[PDF] Feature Reinforcement Learning: Part I: Unstructured MDPs

Abstract

General-purpose, intelligent, learning agents cycle through sequences of observations, actions, and rewards that are complex, uncertain, unknown, and non-Markovian. On the other hand, reinforcement learning is well-developed for small finite state Markov decision processes (MDPs). Up to now, extracting the right state representations out of bare observations, that is, reducing the general agent setup to the MDP framework, is an art that involves significant effort by designers. The primary goal of this work is to automate the reduction process and thereby significantly expand the scope of many existing reinforcement learning algorithms and the agents that employ them. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in Part II. The role of POMDPs is also considered there.

Full PDF

aa r X i v : . [ c s . L G ] J un Feature Reinforcement Learning:Part I. Unstructured MDPs

Marcus Hutter ∗ RSISE @ ANU and SML @ NICTACanberra, ACT, 0200, Australia

Abstract

General-purpose, intelligent, learning agents cycle through sequences ofobservations, actions, and rewards that are complex, uncertain, unknown, andnon-Markovian. On the other hand, reinforcement learning is well-developedfor small ﬁnite state Markov decision processes (MDPs). Up to now, extract-ing the right state representations out of bare observations, that is, reducingthe general agent setup to the MDP framework, is an art that involves signif-icant eﬀort by designers. The primary goal of this work is to automate thereduction process and thereby signiﬁcantly expand the scope of many existingreinforcement learning algorithms and the agents that employ them. Beforewe can think of mechanizing this search for suitable MDPs, we need a formalobjective criterion. The main contribution of this article is to develop such acriterion. I also integrate the various parts into one learning algorithm. Ex-tensions to more realistic dynamic Bayesian networks are developed in PartII [Hut09c]. The role of POMDPs is also considered there.

Contents

Reinforcement learning; Markov decision process; partial observability; fea-ture learning; explore-exploit; information & complexity; rational agents. ∗ A shorter version appeared in the proceedings of the AGI 2009 conference [Hut09b]. Approximations, after all, may be made in two places - in the construc-tion of the model and in the solution of the associated equations. It isnot at all clear which yields a more judicious approximation.”—

Richard Bellman (1961)

Background & motivation.

Artiﬁcial General Intelligence (AGI) is concernedwith designing agents that perform well in a wide range of environments [GP07,LH07]. Among the well-established “narrow” Artiﬁcial Intelligence (AI) approaches[RN03], arguably Reinforcement Learning (RL) [SB98] pursues most directly thesame goal. RL considers the general agent-environment setup in which an agentinteracts with an environment (acts and observes in cycles) and receives (occasional)rewards. The agent’s objective is to collect as much reward as possible. Most if notall AI problems can be formulated in this framework. Since the future is generallyunknown and uncertain, the agent needs to learn a model of the environment basedon past experience, which allows to predict future rewards and use this to maximizeexpected long-term reward.The simplest interesting environmental class consists of ﬁnite state fully observ-able Markov Decision Processes (MDPs) [Put94, SB98], which is reasonably wellunderstood. Extensions to continuous states with (non)linear function approxima-tion [SB98, Gor99], partial observability (POMDP) [KLC98, RPPCd08], structuredMDPs (DBNs) [SDL07], and others have been considered, but the algorithms aremuch more brittle.A way to tackle complex real-world problems is to reduce them to ﬁnite MDPswhich we know how to deal with eﬃciently. This approach leaves a lot of work tothe designer, namely to extract the right state representation (“features”) out ofthe bare observations in the initial (formal or informal) problem description. Evenif potentially useful representations have been found, it is usually not clear whichones will turn out to be better, except in situations where we already know a perfectmodel. Think of a mobile robot equipped with a camera plunged into an unknownenvironment. While we can imagine which image features will potentially be useful,we cannot know in advance which ones will actually be useful.

Main contribution.

The primary goal of this paper is to develop and investigatea method that automatically selects those features that are necessary and suﬃcientfor reducing a complex real-world problem to a computationally tractable MDP.Formally, we consider maps Φ from the past observation-reward-action history h of the agent to an MDP state. Histories not worth being distinguished are mappedto the same state, i.e. Φ − induces a partition on the set of histories. We call thismodel ΦMDP. A state may be simply an abstract label of the partition, but moreoften is itself a structured object like a discrete vector. Each vector component2escribes one feature of the history [Hut09a, Hut09c]. For example, the state maybe a 3-vector containing (shape,color,size) of the object a robot tracks. For thisreason, we call the reduction , Feature RL , although in this Part I only the simplerunstructured case is considered.Φ maps the agent’s experience over time into a sequence of MDP states. Ratherthan informally constructing Φ by hand, our goal is to develop a formal objectivecriterion Cost(Φ | h ) for evaluating diﬀerent reductions Φ. Obviously, at any pointin time, if we want the criterion to be eﬀective it can only depend on the agent’spast experience h and possibly generic background knowledge. The “Cost” of Φshall be small iﬀ it leads to a “good” MDP representation. The establishmentof such a criterion transforms the, in general, ill-deﬁned RL problem to a formaloptimization problem (minimizing Cost) for which eﬃcient algorithms need to bedeveloped. Another important question is which problems can proﬁtably be reducedto MDPs [Hut09a, Hut09c].The real world does not conform itself to nice models: Reality is a non-ergodicpartially observable uncertain unknown environment in which acquiring experiencecan be expensive. So we should exploit the data (past experience) at hand as wellas possible, cannot generate virtual samples since the model is not given (need tobe learned itself), and there is no reset-option. No criterion for this general setupexists. Of course, there is previous work which is in one or another way related toΦMDP. ΦMDP in perspective.

As partly detailed later, the suggested ΦMDP model hasinteresting connections to many important ideas and approaches in RL and beyond: • ΦMDP side-steps the open problem of learning POMDPs [KLC98], • Unlike Bayesian RL algorithms [DFA99, Duf02, PVHR06, RP08], ΦMDPavoids learning a (complete stochastic) observation model, • ΦMDP is a scaled-down practical instantiation of AIXI [Hut05, Hut07], • ΦMDP extends the idea of state-aggregation from planning (based on bi-simulation metrics [GDG03]) to RL (based on information), • ΦMDP generalizes U-Tree [McC96] to arbitrary features, • ΦMDP extends model selection criteria to general RL problems [Gr¨u07], • ΦMDP is an alternative to PSRs [SLJ +

03] for which proper learning algorithmshave yet to be developed, • ΦMDP extends feature selection from supervised learning to RL [GE03].Learning in agents via rewards is a much more demanding task than “classical” ma-chine learning on independently and identically distributed (i.i.d.) data, largely dueto the temporal credit assignment and exploration problem. Nevertheless, RL (andthe closely related adaptive control theory in engineering) has been applied (oftenunrivaled) to a variety of real-world problems, occasionally with stunning success(Backgammon, Checkers, [SB98, Chp.11], helicopter control [NCD + learning and planning ability to its information and complexity theoretical foundations. The implementation of ΦMDP is based on(specialized and general) search and optimization algorithms used for ﬁnding goodreductions Φ. Given that ΦMDP aims at general AI problems, one may wonderabout the role of other aspects traditionally considered in AI [RN03]: knowledgerepresentation (KR) and logic may be useful for representing complex reductionsΦ( h ). Agent interface ﬁelds like robotics , computer vision , and natural language processing can speedup learning by pre&post-processing the raw observations andactions into more structured formats. These representational and interface aspectswill only barely be discussed in this paper. The following diagram illustrates ΦMDPin perspective. ✛✚ ✘✙ Universal AI (AIXI) ✓✒ ✏✑

ΦMDP / ΦDBN ✓✒ ✏✑

Information ✓✒ ✏✑

Learning ✓✒ ✏✑

Planning ✓✒ ✏✑

ComplexitySearch – Optimization – Computation – Logic – KR (cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0) ❅❅❅❅❅❅❅❅❅❅❅❅❅❅❅ ✡✡✡✡ ✡✡✡✡✡ ❏❏❏❏❏❏❏❏❏✄✄✄✄ ✄✄✄✄✄ ❈❈❈❈❈❈❈❈❈

Agents = Framework, Interface = Robots,Vision,Language

Contents.

Section 2 formalizes our ΦMDP setup, which consists of the agent modelwith a map Φ from observation-reward-action histories to MDP states. Section 3develops our core Φ selection principle, which is illustrated in Section 4 on a tinyexample. Section 5 discusses general search algorithms for ﬁnding (approximationsof) the optimal Φ, concretized for context tree MDPs. In Section 6 I ﬁnd theoptimal action for ΦMDP, and present the overall algorithm. Section 7 improvesthe Φ selection criterion by “integrating” out the states. Section 8 contains a briefdiscussion of ΦMDP, including relations to prior work, incremental algorithms, andan outlook to more realistic structured

MDPs (dynamic Bayesian networks, ΦDBN)treated in Part II.Rather than leaving parts of ΦMDP vague and unspeciﬁed, I decided to give atthe very least a simplistic concrete algorithm for each building block, which may beassembled to one sound system on which one can build on.

Notation.

Throughout this article, log denotes the binary logarithm, ǫ the emptystring, and δ x,y = δ xy = 1 if x = y and 0 else is the Kronecker symbol. I generally4mit separating commas if no confusion arises, in particular in indices. For any x of suitable type (string,vector,set), I deﬁne string x = x l = x ...x l , sum x + = P j x j ,union x ∗ = S j x j , and vector x • =( x ,...,x l ), where j ranges over the full range { ,...,l } and l = | x | is the length or dimension or size of x. ˆ x denotes an estimate of x . P( · )denotes a probability over states and rewards or parts thereof. I do not distinguishbetween random variables X and realizations x , and abbreviation P( x ) := P[ X = x ]never leads to confusion. More speciﬁcally, m ∈ IN denotes the number of states, i ∈ { ,...,m } any state index, n ∈ IN the current time, and t ∈ { ,...,n } any time inhistory. Further, in order not to get distracted at several places I gloss over initialconditions or special cases where inessential. Also 0 ∗ undeﬁned=0 ∗ inﬁnity:=0. This section describes our formal setup. It consists of the agent-environment frame-work and maps Φ from observation-reward-action histories to MDP states. I callthis arrangement “Feature MDP” or short ΦMDP.

Agent-environment setup.

I consider the standard agent-environment setup[RN03] in which an

Agent interacts with an

Environment

The agent can choosefrom actions a ∈ A (e.g. limb movements) and the environment provides (regular)observations o ∈ O (e.g. camera images) and real-valued rewards r ∈ R ⊆ IR to theagent. The reward may be very scarce, e.g. just +1 ( −

1) for winning (losing) a chessgame, and 0 at all other times [Hut05, Sec.6.3]. This happens in cycles t = 1 , , ,... :At time t , after observing o t and receiving reward r t , the agent takes action a t basedon history h t := o r a ...o t − r t − a t − o t r t . Then the next cycle t +1 starts. The agent’sobjective is to maximize his long-term reward. Without much loss of generality, Iassume that R is ﬁnite. Finiteness of R is lifted in [Hut09a, Hut09c]. I also as-sume that A is ﬁnite and small, which is restrictive. Part II deals with large statespaces, and large (structured) action spaces can be dealt with in a similar way. Noassumptions are made on O ; it may be huge or even inﬁnite. Indeed, ΦMDP hasbeen speciﬁcally designed to cope with huge observation spaces, e.g. camera images,which are mapped to a small space of relevant states.The agent and environment may be viewed as a pair or triple of interlockingfunctions of the history H := ( O×A×R ) ∗ ×O×R :Env : H × A ❀ O × R , o n r n = Env( h n − a n − ) , Agent : H ❀ A , a n = Agent( h n ) , ☛✡ ✟✠ Agent ☛✡ ✟✠

Env() a ction ✻ r eward ❄ o bservation ❄ where ❀ indicates that mappings → might be stochastic.The goal of AI is to design agents that achieve high (expected) reward over theagent’s lifetime. (Un)known environments. For known Env(), ﬁnding the reward maximizingagent is a well-deﬁned and formally solvable problem [Hut05, Chp.4], with com-putational eﬃciency being the “only” matter of concern. For most real-world AI5roblems Env() is at best partially known. For unknown Env(), the meaning ofexpected reward maximizing is even conceptually a challenge [Hut05, Chp.5].Narrow AI considers the case where function Env() is either known (like planningin blocks world), or essentially known (like in chess, where one can safely model theopponent as a perfect minimax player), or Env() belongs to a relatively small classof environments (e.g. elevator or traﬃc control).The goal of AGI is to design agents that perform well in a large range of environ-ments [LH07], i.e. achieve high reward over their lifetime with as little as possibleassumptions about Env(). A minimal necessary assumption is that the environmentpossesses some structure or pattern [WM97].From real-life experience (and from the examples below) we know that usually wedo not need to know the complete history of events in order to determine (suﬃcientlywell) what will happen next and to be able to perform well. Let Φ( h ) be such a“useful” summary of history h . Generality of ΦMDP.

The following examples show that many problems can bereduced (approximately) to ﬁnite MDPs, thus showing that ΦMDP can deal witha large variety of problems: In full-information games (like chess) with a static op-ponent, it is suﬃcient to know the current state of the game (board conﬁguration)to play well (the history plays no role), hence Φ( h t ) = o t is a suﬃcient summary(Markov condition). Classical physics is essentially predictable from the positionand velocity of objects at a single time, or equivalently from the locations at twoconsecutive times, hence Φ( h t ) = o t − o t is a suﬃcient summary (2nd order Markov).For i.i.d. processes of unknown probability (e.g. clinical trials ≃ Bandits), the fre-quency of observations Φ( h n ) = ( P nt =1 δ o t o ) o ∈O is a suﬃcient statistic. In a POMDPplanning problem, the so-called belief vector at time t can be written down explicitlyas some function of the complete history h t (by integrating out the hidden states).Φ( h t ) could be chosen as (a discretized version of) this belief vector, showing thatΦMDP generalizes POMDPs. Obviously, the identity Φ( h ) = h is always suﬃcientbut not very useful, since Env() as a function of H is hard to impossible to “learn”.This suggests to look for Φ with small codomain, which allow tolearn/estimate/approximate Env by d Env such that o t r t ≈ d Env(Φ( h t − )) for t = 1 ...n . Example.

Consider a robot equipped with a camera, i.e. o is a pixel image. Com-puter vision algorithms usually extract a set of features from o t − (or h t − ), fromlow-level patterns to high-level objects with their spatial relation. Neither is it pos-sible nor necessary to make a precise prediction of o t from summary Φ( h t − ). Anapproximate prediction must and will do. The diﬃculty is that the similarity mea-sure “ ≈ ” needs to be context dependent. Minor image nuances are irrelevant whendriving a car, but when buying a painting it makes a huge diﬀerence in price whetherit’s an original or a copy. Essentially only a bijection Φ would be able to extract allpotentially interesting features, but such a Φ defeats its original purpose. From histories to states.

It is of utmost importance to properly formalize themeaning of “ ≈ ” in a general, domain-independent way. Let s t :=Φ( h t ) summarize all6elevant information in history h t . I call s a state or feature (vector) of h . “Relevant”means that the future is predictable from s t (and a t ) alone, and that the relevantfuture is coded in s t +1 s t +2 ... . So we pass from the complete (and known) history o r a ...o n r n a n to a “compressed” history sra n ≡ s r a ...s n r n a n and seek Φ suchthat s t +1 is (approximately a stochastic) function of s t (and a t ). Since the goal ofthe agent is to maximize his rewards, the rewards r t are always relevant, so they(have to) stay untouched (this will become clearer below). The ΦMDP.

The structure derived above is a classical Markov Decision Process(MDP), but the primary question I ask is not the usual one of ﬁnding the valuefunction or best action or comparing diﬀerent models of a given state sequence. Iask how well can the state-action-reward sequence generated by Φ be modeled as anMDP compared to other sequences resulting from diﬀerent Φ. A good Φ leads to agood model for predicting future rewards, which can be used to ﬁnd good actionsthat maximize the agent’s expected long-term reward.

I ﬁrst review a few standard codes and model selection methods for i.i.d. sequences,subsequently adapt them to our situation, and show that they are suitable in ourcontext. I state my Cost function for Φ, and the Φ selection principle, and compareit to the Minimum Description Length (MDL) philosophy.

I.i.d. processes.

Consider i.i.d. x ...x n ∈ X n for ﬁnite X = { ,...,m } . For known θ i = P[ x t = i ] we have P( x n | θ ) = θ x · ... · θ x n . It is well-known that there existsa code (e.g. arithmetic or Shannon-Fano) for x n of length − logP( x n | θ ), whichis asymptotically optimal with probability one [Bar85, Thm.3.1]. This also easilyfollows from [CT06, Thm.5.10.1]. MDL/MML code [Gr¨u07, Wal05]:

For unknown θ we may use a frequencyestimate ˆ θ i = n i /n , where n i = |{ t ≤ n : x t = i }| . Then it is easy to see that − logP( x n | ˆ θ ) = n H ( ˆ θ ), where H ( ˆ θ ) := − m X i =1 ˆ θ i log ˆ θ i is the entropy of ˆ θ (0log0:=0=:0log ). We also need to code ˆ θ , or equivalently ( n i ), which naively needslog n bits for each i . In general, a sample size of n allows estimating parameters onlyto accuracy O (1 / √ n ), which is essentially equivalent to the fact that logP( x n | ˆ θ ± O (1 / √ n )) − logP( x n | ˆ θ ) = O (1). This shows that it is suﬃcient to code each ˆ θ i toaccuracy O (1 / √ n ), which requires only log n + O (1) bits each. Hence, given n andignoring O (1) terms, the overall code length (CL) of x n for unknown frequencies isCL( x n ) ≡ CL( n ) := n H ( n /n ) + m − log n for n > n = ( n ,...,n m ) and n = n + = n + ... + n m . We have assumed that n is given,hence only m − n i need to be coded, since the m th one can be reconstructedfrom them and n . The above is an exact code of x n , which is optimal (within+ O (1)) for all i.i.d. sources. This code may further be optimized by only coding ˆ θ i for the m ′ = |{ i : n i > }| ≤ m non-empty categories, resulting in a code of lengthCL ′ ( n ) := n H ( n /n ) + m ′ − log n + m, (2)where the m bits are needed to indicate which of the ˆ θ i are coded. We refer to thisimprovement as sparse code. Combinatorial code [LV08]:

A second way to code the data is to code n exactly,and then, since there are n ! /n ! ...n m ! sequences x n with counts n , we can easilyconstruct a code of length log( n ! /n ! ...n m !) given n by enumeration, i.e.CL ′′ ( n ) := log( n ! /n ! ...n m !) + ( m −

1) log n Within ± O (1) this code length also coincides with (1). Incremental code [WST97]:

A third way is to use a sequential estimate ˆ θ t +1 i = t i + αt + mα based on known past counts t i = |{ t ′ ≤ t : x t ′ = i }| , where α> x n ) = ˆ θ x · ... · ˆ θ nx n = C α Q mi =1 Γ( n i + α )Γ( n + mα ) , C α := Γ( mα )Γ( α ) m (3)where Γ is the Gamma function. The logarithm of this expression again essentiallyreduces to (1) (for any α >

0, typically or 1), which can also be written asCL ′′′ ( n ) = X i : n i > ln Γ( n i ) − ln Γ( n ) + O (1) if n > Bayesian code [Sch78, Mac03]:

A fourth (the Bayesian) way is to assume aDirichlet( α ) prior over θ . The marginal distribution (evidence) is identical to (3)and the Bayesian Information Criterion (BIC) approximation leads to code (1). Conclusion:

All four methods lead to essentially the same code length. Thereferences above contain rigorous derivations. In the following I will ignore the O (1)terms and refer to (1) simply as the code length. Note that x n is coded exactly(lossless). Similarly (see MDP below) sampling models more complex than i.i.d.may be considered, and the one that leads to the shortest code is selected as thebest model [Gr¨u07]. MDP deﬁnitions.

Recall that a sequence sra n is said to be sampled from anMDP ( S , A ,T,R ) iﬀ the probability of s t only depends on s t − and a t − ; and r t onlyon s t − , a t − , and s t . That is,P( s t | h t − a t − ) = P( s t | s t − , a t − ) =: T a t − s t − s t P( r t | h t ) = P( r t | s t − , a t − , s t ) =: R a t − r t s t − s t

8n our case, we can identify the state-space S with the states s ,...,s n “observed”so far. Hence S = { s ,...,s m } is ﬁnite and typically m ≪ n , since states repeat. Let s a → s ′ ( r ′ ) be shorthand for “action a in state s resulted in state s ′ (reward r ′ )”. Let T ar ′ ss ′ := { t ≤ n : s t − = s,a t − = a,s t = s ′ ,r t = r ′ } be the set of times t − s a → s ′ r ′ ,and n ar ′ ss ′ := |T ar ′ ss ′ | their number ( n ++++ = n ). Coding MDP sequences.

For some ﬁxed s and a , consider the subsequence s t ...s t n ′ of states reached from s via a ( s a → s t i ), i.e. { t ,...,t n ′ } = T a ∗ s ∗ , where n ′ = n a + s + .By deﬁnition of an MDP, this sequence is i.i.d. with s ′ occurring n ′ s ′ := n a + ss ′ times.By (1) we can code this sequence in CL( n ′ ) bits. The whole sequence s n consistsof |S ×A| i.i.d. sequences, one for each ( s,a ) ∈ S ×A . We can join their codes andget a total code length CL( s n | a n ) = X s,a CL( n a + s • ) (4)If instead of (1) we use the improved sparse code (2), non-occurring transitions s a → s ′ r ′ will contribute only one bit rather than log n bits to the code, so that largebut sparse MDPs get penalized less.Similarly to the states we code the rewards. There are diﬀerent “standard”reward models. I consider only the simplest case of a small discrete reward set R like { , } or {− , , +1 } here and defer generalizations to IR and a discussion ofvariants to the ΦDBN model [Hut09a]. By the MDP assumption, for each ( s,a,s ′ )triple, the rewards at times T a ∗ ss ′ are i.i.d. Hence they can be coded inCL( r n | s n , a n ) = X s,a,s ′ CL( n a • ss ′ ) (5)bits. In order to increase the statistics it might be better to treat r t as a function of s t only. This is not restrictive, since dependence on s t − and a t − can be mimickedby coding aspects into an enlarged state space. Reward ↔ state trade-oﬀ. Note that the code for r depends on s . Indeed wemay interpret the construction as follows: Ultimately we/the agent cares about thereward, so we want to measure how well we can predict the rewards, which we dowith (5). But this code depends on s , so we need a code for s too, which is (4). Tosee that we need both parts consider two extremes.A simplistic state transition model (small |S| ) results in a short code for s . Forinstance, for |S| = 1, nothing needs to be coded and (4) is identically zero. But thisobscures potential structure in the reward sequence, leading to a long code for r .On the other hand, the more detailed the state transition model (large |S| ) theeasier it is to predict and hence compress r . But a large model is hard to learn,i.e. the code for s will be large. For instance for Φ( h ) = h , no state repeats and thefrequency-based coding breaks down. 9 selection principle. Let us deﬁne the

Cost of Φ :

H → S on h n as the length ofthe ΦMDP code for sr given a plus a complexity penalty CL(Φ) for Φ:Cost(Φ | h n ) := CL( s n | a n ) + CL( r n | s n , a n ) + CL(Φ) , (6)where s t = Φ( h t ) and h t = ora t − o t r t The discussion above suggests that the minimum of the joint code length (4) and (5)is attained for a Φ that keeps all and only relevant information for predicting rewards.Such a Φ may be regarded as best explaining the rewards. I added an additionalcomplexity penalty CL(Φ) for Φ such that from the set of Φ that minimize (4)+(5)(e.g. Φ’s identical on (

O × R × A ) n but diﬀerent on longer histories) the simplestone is selected. The penalty is usually some code-length or log-index of Φ. Thisconforms with Ockham’s razor and the MDL philosophy. So we are looking for a Φof minimal cost: Φ best := arg min Φ { Cost(Φ | h n ) } (7)If the minimization is restricted to some small class of reasonably simple Φ, CL(Φ)in (6) may be dropped. The state sequence generated by Φ best (or approximationsthereof) will usually only be approximately MDP. While Cost(Φ | h ) is an optimalcode only for MDP sequences, it still yields good codes for approximate MDP se-quences. Indeed, Φ best balances closeness to MDP with simplicity. The primarypurpose of the simplicity bias is not computational tractability, but generalizationability [Leg08, Hut05]. Relation to MDL et al.

In unsupervised learning (clustering and density es-timation) and supervised learning (regression and classiﬁcation), penalized maxi-mum likelihood criteria [HTF01, Chp.7] like BIC [Sch78], MDL [Gr¨u07], and MML[Wal05] have successfully been used for semi-parametric model selection. It is farfrom obvious how to apply them in RL. Indeed, our derived Cost function cannotbe interpreted as a usual model+data code length. The problem is the following:Ultimately we do not care about the observations but the rewards. The re-wards depend on the states, but the states are arbitrary in the sense that they aremodel-dependent functions of the bare data (observations). The existence of theseunobserved states is what complicates matters, but their introduction is necessaryin order to model the rewards. For instance, Φ is actually not needed for coding rs | a , so from a strict coding/MDL perspective, CL(Φ) in (6) is redundant. Since s is some “arbitrary” construct of Φ, it is better to regard (6) as a code of r only.Since the agent chooses his actions, a need not be coded, and o is not coded, becausethey are only of indirect importance.The Cost() criterion is strongly motivated by the rigorous MDL principle, butinvoked outside the usual induction/modeling/prediction context.10 A Tiny Example

The purpose of the tiny example in this section is to provide enough insight intohow and why ΦMDP works to convince the reader that our Φ selection principle isreasonable.

Example setup.

I assume a simpliﬁed MDP model in which reward r t only dependson s t , i.e. CL( r n | s n , a n ) = X s ′ CL( n + • + s ′ ) (8)This allows us to illustrate ΦMDP on a tiny example. The same insight is gainedusing (5) if an analogous larger example is considered. Furthermore I set CL(Φ) ≡ O = { , } , quaternary reward space R = { , , , } , and a single action A = { } . Observations o t are independent fair coinﬂips, i.e. Bernoulli( ), and reward r t = 2 o t − + o t a deterministic function of the twomost recent observations. Considered features.

As features Φ I consider Φ k : H→O k with Φ k ( h t )= o t − k +1 ...o t for various k =0 , , ,... which regard the last k observations as “relevant”. IntuitivelyΦ is the best observation summary, which I conﬁrm below. The state space S = { , } k (for suﬃciently large n ). The ΦMDPs for k = 0 , , Φ MDP ✚✙✛✘ ǫ r = 0 | | | ☛ ✟ ❄ Φ MDP ✚✙✛✘ r = 0 | ☛ ✟ ❄ ✲✛ ✚✙✛✘ r = 1 | ☛ ✟ ❄ Φ MDP ✚✙✛✘ r = 0 ☛✡ ✲ ✚✙✛✘ r = 3 ✟✠ ✛ ✚✙✛✘ r = 1 ✚✙✛✘ r = 2 ✲ ❄✛✻ (cid:0)(cid:0)✒(cid:0)(cid:0)✠ Φ MDP with all non-zero transition probabilities being 50% is an exact represen-tation of our data source. The missing arrow (directions) are due to the fact that s = o t − o t can only lead to s ′ = o ′ t o ′ t +1 for which o ′ t = o t , denoted by s ∗ = ∗ s ′ in thefollowing. Note that ΦMDP does not “know” this and has to learn the (non)zerotransition probabilities. Each state has two successor states with equal probabil-ity, hence generates (see previous paragraph) a Bernoulli( ) state subsequence anda constant reward sequence, since the reward can be computed from the state =last two observations. Asymptotically, all four states occur equally often, hence thesequences have approximately the same length n/ s (and similarly r ) consists of x ∈ IN i.i.d. subsequences of equallength n/x over y ∈ IN symbols, the code length (4) (and similarly (8)) isCL( s | a ; x y ) = n log y + x |S|− log nx CL( r | s , a ; x y ) = n log y + x |R|− log nx x y just indicates the sequence property. So for Φ MDPwe get CL( s | a ; 4 ) = n + 6 log n and CL( r | s , a ; 4 ) = 6 log n The log-terms reﬂect the required memory to code the MDP structure and proba-bilities. Since each state has only 2 realized/possible successors, we need n bits tocode the state sequence. The reward is a deterministic function of the state, henceneeds no memory to code given s . The Φ MDP throws away all observations (left ﬁgure above), hence CL( s | a ;1 )=0.While the reward sequence is not i.i.d. (e.g. r t +1 = 3 cannot follow r t = 0), Φ MDPhas no choice regarding them as i.i.d., resulting in CL( s | a ;1 ) = 2 n + log n . The Φ MDP model is an interesting compromise (middle ﬁgure above). The stateallows a partial prediction of the reward: State 0 allows rewards 0 and 2; state 1allows rewards 1 and 3. Each of the two states creates a Bernoulli( ) state successorsubsequence and a binary reward sequence, wrongly presumed to be Bernoulli( ).Hence CL( s | a ;2 ) = n +log n and CL( r | s , a ;2 ) = n +3log n . Summary.

The following table summarizes the results for general k = 0 , , k S |S| n + ss ′ n + r ′ + s ′ n + s + = n +++ s ′ s + r CL( s | a ) CL( r | s , a ) Cost(Φ | h )0 { ǫ } n n/ n +1 n + log n n + log n { , } n/ n δ r ′ − s ′ =0 | n/ +2 n +log n n +3log n n +4log n { , , } n δ s ∗ , ∗ s ′ n δ r ′ b = s ′ n/ +4 n +6log n n n +12log n ≥ { , } k k nδ s ∗ , ∗ s ′ k +1 n δ r ′ b = s ′ n/ k k +2 k n + k − − k log n k k log ns k n + k +22 − k log n k The notation of the s + r column follows the one used above in the text ( x y for s and r ). r ′ b = s ′ means that r ′ is the correct reward for state s ′ . The last column isthe sum of the two preceding columns. The part linear in n is the code length forthe state/reward sequence. The part logarithmic in n is the code length for thetransition/reward probabilities of the MDP; each parameter needs log n bits. Forlarge n , Φ results in the shortest code, as anticipated. The “approximate” modelΦ is just not good enough to beat the vacuous model Φ , but in more realisticexamples some approximate model usually has the shortest code. In [Hut09a] Ishow on a more complex example how Φ best will store long-term information in aPOMDP environment. So far I have reduced the reinforcement learning problem to a formal Φ-optimizationproblem. This section brieﬂy explains what we have gained by this reduction, andprovide some general information about problem representations, stochastic search,and Φ neighborhoods. Finally I present a simplistic but concrete algorithm forsearching context tree MDPs. 12 search.

I now discuss how to ﬁnd good summaries Φ. The introduced genericcost function Cost(Φ | h n ), based on only the known history h n , makes this a well-deﬁned task that is completely decoupled from the complex (ill-deﬁned) reinforce-ment learning objective. This reduction should not be under-estimated. We canemploy a wide range of optimizers and do not even have to worry about overﬁtting.The most challenging task is to come up with creative algorithms proposing Φ’s.There are many optimization methods: Most of them are search-based: random,blind, informed, adaptive, local, global, population based, exhaustive, heuristic, andother search methods [AL97]. Most are or can be adapted to the structure of theobjective function, here Cost( ·| h n ). Some exploit the structure more directly (e.g.gradient methods for convex functions). Only in very simple cases can the minimumbe found analytically (without search).Most search algorithms require the speciﬁcation of a neighborhood relation ordistance between candidate Φ, which I deﬁne in the 2nd next paragraph. Problem representation can be important: Since Φ is a discrete function,searching through (a large subset of) all computable functions, is a non-restrictiveapproach. Variants of Levin search [Sch04, Hut05] and genetic programming[Koz92, BNKF98] and recurrent neural networks [Pea89, RHHM08] are the majorapproaches in this direction.A diﬀerent representation is as follows: Φ eﬀectively partitions the history space H and identiﬁes each partition with a state. Conversely any partition of H can (upto a renaming of states) uniquely be characterized by a function Φ. Formally, Φinduces a (ﬁnite) partition S s { h ′ :Φ( h ′ )= s } of H , where s ranges over the codomainof Φ. Conversely, any partition of H = B ˙ ∪ ... ˙ ∪B m induces a function Ψ( h ′ ) = i iﬀ h ′ ∈ B i , which is equivalent to Φ apart from an irrelevant permutation of thecodomain (renaming of states).State aggregation methods have been suggested earlier for solving large-scaleMDP planning problems by grouping (partitioning) similar states together, resultingin (much) smaller block MDPs [GDG03]. But the used bi-simulation metrics requireknowledge of the MDP transition probabilities, while our Cost criterion does not.Decision trees/lists/grids/etc. are essentially space partitioners. The most pow-erful versions are rule-based, in which logical expressions recursively divide domain H into “true/false” regions [DdRD01, SB09]. Φ neighborhood relation.

A natural “minimal” change of a partition is to subdi-vide=split a partition or merge (two) partitions. Moving elements from one partitionto another can be implemented as a split and merge operation. In our case this cor-responds to splitting and merging states (state reﬁnement and coarsening). Let Φ ′ split some state s a ∈ S of Φ into s b ,s c

6∈ S Φ ′ ( h ) := (cid:26) Φ( h ) if Φ( h ) = s a s b or s c if Φ( h ) = s a where the histories mapped to state s a are distributed among s b and s c accordingto some splitting rule (e.g. randomly). The new state space is S ′ = S \{ s a }∪{ s b ,s c } .13imilarly Φ ′ merges states s b ,s c ∈ S into s a

6∈ S ifΦ ′ ( h ) := (cid:26) Φ( h ) if Φ( h ) = s a s a if Φ( h ) = s b or s c where S ′ = S \{ s b ,s c }∪{ s s } . We can regard Φ ′ as being a neighbor of or similar toΦ. Stochastic Φ search.

Stochastic search is the method of choice for high-dimensional unstructured problems. Monte Carlo methods can actually be highlyeﬀective, despite their simplicity [Liu02, Fis03]. The general idea is to randomlychoose a neighbor Φ ′ of Φ and replace Φ by Φ ′ if it is better, i.e. has smaller Cost.Even if Cost(Φ ′ | h ) > Cost(Φ | h ) we may keep Φ ′ , but only with some (in the costdiﬀerence exponentially) small probability. Simulated annealing is a version whichminimizes Cost(Φ | h ). Apparently, Φ of small cost are (much) more likely to occurthan high cost Φ. Context tree example.

The Φ k in Section 4 depended on the last k observations.Let us generalize this to a context dependent variable length: Consider a ﬁnitecomplete suﬃx free set of strings (= preﬁx tree of reversed strings) S ⊂ O ∗ asour state space (e.g. S = { , , , } for binary O ), and deﬁne Φ S ( h n ) := s iﬀ o n −| s | +1: n = s ∈ S , i.e. s is the part of the history regarded as relevant. State splittingand merging works as follows: For binary O , if history part s ∈ S of h n is deemedtoo short, we replace s by 0 s and 1 s in S , i.e. S ′ = S \ { s } ∪ { s, s } . If histories1 s, s ∈ S are deemed too long, we replace them by s , i.e. S ′ = S \ { s, s } ∪ { s } .Large O might be coded binary and then treated similarly. For small O we havethe following simple Φ-optimizer: ΦImprove(Φ S ,h n ) ⌈ Randomly choose a state s ∈ S ;Let p and q be uniform random numbers in [0 , p > /

2) then split s i.e. S ′ = S \{ s }∪{ os : o ∈ O} else if { os ′ : o ∈ O} ⊆ S ( s ′ is s without the ﬁrst symbol )then merge them, i.e. S ′ = S \{ os ′ : o ∈ O}∪{ s ′ } ;if (Cost(Φ S | h n ) − Cost(Φ S ′ | h n ) > log( q )) then S := S ′ ; ⌊ return (Φ S ); Example tree o n − o n − o n S = { , , , } ❍❍❍ ✟✟✟ rrr ❍❍❍ ✟✟✟ rrr ❍❍❍ ✟✟✟ rrr The idea of using suﬃx trees as state space is from [McC96] (see also [Rin94]).It might be interesting to compare the local split/merge criterion of [McC96] withour general global Cost criterion. On the other hand, due to their limitation, suﬃxtrees are currently out of vogue. 14

Exploration & Exploitation

Having obtained a good estimate ˆΦ of Φ best in the previous section, we can/mustnow determine a good action for our agent. For a ﬁnite MDP with known transi-tion probabilities, ﬁnding the optimal action is routine. For estimated probabilitieswe run into the infamous exploration-exploitation problem, for which promising ap-proximate solutions have recently been suggested [SL08]. At the end of this sectionI present the overall algorithm for our ΦMDP agent.

Optimal actions for known MDPs.

For a known ﬁnite MDP ( S , A ,T,R,γ ), themaximal achievable (“optimal”) expected future discounted reward sum, called ( Q ) V alue (of action a ) in state s , satisﬁes the following (Bellman) equations [SB98] Q ∗ as = X s ′ T ass ′ [ R ass ′ + γV ∗ s ′ ] and V ∗ s = max a Q ∗ as (9)where 0 < γ < a n := arg max a Q ∗ as n (10) Estimating the MDP.

We can estimate the transition probability T byˆ T ass ′ := n a + ss ′ n a + s + if n a + s + > s n based on P ˆ T ( s n | a n ) = Q nt =1 ˆ T a t − s t − s t plus the code of the (non-zero) transition probabilities ˆ T ass ′ to relevantaccuracy O (1 / p n a + s + ) has length (4), i.e. the frequency estimate (11) is consistentwith the attributed code length. The expected reward can be estimated asˆ R ass ′ := X r ′ ∈R ˆ R ar ′ ss ′ r ′ , ˆ R ar ′ ss ′ := n ar ′ ss ′ n a + ss ′ (12) Exploration.

Simply replacing T and R in (9) and (10) by their estimates (11)and (12) can lead to very poor behavior, since parts of the state space may neverbe explored, causing the estimates to stay poor.Estimate ˆ T improves with increasing n a + s + , which can (only) be ensured by tryingall actions a in all states s suﬃciently often. But the greedy policy above has noincentive to explore, which may cause the agent to perform very poorly: The agentstays with what he believes to be optimal without trying to solidify his belief. Forinstance, if treatment A cured the ﬁrst patient, and treatment B killed the second,the greedy agent will stick to treatment A and not explore the possibility that B s e to S ,not in the range of Φ( h ), i.e. never observed. Henceforth, S denotes the extendedstate space. For instance + in (11) now includes s e . We set n ass e = 1 , n as e s = δ s e s , R ass e = R emax (13)for all s,a , where exploration bonus R emax is polynomially (in (1 − γ ) − and |S ×A| )larger than max R [SL08].Now compute the agent’s action by (9)-(12) but for the extended S . The optimalpolicy p ∗ tries to ﬁnd a chain of actions and states that likely leads to the high rewardabsorbing state s e . Transition ˆ T ass e = 1 /n as + is only “large” for small n as + , hence p ∗ has a bias towards unexplored (state,action) regions. It can be shown that thisalgorithm makes only a polynomial number of sub-optimal actions.The overall algorithm for our ΦMDP agent is as follows. ΦMDP-Agent( A , R ) ⌈ Initialize Φ ≡ Φ ′ ≡ ǫ ; S = { ǫ } ; h = a = r = ǫ ;for n = 1 , , ,... ⌈ Choose e.g. γ = 1 − /n ;Set R emax =Polynomial((1 − γ ) − , |S ×A| ) · max R ;While waiting for o n and r n ⌈ Φ ′ := ΦImprove(Φ ′ ,h n − ); ⌊ If Cost(Φ ′ | h n − ) < Cost(Φ | h n − ) then Φ := Φ ′ ;Observe o n and r n ; h n := h n − a n − r n − o n r n ; s n := Φ( h n ); S := S ∪{ s n } ;Compute action a n from Equations (9)-(13); ⌊ ⌊ Output action a n ; As discussed, we ultimately only care about (modeling) the rewards, but this en-deavor required introducing and coding states. The resulting Cost(Φ | h ) function isa code length of not only the rewards but also the “spurious” states. This likelyleads to a too strong penalty of models Φ with large state spaces S . The properBayesian formulation developed in this section allows to “integrate” out the states.This leads to a code for the rewards only, which better trades oﬀ accuracy of thereward model and state space size. 16or an MDP with transition and reward probabilities T ass ′ and R ar ′ ss ′ , the proba-bilities of the state and reward sequences areP T ( s n | a n ) = n Y t =1 T a t − s t − s t , P R ( r n | s n a n ) = n Y t =1 R a t − r t s t − s t The probability of r | a can be obtained by taking the product and marginalizing s :P U ( r n | a n ) = X s n P T ( s n | a n )P R ( r n | s n a n )= X s n n Y t =1 U a t − r t s t − s t = X s n [ U a r · · · U a n − r n ] s s n where for each a ∈ A and r ′ ∈ R , matrix U ar ′ ∈ IR m × m is deﬁned as [ U ar ′ ] ss ′ ≡ U ar ′ ss ′ := T ass ′ R ar ′ ss ′ . The right n -fold matrix product can be evaluated in time O ( m n ). Thisshows that r given a and U can be coded in − logP U bits. The unknown U needs tobe estimated, e.g. by the relative frequency ˆ U ar ′ ss ′ := n ar ′ ss ′ /n a + s + . Note that P U completelyignores the observations o n and is essentially independent of Φ. Map Φ and hence o n enter P ˆ U (only and crucially) via the estimate ˆ U . The M := m ( m − |A| ( |R|− U can be coded to suﬃcient accuracy in M log n bits,and Φ will be coded in CL(Φ) bits. Together this leads to a code for r | a of lengthICost(Φ | h n ) := − log P ˆ U ( r n | a n ) + M log n + CL(Φ) (14)In practice, M can and should be chosen smaller like done in the original Costfunction, and/or by using the restrictive model (8) for R , and/or by consideringonly non-zero frequencies (2). Analogous to (7) we seek a Φ that minimizes ICost().Since action evaluation is based on (discounted) reward sums, not individualrewards, one may think of marginalizing P U ( r | a , Φ) even further, or coding rewardsonly approximately. Unfortunately, the algorithms in Section 6 that learn, explore,and exploit MDPs require knowledge of the (exact) individual rewards, so this im-provement is not feasible.

This section summarizes ΦMDP, relates it to previous work, and hints at moreeﬃcient incremental implementations and more realistic structured

MDPs (dynamicBayesian networks).

Summary.

Learning from rewards in general environments is an immensely com-plex problem. In this paper I have developed a generic reinforcement learning algo-rithm based on sound principles. The key idea was to reduce general learning prob-lems to ﬁnite state MDPs for which eﬃcient learning, exploration, and exploitationalgorithms exist. For this purpose I have developed a formal criterion for evaluating17nd selecting good “feature” maps Φ from histories to states. One crucial propertyof ΦMDP is that it neither requires nor learns a model of the complete observationspace, but only for the reward-relevant observations as summarized in the states.The developed criterion has been inspired by MDL, which recommends to select the(coding) model that minimizes the length of a suitable code for the data at handplus the complexity of the model itself. The novel and tricky part in ΦMDP wasto deal with the states, since they are not bare observations, but model-dependentprocessed data. An improved Bayesian criterion, which integrates out the states,has also been derived. Finally, I presented a complete feature reinforcement learningalgorithm ΦMDP-Agent(). The building blocks and computational ﬂow are depictedin the following diagram:

Environment ✗✖ ✔✕

History h ✗✖ ✔✕ Feature Vec. ˆΦ ✗✖ ✔✕ Transition Pr. ˆ T Reward est. ˆ R ✗✖ ✔✕ ˆ T e , ˆ R e ✗✖ ✔✕ ( ˆ Q ) ˆ V alue ✗✖ ✔✕ Best Policy ˆ p ✻ reward r observation o ✻ Cost(Φ | h ) minimization (cid:0)(cid:0)(cid:0)✒ frequency estimate ✲ explorationbonus ❅❅❅❘ Bellman ❄ implicit ❄ action a Relation to previous work.

As already indicated here and there, ΦMDP can beregarded as extending the frontier of many previous important approaches to RLand beyond:

Partially Observable MDPs (POMDPs) are a very important general-ization of MDPs [KLC98]. Nature is still assumed to be an MDP, but the statesof nature are only partially observed via some non-injective or probabilistic func-tion. Even for ﬁnite state space and known observation and transition functions,ﬁnding and even only approximating the optimal action is (harder than NP) hard[LGM01, MHC03]. Lifting any of the assumptions causes conceptual problems, andwhen lifting more than one we enter scientiﬁc terra nullius. Assume a POMDPenvironment: POMDPs can formally (but not yet practically) be reduced to MDPsover so-called (continuous) belief states. Since ΦMDP reduces every problem toan MDP, it is conceivable that it reduces the POMDP to (an approximation of)its belief MDP. This would be a profound relation between ΦMDP and POMDP,likely leading to valuable insights into ΦMDP and proper algorithms for learningPOMDPs. It may also help us to restrict the space of potentially interesting featuresΦ.

Predictive State Representations (PSRs) are very interesting, but to this date in18n even less developed stage [SLJ +

03] than POMDPs.

Universal AI [Hut05] is ableto optimally deal with arbitrary environments, but the resulting AIXI agent is com-putationally intractable [Hut07] and hard to approximate [Pan08, PH06].

BayesianRL algorithms [DFA99, Duf02, PVHR06, RP08] (see also [KV86, Chp.11]) can beregarded as implementations of the AI ξ models [PH06], which are down-scaled ver-sions of AIXI, but the enormous computational demand still severely limits thisapproach. ΦMDP essentially diﬀers from “generative” Bayesian RL and AI ξ in thatit neither requires to specify nor to learn a (complete stochastic) observation model.It is a more “discriminative” approach [LJ08]. Since ΦMDP “automatically” mod-els only the relevant aspects of the environment, it should be computationally lessdemanding than full Bayesian RL. State aggregation methods have been suggestedearlier for solving large-scale MDP planning problems by grouping (partitioning)similar states together, resulting in (much) smaller block MDPs [GDG03]. But thebi-simulation metrics used require knowledge of the MDP transition probabilities.ΦMDP might be regarded as an approach that lifts this assumption.

Suﬃx trees [McC96] are a simple class of features Φ. ΦMDP combined with a local search func-tion that expands and deletes leaf nodes is closely related to McCallum’s U-Treealgorithm [McC96], with a related but likely diﬀerent split&merge criterion.

Mis-cellaneous : ΦMDP also extends the theory of model selection (e.g. MDL [Gr¨u07])from passive to active learning.

Incremental updates.

As discussed in Section 5, most search algorithms are localin the sense that they produce a chain of “slightly” modiﬁed candidate solutions,here Φ. This suggests a potential speedup by computing quantities of interest incre-mentally, which becomes even more important in the ΦDBN case [Hut09a, Hut09c].Computing Cost(Φ) takes at most time O ( |S| |A||R| ). If we split or mergetwo states, we can incrementally update the cost in time O ( |S||A||R| ), rather thancomputing it again from scratch. In practice, many transition T ass ′ don’t occur,and Cost(Φ) can actually be computed much faster in time O ( |{ n arss ′ > }| ), andincrementally even faster.Iteration algorithms for (9) need an initial value for V or Q . We can take theestimate ˆ V from a previous Φ as an initial value for the new Φ. For a merge operationwe can average the value of both states, for a split operation we could give them thesame initial value. A signiﬁcant further speedup can be obtained by using prioritizediteration algorithms that concentrate their time on badly estimated states, whichare in our case (states close to) the new ones [SB98].Similarly, results from cycle n can be (re)used for the next cycle n + 1. Forinstance, ˆ V can simply be reused as an initial value in the Bellman equations, andICost(Φ) can be updated in time O ( |S| ) or even faster if U is sparse. Feature dynamic Bayesian networks.

The use of “unstructured” MDPs, evenour Φ-optimal ones, is clearly limited to very simple tasks. Real world problemsare structured and can often be represented by dynamic Bayesian networks (DBNs)with a reasonable number of nodes. Our Φ selection principle can be adapted from19DPs to the conceptually much more complex DBN case. The primary purposeof this Part I was to explain the key concepts on an as simple model as possible,namely unstructured ﬁnite MDPs, to set the stage for developing the more realisticΦDBN in Part II [Hut09c].

Outlook.

The major open problems are to develop smart Φ generation and smartstochastic search algorithms for Φ best , and to determine whether minimizing (14) isthe right criterion.

Acknowledgements.

My thanks go to Pedro Ortega, Sergey Pankov, Scott Sanner,J¨urgen Schmidhuber, and Hanna Suominen for feedback on earlier drafts.

References [AL97] E. H. L. Aarts and J. K. Lenstra, editors.

Local Search in Combinatorial Op-timization . Discrete Mathematics and Optimization. Wiley-Interscience, Chich-ester, England, 1997.[Bar85] A. R. Barron.

Logically Smooth Density Estimation . PhD thesis, Stanford Uni-versity, 1985.[BF85] D. A. Berry and B. Fristedt.

Bandit Problems: Sequential Allocation of Experi-ments . Chapman and Hall, London, 1985.[BNKF98] W. Banzhaﬀ, P. Nordin, E. Keller, and F.D. Francone.

Genetic Programming .Morgan-Kaufmann, San Francisco, CA, U.S.A., 1998.[BT02] R. I. Brafman and M. Tennenholtz. R-max – a general polynomial time algorithmfor near-optimal reinforcement learning.

Journal of Machine Learning Research ,3:213–231, 2002.[CT06] T. M. Cover and J. A. Thomas.

Elements of Information Theory . Wiley-Intersience, 2nd edition, 2006.[DdRD01] S. Dzeroski, L. de Raedt, and K. Driessens. Relational reinforcement learning.

Machine Learning , 43:7–52, 2001.[DFA99] R. Dearden, N. Friedman, and D. Andre. Model based Bayesian exploration. In

Proc. 15th Conference on Uncertainty in Artiﬁcial Intelligence (UAI-99) , pages150–159, 1999.[Duf02] M. Duﬀ.

Optimal Learning: Computational procedures for Bayes-adaptiveMarkov decision processes . PhD thesis, Department of Computer Science, Uni-versity of Massachusetts Amherst, 2002.[Fis03] G. Fishman.

Monte Carlo . Springer, 2003.[GDG03] R. Givan, T. Dean, and M. Greig. Equivalence notions and model minimizationin Markov decision processes.

Artiﬁcial Intelligence , 147(1–2):163–223, 2003.[GE03] I. Guyon and A. Elisseeﬀ, editors.

Variable and Feature Selection , JMLR SpecialIssue, 2003. MIT Press. Gor99] G. Gordon.

Approximate Solutions to Markov Decision Processes . PhD thesis,School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 1999.[GP07] B. Goertzel and C. Pennachin, editors.

Artiﬁcial General Intelligence . Springer,2007.[Gr¨u07] P. D. Gr¨unwald.

The Minimum Description Length Principle . The MIT Press,Cambridge, 2007.[HTF01] T. Hastie, R. Tibshirani, and J. H. Friedman.

The Elements of Statistical Learn-ing . Springer, 2001.[Hut05] M. Hutter.

Universal Artiﬁcial Intelligence: Sequential Decisions basedon Algorithmic Probability → down ap-proach. In Artiﬁcial General Intelligence , pages 227–290. Springer, Berlin, 2007.[Hut09a] M. Hutter. Feature dynamic Bayesian networks. In

Proc. 2nd Conf. on ArtiﬁcialGeneral Intelligence (AGI’09) , volume 8, pages 67–73. Atlantis Press, 2009.[Hut09b] M. Hutter. Feature Markov decision processes. In

Proc. 2nd Conf. on ArtiﬁcialGeneral Intelligence (AGI’09) , volume 8, pages 61–66. Atlantis Press, 2009.[Hut09c] M. Hutter. Feature reinforcement learning: Part II: Structured MDPs.

Inprogress. Will extend [Hut09a] , 2009.[KLC98] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting inpartially observable stochastic domains.

Artiﬁcial Intelligence , 101:99–134, 1998.[Koz92] J. R. Koza.

Genetic Programming . The MIT Press, 1992.[KS98] M. J. Kearns and S. Singh. Near-optimal reinforcement learning in polynomialtime. In

Proc. 15th International Conf. on Machine Learning , pages 260–268.Morgan Kaufmann, San Francisco, CA, 1998.[KV86] P. R. Kumar and P. P. Varaiya.

Stochastic Systems: Estimation, Identiﬁcation,and Adaptive Control . Prentice Hall, Englewood Cliﬀs, NJ, 1986.[Leg08] S. Legg.

Machine Super Intelligence . PhD thesis, IDSIA, Lugano, 2008.[LGM01] C. Lusena, J. Goldsmith, and M. Mundhenk. Nonapproximability results forpartially observable Markov decision processes.

Journal of Artiﬁcial IntelligenceResearch , 14:83–103, 2001.[LH07] S. Legg and M. Hutter. Universal intelligence: A deﬁnition of machine intelli-gence.

Minds & Machines , 17(4):391–444, 2007.[Liu02] J. S. Liu.

Monte Carlo Strategies in Scientiﬁc Computing . Springer, 2002.[LJ08] P. Liang and M. Jordan. An asymptotic analysis of generative, discriminative,and pseudolikelihood estimators. In

Proc. 25th International Conf. on MachineLearning (ICML’08) , volume 307, pages 584–591. ACM, 2008.[LV08] M. Li and P. M. B. Vit´anyi.

An Introduction to Kolmogorov Complexity and itsApplications . Springer, Berlin, 3rd edition, 2008. Mac03] D. J. C. MacKay.

Information theory, inference and learning algorithms . Cam-bridge University Press, Cambridge, MA, 2003.[McC96] A. K. McCallum.

Reinforcement Learning with Selective Perception and HiddenState . PhD thesis, Department of Computer Science, University of Rochester,1996.[MHC03] O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilisticplanning and related stochastic optimization problems.

Artiﬁcial Intelligence ,147:5–34, 2003.[NCD +

04] A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger,and E. Liang. Autonomous inverted helicopter ﬂight via reinforcement learning.In

ISER , volume 21 of

Springer Tracts in Advanced Robotics , pages 363–372.Springer, 2004.[Pan08] S. Pankov. A computational approximation to the AIXI model. In

Proc. 1stConference on Artiﬁcial General Intelligence , volume 171, pages 256–267, 2008.[Pea89] B. A. Pearlmutter. Learning state space trajectories in recurrent neural networks.

Neural Computation , 1(2):263–269, 1989.[PH06] J. Poland and M. Hutter. Universal learning of repeated matrix games. In

Proc. 15th Annual Machine Learning Conf. of Belgium and The Netherlands(Benelearn’06) , pages 7–14, Ghent, 2006.[Put94] M. L. Puterman.

Markov Decision Processes — Discrete Stochastic DynamicProgramming . Wiley, New York, NY, 1994.[PVHR06] P. Poupart, N. A. Vlassis, J. Hoey, and K. Regan. An analytic solution todiscrete Bayesian reinforcement learning. In

Proc. 23rd International Conf. onMachine Learning (ICML’06) , volume 148, pages 697–704, Pittsburgh, PA, 2006.ACM.[RHHM08] L. De Raedt, B. Hammer, P. Hitzler, and W. Maass, editors.

Recurrent Neu-ral Networks - Models, Capacities, and Applications , volume 08041 of

DagstuhlSeminar Proceedings . IBFI, Schloss Dagstuhl, Germany, 2008.[Rin94] M. Ring.

Continual Learning in Reinforcement Environments . PhD thesis, Uni-versity of Texas, Austin, 1994.[RN03] S. J. Russell and P. Norvig.

Artiﬁcial Intelligence. A Modern Approach . Prentice-Hall, Englewood Cliﬀs, NJ, 2nd edition, 2003.[RP08] S. Ross and J. Pineau. Model-based Bayesian reinforcement learning in largestructured domains. In

Proc. 24th Conference in Uncertainty in Artiﬁcial Intel-ligence (UAI’08) , pages 476–483, Helsinki, 2008. AUAI Press.[RPPCd08] S. Ross, J. Pineau, S. Paquet, and B. Chaib-draa. Online planning algorithmsfor POMDPs.

Journal of Artiﬁcial Intelligence Research , 2008(32):663–704, 2008.[SB98] R. S. Sutton and A. G. Barto.

Reinforcement Learning: An Introduction . MITPress, Cambridge, MA, 1998.[SB09] S. Sanner and C. Boutilier. Practical solution techniques for ﬁrst-order MDPs.

Artiﬁcial Intelligence , 173(5–6):748–788, 2009. Sch78] G. Schwarz. Estimating the dimension of a model.

Annals of Statistics , 6(2):461–464, 1978.[Sch04] J. Schmidhuber. Optimal ordered problem solver.

Machine Learning , 54(3):211–254, 2004.[SDL07] A. L. Strehl, C. Diuk, and M. L. Littman. Eﬃcient structure learning in factored-state MDPs. In

Proc. 27th AAAI Conference on Artiﬁcial Intelligence , pages645–650, Vancouver, BC, 2007. AAAI Press.[SL08] I. Szita and A. L¨orincz. The many faces of optimism: a unifying approach. In

Proc. 12th International Conference (ICML 2008) , volume 307, Helsinki, Fin-land, June 2008.[SLJ +

03] S. Singh, M. Littman, N. Jong, D. Pardoe, and P. Stone. Learning predic-tive state representations. In

Proc. 20th International Conference on MachineLearning (ICML’03) , pages 712–719, 2003.[Wal05] C. S. Wallace.

Statistical and Inductive Inference by Minimum Message Length .Springer, Berlin, 2005.[WM97] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization.

IEEE Transactions on Evolutionary Computation , 1(1):67–82, 1997.[WST97] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. Reﬂections on the prizepaper: The context-tree weighting method: Basic properties.

IEEE InformationTheory Society Newsletter , pages 20–27, 1997. List of Notation

Interface structures O = ﬁnite or inﬁnite set of possible observations A = (small) ﬁnite set of actions R = { , } or [0 ,R max ] or other set of rewards n ∈ IN = current time o t r t a t = ora t ∈ O×R×A = true observation, reward, action at time t Internal structures for ΦMDP log = binary logarithm t ∈ { ,...,n } = any time i ∈ { ,...,m } = any state index x = x n = x ...x n (any x ) x + ,x ∗ , x • = P j x j , S j x j , ( x ,...,x l ) (any x,j,l )ˆ X = estimate of X (any X ) H = ( O×R×A ) ∗ ×O×R = possible histories h n = ora n − o n r n = actual history at time n S = { s ,...,s m } = internal ﬁnite state space (can vary with n )Φ : H → S = state or feature summary of history s t = Φ( h t ) ∈ S = realized state at time t P( · ) = probability over states and rewards or parts thereofCL( · ) = code lengthMDP = ( S , A ,T,R ) = Markov Decision Process T ass ′ = P( s t = s ′ | s t − = s,a t − = a ) = transition matrix s a → s ′ ( r ′ ) = action a in state s resulted in state s ′ (and reward r ′ ) T ar ′ ss ′ = set of times t ∈ { ,...,n } at which s a → s ′ r ′ n ar ′ ss ′ = |T ar ′ ss ′ | = number of times t ∈ { ,...,n } at which s a → s ′ r ′ Cost(Φ | h ) = cost (evaluation function) of Φ based on history h ICost(Φ | h ) = improved cost function Q ∗ as ,V ∗ s = optimal ( Q ) V alue (of action a ) in state sγ ∈ [0;1) = discount factor ((1 − γ ) −1