Reinforcement Learning via AIXI Approximation
aa r X i v : . [ c s . L G ] J u l Reinforcement Learning via AIXI Approximation
Joel VenessUniversity of NSW & NICTA joelv @ cse . unsw . edu . au Kee Siong NgMedicare Australia & ANU keesiong . ng @ gmail . com Marcus HutterANU & NICTA marcus . hutter @ anu . edu . au David SilverUniversity College London davidstarsilver @ googlemail . com
13 July 2010
Abstract
This paper introduces a principled approach for the designof a scalable general reinforcement learning agent. Thisapproach is based on a direct approximation of AIXI, aBayesian optimality notion for general reinforcement learn-ing agents. Previously, it has been unclear whether the the-ory of AIXI could motivate the design of practical algo-rithms. We answer this hitherto open question in the a ffi r-mative, by providing the first computationally feasible ap-proximation to the AIXI agent. To develop our approxi-mation, we introduce a Monte Carlo Tree Search algorithmalong with an agent-specific extension of the Context TreeWeighting algorithm. Empirically, we present a set of en-couraging results on a number of stochastic, unknown, andpartially observable domains. Contents1 Introduction 12 The Agent Setting 13 Bayesian Agents 24 Monte Carlo Expectimax Approximation 35 Action-Conditional CTW 46 Theoretical Results 67 Experimental Results 68 Related Work 79 Limitations 810 Conclusion 811 Acknowledgements 8Keywords:
Reinforcement Learning (RL); Context TreeWeighting (CTW); Monte Carlo Tree Search (MCTS); Up-per Confidence bounds applied to Trees (UCT); PartiallyObservable Markov Decision Process (POMDP); Predic-tion Su ffi x Trees (PST). Consider an agent that exists within some unknown envi-ronment. The agent interacts with the environment in cy-cles. At each cycle, the agent executes an action and re-ceives in turn an observation and a reward. The general re-inforcement learning problem is to construct an agent that,over time, collects as much reward as possible from an ini-tially unknown environment. The AIXI agent [Hut05] is a formal, mathematical so-lution to the general reinforcement learning problem. Itcan be decomposed into two main components: planningand prediction. Planning amounts to performing an expec-timax operation to determine each action. Prediction usesBayesian model averaging, over the largest possible modelclass expressible on a Turing Machine, to predict future ob-servations and rewards based on past experience. AIXI isshown in [Hut05] to be optimal in the sense that it willrapidly learn an accurate model of the unknown environ-ment and exploit it to maximise its expected future reward.As AIXI is only asymptotically computable, it is by nomeans an algorithmic solution to the general reinforcementlearning problem. Rather it is best understood as a Bayesian optimality notion for decision making in general unknownenvironments. This paper demonstrates, for the first time,how a practical agent can be built from the AIXI theory.Our solution directly approximates the planning and pre-diction components of AIXI. In particular, we use a gener-alisation of UCT [KS06] to approximate the expectimax op-eration, and an agent-specific extension of CTW [WST95],a Bayesian model averaging algorithm for prediction su ffi xtrees, for prediction and learning. Perhaps surprisingly, thiskind of direct approximation is possible, practical and theo-retically appealing. Importantly, the essential characteristicof AIXI, its generality, can be largely preserved. This section introduces the notation and terminology wewill use to describe strings of agent experience, the trueunderlying environment and the agent’s model of the en-vironment.The (finite) action, observation, and reward spaces aredenoted by A , O , and R respectively. An observation-reward pair or is called a percept. We use X to denote thepercept space O × R . Definition 1
A history h is an element of ( A × X ) ∗ ∪ ( A ×X ) ∗ × A . N otation: A string x x . . . x n of length n is denoted by x n .The empty string is denoted by ǫ . The concatenation of two1trings s and r is denoted by sr . The prefix x j of x n , j ≤ n , is denoted by x ≤ j or x < j + . The notation generalisesfor blocks of symbols: e.g. ax n denotes a x a x . . . a n x n and ax < j denotes a x a x . . . a j − x j − .The following definition states that the environment takesthe form of a probability distribution over possible perceptsequences conditioned on actions taken by the agent. Definition 2
An environment ρ is a sequence of conditionalprobability functions { ρ , ρ , ρ , . . . } , where ρ n : A n → Density ( X n ) , that satisfies ∀ a n ∀ x < n : ρ n − ( x < n | a < n ) = X x n ∈X ρ n ( x n | a n ) . (1) In the base case, we have ρ ( ǫ | ǫ ) = . Equation 1, called the chronological condition in[Hut05], captures the natural constraint that action a n hasno e ff ect on observations made before it. For convenience,we drop the index n in ρ n from here onwards.Given an environment ρ , ρ ( x n | ax < n a n ) : = ρ ( x n | a n ) ρ ( x < n | a < n ) (2)is the ρ -probability of observing x n in cycle n given history h = ax < n a n , provided ρ ( x < n | a < n ) >
0. It now follows that ρ ( x n | a n ) = ρ ( x | a ) ρ ( x | ax a ) · · · ρ ( x n | ax < n a n ) . (3)Definition 2 is used to describe both the true (but un-known) underlying environment and the agent’s subjective model of the environment. The latter is called the agent’s environment model and is typically learnt from data. Def-inition 2 is extremely general. It captures a wide varietyof environments, including standard reinforcement learningsetups such as MDPs and POMDPs.The agent’s goal is to accumulate as much reward as itcan during its lifetime. More precisely, the agent seeksa policy that will allow it to maximise its expected futurereward up to a fixed, finite, but arbitrarily large horizon m ∈ N . Formally, a policy is a function that maps a his-tory to an action. The expected future value of an agentacting under a particular policy is defined as follows. Definition 3
Given history ax t , the m-horizon expectedfuture reward of an agent acting under policy π : ( A ×X ) ∗ → A with respect to an environment ρ isv m ρ ( π, ax t ) : = E x t + t + m ∼ ρ t + m X i = t + R i ( ax ≤ t + m ) , (4) where for t + ≤ k ≤ t + m, a k : = π ( ax < k ) , andR k ( aor ≤ t + m ) : = r k . The quantity v m ρ ( π, ax t a t + ) is definedsimilarly, except that a t + is now no longer defined by π . The optimal policy π ∗ is the policy that maximises theexpected future reward. The maximal achievable expectedfuture reward of an agent with history h in environment ρ looking m steps ahead is V m ρ ( h ) : = v m ρ ( π ∗ , h ). It is easy tosee that if h ≡ ax t ∈ ( A × X ) t , then V m ρ ( h ) = max at + X xt + ρ ( x t + | ha t + ) · · · max at + m X xt + m ρ ( x t + m | hax t + t + m − a t + m ) t + m X i = t + r i . (5) We will refer to Equation 5 as the expectimax operation .The m -horizon optimal action a ∗ t + at time t + a ∗ t + = arg max a t + V m ρ ( ax t a t + ) (6)Eqs 4 and 5 can be modified to handle discounted reward,however we focus on the finite-horizon case since it bothaligns with AIXI and allows for a simplified presentation. In the general reinforcement learning setting, the environ-ment ρ is unknown to the agent. One way to learn an envi-ronment model is to take a Bayesian approach. Instead ofcommitting to any single environment model, the agent usesa mixture of environment models. This requires committingto a class of possible environments (the model class), as-signing an initial weight to each possible environment (theprior), and subsequently updating the weight for each modelusing Bayes rule (computing the posterior) whenever moreexperience is obtained.The above procedure is similar to Bayesian methods forpredicting sequences of (singly typed) observations. Thekey di ff erence in the agent setup is that each prediction isnow also dependent on previous agent actions. We incor-porate this by using the action-conditional definitions andidentities of Section 2. Definition 4
Given a model class M : = { ρ , ρ , . . . } and a prior weight w ρ > for each ρ ∈ M suchthat P ρ ∈M w ρ = , the mixture environment model is ξ ( x n | a n ) : = P ρ ∈M w ρ ρ ( x n | a n ) . The next result follows immediately.
Proposition 1
A mixture environment model is an environ-ment model.
Proposition 1 allows us to use a mixture environmentmodel whenever we can use an environment model. Its im-portance will become clear shortly.To make predictions using a mixture environment model ξ , we use ξ ( x n | ax < n a n ) = ξ ( x n | a n ) ξ ( x < n | a < n ) , (7)2hich follows from Proposition 1 and Eq. 2. The RHS ofEq. 7 can be written out as a convex combination of modelpredictions to give ξ ( x n | ax < n a n ) = X ρ ∈M w ρ n − ρ ( x n | ax < n a n ) , (8)where the posterior weight w ρ n − for ρ is given by w ρ n − : = w ρ ρ ( x < n | a < n ) P µ ∈M w µ µ ( x < n | a < n ) = Pr( ρ | ax < n ) . (9)Bayesian agents enjoy a number of strong theoretical per-formance guarantees; these are explored in Section 6. Inpractice, the main di ffi culty in using a mixture environmentmodel is computational. A rich model class is required ifthe mixture environment model is to possess general pre-diction capabilities, however naively using (8) for onlineprediction requires at least O ( |M| ) time to process each newpiece of experience. One of our main contributions, intro-duced in Section 5, is a large, e ffi ciently computable mix-ture environment model that runs in time O (log(log |M| )).Before looking at that, we will examine in the next sectiona Monte Carlo Tree Search algorithm for approximating theexpectimax operation. Full-width computation of the expectimax operation (5)takes O ( |A×X| m ) time, which is unacceptable for all but tinyvalues of m . This section introduces ρ UCT, a generalisationof the popular UCT algorithm [KS06] that can be used toapproximate a finite horizon expectimax operation given anenvironment model ρ . The key idea of Monte Carlo searchis to sample observations from the environment, rather thanexhaustively considering all possible observations. This al-lows for e ff ective planning in environments with large ob-servation spaces. Note that since an environment modelsubsumes both MDPs and POMDPs, ρ UCT e ff ectively ex-tends the UCT algorithm to a wider class of problem do-mains.The UCT algorithm has proven e ff ective in solving largediscounted or finite horizon MDPs. It assumes a generativemodel of the MDP that when given a state-action pair ( s , a )produces a subsequent state-reward pair ( s ′ , r ) distributedaccording to Pr( s ′ , r | s , a ). By successively sampling tra-jectories through the state space, the UCT algorithm incre-mentally constructs a search tree, with each node containingan estimate of the value of each state. Given enough time,these estimates converge to the true values.The ρ UCT algorithm can be realised by replacing the no-tion of state in UCT by an agent history h (which is always asu ffi cient statistic) and using an environment model ρ ( or | h ) to predict the next percept. The main subtlety with this ex-tension is that the history used to determine the conditionalprobabilities must be updated during the search to reflectthe extra information an agent will have at a hypotheticalfuture point in time.We will use Ψ to represent all the nodes in the search tree, Ψ ( h ) to represent the node corresponding to a particular his-tory h , ˆ V m ρ ( h ) to represent the sample-based estimate of theexpected future reward, and T ( h ) to denote the number oftimes a node Ψ ( h ) has been sampled. Nodes correspondingto histories that end or do not end with an action are calledchance and decision nodes respectively.Algorithm 1 describes the top-level algorithm, which theagent calls at the beginning of each cycle. It is initialisedwith the agent’s total experience h (up to time t ) and theplanning horizon m . It repeatedly invokes the S ample rou-tine until out of time. Importantly, ρ UCT is an anytime algorithm; an approximate best action, whose quality im-proves with time, is always available. This is retrieved byB est A ction , which computes a ∗ t = arg max a t ˆ V m ρ ( ax < t a t ). Algorithm 1 ρ UCT( h , m ) Require:
A history h Require:
A search horizon m ∈ N I nitialise ( Ψ ) repeat S ample ( Ψ , h , m ) until out of time return B est A ction ( Ψ , h )Algorithm 2 describes the recursive routine used to sam-ple a single future trajectory. It uses the S elect A ction rou-tine to choose moves at interior nodes, and invokes theR ollout routine at unexplored leaf nodes. The R ollout routine picks actions uniformly at random until the (remain-ing) horizon is reached, returning the accumulated reward.After a complete trajectory of length m is simulated, thevalue estimates are updated for each node traversed. Noticethat the recursive calls on Lines 6 and 11 append the mostrecent percept or action to the history argument.Algorithm 3 describes the UCB [Aue02] policy used toselect actions at decision nodes. The α and β constantsdenote the smallest and largest elements of R respectively.The parameter C varies the selectivity of the search; largervalues grow bushier trees. UCB automatically focuses at-tention on the best looking action in such a way that thesample estimate ˆ V ρ ( h ) converges to V ρ ( h ), whilst still ex-ploring alternate actions su ffi ciently often to guarantee thatthe best action will be found.The ramifications of the ρ UCT extension are particu-larly significant to Bayesian agents described in Section3. Proposition 1 allows ρ UCT to be instantiated with amixture environment model, which directly incorporates3 lgorithm 2 S ample ( Ψ , h , m ) Require:
A search tree Ψ Require:
A history h Require:
A remaining search horizon m ∈ N if m = then return else if Ψ ( h ) is a chance node then Generate ( o , r ) from ρ ( or | h ) Create node Ψ ( hor ) if T ( hor ) = reward ← r + S ample ( Ψ , hor , m − else if T ( h ) = then reward ← R ollout ( h , m ) else a ← S elect A ction ( Ψ , h , m ) reward ← S ample ( Ψ , ha , m ) end if ˆ V ( h ) ← T ( h ) + [ reward + T ( h ) ˆ V ( h )] T ( h ) ← T ( h ) + return rewardmodel uncertainty into the planning process. This gives(in principle, provided that the model class contains thetrue environment and ignoring issues of limited compu-tation) the well known Bayes-optimal solution to the ex-ploration / exploitation dilemma; namely, if a reduction inmodel uncertainty would lead to higher expected future re-ward, ρ UCT would recommend an information gatheringaction.
Algorithm 3 S elect A ction ( Ψ , h , m ) Require:
A search tree Ψ Require:
A history h Require:
A remaining search horizon m ∈ N Require:
An exploration / exploitation constant C > U = { a ∈ A : T ( ha ) = } if U , {} then Pick a ∈ U uniformly at random Create node Ψ ( ha ) return a else return arg max a ∈A (cid:26) m ( β − α ) ˆ V ( ha ) + C q log( T ( h )) T ( ha ) (cid:27) end if We now introduce a large mixture environment model foruse with ρ UCT. Context Tree Weighting (CTW) [WST95]is an e ffi cient and theoretically well-studied binary se-quence prediction algorithm that works well in practice. Itis an online Bayesian model averaging algorithm that com- putes, at each time point t , the probabilityPr( y t ) = X M Pr( M ) Pr( y t | M ) , (10)where y t is the binary sequence seen so far, M is a predic-tion su ffi x tree [RST96], Pr( M ) is the prior probability of M , and the summation is over all prediction su ffi x trees ofbounded depth D . A naive computation of (10) takes time O (2 D ); using CTW, this computation requires only O ( D )time. In this section, we outline how CTW can be extendedto compute probabilities of the formPr( x t | a t ) = X M Pr( M ) Pr( x t | M , a t ) , (11)where x t is a percept sequence, a t is an action sequence,and M is a prediction su ffi x tree as in (10). This extensionallows CTW to be used as a mixture environment model(Definition 4) in the ρ UCT algorithm, where we combine(11) and (2) to predict the next percept given a history.
Krichevsky-Trofimov Estimator.
We start with a briefreview of the KT estimator for Bernoulli distributions.Given a binary string y t with a zeroes and b ones, the KTestimate of the probability of the next symbol is given byPr kt ( Y t + = | y t ) : = b + / a + b + . (12)The KT estimator can be obtained via a Bayesian analysisby putting an uninformative (Je ff reys Beta(1 / / θ ) ∝ θ − / (1 − θ ) − / on the parameter θ ∈ [0 ,
1] ofthe Bernoulli distribution. The probability of a string y t is given byPr kt ( y t ) = Pr kt ( y | ǫ )Pr kt ( y | y ) · · · Pr kt ( y t | y < t ) = R θ b (1 − θ ) a Pr( θ ) d θ. Prediction Su ffi x Trees. We next describe prediction suf-fix trees. We consider a binary tree where all the left edgesare labelled 1 and all the right edges are labelled 0. Thedepth of a binary tree M is denoted by d ( M ). Each nodein M can be identified by a string in { , } ∗ as usual: ǫ rep-resents the root node of M ; and if n ∈ { , } ∗ is a nodein M , then n n n . The set of M ’s leaf nodes is de-noted by L ( M ) ⊂ { , } ∗ . Given a binary string y t where t ≥ d ( M ), we define M ( y t ) : = y t y t − . . . y t ′ , where t ′ ≤ t isthe (unique) positive integer such that y t y t − . . . y t ′ ∈ L ( M ). Definition 5
A prediction su ffi x tree (PST) is a pair ( M , Θ ) ,where M is a binary tree and associated with each l ∈ L ( M ) is a distribution over { , } parameterised by θ l ∈ Θ . We callM the model of the PST and Θ the parameter of the PST. M , Θ ) maps each binary string y t , t ≥ d ( M ), to θ M ( y t ) ; the intended meaning is that θ M ( y t ) is the probabilitythat the next bit following y t is 1. For example, the PST inFigure 1 maps the string 1110 to θ M (1110) = θ = .
3, whichmeans the next bit after 1110 is 1 with probability 0.3. θ = . (cid:13) (cid:127) (cid:127) (cid:127)(cid:127)(cid:127)(cid:127)(cid:127) (cid:31) (cid:31) ????? θ = . (cid:13) (cid:127) (cid:127) (cid:127)(cid:127)(cid:127)(cid:127)(cid:127) (cid:31) (cid:31) ????? θ = . ffi x tree Action-Conditional PST.
In the agent setting, we reducethe problem of predicting history sequences with generalnon-binary alphabets to that of predicting the bit represen-tations of those sequences. Further, we only ever conditionon actions; this is achieved by appending bit representationsof actions to the input sequence without updating the PSTparameters.Assume |X| = l X for some l X >
0. Denote by ~ x (cid:127) = x [1 , l X ] = x [1] x [2] . . . x [ l X ] the bit representation of x ∈ X .Denote by ~ x t (cid:127) = ~ x (cid:127)~ x (cid:127) . . . ~ x t (cid:127) the bit representationof a sequence x t . Action symbols are treated similarly.To do action-conditional sequence prediction using aPST with a given model M but unknown parameter, we startwith θ l : = Pr kt (1 | ǫ ) = / l ∈ L ( M ). We set asidean initial portion of the binary history sequence to initialisethe variable h and then repeat the following steps as long asneeded:1. set h : = h ~ a (cid:127) , where a is the current selected action;2. for i : = l X do(a) predict the next bit using the distribution θ M ( h ) ;(b) observe the next bit x [ i ], update θ M ( h ) using (12)according to the value of x [ i ], and then set h : = hx [ i ].Let M be the model of a prediction su ffi x tree, a t anaction sequence, x t a percept sequence, and h : = ~ ax t (cid:127) .For each node n in M , define h M , n by h M , n : = h i h i · · · h i k (13)where 1 ≤ i < i < · · · < i k ≤ t and, for each i , i ∈ { i , i , . . . i k } i ff h i is a percept bit and n is a prefix of M ( h i − ). We have the following expression for the proba-bility of x t given M and a t :Pr( x t | M , a t ) = t Y i = l X Y j = Pr( x i [ j ] | M , ~ ax < i a i (cid:127) x i [1 , j − = Y n ∈ L ( M ) Pr kt ( h M , n ) . (14) Context Tree Weighting.
The above deals with action-conditional prediction using a single PST. We now showhow we can e ffi ciently perform action-conditional predic-tion using a Bayesian mixture of PSTs. There are two maincomputational tricks: the use of a data structure to representall PSTs of a certain maximum depth and the use of proba-bilities of sequences in place of conditional probabilities. Definition 6
A context tree of depth D is a perfect binarytree of depth D such that attached to each node (both inter-nal and leaf) is a probability on { , } ∗ . The weighted probability P nw of each node n in the con-text tree T after seeing h : = ~ ax t (cid:127) is defined as follows: P nw : = Pr kt ( h T , n ) if n is a leaf node; Pr kt ( h T , n ) + P n w × P n w otherwise.The following is a straightforward extension of a resultdue to [WST95]. Lemma 1
Let T be the depth-D context tree after seeingh : = ~ ax t (cid:127) . For each node n in T at depth d, we haveP nw = X M ∈ C D − d − Γ D − d ( M ) Y l ∈ L ( M ) Pr kt ( h T , nl ) , (15) where C d is the set of all models of PSTs with depth ≤ d,and Γ d ( M ) is the code-length for M given by the number ofnodes in M minus the number of leaf nodes in M of depthd. A corollary of Lemma 1 is that at the root node ǫ of thecontext tree T after seeing h : = ~ ax t (cid:127) , we have P ǫ w ( x t | a t ) = X M ∈ C D − Γ D ( M ) Y l ∈ L ( M ) Pr kt ( h T , l ) (16) = X M ∈ C D − Γ D ( M ) Y l ∈ L ( M ) Pr kt ( h M , l ) (17) = X M ∈ C D − Γ D ( M ) Pr( x t | M , a t ) , (18)where the last step follows from (14). Notice that the prior2 − Γ D ( · ) penalises PSTs with large tree structures. The con-ditional probability of x t given ax < t a t can be obtained from(2). We can also e ffi ciently sample the individual bits of x t one by one. Computational Complexity.
The Action-ConditionalCTW algorithm grows the context tree dynamically. Us-ing a context tree with depth D , there are at most O ( tD log( |O||R| )) nodes in the context tree after t cycles.In practice, this is a lot less than 2 D , the number ofnodes in a fully grown context tree. The time complexityof Action-Conditional CTW is also impressive, requiring O ( D log( |O||R| )) time to process each new piece of agentexperience and O ( mD log( |O||R| )) to sample a single trajec-tory when combined with ρ UCT. Importantly, this is inde-pendent of t , which means that the computational overheaddoes not increase as the agent gathers more experience.5 Theoretical Results
Putting the ρ UCT and Action-Conditional CTW algorithmstogether yields our approximate AIXI agent. We now in-vestigate some of its properties.
Model Class Approximation.
By instantiating (5) withthe mixture environment model (18), one can show that theoptimal action for an agent at time t , having experienced ax < t , is given by arg max a t X x t · · · max a t + m X x t + m t + m X i = t r i X M ∈C D − Γ D ( ρ ) Pr( x t + m | M , a t + m ) . Compare this to the action chosen by the AIXI agentarg max a t X x t . . . max a t + m X x t + m t + m X i = t r i X ρ ∈M − K ( ρ ) ρ ( x t + m | a t + m ) , where class M consists of all computable environments ρ and K ( ρ ) denotes the Kolmogorov complexity of ρ . Bothuse a prior that favours simplicity. The main di ff erence isin the subexpression describing the mixture over the modelclass. AIXI uses a mixture over all enumerable chronolog-ical semimeasures, which is completely general but incom-putable. Our approximation uses a mixture of all predictionsu ffi x trees of a certain maximum depth, which is still arather general class, but one that is e ffi ciently computable. Consistency of ρ UCT. [KS06] shows that the UCT al-gorithm is consistent in finite horizon MDPs and derive fi-nite sample bounds on the estimation error due to sampling.By interpreting histories as Markov states, the general re-inforcement learning problem reduces to a finite horizonMDP and the results of [KS06] are now directly applica-ble. Restating the main consistency result in our notation,we have ∀ ǫ ∀ h lim T ( h ) →∞ Pr (cid:16) | V m ρ ( h ) − ˆ V m ρ ( h ) | ≤ ǫ (cid:17) = . (19)Furthermore, the probability that a suboptimal action (withrespect to V m ρ ( · )) is chosen by ρ UCT goes to zero in thelimit.
Convergence to True Environment.
The next result,adapted from [Hut05], shows that if there is a good model ofthe (unknown) environment in C D , then Action-ConditionalCTW will predict well. Theorem 1
Let µ be the true environment, and Υ ≡ P ǫ w the mixture environment model formed from (18). The µ -expected squared di ff erence of µ and Υ is bounded as fol- lows. For all n ∈ N , for all a n , n X k = X x < k µ ( x < k | a < k ) X x k (cid:18) µ ( x k | ax < k a k ) − Υ ( x k | ax < k a k ) (cid:19) ≤ min M ∈ C D (cid:26) Γ D ( M ) ln 2 + KL ( µ ( · | a n ) k Pr( · | M , a n )) (cid:27) , (20) where KL ( · k · ) is the KL divergence of two distributions. If the RHS of (20) is finite over all n , then the sum on theLHS can only be finite if Υ converges su ffi ciently fast to µ .If KL grows sublinear in n , then Υ still converges to µ (in aweaker Cesaro sense), which is for instance the case for all k -order Markov and all stationary processes µ . Overall Result.
Theorem 1 above in conjunction with[Hut05, Thm.5.36] imply V m Υ ( h ) converges to V m µ ( h ) as longas there exists a model in the model class that approximatesthe unknown environment µ well. This, and the consistency(19) of the ρ UCT algorithm, imply that ˆ V m Υ ( h ) converges to V m µ ( h ). More detail can be found in [VNHS09]. This section evaluates our approximate AIXI agent on a va-riety of test domains. The Cheese Maze, 4x4 Grid and Ex-tended Tiger domains are taken from the POMDP litera-ture. The TicTacToe domain comprises a repeated series ofgames against an opponent who moves randomly. The Bi-ased RockPaperScissor domain is described in [FMWR07],which involves the agent repeatedly playing RockPaper-Scissor against an exploitable opponent. Two more chal-lenging domains are included: Kuhn Poker [HSHB05],where the agent plays second against a Nash optimal playerand a partially observable version of Pacman described in[VNHS09]. With the exception of Pacman, each domainhas a known optimal solution. Although our domains aremodest, requiring the agent to learn the environment fromscratch significantly increases the di ffi culty of each of theseproblems. Domain |A| |O| A / O / R bits D m
Cheese Maze 4 16 2 / / / / × / / / / / / / / / / Table 1: Parameter ConfigurationTable 1 outlines the parameters used in each experiment.The sizes of the action and observation spaces are given,6 N o r m a li se d A ve r a g e R e w a r d p e r C yc l e Experience
OptimalCheese MazeTiger4x4 GridTicTacToeBiased RPSKuhn PokerPacman
Figure 2: Learning scalability resultsalong with the number of bits used to encode each space.The context depth parameter D specifies the maximal num-ber of recent bits used by the Action-Conditional CTW pre-diction scheme. The search horizon is given by the parame-ter m . Larger D and m increase the capabilities of our agent,at the expense of linearly increasing computation time; ourvalues represent an appropriate compromise between thesetwo competing dimensions for each problem domain.Figure 2 shows how the performance of the agent scaleswith experience, measured in terms of number of interac-tion cycles. Experience was gathered by a decaying ǫ -greedy policy, which chose randomly or used ρ UCT. Theresults are normalised with respect to the optimal averagereward per time step, except in Pacman, where we nor-malised to an estimate. Each data point was obtained bystarting the agent with an amount of experience given by the x -axis and running it greedily for 2000 cycles. The amountof search used for each problem domain, measured by thenumber of ρ UCT simulations per cycle, is given in Table2. (The average search time per cycle is also given.) Theagent converges to optimality on all the test domains withknown optimal values, and exhibits good scaling propertieson our challenging Pacman variant. Visual inspection ofPacman shows that the agent, whilst not playing perfectly,has already learnt a number of important concepts.Table 2 summarises the resources required for approxi-mately optimal performance on our test domains. Timingstatistics were collected on an Intel dual 2.53Ghz Xeon.Domains that included a planning component such as Tigerrequired more search. Convergence was somewhat slowerin TicTacToe; the main di ffi culty for the agent was learn-ing not to lose the game immediately by playing an illegalmove. Most impressive was that the agent learnt to play anapproximate best response strategy for Kuhn Poker, withoutknowing the rules of the game or the opponent’s strategy. http: // / watch?v = RhQTWidQQ8U
Domain Experience Simulations Search TimeCheese Maze 5 ×
500 0.9sTiger 5 × × . × × × × Table 2: Resources required for optimal performance
The BLHT algorithm [SH99] is closely related to our work.It uses symbol level PSTs for learning and an (unspecified)dynamic programming based algorithm for control. BLHTuses the most probable model for prediction, whereas weuse a mixture model, which admits a much stronger con-vergence result. A further distinction is our usage of anOckham prior instead of a uniform prior over PST models.The Active-LZ [FMWR07] algorithm combines aLempel-Ziv based prediction scheme with dynamic pro-gramming for control to produce an agent that is provablyasymptotically optimal if the environment is n -Markov. Weimplemented the Active-LZ test domain, Biased RPS, andcompared against their published results. Our agent wasable to achieve optimal levels of performance within 10 cycles; in contrast, Active-LZ was still suboptimal after 10 cycles.U-Tree [McC96] is an online agent algorithm that at-tempts to discover a compact state representation from araw stream of experience. Each state is represented as theleaf of a su ffi x tree that maps history sequences to states. Asmore experience is gathered, the state representation is re-fined according to a heuristic built around the Kolmogorov-Smirnov test. This heuristic tries to limit the growth of thesu ffi x tree to places that would allow for better predictionof future reward. Value Iteration is used at each time stepto update the value function for the learned state represen-tation, which is then used by the agent for action selection.It is instructive to compare and contrast our AIXI approx-imation with the Active-LZ and U-Tree algorithms. Thesmall state space induced by U-Tree has the benefit of lim-iting the number of parameters that need to be estimatedfrom data. This has the potential to dramatically speed upthe model-learning process. In contrast, both Active-LZand our approach require a number of parameters propor-tional to the number of distinct contexts. This is one ofthe reasons why Active-LZ exhibits slow convergence inpractice. This problem is much less pronounced in our ap-proach for two reasons. First, the Ockham prior in CTWensures that future predictions are dominated by PST struc-tures that have seen enough data to be trustworthy. Sec-ondly, value function estimation is decoupled from the pro-cess of context estimation. Thus it is reasonable to ex-pect ρ UCT to make good local decisions provided Action-Conditional CTW can predict well. The downside however7s that our approach requires search for action selection.Although ρ UCT is an anytime algorithm, in practice morecomputation is required per cycle compared to approacheslike Active-LZ and U-Tree that act greedily with respect toan estimated global value function.The U-Tree algorithm is well motivated, but unlikeActive-LZ and our approach, it lacks theoretical perfor-mance guarantees. It is possible for U-Tree to prema-turely converge to a locally optimal state representationfrom which the heuristic splitting criterion can never re-cover. Furthermore, the splitting heuristic contains a num-ber of configuration options that can dramatically influenceits performance [McC96]. This parameter sensitivity some-what limits the algorithm’s applicability to the general rein-forcement learning problem.Our work is also related to Bayesian ReinforcementLearning. In model-based Bayesian RL [PV08, Str00], adistribution over (PO)MDP parameters is maintained. Incontrast, we maintain an exact Bayesian mixture of PSTs.The ρ UCT algorithm shares similarities with BayesianSparse Sampling [WLBS05]; the key di ff erences are es-timating the leaf node values with a rollout function andguiding the search with the UCB policy.A more comprehensive discussion of related work can befound in [VNHS09]. The main limitation of our current AIXI approximation isthe restricted model class. Our agent will perform poorly ifthe underlying environment cannot be predicted well by aPST of bounded depth. Prohibitive amounts of experiencewill be required if a large PST model is needed for accurateprediction. For example, it would be unrealistic to think thatour current AIXI approximation could cope with real-worldimage or audio data.The identification of e ffi cient and general model classesthat better approximate the AIXI ideal is an important areafor future work. Some preliminary ideas are explored in[VNHS09].
10 Conclusion
We have introduced the first computationally tractable ap-proximation to the AIXI agent and shown that it providesa promising approach to the general reinforcement learn-ing problem. Investigating multi-alphabet CTW for pre-diction, parallelisation of ρ UCT, further expansion of themodel class (ideally, beyond variable-order Markov mod-els) or more sophisticated rollout policies for ρ UCT are ex-citing areas for future investigation.
11 Acknowledgements
This work received support from the Australian ResearchCouncil under grant DP0988049. NICTA is funded by theAustralian Government as represented by the Department ofBroadband, Communications and the Digital Economy andthe Australian Research Council through the ICT Centre ofExcellence program.
References [Aue02] Peter Auer. Using confidence bounds forexploitation-exploration trade-o ff s. JMLR , 3:397–422,2002.[FMWR07] V. Farias, C. Moallemi, T. Weissman, andB. Van Roy. Universal Reinforcement Learning.
CoRR ,abs / ff ective short-term oppo-nent exploitation in simplified poker. In AAAI’05 , pages783–788, 2005.[Hut05] Marcus Hutter.
Universal Artificial Intelligence:Sequential Decisions Based on Algorithmic Probability .Springer, 2005.[KS06] Levente Kocsis and Csaba Szepesv´ari. Banditbased Monte-Carlo planning. In
ECML , pages 282–293,2006.[McC96] Andrew Kachites McCallum.
ReinforcementLearning with Selective Perception and Hidden State .PhD thesis, University of Rochester, 1996.[PV08] Pascal Poupart and Nikos Vlassis. Model-basedBayesian Reinforcement Learning in Partially Observ-able Domains. In
ISAIM , 2008.[RST96] D. Ron, Y. Singer, and N. Tishby. The powerof amnesia: Learning probabilistic automata with vari-able memory length.
Machine Learning , 25(2):117–150,1996.[SH99] Nobuo Suematsu and Akira Hayashi. A reinforce-ment learning algorithm in partially observable environ-ments using short-term memory. In
NIPS , pages 1059–1065, 1999.[Str00] M. Strens. A Bayesian framework for reinforce-ment learning. In
ICML , pages 943–950, 2000.[VNHS09] Joel Veness, Kee Siong Ng, Marcus Hutter, andDavid Silver. A Monte Carlo AIXI Approximation.
CoRR , abs / ICML , pages 956–963, 2005.[WST95] Frans M.J. Willems, Yuri M. Shtarkov, andTjalling J. Tjalkens. The Context Tree WeightingMethod: Basic Properties.