[PDF] Consistency of Feature Markov Processes

Abstract

We are studying long term sequence prediction (forecasting). We approach this by investigating criteria for choosing a compact useful state representation. The state is supposed to summarize useful information from the history. We want a method that is asymptotically consistent in the sense it will provably eventually only choose between alternatives that satisfy an optimality property related to the used criterion. We extend our work to the case where there is side information that one can take advantage of and, furthermore, we briefly discuss the active setting where an agent takes actions to achieve desirable outcomes.

Full PDF

aa r X i v : . [ c s . L G ] J u l Consistency of Feature Markov Processes

Marcus Hutter and

Peter Sunehag

RSISE @ ANU and SML @ NICTACanberra, ACT, 0200, Australia { Peter.Sunehag,Marcus.Hutter } @anu.edu.au

12 July 2010

Abstract

We are studying long term sequence prediction (forecasting). We approachthis by investigating criteria for choosing a compact useful state representa-tion. The state is supposed to summarize useful information from the history.We want a method that is asymptotically consistent in the sense it will prov-ably eventually only choose between alternatives that satisfy an optimalityproperty related to the used criterion. We extend our work to the case wherethere is side information that one can take advantage of and, furthermore,we brieﬂy discuss the active setting where an agent takes actions to achievedesirable outcomes.

Contents

Markov Process (MP); Hidden Markov Model (HMM); Finite State Machine(FSM); Probabilistic Deterministic Finite State Automata (PDFA); PenalizedMaximum Likelihood (PML); ergodicity; asymptotic consistency; suﬃx trees;model selection; learning; reduction; side information; reinforcement learning. Introduction

When studying long term sequence prediction one is interested in answering ques-tions like: What will the next k observations be? How often will a certain eventor a sequence of events occur? What is the average rate of a variable like cost orincome? This can be interesting for forecasting time series and for choosing policieswith desirable outcomes.Hidden Markov Models [CMR05, EM02] are often used for long term forecastingand sequence prediction. In this article we will restrict our study to models basedon states that result from a deterministic function of the history, in other words,states that summarize useful information that has been observed so far. We willconsider ﬁnite state space maps with the property that given the current state andthe next observation we can determine the next state. These maps are sometimescalled Probabilistic-Deterministic Finite Automata (PDFA) [VTdlH + k symbols or to have minimalentropy rate for an inﬁnite horizon.After the preliminary Section 2 we begin our theory development in Section3. In our problem setting we have a ﬁnite set Y , a sequence y n of elements from Y , and we are interested in predicting the future of the sequence y n . To do this,being inspired by [Hut09] where general criteria for choosing a feature map forreinforcement learning were discussed, we ﬁrst want to learn a feature map Φ( y n ) = s n where y t := y , ...., y t .We would like the map to have the following properties:1. The distribution for the sequence s n induced by the distribution for the se-quence y n should be that of a Markov chain or should be a distribution whichis indistinguishable from a Markov chain for the purpose of predicting thesequence y n .2. We want as few states as possible so that we can learn a model from a modestamount of data.3. We want the model of the sequence y n that arises as a function of the Markovchain s n to be as good as possible. Ideally it should be the true distribution.Our approach consists of deﬁning criteria that can be applied to any class of Φ,but later we restrict our study to a class of maps that are deﬁned by ﬁnite-statemachines. These maps are deﬁned by introducing a deterministic function ψ suchthat s n = ψ ( s n − , y n ). If we have chosen such a map ψ and a ﬁrst state s then the2equence y n determines a unique sequence s n and therefore we have also deﬁned amap Φ( y n ) = s n .In Section 2 we provide some preliminaries on random sequences and HiddenMarkov Models. We introduce a class of ergodic sequences which is the class of se-quences that we work with in this article. They are sequences with the property thatan individual sequence determines a distribution over inﬁnite sequences. We presentour consistency theory by ﬁrst presenting very generic results in the beginning ofSection 3 and then we show how various classes of maps and models ﬁt into this.This has the consequence that we ﬁrst have results where we guarantee optimalitygiven that the individual sequence that we work with has certain properties (andthese results, therefore, have no “almost sure” in the statement since the setting isnot probabilistic) while in the latter part we show that if we sample the sequence incertain ways we will almost surely get a sequence with these properties. In particu-lar in Section 4 we will take a closer look at suﬃx tree sources and maps based onﬁnite state machines related to probabilistic deterministic ﬁnite automata. Section5 summarizes the ﬁndings in a main theorem that says under some assumptions (aclass of maps based on ﬁnite state machines of bounded memory and ergodicity) wewill recover the true model (or the closest we can get to the true model). Section 6contains a discussion of sequence prediction with side information, Section 7 brieﬂydiscusses the active case where an agent acts in an environment and earns rewards,and ﬁnally Section 8 contains our conclusions. In this section we will review some notions and results that the rest of the article willrely upon. We start with random sequences and then follows a section on HiddenMarkov Models (HMM).

Random Sequences.

Consider the set of all inﬁnite sequences y t , t = 1 , , ... ofelements from a ﬁnite alphabet Y . We equip the set with the σ -algebra that isgenerated by the cylinder sets Γ y n = { x ∞ | x t = y t , t = 1 , ..., n } . A measure withrespect to this space is determined by its values on the cylinder sets. Not every setof values is valid. We need to assume that the measure of Γ y t is the sum of themeasures of the sets Γ y t ˜ y for all possible ˜ y ∈ Y . If we want it to be a probabilitymeasure we furthermore need to assume that the measure of the whole space Y ∞ (which is the cylinder set Γ ǫ of the empty string ǫ ) equals to one. The concept thatis introduced in the following two deﬁnitions is of central importance to this article.In particular ergodic sequences is the class of sequences that we intend to model.They are sequences that can be used to deﬁne a distribution over inﬁnite sequencesthat we will be interested in learning. Deﬁnition 1 (Distribution deﬁned from one sequence)

A sequence y ∞ de-ﬁnes a probability distribution on inﬁnite sequences if the (relative) frequency of every nite substring of y ∞ converges asymptotically. The probabilities of the cylindersets are deﬁned to equal those limits: Γ z m := lim n →∞ { t ≤ n : y t +1: t + m = z m } /n Deﬁnition 2 (ergodic sequence)

We say that a sequence is ergodic if the fre-quencies of every ﬁnite substring are converging asymptotically.

As probabilistic models for random sequences we will in this article focus onHidden Markov Models (HMMs) [BP66, Pet69]. More recent surveys on HiddenMarkov Models are [EM02, CMR05].

Hidden Markov Models.

Here we deﬁne distributions over sequences of elementsfrom a ﬁnite set Y of size Y based on an unobserved Markov chain of elements froma ﬁnite state set S of size S . Deﬁnition 3 (Hidden Markov Model, HMM)

Assume that we have a Markovchain with an S × S transition matrix T = ( T s,s ′ ) and that we also have an S × Y emission matrix E = ( E s,y ) where E s,y is the probability that state s will generateoutcome y ∈ Y . If we introduce a starting probability vector we have deﬁned aprobability distribution over sequences of elements from Y . This is called a HiddenMarkov Model (HMM). Sequence Prediction.

One use of Hidden Markov Models (and functions ofMarkov chains) is sequence prediction. Given a history y , ..., y n we want to pre-dict the future y n +1 , ... . In some situations we know what state we are in at time n and that state then summarizes the entire history without losing any usefulinformation since the future is conditionally independent of the past, given thecurrent state. If we are doing one step prediction we are interested in knowing P r ( y n +1 | s n ) . We can also consider a zero step lookahead (called ﬁltering)

P r ( y n | s n )or an m step P r ( y n +1 , ..., y n + m | s n ) . The m step could also be just P r ( y n + m | s n ) . Ina sense we can consider an inﬁnite lookahead ability evaluated by the entropy rate − lim m →∞ m log P r ( y n +1 , ..., y n + m | s n ) . If the Markov chain is ergodic this limit doesnot depend on the state s n . Limit Theorems.

The following theory that is the foundation for studying consis-tency of HMMs was developed in [BP66] and [Pet69]. See [CMR05] chapter 12 forthe modern state of the art.

Deﬁnition 4 (ergodic Markov chain)

A Markov chain (and the stochastic ma-trix that contains its transition probabilities) is called ergodic if it is possible to movefrom state s to state s ′ in a ﬁnite number of steps for all s and s ′ . The following theorem [CMR05] introduces the generalized cross-entropy H andshows that it is well deﬁned and that it can be estimated for ergodic HMMs. Itcan be interpreted as the (idealized) expected number of bits needed for coding asymbol generated by a distribution deﬁned by θ but using the distribution deﬁnedby θ . 4 heorem 5 (ergodic HMMs) If θ and θ are HMM parameters where the tran-sition matrix for θ is an ergodic stochastic matrix, then there exists a ﬁnite number H ( θ , θ ) (which can also be deﬁned as lim n →∞ H n,s ( θ , θ ) for any initial state s where H n,s ( θ , θ ) := n IE θ log P r ( y , ..., y n | s = s, θ )) such that P θ a.s. − lim n →∞ n log P r ( y , ..., y n | θ ) = H ( θ , θ ) and the convergence is uniform in the parameter space. Deﬁnition 6 (Equivalent HMMs)

For an HMM θ , let M [ θ ] be the set of all θ such that the HMM with parameters θ deﬁne the same distribution over outcomesas the HMM with parameters θ . Theorem 7 (Minimal cross-entropy for the truth and only the truth) H ( θ , θ ) ≥ H ( θ , θ ) with equality if and only if θ ∈ M [ θ ] . Given a sequence of elements y n from a ﬁnite alphabet we want to deﬁne a mapΦ : Y ∗ → S , which maps histories (ﬁnite strings) of elements to states Φ( y n ) = s n .The reasons for this include, as was explained in the introduction, in particular theability to learn a model eﬃciently. Suppose that every Φ under consideration is suchthat the size of its state space S is a ﬁnite number that depends on Φ.We are also interested in the case when we have side information x n ∈ X andwe deﬁne a map Φ : ( X × Y ) ∗ → S . In this more general case the models that weconsider for the sequence y will have hidden states while in the case without sideinformation the state (given the y sequence) is not hidden. We have two reasons forexpressing everything in an HMM framework. We can model long-range dependencein the y n sequence through having states and we include the more general case wherethere is side information. Deﬁnition 8 (Feature sequence/process)

A map Φ from ﬁnite strings of ele-ments from Y (or X × Y ) to elements in a ﬁnite set S and a sequence y n inducesa state sequence s n . Deﬁne an HMM through maximum likelihood estimation: Thesequence s t = Φ( y t ) gives transition matrix T ( n ) = ( T s,s ′ ) of probabilities T s,s ′ ( n ) := { t ≤ n | s t = s, s t +1 = s ′ } { t ≤ n | s t = s } and emission matrix E ( n ) of probabilities E s,y ( n ) := { t ≤ n | s t = s, y t = y } { t ≤ n | s t = s } . Denote those HMMs by ˆ θ n := ( T ( n ) , E ( n )) . We will refer to the sequence ˆ θ n as theparameters corresponding to Φ or generated by Φ .

5e will ﬁrst state results based on some generic properties that we have deﬁnedwith just the goal of making the proofs work. Then we will show that some moreeasily understandable cases will satisfy these properties. We structure it this waynot only for generality but also to make the proof techniques clearer.

Ergodic Sequences.

We begin by deﬁning the fundamental ergodicity propertiesthat we will rely upon. We provide asymptotic results for individual sequences thatsatisfy these properties. In the next two subsections we identify situations where wewill almost surely get such a sequence which satisﬁes these ergodicity properties.

Deﬁnition 9 (ergodic w.r.t. Φ ) As stated in Deﬁnition 2, we say that a sequence y t is ergodic if all substring frequencies converge as n → ∞ . Furthermore we saythat1. the sequence y t is ergodic with respect to a map Φ( y t ) = s t if all state transitionfrequencies T s,s ′ ( n ) and emission frequencies E s,y ( n ) converge as n → ∞ .2. the sequence y t is ergodic with respect to a class of maps if it is ergodic withrespect to every map in the class. Deﬁnition 10 (HMM-ergodic)

We say that a sequence y t is HMM-ergodic for aset of HMMs Θ if there is an HMM with parameters θ such that − n log P r ( y , ..., y n | θ ) → H ( θ , θ ) uniformly on compact subsets of Θ . Deﬁnition 11 (Log-likelihood) L n (Φ) = − log P r ( y , ..., y n | ˆ θ n )We will prove our consistency results by ﬁrst proving consistency using MaximumLikelihood (ML) for a ﬁnite class of maps and then we prove that we can add asublinearly growing model complexity penalty and still have consistency. Proposition 12 (HMM consistency of ML for ﬁnite class)

Suppose that y t is HMM-ergodic for the parameter set Θ with optimal parameters (in the sense ofDeﬁnition 10) θ , y t is ergodic for the ﬁnite class of maps { Φ i } Ki =1 and suppose that θ i ∈ Θ are the limiting parameters generated by Φ i . Then it follows that there is N < ∞ such that for all n ≥ N the map Φ i selected by minimizing L n generatesparameters ˆ θ ni whose limit is in arg min θ i H ( θ , θ i ) . Proof.

It follows from Deﬁnition 10 and continuity (in θ ) of the log-likelihood thatlim n →∞ n L n (Φ i ) = H ( θ , θ i )since the convergence in Deﬁnition 10 is uniform. Note that the parameters thatdeﬁne the log-likelihood L n (Φ i ) can be diﬀerent for every n so the uniformity of the6onvergence is needed to draw the conclusion above. By Deﬁnition 9 we know thatif ˆ θ in are the parameters generated by Φ i at time n , then lim n →∞ ˆ θ in = θ i exists forall i . It follows that if θ i / ∈ arg min θ j H ( θ , θ j ) then there must be an N < ∞ suchthat Φ i is not selected at times n ≥ N . Since there are only ﬁnitely many maps inthe class there will be a ﬁnite such N that works for all relevant i . Deﬁnition 13 (HMM Cost function)

If the HMM with parameters ˆ θ n that hasbeen estimated from Φ at time n has S states, then let Cost n (Φ) = − log P r ( y , ..., y n | ˆ θ n ) + pen ( n, S ) where pen ( n, S ) is a positive function that is increasing in both n and S and is suchthat pen ( n, S ) /n → for n → ∞ for all S . We call the negative log-probability term the data coding cost and the other termis the model complexity penalty . They are both motivated by coding (coding the dataand the model). For instance in MDL/MML/BIC, pen ( n, S ) = d log n + O (1), where d is the dimensionality of the model θ . Proposition 14

Suppose that Φ has optimal limiting parameters θ with as fewstates as possible. In other words if an HMM has fewer states than the HMM deﬁnedby θ , then it has a strictly larger entropy rate. We use a (ﬁnite, countable, oruncountable) class of maps that includes only Φ and maps that have strictly fewerstates. We assume that all the maps generate converging parameters. Then there isan N such that the function Cost is minimized by Φ at all times n ≥ N . Proof.

Suppose that θ has S states. We will use a bound for how close one can getto the true HMM using fewer states. We would like to have a constant ε > H ( θ , θ ) > H ( θ , θ )+ ε for all θ with fewer then S states. The existence of suchan ε follows from continuity of H (which is actually also diﬀerentiable [BP66]), thefact that the HMMs with fewer than S states can be compactly (in the parameterspace) embedded into the space of HMMs with exactly S states, and that thisembedded subspace has a strictly positive minimum Euclidean distance from θ inthis parameter space.The existence of ε > D > S states have, for large n , at least Dn worselog probabilities than the distribution θ . Therefore the penalty term (for which pen ( n, S ) /n →

0) will not be able to indeﬁnitely compensate for the inferior mod-eling.

Theorem 15 (HMM consistency of Cost for ﬁnite class)

Proposition 12 isalso true for

Cost . roof. H ( θ , θ k ) < H ( θ , θ j ) implies that there is a constant C > n , L n (Φ j ) − L n (Φ k ) ≥ Cn . Since pen ( n, S ) /n → n → ∞ we know thatany diﬀerence in model penalty will be overtaken by the linearly growing diﬀerencein data code length. Maps that induce HMMs.

In this section we will assume that we use a class ofmaps whose states we know form a Markov chain.

Deﬁnition 16 (Feature Markov Process, ΦMP)

Suppose that

P r ( y n | Φ ( y ) , ..., Φ ( y n )) = P r ( y n | Φ ( y n )) and that the state sequence is Markov, i.e. P r (Φ ( y n ) | Φ ( y ) , ..., Φ ( y n − )) = P r (Φ ( y n ) | Φ ( y n − )) . Then we say that Φ induces an HMM. We call HMMs induced by Φ , FeatureMarkov Process ( Φ MP). If the HMM that is deﬁned this way by Φ is the truedistribution for the sequence y , y , ... , then we say that “ Φ is correct”.We will only discuss the situation when the true HMM is ergodic so we will onlysay that there is a correct Φ in those situations, hence the statement Φ is correctwill contain the assumption that the truth is ergodic. Example 17

The map Φ which sends everything to the same state always inducesan HMM but, unless the sequence y , y , ... is i.i.d, it is not correct. ♦ Proposition 18 (Convergence of estimated distributions) If Φ is correctthen P ˆ θ n → P θ for n → ∞ (as distributions on ﬁnite strings of a (any) ﬁxedlength), where P θ is the true HMM distribution for the outcomes, P θ is the HMMdistribution deﬁned by θ and ˆ θ n are the parameters generated by Φ . Proof.

We are estimating the parameters ˆ θ n through maximum likelihood for thegenerated sequence of states. Consistency of maximum likelihood estimation forMarkov chains implies that ˆ θ n → θ . This implies the proposition due to continuitywith respect to the parameters of the likelihood (for any ﬁnite sequence length). Proposition 19 (Inducing HMM implies drawing ergodic sequences)

Ifwe have a set of maps that induce HMMs and the sequence y t is drawn from one ofthe induced ergodic HMMs, then almost surely1. y t is HMM-ergodic2. we will draw an ergodic sequence y t with respect to the considered class of maps. Proof.

1. is a consequence of Theorem 5.2. This follows from consistency of maximum likelihood for Markov chains (gener-alized law of large numbers) since the claim is that state transition frequencies andemission frequencies converge. 8

Maps based on Finite State Machines (FSMs)

We will in this section consider maps of a special form that are related to PDFAs.We will assume that Φ is such that there is a ψ such thatΦ( y n ) = ψ (Φ( y n − ) , y n ) . In other words, the current state is derived deterministically from the previousstate and the current perception. Given an initial state the state sequence is thendeterministically determined by the perceptions and therefore the combination of ψ with an initial state deﬁnes a map Φ from histories to states. This class of maps Φcan also deﬁne a class of probabilistic models of the sequence y n by assuming that y n only depends on s n − = Φ( y n − ). This leads to the formula P r ( s ′ | s ) = X y : ψ ( s,y )= s ′ P r ( y | s )and as a result we have deﬁned an HMM for the sequence y n . Deﬁnition 20 (Sampling from FSM)

If we follow the procedure above we saythat we have sampled the sequence y t from the FSM. If the Markov chain of statesis ergodic we say that we have sampled y t ergodically from the FSM. Suﬃx Trees.

We consider a class of maps based on FSMs that can be expressedusing Suﬃx Trees [Ris86] with the same states (suﬃxes) as the FSM. The resultingmodels are sometimes called FSMX sources. A suﬃx tree is deﬁned by a suﬃx setwhich is a set of ﬁnite strings. The set must have the property that none of thestrings is an ending substring (a suﬃx) of another string in the set and such that anysuﬃciently long string ends with a substring in the suﬃx set. Given any suﬃcientlylong string we then know that it ends with exactly one of the suﬃxes from the suﬃxset. If the suﬃx set furthermore has the property that given the previous suﬃxand the new symbol there is exactly one element (state) from the suﬃx set thatcan (and is) the end of the new longer string, then it is an FSMX source. Anotherterminology says that the suﬃx set is FSM closed. The property implies (directlyby deﬁnition) that there is a map ψ such that ψ ( s t − , y t ) = s t .The following proposition shows a very nice connection between ergodic se-quences and FSMX sources which will be generalized in Proposition 25 to moregeneral sources based on bounded-memory FSMs. Proposition 21 (ergodicity of suﬃx trees)

If we have a set of maps based onFSMs that can be expressed by suﬃx trees, and the sequence y t is sampled ergodically(Deﬁnition 20) using one of the maps, then almost surely we get a sequence y t thatis ergodic with respect to the considered class of maps and y t is HMM-ergodic. emma 22 If the sequence y t is ergodic, then the state transition frequencies andemission (of y ) frequencies for a FSM closed suﬃx tree are converging. Proof.

Let the map Φ be deﬁned by the suﬃx set in question. Suppose that s ′ is asuﬃx that can follow directly after s . This means that there is a symbol y such thatif you concatenate it to the end of the string s , then this new string ˜ s ends with thestring s ′ . This means that whenever a string of symbols y n ends with ˜ s , then thesequence of states generated by applying the map Φ to the sequence y n will endwith s n − = s and s n = s ′ . It is also true that whenever the state sequence endswith ss ′ then y n ends with ˜ s . Therefore, the counts (of ss ′ in the state sequenceand ˜ s in the y sequence) up until any ﬁnite time point are also equal. We will inthis proof say that ˜ s is the string that corresponds to ss ′ .Given any ordered pair of states ( s, s ′ ) where s ′ can follow s , let c s,s ′ ( n ) be thenumber of times ss ′ occurs in the state sequence up to time n and let d s,s ′ ( n ) be thenumber of times the string ˜ s that corresponds to ss ′ has occurred. We know that c s,s ′ ( n ) = d s,s ′ ( n ) for any such pair ss ′ and any n . If s ′ cannot follow s we let both c s,s ′ = 0 and d s,s ′ = 0. The state transition frequency for the transition from s to s ′ up until time n is c s,s ′ ( n ) P s ′ c s,s ′ ( n ) = d s,s ′ ( n ) P s ′ d s,s ′ ( n ) = d s,s ′ ( n ) d s ( n ) = d s,s ′ ( n ) n nd s ( n )where d s ( n ) is the number of times that the string that deﬁnes s has occurred upuntil time n in the y sequence. The right hand side converges to the frequency of thestring ˜ s divided by the frequency of the string that deﬁnes s . Thus we have provedthat state transition frequencies converge. Emissions work the same way. Lemma 23

If we sample y t ergodically from a suﬃx tree FSM, then the frequencyfor each ﬁnite substring will converge almost surely. In other words the sequence y t is almost surely ergodic. Proof.

If the suﬃx tree deﬁnes an FSM as we have deﬁned it above, the states of thesuﬃx tree will form an ergodic Markov chain. An ergodic Markov chain is stationary.For any state and ﬁnite string of perceptions there is a certain ﬁxed probability ofdrawing the string in question. The frequency of the string str is P s P r ( s ) P r ( str | s )where P r ( s ) is the stationary probability of seeing s and P r ( str | s ) is the probabilityof directly seeing exactly str conditioned on being in state s . It follows from thelaw of large numbers that the frequency of any ﬁnite string str converges.Another way of understanding this result is that it is implied by the convergenceof the frequency of any ﬁnite string of states in the state sequence. Proof. of Proposition 21.

Lemma 22 and Lemma 23 together imply the propo-sition since they say that if we sample from a suﬃx tree then we almost surely get10onverging frequencies for all ﬁnite substrings and this implies converging transitionfrequencies for the states from any suﬃx tree.

Bounded-Memory FSMs.

We here notice that the reasons that the suﬃx treetheory above worked actually relate to a larger class, namely a class of FSMs wherethe internal state is determined by at most a ﬁnite number of previous time stepsin the history.

Deﬁnition 24 (bounded memory FSM)

Suppose that there is a constant κ suchthat if we know the last κ + 1 perceptions y t − κ , ..., y t then the present state s t isuniquely determined. Then we say that the FSM has memory of at most length κ (not counting the current) and that it has bounded memory. Proposition 25 (ergodicity of FSMs)

1. Consider a sequence y t whose ﬁnitesubstring frequencies converge (i.e. the sequence is ergodic) and an FSM of boundedmemory, then the sequence is ergodic with respect to the map deﬁned by the FSM.2. If we sample a sequence y t ergodically from an FSM with bounded memory thenalmost surely y t is HMM-ergodic and its ﬁnite substring frequencies converge. Proof.

The proof works the same way as for suﬃx tree FSMs. If an FSM has ﬁnitememory of length κ then there is a suﬃx tree of that depth with every suﬃx of fulllength and every state of the FSM is a subset of the states of that suﬃx tree. TheFSM is a partition of the suﬃx set into disjoint subsets. Every state transition forthe FSM is exactly one of a set of state transitions for the suﬃx tree states and thefrequency of every ordered pair of suﬃx tree states converge almost surely as before.Therefore, the state transition frequencies for the FSM will almost surely converge.A distribution that is deﬁned using an FSM of bounded memory can also bedeﬁned using a suﬃx tree, so 2. reduces to this case In this section we summarize our results in a main theorem. It follows directlyfrom a combination of results in previous sections. They are stated with respectto our main class of maps, namely the class that is deﬁned by bounded-memoryFSMs. The generating models that we consider are models that are deﬁned from amap in this class in such a way that the states form an ergodic Markov chain. Werefer to this as sampling ergodically from the FSM. Our conclusion is that we willunder these circumstances eventually only choose between maps which generate thebest possible HMM parameters that can be achieved for the purpose of long-termsequence prediction. The model penalty term will inﬂuence the choice between theseoptions towards simpler models.The following theorem guarantees that we will almost surely asymptotically ﬁnda correct HMM for the sequence of interest under the assumption that it is possible.11 heorem 26

If we consider a ﬁnite class of maps Φ i , i = 0 , , ..., k based on ﬁnitestate machines of bounded memory and if we sample ergodically from a ﬁnite statemachine of bounded memory, then there almost surely exist limiting parameters θ i for all i and there is N < ∞ such that for all n ≥ N the map Φ i selected at time n ≥ N by minimizing Cost , generates parameters whose limit is θ which is assumedto be the optimal HMM parameters. Proof.

We are going to make use of Proposition 25 together with Theorem 15.Proposition 25 shows that our assumptions imply the assumptions of Theorem 15which provides our conclusion.

Extension to countable classes.

To extend our results from ﬁnite to countableclasses of maps we need the model complexity penalty to be suﬃciently rapidlygrowing in n and m . This is also necessary if we want to be sure that we eventuallyﬁnd a minimal representation of the optimal model that can be achieved by the classof maps. Proposition 27 (Consistency for countable class)

Suppose that we have acountable class of maps Φ i , i = 0 , , ... and1. Suppose that our class is such that for every ﬁnite k , there are at most ﬁnitelymany maps with at most k states.2. Suppose that θ is an optimal HMM for the sequence y t , that it has m states and that θ is the limit of the parameters generated by Φ . Further-more, suppose that there is ﬁnite N such that whenever n > N , ˜ m > m and ˜ θ is any HMM with ˜ m states we have pen ( n, m ) − log P ˆ θ n ( y , ..., y n )

The idea of the proof is to reduce the countable case to the ﬁnite case thatwe have already proven by using that when n > N we will never pick a Φ withmore than m states and then use the ﬁrst property to say that the remaining class ifﬁnite. This reduction also shows that we will eventually not pick a map with morestates than m .The ﬁrst property in the proposition above holds for the class of suﬃx treesand for the class based on FSMs with bounded memory. The second property,but with the HMM maximum likelihood parameters θ ( n ) with m states (whilewe have ML for a sequence of states and observations) will almost surely holdif the penalty is such that we have strong consistency for the HMM criteria θ ∗ = arg max log P θ ( y , ..., y n ) − pen ( n, m ). This is studied in many articles, e.g.[GB03] where strong consistency is proven for a penalty of the form β ( m ) log n β is a cubic polynomial. Note that in the case without side information (ifour map has the properties that Φ ( y n ) determine y n and that Φ( y n − ) and y n determine Φ( y n )) the emissions are deterministic and the state sequence generatedby any map is determined by the y sequence. This puts us in a simpler situationakin to the Markov order estimation problem [FLN96, CS00] where it is studiedwhich penalties (e.g. BIC) will give us property 2. above. Conjecture 28

We almost surely have Property 2. from Proposition 27 for theBIC penalty studied in [CS00].

In this section we will broaden our problem to the setting where we have side in-formation available to help in our prediction task. In our problem setting we havetwo ﬁnite sets X and Y , a sequence p n = ( x n , y n ) of elements from X × Y , and weare interested in predicting the future of the sequence y n . To do this we ﬁrst wantto learn a feature map Φ( p n ) = s n . In other words we want our current state tosummarize all useful information from both the x and y sequence for the purpose ofpredicting the future of y only.One obvious approach is to predict the future of the entire sequence p , i.e. pre-dicting both x and y and then in the end only notice what we ﬁnd out about y . Thisbrings us back to the case we have studied already, since from this point of viewthere is no side information. A drawback with that approach can be that we createan unnecessarily complicated state representation since we are really only interestedin predicting the y sequence.In the case when there is no side information, s t = Φ( y t ). An importantdiﬀerence of the case with side information is that the sequence s t depends onboth y t and x t . Therefore for the latter case, if we would like to consider adistribution for y only, y , ..., y n does not determine the state sequence s , ..., s n : P r ( y , ..., y n | ˆ θ n ) = X s n ,x n P r ( s , ..., s n ) P r ( x , ..., x n , y , ..., y n | s , ..., s n , ˆ θ n ) . This is expression is of course also true in the absence of side information x , butthen the sum collapses to one term since there is only one sequence of states s n that is compatible with y n .An alternative to using the Cost criteria on the p sequence is to only model the y sequence and let L n (Φ) = − log P r ( y , ..., y n | ˆ θ n )and then deﬁne Cost in exactly the same way as before. This cost function wascalled

ICost in [Hut09]. 13 heorem 29

Theorem 26 is true for sequence prediction with side information us-ing

ICost n (Φ i ) = − log P r ( y , ..., y n | ˆ θ n ) + pen ( n, S ) if we deﬁne “sample ergodically” to refer to the sequence p t = ( x t , y t ) instead of y t . Proof.

The proofs work exactly as they are written for the case without sideinformation.Note that a map that is optimal for predicting the y sequence can have fewerstates than a minimal map that can generate the model of the p sequence.It is interesting to note that the interpretation of this result is not as clear as thecase without side information. It guarantees that, given enough history, the chosenΦ can and will (with the asymptotic parameters) deﬁne the correct model for the y t sequence but the x t sequence has only played a part in the estimation and we arenot guaranteed that we will make use of the extra information if it does not impactthe entropy rate. In particular it is true if the information in x t is only helpful fora ﬁnite number of time steps forward. In this case that gain will not aﬀect theentropy rate which is a limit of averages. We have a more conclusive result for thecase with side information when we use the ﬁrst mentioned approach of applying Cost to the sequence p , since we proved consistency in the previous section in thesense of ﬁnding the true model when possible.If we have injective maps Φ, e.g. maps deﬁned by non-empty suﬃx trees, thenwe can rewrite Cost in a form that was used in [Hut09] also more generally. Thereina cost called original cost was deﬁned as follows:

Deﬁnition 30 (OCost)

OCost = − log P r ( s , ..., s n ) − log P r ( y , ..., y n | s , ..., s n , ˆ θ n ) + pen ( n, S ) . Remark 31 If Φ i is injective and we calculate Cost in the side information casethen

Cost = OCost . ⋄ If we have no side information both

OCost and

ICost will be the same as

Cost but they may diﬀer when there is side information available. We remarked abovethat if we consider only injective Φ (e.g. non-empty suﬃx tree based maps) then

OCost equals using

Cost on the joint sequence p t = ( x t , y t ). As noted in [Hut09] OCost penalizes having many states more than

ICost does and when consideringnon-injective Φ one risks getting a smaller than desired state space.

In this very brief section we will discuss how to map the active case to the previouslyintroduced notions. The active case will be treated in depth in future articles. Inthe active case [RN10, SB98] we have an agent that interacts with an environment.14he agent perceives observations o t and real-valued rewards r t and the agent takesactions a t from a ﬁnite set of possible actions A with the goal of receiving high totalreward in some sense. We will denote the events that have just occurred when theagent will take an action at time step t , i.e. a t , o t , and r t by e t . We consider mapsbased on FSMs (PDFAs) that takes event sequences e t as input. In the previoussection’s notation x t = ( o t , a t ) and y t = r t and p t = e t . We chose this since we areinterested in predicting which future rewards will result from actions chosen withthe help of the observations. This would give us the possibility of determining whichactions will earn the highest rewards.At time t − e , ..., e t − determines s t − and the agent takes an action a t − and o t and r t are generated according to distributions that only depend on s t − and a t − . Then we have generated e t and s t = ψ ( s t − , e t ). Deﬁnition 32

The above describes what we mean when we say that the FSM gener-ates the environment. We say that the FSM generates the environment ergodically,if for any sequence of actions chosen such that the action frequencies for any stateconverge asymptotically, we will have state transitions and emission frequencies thatconverge almost surely to an ergodic HMM.

Proposition 33

Suppose that we have an FSM of bounded-memory generating theenvironment ergodically and the action frequencies for any state converge asymp-totically, then we will almost surely generate an ergodic sequence of events and thereward sequence is HMM-ergodic.

Proof.

The situation reduces through Deﬁnition 32 to that of Proposition 25.

Theorem 34

If we consider a ﬁnite class of maps Φ i , i = 0 , , ..., k based on ﬁnitestate machines of bounded memory and if the environment is generated ergodicallyby a ﬁnite state machine of bounded memory and if the action frequencies for anyinternal state of the generating ﬁnite state machine converge, then there almostsurely exist limiting state transition parameters θ i for all i and there is N < ∞ suchthat for all n ≥ N the map Φ i selected by minimizing ICost at time n ≥ N generatesparameters whose limit is θ which is the optimal HMM. Proof.

We combine Proposition 33 with Theorem 29.How to choose the actions to make the implications for reinforcement learningwhat we want them to be is the subject of ongoing work [Hut09].

Feature Markov Decision Processes were introduced [Hut09] as a framework forcreating generic reinforcement learning agents that can learn to perform well in15 large variety of complex environments. It was introduced as a concept withouttheory or empirical studies. First empirical results are reported in [Mah10]. Here weprovide a consistency theory by focusing on the sequence prediction case with andwithout side information. We brieﬂy discuss the active case where an agent takesactions that may aﬀect the environment. The active case and empirical studies isthe subject of ongoing and future work.

Acknowledgement.

We thank the reviewers for their meticulous reading andvaluable feedback and the Australian Research Council for support under grantDP0988049.

References [BP66] Leonard E. Baum and Ted Petrie. Statistical inference for probabilistic func-tions of Finite State Markov chains.

The Annals of Mathematical Statistics ,37(6):1554–1563, 1966.[CMR05] Olivier Capp´e, Eric Moulines, and Tobias Ryden.

Inference in Hidden MarkovModels (Springer Series in Statistics) . Springer-Verlag New York, Inc., Secaucus,NJ, USA, 2005.[CS00] Imre Csiszr and Paul C. Shields. The consistency of the bic markov order esti-mator., 2000.[EM02] Y. Ephraim and N. Merhav. Hidden Markov processes.

IEEE Transactions onInformation Theory , 48(6):1518–1569, 2002.[FLN96] L. Finesso, C. Liu, and P. Narayan. The optimal error exponent for markovorder estimation.

IEEE Trans. Inform. Theory , 42:14881497, 1996.[GB03] Elisabeth Gassiat and St´ephane Boucheron. Optimal error exponents in hiddenMarkov models order estimation.

IEEE Transactions on Information Theory ,49(4):964–980, 2003.[Hut09] Marcus Hutter. Feature reinforcement learning: Part I: Unstructured MDPs.

Journal of Artiﬁcial General Intelligence , 1:3–24, 2009.[Mah10] M. M. Mahmud. Constructing states for reinforcement learning. In

The 27:thInternational Conference on Machine Learning (ICML’10) , 2010.[McC96] Andrew Kachites McCallum.

Reinforcement learning with selective perceptionand hidden state . PhD thesis, The University of Rochester, 1996.[Pet69] T. Petrie. Probabilistic functions of Finite State Markov chains.

The Annals ofMathematical Statistics , 40(1):97–115, 1969.[Ris83] Jorma Rissanen. A universal data compression system.

IEEE Transactions onInformation Theory , 29(5):656–663, 1983. Ris86] Jorma Rissanen. Complexity of strings in the class of Markov sources.

IEEETransactions on Information Theory , 32(4):526–532, 1986.[RN10] Stuart Russell and Peter Norvig.

Artiﬁcial Intelligence: A Modern Approach .Prentice Hall, 3 edition, 2010.[SB98] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning: An Intro-duction (Adaptive Computation and Machine Learning) . The MIT Press, March1998.[Sin96] Yoram Singer. Adaptive mixtures of probabilistic transducers.

Neural Compu-tation , 9:1711–1733, 1996.[VTdlH + IEEETransactions on Pattern Analysis and Machine Intelligence , 27(7):1013–1025,2005a., 27(7):1013–1025,2005a.