[PDF] Prediction for discrete time series

Abstract

Let { X n } be a stationary and ergodic time series taking values from a finite or countably infinite set X . Assume that the distribution of the process is otherwise unknown. We propose a sequence of stopping times λ n along which we will be able to estimate the conditional probability P( X λ n +1 =x| X 0 ,..., X λ n ) from data segment ( X 0 ,..., X λ n ) in a pointwise consistent way for a restricted class of stationary and ergodic finite or countably infinite alphabet time series which includes among others all stationary and ergodic finitarily Markovian processes. If the stationary and ergodic process turns out to be finitarily Markovian (among others, all stationary and ergodic Markov chains are included in this class) then lim n→∞ n λ n >0 almost surely. If the stationary and ergodic process turns out to possess finite entropy rate then λ n is upperbounded by a polynomial, eventually almost surely.

Full PDF

aa r X i v : . [ m a t h . P R ] N ov Guszt´av MORVAI and Benjamin WEISS

Prediction for Discrete Time Series

Probab. Theory Related Fields 132 (2005), no. 1, 1–12.

Abstract

Let { X n } be a stationary and ergodic time series taking valuesfrom a ﬁnite or countably inﬁnite set X . Assume that the distribu-tion of the process is otherwise unknown. We propose a sequence ofstopping times λ n along which we will be able to estimate the con-ditional probability P ( X λ n +1 = x | X , . . . , X λ n ) from data segment( X , . . . , X λ n ) in a pointwise consistent way for a restricted class ofstationary and ergodic ﬁnite or countably inﬁnite alphabet time se-ries which includes among others all stationary and ergodic ﬁnitarilyMarkovian processes. If the stationary and ergodic process turns outto be ﬁnitarily Markovian (among others, all stationary and ergodicMarkov chains are included in this class) then lim n →∞ nλ n > λ n is upperbounded by a polynomial, eventuallyalmost surely. Keywords: Nonparametric estimation, stationary processesMathematics Subject Classiﬁcations (2000)

Introduction

Bailey [1] and Ryabko [14] considered the problem of estimating the con-ditional probability P ( X n +1 = 1 | X , . . . , X n ) for binary time series. Theyshowed that one cannot estimate this quantity from the data ( X , . . . , X n )such that the diﬀerence tends to zero almost surely as n increases, for allstationary and ergodic binary time series.It is well known, that if one knows in advance that the process is Markovwith arbitrary (unknown) order, then one can estimate the order (c.f. Csisz´arand Shields [4], Csisz´ar [5]), and using this estimate for the order, one cancount empirical averages of blocks with lengths one plus the order for esti-mating P ( X n +1 = 1 | X , . . . , X n ) in a pointwise consistent way. In the presentpaper we will consider the case when it is not known in advance if the processis Markov or not.Morvai [11] exhibited a sequence of stopping times η n such that P ( X η n +1 =1 | X , . . . , X η n ) can be estimated from data segment ( X , . . . , X η n ) in a point-wise consistent way, that is, the error vanishes as n increases. The disadvan-tage of that scheme is that the stopping times grow very fast. Another, morereasonable scheme was proposed by Morvai and Weiss [12] for a subclass ofstationary and ergodic binary time series. There the stopping times still growexponentially, though not so fast as in Morvai [11].Bailey [1] proved that there is no test for the Markov property, that is,there is no algorithm which could tell you eventually if the process is Markovwith any order or not, over all stationary and ergodic binary time series.In this paper discrete (ﬁnite or countably inﬁnite) alphabet stationaryand ergodic processes are treated. We propose a much denser (compared toMorvai and Weiss [12]) sequence of stopping times λ n along which we willbe able to estimate P ( X λ n +1 = x | X , . . . , X λ n ) from samples ( X , . . . , X λ n )in a pointwise consistent way for those processes whose conditional distri-bution is almost surely continuous (see the precise deﬁnition below). Thisclass includes all Markov processes with arbitrary order and the much widerclass of ﬁnitarily Markovian processes. Despite Bailey’s result, for the pro-posed stopping times λ n , if the stationary and ergodic process turns out tobe ﬁnitarily Markovian (which includes all stationary and ergodic Markovchains with arbitrary order) then lim n →∞ nλ n > λ n is upperbounded by a polynomial, eventually almost surely.1 The Proposed Algorithm

Let { X n } ∞ n = −∞ be a stationary and ergodic time series taking values from adiscrete (ﬁnite or countably inﬁnite) alphabet X . (Note that all stationarytime series { X n } ∞ n =0 can be thought to be a two sided time series, that is, { X n } ∞ n = −∞ . ) For notational convenience, let X nm = ( X m , . . . , X n ), where m ≤ n . Note that if m > n then X nm is the empty string.For k ≥

1, let 1 ≤ l k ≤ k be a nondecreasing unbounded sequence of integers,that is, 1 = l ≤ l . . . and lim k →∞ l k = ∞ .Deﬁne auxiliary stopping times ( similarly to Morvai and Weiss [12]) as fol-lows. Set ζ = 0. For n = 1 , , . . . , let ζ n = ζ n − + min { t > X ζ n − + tζ n − − ( l n − t = X ζ n − ζ n − − ( l n − } . (1)Among other things, using ζ n and l n we can deﬁne a very useful process { ˜ X n } n = −∞ as a function of X ∞ as follows. Let J ( n ) = min { j ≥ l j +1 > n } and deﬁne ˜ X − i = X ζ J ( i ) − i for i ≥

0. (2)As we will see in the proof of the Theorem, the { ˜ X } n = −∞ has the samedistribution as the original process. For notational convenience let p k ( x − k )and p k ( y | x − k ) denote the distribution P ( X − k = x − k ) and the conditionaldistribution P ( X = y | X − k = x − k ), respectively. Deﬁnition . For a stationary time series { X n } the (random) length K ( X −∞ )of the memory of the sample path X −∞ is the smallest possible 0 ≤ K < ∞ such that for all i ≥

1, all y ∈ X , all z − K − K − i +1 ∈ X i p K − ( y | X − K +1 ) = p K + i − ( y | z − K − K − i +1 , X − K +1 )provided p K + i ( z − K − K − i +1 , X − K +1 , y ) >

0, and K ( X −∞ ) = ∞ if there is no such K . Deﬁnition . The stationary time series { X n } is said to be ﬁnitarily Marko-vian if K ( X −∞ ) is ﬁnite (though not necessarily bounded) almost surely.In order to estimate K ( ˜ X −∞ ) we need to deﬁne some explicit statistics.2eﬁne∆ k ( ˜ X − k +1 ) =sup ≤ i sup { z − k − k − i +1 ∈X i ,x ∈X : p k + i ( z − k − k − i +1 , ˜ X − k +1 ,x ) > } (cid:12)(cid:12)(cid:12) p k − ( x | ˜ X − k +1 ) − p k + i − ( x | ( z − k − k − i +1 , ˜ X − k +1 )) (cid:12)(cid:12)(cid:12) . We will divide the data segment X n into two parts: X ⌈ n ⌉− and X n ⌈ n ⌉ . Let L (1) n,k denote the set of strings with length k + 1 which appear at all in X ⌈ n ⌉− .That is, L (1) n,k = { x − k ∈ X k +1 : ∃ k ≤ t ≤ ⌈ n ⌉ − X tt − k = x − k } . For a ﬁxed 0 < γ < L (2) n,k denote the set of strings with length k + 1which appear more than n − γ times in X n ⌈ n ⌉ . That is, L (2) n,k = { x − k ∈ X k +1 : {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = x − k } > n − γ } . Let L nk = L (1) n,k \ L (2) n,k . We deﬁne the empirical version of ∆ k as follows:ˆ∆ nk ( ˜ X − k +1 ) = max ≤ i ≤ n max ( z − k − k − i +1 , ˜ X − k +1 ,x ) ∈L nk + i { ζ J ( k ) ≤⌈ n ⌉− } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( ˜ X − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = ˜ X − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , ˜ X − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , ˜ X − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Note that the cut oﬀ 1 { ζ J ( k ) ≤⌈ n ⌉− } ensures that ˜ X − k +1 is deﬁned from X ⌈ n ⌉− .Observe, that by ergodicity, for any ﬁxed k ,lim inf n →∞ ˆ∆ nk ≥ ∆ k almost surely. (3)We deﬁne an estimate χ n for K ( ˜ X −∞ ) from samples X n as follows. Let0 < β < − γ be arbitrary. Set χ = 0, and for n ≥ χ n be the smallest0 ≤ k n < n such that ˆ∆ nk n ≤ n − β . 3bserve that if ζ j ≤ ⌈ n ⌉ − < ζ j +1 then χ n ≤ l j +1 .Here the idea is (cf. the proof of the Theorem) that if K ( ˜ X −∞ ) < ∞ then χ n will be equal to K ( ˜ X −∞ ) eventually and if K ( ˜ X −∞ ) = ∞ then χ n → ∞ .Now we deﬁne the sequence of stopping times λ n along which we will be ableto estimate. Set λ = ζ , and for n ≥ ζ j ≤ λ n − < ζ j +1 then put λ n = min { t > λ n − : X tt − χ t +1 = X ζ j ζ j − χ t +1 } (4)and κ n = χ λ n . (5)Observe that if ζ j ≤ λ n − < ζ j +1 then ζ j ≤ λ n − < λ n ≤ ζ j +1 . If χ λ n − +1 = 0then λ n = λ n − + 1. Note that λ n is a stopping time and κ n is our estimatefor K ( ˜ X −∞ ) from samples X λ n .Let X ∗− be the set of all one-sided sequences, that is, X ∗− = { ( . . . , x − , x ) : x i ∈ X for all −∞ < i ≤ } . Let f : X → ( −∞ , ∞ ) be bounded, otherwise arbitrary. Deﬁne the function F : X ∗− → ( −∞ , ∞ ) as F ( x −∞ ) = E ( f ( X ) | X −∞ = x −∞ ) . E.g. if f ( x ) = 1 { x = z } for a ﬁxed z ∈ X then F ( y −∞ ) = P ( X = z | X −∞ = y −∞ ) . If X is a ﬁnite or countably inﬁnite subset of the reals and f ( x ) = x then F ( y −∞ ) = E ( X | X −∞ = y −∞ ) . One denotes the n th estimate of E ( f ( X λ n +1 ) | X λ n ) from samples X λ n by f n ,and deﬁnes it to be f n = 1 n n − X j =0 f ( X λ j +1 ) . (6) Deﬁne the distance d ∗ ( · , · ) on X ∗− as follows. For x −∞ , y −∞ ∈ X ∗− let d ∗ ( x −∞ , y −∞ ) = ∞ X i =0 − i − { x − i = y − i } . (7)4 eﬁnition . We say that F ( X −∞ ) is almost surely continuous if for someset C ⊆ X ∗− which has probability one the function F ( X −∞ ) restricted tothis set C is continuous with respect to metric d ∗ ( · , · ). (Cf. Morvai and Weiss[12].)The processes with almost surely continuous conditional expectation gener-alizes the processes for which it is actually continuous, cf. Kalikow [9] andKeane [10]. The stationary ﬁnitarily Markovian processes are included in theclass of stationary processes with almost surely continuous E ( f ( X ) | X −∞ )for arbitrary bounded f ( · ).Note that Ryabko [14], and Gy¨orﬁ, Morvai, Yakowitz [7] showed that onecannot estimate P ( X n +1 = 1 | X n ) for all n in a pointwise consistent wayeven for the class of all stationary and ergodic binary ﬁnitarily Markoviantime series.The entropy rate H associated with a stationary ﬁnite or countably inﬁnite al-phabet time series { X n } is deﬁned as H = lim n →∞ − n +1 P x − n ∈X n +1 p n ( x − n ) log p n ( x − n ).We note that the entropy rate of a stationary ﬁnite alphabet time series isﬁnite. For details cf. Cover, Thomas [3], pp. 63-64.Fix positive real numbers 0 < β, γ < β + γ <

1, ﬁx a sequence l n that 1 = l ≤ l , . . . , l n → ∞ and ﬁx a bounded function f ( · ) : X → ( −∞ , ∞ )and with these numbers, sequence and function deﬁne ζ n , χ n , κ n , λ n and F ( · ) as described in the previous section. For the resulting f n we have thefollowing theorem: THEOREM.

Let { X n } be a stationary and ergodic time series taking val-ues from a ﬁnite or countably inﬁnite set X . If the conditional expectation F ( X −∞ ) is almost surely continuous then almost surely, lim n →∞ f n = F ( ˜ X −∞ ) and lim n →∞ (cid:12)(cid:12)(cid:12) f n − E ( f ( X λ n +1 ) | X λ n ) (cid:12)(cid:12)(cid:12) = 0 . The l n may be chosen in such a fashion that whenever the stationary andergodic time series { X n } has ﬁnite entropy rate then the λ n grow no fasterthan a polynomial in n .If the stationary and ergodic time series { X n } turns out to be ﬁnitarilyMarkovian then lim n →∞ λ n n = 1 p K ( ˜ X −∞ ) − ( ˜ X − K ( ˜ X −∞ )+1 ) < ∞ almost surely . oreover, if the stationary and ergodic time series { X n } turns out to beindependent and identically distributed then λ n = λ n − + 1 eventually almostsurely. Proof of the Theorem :Step 1.

The time series { ˜ X n } n = −∞ and { X n } n = −∞ have identical distribu-tion. For all k ≥ ≤ i ≤ k deﬁne (similarly to Morvai and Weiss [12])ˆ ζ k = 0 andˆ ζ ki = ˆ ζ ki − − min { t > X ˆ ζ ki − − t ˆ ζ ki − − ( l k − i +1 − − t = X ˆ ζ ki − ˆ ζ ki − − ( l k − i +1 − } . Let T denote the left shift operator, that is, ( T x ∞−∞ ) i = x i +1 . It is easy tosee that if ζ k ( x ∞−∞ ) = l then ˆ ζ kk ( T l x ∞−∞ ) = − l .Now the statement follows from stationarity and the fact that for k ≥ n ≥ x − n ∈ X n +1 , l ≥ T l { X ζ k ζ k − n = x − n , ζ k = l } = { X − n = x − n , ˆ ζ kk ( X −∞ ) = − l } . (8) Step 2.

We show that P ( χ n = K ( ˜ X −∞ ) eventually | K ( ˜ X −∞ ) < ∞ ) = 1 and P (lim n →∞ χ n = ∞| K ( ˜ X −∞ ) = ∞ ) = 1 . By Step 1, { ˜ X n } n = −∞ is stationary and ergodic with the same distributionas { X n } n = −∞ . We may assume that the sample path ˜ X −∞ is such that allﬁnite blocks that appear have positive probability. It is immediate that if K ( ˜ X −∞ ) < ∞ then for all k ≥ K ( ˜ X −∞ ), ∆ k = 0 and ∆ K ( ˜ X −∞ ) − > K ( ˜ X −∞ ) − K ( ˜ X −∞ ) = ∞ then ∆ k > k , (otherwise K ( ˜ X −∞ ) would be ﬁnite).Thus by (3) if K ( ˜ X −∞ ) = ∞ then χ n → ∞ and if K ( ˜ X −∞ ) < ∞ then χ n ≥ K ( ˜ X −∞ ) eventually almost surely. We have to show that χ n ≤ K ( ˜ X −∞ )eventually almost surely provided that K ( ˜ X −∞ ) < ∞ .Fix now k < n . We will estimate the probability of the undesirable event asfollows: P ( ˆ∆ nk > n − β , K ( ˜ X −∞ ) = k | X ⌈ n ⌉ ) ≤ n X i =1 P ( max ( z − k − k − i +1 , ˜ X − k +1 ,x ) ∈L nk + i { ζ J ( k ) ≤⌈ n ⌉− } (cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( ˜ X − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = ˜ X − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , ˜ X − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , ˜ X − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β , K ( ˜ X −∞ ) = k | X ⌈ n ⌉ ) . Deﬁne M k − as the set of all x − k +1 ∈ X k such that for all i ≥ z ∈ X ,and y − k − k − i +1 ∈ X i , p k + i ( y − k − k − i +1 , x − k +1 , z ) > p k − ( z | x − k +1 ) = p k + i − ( z | y − k − k − i +1 , x − k +1 ). By the deﬁnition of ˆ∆ nk and since K ( ˜ X −∞ ) = k wehave easily that P ( max ( z − k − k − i +1 , ˜ X − k +1 ,x ) ∈L nk + i { ζ J ( k ) ≤⌈ n ⌉− } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( ˜ X − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = ˜ X − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , ˜ X − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , ˜ X − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β , K ( ˜ X −∞ ) = k | X ⌈ n ⌉ ) ≤ P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) . We can estimate this last probability as the sum of two terms: P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ )7 P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 } − p k − ( x | y − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )+ P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p k − ( x | y − k +1 ) − {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ ) . We overestimate these probabilities. For any m ≥ x − m deﬁne σ mi ( x − m )as the time of the i -th ocurrence of the string x − m in the data segment X n ⌈ n ⌉ ,that is, let σ m ( x − m ) = ⌈ n ⌉ + m − i ≥ σ mi ( x − m ) = min { t > σ mi − ( x − m ) : X tt − m = x − m } . Now P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) ≤ P ( max y − k +1 ∈M k − , ( y − k +1 ,x ) ∈L (1) n,k sup j>n − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk − r ( y − k +1) = x } − p k − ( x | y − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )+ P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L (1) n,k + i sup j>n − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk + i − r ( z − k − k − i +1 ,y − k +1 ,x ) = x } − p k − ( x | y − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )Since both L (1) n,k and L (1) n,k + i depend solely on X ⌈ n ⌉ we get P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) ≤ X y − k +1 ∈M k − , ( y − k +1 ,x ) ∈L (1) n,k ∞ X j = ⌈ n − γ ⌉ P ( (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk − r ( y − k +1) = x } − p k − ( x | y − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )+ X y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L (1) n,k + i ∞ X j = ⌈ n − γ ⌉ P ( (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk + i − r ( z − k − k − i +1 ,y − k +1) = x } − p k − ( x | y − k +1 ) (cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ ) . Each of these represents the deviation of an empirical count from its mean.The variables in question are independent since whenever the block y − k +1 occurs the next term is chosen using the same distribution p k − ( x | y − k +1 ).Thus by Hoeﬀding’s inequality (cf. Hoeﬀding [8] or Theorem 8.1 of Devroyet. al. [6]) for sums of bounded independent random variables and since thecardinality of both L (1) n,k and L (1) n,k + i is not greater than ( n + 2) /

2, we have P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) ≤ n + 22 ∞ X j = ⌈ n − γ ⌉ e − n − β j . Thus P ( ˆ∆ nk > n − β , K ( ˜ X −∞ ) = k | X ⌈ n ⌉ ) ≤ n ( n + 2)2 e − n − β +1 − γ . Integrating both sides we get P ( ˆ∆ nk > n − β , K ( ˜ X −∞ ) = k ) ≤ n ( n + 2)2 e − n − β +1 − γ . β + γ < P ( ˆ∆ nk ≤ n − β eventually, K ( ˜ X −∞ ) = k )= P ( K ( ˜ X −∞ ) = k ) . Thus χ n ≤ k eventually almost surely on K ( ˜ X −∞ ) = k . Step 3.

We show the ﬁrst part of the Theorem.

Recalling (6) we can write f n = 1 n n − X j =0 [ f ( X λ j +1 ) − E ( f ( X λ j +1 ) | X λ j −∞ )] + 1 n n − X j =0 E ( f ( X λ j +1 ) | X λ j −∞ ) (9)Observe that the ﬁrst term is an average of orthogonal bounded randomvariables and by Theorem 3.2.2 in R´ev´esz [13], it tends to zero.Now we deal with the second term. If K ( ˜ X −∞ ) < ∞ then by Step 2, χ n = K ( ˜ X −∞ ) eventually and by (1), (2), (4) and Step 1, eventually, E ( f ( X λ j +1 ) | X λ j −∞ ) = E ( f ( X λ j +1 ) | X λ j ) = F ( ˜ X −∞ ) . We may deal with the case when K ( ˜ X −∞ ) = ∞ and by Step 2, χ n → ∞ .For arbitrary j ≥

0, by (5) and (4) and the construction in (2), X λ j λ j − κ j +1 = ˜ X − κ j +1 and lim j →∞ d ∗ ( ˜ X −∞ , X λ j −∞ ) = 0 almost surely. (10)Be Step 1, and the almost sure continuity of F ( · ), for some set C ⊆ X ∗− with full measure, F ( · ) is continuous on C and˜ X −∞ ∈ C, X n −∞ ∈ C for all n ≥ F ( · ) on the set C and (10), E ( f ( X λ j +1 ) | X λ j −∞ ) = F ( X λ j −∞ ) → F ( ˜ X −∞ ) and f n → F ( ˜ X −∞ ) almost surely.Deﬁne the random neighbourhood N j ( X λ j ) of X λ j depending on the randomdata segment X λ j itself as N j ( X λ j ) = { z −∞ ∈ X ∗− : z − κ j +1 = X λ j − κ j +1 , . . . , z = X λ j } . X −∞ ∈ N j ( X λ j ) and by (11) and thecontinuity of F ( · ) on the set C , and since κ j → ∞ , by (10), almost surely,lim j →∞ (cid:12)(cid:12)(cid:12) E ( f ( X λ j +1 ) | X λ j ) − F ( ˜ X −∞ ) (cid:12)(cid:12)(cid:12) = lim j →∞ (cid:12)(cid:12)(cid:12) E { F ( X λ j −∞ ) | X λ j } − F ( ˜ X −∞ ) (cid:12)(cid:12)(cid:12) ≤ lim j →∞ sup y −∞ ,z −∞ ∈N j ( X λj ) T C | F ( y −∞ ) − F ( z −∞ ) | = 0 . Step 4.

We show the second part of the Theorem.

Now we assume that the stationary and ergodic ﬁnite or countably inﬁnitealphabet time series { X n } possesses ﬁnite entropy rate H . (A stationaryﬁnite alphabet time series always has ﬁnite entropy rate.)We will in fact obtain a more precise estimate, namely, if for some 0 < ǫ < ǫ , P ∞ k =1 ( k + 1)2 − l k ( ǫ − ǫ ) < ∞ then λ n < l n ( H + ǫ ) eventually almost surely.In particular, for arbitrary δ >

0, 0 < ǫ < ǫ , if l n = min (cid:16) n, max (cid:16) , ⌊ δǫ − ǫ log n ⌋ (cid:17)(cid:17) then λ n < n δǫ − ǫ ( H + ǫ ) eventually almost surely, and the upper bound is a polynomial.Since λ n ≤ ζ n , it is enough to prove the result for ζ n . Let X ∗ be the set ofall two-sided sequences, that is, X ∗ = { ( . . . , x − , x , x , . . . ) : x i ∈ X for all −∞ ≤ i < ∞} . Deﬁne B k ⊆ X l k as B k = { x − l k +1 ∈ X l k : 2 − l k ( H + ǫ ) < p l k − ( x − l k +1 ) } . Notethat there is a trivial bound on the cardinality of the set B k , namely, | B k | ≤ l k ( H + ǫ ) . (12)Deﬁne the set Υ k ( y − k +1 ) as follows:Υ k ( y − l k +1 ) = { z ∞−∞ ∈ X + : − ˆ ζ kk ( z −∞ ) ≥ l k ( H + ǫ ) , z − l k +1 = y − l k +1 ) } . We will estimate the probability of Υ k ( y − l k +1 ) by a frequency argument.Let x ∞−∞ ∈ X ∗ be a typical sequence of the time series { X n } . Deﬁne ρ ( y − l k +1 , x ∞−∞ ) = 0 and for i ≥ ρ i ( y − l k +1 , x ∞−∞ ) = min { l > ρ i − ( y − l k +1 , x ∞−∞ ) : T − l x ∞−∞ ∈ Υ k ( y − l k +1 ) } . τ ( y − l k +1 , x ∞−∞ ) = 0 and for i ≥ τ i ( y − l k +1 , x ∞−∞ ) = min { l ≥ τ i − ( y − l k +1 , x ∞−∞ )+2 l k ( H + ǫ ) : T − l x ∞−∞ ∈ Υ k ( y − l k +1 ) } . Notice that if τ i − = ρ m then τ i ≤ ρ m + k +1 . (Indeed, since there are at least k +1 occurrences of the block y − l k +1 in the data segment X ρ m +1 − ρ m + k +1 − l k +1 hence2 l k ( H + ǫ ) ≤ − ˆ ζ kk ( T − ρ m x ∞−∞ ) ≤ ρ m + k +1 − τ i − .) By the ergodicity of the timeseries { X n } , P ( X ∞−∞ ∈ Υ k ( y − l k +1 )) = lim t →∞ { j ≥ ρ j ( y − l k +1 , x ∞−∞ ) ≤ τ t ( y − l k +1 , x ∞−∞ ) } τ t ( y − l k +1 , x ∞−∞ )= lim t →∞ P tl =1 { j ≥ τ l − ( y − l k +1 , x ∞−∞ ) < ρ j ( y − l k +1 , x ∞−∞ ) ≤ τ l ( y − l k +1 , x ∞−∞ ) } τ t ( y − l k +1 , x ∞−∞ ) ≤ lim t →∞ t ( k + 1) t l k ( H + ǫ ) = ( k + 1)2 l k ( H + ǫ ) . (13)Since T l { ζ k = l, X ζ k ζ k − l k +1 ∈ B k } = { ˆ ζ kk = − l, X − l k +1 ∈ B k } by stationarity and the upper bound on the cardinality of the set B k in (12)and by (13), we get P ( ζ k ≥ l k ( H + ǫ ) , ˜ X − l k +1 ∈ B k ) = P ( ζ k ≥ l k ( H + ǫ ) , X ζ k ζ k − l k +1 ∈ B k )= P ( − ˆ ζ kk ≥ l k ( H + ǫ ) , X − l k +1 ∈ B k )= X y − lk +1 ∈ B k P ( X ∞−∞ ∈ Υ k ( y − l k +1 )) ≤ ( k + 1)2 − l k ( ǫ − ǫ ) . By assumption, the right hand side sums and the Borel-Cantelli Lemmayields that the event { ζ k ≥ l k ( H + ǫ ) , ˜ X − l k +1 ∈ B k } cannot happen inﬁnitelymany times. By Step 1, the distribution of the time series { ˜ X n } is the sameas the distribution of { X n } and by the Shannon-McMillan-Breiman Theorem(cf. Chung [2]) ˜ X − l k +1 ∈ B k eventually almost surely and so ζ k ≥ l k ( H + ǫ ) cannot happen inﬁnitely many times. Step 5.

We show the rest of the Theorem.

By Step 2, if 1 ≤ K ( ˜ X −∞ ) < ∞ then χ n = K ( ˜ X −∞ ) eventually, and byergodicity, nλ n → p K ( ˜ X −∞ ) − ( ˜ X − K ( ˜ X −∞ )+1 ) >

0. If K ( ˜ X −∞ ) = 0 then by Step2, χ n = 0 eventually, and by (4), λ n = λ n − + 1 eventually. The proof of theTheorem is complete. 12 eferences [1] D. H. Bailey, Sequential Schemes for Classifying and Predicting ErgodicProcesses.

Ph. D. thesis, Stanford University, 1976.[2] K.L. Chung, ”A note on the ergodic theorem of information theory,”

The Annals of Mathematical Statistics. , vol. 32, pp. 612-614, 1961.[3] T.M. Cover and J. Thomas,

Elements of Information Theory , Wiley,1991.[4] I. Csisz´ar and P. Shields, ”The consistency of the BIC Markov orderestimator,”

Annals of Statistics. , vol. 28, pp. 1601-1619, 2000.[5] I. Csisz´ar, ”Large-scale typicality of Markov sample paths and consis-tency of MDL order estimators ,”

IEEE Transactions on InformationTheory , vol. 48, pp. 1616-1628, 2002.[6] L Devroy, L. Gy¨orﬁ, G. Lugosi,

A Probabilistic Theory of Pattern Recog-nition.

Springer-Verlag, New York, 1996.[7] L. Gy¨orﬁ, G. Morvai, and S. Yakowitz, ”Limits to consistent on-lineforecasting for ergodic time series,”

IEEE Transactions on InformationTheory , vol. 44, pp. 886–892, 1998.[8] W. Hoeﬀding, ”Probability inequalities for sums of bounded randomvariables ,”

Journal of the American Statistical Association , vol. 58, pp.13-30, 1963.[9] S. Kalikow ”Random Markov processes and uniform martingales ,”

IsraelJournal of Mathematics , vol. 71, pp. 33–54, 1990.[10] M. Keane ”Strongly mixing g-measures,”

Invent. Math. , vol. 16, pp.309–324, 1972.[11] G. Morvai ”Guessing the output of a stationary binary time series” In:Foundations of Statistical Inference, (Eds. Y. Haitovsky, H.R.Lerche, Y.Ritov), Physika-Verlag, pp. 207-215, 2003.[12] G. Morvai and B. Weiss, ”Forecasting for stationary binary time series,”

Acta Applicandae Mathematicae , vol. 79, pp. 25-34, 2003.1313] P. R´ev´esz,

The Law of Large Numbers , Academic Press, 1968.[14] B. Ya. Ryabko, ”Prediction of random sequences and universal coding,”