aa r X i v : . [ m a t h . P R ] N ov Guszt´av MORVAI and Benjamin WEISS
Prediction for Discrete Time Series
Probab. Theory Related Fields 132 (2005), no. 1, 1–12.
Abstract
Let { X n } be a stationary and ergodic time series taking valuesfrom a finite or countably infinite set X . Assume that the distribu-tion of the process is otherwise unknown. We propose a sequence ofstopping times λ n along which we will be able to estimate the con-ditional probability P ( X λ n +1 = x | X , . . . , X λ n ) from data segment( X , . . . , X λ n ) in a pointwise consistent way for a restricted class ofstationary and ergodic finite or countably infinite alphabet time se-ries which includes among others all stationary and ergodic finitarilyMarkovian processes. If the stationary and ergodic process turns outto be finitarily Markovian (among others, all stationary and ergodicMarkov chains are included in this class) then lim n →∞ nλ n > λ n is upperbounded by a polynomial, eventuallyalmost surely. Keywords: Nonparametric estimation, stationary processesMathematics Subject Classifications (2000)
Introduction
Bailey [1] and Ryabko [14] considered the problem of estimating the con-ditional probability P ( X n +1 = 1 | X , . . . , X n ) for binary time series. Theyshowed that one cannot estimate this quantity from the data ( X , . . . , X n )such that the difference tends to zero almost surely as n increases, for allstationary and ergodic binary time series.It is well known, that if one knows in advance that the process is Markovwith arbitrary (unknown) order, then one can estimate the order (c.f. Csisz´arand Shields [4], Csisz´ar [5]), and using this estimate for the order, one cancount empirical averages of blocks with lengths one plus the order for esti-mating P ( X n +1 = 1 | X , . . . , X n ) in a pointwise consistent way. In the presentpaper we will consider the case when it is not known in advance if the processis Markov or not.Morvai [11] exhibited a sequence of stopping times η n such that P ( X η n +1 =1 | X , . . . , X η n ) can be estimated from data segment ( X , . . . , X η n ) in a point-wise consistent way, that is, the error vanishes as n increases. The disadvan-tage of that scheme is that the stopping times grow very fast. Another, morereasonable scheme was proposed by Morvai and Weiss [12] for a subclass ofstationary and ergodic binary time series. There the stopping times still growexponentially, though not so fast as in Morvai [11].Bailey [1] proved that there is no test for the Markov property, that is,there is no algorithm which could tell you eventually if the process is Markovwith any order or not, over all stationary and ergodic binary time series.In this paper discrete (finite or countably infinite) alphabet stationaryand ergodic processes are treated. We propose a much denser (compared toMorvai and Weiss [12]) sequence of stopping times λ n along which we willbe able to estimate P ( X λ n +1 = x | X , . . . , X λ n ) from samples ( X , . . . , X λ n )in a pointwise consistent way for those processes whose conditional distri-bution is almost surely continuous (see the precise definition below). Thisclass includes all Markov processes with arbitrary order and the much widerclass of finitarily Markovian processes. Despite Bailey’s result, for the pro-posed stopping times λ n , if the stationary and ergodic process turns out tobe finitarily Markovian (which includes all stationary and ergodic Markovchains with arbitrary order) then lim n →∞ nλ n > λ n is upperbounded by a polynomial, eventually almost surely.1 The Proposed Algorithm
Let { X n } ∞ n = −∞ be a stationary and ergodic time series taking values from adiscrete (finite or countably infinite) alphabet X . (Note that all stationarytime series { X n } ∞ n =0 can be thought to be a two sided time series, that is, { X n } ∞ n = −∞ . ) For notational convenience, let X nm = ( X m , . . . , X n ), where m ≤ n . Note that if m > n then X nm is the empty string.For k ≥
1, let 1 ≤ l k ≤ k be a nondecreasing unbounded sequence of integers,that is, 1 = l ≤ l . . . and lim k →∞ l k = ∞ .Define auxiliary stopping times ( similarly to Morvai and Weiss [12]) as fol-lows. Set ζ = 0. For n = 1 , , . . . , let ζ n = ζ n − + min { t > X ζ n − + tζ n − − ( l n − t = X ζ n − ζ n − − ( l n − } . (1)Among other things, using ζ n and l n we can define a very useful process { ˜ X n } n = −∞ as a function of X ∞ as follows. Let J ( n ) = min { j ≥ l j +1 > n } and define ˜ X − i = X ζ J ( i ) − i for i ≥
0. (2)As we will see in the proof of the Theorem, the { ˜ X } n = −∞ has the samedistribution as the original process. For notational convenience let p k ( x − k )and p k ( y | x − k ) denote the distribution P ( X − k = x − k ) and the conditionaldistribution P ( X = y | X − k = x − k ), respectively. Definition . For a stationary time series { X n } the (random) length K ( X −∞ )of the memory of the sample path X −∞ is the smallest possible 0 ≤ K < ∞ such that for all i ≥
1, all y ∈ X , all z − K − K − i +1 ∈ X i p K − ( y | X − K +1 ) = p K + i − ( y | z − K − K − i +1 , X − K +1 )provided p K + i ( z − K − K − i +1 , X − K +1 , y ) >
0, and K ( X −∞ ) = ∞ if there is no such K . Definition . The stationary time series { X n } is said to be finitarily Marko-vian if K ( X −∞ ) is finite (though not necessarily bounded) almost surely.In order to estimate K ( ˜ X −∞ ) we need to define some explicit statistics.2efine∆ k ( ˜ X − k +1 ) =sup ≤ i sup { z − k − k − i +1 ∈X i ,x ∈X : p k + i ( z − k − k − i +1 , ˜ X − k +1 ,x ) > } (cid:12)(cid:12)(cid:12) p k − ( x | ˜ X − k +1 ) − p k + i − ( x | ( z − k − k − i +1 , ˜ X − k +1 )) (cid:12)(cid:12)(cid:12) . We will divide the data segment X n into two parts: X ⌈ n ⌉− and X n ⌈ n ⌉ . Let L (1) n,k denote the set of strings with length k + 1 which appear at all in X ⌈ n ⌉− .That is, L (1) n,k = { x − k ∈ X k +1 : ∃ k ≤ t ≤ ⌈ n ⌉ − X tt − k = x − k } . For a fixed 0 < γ < L (2) n,k denote the set of strings with length k + 1which appear more than n − γ times in X n ⌈ n ⌉ . That is, L (2) n,k = { x − k ∈ X k +1 : {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = x − k } > n − γ } . Let L nk = L (1) n,k \ L (2) n,k . We define the empirical version of ∆ k as follows:ˆ∆ nk ( ˜ X − k +1 ) = max ≤ i ≤ n max ( z − k − k − i +1 , ˜ X − k +1 ,x ) ∈L nk + i { ζ J ( k ) ≤⌈ n ⌉− } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( ˜ X − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = ˜ X − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , ˜ X − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , ˜ X − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Note that the cut off 1 { ζ J ( k ) ≤⌈ n ⌉− } ensures that ˜ X − k +1 is defined from X ⌈ n ⌉− .Observe, that by ergodicity, for any fixed k ,lim inf n →∞ ˆ∆ nk ≥ ∆ k almost surely. (3)We define an estimate χ n for K ( ˜ X −∞ ) from samples X n as follows. Let0 < β < − γ be arbitrary. Set χ = 0, and for n ≥ χ n be the smallest0 ≤ k n < n such that ˆ∆ nk n ≤ n − β . 3bserve that if ζ j ≤ ⌈ n ⌉ − < ζ j +1 then χ n ≤ l j +1 .Here the idea is (cf. the proof of the Theorem) that if K ( ˜ X −∞ ) < ∞ then χ n will be equal to K ( ˜ X −∞ ) eventually and if K ( ˜ X −∞ ) = ∞ then χ n → ∞ .Now we define the sequence of stopping times λ n along which we will be ableto estimate. Set λ = ζ , and for n ≥ ζ j ≤ λ n − < ζ j +1 then put λ n = min { t > λ n − : X tt − χ t +1 = X ζ j ζ j − χ t +1 } (4)and κ n = χ λ n . (5)Observe that if ζ j ≤ λ n − < ζ j +1 then ζ j ≤ λ n − < λ n ≤ ζ j +1 . If χ λ n − +1 = 0then λ n = λ n − + 1. Note that λ n is a stopping time and κ n is our estimatefor K ( ˜ X −∞ ) from samples X λ n .Let X ∗− be the set of all one-sided sequences, that is, X ∗− = { ( . . . , x − , x ) : x i ∈ X for all −∞ < i ≤ } . Let f : X → ( −∞ , ∞ ) be bounded, otherwise arbitrary. Define the function F : X ∗− → ( −∞ , ∞ ) as F ( x −∞ ) = E ( f ( X ) | X −∞ = x −∞ ) . E.g. if f ( x ) = 1 { x = z } for a fixed z ∈ X then F ( y −∞ ) = P ( X = z | X −∞ = y −∞ ) . If X is a finite or countably infinite subset of the reals and f ( x ) = x then F ( y −∞ ) = E ( X | X −∞ = y −∞ ) . One denotes the n th estimate of E ( f ( X λ n +1 ) | X λ n ) from samples X λ n by f n ,and defines it to be f n = 1 n n − X j =0 f ( X λ j +1 ) . (6) Define the distance d ∗ ( · , · ) on X ∗− as follows. For x −∞ , y −∞ ∈ X ∗− let d ∗ ( x −∞ , y −∞ ) = ∞ X i =0 − i − { x − i = y − i } . (7)4 efinition . We say that F ( X −∞ ) is almost surely continuous if for someset C ⊆ X ∗− which has probability one the function F ( X −∞ ) restricted tothis set C is continuous with respect to metric d ∗ ( · , · ). (Cf. Morvai and Weiss[12].)The processes with almost surely continuous conditional expectation gener-alizes the processes for which it is actually continuous, cf. Kalikow [9] andKeane [10]. The stationary finitarily Markovian processes are included in theclass of stationary processes with almost surely continuous E ( f ( X ) | X −∞ )for arbitrary bounded f ( · ).Note that Ryabko [14], and Gy¨orfi, Morvai, Yakowitz [7] showed that onecannot estimate P ( X n +1 = 1 | X n ) for all n in a pointwise consistent wayeven for the class of all stationary and ergodic binary finitarily Markoviantime series.The entropy rate H associated with a stationary finite or countably infinite al-phabet time series { X n } is defined as H = lim n →∞ − n +1 P x − n ∈X n +1 p n ( x − n ) log p n ( x − n ).We note that the entropy rate of a stationary finite alphabet time series isfinite. For details cf. Cover, Thomas [3], pp. 63-64.Fix positive real numbers 0 < β, γ < β + γ <
1, fix a sequence l n that 1 = l ≤ l , . . . , l n → ∞ and fix a bounded function f ( · ) : X → ( −∞ , ∞ )and with these numbers, sequence and function define ζ n , χ n , κ n , λ n and F ( · ) as described in the previous section. For the resulting f n we have thefollowing theorem: THEOREM.
Let { X n } be a stationary and ergodic time series taking val-ues from a finite or countably infinite set X . If the conditional expectation F ( X −∞ ) is almost surely continuous then almost surely, lim n →∞ f n = F ( ˜ X −∞ ) and lim n →∞ (cid:12)(cid:12)(cid:12) f n − E ( f ( X λ n +1 ) | X λ n ) (cid:12)(cid:12)(cid:12) = 0 . The l n may be chosen in such a fashion that whenever the stationary andergodic time series { X n } has finite entropy rate then the λ n grow no fasterthan a polynomial in n .If the stationary and ergodic time series { X n } turns out to be finitarilyMarkovian then lim n →∞ λ n n = 1 p K ( ˜ X −∞ ) − ( ˜ X − K ( ˜ X −∞ )+1 ) < ∞ almost surely . oreover, if the stationary and ergodic time series { X n } turns out to beindependent and identically distributed then λ n = λ n − + 1 eventually almostsurely. Proof of the Theorem :Step 1.
The time series { ˜ X n } n = −∞ and { X n } n = −∞ have identical distribu-tion. For all k ≥ ≤ i ≤ k define (similarly to Morvai and Weiss [12])ˆ ζ k = 0 andˆ ζ ki = ˆ ζ ki − − min { t > X ˆ ζ ki − − t ˆ ζ ki − − ( l k − i +1 − − t = X ˆ ζ ki − ˆ ζ ki − − ( l k − i +1 − } . Let T denote the left shift operator, that is, ( T x ∞−∞ ) i = x i +1 . It is easy tosee that if ζ k ( x ∞−∞ ) = l then ˆ ζ kk ( T l x ∞−∞ ) = − l .Now the statement follows from stationarity and the fact that for k ≥ n ≥ x − n ∈ X n +1 , l ≥ T l { X ζ k ζ k − n = x − n , ζ k = l } = { X − n = x − n , ˆ ζ kk ( X −∞ ) = − l } . (8) Step 2.
We show that P ( χ n = K ( ˜ X −∞ ) eventually | K ( ˜ X −∞ ) < ∞ ) = 1 and P (lim n →∞ χ n = ∞| K ( ˜ X −∞ ) = ∞ ) = 1 . By Step 1, { ˜ X n } n = −∞ is stationary and ergodic with the same distributionas { X n } n = −∞ . We may assume that the sample path ˜ X −∞ is such that allfinite blocks that appear have positive probability. It is immediate that if K ( ˜ X −∞ ) < ∞ then for all k ≥ K ( ˜ X −∞ ), ∆ k = 0 and ∆ K ( ˜ X −∞ ) − > K ( ˜ X −∞ ) − K ( ˜ X −∞ ) = ∞ then ∆ k > k , (otherwise K ( ˜ X −∞ ) would be finite).Thus by (3) if K ( ˜ X −∞ ) = ∞ then χ n → ∞ and if K ( ˜ X −∞ ) < ∞ then χ n ≥ K ( ˜ X −∞ ) eventually almost surely. We have to show that χ n ≤ K ( ˜ X −∞ )eventually almost surely provided that K ( ˜ X −∞ ) < ∞ .Fix now k < n . We will estimate the probability of the undesirable event asfollows: P ( ˆ∆ nk > n − β , K ( ˜ X −∞ ) = k | X ⌈ n ⌉ ) ≤ n X i =1 P ( max ( z − k − k − i +1 , ˜ X − k +1 ,x ) ∈L nk + i { ζ J ( k ) ≤⌈ n ⌉− } (cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( ˜ X − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = ˜ X − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , ˜ X − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , ˜ X − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β , K ( ˜ X −∞ ) = k | X ⌈ n ⌉ ) . Define M k − as the set of all x − k +1 ∈ X k such that for all i ≥ z ∈ X ,and y − k − k − i +1 ∈ X i , p k + i ( y − k − k − i +1 , x − k +1 , z ) > p k − ( z | x − k +1 ) = p k + i − ( z | y − k − k − i +1 , x − k +1 ). By the definition of ˆ∆ nk and since K ( ˜ X −∞ ) = k wehave easily that P ( max ( z − k − k − i +1 , ˜ X − k +1 ,x ) ∈L nk + i { ζ J ( k ) ≤⌈ n ⌉− } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( ˜ X − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = ˜ X − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , ˜ X − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , ˜ X − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β , K ( ˜ X −∞ ) = k | X ⌈ n ⌉ ) ≤ P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) . We can estimate this last probability as the sum of two terms: P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ )7 P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 } − p k − ( x | y − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )+ P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p k − ( x | y − k +1 ) − {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ ) . We overestimate these probabilities. For any m ≥ x − m define σ mi ( x − m )as the time of the i -th ocurrence of the string x − m in the data segment X n ⌈ n ⌉ ,that is, let σ m ( x − m ) = ⌈ n ⌉ + m − i ≥ σ mi ( x − m ) = min { t > σ mi − ( x − m ) : X tt − m = x − m } . Now P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) ≤ P ( max y − k +1 ∈M k − , ( y − k +1 ,x ) ∈L (1) n,k sup j>n − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk − r ( y − k +1) = x } − p k − ( x | y − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )+ P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L (1) n,k + i sup j>n − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk + i − r ( z − k − k − i +1 ,y − k +1 ,x ) = x } − p k − ( x | y − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )Since both L (1) n,k and L (1) n,k + i depend solely on X ⌈ n ⌉ we get P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) ≤ X y − k +1 ∈M k − , ( y − k +1 ,x ) ∈L (1) n,k ∞ X j = ⌈ n − γ ⌉ P ( (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk − r ( y − k +1) = x } − p k − ( x | y − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )+ X y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L (1) n,k + i ∞ X j = ⌈ n − γ ⌉ P ( (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk + i − r ( z − k − k − i +1 ,y − k +1) = x } − p k − ( x | y − k +1 ) (cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ ) . Each of these represents the deviation of an empirical count from its mean.The variables in question are independent since whenever the block y − k +1 occurs the next term is chosen using the same distribution p k − ( x | y − k +1 ).Thus by Hoeffding’s inequality (cf. Hoeffding [8] or Theorem 8.1 of Devroyet. al. [6]) for sums of bounded independent random variables and since thecardinality of both L (1) n,k and L (1) n,k + i is not greater than ( n + 2) /
2, we have P ( max y − k +1 ∈M k − , ( z − k − k − i +1 ,y − k +1 ,x ) ∈L nk + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = ( y − k +1 , x ) } {⌈ n ⌉ + k − ≤ t ≤ n − X tt − k +1 = y − k +1 }− {⌈ n ⌉ + k + i ≤ t ≤ n : X tt − k − i = ( z − k − k − i +1 , y − k +1 , x ) } {⌈ n ⌉ + k + i − ≤ t ≤ n − X tt − k − i +1 = ( z − k − k − i +1 , y − k +1 ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) ≤ n + 22 ∞ X j = ⌈ n − γ ⌉ e − n − β j . Thus P ( ˆ∆ nk > n − β , K ( ˜ X −∞ ) = k | X ⌈ n ⌉ ) ≤ n ( n + 2)2 e − n − β +1 − γ . Integrating both sides we get P ( ˆ∆ nk > n − β , K ( ˜ X −∞ ) = k ) ≤ n ( n + 2)2 e − n − β +1 − γ . β + γ < P ( ˆ∆ nk ≤ n − β eventually, K ( ˜ X −∞ ) = k )= P ( K ( ˜ X −∞ ) = k ) . Thus χ n ≤ k eventually almost surely on K ( ˜ X −∞ ) = k . Step 3.
We show the first part of the Theorem.
Recalling (6) we can write f n = 1 n n − X j =0 [ f ( X λ j +1 ) − E ( f ( X λ j +1 ) | X λ j −∞ )] + 1 n n − X j =0 E ( f ( X λ j +1 ) | X λ j −∞ ) (9)Observe that the first term is an average of orthogonal bounded randomvariables and by Theorem 3.2.2 in R´ev´esz [13], it tends to zero.Now we deal with the second term. If K ( ˜ X −∞ ) < ∞ then by Step 2, χ n = K ( ˜ X −∞ ) eventually and by (1), (2), (4) and Step 1, eventually, E ( f ( X λ j +1 ) | X λ j −∞ ) = E ( f ( X λ j +1 ) | X λ j ) = F ( ˜ X −∞ ) . We may deal with the case when K ( ˜ X −∞ ) = ∞ and by Step 2, χ n → ∞ .For arbitrary j ≥
0, by (5) and (4) and the construction in (2), X λ j λ j − κ j +1 = ˜ X − κ j +1 and lim j →∞ d ∗ ( ˜ X −∞ , X λ j −∞ ) = 0 almost surely. (10)Be Step 1, and the almost sure continuity of F ( · ), for some set C ⊆ X ∗− with full measure, F ( · ) is continuous on C and˜ X −∞ ∈ C, X n −∞ ∈ C for all n ≥ F ( · ) on the set C and (10), E ( f ( X λ j +1 ) | X λ j −∞ ) = F ( X λ j −∞ ) → F ( ˜ X −∞ ) and f n → F ( ˜ X −∞ ) almost surely.Define the random neighbourhood N j ( X λ j ) of X λ j depending on the randomdata segment X λ j itself as N j ( X λ j ) = { z −∞ ∈ X ∗− : z − κ j +1 = X λ j − κ j +1 , . . . , z = X λ j } . X −∞ ∈ N j ( X λ j ) and by (11) and thecontinuity of F ( · ) on the set C , and since κ j → ∞ , by (10), almost surely,lim j →∞ (cid:12)(cid:12)(cid:12) E ( f ( X λ j +1 ) | X λ j ) − F ( ˜ X −∞ ) (cid:12)(cid:12)(cid:12) = lim j →∞ (cid:12)(cid:12)(cid:12) E { F ( X λ j −∞ ) | X λ j } − F ( ˜ X −∞ ) (cid:12)(cid:12)(cid:12) ≤ lim j →∞ sup y −∞ ,z −∞ ∈N j ( X λj ) T C | F ( y −∞ ) − F ( z −∞ ) | = 0 . Step 4.
We show the second part of the Theorem.
Now we assume that the stationary and ergodic finite or countably infinitealphabet time series { X n } possesses finite entropy rate H . (A stationaryfinite alphabet time series always has finite entropy rate.)We will in fact obtain a more precise estimate, namely, if for some 0 < ǫ < ǫ , P ∞ k =1 ( k + 1)2 − l k ( ǫ − ǫ ) < ∞ then λ n < l n ( H + ǫ ) eventually almost surely.In particular, for arbitrary δ >
0, 0 < ǫ < ǫ , if l n = min (cid:16) n, max (cid:16) , ⌊ δǫ − ǫ log n ⌋ (cid:17)(cid:17) then λ n < n δǫ − ǫ ( H + ǫ ) eventually almost surely, and the upper bound is a polynomial.Since λ n ≤ ζ n , it is enough to prove the result for ζ n . Let X ∗ be the set ofall two-sided sequences, that is, X ∗ = { ( . . . , x − , x , x , . . . ) : x i ∈ X for all −∞ ≤ i < ∞} . Define B k ⊆ X l k as B k = { x − l k +1 ∈ X l k : 2 − l k ( H + ǫ ) < p l k − ( x − l k +1 ) } . Notethat there is a trivial bound on the cardinality of the set B k , namely, | B k | ≤ l k ( H + ǫ ) . (12)Define the set Υ k ( y − k +1 ) as follows:Υ k ( y − l k +1 ) = { z ∞−∞ ∈ X + : − ˆ ζ kk ( z −∞ ) ≥ l k ( H + ǫ ) , z − l k +1 = y − l k +1 ) } . We will estimate the probability of Υ k ( y − l k +1 ) by a frequency argument.Let x ∞−∞ ∈ X ∗ be a typical sequence of the time series { X n } . Define ρ ( y − l k +1 , x ∞−∞ ) = 0 and for i ≥ ρ i ( y − l k +1 , x ∞−∞ ) = min { l > ρ i − ( y − l k +1 , x ∞−∞ ) : T − l x ∞−∞ ∈ Υ k ( y − l k +1 ) } . τ ( y − l k +1 , x ∞−∞ ) = 0 and for i ≥ τ i ( y − l k +1 , x ∞−∞ ) = min { l ≥ τ i − ( y − l k +1 , x ∞−∞ )+2 l k ( H + ǫ ) : T − l x ∞−∞ ∈ Υ k ( y − l k +1 ) } . Notice that if τ i − = ρ m then τ i ≤ ρ m + k +1 . (Indeed, since there are at least k +1 occurrences of the block y − l k +1 in the data segment X ρ m +1 − ρ m + k +1 − l k +1 hence2 l k ( H + ǫ ) ≤ − ˆ ζ kk ( T − ρ m x ∞−∞ ) ≤ ρ m + k +1 − τ i − .) By the ergodicity of the timeseries { X n } , P ( X ∞−∞ ∈ Υ k ( y − l k +1 )) = lim t →∞ { j ≥ ρ j ( y − l k +1 , x ∞−∞ ) ≤ τ t ( y − l k +1 , x ∞−∞ ) } τ t ( y − l k +1 , x ∞−∞ )= lim t →∞ P tl =1 { j ≥ τ l − ( y − l k +1 , x ∞−∞ ) < ρ j ( y − l k +1 , x ∞−∞ ) ≤ τ l ( y − l k +1 , x ∞−∞ ) } τ t ( y − l k +1 , x ∞−∞ ) ≤ lim t →∞ t ( k + 1) t l k ( H + ǫ ) = ( k + 1)2 l k ( H + ǫ ) . (13)Since T l { ζ k = l, X ζ k ζ k − l k +1 ∈ B k } = { ˆ ζ kk = − l, X − l k +1 ∈ B k } by stationarity and the upper bound on the cardinality of the set B k in (12)and by (13), we get P ( ζ k ≥ l k ( H + ǫ ) , ˜ X − l k +1 ∈ B k ) = P ( ζ k ≥ l k ( H + ǫ ) , X ζ k ζ k − l k +1 ∈ B k )= P ( − ˆ ζ kk ≥ l k ( H + ǫ ) , X − l k +1 ∈ B k )= X y − lk +1 ∈ B k P ( X ∞−∞ ∈ Υ k ( y − l k +1 )) ≤ ( k + 1)2 − l k ( ǫ − ǫ ) . By assumption, the right hand side sums and the Borel-Cantelli Lemmayields that the event { ζ k ≥ l k ( H + ǫ ) , ˜ X − l k +1 ∈ B k } cannot happen infinitelymany times. By Step 1, the distribution of the time series { ˜ X n } is the sameas the distribution of { X n } and by the Shannon-McMillan-Breiman Theorem(cf. Chung [2]) ˜ X − l k +1 ∈ B k eventually almost surely and so ζ k ≥ l k ( H + ǫ ) cannot happen infinitely many times. Step 5.
We show the rest of the Theorem.
By Step 2, if 1 ≤ K ( ˜ X −∞ ) < ∞ then χ n = K ( ˜ X −∞ ) eventually, and byergodicity, nλ n → p K ( ˜ X −∞ ) − ( ˜ X − K ( ˜ X −∞ )+1 ) >
0. If K ( ˜ X −∞ ) = 0 then by Step2, χ n = 0 eventually, and by (4), λ n = λ n − + 1 eventually. The proof of theTheorem is complete. 12 eferences [1] D. H. Bailey, Sequential Schemes for Classifying and Predicting ErgodicProcesses.
Ph. D. thesis, Stanford University, 1976.[2] K.L. Chung, ”A note on the ergodic theorem of information theory,”
The Annals of Mathematical Statistics. , vol. 32, pp. 612-614, 1961.[3] T.M. Cover and J. Thomas,
Elements of Information Theory , Wiley,1991.[4] I. Csisz´ar and P. Shields, ”The consistency of the BIC Markov orderestimator,”
Annals of Statistics. , vol. 28, pp. 1601-1619, 2000.[5] I. Csisz´ar, ”Large-scale typicality of Markov sample paths and consis-tency of MDL order estimators ,”
IEEE Transactions on InformationTheory , vol. 48, pp. 1616-1628, 2002.[6] L Devroy, L. Gy¨orfi, G. Lugosi,
A Probabilistic Theory of Pattern Recog-nition.
Springer-Verlag, New York, 1996.[7] L. Gy¨orfi, G. Morvai, and S. Yakowitz, ”Limits to consistent on-lineforecasting for ergodic time series,”
IEEE Transactions on InformationTheory , vol. 44, pp. 886–892, 1998.[8] W. Hoeffding, ”Probability inequalities for sums of bounded randomvariables ,”
Journal of the American Statistical Association , vol. 58, pp.13-30, 1963.[9] S. Kalikow ”Random Markov processes and uniform martingales ,”
IsraelJournal of Mathematics , vol. 71, pp. 33–54, 1990.[10] M. Keane ”Strongly mixing g-measures,”
Invent. Math. , vol. 16, pp.309–324, 1972.[11] G. Morvai ”Guessing the output of a stationary binary time series” In:Foundations of Statistical Inference, (Eds. Y. Haitovsky, H.R.Lerche, Y.Ritov), Physika-Verlag, pp. 207-215, 2003.[12] G. Morvai and B. Weiss, ”Forecasting for stationary binary time series,”
Acta Applicandae Mathematicae , vol. 79, pp. 25-34, 2003.1313] P. R´ev´esz,
The Law of Large Numbers , Academic Press, 1968.[14] B. Ya. Ryabko, ”Prediction of random sequences and universal coding,”