aa r X i v : . [ m a t h . P R ] N ov Guszt´av MORVAI and Benjamin WEISS:
Order Estimation of Markov Chains
IEEE Trans. Inform. Theory 51 (2005), no. 4, 1496–1497.
Abstract
We describe estimators χ n ( X , X , . . . , X n ), which when appliedto an unknown stationary process taking values from a countable al-phabet X , converge almost surely to k in case the process is a k -thorder Markov chain and to infinity otherwise. Keywords: Stationary processes, Markov chains, order estimationMathematics Subject Classifications (2000)
Introduction
When faced with an unknown stationary and ergodic stochastic process X , X , . . . , X n , . . . one may try to determine various properties of this pro-cess from the successive observations up to time n . For example, one mighttry to estimate the entropy of the process. Several schemes of the form g n ( X , . . . , X n ) are known which will converge almost surely to the entropyof the process { X n } cf. Bailey [1], Csisz´ar and Shields [2], Csisz´ar [3], Orn-stein and Weiss [8], [7], [9], Kontoyiannis, Algoet, Suhov and Wyner [6]and Ziv [10]. However, if one just wants to determine whether or not theprocess has positive entropy (often associated with the popular notion ofchaos) then there is no sequence of two valued functions e n ( X , . . . , X n ) ∈{ ZERO, P OSIT IV E } with the property that almost surely, e n stabilize at ZERO for all zero entropy processes and at
P OSIT IV E for all positiveentropy processes. (While this result does not appear explicitly in Ornsteinamd Weiss [7], it can be readily established using a very simple variant ofthe construction given there in § k -thorder Markov chains. One can estimate the order of a Markov chain by e.gthe method of Csisz´ar and Shields [2] or Csisz´ar [3]. They show that theminimum description length Markov estimator will converge almost surelyto the correct order if the alphabet size is bounded a priori. Without thisassumption they show that this is no longer true. To accomplish their goalsthey study the large scale typicality of Markov sample paths. A furthernegative result is that of Bailey [1] who showed that no two valued testexists for testing mixing Markov vs. not mixing Markov.We will present a more direct estimator for the order of a Markov chainwhich also uses the fact that there are universal rates for the convergenceof empirical k -block distributions in this class. Our approach enables us todispense with the assumption that the alphabet size is bounded, indeed itmay even be infinite, as long as there is a finite memory. In addition we willshow that if the process is not a Markov chain then the estimate for the orderwill tend to infinity. This is in complete analogy with the entropy estimationthat we mentioned earlier. 1 The Order Estimator
Let { X n } ∞ n = −∞ be a stationary and ergodic time series taking values from adiscrete (finite or countably infinite) alphabet X . (Note that all stationarytime series { X n } ∞ n =0 can be thought to be a two sided time series, that is, { X n } ∞ n = −∞ . ) For notational convenience, let X nm = ( X m , . . . , X n ), where m ≤ n . Note that if m > n then X nm is the empty string.Let p ( x − k ) and p ( y | x − k ) denote the distribution P ( X − k = x − k ) and theconditional distribution P ( X = y | X − k = x − k ), respectively.A discrete alphabet stationary time series is said to be a Markov chain if forsome K ≥
0, for all y ∈ X , i ≥ z − K − i +1 ∈ X K + i , if p ( z − K − i +1 ) > p ( y | z − K +1 ) = p ( y | z − K − i +1 ) . The order of a Markov chain is the smallest such K .In order to estimate the order we need to define some explicit statistics.For k ≥ S k denote the support of the distribution of X − k as S k = { x − k ∈ X k +1 : p ( x − k ) > } . Define ∆ k = sup ≤ i sup ( z − k − i +1 ,x ) ∈S k + i (cid:12)(cid:12)(cid:12) p ( x | z − k +1 ) − p ( x | z − k − i +1 ) (cid:12)(cid:12)(cid:12) . We will divide the data segment X n into two parts: X ⌈ n ⌉− and X n ⌈ n ⌉ . Let S (1) n,k denote the set of strings with length k + 1 which appear at all in X ⌈ n ⌉− .That is, S (1) n,k = { x − k ∈ X k +1 : ∃ k ≤ t ≤ ⌈ n ⌉ − X tt − k = x − k } . For a fixed 0 < γ < S (2) n,k denote the set of strings with length k + 1which appear more than n − γ times in X n ⌈ n ⌉ . That is, S (2) n,k = { x − k ∈ X k +1 : {⌈ n ⌉ + k ≤ t ≤ n : X tt − k = x − k } > n − γ } . Let S nk = S (1) n,k \ S (2) n,k . C ( x | z − k +1 : [ n , n ]) denote the empiricalconditional probability of X = x given X − k +1 = z − k +1 from the samples( X n , . . . , X n ), that is, C ( x | z − k +1 : [ n , n ]) = { n + k ≤ t ≤ n : X tt − k = ( z − k +1 , x ) } { n + k − ≤ t ≤ n − X tt − k +1 = z − k +1 } where 0 / k as follows:ˆ∆ nk = max ≤ i ≤ n max ( z − k − i +1 ,x ) ∈S nk + i (cid:12)(cid:12)(cid:12)(cid:12) C ( x | z − k +1 : [ ⌈ n ⌉ , n ]) − C ( x | z − k − i +1 : [ ⌈ n ⌉ , n ]) (cid:12)(cid:12)(cid:12)(cid:12) . Observe, that by ergodicity, for any fixed k ,lim inf n →∞ ˆ∆ nk ≥ ∆ k almost surely. (1)We define an estimate χ n for the order from samples X n as follows. Let0 < β < − γ be arbitrary. Set χ = 0, and for n ≥ χ n be the smallest0 ≤ k n < n such that ˆ∆ nk n ≤ n − β . THEOREM.
If the stationary and ergodic time series { X n } taking valuesfrom a discrete alphabet happens to be a Markov chain with any finite orderthen χ n equals to the order eventually almost surely, and if it is not Markovwith any finite order then χ n → ∞ almost surely. Application:
Let
M > M or not. One may use χ n and say YES if χ n < M and say NOotherwise. By the Theorem, eventually, the answer will be correct. Proof:
If the process is a Markov chain, it is immediate that for all k greaterthan or equal the order, ∆ k = 0. For k less than the order ∆ k >
0. If theprocess is not a Markov chain with any finite order then ∆ k > k .Thus by (1) if the process is not Markov then χ n → ∞ and if it is Markov3hen χ n is greater or equal the order eventually almost surely. We have toshow that χ n is less or equal the order eventually almost surely provided thatthe process is a Markov chain.Assume that the process is a Markov chain with order k . Let n ≥ k . We willestimate the probability of the undesirable event as follows: P ( ˆ∆ nk > n − β | X ⌈ n ⌉ ) ≤ n X i =1 P ( max ( z − k − i +1 ,x ) ∈S nk + i (cid:12)(cid:12)(cid:12)(cid:12) C ( x | z − k +1 : [ ⌈ n ⌉ , n ]) − C ( x | z − k − i +1 : [ ⌈ n ⌉ , n ]) (cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) . We can estimate each probability in the sum as the sum of two terms: P ( max ( z − k − i +1 ,x ) ∈S nk + i (cid:12)(cid:12)(cid:12)(cid:12) C ( x | z − k +1 : [ ⌈ n ⌉ , n ]) − C ( x | z − k − i +1 : [ ⌈ n ⌉ , n ]) (cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) ≤ P ( max ( z − k − i +1 ,x ) ∈S nk + i (cid:12)(cid:12)(cid:12)(cid:12) C ( x | z − k +1 : [ ⌈ n ⌉ , n ]) − p ( x | z − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )+ P ( max ( z − k − i +1 ,x ) ∈S nk + i (cid:12)(cid:12)(cid:12)(cid:12) p ( x | z − k +1 ) − C ( x | z − k − i +1 : [ ⌈ n ⌉ , n ]) (cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ ) . We overestimate these probabilities. For any m ≥ x − m define σ mi ( x − m )as the time of the i -th ocurrence of the string x − m in the data segment X n ⌈ n ⌉ ,that is, let σ m ( x − m ) = ⌈ n ⌉ + m − i ≥ σ mi ( x − m ) = min { t > σ mi − ( x − m ) : X tt − m = x − m } . Now P ( max ( z − k − i +1 ,x ) ∈S nk + i (cid:12)(cid:12)(cid:12)(cid:12) C ( x | z − k +1 : [ ⌈ n ⌉ , n ]) − C ( x | z − k − i +1 : [ ⌈ n ⌉ , n ]) (cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) ≤ P ( max ( z − k +1 ,x ) ∈S (1) n,k sup j>n − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk − r ( z − k +1) = x } − p ( x | z − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )+ P ( max ( z − k − i +1 ,x ) ∈S (1) n,k + i sup j>n − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk + i − r ( z − k − i +1) = x } − p ( x | z − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )4ince both S (1) n,k and S (1) n,k + i depend solely on X ⌈ n ⌉ we get P ( max ( z − k − i +1 ,x ) ∈S nk + i (cid:12)(cid:12)(cid:12)(cid:12) C ( x | z − k +1 : [ ⌈ n ⌉ , n ]) − C ( x | z − k − i +1 : [ ⌈ n ⌉ , n ]) (cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) ≤ X ( z − k +1 ,x ) ∈S (1) n,k ∞ X j = ⌈ n − γ ⌉ P ( (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk − r ( z − k +1) = x } − p ( x | z − k +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ )+ X ( z − k − i +1 ,x ) ∈S (1) n,k + i ∞ X j = ⌈ n − γ ⌉ P ( (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j j X r =1 { X σk + i − r ( z − k − i +1) = x } − p ( x | z − k +1 ) (cid:12)(cid:12)(cid:12) > . n − β | X ⌈ n ⌉ ) . Each of these represents the deviation of an empirical count from its mean.The variables in question are independent since whenever the block z − k +1 occurs the next term is chosen using the same distribution p ( x | z − k +1 ). Thusby Hoeffding’s inequality (cf. Hoeffding [5] or Theorem 8.1 of Devroye et.al. [4]) for sums of bounded independent random variables and since thecardinality of both S (1) n,k and S (1) n,k + i is not greater than ( n + 2) /
2, we have P ( max ( z − k − i +1 ,x ) ∈S nk + i (cid:12)(cid:12)(cid:12)(cid:12) C ( x | z − k +1 : [ ⌈ n ⌉ , n ]) − C ( x | z − k − i +1 : [ ⌈ n ⌉ , n ]) (cid:12)(cid:12)(cid:12)(cid:12) > n − β | X ⌈ n ⌉ ) ≤ n + 22 ∞ X j = ⌈ n − γ ⌉ e − n − β j . Thus P ( ˆ∆ nk > n − β | X ⌈ n ⌉ ) ≤ n ( n + 2)4 e − n − β +1 − γ . Integrating both sides we get P ( ˆ∆ nk > n − β ) ≤ n ( n + 2)4 e − n − β +1 − γ . The right hand side is summable provided 2 β + γ < P ( ˆ∆ nk ≤ n − β eventually) = 1. Thus χ n ≤ k eventuallyalmost surely provided the process is Markov with order k . The proof of theTheorem is complete. 5 eferences [1] D. H. Bailey, Sequential Schemes for Classifying and Predicting ErgodicProcesses.
Ph. D. thesis, Stanford University, 1976.[2] I. Csisz´ar and P. Shields, ”The consistency of the BIC Markov orderestimator,”
Annals of Statistics. , vol. 28, pp. 1601-1619, 2000.[3] I. Csisz´ar, ”Large-scale typicality of Markov sample paths and consis-tency of MDL order estimators ,”
IEEE Transactions on InformationTheory , vol. 48, pp. 1616-1628, 2002.[4] L Devroye, L. Gy¨orfi, G. Lugosi,
A Probabilistic Theory of PatternRecognition.
Springer-Verlag, New York, 1996.[5] W. Hoeffding, ”Probability inequalities for sums of bounded randomvariables ,”
Journal of the American Statistical Association , vol. 58, pp.13-30, 1963.[6] I. Kontoyiannis, P. Algoet, Yu.M. Suhov, A.J. Wyner, ”Nonparametricentropy estimation for stationary processes and random fields, with ap-plication to English text,”
IEEE Transactions on Information Theory ,vol. 44, pp. 1319–1327, 1998.[7] D. S. Ornstein and B. Weiss, ”How sampling reveals a process,”
TheAnnals of Probability , vol. 18, pp. 905–930, 1990.[8] D. S. Ornstein and B. Weiss, ”Entropy and data compression schemes,”
IEEE Transactions on Information Theory , vol. 39, pp. 78–83, 1993.[9] D. S. Ornstein and B. Weiss, ”Entropy and recurrence rates for station-ary random fields,”
IEEE Transactions on Information Theory , vol. 48,pp. 1699–1697, 2002.[10] J. Ziv, ” Coding theorems for individual sequences.