[PDF] Missing Mass of Rank-2 Markov Chains

Abstract

Estimation of missing mass with the popular Good-Turing (GT) estimator is well-understood in the case where samples are independent and identically distributed (iid). In this article, we consider the same problem when the samples come from a stationary Markov chain with a rank-2 transition matrix, which is one of the simplest extensions of the iid case. We develop an upper bound on the absolute bias of the GT estimator in terms of the spectral gap of the chain and a tail bound on the occupancy of states. Borrowing tail bounds from known concentration results for Markov chains, we evaluate the bound using other parameters of the chain. The analysis, supported by simulations, suggests that, for rank-2 irreducible chains, the GT estimator has bias and mean-squared error falling with number of samples at a rate that depends loosely on the connectivity of the states in the chain.

Full PDF

aa r X i v : . [ c s . I T ] F e b Missing Mass of Rank-2 Markov Chains

Prafulla Chandra, Andrew Thangaraj

Department of Elec EnggIndian Institute of Technology, MadrasChennai, India 600036{ee16d402, andrew}@ee.iitm.ac.in

Nived Rajaraman

Department of Elec Engg and Comp SciUniversity of California, BerkeleyBerkeley, CA 94720, [email protected]

Abstract —Estimation of missing mass with the popular Good-Turing (GT) estimator is well-understood in the case wheresamples are independent and identically distributed (iid). In thisarticle, we consider the same problem when the samples comefrom a stationary Markov chain with a rank-2 transition matrix,which is one of the simplest extensions of the iid case. We developan upper bound on the absolute bias of the GT estimator interms of the spectral gap of the chain and a tail bound onthe occupancy of states. Borrowing tail bounds from knownconcentration results for Markov chains, we evaluate the boundusing other parameters of the chain. The analysis, supported bysimulations, suggests that, for rank-2 irreducible chains, the GTestimator has bias and mean-squared error falling with numberof samples at a rate that depends loosely on the connectivity ofthe states in the chain.

I. I

NTRODUCTION

A. Preliminaries

Consider a Markov chain X n = ( X , X , . . . , X n ) with thestates X i ∈ X = [ K ] , { , , . . . , K } andPr ( X i = x i | X i − = x i − , . . . , X = x ) = Pr ( X = x i | X = x i − ) for i = 2 , . . . , n and all x i ∈ X . The transition probabilitymatrix (t.p.m) of the Markov chain, denoted P , is the K × K matrix with ( i, j ) -th element P ij , Pr ( X = j | X = i ) .A distribution π = [ π , . . . , π K ] on X = [ K ] is said tobe a stationary or invariant distribution of the Markov chainif πP = π [1]. We will consider stationary Markov chainsfor which the distribution of X (and all X i ) is a stationarydistribution.For x ∈ X , let F x ( X n ) , n X i =1 I ( X i = x ) , where I ( · ) is an indicator random variable, denote the numberof occurrences of x in X n . For a stationary Markov chain withstate distribution π , the missing mass or the total probabilityof letters that did not appear in X n , denoted M ( X n , π ) , isdeﬁned as M ( X n , π ) , X x ∈X π x I ( F x ( X n ) = 0) . For l ≥ , let φ l ( X n ) , X x ∈X I ( F x ( X n ) = l ) denote the number of letters that have occurred l times in X n . The popular and standard Good-Turing estimator [2] for M ( X n , π ) is deﬁned as G ( X n ) , φ ( X n ) n . (1)We will drop the arguments X n , π whenever it is non-ambiguous.II. P RIOR WORK AND PROBLEM SETTING

When the t.p.m has a rank of 1, the chain is iid and thebias of Good-Turing estimators falls as /n [3]. So, in the iid case, the number of letters occurring once (i.e. φ ) acts as agood proxy for the missing letters in a sequence. However,the Markov case can be very different, in general. Consider at.p.m with P ii = 1 − η and P ij = η/ ( K − , i = j . For small η , every state is "sticky" and φ appears to be inadequate tocapture the mass of missing letters. As expected, a simulationof Good-Turing estimators for such sticky chains shows a non-vanishing bias.Markov models occur naturally in language modeling [4]and several other applications in practice, where Good-Turingestimators and their modiﬁcations are routinely used. There-fore, it is interesting to analytically understand and studyGood-Turing estimators for missing mass in Markov chains.Missing mass has been studied extensively in the iid case [3],[5]–[10], and estimation in Markov chains has been studied aswell [11]–[14]. Recently, there has been interest in studyingconcentration and estimation of missing mass in Markovchains [15] [16].The sticky chain example above has a full-rank t.p.m, whilethe iid case has a rank-1 t.p.m. This naturally motivates thestudy of Good-Turing estimators for chains with t.p.ms ofother ranks. In particular, in this article, we consider chainswith rank-2 t.p.ms, which are, in some sense, the simplest inthe non-iid Markovian case.Consider a Markov chain with a rank-2 t.p.m P , which wewill call, loosely, as a rank-2 Markov chain. Since P has allentries in [0 , with each row adding to 1 and it has rank 2, theeigenvalues of P will be 1, λ , 0, . . . , 0, and − ≤ λ ≤ (byPerron-Frobenius theorem) [17]. The value of λ determinesseveral important properties of the chain. If λ = 1 , the chainis reducible. If λ = − , the chain is periodic with period 2.For − < λ < , the chain is irreducible and aperiodic.II. R ESULTS

The main results of this article provide bounds on theabsolute bias of Good-Turing estimators of missing mass | E [ G ( X n ) − M ( X n , π )] | on rank-2 stationary Markovchains. To the best of our knowledge, these are perhaps theﬁrst such bounds to have appeared in the literature.For the purpose of bounding, the bias is split into twosigniﬁcant components. The ﬁrst component has contributionsfrom states x for which π x is low. This component is boundedin terms of the spectral gap β ( P ) , − λ ∈ [0 , . (2)For bounding the remaining part of the bias, we use tail boundson the occupancy F x ( X n ) for states x with a high enoughvalue of π x .The following theorem states the main starting point of ourbounds on bias. We let x , − x . Theorem 1.

Let X n be a stationary Markov chain with statedistribution π and a rank-2 t.p.m with spectral gap β . Let /n < δ ≤ β/ . Then, there exist universal constants c > and c > such that | E [ G ( X n ) − M ( X n , π )] | ≤ δβ (cid:18) c + c nβ (cid:19) + 2 max x : π x >δ,P xx = π x Pr ( F x ( X n ) ≤

1) + O (1 /n ) . (3)As β → , the chain becomes reducible, and the bias boundgrows unbounded as expected because consistent estimationis not possible for missing mass in reducible chains [15]. Forexample, consider the chain with t.p.m (cid:20) (2 /K ) K/ × K/ K/ × K/ K/ × K/ (2 /K ) K/ × K/ (cid:21) , where b r × c denotes the r × c all- b matrix, with the uniformdistribution on all states as π . This chain does not visit halfthe states making the estimation of missing mass inconsistent.For a non-vanishing β , using Theorem 1 with a speciﬁctail bound for Pr ( F x ≤ and a choice for δ , we obtainbounds that depend only on the chain’s parameters. Tailbounds, obtained from concentration inequalities for Markovchains, typically use other parameters derived from the t.p.ms.The inequality of Kontorovich and Ramanan in [18] uses aparameter θ ( P ) deﬁned as θ ( P ) , sup x,x ′ ∈X d TV ( P ( ·| x ) , P ( ·| x ′ )) , (4)where d TV ( p, q ) = P x ∈X | p x − q x | is the total variation(TV) distance between two distributions p and q on the samealphabet X . We will call θ ( P ) or simply θ as the maximumTV gap of the chain. Corollary 2.

For a rank-2 Markov chain with spectral gap β and maximum TV gap θ satisfying β ≥ β , (cid:16) θn c + n (cid:17) forsome c ∈ (0 , . , we have | E [ G ( X n ) − M ( X n , π )] | ≤ β (cid:18) n + 1 θn c (cid:19) (cid:18) c + c nβ (cid:19) + 4 e − . n (1 − c ) + O (1 /n ) . (5)In the above corollary, a tail bound from the concentrationinequality of [18] is used for Pr ( F x ≤ in Theorem 1 with δ = β / . The bias bound is now expressed in terms of n , thespectral gap β and the maximum TV gap θ , and the rate offall of absolute bias with n is bounded by how the two gapparameters β and θ vary with n .If two rows of the t.p.m are disjoint in their non-zeropositions, we get θ = 1 and Corollary 2 does not apply. Wewill need to use other concentration inequalities in such cases.The concentration inequality for Markov chains due to Naor,Rao and Regev in [19] uses a parameter λ π ( P ) deﬁned as λ π ( P ) , sup z : K P t =1 π t z t =1 vuut K X i =1 π i (cid:18) K X j =1 ( P ij − π j ) z j (cid:19) , (6)which is the norm of P − K × π in the Hilbert space L ( π ) with k z k π = P i π i z i . If P is orthonormally diagonalizable in L ( π ) , λ π is equal to the absolute second eigenvalue | λ | inthe rank-2 case. However, in most other cases, the optimizationin (6) needs to be evaluated to ﬁnd λ π , and the evaluation isusually feasible in the rank-2 case. We will call the parameter λ π ( P ) or simply λ π as the non-iid weighted norm of the chain. Corollary 3.

For a rank-2 Markov chain with spectral gap β and non-iid weighted norm λ π satisfying β ≥ β , q ln nnλ π + 5 /n , we have | E [ G ( X n ) − M ( X n , π )] |≤ β (cid:18) c + c nβ (cid:19) s ln nλ π n + 1 n ! + O (1 /n ) , (7) where c , c > are universal constants. In the above corollary, a tail bound from the concentrationinequality of [19] is used for Pr ( F x ≤ in Theorem 1 with δ = β / .We will now consider a few speciﬁc types of rank-2irreducible chains and use Corollaries 2 and 3 to bound theabsolute bias of the Good-Turing estimator for missing massin terms of β , θ and non-iid weighted norm λ π . Chains with θ = 0 (IID): The iid case P = K × π (i.e. eachrow equal to π ) has β = 1 and θ = 0 resulting in a / √ n upper bound from Corollary 2, which, while capturing the fallof bias to zero, is poorer in rate than the well-known /n bound. on-iid chains: Consider the following K × K rank-2 t.p.ms. P = " a K × K − K a K × K a K × K a K × K a K × K a K × K − K ,P =  K K − K × K K K × K K K × K K K − K × K  ,P =  K K − K × K K K × K K K × K K K − K × K  , where a = 2 / ( K + K ) and is an all-zero matrix ofappropriate size. In all of the three cases above, States 1 to K/ and States K/ to K are connected only through the K states from ( K − K ) / to ( K + K ) / . Motivated bythe iid worst case, we let K = n and consider K = Θ( n κ ) for κ ≤ to control the connectivity in the chains. Table Ilists the dominant terms in the parameters β , θ , λ π and thebounds from Corollaries 2 and 3 for the t.p.ms P , P and P (up to a multiplicative constant). The three chains considered K = nK = n κ β θ λ π Cor 2 κ > / Cor 3 κ > / P n κ /n n κ /n n κ /n n κ − / q ln nn κ − P n κ /n − − P n κ /n n κ /n − q ln nn κ − TABLE I: Dominant terms of β , θ , λ π and bounds (up to amultiplicative constant).above illustrate different regimes of values for the parameters θ and λ . The corollaries apply when θ = 0 or λ = 0 and when κ is large enough to satisfy the lower bound on the spectralgap β in the corollaries. The case of κ = 1 results in a boundtending to / √ n , and lower κ results in weaker bounds.Fig. 1 shows simulation plots of the rate of fall of absolutebias and mean-squared error versus n for the three t.p.msabove with κ = 1 and κ = 1 / . We observe that both theabsolute bias and the mean-squared error fall with n in allcases. For the highly connected case of κ = 1 , the rate of fallin simulations is close to /n . As κ decreases and connectivityreduces, the rate of fall of both absolute bias and MSE reduce. Periodic case:

Consider the following n × n t.p.m  · · ·

00 0 1 · · · ... ... ... . . . ... · · ·

11 0 0 · · ·  r × r ⊗  r/n · · · r/n ... . . . ... r/n · · · r/n  n/r × n/r , − − . l og | M E | / l og n P P P κ =1 κ =1 / − . − − . − . − . n l og | M S E | / l og n Fig. 1: Rate of fall of absolute bias ( | ME | ) and mean-squarederror (MSE).which is a Kronecker product of an r × r right-shift-by-1permutation matrix and an n/r × n/r matrix with each entryequal to r/n . The rank of the above t.p.m is r , the chain isirreducible and periodic with all states having a period r , andthe uniform distribution is the unique stationary distribution.For r = 2 , we get the rank-2 irreducible, period-2 chain. Usinga direct computation in a manner similar to the iid case, weobtain the following exact characterization of bias: | E [ G ( X n ) − M ( X n , π )] | = rn (cid:18) − rn (cid:19) nr − . (8)For sub-linear r , we see that bias tends to zero. However, for r ∝ n , we get a bias that is constant.In summary, we see that the bias of the Good-Turingestimator for missing mass in rank-2 Markov chains can beanalytically bounded using the spectral gap ( β in (2)) and otherparameters such as the maximum TV gap ( θ in (4)) or the non-iid weighted norm ( λ π in (6)) that control the concentration ofstate occupancies. If better tail bounds can be developed, therate of fall of bias predicted from the bounds can be tightenedfurther.From examples such as the sticky chain and the periodiccase above, we expect a constant bias for the Good-Turingestimator in certain higher rank Markov chains. Improvementin estimation methods for missing mass is needed for suchhigher rank chains.V. P ROOF OF THEOREMS AND COROLLARIES

A. Theorem 1

We begin with the result of [15, Theorem 1], which statesthat for a Markov chain X n with state distribution π , | E [ G ( X n ) − M ( X n , π )] | ≤ n + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X x ∈X π x π x Γ x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (9)where n Γ x = n X i =1 Pr ( F x ( X n ∼ i )=0 | X i = x ) − Pr ( F x ( X n ∼ i )=0 | X i = x ) and X n ∼ i = ( X , . . . , X i − , X i +1 , . . . , X n ) . Lemma 4.

For a rank-2 t.p.m P , we have the following: For x ∈ X with P xx = π x , Γ x = 0 . For δ satisfying /n < δ ≤ β/ , there exist constants c and c such that X x : π x ≤ δ,P xx = π x π x π x | Γ x | ≤ δβ (cid:18) c + c nβ (cid:19) . Proof.

See Section V.Using Part 1 of Lemma 4, it sufﬁces to upper bound (cid:12)(cid:12)(cid:12)P x ∈X : P xx = π x π x π x Γ x (cid:12)(cid:12)(cid:12) . In order to carry out this analysis,we show that π x and Γ x bear an inverse relationship. For this,we ﬁx a threshold δ (to be decided later) and partition the setof states with P xx = π x into two sets A ( δ ) = { x ∈ X : π x ≤ δ, P xx = π x } ,A ( δ ) = { x ∈ X : π x > δ, P xx = π x } . Observe that π x π x Γ x = 1 n (cid:18) n X i =1 π x Pr ( F x ( X n ∼ i )=0 , X i = x ) − π x Pr ( F x ( X n ∼ i )=0 , X i = x ) (cid:19) = π x n n X i =1 Pr ( F x ( X n ∼ i )=0 , X i = x ) ! − π x Pr ( F x ( X n ) = 0)= π x n Pr ( F x ( X n ) = 1) − π x Pr ( F x ( X n ) = 0) . Summing over x ∈ A ( δ ) , we get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X x ∈ A ( δ ) π x π x Γ x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X x ∈ A ( δ ) π x n Pr ( F x ( X n ) = 1) − π x Pr ( F x ( X n ) = 0) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ X x ∈ A ( δ ) (cid:18) π x n + π x (cid:19) Pr ( F x ( X n ) ≤ ≤ x ∈ A ( δ ) Pr ( F x ( X n ) ≤ , (10) where, in the last inequality, we use π x < and | A ( δ ) | ≤ /δ < n . Using Part 2 of Lemma 4 and (10), the proof iscomplete. B. Corollaries 2 and 3

For any x , we have E [ F x ( X n ) /n ] = π x . For large π x ,if F x ( X n ) concentrates, we can bound the tail probabilityPr ( F x ( X n ) ≤ using concentration of F x ( X n ) as follows:Pr ( F x ( X n ) ≤

1) = Pr ( π x − F x ( X n ) /n ≥ π x − /n ) ≤ Pr ( | π x − F x ( X n ) /n | ≥ π x − /n ) . (11)Using the concentration results in [18] and [19], we obtain thefollowing: Lemma 5. If the chain has maximum TV gap θ < ,Pr ( | π x − F x ( X n ) /n | ≥ ǫ ) ≤ e − . nθ ǫ . If the chain has non-iid weighted norm λ π < ,Pr ( | π x − F x ( X n ) /n | ≥ ǫ ) ≤ C (cid:18) qλ π n (cid:19) q/ π x ǫ − q , where C is a constant and q ≥ .Proof. See Section V.

1) Proof of Corollary 2:

We consider the set A ( δ ) with δ = 1 θn c + 1 n , c ∈ (0 , . . (12)For x ∈ A ( δ ) , we have π x − n > θn c . So, setting ǫ = θn c inPart 1 of Lemma 5, we getPr ( F x ( X n ) ≤ ≤ e − . n − c . Using δ from (12) and the above bound for Pr ( F x ( X n ) ≤ in Theorem 1, we get (5) and the proof for Corollary 2 iscomplete.

2) Proof of Corollary 3:

We consider the set A ( δ ) with δ = 3 s ln nnλ π + 1 n . (13)For x ∈ A ( δ ) , we have π x − /n > q ln nnλ π . So, setting ǫ = 3 q ln nnλ π and q = 3 ln n in Part 2 of Lemma 5, we get max x ∈ A ( δ ) Pr ( F x ( X n ) ≤ ≤ C/n . . Using δ from (13) and the above bound for Pr ( F x ( X n ) ≤ in Theorem 1, we get (7) and the proof for Corollary 3 iscomplete.. P ROOFS OF L EMMAS

Consider a rank-2 K × K irreducible t.p.m P with sta-tionary distribution π = [ π · · · π K ] T . There exist vectors u = [ u · · · u K ] T and v = [ v · · · v K ] T satisfying thefollowing decompositions:1) If P is diagonalizable with spectral gap β = 1 , then P = RDS with SR = (cid:20) (cid:21) , where R = (cid:2) K × v (cid:3) , D = (cid:20) β (cid:21) and S = (cid:20) π T u T (cid:21) .Since P is a t.p.m, we have, for ≤ i, j ≤ K , ≤ π j + (1 − β ) v i u j ≤ . (14)2) If P is non-diagonalizable with spectral gap β = 1 , then P = RS with SR = (cid:20) (cid:21) . Since P is a t.p.m, we have, for ≤ i, j ≤ K , ≤ π j + v i u j ≤ . A. Lemma 4, Part 2 (see [20] for Part 1)

The quantity Γ x , when P xx = π x , can be expressed in termsof some more spectral parameters for both the diagonalizableand the non-diagonalizable cases. Lemma 6. If P is diagonalizable and x ∈ X with P xx = π x , Γ x = βv x u x (cid:20)(cid:18) ( λ ∼ x ) n − − ( λ ∼ x ) n − (cid:19) ∆ − x π x − (cid:18) ( λ ∼ x ) n − + ( λ ∼ x ) n − (cid:19)(cid:18) − n (cid:19) β ∆ x + 2 n β ∆ x λ ∼ x λ ∼ x (cid:18) ( λ ∼ x ) n − − ( λ ∼ x ) n − (cid:19)(cid:21) , (15) where λ ∼ xi = 0 . (cid:0) π x + β (1 − v x u x ) + ( − i +1 ∆ x (cid:1) , i = 1 , ,are the eigenvalues of the matrix (cid:18) I × − (cid:20) π x u x (cid:21) (cid:2) v x (cid:3) (cid:19) D ,and ∆ x = (cid:20) β − π x + βv x u x (cid:21) + 4 βπ x v x u x . Proof.

1) Proofs of Claims:

Claim 1:

Using π x ≤ δ ≤ β/ for x ∈ A ( δ ) and βv x u x ≥ − π x from (14) , we get(i) β − π x + βv x u x ≥ β − π x ≥ π x − π x ≥ , and (ii) π x βv x u x ≥ − π x . This implies that, ∆ x = ( β − π x + βv x u x ) + 4 π x βv x u x ≥ ( β − π x ) − π x ≥ β / ≥ β / (18)and hence ∆ x ≥ β/ for x ∈ A ( δ ) . Claim 2:

To prove this, we partition A ( δ ) into two sets, A ( δ ) = { x ∈ A ( δ ) : βv x u x ≤ } and its complement, A ( δ ) = A ( δ ) \ A ( δ ) .1) For x ∈ A ( δ ) , using βv x u x ≤ and βv x u x ≥ − π x from (14), we get ≥ P x ∈ A ( δ ) βv x u x ≥ − P x ∈ A π x ≥ − whichimplies (cid:12)(cid:12)(cid:12)P x ∈ A ( δ ) βv x u x (cid:12)(cid:12)(cid:12) ≤ .

2) On the other hand, to deal with the sum over A ( δ ) ,note that P x ∈X v x u x = 1 since the matrix product SR = I . Therefore, X x ∈X : βv x u x > βv x u x = β − X x ∈X : βv x u x ≤ βv x u x Since the summation on the right in the above equa-tion is bounded above by in absolute value (be-cause of (14) again), we get (cid:12)(cid:12)(cid:12)P x ∈ A ( δ ) βv x u x (cid:12)(cid:12)(cid:12) ≤ P x ∈X : βv x u x > βv x u x ≤ β ≤ . Lemma 7. If P is non-diagonalizable and x ∈ X with P xx = π x , Γ x = v x u x ∆ x (cid:20)(cid:18) ( λ ∼ x ) n − − ( λ ∼ x ) n − (cid:19) π x − (cid:18) ( λ ∼ x ) n − + ( λ ∼ x ) n − (cid:19)(cid:18) − n (cid:19) ∆ − x + 2 n ∆ − x λ ∼ x λ ∼ x (cid:18) ( λ ∼ x ) n − − ( λ ∼ x ) n − (cid:19)(cid:21) , (19) where λ ∼ xi = 0 . (cid:0) π x − v x u x ) + ( − i +1 ∆ x (cid:1) , i = 1 , are theeigenvalues of the matrix (cid:20) π x − π x v x − u x − u x v x (cid:21) with ∆ x = ( π x − v x u x ) + 4 v x u x .Proof. Similar to that of Lemma 6.Bounding P x ∈ A π x π x | Γ x | using a method similar to thediagonalizable case, completes the proof. . Lemma 5 Theorem 8 ( [18]) . Suppose that X n is a Markov chainwith state space X and t.p.m P with θ ( P ) < , and ψ ( X n ) : X n → R is a d -Lipschitz function with respect tothe normalized Hamming metric, thenPr ( | ψ ( X n ) − E [ ψ ( X n )] | ≥ ǫ ) ≤ e − n (1 − θ ) ǫ / d . (20)Since F x ( X n ) /n is 1-Lipschitz, Theorem 20 directly resultsin Part 1. For Part 2, we use the following theorem from [19]. Theorem 9 ( [19]) . Suppose that X n is a stationary Markovchain with state space X = [ K ] , stationary distribution π and t.p.m P such that λ π ( P ) < . Then every f : [ K ] → R satisﬁes the following inequality for every n ∈ N and q ≥ , (cid:18) E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) P ni =1 f ( X i ) n − E [ f ( X )] (cid:12)(cid:12)(cid:12)(cid:12) q (cid:21)(cid:19) q < ∼ r q (1 − λ π ) n (cid:18) E [ | f ( X ) | q ] (cid:19) q , (21) where D < ∼ D implies there exists a universal constant C > such that D ≤ CD . Using Markov’s inequality,Pr (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) π x − F x ( X n ) /n (cid:12)(cid:12)(cid:12)(cid:12) q ≥ ǫ q (cid:19) ≤ E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) π x − F x ( X n ) /n (cid:12)(cid:12)(cid:12)(cid:12) q (cid:21) ǫ − q ( a ) ≤ C (cid:18) qλ π n (cid:19) q/ π x ǫ − q , (22)where ( a ) follows by setting f ( X i ) = I ( X i = x ) in (21). C. Proofs of Lemma 4 part 1, Lemma 6

We prove Lemma 6 and part 1 of Lemma 4 (for diagonal-izable t.p.ms) using the Lemma below. (The proof of part 1of Lemma 4 for non-diagonalizable t.p.ms follows a similarmethod and is hence omitted.)

Lemma 10.

Let P be a rank-2 diagonalizable t.p.m. Let E x ∼ m denote the event F x ( X n ∼ m )=0 . For x ∈ X with P xx = π x , Pr ( E x ∼ m | X m = x ) = Pr ( E x ∼ m | X m = x ) = ( π x ) n − for m = 1 to n. For x ∈ X with P xx = π x , a) for m = 1 to n, Pr ( E x ∼ m | X m = x )= 12 π x (cid:20) ( λ ∼ x ) n ( s x ∆ x + 1) + ( λ ∼ x ) n (1 − s x ∆ x ) (cid:21) (23)b) P ( E x ∼ | X = x ) = P ( E x ∼ n | X n = x )= ( λ ∼ x ) n − (cid:18) s x + ∆ x x − (1 − β ) v x u x ∆ x (cid:19) + ( λ ∼ x ) n − (cid:18) ∆ x − s x x + (1 − β ) v x u x ∆ x (cid:19) (24) c) for m = 2 to n − , Pr ( E x ∼ m | X m = x )= ( λ ∼ x ) n − (cid:18) s x + ∆ x x − (1 − β ) v x u x ∆ x (cid:19) + ( λ ∼ x ) n − (cid:18) ∆ x − s x x + (1 − β ) v x u x ∆ x (cid:19) + 1∆ x β (1 − β ) v x u x (cid:18) ( λ ∼ x ) m − ( λ ∼ x ) n − m + ( λ ∼ x ) n − m ( λ ∼ x ) m − (cid:19) (25) where s x = 1 − π x − (1 − β )(1 − v x u x ) , λ ∼ x , λ ∼ x , and ∆ x are as deﬁned in Lemma 6. Recall that n Γ x = P nm =1 Pr ( E x ∼ m | X m = x ) − Pr ( E x ∼ m | X m = x ) . Using part 1 of Lemma 10 in the above expression completesthe proof of part 1 of Lemma 4. Using part 2 of Lemma 10in the above expression for Γ x and simplifying, we get (15).This completes the proof of Lemma 6. D. Proof of Lemma 10

The probablities Pr ( E x ∼ m | X m = x ) and Pr ( E x ∼ m | X m = x ) involve the events E xi : j , ( X i = x, X i +1 = x, . . . , X j = x ) for suitable choices of i, j depending on m. To compute theprobability of E xi : j i.e. the chain X n not visiting the state x consecutively for a certain number of steps, we use the matrix P ∼ x which is obtained by replacing the column correspondingto state x in the t.p.m P with 0s. Since P = RDS, we have P ∼ x = RDS ∼ x , where S ∼ x is obtained by replacing thecolumn corresponding to state x in the matrix S with zeros.Since SR = I, we get S ∼ x R = (cid:20) − π x − π x v x − u x − u x v x (cid:21) . We use the following Lemmas to prove Lemma 10. Theﬁrst two Lemmas help us express Pr ( E x ∼ m | X m = x ) andPr ( E x ∼ m | X m = x ) in terms of S ∼ x , R and D. Lemma 11.

For m = 1 , . . . , n Pr ( E x ∼ m | X m = x )= h − (1 − β )1 − π x π x v x i ( S ∼ x RD ) n − (cid:20) − π x − u x (cid:21) (26) Proof.

Appendix

AppendixThe next lemma gives an expression for ( S ∼ x RD ) l ( l ≥ for various values of π x , v x and u x . Lemma 13. ( S ∼ x RD ) l : For x ∈ X , with v x = u x = 0 , ( S ∼ x RD ) l = " π xl β l For x ∈ X , with v x = 0 , u x = 0 , a) π x = β :( S ∼ x RD ) l = (cid:20) π xl − lu x π xl − π xl (cid:21) b) π x = β :( S ∼ x RD ) l = (cid:20) u x π x − β (cid:21) " π xl β l − u x π x − β (cid:21) (30)3) For x ∈ X , with v x = 0 , u x = 0 , a) π x = β :( S ∼ x RD ) l = (cid:20) π xl − lπ x v x π xl π xl (cid:21) b) π x = β :( S ∼ x RD ) l = " − βπ x v x π x − β π xl β l βπ x v x π x − β (31)4) For x ∈ X , with v x = 0 , u x = 0 , ( S ∼ x RD ) l = V ∼ x (cid:20) ( λ ∼ x ) l

00 ( λ ∼ x ) l (cid:21) ( V ∼ x ) − (32) where V ∼ x = (cid:20) − ∆ x − s x − β ) π x v x ∆ x + s x − β ) π x v x (cid:21) ( V ∼ x ) − = " ∆ x + s x x − (1 − β ) π x v x ∆ x ∆ x − s x x (1 − β ) π x v x ∆ x with s x = 1 − π x − (1 − β )(1 − v x u x ) , λ ∼ x , λ ∼ x and ∆ x are as deﬁned in Lemma 6. For x ∈ X that fall in any of the ﬁrst three categoriesin Lemma 13 (which makes P xx = π x ), substituting the corresponding form of ( S ∼ x RD ) l in (26), (27) and sim-plifying we get Pr ( E ∼ m | X m = x ) = (1 − π x ) n − for m = 1 to n and Pr ( E x m | X = x ) = (1 − π x ) m − , for all ≤ m ≤ n. Using this in (29) and (28), we getPr ( E x ∼ m | X m = x ) = (1 − π x ) n − , for m = 1 to n. Thiscompletes the proof of the ﬁrst part of Lemma 10.For x ∈ X with v x = 0 and u x = 0 , substituting theexpression for ( S ∼ x RD ) l from (32) in (27) and simplifyingwe get P ( E x m | X = x )= ( λ ∼ x ) m − (cid:18) s x + ∆ x x − (1 − β ) v x u x ∆ x (cid:19) + ( λ ∼ x ) m − (cid:18) ∆ x − s x x + (1 − β ) v x u x ∆ x (cid:19) (33)for m ≥ . Using (33) with m = n along with (28) completes the proof of(24). Using (33) in (29) and simplifying, we get (25). Lastly,substituting ( S ∼ x RD ) l from (32) in (26) and simplifying, weget (23) and the proof is complete.A PPENDIX

A. Proof of Lemma 11

To prove Lemma 11, we ﬁrst note thatPr ( E x ∼ m | X m = x ) = Pr ( F x ( X n ∼ m ) = 0 , X m = x ) Pr ( X m = x )= Pr ( F x ( X n ) = 0) Pr ( X m = x ) We can write Pr ( F x ( X n ) = 0) asPr ( F x ( X n ) = 0) = X y ∈X ,y = x π y Pr ( E x ∼ | X = y ) . Since Pr ( E x ∼ | X = y ) is the probability that the Markovchain does not visit the state x in the steps X through X n given that it starts in state y, Pr ( E x ∼ | X = y ) equals the rowsum of ( P ∼ x ) n − taken along the row corresponding to thestate y. ThereforePr ( F x ( X n ) = 0) = X y ∈X ,y = x π y X z ∈X [( P ∼ x ) n − ] y,z ( a ) = X y ∈X ,y = x π y X z ∈X [( RDS ∼ x ) n − ] y,z = X y ∈X ,y = x π y X z ∈X (cid:20) RD ( S ∼ x RD ) n − S ∼ x (cid:21) y,z where we use P ∼ x = RDS ∼ x to get ( a ) . Noting that ( S ∼ x RD ) n − , S ∼ x are matrices of dimensions × and × K espectively and (cid:2) − β ) v y (cid:3) is the row of the matrix RD corresponding to the state y, we havePr ( F x ( X n ) = 0) ( b ) = X y ∈X ,y = x π y (cid:2) − β ) v y (cid:3) ( S ∼ x RD ) n − (cid:20) − π x − u x (cid:21) ( c ) = (cid:2) − π x − (1 − β ) π x v x (cid:3) ( S ∼ x RD ) n − (cid:20) − π x − u x (cid:21) where we use π T K × = 1 , u T K × = 0 (from SR = I )to sum up the rows of S ∼ x to get (b) and use π T K × =1 , u T v = 0 (from SR = I ) to get (c).Using Pr ( E x ∼ m | X m = x ) = − π x Pr ( F x ( X n ) = 0) for m =1 , , . . . n completes the proof. B. Proof of Lemma 12

1) Using a method similar to the proof of Lemma 11, wegetPr ( E x m | X = x )= X z ∈X [( P ∼ x ) m − ] x,z = X z ∈X [( RDS ∼ x ) m − ] x,z = (cid:2) − β ) v x (cid:3) (cid:18) S ∼ x RD (cid:19) m − (cid:20) − π x − u x (cid:21)

2) We begin with the result of [15, Lemma 8] which statesthat for a stationary Markov chain X n , Pr ( X j = x, E x j − ) = Pr ( X = x, E x j ) , for ≤ j ≤ n (34)SincePr ( E x n − | X n = x ) = Pr ( E x n − , X n = x ) Pr ( X n = x ) using (34) and the stationarity of X n , we getPr ( E x n − | X n = x ) = Pr ( E x n | X = x ) We now factorise Pr ( E x ∼ m | X m = x ) ( for m =2 , . . . , n − as shown below:Pr ( E x ∼ m | X m = x ) ( a ) = Pr ( E x m − | X m = x ) Pr ( E xm +1: n | X m = x ) ( b ) = Pr ( E x m | X = x ) Pr ( E xm +1: n | X m = x ) ( c ) = Pr ( E x m | X = x ) Pr ( E x n − m +1 | X = x ) where we use Markov property to get ( a ) , use (28) with n = m to get ( b ) and ﬁnally use the stationarity of X n to get ( c ) . This completes proof of (29).

C. Proof of Lemma 13

Since SR = I, we have S ∼ x R = (cid:20) π x − π x v x − u x − v x u x (cid:21) and S ∼ x RD = (cid:20) π x − βπ x v x − u x β (1 − v x u x ) (cid:21) (35) 1) We substitute v x = u x = 0 in (35) and raise the poweron both sides to l .2) a) Substituting v x = 0 , β = π x in (35),we get S ∼ x RD = (cid:20) π x − u x π x (cid:21) . Using in-duction on power l, we get ( S ∼ x RD ) l = (cid:20) π xl − lu x π xl − π xl (cid:21) . b) Substituting v x = 0 in (35), we get S ∼ x RD = (cid:20) π x − u x β (cid:21) . We observe that π x , β are the eigen-values of S ∼ x RD with (cid:20) u x π x − β (cid:21) and (cid:20) (cid:21) as theirrespective right eigenvectors resulting in the diag-onalised form in (30).3) a) Substituting u x = 0 , β = π x in (35), weget S ∼ x RD = (cid:20) π x − π x π x v x π x (cid:21) . Using in-duction on power l, we get ( S ∼ x RD ) l = (cid:20) π xl − lπ x v x π xl π xl (cid:21) . b) Substituting u x = 0 in (35), we get S ∼ x RD = (cid:20) π x − βπ x v x β (cid:21) . We observe that π x , β are theeigenvalues of S ∼ x RD with (cid:20) (cid:21) and " − βπ x v x π x − β as their respective right eigenvectors resulting inthe diagonalised form in (31).4) Solving det ( S ∼ x RD − λI ) = 0 , we get λ ∼ xi = 0 . (cid:0) π x + β (1 − v x u x ) + ( − i +1 ∆ x (cid:1) , i = 1 , , as the eigenvalues of S ∼ x RD with (cid:20) − ∆ x − s x − β ) π x v x (cid:21) , (cid:20) ∆ x + s x − β ) π x v x (cid:21) as their respective righteigenvectors, where ∆ x = (cid:0) β − π x + βv x u x (cid:1) +4 βπ x v x u x , s x = 1 − π x − (1 − β )(1 − v x u x ) , resultingin the diagonalised form in (32).This completes the proof of Lemma 13.R EFERENCES[1] R. G. Gallager, “Finite state Markov chains,” in

Discrete StochasticProcesses . Springer, 1996, pp. 103–147.[2] I. J. Good, “The population frequencies of species and the estimationof population parameters,”

Biometrika , vol. 40, no. 3/4, pp. 237–264,1953.[3] D. A. McAllester and R. E. Schapire, “On the convergence rate of Good-Turing estimators,” in

Proceedings of the Thirteenth Annual Conferenceon Computational Learning Theory , 2000, pp. 1–6.[4] S. F. Chen and J. Goodman, “An empirical study of smoothing tech-niques for language modeling,” in

Proceedings of the 34th AnnualMeeting on Association for Computational Linguistics , ser. ACL ’96,1996, pp. 310–318.[5] D. McAllester and L. Ortiz, “Concentration inequalities for the missingmass and for histogram rule error,”

J. Mach. Learn. Res. , vol. 4, pp.895–911, Dec. 2003.[6] D. Berend and A. Kontorovich, “On the concentration of the missingmass,”

Electron. Commun. Probab. , vol. 18, p. 7 pp., 2013.[7] A. Ben-Hamou, S. Boucheron, and M. I. Ohannessian, “Concentrationinequalities in the inﬁnite urn scheme for occupancy counts and themissing mass, with applications,”

Bernoulli , vol. 23, no. 1, pp. 249–287, 02 2017.8] N. Rajaraman, A. Thangaraj, and A. T. Suresh, “Minimax risk formissing mass estimation,” in , June 2017, pp. 3025–3029.[9] J. Acharya, Y. Bao, Y. Kang, and Z. Sun, “Improved bounds for minimaxrisk of estimating missing mass,” in , June 2018, pp. 326–330.[10] P. Chandra and A. Thangaraj, “Concentration and tail bounds for missingmass,” in , July 2019, pp. 1862–1866.[11] M. Asadi, R. P. Torghabeh, and N. P. Santhanam, “Stationary andtransition probabilities in slow mixing, long memory Markov processes,”

IEEE Transactions on Information Theory , vol. 60, no. 9, pp. 5682–5701, 2014.[12] M. Falahatgar, A. Orlitsky, V. Pichapati, and A. T. Suresh, “LearningMarkov distributions: Does estimation trump compression?” in , July 2016,pp. 2689–2693.[13] Y. Hao, A. Orlitsky, and V. Pichapati, “On learning Markov chains,” in

Proceedings of the 32nd International Conference on Neural InformationProcessing Systems , ser. NIPS’18. Red Hook, NY, USA: CurranAssociates Inc., 2018, p. 646–655.[14] G. Wolfer and A. Kontorovich, “Minimax learning of ergodic Markovchains,” in

Proceedings of the 30th International Conference on Algo-rithmic Learning Theory , 2019, pp. 904–930.[15] P. Chandra, A. Thangaraj, and N. Rajaraman, “Missing mass of Markovchains,” in , 2020, pp. 1207–1212.[16] M. Skórski, “Missing mass in Markov chains,” arXiv preprint,arXiv:2001.03603 , 2020.[17] S. U. Pillai, T. Suel, and S. Cha, “The perron-frobenius theorem: someof its applications,”

IEEE Signal Processing Magazine , vol. 22, no. 2,pp. 62–75, 2005.[18] L. A. Kontorovich and K. Ramanan, “Concentration inequalities fordependent random variables via the martingale method,”

The Annalsof Probability , vol. 36, no. 6, p. 2126–2158, Nov 2008.[19] A. Naor, S. Rao, and O. Regev, “Concentration of Markov chains withbounded moments,”

Ann. Inst. H. Poincaré Probab. Statist. , vol. 56,no. 3, pp. 2270–2280, 08 2020.[20] P. Chandra, A. Thangaraj, and N. Rajaraman, “Missing mass of rank-2Markov chains,” arXiv preprint, arXiv:2102.01938v1arXiv preprint, arXiv:2102.01938v1