[PDF] Missing Mass Concentration for Markov Chains

Abstract

The problem of missing mass in statistical inference (posed by McAllester and Ortiz, NIPS'02; most recently revisited by Changa and Thangaraj, ISIT'2019) seeks to estimate the weight of symbols that have not been sampled yet from a source. So far all the approaches have been focused on the IID model which, although overly simplistic, is already not straightforward to tackle. The non-trivial part is in handling correlated events and sums of variables with very different scales where classical concentration inequalities do not yield good bounds. In this paper we develop the research on missing mass further, solving the problem for Markov chains. We reduce the problem to studying the tails of hitting times and finding \emph{log-additive approximations} to them. More precisely, we combine the technique of majorization and certain estimates on set hitting times to show how the problem can be eventually reduced back to the IID case. Our contribution are a) new technique to obtain missing mass bounds - we replace traditionally used negative association by majorization which works for a wider class of processes b) first (exponential) concentration bounds for missing mass in Markov chain models c) simplifications of recent results on set hitting times and d) simplified derivation of missing mass estimates for memory-less sources.

Full PDF

aa r X i v : . [ m a t h . P R ] J a n Missing Mass in Markov Chains

Maciej Skorski

University of Luxembourg [email protected]

Abstract —The problem of missing mass in statistical inference(posed by McAllester and Ortiz, NIPS’02; most recently revisitedby Changa and Thangaraj, ISIT’2019) seeks to estimate theweight of symbols that have not been sampled yet from a source.So far all the approaches have been focused on the IID modelwhich, although overly simplistic, is already not straightforwardto tackle. The non-trivial part is in handling correlated eventsand sums of variables with very different scales where classicalconcentration inequalities do not yield good bounds.In this paper we develop the research on missing mass further,solving the problem for Markov chains. It turns out that theexisting approaches to IID sources are not useful for Markovchains; we reframe the problem as studying the tails of hittingtimes and ﬁnding log-additive approximations to them. Moreprecisely, we combine the technique of majorization and certainestimates on set hitting times to show how the problem canbe eventually reduced back to the IID case. Our contributionare a) new technique to obtain missing mass bounds - wereplace traditionally used negative association by majorizationwhich works for a wider class of processes b) ﬁrst (exponential)concentration bounds for missing mass in Markov chain modelsc) simpliﬁcations of recent results on set hitting times and d)simpliﬁed derivation of missing mass estimates for memory-lesssources.

Index Terms —Markov chains, missing mass problem, concen-tration bounds

I. I

NTRODUCTION

A. Missing Mass Problem

The missing mass problem studies the behavior of unseen symbols when sampling from a memory-less source. Onewants to estimate the probability of elements that have notbeen visited during n -steps ( n subsequent samples from asource). Since so far only IID sources have been studied [1],[3], [10], [12], [13], it is natural to extend results to processwith memory. In this paper we develop such results for Markovchains. B. Proof Outline for IID

For the IID case ( [1], [3], [10], [11] and followed-up worksimproving bounds) one proves tail bounds as followsConsider subsequent symbols X , . . . , X n . The fundamen-tal observations is that the collection variables I ( X i = j ) indexed by tuples of i, j (indicators of whether we hit symbol j at time i ) are negatively associated (referred to as NA). Thisfollows in two parts; ﬁrst this is clearly true for any ﬁxed i (indeed then P j I ( X i = j ) = 1 ), and then one uses the factthat NA vectors can be augmented [5].Let τ j be the moment of hitting a symbol j (may beinﬁnite); any symbol that has not been unseen during n steps satisﬁes τ j > n . However τ j > n is equivalent to P ni =1 I ( X i = j ) = 0 . It is also true that block sums of NAvariables are NA, therefore P ni =1 I ( X i = j ) indexed by j areNA. Finally threshold transforms preserve NA and thus the setof events P ni =1 I ( X i = j ) = 0 , equivalent to τ j > n is NA.Therefore the problem reduces to estimating weighted sumof NA boolean variables. This is another non-trivial task whereclassical inequalities are to week to produce desired results, asthey work best with homogenic variables (weights of similarorders of magnitude). C. Our Result - Markov Chains

We ﬁrst extend the problem statement to Markov chains (or,more generally, stationary sources). Consider a Markov chain ( X i ) i over m states { , . . . , m } with stationary distribution π and some initial starting distribution. Let τ j be the ﬁrst timewhen j is hit. Fix the run length n and considerM ISSING M ASS = m X j =1 π ( j ) · { τ j > n } (1)which indeed extends the IID case. Motivated by seeking forpossible extensions, we ask the following question Problem : Do exponentially strong concentration in-equalities hold for M

ISSING M ASS in Equation (1)?To set up expectations correctly, we note that this problem forMarkov chains is much harder than for IID sources. First, the

NA property fails ; while in the IID case, due to NA (and quiteintuitively), not seeing B increases chances for seeing A , forMarkov chains this may increase or decrease depending on thetopology of the chain. For example consider a walk where A is accessed only (or mostly in terms of probability weights)from B .Second, for memory-less sources the result is based on exactformulas for the tails of hitting times which are straightforwardto derive. For Markov chains we don’t only expect to haveaccurate formulas but it is very challenging to obtain goodlower bounds. Moreover, in our problem we will have tostudy hitting times of entire sets not individual points. Indeed,because we would need to bound moments of Equation (1)which yields mixed moments of P j { τ j > n } , we seek forbounds of the form Q τ j ∈ J . Hitting and commute times aregenerally better understood for points not sets [8], [9].We now present our main result. Below T refers to thehitting time of large sets, the quantity studied in recentworks [6], [14] (see Section II-A). Theorem 1 (Majorization by IID Problem) . Let the chain X be and Q j be independent Bernoulli variables with probability − c · n · π ( j ) /T for some absolute constant c and T being themaximum hitting time of sets of probability at least . . Thenfor any set J and any integer n > ∧ j ∈ J { τ j > n } ] Y j ∈ J Pr[ Q j = 1] (2) In particular for any s > it holds that E exp ( s · M ISSING M ASS ) E exp  s · m X j =1 nπ ( j ) Q j  . (3) Remark 1 (Dependency on Hitting Time of Large Sets) . Intuitively the dependency on T is justiﬁed, because slow-mixing chains which have large T should have heavy tailsfor the missing mass.D. Discussion and Applications1) Exponential Bounds for Markovian Sources: By reduc-ing to the IID case and using bounds from the literature weobtain (see Section III for a proof)

Corollary 1 (Exponential Upper Tails for MCs) . We have thebound M ISSING M ASS E [ P j π ( j ) Q j ] + ǫ with probabilityat − e − Ω( nǫ /T ) .2) Exponential Bounds for IDD Sources: Under the IIDassumption the expression in Equation (2) can be computedexactly, so that we can actually take

Pr[ Q j = 1] := Pr[ τ j >n ] = (1 − π ( j )) n ≈ e − π ( j ) · n . This corresponds to setting c = 1 and T = 1 in Theorem 1. Note that variables on Equation (2)Note that sums considered in Equation (3) have equal meansby deﬁnition of Q j . Thus we re-obtain the same bounds asfor the IID case. Corollary 2 (Exponential Upper Tails for Missing Mass ofIID) . Let M = M ISSING M ASS then under IID | M − E M | ǫ holds with probability − e − Ω( nǫ ) .3) Set Hitting Times Estimates: The bound in Equation (2)is actually the estimate on set hitting times: ∧ j ∈ J { τ j > n } is equivalent to τ J > n . Thus we have proven an exponentialtail Pr[ τ J > n ] e − c · n · π ( J ) /T or, up to a constant, E τ J = O ( T /π ( J )) (these conditions are equivalent up to a constant,see Proposition 1). See also Corollary 3.

4) Eliminating Negative Association Theory:

Traditionallythe proofs for the IID case depend on non-trivial facts onnegative association, for example [10] relies on [5]. Howeverin view of Equation (3) if the exponential method is used(which is the case of all known bounds) we just need to provethat Equation (2) holds with

Pr[ Q j ] = Pr[ τ j > n ] . Pluggingthis we conclude that one needs to show Pr[ τ J > n ] Y j ∈ J Pr[ τ j > n ] We calculate that

Pr[ τ J > n ] = (1 − π ( J )) n for any J , inparticular also Pr[ τ j > n ] = (1 − π ( j )) n . Thus it sufﬁces toprove that − π ( J ) Y j ∈ J (1 − π ( j )) (4)This follows because π ( J ) = P j π ( j ) and from the elemen-tary inequality (1 − a )(1 − b ) > − a − b (applied recursively)valid for all a, b ∈ [0 , .II. P RELIMINARIES

We consider a Markov chain X , X , . . . over a ﬁnite statespace X . We assume it is irreducible so that it has a uniquestationary distribution π [9]. A. Hitting Times By T ( x, B ) we denote the expected hitting time of the set B when the chain starts from x . By T + ( A, B ) we denote themaximal expected hitting time of B over all possible starts in A , that is T + ( A, B ) = max x ∈ A T ( A, B ) . Similarly T − ( A, B ) stands for the minimal hitting time of B over possible startsin A , that is T − ( A, B ) = min x ∈ A T ( A, B ) .We also let T ( B ) = T + ( X , B ) (the worse-case expectedhitting time of B ) and consider the worse expected hit-ting time to sets of measure at least ǫ , that is T ( ǫ ) =max x max B : π ( B ) > ǫ T ( B ) (here π is the stationary distribu-tion). In our applications we think of ǫ as a constant and of T ( ǫ ) as the hitting time of large sets.It is a standard fact that for irreducible chains the tails ofhitting times are exponential [2], [9], [15]. This is shown bysplitting long paths into chunks of equal size and applying theMarkov property. Proposition 1 (Exponential Tails of Hitting Times) . Fix someinitial distribution and let N B be the hitting time (randomvariable) of the set B . Then we have Pr[ N B > t ] exp( −⌈ t/ ⌊ e · E N B ⌉⌋ ) ; note that E N B T ( B ) .B. Ergodicity Below we recall the ergodic theorem for Markov Chains [9]

Proposition 2 (Ergodic Theorem for MCs) . If ( X n ) n is anirreducible Markov chain with stationary distribution π then n n X i =1 f ( X i ) n →∞ −→ E π f a.s. (5) for any starting distribution of the chain and any real functionon the chain states.C. Relative Entropy By D ( p k q ) we note the binary relative-entropy func-tion (Kullback-Leibler divergence), deﬁned as D ( p k q ) = p log pq + (1 − p ) log − p − q . It appears in concentration bounds(aka Chernoff Bounds), and can be bounded from below bythe total variation distance as follows [4] Proposition 3 (Pinsker’s Inequality) . For any p, q ∈ (0 , wehave D ( p k q ) > p − q ) . II. P

ROOFS

A. Bound on Hitting Times

In this section we prove the bound E τ J = O ( T /π ( J )) which implies then the ﬁrst part in Theorem 1, namelyEquation (2), as discussed in Section I-D3.We will need the following result, which connect the timeof reaching B from A to that of the opposite direction, thatappears in [6]. The original proof in the arxiv version [7]was quite involved, based on martingales and concentrationinequalities applied in a non-standard setup (the martingaledifference were not bounded) and in the ﬁnal version gotsubsumed by an argument credited to Peres and Sousi [6]. Lemma 1 (Bounds on Set Hitting Times) . For an irreduciblechain with stationary distribution π and any subsets of states A, B we have π ( A ) T + ( A, B ) T + ( A, B ) + T − ( B, A ) (6) In particular π ( A ) · T − ( B, A ) T + ( A, B ) . (7)Below, as a contribution of independent interest, we providean alternative simple proof which resembles the approachtaken in [6] but uses only simple stopping times rather thanmartingales and doesn’t need concentration inequalities. Be-fore discussing the details we highlight intuitions as follows:we look at how the chain commutes between sets A and B .We split the long runs of the chain into rounds where eachround is one ”return trip”: starting from A , passing through B and ﬁnally returning to A . For m rounds on average we doat least m · ( T + ( A, B ) + T − ( B, A )) steps and spent in A atmost m · T + ( A, B ) steps on average. This can be compared to π ( A ) which is the fraction of time spent in A by the ErgodicTheorem (see Proposition 2). Proof.

Suppose that the chain starts at some ﬁxed point x ∈ A .¿ For j = 1 , . . . , m let N A → B → Aj is the number of furthersteps it takes the walk to start from A visit B and return to A (such quantities are sometimes called the commute time []);also let N A → Bj be the number of further steps it takes thewalk to visit B when it starts from A .Once B gets visited the chain will not go into A in thisround. Thus we havetime the walk spent in A m X j =1 N A → Bj (8)and clearly total time = m X j =1 N A → B → Aj . (9)To ﬁnish the argument we only use convergence in proba-bility. From the discussion it follows that π ( A ) lim inf m P mj =1 N A → Bj P mj =1 N A → B → Aj (10) In the next step we will replace the numerator and thedenominator with their means. For any ǫ > and sufﬁcientlybig m we have ( π ( A ) − ǫ ) · m X j =1 N A → B → Aj m X j =1 N A → Bj . (11)Note that both collections { N A → Bj } j and { N A → B → Aj } j areindependent as it follows from the Markov property; they arehowever not identically distributed because of evolving startpoints. Taking the expectation we obtain ( π ( A ) − ǫ ) · m X j =1 E N A → B → Aj m X j =1 E N A → Bj (12)Without loosing generality we can assume that π ( A ) > andthat ǫ < π ( A ) ; then π ( A ) − ǫ > .We start with N A → B → Aj = N A → Bj + (cid:0) N A → B → Aj − N A → Bj (cid:1) , that is splitting the commutetime at the moment of approaching B and the way back to A .Then we have E (cid:2) N A → B → Aj − N B → Aj (cid:3) > T − ( B, A ) . Thus ( π ( A ) − ǫ ) · m X j =1 (cid:0) E N A → Bj + T − ( B, A ) (cid:1) m X j =1 E N A → Bj (13)Rearranging the terms we write ( π ( A ) − ǫ ) · m X j =1 T − ( B, A ) (1 − π ( A ) + ǫ ) · m X j =1 E N A → Bj (14)We now use the bound E N A → Bj T + ( B, A ) to arrive at ( π ( A ) − ǫ ) · m X j =1 T − ( B, A ) (1 − π ( A ) + ǫ ) · m X j =1 E T + ( A, B ) (15)This is equivalent to π ( A ) T + ( A, B ) T + ( A, B ) + T − ( B, A ) + ǫ (16)and the result follows as ǫ can be arbitrary small.We also comment how the basic Chebyszev inequality canbe also used to accomplish the argument. We observe that N A → Bj and N A → B → Aj have bounded moments, because thetails of stopping times are exponential (see Proposition 1).Then by the Chebyszev inequality implies Pr "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P mj =1 (cid:2) N A → Bj − E N A → Bj (cid:3) m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ǫ = O (1 /mǫ ) (17) Pr "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P mj =1 (cid:2) N A → B → Aj − E N A → B → Aj (cid:3) m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ǫ = O (1 /mǫ ) (18)where the constant depends on A, B and the chain. Using thisin Equation (10) arrive at the same conclusion. emark 2.

By reﬁning the current proof we can show aslightly better bound with the constant . Lemma 2 (Measure vs Hitting Time) . For any A we havethat T ( A ) · T (0 . /π ( A ) . Before giving a proof we explain the intuition. Consider theset of starting points B that are ”unlucky” for A , that is makethe hitting time very long. Then T − ( B, A ) is very large andto keep the right-hand side of Lemma 1 big enough T + ( A, B ) must be sufﬁciently big; more precisely at least by a factor of /π ( A ) . But we bounded the hitting time of big sets B (seethe deﬁnition of T ( ǫ ) ), therefore we conclude that B is small.In other words, the complementary set B c of good startingpoints is big and the walk quickly approaches it; and once itgets there it also quickly approaches A by the deﬁnition ofgood starting points. Proof.

Equation (6) implies that T − ( B, A ) T ( B ) π ( A ) . Let B contain all x such that T ( x, A ) > T (0 . /π ( A ) (unluckystarting points). Then we must have π ( B ) . . Then π ( B c ) > − . . which implies T ( B c ) T (0 . , and B c are good starts for A that is T + ( B c , A ) T (0 . /π ( A ) .By the Markov property T ( x, A ) T ( x, B c ) + T + ( B c , A )) Taking the maximum over x on the right-hand side and usingthe previous bounds we obtain T ( x, A ) T (0 .

5) + T (0 . /π ( A ) again taking the maximum over x on the left-hand side weﬁnish the proof.Combining Proposition 1 and Lemma 2 we obtain the fol-lowing Corollary 3 (Explicit Exponential Tails of Hitting Times) . Let N A be the hitting time of a set A for some initial distributionof the chain. Then we have Pr[ N A > t ] exp( − Ω( t · π ( A ) /T (0 . (19) for some absolute constant under Ω( · ) . Remark 3 (Explicit Constant) . The explicit constant can be setto / e · (1 + o (1)) for large t . This follows from Proposition 1and Remark 2.B. Combining with IID Bounds The condition Equation (2), proved in the previous subsec-tion, implies Equation (3). Indeed, if Q j ∈ J u j Q j ∈ J q j (majorization) then E (cid:16)P j u j (cid:17) k E (cid:16)P j q j (cid:17) k and, by theTaylor expansion of exp( · ) , we obtain E exp( s · P j u j ) E exp( s · P j u j ) . This ﬁnishes the proof of Theorem 1.Therefore Therefore upper obtained through the exponentialmethod for IID variables Q j will apply as well. Following thediscussion in [10] (particularly Lemmma 11) we obtain theupper bound P j π ( j ) · E Q j + ǫ with probability − e − Θ( nǫ ) . IV. C ONCLUSION

We have studied the missing mass problem under Markovchain models. The obtanined reduction allows for derivingbounds from an IID scenario.A

CKNOWLEDGMENT R EFERENCES[1] Daniel Berend, Aryeh Kontorovich, et al.,

On the concentration of themissing mass , Electronic Communications in Probability (2013).[2] P. Br´emaud, Discrete probability models and methods: Probability ongraphs and trees, markov chains and random ﬁelds, entropy and coding ,Probability Theory and Stochastic Modelling, Springer InternationalPublishing, 2017.[3] Prafulla Chandra and Andrew Thangaraj,

Concentration and tail boundsfor missing mass , 2019 IEEE International Symposium on InformationTheory (ISIT), IEEE, 2019, pp. 1862–1866.[4] I. Csisz´ar and J. K¨orner,

Information theory: Coding theorems fordiscrete memoryless systems , Cambridge University Press, 2011.[5] Devdatt Dubhashi and Desh Ranjan,

Balls and bins: A study in negativedependence , Random Structures & Algorithms (1998), no. 2, 99–124.[6] Simon Grifﬁths, Ross Kang, Roberto Oliveira, and Viresh Patel, Tightinequalities among set hitting times in markov chains , Proceedings ofthe American Mathematical Society (2014), no. 9, 3285–3298.[7] Simon Grifﬁths, Ross J. Kang, Roberto Imbuzeiro Oliveira, and VireshPatel,

Tight inequalities among set hitting times in markov chains , 2012.[8] Amine Helali and Matthias L¨owe,

Hitting times, commute times, andcover times for random walks on random hypergraphs , Statistics &Probability Letters (2019).[9] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer,

Markov chainsand mixing times , American Mathematical Society, 2006.[10] David McAllester and Luis Ortiz,

Concentration inequalities for themissing mass and for histogram rule error , Journal of Machine LearningResearch (2003), no. Oct, 895–911.[11] David A. McAllester and Luis E. Ortiz, Concentration inequalities forthe missing mass and for histogram rule error , Advances in NeuralInformation Processing Systems 15 [Neural Information Processing Sys-tems, NIPS 2002, December 9-14, 2002, Vancouver, British Columbia,Canada], 2002, pp. 351–358.[12] David A McAllester and Robert E Schapire,

On the convergence rateof good-turing estimators. , COLT, 2000, pp. 1–6.[13] Elchanan Mossel and Mesrob Ohannessian,