[PDF] Uniform hypothesis testing for ergodic time series distributions

Abstract

Given a discrete-valued sample X 1 ,..., X n we wish to decide whether it was generated by a distribution belonging to a family H 0 , or it was generated by a distribution belonging to a family H 1 . In this work we assume that all distributions are stationary ergodic, and do not make any further assumptions (e.g. no independence or mixing rate assumptions). We would like to have a test whose probability of error (both Type I and Type II) is uniformly bounded. More precisely, we require that for each ϵ there exist a sample size n such that probability of error is upper-bounded by ϵ for samples longer than n . We find some necessary and some sufficient conditions on H 0 and H 1 under which a consistent test (with this notion of consistency) exists. These conditions are topological, with respect to the topology of distributional distance.

Full PDF

aa r X i v : . [ m a t h . S T ] D ec Uniform hypothesis testing for ﬁnite-valuedstationary processes

Daniil Ryabko ∗ Abstract

Given a discrete-valued sample X , . . . , X n we wish to decide whetherit was generated by a distribution belonging to a family H , or it wasgenerated by a distribution belonging to a family H . In this work weassume that all distributions are stationary ergodic, and do not make anyfurther assumptions (e.g. no independence or mixing rate assumptions).We would like to have a test whose probability of error (both Type I andType II) is uniformly bounded. More precisely, we require that for each ε there exist a sample size n such that probability of error is upper-boundedby ε for samples longer than n . We ﬁnd some necessary and some suﬃcientconditions on H and H under which a consistent test (with this notionof consistency) exists. These conditions are topological, with respect tothe topology of distributional distance. Given a sample X , . . . , X n (where X i are from a ﬁnite alphabet A ) whichis known to be generated by a stationary ergodic process, we wish to decidewhether it was generated by a distribution belonging to a family H , versus itwas generated by a distribution belonging to a family H . Unlike most of theworks on the subject, we do not assume that X i are i.i.d., but only make a muchweaker assumption that the distribution generating the sample is stationaryergodic.A test is a function that takes a sample and gives a binary (possibly incorrect)answer: either the sample was generated by a distribution from H or by adistribution from H . An answer i ∈ { , } is correct if the sample is generatedby a distribution that belongs to H i . Here we are concerned with characterizingthose pairs of H and H for which consistent tests exist. Consistency.

In this work we consider the following notion of consistency.For two hypothesis H and H , a test is called uniformly consistent , if for any ε > n such that the probability of error on a sample of ∗ INRIA Lille-Nord Europe, 40, avenue Halley, 59650 Villeneuve d’Ascq, France,[email protected] ize larger than n is not greater than ε if any distribution from H ∪ H is chosento generate the sample . Thus, a uniformly consistent test provides performanceguarantees for ﬁnite sample sizes. The results.

Here we obtain some topological conditions of the hypotheses forwhich consistent tests exist, for the case of stationary ergodic distributions.A distributional distance between two process distributions [3] is deﬁned asa weighted sum of probabilities of all possible tuples X ∈ A ∗ , where A is thealphabet and the weights are positive and have a ﬁnite sum.The test ϕ H ,H that we construct is based on empirical estimates of distri-butional distance. It outputs 0 if the given sample is closer to the (closure of) H than to the (closure of) H , and outputs 1 otherwise. The main result is asfollows. Theorem.

Let H , H ⊂ E , where E is the set of all stationary ergodic processdistributions. If, for each i ∈ { , } the set H i has probability 1 with respectto ergodic decompositions of every element of H i , then there is a uniformlyconsistent test for H against H . Conversely, if there is a uniformly consistenttest for H against H , then, for each i ∈ { , } , the set H − i has probability 0with respect to ergodic decompositions of every element of H i . Prior work.

This work continuous our previous research [13, 14], which pro-vides similar necessary and suﬃcient conditions for the existence of a consistenttest, for a weaker notion of asymmetric consistency: Type I error is uniformlybounded, while Type II error is required to tend to 0 as the sample size grows.Besides that, there is of course a vast body of literature on hypothesis testingfor i.i.d. (real- or discrete-valued) data (see e.g. [7, 4]). There is, however, muchless literature on hypothesis testing beyond i.i.d. or parametric models. For aweaker notion of consistency, namely, requiring that the test should stabilizeon the correct answer for a.e. realization of the process (under either H or H ), [6] constructs a consistent test for so-called constrained ﬁnite-state modelclasses (including ﬁnite-state Markov and hidden Markov processes), againstthe general alternative of stationary ergodic processes. For the same notion ofconsistency, [8] gives suﬃcient conditions on two hypotheses H and H thatconsist of stationary ergodic real-valued processes, under which a consistent testexists, extending the results of [2] for i.i.d. data. The latter condition is that H and H are contained in disjoint F σ sets (countable unions of closed sets), withrespect to the topology of weak convergence. Asymmetrically consistent testsfor some speciﬁc hypotheses, but under the general alternative of stationaryergodic processes, have been proposed in [9, 10, 15, 16], which address problemsof testing identity, independence, estimating the order of a Markov process, andalso the change point problem. Noteworthy, a conceptually simple hypothesisof homogeneity (testing whether two sample are generated by the same or bydiﬀerent processes) does not admit a consistent test even in the weakest asymp-totic sense, as was shown in [12]. Empirical estimates of distributional distancehave been also used to address the problem of clustering time series [11, 5].2 Preliminaries

Let A be a ﬁnite alphabet, and denote A ∗ the set of words (or tuples) ∪ ∞ i =1 A i .For a word B the symbol | B | stands for the length of B . Denote B i the i thelement of A ∗ , enumerated in such a way that the elements of A i appear beforethe elements of A i +1 , for all i ∈ N . Distributions or (stochastic) processes areprobability measures on the space ( A ∞ , F A ∞ ), where F A ∞ is the Borel sigma-algebra of A ∞ . Denote X, B ) the number of occurrences of a word B in a word X ∈ A ∗ and ν ( X, B ) its frequency:

X, B ) = P | X |−| B | +1 i =1 I { ( X i ,...,X i + | B |− )= B } , and ν ( X, B ) = (cid:26) | X |−| B | +1 X, B ) if | X | ≥ | B | , X = ( X , . . . , X | X | ). For example, ν (0001 ,

00) = 2 / . We use the abbreviation X ..k for X , . . . , X k . A process ρ is stationary if ρ ( X .. | B | = B ) = ρ ( X t..t + | B |− = B )for any B ∈ A ∗ and t ∈ N . Denote S the set of all stationary processes on A ∞ . Astationary process ρ is called (stationary) ergodic if the frequency of occurrenceof each word B in a sequence X , X , . . . generated by ρ tends to its a priori (orlimiting) probability a.s.: ρ (lim n →∞ ν ( X ..n , B ) = ρ ( X .. | B | = B )) = 1. Denote E the set of all stationary ergodic processes.A distributional distance is deﬁned for a pair of processes ρ , ρ as fol-lows [3]: d ( ρ , ρ ) = ∞ X i =1 w i | ρ ( X .. | B i | = B i ) − ρ ( X .. | B i | = B i ) | , where w i are summable positive real weights (e.g. w k = 2 − k : we ﬁx this choicefor the sake of concreteness). It is easy to see that d is a metric. Equippedwith this metric, the space of all stochastic processes is a compact, and the setof stationary processes S is its convex closed subset. (The set E is not closed.)When talking about closed and open subsets of S we assume the topology of d .Compactness of the set S is one of the main ingredients in the proofs of themain results. Another is that the distance d can be consistently estimated, asthe following lemma shows (because of its importance for further development,we give it with a proof). Lemma 1 ( ˆ d is consistent [15, 16] ) . Let ρ, ξ ∈ E and let a sample X ..k begenerated by ρ . Then lim k →∞ ˆ d ( X ..k , ξ ) = d ( ρ, ξ ) ρ -a.s.Proof. For any ε > J that P ∞ i = J w i < ε/

2. For each j wehave lim k →∞ ν ( X ..k , B j ) = ρ ( B j ) a.s., so that | ν ( X ..k , B j ) − ρ ( B j ) | < ε/ (2 Jw j )3rom some k on; denote K j this k . Let K = max j K we have | ˆ d ( X ..k , ξ ) − d ( ρ, ξ ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ X i =1 w i (cid:0) | ν ( X ..k , B i ) − ξ ( B i ) | − | ρ ( B i ) − ξ ( B i ) | (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ X i =1 w i | ν ( X ..k , B i ) − ρ ( B i ) | ≤ J X i =1 w i | ν ( X ..k , B i ) − ρ X ( B i ) | + ε/ ≤ J X i =1 w i ε/ (2 Jw i ) + ε/ ε, which proves the statement.Considering the Borel (with respect to the metric d ) sigma-algebra F S onthe set S , we obtain a standard probability space ( S , F S ). An important toolthat will be used in the analysis is ergodic decomposition of stationary pro-cesses (see e.g. [3, 1]): any stationary process can be expressed as a mixtureof stationary ergodic processes. More formally, for any ρ ∈ S there is a mea-sure W ρ on ( S , F S ), such that W ρ ( E ) = 1, and ρ ( B ) = R dW ρ ( µ ) µ ( B ), for any B ∈ F A ∞ . The support of a stationary distribution ρ is the minimal closed set U ⊂ S such that W ρ ( U ) = 1.A test is a function ϕ : A ∗ → { , } that takes a sample and outputs abinary answer, where the answer i is interpreted as “the sample was generatedby a distribution that belongs to H i ”. The answer i is correct if the samplewas indeed generated by a distribution from H i , otherwise we say that the testmade an error .A test ϕ is called uniformly consistent if for every α there is an n α ∈ N such that for every n ≥ n α the probability of error on a sample of size n is lessthan α : ρ ( X ∈ A n : ϕ ( X ) = i ) < α for every ρ ∈ H − i and every i ∈ { , } . The tests presented below are based on empirical estimates of the distributionaldistance d : ˆ d ( X ..n , ρ ) = ∞ X i =1 w i | ν ( X ..n , B i ) − ρ ( B i ) | , where n ∈ N , ρ ∈ S , X ..n ∈ A n . That is, ˆ d ( X ..n , ρ ) measures the discrep-ancy between empirically estimated and theoretical probabilities. For a sample X ..n ∈ A n and a hypothesis H ⊂ E deﬁneˆ d ( X ..n , H ) = inf ρ ∈ H ˆ d ( X ..n , ρ ) . For H ⊂ S , denote cl H the closure of H (with respect to the topology of d ).4or H , H ⊂ S , the uniform test ϕ H ,H is constructed as follows. Foreach n ∈ N let ϕ H ,H ( X ..n ) := (cid:26) d ( X ..n , cl H ∩ E ) < ˆ d ( X ..n , cl H ∩ E ) , Theorem 1 (uniform testing) . Let H ⊂ S and H ⊂ S . If W ρ ( H i ) = 1 forevery ρ ∈ cl H i then the test ϕ H ,H is uniformly consistent. Conversely, if thereexists a uniformly consistent test for H against H then W ρ ( H − i ) = 0 for any ρ ∈ clH i . The proof is deferred to section 5.

First of all, it is obvious that sets that consist of just one or ﬁnitely manystationary ergodic processes are closed and closed under ergodic decompositions;therefore, for any pair of disjoint sets of this type, there exists a uniformlyconsistent test. (In particular, there is a uniformly consistent test for H = { ρ } against H = { ρ } , where ρ , ρ ∈ E .)It is clear that for any ρ there is no uniformly consistent test for { ρ } against E\{ ρ } . More generally, for any non-empty H there is no uniformly consistenttest for H against E\ H provided the latter complement is also non-empty.Indeed, this follows from Theorem 1 since in these cases the closures of H and H are not disjoint. One might suggest at this point that a uniformly consistenttest exists if we restrict H to those processes that are suﬃciently far from ρ .However, this is not true. We can prove an even stronger negative result. Proposition 1.

Let ρ, ν ∈ E , ρ = ν and let ε > . There is no uniformlyconsistent test for H = { ρ } against H = { ν ′ ∈ E : d ( ν ′ , ν ) ≤ ε } . The proof of the proposition is deferred to the next section. What the propo-sition means is that, while distributional distance is well suited for characterizingthose hypotheses for which consistent test exist, it is not suited for formulatingthe actual hypotheses . Apparently a stronger distance is needed for the latter.The following statement is easy to demonstrate from Theorem 1.

Corollary 1.

Given two disjoint sets H and H each of which is continuouslyparametrized by a compact set of parameters and is closed under taking ergodicdecompositions, there exists a uniformly consistent test of H against H . Examples of parametrisations mentioned in the Corollary are the sets of k -order Markov sources, parametrised by transition probabilities. Thus, any twodisjoint closed subsets of these sets satisfy the assumption of the Corollary.5 Proofs

The proof of Theorem 1 will use the following lemmas, whose proofs can befound in [14].

Lemma 2 (smooth probabilities of deviation) . Let m > k > , ρ ∈ S , H ⊂ S ,and ε > . Then ρ ( ˆ d ( X ..m , H ) ≥ ε ) ≤ ε ′− ρ ( ˆ d ( X ..k , H ) ≥ ε ′ ) , (3) where ε ′ := ε − km − k +1 − t k with t k being the sum of all the weights of tupleslonger than k in the deﬁnition of d : t k := P i : | B i | >k w i . Further, ρ ( ˆ d ( X ..m , H ) ≤ ε ) ≤ ρ (cid:18) ˆ d ( X ..k , H ) ≤ mm − k + 1 2 ε + 4 km − k + 1 (cid:19) . (4)The meaning of this lemma is as follows. For any word X ..m , if it is faraway from (or close to) a given distribution µ (in the empirical distributionaldistance), then some of its shorter subwords X i..i + k are far from (close to) µ too. In other words, for a stationary distribution µ , it cannot happen that asmall sample is likely to be close to µ , but a larger sample is likely to be far. Lemma 3.

Let ρ k ∈ S , k ∈ N be a sequence of processes that converges to aprocess ρ ∗ . Then, for any T ∈ A ∗ and ε > if ρ k ( T ) > ε for inﬁnitely manyindices k , then ρ ∗ ( T ) ≥ ε . This statement follows from the fact that ρ ( T ) is continuous as a functionof ρ . Proof of Theorem 1.

To prove the ﬁrst statement of the theorem, we willshow that the test ϕ H ,H is a uniformly consistent test for cl H ∩ E againstcl H ∩ E (and hence for H against H ), under the conditions of the theorem.Suppose that, on the contrary, for some α > n ′ ∈ N there is a process ρ ∈ cl H such that ρ ( ϕ ( X ..n ) = 1) > α for some n > n ′ . Deﬁne∆ := d (cl H , cl H ) := inf ρ ∈ cl H ∩E ,ρ ∈ cl H ∩E d ( ρ , ρ ) , which is positive since cl H and cl H are closed and disjoint. We have α < ρ ( ϕ ( X ..n ) = 1) ≤ ρ ( ˆ d ( X ..n , H ) ≥ ∆ / or ˆ d ( X ..n , H ) < ∆ / ≤ ρ ( ˆ d ( X ..n , H ) ≥ ∆ /

2) + ρ ( ˆ d ( X ..n , H ) < ∆ / . (5)This implies that either ρ ( ˆ d ( X ..n , cl H ) ≥ ∆ / > α/ ρ ( ˆ d ( X ..n , cl H ) < ∆ / > α/

2, so that, by assumption, at least one of these inequalities holds forinﬁnitely many n ∈ N for some sequence ρ n ∈ H . Suppose that it is the ﬁrstone, that is, there is an increasing sequence n i , i ∈ N and a sequence ρ i ∈ cl H , i ∈ N such that ρ i ( ˆ d ( X ..n i , cl H ) ≥ ∆ / > α/ i ∈ N . (6)6he set S is compact, hence so is its closed subset cl H . Therefore, the sequence ρ i , i ∈ N must contain a subsequence that converges to a certain process ρ ∗ ∈ cl H . Passing to a subsequence if necessary, we may assume that this convergentsubsequence is the sequence ρ i , i ∈ N itself.Using Lemma 2, (3) (with ρ = ρ n m , m = n m , k = n k , and H = cl H ), andtaking k large enough to have t n k < ∆ /

4, for every m large enough to have n k n m − n k +1 < ∆ /

4, we obtain8∆ − ρ n m (cid:16) ˆ d ( X ..n k , cl H ) ≥ ∆ / (cid:17) ≥ ρ n m (cid:16) ˆ d ( X ..n m , cl H ) ≥ ∆ / (cid:17) > α/ . (7)That is, we have shown that for any large enough index n k the inequality ρ n m ( ˆ d ( X ..n k , cl H ) ≥ ∆ / > ∆ α/

16 holds for inﬁnitely many indices n m .From this and Lemma 3 with T = T k := { X : ˆ d ( X ..n k , cl H ) ≥ ∆ / } weconclude that ρ ∗ ( T k ) > ∆ α/

16. The latter holds for inﬁnitely many k ; that is, ρ ∗ ( ˆ d ( X ..n k , cl H ) ≥ ∆ / > ∆ α/

16 inﬁnitely often. Therefore, ρ ∗ (lim sup n →∞ d ( X ..n , cl H ) ≥ ∆ / > . However, we must have ρ ∗ ( lim n →∞ d ( X ..n , cl H ) = 0) = 1for every ρ ∗ ∈ cl H : indeed, for ρ ∗ ∈ cl H ∩ E it follows from Lemma 1, andfor ρ ∗ ∈ cl H \E from Lemma 1, ergodic decomposition and the conditions ofthe theorem.Thus, we have arrived at a contradiction that shows that ρ n ( ˆ d ( X ..n , cl H ) > ∆ / > α/ n ∈ N for any sequence of ρ n ∈ cl H . Analogously, we can show that ρ n ( ˆ d ( X ..n , cl H ) < ∆ / > α/ n ∈ N for any sequence of ρ n ∈ cl H . Indeed, usingLemma 2, equation (4), we can show that ρ n m ( ˆ d ( X ..n m , cl H ) ≤ ∆ / > α/ n m implies ρ n m ( ˆ d ( X ..n k , cl H ) ≤ / > α/ n k . Therefore, if we assume that ρ n ( ˆ d ( X ..n , cl H ) < ∆ / > α/ n ∈ N for some sequence of ρ n ∈ cl H , then we will also ﬁnd a ρ ∗ forwhich ρ ∗ ( ˆ d ( X ..n , cl H ) ≤ / > α/ n , which, usingLemma 1 and ergodic decomposition, can be shown to contradict the fact that ρ ∗ (lim n →∞ d ( X ..n , cl H ) ≥ ∆) = 1.Thus, returning to (5), we have shown that from some n on there is no ρ ∈ cl H for which ρ ( ϕ = 1) > α holds true. The statement for ρ ∈ cl H canbe proven analogously, thereby ﬁnishing the proof of the ﬁrst statement.To prove the second statement of the theorem, we assume that there ex-ists a uniformly consistent test ϕ for H against H , and we will show that W ρ ( H − i ) = 0 for every ρ ∈ cl H i . Indeed, let ρ ∈ cl H , that is, suppose thatthere is a sequence ξ i ∈ H , i ∈ N such that ξ i → ρ . Assume W ρ ( H ) = δ > α := δ/

2. Since the test ϕ is uniformly consistent, there is an N ∈ N n > N we have ρ ( ϕ ( X ..n = 0)) ≤ Z H ϕ ( X ..n = 0) dW ρ + Z E\ H ϕ ( X ..n = 0) dW ρ ≤ δα + 1 − δ ≤ − δ/ . Recall that, for T ∈ A ∗ , µ ( T ) is a continuous function in µ . In particular, thisholds for the set T = { X ∈ A n : ϕ ( X ) = 0 } , for any given n ∈ N . Therefore, forevery n > N and for every i large enough, ρ i ( ϕ ( X ..n ) = 0) < − δ/ ξ i ( ϕ ( X ..n ) = 0) < − δ/ ξ i ∈ H . This contradictionshows W ρ ( H ) = 0 for every ρ ∈ cl H . The case ρ ∈ cl H is analogous. Thetheorem is proven. Proof of Proposition 1.

Assume d ( ρ, ν ) > ε (the other case is obvious). Con-sider the process ( x , y ) , ( x , y ) , . . . on pairs ( x i , y i ) ∈ A , such that the dis-tribution of x , x , . . . is ν , the distribution of y , y , . . . is ρ and the two com-ponents x i and y i are independent; in other words, the distribution of ( x i , y i )is ν × ρ . Consider also a two-state stationary ergodic Markov chain µ , withtwo states 1 and 2, whose transition probabilities are (cid:18) − p pq − q (cid:19) , where0 < p < q <

1. The limiting (and initial) probability of the state 1 is p/ ( p + q )and that of the state 2 is q/ ( p + q ). Finally, the process z , z , . . . is constructedas follows: z i = x i if µ is in the state a and z i = y i otherwise (here it is assumedthat the chain µ generates a sequence of outcomes independently of ( x i , y i ).Clearly, for every p, q satisfying 0 < p < q < z , z , . . . is sta-tionary ergodic; denote ζ its distribution. Let p n := 1 / ( n + 1), n ∈ N . Since d ( ρ, ν ) > ε , we can ﬁnd a δ > d ( ρ, ζ n ) > ε where ζ n is the distribu-tion ζ with parameters p n and q n , where q n satisﬁes q n / ( p n + q n ) = δ . Thus, ζ n ∈ H for all n ∈ N . However, lim n →∞ ζ n = ζ ∞ where ζ ∞ is the stationarydistribution with W ζ ∞ ( ρ ) = δ and W ζ ∞ ( ν ) = 1 − δ . Therefore, ζ ∞ ∈ cl H and W ζ ∞ ( H ) >

0, so that by Theorem 1 there is no uniformly consistent test for H against H , which concludes the proof. Acknowledgements

This work has been partially supported by the French Ministry of Higher Edu-cation and Research, Nord-Pas de Calais Regional Council and FEDER throughthe “Contrat de Projets Etat Region (CPER) 2007-2013.”

References [1] P. Billingsley.

Ergodic theory and information . Wiley, New York, 1965.[2] A. Dembo and Y. Peres. A topological criterion for hypothesis testing.

Ann. Math. Stat. , 22:106–117, 1994.83] R. Gray.

Probability, Random Processes, and Ergodic Properties . SpringerVerlag, 1988.[4] M.G. Kendall and A. Stuart.

The advanced theory of statistics; Vol.2:Inference and relationship . London, 1961.[5] A. Khaleghi, D. Ryabko, J. Mary, and P. Preux. Online clustering ofprocesses. In

AISTATS , JMLR W&CP 22, pages 601–609, 2012.[6] J.C. Kieﬀer. Strongly consistent code-based identiﬁcation and order esti-mation for constrained ﬁnite-state model classes.

IEEE Transactions onInformation Theory , 39(3):893–902, 1993.[7] E. Lehmann.

Testing Statistical Hypotheses, 2nd edition . Wiley, New York,1986.[8] A. Nobel. Hypothesis testing for families of ergodic processes.

Bernoulli ,12(2):251–269, 2006.[9] B. Ryabko and J. Astola. Universal codes as a basis for time series testing.

Statistical Methodology , 3:375–397, 2006.[10] B. Ryabko, J. Astola, and A. Gammerman. Application of Kolmogorovcomplexity and universal codes to identity testing and nonparametric test-ing of serial independence for time series.

Theoretical Computer Science ,359:440–448, 2006.[11] D. Ryabko. Clustering processes. In

Proc. the 27th International Con-ference on Machine Learning (ICML 2010) , pages 919–926, Haifa, Israel,2010.[12] D. Ryabko. Discrimination between B-processes is impossible.

Journal ofTheoretical Probability , 23(2):565–575, 2010.[13] D. Ryabko. Testing composite hypotheses about discrete-valued stationaryprocesses. In

Proc. IEEE Information Theory Workshop (ITW’10) , pages291–295, Cairo, Egypt, 2010. IEEE.[14] D. Ryabko. Testing composite hypotheses about discrete ergodic processes.

Test , 21(2):317–329, 2012.[15] D. Ryabko and B. Ryabko. On hypotheses testing for ergodic processes.In

Proc. 2008 IEEE Information Theory Workshop , pages 281–283, Porto,Portugal, 2008. IEEE.[16] D. Ryabko and B. Ryabko. Nonparametric statistical inference for ergodicprocesses.