Uniform hypothesis testing for ergodic time series distributions
aa r X i v : . [ m a t h . S T ] D ec Uniform hypothesis testing for finite-valuedstationary processes
Daniil Ryabko ∗ Abstract
Given a discrete-valued sample X , . . . , X n we wish to decide whetherit was generated by a distribution belonging to a family H , or it wasgenerated by a distribution belonging to a family H . In this work weassume that all distributions are stationary ergodic, and do not make anyfurther assumptions (e.g. no independence or mixing rate assumptions).We would like to have a test whose probability of error (both Type I andType II) is uniformly bounded. More precisely, we require that for each ε there exist a sample size n such that probability of error is upper-boundedby ε for samples longer than n . We find some necessary and some sufficientconditions on H and H under which a consistent test (with this notionof consistency) exists. These conditions are topological, with respect tothe topology of distributional distance. Given a sample X , . . . , X n (where X i are from a finite alphabet A ) whichis known to be generated by a stationary ergodic process, we wish to decidewhether it was generated by a distribution belonging to a family H , versus itwas generated by a distribution belonging to a family H . Unlike most of theworks on the subject, we do not assume that X i are i.i.d., but only make a muchweaker assumption that the distribution generating the sample is stationaryergodic.A test is a function that takes a sample and gives a binary (possibly incorrect)answer: either the sample was generated by a distribution from H or by adistribution from H . An answer i ∈ { , } is correct if the sample is generatedby a distribution that belongs to H i . Here we are concerned with characterizingthose pairs of H and H for which consistent tests exist. Consistency.
In this work we consider the following notion of consistency.For two hypothesis H and H , a test is called uniformly consistent , if for any ε > n such that the probability of error on a sample of ∗ INRIA Lille-Nord Europe, 40, avenue Halley, 59650 Villeneuve d’Ascq, France,[email protected] ize larger than n is not greater than ε if any distribution from H ∪ H is chosento generate the sample . Thus, a uniformly consistent test provides performanceguarantees for finite sample sizes. The results.
Here we obtain some topological conditions of the hypotheses forwhich consistent tests exist, for the case of stationary ergodic distributions.A distributional distance between two process distributions [3] is defined asa weighted sum of probabilities of all possible tuples X ∈ A ∗ , where A is thealphabet and the weights are positive and have a finite sum.The test ϕ H ,H that we construct is based on empirical estimates of distri-butional distance. It outputs 0 if the given sample is closer to the (closure of) H than to the (closure of) H , and outputs 1 otherwise. The main result is asfollows. Theorem.
Let H , H ⊂ E , where E is the set of all stationary ergodic processdistributions. If, for each i ∈ { , } the set H i has probability 1 with respectto ergodic decompositions of every element of H i , then there is a uniformlyconsistent test for H against H . Conversely, if there is a uniformly consistenttest for H against H , then, for each i ∈ { , } , the set H − i has probability 0with respect to ergodic decompositions of every element of H i . Prior work.
This work continuous our previous research [13, 14], which pro-vides similar necessary and sufficient conditions for the existence of a consistenttest, for a weaker notion of asymmetric consistency: Type I error is uniformlybounded, while Type II error is required to tend to 0 as the sample size grows.Besides that, there is of course a vast body of literature on hypothesis testingfor i.i.d. (real- or discrete-valued) data (see e.g. [7, 4]). There is, however, muchless literature on hypothesis testing beyond i.i.d. or parametric models. For aweaker notion of consistency, namely, requiring that the test should stabilizeon the correct answer for a.e. realization of the process (under either H or H ), [6] constructs a consistent test for so-called constrained finite-state modelclasses (including finite-state Markov and hidden Markov processes), againstthe general alternative of stationary ergodic processes. For the same notion ofconsistency, [8] gives sufficient conditions on two hypotheses H and H thatconsist of stationary ergodic real-valued processes, under which a consistent testexists, extending the results of [2] for i.i.d. data. The latter condition is that H and H are contained in disjoint F σ sets (countable unions of closed sets), withrespect to the topology of weak convergence. Asymmetrically consistent testsfor some specific hypotheses, but under the general alternative of stationaryergodic processes, have been proposed in [9, 10, 15, 16], which address problemsof testing identity, independence, estimating the order of a Markov process, andalso the change point problem. Noteworthy, a conceptually simple hypothesisof homogeneity (testing whether two sample are generated by the same or bydifferent processes) does not admit a consistent test even in the weakest asymp-totic sense, as was shown in [12]. Empirical estimates of distributional distancehave been also used to address the problem of clustering time series [11, 5].2 Preliminaries
Let A be a finite alphabet, and denote A ∗ the set of words (or tuples) ∪ ∞ i =1 A i .For a word B the symbol | B | stands for the length of B . Denote B i the i thelement of A ∗ , enumerated in such a way that the elements of A i appear beforethe elements of A i +1 , for all i ∈ N . Distributions or (stochastic) processes areprobability measures on the space ( A ∞ , F A ∞ ), where F A ∞ is the Borel sigma-algebra of A ∞ . Denote X, B ) the number of occurrences of a word B in a word X ∈ A ∗ and ν ( X, B ) its frequency:
X, B ) = P | X |−| B | +1 i =1 I { ( X i ,...,X i + | B |− )= B } , and ν ( X, B ) = (cid:26) | X |−| B | +1 X, B ) if | X | ≥ | B | , X = ( X , . . . , X | X | ). For example, ν (0001 ,
00) = 2 / . We use the abbreviation X ..k for X , . . . , X k . A process ρ is stationary if ρ ( X .. | B | = B ) = ρ ( X t..t + | B |− = B )for any B ∈ A ∗ and t ∈ N . Denote S the set of all stationary processes on A ∞ . Astationary process ρ is called (stationary) ergodic if the frequency of occurrenceof each word B in a sequence X , X , . . . generated by ρ tends to its a priori (orlimiting) probability a.s.: ρ (lim n →∞ ν ( X ..n , B ) = ρ ( X .. | B | = B )) = 1. Denote E the set of all stationary ergodic processes.A distributional distance is defined for a pair of processes ρ , ρ as fol-lows [3]: d ( ρ , ρ ) = ∞ X i =1 w i | ρ ( X .. | B i | = B i ) − ρ ( X .. | B i | = B i ) | , where w i are summable positive real weights (e.g. w k = 2 − k : we fix this choicefor the sake of concreteness). It is easy to see that d is a metric. Equippedwith this metric, the space of all stochastic processes is a compact, and the setof stationary processes S is its convex closed subset. (The set E is not closed.)When talking about closed and open subsets of S we assume the topology of d .Compactness of the set S is one of the main ingredients in the proofs of themain results. Another is that the distance d can be consistently estimated, asthe following lemma shows (because of its importance for further development,we give it with a proof). Lemma 1 ( ˆ d is consistent [15, 16] ) . Let ρ, ξ ∈ E and let a sample X ..k begenerated by ρ . Then lim k →∞ ˆ d ( X ..k , ξ ) = d ( ρ, ξ ) ρ -a.s.Proof. For any ε > J that P ∞ i = J w i < ε/
2. For each j wehave lim k →∞ ν ( X ..k , B j ) = ρ ( B j ) a.s., so that | ν ( X ..k , B j ) − ρ ( B j ) | < ε/ (2 Jw j )3rom some k on; denote K j this k . Let K = max j
First of all, it is obvious that sets that consist of just one or finitely manystationary ergodic processes are closed and closed under ergodic decompositions;therefore, for any pair of disjoint sets of this type, there exists a uniformlyconsistent test. (In particular, there is a uniformly consistent test for H = { ρ } against H = { ρ } , where ρ , ρ ∈ E .)It is clear that for any ρ there is no uniformly consistent test for { ρ } against E\{ ρ } . More generally, for any non-empty H there is no uniformly consistenttest for H against E\ H provided the latter complement is also non-empty.Indeed, this follows from Theorem 1 since in these cases the closures of H and H are not disjoint. One might suggest at this point that a uniformly consistenttest exists if we restrict H to those processes that are sufficiently far from ρ .However, this is not true. We can prove an even stronger negative result. Proposition 1.
Let ρ, ν ∈ E , ρ = ν and let ε > . There is no uniformlyconsistent test for H = { ρ } against H = { ν ′ ∈ E : d ( ν ′ , ν ) ≤ ε } . The proof of the proposition is deferred to the next section. What the propo-sition means is that, while distributional distance is well suited for characterizingthose hypotheses for which consistent test exist, it is not suited for formulatingthe actual hypotheses . Apparently a stronger distance is needed for the latter.The following statement is easy to demonstrate from Theorem 1.
Corollary 1.
Given two disjoint sets H and H each of which is continuouslyparametrized by a compact set of parameters and is closed under taking ergodicdecompositions, there exists a uniformly consistent test of H against H . Examples of parametrisations mentioned in the Corollary are the sets of k -order Markov sources, parametrised by transition probabilities. Thus, any twodisjoint closed subsets of these sets satisfy the assumption of the Corollary.5 Proofs
The proof of Theorem 1 will use the following lemmas, whose proofs can befound in [14].
Lemma 2 (smooth probabilities of deviation) . Let m > k > , ρ ∈ S , H ⊂ S ,and ε > . Then ρ ( ˆ d ( X ..m , H ) ≥ ε ) ≤ ε ′− ρ ( ˆ d ( X ..k , H ) ≥ ε ′ ) , (3) where ε ′ := ε − km − k +1 − t k with t k being the sum of all the weights of tupleslonger than k in the definition of d : t k := P i : | B i | >k w i . Further, ρ ( ˆ d ( X ..m , H ) ≤ ε ) ≤ ρ (cid:18) ˆ d ( X ..k , H ) ≤ mm − k + 1 2 ε + 4 km − k + 1 (cid:19) . (4)The meaning of this lemma is as follows. For any word X ..m , if it is faraway from (or close to) a given distribution µ (in the empirical distributionaldistance), then some of its shorter subwords X i..i + k are far from (close to) µ too. In other words, for a stationary distribution µ , it cannot happen that asmall sample is likely to be close to µ , but a larger sample is likely to be far. Lemma 3.
Let ρ k ∈ S , k ∈ N be a sequence of processes that converges to aprocess ρ ∗ . Then, for any T ∈ A ∗ and ε > if ρ k ( T ) > ε for infinitely manyindices k , then ρ ∗ ( T ) ≥ ε . This statement follows from the fact that ρ ( T ) is continuous as a functionof ρ . Proof of Theorem 1.
To prove the first statement of the theorem, we willshow that the test ϕ H ,H is a uniformly consistent test for cl H ∩ E againstcl H ∩ E (and hence for H against H ), under the conditions of the theorem.Suppose that, on the contrary, for some α > n ′ ∈ N there is a process ρ ∈ cl H such that ρ ( ϕ ( X ..n ) = 1) > α for some n > n ′ . Define∆ := d (cl H , cl H ) := inf ρ ∈ cl H ∩E ,ρ ∈ cl H ∩E d ( ρ , ρ ) , which is positive since cl H and cl H are closed and disjoint. We have α < ρ ( ϕ ( X ..n ) = 1) ≤ ρ ( ˆ d ( X ..n , H ) ≥ ∆ / or ˆ d ( X ..n , H ) < ∆ / ≤ ρ ( ˆ d ( X ..n , H ) ≥ ∆ /
2) + ρ ( ˆ d ( X ..n , H ) < ∆ / . (5)This implies that either ρ ( ˆ d ( X ..n , cl H ) ≥ ∆ / > α/ ρ ( ˆ d ( X ..n , cl H ) < ∆ / > α/
2, so that, by assumption, at least one of these inequalities holds forinfinitely many n ∈ N for some sequence ρ n ∈ H . Suppose that it is the firstone, that is, there is an increasing sequence n i , i ∈ N and a sequence ρ i ∈ cl H , i ∈ N such that ρ i ( ˆ d ( X ..n i , cl H ) ≥ ∆ / > α/ i ∈ N . (6)6he set S is compact, hence so is its closed subset cl H . Therefore, the sequence ρ i , i ∈ N must contain a subsequence that converges to a certain process ρ ∗ ∈ cl H . Passing to a subsequence if necessary, we may assume that this convergentsubsequence is the sequence ρ i , i ∈ N itself.Using Lemma 2, (3) (with ρ = ρ n m , m = n m , k = n k , and H = cl H ), andtaking k large enough to have t n k < ∆ /
4, for every m large enough to have n k n m − n k +1 < ∆ /
4, we obtain8∆ − ρ n m (cid:16) ˆ d ( X ..n k , cl H ) ≥ ∆ / (cid:17) ≥ ρ n m (cid:16) ˆ d ( X ..n m , cl H ) ≥ ∆ / (cid:17) > α/ . (7)That is, we have shown that for any large enough index n k the inequality ρ n m ( ˆ d ( X ..n k , cl H ) ≥ ∆ / > ∆ α/
16 holds for infinitely many indices n m .From this and Lemma 3 with T = T k := { X : ˆ d ( X ..n k , cl H ) ≥ ∆ / } weconclude that ρ ∗ ( T k ) > ∆ α/
16. The latter holds for infinitely many k ; that is, ρ ∗ ( ˆ d ( X ..n k , cl H ) ≥ ∆ / > ∆ α/
16 infinitely often. Therefore, ρ ∗ (lim sup n →∞ d ( X ..n , cl H ) ≥ ∆ / > . However, we must have ρ ∗ ( lim n →∞ d ( X ..n , cl H ) = 0) = 1for every ρ ∗ ∈ cl H : indeed, for ρ ∗ ∈ cl H ∩ E it follows from Lemma 1, andfor ρ ∗ ∈ cl H \E from Lemma 1, ergodic decomposition and the conditions ofthe theorem.Thus, we have arrived at a contradiction that shows that ρ n ( ˆ d ( X ..n , cl H ) > ∆ / > α/ n ∈ N for any sequence of ρ n ∈ cl H . Analogously, we can show that ρ n ( ˆ d ( X ..n , cl H ) < ∆ / > α/ n ∈ N for any sequence of ρ n ∈ cl H . Indeed, usingLemma 2, equation (4), we can show that ρ n m ( ˆ d ( X ..n m , cl H ) ≤ ∆ / > α/ n m implies ρ n m ( ˆ d ( X ..n k , cl H ) ≤ / > α/ n k . Therefore, if we assume that ρ n ( ˆ d ( X ..n , cl H ) < ∆ / > α/ n ∈ N for some sequence of ρ n ∈ cl H , then we will also find a ρ ∗ forwhich ρ ∗ ( ˆ d ( X ..n , cl H ) ≤ / > α/ n , which, usingLemma 1 and ergodic decomposition, can be shown to contradict the fact that ρ ∗ (lim n →∞ d ( X ..n , cl H ) ≥ ∆) = 1.Thus, returning to (5), we have shown that from some n on there is no ρ ∈ cl H for which ρ ( ϕ = 1) > α holds true. The statement for ρ ∈ cl H canbe proven analogously, thereby finishing the proof of the first statement.To prove the second statement of the theorem, we assume that there ex-ists a uniformly consistent test ϕ for H against H , and we will show that W ρ ( H − i ) = 0 for every ρ ∈ cl H i . Indeed, let ρ ∈ cl H , that is, suppose thatthere is a sequence ξ i ∈ H , i ∈ N such that ξ i → ρ . Assume W ρ ( H ) = δ > α := δ/
2. Since the test ϕ is uniformly consistent, there is an N ∈ N n > N we have ρ ( ϕ ( X ..n = 0)) ≤ Z H ϕ ( X ..n = 0) dW ρ + Z E\ H ϕ ( X ..n = 0) dW ρ ≤ δα + 1 − δ ≤ − δ/ . Recall that, for T ∈ A ∗ , µ ( T ) is a continuous function in µ . In particular, thisholds for the set T = { X ∈ A n : ϕ ( X ) = 0 } , for any given n ∈ N . Therefore, forevery n > N and for every i large enough, ρ i ( ϕ ( X ..n ) = 0) < − δ/ ξ i ( ϕ ( X ..n ) = 0) < − δ/ ξ i ∈ H . This contradictionshows W ρ ( H ) = 0 for every ρ ∈ cl H . The case ρ ∈ cl H is analogous. Thetheorem is proven. Proof of Proposition 1.
Assume d ( ρ, ν ) > ε (the other case is obvious). Con-sider the process ( x , y ) , ( x , y ) , . . . on pairs ( x i , y i ) ∈ A , such that the dis-tribution of x , x , . . . is ν , the distribution of y , y , . . . is ρ and the two com-ponents x i and y i are independent; in other words, the distribution of ( x i , y i )is ν × ρ . Consider also a two-state stationary ergodic Markov chain µ , withtwo states 1 and 2, whose transition probabilities are (cid:18) − p pq − q (cid:19) , where0 < p < q <
1. The limiting (and initial) probability of the state 1 is p/ ( p + q )and that of the state 2 is q/ ( p + q ). Finally, the process z , z , . . . is constructedas follows: z i = x i if µ is in the state a and z i = y i otherwise (here it is assumedthat the chain µ generates a sequence of outcomes independently of ( x i , y i ).Clearly, for every p, q satisfying 0 < p < q < z , z , . . . is sta-tionary ergodic; denote ζ its distribution. Let p n := 1 / ( n + 1), n ∈ N . Since d ( ρ, ν ) > ε , we can find a δ > d ( ρ, ζ n ) > ε where ζ n is the distribu-tion ζ with parameters p n and q n , where q n satisfies q n / ( p n + q n ) = δ . Thus, ζ n ∈ H for all n ∈ N . However, lim n →∞ ζ n = ζ ∞ where ζ ∞ is the stationarydistribution with W ζ ∞ ( ρ ) = δ and W ζ ∞ ( ν ) = 1 − δ . Therefore, ζ ∞ ∈ cl H and W ζ ∞ ( H ) >
0, so that by Theorem 1 there is no uniformly consistent test for H against H , which concludes the proof. Acknowledgements
This work has been partially supported by the French Ministry of Higher Edu-cation and Research, Nord-Pas de Calais Regional Council and FEDER throughthe “Contrat de Projets Etat Region (CPER) 2007-2013.”
References [1] P. Billingsley.
Ergodic theory and information . Wiley, New York, 1965.[2] A. Dembo and Y. Peres. A topological criterion for hypothesis testing.
Ann. Math. Stat. , 22:106–117, 1994.83] R. Gray.
Probability, Random Processes, and Ergodic Properties . SpringerVerlag, 1988.[4] M.G. Kendall and A. Stuart.
The advanced theory of statistics; Vol.2:Inference and relationship . London, 1961.[5] A. Khaleghi, D. Ryabko, J. Mary, and P. Preux. Online clustering ofprocesses. In
AISTATS , JMLR W&CP 22, pages 601–609, 2012.[6] J.C. Kieffer. Strongly consistent code-based identification and order esti-mation for constrained finite-state model classes.
IEEE Transactions onInformation Theory , 39(3):893–902, 1993.[7] E. Lehmann.
Testing Statistical Hypotheses, 2nd edition . Wiley, New York,1986.[8] A. Nobel. Hypothesis testing for families of ergodic processes.
Bernoulli ,12(2):251–269, 2006.[9] B. Ryabko and J. Astola. Universal codes as a basis for time series testing.
Statistical Methodology , 3:375–397, 2006.[10] B. Ryabko, J. Astola, and A. Gammerman. Application of Kolmogorovcomplexity and universal codes to identity testing and nonparametric test-ing of serial independence for time series.
Theoretical Computer Science ,359:440–448, 2006.[11] D. Ryabko. Clustering processes. In
Proc. the 27th International Con-ference on Machine Learning (ICML 2010) , pages 919–926, Haifa, Israel,2010.[12] D. Ryabko. Discrimination between B-processes is impossible.
Journal ofTheoretical Probability , 23(2):565–575, 2010.[13] D. Ryabko. Testing composite hypotheses about discrete-valued stationaryprocesses. In
Proc. IEEE Information Theory Workshop (ITW’10) , pages291–295, Cairo, Egypt, 2010. IEEE.[14] D. Ryabko. Testing composite hypotheses about discrete ergodic processes.
Test , 21(2):317–329, 2012.[15] D. Ryabko and B. Ryabko. On hypotheses testing for ergodic processes.In
Proc. 2008 IEEE Information Theory Workshop , pages 281–283, Porto,Portugal, 2008. IEEE.[16] D. Ryabko and B. Ryabko. Nonparametric statistical inference for ergodicprocesses.