Nonasymptotic control of the MLE for misspecified nonparametric hidden Markov models
aa r X i v : . [ m a t h . S T ] J u l Oracle inequality for misspecified NPHMMs
Nonasymptotic control of the MLE for misspecifiednonparametric hidden Markov models
Luc Leh´ericy [email protected]
Laboratoire de Math´ematiques d’OrsayUniv. Paris-Sud, CNRS, Universit´e Paris-Saclay91405 Orsay, France
Abstract
We study the problem of estimating an unknown time process distribution using nonpara-metric hidden Markov models in the misspecified setting , that is when the true distributionof the process may not come from a hidden Markov model. We show that when the truedistribution is exponentially mixing and satisfies a forgetting assumption, the maximumlikelihood estimator recovers the best approximation of the true distribution. We provea finite sample bound on the resulting error and show that it is optimal in the minimaxsense–up to logarithmic factors–when the model is well specified.
Keywords: misspecified model, nonasymptotic bound, nonparametric statistics, maxi-mum likelihood estimator, model selection, oracle inequality, hidden Markov model
1. Introduction
Let ( Y , . . . , Y n ) be a sample following some unknown distribution P ∗ . The maximum likeli-hood estimator can be formalized as follows: let { P θ } θ ∈ Θ , the model , be a family of possibledistributions; pick a distribution P ˆ θ of the model which maximizes the likelihood of theobserved sample.In many situations, the true distribution may not belong to the model at hand: this isthe so-called misspecified setting . One would like the estimator to give sensible results evenin this setting. This can be done by showing that the estimated distribution converges tothe best approximation of the true distribution within the model. The goal of this paper isto establish a finite sample bound on the error of the maximum likelihood estimator for alarge class of true distributions and a large class of nonparametric hidden Markov models.In this paper, we consider maximum likelihood estimators (shortened MLE) based onmodel selection among finite state space hidden Markov models (shortened HMM). A finitestate space hidden Markov model is a stochastic process ( X t , Y t ) t where only the observa-tions ( Y t ) t are observed, such that the process ( X t ) t is a Markov chain taking values in afinite space and such that the Y s are independent conditionally to ( X t ) t with a distributiondepending only on the corresponding X s . The parameters of a HMM ( X t , Y t ) t are the initialdistribution and the transition matrix of ( X t ) t and the distributions of Y s conditionally to X s .HMMs have been widely used in practice, for instance in climatology (Lambert et al.,2003), ecology (Boyd et al., 2014), voice activity detection and speech recognition (Couvreur and Couvreur,2000; Lef`evre, 2003), biology (Yau et al., 2011; Volant et al., 2014)... One of their advan-tages is their ability to account for complex dependencies between the observations: despite . Leh´ericy the seemingly simple structure of these models, the fact that the process ( X t ) t is hiddenmakes the process ( Y t ) t non-Markovian.Up to now, most theoretical work in the literature focused on well-specified and paramet-ric HMMs, where a smooth parametrization by a subset of R d is available, see for instanceBaum and Petrie (1966) for discrete state and observations spaces, Leroux (1992) for gen-eral observation spaces and Douc and Matias (2001) and Douc et al. (2011) for general stateand observation spaces. Asymptotic properties for misspecified models have been studiedrecently by Mevel and Finesso (2004) for consistency and asymptotic normality in finitestate space HMMs and Douc and Moulines (2012) for consistency in HMMs with generalstate space. Let us also mention Pouzo et al. (2016), who studied a generalization of hiddenMarkov models in a semi-misspecified setting. All these results focus on parametric models.Few results are available on nonparametric HMMs, and all of them focus on the well-specified setting. Alexandrovich et al. (2016) prove consistency of a nonparametric max-imum likelihood estimator based on finite state space hidden Markov models with non-parametric mixtures of parametric densities. Vernet (2015a,b) study the posterior consis-tency and concentration rates of a Bayesian nonparametric maximum likelihood estimator.Other methods have also been considered, such as spectral estimators in Anandkumar et al.(2012); Hsu et al. (2012); De Castro et al. (2017); Bonhomme et al. (2016); Leh´ericy (2017)and least squares estimators in de Castro et al. (2016); Leh´ericy (2017). Besides Vernet(2015b), to the best of our knowledge, there has been no result on convergence rates orfinite sample error of the nonparametric maximum likelihood estimator, even in the well-specified setting.The main result of this paper is an oracle inequality that holds as soon as the modelshave controlled tails. This bound is optimal when the true distribution is a HMM takingvalues in R . Let us give some details about this result.Let us start with an overview of the assumptions on the true distribution P ∗ . The firstassumption is that the observed process is strongly mixing. Strong mixing assumptionscan be seen as a strengthened version of ergodicity. They have been widely used to extendresults on independent observation to dependent processes, see for instance Bradley (2005)and Dedecker et al. (2007) for a survey on strong mixing and weak dependence conditions.The second assumption is that the process forgets its past exponentially fast. For hid-den Markov models, this forgetting property is closely related to the exponential stabilityof the optimal filter, see for instance Le Gland and Mevel (2000); Gerencs´er et al. (2007);Douc et al. (2004, 2009). The last assumption is that the likelihood of the true process hassub-polynomial tails. None of these assumptions are specific to HMMs, thus making ourresult applicable to the misspecified setting.To approximate a large class of true distributions, we consider nonparametric HMMs,where the parameters are not described by a finite dimensional space. For instance, onemay consider HMMs with arbitrary number of states and arbitrary emission distributions.Computing a maximizer of the likelihood directly in a nonparametric model may be hard orresult in overfitting. The model selection approach offers a way to circumvent this problem.It consists in considering a countable family of parametric sets ( S M ) M ∈M –the models -–and selecting one of them. The larger the union of all models, the more distributionsare approximated. Several criteria can be used to select the model, such as bootstrap,cross validation (see for instance Arlot and Celisse (2010)) or penalization (see for instance racle inequality for misspecified NPHMMs Massart (2007)). We use a penalized criterion, which consists in maximizing the function(
S, θ ∈ S ) n log p θ ( Y , . . . , Y n ) − pen n ( S ) , where p θ is the density of ( Y , . . . , Y n ) under the parameter θ and the penalty pen onlydepends on the model S and the number of observations n .Assume that the emission distributions of the HMMs–that is the distribution of theobservations conditionally to the hidden states–are absolutely continuous with respect tosome known probability measure, and call emission densities their densities with respect tothis measure. The tail assumption ensures that the emission densities have sub-polynomialtail: ∀ v > e, P ∗ (cid:18) sup γ γ ( Y ) > v D ( n ) (cid:19) v , where the supremum is taken over all emission densities γ in the models for a function n D ( n ). For instance, this assumption holds when all densities are upper bounded by e D ( n ) . A key remark at this point is the dependency of D ( n ) with n : we allow the models todepend on the sample size. Typically, taking a larger sample makes it possible to considerlarger models. A good choice is to take D ( n ) proportional to log n .To stabilize the log-likelihood, we modify the models in the following way. First, onlykeep HMMs whose transition matrix is lower bounded by a positive function n σ − ( n ).We show that taking this lower bound as (log n ) − is a safe choice. Then, replace theemission densities γ by a convex combination of the original emission densities and of thedominating measure λ with a weight that decreases polynomially with the sample size. Inother words, replace γ by (1 − n − a ) γ + n − a λ for some a >
0. Taking a > λ is asymptotically negligible. Any a > α >
1, there exists constants A and n such that if the penalty is large enough, thepenalized maximum likelihood estimator ˆ θ n satisfies for all t > η ∈ (0 ,
1) and n > n ,with probability larger than 1 − e − t − n − α : K (ˆ θ n ) (1 + η ) inf dim( S ) n (cid:26) inf θ ∈ S K ( θ ) + pen n ( S ) (cid:27) + Aη t (log n ) n , where K ( θ ) can be seen as a Kullback-Leibler divergence between the distributions P ∗ and P θ . In other words, the estimator recovers the best approximation of the true distributionwithin the model, up to the penalty and the residual term.In the case where the true distribution is a HMM, it is possible to quantify the ap-proximation error inf θ ∈ S K ( θ ). Using the results of Kruijer et al. (2010), we show that theabove oracle inequality is optimal in the minimax sense–up to logarithmic factors–for real-valued HMMs, see Corollary 12. This is done by taking HMMs whose emission densities aremixtures of exponential power distributions–which include Gaussian mixtures as a specialcase.The paper is organized as follows. We detail the framework of the article in Section 2.In particular, Section 2.3 describes the assumptions on the true distribution, Section 2.4presents the assumptions on the model and Section 2.5 introduces the Kullback Leibler . Leh´ericy criterion used in the oracle inequality. Our main results are stated in Section 3. Section 3.1contains the oracle inequality and Section 3.2 shows how it can be used to show minimaxadaptivity for real-valued HMMs. Section 4 lists some perspectives for this work.One may wish to relax our assumptions depending on the setting. For instance, onecould want to change the dependency of the functions B ( n ) and σ − ( n ) on n , change thetail conditions or the rate of forgetting. We give an overview of the key steps of the proofof our oracle inequality in Section 5 to make it easier to adapt our result.Some proofs are postponed the Appendices. Appendix A contains the proof of theminimax adaptivity result and Appendix B contains the proof of the main technical lemmaof Section 5.
2. Notations and assumptions
We will use the following notations: • a ∨ b is the maximum of a and b , a ∧ b the minimum; • For x ∈ R , we write x + = x ∨ • N ∗ = { , , , . . . } is the set of positive integers; • For K ∈ N ∗ , we write [ K ] = { , , . . . , K } ; • Y ba is the vector ( Y a , . . . , Y b ); • L ( A, A , µ ) is the set of measurable and square integrable functions defined on themeasured space ( A, A , µ ). We write L ( A, µ ) when the sigma-field is not ambiguous; • log is the inverse function of the exponential function exp. Finite state space hidden Markov models (HMM in short) are stochastic processes ( X t , Y t ) t > with the following properties. The hidden state process ( X t ) t is a Markov chain taking valuein a finite set X (the state space ). We denote by K the cardinality of X , and π and Q the initial distribution and transition matrix of ( X t ) t respectively. The observation pro-cess ( Y t ) t takes value in a polish space Y (the observation space ) endowed with a Borelprobability measure λ . The observations Y t are independent conditionally to ( X t ) t with adistribution depending only on X t . In the following, we assume that the distribution of Y t conditionally to { X t = x } is absolutely continuous with respect to λ with density γ x . Wecall γ = ( γ x ) x ∈X the emission densities .Therefore, the parameters of a HMM are its number of hidden states K , its initialdistribution π (the distribution of X ), its transition matrix Q and its emission densities γ . When appropriate, we write p ( K,π, Q ,γ ) the density of the process with respect to thedominating measure under the parameters ( K, π, Q , γ ). For a sequence of observations Y n ,we denote by l n ( K, π, Q , γ ) the associated log-likelihood under the parameters ( K, π, Q , γ ),defined by l n ( K, π, Q , γ ) = log p ( K,π, Q ,γ ) ( Y n ) . We denote by P ∗ the true (and unknown) distribution of the process ( Y t ) t , E ∗ theexpectation under P ∗ , p ∗ the density of P ∗ under the dominating measure and l ∗ n the log-likelihood of the observations under P ∗ . Let us stress that this distribution may not begenerated by a finite state space HMM. racle inequality for misspecified NPHMMs Let ( S K,M,n ) K ∈ N ∗ ,M ∈M be a family of parametric models such that for all K ∈ N ∗ and M ∈ M , the parameters ( K, π, Q , γ ) ∈ S K,M,n correspond to HMMs with K hidden states.Note that the models S K,M,n may depend on the number of observations n . Let us see twoways to construct such models. Mixture densities.
Let { f ξ } ξ ∈ Ξ be a parametric family of probability densities indexedby Ξ ⊂ R d . Let M ⊂ N ∗ . We choose S K,M,n to be the set of parameters (
K, π, Q , γ )such that Q and π are uniformly lower bounded by (log n ) − and for all x ∈ [ K ], γ x is a convex combination of M elements of { f ξ } ξ ∈ Ξ ∩ [ − n,n ] d . L densities. Let ( E M ) M ∈M be a family of finite dimensional subspaces of L ( Y , λ ). Wechoose S K,M,n to be the set of parameters (
K, π, Q , γ ) such that Q and π are uniformlylower bounded by (log n ) − and for all x ∈ [ K ], γ x is a probability density such that γ x = g ∨ g ∈ E M such that k g k n .In both cases, we took a lower bound on the coefficients of the transition matrix Q thattends to zero when the number of observations grows. This allows to estimate parametersfor which some coefficients of the transition matrix are small or zero. We prove the choice(log n ) − to be a good choice in general in Theorem 8.For all K ∈ N ∗ and M ∈ M , we define the maximum likelihood estimator on S K,M,n :( K, ˆ π K,M,n , ˆ Q K,M,n , ˆ γ K,M,n ) ∈ arg max ( K,π, Q ,γ ) ∈ S K,M,n n l n ( K, π, Q , γ ) . Since the true distribution does not necessarily correspond to a parameter of S K,M,n , tak-ing a larger model S K,M,n will reduce the bias of the estimator ( K, ˆ π K,M,n , ˆ Q K,M,n , ˆ γ K,M,n ).However, larger models will make the estimation more difficult, resulting in a larger vari-ance. This means one has to perform a bias-variance tradeoff to select a model with areasonable size. To do so, we select a number of states ˆ K n among a set of integers K n and a model index ˆ M n among a set of indices M n such that the penalized log-likelihood ismaximal:( ˆ K n , ˆ M n ) ∈ arg max K ∈K n ,M ∈M n (cid:18) n l n ( K, ˆ π K,M,n , ˆ Q K,M,n , ˆ γ K,M,n ) − pen n ( K, M ) (cid:19) for some penalty pen n to be chosen.In the following, we use the following notations. • S n := S K ∈K n ,M ∈M n S K,M,n is the set of all parameters involved with the constructionof the maximum likelihood estimator; • S ( γ ) K,M,n = { γ | ( K, π, Q , γ ) ∈ S K,M,n } is the set of density vectors from the model S K,M,n . S ( γ ) n is defined in the same way. In this section, we introduce the assumptions on the true distribution of the process ( Y t ) t > .We assume that ( Y t ) t > is stationary, so that one can extend it into a process ( Y t ) t ∈ Z . . Leh´ericy Let us state the two assumptions on the dependency of the process ( Y t ) t . [A ⋆ forgetting] There exists two constants C ∗ > ρ ∗ ∈ (0 ,
1) such that for all i ∈ Z ,for all k, k ′ ∈ N ∗ and for all y ii − ( k ∨ k ′ ) ∈ Y ( k ∨ k ′ )+1 , | log p ∗ ( y i | y i − i − k ) − log p ∗ ( y i | y i − i − k ′ ) | C ∗ ρ k ∧ k ′ − ∗ For the mixing assumption, let us recall the definition of the ρ -mixing coefficient. Let(Ω , F , P ) be a measured space and A ⊂ F and
B ⊂ F be two sigma-fields. Let ρ mix ( A , B ) = sup f ∈ L (Ω , A ,P ) g ∈ L (Ω , B ,P ) | Corr( f, g ) | . The ρ -mixing coefficient of ( Y t ) t is defined by ρ mix ( n ) = ρ mix ( σ ( Y i , i > n ) , σ ( Y i , i . [A ⋆ mixing] There exists two constants c ∗ > n ∗ ∈ N ∗ such that ∀ n > n ∗ , ρ mix ( n ) e − c ∗ n . [A ⋆ forgetting] ensures that the process forgets its initial distribution exponentiallyfast. This assumption is especially useful for truncating the dependencies in the likelihood. [A ⋆ mixing] is a usual mixing assumption and is used to obtain Bernstein-like concentrationinequalities. Note that [A ⋆ mixing] implies that the process ( Y t ) t > is ergodic.Even if [A ⋆ forgetting] is analog to a ψ -mixing condition (see Bradley (2005) for asurvey on mixing conditions) and is proved using the same tool as [A ⋆ mixing] in hid-den Markov models–namely the geometric ergodicity of the hidden state process–thesetwo assumptions are different in general. For instance, a Markov chain always satisfies [A ⋆ forgetting] but not necessarily [A ⋆ mixing] . Conversely, there exists processes satis-fying [A ⋆ mixing] but not [A ⋆ forgetting] . Lemma 1
Assume that ( Y t ) t is generated by a HMM with a compact metric state space X (not necessarily finite) endowed with a Borel probability measure µ . Write Q ∗ its transitionkernel and assume that Q ∗ admits a density with respect to µ that is uniformly lower boundedand upper bounded by positive and finite constants σ ∗− and σ ∗ + . Write ( γ ∗ x ) x ∈X its emissiondensities and assume that they satisfy R γ ∗ x ( y ) µ ( dx ) ∈ (0 , + ∞ ) for all y ∈ Y .Then [A ⋆ forgetting] and [A ⋆ mixing] hold by taking ρ ∗ = 1 − σ ∗− σ ∗ + , C ∗ = − ρ ∗ , c ∗ = − log(1 − σ ∗− )2 and n ∗ = 1 . Proof
This lemma follows from the geometric ergodicity of the HMM.For [A ⋆ forgetting] , see for instance Douc et al. (2004), proof of Lemma 2.For [A ⋆ mixing] , the Doeblin condition implies that for all distribution π and π ′ on X , Z | p ∗ ( X n = x | X ∼ π ) − p ∗ ( X n = x | X ∼ π ′ ) | µ ( dx ) (1 − σ ∗− ) n k π − π ′ k . racle inequality for misspecified NPHMMs Let A ∈ σ ( Y t , t > k ) and B ∈ σ ( Y t , t
0) such that P ∗ ( B ) >
0. Taking π the stationarydistribution of ( X t ) t and π ′ the distribution of X conditionally to B in the above equationimplies | P ∗ ( A | B ) − P ∗ ( A ) | = (cid:12)(cid:12)(cid:12)(cid:12)Z P ∗ ( A | X n = x )( p ∗ ( X n = x ) − p ∗ ( X n = x | B )) µ ( dx ) (cid:12)(cid:12)(cid:12)(cid:12) Z | p ∗ ( X n = x ) − p ∗ ( X n = x | B ) | µ ( dx ) − σ ∗− ) n . Therefore, the process ( Y t ) t > is φ -mixing with φ mix ( n ) − σ ∗− ) n , so that it is ρ -mixing with ρ mix ( n ) φ mix ( n )) / √ − σ ∗− ) n/ (see e.g. Bradley (2005) for thedefinition of the φ -mixing coefficient and its relation to the ρ -mixing coefficient). One cancheck that the choice of c ∗ and n ∗ allows to obtain [A ⋆ mixing] from this inequality. We need to control the probability that the true density takes extreme values. [A ⋆ tail] There exists two constants B ∗ > q ∈ [0 ,
1] such that ∀ i ∈ Z , ∀ k ∈ N , ∀ u > , P ∗ ( | log p ∗ ( Y i | Y i − i − k ) | > B ∗ u q ) e − u . In practice, only two values of q are of interest. The case q = 0 occurs when thedensities are lower and upper bounded by positive and finite constants. If the densitiesare not bounded, then q = 1 works in most cases and corresponds to subpolynomial tails.Indeed, the lower bound on log p ∗ ( Y i | Y i − i − k ) is always true when taking q = 1 and B ∗ = 1by definition of the density p ∗ , resulting in the following equivalent assumption: [A ⋆ tail’] There exists a constant B ∗ > ∀ i ∈ Z , ∀ k ∈ N , ∀ v > e, P ∗ ( p ∗ ( Y i | Y i − i − k ) > v B ∗ ) v . This can be obtained from Markov’s inequality under a moment assumption, as shownin the following lemma.
Lemma 2
Assume that there exists δ > such that M δ := sup i,k E ∗ [( p ∗ ( Y i | Y i − i − k )) δ ] < ∞ . Then [A ⋆ tail] holds for q = 1 and B ∗ = M δ δ . . Leh´ericy We now state the assumptions on the models. Let us recall that the distribution of theobserved process is not assumed to belong to one of these models.Consider a family of models ( S K,M,n ) K ∈ N ∗ ,M ∈M ,n ∈ N ∗ such that for each K , M and n ,the elements of S K,M,n are of the form (
K, π, Q , γ ) where π is a probability density on [ K ], Q is a transition matrix on [ K ] and γ is a vector of K probability densities on Y withrespect to λ . We need the following assumption on the transition matrices and initial distributions of S n . [Aergodic] There exists σ − ( n ) ∈ (0 , e − ] such that for all ( K, π, Q , γ ) ∈ S n ,inf x,x ′ ∈ [ K ] Q ( x, x ′ ) > σ − ( n ) and inf x ∈ [ K ] π ( x ) > σ − ( n ) . [Aergodic] is standard in maximum likelihood estimation. It ensures that the pro-cess forgets the past exponentially fast, which implies that the difference between the log-likelihood n l n and its limit converges to zero with rate 1 /n in supremum norm. When (
K, π, Q , γ ) ∈ S n , [Aergodic] implies that under the parameters ( K, π, Q , γ ), for all x ∈ [ K ], the probability to jump to state x at time t is at least σ − ( n ), whatever the past maybe. This implies that the density p ( K,π, Q ,γ ) ( Y t | Y t − ) is lower bounded by σ − ( n ) P x γ x ( Y t ).Furthermore, it is upper bounded by P x γ x ( Y t ). Thus, it is enough to bound this quantityto control p ( K,π, Q ,γ ) without having to handle the time dependency.For all γ ∈ S ( γ ) n and y ∈ Y , let b γ ( y ) = log X x γ x ( y ) . We need to control the tails of b γ like we did for log p ∗ in order to get nonasymptoticbounds. This is the purpose of the following assumption. [Atail] There exists two constants q ∈ [0 ,
1] and B ( n ) > ∀ u > , P ∗ " sup γ ∈ S ( γ ) n | b γ ( Y ) | > B ( n ) u q e − u . This assumption is often easy to check in practice, as shown in the following lemma.
Lemma 3
Assume that one of the two following assumption holds:1. (subpolynomial tails) there exists D ( n ) > such that ∀ u > , P ∗ " sup γ ∈ S ( γ ) n b γ ( Y ) > D ( n ) u e − u . racle inequality for misspecified NPHMMs (bounded densities) there exists D ( n ) > such that sup y ∈Y sup γ ∈ S ( γ ) n b γ ( y ) D ( n ) . Consider a new model where all γ are replaced by γ ′ = (1 − n − a ) γ + n − a for a fixedconstant a > . Then [Atail] holds for this new model with q = 1 (resp. q = 0 with thesecond assumption) and B ( n ) = D ( n ) ∨ ( a log n ) . Changing the densities as in the lemma amounts to adding a mixture component (withweight n − a and distribution λ ) to the emission densities to make sure that they are uniformlylower bounded. We shall see in the following that if a >
1, then this additional componentchanges nothing to the approximation properties of the models, see the proof of Corollary 12.This is in agreement with the fact that this component is asymptotically never observed assoon as a > The following assumption means that as far as the bracketing entropy is concerned, the setof emission densities of the model S K,M,n (without taking the hidden state into account)behaves like a parametric model with dimension m M . [Aentropy] There exists a function (
M, K, D, n ) C aux > m M ) M ∈M ∈ N M such that for all δ > M , K and D , N y γ x ( y ) sup γ ′∈ S ( γ ) n | b γ ′ ( y ) | D γ ∈ S ( γ ) K,M,n ,x ∈ [ K ] , d ∞ , δ max (cid:18) C aux δ , (cid:19) m M , (1)where d ∞ is the distance associated with the supremum norm and N ( A, d, ǫ ) is thesmallest number of brackets of size ǫ for the distance d needed to cover A . Let usrecall that the bracket [ a ; b ] is the set of functions f such that a ( · ) f ( · ) b ( · ), andthat the size of the bracket [ a ; b ] is d ( a, b ).Note that we allow the models to depend on the sample size n , which can make C aux growto infinity with n . To control the growth of the models, we use the following assumption. [Agrowth] There exists ζ > n growth such that for all n > n growth ,sup K,M s.t. K n and m M n log C aux ( M, K, B ( n )(log n ) q , n ) n ζ . A typical way to check [Aentropy] is to use a parametrization of the emission densities,for instance a lipschitz application [ − , m M −→ S ( γ ) K,M,n . This reduces the construction of abracket covering on S ( γ ) K,M,n to the construction of a bracket covering of the unit ball of R m M .In this case, C aux depends on the lipschitz constant of the parametrization. An example ofthis approach is given in Section 3.2 for mixtures of exponential power distributions. . Leh´ericy In this section, we focus on the convergence of the log-likelihood. First, we recall resultsfrom Barron (1985) and Leroux (1992) that show the existence of its limit in a generalsetting. Then, we show how to control the difference between the log-likelihood and itslimit using the assumptions from the previous Sections.
The first result comes from Barron (1985) and shows that the true log-likelihood convergesalmost surely with no assumption other than the ergodicity of the process ( Y t ) t > . Lemma 4 (Barron (1985))
Assume that the process ( Y t ) t > is ergodic, then there existsa quantity l ∗ > −∞ such that n l ∗ n −→ n →∞ l ∗ a.s.and l ∗ = lim n →∞ E ∗ [log p ∗ ( Y n | Y n − )] . The second result follows from Theorem 2 of Leroux (1992). A careful reading of hisproof shows that one can relax his assumptions to get the following lemma. Note that thedefinition of l n extends naturally to the case where γ is not a vector of probability densities,or even a vector of integrable functions with respect to λ , through the formula l n ( K, π, Q , γ ) = log X x n ∈ [ K ] n π ( x ) n − Y i =1 Q ( x i , x i +1 ) n Y i =1 γ x i ( Y i ) . Lemma 5 (Leroux (1992))
Let K be a positive integer, γ a vector of K nonnegative andmeasurable functions, Q a transition matrix of size K and π a probability measure on [ K ] .Assume that the process ( Y t ) t > is ergodic and that E ∗ [(log γ x ( Y )) + ] < + ∞ for all x ∈ [ K ] . Then:1. There exists a quantity l ( K, Q , γ ) < + ∞ which does not depend on π such that lim sup n →∞ n l n ( K, π, Q , γ ) l ( K, Q , γ ) P ∗ -a.s.and such that if inf x ∈ [ K ] π ( x ) > , then n l n ( K, π, Q , γ ) −→ n →∞ l ( K, Q , γ ) P ∗ -a.s.2. Assume l ( K, Q , γ ) > −∞ . Then the almost sure convergence also holds in L ( P ∗ ) .3. Assume E ∗ | log γ x ( Y ) | < + ∞ for all x ∈ [ K ] . Then l ( K, Q , γ ) > −∞ . racle inequality for misspecified NPHMMs When appropriate, we define K ( K, Q , γ ) by K ( K, Q , γ ) := l ∗ − l ( K, Q , γ ) . Note that when γ is a vector of probability densities, K ( K, Q , γ ) > x ∈ [ K ] π ( x ) > K ( K, Q , γ ) = lim n →∞ n KL ( P ∗ Y n k P Y n | ( K,π, Q ,γ ) ) . The following lemma controls the difference between the log-likelihood and its limit. When [A ⋆ forgetting] (resp. [Aergodic] ) holds, the log-density of Y conditionally to the pre-vious observations converges exponentially fast to what can be seen as the density of Y conditionally to the whole past, that is p ∗ ( Y i | Y i − −∞ ) (resp. p ( K, Q ,γ ) ( Y i | Y i − −∞ )). Strictlyspeaking, we define the limit of the log-density L ∗ i, ∞ and L i, ∞ ( K, Q , γ ), which can be seenrespectively as log p ∗ ( Y i | Y i − −∞ ) and log p ( K, Q ,γ ) ( Y i | Y i − −∞ ).For all i ∈ Z , k ∈ N ∗ , let L ∗ i,k = log p ∗ ( Y i | Y i − i − k ) , where the process ( Y t ) t > is extended into a process ( Y t ) t ∈ Z by stationarity. Likewise, forall i ∈ Z , k ∈ N ∗ , ( K, π, Q , γ ) ∈ S n and for all probability distribution µ on [ K ], let L i,k,µ ( K, Q , γ ) = log p ( K, Q ,γ ) ( Y i | Y i − i − k , X i − k ∼ µ ) , where p ( K, Q ,γ ) is the density of a stationary HMM with parameters ( K, Q , γ ). When µ isthe stationary distribution of the Markov chain under the parameter ( K, Q , γ ), we write L i,k ( K, Q , γ ). Lemma 6 (Douc et al. (2004)). Assume [Aergodic] holds. Let ρ = 1 − σ − ( n )1 − σ − ( n ) . Then forall i , k , k ′ , µ and µ ′ , sup ( K,π, Q ,γ ) ∈ S n | L i,k,µ ( K, Q , γ ) − L i,k ′ ,µ ′ ( K, Q , γ ) | ρ k ∧ k ′ − / (1 − ρ ) and there exists a process ( L i, ∞ ) i ∈ Z such that for all i and µ , L i,k,µ −→ k →∞ L i, ∞ insupremum norm (when seen as a function of ( K, π, Q , γ ) ) and for all i , k and µ , sup ( K,π, Q ,γ ) ∈ S n | L i,k,µ ( K, Q , γ ) − L i, ∞ ( K, Q , γ ) | ρ k − / (1 − ρ ) .
2. Assume [A ⋆ forgetting] holds, then for all i , k and k ′ , | L ∗ i,k − L ∗ i,k ′ | C ∗ ρ k ∧ k ′ − ∗ andthere exists a process ( L ∗ i, ∞ ) i ∈ Z such that for all i , L ∗ i,k −→ k →∞ L ∗ i, ∞ and for all i and k , | L ∗ i,k − L ∗ i, ∞ | C ∗ ρ k − ∗ . . Leh´ericy
3. Assume [A ⋆ forgetting] and [Aergodic] hold. Under P ∗ , the processes ( L ∗ i, ∞ ) i ∈ Z and ( L i, ∞ ( K, Q , γ )) i ∈ Z are stationary for all ( K, π, Q , γ ) ∈ S n . Moreover, if ( Y t ) t > isergodic (for instance if [A ⋆ mixing] holds), they are ergodic and: • if [Atail] holds, then for all ( K, π, Q , γ ) ∈ S n , l ( K, Q , γ ) exists, is finite and l ( K, Q , γ ) = E ∗ [ L , ∞ ( K, Q , γ )]; • if [A ⋆ tail] holds, then l ∗ exists and is finite and l ∗ = E ∗ [ L ∗ , ∞ ] . Proof
The second point follows directly from [A ⋆ forgetting] .The third point follows from the ergodicity of ( Y t ) t > under [A ⋆ mixing] , from the in-tegrability of L i, ∞ and L ∗ i, ∞ under [Atail] and [A ⋆ tail] and from Lemmas 4 and 5.Note that under the assumptions of point 3 of Lemma 6, one has K ( K, Q , γ ) = E [ L ∗ , ∞ − L , ∞ ( K, Q , γ )] ∈ [0 , + ∞ ) for all ( K, π, Q , γ ) ∈ S n (recall that γ is a vector of probabilitydensities in this case), or with some notation abuses: K ( K, Q , γ ) = E ∗ " log p ∗ ( Y | Y −∞ ) p ( K, Q ,γ ) ( Y | Y −∞ ) ! = E ∗ Y −∞ h KL ( P ∗ Y | Y −∞ k P Y | Y −∞ , ( K, Q ,γ ) ) i . Thus, K ( K, Q , γ ) can be seen as a Kullback Leibler divergence that measures the dif-ference between the distribution of Y conditionally to the whole past under the parameter( K, Q , γ ) and under the true distribution. It can be seen as the prediction error under theparameter ( K, Q , γ ).In the particular case where the true distribution of ( Y t ) t is a finite state space hiddenMarkov model, K characterizes the true parameters, up to permutation of the hidden states,provided the emission densities are all distinct and the transition matrix is invertible, asshown in the following result. Lemma 7 (Alexandrovich et al. (2016), Theorem 5)
Assume ( Y t ) t is generated by afinite state space HMM with parameters ( K ∗ , π ∗ , Q ∗ , γ ∗ ) . Assume Q ∗ is invertible andergodic, that the emission densities ( γ ∗ x ) x ∈ [ K ∗ ] are all distinct and that E ∗ [(log γ ∗ x ( Y )) + ] < ∞ for all x ∈ [ K ∗ ] (so that l ∗ < ∞ ).Then for all K ∈ N ∗ , for all transition matrix Q of size K and for all K -uple ofprobability densities γ , one has K ( K, Q , γ ) > .In addition, if K K ∗ , then K ( K, Q , γ ) = 0 if and only if ( K, Q , γ ) = ( K ∗ , Q ∗ , γ ∗ ) upto permutation of the hidden states.
3. Main results
The following theorem states an oracle inequality on the prediction error of our estimator.It shows that with high probability, our estimator performs as well as the best model of the racle inequality for misspecified NPHMMs class in terms of Kullback Leibler divergence, up to a multiplicative constant and up to anadditive term decreasing as (log n ) ··· n , provided the penalty is large enough. Theorem 8
Assume [A ⋆ forgetting] , [A ⋆ mixing] , [A ⋆ tail] , [Aergodic] , [Atail] , [Aen-tropy] and [Agrowth] hold.Let ( w M ) M ∈M be a nonnegative sequence such that P M ∈M e − w M e − . Assume σ − ( n ) = C σ (log n ) − and B ( n ) = C B log n for some constants C σ > and C B > ζ (where ζ is defined in [Agrowth] ). Let α > . For all K and M , let ( K, ˆ π K,M,n , ˆ Q K,M,n , ˆ γ K,M,n ) ∈ arg max ( K,π, Q ,γ ) ∈ S K,M,n n l n ( K, π, Q , γ ) , ( ˆ K, ˆ M ) ∈ arg max K log n Cσ M s.t. m M n (cid:18) n l n ( K, ˆ π K,M,n , ˆ Q K,M,n , ˆ γ K,M,n ) − pen n ( K, M ) (cid:19) and let ( ˆ K, ˆ π, ˆ Q , ˆ γ ) = ( ˆ K, ˆ π ˆ K, ˆ M,n , ˆ Q ˆ K, ˆ M,n , ˆ γ ˆ K, ˆ M,n ) be the nonparametric maximum likelihood estimator.Then there exists constants A and C pen depending only on α , C σ , C B , n ∗ and c ∗ and aconstant n depending only on α , C σ and C B such that for all n > n growth ∨ n ∨ exp (cid:18) C σ (cid:18) (1 + C ∗ ) ∨ − ρ ∗ − ρ ∗ ∨ e (cid:19)(cid:19) ∨ exp (cid:18) B ∗ C B (cid:19) ∨ exp r C σ n ∗ + 1) , for all t > , for all η , with probability at least − e − t − n − α , K ( ˆ K, ˆ Q , ˆ γ ) (1 + η ) inf K log n Cσ M s.t. m M n ( inf ( K,π, Q ,γ ) ∈ S K,Mn K ( K, Q , γ )+ 2 pen n ( K, M ) ) + Aη t (log n ) q n as soon aspen n ( K, M ) > C pen η (log n ) q n ( w M + (log n ) q ( m M K + K − × (cid:0) (log n ) log log n + log C aux (cid:1) ) . The proof of this theorem is presented in Section 5. Its structure and main steps aredetailed in Section 5.1, and the proof of these steps are gathered in Section 5.2.Note that this theorem is not specific to one choice of the parametric models S K,M,n :one can choose the type of model that suits the density one wants to estimate best. In thefollowing section, we use mixture models to estimate densities when Y is unbounded. If Y is compact, we could use L spaces and this oracle inequality would still hold.The powers of log n in the term (log n ) q come from: . Leh´ericy • The limitation of the dependency to the log n most recent observations, which inducesa factor (log n ) ; • The dependency of σ − ( n ) and B ( n ) on n , each of them at the root of a factor (log n ) ; • Truncating the emission densities (possible thanks to assumptions [Atail] and [A ⋆ tail] ),which induces a factor (log n ) q ; • The use of a Bernstein inequality for exponentially α -mixing processes, which intro-duces a factor (log n ) compared to a Bernstein inequality for independent variables.However, together with the previous point (the truncation of the emission densities),the two points only induce a factor (log n ) q .In the term (log n ) log log n of the penalty, a factor log n comes from the limitation ofthe dependency and a factor log n log log n from σ − ( n ). Finally, the term (log n ) q in thepenalty comes from the dependency of B ( n ) on n , from truncating the emission densitiesand from using a Bernstein inequality for exponentially α -mixing processes. In this section, we show that the oracle inequality of Theorem 8 allows to construct anestimator that is adaptive and minimax up to logarithmic factors when the observationsare generated by a finite state space hidden Markov model. To do so, we consider modelswhose emission densities are finite mixtures of exponential power distributions, and use anapproximation result by Kruijer et al. (2010).Assume that ( Y t ) t > is generated by a stationary HMM with parameters ( K ∗ , Q ∗ , γ ∗ ),which we call the true parameters. We consider the case Y = R endowed with the proba-bility λ with density G λ : y ( π (1 + y )) − with respect to the Lebesgue measure. Inorder to quantify the approximation error by location-scale mixtures, we use the followingassumptions from Kruijer et al. (2010). (C1) Smoothness . log( γ ∗ x G λ ) is locally β -H¨older with β >
0, i.e. there exists a polynomial L and a constant R > r is the largest integer smaller than β , one has ∀ y, y ′ s.t. | y − y ′ | R, (cid:12)(cid:12)(cid:12)(cid:12) ∂ r log( γ ∗ x G λ ) ∂y r ( y ) − ∂ r log( γ ∗ x G λ ) ∂y r ( y ′ ) (cid:12)(cid:12)(cid:12)(cid:12) r ! L ( y ) | x − y | β − r . (C2) Moments . There exists ǫ > ∀ j ∈ { , . . . , r } , Z (cid:12)(cid:12)(cid:12)(cid:12) ∂ j log( γ ∗ x G λ ) ∂y j ( y ) (cid:12)(cid:12)(cid:12)(cid:12) β + ǫj ( γ ∗ x G λ )( y ) dλ ( y ) < ∞ Z L ( y ) β + ǫβ ( γ ∗ x G λ )( y ) dλ ( y ) < ∞ (C3) Tail . There exists positive constants c and τ such that γ ∗ x G λ = O ( e − c | y | τ ) . racle inequality for misspecified NPHMMs (C4) Monotonicity . ( γ ∗ x G λ ) is positive and there exists y m < y M such that ( γ ∗ x G λ ) isnondecreasing on ( −∞ , y m ) and nonincreasing on ( y M , + ∞ ).All these assumptions refer to the functions ( γ ∗ x G λ ), which are the densities of theemission distributions with respect to the Lebesgue measure. Hence, the choice of thedominating measure λ does not matter as far as regularity conditions are concerned.Note that Kruijer et al. (2010) only assumed (C3) outside of a compact set. However,since the regularity assumption (C1) implies that ( γ ∗ x G λ ) is continuous, one can assume (C3) for all y without loss of generality.It is important to note that even though we require some regularity on the emissiondensities, for instance through the polynomial L and the constants β and τ , we do not needto know them to construct our estimator, thus making it adaptive.We consider the following models. Let p > ψ ( y ) = 12Γ (cid:16) p (cid:17) e − y p . Let M = N ∗ . We take S K,M,n as the set of parameters (
K, π, Q , γ ) such that • inf Q > σ − ( n ) := (log n ) − and inf π > σ − ( n ), • For all x ∈ [ K ], there exists ( s x, , . . . , s x,M ) ∈ [ M ; 1] M , ( µ x, , . . . , µ x,M ) ∈ [ − n ; n ] M and w x = ( w x, , . . . , w x,M ) ∈ [0 , M such that P i w x,i = 1 and for all y ∈ R , γ x ( y ) = 1 n + (cid:18) − n (cid:19) G λ ( y ) M X i =1 w x,i s x,i ψ (cid:18) y − µ x,i s x,i (cid:19) . In other words, the emission densities are mixtures of λ (with weight n − ) and of M translations and dilatations of ψ . Lemma 9 (Checking the assumptions)
Assume inf Q ∗ > , then: • [A ⋆ forgetting] and [A ⋆ mixing] hold. • Assume (C3) , then [A ⋆ tail] holds by taking B ∗ > log k P x γ ∗ x k ∞ and q = 1 . • [Aergodic] holds. • [Atail] holds for all n > by taking B ( n ) = 5 log n , K n ⊂ { K | K n } and M n = { M | m M n } with m M = 2 M . • [Aentropy] and [Agrowth] hold for any ζ > by taking m M = 2 M and C aux = 4 pn . Proof
The first point follows from Lemma 1.The second point follows from the fact that the densities γ ∗ x are uniformly boundedunder (C3) and by taking δ large enough in Lemma 2. [Aergodic] holds by definition of the models.See Section A.1.1 for the proof of the last two points. . Leh´ericy Remark 10
One can also take ( s x, , . . . , s x,M ) ∈ [ n ; n ] M , in which case Lemma 9 holds bytaking B ( n ) = 6 log n and C aux = 2 pn .The results of this section remain the same when the weight of λ in the emission densitiesof S K,M,n is allowed to be larger than n − instead of being exactly n − . Lemma 4 from Kruijer et al. (2010) implies the following result.
Lemma 11 (Approximation rates)
Assume (C1) - (C4) hold. Then there exists a se-quence of mixtures ( g M,x ) M such that n − + (1 − n − ) g M,x ∈ S ( γ ) K ∗ ,M,n for all n > M and KL ( γ ∗ x k g M,x ) = O ( M − β (log M ) β (1+ pτ ) ) . Proof
Proof in Section A.1.2.
Corollary 12 (Minimax adaptive estimation rates)
Assume (C1) - (C4) hold. Alsoassume that inf Q ∗ > . Then there exists a constant C > such that for all M > and n > M , inf ( K ∗ ,π, Q ,γ ) ∈ S K ∗ ,M,n K ( K ∗ , Q , γ ) C (cid:18) (log n ) n + M − β (log M ) β (1+ pτ ) (log n ) (cid:19) Hence, using Theorem 8 with pen n ( K, M ) = ( KM + K )(log n ) /n , there exists aconstant C such that almost surely, there exists a (random) n such that ∀ n > n , K ( ˆ K n , ˆ Q n , ˆ γ n ) Cn − β β +1 (log n ) pτ − pτ β +1 Cn − β β +1 (log n ) pτ . Proof
Proof in Section A.1.3.This result shows that our estimator reaches the minimax rate of convergence provedby Maugis-Rabusseau and Michel (2013) for density estimation in Hellinger distance, up tologarithmic factors. Since estimating a density is the same thing as estimating a one-stateHMM, this means that our result is adaptive and minimax up to logarithmic factors when K ∗ = 1. As far as we know, knowing whether increasing the number of states makes theminimax rates of convergence better is still an open problem. It seems reasonable to thinkthat it doesn’t, which would imply that our estimator is in general adaptive and minimax.
4. Perspectives
The main result of this paper is a guarantee that maximum likelihood estimators based onnonparametric hidden Markov models give sensible results even in the misspecified setting,and that their error can be controlled nonasymptotically. Two properties of both the modelsand the true distributions are at the core of this result: a mixing property and a forgettingproperty, which can be seen as a local dependence property. racle inequality for misspecified NPHMMs These two properties are not specific to hidden Markov models. Therefore, it is likelythat our result can be generalized to many other models and distributions. To namea few, one could consider hidden Markov models with continuous state space as stud-ied in Douc and Matias (2001) or Douc et al. (2011), or more generally partially observedMarkov models, see for instance Douc et al. (2016) and reference therein. Special cases ofpartially observed Markov models are HMMs with autoregressive properties (Douc et al.,2004) and models with time inhomogeneous Markov regimes (Pouzo et al., 2016). One couldalso consider hidden Markov fields (Kunsch et al., 1995) and graphical models to generalizeto more general distributions than time processes.Another interesting approach is to consider other forgetting and mixing assumptions.For instance, Le Gland and Mevel (2000) state a more general version of the forgetting as-sumption where the constant is replaced by an almost surely finite random variable, andGerencs´er et al. (2007) give conditions under which the moments of this random variable arefinite. Other mixing and weak dependence conditions have also been introduced in the lit-terature with the hope of describing more general processes, see for instance Dedecker et al.(2007).
5. Proof of the oracle inequality (Theorem 8)
By definition of ( ˆ K, ˆ π, ˆ Q , ˆ γ ), one has for all K log n C σ , for all M such that m M n and forall ( K, π
K,M , Q K,M , γ
K,M ) ∈ S K,M,n :1 n l ∗ n − n l n ( ˆ K, ˆ π, ˆ Q , ˆ γ ) n l ∗ n − n l n ( K, π
K,M , Q K,M , γ
K,M )+ pen n ( K, M ) − pen n ( ˆ K, ˆ M )where ˆ K and ˆ M are the selected number of hidden states and model index respectively.Let ν ( K, π, Q , γ ) := (cid:18) n l ∗ n − n l n ( K, π, Q , γ ) (cid:19) − K ( K, Q , γ ) , then K ( ˆ K, ˆ Q , ˆ γ ) K ( K, Q K,M , γ
K,M ) + 2pen n ( K, M )+ ν ( K, π
K,M , Q K,M , γ
K,M ) − pen n ( K, M ) − ν ( ˆ K, ˆ π, ˆ Q , ˆ γ ) − pen n ( ˆ K, ˆ M ) . Now, assume that with high probability, for all K , M and ( K, π, Q , γ ) ∈ S K,M,n , | ν ( K, π, Q , γ ) | − pen n ( K, M ) η K ( K, Q , γ ) + R n (2)for some constant η ∈ (0 , ), some penalty pen n and some residual term R n . The aboveinequality leads to(1 − η ) K ( ˆ K, ˆ Q , ˆ γ ) (1 + η ) K ( K, Q K,M , γ
K,M ) + 2pen n ( K, M ) + 2 R n , . Leh´ericy and the oracle inequality follows by noticing that η − η η and − η η ∈ (0 , ).Let us now prove equation (2). The following remark will be useful in our proofs: since p ( K,π, Q ,γ ) ( X k = x | Y k − ) = P x ′ ∈ [ K ] p ( K,π, Q ,γ ) ( X k − = x ′ | Y k − ) Q ( x ′ , x ) γ x ′ ( Y k − ) P x ′ ∈ [ K ] p ( K,π, Q ,γ ) ( X k − = x ′ | Y k − ) γ x ′ ( Y k − ) ∈ [ σ − ( n ); 1] using [Aergodic] ,one has for all k , µ and ( K, π, Q , γ ) ∈ S n L i,k,µ ( K, Q , γ ) ∈ log σ − ( n ) + log X x ∈ [ K ] γ x ( Y i ); log X x ∈ [ K ] γ x ( Y i ) =[log σ − ( n ) + b γ ( Y i ); b γ ( Y i )] (3)and finally for all k, k ′ ∈ N ∗ , for all µ , µ ′ probability distributions and for all ( K, π, Q , γ )and ( K ′ , π ′ , Q ′ , γ ′ ) ∈ S n , | L i,k,µ ( K, Q , γ ) − L i,k ′ ,µ ′ ( K ′ , Q ′ , γ ′ ) | log σ − ( n ) + | b γ ( Y i ) | + | b γ ′ ( Y i ) | , | L i,k,µ ( K, Q , γ ) − L ∗ i,k ′ | log σ − ( n ) + | b γ ( Y i ) | + | L ∗ i,k ′ | . (4)Approximate ν ( K, π, Q , γ ) by the deviation¯ ν k ( t ( D )( K, Q ,γ ) ) := 1 n n X i =1 t ( D )( K, Q ,γ ) ( Y ii − k ) − E ∗ [ t ( D )( K, Q ,γ ) ( Y − k )]where D > t ( D )( K, Q ,γ ) : Y − k ( L ∗ ,k − L ,k,x ( K, Q , γ )) | L ∗ ,k |∨ (sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | ) D for a fixed x ∈ [ K ]. Note that k t ( D )( K, Q ,γ ) k ∞ D + log σ − ( n ) thanks to equation (3).Considering these functions t ( D )( K, Q ,γ ) has two advantages. The first one is to limit thetime dependency to an interval of length k , which makes it possible to use the forgettingproperty of the process ( Y t ) t ∈ Z . The second one is to consider bounded functionals of thisprocess, for which one can get Bernstein-like concentration inequalities. The error of thisapproximation is given by the following lemma. Lemma 13
Assume [Atail] , [Aergodic] , [A ⋆ tail] and [A ⋆ forgetting] hold. Also assume B ( n ) > B ∗ and σ − ( n ) − ρ ∗ − ρ ∗ ∧ C ∗ . Then for all u > , with probability greater than − ne − u , for all ( K, π, Q , γ ) ∈ S n , (cid:12)(cid:12)(cid:12) ν ( K, π, Q , γ ) − ¯ ν k ( t ( B ( n ) u q )( K, Q ,γ ) ) (cid:12)(cid:12)(cid:12) (cid:18) B ( n ) u q + log 1 σ − ( n ) (cid:19) e − u + 2 nρ (1 − ρ ) + 4 ρ k − − ρ (5) where ρ = 1 − σ − ( n )1 − σ − ( n ) . racle inequality for misspecified NPHMMs Proof
Proof in Section 5.2.2.The following theorem is our main technical result. It shows that ¯ ν k ( t ( B ( n ) u q )( K, Q ,γ ) ) can becontrolled uniformly on all models with high probability. Theorem 14
Assume [Aergodic] , [Aentropy] and [A ⋆ mixing] . Also assume that thereexists n such that for all n > n , for all K n and M such that m M n , π ( m M K + K − ke − D (log n ) ( k + log C aux ) n. (6) Let ( w M ) M ∈M be a sequence of positive numbers such that P M e − w M e − . Thenthere exists constants C pen and A depending on n ∗ and c ∗ and a numerical constant n suchthat for all ǫ > and n > n ∨ n , the following holds.Let pen n be a function such that for all K n and M such that m M n ,pen n ( K, M ) > C pen n ( n ∗ + k + 1) " (cid:18) D + log 1 σ − ( n ) (cid:19) (log n ) ( m M K + K − × ǫ ∨ (cid:16) D + log σ − ( n ) (cid:17) (log n ) n ∗ + k + 1 (cid:18) log n + k log 2 σ − ( n ) + D + log C aux (cid:19) + (cid:18) D + log 1 σ − ( n ) (cid:19) (log n ) + ǫ ∨ (cid:16) D + log σ − ( n ) (cid:17) (log n ) n ∗ + k + 1 w M . (7) Then for all s > , with probability larger than − e − s , for all K n ∧ σ − ( n ) and M suchthat m M n and for all ( K, π, Q , γ ) ∈ S K,M,n , | ¯ ν k ( t ( D )( K, Q ,γ ) ) | − pen n ( K, M ) ǫ E [ t ( D )( K, Q ,γ ) ( Y − k ) ]+ A ( n ∗ + k + 1) (cid:18) D + log 1 σ − ( n ) (cid:19) (log n ) + 1 ǫ ∨ (cid:16) D + log σ − ( n ) (cid:17) (log n ) n ∗ + k + 1 sn . (8) Proof
Proof in Section B.The last step is to control the variance term E [ t ( D )( K, Q ,γ ) ( Y − k ) ] by K ( K, Q , γ ). Lemma 15
Assume [Atail] , [Aergodic] , [A ⋆ tail] and [A ⋆ forgetting] hold. Also assumethat B ( n ) > B ∗ and σ − ( n ) − ρ ∗ − ρ ∗ ∧ C ∗ ∧ e − . Then for all k such that k > σ − ( n ) (cid:18) log n + 2 log 1 σ − ( n ) (cid:19) , one has for all D > , v > log n and ( K, π, Q , γ ) ∈ S n : B ( n ) v q + log σ − ( n ) ) E ∗ [ t ( D )( K, Q ,γ ) ( Y ii − k ) ] K ( K, Q , γ ) + 22 n . . Leh´ericy Proof
Proof in Section 5.2.3.Now that the main lemmas have been stated, let us show how the assumptions ofTheorem 8 leads to the desired oracle inequality.Let C σ and C B be two positive constants and let ( σ − ( n ) = C σ (log n ) − B ( n ) = C B log n. Let α >
0. In order to have ne − u n − α , take u = (1 + α ) log n. Note that u > n >
3. The assumptions on v and k are v > log n and k > σ − ( n ) (cid:16) log n + 2 log σ − ( n ) (cid:17) (note that the assumption on k entails ρ k − (1 − ρ ) /n ).Thus, there exists an integer n depending on C σ such that if n > n , these assumptionshold for ( k = C σ (log n ) v = log n . In order to get ǫ E ∗ [ t ( D )( K, Q ,γ ) ( Y ii − k ) ] η K ( K, Q , γ ) + ηn using Lemma 15, one needs1 ǫ > η (cid:18) C B (log n ) q + log 1 C σ + log log n (cid:19) . This quantity is smaller than η (cid:16) C B ∨ log C σ ∨ (cid:17) (log n ) q ) . Let C ǫ = 48(1 + α ) q ( C B ∨ log C σ ∨
1) and ( ǫ = C ǫ η (log n ) q ) D = B ( n ) u q = C B (1 + α ) q (log n ) q . There exists an integer n ′ depending only on C σ and α such that for all n > n ′ , (cid:18) D + log 1 σ − ( n ) (cid:19) (log n ) = (cid:18) C B (1 + α ) q (log n ) q + log 1 C σ + log log n (cid:19) (log n ) (log n ) − q ǫ and therefore 1 ǫ ∨ (cid:16) D + log σ − ( n ) (cid:17) (log n ) n ∗ + k + 1 (log n ) − q ǫ . Thus, there exists an integer n ′′ depending on C σ , C B and α such that for all n > n ′′ ∨ exp( C σ ((1 + C ∗ ) ∨ − ρ ∗ − ρ ∗ ∨ e )) ∨ exp( B ∗ C B ) ∨ exp q C σ ( n ∗ + 1) (so that k = C σ (log n ) > n ∗ + 1, racle inequality for misspecified NPHMMs B ( n ) > B ∗ and σ − ( n ) − ρ ∗ − ρ ∗ ∧ C ∗ ∧ e − ), equation (7) is implied bypen n ( K, M ) > C pen n C σ (log n ) C ǫ η (log n ) q × " w M + 2 C B (1 + α ) q (log n ) q ( m M K + K − × (cid:18) C σ (log n ) (cid:18) log 1 C σ + log log n (cid:19) + log C aux (cid:19) , such that equation (8) (combined with Lemma 15) implies | ¯ ν k ( t ( D )( K, Q ,γ ) ) | − pen n ( K, M ) η K ( K, Q , γ ) + A C σ (log n ) C ǫ η (log n ) q sn , such that equation (5) implies (cid:12)(cid:12)(cid:12) ν ( K, π, Q , γ ) − ¯ ν k ( t ( B ( n ) u q )( K, Q ,γ ) ) (cid:12)(cid:12)(cid:12) C B (1 + α ) q (log n ) q n α +1 + 4(log n ) C σ n + 4 n and such that when [Agrowth] holds and when m M n and K n , equation (6) isimplied by 13800 πn C σ (log n ) e − α ) q C B (log n ) q (log n ) n ζ n for all n > n growth , which is itself implied by27600 πC σ n (log n ) e − C B log n n ζ n i.e. 27600 πC σ (log n ) n ζ − C B , which holds for all n > n ′′ (up to modification of n ′′ ) when C B > ζ . Putting theseequations together proves Theorem 8. Let W be a nonnegative random variable such that for all u > P ∗ ( W > u q ) = e − u (if q >
0; otherwise W = 0). Assumption [Atail] implies that there exists a couplingof W and sup γ ∈ S ( γ ) n | b γ ( Y ) | such that on the event { sup γ ∈ S ( γ ) n | b γ ( Y ) | > B ( n ) } , one hassup γ ∈ S ( γ ) n | b γ ( Y ) | B ( n ) W P ∗ -almost surely. Therefore, controlling the moments of W isenough to control the moments of sup γ ∈ S ( γ ) n | b γ ( Y ) | .For u >
0, let E q ( u ) = E [ W W > u ] ,V q ( u ) = E [ W W > u ] . . Leh´ericy Lemma 16
For all u > , ( E q ( u ) u q e − u V q ( u ) u q e − u . Proof
One has E q ( u ) = Z t > P ( W > t ∨ u q ) dt = u q e − u + Z t > u q e − t /q dt = u q e − u + q Z T > u T q − e − T dT u q e − u + Z T > u e − T dT since q u q e − u . Likewise, V q ( u ) = Z a,b > P ( W > a ∨ b ∨ u q ) dt = u q e − u + 2 Z t > u q te − t /q dt = u q e − u + 2 q Z T > u T q − e − T dT = u q e − u + 2 qu q − e − u + 2 q (2 q − Z T > u T q − e − T dT by integration by parts, which is enough to conclude. Let t ( K, Q ,γ ) : Y − k L ∗ ,k − L ,k,x ( K, Q , γ ). Then, since ν ( K, π, Q , γ ) − ¯ ν k ( t ( K, Q ,γ ) ) = 1 n n X i =1 ( L ∗ i,i − − L ∗ i,k ) − n n X i =1 ( L i,i − ,π ( K, Q , γ ) − L i,k,x ( K, Q , γ )) − E [ L ∗ , ∞ − L ∗ ,k ] + E [ L , ∞ ( K, Q , γ ) − L ,k,x ( K, Q , γ )] , racle inequality for misspecified NPHMMs one gets using Lemma 6 and [A ⋆ forgetting] that | ν ( K, π, Q , γ ) − ¯ ν k ( t ( K, Q ,γ ) ) | n n X i =1 ρ ( i − ∧ k − − ρ + C ∗ n n X i =1 ρ ( i − ∧ k − ∗ + ρ k − − ρ + C ∗ ρ k − ∗ nρ (1 − ρ ) + 2 ρ k − − ρ + C ∗ (cid:18) nρ ∗ (1 − ρ ∗ ) + 2 ρ k − ∗ (cid:19) nρ (1 − ρ ) + 4 ρ k − − ρ as soon as ( ρ > ρ ∗ − ρ > C ∗ , which holds for σ − ( n ) − ρ ∗ − ρ ∗ ∧ C ∗ .Then, note that¯ ν k ( t ( K, Q ,γ ) ) − ¯ ν k ( t ( B ( n ) u q )( K, Q ,γ ) ) = 1 n n X i =1 t ( K, Q ,γ ) ( Y ii − k ) | L ∗ i,k |∨ (sup γ ′∈ S ( γ ) n | b γ ′ ( Y i ) | ) >B ( n ) u q − E ∗ [ t ( K, Q ,γ ) ( Y − k ) | L ∗ ,k |∨ (sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | ) >B ( n ) u q ] . We restrict ourselves to the event T ni =1 {| L ∗ i,k | ∨ (sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y i ) | ) B ( n ) u q } , whichoccurs with probability greater than 1 − ne − u using assumptions [Atail] and [A ⋆ tail] . Onthis event, 1 n n X i =1 t ( K, Q ,γ ) ( Y ii − k ) | L ∗ i,k |∨ (sup γ ′∈ S ( γ ) n | b γ ′ ( Y i ) | ) >B ( n ) u q = 0 . Moreover, | E ∗ [ t ( K, Q ,γ ) ( Y − k ) − t ( B ( n ) u q )( K, Q ,γ ) ( Y − k )] | = E ∗ [ | t ( K, Q ,γ ) ( Y − k ) | | L ∗ ,k |∨ (sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | ) >B ( n ) u q ] . Equation (3) ensures that | t ( K, Q ,γ ) ( Y − k ) | | L ∗ ,k | + sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | + log σ − ( n ) , so that | E ∗ [ t ( K, Q ,γ ) ( Y − k ) − t ( B ( n ) u q )( K, Q ,γ ) ( Y − k )] | E ∗ " | L ∗ ,k | | L ∗ ,k | >B ( n ) u q + | L ∗ ,k | B ( n ) u q < sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | ! + E ∗ " sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | >B ( n ) u q + sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | B ( n ) u q < | L ∗ ,k | ! + E ∗ " (cid:18) log 1 σ − ( n ) (cid:19) | L ∗ ,k | >B ( n ) u q + sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | >B ( n ) u q ! . Leh´ericy [Atail] and [A ⋆ tail] imply that sup γ ′ ∈ S ( γ ) n (cid:12)(cid:12) b γ ′ ( Y ) (cid:12)(cid:12) /B ( n ) and | L ∗ ,k | /B ∗ can be upperbounded by the random variable W defined in Section 5.2.1, which means that for all u > | E ∗ [ t ( K, Q ,γ ) ( Y − k ) − t ( B ( n ) u q )( K, Q ,γ ) ( Y − k )] | B ∗ E q ( u ) + B ( n ) u q P ∗ sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | > B ( n ) u q ! + B ( n ) E q ( u ) + B ( n ) u q P ∗ ( | L ∗ ,k | > B ( n ) u q )+ (cid:18) log 1 σ − ( n ) (cid:19) P ∗ sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | > B ( n ) u q ! + P ∗ ( | L ∗ ,k | > B ( n ) u q ) ! B ( n ) u q e − u + 2 (cid:18) log 1 σ − ( n ) (cid:19) e − u as soon as B ( n ) > B ∗ , which concludes the proof. Lemma 17
Assume [Atail] , [Aergodic] and [A ⋆ tail] hold. Assume σ − ( n ) e − and let V ( K, Q , γ ) := E ∗ (cid:2) ( L ∗ , ∞ − L , ∞ ( K, Q , γ )) (cid:3) . Then for all v > , B ( n ) v q + log σ − ( n ) ) V ( K, Q , γ ) K ( K, Q , γ ) + 643 e − v . Proof
We need the following lemma :
Lemma 18 (Shen et al. (2013), Lemma 4)
For any two probability measures P and Q with density p and q and any λ ∈ (0 , e − ] , E P (cid:18) log pq (cid:19) H ( P, Q )
12 + 2 (cid:18) log 1 λ (cid:19) ! + 8 E P "(cid:18) log pq (cid:19) (cid:18) pq > λ (cid:19) where H ( P, Q ) is the Hellinger distance between P and Q : H ( P, Q ) = − E P [( q/p ) / −
1] = Z ( √ p − √ q ) dλ. Take P = P ∗ Y | Y − −∞ and Q = P Y | Y − −∞ , ( K, Q ,γ ) , so that E P (log pq ) = V ( K, Q , γ ). Usingequation (4), one gets (cid:18) log pq (cid:19) sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | + | L ∗ , ∞ | + log 1 σ − ! τ ) sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | + 2(1 + τ ) | L ∗ , ∞ | + (cid:18) τ (cid:19) (cid:18) log 1 σ − (cid:19) racle inequality for misspecified NPHMMs for any τ >
0. Let v be a real number such that 2 B ( n ) v q = log λ − log σ − ( n ) , then (cid:18) pq > λ (cid:19) sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | + | L ∗ , ∞ | > log 1 λ − log 1 σ − ( n ) ! sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | ∨ | L ∗ , ∞ | > B ( n ) v q ! , so that8 E P "(cid:18) log pq (cid:19) (cid:18) pq > λ (cid:19) τ ) E ∗ " | L ∗ , ∞ | | L ∗ , ∞ | >B ( n ) v q + | L ∗ , ∞ | B ( n ) v q < sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | ! + 16(1 + τ ) E ∗ " sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | >B ( n ) v q + sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | B ( n ) v q < | L ∗ , ∞ | ! + 8 (cid:18) τ (cid:19) (cid:18) log 1 σ − (cid:19) E ∗ " | L ∗ , ∞ | >B ( n ) v q + sup γ ′∈ S ( γ ) n | b γ ′ ( Y ) | >B ( n ) v q [Atail] and [A ⋆ tail] imply that sup γ ′ ∈ S ( γ ) n (cid:12)(cid:12) b γ ′ ( Y ) (cid:12)(cid:12) /B ( n ) and | L ∗ , ∞ | /B ∗ can be upperbounded by the random variable W defined in Section 5.2.1, which means that for all v > E P "(cid:18) log pq (cid:19) (cid:18) pq > λ (cid:19) τ ) ( B ∗ ) V q ( v ) + B ( n ) v q P ∗ sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | > B ( n ) v q !! + 16(1 + τ ) (cid:0) B ( n ) V q ( v ) + B ( n ) v q P ∗ ( | L ∗ ,k | > B ( n ) v q ) (cid:1) + 8 (cid:18) τ (cid:19) (cid:18) log 1 σ − ( n ) (cid:19) P ∗ sup γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | > B ( n ) v q ! + P ∗ ( | L ∗ ,k | > B ( n ) v q ) ! e − v (cid:18) τ (cid:19) (cid:18) log 1 σ − ( n ) (cid:19) + 12(1 + τ ) B ( n ) v q ! e − v (cid:18) log 1 σ − ( n ) (cid:19) + 4 B ( n ) v q ! as soon as B ( n ) > B ∗ by taking τ = .Therefore, for all v > λ defined by 2 B ( n ) v q = log λ − log σ − ( n ) satisfies λ e − (i.e. 2 B ( n ) v q > − log σ − ( n ) , which holds as soon as v > . Leh´ericy σ − ( n ) e − ), V ( K, Q , γ ) E ∗ Y − −∞ h H ( P ∗ Y | Y − −∞ , P Y | Y − −∞ , ( K, Q ,γ ) ) i
12 + 2 (cid:18) B ( n ) v q + log 1 σ − ( n ) (cid:19) ! + 64 (cid:18) log 1 σ − ( n ) (cid:19) + 4 B ( n ) v q ! e − v E ∗ Y − −∞ h KL ( P ∗ Y | Y − −∞ k P Y | Y − −∞ , ( K, Q ,γ ) ) i
12 + 2 (cid:18) B ( n ) v q + log 1 σ − ( n ) (cid:19) ! + 64 (cid:18) log 1 σ − ( n ) + 2 B ( n ) v q (cid:19) e − v using that the Kullback Leibler divergence is lower bounded by the Hellinger distance. Thecondition 2 B ( n ) v q > − log σ − ( n ) ensures that 12 + 2(2 B ( n ) v q + log σ − ( n ) ) B ( n ) v q +log σ − ( n ) ) . Finally, using E ∗ Y − −∞ [ KL ( P ∗ Y | Y − −∞ k P Y | Y − −∞ , ( K, Q ,γ ) )] = K ( K, Q , γ ) , one gets V ( K, Q , γ ) (cid:18) B ( n ) v q + log 1 σ − ( n ) (cid:19) K ( K, Q , γ ) + 64 (cid:18) B ( n ) v q + log 1 σ − ( n ) (cid:19) e − v and the lemma is proved by dividing both sides by 3 (cid:16) B ( n ) v q + log σ − ( n ) (cid:17) .The next step is the control of the difference between V ( K, Q , γ ) and E ∗ [ t ( D )( K, Q ,γ ) ( Y ii − k ) ].Taking t ( K, Q ,γ ) : Y − k L ∗ ,k − L ,k,x ( K, Q , γ ), one has by definition of t ( D )( K, Q ,γ ) E ∗ [ t ( D )( K, Q ,γ ) ( Y ii − k ) ] E ∗ [ t ( K, Q ,γ ) ( Y ii − k ) ] . Then, | E ∗ [ t ( K, Q ,γ ) ( Y ii − k ) ] − V ( K, Q , γ ) | = (cid:12)(cid:12) E ∗ (cid:2) ( L ∗ ,k − L ,k,x ( K, Q , γ )) (cid:3) − E ∗ (cid:2) ( L ∗ , ∞ − L , ∞ ( K, Q , γ )) (cid:3)(cid:12)(cid:12) E ∗ | (( L ∗ ,k − L ∗ , ∞ ) − ( L ,k,x − L , ∞ )( K, Q , γ )) × (( L ∗ ,k − L ,k,x ( K, Q , γ )) + ( L ∗ , ∞ − L , ∞ ( K, Q , γ ))) | ρ k − − ρ E ∗ " γ ′ ∈ S ( γ ) n | b γ ′ ( Y ) | + | L ∗ ,k | + | L ∗ , ∞ | + 2 log 1 σ − ( n ) ! ρ k − − ρ (cid:18) (2 B ( n ) + 2 B ∗ )(1 + E q (1)) + 2 log 1 σ − ( n ) (cid:19) ρ k − − ρ (cid:18) B ( n ) + log 1 σ − ( n ) (cid:19) . racle inequality for misspecified NPHMMs using Lemma 6, equation (4), Lemma 16, B ( n ) > B ∗ and the condition on σ − ( n ) (whichimplies ρ > ρ ∗ and − ρ > C ∗ ). Therefore, under the assumptions of Lemma 17, one has13(2 B ( n ) v q + log σ − ( n ) ) E ∗ [ t ( D )( K, Q ,γ ) ( Y ii − k ) ] K ( K, Q , γ ) + 643 e − v + 4 ρ k − − ρ )(2 B ( n ) v q + log σ − ( n ) ) (cid:18) B ( n ) + log 1 σ − ( n ) (cid:19) K ( K, Q , γ ) + 643 e − v + 8 ρ k − − ρ )(2 B ( n ) v q + log σ − ( n ) ) K ( K, Q , γ ) + 643 e − v + 2 ρ k − − ρ ) . Let us take k > − log n log ρ + log(1 − ρ )log ρ + 1 and v > log n , so that643 e − v + 2 ρ k − − ρ ) n + 2 n (1 − ρ )3(1 − ρ ) n . The constant ρ is defined by ρ = 1 − σ − ( n )1 − σ − ( n ) , so that − ρ σ − ( n ) and − log(1 − ρ ) log σ − ( n ) . Therefore, the condition on k holds as soon as k > σ − ( n ) (cid:18) log n + 2 log 1 σ − ( n ) (cid:19) (9)using that log log x (log x ) /e for all x > e (1 − /e ) >
1. Therefore, for all k satisfying equation (9), for all D > v > log n ,13(2 B ( n ) v q + log σ − ( n ) ) E ∗ [ t ( D )( K, Q ,γ ) ( Y ii − k ) ] K ( K, Q , γ ) + 22 n , which concludes the proof. Acknowledgements
I am grateful to Elisabeth Gassiat for her precious advice and insightful discussions.
References
Grigory Alexandrovich, Hajo Holzmann, and Anna Leister. Nonparametric identificationand maximum likelihood estimation for hidden Markov models.
Biometrika , 103(2):423–434, 2016.Animashree Anandkumar, Daniel J Hsu, and Sham M Kakade. A method of moments formixture models and hidden Markov models. In
COLT , volume 1, page 4, 2012. . Leh´ericy Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection.
Statistics surveys , 4:40–79, 2010.Andrew R Barron. The strong ergodic theorem for densities: generalized Shannon-McMillan-Breiman theorem.
The annals of Probability , 13(4):1292–1303, 1985.Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of finitestate Markov chains.
The annals of mathematical statistics , 37(6):1554–1563, 1966.St´ephane Bonhomme, Koen Jochmans, and Jean-Marc Robin. Non-parametric estimationof finite mixtures from repeated measurements.
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 78(1):211–229, 2016.Charlotte Boyd, Andr´e E Punt, Henri Weimerskirch, and Sophie Bertrand. Movementmodels provide insights into variation in the foraging effort of central place foragers.
Ecological modelling , 286:13–25, 2014.Richard C Bradley. Basic properties of strong mixing conditions. A survey and some openquestions.
Probability surveys , 2:107–144, 2005.Laurent Couvreur and Christophe Couvreur. Wavelet-based non-parametric HMM’s: the-ory and applications. In
Acoustics, Speech, and Signal Processing, 2000. ICASSP’00.Proceedings. 2000 IEEE International Conference on , volume 1, pages 604–607. IEEE,2000.Yohann de Castro, ´Elisabeth Gassiat, and Claire Lacour. Minimax adaptive estimation ofnonparametric hidden Markov models.
Journal of Machine Learning Research , 17(111):1–43, 2016.Yohann De Castro, Elisabeth Gassiat, and Sylvain Le Corff. Consistent estimation of thefiltering and marginal smoothing distributions in nonparametric hidden Markov models.
IEEE Transactions on Information Theory , 2017.J´erˆome Dedecker, Paul Doukhan, Gabriel Lang, Le´on R Jos´e Rafael, Sana Louhichi, andCl´ementine Prieur.
Weak dependence: With examples and applications . Springer, 2007.Randal Douc and Catherine Matias. Asymptotics of the maximum likelihood estimator forgeneral hidden Markov models.
Bernoulli , 7(3):381–420, 2001.Randal Douc and Eric Moulines. Asymptotic properties of the maximum likelihood estima-tion in misspecified hidden Markov models.
The Annals of Statistics , 40(5):2697–2732,2012.Randal Douc, Eric Moulines, and Tobias Ryd´en. Asymptotic properties of the maximumlikelihood estimator in autoregressive models with Markov regime.
The Annals of statis-tics , 32(5):2254–2304, 2004.Randal Douc, Gersende Fort, Eric Moulines, and Pierre Priouret. Forgetting the initialdistribution for hidden Markov models.
Stochastic processes and their applications , 119(4):1235–1256, 2009. racle inequality for misspecified NPHMMs Randal Douc, Eric Moulines, Jimmy Olsson, and Ramon Van Handel. Consistency of themaximum likelihood estimator for general hidden Markov models. the Annals of Statistics ,39(1):474–513, 2011.Randal Douc, Jimmy Olsson, and Francois Roeff. Posterior consistency for partially ob-served Markov models. arXiv preprint arXiv:1608.06851 , 2016.L´aszl´o Gerencs´er, Gy¨orgy Michaletzky, and G´abor Moln´ar-S´aska. An improved bound forthe exponential stability of predictive filters of hidden Markov models.
Communicationsin Information & Systems , 7(2):133–152, 2007.Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hiddenMarkov models.
Journal of Computer and System Sciences , 78(5):1460–1480, 2012.Willem Kruijer, Judith Rousseau, and Aad Van Der Vaart. Adaptive Bayesian densityestimation with location-scale mixtures.
Electronic Journal of Statistics , 4:1225–1257,2010.Hans Kunsch, Stuart Geman, Athanasios Kehagias, et al. Hidden markov random fields.
The annals of applied probability , 5(3):577–602, 1995.Martin F Lambert, Julian P Whiting, and Andrew V Metcalfe. A non-parametric hiddenMarkov model for climate state identification.
Hydrology and Earth System SciencesDiscussions , 7(5):652–667, 2003.Fran¸cois Le Gland and Laurent Mevel. Exponential forgetting and geometric ergodicity inhidden Markov models.
Mathematics of Control, Signals and Systems , 13(1):63–93, 2000.Fabrice Lef`evre. Non-parametric probability estimation for HMM-based automatic speechrecognition.
Computer Speech & Language , 17(2):113–136, 2003.Luc Leh´ericy. State-by-state minimax adaptive estimation for nonparametric hidden Markovmodels. arXiv preprint arXiv:1706.08277 , 2017.Brian G Leroux. Maximum-likelihood estimation for hidden Markov models.
Stochasticprocesses and their applications , 40(1):127–143, 1992.Pascal Massart. Concentration inequalities and model selection. In
Lecture Notes in Math-ematics , volume 1896. Springer, Berlin, 2007.Cathy Maugis-Rabusseau and Bertrand Michel. Adaptive density estimation for clusteringwith Gaussian mixtures.
ESAIM: Probability and Statistics , 17:698–724, 2013.Florence Merlev`ede, Magda Peligrad, and Emmanuel Rio. Bernstein inequality and mod-erate deviations under strong mixing conditions. In
High dimensional probability V: theLuminy volume , pages 273–292. Institute of Mathematical Statistics, 2009.Laurent Mevel and Lorenzo Finesso. Asymptotical statistics of misspecified hidden Markovmodels.
IEEE Transactions on Automatic Control , 49(7):1123–1132, 2004. . Leh´ericy Demian Pouzo, Zacharias Psaradakis, and Martin Sola. Maximum likelihood estimation inpossibly misspecified dynamic models with time inhomogeneous Markov regimes. 2016.Weining Shen, Surya T Tokdar, and Subhashis Ghosal. Adaptive Bayesian multivariatedensity estimation with Dirichlet mixtures.
Biometrika , 100(3):623–640, 2013.Elodie Vernet. Posterior consistency for nonparametric hidden Markov models with finitestate space.
Electronic Journal of Statistics , 9(1):717–752, 2015a.Elodie Vernet. Non parametric hidden markov models with finite state space: posteriorconcentration rates. arXiv preprint arXiv:1511.08624 , 2015b.Stevenn Volant, Caroline B´erard, Marie-Laure Martin-Magniette, and St´ephane Robin.Hidden Markov models with mixtures as emission distributions.
Statistics and Computing ,24(4):493–504, 2014.C Yau, Omiros Papaspiliopoulos, Gareth O Roberts, and Christopher Holmes. Bayesiannon-parametric hidden Markov models with applications in genomics.
Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 73(1):37–57, 2011.
Appendix A. Proofs for the minimax adaptive estimation
A.1 Proofs for the mixture framework
A.1.1 Proof of Lemma 9 (checking the assumptions)
Checking [Atail]
By definition of the emission densities, b γ ( y ) > − n for all γ ∈ S ( γ ) n .Moreover, for all y ∈ Y and γ ∈ S ( γ ) K,M,n , b γ ( y ) log X x ∈ [ K ] ∨ max µ,s s ψ (cid:0) y − µs (cid:1) G λ ( y ) ! log K + 0 ∨ (cid:18) max µ,s log 1 s ψ (cid:18) y − µs (cid:19) − log G λ ( y ) (cid:19) log n + 0 ∨ (cid:18) max µ,s (cid:26) log 1 s − (cid:18) y − µs (cid:19) p (cid:27) + log(1 + y ) + log π /p ) (cid:19) log n + 0 ∨ (cid:18) − min µ ( y − µ ) p + log(1 + y ) + log M + log π (cid:19) , where we recall that the maximum is taken over µ ∈ [ − n, n ] and s ∈ [ M , K n and M n , one also has K n and m M n , i.e. M n .If y ∈ [ − n, n ], b γ ( y ) log n + 0 ∨ (cid:0) log(1 + y ) + log M + log π (cid:1) log n + 0 ∨ (cid:0) log(1 + n ) + log( n/
2) + log π (cid:1) n + log( πe/ n racle inequality for misspecified NPHMMs as soon as n >
5. Otherwise, one can take y > n and then b γ ( y ) log n + 0 ∨ ( − ( y − n ) p + log(1 + y ) + log M + log π ) log n + 0 ∨ ( − ( y − n ) p + log(1 + 2( y − n ) + 2 n ) + log M + log π ) log n + 0 ∨ ( − Y p + log(1 + 2 Y ) + log 2 n + log( n/
2) + log π )by writing Y = y − n and using that log( a + b ) log a + log b when a, b >
1. Sincemax Y > ( − Y p + log(1 + 2 Y )) log 3 as soon as p >
2, one gets b γ ( y ) n + log 3 π n as soons as n > Checking [Aentropy] and [Hgrowth]
Let us first assume that there exists a constant L p such that the function ( µ, s ) s − ψ ( s − ( y − u )) G λ ( y ) is L p -Lipschitz for all y (where theorigin space is endowed with the supremum norm). Then a bracket covering of size ǫ of([ n, n ] × [ M , M provides a bracket covering of { γ ( ·| x ) } γ ∈ S ( γ ) n ,x ∈ [ K of size L p ǫ . Since thereexists a bracket covering of size ǫ of [ n, n ] × [ M ,
1] for the supremum norm with less than( nǫ ∨ brackets, one gets [Aentropy] by taking C aux = 4 L p n and m M = 2 M .Let us now check that this constant L p exists. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂∂µ s ψ (cid:0) y − µs (cid:1) G λ ( y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 12 π Γ(1 + p )(1 + y ) (cid:12)(cid:12)(cid:12)(cid:12) ∂∂µ s exp (cid:18) − (cid:18) y − µs (cid:19) p (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = 12 π Γ(1 + p )(1 + y ) s (cid:12)(cid:12)(cid:12)(cid:12) y − µs (cid:12)(cid:12)(cid:12)(cid:12) p − exp (cid:18) − (cid:18) y − µs (cid:19) p (cid:19) s Y p − exp( − Y p ) M Z − /p e − Z n by writing Y = | y − µ | /s and Z = Y p . Likewise, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂∂µ s ψ (cid:0) y − µs (cid:1) G λ ( y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 12 π Γ(1 + p )(1 + y ) (cid:12)(cid:12)(cid:12)(cid:12) − s + p s ( y − µ ) p s p +1 (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) − (cid:18) y − µs (cid:19) p (cid:19) s | pZ − | e − Z n p p >
2. Thus, one can take L p = pn , which corresponds to C aux = 4 pn . Withthis C aux , checking [Hgrowth] is straightforward for all ζ > A.1.2 Proof of Lemma 11 (approximation rates)
Let F ( y ) = e − c | y | τ . Lemma 4 of Kruijer et al. (2010) ensures that there exists c ′ > H > β + 4 p such that for all x ∈ [ K ∗ ] and u >
0, there exists a mixture g u,x with . Leh´ericy O ( u − | log u | p/τ ) components, each with density u ψ ( ·− µu ) with respect to the Lebesgue mea-sure for some µ ∈ { y | F ( y ) > c ′ u H } , such that g u,x approximates the emission density γ ∗ x :max x KL ( γ ∗ x k g u,x ) = O ( u − β ) . Take s = u | log u | − − pτ . When | µ | > s − , one has F ( µ ) exp( − cs − τ ) = o ( c ′ s H ).Thus, for s small enough, all translation parameters µ belong to [ − s − , s − ]. Moreover,by definition of s , the mixture g u,x contains fewer than s − components when s is smallenough. Finally, we use that s = u | log u | − − pτ = ⇒ u s | log s | pτ . Taking s − = M and g M,x = g u,x concludes the proof. A.1.3 Proof of Corollary 12 (minimax adaptive estimation rate)
Denote by h the Hellinger distance, defined by h ( p, q ) = E P [( p q/p − ] for all probabilitydensities p and q associated to probability measures P and Q . Let H ( K, Q , γ ) = E ∗ Y −∞ h h ( p ∗ Y | Y −∞ , p Y | Y −∞ , ( K, Q ,γ ) ) i be the Hellinger distance between the distributions of Y conditionally to Y −∞ under thetrue distribution and under the parameters ( K, Q , γ ) (see Lemma 6 for the definition ofthese conditional distributions).The following lemma shows that the Kullback-Leibler divergence and the Hellinger dis-tance are equivalent up to a logarithmic factor and a small additive term. Lemma 19
Assume that [A ⋆ tail] , [A ⋆ forgetting] , [Atail] and [Aergodic] hold with B ( n ) = C B log n and σ − ( n ) = C σ (log n ) − .Then there exists a constant n depending on C B and C σ such that for all n > n ∨ exp( B ∗ C B ) , one has for all ( K, Q , γ ) ∈ S n H ( K, Q , γ ) K ( K, Q , γ ) C B (log n ) (cid:18) H ( K, Q , γ ) + 3 n (cid:19) . Proof
The lower bound comes from the fact that the square of the Hellinger distance issmaller than the Kullback-Leibler divergence. For the upper bound, we use Lemma 4 ofShen et al. (2013): for all v > P and Q with densities p and q , KL ( p k q ) h ( p, q ) (1 + 2 v ) + 2 E P (cid:20)(cid:18) log pq (cid:19) (cid:26) log pq > v (cid:27)(cid:21) . Take p = p ∗ Y | Y −∞ and q = p Y | Y −∞ , ( K, Q ,γ ) . Then log pq | b γ | + | L ∗ , ∞ | + log σ − ( n ) and n log pq > v o {| b γ | > ( v − log σ − ( n ) ) } ∨ {| L ∗ , ∞ | > ( v − log σ − ( n ) ) } . Taking racle inequality for misspecified NPHMMs v = 2 C B (log n ) , one gets that there exists n depending only on C B and C σ such that forall n > n , v + log σ − ( n ) > ( C B log n ) and 1 + 2 v C B (log n ) , so that K ( K, Q , γ ) C B (log n ) H ( K, Q , γ )+ C B (log n ) (cid:8) P ∗ ( | b γ | > C B (log n ) ) + P ∗ ( | L ∗ , ∞ | > C B (log n ) ) (cid:9) + 2 E ∗ [( | L ∗ , ∞ | + | b γ | ) × ( {| L ∗ , ∞ | > C B (log n ) } ∨ {| b γ | > C B (log n ) } )] . Note that [A ⋆ tail] also holds for L ∗ , ∞ using the uniform convergence of Lemma 6.This implies that P ∗ ( | L ∗ , ∞ | > C B (log n ) ) exp( − log n ) n − since C B (log n ) > B ∗ for n > exp( B ∗ C B ). Likewise, [Atail] implies that P ∗ ( | b γ | > C B (log n ) ) n − .The last expectation of the above equation can be written as2 E ∗ [( a + b ) { a ∨ b > C B (log n ) } ]where a = | L ∗ , ∞ | and b = | b γ | . Then, note that2 E ∗ [ a { a ∨ b > C B (log n ) } )] =2 E ∗ [ a { a > C B (log n ) } )]+ 2 E ∗ [ a { b > C B (log n ) > a } )] C B (log n ) e − log n + 2 C B (log n ) P ∗ [ b > C B (log n ) ] C B (log n ) e − log n + 2 C B (log n ) e − log n C B (log n ) n using C B log n > B ∗ and Lemma 16 for the first term and [Atail] for the second one.Likewise, 2 E ∗ [ b { a ∨ b > C B (log n ) } )] C B (log n ) n , so that finally K ( K, Q , γ ) C B (log n ) H ( K, Q , γ ) + 14 C B (log n ) n , which concludes the proof.Let M ∈ N ∗ . Let g M,x be the approximating densities given by Lemma 11 and write γ M,x = n − + (1 − n − ) g M,x for all x ∈ [ K ∗ ]. The following lemma controls the error H ( K ∗ , Q ∗ , ( γ M,x ) x ) coming from the approximation of the densities. Lemma 20
Assume σ − ( n ) inf Q ∗ , then H ( K ∗ , Q ∗ , ( γ M,x ) x ) (cid:18) σ − ( n )) (1 − ρ ) (cid:19) X x ∈ [ K ∗ ] h ( γ ∗ x , γ M,x ) . Leh´ericy Proof
Let p ∗ x = p ∗ ( X = x | Y −∞ ) and p x = p ( K ∗ , Q ∗ , ( γ M,x ) x ) ( X = x | Y −∞ ). The Cauchy-Schwarz inequality implies that ( pP x a x − pP x b x ) P x ( √ a x − √ b x ) , so that h X x p ∗ x γ ∗ x , X x p x γ M,x ! = Z sX x p ∗ x γ ∗ x − sX x p x γ M,x dλ Z X x ( p p ∗ x γ ∗ x − √ p x γ M,x ) dλ Z X x (cid:16) p x ( p γ ∗ x − √ γ M,x ) + ( √ p x − p p ∗ x ) γ ∗ x (cid:17) dλ X x p x h ( γ ∗ x , γ M,x ) + 2 X x ( p p ∗ x − √ p x ) X x h ( γ ∗ x , γ M,x ) + 2 X x ( p p ∗ x − √ p x ) Thus, one needs to control the expectation of the second term. Since p x and p ∗ x belongto [ σ − ( n ); 1] by minoration of their transition matrices, one has X x ( √ p x − p p ∗ x ) ∈ (cid:20)
14 ; 14 σ − ( n ) (cid:21) X x ( p x − p ∗ x ) . The following equation follows from a careful reading of the proof of Proposition 2.1 ofDe Castro et al. (2017) by noticing that the roles of γ ∗ and γ M are symmetrical in theirproof. X x | p x − p ∗ x | σ − ( n )(1 − ρ ) + ∞ X i =0 ρ i max x | γ ∗ x ( Y − i ) − γ M,x ( Y − i ) | P x γ ∗ x ( Y − i ) ∨ P x γ M,x ( Y − i ) . Therefore, using the Cauchy-Schwarz inequality: X x ( p x − p ∗ x ) X x | p x − p ∗ x | ! σ − ( n )) (1 − ρ ) ∞ X i =0 ρ i (cid:18) max x | γ ∗ x ( Y − i ) − γ M,x ( Y − i ) | P x γ ∗ x ( Y − i ) ∨ P x γ M,x ( Y − i ) (cid:19) . Since | a − b | √ a ∨ b |√ a − √ b | , one has E ∗ (cid:18) max x | γ ∗ x ( Y ) − γ M,x ( Y ) | P x γ ∗ x ( Y ) ∨ P x γ M,x ( Y ) (cid:19) Z max x ( γ ∗ x ( y ) − γ M,x ( y )) P x γ ∗ x ( y ) ∨ P x γ M,x ( y ) dλ ( y ) X x Z ( γ ∗ x ( y ) − γ M,x ( y )) γ ∗ x ( y ) ∨ γ M,x ( y ) dλ ( y ) X x Z (cid:18)p γ ∗ x ( y ) − q γ M,x ( y ) (cid:19) dλ ( y )= 4 X x h ( γ ∗ x , γ M,x ) , racle inequality for misspecified NPHMMs so that E ∗ "X x ( p p ∗ x − √ p x ) σ − ( n ) E ∗ "X x ( p x − p ∗ x ) σ − ( n )) (1 − ρ ) X x h ( γ ∗ x , γ M,x ) , which concludes the proof of the lemma.Finally, since |√ a + b − √ c | |√ a − √ c | + p | b | for all b ∈ R , a > ( − b ) ∨ c > x h ( γ ∗ x , γ M,x ) h ( γ ∗ x , g M,x ) + 4 n KL ( γ ∗ x k g M,x ) + 4 n . Therefore, K ( K ∗ , Q ∗ , ( γ M,x ) x ) C B (log n ) n + 5 C B (log n ) (cid:18) σ − ( n )) (1 − ρ ) (cid:19) X x (cid:18) n + 2 KL ( γ ∗ x , g M,x ) (cid:19) . Since σ − ( n ) = C σ (log n ) − and (1 − ρ ) − ( σ − ( n )) − , there exists a constant C suchthat for all n > K ( K ∗ , Q ∗ , ( γ M,x ) x ) C (cid:18) (log n ) n + M − β (log M ) β (1+ pτ ) (log n ) (cid:19) by definition of the densities g M,x .The choice of penalty verifies the lower bound of Theorem 8. Thus, the oracle inequalityof Theorem 8 with η = 1, α = 2 and t = 2 log n entails that for n large enough and for anysequence ( M n ) n such that M n n/ n : K ( ˆ K, ˆ Q , ˆ γ ) K ( K ∗ , Q ∗ , ( γ M n ,x ) x ) + 2pen n ( K ∗ , M n ) + A (log n ) n C (cid:18) (log n ) n + M − βn (log n ) β (1+ pτ )+9 (cid:19) + 2( K ∗ ) (log n ) n M n + 2 A (log n ) n . Taking M n ∼ n β +1 (log n ) β (1+ p/τ ) − β +1 , one gets the announced rate. . Leh´ericy Appendix B. Proof of the control of ¯ ν k (Theorem 14) Let us give an overview of the proof of the control of ¯ ν k .The first step of the proof is to obtain a Bernstein inequality on ¯ ν k ( t ) for a single function t . This is done using the mixing properties of the process ( Y i ) i and by noticing that ¯ ν k ( t )is the deviation of an empirical mean.The second step is to transform the inequality on one function t into an inequality on thesupremum over all function t belonging to a given class. This step involves the bracketingentropy of the aforementionned class. The control of this entropy is where the shape of thepenalty appears.At this stage, one is able to upper bound the supremum of ¯ ν k ( t ( D )( K, Q ,γ ) ) over all parame-ters ( K, π, Q , γ ) ∈ S K,M,n . However, this upper bound is of order n − / (up to logarithmicfactors), which is suboptimal. The third step of the proof gets rid of the n − / term byconsidering the processes W K,M,n := sup ( K,π, Q ,γ ) ∈ S K,M,n | ¯ ν k ( t ( D )( K, Q ,γ ) ) | E ∗ [ t ( D )( K, Q ,γ ) ( Z ) ] + x K,M,n for some constants x K,M,n . The last step of the proof consists in taking appropriate x K,M,n in order to have with high probability and for all K and M ( W K,M,n ǫW K,M,n x K,M,n pen n ( K, M ) + R n for a residual term R n depending on the probability, which leads to the desired inequality ∀ ( K, π, Q , γ ) ∈ S K,M,n , | ¯ ν k ( t ( D )( K, Q ,γ ) ) | − pen n ( K, M ) ǫ E ∗ [ t ( D )( K, Q ,γ ) ( Z ) ] + R n . The concentration results are stated in Section B.1. The control of the bracketingentropy is done in Section B.2. Finally, the choice of x K,M,n and the synthesis of the proofare done in Section B.3.In the rest of this Section, we omit the dependency of σ − , B , W K,M , x K,M and S K,M on n in the notations. We also introduce the notation θ = ( K, π, Q , γ ) for ( K, π, Q , γ ) ∈ S n to make the notation shorter. Given θ ∈ S n , we write π θ , Q θ and γ θ its components. B.1 Concentration inequality
First, let us introduce some notations. Let
D > K > M ∈ M and k >
1. For all i ∈ Z , let Z i = Y ii − k . Define for all σ > B σ = { θ ∈ S K,M | E ∗ [ t ( D ) θ ( Z ) ] σ } . Let d k be the semi-distance defined by d k ( t , t ) = E ∗ [( t − t ) ( Z )]. Let N ( A, d, ǫ ) = e H ( A,d,ǫ ) denote the minimal cardinality of a covering of A by brackets of size ǫ for thesemi-distance d , that is by sets [ t , t ] = { t : Y k R , t ( · ) t ( · ) t ( · ) } such that d ( t , t ) ǫ . H ( A, d, · ) is called the bracketing entropy of A for the semi-distance d .The first step of the proof is to obtain a Bernstein inequality for the deviations of asingle t ( D ) ( Z i ). racle inequality for misspecified NPHMMs Theorem 21
Assume [A ⋆ mixing] holds. Then there exists a constant C mix depending on c ∗ and n ∗ such that the following holds.Let t be a real valued, measurable bounded function on Y k +1 . Let V = E ∗ [ t ( Z )] . Thenfor all λ ∈ (0 , C mix ( n ∗ + k +1) k t k ∞ (log n ) ) and for all n ∈ N : φ ( λ ) := log E ∗ exp " λ n X i =1 ( t ( Z i ) − E ∗ t ( Z i )) C mix ( n ∗ + k + 1) ( nV + k t k ∞ ) λ − C mix ( n ∗ + k + 1) k t k ∞ (log n ) λ Proof
The following result is a Bernstein inequality for exponentially α -mixing processes. Lemma 22 (Merlev`ede et al. (2009), Theorem 2)
Let ( A i ) i > be a stationary sequenceof centered real-valued random variables such that k A k ∞ M and whose α -mixing coeffi-cients satisfy, for a certain c > , ∀ n ∈ N , α mix ( n ) e − cn . Then there exists positive constants C and C depending on c such that for all n > and all λ ∈ (0 , C M (log n ) ) , log E exp " λ n X i =1 A i C λ ( nv + M )1 − C λM (log n ) , where v is defined by v = Var ( A ) + 2 X i> | Cov ( A , A i ) | . Assumption [A ⋆ mixing] implies that the α -mixing coefficients of ( Y i ) i satisfy α mix ( n ) e − c ∗ n for all n > n ∗ since 4 α mix ( n ) ρ mix ( n ) (see for instance Bradley (2005)). However,this is not enough to apply the previous result: one needs the inequality to hold for all n (and not for n larger than some constant) and for the process ( Z i ) i . To do so, we partitionthe process ( Z i ) i into several processes for which the above result applies, and then gatherthe inequalities.Consider the processes ( Z i ( n ∗ + k +1)+ j ) i with α -mixing coefficients α Z,j ( n ). By construc-tion, they satisfy α Z,j ( n ) e − c ∗ n ∗ n for all n > j ∈ { , . . . , n ∗ + k + 1 } . ApplyLemma 22, one gets that there exists two positive constants C and C depending on c ∗ and n ∗ such that for all function t , all λ ∈ (0 , C M (log n ) ) and all n ∈ N : φ j ( λ ) := log E ∗ exp " λ n X i =1 ( t ( Z i ( n ∗ + k +1)+ j ) − E t ( Z i ( n ∗ + k +1)+ j )) C λ ( nv + k t k ∞ )1 − C λ k t k ∞ (log n ) . Leh´ericy where, denoting V = E ∗ t ( Z ): v = Var( t ( Z j )) + 2 X i> | Cov ( t ( Z j ) , t ( Z i ( n ∗ + k +1)+ j ) | V + 2 V X i> | Corr ( t ( Z j ) , t ( Z i ( n ∗ + k +1)+ j ) | V X i> e − c ∗ n ∗ i ! V − e − c ∗ n ∗ using [A ⋆ mixing] . Finally, using that E Q ki =1 A i Q ki =1 ( E A ki ) /k for any positive integer k and any positive random variable ( A i ) i k , one gets φ ( λ ) n ∗ + k + 1 n ∗ + k +1 X j =1 φ j (( n ∗ + k + 1) λ ) , so that φ ( λ ) C − e − c ∗ n ∗ ( n ∗ + k + 1) λ ( nV + k t k ∞ )1 − C ( n ∗ + k + 1) λ k t k ∞ (log n ) , which concludes the proof.The following result follows mutatis mutandis from the proof of Theorem 6.8 of Massart(2007) using the previous theorem. Lemma 23
Assume [A ⋆ mixing] holds. Then there exists a constant C ∗ > dependingon n ∗ and c ∗ such that the following holds.Let T be a class of real valued and measurable functions on Y k +1 such that T is separablefor the supremum norm. Also assume that there exists positive numbers σ and b such thatfor all t ∈ T , k t k ∞ b and E ∗ t ( Z ) σ and assume that N ( T , d k , δ ) is finite for all δ > .Then for all measurable set A such that P ∗ ( A ) > : E ∗ (cid:18) sup t ∈T | ¯ ν k ( t ) | (cid:12)(cid:12)(cid:12) A (cid:19) C ∗ ( n ∗ + k + 1) " En + σ s n log (cid:18) P ∗ ( A ) (cid:19) + b (log n ) n log (cid:18) P ∗ ( A ) (cid:19) where E = √ n Z σ p H ( T , d k , u ) ∧ ndu + b (log n ) H ( T , d k , σ ) . Now, by taking T = { t ( D ) θ | θ ∈ B σ } and b = 2 D + log σ − , one gets the following lemmafrom Lemma 4.23 and Lemma 2.4 of Massart (2007): racle inequality for misspecified NPHMMs Lemma 24
Assume that there exists a function ϕ and constants C and σ K,M such that x ϕ ( x ) x is nonincreasing and ∀ σ > σ K,M E Cϕ ( σ ) √ n. (10) Then for all x K,M > σ K,M and z > , one has with probability greater than − e − z : W K,M := sup θ ∈ S K,M (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | ¯ ν k ( t ( D ) θ ) | E ∗ [ t ( D ) θ ( Z ) ] + x K,M (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C ∗ ( n ∗ + k + 1) " C ϕ ( x K,M ) x K,M √ n + s zx K,M n + (cid:18) D + log 1 σ − (cid:19) z (log n ) x K,M n . (11)The two remaining steps are the control of the bracketing entropy which will lead to equation(10) (see Section B.2) and the choice of the parameters x K,M and z (see Section B.3). B.2 Control of the bracketing entropy
B.2.1 Reduction of the set
For all θ ∈ S K,M , let g θ = ( g θ,x ) x ∈ [ K ] where g θ,x : y k p θ ( X k = x, Y k = y k | Y k − = y k − ) if | L ∗ k,k | ∨ sup θ ′ ∈ S n | b θ ′ ( y k ) | D, { t ( D ) θ | θ ∈ B σ } , we control the bracketingentropy of the set G := { g θ | θ ∈ S K,M } for the distance d G ( g θ , g θ ) = E ∗ Y k − " X x ∈ [ K ] Z | g θ ,x ( Y k − , y k ) − g θ ,x ( Y k − , y k ) |× | L ∗ k,k |∨ sup θ ′∈ S n | b θ ′ ( y k ) | D dλ ( y k ) . Remark 25
In the rest of Section B.2, we always assume that | L ∗ k,k | ∨ sup θ ′ ∈ S n | b θ ′ ( y k ) | D (12) since if this is not the case, then t ( D ) θ ( y k ) = t ( D ) θ ′ ( y k ) = 0 . This means that only the y k satisfying equation (12) are relevant for the construction of the brackets. For all θ ∈ S K,M , one has X x ∈ [ K ] g θ,x = X x,x ′ ∈ [ K ] p θ ( Y k = y k | X k = x ) Q θ ( x, x ′ ) p θ ( X k − = x ′ | Y k − = y k − ) ∈ [ σ − , X x,x ′ ∈ [ K ] p θ ( Y k = y k | X k = x ) p θ ( X k − = x ′ | Y k − = y k − )= [ σ − , e b θ ( y k ) . Leh´ericy so that for all θ ∈ S K,M , σ − e − D X x ∈ [ K ] g θ,x e D . Let [ a, b ] be a bracket of size ǫ for G with the distance d G such that σ − e − D / P x a x P x b x e D . Then log X x a x − log X x b x ! (cid:18) log 2 e D σ − + log 2 e D (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log X x a x − log X x b x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:18) D + log 1 σ − (cid:19) e D σ − X x | a x − b x | using that σ − / | log a − log b | | a − b | / ( a ∧ b ).Therefore, d k log X x a x , log X x b x ! = E ∗ Y k − Z log X x a x − log X x b x ! ( Y k − , y k ) p ∗ ( Y k = y k | Y k − ) λ ( dy k ) (cid:18) D + log 1 σ − (cid:19) e D σ − E ∗ Y k − "Z X x | a x − b x | ( Y k − , y k ) L ∗ k,k λ ( dy k ) (cid:18) D + log 1 σ − (cid:19) e D σ − E ∗ Y k − "Z X x | a x − b x | ( Y k − , y k ) λ ( dy k ) = 4 (cid:18) D + log 1 σ − (cid:19) e D σ − d G ( a, b ) , so that N ( { t ( D ) θ | θ ∈ B σ } , d k , ǫ ) ¯ N G , d G , σ − ǫ D + log σ − ) e D ! (13)where ¯ N is the minimal cardinality of a bracket covering of G such that all brackets [ a, b ]satisfy σ − e − D / P x a x P x b x e D . B.2.2 Decomposition into simple sets
The aim of this section is to prove the following lemma. racle inequality for misspecified NPHMMs Lemma 26
Let ǫ ∈ (cid:16) , k (cid:0) σ − (cid:1) k +1 (cid:17) . Then ¯ N ( G , d G , ǫ ) N (cid:18) { π θ } θ ∈ S K,M , d ∞ , (cid:16) σ − (cid:17) k ǫ ke D (cid:19) × N (cid:18) { Q θ } θ ∈ S K,M , d ∞ , (cid:16) σ − (cid:17) k ǫ ke D (cid:19) × N (cid:18) { γ θ } θ ∈ S K,M , d ∞ , (cid:16) σ − (cid:17) k ǫ e − D ke D (cid:19) where d ∞ is the distance of the supremum norm and where γ θ denotes the function ( x, y ) γ θ ( y | x ) . Let: • [ a, b ] be a bracket of { π θ } θ ∈ S K,M of size ǫ for the supremum norm ; • [ p, q ] be a bracket of { Q θ } θ ∈ S K,M of size ǫ pour the supremum norm ; • [ u, v ] be a bracket of { γ θ } θ ∈ S K,M of size ǫe − D for the supremum norm.Without loss of generality, one can assume σ − a ( x ) b ( x ) σ − p ( x, x ′ ) q ( x, x ′ ) x, x ′ ∈ [ K ] since all elements of { π θ } θ ∈ S K,M and { Q θ } θ ∈ S K,M satisfythese inequalities. One can also assume that there exists θ ∈ S K,M such that π θ ∈ [ a, b ], Q θ ∈ [ p, q ] and γ θ ∈ [ u, v ]. Under this assumption, all brackets that we construct are nonempty and for all y ∈ Y , e − D (1 − Kǫ ) P x u ( y | x ) P x v ( y | x ) e D + Kǫe − D .Using the approach of Appendix A of De Castro et al. (2017), one can write g θ,x as thefollowing product of matrices g θ,x ( y k ) = (cid:16) µ θ | k − F θ | k − . . . F θk − | k − Q θ (cid:17) x γ θ ( y k | x )where β i | k ( x i ) = X x ki +1 ∈ [ K ] k − i Q θ ( x i , x i +1 ) γ θ ( y i +1 | x i +1 ) . . . Q θ ( x k − , x k ) γ θ ( y k | x k ) ,µ θ | k ( x ) = π θ ( x ) β | k ( x ) P x ′ ∈ [ K ] π θ ( x ′ ) β | k ( x ′ ) ,F θi | k ( x i − , x i ) = β i | k ( x i ) Q θ ( x i − , x i ) γ θ ( y i | x i ) P x ∈ [ K ] β i | k ( x ) Q θ ( x i − , x ) γ θ ( y i | x ) . To clarify the role of these quantities, observe that β i | k ( x i ) = P θ ( Y ki +1 | X i = x i ) ,µ θ | k ( x ) = P θ ( X = x | Y k ) ,F θi | k ( x i − , x i ) = P θ ( X i = x i | Y ki , X i − = x i − ) , . Leh´ericy so that (cid:16) µ θ | k F θ | k . . . F θk | k (cid:17) x = P θ ( X k = x | Y k ) . Now, let α i | k ( x i ) = X x ki +1 ∈ [ K ] k − i p ( x i , x i +1 ) u ( y i +1 | x i +1 ) . . . p ( x k − , x k ) u ( y k | x k ) δ i | k ( x i ) = X x ki +1 ∈ [ K ] k − i q ( x i , x i +1 ) v ( y i +1 | x i +1 ) . . . q ( x k − , x k ) v ( y k | x k ) , ν ( x ) = a ( x ) α | k ( x ) P x ′ ∈ [ K ] b ( x ′ ) δ | k ( x ′ ) ω ( x ) = b ( x ) δ | k ( x ) P x ′ ∈ [ K ] a ( x ′ ) α | k ( x ′ ) , and f i | k ( x i − , x i ) = α i | k ( x i ) p ( x i − , x i ) u ( y i | x i ) P x ∈ [ K ] δ i | k ( x ) q ( x i − , x ) v ( y i | x ) g i | k ( x i − , x i ) = δ i | k ( x i ) q ( x i − , x i ) v ( y i | x i ) P x ∈ [ K ] α i | k ( x ) p ( x i − , x ) u ( y i | x ) . [ ν, ω ] and [ f i | k , g i | k ] are brackets of { µ θ | k } θ ∈ S K,M and { F θi | k } θ ∈ S K,M for all i ∈ { , . . . , k } .Moreover, if one has a bracket covering of the sets { π θ } θ ∈ S K,M , { Q θ } θ ∈ S K,M and { γ θ } θ ∈ S K,M ,then this construction gives a bracket covering of { µ θ | k } θ ∈ S K,M and { F θi | k } θ ∈ S K,M for all i ∈ { , . . . , k } .The next step of the proof is to control the size of these new brackets. Lemma 27
Assume ǫ K , then sup i k P x ∈ [ K ] | α i | k ( x ) − δ i | k ( x ) | P x ∈ [ K ] α i | k ( x ) (cid:18) σ − (cid:19) k − i ǫ. and sup i k P x ∈ [ K ] | α i | k ( x ) u ( y i | x ) − δ i | k ( x ) v ( y i | x ) | P x ∈ [ K ] α i | k ( x ) u ( y i | x ) (cid:18) σ − (cid:19) k − i +1 ǫ. Proof
Using minimalist notations, one has X x ∈ [ K ] | α i | k ( x ) − δ i | k ( x ) | k X j = i +1 X x ki ∈ [ K ] k − i +1 p i +1 i u i +1 . . . u j − | p jj − − q jj − | v j . . . q kk − v k + k X j = i +1 X x ki ∈ [ K ] k − i +1 p i +1 i u i +1 . . . p jj − | u j − v j | q j +1 j . . . q kk − v k . racle inequality for misspecified NPHMMs Then, note that for all j , X x ki ∈ [ K ] k − i +1 p i +1 i u i +1 . . . p j − j − u j − | p jj − − q jj − | v j q j +1 j . . . q kk − v k ǫ X x j − i ∈ [ K ] j − i p i +1 i u i +1 . . . p j − j − u j − X x j ∈ [ K ] ( u j + ǫe − D ) · · · X x k ∈ [ K ] ( u k + ǫe − D )and X x ki ∈ [ K ] k − i +1 p i +1 i u i +1 . . . p j − j − u j − p jj − u j p j +1 j . . . p kk − u k > σ k − j +1 − X x j − i ∈ [ K ] j − i p i +1 i u i +1 . . . p j − j − u j − X x j ∈ [ K ] u j · · · X x k ∈ [ K ] u k . so that P x ki ∈ [ K ] k − i +1 p i +1 i u i +1 . . . u j − | p jj − − q jj − | v j . . . q kk − v k P x ki ∈ [ K ] k − i +1 p i +1 i u i +1 . . . u j − p jj − u j . . . p kk − u k ǫσ k − j +1 − k Y ℓ = j Kǫe − D + P x ℓ u ℓ P x ℓ u ℓ ǫσ k − j +1 − k Y ℓ = j (cid:18) Kǫe − D e − D (1 − Kǫ ) (cid:19) ǫσ k − j +1 − (cid:18) − Kǫ (cid:19) k − j +1 , and likewise P x ki ∈ [ K ] k − i +1 p i +1 i u i +1 . . . p jj − | u j − v j | q j +1 j . . . q kk − v k P x ki ∈ [ K ] k − i +1 p i +1 i u i +1 . . . u j − p jj − u j . . . p kk − u k ǫσ k − j +1 − (cid:18) − Kǫ (cid:19) k − j . Therefore, when ǫK /
2, one has P x ∈ [ K ] | α i | k ( x ) − δ i | k ( x ) | P x ∈ [ K ] α i | k ( x ) ǫ k X j = i +1 (cid:18) σ − (cid:19) k − j +1 ǫ k − i X a =1 (cid:18) σ − (cid:19) a ǫσ − (cid:16) σ − (cid:17) k − i − σ − − ǫ − σ − (cid:18) σ − (cid:19) k − i , . Leh´ericy which gives the desired result. The proof of the second case is similar and comes from thefact that X x ∈ [ K ] | α i | k ( x ) u ( y i | x ) − δ i | k ( x ) v ( y i | x ) | k X j = i +1 X x ki ∈ [ K ] k − i +1 u i p i +1 i u i +1 . . . u j − | p jj − − q jj − | v j . . . q kk − v k + k X j = i X x ki ∈ [ K ] k − i +1 u i p i +1 i u i +1 . . . p jj − | u j − v j | q j +1 j . . . q kk − v k . Lemma 28
Assume ǫ K , then k ν − ω k (cid:18) σ − (cid:19) k +1 ǫ and sup i k sup x ∈ [ K ] k f i | k ( x, · ) − g i | k ( x, · ) k (cid:18) σ − (cid:19) k − i +2 ǫ (cid:18) σ − (cid:19) k +1 ǫ. (14) Proof
With minimalist notations, one has X | ν − ω | = X (cid:12)(cid:12)(cid:12)(cid:12) aα P bδ − bδ P aα (cid:12)(cid:12)(cid:12)(cid:12) P | aα − bδ | P bδ + X | bδ | (cid:12)(cid:12)(cid:12)(cid:12) P aα − P bδ (cid:12)(cid:12)(cid:12)(cid:12) P | aα − bδ | P bδ + P | aα − bδ | P aα σ − P | a − b | α + P b | α − δ | P α σ − ǫ + 4 (cid:18) σ − (cid:19) k ǫ ! , using that σ − a b racle inequality for misspecified NPHMMs Likewise, for all i ∈ { , . . . , k } and x ∈ [ K ], X x ′ ∈ [ K ] | g i | k − f i | k | ( x, x ′ ) = X (cid:12)(cid:12)(cid:12)(cid:12) αpu P δqv − δqv P αpu (cid:12)(cid:12)(cid:12)(cid:12) P | αpu − δqv | P δqv + X | δqv | (cid:12)(cid:12)(cid:12)(cid:12) P αpu − P δqv (cid:12)(cid:12)(cid:12)(cid:12) P | αpu − δqv | P αpu P | αu − δv | q + P αu | p − q | P αpu σ − (cid:18) σ − (cid:19) k − i +1 ǫ + ǫ ! . Define η = 5( σ − ) k +1 ǫ . Equation (14) implies that as soon as η − Kσ − (andin particular η / K σ − ), it is possible to enlarge the bracket[ f i | k , g i | k ] into a bracket [ f ′ i | k , g ′ i | k ] of size smaller than 3 η for the norm of Lemma 28 suchthat f ′ i | k / (1 − η ) and g ′ i | k / (1 + η ) are transition matrices.For instance, one can take any f ′ and g ′ such that σ − ⊤ f ′ f g g ′ ⊤ coefficient-wise and such that f ′ = (1 − η ) and g ′ = (1 + η ) (where is a vector of size K whose coefficients are all equal to 1). One can construct such a matrix f ′ (resp. g ′ ) bytaking a suitable barycenter of the lines of σ − ⊤ and f (resp. ⊤ and g ) for the lines of f ′ (resp. g ′ ). The only condition is Kσ − − η max x ( f ) x max x ( g ) x η K ,which is true when η − Kσ − .Let A x ( y k ) = (cid:16) νf ′ | k − . . . f ′ k − | k − p (cid:17) x u ( y k | x ) B x ( y k ) = (cid:16) ωg ′ | k − . . . g ′ k − | k − q (cid:17) x v ( y k | x ) . [ A, B ] is a bracket of G , and this construction gives a bracket covering of G . Lemma 29
Assume ǫ K ∧ k ( σ − ) k +1 . Then for all y k , X x ∈ [ K ] | ( νf ′ | k . . . f ′ k | k ) x − ( ωg ′ | k . . . g ′ k | k ) x | kη = 35 k (cid:18) σ − (cid:19) k +1 ǫ and X x ∈ [ K ] | ( νf ′ | k . . . f ′ k | k p ) x − ( ωg ′ | k . . . g ′ k | k q ) x | k (cid:18) σ − (cid:19) k +1 ǫ. . Leh´ericy Proof
Note that X x ∈ [ K ] | ( νf ′ | k . . . f ′ k | k ) x − ( ωg ′ | k . . . g ′ k | k ) x | X x ∈ [ K ] | (( ν − ω ) f ′ | k . . . f ′ k | k ) x | + k X j =1 X x ∈ [ K ] | ( ωg ′ | k . . . g ′ j − | k ( g ′ j | k − f ′ j | k ) f ′ j +1 | k . . . f ′ k | k ) x | . Then, we use that f ′ i | k / (1 − η ) and g ′ i | k / (1 + η ) are transition matrices (and thus are 1-Lipschitz linear operators of L ([ K ])): k νf ′ | k . . . f ′ k | k − ωg ′ | k . . . g ′ k | k k k ω − ν k (1 − η ) k + k X j =1 k ω k (1 + η ) j − sup i k sup x ∈ [ K ] k f ′ i | k ( x, · ) − g ′ i | k ( x, · ) k (1 − η ) k − j , so that using Lemma 28: k νf ′ | k . . . f ′ k | k − ωg ′ | k . . . g ′ k | k k η + k ω k k X j =1 (1 + η ) j − η η η ) k − X j =0 (1 + η ) j η (cid:18) η ) (1 + η ) k − η (cid:19) η + 3(1 + η )( e kη − . One can check that for all x ∈ [0 , ], 3(1 + x )( e x − x . Replacing x by kη , one getsthat for all η k , k νf ′ | k . . . f ′ k | k − ωg ′ | k . . . g ′ k | k k η + 6 kη kη. For the second part, note that X x ∈ [ K ] | ( νf ′ | k . . . f ′ k | k p ) x − ( ωg ′ | k . . . g ′ k | k q ) x | X x X x ′ | ( νf ′ | k . . . f ′ k | k ) x ′ p x ′ ,x − ( ωg ′ | k . . . g ′ k | k ) x ′ q x ′ ,x | X x X x ′ | ( νf ′ | k . . . f ′ k | k ) x ′ − ( ωg ′ | k . . . g ′ k | k ) x ′ | q x ′ ,x + X x X x ′ ( νf ′ | k . . . f ′ k | k ) x ′ | p x ′ ,x − q x ′ ,x | . Since the brackets are not empty, one has P x q x ′ ,x Kǫ for all x ′ and P x ′ ( νf ′ | k . . . f ′ k | k ) x ′ νf ′ | k . . . f ′ k | k is the lower bound of a non empty bracket of { p X k | Y k ,θ | θ ∈ S K,M } ), so racle inequality for misspecified NPHMMs that X x ∈ [ K ] | ( νf ′ | k . . . f ′ k | k p ) x − ( ωg ′ | k . . . g ′ k | k q ) x | (1 + Kǫ ) X x ′ | ( νf ′ | k . . . f ′ k | k ) x ′ − ( ωg ′ | k . . . g ′ k | k ) x ′ | + Kǫ X x ′ ( νf ′ | k . . . f ′ k | k ) x ′ (1 + Kǫ )35 k (cid:18) σ − (cid:19) k +1 ǫ + Kǫ.
Finally, we use that since ǫ K , one has (1 + Kǫ )35 and since Kσ −
1, one has K ( σ − ) k +1 . Lemma 30
Assume ǫ K ∧ k ( σ − ) k . Then d G ( A, B ) k (cid:18) σ − (cid:19) k ǫ. Proof
By definition, d G ( A, B ) = E ∗ Y k − X x ∈ [ K ] Z | A x ( Y k ) − B x ( Y k ) | λ ( dY k ) . Taking some fixed Y k − , one has X x Z | A x ( y k ) − B x ( y k ) | λ ( dy k )= X x Z | u ( y k | x )( νf ′ | k − . . . f ′ k − | k − p ) x − v ( y k | x )( ωg ′ | k − . . . g ′ k − | k − q ) x | λ ( dy k ) X x Z | u ( y k | x ) − v ( y k | x ) | ( νf ′ | k − . . . f ′ k − | k − p ) x λ ( dy k )+ X x Z v ( y k | x ) | ( νf ′ | k − . . . f ′ k − | k − p ) x − ( ωg ′ | k − . . . g ′ k − | k − q ) x | λ ( dy k ) . Since we assumed the brackets to be non empty, one has R v ( y | x ) λ ( dy ) k v − u k ∞ =1 + ǫe − D and P x ( νf ′ | k − . . . f ′ k − | k − p ) x { p X k | Y k − ,θ | θ ∈ S K,M } ). Therefore, one gets with Lemma 29 that d G ( A, B ) ǫe − D X x ( νf ′ | k − . . . f ′ k − | k − p ) x + (1 + ǫe − D ) X x | ( νf ′ | k − . . . f ′ k − | k − p ) x − ( ωg ′ | k − . . . g ′ k − | k − q ) x | ǫe − D + (1 + ǫe − D )53( k − (cid:18) σ − (cid:19) k ǫ. . Leh´ericy Finally, notice that ǫe − D σ − ) k to conclude.Lemma 29 implies that sup x | ( νf ′ | k . . . f ′ k | k p ) x − ( ωg ′ | k . . . g ′ k | k q ) x | η ′ := 53( k − σ − ) k ǫ .Therefore, since the bracket [ A, B ] is not empty, one gets by using the assumption on u and v that ( σ − − η ′ ) e − D (1 − Kǫ ) X x ∈ [ K ] A x X x ∈ [ K ] B x (1 + η ′ )( e D + Kǫe − D ) , from which we deduce that the desired inequality σ − e − D / P x ∈ [ K ] A x P x ∈ [ K ] B x e D holds as soon as η ′ σ − and ǫ K , i.e. ǫ σ − k − σ − ) k ) ∧ K , which is implied by ǫ k (cid:0) σ − (cid:1) k +1 since K σ − . This concludes the proof of Lemma 26. B.2.3 Control of the bracketing entropy of the simple sets and synthesis
Lemma 31
Let δ > , then N (cid:0) { π θ } θ ∈ S K,M , d ∞ , δ (cid:1) max (cid:18) K − δ , (cid:19) K − ,N (cid:0) { Q θ } θ ∈ S K,M , d ∞ , δ (cid:1) max (cid:18) K − δ , (cid:19) K ( K − , Let C aux ′ = C aux e D ∨ ( K − [Aentropy] , N (cid:0) { γ θ } θ ∈ S K,M , d ∞ , δe − D (cid:1) max (cid:18) C aux ′ δ , (cid:19) m M K . Then, Lemma 26 ensures that for all ǫ k (cid:0) σ − (cid:1) k +1 ,log ¯ N ( G , d G , ǫ ) ( m M K + K −
1) log max (cid:18) σ − (cid:19) k ke D C aux ′ ǫ , ! , so that using Equation (13) and letting H ( u ) = H ( { t ( D ) θ | θ ∈ B σ } , d k , u ), one has for all ǫ D +log σ − ) e D σ − q k (cid:0) σ − (cid:1) k +1 , H ( ǫ ) ( m M K + K −
1) log max D + log σ − ) e D σ − ! (cid:18) σ − (cid:19) k ke D C aux ′ ǫ , m M K + K −
1) log max (cid:18) D + log 1 σ − (cid:19) (cid:18) σ − (cid:19) k/ e D/ p kC aux ′ ǫ , ! . racle inequality for misspecified NPHMMs Then, since 2 /σ − >
1, one gets that for all ǫ > H ( ǫ ) m M K + K −
1) log max (cid:18) D + log 1 σ − (cid:19) (cid:18) σ − (cid:19) k +1 / e D/ p kC aux ′ ǫ , (cid:18) σ − (cid:19) k +1 / ke D/ p C aux ′ ! . B.3 Choice of parameters
The goal of this section is to find a function ϕ and a constant C for which equation (10)holds, and to choose the weights x K,M of Lemma 24.
Lemma 32
Let
A, B, C ∈ R ∗ + , H : x ∈ R ∗ + A log max( Bx , C ) , and ϕ ( x ) : x ∈ R ∗ + x √ πA (1 + q log max( Bx , C )) . Then: x H ( x ) ϕ ( x ) , Z x p H ( u ) du ϕ ( x ) . Let ϕ ( u ) = u p π ( m M K + K − ( log max (cid:18) D + log 1 σ − (cid:19) (cid:18) σ − (cid:19) k +1 / e D/ p kC aux ′ u , (cid:18) σ − (cid:19) k +1 / ke D/ p C aux ′ !) / ! . The function x ϕ ( x ) x is nonincreasing, so x ϕ ( x ) x is decreasing and one can define σ K,M as the unique solution of the equation (1 + q D + log σ − log n ) ϕ ( x ) = √ nx withunknown x , when a solution exists. By definition of E , one has ∀ σ > σ K,M , E ϕ ( σ ) √ n + (cid:18) D + log 1 σ − (cid:19) (log n ) ϕ ( σ ) σ D + log σ − )(log n ) q D + log σ − log n ϕ ( σ ) √ n s D + log 1 σ − log n ! ϕ ( σ ) √ n. . Leh´ericy We define D ′ := (2 D + log σ − )(log n ) in order to lighten the notations. Using equation(11), one gets that for all z > x K,M > σ K,M , with probability larger than 1 − e − z , W K,M C ∗ ( n ∗ + k + 1) (1 + √ D ′ ) ϕ ( x K,M ) x K,M √ n + s zx K,M n + D ′ zx K,M n C ∗ ( n ∗ + k + 1) σ K,M x K,M + s zx K,M n + D ′ zx K,M n . Let ǫ >
0, and let us take x K,M = 1 θ (cid:18) σ K,M + r zn (cid:19) , where θ > θ + D ′ θ ǫ C ∗ ( n ∗ + k +1) . Then W K,M C ∗ ( n ∗ + k + 1) (cid:2) θ + θ + D ′ θ (cid:3) ǫ and W K,M x K,M C ∗ ( n ∗ + k + 1) (cid:20) σ K,M x K,M + r zn x K,M + D ′ zn (cid:21) C ∗ ( n ∗ + k + 1) h θx K,M + D ′ zn i C ∗ ( n ∗ + k + 1) (cid:20) θ σ K,M + ( D ′ + 1 θ ) tn (cid:21) . Take z = s + w M + K , then since P M e − w M e −
1, one gets that with probabilitylarger than 1 − e − s , for all M , K and for all functions pen such thatpen n ( K, M ) > C ∗ ( n ∗ + k + 1) (cid:20) θ σ K,M + ( D ′ + 1 θ ) w M + Kn (cid:21) , one has W K,M x K,M − pen n ( K, M ) C ∗ ( n ∗ + k + 1)( D ′ + 1 θ ) sn . A possible choice of θ is θ = 1 D ′ s ǫD ′ C ∗ ( n ∗ + k + 1) − ! . Using that √ x − max(1 , x ) for all x >
0, one gets that there exists θ such that2 θ + D ′ θ ǫ C ∗ ( n ∗ + k +1) and1 θ C ∗ ( n ∗ + k + 1) max (cid:18) D ′ C ∗ ( n ∗ + k + 1) , ǫ (cid:19) . racle inequality for misspecified NPHMMs Therefore, W K,M x K,M − pen n ( K, M ) C ∗ ) ( n ∗ + k + 1) (cid:18) D ′ + 1 ǫ ∨ D ′ C ∗ ( n ∗ + k + 1) (cid:19) sn as soon aspen n ( K, M ) > C ∗ ) ( n ∗ + k + 1) " (cid:18) ǫ ∨ D ′ C ∗ ( n ∗ + k + 1) (cid:19) σ K,M + (cid:18) D ′ + 1 ǫ ∨ D ′ C ∗ ( n ∗ + k + 1) (cid:19) w M + Kn . The last step of the proof is to find an upper bound of σ K,M . Lemma 33
Let A , B , C and E be functions N −→ [1 , ∞ ) , and ϕ : x xA (1 + q log max( Bx , C )) . Let σ be the only solution of the equation ϕ ( x ) x √ n = E with unknown x ∈ R ∗ + . Let f ( n ) = (cid:20) A ( n ) C ( n ) E ( n ) B ( n ) (1 + p log B ( n ) + log n ) (cid:21) . Assume that there exists n such that for all n > n , f ( n ) n . Then ∀ n > n , σ A ( n ) E ( n ) √ n (1 + p log B ( n ) + log n ) . In our case, A = p π ( m M K + K − ,B = (cid:18) D + log 1 σ − (cid:19) (cid:18) σ − (cid:19) k +1 / e D/ p kC aux ′ ,C = 109 (cid:18) σ − (cid:19) k +1 / ke D/ p C aux ′ ,E = 1 + √ D ′ √ D ′ . Hence f ( n ) . π ( m M K + K − kD + log σ − e − D (log n ) (cid:18) (cid:18) D + log 1 σ − (cid:19) + (2 k + 1) log 2 σ − + 2 log 21 + 5 D + log k + log C aux ′ + 2 log n (cid:19) . By using that 1 k n , that log( D + log σ − ) D + log σ − , that log C aux ′ log C aux + D + log n , that σ − > K > n > k >
2, one gets: f ( n ) ˜ f K,M ( n ) := 6900 π ( m M K + K − ke − D (log n ) ( k + log C aux ) . . Leh´ericy Now, assume that there exists n such that ˜ f K,M ( n ) n for all n > n , then ∀ n > n , σ π ( m M K + K − D ′ n n + 2 log (cid:18) D + log 1 σ − (cid:19) + (2 k + 1) log 2 σ − + 6 D + log k + log C aux ! , so that ∀ n > n , σ π ( m M K + K − D ′ n (cid:16) log n + k log 2 σ − + D + log C aux (cid:17) . Therefore, there exists a numerical constant C pen such that the condition on the penaltyis implied bypen n ( K, M ) > C pen n ( n ∗ + k + 1) " D ′ (cid:18) ǫ ∨ D ′ C ∗ ( n ∗ + k + 1) (cid:19) ( m M K + K − (cid:16) log n + k log 2 σ − + D + log C aux (cid:17) + (cid:18) D ′ + 1 ǫ ∨ D ′ C ∗ ( n ∗ + k + 1) (cid:19) w M ..