[PDF] Asymptotic properties of the maximum likelihood estimation in misspecified hidden Markov models

Abstract

Let (Y_k)_{k\in \mathbb{Z}} be a stationary sequence on a probability space (\Omega,\mathcal{A},\mathbb{P}) taking values in a standard Borel space \mathsf{Y} . Consider the associated maximum likelihood estimator with respect to a parametrized family of hidden Markov models such that the law of the observations (Y_k)_{k\in \mathbb{Z}} is not assumed to be described by any of the hidden Markov models of this family. In this paper we investigate the consistency of this estimator in such misspecified models under mild assumptions.

Full PDF

aa r X i v : . [ m a t h . S T ] F e b The Annals of Statistics (cid:13)

Institute of Mathematical Statistics, 2012

ASYMPTOTIC PROPERTIES OF THE MAXIMUM LIKELIHOODESTIMATION IN MISSPECIFIED HIDDEN MARKOV MODELS By Randal Douc and Eric Moulines

T´el´ecom SudParis and T´el´ecom ParisTech

Let ( Y k ) k ∈ Z be a stationary sequence on a probability space(Ω , A , P ) taking values in a standard Borel space Y . Consider the asso-ciated maximum likelihood estimator with respect to a parametrizedfamily of hidden Markov models such that the law of the observations( Y k ) k ∈ Z is not assumed to be described by any of the hidden Markovmodels of this family. In this paper we investigate the consistency ofthis estimator in such misspeciﬁed models under mild assumptions.

1. Introduction.

An assumption underlying most of the classical theoryof maximum likelihood is that the “true” distribution of the observations isknown to lie within a speciﬁed parametric family of distributions. In manysettings, it is doubtful that this assumption is satisﬁed. It is therefore nat-ural to investigate the convergence of the maximum likelihood estimator(MLE) and to identify the possible limit for misspeciﬁed models. Such ques-tions have been mainly investigated for models in which observations areindependent; see [15, 29]. Much less is known on the behavior of the MLEestimate for dependent observations; see [10] and the references therein.For independent observations, under mild additional technical conditions,the MLE converges to the parameter which minimizes the relative entropyrate; see [15]. The purpose of this paper is to show that such a result re-mains true when the observations are from an ergodic process and for classesof parametric distributions associated to hidden Markov models (HMM).A HMM is a bivariate stochastic process ( X k , Y k ) k ≥ , where ( X k ) k ≥ is aMarkov chain (often referred to as the state sequence) in a state space X and, conditionally on ( X k ) k ≥ , ( Y k ) k ≥ is a sequence of independent random Received October 2011; revised July 2012. Supported by the Agence Nationale de la Recherche through the 2009-2012 projectBig MC.

AMS 2000 subject classiﬁcations.

Primary 62M09; secondary 62F12.

Key words and phrases.

Strong consistency, hidden Markov models, maximum likeli-hood estimator, misspeciﬁed models, state space models.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Statistics ,2012, Vol. 40, No. 5, 2697–2732. This reprint diﬀers from the original inpagination and typographic detail. 1

R. DOUC AND E. MOULINES variables in a state space Y such that the conditional distribution of Y k giventhe state sequence depends on X k only. The key feature of HMMs is that thestate sequence ( X k ) k ≥ is not observable, so that statistical inference has tobe carried out by means of the observations ( Y k ) k ≥ only. Such problemsare far from straightforward due to the fact that the observation process( Y k ) k ≥ is generally a dependent, non-Markovian time series [despite thatthe bivariate process ( X k , Y k ) k ≥ is itself a Markov chain].HMMs have been intensively used in many scientiﬁc disciplines includingeconometrics [16, 23], biology [5], engineering [18], neurophysiology [11] andthe statistical inference is therefore of signiﬁcant practical importance [4].In all these applications, misspeciﬁed models are the rule, so it is worthwhileto understand the behavior of MLE under such regime.This work extends previous results in this direction obtained by Meveland Finesso [24], but which are restricted to discrete state-space Markovchains. Our main result of consistency of the MLE in misspeciﬁed HMMsis derived under assumptions which are quite weak, covering general state-space HMMs under conditions which are much weaker than [9], where astrong mixing condition was imposed on the transition kernels of the hiddenchain. Therefore our results can be applied to many models of practicalinterest, including the Gaussian linear state space model, the discrete state-space HMM and more general nonlinear state-space models.The paper is organized as follows. In Section 2, we ﬁrst introduce the set-ting and notations that are used throughout the paper. In Section 3, we stateour main assumptions and results. In Section 4, our main result is used to es-tablish consistency in three general classes of models: linear-Gaussian statespace models, ﬁnite state models and nonlinear state space models of thevector ARCH type (this includes the stochastic volatility model and manyother models of interest in time series analysis and ﬁnancial econometrics).Section 5 is devoted to the proof of our main result. Notation.

Some notation pertaining to transition kernels is required. Let L be a (possibly unnormalized) transition kernel on ( X , X ), that is, for any x ∈ X , L ( x, · ) is a ﬁnite measure on ( X , X ) and for any A ∈ X , x L ( x, A )is measurable function from ( X , X ) to ([0 , , B ([0 , L acts on boundedfunctions f on X and on σ -ﬁnite positive measures µ on ( X , X ) via Lf ( x ) = δ x Lf , Z L ( x, d y ) f ( y ) , µL ( A ) = µL A , Z µ (d x ) L ( x, A ) . If L and L are two transition kernels on ( X , X ), then L L is the transitionkernel on ( X , X ), given, for any x ∈ X and A ∈ X by L L ( x, A ) = Z L ( x, d y ) L ( y, A ) . LE IN MISSPECIFIED HMMS

2. Problem statement.

We consider a parameterized family of HMMswith parameter space Θ, assumed to be a compact metric space. For eachparameter θ ∈ Θ, the distribution of the HMM is speciﬁed by the transitionkernel Q θ of the Markov chain ( X k ) k ≥ , and by the conditional distribu-tion g θ of the observation Y k given the hidden state X k , referred to as thelikelihood of the observation.For any m ≤ n and any sequence { a k } k ∈ Z , denote a nm , ( a m , . . . , a n ), andfor any probability measure χ on ( X , X ), deﬁne the likelihood of the obser-vations by p θχ ( y nm ) , Z · · · Z χ (d x m ) g θ ( x m , y m ) n Y p = m +1 Q θ ( x p − , d x p ) g θ ( x p , y p ) ,p θχ ( y np | y p − m ) , p θχ ( y nm ) /p θχ ( y p − m ) , m < p ≤ n, with the standard convention Q np = m a p = 1 if m > n .Let (Ω , F , P ) be a probability space, and let ( Y k ) k ∈ Z be a stationary er-godic stochastic process taking value in ( Y , Y ). We denote by P Y the imageprobability of P by ( Y k ) k ∈ Z on the product space ( Y Z , Y ⊗ Z ), and E Y theassociated expectation. We stress that the distribution P Y may or may notbelong to the parametric family of distributions speciﬁed by the transitionkernels { ( Q θ , g θ ) , θ ∈ Θ } . If P Y does not belong to G , the model is said tobe misspeciﬁed.If χ is a probability measure ( X , X ), we deﬁne the maximum likelihoodestimator (MLE) associated to the initial distribution χ byˆ θ χ,n , arg max θ ∈ Θ ln p θχ ( Y n − ) . (1)The study of asymptotic properties of the MLE in HMMs was initiated inthe seminal work of Baum and Petrie [2, 26] in the 1960s. In these papers,the model is assumed to be well speciﬁed, and the state space X and theobservation space Y were both presumed to be ﬁnite sets. More than twodecades later, Leroux [22] proved consistency for well-speciﬁed models in thecase that X is a ﬁnite set, and Y is a general state space. The consistency ofthe MLE in more general HMMs has subsequently been investigated for well-speciﬁed models in a series of contributions [7, 9, 14, 20, 21] using diﬀerentmethods. A general consistency result for HMMs has been developed in [8].Though the consistency results above diﬀer in the details of their proofs,all proofs have a common thread which serves also as the starting point forthis paper. Denote by p θχ ( Y n ) the likelihood of the observations Y n for theHMM with parameter θ ∈ Θ and initial distribution χ . The ﬁrst step of theproof aims to establish that for any θ ∈ Θ, there is a constant ℓ ( θ ) such thatlim n →∞ n − log p θχ ( Y n − ) = lim n →∞ n − E [log p θχ ( Y n − )] = ℓ ( θ ) , P -a.s. R. DOUC AND E. MOULINES

Up to an additive constant, θ ℓ ( θ ) is the negated relative entropy ratebetween the distribution of the observations and p θχ ( · ), respectively. Whenthe model is well-speciﬁed and θ = θ ⋆ is the true value of the parameter,this convergence follows from the generalized Shannon–Breiman–McMillantheorem [1]; for misspeciﬁed models or for well-speciﬁed models with θ = θ ⋆ the existence of the limit is far from obvious.The second step of the proof aims to prove that the maximizer of the likeli-hood θ n − log p θχ ( Y n ) converges P -a.s. to the maximizer of θ ℓ ( θ ), thatis, to the minimizer of the relative entropy rate. Together, these two stepsshow that the MLE is a natural estimator for the parameters which mini-mizes the relative entropy rate in the parametric family { ( Q θ , g θ ) , θ ∈ Θ } .Let us note that one could write the likelihood as n − log p θχ ( Y n − ) = 1 n n − X k =0 log p θχ ( Y k | Y k − ) , where p θχ ( Y k | Y k − ) denotes the conditional density of Y k given Y k − underthe misspeciﬁed model with parameter θ (i.e., the one-step predictive den-sity). If the limit of p θχ ( Y k | Y k − ) → π θY ( Y k −∞ ) as k → ∞ can be shown toexist P -a.s. and to be P -integrable, the convergence of the log-likelihood tothe relative entropy rate follows from the Birkhoﬀ ergodic theorem, sincethe process { Y k } k ∈ Z is assumed to be ergodic. This result provides an ex-plicit representation of the relative entropy rate ℓ ( θ ) as the expectation ofthe limit ℓ ( θ ) = E [log π θY ( Y −∞ )]. The limit π θY ( Y k −∞ ) might be interpretedas the conditional likelihood of Y k given the whole past Y k − −∞ , but we mustrefrain ourselves of considering this quantity as a conditional density.Such an approach was used in [2] for ﬁnite state-space, and was later ex-tended by Douc, Moulines and Ryd´en [9] to general state-space, but understringent technical conditions (uniform mixing of the Markov kernel, whichmore or less restricts the validity of the results to compact state-spaces, leav-ing aside important models, such as Linear Gaussian state-space models).Alternatively, the predictive distribution p θχ ( Y k | Y k − ) can be expressed asa component of the state of a measure-valued Markov chain; in this approach,the existence of the limiting relative entropy rate ℓ ( θ ), follows from theergodic theorem for Markov chains, provided that this Markov chain can beshown to be ergodic. This approach was used in [7, 20, 21] and was laterextended to misspeciﬁed models by White [24]. This technique is adequatefor ﬁnite state-space Markov chains, but does not extend easily to generalstate-space Markov chains; see [7].In [22], the existence of the relative entropy rate is established by meansof Kingman’s subadditive ergodic theorem (the same approach is used in-directly in [26], which invokes the Furstenberg–Kesten theory of randommatrix products). After some additional work, an explicit representation LE IN MISSPECIFIED HMMS of the relative enropy rate is again obtained. However, as is noted in [22],page 136, the latter is surprisingly diﬃcult, as Kingman’s ergodic theoremdoes not directly yield a representation of the limit as an expectation.For completeness, we note that a recent attempt [12] to prove consis-tency of the MLE for general HMMs contains very serious problems in theproof [17] (not addressed in [13]), and therefore fails to establish the claimedresults.In this paper, we prove consistency of the MLE for general HMMs inmisspeciﬁed models under quite general assumptions. Our proof followsbroadly the original approach of Baum and Petrie [2] and Douc, Moulinesand Ryd´en [9], but relaxes the very restrictive technical conditions used inthese works and extends the analysis to misspeciﬁed models. The key tech-nique to obtain this result is to establish the exponential forgetting of theﬁltering distribution; this result is obtained by using an original couplingtechnique originally introduced in [19] and reﬁned in [6].

3. Assumptions and main results.

For any integer t ≥ θ ∈ Θ and anysequence y t − ∈ Y t , consider the unnormalized kernel L θ h y t − i on ( X , X )deﬁned for all x ∈ X and A ∈ X , by L θ h y t − i ( x , A ) = Z · · · Z " t − Y i =0 g θ ( x i , y i ) Q θ ( x i , d x i +1 ) A ( x t ) . (2)Note that, for any t ≥ θ ∈ Θ, x ∈ X , and y t − ∈ Y t , L θ h y t − i ( x , X ) = p θx ( y t − ) , (3)where for x ∈ X , s ≤ t , p θx ( y ts ), the likelihood of the observation y ts startingfrom state x , is a shorthand notation for p θδ x ( y ts ). Definition 1.

Let r be an integer. A set C ∈ X is a r - local Doeblinset with respect to the family { Q θ , g θ } θ ∈ Θ , if there exist positive functions ǫ − C : Y r → R + , ǫ + C : Y r → R + and a family of probability measures { λ θ C h z i} θ ∈ Θ ,z ∈ Y r and of positive functions { ϕ θ C h z i} θ ∈ Θ ,z ∈ Y r such that for any θ ∈ Θ, z ∈ Y r , λ θ C h z i ( C ) = 1 and, for any A ∈ X , and x ∈ C , ǫ − C ( z ) ϕ θ C h z i ( x ) λ θ C h z i ( A ) ≤ L θ h z i ( x, A ∩ C ) ≤ ǫ + C ( z ) ϕ θ C h z i ( x ) λ θ C h z i ( A ) . (4)This implies that for any measurable nonnegative function f on ( X , X ), x ∈ C and any z ∈ Y r , ǫ − C ( z ) ϕ θ C h z i ( x ) λ θ C h z i ( C f ) ≤ δ x L θ h z i ( C f ) ≤ ǫ + C ( z ) ϕ θ C h z i ( x ) λ θ C h z i ( C f ) . We require that the condition is satisﬁed for any θ ∈ Θ, but this is not aserious restriction since Θ is assumed to be compact.

R. DOUC AND E. MOULINES

Remark 1.

To illustrate this condition, consider the case r = 1. Assumethat for some set C , there exist positive constants ǫ − C , ǫ + C and a family ofprobability measures { λ θ C } θ ∈ Θ such that for any θ ∈ Θ, λ θ C ( C ) = 1 and, forany A ∈ X , and x ∈ C , ǫ − C λ θ C ( A ) ≤ Q θ ( x, A ∩ C ) ≤ ǫ + C λ θ C ( A ) . Then, clearly L θ h y i ( x, A ) = g θ ( x, y ) Q θ ( x, A ) satisﬁes (4) where ǫ − C and ǫ + C are positive constants . In this case C is a 1-local Doeblin set with respect to Q θ ; see [6] and [19]. Remark 2.

Local Doeblin sets share some similarities with 1-small setin the theory of Markov chains over general state spaces; see [25], Chapter 5.Recall that a set C is 1-small for the kernel Q θ , θ ∈ Θ if there exists aprobability measure ˜ λ θ C and a constant ˜ ǫ C >

0, such that ˜ λ θ C ( C ) = 1, andfor all x ∈ C and A ∈ X , Q θ ( x, A ∩ C ) ≥ ˜ ǫ C ˜ λ θ C ( A ∩ C ). In particular, a localDoeblin set is 1-small with ˜ ǫ C = ǫ − C and ˜ λ θ C = λ θ C . The main diﬀerence stemsfrom the fact that we impose both a lower and an upper bound, and weimpose that the minorizing and the majorizing measures are the same.(A1) There exist an integer r ≥ K ∈ Y ⊗ r such that:(i) P [ Y r − ∈ K ] > / η >

0, there exists a r -local Doeblin set C ∈ X such thatfor all θ ∈ Θ and for all y r − ∈ K ,sup x ∈ C c p θx ( y r − ) ≤ η sup x ∈ X p θx ( y r − ) < ∞ (5) and inf y r − ∈ K ǫ − C ( y r − ) ǫ + C ( y r − ) > , (6) where the functions ǫ + C and ǫ − C are deﬁned in Deﬁnition 1.(iii) There exists a set D such that E h ln − inf θ ∈ Θ inf x ∈ D L θ h Y r − i ( x, D ) i < ∞ . (7)(A2) (i) For any θ ∈ Θ, the function g θ : ( x, y ) ∈ X × Y g θ ( x, y ) is posi-tive,(ii) E [ln + sup θ ∈ Θ sup x ∈ X g θ ( x, Y )] < ∞ .(A3) There exists p ∈ N such that for any x ∈ X and n ≥ p , P -a.s. the func-tion θ p θx ( Y n ) is continuous on Θ. Remark 3.

Assumption (A2) assumes that the conditional likelihood g θ is positive. The case where g θ can vanish typically requires diﬀerent condi-tions; see [3, 27]. The second condition can be read as a generalized momentcondition on Y . It is satisﬁed in many examples of interest. LE IN MISSPECIFIED HMMS Remark 4.

To check (A1)(iii), one may, for example, check that:(i) inf x ∈ D inf θ ∈ Θ Q θ ( x, D ) > E [ln − inf θ ∈ Θ inf x ∈ D g θ ( x, Y )] < ∞ .This condition is satisﬁed if ( x, θ ) g θ ( x, y ) is continuous and D is a com-pact small set for all θ ∈ Θ, there exists a probability measure ν θ suchthat ν θ ( D ) = 1 and a constant δ >

0, such that, for all x ∈ D and A ∈ X , Q θ ( x, A ) ≥ δν θ ( A ). Note, however, that (A1)(iii) is far weaker than imposingthat the set D is 1-small. This is important to deal with examples for whichthe transition kernel Q θ ( x, · ) does not admit a density with respect to tosome ﬁxed dominating measure; see, for example, Section 4.1. Remark 5.

Assumption (A3) is in general the consequence of the con-tinuity of the kernel θ Q θ ( x, · ) and of the function θ g θ ( x, · ), usingclassical techniques to deal with integrals depending on a parameter. Remark 6.

According to (3), bound (5) may also be rewritten in termsof the kernel L θ h y r − i assup x ∈ C c L θ h y r − i ( x , X ) ≤ η sup x ∈ X L θ h y r − i ( x , X ) < ∞ . The convergence of the relative entropy is achieved for initial distributionsbelonging to a particular class of initial probability distributions. For theinteger r and the set D ∈ X deﬁned in (A1), let M ( D , r ) be the subset P ( X , X ) of probability measures on ( X , X ) satisfying M ( D , r ) = n χ ∈ P ( X , X ) , (8) E h ln − inf θ ∈ Θ χ L θ h Y u − i D i < ∞ for all u ∈ { , . . . , r } o . Proposition 1.

Assume (A1) and (A2). Then: (i) for any θ ∈ Θ , there exists a measurable function π θY : Y Z − → R suchthat for any probability measure χ ∈ M ( D , r ) , P h lim m →∞ p θχ ( Y | Y − − m ) = π θY ( Y −∞ ) i = 1; moreover, E [ | ln π θY ( Y −∞ ) | ] < ∞ ;(9) (ii) for any θ ∈ Θ and any probability measure χ ∈ M ( D , r ) , lim n →∞ n − ln p θχ ( Y n − ) = ℓ ( θ ) , P -a.s. , where ℓ ( θ ) , E [ln π θY ( Y −∞ )] . R. DOUC AND E. MOULINES

Theorem 2.

Assume (A1)–(A3). Then, θ ℓ ( θ ) is upper semi-continu-ous and deﬁning Θ ⋆ ⊂ Θ by Θ ⋆ , arg max θ ∈ Θ ℓ ( θ ) , we have for any proba-bility measure χ ∈ M ( D , r ) , lim n →∞ d (ˆ θ χ,n , Θ ⋆ ) = 0 , P -a.s. Remark 7.

When the model is well speciﬁed, the law of the observationsbelongs to the parametric family of distributions on which the maximiza-tion occurs and is therefore associated to a speciﬁc parameter θ ∗ . In thisparticular case, under some appropriate assumptions, the set Θ ∗ is reducedto the singleton { θ ∗ } , and the consistency result of the MLE in well speciﬁedmodels can then be written as (see [8])lim n →∞ d ( ˆ θ χ,n , θ ⋆ ) = 0 , P -a.s.A simple suﬃcient condition can be proposed to ensure that χ ∈ M ( D , r ). Proposition 3.

Assume there exist a sequence of sets D u ∈ X , u ∈{ , . . . , r − } , such that (setting D r = D for notational convenience), forsome δ > , inf x u − ∈ D u − inf θ ∈ Θ Q θ ( x u − , D u ) ≥ δ, u ∈ { , . . . , r } , (10) and E h ln − inf θ ∈ Θ inf x ∈ D u g θ ( x, Y ) i < ∞ for u ∈ { , . . . , r } . (11) Then, any initial distribution χ on ( X , X ) satisfying χ ( D ) > belongs to M ( D , r ) . Remark 8.

To check (11), we typically assume that, for any given y ∈ Y ,the function ( x, θ ) g θ ( x, y ) is continuous and that D i × Θ is a compact set, i ∈ { , . . . , r − } . This condition then translates into an assumption on somegeneralized moments of the process Y .To check (10), the following lemma is useful. Lemma 4.

Assume that X = R d for some integer d > and that X is theassociated Borel σ -ﬁeld. Assume in addition that, for any open subset O ∈ X ,the function ( x, θ ) → Q θ ( x, O ) is lower semi-continuous on the product space X × Θ . Then, for any δ > and any compact subset D ∈ X , there exists asequence of compact subsets D u , u ∈ { , . . . , r − } satisfying (10).

4. Applications.

In this section, we develop three classes of examples.In Section 4.1 we consider linear Gaussian state space models. This is obvi-ously a very important model, which is used routinely to analyze time-seriesmodels. We analyze this model under assumptions which are very general

LE IN MISSPECIFIED HMMS and might serve to illustrate the stated assumptions. In Section 4.2, we con-sider the classic case where state space of the underlying Markov chain isa ﬁnite set. Finally, in Section 4.3, we develop a general class of nonlinearstate space models. In all these examples, we will ﬁnd that the assumptionsof Theorem 2 are satisﬁed under general assumptions.4.1. Gaussian linear state space models.

Gaussian linear state space mod-els form an important class of HMMs. In this setting, let X = R d x , and Y = R d y for some integers and let Θ be a compact parameter space. Themodel is speciﬁed by X k +1 = A θ X k + R θ U k , (12) Y k = B θ X k + S θ V k , (13)where { ( U k , V k ) } k ≥ is an i.i.d. sequence of Gaussian vectors with zeromean and identity covariance matrix, independent of X . Here U k is d u -dimensional, V k is d y -dimensional and the matrices A θ , R θ , B θ , S θ have theappropriate dimensions.For any integer n , deﬁne O θ,n and C θ,n the observability matrix and thecontrollability matrices O θ,n ,  B θ B θ A θ B θ A θ ... B θ A n − θ  and C θ,n , [ A n − θ R θ A n − θ R θ · · · R θ ] . (14)It is assumed in the sequel that for any θ ∈ Θ, the following hold:(L1) The pair [ A θ , B θ ] is observable, and the pair [ A θ , R θ ] is controllable;that is, there exists an integer r such that, the observability matrix O θ,r andthe controllability matrix C θ,r are full rank.(L2) The measurement noise covariance matrix S θ is full rank.(L3) The functions θ A θ , θ R θ , θ B θ and θ S θ are continuouson Θ.(L4) E [ k Y k ] < ∞ .We now check the assumptions of Theorem 2.The dimension d u of the state noise vector U k is in many situations smallerthan the dimension d x of the state vector X k and hence R θt R θ (where t A isthe transpose of the matrix A ) may be rank deﬁcient.Some additional notation is needed. For any positive matrix A and anyvector z of appropriate dimension, denote k z k A = t zA − z . Deﬁne for anyinteger n F θ,n = D θ,nt D θ,n + S θ,nt S θ,n , (15) R. DOUC AND E. MOULINES where t denotes the transpose and D θ,n ,  · · · B θ R θ . . . 0 B θ A θ R θ B θ R θ . . . ...... . . . 0 B θ A n − θ R θ B θ A n − θ R θ · · · B θ R θ  , S θ,n ,  S θ · · · S θ . . . ...... . . . . . . 00 · · · S θ  . Under (L2), for any n ≥ r , the matrix F θ,n is positive deﬁnite. The likelihoodof the observations y n − ∈ Y n starting from x is given by p θx ( y n − ) = (2 π ) − nd y det − / ( F θ,n ) exp( − k y n − − O θ,n x k F θ,n ) , (16)where y n − = t [ t y , t y , . . . , t y n − ], and O θ,n is deﬁned in (14).Consider ﬁrst (A1). Under (L1), the observability matrix O θ,r is full rank,we have, for any compact subset K ⊂ Y r ,lim k x k→∞ inf y r − ∈ K k y r − − O θ,r x k F θ,r = ∞ , showing that, for all η >

0, we may choose a compact set C in such a way that(5) is satisﬁed. It remains to prove that any compact set C is a r -local Doe-blin satisfying the condition (6). For any y r − ∈ Y r − and x ∈ X the measure L θ h y r − i ( x , · ) is absolutely continuous with respect to the Lebesgue mea-sure on X with Radon–Nikodym denoted ℓ θ h y r − i ( x , x r ) given (up to anirrelevant multiplicative factor) by ℓ θ h y r − i ( x , x r ) ∝ det − / ( G θ,r ) exp (cid:18) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) y r − x r (cid:21) − (cid:20) O θ,r A rθ (cid:21) x (cid:13)(cid:13)(cid:13)(cid:13) G θ,r (cid:19) , (17)where the covariance matrix G θ,r is given by G θ,r = (cid:20) D θ,r C θ,r (cid:21) [ t D θ,rt C θ,r ] + (cid:20) S θ,r (cid:21) [ t S θ,rt ] . The proof of (17) relies on the positivity of G θ,r , which requires furtherdiscussion. By construction, the matrix G θ,r is nonnegative. For any y r − ∈ Y r and x ∈ X , the equation[ t y r − t x ] G θ,r (cid:20) y r − x (cid:21) = k t D θ,r y r − + t C θ,r x k + k t S θ,r y r − k = 0 LE IN MISSPECIFIED HMMS implies that k t D θ,r y r − + t C θ,r x k = 0 and k t S θ,r y r − k = 0. Since the matrix S θ,r is full rank, this implies that y r − = 0. Since C θ,r is full-rank (the pair[ A θ , R θ ] is controllable), this implies that x = 0. Therefore, the matrix G θ,r is positive deﬁnite and, for any y r − , the function( x , x r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) y r − x r (cid:21) − (cid:20) O θ,r A rθ (cid:21) x (cid:13)(cid:13)(cid:13)(cid:13) G θ,r is continuous, and is therefore bounded on any compact subset of X × X .This implies that every nonempty compact set C ⊂ R d x is a r -local Doeblinset, with λ θ C ( · ) = λ Leb ( · ) /λ Leb ( C ) and ǫ − C ( y r − ) = ( λ Leb ( C )) − inf θ ∈ Θ inf ( x ,x r ) ∈ C × C ℓ θ h y r − i ( x , x r ) ,ǫ + C ( y r − ) = ( λ Leb ( C )) − sup θ ∈ Θ sup ( x ,x r ) ∈ C × C ℓ θ h y r − i ( x , x r ) . Therefore, condition (6) is satisﬁed with any compact set K ⊆ Y r − . It re-mains to show (A1)(iii). Under (L1), L θ h y r − i ( x , · ) is absolutely continuouswith respect to the Lebesgue measure λ Leb . Therefore, for any set D ,inf θ ∈ Θ inf x ∈ D L θ h y r − i ( x , D ) ≥ inf θ ∈ Θ inf ( x ,x r ) ∈ D × D ℓ θ h y r − i ( x , x r ) λ Leb ( D ) . Take D to be any compact set with positive Lebesgue measure.sup θ ∈ Θ sup ( x ,x r ) ∈ D × D (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) y r − x r (cid:21) − (cid:20) O θ,r A rθ (cid:21) x (cid:13)(cid:13)(cid:13)(cid:13) G θ,r ≤ λ max ( G θ,r ) n k y r − k + max x ∈ D k x k [1 + λ max ( t O θ,r O θ,r + t A rθ A rθ )] o , where λ max ( A ) is the largest eigenvalue of A . Under (L3), θ λ max ( G θ,r )and θ λ max ( t O θ,r O θ,r + t A rθ A rθ ) are bounded. Under (L4), E [ k Y k ] < ∞ ,then (A1)(iii) is satisﬁed for any compact set.Consider now (A2). Under (L2), S θ is full rank, and choosing the referencemeasure µ to be the Lebesgue measure on Y , we ﬁnd that g θ ( x, y ) is aGaussian density for each x ∈ X with covariance matrix S θt S θ . We thereforehave sup θ ∈ Θ sup x ∈ X g θ ( x, y ) = (2 π ) − d y / sup θ ∈ Θ det − / ( S θt S θ ) < ∞ , so that (A2)(i) and (ii) are satisﬁed.We ﬁnally check (A3). For any n ≥ r , and x ∈ X the function θ p θx ( y n − )is given by (16). Under (L3), the functions θ

7→ O θ,n [where O θ,n is the ob-servability matrix deﬁned in (14)] and θ det − / ( F θ,n ) [where F θ,n is thecovariance matrix deﬁned in (15)], are continuous on Θ for any n ≥ r . Thus,for any x ∈ X , θ p θx ( y n − ) is continuous for every n ≥ r , showing (A3). R. DOUC AND E. MOULINES

To conclude this discussion, we need to specify more explicitly the set M ( D , r ) [see (8)] of possible initial distributions. Using Proposition 3, wehave to check the suﬃcient conditions (10) and (11). To check (10), we useLemma 4. Note that, for any open subset O , Q θ ( x, O ) = E [ O ( A θ x + R θ U )] , where the expectation is taken with respect to the standard normal randomvariable U . Let { ( x n , θ n ) } ∞ n =1 be any sequence converging to ( x, θ ). By theFatou lemma, using that function O is lower semi-continuous and that θ A θ is continuous under (L3), we havelim inf n →∞ Q θ n ( x n , O ) ≥ E h lim inf n →∞ O ( A θ n x + R θ n U ) i ≥ E h lim inf n →∞ O ( A θ n x + R θ n U ) i = Q θ ( x, O ) , showing that, for any open subset O , the function ( x, θ ) Q θ ( x, O ) is lowersemi-continuous.Assumption (L2) implies that, for all ( x, y ) ∈ X × Y ,ln g θ ( x, y ) ≥ − d y π ) −

12 inf θ ∈ Θ ln det − / ( S θt S θ ) − h inf θ ∈ Θ λ min ( S θt S θ ) i − h k y k + sup θ ∈ Θ k B θ x k i , where λ min ( S θt S θ ) is the minimal eigenvalue of S θt S θ . Therefore, under (L4),(11) is satisﬁed because D u is a compact set, u ∈ { , . . . , r } .We can therefore apply Theorem 2 to show that the MLE is consistent forany initial measure χ as soon as the process { Y k } k ∈ Z is stationary ergodicand E [ | Y | ] < ∞ .4.2. Finite state models.

One of the most widely used classes of HMMsis obtained when the state-space is ﬁnite, that is, X = { , . . . , d } for someinteger d , Y is any Polish space and Θ is a compact metric space. For eachparameter θ ∈ Θ, the transition kernel Q θ is determined by the correspond-ing transition probability matrix Q θ , while the observation density g θ isgiven as in the general setting of this paper.It is assumed in the sequel that:(F1) There exists an integer r >

0, such that, inf θ ∈ Θ inf ( x,x ′ ) ∈ X × X Q rθ ( x, x ′ ) > M ⊂ Y such that inf θ ∈ Θ inf y ∈ M inf x ∈ X g θ ( x, y ) > θ ∈ Θ sup y ∈ M sup x ∈ X g θ ( x, y ) < ∞ . (F3) For any θ ∈ Θ, the function g θ : ( x, y ) ∈ X × Y g θ ( x, y ) is positiveand E h ln + sup θ ∈ Θ sup x ∈ X g θ ( x, Y ) i < ∞ . LE IN MISSPECIFIED HMMS (F4) E [ln − inf θ ∈ Θ inf x ∈ X g θ ( x, Y )] < ∞ .(F5) θ

7→ Q θ and θ g θ ( x, y ) are continuous for any x ∈ X , y ∈ Y .Consider ﬁrst (A1). We set C = X . Since C c = ∅ , (5) is trivially satisﬁed.Under (F1), equation (4) is satisﬁed with ϕ X h y r − i ( x ) ≡ λ θ X = d − P di =1 δ i ,and ǫ − X [ y r − ] = d d − Y i =0 inf θ ∈ Θ inf x ∈ X g θ ( x, y i ) × inf θ ∈ Θ inf ( x,x ′ ) ∈ X × X Q rθ ( x, x ′ ) ,ǫ + X [ y r − ] = d d − Y i =0 sup θ ∈ Θ sup x ∈ X g θ ( x, y i ) × sup θ ∈ Θ sup ( x,x ′ ) ∈ X × X Q rθ ( x, x ′ ) . Hence, the state space X is a r -local Doeblin set. Assumption (F2) impliesthat (6) is satisﬁed with K = M r . Now, note that for all u ∈ { , . . . , r } and y u − ∈ Y r , inf θ ∈ Θ inf x ∈ X L θ h y u − i ≥ u − Y i =0 inf θ ∈ Θ inf x ∈ X g θ ( x, y i ) . (18)Using the previous inequality with u = r and noting that (F4) implies that E [ln − inf θ ∈ Θ inf x ∈ X g θ ( x, Y )] < ∞ , we see that equation (7) is satisﬁed with D = X . The same argument for any u ∈ , . . . , r shows that all the probabilitymeasures on ( X , X ) belong to the set M ( X , r ), deﬁned in (8).Assumption (A2) is a direct consequence of (F3). Finally, we note thatthe continuity of θ

7→ Q θ and θ g θ ( x, y ) yield immediately that θ p θx ( y n )is a continuous function for every n ≥ y n ∈ Y n +1 , establishing (A3).We can therefore apply Theorem 2 under (F1)–(F5) to show that theMLE is consistent for any initial measure χ as soon as the process { Y k } k ∈ Z is stationary ergodic.4.3. Nonlinear state space models.

In this section, we consider a classof nonlinear state space models. Let X = R d , Y = R ℓ and X and Y be theassociated Borel σ -ﬁelds. Let Θ be a compact metric space. For each θ ∈ Θand each x ∈ X , the Markov kernel Q θ ( x, · ) has a density q θ ( x, · ) with respectto the Lebesgue measure on X .For example, ( X k ) k ≥ may be deﬁned through the nonlinear recursion X k = T θ ( X k − ) + Σ θ ( X k − ) ζ k , where ( ζ k ) k ≥ is an i.i.d. sequence of d -dimensional random vectors which areassumed to possess a density ρ ζ with respect to the Lebesgue measure λ Leb on R d , and T θ : R d → R d , Σ θ : R d → R d × d are given (measurable) matrix-valued functions such that for each θ ∈ Θ and x ∈ X , Σ θ ( x ) is full-rank.Such a model for ( X k ) k ≥ is sometimes known as a vector ARCH model,and covers many models of interest in time series analysis and ﬁnancial R. DOUC AND E. MOULINES econometrics. We let the reference measure µ be the Lebesgue measure on R ℓ , and deﬁne the observed process ( Y k ) k ≥ by means of a given observationdensity g θ ( x, y ).We now introduce the basic assumptions of this section.(NL1) The function ( x, x ′ , θ ) q θ ( x, x ′ ) is a positive continuous functionon X × X × Θ. In addition, sup θ ∈ Θ sup ( x,x ′ ) ∈ X × X q θ ( x, x ′ ) < ∞ .(NL2) For any compact subset K ⊂ Y , and θ ∈ Θ,lim | x |→∞ sup y ∈ K g θ ( x, y )sup x ′ ∈ X g θ ( x ′ , y ) = 0 . (NL3) For each ( x, y ) ∈ X → Y , the function θ g θ ( x, y ) is positive andcontinuous on Θ. Moreover, E h ln + sup θ ∈ Θ sup x ∈ X g θ ( x, Y ) i < ∞ . (NL4) There exists a compact subset D ⊂ Y such that E h ln − inf θ ∈ Θ inf x ∈ D g θ ( x, Y ) i < ∞ . We have made no attempt at generality here: for sake of example, we havechosen a set of conditions under which the assumptions of Theorem 2 areeasily veriﬁed. Of course, the applicability of Theorem 2 extends far beyondthe simple assumptions imposed in this section.

Remark 9.

Nonetheless, the present assumptions already cover a broadclass of nonlinear models. Consider, for example, the stochastic volatilitymodel [16] (cid:26) X k +1 = φ θ X k + σ θ ζ k ,Y k = β θ exp( X k / ε k , (19)where ( ζ k , ε k ) are i.i.d. Gaussian random variables in R with zero mean andidentity covariance matrix, β θ > σ θ > θ ∈ Θ, and the functions θ φ θ , θ σ θ , and θ β θ are continuous. Then, assumptions (NL1)–(NL4)are satisﬁed as noted by Douc et al. [8], Remark 10.Under (NL1), every compact set C ⊂ X = R d with λ Leb ( C ) > λ θ C ( · ) = λ Leb ( · ∩ C ) /λ Leb ( C ), ϕ θ C h y i = λ Leb ( C ) and ǫ − C = inf θ ∈ Θ inf ( x,x ′ ) ∈ C × C q θ ( x, x ′ ) ,ǫ + C = sup θ ∈ Θ sup ( x,x ′ ) ∈ C × C q θ ( x, x ′ ) . Under (NL1) and (NL2), (5) and (6) are satisﬁed with r = 1; equation (7)follows from (NL1) and (NL4). Thus assumption (A1) holds. LE IN MISSPECIFIED HMMS Assumption (A2) follows directly from (NL3). To establish (A3), it suﬃcesto note that, under (NL1), for any ( x, x ′ ) ∈ X × X , θ q θ ( x, x ′ ) is continuous,under (NL3), for any ( x, y ) ∈ X × Y , θ g θ ( x, y ) is continuous and for any n ∈ N , sup θ ∈ Θ sup x ∈ X Q nk =0 g θ ( x, Y k ) < ∞ , P -a.s. The bounded convergencetheorem shows that, P -a.s. the function θ p θx ( Y n ) is continuous.Finally, under (NL1)–(NL4) according to Theorem 2 and Proposition 3the MLE is consistent for any initial measure χ such that χ ( D ) >

5. Proofs of Proposition 1 and Theorem 2.

Block decomposition.

The ﬁrst step of the proof consists of splittingthe observations into blocks of size r where r is deﬁned in (A1). More pre-cisely, we will ﬁrst show the equivalent of Proposition 1 and Theorem 2 with Y i replaced by Z i , Y ( i +1) r − ir . With this notation,ˆ θ χ,nr = arg max θ ∈ Θ ln p θχ ( Y nr − ) = arg max θ ∈ Θ ln p θχ ( Z n − ) . In the following, ˆ θ χ,nr is called the block maximum likelihood estimator (de-noted hereafter as the block MLE) associated to the observations Z , . . . , Z n − .5.1.1. Forgetting of the initial distribution for the block conditional likeli-hood.

Denote, for i ∈ Z , z i = y ( i +1) r − ir ∈ Y r . (20)Then, the likelihood p θχ ( z n − ) may be rewritten as p θχ ( z n − ) = p θχ ( y nr − ) = χ L θ h z i · · · L θ h z n − i X = χ L θ h z n − i X , (21)where L θ h z n − i = L θ h y nr − i is deﬁned in (2).For any sequence { z i } i ≥ ∈ Z N where Z , Y r , any probability measures χ and χ ′ on ( X , X ) and any measurable nonnegative functions f and h from X to R + , deﬁne∆ θχ,χ ′ h z n − i ( f, h ) = ( χ L θ h z n − i f )( χ ′ L θ h z n − i h )(22) − ( χ L θ h z n − i h )( χ ′ L θ h z n − i f ) . Let ¯ X = X × X and ¯ X = X ⊗ X . For P a (possibly unnormalized) kernelon ( X , X ), we denote by ¯ P the transition kernel on (¯ X , ¯ X ) deﬁned, for any( x, x ′ ) ∈ ¯ X and A , A ′ ∈ X , by¯ P [( x, x ′ ) , A × A ′ ] = P ( x, A ) P ( x ′ , A ′ ) . (23)If χ and χ ′ are two probability measures on ( X , X ) and f, g are real valuedmeasurable functions on ( X , X ), deﬁne for ¯ A ∈ ¯ X and ¯ w = ( w, w ′ ) ∈ ¯ X , χ ⊗ χ ′ (¯ A ) = Z Z χ (d x ) χ ′ (d x ′ ) ¯ A ( x, x ′ ) , f ⊗ h ( ¯ w ) = f ( w ) g ( w ′ ) . (24) R. DOUC AND E. MOULINES

With the notation introduced above, (22) can be rewritten as follows:∆ θχ,χ ′ h z n − i ( f, h ) = Z · · · Z χ ⊗ χ ′ (d ¯ w ′ ) n − Y i =0 ¯ L θ h z i i ( ¯ w i , d ¯ w i +1 ) ! (25) × { f ⊗ h − h ⊗ f } ( ¯ w n ) . The following proposition extends [6], Proposition 12.

Proposition 5.

Assume (A1). Let ≤ γ − < γ + ≤ . Then, for any η > , there exists ρ ∈ (0 , such that, for any sequence ( z i ) i ≥ ∈ Z N satisfying n − n − X i =0 K ( z i ) ≥ max(1 − γ − , (1 + γ + ) / for any β ∈ ( γ − , γ + ) , any nonnegative bounded functions f and h , any prob-ability measures χ and χ ′ on ( X , X ) and any θ ∈ Θ , | ∆ θχ,χ ′ h z n − i ( f, h ) |≤ ρ ⌊ n ( β − γ − ) ⌋ { ( χ L θ h z n − i f )( χ ′ L θ h z n − i g ) + ( χ ′ L θ h z n − i f )( χ L θ h z n − i g ) } + 2 η ⌊ n ( γ + − β ) ⌋ / " n − Y i =0 | L θ h z i i ( · , X ) | ∞ | f | ∞ | h | ∞ . Proof.

Let η >

0. According to (A1), there exists a set C ⊂ Y suchthat (5) and (6) hold. Denote ¯ C , C × C and for z = y r − , set ¯ ϕ θ C h z i = ϕ θ C h z i ⊗ ϕ θ C h z i and ¯ λ θ C h z i , λ θ C h z i ⊗ λ θ C h z i where ϕ θ C h z i and λ θ C h z i are deﬁnedin Deﬁnition 1. For any measurable nonnegative function ¯ f on (¯ X , ¯ X ), θ ∈ Θand ¯ x ∈ ¯ C , ( ǫ − C ( z )) ¯ ϕ θ C h z i (¯ x )¯ λ θ C h z i ( ¯ C ¯ f )(27) ≤ δ ¯ x ¯ L θ h z i ( ¯ C ¯ f ) ≤ ( ǫ + C ( z )) ¯ ϕ θ C h z i (¯ x )¯ λ θ C h z i ( ¯ C ¯ f ) . Deﬁne the unnormalized kernel ¯ L θ, h z i and ¯ L θ, h z i on (¯ X , ¯ X ) as follows: forall ¯ x ∈ ¯ X and ¯ A ∈ ¯ X ,¯ L θ, h z i (¯ x, ¯ A ) , ¯ C (¯ x )( ǫ − C ( z )) ¯ ϕ θ C h z i (¯ x )¯ λ θ C h z i (¯ C ∩ ¯ A ) , (28) ¯ L θ, h z i (¯ x, ¯ A ) , ¯ L θ h z i (¯ x, ¯ A ) − ¯ L θ, h z i (¯ x, ¯ A ) . (29)Equation (27) implies that, for all ¯ x ∈ ¯ C , and any measurable nonnegativefunction ¯ f , 0 ≤ δ ¯ x ¯ L θ, h z i ( ¯ C ¯ f ) ≤ r C ( z ) δ ¯ x ¯ L θ h z i ( ¯ C ¯ f ) , LE IN MISSPECIFIED HMMS where r C ( z ) , − ( ǫ − C ( z ) /ǫ + C ( z )) . It then follows δ ¯ x ¯ L θ, h z i ( ¯ f )= ¯ C (¯ x ) δ ¯ x ¯ L θ, h z i ( ¯ C ¯ f ) + ¯ C (¯ x ) δ ¯ x ¯ L θ, h z i ( ¯ C c ¯ f ) + ¯ C c (¯ x ) δ ¯ x ¯ L θ, h z i ( ¯ f )(30) ≤ r C ( z ) ¯ C (¯ x ) δ ¯ x ¯ L θ h z i ( ¯ C ¯ f ) + ¯ C (¯ x ) δ ¯ x ¯ L θ h z i ( ¯ C c ¯ f ) + ¯ C c (¯ x ) δ ¯ x ¯ L θ h z i ( ¯ f ) ≤ δ ¯ x ¯ L θ h z i ( r C ( z ) ¯ C (¯ x ) ¯ C ¯ f ) . Note that ∆ θχ,χ ′ h z n − i ( f, h ) may be decomposed as∆ θχ,χ ′ h z n − i ( f, h ) = X t n − ∈{ , } n ∆ θ,t n − χ,χ ′ h z n − i ( f, h ) , where∆ θ,t n − χ,χ ′ h z n − i ( f, h ) = Z · · · Z χ ⊗ χ ′ (d ¯ w ′ ) n − Y i =0 ¯ L θ,t i h z i i ( ¯ w i , d ¯ w i +1 ) ! Φ( ¯ w n )with Φ , f ⊗ h − h ⊗ f . First assume that there exists an index i ∈ { , . . . , n − } such that t i = 0. Then∆ θ,t n − χ,χ ′ h z n − i ( f, h ) = χ ⊗ χ ′ ( ¯ L θ,t h z i · · · ¯ L θ,t i − h z i − i ( ¯ C × ¯ ϕ θ C h z i i )) × ( ǫ − C ( z i )) ¯ λ θ C h z i i ( ¯ C ¯ L θ,t i +1 h z i +1 i · · · ¯ L θ,t n − h z n − i Φ) . By symmetry, ¯ λ θ C h z i i ( ¯ C ¯ L θ,t i +1 h z i +1 i · · · ¯ L θ,t n − h z n − i Φ) = 0 , showing that ∆ θ,t n − χ,χ ′ h z n − i ( f, h ) = 0 except if for all i ∈ { , . . . , n − } , t i = 1.Therefore, ∆ θχ,χ ′ h z n − i ( f, h ) = χ ⊗ χ ′ ( ¯ L θ, h z i · · · ¯ L θ, h z n − i Φ) . This implies, using (30), that | ∆ θχ,χ ′ h z n − i ( f, h ) |≤ χ ⊗ χ ′ ( ¯ L θ, h z i · · · ¯ L θ, h z n − i| Φ | )(31) ≤ Z · · · Z χ ⊗ χ ′ (d ¯ w ) n − Y i =0 ¯ L θ h z i i ( ¯ w i , d ¯ w i +1 )( r C ( z i )) ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) ! × | Φ | ( ¯ w n ) . Note that n − Y i =0 ( r C ( z i )) ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) ≤ ̺ P n − i =0 ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) K ( z i ) C , (32) R. DOUC AND E. MOULINES where ̺ C , sup z ∈ K r C ( z ) < z n − such that n − P n − i =0 K ( z i ) ≥ (1 − γ − ), we have P n − i =0 K c ( z i ) ≤ nγ − , so that n − X i =0 K c ( z i ) ≤ ⌊ nγ − ⌋ . Moreover, we have n − X i =0 ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) K ( z i )= n − X i =0 ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) − n − X i =0 ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) K c ( z i )(33) ≥ N ¯ C ,n ( ¯ w n ) − n − X i =0 K c ( z i ) ≥ N ¯ C ,n ( ¯ w n ) − ⌊ nγ − ⌋ , where, for any set ¯ A ∈ ¯ X , N ¯ A ,n ( ¯ w n ) = P n − i =0 ¯ A × ¯ A ( ¯ w i , ¯ w i +1 ). By combining(32) and (33) and using that ⌊ nβ ⌋ − ⌊ nγ − ⌋ ≥ ⌊ n ( β − γ − ) ⌋ , we thereforeobtain, for any β ∈ ( γ − , n − Y i =0 ( r C ( z i )) ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) ≤ ̺ ⌊ n ( β − γ − ) ⌋ C + { N ¯ C ,n ( ¯ w n ) < ⌊ nβ ⌋} . (34)For any sequence ¯ w n − ∈ ¯ X n and any ¯ A ∈ ¯ X , denote M ¯ A ,n ( ¯ w n − ) , n − X i =0 ¯ A ( ¯ w i ) . Using [6], Lemma 17, for any sequence ¯ w n satisfying N ¯ C ,n ( ¯ w n ) < ⌊ nβ ⌋ whichis equivalent to N ¯ C ,n ( ¯ w n ) ≤ ⌊ nβ ⌋ −

1, we have M ¯ C ,n ( ¯ w n − ) ≤ ( ⌊ nβ ⌋ + n ) / N ¯ C ,n ( ¯ w n ) < ⌊ nβ ⌋ ⇒ M ¯ C c ,n ( ¯ w n − ) ≥ a n , n − ⌊ nβ ⌋ . (35)In words, either the number of consecutive visits to the set ¯ C at most ⌊ nβ ⌋ ,or the number of visits to the complementary of the set ¯ C is larger than a n .Plugging (35) into (34) and combining it with (31) yields | ∆ θχ,χ ′ h z n i ( f, h ) | ≤ ̺ ⌊ n ( β − γ − ) ⌋ C χ ⊗ χ ′ ( ¯ L θ h z i · · · ¯ L θ h z n − i| Φ | )+ 2 | f | ∞ | h | ∞ Γ θχ,χ ′ ( z n − ) , LE IN MISSPECIFIED HMMS whereΓ θχ,χ ′ ( z n − ) , Z · · · Z χ ⊗ χ ′ (d ¯ w ) n − Y i =0 ¯ L θ h z i i ( ¯ w i , d ¯ w i +1 ) { M ¯ C c ,n ( ¯ w n − ) ≥ a n } . We ﬁnally have to bound this last term. First rewrite Γ θχ,χ ′ ( z n − ) as follows:Γ θχ,χ ′ ( z n − ) = n − Y i =0 | L θ h z i i ( · , X ) | ∞ ! Z χ ⊗ χ ′ (d ¯ w )( η P n − i =0 ¯ C c ( ¯ w i ) K ( z i ) ) × n − Y i =0 ¯ L θ h z i i ( ¯ w i , d ¯ w i +1 ) η ¯ C c ( ¯ w i ) K ( z i ) | L θ h z i i ( · , X ) | ∞ ! { M ¯ C c ,n ( ¯ w n − ) ≥ a n } . Note that (26) implies that P n − i =0 K ( z i ) ≥ ( n + ⌊ nγ + ⌋ ) /

2. Then, for any γ + > β , the inequality M ¯ C c ,n ( ¯ w n − ) ≥ a n implies that n − X i =0 ¯ C c (¯ x i ) K ( z i ) ≥ n − X i =0 ¯ C c (¯ x i ) − n − X i =0 K c ( z i ) ≥ ⌊ nγ + ⌋ − ⌊ nβ ⌋ ≥ ⌊ n ( γ + − β ) ⌋ , showing that( η P n − i =0 ¯ C c (¯ x i ) K ( z i ) ) { M ¯ C c ,n (¯ x n − ) ≥ a n } ≤ η ⌊ n ( γ + − β ) ⌋ / . The proof follows noting that, for any ¯ w = ( w, w ′ ) ∈ ¯ X and z ∈ Y r , (3) and(5) imply Z Z ¯ L θ h z i ( ¯ w, d ¯ w i +1 ) η ¯ C c ( ¯ w ) K ( z ) | L θ h z i ( · , X ) | ∞ = L θ h z i ( w, X ) L θ h z i ( w ′ , X ) η ¯ C c ( ¯ w ) K ( z ) | L θ h z i ( · , X ) | ∞ ≤ . (cid:3) Lemma 6.

Let ( U k ) k ∈ Z , ( V k ) k ∈ Z , ( W k ) k ∈ Z be stationary sequences suchthat E [ln + U ] < ∞ , E [ln + V ] < ∞ , E [ln + W ] < ∞ . Then, for all η, ρ in (0 , such that − ln η > E [ln + V ] , there exists a P -a.s. ﬁnite random variable D and a constant ̺ ∈ (0 , such that for all k ≥ , m ≥ , ρ k + m + η k + m W − m k − Y i = − m V i ! U k ≤ ̺ k + m D, P -a.s. Proof.

Let α ∈ (0 ,

1) such that E [ln + V ] < − ln α < − ln η , and let ˜ α > η/α ) ∨ ρ < ˜ α <

1. Then ρ k + m + η k + m W − m k − Y i = − m V i ! U k R. DOUC AND E. MOULINES = "(cid:18) ρ ˜ α (cid:19) k + m ˜ α m + (cid:18) ηα ˜ α (cid:19) k + m ( ˜ α m W − m ) k − Y i = − m ( V i α ) ! ( ˜ α k U k ) ≤ (cid:18) ρ ˜ α ∨ ηα ˜ α (cid:19) k + m D with D , (cid:16) sup m ≥ ˜ α m W − m (cid:17) sup m ≥ Y i = − m ( V i α ) ! sup k ≥ k − Y i =1 ( V i α ) !(cid:16) sup k ≥ ˜ α k U k (cid:17) . We now show that D is P -a.s. ﬁnite. First note that combining the bound E [ln + U < ∞ ] with Lemma 7 (stated and proved below), we obtain that therandom variable sup k ≥ ˜ α k U k is P -a.s. ﬁnite; in the same way, sup m ≥ ˜ α m W − m is P -a.s. ﬁnite. Moreover, since E [ln + V ] < ∞ , Birkoﬀ’s ergodic theorem en-sures that 1 k − k − X i =1 ln + V i → k →∞ E [ln + V ] < − ln α, P -a.s.By taking the exponential function in the previous limit, we obtain that k − Y i =1 ( V i α ) ≤ exp ( ( k − k − k − X i =1 ln + V i + ln α !) → k →∞ , P -a.s.so that sup k ≥ Q k − i =1 ( V i α ) is P -a.s. ﬁnite. Following the same arguments,sup m ≥ Y i = − m ( V i α )is P -a.s. ﬁnite. Finally D is P -a.s. ﬁnite. The proof is complete. (cid:3) Lemma 7.

Let { Z k } k ∈ Z be a sequence of nonnegative random variableson a probability space (Ω , A , P ) having the same marginal distribution, thatis, for any k ∈ Z and any measurable nonnegative function f , E [ f ( Z k )] = E [ f ( Z )] . (i) Assume that E [(ln Z ) + ] < ∞ . Then, for all β ∈ (0 , , sup k ≥ β k Z k < ∞ , P -a.s. (ii) Assume that E [ | ln Z | ] < ∞ . Then, for all β ∈ (0 , , sup k ∈ Z β | k | Z k < ∞ and inf k ∈ Z β −| k | Z k > , P -a.s. Proof.

Let β ∈ (0 , P [ β k Z k >

1] = P [ln Z k / ( − ln β ) ≥ k ] = P [ln Z / ( − ln β ) ≥ k ] , LE IN MISSPECIFIED HMMS it follows that ∞ X k =0 P [ β k Z k >

1] = ∞ X k =0 P [ln Z / ( − ln β ) ≥ k ] ≤ E [(ln Z ) + ] / ( − ln β ) < ∞ . The proof of (i) is completed by using the Borel–Cantelli lemma. Now, (ii)can be easily derived by noting that if E [ | ln Z | ] < ∞ , then one may use twice(i), ﬁrst by replacing Z k by Z − k and then by replacing Z k by 1 /Z k . (cid:3) Proposition 8.

Assume (A1) and (A2). There exist a constant κ ∈ (0 , , an integer-valued random variable K satisfying P Y [ K < ∞ ] = 1 suchthat, for any initial distributions χ, χ ′ ∈ M ( D , r ) [where M ( D , r ) is deﬁnedin (8)], sup θ ∈ Θ sup k ≥ K sup m ≥ κ − ( m + k ) | ln p θχ ( Z k | Z k − − m ) − ln p θχ ′ ( Z k | Z k − − m ) | < ∞ , (36) P -a.s. , sup θ ∈ Θ sup k ≥ K sup m ≥ κ − ( m + k ) | ln p θχ ( Z k | Z k − − m ) − ln p θχ ( Z k | Z k − − m − ) | < ∞ , (37) P -a.s. , sup θ ∈ Θ sup m ≥ κ − m | ln p θχ ( Z | Z − − m ) − ln p θχ ( Z | Z − − m − ) | < ∞ , (38) P -a.s. Proof.

Proof of (36). It follows from (21) that, for any integer ( m, k ) ∈ N and any sequence z k − m , p θχ ( z k | z k − − m ) = χ L θ h z k − − m i ( L θ h z k i X ) χ L θ h z k − − m i ( X ) . Since, for any a, b >

0, ln( a ) − ln( b ) ≤ ( a − b ) /b , deﬁnition (22) implies thatln p θχ ( z k | z k − − m ) − ln p θχ ′ ( z k | z k − − m )(39) ≤ ∆ θχ,χ ′ h z k − − m i ( L θ h z k i X , X ) χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( L θ h z k i X ) . Let 0 ≤ γ − < γ + ≤

1. By Proposition 5, for any η > β ∈ ( γ − , γ + ) thereexists ̺ ∈ (0 ,

1) such that, for any sequence z k − − m satisfying( m + k ) − k − X i = − m K ( z i ) ≥ max(1 − γ − , (1 + γ + ) / , (40) R. DOUC AND E. MOULINES we have ∆ θχ,χ ′ h z k − − m i ( L θ h z k i X , X ) χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( X ) ≤ ̺ a ( m + k ) (cid:20) χ L θ h z k − − m i ( L θ h z k i X ) × χ ′ L θ h z k − − m i ( X ) χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( L θ h z k i X ) (cid:21) (41) + 2 η b ( m + k ) C m,k , where a ( n ) = ⌊ n ( β − γ − ) ⌋ , b ( n ) = ⌊ n ( γ + − β ) ⌋ / C m,k , Q k − i = − m | L θ h z i i ( · , X ) | ∞ χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( L θ h z k i X ) | L θ h z k i ( · , X ) | ∞ . (42)Moreover, by (22), χ L θ h z k − − m i ( L θ h z k i X ) × χ ′ L θ h z k − − m i ( X ) χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( L θ h z k i X )= ∆ θχ,χ ′ h z k − − m i ( L θ h z k i X , X ) χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( L θ h z k i X ) + 1 . Plugging this identity into (41) and then using (39) yieldsln p θχ ( z k | z k − − m ) − ln p θχ ′ ( z k | z k − − m )(43) ≤ − ̺ a ( m + k ) ) − [ ̺ a ( m + k ) + η b ( m + k ) C m,k ] . For any sequence z k − − m , we have χ L θ h z k − − m i ( X ) ≥ χ ( D ) k − Y i = − m n inf x ∈ D L θ h z i i ( x, D ) o , (44) χ ′ L θ h z k − − m i ( L θ h z k i X ) ≥ χ ′ ( D ) k Y i = − m n inf x ∈ D L θ h z i i ( x, D ) o . Exchanging χ and χ ′ in (43) allows us to obtain an upper bound for | ln p θχ ( z k | z k − − m ) − ln p θχ ′ ( z k | z k − − m ) | . More precisely, for any sequence z k − − m sat-isfying (40), we havesup θ ∈ Θ | ln p θχ ( z k | z k − − m ) − ln p θχ ′ ( z k | z k − − m ) |≤ − ̺ a ( m + k ) ) − (45) × ( ̺ a ( m + k ) + η b ( m + k ) χ ( D ) χ ′ ( D ) " k − Y j = − m ( D z j ) D z k ) , LE IN MISSPECIFIED HMMS where, for z ∈ Y r , D z = sup θ ∈ Θ | L θ h z i ( · , X ) | ∞ inf θ ∈ Θ inf x ∈ D L θ h z i ( x, D ) . (46)Assume that E [ln + ( D Z )] < ∞ , and set η small enough so that E [ln + ( D Z )] ≤− ln η . By Lemma 6, there exists a P -a.s. ﬁnite random variable C , and aconstant κ ∈ (0 ,

1) such that, for all k ≥ m ≥ − ̺ a ( m + k ) ( ̺ a ( m + k ) + η b ( m + k ) χ ( D ) χ ′ ( D ) " k − Y j = − m ( D z j ) D z k ) ≤ Cκ k + m , P -a.s.It remains to show that E [ln + ( D Z )] < ∞ . Since for any a, b >

0, ln + ( a/b ) ≤ ln + ( a ) + ln − ( b ),ln + ( D z ) ≤ ln + (cid:16) sup θ ∈ Θ | L θ h z i ( · , X ) | ∞ (cid:17) + ln − (cid:16) inf θ ∈ Θ inf x ∈ D L θ h z i ( x, D ) (cid:17) . (47)Since, for any z = y r − ∈ Y r , sup θ ∈ Θ | L θ h z i ( · , X ) | ∞ ≤ Q r − i =0 sup θ ∈ Θ | g θ ( · , y i ) | ∞ ,(A1)(iii) and (A2) imply that E [ln + ( D Z )] < ∞ . Finally, according to (45),sup θ ∈ Θ | ln p θχ ( Z k | Z k − − m ) − ln p θχ ′ ( Z k | Z k − − m ) | ≤ Cκ m + k , P -a.s. , provided that( m + k ) − k − X j = − m K ( Z j ) ≥ max(1 − γ − , (1 + γ + ) / , P -a.s.(48)It thus remains to show the existence of a P -a.s. ﬁnite random variable K such that for any k ≥ K and any m ≥

0, (48) holds P -a.s. Under (A1)(i),1 − P [ Z ∈ K ] < P [ Z ∈ K ] −

1. Then, choose ˜ γ − , γ − , γ + and ˜ γ + such that1 − P [ Z ∈ K ] < ˜ γ − < γ − < γ + < ˜ γ + < P [ Z ∈ K ] − . (49)By construction (1 + ˜ γ + ) / < P Y [ Z ∈ K ] and 1 − ˜ γ − < P [ Z ∈ K ]. Since( Z k ) k ∈ Z is stationary and ergodic, the Birkhoﬀ ergodic theorem ensures thatthere exists a P -a.s. ﬁnite random variable B such that for any k ≥ B and m ≥ B , P -a.s., max (cid:18) − ˜ γ − , γ + (cid:19) < k − k − X i =0 K ( Z i ) , (50) max (cid:18) − ˜ γ − , γ + (cid:19) < m − − X i = − m K ( Z i ) . (51) R. DOUC AND E. MOULINES

Set K + , B (1 + γ + ) / (˜ γ + − γ + ). If m ≥ B and k ≥ K + , then using that K + ≥ B , P -a.s., P k − i = − m K ( Z i ) k + m > k (1 + ˜ γ + ) / m (1 + ˜ γ + ) / k + m = (1 + ˜ γ + ) / > (1 + γ + ) / . Now, if 0 ≤ m < B and k ≥ K + , P k − i = − m K ( Z i ) k + m ≥ P k − i =0 K ( Z i ) k + m > k (1 + ˜ γ + ) / k + m> K + (1 + ˜ γ + ) / K + + B = (1 + γ + ) / . Similarly, setting K − , B (1 − γ − ) / (˜ γ − − γ − ), we obtain, for all m ≥ k ≥ K − that, P -a.s., P k − i = − m K ( Z i ) k + m ≥ − γ − . The proof of (36) is now completed by setting K = K + ∨ K − . Proof of (37). Note that p θχ ( z k | z k − − m − ) = p θχ ′ ( z k | z k − − m )with χ ′ ( A ) = χ ( L θ h z − m − i A ) /χ ( L θ h z − m − i X ). Since1 χ ′ ( D ) = χ ( L θ h z − m − i X ) χ ( L θ h z − m − i D ) ≤ D z − m − χ ( D ) , where D z is deﬁned in (46), (45) writessup θ ∈ Θ | ln p θχ ( z k | z k − − m ) − ln p θχ ( z k | z k − − m − ) |≤ − ̺ a ( m + k ) ) − × " ̺ a ( m + k ) + η b ( m + k ) [ χ ( D )] D z − m − k − Y j = − m ( D z j ) D z k . And the rest of the proof of (37) follows the same lines as (36) and is omittedfor brevity.

Proof of (38). Noting that, when k = 0, equation (48) follows immediatelyfrom (51), the proof of (38) follows the same lines as the proof of (37) andis omitted for brevity. (cid:3) Corollary 9 (Corollary of Proposition 8).

Assume (A1) and (A2).For any θ ∈ Θ , there exists a measurable function π θZ : Z Z − → R such thatfor any probability measure χ satisfying χ ( D ) ∈ M ( D , r ) [where M ( D , r ) is LE IN MISSPECIFIED HMMS deﬁned in (8)], P Y h lim m →∞ p θχ ( Z | Z − − m ) = π θZ ( Z −∞ ) i = 1 . (52)In the sequel, we denote p θ ( Z | Z − −∞ ) , π θZ ( Z −∞ ) and for n ≥ p θ ( Z n | Z − −∞ ) , Q ni =0 π θZ ( Z i −∞ ).5.1.2. Consistency of the block MLE.

Proposition 10.

Assume (A1) and (A2). Then: (i)

For any θ ∈ Θ , E [ | ln p θ ( Z | Z − −∞ ) | ] < ∞ . (53)(ii) For any probability measure χ ∈ M ( D , r ) [where M ( D , r ) is deﬁnedin (8)], lim sup n →∞ sup θ ∈ Θ | n − ln p θχ ( Z n − ) − n − ln p θ ( Z n − | Z − −∞ ) | = 0 , P -a.s. (iii) For any θ ∈ Θ , and for any probability measure χ ∈ M ( D , r ) , lim n →∞ n − ln p θχ ( Z n − ) = E [ln p θ ( Z | Z − −∞ )] , P -a.s. Proof.

By deﬁnition,lim inf m →∞ E [ L θm ] = E [ln + | L θ h Z i ( · , X ) | ∞ ](56) − lim sup m →∞ m − m X ℓ =1 E [ln p θχ ( Z | Z − − ℓ )]and E h lim inf m →∞ L θm i = E [ln + | L θ h Z i ( · , X ) | ∞ ](57) − E " lim sup m →∞ m − m X ℓ =1 ln p θχ ( Z | Z − − ℓ ) . Since ( Y k ) k ∈ Z is stationary, for any ℓ ∈ N , E [ln p θχ ( Z | Z − − ℓ )] = E [ln p θχ ( Z ℓ | Z ℓ − )]showing that m − m X ℓ =1 E [ln p θχ ( Z | Z − − ℓ )] = m − m X ℓ =1 E [ln p θχ ( Z ℓ | Z ℓ − )] . (58)The Cesaro mean convergence lemma implies that, P -a.s.,lim sup m →∞ m − m X ℓ =1 ln p θχ ( Z | Z − − ℓ ) = lim ℓ →∞ ln p θχ ( Z | Z − − ℓ ) = ln p θ ( Z | Z − −∞ ) . (59)Combining (55), (56), (57), (58) and (59) yields to E [ln p θ ( Z | Z − −∞ )] ≥ lim sup m →∞ m − m X ℓ =1 E [ln p θχ ( Z ℓ | Z ℓ − )](60) = lim sup m →∞ { E [ m − ln p θχ ( Z m )] − m − E [ln p θχ ( Z )] } > −∞ , where the last bound follows from (A1)(iii) and the minorizationln p θχ ( Z m ) ≥ ln χ ( D ) + m X i =0 ln inf x ∈ D L θ h Z i i ( x, D ) . The proof of (i) follows.

Proof of (ii). According to Proposition 8 (36), there exists a randomvariable C satisfying P Y [ C < ∞ ] = 1 such that for all k ≥ K and m ≥ θ ∈ Θ | ln p θχ ( Z k | Z k − − m ) − ln p θχ ( Z k | Z k − − m − ) | ≤ Cκ k + m , P -a.s. , LE IN MISSPECIFIED HMMS which implies thatsup θ ∈ Θ | ln p θχ ( Z k | Z k − ) − ln p θ ( Z k | Z k − −∞ ) | ≤ Cκ k / (1 − κ ) , P -a.s.The proof of (ii) follows from the obvious decomposition n − ln p θχ ( Z n − ) = n − n − X k =1 ln p θχ ( Z k | Z k − ) + n − ln p θχ ( Z ) , (61) n − ln p θ ( Z n − | Z − −∞ ) = n − n − X k =0 ln p θ ( Z k | Z k − −∞ ) . The proof of (iii) follows from (53) and (61) using the Birkhoﬀ theorem; see,for example, [28], Theorem 1.14. (cid:3)

Proposition 11.

Assume (A1)–(A3). Let χ be a probability measuresuch that χ ∈ M ( D , r ) [where M ( D , r ) is deﬁned in (8)]. (i) For any θ ∈ Θ and any ρ > , lim sup n →∞ sup θ ∈B ( θ ,ρ ) n ln p θχ ( Z n − ) ≤ E h sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) i , P -a.s. (ii) The function θ E [ln p θ ( Z | Z − −∞ )] is upper semi-continuous. (iii) For any compact set Ξ ⊂ Θ , the sequence (sup θ ∈ Ξ 1 n ln p θχ ( Z n − )) n ≥ converges P -a.s. and lim n →∞ sup θ ∈ Ξ n ln p θχ ( Z n − ) = sup θ ∈ Ξ E [ln p θ ( Z | Z − −∞ )] , P -a.s. Proof.

Proof of (i). Proposition 10(ii) shows thatlim sup n →∞ sup θ ∈B ( θ ,ρ ) n ln p θχ ( Z n − )(62) ≤ lim sup n →∞ n n − X i =0 sup θ ∈B ( θ ,ρ ) ln p θ ( Z i | Z i − −∞ ) , P -a.s.By (54), for any θ ∈ Θ and ρ > p θ ( Z | Z − −∞ ) ≤ sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ )(63) ≤ r − X i =0 sup θ ∈ Θ ln + | g ( · , Y i ) | ∞ , P -a.s. , R. DOUC AND E. MOULINES which shows using (53) and (A2) that E h(cid:12)(cid:12)(cid:12) sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) (cid:12)(cid:12)(cid:12)i < ∞ . The Birkhoﬀ theorem therefore implieslim sup n →∞ n n − X i =0 sup θ ∈B ( θ ,ρ ) ln p θ ( Z i | Z i − −∞ )(64) = E h sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) i , P -a.s. , which completes the proof of (i). Proof of (ii). First note thatsup θ ∈B ( θ ,ρ ) E [ln p θ ( Z | Z − −∞ )] ≤ E h sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) i . (65)Now, since under (A3), for any m ≥ p , P -a.s., the function θ ln p θχ ( Z | Z − − m )is continuous, then P -a.s., the function θ ln p θ ( Z | Z − −∞ ) is continuous asa uniform limit of continuous functions. Using (63), r − X i =0 sup θ ∈ Θ ln + | g ( · , Y i ) | ∞ − sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) ≥ , the monotone convergence theorem therefore implies thatlim ρ ↓ E h sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) i = E h lim ρ ↓ sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) i (66) = E [ln p θ ( Z | Z − −∞ )] . Combining (65) and (66) shows thatlim ρ ↓ sup θ ∈B ( θ ,ρ ) E [ln p θ ( Z | Z − −∞ )] ≤ E [ln p θ ( Z | Z − −∞ )] . Proof of (iii). By taking the limit of both sides of (i) with respect to ρ ↓ θ ∈ Θ,lim ρ ↓ lim sup n →∞ sup θ ∈B ( θ ,ρ ) n ln p θχ ( Z n − ) ≤ E [ln p θ ( Z | Z − −∞ )] , P -a.s.(67)Therefore, for any δ > θ ∈ Ξ, there exists ρ θ > n →∞ sup θ ∈B ( θ ,ρ θ ) n ln p θχ ( Z n − ) ≤ E [ln p θ ( Z | Z − −∞ )] + δ, P -a.s. LE IN MISSPECIFIED HMMS Since Ξ is compact, by extracting a ﬁnite covering, the latter inequalityshows thatlim sup n →∞ sup θ ∈ Ξ n ln p θχ ( Z n − ) ≤ sup θ ∈ Ξ E [ln p θ ( Z | Z − −∞ )] + δ, P -a.s.Since δ is arbitrary, we therefore havelim sup n →∞ sup θ ∈ Ξ n ln p θχ ( Z n − ) ≤ sup θ ∈ Ξ E [ln p θ ( Z | Z − −∞ )] . (68)Now, since for any θ ∈ Ξ,sup θ ∈ Ξ n ln p θχ ( Z n − ) ≥ n ln p θ χ ( Z n − ) . Proposition 10(iii) yieldslim inf n →∞ sup θ ∈ Ξ n ln p θχ ( Z n − ) ≥ E [ln p θ ( Z | Z − −∞ )] , P -a.s. θ being arbitrary in Ξ, we ﬁnally obtainlim inf n →∞ sup θ ∈ Ξ n ln p θχ ( Z n − ) ≥ sup θ ∈ Ξ E [ln p θ ( Z | Z − −∞ )] , P -a.s.Combining this inequality with (68) completes the proof. (cid:3) Theorem 12.

Assume (A1)–(A3). Then, for any probability measure χ ∈ M ( D , r ) , lim n →∞ d (ˆ θ χ,nr , Θ ⋆b ) = 0 , P -a.s. , where Θ ⋆b ⊂ Θ is deﬁned by Θ ⋆b , arg max θ ∈ Θ E [ln p θ ( Z | Z − −∞ )] . Proof.

By Proposition 11(ii) the function θ E [ln p θ ( Z | Z − −∞ )] is up-per semi-continuous. Therefore the set Θ ⋆b is compact as a closed subsetof a the compact set Θ so that for any δ >

0, Ξ δ = { θ ∈ Θ; d ( θ, Θ ⋆b ) ≥ δ } is also a compact set. In addition, as a upper semi-continuous function, θ E [ln p θ ( Z | Z − −∞ )] restricted to Ξ δ attains its maximum which impliesthat sup θ ∈ Ξ δ E [ln p θ ( Z | Z − −∞ )] = max θ ∈ Ξ δ E [ln p θ ( Z | Z − −∞ )] < E [ln p θ ⋆ ( Z | Z − −∞ )] , where θ ⋆ is any point in Θ ⋆b . Combining this with Proposition 10(iii) yieldslim n →∞ sup θ ∈ Ξ δ n ln p θχ ( Z n − ) < E [ln p θ ⋆ ( Z | Z − −∞ )] , P -a.s. R. DOUC AND E. MOULINES

Using that lim n →∞ n ln p θ ⋆ χ ( Z n − ) = E [ln p θ ⋆ ( Z | Z − −∞ )] , P -a.s.we ﬁnally obtain that P -a.s., ˆ θ χ,n ∈ Ξ δ ﬁnitely many times. The proof iscomplete. (cid:3) Proofs of Proposition 1 and Theorem 2.

We have now all the toolsfor obtaining the consistency of the MLE as a byproduct of the resultsobtained for the block MLE. We ﬁrst state and prove the forgetting of theinitial distribution for the predictive ﬁlter.

Lemma 13.

Assume (A1). Let < γ − < γ + ≤ . Then, for all η > ,there exists ρ η ∈ (0 , such that, for all sequence ( z i ) i ≥ satisfying n − n − X i =0 K ( z i ) ≥ max(1 − γ − , (1 + γ + ) / , (69) all β ∈ ( γ − , γ + ) , all measurable function f , all probability measures χ and χ ′ and all θ ∈ Θ , (cid:12)(cid:12)(cid:12)(cid:12) χ L θ h z n − i fχ L θ h z n − i X − χ ′ L θ h z n − i fχ ′ L θ h z n − i X (cid:12)(cid:12)(cid:12)(cid:12) ≤ ( ρ ⌊ n ( β − γ − ) ⌋ + η ⌊ n ( γ + − β ) ⌋ / χ ( D ) χ ′ ( D ) " n − Y i =0 D z i | f | ∞ , where D z is deﬁned in (46). Proof.

By Proposition 5, (cid:12)(cid:12)(cid:12)(cid:12) χ L θ h z n − i fχ L θ h z n − i X − χ ′ L θ h z n − i fχ ′ L θ h z n − i X (cid:12)(cid:12)(cid:12)(cid:12) = | ∆ θχ,χ ′ h z n − i ( f, X ) | χ L θ h z n − i X × χ ′ L θ h z n − i X ≤ ρ ⌊ n ( β − γ − ) ⌋ | f | ∞ + 2 η ⌊ n ( γ + − β ) ⌋ / Q n − i =0 | L θ h z i i ( · , X ) | ∞ χ L θ h z n − i X × χ ′ L θ h z n − i X | f | ∞ , where we have used that χ L θ h z n − i fχ L θ h z n − i X ∨ χ ′ L θ h z n − i fχ ′ L θ h z n − i X ≤ | f | ∞ . The proof follows by noting that (44) implies that Q n − i =0 | L θ h z i i ( · , X ) | ∞ χ L θ h z n − i X × χ ′ L θ h z n − i X ≤ [ Q n − i =0 D z i ] χ ( D ) χ ′ ( D ) . (cid:3) LE IN MISSPECIFIED HMMS Proof of Proposition 1.

Proof of (i). Let χ a probability measuresuch that χ ( D ) >

0. The ﬁrst step of the proof consists of using the for-getting property obtained in Lemma 13 to show that P -a.s., the sequence( p θχ ( Y | Y − − ℓ )) ℓ ≥ converges. Denote for any t ∈ { , . . . , r } , χ θm,t ( A ) = χ L θ h y − mr − − mr − t i A χ L θ h y − mr − − mr − t i X . Then, write for any m ≥ t ∈ { , . . . , r } and any y − mr − t ∈ Y mr + t +1 , p θχ ( y | y − − mr − t ) = p θχ θm,t ( y | z − − m ) = χ θm,t L θ h z − − m i ( g θ ( · , y )) χ θm,t L θ h z − − m i ( X ) . Let 0 < γ − < γ + <

1. Lemma 13 shows that for any t ∈ { , . . . , r } and η > ρ ∈ (0 ,

1) such that, if m − − X i = − m K ( z i ) ≥ max(1 − γ − , (1 + γ + ) / , then for all β ∈ ( γ − , γ + ), and θ ∈ Θ, | p θχ ( y | y − − mr − t ) − p θχ ( y | y − − mr ) |≤ ρ ⌊ m ( β − γ − ) ⌋ + η ⌊ m ( γ + − β ) ⌋ / χ θm,t ( D ) χ ( D ) − Y j = − m ( D z j ) ! sup θ ∈ Θ | g θ ( · , y ) | ∞ ≤ ρ ⌊ m ( β − γ − ) ⌋ + η ⌊ m ( γ + − β ) ⌋ / D ′− m − Y j = − m ( D z j ) ! sup θ ∈ Θ | g θ ( · , y ) | ∞ , where D ′− m = max t =1 ,...,r − θ ∈ Θ χ θm,t ( D ) χ ( D ) . ( D ′− m ) m ≥ is a stationary sequence. Using the same argument as in the proofof (47), the condition χ ∈ M ( D , r ) [deﬁned in (8)], we have E [ln + D ′− m ] < ∞ .By choosing γ + and γ − such that P Y [ Z ∈ K ] > max(1 − γ − , (1 + γ + ) /

2) andby applying Lemma 6, it follows that there exist ̺ χ ∈ (0 ,

1) and a P -a.s. ﬁniterandom variable C χ such that for any ℓ ≥ | p θχ ( Y | Y − − ℓ ) − p θχ ( Y | Y − − ℓ − ) | ≤ C χ ̺ ℓχ , P -a.s.Similarly, for any probability measure χ ′ such that χ ′ ( D ) >

0, there exist ̺ χ,χ ′ ∈ (0 ,

1) and a P -a.s. ﬁnite random variable C χ,χ ′ such that for any ℓ ≥ | p θχ ( Y | Y − − ℓ ) − p θχ ′ ( Y | Y − − ℓ ) | ≤ C χ,χ ′ ̺ ℓχ,χ ′ , P -a.s. R. DOUC AND E. MOULINES

This implies that for any probability measure χ satisfying χ ( D ) >

0, thesequence ( p θχ ( Y | Y − − ℓ )) ℓ ≥ converges P -a.s. and that the limit denoted by p θ ( Y | Y − −∞ ) does not depend on χ . Then, by stationarity of ( Y ℓ ) ℓ ∈ Z , weobtain that for all k ≥ θ ∈ Θ,lim m →∞ p θχ ( Y k | Y k − − m ) = p θ ( Y k | Y k − −∞ ) , P -a.s. , which shows the ﬁrst part of (i). To complete the proof of (i), it remains toprove that E [ | ln p θ ( Y k | Y k − −∞ ) | ] < ∞ . Since p θχ ( Y k | Y k − − m ) ≤ sup x ∈ X g θ ( x, Y k ),we have ln + p θχ ( Y k | Y k − −∞ ) ≤ ln + sup x ∈ X g θ ( x, Y k ) , which shows, under (A2), that E [ln + p θ ( Y k | Y k − −∞ )] < ∞ . (70)This allows us to deﬁne E [ln p θ ( Y k | Y k − −∞ )] as E [ln p θ ( Y k | Y k − −∞ )] = E [ln + p θ ( Y k | Y k − −∞ )] − E [ln − p θ ( Y k | Y k − −∞ )] , so that E [ln − p θ ( Y k | Y k − −∞ )] < ∞ provided that we have shown E [ln p θ ( Y k | Y k − −∞ )] > −∞ . By stationarity of ( Y k ) k ∈ Z , r E [ln p θ ( Y | Y − −∞ )] = r { E [ln + p θ ( Y | Y − −∞ )] − E [ln − p θ ( Y | Y − −∞ )] } = E " r − X k =0 ln + p θ ( Y k | Y k − −∞ ) − E " r − X k =0 ln − p θ ( Y k | Y k − −∞ ) (71) = E " r − X k =0 ln p θ ( Y k | Y k − −∞ ) , where the last equality follows by applying E ( A − B ) = E ( A ) − E ( B ) fornonnegative random variables A, B such that E ( A ) < ∞ . Now, note that r − Y k =0 p θ ( Y k | Y k − −∞ ) = r − Y k =0 lim m →∞ p θχ ( Y k | Y k − − mr ) = lim m →∞ r − Y k =0 p θχ ( Y k | Y k − − mr )= lim m →∞ p θχ ( Y r − | Y − − mr ) = lim m →∞ p θχ ( Z | Z − − m )= p θ ( Z | Z − −∞ ) . By plugging this expression into (71) and using E [ | ln p θχ ( Z | Z − −∞ ) | ] < ∞ (seeProposition 10), we ﬁnally obtain r E [ln p θ ( Y | Y − −∞ )] = E [ln p θ ( Z | Z − −∞ )] > −∞ , (72)which completes the proof of (i). LE IN MISSPECIFIED HMMS Proof of (ii). Let χ be a probability measure such that χ ( D ) > t ∈ { , . . . , r − } . Then, for any m ≥ m − ln p θχ ( Z m +10 ) ≤ m − ln p θχ ( Y mr + t ) + m − ln + A m,t (73) ≤ m − ln p θχ ( Z m ) + m − ln + B m,t + m − ln + A m,t , where A m,t , sup θ ∈ Θ sup x p θQ θ ( x, · ) ( Y ( m +1) r − mr + t +1 ) , B m,t , sup θ ∈ Θ sup x p θδ x ( Y mr + tmr ) . Note that ( A m,t ) m ≥ and ( B m,t ) m ≥ are stationary. Moreover, using (A2),it can be easily checked that E [ln + A m,t ] < ∞ , E [ln + B m,t ] < ∞ . Then, Lemma 7 may apply and for any β ∈ (0 , P -a.s. ﬁniterandom variables A, B such that for all m ≥ A m,t ≤ Aβ − m , B m,t ≤ Bβ − m , P -a.s.so that, P -a.s., 0 ≤ lim sup m →∞ m − ln + A m,t ≤ − ln β, ≤ lim sup m →∞ m − ln + B m,t ≤ − ln β. By letting β ↑ m →∞ m − ln + A m,t = 0 , lim m →∞ m − ln + B m,t = 0 , P -a.s.(74)Now, note that ( A m,t ) m ≥ and ( B m,t ) m ≥ do not depend on θ ∈ Θ so that(74) together with (73) yieldslim sup m →∞ sup θ ∈ Θ m − | ln p θχ ( Y mr + t ) − ln p θχ ( Z m ) | = 0 , P -a.s.(75)Since t is chosen arbitrarily in { , . . . , r − } , we ﬁnally obtain using Propo-sition 10(ii), lim n →∞ n − ln p θχ ( Y n ) = r − lim m →∞ m − ln p θχ ( Z m )= r − E [ln p θ ( Z | Z − −∞ )]= E [ln p θ ( Y | Y − −∞ )] , P -a.s. , which completes the proof of Proposition 1. (cid:3) Proof of Theorem 2.

By Proposition 11(ii) and (72), the function θ ℓ ( θ ) is upper semi-continuous. Moreover, (72) also impliesΘ ⋆ = arg max θ ∈ Θ E [ln p θ ( Y | Y − −∞ )] = arg max θ ∈ Θ E [ln p θ ( Z | Z − −∞ )] = Θ ⋆b . R. DOUC AND E. MOULINES

Now let t in { , . . . , r − } and recall that Z m = Y mr − . Theorem 12 togetherwith (75) shows that lim n →∞ d (ˆ θ χ,nr + t , Θ ⋆ ) = 0 , P -a.s.(76)The proof of Theorem 2 is then complete since t is arbitrary in { , . . . , r − } . (cid:3) Proof of Proposition 3.

Under these two conditions, for any u ∈{ , . . . , r } , and θ ∈ Θ, χ L θ h y u − i D ≥ u − Y i =0 inf x i ∈ D i g θ ( x i , y i ) ! Z · · · Z χ (d x ) D ( x u ) u Y i =1 D i − ( x i − ) Q θ ( x i − , d x i ) ≥ u − Y i =0 inf x i ∈ D i g θ ( x i , y i ) ! χ ( D ) δ u . (cid:3) Proof of Lemma 4.

The proof proceeds by induction on u ∈ { , . . . , r } .Assume that D u − is a compact subset; we show that there exists a compactset D u such that inf x u − ∈ D u − inf θ ∈ Θ Q θ ( x u − , D u ) ≥ δ .Let ( x, θ ) ∈ D u − × Θ and set δ < δ ′ <

1. Since X = R d is a completeseparable metric space and X is the associated Borel σ -ﬁeld, there exists asequence B x,θ , B x,θ , . . . , of open balls of radius 1 covering X . Choose N x,θ large enough so that Q θ ( x, O x,θ ) ≥ δ ′ , where O x,θ = S i ≤ N x,θ B x,θi . Since forany open set O the function ( x ′ , θ ′ ) Q θ ′ ( x ′ , O ) is lower semi-continuous,there exists a neighborhood V x,θ (for the product topology on X × Θ), suchthat for all ( x ′ , θ ′ ) ∈ V x,θ , Q θ ′ ( x ′ , O x,θ ) ≥ δ . Since O x,θ is totally bounded itsclosure, denoted K x,θ , is a compact subset, which satisﬁes, for any ( x ′ , θ ′ ) ∈V x,θ that Q θ ( x, K x,θ ) ≥ δ .Then, S ( x,θ ) ∈ D u − × Θ V x,θ is a covering of D u − × Θ. Since the set D u − × Θis compact, we may extract a ﬁnite subcover D u − × Θ ⊆ S Ii =1 V x i ,θ i . Take D u = S Ii =1 K x i ,θ i . As a ﬁnite union of compact sets, D u is a compact set,which satisﬁes, for all ( x, θ ) ∈ D u − × Θ, Q θ ( x, D u ) ≥ δ . This completes theproof. (cid:3) REFERENCES [1]

Barron, A. R. (1985). The strong ergodic theorem for densities: GeneralizedShannon–McMillan–Breiman theorem.

Ann. Probab. Baum, L. E. and

Petrie, T. (1966). Statistical inference for probabilistic functionsof ﬁnite state Markov chains.

Ann. Math. Statist. [3] Budhiraja, A. and

Ocone, D. (1997). Exponential stability of discrete-time ﬁltersfor bounded observation noise.

Systems Control Lett. Capp´e, O. , Moulines, E. and

Ryd´en, T. (2005).

Inference in Hidden Markov Mod-els . Springer, New York. MR2159833[5]

Churchill, G. (1992). Hidden Markov chains and the analysis of genome structure.

Computers and Chemistry Douc, R. , Fort, G. , Moulines, E. and

Priouret, P. (2009). Forgetting the initialdistribution for hidden Markov models.

Stochastic Process. Appl.

Douc, R. and

Matias, C. (2001). Asymptotics of the maximum likelihood estimatorfor general hidden Markov models.

Bernoulli Douc, R. , Moulines, E. , Olsson, J. and van Handel, R. (2011). Consistencyof the maximum likelihood estimator for general hidden Markov models.

Ann.Statist. Douc, R. , Moulines, ´E. and

Ryd´en, T. (2004). Asymptotic properties of the max-imum likelihood estimator in autoregressive models with Markov regime.

Ann.Statist. Fomby, T. B. and

Hill, R. C. , eds. (2003).

Maximum Likelihood Estimation ofMisspeciﬁed Models: Twenty Years Later . Advances in Econometrics . Elsevier,Amsterdam. MR2531667[11] Fredkin, D. R. and

Rice, J. A. (1987). Correlation functions of a function of aﬁnite-state Markov process with application to channel kinetics.

Math. Biosci. Fuh, C.-D. (2006). Eﬃcient likelihood estimation in state space models.

Ann. Statist. Fuh, C.-D. (2010). Reply to “On some problems in the article Eﬃcient likelihoodestimation in state space models” by Cheng-Der Fuh [Ann. Statist. (2006)2026–2068] [MR2604693]. Ann. Statist. Genon-Catalot, V. and

Laredo, C. (2006). Leroux’s method for general hiddenMarkov models.

Stochastic Process. Appl.

Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstan-dard conditions. In

Proc. Fifth Berkeley Sympos. Math. Statist. and Probability(Berkeley, Calif., 1965/66), Vol. I: Statistics

Hull, J. and

White, A. (1987). The pricing of options on assets with stochasticvolatilities.

J. Finance Jensen, J. L. (2010). On some problems in the article Eﬃcient likelihood estimationin state space models by Cheng-Der Fuh [Ann. Statist. (2006) 2026–2068][MR2283726]. Ann. Statist. Juang, B. H. and

Rabiner, L. R. (1991). Hidden Markov models for speech recog-nition.

Technometrics Kleptsyna, M. L. and

Veretennikov, A. Y. (2008). On discrete time ergodicﬁlters with wrong initial data.

Probab. Theory Related Fields

Le Gland, F. and

Mevel, L. (2000). Basic properties of the projective productwith application to products of column-allowable nonnegative matrices.

Math.Control Signals Systems Le Gland, F. and

Mevel, L. (2000). Exponential forgetting and geometric er-godicity in hidden Markov models.

Math. Control Signals Systems R. DOUC AND E. MOULINES[22]

Leroux, B. G. (1992). Maximum-likelihood estimation for hidden Markov models.

Stochastic Process. Appl. Mamon, R. S. and

Elliott, R. J. , eds. (2007).

Hidden Markov Models in Fi-nance . International Series in Operations Research & Management Science .Springer, New York. MR2407726[24]

Mevel, L. and

Finesso, L. (2004). Asymptotical statistics of misspeciﬁed hiddenMarkov models.

IEEE Trans. Automat. Control Meyn, S. P. and

Tweedie, R. L. (1993).

Markov Chains and Stochastic Stability .Springer, London. MR1287609[26]

Petrie, T. (1969). Probabilistic functions of ﬁnite state Markov chains.

Ann. Math.Statist. Van Handel, R. (2008). Discrete time nonlinear ﬁlters with informative observationsare stable.

Electron. Commun. Probab. Walters, P. (1982).

An Introduction to Ergodic Theory . Graduate Texts in Mathe-matics . Springer, New York. MR0648108[29] White, H. (1982). Maximum likelihood estimation of misspeciﬁed models.

Econo-metrica SAMOVARCNRS UMR 5157Institut T´el´ecom/T´el´ecom SudParis9 rue Charles Fourier91000 EvryFranceE-mail: [email protected]