Asymptotic properties of the maximum likelihood estimation in misspecified hidden Markov models
aa r X i v : . [ m a t h . S T ] F e b The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2012
ASYMPTOTIC PROPERTIES OF THE MAXIMUM LIKELIHOODESTIMATION IN MISSPECIFIED HIDDEN MARKOV MODELS By Randal Douc and Eric Moulines
T´el´ecom SudParis and T´el´ecom ParisTech
Let ( Y k ) k ∈ Z be a stationary sequence on a probability space(Ω , A , P ) taking values in a standard Borel space Y . Consider the asso-ciated maximum likelihood estimator with respect to a parametrizedfamily of hidden Markov models such that the law of the observations( Y k ) k ∈ Z is not assumed to be described by any of the hidden Markovmodels of this family. In this paper we investigate the consistency ofthis estimator in such misspecified models under mild assumptions.
1. Introduction.
An assumption underlying most of the classical theoryof maximum likelihood is that the “true” distribution of the observations isknown to lie within a specified parametric family of distributions. In manysettings, it is doubtful that this assumption is satisfied. It is therefore nat-ural to investigate the convergence of the maximum likelihood estimator(MLE) and to identify the possible limit for misspecified models. Such ques-tions have been mainly investigated for models in which observations areindependent; see [15, 29]. Much less is known on the behavior of the MLEestimate for dependent observations; see [10] and the references therein.For independent observations, under mild additional technical conditions,the MLE converges to the parameter which minimizes the relative entropyrate; see [15]. The purpose of this paper is to show that such a result re-mains true when the observations are from an ergodic process and for classesof parametric distributions associated to hidden Markov models (HMM).A HMM is a bivariate stochastic process ( X k , Y k ) k ≥ , where ( X k ) k ≥ is aMarkov chain (often referred to as the state sequence) in a state space X and, conditionally on ( X k ) k ≥ , ( Y k ) k ≥ is a sequence of independent random Received October 2011; revised July 2012. Supported by the Agence Nationale de la Recherche through the 2009-2012 projectBig MC.
AMS 2000 subject classifications.
Primary 62M09; secondary 62F12.
Key words and phrases.
Strong consistency, hidden Markov models, maximum likeli-hood estimator, misspecified models, state space models.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2012, Vol. 40, No. 5, 2697–2732. This reprint differs from the original inpagination and typographic detail. 1
R. DOUC AND E. MOULINES variables in a state space Y such that the conditional distribution of Y k giventhe state sequence depends on X k only. The key feature of HMMs is that thestate sequence ( X k ) k ≥ is not observable, so that statistical inference has tobe carried out by means of the observations ( Y k ) k ≥ only. Such problemsare far from straightforward due to the fact that the observation process( Y k ) k ≥ is generally a dependent, non-Markovian time series [despite thatthe bivariate process ( X k , Y k ) k ≥ is itself a Markov chain].HMMs have been intensively used in many scientific disciplines includingeconometrics [16, 23], biology [5], engineering [18], neurophysiology [11] andthe statistical inference is therefore of significant practical importance [4].In all these applications, misspecified models are the rule, so it is worthwhileto understand the behavior of MLE under such regime.This work extends previous results in this direction obtained by Meveland Finesso [24], but which are restricted to discrete state-space Markovchains. Our main result of consistency of the MLE in misspecified HMMsis derived under assumptions which are quite weak, covering general state-space HMMs under conditions which are much weaker than [9], where astrong mixing condition was imposed on the transition kernels of the hiddenchain. Therefore our results can be applied to many models of practicalinterest, including the Gaussian linear state space model, the discrete state-space HMM and more general nonlinear state-space models.The paper is organized as follows. In Section 2, we first introduce the set-ting and notations that are used throughout the paper. In Section 3, we stateour main assumptions and results. In Section 4, our main result is used to es-tablish consistency in three general classes of models: linear-Gaussian statespace models, finite state models and nonlinear state space models of thevector ARCH type (this includes the stochastic volatility model and manyother models of interest in time series analysis and financial econometrics).Section 5 is devoted to the proof of our main result. Notation.
Some notation pertaining to transition kernels is required. Let L be a (possibly unnormalized) transition kernel on ( X , X ), that is, for any x ∈ X , L ( x, · ) is a finite measure on ( X , X ) and for any A ∈ X , x L ( x, A )is measurable function from ( X , X ) to ([0 , , B ([0 , L acts on boundedfunctions f on X and on σ -finite positive measures µ on ( X , X ) via Lf ( x ) = δ x Lf , Z L ( x, d y ) f ( y ) , µL ( A ) = µL A , Z µ (d x ) L ( x, A ) . If L and L are two transition kernels on ( X , X ), then L L is the transitionkernel on ( X , X ), given, for any x ∈ X and A ∈ X by L L ( x, A ) = Z L ( x, d y ) L ( y, A ) . LE IN MISSPECIFIED HMMS
2. Problem statement.
We consider a parameterized family of HMMswith parameter space Θ, assumed to be a compact metric space. For eachparameter θ ∈ Θ, the distribution of the HMM is specified by the transitionkernel Q θ of the Markov chain ( X k ) k ≥ , and by the conditional distribu-tion g θ of the observation Y k given the hidden state X k , referred to as thelikelihood of the observation.For any m ≤ n and any sequence { a k } k ∈ Z , denote a nm , ( a m , . . . , a n ), andfor any probability measure χ on ( X , X ), define the likelihood of the obser-vations by p θχ ( y nm ) , Z · · · Z χ (d x m ) g θ ( x m , y m ) n Y p = m +1 Q θ ( x p − , d x p ) g θ ( x p , y p ) ,p θχ ( y np | y p − m ) , p θχ ( y nm ) /p θχ ( y p − m ) , m < p ≤ n, with the standard convention Q np = m a p = 1 if m > n .Let (Ω , F , P ) be a probability space, and let ( Y k ) k ∈ Z be a stationary er-godic stochastic process taking value in ( Y , Y ). We denote by P Y the imageprobability of P by ( Y k ) k ∈ Z on the product space ( Y Z , Y ⊗ Z ), and E Y theassociated expectation. We stress that the distribution P Y may or may notbelong to the parametric family of distributions specified by the transitionkernels { ( Q θ , g θ ) , θ ∈ Θ } . If P Y does not belong to G , the model is said tobe misspecified.If χ is a probability measure ( X , X ), we define the maximum likelihoodestimator (MLE) associated to the initial distribution χ byˆ θ χ,n , arg max θ ∈ Θ ln p θχ ( Y n − ) . (1)The study of asymptotic properties of the MLE in HMMs was initiated inthe seminal work of Baum and Petrie [2, 26] in the 1960s. In these papers,the model is assumed to be well specified, and the state space X and theobservation space Y were both presumed to be finite sets. More than twodecades later, Leroux [22] proved consistency for well-specified models in thecase that X is a finite set, and Y is a general state space. The consistency ofthe MLE in more general HMMs has subsequently been investigated for well-specified models in a series of contributions [7, 9, 14, 20, 21] using differentmethods. A general consistency result for HMMs has been developed in [8].Though the consistency results above differ in the details of their proofs,all proofs have a common thread which serves also as the starting point forthis paper. Denote by p θχ ( Y n ) the likelihood of the observations Y n for theHMM with parameter θ ∈ Θ and initial distribution χ . The first step of theproof aims to establish that for any θ ∈ Θ, there is a constant ℓ ( θ ) such thatlim n →∞ n − log p θχ ( Y n − ) = lim n →∞ n − E [log p θχ ( Y n − )] = ℓ ( θ ) , P -a.s. R. DOUC AND E. MOULINES
Up to an additive constant, θ ℓ ( θ ) is the negated relative entropy ratebetween the distribution of the observations and p θχ ( · ), respectively. Whenthe model is well-specified and θ = θ ⋆ is the true value of the parameter,this convergence follows from the generalized Shannon–Breiman–McMillantheorem [1]; for misspecified models or for well-specified models with θ = θ ⋆ the existence of the limit is far from obvious.The second step of the proof aims to prove that the maximizer of the likeli-hood θ n − log p θχ ( Y n ) converges P -a.s. to the maximizer of θ ℓ ( θ ), thatis, to the minimizer of the relative entropy rate. Together, these two stepsshow that the MLE is a natural estimator for the parameters which mini-mizes the relative entropy rate in the parametric family { ( Q θ , g θ ) , θ ∈ Θ } .Let us note that one could write the likelihood as n − log p θχ ( Y n − ) = 1 n n − X k =0 log p θχ ( Y k | Y k − ) , where p θχ ( Y k | Y k − ) denotes the conditional density of Y k given Y k − underthe misspecified model with parameter θ (i.e., the one-step predictive den-sity). If the limit of p θχ ( Y k | Y k − ) → π θY ( Y k −∞ ) as k → ∞ can be shown toexist P -a.s. and to be P -integrable, the convergence of the log-likelihood tothe relative entropy rate follows from the Birkhoff ergodic theorem, sincethe process { Y k } k ∈ Z is assumed to be ergodic. This result provides an ex-plicit representation of the relative entropy rate ℓ ( θ ) as the expectation ofthe limit ℓ ( θ ) = E [log π θY ( Y −∞ )]. The limit π θY ( Y k −∞ ) might be interpretedas the conditional likelihood of Y k given the whole past Y k − −∞ , but we mustrefrain ourselves of considering this quantity as a conditional density.Such an approach was used in [2] for finite state-space, and was later ex-tended by Douc, Moulines and Ryd´en [9] to general state-space, but understringent technical conditions (uniform mixing of the Markov kernel, whichmore or less restricts the validity of the results to compact state-spaces, leav-ing aside important models, such as Linear Gaussian state-space models).Alternatively, the predictive distribution p θχ ( Y k | Y k − ) can be expressed asa component of the state of a measure-valued Markov chain; in this approach,the existence of the limiting relative entropy rate ℓ ( θ ), follows from theergodic theorem for Markov chains, provided that this Markov chain can beshown to be ergodic. This approach was used in [7, 20, 21] and was laterextended to misspecified models by White [24]. This technique is adequatefor finite state-space Markov chains, but does not extend easily to generalstate-space Markov chains; see [7].In [22], the existence of the relative entropy rate is established by meansof Kingman’s subadditive ergodic theorem (the same approach is used in-directly in [26], which invokes the Furstenberg–Kesten theory of randommatrix products). After some additional work, an explicit representation LE IN MISSPECIFIED HMMS of the relative enropy rate is again obtained. However, as is noted in [22],page 136, the latter is surprisingly difficult, as Kingman’s ergodic theoremdoes not directly yield a representation of the limit as an expectation.For completeness, we note that a recent attempt [12] to prove consis-tency of the MLE for general HMMs contains very serious problems in theproof [17] (not addressed in [13]), and therefore fails to establish the claimedresults.In this paper, we prove consistency of the MLE for general HMMs inmisspecified models under quite general assumptions. Our proof followsbroadly the original approach of Baum and Petrie [2] and Douc, Moulinesand Ryd´en [9], but relaxes the very restrictive technical conditions used inthese works and extends the analysis to misspecified models. The key tech-nique to obtain this result is to establish the exponential forgetting of thefiltering distribution; this result is obtained by using an original couplingtechnique originally introduced in [19] and refined in [6].
3. Assumptions and main results.
For any integer t ≥ θ ∈ Θ and anysequence y t − ∈ Y t , consider the unnormalized kernel L θ h y t − i on ( X , X )defined for all x ∈ X and A ∈ X , by L θ h y t − i ( x , A ) = Z · · · Z " t − Y i =0 g θ ( x i , y i ) Q θ ( x i , d x i +1 ) A ( x t ) . (2)Note that, for any t ≥ θ ∈ Θ, x ∈ X , and y t − ∈ Y t , L θ h y t − i ( x , X ) = p θx ( y t − ) , (3)where for x ∈ X , s ≤ t , p θx ( y ts ), the likelihood of the observation y ts startingfrom state x , is a shorthand notation for p θδ x ( y ts ). Definition 1.
Let r be an integer. A set C ∈ X is a r - local Doeblinset with respect to the family { Q θ , g θ } θ ∈ Θ , if there exist positive functions ǫ − C : Y r → R + , ǫ + C : Y r → R + and a family of probability measures { λ θ C h z i} θ ∈ Θ ,z ∈ Y r and of positive functions { ϕ θ C h z i} θ ∈ Θ ,z ∈ Y r such that for any θ ∈ Θ, z ∈ Y r , λ θ C h z i ( C ) = 1 and, for any A ∈ X , and x ∈ C , ǫ − C ( z ) ϕ θ C h z i ( x ) λ θ C h z i ( A ) ≤ L θ h z i ( x, A ∩ C ) ≤ ǫ + C ( z ) ϕ θ C h z i ( x ) λ θ C h z i ( A ) . (4)This implies that for any measurable nonnegative function f on ( X , X ), x ∈ C and any z ∈ Y r , ǫ − C ( z ) ϕ θ C h z i ( x ) λ θ C h z i ( C f ) ≤ δ x L θ h z i ( C f ) ≤ ǫ + C ( z ) ϕ θ C h z i ( x ) λ θ C h z i ( C f ) . We require that the condition is satisfied for any θ ∈ Θ, but this is not aserious restriction since Θ is assumed to be compact.
R. DOUC AND E. MOULINES
Remark 1.
To illustrate this condition, consider the case r = 1. Assumethat for some set C , there exist positive constants ǫ − C , ǫ + C and a family ofprobability measures { λ θ C } θ ∈ Θ such that for any θ ∈ Θ, λ θ C ( C ) = 1 and, forany A ∈ X , and x ∈ C , ǫ − C λ θ C ( A ) ≤ Q θ ( x, A ∩ C ) ≤ ǫ + C λ θ C ( A ) . Then, clearly L θ h y i ( x, A ) = g θ ( x, y ) Q θ ( x, A ) satisfies (4) where ǫ − C and ǫ + C are positive constants . In this case C is a 1-local Doeblin set with respect to Q θ ; see [6] and [19]. Remark 2.
Local Doeblin sets share some similarities with 1-small setin the theory of Markov chains over general state spaces; see [25], Chapter 5.Recall that a set C is 1-small for the kernel Q θ , θ ∈ Θ if there exists aprobability measure ˜ λ θ C and a constant ˜ ǫ C >
0, such that ˜ λ θ C ( C ) = 1, andfor all x ∈ C and A ∈ X , Q θ ( x, A ∩ C ) ≥ ˜ ǫ C ˜ λ θ C ( A ∩ C ). In particular, a localDoeblin set is 1-small with ˜ ǫ C = ǫ − C and ˜ λ θ C = λ θ C . The main difference stemsfrom the fact that we impose both a lower and an upper bound, and weimpose that the minorizing and the majorizing measures are the same.(A1) There exist an integer r ≥ K ∈ Y ⊗ r such that:(i) P [ Y r − ∈ K ] > / η >
0, there exists a r -local Doeblin set C ∈ X such thatfor all θ ∈ Θ and for all y r − ∈ K ,sup x ∈ C c p θx ( y r − ) ≤ η sup x ∈ X p θx ( y r − ) < ∞ (5) and inf y r − ∈ K ǫ − C ( y r − ) ǫ + C ( y r − ) > , (6) where the functions ǫ + C and ǫ − C are defined in Definition 1.(iii) There exists a set D such that E h ln − inf θ ∈ Θ inf x ∈ D L θ h Y r − i ( x, D ) i < ∞ . (7)(A2) (i) For any θ ∈ Θ, the function g θ : ( x, y ) ∈ X × Y g θ ( x, y ) is posi-tive,(ii) E [ln + sup θ ∈ Θ sup x ∈ X g θ ( x, Y )] < ∞ .(A3) There exists p ∈ N such that for any x ∈ X and n ≥ p , P -a.s. the func-tion θ p θx ( Y n ) is continuous on Θ. Remark 3.
Assumption (A2) assumes that the conditional likelihood g θ is positive. The case where g θ can vanish typically requires different condi-tions; see [3, 27]. The second condition can be read as a generalized momentcondition on Y . It is satisfied in many examples of interest. LE IN MISSPECIFIED HMMS Remark 4.
To check (A1)(iii), one may, for example, check that:(i) inf x ∈ D inf θ ∈ Θ Q θ ( x, D ) > E [ln − inf θ ∈ Θ inf x ∈ D g θ ( x, Y )] < ∞ .This condition is satisfied if ( x, θ ) g θ ( x, y ) is continuous and D is a com-pact small set for all θ ∈ Θ, there exists a probability measure ν θ suchthat ν θ ( D ) = 1 and a constant δ >
0, such that, for all x ∈ D and A ∈ X , Q θ ( x, A ) ≥ δν θ ( A ). Note, however, that (A1)(iii) is far weaker than imposingthat the set D is 1-small. This is important to deal with examples for whichthe transition kernel Q θ ( x, · ) does not admit a density with respect to tosome fixed dominating measure; see, for example, Section 4.1. Remark 5.
Assumption (A3) is in general the consequence of the con-tinuity of the kernel θ Q θ ( x, · ) and of the function θ g θ ( x, · ), usingclassical techniques to deal with integrals depending on a parameter. Remark 6.
According to (3), bound (5) may also be rewritten in termsof the kernel L θ h y r − i assup x ∈ C c L θ h y r − i ( x , X ) ≤ η sup x ∈ X L θ h y r − i ( x , X ) < ∞ . The convergence of the relative entropy is achieved for initial distributionsbelonging to a particular class of initial probability distributions. For theinteger r and the set D ∈ X defined in (A1), let M ( D , r ) be the subset P ( X , X ) of probability measures on ( X , X ) satisfying M ( D , r ) = n χ ∈ P ( X , X ) , (8) E h ln − inf θ ∈ Θ χ L θ h Y u − i D i < ∞ for all u ∈ { , . . . , r } o . Proposition 1.
Assume (A1) and (A2). Then: (i) for any θ ∈ Θ , there exists a measurable function π θY : Y Z − → R suchthat for any probability measure χ ∈ M ( D , r ) , P h lim m →∞ p θχ ( Y | Y − − m ) = π θY ( Y −∞ ) i = 1; moreover, E [ | ln π θY ( Y −∞ ) | ] < ∞ ;(9) (ii) for any θ ∈ Θ and any probability measure χ ∈ M ( D , r ) , lim n →∞ n − ln p θχ ( Y n − ) = ℓ ( θ ) , P -a.s. , where ℓ ( θ ) , E [ln π θY ( Y −∞ )] . R. DOUC AND E. MOULINES
Theorem 2.
Assume (A1)–(A3). Then, θ ℓ ( θ ) is upper semi-continu-ous and defining Θ ⋆ ⊂ Θ by Θ ⋆ , arg max θ ∈ Θ ℓ ( θ ) , we have for any proba-bility measure χ ∈ M ( D , r ) , lim n →∞ d (ˆ θ χ,n , Θ ⋆ ) = 0 , P -a.s. Remark 7.
When the model is well specified, the law of the observationsbelongs to the parametric family of distributions on which the maximiza-tion occurs and is therefore associated to a specific parameter θ ∗ . In thisparticular case, under some appropriate assumptions, the set Θ ∗ is reducedto the singleton { θ ∗ } , and the consistency result of the MLE in well specifiedmodels can then be written as (see [8])lim n →∞ d ( ˆ θ χ,n , θ ⋆ ) = 0 , P -a.s.A simple sufficient condition can be proposed to ensure that χ ∈ M ( D , r ). Proposition 3.
Assume there exist a sequence of sets D u ∈ X , u ∈{ , . . . , r − } , such that (setting D r = D for notational convenience), forsome δ > , inf x u − ∈ D u − inf θ ∈ Θ Q θ ( x u − , D u ) ≥ δ, u ∈ { , . . . , r } , (10) and E h ln − inf θ ∈ Θ inf x ∈ D u g θ ( x, Y ) i < ∞ for u ∈ { , . . . , r } . (11) Then, any initial distribution χ on ( X , X ) satisfying χ ( D ) > belongs to M ( D , r ) . Remark 8.
To check (11), we typically assume that, for any given y ∈ Y ,the function ( x, θ ) g θ ( x, y ) is continuous and that D i × Θ is a compact set, i ∈ { , . . . , r − } . This condition then translates into an assumption on somegeneralized moments of the process Y .To check (10), the following lemma is useful. Lemma 4.
Assume that X = R d for some integer d > and that X is theassociated Borel σ -field. Assume in addition that, for any open subset O ∈ X ,the function ( x, θ ) → Q θ ( x, O ) is lower semi-continuous on the product space X × Θ . Then, for any δ > and any compact subset D ∈ X , there exists asequence of compact subsets D u , u ∈ { , . . . , r − } satisfying (10).
4. Applications.
In this section, we develop three classes of examples.In Section 4.1 we consider linear Gaussian state space models. This is obvi-ously a very important model, which is used routinely to analyze time-seriesmodels. We analyze this model under assumptions which are very general
LE IN MISSPECIFIED HMMS and might serve to illustrate the stated assumptions. In Section 4.2, we con-sider the classic case where state space of the underlying Markov chain isa finite set. Finally, in Section 4.3, we develop a general class of nonlinearstate space models. In all these examples, we will find that the assumptionsof Theorem 2 are satisfied under general assumptions.4.1. Gaussian linear state space models.
Gaussian linear state space mod-els form an important class of HMMs. In this setting, let X = R d x , and Y = R d y for some integers and let Θ be a compact parameter space. Themodel is specified by X k +1 = A θ X k + R θ U k , (12) Y k = B θ X k + S θ V k , (13)where { ( U k , V k ) } k ≥ is an i.i.d. sequence of Gaussian vectors with zeromean and identity covariance matrix, independent of X . Here U k is d u -dimensional, V k is d y -dimensional and the matrices A θ , R θ , B θ , S θ have theappropriate dimensions.For any integer n , define O θ,n and C θ,n the observability matrix and thecontrollability matrices O θ,n , B θ B θ A θ B θ A θ ... B θ A n − θ and C θ,n , [ A n − θ R θ A n − θ R θ · · · R θ ] . (14)It is assumed in the sequel that for any θ ∈ Θ, the following hold:(L1) The pair [ A θ , B θ ] is observable, and the pair [ A θ , R θ ] is controllable;that is, there exists an integer r such that, the observability matrix O θ,r andthe controllability matrix C θ,r are full rank.(L2) The measurement noise covariance matrix S θ is full rank.(L3) The functions θ A θ , θ R θ , θ B θ and θ S θ are continuouson Θ.(L4) E [ k Y k ] < ∞ .We now check the assumptions of Theorem 2.The dimension d u of the state noise vector U k is in many situations smallerthan the dimension d x of the state vector X k and hence R θt R θ (where t A isthe transpose of the matrix A ) may be rank deficient.Some additional notation is needed. For any positive matrix A and anyvector z of appropriate dimension, denote k z k A = t zA − z . Define for anyinteger n F θ,n = D θ,nt D θ,n + S θ,nt S θ,n , (15) R. DOUC AND E. MOULINES where t denotes the transpose and D θ,n , · · · B θ R θ . . . 0 B θ A θ R θ B θ R θ . . . ...... . . . 0 B θ A n − θ R θ B θ A n − θ R θ · · · B θ R θ , S θ,n , S θ · · · S θ . . . ...... . . . . . . 00 · · · S θ . Under (L2), for any n ≥ r , the matrix F θ,n is positive definite. The likelihoodof the observations y n − ∈ Y n starting from x is given by p θx ( y n − ) = (2 π ) − nd y det − / ( F θ,n ) exp( − k y n − − O θ,n x k F θ,n ) , (16)where y n − = t [ t y , t y , . . . , t y n − ], and O θ,n is defined in (14).Consider first (A1). Under (L1), the observability matrix O θ,r is full rank,we have, for any compact subset K ⊂ Y r ,lim k x k→∞ inf y r − ∈ K k y r − − O θ,r x k F θ,r = ∞ , showing that, for all η >
0, we may choose a compact set C in such a way that(5) is satisfied. It remains to prove that any compact set C is a r -local Doe-blin satisfying the condition (6). For any y r − ∈ Y r − and x ∈ X the measure L θ h y r − i ( x , · ) is absolutely continuous with respect to the Lebesgue mea-sure on X with Radon–Nikodym denoted ℓ θ h y r − i ( x , x r ) given (up to anirrelevant multiplicative factor) by ℓ θ h y r − i ( x , x r ) ∝ det − / ( G θ,r ) exp (cid:18) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) y r − x r (cid:21) − (cid:20) O θ,r A rθ (cid:21) x (cid:13)(cid:13)(cid:13)(cid:13) G θ,r (cid:19) , (17)where the covariance matrix G θ,r is given by G θ,r = (cid:20) D θ,r C θ,r (cid:21) [ t D θ,rt C θ,r ] + (cid:20) S θ,r (cid:21) [ t S θ,rt ] . The proof of (17) relies on the positivity of G θ,r , which requires furtherdiscussion. By construction, the matrix G θ,r is nonnegative. For any y r − ∈ Y r and x ∈ X , the equation[ t y r − t x ] G θ,r (cid:20) y r − x (cid:21) = k t D θ,r y r − + t C θ,r x k + k t S θ,r y r − k = 0 LE IN MISSPECIFIED HMMS implies that k t D θ,r y r − + t C θ,r x k = 0 and k t S θ,r y r − k = 0. Since the matrix S θ,r is full rank, this implies that y r − = 0. Since C θ,r is full-rank (the pair[ A θ , R θ ] is controllable), this implies that x = 0. Therefore, the matrix G θ,r is positive definite and, for any y r − , the function( x , x r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) y r − x r (cid:21) − (cid:20) O θ,r A rθ (cid:21) x (cid:13)(cid:13)(cid:13)(cid:13) G θ,r is continuous, and is therefore bounded on any compact subset of X × X .This implies that every nonempty compact set C ⊂ R d x is a r -local Doeblinset, with λ θ C ( · ) = λ Leb ( · ) /λ Leb ( C ) and ǫ − C ( y r − ) = ( λ Leb ( C )) − inf θ ∈ Θ inf ( x ,x r ) ∈ C × C ℓ θ h y r − i ( x , x r ) ,ǫ + C ( y r − ) = ( λ Leb ( C )) − sup θ ∈ Θ sup ( x ,x r ) ∈ C × C ℓ θ h y r − i ( x , x r ) . Therefore, condition (6) is satisfied with any compact set K ⊆ Y r − . It re-mains to show (A1)(iii). Under (L1), L θ h y r − i ( x , · ) is absolutely continuouswith respect to the Lebesgue measure λ Leb . Therefore, for any set D ,inf θ ∈ Θ inf x ∈ D L θ h y r − i ( x , D ) ≥ inf θ ∈ Θ inf ( x ,x r ) ∈ D × D ℓ θ h y r − i ( x , x r ) λ Leb ( D ) . Take D to be any compact set with positive Lebesgue measure.sup θ ∈ Θ sup ( x ,x r ) ∈ D × D (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) y r − x r (cid:21) − (cid:20) O θ,r A rθ (cid:21) x (cid:13)(cid:13)(cid:13)(cid:13) G θ,r ≤ λ max ( G θ,r ) n k y r − k + max x ∈ D k x k [1 + λ max ( t O θ,r O θ,r + t A rθ A rθ )] o , where λ max ( A ) is the largest eigenvalue of A . Under (L3), θ λ max ( G θ,r )and θ λ max ( t O θ,r O θ,r + t A rθ A rθ ) are bounded. Under (L4), E [ k Y k ] < ∞ ,then (A1)(iii) is satisfied for any compact set.Consider now (A2). Under (L2), S θ is full rank, and choosing the referencemeasure µ to be the Lebesgue measure on Y , we find that g θ ( x, y ) is aGaussian density for each x ∈ X with covariance matrix S θt S θ . We thereforehave sup θ ∈ Θ sup x ∈ X g θ ( x, y ) = (2 π ) − d y / sup θ ∈ Θ det − / ( S θt S θ ) < ∞ , so that (A2)(i) and (ii) are satisfied.We finally check (A3). For any n ≥ r , and x ∈ X the function θ p θx ( y n − )is given by (16). Under (L3), the functions θ
7→ O θ,n [where O θ,n is the ob-servability matrix defined in (14)] and θ det − / ( F θ,n ) [where F θ,n is thecovariance matrix defined in (15)], are continuous on Θ for any n ≥ r . Thus,for any x ∈ X , θ p θx ( y n − ) is continuous for every n ≥ r , showing (A3). R. DOUC AND E. MOULINES
To conclude this discussion, we need to specify more explicitly the set M ( D , r ) [see (8)] of possible initial distributions. Using Proposition 3, wehave to check the sufficient conditions (10) and (11). To check (10), we useLemma 4. Note that, for any open subset O , Q θ ( x, O ) = E [ O ( A θ x + R θ U )] , where the expectation is taken with respect to the standard normal randomvariable U . Let { ( x n , θ n ) } ∞ n =1 be any sequence converging to ( x, θ ). By theFatou lemma, using that function O is lower semi-continuous and that θ A θ is continuous under (L3), we havelim inf n →∞ Q θ n ( x n , O ) ≥ E h lim inf n →∞ O ( A θ n x + R θ n U ) i ≥ E h lim inf n →∞ O ( A θ n x + R θ n U ) i = Q θ ( x, O ) , showing that, for any open subset O , the function ( x, θ ) Q θ ( x, O ) is lowersemi-continuous.Assumption (L2) implies that, for all ( x, y ) ∈ X × Y ,ln g θ ( x, y ) ≥ − d y π ) −
12 inf θ ∈ Θ ln det − / ( S θt S θ ) − h inf θ ∈ Θ λ min ( S θt S θ ) i − h k y k + sup θ ∈ Θ k B θ x k i , where λ min ( S θt S θ ) is the minimal eigenvalue of S θt S θ . Therefore, under (L4),(11) is satisfied because D u is a compact set, u ∈ { , . . . , r } .We can therefore apply Theorem 2 to show that the MLE is consistent forany initial measure χ as soon as the process { Y k } k ∈ Z is stationary ergodicand E [ | Y | ] < ∞ .4.2. Finite state models.
One of the most widely used classes of HMMsis obtained when the state-space is finite, that is, X = { , . . . , d } for someinteger d , Y is any Polish space and Θ is a compact metric space. For eachparameter θ ∈ Θ, the transition kernel Q θ is determined by the correspond-ing transition probability matrix Q θ , while the observation density g θ isgiven as in the general setting of this paper.It is assumed in the sequel that:(F1) There exists an integer r >
0, such that, inf θ ∈ Θ inf ( x,x ′ ) ∈ X × X Q rθ ( x, x ′ ) > M ⊂ Y such that inf θ ∈ Θ inf y ∈ M inf x ∈ X g θ ( x, y ) > θ ∈ Θ sup y ∈ M sup x ∈ X g θ ( x, y ) < ∞ . (F3) For any θ ∈ Θ, the function g θ : ( x, y ) ∈ X × Y g θ ( x, y ) is positiveand E h ln + sup θ ∈ Θ sup x ∈ X g θ ( x, Y ) i < ∞ . LE IN MISSPECIFIED HMMS (F4) E [ln − inf θ ∈ Θ inf x ∈ X g θ ( x, Y )] < ∞ .(F5) θ
7→ Q θ and θ g θ ( x, y ) are continuous for any x ∈ X , y ∈ Y .Consider first (A1). We set C = X . Since C c = ∅ , (5) is trivially satisfied.Under (F1), equation (4) is satisfied with ϕ X h y r − i ( x ) ≡ λ θ X = d − P di =1 δ i ,and ǫ − X [ y r − ] = d d − Y i =0 inf θ ∈ Θ inf x ∈ X g θ ( x, y i ) × inf θ ∈ Θ inf ( x,x ′ ) ∈ X × X Q rθ ( x, x ′ ) ,ǫ + X [ y r − ] = d d − Y i =0 sup θ ∈ Θ sup x ∈ X g θ ( x, y i ) × sup θ ∈ Θ sup ( x,x ′ ) ∈ X × X Q rθ ( x, x ′ ) . Hence, the state space X is a r -local Doeblin set. Assumption (F2) impliesthat (6) is satisfied with K = M r . Now, note that for all u ∈ { , . . . , r } and y u − ∈ Y r , inf θ ∈ Θ inf x ∈ X L θ h y u − i ≥ u − Y i =0 inf θ ∈ Θ inf x ∈ X g θ ( x, y i ) . (18)Using the previous inequality with u = r and noting that (F4) implies that E [ln − inf θ ∈ Θ inf x ∈ X g θ ( x, Y )] < ∞ , we see that equation (7) is satisfied with D = X . The same argument for any u ∈ , . . . , r shows that all the probabilitymeasures on ( X , X ) belong to the set M ( X , r ), defined in (8).Assumption (A2) is a direct consequence of (F3). Finally, we note thatthe continuity of θ
7→ Q θ and θ g θ ( x, y ) yield immediately that θ p θx ( y n )is a continuous function for every n ≥ y n ∈ Y n +1 , establishing (A3).We can therefore apply Theorem 2 under (F1)–(F5) to show that theMLE is consistent for any initial measure χ as soon as the process { Y k } k ∈ Z is stationary ergodic.4.3. Nonlinear state space models.
In this section, we consider a classof nonlinear state space models. Let X = R d , Y = R ℓ and X and Y be theassociated Borel σ -fields. Let Θ be a compact metric space. For each θ ∈ Θand each x ∈ X , the Markov kernel Q θ ( x, · ) has a density q θ ( x, · ) with respectto the Lebesgue measure on X .For example, ( X k ) k ≥ may be defined through the nonlinear recursion X k = T θ ( X k − ) + Σ θ ( X k − ) ζ k , where ( ζ k ) k ≥ is an i.i.d. sequence of d -dimensional random vectors which areassumed to possess a density ρ ζ with respect to the Lebesgue measure λ Leb on R d , and T θ : R d → R d , Σ θ : R d → R d × d are given (measurable) matrix-valued functions such that for each θ ∈ Θ and x ∈ X , Σ θ ( x ) is full-rank.Such a model for ( X k ) k ≥ is sometimes known as a vector ARCH model,and covers many models of interest in time series analysis and financial R. DOUC AND E. MOULINES econometrics. We let the reference measure µ be the Lebesgue measure on R ℓ , and define the observed process ( Y k ) k ≥ by means of a given observationdensity g θ ( x, y ).We now introduce the basic assumptions of this section.(NL1) The function ( x, x ′ , θ ) q θ ( x, x ′ ) is a positive continuous functionon X × X × Θ. In addition, sup θ ∈ Θ sup ( x,x ′ ) ∈ X × X q θ ( x, x ′ ) < ∞ .(NL2) For any compact subset K ⊂ Y , and θ ∈ Θ,lim | x |→∞ sup y ∈ K g θ ( x, y )sup x ′ ∈ X g θ ( x ′ , y ) = 0 . (NL3) For each ( x, y ) ∈ X → Y , the function θ g θ ( x, y ) is positive andcontinuous on Θ. Moreover, E h ln + sup θ ∈ Θ sup x ∈ X g θ ( x, Y ) i < ∞ . (NL4) There exists a compact subset D ⊂ Y such that E h ln − inf θ ∈ Θ inf x ∈ D g θ ( x, Y ) i < ∞ . We have made no attempt at generality here: for sake of example, we havechosen a set of conditions under which the assumptions of Theorem 2 areeasily verified. Of course, the applicability of Theorem 2 extends far beyondthe simple assumptions imposed in this section.
Remark 9.
Nonetheless, the present assumptions already cover a broadclass of nonlinear models. Consider, for example, the stochastic volatilitymodel [16] (cid:26) X k +1 = φ θ X k + σ θ ζ k ,Y k = β θ exp( X k / ε k , (19)where ( ζ k , ε k ) are i.i.d. Gaussian random variables in R with zero mean andidentity covariance matrix, β θ > σ θ > θ ∈ Θ, and the functions θ φ θ , θ σ θ , and θ β θ are continuous. Then, assumptions (NL1)–(NL4)are satisfied as noted by Douc et al. [8], Remark 10.Under (NL1), every compact set C ⊂ X = R d with λ Leb ( C ) > λ θ C ( · ) = λ Leb ( · ∩ C ) /λ Leb ( C ), ϕ θ C h y i = λ Leb ( C ) and ǫ − C = inf θ ∈ Θ inf ( x,x ′ ) ∈ C × C q θ ( x, x ′ ) ,ǫ + C = sup θ ∈ Θ sup ( x,x ′ ) ∈ C × C q θ ( x, x ′ ) . Under (NL1) and (NL2), (5) and (6) are satisfied with r = 1; equation (7)follows from (NL1) and (NL4). Thus assumption (A1) holds. LE IN MISSPECIFIED HMMS Assumption (A2) follows directly from (NL3). To establish (A3), it sufficesto note that, under (NL1), for any ( x, x ′ ) ∈ X × X , θ q θ ( x, x ′ ) is continuous,under (NL3), for any ( x, y ) ∈ X × Y , θ g θ ( x, y ) is continuous and for any n ∈ N , sup θ ∈ Θ sup x ∈ X Q nk =0 g θ ( x, Y k ) < ∞ , P -a.s. The bounded convergencetheorem shows that, P -a.s. the function θ p θx ( Y n ) is continuous.Finally, under (NL1)–(NL4) according to Theorem 2 and Proposition 3the MLE is consistent for any initial measure χ such that χ ( D ) >
5. Proofs of Proposition 1 and Theorem 2.
Block decomposition.
The first step of the proof consists of splittingthe observations into blocks of size r where r is defined in (A1). More pre-cisely, we will first show the equivalent of Proposition 1 and Theorem 2 with Y i replaced by Z i , Y ( i +1) r − ir . With this notation,ˆ θ χ,nr = arg max θ ∈ Θ ln p θχ ( Y nr − ) = arg max θ ∈ Θ ln p θχ ( Z n − ) . In the following, ˆ θ χ,nr is called the block maximum likelihood estimator (de-noted hereafter as the block MLE) associated to the observations Z , . . . , Z n − .5.1.1. Forgetting of the initial distribution for the block conditional likeli-hood.
Denote, for i ∈ Z , z i = y ( i +1) r − ir ∈ Y r . (20)Then, the likelihood p θχ ( z n − ) may be rewritten as p θχ ( z n − ) = p θχ ( y nr − ) = χ L θ h z i · · · L θ h z n − i X = χ L θ h z n − i X , (21)where L θ h z n − i = L θ h y nr − i is defined in (2).For any sequence { z i } i ≥ ∈ Z N where Z , Y r , any probability measures χ and χ ′ on ( X , X ) and any measurable nonnegative functions f and h from X to R + , define∆ θχ,χ ′ h z n − i ( f, h ) = ( χ L θ h z n − i f )( χ ′ L θ h z n − i h )(22) − ( χ L θ h z n − i h )( χ ′ L θ h z n − i f ) . Let ¯ X = X × X and ¯ X = X ⊗ X . For P a (possibly unnormalized) kernelon ( X , X ), we denote by ¯ P the transition kernel on (¯ X , ¯ X ) defined, for any( x, x ′ ) ∈ ¯ X and A , A ′ ∈ X , by¯ P [( x, x ′ ) , A × A ′ ] = P ( x, A ) P ( x ′ , A ′ ) . (23)If χ and χ ′ are two probability measures on ( X , X ) and f, g are real valuedmeasurable functions on ( X , X ), define for ¯ A ∈ ¯ X and ¯ w = ( w, w ′ ) ∈ ¯ X , χ ⊗ χ ′ (¯ A ) = Z Z χ (d x ) χ ′ (d x ′ ) ¯ A ( x, x ′ ) , f ⊗ h ( ¯ w ) = f ( w ) g ( w ′ ) . (24) R. DOUC AND E. MOULINES
With the notation introduced above, (22) can be rewritten as follows:∆ θχ,χ ′ h z n − i ( f, h ) = Z · · · Z χ ⊗ χ ′ (d ¯ w ′ ) n − Y i =0 ¯ L θ h z i i ( ¯ w i , d ¯ w i +1 ) ! (25) × { f ⊗ h − h ⊗ f } ( ¯ w n ) . The following proposition extends [6], Proposition 12.
Proposition 5.
Assume (A1). Let ≤ γ − < γ + ≤ . Then, for any η > , there exists ρ ∈ (0 , such that, for any sequence ( z i ) i ≥ ∈ Z N satisfying n − n − X i =0 K ( z i ) ≥ max(1 − γ − , (1 + γ + ) / for any β ∈ ( γ − , γ + ) , any nonnegative bounded functions f and h , any prob-ability measures χ and χ ′ on ( X , X ) and any θ ∈ Θ , | ∆ θχ,χ ′ h z n − i ( f, h ) |≤ ρ ⌊ n ( β − γ − ) ⌋ { ( χ L θ h z n − i f )( χ ′ L θ h z n − i g ) + ( χ ′ L θ h z n − i f )( χ L θ h z n − i g ) } + 2 η ⌊ n ( γ + − β ) ⌋ / " n − Y i =0 | L θ h z i i ( · , X ) | ∞ | f | ∞ | h | ∞ . Proof.
Let η >
0. According to (A1), there exists a set C ⊂ Y suchthat (5) and (6) hold. Denote ¯ C , C × C and for z = y r − , set ¯ ϕ θ C h z i = ϕ θ C h z i ⊗ ϕ θ C h z i and ¯ λ θ C h z i , λ θ C h z i ⊗ λ θ C h z i where ϕ θ C h z i and λ θ C h z i are definedin Definition 1. For any measurable nonnegative function ¯ f on (¯ X , ¯ X ), θ ∈ Θand ¯ x ∈ ¯ C , ( ǫ − C ( z )) ¯ ϕ θ C h z i (¯ x )¯ λ θ C h z i ( ¯ C ¯ f )(27) ≤ δ ¯ x ¯ L θ h z i ( ¯ C ¯ f ) ≤ ( ǫ + C ( z )) ¯ ϕ θ C h z i (¯ x )¯ λ θ C h z i ( ¯ C ¯ f ) . Define the unnormalized kernel ¯ L θ, h z i and ¯ L θ, h z i on (¯ X , ¯ X ) as follows: forall ¯ x ∈ ¯ X and ¯ A ∈ ¯ X ,¯ L θ, h z i (¯ x, ¯ A ) , ¯ C (¯ x )( ǫ − C ( z )) ¯ ϕ θ C h z i (¯ x )¯ λ θ C h z i (¯ C ∩ ¯ A ) , (28) ¯ L θ, h z i (¯ x, ¯ A ) , ¯ L θ h z i (¯ x, ¯ A ) − ¯ L θ, h z i (¯ x, ¯ A ) . (29)Equation (27) implies that, for all ¯ x ∈ ¯ C , and any measurable nonnegativefunction ¯ f , 0 ≤ δ ¯ x ¯ L θ, h z i ( ¯ C ¯ f ) ≤ r C ( z ) δ ¯ x ¯ L θ h z i ( ¯ C ¯ f ) , LE IN MISSPECIFIED HMMS where r C ( z ) , − ( ǫ − C ( z ) /ǫ + C ( z )) . It then follows δ ¯ x ¯ L θ, h z i ( ¯ f )= ¯ C (¯ x ) δ ¯ x ¯ L θ, h z i ( ¯ C ¯ f ) + ¯ C (¯ x ) δ ¯ x ¯ L θ, h z i ( ¯ C c ¯ f ) + ¯ C c (¯ x ) δ ¯ x ¯ L θ, h z i ( ¯ f )(30) ≤ r C ( z ) ¯ C (¯ x ) δ ¯ x ¯ L θ h z i ( ¯ C ¯ f ) + ¯ C (¯ x ) δ ¯ x ¯ L θ h z i ( ¯ C c ¯ f ) + ¯ C c (¯ x ) δ ¯ x ¯ L θ h z i ( ¯ f ) ≤ δ ¯ x ¯ L θ h z i ( r C ( z ) ¯ C (¯ x ) ¯ C ¯ f ) . Note that ∆ θχ,χ ′ h z n − i ( f, h ) may be decomposed as∆ θχ,χ ′ h z n − i ( f, h ) = X t n − ∈{ , } n ∆ θ,t n − χ,χ ′ h z n − i ( f, h ) , where∆ θ,t n − χ,χ ′ h z n − i ( f, h ) = Z · · · Z χ ⊗ χ ′ (d ¯ w ′ ) n − Y i =0 ¯ L θ,t i h z i i ( ¯ w i , d ¯ w i +1 ) ! Φ( ¯ w n )with Φ , f ⊗ h − h ⊗ f . First assume that there exists an index i ∈ { , . . . , n − } such that t i = 0. Then∆ θ,t n − χ,χ ′ h z n − i ( f, h ) = χ ⊗ χ ′ ( ¯ L θ,t h z i · · · ¯ L θ,t i − h z i − i ( ¯ C × ¯ ϕ θ C h z i i )) × ( ǫ − C ( z i )) ¯ λ θ C h z i i ( ¯ C ¯ L θ,t i +1 h z i +1 i · · · ¯ L θ,t n − h z n − i Φ) . By symmetry, ¯ λ θ C h z i i ( ¯ C ¯ L θ,t i +1 h z i +1 i · · · ¯ L θ,t n − h z n − i Φ) = 0 , showing that ∆ θ,t n − χ,χ ′ h z n − i ( f, h ) = 0 except if for all i ∈ { , . . . , n − } , t i = 1.Therefore, ∆ θχ,χ ′ h z n − i ( f, h ) = χ ⊗ χ ′ ( ¯ L θ, h z i · · · ¯ L θ, h z n − i Φ) . This implies, using (30), that | ∆ θχ,χ ′ h z n − i ( f, h ) |≤ χ ⊗ χ ′ ( ¯ L θ, h z i · · · ¯ L θ, h z n − i| Φ | )(31) ≤ Z · · · Z χ ⊗ χ ′ (d ¯ w ) n − Y i =0 ¯ L θ h z i i ( ¯ w i , d ¯ w i +1 )( r C ( z i )) ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) ! × | Φ | ( ¯ w n ) . Note that n − Y i =0 ( r C ( z i )) ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) ≤ ̺ P n − i =0 ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) K ( z i ) C , (32) R. DOUC AND E. MOULINES where ̺ C , sup z ∈ K r C ( z ) < z n − such that n − P n − i =0 K ( z i ) ≥ (1 − γ − ), we have P n − i =0 K c ( z i ) ≤ nγ − , so that n − X i =0 K c ( z i ) ≤ ⌊ nγ − ⌋ . Moreover, we have n − X i =0 ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) K ( z i )= n − X i =0 ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) − n − X i =0 ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) K c ( z i )(33) ≥ N ¯ C ,n ( ¯ w n ) − n − X i =0 K c ( z i ) ≥ N ¯ C ,n ( ¯ w n ) − ⌊ nγ − ⌋ , where, for any set ¯ A ∈ ¯ X , N ¯ A ,n ( ¯ w n ) = P n − i =0 ¯ A × ¯ A ( ¯ w i , ¯ w i +1 ). By combining(32) and (33) and using that ⌊ nβ ⌋ − ⌊ nγ − ⌋ ≥ ⌊ n ( β − γ − ) ⌋ , we thereforeobtain, for any β ∈ ( γ − , n − Y i =0 ( r C ( z i )) ¯ C × ¯ C ( ¯ w i , ¯ w i +1 ) ≤ ̺ ⌊ n ( β − γ − ) ⌋ C + { N ¯ C ,n ( ¯ w n ) < ⌊ nβ ⌋} . (34)For any sequence ¯ w n − ∈ ¯ X n and any ¯ A ∈ ¯ X , denote M ¯ A ,n ( ¯ w n − ) , n − X i =0 ¯ A ( ¯ w i ) . Using [6], Lemma 17, for any sequence ¯ w n satisfying N ¯ C ,n ( ¯ w n ) < ⌊ nβ ⌋ whichis equivalent to N ¯ C ,n ( ¯ w n ) ≤ ⌊ nβ ⌋ −
1, we have M ¯ C ,n ( ¯ w n − ) ≤ ( ⌊ nβ ⌋ + n ) / N ¯ C ,n ( ¯ w n ) < ⌊ nβ ⌋ ⇒ M ¯ C c ,n ( ¯ w n − ) ≥ a n , n − ⌊ nβ ⌋ . (35)In words, either the number of consecutive visits to the set ¯ C at most ⌊ nβ ⌋ ,or the number of visits to the complementary of the set ¯ C is larger than a n .Plugging (35) into (34) and combining it with (31) yields | ∆ θχ,χ ′ h z n i ( f, h ) | ≤ ̺ ⌊ n ( β − γ − ) ⌋ C χ ⊗ χ ′ ( ¯ L θ h z i · · · ¯ L θ h z n − i| Φ | )+ 2 | f | ∞ | h | ∞ Γ θχ,χ ′ ( z n − ) , LE IN MISSPECIFIED HMMS whereΓ θχ,χ ′ ( z n − ) , Z · · · Z χ ⊗ χ ′ (d ¯ w ) n − Y i =0 ¯ L θ h z i i ( ¯ w i , d ¯ w i +1 ) { M ¯ C c ,n ( ¯ w n − ) ≥ a n } . We finally have to bound this last term. First rewrite Γ θχ,χ ′ ( z n − ) as follows:Γ θχ,χ ′ ( z n − ) = n − Y i =0 | L θ h z i i ( · , X ) | ∞ ! Z χ ⊗ χ ′ (d ¯ w )( η P n − i =0 ¯ C c ( ¯ w i ) K ( z i ) ) × n − Y i =0 ¯ L θ h z i i ( ¯ w i , d ¯ w i +1 ) η ¯ C c ( ¯ w i ) K ( z i ) | L θ h z i i ( · , X ) | ∞ ! { M ¯ C c ,n ( ¯ w n − ) ≥ a n } . Note that (26) implies that P n − i =0 K ( z i ) ≥ ( n + ⌊ nγ + ⌋ ) /
2. Then, for any γ + > β , the inequality M ¯ C c ,n ( ¯ w n − ) ≥ a n implies that n − X i =0 ¯ C c (¯ x i ) K ( z i ) ≥ n − X i =0 ¯ C c (¯ x i ) − n − X i =0 K c ( z i ) ≥ ⌊ nγ + ⌋ − ⌊ nβ ⌋ ≥ ⌊ n ( γ + − β ) ⌋ , showing that( η P n − i =0 ¯ C c (¯ x i ) K ( z i ) ) { M ¯ C c ,n (¯ x n − ) ≥ a n } ≤ η ⌊ n ( γ + − β ) ⌋ / . The proof follows noting that, for any ¯ w = ( w, w ′ ) ∈ ¯ X and z ∈ Y r , (3) and(5) imply Z Z ¯ L θ h z i ( ¯ w, d ¯ w i +1 ) η ¯ C c ( ¯ w ) K ( z ) | L θ h z i ( · , X ) | ∞ = L θ h z i ( w, X ) L θ h z i ( w ′ , X ) η ¯ C c ( ¯ w ) K ( z ) | L θ h z i ( · , X ) | ∞ ≤ . (cid:3) Lemma 6.
Let ( U k ) k ∈ Z , ( V k ) k ∈ Z , ( W k ) k ∈ Z be stationary sequences suchthat E [ln + U ] < ∞ , E [ln + V ] < ∞ , E [ln + W ] < ∞ . Then, for all η, ρ in (0 , such that − ln η > E [ln + V ] , there exists a P -a.s. finite random variable D and a constant ̺ ∈ (0 , such that for all k ≥ , m ≥ , ρ k + m + η k + m W − m k − Y i = − m V i ! U k ≤ ̺ k + m D, P -a.s. Proof.
Let α ∈ (0 ,
1) such that E [ln + V ] < − ln α < − ln η , and let ˜ α > η/α ) ∨ ρ < ˜ α <
1. Then ρ k + m + η k + m W − m k − Y i = − m V i ! U k R. DOUC AND E. MOULINES = "(cid:18) ρ ˜ α (cid:19) k + m ˜ α m + (cid:18) ηα ˜ α (cid:19) k + m ( ˜ α m W − m ) k − Y i = − m ( V i α ) ! ( ˜ α k U k ) ≤ (cid:18) ρ ˜ α ∨ ηα ˜ α (cid:19) k + m D with D , (cid:16) sup m ≥ ˜ α m W − m (cid:17) sup m ≥ Y i = − m ( V i α ) ! sup k ≥ k − Y i =1 ( V i α ) !(cid:16) sup k ≥ ˜ α k U k (cid:17) . We now show that D is P -a.s. finite. First note that combining the bound E [ln + U < ∞ ] with Lemma 7 (stated and proved below), we obtain that therandom variable sup k ≥ ˜ α k U k is P -a.s. finite; in the same way, sup m ≥ ˜ α m W − m is P -a.s. finite. Moreover, since E [ln + V ] < ∞ , Birkoff’s ergodic theorem en-sures that 1 k − k − X i =1 ln + V i → k →∞ E [ln + V ] < − ln α, P -a.s.By taking the exponential function in the previous limit, we obtain that k − Y i =1 ( V i α ) ≤ exp ( ( k − k − k − X i =1 ln + V i + ln α !) → k →∞ , P -a.s.so that sup k ≥ Q k − i =1 ( V i α ) is P -a.s. finite. Following the same arguments,sup m ≥ Y i = − m ( V i α )is P -a.s. finite. Finally D is P -a.s. finite. The proof is complete. (cid:3) Lemma 7.
Let { Z k } k ∈ Z be a sequence of nonnegative random variableson a probability space (Ω , A , P ) having the same marginal distribution, thatis, for any k ∈ Z and any measurable nonnegative function f , E [ f ( Z k )] = E [ f ( Z )] . (i) Assume that E [(ln Z ) + ] < ∞ . Then, for all β ∈ (0 , , sup k ≥ β k Z k < ∞ , P -a.s. (ii) Assume that E [ | ln Z | ] < ∞ . Then, for all β ∈ (0 , , sup k ∈ Z β | k | Z k < ∞ and inf k ∈ Z β −| k | Z k > , P -a.s. Proof.
Let β ∈ (0 , P [ β k Z k >
1] = P [ln Z k / ( − ln β ) ≥ k ] = P [ln Z / ( − ln β ) ≥ k ] , LE IN MISSPECIFIED HMMS it follows that ∞ X k =0 P [ β k Z k >
1] = ∞ X k =0 P [ln Z / ( − ln β ) ≥ k ] ≤ E [(ln Z ) + ] / ( − ln β ) < ∞ . The proof of (i) is completed by using the Borel–Cantelli lemma. Now, (ii)can be easily derived by noting that if E [ | ln Z | ] < ∞ , then one may use twice(i), first by replacing Z k by Z − k and then by replacing Z k by 1 /Z k . (cid:3) Proposition 8.
Assume (A1) and (A2). There exist a constant κ ∈ (0 , , an integer-valued random variable K satisfying P Y [ K < ∞ ] = 1 suchthat, for any initial distributions χ, χ ′ ∈ M ( D , r ) [where M ( D , r ) is definedin (8)], sup θ ∈ Θ sup k ≥ K sup m ≥ κ − ( m + k ) | ln p θχ ( Z k | Z k − − m ) − ln p θχ ′ ( Z k | Z k − − m ) | < ∞ , (36) P -a.s. , sup θ ∈ Θ sup k ≥ K sup m ≥ κ − ( m + k ) | ln p θχ ( Z k | Z k − − m ) − ln p θχ ( Z k | Z k − − m − ) | < ∞ , (37) P -a.s. , sup θ ∈ Θ sup m ≥ κ − m | ln p θχ ( Z | Z − − m ) − ln p θχ ( Z | Z − − m − ) | < ∞ , (38) P -a.s. Proof.
Proof of (36). It follows from (21) that, for any integer ( m, k ) ∈ N and any sequence z k − m , p θχ ( z k | z k − − m ) = χ L θ h z k − − m i ( L θ h z k i X ) χ L θ h z k − − m i ( X ) . Since, for any a, b >
0, ln( a ) − ln( b ) ≤ ( a − b ) /b , definition (22) implies thatln p θχ ( z k | z k − − m ) − ln p θχ ′ ( z k | z k − − m )(39) ≤ ∆ θχ,χ ′ h z k − − m i ( L θ h z k i X , X ) χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( L θ h z k i X ) . Let 0 ≤ γ − < γ + ≤
1. By Proposition 5, for any η > β ∈ ( γ − , γ + ) thereexists ̺ ∈ (0 ,
1) such that, for any sequence z k − − m satisfying( m + k ) − k − X i = − m K ( z i ) ≥ max(1 − γ − , (1 + γ + ) / , (40) R. DOUC AND E. MOULINES we have ∆ θχ,χ ′ h z k − − m i ( L θ h z k i X , X ) χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( X ) ≤ ̺ a ( m + k ) (cid:20) χ L θ h z k − − m i ( L θ h z k i X ) × χ ′ L θ h z k − − m i ( X ) χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( L θ h z k i X ) (cid:21) (41) + 2 η b ( m + k ) C m,k , where a ( n ) = ⌊ n ( β − γ − ) ⌋ , b ( n ) = ⌊ n ( γ + − β ) ⌋ / C m,k , Q k − i = − m | L θ h z i i ( · , X ) | ∞ χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( L θ h z k i X ) | L θ h z k i ( · , X ) | ∞ . (42)Moreover, by (22), χ L θ h z k − − m i ( L θ h z k i X ) × χ ′ L θ h z k − − m i ( X ) χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( L θ h z k i X )= ∆ θχ,χ ′ h z k − − m i ( L θ h z k i X , X ) χ L θ h z k − − m i ( X ) × χ ′ L θ h z k − − m i ( L θ h z k i X ) + 1 . Plugging this identity into (41) and then using (39) yieldsln p θχ ( z k | z k − − m ) − ln p θχ ′ ( z k | z k − − m )(43) ≤ − ̺ a ( m + k ) ) − [ ̺ a ( m + k ) + η b ( m + k ) C m,k ] . For any sequence z k − − m , we have χ L θ h z k − − m i ( X ) ≥ χ ( D ) k − Y i = − m n inf x ∈ D L θ h z i i ( x, D ) o , (44) χ ′ L θ h z k − − m i ( L θ h z k i X ) ≥ χ ′ ( D ) k Y i = − m n inf x ∈ D L θ h z i i ( x, D ) o . Exchanging χ and χ ′ in (43) allows us to obtain an upper bound for | ln p θχ ( z k | z k − − m ) − ln p θχ ′ ( z k | z k − − m ) | . More precisely, for any sequence z k − − m sat-isfying (40), we havesup θ ∈ Θ | ln p θχ ( z k | z k − − m ) − ln p θχ ′ ( z k | z k − − m ) |≤ − ̺ a ( m + k ) ) − (45) × ( ̺ a ( m + k ) + η b ( m + k ) χ ( D ) χ ′ ( D ) " k − Y j = − m ( D z j ) D z k ) , LE IN MISSPECIFIED HMMS where, for z ∈ Y r , D z = sup θ ∈ Θ | L θ h z i ( · , X ) | ∞ inf θ ∈ Θ inf x ∈ D L θ h z i ( x, D ) . (46)Assume that E [ln + ( D Z )] < ∞ , and set η small enough so that E [ln + ( D Z )] ≤− ln η . By Lemma 6, there exists a P -a.s. finite random variable C , and aconstant κ ∈ (0 ,
1) such that, for all k ≥ m ≥ − ̺ a ( m + k ) ( ̺ a ( m + k ) + η b ( m + k ) χ ( D ) χ ′ ( D ) " k − Y j = − m ( D z j ) D z k ) ≤ Cκ k + m , P -a.s.It remains to show that E [ln + ( D Z )] < ∞ . Since for any a, b >
0, ln + ( a/b ) ≤ ln + ( a ) + ln − ( b ),ln + ( D z ) ≤ ln + (cid:16) sup θ ∈ Θ | L θ h z i ( · , X ) | ∞ (cid:17) + ln − (cid:16) inf θ ∈ Θ inf x ∈ D L θ h z i ( x, D ) (cid:17) . (47)Since, for any z = y r − ∈ Y r , sup θ ∈ Θ | L θ h z i ( · , X ) | ∞ ≤ Q r − i =0 sup θ ∈ Θ | g θ ( · , y i ) | ∞ ,(A1)(iii) and (A2) imply that E [ln + ( D Z )] < ∞ . Finally, according to (45),sup θ ∈ Θ | ln p θχ ( Z k | Z k − − m ) − ln p θχ ′ ( Z k | Z k − − m ) | ≤ Cκ m + k , P -a.s. , provided that( m + k ) − k − X j = − m K ( Z j ) ≥ max(1 − γ − , (1 + γ + ) / , P -a.s.(48)It thus remains to show the existence of a P -a.s. finite random variable K such that for any k ≥ K and any m ≥
0, (48) holds P -a.s. Under (A1)(i),1 − P [ Z ∈ K ] < P [ Z ∈ K ] −
1. Then, choose ˜ γ − , γ − , γ + and ˜ γ + such that1 − P [ Z ∈ K ] < ˜ γ − < γ − < γ + < ˜ γ + < P [ Z ∈ K ] − . (49)By construction (1 + ˜ γ + ) / < P Y [ Z ∈ K ] and 1 − ˜ γ − < P [ Z ∈ K ]. Since( Z k ) k ∈ Z is stationary and ergodic, the Birkhoff ergodic theorem ensures thatthere exists a P -a.s. finite random variable B such that for any k ≥ B and m ≥ B , P -a.s., max (cid:18) − ˜ γ − , γ + (cid:19) < k − k − X i =0 K ( Z i ) , (50) max (cid:18) − ˜ γ − , γ + (cid:19) < m − − X i = − m K ( Z i ) . (51) R. DOUC AND E. MOULINES
Set K + , B (1 + γ + ) / (˜ γ + − γ + ). If m ≥ B and k ≥ K + , then using that K + ≥ B , P -a.s., P k − i = − m K ( Z i ) k + m > k (1 + ˜ γ + ) / m (1 + ˜ γ + ) / k + m = (1 + ˜ γ + ) / > (1 + γ + ) / . Now, if 0 ≤ m < B and k ≥ K + , P k − i = − m K ( Z i ) k + m ≥ P k − i =0 K ( Z i ) k + m > k (1 + ˜ γ + ) / k + m> K + (1 + ˜ γ + ) / K + + B = (1 + γ + ) / . Similarly, setting K − , B (1 − γ − ) / (˜ γ − − γ − ), we obtain, for all m ≥ k ≥ K − that, P -a.s., P k − i = − m K ( Z i ) k + m ≥ − γ − . The proof of (36) is now completed by setting K = K + ∨ K − . Proof of (37). Note that p θχ ( z k | z k − − m − ) = p θχ ′ ( z k | z k − − m )with χ ′ ( A ) = χ ( L θ h z − m − i A ) /χ ( L θ h z − m − i X ). Since1 χ ′ ( D ) = χ ( L θ h z − m − i X ) χ ( L θ h z − m − i D ) ≤ D z − m − χ ( D ) , where D z is defined in (46), (45) writessup θ ∈ Θ | ln p θχ ( z k | z k − − m ) − ln p θχ ( z k | z k − − m − ) |≤ − ̺ a ( m + k ) ) − × " ̺ a ( m + k ) + η b ( m + k ) [ χ ( D )] D z − m − k − Y j = − m ( D z j ) D z k . And the rest of the proof of (37) follows the same lines as (36) and is omittedfor brevity.
Proof of (38). Noting that, when k = 0, equation (48) follows immediatelyfrom (51), the proof of (38) follows the same lines as the proof of (37) andis omitted for brevity. (cid:3) Corollary 9 (Corollary of Proposition 8).
Assume (A1) and (A2).For any θ ∈ Θ , there exists a measurable function π θZ : Z Z − → R such thatfor any probability measure χ satisfying χ ( D ) ∈ M ( D , r ) [where M ( D , r ) is LE IN MISSPECIFIED HMMS defined in (8)], P Y h lim m →∞ p θχ ( Z | Z − − m ) = π θZ ( Z −∞ ) i = 1 . (52)In the sequel, we denote p θ ( Z | Z − −∞ ) , π θZ ( Z −∞ ) and for n ≥ p θ ( Z n | Z − −∞ ) , Q ni =0 π θZ ( Z i −∞ ).5.1.2. Consistency of the block MLE.
Proposition 10.
Assume (A1) and (A2). Then: (i)
For any θ ∈ Θ , E [ | ln p θ ( Z | Z − −∞ ) | ] < ∞ . (53)(ii) For any probability measure χ ∈ M ( D , r ) [where M ( D , r ) is definedin (8)], lim sup n →∞ sup θ ∈ Θ | n − ln p θχ ( Z n − ) − n − ln p θ ( Z n − | Z − −∞ ) | = 0 , P -a.s. (iii) For any θ ∈ Θ , and for any probability measure χ ∈ M ( D , r ) , lim n →∞ n − ln p θχ ( Z n − ) = E [ln p θ ( Z | Z − −∞ )] , P -a.s. Proof.
Proof of (i). It follows from (52) that, P -a.s., p θ ( Z | Z − −∞ ) = lim m →∞ p θχ ( Z | Z − − m ) ≤ | L θ h Z i ( · , X ) | ∞ ≤ r − Y i =0 | g θ ( · , Y i ) | ∞ . (54)Then, (A2) shows that E [ln + p θ ( Z | Z − −∞ )] ≤ E [ln + | L θ h Z i ( · , X ) | ∞ ] < ∞ . We now show that E [ln − p θ ( Z | Z − −∞ )] < ∞ by establishing that E [ln p θ ( Z | Z − −∞ )] > −∞ . For that purpose, introduce the sequence L θm , m − m X ℓ =1 [ln + | L θ h Z i ( · , X ) | ∞ − ln p θχ ( Z | Z − − ℓ )] . By (54), the sequence ( L θm ) m ≥ is nonnegative and the Fatou lemma impliesthat lim inf m →∞ E [ L θm ] ≥ E h lim inf m →∞ L θm i . (55) R. DOUC AND E. MOULINES
By definition,lim inf m →∞ E [ L θm ] = E [ln + | L θ h Z i ( · , X ) | ∞ ](56) − lim sup m →∞ m − m X ℓ =1 E [ln p θχ ( Z | Z − − ℓ )]and E h lim inf m →∞ L θm i = E [ln + | L θ h Z i ( · , X ) | ∞ ](57) − E " lim sup m →∞ m − m X ℓ =1 ln p θχ ( Z | Z − − ℓ ) . Since ( Y k ) k ∈ Z is stationary, for any ℓ ∈ N , E [ln p θχ ( Z | Z − − ℓ )] = E [ln p θχ ( Z ℓ | Z ℓ − )]showing that m − m X ℓ =1 E [ln p θχ ( Z | Z − − ℓ )] = m − m X ℓ =1 E [ln p θχ ( Z ℓ | Z ℓ − )] . (58)The Cesaro mean convergence lemma implies that, P -a.s.,lim sup m →∞ m − m X ℓ =1 ln p θχ ( Z | Z − − ℓ ) = lim ℓ →∞ ln p θχ ( Z | Z − − ℓ ) = ln p θ ( Z | Z − −∞ ) . (59)Combining (55), (56), (57), (58) and (59) yields to E [ln p θ ( Z | Z − −∞ )] ≥ lim sup m →∞ m − m X ℓ =1 E [ln p θχ ( Z ℓ | Z ℓ − )](60) = lim sup m →∞ { E [ m − ln p θχ ( Z m )] − m − E [ln p θχ ( Z )] } > −∞ , where the last bound follows from (A1)(iii) and the minorizationln p θχ ( Z m ) ≥ ln χ ( D ) + m X i =0 ln inf x ∈ D L θ h Z i i ( x, D ) . The proof of (i) follows.
Proof of (ii). According to Proposition 8 (36), there exists a randomvariable C satisfying P Y [ C < ∞ ] = 1 such that for all k ≥ K and m ≥ θ ∈ Θ | ln p θχ ( Z k | Z k − − m ) − ln p θχ ( Z k | Z k − − m − ) | ≤ Cκ k + m , P -a.s. , LE IN MISSPECIFIED HMMS which implies thatsup θ ∈ Θ | ln p θχ ( Z k | Z k − ) − ln p θ ( Z k | Z k − −∞ ) | ≤ Cκ k / (1 − κ ) , P -a.s.The proof of (ii) follows from the obvious decomposition n − ln p θχ ( Z n − ) = n − n − X k =1 ln p θχ ( Z k | Z k − ) + n − ln p θχ ( Z ) , (61) n − ln p θ ( Z n − | Z − −∞ ) = n − n − X k =0 ln p θ ( Z k | Z k − −∞ ) . The proof of (iii) follows from (53) and (61) using the Birkhoff theorem; see,for example, [28], Theorem 1.14. (cid:3)
Proposition 11.
Assume (A1)–(A3). Let χ be a probability measuresuch that χ ∈ M ( D , r ) [where M ( D , r ) is defined in (8)]. (i) For any θ ∈ Θ and any ρ > , lim sup n →∞ sup θ ∈B ( θ ,ρ ) n ln p θχ ( Z n − ) ≤ E h sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) i , P -a.s. (ii) The function θ E [ln p θ ( Z | Z − −∞ )] is upper semi-continuous. (iii) For any compact set Ξ ⊂ Θ , the sequence (sup θ ∈ Ξ 1 n ln p θχ ( Z n − )) n ≥ converges P -a.s. and lim n →∞ sup θ ∈ Ξ n ln p θχ ( Z n − ) = sup θ ∈ Ξ E [ln p θ ( Z | Z − −∞ )] , P -a.s. Proof.
Proof of (i). Proposition 10(ii) shows thatlim sup n →∞ sup θ ∈B ( θ ,ρ ) n ln p θχ ( Z n − )(62) ≤ lim sup n →∞ n n − X i =0 sup θ ∈B ( θ ,ρ ) ln p θ ( Z i | Z i − −∞ ) , P -a.s.By (54), for any θ ∈ Θ and ρ > p θ ( Z | Z − −∞ ) ≤ sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ )(63) ≤ r − X i =0 sup θ ∈ Θ ln + | g ( · , Y i ) | ∞ , P -a.s. , R. DOUC AND E. MOULINES which shows using (53) and (A2) that E h(cid:12)(cid:12)(cid:12) sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) (cid:12)(cid:12)(cid:12)i < ∞ . The Birkhoff theorem therefore implieslim sup n →∞ n n − X i =0 sup θ ∈B ( θ ,ρ ) ln p θ ( Z i | Z i − −∞ )(64) = E h sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) i , P -a.s. , which completes the proof of (i). Proof of (ii). First note thatsup θ ∈B ( θ ,ρ ) E [ln p θ ( Z | Z − −∞ )] ≤ E h sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) i . (65)Now, since under (A3), for any m ≥ p , P -a.s., the function θ ln p θχ ( Z | Z − − m )is continuous, then P -a.s., the function θ ln p θ ( Z | Z − −∞ ) is continuous asa uniform limit of continuous functions. Using (63), r − X i =0 sup θ ∈ Θ ln + | g ( · , Y i ) | ∞ − sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) ≥ , the monotone convergence theorem therefore implies thatlim ρ ↓ E h sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) i = E h lim ρ ↓ sup θ ∈B ( θ ,ρ ) ln p θ ( Z | Z − −∞ ) i (66) = E [ln p θ ( Z | Z − −∞ )] . Combining (65) and (66) shows thatlim ρ ↓ sup θ ∈B ( θ ,ρ ) E [ln p θ ( Z | Z − −∞ )] ≤ E [ln p θ ( Z | Z − −∞ )] . Proof of (iii). By taking the limit of both sides of (i) with respect to ρ ↓ θ ∈ Θ,lim ρ ↓ lim sup n →∞ sup θ ∈B ( θ ,ρ ) n ln p θχ ( Z n − ) ≤ E [ln p θ ( Z | Z − −∞ )] , P -a.s.(67)Therefore, for any δ > θ ∈ Ξ, there exists ρ θ > n →∞ sup θ ∈B ( θ ,ρ θ ) n ln p θχ ( Z n − ) ≤ E [ln p θ ( Z | Z − −∞ )] + δ, P -a.s. LE IN MISSPECIFIED HMMS Since Ξ is compact, by extracting a finite covering, the latter inequalityshows thatlim sup n →∞ sup θ ∈ Ξ n ln p θχ ( Z n − ) ≤ sup θ ∈ Ξ E [ln p θ ( Z | Z − −∞ )] + δ, P -a.s.Since δ is arbitrary, we therefore havelim sup n →∞ sup θ ∈ Ξ n ln p θχ ( Z n − ) ≤ sup θ ∈ Ξ E [ln p θ ( Z | Z − −∞ )] . (68)Now, since for any θ ∈ Ξ,sup θ ∈ Ξ n ln p θχ ( Z n − ) ≥ n ln p θ χ ( Z n − ) . Proposition 10(iii) yieldslim inf n →∞ sup θ ∈ Ξ n ln p θχ ( Z n − ) ≥ E [ln p θ ( Z | Z − −∞ )] , P -a.s. θ being arbitrary in Ξ, we finally obtainlim inf n →∞ sup θ ∈ Ξ n ln p θχ ( Z n − ) ≥ sup θ ∈ Ξ E [ln p θ ( Z | Z − −∞ )] , P -a.s.Combining this inequality with (68) completes the proof. (cid:3) Theorem 12.
Assume (A1)–(A3). Then, for any probability measure χ ∈ M ( D , r ) , lim n →∞ d (ˆ θ χ,nr , Θ ⋆b ) = 0 , P -a.s. , where Θ ⋆b ⊂ Θ is defined by Θ ⋆b , arg max θ ∈ Θ E [ln p θ ( Z | Z − −∞ )] . Proof.
By Proposition 11(ii) the function θ E [ln p θ ( Z | Z − −∞ )] is up-per semi-continuous. Therefore the set Θ ⋆b is compact as a closed subsetof a the compact set Θ so that for any δ >
0, Ξ δ = { θ ∈ Θ; d ( θ, Θ ⋆b ) ≥ δ } is also a compact set. In addition, as a upper semi-continuous function, θ E [ln p θ ( Z | Z − −∞ )] restricted to Ξ δ attains its maximum which impliesthat sup θ ∈ Ξ δ E [ln p θ ( Z | Z − −∞ )] = max θ ∈ Ξ δ E [ln p θ ( Z | Z − −∞ )] < E [ln p θ ⋆ ( Z | Z − −∞ )] , where θ ⋆ is any point in Θ ⋆b . Combining this with Proposition 10(iii) yieldslim n →∞ sup θ ∈ Ξ δ n ln p θχ ( Z n − ) < E [ln p θ ⋆ ( Z | Z − −∞ )] , P -a.s. R. DOUC AND E. MOULINES
Using that lim n →∞ n ln p θ ⋆ χ ( Z n − ) = E [ln p θ ⋆ ( Z | Z − −∞ )] , P -a.s.we finally obtain that P -a.s., ˆ θ χ,n ∈ Ξ δ finitely many times. The proof iscomplete. (cid:3) Proofs of Proposition 1 and Theorem 2.
We have now all the toolsfor obtaining the consistency of the MLE as a byproduct of the resultsobtained for the block MLE. We first state and prove the forgetting of theinitial distribution for the predictive filter.
Lemma 13.
Assume (A1). Let < γ − < γ + ≤ . Then, for all η > ,there exists ρ η ∈ (0 , such that, for all sequence ( z i ) i ≥ satisfying n − n − X i =0 K ( z i ) ≥ max(1 − γ − , (1 + γ + ) / , (69) all β ∈ ( γ − , γ + ) , all measurable function f , all probability measures χ and χ ′ and all θ ∈ Θ , (cid:12)(cid:12)(cid:12)(cid:12) χ L θ h z n − i fχ L θ h z n − i X − χ ′ L θ h z n − i fχ ′ L θ h z n − i X (cid:12)(cid:12)(cid:12)(cid:12) ≤ ( ρ ⌊ n ( β − γ − ) ⌋ + η ⌊ n ( γ + − β ) ⌋ / χ ( D ) χ ′ ( D ) " n − Y i =0 D z i | f | ∞ , where D z is defined in (46). Proof.
By Proposition 5, (cid:12)(cid:12)(cid:12)(cid:12) χ L θ h z n − i fχ L θ h z n − i X − χ ′ L θ h z n − i fχ ′ L θ h z n − i X (cid:12)(cid:12)(cid:12)(cid:12) = | ∆ θχ,χ ′ h z n − i ( f, X ) | χ L θ h z n − i X × χ ′ L θ h z n − i X ≤ ρ ⌊ n ( β − γ − ) ⌋ | f | ∞ + 2 η ⌊ n ( γ + − β ) ⌋ / Q n − i =0 | L θ h z i i ( · , X ) | ∞ χ L θ h z n − i X × χ ′ L θ h z n − i X | f | ∞ , where we have used that χ L θ h z n − i fχ L θ h z n − i X ∨ χ ′ L θ h z n − i fχ ′ L θ h z n − i X ≤ | f | ∞ . The proof follows by noting that (44) implies that Q n − i =0 | L θ h z i i ( · , X ) | ∞ χ L θ h z n − i X × χ ′ L θ h z n − i X ≤ [ Q n − i =0 D z i ] χ ( D ) χ ′ ( D ) . (cid:3) LE IN MISSPECIFIED HMMS Proof of Proposition 1.
Proof of (i). Let χ a probability measuresuch that χ ( D ) >
0. The first step of the proof consists of using the for-getting property obtained in Lemma 13 to show that P -a.s., the sequence( p θχ ( Y | Y − − ℓ )) ℓ ≥ converges. Denote for any t ∈ { , . . . , r } , χ θm,t ( A ) = χ L θ h y − mr − − mr − t i A χ L θ h y − mr − − mr − t i X . Then, write for any m ≥ t ∈ { , . . . , r } and any y − mr − t ∈ Y mr + t +1 , p θχ ( y | y − − mr − t ) = p θχ θm,t ( y | z − − m ) = χ θm,t L θ h z − − m i ( g θ ( · , y )) χ θm,t L θ h z − − m i ( X ) . Let 0 < γ − < γ + <
1. Lemma 13 shows that for any t ∈ { , . . . , r } and η > ρ ∈ (0 ,
1) such that, if m − − X i = − m K ( z i ) ≥ max(1 − γ − , (1 + γ + ) / , then for all β ∈ ( γ − , γ + ), and θ ∈ Θ, | p θχ ( y | y − − mr − t ) − p θχ ( y | y − − mr ) |≤ ρ ⌊ m ( β − γ − ) ⌋ + η ⌊ m ( γ + − β ) ⌋ / χ θm,t ( D ) χ ( D ) − Y j = − m ( D z j ) ! sup θ ∈ Θ | g θ ( · , y ) | ∞ ≤ ρ ⌊ m ( β − γ − ) ⌋ + η ⌊ m ( γ + − β ) ⌋ / D ′− m − Y j = − m ( D z j ) ! sup θ ∈ Θ | g θ ( · , y ) | ∞ , where D ′− m = max t =1 ,...,r − θ ∈ Θ χ θm,t ( D ) χ ( D ) . ( D ′− m ) m ≥ is a stationary sequence. Using the same argument as in the proofof (47), the condition χ ∈ M ( D , r ) [defined in (8)], we have E [ln + D ′− m ] < ∞ .By choosing γ + and γ − such that P Y [ Z ∈ K ] > max(1 − γ − , (1 + γ + ) /
2) andby applying Lemma 6, it follows that there exist ̺ χ ∈ (0 ,
1) and a P -a.s. finiterandom variable C χ such that for any ℓ ≥ | p θχ ( Y | Y − − ℓ ) − p θχ ( Y | Y − − ℓ − ) | ≤ C χ ̺ ℓχ , P -a.s.Similarly, for any probability measure χ ′ such that χ ′ ( D ) >
0, there exist ̺ χ,χ ′ ∈ (0 ,
1) and a P -a.s. finite random variable C χ,χ ′ such that for any ℓ ≥ | p θχ ( Y | Y − − ℓ ) − p θχ ′ ( Y | Y − − ℓ ) | ≤ C χ,χ ′ ̺ ℓχ,χ ′ , P -a.s. R. DOUC AND E. MOULINES
This implies that for any probability measure χ satisfying χ ( D ) >
0, thesequence ( p θχ ( Y | Y − − ℓ )) ℓ ≥ converges P -a.s. and that the limit denoted by p θ ( Y | Y − −∞ ) does not depend on χ . Then, by stationarity of ( Y ℓ ) ℓ ∈ Z , weobtain that for all k ≥ θ ∈ Θ,lim m →∞ p θχ ( Y k | Y k − − m ) = p θ ( Y k | Y k − −∞ ) , P -a.s. , which shows the first part of (i). To complete the proof of (i), it remains toprove that E [ | ln p θ ( Y k | Y k − −∞ ) | ] < ∞ . Since p θχ ( Y k | Y k − − m ) ≤ sup x ∈ X g θ ( x, Y k ),we have ln + p θχ ( Y k | Y k − −∞ ) ≤ ln + sup x ∈ X g θ ( x, Y k ) , which shows, under (A2), that E [ln + p θ ( Y k | Y k − −∞ )] < ∞ . (70)This allows us to define E [ln p θ ( Y k | Y k − −∞ )] as E [ln p θ ( Y k | Y k − −∞ )] = E [ln + p θ ( Y k | Y k − −∞ )] − E [ln − p θ ( Y k | Y k − −∞ )] , so that E [ln − p θ ( Y k | Y k − −∞ )] < ∞ provided that we have shown E [ln p θ ( Y k | Y k − −∞ )] > −∞ . By stationarity of ( Y k ) k ∈ Z , r E [ln p θ ( Y | Y − −∞ )] = r { E [ln + p θ ( Y | Y − −∞ )] − E [ln − p θ ( Y | Y − −∞ )] } = E " r − X k =0 ln + p θ ( Y k | Y k − −∞ ) − E " r − X k =0 ln − p θ ( Y k | Y k − −∞ ) (71) = E " r − X k =0 ln p θ ( Y k | Y k − −∞ ) , where the last equality follows by applying E ( A − B ) = E ( A ) − E ( B ) fornonnegative random variables A, B such that E ( A ) < ∞ . Now, note that r − Y k =0 p θ ( Y k | Y k − −∞ ) = r − Y k =0 lim m →∞ p θχ ( Y k | Y k − − mr ) = lim m →∞ r − Y k =0 p θχ ( Y k | Y k − − mr )= lim m →∞ p θχ ( Y r − | Y − − mr ) = lim m →∞ p θχ ( Z | Z − − m )= p θ ( Z | Z − −∞ ) . By plugging this expression into (71) and using E [ | ln p θχ ( Z | Z − −∞ ) | ] < ∞ (seeProposition 10), we finally obtain r E [ln p θ ( Y | Y − −∞ )] = E [ln p θ ( Z | Z − −∞ )] > −∞ , (72)which completes the proof of (i). LE IN MISSPECIFIED HMMS Proof of (ii). Let χ be a probability measure such that χ ( D ) > t ∈ { , . . . , r − } . Then, for any m ≥ m − ln p θχ ( Z m +10 ) ≤ m − ln p θχ ( Y mr + t ) + m − ln + A m,t (73) ≤ m − ln p θχ ( Z m ) + m − ln + B m,t + m − ln + A m,t , where A m,t , sup θ ∈ Θ sup x p θQ θ ( x, · ) ( Y ( m +1) r − mr + t +1 ) , B m,t , sup θ ∈ Θ sup x p θδ x ( Y mr + tmr ) . Note that ( A m,t ) m ≥ and ( B m,t ) m ≥ are stationary. Moreover, using (A2),it can be easily checked that E [ln + A m,t ] < ∞ , E [ln + B m,t ] < ∞ . Then, Lemma 7 may apply and for any β ∈ (0 , P -a.s. finiterandom variables A, B such that for all m ≥ A m,t ≤ Aβ − m , B m,t ≤ Bβ − m , P -a.s.so that, P -a.s., 0 ≤ lim sup m →∞ m − ln + A m,t ≤ − ln β, ≤ lim sup m →∞ m − ln + B m,t ≤ − ln β. By letting β ↑ m →∞ m − ln + A m,t = 0 , lim m →∞ m − ln + B m,t = 0 , P -a.s.(74)Now, note that ( A m,t ) m ≥ and ( B m,t ) m ≥ do not depend on θ ∈ Θ so that(74) together with (73) yieldslim sup m →∞ sup θ ∈ Θ m − | ln p θχ ( Y mr + t ) − ln p θχ ( Z m ) | = 0 , P -a.s.(75)Since t is chosen arbitrarily in { , . . . , r − } , we finally obtain using Propo-sition 10(ii), lim n →∞ n − ln p θχ ( Y n ) = r − lim m →∞ m − ln p θχ ( Z m )= r − E [ln p θ ( Z | Z − −∞ )]= E [ln p θ ( Y | Y − −∞ )] , P -a.s. , which completes the proof of Proposition 1. (cid:3) Proof of Theorem 2.
By Proposition 11(ii) and (72), the function θ ℓ ( θ ) is upper semi-continuous. Moreover, (72) also impliesΘ ⋆ = arg max θ ∈ Θ E [ln p θ ( Y | Y − −∞ )] = arg max θ ∈ Θ E [ln p θ ( Z | Z − −∞ )] = Θ ⋆b . R. DOUC AND E. MOULINES
Now let t in { , . . . , r − } and recall that Z m = Y mr − . Theorem 12 togetherwith (75) shows that lim n →∞ d (ˆ θ χ,nr + t , Θ ⋆ ) = 0 , P -a.s.(76)The proof of Theorem 2 is then complete since t is arbitrary in { , . . . , r − } . (cid:3) Proof of Proposition 3.
Under these two conditions, for any u ∈{ , . . . , r } , and θ ∈ Θ, χ L θ h y u − i D ≥ u − Y i =0 inf x i ∈ D i g θ ( x i , y i ) ! Z · · · Z χ (d x ) D ( x u ) u Y i =1 D i − ( x i − ) Q θ ( x i − , d x i ) ≥ u − Y i =0 inf x i ∈ D i g θ ( x i , y i ) ! χ ( D ) δ u . (cid:3) Proof of Lemma 4.
The proof proceeds by induction on u ∈ { , . . . , r } .Assume that D u − is a compact subset; we show that there exists a compactset D u such that inf x u − ∈ D u − inf θ ∈ Θ Q θ ( x u − , D u ) ≥ δ .Let ( x, θ ) ∈ D u − × Θ and set δ < δ ′ <
1. Since X = R d is a completeseparable metric space and X is the associated Borel σ -field, there exists asequence B x,θ , B x,θ , . . . , of open balls of radius 1 covering X . Choose N x,θ large enough so that Q θ ( x, O x,θ ) ≥ δ ′ , where O x,θ = S i ≤ N x,θ B x,θi . Since forany open set O the function ( x ′ , θ ′ ) Q θ ′ ( x ′ , O ) is lower semi-continuous,there exists a neighborhood V x,θ (for the product topology on X × Θ), suchthat for all ( x ′ , θ ′ ) ∈ V x,θ , Q θ ′ ( x ′ , O x,θ ) ≥ δ . Since O x,θ is totally bounded itsclosure, denoted K x,θ , is a compact subset, which satisfies, for any ( x ′ , θ ′ ) ∈V x,θ that Q θ ( x, K x,θ ) ≥ δ .Then, S ( x,θ ) ∈ D u − × Θ V x,θ is a covering of D u − × Θ. Since the set D u − × Θis compact, we may extract a finite subcover D u − × Θ ⊆ S Ii =1 V x i ,θ i . Take D u = S Ii =1 K x i ,θ i . As a finite union of compact sets, D u is a compact set,which satisfies, for all ( x, θ ) ∈ D u − × Θ, Q θ ( x, D u ) ≥ δ . This completes theproof. (cid:3) REFERENCES [1]
Barron, A. R. (1985). The strong ergodic theorem for densities: GeneralizedShannon–McMillan–Breiman theorem.
Ann. Probab. Baum, L. E. and
Petrie, T. (1966). Statistical inference for probabilistic functionsof finite state Markov chains.
Ann. Math. Statist. [3] Budhiraja, A. and
Ocone, D. (1997). Exponential stability of discrete-time filtersfor bounded observation noise.
Systems Control Lett. Capp´e, O. , Moulines, E. and
Ryd´en, T. (2005).
Inference in Hidden Markov Mod-els . Springer, New York. MR2159833[5]
Churchill, G. (1992). Hidden Markov chains and the analysis of genome structure.
Computers and Chemistry Douc, R. , Fort, G. , Moulines, E. and
Priouret, P. (2009). Forgetting the initialdistribution for hidden Markov models.
Stochastic Process. Appl.
Douc, R. and
Matias, C. (2001). Asymptotics of the maximum likelihood estimatorfor general hidden Markov models.
Bernoulli Douc, R. , Moulines, E. , Olsson, J. and van Handel, R. (2011). Consistencyof the maximum likelihood estimator for general hidden Markov models.
Ann.Statist. Douc, R. , Moulines, ´E. and
Ryd´en, T. (2004). Asymptotic properties of the max-imum likelihood estimator in autoregressive models with Markov regime.
Ann.Statist. Fomby, T. B. and
Hill, R. C. , eds. (2003).
Maximum Likelihood Estimation ofMisspecified Models: Twenty Years Later . Advances in Econometrics . Elsevier,Amsterdam. MR2531667[11] Fredkin, D. R. and
Rice, J. A. (1987). Correlation functions of a function of afinite-state Markov process with application to channel kinetics.
Math. Biosci. Fuh, C.-D. (2006). Efficient likelihood estimation in state space models.
Ann. Statist. Fuh, C.-D. (2010). Reply to “On some problems in the article Efficient likelihoodestimation in state space models” by Cheng-Der Fuh [Ann. Statist. (2006)2026–2068] [MR2604693]. Ann. Statist. Genon-Catalot, V. and
Laredo, C. (2006). Leroux’s method for general hiddenMarkov models.
Stochastic Process. Appl.
Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstan-dard conditions. In
Proc. Fifth Berkeley Sympos. Math. Statist. and Probability(Berkeley, Calif., 1965/66), Vol. I: Statistics
Hull, J. and
White, A. (1987). The pricing of options on assets with stochasticvolatilities.
J. Finance Jensen, J. L. (2010). On some problems in the article Efficient likelihood estimationin state space models by Cheng-Der Fuh [Ann. Statist. (2006) 2026–2068][MR2283726]. Ann. Statist. Juang, B. H. and
Rabiner, L. R. (1991). Hidden Markov models for speech recog-nition.
Technometrics Kleptsyna, M. L. and
Veretennikov, A. Y. (2008). On discrete time ergodicfilters with wrong initial data.
Probab. Theory Related Fields
Le Gland, F. and
Mevel, L. (2000). Basic properties of the projective productwith application to products of column-allowable nonnegative matrices.
Math.Control Signals Systems Le Gland, F. and
Mevel, L. (2000). Exponential forgetting and geometric er-godicity in hidden Markov models.
Math. Control Signals Systems R. DOUC AND E. MOULINES[22]
Leroux, B. G. (1992). Maximum-likelihood estimation for hidden Markov models.
Stochastic Process. Appl. Mamon, R. S. and
Elliott, R. J. , eds. (2007).
Hidden Markov Models in Fi-nance . International Series in Operations Research & Management Science .Springer, New York. MR2407726[24]
Mevel, L. and
Finesso, L. (2004). Asymptotical statistics of misspecified hiddenMarkov models.
IEEE Trans. Automat. Control Meyn, S. P. and
Tweedie, R. L. (1993).
Markov Chains and Stochastic Stability .Springer, London. MR1287609[26]
Petrie, T. (1969). Probabilistic functions of finite state Markov chains.
Ann. Math.Statist. Van Handel, R. (2008). Discrete time nonlinear filters with informative observationsare stable.
Electron. Commun. Probab. Walters, P. (1982).
An Introduction to Ergodic Theory . Graduate Texts in Mathe-matics . Springer, New York. MR0648108[29] White, H. (1982). Maximum likelihood estimation of misspecified models.
Econo-metrica SAMOVARCNRS UMR 5157Institut T´el´ecom/T´el´ecom SudParis9 rue Charles Fourier91000 EvryFranceE-mail: [email protected]