[PDF] Forgetting of the initial condition for the filter in general state-space hidden Markov chain: a coupling approach

Abstract

We give simple conditions that ensure exponential forgetting of the initial conditions of the filter for general state-space hidden Markov chain. The proofs are based on the coupling argument applied to the posterior Markov kernels. These results are useful both for filtering hidden Markov models using approximation methods (e.g., particle filters) and for proving asymptotic properties of estimators. The results are general enough to cover models like the Gaussian state space model, without using the special structure that permits the application of the Kalman filter.

Full PDF

aa r X i v : . [ m a t h . S T ] D ec Forgetting of the initial condition for theﬁlter in general state-space hiddenMarkov chain: a coupling approach

Randal Douc

GET/T´el´ecom INT,France,e-mail: [email protected]

Eric Moulines

GET/T´el´ecom Paris,46 rue Barrault, 75634 Paris C´edex 13, France,e-mail: [email protected]

Ya’acov Ritov

Department of Statistics, The Hebrew University of Jerusalem,e-mail: [email protected]

AMS 2000 subject classiﬁcations:

Primary 93E11, hidden Markov chain,stability, non-linear ﬁltering; secondary 60J57.

1. Introduction and Notation

We consider the ﬁltering problem for a Markov chain { X k , Y k } k ≥ with state X and observation Y . The state process { X k } k ≥ is an homogeneous Markov chaintaking value in a measurable set X equipped with a σ -algebra B ( X ). We let Q be thetransition kernel of the chain. The observations { Y k } k ≥ takes values in a measurableset Y ( B Y is the associated σ -algebra). For i ≤ j , denote Y i : j , ( Y i , Y i +1 , · · · , Y j ).Similar notation will be used for other sequences. We assume furthermore. that foreach k ≥ X k , Y k is independent of X k − , X k +1: ∞ , Y k − , and Y k +1: ∞ .We also assume that for each x ∈ X , the conditional law has a density g ( x, · ) withrespect to some ﬁxed σ -ﬁnite measure on the Borel σ -ﬁeld B ( Y ).We denote by φ ξ,n [ y n ] the distribution of the hidden state X n conditionally onthe observations y n def = [ y , . . . , y n ], which is given by φ ξ,n [ y n ]( A ) = R X n +1 ξ ( dx ) g ( x , y ) Q ni =1 Q ( x i − , dx i ) g ( x i , y i ) A ( x n ) R X n +1 ξ ( dx ) g ( x , y ) Q ni =1 Q ( x i − , dx i ) g ( x i , y i ) , (1)In practice the model is rarely known exactly and therefore suboptimal ﬁltersare computed by replacing the unknown transition kernel, likelihood function andinitial distribution by approximations.The choice of these quantities plays a key role both when studying the conver-gence of sequential Monte Carlo methods or when analysing the asymptotic be-haviour of the maximum likelihood estimator (see e.g., (8) or (5) and the referencestherein). A key point when analyzing maximum likelihood estimator or the stabil-ity of the ﬁlter over inﬁnite horizon is to ask whether φ ξ,n [ y n ] and φ ξ ′ ,n [ y n ] areclose (in some sense) for large values of n , and two diﬀerent choices of the initialdistribution ξ and ξ ′ .The forgetting property of the initial condition of the optimal ﬁlter in nonlinearstate space models has attracted many research eﬀorts and it is impossible to give imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 credit to every contributors. The purpose of the short presentation of the existingresults below is mainly to allow comparison of assumptions and results presented inthis contributions with respect to those previously reported in the literature. Theﬁrst result in this direction has been obtained by (13), who established L p -typeconvergence of the optimal ﬁlter initialised with the wrong initial condition to theﬁlter initialised with the true initial distribution; their proof does not provide rateof convergence. A new approach based on the Hilbert projective metric has laterbeen introduced in (1) to establish the exponential stability of the optimal ﬁlterwith respect to its initial condition. However their results are based on stringent mixing conditions for the transition kernels; these conditions state that there existpositive constants ε − and ε + and a probability measure λ on ( X , B ( X )) such thatfor f ∈ B + ( X ), ε − λ ( f ) ≤ Q ( x, f ) ≤ ε + λ ( f ) , for any x ∈ X . (2)This condition implies in particular that the chain is uniformly geometrically er-godic. Similar results were obtained independently by (9) using the Dobrushin er-godicity coeﬃcient (see (10) for further reﬁnements of this result). The mixingcondition has later been weakened by (6), under the assumption that the kernel Q is positive recurrent and is dominated by some reference measure λ :sup ( x,x ′ ) ∈ X × X q ( x, x ′ ) < ∞ and Z essinf q ( x, x ′ ) π ( x ) λ ( dx ) > , where q ( x, · ) = dQ ( x, · ) dλ , essinf is the essential inﬁmum with respect to λ and πdλ isthe stationary distribution of the chain Q . Although the upper bound is reasonable,the lower bound is restrictive in many applications and fails to be satisﬁed e.g., forthe linear state space Gaussian model.In (12), the stability of the optimal ﬁlter is studied for a class of kernels referredto as pseudo-mixing . The deﬁnition of pseudo-mixing kernel is adapted to the casewhere the state space is X = R d , equipped with the Borel sigma-ﬁeld B ( X ). A kernel Q on ( X , B ( X )) is pseudo-mixing if for any compact set C with a diameter d largeenough, there exist positive constants ε − ( d ) > ε + ( d ) > λ C (which may be chosen to be ﬁnite without loss of generality) such that ε − ( d ) λ C ( A ) ≤ Q ( x, A ) ≤ ε + ( d ) λ C ( A ) , for any x ∈ C , A ∈ B ( X ) (3)This condition implies that for any ( x ′ , x ′′ ) ∈ C × C , ε − ( d ) ε + ( d ) < essinf x ∈ X q ( x ′ , x ) /q ( x ′′ , x ) ≤ esssup x ∈ X q ( x ′ , x ) /q ( x ′′ , x ) ≤ ε + ( d ) ε − ( d ) , where q ( x, · ) def = dQ ( x, · ) /dλ C , and esssup and essinf denote the essential supremumand inﬁmum with respect to λ C . This condition is obviously more general than (2),but still it is not satisﬁed in the linear Gaussian case (see (12, Example 4.3)).Several attempts have been made to establish the stability conditions under theso-called small noise condition. The ﬁrst result in this direction has been obtained by(1) (in continuous time) who considered an ergodic diﬀusion process with constantdiﬀusion coeﬃcient and linear observations: when the variance of the observationnoise is suﬃciently small, (1) established that the ﬁlter is exponentially stable. Smallnoise conditions also appeared (in a discrete time setting) in (4) and (14). Theseresults do not allow to consider the linear Gaussian state space model with arbitrarynoise variance.More recently, (7) prove that the nonlinear ﬁlter forgets its initial condition inmean over the observations for functions satisfying some integrability conditions. imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 The main result presented in this paper relies on the martingale convergence the-orem rather than direct analysis of ﬁltering equations. Unfortunately, this methodof proof cannot provide any rate of convergence.It is tempting to assume that forgetting of the initial condition should be true ingeneral, and that the lack of proofs for the general state-space case is only a matterof technicalities. The heuristic argument says that either • the observations Y ’s are informative, and we learn about the hidden state X from the Y s around it, and forget the initial starting point. • the observations Y s are non-informative, and then the X chain is moving byitself, and by itself it forgets its initial condition, for example if it is positiverecurrent.Since we expect that the forgetting of the initial condition is retained in these twoextreme cases, it is probably so under any condition. However, this argument isfalse, as is shown by the following examples where the conditional chain does notforget its initial condition whereas the unconditional chain does. On the other hand,it can be that observed process, { Y k } k ≥ is not ergodic, while the conditional chainuniformly forgets the initial condition. Example 1.

Suppose that { X k } k ≥ are i.i.d. B (1 , / . Suppose Y i = ( X i = X i − ) . Then P ( X i = 1 | X = 0 , Y n ) = 1 − P ( X i = 1 | X = 1 , Y n ) ∈ { , } .Here is a slightly less extreme example. Consider a Markov chain on the unitcircle. All values below are considered modulus π . We assume that X i = X i − + U i ,where the state noise { U k } k ≥ are i.i.d. . The chain is hidden by additive white noise: Y i = X i + ε i , ε i = πW i + V i , where W i is Bernoulli random variable independentof V i . Suppose now that U i and V i are symmetric and supported on some smallinterval. The hidden chain does not forget its initial distribution under this model.In fact the support of the distribution of X i given Y n and X = x is disjoint fromthe support of its distribution given Y n and X = x + π .On the other hand, let { Y k } k ≥ be an arbitrary process. Suppose it is modeled(incorrectly!) by a autoregressive process observed in additive noise. We will showthat under diﬀerent assumptions on the distribution of the state and the observationnoise, the conditional chain (given the observations Y s which are not necessarilygenerated by the model) forgets its initial condition geometrically fast. The proofs presented in this paper are based on generalization of the notion ofsmall sets and coupling of the two (non-homogenous) Markov chains sampled fromthe distribution of X n given Y n . The coupling argument is based on presentingtwo chains { X k } and { X ′ k } , which marginally follow the same sequence of transitionkernels, but have diﬀerent initial distributions of the starting state. The chains moveindependently, until they coupled at a random time T , and from that time on theyremain equal.Roughly speaking, the two copies of the chain may couple at a time k if theystand close one to the other. Formally, we mean by that, that the the pair of statesof the two chains at time k belong to some set, which may depend of the current, butalso past and future observations. The novelty of the current paper is by consideringsets which are in fact genuinely deﬁned by the pair of states. For example, the setcan be deﬁned as { ( x, x ′ ) : k x − x ′ k < c } . That is, close in the usual sense of theword.The prototypical example we use is the non-linear state space model: X i = a ( X i − ) + U i Y i = b ( X i ) + V i , (4)where { U k } k ≥ is the state noise and { V k } k ≥ is the measurement noise . Both { U k } k ≥ and { V k } k ≥ are assumed to be i.i.d. and mutually independent. Of course, imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 the ﬁltering problem for the linear version of this model with independent Gaussiannoise is solved explicitly by the Kalman ﬁlter. But this is one of the few non-trivial models which admits a simple solution. Under the Gaussian linear model, weargue that whatever are Y n , two independent chains drawn from the conditionaldistribution will be remain close to each other even if the Y s are drifting away.Any time they will be close, they will be able to couple, and this will happen quitefrequently.Our approach for proving that a chain forgets its initial conditions can be decom-posed in two stages. We ﬁrst argue that there are coupling sets (which may dependon the observations, and may also vary according to the iteration index) where wecan couple two copies of the chains, drawn independently from the conditional dis-tribution given the observations and started from two diﬀerent initial conditions,with a probability which is an explicit function of the observations. We then arguethat a pair of chains are likely to drift frequently towards these coupling sets.The ﬁrst group of results identify situations in which the coupling set is given ina product form, and in particular in situations where X × X is a coupling set. In thetypical situation, many values of Y i entail that X i is in some set with high proba-bility, and hence the two conditionally independent copies are likely to be in thisset and close to each other. In particular, this enables us to prove the convergenceof (nonlinear) state space processes with bounded noise and, more generally, in sit-uations where the tails of the observations error is thinner than those of dynamicsinnovations.The second argument generalizes the standard drift condition to the coupling set.The general argument specialized to the linear-Gaussian state model is surprisinglysimple. We generalize this argument to the linear model where both the dynamicsinnovations and the measurement errors have strongly unimodal density.

2. Notations and deﬁnitions

Let n be a given positive index and consider the ﬁnite-dimensional distributionsof { X k } k ≥ given Y n . It is well known (see (5, Chapter 3)) that, for any positiveindex k , the distribution of X k given X k − and Y n reduces to that of X k given X k − only and Y n . The following deﬁnitions will be instrumental in decomposingthe joint posterior distributions. Deﬁnition 1 (Backward functions) . For k ∈ { , . . . , n } , the backward function β k | n is the non-negative measurable function on Y n − k × X deﬁned by β k | n ( y k +1: n , x ) = Z · · · Z Q ( x, dx k +1 ) g ( x k +1 , y k +1 ) n Y l = k +2 Q ( x l − , dx l ) g ( x l , y l ) , (5) for k ≤ n − (with the same convention that the rightmost product is empty for k = n − ); β n | n ( · ) is set to the constant function equal to 1 on X . The term “backward variables” is part of the HMM credo and dates back to theseminal work of Baum and his colleagues (2, p. 168). The backward functions maybe obtained, for all x ∈ X by the recursion β k | n ( x ) = Z Q ( x, dx ′ ) g ( x ′ , y k +1 ) β k +1 | n ( x ′ ) (6)operating on decreasing indices k = n − β n | n ( x ) = 1 . (7) imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 Deﬁnition 2 (Forward Smoothing Kernels) . Given n ≥ , deﬁne for indices k ∈{ , . . . , n − } the transition kernels F k | n ( x, A ) def = ( [ β k | n ( x )] − R A Q ( x, dx ′ ) g ( x ′ , y k +1 ) β k +1 | n ( x ′ ) if β k | n ( x ) = 00 otherwise , (8) for any point x ∈ X and set A ∈ B ( X ) . For indices k ≥ n , simply set F k | n def = Q , (9) where Q is the transition kernel of the unobservable chain { X k } k ≥ . Note that for indices k ≤ n −

1, F k | n depends on the future observations Y k +1: n through the backward variables β k | n and β k +1 | n only. The subscript n in the F k | n notation is meant to underline the fact that, like the backward functions β k | n , theforward smoothing kernels F k | n depend on the ﬁnal index n where the observationsequence ends. Thus, for any x ∈ X , A F k | n ( x, A ) is a probability measure on B ( X ). Because the functions x β k | n ( x ) are measurable on ( X , B ( X )), for any set A ∈ B ( X ), x F k | n ( x, A ) is B ( X ) / B ( R )-measurable. Therefore, F k | n is indeed aMarkov transition kernel on ( X , B ( X )).Given n , for any index k ≥ f ∈ F b ( X ),E ξ [ f ( X k +1 ) | X k , Y n ] = F k | n ( X k , f ) . More generally, For any integers n and m , function f ∈ F b (cid:0) X m +1 (cid:1) and initialprobability ξ on ( X , B ( X )),E ξ [ f ( X m ) | Y n ] = Z · · · Z f ( x m ) φ ξ, | n ( dx ) m Y i =1 F i − | n ( x i − , dx i ) , (10)where { F k | n } k ≥ are deﬁned by (8) and (9) and φ ξ,k | n is the marginal smoothingdistribution of the state X k given the observations Y n . Note that φ ξ,k | n may beexpressed, for any A ∈ B ( X ), as φ ξ,k | n ( A ) = (cid:20)Z φ ξ,k ( dx ) β k | n ( x ) (cid:21) − Z A φ ξ,k ( dx ) β k | n ( x ) , (11)where φ ξ,k is the ﬁltering distribution deﬁned in (1) and β k | n is the backwardfunction.

3. The coupling construction and coupling sets

As outlined in the introduction, our proofs are based on coupling two copies of theconditional chain started from two diﬀerent initial conditions. For any two prob-ability measures µ and µ we deﬁne the total variation distance k µ − µ k TV =sup A | µ ( A ) − µ ( A ) | and we also recall the identities sup | f |≤ | µ ( f ) | = 2 k µ − µ k TV and sup ≤ f ≤ | µ ( f ) | = k µ − µ k TV . Let n and m be integers, and let k ∈ { , . . . , n − m } . Deﬁne the m -skeleton of the forward smoothing kernel as follows:F k,m | n def = F km + m − | n . . . F km | n , (12) imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 Deﬁnition 3 (Coupling constant of a set) . Let n and m be integers, and let k ∈{ , . . . , n − m } . The coupling constant of the set C ⊂ X × X is deﬁned as ε k,m | n ( C ) def = 1 −

12 sup ( x,x ′ ) ∈ C (cid:13)(cid:13) F k,m | n ( x, · ) − F k,m | n ( x ′ , · ) (cid:13)(cid:13) TV . (13)The deﬁnition of the coupling constant implies that, for any ( x, x ′ ) ∈ C ,F k,m | n ( x, A ) ∧ F k,m | n ( x ′ , A ) ≥ ε k,m | n ( C ) ν Ck,m | n ( x, x ′ ; A ) . (14)where ν Ck,m | n ( x, x ′ ; A ) = (F k,m | n ( x, · ) ∧ F k,m | n ( x ′ , · ))( A )(F k,m | n ( x, · ) ∧ F k,m | n ( x ′ , · ))( X ) , (15)where for any measures µ and ν on ( X , B ( X )), µ ∧ ν is the largest measure for which( µ ∧ ν )( A ) ≥ min( µ ( A ) , ν ( A )), for all A ∈ B ( X ).We may now proceed to the coupling construction. Let n be an integer, and forany k ∈ { , . . . , ⌊ n/m ⌋} , let ¯ C k | n be a set-valued function, ¯ C k | n : Y n → B ( X ) ⊗B ( X ),where B ( X ) ⊗ B ( X ) is the smallest σ -algebra containing the sets A × B with A, B ∈B ( X ). We deﬁne ¯R k | n as the Markov transition kernel satisfying, for all ( x, x ′ ) ∈ ¯ C k | n and for all A ∈ B ( X ) and ( x, x ′ ) ∈ ¯ C k | n ,¯R k,m | n ( x, x ′ ; A × A ′ ) = (cid:8) (1 − ε k,m | n ) − (F k,m | n ( x, A ) − ε k,m | n ν k,m | n ( x, x ′ ; A )) (cid:9) × (cid:8) (1 − ε k,m | n ) − (F k,m | n ( x ′ , A ′ ) − ε k,m | n ν k,m | n ( x, x ′ ; A ′ )) (cid:9) , (16)where we have omitted the dependence upon the set ¯ C k | n in the deﬁnition of thecoupling constant ε k,m | n and of the minorizing probability ν k,n | m . For all ( x, x ′ ) X × X , we deﬁne ¯F k,m | n ( x, x ′ ; · ) = F k,m | n ⊗ F k,m | n ( x, x ′ ; · ) , (17)where, for two kernels K and L on X , K ⊗ L is the tensor product of the kernels K and L , i.e., for all ( x, x ′ ) ∈ X × X and A, A ′ ∈ B ( X ) K ⊗ L ( x, x ′ ; A × A ′ ) = K ( x, A ) L ( x ′ , A ′ ) . (18)Deﬁne the product space Z = X × X × { , } , and the associated product sigma-algebra B ( Z ). Deﬁne on the space ( Z N , B ( Z ) ⊗ N ) a Markov chain Z i def = ( ˜ X i , ˜ X ′ i , d i ), i ∈ { , . . . , n } as follows. If d i = 1, then draw ˜ X i +1 ∼ F i,m | n ( ˜ X i , · ), and set ˜ X ′ i +1 =˜ X i +1 and d i +1 = 1. Otherwise, if ( ˜ X i , ˜ X ′ i ) ∈ ¯ C i | n , ﬂip a coin with probability ofheads ε i,m | n . If the coin comes up head, then draw ˜ X i +1 from ν i,m | n ( ˜ X i , ˜ X ′ i ; · ), andset ˜ X ′ i +1 = ˜ X i +1 and d i +1 = 1. If the coin comes up tail, then draw ( ˜ X i +1 , ˜ X ′ i +1 )from the residual kernel ¯R i,m | n ( ˜ X i , ˜ X ′ i ; · ) and set d i +1 = 0. If ( ˜ X i , ˜ X ′ i ) ¯ C i | n , thendraw ( ˜ X i +1 , ˜ X ′ i +1 ) according to the kernel ¯F i,m | n ( ˜ X i , ˜ X ′ i ; · ) and set d i +1 = 0. For µ a probability measure on B ( Z ), denote P Yµ the probability measure induced bythe Markov chain Z i , i ∈ { , . . . , n } with initial distribution µ . It is then easilychecked that for any i ∈ { , . . . , ⌊ n/m ⌋} and any initial distributions ξ and ξ ′ , andany A, A ′ ∈ B ( X ), P Yξ ⊗ ξ ′ ⊗ δ ( Z i ∈ A × X × { , } ) = φ ξ,im | n ( A ) , P Yξ ⊗ ξ ′ ⊗ δ ( Z i ∈ X × A ′ × { , } ) = φ ξ ′ ,im | n ( A ) , where δ x is the Dirac measure and ⊗ is the tensor product of measures and φ ξ,k | n is the marginal posterior distribution given by (11) imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 Note that d i is the bell variable , which shall indicate whether the chains havecoupled ( d i = 1) or not ( d i = 0) by time i . Deﬁne the coupling time T = inf { k ≥ , d k = 1 } , (19)with the convention inf ∅ = ∞ . By the Lindvall inequality, the total variation dis-tance between the ﬁltering distribution associated to two diﬀerent initial distribu-tion ξ and ξ ′ , i.e., P ξ ( X n ∈ · | Y n ) and P ξ ′ ( X n ∈ · | Y n ), is bounded by the taildistribution of the coupling time, k P ξ ( X n ∈ · | Y n ) − P ξ ′ ( X n ∈ · | Y n ) k TV ≤ P Yξ ⊗ ξ ′ ⊗ δ ( T ≥ ⌊ n/m ⌋ ) . (20)In the following section, we consider several conditions allowing to bound the taildistribution of the coupling time. Of course, the construction above is of interest only if we may ﬁnd set-valued func-tion ¯ C k | n such whose coupling constant ε k,m | n ( ¯ C k | n ) is non-zero ‘most of the time’.Recall that this quantity are typically functions of the whole trajectory y n . It isnot always easy to ﬁnd such sets because the deﬁnition of the coupling constantinvolves the product F k | n forward smoothing kernels, which is not easy to handle.In some situations (but not always), it is possible to identify appropriate sets fromthe properties of the unconditional transition kernel Q . Deﬁnition 4 (Strong small set) . A set C ∈ B ( X ) is a strong small set for thetransition kernel Q , if there exists a measure ν C and constants σ − ( C ) > and σ + ( C ) < ∞ such that, for all x ∈ C and A ∈ B ( X ) , σ − ( C ) ν C ( A ) ≤ Q ( x, A ) ≤ σ + ( C ) ν C ( A ) . (21)The following Lemma helps to characterize appropriate sets where coupling mayoccur with a positive probability from products of strong small sets. Proposition 1.

Assume that C is a strong small set. Then, for any n and any k ∈ { , . . . , n } , C × C is a coupling set for the forward smoothing kernels F k | n ; moreprecisely, there exists a probability distribution ν k | n such that, for any A ∈ B ( X ) , inf x ∈ C F k | n ( x, A ) ≥ σ − ( C ) σ + ( C ) ν k | n ( A ) Proof.

The proof is postponed to the appendix.Assume that X = R d , and that the kernel satisﬁes the pseudo-mixing condition(3). Let C be a compact set C with diameter d = diam( C ) large enough so that(3) is satisﬁed. Then, for any n and any k ∈ { , . . . , n } , ¯ C = C × C is a couplingset for F k | n , and ε ( ¯ C ) may be chosen to be equal to ε − ( d ) /ε + ( d ). (12) gives non-trivial examples of pseudo-mixing Markov chains which are not uniformly ergodic.Nevertheless, though the existence of small sets is automatically guaranteed forphi-irreducible Markov chains, the conditions imposed for the existence of a strongsmall set are much more stringent. As shown below, it is sometimes worthwhile toconsider coupling set which are much larger than products of strong small sets. imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018

4. Coupling over the whole state-space

The easiest situation is when the coupling constant of the whole state space ε k,m | n ( X × X ) is away from zero for suﬃciently many trajectories y n ; for unconditional Markovchains, this property occurs when the chain in uniformly ergodic ( i.e., satisﬁes theDoeblin condition). This is still the case here, through now the constants may de-pend on the observations Y ’s. As stressed in the discussion, perhaps surprisingly,we will ﬁnd non trivial examples where the coupling constant ε k,m | n ( X × X ) isbounded away from zero for all y n , whereas the underlying unconditional Markovchain is not uniformly geometrically ergodic. We state without proof the followingelementary result. Theorem 2.

Let n be an integer and m ≥ . Then, k φ ξ,n − φ ξ ′ ,n k TV ≤ ⌊ n/ℓ ⌋ Y k =0 (cid:8) − ε k,m | n ( X × X ) (cid:9) . Remark . Consider the case where the kernel is uniformly ergodic, i.e., σ − def = inf ( x,x ′ ) ∈ X × X q ( x, x ′ ) > σ + def = sup ( x,x ′ ) ∈ X × X q ( x, x ′ ) < ∞ . One may thus take m = 1 and, using Proposition 1 ε k, | n ( X × X ) def = σ − /σ + . In sucha case, k φ ξ,n − φ ξ ′ ,n k TV ≤ (1 − σ − /σ + ) n .To go beyond this example, we have to ﬁnd veriﬁable conditions upon which wemay ascertain that X × X is an m -coupling set. Deﬁnition 5 (Uniform accessibility) . Let k, ℓ, n be integers satisfying ℓ ≥ and k ∈ { , . . . , n − ℓ } . A set C is uniformly accessible for the forward smoothing kernels F k,ℓ | n if there exists a constant κ k,ℓ ( C ) > satisfying, inf x ∈ X F k,ℓ | n ( x, C ) ≥ κ k,ℓ ( C ) . (22)The next step is to ﬁnd conditions upon which a set is uniformly accessible. Forany set A ∈ B ( X ), deﬁne the function α : Y ℓ → [0 , α ( y ℓ ; A ) def = inf x ,x ℓ +1 ∈ X × X W [ y ℓ ]( x , x ℓ ; A ) W [ y ℓ ]( x , x ℓ +1 ; X ) = (1 + ˜ α ( y ℓ ; A )) − , (23)where we have set W [ y ℓ ]( x , x ℓ +1 ; A ) def = Z · · · Z ℓ Y i =0 q ( x i − , x i ) g ( x i , y i ) q ( x ℓ , x ℓ +1 ) A ( x ℓ ) µ (d x ℓ ) . (24)and ˜ α ( y ℓ − ; A ) def = sup x ,x ℓ ∈ X × X W [ y ℓ − ]( x , x ℓ ; A c ) W [ y ℓ − ]( x , x ℓ ; A ) . (25)Of course, the situations of interest are when α ( y ℓ − ; A ) is positive or, equivalently,˜ α ( y ℓ − ; A ) < ∞ . In such case, we may prove the following uniform accessibilitycondition: Proposition 3.

For any integer n and any k ∈ { , . . . , n − ℓ } , inf x ∈ X F k,ℓ | n ( x, C ) ≥ α ( Y k +1: k + ℓ ; C ) . (26) imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 If in addition C is a strong small set for Q , then X × X is a ( ℓ + 1) -coupling set, inf x ∈ X F k + ℓ +1 | n . . . F k | n ( x, A ) ≥ σ − ( C ) σ + ( C ) α ( Y k +1: k + ℓ ; C ) . (27)The proof is given in Section 6. Assume that a Markov chain { X k } k ≥ in X = R d X is observed in a bounded noise.The case of bounded error is of course particular, because the observations of the Y ’s allow to locate the corresponding X ’s within a set. More precisely, we assumethat { X k } k ≥ is a Markov chain with transition kernel Q having density q withrespect to the Lebesgue measure and Y k = b ( X k ) + V k where, • { V k } is an i.i.d., independent of { X k } , with density p V . In addition, p V ( | x | ) =0 for | x | ≥ M . • the transition density ( x, x ′ ) q ( x, x ′ ) is strictly positive and continuous. • The level sets of b , { x ∈ X : | b ( x ) | ≤ K } are compact.This case has already been considered by (3), using projective Hilbert metricstechniques. We will compute an explicit lower bound for the coupling constant ε k, | n ( X × X ), and will then prove, under mild additional assumptions on the dis-tribution of the Y ’s that the chain forgets its initial conditions geometrically fast.For y ∈ Y , denote C ( y ) def = { x ∈ X , | b ( x ) | ≤ | y | + M } . Note that, for any x ∈ X and A ∈ B ( X ),F k +1 | n F k | n ( x, A ) = RR q ( x, x k +1 ) g ( x k +1 , Y k +1 ) q ( x k +1 , x k +2 ) g ( x k +2 , Y k +2 ) A ( x k +2 ) β k +2 | n ( x k +2 )d x k +2 RR q ( x, x k +1 ) g ( x k +1 , Y k +1 ) q ( x k +1 , x k +2 ) g ( x k +2 , Y k +2 ) β k +2 | n ( x k +2 )d x k +2 Since q is continuous and positive, for any compact sets C and C ′ , inf C × C ′ q ( x, x ′ ) > C × C ′ q ( x, x ′ ) < ∞ . On the other hand, because the observation noise isbounded, g ( x, y ) = g ( x, y ) C ( y ) ( x ). Therefore,F k +1 | n F k | n ( x, A ) ≥ ρ ( Y k +1 , Y k +2 ) ν k | n ( A ) , where ρ ( y, y ′ ) = inf C ( y ) × C ( y ′ ) q ( x, x ′ )sup C ( y ) × C ( y ′ ) q ( x, x ′ ) , and ν k | n ( A ) def = R g ( x k +2 , Y k +2 ) A ( x k +2 ) β k +2 | n ( x k +2 ) ν (d x k +2 ) R g ( x k +2 , Y k +2 ) β k +2 | n ( x k +2 ) ν (d x k +2 ) . By applying Theorem 2, we obtain that k φ ξ,n − φ ξ ′ ,n k TV ≤ ⌊ n/ ⌋ Y k =0 { − ρ ( Y k , Y k +1 ) } . Hence, the Markov chain is geometrically ergodic iflim inf n →∞ n − ⌊ n/ ⌋ X k =0 ρ ( Y k , Y k +1 ) > , a.s. . This property holds under many diﬀerent assumptions on the observations Y n andin particular, if the observations follow a model which ‘approximately equal’ to theassumed one. imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 It is also of interest to consider cases where both the X ’s and the Y ’s are un-bounded. We consider a non-linear non-Gaussian state space model (borrowed from(12, Example 5.8)). We assume that X ∼ ξ and for k ≥ X k = a ( X k − ) + U k ,Y k = b ( X k ) + V k , where { U k } and { V k } are two independent sequences of random variables, withprobability densities ¯ p U and ¯ p V with respect to the Lebesgue measure on X = R d X and Y = R d Y , respectively. In addition, we assume that • For any x ∈ X = R d X , ¯ p U ( x ) = p U ( | x | ) where p U is a bounded, boundedaway from zero on [0 , M ], is non increasing on [ M, ∞ [, and for some positiveconstant γ , p U ( α + β ) p U ( α ) p U ( β ) ≥ γ > . (28), • the function a is Lipshitz, i.e., there exists a positive constant a + such that | a ( x ) − a ( x ′ ) | ≤ a + | x − x ′ | , for any x, x ′ ∈ X , • the function b is one-to-one diﬀerentiable and its Jacobian is bounded andbounded away from zero. • For any y ∈ Y = R d Y , ¯ p V ( y ) = p V ( | y | ) where p V is a bounded positive lowersemi-continuous function, p V is non increasing on [ M, ∞ [, and satisﬁesΥ def = Z ∞ [ p U ( x )] − p V ( b − x )[ p U ( a + x )] − d x < ∞ , (29)where b − is the lower bound for the Jacobian of the function b .The condition on the state noise { U k } is satisﬁed by Pareto-type, exponential andlogistic densities but obviously not by Gaussian density, because the tails are insuch case too light.The fact that the tails of the state noise U are heavier than the tails of theobservation noise V (see (29)) plays a key role in the derivations that follow. InSection 5 we consider a case where this restriction is not needed (e.g., normal).The following technical lemma (whose proof is postponed to section 7), showsthat any set with ﬁnite diameter is a strong small set. Lemma 4.

Assume that diam( C ) < ∞ . Then, for all x ∈ C and x ∈ X , ε ( C ) h C ( x ) ≤ q ( x , x ) ≤ ε − ( C ) h C ( x ) , (30) with ε ( C ) def = γp U (diam( C )) ∧ inf u ≤ diam( C )+ M p U ( u ) ∧ sup u ≤ diam( C )+ M p U ( u ) ! − , (31) h C ( x ) def = ( d ( x , a ( C )) ≤ M ) + ( d ( x , a ( C )) > M ) p U ( | x − a ( z ) | ) , (32) where γ is deﬁned in (28) and z is an arbitrary element of C . In addition, for all x ∈ X and x ∈ C , ν ( C ) k C ( x ) ≤ q ( x , x ) , (33) with ν ( C ) def = inf | u |≤ diam( C )+ M p U , (34) k C ( x ) def = ( d ( a ( x ) , C ) < M ) + ( d ( a ( x ) , C ) ≥ M ) p U ( | z − a ( x ) | ) , (35) where z is an arbitrary point in C . imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 By Lemma 4, the denominator of (25) is lower bounded by W [ y ]( x , x ; C ) ≥ ε ( C ) ν ( C ) k C ( x ) h C ( x ) Z C g ( x , y )d x . (36)Therefore, we may bound ˜ α ( y , C ), deﬁned in (25), by˜ α ( y , C ) ≤ (cid:18) ε ( C ) ν ( C ) Z C g ( x , y )d x (cid:19) − × sup x ,x ∈ X [ k C ( x )] − [ h C ( x )] − W [ y ]( x , x ; C c ) . (37)In the sequel, we choose C = C K ( y ) def = { x, | x − b − ( y ) | ≤ K } , where K is aconstant which will be chosen later. Since, by construction, the diameter of theset C K ( y ) is 2 K uniformly with respect to y , the constants ε ( C K ( y )) (deﬁned in(31)) and ν ( C K ( y )) (deﬁned in (34)) are functions of K only and are thereforeuniformly bounded from below with respect to y . We will ﬁrst show that, for K large enough, R C K ( y ) g ( x , y )d x is uniformly bounded from below, as shown inthe following Lemma (whose proved is postponed to Section 7). The following twoLemmas bound the terms appearing in the RHS of (37). Lemma 5. lim K →∞ inf Y Z C K ( y ) p V ( | y − b ( x ) | )d x > . We set z = b − ( y ) in the deﬁnition (32) of h C ( y ) and z = b − ( y ) in the deﬁnition(35). We denote I K ( x , x ; y ) def = [ k C K ( y ) ( x )] − [ h C K ( y ) ( x )] − × Z C cK ( y ) p U ( | x − a ( x ) | ) p V ( | y − b ( x ) | ) p U ( | x − a ( x ) | )d x . (38)The following Lemma shows that K may be chosen large enough so that I K ( x , x , y )is uniformly bounded over x , x and y . Lemma 6. lim sup K →∞ sup y ∈ Y sup ( x ,x ) ∈ X × X I K ( x , x ; y ) < ∞ . (39)The proof is postponed to Section 7.

5. Pairwise drift conditions

In the situations where coupling over the whole state-space leads to trivial result,one may still use the coupling argument, but this time over smaller sets. In suchcases, however, we need a device to control the return time of the joint chain to theset where the two chains are allowed to couple. In this section we obtain results thatare general enough to include the autoregression model with Gaussian innovationsand Gaussian measurement error. Drift conditions are used to obtain bounds on thecoupling time. Consider the following drift condition.

Deﬁnition 6 (Pair-wise drift conditions toward a set) . Let n be an integer and k ∈ { , . . . , n − } and let ¯ C k | n be a set valued function ¯ C k | n : Y n +1 → B ( X ) × B ( X ) .We say that the forward smoothing kernel F k | n satisﬁes the pair-wise drift condition imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 toward the set ¯ C k | n if there exist functions V k | n : X × X × Y n +1 → R , V k | n ≥ ,functions λ k | n : Y n +1 → [0 , , ρ i | n : Y n +1 → R + such that, for any sequence y n ∈ Y n , ¯R k | n V k +1 | n ( x, x ′ ) ≤ ρ k | n ( x, x ′ ) ∈ ¯ C k | n (40)¯F k | n V k +1 | n ( x, x ′ ) ≤ λ k | n V k | n ( x, x ′ ) ( x, x ′ ) ¯ C k | n . (41) where ¯R k | n is deﬁned in (16) and ¯F k | n is deﬁned in (17) . We set ε k | n = ε k | n ( ¯ C k | n ), the coupling constant of the set ¯ C k | n , and we denote B k | n def = 1 ∨ ρ k | n (1 − ε k | n ) λ k | n . (42)For any vector { a i,n } ≤ i ≤ n , denotes by [ ↓ a ] ( i,n ) the i -th largest order statistics, i.e., [ ↓ a ] (1 ,n ) ≥ [ ↓ a ] (2 ,n ) ≥ · · · ≥ [ ↓ a ] ( n,n ) and [ ↑ a ] ( i,n ) the i -th smallest orderstatistics, i.e., [ ↑ a ] (1 ,n ) ≤ [ ↑ a ] (2 ,n ) ≤ · · · ≤ [ ↑ a ] ( n,n ) . Theorem 7.

Let n be an integer. Assume that for each k ∈ { , . . . , n − } , thereexists a set-valued function ¯ C k | n : Y n +1 → B ( X ) ⊗ B ( X ) such that the forwardsmoothing kernel F k | n satisﬁes the pairwise drift condition toward the set ¯ C k | n .Then, for any probability ξ, ξ ′ on ( X , B ( X )) , k φ ξ,n − φ ξ ′ ,n k TV ≤ min ≤ m ≤ n A m,n (43) where A m,n def = m Y i =1 (1 − [ ↑ ε ] ( i | n ) ) + n Y i =0 λ i | n m Y i =0 [ ↓ B ] ( i | n ) ξ ⊗ ξ ′ ( V ) (44)The proof is in section 6.1. Corollary 8.

If there exists a sequence { m ( n ) } of integers satisfying, m ( n ) ≤ n for any integer n , lim n →∞ m ( n ) = ∞ , and, P Y -a.s. lim sup  m ( n ) X i =0 log(1 − [ ↑ ε ] ( i | n ) ) + n X i =0 log λ i | n + m ( n ) X i =0 log[ ↓ B ( i,n ) ]  = −∞ , then lim sup n k φ ξ,n − φ ξ ′ ,n k TV a . s . −→ , P Y − a.s. . Corollary 9.

If there exists a sequence { m ( n ) } of integers such that m ( n ) ≤ n forany integer n , lim inf m ( n ) /n = α > and P Y -a.s. lim sup  n m ( n ) X i =0 log(1 − [ ↑ ε ] ( i | n ) ) + 1 n n X i =1 log λ i | n + 1 n n − m ( n ) X i =1 log[ ↓ B ( i | n ) ]  ≤ − λ , then there exists ν ∈ (0 , such that ν − n k φ ξ,n − φ ξ ′ ,n k TV a . s . −→ , P Y − a.s. . Let X i = αX i − + σU i Y i = X i + τ V i imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 where | α | < { U i } i ≥ and { V i } are i.i.d. standard Gaussian and are indepen-dent from X . Let n be an integer and k ∈ { , . . . , n − } . The backward functionsare given by β k | n ( x ) ∝ exp (cid:16) − ( αx − m k | n ) / (2 ρ k | n ) (cid:17) , (45)where m k | n and ρ k | n can be computed for k = { , . . . , n − } using the followingbackward recursions (see (6)) m k | n = ρ k +1 | n Y k +1 + ατ m k +1 | n ρ k +1 | n + α τ , ρ k | n = ( τ + σ ) ρ k +1 | n + α σ τ ρ k +1 | n + α τ . (46)initialized with m n − | n = Y n and ρ n − | n = σ + τ . The conditional transitionkernel F i | n ( x, · ) has a density with respect to to the Lebesgue measure given by φ ( · ; µ i | n ( x ) , γ i | n ), where φ ( z ; µ, σ ) is the density of a Gaussian random variablewith mean µ and variance σ and µ i | n ( x ) = τ ρ i +1 | n αx + σ ρ i +1 | n Y i +1 + σ ατ m i +1 | n ( σ + τ ) ρ i +1 | n + τ α σ ,γ i | n = σ τ ρ i +1 | n ( τ + σ ) ρ i +1 | n + α τ σ . From (46), it follows that for any i ∈ { , . . . , n − } , σ ≤ ρ i | n ≤ σ + τ . Thisimplies that, for any ( x, x ′ ) ∈ X × X , and any i ∈ { , . . . , n − } , the function µ i | n isLipshitz and with Lipshitz constant which is uniformly bounded by some β < | α | , | µ i | n ( x ) − µ i | n ( x ′ ) | ≤ β | x − x ′ | , β def = | α | τ ( σ + τ )( σ + τ ) + τ α σ , (47)and that the variance is uniformly bounded γ − def = σ τ (1 + α ) τ + σ ≤ γ i | n ≤ γ

2+ def = σ τ ( σ + τ )( τ + σ ) + α τ σ . (48)Therefore, for any c < ∞ , all sets of the form C def = { ( x, x ′ ) ∈ X × X : | x − x ′ | ≤ c } , (49)are coupling sets. Note indeed that, for any i ∈ { , . . . , n − } ,12 (cid:13)(cid:13) F i | n ( x, · ) − F i | n ( x ′ , · ) (cid:13)(cid:13) TV = 2erf (cid:16) γ − i | n | µ i | n ( x ) − µ i | n ( x ′ ) | (cid:17) ≤ γ − − βc ) , where erf is the error function. More precisely, for any ( x, x ′ ) ∈ C and any integer n and any i ∈ { , . . . , n − } ,F i | n ( x, A ) ∧ F i | n ( x ′ , A ) ≥ εν i, | n ( x, x ′ ; A ) , where ε def = (cid:0) − γ − − βc ) (cid:1) , (50)and ν i, ,n is deﬁned as in (15). For c large enough, the drift condition is satisﬁedwith V ( x, x ′ ) = 1 + ( x − x ′ ) :¯F i | n V ( x, x ′ ) = 1 + (cid:8) µ i | n ( x ) − µ i | n ( x ′ ) (cid:9) + 2 γ i | n ≤ β | x − x ′ | + γ . The condition (40) with ρ i | n ≤ ρ def = (1 − ε ) − (cid:0) β c + γ (cid:1) , (51) imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 where c is the width of the coupling set in (49). The condition (41) is satisﬁed with λ i | n = ˜ β for any ˜ β and c satisfying β < ˜ β < c > (1 − ˜ β + γ ) / ( ˜ β − β ). it is worthwhile to note that all these bounds are uniform with respect to n , i ∈ { , . . . , n − } and realization of the observations y n . Therefore, for any m ∈ { , . . . , n } , we may take upper bound A m,n (deﬁned in (44)) by A m,n ≤ (1 − ε ) m + B m ˜ β n (1 + 2 Z ξ ( dx ) x + 2 Z ξ ′ ( dx ) x )with B = 1 ∨ ρ (1 − ε ) ˜ β , where ε is deﬁned in (50), ρ is deﬁned in (51). Taking m = [ δn ] for some δ > B δ ˜ β <

1, this upper bound may be shown togo to zero exponentially fast and uniformly with respect to the observations y n . The Gaussian example can be generalized to the more general case where the distri-bution of the state noise and the measurement noise are strongly unimodal. Recallthat a density is strongly modal if the log of its density is concave.First note that if f and g are two strongly unimodal density, then the density h = f g/ R f g is also strongly unimodal, with mode that lies between the two modes;its second-order derivative of log h is smaller that the sum of the second-orderderivative of log f and log g . Let the state noise density be denoted by p U ( · ) = e ϕ ( · ) and that of the measurements’ errors be p V ( · ) = e ψ ( · ) . Deﬁne by the recursionoperating on the decreasing indices¯ β i | n ( x ) = p V ( y i − x ) Z q ( x, x i +1 ) ¯ β i +1 | n ( x i +1 )d x i +1 , (52)with the initial condition ¯ β n | n ( x ) = p V ( y n − x ). These functions are the conditionaldistribution of the observations Y i : n given X i = x . They are related to the backwardfunction through the relation ¯ β i | n ( x ) def = β i | n ( x ) p V ( y i − x ). We denote ψ i | n ( x ) def =log ¯ β i | n ( x ). Now, ψ i | n ( x ) = ψ ( Y i − x ) + log Z p U ( z − αx ) ¯ β i +1 | n ( z )d z . Under the stated assumptions, the forward smoothing kernel F i | n has a density withrespect to the Lebesgue measure which is given byf i | n ( x i , x i +1 ) = p U ( x i +1 − αx i ) ¯ β i +1 | n ( x i +1 ) / Z p U ( z − αx i ) ¯ β i +1 | n ( z )d z . (53)Denote by g Cov i | n,x the covariance function with respect to the forward smoothingkernel density. We recall that for any probability distribution P on ( X , B ( X )) andany two increasing measurable functions f and g which are square integrable with imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 respect to P, the covariance of f and g with respect to P, is non-negative. Hence, ψ i | n ′′ ( x )= ψ ′′ ( Y i − x ) + α R p ′′ U ( z − αx ) ¯ β i +1 | n ( z ) dz R p U ( z − αx ) ¯ β i +1 | n ( z ) dz − α (cid:16) R p ′ U ( z − αx ) ¯ β i +1 | n ( z ) dz R p U ( z − αx ) ¯ β i +1 | n ( z ) dz (cid:17) = ψ ′′ ( Y i − x ) − α R p ′ U ( z − αx ) ¯ β ′ i +1 | n ( z ) dz R p U ( z − αx ) ¯ β i +1 | n ( z ) dz + α (cid:16) R p ′ U ( z − αx ) ¯ β i +1 | n ( z ) dz R p U ( z − αx ) ¯ β i +1 | n ( z ) dz (cid:17)(cid:16) R p U ( z − αx ) ¯ β ′ i +1 | n ( z ) dz R p U ( z − αx ) ¯ β i +1 | n ( z ) dz (cid:17) = ψ ′′ ( Y i − x ) − α g Cov i | n,x (cid:0) ϕ ′ ( · − αx ) , ψ ′ i +1 | n ( · ) (cid:1) ≤ ψ ′′ ( Y i − x ) , (54)where we used a direct diﬀerentiation, integration by parts, and the fact that both φ ′ and ψ ′ i +1 | n are monotone non-increasing functions (the last statement follows byapplying (54) inductively from n backward).We conclude that ψ i | n is strongly unimodal with curvature at least as that of theoriginal likelihood function. Hence the curvature of the logarithm of the forwardsmoothing density is smaller than the sum of the curvature of the state and of themeasurement noise, (cid:2) log f i | n ( x i , x i +1 ) (cid:3) ′′ ≤ ϕ ′′ ( x i +1 − αx i ) + ψ ′′ ( Y i +1 − x i +1 ) ≤ − c , (55)where c = − max x i +1 ϕ ′′ ( x i +1 ) + max x i +1 ψ ′′ ( x i +1 ) . (56)Lemma 10 shows that the variance of X i +1 given X i and Y i +1: n is uniformly bounded v i | n ( x ) def = Z (cid:18) x i +1 − Z x i +1 f i | n ( x, x i +1 )d x i +1 (cid:19) f i | n ( x, x i +1 )d x i +1 ≤ c − . where c is deﬁned in (56). Now let e i | n ( x ) def = Z x i +1 f i | n ( x, x i +1 )d x i +1 . Similarly as above d e i | n d x ( x ) = − α g Cov i | n,x ( Z, ϕ ′ ( Z − αx )) . Note that x i +1 e i | n ( x ) − x i +1 , x i +1 ϕ ′ ( x i +1 − αx ), and x i +1 ψ ′ i +1 | n ( x i +1 )are monotone non-increasing and therefore their correlation is positive with respectto any probability measure. Hence (cid:12)(cid:12)(cid:12)(cid:12) d e i | n d x ( x ) (cid:12)(cid:12)(cid:12)(cid:12) = | α | R (cid:0) e i | n ( x ) − x i +1 (cid:1) ϕ ′ ( x i +1 − αx )e ϕ ( x i +1 − αx )+ ψ i +1 | n ( x i +1 ) d x i +1 R e ϕ ( x i +1 − αx )+ ψ i +1 | n ( x i +1 ) d x i +1 ≤ | α | R (cid:0) e i | n ( x ) − x i +1 (cid:1) (cid:16) ϕ ′ ( x i +1 − αx ) + ψ ′ i +1 | n ( x i +1 ) (cid:17) e φ ( x i +1 − αx )+ ψ i +1 | n ( x i +1 ) d x i +1 R e ϕ ( x i +1 − αx )+ ψ i +1 | n ( x i +1 ) d x i +1 = | α | . imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 by integration by parts. Put as before V ( x, x ′ ) = 1 + ( x − x ′ ) . It follows from thediscussion above that¯F i | n V ( x, x ′ ) = 1 + ( e i | n ( x ) − e i | n ( x ′ )) + v i | n ( x ) + v i | n ( x ′ ) , where v i | n ( x ) and v i | n ( x ′ ) are uniformly bounded with respect to x and x ′ and | e i | n ( x ) − e i | n ( x ′ ) | ≤ α | x − x ′ | . The rest of the argument is like that for the normal-normal case.We conclude the argument by stating and proving a lemma which was used above. Lemma 10.

Suppose that Z is a random variable with probability density function f satisfying sup x ( ∂ /∂x ) log f ≤ − c . Then, Z is square integrable and Var( Z ) ≤ c − .Proof. Suppose, w.l.o.g., that the maximum of f is at 0. Under the stated assump-tion, there exist constants a ≥ b such that f ( x ) ≤ a e − c ( x − b ) . This impliesthat Z is quare integrable. Denote z ζ ( z ) = log f ( z )+ cz / m be the mean of Z .E[( Z − m ) ] = Z ( z − m ) z e ξ ( z ) − cz / d z = c − Z ( z − m ) ( cz − ξ ′ ( z )) e ξ ( z ) − cz / d z + c − Z ( z − m ) ξ ′ ( z )e ξ ( z ) − cz / d z. By construction, z ξ ′ ( z ) is a non-increasing function. Since the inequality Cov( ϕ ( Z ) , ψ ( Z )) ≥ ϕ and ψ which have ﬁnite secondmoment, the second term in the RHS of the previous equation is negative. Since( cz − ξ ′ ( z )) e ξ ( z ) − cz / = − f ′ ( z ), the proof follows by integration by part:Var( Z ) ≤ − c − Z ( z − m ) f ′ ( z )d z = c − Z f ( z )d z = c − .

6. Proofs

Proof of Proposition 1.

The proof is similar to the one done in (11). For x ∈ C , thecondition, (21) implies that σ − ( C ) ν C ( dx ′ ) ≤ dQ ( x, · ) dν C ( dx ′ ) ≤ σ + ( C ) ν C ( dx ′ ) . Plugging the lower and upper bounds in the numerator and the denominator of (8)yields, F k | n ( x k , A ) ≥ σ − σ + R A dQ ( x k , · ) dν C ( dx k +1 ) β k +1 | n ( x k +1 ) µ (d x k +1 ) R X dQ ( x k , · ) dν C ( dx k +1 ) β k +1 | n ( x k +1 ) µ (d x k +1 )The result is established with ν k | n ( A ) def = R A dQ ( x k , · ) dν C ( dx k +1 ) β k +1 | n ( x k +1 ) µ (d x k +1 ) R X dQ ( x k , · ) dν C ( dx k +1 ) β k +1 | n ( x k +1 ) µ (d x k +1 ) . Proof of proposition 3.

For any x i ∈ X ,P ( X i + ℓ ∈ C | X i = x i , Y n )= R ··· R W [ Y i +1: i + ℓ ]( x i , x i + ℓ +1 ; C ) β i + ℓ +1 | n ( x i + ℓ +1 ) µ (d x i +1: i + ℓ +1 ) R ··· R W [ Y i +1: i + ℓ ]( x i , x i + ℓ +1 ; X ) β i + ℓ +1 | n ( x i + ℓ +1 ) µ (d x i +1: i + ℓ +1 ) , = R ··· R W [ Y i +1: i + ℓ ]( x i ,x i + ℓ +1 ; C ) W [ Y i +1: i + ℓ ]( x i ,x i + ℓ +1 ; X ) W [ Y i +1: i + ℓ ]( x i , x i + ℓ +1 ; X ) β i + ℓ +1 | n ( x i + ℓ +1 ) µ (d x i + ℓ +1 ) R ··· R W [ Y i +1: i + ℓ ]( x i , x i + ℓ +1 ; X ) β i + ℓ +1 | n ( x i + ℓ +1 ) µ (d x i + ℓ +1 ) , imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 where W is deﬁned in (24). The proof is concluded by noting that, under the statedassumptions, sup ( x i ,x i + ℓ +1 ) ∈ X × X W [ Y i +1: i + ℓ ]( x i , x i + ℓ +1 ; C ) W [ Y i +1: i + ℓ ]( x i , x i + ℓ +1 ; X ) ≥ α ( Y i +1: i + ℓ ; C ) , Proof.

For notational simplicity, we drop the dependence in the sample size n .Denote N n def = P nj =0 ¯ C j ( X j , X ′ j ) and ε i def = ε ( ¯ C i ). For any m ∈ { , . . . , n + 1 } , wehave:P Yξ,ξ ′ , ( T ≥ n ) ≤ P Yξ,ξ ′ , ( T ≥ n, N n − ≥ m ) + P Yξ,ξ ′ , ( T ≥ n, N n − < m ) . (57)The ﬁrst term on the RHS of the previous equation is the probability that we failto couple the chains after at least m independent trial. it is bounded byP Yξ,ξ ′ , ( T ≥ n, N n − ≥ m ) ≤ m Y i =1 (cid:0) − [ ↑ ε ] ( i ) (cid:1) . (58)where [ ↑ ε ] ( i ) are the smallest-order statistics of ( ε , . . . , ε n ). We consider now thesecond term in the RHS of (57). Set B j def = 1 ∨ ρ j (1 − ε j ) λ − j . On the event { N n − ≤ m − } , n Y j =1 B ¯ Cj ( X j ,X ′ j ) j ≤ m − Y j =1 [ ↓ B ] ( j ) , where [ ↓ B ] ( j ) is the j -th largest order statistics of B , . . . , B n . Hence, { N n − ≤ m − } ≤  n Y j =1 B ¯ Cj ( X j ,X ′ j ) j  − m − Y j =1 [ ↓ B ] ( j ) , which implies that:P Yξ,ξ ′ , ( T ≥ n, N n − < m ) ≤ n Y j =1 λ j m Y j =1 [ ↓ B ] ( j ) E Yξ ⊗ ξ ′ ⊗ δ [ M n ] (59)where, for k ∈ { , . . . , n } : M k def =  k − Y j =0 λ j  − k − Y j =0 B − ¯ Cj ( X j ,X ′ j ) j V k ( X k , X ′ k ) { d k = 0 } . (60)Since, by construction,E ξ,ξ ′ , (cid:2) V k +1 ( X k +1 , X ′ k +1 ) { d k +1 = 0 } (cid:12)(cid:12) F k (cid:3) (1 − ε k ) ¯ R k V k ( X k , X ′ k ) ¯ C ck ( X k , X ′ k ) + λ k V k ( X k , X ′ k ) ¯ C k ( X k , X ′ k ) , it is easily shown that ( M k , k ≥

0) is a ( F , P Yξ,ξ ′ , )-supermartingale w.r.t. where F def = ( F k ) ≤ k ≤ n with for k ≥ F k def = σ (cid:2) ( X j , X ′ j , d j ) , ≤ j ≤ k (cid:3) . Therefore,E Yξ,ξ ′ , ( M n ) ≤ E Yξ,ξ ′ , ( M ) = ξ ⊗ ξ ′ ( V ) . This establishes (43) and concludes the proof. imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018

7. Proofs of Section 4.1.2

To simplify the notations, the dependence of C ( y ) in K is implicit throughout thesection. Proof of Lemma 4.

Consider ﬁrst the case d ( x , a ( C )) ≥ M . For any z ∈ a ( C ), M ≤ | x − a ( x ) | ≤ | x − z | + | z − a ( x ) | ≤ diam( C ) + | x − z | ,M ≤ | x − z | ≤ | x − a ( x ) | + | z − a ( x ) | ≤ diam( C ) + | x − a ( x ) | . Using that p U is non-increasing for u ≥ M and (28), we obtain p U ( | x − a ( x ) | ) ≥ p U (diam( C ) + | x − z | ) ≥ γp U (diam( C )) p U ( | x − z | ) , and similarly, p U ( | x − z | ) ≥ γp U (diam( C )) p U ( | x − a ( x ) | ) , which establishes that (30) holds when d ( x , a ( C )) ≥ M .Consider now the case d ( x , a ( C )) ≤ M . Since x belongs to C , then | x − a ( x ) | ≤ M + diam( C ), which implies thatinf u ≤ M +diam( C ) p U ( u ) ≤ p U ( | x − a ( x ) | ) ≤ sup u ≤ M +diam( C ) p U ( u ) , (30) holds for d ( x , a ( C )) ≤ M .Consider now the second assertion. Assume ﬁrst that x is such that d ( a ( x ) , C ) ≥ M and let z be an arbitrary point of C . Then, for any x ∈ C , M ≤ | x − a ( x ) | ≤ | x − z | + | z − a ( x ) | ≤ diam( C ) + | z − a ( x ) | . Using that p U is monotone decreasing on [ M, ∞ ) and (28), p U ( | x − a ( x ) | ) ≥ p U (diam( C ) + | z − a ( x ) | ) ≥ γp U [diam( C )] p U ( | z − a ( x ) | ) . (61)If d ( a ( x ) , C ) ≤ M , then for any x ∈ C , | x − a ( x ) | ≤ diam( C ) + M , so thatinf | u |≤ diam( C )+ M p U ≤ p U ( | x − a ( x ) | ) . (62) Proof of Lemma 5.

Choose K such that b − K ≥ M . If | b − ( y ) − x | ≥ K , then, | y − b ( x ) | = | b ( b − ( y )) − b ( x ) | ≥ b − | b − ( y ) − x | ≥ M , (63)and since p V is non-increasing on the interval [ M, ∞ [, the following inequality holds Z | x − b − ( y ) |≥ K p V ( | y − b ( x ) | )d x ≤ Z | x − b − ( y ) |≥ M p V ( b − | b − ( y ) − x | )d x ≤ b Z ∞ b − K p V ( x )d x . Since the Jacobian of b is bounded, R p V ( | y − b ( x ) | )d x is bounded away from zeroby change of variables. The proof follows. imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 Proof of Lemma 6.

We will establish the results by considering independently thefollowing cases:1. For any y and any ( x , x ) such that d ( a ( x ) , C ( y )) ≤ M and d ( x , a [ C ( y )]) ≤ M , I ( x , x ; y ) ≤ (cid:18) sup R p U (cid:19) .

2. For any y and any ( x , x ) such that d ( a ( x ) , C ( y )) > M and d ( x , a [ C ( y )]) ≤ M , I ( x , x ; y ) ≤ γ − (sup p U ) Z ∞ K [ p U ( x )] − p V ( b − x ) dx .

3. For any y and any ( x , x ) such that d ( a ( x ) , C ( y )) ≤ M and d ( x , a [ C ( y )]) >M I ( x , x ; y ) ≤ γ − (sup p U ) (cid:26) b − − + Z ∞ K p V ( b − x )[ p U ( a + x )] − d x (cid:27)

4. For any y and any ( x , x ) such that d ( a ( x ) , C ( y )) > M and d ( x , a [ C ( y )]) >M , I ( x , x ; y ) ≤ γ − × Z ∞ K [ p U ( x )] − p V ( b − x ) ((cid:18) inf u ≤ M p U ( u ) (cid:19) − + [ p U ( a + x )] − ) d x . Proof of Assertion 1.

On the set { x , d ( a ( x ) , C ( y )) ≤ M } , k C ( y ) ( x ) ≡

1; On theset { x , d ( x , a [ C ( y )]) ≤ M } , h C ( y ) ( x ) ≡

1. Since p U is uniformly bounded, thebound follows from Lemma 5 and the choice of K . Proof of Assertion 2.

On the set { x , d ( a ( x ) , C ( y )) > M } , k C ( x ) = p U ( | b − ( y ) − a ( x ) | ) ; On the set { x , d ( x , a [ C ( y )]) ≤ M } , h C ( x ) ≡

1. Therefore, for such( x , x ), I ( x , x ; y ) ≤ (sup p U ) p − U ( | b − ( y ) − a ( x ) | ) Z C c ( y ) p U ( | x − a ( x ) | ) p V ( | y − b ( x ) | )d x . (64)We set α = x − a ( x ) and β = b − ( y ) − x . Note that | α + β | = | b − ( y ) − a ( x ) | ≥ d ( a ( x ) , C ( y )) > M . Since p U is non-increasing on [ M, ∞ [, p U ( | α + β | ) ≥ p U ( | α | + | β | ), and the condition (28) shows that ( p U ( | α + β | )) − p U ( | α | ) ≤ γ − p − U ( | β | ) whichimplies p − U ( | b − ( y ) − a ( x ) | ) p U ( | x − a ( x ) | ) ≤ γ − p − U ( | b − ( y ) − x | ) . (65)Therefore, plugging (65) into the RHS of (64) yields I ( x , x ; y ) ≤ γ − (sup p U ) Z | x − b − ( y ) |≥ K p − U ( | b − ( y ) − x | ) p V ( b − | b − ( y ) − x | )d x ≤ γ − (sup p U ) Z ∞ K p − U ( x ) p V ( b − x )d x . imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 Proof of Assertion 3.

On the set { x , d ( a ( x ) , C ( y )) ≤ M } , k C ( x ) ≡

1; on the set { x , d ( x , a [ C ( y )]) > M } , h C ( x ) = p U ( | x − a [ b − ( y )] | ) ≡

1. Therefore, for such( x , x ); I ( x , x ; y ) ≤ (sup p U ) × p − U ( | x − a [ b − ( y )] | ) Z C c ( y ) p V ( | y − b ( x ) | ) p U ( | x − a ( x ) | )d x . (66)We set α = x − a ( x ), β = a ( x ) − a [ b − ( y )]. Since | α + β | ≥ d ( x , a [ C ( y )]) > M ,using as above that ( p U ( | α + β | )) − p U ( | α | ) ≤ γ − p − U ( | β | ), we show p − U ( | x − a [ b − ( y )] | ) p U ( | x − a ( x ) | ) ≤ γ − p − U ( | a ( x ) − a [ b − ( y )] | ) . (67)Since for any x, x ′ ∈ X , p − U ( | a ( x ) − a ( x ′ ) | ) ≤ (cid:18) inf u ≤ M p U ( u ) (cid:19) − {| a ( x ) − a ( x ′ ) | ≤ M } + p − U ( a + | x − x ′ | ) {| a ( x ) − a ( x ′ ) | > M } , (68)the RHS of (66) is therefore bounded by I ( x , x ; y ) ≤ γ − (sup p U ) Z | x − b − ( y ) |≥ K p V ( b − ( | x − b − ( y ) | )) ((cid:18) inf u ≤ M p U ( u ) (cid:19) − + p − U ( a + | x − b − ( y ) | ) ) d x . Proof of Assertion 4.

On the set { x , d ( a ( x ) , C ( y )) > M } , k C ( y ) ( x ) = p U ( | b − ( y ) − a ( x ) | ). On the set { x , d ( x , a [ C ( y )]) > M } , k C ( y ) ( x ) = p U ( | x − a [ b − ( y )]). There-fore, for such ( x , x ), I ( x , x ; y ) ≤ p − U ( | b − ( y ) − a ( x ) | ) p − U ( | x − a [ b − ( y )] | ) × Z C c ( y ) p U ( | x − a ( x ) | ) p V ( | y − b ( x ) | ) p U ( | x − a ( x ) | )d x . (69)Using (63), (65), (67), and (68), the RHS of the previous equation is bounded by I ( x , x ; y ) ≤ γ − Z ∞ K p − U ( | x | ) p V ( b − | x | ) ((cid:18) inf u ≤ M p U ( u ) (cid:19) − + p − U ( a + x ) ) d x . The proof follows.

References [1]

Atar, R. and Zeitouni, O. (1997). Exponential stability for nonlinear ﬁlter-ing.

Ann. Inst. H. Poincar´e Probab. Statist. , 6, 697–725.[2] Baum, L. E. , Petrie, T. P. , Soules, G. , and Weiss, N. (1970). A maxi-mization technique occurring in the statistical analysis of probabilistic functionsof Markov chains. Ann. Math. Statist. , 1, 164–171.[3] Budhiraja, A. and Ocone, D. (1997). Exponential stability of discrete-timeﬁlters for bounded observation noise.

Systems Control Lett. 30 , 185–193. imsart-generic ver. 2007/09/18 file: hmmGSS1.tex date: November 1, 2018 [4] Budhiraja, A. and Ocone, D. (1999). Exponential stability in discrete-timeﬁltering for non-ergodic signals.

Stochastic Process. Appl. , 2, 245–257.[5] Capp´e, O. , Moulines, E. , and Ryd´en, T. (2005). Inference in Hidden MarkovModels ∼ cappe/ihmm/.[6] Chigansky, P. and Lipster, R. (2004). Stability of nonlinear ﬁlters in non-mixing case.

Ann. Appl. Probab. , 4, 2038–2056.[7] Chigansky, P. and Liptser, R. (2006). On a role of predictor inthe ﬁltering stability.

Electron. Comm. Probab. 11 , 129–140 (electronic).MRMR2240706 (2007k:60118)[8]

Del Moral, P. (2004).

Feynman-Kac Formulae. Genealogical and InteractingParticle Systems with Applications . Springer.[9]

Del Moral, P. and Guionnet, A. (1998). Large deviations for interactingparticle systems: applications to non-linear ﬁltering.

Stoch. Proc. App. 78 , 69–95.[10]

Del Moral, P. , Ledoux, M. , and Miclo, L. (2003). On contraction prop-erties of Markov kernels. Probab. Theory Related Fields , 3, 395–420.[11]

Douc, R. , Moulines, E. , and Ryd´en, T. (2004). Asymptotic properties ofthe maximum likelihood estimator in autoregressive models with Markov regime. Ann. Statist. , 5, 2254–2304.[12] LeGland, F. and Oudjane, N. (2003). A robustiﬁcation approach to sta-bility and to uniform particle approximation of nonlinear ﬁlters: the example ofpseudo-mixing signals.

Stochastic Process. Appl. , 2, 279–316.[13]

Ocone, D. and Pardoux, E. (1996). Asymptotic stability of the optimalﬁlter with respect to its initial condition.

SIAM J. Control 34 , 226–243.[14]

Oudjane, N. and Rubenthaler, S. (2005). Stability and uniform particleapproximation of nonlinear ﬁlters in case of non ergodic signals.

Stoch. Anal.Appl. , 3, 421–448., 3, 421–448.