[PDF] Metropolising forward particle filtering backward sampling and Rao-Blackwellisation of Metropolised particle smoothers

Abstract

Smoothing in state-space models amounts to computing the conditional distribution of the latent state trajectory, given observations, or expectations of functionals of the state trajectory with respect to this distributions. For models that are not linear Gaussian or possess finite state space, smoothing distributions are in general infeasible to compute as they involve intergrals over a space of dimensionality at least equal to the number of observations. Recent years have seen an increased interest in Monte Carlo-based methods for smoothing, often involving particle filters. One such method is to approximate filter distributions with a particle filter, and then to simulate backwards on the trellis of particles using a backward kernel. We show that by supplementing this procedure with a Metropolis-Hastings step deciding whether to accept a proposed trajectory or not, one obtains a Markov chain Monte Carlo scheme whose stationary distribution is the exact smoothing distribution. We also show that in this procedure, backward sampling can be replaced by backward smoothing, which effectively means averaging over all possible trajectories. In an example we compare these approaches to a similar one recently proposed by Andrieu, Doucet and Holenstein, and show that the new methods can be more efficient in terms of precision (inverse variance) per computation time.

Full PDF

aa r X i v : . [ s t a t . C O ] N ov Metropolising forward particle ﬁltering backwardsampling and Rao-Blackwellisation of Metropolisedparticle smoothers

Jimmy Olsson Tobias Ryd´enOctober 24, 2018

Abstract

Smoothing in state-space models amounts to computing the con-ditional distribution of the latent state trajectory, given observations,or expectations of functionals of the state trajectory with respect tothis distributions. For models that are not linear Gaussian or pos-sess ﬁnite state space, smoothing distributions are in general infeasibleto compute as they involve intergrals over a space of dimensionalityat least equal to the number of observations. Recent years have seenan increased interest in Monte Carlo-based methods for smoothing,often involving particle ﬁlters. One such method is to approximateﬁlter distributions with a particle ﬁlter, and then to simulate back-wards on the trellis of particles using a backward kernel. We showthat by supplementing this procedure with a Metropolis-Hastings stepdeciding whether to accept a proposed trajectory or not, one obtainsa Markov chain Monte Carlo scheme whose stationary distribution isthe exact smoothing distribution. We also show that in this procedure,backward sampling can be replaced by backward smoothing, which ef-fectively means averaging over all possible trajectories. In an examplewe compare these approaches to a similar one recently proposed by An-drieu, Doucet and Holenstein, and show that the new methods can bemore eﬃcient in terms of precision (inverse variance) per computationtime.

The topic of the present paper is computation of smoothed expectations offunctionals of the state process in state-space models, i.e. conditional ex-pectations of such functionals given data. To make this discussion more1recise, let ( X , Y ) be a state-space model , where Y = ( Y k ) nk =0 is the ob-servable (output) process and X = ( X k ) nk =0 is the the latent (unobserved)Markov chain. The relation between the two is such that given X , the Y k ’sare conditionally independent with the conditional distribution of a partic-ular Y k depending on the corresponding X k only. If the state space of X is ﬁnite we use the term hidden Markov model (HMM). Our interest thuslies in computing conditional expectations of the form E [ h ( X n ) | y n ] fora a real-valued functional h of one, several, or all of the latent variables.Here X k = ( X , X , . . . , X k ) etc.; this form will be our generic notation forvectors.Conditional expectations as above are often interesting and relevant intheir own respect, with e.g. E [ X k | Y n ], E [ X k | Y n ] and P ( X k ≥ x | Y n ) = E [ ( X k ≥ x ) | Y n ], where ( · ) denotes some indicator function, providinguseful inferential summaries of the latent states. Another very common useof such expectations is however for inference on model parameters throughthe EM algorithm. Indeed, assume that the distribution of ( X , Y ) dependson some model parameter (vector) θ . Then in the E-step of the EM al-gorithm, it is typical that conditional expectations with functionals like h ( x n ) = P nk =0 x k , h ( x n ) = P nk =0 x k , h ( x n ) = P nk =1 x k − x k etc. appear,in particular if the joint distribution of ( X , Y ) belongs to an exponentialfamily of distributions. For HMMs, the functional h ( x n ) = P nk =1 ( x k − = i, x k = j ) is used to re-estimate the transition probability from state i tostate j . Unfortunately, exact numeric computation of the conditional distri-bution of X k , or that of a sequence of X ’s, given Y n , is possible essentiallyonly in two cases. Firstly for HMMs, for which the forward-backward algo-rithm [4] provides the solution, and secondly for linear Gaussian state-spacemodels, for which conditional distributions are Gaussian and the Kalmansmoothing recursions provide their conditional means and (co)variances [e.g.13, Chapter 4]. For other models, i.e. models with continuous state spaceand non-linear and/or non-Gaussian dynamics and/or output characteris-tics, there are no exact numerical methods available and one is conﬁnedto using approximations. Traditional approaches include Kalman ﬁlteringand smoothing techniques based on linearisation of the system dynamics andoutput characteristics, such as the extended Kalman ﬁlter, but following theimpact of Markov chain Monte Carlo (MCMC) methods in general, duringthe last 10–15 years there has been a dramatic increase in the interest inand use of simulation-based methods to approximate conditional expecta-tions given data. When such methods are used to approximate expectationsappearing in the E-step of the EM algorithm, one often talks about MonteCarlo EM (MCEM) algorithms. 2or a model as above it holds that the conditional distribution of X given Y is that of a time-varying Markov chain. This fact lies behind theexistence of the forward-backward algorithm for HMMs, and is also the cor-nerstone of any algorithm that simulates X conditionally on Y . To simulate X given Y and a ﬁxed set of parameters θ , there are essentially two dif-ferent approaches. The ﬁrst one, often referred to as local updating , is torun an MCMC algorithm that updates one X k at the time, given Y and X − k = ( X j ) ≤ j ≤ n,j = k . Because of the model structure, the only variablesappearing in this conditional distribution are Y k , X k − (unless k = 0) and X k +1 (unless k = n ). By varying k one obtains an MCMC algorithm whosestationary distribution is that of X given Y [see e.g. 18]. The other ap-proach to simulating X given Y is simply to simulate a full trajectory fromthe conditional distribution in question. This can be done either by for-ward ﬁltering-backward sampling (FFBS), or by backward recursion-forwardsampling . These names stem from the two blocks of the forward-backwardalgorithm for HMMs, replacing either of them by simulation. In this paperwe focus on the former approach. After recursively computing the ﬁlter, i.e.the conditional distribution of X k given Y k , for k = 0 , , . . . , n , forwardﬁltering-backward sampling ﬁrst simulates X n from the ﬁlter distribution attime n and then recursively simulates, for k = n − , n − , . . . , X k fromthe conditional distribution of X k given X k +1 and Y k .Comparing the two approaches, the advantage of local updating is itssimplicity as it only involves simulation from univariate (conditional) dis-tributions. Its disadvantage is that a signiﬁcant burn-in period may berequired to remove bias, and that mixing can be slow so that many MCMCiterations are required to make sure that sample means approximate the cor-responding conditional expectations with required accuracy. For FFBS onthe other hand, one must compute the ﬁlter distribution. This is easily donefor HMMs [e.g. 7, used FFBS for HMMs], but for continuous state spacesthe ﬁlter distributions are in general not available. The foremost advantageof FFBS is that the simulated replications of X | Y are independent.To approximate ﬁlter distributions in models with continuous state space,a class of methods known as particle ﬁlters , or sequential Monte Carlo (SMC)methods, has received considerable attention during the last 10–15 years; seee.g. [14, 11, 6] for introductions to such methods, and e.g. [2, 15, 20] for sur-veys and applications.Particle ﬁlters approximate the ﬁlter distribution at time k by a discretedistribution P Ni =1 ω ik δ ξ ik , for which the locations ξ ik , the so-called particles ,and the non-negative weights ω ik evolve randomly and recursively in time.3sing such an approximation one can thus recursively compute approxi-mations to the ﬁlter distributions, and then use them to simulate a statetrajectory backwards. The distribution of this trajectory will then onlyapproximately be that of X | Y . The idea to use particle ﬁlters for approx-imate backward sampling ﬁrst appeared, to the best of our knowledge, in[12]. The paper [8] provides theory (see e.g. Theorem 5 and Corollary 6therein) that supports the validity of this approach such as consistency re-sults ensuring that, as the number of particle increases, the distribution of atrajectory sampled using the particle ﬁlter converges to the true smoothingdistribution.A recent paper, [1], devised a method related to FFBS but that removesbias entirely by adding a Metropolis-Hastings (M-H) step. The approachis described in detail below, but in short it involves running a particle ﬁl-ter and then selecting a state trajectory not by backward sampling, but bysampling one of the particles at the ﬁnal time-point n according to its im-portance weight, and then tracing the history of this particle back to theﬁrst time-point. The M-H step is constructed so that the stationary dis-tribution of the sampled trajectory is indeed the distribution of X | Y . Awell-known problem of particle ﬁlters is however that as the ﬁlter recursionproceeds beyond a given time-point k , after a while only a few of the par-ticles that existed at k will have survived the step-wise selection process.This implies that the genealogical tree of the particle ﬁlter provides a poorapproximation to the smoothing distribution at time-points just a bit priorto current time. FFBS does not suﬀer from this kind of degeneration, andin the present paper we show how an M-H step can be applied to removebias also from particle FFBS. We also show how the approach can be Rao-Blackwellised, by which we mean that sampling of trajectories is replacedby the corresponding expectation, which is the backward smoothing recur-sion. As a compromise between single trajectory sampling and smoothingone may also simulate a small number of trajectories from each set of par-ticles. We analyse the various approaches from a variance/cost perspective,and show that backward simulation and smoothing can be notablty moreeﬃcient than sampling from the genealogical tree.We started the work on the material presented in this paper when weduring the writing of the manuscript [16], on approximate data augmen-tation MCMC schemes using particle FFBS, became aware of a preprintversion of [1]. Later, in the discussion part following the published versionof that paper (p. 306–307), we found that Nick Whitley (University of Bris-tol) had been thinking along similar lines. Therefore we would like to pointout some features of our paper that are not found in Whiteley’s comment,4or in the paper [1] itself. One such feature is that we allow for generalauxiliary particle ﬁlters in the MCMC sampler, and another one is the mul-tiple trajectory sampling idea that compromises between single trajectorysampling and backward smoothing, as well the variance/cost analysis of thisand other approaches. In this section we sharpen the notation and introduce some basic conceptsthat will be used throughout the paper. We assume that all random variablesare deﬁned on a common probability space (Ω , F , P ). The state space of X is denoted by X , and by Y we denote the space in which Y takes its values.We suppose that both of these spaces are Polish and write B ( X ) and B ( Y )respectively for the corresponding Borel σ -ﬁelds. The transition kernel andinitial distribution of X are denoted by Q and ρ , respectively, and we assumethat the transition kernel Q admits a density q w.r.t. some ﬁxed referencemeasure measure λ on X , in the sense that Q ( x, A ) = Z A ( x ′ ) q ( x, x ′ ) λ ( dx ′ )for all A ∈ B ( X ) and x ∈ X . We also assume that the conditional distributionof Y k given X k = x has a density g ( x, · ) (the emission density ) w.r.t somereference measure ν . In most applications X and Y are products of R , and λ and ν are Lebesgue measures. Here we have tacitly assumed that neither Q nor g depends on time k , but the extension to time-varying systems isimmediate.We will throughout the paper assume that we are given a ﬁxed record y n of arbitrary but ﬁxed observations and our main target is to producesamples from the joint posterior distribution φ r : s | n ( A ) def = P ( X r : s ∈ A | Y n = y n ), A ∈ B ( X ) (cid:15) ( s − r +1) , of a record of states given the observations. Thespecial cases φ k def = φ k : k | k and φ n | n will be referred to as the ﬁlter and joint smoothing distributions, respectively. Since the model is fully dom-inated, each joint smoothing distribution φ k | k has a density (denoted bythe same symbol) w.r.t. products of λ and ν . This density is proportionalto ρ ( x ) g ( x ) Q kℓ =1 g ℓ ( x ℓ ) q ( x ℓ − , x ℓ ) and we denote by Z k the normalisingconstant. Since the observations are ﬁxed we will keep the dependenceof any quantity on these implicit and introduce the short-hand notation g k ( x ) def = g ( x, y k ) for x ∈ X . 5t is easily shown that the joint smoothing distributions ( φ k | k ) nk =0 sat-isfy the well known forward smoothing recursion φ k | k ( A ) = RR A ( x k ) g k ( x k ) Q ( x k − , dx k ) φ k − | k − ( dx k − ) RR g k ( x k ) Q ( x k − , dx k ) φ k − | k − ( dx k − ) , (2.1)implying the analogous recursion φ k ( A ) = RR A ( x k ) g k ( x k ) Q ( x k − , dx k ) φ k − ( dx k − ) RR g k ( x k ) Q ( x k − , dx k ) φ k − ( dx k − ) (2.2)for the ﬁlter distributions. Conversely, the joint smoothing distributions maybe retrieved from the ﬁlter distribution ﬂow using the so-called backwarddecomposition of the smoothing measure. Indeed, let, for A ∈ B ( X ), ←− Q µ ( x ′ , A ) = R A ( x ) q ( x, x ′ ) µ ( dx ) R q ( u, x ′ ) µ ( du ) (2.3)be the reverse kernel associated with Q and µ , where µ is a probabilitymeasure on X . In particular, by letting µ be some marginal distribution of X we obtain the transition kernel of X when evolving in reverse time. Using(2.3), the joint smoothing distribution φ n | n may be expressed as φ n | n ( A ) = Z · · · Z A ( x n ) φ n ( dx n ) n − Y k =0 ←− Q φ k ( x k +1 , dx k ) (2.4)for A ∈ B ( X ) (cid:15) ( n +1) [6, Corollary 3.3.8]. Here ( ←− Q φ k ) n − k =0 are the so-called backward kernels describing transitions of X when evolving backwards intime and conditionally on the given observations. Consequently, a draw X n from φ n | n may be produced by ﬁrst computing recursively, using (2.2), theﬁlter distributions ( φ k ) nk =0 (the forward ﬁltering pass), simulating X n from φ n , and then simulating, recursively for k = n − , n − , . . . , X k from ←− Q φ k ( X k +1 , · ) (the backward simulation pass). This is the mentioned FFBSalgorithm.As stressed in the introduction, joint smoothing distributions can be ex-pressed on closed form only for a very few models. The same applies for anymarginals of the same, including the ﬁlter distributions, and thus the de-composition (2.4) appears, at a ﬁrst glance, to be of academic interest only.However, while there is a well-established diﬃculty of applying SMC meth-ods directly to the smoothing recursion (2.1) (as resampling systematicallythe particle trajectories decreases rapidly the number of distinct particle6oordinates at early time steps; see Section 3.1), SMC methods may be eﬃ-ciently used for approximating the ﬁlter distributions. Hence, by following[12] and replacing the ﬁlter distributions in (2.4) by particle ﬁlter estimates,we obtain an approximation of the joint smoothing distribution that is notat all eﬀected by the degeneracy of the genealogical particle tree. This issuewill be discussed further in Section 3.1. A particle ﬁlter approximates the ﬁlter distribution φ k at time k by aweighted empirical measure φ Nk ( dx ) def = N X i =1 ω ik P Nℓ =1 ω ℓk δ ξ ik ( dx ) , (3.1)where ( ξ ik , ω ik ) Ni =1 is a weighted ﬁnite sample of so-called particles (the ξ ik ’s)with associated importance weights (the ω ik ’s) and δ ξ denotes a unit pointmass at ξ . We remark that the weights ( ω ik ) Ni =1 are not normalised, i.e.required to sum to unity, which motivates the self-normalisation in (3.1).Based on an approximation as above at time k , an approximation of φ k +1 can be obtained in diﬀerent ways; however, two speciﬁc operationsare common to all SMC algorithms: selection , which amounts to droppingparticles that have small importance weights and duplicating particles withlarger weights, and mutation , which amounts to randomly moving the par-ticles in the state space X . The approach we describe below is called the auxiliary particle ﬁlter [17].Given the ancestor sample ( ξ ik , ω ik ) Ni =1 , one iteration of the auxiliary par-ticle ﬁlter involves sampling the auxiliary distributionΦ Nk +1 ( { i } × A ) def = ω ik R g k +1 ( x ) Q ( ξ ik , dx ) P Nℓ =1 ω ℓk R g k +1 ( x ) Q ( ξ ℓk , dx ) (cid:18) R A ( x ) g k +1 ( x ) Q ( ξ ik , dx ) R g k +1 ( x ) Q ( ξ ik , dx ) (cid:19) on the product space X × { , . . . , N } , using some proposal distributionΠ Nk +1 ( { i } × A ) def = ω ik ϑ ik P Nℓ =1 ω ℓk ϑ ℓk R k ( ξ ik , A )7here R k is a proposal kernel on X and ( ϑ ik ) Ni =1 is a set of adjustmentmultiplier weights. As a motivation for this we note that P Ni =1 Π Nk +1 ( { i } × A ) is the mixture distribution obtained by simply plugging the weightedempirical measure (3.1) into the ﬁltering recursion (2.2); thus, by simulatinga set of particle positions and indices from (3.1) and discarding the latter,a sample of particles approximating φ k +1 is obtained. This procedure maythen be repeated recursively as new observations become available in orderto obtain weighted particle samples approximating the ﬁlter distributions atall time points. We will throughout this paper assume that the adjustmentmultiplier weights are generated from the ancestor sample according to ϑ ik def = ϑ k ( ξ ik ), where ϑ k : X → R + is weight function. In addition we will assumethat the proposal kernel has a transition density r k : X → R + with respectto λ . The latter implies that also Π Nk +1 has a density, which we denote by thesame symbol, on { , . . . , N } × X . In practice a draw from Π Nk +1 is producedby ﬁrst drawing an index I = i with probability proportional to ω ik ϑ ik andthen simulating a new particle location ξ from the measure R k ( ξ Ik , dx ). Eachof the draws ( ξ ik +1 , I ik +1 ) Ni =1 from Π Nk +1 is assigned the importance weight ω ik +1 def = ω k +1 ( ξ I ik +1 k , ξ ik +1 ) ∝ d Φ Nk +1 d Π Nk +1 ( ξ ik +1 , I ik +1 ) , where ω k +1 ( · ) : X → R + is the importance weight function given by ω k +1 ( x, x ′ ) def = g k +1 ( x ′ ) ϑ k ( x ) q ( x, x ′ ) r k ( x, x ′ ) . (3.2)Finally, since the original target distribution is the marginal of Φ Nk +1 withrespect to the particle position, a weighted sample approximating the formeris obtained by discarding the indices I ik +1 and returning ( ξ ik +1 , ω ik +1 ) Ni =1 .The scheme is initialised by drawing ( ξ i ) Ni =1 independently from someinitial instrumental distribution ρ on ( X , B ( X )) and assigning each of theseinitial particles the importance weight ω i = ω ( ξ i ) where, for x ∈ X , ω ( x ) def = g ( x ) dρ/dρ ( x ).Under suitable conditions the approximation φ Nk is is consistent in thesense that, as N tends to inﬁnity, φ Nk ( h ) P −→ φ k ( h ) , for all φ k -integrable target functions h [see 9, for some convergence results onthe auxiliary particle ﬁlter]. In addition, as a by-product, an asymptotically8onsistent estimate of the normalising constant Z n can be obtained as Z Nn def = 1 N n +1 n − Y k =0 N X ℓ =1 ω ℓk ϑ ℓk ! N X ℓ =1 ω ℓn . (3.3)We remark that as a particle one may view not only the actual position ξ ik at time k , but also the whole trajectory ( ξ G i , ξ G i , . . . , ξ G ik − k − , ξ ik ), wherethe indices ( G ik ) n − k =0 of the genealogical path are deﬁned recursively back-wards through G ik − = I G ik k with G in = i , of positions that led up to thiscurrent position. The particle ﬁlter may thus be used not only to approxi-mate the ﬁlter distribution φ k , but also to approximate the joint smoothingdistribution φ k | k by viewing the trajectory associated with ξ ik as a drawfrom this distribution. The set of all such histories is often referred to as the genealogical tree . The problem with this approach, in its basic form, is thatfor time-points k smaller than n , the particles ( ξ in ) Ni =1 will tend to originatefrom the set small set of ancestors at time k . This problem is known as degeneration of the genealogical tree, and typically it happens that for k small enough, all particles alive at n originate from the same single particleat time k . The conclusion is that drawing particle trajectories ending with ξ in will thus produce a poor estimate of the smoothing distribution φ k : k | n for k just a bit smaller than n , as there are in practice only a small collectionof particles being sampled at that time-point. Backward sampling, to bedescribed in the following, is a remedy to avoid this problem.Given a sequence ( φ Nk ) nk =0 of ﬁlter approximations obtained in a prefa-tory pass with the auxiliary particle ﬁlter, a particle approximation of φ n | n may, as mentioned in Section 2, be obtained by replacing each ﬁlter distri-bution φ k in (2.4) by the corresponding particle estimate φ Nk . This yieldsthe estimator φ N n | n ( A ) def = Z · · · Z A ( x n ) φ Nn ( dx n ) n − Y k =0 ←− Q φ Nk ( x k +1 , dx k ) (3.4)for A ∈ B ( X ) ⊗ ( n +1) . The estimator (3.4) was recently analysed in [8],establishing its convergence to φ n | n in several probabilistic senses. Bydeﬁnition (2.3), each measure ←− Q φ Nk ( x, · ), x ∈ X , has support at the par-ticles ( ξ ik ) Ni =1 only and the weight of each support point ξ ik is given by ω ik q ( ξ ik , x ) / P Nℓ =1 ω ℓk q ( ξ ℓk , x ). Thus the estimator φ N n | n is impractical sincethe cardinality of its support grows exponentially with n . However, a draw from φ N n | n is straightforwardly obtained using the following algorithm.9 lgorithm 1 ( ∗ particle-based FFBS ∗ )1. run the particle ﬁlter to obtain ( ξ ik , ω ik ) ≤ i ≤ N, ≤ k ≤ n

2. simulate J n ∼ ( ω in ) Ni =1

3. set X ∗ n ← ξ J n n for k ← n − to do simulate J k ∼ ( ω ik q ( ξ ik , X ∗ k +1 )) Ni =1

6. set X ∗ k ← ξ J n n

7. set X ∗ n ← ( X ∗ , . . . , X ∗ n )8. return X ∗ n We note, for reasons that will be clear in the coming section, that Algo-rithm 1 provides, as a by-product of the forward ﬁltering pass in Step 1, anestimate Z Nn (given in (3.3)) of the normalising constant Z n . Since com-puting the normalising constant of the probability distribution in Step 4involves summing over N terms, the overall cost of executing Steps 2–7 (i.e.the backward simulation pass) is O ( nN ). As noted in [8], this cost can bereduced signiﬁcantly in the case where the transition density q is boundedby some ﬁnite constant q + , i.e. q ( x, x ′ ) ≤ q + for all ( x, x ′ ) ∈ X , which isthe case for a large class of models (e.g. all non-linear models with addi-tive Gaussian noise). Indeed, by applying instead a standard accept-rejectscheme where a candidate J ∗ k is sampled from the probability distributioninduced by the particle weights ( ω ik ) Ni =1 (whose normalising constant is ob-tained as a by-product of the forward ﬁltering pass) and accepted withprobability q ( ξ J ∗ k k , X ∗ k +1 ) /q + , the corresponding complexity can be reducedto O ( n ). More speciﬁcally, [8, Proposition 1] proves that the number of sim-ulations per index needed for obtaining, at any time step k , N indices J k of N conditionally independent replicates of the backward index chain tends toa constant in probability. We will apply this strategy for the implementationin Section 4. Since the density of the smoothing distribution is known up to a normalisingconstant, state-space models can be perfectly cast into the framework of theMetropolis-Hastings algorithm. When applied to smoothing in state-spacemodels, the output of the M-H algorithm is a Markov chain ( X ( ℓ )0: n ) ℓ ≥ on X n +1 with the following dynamics. Given X ( ℓ )0: n , a candidate X ∗ n for X ( ℓ +1)0: n is produced by simulation according to X ∗ n ∼ k n ( X ( ℓ )0: n , · ), where k n is some10roposal kernel on X n +1 ; after this, one sets X ( ℓ +1)0: n =  X ∗ n w.pr. α n ( X ( ℓ )0: n , X ∗ n ) = 1 ∧ (cid:18) φ n | n ( X ∗ n ) k n ( X ∗ n ,X ( ℓ )0: n ) φ n | n ( X ( ℓ )0: n ) k n ( X ( ℓ )0: n ,X ∗ n ) (cid:19) ,X ( ℓ )0: n otherwise . (3.5)The initial trajectory X (0)0: n may be chosen arbitrarily. The M-H algorithmabove admits φ n | n as stationary distribution, and under weak additionalassumptions (such as Harris recurrence), ( X ( ℓ )0: n ) ℓ ≥ converges in distributionto φ n | n [see e.g. 19, for details]. In order to obtain an acceptance rateclose to one, one should aim to simulate the candidates from a proposaldistribution that is as close to φ n | n as possible. Recalling Algorithm 1, anatural strategy is thus to generate the candidate using the particle-basedFFBS. Indeed, with L N ( X ∗ n ) denoting the law of the draw X ∗ n returnedby Algorithm 1, [16, Theorem 1] shows that under rather weak assumptionsthere exists a constant C n < ∞ such that for all N ≥ (cid:13)(cid:13) L N ( X ∗ n ) − φ n | n (cid:13)(cid:13) TV ≤ C n /N where k·k TV denotes total variation (distance); k µ − ν k TV = sup k f k ∞ ≤ | µ ( f ) − ν ( f ) | for probability measures µ and ν . Unfortunately, constructing an M-Hkernel based on this proposal distribution (which is independent of the given X ( ℓ )0: n ) is not possible in practice since the density of L N ( X ∗ n ) is infeasibleto compute. However, the joint density of all random variables (i.e. all in-dices and particle locations drawn in the forward pass as well as the indicesobtained in the backward pass) generating the output of the particle-basedFFBS has a simple form; thus, inspired by [1], we detour this diﬃcultyby sampling instead a well chosen auxiliary target distribution on the aug-mented state space of all these random variables. Interestingly, it turns outthat the acceptance ratio of the resulting independent M-H sampler, whichis described in Algorithm 2 below, is the same as for the standard forward-smoothing-based algorithm particle independent M-H sampler proposed in[1]. Algorithm 2 ( ∗ FFBS-based IM-H sampler ∗ ) Input: X ( ℓ )0: n

1. run Algorithm 1 to obtain X ∗ n and Z N, ∗ n

11. set X ( ℓ +1)0: n ← X ∗ n with probability α n ( X ( ℓ )0: n , X ∗ n ) = 1 ∧ Z N, ∗ n Z Nn ;otherwise set X ( ℓ +1)0: n ← X ( ℓ )0: n .3. return X ( ℓ +1)0: n In order to derive precisely the scheme above, denote by ξ k def = ( ξ k , . . . , ξ Nk )and I k def = ( I k , . . . , I Nk ) the collection of all particles and indices generatedby the particle ﬁlter at time step k ≥

0. Then the process ( ξ k , I k ) k ≥ isMarkovian with joint law given by the density ψ Nn ( ξ , . . . , ξ n , i , . . . , i n ) def = N Y ℓ =1 ρ ( ξ ℓ ) ! n Y k =1 N Y ℓ =1 Π Nk ( i ℓk , ξ ℓk ) ! , (3.6)Let again J k , k = n, n − , . . . , ξ k ) nk =0 and indices ( I k ) nk =1 obtained in the forward ﬁltering passand the indices ( J k ) nk =0 of the backward smoothing pass is given by k Nn ( ξ , . . . , ξ n , i , . . . , i n , j , . . . j n ) (3.7) def = ψ Nn ( ξ , . . . , ξ n , i , . . . , i n ) × ω j n n P Nℓ =1 ω ℓn n Y k =1 ←− Q φ Nk − ( ξ j k k , ξ j k − k − )= ψ Nn ( ξ , . . . , ξ n , i , . . . , i n ) × ω j n n P Nℓ =1 ω ℓn n Y k =1 ω j k − k − q ( ξ j k − k − , ξ j k k ) P Nℓ =1 ω ℓk − q ( ξ ℓk − , ξ j k k ) , where the second factor is the conditional distribution of ( J k ) nk =0 given theparticle locations and indices obtained in the forward ﬁltering pass. Us-ing the density (3.6), the law of a draw produced by Algorithm 1 can beexpressed as, for A ∈ B ( X ) (cid:15) ( n +1) , L N ( X ∗ n )( A ) = E k Nn [ A ( ξ J , . . . , ξ J n n )]= E k Nn [ E k Nn [ A ( ξ J , . . . , ξ J n n ) | ξ , . . . , ξ n , I , . . . , I n ]] = E ψ Nn [ φ N n | n ( A )] . It turns out that the distribution targeted by Algorithm 2 is given by π Nn ( ξ , . . . , ξ n , i , . . . , i n , j , . . . , j n ) def = φ n | n ( ξ j , . . . , ξ j n n ) N n +1 (3.8) × ψ Nn ( ξ , . . . , ξ n , i , . . . , i n ) ρ ( ξ j ) Q nk =1 Π Nk ( i j k k , ξ j k k ) × n Y k =1 ω i jkk k − q ( ξ i jkk k − , ξ j k k ) P Nℓ =1 ω ℓk − q ( ξ ℓk − , ξ j k k ) , Theorem 1

For any particle sample size N ≥ , the distribution of ( ξ J , . . . , ξ J n n ) under π Nn is φ n | n . Theorem 2

For any N ≥ , the update produced by Algorithm 2 is a stan-dard M-H update with target distribution π Nn and proposal distribution k Nn . Impose the following (standard) boundedness condition on the particleimportance and adjustment multiplier weight functions.

A 1

For all ≤ k ≤ n , k ω k k ∞ < ∞ and k ϑ k k ∞ < ∞ . We now have the following result.

Theorem 3

Let Assumption 1 hold. Then there is a κ ∈ [0 , such thatfor all ℓ ≥ and all x n ∈ X n +1 , k L ( X ( ℓ )0: n | X (0)0: n = x n ) − φ n | n k TV ≤ κ ℓ . This result relies on the fact that if the ratio of target to proposal den-sity, here Z N, ∗ n /Z n , is bounded, the independence M-H sampler convergesgeometrically [cf. 1, p.293].Finally, by applying an Azuma-Hoeﬀding-type exponential inequality forgeometrically ergodic Markov chains derived recently in [10] we obtain, asa corollary of Theorem 3, the following result describing the convergence ofMCMC estimates formed by the output of Algorithm 2. Corollary 4

Let Assumption 1 hold. Then for all N ≥ there exists aconstant c > such that for all bounded measurable functions h : X n +1 → R ,all m ≥ , all initial trajectories X (0)0: n = x n ∈ X n +1 , and all ǫ > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m m X ℓ =0 h ( X ( ℓ )0: n ) − φ n | n ( h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ǫ ! ≤ c exp (cid:18) − mc (cid:26) ǫ k h k ∞ ∧ ǫ k h k ∞ (cid:27)(cid:19) . The results above show that the acceptance probability for a proposed trajec-tory obtained by backward sampling does in fact not depend on the currentor proposed trajectories themselves but only on the likelihood estimates,13omputed from the set of particles and their importance weights, underly-ing the respective trajectories. Theorem 2 in [1] shows that the same holdstrue when tracing a trajectory backwards from the genealogical tree, anMCMC algorithm they referred to as the particle independent Metropolis-Hastings (PIMH) sampler. Therefore we can view the M-H sampler as onethat proposes and possibly accepts sets of particles rather than trajectories,and from the current set of particles we may choose to simulate a trajectoryeither by sampling a ﬁnal state and following its genealogy backwards, orby backward sampling.Denoting by X sim0: n a trajectory simulated by either method using thecurrent set of particles, we know that E [ h ( X n ) | y n ] = E { E [ h ( X sim0: n ) | ( ξ k , ω k ) ≤ k ≤ n ] | y n } , where expectations are computed under the stationary distribution of theMCMC sampler and the notation ( ξ k , ω k ) ≤ k ≤ n also contains the ancestralhistory of each particle, if required. Therefore it also holds that E [ h ( X n ) | y n ] = E  E  J − J X j =1 h ( X sim ,j n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( ξ k , ω k ) ≤ k ≤ n  (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y n  , where now X sim ,j n denotes one of J trajectories sampled independently.Moreover, we can in principle remove sampling of trajectories altogetherby letting J → ∞ . This is equivalent to enumerating all possible sampledtrajectories x n , computing the probability v x n say of that trajectory beingsampled, and ﬁnally computing the weighted average P v x n h ( x n ). Whensampling from the genealogical tree this is possible to do, as there are only N diﬀerent possible trajectories (ending at positions ξ in for i = 1 , . . . , N ). Re-placing sampling by averaging in this way is known as Rao-Blackwellisation .Section 4.6 in [1] certainly does point this out, and it also provides con-vergence results for the weighted average above. For backward sampling itis generally not possible to work with all possible trajectories, as there aretypically N n +1 of them, but for low-dimensional distributions of X n , likethat of a single X k or a pair ( X k , X k +1 ), Rao-Blackwellisation is feasible. Itcan then be obtained by iterating the normalised weights at time n back-wards through the backward kernels, which is equivalent to the backwardspass of the forward-backward algorithm for HMMs. Thus we obtain thesmoothing probability v ik say that a sampled trajectory will pass through ξ ik at time k , and we can compute the weighted average P Ni =1 v ik h ( ξ ik ). For apair ( X k , X k +1 ), a similar computation is possible.14omputing smoothing probabilities obviously requires more computingtime than does tracing a trajectory backwards as when sampling from thegenealogical tree. However, since the tree will have low variability at timepoints away from the ﬁnal time point n , averaging over such points willinvolve summing over just one or a few particles. Backward smoothingdoes not suﬀer from this problem, and hence we can expect better Rao-Blackwellisation for all time-points except for the few last ones. A com-promise is however also possible, namely to simulate a number, say 5–25,trajectories using backward sampling and computing the average over those.We will now take a closer look at this approach.Write ( ξ ( r ) , ω ( r ) ) for the set ( ξ k , ω k ) ≤ k ≤ n of particles and weights inthe r -th iteration of the MCMC algorithm, and let X ( r,j )0: n , j = 1 , . . . , J , be J trajectories obtained from this set of particles using backward sampling,simulated independently. Assume for simplicity that we wish to estimate E [ h ( X k ) | y n ] for some k ; the discussion here generalises with only nota-tional changes to functionals of one than one X -variable.Running R MCMC iterations, (1 /RJ ) P Rr =1 P Jj =1 h ( X ( r,j ) k ) is our esti-mate of E [ h ( X k ) | y n ]. To express the variance of this estimate, considerVar  R X r =1 J X j =1 h ( X ( r,j ) k )  = E  Var  R X r =1 J X j =1 h ( X ( r,j ) k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( ξ ( r ) , ω ( r ) ) Rr =1  + Var  E  R X r =1 J X j =1 h ( X ( r,j ) k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( ξ ( r ) , ω ( r ) ) Rr =1  = E " R X r =1 J Var( h ( X sim k ) | ( ξ ( r ) , ω ( r ) )) + Var " R X r =1 J E ( h ( X sim k ) | ( ξ ( r ) , ω ( r ) )) = RJ ( σ + J R − Var " R X r =1 E ( h ( X sim k ) | ( ξ ( r ) , ω ( r ) )) ≈ RJ { σ + J σ ∞ } , where X sim as above denotes a generic trajectory obtained by backwardsampling, σ = E [Var( h ( X sim k ) | ( ξ , ω ))], and σ ∞ is the limit as R → ∞ ofthe normalised variance of the sum in the second last step. This limit, theso-called time-average variance constant (TAVC) in terminology from [3,15hapter IV.1], will exist if the MCMC sampler mixes not too slowly. Thuswe can approximate the variance of our estimate asVar  RJ R X r =1 J X j =1 h ( X ( r,j ) k )  ≈ R ( σ /J + σ ∞ ) . (3.9)Now assume that it takes time τ PF to simulate one set of particles,and that it takes time τ BS to simulate one trajectory using backward sam-pling. The total computational cost for obtaining the estimate above isthen R ( τ PF + J τ BS ). If we have a total computation time τ available, wecan minimise the right-hand side of (3.9) under the constraint that the totalcomputation time is τ . Treating J as a continuous variable, one ﬁnds thatthe optimal value of J is J opt = s σ /τ BS σ ∞ /τ PF . This expression is quite intuitive; if the variability of h ( X sim k ) within a ﬁxedset of particles tends to be large ( σ is large) and sampling trajectories isquick ( τ BS is small), then we should reduce variability by drawing manytrajectories. Likewise we should do so if variability between sets of particlesis small ( σ ∞ is small) and it is time-consuming to generate new sets ofparticles ( τ PF is large).In practice neither of the parameters involved above are known, so theyneed to be estimated from data and run times. In the example below weillustrate this. Andrieu et al. [1, Section 4.4] devised an algorithm, referred to as the particlemarginal Metropolis-Hastings (PMMH) update, for sampling in the casewhere a model parameter is included in the MCMC sampler’s state space.We will now outline, brieﬂy, that an entirely similar approach is applicablewhen trajectories are proposed using FFBS.Thus there is a parameter (vector) θ in some space Θ, and the transitiondensity q , the emission densities g k , and the initial distribution ρ may alldepend on θ . To θ belongs a prior density (with respect to some dominatingmeasure on Θ), denoted by π . The joint posterior density of θ and x n ,which we denote by π n ( θ, x n ), is then proportional to π ( θ ) φ n | n ( x n ; θ ).The MCMC algorithm uses a proposal density k θ ( ·|· ) say for proposingnew values for θ , and is as follows. 16 lgorithm 3 ( ∗ FFBS-based PMMH sampler ∗ ) Input: θ ( ℓ ) and X ( ℓ )0: n

1. sample θ ∗ from k θ ( ·| θ ( ℓ ) )2. run Algorithm 1, under the parameter θ ∗ , to obtain X ∗ n and Z N, ∗ n

3. set ( θ ( ℓ +1) , X ( ℓ +1)0: n ) ← ( θ ∗ , X ∗ n ) with probability1 ∧ k θ ( θ ( ℓ ) | θ ∗ ) Z N, ∗ n k θ ( θ ∗ | θ ( ℓ ) ) Z Nn ;otherwise set ( θ ( ℓ +1) , X ( ℓ +1)0: n ) ← ( θ ( ℓ ) , X ( ℓ )0: n )4. return ( θ ( ℓ +1) , X ( ℓ +1)0: n )In the same fashion an in [1], one may show that on an enlarged MCMCstate space emcompassing θ , ( ξ , . . . , ξ n ), ( i , . . . , i n ) and ( j , . . . , j n ), theproposal density of the sampler is k θ ( θ | θ ) k Nn ( ξ , . . . , ξ n , i , . . . , i n , j , . . . j n ; θ )where θ is the current parameter and k Nn ( ξ , . . . , ξ n , i , . . . , i n , j , . . . j n ; θ )is as in (3.7) but with dependence on θ included, that the density targetedby the MCMC sampler is proportional to π ( θ ) π Nn ( ξ , . . . , ξ n , i , . . . , i n , j , . . . , j n ; θ )with π Nn ( ξ , . . . , ξ n , i , . . . , i n , j , . . . , j n ; θ ) as in (3.8) but with dependenceon θ included, and that the marginal distribution of ( θ, ξ J , . . . , ξ J n n ) underthis target density is the posterior π n ( θ, x n ) (cf. Theorem 1). Moreover,under standard assumptions on irreducbility, the sequence ( θ ( ℓ ) , X ( ℓ )0: n ) gen-erated by Algorithm 3 will converge in distribution to π n ( θ, x n ) [cf. 1,Theorem 4.4b].Since the acceptance probability of Algorithm 3 does again not dependon the current or proposed trajectories themselves but only on the likeli-hood estimates, one can just as in Section 3.3 draw multiple trajectoriesfrom the current set of particles, or average over all of them using backwardsmoothing, to reduce the variance of sample means that approximate pos-terior expectations of functionals of the latent states. Also, again the sameremark applies when trajectories are sampled backwards from the genealog-ical tree. 17 Example

In this section we illustrate the methods developed above for a state-spacemodel often referred to as the growth model , and which is a standard examplein the particle ﬁltering literature. The model is X k = 12 X k − + 25 X k − X k − + 8 cos(1 . k ) + V n , (4.1) Y k = 120 X k + W k , (4.2)with X ∼ N(0 , σ ), V k ∼ NID(0 , σ V ) and W k ∼ NID(0 , σ W ). Because ofthe square X k − in the measurement equation (4.2), the ﬁlter distributionsfor this model are in general bimodal.We chose parameters σ = 5, σ V = 10 and σ W = 1 [an example alsostudied in 1], and N = 500 particles. We simulated a set y of observa-tions, i.e. n = 50, and then R = 5000 sets of particles. We used the bootstrapﬁlter , i.e. the ﬁlter with all adjustment multiplier weights ϑ ik = 1 and pro-posal kernel equal to the system dynamics; in other words, R k − ( x, · ) wasthe Gaussian density with mean as in the right-hand side of (4.1) and withvariance σ W . For the bootstrap ﬁlter the importance weights ω ik +1 ( x, x ′ )simply become the emission densities g k +1 ( x ′ ). The number of acceptedproposed sets of particles was 1515, yielding an empirical acceptance ratio1515 /R ≈ .

30. We did not use a burn-in period at all, as the output showedno signs of a signiﬁcant initial transient.With the aim of estimating E [ X k | y n ] for each k , we did in each sweepof the MCMC algorithm, i.e. for each current set of particles,(i) simulate one trajectory by tracing the genealogical tree backwards;(ii) compute the Rao-Blackwellised average, for each X k , over all N back-ward trajectories from the genealogical tree;(iii) simulate J = 25 trajectories using backward sampling;(iv) run the backward smoothing algorithm to compute a smoothed averageof X k , which is the same as Rao-Blackwellising backward sampling.We denote these four methods by GT , GTRB , BS and BSM respectively.Thus GT is what it referred to as PIMH in [1]. Backward sampling wasdone using the importance sampling (IS) scheme in [8, Algorithm 1]. Thisscheme avoids computing all backward transition probabilities when extend-ing a trajectory one step backwards, although we did abort IS, computed18ll transition probabilities and used them to simulate the state in questionafter 15 failed IS proposals. The average IS acceptance rate over all sets ofparticles and time-points was 16%.Sample averages over the R sets of particles, and, in the case of backwardsampling, over the J simulated trajectories for each set of particles, areshown in Figure 1. Obviously all methods provide the same result, whichthey should, so the diﬀerences lie in the variances. e s t i m a t e o f E ( X k | y : n ) Figure 1: Estimates of E [ X k | y n ] computed as sample means from themethods GT, GTRB, BS and BSM. The four curves overlap and are notdistinguishable.For BSM we have the expression σ ∞ , BSM ,k /R for the asymptotic vari-ance, where σ ∞ , BSM ,k is as in Section 3.3; observe that the expression E [ h ( X sim k ) | ξ ( r ) ], with h as the identity function, is indeed the mean of X k obtained with backward smoothing. Here we also include a subindex k asthis variance will depend on k , and also a subindex BSM as we will requiresimilar variances for GT and GTRB. For BS we have the asymptotic vari-ance ( σ k /J + σ ∞ , BSM ,k ) /R as in (3.9), where subindex k in σ k again denotesdependence on time-index k . The asymptotic variances of GT and GTRBwe write as σ ∞ , GT ,k /R and σ ∞ , GTRB ,k /R respectively, where σ ∞ , GT ,k and σ ∞ , GTRB ,k are TAVCs deﬁned similarly as σ ∞ , RB ,k but for one trajectorysampled from genealogical tree and for the weighted average over all such19rajectories respectively.In practice neither of these variances are known, and we need to estimatethem from the simulations. We estimated σ k by ﬁrst for each set of particlescomputing the sample variance of all J trajectories X sim ,jk obtained by back-ward sampling, and then computing the average of these sample variancesover all R sets of particles. The TAVCs σ ∞ were estimated by summing upestimated autocovariances over lags | ℓ | < √ R , weighted by (1 − | ℓ | /R ) [cf. 5,p. 59]. Inserting these variance estimates into the expressions for asymptoticvariances and taking square roots, yield standard errors for the respectiveestimates of E [ X k | y n ], shown in Figure 2. s t anda r d e rr o r o f E ( X k | y : n ) Figure 2: Standard errors of estimates of E [ X k | y n ] for the methods GT( × ), GTRB ( ◦ ), BS (+) and BSM ( ∗ ).We see that GT has standard errors larger than those of BS and BSM,which is to be expected as GT samples a single trajectory while BS samples J = 25 trajectories and BSM averages over all of them. We also see thatthe standard errors of GT and GTRB are close to identical expect towardsthe ﬁnal time-point n . This is a result of the degeneracy of the genealogicaltree as for k just a bit less than n there are only one or a few ancestorswith descendants alive at time n , and then Rao-Blackwellisation (GTRB)adds little compared to just sampling (GT). For k ≥

43 say GTRB howeverdoes better, and it is on par with BSM for k ≥

48; for such late time-points20he ﬁnal particles’ ancestries have not coalesced and at time n GTRB andBSM are equivalent. Comparing BS and BSM we ﬁnd that they have similarstandard errors, and this is because the term σ k /J , with some exceptions k ,is smaller than σ ∞ , BSM ,k for J = 25.Comparing standard errors without comparing execution times does notprovide the full picture however, and for that reason we introduce a measureof precision per computational eﬀort, deﬁned as inverse variance over com-putation time. We refer to this measure as eﬃciency , and we can estimateit using inverse squared standard errors over measured computation times.The computation time of each method was measured using the function cputime in Matlab , the software used for all simulations. Figure 3 plotsthese estimates. We see that BS is better than BSM, which in turn is betterthan GT and GTRB which perform about equally. The exception is the lastfew time-points for which GTRB, which is fast, does very well. The ratiosof eﬃciencies for BS vs. GT (for all k ) ranges from 0.19 to 30.7, with 48(out of 50) of them being larger than one and their geometric mean being5.4. For BSm vs. GT the corresponding ﬁgures are 0.04, 11.4, 36 and 1.8respectively. e ff i c i en cy ( / s ) Figure 3: Estimated eﬃciency for estimating E [ X k | y n ] for the methods GT( × ), GTRB ( ◦ ), BS with J = 25 trajectories (+), BS with J = 7 trajectories(∆) and BSM ( ∗ ). The y -axis is truncated; GTRB reaches eﬃciencies about900 s − for the ﬁnal time points. 21aving said that, we remark that ﬁgures like these crucially depend onsoftware and implementation. GTRB is fast because Matlab does vectorisedoperations quickly, and we also believe that BS has a slight disadvantagefrom slow random number generation in

Matlab . In addition the resolutionof cputime appears to be 10 ms, which may not be short enough to provideaccurate measures of execution times (as we measured the time of each callto functions performing the various methods).As discussed in Section 3.3, we can choose the number J of trajectoriessampled in the BS method to achieve the best variance/cost performance.This number varies quite a bit with respect to k however, with the minimumvalue of J opt (viewed as a continuous variable) being 0.74 and the largest18.9. As a compromise we chose the geometric mean 6.98, rounded to J = 7(the arithmetic mean is 8.12). The estimated eﬃciency for this J is alsoplotted in Figure 3, and we see that there is improvement over J = 25 thatis mostly marginal, but for some k notable. The ratios of eﬃciencies for thisoptimised BS vs. GT range from 0.50 to 37.6, with 49 (out of 50) of thembeing larger than one and their geometric mean being 7.5. A Proofs

A.1 Proof of Theorem 1

To prove the statement we simply carry through the marginalisation. Thuslet ξ − ℓk def = ( ξ ik ) ≤ i ≤ N,i = ℓ and deﬁne i − ℓk analogously; the marginal of π Nn withrespect to ( ξ J , . . . , ξ J n n , J , . . . , J n ) is then obtained by integrating π Nn over22 i k , ξ − j k k ) nk =0 . We start with integrating over i n and ξ − j n n according to π Nn ( ξ , . . . , ξ n − , ξ j n n , i , . . . , i n − , j , . . . , j n )= X i n Z π Nn ( ξ , . . . , ξ n , i , . . . , i n , j , . . . , j n ) d ξ − j n n = φ n | n ( ξ j , . . . , ξ j n n ) N n +1 ×  n − Y k =1 ω i jkk k − q ( ξ i jkk k − , ξ j k k ) P Nℓ =1 ω ℓk − q ( ξ ℓk − , ξ j k k )  × X i − jnn Z ψ Nn ( ξ , . . . , ξ n , i , . . . , i n ) ρ ( ξ j ) Q nk =1 Π Nk ( I j k k , ξ j k k ) d ξ − j n n  × X i jnn ω i jnn n − q ( ξ i jnn n − , ξ j n n ) P Nℓ =1 ω ℓn − q ( ξ ℓn − , ξ j n n )  . (A.1)In the expression above, X i jnn ω i jnn n − q ( ξ i jnn n − , ξ j n n ) P Nℓ =1 ω ℓn − q ( ξ ℓn − , ξ j n n ) = 1 (A.2)and X i − jnn Z ψ Nn ( ξ , . . . , ξ n , i , . . . , i n ) ρ ( ξ j ) Q nk =1 Π Nk ( i j k k , ξ j k k ) d ξ − j n n = ψ Nn ( ξ , . . . , ξ n − , i , . . . , i n − ) ρ ( ξ j ) Q n − k =1 Π Nk ( i j k k , ξ j k k ) . (A.3)Combining (A.1)–(A.3) yields π Nn ( ξ , . . . , ξ n − , ξ j n n , i , . . . , i n − , j , . . . , j n )= φ n | n ( ξ j , . . . , ξ j n n ) N n +1 × ψ Nn ( ξ , . . . , ξ n − , i , . . . , i n − ) ρ ( ξ j ) Q n − k =1 Π Nk ( i j k k , ξ j k k ) × n − Y k =1 ω i jkk k − q ( ξ i jkk k − , ξ j k k ) P Nℓ =1 ω ℓk − q ( ξ ℓk − , ξ j k k ) . (A.4)Now, by integrating (A.4) with respect to ( i n − , ξ − j n − n − ) and repeating thesame procedure for ( i n − , ξ − j n − n − ) , . . . , ( i , ξ − j ) and ﬁnally ξ − j we obtain23he marginal density π Nn ( ξ j , . . . , ξ j n n , j , . . . , j n ) = φ n | n ( ξ j , . . . , ξ j n n ) N n +1 . (A.5)Finally, for any rectangle A = A × A × · · · × A n in B ( X ) (cid:15) ( n +1) , P π Nn (cid:16) ξ J ∈ A , . . . , ξ J n n ∈ A n (cid:17) = X j n Z A · · · Z A n π Nn ( ξ j , . . . , ξ j n n , j , . . . , j n ) dξ j · · · dξ j n n = φ n | n ( A ) , implying that these measures are identical on B ( X ) (cid:15) ( n +1) . We completethe proof by noting that the arguments above apply independently of theparticle sample size N . A.2 Proof of Theorem 2

It is enough to prove that R def = π Nn ( ξ , . . . , ξ n , I , . . . , I n , J , . . . , J n ) k Nn ( ξ , . . . , ξ n , I , . . . , I n , J , . . . J n ) = Z Nn Z n , where Z Nn is deﬁned in (3.3). Using thatΠ Nk ( I J k k , ξ J k k ) = ω I Jkk k − ϑ I Jkk k − P Nℓ =1 ω ℓk − ϑ ℓk − r k − ( ξ I Jkk k − , ξ J k k ) , we obtain R = φ n | n ( ξ J , . . . , ξ J n n ) Q nk =1 { q ( ξ I Jkk k − , ξ J k k ) P Nℓ =1 ω ℓk − ϑ ℓk − } P Nℓ =1 ω ℓn N n +1 ρ ( ξ J ) ω J Q nk =1 { ω J k k q ( ξ J k − k − , ξ J k k ) ϑ I Jkk k − r k − ( ξ I Jkk k − , ξ J k k ) } . (A.6)Now, by the deﬁnition (3.2) of the importance weights, ω J k k ϑ I Jkk k − r k − ( ξ I Jkk k − , ξ J k k ) = g k ( ξ J k k ) q ( ξ I Jkk k − , ξ J k k ) (A.7)and plugging the identity (A.7) into (A.6) gives, using that, in addition, ρ ( ξ J ) ω J = ρ ( ξ J ) g ( ξ J ), R = φ n | n ( ξ J , . . . , ξ J n n ) P Nℓ =1 ω ℓn Q nk =1 P Nℓ =1 ω ℓk − ϑ ℓk − N n +1 ρ ( ξ J ) g ( ξ J ) Q nk =1 { g k ( ξ J k k ) q ( ξ J k − k − , ξ J k k ) } . φ n | n ( ξ J , . . . , ξ J n n ) Z n = ρ ( ξ J ) g ( ξ J ) n Y k =1 { g k ( ξ J k k ) q ( ξ J k − k − , ξ J k k ) } . References [1] Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markovchain Monte Carlo methods (with discussion).

J. Roy. Statist. Soc. B 72 ,269–342.[2] Andrieu, C., A. Doucet, S. S. Sumeetpal, and V. B. Tadi´c (2004). Particlemethods for change detection, system identiﬁcation, and control.

Proc.IEEE 92 , 423–438.[3] Asmussen, S. and P. W. Glynn (2007).

Stochastic Simulation. Algorithmsand Analysis . New York: Springer.[4] Baum, L. E., T. P. Petrie, G. Soules, and N. Weiss (1970). A max-imization technique occurring in the statistical analysis of probabilisticfunctions of Markov chains.

Ann. Math. Statist. 41 (1), 164–171.[5] Brockwell, P. J. and R. A. Davis (2002).

Introduction to Time Seriesand Forecasting (2nd ed.). Springer.[6] Capp´e, O., E. Moulines, and T. Ryd´en (2005).

Inference in HiddenMarkov Models . New York: Springer.[7] Chib, S. (1996). Calculating posterior distributions and modal estimatesin Markov mixture models.

J. Econometrics 75 , 79–97.[8] Douc, R., A. Garivier, E. Moulines, and J. Olsson (2011). SequentialMonte Carlo smoothing for general state space hidden Markov models.

Ann. Appl. Probab. . to appear.[9] Douc, R., E. Moulines, and J. Olsson (2009). Optimality of the auxiliaryparticle ﬁlter.

Probab. Math. Statist. 29 (1), 1–28.[10] Douc, R., E. Moulines, J. Olsson, and R. Van Handel (2011). Con-sistency of the maximum likelihood estimator for general hidden markovmodels.

The Annals of Statistics . to appear.[11] Doucet, A., N. De Freitas, and N. Gordon (Eds.) (2001).

SequentialMonte Carlo Methods in Practice . New York: Springer.2512] Doucet, A., S. Godsill, and C. Andrieu (2000). On sequential Monte-Carlo sampling methods for Bayesian ﬁltering.

Stat. Comput. 10 , 197–208.[13] Durbin, R. and S. J. Koopman (2001).

Time Series Analaysis by StateSpace Methods . Oxford: Oxford University Press.[14] Fearnhead, P. (1998).

Sequential Monte Carlo methods in ﬁlter theory .Ph. D. thesis, University of Oxford.[15] Gustafsson, F. (2010). Particle ﬁlter theory and practice with position-ing applications.

IEEE Aerospace Electronic Syst. Mag. 25 , 53–82.[16] Olsson, J., T. Ryd´en, and S. Stjernqvist (2010). A particle-basedMarkov chain Monte Carlo sampler for state-space models with appli-cations to DNA copy number data. Submitted. Also appears in S. Stjern-qvist’s PhD thesis

Modelling Allelic and DNA Copy Number VariationsUsing Continuous-Index Hidden Markov Models , Centre for MathematicalSciences, Lund University, 2010.[17] Pitt, M. K. and N. Shephard (1999). Filtering via simulation: auxiliaryparticle ﬁlters.

J. Am. Statist. Assoc. 94 (446), 590–599.[18] Robert, C. P., G. Celeux, and J. Diebolt (1993). Bayesian estimationof hidden Markov chains: a stochastic implementation.

Statist. Probab.Lett. 16 , 77–83.[19] Roberts, G. O. and J. S. Rosenthal (2004). General state space Markovchains and MCMC algorithms.

Probab. Surv. 1 , 20–71.[20] Sch¨on, T., F. Gustafsson, and R. Karlsson (2011). Particle ﬁltering inpractice. In D. Crisan and B. L. Rozovskii (Eds.),