[PDF] Sampling latent states for high-dimensional non-linear state space models with the embedded HMM method

Abstract

We propose a new scheme for selecting pool states for the embedded Hidden Markov Model (HMM) Markov Chain Monte Carlo (MCMC) method. This new scheme allows the embedded HMM method to be used for efficient sampling in state space models where the state can be high-dimensional. Previously, embedded HMM methods were only applied to models with a one-dimensional state space. We demonstrate that using our proposed pool state selection scheme, an embedded HMM sampler can have similar performance to a well-tuned sampler that uses a combination of Particle Gibbs with Backward Sampling (PGBS) and Metropolis updates. The scaling to higher dimensions is made possible by selecting pool states locally near the current value of the state sequence. The proposed pool state selection scheme also allows each iteration of the embedded HMM sampler to take time linear in the number of the pool states, as opposed to quadratic as in the original embedded HMM sampler. We also consider a model with a multimodal posterior, and show how a technique we term "mirroring" can be used to efficiently move between the modes.

Full PDF

SSampling latent states for high-dimensional non-linearstate space models with the embedded HMM method

Alexander Y. ShestopaloﬀDepartment of Statistical SciencesUniversity of [email protected] Radford M. NealDepartment of Statistical Sciences& Department of Computer ScienceUniversity of [email protected] February 2016, revised 25 June 2016

Abstract

We propose a new scheme for selecting pool states for the embedded Hidden MarkovModel (HMM) Markov Chain Monte Carlo (MCMC) method. This new scheme allows theembedded HMM method to be used for eﬃcient sampling in state space models where thestate can be high-dimensional. Previously, embedded HMM methods were only applied tomodels with a one-dimensional state space. We demonstrate that using our proposed poolstate selection scheme, an embedded HMM sampler can have similar performance to a well-tuned sampler that uses a combination of Particle Gibbs with Backward Sampling (PGBS)and Metropolis updates. The scaling to higher dimensions is made possible by selectingpool states locally near the current value of the state sequence. The proposed pool stateselection scheme also allows each iteration of the embedded HMM sampler to take timelinear in the number of the pool states, as opposed to quadratic as in the original embeddedHMM sampler. We also consider a model with a multimodal posterior, and show how atechnique we term “mirroring” can be used to eﬃciently move between the modes.

Consider a non-linear, non-Gaussian state space model for an observed sequence y = ( y , . . . , y n ).This model, with parameters θ , assumes that the Y i are drawn from an observation density p ( y i | x i , θ ), where X i is an unobserved Markov process with initial density p ( x | θ ) and transitiondensity p ( x i | x i − , θ ). Here, the x i might be either continuous or discrete. We may be interested ininferring both the realized values of the Markov process x = ( x , . . . , x n ) and the model parameters θ . In a Bayesian approach to this problem, this can be done by drawing a sample of values for x and θ using a Markov chain that alternately samples from the conditional posterior distributions p ( x | θ, y ) and p ( θ | x, y ). In this paper, we will only consider inference for x by sampling from1 a r X i v : . [ s t a t . C O ] J u l ( x | θ, y ), taking the parameters θ to be known. As a result, we will omit θ in model densities forthe rest of the paper. Except for linear Gaussian models and models with a ﬁnite state space, thissampling problem has no exact solution and hence approximate methods such as MCMC mustbe used.One method for sampling state sequences in non-linear, non-Gaussian state space models isthe embedded HMM method (Neal, 2003; Neal, Beal and Roweis, 2004). An embedded HMMupdate proceeds as follows. First, at each time i , a set of L “pool states” in the latent spaceis constructed. In this set, L − x i . This step can be thought of as temporarily reducing the statespace model to an HMM with a ﬁnite set of L states, hence the name of the method. Then, usingeﬃcient forward-backward computations, which take time proportional to L n , a new sequence x (cid:48) is selected from the “ensemble” of L n sequences passing through the set of pool states, withthe probability of choosing each sequence proportional to its posterior density divided by theprobability of the sequence under the pool state density. At the next iteration of the sampler, anew set of pool states is constructed, so that the chain can sample all possible x i , even when theset of possible values is inﬁnite.Another method is the Particle Gibbs with Backward Sampling (PGBS) method. The ParticleGibbs (PG) method was ﬁrst introduced in Andrieu, Doucet and Holenstein (2010); Whiteleysuggested the backward sampling modiﬁcation in the discussion following this paper. Lindstenand Schon (2012) implemented backward sampling and showed that it improves the eﬃciencyof PG. Starting with a current sequence x , PGBS ﬁrst uses conditional Sequential Monte Carlo(SMC) to construct a set of candidate sequences and then uses backward sampling to select anew sequence from the set of candidate ones. Here, conditional SMC works in the same way asordinary SMC when generating a set of particles, except that one of the particles at time i isalways set to the current x i , similar to what is done in the embedded HMM method, which allowsthe sampler to remain at x i if x i lies in a high-density region. While this method works wellfor problems with low-dimensional state spaces, the reliance of the SMC procedure on choosingan appropriate importance density can make it challenging to make the method work in highdimensions. An important advantage of Particle Gibbs, however, is that each iteration takes timethat is only linear in the number of particles.Both the PGBS and embedded HMM methods can facilitate sampling of a latent state se-quence, x , when there are strong temporal dependencies amongst the x i . In this case, using amethod that samples x i conditional on ﬁxed values of x i − and x i +1 can be an ineﬃcient way ofproducing a sample from p ( x | y, θ ), because the conditional density of x i given x i − and x i +1 can behighly concentrated relative to the marginal density of x i . In contrast, with the embedded HMMand PGBS methods it is possible to make changes to blocks of x i ’s at once. This allows largerchanges to the state in each iteration of the sampler, making updates more eﬃcient. However,good performance of the embedded HMM and PGBS methods relies on appropriately choosingthe set of pool states or particles at each time i .In this paper, our focus will be on techniques for choosing pool states for the embeddedHMM method. When the latent state space is one-dimensional, embedded HMMs work wellwhen choosing pool states in a variety of ways. For example, in Shestopaloﬀ and Neal (2013), we2hoose pool states at each time i by constructing a “pseudo-posterior” for each latent variableby taking the product of a “pseudo-prior” and the observation density, the latter treated as a“pseudo-likelihood” for the latent variable. In Shestopaloﬀ and Neal (2014), we choose pool statesat each time i by sampling from the marginal prior density of the latent process.Ways of choosing pool states that work well in one dimension begin to exhibit problems whenapplied to models with higher-dimensional state spaces. This is true even for dimensions assmall as three. Since these schemes are global, designed to produce sets of pool states withoutreference to the current point, as the dimension of the latent space grows, a higher proportion ofthe sequences in the ensemble ends up having low posterior density. Ensuring that performancedoesn’t degrade in higher dimensions thus requires a signiﬁcant increase in the number of poolstates. As a result, computation time may grow so large that any advantage that comes fromusing embedded HMMs is eliminated. One advantage of the embedded HMM method over PGBSis that the embedded HMM construction allows placing pool states locally near the current valueof x i , potentially allowing the method to scale better with the dimensionality of the state space.Switching to such a local scheme ﬁxes the problem to some extent. However, local pool stateschemes come with their own problems, such as making it diﬃcult to handle models with multipleposterior modes that are well-separated — the pool states might end up being placed near onlysome of the modes.In this paper, we propose an embedded HMM sampler suitable for models where the state spaceis high dimensional. This sampler uses a sequential approximation to the density p ( x i | y , . . . y i )or to the density p ( x i | y i +1 , . . . , y n ) as the pool state density. We show that by using this poolstate density, together with an eﬃcient MCMC scheme for sampling from it, we can reduce thecost per iteration of the embedded HMM sampler to be proportional to nL , as with PGBS. Atthe same time, we retain the ability to generate pool states locally, allowing better scaling forhigh-dimensional state spaces. Our proposed scheme can thus be thought of as combining the bestfeatures of the PGBS and the embedded HMM methods, while overcoming the deﬁciencies of both.We use two sample state space models as examples. Both have Gaussian latent processes andPoisson observations, with one model having a unimodal posterior and the second a multimodalone. For the multimodal example, we introduce a “mirroring” technique that allows eﬃcientmovement between the diﬀerent posterior modes. For these models, we show how our proposedembedded HMM method compares to a simple Metropolis sampler, a PGBS sampler, as wellas a sampler that combines PGBS and simple Metropolis updates. Further details on ensemblemethods are available in the PhD thesis of Shestopaloﬀ (2016). We review the embedded HMM method (Neal, 2003; Neal, Beal and Roweis, 2004) here. We takethe model parameters, θ , to be ﬁxed, so we do not write them explicitly. Let p ( x ) be the densityfrom which the state at time 1 is drawn, let p ( x i | x i − ) be the transition density between statesat times i and i −

1, and let p ( y i | x i ) be the density of the observation y i given x i .Suppose our current sequence is x = ( x , . . . , x n ). The embedded HMM sampler updates x to3 (cid:48) as follows.First, at each time i = 1 , . . . , n , we generate a set of L pool states, denoted by P i = { x [1] i , . . . , x [ L ] i } . The pool states are sampled independently across the diﬀerent times i . Wechoose l i ∈ { , . . . , L } uniformly at random and set x [ l i ] i to x i . We sample the remaining L − x [1] i , . . . , x [ l i − i , x [ l i +1] i , . . . , x [ L ] n using a Markov chain that leaves a pool density κ i invari-ant, as follows. Let R i ( x (cid:48) | x ) be the transitions of this Markov chain with ˜ R i ( x | x (cid:48) ) the transitionsfor this Markov chain reversed (i.e. ˜ R i ( x | x (cid:48) ) = R i ( x (cid:48) | x ) κ i ( x ) /κ i ( x )), so that κ i ( x ) R i ( x (cid:48) | x ) = κ i ( x (cid:48) ) ˜ R i ( x | x (cid:48) ) (1)for all x and x (cid:48) . Then, starting at j = l i −

1, use reverse transitions ˜ R i ( x [ j ] i | x [ j +1] i ) to gener-ate x [ l i − i , . . . , x [1] i and starting at j = l i + 1 use forward transitions R i ( x [ j ] i | x [ j − i ) to generate x [ l i +1] i , . . . , x [ L ] n .At each i = 1 , . . . , n , we then compute the forward probabilities α i ( x ), with x taking valuesin P i . At time i = 1, we have α ( x ) = p ( x ) p ( y | x ) κ ( x ) (2)and at times i = 2 , . . . , n , we have α i ( x ) = p ( y i | x ) κ i ( x ) L (cid:88) l =1 p ( x | x [ l ] i − ) α i − ( x [ l ] i − ) (3)Finally, we sample a new state sequence x (cid:48) using a stochastic backwards pass. This is doneby selecting x (cid:48) n amongst the set, P n , of pool states at time n , with probabilities proportional to α n ( x ), and then going backwards, sampling x (cid:48) i − from the set P i − , with probabilities proportionalto α i − ( x ) p ( x (cid:48) i | x ). Note that only the relative values of the α i ( x ) will be required, so the α i maybe computed up to some constant factor.Alternatively, given a set of pool states, embedded HMM updates can be done by ﬁrst comput-ing the backward probabilities. We will see later on that the backward probability formulationof the embedded HMM method allows us to introduce a variation of our proposed pool stateselection scheme. Setting β n ( x ) = 1 for all x ∈ P n , we compute for i < nβ i ( x ) = 1 κ i ( x ) L (cid:88) l =1 p ( y i +1 | x [ l ] i +1 ) p ( x [ l ] i +1 | x ) β i +1 ( x [ l ] i +1 ) (4)A new state sequence is then sampled using a stochastic forward pass, setting x (cid:48) to one of the x in the pool P with probabilities proportional to β ( x ) p ( x ) p ( y | x ) and then choosing subsequentstates x (cid:48) i from the pools P i with probabilities proportional to β i ( x ) p ( x | x (cid:48) i − ) p ( y i | x ).Computing the α i or β i at each time i > L , since for each ofthe L pool states it takes time proportional to L to compute the sums in (3) or (4). Hence eachiteration of the embedded HMM sampler takes time proportional to L n .4 Particle Gibbs with Backward Sampling MCMC

We review the Particle Gibbs with Backward Sampling (PGBS) sampler here. For full details,see the articles by Andrieu, Doucet and Holenstein (2010) and Lindsten and Schon (2012).Let q ( x | y ) be the importance density from which we sample particles at time 1, and let q i ( x | y i , x i − ) be the importance density for sampling particles at times i >

1. These may dependon the current value of the parameters, θ , which we suppressed in this notation. Suppose we startwith a current sequence x . We set the ﬁrst particle x [1]1 to the current state x . We then sample L − x [2]1 , . . . , x [ L ]1 from q and compute and normalize the weights of the particles: w [ l ]1 = p ( x [ l ]1 ) p ( y | x [ l ]1 ) q ( x [ l ]1 | y ) (5) W [ l ]1 = w [ l ]1 (cid:80) Lm =1 w [ m ]1 (6)for l = 1 , . . . , L .For i >

1, we proceed sequentially. We ﬁrst set x [1] i = x i . We then sample a set of L − i , deﬁned by A [ l ] i − ∈ { , . . . , L } , for l = 2 , . . . , L , withprobabilities proportional to W [ l ] i − . The ancestor index for the ﬁrst state, A [1] i − , is 1. We thensample each of the L − x [ l ] i , at time i , for l = 2 , . . . , L , from q i ( x | y i , x [ A [ l ] i − ] i − ) and computeand normalize the weights at time iw [ l ] i = p ( x [ l ] i | x [ A [ l ] i − ] i − ) p ( y i | x [ l ] i ) q i ( x [ l ] i | y i , x [ A [ l ] i − ] i − ) (7) W [ l ] i = w [ l ] i (cid:80) Lm =1 w [ m ] i (8)A new sequence taking values in the set of particles at each time is then selected using a back-wards sampling pass. This is done by ﬁrst selecting x (cid:48) n from the set of particles at time n withprobabilities W [ l ] n and then selecting the rest of the sequence going backward in time to time 1,setting x (cid:48) i to x [ l ] i with probability w [ l ] i p ( x (cid:48) i +1 | x [ l ] i ) (cid:80) Lm =1 w [ m ] i p ( x (cid:48) i +1 | x [ m ] i ) (9)A common choice for q is the model’s transition density, which is what is compared to in thispaper.Note that each iteration of the PGBS sampler takes time proportional to Ln , since it takestime proportional to L to create the set of particles at each time i , and to do one step of backwardsampling. 5 An embedded HMM sampler for high dimensions

We propose two new ways, denoted f and b , of generating pool states for the embedded HMMsampler. Unlike previously-used pool state selection schemes, where pool states are selectedindependently at each time, our new schemes select pool states sequentially, with pool states attime i selected conditional on pool states at time i −

1, or alternatively at time i + 1. The ﬁrst way to generate pool states is to use a forward pool state selection scheme, with asequential approximation to p ( x i | y , . . . , y i ) as the pool state density. In particular, at time 1, weset the pool state distribution of our proposed embedded HMM sampler to κ f ( x ) ∝ p ( x ) p ( y | x ) (10)As a result of equation (2), α ( x ) is constant. At time i >

1, we set the pool state distribution to κ fi ( x |P i − ) ∝ p ( y i | x ) L (cid:88) (cid:96) =1 p ( x | x [ (cid:96) ] i − ) (11)which makes α i ( x ) constant for i > α i ( x )’s all set to 1.The second way is to instead use a backward pool state selection scheme, with a sequentialapproximation of p ( x i | y i +1 , . . . , y n ) as the pool state density. We begin by creating the pool P n ,consisting of the current state x n and the remaining L − p n ( x ), themarginal density at time n , which is the same as p ( x ) if the latent process is stationary. Thebackward probabilities β n ( x ), for x in P n , are then set to 1. At time i < n we set the pool statedensities to κ bi ( x |P i +1 ) ∝ L (cid:88) (cid:96) =1 p ( y i +1 | x [ (cid:96) ] i +1 ) p ( x [ (cid:96) ] i +1 | x ) (12)so that β i ( x ) is constant for all i = 1 , . . . , n (see equation 4).We then draw a sequence composed of these pool states as in the backward probability im-plementation of the embedded HMM method, with the β i ( x )’s all set to 1.If the latent process is Gaussian, and the latent state at time 1 is sampled from the stationarydistribution of the latent process, it is possible to update the latent variables by applying theforward scheme to the reversed sequence ( y n , . . . , y ) by making use of time reversibility, since X n is also sampled from the stationary distribution, and the latent process evolves backward in timeaccording to the same transition density as it would going forward. We then use the forward poolstate selection scheme along with a stochastic backward pass to sample a sequence ( x n , . . . , x ),starting with x and going to x n . 6t can sometimes be advantageous to alternate between using forward and backward (or,alternatively, forward applied to the reversed sequence) embedded HMM updates, since this canimprove sampling of certain x i . The sequential pool state selection schemes use only part of theobserved sequence in generating the pool states. By alternating update directions, the pool statescan depend on diﬀerent parts of the observed data, potentially allowing us to better cover theregion where x i has high posterior density. For example, at time 1, the pool state density maydisperse the pool states too widely, leading to poor sampling for x , but sampling x using abackwards scheme can be much better, since we are now using all of the data in the sequencewhen sampling pool states at time 1. To sample from κ fi or κ bi , we can use any Markov transitions R i that leave this distributioninvariant. The validity of the method does not depend on the Markov transitions for samplingfrom κ fi or κ bi reaching equilibrium or even on them being ergodic.Directly using these pool state densities in an MCMC routine leads to a computational costper iteration that is proportional to L n , like in the original embedded HMM method, since attimes i > L updates to produce L pool states, and the cost of computing anacceptance probability is proportional to L .However, it is possible to reduce the cost per iteration of the embedded HMM method to beproportional to nL when we use κ fi or κ bi as the pool state densities. To do this, we start bythinking of the pool state densities at each time i > (cid:96) = 1 , . . . , L that indexes a pool state at the previous time. Speciﬁcally, κ fi can be viewedas a marginal of the density λ i ( x, (cid:96) ) ∝ p ( y i | x ) p ( x | x [ (cid:96) ] i − ) (13)while κ bi is a marginal of the density γ i ( x, (cid:96) ) ∝ p ( y i +1 | x [ (cid:96) ] i +1 ) p ( x [ (cid:96) ] i +1 | x ) (14)Both of these densities are deﬁned given a pool P i − at time i − P i +1 at time i + 1. Thistechnique is reminiscent of the auxiliary particle ﬁlter of Pitt and Shephard (1999). We then useMarkov transitions, R i , to sample a set of values of x and (cid:96) , with probabilities proportional to λ i for the forward scheme, or probabilities proportional to γ i for the backward scheme.The chain is started at x set to the current x i , and the initial value of (cid:96) is chosen randomlywith probabilities proportional to p ( x i | x [ (cid:96) ] i − ) for the forward scheme or p ( y i +1 | x [ (cid:96) ] i +1 ) p ( x [ (cid:96) ] i +1 | x i ) forthe backward scheme. This stochastic initialization of (cid:96) is needed to make the algorithm validwhen we use λ i or γ i to generate the pool states.Sampling values of x and (cid:96) from λ i or γ i can be done by updating each of x and (cid:96) separately,alternately sampling values of x conditional on (cid:96) , and values of (cid:96) conditional on x , or by updating x and (cid:96) jointly, or by a combination of these. 7pdating x given (cid:96) can be done with any appropriate sampler, such as Metropolis, or for aGaussian latent process we can use autoregressive updates, which we describe below. To update (cid:96) given x , we can also use Metropolis updates, proposing (cid:96) (cid:48) = (cid:96) + k , with k drawn from some proposaldistribution on {− K, . . . , − , , . . . , K } . Alternatively, we can simply propose (cid:96) (cid:48) uniformly atrandom from { , . . . , L } .To jointly update x and (cid:96) , we propose two novel updates, a “shift” update and a “ﬂip” update.Since these are also Metropolis updates, using them together with Metropolis or autoregressiveupdates, for each of x and (cid:96) separately, allows embedded HMM updates to be performed in timeproportional to nL . For sampling pool states in our embedded HMM MCMC schemes, as well as for comparisonMCMC schemes, we will make use of Neal’s (1998) “autoregressive” Metropolis-Hastings update,which we review here. This update is designed to draw samples from a distribution of the form p ( w ) p ( y | w ) where p ( w ) is multivariate Gaussian with mean µ and covariance Σ and p ( y | w ) istypically a density for some observed data.This autoregressive update proceeds as follows. Let L be the lower triangular Cholesky de-composition of Σ, so Σ = LL T , and n be a vector of i.i.d. normal random variables with zeromean and identity covariance. Let (cid:15) ∈ [ − ,

1] be a tuning parameter that determines the scale ofthe proposal. Starting at w , we propose w (cid:48) = µ + √ − (cid:15) ( w − µ ) + (cid:15)Ln (15)Because these autoregressive proposals are reversible with respect to p ( w ), the proposal densityand p ( w ) cancel in the Metropolis-Hastings acceptance ratio. This update is therefore acceptedwith probability min (cid:18) , p ( y | w (cid:48) ) p ( y | w ) (cid:19) (16)Note that for this update, the same value of (cid:15) is used for scaling along every dimension. It wouldbe of independent interest to develop a version of this update where (cid:15) can be diﬀerent for eachdimension of w . We can simultaneously update (cid:96) and x at time i > x, (cid:96) ) to ( x (cid:48) , (cid:96) (cid:48) )where (cid:96) (cid:48) is proposed in any valid way while x (cid:48) is chosen in a way such that x (cid:48) and x [ (cid:96) (cid:48) ] i − are linkedin the same way as x and x [ (cid:96) ] i − . The shift update makes it easier to generate a set of pool statesat time i with diﬀerent predecessor states at time i −

1, helping to ensure that the pool states arewell-dispersed. This update is accepted with the usual Metropolis probability.8or a concrete example we use later, suppose that the latent process is an autoregressiveGaussian process of order 1, with the model being that X i | x i − ∼ N (Φ x i − , Σ). In this case,given (cid:96) (cid:48) , we propose x (cid:48) i = x i + Φ( x [ (cid:96) (cid:48) ] i − − x [ (cid:96) ] i − ). This update is accepted with probabilitymin (cid:18) , p ( y i | x (cid:48) i ) p ( y i | x i ) (cid:19) (17)as a result of the transition densities in the acceptance ratio cancelling out, since x (cid:48) i − Φ x [ (cid:96) (cid:48) ] i − = x i + Φ( x [ (cid:96) (cid:48) ] i − − x [ (cid:96) ] i − ) − Φ x [ (cid:96) (cid:48) ] i − (18)= x i − Φ x [ (cid:96) ] i − (19)To be useful, shift updates normally need to be combined with other updates for generatingpool states. When combining shift updates with other updates, tuning of acceptance rates forboth updates needs to be done carefully in order to ensure that the shift updates actually improvesampling performance. In particular, if the pool states at time i − x and for x and (cid:96) may lead to a relatively high acceptance rate onupdates of x , in order to ensure that the acceptance rate for the shift updates isn’t low. Generating pool states locally can be helpful when applying embedded HMMs to models withhigh-dimensional state spaces but it also makes sampling diﬃcult if the posterior is multimodal.Consider the case when the observation probability depends on | x i | instead of x i , so that manymodes with diﬀerent signs for some x i exist. We propose to handle this problem by adding anadditional ﬂip update that creates a “mirror” set of pool states, in which − x i will be in the poolif x i is. By having a mirror set of pool states, we are able to ﬂip large segments of the sequencein a single update, allowing eﬃcient exploration of diﬀerent posterior modes.To generate a mirror set of pool states, we must correctly use the ﬂip updates when samplingthe pool states. Since we want each pool state to have a negative counterpart, we choose thenumber of pool states L to be even. The chain used to sample pool states then alternates twotypes of updates, a usual update to generate a pool state and a ﬂip update to generate its negatedversion. The usual update can be a combination of any updates, such as those we consider above.So that each state will have a ﬂipped version, we start with a ﬂip transition between x [1] and x [2] ,a usual transition between x [2] and x [3] , and so on up to a ﬂip transition between x [ L − to x [ L ] .At time 1, we start with the current state x and randomly assign it to some index l in thechain used to generate pool states. Then, starting at x we generate pool states by reversingthe Markov chain transitions back to 1 and going forward up to L . Each ﬂip update is then aMetropolis update proposing to generate a pool state − x given that the chain is at some poolstate x . Note that if the observation probability depends on x only through | x | and p ( x ) issymmetric around zero then this update is always accepted.9t time i >

1, a ﬂip update proposes to update a pool state ( x, (cid:96) ) to ( − x, (cid:96) (cid:48) ) such that x [ (cid:96) (cid:48) ] i − = − x [ (cid:96) ] i − . Here, since the pool states at each time are generated by alternating ﬂip andusual updates, starting with a ﬂip update to x [1] i , the proposal to move from (cid:96) to (cid:96) (cid:48) can be viewedas follows. Suppose that instead of labelling our pool states from 1 to L we instead label them0 to L −

1. The pool states at times 0 and 1, then 2 and 3, and so on will then be ﬂippedpairs, and the proposal to change (cid:96) to (cid:96) (cid:48) can be seen as proposing to ﬂip the lower order bit ina binary representation of (cid:96) (cid:48) . For example, a proposal to move from (cid:96) = 3 to (cid:96) = 2 can be seenas proposing to change (cid:96) from 11 to 10 (in binary). Such a proposal will always be acceptedassuming a transition density for which p ( x i | x i − ) = p (– x i | – x i − ) and an observation probabilitywhich depends on x i only via | x i | . The forward pool state selection scheme can be used to construct a sampler with properties similarto PGBS. This is done by using independence Metropolis to sample values of x and (cid:96) from λ i .At time 1, we propose our pool states from p ( x ). At times i >

2, we propose (cid:96) (cid:48) by selectingit uniformly at random from { , . . . , L } and we propose x (cid:48) by sampling from p ( x | x [ (cid:96) (cid:48) ] i − ). Theproposals at all times i are accepted with probabilitymin (cid:18) , p ( y i | x (cid:48) i ) p ( y i | x i ) (cid:19) (20)This sampler has computational cost proportional to Ln per iteration, like PGBS. It is analogousto a PGBS sampler with importance densities q ( x | y ) = p ( x ) (21)and q i ( x | x i − , y i ) = p ( x | x i − ) , i > p ( y i | x i ) on each particle, instead of an independence Metropolis accept-reject step. We modify the original proof of Neal (2003), which assumes that the sets of pool states P , . . . , P n are selected independently at each time, to show the validity of our new sequential pool stateselection scheme. Another change in the proof is to account for generating the pool states bysampling them from λ i or γ i instead of κ fi or κ bi .This proof shows that the probability of starting at x and moving to x (cid:48) with given sets of poolstates P i (consisting of values of x at each time i ), pool indices l i of x i , and pool indices l (cid:48) i of x (cid:48) i

10s the same as the probability of starting at x (cid:48) and moving to x with the same set of pool states P i , pool indices l (cid:48) i of x (cid:48) i , and pool indices l i of x i . This in turn implies, by summing/integratingover P i and l i , that the embedded HMM method with the sequential pool state scheme satisﬁesdetailed balance with respect to p ( x | y ), and hence leaves p ( x | y ) invariant.Suppose we use the sequential forward scheme. The probability of starting at x and moving to x (cid:48) decomposes into the product of the probability of starting at x , which is p ( x | y ), the probabilityof choosing a set of pool state indices l i , which is L n , the probability of selecting the initial valuesof (cid:96) i for the stochastic initialization step, the probability of selecting the sets of pool states P i , P ( P , . . . , P n ), and ﬁnally the probability of choosing x (cid:48) .The probability of selecting given initial values for the links to previous states (cid:96) , . . . , (cid:96) n is n (cid:89) i =2 p ( x i | x [ (cid:96) i ] i − ) (cid:80) Lm =1 p ( x i | x [ m ] i − ) (23)The probability of choosing a given set of pool states is P ( P , . . . , P n ) = P ( P ) n (cid:89) i =2 P ( P i |P i − ) (24)At time 1, we use a Markov chain with invariant density κ to select pool states in P . Therefore P ( P ) = L (cid:89) j = l +1 R ( x [ j ]1 | x [ j − ) (cid:89) j = l − ˜ R ( x [ j ]1 | x [ j +1]1 )= L (cid:89) j = l +1 R ( x [ j ]1 | x [ j − ) (cid:89) j = l − R ( x [ j +1]1 | x [ j ]1 ) κ ( x [ j ]1 ) κ ( x [ j +1]1 )= L − (cid:89) j = l R ( x [ j +1]1 | x [ j ]1 ) (cid:89) j = l − R ( x [ j +1]1 | x [ j ]1 ) κ ( x [ j ]1 ) κ ( x [ j +1]1 )= κ ( x [1]1 ) κ ( x [ l ]1 ) L − (cid:89) j =1 R ( x [ j +1]1 | x [ j ]1 ) (25)For times i > λ i to sample a set of pool states,given P i − . The chain is started at x [ l i ] i = x i and (cid:96) [ l i ] i = (cid:96) i . Therefore P ( P i |P i − ) = L (cid:89) j = l +1 R i ( x [ j ] i , (cid:96) [ j ] i | x [ j − i , (cid:96) [ j − i ) (cid:89) j = l i − ˜ R i ( x [ j ] i , (cid:96) [ j ] i | x [ j +1] i , (cid:96) [ j +1] i )= L (cid:89) j = l i +1 R i ( x [ j ] i , (cid:96) [ j ] i | x [ j − i , (cid:96) [ j − i ) (cid:89) j = l i − R i ( x [ j +1]1 , (cid:96) [ j +1] i | x [ j ] i , (cid:96) [ j ] i ) λ i ( x [ j ] i , (cid:96) [ j ] i ) λ i ( x [ j +1]1 , (cid:96) [ j +1] i )= L − (cid:89) j = l i R i ( x [ j +1] i , (cid:96) [ j +1] i | x [ j ] i , (cid:96) [ j ] i ) (cid:89) j = l i − R i ( x [ j +1] i , (cid:96) [ j +1] i | x [ j ] i , (cid:96) [ j ] i ) λ i ( x [ j ] i , (cid:96) [ j ] i ) λ i ( x [ j +1] i , (cid:96) [ j +1] i )= λ i ( x [1] i , (cid:96) [1] i ) λ i ( x [ l i ] i , (cid:96) [ l i ] i ) L − (cid:89) j =1 R i ( x [ j +1] i , (cid:96) [ j +1] i | x [ j ] i , (cid:96) [ j ] i ) (26)11inally, we choose a new sequence x (cid:48) amongst the collection of sequences consisting of the poolstates with a backward pass. This is done by ﬁrst choosing a pool state x (cid:48) n uniformly at randomfrom P n . We then select the remaining states x [ l (cid:48) i ] i by selecting l (cid:48) , . . . , l (cid:48) n − with probability n (cid:89) i =2 p ( x (cid:48) i | x [ l (cid:48) i − ] i − ) (cid:80) Lm =1 p ( x (cid:48) i | x [ m ] i − ) (27)Thus, the probability of starting at x and going to x (cid:48) , with given P , . . . , P n , l , . . . , l n and l (cid:48) , . . . , l (cid:48) n is p ( x | y ) × L n × n (cid:89) i =2 p ( x i | x [ (cid:96) i ] i − ) (cid:80) Lm =1 p ( x i | x [ m ] i − ) × κ ( x [1]1 ) κ ( x [ l ]1 ) L − (cid:89) j =1 R ( x [ j +1]1 | x [ j ]1 ) (28) × n (cid:89) i =2 (cid:20) λ i ( x [1] i , (cid:96) [1] i ) λ i ( x [ l i ] i , (cid:96) [ l i ] i ) L − (cid:89) j =1 R i ( x [ j +1] i , (cid:96) [ j +1] i | x [ j ] i , (cid:96) [ j ] i ) (cid:21) × L × n (cid:89) i =2 p ( x (cid:48) i | x [ l (cid:48) i − ] i − ) (cid:80) Lm =1 p ( x (cid:48) i | x [ m ] i − )= κ ( x [1]1 ) L − (cid:89) j =1 R ( x [ j +1]1 | x [ j ]1 ) × n (cid:89) i =2 (cid:20) λ i ( x [1] i , (cid:96) [1] i ) L − (cid:89) j =1 R i ( x [ j +1] i , (cid:96) [ j +1] i | x [ j ] i , (cid:96) [ j ] i ) (cid:21) × L n +1 × p ( x | y ) κ ( x ) (cid:81) ni =2 λ i ( x i , (cid:96) i ) × n (cid:89) i =2 p ( x i | x [ (cid:96) i ] i − ) (cid:80) Lm =1 p ( x i | x [ m ] i − ) n (cid:89) i =2 p ( x (cid:48) i | x [ l (cid:48) i − ] i − ) (cid:80) Lm =1 p ( x (cid:48) i | x [ m ] i − )Here, we have x [ l (cid:48) i − ] i − = x (cid:48) i − . Also κ ( x ) = p ( x ) p ( y | x ) / (cid:80) x ∈P p ( x ) p ( y | x ) and n (cid:89) i =2 λ i ( x i , (cid:96) i ) = n (cid:89) i =2 p ( y i | x i ) p ( x i | x [ (cid:96) i ] i − ) (cid:80) x i ∈P i (cid:80) Lm =1 p ( y i | x i ) p ( x i | x [ m ] i − ) (29)and p ( x | y ) = p ( x ) (cid:81) ni =2 p ( x i | x i − ) (cid:81) ni =1 p ( y i | x i ) p ( y ) (30)Therefore (28) can be simpliﬁed to1 p ( y ) κ ( x [1]1 ) L − (cid:89) j =1 R ( x [ j +1]1 | x [ j ]1 ) × n (cid:89) i =2 (cid:20) λ i ( x [1] i , (cid:96) [1] i ) L − (cid:89) j =1 R i ( x [ j +1] i , (cid:96) [ j +1] i | x [ j ] i , (cid:96) [ j ] i ) (cid:21) × L n +1 × n (cid:89) i =2 p ( x i | x i − ) × n (cid:89) i =2 p ( x (cid:48) i | x (cid:48) i − ) × n (cid:89) i =2 (cid:80) Lm =1 p ( x i | x [ m ] i − ) × n (cid:89) i =2 (cid:80) Lm =1 p ( x (cid:48) i | x [ m ] i − ) × (cid:88) x ∈P p ( x ) p ( y | x ) n (cid:89) i =2 (cid:88) x i ∈P i L (cid:88) m =1 p ( y i | x i ) p ( x i | x [ m ] i − ) (31)The last factor in the product only depends on the selected set of pool states. By exchanging x and x (cid:48) we see that the probability of starting at x (cid:48) and then going to x , with given sets of poolstates P i , pool indices l i of x i and pool indices l (cid:48) i of x (cid:48) i is the same.12 Experiments

To demonstrate the performance of our new pool state scheme, we use two diﬀerent state spacemodels. The latent process for both models is a vector autoregressive process, with X ∼ N (0 , Σ init ) (32) X i | x i − ∼ N (Φ x i − , Σ) , i = 2 , . . . , n (33)where X i = ( X i, , . . . , X i,P ) (cid:48) andΦ =  φ . . . . . . φ P  (34)Σ =  . . . ρ ... . . . ... ρ . . .  (35)Σ init =  − φ . . . ρ √ − φ √ − φ P ... . . . ... ρ √ − φ P √ − φ . . . − φ P  (36)Note that Σ init is the covariance of the stationary distribution for this process.For model 1, the observations are given by Y i,j | x i,j ∼ Poisson(exp( c j + σ j x i,j )) , i = 1 , . . . , n, j = 1 , . . . , P (37)For model 2, the observations are given by Y i,j | x i,j ∼ Poisson( σ j | x i,j | ) , i = 1 , . . . , n, j = 1 , . . . , P (38)For model 1, we use a 10-dimensional latent state and a sequence length of n = 250, settingparameter values to ρ = 0 .

7, and c j = − . φ j = 0 . σ j = 0 . j = 1 , . . . , P , with P = 10.For model 2, we increase the dimensionality of the latent space to 15 and the sequence lengthto 500. We set ρ = 0 . φ j = 0 . σ j = 0 . j = 1 , . . . , P , with P = 15.We generated one random sequence from each model to test our samplers on. These observa-tions from model 1 and model 2 are shown in Figure 1. Note that we are testing only samplingof the latent variables, with the parameters set to their true values. The code for all experimentsin this paper is available at http://arxiv.org/abs/1602.06030.13

50 100 150 200 2500510152025303540 (a) Model 1 (b) Model 2

Figure 1: Observations from Model 1 and Model 2 along dimension j = 1. A simple scheme for sampling the latent state sequence is to use Metropolis-Hastings updatesthat sample each x i in sequence, conditional on x − i = ( x , . . . , x i − , x i +1 , . . . , x i ) and the data,starting at time 1 and going to time n . We sample all dimensions of x i at once using autoregressiveupdates (see section 5.4.3).The conditional densities of the X i are p ( x | x − , y ) ∝ p ( x | x ) p ( y | x ) ∝ p ( x ) p ( x | x ) p ( y | x ) p ( x i | x − i , y ) ∝ p ( x i | x i − , x i +1 ) p ( y i | x i ) ∝ p ( x i | x i − ) p ( x i +1 | x i ) p ( y i | x i ) , ≤ i ≤ n − p ( x n | x − n , y ) ∝ p ( x n | x n − ) p ( y n | x n )The densities p ( x | x ) , p ( x i | x i − , x i +1 ), and p ( x n | x n − ) are all Gaussian. The means and co-variances for these densities can be derived by viewing p ( x ) or p ( x i | x i − ) as a Gaussian prior for x i and p ( x i +1 | x i ) as a Gaussian likelihood for x i . In particular, we have X | x ∼ N ( µ , Σ ) X i | x i , x i +1 ∼ N ( µ i , Σ i ) X n | x n − ∼ N ( µ n , Σ n )14here µ = [(Φ − ΣΦ − ) − + Σ − ] − [(Φ − ΣΦ − ) − Φ − x ]= [(Φ − ΣΦ − ) − + Σ − ] − [Σ − (Φ x )]= [Φ + Σ − Σ] − Φ x Σ = [(Φ − ΣΦ − ) − + Σ − ] − = [Φ(Σ − Φ) + Σ − ] − µ i = [(Φ − ΣΦ − ) − + Σ − ] − [Σ − Φ x i − + (Φ − ΣΦ − ) − Φ − x i +1 ]= (Φ − ΣΦ − ) − + Σ − ] − [Σ − (Φ( x i − + x i +1 ))]= [Φ + I ] − Φ( x i − + x i +1 )Σ i = [(Φ − ΣΦ − ) − + Σ − ] − = [Φ(Σ − Φ) + Σ − ] − µ n = Φ x n − Σ n = ΣTo speed up the Metropolis updates, we precompute and store the matrices [Φ + Σ − Σ] − Φ,[Φ + I ] − Φ as well as the Cholesky decompositions of the posterior covariances.In both of our test models, the posterior standard deviation of the latent variables x i,j variesdepending on the value of the observed y i,j . To address this, we alternately use a larger or asmaller proposal scaling, (cid:15) , in the autoregressive update when performing an iteration of theMetropolis sampler. We implement the PGBS method as described in Section 5.3, using the initial density p ( x ) andthe transition densities p ( x i | x i − ) as importance densities to generate particles. We combinePGBS updates with single-state Metropolis updates from Section 5.6.2. This way, we combinethe strengths of the two samplers in targeting diﬀerent parts of the posterior distribution. Inparticular, we expect the Metropolis updates to do better for the x i with highly informative y i ,and the PGBS updates to do better for the x i where y i is not as informative. For model 1, we compared the embedded HMM sampler to the simple single-state Metropolissampler, to the PGBS sampler, and to the combination of PGBS with Metropolis. For model 2, wecompared the embedded HMM sampler to the PGBS with Metropolis sampler. For both modelsand all samplers, we ran the sampler ﬁve times using ﬁve diﬀerent random number generatorseeds. We implemented the samplers in MATLAB on a Linux system with a 2.60 GHz Inteli7-3720QM CPU. 15 .4.1 Model 1

For the single-state Metropolis sampler, we initialized all x i,j to 0. Every iteration alternatelyused a scaling factor, (cid:15) , of either 0 . .

8, which resulted in an average acceptance rate ofbetween 30% and 90% for the diﬀerent x i over the sampler run. We ran the sampler for 1000000iterations, and prior to analysis, the resulting sample was thinned by a factor of 10, to 100000.The thinning was done due to the diﬃculty of working with all samples at once, and after thinningthe samples still possessed autocorrelation times signiﬁcantly greater than 1. Each of the 100000samples took about 0 .

17 seconds to draw.For the PGBS sampler and the sampler combining PGBS and Metropolis updates, we alsoinitialized all x i,j to 0. We used 250 particles for the PGBS updates. For the Metropolis updates,we alternated between scaling factors of 0 . .

8, which also gave acceptance rates between30% and 90%. For the standalone PGBS sampler, we performed a total of 70000 iterations. Eachiteration produced two samples for a total of 140000 samples and consisted of a PGBS updateusing the forward sequence and a PGBS update using the reversed sequence. Each sample tookabout 0 .

12 seconds to draw. For the PGBS with Metropolis sampler, we performed a total of30000 iterations of the sampler. Each iteration was used to produce four samples, for a totalof 120000 samples, and consisted of a PGBS update using the forward sequence, ten Metropolisupdates (of which only the value after the tenth update was retained), a PGBS update using thereversed sequence, and another ten Metropolis updates, again only keeping the value after thetenth update. The average time to draw each of the 120000 samples was about 0 .

14 seconds.

For model 2, we were unable to make the single-state Metropolis sampler converge to anythingresembling the actual posterior in a reasonable amount of time. In particular, we found that for x i,j suﬃciently far from 0, the Metropolis sampler tended to be stuck in a single mode, nevervisiting values with the opposite sign.For the PGBS with Metropolis sampler, we set the initial values of x i,j to 1. We set the numberof particles for PGBS to 80000, which was nearly the maximum possible for the memory capacityof the computer we used. For the Metropolis sampler, we alternated between scaling factors of 0 . For both model 1 and model 2, we implemented the proposed embedded HMM method using theforward pool state selection scheme, alternating between updates that use the original and the16eversed sequence. As for the baseline samplers, we ran the embedded HMM samplers ﬁve timesfor both models, using ﬁve diﬀerent random number generator seeds.We generate pool states at time 1 using autoregressive updates to sample from κ f . At times i ≥

2, we sample each pool state from λ i ( x, l ) by combining an autoregressive and shift update.The autoregressive update proposes to only change x , keeping the current l ﬁxed. The shiftupdate samples both x and l , with a new l proposed from a Uniform { , . . . , L } distribution. Formodel 2, we also add a ﬂip update to generate a negated version of each pool state.Note that the chain used to produce the pool states now uses a sequence of updates. Therefore,if our forward transition ﬁrst does an autoregressive update and then a shift update, the reversetransitions must ﬁrst do a shift update and then an autoregressive update.As for the single-state Metropolis updates, it is beneﬁcial to use a diﬀerent proposal scaling, (cid:15) ,when generating each pool state at each time i . This allows generation of sets of pool states whichare more concentrated when y i is informative and more dispersed when y i holds little information. For model 1, we initialized all x i,j to 0. We used 50 pool states for the embedded HMM updates.For each Metropolis update to sample a pool state, we used a diﬀerent scaling (cid:15) , chosen at randomfrom a Uniform(0 . , .

4) distribution. The acceptance rates ranged between 55% and 95% for theMetropolis updates and between 20% and 70% for the shift updates. We performed a total of9000 iterations of the sampler, with each iteration consisting of an embedded HMM update usingthe forward sequence and an embedded HMM update using the reversed sequence, for a total of18000 samples. Each sample took about 0 .

81 seconds to draw.

For model 2, we initialized the x i,j to 1. We used a total of 80 pool states for the embeddedHMM sampler (i.e. 40 positive-negative pairs due to ﬂip updates). Each Metropolis update usedto sample a pool state used a scaling, (cid:15) , randomly drawn from the Uniform(0 . , .

2) distribution.The acceptance rates ranged between 75% and 90% for the Metropolis updates and between 20%and 40% for the shift updates. We performed a total of 9000 iterations of the sampler, producingtwo samples per iteration with an embedded HMM update using the forward sequence and anembedded HMM update using the reversed sequence. Each of the 18000 samples took about 1 . .6 Comparisons As a way of comparing the performance of the two methods, we use an estimate of autocorrelationtime for each of the latent variables x i,j . Autocorrelation time is a measure of how many drawsneed to be made using the sampling chain to produce the equivalent of one independent sample.The autocorrelation time is deﬁned as τ = 1 + 2 (cid:80) ∞ i =1 ρ k , where ρ k is the autocorrelation at lag k . It is commonly estimated as ˆ τ = 1 + 2 K (cid:88) i =1 ˆ ρ k (39)where ˆ ρ k are estimates of lag- k autocorrelations and the cutoﬀ point K is chosen so that ˆ ρ k isnegligibly diﬀerent from 0 for k > K . Hereˆ ρ k = ˆ γ k ˆ γ (40)where ˆ γ k is an estimate of the lag- k autocovarianceˆ γ k = 1 n n − k (cid:88) l =1 ( x l − ¯ x )( x k + l − ¯ x ) (41)When estimating autocorrelation time, we remove the ﬁrst 10% of the sample as burn-in. Then, toestimate ˆ γ k , we ﬁrst estimate autocovariances for each of the ﬁve runs, taking ¯ x to be the overallmean over the ﬁve runs. We then average these ﬁve autocovariance estimates to produce ˆ γ k . Tospeed up autocovariance computations, we use the Fast Fourier Transform. The autocorrelationestimates are then adjusted for computation time, by multiplying the estimated autocorrelationtime by the time it takes to draw a sample, to ensure that the samplers are compared fairly.The computation time-adjusted autocorrelation estimates for Model 1, for all the latent vari-ables, plotted over time, are presented in Figure 2. We found that the combination of single-stateMetropolis and PGBS works best for the unimodal model. The other samplers work reasonablywell too. We note that the spike in autocorrelation time for the PGBS and to a lesser extent forthe PGBS with Metropolis sampler occurs at the point where the data is very informative. Thisin turn makes the use of the diﬀuse transition distribution the particles are drawn from ineﬃ-cient and much of the sampling in that region is due to the Metropolis updates. Here, we alsonote that the computation time adjustment is sensitive to the particularities of the implementa-tion, in this case done in MATLAB, where performance depends a lot on how well vectorization Technically, when we alternate updates with the forward and reversed sequence or mix PGBS and single-state Metropolis updates, we cannot use autocorrelation times to measure how well the chain explores the space.While the sampling scheme leaves the correct target distribution invariant, the ﬂipping of the sequence makes thesampling chain for a given variable non-homogeneous. However, suppose that instead of deterministically ﬂippingthe sequence at every step, we add an auxiliary indicator variable that determines (given the current state) whetherthe forward or the reversed sequence is used, and that the probability of ﬂipping this indicator variable is nearlyone. With this auxiliary variable the sampling chain becomes homogeneous, with its behaviour nearly identical tothat of our proposed scheme. Using autocorrelation time estimates to evaluate the performance of our sampler istherefore valid, for all practical purposes.

50 100 150 200 250024681012141618 (a) Metropolis (0.17 seconds/sample) (b) PGBS (0.12 seconds/sample) (c) PGBS+Metropolis (0.14 seconds/sample) (d) Embedded HMM (0.81 seconds/sample)

Figure 2: Estimated autocorrelation times for each latent variable for Model 1, adjusted forcomputation timecan be exploited. Implementing the samplers in a diﬀerent language might change the relativecomparisons.We now look at how the samplers perform on the more challenging Model 2. We ﬁrst did apreliminary check of whether the samplers do indeed explore the diﬀerent modes of the distributionby looking at variables far apart in the sequence, where we expect to see four modes (with allpossible combinations of signs). This is indeed the case for both the PGBS with Metropolis andEmbedded HMM samplers. Next, we look at how eﬃciently the latent variables are sampled. Ofparticular interest are the latent variables with well-separated modes, since sampling performancefor such variables is illustrative of how well the samplers explore the diﬀerent posterior modes.Consider the variable x , , which has true value − .

99. Figure 3 shows how the diﬀerentsamplers explore the two modes for this variable, with equal computation times used to producedthe samples for the trace plots. We can see that the embedded HMM sampler with ﬂip updatesperforms signiﬁcantly better for sampling a variable with well-separated modes. Experimentsshowed that the performance of the embedded HMM sampler on model 2 without ﬂip updates is19

100 200 300 400 500 600 700 800 900 1000−6−4−20246 (a) Embedded HMM (b) Particle Gibbs with Metropolis

Figure 3: Comparison of Samplers for Model 2, x , (a) Embedded HMM (b) Particle Gibbs with Metropolis Figure 4: Comparison of Samplers for Model 2, x , x , much worse.We can also look at the product of the two variables x , x , , with true value − .

45. Thetrace plot is given in Figure 4. In this case, we can see that the PGBS with Metropolis samplerperforms better. Since the ﬂip updates change the signs of all dimensions of x i at once, we do notexpect them to be as useful for improving sampling of this function of state. The vastly greaternumber of particles used by PGBS, 80000, versus 80 for the embedded HMM method, works tothe advantage of PGBS, and explains the performance diﬀerence.Looking at these results, we might expect that we can get a good sampler for both x , and x , x , by alternating embedded HMM and PGBS with Metropolis updates. This is indeedthe case, which can be seen in Figure 5. For producing these plots, we used an embedded HMMsampler with the same settings as in the experiment for Model 2 and a PGBS with Metropolissampler with 10000 particles and Metropolis updates using the same settings as in the experiment20

100 200 300 400 500 600 700 800 900 1000−6−4−20246 (a) Trace plot for x , (b) Trace plot for x , x , Figure 5: Combination of embedded HMM and PGBS with Metropolis samplersfor Model 2.This example of Model 2 demonstrates another advantage of the embedded HMM viewpoint,which is that it allows us to design updates for sampling pool states to handle certain propertiesof the density. This is arguably easier than designing importance densities in high dimensions.

We have demonstrated that it is possible to use embedded HMM’s to eﬃciently sample statesequences in models with higher dimensional state spaces. We have also shown how embeddedHMMs can improve sampling eﬃciency in an example model with a multimodal posterior, byintroducing a new pool state selection scheme. There are several directions in which this researchcan be further developed.The most obvious extension is to treat the model parameters as unknown and add a step tosample parameters given a value of the latent state sequence. In the unknown parameter context,it would also be interesting to see how the proposed sequential pool state selection schemes canbe used together with ensemble MCMC updates of Shestopaloﬀ and Neal (2013). For example,one approach is to have the pool state distribution depend on the average of the current andproposed parameter values in an ensemble Metropolis update, as in Shestopaloﬀ and Neal (2014).One might also wonder whether it is possible to use the entire current state of x in constructingthe pool state density at a given time. It is not obvious how (or if it is possible) to overcome thislimitation. For example, for the forward scheme, using the current value of the state sequence atsome time k > i to construct pool states at time i means that the pool states at time k will endup depending on the current value of x k , which would lead to an invalid sampler.At each time i < n , the pool state generation procedure does not depend on the data aftertime i , which may cause some diﬃculties in scaling this method further. On one hand, this21llows for greater dispersion in the pool states than if we were to impose a constraint from theother direction as with the single-state Metropolis method, potentially allowing us to make largermoves. On the other hand, the removal of this constraint also means that the pool states canbecome too dispersed. In higher dimensions, one way in which this can be controlled is by usinga Markov chain that samples pool states close to the current x i — that is, a Markov chain thatis deliberately slowed down in order not to overdisperse the pool states, which could lead to acollection of sequences with low posterior density. Acknowledgements

We thank Arnaud Doucet for helpful comments. This research was supported by the NaturalSciences and Engineering Research Council of Canada. A. S. is in part funded by an NSERCPostgraduate Scholarship. R. N. holds a Canada Research Chair in Statistics and Machine Learn-ing.

References

Andrieu, C., Doucet, A. and Holenstein, R. (2010) “Particle Markov chain Monte Carlo meth-ods”,

Journal of the Royal Statistical Society B , vol. 72, pp. 269-342.Lindsten, F.; Schon, T.B. (2012) “On the use of backward simulation in the particle Gibbssampler”, in

Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE InternationalConference on , pp. 3845-3848.Neal, R.M. (1998) “Regression and classiﬁcation using Gaussian process priors”, in J.M. Bernardoet al (editors)

Bayesian Statistics 6 , Oxford University Press, pp. 475-501.Neal, R. M. (2003) “Markov Chain Sampling for Non-linear State Space Models using EmbeddedHidden Markov Models”, Technical Report No. 0304, Department of Statistics, Universityof Toronto, http://arxiv.org/abs/math/0305039.Neal, R. M., Beal, M. J., and Roweis, S. T. (2004) “Inferring state sequences for non-linearsystems with embedded hidden Markov models”, in S. Thrun, et al (editors),

Advances inNeural Information Processing Systems 16 , MIT Press.Pitt, M. K. and Shephard, N. (1999) “Filtering via Simulation: Auxiliary Particle Filters”,