A Note on Particle Gibbs Method and its Extensions and Variants
AA Note on Particle Gibbs Methodandits Extensions and Variants
Niharika Gauraha Division of Computational Science and TechnologyKTH Royal Institute of Technology, Sweden
Abstract
High-dimensional state trajectories of state-space models pose challenges forBayesian inference. Particle Gibbs (PG) methods have been widely usedto sample from the posterior of a state space model. Basically, particleGibbs is a Particle Markov Chain Monte Carlo (PMCMC) algorithm thatmimics the Gibbs sampler by drawing model parameters and states fromtheir conditional distributions.This tutorial provides an introductory view on Particle Gibbs (PG) methodand its extensions and variants, and illustrates through several examples ofinference in non-linear state space models (SSMs). We also implement PGSamplers in two different programming languages: Python and Rust. Com-parison of run-time performance of Python and Rust programs are also pro-vided for various PG methods.
1. Introduction
State-space models (SSMs) have been used extensively to model timeseries and dynamical systems. The SSMs can be broadly divided into twogroups, linear Gaussian and nonlinear and/or non-Gaussian. In this tutorial,
Email address: [email protected] (Niharika Gauraha) a r X i v : . [ s t a t . C O ] A ug e mainly focus on the later group nonlinear SSMs as defined below. x t = f ( x t − ) + (cid:15) t , x t ∈ R n x (1a) y t = g ( x t ) + w t , y t ∈ R n y (1b) x ∼ p ( x ) , (1c)where the system noise (cid:15) t ∼ N (0 , Q ) and the measurement noise w t ∼ N (0 , R ) are both Gaussian. The variables x t for t = 1 , . . . , T are latent(unobserved) variables and y t for t = 1 , . . . , T are observed variables. Thefunctional form of f and g are assumed to be known. Usually, learningof a SSM involves the parameter inference problem as well as the state in-ference problem. More specifically, we are concerned with the probabilis-tic learning of SSMs, by inferring noise variances Q and R , along with thestates trajectories x t for t = 1 , . . . , T conditioned on the given T observations y T = { y , . . . , y T } . Since there is no closed form solution exists for extract-ing these information about the state variables and parameters, we considerMonte Carlo based approximation methods.The sequential and dynamic nature of SSMs suggests to use sequentialMonte Carlo (SMC) methods, namely particle filters are widely used tolearn latent state variables from the data when the model parameters areassumed to be known. When model parameters are also unknown, for simul-taneous state and parameter estimation Particle Markov chain Monte Carlo(PMCMC) [1] techniques have been established. PMCMC methods are a(non-trivial) combination of MCMC and SMC methods, where SMC algo-rithms are used to design efficient high dimensional proposal distributionsfor MCMC algorithms. The two main techniques in the PMCMC frameworkare Particle Metropolis Hastings (PMH) sampler and particle Gibbs (PG)samplers. We mainly focus on Particle Gibbs methods.Particle Gibbs (PG) is a PMCMC algorithm that mimics the Gibbs sam-pler. In PG, samples from the joint posterior are generated by alternatingbetween sampling the states and the parameters. Two major drawbacks ofPG is path degeneracy and computational complexity. When the number ofstates and parameters is large, the PMCMC algorithms can become computa-tionally inefficient. We discuss various extensions of PG sampler that addressone or both of the problems (path degeneracy and computational complex-ity). The extension of PG, particle Gibbs with ancestor sampling (PGAS),alleviates the problem with path degeneracy and reduces the computationalcost from quadratic to linear in the number of timesteps, T , in favorable con-2itions. Interacting particle Markov chain Monte Carlo (iPMCMC) [14] wasintroduced to mitigate the path degeneracy problem, by using trade-off be-tween exploration and exploitation that resulted in improved mixing of theMarkov chains. Blocked Particle Gibbs (bPG) Sampler [16] addresses thetime complexity problem, by dividing the whole sequence of states into smallblocks, such that some blocks (odd or even) can be computed in parallel.PG is an exact approximation of the Gibbs sampler and can never dobetter than the Gibbs sampler it approximates. To improve its performancebeyond the underlying Gibbs sampler, collapsed particle Gibbs was proposedin [17]. In collapsed PG, one or more parameters are marginalized over whenthe parameter prior is conjugate to the complete data likelihood.Each method discussed above are implemented in Python and Rust pro-gramming languages. Python is a general purpose programming languageand is known for its simple syntax and readable code. Rust is a systemsprogramming language, its compile-time correctness guarantees the fast per-formance. We compare run-time performance of Rust and Python programsfor particle Gibbs methods.The rest of the paper is organized as follows. We start with an introduc-tory background on state-space models and Monte Carlo methods in Section2. Then we introduce Particle Gibbs method in Section 3. In Section 4, wediscuss extensions and variants of PG methods. In Section 6, we concludeand discuss future outlook.
2. Background
In this section, we provide a brief background on state-space models,Monte Carlo methods and we fix notations and assumptions used throughoutthe paper. Here we only provide a brief review of the underlying principles ofMCMC and SMC methods in terms of usage of them for inference problemsassociated with SSMs. There is an extremely rich literature on Monte Carlomethods: see for example [7] and [9].
State Space Models (SSMs) have been widely used in a variety of fields,for example, econometrics [12], ecology [13], climatology [2], robotics [5], andepidemiology [15], to mention just a few.Usually, in a SSM there is an unobserved state of interest x t that evolvesthrough time, however, only noisy or partial observations of the state y t are3vailable. The state process is assigned an initial density x ∼ p ( x | θ ), andevolves in time with transition density p ( x t | x t − ). Given the latent states x t , the observations are assumed to be independent with density p ( y t | x t , θ ).Here, θ is a parameter vector with prior density p ( θ ). The SSM can beexpressed in probabilistic form as follows. x ∼ p ( x | θ ) (2a) x t | x t − , θ ∼ p ( x t | x t − , θ ) (2b) y t | x t , θ ∼ p ( y t | x t , θ ) , (2c) θ ∼ p ( θ ) , (2d)We assume in all our experiments that the initial state is always fixedand the other model parameters θ = ( Q, R ) T are fixed for some settingonly. Using the Markov Property and conditional probabilities, the jointdistribution p ( x T , θ, y T ) can be factorized as follows. p ( x T , θ, y T ) = (cid:32) T (cid:89) t =1 p ( y t | x t , θ ) (cid:33) (cid:32) T (cid:89) t =2 p ( x t | x t − , θ ) (cid:33) p ( x | θ ) p ( θ ) (3) Latent x x x t − x t x T y y y t − y t y T Observed . . . p ( x | x ) p ( x t | x t − ) . . . p ( y | x ) p ( y | x ) p ( y t − | x t − ) p ( y t | x t ) p ( y T | x T ) Figure 1: Graphical representation of a finite state space model. State variables x t arelatent variables and measurements y t are observed variables. The probabilistic relationshipbetween variables are shown with directed lines. Dependency on θ is omitted for simplicity The posterior distribution of the unknowns in the model can be factorizedas follows: p ( x T , θ | y T ) = p ( x T | θ, y T ) p ( θ | y T ) (4)The term, p ( θ | y T ), estimation of the parameter vector θ given the obser-vations y T is referred to as parameter inference. The term p ( x T | θ, y T ) is4eferred to as state inference problem which involves the estimation of thestates x T given θ and y T . These inference problems are analytically in-tractable for most SSMs. We consider MCMC for parameter inference, SMCfor state inference and PMCMC (Particle Gibbs sampler) for simultaneousestimation of state and parameters. We simulate data from model as defined in Eq. (1), with the followingsettings. f ( x t , t ) = 0 . ∗ x t + 25 ∗ x t / (1 + x t ) + 8 ∗ cos (1 . ∗ t ) g ( x t ) = x t / θ = { Q = 0 . , R = 1 } x = 0 T = 500 (5)The simulated data for first 100 time points is plotted in Figure 2. Weuse this data through out the paper for various experiments. Figure 2: Simulated data from the non-linear SSM model with latent state (orange),observations (blue) and autocorrelation function (ACF) of the observations (coral).
We consider a Bayesian approach to parameter estimation, using Markovchain Monte Carlo (MCMC) methods which are based on simulating a Markovchain with the target as its stationary distribution, p ( θ | y T ). Efficient andbroadly used MCMC methods are: the Metropolis-Hastings and Gibbs sam-pler and their variants. Here, we consider the Gibbs sampler as proposed5n [8]. Gibbs sampler updates a single parameter at a time by sampling fromthe conditional distribution for each parameter given the current value ofall the other parameters and repeatedly applying this updating process. Fordetails on how and why Gibbs sampler work we recommend the tutorial [3]. Algorithm 1
Gibbs Sampler Initialize set θ [1]: arbitrarily for m = 2 to number of iterations M do Draw θ [ m ] ∼ p ( θ [ m ] | y T , θ [ m − . . . θ k [ m − . . . Draw θ k [ m ] ∼ p ( θ k [ m ] | y T , θ [ m ] . . . θ k − [ m ])In a SSM, sampling from p ( θ i | y T , θ . . . θ k ) involves the likelihood p ( y | θ ). Since there is no closed form expression available for the likelihood p ( y | θ ), one can use an estimate of the likelihood. In this tutorial, we aremainly interested in the state inference problem or the simultaneous infer-ences of state and parameters, which is discussed in the sections below. When θ ∈ Θ is known sequential Monte Carlo methods (SMC) are usedfor inference about states. In particular, we consider SMC methods to ap-proximate the sequence of posterior densities p ( x t | y t ) by a set of N randomweighted samples called particles.ˆ p ( x t | y t ) = N (cid:88) i =1 w it δ x t ( x t ) , (6)where w it is a importance weight associated with particle x i t .There are broadly two types of state inference problems in SSMs, filteringand smoothing. We mainly focus on inference problems related with marginalfiltering, in which observations y t up to the current time step t are used toinfer the current value of the state x t . Bayesian filtering recursions are usediteratively to solve the filtering problem for each time t by using the followingtwo steps. p θ ( x t | y t ) = p θ ( x t +1 | y t ) p θ ( y t | y t ) (7) y t | x t ∼ g θ ( y t | x t ) , (8)6e consider the simplest particle filter called Bootstrap Particle Filter(BPF) or standard SMC. At a high level SMC works as follows. At time1, N particles x i , for i = 1 , . . . , N , are generated from prior p ( x | θ ) andthe corresponding importance weights are computed using ˜ w i = p ( y | θ, x i ).To generate N particles approximately distributed according to the posterior p ( x | θ ) we sample N times from the Importance Sampling (IS) approximationˆ p ( x | y ), this is known as resampling step. At time 2 the algorithm aims toproduce samples approximately distributed according to p ( x | θ, y ) usingthe samples obtained at time 1. This process is then repeated for T times.The standard particle filter is summarized in Algorithm 2, where Cat denotescategorical distribution. We refer to [6] for a gentle introduction on SMCtechnique.
Algorithm 2
SMC
Initialize
Draw x i ∼ p ( x | θ ) for i = 1 . . . N Compute ˜ w i = p ( y | θ, x i ) and normalize w i = ˜ w i / (cid:80) ˜ w i for i = 1 . . . N for t = 2 to number of states T do Sample a ti = Cat ( w t − , ..., w Nt − ) for i = 1 . . . N and set ¯ x t − = x a t t − Draw x it ∼ p ( x t | ¯ x it − , θ ) for i = 1 . . . N Set x t = { x t − , x t } Compute ˜ w it = p ( y | θ, x i ) and normalize w it = ˜ w it / (cid:80) ˜ w it , for i = 1 . . . N Error in the latent state estimation using SMC
Consider the data generated from 5. The difference between the truestates and the estimated states using SMC with N = 500 is plotted in Fig-ure 3. 7 igure 3: Error in the latent state estimate using SMC with N = 500 Intuitively, Gibbs sampler for simultaneous state and parameter infer-ences in SSMs can be thought of as alternating between updating θ andupdating x T : Draw θ [ m ] ∼ p ( θ | x T [ m − , y T ) (9)Draw X T [ m ] ∼ p ( x T [ m − | θ [ m ] , y T ) . (10)However, it is hard to draw from p ( x T [ m − | θ [ m ] , y T ). Therefore, weapproximate p ( x T [ m − | θ [ m ] , y T ) using particle filter. More specifically,we use a conditional SMC (cSMC) for which one pre-specified path is retainedthroughout the sampler. cSMC and PG are discussed in details in the nextsection.
3. Particle Gibbs Method
The particle Gibbs (PG) sampler was introduced in [1] as a way to usethe approximate SMC proposals within exact MCMC algorithms (Gibbs sam-pler). It has been widely used for joint parameter and state inference in non-linear state-space models. First, we define conditional particle filter (cSMC),which is the basic building block of PG methods.
Conditional SMC
Conditional SMC (cSMC) or Conditional Particle Filters (CPF) is sim-ilar to a standard SMC algorithm except that a pre-specified path, x (cid:48) t , is8etained to all the resampling steps, whereas the remaining N − N th ) particle x Nt = x (cid:48) t and its ancestor index a Nt = N deterministically, where N is thenumber of particles. Here, conditioning ensures correct stationary distribu-tion for any N ≥
2. The cSMC algorithm returns a trajectory, indexed by b ,where b is sampled with probability proportional to the final particle weights, b ∼ Cat ( { w iT } Ni =1 ). The cSMC algorithm is summarized in Algorithm 3. Algorithm 3 cSMC
Initialize
Draw x i ∼ p ( x | θ ) for i = 1 . . . N − x N = x (cid:48) Compute normalized weights w i for i = 1 . . . N for t = 2 to number of states T do Sample a it for i = 1 . . . N −
1, and set a Nt = N and ¯ x t − = x a t t − Draw x it ∼ p ( x t | ¯ x it − , θ ) for i = 1 . . . N − x Nt = x (cid:48) t Set x t = { x t − , x t } Compute normalized weights w it for i = 1 . . . N Draw b ∼ Cat ( { w iT } Ni =1 ) return x b T The PG algorithm iteratively runs cSMC sweeps as shown in Algorithm 4,where each conditional trajectory is sampled from the surviving trajectoriesof the previous sweep.
Algorithm 4 PG Initialize set x T [1] and θ [1]: arbitrarily for m = 2 to number of iterations, M do Draw θ [ m ] ∼ p ( . | x T [ m − , θ [ m − X T [ m ] = cSMC( x T [ m − , θ [ m ] , y T ) Simultaneous State and Parameter Inference using Particle Gibbs
In this experiment, we use a dataset simulated from 5 and the modelparameters and latent states are assumed to be unknown. The error in thelatent state estimate using PG with 500 particles and 50 ,
000 iterations isplotted in Figure 4. The parameter posteriors (after discarding the first onethird of the samples as burn-in) are plotted in Figures 5 and 6.9 igure 4: Error in the latent state estimate using PG with 500 particles and 50000 itera-tions Figure 5: Posterior Q igure 6: Posterior R Comparing Run-time Performance of Python and Rust Programs for PG
For the previous example, the run-time performance of Python and Rustprograms for PG sampler against different number of iterations (with fixed N = 500) are given in Table 1 and are plotted in Figure 7. The table showsthat the Rust program is 10 times faster than Python program. Figure 7: Visualizing run-time performanceof Python and Rust program for PG againstdifferent number of iterations.
Table 1: Comparison of time (in sec-onds) for Python and Rust programfor PG.
4. Extensions and Variants of Particle Gibbs Methods
In this section, we discuss various extensions and variants of particleGibbs method. 11 .1. Particle Gibbs with Ancestor Sampling
PG algorithm has been proven to be uniformly ergodic under standardassumptions, however, the mixing of the PG sampler can be poor, especiallywhen there is severe degeneracy in the underlying SMC. It has been shownthat the number of particles N must increase linearly with T for the samplerto mix properly for large T , which results in an overall quadratic computa-tional complexity with T . To address this problem PGAS was introduces in[10]. In PGAS, the ancestor for the reference trajectory in each time stepis sampled, according to ancestor weights, instead of setting it determinis-tically, which significantly improves the mixing of the sampler for small N ,even when T is large.Mainly, ancestor resampling within cSMC was introduced to mitigatespath degeneracy and that helps in movement around the conditioned path.Instead of setting a Nt = N , a new value is sampled from { . . . T } . The idea isto connect the partial reference trajectory x (cid:48) t : T to one of the particles x i t − .It is done in the following two steps: Compute weights: ˜ w t − | T ∝ w it − p ( x (cid:48) t | x it − ) (11) Sample : P ( a Nt = i ) ∝ ˜ w t − | T (12)The cSMC-AS algorithm is summarized as follows. Algorithm 5 cSMC-AS
Initialize
Draw x i ∼ p ( x | θ ) for i = 1 . . . N − x N = x (cid:48) Compute normalized weights w i for i = 1 . . . N for t = 2 to number of states T do Sample a it for i = 1 . . . N − Compute weights: ˜ w t − | T ∝ w it − p ( x (cid:48) t | x it − ) Sample : P ( a Nt = i ) ∝ ˜ w t − | T ¯ x t − = x a t t − Draw x it ∼ p ( x t | ¯ x t − , θ ) for i = 1 . . . N − x Nt = x (cid:48) t Set x t = { x t − , x t } Compute normalized weights w it for i = 1 . . . N Draw b ∼ Cat ( { w iT } Ni =1 ) return x b T Algorithm 6
PGAS Initialize set x T [1] and θ [1]: arbitrarily for m = 2 to number of iterations, M do Draw θ [ m ] ∼ p ( . | x T [ m − , θ [ m − X T [ m ] = cSMC-AS( x T [ m − , . . . ) Mixing of PG and PGAS
To illustrate that ancestor resampling can considerably improve the mix-ing of PG, we plot AutoCorrelation Functions (ACF) of the noise parameter Q . We consider the dataset generated from 5, for T = 500, and we assumethat the model parameters and states are unknown. The PG and PGAS sam-plers are simulated for 50000 iterations, and the first one third of the samplesare discarded as burn-in. The ACFs of Q for PG and PGAS against differentvalues of N = (10 , , , N ( N = 5 , N ( > N it shows good mixing rates. Figure 8: ACFs of the parameter Q for PG (left column) and for PGAS (right column)for a dataset generated from 5 with T = 500. The results are reported against differentnumber of particles N . As mentioned in previous section, a major drawback of PG is path de-generacy in cSMC step, where conditioning on an existing trajectory impliesthat whenever resampling of the trajectories results in a common ancestor,this ancestor must correspond to the reference trajectory. This results in13igh correlation between the samples, and poor mixing of the Markov chain.Interacting particle Markov chain Monte Carlo (iPMCMC) [14] was intro-duce to mitigate this problem by, time to time switching between a cSMCparticle system with a completely independent SMC one, which results inimproved mixing.In iPMCMC a pool of conditional SMC samplers and standard SMCsamplers are run as parallel processes, where each process is referred to asnode. Assume that there are R separate nodes, P of them run cSMC and R − P run SMC, and they can interact by exchanging only very minimalinformation at each iteration to draw new MCMC samples. The cSMC nodesare given an identifier c j ∈ { , . . . , R } , where j ∈ { , . . . , P } . Let x ir = x i T,r be the internal particle trajectories of node r ∈ { , . . . , R } . At each iteration m , the nodes c P run cSMC with the previous MCMC samples x (cid:48) j [ r −
1] asthe reference particle. The remaining R − P nodes run standard SMC. Eachnode r returns an estimate of the marginal likelihood for the internal particlesystem defined as ˆ Z r = T (cid:89) t =1 N (cid:88) i =1 w it,r (13)The new conditional nodes are then set by sampling new indices c j asfollows. p ( c i = r | c P \ j ) = ˆ ζ jr (14)ˆ ζ jr = ˆ Z r I ( r / ∈ c P \ j ) (cid:80) q ˆ Z q I ( q / ∈ c P \ j ) (15)Thus one loop through the conditional SMC node indices is required to re-sample them from the union of the current node index and the unconditionalSMC node indices, in proportion to their marginal likelihood estimates. Thisis the key step that may switch the nodes from which the reference particleswill be drawn. 14 lgorithm 7 iPMCMC Input : number of nodes: R, conditional nodes: P, and MCMC steps: M Initialize set x (cid:48) P [1] for m = 2 to number of iterations, M do Workers c P run cSMC using x (cid:48) P [ m −
1] as reference particles Workers 1 : R \ c P run SMC for j = 1 to P do Simulating c j according to Eq. 14 x (cid:48) j [ r ] = x c j The run time performance of Rust and Python programs for iterated PGsampler is compared in Table 2 against different number of iterations andfixed number of particles N = 500 , R = 16, and P = 8. Each program usedthe same data set generated from 5. The table shows that the Rust programis more than 8 times faster than the Python program. Table 2: Comparison of time (in seconds) between Python and Rust programs for iteratedPG sampler.
The uniform ergodicity of the Markov kernel used in PG was proven in[4], and it was shown that the mixing rate does not decay if the number ofparticles grows at least linearly with the number of latent states. However,the computation complexity of a PG sampler is quadratic in the number oflatent states, which can be a limiting factor for its use in long observationsequences. Blocked Particle Gibbs (bPG) Sampler was introduced in [16]to address this problem, and it was shown that using blocking strategies,a sampler can achieve a stable mixing rate for a linear cost per iteration.The main idea in Blocked PG is to divide the whole sequence of states intosmall (overlapping) blocks, such that odd and even blocks can be computedin parallel. 15et I = { , . . . , T } be the index set of the sequence of latent variables X , . . . , X T . In blocked PG, the sequence X , . . . , X T is divided into blocks,where consecutive blocks may overlap but nonconsecutive blocks do not over-lap and are separated, as illustrated in Figure 9. The block size L and overlap p are chosen such that the ideal sampler is stable, and the number of particlesis large enough to obtain a stable PG. Note that blocked PG depends onlyon size of L not on T . Figure 9: Blocked Particle Gibbs Strategy, where the blocks in first row are odd blocks andthe second row contains even blocks, and consecutive odd and even blocks are overlapping.
Let J = J , . . . , J r be a cover of { , . . . , T } and let P = P J , . . . , P J r be the Gibbs kernel for one complete sweep from left to right. The parallelGibbs kernel are defined as follows. For simplicity we can assume that thenumber of blocks are even (it is easy to construct similar arguments for theodd number of blocks as well). P odd = P J , P J , . . . , P J r − P even = P J , P J , . . . , P J r In the first iteration, we sweep through the blocks from left to right, andlet P be the kernel corresponding to one complete sweep. Then at eachiteration we update all the odd-numbered blocks first and then all the even-numbered blocks. It is called parallel blocked Gibbs sampling. The kernel foran internal block J = { s, . . . , u } , called blocked conditional SMC sampler, isdefined in the following. Blocked cSMC
The blocked SMC approximates the sequence of target distributions p ( x s , . . . , x t | x s − , y s , . . . , y t )for t = s, . . . , u using conditional SMC. For initialization, the distribution p ( x s | x s − ) is used. The for loop of blockedSMC algorithm is similar tocSMC. After the loop, to take into account that the target distribution is16 ( x s , . . . , x u | x s − , x u +1 , y s , . . . , y t ), the conditioning on the fixed boundarystate x u +1 is applied , which contributes via the term p ( x u +1 | X iu ) to thefinal weight w iu . Algorithm 8 blockedSMC
Initialize
Draw x is ∼ p ( x s | x s − ) for i = 1 . . . N − x Ns = x (cid:48) Compute normalized weights w i for i = 1 . . . N for t = s + 1 to u do Sample a it for i = 1 . . . N −
1, and set a Nt = N and ¯ x t − = x a t t − Draw x it ∼ p ( x t | ¯ x t − , θ ) for i = 1 . . . N − x Nt = x (cid:48) t Set x t = { x t − , x t } Compute normalized weights w it for i = 1 . . . N Set w iu = w iu ∗ p ( x u +1 | X u ) for i = 1 . . . N .Draw b ∼ Cat ( { w iu } Ni =1 ) return x bs : u Note that for the first block we have deterministic initial condition asbefore and for the last block we do not have to adjust for the (overlapping)consecutive next block. The blocked PG is summarized in the followingalgorithm.
Algorithm 9 blocked PG Input : size of block: L, overlap size: p, and MCMC steps: M Iteration 1: Initialization: set x T [1] by calling SMC Compute start and end index for each block for m = 2 to number of iterations, M do Compute an initial state, a boundary state and a reference trajectory x s : u for each block using x T [ m − Run blockedSMC for each odd block in parallel Run blockedSMC for each even block in parallel Combine results from all the odd and even sweeps into x T [ m ] Comparing the True States and the Estimated Latent States
In this experiment, we compare between the true states and the estimatedlatent states using blocked PG for the simulated data from 5, as shown inFigure 10. We simulated blocked PG for 10 ,
000 iterations with the number17f particles N = 500, block size=30, 1 overlapping particle and all the blockswere run in parallel. The estimated states seem to be a close estimate of thetrue states from Figure 10,. Figure 10: Comparison between the true states and the estimated states using blockedPG with N = 500 particles and 10 ,
000 iterations
Run-time Performance Comparison
For the previous example, the run time performance of the Rust andPython programs for blocked PG sampler is compared in Table 3 againstdifferent number of iterations and fixed number of particles N = 500. Foreach sampler we used block size=30 and 1 overlapping particle and all theblocks were run in parallel. The table shows that the Rust program is almost8 times faster than the Python program. Table 3: Comparison of time (in seconds) for Python and Rust programs for blocked PG.
Usually, independent samples from the target distribution are desired.When there is strong correlation between the variables the standard Gibbs18ampler can generate correlated samples. In Gibbs sampler, when we inte-grate out (marginalizes over) one or more variables when sampling for someother variable, it is known as collapsed Gibbs sampler [11].Here we focus on marginalized state update, integrating out the modelparameters. In particle Gibbs sampler, there is a dependence between thestates x T and the model parameters θ which leads to correlated samples.By marginalizing out the parameters from the state update, the amount ofauto correlation between samples can be reduced.In the following we define marginalized SMC followed by the marginalizedParticle Gibbs (collapsed particle Gibbs) and its application in the non linearstate space models. For the detail we refer to [17]. marginalized SMC Marginalized conditional SMC (mcSMC) is similar to cSMC algorithmexcept that we integrate out the model parameters. We assume that thereis a conjugacy relationship between the prior distribution p ( θ ) and the com-plete data likelihoods p ( x t , y t | θ ), for t = 1 , . . . , T . The use of a restrictedexponential family was proposed, where the log-partition function is assumedto be separable into two parts, one consisting of parameter-dependent partand the other having state-dependent part. The complete data likelihoodunder the restricted exponential family can be given by the following. p ( x t , y t | x t − , θ ) = h t exp (cid:0) θ T s t − A T ( θ ) r t (cid:1) where A ( θ ) is the restricted log-partition function and r ( x ) is some functionwhich only depends on x . A conjugate prior for this likelihood is p ( θ | χ , ν ) = g ( χ , ν ) exp (cid:0) θ T χ − A T ( θ ) ν (cid:1) The parameter posterior is given by: p ( θ | χ , ν ) = π ( χ t − , ν t − )where the hyper-parameters are iteratively updated according to χ t = χ + t (cid:88) k =1 s k = χ t − + s t (16) ν t = ν + t (cid:88) k =1 r k = ν t − + r t (17)19ith the above joint likelihood and conjugate prior, the expression for themarginal of the joint distribution of states and observations, at time t canbe derived in the closed form. p ( x t , y t | x t − , y t − ) = (cid:90) p ( x t , y t | x t − , θ ) p ( θ | χ ν )= h t g ( χ t − , ν t − ) g ( χ t , ν t )In order to compute the weights for the mcSMC under the restricted expo-nential family assumption, we only need to keep track of and update thehyper parameters according to Eq. 16. The mcSMC method is summarizedin Algorithm 10. Algorithm 10 mcSMC
Initialize
Draw x i ∼ p ( x | θ ) for i = 1 . . . N − x N = x (cid:48) Compute normalized weights w i for i = 1 . . . N for t = 2 to number of states T do Update hyperparameters χ it , ν it for i = 1 . . . N Sample a it for i = 1 . . . N −
1, and set a Nt = N and ¯ x t − = x a t t − Draw x it ∼ p ( x t | ¯ x t − ) for i = 1 . . . N − x Nt = x (cid:48) t Set x t = { x t − , x t } Compute normalized weights w it for i = 1 . . . N Draw b ∼ Cat ( { w iT } Ni =1 ) return x b T The collapsed PG algorithm iteratively runs mcSMC sweeps as shown inAlgorithm 11, where each conditional trajectory is sampled from the surviv-ing trajectories of the previous sweep.
Algorithm 11
Collapsed PG Initialize set x T [1] and θ [1]: arbitrarily for m = 2 to number of iterations, M do Draw θ [ m ] ∼ p ( . | x T [ m − , θ [ m − X T [ m ] = mcSMC( x T [ m − , . . . )20 omparing Mixing Rate of PG and Collapsed PG To compare the mixing rate, we simulated PG sampler and collapsed PGsampler for 10,000 iteration with varying number of particles, for the datasetgenerated from 5. After discarding the first one third samples, the first 15lags of ACFs are computed and are plotted as shown in Figure 11. The figureshows that the mixing rate of collapsed PG is much stable than the mixingrate of PG, even for small number of particles.
Figure 11: ACFs of the parameter Q for PG (left column) and for collapsed PG (rightcolumn) for a dataset simulated from 5. The results are reported against different numberof particles N.
5. Conclusion and Future Research
We discussed particle Gibbs Sampler and its variants and extensionssuch as Particle Gibbs with ancestor sampling, interacting particle MCMC,blocked PG and collapsed PG, for state and parameter inferences in non-linear SSMs. We illustrated all the methods with simulated datasets. Prob-ably our implementations of all the methods discussed in Python and Rustprogramming language would make it easy to understand. We compared runtime performance of the Python and Rust programs, the results show thatthe rust programs are 8 to 10 times faster than the corresponding Pythonprograms.The nature of PGAS is off-line in the sense given a new observation thealgorithm has to be executed from scratch. The simultaneous estimationof parameters and states with an on-line approach will be more useful indynamical system identification etc.
Code
The source code of our implementation is available at https://github.com/niharikag/PGSampler . 21 cknowledgments
The computations were performed on resources provided by SNIC throughTetralith under project SNIC 2020/5-278.
References [1] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particlemarkov chain monte carlo methods.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 72(3):269–342, 2010.[2] Francisco M Calafat, Thomas Wahl, Fredrik Lindsten, Joanne Williams,and Eleanor Frajka-Williams. Coherent modulation of the sea-level an-nual cycle in the united states by atlantic rossby waves.
Nature commu-nications , 9(1):1–13, 2018.[3] George Casella and Edward I George. Explaining the gibbs sampler.
The American Statistician , 46(3):167–174, 1992.[4] Nicolas Chopin, Sumeetpal S Singh, et al. On particle gibbs sampling.
Bernoulli , 21(3):1855–1883, 2015.[5] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A surveyon policy search for robotics.
Foundations and Trends R (cid:13) in Robotics ,2(1–2):1–142, 2013.[6] Arnaud Doucet and Adam M Johansen. A tutorial on particle filteringand smoothing: Fifteen years later. Handbook of nonlinear filtering ,12(656-704):3, 2009.[7] Andrew Gelman, Donald B Rubin, et al. Inference from iterative simu-lation using multiple sequences.
Statistical science , 7(4):457–472, 1992.[8] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs dis-tributions and the bayesian restoration of images.
Journal of AppliedStatistics , 20(5-6):25–62, 1993.[9] Walter R Gilks, Sylvia Richardson, and David Spiegelhalter.
Markovchain Monte Carlo in practice . Chapman and Hall/CRC, 1995.2210] Fredrik Lindsten, Michael I Jordan, and Thomas B Sch¨on. Particle gibbswith ancestor sampling.
The Journal of Machine Learning Research ,15(1):2145–2184, 2014.[11] Jun S Liu. The collapsed gibbs sampler in bayesian computations withapplications to a gene regulation problem.
Journal of the AmericanStatistical Association , 89(427):958–966, 1994.[12] Nima Nonejad. Particle gibbs with ancestor sampling for stochasticvolatility models with: heavy tails, in mean effects, leverage, serialdependence and structural breaks.
Studies in Nonlinear Dynamics &Econometrics , 19(5):561–584, 2015.[13] John Parslow, Noel Cressie, Edward P Campbell, Emlyn Jones, andLawrence Murray. Bayesian learning and predictability in a stochas-tic nonlinear dynamical model.
Ecological applications , 23(4):679–698,2013.[14] Tom Rainforth, Christian Naesseth, Fredrik Lindsten, Brooks Paige,Jan-Willem Vandemeent, Arnaud Doucet, and Frank Wood. Interact-ing particle markov chain monte carlo. In
International Conference onMachine Learning , pages 2616–2625, 2016.[15] David A Rasmussen, Oliver Ratmann, and Katia Koelle. Inferencefor nonlinear epidemiological models using genealogies and time series.
PLoS computational biology , 7(8), 2011.[16] Sumeetpal S Singh, Fredrik Lindsten, and Eric Moulines. Block-ing strategies and stability of particle gibbs samplers.
Biometrika ,104(4):953–969, 2017.[17] Anna Wigren, Riccardo Sven Risuleo, Lawrence Murray, and FredrikLindsten. Parameter elimination in particle gibbs sampling. In