[PDF] Augmented pseudo-marginal Metropolis-Hastings for partially observed diffusion processes

Abstract

We consider the problem of inference for nonlinear, multivariate diffusion processes, satisfying Itô stochastic differential equations (SDEs), using data at discrete times that may be incomplete and subject to measurement error. Our starting point is a state-of-the-art correlated pseudo-marginal Metropolis-Hastings scheme, that uses correlated particle filters to induce strong and positive correlation between successive marginal likelihood estimates. However, unless the measurement error or the dimension of the SDE is small, correlation can be eroded by the resampling steps in the particle filter. We therefore propose a novel augmentation scheme, that allows for conditioning on values of the latent process at the observation times, completely avoiding the need for resampling steps. We integrate over the uncertainty at the observation times with an additional Gibbs step. Connections between the resulting pseudo-marginal scheme and existing inference schemes for diffusion processes are made. The methodology is applied in three examples of increasing complexity. We find that our approach offers substantial increases in overall efficiency, compared to competing methods.

Full PDF

aa r X i v : . [ s t a t . C O ] S e p Augmented pseudo-marginal Metropolis-Hastings for partiallyobserved diﬀusion processes

Andrew Golightly ∗ , Chris Sherlock School of Mathematics, Statistics and Physics, Newcastle University, UK Department of Mathematics and Statistics, Lancaster University, UK

Abstract

We consider the problem of inference for nonlinear, multivariate diﬀusion processes, satisfyingItˆo stochastic diﬀerential equations (SDEs), using data at discrete times that may be incompleteand subject to measurement error. Our starting point is a state-of-the-art correlated pseudo-marginal Metropolis-Hastings scheme, that uses correlated particle ﬁlters to induce strong andpositive correlation between successive marginal likelihood estimates. However, unless the mea-surement error or the dimension of the SDE is small, correlation can be eroded by the resamplingsteps in the particle ﬁlter. We therefore propose a novel augmentation scheme, that allows forconditioning on values of the latent process at the observation times, completely avoiding theneed for resampling steps. We integrate over the uncertainty at the observation times with anadditional Gibbs step. Connections between the resulting pseudo-marginal scheme and existinginference schemes for diﬀusion processes are made. The methodology is applied in three exam-ples of increasing complexity. We ﬁnd that our approach oﬀers substantial increases in overalleﬃciency, compared to competing methods.

Keywords: stochastic diﬀerential equation; Bayesian inference; pseudo-marginal Metropolis-Hastings;data augmentation; linear noise approximation

Although stochastic diﬀerential equations (SDEs) have been ubiquitously applied in areas suchas ﬁnance (see e.g. Kalogeropoulos et al., 2010; Stramer et al., 2017), climate modelling (see e.g.Majda et al., 2009; Chen et al., 2014) and life sciences (see e.g. Golightly and Wilkinson, 2011;Fuchs, 2013; Wilkinson, 2018; Picchini and Forman, 2019), their widespread uptake is hindered bythe signiﬁcant challenge of ﬁtting such models to partial data at discrete times. General nonlinear,multivariate SDEs rarely admit analytic solutions, necessitating the use of a numerical solution(such as that obtained from the Euler-Maruyama scheme). The resulting discretisation error canbe controlled through the use of intermediate time steps between observation instants. However,integrating over the uncertainty at these intermediate times can be computationally expensive.Upon resorting to discretisation, two approaches to Bayesian inference are apparent. If learningof both the static parameters and latent process is required, a Gibbs sampler provides a naturalway of exploring the joint posterior. The well-studied dependence between the parameters andlatent process can be problematic; see Roberts and Stramer (2001) for a discussion of the problem.Gibbs strategies that overcome this dependence have been proposed by Roberts and Stramer (2001)for reducible SDEs and by Golightly and Wilkinson (2008), Fuchs (2013), Papaspiliopoulos et al.(2013) and van der Meulen and Schauer (2017) (among others) for irreducible SDEs. If primaryinterest lies in learning the parameters, pseudo-marginal Metropolis-Hastings (PMMH) schemes ∗ [email protected] not at the intermediate times between observation time instants. Givenobservations y , latent values x o at observation instants and parameters θ , a Gibbs sampler is usedto update θ conditional on ( x o , y ), and x o conditional on ( θ, y ). Both steps require estimating like-lihoods of the form p ( x ot +1 | x ot , θ ) for which we obtain unbiased estimators via importance sampling.Consequently, our approach can be cast within the pseudo-marginal framework and we further usecorrelation to improve computational eﬃciency. Crucially, we avoid the need for resampling, thuspreserving induced positive correlation between likelihood estimates. Furthermore, each iterationof our proposed scheme admits steps that are embarrassingly parallel. We refer to the resultingsampler as augmented CPMMH (aCPMMH) . It should be noted that aCPMMH requires carefulinitialisation of x o and subsequently, a suitable proposal mechanism. We provide practical advicefor initialisation and tuning of proposals for a wide class of SDEs. Special cases of aCPMMH arealso considered, and result in a connection with the reparameterisation of Golightly and Wilkinson(2008). We apply aCPMMH in three examples of increasing complexity and compare against state-of-the-art CPMMH and PMMH schemes. We ﬁnd that the proposed approach oﬀers an increasein overall eﬃciency of over an order of magnitude in several settings.The remainder of this paper is organised as follows. Background information on the inferenceproblem, PMMH and CPMMH approaches is provided in Section 2. Our novel contribution isdescribed in Section 3 and we explore connections between our proposed approach and existingsamplers that use data augmentation in Section 3.4. Applications are given in Section 4, andconclusions are provided in Section 5, alongside directions for future research. Consider a continuous-time d -dimensional Itˆo process { X t , t ≥ t } satisfying an SDE of the form dX t = α ( X t , θ ) dt + p β ( X t , θ ) dW t , X ∼ p ( x ) . (1)Here, α is a d -vector of drift functions, the diﬀusion coeﬃcient β is a d × d positive deﬁnitematrix with a square-root representation √ β such that √ β √ β T = β and W t is a d -vector of(uncorrelated) standard Brownian motion processes. We assume that both α and β depend on X t = ( X ,t , . . . , X d,t ) T and the p -vector parameter θ = ( θ , . . . , θ p ) T { X t , t ≥ } cannot be observed exactly, but observations y = ( y , . . . , y n ) T areavailable on a regular time grid and these are conditionally independent (given the latent process).We link the observations to the latent process via an observation model of the form Y t = F T X t + ǫ t , ǫ t | Σ indep ∼ N (0 , Σ) , (2)where Y t is a d o -vector, F is a constant d × d o matrix and ǫ t is a random d o -vector. Note that thissetup allows for only observing a subset of components ( d o < d ). In settings where learning Σ isalso of interest, the parameter vector θ can be augmented to include the components of Σ.For most problems of interest the form of the SDE in (1) will not permit an analytic solution.We therefore work with the Euler-Maruyama approximation∆ X t ≡ X t +∆ t − X t = α ( X t , θ ) ∆ t + p β ( X t , θ ) ∆ W t where ∆ W t ∼ N (0 , I d ∆ t ). To allow arbitrary accuracy of this approximation, we adopt a partitionof [ t, t + 1] as t = τ t, < τ t, < τ t, < . . . < τ t,m − < τ t,m = t + 1thus introducing m − τ ≡ τ t,k +1 − τ t,k = 1 m . (3)The Euler-Maruyama approximation is then applied over each interval of width ∆ τ , with m chosenby the practitioner, to balance accuracy and computational eﬃciency. The transition density underthe Euler-Maruyama approximation of X τ t,k +1 | X τ t,k = x τ t,k is denoted by p e ( x τ t,k +1 | x τ t,k , θ ) = N (cid:0) x τ t,k +1 ; x τ t,k + α ( x τ t,k , θ )∆ τ , β ( x τ t,k , θ )∆ τ (cid:1) where N ( · ; µ, V ) denotes the Gaussian density with mean vector µ and variance matrix V .In what follows, we adopt the shorthand notation x [ t,t +1] = ( x τ t, , . . . , x τ t,m ) T for the latent process over the time interval [ t, t + 1] with an analogous notation for intervals of theform ( t, t + 1] and ( t, t + 1) which ignore x τ t, and ( x τ t, , x τ t,m ) respectively. Hence, the completelatent trajectory is given by x = ( x T [0 , , x T (1 , , . . . , x T ( n − ,n ] ) T . The joint density of the latent process over an interval of interest is then given by a product ofGaussian densities; for example p e ( x ( t,t +1] | x t , θ ) = m − Y k =0 p e ( x τ t,k +1 | x τ t,k , θ ) . (4) Suppose that interest lies in the marginal parameter posterior π ( m ) ( θ | y ) ∝ π ( θ ) p ( m ) ( y | θ ) (5)where π ( θ ) is the prior density ascribed to θ and p ( m ) ( y | θ ) is the (marginal) likelihood under theaugmented Euler-Maruyama approach. That is, p ( m ) ( y | θ ) = Z p ( m ) ( x | θ ) p ( y | x, θ ) dx p ( m ) ( x | θ ) = p ( x ) n − Y t =0 p e ( x ( t,t +1] | x t , θ )and p ( y | x, θ ) = n Y t =1 p ( y t | x t , θ ) . (6)Although π ( m ) ( θ | y ) is typically complicated by the intractable likelihood term, p ( m ) ( y | θ ), the lattercan be unbiasedly estimated using a particle ﬁlter (Del Moral, 2004; Pitt et al., 2012). We write theestimator as ˆ p ( m ) U ( y | θ ) with explicit dependence on U ∼ p ( u ), that is, the collection of all randomvariables of which a realisation u gives the estimate ˆ p ( m ) u ( y | θ ). Algorithm 1 gives the necessarysteps for the generation of ˆ p ( m ) u ( y | θ ), with the explicit role of u suppressed for simplicity. A keyrequirement of the particle ﬁlter is the ability to simulate latent trajectories x ( t,t +1] at each time t . To yield a reasonable particle weight, such trajectories must be consistent with x t and y t +1 andare typically termed bridges . In this article we generate bridges by drawing from a density of theform g (cid:0) x ( t,t +1] | x t , y t +1 , θ (cid:1) = m − Y k =0 g ( x τ t,k +1 | x τ t,k , y t +1 , θ )where the constituent terms take the form g ( x τ t,k +1 | x τ t,k , y t +1 , θ ) = N (cid:0) x τ t,k +1 ; x τ t,k + µ ( x τ t,k , y t +1 , θ )∆ τ , Ψ( x τ t,k , θ )∆ τ (cid:1) (7)for suitable choices of µ ( x τ t,k , y t +1 , θ ) and Ψ( x τ t,k , θ ). The form of (7) permits a wide choice ofbridge construct and we refer the reader to Whitaker et al. (2017) and Schauer et al. (2017) forseveral options. Throughout this paper, we take µ ( x τ t,k , y t +1 , θ ) = α k + β k F (cid:0) F T β k F ∆ k + Σ (cid:1) − (cid:8) y t +1 − F T ( x τ t,k + α k ∆ k ) (cid:9) (8)and Ψ( x τ t,k , θ ) = β k − β k F (cid:0) F T β k F ∆ k + Σ (cid:1) − F T β k ∆ τ (9)where ∆ k = t + 1 − τ t,k and we adopt the shorthand notation that α k := α ( x τ t,k , θ ) and β k := β ( x τ t,k , θ ). We note that (8) and (9) correspond to the (extension to partial and noisy observationsof the) modiﬁed diﬀusion bridge construct of Durham and Gallant (2002). We may write theconstruct generatively as x τ t,k +1 = x τ t,k + µ ( x τ t,k , y t +1 , θ )∆ τ + q Ψ( x τ t,k , θ )∆ τ u τ t,k (10)where u τ t,k ∼ N (0 , I d ). It should then be clear that an estimate of the likelihood, ˆ p ( m ) u ( y | θ ), is adeterministic function of the Gaussian innovations driving the bridge construct, and additionally,any random variates required in the resampling step of Algorithm 1. We use systematic resampling,which requires a single uniform innovation per resampling step.The pseudo-marginal Metropolis-Hastings (PMMH) scheme (Andrieu and Roberts, 2009; Andrieu et al.,2009) is a Metropolis-Hastings (MH) scheme targeting the joint density π ( m ) ( θ, u | y ) ∝ π ( θ )ˆ p ( m ) u ( y | θ ) p ( u )for which it is easily checked (using R ˆ p ( m ) u ( y | θ ) p ( u ) du = p ( m ) ( y | θ )) that π ( m ) ( θ | y ) is a marginaldensity. Hence, for a proposal density that factorises as q ( θ ′ | θ ) q ( u ′ ), the MH acceptance probabilityis α ( θ ′ , u ′ | θ, u ) = min ( , π ( θ ′ ) π ( θ ) × ˆ p ( m ) u ′ ( y | θ ′ )ˆ p ( m ) u ( y | θ ) × q ( θ | θ ′ ) q ( θ ′ | θ ) ) . (11)4 lgorithm 1 Particle ﬁlter

Input: parameters θ , auxiliary variables u and the number of particles N .1. Initialise . For i = 1 , . . . , N sample particle x i from the initial state distribution and assignweight w i = 1 /N .2. For times t = 0 , , . . . , n − Resample.

For i = 1 , . . . , N , sample the index a it ∼ M (cid:0) w Nt (cid:1) of the ancestor of particle i , where M ( w Nt ) denotes a categorical distribution on { , . . . , N } with probabilities w Nt .(b) Propagate.

Draw x i ( t,t +1] ∼ g (cid:0) · | x a it t , y t +1 , θ (cid:1) , i = 1 , . . . , N .(c) Compute the weights. For i = 1 , . . . , N set˜ w it +1 = p ( y t +1 | x it +1 , θ ) p e (cid:0) x i ( t,t +1] | x a it t , θ (cid:1) g (cid:0) x i ( t,t +1] | x a it t , y t +1 , θ (cid:1) , w it +1 = ˜ w it +1 P Nj =1 ˜ w jt +1 . Output: estimate ˆ p ( m ) u ( y | θ ) = N − n Q n − t =0 P Ni =1 ˜ w it +1 of the observed data likelihood.The variance of ˆ p ( m ) u ( y | θ ) is controlled by the number of particles N , which should be chosen tobalance both mixing and computational eﬃciency. For example, as the variance of the likelihoodestimator increases, the acceptance probability of the pseudo-marginal MH scheme decreases to0 (Pitt et al., 2012). Increasing N results in more acceptances at increased computational cost.Practical advice for choosing N is given by Sherlock et al. (2015) and Doucet et al. (2015) undertwo diﬀerent sets of simplifying assumptions. Given a parameter value with good support underthe posterior (e.g. the marginal posterior mean, estimated from a pilot run), we select N sothat the estimated log-likelihood at this parameter value has a standard deviation of roughly 1.5.Unfortunately, the value of N required to meet this condition is often found to be impracticallylarge. Therefore, we consider a variance reduction technique which is key to our proposed approach. The correlated pseudo-marginal scheme (Deligiannidis et al., 2018; Dahlin et al., 2015) aims toreduce the variance of the acceptance ratio in (11) by inducing strong and positive correlationbetween successive estimates of the observed data likelihood in the MH scheme. This can beachieved by taking a proposal of the form q ( θ ′ | θ ) K ( u ′ | u ) where K ( u ′ | u ) satisﬁes the detailed balanceequation K ( u ′ | u ) p ( u ) = K ( u | u ′ ) p ( u ′ ) . Recall that u consists of the collection of Gaussian random variates used to propagate the state par-ticles (2(b) in Algorithm 1) and any variates required in the resampling step (2(a) in Algorithm 1).The Uniform random variate required for systematic resampling can be obtained by applying theinverse Gaussian cdf to a Gaussian draw. Hence, u consists entirely of standard Gaussian variatesand it is then natural to set K ( u ′ | u ) = N (cid:0) u ′ ; ρu , (cid:0) − ρ (cid:1) I d ∗ (cid:1) (12)where d ∗ is the total number of required innovations. We note that the density in (12) corresponds toa Crank–Nicolson proposal density for which it is easily checked that p ( u ) = N ( u ; 0 , I d ∗ ) is invariant.The parameter ρ is chosen by the practitioner, with ρ ≈ lgorithm 2 Correlated PMMH scheme (CPMMH)

Input: correlation parameter ρ and the number of CPMMH iterations n iters .1. Initialise. Set the iteration counter i = 1.(a) Set θ (0) in the support of π ( θ ) and draw u (0) ∼ q ( · ).(b) Compute ˆ p ( m ) u (0) ( y | θ (0) ) by running Algorithm 1 with ( θ, u ) = ( θ (0) , u (0) ).2. Update parameters.(a) Draw θ ′ ∼ q ( ·| θ ( i − ) and u ′ ∼ K ( ·| u ( i − ).(b) Compute ˆ p ( m ) u ′ ( y | θ ′ ) by running Algorithm 1 with ( θ, u ) = ( θ ′ , u ′ ).(c) With probability α ( θ ′ , u ′ | θ ( i − , u ( i − ) given by (11), put ( θ ( i ) , u ( i ) ) = ( θ ′ , u ′ ) otherwisestore the current values ( θ ( i ) , u ( i ) ) = ( θ ( i − , u ( i − ).3. If i = n iters , stop. Otherwise, set i := i + 1 and go to step 2. Output: θ (1) , . . . , θ ( n iters ) .between ˆ p ( m ) u ′ ( y | θ ′ ) and ˆ p ( m ) u ( y | θ ). The correlated pseudo-marginal Metropolis-Hastings scheme isgiven by Algorithm 2, which should be used in conjunction with a modiﬁed version of Algorithm 1.Essentially, the resampling step has the eﬀect of breaking down correlation between successivelikelihood estimates. To alleviate this problem, the particles can be sorted immediately afterpropagation e.g. using a Hilbert sorting procedure (Deligiannidis et al., 2018) or simple Euclideansorting (Choppala et al., 2016). Given a distance metric between particles, the particles are sortedas follows: the ﬁrst particle in the sorted list is the one which has the smallest ﬁrst component;for j = 2 , . . . , N , the j th particle in the sorted list is chosen to be the one among the unsorted N − j + 1 particles that is closest to the j − ρ (e.g. ρ = 0 . N can be chosen to minimisethe distance between successive log estimates of marginal likelihood (Deligiannidis et al., 2018). Inpractice, we choose N so that the variance of log ˆ p ( m ) u ′ ( y | θ ) − log ˆ p ( m ) u ( y | θ ) is around 1, for θ set atsome central posterior value. Although sorting particle trajectories after propagation can alleviate the eﬀect of resampling onmaintaining correlation between successive likelihood estimates, the sorting procedure can be un-satisfactory in practice. For example, the Euclidean sorting procedure as described above (andimplemented within the SDE context in Golightly et al. (2019)) does not take into account the en-tire particle trajectory and is likely to be ineﬀectual when the measurement error variance is largerelative to stochasticity inherent in the latent diﬀusion process. For the latter scenario, resamplingmay be executed less often, although choosing a resampling schedule a priori may necessarily be adhoc . If the resampling step is omitted from the particle ﬁlter, an importance sampling algorithm isobtained. However, unless the bridge construct is particularly accurate, the resulting importanceweights are likely to have high variance. In what follows, we derive a novel approach which avoidsresampling altogether, without recourse to importance sampling of the entire latent process.6 .1 Setup

It will be helpful throughout this section to denote x o as the values of x at the observation times1 , , . . . , n , and x L as the values of x at the remaining intermediate times. That is x L = ( x T [0 , , x T (1 , , . . . , x T ( n − ,n ) ) T . Note that it is also possible to treat x n as a latent variable. In what follows, we include x n in x o for ease of exposition.Rather than target the posterior in (5), we target the joint posterior π ( m ) ( θ, x o | y ) ∝ π ( θ ) p ( m ) ( x o | θ ) p ( y | x o , θ ) (13)where p ( m ) ( x o | θ ) = Z p ( m ) ( x | θ ) dx L (14)and p ( y | x o , θ ) = p ( y | x, θ ) as in (6). Although the integral in (14) will be intractable, we mayestimate it unbiasedly as follows. We adopt the factorisation p ( m ) ( x o | θ ) = p ( m ) ( x | θ ) n − Y t =1 p ( m ) ( x t +1 | x t , θ )and note that the constituent terms can be written as p ( m ) ( x | θ ) = Z p ( x ) p e ( x (0 , | x , θ ) dx [0 , , p ( m ) ( x t +1 | x t , θ ) = Z p e ( x ( t,t +1] | x t , θ ) dx ( t,t +1) ; (15)recall that p e ( x ( t,t +1] | x t , θ ) is given by (4). Now, given some suitable importance density g ( x ( t,t +1) | x t , x t +1 , θ ),we may write p ( m ) ( x t +1 | x t , θ ) = Z p e ( x ( t,t +1] | x t , θ ) g ( x ( t,t +1) | x t , x t +1 , θ ) g ( x ( t,t +1) | x t , x t +1 , θ ) dx ( t,t +1) = E x ( t,t +1) ∼ g (cid:26) p e ( x ( t,t +1] | x t , θ ) g ( x ( t,t +1) | x t , x t +1 , θ ) (cid:27) , and a similar expression can be obtained for p ( m ) ( x | θ ). Hence, given N draws x i ( t,t +1) , i = 1 , . . . , N from g ( ·| x t , x t +1 , θ ), a realisation of an unbiased estimator of p ( m ) ( x t +1 | x t , θ ) isˆ p ( m ) u t ( x t +1 | x t , θ ) = 1 N N X i =1 p e ( x i ( t,t +1] | x t , θ ) g ( x i ( t,t +1) | x t , x t +1 , θ ) (16)with the convention that x it +1 = x t +1 for all i . We recognise (16) as an importance samplingestimator of (15). An unbiased importance sampling estimator of p ( m ) ( x | θ ) can be obtained in asimilar manner, by using an importance density of the form p ( x ) g ( x (0 , | x , x , θ ).We take g ( x ( t,t +1) | x t , x t +1 , θ ) as a simpliﬁcation of the bridge construct used in Section 2.1 sothat g (cid:0) x ( t,t +1) | x t , x t +1 , θ (cid:1) = m − Y k =0 g ( x τ t,k +1 | x τ t,k , x t +1 , θ )7 lgorithm 3 Importance sampling

Input: parameters θ , latent values x t , x t +1 , auxiliary variables u t and the number of importancesamples N .(a) Sample.

Draw x i ( t,t +1) ∼ g (cid:0) · | x t , x t +1 , θ (cid:1) , i = 1 , . . . , N .(If t = 0, draw x i ∼ p ( · ) and x i (0 , ∼ g (cid:0) · | x i , x , θ (cid:1) , i = 1 , . . . , N .)(b) Compute the weights. For i = 1 , . . . , N set˜ w it +1 = p e (cid:0) x i ( t,t +1] | x t , θ (cid:1) g (cid:0) x i ( t,t +1) | x t , x t +1 , θ (cid:1) Output: estimate ˆ p ( m ) u t ( x t +1 | x t , θ ) = N P Ni =1 ˜ w it +1 of p ( m ) ( x t +1 | x t , θ ) (or ˆ p ( m ) u ( x | θ ) = N P Ni =1 ˜ w i of p ( m ) ( x | θ ) if t = 0).where g (cid:0) x ( t,t +1) | x t , x t +1 , θ (cid:1) has the form (7) but with the exact x t +1 taking the place of the noisy y t +1 . Since Σ = 0, (8) and (9) simplify to µ ( x τ t,k , x t +1 ) = x t +1 − x τ t,k t + 1 − τ t,k , Ψ( x τ t,k , θ ) = t + 1 − τ t,k +1 t + 1 − τ t,k β ( x τ t,k , θ ) . (17)We make clear the role of the of the innovation vector u t = ( u t, , . . . , u t,m − ) T in (16) by writingthe bridge construct generatively as in (10) but with µ and Ψ given by (17).Now, since the x ( t,t +1) , t = 0 , . . . , n − x o , we may unbias-edly estimate p ( m ) ( x o | θ ) withˆ p ( m ) U ( x o | θ ) = p ( m ) U ( x | θ ) n − Y t =1 ˆ p ( m ) U t ( x t +1 | x t , θ ) , (18)realisations of which may be computed by running the importance sampler in Algorithm 3 for each t = 0 , . . . , n −

1. Note that each t -iteration of Algorithm 3 can be performed in parallel if desired. We adopt a pseudo-marginal approach by targeting the joint density π ( m ) ( θ, x o , u | y ) ∝ π ( θ )ˆ p ( m ) u ( x o | θ ) p ( y | x o , θ ) p ( u ) (19)for which it is easily checked that the posterior of interest, π ( m ) ( θ, x o | y ) given by (13), is a marginaldensity. The form of (19) immediately suggests a Gibbs sampler which alternates between drawsfrom the full conditional densities (FCDs)1. π ( m ) ( θ | u, x o , y ) ∝ π ( θ )ˆ p ( m ) u ( x o | θ ) p ( y | x o , θ )2. π ( m ) ( x o , u | θ, y ) ∝ ˆ p ( m ) u ( x o | θ ) p ( y | x o , θ ) p ( u ).Metropolis-within-Gibbs steps are necessary. To sample π ( m ) ( θ | u, x o , y ) we use a proposal density q ( θ ′ | θ ) so that the acceptance probability is given by α ( θ ′ | θ, u, x o ) = min ( , π ( θ ′ ) π ( θ ) × ˆ p ( m ) u ( x o | θ ′ )ˆ p ( m ) u ( x o | θ ) × p ( y | x o , θ ′ ) p ( y | x o , θ ) × q ( θ | θ ′ ) q ( θ ′ | θ ) ) . (20)8iven that π ( m ) ( x o , u | θ, y ) may be high dimensional, we propose to update ( x o , u ) in separateblocks corresponding to each time component of x o . For t = 1 , . . . , n − π ( m ) ( x t , u t − , u t | x t − , x t +1 , y t , θ ) ∝ ˆ p ( m ) u t − ( x t | x t − , θ )ˆ p ( m ) u t ( x t +1 | x t , θ ) p ( y t | x t , θ ) p ( u t − ) p ( u t )where, p ( u ) = N ( u ; 0 , I ( m − d ). For t = 1, ˆ p ( m ) u t − ( x t | x t − , θ ) is replaced by ˆ p ( m ) u ( x | θ ). The fullconditionals for the remaining end-point is given by π ( m ) ( x n , u n − | x n − , y n , θ ) ∝ ˆ p ( m ) u n − ( x n | x n − , θ ) p ( y n | x n , θ ) p ( u n − ) . We sample from each FCD using a Metropolis-within-Gibbs step. For each t = 1 , . . . , n − q ( x ′ t , u ′ ( t − ,t ) | x t , u ( t − ,t ) ) = q ( x ′ t | x t ) K ( u ′ ( t − ,t ) | u ( t − ,t ) )where K ( u ′ ( t − ,t ) | u ( t − ,t ) ) = K ( u ′ t − | u t − ) K ( u ′ t | u t ). Hence, the innovations are updated using aCrank-Nicolson kernel. The end-point proposal is deﬁned similarly, with q ( x ′ n , u ′ n − | x n , u n − ) = q ( x ′ n | x n ) K ( u ′ n − | u n − ). We recall that the Crank-Nicolson kernel satisﬁes detailed balance withrespect to the innovation density to arrive at the acceptance probabilities α ( x ′ t , u ′ ( t − ,t ) | x t , u ( t − ,t ) , x t − , x t +1 , θ ) =min  , ˆ p ( m ) u ′ t − ( x ′ t | x t − , θ )ˆ p ( m ) u t − ( x t | x t − , θ ) × ˆ p ( m ) u ′ t ( x t +1 | x ′ t , θ )ˆ p ( m ) u t ( x t +1 | x t , θ ) × p ( y t | x ′ t , θ ) p ( y t | x t , θ ) × q ( x t | x ′ t ) q ( x ′ t | x t )  (21)for t = 1 , . . . , n −

1. For the end-point update, the acceptance probability is α ( x ′ n , u ′ n − | x n , u ′ n − , x n − , θ ) = min  , ˆ p ( m ) u ′ n − ( x ′ n | x n − , θ )ˆ p ( m ) u n − ( x n | x n − , θ ) × p ( y n | x ′ n , θ ) p ( y n | x n , θ ) × q ( x n | x ′ n ) q ( x ′ n | x n )  . (22)It is evident that the innovations ( u , . . . , u n − ) are updated twice per Gibbs iteration. We notethat a scheme that only updates these innovations once per Gibbs iteration is also possible, buteschew this approach in favour of the above, which promotes better exploration of the innovationvariable space. Further tuning considerations are discussed in Section 3.5.We refer to the resulting inference scheme as augmented CPMMH (aCPMMH). The scheme issummarised by Algorithm 4. Note that the components of the latent process at the observationtimes ( x o ) are updated in two sweeps; the ﬁrst for t = 1 , , . . . , n − t =2 , , . . . , n − n is even). Updating in this way allows for embarrisinglyparallel operations over t (at steps 2,3 and 4). Consider aCPMMH with N = 1 particle and ρ = 0. In this case aCPMMH exactly coin-cides with the modiﬁed innovation scheme introduced by Golightly and Wilkinson (2008) (see alsoGolightly and Wilkinson, 2010; Papaspiliopoulos et al., 2013; Fuchs, 2013; van der Meulen and Schauer,2017). We note that for this choice of N there is a one-to-one correspondence between the innova-tions u and the latent path x . Hence, step 2 of the Gibbs sampler in Section 3.3 is equivalent todirectly updating the latent path x in blocks of size 2 m −

Input: parameter and latent values ( θ (0) , x o, (0) ), correlation parameter ρ , number of importancesamples N and the number of iterations n iters .1. Initialise. Obtain ˆ p ( m ) u (0) ( x o, (0) | θ (0) ) by running Algorithm 3 for t = 0 , . . . , n −

1. Set theiteration counter i = 1.2. Update parameters θ .(a) Draw θ ′ ∼ q ( ·| θ ( i − ) and compute ˆ p ( m ) u ( i − ( x o, ( i − | θ ′ ) by running Algorithm 3 for t =0 , . . . , n − α ( θ ′ | θ ( i − , u ( i − , x o, ( i − ) given by (20) put θ ( i ) = θ ′ otherwise storethe current value θ ( i ) = θ ( i − .3. Update ( x t , u ( t − ,t ) ), t = 1 , , . . . , n − x ′ t ∼ q ( ·| x ( i − t ) and u ′ ( t − ,t ) ∼ K ( ·| u ( i − t − ,t ) ). Compute ˆ p ( m ) u ′ t − ( x ′ t | x ( i − t − , θ ( i ) ) andˆ p ( m ) u ′ t ( x ( i − t +1 | x ′ t , θ ( i ) ) by running iterations t − t of Algorithm 3.(b) With probability α ( x ′ t , u ′ ( t − ,t ) | x ( i − t , u ( i − t − ,t ) , x ( i − t − , x ( i − t +1 , θ ( i ) ) given by (21) put x ( i ) t = x ′ t and u ( i )( t − ,t ) = u ′ ( t − ,t ) otherwise store the current value x ( i ) t = x ( i − t and u ( i )( t − ,t ) = u ( i − t − ,t ) .4. Update ( x t , u ( t − ,t ) ), t = 2 , , . . . , n − x ′ t ∼ q ( ·| x ( i − t ) and u ′ ( t − ,t ) ∼ K ( ·| u ( i )( t − ,t ) ). Compute ˆ p ( m ) u ′ t − ( x ′ t | x ( i ) t − , θ ( i ) ) andˆ p ( m ) u ′ t ( x ( i ) t +1 | x ′ t , θ ( i ) ) by running iterations t − t of Algorithm 3.(b) With probability α ( x ′ t , u ′ ( t − ,t ) | x ( i − t , u ( i )( t − ,t ) , x ( i ) t − , x ( i ) t +1 , θ ( i ) ) given by (21) put x ( i ) t = x ′ t and u ( i )( t − ,t ) = u ′ ( t − ,t ) otherwise store the current value x ( i ) t = x ( i − t (and u ( i )( t − ,t ) remainsunchanged).5. Update ( x n , u n − ).(a) Draw x ′ n ∼ q ( ·| x ( i − n ) and u ′ n − ∼ K ( ·| u ( i ) n − ). Compute ˆ p ( m ) u ′ n − ( x ′ n | x ( i ) n − , θ ( i ) ) by runningiteration n − α ( x ′ n , u ′ n − | x ( i − n , u ( i ) n − , x ( i ) n − , θ ( i ) ) given by (22) put x ( i ) n = x ′ n and u ( i ) n − = u ′ n − otherwise store the current value x ( i ) n = x ( i − n (and u ( i ) n − remains un-changed).6. If i = n iters , stop. Otherwise, set i := i + 1 and go to step 2. Output: θ (1) , . . . , θ ( n iters ) , x o, (1) , . . . , x o, ( n iters ) .corresponding to a MH step that uses a RWM proposal to obtain x ′ t and then conditional on thisvalue, uses the bridge construct to propose x ′ ( t − ,t ) and x ′ ( t,t +1) . Step 1 of the Gibbs sampler isequivalent to the reparameterisation used by the modiﬁed innovation scheme. Rather than update θ conditional on x (and y ), the innovations u are the eﬀective component being conditioned on. Themotivation for this reparameterisation is to break down the well known problematic dependence10etween θ and x (Roberts and Stramer, 2001). To make the connection clear, note that uponcombining (18) with (16) and substituting the result into the acceptance probability in (20), weobtainmin ( , π ( θ ′ ) π ( θ ) × n − Y t =0 p e ( x ( t,t +1] | x t , θ ′ ) p e ( x ( t,t +1] | x t , θ ) × n − Y t =0 g ( x ( t,t +1) | x t , x t +1 , θ ) g ( x ( t,t +1) | x t , x t +1 , θ ′ ) × p ( y | x o , θ ′ ) p ( y | x o , θ ) × q ( θ | θ ′ ) q ( θ ′ | θ ) ) . It is straightforward to show that Q n − t =0 g ( x ( t,t +1) | x t , x t +1 , θ ) − is the Jacobian associated with thechange of variables (from x to u ) and therefore the above acceptance probability coincides withthat obtained for the parameter update in the modiﬁed innovation scheme (see e.g. page 14 ofGolightly and Wilkinson, 2010).For N = 1 and 0 < ρ <

1, aCPMMH can be seen as an extension of the modiﬁed innovationscheme that uses a Crank-Nicolson proposal for the innovations. A recent application can be foundin Arnaudon et al. (2020). We assess the performance of aCPMMH for diﬀerent values of ρ and N in Section 4. Recall that both CPMMH and PMMH require setting the number of particles N and, if usinga random walk Metropolis (RWM) proposal, a suitable innovation variance. Practical advice onchoosing N for (C)PMMH is discussed at the end of Sections 2.1 and 2.2. For a RWM proposal of theform q ( θ ′ | θ ) = N ( θ ∗ ; θ, Ω), a rule of thumb for the innovation variance Ω is to take Ω = . p c var( θ | y )(Sherlock et al., 2015), which could be obtained from an initial pilot run (such as that required toﬁnd a plausible θ value for subsequently choosing N ).For (C)PMMH, both the pilot and main monitoring runs require careful initialisation of θ (Owen et al., 2015). The aCPMMH scheme additionally requires initialisation of x o , with poorchoices likely to slow initial convergence of the Gibbs sampler. One possibility is to seek an approx-imation to π ( m ) ( θ, x o | y ), denoted π ( a ) ( θ, x o | y ), for which samples can be obtained (e.g. via MCMC)at relatively low computational cost. These samples can then be used to compute estimates ˆE( θ | y )and ˆE( x t | y ), which can be used to initialise aCPMMH. Further, the proposal variances for θ and x t can be made proportional to the estimates c var( θ | y ) and c var( x t | y ) respectively, which can alsobe computed from the samples. For SDE models of the form (1), the linear noise approximation(LNA) (Stathopoulos and Girolami, 2013; Fearnhead et al., 2014) provides a tractable Gaussianapproximation. We describe the LNA, its solution and sampling of π ( a ) ( θ, x o | y ) in Appendix A.In scenarios where using the LNA is not practical, we suggest initialising a pilot run of aCPMMHwith x o = y (if d o = d so that all components are observed) or sampling x o via recursive applicationof the bridge construct in (7). The pilot run can be used to obtain further quantities required fortuning the proposal densities q ( θ ′ | θ ) and q ( x ′ t | x t ). Hence, our intitialisation and tuning advice canbe summarised by the following two options:1. Perform a short pilot run of an MCMC scheme targeting π ( a ) ( θ, x o | y ) (as described in Ap-pendix A for the LNA) to obtain the estimates ˆE( θ | y ), ˆE( x t | y ), c var( θ | y ) and c var( x t | y ). Thesequantities are used to initialise the main monitoring run of aCPMMH and in the RWMproposals for θ and the components of x o .2. Perform a short pilot run of aCPMMH with x o initialised at the data y (if all SDE componentsare observed) or, in the case of incomplete observation of all SDE components, recursivelydraw from (7), retaining only those values at the observation times. Compute estimates asin option 1, for use in the main monitoring run.The length of the pilot run can be set by choosing a fraction of the main monitoring run to ﬁxthe overall computational budget. For simplicity, we use RWM proposals in the pilot runs with11iagonal innovation variances chosen to obtain (approximately) a desired acceptance rate. We notethat Option 2 additionally requires specifying an initial number of particles N for the pilot run. Inour experiments, we ﬁnd that N = 1 is often suﬃcient. For either option, the number of particlescan be further tuned if desired, before the main monitoring run. We consider three applications of increasing complexity. All algorithms are coded in R and wererun on a desktop computer with an Intel quad-core CPU. For all experiments, we compare theperformance of competing algorithms using minimum (over each parameter chain and for aCPMMH,state chain) eﬀective sample size per second (mESS/s). Eﬀective sample size (ESS) is the number ofindependent and identically distributed samples from the target that would produce an estimatorwith the same variance as the auto-correlated MCMC output. We computed ESS using the Rcoda package, details of which can be found in Plummer et al. (2006). We report CPU time basedon the main monitoring runs and note that CPU cost of tuning was small relative to the cost ofthe main run (and typically less than 10% of the reported CPU time). For all experiments (unlessstated otherwise) we used a discretisation of ∆ τ = 0 . Consider a univariate diﬀusion process satisfying an Itˆo SDE of the form dX t = ( θ − θ ) X t dt + p ( θ + θ ) X t dW t , (23)which can be seen as a degenerate case of a Feller square-root diﬀusion (Feller, 1952). We generatedtwo synthetic data sets consisting of 101 observations at integer times using θ = (0 . , . T andan initial condition of X = 25. The observation model is Y t ∼ N ( X t , σ ) where σ ∈ { , } givingdata sets designated as D and D respectively (and shown in Figure 1). We took independent N (0 , ) priors for each log θ i , i = 1 ,

2, and work on the logarithmic scale when using the randomwalk proposal mechanism.We ran aCPMMH for 50 K iterations with ρ ﬁxed at 0 .

99. We report results for N = 1 particle,since N > θ and θ values are similar, for which the assumption that ﬂuctutaionsabout the mean of X t are small (as is necessary for an accurate LNA), is unreasonable. Nevertheless,we were still able to use the LNA to adequately initialise and tune aCPMMH. We also note thatboth initialisation and tuning options give comparable results.It is evident from Table 1 that aCPMMH oﬀers substantial improvements in overall eﬃciencycompared to PMMH and, to a lesser extent, CPMMH. Minimum eﬀective sample size per secondfor PMMH:CPMMH:aCPMMH scales as 1 : 2 . D and 1 : 2 . . D .We found that as the measurement error variance ( σ ) is increased, the optimal number of particles N for both PMMH and CPMMH also increased. Although aCPMMH required N = 1 (that is, weobserved no additional improvement in overall eﬃciency for N > ata set Algorithm ρ N

CPU(s) mESS ( θ, x o ) mESS/s Rel. D ( σ = 1) aCPMMH (1) 0.99 1 777 (3194, 3818) 4.11 6.0aCPMMH (2) 0.99 1 780 (3080, 3757) 3.95 5.8CPMMH 0.99 2 745 (1358, – ) 1.82 2.7PMMH 0.00 10 2217 (1513, – ) 0.68 1.0 D ( σ = 5) aCPMMH (1) 0.99 1 775 (829, 481) 0.62 2.5aCPMMH (2) 0.99 1 770 (799, 450) 0.58 2.3CPMMH 0.99 6 1684 (1050, – ) 0.62 2.5PMMH 0.00 20 4989 (1254, – ) 0.25 1.0 Table 1: Birth–Death model. Number of particles N , correlation parameter ρ , CPU time (inseconds s ), minimum ESS (over θ and x o chains), minimum ESS per second and relative (toPMMH) minimum ESS per second. All results are based on 5 × iterations of each scheme. P S f r ag r e p l a ce m e n t s Y t Y t tt Figure 1: Birth–Death model. Data sets (circles) and summaries (mean and 95% credible intervalsobtained from the output of aCPMMH) of the within-sample predictive π ( y |D ) (left) and π ( y |D )(right).Finally, although the Euclidean sorting algorithm used in CPMMH is likely to be eﬀective for thissimple univariate example, we anticipate its deterioration in subsequent examples with increasingstate dimension. The Lotka-Volterra system describes the time-course behaviour of two interacting species: prey X ,t and predators X ,t . The stochastic diﬀerential equation describing the dynamics of X t =( X ,t , X ,t ) T is given by d (cid:18) X X (cid:19) = (cid:18) θ X − θ X X θ X X − θ X (cid:19) dt + (cid:18) θ X + θ X X − θ X X − θ X X θ X X + θ X (cid:19) / d (cid:18) W W (cid:19) (24)after suppressing dependence on t .We repeated the experiments of Golightly et al. (2019) which, for this example, involved threesynthetic data sets generated with θ = (0 . , . , . T and an initial condition of X = (100 , T .13 .00 0.05 0.10 0.15 0.20 0.25 0.30 −0.04 −0.02 0.00 0.02 0.04 0.06 P S f r ag r e p l a ce m e n t s θ θ θ − θ Figure 2: Birth–Death model. Marginal posterior distributions using data set D and based on theoutput of aCPMMH (solid lines) and the LNA (dashed lines). The true values of θ , θ and θ − θ are indicated.The observation model is Y t ∼ N (cid:0) X t , σ I (cid:1) where I is the 2 × σ ∈ { , , } giving data sets designated as D , D and D respectively. Data set D is shown in Figure 3, and gives dynamics typical of the parameter choicetaken. The parameters correspond to the rates of prey reproduction, prey death and predatorreproduction, and predator death. As the parameters must be strictly positive, we work on thelogarithmic scale with independent N (0 , ) priors assumed for each log θ i , i = 1 , ,

3. The mainmonitoring runs consisted of 10 iterations of aCPMMH, CPMMH (with ρ = 0 .

99) and PMMH.Note that aCPMMH used random walk proposals in the log θ and x o updates, with variancesobtained from the output of an MCMC pilot run based on the LNA, which was also used toinitialise θ and x o .From Figure 4 we see that aCPMMH gives parameter posterior output that is consistent withthe ground truth (and also with the output of PMMH and CPMMH – not shown). In this case, theLNA gives accurate output when used as an inferential model. We compare eﬃciency of PMMH,CPMMH and aCPMMH in Table 2. We found that N = 1 was suﬃcient for aCPMMH but alsoinclude results for N = 2, which gave a small increase in minimum ESS but a decrease in overalleﬃciency, due to the increase (doubling) in CPU time. It is clear that as σ increases, PMMHand CPMMH require an increase in N to maintain a reasonable minimum ESS. Consequently,their performance degrades. Although the statistical eﬃciency (mESS) of aCPMMH reduces as σ increases, the reduction is gradual (compared to that of CPMMH) and we see an increase in overalleﬃciency of aCPMMH (with ρ = 0 .

99) of an order of magnitude over PMMH in all experiments,and over CPMMH for data sets D and D . We also include the output of aCPMMH with N = 1and ρ = 0 .

0, corresponding to the modiﬁed innovation scheme of Golightly and Wilkinson (2008)(and as discussed in Section 3.4). Although this approach works well compared to CPMMH andPMMH, and gives well mixing parameter chains, we see a decrease in mESS (relative to aCPMMHwith ρ = 0 .

99) calculated from the x o chains, and this relative diﬀerence increases as σ increases.Finally, we compare the performance of CPMMH and aCPMMH when parallelised over twocores. For aCPMMH (and as discussed at the end of Section 3.3), operations over t in steps 2,3and 4 of Algorithm 4 can be performed in parallel. For CPMMH, we perform the propagate stepof the particle ﬁlter (step 2(b) of Algorithm 1) in parallel. Figure 5 shows the diﬀerence (2 coresvs 1 core) in log CPU times (denoted ∆ log CPU) against log ∆ τ , where the discretisation levelis ∆ τ ∈ { − , − , − , − } ; for a perfect speed up from the use of two cores, this would be −

1. Results based on aCPMMH used N = 1 in all cases whereas CPMMH used N = 3, N = 8 and14 ata set Algorithm ρ N CPU(s) mESS ( θ, x o ) mESS/s Rel. D ( σ = 1) aCPMMH 0.99 1 7330 (9375, 9483) 1.279 27.80.99 2 12773 (9446, 12485) 0.740 16.10.00 1 6877 (8805, 2028) 0.299 6.5CPMMH 0.99 3 11280 (8023, – ) 0.711 16.3PMMH 0.00 16 59730 (2771, – ) 0.046 1.0 D ( σ = 5) aCPMMH 0.99 1 6780 (7331, 6807) 1.004 25.70.99 2 12807 (7877, 7117) 0.556 14.20.00 1 6769 (8022, 1380) 0.204 5.2CPMMH 0.99 8 29780 (3681, – ) 0.124 3.2PMMH 0.00 20 75930 (2959, – ) 0.039 1.0 D ( σ = 10) aCPMMH 0.99 1 6772 (4986, 3301) 0.487 16.80.99 2 12753 (5859, 3446) 0.270 9.30.00 1 6786 (4676, 1384) 0.203 7.0CPMMH 0.99 19 71520 (3516, – ) 0.049 1.7PMMH 0.00 28 105770 (3031, – ) 0.029 1.0 Table 2: Lotka–Volterra model. Number of particles N , correlation parameter ρ , CPU time (inseconds s ), minimum ESS (over θ and x o chains), minimum ESS per second and relative (to PMMH)minimum ESS per second. All results are based on 10 iterations of each scheme. N = 18. These values correspond to the the numbers of particles required for each synthetic dataset. For aCPMMH, we see that using 2 cores is beneﬁcial for ∆ τ ≤ − . For CPMMH and N = 1,there is almost no beneﬁt in a multi-core approach (and CPU time using 2 cores is typically higherthan a single core approach). This is unsurprising given the resampling steps (performed in serial)between the propagate steps. As N increases, the beneﬁt of using 2 cores can be seen. A commonly used mechanism for auto-regulation in prokaryotes which has been well-studied andmodelled is a negative feedback mechanism whereby dimers of a protein repress its own transcription(e.g. Arkin et al., 1998). A simpliﬁed model for such a prokaryotic auto-regulation, based on thismechanism can be found in Golightly and Wilkinson (2005) (see also Golightly and Wilkinson,2011). We consider the SDE representation of the dynamics of the key species involved in thismechanism. These are

RNA , P , P and DNA , denoted as X , X , X and X respectively. TheSDE takes the form (1) where α ( X t , θ ) = S h ( X t , θ ) , β ( X t , θ ) = S h ( X t , θ ) S T , the stoichiometry matrix S is S =  − − − − − −  , and the hazard function h ( X t , θ ) is h ( X, θ ) = (0 . X X , θ (10 − X ) , θ X , . X , . X ( X − / , θ X , θ X , θ X ) T

10 20 30 40 50 P S f r ag r e p l a ce m e n t s Y ,t Y ,t tt Figure 3: Lotka–Volterra model. Data set D (circles) and summaries (mean and 95% credibleintervals obtained from the output of aCPMMH) of the within-sample predictive π ( y |D ) (left:prey component, right: predator component). P S f r ag r e p l a ce m e n t s θ θ θ Figure 4: Lotka–Volterra model. Marginal posterior distributions using data set D and based onthe output of aCPMMH (solid lines) and the LNA (dashed lines). The true values of θ , θ and θ are indicated.after dropping t to ease the notation. Further details regarding the derivation of the SDE can befound in Golightly and Wilkinson (2005).The parameters θ = ( θ , θ , θ , θ ) T correspond to the rate of protein unbinding at an operatorsite, the rate of transcription of a gene into mRNA, the rate at which protein dimers disassociateand the rate at which protein molecules degrade. We generated a single synthetic data set with θ = (0 . , . , . , . T and an initial condition of X = (8 , , , T . The observation model is Y t ∼ N ( X t , Σ)where Σ is a diagonal matrix with elements 1 , , , .

25. The data are shown in Figure 6. Indepen-dent U ( − ,

5) priors were assumed for each log θ i , i = 1 , , ,

4. A short MH run was performedusing the LNA, to obtain estimates of var(log θ | y ) and var( x t | y ) (to be used the innovation vari-ances of the random walk proposal mechanisms in (a)CPMMH) and plausible values of θ and x o (to be used to initialise the main monitoring runs of (a)CPMMH). Pilot runs of aCPMMH andCPMMH suggested taking N = 1 and N = 20 for each respective scheme. We then ran aCPMMH16 P S f r ag r e p l a ce m e n t s log ∆ τ log ∆ τ log ∆ τ ∆ l og C P U ∆ l og C P U ∆ l og C P U Figure 5: Lotka–Volterra model. Diﬀerence (2 cores vs 1 core) in log CPU times (∆ log CPU)against log ∆ τ using aCPMMH with N = 1 (solid lines) and CPMMH (dashed lines) with N = 3(data set D , left), N = 8 (data set D , centre) and N = 18 (data set D , right). Algorithm ρ N

CPU(s) mESS ( θ, x o ) mESS/s Rel.aCPMMH 0.99 1 18248 (992, 5524) 0.054 13.5aCPMMH 0.99 2 41578 (766, 6210) 0.018 4.6aCPMMH 0.00 1 18252 (1358, 866) 0.028 6.9CPMMH 0.99 20 199782 (805, – ) 0.004 1.0 Table 3: Autoregulatory model. Number of particles N , correlation parameter ρ , CPU time (inseconds s ), minimum ESS (over θ and x o chains), minimum ESS per second and relative (to PMMH)minimum ESS per second. All results are based on 10 iterations of each scheme.and CPMMH for 10 iterations with these tuning choices. Table 3 and Figure 6 summarise ourﬁndings.It is clear that aCPMMH with ρ = 0 .

99 results in a considerable improvement in statisticaleﬃciency over aCPMMH wth ρ = 0 . x o chains) is almost an order of magnitude higher for ρ =0 .

99 (866 vs 5524). An improvement in overall eﬃciency of aCPMMH over CPMMH is evident,irrespective of the choice of ρ . Increasing N to 2 gives results in better mixing of the x o chains,but no appreciable increase in minimum ESS over all chains. Given observations at discrete times, performing fully Bayesian inference for the parameters gov-erning nonlinear, multivariate stochastic diﬀerential equations is a challenging problem. A dis-cretisation approach allows inference for a wide class of SDE models, at the cost of introducingan additional bias (the so called discretisation bias). The simplest such approach uses the Euler-Maruyama approximation in combination with intermediate time points between observations, toallow a time step chosen by the practitioner, that should trade oﬀ computational cost and accu-racy. It is worth emphasising that although a Gaussian transition density is assumed, the mean andvariance are typically nonlinear functions of the diﬀusion process, and consequently, the data likeli-hood (after integrating out at intermediate times) remains intractable, even under the assumptionof additive Gaussian noise.These remarks motivate the use of pseudo-marginal Metropolis-Hastings (PMMH) schemes,17

10 20 30 40 50 P S f r ag r e p l a ce m e n t s Y ,t Y ,t Y , t Y , t tt P S f r ag r e p l a ce m e n t s Y , t Y , t Y ,t Y ,t tt Figure 6: Autoregulatory model. Data set (circles) and summaries (mean and 95% credible intervalsobtained from the output of aCPMMH) of the within-sample predictive π ( y |D ).which replace an evaluation of the intractable likelihood with a realisation of an unbiased estimator,obtained from a single run of a particle ﬁlter over dynamic states (Andrieu et al., 2010). It iscrucial that the number of particles is carefully chosen to balance computational eﬃciency whilstallowing for reasonably accurate likelihood estimates. Inducing strong and positive correlationbetween successive likelihood estimates can reduce the variance of the acceptance ratio, permittingfewer particles (Dahlin et al., 2015; Deligiannidis et al., 2018). Essentially, the (assumed Gaussian)innovations that are used to construct the likelihood estimates are updated with a Crank-Nicolson(CN) proposal. The resampling steps in the particle ﬁlter are also modiﬁed in order to preservecorrelation; the random numbers used during this step are included in the CN update, and theparticle trajectories are sorted before resampling takes place at the next time point. We followChoppala et al. (2016) and Golightly et al. (2019) by using a simple Euclidean sorting procedurebased on the state of the particle trajectory at the current observation time. We ﬁnd that theeﬀectiveness of this correlated PMMH (CPMMH) approach degrades as the observation varianceand state dimension increases.Our novel approach avoids the use of resampling steps, by updating parameters conditional onthe values of the latent diﬀusion process at the observation times (and the observations themselves),whilst integrating over the state uncertainty at the intermediate times. An additional step is then18 .4 0.6 0.8 1.0 P S f r ag r e p l a ce m e n t s θ θ θ θ Figure 7: Autoregulatory model. Marginal posterior distributions based on the output of aCPMMH(solid lines) and the LNA (dashed lines). The true parameter values are indicated.used to update the latent process at the observation times, conditional on the parameters anddata. The resulting algorithm can be seen as a pseudo-marginal scheme, with unbiased estimatorsof the likelihood terms obtained via importance sampling. We further block together the updatingof the latent states and the innovations used to construct the likelihood estimates, and adopt aCN proposal mechanism for the latter. We denote the resulting sampler as augmented, correlatedPMMH (aCPMMH). A related approach is given by Fearnhead and Meligkotsidou (2016), whouse particle MCMC with additional latent variables, carefully chosen to trade-oﬀ the error in theparticle ﬁlter against the mixing of the MCMC steps. We emphasise that unlike this approach,the motivation behind aCPMMH is to avoid use of a particle ﬁlter altogether, the beneﬁts ofwhich are two fold: positive correlation between successive likelihood estimates is preserved andthe method for obtaining these likelihood estimates can be easily parallelised over observationintervals. Section 4.2 shows that once the updating over an inter-observation interval is suﬃcientlycostly substantial gains can be obtained by parallelising this task over the diﬀerent inter-observationintervals: this could be useful for stiﬀ SDEs, high-dimensional SDEs, or if multiple inter-observationintervals are tackled by a single importance sampler.In addition to the tuning choices required by CPMMH (that is, the number of particles N ,correlation parameter ρ in the CN proposal and parameter proposal mechanism), aCPMMH requiresinitialisation of the latent process at the observation times and a suitable proposal mechanism. Ifa computationally cheap approximation of the joint posterior can be found, this may be used toinitialise and tune aCPMMH. To this end, we found that use of a linear noise approximation (LNA)can work well, even in settings when inferences made under the LNA are noticeably discrepant,compared to those obtained under the SDE. In scenarios where use of the LNA is not practical, apilot run of aCPMMH can be used instead.We compared the performance of aCPMMH with both PMMH and CPMMH using three ex-amples of increasing complexity. In terms of overall eﬃciency (as measured by minimum eﬀectivesample size per second), aCPMMH oﬀered an increase of up to a factor of 28 over PMMH. Weobtained comparable performance with CPMMH for a univariate SDE application, and an increaseof up to factors of 10 and 14 in two applications involving SDEs of dimension 2 and 4 respectively.Our experiments suggest that although the mixing eﬃciency of aCPMMH increases with N , theadditional computational cost results in little beneﬁt (in terms of overall eﬃciency) over using N = 1. A special case of aCPMMH (when ρ = 0 and N = 1) is the modiﬁed innovation schemeof Golightly and Wilkinson (2008), which is typically outperformed (in terms of overall eﬃciency)with ρ > N = 1,it is possible to directly calculate the required log-likelihood gradient. Although we have focussedon updating the latent states in separate blocks (single site updating), other blocking schemes mayoﬀer improved mixing eﬃciency. Alternatively, it might be possible to reduce the number of latentvariables, for example, by only explicitly including latent states in the joint posterior at every (say) k th observation instant. The success of such a scheme is likely to depend on the accuracy of an im-portance sampler that covers k observations, and whether or not the resulting likelihood estimatescan be made suﬃciently correlated. This is the subject of ongoing work. A Linear noise approximation (LNA)

The linear noise approximation (LNA) provides a tractable approximation to the SDE in (1).We provide brief, intuitive details of the LNA and its implementation, and refer the reader toFearnhead et al. (2014) (and the references therein) for an in-depth treatment.

A.1 Derivation and solution

Partition X t as X t = η t + R t , (25)where { η t , t ≥ } is a deterministic process satisfying the ODE dη t dt = α ( η t , θ ) , η = x (26)and { R t , t ≥ } is a residual stochastic process. The residual process ( R t ) satisﬁes dR t = { α ( X t , θ ) − α ( η t , θ ) } dt + p β ( X t , θ ) dW t (27)which will typically be intractable. A tractable approximation can be obtained by Taylor expanding α ( X t , θ ) and β ( X t , θ ) about η t . Retaining the ﬁrst two terms in the expansion of α and the ﬁrstterm in the expansion of β gives d ˆ R t = H t ˆ R t dt + p β ( η t , θ ) dW t . (28)where H t is the Jacobian matrix with ( i , j )th element( H t ) i,j = ∂α i ( η t , θ ) ∂η j,t . (29)The motivation for the LNA is an underlying assumption that || X t − η t || is “small”, or in otherwords, that the drift term α ( X t , θ ) dominates the diﬀusion coeﬃcient β ( X t , θ ).Given an initial condition ˆ R ∼ N (ˆ r , ˆ V ), we obtain ˆ R t as a Gaussian random variable. Thesolution reuires the d × d fundamental matrix P t that satisﬁes the ODE dP t dt = H t P t , P = I d , (30)where I d is the d × d identity matrix. Now let U t = P − t ˆ R t and apply the Itˆo formula to obtain dU t = P − t p β ( η t , θ ) dW t . Hence, we may write U t = U + Z t P − s p β ( η s , θ ) dW s . U t | U ∼ N (cid:26) U , Z t P − s β ( η s , θ ) (cid:0) P − s (cid:1) T ds (cid:27) . (31)Therefore, for the initial condition above, we have thatˆ R t | ˆ R = ˆ r ∼ N (cid:0) P t ˆ r , P t ψ t P Tt (cid:1) , where ψ t = ˆ V + Z t P − s β ( η s , θ ) (cid:0) P − s (cid:1) T ds. Setting m t = P t ˆ r and V t = P t ψ t P Tt gives X t | X ∼ N ( eta t + m t , V t )where η t , m t and V t satisfy the coupled ODE system consisting of (26) and dm t dt = H t m t , m = ˆ r , (32) dV t dt = V t H Tt + β ( η t , θ ) + H t V t , V = 0 . (33)In the absence of an analytic solution, this system of coupled ODEs must be solved numerically.Note that if η = x so that ˆ r = 0, m t = 0 for all times t ≥ A.2 Inference using the LNA

Consider the LNA as an inferential model. The posterior over parameters and the latent process (atthe observation times) is denoted by π ( a ) ( θ, x o | y ). We sample this posterior in two steps. Firstly,a Metropolis-Hastings scheme is used to target the marginal parameter posterior π ( a ) ( θ | y ) ∝ π ( θ ) p ( a ) ( y | θ ) (34)where p ( a ) ( y | θ ) is the marginal likelihood under the LNA. Then, a sample x o is drawn from p ( a ) ( x o | θ, y ) for each θ sample from step one. Note that p ( a ) ( y | θ ) and p ( a ) ( x o | θ, y ) are tractableunder the LNA.A forward ﬁlter is used to evaluate p ( a ) ( y | θ ). Since the parameters θ remain ﬁxed throughoutthe calculation of p ( a ) ( y | θ ), we drop them from the notation where possible.Deﬁne y t = ( y , . . . , y t ) T . Now suppose that X ∼ N ( a, C ) a priori . The marginal likelihood p ( a ) ( y | θ ) under the LNA can be obtained using Algorithm 5.Hence, samples of θ can be obtained from π ( a ) ( θ | y ) by running a Metropolis-Hastings schemewith target (34). Then, for each (thinned) θ draw, x o can be sampled from p ( a ) ( x o | θ, y ) using abackward sampler (see Algorithm 6). References

Andrieu, C., Doucet, A., and Holenstein, R. (2009). Particle Markov chain Monte Carlo for eﬃcientnumerical simulation. In L’Ecuyer, P. and Owen, A. B., editors,

Monte Carlo and Quasi-MonteCarlo Methods 2008 , pages 45–60. Spinger-Verlag Berlin Heidelberg.Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods(with discussion).

J. R. Statist. Soc. B , 72(3):1–269.21 lgorithm 5

LNA forward ﬁlter1. Initialisation. Compute ˜ p ( y ) = N ( y ; F T a , F T CF + Σ). The posterior at time t = 1 istherefore X | y ∼ N ( a , C ), where a = a + CF (cid:0) F T CF + Σ (cid:1) − (cid:0) y − F T a (cid:1) C = C − CF (cid:0) F T CF + Σ (cid:1) − F T C .

Store the values of a and C .2. For t = 1 , . . . , n − t + 1. Initialise the LNA with η t = a t , V t = C t and P t = I d . Integratethe ODEs (26), (33) and (30) forward to t + 1 to obtain η t +1 , V t +1 and P t +1 . Hence X t +1 | y t ∼ N ( η t +1 , V t +1 ).(b) One step forecast. Using the observation equation (2), we have that Y t +1 | y t ∼ N (cid:0) F T η t +1 , F T V t +1 F + Σ (cid:1) . Compute the updated marginal likelihood p ( a ) ( y t +1 ) = p ( a ) ( y t ) p ( a ) ( y t +1 | y t )= p ( a ) ( y t ) N (cid:0) y t +1 ; F T η t +1 , F T V t +1 F + Σ (cid:1) . (c) Posterior at t + 1. Combining the distributions in (a) and (b) gives the joint distributionof X t +1 and Y t +1 (conditional on y t ) as (cid:18) X t +1 Y t +1 (cid:19) ∼ N ( η t +1 F T η t +1 ! , V t +1 V t +1 FF T V t +1 F T V t +1 F + Σ !) and therefore X t +1 | y t +1 ∼ N ( a t +1 , C t +1 ), where a t +1 = η t +1 + V t +1 F (cid:0) F T V t +1 F + Σ (cid:1) − (cid:0) y t +1 − F T η t +1 (cid:1) C t +1 = V t +1 − V t +1 F (cid:0) F T V t +1 F + Σ (cid:1) − F T V t +1 . Store the values of a t +1 , C t +1 , η t +1 , V t +1 and P t +1 .Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for eﬃcient computation. Annals of Statistics , 37:697–725.Arkin, A., Ross, J., and McAdams, H. H. (1998). Stochastic kinetic analysis of developmentalpathway bifurcation in phage λ -infected Escherichia coli cells.

Genetics , 149:1633–1648.Arnaudon, A., van der Meulen, F., Schauer, M., and Sommer, S. (2020). Diﬀusion bridgesfor stochastic Hamiltonian systems with applications to shape analysis. Available from http://arxiv.org/abs/2002.00885 .Chen, N., Giannakis, D., Herbei, R., and Majda, A. J. (2014). An MCMC algorithm for parameterestimation in signals with hidden intermittent instability.

SIAM/ASA Journal on UncertaintyQuantiﬁcation , 2:647–669. 22 lgorithm 6

LNA backward sampler1. First draw x n from X n | y ∼ N ( a n , C n ).2. For t = n − , n − , . . . , X t and X t +1 . Note that X t | y t ∼ N ( a t , C t ). The joint distributionof X t and X t +1 (conditional on y t ) is (cid:18) X t X t +1 (cid:19) ∼ N ( a t η t +1 ! , C t C t P Tt +1 P t +1 C t V t +1 !) . (b) Backward distribution. The distribution of X t | X t +1 , y t is N (ˆ a t , ˆ C t ), whereˆ a t = a t + C t P Tt +1 V − t +1 ( x t +1 − η t +1 ) , ˆ C t = C t − C t P Tt +1 V − t +1 P t +1 C t . Draw x t from X t | X t +1 , y t ∼ N (ˆ a t , ˆ C t ).Choppala, P., Gunawan, D., Chen, J., Tran, M.-N., and Kohn, R. (2016). Bayesian inferencefor state space models using block and correlated pseudo marginal methods. Available from http://arxiv.org/abs/1311.3606 .Dahlin, J., Lindsten, F., Kronander, J., and Schon, T. B. (2015). Acceleratingpseudo-marginal Metropolis-Hastings by correlating auxiliary variables. Available from https://arxiv.1511.05483v1 .Del Moral, P. (2004). Feynman-Kac Formulae: Genealogical and Interacting Particle Systems withApplications . Springer, New York.Deligiannidis, G., Doucet, A., and Pitt, M. K. (2018). The correlated pseudo-marginal method.

Journal of the Royal Society: Series B (Statistical Methodology) , 80(5):839–870.Doucet, A., Pitt, M. K., Deligiannidis, G., and Kohn, R. (2015). Eﬃcient implementation of Markovchain Monte Carlo when using an unbiased likelihood estimator.

Biometrika , 102(2):295–313.Durham, G. B. and Gallant, R. A. (2002). Numerical techniques for maximum likelihood estimationof continuous time diﬀusion processes.

Journal of Business and Economic Statistics , 20:279–316.Fearnhead, P., Giagos, V., and Sherlock, C. (2014). Inference for reaction networks using the LinearNoise Approximation.

Biometrics , 70:457–456.Fearnhead, P. and Meligkotsidou, L. (2016). Augmentation schemes for particle MCMC.

Statisticsand Computing , 26:1293–1306.Feller, W. (1952). The parabolic diﬀerential equations and the associated semi-groups of transfor-mations.

Annals of Mathematics , 55:468–519.Fuchs, C. (2013).

Inference for diﬀusion processes with applications in Life Sciences . Springer,Heidelberg.Golightly, A., Bradley, E., Lowe, T., and Gillespie, C. S. (2019). Correlated pseudo-marginalschemes for time-discretised stochastic kinetic models.

CSDA , 136:92–107.23olightly, A. and Wilkinson, D. J. (2005). Bayesian inference for stochastic kinetic models using adiﬀusion approximation.

Biometrics , 61(3):781–788.Golightly, A. and Wilkinson, D. J. (2008). Bayesian inference for nonlinear multivariate diﬀusionmodels observed with error.

Computational Statistics and Data Analysis , 52(3):1674–1693.Golightly, A. and Wilkinson, D. J. (2010). Markov chain Monte Carlo algorithms for SDE parameterestimation. In Lawrence, N. D., Girolami, M., Rattray, M., and Sanguinetti, G., editors,

Learningand Inference in Computational Systems Biology . MIT Press.Golightly, A. and Wilkinson, D. J. (2011). Bayesian parameter inference for stochastic biochemicalnetwork models using particle Markov chain Monte Carlo.

Interface Focus , 1(6):807–820.Kalogeropoulos, K., Roberts, G., and Dellaportas, P. (2010). Inference for stochastic volatilitymodels using time change transformations.

Annals of Statistics , 38:784–807.Majda, A. J., Franzke, C., and Crommelin, D. (2009). Normal forms for reduced stochastic climatemodels.

PNAS , 106:3649–3653.Owen, J., Wilkinson, D. J., and Gillespie, C. S. (2015). Scalable inference for Markov processeswith intractable likelihoods.

Statistics and Computing , 25:145–156.Papaspiliopoulos, O., Roberts, G. O., and Stramer, O. (2013). Data augmentation for diﬀusions.

Journal of Computational and Graphical Statistics , 22:665–688.Picchini, U. and Forman, J. L. (2019). Bayesian inference for stochastic diﬀerential equation mixedeﬀects models of a tumour xenography study.

Journal of the Royal Statistical Society: Series C ,68:887–913.Pitt, M. K., dos Santos Silva, R., Giordani, P., and Kohn, R. (2012). On some properties ofMarkov chain Monte Carlo simulation methods based on the particle ﬁlter.

J. Econometrics ,171(2):134–151.Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). CODA: convergence diagnosis andoutput analysis for MCMC.

R News , 6(1):7–11.Roberts, G. O. and Stramer, O. (2001). On inference for non-linear diﬀusion models usingMetropolis-Hastings algorithms.

Biometrika , 88(3):603–621.Schauer, M., van der Meulen, F., and van Zanten, H. (2017). Guided proposals for simulatingmulti-dimensional diﬀusion bridges.

Bernoulli , 23:2917–2950.Sherlock, C., Thiery, A., Roberts, G. O., and Rosenthal, J. S. (2015). On the eﬀciency of pseudo-marginal random walk Metropolis algorithms.

The Annals of Statistics , 43(1):238–275.Stathopoulos, V. and Girolami, M. A. (2013). Markov chain Monte Carlo inference for Markov jumpprocesses via the linear noise approximation.

Philosophical Transactions of the Royal Society A ,371:20110541.Stramer, O. and Bognar, M. (2011). Bayesian inference for irreducible diﬀusion processes using thepseudo-marginal approach.

Bayesian Analysis , 6:231–258.Stramer, O., Shen, X., and Bognar, M. (2017). Bayesian inference for Heston-STAR models.

Statistics and Computing , 27:331–348.Tran, M.-N., Kohn, R., Quiroz, M., and Villani, M. (2016). Block-wise pseudo-marginal Metropolis-Hastings. Available from http://arxiv.org/abs/1603.02485 .24an der Meulen, F. and Schauer, M. (2017). Bayesian estimation of discretely observed multi-dimensional diﬀusion processes using guided proposals.

Electronic Journal of Statistics , 11:2358–2396.Whitaker, G. A., Golightly, A., Boys, R. J., and Sherlock, C. (2017). Improved bridge constructsfor stochastic diﬀerential equations.

Statistics and Computing , 27:885–900.Wilkinson, D. J. (2018).