[PDF] A practical guide to pseudo-marginal methods for computational inference in systems biology

Abstract

For many stochastic models of interest in systems biology, such as those describing biochemical reaction networks, exact quantification of parameter uncertainty through statistical inference is intractable. Likelihood-free computational inference techniques enable parameter inference when the likelihood function for the model is intractable but the generation of many sample paths is feasible through stochastic simulation of the forward problem. The most common likelihood-free method in systems biology is approximate Bayesian computation that accepts parameters that result in low discrepancy between stochastic simulations and measured data. However, it can be difficult to assess how the accuracy of the resulting inferences are affected by the choice of acceptance threshold and discrepancy function. The pseudo-marginal approach is an alternative likelihood-free inference method that utilises a Monte Carlo estimate of the likelihood function. This approach has several advantages, particularly in the context of noisy, partially observed, time-course data typical in biochemical reaction network studies. Specifically, the pseudo-marginal approach facilitates exact inference and uncertainty quantification, and may be efficiently combined with particle filters for low variance, high-accuracy likelihood estimation. In this review, we provide a practical introduction to the pseudo-marginal approach using inference for biochemical reaction networks as a series of case studies. Implementations of key algorithms and examples are provided using the Julia programming language; a high performance, open source programming language for scientific computing.

Full PDF

AA practical guide to pseudo-marginal methods forcomputational inference in systems biology

David J. Warne ∗ , Ruth E. Baker , and Matthew J. Simpson School of Mathematical Sciences, Queensland University of Technology,Brisbane, Queensland 4001, Australia Mathematical Institute, University of Oxford, Oxford, OX2 6GG,United KingdomJanuary 1, 2020

Abstract

For many stochastic models of interest in systems biology, such as those describ-ing biochemical reaction networks, exact quantiﬁcation of parameter uncertaintythrough statistical inference is intractable. Likelihood-free computational infer-ence techniques enable parameter inference when the likelihood function for themodel is intractable but the generation of many sample paths is feasible throughstochastic simulation of the forward problem. The most common likelihood-freemethod in systems biology is approximate Bayesian computation that accepts pa-rameters that result in low discrepancy between stochastic simulations and mea-sured data. However, it can be diﬃcult to assess how the accuracy of the re-sulting inferences are aﬀected by the choice of acceptance threshold and discrep-ancy function. The pseudo-marginal approach is an alternative likelihood-free in-ference method that utilises a Monte Carlo estimate of the likelihood function.This approach has several advantages, particularly in the context of noisy, par-tially observed, time-course data typical in biochemical reaction network studies.Speciﬁcally, the pseudo-marginal approach facilitates exact inference and uncer-tainty quantiﬁcation, and may be eﬃciently combined with particle ﬁlters forlow variance, high-accuracy likelihood estimation. In this review, we provide apractical introduction to the pseudo-marginal approach using inference for bio-chemical reaction networks as a series of case studies. Implementations of keyalgorithms and examples are provided using the Julia programming language; ahigh performance, open source programming language for scientiﬁc computing(https://github.com/davidwarne/Warne2019 GuideToPseudoMarginal).

Keywords: biochemical reaction networks; stochastic diﬀerential equations; Markovchain Monte Carlo; Bayesian inference; pseudo-marginal methods. ∗ To whom correspondence should be addressed. E-mail: [email protected] a r X i v : . [ q - b i o . M N ] D ec Introduction

Stochastic models are routinely used in systems biology to facilitate the interpretationand understanding of experimental observations. In particular, stochastic models are of-ten more realistic descriptions, compared with their deterministic counterparts, of manybiochemical processes that are naturally aﬀected by extrinsic and intrinsic noise (Kærnet al., 2005; Raj and van Oudenaarden, 2008), such as the biochemical reaction path-ways that regulate gene expression (Paulsson et al., 2000; Tian and Burrage, 2006). Suchstochastic models enable the exploration of various biochemical network motifs to explainparticular phenomena observed though the use of modern, high resolution experimentaltechniques (Sahl et al., 2017). The validation and comparison of theories against obser-vations can be achieved using statistical inference techniques to quantify the uncertaintyin unknown parameters and likelihoods of observations under diﬀerent models. Recentreviews by Schnoerr et al. (2017) and Warne et al. (2019) highlight the state-of-the-artin computational techniques for simulation of biochemical networks, analysis of the dis-tribution of future states of the biochemical systems, and computational inference froma Bayesian perspective. Both studies point out that, for realistic biochemical reactionnetworks, the likelihood function is intractable. As a result, likelihood-free computationinference schemes are essential for practical situations.In our previous work (Warne et al., 2019), we provide an accessible discussion of a widerange of algorithms for simulation and inference in the context of biochemical systemsand provide example implementations for demonstration purposes. In particular, Warneet al. (2019) highlights the use of approximate Bayesian computation (ABC) (Sissonet al., 2018) for likelihood-free inference of kinetic rate parameters using time-coursedata. While ABC is a widely applicable and popular likelihood-free approach withinthe life sciences (Toni et al., 2009), inferences obtained by this method are, as the nameimplies, approximations, and the accuracy of these approximations are highly dependentupon choices made by the user (Sunn˚aker et al., 2013).Time-course data describing temporal variations in particular molecular signals withinliving cells are often obtained using time-lapse optical microscopy with ﬂuorescent re-porters (Figure 1(A)) (Bar-Joseph et al., 2012; Locke and Elowitz, 2009; Young et al.,2012). Individual cells are tracked (Figure 1(B)) and the luminescence from the reporteris measured over time at discrete intervals (Figure 1(C)). These luminescence values arethen used to determine concentrations of mRNAs or proteins that may be associatedwith the expression of a particular gene over time. These data provide information aboutthe complex dynamics of gene regulatory networks that can result in stochastic switch-ing (Tian and Burrage, 2006) or oscillatory behaviour (Figure 1(C)) (Elowitz and Leibler,2000; Shimojo et al., 2008).Common features of gene expression time-course data include sparsity of temporal ob-servations, relatively few concurrent ﬂuorescent reporters, and noisy observations. There-fore, likelihood-free inference methods are essential to deal with statistical inference (Toniet al., 2009). However, complex dynamics observed in real gene regulatory networks, suchas stochastic oscillations or bi-stability, can render ABC methods impractical for accu-rate inferences since most simulations will be rejected, even when the values of modelparameters are close to the true values.Pseudo-marginal methods (Andrieu and Roberts, 2009) are an alternative likelihood-free approach that can provide exact inferences under the prescribed model and are sig-niﬁcantly less sensitive to user-deﬁned input. Variants of this approach are particularly2igure 1: Time-course gene expression data. (A) snapshots of time-lapse microscopy usingtwo ﬂuorescent reporters indicating Hes1 gene expression (green) and BrdU incorporation(red) used to indicate cell-cycle phases. (B) Time-series of green ﬂuorescent reporter fora tracked single-cell over 17 hours in 20 minute intervals. (C) the resulting time-seriesindicating oscillatory Hes1 gene expression. Panels (A),(B), and (C) are modiﬁed withpermission from Shimojo et al. (2008).well suited for Bayesian inference of nonlinear stochastic models using partially observedtime-course data (Andrieu et al., 2010). This makes the pseudo-marginal method idealfor the study of biochemical systems, however, to-date, few applications of these ap-proaches are present in the systems biology literature (Golightly and Wilkinson, 2008,2011). While Warne et al. (2019) brieﬂy discuss the pseudo-marginal approach, no ex-amples or implementations are provided.The purpose of this review is to complement Warne et al. (2019) and Schnoerr et al.(2017) by providing an accessible, didactic guide to pseudo-marginal methods (Andrieuet al., 2010; Andrieu and Roberts, 2009; Doucet et al., 2015) for the inference of kineticrate parameters of biochemical reaction network models using the chemical Langevindescription. For all of our examples, we provide accessible implementations using the opensource, high performance Julia programming language (Besan¸con et al., 2019; Bezansonet al., 2017) . In this section, we introduce several concepts that are fundamental to understanding howpseudo-marginal methods work and why they are eﬀective. Firstly, we introduce stochas- The Julia code examples and demonstration scripts are available from GitHubhttps://github.com/davidwarne/Warne2019 GuideToPseudoMarginal

A biochemical reaction network consists of N chemical species, X , X , . . . , X N that in-teract via a network of M reactions, N (cid:88) i =1 ν − i,j X i k j → N (cid:88) i =1 ν + i,j X i , j = 1 , , . . . , M, (1)where ν − i,j and ν + i,j are, respectively, the number of reactant and product molecules ofspecies X i involved in the j th reaction, and k j is the kinetic rate parameter for the j th reaction. We refer to ν i,j = ν + i,j − ν − i,j as the stoichiometry of species i for reaction j . While spatially extended systems can be considered (Cotter and Erban, 2016; Flegget al., 2015), we will assume the chemical mixture is spatially uniform for clarity. Underthis assumption the law of mass action holds and the probability of the j th reactionoccurring in the time interval [ t, t + d t ) is a j ( X t )d t , where X t = [ X ,t , X ,t , . . . , X N,t ] T is an N × t and a j ( X t ) is the propensity function for reaction j (Gillespie, 1977; Kurtz, 1972). Shoulda reaction j event occur then the state will update by adding the stoichiometric vec-tor ν j = [ ν ,j , ν ,j , . . . , ν N,j ] T to the current system state. Example implementationsfor generating a range of common biochemical reaction network models is provided in ChemicalReactionNetworkModels.jl .In situations where the number of molecules in the system is suﬃciently large, theforwards evolution of the biochemical reaction network can be accurately approximatedby the chemical Langevin equation (Gillespie, 2000; Higham, 2008; Wilkinson, 2009). Thechemical Langevin equation is an It¯o SDE of the formd X t = M (cid:88) j =1 ν j a j ( X t )d t + M (cid:88) j =1 ν j (cid:113) a j ( X t )d W ( j ) t , (2)where X t takes values in R N and W (1) t , W (2) t , . . . , W ( M ) t are independent scalar Wienerprocesses. For a ﬁxed initial condition, X , the solution to Equation (2), { X t } ≤ t , canbe approximately simulated using numerical methods. In this work, we apply the Euler-Maruyama scheme (Kloeden and Platen, 1999; Maruyama, 1955) which approximates arealisation at X t +∆ t given X t according to X t +∆ t = X t + M (cid:88) j =1 ν j a j ( X t )∆ t + M (cid:88) j =1 ν j (cid:113) a j ( X t )∆ tξ ( j ) , where ξ (1) , ξ (2) , . . . , ξ ( M ) are independent, identically distributed (i.i.d.) standard normalrandom variables. It can be shown that the Euler-Maruyama scheme converges with4ate O ( √ ∆ t ) to the true path-wise solution (Kloeden and Platen, 1999). While higher-order schemes are possible, tighter restrictions on the SDE form are required. Thereforewe restrict ourselves to Euler-Maruyama in this work. For an accessible introductionto numerical methods for SDEs, see Higham (2001), and for a detailed monologue thatincludes rigorous analysis of convergence rates, see Kloeden and Platen (1999). Exampleimplementations of the Euler-Maruyama scheme for the chemical Langevin equation areprovided in EulerMaruyama.jl and

ChemicalLangevin.jl . In practice, the application of mathematical models to the study of real biochemical net-works requires model calibration and parameter inference using experimental data. Thedata are typically chemical concentrations derived from optical microscopy and ﬂuores-cent reporters such as green ﬂuorescent proteins (Finkenst¨adt et al., 2008; Sahl et al.,2017; Wilkinson, 2011). Let θ ∈ Θ be the vector of unknown model parameters, such askinetic rate parameters or initial conditions. The task is to quantify the uncertainty inmodel parameters after taking the experimental data, D , into account. Given a modelparameterised by θ and experimental data, D , uncertainty of the unknown model param-eters can be quantiﬁed using the Bayesian posterior probability density , p ( θ | D ) = L ( θ ; D ) p ( θ ) p ( D ) , (3)where: p ( θ ) is the prior probability density that encodes parameter assumptions; L ( θ ; D )is the likelihood function that determines the probability of the data under the assumedmodel for ﬁxed θ ; and p ( D ) is the evidence that provides a total probability for the dataunder the assumed model over all possible parameter values.Parameter uncertainty quantiﬁcation often involves computing expectations of func-tionals with respect to the posterior distribution (Equation (3)), E [ f ( θ )] = (cid:90) Θ f ( θ ) p ( θ | D ) d θ , which may be estimated using Monte Carlo integration, E [ f ( θ )] ≈ ˆ f ( θ ) = 1 M M (cid:88) i =1 f ( θ ( i ) ) , where θ (1) , θ (2) , . . . , θ ( M ) are i.i.d. samples from the posterior distribution, p ( θ | D ).In particular, the j th marginal posterior probability density, p ( θ j | D ) with θ j the j thdimension of θ , may be estimated using a smoothed kernel density estimate, p ( θ j | D ) ≈ M h M (cid:88) i =1 K (cid:32) θ j − θ ( i ) j h (cid:33) , where θ (1) j , θ (2) j , . . . , θ ( M ) j are the j th dimensions of i.i.d. posterior samples, h is a user pre-scribed smoothing parameter, and the kernel K ( x ) is chosen such that (cid:82) ∞−∞ K ( x ) d x = 1.Typically, K ( x ) is a standard Gaussian, and h is chosen using Silverman’s rule (Silver-man, 1986). In most cases, direct i.i.d. sampling from the posterior distribution is notpossible since it is often not from a standard distribution family.5CMC methods are based on the idea of simulating a discrete time Markov chain, { θ m } ≤ m , in parameter space, Θ , for which the posterior of interest is its stationary dis-tribution (Green et al., 2015; Roberts and Rosenthal, 2004). A popular MCMC algorithmis the Metropolis-Hastings method (Hastings, 1970; Metropolis et al., 1953) (Algorithm 1,an example implementation is provided in MetropolisHastings.jl ), in which transitionsfrom state θ m to a proposed new state θ ∗ occur with probability proportional to the rel-ative posterior density between the two locations. The proposals are determined though Algorithm 1

The Metropolis-Hastings method for MCMC Given initial condition θ such that p ( θ | D ) > for m = 1 , . . . , M do Sample transition kernel, θ ∗ ∼ q ( θ | θ m − ). Calculate acceptance probability α ( θ ∗ , θ m − ) = min (cid:18) , q ( θ m − | θ ∗ ) p ( θ ∗ | D ) q ( θ ∗ | θ m − ) p ( θ m − | D ) (cid:19) . Set θ m ← θ ∗ with probability α ( θ ∗ , θ m − ), otherwise, set θ m ← θ m − . end for sampling a proposal kernel distribution that is conditional on θ m with density q ( θ ∗ | θ m ).Under some regularity conditions on the proposal density, the resulting Markov chain willconverge to the target posterior as its stationary distribution (Mengersen and Tweedie,1996). Therefore, computing expectations can be performed with Monte Carlo integrationusing a suﬃciently large dependent sequence from the Metropolis-Hastings Markov chain.It is important to note that this is an asymptotic result, and determining when such asequence is suﬃciently large for practical purposes is an active area of research (Cowlesand Carlin, 1996; Gelman and Rubin, 1992; Gelman et al., 2014; Vehtari et al., 2019).It is also important to note that the eﬃciency of MCMC based on Metropolis-Hastingsis heavily dependent on the proposal density used (Metropolis et al., 1953). Adaptiveschemes may be applied (Roberts and Rosenthal, 2009), however, care must be takenwhen applying these schemes as the stationary distribution may be altered. In manypractical applications, the proposal density and number of iterations is selected heuristi-cally (Hines et al., 2014).Alternative MCMC algorithms include Gibbs sampling (Geman and Geman, 1984),Hamiltonian Monte Carlo (Duane et al., 1987), and Zig-Zag sampling (Bierkens et al.,2019). In this work, however, we base all discussion and examples on the Metropolis-Hastings method as it is the most natural to extend to challenging inference problems insystems biology (Golightly and Wilkinson, 2011; Marjoram et al., 2003). We demonstrate the application of MCMC to perform exact Bayesian inference usinga biochemical reaction network for which an analytic solution to the likelihood can beobtained. This enables us to highlight important MCMC algorithm design considerationsbefore introducing the additional complexity that arises when the likelihood is intractable.Consider a biochemical system consisting a single chemical species, X , involving only6igure 2: (A) Four example realisations of the chemical Langevin SDE for the production-degradation model. (B) The analytic stationary distribution compared with an approxi-mation using simulations. Simulations are performed using the Euler-Maruyama schemewith ∆ t = 0 . X = 50. Kinetic rate parameters are k = 1 . k = 0 . ∅ k → X (cid:124) (cid:123)(cid:122) (cid:125) external productionof X molecules and X k → ∅ (cid:124) (cid:123)(cid:122) (cid:125) degradationof X molecules . (4)Here, k > k > a ( X t ) = k and a ( X t ) = k X t , with respective stoichiometries ν = 1 and ν = −

1. The chemical Langevin equation forthis production-degradation model (Equation (4)) isd X t = ( k − k X t )d t + (cid:112) k + k X t d W t , (5)where W t is a Wiener process. Approximate realisations of the solution process can begenerated using the Euler-Maruyama discretisation, as demonstrated in Figure 2(A) (seeexample DemoProdDeg.jl ). Throughout this work we take time, t , and rate parametersto be dimensionless. However, all results can be re-dimensionalised as appropriate.Assume that the degradation rate is known, k = 0 .

01. The inference task is toquantify the uncertainty in the production kinetic rate, k , using experimental data D = (cid:104) Y (1)obs , Y (2)obs , . . . , Y ( n )obs (cid:105) , where Y (1)obs , Y (2)obs , . . . , Y ( n )obs are n independent observations of ahypothetical real biochemical production-degradation process that has reached its equi-librium distribution (Appendix D). For simplicity, we also assume these observations arenot subject to any observation error, that is, our data is assumed to be exact realisationsof the stationary process for the production-degradation model (Equation (4)) under thechemical Langevin equation representation (Equation (5)).For inference, we require the Bayesian posterior probability density, p ( k | D ) ∝ L ( k ; D ) p ( k ) , (6)where the prior is p ( k ) and the likelihood is L ( k ; D ) = n (cid:89) i =1 p s (cid:16) Y ( i )obs ; k (cid:17) . (7)7e prescribe a uniform prior, k ∼ U (0 , k = 1 .

0. In Equation (7), p s ( x ; k ) is the probability density function for the chemicalLangevin equation (Equation (5)) solution process, { X t } ≤ t , as t → ∞ , that is, thestationary process X ∞ ∼ p s ( x ; k ). For this particular example, it is possible to obtainan analytical expression for this stationary probability density function. The solution isobtained by formulating the Fokker-Planck equation for the It¯o process in Equation (5)and solving for the steady state (Appendix A) to yield p s ( x ; k ) = exp (cid:18) − x + (cid:18) k k − (cid:19) ln ( k + k x ) (cid:19)(cid:90) ∞ exp (cid:18) − y + (cid:18) k k − (cid:19) ln ( k + k y ) (cid:19) d y . (8)Given a value for k , then the denominator can be accurately calculated using quadrature.Figure 2(B) overlays this analytical solution against a histogram obtained from the timeseries of a single very long simulation with end time, t = 1 , , α ( θ ∗ , θ ) = min  , q ( θ | θ ∗ ) p ( θ ∗ ) (cid:81) ni =1 p s (cid:16) Y ( i )obs ; θ ∗ (cid:17) q ( θ ∗ | θ ) p ( θ ) (cid:81) ni =1 p s (cid:16) Y ( i )obs ; θ (cid:17)  , (9)where θ ∗ ∼ q ( · | θ ) is the proposal mechanism.The choice of proposal kernel dramatically aﬀects the rate of convergence of theMarkov chain. For example, Figure 3 demonstrates the Markov chain based on Equa-tion (9) using a Gaussian proposal kernel, q ( θ ∗ | θ ) = 1 σ √ π exp (cid:18) − ( θ ∗ − θ ) σ (cid:19) , (10)for diﬀerent choices of the standard deviation parameter σ (see example DemoMH.jl ). Inall cases, the initial state of the chain is set in a region of very low posterior density, θ = 0 .

8, to ensure we can compare transient and stationary behaviour of the Markovchain. For small standard deviation, σ = 0 .

01 (Figure 3(A)–(B)), the move acceptancerate is high (see Figure 3(G)), however, only very small steps are ever taken (Figure 3(A)).These small steps lead to an over sampling of the low density region of initialisation be-fore the chain drifts toward the high density region. This over sampling of the tail isstill evident after 500 iterations (Figure 3)(B)), and almost 10 ,

000 iterations are requiredto compensate for this initial transient behaviour. We emphasize that here we refer totransient behaviour of the Metropolis-Hastings Markov chain, and this is not to be con-fused with any transient behaviour of the underlying model. In Figure 3(C)–(D), weshow that the use of a larger standard deviation, σ = 0 .

1, results in the rejection of sig-niﬁcantly more proposals (Figure 3(C) and Figure 3(G)), however, the larger steps resultin rapid convergence to the true target density in almost 500 iterations (Figure 3(D)).However, increasing the standard deviation further to σ = 1 . k , and (B,D, and F) smoothed kernel density estimates for the target Bayesian posteriors are shownfor Gaussian proposal kernels with standard deviations of: (A-B) σ = 0 .

01; (C-D) σ = 0 . σ = 1 .

0. Smoothed kernel density estimates for target Bayesian posterior areshown at 250 iterations (solid blue), 500 iterations (solid orange), and 10,000 iterations(solid green) alongside the exact posterior (dashed black). The true production rate of k = 1 . A m for diﬀerent proposal variances, σ = 0 .

01 (solid blue), σ = 0 . σ = 1 . ,

000 iterations the chain still hasnot reached stationarity.In Figure 3(B),(D), and (F), the exact posterior density for the production-degradationinference problem, p ( k | D ), is overlaid to demonstrate that all chains are still converg-ing to the same stationary distribution. Having this exact solution also enables us todemonstrate the impact that the choice of proposal kernel has on the eﬃcacy of theMetropolis-Hastings method. In general, the selection of an optimal proposal kernel is anopen problem, however, there are techniques that can be applicable in speciﬁc cases (Gel-man et al., 1996; Roberts and Rosenthal, 2009; Yang and Rodr´ıguez, 2013). Nearly all likelihood functions for stochastic biochemical systems of interest are in-tractable. This renders the standard Metropolis-Hastings method for MCMC sampling(Algorithm 1) impossible to implement directly (Sisson et al., 2018; Warne et al., 2019;Wilkinson, 2011). To deal with this problem, techniques for sampling Bayesian poste-rior distributions have been developed that avoid the point-wise evaluation of the likeli-hood. These so-called likelihood-free methods fall into two main categories: approximateBayesian computation; and pseudo-marginal methods.In this section, we provide a brief description of both approaches in the context of theMetropolis-Hastings method for MCMC sampling. We then demonstrate some importantfeatures of these methods in the context of the tractable production-degradation inferenceproblem presented in Section 2.3.

ABC is a broad class of Bayesian sampling techniques that are applicable when the likeli-hood is intractable but simulated data can be generated eﬃciently for a given parametervector θ (Sisson et al., 2018; Sunn˚aker et al., 2013; Warne et al., 2019). The fundamentalidea is that parameter values that frequently lead to simulated data D s that are similar to the true observations D will have higher posterior probability density. In eﬀect, ABCsamples from an approximate Bayesian posterior, p ( θ | ρ ( D , D s ) ≤ (cid:15) ) ∝ P ( ρ ( D , D s ) ≤ (cid:15) | θ ) p ( θ ) , (11)where the discrepancy metric , ρ ( D , D s ), quantiﬁes how diﬀerent the two datasets are,the acceptance threshold , (cid:15) >

0, speciﬁes the diﬀerence that is considered close, and D s ∼ f ( D | θ ) is the data simulation process.In the context of MCMC sampling, Marjoram et al. (2003) developed a modiﬁedMetropolis-Hastings method using the acceptance probability α ( θ ∗ , θ m ) =  min (cid:18) , q ( θ m | θ ∗ ) p ( θ ∗ ) q ( θ ∗ | θ m ) p ( θ m ) (cid:19) , if ρ ( D , D s ) ≤ (cid:15), , if ρ ( D , D s ) > (cid:15). (12)Marjoram et al. (2003) also show that the stationary distribution of the resulting Markovchain is Equation (11). Provided that the discrepancy metric, ρ ( D , D s ), and acceptancethreshold, (cid:15) , are appropriately selected so that p ( θ | ρ ( D , D s ) ≤ (cid:15) ) ≈ p ( θ | D ), then wecan use this Markov chain for inference in the same way that the chain from classical10etropolis-Hastings MCMC (Algorithm 1) would be used. An example implementationis provided in ABCMCMC.jl .The choice of ρ ( D , D s ) and (cid:15) are critical to both the accuracy of approximate poste-rior, and the computational cost of the method. Ideally, we require ρ ( D , D s ) such that werecover the true posterior density in the limit as (cid:15) →

0. Using a metric such as the Eu-clidean distance satisﬁes this property, however, when the data has high dimensionality itis completely infeasible for small (cid:15) to accept any parameter proposals. Conversely, metricsbased on summary statistics of the data can be used to reduce the data dimensionality sothat a smaller (cid:15) can be used, however, this may not lead to the true posterior as (cid:15) → (cid:15) to be of similar order to the observation error (Toni et al., 2009;Wilkinson, 2013) to obtain accurate posteriors for the purposes of inference. Pseudo-marginal methods (Andrieu and Roberts, 2009) are an alternative approach tolikelihood-free inference with some desirable properties compared with ABC. The pseudo-marginal approach can be used when one has an unbiased Monte Carlo estimator, ˆ L ( θ ; D ),for the point-wise evaluation of the likelihood function L ( θ ; D ). This estimator is useddirectly in place of the true likelihood for the purposes of MCMC.In the context of Metropolis-Hastings MCMC (Algorithm 1), the acceptance proba-bility for the pseudo-marginal approach is α ( θ ∗ , θ m ) = min (cid:32) , q ( θ m | θ ∗ ) ˆ L ( θ ∗ ; D ) p ( θ ∗ ) q ( θ ∗ | θ m ) ˆ L ( θ m ; D ) p ( θ m ) (cid:33) . (13)After initial inspection, one would expect the stationary distribution of the resultingMarkov chain to be an approximation to the true posterior, just as with the ABC ap-proach using Equation (12). Surprisingly, this is not the case; the stationary distributionof the Markov chain using Equation (13) is, in fact, the exact posterior distribution(Equation (3)). As a result, pseudo-marginal methods have been referred to as exact ap-proximations (Golightly and Wilkinson, 2011). For a brief explanation for why the trueposterior is recovered, see Appendix B. For more detail we refer the reader to Andrieuand Roberts (2009), Beaumont (2003), and Golightly and Wilkinson (2011). An exampleimplementation is provided in PseudoMarginalMetropolisHastings.jl .Unlike classical Metropolis Hastings, the acceptance probability, α ( θ ∗ , θ m ) (Equa-tion (13)), is still a random variable, given values for θ ∗ and θ m . This additional ran-domness reduces the rate at which the Markov chain approaches stationarity, but theadditional noise can be controlled through reducing the variance of the likelihood estima-tor ˆ L ( θ ; D ). However, reducing the variance necessarily requires higher computation costssince a larger number of Monte Carlo samples will be required for computing ˆ L ( θ ; D ).Research has been undertaken to try to develop methods to choose the number of sam-ples optimally. In particular, Doucet et al. (2015) perform a detailed analysis and ﬁnd,under some restrictive assumptions, that the choosing the number of samples such thatVar (cid:104) log ˆ L (¯ θ ; D ) (cid:105) ≈ .

2, where ¯ θ is the posterior mean, is the optimal trade-oﬀ.11 .3 Comparison for an example with a tractable likelihood We now demonstrate the ABC and pseudo-marginal approaches to MCMC using thetractable production-degradation problem from Section 2.3. Speciﬁcally, we demonstratehow the ABC acceptance threshold and the pseudo-marginal Monte Carlo estimatorvariance aﬀect both the rate of convergence and the stationary distribution.For the ABC case, we can generate simulated data D s = (cid:104) X (1) T , X (2) T , . . . , X ( n ) T (cid:105) where X (1) T , X (2) T , . . . , X ( n ) T are independent approximate realisations of the production degrada-tion model (Equation (5)) using the Euler-Maruyama scheme over the interval 0 ≤ t ≤ T with ∆ t = 1 . T = 1000 . k = 0 .

01, and k is given by the state of the Markov chain θ m . Given that the data has dimension n = 10 (Section 2.3, Appendix D), we choose adiscrepancy metric that reduces the data dimension for ease of demonstration, that is, ρ ( D , D s ) = | ˆ µ ( D ) − ˆ µ ( D s ) | + | ˆ σ ( D ) − ˆ σ ( D s ) | , where ˆ µ ( D ) and ˆ σ ( D ) are the sample mean and standard deviation of the observations D , and ˆ µ ( D s ) and ˆ σ ( D s ) are the sample mean and standard deviation of the simulateddata D s . Figure 4(A)–(F) demonstrates the behaviour of ABC MCMC using acceptancethresholds of (cid:15) = 14 (Figure 4(A)–(B)), (cid:15) = 7 (Figure 4(C)–(D)) and (cid:15) = 3 . DemoABCMCMC.jl ). The Markov chain trajectories shown inFigure 4(A), (C), (E), indicate that larger values of (cid:15) lead to more rapid convergenceto stationarity. The converse is true for the accuracy of the stationary distribution asan approximation to the exact posterior, as demonstrated in Figure 4(B), (C), (F), withlarger values of (cid:15) leading to a more diﬀuse, approximate posterior. This highlights aknown shortcoming for ABC for the purposes of MCMC sampling; choosing a small (cid:15) foraccuracy will tend to result in a Markov chain that repeatedly gets stuck in the samelocation (Sisson et al., 2007).For the pseudo-marginal approach we use a standard smoothed kernel density estimatefor the likelihood, that is,ˆ L ( θ ; D ) = 1( Rh ) n n (cid:89) i =1 R (cid:88) j =1 K (cid:32) Y ( i )obs − X ( j ) T h (cid:33) , where h is the smoothing parameter chosen using Silverman’s rule (Silverman, 1986), K ( x ) is a standard Gaussian smoothing kernel, and X (1) T , X (2) T , . . . , X ( R ) T are independentapproximate realisations of the production degradation model (Equation (5)) using theEuler-Maruyama scheme with identical parameterisation as used for ABC. The varianceof the likelihood estimator depends on the number of realisations used in the estimate, R .Figure 4(G)–(L) demonstrates the behaviour of the pseudo-marginal approach to MCMCusing diﬀerent realisation numbers of R = 25 (Figure 4(G)–(H)), R = 50 (Figure 4(I)–(J))and R = 100 (Figure 4(K)–(L)) (see example DemoPMMH.jl ). As expected, increasing R has the eﬀect of increasing convergence (although not signiﬁcantly so). More importantly,regardless of the value R , the same stationary distribution is approached in the limit, thatis, the exact posterior distribution. 12igure 4: Comparison of likelihood-free MCMC algorithms, (A-F) ABC and (G-L) the pseudo-marginal approach, using the tractableproduction degradation example. ABC trace plots and smoothed kernel density estimates for the target approximate Bayesian posteriorare shown for decreasing acceptance thresholds: (A-B) (cid:15) = 14; (C-D) (cid:15) = 7 and (E-F) (cid:15) = 3 .

5. Pseudo-marginal trace plots and smoothedkernel density estimates for the target approximate Bayesian posterior are shown for increasing sample numbers for likelihood estimation:(G-H) R = 25; (I-J) R = 50 and (K-L) R = 100. Smoothed kernel density estimates for target Bayesian posterior are shown at 250iterations (solid blue), 500 iterations (solid orange), and 10,000 iterations (solid green) alongside the exact posterior (dashed black). Thetrue production rate of k = 1 . σ = 0 .

01, and thechain is initialised with θ = 0 . his highlights a major advantage of pseudo-marginal methods, that is, the stationarydistribution is independent of the number of realisations, R ; furthermore the stationarydistribution is the exact Bayesian posterior distribution. Even with R = 1 the method willeventually converge to the exact posterior distribution. This is in stark contrast to ABCmethods where the stationary distribution depends on the discrepancy metric and accep-tance threshold. Ultimately this means, that user choices only aﬀect the computationalperformance of pseudo-marginal methods rather than both computational performanceand inference accuracy with ABC. This is a clear advantage of the pseudo-marginal ap-proach. The production-degradation example presented in Section 2.3 is useful for highlighting theessential concepts of standard MCMC sampling and likelihood free alternatives. However,this inference problem is very simple compared to practical problems, since real biochem-ical processes are generally not observed in their stationary state without observationerror. Rather, real biochemical process data, as shown in Figure 1, is often characterisedby noisy, time-course data, with few observations in time and only partially observedstates (Finkenst¨adt et al., 2008; Golightly and Wilkinson, 2011; Warne et al., 2019).

In the case of time-course data, the observations are samples at discrete points in time, t , t , . . . , t n , from a single realisation of a stochastic process, such as a gene regulatory net-work. The observation process, denoted by { Y t } ≤ t , often has the form, Y t ∼ g ( Y t | X t ),where { X t } ≤ t is the underlying stochastic process, prescribed by the chemical Langevinequation (Equation (2)), that governs the biochemical kinetics, and g ( Y t | X t ) is the ob-servation process. The resulting discrete observations will be D = [ Y (0)obs , Y (1)obs , . . . , Y ( n )obs ]with Y ( i )obs = Y t i for i = 0 , , . . . , n . The likelihood for such observations is L ( θ ; D ) = p ( Y (0)obs ) n (cid:89) i =1 p ( Y ( i )obs | Y (0)obs , . . . , Y ( i − ) . (14)Not only is this likelihood intractable, but a direct Monte Carlo likelihood estimator willbe impractical for the pseudo-marginal approach. For example, the following is a directMonte Carlo estimate for Equation (14)ˆ L ( θ ; D ) = 1 R R (cid:88) j =1 n (cid:89) i =1 g ( Y ( i )obs | X ( j ) t i ) , (15)where [ X (1) t , X (2) t , . . . , X ( R ) t ] are R independent realisations of the continuous sample pathfrom the chemical Langevin equation (Equation (2)), that are subsequently observed attimes t , t , . . . , t n . However, a prohibitively large number of sample paths, R , will be re-quired to obtain an acceptable variance in the estimator in Equation (15). Consequently,more advanced approaches to pseudo-marginal are required.The following observation assists ﬁnding an alternative solution, p ( Y ( i )obs | Y (0)obs , . . . , Y ( i − ) = (cid:90) R N g ( Y ( i )obs | X t i ) p ( X t i | Y (0)obs , . . . , Y ( i − ) d X t i . p ( X t i | Y (0)obs , . . . , Y ( i − ) for all i = 1 , , . . . , n , then we can use the alternative Monte Carlo estimatorˆ L ( θ ; D ) = n (cid:89) i =1 R R (cid:88) j =1 g ( Y ( i )obs | X ( j ) t i ) , (16)where [ X (1) t i , X (2) t i , . . . , X ( R ) t i ] are R samples from the distribution p ( X t i | Y (0)obs , . . . , Y ( i − ).This estimator will have lower variance because conditioning the samples [ X (1) t i , X (2) t i , . . . , X ( R ) t i ]on all observations taken up to time t i − automatically removes contributions by trajec-tories that do not match the observational history. The challenge is in the sampling of p ( X t i | Y (0)obs , . . . , Y ( i − ) and it motivates the use of, so called, particle ﬁlters (Doucet andJohanson, 2011). We present the mathematical basis for this approach in the next section,along with practical examples that demonstrate how the method works in practice. The bootstrap particle ﬁlter (Doucet and Johanson, 2011; Gordon et al., 1993) is atechnique based on sequential importance sampling (Del Moral et al., 2006). This en-ables one to sample from the sequence of distributions p ( X t | Y ) , p ( X t | Y , Y ) ,. . . , p ( X t n | Y , . . . , Y n − ) and thereby evaluate the lower variance likelihood estima-tor (Equation (16)).Suppose we have independent samples, called particles, X (1) t i − , X (2) t i − , . . . , X ( R ) t i − ∼ p ( X t i − | Y , . . . , Y i − ) . Then, using the Euler-Maruyama scheme (or similar), we can simulate each particleforward to time t i . This results in a new set of independent particles˜ X (1) t i , ˜ X (2) t i , . . . , ˜ X ( R ) t i ∼ p ( X t i | Y , . . . , Y i − ) . From these particles, we can evaluate the Monte Carlo estimate for the marginal likelihoodat time t i , ˆ p ( Y i obs | Y , . . . , Y i − ) = 1 R R (cid:88) k =1 g ( Y i obs | ˜ X ( k ) t i ) . (17)Provided we can then generate a new set of independent particles, X (1) t i , X (2) t i , . . . , X ( R ) t i ∼ p ( X t i | Y , . . . , Y i obs ) , we can compute Equation (17) for all i = 1 , , . . . , n , and hence compute the likelihoodestimator given in Equation (16). Progress can be made by noting that, through appli-cation of Bayes’ Theorem, p ( X t i | Y , . . . , Y i obs ) = p ( Y i obs | X t i ) p ( X t i | Y , . . . , Y i − ) p ( Y i obs | Y , . . . , Y i − ) . Therefore, we can approximate the set of particles X (1) t i , X (2) t i , . . . , X ( R ) t i by resampling theparticles ˜ X (1) t i , ˜ X (2) t i , . . . , ˜ X ( R ) t i with replacement using probabilities, P ( X t i = ˜ X ( k ) t i ) = g ( Y i obs | ˜ X ( k ) t i ) R ˆ p ( Y i obs | Y , . . . , Y i − ) = g ( Y i obs | ˜ X ( k ) t i ) (cid:80) Rj =1 g ( Y i obs | ˜ X ( j ) t i ) . p ( X t i | Y , . . . , Y i obs ). This leads to the bootstrap particle ﬁlter (Algorithm 2) (Gordonet al., 1993). An example implementation is provided in BootstrapParticleFilter.jl . Algorithm 2

The bootstrap particle ﬁlter for likelihood estimation Initialise i = 0 and (cid:110) X ( k ) t (cid:111) Rk =1 where X ( k ) t ∼ p ( X t | Y ) for k = 1 , , . . . , R . for i = 1 , . . . , n do for k = 1 , . . . , R do Simulate particle forward, X ( k ) t i ∼ f ( X ( k ) t i | X ( k ) t i − ). Compute weight, W ki ← g ( Y i obs | X ( k ) t i ). end for Compute marginal likelihood estimate, ˆ p ( Y i obs | Y , . . . , Y i − ) ← R (cid:80) Rk =1 W ki . Resample particles, (cid:110) X ( k ) t i (cid:111) Rk =1 , with replacement using probabilities W ki / (cid:104)(cid:80) Rj =1 W ji (cid:105) for k = 1 , , . . . , R . end for Compute likelihood estimate, ˆ L ( θ ; D ) ← (cid:81) ni =1 ˆ p ( Y i obs | Y , . . . , Y i − ).Figure 5 provides a visual demonstration of this process using a small number of parti-cles, R = 4, for ease of visualisation. Figure 5(A) shows time-course data with error barsindicating the magnitude of the observation error. This data, with very low time resolu-tion, is typical of many experimental studies, such as the data in Figure 1(C). Initially,all four particles are set to the initial data point with equal weighting (Figure 5(B)). Theparticles are then evolved forwards to the next observation time (Figure 5(C)), and theweighting is calculated as the probability density that the observation occurred based oneach of these particles (Figure 5(D)). Note that only one of the four particles contributesigniﬁcantly to the likelihood after this ﬁrst forwards step, thus simulating the other threeparticles forwards any further would be a waste of computational eﬀort. As such, we gen-erate four independent continuations of the one highly weighted particle (Figure 5(D))to evolve toward the next observation time (Figure 5(E)). The same process is repeatedusing the second generation of weights (Figure 5(F)), in order to perform the last step(Figure 5(G)-(H)).The visualisation in Figure 5 highlights the eﬀect of conditioning on past observations.That is, model simulations that lead to a small likelihood in one of the observations arediscarded early and new particles are generated by continuing high likelihood simulations.As a result all samples are focused on the higher density region of the likelihood and thevariance of the estimator is reduced.The application of particle ﬁlters for likelihood estimators within MCMC schemesis called particle MCMC (Wilkinson, 2012). For the purposes for this article, we di-rectly apply the pseudo-marginal method using Algorithm 2 to evaluate the accep-tance probability (Equation (13)) within the Metropolis-Hastings MCMC scheme (Al-gorithm 1). This turns out to be a special case of the Particle marginal Metropolis-Hastings sampler that may be used eﬀectively to sample the joint probability den-sity p ( θ , X t , X t , . . . , X t n | Y , Y , . . . , Y n obs ), as demonstrated within a very generalframework introduced by (Andrieu et al., 2010).16igure 5: Demonstration of the bootstrap particle ﬁlter using R = 4 particles for demon-stration purposes. (A) Observations with error bars indication three standard deviationsof the observation noise distribution. (B) Particles are initially weighted equally. (C)–(H)Three iterations of the boostrap particle ﬁlter. (C),(E), and (G) Particle forward trajec-tories with weights indicated by opacity. (D),(F), and (H) Particle weight distributionscomputed for resampling. There are a number of factors that may aﬀect the performance of particle MCMC sam-pling in practice. Firstly, an important issue to discuss for sequential importance resam-pling, such as the bootstrap particle ﬁlter, is the problem of particle degeneracy. Thatis, as the number of iterations increases, the number of particles with non-zero weightsdecreases. As a result, the accuracy of approximation to p ( X t i | Y , . . . , Y i obs ) degradesas this dependency on a very small particle count introduces bias. While, the resampling17tep reduces the impact of degeneracy, in general a larger number of observations, n , willnecessitate a large number of particles, R , for the likelihood estimator (Doucet and Jo-hanson, 2011). The problem of degeneracy becomes even more problematic when the ob-servation error is very small (Golightly and Wilkinson, 2008, 2011), and may require moreadvanced resampling methods, such as systematic and stratiﬁed resampling (Carpenteret al., 1999; Kitagawa, 1996). In this work, we apply a direct multinomial resamplingscheme (Doucet and Johanson, 2011).Just as with the more general pseudo-marginal approach, there is a trade-oﬀ betweenthe convergence rate of the Markov chain and the computational cost of each likelihoodestimate. While Doucet et al. (2015) provide guidelines for optimally choosing R , theseguides may not be feasible to implement since the behaviour of the likelihood estimatorabout the posterior mean is rarely known (especially since one is often using MCMC inorder to compute this quantity).The performance of particle MCMC methods also depends on the choice of proposalkernel, just as with classical Metropolis-Hastings. When the full inference problem isconsidered, there are a number of novel proposal schemes (Andrieu et al., 2010; Pooleyet al., 2015). A number of asymmetric proposal kernels, such as preconditioned Crank-Nicholson Langevin proposals (Cotter et al., 2013), can also be very eﬀective in highdimensional parameter spaces. However, in general, one needs to perform experimenta-tion to elucidate an eﬀective combination of proposal kernel and particle numbers thatwill converge in an acceptable timeframe.The question of assessing convergence can be challenging. Typically, the auto-correlationfunctions (ACF) for each parameter are computed and the potential scale reduction iscomputed (Geyer, 1992). However, these diagnostics for convergence can be very mis-leading, especially if the posterior is multimodal. To deal with this, it is common to usemultiple chains and assess the within-chain and between-chain variances Gelman et al.(1996, 2014). In this work, we follow the recent recommendations of (Vehtari et al., 2019). As a practical demonstration of the use of particle MCMC, three examples are providedwhere expressions for the likelihood function are not available.

The ﬁrst example is based on the stochastic variant of the classical model for enzymekinetics (Michaelis and Menten, 1913; Rao and Arkin, 2003).

The classical model of Michaelis and Menten (1913) for enzyme kinetics describes theconversion of a chemical substrate S into a product P through the binding of an enzyme E . An enzyme molecule, E , binds to a substrate molecule, S , to form a complex, C ,to convert S to P . The stochastic process is describe through three reactions (Rao andArkin, 2003), E + S k → C (cid:124) (cid:123)(cid:122) (cid:125) enzyme and substrate moleculescombine to form a complex , C k → E + S (cid:124) (cid:123)(cid:122) (cid:125) decay of complex , and C k → E + P (cid:124) (cid:123)(cid:122) (cid:125) catalytic conversionof substrate to product , (18)18ith propensities, a ( X t ) = k E t S t , a ( X t ) = k C t , and a ( X t ) = k C t , where X t = [ E t , S t , C t , P t ] T , and stoichiometries ν =  − −  , ν =  −  and ν =  −  . Application of the chemical Langevin approximation (Equation (2)) to the Michaelis-Menten model (Equation (18)) leads to a coupled system of It¯o SDEsd E t = [ − k E t S t + ( k + k ) C t ]d t − (cid:112) k E t S t d W (1) t + (cid:112) k C t d W (2) t + (cid:112) k C t d W (3) t , d S t = ( − k E t S + k C t )d t − (cid:112) k E t S t d W (1) t + (cid:112) k C t d W (2) t , d C t = [ k E t S t − ( k + k ) C t ]d t + (cid:112) k E t S t d W (1) t − (cid:112) k C t d W (2) t − (cid:112) k C t d W (3) t , d P t = k C t d t + (cid:112) k C t d W (3) t , (19)where W (1) t , W (2) t and W (3) t are independent Wiener processes driving each reaction chan-nel. A typical realisation of the model is provided in Figure 6. Note that as t → ∞ the stationary distribution is a product of Dirac distributions, that is, a point mass at X ∞ = [ E + C , , , S + C + P ] T given X = [ E , S , C , P ] T . Therefore observa-tions involving the transient behaviour are essential to recover information about therate parameters. t X t ESCP

Figure 6: Example realisation of the Michaelis-Menten model demonstrating enzymekinetics. The simulation is produced using the Euler-Maruyama scheme with ∆ t = 1 × − , initial condition X = [100 , , , k = 1 × − , k = 5 × − ,and k = 1 × − .While some analytic progress on likelihood approximation can be made using momentclosures (Schnoerr et al., 2017), the second order reaction for the production of complexes, C , eﬀectively renders the distribution of the forwards problem analytically intractable. We generate synthetic data using a single realisation of the Michaelis-Menten chemicalLangevin SDE with initial condition X = [100 , , , T and kinetic rate parameters19 = 1 × − , k = 5 × − and k = 1 × − . Observations are taken at n = 20uniformly spaced time points t = 5 , t = 10 , . . . , t = 100. The observation processconsiders Gaussian noise applied to each chemical species copy number with a standarddeviation of σ obs = 10, that is, Y ( i )obs ∼ N ( X t i , σ I ) where I is the 4 × θ = [ k , k , k ] T . We use theparticle MCMC approach to sample the Bayesian posterior, p ( k , k , k | D ) ∝ L ( k , k , k ; D ) p ( k , k , k ) , where p ( k , k , k ) is the joint uniform prior with independent components k ∼ U (0 , × − ), k ∼ U (0 , . × − ) and k ∼ U (0 , × − ). The likelihood is estimated using thebootstrap particle ﬁlter (Algorithm 2) with R = 100 particles and the Euler-Maruyamamethod for simulation with ∆ t = 0 . Any application of MCMC requires both a method of initialising the chain and choosingthe proposal kernel. To deal with both of these challenges we can apply trial chains.Four trial chains are simulated for M = 8 ,

000 iterations, each initialised with arandom sample from the prior with a non-zero likelihood estimate. The proposal kernelused in all four trial chains is a Gaussian with covariance matrix Σ =  . × − . × −

00 0 5 . × −  . The diagonal entries correspond to a proposal density such that one tenth of the priorstandard deviation for each parameter is within a single standard deviation of the pro-posal; such independent proposal kernels are typical choices. However, this proposal isnot very eﬃcient, as Figure 7(A)–(F) indicates for the ﬁrst 8,000 iterations of the ﬁrstchain. However, these chains are not used for inference, just for conﬁguring a new set ofmore eﬃcient chains.The tuned proposal kernel is constructed by taking the total covariance matrix usingthe pooled sample of the four trial chains (total of 32 ,

000 samples),ˆ Σ =  . × − . × − . × − . × − . × − − . × − . × − − . × − . × −  , and applying the optimal scaling rule from Roberts and Rosenthal (2001) Σ opt = 2 . Σ . While the optimality of this scaling factor assumes a Gaussian posterior density, this is auseful guide that is widely applied (Roberts and Rosenthal, 2009). The ﬁnal iteration ofthe trial chains is then used to initialise four new tuned chains with this optimal proposalcovariance. The improvement in convergence behaviour is shown in Figure 7(G)–(L).20igure 7: Comparison of marginal trace plots and autocorrelation functions (see Appendix C) using (A)-(F) the na¨ıve independentGaussian proposals and (G)-(L) optimally scaled correlated proposals. .1.4 Convergence assessment and parameter estimates Determining the number of iterations from the tuned chains to ensure valid inference isanother practical challenge. Here, we follow the recommendations of Vehtari et al. (2019)and apply the rank normalised ˆ R statistic along with the multiple chain eﬀective samplesize S eﬀ (Appendix C) using the four tuned Markov chains. Informally, ˆ R representsthe ratio between an estimate of the posterior variance to the average variance of eachindependent Markov chain, and as M → ∞ then ˆ R → S eﬀ statistic provides a measure of eﬀective number of i.i.d. samples that the Markovchains represent for the purposes of computing an expectation. Larger values of S eﬀ arebetter, but S eﬀ will typically be much smaller than M .The results, by parameter, are shown in Table 1 after M = 15 ,

000 iterations perchain. Vehtari et al. (2019) recommend that ˆ

R < .

01 and S eﬀ >

400 for each parameter.We conclude that the chains have converged suﬃciently for our purposes.Table 1: Convergence diagnostics using four chains each with 15,000 iterations using theoptimal proposal with dependent components. k k k S eﬀ

986 683 1,909ˆ R µ , and standard deviation,ˆ σ , with respect to the marginal posterior. k k k θ true . × − . × − . × − ˆ µ . × − . × − . × − ˆ σ . × − . × − . × − In Figure 8, we see that the modes of the marginal posteriors for both k and k arevery close to the true values. However, for k the mode overestimates the true parameter.It is important to emphasize, that this is not due to inaccuracy of the pseudo-marginalinference, but is a feature of the true posterior density. This result eﬀectively highlightsthe uncertainty in the k estimate due to partial observations, observation error, andmodel stochasticity.The implementation of this inference problem, including data generation, tuning andinitialisation steps, convergence assessment, and plotting is given in DemoMichMentPMCMC.jl .The rank normalised ˆ R and S eﬀ statistics are implemented within Diagnostics.jl .22igure 8: Marginal smoothed kernel density estimates for the Michaelis-Menten modelusing four converged particle MCMC chains. True parameter values are also indicated(red dashed).

The second example demonstrates the phenomenon of stochastic bi-stability. This leadsto a very challenging inference problem that is poorly suited to alternative likelihood freeschemes such as ABC.

This example is a theoretical biochemical network initially studied by Schl¨ogl (1972). Thismodel involves a single chemical species, X t , that evolves according to four reactions2 X k → X, X k → X, ∅ k → X, and X k → ∅ , (20)with propensities a ( X t ) = k X t ( X t − a ( X t ) = k X t ( X t − X t − a ( X t ) = k and a ( X t ) = k X t and stoichiometries ν = 1, ν = − ν = 1 and ν = −

1. The ChemicalLangevin It¯o SDE isd X t = [ − k X t + ( k + 3 k ) X t − ( k + 2 k + k ) X t + k ]d t + (cid:113) k X t + ( k − k ) X t + (2 k − k + k ) X t + k d W t , (21)where W t is a Wiener process. For certain values of the rate parameters the underly-ing deterministic model has two stable steady states separated by an unstable steadystate (Schl¨ogl, 1972; Vellela and Qian, 2009). In the stochastic case, it is possible for theintrinsic noise of the system to drive X t from around one stable state toward the other;resulting in switching behaviour demonstrated in Figure 9 called stochastic bi-stability .The time between switching events is also a random variable, and observations takenfrom a single realisation will be very diﬃcult to match using simulated data in the ABCsetting, therefore acceptance rates will be prohibitively low. On the other hand particleMCMC is ideally suited to this problem since we condition simulations on the observa-tions, thereby only sampling from realisations that pass closely to the data.23

50 100 150 200 t X t Figure 9: Four example realisations of the Schl¨ogl model demonstrating stochastic bi-stability. The simulations are produced using the Euler-Maruyama scheme with ∆ t =10 − , initial condition X = 0, and parameters k = 1 . × − , k = 2 . × − , k =2 . × and k = 3 . × . We generate synthetic data using a single realisation of the Schl¨ogl model chemicalLangevin SDE with initial condition X = 0 and kinetic rate parameters k = 1 . × − , k = 2 . × − , k = 2 . × and k = 3 . × . Observations are taken at n = 16uniformly spaced time points t = 12 . , t = 25 , . . . , t = 200. The observation pro-cess is modelled by Gaussian noise applied to the chemical species copy number with astandard deviation of σ obs = 10, that is, Y ( i )obs ∼ N ( X t i , σ ). See Appendix D for theresulting data table.We perform inference on all four rate parameters, θ = [ k , k , k , k ] T . We use theparticle MCMC approach to sample for the Bayesian posterior, p ( k , k , k , k | D ) ∝ L ( k , k , k , ; D ) p ( k , k , k , k ) , where p ( k , k , k , k ) is the joint uniform prior with independent components k ∼U (0 , . × − ), k ∼ U (0 , . × − ), k ∼ U (0 , . × ), and k ∼ U (0 , . × ).The likelihood is estimated using the bootstrap particle ﬁlter (Algorithm 2) with R = 100particles and Euler-Maruyama for simulation with ∆ t = 0 . To initialise and tune four chains for inference on the four rate parameters of the Schl¨oglmodel, we apply the same procedure as described for the Michaelis-Menten inferenceproblem. The only diﬀerence is the number of samples applied.Firstly, four trial chains are simulated for M = 20 ,

000 iterations, each of these chainsis initialised with a random sample from the prior with a non-zero likelihood estimate.The proposal kernel used in all four trial chains is Gaussian with covariance Σ =  . × − . . . . . × − . . . . . × . . . . . ×  . ,

000 samples),ˆ Σ =  . × − . × − . × . × − . × − . × − . × − . × − . × . × − . × . × . × − . × − . × . ×  , and applying the optimal scaling rule from Roberts and Rosenthal (2001) Σ opt = 2 . Σ . The ﬁnal iteration of the trial chains is then used to initialise for new chains using thisoptimal proposal covariance. Figure 10 demonstrates the improvement in convergencebehaviour. 25igure 10: Comparison of marginal trace plots and autocorrelation functions (see Appendix C) using (A)–(H) the na¨ıve independentGaussian proposals and (I)-(P) optimally scaled correlated proposals. .2.4 Convergence assessment and parameter estimates Convergence diagnostic results, by parameter, are shown in Table 3 after M = 240 , R < .

01 and S eﬀ > ,

000 iterations usingthe optimal proposal with dependent components. k k k k S eﬀ

577 656 467 625ˆ R k and k are veryclose to the true parameter values, however the uncertainties are asymmetric, indicatinga range of possibly appropriate parameter values greater than the true values. Theposteriors of k and k are very interesting as they are bi-modal (Figure 11(C)–(D)).While the higher density model is closer to the true parameter values, the second lowerdensity mode indicates that an alternative parameter combination in k and k can leadto very similar stochastic bi-stability in the Schl¨ogl model evolution. To observe thisposterior bi-modality using ABC methods would be very challenging since (cid:15) would needto be prohibitively small. Furthermore, most expositions on ABC methods (Sunn˚akeret al., 2013; Toni et al., 2009; Warne et al., 2019) do not deal with multimodal posteriors.Table 4: Parameter estimates based on estimates of the mean, ˆ µ , and standard deviation,ˆ σ , with respect to the marginal posterior. k k k k θ true . × − . × − . × . × ˆ µ . × − . × − . × . × ˆ σ . × − . × − . × . × The apparent bi-modal nature of parameters k and k (Figure 11) could be a reasonfor the increased computational requirements of this model, since all chains must occupyboth modes suﬃciently to reduce ˆ R and to increase S eﬀ suﬃciently.The implementation of this inference problem, including data generation, tuning andinitialisation steps, convergence assessment, and plotting is give in DemoSchloglPMCMC.jl .The rank normalised ˆ R and S eﬀ statistics are implemented within Diagnostics.jl .27igure 11: Marginal smoothed kernel density estimates for the Schl¨ogl model using fourconverged particle MCMC chains. True parameter values are also indicated (red dashed).

The last example we consider in this work is a gene regulatory network, originally realisedsynthetically by Elowitz and Leibler (2000), that includes a feedback loop resulting instochastic oscillatory dynamics in the gene expression. The model is of interest in bio-logical studies (Pokhilko et al., 2012; Potvin-Trottier et al., 2016) and is a challengingbenchmark for inference methods (Toni et al., 2009).

The repressilator consists of three genes where the expression of one gene inhibits theexpression of the next gene, forming a feedback loop between the three genes. Theregulatory network consists of twelve reactions describing the transcription of the threemRNAs, M , M , and M , associated with each gene, G , G , and G , their expressionthrough translation into proteins, P , P , and P , and degradation processes for bothmRNAs and proteins. For the i th gene we have, G i α + α/ (1+ P nj ) −−−−−−−−→ M i (cid:124) (cid:123)(cid:122) (cid:125) mRNA transcription , M i β → M i + P i (cid:124) (cid:123)(cid:122) (cid:125) protein translation , P i β → ∅ (cid:124) (cid:123)(cid:122) (cid:125) protein degradation , and M i γ → ∅ (cid:124) (cid:123)(cid:122) (cid:125) mRNA degradation , (22)where j = ( i + 1 mod 3) + 1, α ≥ α + α > n ≥ P j , β > γ > G i = 1 for i = 1 , ,

3. The resulting chemical Langevin approxima-tion (Equation (2)) to the repressilator model (Equation (22)) leads to a coupled system28f It¯o SDEsd M ,t = (cid:18) α + α P n ,t − γM ,t (cid:19) d t + (cid:114) α + α P n ,t d W (1) t − (cid:112) γM ,t d W (4) t , d P ,t = β ( M ,t − P ,t ) d t + (cid:112) βM ,t d W (2) t − (cid:112) βP ,t d W (3) t , d M ,t = (cid:18) α + α P n ,t − γM ,t (cid:19) d t + (cid:114) α + α P n ,t d W (5) t − (cid:112) γM ,t d W (8) t , d P ,t = β ( M ,t − P ,t ) d t + (cid:112) βM ,t d W (6) t − (cid:112) βP ,t d W (7) t , d M ,t = (cid:18) α + α P n ,t − γM ,t (cid:19) d t + (cid:114) α + α P n ,t d W (9) t − (cid:112) γM ,t d W (12) t , d P ,t = β ( M ,t − P ,t ) d t + (cid:112) βM ,t d W (10) t − (cid:112) βP ,t d W (11) t , (23)where W (1) t , W (2) t , . . . , W (12) t are independent Wiener processes driving each reaction chan-nel. Certain parameter combinations lead to stochastic oscillations in the gene expressionlevels, that is, the protein copy numbers associated with the expressed gene. Figure 12demonstrates this behaviour in which the expressed gene alternates between G , G , and G in sequence due to the feedback loop in the gene inhibitor network. t X t M P M P M P Figure 12: Example realisation of the repressilator model demonstrating oscillatory geneexpression. The simulation is produced using the Euler-Maruyama scheme with ∆ t =1 × − , initial condition X = [ M , , P , , M , , P , , M , , P , ] T = [0 , , , , , T , andparameters α = 1, α = 1000, n = 2, β = 5, and γ = 1. We generate synthetic data using a single realisation of the repressilator chemical LangevinSDE with initial condition X = [ M , , P , , M , , P , , M , , P , ] T = [0 , , , , , T andparameters α = 1, α = 1000, n = 2, β = 5, and γ = 1. Observations are taken at n = 20 uniformly spaced time points t = 5, t = 10, . . . , t = 100. Again we consider29aussian noise applied to each chemical species copy number with a standard deviationof σ obs = 10, that is, Y ( i )obs ∼ N ( X t i , σ I ) where I is the 6 × θ = [ α , α, n, β ] T , and ﬁx themRNA degradation rate γ = 1. We use the particle MCMC approach to sample fromthe Bayesian posterior, p ( α , α, n, β | D ) ∝ L ( α , α, n, β ; D ) p ( α , α, n, β ) , where p ( α , α, n, β ) is the joint uniform prior with independent components α ∼ U (0 , α ∼ U (500 , n ∼ U (0 , β ∼ U (0 , The same initialisation and proposal tuning procedure applied to the Michaelis-Mentenand Schl¨ogl models is applied here. The resulting tuned proposal kernel convariance isgiven by Σ opt = 2 .  .

288 95 .

328 173 .

584 416 . .

328 0 .

345 1 .

051 1 . .

584 1 .

051 5 .

286 3 . .

368 1 .

142 3 .

488 5 .  , which is derived through application of the Roberts and Rosenthal (2001) scaling rule tothe covariance matrix of the pooled samples from four trial chains, each with M = 5 , R , on thedistribution of the logarithm of the likelihood estimator evaluated Z = log ˆ L ( θ ; D ). − − − − Z . . . . . p ( Z ) R = 100 R = 200 R = 400 R = 800 Figure 13: Distribution of 1 ,

000 likelihood estimates around a high density posteriorpoint for diﬀerent numbers of particles in the bootstrap particle ﬁlter. Signiﬁcant bias isintroduced for the lower particle counts. 30ote that as R decreases, not only does the variance of the estimator increase, butso does the bias that is seen through the shift in the estimator mode. Here, there isa trade-oﬀ, R = 800 yields a low variance and is much closer to the optimal criterionof Doucet et al. (2015). However, R = 400 has a very similar mode, but slightly highervariance. This motivates the use of R = 400 particles to achieve reasonable convergencesrates without too much additional computational burden. Convergence diagnostic results, by parameter, are shown in Table 5 after M = 145 , R < . ,

000 iterations usingthe optimal proposal with dependent components. α α β nS eﬀ

102 107 152 205ˆ R µ , and standard deviation,ˆ σ , with respect to the marginal posterior. α α β n θ true . . . . µ . . . . σ . . . . β and n parameters are veryaccurately retrieved, whereas the marginal posteriors for α and α lead to underestimates.However, these underestimates are consistent with previous results (Toni et al., 2009). Alikely cause of this is the temporal sparsity of observations, leading to few observationsof the peak gene expression levels (see Figure 12 and data in Appendix D), as α and α relate to the transcription rate of the mRNAs. Despite the additional computationalcomplexity associated with the particle ﬁlter for this inference problem, the repressilatormodel provides an insightful example of the eﬃcacy of the pseudo-marginal approach toresolve biological parameters associated with gene regulation using synthetic data that isbiological realisable.The implementation of this inference problem, including data generation, tuning andinitialisation steps, convergence assessment, and plotting is give in DemoRepressilatorPMCMC.jl .The rank normalised ˆ R and S eﬀ statistics are implemented within Diagnostics.jl .31

00 1500 2500 α . . . p ( α | D ) × − α . . . p ( α | D ) β . . . p ( β | D ) n . . . p ( n | D ) Figure 14: Marginal smoothed kernel density estimates for the repressilator model usingfour particle MCMC chains. True parameter values are also indicated (red dashed).

In this work, we provide a practical guide to computational Bayesian inference using thepseudo-marginal approach (Andrieu et al., 2010; Andrieu and Roberts, 2009; Beaumont,2003). We compare and contrast, using a tractable example, the pseudo-marginal ap-proach with the ABC alternative (Sisson et al., 2018; Sunn˚aker et al., 2013). Through-out, chemical Langevin SDE descriptions of biochemical reaction networks (Gillespie,2000; Higham, 2008), of various degrees of complexity, have been employed to demon-strate practical considerations when using these techniques to inference problems withintractable likelihoods.The ABC approach to likelihood-free inference is widely applicable and used exten-sively in practical applications (Browning et al., 2018; Johnston et al., 2016; Kursaweet al., 2018; Warne et al., 2019b; Wilkinson, 2011). For some applications, however, itcan be diﬃcult to determine a priori an appropriate discrepancy metric and acceptancethreshold for reliable inference. Furthermore, a suﬃciently small threshold for the de-sired level of accuracy may result in prohibitively low acceptance rates (Sisson et al.,2007). Pseudo-marginal methods do not suﬀer from these accuracy considerations sincethey converge to the true posterior target regardless of the variance of the estimator (Go-lightly and Wilkinson, 2011). As a result, the pseudo-marginal approach is signiﬁcantlyless sensitive to user-speciﬁed algorithm parameters than likelihood-free inference basedon ABC.There are also disadvantages to the pseudo-marginal approach. Firstly, it is notas generally applicable as ABC; the pseudo-marginal method requires an unbiased es-timator, whereas ABC only needs a model simulation process. While convergence tothe true posterior distribution is not aﬀected by the estimator variance, the rate ofconvergence is (Andrieu and Roberts, 2009); to obtain optimal likelihood variances,a large number of particles may be required, thus evaluating the likelihood estimate32ill be very expensive. Alternatively, ABC will only ever use a single simulation periteration. Furthermore, under the assumption of observation error and model miss-speciﬁcation, convergence to the true posterior is not always a signiﬁcant advantage (An-drieu et al., 2018; Wilkinson, 2009) and ABC may be eﬀectively considered exact (Wilkin-son, 2013). Lastly, pseudo-marginal methods are not widespread in the systems biol-ogy literature and there is a lack of exemplars, despite their suitability for many prob-lems of interest. This review is intended to address this by presenting all the stepsinvolved clearly and providing user-friendly implementations in an open access environ-ment (https://github.com/davidwarne/Warne2019 GuideToPseudoMarginal).For practical illustrative purposes, we focus on the fundamental method of parti-cle marginal Metropolis-Hastings (Andrieu et al., 2010) using the bootstrap particle ﬁl-ter (Gordon et al., 1993) for likelihood estimation. There are many other variants to thisclassic approach, such as particle Gibbs sampling (Andrieu et al., 2010; Doucet et al.,2015), coupled Markov chains (Dodwell et al., 2015, 2019), and more advanced particleﬁlters (Doucet and Johanson, 2011) and proposal mechanisms (Botha et al., 2019; Cotteret al., 2013). It is also important to note that the pseudo-marginal approach is equallyvalid for Bayesian sampling strategies based on sequential Monte Carlo (Del Moral et al.,2006; Li et al., 2019; Sisson et al., 2007). Furthermore, advances in stochastic simula-tion (Schnoerr et al., 2017; Warne et al., 2019) can also improve the performance of thelikelihood estimator, and the application of multilevel Monte Carlo to particle ﬁlters canfurther reduce estimator variance (Gregory et al., 2016; Jasra et al., 2017, 2018).Likelihood-free methods are essential to modern biological sciences, since many mech-anistic models of interest have intractable likelihoods. Unlike ABC methods, the pseudo-marginal approach does not aﬀect the stationary distribution for the purposes of MCMCsampling; this is a desirable property. However, one reason for the popularity and successof ABC methods has been its simplicity to implement. Through this accessible and practi-cal demonstration, along with example open-source codes, the pseudo-marginal approachmay become an additional readily available tool for likelihood-free inference within thewider scientiﬁc community.

Software availability

The Julia code examples and demonstration scripts are availablefrom GitHub https://github.com/davidwarne/Warne2019 GuideToPseudoMarginal.

Acknowledgements

This work was supported by the Australian Research Council(DP170100474). M.J.S. appreciates support from the University of Canterbury ErskineFellowship. R.E.B. would like to thank the Leverhulme Trust for a Leverhulme ResearchFellowship, the Royal Society for a Wolfson Research Merit Award, and the BBSRC forfunding via BB/R00816/1.

References

Andrieu, C., Doucet, A., Holenstein, R., 2010. Particle Markov chain Monte Carlo meth-ods.

Journal of the Royal Statistical Society. Series B (Statistical Methodology) , 72:269–342.Andrieu, C., Lee, A., Vihola, M., 2018. Theoretical and methodological aspects of MCMCcomputations with noisy likelihoods. In

Handbook of Approximate Bayesian Com- utation , Sisson, S.A., Fan, Y., and Beaumont, M.A., (Eds.), 1st edn. Chapman &Hall/CRC Press.Andrieu, C., Roberts, G.O., 2009. The pseudo-marginal approach for eﬃcient MonteCarlo computations. The Annals of Statistics , 37:697–725.Bar-Joseph, Z., Gitter, A., Simon, I., 2012. Studying and modelling dynamic biologicalprocesses using time-series gene expression data.

Nature Reviews Genetics , 13:552–562.DOI:10.1038/nrg3244Beaumont, M.A., 2003. Estimation of population growth or decline in genetically moni-tored populations.

Genetics , 164:1139–1160.Besan¸con, M., Anthoﬀ, D., Arslan, A., Byrne, S., Lin, D., Papamarkou, T., Pearson, J.,2019. Distributions.jl: Deﬁnition and Modeling of Probability Distributions in theJuliaStats Ecosystem. arXiv e-prints , arXiv:1907.08611 [stat.CO] Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B., 2017. Julia: A fresh approach tonumerical computing.

SIAM Review , 59:65–98. DOI:10.1137/141000671Bierkens, J., Fearnhead, P., Roberts, G., 2019. The Zig-Zag process and super-eﬃcient sampling for Bayesian analysis of big data

The Annals of Statistics , 47:1288-1320.DOI:10.1214/18-AOS1715Botha, I., Kohn, R., and Drovandi, C., 2019. Particle methods for stochastic diﬀerentialequation mixed eﬀects models. arXiv e-prints , arXiv:1907.11017 [stat.CO] Browning, A.P., McCue, S.W., Binny, R.N., Plank, M.J., Shah, E.T., Simpson, M.J.,2018. Inferring parameters for a lattice-free model of cell migration and pro-liferation using experimental data.

Journal of Theoretical Biology , 437:251–260.DOI:10.1016/j.jtbi.2017.10.032Carpenter, J., Cliﬀord, P., Fearnhead, P., 1999. Improved particle ﬁlter for nonlinearproblems.

IEEE Proceedings – Radar, Sonar and Navigation , 146:2–7. DOI:10.1049/ip-rsn:19990255Cotter, S.L., Erban, R., 2016. Error analysis of diﬀusion approximation methods for mul-tiscale systems in reaction kinetics.

SIAM Journal on Scientiﬁc Computing

Statistical Science

Journal of the American Statistical Association , 91:883–904.DOI:10.2307/2291683Del Moral, P., Doucet, A., Jasra, A., 2006. Sequential Monte Carlo samplers.

Jour-nal of the Royal Statistical Society: Series B (Statistical Methodology)

SIAM/ASA Journal on Uncertainty Quantiﬁcation

SIAM Review . 61:509–545.DOI:10.1137/19M126966XDoucet, A., Johansen, A., 2011. A tutorial on particle ﬁltering and smoothing: ﬁfteenyears later.

The Oxford Handbook of Nonlinear Filtering , Oxford University Press, NewYork, 656-704.Doucet, A., Pitt, M.K., Deligiannidis, G., Kohn, R., 2015. Eﬃcient implementation ofMarkov chain Monte Carlo when using an unbiased likelihood estimator.

Biometrika ,102:295–313.Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D., 1987. Hybrid Monte Carlo.

Physics Letters B , 195:216–222. DOI:10.1016/0370-2693(87)91197-XElowitz, M.B., Leibler, S., 2000. A synthetic oscillatory network of transcriptional regu-lators.

Nature , 403:335–338. DOI:10.1038/35002125Fearnhead, P., Prangle, D., 2012. Constructing summary statistics for approximateBayesian computation: semi-automatic approximate Bayesian computation.

Jour-nal of the Royal Statistical Society Series B (Statistical Methodology)

Bioin-formatics

Journal of ComputationalPhysics

Bayesian Statistics , 5:599–607.Gelman, A., Rubin, D.B., 1992. Inference from iterative simulation using multiple se-quences.

Statistical Sciences , 7:457–472.Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B., 2014.

Bayesian Data Analysis , 3rd edn. Chapman & Hall/CRC.Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbs distributions, and the Bayesianrestoration of images.

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 6: 721–741. DOI:10.1109/TPAMI.1984.4767596Geyer, C.J., 1992. Practical Markov chain Monte Carlo.

Statistical Science , 7:473–483.DOI:10.1214/ss/1177011137 35illespie, D.T., 1977. Exact stochastic simulation of coupled chemical reactions.

TheJournal of Physical Chemistry

The Journal of Chemical Physics

Computational Statistics and Data Analysis

Interface Focus

IEE Proceedings F - Radar and Signal Processing ,140:107–113. DOI:10.1049/ip-f-2.1993.0015Green, P.J., (cid:32)Latuszy´nski, K., Pereyra, M., Robert, C.P., 2015. Bayesian computation:a summary of the current state, and samples backwards and forwards.

Statistics andComputing

SIAM Journal on Scientiﬁc Computing

Biometrika , 57:97–109.Higham, D.J., 2001. An algorithmic introduction to numerical simulation of stochasticdiﬀerential equations.

SIAM Review

The Journal of GeneralPhysiology , 143:401–416. DOI:10.1085/jgp.201311116Jasra, A., Kamatani, K., Law, K., Zhou, Y., 2017. Multilevel particle ﬁlters.

SIAMJournal on Numerical Analysis , 55:3068–3096. DOI:10.1137/17M1111553Jasra, A., Kamatani, K., Law, K., Zhou, Y., 2018. Bayesian static parameter estimationfor partially observed diﬀusions via multilevel Monte Carlo.

SIAM Journal on ScientiﬁcComputing , 40:A887–A902. DOI:10.1137/17M1112595Johnston, S.T., Ross, J.V., Binder, B.J., McElwain, D.L.S., Haridas, P., Simpson, M.J.,2016. Quantifying the eﬀect of experimental design choices for in vitro scratch assays.

Journal of Theoretical Biology , 400:19–31. DOI:10.1016/j.jtbi.2016.04.012Kærn, M., Elston, T.C., Blake, W.J., Collins, J.J., 2005. Stochasticity in gene expression:from theories to phenotypes.

Nature Reviews Genetics

Journal of Computational and Graphical Statistics , 5:1–15.DOI:10.2307/1390750Kloeden, P.E., Platen, E., 1999.

Numerical Solution of Stochastic Diﬀerential Equations ,3rd edn. Springer, New York.Kursawe, J., Baker, R.E., Fletcher, A.G., 2018. Approximate Bayesian compu-tation reveals the importance of repeated measurements for parameterising cell-based models of growing tissues.

Journal of Theoretical Biology , 443:66–81.DOI:10.1016/j.jtbi.2018.01.020Marjoram, P., Molitor, J., Plagnol, V., Tavar´e, S., 2003. Markov chain Monte Carlowithout likelihoods.

Proceedings of the National Academy of Sciences of the UnitedStates of America , 100:15324–15328.Mengersen, K.L., Tweedie, R.L., 1996. Rates of convergence of the Hastings and Metropo-lis algorithms.

The Annals of Statistics , 24:101–121.Michaelis, L., Menten, M.L., 1913. Die kinetik der invertinwirkung.

Biochem Z

The Journal of Chemical Physics , 57:2976–2978.DOI:10.1063/1.1678692Li, D., Clements, A., Drovandi, C., 2019. Eﬃcient Bayesian estimation for GARCH-typemodels via sequential Monte Carlo. arXiv e-prints , arXiv:1906.03828 [stat.Ap] Locke, J.C.W., Elowitz, M.B., 2009. Using movies to analyse gene circuit dynamics insingle cells.

Nature Reviews Microbiology , 7:383–392. DOI:10.1038/nrmicro2056Maruyama, G., 1955. Continuous Markov processes and stochastic equations.

Rendicontidel Circolo Matematico di Palermo , 4:48–90. DOI:10.1007/BF02846028Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E., 1953.Equation of state calculations by fast computing machines.

The Journal of ChemicalPhysics

Proceedings of the National Academy of Sciencesof the United States of America

Molecular Systems Biology , 8:574. DOI:10.1038/msb.2012.6Pooley, C.M., Bishop, S.C., Marion, G., 2015. Using model-based proposals for fast pa-rameter inference on discrete state space, continuous-time Markov processes.

Journalof the Royal Society Interface

Nature , 538:514–517.DOI:10.1038/nature19841Raj, A., van Oudenaarden, A., 2008. Nature, nurture, or chance: stochastic gene expres-sion and its Consequences.

Cell

The Journal of Chemical Physics

Statistical Science , 16:351–367. DOI:10.1214/ss/1015346320Roberts, G.O., Rosenthal, J.S., 2004. General state space Markov chains and MCMCalgorithms.

Probability Surveys , 1:20–71. DOI:10.1214/154957804100000024Roberts, G.O., Rosenthal, J.S., 2009. Examples of adaptive MCMC.

Journal of Compu-tational and Graphical Statistics , 18:349–367. DOI:10.1198/jcgs.2009.06134Sahl, S.J., Hell, S.W., Jakobs, S., 2017. Fluorescence nanoscopy in cell biology.

NatureReviews Molecular Cell Biology

Z.Physik , 253:147–161.Schnoerr, D., Sanguinetti, G., Grima, R., 2017. Approximation and inference methods forstochastic biochemical kinetics—a tutorial review.

Journal of Physics A: Mathematicaland Theoretical

Neuron , 58:52–64.DOI:10.1016/j.neuron.2008.02.014Silverman, B.W., 1986. Density estimation for statistics and data analysis.

Monographson Statistics and Applied Probability . Boca Ranton: Chapman & Hall/CRC.Sisson, S.A., Fan, Y., Beaumont, M., 2018.

Handbook of approximate Bayesian computa-tion , 1st edn. Chapman & Hall/CRC.Sisson, S.A., Fan, Y., Tanaka, M.M., 2007. Sequential Monte Carlo without likelihoods.

Proceedings of the National Academy of Sciences of the United States of America

PLOS Computational Biology

Proceedings of the National Academy of Sciences of the United States ofAmerica

Journal of the Royal Society Interface R for assessing convergence ofMCMC. arXiv e-print , arXiv:1903.08008 [stat.CO] Vellela, M., Qian, H., 2009. Stochastic dynamics and non-equilibrium thermodynamics ofa bistable chemical system: the Schl¨ogl model revisited.

Journal of the Royal SocietyInterface , 6:925–940. DOI:10.1098/rsif.2008.0476Warne, D.J., Baker, R.E., Simpson, M.J., 2019. Simulation and inference algorithmsfor stochastic biochemical reaction networks: form basic concepts to state-of-the-art.

Journal of the Royal Society Interface

Bulletin of Mathematical Biology

Nature Reviews Genetics

Bayesian Statistics , 9:679–706.Wilkinson, D. J., 2012.

Stochastic Modelling for Systems Biology , 2nd edn. CRC Press,2nd edition.Wilkinson, R.D., 2013. Approximate Bayesian computation (ABC) gives exact resultsunder the assumption of model error.

Statistical Applications in Genetics and MolecularBiology

Proceedings of the National Academy of Sciences of the United States ofAmerica , 110:19307–19312. DOI:10.1073/pnas.1311790110Young, J.W., Locke, J.C.W., Altinok, A., Rosenfeld, N., Bacarian, T., Swain, P.S.,Mjolsness, E., Elowitz, M.B., 2012. Measuring single-cell gene expression dynam-ics in bacteria using ﬂuorescence time-lapse microscopy.

Nature Protocols , 7:80–88.DOI:10.1038/nprot.2011.432 39 ppendix A Derivation of stationary distribution forthe production-degradation model

Here, we derive the solution to the stationary distribution for the production-degradationmodel. First, recall that the general form for the chemical Langevin equation isd X t = M (cid:88) j =1 ν j a j ( X t )d t + M (cid:88) j =1 ν j (cid:113) a j ( X t )d W ( j ) t , where X t takes values in R N , W (1) t , W (2) t , . . . , W ( M ) t are independent scalar Wiener pro-cesses, ν , ν , . . . , ν M are the stoichiometric vectors and a ( X t ) , a ( X t ) , . . . , a M ( X t ) thepropensity functions. Consider the distribution of X t over all possible realisations at time t with probability density function p ( x , t ). The Fokker-Planck equation describes the for-ward evolution of this probability density in time. For the general chemical Langevinequation, the Fokker-Planck equation is given by ∂p ( x , t ) ∂t = 12 M (cid:88) j =1 ν T j H ( a j ( x ) p ( x , t )) ν j − M (cid:88) j =1 ∇ [ a j ( x ) p ( x , t )] ν j , (A.1)where, for reaction j , ∇ [ a j ( x ) p ( x , t )] and H ( a j ( x ) p ( x , t )) are, respectively, the gradientvector and Hessian matrix of the product a j ( x ) p ( x , t ) with respect to the state vector x .For the production-degradation model, we have a single chemical species X t , propen-sity functions a ( X t ) = k and a ( X t ) = k X t , (A.2)with rate parameters k and k , and stoichiometries ν = 1 and ν = − . (A.3)By substituting Equation (A.2) and Equation (A.3) into Equation (A.1), we obtain theFokker-Planck equation for the production degradation model, ∂p ( x, t ) ∂t = ∂ ∂x (cid:20) k + k x p ( x, t ) (cid:21) − ∂∂x [( k − k x ) p ( x, t )] . (A.4)The stationary distribution of X t corresponds to the steady state solution of Equa-tion (A.4), that is, p s ( x ) = lim t →∞ p ( x, t ). The stationary probability density function, p s ( x ), satisﬁes d d x (cid:20) k + k x p s ( x ) (cid:21) − dd x [( k − k x ) p s ( x )] = 0 . (A.5)To obtain a solution, integrate Equation (A.5) to obtaind p s ( x )d x + (cid:18) k (1 − x ) − k k + k x (cid:19) p s ( x ) = C, (A.6)where C is an arbitrary constant. We obtain C = 0 by assuming the boundary conditionlim x →∞ p s ( x ) = 0. The solution to Equation (A.6) can be obtained using an integratingfactor, p s ( x ) = ˜ Ck + k x exp (cid:18) (cid:90) x k − k yk + k y d y (cid:19) = ˜ C exp (cid:18) − x + (cid:18) k k − (cid:19) ln ( k + k x ) (cid:19) , (A.7)40here ˜ C is a constant that is obtained by enforcing the condition (cid:82) ∞−∞ p s ( x ) d x = 1,yielding the stationary probability density provided in the main manuscript. Appendix B Pseudo-marginal MCMC as an exactapproximation

Here we brieﬂy explain why the stationary distribution of the pseudo-marginal Metropolis-Hastings method is the exact posterior. For more detailed analysis, see Andrieu andRoberts (2009), and Golightly and Wilkinson (2008). Consider the following algebraicmanipulations applied to the pseudo-marginal Metropolis-Hastings acceptance probabil-ity. We start with α ( θ ∗ , θ m ) = q ( θ m | θ ∗ ) ˆ L ( θ ∗ ; D ) p ( θ ∗ ) q ( θ ∗ | θ m ) ˆ L ( θ m ; D ) p ( θ m )= q ( θ m | θ ∗ ) L ( θ ∗ ; D ) (cid:34) ˆ L ( θ ∗ ; D ) L ( θ ∗ ; D ) (cid:35) p ( θ ∗ ) q ( θ ∗ | θ m ) L ( θ m ; D ) (cid:34) ˆ L ( θ m ; D ) L ( θ m ; D ) (cid:35) p ( θ m ) . Now, deﬁne the random variable Z = ˆ L ( D ; θ ) / L ( D ; θ ) with density p ( Z | θ ) that repre-sents a scaled likelihood estimator. We apply this change of variable and perform somestraightforward algebra to obtain a new representation for α ( θ ∗ , θ m ) that reveals someinteresting structure: α ( θ ∗ , θ m ) = q ( θ m | θ ∗ ) L ( D ; θ ∗ ) Z ∗ p ( θ ∗ ) q ( θ ∗ | θ m ) L ( D ; θ m ) Zp ( θ m )= p ( Z ∗ | θ ∗ ) p ( Z m | θ m ) p ( Z ∗ | θ ∗ ) p ( Z m | θ m ) × q ( θ m | θ ∗ ) L ( D ; θ ∗ ) Z ∗ p ( θ ∗ ) q ( θ ∗ | θ m ) L ( D ; θ m ) Zp ( θ m )= p ( Z m | θ m ) q ( θ m | θ ∗ ) [ L ( D ; θ ∗ ) p ( θ ∗ ) Z ∗ p ( Z ∗ | θ ∗ )] p ( Z ∗ | θ ∗ ) q ( θ ∗ | θ m ) [ L ( D ; θ m ) p ( θ m ) Zp ( Z m | θ m )] . (B.1)The expressions outside the brackets can be considered a proposal density, q ( Z ∗ , θ ∗ | Z m , θ m ) = p ( Z ∗ | θ ∗ ) q ( θ ∗ | θ m ), in the product state space Z × Θ where Z ⊂ R + is the space of values Z can take. By extending the dimension of the Markovchain state by including Z m , we see that acceptance probability for the original pseudo-marginal Markov chain in Θ , as given in Equation (B.1), can also be considered as anacceptance probability for this new Markov chain in Z × Θ based on exact Metropolis-Hastings MCMC. That is, α (( Z ∗ , θ ∗ ) , ( Z m , θ m )) = q ( Z m , θ m | Z ∗ , θ ∗ ) [ L ( D ; θ ∗ ) p ( θ ∗ ) Z ∗ p ( Z ∗ | θ ∗ )] q ( Z ∗ , θ ∗ | Z m , θ m ) [ L ( D ; θ m ) p ( θ m ) Zp ( Z m | θ m )] . This Markov chain has the stationary distribution p ( Z, θ ) = L ( D ; θ ) p ( θ ) Zp ( Z | θ ) ∝ p ( θ | D ) Zp ( Z | θ ) . Z we obtain (cid:90) Z p ( θ | D ) Zp ( Z | θ ) d Z = p ( θ | D ) (cid:90) Z Zp ( Z | θ ) d Z = p ( θ | D ) E [ Z | θ ] . By linearity of expectation we have p ( θ | D ) E [ Z | θ ] = p ( θ | D ) L ( D ; θ ) E (cid:104) ˆ L ( D ; θ ) | θ (cid:105) . We also have E (cid:104) ˆ L ( D ; θ ) | θ (cid:105) = L ( D ; θ ), since the Monte Carlo estimator for the likeli-hood is unbiased. Therefore, (cid:90) Z p ( Z, θ ) d Z ∝ p ( θ | D ) . We conclude that the original chain has as its stationary distribution the exact posterior, p ( θ | D ). Appendix C MCMC convergence diagnostics

In the main text we apply the rank normalised ˆ R statistic and the multiple chain ef-fective sample size measure, S eﬀ , as deﬁned in the recent work by Vehtari et al. (2019)that improves earlier deﬁnitions (Gelman and Rubin, 1992; Gelman et al., 2014). Sincethese diagnostics are relatively recent updates, we present their deﬁnitions here (see Diagnostics.jl for example implementation).Consider, R chains, taking values in R d , each consisting of an even number of itera-tions, M . Let θ rk,m denote the k th dimension of the m th iteration of the r th chain, thenwe deﬁne the rank normalised transform as z rk,m = Φ − (cid:18) η rk,m − / RM (cid:19) , where η rk,m is the rank of θ rk,m taken over all m = 1 , , . . . , M and r = 1 , , . . . , R , andΦ − : [0 , → R is the inverse cumulative distribution function of the standard Normaldistribution. Then the rank normalised ˆ R k statistic for the k parameter is deﬁned asˆ R k = (cid:114) V k W k , where V k = M − M W k + 2 M B k , with within-chain variance estimate W k and between-chain variance estimate B k . Theseestimates are given by B k = M R − R (cid:88) r =1 (cid:0) ¯ z rk, + − ¯¯ z k (cid:1) + (cid:0) ¯ z rk, − − ¯¯ z k (cid:1) and W k = 12 R R (cid:88) r =1 s rk, + + s rk, − , z rk, + = 2 M M (cid:88) m = M / z rk,m , ¯ z rk, − = 2 M M / (cid:88) m =1 z rk,m , ¯¯ z k = 12 R R (cid:88) r =1 ¯ z rk, + + ¯ z rk, − s rk, + = 2 M − M (cid:88) m = M / (cid:0) z rk,m − ¯ z rk, + (cid:1) and s rk, − = 2 M − M / (cid:88) m =1 (cid:0) z rk,m − ¯ z rk, − (cid:1) . The multiple chain eﬀective sample size measure is computed according to S eﬀ ,k = RM ˆ τ k , where ˆ τ k = 1 + 2 L k (cid:88) (cid:96) =1 ˆ ρ k,(cid:96) , ˆ ρ k,(cid:96) = 1 − V k (cid:32) W k − R R (cid:88) r =1 ˆ ρ rk,(cid:96) (cid:33) , and ˆ ρ rk,(cid:96) = Cov (cid:2) z rk,m , z rk,m + (cid:96) (cid:3) / Var (cid:2) z rk,m (cid:3) is the autocorrelation function for the trace ofthe k th dimension of the r th chain at lag (cid:96) . L k is the largest odd integer such thatˆ ρ k,(cid:96) +1 + ˆ ρ k,(cid:96) +2 > (cid:96) = 1 , , . . . , L k − θ m , θ m , . . . , θ R m for estimation of the posterior mean, Vehtariet al. (2019) recommend that the chains should at least satisfy the conditions ˆ R k < . S eﬀ ,k >

400 for all k = 1 , , . . . , d . Of course, this does not guarantee that the chainshave converged, but it is a guide that, coupled with trace plots and ACF plots, providereasonably conservative results. Appendix D Observed data

The synthetic data used in the main manuscript and example code is provided in Ta-ble D.1 for the stationary production-degradation model model, Table D.2 for the Michaelis-Menten model, Table D.3 for the Schl¨ogl model and Table D.4 for the repressilator model.Table D.1: Data, D , used for inference on the production-degradation model. Generatedusing parameter values k = 1 . k = 0 .

01, initial conditions X = 10, and ﬁnal time t = 1 , , Y (1)obs Y (2)obs Y (3)obs Y (4)obs Y (5)obs Y (6)obs Y (7)obs Y (8)obs Y (9)obs Y (10)obs X ∞ D , used for inference on the Michaelis-Menten model. Generated usingparameter values k = 0 . k = 0 .

05 and k = 0 .

01, and initial conditions E = 100, S = 100, C = 0 and P . The observation error is Gaussian with standard deviation σ = 10. Y (1)obs Y (2)obs Y (3)obs Y (4)obs Y (5)obs Y (6)obs Y (7)obs Y (8)obs Y (9)obs Y (10)obs t E t S t C t P t Y (11)obs Y (12)obs Y (13)obs Y (14)obs Y (15)obs Y (16)obs Y (17)obs Y (18)obs Y (19)obs Y (20)obs t

55 60 65 70 75 80 85 90 95 100 E t S t C t P t D , used for inference on the Schl¨ogl model. Generated using parametervalues k = 0 . k = 0 . k = 2200 . k = 37 .

5, and initial condition X = 0.The observation error is Gaussian with standard deviation σ = 10. Y (1)obs Y (2)obs Y (3)obs Y (4)obs Y (5)obs Y (6)obs Y (7)obs Y (8)obs t . . . . X t Y (9)obs Y (10)obs Y (11)obs Y (12)obs Y (13)obs Y (14)obs Y (15)obs Y (16)obs t . . . . X t D , used for inference on the repressilator model. Generated usingparameter values α = 1000, α = 1, n = 2 β = 5 and γ = 1, and initial conditions M , = 0, P , = 2, M , = 0, P , = 1, M , = 0 and P , = 3. The observation error isGaussian with standard deviation σ = 10. Y (1)obs Y (2)obs Y (3)obs Y (4)obs Y (5)obs Y (6)obs Y (7)obs Y (8)obs Y (9)obs Y (10)obs t M ,t P ,t M ,t P ,t M ,t P ,t Y (11)obs Y (12)obs Y (13)obs Y (14)obs Y (15)obs Y (16)obs Y (17)obs Y (18)obs Y (19)obs Y (20)obs t

55 60 65 70 75 80 85 90 95 100 M ,t P ,t M ,t P ,t M ,t P ,t,t