Accelerating sequential Monte Carlo with surrogate likelihoods
AA CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATELIKELIHOODS
A P
REPRINT
Joshua J. Bon
School of Mathematical SciencesQueensland University of Technology [email protected]
Anthony Lee
School of MathematicsUniversity of Bristol [email protected]
Christopher Drovandi
School of Mathematical SciencesQueensland University of Technology [email protected]
September 9, 2020 A BSTRACT
Delayed-acceptance is a technique for reducing computational effort for Bayesian models withexpensive likelihoods. Using a delayed-acceptance kernel for Markov chain Monte Carlo can reducethe number of expensive likelihoods evaluations required to approximate a posterior expectation.Delayed-acceptance uses a surrogate, or approximate, likelihood to avoid evaluation of the expensivelikelihood when possible. Within the sequential Monte Carlo framework, we utilise the historyof the sampler to adaptively tune the surrogate likelihood to yield better approximations of theexpensive likelihood, and use a surrogate first annealing schedule to further increase computationalefficiency. Moreover, we propose a framework for optimising computation time whilst avoidingparticle degeneracy, which encapsulates existing strategies in the literature. Overall, we develop anovel algorithm for computationally efficient SMC with expensive likelihood functions. The methodis applied to static Bayesian models, which we demonstrate on toy and real examples, code for whichis available at https://github.com/bonStats/smcdar . K eywords Bayesian statistics · Delayed-acceptance · Approximate likelihood · Whittle likelihood · MCMC A cknowledgements JJB is a recipient of a PhD Research Training Program scholarship from the AustralianGovernment. JJB, AL, and CD thank the Australian Research Council (ARC) Centre of Excellence for Mathematicaland Statistical Frontiers for financial support (CE140100049). AL and CD were supported by an ARC ResearchCouncil Discovery Project (DP200102101). AL was supported by an EPSRC grant (EP/R034710/1) and receivedtravel funding from the Statistical Society of Australia. JJB and CD also thank the Centre for Data Science at QUT forsupport. a r X i v : . [ s t a t . C O ] S e p CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
A significant barrier for statisticians and practitioners is the computational requirements for handling complex statisticalmodels with challenging likelihood functions. Particularly in Bayesian modelling, computationally expensive likelihoodscan impede inference and cause drastic reductions in accuracy on a limited computational budget.Bayesian statistics is becoming more popular in a growing number of disciplines due to its principled framework foruncertainty quantification in parameter estimation, model selection, and prediction. To keep up with practitionersdesire to develop and use more sophisticated and realistic stochastic models, there is a demand for improved statisticalmethodology for parameter estimation and model selection from data.This article focuses on sequential Monte Carlo (SMC, Chopin 2002, Del Moral et al. 2006), a theoretically justifiablemethod for Bayesian parameter estimation and model selection. SMC works by propagating a set of weighted samples(called particles) through a sequence of distributions connecting a simple distribution to a more complex distributionthat is difficult to sample from directly. SMC for Bayesian inference typically has the posterior distribution as its finaltarget.The basic ingredients of SMC are reweighting, resampling, and mutation of the particles. SMC can be appealing for awide variety of inference problems as it can handle multimodality, is easily parallelisable as well as adaptable, andprovides an estimate of the posterior normalising constant (the evidence, which can be useful for model selection).Despite the widespread success of SMC, it can involve a large number of likelihood evaluations, which is expensive formodels with computationally intensive likelihood functions.An alternative and popular method for Bayesian sampling is Markov chain Monte Carlo (MCMC), which involvesconstructing a Markov chain with the target posterior as its limiting distribution. The cornerstone of MCMC, theMetropolis-Hastings (MH) algorithm (Metropolis et al. 1953, Hastings 1970), also suffers computationally when thelikelihood function is expensive. An alternative MCMC kernel, delayed-acceptance (DA), can be used to alleviate someof this computational burden (Fox & Nicholls 1997, Christen & Fox 2005). This method uses a cheap surrogate of thetarget to perform an initial screening phase of each proposed parameter value in MCMC. It resembles the MH kernel,but with two stages.In this paper, we use the terms surrogate likelihood and surrogate posterior to refer to a particular approximatelikelihood and posterior, respectively. Using the DA kernel, if the proposal is a poor candidate for the target distribution,according to the surrogate, it will have a high probability of being rejected without needing to evaluate the expensivelikelihood. If the surrogate can accurately predict which proposals are likely to be accepted or rejected using the actuallikelihood, a more efficient MCMC algorithm can be expected. Issues can arise when the tails of the surrogate aremismatched to the target, but there are adjustments that can be made to address this (Banterle et al. 2019).For DA-MH to be successful, the surrogate must be (i) relatively cheap and (ii) roughly proportional to the posterior asa function of the parameter. In some applications, a cheap version of the model may be directly available, such as thelinear noise approximation for Markov processes (Elf & Ehrenberg 2003, Stathopoulos & Girolami 2013). In otherapplications, a general surrogate can be developed using ideas from emulation, for example regression trees (Sherlocket al. 2017) and Gaussian processes (Conrad et al. 2016, Drovandi et al. 2018). In the former scenario, the cheapapproximation may lack sufficient flexibility to satisfy (ii). In the latter scenario, the training region for the emulatorcan be hard to identify and again might not be flexible enough to satisfy (ii).The aim of this paper is to develop novel delayed-acceptance methods that are efficient and automated by harnessingthe SMC framework. DA has been leveraged in SMC previously in the setting of approximate Bayesian computing(Everitt & Rowi´nska 2017). We propose surrogate accelerated SMC to improve delayed-acceptance, and use ofsurrogate likelihoods in SMC, through three avenues, which also constitute the main contributions of this work. Firstly,delayed-acceptance can be used directly in the mutation step of the SMC algorithm, reducing costly evaluations ofthe likelihood. Secondly, the particles can be used to tune the surrogate likelihood to better match the full likelihood.2
CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
Lastly, we can change the annealing strategy of SMC to only use the surrogate likelihood at first, then anneal to the fullposterior which then uses the full likelihood. These adaptations are made possible from access to the population ofparticles in SMC.In creating our new SMC sampler, we also develop a framework for adaptively minimising the computation timeaccrued in the mutation step of SMC whilst achieving a minimum level of particle diversification. This frameworkencompasses existing strategies in the literature with appropriate approximations.The paper proceeds as follows. Section 2 provides an overview of SMC, whilst Section 3 introduces the delayed-acceptance Metropolis-Hastings algorithm. In Section 4 we discuss existing tuning strategies for mutation kernelsin SMC and propose a framework for such strategies in the context of MH kernels. Section 5 extends our tuningframework to delayed-acceptance in SMC, and discusses the other ways surrogate likelihoods can be utilised forimproved computationally efficiency. Section 6 contains a simulation study investigating the effect of computation costdifferentials between the surrogate and full likelihood functions, as well as different tuning parameters and strategies.Section 7 applies our methods to times series models using the Whittle likelihood approximation (Whittle 1953).
Sequential Monte Carlo generates samples from a sequence of distributions forming a bridge between a simple initialdistribution, that is easy to sample from, and a final target distribution. The target distribution is generally the posteriordistribution in Bayesian analysis. In this paper we consider a power-likelihood path connecting the prior, or otherstarting distribution, to the posterior distribution. As such, we use this temperature annealed sequence to illustrate SMCmethodology, which takes the form p t ( θ | y ) ∝ p ( θ | y ) γ t p ( θ ) − γ t ,p ( θ | y ) ∝ p ( y | θ ) π ( θ ) where p ( θ | y ) is the posterior density, p ( y | θ ) is the likelihood implied by a statistical method assumed to derive observeddata y and is parameterised by θ , π ( θ ) is the prior distribution, and γ < γ < · · · < γ T = 1 . The ultimatetarget distribution is the posterior p T ( θ | y ) ≡ p ( θ | y ) . The sequence of targets begins with the initial distribution p ( θ ) when the annealing parameter is γ t = 0 and transitions to the target posterior by steadily increasing γ t and ultimatelyterminates at γ T = 1 . It is often the case that the prior is chosen to be the initial distribution by taking p ( θ ) = π ( θ ) ,resulting in likelihood tempering of the form p t ( θ | y ) ∝ p ( y | θ ) γ t π ( θ ) .At each iteration of the SMC algorithm, the t th target distribution p t is represented by a weighted empirical measure ˆ p Nt = N (cid:88) i =1 W it δ θ it ( θ ) using N particles, θ it , each associated with weights W it , such that (cid:80) Ni =1 W it = 1 . The empirical measure can be used toapproximate expectations with respect to p t , and ultimately the posterior.As stated, an SMC algorithm iterates through reweighting, resampling and mutation steps to migrate the population ofparticles from p ( θ ) to p ( θ | y ) . For static Bayesian models under posterior tempering, the reweighting step uses thecurrent particles and weights from p Nt − , to generate the next set of particles and weights for p Nt . The new weights are W it ∝ w it = W it − p t ( θ it − | y ) p t − ( θ it − | y ) and the updated particles and normalised weights are given by θ it = θ it − , W it = w it (cid:80) Nk =1 w kt CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS for i = 1 , . . . , N .We can measure the quality of the SMC sample at iteration t via the effective sample size (ESS). The ESS is oftenestimated via the normalised weights ESS t = 1 / N (cid:88) i =1 ( W it ) . (1)When the ESS drops below some threshold S (often set at N/ ) an intervention is required to prevent degeneracy inthe particle set. Resampling N particles from the particle set { θ it } Ni =1 with weights { W it } Ni =1 produces an unweightedempirical measure ( W it = 1 /N ) which also approximates the current tempered distribution. The resampling processduplicates particles with relatively high weights and drops those with relatively small weights so that the SMC algorithmcan better explore high probability regions of the posterior. A number of resampling strategies are available includingmultinomial, stratified, or residual resampling (Kitagawa 1996, Liu & Chen 1998).In practice, ESS t ( γ t ) only depends on the next annealing parameter γ t since the particle values, θ t , are fixed duringthe reweighting stage. As such, we can select γ t adaptively via the bisection method bounded by ( γ t − , such thatESS t ( γ t ) ≈ S , where S is the targeted ESS (Jasra et al. 2011, Beskos et al. 2016). If such a strategy is chosen then theparticles are resampled every iteration.To diversify the resampled particle set (which is now likely to contain duplicates), the particles can be perturbed with anMCMC kernel with invariant distribution p t . For simplicity, and consistent with other approaches in the literature (seeChopin 2002, Jasra et al. 2011, Beskos et al. 2016, for example), a Metropolis-Hastings kernel is used with proposaldistribution, q φ , given by a multivariate normal (MVN) random walk q φ ( θ ∗ | θ t ) = N ( θ ∗ ; θ t , φ Σ) , (2)where, in this case, the proposal distribution has step size, φ , as a tuning parameter. For other kernels however, a vectorof tuning parameters may be appropriate. The covariance matrix Σ can be set to the sample covariance estimated fromthe weighted particles before resampling, a choice adopted in our demonstrations in Sections 6 and 7.A proposed update, or proposal θ i ∗ , is made for the i th particle, θ it , by drawing a random variable according to (2). Ingeneral, the proposal is accepted as the new value of the particle with probability α ( θ ∗ , θ t ) = min { , r ( θ ∗ , θ t ) } , with r ( θ ∗ , θ t ) = p t ( θ ∗ | y ) q φ ( θ t | θ ∗ ) p t ( θ t | y ) q φ ( θ ∗ | θ t ) , where, in this case, r can be simplified to p t ( θ ∗ | y ) / p t ( θ t | y ) due to the symmetry of the MVN proposal distribution. If theproposal is rejected, the value of the particle remains unchanged. For a generic mutation kernel, K φ ( θ , · ) , we will referto the tuning parameters as the vector φ henceforth.It is often the case that one iteration of an MCMC kernel is not sufficient to diversify the particle set adequately. Onecan choose the kernel to be a cycle of primitive MCMC kernels, and it is often the case that SMC algorithms apply agiven kernel multiple times for the mutation step. Whilst the number of repeats can be fixed, there are several adaptivemethods for tuning the number of cycles required. Such adaptive methods can be informed by a pilot run of the mutationstep (Drovandi & Pettitt 2011, Salomone et al. 2018), or evolve based on probabilistic rules (Fearnhead et al. 2013). Webetter explore these tuning methods in Sections 4 and 5 where we present our framework for tuning kernels based onminimal diversification and computation time.We present a generic description of an adaptive SMC for static Bayesian models in Algorithm 1. This algorithmsummarises the methodology outlined in this section for a particular instance of SMC, a resample-move algorithm(Gilks & Berzuini 2001) with power-likelihood annealing. Using the resample-move algorithm as a prototype simplifiesthe exposition of this paper, however the developments we make are applicable to a much wider class of SMCalgorthims. The mutation kernel, K φ ( θ , · ) , in Algorithm 1 will typically be the Metropolis-Hastings kernel, or thedelayed-acceptance kernel discussed in Section 3 in the case of expensive likelihoods.4 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
Algorithm 1
Adaptive resample-move sequential Monte Carlo for static Bayesian models
Input:
Number of particles, N ; Initial distribution, p ; Family of MCMC kernels distributions, K φ ; Kernel tuningset, Φ ; ESS threshold S ; Diversification criterion function and threshold, D and d respectively.1. Initialisation.(a) t = 0 , γ = 0 , and W i = N for i ∈ { , . . . , N } (b) Simulate θ i ∼ p for i ∈ { , . . . , N } ( perhaps from a previous SMC run, see Section 5.3 )2. While γ t < increment t , then iterate through(a) Calculate γ t = sup γ ∈ ( γ t − , { ESS t ( γ ) ≥ S } , the new temperature using the bisection method(b) Compute new weights w it = W it − p t ( θ it − | y ) p t − ( θ it − | y ) then normalise W it = w it (cid:80) Nk =1 w kt for i ∈ { , . . . , N } (c) Resample particles θ it ∼ (cid:80) Ni =1 W it δ θ it − ( · ) (multinomial as example) then set W it = N for i ∈{ , . . . , N } (d) Kernel calibration of K φ (i) Calculate ˆΣ t from weighted particle for MVN proposal(ii) If K φ is DA kernel: Calibrate surrogate density ( Section 5.2 )(e) Pilot mutation step, set s = 0 , then(i) Allocate particles by partition (cid:83) φ ∈ Φ S φ = { , . . . , N } (ii) Set φ i = φ for i ∈ S φ , for each φ ∈ Φ (iii) Mutate particles ˇ θ i ∼ K φ i ( θ it , · ) for i ∈ { , . . . , N } (iv) Calculate diversification value J i for i ∈ { , . . . , N } ( Section 4.2 )(v) Calculate optimal tuning parameter φ ∗ ( Section 4.2 and 5.1 )(f) While D (cid:0) { J i } Ni =1 (cid:1) < d , increment s , then iterate through(i) Mutate particles ˇ θ is ∼ K φ ∗ ( ˇ θ is − , · ) for i ∈ { , . . . , N } (ii) Update diversification value J i for i ∈ { , . . . , N } ( Section 4.2 )(g) Update particle θ it = ˇ θ is for i ∈ { , . . . , N } Output:
Particles { θ iT } Ni =1 for T s.t. γ T = 1 . Some additional information from the mutation step is required for diversification criterion calculation and tuningparameter optimisation, e.g. proposal values and acceptance probabilities.
Expensive likelihoods can considerably slow down Bayesian analysis when using computational methods for posteriorapproximation. In Metropolis-Hastings routines, one remedy is delayed-acceptance which uses a surrogate likelihoodto first test whether a proposal is worthy of evaluation on the expensive full likelihood. If the surrogate is a goodapproximation to the full likelihood then this can be used to reject poor proposals early and avoid using unnecessarycomputation time evaluating the full likelihood.Delayed-acceptance has its origins in Bayesian conductivity imaging (Fox & Nicholls 1997) followed by a morecomplete account in Christen & Fox (2005) who formalised the ergodic correctness of the DA-MH algorithm. Morerecently, Banterle et al. (2019) proposed an modified version of delayed-acceptance which is robust to poor tail-coverage by the surrogate likelihood. They explore bounds for the variance of delayed-acceptance compared to standardMetropolis-Hastings. We use this version of delayed-acceptance in our applications of SMC.5
CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
Delayed-acceptance has been used to speed up a number of costly MCMC algorithms including pseudo-marginalmethods (Golightly et al. 2015, Wiqvist et al. 2018), approximate Bayesian computing (ABC, Everitt & Rowi´nska2017), and Bayesian inverse problems (Cui et al. 2011). It has also been combined with data subsampling (Quiroz et al.2018) and consensus MCMC (Payne & Mallick 2018) to produce more general algorithms for accelerated MCMC. It isworth noting however, whilst delayed-acceptance targets the correct stationary distribution, not all of the aforementionedmethods share this property after further approximations are made.The surrogate likelihood in delayed-acceptance can be a deterministic approximation, such as the Linear NoiseApproximation (LNA) for stochastic kinetic models (Golightly et al. 2015), but need not be. In particular Sherlock et al.(2017) and Wiqvist et al. (2018) use adaptive non-parametric approximations to the full likelihood. The former uses anearest-neighbour approximation whilst the latter use a Gaussian process for the log-likelihood, in a similar manner toDrovandi et al. (2018).More theory for delayed-acceptance has been considered by Sherlock et al. (2015), who assess the asymptotic efficiencyof DA in random walk Metropolis and pseudo-marginal random walk Metropolis. Their theoretical analysis providessome practical guidelines for tuning DA algorithms but is tailored to MCMC.Related methods for reducing computation cost for Bayesian models without approximation include early rejection(Solonen et al. 2012). Early rejection partitions the posterior into a monotonically decreasing ordering where thecumulative MH ratio can be checked sequentially and rejected as soon as a proposal is deemed infeasible. Noapproximation to the original MH kernel is made, instead early rejection relies on partial calculation of the posteriordensity at each iteration. Such a partition has also been used for ABC. In particular, early rejection based on the priorcan avoid the expensive simulation required to evaluate the indicator kernel (Everitt & Rowi´nska 2017). Lazy ABC(Prangle 2016) is another example of work in speeding up Bayesian computation. Lazy ABC uses a random stoppingrule to end simulations that are unlikely to result in a feasible parameter, and ensures an unchanged target distributionby reweighting. Like early rejection in ABC, this can be highly effective if the simulations are costly.As mentioned, delayed-acceptance has been used in sequential Monte Carlo previously by Everitt & Rowi´nska (2017)for ABC where the surrogate model is a cheap, but approximate simulator. The novel adaptation strategy we describecan also be used in ABC-SMC, but is not limited to this type of Bayesian inference.The general delayed-acceptance routine extends the standard Metropolis-Hastings, outlined in Section 2, into severalsteps. The standard MH acceptance probability α ( θ ∗ , θ ) is decomposed into two or more distinct acceptance probabili-ties for delayed-acceptance. For simplicity, we will describe and use only two acceptance probabilities in this paperbut the ideas can be applied to several stages of surrogate models. For a given particle θ t , at iteration t of the SMCalgorithm, the sequential acceptance probabilities in delayed-acceptance are α s ( θ ∗ , θ t ) = min { , r s ( θ ∗ , θ t ) } , for s = 1 , with r ( θ ∗ , θ t ) = ˜ p t ( θ ∗ | y ) q φ ( θ t | θ ∗ )˜ p t ( θ t | y ) q φ ( θ ∗ | θ it ) r ( θ ∗ , θ t ) = p t ( θ ∗ | y )˜ p t ( θ t | y ) p t ( θ t | y )˜ p t ( θ ∗ | y ) = (cid:20) p ( y | θ ∗ )˜ p ( y | θ t ) p ( y | θ t )˜ p ( y | θ ∗ ) (cid:21) γ t where ˜ p t ( θ | y ) ∝ [˜ p ( y | θ ) π ( θ )] γ t p ( θ ) − γ t is the annealed surrogate posterior. The delayed-acceptance MH stepproceeds in two parts. First test the proposal using only the (computationally inexpensive) surrogate likelihood, byprovisionally accepting the proposal with probability α ( θ ∗ , θ t ) . Second, if accepted by the surrogate likelihood,accept the proposal definitively with probability α ( θ ∗ , θ t ) which accounts for the discrepancy between the surrogateand full likelihood ensuring overall that the correct MH ratio has been used. Proposals that are rejected during the initialscreening by the surrogate likelihood are not evaluated by the full likelihood, thus saving computational time when thesurrogate is representative of the full likelihood. 6 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
We use a robust delayed-acceptance method proposed by Banterle et al. (2019) who have adjusted the standard delayed-acceptance routine to allow some proposals to bypass the surrogate testing with low probability. This mitigates potentialstability issues, for example if the approximate likelihood has too light tails relative to the full likelihood.In this paper we generalise the method for tuning kernel parameters and iteration number put forth by Salomone et al.(2018) as accounting for computation time becomes a more pressing issue with expensive likelihood functions. Theframework we propose for tuning kernel parameters and choosing the number of cycles for each SMC mutation step isdetailed in the next two sections, for both the MH and delayed-acceptance kernels.
Our research is motivated by seeking efficient use of adaptive delayed-acceptance algorithm within SMC. In this section,we first develop a novel framework for optimising the number of mutations cycles used in a single SMC iteration anddemonstrate how this framework encompasses existing approaches in the literature. We then utilise and extend theframework to optimise delayed-acceptance within SMC in Section 5.The mutation step in SMC algorithms requires careful consideration when the likelihood is expensive. On the onehand, the diversification of particles is crucial in maintaining a representative and non-degenerate particle population,but this comes at the computational cost of evaluating the expensive likelihood frequently. In particular, when usingdelayed-acceptance we would like to lower the acceptance rate and increase the number of iterations to significantlyreduce the number of expensive likelihood evaluations.Several adaptive SMC (and MCMC) algorithms exist that aim to ensure sufficient diversification, but these are yet toexplicitly address the issue of the associated computational cost. Our major contribution is to frame the computationalcost in SMC with and without delayed-acceptance as an optimisation problem. To begin let C ( k, φ ) be the cost of k cycles of the MCMC kernel with tuning parameters φ belonging to the set Φ . The tuning parameters to be consideredmay be all of the available parameters for a particular proposal, or a particular subset of these parameters. We wish tominimise this cost whilst maintaining some particle diversification condition, D ( k, φ ) ≥ d . Under such conditions, anappropriate formulation of the optimisation is arg min k ∈ Z + , φ ∈ Φ C ( k, φ ) s.t. D ( k, φ ) ≥ d. (3)For example, we expect an appropriate, but approximate, cost function for an MH kernel to be C ( k, φ ) = k × L F (4)where L F is the cost of evaluating the likelihood, and k is the number of cycles the MCMC kernel iterates for.To make the optimisation tractable in practice we will perform a pilot run of the MCMC kernel over a grid of tuningparameters for φ , which will assist in estimating the quantities required. This is not the only strategy possible ofcourse, but we have found it to perform well. The general optimisation problems described will be transferable to otherstrategies one might adopt.Before continuing onto the specific (approximate) solutions of (3), we first need to describe some of the methodscurrently in use to ensure sufficient particle diversification in SMC. Implicitly, these methods aim to minimise thenumber of iterations required to meet their chosen diversification condition, D ( k, φ ) ≥ d . We consider two mainmethods in the SMC literature, the first uses the MH acceptance probability to ensure diversification, whilst the seconduses expected squared jumping distance (ESJD).We will now discuss the process of tuning the mutation step at a given iteration of the SMC sampler. In what follows,we assume that one evaluation of the full likelihood costs L F units, and we are concerned with using k cycles of aparameterised MCMC kernel as our mutation step. 7 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
To illustrate the proposed optimisation framework, we begin by considering a simple method from South et al. (2019)for choosing the number of MCMC runs — the basis of which is from Drovandi & Pettitt (2011). In this regime thenumber of cycles, k , is chosen so that each particle moves at least once in k iterations. A move occurs when a proposalis accepted using an MH kernel. We will refer to this criterion as one-move diversification. For a fixed scaling parameter h , one-move diversification uses the MH acceptance rates to determine the average number of MCMC cycles requiredfor at least one proposal per particle to be accepted. More generally, one could require a higher minimum number ofmoves, but for simplicity we just consider the case of at least one move.This section will consider a single mutation step of the SMC algorithm, consisting of multiple cycles of the MCMCkernel, indexed by s ∈ { , , . . . , k } . Assuming the probability of moving (or acceptance, α (1) ) is equal across cycles,the average probability (across the tempered posterior distribution) that at least one is accepted in a sequence of k cycles, α ( k ) , is α ( k ) = 1 − (cid:16) − α (1) (cid:17) k (5)for k ∈ { , , . . . } . A pilot mutation step can be used to estimate the average acceptance rate across the particles in asingle step, (cid:98) α (1) . We can then find k such that α ( k ) ≥ p min for some threshold < p min < . The formula to choosethe total number of iterations, k , is k = (cid:38) log(1 − p min )log (cid:0) − (cid:98) α (1) (cid:1) (cid:39) (6)where (cid:98) α (1) is the estimated acceptance rate from the pilot run of the MH kernel.To frame this in the context of optimising computation time, note that the underlying criterion is to ensure a sufficientnumber of mutation steps are taken so that the probability of at least one move is greater than p min for a given particle.If we denote a move by (cid:107) θ s − θ s − (cid:107) , where (cid:107) · (cid:107) is the zero “norm”, the corresponding diversification criterion canbe expressed as { ( k, φ ) : D ( k, φ ) ≥ p min } where D ( k, φ ) = P (cid:32) k (cid:88) s =1 (cid:107) θ s − θ s − (cid:107) ≥ (cid:33) . (7)Of course, this expression for D ( k, φ ) is a more general version of (5) and coincides if we assume the probabilityof acceptance is equal across particle locations and MCMC iterations, s . We emphasise the norm notation to draw acomparison to the jumping distance diversification in Section 4.2.Now we wish to use the criterion to select the tuning parameters. If we use different proposal kernel tuning parametersfor particular subsets of particles, the acceptance rate will be a function of those parameters, so we write α ( k ) as α ( k ) ( φ ) . The optimisation stated in (3) can be simplified as stated in Proposition 1. Proposition 1
Assume the cost function is C ( k, φ ) = k × L F , approximating the cost of a standard Metropolis-Hastings step, and D ( k, φ ) = α ( k ) ( φ ) . The latter also corresponds to (7) assuming a uniform acceptance rate acrossthe support of θ . Then the general problem in (3) is equivalent to arg min φ ∈ Φ log(1 − p min )log (cid:0) − α (1) ( φ ) (cid:1) (8) where the general diversification threshold, d , has been replaced by the probability p min . Proposition 1 is the solution to choosing the best tuning parameters with the one-move criterion and MH-cost. It closelyconnects to the original decision for k without tuning parameters (6). A proof of Proposition 1 is in Appendix A.1.8 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
In general, we expect the tuning criterion in Proposition 1 to perform poorly. This can be demonstrated by a simple, buthighly applicable, example. If the tuning parameter is the step size for an MH mutation, i.e. φ = [ h ] , then we wouldexpect the acceptance probability, α (1) ( φ ) , to be monotone decreasing in h . Hence the minimisation in (8) will preferthe minimum step size possible, which will ensure at least one move with the minimal computation cost. In other words,the diversification criterion in (7) is only concerned with the probability of at least one move, not the quality of thismove.Due to the aforementioned shortcoming, a diversification criterion that also measures the quality of the mutation isdesirable. For this reason, we turn to jumping distance diversification as a criterion. As with all Monte Carlo samplers, it is difficult to optimise the variance of estimators in real time. For SMC, variance isintroduced when the set of particles degenerate, and we can avoid high variance by mutating particles to ensure diversity.In this sense, the ESJD is a good candidate measure for particle diversification in SMC as it balances the trade-offbetween decreasing acceptance rates and increasing jump sizes (and vice versa) under different tuning parametersin the MH proposal. This is somewhat related to the motivation for ESJD in MCMC, in that Pasarica & Gelman(2010) used ESJD due to the equivalence between maximising the ESJD and minimising a Markov chain’s first-orderautocorrelation.The ESJD criterion has been applied to SMC by Fearnhead et al. (2013) and Salomone et al. (2018). In particular,Fearnhead et al. (2013) adaptively tuned the kernel by drawing parameters from a set that was reweighted and mutatedbased on their ESJD performance in previous iterations much like SMC itself. Salomone et al. (2018), on the otherhand, found good performance by allocating a tuning parameter from a candidate set to each particle during a pilotrun of the mutation step, and selecting the tuning parameter with the highest median ESJD. This parameter is used insubsequent mutation steps (within the same SMC iteration) until a threshold for the total median ESJD is met. Thissecond criterion is the one we choose to generalise as it is more amenable to the calculations required to apply theoptimisation in (3) to delayed-acceptance in SMC.We start by defining the conditional ESJD, which will be used as the starting point for formalising the criterion inSalomone et al. (2018). For the s th mutation cycle on a given particle, define the conditional ESJD as the conditionalexpectation J ( θ s − , θ ∗ ) = E (cid:0) (cid:107) θ s − θ s − (cid:107) | θ s − , θ ∗ (cid:1) (9)where θ s − is the current position of the particle, θ ∗ is the proposed move for the particle, and (cid:107) · (cid:107) denotes theMahalanobis distance with covariance matrix Σ . For an MH kernel, the conditional ESJD can be written as J ( θ s − , θ ∗ ) = (cid:107) θ ∗ − θ s − (cid:107) α ( θ s − , θ ∗ ) , (10)where α ( θ s − , θ ∗ ) is the MH acceptance probability of moving from θ s − to θ ∗ , and s = 1 , , . . . , k is the currentcycle of the mutation kernel. The ESJD of Pasarica & Gelman (2010) is found by taking the expectation with respect tothe conditional values in (10) with distributions θ s − ∼ p t ( θ | y ) and ( θ ∗ | θ s − ) ∼ q φ ( θ s − , · ) (11)where q φ is the proposal distribution of the MH kernel. We will denote the idealised ESJD random variable, or jumpingdistance, generated during iteration s of the SMC algorithm as J s ( φ ) = (cid:107) θ ∗ − θ s − (cid:107) α ( θ s − , θ ∗ ) where θ ∗ and θ s − are approximately distributed according to (11), due to a finite number of particles.The first diversification criteria under a jumping distance rule we consider is the feasible set D = { ( k, φ ) : D ( k, φ ) ≥ p min } where D ( k, φ ) = P ( J s ( φ ) ≥ d ) . (12)9 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
We can also write D ( k, φ ) as D ( k, φ ) = P (cid:32) E (cid:34) k (cid:88) s =1 (cid:107) θ s − θ s − (cid:107) (cid:35) ≥ d (cid:33) where the expectation is with respect to the random acceptance over k cycles of the MH kernel. Written in thisway, D ( k, φ ) elicits an interesting comparison to (7), that is, it is simply a change of “norm” when moving betweenone-move and jumping distance diversification.Unfortunately, using the general jumping distance criterion in (12) to select the tuning parameters with a pilot MCMCrun is more challenging than the one-move criterion. To simplify, we will first limit our focus to the median, but noteour arguments can apply to any chosen quantile. Assuming p min = 0 . , we can rewrite the feasible set D as D = { ( k, φ ) : D ( k, φ ) ≥ d } where D ( k, φ ) = median (cid:40) k (cid:88) s =1 J s ( φ ) (cid:41) . (13)This criterion is still relatively difficult to optimise, so we use an upper bound found by Jensen’s inequality formultivariate medians (Merkle 2010, Theorem 5.2) to develop an approximate optimisation problem. The new, andtractable, optimisation problem replaces D ( k, φ ) in (13) with ˜ D ( k, φ ) = k × median { J ( φ ) } (14)where J ( φ ) is the first jumping distance with respect to the tuning parameter(s) φ . In practice, each J ( φ ) is estimatedusing a pilot run of the mutation step. The approximation can be thought of as assuming the first jumping distance isrepresentative of the jumping distances for subsequent steps. Under this criterion, a diversification threshold of d isachieved when the median of the jumping distance from a single cycle is greater than d/k , the average distance per totalnumber of cycles.Using ˜ D ( k, φ ) , rather than the original D ( k, φ ) , imposes a stronger condition on the diversification requirement duringthe mutation step, as outlined in Proposition 2. Proposition 2
Consider the feasible sets D and ˜ D , as D = { ( k, φ ) : D ( k, φ ) ≥ d } where D ( k, φ ) = median (cid:40) k (cid:88) s =1 J s ( φ ) (cid:41) ˜ D = (cid:110) ( k, φ ) : ˜ D ( k, φ ) ≥ d (cid:111) where ˜ D ( k, φ ) = k × median { J ( φ ) } For a given φ , if J s ( φ ) , s = 1 , , . . . , k , are iid then ˜ D ⊆ D . Imposing the feasible set ˜ D on the optimisation problem (3) ensures the weaker condition of the feasible set D alsoholds. This motivates the use of ˜ D as an approximation of the intended optimisation. The proof of Proposition 2 is inAppendix A.2.With the approximate diversification criterion, ˜ D ( k, φ ) and MH cost, we can simplify the general optimisation (3) tocoincide with the rule given in Salomone et al. (2018). Proposition 3
Assume the cost function is C ( k, φ ) = k × L F , approximating the cost of a standard Metropolis-Hastings step, and diversification criterion D ( k, φ ) = ˜ D ( k, φ ) . The solution to (3) will be equivalent to arg max φ ∈ Φ median { J ( φ ) } . (15)A proof is in Appendix A.3. Proposition 3 provides further justification for the median ESJD tuning rule (Salomoneet al. 2018), the authors having found it useful for choosing a minimum number of mutation cycles to avoid degenerateparticle populations. 10 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
To apply this principle in practice we can use a pilot run of the mutation step over a discrete grid Φ in order to make thisproblem tractable. Our approach is to randomly assign each particle a value from a relatively small pool of candidates,generally a grid of reasonable values. For example, when φ = [ h ] — the step size in a MVN proposal — we couldtake Φ = { . , . , . , . } and assign these values to each particle using a random partition (of equal size) of theparticles. The median jumping distance, m φ = median { J ( φ ) } for each φ ∈ Φ , can then be approximated by (cid:100) m φ = median (cid:8) J i (cid:9) i ∈ S φ where J i is the jumping distance from the i th particle in the pilot run, and S φ is the partition describing the allocationof particles to each φ ∈ Φ .The main benefit of using a grid for the optimisation is to ensure that the procedure is simple and nonburdensomecomputationally. There are other possibilities however, which may be more appropriate in higher dimensions of φ . Forexample, if assigning each particle a unique value of φ , running a non-linear regression to generate an estimate of themedian jumping distance may be feasible. So far we have set up the computational optimisation problem we wish to solve, and shown how it applies to twodiversification criteria in the literature, one-move diversification in Section 4.1, and jumping distance diversificationin Section 4.2. The solutions we derive for the optimal tuning parameters correspond to mutations using cycles ofMH kernels in SMC. However, we have only considered standard Metropolis-Hastings, and we can now extend theoptimisation framework to the delayed-acceptance kernel.Under two-stage delayed-acceptance the optimisation now has the objective function C ( k, φ ) = k ( L S + α ( φ ) L F ) (16)which can be interpreted as k MCMC iterations, each with cost L S for evaluating the approximate likelihood andadditional cost L F for evaluating the full likelihood which occurs with average probability α ( φ ) , and is dependent onthe tuning parameters of the proposal distribution. Combined with the chosen diversification criterion, this optimisationis somewhat more involved. Our approach is to (again) rely on taking the possible tuning parameter set, Φ , to be asmall finite set of values. This set should be much smaller than the total number of particles so that the pilot run caneffectively estimate the parameters required to solve the (approximate) optimisation.For delayed-acceptance with an adaptive Gaussian proposal distribution the tuning parameter φ is the step size h . Apilot run of the mutation step is performed where the i th particle is randomly allocated to group S φ associated with aparticular grid point, φ ∈ Φ , whilst ensuring evenly sized groups.The first stage acceptance probability can be estimated for each φ ∈ Φ by averaging within the groups to yield (cid:98) α ( φ ) ,whilst the average likelihood costs L F and L S can be estimated using all of the pilot runs. These estimates allow us toapproximate the first component of the optimisation, the computational effort C ( k, φ ) . In order to ensure the jumpingdistance threshold is met, we need to estimate the overall acceptance probability, and provide a means to approximatethe total (additive) jumping distance after k mutation cycles – based only on one pilot run. The general strategy is totake Φ as a discrete grid, estimate the minimum number of cycles required to reach the diversification threshold foreach φ ∈ Φ , then find the φ with minimal cost.Once a pair of tuning parameters and required iterations has been found, ( φ ∗ , k ∗ φ ) , one can either iterate the mutationstep k ∗ φ times ( k ∗ φ − is also an option), or monitor the empirical diversification threshold after each mutation stepuntil it has been met. We choose to do the latter for our simulation study in Section 6 and example using the Whittlelikelihood in Section 7. 11 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
The main hurdle to choosing the optimal kernel parameters with cost function (16) and finite discrete set for Φ isdetermining the minimum number of MCMC cycles required to satisfy the diversification criterion. This section detailsseveral possible methods for this important problem, followed by Algorithm 2 which describes the overall process ofchoosing the parameters.To estimate the number of cycles k required for each element φ ∈ Φ to achieve the required jumping distance thresholdin (14) we can use a pilot run of the mutation step. After such a pilot run we have a set of realisations of the ESJD foreach tuning parameter, that is, we have observed values for J ( φ ) . Fitting an amenable model to these realisations willallow us to estimate how many cycles are required for diversification under each parameter. We propose three suchmethods for modelling the expected squared jumping distance used to estimate the required number of cycles. The first model uses the same principles as the median method described in Section 4.2, whereby the minimum numberof cycles k is approximated by k ∗ φ = min { k ∈ Z + : k × median { J ( φ ) } ≥ d } . (17)In practice the number of cycles is then chosen by k ∗ φ = (cid:100) d/ median { J ( φ ) i } i ∈ S φ (cid:101) , where J ( φ ) i is the realisation ofthe ESJD for the i th particle. We refer to this process as the median method. We may also model the jumping distances parametrically, which we describe as follows. The ESJD are positive randomvariables generated from the normalised proposal distribution distance multiplied by the acceptance probability. Inthe case of a multivariate normal proposal, the use of the Mahalanobis distance results in the proposed jumps havinga Chi-squared distribution with degrees of freedom p . This motivates one choice of model for the expected squaredjumping distance – the gamma distribution. It has the desirable properties of having positive support and the sum ofgamma random variables is also gamma, providing a means to estimate how many iterations are required to achieve theminimum diversification threshold.Under the gamma model, for each group of jumping distances, { J ( φ ) i } i ∈ S φ , indexed by φ we fit a gamma distributionto obtain the parameter estimates for a φ and b φ , the shape and rate respectively. Based on these parameters, the totalESJD of k cycles is approximated by k (cid:88) s =1 J s ( φ ) ∼ G ( k (cid:99) a φ , (cid:99) b φ ) (18)using the additive property of the gamma distribution. We can then calculate the minimum iterations for each parameter φ , k ∗ φ = min (cid:40) k ∈ Z + : P (cid:32) k (cid:88) s =1 J s ( φ ) ≥ d (cid:33) ≥ p min (cid:41) (19)by iterating through k = 1 , , . . . using the complementary cumulative distribution function (CCDF) of the gammadistribution in (18) and noting the monotonicity in k of the required probability. If (cid:99) a φ is small, it may be pragmatic todo a line search rather than iterate through each k = 1 then k = 2 et cetera. We also suggest a non-parametric alternative, which we refer to as the bootstrap method. Under this method we assumeeach J s ( φ ) for s = 1 , , . . . is drawn from a discrete distribution with values from the pilot run { J ( φ ) i } i ∈ S φ for φ ∈ Φ . If we assign equal probability to these draws then we can bootstrap the minimum k ∗ φ required. For each φ we12 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS continue drawing new jumping distances until the probability threshold in (19) is met. This approximates the requirednumber of mutation cycles non-parametrically.The algorithm describing the processes for calculating k ∗ φ and choosing the tuning parameter(s) is described inAlgorithm 2 for both the gamma and bootstrap methods of the tuning parameters. Replacing steps 1(a)-(b) ofAlgorithm 2 with k g = (cid:100) d/ median { J ( φ g ) }(cid:101) describes the algorithm for the median method. Algorithm 2
Tuning parameter φ selection from pilot mutation run Input:
Tuning parameter set,
Φ = { φ g } Gg =1 ; Allocation of tuning parameters to particles, { S φ } φ ∈ Φ ; Pilot run ESJDs, { J i } Ni =1 ; Average first stage acceptance probabilities, { (cid:98) α ( φ g ) } Gg =1 ; Approximate computation cost of surrogate andfull likelihoods, L S and L F ; Minimum probability quantile, p min .1. For g ∈ { , . . . , G } iterate through(a) Set k g = 0 and ˆ p = 0 (b) While ˆ p ≤ p min , increment k g then update the quantile estimate: ˆ p = P (cid:16)(cid:80) k g s =1 J s ( φ g ) ≥ d (cid:17) using gamma or bootstrap method ( Sections 5.1.2 and 5.1.3 )(c) Calculate C g = C ( k g , φ g ) = k g ( L S + (cid:98) α ( φ g ) L F )
2. Calculate index of best tuning parameter g ∗ = arg min g ∈ G C g Return
Selected tuning parameter φ g ∗ One further consideration for determining the tuning parameters with a pilot run when using delayed-acceptance, ratherthan standard Metropolis-Hastings, is required. In the case of delayed-acceptance, the overall acceptance probability isthe product of the two acceptance probabilities (see Section 3), that is α ( θ ∗ , θ s ) = α ( θ ∗ , θ s ) α ( θ ∗ , θ s ) . (20)However α s ( θ ∗ , θ s ) is only calculated for the proposals that pass the surrogate model, hence we cannot calculate α ( θ ∗ , θ s ) exactly for these particles without wasting computation time. As such, we run a linear regression withresponse log r ( θ ∗ , θ t ) , the full log ratio used in standard MH, and explanatory variable log r ( θ ∗ , θ t ) the surrogatelog-ratio for the acceptance rate. We also use the step size of the mutation kernel as an additional explanatory variable,which assists when the first stage acceptance rate is small. The overall acceptance rate, (cid:100) log r = (cid:92) log r ( θ ∗ , θ t ) , is thenpredicted for cases where the proposal was rejected in using the surrogate likelihood. We convert this to a probabilityusing the standard MH formula, min { exp( (cid:100) log r ) , } .Simpler strategies are possible, but were found to be ineffective. For example, using α ( θ ∗ , θ s ) = α ( θ ∗ , θ s )¯ α if theproposal was rejected at the surrogate stage when calculating the ESJD, where ¯ α is the mean of the second stageacceptance probabilities that were calculated. The surrogate likelihood can be a biased approximation of the full likelihood which impacts the effectiveness of thedelayed-acceptance method. Within SMC, the previous evaluations of the full likelihood contain valuable informationwhich we can use to tune the surrogate likelihood. We propose a generic method for calibrating the surrogate likelihoodto better match the full likelihood during the delayed-acceptance mutation step of our SMC algorithm. The methodis constructed to avoid evaluating the costly full likelihood by relying on the history of the particles, and so that nouser-chosen tuning parameters are required. We begin by defining the general transformation considered with thecorresponding optimisation problem, followed by the particular implementation we chose to explore in this paper.13
CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
The general transformation consists of two parts. The first anneals components of the surrogate likelihood as follows.Suppose the surrogate likelihood, ˜ L ( y | θ ) , can be decomposed into the product ˜ L ( y | θ ) = (cid:81) qj =1 ˜ L j ( y | θ ) thendefine the weighted-annealing transformation of the surrogate as ˜ L ζ ( y | θ ) = q (cid:89) j =1 ˜ L j ( y | θ ) ζ j . In some cases the product decomposition defined by components ˜ L j ( y | θ ) will correspond to each datum,i.e. ˜ L j ( y | θ ) = p j ( y j | θ ) where p j is the probability density function for datum y j . This need not be the case,and if the surrogate likelihood is not decomposable we can take q = 1 or redefine the surrogate likelihood withadditional components if appropriate. A special case of this transformation, taking q = 1 or ζ i = ζ , corresponds topower-likelihood annealing commonly used in SMC.The second transformation on the surrogate likelihood is a bijection of the model parameters, T ξ , with tuning parame-ters ξ . Combining the annealing weights and parameter transformation, the overall parameterised surrogate likelihoodis ˜ L ζ ( y | T ξ ( θ )) .Prior to the mutation step in each iteration of SMC we would like to minimise a distance (or discrepancy) to find ( ξ (cid:63) , ζ (cid:63) ) ,the optimal transformation, by solving ( ξ (cid:63) , ζ (cid:63) ) = arg min ξ , ζ (cid:88) θ ∈ H d (cid:104) L ( y | θ ) , ˜ L ζ ( y | T ξ ( θ )) (cid:105) (21)where d ( x, y ) is a measure of discrepancy, L ( y | θ ) is the full likelihood, and H is the set of particle locations toaverage across. In some circumstances, it may also be appropriate to add a penalty term to this optimisation problem.To reduce the cost of calibrating the surrogate likelihood in this way, H should be a subset of locations where theexpensive likelihood, L ( y | θ ) , has already been evaluated. For simplicity, we choose H to be the set of locations fromthe current particle set in the SMC algorithm.Our examples in Section 6 and 7 use the discrepancy measure to be d ( x, y ) = (log x − log y ) , corresponding to thesum of square differences of the log-likelihood, with T ξ ( θ ) = θ − ξ , and the annealing weights, ζ applied to the densityof each datum. The transformation T ξ serves to correct for bias in the surrogate likelihood, whilst the ζ act to flatten orsteepen the surrogate as needed. This is a relatively simple choice, for which we formulate an approximate optimisation,and serves to illustrate that even simple calibration can be useful for delayed-acceptance methods.We approximate the solution for (21) by first minimising the discrepancy with respect to ξ with ζ = , followed by aconditional optimisation with a lasso penalty (Tibshirani 1996) — with shrinkage towards the unit vector rather thanzero. The proposed approximate solution is ξ (cid:63) = arg min ξ ,µ (cid:88) θ ∈ H (cid:104) (cid:96) ( y | θ ) − ˜ (cid:96) ζ ( y | ( θ − ξ )) − µ (cid:105) with ζ = , (22) ζ (cid:63) = arg min ζ ,µ (cid:88) θ ∈ H (cid:104) (cid:96) ( y | θ ) − ˜ (cid:96) ζ ( y | ( θ − ξ (cid:63) )) − µ (cid:105) + Λ (cid:107) ζ − (cid:107) (23)where (cid:96) ( y | θ ) is the full log-likelihood, ˜ (cid:96) ζ ( y | ( θ ) = log ˜ L ζ ( y | θ ) is the weighted surrogate log-likelihood, whilst µ and µ are nuisance parameters. Structuring the optimisation as such, (22) can be solved using non-linear least squares,and serves to align the surrogate and full likelihood using a translation. Whilst (23) has the form of lasso regression,since ˜ (cid:96) ζ ( y | ( θ ) = q (cid:88) j =1 ζ j log ˜ L j ( y | θ ) , where ζ j are the linear coefficients. Shrinkage towards the unit vector can be achieved with a change of variables, ζ (cid:48) = ζ − , which transforms the regression to the standard form. We find it convenient to automate the choice of Λ using cross-validation of the lasso (Friedman et al. 2010).14 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS γ t p t ( θ ) p ( θ ) p ( θ ) . ˜ p ( θ | y ) . λ ˜ p ( θ | y ) λ ˜ p ( θ | y ) . λ p ( θ | y ) . p ( θ | y ) Table 1: Representative temperatures and distributions using surrogate first annealing.
The final application of a surrogate likelihood for efficient SMC algorithm is to utilise it in the distribution path ofSMC. It is possible to use the surrogate likelihood for this purpose by annealing through an inexpensive sequenceof distributions using the surrogate before correcting to the full posterior. Such a distribution path can eliminate lowprobability regions of the parameter space with little computational cost. We propose surrogate first annealing (SFA) inwhich the sequence of distributions the particles travel through are determined by p t ( θ ) = p ( θ ) max { − γ t , } ˜ p ( θ | y ) λ min { γ t , − γ t } p ( θ | y ) max { ,γ t − } for γ < · · · < γ S = 1 < γ S +1 < · · · < γ T = 2 and < λ ≤ (24)where p ( θ ) is the initial distribution, ˜ p ( θ | y ) is the posterior with surrogate likelihood, p ( θ | y ) is the posterior withfull likelihood, and λ controls the maximum power of the surrogate posterior during the sequence. In practice thisamounts to two SMC runs (in line with Algorithm 1), where the first anneals from the initial distribution to (a powerof) the surrogate posterior, and the second SMC algorithm anneals from (a power of) the surrogate posterior to thefull posterior. Some representative temperatures, γ t , and associated distributions along the annealing path are given inTable 1.As with all SMC and importance sampling algorithms, if the initial distribution does not cover the tails of the targetdistribution adequately then the sampler can perform poorly. As such, care needs to be taken that the surrogate posterioris a good initial distribution for the full posterior. Therefore, we recommend choosing λ ≤ . , and λ should decreaseas the number of observations increases. In general, λ should be chosen as small as possible but sufficiently large toeliminate low-probability areas of the parameter space and speed up the SMC algorithm. To test the efficacy of the proposed adaptive SMC algorithm, we consider a linear regression where we artificiallycontrol the cost of the likelihood evaluations. In particular, we fit the following normal or student-t regression models ( y i | β , σ ) ∼ N ( X β , σ ) (25)or ( y i | ν, β , s ) ∼ T ( ν, X β , s ) (26)for i = 1 , , . . . , n , where N ( µ, σ ) denotes the normal distribution with mean µ and variance σ , and T ( ν, µ, s ) denotes the location-scale student-t distribution with degrees of freedom ν , mean µ (if ν > ) and scale s . Both modelshave the following priors on β β j ∼ N (0 , τ ) . CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS for j = 1 , , . . . , p . For the normal regression we fix the variance σ = 0 . , for the t-regression we set ν = 3 and s = 1 . The prior variance is set to τ = 2 .We simulate data for each regression with n = 100 , and p = 5 with respect to the data-generating distribution specifiedby (25) or (26). The true parameter vector for the simulation is β = [0 , . , − . , . , (cid:62) and the elements of thedesign matrix X are iid normal random variables with unit variance.To test delayed-acceptance on these models we set the full likelihood as (25) or (26) above, but with an artificial timedelay of L F seconds. The computation is not literally delayed, but instead is artificially inflated each time the likelihoodis evaluated. The surrogate likelihood used is normal with additional bias and scaling on β , that is ˜ L ( y | β ) = N ( y ; X ( a β + b ) , ) where a = e . ≈ . and b = 0 . . The surrogate likelihood is also given an artificial delay, L S , in order to controlthe computational difference between the likelihoods. We express this as the ratio ρ = L F /L S , as the results can beinterpreted for any time unit.Relative to the surrogate likelihood the full likelihoods, the normal and student-t distributions, represent two idealisedextremes possible to encounter with delayed-acceptance. In the normal case, the surrogate likelihood can be transformedto exactly match the full likelihood using the calibration method in Section 5.2, whereas the surrogate likelihood willnever be able to replicate the heavy tails in the student-t likelihood.We ran several adaptive SMC algorithms under various combinations of settings. The MH - and SFA -SMC algorithmsused SMC with a Metropolis-Hastings transition kernel using an adaptive step size, the latter using the surrogate firstannealing as described in Section 5.3. The DA , DA+T , DA+SFA , and
DA+T+SFA algorithms use the delayed-acceptancekernel where +T indicates surrogate calibration was used as in Section 5.2, +SFA indicates using surrogate firstannealing. Each of these algorithms were tested with kernel tuning using the median, gamma, and bootstrap methodsdescribed in Section 5.1 using the cost function (4) for MH and (16) for DA. We also ran two SMC algorithms akin toMH-SMC, but with fixed step size h . The step size for these algorithms, MH (f-) and
MH (f+) , were chosen by takingthe average optimal step size from the MH-SMC algorithm and selecting the closest smaller (f-) or larger (f+) step sizefrom the set Φ . Note that the performance tables that follow only report the best performing of MH (f-) and MH (f+).We tested the above SMC algorithms with the following settings. The number of particles N = 2000 , the initialdistribution p is the prior, and the proposal distribution for the mutation step is an adaptive-variance multivariatenormal random walk, i.e. (2). The ESS threshold to adaptively select the temperature, γ t , is S = N/ . Each SMCalgorithm used stratified sampling with 10 strata to resample the particles.We choose the diversification threshold to be the median ( p min = 0 . ) of the total ESJD greater than d ≈ . chosensuch that P ( X > d ) = 0 . , where X ∼ χ (5) , a Chi-square distribution with 5 degrees of freedom. The choice of d is motivated by the following idea. Conditional on the current location of the particle, θ s − , the (cid:107) θ ∗ − θ s − (cid:107) termin the ESJD definition (10) has a χ (5) distribution (the number of parameters). As such, requiring P ( X > d ) = 0 . ensures that after k mutation cycles with any level of acceptance, the empirical distribution of the ESJD exceeds the80% quantile of the distribution assuming all proposals are accepted from one cycle.The number of mutation cycles is optimised with respect to the step-scale, h , for the MVN random walk, as describedin Sections 4 and 5. Possible values for h are chosen from the set Φ = { . , . , . , . , . , . , . , . } .We continue cycles with the mutations kernel until the median requirement is satisfied empirically by the movement ofthe particles, or the maximum number of cycles was reached (set at 100). The maximum power of the surrogate firstannealing procedure was λ = 0 . .To calibrate the surrogate likelihood, we use the transformations described in Section 5.2. The intercept to account forscaling differences between the log-likelihoods is included in the optimisation, but not used to transform the surrogatelog-likelihood as it does not affect the MH ratio. Five-fold cross validation was used to select Λ in the lasso procedure.16 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS llllllll llllllll llllllll llllllll llllllll llllllllllllllll llllllll llllllll llllllll llllllll llllllll llllllll llllllll llllllll llllllll llllllll llllllllllllllll llllllll llllllll llllllll llllllll llllllll llllllll llllllll llllllll llllllll llllllll llllllllllllllll llllllll llllllll llllllll llllllll llllllll
Tuning = Bootstrap Tuning = Gamma Tuning = Median L i k e li hood = N o r m a l L i k e li hood = S t uden t Likelihood cost ratio ( r ) M ed i an R e l a t i v e E ff i c i en cy ( SE x T i m e ) SMC algorithm llllllll
MHDA+TDASFA+DA+TSFA+DASFAMH (f−)MH (f+)
Figure 1: Computational efficiency as measured by squared error × computation time (SE × Time) for competing SMCalgorithms. Median efficiency is plotted relative to the MH-SMC algorithm (constant red line at 1) which used mediantuning across all simulations. The 10th and 90th quantiles are plotted as error bars for the SFA+DA+T algorithm.The SMC sampler was run with the cost of the full likelihood chosen from L F ∈ { . , , , , } , whilst the costof the surrogate log-likelihood is fixed at L S = 0 . . The relative cost is therefore ρ ∈ { , , , , , } .The simulation was repeated 50 times under each setting, from which we measured the average efficiency of thealgorithms in two ways. The first efficiency metric was the squared error (SE) of the parameters values multiplied byscaled likelihood evaluations, calculated as SLE = E L + ρ − E ˜ L where E L is the number of full likelihood evaluations,and E ˜ L is the number of surrogate likelihood evaluations. The second efficiency metric used was squared errormultiplied by computation time. The median absolute values of the first metric are displayed in Figure 1 alongsideTables 2 and 3 which present the median efficiency gains relative to the standard MH-SMC algorithm under each metric.The raw computation time gains are reported in Table 4. The DA+T+SFA algorithm had the best median efficiency gains among all algorithms, tuning methods, and efficiencymetrics, with the exception for ρ = 10 on a single occasion, where it was marginally outperformed by DA for thestudent likelihood under the SE × SLE metric (Table 2). DA+T+SFA performed best under median tuning for thestudent-t likelihood and (overall) second to the bootstrap method for the normal likelihood. In the results and tables thatfollow, we focus on the median method since all tuning methods performed similarly under the normal distribution, andthe median method is simpler to program. Also note that the maximum mutation cycle limit (of 100) was never reachedfor simulations using the median method for the DA+T+SFA simulations.Surrogate first annealing with delayed-acceptance and tuning (DA+T+SFA) had efficiency gains ranging from . × to . × (median SE × SLE, Table 2) and . × to . × (median SE × time, Table 3). The 90th quantile reached . × to . × for ρ ≥ when measuring efficiency by SE × time. DA+T+SFA also had the best median raw computationtime improvements, relative to MH-SMC, speeding up the SMC algorithm by . × to . × (see Appendix A.4, Table 4).17 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
The best result of the fixed MH-SMC algorithms, MH (f-) or MH (f+), is reported in Tables 2–4 as
MH (fixed) . Thesealgorithms clearly performed worse than the adaptive MH-SMC with median relative performance (under all metrics)of . × or . × . The represents an improvement of at least 25% for the median tuning method when compared to the“best” fixed step-size using MH in SMC (which is not available in practice), adding further evidence that this adaptiveprocedure is useful for speeding up standard SMC (in line with recommendations from Salomone et al. 2018).The algorithms using the delayed-acceptance with surrogate likelihood calibration (Section 5.2, DA + T) outperformedtheir counterparts without such calibration, in some cases significantly. The exception to this trend occurs for thestudent-t distribution with the SE × SLE metric when comparing DA and DA+T. In this scenario, they are mostlyon par except for ρ = 10 . Less tuning was required for the student-t likelihood (see comparison of posteriors inAppendix A.4) which may account for this. It is interesting to note that using tuning for the student-t likelihood doesincrease 10th quantile on average for the efficiency measures when ρ ≥ .The efficiency and computation time tables show an interesting feature of the proposed DA+T+SFA algorithm, in thatthe efficiency gains cannot be solely attributed to either the DA+T aspect or the SFA aspect of the algorithm. Moreover,the efficiency gains for using both DA+T and SFA within the DA+T+SFA algorithm are not additive in its constituentparts.Given the lighter tails of the normal distribution, we also investigated algorithm performance, on this model, with themaximum annealing parameter of the SFA method set to λ = 0 . rather than λ = 0 . . Under this regime, overallspeed-ups were observed to be around . × greater than the result reported thus far for the normal model.From the results, it is clear surrogate likelihoods have the potential to speed-up computation time and efficiency inSMC. However, there does appear to be a threshold for which the ratio of computation cost between the surrogate andfull likelihoods must exceed to realise substantial gains. Whilst an ≈ × speed-up, as is the case for ρ = 10 , maybe critical in some cases, it may not justify implementing new methods to improve computation time. Therefore, theDA-SMC methods in this paper are likely to be more valuable when the likelihood cost ratio is closer to ρ = 10 (orgreater) and the speed-up can be expected to be about . × to . × (the range of 80% intervals of the SE × timemetric). The Whittle likelihood is a computationally efficient likelihood approximation for time series models (Whittle 1953)constructed using (discrete) Fourier transforms to the frequency domain. A key component of the Whittle likelihood isthe periodogram of the series, an estimate of the series’ spectral density. The periodogram is asymptotically unbiased, aproperty inherited by the Whittle likelihood, making it a popular tool in time series modelling. To describe the Whittlelikelihood in full, we begin with definitions for a Fourier transform of a time series model’s covariance and the discreteFourier transform of the time series data.Let { X t } nt =1 be a zero-mean equally spaced time series with stationary covariance function κ ( τ, θ ) = E ( X t X t − τ ) where θ are parameters of the distribution governing X t . Transforming both the data and the covariance function to thefrequency domain enables us to construct the Whittle likelihood with these elements rather than using the time domainas inputs. The Fourier transform of the model’s covariance function, or the spectral density f θ ( ω ) , is f θ ( ω ) = 12 π ∞ (cid:88) τ = −∞ κ ( τ, θ ) exp( − iωτ ) where the angular frequency ω ∈ ( − π, π ] . The discrete Fourier transform (DFT) of the time series data is defined as J ( ω k ) = 1 √ π n (cid:88) t =1 X t exp( − iω k τ ) , ω k = 2 π ( (cid:100) n/ (cid:101) + k ) n using the Fourier frequencies { ω k } nk =1 . 18 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
Table 2: Median (80% interval) multiplicative improvement of efficiency (SE × SLE) relative to MH-SMC (usingmedian tuning method) for simulation study.Likelihood Cost ratio ( ρ ) SFA+DA+T SFA DA+T DA MH (fixed) I ( ω k ) = | J ( ω k ) | n . Using the aforementioned definitions, we can define the Whittle log-likelihood (Whittle 1953) as (cid:96) whittle ( θ ) = − n (cid:88) k =1 (cid:18) log f θ ( ω k ) + I ( ω k ) f θ ( ω k ) (cid:19) . In practice the summation over the Fourier frequencies, ω k , need only be evaluated over a subset (less than half) ofvalues due to symmetry about ω k = 0 and since f θ (0) = 0.The periodogram can be calculated in O ( n log n ) time, and only needs to be calculated once per dataset. After dispersingthis cost, the cost of each subsequent likelihood evaluation is O ( n ) , compared to the usual likelihood cost for timeseries which is O ( n ) .We demonstrate the use of the Whittle likelihood on an example from Salomone et al. (2019) who use the Whittlelikelihood for subsampling frequency to obtain a computational efficient MCMC algorithm for long time series data.Our example differs slightly, in that we would like to demonstrate the efficacy of our method in a pre-asymptotic regime.Hence we analyse shorter time series, which can have multimodal posteriors, making them an ideal test for SMC.We use an autoregressive fractionally integrated moving average model (ARFIMA) to demonstrate the method on anon-trivial model (Granger & Joyeux 1980). ARFIMA models are a generalisation of autoregressive integrated movingaverage model (ARIMA) models using fractional, rather than integer values, of the difference parameter d . For n ofadequate length, the zero-mean series { X t } nt =1 is an ARIFMA time series if φ ( L ) (1 − L ) d X t = θ ( L ) ε t CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
Table 3: Median (80% interval) multiplicative improvement of efficiency (SE × Time) relative to MH-SMC (usingmedian tuning method) for simulation study.Likelihood Cost ratio ( ρ ) SFA+DA+T SFA DA+T DA MH (fixed) L is the lag operator, ε t is zero-mean Gaussian noise with variance σ , and the polynomials φ ( z ) and θ ( z ) aredefined as φ ( z ) = 1 − p (cid:88) i =1 φ i z i and θ ( z ) = 1 + q (cid:88) i =1 θ i z i . These models are fully parameterised by the collection of parameters ( φ · · · φ p ) , ( θ · · · θ q ) , d , and σ . In orderfor the ARFIMA process to be stationary, the zeros of φ ( z ) and θ ( z ) must be outside the complex unit circle, and − . < d < . . We impose the stationarity conditions by transforming the polynomial coefficients of φ ( z ) and θ ( z ) topartial autocorrelations (Barndorff-Nielsen & Schou 1973), which only requires that the magnitude of the (transformed)coefficients be less than one for stationarity. Following this transformation, all parameters are mapped to the real line.ARFIMA models are useful for their ability to describe long-term dependence in time series, which can not be capturedby ARIMA models (Granger & Joyeux 1980). The spectral density function of an ARIFMA time series is f φ ,d, θ ( ω ) = σ π | θ ( e − iω ) | | φ ( e − iω ) | | − e − iω | − d as described in Brockwell & Davis (Chapter 7.3.2, 2016). The spectral density is utilised to calculate the Whittleapproximation for the surrogate likelihood.To test the proposed SMC algorithm we simulated an ARFIMA( p = 2 , d = 0 . , q = 1 ) time series, of length ,using parameters φ = 0 . , φ = 0 . , d = 0 . , θ = − . , and σ = 1 . With n = 5000 particles we fit theMH-SMC and DA+T+SFA algorithms, as well as SMC using only the surrogate likelihood (labelled surrogate only ).The surrogate likelihood was calibrated using the transformation described in Section 5.2 without a shift transformationas the surrogate and full likelihoods are well aligned. The maximum power of the surrogate first annealing procedurewas λ = 0 . which reflects the high number of observations. The median tuning method was used to select the optimalstep size, and the mutation step of the SMC algorithm ran until the median empirical total ESJD was greater than 3.20 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
When calculating the estimate of the covariance matrix, Σ , for using in these algorithms it was necessary to demeanthe particles locations with respect to the mode they occupied. This avoided a close to singular estimate for Σ ,and better reflected the average local covariance structure about the modes. We estimated the modes with k-meansclustering (Hartigan & Wong 1979) where the number of clusters was selected using the Duda-Hart test (1973) thenCalinski-Harabasz criterion (1974) as implemented in the package fpc (Hennig 2020) in R .Overall, the results held a . × to . × speed-up across the 10 simulations (80% interval). In real terms, this reducedthe computation time from about 20.5 hours to 4.5 hours. The full likelihood was evaluated . × to . × more oftenduring the standard MH-SMC algorithm, as compared to the DA+T+SFA version. The cost ratio of the full likelihoodto Whittle likelihood was approximately . to , indicating that the speeds up were inline with the simulationstudy in Section 6, but slightly less than expected.Figure 2 displays the density plots of φ from the SMC algorithms over the replicates, whilst the remaining densitiesare displayed in Appendix A.5. These density comparisons demonstrate that using surrogate likelihoods in SMC canbe done with minimal accuracy lost in the posterior computations, with an appreciable gain in speed. This was achallenging model to consider because the multimodality in the full posterior is not well approximated by the surrogateposterior. Comparing the top-left facet of Figure 2 to the bottom-right facet illustrates this.In an additional experiment, where the length of the time series was 10001 (rather than 8001), we observed speedups of . × to . × . Under this particular simulated dataset, the target posterior did not exhibit multimodality which maypartially explain the increase. The adaptive SMC methods proposed in this paper are able to perform well under bothcases, with and without multimodality. We have explored several ways of using surrogate likelihoods to improve the efficiency of SMC. In particular, delayed-acceptance within the mutation step with calibration using the population of particles, and surrogate first annealing,were proposed and used to this end.A prevailing assumption of ours has been that a surrogate likelihood is available for the application at hand. In theabsence of a good candidate it may be convenient to use a variational Bayes approximation as a surrogate posterior(see for example Bishop 2006, Ch. 10). In this case the surrogate first annealing method would be similar to Donnet& Robin (2017), who start with a variational approximation as their initial distribution. Non-parametric surrogatelikelihoods could also be considered, such as nearest-neighbour or Gaussian Processes (see for example, Sherlock et al.2017, Drovandi et al. 2018).A computational aspect of delayed-acceptance, particularly important in SMC, is its effect on parallel computation. Themutation step of SMC can be easily parallelised but delayed-acceptance results in some mutations occurring quickly(first stage rejection), whilst others taking considerably longer. Whilst we did not observe any adverse behaviour oursettings, appropriate scheduling for parallel implementations should be considered.The SFA method is sensitive to the choice of surrogate temperature, λ , especially if there is a large mismatch betweenthe surrogate and full likelihoods. In this case, if λ is too low, then computational efficiency is lost, but if λ is too high,the final posterior can be inaccurate. In the simulation study and ARFIMA example, λ was chosen by inspecting thesurrogate likelihood and determining a value for λ which retained sufficient density to approximately cover the twopeaks observed from the full likelihood. In general, more observations would preclude a lower value of λ . Future workin determining an appropriate λ automatically would be a useful contribution.The theory contributed by this paper proposes a general framework for choosing tuning parameters in SMC, with afocus on the typically costly mutation step. The tuning parameter decision is cast as an optimisation problem of costminimisation, subject to a sufficient quality of diversification. The framework connects with and generalises, several21 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS DA + T + SFA MHSurrogate only: g T = 1 Surrogate only: g T = 0.01−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.005100510 j D en s i t y Figure 2: Posteriors of φ from 10 replicates of four SMC algorithms. The SMC algorithms using only the surrogatelikelihood (approx) are annealed to γ T ∈ { . , } . The latter of which is the initial particle set for the surrogate firstannealing procedure.tuning methods in the SMC literature, allowing them to be used with delayed-acceptance kernels. We have providedevidence that this framework is appropriate for improving computational efficiency with both Metropolis-Hastings anddelayed-acceptance kernels in SMC without burdensome input from the user. References
Banterle, M., Grazian, C., Lee, A. & Robert, C. P. (2019), ‘Accelerating Metropolis-Hastings algorithms by delayedacceptance’,
Foundations of Data Science (2), 103–128.Barndorff-Nielsen, O. & Schou, G. (1973), ‘On the parametrization of autoregressive models by partial autocorrelations’, Journal of Multivariate Analysis (4), 408–419.Beskos, A., Jasra, A., Kantas, N. & Thiery, A. (2016), ‘On the convergence of adaptive sequential Monte Carlomethods’, The Annals of Applied Probability (2), 1111–1146.Bishop, C. M. (2006), Pattern recognition and machine learning , springer, New York.Brockwell, P. J. & Davis, R. A. (2016),
Introduction to time series and forecasting , springer, Cham, Switzerland.Cali´nski, T. & Harabasz, J. (1974), ‘A dendrite method for cluster analysis’,
Communications in Statistics-theory andMethods (1), 1–27.Chopin, N. (2002), ‘A sequential particle filter method for static models’, Biometrika (3), 539–552.22 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
Christen, J. A. & Fox, C. (2005), ‘Markov chain Monte Carlo using an approximation’,
Journal of Computational andGraphical statistics (4), 795–810.Conrad, P. R., Marzouk, Y. M., Pillai, N. S. & Smith, A. (2016), ‘Accelerating asymptotically exact MCMC for computa-tionally intensive models via local approximations’, Journal of the American Statistical Association (516), 1591–1607.Cui, T., Fox, C. & O’sullivan, M. (2011), ‘Bayesian calibration of a large-scale geothermal reservoir model by a newadaptive delayed acceptance Metropolis Hastings algorithm’,
Water Resources Research (10).Del Moral, P., Doucet, A. & Jasra, A. (2006), ‘Sequential Monte Carlo samplers’, Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) (3), 411–436.Donnet, S. & Robin, S. (2017), ‘Using deterministic approximations to accelerate SMC for posterior sampling’, arXivpreprint arXiv:1707.07971 .Drovandi, C. C., Moores, M. T. & Boys, R. J. (2018), ‘Accelerating pseudo-marginal MCMC using Gaussian processes’, Computational Statistics & Data Analysis , 1–17.Drovandi, C. C. & Pettitt, A. N. (2011), ‘Likelihood-free Bayesian estimation of multivariate quantile distributions’,
Computational Statistics & Data Analysis (9), 2541–2556.Duda, R. O. & Hart, P. E. (1973), Pattern classification and scene analysis , Vol. 3, Wiley, New York.Elf, J. & Ehrenberg, M. (2003), ‘Fast evaluation of fluctuations in biochemical networks with the linear noise approxi-mation’,
Genome Research (11), 2475–2484.Everitt, R. G. & Rowi´nska, P. A. (2017), ‘Delayed acceptance ABC-SMC’, arXiv preprint arXiv:1708.02230 .Fearnhead, P., Taylor, B. M. et al. (2013), ‘An adaptive sequential Monte Carlo sampler’, Bayesian Analysis (2), 411–438.Fox, C. & Nicholls, G. (1997), Sampling conductivity images via MCMC, in K. Mardia, C. Gill & R. Aykroyd, eds,‘The art and science of Bayesian image analysis’, Proceedings of the Leeds Annual Statistical Research Workshop(LASR), Leeds, pp. 91–100.Friedman, J., Hastie, T. & Tibshirani, R. (2010), ‘Regularization paths for generalized linear models via coordinatedescent’,
Journal of Statistical Software (1), 1.Gilks, W. R. & Berzuini, C. (2001), ‘Following a moving target–Monte Carlo inference for dynamic Bayesian models’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) (1), 127–146.Golightly, A., Henderson, D. A. & Sherlock, C. (2015), ‘Delayed acceptance particle MCMC for exact inference instochastic kinetic models’, Statistics and Computing (5), 1039–1055.Granger, C. W. & Joyeux, R. (1980), ‘An introduction to long-memory time series models and fractional differencing’, Journal of Time Series Analysis (1), 15–29.Hartigan, J. A. & Wong, M. A. (1979), ‘Algorithm AS 136: A k-means clustering algorithm’, Journal of the RoyalStatistical Society. Series C (Applied Statistics) (1), 100–108.Hastings, W. (1970), ‘Monte carlo sampling methods using Markov chains and their applications’, Biometrika (1), 97–109.Hennig, C. (2020), fpc: Flexible Procedures for Clustering . R package version 2.2-7. URL: https://CRAN.R-project.org/package=fpc
Jasra, A., Stephens, D. A., Doucet, A. & Tsagaris, T. (2011), ‘Inference for Lévy-driven stochastic volatility models viaadaptive sequential Monte Carlo’,
Scandinavian Journal of Statistics (1), 1–22.Kitagawa, G. (1996), ‘Monte Carlo filter and smoother for non-Gaussian nonlinear state space models’, Journal ofComputational and Graphical Statistics (1), 1–25. 23 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
Liu, J. S. & Chen, R. (1998), ‘Sequential Monte Carlo methods for dynamic systems’,
Journal of the AmericanStatistical Association (443), 1032–1044.Merkle, M. (2010), ‘Jensen’s inequality for multivariate medians’, Journal of Mathematical Analysis and Applications (1), 258–269.Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953), ‘Equation of state calculationsby fast computing machines’,
The Journal of Chemical Physics (6), 1087–1092.Pasarica, C. & Gelman, A. (2010), ‘Adaptively scaling the Metropolis algorithm using expected squared jumpeddistance’, Statistica Sinica (1), 343–364.Payne, R. D. & Mallick, B. K. (2018), ‘Two-stage Metropolis-Hastings for tall data’, Journal of Classification (1), 29–51.Prangle, D. (2016), ‘Lazy ABC’, Statistics and Computing (1-2), 171–185.Quiroz, M., Tran, M.-N., Villani, M. & Kohn, R. (2018), ‘Speeding up MCMC by delayed acceptance and datasubsampling’, Journal of Computational and Graphical Statistics (1), 12–22.Salomone, R., Quiroz, M., Kohn, R., Villani, M. & Tran, M.-N. (2019), ‘Spectral subsampling MCMC for stationarytime series’, arXiv preprint arXiv:1910.13627 .Salomone, R., South, L. F., Drovandi, C. C. & Kroese, D. P. (2018), ‘Unbiased and consistent nested sampling viasequential Monte Carlo’, arXiv preprint arXiv:1805.03924 .Sherlock, C., Golightly, A. & Henderson, D. A. (2017), ‘Adaptive, delayed-acceptance MCMC for targets withexpensive likelihoods’, Journal of Computational and Graphical Statistics (2), 434–444.Sherlock, C., Thiery, A. & Golightly, A. (2015), ‘Efficiency of delayed-acceptance random walk Metropolis algorithms’, arXiv preprint arXiv:1506.08155 .Solonen, A., Ollinaho, P., Laine, M., Haario, H., Tamminen, J., Järvinen, H. et al. (2012), ‘Efficient MCMC for climatemodel parameter estimation: Parallel adaptive chains and early rejection’, Bayesian Analysis (3), 715–736.South, L. F., Pettitt, A. N. & Drovandi, C. C. (2019), ‘Sequential Monte Carlo samplers with independent Markov chainMonte Carlo proposals’, Bayesian Analysis (3), 753–776. URL: https://doi.org/10.1214/18-BA1129
Stathopoulos, V. & Girolami, M. A. (2013), ‘Markov chain Monte Carlo inference for Markov jump processes viathe linear noise approximation’,
Philosophical Transactions of the Royal Society A: Mathematical, Physical andEngineering Sciences (1984), 20110541.Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’,
Journal of the Royal Statistical Society: SeriesB (Methodological) (1), 267–288.Whittle, P. (1953), ‘Estimation and information in stationary time series’, Arkiv för Matematik (5), 423–434.Wiqvist, S., Picchini, U., Forman, J. L., Lindorff-Larsen, K. & Boomsma, W. (2018), ‘Accelerating delayed-acceptanceMarkov chain Monte Carlo algorithms’, arXiv preprint arXiv:1806.05982 . A Appendix
A.1 Proof of Proposition 1
Under the MH cost function, C ( k, φ ) = k × L F , and one-move diversification criterion, (cid:110) ( k, φ ) : α ( k ) ( φ ) ≥ p min (cid:111) where α ( k ) ( φ ) = 1 − (1 − α (1) ( φ )) k , CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS the inequality for the diversification criterion can be rearranged into k ≥ log(1 − p min )log(1 − α (1) ( φ )) . Under this restriction, note that C ( k, φ ) ≥ L F × log(1 − p min )log(1 − α (1) ( φ )) . so under these conditions, the general problem in (3) is equivalent to arg min φ ∈ Φ log(1 − p min )log(1 − α (1) ( φ )) . A.2 Proof of Proposition 2
Using the multivariate Jensen inequality for medians in Merkle (2010, Theorem 5.2) we have that k (cid:88) s =1 median { J s ( φ ) } ≤ median (cid:40) k (cid:88) s =1 J s ( φ ) (cid:41) and assuming the jumping distances are iid, for a given φ , we further reduce this to k × median { J ( φ ) } ≤ median (cid:40) k (cid:88) s =1 J s ( φ ) (cid:41) . (27)We can define the set ˜ D as ˜ D = (cid:110) ( k, φ ) ∈ Z + × Φ : ˜ D ( k, φ ) ≥ d (cid:111) ˜ D ( k, φ ) = k × median { J ( φ ) } then we see from (27) that ˜ D ⊆ D . A.3 Proof of Proposition 3
Under the MH cost function, C ( k, φ ) = k × L F , and approximate ESJD diversification criterion, (cid:110) ( k, φ ) : ˜ D ( k, φ ) ≥ d (cid:111) where ˜ D ( k, φ ) = k × median { J ( φ ) } , the inequality for the diversification criterion can be rearranged into k ≥ d median { J ( φ ) } . Under this restriction, note that C ( k, φ ) ≥ L F × d median { J ( φ ) } . so under these conditions, the general problem in (3) is equivalent to arg min φ ∈ Φ (median { J ( φ ) } ) − ≡ max φ ∈ Φ median { J ( φ ) } . CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
A.4 Additional figures and tables from simulation
Table 4: Median (80% interval) multiplicative improvement of computation time relative to MH-SMC (using mediantuning method) for simulation study.Likelihood Cost ratio ( ρ ) SFA+DA+T SFA DA+T DA MH (fixed) CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS r = r = r = r = r = r = b L i k e li hood = N o r m a l b L i k e li hood = N o r m a l b L i k e li hood = S t uden t b L i k e li hood = S t uden t − − − − − − Value P o s t e r i o r den s i t y SMC algorithm
MH SFA+DA+T Approx
Figure 3: Example of posterior β densities from three SMC algorithms (example 1).27 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS r = r = r = r = r = r = b L i k e li hood = N o r m a l b L i k e li hood = N o r m a l b L i k e li hood = S t uden t b L i k e li hood = S t uden t − − − − − − Value P o s t e r i o r den s i t y SMC algorithm
MH SFA+DA+T Approx
Figure 4: Example of posterior β densities from three SMC algorithms (example 2).28 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS r = r = r = r = r = r = b L i k e li hood = N o r m a l b L i k e li hood = N o r m a l b L i k e li hood = S t uden t b L i k e li hood = S t uden t − − − − − − Value P o s t e r i o r den s i t y SMC algorithm
MH SFA+DA+T Approx
Figure 5: Example of posterior β densities from three SMC algorithms (example 3).29 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS r = r = r = r = r = r = b L i k e li hood = N o r m a l b L i k e li hood = N o r m a l b L i k e li hood = S t uden t b L i k e li hood = S t uden t − − − − − − Value P o s t e r i o r den s i t y SMC algorithm
MH SFA+DA+T Approx
Figure 6: Example of posterior β densities from three SMC algorithms (example 4).30 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS
A.5 Additional figures from ARFIMA model example DA + T + SFA MHSurrogate only: g T = 1 Surrogate only: g T = 0.01−0.5 0.0 0.5 1.0 1.5 −0.5 0.0 0.5 1.0 1.50.02.55.07.510.00.02.55.07.510.0 j D en s i t y Figure 7: Posteriors of φ from 10 replicates of the four SMC algorithms. The SMC algorithms using only the surrogatelikelihood (approx) are annealed to γ T ∈ { . , } and γ T = 0 . . The latter of which is the initial particle set for thesurrogate first annealing procedure. 31 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS DA + T + SFA MHSurrogate only: g T = 1 Surrogate only: g T = 0.01−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.003690369 q D en s i t y Figure 8: Posteriors of θ from 10 replicates of the four SMC algorithms. The SMC algorithms using only the surrogatelikelihood (approx) are annealed to γ T ∈ { . , } and γ T = 0 . . The latter of which is the initial particle set for thesurrogate first annealing procedure. 32 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS DA + T + SFA MHSurrogate only: g T = 1 Surrogate only: g T = 0.01−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50051015051015 d (fractional) D en s i t y Figure 9: Posteriors of d from 10 replicates of the four SMC algorithms. The SMC algorithms using only the surrogatelikelihood (approx) are annealed to γ T ∈ { . , } and γ T = 0 . . The latter of which is the initial particle set for thesurrogate first annealing procedure. 33 CCELERATING SEQUENTIAL M ONTE C ARLO WITH SURROGATE LIKELIHOODS DA + T + SFA MHSurrogate only: g T = 1 Surrogate only: g T = 0.010.5 1.0 1.5 2.0 0.5 1.0 1.5 2.001020300102030 s D en s i t y Figure 10: Posteriors of σ from 10 replicates of the four SMC algorithms. The SMC algorithms using only thesurrogate likelihood (approx) are annealed to γ T ∈ { . , } and γ T = 0 .01