[PDF] Anytime Monte Carlo

Abstract

Monte Carlo algorithms simulate some prescribed number of samples, taking some random real time to complete the computations necessary. This work considers the converse: to impose a real-time budget on the computation, which results in the number of samples simulated being random. To complicate matters, the real time taken for each simulation may depend on the sample produced, so that the samples themselves are not independent of their number, and a length bias with respect to compute time is apparent. This is especially problematic when a Markov chain Monte Carlo (MCMC) algorithm is used and the final state of the Markov chain -- rather than an average over all states -- is required, which is the case in parallel tempering implementations of MCMC. The length bias does not diminish with the compute budget in this case. It also occurs in sequential Monte Carlo (SMC) algorithms, which is the focus of this paper. We propose an anytime framework to address the concern, using a continuous-time Markov jump process to study the progress of the computation in real time. We first show that for any MCMC algorithm, the length bias of the final state's distribution due to the imposed real-time computing budget can be eliminated by using a multiple chain construction. The utility of this construction is then demonstrated on a large-scale SMC^2 implementation, using four billion particles distributed across a cluster of 128 graphics processing units on the Amazon EC2 service. The anytime framework imposes a real-time budget on the MCMC move steps within the SMC^{2} algorithm, ensuring that all processors are simultaneously ready for the resampling step, demonstrably reducing idleness to due waiting times and providing substantial control over the total compute budget.

Full PDF

AAnytime Monte Carlo

Lawrence M. MurrayUppsala University& University of Oxford Sumeetpal SinghUniversity of Cambridge& The Alan Turing Institute Pierre E. JacobHarvard UniversityAnthony LeeUniversity of Warwick& The Alan Turing Institute

Abstract

A Monte Carlo algorithm typically simulates some prescribed number of samples, taking some random realtime to complete the computations necessary. This work considers the converse: to impose a real-time budget onthe computation, so that the number of samples simulated is random. To complicate matters, the real time takenfor each simulation may depend on the sample produced, so that the samples themselves are not independentof their number, and a length bias with respect to compute time is apparent. This is especially problematicwhen a Markov chain Monte Carlo (MCMC) algorithm is used and the ﬁnal state of the Markov chain—ratherthan an average over all states—is required. The length bias does not diminish with the compute budget in thiscase. It occurs, for example, in sequential Monte Carlo (SMC) algorithms. We propose an anytime frameworkto address the concern, using a continuous-time Markov jump process to study the progress of the computationin real time. We show that the length bias can be eliminated for any MCMC algorithm by using a multiplechain construction. The utility of this construction is demonstrated on a large-scale SMC implementation, usingfour billion particles distributed across a cluster of 128 graphics processing units on the Amazon EC2 service.The anytime framework imposes a real-time budget on the MCMC move steps within SMC , ensuring that allprocessors are simultaneously ready for the resampling step, demonstrably reducing wait times and providingsubstantial control over the total compute budget. Real-time budgets arise in embedded systems, fault-tolerant computing, energy-constrained computing, distrib-uted computing and, potentially, management of cloud computing expenses and the fair computational compar-ison of methods. Here, we are particularly interested in the development of Monte Carlo algorithms to observesuch budgets, as well as the statistical properties—and limitations—of these algorithms. While the approach hasbroader applications, the pressing motivation in this work is the deployment of Monte Carlo methods on large-scale distributed computing systems, using real-time budgets to ensure the simultaneous readiness of multipleprocessors for collective communication, minimising the idle wait time that typically precedes it.A conceptual solution is found in the anytime algorithm . An anytime algorithm maintains a numerical result at alltimes, and will improve upon this result if afforded extra time. When interrupted, it can always return the currentresult. Consider, for example, a greedy optimisation algorithm: its initial result is little more than a guess, butat each step it improves upon that result according to some objective function. If interrupted, it may not returnthe optimal result, but it should have improved on the initial guess. A conventional Markov chain Monte Carlo(MCMC) algorithm, however, is not anytime if we are interested in the ﬁnal state of the Markov chain at someinterruption time: as will be shown, the distribution of this state is length-biased by compute time, and this biasdoes not reduce when the algorithm is afforded additional time.MCMC algorithms are typically run for some long time and, after removing an initial burn-in period, the expecta-tions of some functionals of interest are estimated from the remainder of the chain. The prescription of a real-timebudget, t , introduces an additional bias in these estimates, as the number of simulations that will have been com-pleted is a random variable, N ( t ). This bias diminishes in t , and for long-run Markov chains may be rendered1 a r X i v : . [ s t a t . C O ] J un egligible. In motivating this work, however, we have in mind situations where the ﬁnal state of the chain is mostimportant, rather than averages over all states. The bias in the ﬁnal state does not diminish in t . Examples wherethis may be important include (a) sequential Monte Carlo (SMC), where, after resampling, a small number of localMCMC moves are performed on each particle before the next resampling step, and (b) parallel tempering, where,after swapping between chains, a number of local MCMC moves are performed on each chain before the nextswap. In a distributed computing setting, the resampling step of SMC, and the swap step of parallel tempering,require the synchronisation of multiple processes, such that all processors must idle until the slowest completes.By ﬁxing a real-time budget for local MCMC moves, we can reduce this idle time and eliminate the bottleneck,but must ensure that the length bias imposed by the real-time budget is negligible, if not eliminated entirely.The compute time of a Markov chain depends on exogenous factors such as processor hardware, memory band-width, I/O load, network trafﬁc, and other jobs contesting the same processor. But, importantly, it may alsodepend on the states of the Markov chain. Consider (a) inference for a mixture model where one parameter givesthe number of components, and where the time taken to evaluate the likelihood of a data set is proportional to thenumber of components; (b) a differential model that is numerically integrated forward in time with an adaptivetime step, where the number of steps required across any given interval is inﬂuenced by parameters; (c) a complexposterior distribution simulated using a pseudomarginal method (Andrieu and Roberts, 2009), where the numberof samples required to marginalise latent variables depends on the parameters; (d) a model requiring approxim-ate Bayesian computation (ABC) with a rejection sampling mechanism, where the acceptance rate is higher forparameters with higher likelihood, and so the compute time lower, and vice versa. Even in simple cases there maybe a hidden dependency. Consider, for example, a Metropolis–Hastings sampler where there is some seeminglyinnocuous book-keeping code, such as for output, to be run when a proposal is accepted, but not when a proposalis rejected. States from which the proposal is more likely to accept then have longer expected hold time due tothis book-keeping code. In general, we should assume that there is indeed a dependency, and assess whether theresulting bias is appreciable.The ﬁrst major contribution of the present work is a framework for converting any existing Monte Carlo algorithminto an anytime Monte Carlo algorithm, by running multiple Markov chains in a particular manner. The frameworkcan be applied in numerous contexts where real-time considerations might be beneﬁcial. The second major contri-bution is an application in one such context: an SMC algorithm deployed on a large-scale distributed computingsystem. An anytime treatment is applied to the MCMC moves steps that precede resampling, which requiressynchronisation between processors, and can be a bottleneck in distributed deployment of the algorithm. Thereal-time budget reduces wait time at synchronisation, relieves the resampling bottleneck, provides direct controlover the compute budget for the most expensive part of the computation, and in doing so provides indirect controlover the total compute budget.Anytime Monte Carlo algorithms have recently garnered some interest. Paige et al. (2014) propose an any-time SMC algorithm called the particle cascade . This transforms the structure of conventional SMC by runningparticles to completion one by one, with the sufﬁcient statistics of preceding particles used to make birth-deathdecisions in place of the usual resampling step. To circumvent the sort of real-time pitfalls discussed in this work,a random schedule of work units is used. This requires a central scheduler, so it is not immediately obvioushow the particle cascade might be distributed. In contrast, we propose, in this work, an SMC algorithm withthe conventional structure, but including parameter estimation as in SMC (Chopin et al., 2013), and an anytimetreatment of the move step. This facilitates distributed implementation, but provides only indirect control over thetotal compute budget.The construction of Monte Carlo estimators under real-time budgets has been considered before. Heidelberger(1988), Glynn and Heidelberger (1990) and Glynn and Heidelberger (1991) suggest a number of estimators forthe mean of a random variable after performing independent and identically distributed (iid) simulation for someprescribed length of real time. Bias and variance results are established for each. The validity of their results relieson the exchangeability of simulations conditioned on their number, and does not extend to MCMC algorithmsexcept for the special cases of regenerative Markov chains and perfect simulation. The present work establishesresults for MCMC algorithms (for which, of course, iid sampling is a special case) albeit for a different problemdeﬁnition.A number of other recent works are relevant. Recent papers have considered the distributed implementation ofGibbs sampling and the implications of asynchronous updates in this context, which involves real time consid-erations (Terenin et al., 2015; De Sa et al., 2016). As mentioned above, optimisation algorithms already exhibitanytime behaviour and it is natural to consider whether they might be leveraged to develop anytime Monte Carlo2lgorithms, perhaps in a manner similar to the weighted likelihood bootstrap (Newton and Raftery, 1994). Meedsand Welling (2015) suggest an approach along this vein for approximate Bayesian computation. Beyond MonteCarlo, other methods for probabilistic inference might be considered in an anytime setting. Embedded systems,with organic real-time constraints, yield natural examples (e.g. Ramos and Cozman, 2005).The remainder of the paper is structured as follows. Section 2 formalises the problem and the framework of aproposed solution; proofs are deferred to Appendix A. Section 3 uses the framework to establish SMC algorithmswith anytime moves, amongst them an algorithm suitable for large-scale distributed computing. Section 4 valid-ates the framework on a simple toy problem, and demonstrates the SMC algorithms on a large-scale distributedcomputing case study. Section 5 discusses some of the ﬁner points of the results, before Section 6 concludes. Let ( X n ) ∞ n = be a Markov chain with initial state X , evolving on a space X , with transition kernel X n | x n − ∼ κ (d x | x n − ) and target (invariant) distribution π (d x ). We do not assume that κ has a density (e.g. it may be aMetropolis—Hastings kernel). A computer processor takes some random and positive real time H n − to completethe computations necessary to transition from X n − to X n via κ . Let H n − | x n − ∼ τ (d h | x n − ) for some probabilitydistribution τ . The cumulative distribution function (cdf) corresponding to τ is denoted F τ ( h | x n − ), and itssurvival function ¯ F τ ( h | x n − ) = − F τ ( h | x n − ). We assume (1) that H > (cid:15) > (cid:15) , (2) thatsup x ∈ X E τ [ H | x ] < ∞ , and (3) that τ is homogeneous in time.The distribution τ captures the dependency of hold time on the current state of the Markov chain, as well as onexogenous factors such as contesting jobs that run on the same processor, memory management, I/O, networklatency, etc. The ﬁrst two assumptions seem reasonable: on a computer processor, a computation must take atleast one clock cycle, justifying a lower bound on H , and be expected to complete, justifying ﬁnite expectation.The third assumption, of homogeneity, is more restrictive. Exogenous factors may produce transient effects, suchas a contesting process that begins part way through the computation of interest. We discuss the relaxation of thisassumption—as future work—in Section 5.Besides these assumptions, no particular form is imposed on τ . Importantly, we do not assume that τ is memory-less. In general, nothing is known about τ except how to simulate it, precisely by recording the length of time H n − taken to simulate X n | x n − ∼ κ (d x | x n − ). In this sense there exists a joint kernel κ (d x n , d h n − | x n − ) = κ (d x n | h n − , x n − ) τ (d h n − | x n − ) for which the original kernel κ is the marginal over all possible hold times H n − in transitfrom X n − to X n : κ (d x | x n − ) = (cid:90) ∞ κ (d x | x n − , h n − ) τ (d h n − | x n − ) . We now construct a real-time semi-Markov jump process to describe the progress of the computation in real time.The states of the process are given by the sequence ( X n ) ∞ n = , with associated hold times ( H n ) ∞ n = . Deﬁne the arrivaltime of the n th state as A n : = n − (cid:88) i = H i , for all n ≥

1, and a =

0, and a process counting the number of arrivals by time t as N ( t ) : = sup { n : A n ≤ t } . Figure 1 illustrates a realisation of the process.Now, construct a continuous-time Markov jump process to chart the progress of the simulation in real time.Let X ( t ) : = X N ( t ) (the interpolated process in Figure 1) and deﬁne the lag time elapsed since the last jump as L ( t ) : = t − A N ( t ) . Then ( X , L )( t ) is a Markov jump process.The process ( X , L )( t ) is readily manipulated to establish the following properties. Proofs are given in Appendix A.3 l l l l ll l l l l | || || || || || h h h h h x x x x x x a a a a a Figure 1: A realisation of a Markov chain ( X n ) ∞ n = in real time, with hold times ( H n ) ∞ n = and arrival times ( A n ) ∞ n = . Proposition 1.

The stationary distribution of the Markov jump process ( X , L )( t ) is α (d x , d l ) = ¯ F τ ( l | x ) E τ [ H ] π (d x ) d l . Corollary 2.

With respect to π (d x ) , α (d x ) is length-biased by expected hold time: α (d x ) = E τ [ H | x ] E τ [ H ] π (d x ) . The interpretation of these are as follows: in stationarity, the likelihood of a particular x appearing is proportionalto the likelihood with which it arises under the original Markov chain, and the expected length of real time forwhich it holds when it does. Corollary 3.

Let ( X , L ) ∼ α , then the conditional probability distribution of X given L < (cid:15) is π . Corollary 3 simply recovers the original Markov chain from the Markov jump process.We refer to α as the anytime distribution . It is precisely the stationary distribution of the Markov jump process.The new name is introduced to distinguish it from the stationary distribution of the original Markov chain, whichwe continue to refer to as the target distribution .Finally, we state an ergodic theorem from Alsmeyer (1997); see also Alsmeyer (1994). Rather than study theprocess ( X , L )( t ), we can equivalently study ( X n , A n ) ∞ n = , with the initial state being ( X , X n ) ∞ n = , the hold times ( H n ) ∞ n = are independent. This conditional independencecan be exploited to derive ergodic properties of ( X n , A n ) ∞ n = , based on assumed regularity of the driving chain( X n ) ∞ n = . Proposition 4 (Alsmeyer 1997, Corollary 1) . Assume that the Markov chain ( X n ) ∞ n = is Harris recurrent. For afunction g : X → R with (cid:82) | g ( x ) | α (d x ) < ∞ , lim t →∞ E (cid:2) g ( X ( t )) | x (0) , l (0) (cid:3) = (cid:90) g ( x ) α (d x ) for π -almost all x (0) and all l (0) . The above results establish that, when interrupted at real time t , the state of a Monte Carlo computation is dis-tributed according to the anytime distribution, α . We wish to establish situations in which it is instead distributedaccording to the target distribution, π . This will allow us to draw samples of π simply by interrupting the runningprocess at any time t , and will form the basis of anytime Monte Carlo algorithms.4 ● ● ● ● ●●● ● | || || || || || h h h h h x x x x x x ● ● t ● Figure 2: Illustration of the multiple chain concept with two Markov chains. At any time, one chain is beingsimulated (indicated with a dotted line) while one chain is waiting (indicated with a solid line). When queryingthe process at some time t , it is the state of the waiting chain that is reported, so that the hold times of each chainare the compute times of the other chain. For K + ≥ K chains wait, and when querying the process at some time t , the states of all K waiting chains are reported.Recalling Corollary 2, a sufﬁcient condition to establish α (d x ) = π (d x ) is that E τ [ H | x ] = E τ [ H ], i.e. for theexpected hold time to be independent of X . For iid sampling, this is trivially the case: we have κ (d x | x n − ) = π (d x ), the hold time H n − for X n − is the time taken to draw X n ∼ π (d x ) and so is independent of X n − , and E τ [ H | x ] = E τ [ H ].For non-iid sampling, ﬁrst consider the following change to the Markov kernel: X n | x n − ∼ κ (d x | x n − ) . That is, each new state X n depends not on the previous state, x n − , but on one state back again, x n − , so that odd-and even- numbered states are independent. The hold times of the even-numbered states are the compute timesof the odd-numbered states and vice versa, so that hold times are independent of states, and the desired property E τ [ H | x ] = E τ [ H ], and so α (d x ) = π (d x ), is achieved.Another interpretation of this construction is that of running two independent Markov chains, one made up of theeven states, the other of the odd states, of the original Markov chain. The construction is then seen to achieve α (d x ) = π (d x ) for one of those chains at any time. It may be generalised to multiple chains, where it is seen toachieve α (d x ) = π (d x ) for all but one of the chains at any time. As the length bias is isolated to one chain, it canbe eliminated simply by discarding the state of that chain.Formally, suppose that we are simulating K number of Markov chains, with K a positive integer, plus one extrachain. Denote these K + X K + n ) ∞ n = . For simplicity, assume that all have the same target distribution π , kernel κ , and hold time distribution τ . The joint target is Π (d x K + ) = K + (cid:89) k = π (d x k ) . The K + K + K + n − n , and the state of each other chain k ∈ { , . . . , K } becomes the state of chain k +

1. Thetransition can then be written as X K + n | x K + n − ∼ κ (d x n | x K + n − ) K (cid:89) k = δ x kn − (d x k + n ) . As before, this joint Markov chain has an associated joint Markov jump process ( X K + , L )( t ), where L ( t ) is thelag time elapsed since the last jump. This joint Markov jump process is readily manipulated to yield the followingproperties, analogous to the single chain case. Proofs are given in Appendix A.5 roposition 5. The stationary distribution of the Markov jump process ( X K + , L )( t ) isA (d x K + , d l ) = α (d x K + , d l ) K (cid:89) k = π (d x k ) . (1)This result for the joint process extrapolates Proposition 1. To see this, note that the key ingredients of Proposition1, ¯ F τ ( l | x ) and π (d x ), are now replaced with ¯ F τ ( l | x K + ) and Π (d x K + ). The hold-time survival function for theproduct chain is ¯ F τ ( l | x K + ), since it related to the time taken to execute the kernel κ (d x n | x K + n − ), which dependson the state of chain K + Corollary 6.

With respect to Π (d x K + ) , A (d x K + ) is length-biased by expected hold time on the extra state X K + only: A (d x K + ) = α (d x K + ) K (cid:89) k = π (d x k ) . Corollary 7.

Let ( X K + , L ) ∼ A, then the conditional probability distribution of X K + given L < (cid:15) is Π (d x K + ) . We state without proof that the product chain construction also satisﬁes the ergodic theorem under the sameassumptions as Proposition 4. Numerical validation is given in Section 4.1.The practical implication of these properties is that any MCMC algorithm running K ≥ t , the state of the extra chain is distributed according to α , while the states of the remaining K chains areindependently distributed according to π . The state of the extra chain is simply discarded to eliminate the lengthbias. It is straightforward to apply the above framework to design anytime MCMC algorithms. One simply runs K + t . The anytime framework is particularlyuseful within a broader SMC method (Del Moral et al., 2006), where there already exist multiple chains (particles)with which to establish anytime behaviour. We propose appropriate SMC methods in this section. We are interested in the context of sequential Bayesian inference targeting a posterior distribution π (d x ) = p (d x | y V ) for a given data set y V . For the purposes of SMC, we assume that the target distribution π (d x ) admits adensity π ( x ) in order to compute importance weights.Deﬁne a sequence of target distributions π (d x ) = p (d x ) and π v (d x ) ∝ p (d x | y v ) for v = , . . . , V . The ﬁrst targetequals the prior distribution, while the ﬁnal target equals the posterior distribution, π V (d x ) = p (d x | y V ) = π (d x ).Each target π v has an associated Markov kernel κ v , invariant to that target, which could be deﬁned using an MCMCalgorithm.An SMC algorithm propagates a set of weighted samples (particles) through the sequence of target distributions.At step v , the target π v (d x ) is represented by an empirical approximation ˆ π v (d x ), constructed with K number ofsamples x Kv and their associated weights w Kv :ˆ π v (d x ) = (cid:80) Kk = w kv δ x kv (d x ) (cid:80) Kk = w kv . (2)A basic SMC algorithm proceeds as in Algorithm 1. 6 lgorithm 1 A basic SMC algorithm. Where k appears, the operation is performed for all k ∈ { , . . . , K } .1. Initialise x k ∼ π (d x ).2. For v = , . . . , V (a) Set x kv = x kv − and weight w kv = π v ( x kv ) /π v − ( x kv ) ∝ p ( y v | x kv , y v − ), to form the empirical approximationˆ π v (d x v ) ≈ π v (d x v ).(b) Resample x kv ∼ ˆ π v (d x v ).(c) Move x kv ∼ κ v (d x v | x kv ) for n v steps.An extension of the algorithm concerns the case where the sequence of target distributions requires marginalisingover some additional latent variables. In these cases, a pseudomarginal approach (Andrieu and Roberts, 2009) canbe adopted, replacing the exact weight computations with unbiased estimates (Fearnhead et al., 2008, 2010). Forexample, for a state-space model at step v , there are v hidden states z v ∈ Z v to marginalise over: π v (d x ) = (cid:90) Z v π v (d x , d z v ) . Unbiased estimates of this integral can be obtained by a nested SMC procedure targeting π v (d z v | x ), leading tothe algorithm known as SMC (Chopin et al., 2013), where the kernels κ v are particle MCMC moves (Andrieuet al., 2010). This is used for the example of Section 4.2. In the conventional SMC algorithm, it is necessary to choose n v , the number of kernel moves to make per particlein step 2(c). For anytime moves, this is replaced with a real-time budget t v . Move steps are typically the mostexpensive steps—certainly so for SMC —with the potential for signiﬁcant variability in the time taken to moveeach particle. An anytime treatment provides control over the budget of the move step which, if it is indeed themost expensive step, provides substantial control over the total budget also.The anytime framework is used as follows. Associated with each target distribution π v , and its kernel κ v , is ahold time distribution τ v , and implied anytime distribution α v . At step v , after resampling, an extra particle andlag ( x K + v , l v ) are drawn (approximately) from the anytime distribution α v . The real-time Markov jump process( X K + v , L v )( t ) is then initialised with these particles and lag, and simulated forward until time t v is reached. Theextra chain and lag are then eliminated, and the states of the remaining chains are restored as the K particles x Kv .The complete algorithm is given in Algorithm 2. Algorithm 2

SMC with anytime moves. Where k appears, the operation is performed for all k ∈ { , . . . , K } .1. Initialise x k ∼ π (d x ).2. For v = , . . . , V (a) Set x kv = x kv − and weight w kv = π v ( x kv ) /π v − ( x kv ) ∝ p ( y v | x kv , y v − ), to form the empirical approximationˆ π v (d x v ) ≈ π v (d x v ).(b) Resample x kv ∼ ˆ π v (d x v ).(c) Draw (approximately) an extra particle and lag ( x K + v , l v ) ∼ α v (d x v , d l v ). Construct the real-timeMarkov jump process (cid:16) X K + v , L v (cid:17) (0) = ( x K + v , l v ) and simulate it forward for some real time t v .Set x Kv = X Kv ( t v ), discarding the extra particle and lag.By the end of the move step 2(c), as t v → ∞ , the particles x K + v become distributed according to A v , regardlessof their distribution after the resampling step 2(b). This is assured by Proposition 4. After eliminating the extraparticle, the remaining x Kv are distributed according to Π v .7n practice, of course, it is necessary to choose some ﬁnite t v for which the x Kv are distributed only approximatelyaccording to Π v . For any given t v , their divergence in distribution from Π v is minimised by an initialisation asclose as possible to A v . We have, already, the ﬁrst K chains initialised from an empirical approximation of thetarget, ˆ π v , which is unlikely to be improved upon. We need only consider the extra particle and lag.An easily-implemented choice is to draw ( x K + v , l v ) ∼ ˆ π v (d x v ) δ (d l v ). In practice, this merely involves resampling K + K particles in step 2(b), setting l v = x K + v , l v ) ∼ δ x K + v − (d x v ) δ l v − (d l v ). This resumes the computation of the extra particle that wasdiscarded at step v −

1. As t v − → ∞ , it amounts to approximating α v by α v − , which is sensible if the sequence ofanytime distributions changes only slowly. While the potential to parallelise SMC is widely recognised (see e.g. Lee et al., 2010; Murray, 2015), the res-ampling step 2(b) in Algorithm 1 is acknowledged as a potential bottleneck when in a distributed computingenvironment of P number of processors. This is due to collective communication: all processors must synchroniseafter completing the preceding steps in order for resampling to proceed. Resampling cannot proceed until theslowest among them completes. As this is a maximum among P processors, the expected wait time increases with P . Recent work has considered either global pairwise interaction (Murray, 2011; Murray et al., 2016) or limitedinteraction (Vergé et al., 2015; Whiteley et al., 2016; Lee and Whiteley, 2016) to address this issue. Instead,we propose to preserve collective communication, but to use an anytime move step to ensure the simultaneousreadiness of all processors for resampling.SMC with anytime moves is readily distributed across multiple processors. The K particles are partitioned so thatprocessor p ∈ { , . . . , P } has some number of particles, denoted K p , and so that (cid:80) Pp = K p = K . Each processor canproceed with initialisation, move and weight steps independently of the other processors. After the resamplingstep, each processor has K p number of particles. During the move step, each processor draws its own extra particleand lag from the anytime distribution, giving it K p + K p again. Collective communication is required for the resampling step, and an appropriate distributedresampling scheme should be used (see e.g. Boli´c et al., 2005; Vergé et al., 2015; Lee and Whiteley, 2016).In the simplest case, all workers have homogeneous hardware and the obvious partition of particles is K p = K / P . For heterogeneous hardware another partition may be set a priori (see e.g. Rosenthal, 2000). Note alsothat with heterogenous hardware, each processor may have a different compute capability and therefore differentdistribution τ v . For processor p , we denote this τ pv and the associated anytime distribution α pv . This differencebetween processors is easily accommodated, as the anytime treatment is local to each processor.A distributed SMC algorithm with anytime moves proceeds as in Algorithm 3. Algorithm 3

SMC with anytime moves. Where k appears, the operation is performed for all k ∈ { , . . . , K p } .1. On each processor p , initialise x k ∼ π (d x ).2. For v = , . . . , V (a) On each processor p , set x kv = x kv − and weight w kv = π v ( x kv ) /π v − ( x kv ) ∝ p ( y v | x kv , y v − ). Collectively,all K particles form the empirical approximation ˆ π v (d x v ) ≈ π v (d x v ).(b) Collectively resample x kv ∼ ˆ π v (d x v ) and redistribute the particles among processors so that processor p has exactly K p particles again.(c) On each processor p , draw (approximately) an extra particle and lag ( x K p + v , l v ) ∼ α pv (d x , d l ). Constructthe real-time process (cid:16) X K p + v , L v (cid:17) (0) = ( x K p + v , l v ) and simulate it forward for some real time t v . Set x K p v = X K p v ( t v ), discarding the extra particle and lag.The preceding discussion around the approximate anytime distribution still holds for each processor in isolation :for any given budget t v , to minimise the divergence between the distribution of particles and the target distribution,ˆ A pv should be chosen as close as possible to A pv . 8 .4 Setting the compute budget We set an overall compute budget for move steps, which we denote t , and apportion this into a quota for eachmove step v , which we denote t v as above. This requires some a priori knowledge of the compute proﬁle for theproblem at hand.Given ˆ π v − , if the compute time necessary to obtain ˆ π v is constant with respect to v , then a suitable quota for the v th move step is the obvious t v = t / V . If, instead, the compute time grows linearly in v , as is the case for SMC ,then we expect the time taken to complete the v th step to be proportional to v + c (where the constant c is used tocapture overheads). A sensible quota for the v th move step is then t v =  v + c (cid:80) Vu = ( u + c )  t = (cid:32) v + c ) V ( V + c + (cid:33) t . (3)For the constant, a default choice might be c =

0; higher values shift more time to earlier time steps.This approximation does neglect some complexities. The use of memory-efﬁcient path storage (Jacob et al., 2015),for example, introduces a time-dependent contribution of O ( K ) at v =

1, increasing to O ( K log K ) with v as theancestry tree grows. Nonetheless, for the example of Section 4.2 we observe, anecdotally, that this partitioning ofthe time budget produces surprisingly consistent results with respect to the random number of moves completedat each move step v . To reduce the variance in resampling outcomes (Douc and Cappé, 2005), implementations of SMC often useschemes such as systematic, stratiﬁed (Kitagawa, 1996) or residual (Liu and Chen, 1998) resampling, rather thanthe multinomial scheme (Gordon et al., 1993) with which the above algorithms have been introduced. The im-plementation of these alternative schemes does not necessarily leave the random variables X v , . . . , X Kv exchange-able; for example, the offspring of a particle are typically neighbours in the output of systematic or stratiﬁedresampling (see e.g. Murray et al., 2016).Likewise, distributed resampling schemes do not necessarily redistribute particles identically between processors.For example, the implementation in LibBi (Murray, 2015) attempts to minimise the transport of particles betweenprocessors, such that the offspring of a parent particle are more likely to remain on the same processor as thatparticle. This means that the distribution of the K p particles on each processor may have greater divergence from π v than the distribution of the K particles overall.In both cases, the effect is that particles are initialised further from the ideal A v . Proposition 4 nonetheless ensuresconsistency as t v → ∞ . A random permutation of particles may result in a better initialisation, but this can becostly, especially in a distributed implementation where particles must be transported between processors. For aﬁxed total budget, the time spent permuting may be better spent by increasing t v . The optimal choice is difﬁcultto identify in general; we return to this point in the discussion. This section presents two case studies to empirically investigate the anytime framework and the proposed SMCmethods. The ﬁrst uses a simple model where real-time behaviour is simulated in order to validate the results ofSection 2. The second considers a Lorenz ’96 state-space model with non-trivial likelihood and compute proﬁle,testing the SMC methods of Section 3 in two real-world computing environments.

Consider the model X ∼ Gamma( k , θ ) H | x ∼ Gamma( x p /θ, θ ) , k , scale parameter θ , and polynomial degree p . The two Gamma distributions correspond to π and τ , respectively. The anytime distribution is: α (d x ) = E τ [ H | x ] E τ [ H ] π (d x ) ∝ x k + p − exp (cid:18) − x θ (cid:19) d x , which is Gamma( k + p , θ ).Of course, in real situations, τ is not known explicitly, and is merely implied by the algorithm used to simulate X . For this ﬁrst study, however, we assume the explicit form above and simulate virtual hold times. This permitsexploration of the real-time effects of polynomial computational complexity in a controlled environment, includingconstant ( p = p = p =

2) and cubic ( p =

3) complexity.To construct a Markov chain ( X n ) ∞ n = with target distribution Gamma( k , θ ), ﬁrst consider a Markov chain ( Z n ) ∞ n = with target distribution N (0 ,

1) and kernel Z n | z n − ∼ N ( ρ z n − , − ρ ) , where ρ is an autocorrelation parameter. Now deﬁne ( X n ) ∞ n = as x n = F − γ (cid:16) F φ ( z n ); k , θ (cid:17) , where F − γ is the inverse cdf of the Gamma distribution with parameters k and θ , and F φ is the cdf of the standardnormal distribution. By construction, ρ parameterises a Gaussian copula inducing correlation between adjacentelements of ( X n ) ∞ n = .For the experiments in this section, we set k = θ = / ρ = / p ∈ { , , , } . In all cases Markovchains are initialised from π and simulated for 200 units of virtual time.We employ the 1-Wasserstein distance to compare distributions. For two univariate distributions µ and ν withassociated cdfs F µ ( x ) and F ν ( x ), the 1-Wasserstein distance d ( F µ , F ν ) can be evaluated as (Shorack and Wellner,1986, p.64) d ( F µ , F ν ) = (cid:90) ∞−∞ (cid:12)(cid:12)(cid:12) F µ ( x ) − F ν ( x ) (cid:12)(cid:12)(cid:12) d x , which, for the purposes of this example, is sufﬁciently approximated by numerical integration. The ﬁrst distribu-tion will always be the empirical distribution of a set of n samples, its cdf denoted F n ( x ). If those samples aredistributed according to the second distribution, the distance will go to zero as n increases. We ﬁrst validate, empirically, that the anytime distribution is indeed Gamma( k + p , θ ) as expected. We simulate n = Markov chains. At each integer time we take the state of all n chains to construct an empirical distribution.We then compute the 1-Wasserstein distance between this and the anytime distribution, using the empirical cdf F n ( x ) and anytime cdf F α ( x ).Figure 3 plots the results over time. In all cases the distance goes to zero in t , slower for larger p . This afﬁrms thetheoretical results obtained in Section 2. We next check the efﬁcacy of the multiple chain strategy in eliminating length bias. For K + ∈ { , , , , } ,we initialise K + n = , this is repeated n / ( K + K + n samples from which to construct an empirical cdf F n ( x ). In the second case,we eliminate the extra chain but keep the remaining K , giving nK / ( K +

1) samples from which to construct anempirical cdf F nK / ( K + ( x ). In both cases we compute the 1-Wasserstein distance between the empirical and targetdistributions, using the appropriate empirical cdf, and the target cdf F π ( x ).10

100 200 t d ( F n , F , ) p = 0 tp = 1 tp = 2 tp = 3 Figure 3: Convergence of Markov chains to the anytime distribution for the simulation study, with constant ( p = p = p =

2) and cubic ( p =

3) expected hold time. Each plot shows the evolution of the1-Wasserstein distance between the anytime distribution and the empirical distribution of 2 independent Markovchains initialised from the target distribution.Figure 4 plots the results over time for both the uncorrected (top) and corrected (bottom) cases. For the uncorrectedcase, the 1-Wasserstein distance between the empirical distribution and target distribution does not converge tozero. Neither does it become arbitrarily bad: the distance is due to one of the K + α and not π , the inﬂuence of which decreases as K increases.For the corrected case, where the extra chain is eliminated, the distance converges to zero in time. This conﬁrmsthe efﬁcacy of the multiple chain strategy in yielding an anytime distribution equal to the target distribution. Consider a stochastic extension of the deterministic Lorenz ’96 (Lorenz, 2006) model described by the stochasticdifferential equation (SDE)d X d = ( X d − ( X d + − X d − ) − X d + F ) d t + σ d W d , (4)with parameter F , constant σ , state vector X ( t ) ∈ R D and Wiener process vector W ( t ) ∈ R D , with elements ofthose vectors indexed cyclically by subscripts (i.e. X d − D ≡ X d ≡ X d + D ). The SDE may be equivalently interpretedin the Itô or Stratonovich sense, as the noise term is additive (Kloeden and Platen, 1992, p157). The observationmodel is given by Y d ( t ) ∼ N ( x d ( t ) , ς ) . We ﬁx D = σ = − , ς = − and set a prior on the parameter F and initial conditions X (0) of: F ∼ U ([0 , X d (0) ∼ N (0 , σ ) . The SDE can be approximately decomposed into a deterministic drift component given by the ordinary differentialequation (ODE)d x d d t = x d − ( x d + − x d − ) − x d + F , and a diffusion component given by the Wiener process. On a ﬁxed time step ∆ t = × − , the drift com-ponent is ﬁrst simulated using an appropriate numerical scheme for ODEs. Then, a Wiener process increment ∆ W d ∼ N (0 , ∆ t ) is simulated and added to the result. This numerical scheme yields a result similar to that ofEuler–Maruyama for the original SDE but, for drift, substitutes the usual ﬁrst-order Euler method with a higher-order Runge–Kutta method. This is advantageous in low-noise regimes where σ is close to zero, as here. In11

100 200 t d ( F n , F : ) p = 0 tp = 1 tp = 2 tp = 3 t d ( F n K / ( K + ) , F : ) p = 0 tp = 1 tp = 2 tp = 3 Figure 4: Correction of length bias for the simulation study, using K + ∈ { , , , , } chains (light to dark),with constant ( p = p = p =

2) and cubic ( p =

3) expected hold time. Each plot shows theevolution of the 1-Wasserstein distance between the empirical and target distributions. On the top row, the statesof all chains contribute to the empirical distribution, which does not converge to the target. On the bottom row, thestate of the extra chain is eliminated, contributing only the remaining states to the empirical distribution, whichdoes converge to the target.such cases the dynamics are drift-dominated and can beneﬁt from the higher-order scheme (see e.g. Milstein andTretyakov, 2004, ch. 3).The RK4(3)5[2R+]C algorithm of Kennedy et al. (2000) is used to simulate the drift. This provides a fourth ordersolution to the ODE with an embedded third order solution for error estimates. Adaptive step-size adjustment isthen used as in Hairer et al. (1993). The complete method is implemented (Murray, 2012) on a graphics processingunit (GPU) as in the LibBi software ( , Murray, 2015).The Lorenz ’96 model exhibits intricate qualitative behaviours that depend on the parameter F . These range fromdecay, to periodicity, to chaos and back again (Figure 5, top row). With an adaptive step-size adjustment, thenumber of steps required to simulate trajectories within given error bounds generally increases with F , so thatcompute time does also.We produce a data set by setting F = . Y ( t ) every 0.4 time units. This gives 100 observations in total. We then use SMC to attempt torecover the correct posterior distribution over F given this data set. This is non-trivial: this particular value of F is in a region where the qualitative behaviour of the Lorenz ’96 model appears to switch frequently, in F , betweenperiodic and chaotic regimes (Figure 5, top right), suggesting that the marginal likelihood may not be smooth inthe region of the posterior, and inference may be difﬁcult.The marginal likelihood p ( y | F ) cannot be computed exactly, but it can be unbiasedly estimated with SMC. Foreach value of F on a regular grid, we run SMC with 2 particles to estimate the marginal likelihood. Theseestimates are shown in the middle left of Figure 5. The likelihood is clearly multi-modal, and the estimatorheteroskedastic. Nevertheless, the variance in the estimator is tolerable in the region of the posterior distribution(middle right of Figure 5), suggesting that F can be recovered. The real time taken to compute these estimates12s shown in the lower plots of Figure 5. The computations were performed in a random order through the gridpoints on F so as to decorrelate F with any transient exogenous effects on compute time. There appears, in fact,to have been some such effect: note the dotted line of points above the bulk on each plot, suggesting that a subsetof runs have been slowed. This is most likely due to a contesting process on the shared server on which thesecomputations were run. As expected, compute time tends to increase in F (after an initial plateau where otherfactors dominate). Furthermore, variance appears to increase with F in the higher regions.We now run SMC using the LibBi software on two platforms:1. A shared-memory machine with 8 GPUs, each with 1536 cores, for approximately 12000-way parallelism,using 2 particles for F , each with 2 particles for X ( t ), for approximately one billion particles overall.This is a shared machine where contestation from other jobs is expected.2. A distributed-memory cluster on the Amazon EC2 service, with 128 GPUs, each with 1536 cores, forapproximately 200000-way parallelism, using 2 particles for F , each with 2 particles for X ( t ), for ap-proximately four billion particles overall. This is a dedicated cluster where contestation from other jobs isnot expected.In order to obtain a more repeatable comparison between conventional SMC and SMC with anytime moves,we choose to use the same number of samples of X ( t ) for all time steps, rather than adapting this in time asrecommended in Chopin et al. (2011). For the same reason, we resample at all steps rather than use an adaptivetrigger. With anytime moves, the extra particle and lag are drawn as ( x K + v , l v ) ∼ ˆ π v (d x v ) δ (d l v ).We ﬁrst run conventional SMC , making n v =

10 moves per particle at each step v . We then run SMC withanytime moves, prescribing a total budget for move steps of 60 minutes for the 8 GPU conﬁguration, and 5minutes for the 128 GPU conﬁguration, apportioned as in (3).The results of the 128 GPU runs are given in Figure 6. Recalling that F = . , far more pronounced in the 8 GPU case,where a contesting process on one processor has encumbered the entire computation. The anytime move stepgrants a robustness to this contesting process, and wait times are signiﬁcantly reduced. For the 128 GPU case,even in the absence of such exogenous problems, wait times are noticeably reduced. The framework presented is a generic means by which any MCMC algorithm—including iid sampling as a specialcase—can be made an anytime Monte Carlo algorithm. This facilitates the conﬁguration of Monte Carlo compu-tation in real-time terms, rather than in the number of simulations. The beneﬁts of this have been demonstrated ina distributed computing context where, by setting real-time compute budgets, wait times are signiﬁcantly reducedfor an SMC algorithm that requires collective communication. The framework has potential applications else-where, for example as a foundation for real-time, fault-tolerant and energy-constrained Monte Carlo algorithms,for the management of cloud computing budgets, or for the fair computational comparison of methods.We have assumed throughout that an algorithm is given to simulate the target distribution π , and that the anytimedistribution α is merely a consequence of this. The aim has then been to correct the length bias in α . This isa pragmatic approach, as it leverages existing Monte Carlo algorithms. A tantalising alternative is to developalgorithms that, from the outset, yield π as the anytime distribution. This might be done with an underlyingMarkov chain that targets something other than π but that, by design, yields π once length biased. We expect,however, that to do this even approximately will require at least some knowledge of τ , which will restrict itsapplicability to speciﬁc cases only.The proposed SMC method uses anytime move steps, but is not a complete anytime algorithm, as it does notprovide control over the total compute budget. Its objective is to minimise the wait time that precedes resampling13 F -4-202468 X F -4-202468 X F -3000-2500-2000-1500-1000-5000 l og li k e li hood F -90-89-88-87-86-85-84 l og li k e li hood F c o m pu t e t i m e ( s ) F c o m pu t e t i m e ( s ) Figure 5: Elucidating the Lorenz ’96 model. The left column shows the range F ∈ [0 ,

7] as in the uniform priordistribution, while the right column shows a tighter range in the vicinity of the posterior distribution. The solidvertical lines indicate the value F = . X ( t ) for various values of F . Each column is a density plotfor a particular value of F ; darker for higher density values, scaled so that the mode is black. Note the intricatebehaviours of decay, periodicity and chaos induced by F . The second row depicts estimates of the marginal log-likelihood of the simulated data set for the same values of F , using SMC with 2 particles. Multiple modes andheteroskedasticity are apparent. The third row depicts the compute time taken to obtain these estimates, showingincreasing compute time in F after an initial plateau. 14 .75 4.8 4.85 4.9 4.95 5 F F Figure 6: Posterior distributions over F for the Lorenz ’96 case study. On the left, from conventional SMC , onthe right, from SMC with anytime moves, both running on the 128 GPU conﬁguration.

20 40 60 80 100 120 140 real time (min) p

10 20 30 40 50 60 70 real time (min) p real time (min) p real time (min) p Figure 7: Compute proﬁles for the Lorenz ’96 case study. On the left is a conventional distributed SMC methodwith a ﬁxed number of moves per particle after resampling. On the right is distributed SMC with anytime movesteps. Each row represents the activity of a single processor over time: light grey while active and dark greywhile waiting. The top proﬁles are for an 8 GPU shared system where contesting processes are expected. Theconventional method on the left exhibits signiﬁcant idle time on processors 2 to 8 due to a contesting job onprocessor 1. The two bottom proﬁles are for the 128 GPU conﬁguration with no contesting processes. Waittime in the conventional methods on the left is signiﬁcantly reduced in the anytime methods on the right. Totalexecution times are not intended to be compared. 15teps in a distributed implementation of SMC. On this account it succeeds. A complete anytime SMC algorithm(of a conventional structure) will require, in addition, anytime weighting and anytime resampling steps, as well asthe apportioning of the total compute budget between these. Because approximations may appear in each of thesesteps, the apportioning is not straightforward, and will involve tradeoffs. As already identiﬁed, for example, theredistribution of particles after resampling in a distributed environment is an expensive operation, and all or part ofthat time may be better invested in the budget allocation for anytime moves. Such investigations have been left tofuture work. An alternative means to an anytime SMC algorithm is to use a different structure to the conventional,as in the particle cascade (Paige et al., 2014). Whatever the structure, these anytime algorithms are somewhatmore elaborate than the standard SMC algorithms for which theoretical results have been established, and maywarrant further study.Finally, we return to the strongest of the assumptions of the whole framework: that of the homogeneity of τ intime. This may be unrealistic in the presence of transient exogenous factors, such as intermittent network issues,or contesting processes running on the same hardware only temporarily. If the assumption is relaxed, so that τ varies in time, the anytime distribution will vary as well, and ergodicity will not hold. Figure 4 suggests that, forexample, an exogenous switching factor in τ would induce transient effects in the anytime distribution that are notnecessarily eliminated by the multiple chain strategy. There may be weaker assumptions under which comparableresults and appropriate methods can be established, but this investigation is left to future work. This work has presented an approach to allow any MCMC algorithm to be made an anytime Monte Carlo al-gorithm, eliminating the length bias associated with real-time budgets. This is particularly important in situationswhere the ﬁnal state of a Markov chain is more important than computing averages over all states. It has applic-ations in embedded, distributed, cloud, real-time and fault-tolerant computing. To demonstrate the usefulness ofthe approach, a new SMC method has been presented, which exhibits signiﬁcantly reduced wait time when runon a large-scale distributed computing system. Supplementary materials

The methods introduced in this paper are implemented in LibBi version 1.3.0, and the empirical results may bereproduced with the LibBi

Anytime package. Both are available from . Acknowledgements

The authors would like to thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, for supportand hospitality during the programme

Monte Carlo Methods for Complex Inference Problems (MCMW01), wheresome of this work was undertaken. This work was also ﬁnancially supported by an EPSRC-Cambridge Big DataTravel Grant and EPSRC (EP/K020153/1), The Alan Turing Institute under the EPSRC grant EP/N510129/1, andthe Swedish Foundation for Strategic Research (SSF) via the project

ASSEMBLE . References

G. Alsmeyer. On the Markov renewal theorem.

Stochastic Processes and their Applications , 50:37–56, 1994.G. Alsmeyer. The Markov renewal theorem and related results.

Markov Processes and Related Fields , 3:103–127,1997.C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efﬁcient Monte Carlo computations.

Annals ofStatistics , 37(2):697–725, 04 2009. doi: 10.1214/07-AOS574.16. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods.

Journal of the RoyalStatistical Society B , 72:269–302, 2010. doi: 10.1111/j.1467-9868.2009.00736.x.M. Boli´c, P. M. Djuri´c, and S. Hong. Resampling algorithms and architectures for distributed particle ﬁlters.

IEEETransactions on Signal Processing , 53:2442–2450, 2005. doi: 10.1109/TSP.2005.849185.N. Chopin, P. Jacob, and O. Papaspiliopoulos. SMC : A sequential Monte Carlo algorithm with particle Markovchain Monte Carlo updates. 2011.N. Chopin, P. Jacob, and O. Papaspiliopoulos. SMC : An efﬁcient algorithm for sequential analysis of state spacemodels. Journal of the Royal Statistical Society B , 75:397–426, 2013. doi: 10.1111/j.1467-9868.2012.01046.x.C. De Sa, K. Olukotun, and C. Ré. Ensuring rapid mixing and low bias for asynchronous Gibbs sampling, 2016.P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers.

Journal of the Royal Statistical SocietyB , 68:441–436, 2006. doi: 10.1111/j.1467-9868.2006.00553.x.R. Douc and O. Cappé. Comparison of resampling schemes for particle ﬁltering. In

Image and Signal Processingand Analysis, 2005. ISPA 2005. Proceedings of the 4th International Symposium on , pages 64 – 69, 15-17 2005.P. Fearnhead, O. Papaspiliopoulos, and G. O. Roberts. Particle ﬁlters for partially observed diffusions.

Journal ofthe Royal Statistical Society B , 70:755–777, 2008. doi: 10.1111/j.1467-9868.2008.00661.x.P. Fearnhead, O. Papaspiliopoulos, G. O. Roberts, and A. Stuart. Random-weight particle ﬁltering of continuoustime processes.

Journal of the Royal Statistical Society B , 72(4):497–512, 2010. doi: 10.1111/j.1467-9868.2010.00744.x.P. W. Glynn and P. Heidelberger. Bias properties of budget constraint simulations.

Operations Research , 38(5):801–814, 1990.P. W. Glynn and P. Heidelberger. Analysis of parallel replicated simulations under a completion time constraint.

ACM Transactions on Modeling and Computer Simulations , 1(1):3–23, 1991.N. Gordon, D. Salmond, and A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation.

IEEProceedings-F , 140:107–113, 1993. doi: 10.1049/ip-f-2.1993.0015.E. Hairer, S. Nørsett, and G. Wanner.

Solving Ordinary Differential Equations I: Nonstiff Problems . Springer–Verlag, 2 edition, 1993.P. Heidelberger. Discrete event simulations and parallel processing: Statistical properties.

SIAM Journal onScientiﬁc and Statistical Computing , 9(6):1114–1132, 1988. doi: 10.1137/0909077.P. E. Jacob, L. M. Murray, and S. Rubenthaler. Path storage in the particle ﬁlter.

Statistics and Computing , 25(2):487–496, 2015. ISSN 0960-3174. doi: 10.1007/s11222-013-9445-x.C. A. Kennedy, M. H. Carpenter, and R. M. Lewis. Low-storage, explicit Runge–Kutta schemes for the compress-ible Navier-Stokes equations.

Applied Numerical Mathematics , 35:177–219, 2000.G. Kitagawa. Monte Carlo ﬁlter and smoother for non-Gaussian nonlinear state space models.

Journal of Com-putational and Graphical Statistics , 5:1–25, 1996. doi: 10.2307/1390750.P. E. Kloeden and E. Platen.

Numerical Solution of Stochastic Differential Equations . Springer–Verlag, 1992.A. Lee and N. Whiteley. Forest resampling for distributed sequential Monte Carlo.

Statistical Analysis and DataMining: The ASA Data Science Journal , 9(4):230–248, 2016. ISSN 1932-1872. doi: 10.1002/sam.11280.A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes. On the utility of graphics cards to perform massivelyparallel simulation of advanced Monte Carlo methods.

Journal of Computational and Graphical Statistics , 19:769–789, 2010. doi: 10.1198/jcgs.2010.10039.J. S. Liu and R. Chen. Sequential Monte-Carlo methods for dynamic systems.

Journal of the American StatisticalAssociation , 93:1032–1044, 1998. 17. N. Lorenz.

Predictability of Weather and Climate , chapter 3: Predictability – A Problem Partly Solved, pages40–58. Cambridge University Press, 2006. doi: 10.1017/CBO9780511617652.004.T. Meeds and M. Welling. Optimization Monte Carlo: Efﬁcient and embarrassingly parallel likelihood-free in-ference. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,

Advances in NeuralInformation Processing Systems 28 , pages 2080–2088. Curran Associates, Inc., 2015.G. N. Milstein and M. V. Tretyakov.

Stochastic Numerics for Mathematical Physics . Scientiﬁc Computation.Springer–Verlag, 2004. doi: 10.1007/978-3-662-10063-9.L. M. Murray. GPU acceleration of the particle ﬁlter: The Metropolis resampler. In

DMMD: Distributed machinelearning and sparse representation with massive data sets , 2011. URL http://arxiv.org/abs/1202.6163 .L. M. Murray. GPU acceleration of Runge–Kutta integrators.

IEEE Transactions on Parallel and DistributedSystems , 23:94–101, 2012. doi: 10.1109/TPDS.2011.61.L. M. Murray. Bayesian state-space modelling on high-performance hardware using LibBi.

Journal of StatisticalSoftware , 67(10):1–36, 2015. ISSN 1548-7660. doi: 10.18637/jss.v067.i10.L. M. Murray, A. Lee, and P. E. Jacob. Parallel resampling in the particle ﬁlter.

Journal of Computational andGraphical Statistics , 25(3):789–805, 2016. doi: 10.1080/10618600.2015.1062015.M. A. Newton and A. E. Raftery. Approximate Bayesian inference with the weighted likelihood bootstrap.

Journalof the Royal Statistical Society B , 56(1):3–48, 1994. ISSN 00359246.B. Paige, F. Wood, A. Doucet, and Y. W. Teh. Asynchronous anytime sequential Monte Carlo. In

Advances inNeural Information Processing Systems 27 , pages 3410–3418. 2014.F. T. Ramos and F. G. Cozman. Anytime anyspace probabilistic inference.

International Journal of ApproximateReasoning , 38(1):53 – 80, 2005. ISSN 0888-613X. doi: 10.1016/j.ijar.2004.04.001.J. S. Rosenthal. Parallel computing and Monte Carlo algorithms.

Far East Journal of Theoretical Statistics , 4:207–236, 2000.G. R. Shorack and J. A. Wellner.

Empirical processes with applications to statistics . Wiley, New York, 1986.ISBN 9780471867258;047186725X;.A. Terenin, D. Simpson, and D. Draper. Asynchronous Gibbs sampling. 2015. URL https://arxiv.org/abs/1509.08999 .C. Vergé, C. Dubarry, P. Del Moral, and E. Moulines. On parallel implementation of sequential Monte Carlomethods: the island particle model.

Statistics and Computing , 25(2):243–260, 2015. ISSN 1573-1375. doi:10.1007/s11222-013-9429-x.N. Whiteley, A. Lee, and K. Heine. On the role of interaction in sequential Monte Carlo algorithms.

Bernoulli , 22(1):494–529, 2016. doi: 10.3150/14-BEJ666.

A Proofs

Recall the assumptions that H > (cid:15) > (cid:15) , that sup x ∈ X E τ [ H | x ] < ∞ , and that τ ishomogeneous in time.For any positive ∆ ≤ (cid:15) , deﬁne the notation: x : = x ( t ) , l : = l ( t ) , x + : = x ( t + ∆ ) , l + : = l ( t + ∆ ) . At most one jump can occur in any time interval ( t , t + ∆ ], and it may occur at any time in that interval. Thisimplies that either l + = l + ∆ − h with l + ∈ [0 , ∆ ) if a jump occurs, or l + = l + ∆ if no jump occurs.18o simplify the proof of Proposition 1 somewhat, we further assume that τ admits a density, although this is notstrictly necessary. The below proofs are similar if one wishes to adopt discrete hold times; assuming, for example,that jumps can only occur at the end of a processor’s clock cycle. Here we assume that jumps can occur at anyreal time, and that we may query the state of the process at any real time. Proof of Proposition 1.

The transition kernel is λ (d x + , d l + | x , l ) = ρ ( x , l ) λ (d x + , d l + | x , l ) + (1 − ρ ( x , l )) λ (d x + , d l + | x , l ) , where ρ ( x , l ) = F τ ( l + ∆ | x ) − F τ ( l | x )¯ F τ ( l | x ) is the probability that a jump occurs, λ (d x + , d l + | x , l ) = κ (d x + | x , l + ∆ − l + ) τ ( l + ∆ − l + | x ) I [0 , ∆ ) (d l + ) F τ ( l + ∆ | x ) − F τ ( l | x ) is the transition if a jump occurs, and λ (d x + , d l + | x , l ) = δ x (d x + ) δ l +∆ (d l + ) is the transition if no jump occurs . We have: (cid:90) X (cid:90) ∞ λ (d x + , d l + | x , l ) α (d x , d l ) = (cid:90) X (cid:90) ∞ ρ ( x , l ) λ (d x + , d l + | x , l ) α (d x , d l ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:202) + (cid:90) X (cid:90) ∞ (1 − ρ ( x , l )) λ (d x + , d l + | x , l ) α (d x , d l ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:203) . To demonstrate that α (d x , d l ) is the invariant distribution, it is sufﬁcient to show that the above equals α (d x + , d l + ): (cid:202) = I [0 , ∆ ) (d l + ) E τ [ H ] (cid:90) X (cid:90) ∞ κ (d x + | x , l + ∆ − l + ) τ ( l + ∆ − l + | x ) d l π (d x ) = I [0 , ∆ ) (d l + ) E τ [ H ] (cid:90) X (cid:90) ∞ κ (d x + | x , l ) τ ( l | x ) d l π (d x ) (since ∆ − l + ∈ (0 , ∆ ] and F τ ( ∆ | x ) = = I [0 , ∆ ) (d l + ) E τ [ H ] (cid:90) X κ (d x + | x ) π (d x ) = I [0 , ∆ ) (d l + ) E τ [ H ] π (d x + ) = I [0 , ∆ ) ( l + ) α (d x + , d l + ) , (since ¯ F τ ( l + | x + ) = l + ∈ [0 , ∆ )) (cid:203) = I [ ∆ , ∞ ) (d l + ) (1 − ρ ( x + , l + − ∆ )) α (d x + , d l + − ∆ ) = I [ ∆ , ∞ ) (d l + ) (cid:32) ¯ F τ ( l + | x )¯ F τ ( l + − ∆ | x ) (cid:33) (cid:32) ¯ F τ ( l + − ∆ | x ) E τ [ H ] (cid:33) π (d x + ) = I [ ∆ , ∞ ) (d l + ) α (d x + , d l + ) , (cid:202) + (cid:203) = α (d x + , d l + ) . (cid:3) Proof of Corollary 2.

We have: α (d x ) = (cid:82) ∞ ¯ F τ ( l | x ) d l E τ [ H ] π (d x ) = E τ [ H | x ] E τ [ H ] π (d x ) . (cid:3) Proof of Corollary 3.

We have: α (d x , L < ∆ ) = (cid:82) ∆ ¯ F τ ( l | x ) d l E τ [ H ] π (d x ) = ∆ E τ [ H ] π (d x ) (since ¯ F τ ( l | x ) = l ∈ [0 , ∆ )) α ( L < ∆ ) = ∆ E τ [ H ] α (d x | L < ∆ ) = α (d x , L < ∆ ) α ( L < ∆ ) = π (d x ) . (cid:3) roof of Proposition 5. The transition kernel is: ν (d x K + + , d l + | x K + , l ) = ρ ( x K + , l ) ν (d x K + + , d l + | x K + , l ) + (cid:16) − ρ ( x K + , l ) (cid:17) ν (d x K + + , d l + | x K + , l ) , where ν (d x K + + , d l + | x K + , l ) = λ (d x + , d l + | x K + , l ) K (cid:89) k = δ x k (d x k + + ) is the transition if a jump occurs, and ν (d x K + + , d l + | x K + , l ) = λ (d x K + + , d l + | x K + , l ) K (cid:89) k = δ x k (d x k + ) is the transition if a jump does not occur . We have: (cid:90) X K + (cid:90) ∞ ν (d x K + + , d l + | x K + , l ) A (d x K + , d l ) = (cid:90) X K + (cid:90) ∞ ρ ( x K + , l ) ν (d x K + + , d l + | x K + , l ) A (d x K + , d l ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:202) + (cid:90) X K + (cid:90) ∞ (cid:16) − ρ ( x K + , l ) (cid:17) ν (d x K + + , d l + | x K + , l ) A (d x K + , d l ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:203) , To demonstrate that A (d x K + , d l ) is the invariant distribution, it is sufﬁcient to show that the above equals A (d x K + + , d l + ): (cid:202) = (cid:90) X (cid:90) ∞ λ (d x + , d l + | x K + , l ) α (d x K + , d l ) K (cid:89) k = (cid:90) X δ x k (d x k + + ) π (d x k ) = I [0 , ∆ ) ( l + ) α (d x + , d l + ) K (cid:89) k = π (d x k + + ) = I [0 , ∆ ) ( l + ) α (d x K + + , d l + ) K (cid:89) k = π (d x k + ) (by Corollary 3) = I [0 , ∆ ) ( l + ) A (d x K + + , d l + ) , (cid:203) = I [ ∆ , ∞ ) ( l + ) (cid:90) ∞ λ (d x K + + , d l + | x K + , l ) α (d x K + , d l ) K (cid:89) k = (cid:90) X δ x k (d x k + ) π (d x k ) = I [ ∆ , ∞ ) ( l + ) α (d x K + + , d l + ) K (cid:89) k = π (d x k + ) = I [ ∆ , ∞ ) ( l + ) A (d x K + + , d l + ) , (cid:202) + (cid:203) = A (d x K + + , d l + ) . (cid:3)(cid:3)