[PDF] A Single SMC Sampler on MPI that Outperforms a Single MCMC Sampler

Abstract

Markov Chain Monte Carlo (MCMC) is a well-established family of algorithms which are primarily used in Bayesian statistics to sample from a target distribution when direct sampling is challenging. Single instances of MCMC methods are widely considered hard to parallelise in a problem-agnostic fashion and hence, unsuitable to meet both constraints of high accuracy and high throughput. Sequential Monte Carlo (SMC) Samplers can address the same problem, but are parallelisable: they share with Particle Filters the same key tasks and bottleneck. Although a rich literature already exists on MCMC methods, SMC Samplers are relatively underexplored, such that no parallel implementation is currently available. In this paper, we first propose a parallel MPI version of the SMC Sampler, including an optimised implementation of the bottleneck, and then compare it with single-core Metropolis-Hastings. The goal is to show that SMC Samplers may be a promising alternative to MCMC methods with high potential for future improvements. We demonstrate that a basic SMC Sampler with 512 cores is up to 85 times faster or up to 8 times more accurate than Metropolis-Hastings.

Full PDF

AA S

INGLE

SMC S

AMPLER ON

MPI

THAT O UTPERFORMS A S INGLE

MCMC S

AMPLER

Alessandro Varsi

Department of Electrical Engineering and ElectronicsUniversity of LiverpoolLiverpool, L69 3GJ, UK

[email protected]

Lykourgos Kekempanos

Department of Electrical Engineering and ElectronicsUniversity of LiverpoolLiverpool, L69 3GJ, UK

[email protected]

Jeyarajan Thiyagalingam

Scientiﬁc Computing DepartmentRutherford Appleton Laboratory, STFCDidcot, Oxfordshire, OX11 0QX, UK [email protected]

Simon Maskell

Department of Electrical Engineering and ElectronicsUniversity of LiverpoolLiverpool, L69 3GJ, UK

[email protected] A BSTRACT

Markov Chain Monte Carlo (MCMC) is a well-established family of algorithms which are primarilyused in Bayesian statistics to sample from a target distribution when direct sampling is challenging.Single instances of MCMC methods are widely considered hard to parallelise in a problem-agnosticfashion and hence, unsuitable to meet both constraints of high accuracy and high throughput. Se-quential Monte Carlo (SMC) Samplers can address the same problem, but are parallelisable: theyshare with Particle Filters the same key tasks and bottleneck. Although a rich literature alreadyexists on MCMC methods, SMC Samplers are relatively underexplored, such that no parallel im-plementation is currently available. In this paper, we ﬁrst propose a parallel MPI version of theSMC Sampler, including an optimised implementation of the bottleneck, and then compare it withsingle-core Metropolis-Hastings. The goal is to show that SMC Samplers may be a promisingalternative to MCMC methods with high potential for future improvements. We demonstrate thata basic SMC Sampler with cores is up to times faster or up to times more accurate thanMetropolis-Hastings. K eywords Distributed memory architectures · Metropolis-Hastings · Message Passing Interface · Parallel SMCSamplers · Particle Filters.

In Bayesian statistics, it is often necessary to collect and compute random samples from a probability distribution.Markov Chain Monte Carlo (MCMC) methods are commonly used to address this problem since direct sampling isoften hard or impossible. Sequential Monte Carlo (SMC) Samplers are a member of the broader class of SMC methods(which also includes Particle Filters) and can be used in the same application domains as MCMC [1]. While manypapers on Particle Filters or MCMC methods exist, SMC Samplers still remain relatively unexplored as a replacementto MCMC.Research has been focused on improving the run-time and accuracy of SMC and MCMC methods to meet the constraintsof modern applications. While accuracy has been improved by several approaches ranging from using better proposaldistributions [2] to better resampling and better recycling [3], to improve the run-time these algorithms need to employparallel computing.Generic MCMC methods are not parallelisable by nature as it is hard for a single Markov chain to be processedsimultaneously by multiple processing elements. In [4], an approach which aims to parallelise a single chain ispresented but it quickly becomes problem-speciﬁc because the efﬁciency of parallelisation is not guaranteed, especiallyfor computationally cheap proposal distributions. We acknowledge that one could implement multiple instances of a r X i v : . [ s t a t . C O ] M a y CMC in parallel as in [5], but argue that we could also apply the same idea to multiple instances of SMC samplers.However, all chains also need to burn-in concurrently, making it difﬁcult to use this approach to reduce the run-time. Inthis paper, we seek to develop a parallel implementation of a single instance of a sampling algorithm which outperformsa single MCMC algorithm both in terms of run-time and accuracy. Therefore, we leave comparisons with multiple-chainMCMC to future work along with comparisons with parallel instances of SMC Samplers.Particle Filters offer inherent parallelism, although an efﬁcient parallelisation is not trivially achievable. The resamplingstep, which is necessary to respond to particle degeneracy [6], is indeed a challenging task to parallelise. This isdue to the problems encountered in parallelising the constituent redistribute step. Initial approaches to performingresampling are explained in [6] [7] and achieve O ( N ) time complexity. In [8], it has been proven that redistribute canbe parallelised by using a divide-and-conquer approach with time complexity equal to O (( log N ) ) . This algorithmhas been optimised and implemented on MapReduce in [9] and then ported to Message Passing Interface (MPI) in [10].Although the time complexity is improved to O (( log N ) ) , it has been shown that at least parallel cores are requiredto outperform the O ( N ) redistribute version when all other steps are parallelised using MPI.No parallel implementation of the SMC Sampler on MPI is currently available, despite its similarities with the ParticleFilter. Hence, the ﬁrst goal of this paper is to show that an MPI implementation of the SMC Sampler can be translatedfrom the MPI Particle Filter in [10] by porting its key components. An optimisation of the redistribute in [10] will alsobe discussed and included in the proposed algorithm. This paper also compares, both in terms of run-time and accuracy,a basic implementation of the SMC Sampler on MPI with an equally simple MCMC method, Metropolis-Hastings [11].By proving that the SMC Sampler can outperform at least one instance of MCMC, the goal is to clear the way forfuture research (which space constraints prohibit exploring extensively herein). That future research can then optimisethe SMC Sampler and compare it with better-performing MCMC methods, such as TMCMC [12] or HMC [13], inthe context of both single and multiple chains (see above). In doing so, optimisations of SMC Samplers may includeimproved L-kernels, proposal distributions and a full comparison of resampling implementations (akin to that donein [14] in the context of a single core).The rest of the paper is organised as follows: in Section 2 we give some information about distributed memoryarchitectures and MPI. In Section 3, we describe Metropolis-Hastings and SMC methods with a focus on similaritiesand differences between Particle Filters and SMC Samplers. In Section 4, we introduce our novel implementationstrategy. In Section 5, we describe and show the results of several exemplary case studies with a view to showingworst-case performance, maximum speed-up and space complexity of our MPI implementation of the SMC Samplerand its performance versus Metropolis-Hastings. In Section 6, we draw our conclusions and give suggestions for futureimprovements. Distributed memory architectures are a type of parallel system which are inherently different from shared memoryarchitectures. In this environment, the memory is distributed over the cores and each core can only directly access itsown private memory. Exchange of information stored in the memory of the other cores is achieved by sending/receivingexplicit messages through a common communication network.The main advantages relative to shared memory architectures include scalable memory and computation capability withthe number of cores and a guarantee of there being no interference when a core accesses its own memory. The maindisadvantage is the cost of communication and consequent data movement. This may affect the speed-up relative to asingle-core.In order to implement the algorithms we discuss in this paper, we use Message Passing Interface (MPI) which is one ofthe most common programming models for distributed memory environments. In this model, the cores are uniquelyidentiﬁed by a rank, connected via communicators and they use Send/Receive communication routines to exchangemessages.

In this section, we provide details about MCMC and SMC methods with a view to showing similarities and differencesbetween Particle Filters and SMC Samplers. The reader is referred to [1], [6] and [11] for further details.

SMC methods apply the Importance Sampling principle to make Bayesian inferences. The main idea consists ofgenerating N statistically independent hypotheses called particles (or samples) at every given iteration t . The populationof particles x t ∈ R N × M is sampled from a user-deﬁned proposal distribution q ( x t | x t − ) such that x t represents thepdf of the state of a dynamic model (in Particle Filters) or samples from a static target posterior distribution (in SMC amplers ). Each particle x it is then assigned to an unnormalised importance weight w it = F ( w it − , x it , x it − ) , suchthat the array of weights w t ∈ R N provides information on which particle best describes the real state of interest.The particles are however subjected to a phenomenon called degeneracy which (within a few iterations) makes allweights but one decrease towards . This is because the variance of the weights is proven to increase at everyiteration [6]. There exist different strategies to tackle degeneracy. The most common is to perform a resamplingstep which repopulates the particles by eliminating the most negligible ones and duplicating the most important ones.Different variants of resampling exist [14] and the chosen methodology is described in detail in Section 3.3. Resamplingis only triggered when it is needed, more precisely when the (approximate) effective sample size N eff = 1 (cid:80) N − i =0 ( ˜w it ) (1)decreases below a certain threshold N ∗ (which is usually set to N ). ˜w t ∈ R N represents the array of the normalisedweights, each of them calculated as follows: ˜ w it = w it (cid:80) N − j =0 w jt (2)At every iteration, estimates are produced as a weighted sum of x t , weighted using ˜w t . A range of different Particle Filter methods exist. This section provides a brief description of Sequential ImportanceResampling (SIR), described by Algorithm 1 in the appendix.Let X t ∈ R M be the current state of the dynamic system that we want to estimate. At every time step t a newmeasurement Y t ∈ R D is collected. In the SIR Filter, the weighted particles are initially drawn from the priordistribution q ( x ) = p ( x ) and then drawn from the proposal distribution as follows: x it ∼ q ( x it | x it − , Y t ) (3)The weights are initially set to /N and then computed as w it = F ( w it − , x it , x it − ) = w it − p ( x it | x it − ) p ( Y t | x it ) q ( x it | x it − , Y t ) (4)The weights are then normalised and used to calculate N eff as in (1). Then resampling is performed if needed. In thelast step, the estimation of the state is given by the weighted mean of the particles. Like MCMC methods, the goal in the SMC Samplers is to draw samples from a static target distribution of interest π t ( x t ) . The algorithm begins by drawing N samples from the initial proposal q ( x ) and giving the i -th sample theweight w i = π ( x i ) q ( x i ) .After the ﬁrst iteration, the samples are drawn from the forward Markov kernel, q t ( x t | x t − ) , while the weights requirebackward Markov kernels L t ( x t − | x t ) as follows: w it = F ( w it − , x it , x it − ) = w it − π t ( x it ) π t ( x it − ) L t ( x it − | x it ) q t ( x it | x it − ) (5)As is the case for Particle Filters, after the importance weights evaluation and normalisation, the resampling step maybe triggered depending on the value of N eff .In the vanilla SMC Sampler, estimates are performed according to the particles in the ﬁnal iteration. The expectedvalue is computed by multiplication of the particles at the ﬁnal iteration T with the corresponding weights. In [3], anovel recycling method is proposed. Instead of considering the particles from the last iteration as providing the outputs,estimates are computed using all particles from all iterations. Using the notation of this paper, estimates are performedas follows: ˆf = (cid:80) Tt =1 f t ˜ c t (cid:80) Tt =1 ˜ c t (6) While it is not discussed here extensively, SMC Samplers can also be conﬁgured to offer improved performance in contextswhere a Particle Filter struggles [15]. here f t is calculated as f t = N − (cid:88) i =0 x it ˜ w it (7)and the normalisation constant is ˜ c t = (cid:90) π ( x t ) d x t ≈ c t = (cid:80) N − i =0 w it (cid:80) N − i =0 w it − (8)Algorithm 2 in the appendix describes the SMC Sampler with the recycling method. The Metropolis-Hastings (MH) algorithm (see Algorithm 3 in the appendix) simulates a Markov chain where, at eachiteration, a new sample, x ∗ , is drawn from a proposal distribution. The new sample is accepted or rejected using theRejection Sampling principle with acceptance probability a = min { , π ( x ∗ ) q ( x | x ∗ ) π ( x ) q ( x ∗ | x ) } . To reduce the dependency on theinitial sample, the ﬁrst (user-deﬁned) τ samples are discharged (burn-in). Algorithms 1 and 2 in the appendix show that SIR Particle Filter and SMC Sampler with recycling share the same keycomponents.Importance Sampling is trivially parallelisable as (3), (4) and (5) are element-wise operations. Hence, ImportanceSampling achieves O (1) time complexity for P = N cores.Expressions (1), (2), (7), and the weighted mean of particles require Sum and then can be easily parallelised by usingReduction. The time complexity of any Reduction operation scales logarithmically with the number of cores.Both algorithms invoke resampling if N eff < N ∗ . Several alternative resampling steps have been proposed and acomparison between them is discussed in [14]. These algorithms solve the problem in O ( N ) operations. The keyidea of these algorithms is to process w t to generate an array of integers called ncopies ∈ Z N whose i -th element, ncopies i , indicates how many times the i -th particle has to be duplicated. It is easy to infer that ncopies has thefollowing property: (cid:88) N − i =0 ncopies i = N (9)In previous work to parallelise Particle Filters described in [6], [8], [9] and [10], Minimum Variance Resampling (MVR),a variant of Systematic Resampling in [14], has always been the preferred resampling scheme. Since this paper is builton the results in [10], MVR will be the only variant of resampling we consider. This algorithm uses Cumulative Sumto calculate the CDF and then it generates ncopies such that the new population of particles has minimum ergodicvariance. After that, it is necessary to perform a task called redistribute which duplicates x it as many times as ncopies i .This task has already been identiﬁed as bottleneck (see [6], [8], [9]) and it will be discussed in detail in the next section.We note that the reset step (which sets all the weights to /N ) after redistribute is trivially parallelised. The redistribute step is necessary to regenerate the population of particles and is a task which all resampling variantshave in common. A naive and mature implementation can be found in [6] [7]. The same is described by Algorithm 5in the appendix and referred to as Sequential Redistribute (S-R) in the rest of the paper. This routine simply iteratesover ncopies and, for the j − th element, it copies x j as many times as ncopies j . Considering that ncopies follows(9), it is easy to infer that S-R achieves O ( N ) time complexity with a very low constant time. However, this algorithmis not trivial to parallelise because the workload among the processors cannot be readily distributed deterministically.This is because ncopies j could be equal to any value between and N . Parallelisation is even more complicated ondistributed memory architectures since one core cannot directly access the memory of the other cores [10].In [8], it has been shown that, by using a top-down divide-and-conquer approach, redistribute can be parallelised.Starting from the root node, the key idea consists of sorting ncopies and moving the particles at every stage of a binarytree. This can be achieved by searching for a particular index called pivot which perfectly divides the node into twobalanced leaves. Once pivot is identiﬁed, the node is split. In order to ﬁnd pivot , Cumulative Sum (whose parallelimplementation runs in O ( log N ) steps [16]) is performed and then pivot is the ﬁrst index where Cumulative Sum isequal to or greater than N . This routine is repeated recursively log N times. Since Bitonic Sort is the chosen parallelsorting algorithm and it is known that its time complexity is equal to O (( log N ) ) with P = N cores, then we caninfer that this redistribute achieves O (( log N ) ) time complexity for the same level of parallelism. Sorting the particlesis required to make sure that the splitting phase can be performed deterministically in O (1) . (8) is equivalent to (14) in [3], albeit with simpliﬁed notation here. n [9], the redistribute algorithm in [8] was improved by making a subtle consideration: the workload can still bedivided deterministically if we perform Bitonic Sort only once. After this single sort, the algorithm moves on toanother top-down routine where we use rotational shifts to shift all particles on the left side of pivot up to the left sideof the node. This way the father node gets split into two balanced leaves. This algorithm is recursively performed O ( log N ) times until the workload is equally distributed across the cores; then S-R is called. Algorithm 7 in theappendix summarises this routine and, in this paper, is described as Bitonic Sort Based Redistribute (B-R). Rotationalshifts are faster than Bitonic Sort as the achieved time complexity is equal to O ( log N ) and, therefore, the overalltime complexity is improved to O (( log N ) ) . In [9], B-R has been implemented on MapReduce and, although it wassigniﬁcantly better than the algorithm in [8], its runtime for cores was up to times worse than a single-core S-R.In [10], B-R has been ported to distributed memory architectures by using MPI and compared to a deterministic MPIimplementation of S-R, in which one core gathers all particles from the other cores, performs S-R locally and scattersback the resulting array. To avoid misunderstanding, we refer to the MPI implementation of S-R in [10] as CentralisedRedistribute (C-R), which is described by Algorithm 6 in the appendix. The results indicate that the scalability of theMPI implementation is improved relative to the scalability achieved using MapReduce because B-R on MPI couldoutperform C-R for at least P = 64 cores. Possible ways to improve B-R are discussed in the next session. In this section, we consider ways to improve redistribute and how an MPI SMC Sampler could be an alternative toMetropolis-Hastings. P r un - t i m e [ s ] SIR Particle Filter: bottleneck analysis - N = 2 , M = 1 , T = 100 SIR Particle FilterRedistributeBitonic Sort

Figure 1: SIR Particle Filter: bottleneck analysis - N = 2 , M = 1 , T = 100 As is observed here and has been discussed elsewhere in the literature [9] [10], the redistribute step is the bottleneck thatcomplicates parallel implementation of Particle Filters. To ensure this is clear, we repeat an experiment from [10] andreport the results of the same SIR Particle Filter with B-R within using N = 2 , T = 100 in Figure 1. The run-timesvs the number of cores P for the entire Particle Filter, the constituent redistribute step and the subset of redistribute thatis taken up with the Bitonic Sort step are given. As we can see, for P > , redistribute always accounts for at least of the total run-time and this proportion increases with P . For the same values of P , we can also observe that BitonicSort is by far the most computationally intensive task within redistribute and hence is the true bottleneck of the ParticleFilter. igure 2: Bitonic Sort & Nearly Sort - Sorting NetworkBitonic Sort is a very fast comparison-based parallel sorting algorithm [17]. This algorithm uses a divide-and-conquerapproach to ﬁrst divide the input sequence into a series of Bitonic sequences . Then the Bitonic sequences are recursivelymerged together until the algorithm returns a single monotonic sorted sequence. A possible sorting network which canbe used is illustrated in Figure 2. Each horizontal wire represents a key, the vertical arrows connect the input keys for acomparison and the direction represents the order of the keys after the comparison has occurred. The coloured blocksrepresent application of Bitonic Merge (blue or red if Merge is called in increasing or decreasing order respectively).It has been proven that, given a generic array of N elements, Bitonic Sort solves the problem in O ( N ( log N ) ) comparisons [17]. Bitonic Sort is, however, suitable to run in parallel by making P cores work on different chunks ofthe input array. In this case, each wire (or groups of wires) in Figure 2 may also represent a core (or the elements thateach core operates on). When P = N , it is easy to infer that the achieved time complexity is equal to O (( log N ) ) .More generically, we can say that for any number of cores P ≤ N the time complexity is equal to O (cid:32) NP (cid:18) log (cid:18) NP (cid:19)(cid:19) + NP ( log P ) (cid:33) (10) NP ( log ( NP )) is the time complexity to perform Bitonic Sort locally before the cores start interacting with each other.This term is deﬁnitely dominant, especially for low values of P . One possible way to improve Bitonic Sort (and byextension redistribute) is to substitute the serial Bitonic Sort algorithm with a better single-core sorting algorithm, aswas been suggested in [10].In the literature, there are plenty of alternatives to Bitonic Sort available. Algorithms such as Quicksort, Mergesortand Heapsort, for example, achieve O ( N log N ) time complexity. Quicksort is on average faster than Mergesort andHeapsort. However, Quicksort’s choice of its pivot can severely inﬂuence the performance: it is known, in fact, thatQuicksort’s worst-case time complexity is O ( N ) . This occurs when the pivot chosen at every iteration is equal toeither the minimum or the maximum of the available keys. Although this case is statistically very rare in several modernapplications, in the case of SMC methods the worst-case scenario is however often encountered: ncopies has to besorted and, since (9) holds, there is a high probability that is picked as Quicksort’s pivot, i.e. a high probability thatthe pivot is the minimum element.Heapsort achieves O ( N log N ) time complexity in all cases except when all keys are equal. In this special althoughrather unlikely case, the time complexity is O ( N ) . However, Mergesort is perfectly deterministic and data-independentand represents a valid alternative to Bitonic Sort which we consider in the experiment in Section 5.1. A Bitonic Sorterwith Mergesort performed locally achieves the following time complexity: O (cid:18) NP log (cid:18) NP (cid:19) + NP ( log P ) (cid:19) (11)We also observe that ncopies is an array of integers. Hence, one could locally use linear time sorting algorithms suchas Counting Sort or Radix Sort (which are both only applicable to arrays of integers). Although Counting Sort hasdeterministic and data-independent time complexity, its space complexity is data-dependent. This is because CountingSort allocates a temporary array with as many elements as max − min + 1 . In the worst-case max = N , min = 0 andsince N could be very high, the temporary array may not ﬁt within the local memory of a single machine. This problem A Bitonic sequence is a sequence of N keys in which the ﬁrst N/ keys are sorted in increasing order, while the last N/ keysare sorted in decreasing order. s shared with C-R and the impact of this issue will be discussed in Section 5.4. On the other hand, Radix Sort is afeasible deterministic solution. However, Radix Sort is data-dependent because its time complexity is actually O ( C · N ) where the constant C is equal to the number of digits of the maximum element (which can be N in the worst-case).Therefore, Radix Sort may be too slow when N is high and its run-time may ﬂuctuate too much as a function of theinput.In summary, we are looking for a parallel sorting algorithm that works with integer numbers and is deterministic anddata-independent with respect to both time and space complexity. While a combination of Bitonic Sort and Mergesortwithin a core achieves these aims, in the next two sessions, we go on to develop an improved strategy that is sufﬁcientfor our needs and does not require sort at all. In [6], sorting was used extensively. In B-R, rotational shifts are used log P times while Bitonic Sort is used only once.This replacement of sort with rotational shift has improved the time complexity (from O (( log N ) ) to O (( log N ) )) .However, it has also led to a more subtle consideration: by observing the input of rotational shifts we can infer that wedo not actually need to perfectly sort the particles to divide the workload deterministically. This condition is alwayssatisﬁed as long as stage by stage the particles that have to be duplicated are separated from those that do not. To makethings more clear we ﬁrst provide the following deﬁnition. Deﬁnition 1

Let g be a sequence of N elements. g is called Nearly-Sorted sequence when it has the following shape: (cid:2) , ..., , g , g , ..., g m − (cid:3) where g i > ∀ i = 0 , , ..., m − and ≤ m ≤ N . g is an ascending Nearly-Sortedsequence if the ﬁrst elements of g are and a descending Nearly-Sorted sequence if the ﬁnal elements are . We can infer that the workload can be divided deterministically if ncopies is a Nearly-Sorted sequence. In B-R, thiscondition is ensured by sorting before the subsequent parts of the redistribute step. While there are single-core sortingalgorithms that achieve O ( N ) time complexity, these algorithms do not satisfy our need for deterministic run-timeand storage. However, it is possible to use a single core nearly-sort for an array of integers with a deterministic anddata-independent approach with O ( N ) time complexity.Algorithm 8 in the appendix, which we have called Sequential Nearly Sort (S-NS), declares two iterators l and r whichrespectively point at the ﬁrst and the last element of ncopies . Step by step, the i -th element of ncopies is consideredand if the value is higher than then the particle is copied to the right end of the output array. If not, it gets copied to theleft end. The output ncopies new will then be an ascending Nearly-Sorted sequence. S-NS requires N iterations of thefor loop, which means that it achieves O ( N ) time complexity or O ( NP ) if we consider that each core owns NP elements.S-NS is, therefore, a very good alternative to Serial Bitonic Sort, Mergesort, Heapsort and Radix Sort. This is because itachieves low time complexity with deterministic and data-independent run-time and space complexity. We want S-NS to be used as part of a parallel algorithm which generates a Nearly-Sorted sequence from a random inputone. We now discuss how to achieve this.

Deﬁnition 2

Let h be a sequence of N elements. h is called a Nearly-Bitonic sequence when it is possible to ﬁnd anindex k which splits h into two monotonic Nearly-Sorted sequences. One could use S-NS and the same sorting network of Bitonic Sort to ﬁrst divide the input into a series of Nearly-Bitonicsequences and then to recursively merge the sequences together until we generate a monotonic Nearly-Sorted sequenceat the last step.We need to adapt Bitonic Merge such that it processes a Nearly-Bitonic sequence and returns a monotonic Nearly-Sortedsequence. We call this algorithm Nearly Merge. Stage-by-stage, one core with MPI rank i is coupled with another corewith MPI rank j . The assumption is that each core owns a Nearly-Sorted sequence of keys such that the combination ofboth is necessarily a Nearly-Bitonic sequence. Stage by stage, the cores call MPI_Sendrecv to exchange their local data.Then they consume a complementary subset of NP elements. Depending on the direction of the arrow in the sortingnetwork (see again Figure 2), one core will start consuming the s ﬁrst and then the positive elements while the othercore will do the opposite. This way, the s will be conﬁned to one end of the output array separated from the positiveelements.Figure 3 illustrates a possible example of Nearly Merge where each core owns keys; the positive elements are paddedwith Xs for brevity. By extension, each core owns exactly NP particles and performs the same amount of writes tomemory. Therefore, Nearly Merge achieves O ( NP ) time complexity just as S-NS does. We can infer that the overalltime complexity for Nearly Sort is equal to O (cid:18) NP + NP ( log P ) (cid:19) (12)This algorithm has asymptotically the same time complexity of Bitonic Sort when P = N , but the time complexity forthe serial algorithm is improved by a factor of O (( log ( NP )) ) . Therefore, we expect this algorithm to outperform both ore i Core jMPI Sendrecv x x x X x X x X x X x X x x x x X x X x x X x X x X x x x X x X x x X x X x X x x x x X x X x X x X x X Figure 3: Nearly Merge - exampleBitonic Sort and a Bitonic Sorter with Mergesort performed locally. By extension, if we exchange Bitonic Sort withNearly Sort in B-R we expect to have better performance. Algorithm 9 in the appendix describes the new routine. Apossible example for N = 16 and P = 4 is shown in Figure 4. From now on, we refer to this algorithm as Nearly SortBased Redistribute (N-R).In SMC methods ncopies is the array to (nearly) sort and each key ncopies i is necessarily coupled to the particle x i ∈ R M . (12) must then be extended to: O (cid:18) M · NP + M · NP ( log P ) (cid:19) (13)We denote that (13) can qualitatively describe the time complexity of N-R and, by extension, the time complexity of anSMC method which uses N-R within. The same conclusions about (12) and (13) can be made about (10) and (11) butthey are left out for brevity. Metropolis-Hastings and the SMC Sampler perform sampling from a target distribution and they both provide an accurateresult for a sufﬁciently large number of iterations. However, the details of the two approaches differ substantially. Aswe have discussed in Section 3, Metropolis-Hastings uses a Markov Chain to generate each sample one by one based onthe history of the previous samples. This approach makes a single instance of Metropolis-Hastings hard to parallelise ina problem-agnostic way. On the other hand, the SMC Sampler is a population-based algorithm where all samples areprocessed independently and concurrently during each iteration.Now let the total simulation-time for each algorithm be ﬁxed to ∆ seconds. After ∆ seconds, Metropolis-Hastings andthe SMC Sampler will have performed T MH and T SMC iterations respectively and provide a solution with a certainroot mean squared error. Since a single Metropolis-Hastings is hard to parallelise, we cannot increase its accuracywithout running the simulation for longer than ∆ seconds. However, a single SMC Sampler can improve its throughputor accuracy by taking advantage of its inherent parallelism. In an ideal world, cores can perform T SMC iterations in ∆2 seconds, but they can also, and most importantly, run T SMC iterations in ∆ seconds. This means they can achievebetter accuracy with the same run-time. By extension, P cores can ideally run P · T SMC iterations in ∆ seconds butthey will achieve a much better accuracy than a single core is capable of.The main goal of this paper is to prove that a P core MPI SMC Sampler can be more accurate over the same run-timethan Metropolis-Hastings when they sample from the same target distribution and use the same proposal. A moreexhaustive explanation with experimental results is provided in Section 5.5.1. earlySort pivotRotationalshiftsSequentialRedistribute x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Figure 4: Nearly Sort Based Redistribute

In this section, we brieﬂy describe the experiments we make and we analyse the results. Table 1 provides details aboutBarkla and Chadwick, the two platforms we use for the described experiments. Barkla is the preferred cluster for themajority of the experiments as it can provide more resources.

To evaluate the improvements in the bottleneck, we ﬁrst focus on the sorting phase. Then we compare N-R, B-R andC-R. M = 1 in this ﬁrst experiment for brevity. P Q u a li t a t i v e e x p e c t e d r un - t i m e t r e n d Sorting Algorithms: qualitative trend based on Time ComplexitiesNSBSMS + BS (a) Sorting P Q u a li t a t i v e e x p e c t e d r un - t i m e t r e n d Redistribute Algorithms: qualitative trend based on Time ComplexitiesN-RB-RC-R (b) Redistribute

Figure 5: Bottleneck: theoretical run-time trend

As we outlined in Section 4, Bitonic Sort is the slowest task in B-R. In this experiment, we compare three differentdeterministic sorting/nearly sorting algorithms: Bitonic Sort (BS), Bitonic Sort with Mergesort performed locally(BS+MS) and Nearly Sort (NS). These algorithms are compared by passing in input the same two arrays: ncopies and x . ncopies represents the array of the numbers of copies and hence it is an array of integers. It is generated randomlyaccording to (9) by using a Gaussian random generator followed by MVR. x represents an array of single-dimensionparticles and hence it is an array of ﬂoating point numbers which are generated by a Gaussian random generator. he experiments have been run on Barkla for N = 2 , , , particles and increasing numbers of cores P = 1 , , , , ..., . Both N and P must necessarily be equal to power of numbers, due to the constraint of BitonicSort and Nearly Sort. Each experiment has been run times and we report the median of the sampled run-times vs thenumber of cores.As we can see in Figure 6, NS does not scale for up to cores. This trend might seem odd but it can be explainedby analysing the time complexity of NS described by (12). Figure 5a describes the qualitative trend of (10), (11) and(12). When P = 1 , the quasilinear term in (12) is equal to . However, for ≤ P ≤ , the quasilinear term offsets theimprovement associated with the linear term. Theoretically, the run-time should have positive speed-ups for P = 32 . Inthe measured values for N ≥ , this happens when P ≥ or P ≥ cores, depending on N . We associate thisslight discrepancy to the additional communication cost associated with larger numbers of particles.However, the most important result is that NS is signiﬁcantly faster than the other algorithms and especially BS for alow number of cores. Then, when P increases the performance of both algorithms become closer because the timecomplexity of both algorithms is asymptotically equal to O (( log N ) ) , as underlined in Section 4. These resultssuggest that using Bitonic Sort or Nearly Sort results in similar run-time for P ≥ but, using Nearly Sort may leadto signiﬁcant improvements for P < . This means that the crossing point with respect to the run-time of C-R maybe shifted to the left side of the graph, relative to the results in [10].The results for N = 2 keys show that BS and BS+MS stop scaling for a very low value of P . The reasons behind thisresult have required further investigation. For a very low number of keys, the granularity of the pipeline is already ﬁne.In other words, the computation cost is already comparable with the communication cost and using more cores does notprovide any scalability. NS is also affected by the same problem. In addition, the time complexity of Nearly Sort isnecessarily higher than O ( N ) for P ≤ cores. For these two reasons NS always returns negative speed-ups. In this experiment, we use exactly the same strategy described in the previous section, since the required input for N-R,B-R and C-R is the same. The results are shown in Figure 6. The results for redistribute with BS+MS are left out forbrevity.As we expected from theory, for N ≥ N-R is better than B-R overall and much faster for a small number of cores.However, the most important result is that N-R outperforms C-R at the theoretical minimum (which is P = 32 ) forsome values of the dataset size N . This suggests that, as long as we use a parallel redistribute whose time complexityis equal to O (( log N ) ) , we cannot outperform C-R for P < nor have positive speed-up for the same values of P . In order to achieve this goal, a new algorithm with O ( log N ) time complexity is needed. Sorting networks whichachieve the theoretical lower bound have been proposed in [18], [19] which improve the original AKS sorting networkpresented in [20]. These networks can also be rearranged to perform redistribute by substituting the comparators withbalancers. However, they cannot be practically used because each atomic step requires a huge constant time C . Theexact value of C is unknown as it also depends on the network parameters but it seems to be in the order of thousands(e.g. C = 6100 in the best conﬁguration in [18]). It can then be inferred that they cannot outperform O (( log N ) ) sorting networks such as Bitonic Sort for any practical N . In [21] it has been estimated that a hypothetical C = 87 would require N ≥ to make AKS-like sorting networks faster than Bitonic Sort. Therefore, the infeasibility of thisclass of algorithms makes O (( log N ) ) redistribute on distributed memory systems a practical lower bound (althoughit cannot yet be considered a theoretical minimum).For N = 2 , N-R does not scale and does not outperform C-R either. As we outlined in the previous section, for lowvalues of N the granularity is already too ﬁne to observe any speed-up. In this section, we analyse three case studies, two for the Particle Filter and one example of an SMC Sampler. Dependingon the chosen redistribute (N-R, B-R or C-R), we use the acronyms N-PF/B-PF/C-PF for the Particle Filter, and N-SMCS/B-SMCS/C-SMCS for the SMC Sampler. We consider a worst-case scenario which occurs when resamplingis needed at every iteration and the time taken to perform Importance Sampling is small relative to the time taken toperform resampling. This section aims to achieve two goals. The ﬁrst one is to demonstrate that the historic progressmade in developing parallel implementations of Particle Filters can be translated to develop a parallel implementationof an SMC Sampler. The second one is to estimate the worst-case speed-up of our improved algorithm.All experiments in this section are run for the same values of N as were considered in the previous section. Eachrun-time is once again the median of runs, each one representing a simulation of iterations in the worst-casescenario and for increasing number of cores P = 1 , , , ..., . In this experiment, we use Barkla and compare N-PF, B-PF and C-PF. We apply these algorithms to a stochasticvolatility model which describes the evolution of the pound-to-dollar exchange rate between the 1st of October 1980and the 28th of June 1985. This model has been used in [22] to demonstrate the utility of advanced SMC methods, such P -18 -17 -16 -15 -14 -13 -12 m e d i a n r un - t i m e [ s ] MPI NS vs MPI BS vs MPI BS+MS - run-timesNS N =2 BS N =2 BS+MS N =2 (a) NS, BS and BS+MS for N = 2 P -16 -15 -14 -13 -12 -11 -10 -9 m e d i a n r un - t i m e [ s ] N-R vs B-R vs C-R - run-timesN-R N =2 B-R N =2 C-R N =2 (b) N-R, B-R and C-R for N = 2 P -12 -11 -10 -9 -8 -7 -6 m e d i a n r un - t i m e [ s ] MPI NS vs MPI BS vs MPI BS+MS - run-timesNS N =2 BS N =2 BS+MS N =2 (c) NS, BS and BS+MS for N = 2 P -10 -9 -8 -7 m e d i a n r un - t i m e [ s ] N-R vs B-R vs C-R - run-times N-R N =2 B-R N =2 C-R N =2 (d) N-R, B-R and C-R for N = 2 P -5 -4 -3 -2 -1 m e d i a n r un - t i m e [ s ] MPI NS vs MPI BS vs MPI BS+MS - run-timesNS N =2 BS N =2 BS+MS N =2 (e) NS, BS and BS+MS for N = 2 P -4 -3 -2 -1 m e d i a n r un - t i m e [ s ] N-R vs B-R vs C-R - run-times N-R N =2 B-R N =2 C-R N =2 (f) N-R, B-R and C-R for N = 2 P m e d i a n r un - t i m e [ s ] MPI NS vs MPI BS vs MPI BS+MS - run-timesNS N =2 BS N =2 BS+MS N =2 (g) NS, BS and BS+MS for N = 2 P m e d i a n r un - t i m e [ s ] N-R vs B-R vs C-R - run-times N-R N =2 B-R N =2 C-R N =2 (h) N-R, B-R and C-R for N = 2 Figure 6: Bottleneck: median run-times vs P for N = 2 , , , s Block Sampling Particle Filters, over SIR Particle Filters. X t = φX t − + σV t (14a) Y t = β exp (cid:18) X t (cid:19) W t (14b)(14a) and (14b) represent the model where the coefﬁcients φ = 0 . , σ = 0 . , β = 0 . (as selected in [22])and the noise terms for the state and the measurement are V t ∼ N (0 , and W t ∼ N (0 , . The initial state is sampledas X ∼ N (0 , σ − φ ) . The particles are initially drawn from the prior distribution. The importance weights are simplyequal to the likelihood p ( Y t | X t ) since the dynamic model is used as the proposal. This experiment is focused on showing the performance of N-PF, B-PF and C-PF applied to a non-scalar model. Thechosen example is the popular four-dimensional state model for a Bearing-Only Tracking problem, where the state isrepresented by the position and velocity of the tracked object. Both position and velocity are 2-dimensional physicalquantities. This model was previously presented in several publications, such as in [7], and used in [22] to test the BlockSampling SIR Filter. In accordance with [22], we consider the state to be composed of four elements denoted such that X t = (cid:2) X t , X t , X t , X t (cid:3) where X t , X t are position and X t , X t are velocity.The model is deﬁned as follows: X t = A · X t − + V t (15a) Y t = arctan (cid:18) X t X t (cid:19) + W t (15b)where the state transition matrix and the covariance are A =   Σ = 

5∆ 0 00 0  The noise terms are V t ∼ N ( , Σ ) and W t ∼ N (cid:0) , − (cid:1) . The initial state X has the identity matrix as covarianceand mean equal to the true initial simulated point of the system. The parameter ∆ represents the sampling period and itis set to ∆ = 1 s. We apply N-SMCS, B-SMCS and C-SMCS to sample from a static single-dimensional ( M = 1 ) Student’s t posteriordistribution calculated as: π ( x ) = Γ (cid:0) ν +12 (cid:1) Γ (cid:0) ν (cid:1) (cid:112) ( νπ ) (cid:18) ν ( x − µ ) (cid:19) − ν +12 (16)where ν and µ correspond to the degrees of freedom and the mean value respectively.In the experiment, the particles, as samples of x t , at the t -th iteration are generated using random walk as theproposal distribution, x t ∼ N ( x t − , (cid:15) ) . The backward kernel is naively selected to emulate MCMC such that L t ( x t − | x t ) = q t ( x t | x t − ) . We then anticipate that a better choice of L t ( x t − | x t ) can positively impact our estimate.The recycling described in Section 3.1.2 is also used. In this section we provide the results of the experiments described in Sections 5.2.1, 5.2.2 and 5.2.3 which are shown inFigure 7 for N = 2 , , , respectively. Since that N-R, B-R and C-R have the same baseline, we show thespeed-ups instead of the run-times as, in this case, these speed-ups can also prove which algorithm is faster.As we can see, for N = 2 N-PF/N-SMCS does not outperform C-PF/C-SMCS. The reasons are the granularity andthe theoretical time complexity of N-R as we explained in Section 5.1. In contrast, C-PF/C-SMCS can keep scalingfor a relatively low P , until redistribute emerges as the bottleneck. However, since modern applications need a largenumber of particles ( is just ), we are not discouraged by these limitations.For N ≥ , computation becomes dominant over communication and both N-PF/N-SMCS and B-PF/B-SMCS canscale for much larger values of P and can both outperform C-PF/C-SMCS. For this range of N , using Nearly Sortinstead of Bitonic Sort makes the SMC method faster for any number of cores and up to twice as fast as using Bitonic ort for low values of P . The gap between these two approaches also increases with N . N-PF/N-SMCS runs faster thanthe solution with C-R in the range ≤ P ≤ and it can be as much as approximately times faster for P = 512 cores.Overall, we can say that for a ﬁxed N , changing the model in the Particle Filter (i.e. whether we consider stochasticvolatility or Bearing-Only Tracking) or switching from Particle Filters to SMC Samplers gives roughly the sametrend. This proves that the improvements that have been demonstrated in the context of Particle Filters can directly betranslated to the context of SMC Samplers.As we can see, the minimum worst-case speed-up is about and occurs in the context of the Bearing-Only Tracking.On the other hand, the maximum worst-case speed-up can be up to : this occurs in the context of the SMC Sampler.The efﬁciency of N-SMCS with respect to the maximum speed-up is indeed signiﬁcantly higher than for the ParticleFilter. This is due to the different way particles and weights are calculated in the SMC Sampler. For example, (16) ismore computationally intensive than (14a), and the likelihood calculation in the SMC Sampler is more computationallydemanding than the likelihood in the Particle Filters (this is because of the need to compute the L -kernel). Therefore,resampling accounts for a lower percentage of the entire workload in the SMC Sampler than it does in the Particle Filter.The resampling step is no longer such a signiﬁcant bottleneck for a low number of cores. This is discussed in moredetail in the next section. All the previous experiments use relatively simple proposal distributions and likelihoods. However, in real problems,these two tasks are likely to be much more complicated (e.g. they may involve non-linear systems or Partial DifferentialEquations etc.). In the next experiment, we investigate the impact that a more computationally intensive ImportanceSampling step has on the maximum speed-up.In order to simulate this scenario, we adjust the experiment described in Section 5.2.2 by using

D > sensors spreadover the Cartesian plane. This practice is also common in real applications to make the estimate more accurate (sincethe triangulation observability criterion is satisﬁed [23]). The measurement is now a D -dimensional measurementvector: Y t = arctan (cid:18) X t − ˜ y k X t − ˜ x k (cid:19) + W t ∀ k = 1 , ..., D (17)where ( ˜ x k , ˜ y k ) is the position of the k -th sensor with respect to the target. The state equation remains the same as isdescribed in Section 5.2.2 such that M is unchanged. We consider N = 2 . The maximum speed-up efﬁciency foreach D is estimated as the speed-up for P = 512 vs the ideal speed-up for the same P . We increase D until we have atleast efﬁciency. For each D we also report the percentage of the total workload that Importance Sampling accountsfor when the run-time of redistribute is at its peak, i.e. for P = 2 .As we can observe in Figure 8, a more computationally intensive Importance Sampling step leads to higher speed-ups.The speed-up for D = 360 is indeed about . times the speed-up for D = 1 (which corresponds to the experiment inSection 5.2.2). Therefore, in these problems, the bottleneck for a low number of cores is likely to be the ImportanceSampling step and not resampling. However, when P = N , Importance Sampling has O (1) time complexity whileresampling has complexity of O (( log N ) ) . In other words, since resampling always emerges as the bottleneck for asufﬁciently high level of parallelism, it is crucial to use a parallelisable redistribute such that we can achieve near-linearspeed-ups for higher P . N-R and B-R, have both scalable space complexity equal to O ( M · NP ) . However, C-R has constant space complexityequal to O ( M · N ) [10]: one core is in charge of collecting the particles, performing the routine locally and thendistributing the new population back to the other cores.The main side effect is that when the available memory in each node is insufﬁcient to store all the necessary data for P = 1 , we cannot run C-R for any P . In contrast, even for very large values of N , we can always run N-R or B-R aslong as each node has enough memory for its data. In order to show this problem, we repeat the experiment described inSection 5.2.2 on Chadwick (which has a GB memory in each node, i.e. less than Barkla’s

GB per node). Figure9 shows the measured run-times for N = 2 (speed-ups are not provided since the baseline is impossible to run dueto space complexity limitations). The results for N ∈ { , , } are left out for brevity since they resemble theresults in Figure 7. The total absence of a curve for C-PF or the missing points for N-PF occur because of an mpirunabort (which happens when we request more memory than the node has). As we can see we need at least P = 64 coresto run N-PF while it is never possible with C-PF with N = 2 .We can conclude that N-PF/N-SMCS outperforms B-PF/B-SMCS for any P and outperforms C-PF/C-SMCS for P ≥ cores. Furthermore, it is always possible to run N-PF/N-SMCS while C-PF/C-SMCS may be impossible to runfor high N . P s p ee d - u p MPI PF Econometrics: N-PF vs B-PF vs C-PF - speed-upsN-PF N =2 B-PF N =2 C-PF N =2 (a) Econometrics for N = 2 P s p ee d - u p MPI PF Bearing-Only Track.: N-PF vs B-PF vs C-PF - speed-upsN-PF N =2 B-PF N =2 C-PF N =2 (b) Bearing-Only Track. for N = 2 P s p ee d - u p MPI SMC Sampler: N-SMCS vs B-SMCS vs C-SMCS - speed-upsN-SMCS N =2 B-SMCS N =2 C-SMCS N =2 (c) Synthetic SMC Sampler for N = 2 P s p ee d - u p MPI PF Econometrics: N-PF vs B-PF vs C-PF - speed-upsN-PF N =2 B-PF N =2 C-PF N =2 (d) Econometrics for N = 2 P s p ee d - u p MPI PF Bearing-Only Track.: N-PF vs B-PF vs C-PF - speed-upsN-PF N =2 B-PF N =2 C-PF N =2 (e) Bearing-Only Track. for N = 2 P s p ee d - u p MPI SMC Sampler: N-SMCS vs B-SMCS vs C-SMCS - speed-upsN-SMCS N =2 B-SMCS N =2 C-SMCS N =2 (f) Synthetic SMC Sampler for N = 2 P s p ee d - u p MPI PF Econometrics: N-PF vs B-PF vs C-PF - speed-upsN-PF N =2 B-PF N =2 C-PF N =2 (g) Econometrics for N = 2 P s p ee d - u p MPI PF Bearing-Only Track.: N-PF vs B-PF vs C-PF - speed-upsN-PF N =2 B-PF N =2 C-PF N =2 (h) Bearing-Only Track. for N = 2 P s p ee d - u p MPI SMC Sampler: N-SMCS vs B-SMCS vs C-SMCS - speed-upsN-SMCS N =2 B-SMCS N =2 C-SMCS N =2 (i) Synthetic SMC Sampler for N = 2 P s p ee d - u p MPI PF Econometrics: N-PF vs B-PF vs C-PF - speed-upsN-PF N =2 B-PF N =2 C-PF N =2 (j) Econometrics for N = 2 P s p ee d - u p MPI PF Bearing-Only Track.: N-PF vs B-PF vs C-PF - speed-upsN-PF N =2 B-PF N =2 C-PF N =2 (k) Bearing-Only Track. for N = 2 P s p ee d - u p MPI SMC Sampler: N-SMCS vs B-SMCS vs C-SMCS - speed-upsN-SMCS N =2 B-SMCS N =2 C-SMCS N =2 (l) Synthetic SMC Sampler for N = 2 Figure 7: SMC methods: speed-ups vs P for N = 2 , , ,

50 100 150 200 250 300 350 400 D % Speed-up efficiency for increasing D speed-up efficiency, P =512 workload ratio for Importance Sampling, P =2 Figure 8: Multi Sensors: maxspeed-up efﬁciency for increasing D P m e d i a n r un - t i m e [ s ] MPI PF Bearing-Only Track. on Chadwick: N-PF vs C-PF - run-times

N-PF N =2 Figure 9: Bearing-Only Track.for N = 2 on Chadwick P s p ee d - u p MPI SMC Sampler vs MH - inter-task speed-ups N =2 N =2 N =2 Figure 10: N-SMCS vs MH: inter-taskspeed-ups vs P for T SMC = 100

As we have anticipated in Section 4.4, this experiment aims to achieve two goals. We ﬁrst want to prove that a P -coreimplementation of the SMC Sampler can achieve a lower run-time than a single-chain Metropolis-Hastings when bothalgorithms draw the same number of samples in total (see below). Then we want to prove that the extra speed-up that P cores provide can make an SMC Sampler more accurate than Metropolis-Hastings, since as P increases an SMCSampler can perform more iterations over the same time span.The ﬁrst part of the experiment is done by comparing the run-time of both algorithms for the same workload such that: T MH = N × T SMC (18)To investigate the second issue, we primarily need to know the inter-task speed-up SU P which P cores can provide,keeping N ﬁxed. We estimate SU P from the ﬁrst part of the experiment. Then we compare the algorithms in terms ofaccuracy over the same computational time which happens when: T MH = N × T SMC × SU P , P = 1 , , , ... (19)In other words, the SMC Sampler will run for SU P -times more iterations (or less if ≤ SU P ≤ ) over the samerun-time.The number of samples, N , and the number of cores, P , are the same as in the previous experiments and we will use T SMC ∈ { , , } . T MH is constant, independent of P and always picked using (19) for P = 1 . The SMCSampler is once again assessed in the worst-case setting when resampling is needed at every iteration. Figure 10 shows the inter-task speed-up between SMC Sampler and Metropolis-Hastings for the same workload (see(18)), after having set T SMC = 100 . We calculate that a single-core implementation of the SMC Sampler is slightlyslower (typically by ) than Metropolis-Hastings. This means that an SMC Sampler running on a cluster of nodescould be much faster than Metropolis-Hastings. As we can see, for high values of N (which as we have seen lead tolarger speed-ups) the SMC Sampler can be up to times faster than Metropolis-Hastings. P -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 S M C S v s M H a cc u r a c y r a t i o MPI SMC Sampler vs MH accuracy ratio for N =2 T SMC =10 T SMC =10 T SMC =10 (a) N = 2 P -16 -14 -12 -10 -8 -6 -4 -2 S M C S v s M H a cc u r a c y r a t i o MPI SMC Sampler vs MH accuracy ratio for N =2 T SMC =10 T SMC =10 T SMC =10 (b) N = 2 P -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 S M C S v s M H a cc u r a c y r a t i o MPI SMC Sampler vs MH accuracy ratio for N =2 T SMC =10 T SMC =10 T SMC =10 (c) N = 2 Figure 11: SMC Sampler vs Metropolis-Hastings accuracy ratiosIn the second part of the experiment we compare the accuracy (expressed as Root Mean Squared Error (RMSE)) for highspeed-ups when both algorithms run for the same time span (see (19)). To make the run-time equivalent we set SU P to he values shown in Figure 10. Figure 11 shows the accuracy ratios between the algorithms vs P for increasing N or T SMC . The baseline (and numerator of the accuracy ratio) is the accuracy of Metropolis-Hastings (see Table 2). As wecan see, for low values of T SMC , the SMC Sampler does not outperform Metropolis-Hastings as the gap for P = 1 isinitially too big. However, while an SMC Sampler is less accurate, the relative beneﬁt of using Metropolis-Hastingsreduces when P increases. Combinations of bigger values for N or T SMC lead to comparable gains in accuracy when SU P is maximised (see pairs: N = 2 , T SMC = 10 ; N = 2 , T SMC = 10 ; N = 2 , T SMC = 10 ). In the end,for even higher values of T SMC or N the initial gap at the baseline is lower such that, when SU P increases sufﬁciently,the SMC Sampler ﬁnally outperforms Metropolis-Hastings. Figures 11c proves that the RMSE of the SMC Samplercan be up to approximately times lower. If we consider that the standard deviation σ for Metropolis-Hastings scalesas O (1 / √ N ) , we can infer that the ideal improvement in accuracy for P = 512 cores would be: σ P =512 σ P =1 = √ N √ P N = 1 √ ≈ . (20)This would occur only if we could trivially parallelise a single-chain Metropolis-Hastings and observe linear speed-up,which means that the proposed MPI SMC Sampler already achieves approximately efﬁciency with respect tothe ideal scenario. This is an encouraging ﬁnding, especially considering that we have used a simple SMC Sampler.We anticipate that further improvements in accuracy would result from using a more sophisticated L -Kernel, betterrecycling, novel proposal distributions or alternative resampling implementations [24]. The results for N = 2 are notshown. This is because for high values of T SMC , the run-times of Metropolis-Hastings and the SMC Sampler for lowvalues of P exceed the simulation time limit on both clusters (which is set to days).Table 1: Details of the clusters. Name Barkla Chadwick

OS CentOS Linux 7 RHEL 6.10 (Santiago)Number of Nodes 16 8Cores per node 40 16CPU 2 Xeon Gold 6138 2 Xeon(R) E5-2660RAM 384 GB 64 GBMPI Version OpenMPI-1.10.1 OpenMPI-1.5.3Max time per job 72 hours 72 hours

Table 2: Metropolis-Hastings: RMSE (log scale) T SMC N − . − . − . − . − . − . − . − . − . In this paper, we have shown that a parallel implementation of the SMC Sampler on distributed memory architectures isan advantageous alternative to Metropolis-Hastings as it can be up to times faster over the same workload, and up to times more accurate over the same run-time for cores.To get to this position, we have made several advances. An MPI implementation of the SMC Sampler has previouslybeen unavailable but we have proven that it can be produced by porting the key components of the Particle Filter. Thereexist several alternative algorithms to perform the common bottleneck, redistribute, including a state-of-the-art parallelalgorithm and a textbook non-parallelisable implementation. In this paper, we have optimised the parallel algorithm andproven it can outperform the current approach for any number of cores and be up to times faster than the textbookimplementation for a sufﬁciently high degree of parallelism. In addition, we have demonstrated the infeasibility of thenon-parallelisable algorithm for large numbers of particles.The proposed algorithm for cores is times as fast as its serial conﬁguration in the worst case scenario,which occurs when resampling (and redistribute) is needed at every step and, most importantly, when the model isunrealistically simple and hence Importance Sampling has a very fast constant time. More realistic models have highlycomputationally intensive proposal distributions or likelihoods. Under these realistic conditions, we have shown thatthe overall speed-up increases with the workload of Importance Sampling and the maximum recorded speed-up is about for cores. key observation we can make is that the SMC Sampler version we have used is a basic reference version as theL-kernel is equal to the proposal distribution, which is Gaussian; better recycling and resampling are yet to be explored.This means that we still have left signiﬁcant scope for future improvements. A combination of intelligent recycling, amore sophisticated L-Kernel, improved proposal distribution and better resampling may have major impacts on theaccuracy.Another improvement avenue is to speed up the run-time which as we have seen can indirectly improve the accuracytoo. One possible way to achieve this goal is to investigate the beneﬁts of mixing shared memory architectures anddistributed memory architectures. OpenMP is the most common programming model for shared memory architecturesand including OpenMP algorithms within MPI is a routine approach in the high performance computing domain. Datalocality may also lead to alternative and more efﬁcient ways of implementing redistribute. A second environmentthat may lead to further speed-up consists of using the additional computational power that the GPU card within eachmachine provides.Future work will focus on implementing all these improvements and comparing the resulting SMC Sampler withbetter MCMC methods than Metropolis-Hastings. These comparisons must necessarily be made both in the single andmultiple chain contexts. Acknowledgments

This work was supported by the UK EPSRC Doctoral Training Award and by Schlumberger.

Appendices

The following algorithms summarise the routines which are described in detail in Sections 3 and 4 and compared inSection 5 of the main paper. In these algorithms, we use the same notation of the main paper: arrays and matrices are inbold while scalars, such as some input parameters and single elements of one-dimensional arrays, are written in italicfont-style.

SMC methods and Metropolis-Hastings

Algorithms 1, 2 and 3 explain the two considered SMC methods and Metropolis-Hastings.

Algorithm 1

SIR Particle Filter

Input: T , N , N ∗ Output: f t x , w ← Initialisation () , each particle is initially drawn from the prior distribution q ( x ) = p ( x ) and eachweight is initialised to /N for t ← ; t ≤ T ; t ← t + 1 do Y t ← New_Measurement () x t , w t ← Importance_Sampling () , see (3) and (4) in the main paper ˜w t ← Normalise ( w t ) , see (2) in the main paper N eff ← ESS ( ˜w t ) , see (1) in the main paper if N eff < N ∗ then x t , w t ← Resampling ( x t , ˜w t , N ) end if f t ← Mean ( x t , w t ) , calculate the weighted mean of the particles to estimate the real state. end for Resampling and Redistribute

Algorithm 4 depicts the chosen resampling step for our implementation of the Particle Filter and the SMC Sampler.The three considered MPI implementations of the constituent redistribute step are described by Algorithms 6, 7 and 9.These routines make use of Algorithm 5 to redistribute within each core once the workload is balanced (or centralisedas in Algorithm 6). Algorithm 8 explains the single-core Nearly Sort that is used for the MPI Nearly Sort in Algorithm9 which replaces the MPI Bitonic Sort in Algorithm 7. lgorithm 2 SMC sampler with recycling

Input: T , N , N ∗ Output: x t , ˆf x , w ← Initialisation () , x ∼ q ( x ) and each particle is assigned to its initial weight w i = π ( x i ) q ( x i ) for t ← ; t ≤ T ; t ← t + 1 do ˜ c t ← Normalisation_Constant ( w t ) , see (8) in the main paper x t , w t ← Importance_Sampling () , see (5) in the main paper for w t ; x t ∼ q ( x t | x t − ) ˜w t ← Normalise ( w t ) , see (2) in the main paper f t ← Estimate ( x t , w t ) , see (7) in the main paper N eff ← ESS ( ˜w t ) , see (1) in the main paper if N eff < N ∗ then x t , w t ← Resampling ( x t , ˜w t , N ) end if end for ˆf ← Recycling ( f , ˜c , T ) , see (6) in the main paper Algorithm 3

Metropolis-Hastings

Input: T , (cid:15) , Σ Output: x t x ∼ q ( x ) for t ← ; t ≤ T ; t ← t + 1 do x ∗ ∼ N ( x ∗ | x t − , (cid:15) Σ ) , a new sample is drawn from the proposal distribution a = min { , π ( x ∗ ) q ( x t − | x ∗ ) π ( x t − ) q ( x ∗ | x t − ) } , calculate the acceptance probability r ∼ [0 , if a < r then x t = x ∗ , the proposed sample is accepted else x t = x t − , the proposed sample is rejected and the old sample is propagated to the next iteration end if end for References [1] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers.

Journal of the RoyalStatistical Society. Series B (Statistical Methodology) , 68(3):411–436, 2006.[2] Wei Shao, Guangbao Guo, Fanyu Meng, and Shuqin Jia. An efﬁcient proposal distribution for metropolis-hastingsusing a b-splines technique.

Computational Statistics & Data Analysis , 57(1):465 – 478, 2013.[3] T. L. T. Nguyen, F. Septier, G. W. Peters, and Y. Delignon. Efﬁcient sequential monte-carlo samplers for bayesianinference.

IEEE Transactions on Signal Processing , 64(5):1305–1319, March 2016.[4] A. E. Brockwell. Parallel markov chain monte carlo simulation by pre-fetching.

Journal of Computational andGraphical Statistics , 15(1):246–261, 2006.[5] Radu V. Craiu, Jeffrey Rosenthal, and Chao Yang. Learn from thy neighbor: Parallel-chain and regional adaptivemcmc.

Journal of the American Statistical Association , 104(488):1454–1466, 2009.[6] M.S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle ﬁlters for online nonlinear/non-gaussian bayesian tracking.

IEEE Transactions on Signal Processing , 50(2):174–188, 2002.

Algorithm 4

Minimum Variance Resampling

Input: x t , w t , N Output: x t , w t ncopies ← MVR ( w t ) , apply Minimum Variance Resampling to generate ncopies from w t x t ← Redistribute ( N, ncopies , x t ) w t ← Reset ( w t ) , all weights are reset to /N lgorithm 5 Sequential Redistribute (S-R)

Input: N , ncopies , x Output: x new i ← for j ← ; j < N ; j ← j + 1 do for k ← ; k < ncopies j ; k ← k + 1 do x inew ← x i i ← i + 1 end for end forAlgorithm 6 Centralised Redistribute (C-R)

Input: N , ncopies , x , mpi _ rank , P Output: x if mpi _ rank == 0 then

2: Allocate memory for N integers and N particles3: end if

4: The master core (i.e., the core with mpi _ rank = 0 ) collects the N/P elements in ncopies from each core using MPI_Gatherand stores them into tmp _ ncopies

5: The master core collects the

N/P particles in x from each core using MPI_Gather and stores them into tmp _ x if mpi _ rank == 0 then tmp _ x new ← Sequential_Redistribute ( N, tmp _ ncopies , tmp _ x ) , see Algorithm 58: end if

9: The master core scatters tmp _ x new to the other cores using MPI_Scatter. x is used as received buffer10: return x [7] N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novel approach to nonlinear/non-gaussian bayesian stateestimation. Radar and Signal Processing, IEE Proceedings F , 140:107 – 113, 05 1993.[8] S. Maskell, B. Alun-Jones, and M. Macleod. A single instruction multiple data particle ﬁlter. In

IEEE NonlinearStatistical Signal Processing Workshop , pages 51–54, 2006.[9] Jeyarajan Thiyagalingam, Lykourgos Kekempanos, and Simon Maskell. Mapreduce particle ﬁltering with exactresampling and deterministic runtime.

EURASIP Journal on Advances in Signal Processing , 2017(1):71, Oct2017.[10] A. Varsi, L. Kekempanos, J. Thiyagalingam, and S. Maskell. Parallelising particle ﬁlters with deterministicruntime on distributed memory systems.

IET Conference Proceedings , pages 11–18, 2017.

Algorithm 7

Bitonic Sort Based Redistribute (B-R)

Input:

Node = [ ncopies , x ] , N , P , n = NP Output: x if P > then MPI_Bitonic_Sort ( Node , N, P ) end if procedure D ISTRIBUTE ( Node , N, P, n )2: if N == n then , the workload is now fully balanced as each core has n = NP particles to copy3: x ← Sequential_Redistribute ( n, ncopies , x ) , see Algorithm 54: return x end if csum ← MPI_Cumulative_Sum ( N, P, ncopies ) , MPI_Scan is used between the MPI nodes7: pivot ← Pivot_Calc ( ncopies , csum ) , pivot is the ﬁrst index of csum such that csum pivot ≥ N/ r ← pivot − (cid:0) N − (cid:1) ( Leaf l , Leaf r ) ← MPI_Rotational_Shifts ( Node , r ) , up to log P MPI_Sendrecv since r is expressed as a sum ofpower of numbers.10: Distribute ( Leaf l , N/ , P/ , n ) , the left node becomes a new father node and the size of the problem is halved11: Distribute ( Leaf r , N/ , P/ , n ) , the right node becomes a new father node and the size of the problem is halved12: end procedure lgorithm 8 Sequential Nearly Sort (S-NS)

Input: N , ncopies , x Output: x new , ncopies new l ← , r ← N − for i ← ; i < N ; i ← i + 1 do if ncopies i > then ncopies rnew ← ncopies i x rnew ← x i r ← r − else ncopies lnew ← ncopies i x lnew ← x i l ← l + 1 end if end forAlgorithm 9 Nearly Sort Based Redistribute (N-R)

Input:

Node = [ ncopies , x ] , N , P , n = NP Output: x if P > then MPI_Nearly_Sort ( Node , N, P ) end if procedure D ISTRIBUTE ( Node , N, P, n )2: if N == n then , the workload is now fully balanced as each core has n = NP particles to copy3: x ← Sequential_Redistribute ( n, ncopies , x ) , see Algorithm 54: return x end if csum ← MPI_Cumulative_Sum ( N, P, ncopies ) , MPI_Scan is used between the MPI nodes7: pivot ← Pivot_Calc ( ncopies , csum ) , pivot is the ﬁrst index of csum such that csum pivot ≥ N/ r ← pivot − (cid:0) N − (cid:1) ( Leaf l , Leaf r ) ← MPI_Rotational_Shifts ( Node , r ) , up to log P MPI_Sendrecv since r is expressed as a sum ofpower of numbers.10: Distribute ( Leaf l , N/ , P/ , n ) , the left node becomes a new father node and the size of the problem is halved11: Distribute ( Leaf r , N/ , P/ , n ) , the right node becomes a new father node and the size of the problem is halved12: end procedure [11] Metropolis NS, A.W. Rosenbluth, M.N. Rosenbluth, AH Teller, and E J. Teller. Equation of state calculations byfast computing machines. 21:1087–1092, 01 1953.[12] Jianye Ching and Yi-Chu Chen. Transitional markov chain monte carlo method for bayesian model updating,model class selection, and model averaging. 133, 07 2007.[13] Simon Duane, A.D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics Letters B ,195(2):216 – 222, 1987.[14] J. D. Hol, T. B. Schon, and F. Gustafsson. On resampling algorithms for particle ﬁlters. In , pages 79–82, Sept 2006.[15] S. Maskell. An application of sequential monte carlo samplers: An alternative to particle ﬁlters for non-linearnon-gaussian sequential inference with zero process noise. In ,pages 1–8, May 2012.[16] Richard E. Ladner and Michael J. Fischer. Parallel preﬁx computation.

J. ACM , 27(4):831–838, 1980.[17] Kenneth E. Batcher. Sorting networks and their applications. In

Proceedings of the 1968 Spring Joint ComputernConference (SJCC) , volume 32, pages 307–314. AFIPS Press, 1968.[18] M. S. Paterson. Improved sorting networks with o(logn) depth.

Algorithmica , 5(1):75–92, Jun 1990.[19] J. Seiferas. Sorting networks of logarithmic depth, further simpliﬁed.

Algorithmica , 53(3):374–384, Mar 2009.[20] M. Ajtai, J. Komlós, and E. Szemerédi. An 0(n log n) sorting network. In

Proceedings of the Fifteenth AnnualACM Symposium on Theory of Computing , STOC ’83, pages 1–9, New York, NY, USA, 1983. ACM.

21] S.W.A.H. Baddar and K.E. Batcher.

Designing Sorting Networks: A New Paradigm . SpringerLink : Bücher.Springer New York, 2012.[22] Arnaud Doucet, Mark Briers, and Stéphane Sénécal. Efﬁcient block sampling strategies for sequential montecarlo methods.

Journal of Computational and Graphical Statistics , 15(3):693–711, 2006.[23] Darko Mušicki. Bearings only multi-sensor maneuvering target tracking.

Systems & Control Letters , 57(3):216 –221, 2008.[24] T. Li, M. Bolic, and P. M. Djuric. Resampling methods for particle ﬁltering: Classiﬁcation, implementation, andstrategies.

IEEE Signal Processing Magazine , 32(3):70–86, 2015., 32(3):70–86, 2015.