vvSMC: Parallel Sequential Monte Carlo in
C++
Yan Zhou
University of Warwick
Abstract
Sequential Monte Carlo is a family of algorithms for sampling from a sequence ofdistributions. Some of these algorithms, such as particle filters, are widely used in thephysics and signal processing researches. More recent developments have established theirapplication in more general inference problems such as Bayesian modeling.These algorithms have attracted considerable attentions in recent years as they ad-mit natural and scalable parallelizations. However, these algorithms are perceived to bedifficult to implement. In addition, parallel programming is often unfamiliar to manyresearchers though conceptually appealing, especially for sequential Monte Carlo relatedfields.A
C++ template library is presented for the purpose of implementing general sequen-tial Monte Carlo algorithms on parallel hardware. Two examples are presented: a simpleparticle filter and a classic Bayesian modeling problem.
Keywords : Monte Carlo, particle filter,
C++ template generic programming.
1. Introduction
Sequential Monte Carlo (SMC) methods are a class of sampling algorithms that combineimportance sampling and resampling. They have been primarily used as “particle filters” tosolve optimal filtering problems; see, for example, Capp´e, Godsill, and Moulines (2007) andDoucet and Johansen (2011) for recent reviews. They are also used in a static setting wherea target distribution is of interest, for example, for the purpose of Bayesian modeling. Thiswas proposed by Del Moral, Doucet, and Jasra (2006b) and developed by Peters (2005) andDel Moral, Doucet, and Jasra (2006a). This framework involves the construction of a sequenceof artificial distributions on spaces of increasing dimensions which admit the distributions ofinterest as particular marginals.SMC algorithms are perceived as being difficult to implement while general tools were notavailable until the development by Johansen (2009), which provided a general framework forimplementing SMC algorithms. SMC algorithms admit natural and scalable parallelization.However, there are only parallel implementations of SMC algorithms for many problem spe-cific applications, usually associated with specific SMC related researches. Lee, Yau, Giles,Doucet, and Holmes (2010) studied the parallelization of SMC algorithms on GPUs with somegenerality. There are few general tools to implement SMC algorithms on parallel hardwarethough multicore CPUs are very common today and computing on specialized hardware suchas GPUs are more and more popular.The purpose of the current work is to provide a general framework for implementing SMC al- a r X i v : . [ s t a t . C O ] J un vSMC: Parallel Sequential Monte Carlo in C++ gorithms on both sequential and parallel hardware. There are two main goals of the presentedframework. The first is reusability. It will be demonstrated that the same implementationsource code can be used to build a serialized sampler, or using different programming mod-els (for example,
OpenMP (OpenMP Architecture Review Board 2011) and
Intel TBB (IntelCooperation 2013c)) to build parallelized samplers for multicore CPUs. They can be scaledfor clusters using
MPI (Message Passing Interface Forum 2012) with few modifications. Andwith a little effort they can also be used to build parallelized samplers on specialized mas-sive parallel hardware such as GPUs using
OpenCL (Kronos OpenCL Working Group 2012).The second is extensibility. It is possible to write a backend for vSMC to use new parallelprogramming models while reusing existing implementations. It is also possible to enhancethe library to improve performance for specific applications. Almost all components of thelibrary can be reimplemented by users and thus if the default implementation is not suitablefor a specific application, they can be replaced while being integrated with other componentsseamlessly.
2. Sequential Monte Carlo
Importance sampling is a technique which allows the calculation of the expectation of afunction ϕ with respect to a distribution π using samples from some other distribution η withrespect to which π is absolutely continuous, based on the identity, E π [ ϕ ( X )] = (cid:90) ϕ ( x ) π ( x ) d x = (cid:90) ϕ ( x ) π ( x ) η ( x ) η ( x ) d x = E η (cid:104) ϕ ( X ) π ( X ) η ( X ) (cid:105) (1)And thus, let { X ( i ) } Ni =1 be samples from η , then E π [ ϕ ( X )] can be approximated byˆ ϕ = 1 N N (cid:88) i =1 ϕ ( X ( i ) ) π ( X ( i ) ) η ( X ( i ) ) (2)In practice π and η are often only known up to some normalizing constants, which can beestimated using the same samples. Let w ( i ) = π ( X ( i ) ) /η ( X ( i ) ), then we haveˆ ϕ = (cid:80) Ni =1 w ( i ) ϕ ( X ( i ) ) (cid:80) Ni =1 w ( i ) (3)or ˆ ϕ = N (cid:88) i =1 W ( i ) ϕ ( X ( i ) ) (4)where W ( i ) ∝ w ( i ) and are normalized such that (cid:80) Ni =1 W ( i ) = 1.Sequential importance sampling (SIS) generalizes the importance sampling technique for asequence of distributions { π t } t ≥ defined on spaces { (cid:81) tk =0 E k } t ≥ . At time t = 0, sample { X ( i )0 } Ni =1 from η and compute the weights W ( i )0 ∝ π ( X ( i )0 ) /η ( X ( i )0 ). At time t ≥
1, eachsample X ( i )0: t − , usually termed particles in the literature, is extended to X ( i )0: t by a proposal hou, Yan q t ( ·| X ( i )0: t − ). And the weights are recalculated by W ( i ) t ∝ π t ( X ( i )0: t ) /η t ( X ( i )0: t ) where η t ( X ( i )0: t ) = η t − ( X ( i )0: t − ) q t ( X ( i )0: t | X ( i )0: t − ) (5)and thus W ( i ) t ∝ π t ( X ( i )0: t ) η t ( X ( i )0: t ) = π t ( X ( i )0: t ) π t − ( X ( i )0: t − ) η t − ( X ( i )0: t − ) q t ( X ( i )0: t | X ( i )0: t − ) π t − ( X ( i )0: t − )= π t ( X ( i )0: t ) q t ( X ( i )0: t | X ( i )0: t − ) π t − ( X ( i )0: t − ) W ( i ) t − (6)and importance sampling estimate of E π t [ ϕ t ( X t )] can be obtained using { W ( i ) t , X ( i )0: t } Ni =1 .However this approach fails as t becomes large. The weights tend to become concentrated ona few particles as the discrepancy between η t and π t becomes larger. Resampling techniquesare applied such that, a new particle system { ¯ W ( i ) t , ¯ X ( i )0: t } Mi =1 is obtained with the property, E (cid:104) M (cid:88) i =1 ¯ W ( i ) t ϕ t ( ¯ X ( i )0: t ) (cid:105) = E (cid:104) N (cid:88) i =1 W ( i ) t ϕ t ( X ( i )0: t ) (cid:105) (7)In practice, the resampling algorithm is usually chosen such that M = N and ¯ W ( i ) = 1 /N for i = 1 , . . . , N . Resampling can be performed at each time t or adaptively based on somecriteria of the discrepancy. One popular quantity used to monitor the discrepancy is effectivesample size (ESS), introduced by Liu and Chen (1998), defined asESS t = 1 (cid:80) Ni =1 ( W ( i ) t ) (8)where { W ( i ) t } Ni =1 are the normalized weights. And resampling can be performed when ESS ≤ αN where α ∈ [0 , { ¯ X ( i )0: t } Ni =1 directly, a random sample of integers { R ( i ) } Ni =1 is generated, such that R ( i ) ≥ i = 1 , . . . , N and (cid:80) Ni =1 R ( i ) = N . And each particle value X ( i )0: t is replicated for R ( i ) times in the new particlesystem. The distribution of { R ( i ) } Ni =1 shall fulfill the requirement of Equation 7. One suchdistribution is a multinomial distribution of size N and weights ( W ( i ) t , . . . , W ( N ) t ). See Douc,Capp´e, and Moulines (2005) for some commonly used resampling algorithms. SMC samplers allow us to obtain, iteratively, collections of weighted samples from a sequenceof distributions { π t } t ≥ over essentially any random variables on some spaces { E t } t ≥ , byconstructing a sequence of auxiliary distributions { ˜ π t } t ≥ on spaces of increasing dimensions,˜ π t ( x t ) = π t ( x t ) (cid:81) t − s =0 L s ( x s +1 , x s ), where the sequence of Markov kernels { L s } t − s =0 , termedbackward kernels, is formally arbitrary but critically influences the estimator variance. SeeDel Moral et al. (2006b) for further details and guidance on the selection of these kernels. vSMC: Parallel Sequential Monte Carlo in C++
Standard sequential importance sampling and resampling algorithms can then be appliedto the sequence of synthetic distributions, { ˜ π t } t ≥ . At time t −
1, assume that a set ofweighted particles { W ( i ) t − , X ( i )0: t − } Ni =1 approximating ˜ π t − is available, then at time t , the pathof each particle is extended with a Markov kernel say, K t ( x t − , x t ) and the set of particles { X ( i )0: t } Ni =1 reach the distribution η t ( X ( i )0: t ) = η ( X ( i )0 ) (cid:81) tk =1 K t ( X ( i ) t − , X ( i ) t ), where η is theinitial distribution of the particles. To correct the discrepancy between η t and ˜ π t , Equation 6is applied and in this case, W ( i ) t ∝ ˜ π t ( X ( i )0: t ) η t ( X ( i )0: t ) = π t ( X ( i ) t ) (cid:81) t − s =0 L s ( X ( i ) s +1 , X ( i ) s ) η ( X ( i )0 ) (cid:81) tk =1 K t ( X ( i ) t − , X ( i ) t ) ∝ ˜ w t ( X ( i ) t − , X ( i ) t ) W ( i ) t − (9)where ˜ w t , termed the incremental weights , are calculated as,˜ w t ( X ( i ) t − , X ( i ) t ) = π t ( X ( i ) t ) L t − ( X ( i ) t , X ( i ) t − ) π t − ( X ( i ) t − ) K t ( X ( i ) t − , X ( i ) t ) (10)If π t is only known up to a normalizing constant, say π t ( x t ) = γ t ( x t ) /Z t , then we can use the unnormalized incremental weights w t ( X ( i ) t − , X ( i ) t ) = γ t ( X ( i ) t ) L t − ( X ( i ) t , X ( i ) t − ) γ t − ( X ( i ) t − ) K t ( X ( i ) t − , X ( i ) t ) (11)for importance sampling. Further, with the previously normalized weights { W ( i ) t − } Ni =1 , we canestimate the ratio of normalizing constant Z t /Z t − byˆ Z t Z t − = N (cid:88) i =1 W ( i ) t − w t ( X ( i ) t − , X ( i ) t ) (12)Sequentially, the normalizing constant between initial distribution π and some target π T , T ≥ et al. (2006b) for details on calculating the incrementalweights. In practice, when K t is invariant to π t , and an approximated suboptimal backwardkernel L t − ( x t , x t − ) = π ( x t − ) K t ( x t − , x t ) π t ( x t ) (13)is used, the unnormalized incremental weights will be w t ( X ( i ) t − , X ( i ) t ) = γ t ( X ( i ) t − ) γ t − ( X ( i ) t − ) . (14) Some other commonly used sequential Monte Carlo algorithms can be viewed as special casesof algorithms introduced above. The annealed importance sampling (AIS; Neal (2001)) canbe viewed as SMC samplers without resampling.Particle filters as seen in the physics and signal processing literature, can also be interpretedas the sequential importance sampling and resampling algorithms. See Doucet and Johansen hou, Yan vSMC library.
3. Using the vSMC library
The vSMC library makes use of
C++ ’s template generic programming to implement generalSMC algorithms. This library is formed by a few major modules listed below. Some featuresnot included in these modules are introduced later in context.
Core
The highest level of abstraction of SMC samplers. Users interact with classes definedwithin this module to create and manipulate general SMC samplers. Classes in thismodule include
Sampler , Particle and others. These classes use user defined callbackfunctions or callable objects, such as functors, to perform problem specific operations,such as updating particle values and weights. This module is documented in Section 4.1.
Symmetric Multiprocessing (SMP)
This is the form of computing most people use ev-eryday, including multiprocessor workstations, multicore desktops and laptops. Classeswithin this module make it possible to write generic operations which manipulate a sin-gle particle that can be applied either sequentially or in parallel through various parallelprogramming models. A method defined through classes of this module can be used by
Sampler as callback objects. This module is documented in Section 4.2.
Message Passing Interface
MPI is the de facto standard for parallel programming on dis-tributed memory architectures. This module enables users to adapt implementations ofalgorithms written for the SMP module such that the same sampler can be parallelizedusing
MPI . In addition, when used with the SMP module, it allows easy implementationof hybrid parallelization such as
MPI / OpenMP . In Section 5.2, an example is shown howto extend existing SMP parallelized samplers using this module.
OpenCL
This module is similar to the two above except it eases the parallelization through
OpenCL , such as for the purpose of General Purpose GPU Programming (GPGPU).
OpenCL is a framework for writing programs that can be execute across heterogeneousplatforms.
OpenCL programs can run on either CPUs or GPUs. It is beyond the scopeof this paper to give a proper introduction to GPGPU,
OpenCL and their use in vSMC .However, later we will demonstrate the relative performance of this programming modelon both CPUs and GPUs in Section 5.2. vSMC is a header only library. There is practically no installation step. The library canbe downloaded from http://zhouyan.github.io/vSMC/vSMC.zip . After downloading andunpacking, one can start using vSMC by ensuring that the compiler can find the headersinside the include directory. To permanently install the library in a system directory, onecan simply copy the contents of the include directory to an appropriate place. vSMC: Parallel Sequential Monte Carlo in
C++
Alternatively, one can use the
CMake (Martin and Hoffman 2010) (2.8 or later) configurationscript and obtain the source by
Git (Torvalds et al. git clone git://github.com/zhouyan/vSMC.gitcd vSMCgit submodule initgit submodule updatemkdir buildcd buildcmake .. -DCMAKE_BUILD_TYPE=Releasemake install
For Unix-like systems, there is also a shell script build.sh that builds all examples in thispaper and produces the results and benchmarks as in Section 5. See documentations in thatscript for details of how to change settings for the users’ platform.
To build the reference manual, one need
Doxygen (van Heesch 2013), version 1.8.3 or later.Continuing the last step (still in the build directory), invoking make docs will create a doc directory inside build , which contains the
HTML references. Alternatively the referencemanual can also be found on http://zhouyan.github.io/vSMC/doc/html/index.html . Itis beyond the scope of this paper to document every feature of the vSMC library. In manyplaces we will refer to this reference manual for further information. vSMC uses
Random123 (Salmon, Moraes, Dror, and Shaw 2011) counter-based RNG forrandom number generating. For an SMC sampler with N particles, vSMC constructs N (sta-tistically) independent RNG streams. It is possible to use millions of such streams withouta huge memory footprint or other performance penalties. Since each particle has its ownindependent RNG stream, it frees users from many thread-safety and statistical indepen-dence considerations. It is highly recommended that the users install this library. Within vSMC , these RNG streams are wrapped under C++11
RNG engines, and can be replacedby other compatible RNG engines seamlessly. Users only need to be familiar with classesdefined in
C++11
Boost (Dawes et al.
Boost library. Version 1.49 or later is required.However, this is actually optional provided that proper
C++11 features are available in thestandard library, for example using clang (The LLVM Developer Group 2013a) with libc++ (The LLVM Developer Group 2013b). The
C++11 headers of concern are
Boost library, one needs to define the configuration macros before including any vSMC headers. For example, clang++ -std=c++11 -stdlib=libc++ \ hou, Yan -DVSMC_HAS_CXX11LIB_FUNCTIONAL=1 \-DVSMC_HAS_CXX11LIB_RANDOM=1 \-o prog prog.cpp tells the library to use C++11
CMake configuration script. vSMC has been tested with recent versions of clang , GCC (GNU Project 2013), Intel
C++
Complier (Intel Cooperation 2013a) and Microsoft
Visual C++ (Microsoft Cooperation 2012). vSMC can optionally use some
C++11 features to improve performance and usability. Inparticular, as mentioned before, vSMC can use
C++11 standard libraries instead of the
Boost library. At the time of writing, clang with libc++ has the most comprehensive supportof
C++11 with respect to standard compliant and feature completion.
GCC
Visual C++
C++
Complier 2013 also have very good
C++11 support. Notethat, provided the
Boost library is available, all
C++11 language and library features areoptional. vSMC can be used with any
C++98 conforming compilers.
4. The vSMC library
The core module abstracts general SMC samplers. SMC samplers can be viewed as formedby a few concepts regardless of the specific problems. The following is a list of the mostcommonly seen components of SMC samplers and their corresponding vSMC abstractions. • A collection of all particle state values, namely { X ( i ) t } Ni =1 . In vSMC , users need to definea class, say T , to abstract this concept. We refer to this as the value collection . Wewill slightly abuse the generic name T in this paper. Whenever a template parameteris mentioned with the name T , it always refers to such a value collection type unlessstated otherwise. • A collection of all particle state values and their associated weights. This is abstractedby a
Particle
Particle
RngSet object. • Operations that perform tasks common to all samplers to these particles, such as re-sampling. These are the member functions of
Particle
Sampler
Monitor
C++ the estimates of E [ h ( X t )] computes h ( X ( i ) t ) for each i = 1 , . . . , N . The function value h ( X t ) is allowed to be a vector.Note that within the core module, all operations are applied to Particle
Sampler
The template parameter T is a user defined type that abstracts the value collection. vSMC does not restrict how the values shall be actually stored. They can be stored in memory,spread among nodes of a cluster, in GPU memory or even in a database. However thiskind of flexibility comes with a small price. The value collection does need to fulfill tworequirements. We will see later that for most common usage, vSMC provides readily usableimplementations, on top of which users can create problem specific classes.First, the value collection class T has to provide a constructor of the form T (SizeType N) where
SizeType is some integer type. Since vSMC allows one to allocate the states in anyway suitable, one needs to provide this constructor which
Particle
Particle
Constructing a sampler
Once the value collection class T is defined. One can start constructing SMC samplers. Forexample, the following line creates a sampler with N particles Sampler
The number of particles is the only mandatory argument of the constructor. There are twooptional parameters. The complete signature of the constructor is, explicit Sampler
The scheme parameter is self-explanatory. vSMC provides a few built-in resampling schemes;see the reference manual for a list of them. User defined resampling algorithms can also beused. See the reference manual for details. The threshold is the threshold of ESS /N belowwhich a resampling will be performed. It is obvious that if threshold ≥ threshold ≤ Initialization and updating
All the callable objects that initialize, move and weight the particle collection can be added toa sampler through a set of member functions. All these objects operate on the
Particle
C++ struct init_class{ std::size_t operator() (Particle
They can be added to the sampler through sampler.init(init_func); or sampler.init(init_class()); respectively. C++11 std::function or its
Boost equivalent boost::function can also beused. For example, std::function
The addition of updating methods is more flexible. There are two kinds of updating methods.One is simply called move in vSMC , and is performed before the possible resampling at eachiteration. These moves usually perform the updating of the weights among other tasks. Theother is called mcmc , and is performed after the possible resampling. They are often MCMCtype moves. Multiple move ’s or mcmc ’s are also allowed. In fact a vSMC sampler consists ofa queue of move ’s and a queue of mcmc ’s. The move ’s in the queue can be changed through Sampler
Objects of class type with operator() properly defined can also be used, similarly to theinitialization method. The queue of mcmc ’s can be used similarly. See the reference manualfor other methods that can be used to manipulate these two queues.In principle, one can combine all moves into a single move. However, sometimes it is morenatural to think of a queue of moves. For instance, if a multi-block Metropolis random walkconsists of kernels K and K , then one can implement each of them as functions, say mcmc_k1 and mcmc_k2 , and add them to the sampler sequentially, hou, Yan sampler.mcmc(mcmc_k1, true).mcmc(mcmc_k2, true); Then at each iteration, they will be applied to the particle collection sequentially in the orderin which they are added.
Running the algorithm, monitoring and outputs
Having set all the operations, one can initialize and iterate the sampler by calling sampler.initialize((void *)param);sampler.iterate(iter_num);
The param argument to initialize is optional, with
NULL as its default. This parameter ispassed to the user defined init_func . The iter_num argument to iterate is also optional;the default is 1.Before initializing the sampler or after a certain time point, one can add monitors to thesampler. The concept is similar to
BUGS (Spiegelhalter, Thomas, and Best 1996)’s monitor statement, except it does not monitor the individual values but rather the importance sam-pling estimates. Consider approximating E [ h ( X t )], where h ( X t ) = ( h ( X t ) , . . . , h m ( X t )) isan m -vector function. The importance sampling estimate can be obtained by AW where A is an N by m matrix where A ( i, j ) = h j ( X ( i ) t ) and W = ( W ( i ) t , . . . , W ( N ) t ) T is the N -vector ofnormalized weights. To compute this importance sampling estimate, one need to define thefollowing evaluation function (or a class with operator() properly defined), void monitor_eval (std::size_t iter, std::size_t m,const Particle
C++
A reference to the value collection T object can be retrieved through the Particle
Particle
WeightSet . The
Particle
MPI moduleprovides a special weight set class that manages weights across multiple nodes and performproper normalization, computation of ESS, and other tasks.The default
WeightSet object provides some ways to retrieve weights. Here we documentsome of the most commonly used. See the reference manual for details of others. The weightscan be accessed one by one, for example,
Particle
One can also read all weights into a container, for example, std::vector
Note that these member functions accept general output iterators.
Implementing initialization and updating
So far we have only discussed how to add initialization and updating objects to a sampler.To actually implement them, one writes callable objects that operate on the
Particle
Inside the body of this function, one can change the value by manipulating the object throughthe reference returned by particle.value() .When using the default weight set class, the weights can be updated through a set of memberfunctions of
WeightSet . For example, std::vector
Generating random numbers
The
Particle
C++11
C++11
Boost library, boost::random::normal_distribution
C++11
VSMC_HAS_CXX11LIB_RANDOM . Though the user is free to choose whichone to use in their own code, there is a convenient alternative. For each class defined in
C++11
C++11 standardlibrary or
Boost . The benefit is that if one needs to develop on multiple platforms, andonly some of them support
C++11 and some of them have the
Boost library, then only theconfigure macro
VSMC_HAS_CXX11LIB_RANDOM needs to be changed. This can be configuredthrough
CMake and other build systems. Of course, one can also use an entirely differentRNG system than those provided by vSMC . vSMC: Parallel Sequential Monte Carlo in C++
The value collection
Many typical problems’ value collections can be viewed as a matrix of certain type. Forexample, a simple particle filter whose state is a vector of length
Dim and type double canbe viewed as an N by Dim matrix where N is the number of particles. A trans-dimensionalproblem can use an N by 1 matrix whose type is a user defined class, say StateType . Forthis kind of problems, vSMC provide a class template template
RowMajor or ColMajor ) specifies how the values are ordered in memory. Usu-ally one shall choose
RowMajor to optimize data access. The second template parameter isthe number of variables, an integer value no less than 1 or the special value
Dynamic , in whichcase
StateMatrix provides a member function resize_dim such that the number of variablescan be changed at runtime. The third template parameter is the type of the state values.Each particle’s state is thus a vector of length
Dim , indexed from to Dim - 1 . To obtainthe value at position pos of the vector of particle i , one can use one of the following memberfunctions, StateBase
Pos is a compile time constant expression whose value is the same as pos , assumingthe position is known at compile time. One can also read all values. To read the variable atposition pos , std::vector
Alternatively, one can also read all values through an iterator which points to iterators, std::vector
C++11
StateTuple class template,which is similar to
StateMatrix except that the types of values do not have to be the same foreach variable. This is similar to R (R Core Team 2013)’s data.frame . For example, supposeeach particle’s state is formed by two double ’s, an int and a user defined type StateType ,then the following constructs a value collection using
StateTuple , StateTuple
And there are a few ways to access the state values, similar to
StateMatrix , double x0 = value.state(i, Position<0>());int x2 = value.state<2>(i);std::vector
A single particle
For a
Particle
SingleParticle
Particle
SingleParticle
Particle
Particle
Particle
Particle
SingleParticle
SingleParticle class template.
SingleParticle
Sequential and parallel implementations
Once we have the
SingleParticle
The template parameter
BaseState needs to satisfy the general value collection requirementsin addition to a copy_particle member function, for example,
StateMatrix . Other baseclasses expect T to satisfy general value collection requirements. The details of all these classtemplates can be found in the reference manual. Here we use the MoveSEQ
Sampler
For the purpose of illustration, the type T is defined as, typedef StateMatrix
MoveSEQ
The operator() of MoveSEQ
MoveSEQ only takes away the loop around Part 2. However if oneimplement the move in such a way, and then replace
MoveSEQ with
MoveOMP , the changing ofthe base class name causes vSMC to use
OpenMP to parallelize the loop. For example, onecan declare the class as and use exactly the same implementation as before. Now when move::operator() is called, itwill be parallelized by
OpenMP . Other backends are available in case
OpenMP is not available.Among them there are
Cilk Plus (Intel Cooperation 2011) and
Intel TBB . In addition to thesewell known parallelization programming models, vSMC also has its own implementation using
C++11
OpenMP parallelized one, we can write, vSMC: Parallel Sequential Monte Carlo in C++
And we can compile the same source into different samplers with
Makefile rules such as thefollowing, prog-seq : prog.cpp$(CXX) $(CXXFLAGS) -DUSE_SEQ -o prog-seq prog.cppprog-omp : prog.cpp$(CXX) $(CXXFLAGS) -DUSE_omp -o prog-omp prog.cpp
Or one can configure the source file through a build system such as
CMake , which can alsodetermine which programming model is available on the system.
Adapter
Sometimes, the cumbersome task of writing a class to implement a move and other operationscan out weight the power we gain through object oriented programming. For example, a simple move_state -like function is all that we need to get the work done. In this case, one can createa move through the
MoveAdapter . For example, say we have implemented std::size_t move_state (std::size_t iter, SingleParticle
Sampler
MoveAdapter
These are respectively, sequential,
C++11
OpenMP implementations. The first template parameter is the type of value collection and the secondis the name of the base class template. Actually, the
MoveAdapter ’s constructor accepts twomore optional arguments, the pre_processor and the post_processor , corresponding to theother two aforementioned member functions. Similar adapters for the other three base classesalso exist.Another scenario where an adapter is desired is that which backend to use needs be decidedat runtime. The above simple adapters can already be used for this purpose. In addition,another form of the adapter is as the following, hou, Yan class move;MoveAdapter
Particle and
Sampler are not thread-safe. Therefore if one needto construct multiple
Sampler from different threads, a mutex protection is needed.However, subsequent member function calls on the constructed objects are thread-safeaccording to the above rules. • Member functions that concern a single particle are generally thread-safe in the sensethat one can call them on the same object from different threads as long as they arecalled for different particles. For example
Particle::rng and
StateMatrix::state are thread-safe.In general, one can safely manipulate different individual particles from different threads,which is the minimal requirement for scalable parallelization. But one cannot manipulate thewhole particle collection from different threads, for example
WeightSet::set_log_weight .User defined callbacks shall generally follow the same rules. For example, for a
MoveOMP sub-class, pre_processor and post_processor does not have be thread-safe, but move_state needs to be. In general, avoid write access to memory locations shared by all particles from move_state and other similar member functions. One needs to take some extra care whenusing third-party libraries. For example, in our example implementation of the move class, the rnorm object, which is used to generate Normal random variates, is defined within move_state instead of being a class member data even though it is created with the same parameters foreach particle. This is because the call rnorm(sp.rng()) is not thread-safe in some imple-mentations, for example, when using the
Boost library.For scalable performance, certain practices should be avoided when implementing memberfunctions such as move_state . For example, dynamic memory allocation is usually lock-based and thus should be avoided. Alternatively one can use a scalable memory allocatorsuch as the one provided by
Intel TBB . In general, in functions such as move_state , oneshould avoid using locks to guarantee thread-safety, which can be a bottleneck to parallelperformance.0 vSMC: Parallel Sequential Monte Carlo in
C++
5. Example
The model and algorithm
This is an example used in
SMCTC (Johansen 2009). Through this example, we will show howto re-implement a simple particle filter in vSMC . It shall walk one through the basic featuresof the library.
SMCTC is the first general framework for implementing SMC algorithms in
C++ and serves as one of the most important inspirations for the vSMC library. It is widelyused by many researchers. One of the goals of the current work is that users familiar with
SMCTC shall find little difficulty in using the new library.The state space model, known as the almost constant velocity model in the tracking literature,provides a simple scenario. The state vector X t contains the position and velocity of anobject moving in a plane. That is, X t = ( x t pos , y t pos , x t vel , y t vel ) T . Imperfect observations Y t = ( x t obs , y t obs ) T of the positions are possible at each time instance. The state and observationequations are linear with additive noises, X t = AX t − + V t Y t = BX t + αW t where A = B = (cid:18) (cid:19) α = 0 . V t are independent Normal with variance0 .
02 and 0 .
001 for position and velocity, respectively. The observation noise, W t comprisesindependent, identically distributed t -distributed random variables with degree of freedom ν = 10. The prior at time 0 corresponds to an axis-aligned Gaussian with variance 4 for theposition coordinates and 1 for the velocity coordinates. The particle filter algorithm is shownin Algorithm 1. Implementations
We first introduce the body of the main function, showing the typical work flow of a vSMC program. hou, Yan Initialization
Set t ← x (0 ,i )pos , y (0 ,i )pos ∼ N (0 ,
4) and x (0 ,i )vel , y (0 ,i )vel ∼ N (0 , W ( i )0 ∝ L ( X ( i )0 | Y ) where L is the likelihood function. Iteration
Set t ← t + 1.Sample x ( t,i )pos ∼ N ( x ( t − ,i )pos + ∆ x ( t − ,i )vel , . x ( t,i )vel ∼ N ( x ( t − ,i )vel , . y ( t,i )pos ∼ N ( y ( t − ,i )pos + ∆ y ( t − ,i )vel , . y ( t,i )vel ∼ N ( y ( t − ,i )vel , . W ( i ) t ∝ W ( i ) t − L ( X ( i ) t | Y t ). Repeat the
Iteration step until all data are processed .Algorithm 1: Particle filter algorithm for the almost constant velocity model. int main (){ Sampler
In the main function, we constructed a sampler with
ParticleNum particles. And then weadded the initialization function cv_init and the move object of type cv_move . And thenwe added a monitor that will record the importance sampling estimates of the two positionparameters. Next, we initialized the sampler with data file pf.data and iterate the sampleruntil all data are processed. The last step is that we output the results into a file called pf.est .The class cv will be our value collection which is a subclass of StateMatrix
MoveSEQ
C++ position parameters, we will implement a simple function cv_monitor_state and use theadapter
MonitorEvalAdapter
Dim = 4) matrix of type double . Wecan simply use
StateMatrix
And after that, if we want to re-initialize the sampler, we can simply call, sampler.initialize();
This will reset the sampler and initialize it again but without reading the data. If the dataset is large, repeated IO can be very expensive. After reading the data, we will initializeeach particle’s value by Normal random variates, and calculate its log-likelihood. The laststep is to set the logarithm weights of the particle collection. Since this is not a accept-reject type algorithm, the returned acceptance count bares no meaning. Here is the completeimplementation, std::size_t cv_init (Particle
In this example, we read all data from a single file for simplicity. In a realistic application, thedata is often processed online – the filter is applied when new data becomes available. In this4 vSMC: Parallel Sequential Monte Carlo in
C++ case, the user can use the optional argument of
Sampler
SingleParticle
Second, vSMC provides STL style iterators,
Particle
Third, if one has a compiler supporting
C++11 auto and range-based for , the following isalso supported, for (auto &sp : particle) sp.state(0) = norm_pos(sp.rng());
There are little or no performance difference among all these forms of loops. However, one canchoose an appropriate form to work with the interfaces of other libraries. For example, theiterator support allows vSMC to be used with STL
If later we decided to monitor all states, we only need to change the in the above line to Dim .After we implemented all the above, compiled and ran the program, a file called pf.est waswritten by the following statement in the main function,6 vSMC: Parallel Sequential Monte Carlo in
C++ lllllll l l lll llllll lllllll lll lllllllllllll lllllll lllll l lllllllll lll lll llllllllll llll llllllll lllllllll l lllll l l l lll llll ll ll llll l lll lllllllllllll llll lll lllll l ll lllllll lll lll llllllllll lll l lllll lll lllllllll
X position Y po s i t i on Data ll EstimatesObservations
Figure 1: Observations and estimates of a simple particle filter. est << sampler << std::endl;
The output file contains the ESS, resampling, importance sampling estimates and other in-formations in a table. This file can be read by most statistical softwares for further analysis.For instance, we can process this file with R , and get the plot of the estimates of positionsagainst the observations, as shown in Figure 1. The model and algorithm
Since Richardson and Green (1997), the Gaussian mixture model (GMM) has provided acanonical example of a model-order-determination problem. We use the same model as inDel Moral et al. (2006b) to illustrate the implementation of this classical example in MonteCarlo literature. This model is also used in Zhou, Johansen, and Aston (2013) for demon-stration of the use of SMC in the context of Bayesian model comparison, which providesmore details of the following setting. The model is as follows; data y = ( y , . . . , y n ) areindependently and identically distributed as y i | θ r ∼ r (cid:88) j =1 ω j N ( µ j , λ − j )where N ( µ j , λ − j ) denotes the Normal distribution with mean µ j and precision λ j ; θ r = hou, Yan M r ∈ M perform the following algorithm. Initialization
Set t ← θ ( i,t ) r ∼ π ( θ ( i,t ) r | M r ).Weight W ( i )0 ∝ Iteration
Set t ← t + 1.Weight W ( i ) t ∝ W ( i ) t − p ( y | θ ( i,t − r , M r ) α ( t/T r ) − α ([ t − /T r ) .Apply resampling if necessary.Sample θ ( i,t ) r ∼ K t ( ·| θ ( i,t − r ), a π t -invariant MCMC kernel. Repeat the
Iteration step up to t = T r .Algorithm 2: SMC algorithm for Bayesian modeling of Gaussian mixture model.( µ r , λ r , ω r ) and r is the number of components in each model. The parameter space is thus R r × R + r × S r where S r = { ω r : 0 ≤ ω j ≤ (cid:80) rj =1 ω j = 1 } is the standard ( r − µ j ∼ N ( ξ, κ − ), λ j ∼ G ( ν, χ )and ω r ∼ D ( ρ ) where D ( ρ ) is the symmetric Dirichlet distribution with parameter ρ and G ( ν, χ ) is the Gamma distribution with shape ν and scale χ . The prior parameters are set inthe same manner as in Richardson and Green (1997); also see Zhou et al. (2013) for details.The data is simulated from a four components model with µ = ( − , , , λ j = 2, ω j = 0 . j = 1 , . . . ,
4. Our interest is to simulate the posterior distribution of models with r components, denoted by M r and obtaining the normalizing constant for the purpose ofBayesian model comparison (Robert 2007, chap. 7).Numerous strategies are possible to construct a sequence of distributions for the purpose ofSMC sampling. One option is to use for each model M r , r ∈ { , , . . . } , the sequence { π t } T r t =0 ,defined by π t ( θ tr ) ∝ π ( θ tr | M r ) p ( y | θ tr , M r ) α ( t/T r ) . (15)where the number of distribution, T r , and the annealing schedule, α : [0 , → [0 ,
1] may bedifferent for each model. This leads to Algorithm 2.The MCMC kernel K t in Algorithm 2 is constructed as a three-blocks Metropolis randomwalk,1. Update µ r through a Normal random walk.2. Update λ r through a Normal random walk on logarithm scale, that is, on log λ j , j = 1 , . . . , r .3. Update ω r through a Normal random walk on logit scale, that is, on ω j /ω r , j =1 , . . . , r − et al. λ T r ,N ds = N (cid:88) i =1 π ( θ ( i, r | M r ) ν ( θ ( i, ) × T r (cid:89) t =1 N (cid:88) i =1 W ( i ) t − p ( y | θ ( i,t ) r M r ) α ( t/T r ) − α ([ t − /T r ) (16)8 vSMC: Parallel Sequential Monte Carlo in C++ where W ( i ) t − is the importance weight of sample θ ( i ) t − . Path sampling for estimation of normalizing constants
As shown in Zhou et al. (2013) the estimation of the normalizing constant associated withour sequence of distributions can also be achieved by a Monte Carlo approximation to the path sampling formulation given by Gelman and Meng (1998), also known as thermodynamicintegration or Ogata’s method. Given a parameter α which defines a family of distributions, { p α = q α /Z α } α ∈ [0 , that move smoothly from p = q /Z to p = q /Z as α increases fromzero to one, one can estimate the logarithm of the ratio of their normalizing constants via asimple integral relationship, log (cid:18) Z Z (cid:19) = (cid:90) E α (cid:20) d log q α ( · )d α (cid:21) d α, (17)where E α denotes expectation under p α . The sequence of distributions in the SMC algorithmfor this example can be interpreted as belonging to such a family of distributions, with α = α ( t/T r ).The SMC sampler provides us with a set of weighted samples obtained from a sequence of dis-tributions suitable for approximating this integral. At each time t we can obtain an estimateof the expectation within the integral via the usual importance sampling estimator, and thisintegral can then be approximated via a trapezoidal integration. In summary, the path sam-pling estimator of the ratio of normalizing constants λ T r = log( Z /Z ) can be approximatedby ˆ λ T r ,N ps = T r (cid:88) t =1
12 ( α t − α t − )( U Nt + U Nt − ) (18)where U Nt = N (cid:88) i =1 W ( i ) t d log q α ( X ( i ) t )d α (cid:12)(cid:12)(cid:12) α = α t (19) Implementations
In this example we will implement the following classes. • gmm_param is a class that abstracts the parameters of the model, θ r = ( µ r , λ r , ω r ). • gmm_state is the value collection class. • gmm_init is a class that implements operations used to initialize the sampler. • gmm_move_smc is a class that implements operations used to update the weights as wellas selecting the random walk proposal scales and distribution parameter α ( t/T r ). • gmm_move_mu , gmm_move_lambda and gmm_move_weight are classes that implement therandom walks, each for one of the three blocks. • gmm_path is a class that implements monitors for the path sampling estimator. This classis similar to the importance sampling monitor introduced before. It is to be used with Sampler
The definitions of these macros will be changed at compile time to build parallelized samplers.For example, when using
OpenMP parallelization, the header backend_omp.hpp will be usedinstead of backend_seq.hpp ; and
StateSEQ will be changed to
StateOMP along with similarchanges to the other macros. In the distributed source code, this is configured by the
CMake build system.Again, we first introduce the main function. The required headers are the same as the lastparticle filter example in addition to the SMP backend headers as described above. Thefollowing variables used in the main function will be set by the user input. int ParticleNum;int AnnealingScheme;int PriorPower;int CompNum;std::string DataFile;
In the main function, we will create objects that set the distribution parameter α ( t/T r ) ateach iteration according to the user input of AnnealingScheme . Below is the main function.Note that some source code of IO operations which set the parameters above are omitted. int main (){ gmm_move_smc::alpha_setter_type alpha_setter;if (AnnealingScheme == 1)alpha_setter = gmm_alpha_linear(IterNum);if (AnnealingScheme == 2)alpha_setter = gmm_alpha_prior(IterNum, PriorPower); vSMC: Parallel Sequential Monte Carlo in C++
Sampler
The sampler first sets the number of components and allocate memory through memberfunction comp_num of gmm_state . Then it sets the initialization and updating methods.Before possible resampling, a gmm_move_smc object is added. After that, three Metropolisrandom walks are appended. In addition, we add a gmm_path object to calculate the pathsampling integration. Then we initialize and iterate the sampler and get the normalizingconstant estimates.It is obvious that the parameter class gmm_param need to store the parameters ( µ r , λ r , ω r ).We also associate with each particle its log-likelihood and log-prior. Here is the definition ofthe gmm_param class. We omitted definitions of some data access member functions. class gmm_param{ public :void comp_num (std::size_t num);void save_old ();double log_prior () const {return log_prior_;}double &log_prior () {return log_prior_;}double log_likelihood () const {return log_likelihood_;}double &log_likelihood () {return log_likelihood_;}int mh_reject_mu (double p, double u);int mh_reject_lambda (double p, double u);int mh_reject_weight (double p, double u); hou, Yan int mh_reject_common (double p, double u);double log_lambda_diff () const;double logit_weight_diff () const;void update_log_lambda ();private :std::size_t comp_num_;double log_prior_, log_prior_old_, log_likelihood_, log_likelihood_old_;std::vector
1] random variate, say u ; it rejects the proposed change if p < u ,and restore the particle state of the parameters µ r by those values saved by save_old . Themember functions mh_reject_lambda and mh_reject_weight do the same for the other twoset of parameters. All these three also call the mh_reject_common which restore the storedlog-likelihood and log-prior values. The use of these member functions will be seen in theimplementation of gmm_move_mu , in the context of which their own implementation becomeobvious. Other member functions provide some useful computations such as the logarithm ofthe λ r . They are used when compute the log-likelihood.The class gmm_state contains some properties common to all particles, such as the data andthe distribution parameter α ( t/T r ). Also, we will have it to record the logarithm of the ratioof normalizing constants, using the NormalizingConstant class. We will see how to updatethis variable at each iteration in the implementation of gmm_move_smc . The prior parametersare also stored in the value collection. Here is the definition of this value collection class.Again, we omitted some data access member functions, class gmm_state : public BASE_STATE
The initialize_param member function is called before the pre_processor , which is absentin this case and have the default implementation which does nothing. And it processes the op-tional parameter of
Sampler::initialize , the file name of the data. The initialize_state member function initialize the state values according to the prior and update the log-priorand log-likelihood.After initialization, at each iteration, gmm_move_smc class will implement the updating ofweights as well as selecting of the proposal scales and the distribution parameter. For example,when using the linear annealing scheme, we can implement a gmm_alpha_linear class as thefollowing, class gmm_alpha_linear{ public :gmm_alpha_linear (const std::size_t iter_num) : iter_num_(iter_num) {}void operator() (std::size_t iter, Particle
It accepts the total number of iterations T r as an argument to its constructor. And it imple-ments an operator() that update the distribution parameter α ( t/T r ). The prior annealingscheme can be implemented similarly. For simplicity and demonstration purpose, we onlyallow gmm_move_smc to be configured with different annealing schemes, and hard code the hou, Yan class gmm_move_smc{ public :typedef cxx11::function
VSMC_HAS_CXX11LIB_FUNCTIONAL . Objectsof this class type can be added to a sampler as a move. The operator() satisfies the interfacerequirement of the core module. First it uses alpha_setter_ to set the distribution parameter α ( t/T r ). Second, it sets the proposal scales for the three Metropolis random walks accordingto the current value of α . Then it computes the unnormalized incremental weights. The NormalizingConstnat class has member function add_log_weight , which is not unlike theone with the same name in
WeightSet . It accepts the logarithm of the incremental weightsand a
WeightSet object. The standard normalizing constant estimates will be computedusing these values. The last, we also modify the
WeightSet type object itself by adding thelogarithm of the incremental weights.At each iteration, ramdom walks are also perfoemd. The implementations of the randomwalks are straightforward. Below is the implementation of the random walk on the meanparameters. The random walks on the other parameters are similar. class gmm_move_mu : public BASE_MOVE
First we save the logarithm of the value of target density computed using the old values in p ,the acceptance probability. And then we call gmm_param::save_old to save the old values.Next we update each parameter with a proposed Normal random variates and compute thenew log-prior and the log-likelihood as well as the new value of the target density. Then wereject it according to the Metropolis algorithm, as implemented in gmm_param , which managesboth the current states as well as the backup.Last we need to monitor certain quantities for interference purpose. Recall that, in the main function we used sampler.path_sampling(gmm_path()) to set the monitoring of path hou, Yan path_sampling member function requires a callable objects withthe following signature, double path_eval (std::size_t iter, const Particle
MoveSEQ template introduced in Section 4.2.The path sampling integrands under this geometry annealing scheme are simply the log-likelihood. Therefore the implementation of gmm_path class is rather simple, class gmm_path : public BASE_PATH
Results
After compiling and running the algorithm, the results were consistent with those reported inDel Moral et al. (2006b). For a more in depth analyze of the methodologies, extensions andthe results see Zhou et al. (2013).
Extending the implementation using
MPI vSMC ’s MPI module assumes that identical samplers are constructed on each node, withpossible different number of particles to accommodate the difference in capacities amongnodes. To extend the above SMP implementation for use with
MPI , first at the beginning ofthe main function, we add the following,
MPIEnvironment env(argc, argv); to initialize the
MPI environment. When the object env is destroyed at the exit of the main function, the
MPI environment is finalized. Second, we need to replace base value collectionclass template with
StateMPI . So now gmm_state is declared as the following, class gmm_state :public StateMPI
The implementation is exactly the same as before. Third, the gmm_param class now needs to betransferable using
MPI . Unlike the SMP situations, a simple copy constructor is not enough. vSMC uses
Boost
MPI library, and thus one only needs to write a serialize member functionfor gmm_param such that the data can be serialized into bytes. See documents of
Boost
MPI and serialization libraries for details. In summary, the following member function accept an
Archive object as input, and it can perform a store or a load operation based on the
Archive type. In a load operation, the
Archive object is like an input stream and in a store operation,it is like an output stream. template
Fourth, after user input of the sampler parameters, we need to sync them with all nodes. Forexample, for the
ParticleNum parameter, boost::mpi::communicator World;boost::mpi::broadcast(World, ParticleNum, 0);
Last, any importance sampling estimates that are computed on each node, need to be com-bined into final results. For example, the path sampling results are now obtained throughadding the results from each node together, double ps_sum = 0;boost::mpi::reduce(World, ps, ps_sum, std::plus
For the standard normalizing constant ratio estimator, we will replace
NormalizingConstant with
NormalizingConstantMPI , which will perform such and other tasks.After these few lines of change, the sampler is now parallelized using
MPI and can be deployedto clusters and other distributed memory architecture. On each node, the selected SMPparallelization is used to perform multi-threading parallelization locally. vSMC ’s MPI modulewill take care of normalizing weights and other tasks. hou, Yan Parallelization performance
One of the main motivation behind the creation of vSMC is to ease the parallelization withdifferent programming models. The same implementation can be used to built different sam-plers based on what kind of parallel programming model is supported on the users’ platforms.In this section we compare the performance of various SMP parallel programming models and
OpenCL parallelization.We consider five different implementations supported by Intel
C++
Complier 2013: sequential,
Intel TBB , Cilk Plus , OpenMP and
C++11
CXX=icpc -std=c++11 -gcc-name=gcc-4.7 -gxx-name=g++-4.7CXXFLAGS=-O3 -xHost -fp-model precise \-DVSMC_HAS_CXX11LIB_FUNCTIONAL=1 \-DVSMC_HAS_CXX11LIB_RANDOM=1 on a Ubuntu 12.10 workstation with an Xeon W3550 (3.06GHz, 4 cores, 8 hardware threadsthrough hyper-threading) CPU. A four components model and 100 iterations with a priorannealing scheme is used for all implementations. A range of numbers of particles are tested,from 2 to 2 .For different number of particles, the wall clock time and speedup are shown in Figure 2.For 10 or more particles, the differences are minimal among all the programming models.They all have roughly 550% speedup. With smaller number of particles, vSMC ’s C++11 parallelization is less efficient than other industry strength programming models. However,with 1000 or more particles, which is less than typical applications, the difference is not verysignificant.
OpenCL implementations are also compared on the same workstation, which also has anNVIDIA Quadro 2000 graphic card.
OpenCL programs can be compiled to run on both CPUsand GPUs. For CPU implementation, there are Intel
OpenCL (Intel Cooperation 2013b) andAMD
APP
OpenCL (Advanced Micro Devices, Inc. 2012) platforms. We use the
Intel TBB implementation as a baseline for comparison. The same
OpenCL implementation are used forall the CPU and GPU runtimes. Therefore they are not particularly optimized for any ofthem. For the GPU implementation, in addition to double precision, we also tested a singleprecision configuration. Unlike modern CPUs, which have the same performance for doubleand single precision floating point operations (unless SIMD instructions are used, which canhave at most a speedup by a factor of 2), GPUs penalize double precision performance heavily.For different number of particles, the wall clock time and speed up are plotted in Figure 3.With smaller number of particles, the
OpenCL implementations have a high overhead whencompared to the
Intel TBB implementation. With a large number of particles, AMD
APP
OpenCL has a similar performance as the
Intel TBB implementation. Intel
OpenCL is about40% faster than the
Intel TBB implementation. This is due to more efficient vectorizationand compiler optimizations. The double precision performance of the NVIDIA GPU has a220% speedup and the single precision performance has near 1600% speedup. As a roughreference for the expected performance gain, the CPU has a theoretical peak performanceof 24.48 GFLOPS. The GPU has a theoretical peak performance of 60 GFLOPS in doubleprecision and 480 GFLOPS in single precision. This represents 245% and 1960% speedupcompared to the CPU, respectively.0 vSMC: Parallel Sequential Monte Carlo in
C++ l l l l l l l l l l l l Number of particles W a ll c l o ck t i m e ( s e c ond ) Implementation l Sequential Intel TBB Intel Cilk Plus OpenMP vSMC STDTBB l l l l l l l l l l l l Number of particles S peedup Implementation l Sequential Intel TBB Intel Cilk Plus OpenMP vSMC STDTBB
Figure 2: Performance of
C++ implementations of Bayesian modeling for Gaussian mixturemodel (Linux; Xeon W3550, 3.06GHz, 4 cores, 8 threads). hou, Yan l l l l l l l l l l l l −1 Number of particles W a ll c l o ck t i m e ( s e c ond ) Implementation l Intel TBB AMD CPU Intel CPU NVIDIA GPU (single) NVIDIA GPU (double) l l l l l l l l l l l l −2 −1 Number of particles S peedup Implementation l Intel TBB AMD CPU Intel CPU NVIDIA GPU (single) NVIDIA GPU (double)
Figure 3: Performance of
OpenCL implementations of Bayesian modeling for Gaussian mixturemodel (Linux; Xeon W3550 GPU, 3.06GHz, 4 cores, 8 threads; NVIDIA Quadro 2000).2 vSMC: Parallel Sequential Monte Carlo in
C++
It is widely believed that
OpenCL programming is tedious and hard. However, vSMC providesfacilities to manage
OpenCL platforms and devices as well as common operations. Limitedby the scope of this paper, the
OpenCL implementation (distributed with the vSMC source)is not documented in this paper. Overall the
OpenCL implementation has about 800 linesincluding both host and device code. It is not an enormous increase in effort when comparedto the 500 lines SMP implementation. Less than doubling the code base but gaining morethan 15 times performance speedup, we consider the programming effort is relatively small.
6. Discussion
This paper introduced a
C++ template library intended for implementing general SMC al-gorithms and constructing parallel samplers with different programming models. While it ispossible to implement many realistic application with the presented framework, some tech-nical proficiency is still required to implement some problem specific part of the algorithms.Some basic knowledge of
C++ in general and how to use a template library are also required.It is shown that with the presented framework it is possible to implement parallelized, scalableSMC samplers in an efficient and reusable way. The performance of some common parallelprogramming models are compared using an example.Some future work may worth the effort to ease the implementation of SMC algorithms further.However, there is a balance between performance, flexibility and the ease of use. vSMC aimsto be developer-friendly and to provide users as much control as possible for all performancerelated aspects. For a
BUGS -like interface, users may be interested in other software suchas
BiiPS (Caron, Todeschini, Legrand, and Moral 2012). In addition
LibBi (Murray 2013)provides a user friendly and high performance alternative with a focus on state-space models.Compared with these recent developments, vSMC is less accessible to those with little or noknowledge of
C++ . However, for researchers with expertise in
C++ and template genericprogramming in particular, vSMC provides a framework within which potential superiorperformance can be obtained and greater flexibility and extensibility are possible.
References
Advanced Micro Devices, Inc (2012).
AMD Accelerated Parallel Processing
OpenCL
Pro-gramming Guide . Version 2.8, URL http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk .Capp´e O, Godsill SJ, Moulines E (2007). “An Overview of Existing Methods and RecentAdvances in Sequential Monte Carlo.”
Proceedings of the IEEE , (5), 899–924.Caron F, Todeschini A, Legrand P, Moral PD (2012). BiiPS : Bayesian Inference with Inter-acting Particle Systems . Version 0.7.2, URL https://alea.bordeaux.inria.fr/biips/doku.php .Dawes B, et al. (2013).
Boost – C++
Libraries . Version 1.53, URL .Del Moral P, Doucet A, Jasra A (2006a). “Sequential Monte Carlo Methods for BayesianComputation.” In
Bayesian Statistics 8 . Oxford University Press. hou, Yan
Journal ofRoyal Statistical Society B , (3), 411–436.Douc R, Capp´e O, Moulines E (2005). “Comparison of Resampling Schemes for Particle Filter-ing.” In Proceedings of the 4th International Symposium on Imange and Signal Processingand Analysis , pp. 1–6.Doucet A, Johansen AM (2011). “A Tutorial on Particle Filtering and Smoothing: FifteenYears Later.” In
The Oxford Handbook of Non-linear Filtering . Oxford University Press.Gelman A, Meng XL (1998). “Simulating Normalizing Constants: From Importance Samplingto Bridge Sampling to Path Sampling.”
Statistical Science , (2), 163–185.GNU Project (2013). GCC – GNU Compiler Collection . Version 4.8.1, URL http://gcc.gnu.org .Intel Cooperation (2011).
Intel
Cilk Plus
Language Specification . Version 1.1, URL http://cilkplus.org/ .Intel Cooperation (2013a).
Intel
C++
Composer XE . Version 13.1, URL http://software.intel.com/en-us/intel-compilers .Intel Cooperation (2013b).
Intel SDK for
OpenCL
Applications 2013 . URL http://software.intel.com/en-us/vcsource/tools/opencl-sdk .Intel Cooperation (2013c).
Intel Threading Building Blocks . Version 4.1, URL http://threadingbuildingblocks.org/ .Johansen AM (2009). “SMCTC: Sequential Monte Carlo in
C++ .” Journal of StatisticalSoftware , (6), 1–41.Kronos OpenCL Working Group (2012). The
OpenCL
Specification . Version 1.2, URL .Lee A, Yau C, Giles MB, Doucet A, Holmes CC (2010). “On the Utility of Graphics Cardsto Perform Massively Parallel Simulation of Advanced Monte Carlo Methods.”
Journal ofComputational and Graphical Statistics , (4), 769–789.Liu JS, Chen R (1998). “Sequential Monte Carlo Methods for Dynamic Systems.” Journal ofthe American Statistical Association , (443), 1032–1044.Martin K, Hoffman B (2010). Mastering
CMake . Version 2.8, URL .Message Passing Interface Forum (2012).
MPI : A Message-Passing Interface Starndard . Ver-sion 3.0, URL .Microsoft Cooperation (2012).
Microsoft
Visual C++ . URL http://msdn.microsoft.com/en-us/vstudio/hh386302 .Murray LM (2013). “Bayesain State-Space Modelling on High-Performance Hardware Using
LibBi .” arXiv , pp. 1–28. URL http://arxiv.org/abs/1306.3277 .Neal RM (2001). “Annealed Importance Sampling.” Statistics and Computing , (2), 125–139.4 vSMC: Parallel Sequential Monte Carlo in C++
OpenMP Architecture Review Board (2011).
OpenMP
Application Program Interface . Ver-sion 3.1, URL .Peters GW (2005).
Topics in Sequential Monte Carlo Samplers . Master’s thesis, Departmentof Engineering, University of Cambridge.R Core Team (2013). R : A Language and Environment for Statistical Computing . RFoundation for Statistical Computing, Vienna, Austria. Version 2.15.2, URL .Richardson S, Green PJ (1997). “On Bayesian Analysis of Mixtures with an Unknown Numberof Components.” Journal of Royal Statistical Society B , (4), 731–792.Robert CP (2007). The Bayesian Choice: From Decision-theoretic Foundations to Computa-tional Implementation . 2nd edition. Springer-Verlag, New York.Salmon JK, Moraes MA, Dror RO, Shaw DE (2011).
Parallel Random Numbers: As Easy as1, 2, 3 .Spiegelhalter D, Thomas A, Best N (1996).
BUGS – Bayesian Inference Using Gibbs Sampling .Version 0.5, URL .The LLVM Developer Group (2013a). clang : A C Language Family Frontend for LLVM .Version 3.2, URL http://clang.llvm.org/ .The LLVM Developer Group (2013b). “ libc++ ” C++
Standard Library . Version 1.1, URL http://libcxx.llvm.org .Torvalds L, et al. (2013).
Git – Distributed Source Code Management . Version 1.8.3.1, URL http://git-scm .van Heesch D (2013).
Doxygen – Generating Documentation from Source Code . Version 1.8.4,URL .Zhou Y, Johansen AM, Aston JAD (2013). “Towards Automatic Model Comparison: AdaptiveSequential Monte Carlo Approach.” arXiv , pp. 1–33. URL http://arxiv.org/abs/1303.3123 . Affiliation:
Yan ZhouDepartment of StatisticsUniversity of WarwickCoventry, CV4 7AL, United KingdomE-mail: