[PDF] Parallelizing MCMC Sampling via Space Partitioning

Abstract

Efficient sampling of many-dimensional and multimodal density functions is a task of great interest in many research fields. We describe an algorithm that allows parallelizing inherently serial Markov chain Monte Carlo (MCMC) sampling by partitioning the space of the function parameters into multiple subspaces and sampling each of them independently. The samples of the different subspaces are then reweighted by their integral values and stitched back together. This approach allows reducing sampling wall-clock time by parallel operation. It also improves sampling of multimodal target densities and results in less correlated samples. Finally, the approach yields an estimate of the integral of the target density function.

Full PDF

PParallelizing MCMC Sampling via Space Partitioning

Vasyl Hafych , Philipp Eller , Oliver Schulz , Allen Caldwell

1: Max Planck Institute for Physics, D-80805 Munich, Germany2: Technical University of Munich, D-80333 Munich, Germany

August 10, 2020

Abstract

Eﬃcient sampling of many-dimensional and multimodal density functions is atask of great interest in many research ﬁelds. We describe an algorithm that al-lows parallelizing inherently serial Markov chain Monte Carlo (MCMC) sampling bypartitioning the space of the function parameters into multiple subspaces and sam-pling each of them independently. The samples of the diﬀerent subspaces are thenreweighted by their integral values and stitched back together. This approach allowsreducing sampling wall-clock time by parallel operation. It also improves samplingof multimodal target densities and results in less correlated samples. Finally, theapproach yields an estimate of the integral of the target density function.

Markov chain Monte Carlo (MCMC) is a technique that allows generating samples witha distribution proportional to a given target density function . This technique is widelyused in Bayesian statistics, statistical mechanics, computational biology, and many otherﬁelds or research. One of the major strengths of this technique is that it can converge tothe target density even if target functions are highly multidimensional and multimodal.A major diﬃculty is that convergence is reached only asymptotically, and approachingthe stationary distribution can require a very large number of sampling steps.A Markov chain, by deﬁnition, consists of a series of consecutive steps that move apoint or a set of points across the parameter space, which is an inherently serial process.A proposed displacement, by deﬁnition of the Markov process, does not depend on thehistory that led to the current location. Determining whether to accept a displacementinvolves evaluating the target density at the proposed location. For many real-worldapplications, evaluation of a target density can be very computationally costly, and thereis usually a limit to how far a single target evaluation can be parallelized eﬃciently; thiscan make MCMC sampling very costly. A further complication stems from the fact that alarge number of burn-in steps (the steps necessary for the MCMC to reach the stationarydistribution) need to be performed for each MCMC chain before representative samplescan be generated. The burn-in duration can even exceed the sampling time, especiallyfor target densities that have a complex shape. While separate MCMC chains can berun independently and in parallel, simply increasing their number while producing fewer In the following, we will use the terms ‘target density function’ also for unnormalized positive deﬁnitecontinious functions. a r X i v : . [ s t a t . C O ] A ug amples from each chain is therefore not an eﬀective parallelization strategy as the lengthof the burn-in process for each chain would not change.Signiﬁcant research has been conducted to enhance the eﬃciency of MCMC meth-ods. The developments in this ﬁeld can be divided into several categories [1]. The ﬁrstis based on exploiting the geometry of the target density function. Hamiltonian MonteCarlo [2] (HMC) belongs to this category and it introduces the auxiliary variable, calledmomentum, which is refreshed by using the gradient of the density function. Symplec-tic integrators of diﬀerent orders of precision are developed to approximate Hamiltonianequation [3, 4]. The HMC provides less correlated samples than the Metropolis-Hastingsalgorithm; however, the gradient of the density is not always readily available or cannotbe computed in reasonable time. The second approach of accelerating MCMC is based onimproving the proposal function. Techniques such as simulated tempering [5, 6], adaptiveMCMC [7], and multi-proposal MCMC [8, 9] are available and have been shown to beeﬀective for many applications [10, 11, 12, 13, 14]. The third approach is based on break-ing initially complicated problems into simpler pieces. For example, separate MCMCchains explore the parameter space in parallel and the resulting samples are mergedtogether [15, 16]. As discussed earlier, this approach does not simplify convergence ofchains to the stationary distribution.In this paper, we present an approach that improves MCMC sampling by partitioningthe parameter space of the target density function into multiple subspaces and separatelysampling the diﬀerent subspaces. An eﬀective partitioning can thereby change the taskfrom sampling from a complicated target distribution to sampling from many, simplertarget distributions. Any MCMC sampling algorithm can be used to generate samplesin the subspaces, and all subspaces can be sampled independently and in parallel. Thisallows for massively parallel and distributed execution on multiple processors of multiplecomputer systems. After sampling, the integrated density of each subspace is calculatedand the samples for each subspace are re-weighted correspondingly.Our approach results in increased MCMC sampling eﬃciency, yielding samples withreduced correlations. It also reduces the required MCMC burn-in time signiﬁcantlysince the possibly multiple modes of the full target density will ideally lie in separatesubspaces; each chain will then only have to sample a unimodal density. In combination,these beneﬁts result in a shorter total sampling time for each MCMC chain (includingburn-in), which makes this approach of running many chains distributed over manysubspaces eﬃcient, whereas running many chains over the full density is not (due toconstant burn-in overhead).We start with a brief review of Bayesian data analysis, as this is the primary ap-plication that we aim for. In section 3, the general idea of the algorithm and our im-plementation of it are described. In section 4, performance of the algorithm is shownusing the mixture of four multivariate normal distributions as a target density function.Finally, section 5 summarizes developments presented in this paper and discusses furtherdirections. For the given model M , parameters λ , and the data D , Bayes’ theorem is deﬁned as P ( λ |D , M ) = P ( D| λ, M ) P ( λ | M ) P ( D| M ) , (1)2here P ( D| λ, M ) denotes the likelihood that is used to update the prior probabilitydensity P ( λ | M ) of λ to the posterior probability density P ( λ |D , M ). The denominatoris usually called ‘evidence’ or ‘marginal likelihood’ and it is given by the Law of TotalProbability P ( D| M ) = Z P ( D| λ, M ) P ( λ | M ) dλ. (2)MCMC methods do not require knowledge of the normalization constant, P ( D| M ), togenerate samples with the correct distribution. However, we require the normalizingconstant in each subspace in order to reproduce the correct target distribution by patchingtogether samples from diﬀerent subspaces. Our approach thereby relies on being ableto accurately calculate P i ( D| M ), where i labels one of the subspaces. A variety oftechniques that allow estimating the evidence of models exist, an overview summary canbe found in [18]. These typically require a re-sampling of the target probability densityfunction once modes have been identiﬁed and only a few of the methods can be used ina post processing step based on existing samples. We use the recently published AHMIalgorithm [19] to evaluate the integrals in each subspace directly from the samples. Thisprovides the user with the correctly weighted samples, and allows for a simple calculationof the evidence for the full target distribution. This is relevant for evaluating, amongstother things, the Bayes factor used in model comparisons. We consider generating samples according to a target density function f ( λ ) where λ ⊂ (cid:82) m and Ω is the support of the function. To illustrate our method, we will use as examplethe sum of four bivariate normal distributions with λ = ( λ , λ ): f ( λ ) = X i =1 a i · (cid:78) ( λ | µ i , Σ i ) , (3)where a = a = 0 . a = a = 0 . µ i = ( ± . , ± . = Σ = (0 . , .

17; 0 . , . = Σ = (0 . , − . − . , . .

48) and the other two, in the other quadrants, have small weights (0 . N exp exploration samples { λ ∗ i } i =1 ..N exp ∈ Ω, distributed amongst N chains , where N exp is a small number compared to the desired number of ﬁnalMCMC samples and N chains is the number of chains. The chains should havediﬀerent (possibly randomly chosen) starting points. The samples are used to ﬁndregion/s of the parameter space with a high density and the MCMC chains are notrequired to converge. An initial sampling of our example function with N exp = 500generated using N chains = 25 with 20 samples per chain is shown in Fig. 1-a.2. Partition the parameter space into N sp mutually exclusive subspaces { ω k } k =1 ..N sp ∈ Ω in such a way that ∪ ω k = Ω, ω k ∩ ω m = ø if k = m (see also Fig. 1-b). While in3eneral, the boundaries of the subspaces could be arbitrary shapes, in the following,‘ N space partitions’ will refer to N cuts along diﬀerent, single parameter axes. Thissplits the parameter space into N + 1 rectangular subspaces.3. Generate N ksamp samples n λ ki o i =1 ..N ksamp ∈ ω k in each subspace k with the distribu-tion proportional to f ( λ ) using a sampling algorithm of choice (see Fig. 1-c). Notethat each sampler has to perform its burn-in cycle only in the reduced subspace ω k , which signiﬁcantly reduces tuning time.4. Determine the integrated density of the target distribution in each subspace bycomputing I k = R ω k f ( λ ) d λ and assign the following weights to the sample ofsubspace k w k ∝ I k N ksamp .

5. Stitch the now weighted samples together resulting in the ﬁnal sampling distribution(see Fig. 1-d).There are many ways of implementing the described idea based on choices of samplers,integrators and space partitioning strategies. In the following, we describe our implemen-tation that is also made available in the Bayesian Analysis Toolkit package [20].

Exploration samples play an important role in this algorithm, since the parameter spaceis partitioned based on them. If exploration samples represent the structure of thetarget density closely enough — for example indicating the presence of multiple modesby clusters of spatially neighboring points — then the space partitioning algorithm cancapture these features and generate partitions in such a way to split those clusters. Thissimpliﬁes the target density in each subspace and thus allows for much faster burn-in and tuning procedures. In our implementation, we generate exploration samples byrunning a large number of MCMC chains, where each chain generates a few hundredsamples. There is no tuning or convergence requirement for these chains, but a smallset of samples are initially used to set the parameters of the proposal functions for eachchain. Some knowledge of the form of the target distribution is useful in determininghow many chains and how many samples will be necessary. While the morphology of theresulting sample clouds should resemble that of the target density as closely as possible,this initial exploration should be fast compared to the following sampling time in thepartitioned space.

Given the discussed exploration samples, we partition our parameter space into rectan-gular subspaces in such a way as to split clusters of spatially neighboring samples. Todo so, a binary tree is used where each node is determined by a cut that is orthogonal toparameter axes. For the sake of illustration, we consider a one-dimensional problem andexploration samples { λ } (see upper histogram in Fig. 2). The cut position perpendicular4igure 1: Color maps displaying the steps in the partitioned sampling approach. (1)The 500 samples from the exploration step for the target density deﬁned in Eq 6. Thered dashed lines demonstrate contours of the true density. (2) The result of partitioningthe parameter space into 30 subspaces. The black lines demonstrate the boundaries ofsubspaces. (3) The 10 samples in each subspace from the individual MCMC chains. (4)The weighted samples displayed in the full space.to the λ axis is denoted as e λ and it is selected by ﬁnding the minimum of the followingcost function: e λ = inf a [ W ( a, λ )] = inf a  X λ i a (cid:12)(cid:12) λ i − h λ i λ>a (cid:12)(cid:12)  , (4)where h λ i denote the mean of samples. This process is then repeated iteratively resultingin the desired number of partitions.The blue lines in Fig. 2 demonstrate how this cost function depends on the cutpositions for 3 partitioning steps. The partitioning procedure is ended when a minimalchange in the cost function results from further partitioning or a maximum number ofsubspaces is reached. The evolution of the cost function for our example is also shownin Fig. 2.The partitioning procedure is analogous for higher dimensional space. If, for instance,a sample vector is M -dimensional, then we evaluate Eq. 4 for every dimension which re-sults in proposed cut positions e λ i with corresponding cost values W i ( a i , e λ i ) for dimension5 W P () Figure 2: Illustration of the space partitioning algorithm using 5 · one-dimensionalexploration samples λ ∗ with a distribution demonstrated in the upper left histogram. Thered dashed lines demonstrate the ﬁrst three cut positions e λ . The blue lines show the valueof W ( a, λ ) as a function of the cut position for 3 iterations of space partitioning. Thegray dashed lines show the value of the cost function at its minimum for each iteration.The bottom right subplot illustrates the dependence of W as a function of the numberof cuts. i = (1 , ..., M ). The minimum cost value is selected and the cut along the correspond-ing dimension is accepted. Additionally, if preliminary knowledge about the structureof the target density is present, the user can specify manually along which parameterspartitioning of the parameter space should be performed. Sampling in the subspaces is performed independently and does not require communi-cation between MCMC processes. It can therefore be trivially divided into tasks andexecuted in parallel on multiple processors using distributed computing. In the follow-ing, we deﬁne a ‘worker’ as a computing unit (this can either be a node of a cluster,networked machine, or a single machine) that consists of multiple CPU cores used toperform one task; i.e., sample the target density in one subspace. All the cores thatbelong to one worker are called threads and are used to run multiple MCMC chains inparallel within one subspace. Running multiple chains within one subspace is advanta-geous for determining convergence of the MCMC process. Our implementation allowsrunning subspaces on multiple remote hosts using Julia’s support for compute clusters,and MPI/TCP/IP protocols for communication between workers can be used. By de-fault, the Metropolis-Hastings algorithm is used to generate MCMC samples on eachsubspace. However, any other sampling algorithm can be used.6 .2.4 Reweighting

Samples that originate from diﬀerent subspaces have diﬀerent, and a priori unknown,normalizations with respect to each other. In order to correctly stitch those together, aweight proportional to the integral of the target density within the subspace needs to beapplied. Given that samples are drawn from the target function n λ k o ∼ f ( λ ) in eachsubspace k , we compute the following integrals: I k = Z ω k f ( λ ) d λ . (5)As noted earlier, we use the Adaptive Harmonic Mean Integration (AHMI). Given samplesdrawn according to a density proportional to the function, the AHMI algorithm estimatesthe integral of the function and the uncertainty of the estimate acting on samples as apostprocessing step, with no requirements to reevaluate the target density. The AHMIalgorithm in its current form has been used to accurately integrate functions in up to 20dimensions. Once the weighting is performed, the samples from multiple subspaces are concatenatedand returned to the user. The total integral of the function f ( λ ) is then estimated bysumming the weights of the subspaces I = P k =1 ..N sp I k .Our implementation was used to generate the results shown in Fig. 1. In this section, we evaluate the performance of our algorithm on a more complicatedtest density function. The function was chosen in such a way to (a) have a knownanalytic integral, (b) allow generating independent and identically distributed ( i.i.d )samples, and (c) have multiple modes in many dimensions and thus be challenging fora classical Metropolis-Hastings algorithm. With this aim, we have chosen a mixture offour multivariate normal distributions in 9-dimensional space: f ( λ ) = X i =1 a i (cid:78) ( λ | µ i , Σ i ) , (6)where all a i = 1 / µ and Σ are randomly assigned mean vector and a covariancematrix. The full parameter set used for this performance test are given in the Appendix.Figure 3 illustrates one and two dimensional distributions of 10 i.i.d samples drawnfrom this density.There are two primary points that we demonstrate in this section. The ﬁrst oneis the ability to improve the wall-clock time spent on sampling by utilizing eﬃcientlycomputational resources. The second one is the ability to improve the quality of samplesonce we increase the number of space partitions. Measurements of the performance wereevaluated as follows: • We use a varying number of subspaces S = (1 , , , , , Figure 3: One and two dimensional distributions of the density function given by Eq. 6.Histograms are constructed using 10 i.i.d samples. • In addition, we also vary the wall-clock time that workers can spend on generatingsamples, considering time intervals of 3 , ,

11, and 15 seconds. • For every combination of space partitions and wall-clock times, we repeat the sam-pling process 3 times to evaluate statistical ﬂuctuations.The overall number of MCMC runs is 72. An example run is: 8 subspaces and 8 workerswith 10 CPU cores each are used to sample 10 chains for 11 seconds of wall-clock time,after which samples are integrated and returned; sampling with this setting is repeated3 times.

A summary of the benchmark run is demonstrated in Fig. 4. We used the MPCDFHPC system DRACO with Intel ‘Haswell’ Xeon E5-2698 processors (880 nodes with32 cores @ 2.3 GHz each) to perform parallel MCMC executions. While sampling withspace partitions, each subspace requires a diﬀerent amount of time on sampling andintegration, depending on the complexity of the underlying density region. The time of .0 1.5 2.0 2.5 3.0 t / t ref N / N r e f S=1 [10 cores]S=2 [20 cores] S=4 [40 cores]S=8 [80 cores] S=16 [160 cores]S=32 [320 cores]1.00 1.25

I I truth N eff > N total , [%] Figure 4: Summary of the benchmark runs for the target density function given by Eq. 6.The diﬀerent colors represent runs with diﬀerent numbers of partitions; the number ofsubspaces is denoted by S. The vertical axis is common for all subplots and gives theratio of the total number of samples generated per single run to the number of samplesthat are generated if no space partition is performed ( N ref = 3 . · ). Left subplot :The horizontal axis shows the ratio of the time spent on sampling and integration to thetime that a single worker spent if no space partition is performed ( t ref = 14 . Middle subplot : The horizontal axis showsthe ratio of the integral to the true value; error bars are obtained from the integrationalgorithm.

Right subplot : The horizontal axis illustrates the ratio of the eﬀective numberof samples (separately for each dimension) to the total number of samples. An eﬀectivenumber of samples is estimated per dimension and the error bars represent the standarddeviation across dimensions. Dashed colored lines represent the average fraction over theruns with the same number of space partitions.the slowest one is reported in Fig. 4. It can be seen that, by changing the number ofsubspaces from 1 to 32 (and the number of total CPU cores from 10 to 320), the numberof generated samples increases almost two orders of magnitude while the wall-clock timeremains constant.Figure 4 can be rearranged into a slightly diﬀerent form in order to demonstrate thesampling rate. We deﬁne N as the number of samples that the sampler with no spacepartitions has generated during the time interval ∆ t = t stop − t start ( t stop is the wall-clock time when integration has ﬁnished and t start is the wall-clock time when samplingon subspace has started). We further denote as N k the total number of samples from therun with k subspaces, and the time spent on each subspace as ∆ t k . The sampling rate9 S a m p li n g R a t e Mean5 10 15 20 25 30 P e r - c h a i n S a m p li n g R a t e S=1 [10 cores]S=2 [20 cores]S=4 [40 cores] S=8 [80 cores]S=16 [160 cores]S=32 [320 cores]

Figure 5: The ﬁgure illustrates sampling rate (upper subplot) and per-chain samplingrate (lower subplot) versus the number of space partitions. The gray lines representaverage over 3 runs.is deﬁned as S = N k max k ∆ t k · ∆ t N . (7)In addition to the wall-clock time, we measure a CPU time spent on sampling andintegration on each subspace using the CPUTime.jl package . We denote it as τ i , where i is the subspace index, and τ is a CPU time when no space partitions are used. Theper-chain sampling rate is deﬁned as S per − chain = N k P i =1 ..k τ i · τ N . (8)Eq. 7 and Eq. 8 do not include time spent on the generation of exploration samples andconstruction of the partition tree. A time spent on the generation of exploration samplesdepends primarily on the complexity of a likelihood evaluation, and for our problem, itis equal to 4 seconds. The time required to generate the space partition tree primarilydepends on the number of exploration samples, it is about 2 seconds for our problem.The sampling rate and the per-chain sampling rate versus the number of space parti-tions are presented in Fig. 5. The ﬁgure shows that the sampling rates are improved forboth cases. Improvements in the per-chain sampling rate indicates that by partitioningthe parameter space we simplify the target density function resulting in faster tuningand convergence (tuning and convergence are occurring in every subspace). While im-provement in the sampling rate is expected due to the scaling of the number of CPUcores, its faster-than-linear behavior can be explained by a superposition of the improvedper-chain sampling rate and the linear sampling rate. A detailed deﬁnition of the CPU time can be found in the package documentation. .2 Density Integration and Eﬀective Sample Size Another important characteristic to track is the integral estimate of the target densityfunction. If, for example, samples are not correctly representing the target function,then the integral will deviate from the truth. By partitioning the parameter space weare simplifying tasks for both the sampler and integrator; the complicated problem hasbeen split into a number of simpler ones. This results in better integral estimates, whichcan be seen in Fig. 4 (middle subplot).Samples that originated from one MCMC chain are correlated. A degree of samplecorrelation depends on the acceptance probability, number of chains, complexity of thetarget density function, etc. The eﬀective number of samples can be estimated as N eff = N ˆ τ , (9)where N is the number of samples and ˆ τ is the integrated autocorrelation time, estimatedvia the normalized autocorrelation function ˆ ρ ( τ ):ˆ τ k = 1 + 2 ∞ X τ =1 ˆ ρ k ( τ ) (10)ˆ ρ k ( τ ) = ˆ c k ( τ )ˆ c k (0) (11)ˆ c k ( τ ) = 1 N − τ N − τ X n =1 (cid:16) λ k,i − ˆ λ k (cid:17) (cid:16) λ k,i + τ − ˆ λ k (cid:17) (12)where k refers to the one of the M dimensions of the multivariate sample λ i = { λ ,i , ..., λ M,i } and ˆ λ k is the average of the k -th component of all the N samples. In order to computethe sum given by Eq. 10, we use a heuristic cut-oﬀ given by Geyer’s initial monotonesequence estimator [21]. This technique allows us to calculate an eﬀective sample sizefor each dimension N eff,k = N ˆ τ k . As it is shown in Fig. 4 (right subplot), the eﬀectivenumber of samples increases with the number of space partitions. It can also be seen thatin our example there is no increase of the fraction of eﬀective samples when the numberof subspaces exceeds 8. A summary of the performance enhancement from partitioning and sampling in parallelis presented in Fig. 6. The accuracy of the integrals of the function for diﬀerent numbersof subspaces as a function of sampling wall-clock time is shown in the left panel. There,we see that even with the longest running times tested, the integral estimate withoutpartitioning deviates considerably from the correct value. Good results are seen alreadyfor the shortest running times with 4 subspaces, and running on 32 subspaces givesexcellent results in all running times tested.The right panel in Fig. 6 shows the average number of eﬀective samples for diﬀerentcombinations of numbers of subspaces and running times. We ﬁnd a dramatic increase inthe number of eﬀective samples: the factor achieved in the eﬀective number of samplesis an order of magnitude larger than the increase in the computing resources (number ofprocessors). This is due to the much simpler forms of the distributions sampled in thesubspaces.Both of these results show that a strong scaling of the performance is achieved usingthe partitioning scheme that we have outlined.11 N u m b e r o f S u b s p a c e s I I truth N eff > / N ref Figure 6: The ratio of the integral to the true value (left panel) and the average numberof eﬀective MCMC samples (right panel) for the diﬀerent numbers of subspaces andsampling times. In the right panel, the values are referenced to the average eﬀectivenumber of samples generated for one subspace and a running time of 3 seconds: N ref =267. Since our algorithm requires the weighting of samples from diﬀerent subspaces and stitch-ing them together, we test whether this results in a smooth posterior approximating thetrue target density function. For comparison, we also approximate the true density by di-rectly generating i.i.d samples. In the following, we will use the two-sample Kolmogorov-Smirnov test [22] and a two-sample classiﬁer test [23] as a quantitative assessment of howclose our MCMC samples are to the i.i.d ones.

The two-sample Kolmogorov-Smirnov test is used to test whether two one-dimensionalmarginalized samples come from the same distribution. P-values from this test for everymarginal and diﬀerent number of space partitions are shown in Fig. 7. We use the eﬀectivenumber of samples to calculate p-values for the Kolmogorov-Smirnov test. If two sets ofsamples stem from the same distribution, then the Kolmogorov-Smirnov p-values shouldbe uniformly distributed. It can be seen that for a small number of space partitions p-values are peaking around 0 and 1. Peaks close to 1 indicate that the eﬀective number ofMCMC samples is likely underestimated, demonstrating that samples are very correlated.In contrast to this, the peaks near p-values of zero indicate that marginals of i.i.d andMCMC distributions deviate from each other. When using more than 4 partitions, p-values are uniformly covering the range from 0 to 1, indicating that marginals of i.i.d and MCMC samples are in a good agreement.12 N u m b e r o f S u b s p a c e s Figure 7: The plot illustrates histograms of the p-values from the two-sampleKolmogorov-Smirnov test to verify whether MCMC and i.i.d samples come from thesame distribution. Diﬀerent colors represent diﬀerent numbers of space partitions. Oneset of MCMC samples results in 9 p-values for every dimension.

A second approach to determining whether two samples are similar to one another usesa binary classiﬁer aiming at distinguishing them. A training dataset is constructed bypairing MCMC and i.i.d samples with opposite labels. If samples are indistinguishable,the classiﬁcation accuracy on the test dataset should be close to that obtained from arandom guess. To train a classiﬁer, we use a simple neural network model with two denselayers with sizes 9 ×

20 and 20 ×

2, and the sigmoid activation function. To constructtraining and testing datasets, we generate MCMC and i.i.d samples with equal weights.By default, i.i.d samples come with weight one. For our weighted MCMC however, wegenerate 3 · samples with unit weights by using ordered resampling implementedin [20]. In total, 4 · samples are used for training and 2 · for testing with an equalfraction of MCMC and i.i.d samples. Training is performed for every MCMC run thatis described in Fig. 4 individually and results are presented in Fig. 8. The left subplotin Fig. 8 shows the modiﬁed receiver operating characteristic (ROC), where the verticalaxis is a diﬀerence between true positive rate (TPR) and false positive rate (FPR) fordiﬀerent MCMC runs. If the classiﬁer cannot distinguish two samples, then the line willﬂuctuate around zero. The right subplot in Fig. 8 shows the integral under the ROCcurve (expected to be close to 0.5 for indistinguishable distributions) versus the numberof MCMC samples (before resampling was performed). It can be seen that even thoughthe training datasets consist of the same number of MCMC samples, there is a diﬀerencein their distinguishability. Samples obtained from the runs with a large number of spacepartitions have ROC curve integrals much closer to 0.5. It was not possible to detect this13 .0 0.2 0.4 0.6 0.8 1.0FPR0.020.000.020.040.060.080.100.12 T P R - F P R S a m p l e s S=1 [10 cores]S=2 [20 cores] S=4 [40 cores]S=8 [80 cores] S=16 [160 cores]S=32 [320 cores]

Figure 8: The ﬁgure illustrates the results of the classiﬁer two-sample test performed todistinguish i.i.d samples and MCMC samples with space partitioning. The left subplotshows a modiﬁed receiver operating characteristic (ROC) where TPR stands for truepositive rate and FPR for false positive rate. Diﬀerent lines correspond to a diﬀerentnumber of subspaces (S). The right subplot shows area under the ROC curve versus thenumber of samples.diﬀerence in the 1d Kolmogorov-Smirnov tests.

We have presented an approach to both improve and accelerate MCMC sampling bypartitioning the parameter space of the target density function into multiple subspacesand sampling independently in each subspace. These subspaces can be sampled in par-allel and the resulting samples then stitched together with appropriate weighting. Thescheme relies on a good space partitioning, which we achieve using a binary partitioningalgorithm, and a good integrator for determining the weights assigned to the samplesin the diﬀerent subspaces. The integrations in our examples were performed using theAHMI [19] algorithm. The parallized MCMC sampling algorithm described in the paperhas been implemented in the Bayesian Analysis Toolkit [20].We have benchmarked this technique by evaluating the quality of samples and thesampling rate for a mixture of four multivariate normal distributions in 9-dimensionalspace. We demonstrate that the space partitioning allows us to obtain a 50-fold increasein the sampling rate while increasing the number of CPUs by a factor 32. This increaseis a superposition of two eﬀects: a linear scaling with the number of CPU cores, and aCPU-time reduction due to the simpliﬁcation of the target density function. In additionto the increase in the sampling rate, sampling with space partitioning also resulted in anincreased quality of MCMC samples by reducing their correlations. This was evidenced14n particular by more accurate integral values of the target density.We have evaluated the correctness of the resulting sampling distributions by compar-ing the MCMC samples with i.i.d samples using a two-sample Kolmogorov-Smirnov testand with a two-sample classiﬁer test. Both show that increasing the number of spacepartitions leads to a better agreement between MCMC and i.i.d samples.Finally, this approach provides the user with a normalization constant of the targetdensity function - the Bayesian evidence is provided at no extra cost.15

Appendix

Covariance matrices used in Eq. 6 are diagonal with the size 9 × = diag(12 . , . , ..., . , Σ = diag(10 . , . , ..., . , Σ = diag(33 . , . , ..., . , Σ = diag(27 . , . , ..., . . (13)Mean vectors are µ = (4 . , . , . , . , − . , . , − . , − . , − . ,µ = (2 . , . , . , . , − . , − . , − . , − . , − . ,µ = ( − . , . , − . , − . , . , − . , . , − . , . ,µ = ( − . , . , . , . , − . , . , − . , . , . . (14) References [1] Robert, Christian P., et al. ”Accelerating MCMC algorithms.” Wiley Interdisci-plinary Reviews: Computational Statistics 10.5 (2018): e1435.[2] Duane, Simon, et al. ”Hybrid monte carlo.” Physics letters B 195.2 (1987): 216-222.[3] Leimkuhler, Benedict, and Sebastian Reich. Simulating hamiltonian dynamics. Vol.14. Cambridge university press, 2004.[4] Blanes, Sergio, Fernando Casas, and J. M. Sanz-Serna. ”Numerical integrators forthe Hybrid Monte Carlo method.” SIAM Journal on Scientiﬁc Computing 36.4(2014): A1556-A1580.[5] Geyer, Charles J. ”Markov chain Monte Carlo maximum likelihood.” (1991).[6] Marinari, Enzo, and Giorgio Parisi. ”Simulated tempering: a new Monte Carloscheme.” EPL (Europhysics Letters) 19.6 (1992): 451.[7] Douc, Randal, et al. ”Convergence of adaptive mixtures of importance samplingschemes.” The Annals of Statistics 35.1 (2007): 420-448.[8] Liu, Jun S., Faming Liang, and Wing Hung Wong. ”The multiple-try method andlocal optimization in Metropolis sampling.” Journal of the American Statistical As-sociation 95.449 (2000): 121-134.[9] B´edard, Myl`ene, Randal Douc, and Eric Moulines. ”Scaling analysis of multiple-tryMCMC methods.” Stochastic Processes and their Applications 122.3 (2012): 758-786.[10] Laloy, Eric, and Jasper A. Vrugt. ”High-dimensional posterior exploration of hydro-logic models using multiple-try DREAM (ZS) and high-performance computing.”Water Resources Research 48.1 (2012).[11] Neal, Radford M. ”Sampling from multimodal distributions using tempered transi-tions.” Statistics and computing 6.4 (1996): 353-366.1612] Xie, Yun, Jian Zhou, and Shaoyi Jiang. ”Parallel tempering Monte Carlo simulationsof lysozyme orientation on charged surfaces.” The Journal of chemical physics 132.6(2010): 02B602.[13] Carter, J. N., and D. A. White. ”History matching on the Imperial College faultmodel using parallel tempering.” Computational Geosciences 17.1 (2013): 43-65.[14] Nampally, Arun, and C. R. Ramakrishnan. ”Adaptive MCMC-based inference inprobabilistic logic programs.” arXiv preprint arXiv:1403.6036 (2014).[15] Mykland, Per, Luke Tierney, and Bin Yu. ”Regeneration in Markov chain samplers.”Journal of the American Statistical Association 90.429 (1995): 233-241.[16] Jacob, Pierre E., John O’Leary, and Yves F. Atchad´e. ”Unbiased markov chainmonte carlo with couplings.” arXiv preprint arXiv:1708.03625 (2017).[17] Gelfand, Alan E., and Adrian FM Smith. ”Sampling-based approaches to calculatingmarginal densities.” Journal of the American statistical association 85.410 (1990):398-409.[18] Friel, Nial and Jason Wyse. “Estimating the evidence – a review.” Statistica Neer-landica 66 (2011): 288-308.[19] Caldwell, Allen, et al. ”Integration with an Adaptive Harmonic Mean Algorithm.”arXiv preprint arXiv:1808.08051v2 (2020).[20] The Bayesian Analysis Toolkit repository, https://github.com/bat/BAT.jlhttps://github.com/bat/BAT.jl