Bayesian Fusion: Scalable unification of distributed statistical analyses
BBayesian Fusion: Scalable unification of distributed sta-tistical analyses
Hongsheng Dai † University of Essex, Colchester, U.K.
Murray Pollock
Newcastle University, Newcastle-upon-Tyne, and
The Alan Turing Institute, London, U.K.
Gareth O. Roberts
University of Warwick, Coventry, and
The Alan Turing Institute, London, U.K.
Summary . There has recently been considerable interest in addressing the problem ofunifying distributed statistical analyses into a single coherent inference. This problem nat-urally arises in a number of situations, including in big-data settings, when working underprivacy constraints, and in Bayesian model choice. The majority of existing approacheshave relied upon convenient approximations of the distributed analyses. Although typicallybeing computationally efficient, and readily scaling with respect to the number of analysesbeing unified, approximate approaches can have significant shortcomings – the qualityof the inference can degrade rapidly with the number of analyses being unified, and canbe substantially biased even when unifying a small number of analyses that do not con-cur. In contrast, the recent
Fusion approach of Dai et al. (2019) is a rejection samplingscheme which is readily parallelisable and is exact (avoiding any form of approximationother than Monte Carlo error), albeit limited in applicability to unifying a small number oflow-dimensional analyses. In this paper we introduce a practical
Bayesian Fusion ap-proach. We extend the theory underpinning the Fusion methodology and, by embeddingit within a sequential Monte Carlo algorithm, we are able to recover the correct targetdistribution. By means of extensive guidance on the implementation of the approach,we demonstrate theoretically and empirically that Bayesian Fusion is robust to increas-ing numbers of analyses, and coherently unifying analyses which do not concur. This isachieved while being computationally competitive with approximate schemes.
Keywords: Bayesian inference; Distributed data; Fork-and-join; Langevin diffusion;Sequential Monte Carlo
1. Introduction
There has recently been considerable interest in developing methodology to combine distributed statistical inferences, into a single (Bayesian) inference. This distributedscenario can arise for a number of practically compelling reasons. For instance, it canarise by construction in large data settings where, to circumvent the memory constraintson a single machine, we split the available data set across C machines (which we term † Address for correspondence:
Hongsheng Dai, Department of Mathematical Sciences, Univer-sity of Essex, Wivenhoe Park, Colchester, CO4 3SQ, U.K.
Email: [email protected] . a r X i v : . [ s t a t . M E ] F e b Dai, Pollock and Roberts cores ) and conduct C separate inferences (Scott et al., 2016). Other modern instancesappear when working under confidentiality constraints, where pooling the underlyingdata would be deemed a data privacy breach (for instance, Yıldırım and Ermiş (2019)),and in model selection (Buchholz et al., 2019). More classical instances of this commonscenario appear in Bayesian meta-analysis (see for example Fleiss (1993); Smith et al.(1995)), and in constructing priors from multiple expert elicitation (Berger, 1980; Genestand Zidek, 1986).In this article we present general statistical methodology to address this fusion prob-lem . We term each of the C inferences across C cores that we wish to unify a sub-posterior ,denoted by f c ( x ) for c ∈ { , . . . , C } . The natural manner to unify the sub-posteriors isby considering the product pooled posterior density (which we term the fusion density ), f ( x ) ∝ f ( x ) . . . f C ( x ) . (1)Our goal is to produce a Monte Carlo sample from (1). For convenience, and commonto many existing approaches (Scott et al., 2016; Neiswanger et al., 2013; Xue and Liang,2019; Wang and Dunson, 2013), we will assume in this article that independent samplesfrom each sub-posterior are readily available and it is possible to evaluate each sub-posterior point-wise. As discussed later, neither of these are limiting factors for ourmethodology.Specific applications, such as those we used to introduce the fusion problem, havea number of specific constraints and considerations unique to them. For instance, inthe large data setting particular consideration may be given to latency and computerarchitectures (Scott et al., 2016), whereas in the confidentiality setting of Yıldırım andErmiş (2019) one may be constrained in the number and type of mathematical operationsconducted. Indeed, the majority of the current literature addressing the fusion problemhas been developed to address specific applications. Our focus in this paper will notconcern any particular application, but rather on methodology for the general fusionproblem, which in principle could be applied and adapted to to the statistical contextswe describe. Some general discussion on particular applications is given in Section 3.6,following the introduction of our methodology.The methodologies proposed in the literature to address the fusion problem are mostlyapproximate, often supported by underpinning theory which ensures their limiting un-biasedness in an appropriate asymptotic limit. While these methods are often compu-tationally efficient and generally effective, it is generally difficult to assess the extent ofthe biases introduced by these method, and equally difficult to correct for these biases.One of the earliest, and most widely used method for dealing with the fusion problemis the Consensus Monte Carlo (CMC) method (Scott et al., 2016; Scott, 2017). Thismethod weights samples from individual sub-posteriors in a way which would be com-pletely unbiased if each sub-posterior was indeed Gaussian. This is attractive in thelarge data context which motivated their work. On the other hand, outside the Gaus-sian context CMC can be very biased (Wang et al., 2015; Srivastava et al., 2016). Analternative method involving aggregation techniques based on Weierstrass transforms toeach sub-posterior was proposed in Wang and Dunson (2013). In comparison to CMC,the Weierstrass sampler is computationally more expensive, although it tends to produceless biased results in the context of non-Gaussian sub-posteriors. We shall use these two ayesian Fusion methods as benchmarks to compare the methodology we propose here.Much of the existing approximate literature has been focused on distributed largedata settings, and as a consequence there has been particular attention on developing embarrassingly parallel procedures, where communication between cores is limited to asingle unification step. Often termed as divide-and-conquer approaches (although strictlyspeaking fork-join approaches), recent contributions include Neiswanger et al. (2013) whoconstructs a kernel density estimate for each sub-posterior to reconstruct the posteriordensity. Other approaches which construct approximations directly from sub-posteriordraws include Minsker et al. (2014); Srivastava et al. (2016); Wang et al. (2015); Sta-matakis and Aberer (2013); Agarwal and Duchi (2011); Neiswanger et al. (2013); Xueand Liang (2019) and Wang and Dunson (2013). Alternative non-embarrassingly parallelapproaches are discussed extensively in Jordan et al. (2018) and Xu et al. (2014). Withina hierarchical framework Rendell et al. (2018) (and subsequently Vono et al. (2019)) in-troduce a methodology in which a smoothed approximation to (1) can be obtained ifincreased communication between the cores is permitted.In contrast to approximate methods, the Monte Carlo Fusion approach recently in-troduced by Dai et al. (2019) provides a theoretical framework to sample independentdraws from (1) exactly (without any form of approximation). Monte Carlo Fusion isbased upon constructing a rejection sampler on an auxiliary space which admits (1) as amarginal. However, unlike approximate approaches there are considerable computationalchallenges with Monte Carlo Fusion. In particular, the scalability of the methodologyin terms of the number of sub-posteriors to be unified, increasing dis-similarity in thesub-posteriors, and the dimensionality of the underlying fusion target density, all inhibitthe practical adoption of the methodology. The challenge that we address successfully inthis paper is to devise a methodology which shares the consistency properties of MonteCarlo Fusion while sharing the scalability behaviour of of the approximate alternatives.In this paper we substantially reformulate the theoretical underpinnings of the aux-iliary construction used in Dai et al. (2019) to support the use of scalable Monte Carlomethodology. There are a number of substantial and novel contributions which we listhere for clarity. • We show that it is possible to sample from (1) by means of simulating for the probabilitymeasure of a forward stochastic differential equation (SDE). • Based upon the SDE formulation we further develop a Sequential Monte Carlo (SMC)algorithm to sample consistently from (1), in a methodology which we term
BayesianFusion . • We develop theory to show that Bayesian Fusion is robust in the large C and in-creasingly discrepant sub-posteriors scenarios, and as a consequence considerably moreefficient when used in practical Bayesian settings. • For practitioners we provide practical guidance for setting algorithm hyperparameters,which will (approximately) optimise the efficiency of our approach. • Finally, we provide extensive pedagogical examples and real-data applications to con-trast our methodology with existing approximate and exact approaches, and to study
Dai, Pollock and Roberts empirically the scaling properties of our approach and verify it attains that given byour theoretical guidance.In the next section we present the theory that underpins Bayesian Fusion, togetherwith methodology and pseudo-code for its implementation in Section 2.1. We provideguidance on implementing Bayesian Fusion in Section 3, which includes selection of user-specified parameters in Sections 3.1 and 3.2, studies of the robustness of the algorithmwith respect to how similar the sub-posteriors are in Sections 3.3 and 3.4, and exten-sive discussion of practical considerations in Sections 3.5 and 3.6. Section 4 studies theperformance of our methodology in comparison to competing methodologies for idealisedmodels and a synthetic data set, and in Section 5 its performance in a number of realdata set applications. We conclude in Section 6 with discussion and future directions. Wesuppress all proofs from the main text, which are instead collated in the appendices. Theappendices also include some discussion of the underlying diffusion theory and assump-tions (Appendix A), theory to support implementations for distributed environments inAppendix D, and discussion on the application of the methodology to large data settingsin Appendix E, and are referenced as appropriate in the main text.
2. Bayesian Fusion
Consider the d -dimensional posterior density f ( x ) described in (1). As motivated inthe introduction, we want to sample from f ( x ) by means of sampling and evaluatingfunctionals of the available sub-posterior densities f c ( x ) ( c ∈ { , . . . , C } ). f ( x ) canbe obtained as a marginal of a suitably chosen extended target fusion measure on anextended state space, which we present in Theorem 1.To introduce the fusion measure , we first present some notation and terminology.We term the proposal measure , P , to be the probability law induced by C interacting d -dimensional parallel continuous-time Markov processes in [0 , T ] , where each process c ∈ { , . . . , C } is described by the following d -dimensional SDE, d X ( c ) t = ¯ X t − X ( c ) t T − t d t + d W ( c ) t , X ( c )0 := x ( c )0 ∼ f c , t ∈ [0 , T ] , (2)where { W ( c ) t } Cc =1 are independent Brownian motions, and ¯ X t := C − (cid:80) Cc =1 X ( c ) t . Typ-ical realisations of the proposal measure are denoted as X := { (cid:126) x t , t ∈ [0 , T ] } , where (cid:126) x t := x (1: C ) t is the dC -dimensional vector of all processes at time t , with one such reali-sation illustrated in Figure 1.Interaction of the C processes in a realisation of X occurs through their average at agiven time marginal ( ¯ X t ), and note that we have coalescence at time T ( x (1) T = · · · = x ( C ) T =: y ). We describe in detail in Section 2.1 how to simulate from P , but note that(critically) initialisation of the proposal measure at t = 0 only requires independent drawsfrom the C available sub-posteriors.Now we define the fusion measure , F , to be the probability measure induced by the ayesian Fusion Fig. 1.
A typical realisation of X ( C interacting Markov processes) following Radon-Nikodým derivative, d F d P ( X ) ∝ ρ ( (cid:126) x ) · C (cid:89) c =1 (cid:20) exp (cid:26) − (cid:90) T φ c (cid:16) x ( c ) t (cid:17) d t (cid:27)(cid:21) , (3)where { x ( c ) t , t ∈ [0 , T ] } is a Brownian bridge from x ( c )0 to x ( c ) T , φ c ( x ) := (cid:52) f c ( x ) / f c ( x ) (where (cid:52) is the Laplacian operator), and ρ := ρ ( (cid:126) x ) = exp (cid:40) − C (cid:88) c =1 (cid:107) x ( c )0 − ¯ x (cid:107) T (cid:41) ∈ (0 , , where ¯ x = C − C (cid:88) c =1 x ( c )0 . (4)We now establish that we can access the fusion density f , by means of the temporalmarginal of F given by common value of the C trajectories at time T . Theorem 1.
Under Assumptions A.1 and A.2 given in Appendix A, then with probability we have that under the fusion measure, F , y := x (1) T = · · · = x ( C ) T and this commonvalue has density f .Proof. See Appendix A. f by means of simulating from the fusion measure F As suggested by Theorem 1 we could simulate from the desired f in (1) by simulating X ∼ F and simply retaining its time T marginal, y . However, direct simulation of F will typically not be possible, and so we now outline general methodology to simulate F indirectly (and so by extension f ). In particular, we show that we can simulate from F by Dai, Pollock and Roberts means of a rejection sampler with proposals X ∼ P which are accepted with probabilityproportional to the Radon-Nikodým derivative given in (3).For the purposes of the efficiency of the methodology we will subsequently develop,we will consider the simulation of P and F at discrete time points given by the followingauxiliary temporal partition, P = { t , t , . . . , t n : 0 =: t < t < · · · < t n := T } , (5)noting that ultimately we only require the time T marginal corresponding to the n thtemporal partition. For simplicity we will suppress subscripts when considering theMarkov processes at times coinciding with the partition, denoting x ( c ) t j as x ( c ) j , and (cid:126) x t j as (cid:126) x j . We further denote ∆ j := t j − t j − .We begin by considering simulating exactly X ∼ P at the points given by the temporalpartition, P . To do so we simply note that the SDE given in (2) is linear and there-fore describes a Gaussian process, and its finite-dimensional distributions are explicitlyavailable. Theorem 2. If X satisfies (2) then under the proposal measure, P , we have:(a) For s < t (cid:126) X t (cid:12)(cid:12)(cid:12)(cid:16) (cid:126) X s = (cid:126) x s (cid:17) ∼ N (cid:16) (cid:126) M s,t , V s,t (cid:17) , (6) where N is a multivariate Gaussian density, (cid:126) M s,t = ( M (1) s,t , . . . M ( C ) s,t ) with M ( c ) s,t = T − tT − s x ( c ) s + t − sT − s ¯ x s , (7) and where V s,t = Σ ⊗ I d × d with Σ = (Σ ij ) being a C × C matrix given by Σ ii = ( t − s ) · ( T − t ) T − s + ( t − s ) C ( T − s ) , Σ ij = ( t − s ) C ( T − s ) . (8) (b) For every c ∈ { , . . . , C } , the distribution of { X ( c ) u , s ≤ u ≤ t } given endpoints X ( c ) s = x ( c ) s and X ( c ) t = x ( c ) t is a Brownian bridge, so that X ( c ) u (cid:12)(cid:12)(cid:12)(cid:16) x ( c ) s , x ( c ) t (cid:17) ∼ N (cid:32) ( t − u ) x ( c ) s + ( u − s ) x ( c ) t t − s , ( u − s )( t − u ) t − s I d × d (cid:33) . (9) Proof.
See Appendix A.To simplify the presentation of the methodology, we now restrict our attention to the d ( nC + 1) -dimensional density of the C d -dimensional Markov processes at the ( n + 1) time marginals given by the temporal partition under P . An illustration of this is givenin Figure 2. As a consequence of Theorem 2 we have, h ( (cid:126) x , . . . , (cid:126) x n − , y ) ∝ C (cid:89) c =1 (cid:104) f c (cid:0) x ( c )0 (cid:1)(cid:105) · n (cid:89) j =1 N (cid:16) (cid:126) x j ; (cid:126) M j , V j (cid:17) , (10) ayesian Fusion Fig. 2.
Illustration of the d ( nC + 1) -dimensional density corresponding to the time marginals ofa typical realisation of X given by the temporal partition P . where to simplify notation we have (cid:126) M j := (cid:126) M t j − ,t j and V j := V t j − ,t j .By factorising (3) according to the temporal partition P , the equivalent d ( nC + 1) -dimensional density under F is simply, g ( (cid:126) x , . . . , (cid:126) x n − , y ) ∝ h ( (cid:126) x , . . . , (cid:126) x n − , y ) · n (cid:89) j =0 ρ j , (11)where ρ is as given in (4), for j ∈ { , . . . , n } , ρ j := ρ j ( (cid:126) x j − , (cid:126) x j ) = C (cid:89) c =1 E W j,c (cid:34) exp (cid:40) − (cid:90) t j t j − (cid:16) φ c (cid:16) x ( c ) t (cid:17) − Φ c (cid:17) d t (cid:41)(cid:35) ∈ (0 , , and where W j,c is the law of a Brownian bridge { x ( c ) t , t ∈ ( t j − , t j ) } from x ( c ) j − to x ( c ) j ,and Φ c is a constant such that inf x φ c ( x ) ≥ Φ c > −∞ . Discussion on Φ c can be foundin Appendix A.As we are interested in sampling from the fusion density f (corresponding to the time T marginal of the d ( nC + 1) -dimensional density g ), it is sufficient to simulate g ratherthan the more complicated object X ∼ F . As suggested by (11), this can be achieved byrejection sampling by first simulating a proposal from the density h , and accepting thisproposal with probability equal to (cid:81) nj =0 ρ j .Simulation of a proposal from h is straightforward following Theorem 2 and (10). Inparticular, we can do so by first simulating a single draw from each sub-posterior andcomposing them to obtain a proposal at the time marginal of the temporal partition P (in particular (cid:126) x := x (1: C )0 , where for c ∈ { , . . . , C } , x ( c )0 ∼ f c ). This initial draw canthen be iteratively propagated n -times using Gaussian transitions (as given in (10) tocompose the entire draw from h . Dai, Pollock and Roberts
Now, considering the computation of the acceptance probability of the proposal, notethat although ρ is computable the direct computation of ρ , . . . , ρ n is not possible as itwould require the evaluation of path integrals of functionals of Brownian motion. How-ever, it is possible to construct unbiased estimators of these intractable quantities, andthen simulate them using variations of established techniques. We denote the estimatorswe use by ˆ ρ , . . . , ˆ ρ n , and are given by ˆ ρ j := C (cid:89) c =1 ∆ κ c j · e − ( U ( c ) j − Φ c )∆ j κ c ! · p ( κ c | R c ) κ c (cid:89) k c =1 (cid:16) U ( c ) j − φ c (cid:16) x ( c ) χ c,k (cid:17)(cid:17) , (12)where R c is a function of the Brownian bridge sample path x ( c ) ∼ W j,c which deter-mines the compact subset of R d in which it is constrained. U ( c ) j is a constant such that φ c (cid:16) x ( c ) t (cid:17) ≤ U ( c ) j for all x ( c ) t ∼ W j,c | R c , κ c is a discrete random variable with conditionalprobabilities P [ κ c = k c | R c ] := p ( κ c | R c ) , and { χ , . . . , χ κ c } iid ∼ U [ t j − , t j ] . The validity of(12) is established the follows: Theorem 3.
For every ≤ j ≤ n , ˆ ρ j is an unbiased estimator of ρ j .Proof. See Appendix B.The construction and of estimators of the type in (12), details on their specification(including the functional R c ), and a full proof of Theorem 3 are deferred to Appendix B.As it is possible to construct [0 , unbiased estimators of ρ , . . . , ρ n , we now have animplementable rejection sampler: upon simulating the proposal from h we can simplysimulate (cid:81) nj =0 ˆ ρ j ∈ (0 , and accept with with this probability. The validity of thiscan be established by simply noting that as ˆ ρ , . . . , ˆ ρ n ∈ [0 , , then the rejection basedalgorithms resulting from their use are algorithmically equivalent to the original con-structions of the algorithm had the intractable quantities been available. Furthermore,as a consequence of this there are no detrimental effect from the use of the estimators(such as decreased acceptance probabilities, or inflated variance).Rejection sampling based algorithms suffer from a number of inefficiencies in thissetting. For instance, it is clear that from (11) the acceptance probability of rejectionsampling will likely decay geometrically with increasing C , and exponentially with in-creasing T . Dai et al. (2019) introduced a variant of this rejection sampling approachbased upon methodology developed from a substantial simplification of Theorem 2 with-out the auxiliary temporal partition, P . Although theoretical sound, and being the first exact fusion approach, the Monte Carlo Fusion approach introduced in Dai et al. (2019)is impractical in many settings due to this lack of robustness with increasing numbers(and heterogeneity or lack of similarity) of sub-posteriors. Some further discussion ofthis approach and these shortcomings are given in Section 4.An immediate extension of the rejection sampling approach of Dai et al. (2019) wouldbe an importance sampling approach, in which importance weights are assigned to eachof the proposals from h corresponding to the acceptance probability. This would howeverultimately suffer from similar inefficiencies to the rejection sampling approach manifestedby variance in the importance weights. A drawback of both rejection and importance ayesian Fusion sampling approaches, are the computational complications from the simulation of dif-fusion bridges (required in (12)) which have computational cost which does not scalelinearly in T – this is one of the motivations for introducing the temporal partition, P .The key novelty of Theorem 2 is that the auxiliary temporal partition P which hasbeen introduced allows g to be simulated using a sequential Monte Carlo (SMC) approach.This mitigates the robustness drawbacks of the Monte Carlo Fusion approach of Dai et al.(2019), and allows us to leverage the results and approaches available within the SMCliterature. In particular, and as suggested by (11), one could initialise an algorithm bysimulating N particles from the time marginal of h in (11), (cid:126) x , , . . . , (cid:126) x ,N (recallingthat (cid:126) x := x (1: C ) t , where for c ∈ { , . . . , C } x ( c ) t ∼ f c ), and assigning each an un-normalised importance weight w (cid:48) , · := ρ ( (cid:126) x , · ) . This initial particle set (which constitutesan approximation of the time marginal of g in (11)), can then be iteratively propagated n times by interlacing Gaussian transitions of the particle set over the j th partition of P (with mean vector (cid:126) M j and covariance matrix V j as given in (11)), and updating theparticle set weightings by a factor of ˆ ρ j ( (cid:126) x j − , · , (cid:126) x j, · ) . The weighted particle set obtainedafter the final ( n th iteration of the algorithm (which is an approximation of the time T marginal of g ), can then be used as a proxy for the desired f (as supported by Theorem2). We term the SMC approach outlined above Bayesian Fusion , and present pseudo-codefor it in Algorithm 1. Note that in this setting (unlike the rejection sampling setting)we need to further consider the construction of the unbiased estimator for ρ j and itsvariance, which is fully considered in Appendix B.Algorithm 1 outputs a weighted particle set at the end of each iteration which are re-normalised . As standard within the SMC literature, we monitor for weight degeneracyby monitoring the importance sampling weights, and if appropriate resampling . In par-ticular, we compute the effective sample size (ESS) (Kong et al., 1994) of the particle set,and if the ESS falls below a lower user-specified threshold then the next iteration of thealgorithm is instead initialised by (re-)sampling N times from the empirical distributiondefined by the current set of weighted particles (for simplicity we use multinomial re-sampling). In our particular case re-normalisation removes all contributory componentsof Φ , . . . , Φ C from ˆ ρ j . This conveniently allows us to avoid the computation of the con-stants Φ , . . . , Φ C which would seem to be required by Theorem 2 and (12). As such inour presentation of Algorithm 1 we have simply replaced ˆ ρ j by ˜ ρ j to exploit this, where ˜ ρ j ( (cid:126) x j − , · , (cid:126) x j, · ) := ˜ ρ j := e − (cid:80) Cc =1 Φ c ∆ j ˆ ρ j . (13)As suggested by Algorithm 1, the output can be used directly as an approximationfor the fusion density, f . Clearly the efficiency of the Bayesian Fusion approach out-lined in Algorithm 1 will depend critically on the user-specified time horizon T , and theresolution of P (and hence the number of iterations required in the algorithm). In thefollowing section we provide guidance on selecting these tuning parameters, together withadditional practical guidance on implementation. Dai, Pollock and Roberts
Algorithm 1
Bayesian Fusion Algorithm.(a)
Initialisation Step ( j = 0 ) (i) Input:
Sub-posteriors, f , . . . , f C , number of particles, N , time horizon, T ,and temporal partition P : 0 = t < t < · · · < t n = T .(ii) For i in to N ,A. (cid:126) x ,i : For c in to C , simulate x ( c )0 ,i ∼ f c . Set (cid:126) x ,i := x (1: C )0 ,i .B. w (cid:48) ,i : Compute un-normalised weight w (cid:48) ,i = ρ ( (cid:126) x ,i ) , as per (4).(iii) w , · : For i in to N compute normalised weight w ,i = w (cid:48) ,i / (cid:80) Nk =1 w (cid:48) ,k .(iv) g N : Set g N (d (cid:126) x ) := (cid:80) Ni =1 w ,i · δ (cid:126) x ,i (d (cid:126) x ) .(b) Iterative Update Steps ( j = j + 1 while j ≤ n ) (i) Resample:
If the ESS := ( (cid:80) Ni =1 w j − ,i ) − breaches the lower user-specifiedthreshold, then for i in to N resample (cid:126) x j − ,i ∼ g Nj − , and set w j − ,i = 1 /N .(ii) For i in to N ,A. (cid:126) x j,i : Simulate (cid:126) x j,i ∼ N (cid:16) (cid:126) x j − ,i ; (cid:126) M j,i , V j (cid:17) , where (cid:126) M j,i and V j are com-puted using Theorem 2.B. w (cid:48) j,i : Compute un-normalised weight, w (cid:48) j,i = w j − ,i · ˜ ρ j ( (cid:126) x j − ,i , (cid:126) x j,i ) as perAlgorithm 4 of Appendix B.(iii) w j, · : For i in to N compute normalised weight w j,i = w (cid:48) j,i / (cid:80) Nk =1 w (cid:48) j,k .(iv) g Nj : Set g Nj (d (cid:126) x j ) := (cid:80) Ni =1 w j,i · δ (cid:126) x j,i (d (cid:126) x j ) .(c) Output: ˆ f (d y ) := g Nn (d y ) ≈ f (d y ) .
3. Theoretical underpinning and implementational guidance
In this section we provide guidance on implementing the Bayesian Fusion algorithm.In particular, how to select the user-specified time horizon ( T ), and an appropriateresolution of the auxiliary temporal partition ( n and P ), This is considered in Sections3.1 and 3.2 respectively. The robustness of this guidance is considered by means of twoextreme possible scenarios in Sections 3.3 and 3.4. We conclude in Sections 3.5 and 3.6by presenting other practical considerations for efficiently implementing Algorithm 1.We begin in developing guidance for T , n and P , by noting that Algorithm 1 is anSMC algorithm for simulating the extended target density g in (11), which is achieved byapproximating successive temporal marginals of g (in particular, g Nj ) by means of propa-gating and re-weighting the previous temporal marginal ( g Nj − ). As such, it is natural tochoose T , n and P to ensure the discrepancy between the sequence of proposal and tar-get distributions is not degenerate, and so effective sample size (ESS) is an appropriatequantity to analyse (see Kong et al. (1994)). However, the implementation we presentin Algorithm 1 makes use of both weight normalisation and resampling in order to com- ayesian Fusion bat weight degradation. As such it is more natural in this setting to study a variant ofESS which is instead based upon the un-normalised incremental weight change withinAlgorithm 1 (recalling that we denote by ˆ ρ j ( (cid:126) x j − ,i , (cid:126) x j,i ) =: ˆ ρ j,i the incremental weightchange of the i th particle in the j th iteration), which we term the conditional effectivesample size (CESS) (following for instance Zhou et al. (2016)). In particular we denoteCESS j := (cid:0) (cid:80) Ni =1 ˆ ρ j,i (cid:1) (cid:80) Ni =1 ˆ ρ j,i . To develop concrete implementational guidance we consider and analyse the idealisedsetting of posterior distributions of large sample size m . In particular, we assume thatthe target density f is multivariate Gaussian with mean vector a and covariance matrix m − b I (for some b > ), and each of the sub-posterior densities f c ( x ) ( c ∈ { , . . . , C } )are also multivariate Gaussian but with mean vector a c and covariance matrix m − Cb I respectively. Note that we have a = C − (cid:80) Cc =1 a c , and we will further reasonably assume m > C > . To study the robustness of Algorithm 1 we further consider the quantity σ a := C − (cid:80) Cc =1 (cid:107) a c − a (cid:107) which gives a measure of what we term the sub-posteriorheterogeneity (the degree to which the individual sub-posteriors agree or disagree withone another). T Considering the selection of T note from Algorithm 1 that its influence appears solelyin the initial weighting given to each of the N particles in (4) through ρ . As such, westudy the initial conditional effective sample size. Theorem 4.
Considering the initial conditional effective sample size (CESS ), we havethat as N → ∞ , N − CESS → exp (cid:40) − σ a bm (cid:0) TC + bm (cid:1) · (cid:0) TC + bm (cid:1) (cid:41) · (cid:34) (cid:0) CbT m (cid:1) CbT m (cid:35) − ( C − d . Proof.
See Appendix C.Theorem 4 shows explicitly how CESS degrades as the level of sub-posterior hetero-geneity ( σ a ) increases. To explore this dependency we introduce the following conditionswhich will allow us to clearly identify regimes where CESS is well-behaved. Condition 1 (SH ( λ ) ) . The sub-posteriors obey the SH ( λ ) condition (for some constant λ > ) if, σ a = b ( C − λm . Condition 2 (SSH ( γ ) ) . The sub-posteriors obey the super sub-posterior heterogeneity
SSH ( γ ) condition (for some constant γ > ) if, σ a = bγ. Dai, Pollock and Roberts
Note that Condition 1 is a very natural condition which would arise in many settings(for instance, if ( m/C ) th of the data was randomly allocated to each sub-posterior then σ a ∼ b/m × χ C − and thereby have mean b ( C − /m ). For m/C large we would expectthat for λ > the sub-posteriors would obey the SH ( λ ) condition with high probability.Whereas at the other end of the spectrum, the SSH ( γ ) condition of Condition 2 capturesthe case where sub-posterior heterogeneity does not decay with m .Considering the initial conditional effective sample size under Conditions 1 and 2 weestablish the following corollary. Corollary 1.
If for some constant k > , T is chosen such that T ≥ bC / k m , (14) then the following lower bounds on CESS hold:(a) If SH ( λ ) holds for some λ > , then lim N →∞ N − CESS ≥ exp (cid:8) − λk − − dk − / (cid:9) . (15) (b) If SSH ( γ ) holds for some γ > , and T ≥ k C − / (for some constant k > ), then lim N →∞ N − CESS ≥ exp (cid:8) − γbk − k − − dk − / (cid:9) . (16) Proof.
See Appendix C.Corollary 1 gives explicit guidance on minimal values of T which should be selectedto robustly initialise Algorithm 1, as measured by initial CESS. In principle one couldchoose T in excess of this minimal guidance. This however comes at either the cost ofincreasing the number of iterations of the algorithm required (which we will discuss inthe following section), or increasing the increment size in the auxiliary temporal partition(which will lead to increased computational cost in simulating ˆ ρ · from Theorem 3), orsome combination of both. n and P Having selected an appropriate T (using the guidance of Section 3.1 and Corollary 1), weare left with choosing the remaining user-specified parameters n and P (the resolution andspacing of the auxiliary temporal partition), as required in Algorithm 1. We address thisimplicitly by considering how to choose the j th interval size (i.e. the interval ( t j − , t j ] )of the auxiliary temporal partition, which we do so by again considering the conditionaleffective sample size in Theorem 5. Theorem 5.
Considering the conditional effective sample size for the j th iteration ofAlgorithm 1 (CESS j ), and letting k , k be positive constants, we have lim inf lim N →∞ N − CESS j ≥ e − k − dk , ayesian Fusion where the outer lim inf is taken over sequences of t j − t j − → with t j − t j − ≤ min (cid:32) b k C m σ t j (cid:33) / , (cid:18) k b C m (cid:19) / , (17) and σ t j = C − (cid:80) Cc =1 (cid:107) E (cid:16) x ( c ) j | ξ j (cid:17) − a c (cid:107) (where ξ j denotes a sequence of standard Gaus-sian vectors as defined in Corollary 2).Proof. See Appendix C.We can use Theorem 5 to develop guidance for choosing the j th interval size of theauxiliary temporal partition, by considering the effect of σ t j in (17). In essence σ t j de-scribes the average variation of the C trajectories of the distribution of their proposedupdate locations with respect to their individual sub-posterior mean (i.e. how different E ( x ( c ) j | ξ j ) is from a c ). Recalling that Algorithm 1 is coalescing C trajectories initialisedindependently from their respective sub-posteriors to a common end point, then σ t j willlargely be determined by a combination of how close the interval is to the end point T , how large the interval ( t j − , t j ] we are simulating over is, and critically the degree of sub-posterior heterogeneity as determined by variation in their mean. Intuitively one maywish to also choose the regularity of the mesh itself dependant on sub-posterior hetero-geneity (in particular, one would anticipate decreasing the interval size in the partitionapproaching T to counteract the increasing disagreement of the coalescing trajectorieswith their own respective means), but for algorithmic simplicity in the following we im-pose a regular mesh ( ∆ j = t j − t j − = T /n =: ∆ ). Consequently, and as in Section3.1, we develop guidance for n and P by considering sub-posterior heterogeneity and itsimpact on (17) (noting that for a regular mesh we can simply set n = O (cid:0) T ∆ − j (cid:1) ). Wethen return in Section 3.5 to consider the implication of imposing a regular mesh overan irregular mesh.We begin by noting (see (24) in Corollary 2) that σ t j = C − C (cid:88) c =1 (cid:13)(cid:13)(cid:13)(cid:13) ( a − a c ) t j T + t j √ CT · ξ j (cid:13)(cid:13)(cid:13)(cid:13) E σ t j = C − C (cid:88) c =1 (cid:107) a − a c (cid:107) t j T + d · t j CT (a) If SH ( λ ) holds we have, by choosing T = O ( bC / k m − ) following Corollary 1, E σ t j = b ( C − λm t j T + d · t j CT ≤ O (cid:18) bCλm + dbC / k Cm (cid:19) (18)which implies that E σ t j is bounded above by O (cid:0) Cm (cid:1) , then from Theorem 5 we havethat CESS j will be well-behaved provided we choose t j − t j − = O (cid:18) C / m (cid:19) . (19) Dai, Pollock and Roberts (b) If SSH ( γ ) holds we have E σ t j = O (cid:32) t j T + t j CT (cid:33) . (20)Following Corollary 1 if we choose T = O (cid:0) max { bC / k m − , k C − / } (cid:1) , and re-calling we have m > C > , then E σ t j = O (cid:0) t j T (cid:0) TC (cid:1)(cid:1) = O (cid:0) t j T (cid:1) which is boundedabove by O (1) . As such, from Theorem 5 we have that CESS j will be well-behavedprovided we choose t j − t j − = O (cid:18) Cm / (cid:19) . (21)In keeping with the intuition we developed earlier, note from (18) and (20) that inboth the SH ( λ ) and SSH ( γ ) settings σ t j will increase as t j ↑ T , and so from (17) we wouldanticipate some marginal gain using an irregular mesh and decreasing the interval sizeas we approach T . However, choosing ∆ j = t j − t j − as per (19) and (21) in conjunctionwith the guidance for choosing T in Section 3.1, this results in only a marginal effect andso choosing a regular mesh is advantageous from the perspective of algorithmic simplicity(we verify this statement empirically in Section 3.5).As a consequence of choosing T as per Section 3.1 and imposing a regular mesh,determining appropriate choices for n and P (the remaining user-specified parameters ofAlgorithm 1) is direct.Of course, choosing interval sizes ( ∆ j ) smaller than this minimal guidance is possible(and may help computationally in the simulation of ˆ ρ · as per Lemma 4) but leads to anincreased number of iterations in Algorithm 1. As the reader will surmise, choosing T and n beyond the minimal guidance given by Sections 3.1 and 3.2 is more of a practicalcomputational consideration, but as shown the algorithm should be well-behaved.Having established guidance for choosing T , n , and P for Bayesian Fusion, we nowverify that these selections lead to Bayesian Fusion being robust to increasing data size(as measured by CESS). We do so by studying the guidance in idealised settings for theposterior distribution under the SH ( λ ) and SSH ( γ ) conditions, which we do in Sections 3.3and 3.4 respectively. Note that we consider more substantial examples and comparisonswith competing methodologies in Sections 4 and 5. We begin by examining the guidance for T and n in Bayesian Fusion under the SH ( λ ) setting of Condition 1. Recall this would be the most common setting of relativelyhomogeneous sub-posteriors (as characterised by variation in the sub-posterior mean),which would occur if for instance we were able to randomly allocate approximately a C th of the available data to each sub-posterior. To do so we consider the idealisedscenario in which we wish to recover a target distribution f , which is Gaussian withmean µ = 0 and variance σ = m − , by applying Algorithm 1 to unify C sub-posteriors( f c , c ∈ { , . . . , C } ), which are Gaussian with mean µ c = 0 and variance σ c = Cσ . Inthis example we consider a range of data sizes from m = 1000 to m = 50 , , with a ayesian Fusion fixed number of sub-posteriors ( C = 10 ), and using a particle set of size N = 10 , . Inimplementing Algorithm 1 we use UE- b (Condition B.2) of Appendix B for simulatingthe unbiased estimator in Step b(ii)B.In line with the development of our guidance in Section 3, we consider CESS andCESS j ( j ∈ { , . . . , n } ) with increasing data size by first considering fixed choices for T and n ( T = 0 . and n = 5 ), then choosing a robust scaling of T but with fixed n (asin Section 3.1), and then robustly scaling T and n (as in Section 3.2). This is presentedin Figures 3(a)–3(c) respectively.Considering the results of fixing T and n in Figure 3(a), it is clear in this regimethat Algorithm 1 would lack robustness with increasing data size. Although CESS improves with increasing data size as expected with increasingly similar sub-posteriorsfrom (4) of Theorem 2, this comes with drastically decreasing CESS j (as suggested byTheorem 5), which in totality would render the methodology impractical. Scaling T following the guidance in Section 3.1 immediately stabilises both CESS and CESS j inthe SH ( λ ) setting, making Algorithm 1 robust to increasing data size (as shown in Figure3(b)). Additionally scaling n substantively improves CESS j for all data sizes. In bothFigure 3(b) and 3(c) the slightly decreased CESS j for small data sizes can be explainedby random variation in the simulation of the sub-posterior, which leads to slight mis-matching. . . . . . . Data size C ond i t i ona l e ff e c t i v e s a m p l e s i z e s l l l l l l ll l l l l l l (a) Fixed user-specified tun-ing parameters T and n . . . . . . . Data size C ond i t i ona l e ff e c t i v e s a m p l e s i z e s l l l l l l ll l l l l l l (b) Recommended scaling of T , fixed n . . . . . . . Data size C ond i t i ona l e ff e c t i v e s a m p l e s i z e s l l l l l l ll l l l l l l (c) Recommended scaling of T and n . Fig. 3.
Conditional effective sample size of Algorithm 1 with increasing data size in SH ( λ ) setting of Section 3.3. Solid lines denote initial conditional effective sample size (CESS , fol-lowing Algorithm 1 Step a)). Dotted lines denote averaged conditional effective sample size insubsequent iterations of Algorithm 1 ( ( (cid:80) nj =1 CESS j ) /n , following Algorithm 1 Step b)). Now we examine the guidance for T and n in Bayesian Fusion under the SSH ( γ ) setting ofCondition 2. Recall this would be an extreme setting in which sub-posterior heterogeneitydoes not decay with data size, m . To investigate this setting we consider recovering atarget distribution f , which is Gaussian with mean µ = 0 and variance σ = m − , by Dai, Pollock and Roberts using Algorithm 1 to unify C = 2 sub-posteriors with mean µ c = ± . and variance σ c = 2 σ . In this scenario as data size increases the sub-posteriors have increasinglydiminishing common support, although our measure of heterogeneity is fixed with σ a =0 . . In this example we consider a range of data sizes from m = 250 to m = 2500 , anduse a particle set of size N = 10 , . We again use UE- b (Condition B.2) of AppendixB for simulating the unbiased estimator in Step b(ii)B when implementing Algorithm 1.As in the SH ( λ ) setting of Section 3.3, for this SSH ( γ ) setting we consider CESS and CESS j ( j ∈ { , . . . , n } ) with increasing data size with fixed choices for T and n ( T = 0 . and n = 5 ), then choose a robust scaling of T but with fixed n (as in Section3.1), and then robustly scale both T and n (as in Section 3.2). This is presented inFigures 4(a)–4(c) respectively.It is clear looking at the results for the SSH ( γ ) setting in Figure 4, and contrastingthem with the SH ( λ ) setting of Figure 3, that the SSH ( γ ) setting is considerably morechallenging. This is to be expected as the sub-posteriors become increasingly mismatchedas data size increases. However, the effect of including scaling T and n does substantivelyimprove Algorithm 1 as it did in Section 3.3. Considering the results of fixing T and n in Figure 4(a), it is clear in this regime that Algorithm 1 is degenerate. Incorporatingscaling of T in Figure 4(b) stabilises CESS and leads to a slower degradation with datasize of CESS j . However, incorporating scaling of T and n following our guidance earlierin Section 3 retains the stabilised CESS and substantively improves CESS j to a levelwhere it could lead to a practical algorithm.
500 1000 1500 2000 2500 . . . . . . Data size C ond i t i ona l e ff e c t i v e s a m p l e s i z e s l l l l l ll l l l l l (a) Fixed user-specified tun-ing parameters T and n .
500 1000 1500 2000 2500 . . . . . . Data size C ond i t i ona l e ff e c t i v e s a m p l e s i z e s l l l l l ll l l l l l (b) Recommended scaling of T , fixed n .
500 1000 1500 2000 2500 . . . . . . Data size C ond i t i ona l e ff e c t i v e s a m p l e s i z e s l l l l l ll l l l l l (c) Recommended scaling of T and n . Fig. 4.
Conditional effective sample size of Algorithm 1 with increasing data size in SSH ( γ ) setting of Section 3.4. Solid lines denote initial conditional effective sample size (CESS , fol-lowing Algorithm 1 Step a)). Dotted lines denote averaged conditional effective sample size insubsequent iterations of Algorithm 1 ( ( (cid:80) nj =1 CESS j ) /n , following Algorithm 1 Step b)). In Section 3.2 in order to simplify the guidance for selecting the partition P , we imposeda regular mesh. This allowed us to use the minimal guidance for the temporal distance ayesian Fusion between points in the partition we developed in Theorem 5, which in conjunction withthe guidance already established for choosing T in Section 3.1, allowed us to indirectlyspecify n and in turn P . As discussed in Section 3.2, there may be some advantage ofusing an irregular mesh (in which the temporal distance between points in the partitiondecreases as T ↑ n ). In this section we investigate the impact of using a regular mesh onCESS j ( j ∈ { , . . . , n } ) as a function of the iteration of Algorithm 1.To investigate temporal regularity we revisit the idealised examples of the SH ( λ ) andSSH ( γ ) settings we introduced in Sections 3.3 and 3.4 respectively. For both settings weconsider a data size of m = 1000 distributed across C = 2 sub-posteriors, and specify atemporal horizon of T = 0 . and regular mesh of size n = 10 . In implementing BayesianFusion we use a particle set of size N = 10 , , and consider the use of two variants forthe unbiased estimator in Step b(ii)B when implementing Algorithm 1 – UE- a (ConditionB.1) and UE- b (Condition B.2) of Appendix B – UE- a being a relatively straightforwardconstruction, whereas UE- b requiring slightly more specification but in general leadingto a more robust estimator as defined by the variance of the estimator. The results arepresented in Figure 5.Considering the SH ( λ ) setting of Figure 5(a) we find that CESS j is stable acrossiterations of Algorithm 1, which would suggest that there is little to be gained whenheterogeneity is low in having a more flexible irregular mesh. The SSH ( γ ) setting ofFigure 5(b) is slightly more complicated. The results here would suggest if using theUE- a in the SSH ( γ ) setting there may be some advantage to using an irregular mesh tobalance CESS j across the iterations of Algorithm 1. However, in both the SH ( λ ) andSSH ( γ ) settings when using the UE- b unbiased estimator we find that CESS j is stable.This would suggest that there is little to be gained from specifying an irregular meshover the regular one we have imposed in Section 3.2. Choosing a good estimator for aregular mesh is far simpler than optimising an irregular mesh for a poor estimator, andso the more critical consideration is to ensure a suitable unbiased estimator is chosen –a full discussion of which can be found in Appendix B. As motivated in the introduction, the primary contribution of this paper is to developa practical alternative to the
Monte Carlo Fusion approach of (Dai et al., 2019) for in-ference in the fusion problem (simulating from (1)). The methodological developmentof Section 2, and the practical guidance of Sections 3.1 and 3.2, have been developed tothis end. However, in some particular settings where this methodology is applied it islikely there will be a number of additional specific constraints that necessitate carefulimplementation, or some modification, of Algorithm 1. For instance, latency in commu-nication between cores may be of particular concern, or in applications where there isa large amount of data on each individual sub-posterior the computational efficiency ofsome quantities in Algorithm 1 may need consideration. In this section we highlight someaspects and minor (non-standard) modifications of the methodology we have developedwhich may be useful for practitioners.For the purposes of clarity for the primary contributions of this paper, the method-ology and examples given elsewhere in the paper do not exploit the modifications wepresent below. We discuss other more substantial possible directions for the practical Dai, Pollock and Roberts . . . . . . j C ond i t i ona l e ff e c t i v e s a m p l e s i z e s l l l l l l l l l ll l l l l l l l l l (a) SH ( λ ) setting. f c ( x ) = N (0 , Cσ ) . . . . . . . j C ond i t i ona l e ff e c t i v e s a m p l e s i z e s l l l l l l l l l ll l l l l l l l l l (b) SSH ( γ ) setting. f c ( x ) = N ( µ c , Cσ ) , µ c = ± . . Fig. 5.
Conditional effective sample size at each iteration of Algorithm 1 ( j ∈ { , . . . , } ) underSH ( λ ) and SSH ( γ ) settings respectively. Solid lines denote results based upon selecting theunbiased estimator ˜ ρ j := ˜ ρ ( a ) j . Dotted lines the unbiased estimator ˜ ρ j := ˜ ρ ( b ) j . development of the Bayesian Fusion methodology in the conclusions. We consider thepossible modifications to Bayesian Fusion grouped into the constituent elements of Algo-rithm 1: Initialisation; Propagation of the particle set; Computing importance weights;and, normalisation and resampling of the particle set. This is presented in Sections3.6.1–3.6.4 respectively.Note that Sequential Monte Carlo methods (upon which Bayesian Fusion is based)are in principle well-suited to parallel implementation in distributed environments (seefor instance, Doucet and Lee (2018, Sec. 7.5.3) and Crisan et al. (2018)). A considerableliterature has been developed on distributed resampling methodologies (Lee et al., 2010;Murray et al., 2016; Lee and Whiteley, 2016), and methodological adaptations such as distributed particle filters (Bolic et al., 2005; Heine and Whiteley, 2017), and the islandparticle filter (Vergé et al., 2015). The guidance provided in this subsection may be ofinterest in developing a truly parallel implementation of Bayesian Fusion, although notethe particularities of the fusion problem make this a challenging problem outwith thescope of this paper. In particular, in the fusion setting the sub-posteriors (and accom-panying data) are distributed across the available cores, and a natural implementationof Algorithm 1 would have the particle set common among all cores – this is at oddswith the setting typically addressed by the distributed SMC literature. We defer furtherdiscussion on this to the conclusions. ayesian Fusion The initialisation of Bayesian Fusion as presented in Section 2 utilises the fact that wehave access to independent draws from the C sub-posteriors. In particular, we pro-pose (cid:126) x := x (1: C )0 where for c ∈ { , . . . , C } , x ( c )0 ∼ f c (as in Algorithm 1 Step a(ii)A).Composing (cid:126) x requires communication between the cores, and furthermore (cid:126) x requirescommunication back to the cores for the computation of the proposal importance weight, ρ ( (cid:126) x ) (as in Algorithm 1 Step a(ii)B). Although ρ ( (cid:126) x ) can be trivially decomposedinto a product of C terms corresponding to the contribution from each core separately(4), computing ρ ( (cid:126) x ) still requires a third communication between cores during initial-isation. Such a level of communication between the cores is undesirable, particular as latency can make this communication expensive. In this setting, one could attempt toimprove the quality of the proposals made on each core (in isolation, noting we do notwish to introduce additional communication), and reduce the level of communication.Consider choosing some ˜ θ ∈ R d (for instance, by performing a single pre-processingstep and choosing ˜ θ to be the weighted average of the approximate modes of each sub-posterior), we can modify the proposal distribution for the initial draw from each coreto be, ˜ f c (cid:0) x ( c )0 (cid:1) ∝ exp (cid:26) − (cid:107) x ( c )0 − ˜ θ (cid:107) T (cid:27) · f c (cid:0) x ( c )0 (cid:1) , (22)compensating for this modification by replacing ρ within Algorithm 1 with ˜ (cid:37) ( (cid:126) x ) := exp (cid:26) (cid:107) ¯ x − ˜ θ (cid:107) T /C (cid:27) , where ¯ x = C − C (cid:88) c =1 x ( c )0 . (23)The validity of these modifications can be established by noting that, ˜ (cid:37) ( (cid:126) x ) · C (cid:89) c =1 ˜ f c (cid:0) x ( c )0 (cid:1) ∝ ρ ( (cid:126) x ) · C (cid:89) c =1 f c (cid:0) x ( c )0 (cid:1) , and recalling that re-normalisation within Algorithm 1 removes the need to compute theconstant of proportionality for ˜ (cid:37) .Noting that it is possible to sample from (22) on each core in isolation by rejectionsampling (using f c as a proposal), then this can be done by each core in parallel inadvance of initialising the algorithm, and will lead to improved proposal quality. Fur-thermore, note that computation of the proposal importance weight, ˜ (cid:37) ( (cid:126) x ) in (23), doesnot require further communication by the cores. In particular, we have removed twoof the three communications required in the original formulation of the initialisation ofBayesian Fusion. This simple modification to the Bayesian Fusion algorithm is presentedin Algorithm 2. Considering the iterative propagation of the particle set in Algorithm 1 Step b(ii)A,note that for each particle we need to compute (cid:126) M j and V j , from (7) and (8). In Dai, Pollock and Roberts
Algorithm 2
Modified Initialisation (in place of Algorithm 1 Step aii))(aii) For i in to N ,A. (cid:126) x ,i : For c in to C , simulate x ( c )0 ,i ∼ ˜ f c . Set (cid:126) x ,i := x (1: C )0 ,i .B. w (cid:48) ,i : Compute un-normalised weight w (cid:48) ,i = ˜ (cid:37) ( (cid:126) x ,i ) , as per (23).particular, communication between the cores is required as the computation of (cid:126) M j and V j requires the temporal position of every trajectory over all cores, which then needscommunicated back to each core. Upon propagation further communication is requiredin order to compute the updated importance weight of the particle in Algorithm 1 Stepb(ii)B. This is clearly inefficient, and consequently we wish to minimise the number andsize of communications. We would instead like to propagate (cid:126) x j − to (cid:126) x j by consideringthe separate propagation of each of the C parallel processes which compose (cid:126) x j − , namely x ( c ) j − c ∈ { , . . . , C } . To enable this we exploit the following corollary. Corollary 2.
Simulating (cid:126) x j ∼ N (cid:16) (cid:126) x j − ; (cid:126) M j , V j (cid:17) , the required transition from (cid:126) x j − to (cid:126) x j in Algorithm 1, can be expressed as x ( c ) j = (cid:32) ∆ j C ( T − t j − ) (cid:33) / ξ j + (cid:18) T − t j T − t j − ∆ j (cid:19) / η ( c ) j + M ( c ) j , (24) where ξ j and η ( c ) j are standard Gaussian vectors, and M ( c ) j is the sub-vector of (cid:126) M j corresponding to the c th component.Proof. See Appendix D.Corollary 2 allows us to propagate the c th trajectory from (cid:126) x j − to (cid:126) x j in relativeisolation, noting that the interaction with the other trajectories solely appears in themean of the trajectories at the previous iteration ( ¯ x j − ). Computation of ¯ x j − can beconducted at the previous iteration of Algorithm 1 at the same time as the trajectoriesare communicated for composition and use in computing the importance weight — thusremoving an unnecessary communication. As we already compute ¯ x ,i , as required inthe computation of ρ in Algorithm 1 Step a(ii)B (or alternatively as required by ˜ (cid:37) inSection 3.6.1), incorporating this into Bayesian Fusion requires only a minor modificationof Algorithm 1, as presented in Algorithm 3. Algorithm 3
Modified Propagation (in place of Algorithm 1 Step b(ii)A).b(ii)A.1. For c in to C , simulate x ( c ) j,i (cid:12)(cid:12)(cid:0) ¯ x j − ,i , x ( c ) j − ,i (cid:1) as per (24).b(ii)A.2. Set (cid:126) x j,i := x (1: C ) j,i , and compute ¯ x j,i := (cid:80) Cc =1 x ( c ) j,i /C . ayesian Fusion In many settings it may not be practical to compute the required functionals of eachsub-posterior ( f c , c ∈ { , . . . , C } ), and so rendering the evaluation of φ c , and in turn ˜ ρ j in Algorithm 1 Step b(ii)B, unfeasible. This may be due to a form of intractability ofthe sub-posteriors, (such as the settings considered by Andrieu and Roberts (2009)), orsimply that their evaluation is computationally too expensive (such as in the large datasettings considered by Pollock et al. (2020)).This particular issue can be circumvented by noting it is possible to construct anunbiased estimator of ˜ ρ j as follows. Corollary 3.
The estimator ˜ (cid:37) j := C (cid:89) c =1 ∆ κ c j · e − ¯ U ( c ) j ∆ j κ c ! · p ( κ c | R c ) κ c (cid:89) k c =1 (cid:16) ¯ U ( c ) j − ˆ φ c (cid:16) x ( c ) χ c,k (cid:17)(cid:17) , where κ c , p , R c and χ · are as defined in Theorem 3, ˆ φ c is an unbiased estimator of φ c ,and ¯ U ( c ) j is a constant such that ˆ φ c (cid:16) x ( c ) t (cid:17) ≤ ¯ U ( c ) j for all x ( c ) t ∼ W j,c | R c , is an unbiasedestimator of ˜ ρ j .Proof. Follows directly from the proof of Theorem 3 in Appendix B.The estimator ˜ (cid:37) j in Corollary 3 can be used as an immediate replacement for ˜ ρ j inAlgorithm 1 Step b(ii)B, and simulated by direct modification of Algorithm 4. To takeadvantage of Corollary 3 one simply has to find a suitable unbiased estimator of φ c ,which in many settings will be straightforward to construct as φ c is linear in terms of ∇ log f c ( x ) and ∆ log f c ( x ) . To find a suitable unbiased estimator to use in place of ˜ ρ j , itis important to recognise the penalty for its introduction. In particular, introducing theestimator ˜ (cid:37) j will (typically) increase the variance of the estimator, which will manifestitself in the variance of the particle set weights in Algorithm 1. To control this we will(typically) require a heavier tailed choice of discrete distribution p in Corollary 3. Anextensive discussion on finding low variance estimators of the type in Theorem 2 canbe found in Appendix B, and can be adapted directly to the setting in Corollary 3. Aconcrete application of Corollary 3 can be found in Appendix E, where we consider asimple large data setting. For simplicity in the presentation of Bayesian Fusion, and in our examples, we haveemployed multinomial resampling of the particle set (Gordon et al., 1993). It is commonwithin the Sequential Monte Carlo (SMC) literature for alternative resampling schemesto be employed which minimise the introduction of additional variance, and can beused in place of multinomial resampling within Bayesian Fusion (typically with betterperformance (Douc et al., 2005)). These include systematic resampling (Kitagawa, 1996), stratified resampling (Carpenter et al., 1999) and residual resampling (Higuchi, 1997; Liuand Chen, 1998). Further detail on resampling schemes can be found in Doucet et al.(2001), which includes a review of more advanced methodologies. Dai, Pollock and Roberts
4. Illustrative comparisons with competing methodologies
In this section we contrast Bayesian Fusion with competing methodologies. In Section4.1 we compare it to Monte Carlo Fusion (Dai et al., 2019). In Section 4.2 we considera simple logistic regression model and the relative performance of Bayesian Fusion withthe approximate
Consensus Monte Carlo (Scott et al., 2016) and
Weierstrass RefinementSampler (Wang and Dunson, 2013) methodologies. Note that in both subsections we areconsidering pedagogical and illustrative examples in order to illustrate the strengthsand weaknesses of each methodology, before returning to more substantive examples inSection 5.
As discussed in the introduction and Section 2.1, the primary motivation for developingBayesian Fusion is to address the scalability of the (otherwise exact)
Monte Carlo Fusion approach of Dai et al. (2019). Recall that Monte Carlo Fusion is a rejection samplingbased approach, and as a consequence to be computationally practical requires acceptanceprobabilities which are sufficiently large. However, when contrasting Bayesian Fusionwith competing methodologies in Section 4.2 in more realistic (albeit idealised) settings,and when considering the practical application of Bayesian Fusion in Section 5, MonteCarlo Fusion proves to be impractical. As Monte Carlo Fusion is the progenitor ofthis methodological approach to the fusion problem of (1), we explicitly contrast thescalability of Monte Carlo Fusion and Bayesian Fusion in idealised settings well suitedto Monte Carlo Fusion.In our first scenario, illustrated in Figure 6(a), we consider the fusion of an increasingnumber of identical sub-posteriors. The challenge in this setting is that despite the sub-posterior homogeneity, the (fusion) target we want to recover is becoming increasinglyconcentrated relative to the sub-posteriors. In Figure 6(b), we consider the fusion of twoGaussian sub-posteriors with the same variance but with different means, and considerthe computational cost of each methodology to achieve a fixed ESS, while fixing T , asthe means of the sub-posteriors increase in distance from one another. This corresponds(loosely) to the increasing sub-posterior heterogeneity scenario of Section 3.It is clear in both scenarios in Figure 6 that even without employing the optimizedguidance on the implementation of Bayesian Fusion of Section 3 (in particular, T inFigure 6(b)), Bayesian Fusion still has far better scaling properties and offers considerableadvantage over Monte Carlo Fusion, even in idealised settings well suited to Monte CarloFusion. In this section we study the performance of Bayesian Fusion against other competing(approximate) methodologies for simulating from (1). A synthetic data set of size m =1000 was simulated from the following logistic regression model, y i = with probability exp { z Ti β } { z Ti β } , otherwise . (25) ayesian Fusion . . . . . . . m l og ( C o m pu t a t i on c o s t pe r ESS ) (a) Increasing C in the SH ( λ ) set-ting. Here f ( x ) ∝ N (0 , / m ) , f c ( x ) ∝ f /m ( x ) , and with m ∈{ , . . . , } Difference between component means l og ( C o m pu t a t i on c o s t pe r ESS ) (b) Increasing sub-posterior het-erogeneity setting: Here f ( x ) ∝N (0 , / with component densi-ties ∝ N ( ± µ/ , C/ , T = 1 Fig. 6.
Log computational cost comparison of Bayesian Fusion (red dashed line) and MonteCarlo Fusion (blue solid line) in idealised scaling scenarios.
The true β := ( − , − (where the first co-ordinate corresponds to the intercept). Eachrecord contained a single covariate in addition to an intercept, which was independentlysimulated from a Gaussian distribution with mean . and variance . The BayesLogitR package was used to fit logistic regression on the entire data set, using a Gaussianprior distribution for both β and β with mean and variance . We term this the benchmark posterior distribution, and use it to compare methodologies in this section.For this data set we consider recovering the posterior distribution by unifying sub-posteriors across an increasing number of cores C ∈ { , , , } . To obtain our C sub-posteriors we evenly distributed the data among the C cores ( m c = m/C ), and foreach core we specified a prior distribution by raising the prior distribution specified forthe entire data set to the power /C . We then fit logistic regression using the BayesLogitR package. The specification (data size, parameterisation, and number of cores) we havechosen above is particularly challenging when considering the fusion problem in (1) forall methodologies. The lack of data on each core (particularly in the case where C = 40 ),and the scarcity of positive responses in the entire data set (we had (cid:80) i y i = 30 ), results insub-posteriors which are both irregular and exhibit a high degree of dissimilarity, and areconsequently difficult to unify. To illustrate this the sub-posterior marginals are shownin Figure 7 for the case where C = 10 .We contrasted Bayesian Fusion (Algorithm 1) with the approximate Consensus MonteCarlo (CMC) method of Scott et al. (2016), and the approximate
Weierstrass Refine-ment Sampler (WRS) of Wang and Dunson (2013). Recall that unlike the approximateschemes in the literature, Bayesian Fusion is an asymptotically consistent methodology,and so increased posterior accuracy can be obtained over these competing methodolo-gies given sufficient computational budget. We also attempted to implement
Monte Dai, Pollock and Roberts −8 −6 −4 −2 0 . . . . D en s i t y (a) β −6 −4 −2 0 . . . . D en s i t y (b) β Fig. 7.
Sub-posterior marginals for logistic regression problem of Section 4.2 with C = 10 . Blacksolid line denotes benchmark posterior distribution. Purple dashed lines denote sub-posteriordistributions. Carlo Fusion (Dai et al., 2019) and the
Weierstrass Rejection Sampler (Wang and Dun-son, 2013), but the small acceptance probabilities resulting from the dis-similarity inthe sub-posteriors, together with the number of sub-posteriors, rendered applying thesemethodologies computationally infeasible. Bayesian Fusion was implemented followingthe guidance in Section 3 with a particle set of size N = 10 , , and the other method-ologies were implemented following the guidance suggested by the authors and tuned tothis particular data set.The marginal densities for each methodology were obtained together with their run-ning times, and are presented in Figure 8. As the competing methodologies are approx-imate it is important to determine the accuracy of each method for the given computa-tional budget. To do so we define and compute the Integrated Absolute Distance (IAD)for each method with respect to the benchmark distribution we obtained earlier. We ob-tain IAD by simply considering the difference between the marginal for the methodologyand the benchmark for each dimension. In particular,IAD := 1 d d (cid:88) j =1 (cid:90) (cid:12)(cid:12)(cid:12) ˆ f ( θ j ) − f ( θ j ) (cid:12)(cid:12)(cid:12) d θ j ∈ [0 , , (26)where f is the benchmark distribution and ˆ f is the distribution obtained from themethodology employed, both computed using a kernel density estimate as necessary.It is important to note that as Bayesian Fusion is asymptotically consistent, and sowhen considering IAD and the associated running time (Figure 8(c) and Figure 8(d)respectively) this is one possible combination – IAD can be improved to a user-specifiedaccuracy for Bayesian Fusion given sufficient computational budget (which is not truefor other schemes in the literature).Considering solely the computational cost of each scheme (Figure 8(d)), it is clearConsensus Monte Carlo (CMC) is substantially faster than both Bayesian Fusion and ayesian Fusion the Weierstrass Refinement Sampler (WRS). This is largely due to the desirable lackof communication between cores CMC achieves. However, scrutinising the marginals inFigure 8(a) and Figure 8(b), it is apparent that even for this standard two dimensionallogistic regression problem, CMC incorrectly estimates both the location of the modes(in particular β ) and tail structure of the benchmark distribution. This is indeed sum-marised by the IAD in Figure 8(c), further suggesting the methodology is not robust tounifying the target distribution with increasing numbers of cores, C . The WRS substan-tially improves upon CMC, appears to better capture both the mode and tail structure ofthe benchmark distribution, and seems to be more robust to increasing numbers of cores.However, Bayesian Fusion recovers the benchmark distribution for only a modest increasein computational budget over the WRS. Indeed, the IAD obtained for Bayesian Fusion inFigure 8(c) is driven by Monte Carlo error (and not approximation error), and so couldbe further improved if necessary by increasing the computational budget. In truly largedata or distributed network settings, Bayesian Fusion could be further optimised whenthe extensions discussed in Section 3.6 are incorporated.
5. Examples
In this example we applied Bayesian Fusion to the 1994 and 1995 U.S. Census Bureaupopulation surveys, obtained from Bache and Lichman (2013), and of size m = 199 , .For the purposes of this example we investigated the effect of education on gross income.We took gross income as our observed data, treating it as a binary taking a value of one ifincome was greater than $ , . An income in excess of $ , is moderately rare withonly , individuals exceeding this threshold (which represents approximately % ofthe data). In addition to the intercept, we extracted three further education covariatesindicating educational stages attained by the individual (each of which were binary). Wethen fitted the logistic regression model of (25), with prior distribution N (0 , I ) , to thedata set to obtain a benchmark posterior distribution to assess the quality of BayesianFusion. The data size for this example exceeded the capabilities of the BayesLogit Rpackage used in Section 4.2, and so we instead obtained our benchmark by applyingMarkov chain Monte Carlo to the full data set.For this data set we considered recovering the benchmark distribution by unifying sub-posteriors across an increasing number of cores C ∈ { , , } . We again contrastedBayesian Fusion with Consensus Monte Carlo (CMC) and the Weierstrass RefinementSampler (WRS). To construct sub-posteriors we distributed the data among the available C cores, and fit the logistic regression model of (25) to each using a prior obtained byraising the prior distribution specified for the entire data set to the power /C . In contrastwith Section 4.2, in constructing the sub-posteriors we did not allocate the data randomly(or evenly) among the C cores. This is more representative of a typical application,and introduces dis-similarity in the sub-posteriors, particularly when observations orcovariates are rare. For instance, in the case where C = 40 , three of the cores containeddata comprising in excess of % of the individuals earning in excess of $ , .Bayesian Fusion was implemented with a particle set of size N = 30 , , and followingthe guidance of Section 3. CMC and WRS were again implemented as fairly as possible, Dai, Pollock and Roberts −10 −8 −6 −4 −2 . . . . . D en s i t y (a) β −8 −6 −4 −2 0 . . . . . D en s i t y (b) β l l l l . . . . . Cores I n t eg r a t ed ab s o l u t e d i s t an c e ( I A D ) l l l ll l l l (c) IAD with respect to bench-mark Cores
Log r unn i ng t i m e ( s e c ond s ) (d) Log running times Fig. 8.
Comparison of competing algorithms to Bayesian Fusion applied to the logistic regres-sion problem of Section 4.2. Upper marginal densities are of the C = 40 case. Black solid linesdenote the benchmark fitted target distribution. Red dashed lines denote Bayesian Fusion. Bluedotted lines denote Consensus Monte Carlo (CMC). Orange dotted and dashed lines denote theWeierstrass Refinement Sampler (WRS). ayesian Fusion following the guidance suggested by the authors. The marginal densities are presentedin Figure 9 for the C = 40 setting, and in Figure 10 we again present the IntegratedAbsolute Distance (IAD, see (26)) of each methodology with respect to the benchmarkdistribution, together with their computational costs for the range of cores considered.For this data set CMC performs extremely poorly, capturing neither the marginalsof the benchmark distribution (particularly, β and β ) or showing any robustness withrespect to the numbers of cores. Considering the marginals in Figure 9, the WRS sub-stantially improves upon CMC (only having difficulty capturing the benchmark for β and β ). However, for slightly more computational expenditure (Figure 10(b)), BayesianFusion substantially improves upon IAD over the WRS (Figure 10(a)), and also appearsto show robustness with increasing C . −3.5 −3.0 −2.5 −2.0 −1.5 −1.0 D en s i t y (a) β D en s i t y (b) β D en s i t y (c) β −0.5 0.0 0.5 1.0 D en s i t y (d) β Fig. 9.
Marginal density estimates of Bayesian Fusion and competing algorithms applied tothe U.S. Census Bureau population survey data set of Section 5.1. Black solid lines denotethe benchmark fitted target distribution. Red dashed lines denote Bayesian Fusion. Blue dot-ted lines denote Consensus Monte Carlo (CMC). Orange dotted and dashed lines denote theWeierstrass rejection sampler (WRS).8
Dai, Pollock and Roberts
In this example we considered the ‘Road Safety Data’ data set published by the Depart-ment for Transport of the U.K. government (gov.uk, 2019). It comprises road accidentdata set from 2011–2018, and in total is of size m = 1 , , . We treated our observa-tion for each record to be binary taking a value of one if a severe accident was recorded.In total in the full data set there were , such severe accidents. We selected a numberof covariates to investigate what effect (if any) they have on accident severity. In partic-ular, and in addition to an intercept, we considered road speed limit , lighting condition (which we treated as binary taking a value of one if lighting was good , and zero if lightingwas poor ), and weather condition (binary, taking one if good and zero if poor ). Thelogistic regression model of (25) was fit to the data set, together with a N (0 , I ) priordistribution. Our benchmark posterior distribution was obtained by applying Markovchain Monte Carlo to the full data set.We again considered recovering the benchmark distribution by unifying sub-posteriorsacross an increasing number of cores C ∈ { , , } . The sub-posteriors were obtainedfollowing the same approach as Section 5.1 — with the allocation of data to each corebeing in temporal order. We contrasted Bayesian Fusion with a particle set of size N =30 , , with fair implementations of Consensus Monte Carlo (CMC) and the WeierstrassRefinement Sampler (WRS). Marginal densities for the C = 40 setting are presented inFigure 9, and IAD (26) with respect to the benchmark together with their computationalcosts for the range of cores considered. The results are in keeping with those of Section5.1. CMC performs extremely poorly, and for a modest increase in computational budgetBayesian Fusion obtains substantially better results than the WRS.
6. Conclusions
In this paper we have developed a theoretical framework, and scalable sequential MonteCarlo (SMC) methodology, for unifying distributed statistical analyses on shared parame-ters from multiple sources (which we term sub-posteriors ) into a single coherent inference.The work significantly extends the theoretical underpinning, and addresses the practicallimitations, of the exact Monte Carlo Fusion approach of Dai et al. (2019). Monte CarloFusion is a rejection-sampling based approach which is the first methodology in generalsettings in which the product pooled posterior distribution of the fusion problem in (1)is recovered without approximation. However, it lacked scalability with respect to thenumber of sub-posteriors to be unified, and robustness with sub-posterior dis-similarity.This is addressed by the
Bayesian Fusion approach introduced in this paper, resulting ina methodology which both recovers the correct target distribution and is computation-ally competitive with leading approximate schemes. Fundamental to the methodologyintroduced is the construction of the fusion measure via an SMC procedure driven bythe SDE in (2), and leading to Algorithm 1.In addition to the theoretical and methodological development of Bayesian Fusionpresented in Section 2, in Section 3 we provide concrete theory and guidance on how tochoose the free parameters of Algorithm 1 to ensure robustness with increasing numbers ofsub-posteriors, and robustness with sub-posterior dis-similarity. This includes in Section3.6 providing extensive practical guidance on how Bayesian Fusion may be implemented ayesian Fusion l l l
10 15 20 25 30 35 40 . . . . . Cores I n t eg r a t ed ab s o l u t e d i s t an c e ( I A D ) l l ll l l (a) IAD with respect to bench-mark
10 15 20 25 30 35 40
Cores
Log r unn i ng t i m e ( s e c ond s ) (b) Log running times Fig. 10.
Performance of Bayesian Fusion and competing algorithms applied to the U.S. CensusBureau population survey data set of Section 5.1. Black solid lines denote the benchmark fittedtarget distribution. Red dashed lines denote Bayesian Fusion. Blue dotted lines denote Con-sensus Monte Carlo (CMC). Orange dotted and dashed lines denote the Weierstrass rejectionsampler (WRS). . . . . . Cores I n t eg r a t ed ab s o l u t e d i s t an c e ( I A D ) (a) IAD with respect to bench-mark
10 15 20 25 30 35 40
Cores
Log r unn i ng t i m e ( s e c ond s ) (b) Log running times Fig. 11.
Performance of Bayesian Fusion and competing algorithms applied to the U.K. roadaccident data set of Section 5.2. Black solid lines denote the benchmark fitted target distribution.Red dashed lines denote Bayesian Fusion. Blue dotted lines denote Consensus Monte Carlo(CMC). Orange dotted and dashed lines denote the Weierstrass rejection sampler (WRS).0
Dai, Pollock and Roberts −5.5 −5.4 −5.3 −5.2 −5.1 −5.0 D en s i t y (a) β D en s i t y (b) β D en s i t y (c) β −0.5 −0.4 −0.3 −0.2 −0.1 0.0 D en s i t y (d) β Fig. 12.
Marginal density estimates of Bayesian Fusion and competing algorithms applied to theU.K. road accident data set of Section 5.2. Black solid lines denote the benchmark fitted targetdistribution. Red dashed lines denote Bayesian Fusion. Blue dotted lines denote ConsensusMonte Carlo (CMC). Orange dotted and dashed lines denote the Weierstrass rejection sampler(WRS). ayesian Fusion by practitioners, including in distributed network and big data settings.Section 4 provides an extensive comparison of the performance of Bayesian Fusion,and competing approximate methodologies, in an idealised synthetic data setting. InSection 5 we apply Bayesian Fusion to real data sets, including the ‘U.S. Census Bureaupopulation surveys’ data set in Section 5.1, and a ‘U.K. road accidents’ data set inSection 5.2, together with the competing approximate methodologies. In all settings ourimplementation of Bayesian Fusion performs extremely well, demonstrating appreciablescope for its broader application.One of the key advantages of Bayesian Fusion is that it is underpinned methodologi-cally by sequential Monte Carlo (SMC), which allows us to leverage many of the existingtheoretical results and methodology found in that literature. As is typical within SMCit is desirable to attempt to minimise the discrepancy between the sequence of proposaland target distributions. In our setting this entails ensuring the propagated temporalmarginal of g in (11) (say g Nj − ), is well-matched with the following temporal marginalof g (say g Nj ). Although not emphasised within the main text, there is clear scopeto improve Bayesian Fusion in this sense by modifying the diffusion theory presentedin its development (Appendix A), to one which better incorporates information abouteach sub-posterior (for instance, this could be knowledge of the volume of data on eachcore). One approach explored in Dai et al. (2019, Sec. 4) is to consider an underlyingOrnstein-Uhlenbeck proposal measure (appropriately parameterised), which could well-approximate posterior distributions which are approximately Gaussian, and thus lead tobetter propagation of temporal marginals if incorporated within Bayesian Fusion. An-other feasible direction is to estimate the covariance structure of each sub-posterior andtransform the C spaces accordingly, which would lead to the Brownian proposals beingmore attuned to the target distribution (Chan et al., 2021). This would be equivalentto modifying the Fusion measure in (3), in which the transition densities for each sub-posterior are that of a Langevin diffusion with unit volatility, to one with volatility whichmatches the covariance structure of its respective sub-posterior. The theory remains validprovided the transition densities of the chosen diffusion in (3) have the same invariantdistribution, and the proposal chosen has matching volatility.We have provided considerable practical guidance in Section 3.6 to render many as-pects of Bayesian Fusion which are non-standard due to the particularities of the fusionproblem into standard SMC structures. A truly parallel implementation of Bayesian Fu-sion is a very attractive prospect for future development. As discussed in Section 3.6.4,although SMC is inherently well-suited to parallel implementation in distributed environ-ments (Doucet and Lee, 2018), in the fusion setting the natural direct interpretation ofBayesian Fusion would be to consider the sub-posteriors (and associated data) as beingdistributed across cores, but the particle set to be shared across all cores. This is notthe setting typically addressed by distributed SMC literature, and raises interesting chal-lenges which require further innovation to be resolved. For instance, developing theoryto support methodology in which the particles are not shared by all cores.A number of other methodological directions for Bayesian Fusion are possible. Aspresented in Section 2 and Section 3, the C sub-posteriors are unified together in a ‘fork-and-join’ manner. An alternative would be to unify the sub-posteriors in stagesgradually by constructing a tree to perform the operation hierarchically, for instance by Dai, Pollock and Roberts exploiting ‘divide-and-conquer’
SMC theory and methodologies such as that of Lindstenet al. (2017). Another direction would be to consider how approximations could be usedwithin the methodology. Many approximate approaches tackling the fusion problem arehighly computationally efficient, albeit at the expense of introducing an approximationerror which can be difficult to quantify and on occasion significant. The work of Wanget al. (2019) constructs an explicit Monte Carlo scheme in which approximations canbe readily used to develop exact Monte Carlo schemes. There is tangible theory linkingthis paper with Pollock et al. (2020) and Wang et al. (2019), and so finding a similarapproach to embedding approximations may be viable.There is considerable scope for application of Bayesian Fusion, as inference in thesetting of (1) arises directly and indirectly in many interesting practical settings. Manyof these applications were discussed in Section 1. One interesting direction considersthe use of Fusion methodologies within the
Markov melding framework of Goudie et al.(2019), in which a modular approach is taken to statistical inference where separatesub-models are fit to data sources in isolation (often of varying dimensionality), andthen joined. This type of application would necessitate theoretical developments to theFusion methodologies to support sub-posteriors on mismatched dimensions. However,such a theoretical development when combined with ‘divide-and-conquer’
SMC theorysuch as that developed in Lindsten et al. (2017) may also make Fusion methodologiesmore robust to increasing dimensionality.A number of future directions for the Bayesian Fusion methodology are currently beingpursued by the authors. One interesting avenue of research is to apply Fusion methodolo-gies within statistical cryptography. In the simplest setting a number of trusted partieswho wish to securely share their distributional information on a common parameter spaceand model, but would prefer not to reveal their individual level distributions, could doso by means of applying cryptography techniques and exploiting the exactness and linearcontributions to computations of individual sub-posteriors within the Fusion approach.In a further example, the authors are investigating the application of Bayesian fusionfor purely algorithmic reasons. One motivation for this (rather like the motivation fortempering MCMC approaches) is that the simulation of a multimodal target densitycould be prohibitively difficult, whereas the target density might be readily written asa product of densities with less pronounced multi-modal behaviour, thus making it farmore amenable to Monte Carlo sampling (see Chan et al. (2021)).
7. Acknowledgements
We would like to thank Louis Aslett, Ryan Chan, Paul Jenkins, Yuxi Jiang and AdamJohansen for helpful discussions on aspects of the paper. This work was supportedby the Engineering and Physical Sciences Research Council (EPSRC) grant numbersEP/K014463/1, EP/N031938/1, EP/R018561/1, EP/R034710/1, and the Alan TuringInstitute’s Defence & Security Programme. The authors would like to thank the IsaacNewton Institute for Mathematical Sciences for support and hospitality during the pro-gramme “Scalable inference; statistical, algorithmic, computational aspects (SIN)” whereaspects of the work in this paper were undertaken. ayesian Fusion A. Background notation, and proof of Theorem 1 and Theorem 2
We begin by more formally considering the C correlated continuous-time Markov pro-cesses in [0 , T ] introduced in Section 2, which are initialised separately but coalesce toa single point y at time T . A typical realisation of the object X = { (cid:126) x t , t ∈ [0 , T ] } = { x ( c ) t , c ∈ { , . . . , C } , t ∈ [0 , T ] } is given in Figure 2, and is defined on the followingspace Ω : Definition A.1 ( Ω ) . Ω := (cid:110) X : x ( c ) ∈ C d [0 , T ] , c ∈ { , . . . , C } , x (1) T = · · · = x ( C ) T = y (cid:111) , where C d [0 , T ] denotes the d -dimensional continuous function space with domain [0 , T ] . In proving the results presented in this appendix, we impose the following regularityassumptions.
Assumption A.1. ∇ log f c ( x ) is at least once continuously differentiable, where ∇ isthe gradient operator. Assumption A.2. φ c ( x ) is bounded below by some Φ c ≤ inf { φ c ( x ) : x ∈ R d } ∈ R . Both Assumptions A.1 and A.2 are easily verified in practice and will typically besatisfied for many statistical applications. Assumption A.1 is required in order to es-tablish the Radon-Nikodým derivative in the proof of Theorem 2 below, although inprinciple could be weakened to consider discontinuous drifts following the approach ofPapaspiliopoulos et al. (2016) (at the expense of adapting Theorem 2 and complicatingthe resulting methodology). Within the context of the sequential Monte Carlo method-ology developed in Section 2.1, Assumption A.2 can be weakened, but as discussed inAppendix B ensures the estimator presented in Theorem 3 has finite variance (and soensures robustness of Algorithm 1).We can now proceed to the proof of Theorem 1, and show that if X ∼ F , then themarginal y (:= (cid:126) x T ) ∼ f as desired. Proof (Theorem 1).
We begin by marginalising F onto the values of (cid:126) x T . Since all densi-ties are written with respect to P we first take an expectation with respect to F of each ofthe C coalescing diffusion paths ( { x ( c ) t , t ∈ [0 , T ] } Cc =1 ) and condition on their respectiveendpoints (for the c th path this is x ( c )0 and x ( c ) T respectively). Note that by constructionthese paths are independent Brownian bridges. The calculation for the remaining expec-tation (for (cid:126) x ) appears in Dai et al. (2019, Prop. 2). Therefore the marginal distributionof the common endpoint y := x (1) T = · · · = x ( C ) T has density f .To show that the law of C independent Brownian motions initialised from their respec-tive distributions ( (cid:126) x = { x ( c )0 } Cc =1 where x (1)0 ∼ f , . . . , x ( C )0 ∼ f C ) and conditioned Dai, Pollock and Roberts to coalesce at time T satisfies (2), we use Doob h -transforms (see for instance, Rogersand Williams (2000, Section IV.6.39)). As such, we introduce the space-time harmonicfunction h( t, (cid:126) x t ) = (cid:90) C (cid:89) c =1 (cid:112) π ( T − t ) exp (cid:40) − (cid:107) y − x ( c ) t (cid:107) T − t ) (cid:41) d y which represents the integrated density of coalescence at time T given the current state (cid:126) x t . As a consequence we have that the C conditioned processes satisfy a SDE of theform, d (cid:126) X t = d (cid:126) W t + ∇ log(h( t, (cid:126) X t )) d t, where ∇ log(h( t, (cid:126) x t )) = ( v (1) t , . . . , v ( C ) t ) is the concatenation of C d -dimensional vectors(which we denote { v ( c ) t } Cc =1 ). Considering the c th term we have v ( c ) t = (cid:82) y − x ( c ) t T − t (cid:81) Cc =1 1 √ π ( T − t ) exp (cid:110) − (cid:107) y − x ( c ) t (cid:107) T − t ) (cid:111) d y (cid:82) (cid:81) Cc =1 1 √ π ( T − t ) exp (cid:110) − (cid:107) y − x ( c ) t (cid:107) T − t ) (cid:111) d y = (cid:82) y T − t √ π ( T − t ) exp (cid:110) − C (cid:107) ¯ x t − x ( c ) t (cid:107) T − t ) (cid:111) d y (cid:82) √ π ( T − t ) exp (cid:110) − C (cid:107) ¯ x t − x ( c ) t (cid:107) T − t ) (cid:111) d y − x ( c ) t T − t = ¯ x t − x ( c ) t T − t . As a consequence we have ∇ log(h( t, (cid:126) x t )) = (cid:32) ¯ x t − x (1) t T − t , ¯ x t − x (2) t T − t , . . . , ¯ x t − x ( C ) t T − t (cid:33) , and (2) holds as required.Simulating from P without discretisation error relies on having explicit access to thefinite-dimensional distributions of the process given by the SDE in (2). This is establishedby Theorem 2. Proof (Theorem 2). (a) From Theorem 1 we have established that X ∼ F is Markov (intime), and so without loss of generality we only need to consider its incrementaldistribution.For X ∼ P we have that for all c ∈ { , . . . , C } { x ( c ) t , t ∈ [0 , T ] } is the realisationof a d -dimensional Brownian bridge conditioned on starting at x ( c )0 and ending at y = x ( c ) T . Furthermore, under P we have that conditional on (cid:126) x then y is distributedaccording to a Gaussian distribution with mean ¯ x and covariance matrix T C − I d × d . ayesian Fusion To derive the joint density of (cid:126) X t conditional on (cid:126) X s (for ≤ s < t < T ), we beginby considering the joint d ( C + 1) -dimensional density of (cid:126) X t and y conditional on (cid:126) X s , which we denote by p . − p = C (cid:107) y − ¯ x s (cid:107) T − s + C (cid:88) c =1 T − s ( t − s ) · ( T − t ) (cid:13)(cid:13)(cid:13)(cid:13) x ( c ) t − t − sT − s y − T − tT − s x ( c ) s (cid:13)(cid:13)(cid:13)(cid:13) = CT − t (cid:2) (cid:107) y (cid:107) − y (cid:48) ¯ x t (cid:3) + C (cid:88) c =1 T − s ( t − s ) · ( T − t ) (cid:13)(cid:13)(cid:13)(cid:13) x ( c ) t − T − tT − s x ( c ) s (cid:13)(cid:13)(cid:13)(cid:13) + k = CT − t (cid:107) y − ¯ x t (cid:107) − CT − t (cid:107) ¯ x t (cid:107) + T − s ( t − s ) · ( T − t ) C (cid:88) c =1 (cid:13)(cid:13)(cid:13)(cid:13) x ( c ) t − T − tT − s x ( c ) s (cid:13)(cid:13)(cid:13)(cid:13) + k , where k is a constant. Now, integrating out y we obtain the dC -dimensional densityof (cid:126) X t conditional on (cid:126) X s , which we denote p , − p = − T − t (cid:107) (cid:80) Cc =1 x ( c ) t (cid:107) C + T − s ( t − s ) · ( T − t ) C (cid:88) c =1 (cid:13)(cid:13)(cid:13)(cid:13) x ( c ) t − T − tT − s x ( c ) s (cid:13)(cid:13)(cid:13)(cid:13) + k = (cid:126) x (cid:48) t V − s,t (cid:126) x t − (cid:126) x (cid:48) t V − s,t (cid:126) M s,t + k , where k and k are constants, and V s,t and (cid:126) M s,t are terms we will now derive. Wehave V − s,t = Σ − ⊗ I C × C with Σ − = T − s ( t − s ) · ( T − t ) I C × C − C ( T − t ) J C × C , where J C × C is the C × C matrix containing all elements . Inverting Σ − we have Σ ii = ( t − s ) · ( T − t ) T − s + ( t − s ) C ( T − s ) , Σ ij = ( t − s ) C ( T − s ) .(cid:126) M s,t can be written as ( M (1) s,t , . . . , M ( C ) s,t ) with M ( c ) s,t = T − tT − s x ( c ) s + t − sT − s ¯ x s , as required by the statement of the theorem.(b) From Dai et al. (2019) we have that for c ∈ { , . . . , C } the law of { x ( c ) t , t ∈ (0 , T ) } conditional on the endpoints x ( c )0 and y is that of a Brownian bridge. As a conse-quence, the result holds from standard properties of Brownian bridges. B. Proof of Theorem 3, and unbiased estimation of ρ j In this appendix we provide a proof of Theorem 3, together with practical guidance onhow to simulate a low variance, positive, unbiased estimator ˜ ρ j . This is accompanied Dai, Pollock and Roberts with pseudo-code which is presented in Algorithm 4. The approach we take is a variantof Beskos et al. (2006), Fearnhead et al. (2008, Section 4), and Pollock (2013, Chapter7.1), applied to our particular setting.To construct such an estimator we rely on the property that the function φ c for c ∈ { , . . . , C } is bounded on compact sets, which follows directly from Assumption A.1(Beskos et al., 2008). In particular, suppose there exists some compact region R c ⊂ R d such that x ( c ) ( ω ) ∈ R c , then there exists some L ( c ) j := L (cid:0) x ( c ) ( ω ) (cid:1) ∈ R and U ( c ) j := U (cid:0) x ( c ) ( ω ) (cid:1) ∈ R such that φ c (cid:0) x ( c ) ( ω ) (cid:1) ∈ (cid:2) L ( c ) j , U ( c ) j (cid:3) .We can exploit this property of φ c by simulating as required x ( c ) (with law W j,c ) intwo steps: (i) partitioning the path-space of W j,c into disjoint layers and simulating towhich x ( c ) belongs (denoting R c := R c ( x ( c ) ) ∼ R c ); (ii) simulating the path at timemarginals as required conditional on the simulated layer (i.e. x ( c ) t ∼ W j,c | R c ). This twostep procedure then allows us to identity L ( c ) j and U ( c ) j for use when constructing ourestimator. Full detail on step (i) can be found in Pollock et al. (2016, Section 8.1 andAlgorithm 16), and on step (ii) can be found in Pollock et al. (2016, Section 8.5 andAlgorithm 22), but both are omitted from this paper for brevity.We can now proceed to the proof of Theorem 3. Proof (Theorem 3).
Recalling R c := R c ( x ( c ) ) is a function of the Brownian bridge samplepath x ( c ) ∼ W j,c which determines a compact subset of R d for which x ( c ) is constrained,further denote R as the law of R , . . . , R C , and ¯ W as the law of the C Brownian bridges x (1) , . . . , x ( C ) . Let K denote the law of κ , . . . , κ C , and U denote the law of χ , , . . . , χ ,κ , . . . , χ C, , . . . , χ C,κ C ∼U [ t j − , t j ] . Then for j ∈ { , . . . , n } we have, E R E ¯ W |R E K E U [ ˆ ρ j ] = E R E ¯ W |R E K E U (cid:34) C (cid:89) c =1 ∆ κ c j · e − ( U ( c ) j − Φ c )∆ j κ c ! · p ( κ c | R c ) κ c (cid:89) k c =1 (cid:16) U ( c ) j − φ c (cid:16) x ( c ) χ c,k (cid:17)(cid:17)(cid:35) = E R E ¯ W |R E K C (cid:89) c =1 ∆ κ c j · e − ( U ( c ) j − Φ c )∆ j κ c ! · p ( κ c | R c ) (cid:90) t j t j − U ( c ) j − φ c (cid:16) x ( c ) t (cid:17) ∆ j d t κ c = E R E ¯ W |R C (cid:89) c =1 ∞ (cid:88) k c =0 ∆ k c j · e − ( U ( c ) j − Φ c )∆ j k c ! (cid:90) t j t j − U ( c ) j − φ c (cid:16) x ( c ) t (cid:17) ∆ j d t k c = E R E ¯ W |R (cid:34) C (cid:89) c =1 e − ( U ( c ) j − Φ c )∆ j · exp (cid:40)(cid:90) t j t j − (cid:16) U ( c ) j − φ c (cid:16) x ( c ) t (cid:17)(cid:17) d t (cid:41)(cid:35) = C (cid:89) c =1 E W j,c (cid:34) exp (cid:40) − (cid:90) t j t j − (cid:16) φ c (cid:16) x ( c ) t (cid:17) − Φ c (cid:17) d t (cid:41)(cid:35) =: ρ j . Theorem 3 allows for significant flexibility in choosing the law K . As we are embeddingthe estimator within a sequential Monte Carlo framework, we want to choose the law K tominimise the variance of the estimator (or equivalently in our case, the second moment). ayesian Fusion Lemma B.1.
The second moment of the estimator ˆ ρ j is minimised when p ( κ | R ) , . . . , p ( κ c | R c ) are chosen to be Poisson distributed with intensities, λ c := (cid:34) ∆ j (cid:90) t j t j − (cid:16) U ( c ) j − φ c (cid:16) x ( c ) t (cid:17)(cid:17) d t (cid:35) / , c ∈ { , . . . , C } . (27) Proof.
We have, E R E ¯ W |R E K E U [ ˆ ρ j ]= E R E ¯ W |R E K C (cid:89) c =1 ∆ κ c j · e − U ( c ) j − Φ c )∆ j ( κ c !) · p ( κ c | R c ) (cid:90) t j t j − (cid:0) U ( c ) j − φ c (cid:16) x ( c ) χ c,k (cid:17) (cid:1) ∆ j d t κ c = E R E ¯ W |R (cid:34) C (cid:89) c =1 e − U ( c ) j − Φ c )∆ j · ∞ (cid:88) k =0 (cid:2) ∆ j · (cid:82) t j t j − (cid:0) U ( c ) j − φ c (cid:16) x ( c ) χ c,k (cid:17) (cid:1) d t (cid:3) k /k ! p ( k | R c ) (cid:124) (cid:123)(cid:122) (cid:125) =: f k /p k (cid:35) . (28)Recalling we have the flexibility to choose the discrete probability distributions givenby p ( k | R c ) for c ∈ { , . . . , C } , we want to make our selection to minimise (28). To doso we consider each sub-posterior separately and use Lagrange multipliers to optimise (cid:80) k f k /p k + λ ( (cid:80) k p k − , finding that − f k /p k + λ = 0 . As p k ∈ (0 , we have p k = (cid:112) f k /λ , and further noting (cid:80) k p k = 1 then λ = ( (cid:80) k √ f k ) . Hence we find the optimaldistribution of p ( k | R c ) to be Poisson, p ( k | R c ) = λ kc k ! e − λ c , with λ c as given in (27). Substituting this selection into (28), and recalling from Theorem2 that Φ c is a constant such that inf x φ c ( x ) ≥ Φ c > −∞ , we have λ c ≤ ∆ j · ( U ( c ) j − Φ c ) ,then we show finiteness as required: E R E ¯ W |R E K E U [ ˆ ρ j ] = E R E ¯ W |R (cid:34) C (cid:89) c =1 e − U ( c ) j − Φ c )∆ j · ∞ (cid:88) k =0 λ kc · e λ c k ! (cid:35) = E R E ¯ W |R (cid:34) exp (cid:40) C (cid:88) c =1 (cid:16) λ c − (cid:0) U ( c ) j − Φ c (cid:1) ∆ j (cid:17)(cid:41)(cid:35) ≤ < ∞ . As noted with Section 2.1, normalisation within Algorithm 1 permits us to use theestimator ˜ ρ (given by (13)) in place of the estimator ˆ ρ , thus avoiding the need to computethe constants Φ , . . . , Φ C . Corollary B.1.
The second moment of the estimator ˜ ρ j is minimised when p ( κ | R ) , . . . , p ( κ c | R c ) are chosen as in Lemma B.1, and is finite.Proof. Follows directly from (13) and Lemma B.1. Dai, Pollock and Roberts
Although Lemma B.1 and Corollary B.1 suggest an optimal distribution and param-eterisation for the simulation of the law K in Theorem 3, the integral in (27) precludesthis choice. In this paper we consider the following two possible choices for K whichattempt to mimic the optimal parameterisation in (27) (but erring on having heaviertails for robustness): (i) a Poisson distribution with a higher intensity; (ii) a Poisson dis-tribution with a random mean approximating λ c given by a Gamma distribution (whichleads to the negative binomial distribution). We term these Unbiased Estimator A and B respectively (UE- a and UE- b ) respectively (and are based upon GPE- and GPE- within (Fearnhead et al., 2008) applied to our setting). Condition B.1 ( ˜ ρ ( a ) ) . Choosing p ( κ | R ) , . . . , p ( κ c | R c ) to be Poisson distributed withintensity Λ c := ∆ j (cid:0) U ( c ) j − L ( c ) j (cid:1) , leads to the estimator ˜ ρ ( a ) j := C (cid:89) c =1 (cid:32) e − L ( c ) j ∆ j (cid:0) U ( c ) j − L ( c ) j (cid:1) κ c κ c (cid:89) k c =1 (cid:16) U ( c ) j − φ c (cid:16) x ( c ) χ c,k (cid:17)(cid:17)(cid:33) . Condition B.2 ( ˜ ρ ( b ) ) . Choosing p ( κ | R ) , . . . , p ( κ c | R c ) to be Negative Binomial dis-tributed with mean parameter m c := ∆ j U ( c ) j − (cid:90) t j t j − φ c (cid:32) x ( c ) j − ( t j − s ) + x ( c ) j s ∆ j (cid:33) d s (29) and dispersion parameter r c , leads to the estimator, ˜ ρ ( b ) j := C (cid:89) c =1 (cid:32) ∆ κ c j e − U ( c ) j ∆ j · Γ( r c ) · ( m c + r c ) r c + κ c Γ( r c + κ c ) · r r c c m κ c c κ c (cid:89) k c =1 (cid:16) U ( c ) j − φ c (cid:16) x ( c ) χ c,k (cid:17)(cid:17)(cid:33) . Although ˜ ρ ( a ) j of Condition B.1 is a more natural estimator (it is trivially bounded by exp {− ∆ j (cid:80) Cc =1 Φ c } , and consequently has finite variance), Fearnhead et al. (2008) rec-ommend using an estimator of the form ˜ ρ ( b ) j as given in Condition B.2 as it is more robustin practice. The mean parameterisations suggested in (29) of the Negative Binomial aretypically tractable, but if it is too unavailable then a crude estimate of the integral foreach sub-posterior can be used (for instance by taking the mean of φ evaluated at x ( c ) j − and x ( c ) j ) and this does not introduce bias to the estimator (it simply inflates the vari-ance). The dispersion parameterisations in (29) of the Negative Binomial can be chosento approximately match the tail thickness of the optimal Poisson distribution. If thedispersion parameters are chosen to be constant for every sub-posterior ( r = · · · = r c )then a further normalising constant can be removed from the estimator (noting the nor-malising constant will be common for all particles in Algorithm 1, and so will be lost ayesian Fusion when the particle weights are re-normalised). As the distribution for K has heavier tailsunder the choice in Condition B.2 than Condition B.1, a variation of Fearnhead et al.(2008) shows that the variance of ˜ ρ ( b ) j is finite too.An algorithmic summary of the construction of the unbiased estimator ˜ ρ j for usein Algorithm 1 Step b(ii)B is given in Algorithm 4. In practice we have found usingthe slightly more complicated UE- b to be more robust than UE- a within Algorithm 4Step aii), particularly when the individual sub-posterior trajectories are out-with thedomain of attraction of their corresponding sub-posterior (for instance with increasingsub-posterior heterogeneity, and as j → n ). Algorithm 4
Simulating the unbiased estimator ˜ ρ j (Algorithm 1 Step b(ii)B).(a) For c in to C ,(i) R c : Simulate R c ∼ R c as per Pollock et al. (2016, Alg. 16).(ii) p c : Choose p ( ·| R c ) (e.g. following guidance in Condition B.1 or Condition B.2).(iii) κ c : Simulate κ c ∼ p ( ·| R c ) .(iv) χ · : Simulate χ c, , . . . , χ c,κ c ∼ U [ t j − , t j ] .(v) x ( c ) · : Simulate x ( c ) χ c, , . . . , x ( c ) χ c,κc ∼ W j,c | R c as per Pollock et al. (2016, Alg. 22).(b) Output: ˜ ρ j := (cid:81) Cc =1 (cid:16) ∆ κcj · e − U ( c ) j ∆ j κ c ! · p ( κ c | R c ) (cid:81) κ c k c =1 (cid:0) U ( c ) j − φ c (cid:16) x ( c ) χ c,k (cid:17) (cid:1)(cid:17) (e.g. following guid-ance in Condition B.1 or Condition B.2). C. Proofs of Theorems 4 and 5, and Corollaries 1 and 2
To prove Theorem 4 and Corollary 1 of Section 3.1, we introduce the following lemma,
Lemma C.1.
The moment generating function (mgf ) for σ := C (cid:80) Cc =1 (cid:107) x ( c )0 − ¯ x (cid:107) ,where x ( c )0 , c = 1 , . . . , C are independent with x ( c )0 ∼ N ( a c , m − Cb I ) , is given by M σ ( s ) = exp (cid:26) mσ a sm − sb (cid:27) · (cid:18) − s bm (cid:19) − ( C − d , where, sbm < . Proof.
We have C C (cid:88) c =1 (cid:107) x ( c )0 − a (cid:107) = σ + 1 C C (cid:88) c =1 (cid:107) a − ¯ x (cid:107) where the right hand side is a sum of two independent variables. If we denote λ = C (cid:88) c =1 (cid:13)(cid:13)(cid:13)(cid:13) a c − a √ m − Cb (cid:13)(cid:13)(cid:13)(cid:13) = σ a mb Dai, Pollock and Roberts then mCb (cid:80) Cc =1 (cid:107) x ( c )0 − a (cid:107) is a non-central χ ( Cd, λ ) , with moment generating function(mgf) M ( s ) = exp (cid:110) λs − s (cid:111) (1 − s ) Cd . Furthermore, mb (cid:107) a − ¯ x (cid:107) is a χ ( d ) random variable with mgf M ( s ) = (1 − s ) − d .Therefore, the mgf of σ is M σ ( s ) = M ( sb/m ) M ( sb/m ) = exp (cid:110) σ a s − s bm (cid:111)(cid:0) − s bm (cid:1) Cd (cid:44) (cid:18) − s bm (cid:19) − d , and the statement of Lemma C.1 follows directly.We can now present the proofs of Theorem 4 and Corollary 1. Proof (Theorem 4).
The conditional effective sample size CESS for particles with weight ρ ,i , i = 1 , . . . , N is such that, as N → ∞ , N − CESS = N − (cid:34)(cid:88) i ( ρ ,i ) ( (cid:80) j ρ ,j ) (cid:35) − → ( E ρ ,i ) E ( ρ ,i ) = (cid:104) E (cid:16) e − Cσ T (cid:17)(cid:105) E (cid:16) e − Cσ T (cid:17) = (cid:0) M σ (cid:0) − C T (cid:1)(cid:1) M σ (cid:0) − CT (cid:1) . From Lemma C.1, we have (cid:2) M σ (cid:0) − C T (cid:1)(cid:3) M σ (cid:0) − CT (cid:1) = (cid:20) exp (cid:110) − mσ a C T m +2 C T b (cid:111) · (cid:0) C T bm (cid:1) − ( C − d (cid:21) exp (cid:110) − mσ a CT m +2 CT b (cid:111) · (cid:0) CT bm (cid:1) − ( C − d = exp (cid:40) − σ a bm (cid:0) TC + bm (cid:1) · (cid:0) TC + bm (cid:1) (cid:41) · (cid:34) (cid:0) CbT m (cid:1) CbT m (cid:35) − ( C − d , (30)and so Theorem 4 immediately follows. Proof (Corollary 1).
First considering the proof of part (a): In the SH ( λ ) setting (Con-dition 1) we have that σ a ≤ bCλ/m . For the first term in (30) we have exp (cid:40) − σ a bm (cid:0) TC + bm (cid:1) · (cid:0) TC + bm (cid:1) (cid:41) ≥ exp (cid:26) − σ a bC T m (cid:27) ≥ exp (cid:26) − b C λT m (cid:27) ≥ exp (cid:26) − λk (cid:27) , and for the second term in (30) we have (cid:34) (cid:0) CbT m (cid:1) CbT m (cid:35) − ( C − d ≥ exp (cid:40) − (cid:0) CbT m (cid:1) · ( C − d CbT m ) (cid:41) ≥ exp (cid:26) − d k (cid:27) , (31) ayesian Fusion which together prove part (a).Now considering the proof of part (b): Using the assumed bounds in (14) and (16), wehave (cid:18) TC + bm (cid:19) · (cid:18) TC + 2 bm (cid:19) ≥ T C ≥ bk k m . In the SSH ( γ ) setting (Condition 2) we can deduce that exp (cid:40) − σ a bm (cid:0) TC + bm (cid:1) · (cid:0) TC + bm (cid:1) (cid:41) ≥ exp (cid:26) − b γ/mbk k /m (cid:27) = exp (cid:26) − bγk k (cid:27) , which when taken together with the bound in (31) prove part (b).In Theorem 5 for simplicity we derive the conditional effective sample size CESS j forparticles with importance weight ρ j,i = (cid:81) Cc =1 ρ ( c ) j,i , i = 1 . . . , N . Proof (Theorem 5).
For large N , and ξ j as given in (24), N − CESS j ≈ (cid:104) E (cid:16) exp (cid:110) − (cid:80) Cc =1 (cid:82) t j t j − φ c (cid:16) x ( c ) t (cid:17) d t (cid:111) (cid:12)(cid:12)(cid:12) ξ j (cid:17)(cid:105) E (cid:16) exp (cid:110) − (cid:80) Cc =1 (cid:82) t j t j − φ c (cid:16) x ( c ) t (cid:17) d t (cid:111) (cid:12)(cid:12)(cid:12) ξ j (cid:17) . For Gaussian sub-posteriors we have φ c (cid:16) x ( c ) t (cid:17) = ( Cb ) − m (cid:107) x ( c ) t − a c (cid:107) − ( Cb ) − md .If we consider very small intervals ( t j − , t j ) , we have N − CESS j ≈ (cid:104) E (cid:16) exp (cid:110) − ∆ j (cid:80) Cc =1 ( Cb ) − m (cid:107) x ( c ) j − a c (cid:107) (cid:111) (cid:12)(cid:12)(cid:12) ξ j (cid:17)(cid:105) E (cid:16) exp (cid:110) − ∆ j (cid:80) Cc =1 ( Cb ) − m (cid:107) x ( c ) j − a c (cid:107) (cid:111) (cid:12)(cid:12)(cid:12) ξ j (cid:17) . (32)On the other hand, we have (cid:16) T − t j T − t j − ∆ j (cid:17) − (cid:80) Cc =1 (cid:107) x ( c ) j − a c (cid:107) is a non-central χ ( Cd, λ (cid:48) j ) ,where λ (cid:48) j = C (cid:88) c =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:16) x ( c ) j (cid:12)(cid:12)(cid:12) ξ j (cid:17) − a c (cid:113) T − t j T − t j − ∆ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Using arguments similar to those in Theorem 4 and Lemma C.1, we can write (32) as N − CESS j ≈ (cid:104) exp (cid:110) λ (cid:48) j s − s (cid:111) · (1 − s ) − Cd (cid:105) exp (cid:110) λ (cid:48) j s − s (cid:111) · (1 − s ) − Cd = exp (cid:26) − σ tj C − m b − ∆ j − s (cid:27) · (1 − s ) − Cd exp (cid:110) − σ tj C − m b − ∆ j − s (cid:111) · (1 − s ) − Cd , where s = − ( Cb ) − m T − t j T − t j − ∆ j and σ t j = C − (cid:80) Cc =1 (cid:107) E (cid:0) x ( c ) j (cid:12)(cid:12) ξ j (cid:1) − a c (cid:107) . This can befurther simplified as N − CESS j ≈ exp (cid:40) s σ t j C − m b − ∆ j (1 − s )(1 − s ) (cid:41) · (cid:0) − s + 4 s (cid:1) − Cd (1 − s ) − Cd . (33) Dai, Pollock and Roberts
If we take the limiting regime prescribed in (17) this implies that s → . Furtherbounding the right hand expression in (33) as follows (cid:0) − s + 4 s (cid:1) − Cd/ (1 − s ) − Cd/ = (cid:18) s − s (cid:19) − Cd/ ≥ exp (cid:26) − s Cd − s (cid:27) , and substituting in the bounds in (17), we arrive at the required result. D. Proof of Corollary 2
Proof of Corollary 2.
From Algorithm 1 Step b(ii)A we have (cid:126) x j ∼ N (cid:16) (cid:126) x j − ; (cid:126) M j , V j (cid:17) ,where (cid:126) M j and V j are as given in Theorem 2. In addition we have, V j = (cid:26) ( t j − t j − ) C ( T − t j − ) I C × C + T − t j T − t j − ( t j − t j − ) I C × C (cid:27) ⊗ I d × d . From (24) we also have the mean and covariance matrix of (cid:126) x j given (cid:126) x j − are given by (cid:126) M j and the above V j , as required. E. Application of Corollary 3 to large data settings
An example of where Corollary 3 may be useful to practitioners is in a large data setting,in which each core contains a large volume of data ( m c (cid:29) , c ∈ { , . . . , C } ), and wherecomputing φ c is consequently an (expensive) O ( m c ) operation on each core. Considera simple model which admits a structure with conditional independence, and hence thefollowing factorisation f c ( x ) ∝ m c (cid:89) i =0 (cid:96) i,c ( x ) , (34)where (cid:96) ,c and (cid:96) i,c ( x ) , i ∈ { , . . . , m c } , are the prior and m c likelihood terms respectivelycorresponding to the c th sub-posterior. Recalling φ c is linear in terms of ∇ log (cid:96) i,c ( x ) and ∆ log (cid:96) i,c ( x ) , one such simple unbiased estimator for φ c that could be used in Algorithm1 would be ˆ φ c ( x ) = ( m c + 1) · (cid:2) ( m c + 1) · ( ∇ log (cid:96) I,c ( x )) T ( ∇ log (cid:96) J,c ( x )) + ∆ log (cid:96) I,c ( x ) (cid:3) / , (35)where I, J iid ∼ U { , . . . , m c } . However, in this setting one would naturally be interested inthe robustness of Bayesian Fusion as the volume of data on each sub-posterior increases( m c → ∞ ). As discussed above, the critical consideration when using Corollary 3 isto take note that the expected number of functional evaluations in Algorithm 1 willincrease, from say K to K (cid:48) , and so the critical quantity to consider is the ratio K (cid:48) /K and its growth as m c → ∞ . Unfortunately use of (35) would in a futile manner exchangethe O ( m c ) evaluations of the sub-posterior in the case of ˜ ρ j , with an O ( m c ) inflation inthe in the expected number of functional evaluations for the case of ˜ (cid:37) j modified by (35). ayesian Fusion Instead, to exploit Corollary 3 in the large data setting one could apply directly theapproach of Pollock et al. (2020), and develop an O (1) unbiased estimator ˆ φ c , with O (1) scaling of the ratio K (cid:48) /K , in place of φ c , and so in turn find a suitable ˜ (cid:37) j . Pollock et al.(2020, Sec. 4) propose using a small number of suitably chosen control variates in theconstruction of such an estimator. In the Bayesian Fusion setting it is natural to choosea set of control variates for each sub-posterior: ∇ log f c and ∆ log f c computed at both apoint close to the mode of the sub-posterior (say ˆ x c ), and a point close to the posteriormode (say ˆ x ) – where close in this sense is within O ( m − / c ) of the true respective modes.Such points can be found by applying an appropriate mode finding algorithm (Bottou,2010; Nesterov, 2013; Jin et al., 2017), although note that these will involve full likelihoodcalculations and so are both likely to be one-time O ( m c ) computations. With this, weinstead recommend the following choice as an unbiased estimator for φ c , ˆ φ c ( x ) = ( ˆ α I,c ( x )) T (2 ∇ log f c ( x ∗ ) + ˆ α J,c ( x )) + div ˆ α I,c ( x )) / C (36)where x ∗ is either ˆ x c or ˆ x and is chosen to be that closest to x , I, J iid ∼ U { , . . . , m c } , C := [ (cid:107)∇ log f c ( x ∗ ) (cid:107) + ∆ log f c ( x ∗ )] / is a constant, and where ˆ α I,c ( x ) := ( m c + 1) · [ ∇ log (cid:96) I,c ( x ) − ∇ log (cid:96) I,c ( x ∗ )] , div ˆ α I,c ( x ) := ( m c + 1) · [∆ log (cid:96) I,c ( x ) − ∆ log (cid:96) I,c ( x ∗ )] . Pollock et al. (2020, Thm. 3) show that under some mild technical assumptions, andwhere control variates as described above are used, then in the regular setting where thesub-posteriors contract at the rate m − / c , that K (cid:48) /K grows with data size like O (1) .They also consider scaling under different contraction rates, and the effect if the controlvariates are not as per the guidance above. Note that although the unbiased estimator ˆ φ c indicated above uses only two draws from (34), it may be worthwhile using multipledraws (sampled with replacement) as the variance of the estimator ultimately impactsthe stability of the particle set weights in Algorithm 1 (as discussed in Appendix B).Further note that conveniently the constant C in (36) does not need to be computed inAlgorithm 1 as it forms part of the normalisation constant. References
Agarwal, A. and J. Duchi (2011). Distributed delayed stochastic optimization. In
Ad-vances in Neural Information Processing Systems , pp. 873–881.Andrieu, C. and G. Roberts (2009). The pseudo-marginal approach for efficient MonteCarlo computations.
Annals of Statistics 37 , 697–725.Bache, K. and M. Lichman (2013).
UCI Machine Learning Repository . Irvine CA:University of California, School of Information and Computer Science.Berger, J. (1980).
Statistical decision theory and Bayesian analysis . Springer.Beskos, A., O. Papaspiliopoulos, and G. Roberts (2008). A factorisation of diffusionmeasure and finite sample path constructions.
Methodology and Computing in AppliedProbability 10 , 85–104. Dai, Pollock and Roberts
Beskos, A., O. Papaspiliopoulos, G. Roberts, and P. Fearnhead (2006). Exact and com-putationally efficient likelihood-based estimation for discretely observed diffusion pro-cesses (with discussion).
Journal of the Royal Statistical Society, Series B (StatisticalMethodology) 68 (3), 333–382.Bolic, M., P. Djuric, and S. Hong (2005). Resampling algorithms and architectures fordistributed particle filters.
IEEE Transactions on Signal Processing 53 (7), 2442–2450.Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT’2010 , pp. 177–186. Springer.Buchholz, A., D. Ahfock, and S. Richardson (2019). Distributed Computation forMarginal Likelihood based Model Choice. arXiv e-prints, arXiv:1910.04672.Carpenter, J., P. Clifford, and P. Fearnhead (1999). An Improved Particle Filter forNon-linear Problems.
IEEE Proceedings - Radar, Sonar and Navigation 146 , 2–7.Chan, R., A. Johansen, M. Pollock, and G. Roberts (2021). Hierarchical Monte CarloFusion. In preparation.Crisan, D., J. Míguez, and G. Ríos-Muñoz (2018). On the performance of parallelisa-tion schemes for particle filtering.
EURASIP Journal on Advances in Signal Process-ing 2018 (1), 31.Dai, H., M. Pollock, and G. Roberts (2019). Monte Carlo Fusion.
Journal of AppliedProbability 56 , 174–191.Douc, R., O. Cappé, and E. Moulines (2005, September). Comparison of resamplingschemes for particle filtering. In , Zagreb, Croatia.Doucet, A., N. de Freitas, and N. Gordon (2001).
Sequential Monte Carlo Methods inPractice (1st ed.). Springer.Doucet, A. and A. Lee (2018). Sequential Monte Carlo Methods. In M. Maathuis,M. Drton, S. Lauritzen, and M. Wainwright (Eds.),
Handbook of Graphical Models ,Chapter 7, pp. 165–189. CRC Press.Fearnhead, P., O. Papaspiliopoulos, and G. Roberts (2008). Particle filters for partially-observed diffusions.
Journal of the Royal Statistical Society, Series B (StatisticalMethodology) 70 (4), 755–777.Fleiss, J. (1993). Review papers: The statistical basis of meta-analysis.
Statisticalmethods in medical research 2 (2), 121–145.Genest, C. and J. Zidek (1986). Combining probability distributions: A critique and anannotated bibliography.
Statistical Science 1 (1), 114–135.Gordon, N., J. Salmond, and A. Smith (1993). A novel approach to nonlinear/non-Gaussian Bayesian state estimation.
IEEE Proceedings on Radar and Signal Process-ing 140 , 107–113. ayesian Fusion Goudie, R., A. Presanis, D. Lunn, D. De Angelis, and L. Wernisch (2019). Joining andsplitting models with Markov melding.
Bayesian analysis 14 (1), 81.gov.uk (2019). ‘Road Safety Data’ dataset, Department for Transport, U.K. Govern-ment. https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data . Update Version: 2019-12-17. Accessed: 2020-09-17.Heine, K. and N. Whiteley (2017). Fluctuations, stability and instability of a distributedparticle filter with local exchange.
Stochastic Processes and their Applications 127 (8),2508–2541.Higuchi, T. (1997). Monte Carlo filter using the genetic algorithm operators.
Journal ofComputational and Graphical Statistics 59 (1), 1–23.Jin, C., P. Netrapalli, and M. Jordan (2017). Accelerated gradient descent escapes saddlepoints faster than gradient descent. arXiv e-prints, arXiv:1711.10456.Jordan, M., J. Lee, and Y. Yang (2018). Communication-efficient distributed statisticalinference.
Journal of the American Statistical Association 114 (526), 668–681.Kitagawa, G. (1996). Monte Carlo Filter and Smoother for Non-Gaussian NonlinearState Space Models.
Journal of Computational and Graphical Statistics 5 (1), 1–25.Kong, A., J. Liu, and W. Wong (1994). Sequential Imputations and Bayesian MissingData Problems.
Journal of the American Statistical Association 89 (425), 278–288.Lee, A. and N. Whiteley (2016). Forest resampling for distributed sequential MonteCarlo.
Statistical Analysis and Data Mining: The ASA Data Science Journal 9 (4),230–248.Lee, A., C. Yau, M. Giles, A. Doucet, and C. Holmes (2010). On the utility of graphicscards to perform massively parallel simulation of advanced Monte Carlo methods.
Journal of Computational and Graphical Statistics 19 (4), 769–789.Lindsten, F., A. Johansen, C. Naesseth, B. Kirkpatrick, T. Schön, J. Aston, andA. Bouchard-Côté (2017). Divide-and-conquer with Sequential Monte Carlo.
Jour-nal of Computational and Graphical Statistics 26 (2), 445–458.Liu, J. and R. Chen (1998). Sequential Monte Carlo Methods for Dynamic Systems.
Journal of the American Statistical Association 93 (443), 1032–1044.Minsker, S., S. Srivastava, L. Lin, and D. Dunson (2014). Scalable and Robust BayesianInference via the Median Posterior. In E. Xing and T. Jebara (Eds.),
Proceedings ofthe 31st International Conference on Machine Learning , Volume 32, pp. 1656–1664.PMLR.Murray, L., A. Lee, and P. Jacob (2016). Parallel resampling in the particle filter.
Journalof Computational and Graphical Statistics 25 (3), 789–805.Neiswanger, W., C. Wang, and E. Xing (2013). Asymptotically Exact, EmbarrassinglyParallel MCMC. arXiv e-prints, arXiv:1311.4780. Dai, Pollock and Roberts
Nesterov, Y. (2013).
Introductory lectures on convex optimization: A basic course , Vol-ume 87. Springer Science & Business Media.Papaspiliopoulos, O., G. Roberts, and K. Taylor (2016). Exact sampling of diffusionswith a discontinuity in the drift.
Advances in Applied Probability 48 (A), 249.Pollock, M. (2013).
Some Monte Carlo Methods for Jump Diffusions . Ph. D. thesis,Department of Statistics, University of Warwick.Pollock, M., P. Fearnhead, A. Johansen, and G. Roberts (2020). Quasi-stationary MonteCarlo methods and the ScaLE algorithm (with discussion).
Journal of the Royal Sta-tistical Society, Series B (Statistical Methodology) 82 , 1–59.Pollock, M., A. Johansen, and G. Roberts (2016). On the exact and ε -strong simulationof (jump) diffusions. Bernoulli 22 (2), 794–856.Rendell, L., A. Johansen, A. Lee, and N. Whiteley (2018). Global Consensus MonteCarlo. arXiv e-prints, arXiv:1807.09288.Rogers, L. and D. Williams (2000).
Diffusions, Markov processes and martingales: Vol-ume 2, Itô calculus , Volume 2. Cambridge University Press.Scott, S. (2017). Comparing consensus Monte Carlo strategies for distributed Bayesiancomputation.
Brazilian Journal of Probability and Statistics 31 (4), 668–685.Scott, S., A. Blocker, F. Bonassi, H. Chipman, E. George, and R. McCulloch (2016).Bayes and big data: the consensus Monte Carlo algorithm.
International Journal ofManagement Science and Engineering Management 11 (2), 78–88.Smith, T., D. Spiegelhalter, and A. Thomas (1995). Bayesian approaches to random-effects meta-analysis: a comparative study.
Statistics in Medicine 14 (24), 2685–2699.Srivastava, S., V. Cevher, Q. Tan-Dinh, and D. Dunson (2016). Wasp: Scalable bayes viabarycenters of subset posteriors. In
Artificial Intelligence and Statistics , pp. 912–920.Stamatakis, A. and A. Aberer (2013). Novel Parallelization Schemes for Large-ScaleLikelihood-based Phylogenetic Inference. In , pp. 1195–1204.Vergé, C., C. Dubarry, P. Moral, and E. Moulines (2015). On parallel implementationof sequential Monte Carlo methods: the island particle model.
Statistics and Comput-ing 25 , 243–260.Vono, M., N. Dobigeon, and P. Chainais (2019). Split-and-augmented gibbs sam-pler—application to large-scale inference problems.
IEEE Transactions on Signal Pro-cessing 67 (6), 1648–1661.Wang, A., M. Pollock, G. Roberts, and D. Steinsaltz (2019). Regeneration-enrichedMarkov processes with application to Monte Carlo. arXiv e-prints, arXiv:1910.05037.Wang, X. and D. Dunson (2013). Parallelizing MCMC via Weierstrass Sampler. arXive-prints, arXiv:1312.4605. ayesian Fusion Wang, X., F. Guo, K. Heller, and D. Dunson (2015). Parallelizing MCMC with randompartition trees. In
Advances in Neural Information Processing Systems , pp. 451–459.Xu, M., B. Lakshminarayanan, Y. Teh, J. Zhu, and B. Zhang (2014). DistributedBayesian Posterior Sampling via Moment Sharing. In
Advances in Neural Informa-tion Processing Systems , pp. 3356–3364.Xue, J. and F. Liang (2019). Double-parallel Monte Carlo for Bayesian analysis of bigdata.
Statistics and Computing 29 , 23–32.Yıldırım, S. and B. Ermiş (2019). Exact MCMC with differentially private moves.
Statis-tics and Computing 29 (5), 947–963.Zhou, Y., A. Johansen, and J. Aston (2016). Toward Automatic Model Comparison: AnAdaptive Sequential Monte Carlo Approach.