An Efficient Interpolation Technique for Jump Proposals in Reversible-Jump Markov Chain Monte Carlo Calculations
rrsos.royalsocietypublishing.org
Research
Article submitted to journal
Subject Areas:
Astrostatistics, Data Analysis, Markovchain Monte Carlo
Keywords:
MCMC, RJMCMC
Author for correspondence:
Ilya Mandele-mail: [email protected]
An Efficient InterpolationTechnique for Jump Proposalsin Reversible-Jump MarkovChain Monte CarloCalculations
W M Farr , , I Mandel , , and D Stevens , Center for Interdisciplinary Exploration and Research inAstrophysics (CIERA), Department of Physics and Astronomy,Northwestern University, Evanston IL USA MIT Kavli Institute, Cambridge MA USA School of Physics and Astronomy, University of Birmingham,Birmingham UK Monash Center for Astrophysics and School of Physics andAstronomy, Monash University, Clayton, VIC Australia Department of Astronomy, The Ohio State University, ColumbusOH USA
Selection among alternative theoretical models givenan observed data set is an important challenge inmany areas of physics and astronomy. Reversible-jump Markov chain Monte Carlo (RJMCMC) isan extremely powerful technique for performingBayesian model selection, but it suffers from afundamental difficulty: it requires jumps betweenmodel parameter spaces, but cannot efficientlyexplore both parameter spaces at once. Thus, a naivejump between parameter spaces is unlikely to beaccepted in the MCMC algorithm and convergenceis correspondingly slow. Here we demonstrate aninterpolation technique that uses samples from single-model MCMCs to propose inter-model jumps froman approximation to the single-model posterior of thetarget parameter space. The interpolation technique,based on a kD-tree data structure, is adaptive andefficient in modest dimensionality. We show thatour technique leads to improved convergence overnaive jumps in an RJMCMC, and compare it toother proposals in the literature to improve theconvergence of RJMCMCs. We also demonstrate theuse of the same interpolation technique as a way toconstruct efficient “global” proposal distributions forsingle-model MCMCs without prior knowledge of thestructure of the posterior distribution, and discussimprovements that permit the method to be used inhigher-dimensional spaces efficiently. c (cid:13) a r X i v : . [ a s t r o - ph . I M ] J un r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . ..............................................................
1. Introduction
Selection among alternative theoretical models given an observed data set is an importantchallenge in many areas of physics and astronomy. In a Bayesian context, model selection involvescomputing the evidence for the data given each model. The model evidence is an integral of theunnormalized posterior probability distribution over the model parameter space, representing theprobability of obtaining the data set within that model. Models with larger evidence are preferred;the ratio of the evidences of two models is the Bayes factor between them. The product of theBayes factor and the ratio of prior probabilities for the two models yields the odds ratio for themodels.There are many ways to compute model evidences. In low-dimensional parameter spaces, theunnormalized posterior probability can be evaluated on a grid or lattice and the integral can beperformed directly. For many problems or models of interest, however, the dimensionality of theparameter space is too large to make this approach practical, and stochastic sampling must beused.Markov chain Monte Carlo (MCMC) methods attempt to stochastically produce parametersamples with density proportional to the posterior probability distribution. In MCMC techniques,the primary target is an accurate estimate of the posterior distribution. (We note that an alternativestochastic method for exploring a model parameter space, nested sampling (Skilling 2004, Skilling2006, Feroz et al. 2009), focuses on evidence computation rather than sampling the posteriorprobability density functions.) It is not straightforward to compute the model evidence fromMCMC samples. The most direct way to estimate the evidence for a model from MCMCsamples is to compute the harmonic-mean estimator, but this estimator of the evidence cansuffer from infinite variance (Newton & Raftery 1994, Chib 1995, van Haasteren 2009, Wolpert &Schmidler 2012). MCMC implementations with parallel tempering (Swendsen & Wang 1986, Earl& Deem 2005) allow for evidence computation via thermodynamic integration (Friel & Pettitt2008), but these can be computationally costly.(Weinberg 2009) gives a method for directly computing the evidence integral from existingMCMC samples by using a kD-tree data structure to decompose a parameter space into boxescontaining the MCMC sample points. The integral is approximated as a sum over box volumes.This method is promising, but it is not clear in general what statistical and systematic errors itintroduces and how these are affected by the shape of the posterior distribution which the MCMCsamples.When the goal is model selection between several known models, only the relative evidenceof each model is needed. In this circumstance, the Reversible Jump MCMC technique firstintroduced in (Green 1995) is one of the most reliable and accurate ways to compare the models.Reversible Jump MCMC (RJMCMC), described more fully in Section 2, performs a standardMCMC in space that is an augmented union of all the model parameter spaces. Such an MCMCinvolves both intra- and inter-model jumps; the number of MCMC samples in each model’sparameter space is proportional to that model’s relative evidence in the suite of models beingcompared.Implemented naively, RJMCMC has a significant drawback: because the chain of samples mustbe Markovian, only the current sample is available to the algorithm as it is choosing the nextsample. Each time an RJMCMC transitions between models, the information about the choicesof parameter values in the previous model is lost; subsequent jumps into that model must “startfresh,” and are correspondingly unlikely to be accepted, delaying convergence of the RJMCMCsample chain (see Section (c) for a caveat). (Littenberg & Cornish 2009) addressed this issueby proposing a new method for producing inter-model jumps in an RJMCMC that relies oninterpolating single-model posterior distributions using a box decomposition of parameter space.Here we introduce an alternative technique based on a kD-tree data structure to constructan approximation to each model’s posterior parameter distribution. This improved interpolationmethod leads to faster convergence of RJMCMC sample chains. We draw jump proposals into r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. the model from this approximation to its posterior. Because jumps are proposed preferentiallyto locations favored by the single-model posterior, the RJMCMC compares “good” locations inparameter space across all the models, and convergence is generally rapid. We have successfullyapplied this RJMCMC technique to a 10-way model selection among alternative mass distributionmodels for black-hole X-ray binaries (Farr et al. 2011). We also provide an example using thismethod as an “asymptotically Markovian” (ter Braak & Frugt 2008) jump proposal in the contextof a single-model, nine-dimensional MCMC in Section 5.The method of (Littenberg & Cornish 2009) for producing inter-model jumps in an RJMCMCrelies on a box decomposition of parameter space, using fixed-sized boxes. The method cannotadapt to the local structure of the posterior, and becomes asymptotically inefficient for high-dimensional parameter spaces or highly peaked posteriors. Meanwhile, the approximation to theposterior distribution produced by the kD-tree is a constant-in-box interpolation of the posterior,similar in spirit to the phase-space density interpolants produced from N-body positions andmomenta in (Ascasibar & Binney 2005). The kD-tree interpolation is effective in parameterspaces of modest dimensionality, and is quite space-efficient, requiring O ( N ) storage spaceand O (log N ) time to produce each proposed jump, where N is the number of samples in anMCMC over the parameter space of one model (‘single-model MCMC’) used to construct theinterpolation.The structure of this paper is as follows. In Section 2 we introduce in more detail the concept ofa Reversible Jump MCMC, and describe the fundamental difficulty with a naive jump proposalin an RJMCMC. In Section 3 we introduce the kD-tree data structure used to decompose theparameter space into boxes for interpolation. In Section 4 we demonstrate the efficiency gainsthat are achieved from use of the interpolated jump proposal. In Section 5 we give examplesof some other uses of the interpolated jump proposal that suggest its utility in the context of asingle-model MCMC. Finally, in Section 6 we offer a summary and some concluding remarks onthe method.
2. Reversible Jump MCMC
Reversible jump Markov chain Monte Carlo (RJMCMC) (Green 1995) is a technique for Bayesianmodel comparison. Below, we give a very brief introduction to Bayesian analysis, describe astandard MCMC, and introduce RJMCMC. (a) Bayesian analysis
Consider an observed data set d and a set of competing models for the data, indexed by an integer i : { M i | i = 1 , , . . . } . Each model has some continuous parameters, (cid:126)θ i ; given the model and itsparameters, we can make a prediction about the likelihood of observing the experimental data: L ( d | (cid:126)θ i , M i ) . Within the framework of each model, Bayes’ rule gives us a way to compute theposterior probability distribution function (PDF) for the model parameters implied by the data: p ( (cid:126)θ i | d, M i ) = L ( d | (cid:126)θ i , M i ) p ( (cid:126)θ i | M i ) p ( d | M i ) , (2.1)where p ( (cid:126)θ i | d, M i ) is the posterior distribution for the model parameters (cid:126)θ i implied by the datain the context of model M i , p ( (cid:126)θ i | M i ) is the prior probability of the model parameters thatrepresents our beliefs before accumulating any of the data d , and p ( d | M i ) , called the evidence,is an overall normalizing constant that ensures that p ( (cid:126)θ i | d, M i ) is properly normalized as aprobability distribution on the (cid:126)θ i . This implies that the evidence is equal to p ( d | M i ) = (cid:90) V i d(cid:126)θ i L ( d | (cid:126)θ i , M i ) p ( (cid:126)θ i | M i ) , (2.2)where V i is the parameter space volume in model M i . For model comparison, we are interestedin the posterior probability of a particular model, M i , given the data, p ( M i | d ) . Using Bayes’ rule, r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. we see that this involves the evidence, Eq. (2.2): p ( M i | d ) = p ( d | M i ) p ( M i ) p ( d ) , (2.3)where p ( M i ) is our a priori belief in model M i and p ( d ) is a normalizing constant, p ( d ) = (cid:88) i p ( d | M i ) p ( M i ) . (2.4)When selecting among alternative models, we are interested in finding the model with thehighest posterior probability p ( M i | d ) . However, attempts to directly compute the evidence byperforming the integration in Eq. (2.2) are generally very difficult in a multi-dimensional, multi-modal parameter space when the likelihood has to be evaluated numerically. In particular,a grid-based integral quickly becomes computationally unfeasible as the dimensionality of (cid:126)θ exceeds a few. The parameter space must typically be explored in a stochastic manner before theevidence integral can be computed. There are several stochastic parameter-exploration techniquesfocused directly on evidence computation (e.g., nested sampling (Skilling 2004, Skilling 2006) andits variant MultiNest (Feroz et al. 2009)). Although nested sampling can be used to compute theposterior PDFs within each model along with the evidences for the various models, the mostcommon technique for computing posterior PDFs in the context of a model is the Markov chainMonte Carlo, which we now describe. (b) Markov chain Monte Carlo A Markov chain Monte Carlo (Gilks et al. 1996) produces a set of samples { (cid:126)θ ( j ) | j = 1 , . . . } fromthe model parameter space that are sampled according to the posterior, meaning that, in the limitthat the chain length tends to infinity, the relative frequency with which a given set of parametersappears in the chain is proportional to the desired posterior, p ( (cid:126)θ | d, M ) . Therefore, the output of anMCMC can be directly interpreted as the posterior PDF over the full parameter space, while PDFsfor individual parameters can be obtained by marginalizing over the uninteresting parameters.A Markov chain has the property that the probability distribution of the next state can dependonly on the current state, not on the past history: p ( (cid:126)θ ( j +1) ) = (cid:90) V d(cid:126)θ ( j ) p ( (cid:126)θ ( j ) → (cid:126)θ ( j +1) ) p ( (cid:126)θ ( j ) ) , (2.5)where the jump probability p ( (cid:126)θ ( j ) → (cid:126)θ ( j +1) ) depends only on (cid:126)θ ( j ) and (cid:126)θ ( j +1) . An additionalrequirement for an MCMC arises from the fact that the desired distribution is the equilibriumdistribution. Detailed balance requires that p ( (cid:126)θ ( i ) ) p ( (cid:126)θ ( i ) → (cid:126)θ ( j ) ) = p ( (cid:126)θ ( j ) ) p ( (cid:126)θ ( j ) → (cid:126)θ ( i ) ) .One way to produce such a sequence of samples is via the Metropolis-Hastings algorithm, firstproposed in (Metropolis et al. 1953), and later generalized in (Hastings 1970):(i) Given a current state (cid:126)θ ( j ) , propose the next state (cid:126)θ p by drawing from a jump proposaldistribution with probability Q ( (cid:126)θ ( j ) → (cid:126)θ p ) .(ii) Compute the probability of accepting the proposed jump as p accept ≡ min (cid:16) , p ( (cid:126)θ p | d, M ) p ( (cid:126)θ ( j ) | d, M ) Q ( (cid:126)θ p → (cid:126)θ ( j ) ) Q ( (cid:126)θ ( j ) → (cid:126)θ p ) (cid:17) . (2.6)(iii) Pick a uniform random number α ∈ [0 , . If α < p accept , accept the proposed jump, setting (cid:126)θ ( j +1) = (cid:126)θ p . Otherwise, reject the jump, and remain at the same location in parameterspace for the next step, (cid:126)θ ( j +1) = (cid:126)θ ( j ) .This jump proposal distribution Q ( (cid:126)θ ( j ) → (cid:126)θ p ) can depend on the parameters of the currentstate (cid:126)θ ( j ) , but not on the past history. It must also allow any state within the prior volume to bereachable (eventually) by the MCMC. Any jump proposal that satisfies these properties is suitablefor an MCMC. r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. The jump proposal is the most important choice in the MCMC, as it determines the samplingefficiency of the algorithm, i.e., the length of the chain before it converges to the posterior PDF.Creating an efficient jump proposal distribution requires an understanding of the structure of theparameter space which may not be available until the PDFs are found, creating a Catch-22; onepossibility for resolving this infinite loop is described in Section 5.It should be noted that although an MCMC whose jump acceptance criterium obeys detailedbalance (as the Metropolis-Hastings algorithm does) must eventually converge to the desireddistribution, there is no way to guarantee convergence in a fixed number of steps or to testwhether a chain has converged in a foolproof manner. For example, MCMC chains can get stuckon local maxima, producing an apparently well-converged sampling of the PDF in the vicinity ofthe maximum; or, if the chain visits a sequence of local maxima, moving rarely between maxima,the autocorrelation length of the chain may represent a substantial fraction of the total number ofsamples, resulting in an effective sample size that is too small to accurately represent the relativesizes of the modes in the PDF.Finally, we note that, in practice, the randomly chosen initial starting point of the MCMC maybe in a particularly unlikely location in the parameter space. Because jumps are frequently local,we will generally want to ignore the early samples in a finite-size chain to avoid biases in therecovered posterior PDF due to the choice of the initial location. The samples thus discarded arereferred to as “burn-in” samples. (c) RJMCMC
The samples produced by an MCMC algorithm can be used to directly perform a Monte Carloevidence integral. This results in a harmonic mean estimator for the evidence, which may sufferfrom infinite variance (Newton & Raftery 1994, Chib 1995, van Haasteren 2009, Wolpert &Schmidler 2012). Additional techniques for the direct integration of evidence, also based on a kDtree decomposition of the parameter space (see Sec. 3), are described in (Weinberg 2009). Thesetechniques are promising, but in some cases suffer from large variance and bias (Farr et al. 2011).An alternative approach to model selection among a set of models is based on performing anMCMC in a “super-model” that encompasses all of the models under consideration; this is knownthe the Reversible Jump Markov chain Monte Carlo (RJMCMC).The parameter space of the super-model in an RJMCMC consists of a discrete parameter thatidentifies the model, M i , and a set of continuous parameters appropriate for that model, (cid:126)θ i . Thus,each sample consists of a model identifier and a location within the parameter space of thatmodel, { M i , (cid:126)θ i } . We perform the MCMC in the “super-model” parameter space just like a regularMCMC; we propose jumps to different parameters within a model (intramodel jumps) and jumpsto a different model with different parameters (intermodel jumps). The acceptance probability fora proposed jump from (cid:126)θ ( j ) i in model M i to (cid:126)θ pj in model M j becomes p accept ≡ min (cid:16) , p ( (cid:126)θ pj , M j | d ) p ( (cid:126)θ ( j ) i , M i | d ) Q ( (cid:126)θ pj , M j → (cid:126)θ ( j ) i , M i ) Q ( (cid:126)θ ( j ) i , M i → (cid:126)θ pj , M j ) (cid:17) . (2.7)Here the Q factors incorporate both a discrete probability on the model index, reflecting theprobabilistic choice of which model to jump into, and also a continuous probability density ontarget model’s parameter space. For example, Q ( (cid:126)θ pj , M j → (cid:126)θ ( j ) i , M i ) has a factor for the probabilityof proposing a jump to model i when in model j , and is a density on the parameter space of M i . These densities cancel the corresponding densities in the ratio of posteriors, making theacceptance probability a parameterisation-independent scalar. In the common special case wherethe two parameter spaces have equal dimension and the jump proposal is a diffeomorphism, φ ,between them, (cid:126)θ pj = φ (cid:16) (cid:126)θ ( j ) i (cid:17) , (2.8) r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. and therefore (cid:126)θ ( j ) i = φ − (cid:16) (cid:126)θ pj (cid:17) , (2.9)then the jump proposal ratio reduces to the Jacobian of the diffeomorphism: Q ( (cid:126)θ pj , M j → (cid:126)θ ( j ) i , M i ) Q ( (cid:126)θ ( j ) i , M i → (cid:126)θ pj , M j ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂φ∂(cid:126)θ ( j ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (2.10)The resulting chain samples from the posterior p ( M i , { (cid:126)θ i }| d ) . As in a usual MCMC, the PDFon the model as a parameter, with other parameters ignored, is obtained by marginalizing overthe remaining parameters. The posterior probability of a model is proportional to the number ofcounts p ( M i | d ) = (cid:90) d(cid:126)θ i L ( d | M i , (cid:126)θ i ) p ( (cid:126)θ i | M i ) p ( M i ) p ( d ) ≈ N i N , (2.11)where N i is the number of RJMCMC samples listing the i ’th model and N is the total chain length.Thus, the probability of a particular model relative to other models under consideration is givenby the fraction of RJMCMC samples lying in the parameter space of that model.The main difficulty of achieving an efficient RJMCMC is finding a good jump proposaldistribution for intermodel jumps. In order to have relatively high acceptance ratios forintermodel jumps, which is necessary for efficient mixing between models, jumps should bepreferentially proposed into regions with a high posterior. However, because the algorithm isMarkovian, it has no past memory, so a jump proposed into a model from outside can not accessinformation from earlier in the chain which may identify a posterior peak. It is, in principle,possible to overcome this constraint by storing a union of { (cid:126)θ i } as the MCMC state vector, withthe likelihood a function only of parameters (cid:126)θ i that correspond to the current model M i . In thiscase, intermodel jump proposals would change only the model M i . However, if the chain finds ahigh-likelihood region in one model space faster than in another, a jump to the other model willtake a very long time to be accepted – again rendering RJMCMC inefficient.The way to solve this problem is to identify a good jump proposal distribution in advance,by exploiting information from single-model MCMCs to generate efficient jump proposaldistributions for our reversible jump MCMC (Single-model MCMCs can take small local jumpswithin their model, meaning that they are much less likely than an RJMCMC to lose a high-posterior mode once it has been located). The ideal jump proposal distribution for the parameterswithin a model would consist of the posterior PDF for those parameters, p ( (cid:126)θ i | M i , d ) , and single-model MCMCs already represent samples from these posterior PDFs. However, the samples arediscrete, and a jump proposal must be continuous. Therefore, the output of each single-modelMCMC must be interpolated to construct the desired jump proposal. The strategy we propose forefficiently interpolating a discretely sampled PDF is described in the next section.
3. kD Trees and Interpolation
The problem of drawing a proposed jump from an interpolation of single-model MCMCsamples can be thought of as the problem of assigning a local “neighborhood” to each samplein the MCMC chain. We choose these neighborhoods to be non-overlapping and to fill theparameter space. The size of a neighborhood is inversely proportional to the local sample density.The proposed jumps are drawn from a piecewise-constant (constant on each neighborhood)interpolation of the PDF. To draw a proposed jump, we select a sample uniformly from the MCMCsamples, find its associated neighborhood, and then draw the proposed jump uniformly fromthe neighborhood. Since the MCMC samples are distributed according to the posterior PDF forthe single model, this procedure produces proposed jumps that are approximately distributedaccording to the posterior PDF.There are various techniques that could be used to construct the set of neighborhoodsassociated with each sample. (Littenberg & Cornish 2009) decompose the parameter space into r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. constant-volume “bricks” whose size is set by the typical size of the peaks of the PDF. Eachsample is associated with the brick that contains it, and the probability of proposing a jump intoa particular brick is thus proportional to the number of samples within that brick. Additionally,an extra uniform jump proposal is added to allow for jumps into bricks that do not contain anysamples, so that the jump proposal covers the entire model parameter space. However, the bricksin this algorithm do not adapt to the local structure of the PDF. One must either use small bricksto capture the local structure of the PDF, placing many bricks in regions without MCMC samples(which can increase memory management and access costs), or use large bricks, missing the localstructure of the PDF in exchange for fewer empty bricks.An alternate technique for producing adaptive neighborhoods would be to use the Voronoiregions (Voronoi 1907) associated with each MCMC sample. The Voronoi region associatedwith a sample contains all the parameter space points that are closer to that sample than anyother sample. The Voronoi region decomposition into neighborhoods is, in a sense, maximallyadaptive, in contrast to the approach of (Littenberg & Cornish 2009), which is minimally adaptive.Unfortunately, defining the Voronoi decomposition requires a metric on parameter space, whichmay be difficult or impossible to define. Also, the computational cost for computing the Voronoiregions increases rapidly with dimensionality.Here we propose to use a decomposition of the parameter space into neighborhoods based ona data structure called a kD-tree (see, e.g. (de Berg et al. 2008) or (Gaede & Günther 1998)). Thedecomposition is more adaptive than the boxes of (Littenberg & Cornish 2009), and more efficientin high-dimensional spaces than the Voronoi decomposition.A kD-tree is a binary, space-partitioning tree. To partition a set of samples into a kD-tree,begin by placing them in a rectangular box that contains all of parameter space. Then proceedrecursively :(i) If the given box contains exactly one sample, stop; this is a leaf of the tree. Otherwise:(ii) Choose a dimension along which to divide the samples. Divide the samples in half alongthis dimension (or nearly in half, if the number of samples is odd), forming two sub-boxes. The “left” sub-box contains the half (or nearly half) of the samples that have smallcoordinates along the chosen dimension; the “right” sub-box contains the half (or nearlyhalf) of the samples that have large coordinates along the chosen dimension.(iii) Return to Step 1 with each of the sub-boxes, storing the resulting trees as sub-trees of thecurrent box.The key algorithmic step in the production of a kD-tree is finding the median sample along a givendimension in order to divide the samples in half in Step 2. For n samples, this can be accomplishedin O ( n ) time (see, e.g., (Press et al. 2007)). If there are N samples in total, there are O (log N ) levels in the tree; at each level, O ( N ) samples must be processed once in the median-findingalgorithm. Tree construction thus costs O ( N log N ) in time, and the tree consumes O ( N ) space.As an illustration, box boundaries for a kD-tree constructed around a sample set that is normallydistributed around the origin in two dimensions are shown in Figure 1.In order to use the kD-tree interpolation as a jump proposal in an MCMC, we randomlyselect a point stored in the kD tree with equal probability, and propose from the associatedneighbourhood. Therefore, we must be able to quickly find the neighborhood associated with agiven point to compute the jump probability (see Eq. 2.6). We introduce an additional parameterinto the neighbourhood search, N boxing , which describes the minimum number of points in a kDbox used as the neighbourhood for a point; small N boxing increases variance in the proposal butresolves finer-scale structure in the posterior. The neighbourhood search can be accomplished in O (log N ) time and constant space with the following algorithm, which is a modified binary treesearch. Given the randomly-chosen point, (cid:126)θ i , the tree, T , and N boxing : The kD-tree data structure defined here places box boundaries between the samples. An alternate definition common in theliterature places box boundaries on the median sample, but such a definition is inconvenient for our purposes. r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. Figure 1.
The neighborhoods from a kD-tree constructed around a set of samples that are normally distributed aboutthe origin in two dimensions. As the samples become denser around the origin, the typical neighborhood gets smaller.The interpolated PDF within a box of volume V i is / ( NV i ) , where N is the total number of samples (which is also thenumber of boxes). (i) If T contains fewer than N boxing points, then its box is the associated neighborhood.Otherwise:(ii) The tree T has two sub-trees. If the point (cid:126)θ i is contained in the “left” sub-tree, then returnto Step 1, considering this sub-tree; otherwise return to Step 1, considering the “right”sub-tree.The returned box is used for the jump proposal by drawing uniformly from its interior, so theproposal density is Q ( (cid:126)θ → (cid:126)θ p ) = N box NV , (3.1)where N is the number of points in the tree, N box is the number of points in the chosen kD box,and V is the coordinate-space volume of the box.
4. RJMCMC Efficiency
In this section, we demonstrate the efficiency of the kD-interpolated jump proposal on a toymodel-comparison problem. The same algorithm has been used in real-world settings (Farret al. 2011) and, as discussed below, is the available in several forms as a software library forre-use by others. r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. In the toy model of this section, we draw N = 100 simulated data points from a N (0 , Gaussian distribution, and then ask whether these data are better described by a model wherethey are Gaussian distributed with unknown mean µ and standard deviation σp ( x ) = 1 √ πσ exp (cid:18) − ( x − µ ) σ (cid:19) , (4.1)or by a model where they are Cauchy distributed with mode α and width βp ( x ) = 1 πβ (cid:18) (cid:16) x − αβ (cid:17) (cid:19) . (4.2)We take priors on µ and α to be uniform in [ − , , and priors in σ and β to be uniform in [0 . , . . With a data set of 100 points, the relative uncertainty in determining the parametersof the underlying distribution is approximately 10%, so we expect the posterior probabilitiesin the ( µ, σ ) and ( α, β ) spaces to occupy only a few percent of the prior volume. The Cauchydistribution is much broader than the Gaussian (it has no finite moments), so with equal modelpriors, the posterior probability for the Gaussian model over the Cauchy model is extremely large: p ( Gaussian | d ) p ( Cauchy | d ) ∼ . (4.3)In order to ensure that the RJMCMC produces samples in the Cauchy model at all, we impose amodel prior that favors the Cauchy model by × relative to the Gaussian. The evidence ratiobetween the models for our chosen data set with these priors is p ( Gaussian | d ) p ( Cauchy | d ) ≡ r = 1 . , (4.4)yielding a theoretical maximum acceptance rate of inter-model jumps of (1 + 1 /r ) / . .We obtain single-model MCMC samples by independently running MCMC within eachmodel, and use the kD-tree interpolation method described above to propose inter-model jumpsin an RJMCMC. The acceptance rate of inter-model jumps is approximately 0.8. To explore howthe efficiency of the method degrades as the interpolation becomes less accurate, we artificiallytruncated the kD tree with higher and higher numbers of samples in each box (this can beaccomplished during the neighborhood search phase by stopping the search for a box when oneis found containing the desired number of samples). For each truncation choice, we performed anRJMCMC with the resulting interpolated jump proposal. The acceptance rate is plotted againstthe number of single-model MCMC samples per box (kD-tree leaf) in Figure 2. The more samplesin each leaf of the tree when the search is truncated, the lower the acceptance probability; whenpoints are drawn from the top level of the tree, the acceptance probability asymptotes to the naivedraw from the prior ( ∼ ).The relative error on the determination of the Bayes factor (evidence ratio) scales with / √ N transitions , where N transitions is the number of inter-model transitions in the RJMCMC. Thus,as the acceptance rate of inter-model jumps goes down, the RJMCMC must run longer to achievea desired accuracy in the evidence ratio. By boosting the acceptance rate of inter-model jumps,the interpolation method described above can improve the runtime of an RJMCMC.
5. kD-Interpolated Jump Proposal in Higher-Dimensional Single-Model MCMCs
In the model selection example from Section 4, the models have two-dimensional parameterspaces; in (Farr et al. 2011) the highest-dimensional model had five dimensions. As the number ofdimensions increases the interpolation becomes more difficult for two reasons. First, the numberof samples required for a given number of subdivisions in each dimension grows exponentiallywith the number of dimensions; hence, for a reasonable number of samples, a high-dimensional r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. boxing f a cc ep t Figure 2.
The inter-model jump acceptance rate versus the number of samples per box when the kD-tree neighborhoodsearch is truncated. As the number of samples per box increases, and the interpolation becomes less accurate, theacceptance rate falls, asymptoting to the rate for naive draws from the uniform prior (about for this data set). kD tree will have few subdivisions along each dimension. Second, as the dimensionality increases,the fraction of the volume at the “edges” and “corners” of each kD-tree box becomes moresignificant, and the placement of the sample becomes increasingly unrepresentative of the typicaldensity in the box. This problem is even more pronounced when the distribution of samples doesnot align with the coordinate axes along which the tree subdivides.In this section, we propose a practical solution that allows us to incorporate larger-scalefeatures in the sample density distribution. We illustrate our solution with an single-modelMCMC that updates the jump proposal distribution on-the-fly by using a kD tree to interpolatethe density of existing samples. We conclude with an example of a successful application togravitational-wave parameter estimation.The kD-tree based interpolated jump proposal described in Section 3 selects one sample fromthe tree at random and proposes a uniform point from the box containing that sample. Thisignores the density of other samples in the neighborhood of this box, which contains additionalinformation about the true density within the box, which is likely non-uniform. A better proposal,therefore, would account for larger-scale sample covariances in parameter space. To addresssignificant covariance between samples, which would correspond to strong density gradients anda very non-uniform density profile within a rectangular box, the modified jump proposal doesnot descend to the leaf-nodes of the tree, but instead stops descending whenever the number ofsamples in the current box, N box , falls below some threshold, N crit . We then find the principalaxes of the samples contained in this box by calculating the eigenvectors of the covariance matrixof these samples. We use these eigenvectors to draw a new box. This box – henceforth called acovariance cell – contains all N box samples from the kD-tree box, is centered on the mean of thesesamples, and has edges that are aligned with the principal axes. Furthermore, this box is drawnaround the samples as tightly as possible; in other words, for each edge, there exists a point in thecovariance cell that lies along that edge. Figure 3 illustrates these two boxes. r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) Figure 3.
A two-dimensional illustration of the two boxes involved in the modified kD proposal. The larger box alignedwith the axes is the normal kD-tree box containing the given samples, while the tighter box is the “covariance cell” alignedwith the eigenvectors of the sample distribution. The modified kD-interpolated jump proposal draws a point from theintersection of the two boxes. Tightly-correlated posteriors in parameter space such as this are typical of the gravitational-wave parameter estimation problem described in (Veitch et al. 2015) and in the text. Without the modification to accountfor the correlated samples, the kD neighbourhood of these points would produce a very inefficient jump proposal, sincemost of the bounding-box would contain empty (i.e., low-posterior) parameter space.
The jump proposal chooses a stored sample at random and finds the largest kD-tree boxcontaining it and fewer than N crit total samples. It then draws the covariance cell and proposesa new point uniformly from the intersection of the covariance cell with the original kD-tree box.The jump probability for this jump proposal is still (c.f. Eq. (3.1)) given by Q ( (cid:126)θ i → (cid:126)θ i +1 ) = N box NV , (5.1)where N is the total number of samples in the kD tree, and V is the volume of the intersectionbetween the kD-tree box and the correlation cell containing (cid:126)θ i +1 . When used as the only jumpproposal for an RJMCMC, it is important that this proposal be capable of proposing points in allallowed regions of parameter space; on the other hand, when the kD proposal is used in a suiteof other jump proposals, this is no longer a requirement.This technique trivially violates detailed balance since the jump proposal distribution dependson the past history. One way to address this problem, through diminishing adapations (Brookset al. 2011), requires continuously accumulating samples into the kD tree as the MCMC exploresmore of the parameter space. As the number of samples drawn from the posterior PDFincreases, the neighborhoods around each sample decrease with increasing sample density andthe calculated covariances between samples in the kD tree better approximate the true covariancesbetween parameters in the equilibrium posterior PDF. Hence, for large N , the change in Q ( (cid:126)θ ( j ) → (cid:126)θ ( j +1) ) approaches zero and the jump proposal does a progressively better job of sampling theequilibrium posterior while becoming asymptotically Markovian. Conservatively, the entire setof samples obtained while the jump proposal is being dynamically updated could be discardedas burn-in, and future analysis could use the static jump proposal distribution interpolated fromsamples accumulated during the extended burn-in phase.To efficiently insert samples into the tree as the chain accumulates them, the algorithm fromSection 3 must be modified: r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. (i) Instead of partitioning the sample set at its median, we now partition the bounding boxat its geometric center along a particular dimension, which cycles as we descend the treelevels. Note that this allows for empty boxes if all the samples cluster to one side of thedomain.(ii) When inserting a sample, we descend the tree to a leaf node, which now can be eitherempty or contain one sample. If it is empty, we insert the sample at this node. If it containsone sample, we subdivide the appropriate dimension, and place both samples into thetwo sub-boxes. If both land in the same sub-box, we continue subdividing, until each boxin the sub-sub-sub... tree contains one or zero samples.We have implemented this modified kD tree proposal as one of many jump proposals inthe LALInferenceMCMC sampler (van der Sluys et al. 2008, Raymond et al. 2010, Raymond2012, Veitch et al. 2015).
LALInferenceMCMC is a MCMC code, based on the LIGO algorithmslibrary ( ), designed to sample the posterioron parameters of merging compact-object binaries (masses, sky location, orbital orientation,distance, etc.) encoded in their gravitational-wave signatures as observed by the ground-basedgravitational-wave detectors LIGO (Harry, G. M. & the LIGO Scientific Collaboration 2010) andVirgo (Virgo Collaboration 2009). The simplest such signal has a nine-dimensional parameterspace, and the posterior often includes near-degeneracies and multiple modes that makeconvergence of the MCMC chain very slow with traditional proposals (Aasi et al. 2013).Figure 4 shows the acceptance ratio for the kD-tree jump proposal applied as part of a
LALInferenceMCMC analysis. In this example, the kD-tree jump proposal is one of several jumpproposals used during the MCMC, and the kD tree itself is updated with accepted sampleswhenever its jump proposal is called. In spite of the very low acceptance rate of the proposal,applying the kD-tree proposal to one out of 20 proposed jumps improved the convergencetime—defined as the average time to reach a specified number of independent samples fromthe posterior, thinning by the autocorrelation length as described in (Veitch et al. 2015)—of thesimulation by a factor of two compared to the standard suite of proposals because it is particularlyefficient at producing “mode-hopping” jumps, which are difficult to produce with the otherproposals in
LALInferenceMCMC .
6. Conclusion
The need to compare evidences for multiple models arises in a large variety of physical andastronomical contexts. In this paper, we described a technique that allows for efficient evidencecomputations via a Reversible-Jump Markov chain Monte Carlo. This technique solves the usualproblem of finding good inter-model jump proposals in an RJMCMC by using a kD tree to quicklyand accurately interpolate an approximate posterior PDF from a single-model MCMC run, andthen proposing efficient inter-model jumps from this interpolated PDF.We demonstrated the efficiency of this technique on a toy model-comparison problemdescribed in Section 4. We also successfully applied this technique to the problem of selectingthe best model for the observed distribution of black-hole X-ray binary masses, as described in(Farr et al. 2011). In addition to model comparison, the PDF interpolation described here can beuseful in single-model MCMCs to inform the jump proposal distribution on the fly in order topropose jumps that can efficiently sample the parameter space (see Section 5), or to test MCMCconvergence.We have made our implementation of the technique described in this paper publicly availableonline at http://github.com/farr/mcmc-ocaml , and also in the
LALInferenceMCMC sampler, at .We welcome readers to take advantage of this toolkit. r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. (cid:72) x100 (cid:76) C u m u l a ti v e A cce p t a n ce R a ti o Figure 4.
The cumulative acceptance ratio for the modified kD-tree jump proposal used in the
LALInferenceMCMC code as a function of the number of steps (in hundreds). The simulation in question was a nine-dimensional analysis ofa simulated gravitational-wave signal injected into synthetic data similar to that taken by the LIGO and Virgo detectors(Harry, G. M. & the LIGO Scientific Collaboration 2010, Virgo Collaboration 2009). The parameter space includes themasses of the compact objects generating the gravitational-wave, their location on the sky, distance, orbital orientation,the time of signal arrival, and the orbital phase. The posterior in this problem has a number of well-separated modes in theparameter space which are difficult to jump between using traditional jump proposals; in spite of the small acceptance ratioof the kD proposal, when applied to one in twenty jumps proposed in this simulation, it improved the convergence time ofthe sampler by a factor of two compared to using only the standard suite of proposals. The acceptance rate asymptotesto the steady-state solution once sufficient samples have been accumulated in the kD tree to allow the sample density tobe accurately interpolated; samples collected prior to this point should be discarded as burn-in.
Acknowledgments
We are grateful to Neil Cornish for interesting discussions. WF acknowledges support fromNSF grant AST0908930 and NASA grant NNX09AJ56G. IM acknowledges support from the NSFAstronomy and Astrophysics Postdoctoral Fellowship, award AST-0901985, for the early stagesof this work. DS acknowledges support from a Northwestern University Summer UndergraduateResearch Grant. IM is grateful for the hospitality of the Monash Center for Astrophysicssupported by a Monash Research Acceleration Grant (PI Y. Levin). This work has been partiallysupported by the UK Science and Technology Facilities Council.
References
Aasi J, Abadie J, Abbott B P, Abbott R, Abbott T D, Abernathy M, Accadia T, Acernese F, AdamsC, Adams T & et al. 2013 Phys. Rev. D (6), 062001.Ascasibar Y & Binney J 2005 Mon. Not. R. Astron. Soc. , 872.arXiv:astro-ph/0409233.Brooks S, Gelman A, Jones G & Meng X 2011 Handbook of Markov chain Monte Carlo Chapmanand Hall/CRC.Chib S 1995 J. Am. Stat. Assoc. (432), 1313–1321.de Berg M, Cheong O, van Kreveld M & Overmars M 2008 Computational Geometry 3rd editionedn Springer.Earl D J & Deem M W 2005 Phys. Chem. Chem. Phys. , 3910. r s o s . r o y a l s o c i e t y pub li s h i ng . o r g R . S o c . open sc i . .............................................................. arXiv:physics/0508111.Farr W M, Sravan N, Cantrell A, Kreidberg L, Bailyn C D, Mandel I & Kalogera V 2011Astrophys. J. , 103.Feroz F, Hobson M P & Bridges M 2009 Mon. Not. R. Astron. Soc. , 1601–1614.arXiv:0809.3437.Friel N & Pettitt A N 2008 Journal of the Royal Statistical Society: Series B , 589Ð607.Gaede V & Günther O 1998 ACM Computing Surveys (2), 170–231.Gilks W R, Richardson S & Spiegelhalter D J 1996 Markov chain Monte Carlo in practiceChapman & Hall/CRC.Green P J 1995 Biometrika (4), 711–732.Harry, G. M. & the LIGO Scientific Collaboration 2010 Classical and Quantum Gravity (8), 084006–+.Hastings W K 1970 Biometrika (1), 97–109.Littenberg T B & Cornish N J 2009 Phys. Rev. D , 063007.arXiv:0902.0368.Metropolis N, Rosenbluth A W, Rosenbluth M N, Teller A H & Teller E 1953 J. Chem. Phys. , 1087–1092.Newton M A & Raftery A E 1994 J. R. Stat. Soc. B (11), 114009–+.arXiv:0912.3746.Skilling J 2004 in R. Fischer, R. Preuss, & U. V. Toussaint, ed., ‘AIP Conf. Ser.’ Vol. 735 pp. 395–405.Skilling J 2006 Bayesian Analysis (4), 833–860.Swendsen R H & Wang J S 1986 Physical Review Letters , 2607–2609.ter Braak C J & Frugt J A 2008 Statistics and Computing (4), 435–446.van der Sluys M, Raymond V, Mandel I, Röver C, Christensen N, Kalogera V, Meyer R & VecchioA 2008 Classical and Quantum Gravity (18), 184011.van Haasteren R 2009 ArXiv e-prints .arXiv:0911.2150.Veitch J, Raymond V, Farr B, Farr W M, Graff P, Vitale S, Aylott B, Blackburn K, ChristensenN, Coughlin M, Del Pozzo W, Feroz F, Gair J, Haster C J, Kalogera V, Littenberg T, MandelI, O’Shaughnessy R, Pitkin M, Rodriguez C, Röver C, Sidery T, Smith R, Van Der Sluys M,Vecchio A, Vousden W & Wade L 2015 Phys. Rev. D (4), 042003.Virgo Collaboration 2009 Advanced virgo baseline design Virgo Technical Report VIR-0027A-09.https://tds.ego-gw.it/itf/tds/file.php?callFile=VIR-0027A-09.pdf.Voronoi G 1907 Journal für die Reine und Angewandte Mathematik , 97–178.Weinberg M D 2009 Submitted to Bayesian Analysis .arXiv:0911.1777.Wolpert R L & Schmidler S C 2012 Statistica Sinica22