Inference for Trans-dimensional Bayesian Models with Diffusive Nested Sampling
IInference for Trans-dimensional Bayesian Models with DiffusiveNested Sampling
Brendon J. BrewerDepartment of Statistics, The University of AucklandPrivate Bag 92019, Auckland 1142, New Zealand [email protected]
Abstract
Many inference problems involve inferring the number N of components in some region,along with their properties { x i } Ni =1 , from a dataset D . A common statistical example is finitemixture modelling. In the Bayesian framework, these problems are typically solved using one ofthe following two methods: i) by executing a Monte Carlo algorithm (such as Nested Sampling)once for each possible value of N , and calculating the marginal likelihood or evidence as afunction of N ; or ii) by doing a single run that allows the model dimension N to change (suchas Markov Chain Monte Carlo with birth/death moves), and obtaining the posterior for N directly. In this paper we present a general approach to this problem that uses trans-dimensionalMCMC embedded within a Nested Sampling algorithm, allowing us to explore the posteriordistribution and calculate the marginal likelihood (summed over N ) even if the problem containsa phase transition or other difficult features such as multimodality. We present two exampleproblems, finding sinusoidal signals in noisy data, and finding and measuring galaxies in a noisyastronomical image. Both of the examples demonstrate phase transitions in the relationshipbetween the likelihood and the cumulative prior mass, highlighting the need for Nested Sampling. Consider the following class of inference problems. There is some unknown number, N , ofcomponents in a (physical or metaphorical) region. Each of the N components has parameters x , which may be a single scalar value (for example, a mass), or a set of values (e.g. a two-dimensional position and a mass). These problems are often challenging to solve due to theunknown (and potentially large) dimensionality. In addition, they also have the so-called “label-switching degeneracy” issue (Jasra et al, 2005), where the meaning of the model is invariant toswitching the identities of the components. Finite mixture modelling is a common statisticalexample, but this kind of problem appears in many other contexts (e.g. Skilling, 1998).To carry out the inference, we must assign a prior to the components’ parameters { x i } Ni =1 . Since N may be large, it is usually easier to assign an “conditional prior” given some hyperparameters α , and then assign a prior to α . This kind of model is usually called hierarchical . The prior for N , α , and { x i } is usually factorised in the following way: a r X i v : . [ s t a t . C O ] J a n ( N, α , { x i } ) = p ( N ) p ( α | N ) p ( { x i }| α , N ) (1)= p ( N ) p ( α ) N (cid:89) i =1 p ( x i | α ) . (2)Here we have assumed the priors for N and α are independent, and the conditional prior for { x i } is iid and does not depend on N . Given data D , we usually want to calculate properties ofthe posterior distribution, given by: p ( N, α , { x i }|D ) ∝ p ( N, α , { x i } ) p ( D| N, α , { x i } ) (3)The structure of the model is depicted graphically in Figure 1.The computational method used to calculate properties of the posterior distribution is to gener-ate samples from it using Markov Chain Monte Carlo (MCMC). Since N is unknown, an MCMCsampler needs to be able to jump between candidate solutions with different numbers of com-ponents, using either birth-death MCMC (Stephens, 2000) or the reversible jump framework ofGreen (1995).The marginal likelihood, or evidence, is also an important quantity, and is the normalisationconstant of the posterior given in Equation 3: Z = p ( D ) (4)= N max (cid:88) i =0 (cid:90) p ( N, α , { x i } ) p ( D| N, α , { x i } ) d N x i d α (5)Some authors have addressed these kinds of problems by doing separate runs of Nested Sampling(or another method of calculating marginal likelihoods), with N fixed at various trial values (e.g.Hou et al., 2014; Feroz and Hobson, 2014). Then, the posterior for N can be calculated basedon the estimates of the marginal likelihood obtained as a function of N . This approach does notgeneralise well to large N , as it would require a large number of parallel runs.In this paper, we introduce an approach that uses trans-dimensional MCMC moves to infer N ,but embeds this process in the Nested Sampling framework to overcome difficulties that wouldbe encountered by sampling the posterior distribution directly. Our motivation for using NestedSampling is not primarily to calculate the marginal likelihood Z (including summing over N , asin Equation 5), although Z is readily available from the output. The motivation for using NestedSampling, rather than just sampling the posterior using trans-dimensional MCMC, is that theposterior often contains difficult features, such as strong dependencies or multiple modes, thatcause problems with mixing. In addition to the well-known challenges caused by multiple modes and strong dependencies, phase transitions (Skilling, 2006) can also cause difficulties for Bayesian computation. Phasetransitions occur when the posterior distribution is a mixture of a “slab” with high volume butlow density, and a “spike” with low volume but high density. To understand why this causesproblems, imagine sampling the posterior using the Metropolis algorithm. If the sampler is inthe slab, it will only rarely propose a move into the spike, because the spike’s volume is so small. bjects i = 1 , ..., N α N D x i Figure 1:
A probabilistic graphical model (PGM) depicting the kind of model discussed in this paper.Produced using daft ( daft-pgm.org ). The number, N , of components in the model is an unknownparameter, as are the properties of the components. The data depends on the properties of thecomponents. Note that it is possible (and common) to have other parameters in the model that pointdirectly to the data. However, we have omitted such parameters for simplicity. If the sampler is in the spike, a proposal to move into the slab will have a very low acceptanceprobability because the density of the slab is so low relative to that of the spike. If the spike hasa volume V spike and the slab has a volume V slab then the number of MCMC iterations needed tojump from the spike to the slab, or vice versa, is of order V slab /V spike which may be very large.In many problems, the posterior distribution is a mixture of a spike component and a slabcomponent, however the total posterior probability in the slab component is negligible. Inthis situation, posterior sampling is unaffected, however problems can arise when calculating themarginal likelihood using an annealing-type approach. For example, consider the thermodynamicintegral, which gives the log of the marginal likelihood:log( Z ) = (cid:90) (cid:104) log [ L ( θ )] (cid:105) β dβ (6)where L ( θ ) is the likelihood function, and the expectation is taken with respect to the “thermal”distributions which are intermediate between the prior π ( θ ) and the posterior: p β ( θ ) ∝ π ( θ ) L ( θ ) β . (7)If the posterior distribution is a mixture of a slab and a spike, then for some range of inversetemperatures β the probability mass in the slab and the spike will be comparable. At thesetemperatures the MCMC method will be unable to mix (the sampler will only visit the spike orthe slab, and not both) and will give an incorrect estimate of (cid:104) log [ L ( θ )] (cid:105) β . ested Sampling (Skilling, 2006) overcomes these problems by working with an alternative se-quence of probability distributions, proportional to the prior but with a hard constraint on thevalue of the likelihood: p l ( θ ) ∝ π ( θ ) [ L ( θ ) > l ] (8)The normalising constants of these distributions are given by X ( l ) = (cid:90) π ( θ ) [ L ( θ ) > l ] dθ (9)which is the amount of prior mass that has likelihood greater than l . As Nested Samplingproceeds, X shrinks geometrically with time, so the number of iterations required to compressfrom the slab to the spike is of order log( V slab /V spike ). In Diffusive Nested Sampling (Section 1.2),it takes of order log( V slab /V spike ) iterations to pass from the slab to the spike initially, and thenlog( V slab /V spike ) time to revise this. During this process Nested Sampling visits states withlikelihoods intermediate between those of the slab and the spike, whereas no choice of β inEquation 7 will give these states substantial probability. The main difficulty with Nested Sampling is coming up with methods to sample the constrainedprior distributions of Equation 8. Various implementations of Nested Sampling exist, withdifferent strategies for sampling Equation 8, as well as other features to infer whether the runwas successful, or to improve the accuracy of the marginal likelihood estimate (e.g. Feroz, Hobson,& Bridges, 2009; Feroz et al., 2013; Corsaro and De Ridder, 2014; Buchner , 2014). DiffusiveNested Sampling (DNS Brewer, P´artay, & Cs´anyi, 2011b) is an alternative version of NestedSampling that is appropriate when MCMC is used to explore the parameter space .Rather than working with a sequence of constrained prior distributions like in Equation 8, DNSreplaces the posterior distribution with an alternative target distribution composed of a mixture of the prior distribution with the constrained priors, facilitating mixing between the multiplemodes and along degeneracy curves. The mixture of constrained priors is given by a mixture ofdistributions like in Equation 8: p ( θ ) = n (cid:88) i =0 w i π ( θ ) [ L ( θ ) > l i ] X i (10)where { l i } is a set of likelihood thresholds or “levels”, constructed such that the associated priormass X i +1 of level i + 1 is approximately e − times that of level i . The { w i } are a set of mixtureweights, usually chosen to emphasise the higher levels during the initial stages of a run (whilecreating levels), and set to uniform once the desired number of levels n has been created. SinceDNS does not sample the posterior distribution, the samples need to be assigned weights (ina manner similar to importance sampling) to represent the posterior. See Brewer, P´artay, &Cs´anyi (2011b) for more details about DNS.The idea of replacing the target distribution with something other than the posterior is similarto annealed importance sampling (Neal, 2001) or parallel tempering (Hansmann, 1997), wherethe target distribution is modified from the posterior to something “easier”. However, the A C++ implementation of DNS, called
DNest3 , is available at https://github.com/eggplantbren/DNest3 y Figure 2:
Left : An example of a two-dimensional posterior distribution with multiple modes, strongdependence, and a phase transition (in the top right mode). While this low dimensional exampleis not difficult to sample, an analogous problem in high dimensions would be.
Right : DiffusiveNested Sampling replaces the target posterior distribution by a mixture of constrained priors. Thedensity is non-negligible everywhere (since the prior is a component of this distribution), and statesintermediate between the two phases in the top-right mode are appropriately up-weighted. sequence of distributions used in Nested Sampling avoids some of the problems with annealing-based methods. In particular, there is no need to choose a temperature schedule (a potentiallylarge set of tuning parameters), and the method does not fail when phase transitions are present.Figure 2 shows a mock two dimensional posterior density with three modes, one of which containsa strong dependence, and another of which contains a phase transition. The target distributionused by DNS includes the prior as part of the mixture, so travel between the modes is possible.Since DNS is effectively the Metropolis-Hastings algorithm applied to a distribution other thanthe posterior, proposal distributions are needed which define how a walker moves around thehypothesis space. In Section 2, we outline a set of proposal distributions used to explore thehypothesis space of possible values for N , α , and { x i } .A C++ implementation of the ideas described in this paper is available online (under the termsof the GNU General Public Licence) at https://github.com/eggplantbren/RJObject . We will now define a set of proposal distributions that can be used to sample the prior distribution p ( N, α , { x i } ) = p ( N ) p ( α ) (cid:81) Ni =1 p ( x i | α ). These proposals will need to satisfy detailed balancein order to be valid when used inside DNS. When the proposal satisfies detailed balance withrespect to the prior, the DNS algorithm can incorporate hard likelihood constraints by rejectingany proposal whose likelihood does not satisfy the constraint. The overall proposal is a mixtureof all of the proposals listed here.When using DNS, the proposals typically should be very heavy tailed. When DNS is exploringa distribution close to the prior, large proposals will generally be needed to explore the prior fficiently. When DNS is exploring a distribution constrained to very high values of the likelihoodfunction, much smaller proposals will be appropriate. Rather than attempting expensive tuning(since the number of distributions involved may number in the hundreds or even thousands),we will apply the heavy tailed proposal distributions and simply accept that there will be somewaste. If a parameter u has a uniform prior between 0 and 1, then a particular choice of heavy-tailed proposal is defined by: u (cid:48) := mod (cid:0) u + 10 . − a b, (cid:1) (11)where a ∼ Uniform(0 ,
1) and b ∼ N (0 , a is closeto zero, the size of the proposal is of order 10, which (because of the mod function) effectivelyrandomises u (cid:48) from the prior. If a is close to 1, the scale of the proposal is roughly 10 − timesthe prior width. The coefficient of a was set to 6 because smaller proposals are usually notnecessary. The value 1.5 in the exponent was found to optimize the performance of this proposalwhen sampling from U (0 ,
1) and N (0 ,
1) priors.Most of the proposals discussed in this paper involve changing one (or a subset) of the parametersand/or hyperparameters, while keeping the others fixed. Whenever a heavy-tailed proposaldistribution is required, we use the proposal from Equation 11, or an analogous one, e.g. adiscrete version when the proposal is to change N . The DNest3 package monitors the acceptancerate as a function of level. In most applications, the acceptance probability is close to 1 (or maybe precisely 1) when sampling the prior, and decreases to 5-40% when sampling high levels. N The first kind of proposal we consider are proposals that change the dimension of the model, i.e.proposals that change the value of N . By necessity, these will also change the parameters { x i } Ni =1 because the number of parameters will be changed. The proposals used here are traditionallycalled birth and death proposals.We will assume that the prior for N is a discrete uniform distribution between 0 and some N max ,inclusive. The proposal starts by choosing a new value N (cid:48) according to N (cid:48) = mod ( N + δ N , N max + 1) (12)where δ N is drawn from a heavy tailed distribution which is a discrete analogue of Equation 11( N is treated as a real number in [0 , N max + 1] and then we take the integer part at the end).The most probable values of δ N are ±
1, but values of order N max also have some probability,to allow fast exploration when the target distribution is similar to the prior. If N (cid:48) = N , N (cid:48) isregenerated as N + 1 with probability 0.5, and N − { , ..., N max } if necessary.If N (cid:48) > N , i.e. the proposal is to add components to the model, the extra parameters needed, { x i } N (cid:48) i = N +1 , are drawn from their prior conditional on the current value of the hyperparameters,i.e. the conditional prior p ( x | α ). If N (cid:48) < N , i.e. the proposal is to remove components from themodel, then N − N (cid:48) components must be selected for removal. All of the components have thesame probability of being selected for removal. We now consider a proposal distribution that modifies one or more of the components { x i } ,while keeping the number of components N , as well as the hyperparameters α , fixed. et F ( x ; α ) be a function that takes an component x and transforms it to a value u that has auniform distribution between 0 and 1, given α . If the component x consists of a single scalarvalue, F is the cumulative distribution of the conditional prior. Denote the inverse of F by G .A proposal for an component involves transforming its parameters to [0 ,
1] using F , making theproposal in terms of u , and then transforming back. Specifically, the proposal chooses a newvalue x (cid:48) i from the current value x i as follows: u i := F ( x i ; α ) (13) u (cid:48) i := mod ( u i + δ u ,
1) (14) x (cid:48) i := G ( u (cid:48) i ; α ) . (15)where δ u is drawn from a heavy-tailed distribution as in Equation 11.Choosing just one component to change is most appropriate when the DNS distribution is veryconstrained. When it is close to the prior, bolder proposals that change more than one componentat a time are possible. Hence, we choose the number of components to change from a heavytailed distribution wherer the most probable value is 1 but there is also a nontrivial probabilityof proposing to change ∼ N components at once. Another kind of proposal that we will consider is a proposal that keeps all of the componentsfixed in place (i.e. leaves { x i } ) unchanged, but changes the hyperparameter(s) from their currentvalue α to a new value α (cid:48) . The proposal for the hyperparameter(s) is chosen to satisfy detailedbalance with respect to p ( α ) and should be heavy-tailed.The overall Metropolis acceptance probability for this kind of move, if sampling the prior (orthe constrained prior of DNS) must include the following factor: (cid:81) Ni =1 p ( x i | α (cid:48) ) (cid:81) Ni =1 p ( x i | α ) (16)Since this proposal leaves the components { x i } fixed, it will usually not affect the value of thelikelihood, and therefore the likelihood will not need to be recomputed. Therefore this proposalis most useful for mixing the values of the hyperparameters α when the target distribution ishighly constrained. The above proposals allow for changes to N , α , and { x i } , and are therefore sufficient to allowfor “correct” exploration of the prior distribution, and the constrained prior distributions usedin DNS. However, they do not necessarily allow for efficient exploration, even of the prior itself.The main reason for this is the inability for large changes to be made to the hyperparameters α .If the proposed change from α to α (cid:48) is large, the ratio in Equation 16 is likely to be very small,and the move will probably be rejected.Therefore, we allow an additional move that changes α to a new value α (cid:48) , but rather than leavingthe components { x i } fixed, they are “dragged” so they represent the distribution p ( x | α (cid:48) ) rather han p ( x | α ). We do this by making use of the transformation functions F ( x ; α ) and G ( x ; α )defined in Section 2.2.The “dragging” process works as follows, and must be carried out on each component: u i := F ( x i ; α ) (17) x (cid:48) i := G ( u i ; α (cid:48) ) (18)In other words, the components’ parameters are transformed to [0 ,
1] using the current value ofthe hyperparameters, and then transformed back using the proposed value of the hyperparame-ters so they represent the new conditional prior rather than the old one.
In this section we demonstrate a seemingly simple example which exhibits a phase transition,making a standard MCMC approach difficult. Suppose a signal is composed of N sinusoids,of different periods, amplitudes, and phases. The signal is observed at various times { t i } mi =1 with noise, and we want to use the resulting data to infer the number of sinusoids N , alongwith the periods { T i } Ni =1 , amplitudes { A i } Ni =1 , and phases { φ i } Ni =1 . This kind of model hasmany applications, and has been solved analytically under certain sets of assumptions (see e.g.Bretthorst, 1988; Mortier et al., 2014). Similar models have been used in many different fields(e.g. Bretthorst, 2003; Umst¨atter et al., 2005; Brewer et al., 2007; Brewer and Stello, 2009). Notethat this problem is also very closely related to the problem of detecting exoplanet signals inradial velocity data, which has attracted a lot of research attention in recent years (e.g. Gregory,2011; Hou et al., 2014; Feroz et al., 2011). To demonstrate the techniques, we simulated some data based on the assumption N = 2, i.e.the true signal was composed of two sinusoids. The simulated data is shown in Figure 3, andshows a large, low period oscillation with a smaller, much faster oscillation superimposed. Thenoise level is such that the fast oscillation is difficult to detect. The true values of the parameterswere N = 2, A = { , . } , T = { , } , and φ = { , } . The signal was observed at n = 1001points equally spaced between t = 0 and t = 100. The true form of the signal is: y ( t ) = sin (cid:18) πt (cid:19) + 0 . (cid:18) πt (cid:19) (19)and the noise standard deviation was σ = 0 . To infer N , the number of sinusoids, as well as all the periods, amplitudes, and phases, we needto define prior distributions. In the notation of Section 1, the parameters of the each componentare x i = { A i , T i , φ i } . (20)For the conditional prior p ( x i | α ), we introduced a single hyperparameter µ , such that A i given µ has an exponential prior with mean µ . For the periods, we assigned a log-uniform distribution
20 40 60 80 100Time − − S i g n a l Noise-free signalNoisy measurements
Figure 3:
The simulated data for the sinusoidal example. The solid line shows the true signal whichconsists of a large slow oscillation and a much smaller amplitude fast oscillation. The points arethe measurements, simulated with a noise standard deviation of σ = 0 . . for the periods between fixed boundaries, and a uniform distribution for the phases between 0and 2 π . In other words, our prior is only hierarchical for the amplitudes { A i } . The prior for µ was a log-uniform prior, with probability density ∝ /µ , where µ ∈ [10 − , ].The model for the shape of the (noise-free) signal is y ( t ) = N (cid:88) i =1 A i sin (cid:18) πtT i + φ i (cid:19) (21)where there are N sinusoids in the signal, the amplitudes are { A i } , the periods are { T i } , andthe phases are { φ i } . The sampling distribution (probability distribution for the data given theparameters) was a normal (gaussian) distribution with mean zero and standard deviation σ ,applied independently to each data point: Y i | N, { A i } , { T i } , { φ i } , σ ∼ N (cid:0) y ( t i ) , σ (cid:1) . (22)This also depends on an additional noise standard deviation parameter σ which will also beinferred. We used a log-uniform prior for σ between 10 − and 10 . We ran DNS on the simulated data shown in Figure 3. The (log) likelihood as a function of(log) enclosed prior mass X (Equation 9) is plotted in Figure 5 This curve is a standard outputof Nested Sampling methods and can be used to gain insight into the structure of the problem.Particularly, when phase transitions are present, this curve will have concave-up regions (Skilling,2006).Two phase transitions can be seen in Figure 5. The first phase transition, at log( X ) ≈ −
10 nats,separates “noise-only” models (where the entire dataset is accounted for by the noise term) from
500 1000 1500 2000 2500Iteration / 20000 − − − − − − − − − − L og L i k e li h oo d Figure 4:
A “trace plot” of the log likelihood in the sinusoidal model, for an MCMC chain designedto sample the posterior distribution. The two phases have log-likelihoods around -730 and -745respectively, however transitions between the two phases are quite uncommon. “one-sinusoid” models where N = 1. At log( X ) ≈ −
35 nats, a second phase transition separates N = 1 models from N = 2 models. This phase transition causes a double-peaked feature in theposterior weights (lower panel of Figure 5).If we were to do this analysis by trying to explore the posterior directly, rather than by usingNested Sampling, it would be difficult to jump between the N = 1 solution and the N = 2solution, because the posterior distribution would be composed of a mixture of a small-volumebut high-likelihood spike and a large-volume but (relatively) low likelihood slab. Note that thisphase transition still exists if we condition on N = 2, so is not caused by the trans-dimensionalmodel. To demonstrate this, Figure 7 shows the value of the log likelihood versus time for45 million iterations of MCMC targeting the posterior distribution. The sampler should visitstates with log likelihood ∼ −
745 about 10% of the time (corresponding with models that don’tcontain the small oscillation), and the rest of the time should be spent in states with log likelihoodaround − N is shown in Figure 6. Themost probable value is N = 2 which is also the true value, and there is a small probability for N = 1. The decreasing probabilities for N > Z ) = − .
8, and theinformation, or Kullback-Leibler divergence from the prior to the posterior, was estimated to be39.8 nats.
Top panel:
The shape of the likelihood function with respect to prior mass.
Bottompanel:
The posterior weights of the saved samples. The existence of a phase transition causes thisto display two separate peaks, which would be difficult to mix between if we simply attempted tosample the posterior distribution. N N u m b e r o f p o s t e r i o r s a m p l e s Figure 6:
The inference for N , the number of sinusoids, based on the sinusoid data. The true valuewas N = 2 , which is also the most probable value given the data. Of course, this is not always thecase.
500 1000 1500 2000 2500Iteration / 20000 − − − − − − − − − − L og L i k e li h oo d Figure 7:
A “trace plot” of the log likelihood in the SineWave model, for an MCMC chain designedto sample the posterior distribution. The two phases have log-likelihoods around -730 and -745respectively, however transitions between the two phases are quite uncommon.
Source detection is an important problem in astrophysics. Given some data, usually one or morenoisy, blurred, images of the sky, we would like to know how many components N (such asstars or galaxies) are in the image, and their properties (such as flux, size, orientation). Variousapproaches exist, covering a wide spectrum between ad-hoc approaches and principled inferenceapproaches (e.g. Irwin, 1985; Bertin and Arnouts, 1996; Dolphin, 2000; Hobson & McLachlan,2003; Brewer et al., 2013). The more principled techniques tend to be much more computationallyintensive, so should only be applied to interesting or especially challenging subsets of astronomicalimaging. In fact, the approach used by Brewer et al. (2013) was essentially an early version ofthe methodology presented in this paper.In this section we apply the DNS approach to a toy version of the problem of detecting andquantifying extended components such as galaxies in a noisy image. We model each galaxy asa mixture of two concentric elliptical gaussian components, with the following eight parameters.Firstly, two parameters x c and y c describe the central position of the galaxy within the image.The flux f describes the total integral of the galaxy’s intensity profile. The axis ratio q describesthe ellipticity of the galaxy and θ its orientation angle with respect to horizontal. The radius ofthe bigger gaussian is a parameter w . Finally we include a “radius ratio” u describing the ratioof the radius of the smaller gaussian with respect to that of the bigger gaussian, and a parameter v describing the fraction of the total flux in the smaller gaussian (therefore the fraction of thelight in the bigger gaussian is 1 − v ). In a coordinate system aligned with the major axis of theellipse, the surface brightness profile of a “galaxy” is given by ρ ( x, y ) = f (1 − v )2 πw exp (cid:20) − w (cid:0) qx + y /q (cid:1)(cid:21) (23)+ f v π ( uw ) exp (cid:20) − uw ) (cid:0) qx + y /q (cid:1)(cid:21) . (24) x y GalaxyField Data
Figure 8:
The simulated image for the GalaxyField example. The image is × pixels in sizeand contains N = 47 “galaxies”. We generated a simulated 200 ×
200 pixel noisy image (Figure 8) containing 47 “galaxies”, inorder to test the algorithm.
For the conditional prior for the total fluxes of the galaxies, and for the widths of the galax-ies, we used the Pareto distribution. The probability density function for a variable x givenhyperparameters x min and a is p ( x | x min , a ) = (cid:26) ax a min x a +1 , x ≥ x min , otherwise. (25)Since we are using this for both the fluxes and the widths of the galaxies, there will be fourhyperparameters in total: two lower limits (one for the fluxes and one for the widths) and twoslopes. The lower limits were assigned log-uniform priors between 10 − and 10 , and uniformpriors between 0 and 1 were assigned to the inverses of the slopes. This latter choice was basedon the fact that a Pareto distribution for x is an exponential distribution for ln( x ), with scalelength 1 /a . If we do not expect the scale length to be very extreme then we might want torestrict it to [0 ,
1] with a uniform prior.The conditional priors for the radius ratios u and the flux ratios w were both uniform distribu-tions. Since u ∈ [0 ,
1] and v ∈ [0 , Top panel:
The shape of the likelihood function with respect to prior mass for theGalaxyField data. Note the phase transition at log( X ) ≈ − nats. This separates models withjust the few brightest galaxies from models with many faint galaxies as well. While this doesn’t affectthe posterior distribution (unlike the sinewave example), it would affect calculation of the marginallikelihood if annealing were used. Bottom panel:
The posterior weights of the saved samples. between 0 and 1. Therefore the prior for the upper limit was a uniform distribution between 0and 1, and the prior for the lower limit given the upper limit was a uniform distribution between0 and the upper limit.As in Section 3 we allowed the noise standard deviation σ to be a free parameter with a log-uniform prior between 10 − and 10 . The results from running DNS on the simulated galaxy data are shown in Figures 9 and 10. Thisproblem, like the sinusoidal problem, exhibits a phase transition, which can be seen in Figure 9.The phase transition separates models which only fit the brightest galaxies from models that alsofit the faint galaxies close to the noise level. This phase transition does not affect the posteriordistribution, as the low- N solutions are eventually found to have very low importance. Howeverit would cause difficulty for a thermal approach to calculating the marginal likelihood.The marginal likelihood estimate returned by DNS for this data was log( Z ) = − .
6, andthe information, or Kullback-Leibler divergence from the prior to the posterior, was estimatedto be 550.2 nats.
20 40 60 80 100Number of Galaxies N N u m b e r o f p o s t e r i o r s a m p l e s Figure 10:
The inference for N , the number of galaxies, based on the GalaxyField data. The truevalue was N = 47 . In both of the examples discussed in this paper, the likelihood evaluation involved computing amock noise-free signal from the current value of the model parameters. The mathematical formof the mock signal was a sum over all N components present. Computing the mock signal isoften the most expensive step in the evaluation of the likelihood.However, many of the proposal distributions used involve changing a subset of the parameters,while keeping others fixed. Considerable speedups can be achieved by, when appropriate, updat-ing the mock signal to reflect the proposed change to the model, rather than computing the entiremock signal from scratch. For example, if the proposal is to add two new components to themodel, the mock signal can simply be updated by adding the effect of the two new componentsto the model.In general, it is possible to compute the number of components that have been affected bythe proposal. If this is greater than or equal to N , we compute the mock signal from scratch,otherwise we subtract the effect of those components that have been removed and add in theeffect of those that have been added. In the C++ implementation of this work, when a proposaltakes place, the difference between the old and new model is cached and is available to the user,so that the likelihood can be updated rather than computed from scratch. Using this techniqueoffers a speed advantage of a factor of ∼ In this paper, we presented a general approach to trans-dimensional Bayesian inference prob-lems. To compute the posterior distribution in these contexts, birth and death MCMC moves are ommonly used. However, the approach presented here replaces the posterior distribution withthe alternative target distribution used in Diffusive Nested Sampling. This is done to facilitatemixing between multiple modes, along strong dependencies, and between different phases. Addi-tionally, the marginal likelihood can be computed including the sum over the unknown numberof components. The method was demonstrated on two illustrative examples inspired by astro-nomical data analysis, but should be applicable in other contexts such as mixture modelling.Software (in C++) implementing the general Metropolis proposals for these models, as well asthe specific examples, is available at https://github.com/eggplantbren/RJObject . Acknowledgements
This work is supported by a Marsden Fast-Start grant from the Royal Society of New Zealand.I would like to thank the following people for valuable conversations and inspiration: AnnaPancoast (UCSB), David Hogg (NYU), Daniel Foreman-Mackey (NYU), Courtney Donovan(Auckland), Tom Loredo (Cornell), Iain Murray (Edinburgh), Ewan Cameron (Oxford), JohnSkilling (MaxEnt Data Consultants), and Daniela Huppenkothen (Amsterdam, NYU).
References
Bertin, E., Arnouts, S. 1996. SExtractor: Software for source extraction. Astronomy and Astro-physics Supplement Series 117, 393-404.Bretthorst, G. Larry. 1988. Bayesian Spectrum Analysis and Parameter Estimation. In LectureNotes in Statistics, 48, Springer-Verlag, New York, New York.Bretthorst, G. L. 2003. Frequency Estimation, Multiple Stationary Nonsinusoidal ResonancesWith Trend. Bayesian Inference and Maximum Entropy Methods in Science and Engineering659, 3-22.Brewer, B. J., Bedding, T. R., Kjeldsen, H., Stello, D. 2007. Bayesian Inference from Observationsof Solar-like Oscillations. The Astrophysical Journal 654, 551-557.Brewer, B. J., Stello, D. 2009. Gaussian process modelling of asteroseismic data. Monthly Noticesof the Royal Astronomical Society 395, 2226-2233.Brewer, B. J., Foreman-Mackey, D., Hogg, D. W. 2013. Probabilistic Catalogs for CrowdedStellar Fields. The Astronomical Journal 146, 7.Brewer B. J., P´artay L. B., Cs´anyi G., 2011, Statistics and Computing, 21, 4, 649-656.arXiv:0912.2380Brewer, B. J., Lewis, G. F., Belokurov, V., Irwin, M. J., Bridges, T. J., Evans, N. W. 2011.Modelling of the complex CASSOWARY/SLUGS gravitational lenses. Monthly Notices of theRoyal Astronomical Society 412, 2521-2529Buchner, Johannes. A statistical test for Nested Sampling algorithms. Statistics and Computing(2014): 1-10.Corsaro, E., De Ridder, J. 2014. DIAMONDS: A new Bayesian nested sampling tool. Applicationto peak bagging of solar-like oscillations. Astronomy and Astrophysics 571, AA71.Dolphin, A. E. 2000, PASP, 112, 1383 eroz, F., Hobson, M. P., Cameron, E., Pettitt, A. N. 2013. Importance Nested Sampling andthe MultiNest Algorithm. ArXiv e-prints arXiv:1306.2144.Feroz F., Hobson M. P., Bridges M., 2009, MNRAS, 398, 1601Feroz, F., Balan, S. T., Hobson, M. P. 2011. Detecting extrasolar planets from stellar radialvelocities using Bayesian evidence. Monthly Notices of the Royal Astronomical Society 415,3462-3472.Feroz, F., Hobson, M. P. 2014. Bayesian analysis of radial velocity data of GJ667C with correlatednoise: evidence for only two planets. Monthly Notices of the Royal Astronomical Society 437,3540-3549.Green, P. J., 1995, Reversible Jump Markov Chain Monte Carlo Computation and BayesianModel Determination, Biometrika 82 (4): 711732.Gregory, P. C. 2011. Bayesian exoplanet tests of a new method for MCMC sampling in highlycorrelated model parameter spaces. Monthly Notices of the Royal Astronomical Society 410,94-110.Hansmann, Ulrich HE., 1997, “Parallel tempering algorithm for conformational studies of bio-logical molecules.”, Chemical Physics Letters 281, no. 1 (1997): 140-150.Hobson, M. P., & McLachlan, C. 2003, MNRAS, 338, 765Hou, F., Goodman, J., Hogg, D. W. 2014. The Probabilities of Orbital-Companion Models forStellar Radial Velocity Data. ArXiv e-prints arXiv:1401.6128.Irwin, M. J. 1985, MNRAS, 214, 575Jasra, A., Holmes, C. C, and Stephens, D. A., Markov Chain Monte Carlo Methods and theLabel Switching Problem in Bayesian Mixture Modeling, Statistical Science Vol. 20, No. 1,pp. 50-67MacKay, David J. C., 2003, “Information theory, inference, and learning algorithms”. Vol. 7.Cambridge: Cambridge university press, 2003.Mortier, A., Faria, J. P., Correia, C. M., Santerne, A., Santos, N. C. 2014. BGLS: A Bayesianformalism for the generalised Lomb-Scargle periodogram. ArXiv e-prints arXiv:1412.0467Neal, R. M., 2001, Annealed importance sampling, Statistics and Computing, vol. 11, pp. 125-139.Skilling J., 1998, Massive Inference and Maximum Entropy, in Maximum Entropy and BayesianMethods, Kluwer Academic Publishers, Dordrecht/Boston/London p.14Skilling, J., 2006, Nested Sampling for General Bayesian Computation, Bayesian Analysis 4, pp.833-860.Stephens, M., 2000, “Bayesian analysis of mixture models with an unknown number ofcomponents-an alternative to reversible jump methods.”, Annals of Statistics (2000), 40-74.Umst¨atter, R., Christensen, N., Hendry, M., Meyer, R., Simha, V., Veitch, J., Vigeland, S.,Woan, G. 2005. Bayesian modeling of source confusion in LISA data. Physical Review D 72,022001.eroz, F., Hobson, M. P., Cameron, E., Pettitt, A. N. 2013. Importance Nested Sampling andthe MultiNest Algorithm. ArXiv e-prints arXiv:1306.2144.Feroz F., Hobson M. P., Bridges M., 2009, MNRAS, 398, 1601Feroz, F., Balan, S. T., Hobson, M. P. 2011. Detecting extrasolar planets from stellar radialvelocities using Bayesian evidence. Monthly Notices of the Royal Astronomical Society 415,3462-3472.Feroz, F., Hobson, M. P. 2014. Bayesian analysis of radial velocity data of GJ667C with correlatednoise: evidence for only two planets. Monthly Notices of the Royal Astronomical Society 437,3540-3549.Green, P. J., 1995, Reversible Jump Markov Chain Monte Carlo Computation and BayesianModel Determination, Biometrika 82 (4): 711732.Gregory, P. C. 2011. Bayesian exoplanet tests of a new method for MCMC sampling in highlycorrelated model parameter spaces. Monthly Notices of the Royal Astronomical Society 410,94-110.Hansmann, Ulrich HE., 1997, “Parallel tempering algorithm for conformational studies of bio-logical molecules.”, Chemical Physics Letters 281, no. 1 (1997): 140-150.Hobson, M. P., & McLachlan, C. 2003, MNRAS, 338, 765Hou, F., Goodman, J., Hogg, D. W. 2014. The Probabilities of Orbital-Companion Models forStellar Radial Velocity Data. ArXiv e-prints arXiv:1401.6128.Irwin, M. J. 1985, MNRAS, 214, 575Jasra, A., Holmes, C. C, and Stephens, D. A., Markov Chain Monte Carlo Methods and theLabel Switching Problem in Bayesian Mixture Modeling, Statistical Science Vol. 20, No. 1,pp. 50-67MacKay, David J. C., 2003, “Information theory, inference, and learning algorithms”. Vol. 7.Cambridge: Cambridge university press, 2003.Mortier, A., Faria, J. P., Correia, C. M., Santerne, A., Santos, N. C. 2014. BGLS: A Bayesianformalism for the generalised Lomb-Scargle periodogram. ArXiv e-prints arXiv:1412.0467Neal, R. M., 2001, Annealed importance sampling, Statistics and Computing, vol. 11, pp. 125-139.Skilling J., 1998, Massive Inference and Maximum Entropy, in Maximum Entropy and BayesianMethods, Kluwer Academic Publishers, Dordrecht/Boston/London p.14Skilling, J., 2006, Nested Sampling for General Bayesian Computation, Bayesian Analysis 4, pp.833-860.Stephens, M., 2000, “Bayesian analysis of mixture models with an unknown number ofcomponents-an alternative to reversible jump methods.”, Annals of Statistics (2000), 40-74.Umst¨atter, R., Christensen, N., Hendry, M., Meyer, R., Simha, V., Veitch, J., Vigeland, S.,Woan, G. 2005. Bayesian modeling of source confusion in LISA data. Physical Review D 72,022001.