[PDF] Nested sampling with any prior you like

Abstract

Nested sampling is an important tool for conducting Bayesian analysis in Astronomy and other fields, both for sampling complicated posterior distributions for parameter inference, and for computing marginal likelihoods for model comparison. One technical obstacle to using nested sampling in practice is the requirement (for most common implementations) that prior distributions be provided in the form of transformations from the unit hyper-cube to the target prior density. For many applications - particularly when using the posterior from one experiment as the prior for another - such a transformation is not readily available. In this letter we show that parametric bijectors trained on samples from a desired prior density provide a general-purpose method for constructing transformations from the uniform base density to a target prior, enabling the practical use of nested sampling under arbitrary priors. We demonstrate the use of trained bijectors in conjunction with nested sampling on a number of examples from cosmology.

Full PDF

MMNRAS , 1–5 (2020) Preprint 26 February 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

Nested sampling with any prior you like

Justin Alsing , ★ and Will Handley , Oskar Klein Centre for Cosmoparticle Physics, Department of Physics, Stockholm University, Stockholm SE-106 91, Sweden Imperial Centre for Inference and Cosmology, Department of Physics, Imperial College London, Blackett Laboratory,Prince Consort Road, London SW7 2AZ, UK Astrophysics Group, Cavendish Laboratory, JJ Thomson Avenue, Cambridge CB3 0HA, UK Kavli Institute for Cosmology, Madingley Road, Cambridge CB3 0HA, UK

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

Nested sampling is an important tool for conducting Bayesian analysis in Astronomy and other ﬁelds, both for samplingcomplicated posterior distributions for parameter inference, and for computing marginal likelihoods for model comparison. Onetechnical obstacle to using nested sampling in practice is the requirement that prior distributions be provided in the form ofbĳective transformations from the unit hyper-cube to the target prior density. For many applications – particularly when usingthe posterior from one experiment as the prior for another – such a transformation is not readily available. In this letter we showthat parametric bĳectors trained on samples from a desired prior density provide a general-purpose method for constructingtransformations from the uniform base density to a target prior, enabling the practical use of nested sampling under arbitrarypriors. We demonstrate the use of trained bĳectors in conjunction with nested sampling on a number of examples from cosmology.

Key words: data analysis: methods

Nested sampling has become one of the pillars of Bayesian compu-tation in Astronomy (and other ﬁelds), enabling eﬃcient samplingof complicated and high-dimensional posterior distributions for pa-rameter inference, and estimation of marginal likelihoods for modelcomparison (Skilling 2006; Feroz et al. 2009; Handley et al. 2015a,b).In spite of its widespread popularity, nested sampling has had a prac-tical limitation owing to the fact that priors must be speciﬁed asa bĳective transform from the unit hyper-cube. As a consequence,nested sampling has only been practical for a somewhat restrictiveclass of priors, which have a readily available representation as abĳector.This limitation has been notably prohibitive for (commonly occur-ring) situations where one wants to use an interim posterior derivedfrom one experiment as the prior for another. In these situations, whilethe interim posterior is computable up to a normalization constantso can be sampled from, no bĳector representation readily presentsitself in general.In this letter we address the need to provide priors in the form ofbĳectors as required for nested sampling, by training parametric bi-jectors to represent priors given only a set of samples from the desiredprior. For use cases where interim posteriors are to be used as priors,the bĳector can be simply trained on a set of MCMC samples from theinterim posterior. In the context of cosmological data analysis, repre-senting interim posteriors as (computationally inexpensive) bĳectorshas an additional advantage: it circumvents the need to either multi-ply computationally expensive likelihoods together, or introduce anextra importance sampling step, when performing joint analyses of ★ E-mail: [email protected] multiple datasets. This will hence lead to substantial computational(and convenience) gains when performing multi-probe cosmologicalanalyses.The structure of this letter is as follows. In §2 we brieﬂy reviewnested sampling theory. In §3 we discuss representation of priors asbĳective transformations from a base density. In §4 we use bĳectorsfor representing cosmological parameter posteriors for a number ofexperiments, demonstrating that they are suﬃciently accurate forpractical use as priors for subsequent (multi-probe) cosmologicalanalyses. We conclude in §5.

Given a likelihood function L( 𝜃 ) and prior distribution 𝜋 ( 𝜃 ) , nestedsampling was incepted by Skilling (2006) as a Monte Carlo tool tocompute the Bayesian evidence Z = ∫ L( 𝜃 ) 𝜋 ( 𝜃 ) 𝑑𝜃, (1)whilst simultaneously generating samples drawn from the normalisedposterior distribution P ( 𝜃 ) = L( 𝜃 ) 𝜋 ( 𝜃 )/Z .Nested sampling performs these dual processes by evolving anensemble of 𝑛 live live points by initially drawing them from the priordistribution 𝜋 , and then at each subsequent iteration: (a) discardingthe lowest likelihood live point and (b) replacing it by a live pointdrawn from the prior 𝜋 subject to hard constraint that the new point isat a higher likelihood than the discarded point. The set of discarded dead points and likelihoods can be used to compute a statisticalestimate of the Bayesian evidence (1), and be weighted to form a setof samples from the posterior distribution.The nested sampling procedure therefore diﬀers substantially from © a r X i v : . [ a s t r o - ph . I M ] F e b J. Alsing and W. Handley traditional strategies such as Gibbs sampling, Metropolis-Hastingsor Hamiltonian Monte Carlo in that it is acts as a likelihood scannerrather than a posterior sampler. It scans athermally, so in fact iscapable of producing samples at any temperature and computing thefull partition function in addition to the Bayesian evidence (Handley2019b).Implementations of the nested sampling meta-algorithm e.g.

MultiNest (Feroz et al. 2009) and

PolyChord (Handley et al.2015a,b) diﬀer primarily in their choice of method by which todraw a new live point from the prior subject to a hard likelihoodconstraint. Nested sampling algorithms typically assume that eachparameter has a prior that is uniform over the prior range [ , ] i.e.the unit hypercube . As we shall see in the next section, this is not asrestrictive as it sounds, since any non-uniform proper prior may betransformed onto the unit hypercube via a bĳective transformation.It should be noted that normalising ﬂows and bĳectors have beenused by Moss (2020) & Williams et al. (2021) in the context oftechniques for generating new live points, but the approach in thisletter for using them for prior speciﬁcation is applicable to all existingnested sampling algorithms. A bĳector f : x → y is an invertible function between the elementsof two sets, with a one-to-one correspondence between elements,ie., each element of one set is paired with precisely one element ofthe other set and there are no unpaired elements. In the context ofprobability theory, bĳectors are convenient for deﬁning mappingsfrom one probability distribution to another. If a random variate x with probability density 𝑝 ( x ) is transformed under a bĳector f : x → y = f ( x ) , then samples and probability densities transform as: y = f ( x ) , x = f − ( y ) 𝑝 ( y ) = | J | 𝑝 ( x ) , 𝑝 ( x ) = | J − | 𝑝 ( y ) (2)where 𝐽 𝑖 𝑗 = 𝜕 𝑓 𝑖 / 𝜕𝑥 𝑗 is the Jacobian corresponding to the bĳectorfunction f . Any probability density can be represented as a bĳec-tion from a base density, or equivalently, any target density can becompletely speciﬁed by a base density and bĳector from that basedensity to the target. Note also that any composition of bĳectors isalso a bĳector.In the context of nested sampling, priors 𝜋 ( θ ) over model param-eters θ should be provided as a bĳector that transforms from a base(unit) uniform density u ∼ U( , ) to the desired prior 𝜋 ( 𝜽 ) , ie., θ = f ( u ) , (3)where 𝜋 ( θ ) = | J | U( f − ( θ )) (4)In practice, all that is required for specifying nested sampling priorsis the bĳector function f ( u ) .In many use cases, such a bĳector is readily available. For example,for any univariate prior density, the bĳector from the unit uniform issimply the inverse cumulative density function (CDF) of the targetprior, 𝑓 ( 𝑢 ) = CDF − ( 𝑢 ) , CDF ( 𝑢 ) = ∫ 𝑢 −∞ 𝜋 ( 𝜃 ) 𝑑𝜃, (5)where the CDF can be interpolated and inverted numerically if theintegral is intractable. However, for the general case of correlated multivariate priors,the requisite bĳector is far more involved to compute analytically(see 5.1 of Feroz et al. 2009) and in general numerically requiresobtaining by other means. One particularly common scenario wherethis arises is when one wants to use the (sampled) posterior from oneexperiment as the prior for another; in these cases, all that is availableare samples from the desired prior and un-normalized values of itsdensity at those points.The solution is to ﬁt a parametric model for the required bĳector f ( u ; w ) (with parameters w ) to the target prior. In the followingsections we describe how to deﬁne expressive parametric bĳectormodels, and ﬁt them to target (prior) densities. Parametric bĳector models f ( u ; w ) have recently been gaining pop-ularity for solving complex density estimation tasks, probabilisticlearning and variational inference, with a plethora of models avail-able. In this section we give a brief overview of the key methods(in order of increasing complexity). For a more in-depth review, seePapamakarios et al. (2019). Compositions of simple invertible functions

Since any composition of bĳectors is also a bĳector, it is possible toconstruct ﬂexible parametric bĳectors by composing even simpleinvertible fucntions, such as aﬃne transformations, exponentials,sigmoids, hyperbolic functions, etc.Some such transformations applied to a base density lead to al-ready well-studied distribution families, such as the “sinh-arcsinh"distributions (Jones & Pewsey 2009) which are generated by apply-ing a 𝑎 sinh ( 𝑏 sinh − ( 𝑥 ) + 𝑐 ) bĳector (with some parameters 𝑎 , 𝑏 and 𝑐 ) to a base normal random variate. Chaining a number of suchtransformations together, interleaved with linear transformations ofthe parameter space, can already lead to rather expressive families ofdistributions. Autoregressive ﬂows

More sophisticated bĳector models typically involve constructingnormalizing ﬂows parameterized by neural networks. Of these neuralnetwork based models, Inverse Autoregressive Flows (IAFs; Kingmaet al. 2016) have been demonstrated to be eﬀective at representingcomplex densities, are fast to train, and fast to evaluate once trained.IAFs deﬁne a series of 𝑛 bĳectors composed into a normalizing ﬂow,as follows: z ( ) = n ∼ N ( , ) z ( ) = 𝝁 ( ) [ z ( ) ] + 𝝈 ( ) [ z ( ) ] (cid:12) z ( ) ... y = z ( 𝑛 ) = 𝝁 ( 𝑛 ) [ z ( 𝑛 − ) ] + 𝝈 ( 𝑛 ) [ z ( 𝑛 − ) ] (cid:12) z ( 𝑛 − ) (6)where the shifts 𝝁 and scales 𝝈 of the subsequent aﬃne transformsare autoregressive and are parameterized by neural networks, ie., 𝝁 ( 𝑘 ) 𝑖 = 𝝁 ( 𝑘 ) 𝑖 [ z 𝑖 − ; w ] , 𝝈 ( 𝑘 ) 𝑖 = 𝝈 ( 𝑘 ) 𝑖 [ z 𝑖 − ; w ] , (7)where w denote the weights and biases of the neural network(s). Formore details see Papamakarios et al. (2017). MNRAS , 1–5 (2020) ested sampling with any prior you like Continuous normalizing ﬂows

In order to take a further step forward in expressivity, one can replacethe discrete set of transforms with an integral of continuous-timedynamics (Chen et al. 2018; Grathwohl et al. 2018), i.e., deﬁning thebĳector as the solution to an initial value problem: z ( ) ∼ N ( , ) y = z ( 𝑡 ) = ∫ 𝑡 f ( z , 𝑡 ; w ) 𝑑𝑡, (8)where the dynamics f ( z , 𝑡 ; w ) = 𝑑 z / 𝑑𝑡 are parameterized by a neuralnetwork with weights w . The elevation from discrete to continuousnormalizing ﬂows comes with a signiﬁcant increase in expressivity,making such models appropriate for the most complex bĳector rep-resentation tasks. See Chen et al. (2018) and Grathwohl et al. (2018)for more details. Fitting a bĳector model to the target can be achieved by minimizingthe Kullback-Leibler (KL) divergence between the target 𝜋 ( θ ) andthe (model) bĳective density (with respect to the model parameters w ). Since the KL divergence will not be analytically tractable ingeneral, one can instead minimize a Monte Carlo estimate of theintegral given samples from the target prior { θ } ∼ 𝜋 ( θ ) : L( w ) = ˆKL ( w ) = 𝑁 𝑁 ∑︁ 𝑖 = ln 𝜋 ∗ ( θ 𝑖 ; w ) , (9)where 𝜋 ∗ ( θ ; w ) = | J ( θ ; w )| U( f − ( θ ; w ) is the probability densitycorresponding to the parametric bĳector model, with parameters w .Alternatively, when one has access to not only samples from thetarget prior, but also the values of the density at those samples, it canbe advantageous to regress the model to the target density using theL2 norm, ie., minimize the loss function L( w ) = 𝑁 ∑︁ 𝑖 = || ln 𝜋 ( θ 𝑖 ) − ln 𝜋 ∗ ( θ 𝑖 ; w ) || . (10)This has been shown to be less noisy than minimizing the (sampled)KL-divergence in certain cases (Seljak & Yu 2019) (and can bereadily extended to exploit gradient information about the targetdensity if it is available; see Seljak & Yu 2019 for details). As a concrete and non-trivial cosmological example we choose theconcordance Λ CDM model of the universe (Scott 2018) with anadditional spatial curvature parameter Ω 𝐾 . Such models have beendebated recently in the literature (Handley 2019a; Di Valentino et al.2019; Efstathiou & Gratton 2020). We choose this model not for itscontroversy, but merely for the fact that it yields non-trivial banana-like priors and posteriors and therefore proves a more challengingcosmological example for a bĳector to learn than models withoutcurvature.For likelihoods we choose three datasets: (1) The Planck base-line TT,TE,EE+lowE likelihood (without lensing) Planck Collabo-ration (2020a,b) (hereafter CMB), (2) Baryon Acoustic Oscillationand Redshift Space Distortion measurements from the Baryon Os-cillation Spectroscopic Survey (BOSS) DR12 Alam & et al. (2017);Beutler et al. (2011); Ross et al. (2015) (hereafter BAO), and (3) − . − . − . − . − . − . log Z log P (CMB , BAO)log P (BAO | CMB) + log P (CMB)log P (CMB | BAO) + log P (BAO) − . − . − . − . − . − . − . log Z log P (CMB , S H ES)log P (S H ES | CMB) + log P (CMB)log P (CMB | S H ES) + log P (S H ES)

Figure 1.

In the same three paths of Figure 2 the Bayesian evidences log Z foreach of the three runs combine consistently to within error, demonstrating thatthe bĳective prior methodology can be used to revore both accurate evidencesas well as posteriors. Plot produced under anesthetic (Handley 2019b). local cosmic distance ladder measurements of the expansion rate, us-ing type Ia SNe calibrated by variable Cepheid stars from the S 𝐻 EScollaboration S 𝐻 ES Collaboration (2018) (hereafter S 𝐻 ES).The bĳector code is implemented as an extension to

CosmoChord using forpy to call tensorflow bĳectors from FORTRAN. All thenested sampling runs, code and details for reproducing the results inthis letter can be found on Zenodo (Handley & Alsing 2020). https://github.com/ylikx/forpy MNRAS000

J. Alsing and W. Handley − . − .

05 0 .

00 0 . H π − . − .

05 0 .

00 0 . H π → CMB − . − . − .

025 0 . CMB → BAO − . − .

05 0 .

00 0 . Ω K H π → BAO − .

02 0 .

00 0 . Ω K BAO → CMB − . − .

05 0 .

00 0 . π → BAO+CMB − .

005 0 .

000 0 . BAO+CMB

Figure 2.

Reading around the upper edge of the hexagon: from the prior 𝜋 , nested sampling uses the Planck CMB likelihood to generate a set of posteriorsamples. These are then used to train a new bĳector to deﬁne a prior in the next step, for which a new nested sampling run using a BAO likelihood is scanned.The lower edge is the same process but with CMB and BAO reversed, and through the middle one recovers the posterior if both CMB and BAO likelihoods areapplied at the same time. The right-most plot shows that the posteriors recovered through these three paths are consistent to within sampling error. Plot producedunder anesthetic (Handley 2019b).MNRAS , 1–5 (2020) ested sampling with any prior you like Figure 2 shows our key results for CMB and BAO data. In thiscase we compare the three routes to a combined CMB and BAOconstraint on a 𝑘 Λ CDM universe. The simplest approach is to runnested sampling with the default

CosmoMC prior with both likelihoodsturned on 𝜋 → CMB+BAO. The bĳector approach ﬁrst runs the CMBdataset to update the prior 𝜋 → CMB, then after using the posteriorsamples from this run to train a CMB bĳector constructs a distributionto use as the prior for the input for the Bayesian update with the nextdataset CMB → BAO. In Figure 2 we show that doing this in eitherorder recovers the same posterior.One critical advantage of this approach is that since much ofthe cost of a nested sampling run is in compressing from prior toposterior, whilst the ﬁrst run 𝜋 → CMB is expensive, the secondupdate CMB → BAO requires considerably fewer likelihood calls.Furthermore, since the journey from prior to posterior is shorter, lesspoisson error accumulates over nested sampling iterations, makingfor more accurate posteriors and evidence estimates. For expensivemodels like those involving cosmic curvature, this is a signiﬁcantadvantage.It should be noted that as CMB and BAO are in mild tension, ithas been argued that these combined constraints should be viewedwith some scepticism (Handley 2019a; Vagnozzi et al. 2020), andthis tension is much stronger between CMB and S 𝐻 ES data (Verdeet al. 2019). Whilst this is a general cause for simultaneous concernand excitement scientiﬁcally speaking, in the context of this work thisalso serves to highlight the expressivity of these bĳector models whencombined with nested sampling chains. The tail behaviour of thesebĳective transformations is suﬃciently powerful to still recover thecorrect answer even when combining likelihoods which are in strongtension. Figure 1 demonstrates that the evidence estimates extractedby this run from anesthetic (Handley 2019b) are also consistentto within the errors on nested sampling.

Nested sampling implementations have been burdened by a practi-cal limitation that priors need to be speciﬁed as bĳective transformsfrom the unit hyper-cube, and in many cases such a representationis not readily available. We have shown that nested sampling priorscan be constructed by ﬁtting ﬂexible parametric bĳectors to sam-ples from the target prior, providing a general-purpose approach forconstructing generic priors for nested sampling. This is of particularimportance for use cases where an interim posterior from one ex-periment is to be used as the prior for another, where a parametricbĳector can then simply be trained on MCMC samples from the in-terim posterior. We have demonstrated the use of parametric bĳectorson some typical use-cases from cosmology.In the longer term, we plan to release a library of cosmologicalbĳectors for use as nested sampling priors in order to save userscomputational resources and standardise cosmological prior choice.

ACKNOWLEDGEMENTS

Justin Alsing was supported by the research project grant “Funda-mental Physics from Cosmological Surveys" funded by the SwedishResearch Council (VR) under Dnr 2017-04212. Will Handley wassupported by a Gonville & Caius Research Fellowship, STFC grantnumber ST/T001054/1 and a Royal Society University Research Fel-lowship.

REFERENCES

Alam S., et al. 2017, MNRAS, 470, 2617Beutler F., et al., 2011, MNRAS, 416, 3017Chen R. T., Rubanova Y., Bettencourt J., Duvenaud D. K., 2018, in Advancesin neural information processing systems. pp 6571–6583Di Valentino E., Melchiorri A., Silk J., 2019, Nature Astronomy, 4, 196–203Efstathiou G., Gratton S., 2020, Monthly Notices of the Royal AstronomicalSociety: Letters, 496, L91–L95Feroz F., Hobson M. P., Bridges M., 2009, Monthly Notices of the RoyalAstronomical Society, 398, 1601–1614Grathwohl W., Chen R. T., Bettencourt J., Sutskever I., Duvenaud D., 2018,arXiv preprint arXiv:1810.01367Handley W., 2019a, Curvature tension: evidence for a closed universe( arXiv:1908.09139 )Handley W., 2019b, Journal of Open Source Software, 4, 1414Handley W., Alsing J., 2020, Nested sampling with any prior you like (Sup-plementary inference products), doi:10.5281/zenodo.4247552, https://doi.org/10.5281/zenodo.4247552

Handley W. J., Hobson M. P., Lasenby A. N., 2015a, Monthly Notices of theRoyal Astronomical Society: Letters, 450, L61–L65Handley W. J., Hobson M. P., Lasenby A. N., 2015b, Monthly Notices of theRoyal Astronomical Society, 453, 4385–4399Jones M. C., Pewsey A., 2009, Biometrika, 96, 761Kingma D. P., Salimans T., Jozefowicz R., Chen X., Sutskever I., WellingM., 2016, in Advances in neural information processing systems. pp4743–4751Moss A., 2020, MNRAS, 496, 328Papamakarios G., Pavlakou T., Murray I., 2017, arXiv preprintarXiv:1705.07057Papamakarios G., Nalisnick E., Rezende D. J., Mohamed S., Lakshmi-narayanan B., 2019, arXiv preprint arXiv:1912.02762Planck Collaboration 2020a, A&A, 641, A5Planck Collaboration 2020b, A&A, 641, A6Ross A. J., Samushia L., Howlett C., Percival W. J., Burden A., Manera M.,2015, MNRAS, 449, 835S 𝐻 ES Collaboration 2018, ApJ, 855, 136Scott D., 2018, The Standard Model of Cosmology: A Skeptic’s Guide( arXiv:1804.01318 )Seljak U., Yu B., 2019, arXiv preprint arXiv:1901.04454Skilling J., 2006, Bayesian Anal., 1, 833Vagnozzi S., Di Valentino E., Gariazzo S., Melchiorri A., Mena O., Silk J.,2020, arXiv e-prints, p. arXiv:2010.02230Verde L., Treu T., Riess A. G., 2019, arXiv e-prints, p. arXiv:1907.10625Williams M. J., Veitch J., Messenger C., 2021, arXiv e-prints, p.arXiv:2102.11056This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000