Nested sampling with any prior you like
MMNRAS , 1–5 (2020) Preprint 26 February 2021 Compiled using MNRAS L A TEX style file v3.0
Nested sampling with any prior you like
Justin Alsing , ★ and Will Handley , Oskar Klein Centre for Cosmoparticle Physics, Department of Physics, Stockholm University, Stockholm SE-106 91, Sweden Imperial Centre for Inference and Cosmology, Department of Physics, Imperial College London, Blackett Laboratory,Prince Consort Road, London SW7 2AZ, UK Astrophysics Group, Cavendish Laboratory, JJ Thomson Avenue, Cambridge CB3 0HA, UK Kavli Institute for Cosmology, Madingley Road, Cambridge CB3 0HA, UK
Accepted XXX. Received YYY; in original form ZZZ
ABSTRACT
Nested sampling is an important tool for conducting Bayesian analysis in Astronomy and other fields, both for samplingcomplicated posterior distributions for parameter inference, and for computing marginal likelihoods for model comparison. Onetechnical obstacle to using nested sampling in practice is the requirement that prior distributions be provided in the form ofbijective transformations from the unit hyper-cube to the target prior density. For many applications – particularly when usingthe posterior from one experiment as the prior for another – such a transformation is not readily available. In this letter we showthat parametric bijectors trained on samples from a desired prior density provide a general-purpose method for constructingtransformations from the uniform base density to a target prior, enabling the practical use of nested sampling under arbitrarypriors. We demonstrate the use of trained bijectors in conjunction with nested sampling on a number of examples from cosmology.
Key words: data analysis: methods
Nested sampling has become one of the pillars of Bayesian compu-tation in Astronomy (and other fields), enabling efficient samplingof complicated and high-dimensional posterior distributions for pa-rameter inference, and estimation of marginal likelihoods for modelcomparison (Skilling 2006; Feroz et al. 2009; Handley et al. 2015a,b).In spite of its widespread popularity, nested sampling has had a prac-tical limitation owing to the fact that priors must be specified asa bijective transform from the unit hyper-cube. As a consequence,nested sampling has only been practical for a somewhat restrictiveclass of priors, which have a readily available representation as abijector.This limitation has been notably prohibitive for (commonly occur-ring) situations where one wants to use an interim posterior derivedfrom one experiment as the prior for another. In these situations, whilethe interim posterior is computable up to a normalization constantso can be sampled from, no bijector representation readily presentsitself in general.In this letter we address the need to provide priors in the form ofbijectors as required for nested sampling, by training parametric bi-jectors to represent priors given only a set of samples from the desiredprior. For use cases where interim posteriors are to be used as priors,the bijector can be simply trained on a set of MCMC samples from theinterim posterior. In the context of cosmological data analysis, repre-senting interim posteriors as (computationally inexpensive) bijectorshas an additional advantage: it circumvents the need to either multi-ply computationally expensive likelihoods together, or introduce anextra importance sampling step, when performing joint analyses of ★ E-mail: [email protected] multiple datasets. This will hence lead to substantial computational(and convenience) gains when performing multi-probe cosmologicalanalyses.The structure of this letter is as follows. In §2 we briefly reviewnested sampling theory. In §3 we discuss representation of priors asbijective transformations from a base density. In §4 we use bijectorsfor representing cosmological parameter posteriors for a number ofexperiments, demonstrating that they are sufficiently accurate forpractical use as priors for subsequent (multi-probe) cosmologicalanalyses. We conclude in §5.
Given a likelihood function L( 𝜃 ) and prior distribution 𝜋 ( 𝜃 ) , nestedsampling was incepted by Skilling (2006) as a Monte Carlo tool tocompute the Bayesian evidence Z = ∫ L( 𝜃 ) 𝜋 ( 𝜃 ) 𝑑𝜃, (1)whilst simultaneously generating samples drawn from the normalisedposterior distribution P ( 𝜃 ) = L( 𝜃 ) 𝜋 ( 𝜃 )/Z .Nested sampling performs these dual processes by evolving anensemble of 𝑛 live live points by initially drawing them from the priordistribution 𝜋 , and then at each subsequent iteration: (a) discardingthe lowest likelihood live point and (b) replacing it by a live pointdrawn from the prior 𝜋 subject to hard constraint that the new point isat a higher likelihood than the discarded point. The set of discarded dead points and likelihoods can be used to compute a statisticalestimate of the Bayesian evidence (1), and be weighted to form a setof samples from the posterior distribution.The nested sampling procedure therefore differs substantially from © a r X i v : . [ a s t r o - ph . I M ] F e b J. Alsing and W. Handley traditional strategies such as Gibbs sampling, Metropolis-Hastingsor Hamiltonian Monte Carlo in that it is acts as a likelihood scannerrather than a posterior sampler. It scans athermally, so in fact iscapable of producing samples at any temperature and computing thefull partition function in addition to the Bayesian evidence (Handley2019b).Implementations of the nested sampling meta-algorithm e.g.
MultiNest (Feroz et al. 2009) and
PolyChord (Handley et al.2015a,b) differ primarily in their choice of method by which todraw a new live point from the prior subject to a hard likelihoodconstraint. Nested sampling algorithms typically assume that eachparameter has a prior that is uniform over the prior range [ , ] i.e.the unit hypercube . As we shall see in the next section, this is not asrestrictive as it sounds, since any non-uniform proper prior may betransformed onto the unit hypercube via a bijective transformation.It should be noted that normalising flows and bijectors have beenused by Moss (2020) & Williams et al. (2021) in the context oftechniques for generating new live points, but the approach in thisletter for using them for prior specification is applicable to all existingnested sampling algorithms. A bijector f : x → y is an invertible function between the elementsof two sets, with a one-to-one correspondence between elements,ie., each element of one set is paired with precisely one element ofthe other set and there are no unpaired elements. In the context ofprobability theory, bijectors are convenient for defining mappingsfrom one probability distribution to another. If a random variate x with probability density 𝑝 ( x ) is transformed under a bijector f : x → y = f ( x ) , then samples and probability densities transform as: y = f ( x ) , x = f − ( y ) 𝑝 ( y ) = | J | 𝑝 ( x ) , 𝑝 ( x ) = | J − | 𝑝 ( y ) (2)where 𝐽 𝑖 𝑗 = 𝜕 𝑓 𝑖 / 𝜕𝑥 𝑗 is the Jacobian corresponding to the bijectorfunction f . Any probability density can be represented as a bijec-tion from a base density, or equivalently, any target density can becompletely specified by a base density and bijector from that basedensity to the target. Note also that any composition of bijectors isalso a bijector.In the context of nested sampling, priors 𝜋 ( θ ) over model param-eters θ should be provided as a bijector that transforms from a base(unit) uniform density u ∼ U( , ) to the desired prior 𝜋 ( 𝜽 ) , ie., θ = f ( u ) , (3)where 𝜋 ( θ ) = | J | U( f − ( θ )) (4)In practice, all that is required for specifying nested sampling priorsis the bijector function f ( u ) .In many use cases, such a bijector is readily available. For example,for any univariate prior density, the bijector from the unit uniform issimply the inverse cumulative density function (CDF) of the targetprior, 𝑓 ( 𝑢 ) = CDF − ( 𝑢 ) , CDF ( 𝑢 ) = ∫ 𝑢 −∞ 𝜋 ( 𝜃 ) 𝑑𝜃, (5)where the CDF can be interpolated and inverted numerically if theintegral is intractable. However, for the general case of correlated multivariate priors,the requisite bijector is far more involved to compute analytically(see 5.1 of Feroz et al. 2009) and in general numerically requiresobtaining by other means. One particularly common scenario wherethis arises is when one wants to use the (sampled) posterior from oneexperiment as the prior for another; in these cases, all that is availableare samples from the desired prior and un-normalized values of itsdensity at those points.The solution is to fit a parametric model for the required bijector f ( u ; w ) (with parameters w ) to the target prior. In the followingsections we describe how to define expressive parametric bijectormodels, and fit them to target (prior) densities. Parametric bijector models f ( u ; w ) have recently been gaining pop-ularity for solving complex density estimation tasks, probabilisticlearning and variational inference, with a plethora of models avail-able. In this section we give a brief overview of the key methods(in order of increasing complexity). For a more in-depth review, seePapamakarios et al. (2019). Compositions of simple invertible functions
Since any composition of bijectors is also a bijector, it is possible toconstruct flexible parametric bijectors by composing even simpleinvertible fucntions, such as affine transformations, exponentials,sigmoids, hyperbolic functions, etc.Some such transformations applied to a base density lead to al-ready well-studied distribution families, such as the “sinh-arcsinh"distributions (Jones & Pewsey 2009) which are generated by apply-ing a 𝑎 sinh ( 𝑏 sinh − ( 𝑥 ) + 𝑐 ) bijector (with some parameters 𝑎 , 𝑏 and 𝑐 ) to a base normal random variate. Chaining a number of suchtransformations together, interleaved with linear transformations ofthe parameter space, can already lead to rather expressive families ofdistributions. Autoregressive flows
More sophisticated bijector models typically involve constructingnormalizing flows parameterized by neural networks. Of these neuralnetwork based models, Inverse Autoregressive Flows (IAFs; Kingmaet al. 2016) have been demonstrated to be effective at representingcomplex densities, are fast to train, and fast to evaluate once trained.IAFs define a series of 𝑛 bijectors composed into a normalizing flow,as follows: z ( ) = n ∼ N ( , ) z ( ) = 𝝁 ( ) [ z ( ) ] + 𝝈 ( ) [ z ( ) ] (cid:12) z ( ) ... y = z ( 𝑛 ) = 𝝁 ( 𝑛 ) [ z ( 𝑛 − ) ] + 𝝈 ( 𝑛 ) [ z ( 𝑛 − ) ] (cid:12) z ( 𝑛 − ) (6)where the shifts 𝝁 and scales 𝝈 of the subsequent affine transformsare autoregressive and are parameterized by neural networks, ie., 𝝁 ( 𝑘 ) 𝑖 = 𝝁 ( 𝑘 ) 𝑖 [ z 𝑖 − ; w ] , 𝝈 ( 𝑘 ) 𝑖 = 𝝈 ( 𝑘 ) 𝑖 [ z 𝑖 − ; w ] , (7)where w denote the weights and biases of the neural network(s). Formore details see Papamakarios et al. (2017). MNRAS , 1–5 (2020) ested sampling with any prior you like Continuous normalizing flows
In order to take a further step forward in expressivity, one can replacethe discrete set of transforms with an integral of continuous-timedynamics (Chen et al. 2018; Grathwohl et al. 2018), i.e., defining thebijector as the solution to an initial value problem: z ( ) ∼ N ( , ) y = z ( 𝑡 ) = ∫ 𝑡 f ( z , 𝑡 ; w ) 𝑑𝑡, (8)where the dynamics f ( z , 𝑡 ; w ) = 𝑑 z / 𝑑𝑡 are parameterized by a neuralnetwork with weights w . The elevation from discrete to continuousnormalizing flows comes with a significant increase in expressivity,making such models appropriate for the most complex bijector rep-resentation tasks. See Chen et al. (2018) and Grathwohl et al. (2018)for more details. Fitting a bijector model to the target can be achieved by minimizingthe Kullback-Leibler (KL) divergence between the target 𝜋 ( θ ) andthe (model) bijective density (with respect to the model parameters w ). Since the KL divergence will not be analytically tractable ingeneral, one can instead minimize a Monte Carlo estimate of theintegral given samples from the target prior { θ } ∼ 𝜋 ( θ ) : L( w ) = ˆKL ( w ) = 𝑁 𝑁 ∑︁ 𝑖 = ln 𝜋 ∗ ( θ 𝑖 ; w ) , (9)where 𝜋 ∗ ( θ ; w ) = | J ( θ ; w )| U( f − ( θ ; w ) is the probability densitycorresponding to the parametric bijector model, with parameters w .Alternatively, when one has access to not only samples from thetarget prior, but also the values of the density at those samples, it canbe advantageous to regress the model to the target density using theL2 norm, ie., minimize the loss function L( w ) = 𝑁 ∑︁ 𝑖 = || ln 𝜋 ( θ 𝑖 ) − ln 𝜋 ∗ ( θ 𝑖 ; w ) || . (10)This has been shown to be less noisy than minimizing the (sampled)KL-divergence in certain cases (Seljak & Yu 2019) (and can bereadily extended to exploit gradient information about the targetdensity if it is available; see Seljak & Yu 2019 for details). As a concrete and non-trivial cosmological example we choose theconcordance Λ CDM model of the universe (Scott 2018) with anadditional spatial curvature parameter Ω 𝐾 . Such models have beendebated recently in the literature (Handley 2019a; Di Valentino et al.2019; Efstathiou & Gratton 2020). We choose this model not for itscontroversy, but merely for the fact that it yields non-trivial banana-like priors and posteriors and therefore proves a more challengingcosmological example for a bijector to learn than models withoutcurvature.For likelihoods we choose three datasets: (1) The Planck base-line TT,TE,EE+lowE likelihood (without lensing) Planck Collabo-ration (2020a,b) (hereafter CMB), (2) Baryon Acoustic Oscillationand Redshift Space Distortion measurements from the Baryon Os-cillation Spectroscopic Survey (BOSS) DR12 Alam & et al. (2017);Beutler et al. (2011); Ross et al. (2015) (hereafter BAO), and (3) − . − . − . − . − . − . log Z log P (CMB , BAO)log P (BAO | CMB) + log P (CMB)log P (CMB | BAO) + log P (BAO) − . − . − . − . − . − . − . log Z log P (CMB , S H ES)log P (S H ES | CMB) + log P (CMB)log P (CMB | S H ES) + log P (S H ES)
Figure 1.
In the same three paths of Figure 2 the Bayesian evidences log Z foreach of the three runs combine consistently to within error, demonstrating thatthe bijective prior methodology can be used to revore both accurate evidencesas well as posteriors. Plot produced under anesthetic (Handley 2019b). local cosmic distance ladder measurements of the expansion rate, us-ing type Ia SNe calibrated by variable Cepheid stars from the S 𝐻 EScollaboration S 𝐻 ES Collaboration (2018) (hereafter S 𝐻 ES).The bijector code is implemented as an extension to
CosmoChord using forpy to call tensorflow bijectors from FORTRAN. All thenested sampling runs, code and details for reproducing the results inthis letter can be found on Zenodo (Handley & Alsing 2020). https://github.com/ylikx/forpy MNRAS000
CosmoChord using forpy to call tensorflow bijectors from FORTRAN. All thenested sampling runs, code and details for reproducing the results inthis letter can be found on Zenodo (Handley & Alsing 2020). https://github.com/ylikx/forpy MNRAS000 , 1–5 (2020)
J. Alsing and W. Handley − . − .
05 0 .
00 0 . H π − . − .
05 0 .
00 0 . H π → CMB − . − . − .
025 0 . CMB → BAO − . − .
05 0 .
00 0 . Ω K H π → BAO − .
02 0 .
00 0 . Ω K BAO → CMB − . − .
05 0 .
00 0 . π → BAO+CMB − .
005 0 .
000 0 . BAO+CMB
Figure 2.
Reading around the upper edge of the hexagon: from the prior 𝜋 , nested sampling uses the Planck CMB likelihood to generate a set of posteriorsamples. These are then used to train a new bijector to define a prior in the next step, for which a new nested sampling run using a BAO likelihood is scanned.The lower edge is the same process but with CMB and BAO reversed, and through the middle one recovers the posterior if both CMB and BAO likelihoods areapplied at the same time. The right-most plot shows that the posteriors recovered through these three paths are consistent to within sampling error. Plot producedunder anesthetic (Handley 2019b).MNRAS , 1–5 (2020) ested sampling with any prior you like Figure 2 shows our key results for CMB and BAO data. In thiscase we compare the three routes to a combined CMB and BAOconstraint on a 𝑘 Λ CDM universe. The simplest approach is to runnested sampling with the default
CosmoMC prior with both likelihoodsturned on 𝜋 → CMB+BAO. The bijector approach first runs the CMBdataset to update the prior 𝜋 → CMB, then after using the posteriorsamples from this run to train a CMB bijector constructs a distributionto use as the prior for the input for the Bayesian update with the nextdataset CMB → BAO. In Figure 2 we show that doing this in eitherorder recovers the same posterior.One critical advantage of this approach is that since much ofthe cost of a nested sampling run is in compressing from prior toposterior, whilst the first run 𝜋 → CMB is expensive, the secondupdate CMB → BAO requires considerably fewer likelihood calls.Furthermore, since the journey from prior to posterior is shorter, lesspoisson error accumulates over nested sampling iterations, makingfor more accurate posteriors and evidence estimates. For expensivemodels like those involving cosmic curvature, this is a significantadvantage.It should be noted that as CMB and BAO are in mild tension, ithas been argued that these combined constraints should be viewedwith some scepticism (Handley 2019a; Vagnozzi et al. 2020), andthis tension is much stronger between CMB and S 𝐻 ES data (Verdeet al. 2019). Whilst this is a general cause for simultaneous concernand excitement scientifically speaking, in the context of this work thisalso serves to highlight the expressivity of these bijector models whencombined with nested sampling chains. The tail behaviour of thesebijective transformations is sufficiently powerful to still recover thecorrect answer even when combining likelihoods which are in strongtension. Figure 1 demonstrates that the evidence estimates extractedby this run from anesthetic (Handley 2019b) are also consistentto within the errors on nested sampling.
Nested sampling implementations have been burdened by a practi-cal limitation that priors need to be specified as bijective transformsfrom the unit hyper-cube, and in many cases such a representationis not readily available. We have shown that nested sampling priorscan be constructed by fitting flexible parametric bijectors to sam-ples from the target prior, providing a general-purpose approach forconstructing generic priors for nested sampling. This is of particularimportance for use cases where an interim posterior from one ex-periment is to be used as the prior for another, where a parametricbijector can then simply be trained on MCMC samples from the in-terim posterior. We have demonstrated the use of parametric bijectorson some typical use-cases from cosmology.In the longer term, we plan to release a library of cosmologicalbijectors for use as nested sampling priors in order to save userscomputational resources and standardise cosmological prior choice.
ACKNOWLEDGEMENTS
Justin Alsing was supported by the research project grant “Funda-mental Physics from Cosmological Surveys" funded by the SwedishResearch Council (VR) under Dnr 2017-04212. Will Handley wassupported by a Gonville & Caius Research Fellowship, STFC grantnumber ST/T001054/1 and a Royal Society University Research Fel-lowship.
REFERENCES
Alam S., et al. 2017, MNRAS, 470, 2617Beutler F., et al., 2011, MNRAS, 416, 3017Chen R. T., Rubanova Y., Bettencourt J., Duvenaud D. K., 2018, in Advancesin neural information processing systems. pp 6571–6583Di Valentino E., Melchiorri A., Silk J., 2019, Nature Astronomy, 4, 196–203Efstathiou G., Gratton S., 2020, Monthly Notices of the Royal AstronomicalSociety: Letters, 496, L91–L95Feroz F., Hobson M. P., Bridges M., 2009, Monthly Notices of the RoyalAstronomical Society, 398, 1601–1614Grathwohl W., Chen R. T., Bettencourt J., Sutskever I., Duvenaud D., 2018,arXiv preprint arXiv:1810.01367Handley W., 2019a, Curvature tension: evidence for a closed universe( arXiv:1908.09139 )Handley W., 2019b, Journal of Open Source Software, 4, 1414Handley W., Alsing J., 2020, Nested sampling with any prior you like (Sup-plementary inference products), doi:10.5281/zenodo.4247552, https://doi.org/10.5281/zenodo.4247552
Handley W. J., Hobson M. P., Lasenby A. N., 2015a, Monthly Notices of theRoyal Astronomical Society: Letters, 450, L61–L65Handley W. J., Hobson M. P., Lasenby A. N., 2015b, Monthly Notices of theRoyal Astronomical Society, 453, 4385–4399Jones M. C., Pewsey A., 2009, Biometrika, 96, 761Kingma D. P., Salimans T., Jozefowicz R., Chen X., Sutskever I., WellingM., 2016, in Advances in neural information processing systems. pp4743–4751Moss A., 2020, MNRAS, 496, 328Papamakarios G., Pavlakou T., Murray I., 2017, arXiv preprintarXiv:1705.07057Papamakarios G., Nalisnick E., Rezende D. J., Mohamed S., Lakshmi-narayanan B., 2019, arXiv preprint arXiv:1912.02762Planck Collaboration 2020a, A&A, 641, A5Planck Collaboration 2020b, A&A, 641, A6Ross A. J., Samushia L., Howlett C., Percival W. J., Burden A., Manera M.,2015, MNRAS, 449, 835S 𝐻 ES Collaboration 2018, ApJ, 855, 136Scott D., 2018, The Standard Model of Cosmology: A Skeptic’s Guide( arXiv:1804.01318 )Seljak U., Yu B., 2019, arXiv preprint arXiv:1901.04454Skilling J., 2006, Bayesian Anal., 1, 833Vagnozzi S., Di Valentino E., Gariazzo S., Melchiorri A., Mena O., Silk J.,2020, arXiv e-prints, p. arXiv:2010.02230Verde L., Treu T., Riess A. G., 2019, arXiv e-prints, p. arXiv:1907.10625Williams M. J., Veitch J., Messenger C., 2021, arXiv e-prints, p.arXiv:2102.11056This paper has been typeset from a TEX/L A TEX file prepared by the author.MNRAS000