Nested Sampling with Normalising Flows for Gravitational-Wave Inference
NNested Sampling with Normalising Flows for Gravitational-Wave Inference
Michael J. Williams, John Veitch, and Chris Messenger
SUPA, School of Physics and AstronomyUniversity of GlasgowGlasgow G12 8QQ, United Kingdom (Dated: February 23, 2021)We present a novel method for sampling iso-likelihood contours in nested sampling using a type ofmachine learning algorithm known as normalising flows and incorporate it into our sampler
Nessai . Nessai is designed for problems where computing the likelihood is computationally expensive andtherefore the cost of training a normalising flow is offset by the overall reduction in the number oflikelihood evaluations. We validate our sampler on 128 simulated gravitational wave signals fromcompact binary coalescence and show that it produces unbiased estimates of the system parameters.Subsequently, we compare our results to those obtained with dynesty and find good agreementbetween the computed log-evidences whilst requiring 2.32 times fewer likelihood evaluations. Wealso highlight how the likelihood evaluation can be parallelised in
Nessai without any modificationsto the algorithm. Finally, we outline diagnostics included in
Nessai and how these can be used totune the sampler’s settings.
I. INTRODUCTION
Gravitational-wave astronomy has contributed to ourunderstanding of physics and astrophysics from theatomic scale up to the scale of the Universe [1–3]. This isset to continue as the detectors of the LIGO-Virgo-Kagra(LVK) Collaboration continue to improve in sensitivity[4] and new detectors come online [5] increasing the po-tential of detecting previously unseen types of sources.Prior to the recent third observing run there were 11confirmed detections of gravitational waves (GWs) fromcompact binary coalescence (CBC) [6], including the firstmulti-messenger detection [7]. The first half of the lastyear-long observing run has resulted in a further 39 can-didate detections [8]. Each of the candidate events re-quires extensive analysis to determine whether they areastrophysical in origin and consequently allow us to un-derstand the astrophysics that produce such phenomena.The nature of these analyses and ever increasing numberof candidates necessitates more efficient analysis tech-niques.Our understanding of the sources that produce the de-tected GWs hinges on the ability to infer the parametersthat describe them. This inference is carried out in aBayesian framework [9, 10] centred around Bayes’ theo-rem which allows for prior knowledge to be updated withobservations to obtain posterior distributions, or moreexplicitly: p ( θ | d, H ) = p ( d | θ , H ) p ( θ | H ) p ( d | H ) , (1)where θ are the parameters that describe the model H , d is observed data, p ( θ | d, H ) is the posterior, ( d | θ , H ) isthe likelihood, p ( θ | H ) is the prior and p ( d | H ) is the evi-dence. Whilst computing the posterior is trivial in lowerdimensions, the gravitational-wave signals from compactbinary coalescence are described by a minimum of 15parameters and the resulting parameter space can be multi-modal and highly correlated [11]. Typically, thisrequires applying stochastic sampling techniques such asMarkov Chain Monte Carlo (MCMC) [12] and NestedSampling [13]. These techniques are computationally ex-pensive and their cost is directly related to the cost ofevaluating the likelihood and amount of data being anal-ysed. There have been various efforts to reduce this costby means such as re-parameterisations [14], reduced or-der methods [15, 16], parallel sampling methods [17–19]and problem-specific sampling algorithms [11, 20–22].Machine learning algorithms have successfully been ap-plied to various aspects of gravitational-wave data anal-ysis [23] including data-quality improvement, waveformmodelling, searches and, most relevant to this work,parameter estimation [24–27]. These algorithms, oncetrained, greatly reduce the cost of computing posteriordistributions. However they are still in their infancy andthere are various challenges that have yet to be solved,for example, they are constrained by the distribution oftraining data and cannot reliably be applied to data thatis outside that distribution, e.g. with different detectornoise curves or longer duration signals.Normalising flows are a type of generative machinelearning algorithm that have recently been applied to awide variety of problems including generation, inferenceand representation learning [28, 29]. The fundamentalidea behind normalising flows is to model a complex prob-ability distribution as the transformation of simple distri-bution. This transformation is constructed to be highlyflexible but maintain an explicit mathematical descrip-tion with a tractable Jacobian. This makes them wellsuited to applications in the physical sciences where theapparent opacity of other machine learning algorithmscan hinder their widespread adoption.In this paper we propose a modified nested samplingalgorithm that incorporates a novel proposal method us-ing normalising flows. During sampling the normalisingflow is trained such that the resulting distribution canbe re-sampled according to the prior to produce new live a r X i v : . [ g r- q c ] F e b points within a given iso-likelihood contour which are in-dependent of the current worst point . This is akin toother nested sampling algorithms which sample from theconstrained prior thus avoiding the need to evolve pointswith a random walk. These algorithms typically requirefewer evaluations of the likelihood compared to those thatuse random walks which reduces the computational cost.However, the challenge when implementing these algo-rithms is determining the constrained prior which oftenrequires defining multiple bounding distributions. Usinga normalising flow allows us to use a single bounding dis-tribution but comes at the cost of training during sam-pling. This additional cost is easily offset in problemswhere there is significant computational cost associatedwith computing the likelihood. As such we choose toapply our algorithm to gravitational-wave inference.This paper is structured as follows: in section II weoutline the background theory for gravitational-wave in-ference, nested sampling and normalising flows. We thenintroduce our method in section III and discuss relatedwork in section IV. In section V we present results forgravitational-wave parameter estimation, compare to acommonly used nested sampler, and discuss the implica-tion of our results. Finally in section VI we summariseour results and draw conclusions. II. BACKGROUNDA. Gravitational-wave likelihood for compactbinary coalescence
The strain data from gravitational-wave interferome-ters is modelled as a signal with additive noise in thetime domain [11]. The standard definition of the likeli-hood for compact binary coalescence then assumes thatthe noise is Gaussian and stationary over short obser-vation times and that the power spectral density S n isindependent of the model parameters. This allows us todefine the likelihood as an independent product in thefrequency domain p ( d | H, θ ) ∼ exp (cid:88) i (cid:34) − | ˜ h i ( θ ) − ˜ d i | T S n ( f i ) (cid:35) , (2)where ˜ h i ( θ ) and ˜ d i are the waveform model and datain the frequency domain. There are various differentwaveform models h ( θ ) which are characterised by setsof parameters. The exact number of parameters de-pends on the physics that the model describes, if bothcompact objects are assumed to be black holes, thentypically 15 parameters are required [11]. Different pa-rameterisations have proven to improve sampling effi-ciency and convergence, most notably reparameterisingthe component masses in terms for chirp mass M =( m m ) / ( m + m ) − / and asymmetric mass ratio q = m /m [11] and using the system-frame when de-scribing the orientation of the binary [14]. B. Nested sampling
Nested sampling is a stochastic algorithm for Bayesianinference that was proposed by Skilling in [13] andcomputes the Bayesian evidence Z = p ( d | H ) = (cid:82) p ( d | θ , H )d θ in eq. (1). In nested sampling the evi-dence integral is simplified by considering the total priorvolume X contained within a given iso-likelihood contour L ∗ = p ( d | θ , H ): X ( L ∗ ) = (cid:90) p ( d | θ ,H ) >L ∗ d θ p ( θ | H ) . (3)where d θ p ( θ | H ) = dX . The evidence integral can thenbe re-written using the likelihood L by inverting eq. (3) Z = (cid:90) L ( X )d X. (4)Since the integrand is positive and decreasing, thefunction is well behaved and the integral can thereforebe approximated by considering an ordered sequence ofdecreasing of M points in the prior volume X i , evaluat-ing the likelihood at each point L i = L ( X i ) and finally,for example, using the trapezoid rule Z = M (cid:88) i
12 ( X i − − X i +1 L i ) . (5)The complete algorithm is detailed in [13] but in shortit requires first sampling a set of points, known as livepoints, from the prior distribution. The point with thelowest likelihood L ∗ is then removed and replaced by an-other sample drawn from within the likelihood contourdefined by the original point and according to the priordistribution. This replacement process is then continueduntil a stopping criteria is met [13]. In gravitational-waveinference this is typically chosen to be when the fractionalchange in log evidence at a given iteration ∆ ln Z i causedby replacing another point is less than 0 . Z i ≡ ln ( Z i + ∆ Z i ) − ln Z i and ∆ Z i ≈ L ∗ X i . Oncethe algorithm has terminated the final evidence is com-puted including the final set of live points and posteriorsamples can be produced from the set of discarded andfinal live points [13].The challenge when implementing this algorithm is ef-ficiently proposing the replacement samples within thecurrent iso-likelihood contour since these samples mustbe independently and identically distributed (i.i.d.) ac-cording to the prior. One commonly used method drawsnew points based on the location of existing points whichare then evolved using a random walk to ensure they arei.i.d. [11, 18]. The efficiency of this method depends onthe length of the MCMC chain required since each stepin the chain requires evaluating the likelihood, for GWinference proposing a single new sample often requiresof order 10 likelihood evaluations. Other approachesavoid using a random walk by sampling from the con-strained prior at a given iteration. This requires deter-mining the distribution of the current live points and con-structing one or more bounding distributions, for exam-ple, using multidimensional ellipsoids as described in [30].In this case the challenge is efficiently constructing thebounding distributions and sampling them without over-constraining the prior and consequently under-samplinga region in parameter space. What is more, these two ap-proaches are sometimes used in conjunction [31] to tackleparticularly challenging problems, such as gravitational-wave inference [32], where the individual methods canbe inefficient. In section III we propose a method fordirectly sampling from a constrained prior using normal-ising flows to learn the iso-likelihood contours. C. Normalising flows
Normalising flows [33] are a type of generative machinelearning algorithm that map samples x in the physicalspace X to samples z in a latent space Z such that thesamples z are distributed according to a prior probabilitydistribution p Z , known as the latent prior . If we assumethis mapping to be bijective and denoted f : Z → X ,then using the change of variable formula we can definea probability distribution in the physical space p X p X ( x ) = p Z ( f ( x )) (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f ( x ) ∂x T (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) , (6)where ∂f ( x ) /∂x T is the Jacobian of f at x . Samplescan then be drawn from p X by sampling the latent space z ∼ p Z and applying the inverse mapping f − . Thisrequires the bijective mapping f to have a tractable Ja-cobian, which for functions with multidimensional do-mains and codomains is often computationally expensive.These mappings are therefore carefully constructed tomeet these criteria and usually parameterised by a neu-ral network which allows for the mapping to be learned.Normalising flows generally fall in two categories de-pending on how the mappings are defined: autoregressiveflows [34, 35] and flows based on coupling transforms [36–38]. Each have distinct advantages and disadvantages,most notably autoregressive flows are often more flexiblebut have a greater computational cost and are not alwaysanalytically invertible whereas coupling based flows tendto be less flexible but computationally cheaper to evalu-ate and analytically invertible. In this work we use thelatter since we aim to minimise computational cost andlater rely on the invertablility of normalising flows. We use the term physical space to distinguish it from the straindata however in the literature X often referred to as the dataspace. Flows based on coupling transforms map an n --dimensional input vector x to an n -dimensional out-put vector y by splitting the input vector in two parts x = [ x m , x m +1: n ] where m < n and then transformingone part conditioned on the other unchanged part. Thechoice of splitting depends on the specific coupling trans-form though typically an alternating pattern is used. Forexample the coupling transform proposed in [36]: y m = x m , (7a) y m +1: n = x m +1: n (cid:12) exp [ s ( x m )] + t ( x m ) , (7b)where s ( x m ) and t ( x m ) are the output of a neuralnetwork and (cid:12) denotes the element-wise product. Weshow a schematic of eq. (7) in fig. 1. Since y m = x m the Jacobian matrix of these transforms is lower triangu-lar and the Jacobian determinant is simply the productof the diagonal entries. This transform can easily be in-verted by inverting eq. (7) and using y m as the input to s and t : x m = y n , (8a) x m +1: n = [ y m +1: n − t ( y m :1 )] (cid:12) exp [ − s ( y m )] . (8b)As with eq. (7) the Jacobian determinant is trivial tocompute. Coupling transforms can also be stacked byalternating which part of x is updated, x m or x m +1: n ,to produce more flexible transforms. It is also commonpractice to include permutations between coupling trans-forms, these permute the inputs to the transforms makingthe mapping more expressive [36]. More recently this hasbeen generalised to linear transforms in which the per-mutation is learnt during training [37]. In fig. 1 we showa schematic of how coupling transforms can be stackedwith permutations. For a more detailed description ofnormalising flows, the different types and their applica-tion we point the reader to [28, 29]. III. METHOD
We present a novel method for sampling within a giveniso-likelihood using a normalising flow. The normalisingflow learns the distribution of a set of live points and isconstructed such that the learnt distribution can be sam-pled from analytically. We introduce additional steps toensure that the samples are distributed according to the sampling prior and bounded by the iso-likelihood con-tour. This eliminates the need to evolve new samplesand allows us to efficiently draw new samples within acomplex iso-likelihood contour. A more efficient proposal We use sampling prior to denote the prior used for nested sam-pling and to distinguish it from the latent prior use in normalisingflows. xyx m x m +1: n y m y m +1: n (cid:12) +NN st Trainable exp( s ) xyx m x m +1: n y m y m +1: n (cid:12) + NN s t Trainableexp( s ) f ( x ) xz PermutationTransform 1PermutationTransform 2PermutationTransform 3PermutationTransform 4
FIG. 1. Diagram of a normalising flow f ( x ) composed of four coupling transforms which maps an n -dimensional input vectorx to an n -dimensional latent vector z. Each transform splits x in two [ x m , x m +1: n ] and updates one part conditioned on theother. In the first and third transforms x m is used as the input to a neural network (NN) which then produces the scale s andtranslation t vectors of length m . The element-wise product ( (cid:12) ) is then computed between x m and exp( s ) followed by thesum of the output and t . This is shown in the left transform. In the second and fourth transforms x m is updated conditionedon x m +1: n as shown in the right transform. equates to few rejected points and, since the likelihoodmust be computed before a point can be rejected, thisis also equivalent to a reduction in the number of like-lihood evaluations. We first describe our method in theisolated case of a single of set of live points at a giveniteration and then present a nested sampling algorithmwhich incorporates it. A. Sampling within a iso-likelihood contour
At any given point in the nested sampling algorithmthere is a current set of live points which by definitionis contained within the iso-likelihood contour defined bythe worst point with likelihood L ∗ . We describe the im-plementation in our algorithm in terms of four steps: a. How to define an iso-likelihood contour using anormalising flow. If we treat the sampling space as thephysical space X and the live points as the data x we canthen train a normalising flow to approximate the distri-bution of the live points to within some error and use it todraw new samples in X . This requires sampling from thelatent prior p z , which we choose to be an n -dimensionalGaussian, and then applying the inverse mapping learnedby flow f − . However, since our choice of latent prior p Z has an infinite domain the resulting distribution of sam-ples p X will not be bounded by the iso-likelihood contouror distributed according to the sampling prior. We there-fore examine the notion of an iso-likelihood contour in thecontext of the normalising flow with an n -dimensionalGaussian latent prior. b. How to determine the contour given the current setof live points. Once trained, the normalising flow canbe used to map the current worst point in the physicalspace x ∗ to the latent space Z as z ∗ . This point has alikelihood in the latent space L ∗Z given by p Z ( z ∗ ) and,since p Z is an n -dimensional Gaussian, points of equallikelihood lie on the ( n − r ∗ givenby z ∗ . If we assume a perfect mapping, then this iso-likelihood contour in the latent space can be mapped toan iso-likelihood contour in the physical space. We cantherefore sample within the contour in latent space anduse the inverse mapping to produce samples within thecontour in the data space. c. How to sample within the contour. We use twoapproaches for drawing K new samples z i in the latentspace given a radius r ∗ , these produce normally and uni-formly distributed samples respectively. Both start bydrawing K samples on the ( n − K -dimensional vectors x x Data in
X − z − z Data in Z− z − z New samples in Z x x New samples in X− z − z Pool in Z x x Pool in X FIG. 2. Example of how a normalising flow trained on a set oflive points can produce samples within current iso-likelihoodcontour for simple two-dimensional parameter space.
Top: example of training samples in the physical space X andlearned mapping to the latent space Z with the iso-likelihoodcontour for the current worst point shown in orange. Mid-dle: samples drawn from a truncated Guassian within theiso-likelihood contour in Z and mapped to X using the inversemapping. Bottom: pool of accepted samples after applyingrejection sampling until 1000 points are obtained shown inboth Z and X . are drawn from an n -dimensional unit Gaussian and thennormalised using the Euclidean norm. These samples y i can then be rescaled to obtain samples within in the n --ball z i = ρ i y i || y i || , (9)where the choice of distribution for ρ i determines howthe resulting samples z i are distributed in Z . For uni-formly distributed samples ρ i = u /n where u ∼ U (0 , r ∗ )and for normally distributed samples ρ i ∼ χ ( n ) where χ ( n ) is a chi-distribution with n degrees of freedom trun-cated at r . The inverse mapping of the normalising flow can then be applied to z i to obtain samples in the physi-cal space. We consider two approaches because samplingfrom a truncated Gaussian in high-dimensional space canbecome inefficient for large values of r ∗ . d. How to ensure new samples are drawn accordingto the prior. The samples obtained in the previous stepmust be re-sampled such that they are distributed ac-cording to the sampling prior. We use rejection samplingand compute weights α i for each sample α i = p ( x i ) q ( x i ) , (10)where x i = f − ( z ), p ( x ) is the sampling prior and q ( x ) the proposal probability which is computed usingthe inverse of eq. (6) q ( x ) = q ( f − ( z )) = p Z ( z ) (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f − ( z ) ∂x T (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) − . (11)The choice of latent prior p Z will depend on whichmethod was used to draw the samples in the latent space.The weights eq. (10) are then rescaled such that theirmaximum value is one. We then draw N samples u ∼U [0 ,
1] and accept samples for which α i /u i >
1. In fig. 2we show an example of this process for a simple two-dimensional case.
B. Algorithm details
We identify three key stages of the sampling approachdetailed in section III A that we then incorporate into thenested sampling algorithm: • Training: a normalising flow is trained on the cur-rent set of K live points by minimising a MonteCarlo approximation of the of the KL divergencebetween the target distribution and p X ( x ) as de-fined in eq. (6). The explicit loss function is derivedin appendix A. • Population: once the normalising flow is trained,samples are drawn within the n -ball of radius r defined by the worst point . These samples are thenmapped to the data space X , re-sampled accordingto the prior and finally stored in the pool of newsamples. • Proposal : once the pool of new samples has beenpopulated, new live points are drawn at randomfrom the pool and then removed until the pool isempty or the normalising flow is retrained.The standard nested sampling algorithm is modified toinclude these stages, we call the algorithm
Nessai .The start of the algorithm remains unchanged: K livepoints are drawn from the prior distribution and theirlog-likelihoods computed. We then start the iterativeprocess of determining the worst live point with log-likelihood L ∗ and drawing a replacement live point thatlies within the iso-likelihood contour. In our modifiedalgorithm we use standard rejection sampling from theprior for the first M points (typically 2 K ) or until it be-comes inefficient. The normalising flow is then trained onthe current K live points allowing us to map the worstpoint to the latent space to obtain the worst latent point z w and the radius of the corresponding n -ball r . This ra-dius is then used for the population stage when drawingsamples in the latent space. Once populated, a replace-ment point is drawn from the pool, its log-likelihood iscomputed and if it is greater than L ∗ , the point is ac-cepted; if not, more points are drawn until one is ac-cepted. The proposal stage is then repeated for subse-quent worst live points until one of four criteria is met: • the proposal pool is depleted: the normalisingflow is retrained using the current live points and worst live point is used to compute a new radiusand the population stage is repeated, • the acceptance rate falls below a user-defined criteria: the current proposal pool is dis-carded and the normalising flow is retrained. Thisthreshold is defined by the user, • the criteria for retraining the normalisingflow is met: the normalising flow is retrained withthe current K live points, this happens by defaultevery K live points, • the nested sampling convergence criterion ismet: the algorithm terminates.Since this algorithm relies on the normalising flows’ability to approximate the distribution of live points atvarious stages throughout the sampling, we include a se-ries of reparameterisations of X to reduce the complexityof the data space and removing certain features. We de-note this reparameterised space X (cid:48) and include the Ja-cobian for each reparameterisation in eq. (11). Thesereparamerisations are: • Rescaling: we add the option to rescale the inputdata according to either the sampling priors or thecurrent minimum and maximum values such thatall of the parameters in X (cid:48) are defined over thesame domain. As as default we use [ − , n . • Boundary inversion: we observe that asymmet-ric distributions with high density regions near theprior bounds are often under-sampled. To miti-gate this effect we add the option to mirror thelive points around such bounds and train on theresulting symmetric distribution. Further detailsare provided in appendix B.We also introduce additional settings which help withconvergence and sampling efficiency, some of these and discussed in section V E and a comprehensive list can befound in the online documentation for our sampler [41].
C. Gravitational-wave reparameterisations
The gravitational-wave parameter space is typically15-dimensional and contains various degeneracies be-tween parameters such as the masses, inclination and lu-minosity distance which can make sampling inefficient.Previous work has shown that certain reparameterisa-tions can improve sampling efficiency [11]. We use twoof these: chirp mass M and asymmetric mass ratio q replace the component masses and we use the system-frame parameterisation in place of the radiation-frame to describe the orientation of the binary [14].More than half of the parameters to sample are anglesand we note that the periodicity of these angles is notencoded in the mapping learned by the normalising flowsince the latent space Z is continuous and unbounded.We therefore include a further reparameterisation specif-ically for the angular parameters θ i . We assume thateach angle has a corresponding radial component ρ θ i andtogether they describe a position in a two-dimensionalplane. We can therefore use standard transformations toexpress this position in Cartesian coordinates ( x θ i , y θ i ): x θ i = ρ θ i cos θ i ,y θ i = ρ θ i sin θ i . (12)If we choose the distribution of radial components suchthat ρ θ i ∈ [0 , ∞ ) then x θ i , y θ i ∈ ( −∞ , ∞ ). Since we areusing a Gaussian latent prior p Z , we sample ρ θ i from achi-distribution with two degrees of freedom such that,if the angle is uniformly distributed on [0 , π ], the re-sulting distribution of ( x θ i , y θ i ) is Gaussian. We use thistreatment for the phase, inclination, polarisation and allfour spin angles, for polarisation we rescale the angles to[0 , π ] before applying the transformation to Cartesiancoordinates. This reparameterisation also naturally in-cludes periodic boundary conditions for the angles withuniform priors.The sky location is described by a further two angles,right ascension α and declination δ . For these angleswe extend the previous treatment from two-dimensionalto three-dimensional Cartesian coordinates ( x, y, z ) anddraw the radial component ρ from a chi-distribution withthree degrees of freedom. For the standard priors, p ( α ) ∼U [0 , π ] and p ( δ ) ∼ cos δ , the resulting distribution of( x, y, z ) is again Gaussian.The spin magnitudes χ and χ also require a specifictreatment. They are typically defined on [0 , .
99] withuniform priors and, importantly, the posterior distribu-tions are often broad and span the entire prior range.We consider applying the boundary inversion to bothbounds but in practice find this ineffective. We insteadopt to map χ i into a two-dimensional plane with posi-tions described using Cartesian coordinates x χ i and y χ i .We achieve this by first defining a rescaled magnitudeˆ χ i ∈ [0 ,
1] which is obtained using the corresponding pri-ors. Then, we consider the angle defined by ˆ χ i π and,again, introduce a radial component ρ χ i ∼ χ (2). The cor-responding Cartesian coordinates ( x χ i , y χ i ) are definedon [0 , ∞ ) and ( −∞ , ∞ ) respectively. However we knowthat the coupling transforms we have chosen to use arebetter suited to unbounded domains. To avoid this, weintroduce a random variable k which is drawn from aRademacher distribution and include it in the Cartesiancoordinate transform x χ i = ρ χ i cos ˆ χ i π,y χ i = kρ χ i sin ˆ χ i π. (13)As a result of the including k , ( x χ i , y χ i ) ∈ ( −∞ , ∞ )and the hard boundary at x χ i = 0 has been avoided.We choose to reparameterise the luminosity distance d L such that the prior for the resulting parameter d U is uniform. The exact reparameterisation therefore de-pends on the prior used for d L . In this work we choose touse a prior on d L that is uniform in co-moving volume sowe first convert the luminosity distance to a co-movingdistance d C and then the uniform parameter is simply d U = d . Similar reparametersitions can be determinedfor other commonly used distances priors such as a powerlaw. Additionally, we allow boundary inversion as de-scribed in section III B but limit it to only the upperbound since in practice the luminosity distance posteriorwill not rail against the lower bound.The remaining parameters are the chirp mass M andmass ratio q for which we use the reparameterisationsfrom X to X (cid:48) mentioned in section III B, allowing bound-ary inversion for q . D. Implementation
We use the implementation of normalising flows in
Py-Torch [42] available in nflows [43] which allows fora wide variety of normalising flows to be used. How-ever we choose to use coupling transforms [36] because ofthe tractable Jacobian and ease of computing the inversemapping. As suggested in [37, 44], we include invertiblelinear transforms that randomly permute the parametersbefore each coupling transform allowing all of the param-eters to interact with each other. We also include batchnormalisation [45] after each coupling transform as de-scribed in [36]. We use a residual neural network [46, 47]for computing the parameters for each transform. Wetrain the normalising flows with the Adam optimiser [48].In appendix D we detail the specific parameters used forthe results presented in section V.Our sampler,
Nessai (Nested Sampling with ArtificialIntelligence), is available as an open source package [49]and documentation is also available online [41].
IV. RELATED WORK
Different frameworks and samplers have been devel-oped for gravitational-wave inference.
LALInference [11] implements nested sampling and MCMC with spe-cific proposal methods for the gravitational-wave param-eter space and has been used extensively for analyses ofthe first gravitational wave detections [50] and GWTC-1 [6]. More recently, the
Python package
Bilby [22]has been developed to use off-the-shelf samplers, such as dynesty [31], and been shown to achieve comparableresults to
LALInference on GWTC-1 [32].Machine learning has previously been incorporatedinto stochastic sampling algorithms; in [51] the likelihoodfunction is approximated with a neural network, and in[52] neural networks are used to generalise HamiltonianMonte Carlo. More closely related to our work, normal-ising flows have been used to improve the efficiency ofMCMC methods by reparameterising the sampling space[53] and a similar approach has also been extended toMCMC sampling in nested sampling in [54].Recent work has shown that likelihood-free inferenceusing conditional variational autoencoders [24, 25] andnormalising flows [26, 27] can produce posterior distribu-tions for compact binary coalescence from binary blackholes. These approaches promise to drastically reducethe cost of producing posterior samples when comparedto traditional stochastic sampling methods. However,they require large amounts of training data and theycurrently lack the flexibility to deal with, for example,different PSDs, high sampling frequencies and long du-ration signals.
V. RESULTS
We chose to evaluate the performance of our samplingalgorithm,
Nessai , with simulated gravitational-wavesignals from CBC. The parameter is multi-dimensionaland various parameters are correlated which can provechallenging when sampling. The likelihood, as definedin section II A, is also typically computationally costlyto evaluate, making it well suited to our sampler, how-ever, this depends on the waveform approximant usedand length of the observation. We first use probability-probability plots to check the consistency of our samplerand then compare our results to those obtained using dynesty . We then highlight how the likelihood compu-tation can be parallelised in
Nessai before finally dis-cussing various diagnostics that can ne used to identifyproblems during sampling and tune the sampler settings.We use bilby pipe and
Bilby [22] to simulate 128injections using
IMRPhenomPv2 [55, 56] sampled at2048 Hz with 4-second observing time in a three detec-tor network with AdLIGO Hanford, AdLIGO Livingstonand AdVirgo at design sensitivity [4]. We choose uni-form priors on chirp mass
M ∼ U [25 ,
35 ]M (cid:12) and asym-metric mass ratio q ∼ U [0 . , . . . . . . . . . . . . . F r a c t i o n o f e v e n t s i n C . I . N=128, p-value=0.3394 χ (0.831) χ (0.788) M (0.504) δ (0.298) t c (0.844) d L (0.066) q (0.723) φ (0.800) φ jl (0.925) ψ (0.630) α (0.019) θ JN (0.283) θ (0.121) θ (0.197) (a) With phase marginalisation . . . . . . . . . . . . F r a c t i o n o f e v e n t s i n C . I . N=128, p-value=0.6818 χ (0.937) χ (0.876) M (0.785) δ (0.607) t c (0.984) q (0.955) φ (0.786) φ jl (0.885) ψ (0.544) α (0.007) θ JN (0.492) θ (0.163) θ (0.206) (b) With phase and distance marginalisation FIG. 3. Probability-probability (P-P) plot showing the confidence interval versus the fraction of the events within that confidenceinterval for the posterior distributions obtained using our analysis
Nessai for 128 simulated compact binary coalescence signalsproduced with
Bilby and bilby pipe . The 1-, 2- and 3- σ confidence intervals are indicated by the shaded regions and p -valuesare shown for each of the parameters and the combined p -value is also shown. nosity distance that is uniform in co-moving volume on[100 , Bilby [32], see appendix C for a complete list.We analyse the injections with our sampling algorithm,
Nessai , outlined in section III B and include the specificreparameterisations for gravitational wave analyses de-scribed in section III C. We choose to analyse each injec-tion twice: once with just phase marginalisation and oncewith both phase and distance marginalisation. Furtherdetails of the exact settings used for
Nessai are providedin appendix D.
A. Result validation
Probability-probability (P-P) plots are a standardmethod of verifying the performance of sampling algo-rithms [57, 58]. They test whether the correct propor-tion of injected values are recovered at a given confi-dence interval for a specific prior distribution. Thesetests are particularly useful when using a Gaussian likeli-hood, such as eq. (2), since the fraction of events within agiven confidence interval should be uniformly distributedand we can therefore compute p -values for each parame-ter and a combined p -value for all of the parameters. Weproduce P-P plots for both of our analyses using Bilby and present the results in fig. 3. For an idealised samplerfor the p % confidence interval, p % of the events shouldbe recovered, this would correspond to a diagonal line. Inpractice we expect to see deviation from the diagonal, assuch the 1-, 2- and 3- σ confidence intervals are also shownin fig. 3. These results show that Nessai consistently re-covers for the posteriors for the 128 injections but alsoindicate that luminosity distance is consistently harder tosample. The combined p -values of 0.3394 and 0.6818 forour analyses without and with distance marginalisationserve as further verification. B. Comparison to dynesty
To further validate our results we compare them tothose obtained with dynesty [31], another nested sam-pling algorithm commonly used in gravitational-wave in-ference [22, 32]. We use the configuration described in[32] but increase the number of live points to 2000 andrun on a single thread to ensure as direct of a comparisonwith
Nessai as possible. With these settings dynesty passes the P-P test (see appendix E) but we note thatthese settings are the minimum required to produce re-liable results and in practice more conservative settingsare often used. Additionally, several injections requireda second analysis with a different sampling-seed in orderreach convergence. The results obtained with dynesty − Z C o un t s FIG. 4. Difference between the log evidences ∆ ln Z obtainedusing dynesty and Nessai for all 128 injections with distancemarginalisation (dashed line) and without distance marginal-isation (solid line). allow us to verify the log-evidences returned by
Nessai since these cannot be computed analytically and providesa point of reference when considering the number of like-lihood evaluations and total computational time.In fig. 4 we compare the log-evidences returned by dynesty and
Nessai . If
Nessai was consistently overor under-estimating the log-evidence when compared to dynesty , this would indicate a potential problem dur-ing sampling, such as over- or under-constraining, whichwould lead to biased results. The results in fig. 4 showno such bias. However, since sampling is a stochasticprocess there is an error associated with the computedlog-evidence. The theoretical error can be approximatedusing the information H and the number of live points K , δ log Z ≈ (cid:112) H/K . To quantify this error we repeat theanalysis on a single injection with 50 different samplingseeds and compute an approximate error δ log Z ≈ . .
11, this is consistent with previous analyses which de-termined that there are additional sources of uncertainty[59].
Nessai is designed with the aim of improving the ef-ficiency of drawing replacement live points at the costof repeatedly training a normalising flow and populatinga pool of live points. An improvement in the efficiencytranslates to a reduction in the total number of likeli-hood evaluations since the likelihood must be computedfor each rejected point. We therefore compare the to-tal number of likelihood evaluations required to reachconvergence for each sampler in fig. 5 with and withoutdistance marginalisation.
Nessai requires a median of5 . × and 7 . × likelihood evaluations to con-verge with and without distance marginalisation respec-tively and dynesty requires 10 . × and 9 . × . Incontrast to dynesty , sampling with Nessai is more effi-cient without distance marginalisation, we attribute thisto a combination of the reparameterisation used for lu-minosity distance and the sampler settings we converged 10 Number of likelihoodevaluations02040 10 Runtime (hrs)
FIG. 5. Distribution of the total number of likelihood evalua-tions required to reach convergence for and total run time for dynesty (blue) and
Nessai (orange) when applied 128 simu-lated signals from compact binary coalescence with the priorsand sampler settings described in section V and appendix D.The results distance marginalisation disable are shown withsolid lines and those with distance marginalisation enabledare shown with dashed lines. on for
Nessai .This, however, does not directly translate into a de-crease in the total run-time since it does factor in thecost of training and population. In fig. 5 we also show thetotal run-time for each sampler and when comparing themedian run-times we observe that
Nessai is 2.07 timesfaster than dynesty without distance marginalisationand 1.34 times faster with it. We also examine the pro-portion of the run-time spent on training and populationand find that on average population-time accounts forapproximately 39.6% of the total run-time and training-time accounts for a further 8.1%. This explains why,despite a 2.32 times reduction in likelihood evaluations,we do not observe a 2.32 times reduction in samplingtime compared to dynesty . We also note that the costof training and population does not depend on the costof evaluating the likelihood, as such, the fraction of thetotal run-time will decrease as the cost of evaluating thelikelihood cost increases.For each injection we can also compare the poste-rior distributions produced by each sampler. These al-low us to quickly identify discrepancies between samplersfor specific injections or regions of the parameter space.However these differences are not easily quantified andthe correlations between more than two parameters arenot clearly represented. We show an example of such acomparison is appendix F.
C. Parallelisation of the likelihood computation
Our sampler is designed such that candidate live pointsare drawn simultaneously in the population stage. Thisallows for simple parallelisation of the likelihood compu-tation since the pool of candidate live points can be dis-0 Number of threads2 − T i m e ( h r s ) PopulationTrainingLikelihood evaluation TotalSumTheoretical
FIG. 6. Comparison of the total time (in hours) spent oneach stage of the algorithm for increasing number of threadsfor a single injection with a fixed noise seed. The time spentevaluating the likelihood decreases as the number of threadsincreases, the theoretical reduction is shown in black. Thetraining and population stages remain approximately con-stant, as such act a lower bound on the minimum run-time.The sum of the time spent on likelihood evaluation, trainingand population is approximately equal to the total time spentsampling, indicating minimal overhead. tributed over a number of threads and likelihood valuescomputed and stored until needed for the proposal stage.In fig. 6 we compare the run-time and time spent evaluat-ing the likelihood for the same injection using increasingnumber of threads for the likelihood computation. Weuse an additional thread for the main sampling process.The time spent evaluating the likelihood is inversely pro-portional to the number of threads allocated although theoverall run-time is not. With a single thread it accountsfor 53.9% of the total run-time and this decreases to 9.0%when using 16 threads. As mentioned previously, thereis a cost associated with populating stage and furthersmaller cost associated with training, for this injectionthese are 36.4% and 9.3% respectively. These remainapproximately constant when increasing the number ofthreads available and act as a lower limit on the theo-retical minimum run-time, this is shown in fig. 6. Thereremaining <
1% of the run-time is general overhead as-sociated with running the sampler.
D. Diagnostics
As mentioned previously, there are various challengeswhen implementing a sampling algorithm.
Nessai is de-signed to sample from within the constrained prior, inthis case care must be taken to ensure that the prioris not over-constrained since this will lead to regions ofparameter being under-sampled which in turn will biasthe results. There are also specific problems that arise . . . . P r o b a b ili t y d e n s i t y Correctly constrainedOver-constrained
FIG. 7. Example of the distribution of insertion indices fortwo nested sampling runs with 2000 live points. Uniformlydistributed indices indicate no under- or over-constrainingand deviations from uniformity indicate the opposite. Theresult with an orange dashed line shows over-constraining andthe result with a solid blue line shows the correctly convergedrun. The shaded region indicates the 2- σ errors on the ex-pected distribution. from the nature of the parameter space, such as multi-modality and correlations. We use a series of diagnosticsto identify possible problems during sampling. In sec-tion V E we also discuss how some of these diagnosticscan be used to tune the sampler settings described insection III B and appendix D.We use the cross-checks proposed in [60] as a heuristicfor determining if the nested sampling algorithm has con-verged without over or under-constraining the posteriordistributions. These checks rely on order statistics andthe assumption that new live points should be inserteduniformly into the existing live points which allows for a p -value to be computed using the Kolomorgov-Smirnovstatistic [61] with the additional consideration that un-derlying distribution is discrete [62]. In fig. 7 we showan example of the distribution of the indices of newlyinserted live points and in fig. 8 we show the p -valuescomputed every K iterations. The histogram shows thefinal distribution of insertion indices for all the nestedsamples, this may not highlight specific problematic re-gions of the parameter space but if it is not uniform, it isa clear indication that the sampler is consistently over-or under-constraining. If the distribution of p -values infig. 8 is non-uniform then this is another clear indicationof problems during sampling.The acceptance is another important statistic to mon-itor during sampling since we aim to develop a more effi-cient sampler. There are two acceptances we can monitorin Nessai , the proposal acceptance and the populationacceptance. The first has a direct effect on the numberof likelihood evaluations whilst the second only affectsthe total run-time, both quantities are shown in fig. 8.This figure also highlights how periodically retraining thenormalising flow leads to an increase in the proposal effi-ciency. It also shows how the population process is typi-1 − − L og - li k e li h oo d (a) Min.Max.01 L og - li k e li h oo d e v a l u a t i o n s × (b) − − l n Z (c) ln Z d Z . . . A cce p t a n ce (d) ProposalPopulation0 20000 40000
Iteration . . . p - v a l u e (e) d Z FIG. 8. Example of the statistics that are tracked in oursampler as a function of sampling iteration: (a) minimum(blue solid) and maximum (red dashed) log-likelihood, (b) cumulative number of likelihood evaluations, (c) log evi-dence log Z (blue solid) and fractional change in evidence d Z (red dashed), (d) proposal (blue solid) and population (reddashed) acceptance and (e) p -value for cross-checks of every K live points. The iterations at which the normalising flow istrained are indicated with vertical lines. cally inefficient which explains why on average 32 percentof the total run-time is spent on the population stage.We also track the minimum and maximum log-likelihoods, number of log-likelihood evaluations, log-evidence and fractional change in evidence. The com-bination of these statistics allows the user to quickly un-derstand the current state of the sampler and identify potential issues such as plateaus in the likelihood spaceand regions which are inefficient to sample. The completeset of statistics is shown in fig. 8. E. Tuning
Nessai
Nessai includes various settings, a comprehensive listand description of each can be found in the documen-tation [41]. In practice we find that a small subset ofthe settings predominantly determine whether the algo-rithm converges without any bias. We use the validationmethod described in section V A and the diagnostics fromsection V D to understand how these settings affect con-vergence.As expected, the number of live points K is an impor-tant setting but it is even more crucial in Nessai since itlimits the amount of training data available. We find thata minimum of 1000 live points is required and for morecomplex problems, such as gravitational-wave inference,at least 2000 live points should be used.There are large number of settings which relate to thecomplexity of the normalising flow. Whilst tuning thesampler we found that the number of coupling trans-formations greatly affected convergence. If too manytransforms were used the algorithm was prone to over-constraining the posterior distribution. We attribute thisto the complexity of the iso-likelihood contour learnt bythe flow, if the flow has too many trainable parametersit can over-fit the distribution and exclude regions of theparameter space which should be sampled. At the otherextreme, if the model is too simple then resulting contourcan “smooth” fine details and more samples are drawnoutside of the initial likelihood constraint. These will notbe accepted and the sampling process is therefore less ef-ficient. We use a similar logic for the number of neuronsand layers in the neural network that parameterises theflow but we find that these parameters predominantlyaffect training time with a lesser effect on overall con-vergence. Another parameter that is important to con-sider is the batch size, during sampling the normalisingflow can be training upwards of 100 times. Hence, alarger batch size is recommended since it can greatly re-duce training time, we also recommend increasing thebatch size when using reparameterisations that increasethe amount of training data, such as the boundary inver-sion described in section III B and appendix B.We note that the size of pool of new samples effectsthe efficiency of the algorithm and the total run-time.If the pool-size is small then the normalising flow is fre-quently retrained, in extreme case where the proposalis inefficient due to, for example, the complexity of theparameter space, then the normalising can be retrainedmultiple times during a single iteration. Conversely, ifthe pool-size is large then if the flow is force-ably re-trained a number of points are discarded or, if the flowis only retrained once the pool is empty, then the rejec-tion sampling becomes in-efficient since a large fraction2of the potential new points will lie outside the likelihoodbound. We instead opt to inversely scale the pool-sizegiven the mean acceptance of the sampler since the lastiteration the flow was trained. We recommend settingthe base pool-size to the number of live points, only re-training the model when the pool is empty and settingthe maximum pool-size to be ten times the base pool-size.As mentioned previously, approximately 32% of therun-time is spent on populating the pool of new sam-ples. This is directly attributable to the efficiency of therejection sampling required to ensure samples are dis-tributed according to the prior. In section III A we pro-pose two methods for drawing samples within the con-tour in the latent space, these produce uniformly andnormally distributed samples respectively. In practicewe find the two methods comparable in most cases withthe exception of when the latent radius lies in the tail ofthe chi-distribution that corresponds to the latent prior p Z . In this case using the uniform distribution results inlower population and proposal acceptances which leadsto longer run-times. VI. CONCLUSIONS
We have proposed a novel method for sampling withina given iso-likelihood contour according to the prior thatcan be incorporated into the standard nested samplingalgorithm. Our method employs normalising flows tolearn the density of current set of live points which, oncetrained, allows us to produce points within the contourby sampling from a simple distribution and using rejec-tion sampling. The use of normalising flows allows usto avoid using multiple bounding distributions and sincenew samples are independent of the previous samples weeliminate the need to use a random walk. We imple-ment this proposal method in our sampler,
Nessai , andconduct a series of tests to verify that it recovers the cor-rect Bayesian posteriors and then compare our results tothose obtained with another sampler to determine if ourdesign does in fact result in a more efficient sampler.We apply our sampler to 128 four second durationsimulated signals from the coalescence of binary blackhole systems sampled at 2048 Hz and we run two sepa-rate analyses, one with distance marginalisation and an-other without. The resulting P-P plots (fig. 3) show thatour sampler more reliably recovers the posterior distribu-tions with distance marginalisation than without, how-ever both pass the P-P test. This indicates that ourproposal method does not introduce any inherent biases.We use dynesty for the comparison, which has beenshown to produce results consistent with those used inprevious LVK analyses [32]. We find that our samplerreturns evidences consistent with dynesty , which servesas further verification of our results. Since we aim to pro-duce a more efficient sampler we also compare the like-lihood evaluations required to reach convergence. Whennot using distance marginalisation we find that
Nessai requires 5 . × likelihood evaluations, 2.32 times fewerthan dynesty . When distance marginalisation is en-abled Nessai requires 7 . × , which, whilst still 1.40fewer than dynesty , is more than with the marginali-sation disabled. As such, we recommend using Nessai without distance marginalisation for gravitational-waveinference.However, this reduction in likelihood evaluations doesnot translate to a direct reduction in total computationtime because of the cost associated with training the nor-malising flow and populating the pool of new samples.We find that the fraction of the time spent of each processchanges when using distance marginalisation. Withoutthe marginalisation, on average, 8.1% of the total com-putation time is spent on training and a further 39.6%on population. When using distance marginalisation thischanges to 4.9% spent on training and 42.2% on popu-lation. We attribute the difference in population timeto the efficiency of the rejection sampling, which is im-proved when including the reparameterisation for dis-tance discussed in section III C. We find that withoutdistance marginalisation the median run time for
Nes-sai is 2.07 times faster than dynesty . However whendistance marginalisation is enabled we observe that, onaverage,
Nessai is only 1.34 times faster than dynesty .This further reinforces our recommendation to use
Nes-sai with distance marginalisation disabled.We also show how our sampler can make use of paral-lelised likelihood functions by evaluating the likelihood ofnew live points during the population stage. We repeatthe previous analysis for a single injection without dis-tance marginalisation and parallelise the likelihood com-putation with increasing number of threads up to 16. Weobserve that the reduction time evaluating the likelihooddoes not quite match the theoretical values, indicatingthat there is a small overhead associated with it. Thisalso highlights how the limiting factor is the time spenttraining the normalising flow and populating the pool ofnew live point.To aid in diagnosing potential biases during sampling,we include a series of diagnostics in our sampler whichallow us to easily identify under and over-constraining.These diagnostics also help to tune the sampling settingsand highlight how periodically re-training the normal-ising flow during sampling prevents the proposal frombecoming inefficient during sampling.We find that our algorithm is susceptible to under-sampling regions of the parameter space which are closeto the prior bounds. We consequently introduce thepreviously described reparameterisations to mitigate thisand a series of diagnostics to aid in diagnosing biases andcorrectly tuning the settings. We aim address this in fur-ther work with changes to the design of the normalisingflows we have used.It is natural to compare this work to [24–27] which usevariational autoeconders and normalising flows to pro-duce posterior distributions. Our approach differs fromthese in that it requires no prior computation since train-3ing occurs during sampling and we do not introduce anyassumptions about the data other than those necessary toapply a nested sampling algorithm.
Nessai is thereforea drop-in replacement for existing sampling algorithmsthat does not require changes to existing pipelines.In future work we aim to evaluate our sampler us-ing more expensive waveform models including those forlonger duration signals, such as those from binary neu-tron star of neutron star-black hole system, and modelswhich include higher-order modes. We will also inves-tigate the suitability of other types of normalising flowtransforms, such as the spline based transforms from [63]and flows which allow for specifying a manifold [64].In summary, we have proposed a novel variation ofthe standard nested sampling algorithm that incorpo-rates normalising flows specifically designed for inferencewith computationally expensive likelihood functions. Wehave applied our sampler to the problem of gravitationalwave inference and shown that it consistently recovers theBayesian posteriors distributions and evidences with 2.32times fewer total likelihood evaluations than dynesty ,another commonly used sampler, which translates to a2.07 times reduction in computation time. Our samplertherefore serves as a more efficient drop-in replacement for existing samplers.
ACKNOWLEDGMENTS
The authors gratefully acknowledge the Science andTechnology Facilities Council of the United Kingdom.MJW is supported by the Science and Technology Fa-cilities Council [2285031]. JV and CM are supportedby the Science and Technology Research Council [ST/L000946/1]. CM is also supported by the EuropeanCooperation in Science and Technology (COST) action[CA17137]. The authors are grateful for computationalresources provided by Cardiff University, and funded byan STFC grant supporting UK Involvement in the Oper-ation of Advanced LIGO.
Software:
Nessai was initially develop using cpnest [18] with permission from the authors and still shares asimilar interface and other core codes.
Nessai is im-plemented in
Python and uses
NumPy [65],
SciPy [66], pandas [67, 68], nflows [43],
PyTorch [42], mat-plotlib [69] and seaborn [70]. Gravitational wave in-jections were generated using
Bilby and bilby pipe [22].Figures were prepared using matplotlib [69], seaborn [70],
Bilby [22] and corner [71].
Appendix A: Loss function
A normalising flow applies a mapping f : X → Z conditioned on its parameters θ which are typically the trainableparameters of a neural network. In this context the goal of training a normalising flow is to approximate a targetdistribution p ∗X ( x ). The KL divergence between the target distribution and the distribution of the flow p X ( x | θ ) canbe written as [29]: L ( θ ) = D KL [ p ∗X ( x ) || p X ( x | θ )]= − E p ∗X ( x ) [ln p X ( x | θ )] + const.= − E p ∗X ( x ) (cid:20) ln p Z ( f ( x | θ )) + ln (cid:12)(cid:12)(cid:12)(cid:12) det ∂f ( x | θ ) ∂x T (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) + const. . (A1)Then, assuming the set of K samples used for training is drawn from p ∗X ( x ), the expectation value can be approxi-mated as: L ( θ ) ≈ − K K (cid:88) k =1 ln p Z ( f ( x | θ )) + ln (cid:12)(cid:12)(cid:12)(cid:12) det ∂f ( x | θ ) ∂x T (cid:12)(cid:12)(cid:12)(cid:12) . (A2) Appendix B: Boundary inversion
In section III B we describe boundary inversion which we introduce to avoid under-sampling regions which are closeto the priors bounds. The user defines which parameters the inversion can be applied to and before training thesampler determines if it should be applied to each parameter using the following steps:1. Compute the density of samples over the specified range and find the maximum value.2. Compute the fraction of the density that lies within the initial and final m % of the bounds, i.e. [0 , .
1] and[0 . , .
0] if the parameter is defined on [0 , n % of the maximum density andthe density at the bound is non-zero.From our testing the percentages m and n default to 10% and 50% respectively but can be changed by the user.We consider two methods for applying the inversion: • duplication: duplicate the set of points and apply the inversion to the duplicates, • splitting: randomly select half of the points to apply the inversion to.We find that duplication generally provides more consistent results but at the cost of the increasing the trainingtime. As such we recommend using splitting when inversion is applied to more than two parameters. Appendix C: Gravitational-wave priors
TABLE I. Prior distributions used for each parameter for gravitational-wave parameter estimation. Their corresponding labelsand the lower and upper bounds are included where applicable.Parameters Label Prior BoundsChirp mass M Uniform [25 , (cid:12) Asymmetric mass ratio q Uniform [0 . , . d L Uniform in co-moving volume [100 , α Uniform [0 , π ]Declination δ Cosine -Reference time at geocentre t c Uniform [ − . , . θ JN Sine -Polarisation ψ Uniform [0 , π ]Phase φ c Uniform [0 , π ]Dimensionless spin magnitudes χ i Uniform [0 , . θ i Sine -Difference between the azimuthal angles of each spin vector relativeto the orbital angular momentum φ Uniform [0 , π ]Difference between the azimuthal angles of the total and orbitalangular momentum φ JL Uniform [0 , π ] Appendix D:
Nessai sampling settings
TABLE II. Settings used for
Nessai for gravitational-wave inference. These are split into three categories: general settingswhich control aspects of the sampler such as the choice of latent prior or pool-size, flow hyper-parameters which determine theconfiguration of the normalising flow and flow training settings which control the training process. For a complete descriptionof each see the documentation [41].
Nessai settingsGeneral settings Flow hyper-parameters Flow training settingsTraining frequency
None
Coupling transformations Optimiser
Adam
Cooldown
Linear transformation LU Learning rate
Base pool-size
Network type
ResNet
Batch size
Update pool-size
True
Layers per network Max. epochs
Drawsize
Neurons per layer Patience Train on empty
True
Activation
ReLU
Weights reset
False
Batch-Normalisation
Intra-transforms
Latent prior
Truncated Gaussian
Rescale
True
Update bounds
True Appendix E: P-P tests for dynesty . . . . . . . . . . . . F r a c t i o n o f e v e n t s i n C . I . N=128, p-value=0.3401 χ (0.380) χ (0.453) M (0.850) δ (0.478) t c (0.628) d L (0.078) q (0.889) φ (0.759) φ jl (0.918) ψ (0.635) α (0.019) θ JN (0.504) θ (0.108) θ (0.175) (a) With phase marginalisation . . . . . . . . . . . . F r a c t i o n o f e v e n t s i n C . I . N=128, p-value=0.6549 χ (0.419) χ (0.432) M (0.628) δ (0.664) t c (0.451) q (0.894) φ (0.621) φ jl (0.806) ψ (0.680) α (0.053) θ JN (0.460) θ (0.134) θ (0.362) (b) With phase and distance marginalisation FIG. 9. Probability-probability (P-P) plot showing the confidence interval versus the fraction of the events within that confidenceinterval for the posterior distributions obtained using dynesty for 128 simulated compact binary coalescence signals producedwith
Bilby and bilby pipe . The 1-, 2- and 3- σ confidence intervals are indicated by the shaded regions and p -values are shownfor each of the parameters and the combined p -value is also shown. We use the settings described in [32] with the exception ofthe number of live points which we increase to 2000. Appendix F: Example corner plot . . q . . . χ . . . χ . . . θ . . . θ . . . ∆ φ . . . φ J L d L [ M p c ] − . − . δ . . . α . . . θ J N . . . ψ
28 32 M . . . t c [ s ] . . q . . . χ . . . χ . . . θ . . . θ . . . ∆ φ . . . φ JL d L [Mpc] − . − . δ .
04 5 .
12 5 . α . . . θ JN . . . ψ .
076 0 .
080 0 . t c [ s ] FIG. 10. Corner plot comparing the posterior distributions produced with dynesty (blue) and our sampler
Nessai (red). Thephase is marginalised and remaining 14 parameters are shown, see appendix C for details on the parameters. The respective16% and 84% quantiles are also shown in the 1-dimensional marginalised posteriors.[1] B. Abbott et al. (LIGO Scientific, Virgo, Fermi GBM,INTEGRAL, IceCube, AstroSat Cadmium Zinc Tel- luride Imager Team, IPN, Insight-Hxmt, ANTARES,Swift, AGILE Team, 1M2H Team, Dark Energy Camera GW-EM, DES, DLT40, GRAWITA, Fermi-LAT, ATCA,ASKAP, Las Cumbres Observatory Group, OzGrav,DWF (Deeper Wider Faster Program), AST3, CAAS-TRO, VINROUGE, MASTER, J-GEM, GROWTH,JAGWAR, CaltechNRAO, TTU-NRAO, NuSTAR, Pan-STARRS, MAXI Team, TZAC Consortium, KU, NordicOptical Telescope, ePESSTO, GROND, Texas TechUniversity, SALT Group, TOROS, BOOTES, MWA,CALET, IKI-GW Follow-up, H.E.S.S., LOFAR, LWA,HAWC, Pierre Auger, ALMA, Euro VLBI Team, Pi ofSky, Chandra Team at McGill University, DFN, AT-LAS Telescopes, High Time Resolution Universe Sur-vey, RIMAS, RATIR, SKA South Africa/MeerKAT),Multi-messenger Observations of a Binary NeutronStar Merger, Astrophys. J. Lett. , L12 (2017),arXiv:1710.05833 [astro-ph.HE].[2] B. Abbott et al. (LIGO Scientific, Virgo), GW170817:Measurements of neutron star radii and equation of state,Phys. Rev. Lett. , 161101 (2018), arXiv:1805.11581[gr-qc].[3] B. Abbott et al. (LIGO Scientific, Virgo), Agravitational-wave measurement of the Hubble constantfollowing the second observing run of Advanced LIGOand Virgo, arXiv e-prints (2019), arXiv:1908.06060[astro-ph.CO].[4] B. Abbott et al. (KAGRA, LIGO Scientific, VIRGO),Prospects for Observing and Localizing Gravitational-Wave Transients with Advanced LIGO, AdvancedVirgo and KAGRA, Living Reviews in Relativity ,10.1007/s41114-018-0012-9 (2020), arXiv:1304.0670 [gr-qc].[5] B. Iyer, T. Souradeep, C. Unnikrishnan, S. Dhurandhar,S. Raja, and A. Sengupta, LIGO-India, Proposal of theConsortium for Indian Initiative in Gravitational-waveObservations (IndIGO) , Tech. Rep. (LIGO Scientific Col-laboration, 2011).[6] B. Abbott et al. (LIGO Scientific, Virgo), GWTC-1: AGravitational-Wave Transient Catalog of Compact Bi-nary Mergers Observed by LIGO and Virgo during theFirst and Second Observing Runs, Phys. Rev. X ,031040 (2019), arXiv:1811.12907 [astro-ph.HE].[7] B. Abbott et al. (LIGO Scientific, Virgo), GW170817:Observation of Gravitational Waves from a Binary Neu-tron Star Inspiral, Phys. Rev. Lett. , 161101 (2017),arXiv:1710.05832 [gr-qc].[8] R. Abbott et al. (LIGO Scientific, Virgo), GWTC-2: Compact Binary Coalescences Observed by LIGOand Virgo During the First Half of the Third Ob-serving Run, arXiv e-prints , arXiv:2010.14527 (2020),arXiv:2010.14527 [gr-qc].[9] J. Aasi et al. (LIGO Scientific, VIRGO), Parameter es-timation for compact binary coalescence signals withthe first generation gravitational-wave detector network,Phys. Rev. D , 062001 (2013), arXiv:1304.1775 [gr-qc].[10] B. Abbott et al. (LIGO Scientific, Virgo), Properties ofthe Binary Black Hole Merger GW150914, Phys. Rev.Lett. , 241102 (2016), arXiv:1602.03840 [gr-qc].[11] J. Veitch, V. Raymond, B. Farr, W. Farr, P. Graff, S. Vi-tale, B. Aylott, K. Blackburn, N. Christensen, M. Cough-lin, et al. , Parameter estimation for compact binarieswith ground-based gravitational-wave observations usingthe lalinference software library, Phys. Rev. D , 042003(2015).[12] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, Hand- book of markov chain monte carlo (CRC press, 2011).[13] J. Skilling et al. , Nested sampling for general bayesiancomputation, Bayesian analysis , 833 (2006).[14] B. Farr, E. Ochsner, W. M. Farr, and R. O’Shaughnessy,A more effective coordinate system for parameter esti-mation of precessing compact binaries from gravitationalwaves, Phys. Rev. D , 024018 (2014).[15] M. P¨urrer, Frequency domain reduced order models forgravitational waves from aligned-spin compact binaries,Class. Quant. Grav. , 195010 (2014), arXiv:1402.4146[gr-qc].[16] R. Smith, S. E. Field, K. Blackburn, C.-J. Haster,M. P¨urrer, V. Raymond, and P. Schmidt, Fast and ac-curate inference on gravitational waves from precessingcompact binaries, Phys. Rev. D , 044031 (2016).[17] W. J. Handley, M. P. Hobson, and A. N. Lasenby,polychord: next-generation nested sampling, MonthlyNotices of the Royal Astronomical Society ,4384 (2015), https://academic.oup.com/mnras/article-pdf/453/4/4384/8034904/stv1911.pdf.[18] J. Veitch, W. D. Pozzo, M. Williams, C. Talbot,M. Pitkin, G. Ashton, Cody, M. H¨ubner, A. Nitz,D. Macleod, G. Carullo, G. Davies, and Tony, john-veitch/cpnest: Version 0.11 (2021).[19] R. J. E. Smith, G. Ashton, A. Vajpeyi, andC. Talbot, Massively parallel Bayesian inferencefor transient gravitational-wave astronomy, MonthlyNotices of the Royal Astronomical Society ,4492 (2020), https://academic.oup.com/mnras/article-pdf/498/3/4492/33798799/staa2483.pdf.[20] J. Lange, R. O’Shaughnessy, and M. Rizzo, Rapid andaccurate parameter inference for coalescing, precess-ing compact binaries, arXiv e-prints , arXiv:1805.10457(2018), arXiv:1805.10457 [gr-qc].[21] C. M. Biwer, C. D. Capano, S. De, M. Cabero, D. A.Brown, A. H. Nitz, and V. Raymond, PyCBC Infer-ence: A Python-based Parameter Estimation Toolkit forCompact Binary Coalescence Signal, PASP , 024503(2019), arXiv:1807.10312 [astro-ph.IM].[22] G. Ashton, M. H¨ubner, P. D. Lasky, C. Talbot, K. Ack-ley, S. Biscoveanu, Q. Chu, A. Divakarla, P. J. Easter,B. Goncharov, F. Hernandez Vivanco, J. Harms, M. E.Lower, G. D. Meadors, D. Melchor, E. Payne, M. D.Pitkin, J. Powell, N. Sarin, R. J. E. Smith, andE. Thrane, BILBY: A User-friendly Bayesian InferenceLibrary for Gravitational-wave Astronomy, ApJS ,27 (2019), arXiv:1811.02042 [astro-ph.IM].[23] E. Cuoco, J. Powell, M. Cavagli`a, K. Ackley, M. Be-jger, C. Chatterjee, M. Coughlin, S. Coughlin, P. Easter,R. Essick, H. Gabbard, T. Gebhard, S. Ghosh,L. Haegel, A. Iess, D. Keitel, Z. Marka, S. Marka,F. Morawski, T. Nguyen, R. Ormiston, M. Puerrer,M. Razzano, K. Staats, G. Vajente, and D. Williams,Enhancing Gravitational-Wave Science with MachineLearning, arXiv e-prints , arXiv:2005.03745 (2020),arXiv:2005.03745 [astro-ph.HE].[24] H. Gabbard, C. Messenger, I. S. Heng, F. Tonolini, andR. Murray-Smith, Bayesian parameter estimation usingconditional variational autoencoders for gravitational-wave astronomy, arXiv e-prints , arXiv:1909.06296(2019), arXiv:1909.06296 [astro-ph.IM].[25] A. J. K. Chua and M. Vallisneri, Learning BayesianPosteriors with Neural Networks for Gravitational-Wave Inference, Phys. Rev. Lett. , 041102 (2020), arXiv:1909.05966 [gr-qc].[26] S. R. Green, C. Simpson, and J. Gair, Gravitational-wave parameter estimation with autoregressive neuralnetwork flows, arXiv e-prints , arXiv:2002.07656 (2020),arXiv:2002.07656 [astro-ph.IM].[27] S. R. Green and J. Gair, Complete parameter infer-ence for GW150914 using deep learning, arXiv e-prints, arXiv:2008.03312 (2020), arXiv:2008.03312 [astro-ph.IM].[28] I. Kobyzev, S. J. D. Prince, and M. A. Brubaker, Nor-malizing Flows: An Introduction and Review of Cur-rent Methods, arXiv e-prints , arXiv:1908.09257 (2019),arXiv:1908.09257 [stat.ML].[29] G. Papamakarios, E. Nalisnick, D. Jimenez Rezende,S. Mohamed, and B. Lakshminarayanan, NormalizingFlows for Probabilistic Modeling and Inference, arXive-prints , arXiv:1912.02762 (2019), arXiv:1912.02762[stat.ML].[30] F. Feroz, M. P. Hobson, and M. Bridges, MULTINEST:an efficient and robust Bayesian inference tool for cos-mology and particle physics, MNRAS , 1601 (2009),arXiv:0809.3437 [astro-ph].[31] J. S. Speagle, DYNESTY: a dynamic nested samplingpackage for estimating Bayesian posteriors and evidences,MNRAS , 3132 (2020), arXiv:1904.02180 [astro-ph.IM].[32] I. M. Romero-Shaw, C. Talbot, S. Biscoveanu,V. D’Emilio, G. Ashton, C. P. L. Berry, S. Cough-lin, S. Galaudage, C. Hoy, M. Huebner, K. S. Phukon,M. Pitkin, M. Rizzo, N. Sarin, R. Smith, S. Steven-son, A. Vajpeyi, M. Arene, K. Athar, S. Banagiri,N. Bose, M. Carney, K. Chatziioannou, R. Cotesta,B. Edelman, C. Garcia-Quiros, A. Ghosh, R. Green,C. J. Haster, A. X. Kim, F. Hernandez-Vivanco, I. Ma-gana Hernandez, C. Karathanasis, P. D. Lasky, N. DeLillo, M. E. Lower, D. Macleod, M. Mateu-Lucena,A. Miller, M. Millhouse, S. Morisaki, S. H. Oh, S. Os-sokine, E. Payne, J. Powell, M. Puerrer, A. Ramos-Buades, V. Raymond, E. Thrane, J. Veitch, D. Williams,M. J. Williams, and L. Xiao, Bayesian inference for com-pact binary coalescences with BILBY: Validation andapplication to the first LIGO–Virgo gravitational-wavetransient catalogue, arXiv e-prints , arXiv:2006.00714(2020), arXiv:2006.00714 [astro-ph.IM].[33] D. Jimenez Rezende and S. Mohamed, VariationalInference with Normalizing Flows, arXiv e-prints ,arXiv:1505.05770 (2015), arXiv:1505.05770 [stat.ML].[34] G. Papamakarios, T. Pavlakou, and I. Murray, MaskedAutoregressive Flow for Density Estimation, arXive-prints , arXiv:1705.07057 (2017), arXiv:1705.07057[stat.ML].[35] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville,Neural Autoregressive Flows, arXiv e-prints ,arXiv:1804.00779 (2018), arXiv:1804.00779 [cs.LG].[36] L. Dinh, J. Sohl-Dickstein, and S. Bengio, Den-sity estimation using Real NVP, arXiv e-prints ,arXiv:1605.08803 (2016), arXiv:1605.08803 [cs.LG].[37] D. P. Kingma and P. Dhariwal, Glow: GenerativeFlow with Invertible 1x1 Convolutions, arXiv e-prints ,arXiv:1807.03039 (2018), arXiv:1807.03039 [stat.ML].[38] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios,Neural Spline Flows, arXiv e-prints , arXiv:1906.04032(2019), arXiv:1906.04032 [stat.ML].[39] M. E. Muller, A note on a method for generating points uniformly on n-dimensional spheres, Commun. ACM ,19 (1959).[40] G. Marsaglia, Choosing a point from the surface of asphere, Ann. Math. Statist. , 645 (1972).[41] M. J. Williams, Nessai documentation, https://nessai.readthedocs.io/ (2021).[42] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Rai-son, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang,J. Bai, and S. Chintala, Pytorch: An imperative style,high-performance deep learning library, in Advances inNeural Information Processing Systems 32 , edited byH. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´eBuc, E. Fox, and R. Garnett (Curran Associates, Inc.,2019) pp. 8024–8035.[43] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios,nflows: normalizing flows in PyTorch (2020).[44] L. Dinh, D. Krueger, and Y. Bengio, NICE: Non-linearIndependent Components Estimation, arXiv e-prints ,arXiv:1410.8516 (2014), arXiv:1410.8516 [cs.LG].[45] S. Ioffe and C. Szegedy, Batch Normalization: Acceler-ating Deep Network Training by Reducing Internal Co-variate Shift, arXiv e-prints , arXiv:1502.03167 (2015),arXiv:1502.03167 [cs.LG].[46] K. He, X. Zhang, S. Ren, and J. Sun, Deep Resid-ual Learning for Image Recognition, arXiv e-prints ,arXiv:1512.03385 (2015), arXiv:1512.03385 [cs.CV].[47] K. He, X. Zhang, S. Ren, and J. Sun, Identity Map-pings in Deep Residual Networks, arXiv e-prints ,arXiv:1603.05027 (2016), arXiv:1603.05027 [cs.CV].[48] D. P. Kingma and J. Ba, Adam: A method for stochasticoptimization, arXiv preprint arXiv:1412.6980 (2014).[49] M. J. Williams, Nessai: Nested sampling with artificialintelligence (2021).[50] B. Abbott et al. (LIGO Scientific, Virgo), Observation ofGravitational Waves from a Binary Black Hole Merger,Phys. Rev. Lett. , 061102 (2016), arXiv:1602.03837[gr-qc].[51] P. Graff and F. Feroz, BAMBI: Blind Accelerated Mul-timodal Bayesian Inference (2013), ascl:1312.008.[52] D. Levy, M. D. Hoffman, and J. Sohl-Dickstein,Generalizing Hamiltonian Monte Carlo with NeuralNetworks, arXiv e-prints , arXiv:1711.09268 (2017),arXiv:1711.09268 [stat.ML].[53] M. Hoffman, P. Sountsov, J. V. Dillon, I. Lang-more, D. Tran, and S. Vasudevan, NeuTra-lizing BadGeometry in Hamiltonian Monte Carlo Using NeuralTransport, arXiv e-prints , arXiv:1903.03704 (2019),arXiv:1903.03704 [stat.CO].[54] A. Moss, Accelerated Bayesian inference using deeplearning, arXiv e-prints , arXiv:1903.10860 (2019),arXiv:1903.10860 [astro-ph.CO].[55] P. Schmidt, M. Hannam, and S. Husa, Towards modelsof gravitational waveforms from generic binaries: A sim-ple approximate mapping between precessing and non-precessing inspiral signals, Phys. Rev. D , 104063(2012).[56] S. Khan, K. Chatziioannou, M. Hannam, and F. Ohme,Phenomenological model for the gravitational-wave sig-nal from precessing binary black holes with two-spin ef-fects, Phys. Rev. D , 024059 (2019), arXiv:1809.10113[gr-qc].[57] S. R. Cook, A. Gelman, and D. B. Rubin, Validation of software for bayesian models using posterior quantiles,Journal of Computational and Graphical Statistics ,675 (2006), https://doi.org/10.1198/106186006X136976.[58] S. Talts, M. Betancourt, D. Simpson, A. Vehtari,and A. Gelman, Validating Bayesian Inference Algo-rithms with Simulation-Based Calibration, arXiv e-prints, arXiv:1804.06788 (2018), arXiv:1804.06788 [stat.ME].[59] J. Veitch and A. Vecchio, Bayesian coherent analysis ofin-spiral gravitational wave signals with a detector net-work, Phys. Rev. D , 062003 (2010), arXiv:0911.3820[astro-ph.CO].[60] A. Fowlie, W. Handley, and L. Su, Nested samplingcross-checks using order statistics, arXiv e-prints ,arXiv:2006.03371 (2020), arXiv:2006.03371 [stat.CO].[61] N. Smirnov, Table for estimating the goodness of fitof empirical distributions, The annals of mathematicalstatistics , 279 (1948).[62] T. B. Arnold and J. W. Emerson, Nonparametricgoodness-of-fit tests for discrete null distributions., RJournal (2011).[63] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios,Neural Spline Flows, arXiv e-prints , arXiv:1906.04032(2019), arXiv:1906.04032 [stat.ML].[64] J. Brehmer and K. Cranmer, Flows for simultaneousmanifold learning and density estimation, arXiv e-prints, arXiv:2003.13913 (2020), arXiv:2003.13913 [stat.ML]. [65] S. van der Walt, S. C. Colbert, and G. Varoquaux, Thenumpy array: A structure for efficient numerical compu-tation, Computing in Science Engineering , 22 (2011).[66] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haber-land, T. Reddy, D. Cournapeau, E. Burovski, P. Pe-terson, W. Weckesser, J. Bright, S. J. van der Walt,M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov,A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. Carey,˙I. Polat, Y. Feng, E. W. Moore, J. Vand erPlas, D. Lax-alde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quin-tero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pe-dregosa, P. van Mulbregt, and S. . . Contributors, SciPy1.0: Fundamental Algorithms for Scientific Computingin Python, Nature Methods , 261 (2020).[67] T. pandas development team, pandas-dev/pandas: Pan-das (2020).[68] Wes McKinney, Data Structures for Statistical Comput-ing in Python, in Proceedings of the 9th Python in ScienceConference , edited by St´efan van der Walt and JarrodMillman (2010) pp. 56 – 61.[69] J. D. Hunter, Matplotlib: A 2d graphics environment,Computing in Science & Engineering , 90 (2007).[70] M. Waskom and the seaborn development team,mwaskom/seaborn (2020).[71] D. Foreman-Mackey, corner.py: Scatterplot matrices inpython, The Journal of Open Source Software1