[PDF] Complete parameter inference for GW150914 using deep learning

Abstract

The LIGO and Virgo gravitational-wave observatories have detected many exciting events over the past five years. As the rate of detections grows with detector sensitivity, this poses a growing computational challenge for data analysis. With this in mind, in this work we apply deep learning techniques to perform fast likelihood-free Bayesian inference for gravitational waves. We train a neural-network conditional density estimator to model posterior probability distributions over the full 15-dimensional space of binary black hole system parameters, given detector strain data from multiple detectors. We use the method of normalizing flows---specifically, a neural spline normalizing flow---which allows for rapid sampling and density estimation. Training the network is likelihood-free, requiring samples from the data generative process, but no likelihood evaluations. Through training, the network learns a global set of posteriors: it can generate thousands of independent posterior samples per second for any strain data consistent with the prior and detector noise characteristics used for training. By training with the detector noise power spectral density estimated at the time of GW150914, and conditioning on the event strain data, we use the neural network to generate accurate posterior samples consistent with analyses using conventional sampling techniques.

Full PDF

CComplete parameter inference for GW150914 using deep learning

Stephen R. Green ∗ and Jonathan Gair † Max Planck Institute for Gravitational Physics (Albert Einstein Institute)Am M¨uhlenberg 1, 14476 Potsdam, Germany

The LIGO and Virgo gravitational-wave observatories have detected many exciting events overthe past ﬁve years. As the rate of detections grows with detector sensitivity, this poses a growingcomputational challenge for data analysis. With this in mind, in this work we apply deep learningtechniques to perform fast likelihood-free Bayesian inference for gravitational waves. We train aneural-network conditional density estimator to model posterior probability distributions over thefull 15-dimensional space of binary black hole system parameters, given detector strain data frommultiple detectors. We use the method of normalizing ﬂows—speciﬁcally, a neural spline normalizingﬂow—which allows for rapid sampling and density estimation. Training the network is likelihood-free,requiring samples from the data generative process, but no likelihood evaluations. Through training,the network learns a global set of posteriors: it can generate thousands of independent posteriorsamples per second for any strain data consistent with the prior and detector noise characteristicsused for training. By training with the detector noise power spectral density estimated at the timeof GW150914, and conditioning on the event strain data, we use the neural network to generateaccurate posterior samples consistent with analyses using conventional sampling techniques.

Introduction.—

Since the ﬁrst detection in September2015 [1], the LIGO/Virgo Collaboration has publishedobservations of gravitational waves from more than adozen compact binary coalescences [2–5], primarily binaryblack hole mergers, but also two binary neutron starmergers. In addition, the LIGO/Virgo Collaboration haspublicly released around ﬁfty additional triggers [6] ofevents of interest, the details of which have so far not beenpublished. These observations have had a transformativeimpact on our understanding of compact objects in theUniverse, facilitated by inferring the parameters of thesystem using accurate physical models of the emittedgravitational waves.This inference is extremely computationally expensive,as posterior distributions of the parameters are usually ob-tained using more than one waveform model to probe anypotential systematic eﬀects, and using the physically mostcomplete (and thus computationally most costly) wave-forms available. LIGO/Virgo currently employ MarkovChain Monte Carlo and nested-sampling algorithms toobtain posterior samples [7, 8]. Run times for single pos-terior calculations typically take days for binary blackhole systems and weeks for binary neutron stars [2, 9].These long run times will become increasingly problem-atic as the sensitivity of the instruments improves andevent rates reach one per day or higher [10].There is an urgent need for new approaches that cangenerate scientiﬁc inferences much more rapidly than theexisting pipelines [7, 8]. Deep-learning methods are onepromising approach to increase the speed of gravitationalwave inference by several orders of magnitude, that hasbeen receiving increasing focus in recent years [11]. Thesetechniques attempt to train a neural-network conditionaldensity estimator q ( θ | s ) to approximate the Bayesianposterior distribution p ( θ | s ) of parameter values θ givendetector strain data s . Neural networks typically have millions of parameters, which are optimized stochasticallyduring training to minimize an appropriate loss function.With a “likelihood-free” training algorithm, it is nevernecessary to draw samples from the posterior or evaluatea likelihood, rather the procedure is generative and justrequires an ability to simulate data sets. Consequently,training can be done in a time comparable to that taken toobtain posterior samples using a standard method. Therehave been several previous works on this topic, but theseeither simpliﬁed the description of the posterior, e.g., byusing a Gaussian approximation [12], or simpliﬁed theinput, e.g., using a reduced space of parameters and asingle detector [13].In a previous paper [14], we used a type of neuralnetwork known as a conditional variational autoencoder(CVAE) [15, 16] combined with normalizing ﬂows [17–20]to learn the posterior distribution for inference of all pa-rameters of an aligned-spin quasi-circular merger observedwith a single gravitational-wave detector. With a singledetector we could not recover the full set of waveformparameters, and all data sets analyzed were artiﬁcially gen-erated with Advanced LIGO design-sensitivity noise [21].In this Letter, we extend that work into a tool that can,for the ﬁrst time, be used to analyze real data from theLIGO/Virgo interferometers. We describe a neural net-work architecture, based on normalizing ﬂows alone, thatis able to generate posteriors on the full D = 15 dimen-sional parameter space of quasi-circular binary inspirals,using input data from multiple gravitational-wave detec-tors. We apply this network to analyze observed interfer-ometer data surrounding the ﬁrst observed gravitational-wave event, GW150914, and show that we can successfullyrecover posterior distributions consistent with conven-tional methods. This is the ﬁrst demonstration that thesemethods can be used in a realistic setting to produce fastand accurate scientiﬁc inference on real data. This Letter a r X i v : . [ a s t r o - ph . I M ] A ug thus establishes a new benchmark in fast-and-accurategravitational wave inference, as well as describing meth-ods that could also be applied to other inference problemsin experimental physics. Neural network model.—

Our aim is to train a neuralconditional density estimator q ( θ | s ) to approximate thegravitational-wave posterior p ( θ | s ). To this end, q ( θ | s )must have suﬃcient ﬂexibility to capture the detailedshape of the true posterior over parameters θ , as well asthe dependence on the complicated strain data s . We usethe method of normalizing ﬂows .A normalizing ﬂow f is an invertible mapping on asample space with simple Jacobian determinant [17]. Fora conditional distribution, the ﬂow must depend on s , sowe denote it f s . The idea is to train the ﬂow so that itmaps a simple “base” distribution π ( u ) into the far morecomplex q ( θ | s ). We deﬁne the conditional distribution interms of the ﬂow by q ( θ | s ) = π ( f − s ( θ )) (cid:12)(cid:12)(cid:12) det J − f s (cid:12)(cid:12)(cid:12) , (1)which is based on the change of variables rule for proba-bility distributions. The base distribution π ( u ) should bechosen such that it can be easily sampled and its densityevaluated; we will always take it to be standard multi-variate normal of the same dimension D as the samplespace.By the properties of a normalizing ﬂow, q ( θ | s ) inheritsthe nice properties of the base distribution. Indeed, todraw a sample, one ﬁrst samples u ∼ π ( u ), and thensets θ = f s ( u ); it follows that θ ∼ q ( θ | s ). To evaluatethe conditional density, one uses (1); the right hand sidemay be evaluated by the deﬁning properties that f s isinvertible and has simple Jacobian determinant.Normalizing ﬂows are under active development in com-puter science, and are usually represented by neural net-works. Neural networks are very ﬂexible trainable functionapproximators, so they can give rise to complex condi-tional densities. Our previous work [14] used a MaskedAutoregressive Flow [19] with aﬃne transformations. Inthe present work, we use a much more powerful ﬂow calleda neural spline ﬂow [22]. We use directly the original neu-ral spline ﬂow implementation [26], which we illustrate inﬁgure 1. We now give a brief summary.The ﬂow is a composition of “coupling transforms” c s ( u ), each of which transform elementwise half of theparameters (say, u d +1: D ) conditional upon the other half( u d ) as well as the strain data s [27], i.e., c s,i ( u ) = (cid:40) u i if i ≤ d,c i ( u i ; u d , s ) if i > d. (2)If c i is invertible and diﬀerentiable with respect to u i , thenit follows immediately that the coupling transform is anormalizing ﬂow. By composing n ﬂows of these transforms,and permuting the indices of u in between, a much moreﬂexible ﬂow can be obtained. × n ﬂows = 15 u strain sn SVD = 100Permute,LinearCoupling split Residual network • n hidden = 512 • n blocks = 10 • K = 9SplinetransformCoupling joinPermute,Linear θ d d + 1 : D knots, derivativesFIG. 1. Overall structure of the normalizing ﬂow [22] fromthe base space ( u ) to the parameter space ( θ ), with optimalhyperparameter choices indicated. Red connections are in-vertible. The residual network is made up of n blocks residualblocks, each with two fully-connected hidden layers of n hidden units. Prior to each linear transformation [23], we insertedbatch normalization layers to speed training [24] and Expo-nential Linear Units for nonlinearity [25]. Each block is alsoconditioned on the strain data s . The neural spline coupling transform [22] takes each c i to be a monotonically-increasing piecewise function,deﬁned by a set of knots { ( u ( k ) i , c ( k ) i ) } Kk =0 and positive-valued derivatives { δ ( k ) i } Kk =0 , between which are interpo-lated rational-quadratic (RQ) functions. The knots andderivatives are output from a residual neural network [28],which takes as input u d and s ; details of the network aregiven in ﬁgure 1. The RQ spline is diﬀerentiable and hasanalytic inverse, so it satisﬁes the properties of a couplingtransform. Training.—

The conditional density estimator q ( θ | s )must be trained to approximate as closely as possiblethe gravitational-wave posterior p ( θ | s ). We do this bytuning the neural network parameters to minimize a lossfunction, the expected value (over s ) of the cross-entropybetween the true and model distributions, L = − (cid:90) d s p ( s ) (cid:90) d θ p ( θ | s ) log q ( θ | s )= − (cid:90) d θ p ( θ ) (cid:90) d s p ( s | θ ) log q ( θ | s ) . (3)On the second line we used Bayes’ theorem to express L in a form that involves an integral over the likelihoodrather than the posterior; this is a key simpliﬁcation whichmeans posterior samples are not needed for training. Weevaluate the integral (3) on a minibatch of training datawith a Monte Carlo approximation, L ≈ − N N (cid:88) i =1 log q ( θ ( i ) | s ( i ) ) , (4)where θ ( i ) ∼ p ( θ ), s ( i ) ∼ p ( s | θ ( i ) ), and N is the number ofsamples in the minibatch. We then use backpropagation(the chain rule) to compute the gradient with respect tonetwork parameters, and minimize L stochastically onminibatches using the Adam optimizer [29].To obtain a training pair ( θ ( i ) , s ( i ) ), we draw θ ( i ) fromthe prior, we generate a waveform h ( θ ( i ) ), and then weadd a noise realization to obtain s ( i ) . Waveform genera-tion is too costly to perform in real time during training,so we adopt a hybrid approach: we sample “intrinsic”parameters in advance and save associated waveform po-larizations h ( i )+ , × ; at train time we sample “extrinsic” pa-rameters, project onto detectors, and add noise. We used10 sets of intrinsic parameters, which was suﬃcient toavoid overﬁtting. Prior.—

We perform inference over the full D = 15dimensional set of precessing quasi-circular binary blackhole parameters: detector-frame masses m i ( i = 1 , φ c , time of coalescence t c, geocent , lu-minosity distance d L , spin magnitudes a i , spin angles( θ i , φ , φ JL ) [30], inclination angle θ JN , polarization an-gle ψ , and sky position ( α, δ ). Of these, we consider( m i , φ c , a i , θ i , φ , φ JL , θ JN ) to be intrinsic, so they aresampled in advance of training. To analyze GW150914,we take a uniform prior over10 M (cid:12) ≤ m i ≤

80 M (cid:12) , (5a)100 Mpc ≤ d L ≤ , (5b)0 ≤ a i ≤ . , (5c) − . ≤ t c, geocent ≤ . . (5d)The prior is standard over the remaining quantities. Wetake t c, geocent = 0 to be the trigger time, and we require m ≥ m .Although a prior uniform in the comoving sourceframe [8] would be most physical, we adopted a uni-form prior over d L and an upper bound of 1000 Mpc tomore uniformly cover the parameter space and improvetraining. We applied the physical prior in postprocess-ing by reweighting samples. Also to improve training, we rescaled all parameters to have zero mean and unitvariance. Strain data.—

For likelihood-free training, we requiresimulated strain data sets s ( i ) that arise from the datagenerative process, s ( i ) ∼ p ( s | θ ( i ) ). We assume stationaryGaussian noise, so the gravitational-wave likelihood func-tion is known explicitly, but the likelihood-free approachapplies even when this is not the case—e.g., in the pres-ence of non-Gaussian noise—as long as it is possible tosimulate data.We generate training waveforms using the IMRPhenomPv2 frequency-domain precessing model [31–33]. We take a frequency range of [20, 1024] Hz, and awaveform duration of 8 s. We then whiten h ( i )+ , × using thenoise PSD estimated from 1024 s of detector data priorto the event. Following [12], we compress the whitenedwaveforms to a reduced-order representation; we use asingular value decomposition (SVD), and keep the ﬁrst n SVD = 100 components.At train time, we sample extrinsic parameters and gen-erate detector signals. This requires a trivial rescalingto apply d L , and linear combinations of h ( i )+ , × to projectonto the antenna patterns for the two LIGO detectors.To apply time delays in the reduced-basis (RB) represen-tation, we follow the approach of [34, 35] of pre-preparinga grid of time-translation matrix operators that act onvectors of RB coeﬃcients, using cubic interpolation forintermediate times. Since the transformation to RB spaceis a rotation, we add white noise directly to the RB coef-ﬁcients of the whitened waveforms to obtain s ( i ) . Finally,data is standardized to have zero mean and unit variancein each component. Results.—

We trained for 500 epochs with a batch sizeof 512. The initial learning rate was 0.0002, and weused cosine annealing [36] to reduce the learning rate tozero over the course of training. We performed a searchover network hyperparameters, and have listed those withbest performance (as measured by ﬁnal validation loss) inﬁgure 1. During training, we reserved 10% of our trainingset for validation, and found no evidence of overﬁtting.With an NVIDIA Quadro P4000 GPU, training took ≈ q ( θ | s ). This producedsamples at a rate of 5,000 per second. We benchmarkedthese against samples produced by bilby [8, 37] with the dynesty sampler [38].Our main result is presented in ﬁgure 2, which com-pares the neural network and bilby posteriors. Bothdistributions are clearly in very close agreement. Thereare minor diﬀerences in the inclination angle θ JN , wherethe neural network gives more support to the secondary bilby dynestyneural network m [ M (cid:12) ] . . . t c , g e o ce n t [ s ] d L [ M p c ] . . a . . a . . θ . . θ . . φ J L

32 40 48 m [M (cid:12) ] . . θ J N

24 32 40 m [M (cid:12) ] .

02 0 .

03 0 . t c, geocent [s]

250 500 750 d L [Mpc] . . a . . a . . θ . . θ . . φ JL . . θ JN h h h FIG. 2. Marginalized one- and two- dimensional posterior distributions over a subset of parameters, comparing the normalizingﬂow (orange) and bilby dynesty (blue). Contours represent 50% and 90% credible regions. Neural network posteriors areconstructed from 5 × samples. The inset shows the sky position, with rejection sampling used to obtain unweighted neuralnetwork samples. mode, and the sky position, where the neural network alsodevelops a small secondary mode. With more training ora larger network, we expect even better convergence.Although our demonstration has focused on GW150914,the neural network has been trained to generate any pos-terior consistent with the prior and the given noise PSD.To demonstrate this, performed inference on 100 injec- tions of waveforms drawn from the prior with added noise.Performance is summarized in the P–P plot shown inﬁgure 3. This shows the cumulative distribution function(CDF) of the percentile scores of each injection parameterwithin the one-dimensional marginalized posteriors. Thepercentiles should be distributed uniformly between 0 and1; since the CDF lies close to the diagonal, we conclude . . . . . . p . . . . . . C D F ( p ) m (0 . m (0 . φ c (0 . t c (0 . d L (0 . a (0 . a (0 . θ (0 . θ (0 . φ (0 . φ JL (0 . θ JN (0 . ψ (0 . α (0 . δ (0 . FIG. 3. P–P plot for 100 artiﬁcial strain data sets analyzed bythe neural network. For each injection and one-dimensionalposterior distribution, we compute the percentile value of theinjected parameter. The ﬁgure shows the CDF of the injectionsfor each parameter, which should lie close to the diagonal ifthe network is performing properly. KS test p -values are givenin the legend. that the network is properly sampling the posteriors. Thisis conﬁrmed by a Kolmogorov-Smirnov (KS) test.In our experiments, we also varied n SVD , and we foundslightly reduced performance as this was increased. Thisindicates that, although with less compression it shouldbe possible to produce tighter posteriors, better networkoptimization is required to take full advantage. Indeed,subleading SVD elements contain mostly noise, whichmakes training more diﬃcult.

Conclusions.—

In this Letter, we have demonstratedfor the ﬁrst time that deep neural networks can accu-rately infer all 15 binary black hole parameters fromreal gravitational-wave strain data. Once the networkis trained, inference is extremely fast, producing 5,000independent samples per second.Rapid parameter estimation is critical for multimes-senger followup and for confronting the expected highrate of future detections. An advantage of likelihood-freemethods is that waveform generation is done in advanceof training and inference, rather than at sampling time asfor conventional methods. Thus, waveform models thatinclude more physics but may be slower to evaluate [39]can be used to analyze data in the same time as fastermodels.The network we presented is tuned to a particular noisePSD—in this case, estimated just prior to GW150914.However, the noise characteristics of the LIGO and Virgo detectors vary from event to event, and ultimately wewould like amortize training costs by building a condi-tional density estimator that can do inference on anyevent without retraining for each PSD. One approachwould be to condition the model on PSD information:during training, waveforms would be whitened with re-spect to a PSD drawn from a distribution representingthe variation in detector noise from event to event, and(a summary of) this PSD information would be passedto the network as additional context. (PSD samples canbe obtained from detector data at random times.) Forinference, PSD information would then be passed alongwith the whitened strain data. Similar approaches couldalso be used to treat non-Gaussian noise artifacts.In contrast to CVAEs used in past work [13, 14], normal-izing ﬂows have the advantage of estimating the densitydirectly, without any need to marginalize over latent vari-ables. This means that the loss function can be taken to bethe cross-entropy (4) rather than an upper bound [15, 16].Moreover, since q ( θ | s ) is a normalized probability dis-tribution, the Bayesian evidence can be obtained as abyproduct. The performance we achieved without latentvariables in this work was made possible by the use of amore powerful normalizing ﬂow [22] compared to [14]. Asnew and more powerful normalizing ﬂows are developedby computer scientists in the future, they will be straight-forward to deploy to further improve the performanceand capabilities of deep learning for gravitational-waveparameter estimation.We would like to thank C. Simpson for helpful dis-cussions and for performing training runs during theearly stages of this work. Our code was implemented in PyTorch [40], and with the neural spline ﬂow implemen-tation of [26]. Plots were produced with matplotlib [41],

ChainConsumer [42] and ligo.skymap [43]. ∗ [email protected] † [email protected][1] B.P. Abbott et al. (LIGO Scientiﬁc, Virgo), “Observationof Gravitational Waves from a Binary Black Hole Merger,”Phys. Rev. Lett. , 061102 (2016), arXiv:1602.03837[gr-qc].[2] B. P. Abbott et al. (LIGO Scientiﬁc, Virgo), “GWTC-1: AGravitational-Wave Transient Catalog of Compact BinaryMergers Observed by LIGO and Virgo during the Firstand Second Observing Runs,” (2018), arXiv:1811.12907[astro-ph.HE].[3] R. Abbott et al. (LIGO Scientiﬁc, Virgo), “GW190412:Observation of a Binary-Black-Hole Coalescence withAsymmetric Masses,” (2020), arXiv:2004.08342 [astro-ph.HE].[4] B.P. Abbott et al. (LIGO Scientiﬁc, Virgo), “GW190425:Observation of a Compact Binary Coalescence with TotalMass ∼ . M (cid:12) ,” Astrophys. J. Lett. , L3 (2020),arXiv:2001.01761 [astro-ph.HE]. [5] R. Abbott et al. (LIGO Scientiﬁc, Virgo), “GW190814:Gravitational Waves from the Coalescence of a 23 SolarMass Black Hole with a 2.6 Solar Mass Compact Object,”Astrophys. J. , L44 (2020), arXiv:2006.12611 [astro-ph.HE].[6] “Gravitational-wave candidate event database,” https://gracedb.ligo.org/superevents/public/O3/ , accessed:2020-07-22.[7] J. Veitch et al. , “Parameter estimation for compact bina-ries with ground-based gravitational-wave observationsusing the LALInference software library,” Phys. Rev. D91 ,042003 (2015), arXiv:1409.7215 [gr-qc].[8] I.M. Romero-Shaw et al. , “Bayesian inference for compactbinary coalescences with BILBY: Validation and applica-tion to the ﬁrst LIGO–Virgo gravitational-wave transientcatalogue,” (2020), arXiv:2006.00714 [astro-ph.IM].[9] B. P. Abbott et al. (LIGO Scientiﬁc, Virgo), “Propertiesof the binary neutron star merger GW170817,” Phys. Rev. X9 , 011001 (2019), arXiv:1805.11579 [gr-qc].[10] B.P. Abbott et al. (KAGRA, LIGO Scientiﬁc, VIRGO),“Prospects for Observing and Localizing Gravitational-Wave Transients with Advanced LIGO, AdvancedVirgo and KAGRA,” Living Rev. Rel. , 3 (2018),arXiv:1304.0670 [gr-qc].[11] Elena Cuoco et al. , “Enhancing Gravitational-Wave Sci-ence with Machine Learning,” (2020), arXiv:2005.03745[astro-ph.HE].[12] Alvin J. K. Chua and Michele Vallisneri, “LearningBayesian posteriors with neural networks for gravitational-wave inference,” Phys. Rev. Lett. , 041102 (2020),arXiv:1909.05966 [gr-qc].[13] Hunter Gabbard, Chris Messenger, Ik Siong Heng,Francesco Tonolini, and Roderick Murray-Smith,“Bayesian parameter estimation using conditional vari-ational autoencoders for gravitational-wave astronomy,”(2019), arXiv:1909.06296 [astro-ph.IM].[14] Stephen R. Green, Christine Simpson, and Jonathan Gair,“Gravitational-wave parameter estimation with autoregres-sive neural network ﬂows,” (2020), arXiv:2002.07656[astro-ph.IM].[15] Diederik P Kingma and Max Welling, “Auto-encodingvariational bayes,” (2013), arXiv:1312.6114 [stat.ML].[16] Danilo Jimenez Rezende, Shakir Mohamed, and DaanWierstra, “Stochastic backpropagation and approximateinference in deep generative models,” in InternationalConference on Machine Learning (2014) pp. 1278–1286,1401.4082 [stat.ML].[17] Danilo Rezende and Shakir Mohamed, “Variational infer-ence with normalizing ﬂows,” in

International Conferenceon Machine Learning (2015) pp. 1530–1538, 1505.05770[stat.ML].[18] Durk P Kingma, Tim Salimans, Rafal Jozefowicz,Xi Chen, Ilya Sutskever, and Max Welling, “Improvedvariational inference with inverse autoregressive ﬂow,” in

Advances in neural information processing systems (2016)pp. 4743–4751, arXiv:1606.04934 [cs.LG].[19] George Papamakarios, Theo Pavlakou, and Iain Murray,“Masked autoregressive ﬂow for density estimation,” in

Ad-vances in Neural Information Processing Systems (2017)pp. 2338–2347, arXiv:1705.07057 [stat.ML].[20] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan,Prafulla Dhariwal, John Schulman, Ilya Sutskever, andPieter Abbeel, “Variational lossy autoencoder,” (2016),arXiv:1611.02731 [cs.LG]. [21] “Advanced LIGO anticipated sensitivity curves,” LIGOTechnical Document, LIGO-T0900288-v2 (2009).[22] Conor Durkan, Artur Bekasov, Iain Murray, andGeorge Papamakarios, “Neural spline ﬂows,” (2019),arXiv:1906.04032 [stat.ML].[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Identity mappings in deep residual networks,”(2016), arXiv:1603.05027 [cs.CV].[24] Sergey Ioﬀe and Christian Szegedy, “Batch normalization:Accelerating deep network training by reducing internalcovariate shift,” in

International Conference on MachineLearning (2015) pp. 448–456, arXiv:1502.03167 [cs.LG].[25] Djork-Arn Clevert, Thomas Unterthiner, and SeppHochreiter, “Fast and accurate deep network learning byexponential linear units (elus),” (2015), arXiv:1511.07289[cs.LG].[26] Conor Durkan, Artur Bekasov, Iain Murray, and GeorgePapamakarios, “Neural spline ﬂows,” https://github.com/bayesiains/nsf (2019).[27] Laurent Dinh, David Krueger, and Yoshua Bengio, “Nice:Non-linear independent components estimation,” (2014),arXiv:1410.8516 [cs.LG].[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,”(2015), arXiv:1512.03385 [cs.CV].[29] Diederik P. Kingma and Jimmy Ba, “Adam: A Methodfor Stochastic Optimization,” (2014), arXiv:1412.6980[cs.LG].[30] Benjamin Farr, Evan Ochsner, Will M. Farr, and RichardO’Shaughnessy, “A more eﬀective coordinate system forparameter estimation of precessing compact binaries fromgravitational waves,” Phys. Rev. D , 024018 (2014),arXiv:1404.7070 [gr-qc].[31] Mark Hannam, Patricia Schmidt, Alejandro Boh, LelaHaegel, Sascha Husa, Frank Ohme, Geraint Pratten, andMichael Prrer, “Simple Model of Complete PrecessingBlack-Hole-Binary Gravitational Waveforms,” Phys. Rev.Lett. , 151101 (2014), arXiv:1308.3271 [gr-qc].[32] Sebastian Khan, Sascha Husa, Mark Hannam, FrankOhme, Michael P¨urrer, Xisco Jim´nez Forteza, and Ale-jandro Boh´e, “Frequency-domain gravitational waves fromnonprecessing black-hole binaries. II. A phenomenologicalmodel for the advanced detector era,” Phys. Rev. D93 ,044007 (2016), arXiv:1508.07253 [gr-qc].[33] Alejandro Boh´e, Mark Hannam, Sascha Husa, FrankOhme, Michael P¨urrer, and Patricia Schmidt, “Phe-nomPv2 – technical notes for the LAL implementation,”LIGO Technical Document, LIGO-T1500602-v4 (2016).[34] Priscilla Canizares, Scott E. Field, Jonathan R. Gair, andManuel Tiglio, “Gravitational wave parameter estimationwith compressed likelihood evaluations,” Phys. Rev. D , 124005 (2013), arXiv:1304.0462 [gr-qc].[35] Rory Smith, Scott E. Field, Kent Blackburn, Carl-JohanHaster, Michael Prrer, Vivien Raymond, and PatriciaSchmidt, “Fast and accurate inference on gravitationalwaves from precessing compact binaries,” Phys. Rev. D , 044031 (2016), arXiv:1604.08253 [gr-qc].[36] Ilya Loshchilov and Frank Hutter, “Sgdr: Stochas-tic gradient descent with warm restarts,” (2016),arXiv:1608.03983 [cs.LG].[37] Gregory Ashton et al. , “BILBY: A user-friendly Bayesianinference library for gravitational-wave astronomy,” Astro-phys. J. Suppl. , 27 (2019), arXiv:1811.02042 [astro-ph.IM]. [38] Joshua S Speagle, “dynesty: a dynamic nested samplingpackage for estimating bayesian posteriors and evidences,”Monthly Notices of the Royal Astronomical Society ,31323158 (2020), arXiv:1904.02180 [astro-ph.IM].[39] Serguei Ossokine et al. , “Multipolar Eﬀective-One-BodyWaveforms for Precessing Binary Black Holes: Construc-tion and Validation,” (2020), arXiv:2004.09442 [gr-qc].[40] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, ZemingLin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,Andreas Kopf, Edward Yang, Zachary DeVito, MartinRaison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala,“Pytorch: An imperative style, high-performance deeplearning library,” in Advances in Neural Information Pro-cessing Systems 32 , edited by H. Wallach, H. Larochelle,A. Beygelzimer, F. d’Alch´e Buc, E. Fox, and R. Garnett(Curran Associates, Inc., 2019) pp. 8024–8035.[41] J. D. Hunter, “Matplotlib: A 2d graphics environment,”Computing in Science & Engineering , 90–95 (2007).[42] S. R. Hinton, “ChainConsumer,” The Journal of OpenSource Software , 00045 (2016).[43] Leo Singer, “ligo.skymap,” https://lscsoft.docs.ligo.org/ligo.skymap/https://lscsoft.docs.ligo.org/ligo.skymap/