[PDF] The sum of the masses of the Milky Way and M31: a likelihood-free inference approach

Abstract

We use Density Estimation Likelihood-Free Inference, Λ Cold Dark Matter simulations of ∼2M galaxy pairs, and data from Gaia and the Hubble Space Telescope to infer the sum of the masses of the Milky Way and Andromeda (M31) galaxies, the two main components of the Local Group. This method overcomes most of the approximations of the traditional timing argument, makes the writing of a theoretical likelihood unnecessary, and allows the non-linear modelling of observational errors that take into account correlations in the data and non-Gaussian distributions. We obtain an M 200 mass estimate M MW+M31 = 4.6 +2.3 −1.8 × 10 12 M ⊙ ( 68% C.L.), in agreement with previous estimates both for the sum of the two masses and for the individual masses. This result is not only one of the most reliable estimates of the sum of the two masses to date, but is also an illustration of likelihood-free inference in a problem with only one parameter and only three data points.

Full PDF

TThe sum of the masses of the Milky Way and M31: a likelihood-free inferenceapproach

Pablo Lemos, ∗ Niall Jeﬀrey,

2, 1

Lorne Whiteway, Ofer Lahav, Noam I Libeskind,

3, 4 and Yehuda Hoﬀman Department of Physics and Astronomy, University College London, Gower Street, London, WC1E 6BT, UK Laboratoire de Physique de l’Ecole Normale Sup´erieure, ENS, Universit´e PSL,CNRS, Sorbonne Universit´e, Universit´e de Paris, Paris, France Leibniz-Institut f¨ur Astrophysik Potsdam (AIP),An der Sternwarte 16, 14482 Potsdam, Germany University of Lyon, UCB Lyon-1/CNRS/IN2P3, IPN Lyon, France Racah Institute of Physics, Hebrew University, Jerusalem, 91904 Israel

We use Density Estimation Likelihood-Free Inference, Λ Cold Dark Matter simulations of ∼ M galaxy pairs, and data from Gaia and the Hubble Space Telescope to infer the sum of the massesof the Milky Way and Andromeda (M31) galaxies, the two main components of the Local Group.This method overcomes most of the approximations of the traditional timing argument, makes thewriting of a theoretical likelihood unnecessary, and allows the non-linear modelling of observationalerrors that take into account correlations in the data and non-Gaussian distributions. We obtainan M mass estimate M MW+M31 = 4 . +2 . − . × M (cid:12) (68% C.L.), in agreement with previousestimates both for the sum of the two masses and for the individual masses. This result is not onlyone of the most reliable estimates of the sum of the two masses to date, but is also an illustrationof likelihood-free inference in a problem with only one parameter and only three data points. I. INTRODUCTION

Likelihood-free inference (LFI) has emerged as a verypromising technique for inferring parameters from data,particularly in cosmology. It provides parameter poste-rior probability estimation without requiring the calcula-tion of an analytic likelihood (i.e. the probability of thedata being observed given the parameters). LFI uses for-ward simulations in place of an analytic likelihood func-tion. Writing a likelihood for cosmological observablescan be extremely complex, often requiring the solutionof Boltzmann equations, as well as approximations forhighly nonlinear processes such as structure formationand baryonic feedback. While simulations have their ownlimitations and are computationally expensive, the qual-ity and eﬃciency of cosmological simulations are con-stantly increasing, and they are likely to soon far surpassthe accuracy or robustness of any likelihood function.This is a rapidly growing topic in cosmology, due tothe emergence of novel methods for likelihood-free infer-ence [see e.g. 1, 2], with applications to data sets suchas the Joint Light Curve (JLA) and Pantheon super-nova datasets [3, 4], and the Dark Energy Survey Sci-ence Veriﬁcation data [5], amongst others [6–8]. Thereare, therefore, many applications for which LFI couldimprove the robustness of parameter inference using cos-mological data. In this work we perform a LFI-basedparameter estimation of the sum of masses of the MilkyWay and M31. The likelihood function for this problemrequires signiﬁcant simpliﬁcations, but forward simula-tions can be obtained easily.The Milky Way and Andromeda are the main compo- ∗ [email protected] nents of the Local Group, which includes tens of smallergalaxies. We deﬁne M MW+M31 as the sum of the MWand M31 masses. Estimating M MW+M31 remains an elu-sive and complex problem in astrophysics. As the massof each of the Milky Way and M31 is known only towithin a factor of 2, it is important to constrain the sumof their masses. The traditional approach is to use theso-called timing argument (TA) [9]. The timing argu-ment estimates M MW+M31 using Newtonian Dynamicsintegrated from the Big Bang. This integration is anextremely simpliﬁed version of a very complex problem.Therefore, alternative methods that do not rely on thesame approximations become extremely useful.In this work, we use the

MultiDark Planck (MDPL)simulation [10, 11], combined with data from the Hub-ble Space Telescope [HST, 12] and Gaia [13], to estimate M MW+M31 . A similar data set was previously used in [14]to obtain a point estimate of M MW+M31 using ArtiﬁcialNeural Networks (ANN) in conjunction with the TA. Incontrast, our work uses Density Estimation Likelihood-Free Inference [DELFI 2, 15–17], using the pyDELFI pack-age , combined with more recent data. While the resultis important on its own, this paper also illustrates thefundamental methodology of DELFI in a problem that isstatistically simple but physically complex.The structure of the paper is as follows: Sec. II re-views and describes previous estimates of M MW+M31 .Sec. III describes the basics of LFI, and the particulartechniques used in this work. Sec. IV and Sec. V describethe simulations and data, respectively, used in this work.Sec. VI shows our results, and conclusions are presentedin Sec. VII. https://github.com/justinalsing/pydelfi a r X i v : . [ a s t r o - ph . GA ] D ec II. PREVIOUS ESTIMATES

A ﬁrst approach to estimating M MW+M31 from dynam-ics, known as TA, is based on the simple idea that MWand M31 are point masses approaching each other on aradial orbit that obeys¨ r = − GM MW+M31 r + 13 Λ r, (1)where M MW+M31 is the sum of the masses of the twogalaxies [9, 21], and where the Λ term, which represents aform of dark energy, was added in later studies [22, 23]. Itwas also extended for modiﬁed gravity models [24]. Sincewe know the present-day distance r between MW andM31 and their relative radial velocity v r , and if we assumethe age of the universe and Λ, we can infer the mass M MW+M31 . The analysis can be extended to cover thecase of non-zero tangential v t velocity [e.g. 14, 25]. Thepros and cons of the timing argument are well known.It is a simple model which assumes only two point massbodies; this ignores, for example, the tidal forces dueto neighbouring galaxy haloes in the Local Group andthe extended cosmic web around it. While the timingargument model does not capture the complexity of thecosmic structure and resulting cosmic variance, it gives asomewhat surprisingly good estimate for M MW+M31 . Asshown below, it can also serve to test the sensitivity of theresults to parameters such as the cosmological constantΛ and the Hubble constant H , in case simulations arenot available for diﬀerent values of these parameters.A second approach is to consider the dynamics of allthe galaxies in the Local Group using the Least ActionPrinciple [26], as, for example, was implemented in [20].In this approach all members of the Local Group appearin the model, but as a result the derived masses are cor-related, and the error bars should be interpreted accord-ingly. A third approach is to use N-body simulations ofthe local universe, assuming a cosmological model suchas ΛCDM [14, 18, 19]. In this paper we apply this thirdmethod, but using the DELFI method (this provides sig-niﬁcant improvements over other methods, as discussedin the following section).Representative results for M MW+M31 from previousworks are given in Tab. I. Throughout the paper we quote68% credible intervals.

III. LIKELIHOOD-FREE INFERENCE

In Bayesian statistics we often face the following prob-lem: given observed data D obs , and a theoretical model I with a set of parameters θ , calculate the probability of theparameters given the data. In other words, we want tocalculate the posterior distribution P ≡ p ( θ | D obs , I ); here p is a probability (for a model with discrete parameters)or a probability density (for continuous parameters). Wedo so using Bayes’ theorem: p ( θ | D obs , I ) = p ( D obs | θ, I ) p ( θ | I ) p ( D obs | I ) ⇔ P = L × Π Z (2)where L is called the likelihood, Π the prior, and Z theBayesian Evidence. The Bayesian Evidence acts as anoverall normalization in parameter estimation, and cantherefore be ignored for this task. Thus, given a choiceof prior distribution and a likelihood function, we canestimate the posterior distribution. However, obtaininga likelihood function is not always easy. The likelihoodfunction provides a probability of measuring the data asa function of the parameter values, and often requires ap-proximations both in the statistics and in the theoreticalmodelling.Likelihood-Free Inference is an alternative method forcalculating the posterior distribution; in this method wedo not formally write down a likelihood function. In-stead, we use forward simulations of the model to gener-ate samples of the data and parameters. In the simplestversion of LFI, we select only the forward simulationsthat are the most similar to the observed data, rejectingthe rest. This method is known as Approximate BayesianComputation [ABC, 27]; it relies on choices of a distancemetric (to measure similarity between simulated and ob-served data) and of a maximum distance parameter (cid:15) (used to accept or rejected simulations).In this work, we will use a version of LFI called Den-sity Estimation Likelihood-Free Inference (DELFI). Inthis approach, we use all existing forward simulations tolearn a conditional density distribution of the data d given the parameters θ , using a density estimation al-gorithm. Examples of density estimation algorithms areKernel Density Estimation [KDE, 28–30], Mixture Mod-els, Mixture Density Networks [31, 32], and Masked Au-toregressive Flows [33]. We use the package pyDELFI , andestimate the likelihood function from the forward simula-tions using Gaussian Mixture Density Networks (GMDN)and Masked Autoregressive Flows (MAF). In this sense,the name Likelihood-Free Inference is perhaps mislead-ing: the inference is not ‘likelihood-free’, we simply avoidwriting a likelihood and instead model it using forwardsimulations. A more accurate name for the method couldtherefore be ‘explicit-likelihood-free’. A more extendeddiscussion of our choice of density estimation algorithmsand conditional distribution is presented in Appendix A.DELFI has several advantages over the simpler ABCapproach to LFI: it does not rely on a choice of a distanceparameter (cid:15) (although admittedly the choice of basis inparameter space can change the implicit distance metricof the density estimator) and it uses all available forwardsimulations to build the conditional distribution, makingit far more eﬃcient. Note that we use the letter d to refer to the data space used inDELFI to learn conditional distributions, and we use D obs torefer to the observed data. TABLE I. Estimates of M MW+M31 from previous work. The third column shows the data used, with r in Mpc and v r , v t in km s − . The fourth column shows M MW+M31 in units of 10 M (cid:12) . Note that Gaussian approximations have been used toconvert the reported conﬁdence levels to 68% conﬁdence levels in some cases.Reference method assumed ( r, v r , v t ) M MW+M31

Li & White (2008) [18] TA calibrated on Sims (0.784, 130, 0) 5 . +2 . − . Gonzalez & Kravtsov (2014) [19] Sims (0.783, 109.3, 0) 4 . +2 . − . McLeod et al. (2017) [14] TA+Λ (0 . ± . , . ± . , ±

17) 4 . +0 . . − . − . McLeod et al. (2017) [14] Sims + ANN +Shear (0 . ± . , . ± . , ±

17) 4 . +0 . . − . − . Phelps et al. (2013) [20] Least Action (0.79, 119, 0) 6 . ± . While relatively new, likelihood-free inference has al-ready been applied to several problems in astrophysics[e.g. 5, 34–39]. However, most applications involv-ing LFI suﬀer from the curse of dimensionality: therecan be hundreds, thousands, or even millions of observ-ables (such as ∼ M MW+M31 problem has onlythree data points and one parameter of interest, makingit an extremely simple application of the method fromthe statistical point of view; it illustrates all necessarytechniques, and does not require data compression.To summarize, the steps that we will follow are: • Generate a large number of simulations of systemssimilar to the one of interest. The simulations usedin this work are described in Sec. IV. • Use a density estimator, in our case GMDN andMAF as part of the pyDELFI package, to obtainthe sampling distribution for any data realization p ( d | θ, I ). • Evaluate this distribution at the observed data(which will be described in Sec. V) to obtain thelikelihood function for our observed data realiza-tion: p ( d = D obs | θ, I ). • From a prior distribution and the likelihood, andusing Bayes’ theorem (Eq. (2)), get a posterior dis-tribution p ( θ | D obs , I ). IV. SIMULATIONS

We use the publicly available MDPL simulation. Thisis an N -body simulation of a periodic box with sidelength L box = 1 Gpc h − , populated with 2048 darkmatter particles. Such a simulation achieves a massresolution of 8 . × M (cid:12) h − and a Plummer equiva-lent gravitational softening of 7 kpc. The simulation was run with the ART code and assumed a ΛCDM powerspectrum of ﬂuctuations with Ω Λ = 0 .

73, Ω b = 0 . m = 0 . σ = 8 . H = 100 h km s − Mpc − ,with h = 0 . range 5 × < M/M (cid:12) h − < × . For eachsuch halo, we ﬁnd all haloes (irrespective of mass) within ∼ × , × ] and is sep-arated by more than 500 kpc but less than 1500 kpc, thetwo haloes are considered a potential pair. The partner isthen examined to ensure that the pair is isolated (namelythat there is no other halo with a mass > × andcloser to either pair member than the pair separation).If this is the case then the pair is kept for the analysisdescribed here. In this way, 1,094,839 pairs are found (asopposed to the 30 ,

190 pairs used in [14]).We believe the criteria used to select the halo pairs inthis work are not very restrictive and any cuts lie welloutside the realistic limits for the actual MW-M31 sys-tem. Though these selections are unrestrictive, we notethat including more pairs does improve the density es-timation, as DELFI can use all the available simulateddata, and not only those that are close to the observation.

V. DATA

We use three observations to constrain M MW+M31 : thedistance to M31, and the radial and tangential compo-nents of its velocity. While the TA requires other ob- Henceforth, all masses are deﬁned as M , the total mass en-closed in the largest sphere surrounding them with an enclosedmean density over 200 times the critical value [18]. v t [km / s]01000200030004000 P ( v t | M ) Measurement-noise-free norm

FIG. 1. An illustration of our non-Gaussian error modellingfor the tangential velocity. The plot was obtained taking thecomponents of the tangential velocity for a randomly chosensimulation (with ﬁxed M ), scattering a large number of timesby the errors of Eq. (3), and calculating the norm for eachsample. The black dashed line shows the norm of the com-ponents for the tangential velocity of the simulation withoutmeasurement error. servables, such as the age of the Universe t and the cos-mological constant Λ, in our approach these are alreadyincluded in the simulations at ﬁxed values. We discussbelow how uncertainties in cosmological parameters af-fect the estimated mass.For the distance to M31, we adopt the commonlyused value r = 770 ±

40 kpc [46–49]. For the veloc-ity, we follow the results of [50, henceforth VdM19].The radial velocity in the galactocentric rest frame is v r = − . ± . − [51] from HST observations.The tangential velocity is slightly more cumbersome.VdM19 report the following value for components of thetangential velocity of M31 from a combination of Gaia

DR2 and HST: µ = (10 ± , − ± µ as yr − , (3)already in the galactocentric frame, i.e. after correctingfor the solar reﬂex motion. VdM19 uses the distance toM31 to convert this to km s − . They then use a methoddescribed in [52] to correct for the fact that taking thenorm of the two components leads to a ‘bias’ in the re-ported tangential velocity . However, in this work, wetake a diﬀerent approach: for each simulation, we scatter We will discuss this supposed bias in a future publication. the value of each component of the tangential velocityfor the simulation according the observational error ofeach components shown in Eq. (3). By doing this, we areputting the observational errors in the simulated mea-surements, instead attempting to ‘debias’ the tangentialvelocity summary statistic. We convert to km s − usingthe value of the distance for that sample, to account forthe covariance between r and v t . We take as the observedvalue the norm of the two observed components, v t =72 km s − . While this diﬀers from the value reported inVdM19, this should not be a problem, as long as the wayin which we calculate the tangential velocity in simula-tions and observations is consistent, and we use a sum-mary statistic that extracts all the available information(which the norm of the components does). Furthermore,Appendix C shows that our results do not signiﬁcantlychange if we repeat the analysis using a purely radialmotion ( v t = 0).This approach takes into account both the non-Gaussian errors in v t and the correlation of v t with theerrors in the distance measurement (these have not beenaccounted for in previous estimates of M MW+M31 ). VI. RESULTSA. Overview

Having discussed the method, the simulations, and thedata, we have everything we need to perform LFI usingDELFI. We have three data points d = { r, v r , v t } and oneparameter θ = M . The process is illustrated in Fig. 2 andFig. 3, and consists of the following:1. Left panel of Fig. 2: We generate a large numberof forward simulations, as discussed in Sec. IV. In-creasing the number of simulations will increase theaccuracy of the density estimation, and of the re-sulting posterior. Reference [2] demonstrates an ac-tive learning scheme with pyDELFI , providing crite-ria to run new simulations based on discrepanciesbetween the density estimates in the neural den-sity estimator ensemble. However, due to our ini-tial large number of simulations, we had no need torun such extra simulations on-the-ﬂy.2. Right panel of Fig. 2: The observational errors areintroduced as scatter in the forward simulations.More speciﬁcally, we displace the simulations by anumber sampled from the error model presented inSec. V. Note how in Fig. 2 this step does not aﬀectthe mass. This is because the mass in this problemis part of the parameters θ , not the data, as it isour goal to obtain a posterior distribution for themass.3. We use density estimation to get the conditionaldensity distribution p ( d | θ, I ), as shown in Fig. 3.While it might seem counter-intuitive to learn the . . r [ M p c ] − v r [ k m / s ]

20 40 M [10 M (cid:12) ]300600 v t [ k m / s ] . . r [Mpc] −

200 0 v r [km / s]2040 M [ M (cid:12) ]

300 600 v t [km / s]Simulations 0 . . r [ M p c ] − v r [ k m / s ]

20 40 M [10 M (cid:12) ]300600 v t [ k m / s ] . . r [Mpc] −

200 0 v r [km / s]2040 M [ M (cid:12) ]

300 600 v t [km / s]Simulations + Observational ErrorsSimulations FIG. 2. Illustration of DELFI for the estimation of M MW+M31 . The left panel is a scatter plot of the simulations described inSec. IV. The right panel adds the observational errors by scattering the simulations. . . r [ M p c ] − − v r [ k m / s ] M [10 M (cid:12) ]300600 v t [ k m / s ] . . r [Mpc] − −

150 0 v r [km / s]025 300 600 v t [km / s] FIG. 3. Conditional distribution p ( d | θ, I ) density estimatefrom the points shown in the right panel of Fig. 2. The 2Dsubplots in the left column show the probabilities of the data d = { r, v r , v t } conditional on the parameter θ = M (which hasbeen given a uniform distribution), while the remaining 2Dsubplots have been marginalised over this uniform distribu-tion. By evaluating this function at the observed data points d = { r, v r , v t } we obtain the likelihood function. The pointswere sampled using the Nested Sampling [53] code PolyChord [54, 55]. Note that the one dimensional distribution on M isﬂat by construction, as this is the parameter, and the distri-butions on r and v r appear to be ‘cut’ because of the selectioncriteria described in Sec. IV. likelihood instead of directly learning the posterior,this allows us to then sample from a chosen prior,instead of being limited to the prior that is implicitin the simulations. This is discussed in more detailin Appendix A.There are several algorithms that can be used to geta conditional probability distribution from samples.In this work we use GMDN and MAF (as part ofthe pyDELFI package).4. Finally, we evaluate this conditional density distri-bution at the observed data D obs = { r = 0 .

77 Mpc ,v r = − . − , v t = 72 km s − } , (4)as discussed in Sec. V. This way, we get the likeli-hood function: L ≡ p ( d = D obs | θ, I ) . (5)Through this process we obtain a likelihood function,without ever having to write a theory or use a Gaus-sian approximation. While this process is limited by Note that while this work uses GMDN and MAF for density es-timation, Fig. 3 uses KDE instead. This is because the plotswere generated using the code anesthetic [56], which uses KDEto plot smooth probability distributions. KDE is appropriate inthis case, as anesthetic only plots the one- and two-dimensionalposterior distributions, whereas in this work we are trying tolearn the full 4D distribution. anesthetic uses the fastKDE im-plementation [57, 58]. the number and accuracy of the available simulations,it has a big advantage over likelihood-based problemsthat use simplifying approximations to make the likeli-hood more tractable, or easier to compute. The calcu-lation of M MW+M31 is a good example: the likelihood-based approach relies on the TA and data modelling ap-proximations, which we know oversimpliﬁes the problem.Instead, using DELFI, we can account for the complexnonlinear evolution of the system through our N-bodysimulation.

B. Density estimation validation

For the density estimation, with pyDELFI we use a com-bination of two GMDNs (with four and ﬁve Gaussiancomponents) and two MAFs (with three and four com-ponents). The GMDNs have two layers with ﬁfty compo-nents each, while the MAFs have thirty components oneach of the two hidden layers. For a more robust densityestimation we stack the results weighted by each densityestimation’s relative likelihood, as described in [2].We hold back 10,000 simulations from the training setto be used for validating the likelihood that we havelearned through the simulations. Each validation sim-ulation has a ‘true’ mass, position and velocity, and wecan use these to estimate how well our likelihood works.The results from this validation are shown in Fig. 4. Wealso perform a quartile test, ﬁnding that 95 . σ predicted posterior, as ex-pected. C. Prior distribution

As previously discussed, we have the freedom to choosea suitable prior distribution (this is because we have usedthe simulations to learn a likelihood function, insteadof directly learning the posterior distribution). The leftpanel of Fig. 8 shows four priors relevant to this study.Uninformative priors in this case could be either a ﬂatprior or a logarithmic prior. In addition, in this problemPress-Schechter theory [59] supplies us with a physically-motivated prior. The Press-Schechter formalism predictsthe number of virialized objects with a given mass. Whilethis would be a fully correct prior only if the MW andM31 formed a single halo, it can provide a good priordistribution for the problem . We calculate the Press-Schechter prior using the code Colossus [60] , and theTinker mass function [61]. Finally, the prior shown asa black dashed line is the one that would have been inuse if we have learned the posterior directly from the The Local Group is a bound system but not a virialized system,so placing it in the Press-Schechter mass function works as anapproximation. https://bdiemer.bitbucket.io/colossus/index.html M true [10 M (cid:12) ]10 M % [ M (cid:12) ] − − − M % − M tr u e [ M (cid:12) ] FIG. 4. Validation plot for our density estimation. We use10,000 simulations that have not be used for training. For allthese, we use r , v r and v t to estimate a mass, and comparewith the true mass. The top ﬁgure plots predicted vs truemass, while the bottom plot shows the residuals. The barsshow the 68 % CL obtained using the method described inthis paper. simulations (when learning the posterior directly, we stillhave a prior, we simply lose the freedom to choose it). Inthis work, we adopt the Press-Schechter prior, which asshown in Fig. 8 is virtually equivalent to a ﬂat prior inlog M . The eﬀect of using diﬀerent priors in our resultswill be discussed in Appendix B. . . . . . . . M MW+M31 [10 M (cid:12) ]0 . . . . . . P / P m a x This workTiming argument (McLeod et al. 2017)

FIG. 5. The posterior on M MW+M31 obtained in this work(solid blue) compared to the timing argument result of [14]that includes Λ and the tangential velocity from [50] (dashedorange). While our peak is lower, the posteriors are fullyconsistent.

Once we have obtained a likelihood function and prior,we can get a posterior using Bayes’ theorem Eq. (2). Wecan describe this posterior by sampling from it using analgorithm such as Markov chain Monte Carlo (MCMC)or Nested Sampling [62]. However, in our case, a ‘brute-force’ approach is more practical (because the posterior isonly one-dimensional): we simply calculate the posterioron a grid of mass values.Our result using the Press-Schechter prior is shown inFig. 5 (solid blue). Our peak and 68% conﬁdence levelsare M MW+M31 = 4 . +2 . − . × M (cid:12) , in good agreementwith [14] (also shown in Fig. 5) but with improved errorbars. D. Results discussion

Our result is compared to previous results in Fig. 7. Wesee that all other estimates considered in this work arewithin the 68% conﬁdence interval of our posterior in themass, despite the diﬀerent methods used. We notice thatthe Least Action result of [20]) obtains tighter constraintsthan our method; however, our result is the ﬁrst one tofully account for the distribution of the observed errors ina robust (and Bayesian) manner. Other results use Gaus-sian approximations for observational errors, or neglectthem completely, and therefore our result is the most ac-curate estimate of M MW+M31 to date. This frameworkalso allows for more accurate estimates, in particular ac-counting for the presence of M33 and the LMC in thelocal group. This will be explored in future work. The simulation was run using one particular set of cos-mological parameters but in reality these parameters areuncertain and we should marginalise over them. This isinfeasible for us, as we have a pre-run set of simulationswith ﬁxed cosmological parameters, but we can estimatethe size of the eﬀect by reference to the timing argument(TA).The TA uses the same observational constraints asdoes this work, and like this work is based on mod-elling/simulating the trajectories of galaxies similar tothose in the MW+M31 system; as a result the TA shouldhave similar sensitivities to cosmological parameters asthis work. The TA sensitivities can be estimated by nu-merically diﬀerentiating the mass estimation algorithmdescribed in [23]. We parameterise this algorithm us-ing h and Ω Λ (from which Λ and the age of the uni-verse may be derived, the latter assuming Ω m + Ω Λ =1). We ﬁnd ∂M MW+M31 /∂ Ω Λ = − . × M (cid:12) and ∂M MW+M31 /∂h = 7 . × M (cid:12) . Multiplying thesesensitivities by uncertainties on cosmological parameters(∆Ω Λ = 0 .

006 and ∆ h = 0 .

004 [63]) yields uncertain-ties on the mass estimate that are immaterial comparedto the uncertainty implied by the posterior width, andhence will be ignored. This conclusion continues to holdeven if we assume a larger uncertainty on h reﬂecting thecurrent tension between early- and late-Universe mea-surements of this parameter. For example, a changein h of 0 .

066 induces a change in the TA M MW+M31 of0 . × M (cid:12) (in agreement with [24]); adding this inquadrature to the uncertainty implied by the posteriorwidth yields only a marginal increase in total uncertainty(from 2 . × M (cid:12) to 2 . × M (cid:12) ). This calcula-tion illustrates that in a simulation-based approach itis important to have a benchmark analytical model, togauge if parameters not explored by the simulations arerelevant and if extra simulations are needed.Finally, we can compare our results with separate es-timates of the masses of the Milky Way and M31. Thereare several values in the literature for the separate massesof each galaxy, in some cases discrepant. Given this dis-crepancy, we take a number of estimates of each mass ob-tained through diﬀerent methods, and assume that thetrue value is contained within the ranges of the diﬀer-ent estimates, as done in [64]. While conservative, thismethod should provide with ranges that contain the truemass of each galaxy, and allow us to combine estimatesin tension. Through this method, we get the following: • M MW ∈ (1 . , . × M (cid:12) , from [65–70] • M M31 ∈ (0 . , . × M (cid:12) , from [65, 71–73]Combining these two measurements yields M MW+M31 ∈ (1 . , . × M (cid:12) . This is slightlylower than our result, but still in agreement, as illus-trated in Fig. 6. We can see in Fig. 7 that all estimatesof the sum of the masses based on the relative distanceand velocity of the bodies (TA, ANN and our approach)obtain slightly larger values than the sum of the separate M MW [10 M (cid:12) ]012345678 M M [ M (cid:12) ] This work (68 % C.L.)

FIG. 6. A comparison of the estimates of the separate massesof M31, the MW, and their sum, the latter from this work.The plot shows the small discrepancy between separate esti-mates of the individual masses of the MW and M31 and thiswork. masses. A possible explanation for this could be thefact that all these approaches ignore the eﬀect of otherbodies such as the LMC and M33 in the observedvelocities, which could bias the sum of the masses tohigher values [74]. The eﬀect of the LMC and M33 inour posterior mass will be explored in future work.

VII. CONCLUSIONS

In this work we have used Density EstimationLikelihood-Free Inference with forward-modelling to es-timate the posterior distribution for sum of the massesof the Milky Way and M31 using observations of thedistance and velocity to M31. We obtain a mass M MW+M31 = 4 . +2 . − . × M (cid:12) ( M ). Our methodovercomes the several approximations of the traditionalTiming Argument, accounts for non-Gaussian sources ofobservational measurement error, and uses a physicallymotivated prior; this makes it the most reliable estimateof M MW+M31 mass to date.The sensitivity analysis performed in this study illus-trates that in any simulation-based approach it is impor-tant to have a benchmark analytical (or semi-analytical)model, to assess how to cover the parameter space ofrequired simulations.This works serves not only to obtain state-of-the-artestimates of the M MW+M31 ; by applying Likelihood-FreeInference to a problem that is physically rich and complexyet statistically simple (thanks to its low dimensionality),we can illustrate how the method works, what diﬀerent . . . . . . . M MW+M31 [10 M (cid:12) ]Li & White (2008)– TAPhelps et al. (2013) – LAGonzalez et al. (2014)– TAMcLeod et al. (2017) – TAMcLeod et al. (2017) – ANNThis work – simulation-based LFI FIG. 7. Comparison of this work with previous estimates ofthe M MW+M31 , shown as best ﬁt and 68% conﬁdence intervals.The result of this work is shown at the bottom; it is the ﬁrstto account fully for the observational errors, and to not relyon the approximation of the TA. choices need to be made, and what challenges need to betackled. The ability to robustly infer M MW+M31 withoutrequiring an analytic theory or a likelihood demonstratesthe potential of Likelihood-Free Inference methods in as-tronomy and cosmology.

ACKNOWLEDGMENTS

We thank Andrey Kravtsov, Roeland van der Marel,Ekta Patel and Vasily Belokurov. We are also gratefulto David Benisty for participating in discussions that ledto the idea of this paper.PL & OL acknowledge STFC Consolidated GrantST/R000476/1. NIL acknowledges ﬁnancial support ofthe Project IDEXLYON at the University of Lyon underthe Investments for the Future Program (ANR-16-IDEX-0005). YH has been partially supported by the IsraelScience Foundation grant ISF 1358/18.

Appendix A: Density Estimators

One of the key elements of DELFI is the estimation of aprobability distribution from samples. This correspondsto going from Fig. 2 (right panel) to Fig. 3. The densityestimation problem arises in many ﬁelds (for example im-age analysis [75, 76]) and several algorithms have beendeveloped to address it. In this section, we review someof the most popular density estimation methods in thecontext of LFI. For an overview of neural density estima-tion in the context of LFI we recommend [2].Density estimation algorithms that rely on there be-ing samples near the point of interest, such as spline orKernel Density Estimation (KDE), struggle in high di-mensional spaces due to the sparsity of the sampling.They are very useful, however, for estimating low dimen-sional PDFs, which is why they are often used for plottingmarginalised posterior distributions. Public codes suchas

GetDist [77],

ChainConsumer [78] or anesthetic [56]use KDE to generate plots of marginalised posterior dis-tributions.A Mixture Model (MM) represents a PDF p as aweighted sum of component distributions: p ( y ) = N (cid:88) c =1 α c D ( y ; Φ c ) . (A1)Here N is the number of components in the mixture while D is some family of distributions described by parame-ters Φ; the weights { α c } and parameters { Φ c } are ﬁt toobserved or training data. A common choice is the Gaus-sian Mixture Model (GMM), in which each componentdistribution is Gaussian: D ( y ; Φ c ) = N ( y ; µ c , σ c ).GMMs can successfully represent a large number ofPDFs. In addition, they have the advantage that theweights and parameters { α c , µ c , σ c } can be easily ﬁt tothe data using the Expectation-Maximization algorithm[79]. There are, however, some issues with GMMs: theyare sensitive to the choice of N , and they have problemsﬁtting certain features (such as the sharp edges that canarise when ﬂat priors are used).In the context of LFI we are interested in modelling a conditional distribution p ( y | x ) (for example in our case p is the conditional likelihood, for which y = d and x = θ ). Such conditional distributions can be modelled by Mix-ture Density Networks (MDNs) [31, 32]. As with MMs,they model the PDF as a weighted sum of component dis-tributions, but now the weights and parameters describ-ing the components are themselves (possibly non-linear)functions of x : p ( y | x ) = N (cid:88) c =1 α c ( x ) D c ( y ; Φ c ( x )) . (A2)Again, a common choice is the Gaussian MDN (GMDN),in which each component is Gaussian: D ( y ; Φ c ( x )) = N ( y ; µ c ( x ) , σ c ( x )).The functions { α c ( x ) , µ c ( x ) , σ c ( x ) } can be modelledby a neural network with a set of weights; these weightsare then ﬁt to the data. As with GMMs, GMDNs requirespeciﬁcation of the number N of mixture componentsto be used; however, this dependence is much smallerthan in the case of GMMs (as GMDNs can ﬁt complexdistributions using only a small number of components).We ﬁnish by describing Masked Autoregressive Flows(MAFs), which have recently emerged as a powerful den-sity estimation method [33, 43]. They do not rely on achoice of number of components, and have the advan-tage of providing simple tests of the goodness of ﬁt tothe samples.Here is the motivation for masking as a strategy fordensity estimation. Consider training a neural network NN to mimic; one can imagine training a parrot, for ex-ample. The trainer speaks (= input signal), and rewardsthe bird if its output matches this training input. If NN has suﬃcient complexity then it will learn to mimic theinput. Now repeat the process but with a bird with cov-ered (= masked) ears. The bird can’t hear the input,but nevertheless receives the training reward if its out-put matches the input. Now the bird can only ‘play thepercentages’; it learns the optimal strategy, which is tooutput a weighted average of the input signals (weightedby their frequency of usage by the trainer). In this waythe masked parrot learns the probability distribution ofthe input signal i.e. has become a density estimator.That was the one-dimensional case. The two dimen-sional case needs two parrots. The ﬁrst is trained on sig-nal x , which it can’t hear, with the result that it learns p ( x ). The second is trained on x , which it can’t hear,but it is allowed to hear x . As a result it learns p ( x | x ).Thus between them they learn p ( x ) p ( x | x ) = p ( x , x )as desired. The multi-dimensional case is similar. Thisstrategy is called an autoregressive autoencoder . Notethat it treats the coordinates asymmetrically.The conditional distributions p ( x i | x , . . . , x i − )learned by NN are typically modelled as Gaussian.Consider generating samples from the estimated proba-bility distribution p ; for each sample we need we a setof n random unit normals (i.e. a draw from N (0 , I )),which we transform to get samples from p - call thistransform T . The details of T come from the meansand standard deviations of the conditional distributions,0 M MW+M31 [10 M (cid:12) ]0 . . . . . . P r i o r Flat prior in M Flat prior in log M Press SchechterSimulation distribution M MW+M31 [10 M (cid:12) ]0 . . . . . . P / P m a x FIG. 8. On the left, a comparison of four possible priors: a ﬂat prior in the mass (blue), a ﬂat prior in the logarithm ofthe mass (red), a physically motivated Press-Schechter prior (orange), and the prior from the distribution of the simulations(dashed black). On the right, the corresponding posterior obtained from each prior. Note that we did not show the posteriorfor the ﬂat prior in the logarithm of the mass, as this is equivalent to the Press-Schechter prior. which can be obtained from n evaluations of NN .However, importantly, the inverse mapping T − can befound with just one evaluation of NN . This is the ideaof the Masked Autoencoder for Distribution Estimation(MADE) algorithm [80].Now apply T − to the training data D . The resulting‘pulled-back’ data T − ( D ) will ideally be a set of sam-ples from N (0 , I ) and its deviation from this ideal givesa direct measure of how imperfect is our modelling of p .We can then use the ‘pulled-back’ data as the trainingdata for yet another MADE process, and so on throughseveral iterations. Between iterations we permute the co-ordinate axes, thereby symmetrising how we treat them.With suﬃcient iterations, the multiply-pulled-back train-ing data approaches N (0 , I ); at this point the algorithmhas an easy-to-evaluate mapping between p and N (0 , I ),which suﬃces for doing calculations. This is the MaskedAutoregressive Flows (MAF) algorithm [33, 43]. Appendix B: Dependence on Priors

In this appendix, we explore how the posterior distri-bution of M MW+M31 (as shown in Fig. 5 and discussedin Sec. VI) depends on our choice of prior. We considerthe four diﬀerent priors illustrated in Fig. 8: • A ﬂat prior in the mass; • A logarithmic prior in the mass; • A prior based on the Press-Schechter distribution(as adopted in this work); • A prior distribution matching the distribution ofmasses in the simulation. In our case the second and third choices are virtu-ally the same (as shown in the left panel of Fig. 8),and so we omit the ‘logarithmic in mass’ prior whenexamining how the priors aﬀect our results. The pos-teriors obtained when using the remaining three pri-ors are shown in the right panel of Fig. 8, and wesee that our result is essentially independent of ourchoice of prior, be it the Press-Schechter prior, aﬂat prior on the mass ( M LG (Flat prior) = 5 . +2 . − . × M (cid:12) ) or the prior from the simulation distribution.( M LG (Simulation Distribution) = 4 . ± . × M (cid:12) ) Appendix C: Dependence on Tangential Velocity

Imagine a 2-dimensional velocity vector V = ( V x , V y ).If V x , V y are uncorrelated, normally distributed withzero mean and equal variance σ , then the overall speed V = (cid:113) ( V x + V y ) will be characterized by the Rayleighdistribution [81], with mean ¯ V = σ (cid:112) π/

2, rather thannaively zero. Similarly, non-zero measurements of V x , V y with error bars will result in a distribution function witha mean that is not ¯ V = (cid:113) ( V x + V y ). In our analysis wepay attention to this via the forward modeling approach,starting with simulated ( V x , V y ) and propagating theirimpact on the ﬁnal posterior for M MW+M31 .While distance and radial velocity have been measuredin numerous occasions using diﬀerent methods, observa-tions of the tangential velocity are far more scarce [50–52]. Their work treats the eﬀect of converting measure-ments of two components of the tangential velocity intoa modulus in a novel way. We check the robustness ofour approach in this section. We do so by comparingthe main result of the paper to the case of no tangen-1 . . . . . . . M [10 M (cid:12) ]0 . . . . . . P / P m a x v t = 72 km / s v t = 0 km / s FIG. 9. The posterior on M MW+M31 for diﬀerent values ofthe tangential velocity: v t = 72 km s − as used in this work(solid blue), and a purely radial motion v t = 0 (dashed red). tial velocity. As shown in Fig. 9, our posterior on themass does not depend strongly on the tangential veloc-ity. Therefore, we are conﬁdent on the accuracy of ourposterior in the mass. [1] Florent Leclercq. Bayesian optimization for likelihood-free cosmological inference. Phys. Rev. D, 98(6):063511,September 2018. doi:10.1103/PhysRevD.98.063511.[2] Justin Alsing, Tom Charnock, Stephen Feeney, andBenjamin Wand elt. Fast likelihood-free cosmologywith neural density estimators and active learning.MNRAS, 488(3):4440–4458, September 2019. doi:10.1093/mnras/stz1960.[3] M. Betoule, R. Kessler, J. Guy, J. Mosher, and D. Hardinet al. Improved cosmological constraints from a jointanalysis of the SDSS-II and SNLS supernova sam-ples. A&A, 568:A22, August 2014. doi:10.1051/0004-6361/201423413.[4] Yu-Chen Wang, Yuan-Bo Xie, Tong-Jie Zhang, Hui-Chao Huang, Tingting Zhang, and Kun Liu. Likelihood-free Cosmological Constraints with Artiﬁcial Neural Net-works: An Application on Hubble Parameters and SN Ia. arXiv e-prints , art. arXiv:2005.10628, May 2020.[5] Niall Jeﬀrey, Justin Alsing, and Francois Lanusse.Likelihood-free inference with neural compression of DESSV weak lensing map statistics. arXiv e-prints , art.arXiv:2009.08459, September 2020.[6] Johann Brehmer, Siddharth Mishra-Sharma, Joeri Her-mans, Gilles Louppe, and Kyle Cranmer. Mining forDark Matter Substructure: Inferring subhalo populationproperties from strong lenses with machine learning. 92019. doi:10.3847/1538-4357/ab4c41.[7] Doogesh Kodi Ramanah, Rados(cid:32)law Wojtak, and NikkiArendse. Simulation-based inference of dynamical galaxycluster masses with 3D convolutional neural networks. 92020.[8] Luca Tortorelli, Martina Fagioli, J¨org Herbel, AdamAmara, Tomasz Kacprzak, and Alexandre Refregier. Measurement of the B-band galaxy Luminosity Functionwith Approximate Bayesian Computation. J. Cosmol-ogy Astropart. Phys., 2020(9):048, September 2020. doi:10.1088/1475-7516/2020/09/048.[9] F. D. Kahn and L. Woltjer. Intergalactic Matterand the Galaxy. ApJ, 130:705, November 1959. doi:10.1086/146762.[10] Francisco Prada, Anatoly A. Klypin, Antonio J. Cuesta,Juan E. Betancort-Rijo, and Joel Primack. Halo con-centrations in the standard Λ cold dark matter cos-mology. MNRAS, 423(4):3018–3030, July 2012. doi:10.1111/j.1365-2966.2012.21007.x.[11] K. Riebe, A. M. Partl, H. Enke, J. Forero-Romero,S. Gottl¨ober, A. Klypin, G. Lemson, F. Prada,J. R. Primack, M. Steinmetz, and V. Turchaninov.The MultiDark Database: Release of the Bolshoiand MultiDark cosmological simulations. Astronomis-che Nachrichten , 334(7):691–708, August 2013. doi:10.1002/asna.201211900.[12] Georges Meylan, Juan P. Madrid, and Duccio Macchetto.Hubble Space Telescope Science Metrics. PASP, 116(822):790–796, August 2004. doi:10.1086/423227.[13] Gaia Collaboration. The Gaia mission. A&A, 595:A1,November 2016. doi:10.1051/0004-6361/201629272.[14] M. McLeod, N. Libeskind, O. Lahav, and Y. Hoﬀman.Estimating the mass of the Local Group using machinelearning applied to numerical simulations. J. Cosmol-ogy Astropart. Phys., 2017(12):034, December 2017. doi:10.1088/1475-7516/2017/12/034.[15] F.V. Bonassi, L. You, and M. West. Bayesian Learningfrom Marginal Data in Bionetwork Models.

Statistical ap-plications in genetics and molecular biology , 2011(10,1):49, October 2011. doi:10.2202/1544-6115.1684. [16] Y. Fan, D. J. Nott, and S. A. Sisson. ApproximateBayesian Computation via Regression Density Estima-tion. arXiv e-prints , art. arXiv:1212.1479, December2012.[17] George Papamakarios and Iain Murray. Fast (cid:15) -freeInference of Simulation Models with Bayesian Con-ditional Density Estimation. arXiv e-prints , art.arXiv:1605.06376, May 2016.[18] Yang-Shyang Li and Simon D.M. White. Masses forthe Local Group and the Milky Way. Mon. Not. Roy.Astron. Soc. , 384:1459–1468, 2008. doi:10.1111/j.1365-2966.2007.12748.x.[19] Roberto E. Gonzalez, Andrey V. Kravtsov, and Nicko-lay Y. Gnedin. On the mass of the Local Group.

Astro-phys. J. , 793:91, 2014. doi:10.1088/0004-637X/793/2/91.[20] Steven Phelps, Adi Nusser, and Vincent Desjacques. Themass of the Milky Way and M31 using the method of leastaction.

Astrophys. J. , 775:102, 2013. doi:10.1088/0004-637X/775/2/102.[21] D. Lynden-Bell. The dynamical age of the local group ofgalaxies.

The Observatory , 101:111–114, August 1981.[22] James Binney and Scott Tremaine.

Galactic dynamics .1987.[23] Candace Partridge, Ofer Lahav, and Yehuda Hoﬀman.Weighing the Local Group in the Presence of Dark En-ergy.

Mon. Not. Roy. Astron. Soc. , 436:45, 2013. doi:10.1093/mnrasl/slt109.[24] Michael McLeod and Ofer Lahav. The Two Body Prob-lem in the Presence of Dark Energy and Modiﬁed Grav-ity: Application to the Local Group. arXiv e-prints , art.arXiv:1903.10849, March 2019.[25] David Benisty, Eduardo I. Guendelman, and Ofer Lahav.Milky Way and Andromeda past-encounters in diﬀerentgravity models: the impact on the estimated Local Groupmass. arXiv e-prints , art. arXiv:1904.03153, April 2019.[26] P. J. E. Peebles. Orbits of the Nearby Galaxies. ApJ,429:43, July 1994. doi:10.1086/174301.[27] Donald B Rubin. Bayesianly justiﬁable and relevant fre-quency calculations for the applied statistician.

The An-nals of Statistics , pages 1151–1172, 1984.[28] Murray Rosenblatt. Remarks on some nonparametric es-timates of a density function.

Ann. Math. Statist. , (3):832–837, 09 . doi:10.1214/aoms/1177728190.[29] Emanuel Parzen. On estimation of a probability densityfunction and mode.

Ann. Math. Statist. , (3):1065–1076,09 . doi:10.1214/aoms/1177704472.[30] J.S. Simonoﬀ.

Smoothing Methods in Statistics .Springer Series in Statistics. Springer, 1996. ISBN9780387947167. URL https://books.google.co.uk/books?id=wFTgNXL4feIC .[31] Christopher M. Bishop. Mixture density networks. Tech-nical report, 1994.[32] Christopher M. Bishop.

Pattern Recognition andMachine Learning (Information Science and Statis-tics) . Springer-Verlag, Berlin, Heidelberg, 2006. ISBN0387310738.[33] George Papamakarios, Theo Pavlakou, and Iain Murray.Masked autoregressive ﬂow for density estimation. In

Ad-vances in Neural Information Processing Systems , pages2338–2347, 2017.[34] E. Cameron and A. N. Pettitt. Approximate BayesianComputation for astronomical model analysis: a casestudy in galaxy demographics and morphological trans-formation at high redshift. MNRAS, 425(1):44–65, September 2012. doi:10.1111/j.1365-2966.2012.21371.x.[35] Anja Weyant, Chad Schafer, and W. Michael Wood-Vasey. Likelihood-free Cosmological Inference with TypeIa Supernovae: Approximate Bayesian Computation fora Complete Treatment of Uncertainty. ApJ, 764(2):116,February 2013. doi:10.1088/0004-637X/764/2/116.[36] Jo¨el Akeret, Alexandre Refregier, Adam Amara, Se-bastian Seehars, and Caspar Hasner. ApproximateBayesian Computation for Forward Modeling in Cos-mology.

JCAP , 08:043, 2015. doi:10.1088/1475-7516/2015/08/043.[37] ChangHoon Hahn, Mohammadjavad Vakili, KilianWalsh, Andrew P. Hearin, David W. Hogg, and Dun-can Campbell. Approximate Bayesian computation inlarge-scale structure: constraining the galaxy–halo con-nection.

Mon. Not. Roy. Astron. Soc. , 469(3):2791–2805,2017. doi:10.1093/mnras/stx894.[38] Austin Peel, Chieh-An Lin, Francois Lanusse, AdrienneLeonard, Jean-Luc Starck, and Martin Kilbinger. Cos-mological constraints with weak lensing peak countsand second-order statistics in a large-ﬁeld survey.

As-tron. Astrophys. , 599:A79, 2017. doi:10.1051/0004-6361/201629928.[39] T. Kacprzak, J. Herbel, A. Amara, and A. R´efr´egier.Accelerating Approximate Bayesian Computation withQuantile Regression: application to cosmological redshiftdistributions.

JCAP , 02:042, 2018. doi:10.1088/1475-7516/2018/02/042.[40] Justin Alsing, Benjamin Wandelt, and Stephen Feeney.Massive optimal data compression and density estima-tion for scalable, likelihood-free inference in cosmol-ogy. MNRAS, 477(3):2874–2885, July 2018. doi:10.1093/mnras/sty819.[41] Justin Alsing and Benjamin Wandelt. Generalized mas-sive optimal data compression. MNRAS, 476(1):L60–L64, May 2018. doi:10.1093/mnrasl/sly029.[42] Alan F Heavens, Elena Sellentin, and Andrew H Jaﬀe.Extreme data compression while searching for newphysics.

Monthly Notices of the Royal Astronomical So-ciety , 498(3):3440–3451, 2020.[43] George Papamakarios, David Sterratt, and Iain Murray.Sequential neural likelihood: Fast likelihood-free infer-ence with autoregressive ﬂows. In

The 22nd InternationalConference on Artiﬁcial Intelligence and Statistics , pages837–848. PMLR, 2019.[44] Jan-Matthis Lueckmann, Giacomo Bassetto, TheofanisKaraletsos, and Jakob H Macke. Likelihood-free inferencewith emulator networks. In

Symposium on Advances inApproximate Bayesian Inference , pages 32–53. PMLR,2019.[45] Alexander Knebe, Steﬀen R. Knollmann, Stuart I. Mul-drew, Frazer R. Pearce, and Miguel Angel Aragon-Calvoet al. Haloes gone MAD: The Halo-Finder ComparisonProject. MNRAS, 415(3):2293–2318, August 2011. doi:10.1111/j.1365-2966.2011.18858.x.[46] Stephen Holland. The Distance to the M31 Globu-lar Cluster System.

Astron. J. , 115:1916, 1998. doi:10.1086/300348.[47] Y.C. Joshi, A.K. Pandey, D. Narasimha, R. Sagar, andY. Giraud-Hiraud. Identiﬁcation of 13 Cepheids and 333other variables in M 31.

Astron. Astrophys. , 402:113–123,2003. doi:10.1051/0004-6361:20030136.[48] Ignasi Ribas, Carme Jordi, Francesc Vilardell, Edward L.Fitzpatrick, Ron W. Hilditch, and Edward F. Guinan. First determination of the distance and fundamentalproperties of an eclipsing binary in the andromedagalaxy.

Astrophys. J. Lett. , 635:L37–L40, 2005. doi:10.1086/499161.[49] Alan McConnachie and Mike Irwin. Structural pa-rameters for the m31 dwarf spheroidals.

Mon. Not.Roy. Astron. Soc. , 365:1263, 2006. doi:10.1111/j.1365-2966.2005.09806.x.[50] Roeland P. van der Marel, Mark A. Fardal, Sangmo TonySohn, Ekta Patel, Gurtina Besla, Andr´es del Pino, Jo-hannes Sahlmann, and Laura L. Watkins. First GaiaDynamics of the Andromeda System: DR2 Proper Mo-tions, Orbits, and Rotation of M31 and M33. ApJ, 872(1):24, February 2019. doi:10.3847/1538-4357/ab001b.[51] Roeland P. van der Marel, Gurtina Besla, T. J. Cox,Sangmo Tony Sohn, and Jay Anderson. The M31 Ve-locity Vector. III. Future Milky Way M31-M33 OrbitalEvolution, Merging, and Fate of the Sun. ApJ, 753(1):9,July 2012. doi:10.1088/0004-637X/753/1/9.[52] Roeland P. van der Marel and Puragra Guhathakurta.M31 Transverse Velocity and Local Group Mass fromSatellite Kinematics.

Astrophys. J. , 678:187, 2008. doi:10.1086/533430.[53] John Skilling et al. Nested sampling for general bayesiancomputation.

Bayesian analysis , 1(4):833–859, 2006.[54] W. J. Handley, M. P. Hobson, and A. N. Lasenby. POLY-CHORD: nested sampling for cosmology. MNRAS, 450:L61–L65, June 2015. doi:10.1093/mnrasl/slv047.[55] W. J. Handley, M. P. Hobson, and A. N. Lasenby.POLYCHORD: next-generation nested sampling.MNRAS, 453:4384–4398, November 2015. doi:10.1093/mnras/stv1911.[56] Will Handley. anesthetic: nested sampling visualisation.

The Journal of Open Source Software , 4(37):1414, Jun2019. doi:10.21105/joss.01414.[57] Travis O’Brien, Karthik Kashinath, Nicholas Cavanaugh,William Collins, and John O’Brien. A fast and objec-tive multidimensional kernel density estimation method:Fastkde.

Computational Statistics & Data Analysis , 101,03 2016. doi:10.1016/j.csda.2016.02.014.[58] Travis O’Brien, William Collins, Sara Rauscher, andTodd Ringler. Reducing the computational cost of theecf using a nuﬀt: A fast and objective probability den-sity estimation method.

Computational Statistics & DataAnalysis , 79, 11 2014. doi:10.1016/j.csda.2014.06.002.[59] William H. Press and Paul Schechter. Formation ofGalaxies and Clusters of Galaxies by Self-Similar Grav-itational Condensation. ApJ, 187:425–438, February1974. doi:10.1086/152650.[60] Benedikt Diemer. COLOSSUS: A python toolkit for cos-mology, large-scale structure, and dark matter halos.

As-trophys. J. Suppl. , 239(2):35, 2018. doi:10.3847/1538-4365/aaee8c.[61] Jeremy L. Tinker, Andrey V. Kravtsov, Anatoly Klypin,Kevork Abazajian, Michael S. Warren, Gustavo Yepes,Stefan Gottlober, and Daniel E. Holz. Toward a halomass function for precision cosmology: The Limits ofuniversality.

Astrophys. J. , 688:709–728, 2008. doi:10.1086/591439.[62] John Skilling. Nested sampling for general bayesiancomputation.

Bayesian Anal. , (4):833–859, 12 . doi:10.1214/06-BA127.[63] Planck Collaboration. Planck 2018 results - vi. cosmolog-ical parameters.

A&A , 641:A6, 2020. doi:10.1051/0004- 6361/201833910.[64] Noam I. Libeskind, Edoardo Carlesi, Robert J. J. Grand,Arman Khalatyan, Alexander Knebe, Ruediger Pakmor,Sergey Pilipenko, Marcel S. Pawlowski, Martin Sparre,Elmo Tempel, Peng Wang, H´el`ene M. Courtois, StefanGottl¨ober, Yehuda Hoﬀman, Ivan Minchev, ChristophPfrommer, Jenny G. Sorce, Volker Springel, MatthiasSteinmetz, R. Brent Tully, Mark Vogelsberger, and Gus-tavo Yepes. The HESTIA project: simulations of theLocal Group. MNRAS, 498(2):2968–2983, August 2020.doi:10.1093/mnras/staa2541.[65] J. D. Diaz, S. E. Koposov, M. Irwin, V. Belokurov, andN. W. Evans. Balancing mass and momentum in the Lo-cal Group. MNRAS, 443(2):1688–1703, September 2014.doi:10.1093/mnras/stu1210.[66] Dennis Zaritsky and Helene Courtois. A dynamics-freelower bound on the mass of our Galaxy. MNRAS, 465(3):3724–3728, March 2017. doi:10.1093/mnras/stw2922.[67] Kohei Hattori, Monica Valluri, Eric F. Bell, and Ian U.Roederer. Old, Metal-Poor Extreme Velocity Stars inthe Solar Neighborhood.

Astrophys. J. , 866(2):121, 2018.doi:10.3847/1538-4357/aadee5.[68] Lorenzo Posti and Amina Helmi. Mass and shape of theMilky Way’s dark matter halo with globular clusters fromGaia and Hubble. A&A, 621:A56, January 2019. doi:10.1051/0004-6361/201833355.[69] Laura L. Watkins, Roeland P. van der Marel,Sangmo Tony Sohn, and N. Wyn Evans. Evidence foran Intermediate-mass Milky Way from Gaia DR2 HaloGlobular Cluster Motions. ApJ, 873(2):118, March 2019.doi:10.3847/1538-4357/ab089f.[70] Ekaterina V. Karukes, Maria Benito, Fabio Iocco,Roberto Trotta, and Alex Geringer-Sameth. A ro-bust estimate of the Milky Way mass from rotationcurve data.

JCAP , 05:033, 2020. doi:10.1088/1475-7516/2020/05/033.[71] E. Corbelli, S. Lorenzoni, R. Walterbos, R. Braun, andD. Thilker. A wide-ﬁeld H I mosaic of Messier 31. II. Thedisk warp, rotation, and the dark matter halo. A&A, 511:A89, February 2010. doi:10.1051/0004-6361/200913297.[72] A. Tamm, E. Tempel, P. Tenjes, O. Tihhonova, andT. Tuvikene. Stellar mass map and dark matter dis-tribution in M 31. A&A, 546:A4, October 2012. doi:10.1051/0004-6361/201220065.[73] Prajwal R. Kaﬂe, Sanjib Sharma, Geraint F. Lewis,Aaron S. G. Robotham, and Simon P. Driver. The Needfor Speed: Escape velocity and dynamical mass measure-ments of the Andromeda galaxy.

Mon. Not. Roy. Astron.Soc. , 475(3):4043–4054, 2018. doi:10.1093/mnras/sty082.[74] Jorge Pe˜narrubia, Facundo A. G´omez, Gurtina Besla,Denis Erkal, and Yin-Zhe Ma. A timing constraint on the(total) mass of the Large Magellanic Cloud. MNRAS, 456(1):L54–L58, February 2016. doi:10.1093/mnrasl/slv160.[75] Lucas Theis and Matthias Bethge. Generative imagemodeling using spatial lstms. In

Advances in Neural In-formation Processing Systems , pages 1927–1935, 2015.[76] Tim Salimans, Andrej Karpathy, Xi Chen, andDiederik P. Kingma. PixelCNN++: Improving thePixelCNN with Discretized Logistic Mixture Likeli-hood and Other Modiﬁcations. arXiv e-prints , art.arXiv:1701.05517, January 2017.[77] Antony Lewis. GetDist: a Python package for analysingMonte Carlo samples. 2019. URL https://getdist.readthedocs.io . [78] S. R. Hinton. ChainConsumer. The Journal ofOpen Source Software , 1:00045, August 2016. doi:10.21105/joss.00045.[79] A. P. Dempster, N. M. Laird, and D. B. Rubin. Max-imum likelihood from incomplete data via the em algo-rithm.

Journal of the Royal Statistical Society. Series B(Methodological) , 39(1):1–38, 1977. ISSN 00359246. URL .[80] Mathieu Germain, Karol Gregor, Iain Murray, and HugoLarochelle. Made: Masked autoencoder for distribu-tion estimation. In

International Conference on MachineLearning , pages 881–889, 2015.[81] Lord Rayleigh. The problem of the random walk.