[PDF] Estimating the Galactic Mass Profile in the Presence of Incomplete Data

Abstract

A powerful method to measure the mass profile of a galaxy is through the velocities of tracer particles distributed through its halo. Transforming this kind of data accurately to a mass profile M(r), however, is not a trivial problem. In particular, limited or incomplete data may substantially affect the analysis. In this paper we develop a Bayesian method to deal with incomplete data effectively; we have a hybrid-Gibbs sampler that treats the unknown velocity components of tracers as parameters in the model. We explore the effectiveness of our model using simulated data, and then apply our method to the Milky Way using velocity and position data from globular clusters and dwarf galaxies. We find that in general, missing velocity components have little effect on the total mass estimate. However, the results are quite sensitive to the outer globular cluster Pal 3. Using a basic Hernquist model with an isotropic velocity dispersion, we obtain credible regions for the cumulative mass profile M(r) of the Milky Way, and provide estimates for the model parameters with 95 percent Bayesian credible intervals. The mass contained within 260 kpc is 1.37x10^12 solar masses, with a 95 percent credible interval of (1.27,1.51)x10^12 solar masses. The Hernquist parameters for the total mass and scale radius are 1.55 (+0.18/-0.13)x10^12 solar masses and 16.9 (+4.8/-4.1) kpc, where the uncertainties span the 95 percent credible intervals. The code we developed for this work, Galactic Mass Estimator (GME), will be available as an open source package in the R Project for Statistical Computing.

Full PDF

DDraft version September 16, 2018

Preprint typeset using L A TEX style emulateapj v. 5/2/11

ESTIMATING THE GALACTIC MASS PROFILE IN THE PRESENCE OF INCOMPLETE DATA

Gwendolyn M. Eadie , William E. Harris , and Lawrence M. Widrow Draft version September 16, 2018

ABSTRACTA powerful method to measure the mass proﬁle of a galaxy is through the velocities of tracer particlesdistributed through its halo. Transforming this kind of data accurately to a mass proﬁle M ( r ),however, is not a trivial problem. In particular, limited or incomplete data may substantially aﬀectthe analysis. In this paper we develop a Bayesian method to deal with incomplete data eﬀectively; wehave a hybrid-Gibbs sampler that treats the unknown velocity components of tracers as parameters inthe model. We explore the eﬀectiveness of our model using simulated data, and then apply our methodto the Milky Way using velocity and position data from globular clusters and dwarf galaxies. We ﬁndthat in general, missing velocity components have little eﬀect on the total mass estimate. However,the results are quite sensitive to the outer globular cluster Pal 3. Using a basic Hernquist model withan isotropic velocity dispersion, we obtain credible regions for the cumulative mass proﬁle M ( r ) of theMilky Way, and provide estimates for the model parameters with 95% Bayesian credible intervals. Themass contained within 260kpc is 1 . × M (cid:12) , with a 95% credible interval of (1 . , . × M (cid:12) .The Hernquist parameters for the total mass and scale radius are 1 . +0 . − . × M (cid:12) and 16 . +4 . − . kpc,where the uncertainties span the 95% credible intervals. The code we developed for this work, GalacticMass Estimator (GME), will be available as an open source package in the R Project for StatisticalComputing. Subject headings:

Milky Way — galaxies: dark matter — galaxies: halos — galaxies: satellites INTRODUCTIONAlmost every galaxy in the universe is assumed to re-side in a massive, dark matter halo that can extend farbeyond the visible components of the galaxy. Standardmethods to determine the mass distribution within vis-ible portions of galaxies are based on rotation curves orvelocity dispersion proﬁles. The former is applicable tospiral galaxies, while the latter method is used mainly forelliptical galaxies or disk galaxy bulges. At large radii,where the mass distribution is presumably dominated bydark matter, one can use observations of kinematic trac-ers to learn about a galaxy’s mass proﬁle.The Milky Way (MW) has many distant satellites, suchas globular clusters, halo stars, planetary nebulae, anddwarf galaxies. The kinematic properties of these satel-lites can be used to learn about the gravitational poten-tial of the whole system, and thus the Galaxy’s massproﬁle out to large radii.One way to use the kinematic data of tracers to es-timate the mass distribution is via a mass estimator , amethod ﬁrst suggested by Hartwick & Sargent (1978)and a method that avoids using an explicit model. Themethod has endured partly because it uses only the lineof sight velocities and positions of the tracers. It is use-ful when no proper motions are available and conversionto a Galactocentric refrence frame is impossible. Nowa-days, however, many MW tracers have proper motionmeasurements, and more continue to become available.Thus, it would be beneﬁcial to have a method whichcan incorporate both complete data (i.e. tracers with 3- Dept. of Physics and Astronomy, McMaster University,Hamilton, ON L8S 4M1, Canada. Dept. of Physics, Engineering Physics & Astronomy, Queen’sUniversity, Kingston, ON K7L 3N6, Canada. dimensional space motions in the Galactocentric frame)and incomplete data (i.e. tracers with only line of sightvelocities). In this paper we introduce such a method ofmass estimation.The method used here is based on the phase-space dis-tribution function (DF), which is a probability distribu-tion for a satellite in terms of its position r and veloc-ity v in the Galactocentric reference frame. Using theDF for a model and a Bayesian approach to the anal-ysis, we obtain probability distributions for the modelparameters (and thus the mass). The method was ﬁrstsuggested by Little & Tremaine (1987), who showed howto use the DF and a Bayesian approach to estimate themass speciﬁcally for the Milky Way. Since then, otherstudies have used the Bayesian approach or MaximumLikelihood methods for Galactic mass estimation (e.g.Kulessa & Lynden-Bell 1992; Kochanek 1996; Wilkinson& Evans 1999; Widrow et al. 2008). Deriving analyticand physically relevant DFs has been explored by Hern-quist (1990), Cuddeford (1991), Ciotti (1996), Widrow(2000), and Evans & Williams (2014), to name a few.A DF’s derivation and ﬁnal form is, by default, in theGalactocentric reference frame, but previous studies havere-written DFs in terms of the line-of-sight velocity com-ponent only, in order to incorporate incomplete data (e.g.Wilkinson & Evans 1999). This does not take full ad-vantage of the complete data that is available, which isan issue when the method may be susceptible to largeuncertainties due to small sample size (as discussed byZaritsky et al. 1989). Furthermore, rewriting the DF interms of the line-of-sight velocity can be mathematicallydiﬃcult. In our study, we introduce a generalized ap-proach via the Bayesian framework, whereby it is easyto incorporate complete and incomplete data simultane-ously, and also unnecessary to rewrite the DF in terms a r X i v : . [ a s t r o - ph . GA ] M a r Eadie, Harris, and Widrowof the line-of-sight velocity.The purpose of this paper is to lay out the methodfully, and set the groundwork for future studies with arange of DFs and datasets. We also test the method withsimulated data, and do some preliminary analysis withthe method as it applies to the Milky Way.The outline of this paper is as follows. Section 2.1brieﬂy describes the theoretical background of DFs, andSection 2.2 introduces two models that already have an-alytic DFs, which we use in this work. Next, in Sec-tion 2.3, we review how Bayes’ Theorem can be used withthe DFs to obtain parameter estimates of the model. Sec-tions 2.4, 2.5, and 2.6 introduce the method for incorpo-rating both complete and incomplete data via a hybrid-Gibbs Sampler, and Section 2.7 discusses the techniquesused to assess convergence of the Markov chains.We ﬁrst apply and test our method on simulated data(described in Section 3) and then apply the method tosome kinematic data of satellites orbiting the Milky Way(described in Section 4). The results of these analyses arepresented and discussed in Sections 5 and 6 respectively.Many future prospects are also discussed in Section 6. BACKGROUNDThis section provides background information and no-tation about distribution functions, the models we usein our analysis, and the methods of Bayesian inferenceas they apply to the current problem. Additional detailsand discussion can be found in Eadie (2013).2.1.

Distribution Function

The Distribution Function (DF), f ( r , v ), is a probabil-ity density function that gives the probability of ﬁnding aparticle with a position r and velocity v within a phase-space volume element d r d v (Binney & Tremaine 2008).Like any probability density, the DF integrates to one: (cid:90) f ( r , v ) d r d v = 1 . (1)Eq. 1 is often renormalized so that the DF integrates toa quantity of interest, such as the total mass M tot : (cid:90) f ( r , v ) d r d v = M tot . (2)However, in a Bayesian framework the DF as deﬁned ineq. 1 is used, and thus the left-hand-side of eq. 2 is di-vided by M tot . Thus, it is important that models have aﬁnite mass in a Bayesian analysis— if the mass is inﬁnite,then the DF is not a proper probability distribution.A DF can be speciﬁed by use of Jeans’ Theorem, whichstates that any solution of the time-independent colli-sionless Boltzmann equation is a function of the phase-space coordinates ( r , v ) only. In a time-independent sys-tem, the Hamiltonian H = v + Φ( r ) is always an inte-gral of motion, and if the system is also spherical then themagnitude of the angular momentum, L , is an integral ofmotion too. Therefore, any non-negative function f ( H )or f ( H, L ) will be a solution to the time-independentcollisionless Boltzmann equation, and thus a DF for thesystem. Whether or not f is a function of H or both H and L determines the velocity dispersion of the system; f ( H ) corresponds to an isotropic system, and f ( H, L )corresponds to an anisotropic system. In practice, a DF corresponding to an isotropic, spher-ical, self-consistent system is usually written in terms of E , the relative energy per unit mass, deﬁned as E = − v r ) (3)where v is the speed of a particle at a distance r fromthe center of the system, and Ψ( r ) is the relative gravita-tional potential of the system at r , as deﬁned in Binney& Tremaine (2008). Particles with E ≤ f = 0. If the system has an anisotropic ve-locity dispersion, then the DF is written as a function ofboth E and the angular momentum L .2.2. Models

Models with analytic DFs are preferable to empiri-cal distribution functions in theoretical analyses becausethey allow for easy sampling of the distribution, and alsosave computation time by avoiding numerical integra-tion. Finding a DF that models a realistic galaxy is a dif-ﬁcult task, however, because galaxies are often composedof multiple subsystems such as a bulge, a stellar halo, adark matter halo, and possibly a disk. Finding a singlephase-space distribution function that is self-consistent,analytic, and that describes the intricate features of agalaxy is very challenging.An empirical luminosity proﬁle that has been success-ful in ﬁtting the surface brightness proﬁles of ellipticalgalaxies and bulges is the de Vaucouleurs (1948) R / proﬁle. A generalization of the R / proﬁle is R /n ,which was introduced by Sersic (1968). Due to the suc-cess of R / , theorists have tried to develop distributionfunctions that can reproduce the proﬁle. The analyticmodels introduced by Jaﬀe (1983) and Hernquist (1990)ﬁt the R / type galaxies well for most radii.In this work, primarily for the purpose of testing themethod, we use the Hernquist and Jaﬀe models becauseof their analytic simplicity and their ubiquity. The Hern-quist model also has the beneﬁt of having more than oneanalytic DF— one has an isotropic velocity dispersion,and there are a few that are anisotropic. Furthermore,both the Hernquist and Jaﬀe models are self-consistentand have a ﬁnite total mass, making consistent numeri-cal computations feasible. For these reasons, we use theHernquist-style models to lay out the methodology ofour Bayesian approach and the derivation of mass pro-ﬁle credible regions. In future work, we will extend themethod to include models with non-analytic DFs andnon-ﬁnite mass distributions, such as the Navarro et al.(1996) (NFW) model.Hernquist (1990) introduced a halo model that is a self-consistent, analytic potential-density pair. With G ≡ r ) = − M tot r + a (4)and the mass density proﬁle is ρ ( r ) = aM tot πr ( r + a ) (5)where M tot is the total mass of the system, and a is thescale radius. Integrating over a sphere, the Hernquistcumulative mass proﬁle is then, M ( r ) = M tot r ( r + a ) (6)Hernquist (1990) provides two DFs for their modelthat are written in terms of elementary functions: onefor an isotropic velocity dispersion, and a second for ananisotropic velocity dispersion of the Osipkov (1979) andMerritt (1985) type (hereafter OM-type). The OM-typeallows the anisotropy to vary as a function of r , and in-cludes a constant parameter called the anisotropy radius r a : β ( r ) = r r + r a (7)The parameter r a controls the degree of radial anisotropyin the system at diﬀerent radii. As r a → ∞ , β ( r ) → β = 0 .

5, which was derived by Evans &An (2006). We consider this model because recent re-search by Deason et al. (2013) showed that blue horizon-tal branch (BHB) stars have a radially biased velocityanisotropy of 0 . β may be approximately constant for most of the stellarhalo.We also consider the isotropic Jaﬀe (1983) model inthis work, which has mass proﬁle and potential M J ( r ) = M tot ln (1 + a J /r ) a J (8)Φ J ( r ) = − M tot πa J r (1 + r/a J ) (9)where a J is the scale radius.Overall, we ﬁt the Milky Way data to four dif-ferent models: three Hernquist models with diﬀerentanisotropies (isotropic, OM-type, constant anisotropy( β = 0 . Bayes Theorem and Parameter Estimation

Bayes’ Theorem is named after Thomas Bayes (1701-1761) and was introduced posthumously by RichardPrice (Bayes & Price 1763). Using the rules of condi-tional probabilities, Bayes showed that the conditionalprobability p ( A | B ) is, p ( A | B ) = p ( B | A ) p ( A ) p ( B ) (10)now known as Bayes’ Theorem.Bayesian inference involves using eq. 10 in data anal-ysis to obtain probability distributions about model pa-rameters. Bayesian inference returns a probability dis-tribution for parameters given the data and a prior dis-tribution on the parameters.When Bayes’ Theorem is used for Bayesian inference,it is re-written in terms of the vector of model param-eters θ and the data y , and is sometimes referred to asBayes’ rule. The Bayesian posterior probability for θ ,given some data y , is then p ( θ | y ) = p ( y | θ ) p ( θ ) p ( y ) (11) where p ( y | θ ) is called the likelihood , p ( θ ) is the priorprobability on the parameters, and p ( y ) is the marginalprobability of the data. Because the marginal probabilitydoes not depend on θ , and with ﬁxed y can be considereda constant, it is common practice to sample the unnor-malized posterior probability, p ( θ | y ) ∝ p ( y | θ ) p ( θ ) . (12)The pioneering work by Bahcall & Tremaine (1981)showed that the DF determines the likelihood p ( y | θ ).For example, the probability of the ﬁrst satellite in ourdata set having r and v , given the model parameters,is f ( r , v | θ ). Assuming all n satellites in the data setare independent, the probability of their correspondingpositions and velocities is the product of the DFs, andthus the likelihood is p ( y | θ ) = n (cid:89) i =1 f ( r i , v i | θ ) . (13)Therefore, eq 12 becomes p ( θ | y ) ∝ n (cid:89) i =1 f ( r i , v i | θ ) p ( θ ) . (14)Sampling eq. 14 is usually done via a Markov ChainMonte Carlo (MCMC) method, which creates a Markovchain— a sequence of random variables, θ t , where t =1 , , ... represents the position in the chain. Every ran-dom variable in the chain depends only on the variablebefore it, θ t − (Gelman et al. 2003). When MCMC algo-rithms are used to sample a Bayesian posterior density,then by construction, the Markov chain is a collection ofparameter vectors that have the same stationary distri-bution as the posterior (eq. 11).We apply the Metropolis algorithm (Metropolis &Ulam 1949; Metropolis et al. 1953) to sample eq. 14. TheMetropolis algorithm is iterative and creates a Markovchain whose stationary distribution is proportional to theBayesian posterior probability in question. The Markovchains in this work are constructed as follows (Gelmanet al. 2003):1. Draw a trial value θ ∗ from a symmetric proposaldistribution2. Calculate d = p ( θ ∗ | y ) p ( θ t − | y )

3. If d >

1, then accept θ ∗ as θ t (a) set θ t = θ ∗ (b) return to step 14. If instead d <

1, then accept θ ∗ with probability d (a) draw a random number z from the uniformdistribution U (0 , d > z then accept θ ∗ as in step 3, and returnto step 1(c) if d < z then reject θ ∗ i. set θ t = θ t − ii. return to step 1 Eadie, Harris, and WidrowAccepting θ ∗ only when d > z ensures that the θ valuesare accepted with probability proportional to the poste-rior, provided that the chain has converged to the tar-get distribution. The above process is repeated N times,resulting in a Markov chain with N values of θ whichrepresents samples from the posterior.Because a Bayesian analysis leads to distributions forparameters, the results are arguably easier to interpret.The Markov chains can be used to acquire estimates,uncertainties, and probabilities pertaining to model pa-rameters. Furthermore, the uncertainties can be carriedforward to subsequent modeling and analysis.A Bayesian analysis can also easily include nuisanceparameters— parameters in the model whose values areunknown but not necessarily of interest to the researcher.This feature turns out to be useful in our current problemof galaxy mass estimation; sometimes we do not knowthe tangential velocity component of a satellite object,but we can treat that unknown component as a nuisanceparameter in the model.2.4. Nuisance ( v t ) Parameters In a Galactocentric coordinate system, the total speedof a satellite in orbit around the Galaxy can be written v = (cid:113) v r + v t (15)where v r and v t are the radial and tangential componentsrespectively. In turn, v t = v φ + v θ . (16)The total speed of a satellite is needed for the DF f ( r, v | θ ) in the likelihood of Bayes’ theorem. However,proper motion measurements are not available for allsatellites. In many cases, distant tracers of the MW haveonly line-of-sight velocity measurements with respect toour position in the Galaxy. We want to use this satel-lite information in our analyses, but without a propermotion, the line-of-sight velocity in the local standard ofrest frame does not give us v r or v t in Galactocentric co-ordinates. For very distant objects, the line-of-sight ve-locity is approximately v los ≈ v r , since the angle createdby the location of the Sun, the satellite, and the centerof the galaxy is quite small. However, we still have novalue for v t . If we treat these unknown v t ’s as nuisanceparameters in the model, then Bayes’ rule reads, p ( θ | y ) ∝ n (cid:89) i =1 f ( r i , v r,i ) | θ , v t,i ) p ( θ ) p ( v t,i ) (17)where p ( v t,i ) is the prior probability on the tangentialvelocity of the i th satellite, to be discussed in section 2.6.2.5. The Gibbs Sampler

When nuisance ( v t ) parameters are present, we use aGibbs sampler, which was ﬁrst introduced by Geman &Geman (1984) in the area of image processing, and thenadapted to iterative simulations in the study of statisticsby Tanner & Wong (1987). Gelfand & Smith (1990) thenshowed how to apply it to Bayesian inference. Since then,the Gibbs sampler has been applied to many problems(Gelman et al. 2003, and references therein). The Gibbs sampler is sometimes called alternatingconditional sampling, and can be very useful in multi-dimensional problems where θ = ( θ , ..., θ n ) (Gelmanet al. 2003). The Gibbs algorithm samples each of the pa-rameters ( θ , ..., θ n ) one at a time, based on the currentvalue of all of the other parameters and the conditionalprobability given those parameters.Consider θ i , the i th parameter in a model with n pa-rameters, and let θ − i represent all of the other parame-ters. Next, let t be the t th iteration of the chain— thechain that will be a sample of the posterior distribution p ( θ | y ). In the Gibbs sampler, each θ i is sampled oneat a time based on its conditional probability given thecurrent values of all of the other parameters, p ( θ i | θ t − − i , y ) (18)where θ t − − i represents the other parameters at their cur-rent value, θ t − − i = ( θ t , ..., θ ti − , θ t − i +1 , ..., θ t − n ) . In our work, we do not directly sample the conditionaldistributions (eq. 18) because they are usually not avail-able. Instead, we use a Metropolis step to update theconditional distributions. Thus, we employ a hybrid-Gibbs sampler: we use a symmetric proposal distribu-tion, so that the accept/reject condition of the trial pa-rameter θ ∗ i follows the same algorithm as that describedin section 2.3, except d is now d = p (cid:0) θ ∗ i | θ t − − i , y (cid:1) p (cid:0) θ t − i | θ t − − i , y (cid:1) (Gelman et al. 2003).In the problem at hand, the hybrid-Gibbs sampleris more eﬃcient than a standard Metropolis algorithm.The latter method samples all parameters simultane-ously, while the former samples parameters individually.Under the Metropolis algorithm, if even one parametersuggestion is highly improbable, then the entire vectorof parameters is likely to be rejected. Therefore, a high-dimensional Markov chain may take an extremely longtime to walk through parameter space and converge tothe posterior distribution. By contrast, the hybrid-Gibbssampler is much more eﬃcient in our high-dimensionalMarkov chain (there are 2 model parameters, and 44 tan-gential velocity parameters for the Milky Way data dis-cussed below). The parameters M tot and a are sampledsimultaneously, based on the current v t parameters, andthe v t parameters are sampled individually based on thecurrent values of all the other parameters.Using the hybrid-Gibbs sampler for the v t ’s allows usto obtain a probability distribution for each v t param-eter eﬃciently. We can look at the probability distri-bution for each v t and make a prediction of the mostprobable v t value for each satellite. Although we ﬁndthat the resulting v t distributions are quite diﬀuse, anda meaningful prediction of v t cannot be made from them,the hybrid-Gibbs sampler method is nevertheless eﬃcientand in general does not aﬀect the mass estimate of theGalaxy (as will be shown below). If we assume that thesatellite is bound to the galaxy, then by setting eq. 3 tozero we obtain an upper limit on the tangential velocity, v t,max = (cid:112) r ) − v r . (19)2.6. Prior Probabilities

In a Bayesian analysis, the choice of a prior can bethought of as a chance for the researcher to state plainlyand explicitly the prior assumptions. When little isknown about the problem at hand, it is common to use a noninformative prior, so that the information containedin the likelihood is not overwhelmed by information con-tained in the prior.In this preliminary analysis, we use uniform priors forall of the model parameters because we assume littleabout the mass and scale of the system. The uniformprior for each parameter θ is, p ( θ ) = 1 θ max − θ min (20)where θ min and θ max are the lower and upper bounds ofthe uniform distribution. In practice, the parameters aresampled in the natural log-space to ensure that the totalmass and scale radius are always positive. The Markovchain values are then exponentiated before examinationof the posterior. The bounds ( θ min , θ max ) that we usefor M tot and a are on the order of (10 , ) M (cid:12) and(10 − , )kpc respectively.When the tangential velocities are treated as nuisanceparameters and sampled in the Markov chain, they toorequire a prior. The tangential velocity is a 2-dimensionalvector on the plane of the sky, so we would like the prioron v t to be uniform in v t . Because we are sampling v t and not the squared tangential velocity, the uniform prioron v t needs to be transformed to one for v t .Here we use Jeﬀreys’ invariance principle (Jeﬀrey1939). Suppose a parameter θ has a prior distribu-tion p ( θ ), and that a one-to-one transformation is sub-sequently performed on θ such that φ = h ( θ ). Then theJeﬀrey’s prior for φ which expresses the same belief asthat of p ( θ ) is p ( φ ) = p ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) dθdφ (cid:12)(cid:12)(cid:12)(cid:12) = p ( θ ) | h (cid:48) ( θ ) | − (21)(Gelman et al. 2003).Following equation 21, let θ = v t and φ = v t , so that h ( θ ) = (cid:112) v t . Then the prior on v t is given by p ( v t ) = 2 v t v t,max − v t,min . (22)The minimum tangential velocity, v t,min , is zero, and themaximum tangential velocity, v t.max , is a large constant.Note that if a value of v t that makes a particle unboundedis suggested in the Markov chain, it will be rejected viathe likelihood.2.7. MCMC chains and Assessing convergence

The computer code created for this research is writ-ten for the R Project for Statistical Computing ( R ), anopen source software environment for statistical comput-ing and graphics (R Development Core Team 2012) withmany well-developed and eﬃcient statistical diagnostic tools. Recently, R has gained popularity in astronomyand the ﬁeld of astrostatistics (e.g. Feigelson & Babu2012), and our code is being developed into an R pack-age, called Galactic Mass Estimator (hereafter GME).GME takes data of the form ( r, v r , v t ) in Galactocen-tric coordinates, allows the user to select one of ﬁve DFs,constructs three Markov chains in parallel, and then com-bines them into a ﬁnal, single chain after convergenceconditions have been met. The result is a single Markovchain that represents samples from the posterior distri-bution for the model parameters, given the data. TheSNOW package (Tierney et al. (2013)) is used for paral-lel computing, and the CODA package (Plummer et al.2006) is used for convergence diagnostics.Many diagnostics to assess convergence of a Markovchain have been developed, and most of these methodsuse multiple chains. One advantage of running multiplechains is that the initial values can be dispersed widelyin the parameter space, and then convergence can beapproximated when they appear to reach a common sta-tionary distribution. This approach allows more explo-ration of the parameter space and makes it less likely fora local maximum to be mistaken for the mode of the pos-terior. Furthermore, using multiple chains on the samedata set allows estimates of convergence to be obtainedin a more reliable and quantitative manner than a singlechain. Gelman & Rubin (1992) suggest using the statis-tic (cid:98) R to assess the mutual convergence of parallel chains: (cid:98) R = (cid:115) (cid:99) var + ( ψ | y ) W . (23)In equation 23, (cid:99) var + ( ψ | y ) is the marginal posterior vari-ance of the estimand [parameter] , which is essentially aweighted average of the within-chain variance W , andbetween-chain variance B (see Gelman et al. 2003, formore details). In practice, (cid:98) R is calculated for each pa-rameter separately; to assume convergence, Gelman et al.(2003) recommend a value of (cid:98) R < . (cid:98) R statistic isavailable in the R Software Statistical Computing Lan-guage in the CODA package (Plummer et al. 2006).In this work, the three Markov chains’ starting valuesare widely dispersed in the parameter space, and eachchain is run for i = 10 iterations. After this initialrun, (cid:98) R is calculated for each parameter. If any of theparameters have (cid:98) R > .

1, then it is assumed that thechains have not converged, and the last parameter valuesin each chain are used as the initial parameters in threenew chains. The process is repeated until all parametersacross all three chains have (cid:98)

R < .

1, at which pointconvergence is assumed. The ﬁnal sample of the posteriordistribution is created by combining the last halves ofthe three chains, thus providing 1500 parameter vectorsamples. Prior to this ﬁnal step, however, we check theeﬀective sample sizes of the Markov chains.In general, the draws in a Markov chain are not trulyindependent; some autocorrelation exists in the sequenceof samples (Gelman et al. 2003). The eﬀective samplesize is the equivalent number of independent samples: n eﬀ = mn (cid:99) var + ( ψ | y ) B (24) Eadie, Harris, and Widrowwhere m is the number of parallel Markov chains and n is the number of draws in each chain (Gelman et al.2003). If the draws in all chains were perfectly indepen-dent, then the number of independent draws would be mn . However, the draws within a chain are general au-tocorrelated, and so B > (cid:99) var + ( ψ | y ), and n eff < mn .An eﬀective sample size of at least 100 is necessary toobtain reliable ﬁrst-moment statistics such as the meanand median, while an eﬀective sample size over 200 isneeded for second-order moments.In our code, we use the eﬀectiveSize function in theCODA package (Plummer et al. 2006), and ﬁnd that allparameters had eﬀective sample sizes greater than 300 forall models when i = 10 . An acceptance rate between20 and 30% is required for the v t parameters, and anacceptance rate of 30-40% is required for the model pa-rameters. The ﬁnal chain (15000 samples) for each modelis visually inspected to ensure that the three chains didnot reach very diﬀerent maxima in the posterior distri-bution. SIMULATIONSPrior to data analysis, we explored the statistical prop-erties of our Bayesian estimates under repeated sampling,using simulation. Unfortunately, a Bayesian analysiscannot be repeated when the data is real (i.e. for a singledata set). A Bayesian analysis can be repeated, however,with simulated data sets produced from the same DF,and analysed independently using the same model. Byexamining the range of parameter estimates, the aver-age behaviour of the model on this type of data can beexplored. Simulations and analyses of trivial cases (i.e.when the model and data have the same distribution) arealso an eﬀective way to test code.It is expected that when the simulated data come fromthe same DF as that of the model, then on average theBayesian parameter estimates should be correct. Fur-thermore, quantities like uncertainties and credible inter-vals should be reliable (e.g. a 50% credible region shouldcontain the true parameter values 50% of the time). Incontrast, when the simulated data comes from a diﬀerentDF than that of the model, biases in the estimates mayoccur, and the credible regions may become unreliable(e.g. overconﬁdent). Moreover, we need to investigatewhether or not treating missing v t measurements as pa-rameters aﬀects other parameter estimates, regardless ofwhether or not the data and model share the same DF.We simulate mock observations of 100 satellites orbit-ing a galaxy whose gravitational potential follows theHernquist model. The mock tracer observations includetheir distances r and velocity components v r and v t . Weexplore the eﬀects of assuming an isotropic Hernquistmodel in the following three scenarios:1. isotropic data with complete velocity vectors2. isotropic data with 50 unknown v t values3. constant anisotropic data from the β = 0 . v t ’sFor each scenario, we create 500 data sets with 100 par-ticles each. The Bayesian analysis is performed on eachset of data, as described in Sections 2.3 - 2.7, yielding500 Markov chains for each scenario. From these chains, statistics such as the mean parameter estimate and cred-ible intervals are calculated.For all the simulations, we use M tot = 10 M (cid:12) and a = 15kpc to generate the data. Our choice for the totalmass is based on many other studies that have shown theMilky Way’s mass to be close to this order of magnitude.For numerical simplicity, the following units are used inour code: the gravitational constant G ≡ r is measuredin kiloparsecs (kpc), velocity components are measuredin 100km s − , and mass is measured in 2 . × M (cid:12) . KINEMATIC DATA FOR THE MILKY WAYIn principle, any well-deﬁned object orbiting theGalaxy with a measured distance from the Galactic cen-ter and at least one velocity measurement may be usedto estimate the mass of the Milky Way. In this work, as aﬁrst run, we use only globular clusters (GCs) and dwarfgalaxies (DGs). It is possible to measure the proper mo-tions and line-of-sight velocities of these tracers in theGalaxy’s halo, and to convert these measurements intoGalactocentric coordinates. Indeed, many of the kine-matic measurements and conversions have already beenmade (e.g. Dinescu et al. 1999; Casetti-Dinescu et al.2010, 2013; Boylan-Kolchin et al. 2013). However, theproper motions of many GCs and DGs have yet to bemeasured, and so the conversion from our frame of refer-ence to a Galactocentric one cannot be performed. Nev-ertheless, the line-of-sight velocities of these objects areavailable, and could contain useful information about theGalaxy’s mass proﬁle. Thus, we incorporate some of thisincomplete data into our analysis.The data used in this research are in Galactocentric co-ordinates (see Table 1). The ﬁrst 59 objects are GCs, andthe last 29 are DGs. Note that 26 GCs and 18 DGs listeddo not have tangential velocities, because they have noproper motion measurements. The Galactocentric radialvelocities for these data must be approximated, for whichwe assume v r ≈ v los . We use this approximation only forobjects with | cos γ | ≥ .

95 (where γ is the angle sub-tended by the line connecting the Sun and the GalacticCentre, from the object), guaranteeing that any furtheradjustment to v los will be small. We also exclude thefollowing clusters, even though they have | cos γ | ≥ . v r values are taken fromthe HyperLeda Catalogue (Paturel et al. 2003), with theexception of those for Coma Berenices, Sagittarius, andSextans, which are taken from Simon & Geha (2007),Ibata et al. (1997), and Walker et al. (2006) respectively.The r -values in Table 1 are based on mean magnitudesof RR Lyrae and horizontal branch stars, and are uncer- TABLE 1Milky Way Kinematic Data

Object r v r ∆ v r v t ∆ v t cos γ Object r v r ∆ v r v t ∆ v t cos γ (kpc) (km s − ) (km s − ) (kpc) (km s − ) (km s − )NGC 104 7 17.0 0.2 171.0 22.0 0.15 NGC 6540 3 0.0 1.4 – – -0.97NGC 288 11 16.0 0.4 59.0 18.0 0.75 NGC 6569 3 -0.2 5.6 – – 0.95NGC 362 9 55.0 0.5 85.0 31.2 0.61 NGC 6864 15 -1.1 3.6 – – 0.96NGC 1851 16 186.0 0.6 170.0 42.4 0.89 IC 1257 18 -66.5 2.1 – – 0.99NGC 1904 18 93.0 0.5 83.0 44.7 0.94 Arp 2 21 153.0 10.0 – – 0.99NGC 2298 14 -58.0 1.3 100.0 52.7 0.89 NGC 7492 25 -97.4 0.6 – – 0.95Pal 3 85 -247.0 8.4 242.0 121.5 1.00 NGC 5824 26 -117.7 1.5 – – 0.98NGC 4147 19 57.0 1.0 161.0 65.7 0.93 Pal 13 27 192.4 0.3 – – 0.95NGC 4590 9 -99.0 0.6 300.0 35.6 0.69 NGC 5694 29 -228.1 0.8 – – 0.98NGC 5024 18 -106.0 4.1 250.0 86.5 0.90 NGC 6229 30 22.6 7.6 – – 0.96NGC 5139 6 -31.0 0.7 65.0 14.1 0.05 Whiting 1 35 -103.5 1.8 – – 0.98NGC 5272 12 2.0 0.4 164.0 24.5 0.75 Pal 2 35 -104.4 57.0 – – 1.00NGC 5466 16 254.0 0.3 216.0 66.8 0.88 Pal 15 38 147.8 1.1 – – 0.99Pal 5 16 -11.0 16.0 62.0 38.0 0.95 NGC 7006 39 -185.2 0.4 – – 0.98NGC 5897 7 49.0 1.0 138.0 59.4 0.79 Pyxis 41 -195.2 1.9 – – 0.98NGC 5904 6 -313.0 0.5 234.0 39.6 0.33 Pal 14 72 165.4 0.2 – – 1.00NGC 6093 3 60.0 4.1 85.0 28.2 0.67 NGC 2419 90 -26.4 0.5 – – 1.00NGC 6121 6 -58.0 0.4 25.0 22.6 -0.91 Eridanus 95 -141.0 2.1 – – 1.00NGC 6144 3 109.0 1.1 137.0 33.3 0.47 Pal 4 111 50.5 2.1 – – 1.00NGC 6171 4 20.0 0.3 156.0 36.9 -0.29 AM 1 125 -41.6 20.0 – – 1.00NGC 6205 8 279.0 0.9 129.0 35.0 0.48 Fornax 140 -31.8 1.7 196.0 29.0 1.00NGC 6218 5 -21.0 0.6 168.0 22.0 -0.46 LeoI 261 167.9 2.8 101.0 34.4 1.00NGC 6254 5 -53.0 1.1 178.0 28.3 -0.59 LMC 49 93.2 3.7 346.0 8.5 0.99NGC 6341 9 70.0 1.7 46.0 26.2 0.61 SMC 60 6.8 2.4 259.0 17.0 0.99NGC 6362 5 -40.0 0.6 134.0 20.5 0.25 Sculptor 87 79.0 6.0 198.0 50.0 1.00NGC 6397 6 18.0 0.1 166.0 16.3 -0.83 Draco 92 -98.5 2.6 210.0 25.0 1.00NGC 6584 6 150.0 15.0 185.0 55.9 0.88 BootesI 57 106.6 1.0 – – 0.99NGC 6626 3 8.0 1.0 172.0 26.4 -0.87 BootesII 43 -115.6 5.0 – – 0.98NGC 6656 5 172.0 0.6 214.0 31.9 -0.94 CanesVenaticiI 219 76.8 1.0 – – 1.00NGC 6712 4 208.0 0.6 132.0 21.5 -0.08 CanesVenaticiII 150 -96.1 1.0 – – 1.00NGC 6752 5 -19.0 1.5 200.0 11.4 -0.50 Carina 102 14.3 1.0 – – 1.00NGC 6779 9 172.0 0.9 39.0 58.1 0.63 ComaBernices 45 82.6 5.0 – – 0.98NGC 6809 4 -181.0 0.4 119.0 30.4 -0.49 Hercules 141 142.9 1.0 – – 1.00NGC 6838 7 3.0 0.2 180.0 17.8 -0.05 LeoII 235 26.5 8.0 – – 1.00NGC 6934 12 -305.0 1.6 124.0 93.0 0.86 LeoIV 154 13.9 1.0 – – 1.00NGC 7078 10 -74.0 0.6 141.0 34.7 0.70 LeoV 175 62.3 3.0 – – 1.00NGC 7089 10 46.0 0.9 331.0 63.9 0.74 Sagittarius 16 166.3 60.0 – – 0.93NGC 7099 7 14.0 1.0 120.0 30.8 0.46 Segue1 28 113.5 1.0 – – 0.97NGC 5634 21 -0.8 6.6 – – 0.95 Segue2 41 39.7 1.0 – – 0.99NGC 6284 8 0.3 1.7 – – 0.98 Sextans 89 78.2 1.0 – – 1.00NGC 6356 7 0.6 4.3 – – 0.97 UrsaMajorI 101 -8.8 1.0 – – 1.00NGC 6426 14 -0.5 23.0 – – 0.96 UrsaMajorII 36 -36.5 2.0 – – 0.99NGC 6441 4 -0.0 1.0 – – 0.95 UrsaMinor 77 -89.8 8.0 – – 1.00NGC 6453 4 -0.9 8.3 – – 0.98 Willman1 42 33.7 2.0 – – 0.98 Note . — Columns from left to right: objects’ names, Galactocentric distance, radial velocity, uncertainty in radial velocity, tangentialvelocity, uncertainty in tangential velocity,and cos γ . All data are in Galactocentric coordinates ( r , v r , v t ) as described in Section 2.4, withthe exception of GCs and DGs that lack tangential velocities (see text). Conversions from line-of-sight and proper motion measurementsto Galactocentric measurements were completed by the studies mentioned in Section 4. tain to typically 5% (see Harris 1996). The uncertain-ties associated with r and v r , and the diﬀerences in theLSR assumed motion used among the diﬀerent studiesare (cid:47)

15 km/s, and thus unimportant compared to theuncertainties associated with the v t values.In the following analysis, we speciﬁcally assume (a) aspherical Hernquist-like or Jaﬀe-like halo potential, (b)equal weights for all data points, (c) no net rotation ofthe halo, and (d) that all tracers are bound to the Galaxy. RESULTS5.1.

Simulation Results

Figure 1 shows the distribution of the mean parameterestimates from scenario 1. Black dots are the mean of theestimates, and red dashed lines are the true parametervalues. On average, the estimates are unbiased withinone standard deviation (sd), and the sd of the chains isroughly equal to the sd of the estimates.Because the Markov chain represents the posteriordistribution, we can also calculate credible regions —Bayesian analogues of conﬁdence intervals — for the M ( r ) proﬁle. An example of the mass proﬁle credible re-gions for one data set is shown in Figure 2, where shadesof teal show the 50, 75, and 95% credible intervals as afunction of r . Credible regions are found by calculating Eadie, Harris, and Widrow mean of M tot ( M sun ) F r equen cy true value = 1mean = 1.01sd = 0.07 mean of a (kpc)

10 12 14 16 18 20 22 true value = 15mean = 15.21sd = 1.77

Fig. 1.—

Empirical distribution of M tot and a estimates from simulated data analysis. Black points and red dashed lines show the meanof the estimates and the true value of the parameter respectively. The standard deviation of the estimates is 0.07 and 1.77 for M tot and a respectively. . . . . . . . r (kpc) M ( r ) ( M s o l )

50 %75 %95 %

Fig. 2.—

Example of predicted and true mass proﬁle from anal-ysis of simulated data. The true M ( r ) proﬁle is shown in red, andthe 50, 75, and 96% credible regions are shown as shades of teal.Note: this is the result from one analysis (i.e. one data set). M ( r ) at several diﬀerent r values, for every set of param-eters in the Markov chain. The true M ( r ) proﬁle is thesolid red line, calculated from eq. 6 and the true M tot and a . We ﬁnd that the credible regions are reliable when the DF of the assumed model and the DF of the data are thesame. For example, the true M ( r ) curve fell within the75% credible region seventy-ﬁve percent of the time overthe course of the 500 analyses for scenario 1.In scenario 2, we randomly remove 50 v t ’s from eachdata set, and treat them as parameters in the analy-sis. We ﬁnd a very small positive bias in both the M tot and a estimates, as shown in the top two panels inFig. 3. The bias is insigniﬁcant, as the means of the esti-mates (1 . × M (cid:12) and 15 . . × M (cid:12) and1 . β = 0 .

5. Despite the data and modelhaving diﬀerent DFs, the estimates show only a slightpositive bias; the true M tot and a are still within onestandard deviation of the distribution (see Fig. 3). Themean of the estimates for M tot and a are 1 . ± . × M (cid:12) and 15 . ± . v t parameters tends to increase the width ofthe credible regions at all r values compared to Fig. 2.In scenario 2, we ﬁnd the credible regions to be slightlyover conﬁdent for values of 17 < r < M ( r ) falling within the 50, 75, and 95% regions 48,73, and 93% of the time over the 500 analyses. At allother r values, however, the credible regions are reliable.In scenario 2, the credible regions are slightly lower thanthe true cumulative mass proﬁle; the opposite is the case F r equen cy (2) true value = 1mean = 1.01sd = 0.08 true value = 15mean = 15.25sd = 1.89 mean of M tot ( M sun ) F r equen cy (3) true value = 1mean = 1.01sd = 0.08 mean of a (kpc) true value = 15mean = 15.31sd = 1.8310 12 14 16 18 20 22 Fig. 3.—

Empirical distribution of M tot and a estimates from simulated data analysis, with 50 tangential velocities removed. The toppanels are for scenario (2) and the bottom, (3). The black points and red dashed lines show the mean of the estimates and the true valueof the parameters respectively. The standard deviations of the estimates in scenario (2) are 0 . × M (cid:12) and 1 . . × M (cid:12) and 1 . . . . . . . . r (kpc) M ( r ) ( M s o l )

50 %75 %95 % . . . . . . . r (kpc) M ( r ) ( M s o l )

50 %75 %95 %

Fig. 4.—

Example cumulative mass proﬁle when 50 v t ’s are unknown, and an isotropic Hernquist model is assumed. The true proﬁle isshown as a solid red line, and the credible regions are shown as shades of teal. The left proﬁle is scenario (2) and the right is scenario (3). r (see Fig. 4). We reiter-ate, however, that Fig. 4 are examples of M ( r ) proﬁlesfrom a single data set and analysis. Over 500 analyseswe ﬁnd that the 50, 75, and 95% credible regions do con-tain the true M ( r ) 50, 75, and 95% of the time, in bothscenarios 2 and 3, for almost all r .5.2. Milky Way Results

Assuming an isotropic Hernquist model, and using allthe kinematic data from Table 1, we ﬁnd a mean M tot of1 . ± . × M (cid:12) and a scale radius of 16 . ± . M tot and a are (1 . , . × M (cid:12) and(12 . , .

7) kpc respectively. We also report the mean M tot and scale radius, with uncertainties of one stan-dard deviation, when the other models are assumed (Ta-ble 2). The mass estimates and scale radii vary onlyslightly between Hernquist models. The Jaﬀe model’smass is similar, but the Jaﬀe scale radius radius cannotbe compared directly to that of Hernquist because theyhave physically diﬀerent deﬁnitions.The mass proﬁle credible regions are shown in Fig. 5.The innermost dark region corresponds to the 50% cred-ible region. The vertical dashed lines show the extentof the data, with NGC 6540 and Leo I being the clos-est and furthest objects from the Galactic center respec-tively. The mass contained within the distance of Leo I is1 . +0 . − . × M (cid:12) , where the uncertainties correspondto the 95% credible interval. . . . . r (kpc) M ( r ) ( M s o l )

50 %75 %95 %50 %75 %95 %r min max data

Fig. 5.—

Mass proﬁle credible regions assuming a Hernquistmodel with isotropic velocity dispersion. The dashed lines indi-cate the location of NGC 6540 and Leo I (the closest and furthestobjects from the Galactic center respectively in our dataset).

Some satellites may have a large eﬀect on the massestimate of the Galaxy. Leo I, for example, remaineda contentious object for many years, because it is at alarge distance from the Galactic center and it was un-clear whether or not it is bound to the MW. Recently,however, Boylan-Kolchin et al. (2013) showed that LeoI is likely bound to the MW. Furthermore, when LeoI’s proper motion is taken into account, the object haslittle eﬀect on the mass estimate of the Galaxy (Wilkin-son & Evans 1999). We run our analysis assuming theisotropic Hernquist model both with and without LeoI, and we also ﬁnd that it has no eﬀect within erroron M tot . When Leo I is removed from the analysis, M tot = 1 . ± . × M (cid:12) and a = 16 . ± . v los ≈ v r for tracers with | cos γ | ≥ . | cos γ | = 0 .

93, so one may question the inclusion of thisobject. However, once again we ﬁnd little change inthe mass estimate without it, for all models. For ex-ample, the isotropic Hernquist model returned M tot =1 . ± . × M (cid:12) , which is almost identical to theresult obtained using all the data (see Table 2). The scaleradius is also unchanged within error (17 . ± . . ± . × M (cid:12) . Next, we remove ﬁve tangentialvelocities from the data, and repeat the analysis treatingthose missing v t ’s as parameters. Repeating this processand removing ﬁve diﬀerent v t ’s each time, we ﬁnd thatthe v t ’s cannot be well estimated. However, treating v t ’sas parameters has little to no eﬀect on the mass estimate,within error. There is one exception to the latter state-ment: when Pal 3’s tangential velocity was removed, themass estimate was reduced signiﬁcantly.To investigate the inﬂuence of Pal 3 further, we per-formed an analysis using only the Dinescu data, but with-out Pal 3’s v t value. Treating Pal 3’s v t as a parameter,the mass estimate of the Milky Way fell by more than50% ( M tot = 0 . ± . × M (cid:12) ). We also ran theanalysis using only the Dinescu data, but without any v t ’s. In this case, M tot = 0 . ± . × M (cid:12) , which issimilar to the estimate obtained in the former analysis.We note, however, that Pal 3 has the most uncertain v t measurement in the list (Table 1). It is evident that in-cluding measurement uncertainties in the analysis wouldreduce its leverage considerably.Using all kinematic data, but removing Pal 3 from theanalysis, also resulted in reduced mass estimates. Fur-thermore, the eﬀect is observed regardless of the selectedmodel (Table 2). Thus, Pal 3’s proper motion, and in-deed Pal 3 in general, has signiﬁcant inﬂuence on the1 TABLE 2

Parameter Estimates for the Milky Way.All Data Without Pal 3 Without Draco

Model - σ M tot Scale Radius M tot Scale Radius M tot Scale Radius(10 M (cid:12) ) (kpc) (10 M (cid:12) ) (kpc) (10 M (cid:12) ) (kpc)H - iso 1.55 ± ± ± ± ± ± ± ± ± ± ± ± β = 0 . ± ± ± ± ± ± ± ± ± ± ± ± Note . — In the ﬁrst column, the ﬁrst three models are of the Hernquist type, with isotropic, OM-type anisotropy, and constantanisotropy. The last row shows the results of assuming an isotropic Jaﬀe model. Uncertainties are one standard deviation of the posteriordistribution. r (kpc) E ( k m s - ) complete dataincomplete dataPal 3DracoLeo I F ( r ) Fig. 6.—

Satellite energies given the model parameters from theisotropic Hernquist model ﬁt. Satellites without tangential veloci-ties are shown as open circles, and are plotted using the estimated v t value from the model ﬁt. Unbound (escaping) objects would lieabove the dotted line. mass estimate of the Galaxy. This issue regarding Pal3 conﬁrms the ﬁnding of Sakamoto et al. (2003), whonoted that high-velocity objects such as Pal 3 and Dracocan have a signiﬁcant eﬀect on the mass estimate of theGalaxy. As mentioned previously, we also test the eﬀectof Draco on the mass estimate and ﬁnd that it has littleeﬀect on the mass estimate (Table 2).To demonstrate the eﬀectiveness of using v t ’s as pa-rameters, consider Figure 6. Using eq. 3 and the meanparameter values from the isotropic Hernquist model ﬁt,we plot the negative of the tracers’ speciﬁc energies as afunction of r . Filled points are satellite data with com-plete velocity vectors, and hollow points are data withunknown v t ’s (plotted using the mean v t estimates fromthe Markov chain). As demonstrated by our simulationsand tests with the Dinescu data, the v t values cannot bewell estimated. However, the v t parameters appear toconverge to an average v t for a given r value, and this isreﬂected in the positions of the hollow points in Fig. 6. The results demonstrate that in a small sample of data,some objects carry greater inﬂuence on the mass estimatethan others. Furthermore, the variation in these resultsimplies that it would be fruitful to weight the data bytheir measurement uncertainties. In a Bayesian analy-sis, however, a fully hierarchical approach is necessary toproperly include the measurement uncertainties of thedata, and a probability distribution for the errors mustbe assumed. We leave this analysis to a future paper,and instead perform three more approximate sensitivityanalyses.The ﬁrst two sensitivity analyses are extreme cases: 1)all the tangential velocities are increased by 2∆ v t , and 2)all the tangential velocities are decreased by 2∆ v t . In thethird and more realistic sensitivity analysis, we randomlychange each v t into a new tangential velocity v t,new via, v t,new = v t + N (0 , ∆ v t ) (25)where N (0 , ∆ v t ) represents a random number drawnfrom a normal distribution with mean zero and variance∆ v t . Using eq. 25, we generate 100 data sets with new v t values, and then analyze these data sets assuming theisotropic Hernquist model. The estimates of M tot and a from the 100 analyses have a distribution that conﬁrmsthe results of the original analysis (Fig. 7); the mean ofthe estimates is nearly identical to the result in Table 2.The results of the sensitivity analyses show that aproper treatment of the measurement uncertainties isworth pursuing. In future analyses, we plan to incor-porate the measurement uncertainties of the data via ahierarchical model. DISCUSSION AND FUTURE PROSPECTSThe results of this study are promising. Not only doesthe Bayesian analysis provide an eﬀective way of incorpo-rating complete and incomplete data, but it also enableseasy calculation of probabilistic credible regions for thecumulative mass proﬁle. Furthermore, even though thisis a preliminary analysis, and mostly meant to lay thegroundwork for future studies, our total mass estimatesare similar to other studies that use diﬀerent methods.Because our method returns a sample of parametervalues representing the posterior distribution, it is easyto compare our results with mass estimates, at any radii,obtained in other studies. We can compute M ( r ) credibleregions from our Markov chain at any r value, and obtaina mass estimate at that radius, with uncertainties.A collection of total mass estimates within r =100kpc, from 10 diﬀerent studies, has been compiled2 Eadie, Harris, and Widrow mean of M tot ( M sun ) F r equen cy mean = 1.55sd = 0.06 mean of a (kpc) mean = 16.96sd = 0.22 Fig. 7.—

Distribution of M tot and a estimates from the third sensitivity analysis. The black dots show the mean of the estimates. Thevalue of the mean and the standard deviation of the empirical distribution are shown in the legend. by Courteau et al. (2014). Our mass estimate atthis radius assuming the isotropic Hernquist model is M = 1 . × M (cid:12) with a 95% credible region of(1.05, 1.26) × M (cid:12) , which is within the range of val-ues listed in the review.Watkins et al. (2010) ﬁnd the mass within 300 kpcto be 0 . ± . × M (cid:12) for an isotropic model.Our estimate M assuming an isotropic model is1 . × M (cid:12) with a 95% credible region of (1.29,1.53) × M (cid:12) . When they consider an anisotropic ve-locity distribution with β derived from the observationaldata, however, they ﬁnd 3 . ± . × M (cid:12) , in contrastto our β = 0 . M = 1 . × M (cid:12) , with a 95% credible interval of(1.27,1.51) × M (cid:12) .Deason et al. (2013) used BHB stars to trace the MW’smass, and found M ( r = 50kpc) to be approximately4 × M (cid:12) , assuming a model of constant anisotropywith β = 0 .

5. However, our mass estimate for the MW at50kpc, using the Hernquist constant anisotropic model,is substantially higher at 9 . × M (cid:12) with a 95% cred-ible interval of (8 . , . × M (cid:12) . Even removing Pal3 from the data set does not lower this estimate signiﬁ-cantly, reducing it to 8 . × M (cid:12) with a 95% credibleinterval of (7 . , . × M (cid:12) .Using a truncated, ﬂat rotation curve model, Battagliaet al. (2005) found the mass of the MW dark matterhalo to be 1 . +1 . − . × M (cid:12) , and with an NFW modelfound a virial radius of 0 . +1 . − . × M (cid:12) . Boylan-Kolchin et al. (2013) estimate the MW virial mass at 1 . × M (cid:12) , with a 90% conﬁdence interval of 1.0to 2 . × M (cid:12) . Sohn et al. (2013) use the timingargument of Leo I to arrive at a virial mass estimate M vir = 3 . +1 . − . × M (cid:12) . Li & White (2008) foundthe virial mass to be 2 . × M (cid:12) , with a lower 95%conﬁdence level of 0 . × M (cid:12) . Thus, our preliminaryresults are on par with many other studies that use dif-ferent methodologies.To our knowledge, no other studies besides Sakamotoet al. (2003) have found Pal 3 to carry so much weightin the analysis. Pal 3’s proper motion is already known(though with large uncertainties) and the satellite doesnot lie as far from the Galactic center as Leo I and othersatellite dwarfs, which may have allowed its eﬀect to gounnoticed. Removing Pal 3’s true v t from the analysisand treating it as a parameter lowered the mass signif-icantly, suggesting that the tangential velocity is in thetail of the v t distribution at r = 84kpc. The dwarf galax-ies in our data set, on the other hand, seem to have littleindividual inﬂuence on the mass estimate of the Galaxyeven though some have high velocities and large distancesfrom the galactic center.Many improvements, challenges, and exploratory anal-yses remain:1. One way to substantially improve the analysis isto incorporate measurement uncertainties via a hi-erarchical model, rather than the preliminary sen-sitivity analysis performed here. In the Bayesianparadigm, a probability density function of themeasurement errors must be assumed. For exam-ple, for each data point the known measurement3uncertainty may be used to deﬁne the variance ofa Gaussian distribution centered on the measure-ment value. Objects with high inﬂuence when mea-surement errors are ignored (e.g. Pal 3) might havereduced inﬂuence when measurement errors are in-cluded.2. An immediate challenge is ﬁnding models with dis-tribution functions that are tractable and that de-scribe the Milky Way in a more sophisticated man-ner. Using the NFW model is of particular interestbecause it is known to ﬁt the dark matter halos ofgalaxies on many diﬀerent scales, as well as groupsand clusters of galaxies. The NFW DF, however,is not analytic. Although numerical solutions forthe NFW DF have been derived, applying them inthe Bayesian framework is more diﬃcult because ofthe model’s inﬁnite mass. When a model’s DF doesnot integrate to a ﬁnite mass (eq. 2) then it is nota proper probability distribution— a requirementwhen applying Bayes’ theorem. We are currentlyexploring the problem through an ApproximateBayesian Computation (ABC) algorithm, which al-lows for calculations of posterior distributions with-out explicit calculation of the likelihood.3. The DFs for all of the models employed hereare analytic. Furthermore, the models are self-consistent— i.e. we implicitly assume that the darkmatter and the satellites follow the same distribu-tion. However, it is possible that the tracers (GCsand dwarf galaxies) have a diﬀerent distributionthan the dark matter halo particles. For example,the tracers may have a Hernquist-type density pro-ﬁle ρ H ( r ) given by eq. 5, but may reside in an NFWgravitational potential given byΦ NFW ( r ) = − πρ o r s ln (1 + r/r s ) r/r s . (26)In this situation, there are two extra parameters, r s and ρ o , which correspond to the scale length anddensity parameter of the dark matter halo. We canderive the DF for such a model via the Eddingtonformula: f ( E ) = 1 √ π (cid:90) E √E − Ψ (cid:18) d ρd Ψ (cid:19) d Ψ+ 1 √E (cid:18) dρd Ψ (cid:19) Ψ=0 (27)where Ψ = Φ − Φ o is the relative potential (see Bin-ney & Tremaine 2008). For the Hernquist model, ρ can be written as an analytic function of Ψ, and theintegral can be evaluated in closed form. For thecase at hand, however, the relation between ρ H andΨ NFW is a transcendental equation. Nevertheless,the integral required in the Eddington formula canbe evaluated numerically. Widrow et al. (2008),for example, used this method to derive DFs fortheir self-consistent disk-bulge-halo galaxy models.A numerically derived DF may still be used withour method, as long as it is a normalized probabil-ity distribution. We plan to implement models ofthis type in future studies of tracer populations. 4. Diﬀerent velocity anisotropy formalisms are also ofinterest. For example, other Hernquist model DFsof diﬀerent anisotropies are discussed by Cuddeford(1991) and Baes & Dejonghe (2002). The formerderived a velocity anisotropy that is a generaliza-tion of the OM-type anisotropy, where another pa-rameter β is introduced such that β ( r ) = r + β r a r + r a . (28)When β = 0, eq. 28 reduces to OM-typeanisotropy. As r a → ∞ , β ( r ) → β , in constrastto eq. 7. Baes & Dejonghe (2002) derive a Hern-quist DF using this formalism, and show that onlyfour values of β lead to DFs that can be expressedin terms of elementary functions. The simplest ofthese DFs occurs when β = 0 .

5, while the otherDFs are ”...somewhat more elaborate” and not pro-vided (Baes & Dejonghe 2002).5. We will explore biases that may occur due to selec-tion eﬀects. In the Hernquist simulations used here,some tracers are unrealistically far from the Galac-tic center (e.g. more than 500kpc away), while ourkinematic data set has a range from r = 3kpc to261 kpc. Imposing a more realistic range on sim-ulated data may or may not introduce biases inparameter estimates.6. Further along the line, it will be exciting to applythe method presented here to large datasets of ﬁeldhalo stars, leading up to the GAIA data. Sakamotoet al. (2003) showed that including many ﬁeld hor-izontal branch stars greatly reduced the eﬀect ofhigh-velocity objects (such as Draco and Pal 3) onthe mass estimate of the Milky Way. Therefore, itcan be expected that the accurate and abundantkinematic data from GAIA will also improve ourmass estimates in a major way and decrease theeﬀect of outliers.7. The method outlined in this paper could also beextended to obtain mass estimates of other galaxiesfor which tracer objects will have only the projectedpositions and line of sight velocities.SUMMARYWe have introduced a method to estimate the mass ofthe Milky Way that incorporates both complete and in-complete data for positions and velocities of tracers. Themethod treats unknown tangential velocities as parame-ters in the model. Simulations showed that although thetangential velocities cannot be well constrained, treating v t ’s as parameters has little eﬀect on the mass estimatewhen other complete velocity vectors are available. Anexception does occur, however, when a tracer has an un-usually extreme tangential velocity (e.g. Pal 3).Under simple assumptions of a Hernquist-like halo po-tential and modest anisotropy, we ﬁnd M tot ≈ . − . × M (cid:12) , in good agreement with other recent work. Foran isotropic model we ﬁnd M tot = 1 . × M (cid:12) witha 95% credible interval of (1 . , . × M (cid:12) , and ascale radius of a = 16 . . × M (cid:12) , with a 95%credible interval of (1 . , . × M (cid:12) .In future research, we will be incorporating measure-ment uncertainties into the analysis and will test moreextensively for parameter biases. We also plan to use theNFW model and ﬁnd other DF’s to use in the GME code.The method outlined here will eventually be applied toextragalactic studies, where the complete velocity vectorsof the tracers are unknown. ACKNOWLEDGEMENTSWEH and LMW acknowledge the ﬁnancial support ofNSERC. GME thanks the reviewer for very useful com-ments and suggestions, and would also like to thankAaron Springford for useful discussions about sensitiv-ity analyses..In future research, we will be incorporating measure-ment uncertainties into the analysis and will test moreextensively for parameter biases. We also plan to use theNFW model and ﬁnd other DF’s to use in the GME code.The method outlined here will eventually be applied toextragalactic studies, where the complete velocity vectorsof the tracers are unknown. ACKNOWLEDGEMENTSWEH and LMW acknowledge the ﬁnancial support ofNSERC. GME thanks the reviewer for very useful com-ments and suggestions, and would also like to thankAaron Springford for useful discussions about sensitiv-ity analyses.