[PDF] Extracting distribution parameters from multiple uncertain observations with selection biases

Abstract

We derive a Bayesian framework for incorporating selection effects into population analyses. We allow for both measurement uncertainty in individual measurements and, crucially, for selection biases on the population of measurements, and show how to extract the parameters of the underlying distribution based on a set of observations sampled from this distribution. We illustrate the performance of this framework with an example from gravitational-wave astrophysics, demonstrating that the mass ratio distribution of merging compact-object binaries can be extracted from Malmquist-biased observations with substantial measurement uncertainty.

Full PDF

MMNRAS , 1–9 (2018) Preprint 9 April 2019 Compiled using MNRAS L A TEX style ﬁle v3.0

Extracting distribution parameters from multipleuncertain observations with selection biases

Ilya Mandel , (cid:63) , Will M. Farr , † , Jonathan R. Gair ‡ Monash Centre for Astrophysics, School of Physics and Astronomy, Monash University, Clayton, Victoria 3800, Australia Birmingham Institute for Gravitational Wave Astronomy and School of Physics and Astronomy, University of Birmingham,Birmingham, B15 2TT, United Kingdom Department of Physics and Astronomy, Stony Brook University, Stony Brook NY 11794, USA Center for Computational Astronomy, Flatiron Institute, 162 5th Ave., New York NY 10010, USA School of Mathematics, University of Edinburgh, James Clerk Maxwell Building, Peter Guthrie Tait Road, Edinburgh EH9 3FD, UK

ABSTRACT

We derive a Bayesian framework for incorporating selection eﬀects into populationanalyses. We allow for both measurement uncertainty in individual measurementsand, crucially, for selection biases on the population of measurements, and show howto extract the parameters of the underlying distribution based on a set of observationssampled from this distribution. We illustrate the performance of this framework withan example from gravitational-wave astrophysics, demonstrating that the mass ratiodistribution of merging compact-object binaries can be extracted from Malmquist-biased observations with substantial measurement uncertainty.

Key words: methods: data analysis – gravitational waves – stars: neutron

The problem of extracting the distributional properties of apopulation of sources based on a set of observations drawnfrom that distribution is a common one, frequently labeledas hierarchical modelling (e.g., Hogg et al. 2010) (Bovy et al.(2011) call this “extreme deconvolution”). In practical ap-plications, one often has to deal with selection eﬀects: theobserved population will have a Malmquist bias (Malmquist1922, 1925) whereby the loudest or brightest sources aremost likely to be detected, and it is necessary to correctfor this bias in order to extract the true source popula-tion (e.g., Foreman-Mackey et al. 2014; Farr et al. 2014).In other applications, signiﬁcant measurement uncertaintiesin the individual observations must be accounted for (e.g.,Farr & Mandel 2018). Of course, these two complications –measurement uncertainties and selection eﬀects – are oftenpresent simultaneously.There have been multiple attempts to address the prob-lem of population-based inference with both selection eﬀectsand signiﬁcant measurement uncertainties. The earliest cor-rect published solution to this problem, as far as we areaware, belongs to Loredo (2004). However, despite the avail-ability of this solution, it is easy to be lured into a seemingly (cid:63)

E-mail: [email protected] † E-mail: [email protected] ‡ E-mail: [email protected] straight-forward but incorrect derivation. The most com-mon mistake is the modiﬁcation of the model populationdistribution to account for the selection function, i.e., theinclusion of the probability of detecting a particular eventonly as a multiplicative term in the probability of observingthat event. This detection probability is usually included asthe probability marginalised over all realisations of the data,ignoring the fact that we know the particular data realisa-tion that has been observed. For a given data realisationthe probability that a source is detected, which is a prop-erty purely of the data, is by deﬁnition equal to one for anydata set associated with an observation we are analysing. Onthe other hand, as shown below, it is critical to include thedetection probability in the normalisation factor to accountfor the diﬀerent numbers of events expected to be observedunder diﬀerent population models.We sketched out the correct approach to including selec-tion eﬀects in Mandel et al. (2016) (which is superseded bythe present manuscript) and Abbott et al. (2016). Other cor-rect applications in the literature include Fishbach & Holz(2017), Fishbach et al. (2018), and Feeney et al. (2019).Here, we expand and clarify the earlier treatment of Loredo(2004) by presenting two diﬀerent approaches to solving thisproblem below: a bottom-up and a top-down derivation,showing that they yield the same result. Some among usﬁnd one or the other approach to be more clear, and wehope that including both will also beneﬁt readers.We illustrate the derived methodology with two exam- c (cid:13) a r X i v : . [ phy s i c s . d a t a - a n ] A p r ples. The ﬁrst is the classic example of measuring a lumi-nosity function with a ﬂux-limited survey. The second is anexample from gravitational-wave astronomy: the measure-ment of the mass ratio of merging binary neutron stars. Weshow that (cid:38) (cid:38)

20 will be necessary to accurately measure the mass ra-tio distribution. This feat, which can be accomplished withthird-generation ground-based gravitational-wave detectors,could elucidate the details of neutron star formation.

We consider a population of events or objects, each de-scribed by a set of parameters (cid:126)θ . These parameters repre-sent the characteristics of individual events. For example, inthe case of compact binary coalescences observed by LIGOand Virgo these would include the masses, spin magnitudesand spin orientations of the two components, the locationof the source on the sky, the distance of the source, theorientation and eccentricity of the binary orbit etc. The dis-tribution of events in the population is described via pa-rameters (cid:126)λ , so that the number density of objects follows d N d (cid:126)θ ( (cid:126)λ ) = Np pop ( (cid:126)θ | (cid:126)λ (cid:48) ). In the gravitational wave-context,these parameters could represent properties of the popu-lation like the slope of the mass function of black holes incompact binaries, or the shape of the spin magnitude distri-bution, or the mixing fractions of diﬀerent sub-populations.They could also represent physical ingredients used in pop-ulation synthesis calculations, for example the parametersof the initial mass function, stellar metallicity distributionor stellar winds and the properties of common envelope evo-lution or of the distribution of supernova kicks. In this sec-ond case, the distribution of the individual event proper-ties p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) could be obtained from the output of pop-ulation synthesis codes for that particular choice of inputphysics. We have separated (cid:126)λ into the overall normalisationfor the number or rate of events N and the set of parame-ters describing the shape of the distribution alone (cid:126)λ (cid:48) . For in-stance, if the underlying distribution is modelled as a multi-dimensional Gaussian, (cid:126)λ would consist of the mean vectorand covariance matrix; alternatively, a non-parametric dis-tribution could be described with a (multi-dimensional) his-togram, in which case (cid:126)λ represents the weights of varioushistogram bins.This distribution is sampled by drawing a set of N obs “observed events” with true parameters { (cid:126)θ i } , for i ∈ [1 , N obs ]. For each object in the population we make a noisymeasurement of (cid:126)θ i , represented by a likelihood function re-lating the measured data, (cid:126)d i , to the parameters of the event, (cid:126)θ : p (cid:16) (cid:126)d i | (cid:126)θ i (cid:17) .Moreover, based on the observed data, some objects areclassed as “observable” and others are “un-observable.” Forexample, a survey may impose a per-pixel or per-aperturethreshold on the ﬂux for inclusion of point-sources in a cata-log, or a gravitational wave detector may only report eventswhose signal-to-noise ratio rises above some predeterminedthreshold. This detection probability can be estimated em-pirically for a search pipeline via a large injection campaign.In some cases, it can be modelled analytically; for example,for low-mass compact binaries, the gravitational-wave strain in the frequency domain is proportional to the 5 / M c , so the detection probability scales as thesurveyed volume , ∝ M / c . Throughout this article, we willassume that whether or not an event is counted as a detec-tion is a property only of the data for each object and sothere exists an indicator function I ( (cid:126)d ) that is equal to 1 for“observable” objects that would be classiﬁed as detectionsand 0 otherwise; this is by far the most common case forastronomical observations .Our ultimate goal is to determine the population prop-erties (cid:126)λ . Of course, we cannot uniquely reconstruct (cid:126)λ usinga limited set of observations with selection biases and mea-surement uncertainties. The best we can do is compute theposterior probability on (cid:126)λ , the distribution on distributions,given the observations, which, in the usual Bayesian formal-ism, is given by p ( (cid:126)λ |{ (cid:126)d i } ) = p ( { (cid:126)d i }| (cid:126)λ ) π ( (cid:126)λ ) p ( { (cid:126)d i } ) , (1)where p ( { (cid:126)d i }| (cid:126)λ ) is the likelihood of observing the data setgiven the population properties, π ( (cid:126)λ ) is the prior on (cid:126)λ andthe evidence p ( { (cid:126)d i } ) is the integral of the numerator overall (cid:126)λ . This evidence can be used to select between diﬀerentmodels for representing the distribution, as in Farr et al.(2011). In the next two sections, we present two alternativeways of deriving p ( { (cid:126)d i }| (cid:126)λ ). First, we follow the bottom-up approach of deriving the like-lihood for obtaining a particular set of observations giventhe population parameters, by starting with a simple prob-lem without either measurement uncertainties or selectioneﬀects and gradually building up the problem complexity.For the moment, we assume that we are only interested inthe shape of the population distribution, and ignore the nor-malisation , or rate , of objects in the population; we discussestimation of both the rate and shape of a population at theend of this section and in § { (cid:126)θ i } , for i ∈ [1 , N obs ]. The total probability of making this particularset of independent observations is p ( { (cid:126)θ i }| (cid:126)λ (cid:48) ) = N obs (cid:89) i =1 p pop ( (cid:126)θ i | (cid:126)λ (cid:48) ) (cid:82) d (cid:126)θ p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) . (2)The normalisation factor here accounts for the overall prob-ability of making an observation given a particular choice of In practice, there are very weak deviations from this power lawdue to the imperfect – noisy – measurement of signal amplitude. An example where the selection may be parameter- rather thandata-dependent is in surveys of objects that have been selectedbased on data in yet other surveys; Maggie Lieu pointed us toX-ray selected populations of galaxy clusters in a weak-lensingcatalog. This can still be treated within the framework proposedhere, by considering the combined likelihood for both data setsand marginalising over the “discarded” data from the survey usedfor selection. MNRAS , 1–9 (2018) election Eﬀects in Hierarchical Modelling (cid:126)λ (it will be equal to 1 if p pop is properly normalised, butwe keep the normalisation term for completeness).In practice, there is often a selection bias involved: someevents are easier to observe than others. This can be char-acterised by a detection probability p det ( (cid:126)θ ). We assume fornow that when systems are observed their parameters canbe measured perfectly, i.e., we directly measure { (cid:126)θ i } . Thiseﬀectively says that the noise in the measurement is neg-ligible and the selection eﬀects can be applied directly tothe event parameters: p det ( (cid:126)θ ) = I ( (cid:126)θ ), i.e., events are eitheralways detected or never detected depending on their pa-rameters. With the selection eﬀect included, equation (2)becomes (e.g., Chennamangalam et al. 2013; Farr et al. 2015) p ( { (cid:126)θ i }| (cid:126)λ (cid:48) ) = N obs (cid:89) i =1 p pop ( (cid:126)θ i | (cid:126)λ (cid:48) ) p det ( (cid:126)θ i ) (cid:82) d( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) p det ( (cid:126)θ )= N obs (cid:89) i =1 p pop ( (cid:126)θ i | (cid:126)λ (cid:48) ) (cid:82) d( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) p det ( (cid:126)θ ) , (3)where the second equality follows because, by deﬁnition, I ( (cid:126)θ ) = 1 for any event we have included among the set ofdetections.In general, we don’t have the luxury of directly measur-ing the parameters of a given event, (cid:126)θ i . Instead, we measurethe data set (cid:126)d i which encodes these parameters but also in-cludes some random noise. For a given data set and searchpipeline, we assume that the detectability is deterministic:if the data exceeds some threshold (e.g., a threshold on thesignal to noise ratio), then the event is detectable; otherwise,it’s not. In other words, the detection probability for a givenset of parameters introduced earlier is, in fact, an integralover the possible data sets given those parameters: p det ( (cid:126)θ ) = (cid:90) (cid:126)d> threshold p ( (cid:126)d | (cid:126)θ )d (cid:126)d = (cid:90) I ( (cid:126)d ) p ( (cid:126)d | (cid:126)θ )d (cid:126)d . (4)In the gravitational-wave context, detection is usually wellapproximated as a cut on the observed signal-to-noise ratio(SNR) and so this detection probability is the likelihood dis-tribution of observed SNRs. There are two stochastic com-ponents to the observed SNR — ﬂuctuations in the detectornoise which change the observed SNR relative to the in-trinsic SNR, and ﬂuctuations in the intrinsic SNR due tovariations in the source parameters. For an example of thelatter, the expected signal amplitude is a strong functionof the mass — a selection eﬀect that is critical to considerwhen inferring the underlying distribution of binary blackhole masses from the observed events Abbott et al. (2016);Fishbach & Holz (2017). As another example, the intrinsicSNR also depends on extrinsic parameters of the binary, i.e.,the sky location and orientation of the system. That depen-dence is largely encoded in the distribution of the param-eter Θ described in Finn & Chernoﬀ (1993). The function p det ( (cid:126)θ ) encodes both these types of intrinsic selection eﬀect,plus marginalisation over instrumental noise ﬂuctuations.Using Eq. (4), we can write the probability of observinga particular data set (where “observing” implies that thedata are above the threshold, hence included as one of our k This is a very artiﬁcial model since all detectors have noise andthe reason that p det ( (cid:126)θ ) is not equal to one is because of that noise.However, it serves to illustrate the basic idea. observations) given the assumed distribution parametrisedby (cid:126)λ (cid:48) as p ( (cid:126)d | (cid:126)λ (cid:48) ) = (cid:82) d (cid:126)θp ( (cid:126)d | (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) α ( (cid:126)λ (cid:48) ) , (5)where the normalisation factor α ( (cid:126)λ (cid:48) ) is given by α ( (cid:126)λ (cid:48) ) ≡ (cid:90) (cid:126)d> threshold d (cid:126)d (cid:90) d (cid:126)θp ( (cid:126)d | (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) )= (cid:90) d (cid:126)θ (cid:20)(cid:90) (cid:126)d> threshold d (cid:126)dp ( (cid:126)d | (cid:126)θ ) (cid:21) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) ≡ (cid:90) d (cid:126)θp det ( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) . (6)This normalisation factor can be interpreted as the fractionof events in the Universe that would be detected for a par-ticular population model, characterised by the populationparameters (cid:126)λ (cid:48) .Thus, in the presence of both measurement uncertaintyand selection eﬀects, equations (2) and (3) become: p ( { (cid:126)d i }| (cid:126)λ (cid:48) ) = N obs (cid:89) i =1 (cid:82) d (cid:126)θp ( (cid:126)d i | (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) (cid:82) d (cid:126)θp det ( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) . (7)The presence of the likelihood p ( (cid:126)d i | (cid:126)θ ) in this equation is areminder that we do not have a perfect measurement of theparameters of a given event. The likelihood can be rewrittenin terms of the posterior probability density function (PDF) p ( (cid:126)θ i | (cid:126)d i ) that is computed in the course of single-event pa-rameter estimation using some assumed prior π ( (cid:126)θ ): p ( (cid:126)d i | (cid:126)θ i ) = p ( (cid:126)θ i | (cid:126)d i ) p ( (cid:126)d i ) π ( (cid:126)θ ) . (8)Thus, each term of the product in Eq. (7) is a normalisedconvolution integral of the population with the posteriorPDF (Mandel 2010).In practice, the posterior PDF p ( (cid:126)θ i | (cid:126)d i ) is often dis-cretely sampled with S i samples from the posterior, { j (cid:126)θ i } ,for j ∈ [1 , S i ]. Because the samples are drawn accordingto the posterior, the parameter space volume associatedwith each sample is inversely proportional to the local PDF, d j (cid:126)θ i ∝ (cid:104) p ( j (cid:126)θ i | (cid:126)d i ) (cid:105) − . This allows us to easily replace theintegral in Eq. (7) with a discrete sum over PDF samples: p ( { (cid:126)d i }| (cid:126)λ (cid:48) ) = N obs (cid:89) i =1 1 S i (cid:80) S i j =1 p pop ( j (cid:126)θ i | (cid:126)λ (cid:48) ) p ( (cid:126)d i ) π ( (cid:126)θ ) (cid:82) d (cid:126)θp det ( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) . (9)Finally, the posterior on the underlying population pa-rameters (cid:126)λ (cid:48) is given by substituting equation (9) into equa-tion (1): p ( (cid:126)λ (cid:48) |{ (cid:126)d i } ) = π ( (cid:126)λ (cid:48) ) p ( { (cid:126)d i } ) N obs (cid:89) i =1 1 S i (cid:80) S i j =1 p pop ( j (cid:126)θ i | (cid:126)λ (cid:48) ) π ( (cid:126)θ ) p ( (cid:126)d i ) (cid:82) d (cid:126)θp det ( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) )= π ( (cid:126)λ (cid:48) ) N obs (cid:89) i =1 1 S i (cid:80) S i j =1 p pop ( j (cid:126)θ i | (cid:126)λ (cid:48) ) π ( (cid:126)θ ) (cid:82) d (cid:126)θp det ( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) . (10)Of course, if interested in the distribution of a singleparameter, we can marginalise over Eq. (10) in the usualway, by integrating over the remaining parameters.We have so far described inference based on the shape MNRAS000 , 1–9 (2018) of the distribution p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) while ignoring the overall nor-malisation. This is appropriate when the overall normalisa-tion on the population counts is not interesting, or when thebulk of the information comes from the distribution prop-erties rather than the detection rate (a single data point).This is a reasonable assumption in the gravitational-wavecontext, where the astrophysical uncertainty on the rates ofcompact object mergers covers several order of magnitude.While inferring the rate is of great interest, models may notpredict it with suﬃcient precision for that measurement tohave strong constraining power.In contexts in which the expected number of detections N det can be predicted, this can be readily included in theframework. The probability of observing k events is givenby the Poisson distribution p ( k | N det ) = e − N det ( N det ) N obs . (11)Here, the usual N obs ! term in the denominator is absentbecause the events are distinguishable by their data; in anycase, as a normalisation term that depends on the data only,it would not impact inference on (cid:126)λ . The expected number ofdetections once selection eﬀects are included is (cf. Eq. (23)): N det (cid:16) (cid:126)λ (cid:17) ≡ (cid:90) (cid:126)d> threshold d (cid:126)d d (cid:126)θ p ( (cid:126)d | (cid:126)θ ) d N d (cid:126)θ ( (cid:126)λ ) = Nα ( (cid:126)λ (cid:48) ) . (12)The posterior on the population parameters with the rateincluded becomes p ( (cid:126)λ (cid:48) , N |{ (cid:126)d i } ) = π ( (cid:126)λ (cid:48) ) π ( N ) N obs (cid:89) i =1 1 S i (cid:80) S i j =1 p pop ( j (cid:126)θ i | (cid:126)λ (cid:48) ) π ( (cid:126)θ ) (cid:82) d (cid:126)θp det ( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) × e − N det ( N det ) N obs . (13)Note that if a prior π ( N ) ∝ /N is assumed on the intrinsicevent number or rate (Fishbach et al. 2018), equation (13)can be marginalised over N to again yield Eq. (10) up to anormalisation constant, which depends only on the numberof observed events and would not impact inference on modelparameters: (cid:90) d N π ( (cid:126)λ (cid:48) ) N N obs (cid:89) i =1 (cid:80) S i j =1 p pop ( j (cid:126)θ i | (cid:126)λ (cid:48) ) π ( (cid:126)θ ) S i α ( (cid:126)λ (cid:48) ) e − Nα ( (cid:126)λ (cid:48) ) (cid:16) Nα ( (cid:126)λ (cid:48) ) (cid:17) N obs = ( N obs − π ( (cid:126)λ (cid:48) ) N obs (cid:89) i =1 (cid:80) S i j =1 p pop ( j (cid:126)θ i | (cid:126)λ (cid:48) ) π ( (cid:126)θ ) S i α ( (cid:126)λ (cid:48) ) (14)where we used (cid:90) d NN e − Nα ( (cid:126)λ (cid:48) ) (cid:16) Nα ( (cid:126)λ (cid:48) ) (cid:17) N obs = (cid:90) d N det e − N det N N obs − = Γ( k ) = ( N obs − Alternatively, we consider a top-down calculation. If we haveobserved a representative sample from the population (i.e.a “fair draw”), then the appropriate (unnormalised) jointdistribution for the parameters (cid:110) (cid:126)θ i (cid:111) N total i =1 and observations (cid:110) (cid:126)d i (cid:111) of the i = 1 , . . . , N total objects given the parameters (cid:126)λ describing the population (again, (cid:126)λ are all parameters describing the population, including the rate, while (cid:126)λ (cid:48) areparameters that only describe the shape of the population)is π (cid:16)(cid:110) (cid:126)θ i (cid:111) , { d i } | (cid:126)λ (cid:17) ∝ (cid:34) N total (cid:89) i =1 p (cid:16) (cid:126)d i | (cid:126)θ i (cid:17) d N d (cid:126)θ i (cid:16) (cid:126)λ (cid:17)(cid:35) exp (cid:104) − N (cid:16) (cid:126)λ (cid:17)(cid:105) , (16)where N (cid:16) (cid:126)λ (cid:17) ≡ (cid:90) d (cid:126)d d (cid:126)θp (cid:16) (cid:126)d | (cid:126)θ (cid:17) d N d (cid:126)θ ( λ ) (17)is the expected number of objects in the population . Thisis the standard likelihood for a hierarchical analysis of an in-homogeneous Poisson process (Loredo & Wasserman 1995;Hogg et al. 2010; Mandel 2010; Youdin 2011; Foreman-Mackey et al. 2014; Farr et al. 2015; Barrett et al. 2018).If some objects are classed as “observable” (indexed by i ) and others are “un-observable” (indexed by j ), the com-plete set of observations partitions into two subsets of car-dinality N obs and N nobs : π (cid:16)(cid:110) (cid:126)θ i (cid:111) , (cid:110) (cid:126)θ j (cid:111) , { d i } , { d j } | (cid:126)λ (cid:17) ∝ (cid:34) N obs (cid:89) i =1 p (cid:16) (cid:126)d i | (cid:126)θ i (cid:17) d N d (cid:126)θ i (cid:16) (cid:126)λ (cid:17)(cid:35) × (cid:34) N nobs (cid:89) j =1 p (cid:16) (cid:126)d j | (cid:126)θ j (cid:17) d N d (cid:126)θ j (cid:16) (cid:126)λ (cid:17)(cid:35) exp (cid:104) − N (cid:16) (cid:126)λ (cid:17)(cid:105) . (18)Again, a key point is that we can perform this partitioningsimply by examining the data obtained for each object.It is common for the data associated with “non-observable” objects to be completely censored ; that is, itoften does not appear in a catalog or otherwise at all. Inthis case, it is appropriate to marginalise over the parame-ters and (unknown) data for the “non-observable” objects.Doing so destroys the distinguishability inherent in the in-homogeneous Poisson distribution, so we must introduce afactor of N nobs ! to account for the over-counting: π (cid:16)(cid:110) (cid:126)θ i (cid:111) , { d i } , N nobs | (cid:126)λ (cid:17) ∝ (cid:34) N obs (cid:89) i =1 p (cid:16) (cid:126)d i | (cid:126)θ i (cid:17) d N d (cid:126)θ i (cid:16) (cid:126)λ (cid:17)(cid:35) × N N nobs ndet (cid:16) (cid:126)λ (cid:17) N nobs ! exp (cid:104) − N (cid:16) (cid:126)λ (cid:17)(cid:105) , (19)where N ndet (cid:16) (cid:126)λ (cid:17) ≡ (cid:90) { (cid:126)d | non-detection } d (cid:126)d d (cid:126)θ p (cid:16) (cid:126)d | (cid:126)θ (cid:17) d N d (cid:126)θ (cid:16) (cid:126)λ (cid:17) (20)is the expected number of non-detections in the populationmodel. Stopping here we would have a model similar to theones discussed in Messenger & Veitch (2013) (though thatreference did not discuss rate estimation); however, it is com-mon to not even know how many non-detected objects therewere in a given survey or data set. In this case we mustmarginalise – sum, since counting is a discrete operation – The rationale for writing this as a double-integral, when theintegral over (cid:126)d is in fact trivial – since the likelihood is normalisedover (cid:126)d – will become apparent below. MNRAS , 1–9 (2018) election Eﬀects in Hierarchical Modelling over the unknown number of non-detections, N nobs , yielding π (cid:16)(cid:110) (cid:126)θ i (cid:111) , { d i } | (cid:126)λ (cid:17) ∝ (cid:34) N obs (cid:89) i =1 p (cid:16) (cid:126)d i | (cid:126)θ i (cid:17) d N d (cid:126)θ i (cid:16) (cid:126)λ (cid:17)(cid:35) (21) × exp (cid:104) − (cid:16) N (cid:16) (cid:126)λ (cid:17) − N ndet (cid:16) (cid:126)λ (cid:17)(cid:17)(cid:105) , or π (cid:16)(cid:110) (cid:126)θ i (cid:111) , { d i } | (cid:126)λ (cid:17) ∝ (cid:34) N obs (cid:89) i =1 p (cid:16) (cid:126)d i | (cid:126)θ i (cid:17) d N d (cid:126)θ i (cid:16) (cid:126)λ (cid:17)(cid:35) exp (cid:104) − N det (cid:16) (cid:126)λ (cid:17)(cid:105) , (22)where N det – the compliment of N ndet – is the expectednumber of detections under the population model: N det (cid:16) (cid:126)λ (cid:17) ≡ (cid:90) { (cid:126)d | detection } d (cid:126)d d (cid:126)θ p (cid:16) (cid:126)d | (cid:126)θ (cid:17) d N d (cid:126)θ (cid:16) (cid:126)λ (cid:17) = (cid:90) d (cid:126)d d (cid:126)θ I ( (cid:126)d ) p (cid:16) (cid:126)d | (cid:126)θ (cid:17) d N d (cid:126)θ (cid:16) (cid:126)λ (cid:17) . (23)This equation is the posterior for a hierarchical analysis ofthe number density and properties of objects from a dataset subject to selection eﬀects (e.g. Gair et al. 2010; Youdin2011; Fishbach et al. 2018; Wysocki et al. 2018).This is the same result we derived in §

3. Each mul-tiplicative term in the numerator of Eq. (13) from § (cid:82) d (cid:126)θp ( (cid:126)d i | (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ), approximated as a MonteCarlo sum over the posterior samples. The denominator ofEq. (13) is α N obs . Meanwhile, α = N det /N according toequation (12), which is identical to equation (23) from thissection. With the substitution p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) = ( dN/d(cid:126)θ ) /N , theentire fraction in Eq. (13) is identical to the ﬁrst term ofEq. (22) divided by N N obs det , which cancels the last term ofEq. (13). Thus, we see that equations (13) and (22) areequivalent up to the choice of priors.As in §

3, if we re-parameterise d N d (cid:126)θ so that we can writed N d (cid:126)θ ≡ Np (cid:16) (cid:126)θ | (cid:126)λ (cid:48) (cid:17) (24)with p (cid:16) (cid:126)θ | (cid:126)λ (cid:48) (cid:17) integrating to 1 over the population for anyvalue of the new parameters (cid:126)λ (cid:48) , impose a prior p ( N ) ∝ /N ,and marginalise over N , we arrive at the treatment of selec-tion functions for estimating population distributions fromLoredo (2004); Abbott et al. (2016). This correspondenceonly holds with a 1 /N scale-invariant prior on the numberof objects in the population (see Fishbach et al. (2018) andEq. (14) above); other priors are, of course, possible, but willnot marginalise to the population analysis above.Note that the commonly-employed technique of modi-fying d N d (cid:126)θ to account for the selection function is not correct,and will lead to biased results as long as the selection isdependent only on the observed data. It is natural to ask how many events you will need to ob-serve before the incorrect treatment of selection eﬀects startsto inﬂuence the results. Any incorrect analysis, i.e., writingdown a posterior distribution that is not consistent with the true data generating process, will lead to a bias in the resultand might also change the inferred posterior uncertainty.Asymptotically, the bias remains constant while the uncer-tainty decreases like the square root of the number of events.Therefore after suﬃcient observations have been made theresult will be inconsistent with the true parameter values.The number of events that can be observed before the biasbecomes important depends both on what particular “wrongmethod” is being used and on the speciﬁc problem underconsideration. One plausible wrong method is that selectioneﬀects will be ignored completely, but more often selectioneﬀects are included in an incorrect way. For example, onemight write down the likelihood for an individual detectedevent as (cid:90) p ( (cid:126)d, det | (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) )d (cid:126)θ which acknowledges that we have only used detected events(indicated by the ﬂag “det”). Then an incorrect assumptionis made that the speciﬁc data generation process and thequestion of whether or not the event is detected are inde-pendent, so that the ﬁrst term can be factorised as (cid:90) p ( (cid:126)d | (cid:126)θ ) p det ( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) )d (cid:126)θ. This diﬀers from the true result in two ways — the normal-isation term 1 /α ( (cid:126)λ ) is missing, and there is an extra factorof p det ( (cid:126)θ ) in the numerator.A slightly more astute practitioner might realise thatthe selection bias modiﬁes the probability distribution forthe parameters of observed events so that this becomes p ( (cid:126)θ | det , (cid:126)λ (cid:48) ) = p det ( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) ) α ( (cid:126)λ (cid:48) )but then fail to also condition the likelihood p ( d | (cid:126)θ ) on de-tection and use1 α ( (cid:126)λ (cid:48) ) (cid:90) p ( d | (cid:126)θ ) p det ( (cid:126)θ ) p pop ( (cid:126)θ | (cid:126)λ (cid:48) )d (cid:126)θ which includes the correct normalisation factor but still hasthe additional p det ( (cid:126)θ ) in the numerator. In this latter case,the diﬀerences will only become apparent once a suﬃcientnumber of events with p det ( (cid:126)θ ) signiﬁcantly diﬀerent from 1have been observed. The number of events required wouldscale like the inverse of the fraction of the observable param-eter space where selection eﬀects are important, althoughthe exact number of events would also depend on how muchinformation those events contained about (cid:126)λ (cid:48) , i.e., how muchthe properties of those events depend on the properties ofthe population.In the former case, every event contributes to a mistakein inference as the factor 1 /α ( (cid:126)λ (cid:48) ) is also missing. The num-ber of events required before the error becomes apparent willthen depend on how strongly this varies with the populationparameters, which depends on the particular inference prob-lem. For example, in the case of inferring the slope of theblack hole mass function from binary black hole mergers ob-served by LIGO, this would be a strong eﬀect as shallowermass functions give more higher-mass events, which are vis-ible to greater distances and so a higher proportion of thetotal population lies within the LIGO detector horizon (see,for example, Fishbach & Holz (2017)). However, in the case MNRAS000 , 1–9 (2018) of inferring the Hubble constant using binary neutron starobservers with counterparts, the natural prior on the dis-tance distribution is uniform in comoving volume and, sincemass redshifting and non-Euclidean cosmological correctionsare negligible within the current LIGO horizon, the selectioneﬀect is largely independent of the Hubble constant Abbottet al. (2017b). To be concrete, in the example that will bedescribed in the next section, we repeated the analysis usingthe former of these wrong methods (as a worst-case scenario)and we show the results of that analysis as dashed lines inFigure 3. That ﬁgure shows the probability-probability plot,i.e., the fraction of times the true parameters lie at a partic-ular signiﬁcance level over many experiments. For true andmodelled distributions that are both Gaussians with com-mon variance σ but means that diﬀer by a bias b , the amountby which the p-p plot deviates from the diagonal depends on b/σ (see discussion in Gair & Moore (2015)). We see that, forthat speciﬁc example, with 10 events the bias is already ev-ident in the p-p plot, but at a level consistent with b/σ < b/σ ∼ a few, so for 100 events the resultwill be appreciably biased. These numbers are for a speciﬁcproblem and the threshold for inclusion of selection eﬀectsto avoid bias will vary from problem to problem. It is there-fore important to always include selection eﬀects properlyin the analysis, unless there is a good reason to believe thatthey can be ignored, which typically could only be assessedby doing the analysis including selection eﬀects anyway. Measuring a luminosity function from a ﬂux-limited sur-vey is a classic problem in astronomy that deals with selec-tion eﬀects (see, e.g., Malmquist (1922)). Here we apply themethod discussed in the previous sections to a toy-model,but illustrative, version of this problem.Suppose the luminosity function of our objects can bemodeled by a Schechter function (Schechter 1976):d N d L = Λ L α ∗ Γ (1 + α ) L α exp (cid:20) − LL ∗ (cid:21) (25)with α > − L ∗ > σ L (cid:39)

5% uncer-tainty and that the measurement process results in a log-normal likelihood function: p ( L obs | L ) = 1 σ L L obs √ π exp (cid:34) − (cid:18) log L − log L obs σ L (cid:19) (cid:35) . (26)We assume a Euclidean universe, so in appropriate units aﬂux limit for detection of F th implies a probability of detec- − − − log L − − p ( l og L ) ObservedPopulation

Figure 1.

The distribution of observed (blue) and true (orange)luminosities for a draw from the model discussed in Section 6. Dueto selection eﬀects, the distribution of observed luminosities peaksat higher luminosity and falls more rapidly at low luminosity thanthe true distribution of sources. tion for an observed luminosity of P det ( L obs ) = (cid:40) L obs πz > F th , (27)where z is the redshift (distance) to the object. For compu-tational eﬃciency, we assume that our objects are uniformlydistributed in z for 0 ≤ z ≤ L ∗ = 1, α = − / F th = 1 / π ; this latter choice means that the detec-tion probability for a L ∗ object at z = 1 is 50%. For thesechoices, one draw of a random universe produces the distri-bution of observed and true luminosities shown in Figure 1.In this particular draw, we observed 24 objects and missed80 in our survey.Applying the “top-down” methodology to this problem,the crucial integral in Eq. (23) is not analytically tractable,though both the population distribution and the selectionfunction are simple functions. We must evaluate this inte-gral numerically. We choose to do this by sampling over theun-observed population and associated data (subject to theconstraint that the ﬂuxes associated to the un-observed pop-ulation are always below F th ) in a MCMC at the same timeas we sample the properties of the population and observedobjects. That is, we explicitly implement Eq. (18) as ourposterior density, summing over the (unknown) number ofnon-detected systems. Sampling over the un-observed popu-lation with this posterior is a method for numerically evalu-ating the selection integral. Code and results implementingthis model in the stan sampler (Carpenter et al. 2017) canbe found at https://github.com/farr/SelectionExample . MNRAS , 1–9 (2018) election Eﬀects in Hierarchical Modelling − . − . − . . . . α . . . . . . L ∗ Figure 2.

Marginal posterior distribution for the L ∗ and α pa-rameters of the luminosity function (see Eq. (25)) from the modeland data described in Section 6. Black lines indicate the true val-ues of the parameters. One result of the sampling is an estimate of the luminos-ity function parameters L ∗ and α ; a joint posterior on theseparameters appears in Figure 2. The analysis also recoverswith similar accuracy the expected number of objects in thesurvey volume (Λ), improved estimates of each object’s in-trinsic luminosity (informed by the population model), andluminosity distributions of the set of objects too dim to beobserved by the survey, as a by-product of the selection func-tion modelling. Do all or the majority of merging binary neutron starshave mass ratios very close to unity? Is the answer tothis question redshift- or metallicity-dependent? This ques-tion is an important science driver for third-generationgravitational-wave detectors . Here, we examine how manyneutron star binary mergers must be detected in order tomeasure the mass-ratio distribution, providing an illustra-tion of the methodology described in the previous sections.The binary neutron star mass ratio distribution is sen-sitive to the mass ejections associated with neutron starformation in a supernova and the velocity kicks that neu-tron stars receive at birth. For example, ﬁgure 3 of Vigna-G´omez et al. (2018) illustrates the diﬀerences in the massratio distributions under diﬀerent assumptions about mass This was identiﬁed as a key goal during an ongoing study com-missioned by the Gravitational Wave International Committee(GWIC), https://gwic.ligo.org/3Gsubcomm/charge.shtml . fallback and natal kicks. Since models show a preference forequal mass ratios q = m /m , we assume a simple single-parameter form for the intrinsic mass ratio distribution: p ( η ) ∝ e ( η − . /λ , (28)where η = q/ (1 + q ) is the symmetric mass ratio. We usethe symmetric mass ratio because it tends to have moresymmetric error bars than q ; when component masses areequal, q = 1 and η = 0 . p ( (cid:126)d | η ). We will approximate theproblem by viewing the data as a point estimate of the sym-metric mass ratio ˆ η (one can think of it as a maximum-likelihood estimate) with a Gaussian likelihood functiongiven by p (ˆ η | η ) ∝ exp (cid:26) − (ˆ η − η ) σ η (cid:27) . (29)We use a simple Fisher-information-matrix analysis with anoise power spectral density shape representative of a po-tential third-generation detector to estimate the expectedmeasurement uncertainty σ η . We follow Poisson & Will(1995) in using frequency-domain post-Newtonian wave-forms, which can be analytically diﬀerentiated and are ade-quate for binary neutron star analysis, allowing us to rapidlyestimate the accuracy of inference. We do not impose priors,include a spin-orbit coupling term but ignore the spin-spincoupling term as suggested by Poisson & Will (1995). We de-rive the following simple ﬁt to the measurement uncertaintyon η for a signal from a canonical 1 . . M (cid:12) binary withnon-spinning components as a function of the event signal-to-noise ratio ρ : σ η = 0 . ρ + 4 ρ + 250 ρ (30)This ﬁt is accurate to better than 10% for ρ >

18. Theinverse of the Fisher information matrix is no longer agood estimate for the covariance matrix at lower values of ρ where the linear signal approximation breaks down, the log-likelihood ceases to be well approximated by a quadratic(Vallisneri 2008), and the prior constraints on variablesstrongly correlated with η , such as the spin parameters, be-come increasingly important. In any case, the mass ratioconstraints become very poor at low ρ ; for example, despitea ρ of 32, the mass ratio of the binary neutron star mergerGW170817 could only be constrained to q ∈ [0 . ,

1] at 90%conﬁdence (Abbott et al. 2017a).The signal-to-noise ratio at a given distance scales as M / c , where M c ∝ η / is the chirp mass, consistent withthe inspiral amplitude scaling. We assume that the distance D to the event is drawn from a p ( D ) ∝ D distribution con-sistent with a ﬂat, isotropic universe, and is known perfectly.With this simpliﬁcation, the signal-to-noise ratio as used inEq. (30) follows ρ ∝ η . D . (31) We assume that the noise spectral density is proportional tothe LIGO A+ design, https://dcc.ligo.org/LIGO-T1800042/public .MNRAS000

Figure 3.

The p-p plot of the cumulative distribution of thequantile of the true value of λ within its posterior as estimatedfrom 10 (solid orange curve) and 100 (solid blue curve) mock datasets. These are consistent with the diagonal (dashed black line).For comparison we show the corresponding results, as dashedlines, from using one particular wrong method, as described inSection 5. The observed signal-to-noise ratio ˆ ρ follows the same scaling,but with the dependence on the data ˆ η , not the true eventmass ratio η . In line with comments on the validity of theFisher information matrix we will use a detection thresholdˆ ρ ≥

18 in this simpliﬁed treatment; the detectability con-ditioned on the observed data is thus independent of thesource properties.We test the self-consistency of the inference on λ by cre-ating 100 mock populations with random values of λ drawnfrom the ﬂat prior λ ∈ [0 , . λ following the method-ology described above. We then ask for the quantile of thetrue value of λ within this posterior. Figure 3 shows thecumulative distribution of this quantile value, the so-calledp-p plot. If posteriors are self-consistent, we expect the truthto fall within the X% Bayesian credible interval X% of thetime, i.e., the p-p plot should be diagonal (e.g., Cook et al.2006; Sidery et al. 2014; Veitch et al. 2015). We conﬁrm thatthe p-p plot is consistent with the diagonal within statisticalﬂuctuations.Having tested the method and its implementation, wenow analyse the uncertainty in the inferred value of λ . Thistime, we ﬁx the value of λ at λ = 0 .

05 when generating mockdata catalogs, but vary the number of simulated events, witha subset of the events labeled as detectable. We computethe width of the 90% credible interval on λ , deﬁned here asstretching from the 5th to the 95th percentile of the poste-rior. In ﬁgure, we plot this width ∆ λ against the number ofdetectable events.We ﬁnd that ∼ ρ ≥

18 are necessaryin order to measure λ to an accuracy δλ ≈ .

01. Distribu-tions with λ = 0 .

01 and λ = 0 .

02 yield median values of η ( q ) of 0.243 (0.71) and 0.236 (0.62), respectively, so at leasta thousand detections are required in order to make mean-ingful inference on the mass ratio distribution with a view todistinguishing evolutionary models. An even greater number Figure 4.

The width of the 90% credible interval ∆ λ as a functionof the number of detections; the true value of λ is 0.05 in all mockcatalogs. The ﬂuctuations relative to the ∆ λ ∝ N − / trend aredue to the stochastic nature of the detected sample. of detections would be required in each of several redshiftbins in order to search for redshift-dependent changes in themass ratio distribution – perhaps O (10000), given the plau-sible variation of the mass ratio distribution with redshift. ACKNOWLEDGMENTS

IM and WF thank Tom Loredo for useful discussions andthe Statistical and Applied Mathematical Sciences Institute,partially supported by the National Science Foundation un-der Grant DMS-1127914, for hospitality. IM’s work was per-formed in part at Aspen Center for Physics, which is sup-ported by National Science Foundation grant PHY-1607611;IM’s visit there was partially supported by a grant fromthe Simons Foundation. IM thanks Stephen Justham, VickyKalogera, and Fred Rasio for discussions related to the il-lustrative example. We thank Arya Farahi for alerting us toa typo in the manuscript, and the anonymous referee for anumber of insightful comments.

REFERENCES

Abbott B. P., et al., 2016, Physical Review X, 6, 041015Abbott B. P., et al., 2017a, Physical Review Letters, 119, 161101Abbott B. P., et al., 2017b, Nature, 551, 85Barrett J. W., Gaebel S. M., Neijssel C. J., Vigna-G´omez A.,Stevenson S., Berry C. P. L., Farr W. M., Mandel I., 2018,MNRAS, 477, 4685Bovy J., Hogg D. W., Roweis S. T., 2011, Annals of AppliedStatistics, 5, 1657Carpenter B., et al., 2017, Journal of Statistical Software, Arti-cles, 76, 1Chennamangalam J., Lorimer D. R., Mandel I., Bagchi M., 2013,MNRAS, 431, 874Cook S. R., Gelman A., Rubin D. B., 2006, Journal of Computa-tional and Graphical Statistics, 15, 675Farr W. M., Mandel I., 2018, Science, 361, aat6506Farr W. M., Sravan N., Cantrell A., Kreidberg L., Bailyn C. D.,Mandel I., Kalogera V., 2011, ApJ, 741, 103MNRAS , 1–9 (2018) election Eﬀects in Hierarchical Modelling Farr W. M., Mandel I., Aldridge C., Stroud K., 2014, preprint,( arXiv:1412.4849 )Farr W. M., Gair J. R., Mandel I., Cutler C., 2015, Phys. Rev. D,91, 023005Feeney S. M., Peiris H. V., Williamson A. R., Nissanke S. M.,Mortlock D. J., Alsing J., Scolnic D., 2019, Physical ReviewLetters, 122, 061105Finn L. S., Chernoﬀ D. F., 1993, Phys. Rev. D, 47, 2198Fishbach M., Holz D. E., 2017, ApJ, 851, L25Fishbach M., Holz D. E., Farr W. M., 2018, ApJ, 863, L41Foreman-Mackey D., Hogg D. W., Morton T. D., 2014, ApJ, 795,64Gair J. R., Moore C. J., 2015, Phys. Rev. D, 91, 124062Gair J. R., Tang C., Volonteri M., 2010, Phys. Rev. D, 81, 104014Hogg D. W., Myers A. D., Bovy J., 2010, ApJ, 725, 2166Loredo T. J., 2004, in Fischer R., Preuss R., Toussaint U. V.,eds, American Institute of Physics Conference Series Vol. 735,American Institute of Physics Conference Series. pp 195–206( arXiv:astro-ph/0409387 ), doi:10.1063/1.1835214Loredo T. J., Wasserman I. M., 1995, The Astrophysical JournalSupplement Series, 96, 261Malmquist K. G., 1922, Meddelanden fran Lunds AstronomiskaObservatorium Serie I, 100, 1Malmquist K. G., 1925, Meddelanden fran Lunds AstronomiskaObservatorium Serie I, 106, 1Mandel I., 2010, Phys. Rev. D, 81, 084029Mandel I., Farr W. M., Gair J. R., 2016, Extracting distributionparameters from multiple uncertain observations with selec-tion biases. https://dcc.ligo.org/LIGO-P1600187/publicMessenger C., Veitch J., 2013, New Journal of Physics, 15, 053027Poisson E., Will C. M., 1995, Phys. Rev. D, 52, 848Schechter P., 1976, Astrophysical Journal, 203, 297Sidery T., et al., 2014, Phys. Rev. D, 89, 084060Vallisneri M., 2008, Phys. Rev. D, 77, 042001Veitch J., et al., 2015, Phys. Rev. D, 91, 042003Vigna-G´omez A., et al., 2018, preprint, ( arXiv:1805.07974 )Wysocki D., Lange J., O’Shaughnessy R., 2018, preprint,( arXiv:1805.06442 )Youdin A. N., 2011, ApJ, 742, 38MNRAS000