Beyond single-threshold searches: the Event Stacking Test
BBeyond single-threshold searches: the Event Stacking Test
Ryan Lynch, ∗ Salvatore Vitale, and Erik Katsavounidis
LIGO Laboratory, Massachusetts Institute of Technology, 185 Albany St, 02139 Cambridge USA
We present a new statistical test that examines the consistency of the tails of two empiricaldistributions at multiple thresholds. Such distributions are often encountered in counting experi-ments, in physics and elsewhere, where the significance of populations of events is evaluated. Thismulti-threshold approach has the effect of “stacking” multiple events into the tail bin of the dis-tribution, and thus we call it the Event Stacking Test. This test has the ability to confidentlydetect inconsistencies composed of multiple events, even if these events are low-significance outliersin isolation. We derive the Event Stacking Test from first principles and show that the p-valueit reports is a well-calibrated representation of noise fluctuations. When applying this test to thedetection of gravitational-wave transients in LIGO-Virgo data, we find that it performs better thanor comparably to other statistical tests historically used within the gravitational-wave community.This test is particularly well-suited for detecting classes of gravitational-wave transients that areminimally-modeled, i.e., gravitational-wave bursts. We show that the Event Stacking Test allowsus to set upper limits on the astrophysical rate-density of gravitational-wave bursts that are stricterthan those set using other statistical tests by factors of up to 2 - 3.
I. INTRODUCTION
With the recent detection of gravitational-wave (GW)events such as GW150914 [1] and GW170817 [2], wehave entered an era where we expect the detection ofGW transients with Advanced LIGO [3] and Virgo [4]to occur on a regular basis. To date, all of the detectedGW transients have been emitted by compact binary co-alescence (CBC) sources that are suitably well-modeledto be detected by templated searches [5–7]. These de-tections have enabled rich scientific investigations andbreakthroughs. For example, the detection of binaryblack hole mergers have been used to test Einstein’s the-ory of relativity in the strong-field regime [8]. The jointdetection of the binary neutron star merger GW170817with electromagnetic counterparts [9] has led to the as-sociation of short gamma-ray bursts with binary neu-tron star mergers [10], has provided evidence of heavy-element nucleosynthesis [11–16], and has enabled a newprocedure for measuring the Hubble parameter [17]. Weexpect similar breakthroughs to occur when LIGO-Virgodetects non-CBC sources of GW transients, which we willrefer to as GW bursts. Examples of potential GW burstsources include the core-collapse supernovae of massivestars [18, 19], neutron stars collapsing to form blackholes [20], neutron star glitches [21, 22], cosmic stringcusps [23], and the unknown. While the waveforms ofsome of these GW burst signals can be at least partiallymodeled [24–28], the LIGO-Virgo Collaboration has per-formed multi-algorithm [29–31] searches for GW burststhat are only minimally modeled to ensure that generictransient signals are reliably detected [32–35].Historically, the GW detection problem has beenviewed as a form of outlier/anomaly detection. The sig-nals of GW events are superimposed onto detector noise, ∗ [email protected] meaning the measured event rate in a GW analysis takesthe form λ total = λ noise + λ GW . GW searches measurethese event rates as a function of a search statistic Λ. Thecumulative rate of events whose measured search statisticexceeds a value Λ, λ (Λ), is then given by λ total (Λ) = λ noise (Λ) + λ GW (Λ) . (1)Typically λ noise (Λ) (cid:29) λ GW (Λ) for small values of Λ,meaning noise events dominate GW events in number.However, these search statistics are constructed specifi-cally so that the GW event rate should dominate at largevalues of Λ. Thus, most detections are expected to oc-cur in the high-Λ tail, and we refer to events with largevalues of Λ as “loud”. Searches for GW transients havehistorically evaluated the significance of events in isola-tion, meaning the significance of each candidate is thePoisson probability of detector noise producing at leastone event exceeding the candidate’s measured Λ (e.g.,see [36]). We will refer to this process as the LoudestEvent Test (LET) [37], and we will describe it in furtherdetail in Section III. However, by only considering eventsthat can be detected in isolation, we are ignoring popu-lations of low-significance GW events that may also belocated in the data.For compact-binary coalescence (CBC) events, the as-trophysical distribution of sources is believed to be well-enough understood that the GW event rate can be pre-dicted as a function of Λ. With such a model, we cancompute the probability that any event, including low-significance ones, are part of an astrophysical popula-tion of GW events [38, 39]. These calculations havebeen performed by the LIGO-Virgo Collaboration intheir first (O1) observing run in the Advanced Detec-tor Era [36, 40]. However, it is difficult to do somethingsimilar for GW burst sources that are inherently unmod-eled. The expected distributions of these sources canrange from point-like distributions emitted from matter-dense regions in the Milky Way Galaxy to uniform-in-volume distributions covering most of the observable uni- a r X i v : . [ phy s i c s . d a t a - a n ] N ov verse. Since the generic GW-burst signal morphologiesand source distributions are usually assumed to be onlyminimally modeled, it makes sense to work in terms ofnull hypothesis tests where only the noise distributionsmust be well-modeled. The detection statement of a nullhypothesis test is that an analysis measurement is incon-sistent with the measured background distribution oversome region of the Λ parameter space.In this paper, we derive a null-hypothesis test that wewill refer to as the Event Stacking Test (EST). This testis designed to evaluate the statistical consistency of thehigh-Λ tails of measured analysis and background distri-butions at k different thresholds. The EST evaluates thejoint significance of the k -loudest events in the analysisdata set by first “stacking” them into tail bins of differ-ent widths, and then comparing the statistical propertiesof these bins to those of the background data set. It re-ports the probability that the same underlying distribu-tion produced both the analysis and background high-Λtails. As a result, the EST is able to detect up to k GWevents of a population, even if none of them is individu-ally significant enough to be detected on its own.The EST is the Poissonian analogy to the BinomialTest presented in [41]. Both the EST formulation and theBinomial Test evaluate the similarity of the distributionalshapes of the analysis and background events. However,the EST also takes into account the relative event ratesof the analysis and background measurements, while theBinomial Test normalizes out these event rates. BecauseGW events increase the event rate of the analysis mea-surement as compared to the background measurement(see Eq. 1), the EST utilizes more information relevantto the GW detection problem than the Binomial Test.We also note the p-value of the EST is calculated analyt-ically, while the p-value of the Binomial Test historicallyhas been computed via Monte Carlo [41, 42]. The EST isalso similar in nature to the tail-targeted test presentedin [43] that examines the joint statistical consistency ofall analysis events exceeding a certain threshold value ofΛ. We prefer the EST formulation since it always evalu-ates a known number of events ( k ) in an unknown regionof Λ, as opposed to an unknown number of events ina known region of Λ. For similar configurations whereboth tests evaluate k events on average, the EST hasthe advantage of detecting up to k -event inconsistencies above or below the Λ threshold of the referenced test,while the referenced test can detect an arbitrary-event-number inconsistency but only above the Λ threshold.Thus, the EST is more likely to detect low-Λ GW eventsthat would be missed in isolation. Additionally, as longas all confidently-detected GW events are removed fromthe analysis, the EST can be repeated multiple times un-til statistical consistency is achieved, allowing it to detectmore than k GW events in practice.In Section II we develop the formalism needed to justifythe EST and then derive it from first principles. Thenin Section III, we compare the EST with several otherstandard statistical consistency tests, showing that its reported p-value is well-calibrated and that it is partic-ularly powerful in detecting GW burst events. We givea specific demonstration of this detection power in Sec-tion IV, where we show that the EST allows us to set up-per limits on the astrophysical rate-density of GW burstsources that are stricter than those set using the LET.Finally, we summarize our conclusions in Section V.
II. FORMALISM
We will derive the EST in the context of the GW-transient event detection problem. Within this context,detection is based on the analysis of timeseries datathat encodes the GW strain measured by multiple in-terferometers, such as LIGO [3], Virgo [4], and others.A GW signal must be consistent with a single (astro-physical) source across all detectors. Several consistencycriteria are invoked in order to identify transient oc-curences of these signals within the multi-detector datastreams [29, 30, 44–47].The background of GW searches is commonly mea-sured by artificially shifting the timeseries data of eachdetector by relative time lags so that GW events withinthe data are no longer found coincidently in all of the de-tectors. The resulting data set is assumed to contain onlynoise [48]. Because the analysis data set does not undergoany time lags, it is commonly referred to as the 0-lag. Thefoundation of our data model is that the occurrences ofnoise events in the 0-lag and background measurementsare described by the same underlying Poisson process.Specifically, for any region of the Λ parameter space, thenumber of events observed ( N ) over a duration of time( T ) will be distributed as a Poisson distribution P ( N | λ, T ) = 1 N ! ( λT ) N e − λT (2)where λ is the mean event rate in this region.Consider a GW search that ranks events according toa search statistic Λ, where greater values of Λ correspondto more GW-like events. In order to evaluate the con-sistency of background and 0-lag measurements made bythe GW search at a value of Λ, we can compare the rateof events exceeding a given value of Λ, λ (Λ), in eachmeasurement. An example of this comparison is shownin Fig. 1 as a function of Λ. Doing this comparison incumulative fashion gives a nonparametric representationof the measurement data, unlike with differential binningwhere the bin sizes and bin location must be specified.Importantly, this cumulative representation of the datahas a salient feature: the event rate must decrease mono-tonically as a function of Λ. The following sections willexplore how we can use this cumulative representation ofthe data to quantify the level of consistency between the0-lag and background measurements. FIG. 1. Examples of 50 0-lag measurements and 1 backgroundmeasurement drawn from the same noise-only event distribu-tion for a GW burst search (top) and a CBC search (bottom).We sample each realization from a Poisson distribution usingmeasurement durations T back = 1000 years and T = 1 yearand a total event rate of 100 events per year. The lines tracethe cumulative rate at which events exceed each value of thesearch statistic Λ. The noise-only event distributions are esti-mated using O1 LIGO-Virgo background measurements per-formed by oLIB [34] (for GW bursts) and PyCBC [36] (forCBC). A. Single-Threshold Significance
Assuming Poissonity, it is straightforward to evaluatethe statistical consistency of any 0-lag and backgroundmeasurements at a single threshold value of Λ. Becausethe inconsistencies of interest to us are excesses of eventsin the 0-lag, the p-value we will calculate is the false-alarm-probability (FAP). The FAP at a threshold of Λ isgiven by FAP Λ ( N ∗ ) = P ( N ≥ N ∗ | (cid:126)θ )= ∞ (cid:88) N = N ∗ P ( N | (cid:126)θ )= 1 − N ∗ − (cid:88) N =0 P ( N | (cid:126)θ ) (3)where N is the number of 0-lag events exceeding Λ, N ∗ denotes minimum value of N that will be con-sidered a significant excess, and (cid:126)θ represents the full setof conditional parameters that have been measured. Inpractice, one commonly sets N ∗ equal to the number T T N N λ λ N back N back T back T back FIG. 2. The DAG describing the probabilistic dependenciesof variables in the single-threshold test. The variables are de-noted by nodes, and the arrows point from cause to effect.Shaded nodes represent variables that have been measuredand have fixed values, while the unshaded nodes representvariables that are unconstrained. The left DAG shows thestate of knowledge before any measurements are made, whilethe right DAG shows the state of knowledge after the relevantmeasurements are made. The explicit probabilistic dependen-cies implied by each DAG can be extracted using the theoryof D-separation [49, 50]. (cid:126)θ and then find the functional formof P ( N | (cid:126)θ ). We can model the probabilistic depen-dencies of the parameters involved in these steps usinga probabilistic directed acyclic graphical model (DAG).While a detailed formulation of probabilistic DAGs (alsocommonly referred to as Bayesian Networks or BayesianHierarchical Models) can be found in [49, 50], the basicpremise is fairly intuitive: nodes represent variables, andarrows describe the cause-and-effect relationship betweenpairs of variables by pointing from cause to effect. Un-shaded nodes represent free variables, while shaded nodesrepresent measured variables.The DAG for this single-threshold problem is shown inFig. 2. The interpretation of the DAG is as follows: thereis an underlying Poisson process with noise event rate λ ≡ λ (Λ) that, along with 0-lag ( T ) and background( T back ) measurement durations, generates a number of 0-lag ( N ) and background ( N back ) events exceeding thethreshold Λ. Let us then consider the scenario where wehave measured T , T back , and N back . In order to solveEq. 3, we need to find the functional form of P ( N | (cid:126)θ ) = P ( N | T , T back , N back )= (cid:90) ∞ dλ P ( N , λ | T , T back , N back ) . (4)To accomplish this, we can use the theory of D-separation [49, 50] on the DAG to find the fol-lowing conditional independencies: (1) N ⊥ N back , T back | λ, T , and (2) λ ⊥ T | N back , T back .These conditional independencies allow us to write P ( N , λ | T , T back , N back )= P ( N | λ, T , T back , N back ) · P ( λ | T , T back , N back )= P ( N | λ, T ) P ( λ | N back , T back ) . (5)One approach to solving Eq. 4 is to estimate λ usingthe maximum-likelihood estimator ˆ λ ML = N back T back . Withthis choice, we can write P ( λ | N back , T back ) = δ ( λ − ˆ λ ML ).Substituting this delta function into Eq. 5 and integratingover λ , we find P ML ( N | T , T back , N back ) = P ( N | ˆ λ ML , T ) . (6) The right-hand side of this expression can be identifiedas the Poisson distribution defined in Eq. 2. We thus findthe “maximum-likelihood” FAP estimate to beFAP Λ , ML ( N ∗ )= 1 − N ∗ − (cid:88) N =0 P ML ( N | T , T back , N back )(7)where P ML ( N | T , T back , N back )= 1 N ! (ˆ λ ML T ) N e − ˆ λ ML T . (8)Instead of using a maximum-likelihood estimator for λ , we can instead choose to adopt the Bayesian approachof marginalizing over the unobserved λ . We can applyBayes’ theorem to Eq. 5 to obtain P ( N , λ | T , T back , N back ) = P ( N | λ, T ) P ( N back | λ, T back ) P ( λ ) (cid:82) ∞ dλ P ( N back | λ, T back ) P ( λ ) (9)where we have again used D-separation on the DAG tofind the independency λ ⊥ T back . Substituting this ex- pression into Eq. 4, we find that we need to solve twointegrals: P ( N | T , T back , N back ) = (cid:82) ∞ dλ P ( N | λ, T ) P ( N back | λ, T back ) P ( λ ) (cid:82) ∞ dλ P ( N back | λ, T back ) P ( λ ) . (10)Both integrands involve the product of Poisson distribu-tions and priors on λ , and both integrals can be solved an-alytically by choosing P ( λ ) to be a Gamma distribution.We will solve this integration explicitly for two such “un- informative” priors: a uniform prior where P uni ( λ ) ∝ λ ,and a Jeffreys prior where P Jef ( λ ) ∝ λ − . For thesepriors, we find P uni ( N | T , T back , N back ) = ( N back + N )! N back ! N ! · T N T N back +1back ( T back + T ) N back + N +1 (11) P Jef ( N | T , T back , N back ) = ( N back + N − )!( N back − )! N ! · T N T N back + back ( T back + T ) N back + N + . (12)Using these expressions, we find the “uniform FAP” and the “Jeffreys FAP” estimates to be N ... N N N T N (cid:48) ... N (cid:48) N (cid:48) N (cid:48) λ (cid:48) k(k-1) ... λ (cid:48) λ (cid:48) λ (cid:48) N (cid:48) back,k(k-1) ... N (cid:48) back,32 N (cid:48) back,21 N (cid:48) back,10 T back N back,k ... N back,3 N back,2 N back,1 FIG. 3. The DAG describing the probabilistic dependenciesof variables in the k-threshold test. The variables are denotedby nodes, and the arrows point from cause to effect. Shadednodes represent variables that have been measured and havefixed values, while the unshaded nodes represent variablesthat are unconstrained. Primed variables represent the quan-tities as measured between the thresholds, while un-primedvariable represent the quantities as measured cumulativelyabove the threshold. The explicit probabilistic dependenciesimplied by each DAG can be extracted using the theory ofD-separation [49, 50].
FAP Λ , uni ( N ∗ ) =1 − N ∗ − (cid:88) N =0 P uni ( N | N back , T back , T ) (13)andFAP Λ , Jef ( N ∗ ) =1 − N ∗ − (cid:88) N =0 P Jef ( N | N back , T back , T ) , (14)respectively. B. Multi-Threshold Significance
If we choose to make measurements a multiple thresh-olds, (cid:126)
Λ, instead of just a single threshold, the measure-ment at each threshold is capable of reporting statisticalinconsistencies. As a result, multiple-threshold consis-tency tests are biased towards reporting inconsistenciesbetween 0-lag and background measurements more fre-quently than single-threshold consistency tests. For this reason, we must take care to properly calibrate the FAPsthat are reported when making multiple measurements.Similarly to Sec II A, let (cid:126)N ∗ denote the minimum num-ber of 0-lag events exceeding the thresholds (cid:126) Λ that will beconsidered a significant excess. Instead of Eq. 3, the FAPof observing a significant excess at any of the thresholdswill beFAP (cid:126) Λ ( (cid:126)N ∗ ) = P ( any N ≥ N ∗ ∈ (cid:126)N ∗ | (cid:126)θ )= 1 − P ( all N < N ∗ ∈ (cid:126)N ∗ | (cid:126)θ )(15)where N ∈ (cid:126)N represents the number of 0-lagevents exceeding the threshold Λ i ∈ (cid:126) Λ and (cid:126)θ again rep-resents the full set of conditional parameters that havebeen measured.Let us consider the problem of calculating the statis-tical consistency at k value-ordered thresholds (cid:126) Λ in thecumulative representation. We choose (cid:126)N ∗ so thatFAP Λ i ( N ∗ ) ≤ FAP single < FAP Λ i ( N ∗ −
1) (16)for all thresholds Λ i ∈ (cid:126) Λ. This means that each N ∗ ischosen so that the FAP at threshold Λ i (calculated usingEq. 7, 13, or 14 depending on the choice of prior) is closestto some FAP single without exceeding it. As a result, wecan view FAP single as our single-threshold reference FAPthat every threshold in (cid:126) Λ tests for. Under this point ofview, we writeFAP (cid:126) Λ ( (cid:126)N ∗ ) = ETF( (cid:126)N ∗ ) · FAP single . (17)for some ETF ≥
0. We refer to the variable ETF as the“effective trials factor” because it is the coefficient link-ing a single-threshold FAP to a multiple-threshold FAP,with each threshold representing a “trial” that could pro-duce an excess of the desired significance. ETF tells usthe actual probability of finding an inconsistency of sig-nificance FAP single , accounting for the multiple measure-ments we have performed by using multiple thresholds.In the case of independent measurements, the Bonferronicorrection [51] tells us that we should set ETF equal tothe number of thresholds. Nevertheless, the word “effec-tive” denotes that ETF will not correspond to the ex-act number of thresholds we have used if there is anynon-zero dependence between the measurements at eachthreshold.In the cumulative representation of the measurementsdepicted in Fig. 1, the measurements are not independentas a result of the constraint that the cumulative eventrate λ (Λ) must decrease monotonically as a function ofΛ. In order to calculate FAP (cid:126) Λ , we need to understandhow this monotonicity constraint affects the relationshipamong the k thresholds (cid:126) Λ. Let us order the thresholdsfrom largest to smallest so that Λ i ≥ Λ i +1 ∀ Λ i , Λ i +1 ∈ (cid:126) Λ.The monotonicity of the cumulative representation tellsus that any event, 0-lag or background, that exceedsthe threshold Λ i must also exceed all thresholds Λ j for j > i . Thus, the elements of the event-number vector (cid:126)N at thresholds (cid:126) Λ are correlated with each other for both 0-lag and background measurements. We can explain thisdependence quantitatively by introducing a new set ofvariables: if N i and λ i ≡ λ (Λ i ) are the number of eventsand rate of events exceeding threshold Λ i , then we define N (cid:48) i ( i − and λ (cid:48) i ( i − to be the number of events and therate of events exceeding threshold Λ i but not thresholdΛ i − . Then, the event rate λ (cid:48) i ( i − along with the mea-surement duration T determines N (cid:48) i ( i − according to aPoisson distribution. The number of events we measureexceeding threshold Λ i is given by N i = N i − + N (cid:48) i ( i − .These cause-and-effect relationships allow us to depict the dependence among all variables with a probabilisticDAG, which we illustrate in Fig. 3.We can solve for FAP (cid:126) Λ by rewriting Eq. 15 asFAP (cid:126) Λ ( (cid:126)N ∗ )= 1 − (cid:88) (cid:126)N < (cid:126)N ∗ P ( (cid:126)N | (cid:126)N back , (cid:126)N (cid:48) back , T back , T )(18)where we note that our calculation is conditioned on allmeasurements available to us: the number of backgroundevents exceeding the thresholds and the durations of boththe background and 0-lag measurements. We can expandthe summed probability as: P ( (cid:126)N | (cid:126)N back , (cid:126)N (cid:48) back , T back , T ) = (cid:88) (cid:126)N (cid:48) (cid:90) ∞ d(cid:126)λ P ( (cid:126)N , (cid:126)N (cid:48) , (cid:126)λ | (cid:126)N back , (cid:126)N (cid:48) back , T back , T )= (cid:88) (cid:126)N (cid:48) (cid:90) ∞ d(cid:126)λ P ( (cid:126)N | (cid:126)N (cid:48) , (cid:126)λ, (cid:126)N back , (cid:126)N (cid:48) back , T back , T ) · P ( (cid:126)N (cid:48) | (cid:126)λ, (cid:126)N back , (cid:126)N (cid:48) back , T back , T ) P ( (cid:126)λ | (cid:126)N back , (cid:126)N (cid:48) back , T back , T ) . (19)Applying D-separation to the DAG illustrated inFig. 3, we find the following conditional independen-cies: (1) (cid:126)N ⊥ (cid:126)λ, (cid:126)N back , (cid:126)N (cid:48) back , T back , T | (cid:126)N (cid:48) ; (2) (cid:126)N (cid:48) ⊥ (cid:126)N back , (cid:126)N (cid:48) back , T back | (cid:126)λ, T ; (3) (cid:126)λ ⊥ (cid:126)N back , T | (cid:126)N (cid:48) back , T back . These conditional indepen-dencies allow us to simplify Eq. 19 to P ( (cid:126)N | (cid:126)N back , (cid:126)N (cid:48) back , T back , T ) = (cid:88) (cid:126)N (cid:48) (cid:90) ∞ d(cid:126)λ P ( (cid:126)N | (cid:126)N (cid:48) ) P ( (cid:126)N (cid:48) | (cid:126)λ, T ) P ( (cid:126)λ | (cid:126)N (cid:48) back , T back )= (cid:88) (cid:126)N (cid:48) (cid:90) ∞ d(cid:126)λ (cid:34) k (cid:89) i =1 P ( N ,i |{ N ,j : j < i } , (cid:126)N (cid:48) ) (cid:35) · P ( (cid:126)N (cid:48) | (cid:126)λ, T ) P ( (cid:126)λ | (cid:126)N (cid:48) back , T back ) (20)where we have conditionally factored P ( (cid:126)N | (cid:126)N (cid:48) )in the second line. We can once again use D-separation on the DAG of Fig. 3 to find threemore conditional independencies: (4) N ,i ⊥ N ,j
We will now discuss solving Eq. 26 for a specific se-lection of k thresholds, which will form the tail-targetedstatistical consistency test that we will refer to as theEvent Stacking Test (EST). The cumulative representa-tion of data, as depicted in Fig. 1, is appropriate forchecking for distributional inconsistencies in the tails ofmeasurements since all events measured to exceed a Λthreshold are “stacked” into the tail bin defined by Λ.Choosing to make measurements at the multiple thresh-olds of (cid:126) Λ is equivalent to testing the distributional con-sistency of the high-Λ tail for several binnings, where thevalue of each Λ i defines the size of the tail bin. Specif-ically, large values of Λ i correspond to short tails withfew events, while small values of Λ i allow for much widertail regions that encompass more events. We note thatEq. 23 is the product of the probabilities of observingeach N (cid:48) ∈ (cid:126)N (cid:48) , which is equivalent to calculat-ing the joint probability of a 0-lag measurement usingthe differentially-binned event counts (cid:126)N (cid:48) . The trans-formation that maps a set of differentially-binned eventcounts into a multi-threshold test of the measurement’stail is performed by changing to the cumulative represen-tation of event counts (cid:126)N in the summation bounds ofEq. 18.
1. Choosing optimal thresholds
It is sub-optimal to evaluate the consistency of the 0-lag and background measurements at a set of arbitrarily-defined thresholds (cid:126)
Λ, both in terms of detection powerand computational efficiency. We will first show how tooptimally choose (cid:126)
Λ in order to maximize the detectionpower of the EST. A simple thought experiment showsthat the most significant 0-lag excesses occur when athreshold Λ takes the exact value of a 0-lag event, as isillustrated in Fig. 4. We will call the occurrence of i λ (Λ) could only have increasedor remain unchanged as we decreased Λ from that of theloudest 0-lag event. As a result, the FAP of the currentsingle-event realization is now greater than the FAP ofthe single-event realization measured at the loudest 0-lag event itself. We can continue to decrease Λ until wereach the second-loudest 0-lag event. Following the aboveargument, we find that the most significant two-event re-alization occurs at the second loudest 0-lag event. Byinduction, we see that the most significant i -event real-ization will occur when the threshold Λ is located at the i th loudest 0-lag event.In addition, we note that there are a finite numberof thresholds that give us unique p-values, which impor-tantly means that there is a maximum dimension of (cid:126) Λ FIG. 4. A graphical explanation of how we should chooseto place thresholds in the EST. The plot is a toy model ofFig. 1, where the dots represent the location of events. Asis described in the figure, the set of all unique thresholdsis obtained by placing thresholds just past the backgroundevents’ Λ’s. The maximum-significance inconsistency of an i -event 0-lag realization is found at the threshold closest tothe value of the i th loudest 0-lag event without exceeding it. above which no new information can be gained from themeasured data. To prove this, let us revisit the abovethought experiment illustrated in Fig. 4. We again placea single threshold Λ at the value of the i th -loudest 0-lagevent. Let us also assume we do not know the event rate λ (Λ) exactly, but instead our estimates of it rely on ourbackground measurements. The significance estimate atthe threshold in this scenario takes the form of Eq. 7, 13,or 14, depending on the choice of event rate prior. Therelevant background measurement is N back : the numberof measured background events exceeding the thresholdΛ. We can decrease the value of Λ from the i th -loudest 0-lag event without changing the significance of the i -eventrealization as long as N back remain unchanged. Thus, ifwe define (cid:126) Λ to consist of thresholds located at Λ n + (cid:15) ,where Λ n is the measured value of the n th backgroundevent and (cid:15) →
0, for every background event in the mea-surement, we can calculate the unique significance of ev-ery possible 0-lag measurement.
2. Calculating the EST p-value
Now that we have an optimal procedure for choosingunique thresholds in the measurement’s tail, we definethe EST. The k -threshold EST evaluates the statisticalconsistency of 0-lag and background measurements at the k -sharpest binnings of the 0-lag’s high-Λ tail. The use ofsharp tail bins ensures that any statistical inconsistencieswill involve the k -loudest 0-lag events. The final signifi-cance of the EST is given by the p-valueFAP EST = ETF · FAP min (27)where FAP min is the minimum FAP observed across allof the k thresholds. Calculating the p-value of the ESTtakes the following algorithmic form:1. Choose k , where we will evaluate the single-threshold FAP at each of the k -loudest 0-lag events.2. Evaluate the single-threshold FAP at each of thek-loudest 0-lag events using Eq. 7, 13, or 14, de-pending on the choice of rate prior. Thus, at the i th threshold, N back is equal to the number of back-ground events exceeding the i th loudest 0-lag event.We also choose N ∗ = i so that we are evaluatingthe significance of seeing the measured i -event 0-lagrealization. Because we are evaluating significancesat the measured 0-lag events, any inconsistenciesbetween the 0-lag and background measurementswill have maximum significance.3. We are looking for the most significant 0-lag in-consistencies, so we choose our single-threshold ref-erence FAP to be the minimum single-thresholdFAP found across the k thresholds, FAP min ; i.e.,we choose FAP single = FAP min .4. We finally need to find the ETF that maps ourminimum single-threshold FAP into a calibratedmultiple-threshold FAP via Eq. 27. To do this,we need to find the k “critical” thresholds (cid:126) Λ = { Λ i : i ∈ , . . . , k } that produce 0-lag realizationswith FAPs closest to, without exceeding, FAP min (see Eq. 16). We already saw that any thresh-old is equivalent to some threshold located at abackground event. Thus, we only need to con-sider thresholds located at the background events.This feature makes the determination of the criticalthresholds independent of the shape of the under-lying noise event rate λ (Λ). We find the k criticalthresholds using the following algorithm:(a) Initialize n = 0, i = 1, and all critical thresh-olds as undefined.(b) Until i > k :i. Calculate FAP test at N back = n and N ∗ = i using Eq. 7, 13, or 14, depend-ing on the choice of prior.ii. If FAP test ≤ FAP min , we have achieved aFAP less than the reference FAP. Storethe current estimate of the i th criticalthreshold Λ i to be the value of the n th loudest background event. Larger valuesof n can only produce less-significant i -event realizations with larger FAPs, so it-erate n → n + 1.iii. If instead FAP test > FAP min , we nolonger have achieved a FAP less thanthe reference FAP, meaning we have al-ready passed the true critical thresholdΛ i . Thus, we iterate onto the next criticalthreshold with i → i + 1.We note that it is possible that some critical thresh-olds remain undefined if there exists no i -event 0-lag realization with a FAP less than FAP min . In these instances, we remove the undefined thresh-olds from (cid:126) Λ (meaning there might be fewer than k critical thresholds).5. The set of critical values (cid:126) Λ uniquely defines all pos-sible 0-lag realizations involving the k -loudest 0-lagevents with FAPs less than or equal to FAP min .Thus, we finally evaluate Eq. 26 at the criticalthresholds (cid:126) Λ with N ∗ = i . The result is theproperly-calibrated p-value of our k -threshold ESTtest: FAP EST .
3. Salient features of the EST
We briefly note two salient features of the EST thatgive it exceptional power to detect statistical inconsis-tencies in certain scenarios. First, it incorporates knowl-edge of the measurement durations T and T back intothe statistical consistency test. Thus, it is able to dis-tinguish 0-lag and background distributions with differ-ent underlying noise event rates λ (Λ) even if they havethe same distributional shape as a function of Λ. Sec-ond, while the necessary specification of k means thatthe EST is parameterized, this parameterization is in-tentional since it gives the user control over how much ofthe 0-lag’s high-Λ tail to test for consistency. This fea-ture can prove useful when deviations of the 0-lag fromthe background are expected to be detectable in high-Λregimes but un-detectable in low-Λ regimes.Together, these features make the EST a powerful testfor detecting a non-zero rate of GW events superimposedonto detector data in the form of Eq. 1 where λ noise (Λ) (cid:29) λ GW (Λ) for small values of Λ. This power is due tothe ability of the EST to test only the high-Λ regime ofmeasurements and to detect the elevated event rate ofthe 0-lag as compared to the background. III. COMPARISON WITH OTHERSTATISTICAL TESTS
Here we compare the performance of the proposed ESTwith several other common statistical consistency tests.These tests are all null tests in the sense that they aretesting the hypothesis that all measured data is producedby noise alone. The first of these tests is the LoudestEvent Test (LET). The LET only computes the PoissonFAP of the loudest 0-lag event [37]; i.e., it is equivalentto the EST with k = 1. By construction, the loudest 0-lag event will produce the most significant single-eventFAP since the noise event rate λ noise (Λ) is lowest at thisevent by definition of the cumulative representation. TheLET is commonly used in GW searches to make detectionstatements on individual 0-lag events (e.g., see [36]). Inpractice, if the FAP of the loudest 0-lag event is found tomeet some (strict) significance threshold, the 0-lag eventis classified as a GW detection. It can then be excised0from the analysis data set, and the consistency of the re-maining 0-lag events with the background measurementcan be tested using the new loudest event. This com-bination of the LET and GW-detection excision can berepeated until the 0-lag measurement is found to be sta-tistically consistent with the background measurement.Such an application of the LET is equivalent to eval-uating the significance of each 0-lag event in isolationof other events. This interpretation provides the LETwith an important feature: any statistical inconsisten-cies found by it are attributed entirely to single events,meaning LET detection statements consist of GW eventsalone. As a comparison, the EST detection statementpoints to a statistical inconsistency somewhere in the k -loudest 0-lag events, but it does not uniquely specifywhich of these events are GW events and which are noiseevents. We have verified that the k = 1 EST producesresults that are identical to those of the LET.In the opposite extreme, we can also consider con-sistency tests that compare the empirical distributionsof two measurements in their entirety. Two commondistributional tests are the Kolmogorov-Smirnov (KS)Test [52, 53] and the Anderson-Darling (AD) Test [54–56]. The KS test is most sensitive to inconsistencies inthe bulks of the empirical distributions, while the ADtest is more sensitive to inconsistencies in the tails [57].Both of these test are nonparametric in the sense thatthey always test the entire distribution in the same wayand cannot be tuned to focus on specific regions of thedistribution. As a result, any detection statements madewith the KS or AD tests provide no information as towhich 0-lag events are GW events and which are noiseevents. We also note that both the KS test and the ADtest only compare the normalized distributional shapesof the 0-lag and background measurements, i.e., they donot incorporate the relative event rates of the 0-lag andbackground measurements.We see that the EST strikes a middle ground be-tween the tradeoffs of the LET and the KS/AD tests.The LET incorporates the relative 0-lag and backgroundevent rates and pinpoints statistical inconsistencies tosingle events, but it only tests consistency at a single0-lag event. The KS and AD tests evaluate the consis-tency between the entire empirical distributions of the0-lag and background measurements, but they do notincorporate relative event rate information and cannotpinpoint which 0-lag events are the GW events causingthe distributional inconsistencies. The EST incorporatesrelative event rate information, tests consistency at theuser-specified k -loudest 0-lag events, and confines anystatistical inconsistencies to these same k n noise events from the estimated noise distribution,where n itself is drawn from a Poisson distribution assum-ing a measurement duration of 1,000 years and a meanevent rate of 100 events per year. For each of these 10background measurements, we then simulate 10,000 0-lagmeasurements. To do so, we first sample m of the noiseevents from the estimated noise distribution, where m isdrawn from a Poisson distribution assuming a measure-ment duration of 1 year and a mean event rate of 100events per year. GW events are then superimposed ontothese 0-lag realizations by similarly drawing events froma specified GW event distribution at a given rate andappending them to the list of 0-lag noise events. Sucha methodology allows us to properly characterize thenull test 10 different times under the assumption that allevents in both the 0-lag and background measurementsare Poissonian realizations of the same underlying noisedistribution. A. Calibration
In order to test the calibration of the statistical consis-tency tests, we run the above simulations without inject-ing any GW events into the 0-lag, i.e., with λ GW (Λ) = 0for all Λ. Thus, the 0-lag and background measurementsshould be found to be statistically consistent with respectto the reported p-values of the null test. In other words,the p-values reported by the null test should be equiva-lent to the frequency at which noise-only inconsistenciesof the same or greater significance occur.For this work, we will restrict our discussion to thescenario where T back (cid:29) T since most GW searchesoperate in this regime. Analyzing these λ GW (Λ) = 0simulations, we find that the EST, LET, KS test, andAD test are all well-calibrated in the sense that each p-value statement is reported in that fraction of the 10,0000-lag trials. The evidence of this calibration is shown inFigs. 5 and 6, where points lie on the predicted diagonalline within the sampling error bars. We note that in thelimit where T back T (cid:46)
10, the mean realization still appearsto be well-calibrated, but with large error bars since thebackground measurement poorly estimates the underly-ing event rate distribution in this regime. For the EST,this statement holds invariant of the number of thresh-olds, the shape of the background distribution, or thechoice of rate prior used (i.e., Maximum Likelihood, Uni-form, or Jeffreys). The choice of rate prior does influencethe size of the calibration error bars when T back ∼ T and there is a lack of background data to estimate theunderlying rate distribution. However, the effect of theprior choice becomes negligible in the limit T back (cid:29) T GW BurstsCBC
FIG. 5. The calibration plots for the 5-threshold EST with aJeffreys rate prior for noise-only simulations of a GW burstsearch (top) and a CBC search (bottom). The effective FAPfound analytically is the p-value reported by the EST, whilethe effective FAP found via Monte Carlo is the fraction ofthe 10,000 0-lag trials that had a p-value less than or equalto the value of the abscissa axis. We plot the results for 10different background measurements (thin continuous lines),along with the mean and standard deviation of these 10 mea-surements (error bars). We find that the EST p-values liealong the expected diagonal (dashed line), meaning it is well-calibrated. These results hold for ESTs regardless of the num-ber of thresholds used or choice of rate prior. where the same rate estimates become data-dominated.Because we will only consider T back (cid:29) T in the re-mainder of this work, we will only report the results ob-tained when using the Jeffreys prior.We note that the calibration of the EST is only pos-sible due to the effective-trials-factor adjustment derivedin Sec. II B. Fig. 7 compares the effective trials factor,ETF, used in Eq. 27 for the k-threshold EST to thenaive Bonferroni correction that would have ETF = k across all FAPs. Note that the Bonferroni correctionover-estimates ETF across all FAPs, and thus it over-estimates the rate at which the noise-only statistical in-consistencies occur. As a result, it under-estimates thesignificance of any inconsistencies. The jagged natureof the ETF curve is a result of two effects. The first isthe discrete nature of the 0-lag and background measure-ments, which prevents us from selecting a set of k critical Loudest Event TestKS TestAD Test
FIG. 6. The calibration plots for the LET with a Jeffreys rateprior (top), KS test (middle), and AD test (bottom) for noise-only simulations of a GW burst search. The effective FAPfound analytically is the p-value reported by the tests, whilethe effective FAP found via Monte Carlo is the fraction of the10,000 0-lag trials that had a p-value less than or equal to thevalue of the abscissa axis. We plot the results for 10 differentbackground measurements (thin continuous lines), along withthe mean and standard deviation of these 10 measurements(error bars). We find that the p-values of all tests lie alongthe expected diagonal (dashed line), meaning they are all well-calibrated. These results hold for CBC searches as well. thresholds whose respective realization probabilities per-fectly match FAP min . As discussed in Sec. II C, eachthreshold’s FAP merely approximate FAP min as closelyas possible. These approximations are more accurate,2 k=10 Thresholdsk=5 Thresholdsk=3 Thresholds
FIG. 7. The effective trials factor (ETF) as a function ofthe minimum single-threshold FAP observed (FAP min ) for theESTs using 10 thresholds (top), 5 thresholds (middle), and 3thresholds (bottom). We also plot the naive trials factor of k that would be obtained by applying the Bonferroni correctionto a k -threshold test. We note that the effective trials factorcan be significantly lower than the Bonferroni correction, es-pecially for ESTs using a large number of thresholds. Thejagged nature of the curve at low FAPs is due to the discretenumber of background events used to estimate the underlyingrate distribution. The steep dropoff of ∼ min of 4 × − corresponds to the removal of single-event realization thresholds from the EST. These results holdindependent of the underlying event distribution λ noise (Λ) forgiven measurement durations T and T back . and thus the ETF curve is smoother, at high FAPs thanat low FAPs since N back is larger at high FAPs and the es-timates of the underlying rate distribution λ noise (Λ) be-come more fine-grained. Second, thresholds correspond-ing to i -event realizations can drop out of the ETF calcu-lation for small i and low FAPs, as discussed in Sec. II C.For example, single-event 0-lag realizations cannot haveFAPs much lower than T T back (obtained when N back = 1, T back (cid:29) T , and the Maximum Likelihood prior isused), meaning the single-event threshold is undefined forextremely-low values of FAP min . Referencing the Bonfer-roni approximation, this means the ETF decreases by ∼ i -event realization drops out of the ETF cal-culation. Such an abrupt drop in the value of ETF canbe seen in Fig. 7 around a FAP min of 4 × − . B. Detection Efficiency
We now explore how powerful these tests are in termsof detecting GW events. Since the tests are all well-calibrated, we can trust their reported p-values to beaccurate representations of the rate of noise fluctuations.We will count as a detection any 0-lag realization wherethe p-value is less than or equal to the desired FAPthreshold of the detection statement. We note that a0-lag measurement is either counted as a detection or asa non-detection; we do not take into account how manyGW events in the 0-lag are correctly identified as de-tections. We will consider two distributions for the in-jected GW events: (1) a “point” source of GWs, and (2)a uniform-in-volume distribution of GW sources. Whencomparing the results for GW burst and CBC searches,we emphasize that GW burst searches typically havebackgrounds with heavier high-Λ tails. These tails arepresent because GW burst searches inherently place min-imal constraints on their target GW signals, meaning itis much easier for noise transients to achieve large valuesof Λ in GW burst searches than in the heavily-modeledCBC searches.
1. Point sources
We define the point source distribution to be a set ofGW events that always occur at roughly the same valueof Λ . While this distribution could be used to modelthe physically-motivated scenarios of GW events com-ing from a single matter-dense region (such as the galac-tic center or a nearby galaxy) or a randomly repeatingsource, it is more broadly representative of any scenario Physically-motivated point sources can exhibit O (1) variationsin Λ as a result of changes in the detectors’ antenna responsesthat occur if the relative position of the (same) source and thedetectors changes. FIG. 8. Left: The ROC curves comparing the GW detectionefficiency of the different consistency tests as a function of theeffective FAP of their detection statements. Here, the GWevents are injected into a GW burst search from a “point”source distribution. The GW events are added to the 0-lag atan average event rate of 3 events per year and have values ofΛ = 15 (top), 5 (middle), and 0 (bottom). The EST uses 5thresholds, and the i th -highest level in the EST curve corre-sponds to the regime where i -event realizations last contributeto detection statements. Right: the same plot as Fig. 1, butwith the point-source GW events added to the 0-lag measure-ments. where we expect events to bunch around a value of Λ.One realistic example is the residual of a LET searchafter all high-significance GW events are detected andremoved from the 0-lag. Since the single-threshold LETcan only claim detections above a certain critical valueof Λ, we may expect a set of low-significance events to belocated slightly below this value of Λ.In Fig. 8, we explore the detection efficiency of a GWburst search for three different point source distributionsthat each result in narrowly-varying values of Λ: Λ = 0, 5,or 15, each at a rate of 3 events per year. We note that forpoint sources where Λ = 0, none of the consistency testshave much detection power, as the number of detectionsclaimed is identical to the number expected from noisefluctuations alone. When Λ = 5 or 10, neither the KS testor AD test have much detection power at any FAP sincethe distributional inconsistencies only make up a smallfraction of the total number of 0-lag events ( ∼
3% on av-erage). For these scenarios, the LET has good detectionpower down to a cutoff FAP and has no detection powerbelow this cutoff FAP. The cutoff FAP corresponds to
FIG. 9. The same plots as Fig. 8, but for a CBC search. TheGW events are added to the 0-lag at an average event rateof 3 events per year and have values of Λ = 10 (top), 9.5(middle), and 9 (bottom). the FAP of a single-event realization at Λ. This behaviorhelps emphasize the shortcomings of the LET: the signif-icance of the statistical inconsistency is always defined bythe loudest 0-lag event alone, no matter how many GWevents may be present in the 0-lag data. The EST doesnot have this shortcoming and has substantial detectionpower across all FAPs for Λ = 5 and 10 exactly becauseit tests for events that are bunched together at similarvalues of Λ. The i th highest level seen in the EST de-tection efficiency curve for the EST represents detectionstatements being made using i -event realizations. Everytime the desired FAP can no longer be reached with i -event realizations, the detection efficiency drops to thatof the i + 1 th level since now i + 1-event realizations areneeded to reach the desired FAP. We note that the 1 st level of the EST overlaps with the detection efficiency ofthe LET, which we expect since the LET is equivalent tothe k = 1 EST. The only difference between the two isthat the 1 st level of the EST ends at slightly higher FAPsthan for the LET as a result of the effective-trials-factorpenalty applied to the EST (see Eq. 27).We note that because all statistical inconsistencies oc-cur at a single value of Λ for point sources, the extent ofthe search background’s tail does not affect the behaviorof the consistency tests. All that matters is the relativerate of GW events to noise events at that value of Λ.Thus, we observe similar behavior for CBC searches aswe do for GW burst searches. The CBC search results4are shown in Fig. 9 for three different point source distri-butions: Λ = 9, 9.5, or 10, each at a rate of 3 events peryear. We consider the narrower range of Λ values for theCBC search ([9,10]) than for the GW burst search ([0,15])because the rate of background events falls off much morequickly as a function of Λ for the CBC search.
2. Uniform-in-volume sources
The uniform-in-volume GW source distribution de-scribes the astrophysically-motivated scenario wheresources are distributed homogeneously in the spatial vol-ume of the local universe. If a GW search were to usethe optimal matched-filter signal-to-noise ratio (SNR) asits search statistic, namely Λ = SNR, we would expectthe cumulative GW event distribution to scale ∝ Λ − for events located in the low-redshift universe [58–60]. Ithas also been shown [61] that the cumulative noise dis-tribution falls off nearly exponentially as a function ofΛ in the distribution’s bulk. These two models implythat the GW event distribution will be more heavily-tailed than the noise distribution. However, for reasonssuch as the non-Gaussianity of the LIGO-Virgo detec-tor noise, real GW searches use a combination of SNRand signal-consistency constraints as their search statis-tics [29, 30, 44–47]. Thus, it is difficult to provide anexact analytical model of the GW event distribution asa function of Λ for many real searches. Instead, we canmodel the GW event distribution using Monte Carlo sim-ulations where we inject GW signals from a uniform-in-volume source distribution into the LIGO-Virgo datastreams. Running the GW search algorithms over theseinjections and measuring Λ for each, we get an empiricalestimate of the GW event distribution as a function of Λ.For the CBC search we perform this Monte Carlo proce-dure using the binary-black-hole waveforms processed byPyCBC for the O1 rates calculation [36], and for the GWburst search we use the sine-Gaussian waveforms used totrain the oLIB analysis for the O1 short-duration GWburst search [34].In Fig. 10, we first explore the detection efficiency ofa GW burst search for three different uniform-in-volumedistributions: GW event rates of 1, 3, and 5 events peryear. The KS and AD tests do not have much detectionpower at any GW event rate. This weakness is observedbecause the noise event rate is 100 events per year, sothe statistical inconsistencies only make up ∼ −
5% ofthe empirical distributions. As seen in the example 0-lagrealizations of Fig. 10, the differences in relative eventrates between the 0-lag and background measurementsare most evident in the high-Λ tail. As a result, the ESTand LET have much more significant detection powerthan either the KS or AD tests for all GW event rates,precisely because they test for inconsistent backgroundand 0-lag event rates in the high-Λ tail. At low GWevent rates, single-event 0-lag realizations at large val-ues of Λ provide the most significant inconsistencies, and
FIG. 10. Left: The ROC curves comparing the GW detec-tion efficiency of the different consistency tests as a function ofthe effective FAP of their detection statements. Here, the GWevents are injected into a GW burst search from a uniform-in-volume distribution. The GW events are added to the 0-lagat average event rates of 5 events per year (top), 3 events peryear (middle), and 1 event per year (bottom). The EST uses5 thresholds, and the i th -highest level in the EST curve corre-sponds to the regime where i -event realizations last contributeto detection statements. Right: the same plot as Fig. 1, butwith the uniform-in-volume GW events added to the 0-lagmeasurements. thus the LET and EST have similar detection efficien-cies. However, as the GW event rate increases, multiple-event 0-lag realizations are more likely to occur at largevalues of Λ, which the EST can detect but the LET can-not. Thus, EST begins to noticeably outperform the LETas the GW event rate increases. We also note that be-cause single-event 0-lag realizations can only be detectedto FAPs ∼ T T back , the minimum FAP achievable by theLET has a hard-cutoff set by the duration of the back-ground measurement. Here, the cutoff occurs at a FAPof about 6 × − , consistent with our estimates. On theother hand, the FAP of multiple-event realizations canbe much lower than the FAP of single-event realizations.This feature gives the EST the ability to make detectionstatements at much lower FAPs than the LET, even whenthe background measurement duration is limited.We observe similar results, with a few important differ-ences, in Fig. 11 for the CBC search, again for three dif-ferent uniform-in-volume distributions: GW event ratesof 1, 3, and 5 events per year. The first key differenceas compared to the GW burst search is that the detec-5 FIG. 11. The same plots as Fig. 10, but for a CBC search.The GW events are added to the 0-lag at average event ratesof 5 events per year (top), 3 events per year (middle), and 1event per year (bottom). tion efficiencies for the LET and EST are very flat as afunction of FAP. These level regions correspond to theregions where i -event realizations last contribute to theFAP calculation (e.g., there is only a single level for theLET since it only tests for single-event realizations). Thereason these regions are so flat is because the CBC back-ground is extremely steep as compared to the GW burstbackground. Thus, with high probability, the loudestGW event occurs in a region with either (1) very highnoise event rates (Λ (cid:46) (cid:38) i -loudest 0-lag events. Because there is little room inthe Λ parameter space ( ∼ i -loudest0-lag events are detectable together but not in isolation,the EST and LET perform almost identically at FAPsachievable by the LET (i.e., greater than 6 × − ). Inother words, if an i -event realization meets a FAP thresh-old achievable by the LET, then the loudest-event re-alization almost certainly meets this FAP threshold aswell. We finally note that the difference in the single-event realization FAP cutoff for the LET and EST is dueto the effective-trials-factor penalty applied to the EST(see Eq. 27). IV. EXAMPLE APPLICATION
Sec. III demonstrates the relative merits of the ESTas compared to the other statistical consistency tests,especially in the case of GW burst searches. We nowexplore one physically-motivated application of the EST:improving the upper limits set by a GW burst searchafter a null-detection.
A. Calculating GW Burst Rate-Density UpperLimits
GW burst searches have not yet detected any non-CBCGW events [32, 34]. Thus, the results of all GW burstsearches to date has been placing rate-density upper lim-its on unknown (i.e., non-CBC) sources of GW transients.The classical upper limit on the expected number of GWevents in the data, E [ N GW ], placed by an experimentwhere no inconsistency is observed exceeding the detec-tion significance threshold satisfies P (no significant inconsistency | E [ N GW ]) ≥ − α (28)where α is the significance of the confidence interval [62].For this example, we will use a FAP detection thresholdof 3 × − , meaning we can assume that at least oneevent in any detected inconsistency is a GW event sincethe noise contamination will be low. We will also chooseto construct the α = 90% confidence interval.In order to map the upper limit on E [ N GW ] into a rate-density upper limit, we need to make some assumptionsabout our astrophysical source distribution. Assumingthat all GWs are emitted from their sources with energy E GW at a monochromatic frequency of f , the distance d of the GW source is given by d = (cid:115) GE GW π c f h (29)where h rss is the average root-squared-sum strain am-plitude of the GW events as observed at Earth [63]. Foruniform-in-volume sources emitting at constant E GW and f , the differential rate of GW events will be distributedproportionally to h − (since d ∝ h rss ). We define R to bethe rate-density of the GW events. The expected numberof GW events louder than some minimum strain ampli-tude h rss,min in a GW search of 0-lag duration T isgiven by E [ N GW ] = R × π (cid:32) GE GW π c f h (cid:33) T . (30)The actual number of GW events, N GW , with strain am-plitude louder than h rss,min is then a Poisson-distributedrandom variable with a mean value of E [ N GW ] over themeasurement duration of T . These N GW events aredistributed in strain amplitude proportionally to h − .6The upper limit can be expressed in terms of the rate-density by solving Eq. 28 to find the R at which there isprobability 1 − α of detecting no significant inconsistency(where E [ N GW ] is a function of R ). Typically, the LETis used to evaluate the significance of events, where de-tections are iteratively removed from the 0-lag measure-ment until it is consistent with the background measure-ment. This procedure is equivalent to evaluating all 0-lagevents in isolation, meaning the number of detections willbe Poisson-distributed and the desired R can be foundby explicitly integrating over the detection efficiency ofthe GW search on a single-event basis [32]. However,the detection statements of the EST are not performedon isolated events but rather on collective multi-event0-lag realizations. As a result, the detection statementof EST is not Poissonian and an analytic calculation ofthe desired R is difficult to derive. Instead, we can use astochastic approximation method known as the Robbins-Monro algorithm [64] to calculate the upper-limit valueof R . To perform the Robbins-Monro algorithm we needto define a random variable F ( R ) whose expected valueis equal to the left side of Eq. 28. I.e., we need E [ F ( R )]= P (no significant inconsistency | E [ N GW ( R )]) . (31)This expectation is achieved if we define F ( R )= 1l (no significant inconsistency | E [ N GW ( R )]) . (32)In order to satisfy Eq. 28, we wish to find the value of R where E [ F ( R )] = 1 − α . Since the probability of detect-ing no significant inconsistencies monotonically decreasesas a function of R , the iterative equation R n +1 = R n + a n [ F ( R n ) − (1 − α )] (33)has the right form for convergence to our desired rate. Infact, Robbins and Monro proved that the above iterativeequation will converge (as n → ∞ ) to the R at which E [ F ( R )] = 1 − α if a n ∝ n [64]. As a result, we can findthe α -confidence upper-limit value of R for any statisticalconsistency test by running the following algorithm:1. Initialize R to some arbitrary starting value. Ini-tializing R closer to the converged value R ∞ willspeed up the convergence.2. Until R n is sufficiently converged:(a) Draw a value of N GW ( R n ) from a Poisson dis-tribution with mean rate E [ N GW ( R n )] (seeEq. 30) and measurement duration T .(b) Draw N GW ( R n ) GW events with strain am-plitude greater than h rss,min from a strain am-plitude distribution proportional to h − . Thevalues of Λ for these GW injections shouldhave previously been measured via the GWsearch. (c) Simulate a 0-lag measurement by sampling N noise noise events from the background mea-surement, where N noise is drawn from a Pois-son distribution with the same mean eventrate as the background and a measurementduration of T . Append the GW events tothis simulated 0-lag measurement.(d) Evaluate the statistical consistency test (e.g.,EST, LET) using the background and 0-lagmeasurements. Set F ( R n ) = 1 if there isno inconsistency meeting the FAP detectionthreshold and F ( R n ) = 0 if a significant in-consistency is detected.(e) Find R n +1 using Eq. 33 with a n = R n andthen iterate n → n + 1. B. Results
We now compare how well the LET and EST con-strain the GW burst rate-density after a null detection.We draw noise events and GW events from oLIB’s anal-ysis of O1 data. While the background event distri-butions are the same as published for the LIGO-VirgoO1 short-duration GW burst analysis [34], here we arbi-trarily choose the event rate to be 100 events per year,the measurement durations to be T = 1 year and T back = 1000 years, and the FAP detection threshold tobe 3 × − . With these arbitrary choices, the rate-densityupper limits calculated in this section are not astrophysi-cal and should not be compared to those published for theO1 analysis [34]. However, these parameter choices arereasonably close to those of historical GW burst searches,and thus we expect the general trends of our findings tohold for actual GW burst upper-limit calculations. Wenote that this FAP detection threshold is chosen so thata 5-threshold EST is able to make detections using itssingle-event realization threshold (see Fig. 10 for the lo-cation of the FAP cutoff for these choices of T and T back ). As a result, our findings should change only neg-ligibly if more background were measured so that T back increased. We evaluate the upper limits at a confidenceof α = 90% for E GW = 1 M (cid:12) c and h rss,min well be-low the detection sensitivity of the O1 oLIB search. Wefinally note that because the Robbins-Monro algorithmfixes the value of E [ N GW ( R n )], Eq. 30 lets us rescale therate-density upper limits for any choice of GW emissionenergy using R ∝ E − GW .We compute the GW rate-density upper limits for 4different sine-Gaussian and 2 different white-noise burstsignal morphologies used in the LIGO-Virgo O1 anal-ysis [34], using both the LET and a 5-threshold EST.We list the results in Table I. The LET results obtainedwith the Robbins-Monro algorithm match those obtainedwith the historically-used analytic calculation, giving usconfidence that the algorithm converges properly. Wenote that the EST consistently provides stricter GW rate-7 TABLE I. The 90%-confidence rate-density upper limits (in Gpc − yr − ) placed on GW burst sources emitting at E GW =1 M (cid:12) c , assuming a uniform-in-volume source distribution. The upper limits are calculated for 4 sine-Gaussian waveformmorphologies, defined by their central frequency ( f ) and quality factor ( Q ), and 2 white-noise burst waveform morphologies,defined by their central f , bandwidth ∆ f , and duration τ . These waveforms were used in the LIGO-Virgo O1 analysis [34].We calculate the upper limits using both the LET and EST, and give their ratio. We find that the EST sets rate-density upperlimits that are stricter than those of the LET by 40% − We stress that the configuration of our simulations is ad hoc,meaning these rate-density upper limits are non-astrophysical . SG Morphology LET [Gpc − yr − ] EST [Gpc − yr − ] LET-to-EST ratio f = 153, Q = 9 8.5 4.8 1.8 f = 235, Q = 100 48 25 1.9 f = 554, Q = 9 1300 780 1.7 f = 849, Q = 3 8700 6300 1.4WNB Morphology LET [Gpc − yr − ] EST [Gpc − yr − ] LET-to-EST ratio f = 150 Hz, ∆ f = 100 Hz, τ = 0 . f = 300 Hz, ∆ f = 100 Hz, τ = 0 . density upper limits than the LET, with the level of im-provement ranging from 40% − V. CONCLUSIONS
The Event Stacking Test (EST) is a null hypothe-sis test that checks for statistical consistency between abackground measurement and an analysis measurement.This test specifically targets the (“loud”) tail of a distri-bution, evaluating the consistency of the empirical distri-butions at k thresholds, where k is defined by the user.The multiple thresholds have the effect of “stacking” mul-tiple analysis events into the tail bin. As a result, up to k of the loudest analysis events can be detected in any mea-surement, even if none of them are individually significantenough to be detected in isolation. We have shown that the EST is well-calibrated, so that the significance of thereported level of inconsistency is representative of noisefluctuations. For carefully constructed GW searches, thisproper calibration ensures that any significant inconsis-tencies are likely caused by the presence of GW events inthe analysis data.Comparing the GW detection power of the EST to thatof other statistical consistency tests, such as single-outliertests like the Loudest Event Test (LET) or nonparamet-ric distributional tests like the Kolmogorov-Smirnov (KS)test or the Anderson-Darling (AD) test, the benefits ofthe EST become clear. The EST is a parameterized testin the sense that the user must choose k , meaning tun-ing is needed to find the value of k that maximizes thedetection power of the test. This freedom to tune is ben-eficial when searching for GWs since it allows the userto interpolate between single-event tests (LET) and fullydistributional tests (KS and AD) depending on the exactshape of the noise and GW event distributions. The ESTtakes into account the relative rates of the backgroundand analysis measurements at multiple thresholds, mean-ing it incorporates more information than just the shapesof the empirical distributions.As a result of these features, we find that the EST ro-bustly outperforms the KS and AD tests in terms of GWevent detection, independent of source type or source dis-tribution. The noise distributions of CBC searches typ-ically have steeply-falling tails, meaning there is limitedroom in the tail for low-significance outliers to bunch to-gether and form significant inconsistencies. As a result,the EST performs comparably to the LET in terms ofdetecting CBC sources in the significance regime wheresingle-event detections can be made (and obviously out-performs the LET in regimes where only multiple-eventdetections are possible). On the other hand, GW burstsearches typically have more heavily-tailed noise distri-butions than CBC searches, meaning there is more roomin the tail for low-significance events to bunch togetherand form significant inconsistencies. Thus, the EST canachieve noticeably higher GW burst detection efficien-cies than the LET, even in the significance regime where8single-event detections are common. As a result, the ESTcan place upper limits on the astrophysical rate-densityof GW burst sources that are 40% − .
1% of events for theLET. However, this feature should not discourage the useof the EST since any detection of GW bursts would neces-sarily launch an intense parameter-estimation follow-upin an attempt to explain the events’ astrophysical origins.Using the EST will allow the GW community to detectGW burst events more efficiently and in greater quantity. These prospects should only encourage the continued de-velopment of the parameter estimation tools that willbe needed to maximally extract science from what willsurely be groundbreaking detections.
VI. ACKNOWLEDGMENTS
The authors acknowledge the support of the NationalScience Foundation and the LIGO Laboratory. LIGOwas constructed by the California Institute of Technologyand Massachusetts Institute of Technology with fund-ing from the National Science Foundation and operatesunder cooperative agreement PHY-0757058. We wouldalso like to thank Hsin-Yu Chen, Kwan Yeung Ng, Yi-wen Huang, Robert Eisenstein, Satya Mohapatra, SteveDrasco, Reed Essick, and the LIGO-Virgo Burst workinggroup for useful comments and discussion. This is LIGOdocument number P1800170. [1] B. Abbott et al., Phys. Rev. Lett. , 061102 (2016).[2] B. Abbott et al., Phys. Rev. Lett. , 161101 (2017).[3] J. Aasi et al., Classical and Quantum Gravity , 074001(2015), arXiv:1411.4547 [gr-qc].[4] F. Acernese et al., Classical and Quantum Gravity ,024001 (2015), arXiv:1408.3978 [gr-qc].[5] S. Usman et al., Classical and Quantum Gravity ,215004 (2016), arXiv:1508.02357 [gr-qc].[6] A. Nitz et al., PyCBC software https://doi.org/10.5281/zenodo.596388 (2018).[7] C. Messick et al., Phys. Rev. D , 042001 (2017),arXiv:1604.04324 [astro-ph.IM].[8] B. Abbott et al., Phys. Rev. Lett. , 221101 (2016),arXiv:1602.03841 [gr-qc].[9] B. Abbott et al., Astrophys. J. Lett. , L12 (2017),arXiv:1710.05833 [astro-ph.HE].[10] B. Abbott et al., Astrophys. J. Lett. , L13 (2017),arXiv:1710.05834 [astro-ph.HE].[11] D. Coulter et al., Science , 1556 (2017),arXiv:1710.05452 [astro-ph.HE].[12] M. Soares-Santos et al., Astrophys. J. Lett. , L16(2017), arXiv:1710.05459 [astro-ph.HE].[13] S. Valenti et al., Astrophys. J. Lett. , L24 (2017),arXiv:1710.05854 [astro-ph.HE].[14] I. Arcavi et al., Nature , 64 (2017), arXiv:1710.05843[astro-ph.HE].[15] N. Tanvir et al., Astrophys. J. Lett. , L27 (2017),arXiv:1710.05455 [astro-ph.HE].[16] V. Lipunov et al., Astrophys. J. Lett. , L1 (2017),arXiv:1710.05461 [astro-ph.HE].[17] B. Abbott et al., Nature , 85 (2017),arXiv:1710.05835.[18] C. L. Fryer and K. C. B. New, Living Reviews in Rela-tivity , 1 (2011).[19] S. Gossan et al., Phys. Rev. D , 042002 (2016),arXiv:1511.02836 [astro-ph.HE].[20] B. Abbott et al., Astrophys. J. Lett. , L16 (2017),arXiv:1710.09320 [astro-ph.HE]. [21] B. Abbott et al., Phys. Rev. Lett. , 211102 (2008).[22] A. Corsi and B. J. Owen, Phys. Rev. D , 104014 (2011),arXiv:1102.3421 [gr-qc].[23] T. Damour and A. Vilenkin, Phys. Rev. D , 063510(2005).[24] C. D. Ott, Classical and Quantum Gravity , 063001(2009).[25] V. Morozova et al., (2018), arXiv:1801.01914 [astro-ph.HE].[26] H. Dimmelmeier, C. D. Ott, A. Marek, and H.-T. Janka,Phys. Rev. D , 064056 (2008).[27] K. N. Yakunin, A. Mezzacappa, P. Marronetti, E. J.Lentz, S. W. Bruenn, W. R. Hix, O. E. Bronson Messer,E. Endeve, J. M. Blondin, and J. A. Harris, (2017),arXiv:1701.07325 [astro-ph.HE].[28] S. Scheidegger, S. C. Whitehouse, R. K¨appeli, andM. Liebend¨orfer, Classical and Quantum Gravity ,114101 (2010), arXiv:0912.1455 [astro-ph.HE].[29] R. Lynch et al., Phys. Rev. D , 104046 (2017).[30] S. Klimenko et al., Phys. Rev. D 93, 042004 (2016),arXiv:1511.05999 [gr-qc].[31] N. Cornish and T. Littenberg, Classical and QuantumGravity , 135012 (2015).[32] J. Abadie et al., Phys. Rev. D , 122007 (2012).[33] B. Abbott et al., Phys. Rev. D , 042005 (2015),arXiv:1511.04398 [gr-qc].[34] B. Abbott et al., Phys. Rev. D , 042003 (2017).[35] B. Abbott et al., Classical and Quantum Gravity ,065009 (2018), arXiv:1711.06843 [gr-qc].[36] B. Abbott et al., Phys. Rev. X , 041015 (2016).[37] R. Biswas et al., Classical and Quantum Grav-ity , 175009 (2009), [Erratum: Class. Quant.Grav.30,079502(2013)], arXiv:0710.0465 [gr-qc].[38] W. Farr et al., Phys. Rev. D , 023005 (2015),arXiv:1302.5341 [astro-ph.IM].[39] B. Abbott et al., Astrophys. J. Suppl. , 14 (2016),arXiv:1606.03939 [astro-ph.HE]. [40] B. Abbott et al., Astrophys. J. Lett. , L1 (2016),arXiv:1602.03842 [astro-ph.HE].[41] B. Abbott et al., Phys. Rev. D , 062004 (2008),arXiv:0709.0766 [gr-qc].[42] B. Abbott et al., Astrophys. J. , 1438 (2010),arXiv:0908.3824 [astro-ph.HE].[43] K. Cannon, C. Hanna, and D. Keppel, Phys. Rev. D ,024025 (2013).[44] B. Allen, Phys. Rev. D , 062001 (2005), arXiv:gr-qc/0405045 [gr-qc].[45] A. Nitz et al., Astrophys. J. , 118 (2017),arXiv:1705.01513 [gr-qc].[46] K. Cannon, C. Hanna, and J. Peoples, ArXiv e-prints(2015), arXiv:1504.04632 [astro-ph.IM].[47] J. Kanner et al., Phys. Rev. D , 022002 (2016),arXiv:1509.06423 [astro-ph.IM].[48] M. Was et al., Classical and Quantum Gravity , 015005(2010).[49] D. Geiger, T. Verma, and J. Pearl, Networks ,507, https://onlinelibrary.wiley.com/doi/pdf/10.1002/net.3230200504.[50] C. Bishop, Pattern Recognition and Machine Learning(Information Science and Statistics) (Springer-Verlag,Berlin, Heidelberg, 2006).[51] C. Bonferroni,
Teoria statistica delle classi e calcolo delleprobabilit`a , Pubblicazioni del R. Istituto superiore discienze economiche e commerciali di Firenze (Libreria in-ternazionale Seeber, 1936).[52] A. Kolmogorov, Giornale dell’Istituto Italiano degli At-tuari , 83 (1933). [53] N. Smirnoff, Rec. Math. N.S. [Mat. Sbornik] , 3(1939).[54] T. W. Anderson and D. A. Darling, Ann. Math. Statist. , 193 (1952).[55] T. W. Anderson and D. A. Darling, Journal of the Amer-ican Statistical Association , 765 (1954).[56] F. Scholz and M. Stephens, Journal of the American Sta-tistical Association , 918 (1987).[57] M. Stephens, Journal of the American Statistical Associ-ation , 730 (1974).[58] B. Schutz, Classical and Quantum Gravity , 125023(2011), arXiv:1102.5421 [astro-ph.IM].[59] H.-Y. Chen and D. Holz, in APS Meeting Abstracts (2014) p. C15.001.[60] S. Vitale, Phys. Rev. D , 121501 (2016),arXiv:1610.06914 [gr-qc].[61] R. Lynch et al., Astrophys. J. Lett. , L24 (2018),arXiv:1803.02880 [astro-ph.HE].[62] J. Neyman, Philosophical Transactions of theRoyal Society of London A: Mathematical, Phys-ical and Engineering Sciences , 333 (1937),http://rsta.royalsocietypublishing.org/content/236/767/333.full.pdf.[63] P. Sutton, (2013), arXiv:1304.0210 [gr-qc].[64] H. Robbins and S. Monro, Ann. Math. Statist.22