[PDF] Modeling Smooth Backgrounds and Generic Localized Signals with Gaussian Processes

Abstract

We describe a procedure for constructing a model of a smooth data spectrum using Gaussian processes rather than the historical parametric description. This approach considers a fuller space of possible functions, is robust at increasing luminosity, and allows us to incorporate our understanding of the underlying physics. We demonstrate the application of this approach to modeling the background to searches for dijet resonances at the Large Hadron Collider and describe how the approach can be used in the search for generic localized signals.

Full PDF

MModeling Smooth Backgrounds & Generic Localized Signals with Gaussian Processes

Meghan Frate, Kyle Cranmer, Saarik Kalia, Alexander Vandenberg-Rodes, and Daniel Whiteson Department of Physics and Astronomy, University of California, Irvine, CA 92697 Department of Physics and Astronomy, New York University, New York, NY 10003 Department of Physics, MIT, Boston, MA Obsidian Security Inc., Newport Beach, CA 92660

We describe a procedure for constructing a model of a smooth data spectrum using Gaussianprocesses rather than the historical parametric description. This approach considers a fuller spaceof possible functions, is robust at increasing luminosity, and allows us to incorporate our under-standing of the underlying physics. We demonstrate the application of this approach to modelingthe background to searches for dijet resonances at the Large Hadron Collider and describe how theapproach can be used in the search for generic localized signals.

PACS numbers:

INTRODUCTION

The search for new particles and interactions is a cen-tral focus of the research program of the Large HadronCollider (LHC). Typically, such a search is cast in thelanguage of a hypothesis test of a background model pre-dicted by the standard model of particle physics. Insome cases, the alternative hypothesis is speciﬁed bya particular theory or class of theories, in which casea practical task of the experimentalist is to identifya good discriminating variable and to construct mod-els of the background-only and signal-plus-backgroundhypotheses. This approach includes searches for reso-nances on a smooth background spectrum as well as non-resonant signals with well-speciﬁed signal characteristics.In addition, some searches for new physics aim at model-independence by making only weak assumptions aboutthe deviation from the background. For instance, such asearch might assume only that the signal manifests itselfas a localized deviation without completely specifying thedistribution. A critical element in both approaches is aproper description of the smooth background spectrum.In an ideal case, the background intensity f ( x ) as afunction of observable x would be known exactly. Inthis scenario, we can derive an expected total number ofevents ν ≡ (cid:82) f ( x ) dx and a probability density p ( x ) ≡ f ( x ) /ν , which leads to the familiar unbinned extendedmaximum likelihood for a dataset D = { x , . . . , x N } with N observed events p ( D ) = Pois( N | ν ) N (cid:89) i =1 p ( x i ) . (1)In realistic applications, we replace the ideal back-ground prediction f ( x ) with a more ﬂexible backgroundmodel. The traditional approach for searches withsmooth background shapes [1–3] is to introduce a familyof functions f ( x | θ ) with some explicit functional formparametrized by θ . The choice of functional form to describe the background is central to the new particlesearch, yet functional forms derived from ﬁrst principlesare almost never available. Instead, the typical approachis to select an ad-hoc parametric function with little-to-no grounding in the physics involved, but which ﬁts rea-sonably well in collider data and simulated samples. Asthe luminosity of the collected datasets grow, however,the discrepancies between the ad-hoc model and the truephysical process are revealed. As the rigid form and lim-ited ﬂexibility of the parametric functions fail to accom-modate the observed spectra, continual addition of newad-hoc terms is required. In the coming period of high-luminosity collisions at the Large Hadron Collider, thisissue will become even more acute.A second issue with the traditional approach is anunsatisfying bifurcation of the search strategy betweenhypothesis testing for speciﬁc signal models and themore model-independent search for localized deviationsfrom a smooth background. In model-speciﬁc hypoth-esis testing, a likelihood ratio test can be applied toweigh the background-only hypothesis against the signal-plus-background hypotheses, while the more model-independent searches often use a more elaborate pro-cedure like BumpHunter [4]. The

BumpHunter al-gorithm considers a large family of search windows andthen performs a number-counting test in each. Both ap-proaches face a look-elsewhere eﬀect from multiple test-ing, but the

BumpHunter [4] approach also faces com-plications from the ad-hoc algorithmic choices and mul-tiple correlated background estimates.In this paper, we propose a new strategy for modelingsmooth backgrounds, one that uses Gaussian Processes(GPs). This strategy relaxes the overly-rigid constraintsof ad-hoc functional forms, which makes them more ro-bust to increasing luminosity. Moreover, GPs can beused to model generic localized signals, which allows fora uniﬁed statistical treatment of model-speciﬁc and moremodel-independent searches. a r X i v : . [ phy s i c s . d a t a - a n ] S e p GPs are remarkably ﬂexible and have been used to im-prove modeling for geostatistics, climate, exoplanets, andmachine learning [5–10]. It is curious that GPs have notyet been widely used in high-energy physics. We sus-pect that this is due to a few barriers, which we aim toaddress. First, the GP formalism is somewhat foreignto high-energy physicists, so we detail how it is that wecan translate our understanding of the underlying physicsinto the GP formalism. Secondly, a large portion of theGP literature is focused on regression problems, insteadof modeling probability distributions. Thus, we makean eﬀort to connect the traditional extended likelihoodtreatment in high-energy physics to the GP approach.Lastly, there are subtle issues related to the statistical in-terpretation and procedures of GPs, so in addition to themore pedagogical goals of the paper we also incorporatesome more technical statistical considerations for com-pleteness. These technicalities are not essential to themain point, and should not deter the interested reader.We use as a test case the LHC search for resonancesdecaying to two hadronic jets, which is one of the mostpowerful searches for new physics at the LHC and ex-amines a falling spectrum over a broad mass range, pro-viding a vigorous test of the method. In the following,we describe the historical approach, give the details ofthe GP approach and present studies that documents itsperformance.

PARAMETRIZED FUNCTION APPROACHDijet Resonance Search Example

Searching for a resonance above a smooth backgroundis a powerful technique in the hunt for new particlesand interactions. This classic approach has been used inmany experiments at diﬀerent facilities, spanning severaldecades [1, 11–14]. In particular, the search for dijets atthe LHC is a simple but eﬀective search for new physics.It analyzes the invariant mass spectrum of pairs of high- p T jets which are found back-to-back in the detector.The dominant background is due to multi-jet productionvia non-resonant hadronic interactions, which historicallyhas been diﬃcult to model accurately using simulation,from the bulk to the high-mass tail. For this reason,a functional approach with three parameters was intro-duced by the UA2 collaboration [1]. In the interveningthirty years, this form has needed the addition of furtherterms in order to be able to adequately describe observedspectra, leading to the most recent form used by the AT-LAS collaboration [15] (previously θ = 0): f ( x | θ ) = θ (1 − x ) θ x θ x θ log x (2)where x = m jj / √ s . The selection of the particular func-tional form has been performed by comparing a handful of choices of varying complexity, selecting the simplestthat satisﬁes the Wilk’s test [16] which seeks “to deter-mine if the background estimation would be signiﬁcantlyimproved by an additional degree of freedom” [15]. Thestructure of this parametric function and the selection ofthe terms is nearly entirely pragmatic, and has little ba-sis in knowledge of the underlying physics processes. Anexample ﬁt of the three-parameter version to the ATLAS2015 dijet data at √ s = 13 TeV is shown in Fig. 1.In some cases, experimental papers attempt to assess asystematic uncertainty on the result due to the arbitrarystructure of this function [15]. An alternative approachis to abandon the full-spectrum ﬁt entirely and ﬁt onlya narrow sliding window [3, 17], which tolerates a sim-pler functional form, but sacriﬁces the power of the fullspectrum and complicates the global statistical interpre-tation. Incorporating auxiliary information

In the dijet case, the parameters θ are only constrainedby the data D ; but more generally some auxiliary in-formation a ( e.g. calibration measurements, theoreticalconsiderations, etc ) may be used to constrain the param-eters through an additional constraint term leading tothe likelihood p ( D , a | θ ) = Pois( N | ν ( θ )) N (cid:89) e =1 p ( x e | θ ) · p constr. ( a | θ ) . (3)In addition to constraint terms in the likelihood corre-sponding to auxiliary measurements a that have a clearfrequentist interpretation, it is common to incorporateterms in the likelihood that reﬂect prior knowledge. Forinstance, uncertainties in theoretical cross section due tomissing higher-order corrections typically are not treatedas unconstrained parameters, instead the size of these un-certainties are estimated via varying the renormalizationand factorization scales. Moreover, when very ﬂexiblemodels are used ( e.g. in unfolding problems) it is of-ten desirable to include penalty terms that regularize theproblem in order to avoid unphysical solutions. Lastly,consideration of multiple functional forms can ﬁt into thisformalism where θ takes on discrete values indexing themodel choice [18]. For a more pedagogical discussion ofconstraint terms, see Ref. [19]. In statistics literature, probability models of the form of Eq. 1are referred to as a Poisson point process with an intensity f ( x ). FIG. 1: Invariant mass of dijet pairs reported byATLAS [15] in proton-proton collisions at √ s = 13 TeVwith integrated luminosity of 3.6 fb − . The blue line isa ﬁt using the ﬁrst three terms of Eq. 2. The bottompane shows the signiﬁcance of the residual between thedata and the ﬁt. GAUSSIAN PROCESS APPROACH

Conceptually, GPs provide a generalization of the in-tensity f ( x ) that is not tied to a particular functionalform with a ﬁxed number of parameters. Instead, theintensity at x is modeled as a Gaussian and the relation-ship between the intensity at diﬀerent points is encodedin the covariance kernel Σ( x, x (cid:48) ). In this way, GPs allowfor the description of a much broader set of functions (seeFig. 2) and provide a natural way to incorporate auxiliaryinformation and prior knowledge.While the exact treatment of Poisson statistical ﬂuctu-ations together with GPs is somewhat cumbersome , insituations with many events we can simplify the situationby working with a binned likelihood where the Poissoncounts in each bin y i are accurately approximated witha Gaussian distribution. In that case, the likelihood ofEq. 3 can be approximated as p ( y , a | θ ) = (cid:81) ni =1 Pois( y i | ¯ f ( x i | θ )) · p constr. ( a | θ ) (4) ≈ Gaus( y | ¯ f ( x | θ ) , σ ) · Gaus( ¯ f ( x | θ ) | µ, Σ ) , where x are the bin centers, y are the observed bincounts, and ¯ f ( x | θ ) are the expected bin counts from av-eraging f ( x | θ ) within the corresponding bin. The ﬁrst From the point of view of the statistics literature, the combina-tion of a Poisson point process with a Gaussian Process as theintensity is known as a Cox process, named after the statisticianDavid Cox, who ﬁrst published the model in 1955 [20]. In or-der to enforce positivity in the intensity, it is common to modelthe log of the intensity with a Gaussian process as was done inRef. [21]. Inference in a doubly stochastic process is technicallymore cumbersome than the approach described above [22]. term of Eq. 5 is the per-bin Poisson statistical ﬂuctuation,while the second term is an n × n multivariate Gaussiandistribution that approximates the eﬀect of p constr. ( a | θ )propagated to the expected bin counts ¯ f ( x | θ ).The second term reveals how an ad-hoc functionalform parametrized by θ ∈ R d can be overly rigid. Inthe n -dimensional space of bin counts, the parametrizedfunction ¯ f ( x | θ ) only maps out a d -dimensional subspace( e.g. the red markers in Fig. 2).Gaussian Processes provide a natural way to expandaround the overly restricted parametrized model and ﬁllin the full space of possibilities. A Gaussian Process isdeﬁned as “a collection of random variables, any ﬁnitenumber of which have a joint Gaussian distribution” [8,21]. Instead of providing a parametric form, we directlymodel the mean ( µ ) and covariance functions (Σ) of theGaussian process deﬁned as µ ( x ) = E [ f ( x )] (5)Σ( x, x (cid:48) ) = E [( f ( x ) − µ ( x ))( f ( x (cid:48) ) − µ ( x (cid:48) ))] . (6)This covariance kernel Σ is then augmented with thediagonal (uncorrelated) statistical component σ ( x ) I toprovide the likelihood for the observed bin counts y .GPs have rarely been used in high-energy physics, withfew exceptions including a top-quark mass measurementby the CMS experiment [23] and a paper by Golchi& Lockhart focusing on other statistical aspects in thesearch for the Higgs boson [21]. Those analyses tookadvantage of the nice properties of GPs, but did not em-phasize or explore how physics considerations come intoplay in the choice of the kernel. An attractive feature ofGPs for high-energy physics is that the kernel directly en-codes understanding of the underlying physics, which ismanifest as covariance among the bin counts. For exam-ple, our understanding of the mass resolution, the partondensity function uncertainties, the jet energy scale uncer-tainties, and expected smoothness can be incorporatedinto the kernel.An advantage of interpreting the Gaussian processthrough the relationship of Eqs. 4 and 5 is that it makesclear how to connect the GP formalism with the existingconsiderations around sources of uncertainty: those asso-ciated to auxiliary measurements with a clear frequentistinterpretation, those associated to theoretical considera-tions that cannot be identiﬁed with a random variablewith frequentist probability, and those terms that are in-troduced for the sake of regularization.In the following sections, we outline the connectionto unfolding, study the typical covariance structure fromphysical processes that aﬀect the shape of the backgroundspectrum, construct kernel and mean functions and de-scribe the procedure to ﬁt the GP to the observed spec-trum. x A x B f ( x A ) f ( x B ) ad-hocGaussian Processdata f ( x A ) f ( x B ) ( x A , x B )ad-hocGaussian Process x A x B x A x B ( x , x ) FIG. 2: Schematic of the relationship between an ad-hoc function and the GP. An example toy dataset is shown(left) with samples from the posterior for an ad-hoc 1-parameter function (red) and a GP (green). Each posteriorsample is an entire curve f ( x ), which corresponds to a particular point in the (center) plane of f ( x A ) vs. f ( x B ).The red dots for the ad-hoc 1-parameter function trace out a 1-dimensional curve, which reveals how the function isoverly-rigid. In contrast, the green dots from the GP relax the assumptions and ﬁll a correlated multivariateGaussian (with covariance indicated by the black ellipse). The covariance kernel Σ( x, x (cid:48) ) for the GP is shown (right)with Σ( x A , x B ) corresponding to the black ellipse of the center panel. Connection to unfolding

The process of constructing a GP kernel, which en-codes our physical requirements for the backgroundmodel, is closely related to the more familiar topic of im-posing physical requirements when unfolding diﬀerentialcross section distributions.In unfolding, we have an observed set of bin countsfor an observable x , which we also assume to arise from aPoisson point process as in Eq. 1. In contrast to searches,the goal is to estimate the diﬀerential cross section for atheoretical quantity z , removing dependence on experi-mental eﬃciency, acceptance, and detector eﬀects. Therelationship between the target theoretical intensity f ( z )and the intensity for the observable f ( x ) is given by f ( x ) = (cid:90) f ( z ) W ( x | z ) dz , (7)where W ( x | z ) is a folding matrix or transfer functionthat encodes smearing eﬀects from detector resolution.Ideally, the experimentalist makes no assumptions aboutthe theoretical intensity f ( z ), and infers f ( z ) through theunfolding process. As has been well studied, this type ofinverse problem is ill-posed in the technical sense thatthe solution f ( z ) is sensitive to small changes to f ( x ) orobserved data. The unregularized maximum likelihoodsolution to the unfolding problem often exhibits unphysi-cal oscillations in f ( z ). Physical considerations motivateadditional regularization or penalty terms to the likeli-hood that are not motivated by auxiliary measurements,but which lead to solutions that behave better for the f ( z ) we consider relevant. Of particular interest for particle physics is to revisitEq. 7 when f ( z ) is itself a GP. In that case Eq. 7 canbe seen as applying the linear operator (cid:82) W ( x | z ) dz tothe GP f ( z ), which gives rise to another GP through a process convolution [25, 26] resulting inΣ( x, x (cid:48) ) = (cid:90) (cid:90) dzdz (cid:48) Σ( z, z (cid:48) ) W ( x, z ) W ( x (cid:48) , z (cid:48) ) (8)As expected, even in the extreme case where diﬀerentbins of the theoretical distribution are allowed to be to-tally uncorrelated, Σ( z, z (cid:48) ) ∝ δ ( z − z (cid:48) ), the ﬁnite resolu-tion of the detector will introduce correlations in x via W . For example, if W ( x | z ) is a Gaussian smearing withresolution σ x , then the resulting GP is an exponentialsquared kernel (see Eq. 9) with length scale l = √ σ x . Physically motivated kernels

Next, we motivate contributions to the kernel that areclearly grounded in experimental and theoretical formsof uncertainty cast in terms of Eq. 7. In the ideal case, There is a deep connection between Tikhonov regularization andother forms of regularization used in unfolding and the kernelsof Gaussian Processes, which is formalized in the language ofReproducing Kernel Hilbert Spaces [8, 24]. A detailed discussionof this connection is beyond the scope of this paper; however, weshould anticipate a contribution to the Gaussian Process kernelthat is connected to this loosely deﬁned notion of smoothness inthe underlying physics. both the theoretical prediction for f ( z ) and the eﬃcien-cies, acceptance, and experimental resolution encoded in W ( x | z ) would be well speciﬁed. In that ideal case, f ( x )is totally ﬁxed and the GP description of the intensitywould collapse to a single point in function space.If, however, the detector response were itself uncertain,then this would propagate into the space of intensities.Take for example, the jet energy scale (JES) uncertainty.As described in Refs. [17, 27] the ATLAS JES uncer-tainty is only a few percent for jets with p T of around 1TeV where data are plentiful, while the the limited sizeof observed examples for higher- p T jets requires an al-ternate approach to estimating the JES. The resultingJES uncertainty therefore grows rapidly with m jj andhas an impact of at most 15% [27]. To illustrate the co-variance due to the JES uncertainty, consider a simpliﬁedtwo-parameter model for the impact on the m jj distri-bution: J( z, θ ) = 1 + 15% θ z + 5% θ (1 − z ), where z isthe true dijet invariant mass and z max = 7 TeV. We usethe best ﬁt 3-parameter ﬁt as a proxy for f ( z ) and foldin the smearing W ( x | z, θ ) = Gaus( x | z J( z/z max , θ ) , σ x ),where σ x = 2% z is the dijet invariant mass resolu-tion [17]. By assuming a uniform prior and an ap-propriate scaling for θ , we sample from the posteriorGaus( θ | , θ | ,

1) and propagate the uncertaintyin θ through to the predicted bin counts ¯ f ( x | θ ) as inEqs. 4 and 5. This allows us to explicitly build the co-variance matrix Σ using the simulation shown in Fig. 3.As expected, we see a roughly block-diagonal structuredeﬁned by low and high mass regions.Similarly, we can study the uncertainty in the theoret-ical distribution arising from uncertainty in the partondistribution functions (PDFs), such as that in the AT-LAS 7 TeV analysis [28, 29]. Figure 3 corresponds to thePDF uncertainties described in Ref. [30] for NLO calcu-lations from POWHEG-BOX [31, 32] using the S no-jet PDFsets provided in

NNPDF3.0 [33]. In this case, sum rulesin the PDFs lead to anti-correlation between low- andhigh-mass regions.With a satisfying GP description of the theoretical dis-tribution and knowledge of the smearing W ( x | z ), we canarrive at a well-motivated kernel for the observed spec-trum via Eq. 8. Implicit covariance in current background models

It is also instructive to examine the eﬀective covari-ance implied by current approaches: the ad-hoc ﬁt andthe sliding window ﬁt. Again we study this through therelationship of Eqs. 4 and 5. In the case of the global ﬁt tothe ad-hoc function of Eq. 2, we construct the posteriorfor θ given the ATLAS data shown in Fig. 1 and a uniformprior on θ . We sample the posterior using emcee [34], aPython implementation of the aﬃne-invariant ensemblesampler for Markov Chain Monte Carlo (MCMC) pro- FIG. 3: Correlation coeﬃcients between pairs of massbins due to variations in the jet energy scale (left) orparton distribution functions (right). Thesedemonstrate the broad but smoothly varying inﬂuenceof these eﬀects on the mass spectrum.FIG. 4: Correlation coeﬃcients between pairs of massbins from many samples of the global ad-hoc ﬁt (left)and the sliding window ﬁt (right). The plot of theglobal ﬁt reveals non-physical pivot points where thead-hoc function is less ﬂexible. The sliding window ﬁthas a strictly limited correlation, by construction.posed by Goodman & Weare [35]. From the posteriorsamples of θ ∼ p ( θ | y ) we build the covariance matrix Σshown in Fig. 4. The global structure of the covariance re-sembles those arising from PDF uncertainties, but recallthat the model only sweeps out a 3-dimensional subspacein the much larger space of functions with this same co-variance.In the case of the sliding window approach, we alsoestimate the covariance from a table of f ( x i ) values, butinstead of posterior samples, we perform a single ﬁt foreach of 50 mass windows. For the k th window, if the binis outside the window we record 0, otherwise we record f ( x i | ˆ θ k ), where ˆ θ k is the best ﬁt value of θ for the ﬁt re-stricted to the window. The covariance is calculated fromthese recorded values. This method should create a co-variance structure which is limited to the diagonal band,as each ﬁt includes only a small portion of the distribu-tion – indeed this is what we see in Figure 4. While the The slight negative correlation outside of this band is an artifact the sliding window ﬁt approach provides more ﬂexibilityand better scaling to high luminosities than a global ﬁt,the piecewise approach to background modeling compli-cates the downstream statistical analysis.

Parametrized kernels

In practice, one rarely works with a covariancematrix explicitly calculated from samples, but in-stead parametrizes the kernel Σ( x, x (cid:48) ). A commonparametrized kernel is the exponential-squared kernel:Σ( x, x (cid:48) ) = A exp (cid:18) − ( x − x (cid:48) ) l (cid:19) . (9)This kernel describes a maximal covariance with ampli-tude A that falls oﬀ with a length-scale l . If | x − x (cid:48) | (cid:29) l ,then the covariance is very small. The parameters of thekernel are traditionally referred to as hyperparameters ,and the values of the hyperparameters can be ﬁt either apriori ( e.g. by considering simulated or control samples),or to the data simultaneously with the hypothesis testitself.In addition to the ability to construct kernels from ﬁrstprinciples as in Eq. 8, there are also rules for various typesof operations on GPs [9, 10]. For example, there are rulesfor describing the mean and covariance for a new GP pro-duced from combining two GPs through the summationand multiplication. These rules eﬀectively form a gram-mar that allows us to compose complex kernels througha narrative.We can also build parametrized kernels that captureessential physics in a direct intuitive way. This approachsits somewhere between the ﬁrst principles approach andthe use of ad-hoc functions; however, unlike most ad-hocfunctional forms, we will be able to interpret the termsmore directly. For instance, in the dijet case we mightargue that the mass resolution is one of the dominantcontributions to the covariance, and start with an expo-nential squared kernel. We would anticipate the lengthscale to be larger than √ σ x from mass resolution eﬀectsto reﬂect the intrinsic smoothness of the underlying truedistribution. In practice, the mass resolution is not con-stant, so we may allow for the length scale (incorporatingboth mass resolution and intrinsic smoothness) to havea linear dependence l ( x ) = bx + c . This varying lengthscale can be accommodated by the Gibbs kernel [8, 36].In addition, instead of a constant amplitude for the varia-tion, we take the amplitude of ﬂuctuations to be a fallingexponential. These considerations motivate the following of recording 0 for the value of the intensity, which is less thanthe mean. parametrized kernelΣ( x, x (cid:48) ) = Ae d − ( x + x (cid:48) )2 a (cid:113) l ( x ) l ( x (cid:48) ) l ( x ) + l ( x (cid:48) ) e − ( x − x (cid:48) )2 l ( x )2+ l ( x (cid:48) )2 (10)The hyperparameters of this kernel ( A, a, b, c, d ) will bedetermined during the ﬁt to the data, described below.We can identify the terms associated to the narrativethat produced this kernel, which we argue makes it lessad-hoc than the functional form of Eq. 2, and the GPassociated to this kernel is much more ﬂexible. Below wewill compare the performance of this kernel to the ad-hocfunction in the context of a dijet resonance search.

Mean function

It is common to use µ ( x ) = 0 for the mean of a GP,partially because once conditioned on the observations y , the posterior mean usually adapts to this oﬀset re-markably well if the spacing between the observed x i issmall compared to the length scale. However, if we havea reasonable estimate of the mean, there is no reason notto use it. Thus we use the three-parameter ﬁt functionsas the mean µ ( x ) of the GP, which contributes an ad-ditional three hyperparameters ( θ , θ , θ ). The resultsare very robust to the choice of the mean, the key to theperformance is really in the choice of the kernel.In the case of signal-plus-background hypothesis test-ing with a known signal model f s ( x | θ s ), the mean func-tion also includes the signal contribution. Due to the lin-ear relationship between y and µ ( x ) in the Gaussian, thisis numerically equivalent to subtracting the signal expec-tation from the observation and modeling the residualswith the background-only GP. Incorporating GPs into the statistical procedure

There are subtle issues associated to the Bayesian andFrequentist interpretations of the GP. GPs are most com-monly presented in a Bayesian formalism where the meanand covariance kernel are interpreted as a prior distribu-tion over the space of functions. Then given the observa-tions y we arrive at an updated GP that represents a pos-terior distribution over the space of functions. Becauseboth the prior and the posterior are Gaussians there areexplicit formulae for the posterior mean and covariancethat rely on basic linear algebra [8]. In particular µ ( x ∗ | y ) = µ ( x ∗ )+Σ( x ∗ , x )[Σ( x , x )+ σ ( x ) I ] − ( y − µ ( x ))(11)andΣ( x ∗ , x (cid:48)∗ ) = Σ( x ∗ , x (cid:48)∗ ) (12) − Σ( x ∗ , x (cid:48) )[Σ( x , x (cid:48) ) + σ ( x ) I ] − Σ( x , x (cid:48)∗ ) , where x ∗ are the values where the posterior GP is beingevaluated and x are the values being conditioned on. Ina typical binned analysis x and x ∗ would both be thethe bin centers. In addition, ﬁtting the hyperparametersof a GP is usually based on maximizing the marginallikelihood, which has the explicit formlog L = −

12 log | Σ |− ( y − µ ( x )) T Σ − ( y − µ ( x )) − n π . (13)In order to take advantage of the closed form solutionsabove and fast linear algebra implementations, the statis-tical ﬂuctuations are typically approximated as σ ( x ) = y instead of the more accurate Poisson mean.In principle, one can revisit the logic of a particularGP kernel to trace back the terms that came from aux-iliary measurements with a clear frequentist interpreta-tion, and cary out the equivalent proﬁle likelihood treat-ment. Given the correspondence between proﬁling andmarginalization in the Gaussian case, this should leadto equivalent results with diﬀerences of at most constantfactors; however, those factors may depend on the hyper-parameters. The philosophical and practical use of theGP’s proﬁle likelihood will also be complicated by thecontributions to the kernel from theoretical considera-tions that lack a frequentist interpretation and contribu-tions motivated by regularization considerations. Thus,we leave this as a direction for future work.In practice there are roughly four ways the GP can beintegrated into the high-energy physics search strategy.The ﬁrst is to take on a fully Bayesian approach usingthe GP as the intensity of a Poisson point process, whichforms a doubly stochastic Cox process [20]. Hypothesistesting and limits can then be based on Bayes factors.This approach was considered in Ref. [21]; however, it iscomputationally very intensive and diﬃcult to combinewith the bulk of the likelihood-based search strategies.The second is to approximate the Poisson ﬂuctuationsas Gaussian and use the marginal likelihood of Eq. 13 di-rectly in the test statistic, which is more computationallyeﬃcient. One can still use this marginal likelihood as atest statistic in the frequentist sense, but it requires us-ing ensemble tests to calibrate the p-values as one cannotrely on the standard asymptotic formulae [37].The third approach is a two-step process where oneﬁrst calculates the posterior mean of Eq. 11 and then uses µ ( x ) as the intensity for the Poisson likelihood of Eq. 1 orits corresponding binned version. Similar to the proﬁlelikelihood approach, the background model can be con-ditioned on the signal hypothesis being tested, since thesignal’s parameters are present in the prior mean func-tion. Computationally this approach is similar to the oneabove, with an additional cost of evaluating the Poissonlikelihood and the practical considerations of managingthe two-step process.The last approach is also a two-step process where oneﬁrst calculates both the posterior mean of Eq. 11 and FIG. 5: Invariant mass of dijet pairs reported byATLAS [15] in proton-proton collisions at √ s = 13 TeVwith integrated luminosity of 3.6 fb − . The green lineshows the resulting Gaussian process backgroundmodel. The bottom pane shows the signiﬁcance of theresidual between the data and the GP model.the posterior covariance of Eq. 12 and then uses those ina down-stream least-squares analysis. For example, thisapproach is convenient for goodness-of-ﬁt tests.In the studies below we only use the marginal likeli-hood for ﬁtting the hyperparameters, which we optimizeusing Minuit [38]. In later studies we use the third ap-proach of treating the posterior mean (conditioned on thesignal model parameters) as the Poisson intensity for thebackground. We use the software package george [39](see also celerite [40]) for GP regression, which we haveextended by implementing custom kernels.The posterior mean and posterior correlation matrixfrom ﬁtting the GP to the ATLAS dataset are shown inFigs. 5 and 6. By visual inspection, the mean function ﬁtsthe data well and the correlation is constrained near thediagonal, with the oﬀ diagonal dying oﬀ quickly. Thisstructure reﬂects the locality of the GP, where nearbybins are closely connected but bins far from each otherin mass are uncorrelated.

PERFORMANCE STUDIES

Figures 5 and 6 demonstrate the ﬁt (posterior mean)of the GP to a single dataset collected by ATLAS inproton-proton collisions at √ s = 13 TeV with integratedluminosity of 3.6 fb − [15]. More important is the char-acterization the GP approach from ﬁts to an ensembleof datasets with independent statistical ﬂuctuations andincreasing luminosity. We construct toy data samples bysmoothing the ATLAS data, scaling it to the desired lu-FIG. 6: Correlation between pairs of mass bins from theGP ﬁt, which shows the largely diagonal nature, withincreasing length scale at higher mass.minosity, and generating independent samples by addingPoisson noise to each bin.Below, we present the performance of the GP approachin these datasets under two aspects of hypothesis testing: • Background-only tests: these studies test whetherthe GP has suﬃcient ﬂexibility to describe the typ-ical background spectrum, assuming no signal. • Signal-plus-background tests: these studies com-bine a GP background with a speciﬁc signal modeland tests the power of a hypothesis test based onthe GP background. This requires that the GPmodel not be so ﬂexible that it can absorb the lo-calized signal into the background model.

Background only tests

The performance of the GP background model is eval-uated in the toy datasets described above. For eachtoy dataset, we ﬁt the GP, extract the posterior meanfrom Eq. 11, and evaluate a χ quantity from (cid:80) i ( y i − µ ( x i | y )) /µ ( x i | y ); note that in this test we do not in-corporate the posterior covariance matrix of Eq. 12. Fig-ure 7 shows the ± σ about the average µ ( x | y ) from thesetoys, with the ATLAS data to guide the eye, and the ad-hoc ﬁt for comparison. The GP based on the kernel inEq. 10 has more ﬂexibility at high mass, but also pro-vides a superior ﬁt, as measured by the χ /dof statistic.The number of degrees of freedom for the ad-hoc ﬁt is 3,while the GP has 8 hyperparameters (3 from the meanfunction and 5 from the kernel). Figure 8 shows the dis-tribution of χ /dof, which peaks near χ /dof = 1 for theGP model and is signiﬁcantly larger than unity for thead-hoc function. FIG. 7: Tests of the Gaussian process andthree-parameter ad-hoc function in toy data generatedfrom the ATLAS data. Shown are the ± σ band aboutthe mean background models, with the ATLAS dataoverlaid for reference.FIG. 8: The distribution of χ per degree of freedom intoy data generated from the ATLAS data at luminosityof 3.6 fb − . While the goodness of ﬁt for the ad-hocfunction degrades with more data, the GP is robust.A critical test of the Gaussian process model is its ro-bustness with increasing luminosity, where the ad-hocapproach has failed in collider data [15, 17]. In Figure 9,the mean and standard deviation of the χ /d.o.f. areshown as a function of integrated luminosity in the toydata, demonstrating the robustness of the GP approach. Background plus signal ﬁts

Adding more ﬂexibility to the background model guar-antees a better ﬁt to background-only toys; however, thisgenerally comes at the loss of power in a search for asignal. A background model that is ﬂexible enough toFIG. 9: Mean and standard deviation of the χ /d.o.f.measure in toy data generated from ATLAS collisions,as a function of integrated luminosity, for the ad-hoc ﬁtand the Gaussian process.incorporate a signal contribution will have no discoverypower.Here, we test the GP model’s performance in the toydata constructed as described earlier, but with signal in-jected as well.We used a generic Gaussian resonance, and performedtests with various values for the signal mass, width andamplitude. The hyperparameters of the GP (both forthe background mean and kernel functions) are ﬁxedfrom our ﬁt to the ATLAS dataset; in a realistic applica-tion, experimenters could establish the hyperparametersin simulated samples. We only ﬁt the three parameters(amplitude, mass, and width) of the Gaussian signal. Forthe parametric ﬁt, we ﬁt all six parameters: the three ﬁtfunction parameters and three signal parameters. An ex-ample of this background-plus-signal ﬁt is shown with aninjected 2.5 TeV Gaussian signal shape in the top panelof Fig. 10.This single example is illustrative and qualitative, butthe statistical test for the presence of a signal in ob-served data relies on the likelihood ratio Λ betweenthe background-only and the signal-plus-background hy-potheses. We calculate the likelihood ratio between thetwo hypotheses in cases background-only toy data as wellas background-plus-signal toy data. This involves the useof Eq. 11 twice, as the posterior mean background pre-diction is diﬀerent for the background-only and signal-plus-background ﬁts. This is analogous to the proﬁlelikelihood ratio where there are two ﬁts and the condi-tional maximum likelihood estimate of the background inthe background-only case is generally diﬀerent from thebackground estimate in the signal-plus-background ﬁt.The distribution of − √ s = 13 TeVwith integrated luminosity of 3.6 fb − with a falsesignal injected at m jj = 2 . − − − χ distribution with one degree offreedom.FIG. 12: Mean log likelihood ratio (Λ) between thebackground-only and the background-plus-signalhypotheses, shown both for the of Gaussian processmodel and the ad-hoc ﬁt, in 1000 toy data sets forvarying injected signal mass. Solid and dashed linesindicate the threshold for 3 σ signiﬁcance and for an α -level of 0.05. ad-hoc function and the GP model. The added ﬂexibilityof the GP does not degrade the power of the search – infact, the GP has more sensitivity to a signal at low mass,while the two methods are comparable at high mass. Thisgain in power is logically possible because the distributionused to true generate the background (which is normallyunknown) does not correspond to the ad-hoc function ex-actly. In our case the GP background model is able tomore accurately follow the true background (generatedfrom smeared ATLAS data) than the ad-hoc function.If a signal is detected, it is also vital to be able toextract the signal parameters. For a two choices of sig-nal mass ( m jj = 3 and 5 TeV), we performed ﬁts tosignal-plus-background toys ﬁtting the mass, width, andsignal strength. Figure 13 shows that the extracted sig-nal width and yield are reliable estimators of the truevalues. MODELING GENERIC LOCALIZED SIGNALS

The search for speciﬁc resonances above a smoothbackground is only one type of search strategy. Morebroadly, we hope to be sensitive to localized deviationsthat take diﬀerent, potentially unanticipated shapes. Forinstance, a cascade decay can lead to triangular distribu-tions with a sharp endpoint [41], though helicity corre-lations can modify this shape in detail. Searches likethis require balancing a small number of tests of thebackground-only model using generic properties of a sig-nal and a larger number of tests of the background-onlymodel using more speciﬁc signal properties. A singlenumber-counting search using the full mass range is verygeneric, but has very little power. Conversely, an enor-mous scan over speciﬁc hypothesized signals individuallyhave more power, but this strategy suﬀers from a largelook-elsewhere eﬀect.Historically, the search for generic signals over abackground model has used algorithms such as

Bump-Hunter [4]. This approach imposes only minimal struc-ture on the signal: that it is a localized, contiguous ex-cess. This approach can be eﬀective, but it has signif-icant practical drawbacks as it contains many ad-hocalgorithmic elements. For example, in common usage

BumpHunter requires that the excess be localized toat least two bins, and at most half of the bins. This al-gorithmic characterization of the signal is eﬀective, butit is diﬃcult to interpret and characterize statistically.Secondly, to address the look-elsewhere eﬀect, this ap-proach explicitly accounts for multiple testing and cali-brates the distribution of the test statistic by applyingthe entire procedure to background-only toys. This re-quires a global background-only prediction, which is com-plicated when we rely on the data to help ﬁt the back-ground model. In particular, if a signal is present in thedata, it is unclear how this impacts the background esti-1FIG. 13: Extracted signal parameters versus true parameters for an injected Gaussian signals with m jj = 3 TeV(top) and m jj = 5.5 TeV (bottom). Left: the extracted signal width ( σ ) for a ﬁxed signal yield. Right: theextracted signal yield ( N ) for a ﬁxed signal width ( σ = 250 GeV). Results are shown for both the Gaussian Processbackground model (green) and the ad-hoc ﬁt function (blue).mate. Thus far, the main strategy has been an iterativebackground estimation procedure that deﬁnes a signalregion and extrapolates the background ﬁt into this re-gion. This approach introduces a coupling of algorithmicdecisions with the statistical considerations. Similarly, inthe context of a sliding window background model, theprocedure is further complicated by the fact that thereis not a single global background prediction, but a set ofcorrelated background predictions speciﬁc to the signalwindow under consideration.In this section, we consider an alternative approachwhich uses a GP to model a generic localized signal.In this case, the basic physical requirement of the lo-calized signal can be encoded directly in the kernel ofthe signal GP, rather indirectly through ad-hoc algorith-mic choices. This approach allows us to perform signal-plus-background ﬁts where the signal GP absorbs thelocalized excess and the background GP accounts for thebackground. The background component from such a ﬁtcan provide global background estimate to be used in thecontext of a BumpHunter approach even when a signalis present in the data. More importantly, this approachenables hypothesis tests of the background-only modelagainst a weakly speciﬁed signal-plus-background modeldirectly based on the likelihood ratio or Bayes factor.We initiate the study of this approach with a speciﬁc signal GP described by the following kernelΣ( x, x (cid:48) ) = Ae − ( x − x (cid:48) ) /l e − (( x − m ) +( x (cid:48) − m ) ) /t , (14)which has three main terms. The ﬁrst term A is an overallamplitude for the signal. The second term is the standardexponential-squared kernel with length scale l . The thirdterm is an envelope that localizes the signal around amass m with a width t , which is analogous to the masswindow.To demonstrate the ﬂexibility of this kernel, we per-formed signal-plus-background ﬁts for a variety of signalshapes. Figure 14 shows the background extraction onboth a linear piecewise triangular signal and Figure 15shows a square signal; both have been smeared to modeldetector jet energy resolution eﬀects. Our studies indi-cate the GP signal is able to accommodate a wide varietyof signal shapes leaving the background model responsi-ble for for the smoother background-only component. Look-elsewhere eﬀect

This approach does not eliminate the look-elsewhereeﬀect that arises from considering multiple signal hy-potheses. Instead of a ﬁnite number of search windows orsignal hypotheses, the GP describes a continuous family2FIG. 14: Top, an example ﬁt with both a GPbackground and signal model to toy data with injectedtriangular signal. The panes below show the signiﬁcanceof residuals between the toy data and the backgroundmodel, the toy data and the background-and-signalmodel. Bottom, the residuals between the toy data andthe background model, overlaid with the injected signaland the ﬁtted GP signal.of signal hypotheses. This is not fundamentally diﬀerentthan the look-elsewhere eﬀect that arises from consid-ering a signal model with an unknown mass or width,though it is in a non-parametric setting. While both GPand the the simple example of an unknown mass corre-spond to an inﬁnite number of signal hypotheses, theyare highly correlated and the eﬀective trials factor is ﬁ-nite [42, 43].Fundamentally, the fact that some parameters of the FIG. 15: Top, an example ﬁt with both a GPbackground and signal model to toy data with injectedsquare signal. The panes below show the signiﬁcance ofresiduals between the toy data and the backgroundmodel, the toy data and the background-and-signalmodel. Bottom, the residuals between the toy data andthe background model, overlaid with the injected signaland the ﬁtted GP signal.signal model ( e.g. mass and width) have no eﬀect in thebackground only-case (in statistics jargon, they are notidentiﬁed under the null [44]) means that the conditionsnecessary for Wilks’s theorem are not satisﬁed and thelog likelihood ratio distribution will not take on the chi-square form. While there are approaches to estimate theasymptotic distribution of the likelihood ratio test statis-tic for signal models with one or a few parameters [42, 43],we are not aware of an asymptotic theory in the case of3GPs. The lack of an asymptotic theory has little practicalimpact since even in the case of signal models with a fewparameters, the asymptotic distributions are only accu-rate for very signiﬁcant ( (cid:38) σ ) excesses, and background-only toys are usually used in the interesting region of2 − σ .The eﬀective trials factor will depend on the speciﬁcbackground model and the kernel used for the signal GP.To illustrate this, we examined the log-likelihood ratiodistribution for an ensemble of background-only toys sim-ilar to what was done in Fig. 11. In this case we ﬁtthe mass hyperparameter m in the range 2-5 TeV andﬁxed the hyperparameter t = 600 GeV, which speciﬁesthat the signal is localized roughly to a 600 GeV region.Naively, the trials factor from allowing the mass to ﬂoat(range over width) to be about 6. In addition we considertwo diﬀerent values for the length scale: l = t and l = t/ l allow the signal GP more ﬂexibilitywithin the eﬀective mass resolution, and thus further in-crease the trials factor. Figure 16 shows the log-likelihoodratio distribution from these tests, conﬁrms our intuitionthat smaller values of l imply a larger look-elsewhere ef-fect, and demonstrates that it is straight forward to di-rectly calculate the global p -value from background-onlytoy Monte Carlo. DISCUSSION

In this paper, we give a broad overview of the potentialto use Gaussian Processes to model smooth backgroundsand generic localized signals. The use of GPs for model-ing smooth backgrounds addresses the shortcomings ofthe current approaches based on ad-hoc parametrizedfunctions. In particular, overly rigid parametrized func-tions lead to problems in high-luminosity searches be-cause the inevitable deviations between the functionalform and the true distribution can no longer hide behindstatistical ﬂuctuations. In contrast, the GP approach re-laxes the rigid structure of a parametrized function, whilemaintaining the necessary notions of smoothness.We have outlined the logical continuity between theGP likelihood to the conventional binned and unbinnedPoisson likelihoods used in high-energy physics. We havediscussed the interpretation of the kernel from ﬁrst prin-ciples through a precise connection to unfolding and in-vestigated the kernels associated to speciﬁc experimentaland theoretical sources of uncertainty.We have studied in detail the performance of a simpleintuitive kernel designed for dijet resonance searches. Fi-nally, we have introduced a novel strategy for the searchfor generic localized signal excesses in which the weakspeciﬁcation of signal properties is provided via a GPkernel instead of the ad-hoc algorithmic choices of cur-rent approaches.Gaussian Processes have improved the statistical mod- FIG. 16: Distribution of − χ distributionis due to the look-elsewhere eﬀect. The top plotcorresponds to a signal GP with l = t/

3, which hasmore ﬂexibility and a larger trials factor than thebottom plot with l = t .eling in a number of scientiﬁc ﬁelds, and these stud-ies demonstrate their potential for high-energy physics.Their robustness in a high-luminosity setting and theirability to model weakly speciﬁed signals are both wel-come and timely. ACKNOWLEDGEMENTS

We would like to thank Simone Alioli and Duccio Pap-padopulo for providing the parton density function un-certainties for the ATLAS dijet analyses. Duccio Pap-padopulo, Lukas Heinrich, and Gilles Louppe provideduseful feedback on the manuscript. We also thank DanForeman-Mackey for advice and support regarding cus-tom kernels in george . DW and MF are supported bythe Department of Energy Oﬃce of Science. Addition-ally, MF supported via NSF ACI-1450310. KC is sup-4ported through NSF ACI-1450310, PHY-1505463, PHY-1205376, and the Moore-Sloan Data Science Environ-ment at NYU. [1] UA2 Collaboration, Zeitschrift f¨ur Physik C Particles andFields , 17 (1991), ISSN 1431-5858.[2] G. Aad et al. (ATLAS), Phys. Rev. Lett. , 111803(2012), 1202.1414.[3] S. Abrahamyan et al. (APEX), Phys. Rev. Lett. ,191804 (2011), 1108.2750.[4] G. Choudalakis, in Proceedings, PHYSTAT 2011 Work-shop on Statistical Issues Related to Discovery Claimsin Search Experiments and Unfolding, CERN,Geneva,Switzerland 17-20 January 2011 (2011), 1101.0390,URL https://inspirehep.net/record/883244/files/arXiv:1101.0390.pdf .[5] P. D. Sampson and P. Guttorp, Journal of the AmericanStatistical Association , 108 (1992).[6] W. Meiring, P. Monestiez, P. Sampson, and P. Guttorp,Geostatistics Wollongong , 162 (1997).[7] D. Foreman-Mackey, B. T. Montet, D. W. Hogg, T. D.Morton, D. Wang, and B. Sch¨olkopf, The AstrophysicalJournal , 215 (2015), URL http://stacks.iop.org/0004-637X/806/i=2/a=215 .[8] C. E. Rasmussen and C. K. Williams, Gaussian pro-cesses for machine learning , vol. 1 (MIT press Cam-bridge, 2006).[9] R. Grosse, R. Salakhutdinov, W. Freeman, and J. Tenen-baum, in

Uncertainty in Artiﬁcial Intelligence (2012).[10] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum,and Z. Ghahramani, in

Proceedings of the 30th Interna-tional Conference on Machine Learning (2013).[11] C. P. Shen et al. (Belle), Phys. Rev. Lett. , 112004(2010), 0912.2383.[12] T. Aaltonen et al. (CDF), Phys. Rev.

D79 , 112002(2009), 0812.4036.[13] V. M. Abazov et al. (D0), Phys. Rev. Lett. , 191803(2009), 0906.4819.[14] V. Khachatryan et al. (CMS), Phys. Rev. Lett. ,211801 (2010), 1010.0203.[15] G. Aad et al. (ATLAS), Phys. Lett.

B754 , 302 (2016),1512.01530.[16] S.S. Wilks, Ann. Math. Statist. , 60 (1938).[17] M. Aaboud et al. (ATLAS) (2017), 1703.09127.[18] P. D. Dauncey, M. Kenzie, N. Wardle, and G. J. Davies,JINST , P04015 (2015), 1408.6865.[19] K. Cranmer, in Proceedings, 2011 European School ofHigh-Energy Physics (ESHEP 2011): Cheile Gradistei,Romania, September 7-20, 2011 (2015), pp. 267–308,[,247(2015)], 1503.07622, URL http://inspirehep.net/record/1356277/files/arXiv:1503.07622.pdf .[20] D. R. Cox, Journal of the Royal Statistical Society. SeriesB (Methodological) , 129 (1955), ISSN 00359246, URL .[21] S. Golchi and R. Lockhart, arXiv preprintarXiv:1501.02226 (2015).[22] J. P. Cunningham, K. V. Shenoy, and M. Sahani, in Proceedings of the 25th International Conference on Ma-chine Learning (ACM, New York, NY, USA, 2008),ICML ’08, pp. 192–199, ISBN 978-1-60558-205-4, URL http://doi.acm.org/10.1145/1390156.1390181 .[23] A. M. Sirunyan et al. (CMS), Phys. Rev.

D96 , 032002(2017), 1704.06142.[24] G. Pillonetto, F. Dinuzzo, T. Chen, G. D. Nicolao,and L. Ljung, Automatica , 657 (2014), ISSN 0005-1098, URL .[25] D. Higdon, J. Swall, and J. Kern, Bayesian statistics ,761 (1999).[26] C. J. Paciorek and M. J. Schervish, Environmetrics ,483 (2006).[27] M. Aaboud et al. (ATLAS), Eur. Phys. J. C77 , 26 (2017),1607.08842.[28] G. Aad et al. (ATLAS), JHEP , 059 (2014), 1312.3524.[29] S. Alioli, Private Communication (2017).[30] S. Alioli, M. Farina, D. Pappadopulo, and J. T. Ruder-man, JHEP , 097 (2017), 1706.03068.[31] S. Alioli, P. Nason, C. Oleari, and E. Re, JHEP , 043(2010), 1002.2581.[32] S. Alioli, K. Hamilton, P. Nason, C. Oleari, and E. Re,JHEP , 081 (2011), 1012.3380.[33] R. D. Ball et al. (NNPDF), JHEP , 040 (2015),1410.8849.[34] D. Foreman-Mackey, D. W. Hogg, D. Lang, and J. Good-man, Publications of the Astronomical Society of thePaciﬁc , 306 (2013), URL http://stacks.iop.org/1538-3873/125/i=925/a=306 .[35] J. Goodman and J. Weare, Communications in appliedmathematics and computational science , 65 (2010).[36] M. N. Gibbs, Ph.D. thesis, University of Cambridge Cam-bridge, England (1998).[37] G. Cowan, K. Cranmer, E. Gross, and O. Vitells,Eur. Phys. J. C71 , 1554 (2011), [Erratum: Eur. Phys.J.C73,2501(2013)], 1007.1727.[38] F. James and M. Roos, Comput. Phys. Commun. , 343(1975).[39] S. Ambikasaran, D. Foreman-Mackey, L. Greengard,D. W. Hogg, and M. O’Neil (2014), 1403.6015.[40] D. Foreman-Mackey, E. Agol, R. Angus, and S. Am-bikasaran, ArXiv (2017), URL https://arxiv.org/abs/1703.09710 .[41] B. C. Allanach et al., Eur. Phys. J. C25 , 113 (2002),hep-ph/0202233.[42] E. Gross and O. Vitells, Eur. Phys. J.

C70 , 525 (2010),1005.1891.[43] O. Vitells and E. Gross, Astropart. Phys. , 230 (2011),1105.4355.[44] R. B. Davies, Biometrika , 33 (1987), URL +http://dx.doi.org/10.1093/biomet/74.1.33+http://dx.doi.org/10.1093/biomet/74.1.33