Modeling Smooth Backgrounds and Generic Localized Signals with Gaussian Processes
Meghan Frate, Kyle Cranmer, Saarik Kalia, Alexander Vandenberg-Rodes, Daniel Whiteson
MModeling Smooth Backgrounds & Generic Localized Signals with Gaussian Processes
Meghan Frate, Kyle Cranmer, Saarik Kalia, Alexander Vandenberg-Rodes, and Daniel Whiteson Department of Physics and Astronomy, University of California, Irvine, CA 92697 Department of Physics and Astronomy, New York University, New York, NY 10003 Department of Physics, MIT, Boston, MA Obsidian Security Inc., Newport Beach, CA 92660
We describe a procedure for constructing a model of a smooth data spectrum using Gaussianprocesses rather than the historical parametric description. This approach considers a fuller spaceof possible functions, is robust at increasing luminosity, and allows us to incorporate our under-standing of the underlying physics. We demonstrate the application of this approach to modelingthe background to searches for dijet resonances at the Large Hadron Collider and describe how theapproach can be used in the search for generic localized signals.
PACS numbers:
INTRODUCTION
The search for new particles and interactions is a cen-tral focus of the research program of the Large HadronCollider (LHC). Typically, such a search is cast in thelanguage of a hypothesis test of a background model pre-dicted by the standard model of particle physics. Insome cases, the alternative hypothesis is specified bya particular theory or class of theories, in which casea practical task of the experimentalist is to identifya good discriminating variable and to construct mod-els of the background-only and signal-plus-backgroundhypotheses. This approach includes searches for reso-nances on a smooth background spectrum as well as non-resonant signals with well-specified signal characteristics.In addition, some searches for new physics aim at model-independence by making only weak assumptions aboutthe deviation from the background. For instance, such asearch might assume only that the signal manifests itselfas a localized deviation without completely specifying thedistribution. A critical element in both approaches is aproper description of the smooth background spectrum.In an ideal case, the background intensity f ( x ) as afunction of observable x would be known exactly. Inthis scenario, we can derive an expected total number ofevents ν ≡ (cid:82) f ( x ) dx and a probability density p ( x ) ≡ f ( x ) /ν , which leads to the familiar unbinned extendedmaximum likelihood for a dataset D = { x , . . . , x N } with N observed events p ( D ) = Pois( N | ν ) N (cid:89) i =1 p ( x i ) . (1)In realistic applications, we replace the ideal back-ground prediction f ( x ) with a more flexible backgroundmodel. The traditional approach for searches withsmooth background shapes [1–3] is to introduce a familyof functions f ( x | θ ) with some explicit functional formparametrized by θ . The choice of functional form to describe the background is central to the new particlesearch, yet functional forms derived from first principlesare almost never available. Instead, the typical approachis to select an ad-hoc parametric function with little-to-no grounding in the physics involved, but which fits rea-sonably well in collider data and simulated samples. Asthe luminosity of the collected datasets grow, however,the discrepancies between the ad-hoc model and the truephysical process are revealed. As the rigid form and lim-ited flexibility of the parametric functions fail to accom-modate the observed spectra, continual addition of newad-hoc terms is required. In the coming period of high-luminosity collisions at the Large Hadron Collider, thisissue will become even more acute.A second issue with the traditional approach is anunsatisfying bifurcation of the search strategy betweenhypothesis testing for specific signal models and themore model-independent search for localized deviationsfrom a smooth background. In model-specific hypoth-esis testing, a likelihood ratio test can be applied toweigh the background-only hypothesis against the signal-plus-background hypotheses, while the more model-independent searches often use a more elaborate pro-cedure like BumpHunter [4]. The
BumpHunter al-gorithm considers a large family of search windows andthen performs a number-counting test in each. Both ap-proaches face a look-elsewhere effect from multiple test-ing, but the
BumpHunter [4] approach also faces com-plications from the ad-hoc algorithmic choices and mul-tiple correlated background estimates.In this paper, we propose a new strategy for modelingsmooth backgrounds, one that uses Gaussian Processes(GPs). This strategy relaxes the overly-rigid constraintsof ad-hoc functional forms, which makes them more ro-bust to increasing luminosity. Moreover, GPs can beused to model generic localized signals, which allows fora unified statistical treatment of model-specific and moremodel-independent searches. a r X i v : . [ phy s i c s . d a t a - a n ] S e p GPs are remarkably flexible and have been used to im-prove modeling for geostatistics, climate, exoplanets, andmachine learning [5–10]. It is curious that GPs have notyet been widely used in high-energy physics. We sus-pect that this is due to a few barriers, which we aim toaddress. First, the GP formalism is somewhat foreignto high-energy physicists, so we detail how it is that wecan translate our understanding of the underlying physicsinto the GP formalism. Secondly, a large portion of theGP literature is focused on regression problems, insteadof modeling probability distributions. Thus, we makean effort to connect the traditional extended likelihoodtreatment in high-energy physics to the GP approach.Lastly, there are subtle issues related to the statistical in-terpretation and procedures of GPs, so in addition to themore pedagogical goals of the paper we also incorporatesome more technical statistical considerations for com-pleteness. These technicalities are not essential to themain point, and should not deter the interested reader.We use as a test case the LHC search for resonancesdecaying to two hadronic jets, which is one of the mostpowerful searches for new physics at the LHC and ex-amines a falling spectrum over a broad mass range, pro-viding a vigorous test of the method. In the following,we describe the historical approach, give the details ofthe GP approach and present studies that documents itsperformance.
PARAMETRIZED FUNCTION APPROACHDijet Resonance Search Example
Searching for a resonance above a smooth backgroundis a powerful technique in the hunt for new particlesand interactions. This classic approach has been used inmany experiments at different facilities, spanning severaldecades [1, 11–14]. In particular, the search for dijets atthe LHC is a simple but effective search for new physics.It analyzes the invariant mass spectrum of pairs of high- p T jets which are found back-to-back in the detector.The dominant background is due to multi-jet productionvia non-resonant hadronic interactions, which historicallyhas been difficult to model accurately using simulation,from the bulk to the high-mass tail. For this reason,a functional approach with three parameters was intro-duced by the UA2 collaboration [1]. In the interveningthirty years, this form has needed the addition of furtherterms in order to be able to adequately describe observedspectra, leading to the most recent form used by the AT-LAS collaboration [15] (previously θ = 0): f ( x | θ ) = θ (1 − x ) θ x θ x θ log x (2)where x = m jj / √ s . The selection of the particular func-tional form has been performed by comparing a handful of choices of varying complexity, selecting the simplestthat satisfies the Wilk’s test [16] which seeks “to deter-mine if the background estimation would be significantlyimproved by an additional degree of freedom” [15]. Thestructure of this parametric function and the selection ofthe terms is nearly entirely pragmatic, and has little ba-sis in knowledge of the underlying physics processes. Anexample fit of the three-parameter version to the ATLAS2015 dijet data at √ s = 13 TeV is shown in Fig. 1.In some cases, experimental papers attempt to assess asystematic uncertainty on the result due to the arbitrarystructure of this function [15]. An alternative approachis to abandon the full-spectrum fit entirely and fit onlya narrow sliding window [3, 17], which tolerates a sim-pler functional form, but sacrifices the power of the fullspectrum and complicates the global statistical interpre-tation. Incorporating auxiliary information
In the dijet case, the parameters θ are only constrainedby the data D ; but more generally some auxiliary in-formation a ( e.g. calibration measurements, theoreticalconsiderations, etc ) may be used to constrain the param-eters through an additional constraint term leading tothe likelihood p ( D , a | θ ) = Pois( N | ν ( θ )) N (cid:89) e =1 p ( x e | θ ) · p constr. ( a | θ ) . (3)In addition to constraint terms in the likelihood corre-sponding to auxiliary measurements a that have a clearfrequentist interpretation, it is common to incorporateterms in the likelihood that reflect prior knowledge. Forinstance, uncertainties in theoretical cross section due tomissing higher-order corrections typically are not treatedas unconstrained parameters, instead the size of these un-certainties are estimated via varying the renormalizationand factorization scales. Moreover, when very flexiblemodels are used ( e.g. in unfolding problems) it is of-ten desirable to include penalty terms that regularize theproblem in order to avoid unphysical solutions. Lastly,consideration of multiple functional forms can fit into thisformalism where θ takes on discrete values indexing themodel choice [18]. For a more pedagogical discussion ofconstraint terms, see Ref. [19]. In statistics literature, probability models of the form of Eq. 1are referred to as a Poisson point process with an intensity f ( x ). FIG. 1: Invariant mass of dijet pairs reported byATLAS [15] in proton-proton collisions at √ s = 13 TeVwith integrated luminosity of 3.6 fb − . The blue line isa fit using the first three terms of Eq. 2. The bottompane shows the significance of the residual between thedata and the fit. GAUSSIAN PROCESS APPROACH
Conceptually, GPs provide a generalization of the in-tensity f ( x ) that is not tied to a particular functionalform with a fixed number of parameters. Instead, theintensity at x is modeled as a Gaussian and the relation-ship between the intensity at different points is encodedin the covariance kernel Σ( x, x (cid:48) ). In this way, GPs allowfor the description of a much broader set of functions (seeFig. 2) and provide a natural way to incorporate auxiliaryinformation and prior knowledge.While the exact treatment of Poisson statistical fluctu-ations together with GPs is somewhat cumbersome , insituations with many events we can simplify the situationby working with a binned likelihood where the Poissoncounts in each bin y i are accurately approximated witha Gaussian distribution. In that case, the likelihood ofEq. 3 can be approximated as p ( y , a | θ ) = (cid:81) ni =1 Pois( y i | ¯ f ( x i | θ )) · p constr. ( a | θ ) (4) ≈ Gaus( y | ¯ f ( x | θ ) , σ ) · Gaus( ¯ f ( x | θ ) | µ, Σ ) , where x are the bin centers, y are the observed bincounts, and ¯ f ( x | θ ) are the expected bin counts from av-eraging f ( x | θ ) within the corresponding bin. The first From the point of view of the statistics literature, the combina-tion of a Poisson point process with a Gaussian Process as theintensity is known as a Cox process, named after the statisticianDavid Cox, who first published the model in 1955 [20]. In or-der to enforce positivity in the intensity, it is common to modelthe log of the intensity with a Gaussian process as was done inRef. [21]. Inference in a doubly stochastic process is technicallymore cumbersome than the approach described above [22]. term of Eq. 5 is the per-bin Poisson statistical fluctuation,while the second term is an n × n multivariate Gaussiandistribution that approximates the effect of p constr. ( a | θ )propagated to the expected bin counts ¯ f ( x | θ ).The second term reveals how an ad-hoc functionalform parametrized by θ ∈ R d can be overly rigid. Inthe n -dimensional space of bin counts, the parametrizedfunction ¯ f ( x | θ ) only maps out a d -dimensional subspace( e.g. the red markers in Fig. 2).Gaussian Processes provide a natural way to expandaround the overly restricted parametrized model and fillin the full space of possibilities. A Gaussian Process isdefined as “a collection of random variables, any finitenumber of which have a joint Gaussian distribution” [8,21]. Instead of providing a parametric form, we directlymodel the mean ( µ ) and covariance functions (Σ) of theGaussian process defined as µ ( x ) = E [ f ( x )] (5)Σ( x, x (cid:48) ) = E [( f ( x ) − µ ( x ))( f ( x (cid:48) ) − µ ( x (cid:48) ))] . (6)This covariance kernel Σ is then augmented with thediagonal (uncorrelated) statistical component σ ( x ) I toprovide the likelihood for the observed bin counts y .GPs have rarely been used in high-energy physics, withfew exceptions including a top-quark mass measurementby the CMS experiment [23] and a paper by Golchi& Lockhart focusing on other statistical aspects in thesearch for the Higgs boson [21]. Those analyses tookadvantage of the nice properties of GPs, but did not em-phasize or explore how physics considerations come intoplay in the choice of the kernel. An attractive feature ofGPs for high-energy physics is that the kernel directly en-codes understanding of the underlying physics, which ismanifest as covariance among the bin counts. For exam-ple, our understanding of the mass resolution, the partondensity function uncertainties, the jet energy scale uncer-tainties, and expected smoothness can be incorporatedinto the kernel.An advantage of interpreting the Gaussian processthrough the relationship of Eqs. 4 and 5 is that it makesclear how to connect the GP formalism with the existingconsiderations around sources of uncertainty: those asso-ciated to auxiliary measurements with a clear frequentistinterpretation, those associated to theoretical considera-tions that cannot be identified with a random variablewith frequentist probability, and those terms that are in-troduced for the sake of regularization.In the following sections, we outline the connectionto unfolding, study the typical covariance structure fromphysical processes that affect the shape of the backgroundspectrum, construct kernel and mean functions and de-scribe the procedure to fit the GP to the observed spec-trum. x A x B f ( x A ) f ( x B ) ad-hocGaussian Processdata f ( x A ) f ( x B ) ( x A , x B )ad-hocGaussian Process x A x B x A x B ( x , x ) FIG. 2: Schematic of the relationship between an ad-hoc function and the GP. An example toy dataset is shown(left) with samples from the posterior for an ad-hoc 1-parameter function (red) and a GP (green). Each posteriorsample is an entire curve f ( x ), which corresponds to a particular point in the (center) plane of f ( x A ) vs. f ( x B ).The red dots for the ad-hoc 1-parameter function trace out a 1-dimensional curve, which reveals how the function isoverly-rigid. In contrast, the green dots from the GP relax the assumptions and fill a correlated multivariateGaussian (with covariance indicated by the black ellipse). The covariance kernel Σ( x, x (cid:48) ) for the GP is shown (right)with Σ( x A , x B ) corresponding to the black ellipse of the center panel. Connection to unfolding
The process of constructing a GP kernel, which en-codes our physical requirements for the backgroundmodel, is closely related to the more familiar topic of im-posing physical requirements when unfolding differentialcross section distributions.In unfolding, we have an observed set of bin countsfor an observable x , which we also assume to arise from aPoisson point process as in Eq. 1. In contrast to searches,the goal is to estimate the differential cross section for atheoretical quantity z , removing dependence on experi-mental efficiency, acceptance, and detector effects. Therelationship between the target theoretical intensity f ( z )and the intensity for the observable f ( x ) is given by f ( x ) = (cid:90) f ( z ) W ( x | z ) dz , (7)where W ( x | z ) is a folding matrix or transfer functionthat encodes smearing effects from detector resolution.Ideally, the experimentalist makes no assumptions aboutthe theoretical intensity f ( z ), and infers f ( z ) through theunfolding process. As has been well studied, this type ofinverse problem is ill-posed in the technical sense thatthe solution f ( z ) is sensitive to small changes to f ( x ) orobserved data. The unregularized maximum likelihoodsolution to the unfolding problem often exhibits unphysi-cal oscillations in f ( z ). Physical considerations motivateadditional regularization or penalty terms to the likeli-hood that are not motivated by auxiliary measurements,but which lead to solutions that behave better for the f ( z ) we consider relevant. Of particular interest for particle physics is to revisitEq. 7 when f ( z ) is itself a GP. In that case Eq. 7 canbe seen as applying the linear operator (cid:82) W ( x | z ) dz tothe GP f ( z ), which gives rise to another GP through a process convolution [25, 26] resulting inΣ( x, x (cid:48) ) = (cid:90) (cid:90) dzdz (cid:48) Σ( z, z (cid:48) ) W ( x, z ) W ( x (cid:48) , z (cid:48) ) (8)As expected, even in the extreme case where differentbins of the theoretical distribution are allowed to be to-tally uncorrelated, Σ( z, z (cid:48) ) ∝ δ ( z − z (cid:48) ), the finite resolu-tion of the detector will introduce correlations in x via W . For example, if W ( x | z ) is a Gaussian smearing withresolution σ x , then the resulting GP is an exponentialsquared kernel (see Eq. 9) with length scale l = √ σ x . Physically motivated kernels
Next, we motivate contributions to the kernel that areclearly grounded in experimental and theoretical formsof uncertainty cast in terms of Eq. 7. In the ideal case, There is a deep connection between Tikhonov regularization andother forms of regularization used in unfolding and the kernelsof Gaussian Processes, which is formalized in the language ofReproducing Kernel Hilbert Spaces [8, 24]. A detailed discussionof this connection is beyond the scope of this paper; however, weshould anticipate a contribution to the Gaussian Process kernelthat is connected to this loosely defined notion of smoothness inthe underlying physics. both the theoretical prediction for f ( z ) and the efficien-cies, acceptance, and experimental resolution encoded in W ( x | z ) would be well specified. In that ideal case, f ( x )is totally fixed and the GP description of the intensitywould collapse to a single point in function space.If, however, the detector response were itself uncertain,then this would propagate into the space of intensities.Take for example, the jet energy scale (JES) uncertainty.As described in Refs. [17, 27] the ATLAS JES uncer-tainty is only a few percent for jets with p T of around 1TeV where data are plentiful, while the the limited sizeof observed examples for higher- p T jets requires an al-ternate approach to estimating the JES. The resultingJES uncertainty therefore grows rapidly with m jj andhas an impact of at most 15% [27]. To illustrate the co-variance due to the JES uncertainty, consider a simplifiedtwo-parameter model for the impact on the m jj distri-bution: J( z, θ ) = 1 + 15% θ z + 5% θ (1 − z ), where z isthe true dijet invariant mass and z max = 7 TeV. We usethe best fit 3-parameter fit as a proxy for f ( z ) and foldin the smearing W ( x | z, θ ) = Gaus( x | z J( z/z max , θ ) , σ x ),where σ x = 2% z is the dijet invariant mass resolu-tion [17]. By assuming a uniform prior and an ap-propriate scaling for θ , we sample from the posteriorGaus( θ | , θ | ,
1) and propagate the uncertaintyin θ through to the predicted bin counts ¯ f ( x | θ ) as inEqs. 4 and 5. This allows us to explicitly build the co-variance matrix Σ using the simulation shown in Fig. 3.As expected, we see a roughly block-diagonal structuredefined by low and high mass regions.Similarly, we can study the uncertainty in the theoret-ical distribution arising from uncertainty in the partondistribution functions (PDFs), such as that in the AT-LAS 7 TeV analysis [28, 29]. Figure 3 corresponds to thePDF uncertainties described in Ref. [30] for NLO calcu-lations from POWHEG-BOX [31, 32] using the S no-jet PDFsets provided in
NNPDF3.0 [33]. In this case, sum rulesin the PDFs lead to anti-correlation between low- andhigh-mass regions.With a satisfying GP description of the theoretical dis-tribution and knowledge of the smearing W ( x | z ), we canarrive at a well-motivated kernel for the observed spec-trum via Eq. 8. Implicit covariance in current background models
It is also instructive to examine the effective covari-ance implied by current approaches: the ad-hoc fit andthe sliding window fit. Again we study this through therelationship of Eqs. 4 and 5. In the case of the global fit tothe ad-hoc function of Eq. 2, we construct the posteriorfor θ given the ATLAS data shown in Fig. 1 and a uniformprior on θ . We sample the posterior using emcee [34], aPython implementation of the affine-invariant ensemblesampler for Markov Chain Monte Carlo (MCMC) pro- FIG. 3: Correlation coefficients between pairs of massbins due to variations in the jet energy scale (left) orparton distribution functions (right). Thesedemonstrate the broad but smoothly varying influenceof these effects on the mass spectrum.FIG. 4: Correlation coefficients between pairs of massbins from many samples of the global ad-hoc fit (left)and the sliding window fit (right). The plot of theglobal fit reveals non-physical pivot points where thead-hoc function is less flexible. The sliding window fithas a strictly limited correlation, by construction.posed by Goodman & Weare [35]. From the posteriorsamples of θ ∼ p ( θ | y ) we build the covariance matrix Σshown in Fig. 4. The global structure of the covariance re-sembles those arising from PDF uncertainties, but recallthat the model only sweeps out a 3-dimensional subspacein the much larger space of functions with this same co-variance.In the case of the sliding window approach, we alsoestimate the covariance from a table of f ( x i ) values, butinstead of posterior samples, we perform a single fit foreach of 50 mass windows. For the k th window, if the binis outside the window we record 0, otherwise we record f ( x i | ˆ θ k ), where ˆ θ k is the best fit value of θ for the fit re-stricted to the window. The covariance is calculated fromthese recorded values. This method should create a co-variance structure which is limited to the diagonal band,as each fit includes only a small portion of the distribu-tion – indeed this is what we see in Figure 4. While the The slight negative correlation outside of this band is an artifact the sliding window fit approach provides more flexibilityand better scaling to high luminosities than a global fit,the piecewise approach to background modeling compli-cates the downstream statistical analysis.
Parametrized kernels
In practice, one rarely works with a covariancematrix explicitly calculated from samples, but in-stead parametrizes the kernel Σ( x, x (cid:48) ). A commonparametrized kernel is the exponential-squared kernel:Σ( x, x (cid:48) ) = A exp (cid:18) − ( x − x (cid:48) ) l (cid:19) . (9)This kernel describes a maximal covariance with ampli-tude A that falls off with a length-scale l . If | x − x (cid:48) | (cid:29) l ,then the covariance is very small. The parameters of thekernel are traditionally referred to as hyperparameters ,and the values of the hyperparameters can be fit either apriori ( e.g. by considering simulated or control samples),or to the data simultaneously with the hypothesis testitself.In addition to the ability to construct kernels from firstprinciples as in Eq. 8, there are also rules for various typesof operations on GPs [9, 10]. For example, there are rulesfor describing the mean and covariance for a new GP pro-duced from combining two GPs through the summationand multiplication. These rules effectively form a gram-mar that allows us to compose complex kernels througha narrative.We can also build parametrized kernels that captureessential physics in a direct intuitive way. This approachsits somewhere between the first principles approach andthe use of ad-hoc functions; however, unlike most ad-hocfunctional forms, we will be able to interpret the termsmore directly. For instance, in the dijet case we mightargue that the mass resolution is one of the dominantcontributions to the covariance, and start with an expo-nential squared kernel. We would anticipate the lengthscale to be larger than √ σ x from mass resolution effectsto reflect the intrinsic smoothness of the underlying truedistribution. In practice, the mass resolution is not con-stant, so we may allow for the length scale (incorporatingboth mass resolution and intrinsic smoothness) to havea linear dependence l ( x ) = bx + c . This varying lengthscale can be accommodated by the Gibbs kernel [8, 36].In addition, instead of a constant amplitude for the varia-tion, we take the amplitude of fluctuations to be a fallingexponential. These considerations motivate the following of recording 0 for the value of the intensity, which is less thanthe mean. parametrized kernelΣ( x, x (cid:48) ) = Ae d − ( x + x (cid:48) )2 a (cid:113) l ( x ) l ( x (cid:48) ) l ( x ) + l ( x (cid:48) ) e − ( x − x (cid:48) )2 l ( x )2+ l ( x (cid:48) )2 (10)The hyperparameters of this kernel ( A, a, b, c, d ) will bedetermined during the fit to the data, described below.We can identify the terms associated to the narrativethat produced this kernel, which we argue makes it lessad-hoc than the functional form of Eq. 2, and the GPassociated to this kernel is much more flexible. Below wewill compare the performance of this kernel to the ad-hocfunction in the context of a dijet resonance search.
Mean function
It is common to use µ ( x ) = 0 for the mean of a GP,partially because once conditioned on the observations y , the posterior mean usually adapts to this offset re-markably well if the spacing between the observed x i issmall compared to the length scale. However, if we havea reasonable estimate of the mean, there is no reason notto use it. Thus we use the three-parameter fit functionsas the mean µ ( x ) of the GP, which contributes an ad-ditional three hyperparameters ( θ , θ , θ ). The resultsare very robust to the choice of the mean, the key to theperformance is really in the choice of the kernel.In the case of signal-plus-background hypothesis test-ing with a known signal model f s ( x | θ s ), the mean func-tion also includes the signal contribution. Due to the lin-ear relationship between y and µ ( x ) in the Gaussian, thisis numerically equivalent to subtracting the signal expec-tation from the observation and modeling the residualswith the background-only GP. Incorporating GPs into the statistical procedure
There are subtle issues associated to the Bayesian andFrequentist interpretations of the GP. GPs are most com-monly presented in a Bayesian formalism where the meanand covariance kernel are interpreted as a prior distribu-tion over the space of functions. Then given the observa-tions y we arrive at an updated GP that represents a pos-terior distribution over the space of functions. Becauseboth the prior and the posterior are Gaussians there areexplicit formulae for the posterior mean and covariancethat rely on basic linear algebra [8]. In particular µ ( x ∗ | y ) = µ ( x ∗ )+Σ( x ∗ , x )[Σ( x , x )+ σ ( x ) I ] − ( y − µ ( x ))(11)andΣ( x ∗ , x (cid:48)∗ ) = Σ( x ∗ , x (cid:48)∗ ) (12) − Σ( x ∗ , x (cid:48) )[Σ( x , x (cid:48) ) + σ ( x ) I ] − Σ( x , x (cid:48)∗ ) , where x ∗ are the values where the posterior GP is beingevaluated and x are the values being conditioned on. Ina typical binned analysis x and x ∗ would both be thethe bin centers. In addition, fitting the hyperparametersof a GP is usually based on maximizing the marginallikelihood, which has the explicit formlog L = −
12 log | Σ |− ( y − µ ( x )) T Σ − ( y − µ ( x )) − n π . (13)In order to take advantage of the closed form solutionsabove and fast linear algebra implementations, the statis-tical fluctuations are typically approximated as σ ( x ) = y instead of the more accurate Poisson mean.In principle, one can revisit the logic of a particularGP kernel to trace back the terms that came from aux-iliary measurements with a clear frequentist interpreta-tion, and cary out the equivalent profile likelihood treat-ment. Given the correspondence between profiling andmarginalization in the Gaussian case, this should leadto equivalent results with differences of at most constantfactors; however, those factors may depend on the hyper-parameters. The philosophical and practical use of theGP’s profile likelihood will also be complicated by thecontributions to the kernel from theoretical considera-tions that lack a frequentist interpretation and contribu-tions motivated by regularization considerations. Thus,we leave this as a direction for future work.In practice there are roughly four ways the GP can beintegrated into the high-energy physics search strategy.The first is to take on a fully Bayesian approach usingthe GP as the intensity of a Poisson point process, whichforms a doubly stochastic Cox process [20]. Hypothesistesting and limits can then be based on Bayes factors.This approach was considered in Ref. [21]; however, it iscomputationally very intensive and difficult to combinewith the bulk of the likelihood-based search strategies.The second is to approximate the Poisson fluctuationsas Gaussian and use the marginal likelihood of Eq. 13 di-rectly in the test statistic, which is more computationallyefficient. One can still use this marginal likelihood as atest statistic in the frequentist sense, but it requires us-ing ensemble tests to calibrate the p-values as one cannotrely on the standard asymptotic formulae [37].The third approach is a two-step process where onefirst calculates the posterior mean of Eq. 11 and then uses µ ( x ) as the intensity for the Poisson likelihood of Eq. 1 orits corresponding binned version. Similar to the profilelikelihood approach, the background model can be con-ditioned on the signal hypothesis being tested, since thesignal’s parameters are present in the prior mean func-tion. Computationally this approach is similar to the oneabove, with an additional cost of evaluating the Poissonlikelihood and the practical considerations of managingthe two-step process.The last approach is also a two-step process where onefirst calculates both the posterior mean of Eq. 11 and FIG. 5: Invariant mass of dijet pairs reported byATLAS [15] in proton-proton collisions at √ s = 13 TeVwith integrated luminosity of 3.6 fb − . The green lineshows the resulting Gaussian process backgroundmodel. The bottom pane shows the significance of theresidual between the data and the GP model.the posterior covariance of Eq. 12 and then uses those ina down-stream least-squares analysis. For example, thisapproach is convenient for goodness-of-fit tests.In the studies below we only use the marginal likeli-hood for fitting the hyperparameters, which we optimizeusing Minuit [38]. In later studies we use the third ap-proach of treating the posterior mean (conditioned on thesignal model parameters) as the Poisson intensity for thebackground. We use the software package george [39](see also celerite [40]) for GP regression, which we haveextended by implementing custom kernels.The posterior mean and posterior correlation matrixfrom fitting the GP to the ATLAS dataset are shown inFigs. 5 and 6. By visual inspection, the mean function fitsthe data well and the correlation is constrained near thediagonal, with the off diagonal dying off quickly. Thisstructure reflects the locality of the GP, where nearbybins are closely connected but bins far from each otherin mass are uncorrelated.
PERFORMANCE STUDIES
Figures 5 and 6 demonstrate the fit (posterior mean)of the GP to a single dataset collected by ATLAS inproton-proton collisions at √ s = 13 TeV with integratedluminosity of 3.6 fb − [15]. More important is the char-acterization the GP approach from fits to an ensembleof datasets with independent statistical fluctuations andincreasing luminosity. We construct toy data samples bysmoothing the ATLAS data, scaling it to the desired lu-FIG. 6: Correlation between pairs of mass bins from theGP fit, which shows the largely diagonal nature, withincreasing length scale at higher mass.minosity, and generating independent samples by addingPoisson noise to each bin.Below, we present the performance of the GP approachin these datasets under two aspects of hypothesis testing: • Background-only tests: these studies test whetherthe GP has sufficient flexibility to describe the typ-ical background spectrum, assuming no signal. • Signal-plus-background tests: these studies com-bine a GP background with a specific signal modeland tests the power of a hypothesis test based onthe GP background. This requires that the GPmodel not be so flexible that it can absorb the lo-calized signal into the background model.
Background only tests
The performance of the GP background model is eval-uated in the toy datasets described above. For eachtoy dataset, we fit the GP, extract the posterior meanfrom Eq. 11, and evaluate a χ quantity from (cid:80) i ( y i − µ ( x i | y )) /µ ( x i | y ); note that in this test we do not in-corporate the posterior covariance matrix of Eq. 12. Fig-ure 7 shows the ± σ about the average µ ( x | y ) from thesetoys, with the ATLAS data to guide the eye, and the ad-hoc fit for comparison. The GP based on the kernel inEq. 10 has more flexibility at high mass, but also pro-vides a superior fit, as measured by the χ /dof statistic.The number of degrees of freedom for the ad-hoc fit is 3,while the GP has 8 hyperparameters (3 from the meanfunction and 5 from the kernel). Figure 8 shows the dis-tribution of χ /dof, which peaks near χ /dof = 1 for theGP model and is significantly larger than unity for thead-hoc function. FIG. 7: Tests of the Gaussian process andthree-parameter ad-hoc function in toy data generatedfrom the ATLAS data. Shown are the ± σ band aboutthe mean background models, with the ATLAS dataoverlaid for reference.FIG. 8: The distribution of χ per degree of freedom intoy data generated from the ATLAS data at luminosityof 3.6 fb − . While the goodness of fit for the ad-hocfunction degrades with more data, the GP is robust.A critical test of the Gaussian process model is its ro-bustness with increasing luminosity, where the ad-hocapproach has failed in collider data [15, 17]. In Figure 9,the mean and standard deviation of the χ /d.o.f. areshown as a function of integrated luminosity in the toydata, demonstrating the robustness of the GP approach. Background plus signal fits
Adding more flexibility to the background model guar-antees a better fit to background-only toys; however, thisgenerally comes at the loss of power in a search for asignal. A background model that is flexible enough toFIG. 9: Mean and standard deviation of the χ /d.o.f.measure in toy data generated from ATLAS collisions,as a function of integrated luminosity, for the ad-hoc fitand the Gaussian process.incorporate a signal contribution will have no discoverypower.Here, we test the GP model’s performance in the toydata constructed as described earlier, but with signal in-jected as well.We used a generic Gaussian resonance, and performedtests with various values for the signal mass, width andamplitude. The hyperparameters of the GP (both forthe background mean and kernel functions) are fixedfrom our fit to the ATLAS dataset; in a realistic applica-tion, experimenters could establish the hyperparametersin simulated samples. We only fit the three parameters(amplitude, mass, and width) of the Gaussian signal. Forthe parametric fit, we fit all six parameters: the three fitfunction parameters and three signal parameters. An ex-ample of this background-plus-signal fit is shown with aninjected 2.5 TeV Gaussian signal shape in the top panelof Fig. 10.This single example is illustrative and qualitative, butthe statistical test for the presence of a signal in ob-served data relies on the likelihood ratio Λ betweenthe background-only and the signal-plus-background hy-potheses. We calculate the likelihood ratio between thetwo hypotheses in cases background-only toy data as wellas background-plus-signal toy data. This involves the useof Eq. 11 twice, as the posterior mean background pre-diction is different for the background-only and signal-plus-background fits. This is analogous to the profilelikelihood ratio where there are two fits and the condi-tional maximum likelihood estimate of the background inthe background-only case is generally different from thebackground estimate in the signal-plus-background fit.The distribution of − √ s = 13 TeVwith integrated luminosity of 3.6 fb − with a falsesignal injected at m jj = 2 . − − − χ distribution with one degree offreedom.FIG. 12: Mean log likelihood ratio (Λ) between thebackground-only and the background-plus-signalhypotheses, shown both for the of Gaussian processmodel and the ad-hoc fit, in 1000 toy data sets forvarying injected signal mass. Solid and dashed linesindicate the threshold for 3 σ significance and for an α -level of 0.05. ad-hoc function and the GP model. The added flexibilityof the GP does not degrade the power of the search – infact, the GP has more sensitivity to a signal at low mass,while the two methods are comparable at high mass. Thisgain in power is logically possible because the distributionused to true generate the background (which is normallyunknown) does not correspond to the ad-hoc function ex-actly. In our case the GP background model is able tomore accurately follow the true background (generatedfrom smeared ATLAS data) than the ad-hoc function.If a signal is detected, it is also vital to be able toextract the signal parameters. For a two choices of sig-nal mass ( m jj = 3 and 5 TeV), we performed fits tosignal-plus-background toys fitting the mass, width, andsignal strength. Figure 13 shows that the extracted sig-nal width and yield are reliable estimators of the truevalues. MODELING GENERIC LOCALIZED SIGNALS
The search for specific resonances above a smoothbackground is only one type of search strategy. Morebroadly, we hope to be sensitive to localized deviationsthat take different, potentially unanticipated shapes. Forinstance, a cascade decay can lead to triangular distribu-tions with a sharp endpoint [41], though helicity corre-lations can modify this shape in detail. Searches likethis require balancing a small number of tests of thebackground-only model using generic properties of a sig-nal and a larger number of tests of the background-onlymodel using more specific signal properties. A singlenumber-counting search using the full mass range is verygeneric, but has very little power. Conversely, an enor-mous scan over specific hypothesized signals individuallyhave more power, but this strategy suffers from a largelook-elsewhere effect.Historically, the search for generic signals over abackground model has used algorithms such as
Bump-Hunter [4]. This approach imposes only minimal struc-ture on the signal: that it is a localized, contiguous ex-cess. This approach can be effective, but it has signif-icant practical drawbacks as it contains many ad-hocalgorithmic elements. For example, in common usage
BumpHunter requires that the excess be localized toat least two bins, and at most half of the bins. This al-gorithmic characterization of the signal is effective, butit is difficult to interpret and characterize statistically.Secondly, to address the look-elsewhere effect, this ap-proach explicitly accounts for multiple testing and cali-brates the distribution of the test statistic by applyingthe entire procedure to background-only toys. This re-quires a global background-only prediction, which is com-plicated when we rely on the data to help fit the back-ground model. In particular, if a signal is present in thedata, it is unclear how this impacts the background esti-1FIG. 13: Extracted signal parameters versus true parameters for an injected Gaussian signals with m jj = 3 TeV(top) and m jj = 5.5 TeV (bottom). Left: the extracted signal width ( σ ) for a fixed signal yield. Right: theextracted signal yield ( N ) for a fixed signal width ( σ = 250 GeV). Results are shown for both the Gaussian Processbackground model (green) and the ad-hoc fit function (blue).mate. Thus far, the main strategy has been an iterativebackground estimation procedure that defines a signalregion and extrapolates the background fit into this re-gion. This approach introduces a coupling of algorithmicdecisions with the statistical considerations. Similarly, inthe context of a sliding window background model, theprocedure is further complicated by the fact that thereis not a single global background prediction, but a set ofcorrelated background predictions specific to the signalwindow under consideration.In this section, we consider an alternative approachwhich uses a GP to model a generic localized signal.In this case, the basic physical requirement of the lo-calized signal can be encoded directly in the kernel ofthe signal GP, rather indirectly through ad-hoc algorith-mic choices. This approach allows us to perform signal-plus-background fits where the signal GP absorbs thelocalized excess and the background GP accounts for thebackground. The background component from such a fitcan provide global background estimate to be used in thecontext of a BumpHunter approach even when a signalis present in the data. More importantly, this approachenables hypothesis tests of the background-only modelagainst a weakly specified signal-plus-background modeldirectly based on the likelihood ratio or Bayes factor.We initiate the study of this approach with a specific signal GP described by the following kernelΣ( x, x (cid:48) ) = Ae − ( x − x (cid:48) ) /l e − (( x − m ) +( x (cid:48) − m ) ) /t , (14)which has three main terms. The first term A is an overallamplitude for the signal. The second term is the standardexponential-squared kernel with length scale l . The thirdterm is an envelope that localizes the signal around amass m with a width t , which is analogous to the masswindow.To demonstrate the flexibility of this kernel, we per-formed signal-plus-background fits for a variety of signalshapes. Figure 14 shows the background extraction onboth a linear piecewise triangular signal and Figure 15shows a square signal; both have been smeared to modeldetector jet energy resolution effects. Our studies indi-cate the GP signal is able to accommodate a wide varietyof signal shapes leaving the background model responsi-ble for for the smoother background-only component. Look-elsewhere effect
This approach does not eliminate the look-elsewhereeffect that arises from considering multiple signal hy-potheses. Instead of a finite number of search windows orsignal hypotheses, the GP describes a continuous family2FIG. 14: Top, an example fit with both a GPbackground and signal model to toy data with injectedtriangular signal. The panes below show the significanceof residuals between the toy data and the backgroundmodel, the toy data and the background-and-signalmodel. Bottom, the residuals between the toy data andthe background model, overlaid with the injected signaland the fitted GP signal.of signal hypotheses. This is not fundamentally differentthan the look-elsewhere effect that arises from consid-ering a signal model with an unknown mass or width,though it is in a non-parametric setting. While both GPand the the simple example of an unknown mass corre-spond to an infinite number of signal hypotheses, theyare highly correlated and the effective trials factor is fi-nite [42, 43].Fundamentally, the fact that some parameters of the FIG. 15: Top, an example fit with both a GPbackground and signal model to toy data with injectedsquare signal. The panes below show the significance ofresiduals between the toy data and the backgroundmodel, the toy data and the background-and-signalmodel. Bottom, the residuals between the toy data andthe background model, overlaid with the injected signaland the fitted GP signal.signal model ( e.g. mass and width) have no effect in thebackground only-case (in statistics jargon, they are notidentified under the null [44]) means that the conditionsnecessary for Wilks’s theorem are not satisfied and thelog likelihood ratio distribution will not take on the chi-square form. While there are approaches to estimate theasymptotic distribution of the likelihood ratio test statis-tic for signal models with one or a few parameters [42, 43],we are not aware of an asymptotic theory in the case of3GPs. The lack of an asymptotic theory has little practicalimpact since even in the case of signal models with a fewparameters, the asymptotic distributions are only accu-rate for very significant ( (cid:38) σ ) excesses, and background-only toys are usually used in the interesting region of2 − σ .The effective trials factor will depend on the specificbackground model and the kernel used for the signal GP.To illustrate this, we examined the log-likelihood ratiodistribution for an ensemble of background-only toys sim-ilar to what was done in Fig. 11. In this case we fitthe mass hyperparameter m in the range 2-5 TeV andfixed the hyperparameter t = 600 GeV, which specifiesthat the signal is localized roughly to a 600 GeV region.Naively, the trials factor from allowing the mass to float(range over width) to be about 6. In addition we considertwo different values for the length scale: l = t and l = t/ l allow the signal GP more flexibilitywithin the effective mass resolution, and thus further in-crease the trials factor. Figure 16 shows the log-likelihoodratio distribution from these tests, confirms our intuitionthat smaller values of l imply a larger look-elsewhere ef-fect, and demonstrates that it is straight forward to di-rectly calculate the global p -value from background-onlytoy Monte Carlo. DISCUSSION
In this paper, we give a broad overview of the potentialto use Gaussian Processes to model smooth backgroundsand generic localized signals. The use of GPs for model-ing smooth backgrounds addresses the shortcomings ofthe current approaches based on ad-hoc parametrizedfunctions. In particular, overly rigid parametrized func-tions lead to problems in high-luminosity searches be-cause the inevitable deviations between the functionalform and the true distribution can no longer hide behindstatistical fluctuations. In contrast, the GP approach re-laxes the rigid structure of a parametrized function, whilemaintaining the necessary notions of smoothness.We have outlined the logical continuity between theGP likelihood to the conventional binned and unbinnedPoisson likelihoods used in high-energy physics. We havediscussed the interpretation of the kernel from first prin-ciples through a precise connection to unfolding and in-vestigated the kernels associated to specific experimentaland theoretical sources of uncertainty.We have studied in detail the performance of a simpleintuitive kernel designed for dijet resonance searches. Fi-nally, we have introduced a novel strategy for the searchfor generic localized signal excesses in which the weakspecification of signal properties is provided via a GPkernel instead of the ad-hoc algorithmic choices of cur-rent approaches.Gaussian Processes have improved the statistical mod- FIG. 16: Distribution of − χ distributionis due to the look-elsewhere effect. The top plotcorresponds to a signal GP with l = t/
3, which hasmore flexibility and a larger trials factor than thebottom plot with l = t .eling in a number of scientific fields, and these stud-ies demonstrate their potential for high-energy physics.Their robustness in a high-luminosity setting and theirability to model weakly specified signals are both wel-come and timely. ACKNOWLEDGEMENTS
We would like to thank Simone Alioli and Duccio Pap-padopulo for providing the parton density function un-certainties for the ATLAS dijet analyses. Duccio Pap-padopulo, Lukas Heinrich, and Gilles Louppe provideduseful feedback on the manuscript. We also thank DanForeman-Mackey for advice and support regarding cus-tom kernels in george . DW and MF are supported bythe Department of Energy Office of Science. Addition-ally, MF supported via NSF ACI-1450310. KC is sup-4ported through NSF ACI-1450310, PHY-1505463, PHY-1205376, and the Moore-Sloan Data Science Environ-ment at NYU. [1] UA2 Collaboration, Zeitschrift f¨ur Physik C Particles andFields , 17 (1991), ISSN 1431-5858.[2] G. Aad et al. (ATLAS), Phys. Rev. Lett. , 111803(2012), 1202.1414.[3] S. Abrahamyan et al. (APEX), Phys. Rev. Lett. ,191804 (2011), 1108.2750.[4] G. Choudalakis, in Proceedings, PHYSTAT 2011 Work-shop on Statistical Issues Related to Discovery Claimsin Search Experiments and Unfolding, CERN,Geneva,Switzerland 17-20 January 2011 (2011), 1101.0390,URL https://inspirehep.net/record/883244/files/arXiv:1101.0390.pdf .[5] P. D. Sampson and P. Guttorp, Journal of the AmericanStatistical Association , 108 (1992).[6] W. Meiring, P. Monestiez, P. Sampson, and P. Guttorp,Geostatistics Wollongong , 162 (1997).[7] D. Foreman-Mackey, B. T. Montet, D. W. Hogg, T. D.Morton, D. Wang, and B. Sch¨olkopf, The AstrophysicalJournal , 215 (2015), URL http://stacks.iop.org/0004-637X/806/i=2/a=215 .[8] C. E. Rasmussen and C. K. Williams, Gaussian pro-cesses for machine learning , vol. 1 (MIT press Cam-bridge, 2006).[9] R. Grosse, R. Salakhutdinov, W. Freeman, and J. Tenen-baum, in
Uncertainty in Artificial Intelligence (2012).[10] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum,and Z. Ghahramani, in
Proceedings of the 30th Interna-tional Conference on Machine Learning (2013).[11] C. P. Shen et al. (Belle), Phys. Rev. Lett. , 112004(2010), 0912.2383.[12] T. Aaltonen et al. (CDF), Phys. Rev.
D79 , 112002(2009), 0812.4036.[13] V. M. Abazov et al. (D0), Phys. Rev. Lett. , 191803(2009), 0906.4819.[14] V. Khachatryan et al. (CMS), Phys. Rev. Lett. ,211801 (2010), 1010.0203.[15] G. Aad et al. (ATLAS), Phys. Lett.
B754 , 302 (2016),1512.01530.[16] S.S. Wilks, Ann. Math. Statist. , 60 (1938).[17] M. Aaboud et al. (ATLAS) (2017), 1703.09127.[18] P. D. Dauncey, M. Kenzie, N. Wardle, and G. J. Davies,JINST , P04015 (2015), 1408.6865.[19] K. Cranmer, in Proceedings, 2011 European School ofHigh-Energy Physics (ESHEP 2011): Cheile Gradistei,Romania, September 7-20, 2011 (2015), pp. 267–308,[,247(2015)], 1503.07622, URL http://inspirehep.net/record/1356277/files/arXiv:1503.07622.pdf .[20] D. R. Cox, Journal of the Royal Statistical Society. SeriesB (Methodological) , 129 (1955), ISSN 00359246, URL .[21] S. Golchi and R. Lockhart, arXiv preprintarXiv:1501.02226 (2015).[22] J. P. Cunningham, K. V. Shenoy, and M. Sahani, in Proceedings of the 25th International Conference on Ma-chine Learning (ACM, New York, NY, USA, 2008),ICML ’08, pp. 192–199, ISBN 978-1-60558-205-4, URL http://doi.acm.org/10.1145/1390156.1390181 .[23] A. M. Sirunyan et al. (CMS), Phys. Rev.
D96 , 032002(2017), 1704.06142.[24] G. Pillonetto, F. Dinuzzo, T. Chen, G. D. Nicolao,and L. Ljung, Automatica , 657 (2014), ISSN 0005-1098, URL .[25] D. Higdon, J. Swall, and J. Kern, Bayesian statistics ,761 (1999).[26] C. J. Paciorek and M. J. Schervish, Environmetrics ,483 (2006).[27] M. Aaboud et al. (ATLAS), Eur. Phys. J. C77 , 26 (2017),1607.08842.[28] G. Aad et al. (ATLAS), JHEP , 059 (2014), 1312.3524.[29] S. Alioli, Private Communication (2017).[30] S. Alioli, M. Farina, D. Pappadopulo, and J. T. Ruder-man, JHEP , 097 (2017), 1706.03068.[31] S. Alioli, P. Nason, C. Oleari, and E. Re, JHEP , 043(2010), 1002.2581.[32] S. Alioli, K. Hamilton, P. Nason, C. Oleari, and E. Re,JHEP , 081 (2011), 1012.3380.[33] R. D. Ball et al. (NNPDF), JHEP , 040 (2015),1410.8849.[34] D. Foreman-Mackey, D. W. Hogg, D. Lang, and J. Good-man, Publications of the Astronomical Society of thePacific , 306 (2013), URL http://stacks.iop.org/1538-3873/125/i=925/a=306 .[35] J. Goodman and J. Weare, Communications in appliedmathematics and computational science , 65 (2010).[36] M. N. Gibbs, Ph.D. thesis, University of Cambridge Cam-bridge, England (1998).[37] G. Cowan, K. Cranmer, E. Gross, and O. Vitells,Eur. Phys. J. C71 , 1554 (2011), [Erratum: Eur. Phys.J.C73,2501(2013)], 1007.1727.[38] F. James and M. Roos, Comput. Phys. Commun. , 343(1975).[39] S. Ambikasaran, D. Foreman-Mackey, L. Greengard,D. W. Hogg, and M. O’Neil (2014), 1403.6015.[40] D. Foreman-Mackey, E. Agol, R. Angus, and S. Am-bikasaran, ArXiv (2017), URL https://arxiv.org/abs/1703.09710 .[41] B. C. Allanach et al., Eur. Phys. J. C25 , 113 (2002),hep-ph/0202233.[42] E. Gross and O. Vitells, Eur. Phys. J.
C70 , 525 (2010),1005.1891.[43] O. Vitells and E. Gross, Astropart. Phys. , 230 (2011),1105.4355.[44] R. B. Davies, Biometrika , 33 (1987), URL +http://dx.doi.org/10.1093/biomet/74.1.33+http://dx.doi.org/10.1093/biomet/74.1.33