[PDF] Detecting new signals under background mismodelling

Abstract

Searches for new astrophysical phenomena often involve several sources of non-random uncertainties which can lead to highly misleading results. Among these, model-uncertainty arising from background mismodelling can dramatically compromise the sensitivity of the experiment under study. Specifically, overestimating the background distribution in the signal region increases the chances of missing new physics. Conversely, underestimating the background outside the signal region leads to an artificially enhanced sensitivity and a higher likelihood of claiming false discoveries. The aim of this work is to provide a unified statistical strategy to perform modelling, estimation, inference, and signal characterization under background mismodelling. The method proposed allows to incorporate the (partial) scientific knowledge available on the background distribution and provides a data-updated version of it in a purely nonparametric fashion without requiring the specification of prior distributions on the parameters. Applications in the context of dark matter searches and radio surveys show how the tools presented in this article can be used to incorporate non-stochastic uncertainty due to instrumental noise and to overcome violations of classical distributional assumptions in stacking experiments.

Full PDF

VVersion of December 4, 2019

Detecting new signals under background mismodelling ∗ Sara Algeri

1, † School of Statistics, University of Minnesota, Minneapolis (MN), 55455, USA (Dated: December 4, 2019)Searches for new astrophysical phenomena often involve several sources of non-random uncertainties whichcan lead to highly misleading results. Among these, model-uncertainty arising from background mismodellingcan dramatically compromise the sensitivity of the experiment under study. Speciﬁcally, overestimating thebackground distribution in the signal region increases the chances of missing new physics. Conversely, under-estimating the background outside the signal region leads to an artiﬁcially enhanced sensitivity and a higherlikelihood of claiming false discoveries. The aim of this work is to provide a uniﬁed statistical strategy to per-form modelling, estimation, inference, and signal characterization under background mismodelling. The methodproposed allows to incorporate the (partial) scientiﬁc knowledge available on the background distribution andprovides a data-updated version of it in a purely nonparametric fashion without requiring the speciﬁcation ofprior distributions on the unknown parameters. Applications in the context of dark matter searches and radiosurveys show how the tools presented in this article can be used to incorporate non-stochastic uncertainty due toinstrumental noise and to overcome violations of classical distributional assumptions in stacking experiments.

I. INTRODUCTION

When searching for new physics, a discovery claim is madeif the data collected by the experiment provides sufﬁcient sta-tistical evidence in favor of the new phenomenon. If the back-ground and signal distributions are speciﬁed correctly, this canbe done by means of statistical tests of hypothesis, upper lim-its and conﬁdence intervals.

The problem.

In practice, even if a reliable descrip-tion of the signal distribution is available, providing accu-rate background models may be challenging, as the behav-ior of the sources which contribute to it is often poorly un-derstood. Some examples include searches for nuclear re-coils of weakly interacting massive particles over electron re-coils backgrounds [1, 2], searches for gravitational-wave sig-nals over non-Gaussian backgrounds from stellar-mass binaryblack holes [3], and searches for a standard model-like Higgsboson over prompt diphoton production [4].Unfortunately, model uncertainty due to background mis-modelling can signiﬁcantly compromise the sensitivity ofthe experiment under study. Speciﬁcally, overestimating thebackground distribution in the signal region increases thechances of missing new physics. Conversely, underestimatingthe background outside the signal region leads to an artiﬁciallyenhanced sensitivity, which can easily result in false discov-ery claims. Several methods have been proposed in literatureto address this problem [e.g., 5–7]. However, to the best of theauthor’s knowledge, none of the methods available provides auniﬁed strategy to (i) assess the validity of existing models forthe background, (ii) fully characterize the background distri-bution, (iii) perform signal detection even if the signal distri-bution is not available, (iv) characterize the signal distribution,and (v) detect additional signals of new unexpected sources. ∗ The author declares no conﬂict of interest. † [email protected] Goal.

The aim of this work is to integrate modelling, es-timation, and inference under background mismodelling andprovide a general statistical methodology to perform of (i)-(v). As a brief overview, given a source-free sample and the(partial) scientiﬁc knowledge available on the background dis-tribution, a data-updated version of it is obtained in a purelynonparametric fashion without requiring the speciﬁcation ofprior distributions on the unknown parameters. At this stage,a graphical tool is provided in order to assess if and where sig-niﬁcant deviations between the true and the postulated back-ground distributions occur. The “updated” background dis-tribution is then used to assess if the distribution of the datacollected by the experiment deviates signiﬁcantly from thebackground model. Also in this case, it is possible to as-sess graphically how the data distribution deviates from theexpected background model. If a source-free sample is avail-able, or if control regions can be identiﬁed, the solution pro-posed does not require the speciﬁcation of a model for thesignal; however, if the signal distribution is known (up tosome free parameters), the latter can be used to further im-prove the accuracy of the analysis and to detect the signals ofunexpected new sources. Finally, the method can be easilyadjusted to cover situations in which a source-free sample orcontrol regions are not available, the background is unknown,or incorrectly speciﬁed, but a functional form of the signaldistribution is known.

The key of the solution.

The statistical methodologies in-volved rely on the novel

LP approach to statistical modelling ﬁrst introduced by Mukhopadhyay and Parzen in 2014 [8]. Asit will become clearer later on in the paper, the letter L typi-cally denotes robust nonparametric methods based on quan-tiles, whereas P stands for polynomials [9, Supp S1]. This ap-proach allows the uniﬁcation of many of the standard resultsof classical statistics by expressing them in terms of quan-tiles and comparison distributions and provides a simple andpowerful framework for statistical learning and data analysis.The interested reader is directed to [10–14] and referencestherein, for recent advancements in mode detection, nonpara- a r X i v : . [ phy s i c s . d a t a - a n ] D ec metric time series, goodness-of-ﬁt on prior distributions, andlarge-scale inference using an LP approach. Organization.

Section II is dedicated to a review of themain constructs of LP modelling. Section III highlights thepractical advantages offered by modelling background distri-butions using an LP approach. Section IV introduces a novelLP-based framework for statistical inference. Section V out-lines the main steps of a data-scientiﬁc approach for signaldetection and characterization. In Section VI, the methodsproposed are applied in the context of dark matter searcheswhere the goal is to distinguish γ -ray emissions due to darkmatter from those due to pulsars. In Section VII, the toolsdiscussed are applied to a simulation of the Fermi Large AreaTelescope γ -ray telescope and it is shown how upper limitsand Brazil plots can be constructed by means of comparisondistributions. Section VIII is dedicated to model-denoising.Section IX presents an application to data from the NVSS as-tronomical survey and discusses a simple approach to assessthe validity of distributional assumptions on the polarized in-tensity in stacking experiments. A discussion of the main re-sults and extensions is proposed in Section X. II. LP APPROACH TO STATISTICAL MODELLING

The

LP Approach to Statistical Modelling [8] is a novel sta-tistical approach which provides an ideal framework to simul-taneously assess the validity of the scientiﬁc knowledge avail-able and ﬁll the gap between the initial scientiﬁc belief and theevidence provided by the data. Sections II A, II B and II C be-low introduce the LP modelling framework, whereas SectionIII discusses how the problem of background mismodellingcan be formulated under this paradigm.

A. The skew-G density model

Let X be a continuous random variable with cumulative dis-tribution function (cdf) and probability density function (pdf) F ( x ) . Since F is the true distribution of the data, it is typicallyunknown. However, suppose a suitable cdf G ( x ) is available,and let g ( x ) be the respective pdf. In order to understand if G is a good candidate for F , it is convenient to express therelationship among the two in a concise manner.The skew-G density model [8, 10] is a universal representa-tion scheme which allows to express any pdf f ( x ) as f ( x ) = g ( x ) d ( G ( x ) ; G , F ) (1)where d ( u ; G , F ) is called comparison density [15] and it issuch that d ( u ; G , F ) = f ( G − ( u )) g ( G − ( u )) with 0 ≤ u ≤ , (2)with u = G ( x ) and G − ( u ) = inf { x : G ( x ) ≥ u } denoting the“postulated” quantile function of X . The comparison density is the pdf of the random variable U = G ( X ) ; whereas, its cdfis given by D ( u ) = F (cid:0) G − ( u ) (cid:1) = (cid:90) u d ( v ; G , F ) ∂ v , (3)and it is called comparison distribution . Practical remarks.

Equations (2) and (3) are of fundamen-tal importance to understand the power of a statistical mod-elling approach based on the comparison density. Speciﬁcally, d ( u ; G , F ) allows to “connect” any given pdf g to the true pdf f through the quantile transformation G − of u . Furthermore, g ≡ f if and only if d ( u ; G , F ) = u ∈ [ , ] , i.e., U is uniformly distributed over the interval [ , ] . Whereas, if g (cid:54)≡ f , d ( u ; G , F ) models the departure of the true density f ( x ) from the postulated model g ( x ) . Consequently, an ad-equate estimate of d ( u ; G , F ) , not only leads to an estimate ofthe true f ( x ) based on (1), but it also allows to identify theregions where f ( x ) deviates substantially from g ( x ) . B. LP skew-G series representation

Denote with L [ , ] the Hilbert space of square inte-grable functions on the unit interval with respect to themeasure G . A complete, orthonormal basis of functionsin L [ , ] can be constructed considering powers of G ( x ) ,i.e., G ( x ) , G ( x ) , G ( x ) , . . . and adequately orthonormalizedvia Gram-Schmidt procedure [10]. The resulting bases canequivalently be expressed as normalized shifted L egendre P olynomials, namely Leg j ( u ) , with u = G ( x ) .Under the assumption that (2) is a square integrable func-tion on [ , ] , i.e., d ∈ L [ , ] , we can then represent d ( u ; G , F ) via a series of { Leg j ( u ) } j ≥ polynomials, i.e., d ( u ; G , F ) = + ∑ j > LP j Leg j ( u ) (4)with coefﬁcients LP j = (cid:82) Leg j ( u ) d ( u ; G , F ) ∂ u . The repre-sentation in (4) is called LP skew-G series representation [10].

C. LP density estimate

Let x , . . . , x n be a sample of independent and identicallydistributed (i.i.d.) observations from X . Observations from U are given by u = G ( x ) , . . . , u n = G ( x n ) . The LP j coefﬁcients Classical Legendre polynomials are deﬁned over [ − , ] ; here, their“shifted” counterpart over the range [ , ] is considered. The ﬁrst threenormalized shifted Legendre polynomials are: Leg ( u ) = Leg ( u ) = √ ( u − . ) , Leg ( u ) = √ ( u − u + ) , etc. in (4) can then be estimated via (cid:99) LP j = n n ∑ i = Leg j ( u i ) . (5)Aternatively, in virtue of (3), the estimates (cid:99) LP j can also bespeciﬁed as (cid:99) LP j = (cid:90) Leg j ( u ) ∂ ˜ D ( u ) = (cid:90) Leg j ( u ) ∂ ˜ F ( G − ( u )) (6)where ˜ F and ˜ D denote the empirical distribution of the sam-ples x , . . . , x n and u , . . . , u n , respectively.The moments of the (cid:99) LP j are E [ (cid:99) LP j ] = LP j , V ( (cid:99) LP j ) = σ j n and Cov ( (cid:99) LP j , (cid:99) LP k ) = σ jk n (7)where σ j = (cid:82) ( Leg j ( u ) − LP j ) d ( u ; G , F ) ∂ u and σ jk = (cid:82) ( Leg j ( u ) − LP j )( Leg k ( u ) − LP k ) d ( u ; G , F ) ∂ u . When f ≡ g ,the equalities in (7) reduce to E [ (cid:99) LP j ] = , V ( (cid:99) LP j ) = n and Cov ( (cid:99) LP j , (cid:99) LP k ) = j (cid:54) = k . Derivations of (7) and (8) are discussed in Ap-pendix A.If (4) is approximated by the ﬁrst M + an estimateof the comparison density is given by (cid:98) d ( u ; G , F ) = + M ∑ j = (cid:99) LP j Leg j ( u ) , (9)with variance V (cid:2) (cid:98) d ( u ; G , F ) (cid:3) = M ∑ j = σ j n Leg j ( u ) + ∑ j < k σ jk n Leg j ( u ) Leg k ( u ) . (10)See Appendix B for more details on the derivation of (10). Fi-nally, the standard error of (cid:98) d ( u ; G , F ) corresponds the squareroot of (10), with σ j and σ jk estimated by their sample coun-terpart, i.e., (cid:98) σ j = n n ∑ i = ( Leg j ( u i ) − (cid:99) LP j ) (cid:98) σ jk = n n ∑ i = Leg j ( u i ) Leg k ( u i ) − (cid:99) LP j (cid:99) LP k . Finally, in virtue of the skew-G density model in (1) we canestimate f ( x ) as (cid:98) f ( x ) = g ( x ) (cid:98) d ( G ( x ) ; G , F ) . (11) Recall that the ﬁrst normalized shifted Legendre polynomial is

Leg ( u ) = D en s i t y . . . . . f b ( x ) − true bkgg b ( x ) − postulated bkgf^ b ( x ) − estimated bkgf~ b ( x ) − KDE estimatef s ( x ) − signal . . x d ^ ( G b ( x ) , G b , F b ) FIG. 1: Upper panel: histogram of a source-free samplesimulated from the tail of a Gaussian with mean 55 and width15 (green solid line). The candidate background distributionis given by the best ﬁt of a second-degree polynomial (reddashed line), and it is updated using the source-free data bymeans of (19) (purple dot-dashed line). The Kernel densityestimator of f b is also displayed for comparison (orangedot-dashed line). Bottom panel: comparison density estimate(blue solid line) plotted on the x -scale and respectivestandard errors (light blue area).Since each Leg j ( u ) is a polynomial function of the randomvariable U , each (cid:99) LP j estimate can be expressed as a linearcombination of the ﬁrst j sample moments of U , e.g., (cid:99) LP = n n ∑ i = Leg ( u i ) = √ (cid:16) (cid:98) µ − (cid:98) µ + (cid:17) where (cid:98) µ = n ∑ ni = u i , (cid:98) µ = n ∑ ni = u i . Therefore, the trun-cation point M can be interpreted as the order of the high-est moment considered to characterize the distribution of U .(The reader is directed to Section IV C for a discussion on thechoice of M .) D. The bias variance trade-off

In order to understand how good (9) is in estimating d ( u ; G , F ) we consider the Mean Integrated Squared Error(MISE) of (cid:98) d ( u ; G , F ) , i.e., MISE = E (cid:20) (cid:90) (cid:0) (cid:98) d ( u ; G , F ) − d ( u ; G , F ) (cid:1) ∂ u (cid:21) (12) = M ∑ j = σ j n + ∑ j > M LP j (13)where the ﬁrst term in (13) corresponds to the integral of the(10) over [ , ] ; whereas the second term corresponds to theIntegrated Squared Bias (IBS), i.e, IBS = (cid:90) (cid:18) E (cid:2) (cid:98) d ( u ; G , F ) (cid:3) − d ( u ; G , F ) (cid:19) ∂ u . (14)Interestingly, the latter can also be speciﬁed as IBS = (cid:90) (cid:18) f ( x ) − g ( x ) g ( x ) (cid:19) ∂ u − M ∑ j = LP j (15)(see derivations in Appendix B). The ﬁrst term on the righthand side of (15) is particularly important in understanding therole played by g in obtaining a reliable estimate of f . Specif-ically, the closer g is to f the lower the bias of (cid:98) d ( G ( x ) ; G , F ) and (cid:98) f ( x ) in (11). Practical remarks.

Equation (13) implies that larger valuesof M do not necessarily lead to better estimates of d ( u ; G , F ) .Speciﬁcally, when n → ∞ , the ﬁrst term in (13) tends to zero.However, for large values of M , more and more terms to con-tribute to it and thus increasing M may lead to a substantialinﬂation of the variance in (10). Conversely, the bias is not af-fected by sample size and it can be controlled by either choos-ing g sufﬁciently close to f (see (15)) and/or increasing M while preserving a good bias-variance trade-off. Further remarks.

Equation (6) implies that the estimator in(9) relies on the empirical distribution of the sample observedby means of the (cid:99) LP j estimates. Therefore, an estimator of thecomparison density based entirely on the empirical cdf can beexpressed by setting M = n − M < n −

1, theestimator in (9), not only leads to a reduction of the variancebut, in virtue of (15), its bias is mitigated when the postulatedmodel g is sufﬁciently close to the true pdf of the data f . III. DATA-DRIVEN CORRECTIONS FOR MISSPECIFIEDBACKGROUND MODELS

Let x B = ( x , . . . , x N ) be a sample of observations from con-trol regions or the result of Monte Carlo simulations, wherewe expect no signal to be present. Hereafter, we refer to x B asthe source-free sample . Therefore, x B can be used to “learn”the unknown pdf of the background, namely f b ( x ) , and obtainan estimate for it via (11).Despite the true background model being unknown, sup-pose that a candidate pdf, namely g b ( x ) , is available. The candidate model g b ( x ) can be speciﬁed from previous experi-ments or theoretical results or can be obtained by ﬁtting spe-ciﬁc functions (e.g., polynomial, exponential, etc.) to x B . If g b ( x ) does not provide an accurate description of f b ( x ) , thesensitivity of the experiment can be strongly affected.Consider, for instance, a source-free sample of N = [ , ] , i.e., f b ( x ) = e − (cid:0) x − (cid:1) k f b (16)with k f b = (cid:82) e − (cid:0) x − (cid:1) ∂ x . Suppose that a candidate modelfor the background is obtained by ﬁtting a second-degreepolynomial on the source-free sample and adequately normal-izing it in order to obtain a proper pdf, i.e., g b ( x ) = . − . x + . x k gb (17)with k gb = (cid:82) [ . − . x + . x ] ∂ x . For illustrative pur-poses, assume that the distribution of the signal is a Gaussiancentered at 25, with width 4 . f s ( x ) = e − (cid:0) x − . (cid:1) k f s (18)with k f s = (cid:82) e − (cid:0) x − . (cid:1) ∂ x . The histogram of the source-free sample along with (16)-(18) is shown in Fig. 1. At thehigher end of the spectrum, the postulated background (reddashed line) underestimates the true background distribution(green solid line). As a result, using (17) as background modelincreases the chance of false discoveries in this region. Con-versely, at the lower end of the spectrum, g b ( x ) underestimates f b ( x ) , reducing the sensitivity of the analysis. For the sakeof comparison, a Kernel density estimate (orange dot-dashedline) has been computed by selecting the bandwidth param-eter as recommended in [16]. The latter exhibits substantialbias at the boundary and appears to overﬁt the data sample.It is important to point out that, the discrepancy of f b ( x ) from g b ( x ) is typically due to the fact that the speciﬁc func-tional form imposed (in our example, a second-degree polyno-mial) is not adequate for the data. Thus, changing the valuesof the ﬁtted parameters (or assigning priors to them) is un-likely to solve the problem. However, it is possible to “repair” g b ( x ) and obtain a suitable estimate of f b ( x ) by means of (11).Speciﬁcally, f b ( x ) can be estimated via (cid:98) f b ( x ) = g b ( x ) (cid:98) d ( G b ( x ) ; G b , F b ) (19)where (cid:98) d ( G b ( x ) ; G b , F b ) is the comparison density estimatedvia (9) on the sample G b ( x ) , . . . , G b ( x N ) , whereas F b and G b are the true and the postulated background distributions, withpdfs as in (16) and (17), respectively.In our example, choosing M = (cid:98) d ( G b ( x ) ; G b , F b ) = + . Leg [ G b ( x )] − . Leg [ G b ( x )] , (20) where Leg [ G b ( x )] and Leg [ G b ( x )] are the ﬁrst and secondnormalized shifted Legendre polynomials evaluated at G b ( x ) .Notice that, by combining (19) and (20), we can easily writethe background model using of a series of shifted Legendrepolynomials. This may be especially useful when dealing withcomplicated likelihoods and for which a functional form isdifﬁcult to specify.The upper panel of Fig. 1 shows that the “calibrated” back-ground model in (19) as a purple dot-dashed line and matchesalmost exactly the true background density in (16) (green solidline). The plot of (cid:98) d ( G b ( x ) ; G b , F b ) in the bottom panel of Fig.1 provides important insights on the deﬁciencies of (17) as acandidate background model. Speciﬁcally, the magnitude andthe direction of the departure of (cid:98) d ( G b ( x ) ; G b , F b ) from onecorresponds to the estimated departure of f b ( x ) from g b ( x ) for each value of x . Therefore, if (cid:98) d ( G b ( x ) ; G b , F b ) is belowone in the region where we expect the signal to occur, using (cid:98) f b ( x ) in place of g b ( x ) increases the sensitivity of the analysis.Conversely, if (cid:98) d ( G b ( x ) ; G b , F b ) is above one outside the signalregion, the use of (cid:98) f b ( x ) instead of g b ( x ) prevents from falsediscoveries.Notice that in this article we only consider continuous data.In this respect, the goal is to learn the model of the backgroundconsidered as a continuum and no binning is applied. There-fore, the histograms presented here are only a graphical toolused to display the data distribution and are not intended torepresent an actually binning of the data. IV. LP-BASED INFERENCE

When discussing the skew-G density model in (1), we havewitnessed that f ≡ g if d ( u ; G , F ) = u ∈ [ , ] . Addi-tionally, the graph of (cid:98) d ( u ; G , F ) provides an exploratory toolto understand the nature of the deviation of f ( x ) from g ( x ) .This section introduces a novel inferential framework to testthe signiﬁcance of the departure of f ( x ) from g ( x ) . Speciﬁ-cally, our goal is to test the hypotheses H : d ( u ; G , F ) = u ∈ [ , ] vsH : d ( u ; G , F ) (cid:54) = u ∈ [ , ] . (21)First, an overall test, namely the deviance test , is presented.The deviance test assesses if f ( x ) deviates signiﬁcantly from g ( x ) anywhere over the range of x considered. Second, ad-equate conﬁdence bands are constructed in order to assesswhere signiﬁcant departures occur. A. The deviance test

Recall that the LP j coefﬁcients in (4) specify as LP j = (cid:82) Leg u ( d ) d ( u ; G , F ) ∂ u . Consequently, by orthogonality ofthe { Leg j ( u ) } j > Leg ( u ) =

1, when H in(21) is true all the LP j coefﬁcients are equal to zero, includingthe ﬁrst M of them. We can then quantify the departure of (cid:98) d ( u ; G , F ) from one by means of the deviance statistics [13]which speciﬁes as ∑ Mj = (cid:99) LP j . If the deviance is equal to zero,we may expect that g is approximately equivalent to f ; hence,we test H : M ∑ j = LP j = H : M ∑ j = LP j > D M = n M ∑ j = (cid:99) LP j . (23)It can be shown [10] that, as n → ∞ √ n (cid:99) LP j d −→ N ( , ) , (24)where d −→ denotes convergence in distribution, and thus, under H , D M is asymptotically χ M -distributed. Hence, an asymp-totic p-value for (22) is given by P ( D M > d M ) −−−→ n → ∞ P ( χ M > d M ) , (25)where d M is the value of D M observed on the data. Practical remarks.

Notice that H in (22) implies H in (21).Similarly, H in (21) implies H in (22); however, the oppositeis not true in general since there may be some non-zero LP j coefﬁcients for j > M . Therefore, even when choosing M small may lead to conservative, but yet valid, inference. B. Conﬁdence bands

The estimator in (9) only accounts for the ﬁrst M + (cid:98) d ( u ; G , F ) is a bi-ased estimator of d ( u ; G , F ) . Speciﬁcally, as discussed in Sec-tion II D, the integrated bias is given by ∑ j > M LP j , whereas,as show in Appendix B, the bias at a given point u is givenby ∑ j > M LP j Leg ( u ) . It follows that, when the bias is large,conﬁdence bands based on (cid:98) d ( u ; G , F ) are shifted away fromthe true density d ( u ; G , F ) .Despite the bias cannot be easily quantiﬁed in the generalsetting, it follows from (8) that, when H in (21) (and conse-quently H in (22)) is true, both the bias at a point u and theintegrated bias are equal to zero. Thus, we can exploit thisproperty to construct reliable conﬁdence bands under the null.Speciﬁcally, the goal is to identify c α , such that1 − α = P ( − c α ≤ (cid:98) d ( u ; G , F ) − ≤ c α , for all u ∈ [ , ] | H )= P ( max u | (cid:98) d ( u ; G , F ) − | ≤ c α | H ) (26)where α is the desired signiﬁcance level. If the bias determines where the conﬁdence bands are cen-tered, the distribution and the variance of (cid:98) d ( u ; G , F ) determinetheir width. As discussed in Section II C (see (8)), under H in(21), the (cid:99) LP j estimates have mean zero, variance n and theyare uncorrelated one another. Therefore, when f ≡ g , the stan-dard error of (cid:98) d ( u ; G , F ) , corresponds to the square root of (10)with σ j = σ jk = SE (cid:104) (cid:98) d ( u ; G , F ) | H (cid:105) = (cid:118)(cid:117)(cid:117)(cid:116) M ∑ j = n Leg j ( u ) . (27)Additionally, (24) implies that (cid:98) d ( u ; G , F ) is asymptoticallynormally distributed, hence (cid:98) d ( u ; G , F ) − (cid:113) ∑ Mj = n Leg j ( u ) d −→ N ( , ) . (28)as n → ∞ , for all u ∈ [ , ] , under H .We can then construct approximate conﬁdence bands under H which satisfy (26) by means of tube formulae (see [17,Ch.5] and [18]), i.e., (cid:34) − c α (cid:118)(cid:117)(cid:117)(cid:116) M ∑ j = n Leg j ( u ) , + c α (cid:118)(cid:117)(cid:117)(cid:116) M ∑ j = n Leg j ( u ) (cid:35) , (29)where c α is the solutions of2 ( − Φ ( c α )) + k π e − . c α = α , (30)with k = (cid:113) ∑ Mj = [ ∂∂ u Leg j ( u )] . If (cid:98) d ( u ; G , F ) is within thebands in (29) over the entire range [ , ] , we conclude thatthere is no evidence that f deviates signiﬁcantly from g anywhere over the range considered and at conﬁdence level1 − α . Conversely, we expect signiﬁcant departures to occurin regions where (cid:98) d ( u ; G , F ) lies outside the conﬁdence bands. Practical remarks.

Notice that, under H in (22), the (cid:98) d ( u ; G , F ) is an unbiased estimator of d ( u ; G , F ) , regardlessof the choice of M . This implies that the conﬁdence bands in In astrophysics, the statistical signiﬁcance α is often expressed in termsof number of σ -deviations from the mean of a standard normal, namely σ . For instance, a 2 σ signiﬁcance corresponds to α = − Φ ( ) = . Φ ( · ) denotes the cdf of a standard normal. (29) are only affected by the variance and asymptotic distri-bution of (cid:98) d ( u ; G , F ) under H . C. Choice of M The number of (cid:99) LP j estimates considered determines thelevel of “smoothness” of (cid:98) d ( u ; G , F ) , with smaller values of M leading to smoother estimates. The deviance test can beused to select the value M which maximizes the sensitivity ofthe analysis according to the following scheme:i. Choose a sufﬁciently large value M max .ii. Obtain the estimates (cid:99) LP , . . . , (cid:99) LP M max as in (5).iii. For m = , . . . , M max :calculate the deviance test p-value as in (25), i.e., p ( m ) = P (cid:16) χ m > d m (cid:17) (31)with d m = n ∑ mj = (cid:99) LP j .iv. Choose M such that M = argmin m { p ( m ) } . (32)

1. Adjusting for post-selection

As any data-driven selection process, the scheme presentedabove affects the distribution of (9) and can yield to overlyoptimistic inference [19, 20]. Despite this aspect being oftenignored in practical applications, correct coverage can only beguaranteed if adequate corrections are implemented.The issues arising in the context of post-selection inferencecan be interpreted in terms of looks-elsewhere effect [21, 22]where one has to adjust the inference for the fact that, in prac-tice, many different models have been considered and, conse-quently, many different tests have been conducted for the sakeof assessing the goodness of ﬁt.In our setting, the number of models under comparison istypically small ( M max ≤ M max · P ( χ M > d M ) , (33) As an anonymous referee correctly pointed out, (cid:98) d ( u ; G , F ) is alwayssmooth as it is constructed as a series of inﬁnitely differentiable functions.In statistics, however, the word “smoothness” is often used to indicate theﬂexibility of the estimator considered or, in other words, its degrees offreedom. Often, this is quantiﬁed in terms of magnitude of the secondderivative of the function considered. Despite the abuse of terminology,throughout the manuscript we will refer to the latter deﬁnition of smooth-ness. where M is the value selected via (32), whereas conﬁdence Algorithm 1: a data-scientiﬁc signal search

INPUTS: source-free sample x B ;postulated background distribution g b ( x ) ;physics sample x . If available : signal distribution, f s ( x , θ s ) . PHASE A: background calibration

Step 1:

Estimate (cid:98) d ( u ; G b , F b ) on u B = G b ( x B ) and test(21) via deviance test and CD plot. Step 2: if F b (cid:54)≡ G b , set (cid:98) f b ( x ) = g b ( x ) (cid:98) d ( u ; G b , F b ) ; else set (cid:98) f b ( x ) = g b ( x ) . PHASE B: signal searchStage 1: nonparametric signal detection

Step 3: set g ( x ) = (cid:98) f b ( x ) . Step 4: estimate (cid:98) d ( u ; G , F ) on u = G ( x ) and test (21) viadeviance test and CD plot. Step 5: if G (cid:54)≡ F , claim evidence in favor of the signal andgo to Step 6; else set (cid:98) f ( x ) = g ( x ) , claim that no signal is presentand stop. Stage 2: semiparametric signal characterization

Step 6: if f s ( x , θ s ) given, ﬁt g bs ( x ) in (38); else use the CD plot of (cid:98) d ( u ; G , F ) and the theoryavailable to specify/ﬁt a suitable model for f s ( x , θ s ) and ﬁt g bs ( x ) in (38). Step 7: estimate (cid:98) d ( u ; G bs , F ) on u = G bs ( x ) and test (21)via deviance test and CD plot. Step 8: if G bs (cid:54)≡ F , claim evidence of unexpected signaland use the CD plot of (cid:98) d ( u ; G bs , F ) and the theoryavailable to further investigate the nature thedeviation from G bs ; else go to Step 9. Step 9: compute (cid:98)(cid:98) d ( u ; G , F ) as in (39) and use it to reﬁne (cid:98) f b ( x ) or f s ( x , (cid:98) θ s ) as in (40). Go back to Step 3.bands can be adjusted by substituting c α in (29), with c α , M max satisfying2 ( − Φ ( c α , M max )) + k π e − . c α , M max = α M max . (34) Practical remarks.

As noted in Section II, the estimate (9) in-volves the ﬁrst M sample moments of U ; therefore, M max canbe interpreted as the order of the highest moment which we expect to contribute in discriminating the distribution of U from uniformity. Notice that, in addition to the inﬂation ofthe variance of (9), when M is large, the computation of nor-malized shifted Legendre of higher order may face numericalinstability (see Section VIII B). Therefore, as a rule of thumb, M max is typically chosen ≤

20. Finally, Steps i-iv aim to se-lect the approximant based on the ﬁrst most signiﬁcant M mo-ments, while excluding powers of higher order. A further noteon model-denoising is given in Section VIII. V. A DATA-SCIENTIFIC APPROACH TO SIGNALSEARCHES

The tools presented in Sections II and IV provide a naturalframework to simultaneously(a) assess the validity of the postulated background modeland, if necessary, update it using the data (Section III);(b) perform signal detection on the physics sample;(c) characterize the signal when a model for it is not avail-able.Furthermore, if the model for the signal is known (up tosome free parameters), it is possible to(d) further reﬁne the background or signal distribution;(e) detect hidden signals from new unexpected sources.Notice that, since Bonferroni’s correction leads to an upperbound for the overall signiﬁcance, the resulting coverage willbe higher than the nominal one. Alternatively, approximatepost-selection conﬁdence bands and inference can be con-structed using Monte Carlo and/or resampling methods andrepeating the selection process at each replicate.Tasks (a)-(e) can be tackled in two main phases. In theﬁrst phase, the postulated background model is “calibrated”on a source-free sample in order to improve the sensitivityof the analysis and reduce the risk of false discoveries. Thesecond phase focuses on searching for the signal of interestand involves both a nonparametric signal detection stage and asemiparametric stage for signal characterization. Both phasesand respective steps are described in details below and sum-marized in Algorithm 1.

A. Background calibration

As discussed in Section III, deviations of (cid:98) d ( G b ( x ) ; G b , F b ) from one suggest that a further reﬁnement of the candidatebackground model g b is needed. However, as M increases,the deviations of (cid:98) d ( G b ( x ) ; G b , F b ) from one may become moreand more prominent while the variance inﬂates. Thus, it isimportant to assess if such deviations are indeed signiﬁcant.In order address this task, the analysis of Section III can befurther reﬁned in light of the inferential tools introduced inSection IV. D en s i t y . . . . d^ ( u , G b , F b ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 6.397 · - b ( ) G b ( ) G b ( ) G b ( ) G b ( ) uG b ( x ) FIG. 2: Deviance test and CD plot for the source-free sample.The comparison density is estimated via (20) (solid blueline), whereas its standard error (light blue area) is computedas the squared root of the estimate of the variance in (10).Finally, conﬁdence bands have been constructed around one(grey areas) via (29) with c α replaced by c α , M max in (34). Thenotation ≥ σ is used to highlight that Bonferroni’scorrection has been applied to adjust for post-selectioninference, leading to an increase of the nominal coverage.For the toy example discussed in Section III, we have seenthat g b overestimates f b in the signal region and underesti-mates it at the higher end of the range considered (Fig. 1).We can now assess if any of these deviations are signiﬁcantby implementing the deviance test in (23)-(25), whereas, toidentify where the most signiﬁcant departures occur, we con-struct conﬁdence bands under the null model as in (29), i.e.,assuming that no “update” of g b is necessary.The results are collected in the comparison density plot or CD plot presented in Fig. 2. First, a value M = f b from g b is signiﬁcant at a 6 . σ signiﬁcancelevel (adjusted p-value of 6 . · − ). Additionally, theestimated comparison density in (20) lies outside the 2 σ con-ﬁdence bands in the region [ , ] where the signal speciﬁedin (18) is expected to occur. Hence, using (19) instead of (17)is recommended in order to improve the sensibility of theanalysis in the signal region. Important remarks on the CD plot.

When comparing differentmodels for the background or when assessing if the data dis-tribution deviates from the model expected when no signalis present, it is common practice to visualize the results ofthe analysis by superimposing the models under comparisonto the histogram of the data observed on the original scale(e.g., upper panel of Fig. 1). This corresponds to a data vi-sualization in the density domain. Conversely, the CD plot(e.g., Fig. 2) provides a representation of the data in the quan-tile domain, which offers the advantage of connecting the truedensity of the data with the quantiles of the postulated model(see (2)-(3)). Consequently, the most substantial departures ofthe data distribution from the expected model are magniﬁed,and those due to random ﬂuctuations are smoothed out (see, also, Section VII B). Furthermore, the deviance tests and theCD plot together provide a powerful goodness-of-ﬁt tool andexploratory which, conversely from classical methods suchas Anderson-Darling [24] and Kolmogorov-Smirnov [26], notonly allow to test if the distributions under comparison differ,but they also allow to assess how and where they differ. As aresult, the CD plot can be used to characterize the unknownsignal distribution (see Section V B 2) and to identify exclu-sion regions (e.g., Case I in Section V B 1).As an additional advantage, the deviance test appears toenjoy higher detection power than classical approaches. Thisaspect is highlighted in Table I where several methods forgoodness of ﬁt or two-samples comparisons are implemented,along with the deviance test, for all the cases discussed inSection V.

Reliability of the calibrated background model.

The size N ofthe source-free sample plays a fundamental role in the validityof (cid:98) f b ( x ) as a reliable background model. Speciﬁcally, the ran-domness involved in (19) only depends on the (cid:99) LP j estimates.If N is sufﬁciently large, by the strong law of large numbers, P (cid:0) lim N → ∞ (cid:99) LP j = LP j (cid:1) = . Therefore, despite the variance of (cid:98) f b ( x ) becoming negligibleas N → ∞ , one has to account for the fact that (cid:98) f b ( x ) leads toa biased estimate of f b ( x ) when f b (cid:54)≡ g b (see Section II D).For sufﬁciently smooth densities, a visual inspection is oftensufﬁcient to assess if (cid:98) d ( u ; G b , F b ) (and, consequently, (cid:98) f b ( x ) )provides a satisfactory ﬁt for the data, whereas, for more com-plex distributions the effect of the bias can be mitigated con-sidering larger values of M and model-denoising (see SectionVIII A). B. Signal search

1. Nonparametric signal detection

The background calibration phase allows the speciﬁcationof a well tailored model for the background, namely (cid:98) f b ( x ) ,which simultaneously integrates the initial guess, g b , and theinformation carried by the source-free data sample. Hereafter,we disregard the source-free data sample and focus on analyz-ing the physics sample.Under the assumption that the source-free sample has nosigniﬁcant source contamination, we expect that, if the signalis absent, both the source-free and the physics sample followthe same distribution. Therefore, the calibrated backgroundmodel, (cid:98) f b ( x ) , plays the role of the postulated distribution forthe physics sample, i.e., the model that we expect the data tofollow when no signal is present; hence, we set g ( x ) = (cid:98) f b ( x ) .Let f ( x ) be the (unknown) true pdf of the physics samplewhich may or may not carry evidence in favor of the source ofinterest. When no model for the signal is speciﬁed, it is rea-sonable to consider any signiﬁcant deviation of f from g asan indication that a signal of unknown nature may be present.Goodness-of-ﬁt test p-values Two-samples test p-values Sample Anderson-Darling Cramer-von Mises Deviance (adjusted) Kolmogorov-Smirnov Wilcoxon Rank Sum

Calibration 1 . · − . · − . · − (6 . · − ) - -Case I 0 . . . >

1) 0.9248 0.5487Case II 4 . · − . · − . · − (1 . · − ) 1 . · − . · − Case III 4 . · − . · − . · − (5 . · − ) 2 . · − . · − TABLE I: Comparison of deviance test and classical inferential tools. The ﬁrst two columns report the p-values ofAnderson-Darling [24] and Cramer-von Mises [26] goodness-of-ﬁt tests obtained assuming as theoretical distribution the same G indicated in Sections V A and V B for the the calibration phase, case I, II and III, respectively. The raw deviance p-values andtheir post-selection adjusted counterparts are reported in the third columns. Finally, the fouth and ﬁfth column report,respectively, the Kolmogorov-Smirnov [26] and Wilcoxon rank sum [25] tests used to compare directly the physics samples inCase I, II and III with the source-free sample used in Section V A. D en s i t y . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value> 1 ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) D en s i t y . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 1.799 · - ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) FIG. 3: Deviance test and CD plots for Case I where no signal is present (left panel) and Case II where the signal is present(right panel). In both cases, the postulated distribution G corresponds to the cdf of the calibrated background model in (35). Forthe sake of comparison, d ( u ; G , F ) has been estimated via (9) with M = f from g occur. Two possiblescenarios are considered – a physics sample which collectsonly background data (Case I) and a physics sample of obser-vations from both background and signal (Case II). Case I: background-only.

Let x be a physics sample of n = f ( x ) isequivalent to f b ( x ) in (16). We set g ( x ) = (cid:98) f b ( x ) = g b ( x ) (cid:98) d ( G b ( x ) ; G b , F b ) (35)where g b ( x ) and (cid:98) d ( G b ( x ) ; G b , F b ) are deﬁned as in (17) and(20), respectively. The resulting CD plot and deviance test arereported in the left panel of Fig. 3.When applying the scheme in Section IV C with M max = M considered leads to signiﬁcant results;therefore, for the sake of comparison with Case II below, wechoose M =

4. Not surprisingly, the estimated comparisondensity approaches one over the entire range and lies entirelywithin the conﬁdence bands. This suggests that the true distri-bution of the data does not differ signiﬁcantly from the modelwhich accounts only for the background. Similarly, the de- viance test leads to very low signiﬁcance (adjusted p-value > Case II: background + signal.

Let x be a physics sampleof n = f ( x ) isequal to f bs ( x ) in (36) f bs ( x ) = ( − η ) f b ( x ) + η f s ( x ) (36)with f b ( x ) and f s ( x ) deﬁned as in (16) and (18) respectively,and η = .

15. The histogram of the data and the graph of f bs ( x ) are plotted in Fig. 4. As in Case I, we set g ( x ) as in(35).The CD plot and deviance test in the right panel of Fig. 3show a signiﬁcant departure of the data distribution from thebackground-only model in (35). The maximum signiﬁcanceof the deviance is achieved at M =

4, leading to a rejection ofthe null hypothesis at a 11 . σ signiﬁcance level (adjusted p-value = . · − ). The CD plot shows a prominent peak atthe lower end of the spectrum; hence, we conclude that there isevidence in favor of the signal, and we proceed to characterizeits distribution as described in Section V B 2.0 x D en s i t y

10 20 30 40 50 . . . . . . f bs ( x ) f^ ( x ) , M=4f^ ( x ) , M=9g bs ( x ) FIG. 4: Histogram of a physics sample of n = f ( x ) have beencomputed as in (11), by plugging-in the (cid:98) d ( G ( x ) ; G , F ) estimates obtained with M = M =

2. Semiparametric signal characterization

The signal detection strategy proposed in Section V B 1does not require the speciﬁcation of a distribution for the sig-nal. However, if a model for the signal is known (up to somefree parameters), the analysis can be further reﬁned by pro-viding a parametric estimate of the comparison density andassessing if additional signals from new unexpected sourcesare present.

Case IIa: background + (known) signal.

Assume that amodel for the signal, f s ( x , θ s ) , is given, with θ s being a vectorof unknown parameters. Since the CD plot in the right panelof Fig. 3 provides evidence in favor of the signal, we expectthe data to be distributed according to the pdf (cid:98) f bs ( x ) = ( − η ) (cid:98) f b ( x ) + η f s ( x , θ s ) , ≤ η ≤ , (37)where (cid:98) f b ( x ) is the calibrated background distribution in (35)and η and θ s can be estimated via Maximum Likelihood(ML). Letting (cid:98) η and (cid:98) θ s be the ML estimates of η and θ s re-spectively, we specify g bs ( x ) = ( − (cid:98) η ) (cid:98) f b ( x ) + (cid:98) η f s ( x , (cid:98) θ s ) (38)as postulated model. For simplicity, let f s to be fully speciﬁedas in (18); we construct the deviance test and the CD plot to as-sess if (38) deviates signiﬁcantly from the true distribution ofthe data. The scheme in Section IV C has been implemented with M max =

20, and none of the values of M considered ledto signiﬁcant results. The CD plot and deviance test for M = >

1) and the CDplot suggest that no signiﬁcant deviations occur; thus, (38) isa reliable model for the physics sample.Moreover, we can use (38) to further reﬁne our (cid:98) f b ( x ) or f s ( x , (cid:98) θ s ) distributions. Speciﬁcally, we ﬁrst construct a semi-parametric estimate of d ( G ( x ) ; G , F ) , i.e., (cid:98)(cid:98) d ( G ( x ) ; G , F ) = ( − (cid:98) η ) (cid:98) f b ( x ) f s ( x , (cid:98) θ s ) + (cid:98) η , (39)and rewrite (cid:98)(cid:98) f b ( x ) = (cid:98) f b ( x ) (cid:98)(cid:98) d ( G ( x ) ; G , F ) − (cid:98) η f s ( x , (cid:98) θ s )( − (cid:98) η ) (cid:98)(cid:98) f s ( x ) = (cid:98) f b ( x ) (cid:98)(cid:98) d ( G ( x ) ; G , F ) − ( − (cid:98) η ) (cid:98) f b ( x ) (cid:98) η . (40)In the upper right panel of Fig. 5, the true comparisondensity (grey dashed line) of our physics sample is comparedwith its semiparametric estimate computed as in (39) (pinkdashed line) with f s ( x , (cid:98) θ s ) = f s ( x ) in (18). The graphs of twononparametric estimates of d ( u ; G , F ) computed via (9) with M = M = d ( u ; G , F ) almost exactly,whereas both nonparametric estimates show some discrepan-cies from the true comparison density. All the estimates sug-gest that there is only one prominent peak in correspondenceof the signal region.When moving from the comparison density domain to thedensity domain in Fig. 4, the discrepancies between the non-parametric estimates and the true density f ( x ) are substan-tially magniﬁed. Speciﬁcally, when computing (9) and (11)with M = M =

9, the (cid:98) f ( x ) ex-hibits high bias at the boundaries (dotted black line). Case IIb: background + (unknown) signal.

When thesignal distribution is unknown, the CD plot of (cid:98) d ( u ; G , F ) canbe used to guide the scientist in navigating across the differ-ent theories on the astrophysical phenomenon under study andspecify a suitable model for the signal, i.e., f s . The model pro-posed can then be validated, as in Case IIa, by ﬁtting (38) andconstructing deviance tests and CD plots.At this stage, the scientist has the possibility to iterativelyquery the data and explore the distribution of the signal by as-suming different models. A viable signal characterization is Boundary bias is a common problem among nonparametric density estima-tion procedures [e.g., 17, Ch.5, Ch.8]. When aiming for a non-parametricestimate of the data density f ( x ) , solutions exists to mitigate this problem[e.g., 27]. D en s i t y . . . . d^ ( u , G bs , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value> 1 bs ( ) G bs ( ) G bs ( ) G bs ( ) G bs ( ) uG bs ( x ) D en s i t y . . . . . . d^ ( u , G , F ) , M=4d^ ( u , G , F ) , M=9d^^ ( u , G , F ) d ( u , G , F ) ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) D en s i t y . . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 5.181 · - ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) D en s i t y . . . . . . . d^ ( u , G bs , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 0.00076 bs ( ) G bs ( ) G bs ( ) G bs ( ) G bs ( ) uG bs ( x ) FIG. 5: Upper panels: deviance test and CD plots for Case IIa where the signal is present (right panel), and the postulateddistribution G bs corresponds to the cdf of the estimated background+signal model in (38) with (cid:98) η = . M =

3. Bottom panels: Deviance test and CD plots for Case III where, inaddition to the signal of interest, an additional resonance is present. The data are ﬁrst analyzed considering thebackground-only pdf in (17) as the postulated model (left panel). The analysis is then repeated by assuming the ﬁttedbackground + signal model in (38) as the postulated distribution (right panel). Both estimates of the comparison density in theleft and right panels have been computed as in (9) with M = (cid:98) d ( u , G bs , F ) fromone are observed (e.g., see upper left panel of Fig. 5). Noticethat a similar approach can be followed also in the backgroundcalibration stage (Section V A) to provide a parametric char-acterization of the background distribution. Case III: background + (known) signal + unexpectedsource.

The tools proposed so far can also be used to detectsignals from unexpected sources whose pdfs are, by design,unknown.Suppose that the physics sample x contains n = f ( x ) is equal to f bsh ( x ) f bsh ( x ) = ( − η − η ) f b ( x ) + η f s ( x ) + η f h ( x ) (41)where f h ( x ) is the pdf of the unexpected signal and assume itsdistribution to be normal with center at 37 and width 1.8. Let f b ( x ) and f s ( x ) be deﬁned as in (16) and (18), respectively,and let η = .

15 and η = . g ( x ) = (cid:98) g bs ( x ) in (35), with f s deﬁned as in (18) and (cid:98) η estimated via MLE. The respective CD plot and deviance testsare reported in the bottom left panel of Fig. 5.Choosing M =

9, as in (32), both the CD plot and de-viance test indicate a signiﬁcant departure from the expected background-only model and a prominent peak is observed incorrespondence of the signal of interest centered around 25. Asecond but weaker peak appears to be right on the edge of ourconﬁdence bands, suggesting the possibility of an additionalsource. At this stage, if f s was unknown, we could proceedwith a semiparametric signal characterization as in Case IIb.Whereas assuming that the distribution of the signal of inter-est is known and given by (18), we ﬁt (38), aiming to capturea signiﬁcant deviation in correspondence of the second bump.This is precisely what we observe in the bottom right panel ofFig. 5. Here the estimated comparison density deviates from(35) around 35, providing evidence in favor of an additionalsignal in this region. We can then proceed as in Case IIb byexploring the theories available and/or collecting more datato further investigate the nature and the cause of the unantici-pated bump. VI. SIGNAL DETECTION WITHOUT CALIBRATIONSAMPLE AND MODEL SELECTION

There are situations where a source-free sample is simply2

Energy (TeV) D en s i t y Dark Matter modelPulsar modelDark Matter sample

Energy (TeV) D en s i t y Dark Matter modelPulsar modelPulsar sample

FIG. 6: Dark matter and pulsar samples. The left panel corresponds to the histogram of a sample of 2000 observationssimulated from the model in (42) with M χ = .

5. The right panel corresponds to the histogram of a sample of 2000observations simulated from the model in (43) with τ =

2. The best ﬁt of (42) and (43) are also reported as blue solid line andblack dashed lines, respectively, on top of each histogram.not available and thus the calibration phase in Section V Acannot be implemented. The tools described in Sections IIand IV can, however, still be applied in order to perform signaldetection and goodness-of-ﬁt when a model for the signal isknown, up to some free parameters. In this framework, we ex-pect the data to either come only from the signal (with at mostsome negligible background contamination) or only from thebackground.In order to illustrate how to proceed in this setting, we con-sider a dark matter search where the postulated model for darkmatter γ -ray emissions is the one of [29, Eq. 29], i.e., g DM ( y ) = . M . χ yk M χ exp (cid:26) − . yM χ (cid:27) (42)with y ∈ [ . , ] Teraelectron Volt (TeV), M χ ∈ [ . , ] TeVand k M χ is a normalizing constant. The goal is to show that,when considering a background-only sample, the method pro-posed correctly rejects (42) as suitable model for the data;whereas, when considering a dark matter sample, the darkmatter model in (42) is “accepted”.To further increase the complexity of the problem, we con-sider a situation where the background sample corresponds to γ -ray emissions due to a pulsar, with distribution g PS ( y ) = yk τ exp (cid:26) − (cid:18) yy (cid:19) τ (cid:27) , (43)with y = . y ∈ [ . , ] TeV, τ > k τ is a normaliz-ing constant. Notice that, as discussed in [30], distinguishing γ -ray emissions due to pulsars from those due to dark matteris a particularly challenging task. The histograms of the two datasets considered are shown in Figure 6; the overlappingcurves correspond to the best ﬁt of the models in (42) and (43)on each sample. Interestingly, for both samples, (42) and (43)provide a very similar ﬁt to the data; hence the importanceof correctly selecting the most adequate model or, excludingthe dark matter hypothesis when observing emissions due topulsars.The upper panels of Figure 7 display the CD plots obtainedby setting g = g DM in (42) as postulated model and compar-ing it with the distribution of the dark matter sample (upperleft panel) and of the background pulsar sample (upper rightpanel). Remarkably, the CD plots and the adjusted deviancetests correctly lead to the conclusion that the distribution of thedark matter sample does not deviates signiﬁcantly from (42),whereas the distribution of the pulsar sample does deviate sub-stantially from (42) and the deviance test (adequately adjustedfor post-selection inference) rejects the dark matter modelwith 3 . σ signiﬁcance (adjusted p-value of 4 . · − ).Notice that, in both cases, we are ignoring the information re-garding the pulsar distribution and the only inputs consideredare the data and the signal model in (42).Finally, when incorporating the knowledge of the pulsardistribution in (43) into the analysis, one can select betweenthe models in (42) and (43) by constructing additional CDplots and deviance test for both samples and setting g = g PS in (43). The results are shown in lower panels of Figure 7.As expected, the dark matter model is rejected (lower leftpanel) with 2 . σ signiﬁcance (adjusted p-value of 0 . D en s i t y . . . . d^ ( u , G DM , F ) s confidence bandsStandard error Deviance testadjusted p.value > 1 ( ) G ( ) G ( ) G ( ) G ( ) G ( ) u D en s i t y . . . . d^ ( u , G DM , F ) s confidence bandsStandard error Deviance testadjusted p.value = 4.87 · - ( ) G ( ) G ( ) G ( ) G ( ) u D en s i t y . . . . d^ ( u , G DM , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value = 0.0108 ( ) G ( ) G ( ) G ( ) G ( ) G ( ) u D en s i t y . . . . d^ ( u , G PS , F ) s confidence bandsStandard error Deviance testadjusted p.value > 1 ( ) G ( ) G ( ) G ( ) G ( ) u FIG. 7: CD plots and deviance test for dark matter and pulsar samples. The upper left panel displays CD plot and deviance testfor the dark matter sample with g = g DM in (42). The upper right panel compares the distribution of the pulsar sample with themodel in (42). The lower left panel displays CD plot and deviance test for the dark matter sample with g = g PS in (43). Thelower right panel displays CD plot and deviance test for the pulsar sample with g = g PS in (43). For the plots on the left, thesize of the basis selected is M =

6, whereas, for the plots on the right, M = VII. BACKGROUND MISMODELLING DUE TOINSTRUMENTAL NOISE AND UPPER LIMITSCONSTRUCTIONS

When conducting real data analyses one has to take intoaccount that the data generating process is affected by bothstatistical and non-random uncertainty due to the instrumentalnoise. As a result, even when a model for the background isknown, the data distribution may substantially deviate from itdue to the smearing introduced by the detector [e.g., 31]. Inorder to account for the instrumental error affecting the data,it is common practice to consider folded distributions wherethe errors due to the detector are often modelled assuming anormal distribution or estimated via non-parametric methods[e.g., 32, 33]. In Section VII A, it is shown how the same ap-proach described in Sections V A and V B 1 can be used to as-sess if the instrumental error is negligible and, when not, howto update the postulated background model in order to incor-porate the instrumental noise. Section VII B discusses upperlimits constructions by means of comparison distributions.

A. Modelling the instrumental error

The data considered come from a simulated observation bythe Fermi Large Area Telescope [28] with realistic represen- tations of the effects of the detector and present backgrounds[21, 34]. The Fermi-LAT is a pair-conversion γ -ray telescopeon board the earth-orbiting Fermi satellite. It measures ener-gies and images γ -rays between about a 100 MeV and severalTeV. The goal of the analysis is to assess if the data could re-sult from the self-annihilation of a dark matter particle. Letthe distribution of the astrophysical background be a power-law, i.e., g b ( x ) = k φ x φ + (44)where k φ is a normalizing constant and x ∈ [ , ] Giga elec-tron Volt (GeV). Equation (44) corresponds to the distributionwe would expect the background to follow if there was nosmearing of the detector. The left panel of Figure 8 showsthe histogram of a source-free sample of 35,157 i.i.d. obser-vations from a power-law distributed background source withindex 2.4 (i.e., φ = . d ( G b ( x ) ; G b , F b ) and f b as in (9) and4 Energy (GeV) D en s i t y . . . . . . . Power−law, f =

Energy D en s i t y . . . . . . . Dark Matter sampleBackground sample

FIG. 8: Histograms of simulated Fermi-LAT samples. The left panel displays the histogram of a source-free simulatedFermi-LAT sample of N = ,

157 observations, whereas the black dashed line corresponds to the best ﬁt of the power-lawmodel in (44). The right panel shows the histogram of two simulated Fermi-LAT physics samples of n =

200 observations. Thegrey histogram corresponds to the background-only sample, whereas the blue histogram corresponds to the dark matter signalsample. D en s i t y . . . . d^ ( u , G b , F b ) ‡ s confidence bandsStandard error Deviance testpost−selection p.value £ · - ( ) G ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) D en s i t y . . . . d^ ( u , G b , F b ) ‡ s confidence bandsStandard error Deviance testpost−selection p.value > ( ) G ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) FIG. 9: Deviance test and CD plot for the source-free simulated Fermi-LAT sample including the instrumental error (left panel)and simulated sample with no instrumental error (right panel). In both cases M = M = (cid:98) f b ( x ) = k ˆ φ x ˆ φ + (cid:16) + . Leg [ G b ( x )] − . Leg [ G b ( x )]+ . Leg [ G b ( x )] − . Leg [ G b ( x )] (cid:17) , (45)where G b ( x ) is the cdf of (44) and ˆ φ = .

359 is the ML esti- mate of φ in (44).For the sake of comparison, the same analysis has been re-peated considering 35 ,

157 i.i.d. observations from a power-law background source with index 2.4, without instrumentalerror. The respective CD plot and deviance test are shown onthe right panel of Figure 9 and indicate that the power-lawmodel in (44), with φ replace by its MLE (i.e., (cid:98) φ = . B. Signal detection and upper limit construction

Once obtained a calibrated background distribution, weproceed with the signal detection phase by setting g ( x ) = (cid:98) f b ( x ) in (45). Similarly to Section V B 1, two physics samplesare given; one containing 200 observations from the back-ground source distributed, as in (44), and the other containing200 observations from a dark matter emission. The signal dis-tribution from which the data have been simulated is the pdfof γ -ray dark matter energies in [29, Eq. 28] with M χ = . M =

3; therefore, for the sake ofcomparison, we choose M = [ , . ] region with 3 . σ signiﬁ-cance (adjusted p-value = 4 . · − ). As in (39), it is possi-ble to proceed with the signal characterization stage (see Sec-tion V B 2); however, in this setting, one has to account for thefact that also the signal distribution must include the smearingeffect of the detector.As an anonymous referee pointed out, it is important to dis-cuss how upper limits and Brazil plot can be constructed viaLP modelling and how they relate to the constructs discussedso far in this manuscript. Indeed, the conﬁdence bands re-ported in the CD plots are themselves upper limits. Speciﬁ-cally, in the signal detection framework of Section V B 1, theconﬁdence bands in (29) are constructed assuming that thereis no signal in the data. Speciﬁcally, they correspond to the re-gions where the comparison density estimator is expected tolie, at 1 − α conﬁdence level, if the data includes background-only events. Conversely, any deviation from the conﬁdencebands characterizes the quantiles of the distribution where thedata distribution does not conform with the one postulated un-der the assumption that no signal is present.When the interest is in identifying areas of the search regionwhere deviations from the background model occur, one canexploit the fact that u = G ( x ) , and thus upper limits and clas-sical “Brazil plots” based on the comparison density can beobtained by plotting (20) and the respective conﬁdence bandsin (29) as a function of x . This is shown, for our Fermi-LATexample in the bottom panels of Figure 10. Indeed, the up-per and bottom panels in Figure 10 carry essentially the sameinformation in two different domains. Speciﬁcally, the CDplots display the departure of f from g in the quantile domainwhereas the Brazil plots show the same differences in the fre-quency domain. For signal detection purpose, the bottom pan-els may be preferred to identify the location where substan-tial deviations among the background and signal model occur.Whereas, the CD plots are more suitable for goodness-of-ﬁtpurposes as they provide a simulataneous visualization of thedifferences occurring at each quantile of the distribution. VIII. MODEL-DENOISING

As discussed in Section II D, the choice of M affects the re-sulting estimator of d ( u ; G , F ) in terms of both bias and vari-ance. When dealing with complex background distributions,a large value of M may be necessary to reduce the bias of theestimated comparison density. At the same times, however, alarge value of M leads to an inﬂation of the variance. In otherwords, considering a basis of M shifted Legendre polynomialsmay lead to overﬁtting.Practically speaking, overﬁtting leads to wiggly (i.e., non-smooth) estimates and thus one may overcome this limitationby attempting to denoise the estimator in (9). Section VIII Areviews the model-denoising approach proposed by [8, 10],whereas Section VIII B brieﬂy discusses inference and modelselection in this setting. Finally, Section VIII C compares theresults obtained with a full and a denoised solution on the ex-amples of Section V. A. AIC denoising

Let (cid:99) LP , . . . , (cid:99) LP M be the estimate of the ﬁrst M coefﬁcientsof the expansion in (4). The most “signiﬁcant” LP j coefﬁ-cients are selected by sorting the respective (cid:99) LP j estimates sothat (cid:99) LP ( ) ≥ (cid:99) LP ( ) ≥ · · · ≥ (cid:99) LP ( M ) and choosing the value k = , . . . , M for which AIC ( k ) in (46)is maximum AIC ( k ) = k ∑ j = (cid:99) LP ( j ) − kn . (46)The AIC-denoised estimator of d ( u ; G , F ) is given by (cid:98) d ∗ ( u ; G , F ) = + k ∗ M ∑ j = (cid:99) LP ( j ) Leg ( j ) ( u ) (47)where (cid:99) LP ( j ) is the estimate whose square is the j th largestamong (cid:99) LP , . . . , (cid:99) LP M , Leg ( j ) ( u ) is the respective shifted Leg-endre polynomial and k ∗ M = argmax k { AIC ( ) , . . . , AIC ( M ) } . (48) Practical remarks.

Recall that the ﬁrst M coefﬁcients LP j canbe expressed as a linear combination of the ﬁrst M momentsof U . Thus, the AIC-denoising approach selects the LP j co-efﬁcients which carry all the “sufﬁcient” information on theﬁrst M moments of the distribution.6 D en s i t y . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testpost−selection p.value £ ( ) G ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) D en s i t y . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 4.552 · - ( ) G ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . Energy (GeV) d ^ ( G ( x ) , G , F ) d^ ( G ( x ) , G , F ) ±SEExpected (Bkg only)Expected±2 s Expected±1 s Deviance testadjusted p.value> 1 llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . Energy (GeV) d ^ ( G ( x ) , G , F ) d^ ( G ( x ) , G , F ) ±SEExpected (Bkg only)Expected±2 s Expected±1 s Deviance testadjusted p.value= 4.552 · - FIG. 10: Deviance test, CD plots and Brazil plots for the simulated Fermi-LAT background-only sample of size 200 (leftpanels) and the simulated Fermi-LAT dark matter signal sample of size 200 (right panels). In both cases, the postulateddistribution G corresponds to the cdf of the calibrated background model in (45). For the sake of comparison, d ( u ; G , F ) hasbeen estimated via (9), with M = B. Inference after denoising

The deviance test can be used, as in Section IV C, to choosethe size of the initial basis of M polynomials among M max pos-sible models. Finally, the k (cid:63) M largest coefﬁcients are chosen bymaximizing (46). This two-step procedure selects (cid:98) d ∗ ( u ; G , F ) in (47) from a pool of M tot = M max + M ( M − ) possible esti-mators. Therefore, the Bonferroni-adjusted p-value of the de-viance test is given by M tot · P ( χ k ∗ M > d k ∗ M ) (49)withe d k ∗ M = ∑ k ∗ M j = (cid:99) LP ( j ) . Similarly, conﬁdence bands can beconstructed as (cid:34) − c α , M tot (cid:118)(cid:117)(cid:117)(cid:116) k ∗ M ∑ j = n Leg ( j ) ( u ) , + c α (cid:118)(cid:117)(cid:117)(cid:116) k ∗ M ∑ j = n Leg ( j ) ( u ) (cid:35) (50)where c α , M tot is the solutions of2 ( − Φ ( c α , M tot )) + k π e − . c α , Mtot = α M tot . (51) Practical remarks.

Given the possibility of denoising oursolution, one may legitimately wonder why not to consider a large value of M max , e.g., M max =

100 and then select k (cid:63) M max directly. In other words, why should we ﬁrst implement theprocedure in Section IV C and, only after, reﬁne our estimatoras in Section VIII A and not vice-versa? There are two mainreasons why such approach is discouraged.First of all, one has to take into account that ignoring the se-lection stage proposed in Section IV C, there is no guaranteethat the resulting k (cid:63) M max would include all the (cid:99) LP j terms thatprovide the strongest evidence in favor of H in (22). There-fore, the resulting p-value can in principle be lower than theone in (49). Indeed the AIC criterion in (46), aims to im-prove the ﬁt of the estimator to the data, whereas the devianceselection criteria in (32) aim to maximize the power of theinferential procedure.Second, choosing M max =

100 is computationally unfeasi-ble with most of the standard programming languages suchas R and Python , and the numerical computation of (9) mayeasily lead to divergent or inaccurate results.

C. Comparing full and denoised solution

Fig. 11 compares the ﬁt of the estimators (cid:98) d ( u ; G , F ) and (cid:98) d ∗ ( u ; G , F ) for the examples in Section V. For all the casesconsidered, M and k ∗ M have been selected as in (32) and(48) (see second column of Table II). When no signiﬁcance7 . . . . . Toy Example: Calibtation u D en s i t y . . . . . Toy Example: Case I u D en s i t y . . . Toy Example: Case II u D en s i t y . . . Toy Example: Case III u D en s i t y d^ ( u , G , F ) d^*(u,G,F) d ( u , G , F ) ‡ s confidence bands full ‡ s confidence bands denoised FIG. 11: Comparison of the estimators (cid:98) d ( u ; G , F ) (blue solid lines) and (cid:98) d ∗ ( u ; G , F ) (pink dashed lines) for the toy examples inSection V. The true comparison densities d ( u ; G , F ) are displayed as dotted black curves. The light gray areas corresponds tothe conﬁdence bands of the full solution whereas dark gray areas refer to the conﬁdence bands of the denoised solution. In allthe examples proposed, the latter is almost entirely overlapping with the former. M , k ∗ M Method Deviance Adjustedselected p-values p-valuesToy example M = p ( ) = . · − · p ( ) = . · − Calibration k ∗ = p ( ) = . · − · p ( ) = . · − Toy example

M=18 Full p ( ) = . · p ( ) > Case I k ∗ = p ( ) = . · − · p ( ) = . Toy example M = p ( ) = . · − · p ( ) = . · − Case II k ∗ = p ( ) = . · − · p ( ) = . · − Toy example M = p ( ) = . · − · p ( ) = . · − Case III k ∗ = p ( ) = . · − · p ( ) = . · − TABLE II: Model selection and inference for the toy example in Section V. The second column reports the M and k ∗ M valuesselected as in (32) and (48), respectively. The third column collects the unadjusted deviance p-values for the full and denoisedsolutions. The Bonferroni-adjusted p-values, computed as in (33) and (49) are reported in the fourth column. The correctionterms applied correspond to M max =

20 for the full solution and M tot = M max + M ( M − ) for the denoised solution.was achieved for any of the values of M considered, a smallbasis of M = M = (cid:98) d ( u ; G , F ) , which was then further denoisedin order to obtain (cid:98) d ∗ ( u ; G , F ) . Table II shows the results ofthe deviance tests of the full and the denoised solution forthe examples in Section V. The unadjusted p-values and theBonferroni-adjusted p-values are reported in the second andthird columns, respectively. In half of the cases, k ∗ M = M and the estimators (cid:98) d ( u ; G , F ) and (cid:98) d ∗ ( u ; G , F ) overlap over theentire range [ , ] . The inferential results were also approxi-mately equivalent in the majority of the situations considered.The main differences are observed in the analysis of thebackground-only physics sample (Case I). In this case, thedeviance-selection procedure leads to non-signiﬁcant resultsfor all the values of M considered; the minimum p-value isobserved at M =

18 (unadjusted p-value = 0 . k ∗ = . · − . This further empha-sizes the importance of adjusting for model selection in orderto avoid false discoveries. For modelling purposes and for thesake of comparison with the case where a signal is present, abasis of M = k ∗ M = k ∗ M = M = . · − ) compared to the full solution(adjusted p-value=5 . · − ).These results suggest that the denoising approach can eas-ily adapt to situations where a sparse solution is preferable(i.e., when only few of the M coefﬁcients LP j are non-zero)without enforcing sparsity when many of the M coefﬁcientsconsidered are needed to adequately ﬁt the data (e.g., bottomright panel of Fig. 11). From an inferential perspective, de-noising can improve the sensitivity of the analysis; however,in order to avoid false discoveries, extra care needs to be takenwhen the deviance selection procedure leads to large p-valuesfor all the M max models considered.8 Polarized flux (Jy/beam) D en s i t y Polarized flux (Jy/beam) D en s i t y Polarized flux (Jy/beam) D en s i t y Source−free sampleSource sample

FIG. 12: Histograms of the NVSS samples. The left panels show the source-free sample with and without outliers (upper leftand lower left panels, respectively). In both cases, the best ﬁt of the Rayleigh model in (53) is displayed as a black curve. Theright panel compares the source-free sample without outliers (grey histogram) with the source sample (blue histogram)truncated over the search area considered. D en s i t y . . . . d^ ( u , G b , F b ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 5.707 · - uG ( x ) FIG. 13: Deviance test and CD plot for the NVSS source-freesample of size 28 ,

739 compared to the Rayleigh distributionin (53).

IX. AN APPLICATION TO STACKING EXPERIMENTS

In radio astronomical surveys, stacking techniques are of-ten used to combine noisy images or “stacks” in order to in-crease the signal-to-noise ratio and improve the sensitivity ofthe analysis in detecting faint sources [e.g., 35–37]. In polar-ized signal searches, for instance, a faint population of sourcesis considered when the median polarized intensity observedover control regions differs signiﬁcantly from the median ofthe region where the sources are expected to be present. Inthis context, under simplifying assumptions, the distributionof the intensity of the source polarization is often assumed to to have Rice distribution i.e., f ( x ) = xe − x + ν σ k νσ Bessel (cid:0) x νσ (cid:1) (52)where Bessel ( · ) denotes the Bessel function of ﬁrst kind of or-der zero and k νσ is a normalizing constant. Furthermore, (52)reduces to a Rayleigh pdf when no signal is present [38], i.e,when ν =

0. Below, it is shown how the methods describedin Sections V A and V B 1 can be used to assess whether theRayleigh distribution is a reliable model for the backgroundand, when too simplistic, investigate the impact of incorrectlyassuming a Rayleigh distribution on the reliability of the anal-ysis.The data considered comes from the NRAO VLA Sky Sur-vey (NVSS) [39]. The NVSS is an astronomical survey of theNorthern hemisphere carried out by the Very Large Array ofthe National Radio Astronomy Observatory. The NVSS hasdetected 1.8 million sources in total intensity, but only 14%of these have reported a polarized signal peak greater than3 σ [37]. The original source-free sample contained 29 , In statistics, an observation x i is considered an outlier if x i < Q . − . [ Q . − Q . ] or x i > Q . + . [ Q . − Q . ] where Q . and Q . are the ﬁrst and the third sample quartiles. D en s i t y . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 2.949 · - uG ( x ) D en s i t y . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 5.178 · - uG ( x ) FIG. 14: Deviance tests and CD plots for the NVSS source sample assuming g ( x ) to be the calibrated background model in (54)(left panel) and when letting g ( x ) be the pdf in of the truncated Rayleigh distribution in (53). In both cases the estimator of thecomparison density has been denoised as described in Section VIII A. The values of M and k ∗ considered are M = k ∗ = M = k ∗ = ,

739 observations on the re-gion [ , . ] Jy/beam. It has to be noted that the nominalnoise in NVSS polarization is 0.00029 Jy/beam and we mayexpect as reasonable threshold for the detection of one indi-vidual source to be three times the noise. Hence, a sourcesample of 6 ,

220 observations has been selected from posi-tions where compact radio sources with a brightness in to-tal intensity between 0 and 0.0009 Jy/beam are known to bepresent. Both source-free and source samples are assumed tobe i.i.d. The histograms of the source-free and signal samplesconsidered are shown in the right panel of Fig. 12.As ﬁrst step, we ﬁt a Rayleigh distribution (adequately trun-cated over the range [ , . ] ) on the source-free sample,i.e., g b ( x ) = xe − x (cid:98) σ k (cid:98) σ (53)where k (cid:98) σ is a normalizing constant, (cid:98) σ = . σ , and x ∈ [ , . ] Jy/beam. In order to assess if (53) provides a good ﬁt for thedata, we estimate the comparison density d ( G b ( x ) ; G b , F b ) by,ﬁrst, selecting M as in (32) and then applying the AIC-baseddenoising approach described in Section VIII A. In this case,the denoised solution selects k ∗ = M =

10 polyno-mial terms. The deviance tests and the CD plot in Fig. 13suggest that, despite the fact that the median of the data co-incides with the one of the Rayleigh model, overall, the latterdoes not provide a good ﬁt for the distribution of the source-free sample. Speciﬁcally, the data distribution shows a higherright tail than one expected under the Rayleigh assumption,whereas the ﬁrst quantiles are overestimated by the Rayleigh.Therefore, the researcher can either decide to use a more re-ﬁned parametric model for the background or consider the cal-ibrated background distribution of the form in (19), which in our setting speciﬁes as (cid:98) f b ( x ) = xe − x (cid:98) σ k (cid:98) σ (cid:16) − . Leg [ G b ( x )] + . Leg [ G b ( x )]+ . Leg [ G b ( x )] − . Leg [ G b ( x )] + . Leg [ G b ( x )] − . Leg [ G b ( x )] + . Leg [ G b ( x )] + . Leg [ G b ( x )] − . Leg [ G b ( x )] (cid:17) , (54)where G b ( x ) is the cdf of (53).The strategy described in Section V B 1 allows us to iden-tify where signiﬁcant differences between the control andsource sample occur. In order to assess the effect of incor-rectly assuming a Rayleigh background, we compare the dis-tribution of the physics sample with both the Rayleigh and thecalibrated background distribution in (54). Figure 14 reportsdeviance tests and CD plots obtained on the physics samplewhen setting g ( x ) = (cid:98) f b ( x ) in (53) (left panel) and g ( x ) = (cid:98) f b ( x ) in (54) (right panel). Both analyses provide strong evidencethat the distribution of the physics sample differs signiﬁcantlyfrom the postulated models (cid:98) f b ( x ) and g b ( x ) , and the most sub-stantial discrepancies occur on the right tail of the distribution.However, since the Rayleigh model underestimates the righttail of the background distribution (see Fig. 13), it leads to anartiﬁcially enhanced sensitivity in this region. The differencesbetween the two CD plots are less prominent around the me-dian expected under (cid:98) f b ( x ) and g b ( x ) (i.e., in correspondenceof u = . . σ (adjusted p-value = 5 . · − ),whereas the one obtained using (54) is 20 . σ (adjusted p-value = 2 . · − ).Conversely, the calibrated background model in (54) allowsus to safely compare the entire distribution of the polarizedintensity in the source and control regions via CD plots anddeviance tests without affecting the sensitivity of the analysis. X. DISCUSSION

This article proposes a uniﬁed framework for signal de-tection and characterization under background mismodelling.From a methodological perspective, the methods presentedhere extend LP modelling to the inferential setting.The solution discussed is articulated in two main phases: acalibration phase where the background model is “trained” ona source-free sample and a signal search phase conducted onthe physics sample collected by the experiment. If a model forthe signal is given, the method proposed allows the identiﬁca-tion of hidden signals from new unexpected sources and/orthe reﬁning of the postulated background or signal distribu-tions. Furthermore, the tools presented in this manuscript canbe easily extended to situations where a source-free sampleis not available and the background is unknown (up to somefree parameters). As discussed in Section VI, however, in thissetting the signal distribution is required to be known, and thephysics sample is expected to contain only signal-like events,i.e., the background is almost completely reduced.The theory of Section II D and the analyses in Section Vhave highlighted that, despite a fully non-parametric approachprovides reliable inference, it may lead to unsatisfactory esti-mates when the postulated pdf g is substantially different fromthe true density f . In this setting, a semiparametric stage canbe performed in order provide a reliable model for the data.Each individual step in both the nonparametric and thesemiparametric stage of Sections V B 2 and V A provides use-ful scientiﬁc insights on the signal and background distribu-tion. Hence, an automatized implementation of the steps ofAlgorithm 1 based solely on the p-values of the deviance testsis discouraged as it would lead to a substantial loss of scien-tiﬁc knowledge on the phenomena under study.Finally, it is important to point out that, despite this article’sfocus on the one-dimensional searches on continuous data, allthe constructs presented in Sections II and the deviance testin IV A also apply to the discrete case when considering i.i.d.events. More work is needed to extend these results and thoseof Section IV B to searches in multiple dimensions and whenconsidering Poisson events with functional mean. In the ﬁrstcase the difﬁculty mainly lies in generalizing the constructsof Section IV to account for the dependence structure occur-ing across multiple dimensions. In the second case, the mainchallenge lies in identifying the equivalent of (1) to model themean of the distribution, while incorporating the Poisson er-ror. CODE AVAILABILITY

The

LPBkg

Python package [40] and the

LPBkg

R package[41] allow the implementation of the methods proposed in thismanuscript. Detailed tutorials on how to use the functionsprovided are also available at http://salgeri.umn.edu/my-research . ACKNOWLEDGMENTS

The author thanks Jeroen Stil, who provided the NVSSdatasets used in Section IX, and Lawrence Rudnick, who ﬁrstrecognized the usefulness of the method proposed in the con-text of stacking experiments. Conversations with SubhadeepMukhopadhyay have been of great help when this work wasﬁrst conceptualized. Discussions and e-mail exchanges withCharles Doss and Chad Shafer are gratefully acknowledged.Finally, the author thanks an anonymous referee whose feed-back has been substantial to improve the overall quality of thepaper.

Appendix A: Moments of the (cid:99) LP j estimates Consider the general setting where f (cid:54)≡ g and thus d ( u ; G , F ) (cid:54) = [ , ] . It follows that each u i is indepen-dently and identically distributed with pdf d ( u ; G , F ) ; hence,all the expectations in E [ (cid:99) LP j ] , V ( (cid:99) LP j ) and Cov ( (cid:99) LP j , (cid:99) LP k ) aretaken with respect to d ( u ; G , F ) . Speciﬁcally, E [ (cid:99) LP j ] = E (cid:20) n n ∑ i = Leg j ( U i ) (cid:21) = E [ Leg j ( U )]= (cid:90) Leg j ( u ) d ( u ; G , F ) ∂ u = LP j where the second equality follows by the fact that each ob-served value u i is a realization of a random variable U i and each U , . . . , U n is identically distributed as the ran-dom variable U , whose pdf is given by the comparisondensity d ( u ; G , F ) . Notice that d ( u ; G , F ) = (cid:82) Leg j ( u ) d ( u ; G , F ) ∂ u = (cid:82) Leg j ( u ) ∂ u =

0, from which theﬁrst equivalence in (8) follows. Moreover, V ( (cid:99) LP j ) = n V (cid:18) n ∑ i = Leg j ( U i ) (cid:19) = nV (cid:0) Leg j ( U ) (cid:1) = σ j n where V (cid:0) Leg j ( U ) (cid:1) = (cid:82) ( Leg j ( u ) − LP j ) d ( u ; G , F ) ∂ u = σ j .The second equality holds because of independence and iden-tical distribution of each u i . Notice that if d ( u ; G , F ) = σ j = Leg j ( u ) poly-1nomials. Hence the second equivalence in (8) holds. Finally, Cov ( (cid:99) LP j , (cid:99) LP k ) = Cov (cid:18) n n ∑ i = Leg j ( U i ) , n n ∑ i = Leg k ( U i ) (cid:19) = nCov (cid:0) Leg j ( U ) , Leg k ( U ) (cid:1) = σ jk n also in this case, the second equality follows by independenceand identical distribution of each u i and Cov (cid:0)

Leg j ( U ) , Leg k ( U ) (cid:1) = (cid:90) ( Leg j ( u ) − LP j )( Leg k ( u ) − LP k ) d ( u ; G , F ) ∂ u = (cid:90) Leg j ( u ) Leg k ( u ) d ( u ; G , F ) ∂ u − LP j LP k = σ jk . (A1) Because of the orthogonality of the

Leg j ( u ) , σ jk = d ( u ; G , F ) =

1. Hence the third equivalence in (8).

Appendix B: Bias, variance and MISE of (cid:98) d ( u ; G , F ) Given a point u over [ , ] , the bias of (9) at u isBias (cid:2) (cid:98) d ( u ; G , F ) (cid:3) = E (cid:2) (cid:98) d ( u ; G , F ) (cid:3) − d ( u ; G , F ) (B1) = M ∑ j = E [ (cid:99) LP j ] Leg j ( u ) − ∑ j > LP j Leg j ( u ) (B2) = ∑ j > M LP j Leg j ( u ) ; (B3)here (B2) follows from (4) and (9). Whereas, the integratedsquared bias is IBS = (cid:90) (cid:18) ∑ j > M LP j Leg j ( u ) (cid:19) ∂ u (B4) = ∑ j > M LP j (cid:90) Leg j ( u ) ∂ u (B5) + ∑ M < j < k LP j LP k (cid:90) Leg j ( u ) Leg k ( u ) ∂ u (B6) = ∑ j > M LP j ; (B7) where (B7) holds because of orthonormality of the Leg j ( u ) polynomials. Notice that IBS = ∑ j > M LP j = ∑ j > LP − M ∑ j = LP (B8) = (cid:90) (cid:0) d ( u ; G , F ) − (cid:1) ∂ u − M ∑ j = LP (B9) = (cid:90) (cid:18) f ( G − ( u )) − g ( G − ( u )) g ( G − ( u )) (cid:19) ∂ u − M ∑ j = LP j (B10)where (B9) follows by Parseval’s identity whereas (B10) fol-lows from (2).The variance of (9) at a given point u is given by V (cid:2) (cid:98) d ( u ; G , F ) (cid:3) = V (cid:18) M ∑ j = (cid:99) LP j Leg j ( u ) (cid:19) (B11) = M ∑ j = Leg j ( u ) V (cid:0)(cid:99) LP j (cid:1) (B12) + ∑ j < k Leg j ( u ) Leg k ( u ) Cov (cid:0)(cid:99) LP j , (cid:99) LP k (cid:1) (B13) = M ∑ j = σ j n Leg j ( u ) + ∑ j < k σ jk n Leg j ( u ) Leg k ( u ) . (B14)By orthonormality of the polynomials Leg j ( u ) , the integral of(B14) over [ , ] is (cid:90) V (cid:2) (cid:98) d ( u ; G , F ) (cid:3) ∂ u = M ∑ j = σ j n . (B15)also in this case, equality follows by orthonormality of the Leg j ( u ) . Finally, the MISE is MISE (cid:104) (cid:98) d ( u ; G , F ) (cid:105) = E (cid:20) (cid:90) (cid:0) (cid:98) d ( u ; G , F ) − d ( u ; G , F ) (cid:1) ∂ u (cid:21) (B16) = (cid:90) E (cid:20)(cid:0) (cid:98) d ( u ; G , F ) − d ( u ; G , F ) (cid:1) (cid:21) ∂ u (B17) = (cid:90) V (cid:2) (cid:98) d ( u ; G , F ) (cid:3) (B18) + (cid:0) E (cid:2) (cid:98) d ( u ; G , F ) (cid:3) − d ( u ; G , F ) (cid:1) ∂ u (B19) = M ∑ j = σ j n + ∑ j > M LP j (B20)where (B17) holds because of Fubini-Tonelli theorem,2whereas the last equality follows by (B7) and (B15). [1] Aprile, E., et al. Physical review letters 119.18 (2017): 181301.URL: https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.119.181301 [2] Agnese, R., et al. Physical Review D 99.6 (2019): 062001.URL:[3] Smith, R. and Thrane, E. Physical Review X, no. 2 (2018):021019. URL:[4] Sirunyan, A.M., et al. Physics Letters B 793 (2019): 320-347.URL:[5] Yellin, S. Physical Review D 66.3 (2002): 032005.[6] Priel, N., et al. Journal of Cosmology and Astroparticle Physics2017.05 (2017): 013.[7] Dauncey, P. D., et al. Journal of Instrumentation 10.04 (2015):P04015.[8] Mukhopadhyay, S., and Parzen, E. arXiv preprintarXiv:1405.2601 (2014).[9] Mukhopadhyay, S., and Wang, K. To Appear in: Biometrika(2019+). arXiv[stat.ME]:1810.01724v2. Supplement availableat: http://unitedstatalgo.com/wp-content/uploads/2019/09/LPKsample_Supp.pdf .[10] Mukhopadhyay, S. Electronic Journal of Statistics 11.1 (2017):215-240. arxiv[stat.ME]:1509.06428[11] Mukhopadhyay, S., and Fletcher, D. Scientiﬁc reports 8.1(2018): 9983.[12] Mukhopadhyay, S., and Parzen, E. Journal of Risk and Finan-cial Management 11.3 (2018): 37. arxiv[math.ST]:1308.0642[13] Mukhopadhyay, S. Biometrics 72.2 (2016): 325-334.arxiv[math.ST]:1507.08727[14] Mukhopadhyay, S. Journal of Nonparametric Statistics 30.4(2018): 1003-1015. arxiv[stat.ME]:1805.02075[15] Parzen E. Statistical Science. 2004;19(4):652-62.[16] Sheather, S.J. and Jones, M.C. Journal of the Royal StatisticalSociety B 53, no. 3 (1991): 683-690.[17] Wasserman, L. All of nonparametric statistics. Springer Science& Business Media (2006).[18] Pilla, R.S., Loader, C. and Taylor, C.C. Physical review letters95.23 (2005): 230202.[19] Shen, X., Huang, H.C., and Ye, J. Journal of the American Sta-tistical Association 99.467 (2004): 751-762.[20] Leeb, H., and P¨otscher, B.M. Econometric Theory 21.1 (2005):21-59. URL: https://pdfs.semanticscholar.org/307d/4f561831d98246c91424b84ff90f81b964eb.pdf .[21] Algeri, S., et al. Journal of Instrumentation 11.12 (2016):P12010.[22] Gross E, Vitells O. The European Physical Journal C. 2010 Nov1;70(1-2):525-30.[23] Dunn, O.J. The Annals of Mathematical Statistics. 1958 Dec1:1095-111.[24] Anderson, T.W., and Darling, D.A. The annals of mathematicalstatistics 23.2 (1952): 193-212.[25] Wilcoxon, F. Biometrics 3, no. 3 (1947): 119-122.[26] Darling, D.A. The Annals of Mathematical Statistics 28, no. 4(1957): 823-838.[27] Efromovich, S. Nonparametric Curve Estimation. Springer Sci-ence & Business Media, 2008.[28] Atwood W. B.et al. The Astrophysical Journal (2009): 697 2.[29] Bergstr¨om, L., Ullio, P. and Buckley, J.H. Astroparticle Physics9.2 (1998): 137-162. [30] Baltz, E. A., James E. T., and Lawrence L. W. The Astrophysi-cal Journal Letters 659, no. 2 (2007): L125.[31] Lyons, L. PHYSTAT 2011 Workshop proceedings (2011): 225-228.[32] Panaretos, V. PHYSTAT 2011 Workshop proceedings (2011):229-239.[33] Blobel, Volker. PHYSTAT 2011 Workshop proceedings (2011):240-251.[34] Algeri, S., Conrad, J. and van Dyk, D.A. Monthly Notices of theRoyal Astronomical Society: Letters 458.1 (2016): L84-L88.[35] Brown, S., et al. The Astrophysical Journal Letters 740.1(2011): L28.[36] White, R.L., et al. The Astrophysical Journal 654.1 (2007): 99.[37] Stil, J. M., et al. The Astrophysical Journal 787.2 (2014): 99.[38] Simmons, J. F. L., and B. G. Stewart. Astronomy and Astro-physics 142 (1985): 100-106.[39] Condon, J.J., et al. The Astronomical Journal 115.5 (1998):1693.[40] Zhang, Z. LPBkg: Detecting New Signals under BackgroundMismodelling, 2019. URL: https://pypi.org/project/LPBkg/ .[41] Algeri, S. and Liu, H. LPBkg: Detecting New Signals un-der Background Mismodelling, 2019. URL https://CRAN.R-project.org/package=LPBkghttps://CRAN.R-project.org/package=LPBkg