Detecting new signals under background mismodelling
VVersion of December 4, 2019
Detecting new signals under background mismodelling ∗ Sara Algeri
1, † School of Statistics, University of Minnesota, Minneapolis (MN), 55455, USA (Dated: December 4, 2019)Searches for new astrophysical phenomena often involve several sources of non-random uncertainties whichcan lead to highly misleading results. Among these, model-uncertainty arising from background mismodellingcan dramatically compromise the sensitivity of the experiment under study. Specifically, overestimating thebackground distribution in the signal region increases the chances of missing new physics. Conversely, under-estimating the background outside the signal region leads to an artificially enhanced sensitivity and a higherlikelihood of claiming false discoveries. The aim of this work is to provide a unified statistical strategy to per-form modelling, estimation, inference, and signal characterization under background mismodelling. The methodproposed allows to incorporate the (partial) scientific knowledge available on the background distribution andprovides a data-updated version of it in a purely nonparametric fashion without requiring the specification ofprior distributions on the unknown parameters. Applications in the context of dark matter searches and radiosurveys show how the tools presented in this article can be used to incorporate non-stochastic uncertainty due toinstrumental noise and to overcome violations of classical distributional assumptions in stacking experiments.
I. INTRODUCTION
When searching for new physics, a discovery claim is madeif the data collected by the experiment provides sufficient sta-tistical evidence in favor of the new phenomenon. If the back-ground and signal distributions are specified correctly, this canbe done by means of statistical tests of hypothesis, upper lim-its and confidence intervals.
The problem.
In practice, even if a reliable descrip-tion of the signal distribution is available, providing accu-rate background models may be challenging, as the behav-ior of the sources which contribute to it is often poorly un-derstood. Some examples include searches for nuclear re-coils of weakly interacting massive particles over electron re-coils backgrounds [1, 2], searches for gravitational-wave sig-nals over non-Gaussian backgrounds from stellar-mass binaryblack holes [3], and searches for a standard model-like Higgsboson over prompt diphoton production [4].Unfortunately, model uncertainty due to background mis-modelling can significantly compromise the sensitivity ofthe experiment under study. Specifically, overestimating thebackground distribution in the signal region increases thechances of missing new physics. Conversely, underestimatingthe background outside the signal region leads to an artificiallyenhanced sensitivity, which can easily result in false discov-ery claims. Several methods have been proposed in literatureto address this problem [e.g., 5–7]. However, to the best of theauthor’s knowledge, none of the methods available provides aunified strategy to (i) assess the validity of existing models forthe background, (ii) fully characterize the background distri-bution, (iii) perform signal detection even if the signal distri-bution is not available, (iv) characterize the signal distribution,and (v) detect additional signals of new unexpected sources. ∗ The author declares no conflict of interest. † [email protected] Goal.
The aim of this work is to integrate modelling, es-timation, and inference under background mismodelling andprovide a general statistical methodology to perform of (i)-(v). As a brief overview, given a source-free sample and the(partial) scientific knowledge available on the background dis-tribution, a data-updated version of it is obtained in a purelynonparametric fashion without requiring the specification ofprior distributions on the unknown parameters. At this stage,a graphical tool is provided in order to assess if and where sig-nificant deviations between the true and the postulated back-ground distributions occur. The “updated” background dis-tribution is then used to assess if the distribution of the datacollected by the experiment deviates significantly from thebackground model. Also in this case, it is possible to as-sess graphically how the data distribution deviates from theexpected background model. If a source-free sample is avail-able, or if control regions can be identified, the solution pro-posed does not require the specification of a model for thesignal; however, if the signal distribution is known (up tosome free parameters), the latter can be used to further im-prove the accuracy of the analysis and to detect the signals ofunexpected new sources. Finally, the method can be easilyadjusted to cover situations in which a source-free sample orcontrol regions are not available, the background is unknown,or incorrectly specified, but a functional form of the signaldistribution is known.
The key of the solution.
The statistical methodologies in-volved rely on the novel
LP approach to statistical modelling first introduced by Mukhopadhyay and Parzen in 2014 [8]. Asit will become clearer later on in the paper, the letter L typi-cally denotes robust nonparametric methods based on quan-tiles, whereas P stands for polynomials [9, Supp S1]. This ap-proach allows the unification of many of the standard resultsof classical statistics by expressing them in terms of quan-tiles and comparison distributions and provides a simple andpowerful framework for statistical learning and data analysis.The interested reader is directed to [10–14] and referencestherein, for recent advancements in mode detection, nonpara- a r X i v : . [ phy s i c s . d a t a - a n ] D ec metric time series, goodness-of-fit on prior distributions, andlarge-scale inference using an LP approach. Organization.
Section II is dedicated to a review of themain constructs of LP modelling. Section III highlights thepractical advantages offered by modelling background distri-butions using an LP approach. Section IV introduces a novelLP-based framework for statistical inference. Section V out-lines the main steps of a data-scientific approach for signaldetection and characterization. In Section VI, the methodsproposed are applied in the context of dark matter searcheswhere the goal is to distinguish γ -ray emissions due to darkmatter from those due to pulsars. In Section VII, the toolsdiscussed are applied to a simulation of the Fermi Large AreaTelescope γ -ray telescope and it is shown how upper limitsand Brazil plots can be constructed by means of comparisondistributions. Section VIII is dedicated to model-denoising.Section IX presents an application to data from the NVSS as-tronomical survey and discusses a simple approach to assessthe validity of distributional assumptions on the polarized in-tensity in stacking experiments. A discussion of the main re-sults and extensions is proposed in Section X. II. LP APPROACH TO STATISTICAL MODELLING
The
LP Approach to Statistical Modelling [8] is a novel sta-tistical approach which provides an ideal framework to simul-taneously assess the validity of the scientific knowledge avail-able and fill the gap between the initial scientific belief and theevidence provided by the data. Sections II A, II B and II C be-low introduce the LP modelling framework, whereas SectionIII discusses how the problem of background mismodellingcan be formulated under this paradigm.
A. The skew-G density model
Let X be a continuous random variable with cumulative dis-tribution function (cdf) and probability density function (pdf) F ( x ) . Since F is the true distribution of the data, it is typicallyunknown. However, suppose a suitable cdf G ( x ) is available,and let g ( x ) be the respective pdf. In order to understand if G is a good candidate for F , it is convenient to express therelationship among the two in a concise manner.The skew-G density model [8, 10] is a universal representa-tion scheme which allows to express any pdf f ( x ) as f ( x ) = g ( x ) d ( G ( x ) ; G , F ) (1)where d ( u ; G , F ) is called comparison density [15] and it issuch that d ( u ; G , F ) = f ( G − ( u )) g ( G − ( u )) with 0 ≤ u ≤ , (2)with u = G ( x ) and G − ( u ) = inf { x : G ( x ) ≥ u } denoting the“postulated” quantile function of X . The comparison density is the pdf of the random variable U = G ( X ) ; whereas, its cdfis given by D ( u ) = F (cid:0) G − ( u ) (cid:1) = (cid:90) u d ( v ; G , F ) ∂ v , (3)and it is called comparison distribution . Practical remarks.
Equations (2) and (3) are of fundamen-tal importance to understand the power of a statistical mod-elling approach based on the comparison density. Specifically, d ( u ; G , F ) allows to “connect” any given pdf g to the true pdf f through the quantile transformation G − of u . Furthermore, g ≡ f if and only if d ( u ; G , F ) = u ∈ [ , ] , i.e., U is uniformly distributed over the interval [ , ] . Whereas, if g (cid:54)≡ f , d ( u ; G , F ) models the departure of the true density f ( x ) from the postulated model g ( x ) . Consequently, an ad-equate estimate of d ( u ; G , F ) , not only leads to an estimate ofthe true f ( x ) based on (1), but it also allows to identify theregions where f ( x ) deviates substantially from g ( x ) . B. LP skew-G series representation
Denote with L [ , ] the Hilbert space of square inte-grable functions on the unit interval with respect to themeasure G . A complete, orthonormal basis of functionsin L [ , ] can be constructed considering powers of G ( x ) ,i.e., G ( x ) , G ( x ) , G ( x ) , . . . and adequately orthonormalizedvia Gram-Schmidt procedure [10]. The resulting bases canequivalently be expressed as normalized shifted L egendre P olynomials, namely Leg j ( u ) , with u = G ( x ) .Under the assumption that (2) is a square integrable func-tion on [ , ] , i.e., d ∈ L [ , ] , we can then represent d ( u ; G , F ) via a series of { Leg j ( u ) } j ≥ polynomials, i.e., d ( u ; G , F ) = + ∑ j > LP j Leg j ( u ) (4)with coefficients LP j = (cid:82) Leg j ( u ) d ( u ; G , F ) ∂ u . The repre-sentation in (4) is called LP skew-G series representation [10].
C. LP density estimate
Let x , . . . , x n be a sample of independent and identicallydistributed (i.i.d.) observations from X . Observations from U are given by u = G ( x ) , . . . , u n = G ( x n ) . The LP j coefficients Classical Legendre polynomials are defined over [ − , ] ; here, their“shifted” counterpart over the range [ , ] is considered. The first threenormalized shifted Legendre polynomials are: Leg ( u ) = Leg ( u ) = √ ( u − . ) , Leg ( u ) = √ ( u − u + ) , etc. in (4) can then be estimated via (cid:99) LP j = n n ∑ i = Leg j ( u i ) . (5)Aternatively, in virtue of (3), the estimates (cid:99) LP j can also bespecified as (cid:99) LP j = (cid:90) Leg j ( u ) ∂ ˜ D ( u ) = (cid:90) Leg j ( u ) ∂ ˜ F ( G − ( u )) (6)where ˜ F and ˜ D denote the empirical distribution of the sam-ples x , . . . , x n and u , . . . , u n , respectively.The moments of the (cid:99) LP j are E [ (cid:99) LP j ] = LP j , V ( (cid:99) LP j ) = σ j n and Cov ( (cid:99) LP j , (cid:99) LP k ) = σ jk n (7)where σ j = (cid:82) ( Leg j ( u ) − LP j ) d ( u ; G , F ) ∂ u and σ jk = (cid:82) ( Leg j ( u ) − LP j )( Leg k ( u ) − LP k ) d ( u ; G , F ) ∂ u . When f ≡ g ,the equalities in (7) reduce to E [ (cid:99) LP j ] = , V ( (cid:99) LP j ) = n and Cov ( (cid:99) LP j , (cid:99) LP k ) = j (cid:54) = k . Derivations of (7) and (8) are discussed in Ap-pendix A.If (4) is approximated by the first M + an estimateof the comparison density is given by (cid:98) d ( u ; G , F ) = + M ∑ j = (cid:99) LP j Leg j ( u ) , (9)with variance V (cid:2) (cid:98) d ( u ; G , F ) (cid:3) = M ∑ j = σ j n Leg j ( u ) + ∑ j < k σ jk n Leg j ( u ) Leg k ( u ) . (10)See Appendix B for more details on the derivation of (10). Fi-nally, the standard error of (cid:98) d ( u ; G , F ) corresponds the squareroot of (10), with σ j and σ jk estimated by their sample coun-terpart, i.e., (cid:98) σ j = n n ∑ i = ( Leg j ( u i ) − (cid:99) LP j ) (cid:98) σ jk = n n ∑ i = Leg j ( u i ) Leg k ( u i ) − (cid:99) LP j (cid:99) LP k . Finally, in virtue of the skew-G density model in (1) we canestimate f ( x ) as (cid:98) f ( x ) = g ( x ) (cid:98) d ( G ( x ) ; G , F ) . (11) Recall that the first normalized shifted Legendre polynomial is
Leg ( u ) = D en s i t y . . . . . f b ( x ) − true bkgg b ( x ) − postulated bkgf^ b ( x ) − estimated bkgf~ b ( x ) − KDE estimatef s ( x ) − signal . . x d ^ ( G b ( x ) , G b , F b ) FIG. 1: Upper panel: histogram of a source-free samplesimulated from the tail of a Gaussian with mean 55 and width15 (green solid line). The candidate background distributionis given by the best fit of a second-degree polynomial (reddashed line), and it is updated using the source-free data bymeans of (19) (purple dot-dashed line). The Kernel densityestimator of f b is also displayed for comparison (orangedot-dashed line). Bottom panel: comparison density estimate(blue solid line) plotted on the x -scale and respectivestandard errors (light blue area).Since each Leg j ( u ) is a polynomial function of the randomvariable U , each (cid:99) LP j estimate can be expressed as a linearcombination of the first j sample moments of U , e.g., (cid:99) LP = n n ∑ i = Leg ( u i ) = √ (cid:16) (cid:98) µ − (cid:98) µ + (cid:17) where (cid:98) µ = n ∑ ni = u i , (cid:98) µ = n ∑ ni = u i . Therefore, the trun-cation point M can be interpreted as the order of the high-est moment considered to characterize the distribution of U .(The reader is directed to Section IV C for a discussion on thechoice of M .) D. The bias variance trade-off
In order to understand how good (9) is in estimating d ( u ; G , F ) we consider the Mean Integrated Squared Error(MISE) of (cid:98) d ( u ; G , F ) , i.e., MISE = E (cid:20) (cid:90) (cid:0) (cid:98) d ( u ; G , F ) − d ( u ; G , F ) (cid:1) ∂ u (cid:21) (12) = M ∑ j = σ j n + ∑ j > M LP j (13)where the first term in (13) corresponds to the integral of the(10) over [ , ] ; whereas the second term corresponds to theIntegrated Squared Bias (IBS), i.e, IBS = (cid:90) (cid:18) E (cid:2) (cid:98) d ( u ; G , F ) (cid:3) − d ( u ; G , F ) (cid:19) ∂ u . (14)Interestingly, the latter can also be specified as IBS = (cid:90) (cid:18) f ( x ) − g ( x ) g ( x ) (cid:19) ∂ u − M ∑ j = LP j (15)(see derivations in Appendix B). The first term on the righthand side of (15) is particularly important in understanding therole played by g in obtaining a reliable estimate of f . Specif-ically, the closer g is to f the lower the bias of (cid:98) d ( G ( x ) ; G , F ) and (cid:98) f ( x ) in (11). Practical remarks.
Equation (13) implies that larger valuesof M do not necessarily lead to better estimates of d ( u ; G , F ) .Specifically, when n → ∞ , the first term in (13) tends to zero.However, for large values of M , more and more terms to con-tribute to it and thus increasing M may lead to a substantialinflation of the variance in (10). Conversely, the bias is not af-fected by sample size and it can be controlled by either choos-ing g sufficiently close to f (see (15)) and/or increasing M while preserving a good bias-variance trade-off. Further remarks.
Equation (6) implies that the estimator in(9) relies on the empirical distribution of the sample observedby means of the (cid:99) LP j estimates. Therefore, an estimator of thecomparison density based entirely on the empirical cdf can beexpressed by setting M = n − M < n −
1, theestimator in (9), not only leads to a reduction of the variancebut, in virtue of (15), its bias is mitigated when the postulatedmodel g is sufficiently close to the true pdf of the data f . III. DATA-DRIVEN CORRECTIONS FOR MISSPECIFIEDBACKGROUND MODELS
Let x B = ( x , . . . , x N ) be a sample of observations from con-trol regions or the result of Monte Carlo simulations, wherewe expect no signal to be present. Hereafter, we refer to x B asthe source-free sample . Therefore, x B can be used to “learn”the unknown pdf of the background, namely f b ( x ) , and obtainan estimate for it via (11).Despite the true background model being unknown, sup-pose that a candidate pdf, namely g b ( x ) , is available. The candidate model g b ( x ) can be specified from previous experi-ments or theoretical results or can be obtained by fitting spe-cific functions (e.g., polynomial, exponential, etc.) to x B . If g b ( x ) does not provide an accurate description of f b ( x ) , thesensitivity of the experiment can be strongly affected.Consider, for instance, a source-free sample of N = [ , ] , i.e., f b ( x ) = e − (cid:0) x − (cid:1) k f b (16)with k f b = (cid:82) e − (cid:0) x − (cid:1) ∂ x . Suppose that a candidate modelfor the background is obtained by fitting a second-degreepolynomial on the source-free sample and adequately normal-izing it in order to obtain a proper pdf, i.e., g b ( x ) = . − . x + . x k gb (17)with k gb = (cid:82) [ . − . x + . x ] ∂ x . For illustrative pur-poses, assume that the distribution of the signal is a Gaussiancentered at 25, with width 4 . f s ( x ) = e − (cid:0) x − . (cid:1) k f s (18)with k f s = (cid:82) e − (cid:0) x − . (cid:1) ∂ x . The histogram of the source-free sample along with (16)-(18) is shown in Fig. 1. At thehigher end of the spectrum, the postulated background (reddashed line) underestimates the true background distribution(green solid line). As a result, using (17) as background modelincreases the chance of false discoveries in this region. Con-versely, at the lower end of the spectrum, g b ( x ) underestimates f b ( x ) , reducing the sensitivity of the analysis. For the sakeof comparison, a Kernel density estimate (orange dot-dashedline) has been computed by selecting the bandwidth param-eter as recommended in [16]. The latter exhibits substantialbias at the boundary and appears to overfit the data sample.It is important to point out that, the discrepancy of f b ( x ) from g b ( x ) is typically due to the fact that the specific func-tional form imposed (in our example, a second-degree polyno-mial) is not adequate for the data. Thus, changing the valuesof the fitted parameters (or assigning priors to them) is un-likely to solve the problem. However, it is possible to “repair” g b ( x ) and obtain a suitable estimate of f b ( x ) by means of (11).Specifically, f b ( x ) can be estimated via (cid:98) f b ( x ) = g b ( x ) (cid:98) d ( G b ( x ) ; G b , F b ) (19)where (cid:98) d ( G b ( x ) ; G b , F b ) is the comparison density estimatedvia (9) on the sample G b ( x ) , . . . , G b ( x N ) , whereas F b and G b are the true and the postulated background distributions, withpdfs as in (16) and (17), respectively.In our example, choosing M = (cid:98) d ( G b ( x ) ; G b , F b ) = + . Leg [ G b ( x )] − . Leg [ G b ( x )] , (20) where Leg [ G b ( x )] and Leg [ G b ( x )] are the first and secondnormalized shifted Legendre polynomials evaluated at G b ( x ) .Notice that, by combining (19) and (20), we can easily writethe background model using of a series of shifted Legendrepolynomials. This may be especially useful when dealing withcomplicated likelihoods and for which a functional form isdifficult to specify.The upper panel of Fig. 1 shows that the “calibrated” back-ground model in (19) as a purple dot-dashed line and matchesalmost exactly the true background density in (16) (green solidline). The plot of (cid:98) d ( G b ( x ) ; G b , F b ) in the bottom panel of Fig.1 provides important insights on the deficiencies of (17) as acandidate background model. Specifically, the magnitude andthe direction of the departure of (cid:98) d ( G b ( x ) ; G b , F b ) from onecorresponds to the estimated departure of f b ( x ) from g b ( x ) for each value of x . Therefore, if (cid:98) d ( G b ( x ) ; G b , F b ) is belowone in the region where we expect the signal to occur, using (cid:98) f b ( x ) in place of g b ( x ) increases the sensitivity of the analysis.Conversely, if (cid:98) d ( G b ( x ) ; G b , F b ) is above one outside the signalregion, the use of (cid:98) f b ( x ) instead of g b ( x ) prevents from falsediscoveries.Notice that in this article we only consider continuous data.In this respect, the goal is to learn the model of the backgroundconsidered as a continuum and no binning is applied. There-fore, the histograms presented here are only a graphical toolused to display the data distribution and are not intended torepresent an actually binning of the data. IV. LP-BASED INFERENCE
When discussing the skew-G density model in (1), we havewitnessed that f ≡ g if d ( u ; G , F ) = u ∈ [ , ] . Addi-tionally, the graph of (cid:98) d ( u ; G , F ) provides an exploratory toolto understand the nature of the deviation of f ( x ) from g ( x ) .This section introduces a novel inferential framework to testthe significance of the departure of f ( x ) from g ( x ) . Specifi-cally, our goal is to test the hypotheses H : d ( u ; G , F ) = u ∈ [ , ] vsH : d ( u ; G , F ) (cid:54) = u ∈ [ , ] . (21)First, an overall test, namely the deviance test , is presented.The deviance test assesses if f ( x ) deviates significantly from g ( x ) anywhere over the range of x considered. Second, ad-equate confidence bands are constructed in order to assesswhere significant departures occur. A. The deviance test
Recall that the LP j coefficients in (4) specify as LP j = (cid:82) Leg u ( d ) d ( u ; G , F ) ∂ u . Consequently, by orthogonality ofthe { Leg j ( u ) } j > Leg ( u ) =
1, when H in(21) is true all the LP j coefficients are equal to zero, includingthe first M of them. We can then quantify the departure of (cid:98) d ( u ; G , F ) from one by means of the deviance statistics [13]which specifies as ∑ Mj = (cid:99) LP j . If the deviance is equal to zero,we may expect that g is approximately equivalent to f ; hence,we test H : M ∑ j = LP j = H : M ∑ j = LP j > D M = n M ∑ j = (cid:99) LP j . (23)It can be shown [10] that, as n → ∞ √ n (cid:99) LP j d −→ N ( , ) , (24)where d −→ denotes convergence in distribution, and thus, under H , D M is asymptotically χ M -distributed. Hence, an asymp-totic p-value for (22) is given by P ( D M > d M ) −−−→ n → ∞ P ( χ M > d M ) , (25)where d M is the value of D M observed on the data. Practical remarks.
Notice that H in (22) implies H in (21).Similarly, H in (21) implies H in (22); however, the oppositeis not true in general since there may be some non-zero LP j coefficients for j > M . Therefore, even when choosing M small may lead to conservative, but yet valid, inference. B. Confidence bands
The estimator in (9) only accounts for the first M + (cid:98) d ( u ; G , F ) is a bi-ased estimator of d ( u ; G , F ) . Specifically, as discussed in Sec-tion II D, the integrated bias is given by ∑ j > M LP j , whereas,as show in Appendix B, the bias at a given point u is givenby ∑ j > M LP j Leg ( u ) . It follows that, when the bias is large,confidence bands based on (cid:98) d ( u ; G , F ) are shifted away fromthe true density d ( u ; G , F ) .Despite the bias cannot be easily quantified in the generalsetting, it follows from (8) that, when H in (21) (and conse-quently H in (22)) is true, both the bias at a point u and theintegrated bias are equal to zero. Thus, we can exploit thisproperty to construct reliable confidence bands under the null.Specifically, the goal is to identify c α , such that1 − α = P ( − c α ≤ (cid:98) d ( u ; G , F ) − ≤ c α , for all u ∈ [ , ] | H )= P ( max u | (cid:98) d ( u ; G , F ) − | ≤ c α | H ) (26)where α is the desired significance level. If the bias determines where the confidence bands are cen-tered, the distribution and the variance of (cid:98) d ( u ; G , F ) determinetheir width. As discussed in Section II C (see (8)), under H in(21), the (cid:99) LP j estimates have mean zero, variance n and theyare uncorrelated one another. Therefore, when f ≡ g , the stan-dard error of (cid:98) d ( u ; G , F ) , corresponds to the square root of (10)with σ j = σ jk = SE (cid:104) (cid:98) d ( u ; G , F ) | H (cid:105) = (cid:118)(cid:117)(cid:117)(cid:116) M ∑ j = n Leg j ( u ) . (27)Additionally, (24) implies that (cid:98) d ( u ; G , F ) is asymptoticallynormally distributed, hence (cid:98) d ( u ; G , F ) − (cid:113) ∑ Mj = n Leg j ( u ) d −→ N ( , ) . (28)as n → ∞ , for all u ∈ [ , ] , under H .We can then construct approximate confidence bands under H which satisfy (26) by means of tube formulae (see [17,Ch.5] and [18]), i.e., (cid:34) − c α (cid:118)(cid:117)(cid:117)(cid:116) M ∑ j = n Leg j ( u ) , + c α (cid:118)(cid:117)(cid:117)(cid:116) M ∑ j = n Leg j ( u ) (cid:35) , (29)where c α is the solutions of2 ( − Φ ( c α )) + k π e − . c α = α , (30)with k = (cid:113) ∑ Mj = [ ∂∂ u Leg j ( u )] . If (cid:98) d ( u ; G , F ) is within thebands in (29) over the entire range [ , ] , we conclude thatthere is no evidence that f deviates significantly from g anywhere over the range considered and at confidence level1 − α . Conversely, we expect significant departures to occurin regions where (cid:98) d ( u ; G , F ) lies outside the confidence bands. Practical remarks.
Notice that, under H in (22), the (cid:98) d ( u ; G , F ) is an unbiased estimator of d ( u ; G , F ) , regardlessof the choice of M . This implies that the confidence bands in In astrophysics, the statistical significance α is often expressed in termsof number of σ -deviations from the mean of a standard normal, namely σ . For instance, a 2 σ significance corresponds to α = − Φ ( ) = . Φ ( · ) denotes the cdf of a standard normal. (29) are only affected by the variance and asymptotic distri-bution of (cid:98) d ( u ; G , F ) under H . C. Choice of M The number of (cid:99) LP j estimates considered determines thelevel of “smoothness” of (cid:98) d ( u ; G , F ) , with smaller values of M leading to smoother estimates. The deviance test can beused to select the value M which maximizes the sensitivity ofthe analysis according to the following scheme:i. Choose a sufficiently large value M max .ii. Obtain the estimates (cid:99) LP , . . . , (cid:99) LP M max as in (5).iii. For m = , . . . , M max :calculate the deviance test p-value as in (25), i.e., p ( m ) = P (cid:16) χ m > d m (cid:17) (31)with d m = n ∑ mj = (cid:99) LP j .iv. Choose M such that M = argmin m { p ( m ) } . (32)
1. Adjusting for post-selection
As any data-driven selection process, the scheme presentedabove affects the distribution of (9) and can yield to overlyoptimistic inference [19, 20]. Despite this aspect being oftenignored in practical applications, correct coverage can only beguaranteed if adequate corrections are implemented.The issues arising in the context of post-selection inferencecan be interpreted in terms of looks-elsewhere effect [21, 22]where one has to adjust the inference for the fact that, in prac-tice, many different models have been considered and, conse-quently, many different tests have been conducted for the sakeof assessing the goodness of fit.In our setting, the number of models under comparison istypically small ( M max ≤ M max · P ( χ M > d M ) , (33) As an anonymous referee correctly pointed out, (cid:98) d ( u ; G , F ) is alwayssmooth as it is constructed as a series of infinitely differentiable functions.In statistics, however, the word “smoothness” is often used to indicate theflexibility of the estimator considered or, in other words, its degrees offreedom. Often, this is quantified in terms of magnitude of the secondderivative of the function considered. Despite the abuse of terminology,throughout the manuscript we will refer to the latter definition of smooth-ness. where M is the value selected via (32), whereas confidence Algorithm 1: a data-scientific signal search
INPUTS: source-free sample x B ;postulated background distribution g b ( x ) ;physics sample x . If available : signal distribution, f s ( x , θ s ) . PHASE A: background calibration
Step 1:
Estimate (cid:98) d ( u ; G b , F b ) on u B = G b ( x B ) and test(21) via deviance test and CD plot. Step 2: if F b (cid:54)≡ G b , set (cid:98) f b ( x ) = g b ( x ) (cid:98) d ( u ; G b , F b ) ; else set (cid:98) f b ( x ) = g b ( x ) . PHASE B: signal searchStage 1: nonparametric signal detection
Step 3: set g ( x ) = (cid:98) f b ( x ) . Step 4: estimate (cid:98) d ( u ; G , F ) on u = G ( x ) and test (21) viadeviance test and CD plot. Step 5: if G (cid:54)≡ F , claim evidence in favor of the signal andgo to Step 6; else set (cid:98) f ( x ) = g ( x ) , claim that no signal is presentand stop. Stage 2: semiparametric signal characterization
Step 6: if f s ( x , θ s ) given, fit g bs ( x ) in (38); else use the CD plot of (cid:98) d ( u ; G , F ) and the theoryavailable to specify/fit a suitable model for f s ( x , θ s ) and fit g bs ( x ) in (38). Step 7: estimate (cid:98) d ( u ; G bs , F ) on u = G bs ( x ) and test (21)via deviance test and CD plot. Step 8: if G bs (cid:54)≡ F , claim evidence of unexpected signaland use the CD plot of (cid:98) d ( u ; G bs , F ) and the theoryavailable to further investigate the nature thedeviation from G bs ; else go to Step 9. Step 9: compute (cid:98)(cid:98) d ( u ; G , F ) as in (39) and use it to refine (cid:98) f b ( x ) or f s ( x , (cid:98) θ s ) as in (40). Go back to Step 3.bands can be adjusted by substituting c α in (29), with c α , M max satisfying2 ( − Φ ( c α , M max )) + k π e − . c α , M max = α M max . (34) Practical remarks.
As noted in Section II, the estimate (9) in-volves the first M sample moments of U ; therefore, M max canbe interpreted as the order of the highest moment which we expect to contribute in discriminating the distribution of U from uniformity. Notice that, in addition to the inflation ofthe variance of (9), when M is large, the computation of nor-malized shifted Legendre of higher order may face numericalinstability (see Section VIII B). Therefore, as a rule of thumb, M max is typically chosen ≤
20. Finally, Steps i-iv aim to se-lect the approximant based on the first most significant M mo-ments, while excluding powers of higher order. A further noteon model-denoising is given in Section VIII. V. A DATA-SCIENTIFIC APPROACH TO SIGNALSEARCHES
The tools presented in Sections II and IV provide a naturalframework to simultaneously(a) assess the validity of the postulated background modeland, if necessary, update it using the data (Section III);(b) perform signal detection on the physics sample;(c) characterize the signal when a model for it is not avail-able.Furthermore, if the model for the signal is known (up tosome free parameters), it is possible to(d) further refine the background or signal distribution;(e) detect hidden signals from new unexpected sources.Notice that, since Bonferroni’s correction leads to an upperbound for the overall significance, the resulting coverage willbe higher than the nominal one. Alternatively, approximatepost-selection confidence bands and inference can be con-structed using Monte Carlo and/or resampling methods andrepeating the selection process at each replicate.Tasks (a)-(e) can be tackled in two main phases. In thefirst phase, the postulated background model is “calibrated”on a source-free sample in order to improve the sensitivityof the analysis and reduce the risk of false discoveries. Thesecond phase focuses on searching for the signal of interestand involves both a nonparametric signal detection stage and asemiparametric stage for signal characterization. Both phasesand respective steps are described in details below and sum-marized in Algorithm 1.
A. Background calibration
As discussed in Section III, deviations of (cid:98) d ( G b ( x ) ; G b , F b ) from one suggest that a further refinement of the candidatebackground model g b is needed. However, as M increases,the deviations of (cid:98) d ( G b ( x ) ; G b , F b ) from one may become moreand more prominent while the variance inflates. Thus, it isimportant to assess if such deviations are indeed significant.In order address this task, the analysis of Section III can befurther refined in light of the inferential tools introduced inSection IV. D en s i t y . . . . d^ ( u , G b , F b ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 6.397 · - b ( ) G b ( ) G b ( ) G b ( ) G b ( ) uG b ( x ) FIG. 2: Deviance test and CD plot for the source-free sample.The comparison density is estimated via (20) (solid blueline), whereas its standard error (light blue area) is computedas the squared root of the estimate of the variance in (10).Finally, confidence bands have been constructed around one(grey areas) via (29) with c α replaced by c α , M max in (34). Thenotation ≥ σ is used to highlight that Bonferroni’scorrection has been applied to adjust for post-selectioninference, leading to an increase of the nominal coverage.For the toy example discussed in Section III, we have seenthat g b overestimates f b in the signal region and underesti-mates it at the higher end of the range considered (Fig. 1).We can now assess if any of these deviations are significantby implementing the deviance test in (23)-(25), whereas, toidentify where the most significant departures occur, we con-struct confidence bands under the null model as in (29), i.e.,assuming that no “update” of g b is necessary.The results are collected in the comparison density plot or CD plot presented in Fig. 2. First, a value M = f b from g b is significant at a 6 . σ significancelevel (adjusted p-value of 6 . · − ). Additionally, theestimated comparison density in (20) lies outside the 2 σ con-fidence bands in the region [ , ] where the signal specifiedin (18) is expected to occur. Hence, using (19) instead of (17)is recommended in order to improve the sensibility of theanalysis in the signal region. Important remarks on the CD plot.
When comparing differentmodels for the background or when assessing if the data dis-tribution deviates from the model expected when no signalis present, it is common practice to visualize the results ofthe analysis by superimposing the models under comparisonto the histogram of the data observed on the original scale(e.g., upper panel of Fig. 1). This corresponds to a data vi-sualization in the density domain. Conversely, the CD plot(e.g., Fig. 2) provides a representation of the data in the quan-tile domain, which offers the advantage of connecting the truedensity of the data with the quantiles of the postulated model(see (2)-(3)). Consequently, the most substantial departures ofthe data distribution from the expected model are magnified,and those due to random fluctuations are smoothed out (see, also, Section VII B). Furthermore, the deviance tests and theCD plot together provide a powerful goodness-of-fit tool andexploratory which, conversely from classical methods suchas Anderson-Darling [24] and Kolmogorov-Smirnov [26], notonly allow to test if the distributions under comparison differ,but they also allow to assess how and where they differ. As aresult, the CD plot can be used to characterize the unknownsignal distribution (see Section V B 2) and to identify exclu-sion regions (e.g., Case I in Section V B 1).As an additional advantage, the deviance test appears toenjoy higher detection power than classical approaches. Thisaspect is highlighted in Table I where several methods forgoodness of fit or two-samples comparisons are implemented,along with the deviance test, for all the cases discussed inSection V.
Reliability of the calibrated background model.
The size N ofthe source-free sample plays a fundamental role in the validityof (cid:98) f b ( x ) as a reliable background model. Specifically, the ran-domness involved in (19) only depends on the (cid:99) LP j estimates.If N is sufficiently large, by the strong law of large numbers, P (cid:0) lim N → ∞ (cid:99) LP j = LP j (cid:1) = . Therefore, despite the variance of (cid:98) f b ( x ) becoming negligibleas N → ∞ , one has to account for the fact that (cid:98) f b ( x ) leads toa biased estimate of f b ( x ) when f b (cid:54)≡ g b (see Section II D).For sufficiently smooth densities, a visual inspection is oftensufficient to assess if (cid:98) d ( u ; G b , F b ) (and, consequently, (cid:98) f b ( x ) )provides a satisfactory fit for the data, whereas, for more com-plex distributions the effect of the bias can be mitigated con-sidering larger values of M and model-denoising (see SectionVIII A). B. Signal search
1. Nonparametric signal detection
The background calibration phase allows the specificationof a well tailored model for the background, namely (cid:98) f b ( x ) ,which simultaneously integrates the initial guess, g b , and theinformation carried by the source-free data sample. Hereafter,we disregard the source-free data sample and focus on analyz-ing the physics sample.Under the assumption that the source-free sample has nosignificant source contamination, we expect that, if the signalis absent, both the source-free and the physics sample followthe same distribution. Therefore, the calibrated backgroundmodel, (cid:98) f b ( x ) , plays the role of the postulated distribution forthe physics sample, i.e., the model that we expect the data tofollow when no signal is present; hence, we set g ( x ) = (cid:98) f b ( x ) .Let f ( x ) be the (unknown) true pdf of the physics samplewhich may or may not carry evidence in favor of the source ofinterest. When no model for the signal is specified, it is rea-sonable to consider any significant deviation of f from g asan indication that a signal of unknown nature may be present.Goodness-of-fit test p-values Two-samples test p-values Sample Anderson-Darling Cramer-von Mises Deviance (adjusted) Kolmogorov-Smirnov Wilcoxon Rank Sum
Calibration 1 . · − . · − . · − (6 . · − ) - -Case I 0 . . . >
1) 0.9248 0.5487Case II 4 . · − . · − . · − (1 . · − ) 1 . · − . · − Case III 4 . · − . · − . · − (5 . · − ) 2 . · − . · − TABLE I: Comparison of deviance test and classical inferential tools. The first two columns report the p-values ofAnderson-Darling [24] and Cramer-von Mises [26] goodness-of-fit tests obtained assuming as theoretical distribution the same G indicated in Sections V A and V B for the the calibration phase, case I, II and III, respectively. The raw deviance p-values andtheir post-selection adjusted counterparts are reported in the third columns. Finally, the fouth and fifth column report,respectively, the Kolmogorov-Smirnov [26] and Wilcoxon rank sum [25] tests used to compare directly the physics samples inCase I, II and III with the source-free sample used in Section V A. D en s i t y . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value> 1 ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) D en s i t y . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 1.799 · - ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) FIG. 3: Deviance test and CD plots for Case I where no signal is present (left panel) and Case II where the signal is present(right panel). In both cases, the postulated distribution G corresponds to the cdf of the calibrated background model in (35). Forthe sake of comparison, d ( u ; G , F ) has been estimated via (9) with M = f from g occur. Two possiblescenarios are considered – a physics sample which collectsonly background data (Case I) and a physics sample of obser-vations from both background and signal (Case II). Case I: background-only.
Let x be a physics sample of n = f ( x ) isequivalent to f b ( x ) in (16). We set g ( x ) = (cid:98) f b ( x ) = g b ( x ) (cid:98) d ( G b ( x ) ; G b , F b ) (35)where g b ( x ) and (cid:98) d ( G b ( x ) ; G b , F b ) are defined as in (17) and(20), respectively. The resulting CD plot and deviance test arereported in the left panel of Fig. 3.When applying the scheme in Section IV C with M max = M considered leads to significant results;therefore, for the sake of comparison with Case II below, wechoose M =
4. Not surprisingly, the estimated comparisondensity approaches one over the entire range and lies entirelywithin the confidence bands. This suggests that the true distri-bution of the data does not differ significantly from the modelwhich accounts only for the background. Similarly, the de- viance test leads to very low significance (adjusted p-value > Case II: background + signal.
Let x be a physics sampleof n = f ( x ) isequal to f bs ( x ) in (36) f bs ( x ) = ( − η ) f b ( x ) + η f s ( x ) (36)with f b ( x ) and f s ( x ) defined as in (16) and (18) respectively,and η = .
15. The histogram of the data and the graph of f bs ( x ) are plotted in Fig. 4. As in Case I, we set g ( x ) as in(35).The CD plot and deviance test in the right panel of Fig. 3show a significant departure of the data distribution from thebackground-only model in (35). The maximum significanceof the deviance is achieved at M =
4, leading to a rejection ofthe null hypothesis at a 11 . σ significance level (adjusted p-value = . · − ). The CD plot shows a prominent peak atthe lower end of the spectrum; hence, we conclude that there isevidence in favor of the signal, and we proceed to characterizeits distribution as described in Section V B 2.0 x D en s i t y
10 20 30 40 50 . . . . . . f bs ( x ) f^ ( x ) , M=4f^ ( x ) , M=9g bs ( x ) FIG. 4: Histogram of a physics sample of n = f ( x ) have beencomputed as in (11), by plugging-in the (cid:98) d ( G ( x ) ; G , F ) estimates obtained with M = M =
2. Semiparametric signal characterization
The signal detection strategy proposed in Section V B 1does not require the specification of a distribution for the sig-nal. However, if a model for the signal is known (up to somefree parameters), the analysis can be further refined by pro-viding a parametric estimate of the comparison density andassessing if additional signals from new unexpected sourcesare present.
Case IIa: background + (known) signal.
Assume that amodel for the signal, f s ( x , θ s ) , is given, with θ s being a vectorof unknown parameters. Since the CD plot in the right panelof Fig. 3 provides evidence in favor of the signal, we expectthe data to be distributed according to the pdf (cid:98) f bs ( x ) = ( − η ) (cid:98) f b ( x ) + η f s ( x , θ s ) , ≤ η ≤ , (37)where (cid:98) f b ( x ) is the calibrated background distribution in (35)and η and θ s can be estimated via Maximum Likelihood(ML). Letting (cid:98) η and (cid:98) θ s be the ML estimates of η and θ s re-spectively, we specify g bs ( x ) = ( − (cid:98) η ) (cid:98) f b ( x ) + (cid:98) η f s ( x , (cid:98) θ s ) (38)as postulated model. For simplicity, let f s to be fully specifiedas in (18); we construct the deviance test and the CD plot to as-sess if (38) deviates significantly from the true distribution ofthe data. The scheme in Section IV C has been implemented with M max =
20, and none of the values of M considered ledto significant results. The CD plot and deviance test for M = >
1) and the CDplot suggest that no significant deviations occur; thus, (38) isa reliable model for the physics sample.Moreover, we can use (38) to further refine our (cid:98) f b ( x ) or f s ( x , (cid:98) θ s ) distributions. Specifically, we first construct a semi-parametric estimate of d ( G ( x ) ; G , F ) , i.e., (cid:98)(cid:98) d ( G ( x ) ; G , F ) = ( − (cid:98) η ) (cid:98) f b ( x ) f s ( x , (cid:98) θ s ) + (cid:98) η , (39)and rewrite (cid:98)(cid:98) f b ( x ) = (cid:98) f b ( x ) (cid:98)(cid:98) d ( G ( x ) ; G , F ) − (cid:98) η f s ( x , (cid:98) θ s )( − (cid:98) η ) (cid:98)(cid:98) f s ( x ) = (cid:98) f b ( x ) (cid:98)(cid:98) d ( G ( x ) ; G , F ) − ( − (cid:98) η ) (cid:98) f b ( x ) (cid:98) η . (40)In the upper right panel of Fig. 5, the true comparisondensity (grey dashed line) of our physics sample is comparedwith its semiparametric estimate computed as in (39) (pinkdashed line) with f s ( x , (cid:98) θ s ) = f s ( x ) in (18). The graphs of twononparametric estimates of d ( u ; G , F ) computed via (9) with M = M = d ( u ; G , F ) almost exactly,whereas both nonparametric estimates show some discrepan-cies from the true comparison density. All the estimates sug-gest that there is only one prominent peak in correspondenceof the signal region.When moving from the comparison density domain to thedensity domain in Fig. 4, the discrepancies between the non-parametric estimates and the true density f ( x ) are substan-tially magnified. Specifically, when computing (9) and (11)with M = M =
9, the (cid:98) f ( x ) ex-hibits high bias at the boundaries (dotted black line). Case IIb: background + (unknown) signal.
When thesignal distribution is unknown, the CD plot of (cid:98) d ( u ; G , F ) canbe used to guide the scientist in navigating across the differ-ent theories on the astrophysical phenomenon under study andspecify a suitable model for the signal, i.e., f s . The model pro-posed can then be validated, as in Case IIa, by fitting (38) andconstructing deviance tests and CD plots.At this stage, the scientist has the possibility to iterativelyquery the data and explore the distribution of the signal by as-suming different models. A viable signal characterization is Boundary bias is a common problem among nonparametric density estima-tion procedures [e.g., 17, Ch.5, Ch.8]. When aiming for a non-parametricestimate of the data density f ( x ) , solutions exists to mitigate this problem[e.g., 27]. D en s i t y . . . . d^ ( u , G bs , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value> 1 bs ( ) G bs ( ) G bs ( ) G bs ( ) G bs ( ) uG bs ( x ) D en s i t y . . . . . . d^ ( u , G , F ) , M=4d^ ( u , G , F ) , M=9d^^ ( u , G , F ) d ( u , G , F ) ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) D en s i t y . . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 5.181 · - ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) D en s i t y . . . . . . . d^ ( u , G bs , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 0.00076 bs ( ) G bs ( ) G bs ( ) G bs ( ) G bs ( ) uG bs ( x ) FIG. 5: Upper panels: deviance test and CD plots for Case IIa where the signal is present (right panel), and the postulateddistribution G bs corresponds to the cdf of the estimated background+signal model in (38) with (cid:98) η = . M =
3. Bottom panels: Deviance test and CD plots for Case III where, inaddition to the signal of interest, an additional resonance is present. The data are first analyzed considering thebackground-only pdf in (17) as the postulated model (left panel). The analysis is then repeated by assuming the fittedbackground + signal model in (38) as the postulated distribution (right panel). Both estimates of the comparison density in theleft and right panels have been computed as in (9) with M = (cid:98) d ( u , G bs , F ) fromone are observed (e.g., see upper left panel of Fig. 5). Noticethat a similar approach can be followed also in the backgroundcalibration stage (Section V A) to provide a parametric char-acterization of the background distribution. Case III: background + (known) signal + unexpectedsource.
The tools proposed so far can also be used to detectsignals from unexpected sources whose pdfs are, by design,unknown.Suppose that the physics sample x contains n = f ( x ) is equal to f bsh ( x ) f bsh ( x ) = ( − η − η ) f b ( x ) + η f s ( x ) + η f h ( x ) (41)where f h ( x ) is the pdf of the unexpected signal and assume itsdistribution to be normal with center at 37 and width 1.8. Let f b ( x ) and f s ( x ) be defined as in (16) and (18), respectively,and let η = .
15 and η = . g ( x ) = (cid:98) g bs ( x ) in (35), with f s defined as in (18) and (cid:98) η estimated via MLE. The respective CD plot and deviance testsare reported in the bottom left panel of Fig. 5.Choosing M =
9, as in (32), both the CD plot and de-viance test indicate a significant departure from the expected background-only model and a prominent peak is observed incorrespondence of the signal of interest centered around 25. Asecond but weaker peak appears to be right on the edge of ourconfidence bands, suggesting the possibility of an additionalsource. At this stage, if f s was unknown, we could proceedwith a semiparametric signal characterization as in Case IIb.Whereas assuming that the distribution of the signal of inter-est is known and given by (18), we fit (38), aiming to capturea significant deviation in correspondence of the second bump.This is precisely what we observe in the bottom right panel ofFig. 5. Here the estimated comparison density deviates from(35) around 35, providing evidence in favor of an additionalsignal in this region. We can then proceed as in Case IIb byexploring the theories available and/or collecting more datato further investigate the nature and the cause of the unantici-pated bump. VI. SIGNAL DETECTION WITHOUT CALIBRATIONSAMPLE AND MODEL SELECTION
There are situations where a source-free sample is simply2
Energy (TeV) D en s i t y Dark Matter modelPulsar modelDark Matter sample
Energy (TeV) D en s i t y Dark Matter modelPulsar modelPulsar sample
FIG. 6: Dark matter and pulsar samples. The left panel corresponds to the histogram of a sample of 2000 observationssimulated from the model in (42) with M χ = .
5. The right panel corresponds to the histogram of a sample of 2000observations simulated from the model in (43) with τ =
2. The best fit of (42) and (43) are also reported as blue solid line andblack dashed lines, respectively, on top of each histogram.not available and thus the calibration phase in Section V Acannot be implemented. The tools described in Sections IIand IV can, however, still be applied in order to perform signaldetection and goodness-of-fit when a model for the signal isknown, up to some free parameters. In this framework, we ex-pect the data to either come only from the signal (with at mostsome negligible background contamination) or only from thebackground.In order to illustrate how to proceed in this setting, we con-sider a dark matter search where the postulated model for darkmatter γ -ray emissions is the one of [29, Eq. 29], i.e., g DM ( y ) = . M . χ yk M χ exp (cid:26) − . yM χ (cid:27) (42)with y ∈ [ . , ] Teraelectron Volt (TeV), M χ ∈ [ . , ] TeVand k M χ is a normalizing constant. The goal is to show that,when considering a background-only sample, the method pro-posed correctly rejects (42) as suitable model for the data;whereas, when considering a dark matter sample, the darkmatter model in (42) is “accepted”.To further increase the complexity of the problem, we con-sider a situation where the background sample corresponds to γ -ray emissions due to a pulsar, with distribution g PS ( y ) = yk τ exp (cid:26) − (cid:18) yy (cid:19) τ (cid:27) , (43)with y = . y ∈ [ . , ] TeV, τ > k τ is a normaliz-ing constant. Notice that, as discussed in [30], distinguishing γ -ray emissions due to pulsars from those due to dark matteris a particularly challenging task. The histograms of the two datasets considered are shown in Figure 6; the overlappingcurves correspond to the best fit of the models in (42) and (43)on each sample. Interestingly, for both samples, (42) and (43)provide a very similar fit to the data; hence the importanceof correctly selecting the most adequate model or, excludingthe dark matter hypothesis when observing emissions due topulsars.The upper panels of Figure 7 display the CD plots obtainedby setting g = g DM in (42) as postulated model and compar-ing it with the distribution of the dark matter sample (upperleft panel) and of the background pulsar sample (upper rightpanel). Remarkably, the CD plots and the adjusted deviancetests correctly lead to the conclusion that the distribution of thedark matter sample does not deviates significantly from (42),whereas the distribution of the pulsar sample does deviate sub-stantially from (42) and the deviance test (adequately adjustedfor post-selection inference) rejects the dark matter modelwith 3 . σ significance (adjusted p-value of 4 . · − ).Notice that, in both cases, we are ignoring the information re-garding the pulsar distribution and the only inputs consideredare the data and the signal model in (42).Finally, when incorporating the knowledge of the pulsardistribution in (43) into the analysis, one can select betweenthe models in (42) and (43) by constructing additional CDplots and deviance test for both samples and setting g = g PS in (43). The results are shown in lower panels of Figure 7.As expected, the dark matter model is rejected (lower leftpanel) with 2 . σ significance (adjusted p-value of 0 . D en s i t y . . . . d^ ( u , G DM , F ) s confidence bandsStandard error Deviance testadjusted p.value > 1 ( ) G ( ) G ( ) G ( ) G ( ) G ( ) u D en s i t y . . . . d^ ( u , G DM , F ) s confidence bandsStandard error Deviance testadjusted p.value = 4.87 · - ( ) G ( ) G ( ) G ( ) G ( ) u D en s i t y . . . . d^ ( u , G DM , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value = 0.0108 ( ) G ( ) G ( ) G ( ) G ( ) G ( ) u D en s i t y . . . . d^ ( u , G PS , F ) s confidence bandsStandard error Deviance testadjusted p.value > 1 ( ) G ( ) G ( ) G ( ) G ( ) u FIG. 7: CD plots and deviance test for dark matter and pulsar samples. The upper left panel displays CD plot and deviance testfor the dark matter sample with g = g DM in (42). The upper right panel compares the distribution of the pulsar sample with themodel in (42). The lower left panel displays CD plot and deviance test for the dark matter sample with g = g PS in (43). Thelower right panel displays CD plot and deviance test for the pulsar sample with g = g PS in (43). For the plots on the left, thesize of the basis selected is M =
6, whereas, for the plots on the right, M = VII. BACKGROUND MISMODELLING DUE TOINSTRUMENTAL NOISE AND UPPER LIMITSCONSTRUCTIONS
When conducting real data analyses one has to take intoaccount that the data generating process is affected by bothstatistical and non-random uncertainty due to the instrumentalnoise. As a result, even when a model for the background isknown, the data distribution may substantially deviate from itdue to the smearing introduced by the detector [e.g., 31]. Inorder to account for the instrumental error affecting the data,it is common practice to consider folded distributions wherethe errors due to the detector are often modelled assuming anormal distribution or estimated via non-parametric methods[e.g., 32, 33]. In Section VII A, it is shown how the same ap-proach described in Sections V A and V B 1 can be used to as-sess if the instrumental error is negligible and, when not, howto update the postulated background model in order to incor-porate the instrumental noise. Section VII B discusses upperlimits constructions by means of comparison distributions.
A. Modelling the instrumental error
The data considered come from a simulated observation bythe Fermi Large Area Telescope [28] with realistic represen- tations of the effects of the detector and present backgrounds[21, 34]. The Fermi-LAT is a pair-conversion γ -ray telescopeon board the earth-orbiting Fermi satellite. It measures ener-gies and images γ -rays between about a 100 MeV and severalTeV. The goal of the analysis is to assess if the data could re-sult from the self-annihilation of a dark matter particle. Letthe distribution of the astrophysical background be a power-law, i.e., g b ( x ) = k φ x φ + (44)where k φ is a normalizing constant and x ∈ [ , ] Giga elec-tron Volt (GeV). Equation (44) corresponds to the distributionwe would expect the background to follow if there was nosmearing of the detector. The left panel of Figure 8 showsthe histogram of a source-free sample of 35,157 i.i.d. obser-vations from a power-law distributed background source withindex 2.4 (i.e., φ = . d ( G b ( x ) ; G b , F b ) and f b as in (9) and4 Energy (GeV) D en s i t y . . . . . . . Power−law, f =
Energy D en s i t y . . . . . . . Dark Matter sampleBackground sample
FIG. 8: Histograms of simulated Fermi-LAT samples. The left panel displays the histogram of a source-free simulatedFermi-LAT sample of N = ,
157 observations, whereas the black dashed line corresponds to the best fit of the power-lawmodel in (44). The right panel shows the histogram of two simulated Fermi-LAT physics samples of n =
200 observations. Thegrey histogram corresponds to the background-only sample, whereas the blue histogram corresponds to the dark matter signalsample. D en s i t y . . . . d^ ( u , G b , F b ) ‡ s confidence bandsStandard error Deviance testpost−selection p.value £ · - ( ) G ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) D en s i t y . . . . d^ ( u , G b , F b ) ‡ s confidence bandsStandard error Deviance testpost−selection p.value > ( ) G ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) FIG. 9: Deviance test and CD plot for the source-free simulated Fermi-LAT sample including the instrumental error (left panel)and simulated sample with no instrumental error (right panel). In both cases M = M = (cid:98) f b ( x ) = k ˆ φ x ˆ φ + (cid:16) + . Leg [ G b ( x )] − . Leg [ G b ( x )]+ . Leg [ G b ( x )] − . Leg [ G b ( x )] (cid:17) , (45)where G b ( x ) is the cdf of (44) and ˆ φ = .
359 is the ML esti- mate of φ in (44).For the sake of comparison, the same analysis has been re-peated considering 35 ,
157 i.i.d. observations from a power-law background source with index 2.4, without instrumentalerror. The respective CD plot and deviance test are shown onthe right panel of Figure 9 and indicate that the power-lawmodel in (44), with φ replace by its MLE (i.e., (cid:98) φ = . B. Signal detection and upper limit construction
Once obtained a calibrated background distribution, weproceed with the signal detection phase by setting g ( x ) = (cid:98) f b ( x ) in (45). Similarly to Section V B 1, two physics samplesare given; one containing 200 observations from the back-ground source distributed, as in (44), and the other containing200 observations from a dark matter emission. The signal dis-tribution from which the data have been simulated is the pdfof γ -ray dark matter energies in [29, Eq. 28] with M χ = . M =
3; therefore, for the sake ofcomparison, we choose M = [ , . ] region with 3 . σ signifi-cance (adjusted p-value = 4 . · − ). As in (39), it is possi-ble to proceed with the signal characterization stage (see Sec-tion V B 2); however, in this setting, one has to account for thefact that also the signal distribution must include the smearingeffect of the detector.As an anonymous referee pointed out, it is important to dis-cuss how upper limits and Brazil plot can be constructed viaLP modelling and how they relate to the constructs discussedso far in this manuscript. Indeed, the confidence bands re-ported in the CD plots are themselves upper limits. Specifi-cally, in the signal detection framework of Section V B 1, theconfidence bands in (29) are constructed assuming that thereis no signal in the data. Specifically, they correspond to the re-gions where the comparison density estimator is expected tolie, at 1 − α confidence level, if the data includes background-only events. Conversely, any deviation from the confidencebands characterizes the quantiles of the distribution where thedata distribution does not conform with the one postulated un-der the assumption that no signal is present.When the interest is in identifying areas of the search regionwhere deviations from the background model occur, one canexploit the fact that u = G ( x ) , and thus upper limits and clas-sical “Brazil plots” based on the comparison density can beobtained by plotting (20) and the respective confidence bandsin (29) as a function of x . This is shown, for our Fermi-LATexample in the bottom panels of Figure 10. Indeed, the up-per and bottom panels in Figure 10 carry essentially the sameinformation in two different domains. Specifically, the CDplots display the departure of f from g in the quantile domainwhereas the Brazil plots show the same differences in the fre-quency domain. For signal detection purpose, the bottom pan-els may be preferred to identify the location where substan-tial deviations among the background and signal model occur.Whereas, the CD plots are more suitable for goodness-of-fitpurposes as they provide a simulataneous visualization of thedifferences occurring at each quantile of the distribution. VIII. MODEL-DENOISING
As discussed in Section II D, the choice of M affects the re-sulting estimator of d ( u ; G , F ) in terms of both bias and vari-ance. When dealing with complex background distributions,a large value of M may be necessary to reduce the bias of theestimated comparison density. At the same times, however, alarge value of M leads to an inflation of the variance. In otherwords, considering a basis of M shifted Legendre polynomialsmay lead to overfitting.Practically speaking, overfitting leads to wiggly (i.e., non-smooth) estimates and thus one may overcome this limitationby attempting to denoise the estimator in (9). Section VIII Areviews the model-denoising approach proposed by [8, 10],whereas Section VIII B briefly discusses inference and modelselection in this setting. Finally, Section VIII C compares theresults obtained with a full and a denoised solution on the ex-amples of Section V. A. AIC denoising
Let (cid:99) LP , . . . , (cid:99) LP M be the estimate of the first M coefficientsof the expansion in (4). The most “significant” LP j coeffi-cients are selected by sorting the respective (cid:99) LP j estimates sothat (cid:99) LP ( ) ≥ (cid:99) LP ( ) ≥ · · · ≥ (cid:99) LP ( M ) and choosing the value k = , . . . , M for which AIC ( k ) in (46)is maximum AIC ( k ) = k ∑ j = (cid:99) LP ( j ) − kn . (46)The AIC-denoised estimator of d ( u ; G , F ) is given by (cid:98) d ∗ ( u ; G , F ) = + k ∗ M ∑ j = (cid:99) LP ( j ) Leg ( j ) ( u ) (47)where (cid:99) LP ( j ) is the estimate whose square is the j th largestamong (cid:99) LP , . . . , (cid:99) LP M , Leg ( j ) ( u ) is the respective shifted Leg-endre polynomial and k ∗ M = argmax k { AIC ( ) , . . . , AIC ( M ) } . (48) Practical remarks.
Recall that the first M coefficients LP j canbe expressed as a linear combination of the first M momentsof U . Thus, the AIC-denoising approach selects the LP j co-efficients which carry all the “sufficient” information on thefirst M moments of the distribution.6 D en s i t y . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testpost−selection p.value £ ( ) G ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) D en s i t y . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 4.552 · - ( ) G ( ) G ( ) G ( ) G ( ) G ( ) uG ( x ) llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . Energy (GeV) d ^ ( G ( x ) , G , F ) d^ ( G ( x ) , G , F ) ±SEExpected (Bkg only)Expected±2 s Expected±1 s Deviance testadjusted p.value> 1 llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . Energy (GeV) d ^ ( G ( x ) , G , F ) d^ ( G ( x ) , G , F ) ±SEExpected (Bkg only)Expected±2 s Expected±1 s Deviance testadjusted p.value= 4.552 · - FIG. 10: Deviance test, CD plots and Brazil plots for the simulated Fermi-LAT background-only sample of size 200 (leftpanels) and the simulated Fermi-LAT dark matter signal sample of size 200 (right panels). In both cases, the postulateddistribution G corresponds to the cdf of the calibrated background model in (45). For the sake of comparison, d ( u ; G , F ) hasbeen estimated via (9), with M = B. Inference after denoising
The deviance test can be used, as in Section IV C, to choosethe size of the initial basis of M polynomials among M max pos-sible models. Finally, the k (cid:63) M largest coefficients are chosen bymaximizing (46). This two-step procedure selects (cid:98) d ∗ ( u ; G , F ) in (47) from a pool of M tot = M max + M ( M − ) possible esti-mators. Therefore, the Bonferroni-adjusted p-value of the de-viance test is given by M tot · P ( χ k ∗ M > d k ∗ M ) (49)withe d k ∗ M = ∑ k ∗ M j = (cid:99) LP ( j ) . Similarly, confidence bands can beconstructed as (cid:34) − c α , M tot (cid:118)(cid:117)(cid:117)(cid:116) k ∗ M ∑ j = n Leg ( j ) ( u ) , + c α (cid:118)(cid:117)(cid:117)(cid:116) k ∗ M ∑ j = n Leg ( j ) ( u ) (cid:35) (50)where c α , M tot is the solutions of2 ( − Φ ( c α , M tot )) + k π e − . c α , Mtot = α M tot . (51) Practical remarks.
Given the possibility of denoising oursolution, one may legitimately wonder why not to consider a large value of M max , e.g., M max =
100 and then select k (cid:63) M max directly. In other words, why should we first implement theprocedure in Section IV C and, only after, refine our estimatoras in Section VIII A and not vice-versa? There are two mainreasons why such approach is discouraged.First of all, one has to take into account that ignoring the se-lection stage proposed in Section IV C, there is no guaranteethat the resulting k (cid:63) M max would include all the (cid:99) LP j terms thatprovide the strongest evidence in favor of H in (22). There-fore, the resulting p-value can in principle be lower than theone in (49). Indeed the AIC criterion in (46), aims to im-prove the fit of the estimator to the data, whereas the devianceselection criteria in (32) aim to maximize the power of theinferential procedure.Second, choosing M max =
100 is computationally unfeasi-ble with most of the standard programming languages suchas R and Python , and the numerical computation of (9) mayeasily lead to divergent or inaccurate results.
C. Comparing full and denoised solution
Fig. 11 compares the fit of the estimators (cid:98) d ( u ; G , F ) and (cid:98) d ∗ ( u ; G , F ) for the examples in Section V. For all the casesconsidered, M and k ∗ M have been selected as in (32) and(48) (see second column of Table II). When no significance7 . . . . . Toy Example: Calibtation u D en s i t y . . . . . Toy Example: Case I u D en s i t y . . . Toy Example: Case II u D en s i t y . . . Toy Example: Case III u D en s i t y d^ ( u , G , F ) d^*(u,G,F) d ( u , G , F ) ‡ s confidence bands full ‡ s confidence bands denoised FIG. 11: Comparison of the estimators (cid:98) d ( u ; G , F ) (blue solid lines) and (cid:98) d ∗ ( u ; G , F ) (pink dashed lines) for the toy examples inSection V. The true comparison densities d ( u ; G , F ) are displayed as dotted black curves. The light gray areas corresponds tothe confidence bands of the full solution whereas dark gray areas refer to the confidence bands of the denoised solution. In allthe examples proposed, the latter is almost entirely overlapping with the former. M , k ∗ M Method Deviance Adjustedselected p-values p-valuesToy example M = p ( ) = . · − · p ( ) = . · − Calibration k ∗ = p ( ) = . · − · p ( ) = . · − Toy example
M=18 Full p ( ) = . · p ( ) > Case I k ∗ = p ( ) = . · − · p ( ) = . Toy example M = p ( ) = . · − · p ( ) = . · − Case II k ∗ = p ( ) = . · − · p ( ) = . · − Toy example M = p ( ) = . · − · p ( ) = . · − Case III k ∗ = p ( ) = . · − · p ( ) = . · − TABLE II: Model selection and inference for the toy example in Section V. The second column reports the M and k ∗ M valuesselected as in (32) and (48), respectively. The third column collects the unadjusted deviance p-values for the full and denoisedsolutions. The Bonferroni-adjusted p-values, computed as in (33) and (49) are reported in the fourth column. The correctionterms applied correspond to M max =
20 for the full solution and M tot = M max + M ( M − ) for the denoised solution.was achieved for any of the values of M considered, a smallbasis of M = M = (cid:98) d ( u ; G , F ) , which was then further denoisedin order to obtain (cid:98) d ∗ ( u ; G , F ) . Table II shows the results ofthe deviance tests of the full and the denoised solution forthe examples in Section V. The unadjusted p-values and theBonferroni-adjusted p-values are reported in the second andthird columns, respectively. In half of the cases, k ∗ M = M and the estimators (cid:98) d ( u ; G , F ) and (cid:98) d ∗ ( u ; G , F ) overlap over theentire range [ , ] . The inferential results were also approxi-mately equivalent in the majority of the situations considered.The main differences are observed in the analysis of thebackground-only physics sample (Case I). In this case, thedeviance-selection procedure leads to non-significant resultsfor all the values of M considered; the minimum p-value isobserved at M =
18 (unadjusted p-value = 0 . k ∗ = . · − . This further empha-sizes the importance of adjusting for model selection in orderto avoid false discoveries. For modelling purposes and for thesake of comparison with the case where a signal is present, abasis of M = k ∗ M = k ∗ M = M = . · − ) compared to the full solution(adjusted p-value=5 . · − ).These results suggest that the denoising approach can eas-ily adapt to situations where a sparse solution is preferable(i.e., when only few of the M coefficients LP j are non-zero)without enforcing sparsity when many of the M coefficientsconsidered are needed to adequately fit the data (e.g., bottomright panel of Fig. 11). From an inferential perspective, de-noising can improve the sensitivity of the analysis; however,in order to avoid false discoveries, extra care needs to be takenwhen the deviance selection procedure leads to large p-valuesfor all the M max models considered.8 Polarized flux (Jy/beam) D en s i t y Polarized flux (Jy/beam) D en s i t y Polarized flux (Jy/beam) D en s i t y Source−free sampleSource sample
FIG. 12: Histograms of the NVSS samples. The left panels show the source-free sample with and without outliers (upper leftand lower left panels, respectively). In both cases, the best fit of the Rayleigh model in (53) is displayed as a black curve. Theright panel compares the source-free sample without outliers (grey histogram) with the source sample (blue histogram)truncated over the search area considered. D en s i t y . . . . d^ ( u , G b , F b ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 5.707 · - uG ( x ) FIG. 13: Deviance test and CD plot for the NVSS source-freesample of size 28 ,
739 compared to the Rayleigh distributionin (53).
IX. AN APPLICATION TO STACKING EXPERIMENTS
In radio astronomical surveys, stacking techniques are of-ten used to combine noisy images or “stacks” in order to in-crease the signal-to-noise ratio and improve the sensitivity ofthe analysis in detecting faint sources [e.g., 35–37]. In polar-ized signal searches, for instance, a faint population of sourcesis considered when the median polarized intensity observedover control regions differs significantly from the median ofthe region where the sources are expected to be present. Inthis context, under simplifying assumptions, the distributionof the intensity of the source polarization is often assumed to to have Rice distribution i.e., f ( x ) = xe − x + ν σ k νσ Bessel (cid:0) x νσ (cid:1) (52)where Bessel ( · ) denotes the Bessel function of first kind of or-der zero and k νσ is a normalizing constant. Furthermore, (52)reduces to a Rayleigh pdf when no signal is present [38], i.e,when ν =
0. Below, it is shown how the methods describedin Sections V A and V B 1 can be used to assess whether theRayleigh distribution is a reliable model for the backgroundand, when too simplistic, investigate the impact of incorrectlyassuming a Rayleigh distribution on the reliability of the anal-ysis.The data considered comes from the NRAO VLA Sky Sur-vey (NVSS) [39]. The NVSS is an astronomical survey of theNorthern hemisphere carried out by the Very Large Array ofthe National Radio Astronomy Observatory. The NVSS hasdetected 1.8 million sources in total intensity, but only 14%of these have reported a polarized signal peak greater than3 σ [37]. The original source-free sample contained 29 , In statistics, an observation x i is considered an outlier if x i < Q . − . [ Q . − Q . ] or x i > Q . + . [ Q . − Q . ] where Q . and Q . are the first and the third sample quartiles. D en s i t y . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 2.949 · - uG ( x ) D en s i t y . . . . . . d^ ( u , G , F ) ‡ s confidence bandsStandard error Deviance testadjusted p.value= 5.178 · - uG ( x ) FIG. 14: Deviance tests and CD plots for the NVSS source sample assuming g ( x ) to be the calibrated background model in (54)(left panel) and when letting g ( x ) be the pdf in of the truncated Rayleigh distribution in (53). In both cases the estimator of thecomparison density has been denoised as described in Section VIII A. The values of M and k ∗ considered are M = k ∗ = M = k ∗ = ,
739 observations on the re-gion [ , . ] Jy/beam. It has to be noted that the nominalnoise in NVSS polarization is 0.00029 Jy/beam and we mayexpect as reasonable threshold for the detection of one indi-vidual source to be three times the noise. Hence, a sourcesample of 6 ,
220 observations has been selected from posi-tions where compact radio sources with a brightness in to-tal intensity between 0 and 0.0009 Jy/beam are known to bepresent. Both source-free and source samples are assumed tobe i.i.d. The histograms of the source-free and signal samplesconsidered are shown in the right panel of Fig. 12.As first step, we fit a Rayleigh distribution (adequately trun-cated over the range [ , . ] ) on the source-free sample,i.e., g b ( x ) = xe − x (cid:98) σ k (cid:98) σ (53)where k (cid:98) σ is a normalizing constant, (cid:98) σ = . σ , and x ∈ [ , . ] Jy/beam. In order to assess if (53) provides a good fit for thedata, we estimate the comparison density d ( G b ( x ) ; G b , F b ) by,first, selecting M as in (32) and then applying the AIC-baseddenoising approach described in Section VIII A. In this case,the denoised solution selects k ∗ = M =
10 polyno-mial terms. The deviance tests and the CD plot in Fig. 13suggest that, despite the fact that the median of the data co-incides with the one of the Rayleigh model, overall, the latterdoes not provide a good fit for the distribution of the source-free sample. Specifically, the data distribution shows a higherright tail than one expected under the Rayleigh assumption,whereas the first quantiles are overestimated by the Rayleigh.Therefore, the researcher can either decide to use a more re-fined parametric model for the background or consider the cal-ibrated background distribution of the form in (19), which in our setting specifies as (cid:98) f b ( x ) = xe − x (cid:98) σ k (cid:98) σ (cid:16) − . Leg [ G b ( x )] + . Leg [ G b ( x )]+ . Leg [ G b ( x )] − . Leg [ G b ( x )] + . Leg [ G b ( x )] − . Leg [ G b ( x )] + . Leg [ G b ( x )] + . Leg [ G b ( x )] − . Leg [ G b ( x )] (cid:17) , (54)where G b ( x ) is the cdf of (53).The strategy described in Section V B 1 allows us to iden-tify where significant differences between the control andsource sample occur. In order to assess the effect of incor-rectly assuming a Rayleigh background, we compare the dis-tribution of the physics sample with both the Rayleigh and thecalibrated background distribution in (54). Figure 14 reportsdeviance tests and CD plots obtained on the physics samplewhen setting g ( x ) = (cid:98) f b ( x ) in (53) (left panel) and g ( x ) = (cid:98) f b ( x ) in (54) (right panel). Both analyses provide strong evidencethat the distribution of the physics sample differs significantlyfrom the postulated models (cid:98) f b ( x ) and g b ( x ) , and the most sub-stantial discrepancies occur on the right tail of the distribution.However, since the Rayleigh model underestimates the righttail of the background distribution (see Fig. 13), it leads to anartificially enhanced sensitivity in this region. The differencesbetween the two CD plots are less prominent around the me-dian expected under (cid:98) f b ( x ) and g b ( x ) (i.e., in correspondenceof u = . . σ (adjusted p-value = 5 . · − ),whereas the one obtained using (54) is 20 . σ (adjusted p-value = 2 . · − ).Conversely, the calibrated background model in (54) allowsus to safely compare the entire distribution of the polarizedintensity in the source and control regions via CD plots anddeviance tests without affecting the sensitivity of the analysis. X. DISCUSSION
This article proposes a unified framework for signal de-tection and characterization under background mismodelling.From a methodological perspective, the methods presentedhere extend LP modelling to the inferential setting.The solution discussed is articulated in two main phases: acalibration phase where the background model is “trained” ona source-free sample and a signal search phase conducted onthe physics sample collected by the experiment. If a model forthe signal is given, the method proposed allows the identifica-tion of hidden signals from new unexpected sources and/orthe refining of the postulated background or signal distribu-tions. Furthermore, the tools presented in this manuscript canbe easily extended to situations where a source-free sampleis not available and the background is unknown (up to somefree parameters). As discussed in Section VI, however, in thissetting the signal distribution is required to be known, and thephysics sample is expected to contain only signal-like events,i.e., the background is almost completely reduced.The theory of Section II D and the analyses in Section Vhave highlighted that, despite a fully non-parametric approachprovides reliable inference, it may lead to unsatisfactory esti-mates when the postulated pdf g is substantially different fromthe true density f . In this setting, a semiparametric stage canbe performed in order provide a reliable model for the data.Each individual step in both the nonparametric and thesemiparametric stage of Sections V B 2 and V A provides use-ful scientific insights on the signal and background distribu-tion. Hence, an automatized implementation of the steps ofAlgorithm 1 based solely on the p-values of the deviance testsis discouraged as it would lead to a substantial loss of scien-tific knowledge on the phenomena under study.Finally, it is important to point out that, despite this article’sfocus on the one-dimensional searches on continuous data, allthe constructs presented in Sections II and the deviance testin IV A also apply to the discrete case when considering i.i.d.events. More work is needed to extend these results and thoseof Section IV B to searches in multiple dimensions and whenconsidering Poisson events with functional mean. In the firstcase the difficulty mainly lies in generalizing the constructsof Section IV to account for the dependence structure occur-ing across multiple dimensions. In the second case, the mainchallenge lies in identifying the equivalent of (1) to model themean of the distribution, while incorporating the Poisson er-ror. CODE AVAILABILITY
The
LPBkg
Python package [40] and the
LPBkg
R package[41] allow the implementation of the methods proposed in thismanuscript. Detailed tutorials on how to use the functionsprovided are also available at http://salgeri.umn.edu/my-research . ACKNOWLEDGMENTS
The author thanks Jeroen Stil, who provided the NVSSdatasets used in Section IX, and Lawrence Rudnick, who firstrecognized the usefulness of the method proposed in the con-text of stacking experiments. Conversations with SubhadeepMukhopadhyay have been of great help when this work wasfirst conceptualized. Discussions and e-mail exchanges withCharles Doss and Chad Shafer are gratefully acknowledged.Finally, the author thanks an anonymous referee whose feed-back has been substantial to improve the overall quality of thepaper.
Appendix A: Moments of the (cid:99) LP j estimates Consider the general setting where f (cid:54)≡ g and thus d ( u ; G , F ) (cid:54) = [ , ] . It follows that each u i is indepen-dently and identically distributed with pdf d ( u ; G , F ) ; hence,all the expectations in E [ (cid:99) LP j ] , V ( (cid:99) LP j ) and Cov ( (cid:99) LP j , (cid:99) LP k ) aretaken with respect to d ( u ; G , F ) . Specifically, E [ (cid:99) LP j ] = E (cid:20) n n ∑ i = Leg j ( U i ) (cid:21) = E [ Leg j ( U )]= (cid:90) Leg j ( u ) d ( u ; G , F ) ∂ u = LP j where the second equality follows by the fact that each ob-served value u i is a realization of a random variable U i and each U , . . . , U n is identically distributed as the ran-dom variable U , whose pdf is given by the comparisondensity d ( u ; G , F ) . Notice that d ( u ; G , F ) = (cid:82) Leg j ( u ) d ( u ; G , F ) ∂ u = (cid:82) Leg j ( u ) ∂ u =
0, from which thefirst equivalence in (8) follows. Moreover, V ( (cid:99) LP j ) = n V (cid:18) n ∑ i = Leg j ( U i ) (cid:19) = nV (cid:0) Leg j ( U ) (cid:1) = σ j n where V (cid:0) Leg j ( U ) (cid:1) = (cid:82) ( Leg j ( u ) − LP j ) d ( u ; G , F ) ∂ u = σ j .The second equality holds because of independence and iden-tical distribution of each u i . Notice that if d ( u ; G , F ) = σ j = Leg j ( u ) poly-1nomials. Hence the second equivalence in (8) holds. Finally, Cov ( (cid:99) LP j , (cid:99) LP k ) = Cov (cid:18) n n ∑ i = Leg j ( U i ) , n n ∑ i = Leg k ( U i ) (cid:19) = nCov (cid:0) Leg j ( U ) , Leg k ( U ) (cid:1) = σ jk n also in this case, the second equality follows by independenceand identical distribution of each u i and Cov (cid:0)
Leg j ( U ) , Leg k ( U ) (cid:1) = (cid:90) ( Leg j ( u ) − LP j )( Leg k ( u ) − LP k ) d ( u ; G , F ) ∂ u = (cid:90) Leg j ( u ) Leg k ( u ) d ( u ; G , F ) ∂ u − LP j LP k = σ jk . (A1) Because of the orthogonality of the
Leg j ( u ) , σ jk = d ( u ; G , F ) =
1. Hence the third equivalence in (8).
Appendix B: Bias, variance and MISE of (cid:98) d ( u ; G , F ) Given a point u over [ , ] , the bias of (9) at u isBias (cid:2) (cid:98) d ( u ; G , F ) (cid:3) = E (cid:2) (cid:98) d ( u ; G , F ) (cid:3) − d ( u ; G , F ) (B1) = M ∑ j = E [ (cid:99) LP j ] Leg j ( u ) − ∑ j > LP j Leg j ( u ) (B2) = ∑ j > M LP j Leg j ( u ) ; (B3)here (B2) follows from (4) and (9). Whereas, the integratedsquared bias is IBS = (cid:90) (cid:18) ∑ j > M LP j Leg j ( u ) (cid:19) ∂ u (B4) = ∑ j > M LP j (cid:90) Leg j ( u ) ∂ u (B5) + ∑ M < j < k LP j LP k (cid:90) Leg j ( u ) Leg k ( u ) ∂ u (B6) = ∑ j > M LP j ; (B7) where (B7) holds because of orthonormality of the Leg j ( u ) polynomials. Notice that IBS = ∑ j > M LP j = ∑ j > LP − M ∑ j = LP (B8) = (cid:90) (cid:0) d ( u ; G , F ) − (cid:1) ∂ u − M ∑ j = LP (B9) = (cid:90) (cid:18) f ( G − ( u )) − g ( G − ( u )) g ( G − ( u )) (cid:19) ∂ u − M ∑ j = LP j (B10)where (B9) follows by Parseval’s identity whereas (B10) fol-lows from (2).The variance of (9) at a given point u is given by V (cid:2) (cid:98) d ( u ; G , F ) (cid:3) = V (cid:18) M ∑ j = (cid:99) LP j Leg j ( u ) (cid:19) (B11) = M ∑ j = Leg j ( u ) V (cid:0)(cid:99) LP j (cid:1) (B12) + ∑ j < k Leg j ( u ) Leg k ( u ) Cov (cid:0)(cid:99) LP j , (cid:99) LP k (cid:1) (B13) = M ∑ j = σ j n Leg j ( u ) + ∑ j < k σ jk n Leg j ( u ) Leg k ( u ) . (B14)By orthonormality of the polynomials Leg j ( u ) , the integral of(B14) over [ , ] is (cid:90) V (cid:2) (cid:98) d ( u ; G , F ) (cid:3) ∂ u = M ∑ j = σ j n . (B15)also in this case, equality follows by orthonormality of the Leg j ( u ) . Finally, the MISE is MISE (cid:104) (cid:98) d ( u ; G , F ) (cid:105) = E (cid:20) (cid:90) (cid:0) (cid:98) d ( u ; G , F ) − d ( u ; G , F ) (cid:1) ∂ u (cid:21) (B16) = (cid:90) E (cid:20)(cid:0) (cid:98) d ( u ; G , F ) − d ( u ; G , F ) (cid:1) (cid:21) ∂ u (B17) = (cid:90) V (cid:2) (cid:98) d ( u ; G , F ) (cid:3) (B18) + (cid:0) E (cid:2) (cid:98) d ( u ; G , F ) (cid:3) − d ( u ; G , F ) (cid:1) ∂ u (B19) = M ∑ j = σ j n + ∑ j > M LP j (B20)where (B17) holds because of Fubini-Tonelli theorem,2whereas the last equality follows by (B7) and (B15). [1] Aprile, E., et al. Physical review letters 119.18 (2017): 181301.URL: https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.119.181301 [2] Agnese, R., et al. Physical Review D 99.6 (2019): 062001.URL:[3] Smith, R. and Thrane, E. Physical Review X, no. 2 (2018):021019. URL:[4] Sirunyan, A.M., et al. Physics Letters B 793 (2019): 320-347.URL:[5] Yellin, S. Physical Review D 66.3 (2002): 032005.[6] Priel, N., et al. Journal of Cosmology and Astroparticle Physics2017.05 (2017): 013.[7] Dauncey, P. D., et al. Journal of Instrumentation 10.04 (2015):P04015.[8] Mukhopadhyay, S., and Parzen, E. arXiv preprintarXiv:1405.2601 (2014).[9] Mukhopadhyay, S., and Wang, K. To Appear in: Biometrika(2019+). arXiv[stat.ME]:1810.01724v2. Supplement availableat: http://unitedstatalgo.com/wp-content/uploads/2019/09/LPKsample_Supp.pdf .[10] Mukhopadhyay, S. Electronic Journal of Statistics 11.1 (2017):215-240. arxiv[stat.ME]:1509.06428[11] Mukhopadhyay, S., and Fletcher, D. Scientific reports 8.1(2018): 9983.[12] Mukhopadhyay, S., and Parzen, E. Journal of Risk and Finan-cial Management 11.3 (2018): 37. arxiv[math.ST]:1308.0642[13] Mukhopadhyay, S. Biometrics 72.2 (2016): 325-334.arxiv[math.ST]:1507.08727[14] Mukhopadhyay, S. Journal of Nonparametric Statistics 30.4(2018): 1003-1015. arxiv[stat.ME]:1805.02075[15] Parzen E. Statistical Science. 2004;19(4):652-62.[16] Sheather, S.J. and Jones, M.C. Journal of the Royal StatisticalSociety B 53, no. 3 (1991): 683-690.[17] Wasserman, L. All of nonparametric statistics. Springer Science& Business Media (2006).[18] Pilla, R.S., Loader, C. and Taylor, C.C. Physical review letters95.23 (2005): 230202.[19] Shen, X., Huang, H.C., and Ye, J. Journal of the American Sta-tistical Association 99.467 (2004): 751-762.[20] Leeb, H., and P¨otscher, B.M. Econometric Theory 21.1 (2005):21-59. URL: https://pdfs.semanticscholar.org/307d/4f561831d98246c91424b84ff90f81b964eb.pdf .[21] Algeri, S., et al. Journal of Instrumentation 11.12 (2016):P12010.[22] Gross E, Vitells O. The European Physical Journal C. 2010 Nov1;70(1-2):525-30.[23] Dunn, O.J. The Annals of Mathematical Statistics. 1958 Dec1:1095-111.[24] Anderson, T.W., and Darling, D.A. The annals of mathematicalstatistics 23.2 (1952): 193-212.[25] Wilcoxon, F. Biometrics 3, no. 3 (1947): 119-122.[26] Darling, D.A. The Annals of Mathematical Statistics 28, no. 4(1957): 823-838.[27] Efromovich, S. Nonparametric Curve Estimation. Springer Sci-ence & Business Media, 2008.[28] Atwood W. B.et al. The Astrophysical Journal (2009): 697 2.[29] Bergstr¨om, L., Ullio, P. and Buckley, J.H. Astroparticle Physics9.2 (1998): 137-162. [30] Baltz, E. A., James E. T., and Lawrence L. W. The Astrophysi-cal Journal Letters 659, no. 2 (2007): L125.[31] Lyons, L. PHYSTAT 2011 Workshop proceedings (2011): 225-228.[32] Panaretos, V. PHYSTAT 2011 Workshop proceedings (2011):229-239.[33] Blobel, Volker. PHYSTAT 2011 Workshop proceedings (2011):240-251.[34] Algeri, S., Conrad, J. and van Dyk, D.A. Monthly Notices of theRoyal Astronomical Society: Letters 458.1 (2016): L84-L88.[35] Brown, S., et al. The Astrophysical Journal Letters 740.1(2011): L28.[36] White, R.L., et al. The Astrophysical Journal 654.1 (2007): 99.[37] Stil, J. M., et al. The Astrophysical Journal 787.2 (2014): 99.[38] Simmons, J. F. L., and B. G. Stewart. Astronomy and Astro-physics 142 (1985): 100-106.[39] Condon, J.J., et al. The Astronomical Journal 115.5 (1998):1693.[40] Zhang, Z. LPBkg: Detecting New Signals under BackgroundMismodelling, 2019. URL: https://pypi.org/project/LPBkg/ .[41] Algeri, S. and Liu, H. LPBkg: Detecting New Signals un-der Background Mismodelling, 2019. URL https://CRAN.R-project.org/package=LPBkghttps://CRAN.R-project.org/package=LPBkg