Harmonizing discovery thresholds and reporting two-sided confidence intervals: a modified Feldman & Cousins method
PPrepared for submission to JINST
Harmonizing discovery thresholds and reportingtwo-sided confidence intervals: a modified Feldman &Cousins method
K. D. Morå
Oskar Klein Centre, Department of Physics, Stockholm UniversityAlbanova University Center, SE-10691 Stockholm, Sweden
E-mail: [email protected]
Abstract: When searching for new physics effects, collaborations will often wish to publish upperlimits and intervals with a lower confidence level than the threshold they would set to claim an excessor a discovery. In this paper a modification to the Feldman-Cousins method is proposed that allowsfor a transition from one-sided upper confidence limits for null results and a two-sided confidenceintervals for non-null results at any given specified threshold chosen to define the observation of asignal, while maintaining exact coverage.Keywords: Analysis and statistical methods, Dark Matter detectors (WIMPs, axions, etc.) a r X i v : . [ phy s i c s . d a t a - a n ] F e b ontents Many physics experiments, in particular searches for new physics, look for very low event rates wherethe asymptotic methods of constructing frequentist confidence intervals do not work. Confidenceintervals are required to have coverage; 1 − α -confidence level intervals should contain the truevalue a fraction 1 − α of repeated experiments. However, the actual coverage of a statistical methodmay vary with the true signal properties. For example, an asymptotic 0 .
68 confidence-level intervalfor a counting experiment observing n events , [ n − √ n , n + √ n ] , will cover the true expectationvalue µ asymptotically as µ → ∞ , but may cover as little as 0 .
55 and as much as 1 depending on µ .A method that provides confidence intervals with exact coverage is known since 1937 as theNeyman construction [1]. The Neyman construction initially consists of constructing a confidencebelt for each possible true value of parameter of interest s :1 − α = ∫ ba f ( x | s ) dx (1.1)where f ( x | s ) is the probability density function for the observed parameter x , which may depend on s , and [ a , b ] denote the limits of the confidence belt. The confidence interval on s can then be foundby constructing a ( s ) and b ( s ) , which will express the upper and lower range in which x would fall1 − α of the time if the true parameter of interest is s . Inverting these functions yields the Neymanconstruction limits for an observation x : [ a − ( x ) , b − ( x )] (1.2)The condition for the confidence belt provided in equation 1.1 is not unique, and the limits of theconfidence belt have to be set by a boundary condition. This condition has traditionally consistedin either the desire to set upper or lower limits (for example in absence of a signal) or in reporting(symmetric or asymmetric) two-sided intervals in case of a measurement of a physical parameter.In some cases, such as in searches for a new particle, experiments may wish to report upper limits ifthey do not observe a discovery significance exceeding a set threshold, and a two-sided confidence– 1 –nterval otherwise. However, Feldman & Cousins [2] noted that the fact that switching betweenNeyman constructions based on the experimental outcome may lead to under-coverage, even if theindividual constructions provide coverage. The suggested remedy (hereafter referred to as the "FCmethod") is to construct confidence intervals by a single Neyman construction that provides bothupper limits and two-sided intervals, depending on the experimental result. The FC confidence belt,reviewed in section 2, uses the log-likelihood ratio to decide which regions of observable space toinclude first. Figure 1 shows the upper and lower limits for a Gaussian observable x with knownbackground b and standard deviation 1 for the FC construction in blue, and an experiment thatswitches to two-sided intervals from upper limits if the discovery significance exceeds 3 σ in green.Since this shift moves only the upper limit line for, for example s =
2, this approach will under-coverfor this signal. The modification suggested in this paper, which is constructed to maintain coverageis shown in orange, and may be interpreted as a coverage-conserving interpolation between a one-and two-sided Neyman construction.Conventionally, upper limits are reported with confidence levels of less than 95%, and two-sided intervals are presented only in the case of discovery or at least some reasonably significantindication. The (one-sided) p-value of an indication is usually much smaller than the 5 % or more α implied by the 1 − α confidence interval. While statistically presenting a two-sided interval andnot claiming a discovery does not pose a problem (the fact that the confidence interval excludesthe non existence of a signal at some confidence level should not be confused with a discoveryclaim), in practice experimenters are reluctant to present a two sided limit even if the FC methodprovides it. A common remedy is to report only the upper edge of the interval provided. This leadsto a signal-dependent over-coverage, or, equivalently, some confidence intervals or upper limitscould be more constraining without violating coverage. In this paper, we suggest a modification ofthe FC method that will provide two-sided intervals only at a desired discovery threshold, whilestill providing a unified confidence interval calculation method and improving the coverage. Thepaper is organized as follows: in section 2 we review briefly the FC method, and the procedure forassessing the existence of an excess or a discovery. In section 3 we introduce the modified versionof the FC method, and we illustrate the method with line-search example in section 4. For an experiment where one measures some data (cid:174) x with a probability distribution f ( (cid:174) x | s ) thatdepends on a parameter s , the likelihood is given as L( s ) = f ( (cid:174) x | s ) . The method proposed byFeldman and Cousins uses the log-ratio R between the likelihood given s , and the s that minimizesthe likelihood, ˆ s : R ( θ ) = · log [L( ˆ s )/L( s )] (2.1)to decide which (cid:174) x to include. Either constructing the confidence belt from Equation 1.1 with theconstraint that the (cid:174) x with the lowest R ( s ) are included first, or constructing the confidence beltdirectly in the R ( s ) parameter: 1 − α = ∫ R max ,α ( s ) f ( R | s ) dR (2.2)– 2 –or each value of s will yield the FC construction. The confidence interval, whether one- or two-sided will be the region where R ( s ) < R max ,α ( s ) . Note that the threshold likelihood ratio R max ,α ( s ) also depends on the parameter of interest. In the case that an experiment is looking to constraina parameter s that has a null-hypothesis and lower bound, s , the method has to give confidenceintervals that do not contain s in α of the cases. For example, in searches for the production cross-section of an unknown particle, α of confidence intervals will exclude the no-signal null-hypothesis.The log-likelihood ratio R ( s ) is typically also used to assess discovery significance with respectto the null-hypothesis s = s . The p-value of R ( s result ) under the null-hypothesis is: p result ( R result ( s )) = ∫ ∞ R result ( s ) f ( R | s ) dR (2.3)This may also be inverted to yield discovery thresholds; p − result ( α ) is the discovery threshold foran α excess. Note that this equation shows that at the null-hypothesis s = s , the FC thresholdfor inclusion in the confidence interval, R max ,α ( s ) implies a p-value of α , and that a confidenceinterval that does not include s implies a p-value below α . Typical confidence intervals for upperlimits, and thus the FC construction are α = . , . , .
01. Using the FC method consistently willreport two-sided intervals at those same thresholds. However, a conventional discovery thresholdin particle physics is 5 σ , or p = · − , and experiments may not wish to publish measurementsof excesses lower than, say, 3 σ , or p = . · − . A pragmatic solution to this is to only report theupper edge of the confidence interval as an upper limit until the discovery significance has exceededthe required discovery threshold. This will lead to over-coverage, as one extends a confidenceinterval constructed to cover with an 1 − α frequency. The aim of the modified method is to provide a construction with a desired discovery significancethreshold, different than what the confidence level would imply, in addition to maintaining theconstant coverage of the pure FC construction. In figure 1, this modification is indicated withan orange line, showing that the modification does not change the FC construction at highersignals, while approaching the one-sided Neyman construction upper limit for low signals. In thisillustration, the modification does not reach the median signal-free result of x − b =
0, but for higherdiscovery thresholds, such as 4 σ , even the median upper limit will be affected by the modification,as shown in the coverage plots in figure 6 for the example in section 4.We wish to include all results where the discovery significance is less than the reporting thresh-old in our Neyman confidence belt, while maintaining coverage for all signals. To accomplish this,we will treat upwards and downwards fluctuations separately, and include all upwards fluctuationsthat do not rise to the reporting threshold in our band. This will require constricting the confidenceband for downwards fluctuations to conserve coverage. To distinguish between upwards and down-wards fluctuations, the proposed modification to the FC method uses an idea very similar as thatwhich was used by the atlas Higgs search [3]; where the ordering ratio R is multiplied by the signof ˆ s − s : R (cid:48) ( s ) = sgn ( ˆ s − s ) · R ( s ) (3.1)– 3 – x − b s Two Neyman constructionsFC limitsmodified FC
Figure 1 : Illustrations of three constructions of upper and lower limits for a Gaussian observable x ,with known background b . The green lines show the upper and lower limit as function of x − b for anexperiment that switches between setting a 90% upper limits for discovery significances below 3 σ ,and uses a two-sided interval above. The blue lines shows the FC upper and lower limits. Orangelines shows the modified FC method, that like the FC method provides coverage for all true signals,but switches between a one- and two-sided limit when the threshold significance of 3 σ is reached.This leads the upper limit for this construction to approach the one-sided limit construction for low x + b .This separates the cases where the data prefers a lower and higher signal than the tested hypothesis.Close to a boundary, say a requirement that s ≤ s , R (cid:48) ( s ) can only be non-negative, and forslightly larger s , the distribution of R (cid:48) ( s ) will still be asymmetric between upwards and downwardsfluctuations. The switch of sign of R (cid:48) ( s ) occurs as ˆ s approaches s , which is also where R ( s ) approaches 0. Examples of the distributions of R (cid:48) ( s ) for the line-search detailed in the next sectionare shown in figure 2, including a blue line indicating the ranges of R (cid:48) ( s ) corresponding to a 90%confidence level FC interval. Orange lines show that the modified FC method band shifts to includemore of the positive R (cid:48) ( s ) in order to avoid excluding excesses below the discovery thresholdsindicated.We will construct the edge of the confidence belt corresponding to upwards fluctuations, R + ( s ) ,first. We denote the confidence level of the interval 1 − α , and the p-value threshold for reporting atwo-sided excess γ . To ensure that our confidence intervals exclude the null hypothesis case whenthe discovery significance exceeds the reporting threshold, R + ( s ) must correspond to the discoverythreshold R max ,γ ( s ) defined in equation 2.3. At large signals, we wish R + ( s ) to approach the FCedge R max ,α ( s ) . We accomplish this by interpolating between the two thresholds: R (cid:48) ( s ) + = w ( s ) · R max ,γ ( s ) + ( − w ( s )) · R max ,α ( s ) (3.2)where w ( s ) is a weighting function that monotonically decreases from 1 at s = s to 0 as s increases.The FC threshold function, R max ,α ( s ) is defined by equation 2.2. The freedom to choose w ( s ) – 4 – -4 -2 σ = 3 σ = 4 FC intervalModified FC intervalss=0.00 5 10 15 R ( s ) -4 -2 σ = 3 σ = 4 FC intervalModified FC intervalss=3.0
Figure 2 : Histograms of R (cid:48) ( s ) computed for toy-Monte Carlo simulations for 0 and 3 expectedsignal events in the upper and lower panel, according to the example in section 4. The best-fit signalrate ˆ s is constrained to be non-negative in the signal model and fit. The sharp boundary at R (cid:48) ( s ) = ≤ ˆ s is applied to the best-fit.Blue and orange bands show the 90% confidence band for the FC method and the modified FCmethod, respectively, with the latter shown both for a 3 σ and 4 σ discovery threshold.reflects the original freedom in the Neyman construction. However, we wish the confidence bandto rapidly approach the FC band with increasing s . In some simple cases, such as a single Gaussiandistributed variable with known standard deviation, an observation with a discovery significanceexactly equal to the threshold will have an R (cid:48) ( s ) -curve that exactly divides observations below orabove the discovery threshold, and the R + ( s ) curve may be constructed as the maximum of thiscurve and the FC threshold. This corresponds to the vertical line at x − b = R (cid:48) i ( s ) for all these observations,labelled with index i , and compute the maximum value at each signal, R envelope ( s ) = sup i ( R (cid:48) i ( s )) .Finally, we set the threshold R + ( s ) to be the greatest of R envelope ( s ) and R max ,α ( s ) . This constructionis shown in figure 3.The lower edge of the interval, R − ( s ) is then defined so that for any signal, 1 − α of R (cid:48) ( s ) s arecontained between R − ( s ) and R + ( s ) :1 − α = ∫ R + ( s ) R − ( s ) f ( R (cid:48) | s ) dR (cid:48) (3.3)At s = s , the above equation would indicate a coverage of 1 − α . However, at the border of thedomain for α , the distribution of R (cid:48) ( s ) will be peaked towards 0, as shown in the lower panelof figure 2 , and by defining the confidence interval lower threshold R − ( s ) = − (cid:15) , where (cid:15) is aninfinitesimally small negative number such that no R (cid:48) ( s ) are lower than R − ( s ) , the coverage at theboundary can be arranged to be γ . – 5 –
10 20 30 40 50 s R ( s ) < σ > σ FC constructionmodified FC σ Discovery Threshold
Figure 3 : Construction of the modified FC threshold function, using curves of R (cid:48) ( s ) for multiple toy-Monte Carlo realizations, colored according to whether they exceed the 3 σ discovery significancethreshold (green) or not (gray). The thick blue curve shows the threshold corresponding to a 90%FC construction, while the orange curve, showing R + ( s ) , is constructed to be equal to the discoverythreshold, indicated with a dashed green line, at 0 signal, and to be greater or equal to both thelikelihood-ratio curves with discovery significance less than 3 σ and the FC threshold.Confidence intervals are constructed as in the FC case as intersections between R (cid:48) ( s ) and R + ( s ) for lower limits, and R (cid:48) ( s ) = R − ( s ) for upper limits. For increasing discovery thresholds,the discovery belt must move to include higher positive values of R (cid:48) ( s ) , and correspondinglyincrease R − ( s ) as well. Close to s , where the shift is the highest, R − ( s ) will approach theNeyman construction boundary for an upper-limit-only construction, which can provide emptyconfidence intervals for strong but finite downwards fluctuations of the background. The FCmethod yields higher limits in the downwards fluctuation regime, as illustrated in figure 1 for theGaussian example, with the upper limit approaching zero asymptotically when the downwardsfluctuation approaches negative infinity. Some experiments setting upper limits have adopted theCLs method [4], which penalizes the p-value to yield a signal-dependent over-coverage at lowsignal-background discrimination approaching 1 for signals approaching s . Others have used apower-constraint [5], where upper limits are not placed below a signal where the experiment has acertain discovery power. Direct detection experiments using the two-sided FC method [6, 7], haveapplied a power-constraint, corresponding to a − σ downwards fluctuation of upper limit. Thecoverage properties of the power-constraint applied to the modified FC confidence intervals has asimpler form than the CLs method, with the coverage being 1 − β ( s ) below the critical discoverypower, where β ( s ) is the discovery power, and 0 . E/E -1 C o un t s Power-LawRandom Data m = 5 E m = 60 E Figure 4 : Background distribution (blue line), and signal distributions for m = E and m = E (orange and green), together with a histogram showing an example data-set drawn from thebackground-only distribution. As an example, we consider an experiment that observes events with energies E i , and searchesfor a Gaussian signal line with a certain mass in the presence of a power-law background. Theprobability distribution function f ( E ) has the form: f ( E | s ) = s ( s + b ) f s ( E ) + bs + b f b ( E ) (4.1) f s ( E ) ≡ √ πσ e ( E − m ) − σ (4.2) f b ( E ) ≡ E Γ · (cid:20)∫ E E E Γ d E (cid:21) − (4.3)Here, s is the signal expectation value, and m , σ = m / f s ( E ) . The background power-law f b ( E ) has expectationvalue b = Γ = −
2. Both distributions are normalized between E and E = E . The signal expectation value is not allowed to be negative in the fit, 0 ≤ ˆ s . In thisexample, the nuisance parameters affecting the distribution shapes, and the background expectationare fixed. For most experiments, the likelihood will include a number of nuisance parameters. Inthat case, the ordering parameter R may be based on the profiled likelihood instead, yielding theprofile construction [8], where coverage is not ensured by construction, but must be investigated.The toy-Monte Carlo methods used to construct the confidence belt and to investigate coverageproperties are identical to the ones used for the profile construction.The extended un-binned likelihood for the observation of N energies with values E i can be– 7 – s R ( s ) Log-likelihood ratio, ˆ s = 5 . Log-likelihood ratio × sign (ˆ s − s ) FC constructionmodified FC σ Discovery Threshold (a) m = s R ( s ) Log-likelihood ratio, ˆ s = 0 . Log-likelihood ratio × sign (ˆ s − s ) FC constructionmodified FC σ Discovery Threshold (b) m = Figure 5 : Illustration of confidence interval constructions using the FC (blue bands) and themodified FC (orange bands). Intersections between the R (cid:48) ( s ) -curve, in black and the thresholdsdefine the upper and lower limits of the interval. For comparison, the FC construction is shownwith the dashed black line, with confidence interval boundaries marked by blue dots.written: L( s | N , (cid:174) E ) = Pois ( N | s + b ) · N (cid:214) i = [ f ( E i | s )] (4.4)Here, Pois ( N | s + b ) is the Poisson probability to observe N events given an expectation value of s + b . The distribution of the signed ordering parameter R (cid:48) ( s ) , defined from equations 2.1 and 3.1,is shown in figure 2, for a signal mass m = E , and for two different true signal expectations s = ,
3. As an example of using toy Monte-Carlo methods to determine the interpolation function w ( s ) , Figure 3 shows multiple curves of R (cid:48) ( s ) from toyMC simulations with true signals rangingfrom 0 to 40, divided by whether the discovery significance, assessed with R (cid:48) ( ) , is above or belowa 3 σ discovery threshold. The weighting function w ( s ) is chosen so that the modified FC thresholdis equal or greater than all the R (cid:48) ( s ) curves for all signals, until the R (cid:48) ( s ) + -curve, in orange, meetsthe FC belt in blue. The 90% confidence belts derived from this construction are also shown infigure 2 as bands for 3 and 4 σ discovery thresholds.The confidence interval construction for the power-law example is shown in figure 5, showingboth cases where an excess with a p-value below 10% gives a two-sided interval, and a case whereboth constructions yields upper limits. The confidence interval consists of the signal range where R (cid:48) ( s ) is contained between the R + ( s ) and R − ( s ) curves.The coverage of the FC and modified FC method 90% confidence intervals are shown in fig 6for m = ,
60. The pure FC method, shown in a blue line, provides the expected coverage. Thegreen curves show the over-coverage of experiments using the FC construction with a thresholdfor reporting the lower limit of either a 3 or 4 σ discovery significance. The modified FC methodexhibits the desired coverage of 0 . s c o v e r a g e FC sensitivityModified FC sensitivityFCFC + σ discovery thresholdModified FC, σ (a) m =
5, 3 σ threshold s c o v e r a g e FC sensitivityModified FC sensitivityFCFC + σ discovery thresholdModified FC, σ (b) m =
5, 4 σ threshold s c o v e r a g e FC sensitivityModified FC sensitivityFCFC + σ discovery thresholdModified FC, σ (c) m =
60, 3 σ threshold s c o v e r a g e FC sensitivityModified FC sensitivityFCFC + σ discovery thresholdModified FC, σ (d) m =
60, 4 σ threshold Figure 6 : Coverage as function of signal expectation for the FC method (blue), the FC method,including a 3 or 4 σ discovery threshold (green) and the modified FC method (orange), for twoline masses. Median upper limits are indicated in dashed lines for the FC (blue) and modified FC(orange) cases.lower threshold shown in figure 5 can be seen, with a greater change seen for the larger, 4 σ discoverythreshold. This paper proposes a method for constructing a modified FC method where the discovery sig-nificance is different from the confidence level of the upper limits and intervals. For an examplecase, the coverage at 0 signal corresponds to the discovery significance, and moves to the requiredconfidence level for all signals larger than 0. This allows experiments to avoid the over-coveragethat results from expanding the standard FC intervals, and simplifies reporting or discussion ofcoverage properties. The intervals approach the one-sided upper limit for under-fluctuations ofthe data, motivating an application of a power-constraint lower signal threshold or similar to theconfidence interval construction outlined in this paper. This will result in discrete coverage regimes– 9 –hat depend on the true signal size, and allow the experiment to directly set the discovery thresholdand minimal discovery power independent of the confidence level of the interval.
Acknowledgements
The author would like to thank Jan Conrad and Jelle Aalbers for fruitful discussions and suggestions.This research was supported by a grant of the Knut and Alice Wallenberg Foundation, PI: J. Conrad
References [1] J. Neyman. Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability.
Phil. Trans. Roy. Soc. Lond. , A236(767):333–380, 1937. doi: 10.1098/rsta.1937.0005.[2] Gary J. Feldman and Robert D. Cousins. A Unified approach to the classical statistical analysis ofsmall signals.
Phys. Rev. , D57:3873–3889, 1998. doi: 10.1103/PhysRevD.57.3873.[3] Georges Aad and others (ATLAS Collaboration). Combined search for the Standard Model Higgsboson in pp collisions at √ s = Phys. Rev. , D86:032003, 2012. doi:10.1103/PhysRevD.86.032003.[4] A L Read. Modified frequentist analysis of search results (the CL s method).(CERN-OPEN-2000-205), 2000. URL http://cds.cern.ch/record/451614 .[5] Glen Cowan, Kyle Cranmer, Eilam Gross, and Ofer Vitells. Power-Constrained Limits. pre-print , 2011.[physics.data-an/1105.3166].[6] E. Aprile and others (XENON Collaboration). Dark Matter Search Results from a One Ton-YearExposure of XENON1T. Phys. Rev. Lett. , 121(11):111302, 2018. doi:10.1103/PhysRevLett.121.111302.[7] D. S. Akerib and others (LUX collaboration). Results from a search for dark matter in the completeLUX exposure.
Phys. Rev. Lett. , 118(2):021303, 2017. doi: 10.1103/PhysRevLett.118.021303.[8] M. Tanabashi and others (PDG). Review of particle physics.
Phys. Rev. D , 98:030001, Aug 2018. doi:10.1103/PhysRevD.98.030001. URL https://link.aps.org/doi/10.1103/PhysRevD.98.030001 ..