Another Look at Confidence Intervals: Proposal for a More Relevant and Transparent Approach
AAnother Look at Confidence Intervals:Proposal for a More Relevant and Transparent Approach
Steven D. Biller
Department of Physics, University of Oxford, Oxford OX1 3RH, UK
Scott M. Oser
Department of Physics & Astronomy, University of British Columbia, Vancouver V6T 1Z1, Canada
The behaviors of various confidence/credible interval constructions are explored, particularly inthe region of low event numbers where methods diverge most. We highlight a number of challenges,such as the treatment of nuisance parameters, and common misconceptions associated with suchconstructions. An informal survey of the literature suggests that confidence intervals are not alwaysdefined in relevant ways and are too often misinterpreted and/or misapplied. This can lead toseemingly paradoxical behaviors and flawed comparisons regarding the relevance of experimentalresults. We therefore conclude that there is a need for a more pragmatic strategy which recognizesthat, while it is critical to objectively convey the information content of the data, there is also astrong desire to derive bounds on model parameter values and a natural instinct to interpret thingsthis way. Accordingly, we attempt to put aside philosophical biases in favor of a practical view topropose a more transparent and self-consistent approach that better addresses these issues.
INTRODUCTION
The ability to distill experimental results in a form rel-evant to theoretical models is fundamental to scientificinquiry. Yet the best approach for this is still a matter ofconsiderable discussion and debate. At the heart of theissue is the desire to both objectively quantify results in afrequentist manner and also draw relevant inferences forspecific models, which inherently requires a Bayesian con-text ( i.e. a choice of prior) for those models. A failure tosatisfactorily address both of these aspects has, in manycases, led to misinterpretation and misapplication thathave not been mitigated by the adoption of new frequen-tist conventions. The impact is largest for experimentsworking in the region of low numbers of signal events,where different approaches diverge most. The confusionis not helped by the use of forms for the display of fre-quentist information that seem to suggest direct boundson model parameter values or relative experimental sensi-tivities to such models, neither of which is necessarily thecase. Suggestions that such confusion arises from ques-tions that should not be asked concerning models are notsatisfactory and fail to confront the fact that scientistsdo, in fact, ask such questions and should therefore makeuse of the appropriate formalism for these.In fact, the goals of both objectively conveying therelevant information content of data and deriving boundson model parameter values are not mutually exclusive,but rather are closely linked. It is not generally possibleto translate experimental results into meaningful modelconstraints without specifying a prior. As such, detailedobjective information should be used to clearly define thecontext for Bayesian constraints. The issue is thereforelargely one of establishing relevance and transparency.In this paper, we briefly review the nature of various in-terval constructions; highlight some apparent paradoxes that arise from common misinterpretations; cite specificcases where experiments have run into such issues; dis-cuss several aspects associated with practical implemen-tation; and, finally, propose an approach to directly ad-dress the above issues in a more relevant, self-consistentand transparent manner using standard techniques.
INTERVAL CONSTRUCTIONS AND THEIRMEANINGBayesian
Bayesian probabilities quantify the degree of belief in ahypothesis. Given a measurement, the goal of a Bayesianapproach is to assign probabilities to the range of possiblemodel parameter values. By necessity, this requires anassumed context for these models (prior), as indicated byBayes’ Theorem: P ( H i | D ) = P ( D | H i ) P ( H i ) (cid:80) j P ( D | H j ) P ( H j ) (1)where P ( H i | D ) is the posterior probability of hypothe-sis H i given the data D ; P ( D | H i ) is the likelihood ofthe data assuming hypothesis H i ; and P ( H i ) is the priorprobability for H i that defines the a priori context rela-tive to other model parameter values. The ratio betweenBayesian probabilities therefore provides an estimate ofrelative “betting odds” for which hypotheses are mostlikely to be correct.For a purely Bayesian approach, there is no relevanceof the concept of “statistical coverage” of a credible in-terval (the frequency with which a large number of rep-etitions of an experiment subject to random fluctuations a r X i v : . [ phy s i c s . d a t a - a n ] F e b would yield intervals that bound the correct hypothesis),since no comparison is done to a hypothetical ensemble— only the actual measurements matter. If desired, theeffective statistical coverage can often still be estimatedfor Bayesian constructions using Monte Carlo calcula-tions etc. (as shown in Appendix A), but the credibilitylevel that defines the construction simply relates the ac-tual observation directly to the model.Bayesian credible intervals are simply defined by therelevant portion of the posterior probability density func-tion (PDF) that constitutes a fraction equal to a pre-defined credibility for the interval, CI . The way thisfraction is selected may be altered to yield lower bounds,upper bounds, central intervals, the most compact inter-val, or intervals containing the highest probability densi-ties. For intervals, as opposed to bounds, we suggest thatusing the highest probability density offers the most in-tuitive and robust definition for an arbitrary probabilitydistribution.As a simple example, we give the construction for anupper bound ( i.e. the critical value up to which integra-tion is performed) on an average signal strength, S , ina Poisson counting experiment where the expected back-ground level is B and a total of n events is observed: (cid:82) S up [( S + B ) n e − ( S + B ) /n !] P ( S ) dS (cid:82) ∞ [( S + B ) n e − ( S + B ) /n !] P ( S ) dS = CI (2)where S up is the upper bound to be determined, P ( S ) isthe prior probability for S , and CI is the desired credibil-ity for the interval. In the case where all positive valuesof S are a priori given equal consideration ( i.e. a uni-form prior in which P ( S ) is a constant for S ≥ (cid:80) nm =0 ( S up + B ) m e − ( S up + B ) /m ! (cid:80) nm =0 B m e − B /m ! = 1 − CI. (3)Thus, S up can be interpreted as denoting the upperlimit on the range of model parameter values for whichthe probability of observing n events or less is not morethan 1 − CI , given that the possible number of back-ground events cannot be greater than the total numberof events observed in this measurement. If a non-uniformprior were used instead, the form would be modified andthe interpretation would be that the upper limit is onthe correspondingly weighted range of model parametervalues. Standard Frequentist
Frequentist probabilities are defined as the relativefrequencies of occurrence given a hypothetical ensemble of similar experiments subject to random fluctuations.There is no such thing as a “probability” for a model pa-rameter to lie within derived bounds — either it does orit does not. However, if everyone derived bounds in thesame way, the correct model would be correctly boundeda known fraction of the time (for more on statistical cov-erage, see Appendix A).Rather than using the posterior probability, the Ney-man construction of frequentist intervals [1] starts withthe probability density function (PDF) for a given ob-servation under a fixed hypothesis that is used to con-struct the likelihood. For each possible hypothesis, aportion of the possible outcomes containing the fraction CL (frequentist confidence level) is defined. The range ofmodel parameter values for which a given measurementis “likely” ( i.e. would be contained within that CL frac-tion) then defines the confidence region. Note that thisis not the same as a statement that any given model islikely (which is Bayesian) and, indeed, the construction issuch as to avoid any direct comparison of models. How-ever, as before, there is an ambiguity in this constructionregarding how the PDF is used to compose the initial fre-quency intervals, with common ordering choices includ-ing central, highest probability density and most compactintervals. We will define frequentist approaches that usean ordering principle based on the expected frequency ofobservations for a given hypothesis as “standard frequen-tist.” Approaches that fall outside of this include thosethat use a likelihood ratio test as an alternative order-ing principle, such as Feldman-Cousins [2] (which will bediscussed separately in the next section).For comparison, the standard frequentist constructionfor an upper bound on an average signal strength, S , ina Poisson counting experiment where the expected back-ground level is B and a total of n events is observed canbe written as follows: n (cid:88) m =0 ( S up + B ) m e − ( S up + B ) /m ! = 1 − CL (4)where S up can thus be interpreted as denoting the upperlimit on the range of model parameter values for which n events or less would be observed with a relative frequencyof not more than 1 − CL if the measurements were to berepeated a large number of times. Note that this differsfrom the Bayesian formula for a uniform prior only inthe absence of the background normalization. In otherwords, for this construction, the possible number of back-ground events is not constrained to be less than or equalto the total number of all events observed in this partic-ular measurement. This is because the probability beingcalculated is that for observing n events during a generictrial for an ensemble of measurements, and does not takeinto account additional information available from anyparticular observation (such as the fact that the numberof background events actually detected cannot exceed n ).Thus, the probability associated with any particular mea-surement is not a meaningful concept in the frequentistapproach.This can also be seen by the fact that the lack of abackground normalization means that there will be casesfor which Equation 4 does not yield a positive solutionfor S up . These are instances where the observed numberof events is already deemed to be less probable than thedesired confidence level. Such “empty intervals” are per-fectly allowed and, indeed, are necessary in order to guar-antee the correct statistical coverage for the frequency ofobservations within the overall ensemble of hypotheticalexperiments. Individual frequentist bounds, however, donot have meaning for model parameter values by them-selves. Indeed, for a case where the confidence interval isempty, the observer knows that for this particular dataset the confidence interval does not contain the true valueof the parameter, even if the repeated construction ofsuch confidence intervals would correctly bound it in, say,90% of the cases where statistical fluctuations resulted indifferent data sets. This distinction is fundamental: fre-quentist confidence intervals are always statements abouthow often a large ensemble of hypothetical experimentswill bound the true value, and are never a statementthat there is a particular probability that the true valueis contained in the interval for any individual data set.In fact, in many cases for both standard frequentist andFeldman-Cousins intervals, the experimenters may knowthat it is very unlikely that the true model is contained inthe generated interval for their particular data set. Thissituation often tends to conflict with the desired interpre-tations of these bounds, since the question of interest tomost experimenters is the relevance of their own particu-lar data set for the model parameter values under study,rather than the behavior of a large ensemble of hypothet-ical experiments that were not actually performed. Feldman-Cousins
The approach of Feldman and Cousins [2] uses an or-dering principle for the Neyman construction based onthe ratio of likelihoods which, for the measurement of aquantity typified by a mean expectation, µ , is given by:Λ µ ( x ) = L ( x | µ ) L ( x | µ best ) (5)where x is the measurement and µ best is the mean forthe hypothesis in the physical region for which the datais most likely (not necessarily the most likely hypoth-esis, an assessment of which would call for a Bayesianconstruction).In the standard frequentist case, the composition of in-tervals is simply based on the expected relative frequency of observations under each hypothesis. However, underthe Feldman-Cousins approach, the composition of in-tervals is instead determined by a likelihood comparisonacross potentially different hypotheses, which can there-fore lead to less intuitive interval choices.As an example of how these interval definitions candiffer, consider the case of a Gaussian variable with unitvariance and a mean of µ =0.5. Assume that this valueof µ is unknown to us, but represents a physical quan-tity (such as a mass) that must be non-negative. From agiven observation, we then wish to define 90% CL boundsfor µ . Figure 1 shows the relative frequency of observa-tions, P ( x ). The red striped area indicates the range ofobservations for which the correct value of µ is boundedby a central standard frequentist interval. These boundsare symmetric, extending ± . σ relative to the value of x = 0 .
5, as might be expected. However, as indicatedby the black striped area, the range of observations forwhich the correct value of µ is bounded by a Feldman-Cousins frequentist interval is notably asymmetric (dueto the fact that µ best ≥ x = − . σ away fromthe true value. On the other hand, the interval excludesthe true mean for an observation of x = 1 .
9, even thoughthis is less than 1 . σ away from the true mean and, hence,nearly 3 times more likely to occur. FIG. 1: Upper plot: Relative frequency for observations, x ,of a Gaussian variable with σ = 1 and µ = 0 .
5. The rangeof observations for which the true mean is bounded for stan-dard frequentist (red stripes) and Feldman-Cousins intervals(black stripes) at 90% CL are shown. Lower plot: Valuesof the corresponding ordering parameter used to define thecomposition of the Feldman-Cousins interval.
This counterintuitive result illustrates that the in-terpretation of the Feldman-Cousins ordering principle,which is not based directly on the frequency of the obser-vation but instead on the likelihood ratio Λ( x | µ ), is notstraightforward.Feldman and Cousins cite Section 23.1 of the 5th edi-tion of Kendall’s book [3] as implying the ordering prin-ciple for their interval construction, but, in fact, Sections31.31-31.34 of that same reference are much more ex-plicit regarding the general use of a likelihood ratio todefine confidence intervals. These sections end with thestatement, “The difficulties with such an approach are, asbefore, the lack of a frequency interpretation for P ∗ or,indeed, any direct interpretation for the function. Here,as elsewhere, the statistician must decide whether he orshe is willing to make the logical leap in order to justifyinferential statements that relate to single experiments.” One stated purpose of the Feldman-Cousins construc-tion is to avoid empty intervals, thus making the boundsappear more physical for the model. However, as pre-viously indicated, such empty intervals do not actuallypose any problems in principle since frequentist boundsdo not refer to direct restrictions on the physical modeland only take on meaning for a large ensemble of mea-surements, where statistical coverage is indeed upheld.While the ordering principle used in the Feldman-Cousinsapproach ensures that intervals are never empty, whichmany may find less disconcerting, this does not avoid thebasic issue that the bound for any particular data setmay not be meaningful, and situations in which conven-tional frequentist intervals are empty are often situationsin which the Feldman-Cousins procedure returns a valuethat is prone to misinterpretation as being an undulystrict bound on a model.Feldman and Cousins themselves recognized this prob-lem, noting that this results from a confusion withBayesian inferences, which are more relevant for decisionmaking. Accordingly, they recommended accompanyingeach limit by the average expected limit (the “sensitiv-ity” of the measurement) “in order to provide informa-tion that will assist in this (Bayesian) assessment.” Casesin which the obtained limit is significantly better thanthe expected sensitivity are cases where there is a higherprobability that the true parameter value is not containedwithin the derived bounds. However, the expected limitclearly does not represent an actual measurement, andhow information from this and/or the derived boundsis to be quantitatively applied in order to arrive at aBayesian (or any other) assessment for these cases is notat all clear. Furthermore, this issue does not, in fact, havea clear threshold, as all negative fluctuations yield tighterbounds than the expected limit. Even worse, merely pro-ducing non-empty intervals by itself is not obviously animprovement in relevance or clarity, since any particularinterval (empty or not) constructed for a given data setwill not generally contain the true parameter value with a probability indicated by the quoted confidence interval,in spite of a nearly universal tendency to misinterpret itas such. The mere fact that Feldman-Cousins returnsnon-empty intervals in some cases may actually obfus-cate their nature. In many ways a null interval, whiledisconcerting, at least is transparently not a limit on theparameter, whereas a narrow but non-empty Feldman-Cousins interval, such as often occurs when the data fluc-tuates below the expected background, may give the falseimpression that the confidence interval is meaningful forbounding a model.
EXAMPLES OF THE BEHAVIOR OF UPPERBOUNDS FOR LOW NUMBERS OF EVENTS
We will now explore a number of scenarios in the re-gion of low event numbers that highlight the differencesbetween various interval constructions.As an initial example, consider the scenario where theexpected number of background events, B , for 1 yearof running with a 1 kg detector is 9, but a statisticalfluctuation results in a total of only 5 events observed. Afluctuation this low or lower would happen nearly 12% ofthe time if the true signal rate were zero. We now wish toconstruct a 90% CL (frequentist) or CI (Bayesian) upperbound, S up , on a signal of strength S . Standard Frequentist
For the Standard Frequentist case, the above scenarioyields from Equation 4 a value of S up = 0.27 events perkg per year.Now consider the further scenario in which we werecontemplating running the experiment for an additionalyear. The second year of data would very likely yield anumber of background events much closer to the expectedmean, so we would likely end up with a total of approx-imately 5+9=14 events where 18 are expected over the2 year run. The same formalism would then result in avalue of S up = 2.12 events per 2 kg-years, or a bound onthe rate of 1.06 events per kg per year, which is nearly 4times less restrictive than the limit from the first year ofdata.Furthermore, consider the case for a new experimentto be constructed with 100 times the fiducial mass. Whatconstraints is it likely to achieve? Here we can use σ ∼ √ N and the 1-sided Gaussian approximation that90% CL corresponds to ∼ σ . This means that after1 year of running we would typically expect a bound of1.28 √ Bayesian
Here we will assume that all non-negative rates havean a priori equal probability density and, accordingly,choose a uniform prior in S for S ≥
0. While numericalvalues for the derived bounds may be modified under adifferent assumption, their qualitative behavior will re-main the same. For the case at hand (9 events expected,5 observed), the uniform prior assumption yields a valueof S up = 3.88 events per kg per year, which is ∼
14 timeslarger than the standard frequentist bound.Now consider again what would happen if we decidedto run the experiment for another year, once more as-suming that we would likely end up with 14 events with18 expected over two years. The above formalism wouldthen result in a bound of 5.89 for the 2-year run, or a 90%CI rate limit of 2.945 per kg per year, which is noticeablybetter than before.For the case of a 1-year exposure of an experiment with100 times the fiducial mass, the limit would approach0.38 events per kg per year (as before), which is verysignificantly better.Thus, Bayesian bounds on the actual model behave aswould be expected for something that reflects the successand relevance of a given experimental measurement, in-dicating that it is generally beneficial to run for longerand build better experiments under such scenarios.A Bayesian calculation will also result in a more strin-gent upper bound for downward fluctuations becausethese bounds make explicit use of the constraint that thenumber of background events actually detected cannot belarger than the total number of observed events ( i.e. thedenominator of Equation 3). The most stringent boundstherefore always occur when n = 0, since the numberof backgrounds is then also known to be identically zerofor this observation. Hence, Bayesian intervals are inde-pendent of the expected background rate for such cases.However, this is not so for the frequentist case, whichhas no such normalization and where much larger vari-ations for individual measurements are allowed becauseless likely measurements carry inherently less weight inthe ensemble of other possible outcomes that defines the coverage. A frequentist limit based on an observed non-zero n can, counterintuitively, even be more stringentthan a limit for n = 0, depending on the expected back-ground levels for each case. Feldman-Cousins
The Feldman-Cousins (F-C) bound on the scenario of 9expected background events and 5 observed events yieldsa 90% CL value of S up = 2.38. This is ∼ ∼ S TABLE I: Feldman-Cousins upper bounds on S when 5 countsare expected but none are observed (an occurrence with astatistical frequency of less than 1% for S=0).
It may look odd to have intervals for S with up tonearly 3 signal events allowed for these confidence levelswhen, for an expected background of 5 events, the fre-quency with which no counts would be observed even ifS were identically zero is only 0.67%. However, this is areflection of how the acceptance region in the observablehas been distributed in a way that is not proportional tothe frequency of possible observations and that the sta-tistical coverage only takes on meaning for the ensemble.As mentioned previously, Feldman and Cousins didrecognize the problematic nature of limits such as thosein Table I and recommended stating both the limit andthe expected sensitivity, which in this case is 5.18 at 90%C.L. for S = 0, more than 5 times larger than what ap-pears in the table. This large difference between the ex-pected sensitivity and the limit is a warning flag that thelimit should be interpreted with extreme caution. Hadone event been observed, the Feldman-Cousins 90% limitwould be 1.22, and for n obs = 2 it would be 1.73, all ofwhich are noticeably lower than the expected sensitiv-ity. However, the probability of n obs ≤ Example of Behavior Under an Improved Analysis
As one more example to compare the behavior of up-per limits for the different approaches, first consider thecase where 5 backgrounds are expected and 2 events areobserved. The resulting 90% CL/CI upper bounds aregiven in the first column of Table 2. Now assume thatan improved analysis technique is developed that is ex-pected to reduce the background levels by a factor of10 while not impacting the efficiency of signal detection.When this is applied to the same data set, the eventspreviously observed are cut. The new 90% CL/CI upperbounds are then re-computed and given in the secondcolumn of Table II.For both standard and F-C frequentist approaches,paradoxically, the improved analysis actually results in aworse constraint on S. Only the Bayesian limit improves,behaving more as would be intuitively expected underthis scenario. This example goes to further illustrate thepoint already made: that individual frequentist boundsneither relate to restrictions on the model, nor do theynecessarily represent a measure of how sensitive or in-formative one experimental measurement is relative toanother one. Improved Cuts:B=5, n=2 B=0.5, n=0Standard Frequentist 0.32 1.8Feldman-Cousins 1.73 1.94Bayesian 3.13 2.3
TABLE II: 90% CL/CI upper bounds on S when 5 backgroundcounts are expected and 2 events are observed, compared withthose for an analysis with 10 times better background rejec-tion.
Pragmatically, we believe that any useful method forproducing limits must do so for any data observation,with a meaning that is easy to interpret and provides areasonable, robust and intuitive basis on which to com-pare results. Both the standard and Feldman-Cousinsmethods appear to fail in this regard a non-negligiblefraction of the time, and the suggestion that this problemcan be dealt with by simply quoting the expected sensi-tivity and leaving the interpretation to the individual’sjudgement does not seem like a viable way to proceed.
EXAMPLE OF ISSUES WITH INTERVALS INTHE PRESENCE OF A CLEAR SIGNAL
Up to now, examples have focused on potential issuesof misinterpretation related to upper bounds. We givehere an example where a 2-sided interval constructionfor a clear signal can also lead to complications for afrequentist approach.Consider the case of an ultra-high energy neutrino de-tector, such as IceCube [7], looking for signs of extra-terrestrial neutrinos from astrophysical sources. Assumethat the instrument has a known Gaussian energy resolu-tion and that one event with an apparent energy well be-yond expectations for atmospheric neutrinos is observed.Say we now wish to construct a confidence interval forthe energy of the neutrino itself (as opposed to the de-posited energy). However, the likely energy of the eventis strongly dependent on the index of the underlying en-ergy spectrum, which is unknown. For example, the ob-servation is much more likely to have resulted from thefluctuation of a lower energy event if the underlying dif-ferential neutrino spectrum was proportional to E − asopposed to E − or E − . Without knowing this index, itis therefore not possible to uniquely define a hypotheti-cal ensemble of repeated measurements, and frequentistbounds on the deposited energy alone can be misleadingif incorrectly interpreted in terms of the neutrino energy.One could treat the index as a nuisance parameter but, asshown later, the associated uncertainties still cannot bepropagated in a self-consistent manner using a purely fre-quentist framework. However, Bayesian bounds are well-defined, where the dependence on the assumed spectralprior is made explicit and the sensitivity to this choicecan be shown.While this example was chosen as a particularly clearcase, all intervals are subject to this issue at some level. Ifthe choice of prior is obvious or does not matter, Bayesianbounds are unambiguous. If the choice of prior is notclear and leads to differences in interpretation, a Bayesianconstruction is explicit regarding this context, whereasthe misinterpretation of a frequentist interval as boundson a model can lead to erroneous conclusions that areeffectively based on an assumed but hidden prior. COMPARISON OF EXPERIMENTAL LIMITS
It has already been noted that frequentist intervals donot necessarily reflect the relevance of individual experi-mental measurements. However, this issue is worth fur-ther discussion since the comparison of derived frequen-tist intervals from different experiments are often madeon exclusion plots etc., which can lead to erroneous con-clusions.To illustrate this, consider the scenario of two count-ing experiments making observations to place bounds onan possible signal. The first of these has an expectedbackground of just 1 count, while the second suffers froma higher average background level. Now consider theensemble of comparisons between 90% CL/CI upper in-terval bounds for Experiment ∼ FIG. 2: The fraction of times that the derived 90% CL/CI up-per interval bounds for Experiment
These issues are more than of esoteric interest, and sev-eral examples can be found where the representation ofexperimental results appear to run into difficulties whencouched in a frequentist context.
KARMEN II
The Karmen II neutrino oscillation experiment ob-served zero events during its initial run between Febru-ary 1997 and April 1998, where the expected backgroundwas 2.88 ± ± ∼ LEP and LHC (The CL s + b Method)
Members of the ALEPH, DELPHI, L3 and OPAL col-laborations recognized the potential difficulties in inter-preting frequentist limits in the face of fluctuations. Intheir joint 2003 paper ‘Search for the standard modelHiggs boson at LEP,’ [10], the authors note the follow-ing regarding frequentist intervals: “ ...this procedure maylead to the undesired possibility that a large downwardfluctuation of the background would allow hypotheses tobe excluded for which the experiment has no sensitivitydue to the small expected signal rate. ” Their solution wasto re-define their “frequentist” upper bounds based onthe ratio of the chance probability for the observation un-der a given signal+background hypothesis to that underthe hypothesis of background alone [11]. In the contextof a simple counting experiment, this “CL s + b method”would therefore take the form of Equation 3 which, infact, is a Bayesian bound with uniform prior. The equiv-alence with a Bayesian bound also holds for a Gaussianwith a prior uniform in the mean, and can sometimesbe found for other distributions with different choices ofprior. However, this equivalence is not true in generaland, in particular, does not hold for the likelihood ratio test statistic often used to search for new particles.While the CL s + b statistic avoids bounds that may ap-pear overly strict for negative fluctuations, the interpre-tation of the associated confidence levels is unclear and,in fact, is variable, depending on the test statistic. Itdoes not guarantee frequentist coverage, nor does it nec-essarily provide well-defined bounds on model parame-ter values themselves (and even in cases where there is aBayesian equivalent, the form of the prior is not explicitlyevident). Nevertheless, the technique is now ubiquitousamongst LHC experiments as well, being used (as withLEP) essentially as a binary assessment as to whetheran observation is significant. If so, a 2-sided Feldman-Cousins interval is then typically quoted, though thisscheme now violates the principles on which the Feldman-Cousins method is predicated ( i.e. that the nature ofthe interval is automatically determined by the construc-tion). For such a binary assessment, it is unclear whatadvantage this provides over a simple p-value (test ofconsistency with the zero signal hypothesis). Beyondthis, if one wished to constrain model parameter valuesthemselves with well-defined confidence levels, appropri-ate Bayesian bounds could instead be derived.It is worth emphasizing again that the issue of negativefluctuations is entirely a consequence of misinterpretingfrequentist bounds or, equivalently, using the wrong con-struction to answer a Bayesian question. We find it cu-rious that, in order to apparently avoid a philosophicalissue about providing an unambiguous interpretation, theauthors have instead opted to use a scheme without anyconsistent interpretation at all. A more straightforwardapproach would be to confront the specific nature of thequestions being posed and then adopt an appropriate,well-defined and consistent mathematical formalism. ZEPLIN III
In 2009, the ZEPLIN III collaboration published theirfirst bounds on dark matter [12]. The limits were largelybased on a large region within the signal box where noevents were observed. The authors noted that the fit ex-pectation for the average background level was greaterthan this and, together with the difficulty in quantifyingsome systematics associated with the background extrap-olation, this “compromised” the use of frequentist tech-niques, such as maximum likelihood or Feldman-Cousins.Consequently, they instead used the maximum value al-lowed for a Feldman-Cousins 90% CL interval of 2.44 (thevalue at n = B = 0). Clearly, while this may be seen as a“conservative estimate” of a frequentist bound that mustbe greater than 90%, the meaning of the confidence levelbeyond this is simply not defined. Had the authors in-stead used a Bayesian approach with a uniform prior,they would have arrived at a bound of 2.3 at 90% CI —a value that is, in fact, well-defined for this case. EXO-200
The EXO collaboration published first results from theEXO-200 neutrinoless double beta-decay experiment in2012 [13]. In the ± σ energy resolution window aroundthe endpoint, 1 event was observed where a backgroundof 4.1 ± < νββ of 1 . × years. A Feldman-Cousinsbound based on the ± σ bin would have yielded a limitof less than 2.0 signal counts at 90% CL (accounting forthe 68% signal efficiency of the bin), corresponding to aneven more restrictive 90% CL lower bound for the half-life due to the negative fluctuation of 2 . × years. Onthe other hand, a Bayesian bound with a prior uniform incounting rate based on the ± σ bin would have yieldeda limit of less than 4.0 signal counts, corresponding to a90% CI lower bound to the half-life of only 1 . × years— seemingly less restrictive than the Feldman-Cousinsbound by a factor of two.The EXO data falls exactly into the category of para-doxical situations for frequentist intervals previously de-scribed, where improved background rejection and/orlonger periods of data collection would likely result inless restrictive bounds than for the initial case. And, infact, an update of EXO-200 results published in 2014 us-ing two years of data with quadruple the exposure ofthe initial result appeared to actually suggest a posi-tive ∼ . σ fluctuation in the larger data set that ac-centuates this effect [14]. Within the ± σ energy reso-lution window around the endpoint, 21 events were ob-served where a background of 16 ± νββ half-life of t / > . × years by apply-ing Wilks’ Theorem to a likelihood analysis, which is afactor of 1.45 less restrictive than the initial result. AFeldman-Cousins analysis based on the ± σ bin wouldyield a bound of t / > . × years, a factor of 1.7less restrictive than the F-C bound from the initial re-sult. However, a Bayesian bound, with a prior uniformin counting rate and based on the same bin yields a valueof t / > . × years or, using the appropriate inte-gration of the posterior probability derived from the pro-vided likelihood curve assuming a uniform prior, a valueof t / > . × years. Both Bayesian calculationsare modestly more restrictive than the initial Bayesianresult, thus better reflecting the relevance of the measure-ments and providing a substantially more stable basis forcomparison in the face of these background fluctuations.These results are summarised in Table III.These are just a few obvious examples of cases sam-pled across a number of different areas in particle physics. Spectrum F-C Bayesian+ Wilks’ ( ± σ bin) ( ± σ bin)Data Set 1 > > > > > > > TABLE III: Derived 90% CL/CI lower bounds on the half-lives for 0 νββ in units of 10 years from EXO data usingvarious approaches. ‘Data Set 1’ is from the initial 2012 pub-lication [13] and ‘Data Set 2’ is from the 2014 publication withquadruple the exposure [14]. The ratios of bounds betweendata sets are also given. Notably less restrictive frequentistbounds result from the larger data set owing to backgroundfluctuations, whereas the Bayesian bound modestly improves. However, the fact is that all such comparisons of exper-imental results using frequentist bounds are sensitive tothese issues at some level.
ISSUES ASSOCIATED WITH THE TREATMENTOF NUISANCE PARAMETERS
The concept of frequentist coverage presents particularchallenges when trying to incorporate the effects of otherunknown parameters. A confidence interval constructionis said to have, say, 90% coverage if, for any true valueof a parameter θ that is to be estimated, an ensemble ofrepeated experiments would result in constructions thatwould contain this value in 90% of the repetitions.Consider now the case that the likelihood L ( x | θ, ξ ) de-pends on a second parameter ξ . Here ξ may either bea parameter of physics interest, or a nuisance parame-ter representing the effects of a systematic uncertainty.A common experimental problem is to determine a 1Dconfidence interval for θ , independent of the value of ξ .Standard frequentist techniques can readily define a 2Dconfidence region in the θ, ξ plane so that, for any point( θ, ξ ), the generated region will contain that point in 90%of random trials. But these techniques do not provide anysatisfactory way of producing a 1D confidence interval for θ , independent of ξ , with a desired level of coverage.Two possible definitions of coverage may be consideredin this case. The “strong” definition of coverage would bethat the 1D interval generated by the method should con-tain the true value of θ in 90% of cases, for any values of θ and ξ . This would be the desired definition of coveragefor a purely frequentist construction, since ξ may repre-sent a parameter whose value is constant but unknown,and not subject to fluctuations from trial to trial.One may also consider a “weak” coverage requirement.In this approach, ξ is thought of as a random variablethat may have a different value every time the exper-iment is done (although this is not always the case in0actuality). If the frequency distribution for ξ is knownand denoted by f ( ξ ), then we could less stringently re-quire that the frequency for the 1D interval generated bya random measurement x to contain the true value of θ is 90%, averaging over ξ . For some fixed true value of ξ ,the 1D confidence interval generated might not have thedesired 90% coverage, but since the true value of ξ is, byassumption, not known, we are content if: (cid:90) dξf ( ξ ) α θ ( ξ ) = 0 . α θ ( ξ ) is the coverage at true value θ for a particulartrue value of ξ : α θ ( ξ ) = (cid:90) x ∈ R dxP ( x | θ, ξ ) . (7)Here R denotes the region in the measurement space X determined by whatever ordering principle is used toconstruct the confidence intervals.If α θ ( ξ ) is a constant, not depending on ξ , then thecoverage is in fact independent of the nuisance parameterand the strong definition of coverage is obtained.Note that there is no essential difference between thecase where ξ is a true nuisance parameter versus the casethat we wish to incorporate the effect of one “physics”parameter in the projected confidence region for another,such as generating a 1D confidence interval for the neu-trino mixing parameter θ from a likelihood functionthat depends on θ and ∆ m . The Frequentist Minimization Procedure
The commonly recommended frequentist prescriptionfor eliminating a nuisance parameter is the “profile”method, in which the profiled likelihood is generated bymaximizing L ( x | θ, ξ ) over ξ for each fixed value of θ : L ( x | θ ) = max ξ L ( x | θ, ξ ) (8)This suggests that, for example, to generate aFeldman-Cousins confidence interval in the presence ofa nuisance parameter, we should form the following like-lihood ratio: Λ θ ( x ) = L ( x | θ, ˆ ξ θ ( x )) L ( x | ˆ θ, ˆ ξ ) (9)Here in the numerator, ˆ ξ θ ( x ) is the value of ξ that maxi-mizes the likelihood for a fixed value of θ . In the denom-inator, ˆ θ and ˆ ξ are the values of these parameters thatglobally maximize the likelihood. In all cases Λ ≤
1. For any value of θ , there is a critical value c θ for which θ will be included in the confidence interval if Λ θ ( x ) > c θ .The value of c θ is chosen by construction so that thefrequency with which this selection occurs is 90%.Ideally, we would want the critical value c θ to be inde-pendent of ξ . In that case, the strong coverage conditionholds, and the integrity of the profiled 1D frequentist in-terval is maintained. A Simple Example
Suppose we perform a single measurement of a quan-tity expected to follow a Gaussian distribution with mean θ + ξ and an RMS of 1, and that the result of this mea-surement is x . In this case, θ and ξ are degenerate. Letus therefore suppose that we have further knowledge that ξ = A ±
1, with ξ following a Gaussian distribution withmean A and RMS 1. This could result from a previousmeasurement of ξ , such as from a calibration run. The(unnormalized) joint likelihood function is therefore: L ( x | θ, ξ ) = exp[ −
12 ( x − ( θ + ξ )) ] exp[ −
12 ( ξ − A ) ] (10)It is trivial to see that this is globally maximized forˆ ξ = A, ˆ θ = x − A , at which point L ( x | ˆ θ, ˆ ξ ) = 1.For any fixed value of θ , the likelihood is maximized atˆ ξ θ ( x ) = x − θ + A ξ = x − θ that maximizes the first factor in the likelihood, and thebest fit value ξ = A from the second factor. If the twoGaussian terms in the likelihood had different σ ’s, thiswould instead be a weighted average.Inserting this expression into Eq. 9 givesΛ θ ( x ) = exp[ −
14 ( x − A − θ ) ] . (12)This is the ordering parameter. For any fixed ( θ, ξ ),we can predict the distribution of x and, hence, of Λ θ ( x ),and determine the critical value c θ for Λ θ ( x ) such that: (cid:82) dxL ( x | θ, ξ ) H (Λ θ ( x ) − c θ ) (cid:82) dxL ( x | θ, ξ ) = 0 . H is the Heaviside step function to insure that weinclude θ in the confidence interval only if Λ θ ( x ) > c θ .Since the distribution of x depends on ξ but Λ θ ( x ) doesnot, it is clear that Equation 13 cannot be guaranteed tohold for all ξ . Therefore, the confidence interval construc-tion does not satisfy the strong coverage condition—theprocedure does not yield confidence intervals that givethe correct coverage for all combinations of θ and ξ .1In this case, one might at least try to err on the sideof caution by choosing the smallest critical value that isobtained for any ξ , giving the desired coverage level forthat one value of ξ and giving over-coverage for othervalues. (For example, Cranmer has proposed includingin the 1D interval any value of θ for which the standard2D confidence region is not empty for at least one valueof ξ [17].) However, this method not only is extremelylikely to give over-coverage for almost all true values of ξ ,but also may give overly large intervals dictated by themost extreme possible value of ξ .The alternative is to give up on the strong coveragecondition and settle for “weak coverage”, as per Equa-tion 6. To achieve this, one can simply interpret thelikelihood function as a joint probability distribution for x and ξ . Drawing values for x and ξ randomly fromthis distribution (for fixed θ ), one can, for example, thencalculate the distribution for Λ θ ( x ) by Monte Carlo anddetermine the appropriate critical value to give the de-sired coverage. Correct coverage of the “weak” kind isthen obtained with the following meaning: if the exper-iment were done a large number of times, and if ξ hada different random value for each trial with a joint prob-ability distribution L ( ξ, x | θ ), then 90% of the generatedintervals would contain the true value of θ .This is clearly a hybrid approach. In order to createa “frequentist” interval with desired coverage for θ , weare integrating out the nuisance parameter ξ . In doingthis, we are forced to treat ξ in a Bayesian way, withan assumed prior distribution, and must partly abandonthe frequentist paradigm. This is of course exactly theapproach of Cousins-Highland [18], and its performanceand that of profiling the likelihood have been exploredby a number of authors [19, 20]. The contribution ofthis Bayesian aspect to the overall confidence interval isnot necessarily small since, for example, it is a commongoal to run experiments to the point where systematicuncertainties dominate.Even if one still chose to accept a pseudo-Bayesian wayof getting rid of nuisance parameters in order to createpseudo-frequentist intervals on the remaining parametersof interest, the incursion of Bayesian philosophy cannotnecessarily be so simply contained. Consider the case ofa neutrino oscillation experiment with no systematic un-certainties, and sensitivity to two oscillation parameters∆ m , θ . While one can readily produce frequentist con-fidence regions in the 2D contour plane, what happensif we want to quote a 1D limit on either of the param-eters? This is mathematically identical to eliminating anuisance parameter, although, in this case, the parame-ter is actually one of physical interest. Except in specialcases, it is not possible to produce 1D frequentist confi-dence intervals with correct coverage for all values of theother parameter. The best that can be hoped for is toachieve “weak” coverage, but this then implies marginal-ising in a Bayesian way over the other parameter (as, for example, in [21]). One is forced to be Bayesian aboutany parameter that is being eliminated from the prob-lem, or else abandon the notion of defining a consistentstatistical coverage.We suggest that, rather than seeking such work-arounds and compromises to the frequentist treatmentof nuisance parameters, it is prudent to instead ask whysuch steps are necessary in the first place. If the mathe-matical framework is inconsistent, this may suggest thatthe thinking leading up to the use of such an approach isalso inconsistent and, therefore, likely to lead to furtherdifficulties and misinterpretations. Bayesian Treatment of Nuisance Parameters
Nuisance parameters present no special difficulties fora Bayesian analysis. If L ( D | θ, ξ ) is a likelihood func-tion for a datum D depending on a nuisance parameter ξ , any constraints on this parameter are easily includedas part of the prior in Bayes’ Theorem. For example, ξ might represent the rate of a background process, per-haps measured in a separate calibration or side channel.A probability distribution g ( ξ ) can then be assigned for ξ which, together with a prior f ( θ ) for the physics param-eter θ , can be used as priors in Bayes Theorem to give ajoint posterior distribution for both θ and ξ : P ( θ, ξ | D ) ∝ L ( D | θ, ξ ) f ( θ ) g ( ξ ) (14)Dependence on the unwanted parameter ξ can then beremoved simply by integrating the posterior distributionwith respect to ξ .It is worth noting that because the likelihood function L ( D | θ, ξ ) depends on ξ , the measurement itself may oftencontain useful information on the true value of ξ . Incor-porating prior information on ξ along with informationderived from D through the likelihood makes full use ofall of the information about ξ that is available. This isoften a superior approach to Monte Carlo methods inwhich random values of ξ are drawn from the distribu-tion g ( ξ ), and then used to fit for θ , with ξ held constantin each fit. If the Monte Carlo approach is used, the re-sults of fitting for θ using each random throw of ξ shouldideally be weighted by the calculated likelihood for thatvalue of ξ .Choosing appropriate priors for nuisance parametersrepresenting systematic uncertainties is also generallystraightforward in practice. If the constraint on the nui-sance parameter is the result of an independent measure-ment, the likelihood function for that measurement canbe an appropriate choice of prior for g ( ξ ) in Equation 14(possibly further modified by any theoretical priors on ξ itself). Physical boundaries, such as requiring back-ground rates to be non-negative, are easily incorporated2by setting the prior to zero in the unphysical region and,in fact, should always be included to prevent unphysicalbehavior in the posterior distribution. In the case thatthere is no previous measurement upon which to base aprior for the nuisance parameter, the guidelines in Sec-tion (“Choice of Bayesian Priors”) may be used.contains further discussion of the mathematical rela-tionship between integrating over a nuisance parametervs. maximizing the likelihood with respect to one. ISSUES WITH UNIFICATION
The paper of Feldman and Cousins advocates the useof a “unified approach,” in which the formalism of the in-terval construction itself dictates whether upper boundsor two-sided confidence intervals are given. The argu-ment for this is to avoid the problem of “flip-flopping,”whereby different experimenters choose for themselveswhen to quote a given type of interval based on the result,leading to a small statistical bias in frequentist coveragein some cases if one were to do an unfiltered survey ofonly those frequentist intervals reported. It should beemphasized again that this is a purely frequentist issue— Bayesian intervals are immune to such effects as theyare not defined with respect to ensembles.The frequency with which signals are excluded can, forborderline cases, be as much as 50% higher than would beinferred from the nominal confidence level (see AppendixB). In other words, an ensemble of 90% CL bounds mayonly have 85% coverage. This deviation from nominalcoverage is of similar magnitude to the inherent varia-tions in frequentist coverage due to quantized Poissonstatistics (see Appendix A). In practice, it is also thecase that any such biases are often dwarfed by other fac-tors, including difficulties in assessing and propagatingsystematic uncertainties, accounting for look-elsewhereeffects, and various details of the particular analysis ap-proach employed. In addition, it should be consideredthat results are generally not taken purely at face value,but are often re-analyzed and combined with other re-sults when appropriate and that the effect in question di-minishes as significance levels move away from the cross-over region (where measurements are generally viewedconservatively in any case, independent of what type ofinterval may be quoted). Therefore, the potential impactof “flip-flopping” is, in fact, not particularly significantin relative terms.On the other hand, the adoption of a unified approachimposes substantial constraints that can lead to non-trivial difficulties: • The approach conflicts with the desired and scien-tifically well-motivated convention to quote a 90%or 95% CL for upper/lower bounds for results thatare consistent with the zero signal hypothesis, but to only claim a 2-sided discovery interval when thezero signal hypothesis is rejected at a considerablyhigher confidence level (typically in excess of 3 ormore standard deviations). Indeed, even in caseswhere a unified 2-sided interval may be shown, itis often accompanied by the phrase, “we regardthis as an upper limit” when the significance is notjudged to have passed a critical level, despite thefact that the coverage is not appropriate for such alimit. • A unified approach cannot easily cope with look-elsewhere effects ( i.e. trials factors). For example,if a search for gamma-ray emission from 1000 differ-ent astrophysical sources results in no event excessexceeding 3 standard deviations above the back-ground levels, the data may be judged to be con-sistent with statistical fluctuations and the mostappropriate things to quote are upper bounds onthe possible emission from each source. However,a unified approach would instead necessitate a 3 σ detection interval for observations consistent withchance fluctuations. • Even in the case of a clear detection, it may stillbe relevant to also quote upper and lower boundsin the context of certain models. For example,some classes of models may simply place bounds onthe allowed maximum luminosity of a given source.Thus, different interval constructions can be simul-taneously valid and relevant for the same results,as they simply address different questions.These difficulties appear to substantially outweigh anybenefit of making what is, in the end, a minor correc-tion to frequentist coverage. Therefore, on balance, webelieve it is pragmatically advantageous to allow the na-ture of interval constructions to be determined by theexperimenters themselves based on an assessment of sci-entific relevance, rather than having these dictated by aninflexible and, ultimately, inappropriate formalism.
CONVERGENCE, DIVERGENCE ANDCONFUSION
The likelihood function is central to all the approachesdescribed and increasingly constrains the model parame-ters and dominates the definition of these intervals withincreasing numbers of events. Hence, in the limit oflarge numbers of events, the dependence on the particu-lar choice of prior becomes insignificant in the Bayesianconstruction, and the Feldman-Cousins ordering param-eter has little effect away from a physical boundary suchas the origin. All methods therefore converge on thesame bounds in this region. This helps to address thequestion of what constitutes a sufficient ensemble of mea-surements in the frequentist approach for the constructed3intervals to begin to reliably constrain a model. Namely,this occurs when there is a sufficient sampling to well-characterize the likelihood space of measurements, atwhich point such an approach would produce the sameanswer as for the Bayesian method.The deviation between approaches in the region of lownumbers of events is therefore largely a reflection of thefact that there is not yet enough information to make amore definitive statement about the model without sup-plying at least some additional constraints. The frequen-tist approach in this case is to simply place the mea-surement in context for some hopeful ensemble of otherexperiments without trying to identify the model, whilethe Bayesian approach is typically to seek out some min-imal set of “reasonable and conservative” constraints inorder to infer which model parameter values are the mostlikely. These philosophies are not mutually exclusive —both goals are valuable in the case of limited informationand simply need to be appropriately defined and distin-guished.However, in many cases, a lack of clarity in this regardand a reticence to provide both types of information hasled to confusion. An informal survey of physics journalarticles suggests that, in a large fraction of cases, quotedfrequentist intervals are often used to make statementsregarding constraints on model parameter values, eitherby the authors themselves or by others in subsequentarticles, even when the natures of the intervals are ex-plicitly stated. Parameter exclusion plots, as the nameimplies, are instinctively interpreted as defining allowedand disallowed regions of model parameter space basedon the data. This is an inherently Bayesian interpreta-tion, yet the nature of the derived contours is not al-ways consistent with this. We believe there is need of amore pragmatic approach which recognizes that, while itis critical to objectively convey the information contentof the data, there is also a strong desire to derive boundson model parameter values and a natural instinct to in-terpret things this way.
TOWARDS A MORE RELEVANT ANDTRANSPARENT APPROACHRelevant Statements for Scientific Papers
We start with an attempt to distill the basic statementsthat are desirable to make regarding the nature of resultsfrom an experimental measurement. There are typicallyfour issues of relevance:1. To present the measured value of a direct observ- able and an assessment of systematic uncertaintiesthat could bias the measurement. This results in asimple, objective statement concerning the obser-vation.2. To address the question,“How often would a mea-surement ‘like mine’ occur under the zero signalhypothesis?” This is a frequency question probingstatistical consistency and focusing on an observ-able in the context of a single, fixed model ( i.e. aFisher-type test). For this, the normalized PDF forobservations under the zero signal hypothesis canbe appropriately integrated (including integrationover any nuisance parameters) to arrive at an as-sessment, i.e. a “p-value”. There is, however, someambiguity in what is meant by ‘like mine.’ Forexample, “How likely is it to measure a rate thishigh?” or “How likely is it to measure a rate thisfar away from the predicted value under the zerosignal hypothesis?” As there is not a general formfor this, it needs to be defined in a relevant way on acase-by-case basis. Note that this is not equivalentto defining a frequentist interval — the result is justa single number representing the statistical chanceof measuring a value for the observable in a ‘similarrange’ under the zero signal hypothesis. Inferencesbased on p-values should be treated with caution(see, for example, [23] for further discussion).3. To address the question,“What constraints do mymeasurements of direct observables place on modelparameter values?” This is explicitly a Bayesianquestion and, thus, requires the application of theappropriate formalism, including the use of a priorto define the relative context of models.4. To objectively convey the relevant information con-tent of the data so as to allow the impact of alter-native assumptions to be evaluated, facilitate thetesting of different models, and permit informationfrom this measurement to be effectively combinedwith that from other experiments. Frequentist in-tervals are blunt instruments for this purpose thatonly provide a crude simplification of likelihood in-formation viewed through a particular, non-uniquefilter that may be prone to misinterpretation. Inpractice, such intervals are rarely used in the en-semble tests for which they are relevant, being dis-favored relative to combined analyses that eitheruse the raw data or likelihood maps from differentexperiments. A better approach would therefore beto actually provide, to the best extent possible, thelikelihood information directly.The means by which to address the first two of theseissues is relatively straightforward and largely uncontro-versial. We will therefore now concentrate on approachesrelevant for the latter two issues.4
Choice of Bayesian Priors
As indicated previously, in order to use measurementsto bound model parameter values, a context for these val-ues must be provided in the form of a prior probability.This conveniently permits known, physical constraints tobe imposed ( e.g. energies and masses must be greaterthan zero; the position of observed events must be insidethe detector, etc.) and allows known attributes of thephysical system to be taken into account ( e.g. energiesare being sampled from a particular spectrum; the rel-ative probabilities for different event classes are drawnfrom a given distribution, etc.). The choice of priors insuch contexts is often non-controversial. Less straight-forward is the case of defining a prior within the physicalregion when there is no a priori knowledge of the prob-ability distribution for a given model parameter: the so-called “non-informative” prior. It may seem odd to needto choose a prior at all under such circumstances, butthe fact is that “no knowledge” is a fuzzy concept whosemeaning needs to be defined.We should note (as others have) that it is rarely thecase that there is really “no prior information” at all, aswe will generally have some knowledge of previous obser-vations related to the current measurement. At the sametime, it is best to avoid tuning priors to previous observa-tions in too substantive a way in order to preserve the ro-bustness of independent verification. Typically then, theterm “non-informative” prior actually refers to a “weaklyinformative” prior.At first glance, one might think that providing an equalweighting to all parameter values ( i.e. uniform in proba-bility) would make the most intuitive sense for the non-informative case. However, this runs into two issues, onetrivial and one non-trivial. The trivial issue is that sucha prior is improper, not having a finite integral, and alsobegs the question of whether you actually believe, for ex-ample, that it is equally likely to detect 10 events as itis to detect 3. However, in practice, the prior is alwaysmultiplied by the likelihood function, which suppressesits impact outside of the region of interest for the actualobservation and makes the exact form of the prior farfrom this region irrelevant. Thus, a uniform prior shouldreally be viewed as a sufficient approximation to one thatactually trails off to zero at some point in a manner thatdoes not need to be specifically defined. The non-trivialissue with uniform priors is that uniformity is not nec-essarily preserved for parameters couched in a differentform. Thus, for example, a prior distribution that isuniform in the parameter S is not uniform in S . Thistherefore requires a choice to be made as to what form ofthe model parameters might be reasonably assumed tohave a uniform prior probability distribution.The subject of non-informative priors is one of activediscussion and debate. One common alternative is to use a Jeffreys prior [24], which uses the Fisher informa-tion content of the likelihood function itself to define anon-informative prior that is transformationally invari-ant. While this approach often works well for simplesituations, the application of this is not always straight-forward for realistic analysis scenarios that may involvemultiple, multi-dimensional signal and background com-ponents, often with non-parametric forms, nuisance pa-rameters for error propagation, and multi-parameter sig-nal models. Consequently, this can also lead to forms ofthe prior that are non-intuitive and depend on the par-ticular analysis in which it is applied.It should be emphasized at this point that there is of-ten no “correct” choice of prior. Making any statementregarding the viability of a given model based on the datanecessarily requires a stated frame of reference, and theprior just defines the context within which one chooses tomake this assessment. With this understanding in mind,we take the pragmatic view that a non-informative priorshould be intuitively reasonable, simple to apply and vi-sualize, and must allow for the impact of an alternativechoice of prior to be easily evaluated. To this end, wesuggest that the use of uniform priors would be preferredand that relevant model parameters should be couched insimple forms that make this a not unreasonable choice.The nature of bounds on different forms can then be de-rived from this initial determination. This approach hasthe further advantage that maps of the Bayesian proba-bility densities couched in terms of such “flat” parametersare identical to likelihood maps, which can then serve theimportant dual purpose of providing both a Bayesian as-sessment of model viability and a detailed presentationof the objective information content of the data.For the purpose of consistency, it would be advanta-geous to establish common conventions ( e.g. PDG guide-lines) for such flat parameter forms relevant to differenttypes of experimental measurement. Fortunately, whilethe number of possible models is infinite, the basic na-ture of fundamental parameters on which these modelsdepend is not, and we believe that finding agreement on aset of reasonable choices is not a particularly contentiousissue in practice. As a general rule of thumb, our per-ception of model parameters often falls into one of twocategories: either they are variables of magnitude, withvalues spanning the same general order and for which auniform prior (sometimes over a limited range) may bea reasonable choice; or they are variables of scale, poten-tially spanning several orders of magnitude and for whichit may be appropriate to weight such scales equally, re-sulting in a prior that is uniform in the log. These twochoices typically bound the range of non-informative pri-ors that are generally considered to be plausible. It is,for example, usually difficult to justify a form of non-informative prior that actually rises with signal strengthor falls faster than would be implied by giving equalweight to all scales. Examples of priors uniform in mag-5nitude might include the value of an unknown phase an-gle, the value of a spectral index, or the precision mea-surement of a quantity whose rough magnitude is con-strained. Examples of priors uniform in scale might in-clude the energy scale for new physics, the cross sectionfor some non-standard interaction, or first measurementsof quantities whose rough magnitude is not constrained.A good indication of the relevant variable class can of-ten be taken from how they are typically represented onparameter plots ( i.e. whether they have linear or loga-rithmic scale axes).However, for models that directly depend on a count-ing rate measurement (especially in the region of lowevent numbers), and where there is not a strong case foranother form of prior, we make the pragmatic proposalto always use a prior proportional to a uniform averagecounting rate. This is because the sensitivity range for anexperiment to detect a signal not previously establisheddoes not typically stretch over several orders of magni-tude and, in the event that an upper bound is appro-priate, this prior choice produces a conservative numberfor evaluating the viability of model parameter values.In addition, as previously shown, this choice also tendsto yield a good degree of statistical coverage for simplecases which, while not necessary for Bayesian bounds, weregard as convenient.
Sensitivity to Prior
From Equation 1 it is clear that, for a given hypothesis, H , and data set, D , the posterior probability is relatedto the prior as follows:log( P ( H | D )) = log( P ( D | H )) + log( P ( H )) + C (15)= n (cid:88) i =1 (cid:20) log( f ( x i | H )) + 1 n log( P ( H )) (cid:21) + C where f is the likelihood function evaluated for each of n independent data values x and C is a constant. Hence,the impact of the prior probability, P ( H ), becomes lesssignificant as the number of events increase. However,it is still the case that the choice of prior can, in someinstances, have a notable impact on the perception ofmodel viability. The effect becomes particularly relevantwhere parameter ranges are unconstrained over orders ofmagnitude and arguments hold for choosing a nominalprior that is uniform in the log. Consequently, while theanalysis of the previous section can provide a reasonablebasis for default parameter representations and priors, insuch cases we believe it is also important to specificallyindicate the extent of the prior sensitivity. As a reason-able and pragmatic approach, we suggest doing so by comparing the results from the choice of a prior that isuniform in scale with that which is uniform in the mag-nitude of the relevant parameter. The impact can beindicated by an additional contour on parameter mapsand, if significant, can then be further highlighted in thedata summary.It is frequently the case that the appropriate choice be-tween the two suggested forms of prior is clear and thateither conservative upper bounds can be derived in thecase that the data is consistent with no signal, or that thestrength of an observed signal will help to dictate robustbounds. However, if the choice of prior is not obvious,and if the conclusions strongly depend on the availablechoice, then avoiding a Bayesian method does not alterthis fact. Using purely frequentist approaches will merelyhide the ambiguity, potentially leading to false conclu-sions regarding the robustness of implications for modelparameter values.When taken together, we believe that the type of ap-proach outlined, involving: 1) a pragmatic choice of prior;2) an explicit presentation of the likelihood; and 3) a testof the prior sensitivity where appropriate, can providea robust approach for indicating Bayesian model con-straints as an important component of the overall presen-tation of results. In addition to helping avoid confusion ininterpretation, this would also serve the valuable purpose(often overlooked) of explicitly indicating the strength ofinformation in the data and when reliable inferences re-garding model parameter values might be made. Unified Likelihood Maps and Data Summaries
As mentioned in the previous section, the suggestionis to display the Bayesian posterior probability distribu-tion in terms of model parameters with uniform priors,which then simultaneously shows the global likelihood aswell. This probability should be suitably marginalizedover the non-essential parameters. It is sufficient to givethe ratio of the Bayesian probability (likelihood) at anyone point to the maximum value. In fact, it would seemsensible to couch this as − L/L max ), since this isoften approximately equivalent to differences in χ fromthe best fit and, thus, carries some intuition. This alsoreadily allows for approximate frequentist intervals to beinferred via Wilks’ Theorem, as discussed in Appendix A.As is often done for 1-parameter models, a simple graphof this quantity as a function of the parameter can beshown; for a 2-parameter model, a 2D contour or colormap can be given; and for higher order models, appro-priate sample 2D slices (preferably in the most slowlyvarying parameters) can be shown.6Bayesian credible intervals can be superimposed on topof these plots and are simply computed by integratingthe Bayesian probability distribution (in this case equiv-alent to the distribution of L/L max ) in the space of theseflat parameters to find the fraction, f , of their distribu-tion above some value L > L C . The contour definedby that value of − L C /L max ) then corresponds tothe credibility level CI = f . This approach also getsaround another common problem of attempting to in-terpret the meaning of maximum likelihood contours interms of significance levels by either relying on Wilks’Theorem (which, while often providing good estimates,cannot always be relied upon for precision in the regionof low numbers of events and/or near a physical bound-ary), or undertaking a potentially burdensome (in somecases, perhaps intractable) Monte Carlo calculation. Bycontrast, the Bayesian calculation is always well-definedand has no such issues.For the purposes of abstracts, text, tables etc., a con-venient summary of results is desired. In the case ofsingle parameter models, the value of the parameter cor-responding to the maximum likelihood (which is equal tothe maximum of the posterior distribution) can be quotedalong with the associated Bayesian credible interval. Forthe multi-parameter case, results may be summarized byquoting the marginalized Bayesian intervals for each pa-rameter. As a shorthand for more detailed likelihoodshape information, especially where behavior tends tobe non-Gaussian, we suggest that bounds in terms offlat parameters be quoted at 2 different credibility levels,for example, 90% and 99%, or 68% (1 σ ) and 95% (2 σ ).And, as previously discussed, for cases where the choicebetween the two types of priors leads to a notably lessconservative constraint, the impact should be specificallyindicated. Example 1: neutrino mixing
As a first example, consider the case of neutrino oscilla-tion and mixing. Regarding the choice of non-informativeprior for ∆m , if the scale has not been firmly established,the value could correspond to a range of scales and is gen-erally plotted logarithmically. Accordingly, this suggeststhe choice of a prior that is uniform in the log of thisvariable. If the scale has already been well establishedin the field, a prior uniform in ∆m may be more appro-priate (though the impact of the choice is unlikely to besubstantial in this case).For the unknown mixing angles and phases, non-informative priors that are uniform in circular range aresuggested. One complication here is that the form themixing angle takes in, for example, the 2-neutrino vac-uum mixing expression is sin θ , which means that agiven observation lead to an ambiguously defined valueof θ itself, which can be in any of 4 quadrants. Indeed, FIG. 4: Confidence intervals derived from the SNO experi-ment’s salt phase data [25]. The colors represent the differencein the log of the likelihood ratio or, equivalently, differencesin the posterior probability calculated with uniform priors forboth axes, relative to the maximum point. The solid blacklines are 68%, 95%, and 99.73% Bayesian contours found byintegrating the posterior distribution. The black dashed re-gion is the 68% credible region found under an alternativeprior that is uniform in ∆ m rather than log(∆ m ) there is no clear consensus in the field as to the bestform to use for such experiments, with various choicesincluding sin θ , sin θ and tan θ , making comparisonsbetween different results and different phenomenologypapers troublesome. While various trigonometric formsmay appear to describe different phenomena, the funda-mental parameter is the angle itself and it therefore seemsappropriate to try to couch things in terms of this vari-able. The issue of redundant multi-quadrant values thatmay arise from measurements in some cases can be read-ily taken into account by using variables such as | θ + nπ | ,which then allows a non-redundant range in theta to rep-resent other quadrants simultaneously.As a more specific example of such a construction, weuse the publicly available contour map data from the 2005SNO salt phase solar neutrino mixing parameter analy-sis [25]. At the time of this data, the scale of ∆ m hadnot yet been unambiguously established by solar neutrinodata, so we choose a prior that is uniform in the log ofthis parameter. We therefore obtain the plot in Figure4. The color scale represents the range of ln( L/L max ),while solid line contours are also shown correspondingto Bayesian credibility levels of 68%, 95% and 99.73%.The values of − L C /L max ) corresponding to theselevels were found to be 2.82, 6.58 and 12.76, respectively,which are not so far from the naive Wilks’ expectation7for critical ∆ χ values corresponding to a 2D parame-ter space (2.3, 6.1, 11.8). This suggests that frequentistconstraints would look very similar in this case. To in-dicate the sensitivity to the choice of prior, the dashedline contour indicates the 68% CI region if a prior thatwas uniform in ∆ m itself were used. This region is ofa very similar size to the LMA 68% contour using thedefault prior that is uniform in scale, with a relativelymodest displacement of boundary positions. This indi-cates that this range is relatively insensitive to the formof the prior and conclusions regarding the model here arereasonably robust. However, the LOW region would beeliminated, suggesting that the default choice of a prioruniform in log(∆ m ) is the more conservative approachin this case.For the data summary, we obtain the marginalized uni-form Bayesian intervals | θ + n (180 ◦ ) | = 33 . +2 . . − . . de-grees at the 68%(95%)CI. For ∆ m , we have the inter-esting situation that the SNO data alone cannot sepa-rate LOW and LMA solutions at the desired credibil-ity levels with the assumed prior, giving rise to a bi-modal distribution in the marginalized parameter. Inthis case, both possibilities should be presented using the2-part intervals that are naturally produced by select-ing the highest probability densities that constitute 68%and 95% of the overall distribution, respectively. Thus,we obtain the marginalized uniform Bayesian intervals − log (∆ m ) = 4 . +0 . . − . . eV and 6 . +0 . . − . . eV at the 68%(95%)CI. Example 2: rare event search
Unfortunately, it is not a common enough practice forexperiments to provide detailed likelihood information.Consequently, to provide an example of an analysis oper-ating in a much more restricted range of event numbers,we will consider a hypothetical experiment genericallysearching for rare events. For simplicity, we will assumethis to be a counting experiment with negligible system-atic uncertainties and a well-defined background expec-tation, along the lines of the examples considered earlierin this document.Assume that the expected background is B = 5 eventsin the region of interest and that 10 events are observed.We then wish to place 90% CI upper bounds on the aver-age signal strength, S . In accordance with previous dis-cussion, as this measurement relates to a low-rate count-ing experiment, we choose a prior that is uniform in S .We would therefore use the Poisson likelihood functionfrom Equation 2 with B = 5 to plot − L/L max ) as afunction of S , as shown in Figure 5.Even for such a simple measurement, the likelihoodfunction will typically not, in fact, be a simple Poissondistribution owing to an imperfect knowledge of the ex- FIG. 5: Difference in log likelihood from the best-fit signalstrength, as a function of signal strength, for a Poisson pro-cess with background rate B = 5 where 10 events are ob-served. The vertical lines represent Bayesian upper limitsat the stated confidence levels, while the intersections of thedashed horizontal line with the curve indicates the 90% CLtwo-sided limits from Wilks’ Theorem. pected background and systematic uncertainties, the ef-fects of which must be taken into account by an appro-priate marginalization of the distribution.The Bayesian uniform prior upper bounds are indi-cated on the plot for 90%, 95% and 99% CI. As the al-ternative choice of a prior uniform in log( S ) would resultin a smaller value for the bound ( e.g. S < .
35 at 90%CI), it is not necessary to show the impact of this alter-nate choice since the chosen prior represents the conser-vative value. Thus, we may simply quote the Bayesianuniform prior upper bounds on S as 10.4(15.2) at the90%(99%) CI.The dashed line also shown on the plot correspondsto an approximate frequentist 90% CL based on Wilks’theorem, and suggests that an interval would be calledfor under a unified approach spanning 0.66-11.1 (compa-rable to the corresponding Feldman-Cousins interval of1.2-11.5). Hence, for this case, frequentist numbers arereasonably similar to those from the Bayesian approach,with the slight difference at the upper end largely due tothe choice to quote a 1-sided versus a 2-sided bound.If we instead considered the scenario described earlierwhere B = 9 and 5 events are observed, the likelihoodplot shown in Figure 6 is instead produced.In this case, as previously described, the Bayesian up-per bound is 3.88, while the frequentist bound inferredfrom Wilks’ theorem is only 2.64 (slightly more conser-8 FIG. 6: Difference in log likelihood between the best-fit sig-nal strength, as a function of signal strength, for a Poissonprocess with background rate B = 9 where 5 events are ob-served. The vertical lines represent Bayesian upper limitsat the stated credibility levels, while the intersections of thedashed horizontal line with the curve indicate the two-sided90% CL limits from Wilks’ Theorem. vative than the Feldman-Cousins bound of 2.38 for thisscenario). This is an example where frequentist boundsare misleadingly foreshortened as a result of negative fluc-tuations and prone to misinterpretation. CONCLUSION
The distillation of constraints on theoretical model pa-rameter values is central to scientific enquiry and rep-resents the ultimate goal of experimental observations.There is therefore an understandable desire to interpretthe probable nature of such constraints as well as to de-fine the objective details of experimental observations,even when the available information content of the datais necessarily limited. A failure to recognize both aspectsand clearly represent them with distinct and appropri-ate formalisms inevitably leads to misinterpretation andmisuse of quoted confidence intervals.Frequentist confidence intervals, even “sophisticated”implementations such as Feldman-Cousins, generally suf-fer from significant shortcomings in this regard: • They do not reliably indicate the relevance of indi-vidual measurements. • They are prone to frequent misinterpretation thatcan result in incorrect conclusions and seeminglyparadoxical behaviors. • They do not provide a robust basis for the compar-ison of different experimental results. • There is no self-consistent approach to propagatesystematic uncertainties. • There is no self-consistent method for producinga lower-dimensional confidence interval with a de-sired statistical coverage from a higher-dimensionalinterval. • They do not possess a unique definition and repre-sent a very limited view of the underlying likelihoodinformation.Furthermore, frequentist intervals are rarely, if ever, ac-tually used for the purpose in which their context ismeaningfully defined: constraining model parameter val-ues at the designated confidence level with a large ensem-ble of similar measurements. Even where multiple mea-surements exist, the practice is generally to undertake acombined analysis with either the raw measurements orlikelihood maps from individual experiments rather thanto compare frequentist intervals. As such, the relevanceof quoting intervals that are never actually used for theirintended purpose (and, in fact, are readily abandonedwith the addition of further data) is questionable. As faras objectively conveying the information content of thedata, we therefore believe that providing the likelihooditself as a function of relevant model parameters offersthe clearest and most useful approach.Bayesian credible intervals offer the only well-definedmathematical approach to placing bounds on model pa-rameter values themselves and, thus, are a necessarycomponent of data presentation if one is to address thisaspect. Bayesian intervals are generally free from manyof the problems that plague frequentist intervals, but suf-fer from the one issue of requiring the specification of aprior to establish the context of the model. However, weargue here that1. For Poisson statistics in particular, the use of aprior that is uniform in the counting rate producesconservative upper bounds and offers a pragmati-cally defensible choice for bounding model parame-ter values. (Appendix A shows that these credibleintervals also have reasonable frequentist coveragefor simple cases.)2. In other areas, the forms for priors are effectivelybounded between those that are uniform in magni-tude and those that are uniform in scale ( i.e. equal9weighting in the log), where the appropriate choicebetween these two alternatives for a given parame-ter class is usually obvious.3. When appropriate, explicit indications of the sen-sitivity to this choice of prior can easily be shown,which provides additional valuable details regard-ing the strength of information contained in thedata that is otherwise hidden (rather than avoided)by pure frequentist methods.Taken together, we believe this provides a clear and ro-bust approach to the presentation of model constraints.Achieving a community consensus on such an approachfor Bayesian intervals seems both straightforward and ofmore practical benefit than continuing to only use fre-quentist constructions which are, themselves, arbitraryand prone to the numerous difficulties previously listed.We also recommend that the global likelihood for datasets be shown as a function of ‘flat’ Bayesian model pa-rameters to simultaneously indicate the Bayesian poste-rior probability distribution and objectively represent therelevant information content of the data. Such represen-tations allow other analysts to more optimally combineinformation from other experiments and test models un-der different assumptions. Bayesian credibility levels canbe readily indicated on such plots and approximate fre-quentist intervals can also generally be extracted if de-sired by appealing to Wilks’ Theorem (which is found towork well for Poisson statistics, even for small numbersof events).As previously indicated, we believe the standardiza-tion of priors is straightforward and no more arbitrarythan choosing any other type of interval construction.For data summaries, we therefore recommend quotingBayesian credibility bounds in terms of common flatparameter forms, marginalizing where appropriate formulti-dimensional parameter spaces. As a shorthand formore detailed likelihood/Bayesian shape information, wealso suggest that, where convenient, such bounds arequoted at two different credibility levels, such as 90%and 99%, or 68% and 95%.Overall, we believe this to represent a much more well-balanced approach to the presentation of experimentalresults that offers a higher degree of relevance and trans-parency than purely frequentist conventions, while stillproviding objective information about the measurementin a more useful form.We would like to thank Louis Lyons and Josh Klein foruseful discussions and, in particular, Dean Karlen for acareful reading of an earlier draft of the paper. This workhas been supported by the Science and Technology Fa-cilities Council of the United Kingdom and the NationalSciences and Engineering Research Council of Canada. [1] J. Neyman, Phil. Trans. Royal Soc. London, Series A, , 333-80 (1937)[2] G.J. Feldman and R.D Cousins, Phys. Rev
D57 , 3873-3889 (1998)[3] A. Stuart and J.K. Ord,
Kendall’s Advanced Theory ofStatistics , Vol. 2, 5th Ed. (Oxford University, New York,1991)[4] B.P. Roe and M.B. Woodroofe, Phys. Rev.
D60
D62
D63 et al. , Phys. Rev. Lett. et al. , Nucl. Phys. Proc. Suppl. , 212-219(1999)[9] K. Eitel et al. , Nucl. Phys. Proc. Suppl. ,191-197(2001)[10] The LEP Working Group for Higgs Boson Searches,Phys. Lett. B565 et al. , Phys. Rev. D80 et al. , Phys. Rev. Lett. et al. , arXive:1402.6956 (2014)[15] Details provided by the EXO collaboration (private com-munication)[16] G.M. Cordeiro et al. , Journal of Statistical Computationand Simulation (3-4), 211-231 (1995)[17] K. Cranmer, eConf C030908, WEMT004(arXiv:physics/0310108) (2003)[18] R. Cousins and V. Highland, Nucl. Instrum. Meth. A320 et al. , Proceedings of “Advanced StatisticalTechniques in Particle Physics,” Ed. M.R. Whalley andL. Lyons, IPPP-02039 (hep-ex/0206034v1) (2002)[20] W. Rolke, A. Lopez and J. Conrad, Nucl. Instrum. Meth.
A551 et al.
Phys. Rev. Lett. (accepted),arXiv:1403.1532, (2014)[22] P.S. Laplace, “Memoir on the probability of causes ofevents,” Memoires de Mathematique et de Physique,Tome Sixieme, (1774). (English translation by S. M.Stigler, Statist. Sci., 1(19):364378, 1986.).[23] F. Beaujean et al. , Phys. Rev.
D83 , 012004 (2011)[24] H. Jeffreys, Proc. of the Royal Soc. of London, Series A,Mathematical and Physical Sciences , (1007): 453461(1946)[25] B. Aharmim et al. , Phys. Rev. Lett. , 111301 (2008) APPENDIX A: COVERAGE AND CONSISTENCYFrequentist Coverage
Central to the frequentist paradigm is the concept ofstatistical coverage: that limits derived from an ensembleof repeated experiments would correctly bound the truemodel a known fraction of the time. This is a hypotheti-cal construction, independent of the prospect for actuallyachieving a large enough ensemble of measurements to re-liably bound a model with the relevant accuracy this way.Indeed, as there is no clear criteria for when an ensembleis large enough for frequentist intervals to ever be usedto constrain model parameter values, nor how this couldeven be done in principle outside of ultimately employingBayesian statistics, the construction is hypothetical evenwhere there is a large ensemble of measurements.In this context, the necessity to produce precise fre-quentist bounds can seem a little unclear. In fact, asdiscussed in Section , achieving exact coverage is oftennot actually possible in the presence of systematic uncer-tainties, and is not possible for counting statistics evenwithout such uncertainties, since the quantized nature ofmeasurements inevitably leads to coverages that are ei-ther a little greater than the target confidence level (over-coverage) or less than this target (under-coverage). Theformer of these is generally chosen to insure coverage inexcess of the target confidence level (though one mightpragmatically argue that such intervals need only be de-termined to an accuracy comparable to that with whichthey will ever actually be used to bound a model). Con-sequently, it is possible to spend large amounts of CPUtime computing Feldman-Cousins intervals in which theachieved precision does not reflect the achieved accuracy.The coverage map for the Neyman construction of 90%CL intervals using the Feldman-Cousins ordering rule isshown in Figure 7 for Poisson statistics in the region oflow counts. This figure plots the frequency with whichthe true model parameter value lies outside of the de-rived bounds and indicates the extent and distributionof the inherent over-coverage. This assumes a perfectknowledge of the background level, no systematic uncer-tainties, etc. Here the over-coverage, which can be aslarge as a few percent for certain values of the signal andbackground rates, is due entirely to the discrete natureof Poisson counting statistics.While the Neyman construction permits the most con-trolled definition of frequentist intervals, it is also oftenusefully the case that approximate frequentist intervalscan be much more easily derived from the global likeli-hood itself by appealing to Wilks’ Theorem, which saysthat, in the large n limit, − L R is distributed as a χ d distribution, where L R is the likelihood ratio betweennested hypotheses defined by d model parameters. Infact, this works surprisingly well for Poisson statistics, FIG. 7: Fraction of time with which a Feldman-Cousins 90%confidence interval generated from Poisson observations doesnot contain the true signal rate, as a function of the true signalrate and the background rate (which is assumed to be knownperfectly). The over-coverage of a few percent is intrinsic tothe discrete nature of the observable. even for small numbers of counts (as might be impliedby the fact that the Bartlett correction for this case isestimated to be small [16]). The corresponding coveragemap for this is shown in Figure 8, where a χ thresholdof 2.706 is used for − L/L best ) to define a 2-sided90% CL interval similar to Feldman-Cousins, with L best evaluated for non-negative values of S (which leads to theover-coverage seen on the left side of the plot). A verygood approximation to 90% coverage is achieved and, infact, for most of the plane, the magnitude of the devia-tion from exact coverage is no larger than that seen inthe Feldman-Cousins case of Figure 7. If desired, theareas of slight under-coverage can be all but eliminatedto achieve more conservative bounds by simply choosingslightly higher threshold values for the likelihood ratio.Conveniently, the accuracy with which Wilks’ theoremapproximates the coverage suggests that reasonable es-timates of frequentist intervals can often be given, evenfor small statistics, by simply providing the global like-lihood as a function of relevant model parameters with-out resorting to the computationally intensive Feldman-Cousins construction. In cases where the applicability ofWilks’ theorem may be more questionable, Monte Carlosimulations can be used to spot-check the implied cover-age in the region of interest.1 FIG. 8: Fraction of time with which 90% confidence intervalsgenerated by Wilks’ theorem for a Poisson observation do notcontain the true signal rate, as a function of the true signalrate and the background rate (which is assumed to be knownperfectly). The magnitude of deviations from the expectedcoverage are comparable to the Feldman-Cousins case.
Bayesian Consistency
Bayesian credibility levels represent the degree of be-lief that the limits derived from a particular observationactually bound the true value of the model parameter.Unlike frequentist intervals, the extent of Bayesian inter-vals are intended to directly represent the probable rangeof the model parameter. To make this concept tangible inthe sense of calculation, we define “Bayesian consistency”as the notion that, if possible model parameter valueswere sampled according to the assumed prior distribu-tion and instances were selected where an experimentalmeasurement would result in exactly what was observed,then the derived credible intervals correctly bound thetrue model in the desired fraction of these cases. And,indeed, correctly constructed Bayesian intervals are al-ways exact in this sense for the assumed paradigm, evenfor the case of Poisson statistics. As such, unlike fre-quentist coverage, maps of Bayesian consistency are notnecessary. It is also perhaps worth noting that, sinceBayesian credible intervals are specific to each particularmeasurement and have a meaning that is independentof any ensemble, the choice of interval type ( e.g. “flip-flopping,” as discussed in Section ) has no impact at allon Bayesian consistency.
Some Unfair Comparisons
While never intended for this purpose, it is neverthelessof interest to explore the extent to which Bayesian con-
FIG. 9: Fraction of time with which 90% CI Bayesian upperbounds generated with Bayes theorem and a uniform priorfor a Poisson observation, and then treated as a CL, do notcontain the true signal rate, as a function of the true signalrate and the background rate. There is over-coverage, whichimplies that the Bayesian intervals are more conservative thanNeyman-constructed intervals for this choice of prior. structions also provide statistical coverage. In particular,we will explore the case in which a uniform prior in signalrate is used for a simple, 1-D Poisson observable. Resultsare shown in Figure 9 for Bayesian upper bounds. As canbe seen, these bounds tend to over-cover. In other words,for an ensemble of repeated experiments, upper boundsconstructed using this Bayesian prescription will tend tocontain the true model value with a higher frequencythan, for example, Feldman-Cousins bounds. Therefore,at least for upper bounds on counting statistics (which isalso relevant to estimates of experimental sensitivities),this prescription is conservative and sufficient for bothdefinitions of coverage.For 2-sided interval constructions, the coverage is mod-ified, as shown in Figure 10. Reasonable statistical cov-erage is generally achieved in this case as well, althoughthere is some under-coverage in the region correspondingto borderline signal detections. For such cases, the 90%CI Bayesian bounds provide statistical coverage at thelevel of 86% or more if treated as a frequentist CL.Correspondingly, it is also of interest to explore theextent to which frequentist constructions also provideBayesian consistency for this same case. Figure 11 showsthis for the Feldman-Cousins method and indicates theprobability for the true model parameter value to lie out-side of the derived bounds for a given measurement, as-suming that all signal strengths are equally likely. Thisplot has a very different interpretation from Figures 7-10:it indicates what fraction of true values of the signal rate2are not contained in the Feldman-Cousins interval for aparticular observed number of counts , where each modelvalue is given equal weight in its assigned prior proba-bility. Here, the frequentist CL value, treated as a CI,can significantly overestimate the credibility for down-ward fluctuations in the observed rate. The entire up-per left portion of the plot represents cases where theFeldman-Cousins intervals very often do not contain thetrue signal rate. This again highlights the fact that fre-quentist intervals only yield correct statistical coverageas an ensemble average. However, for particular obser-vations, the constructed interval can often be quite un-likely to contain the true value of the parameter. Whilesome sections of the upper left region of this plot repre-sent unlikely fluctuations, there is a reasonable fractionof phase space where this is not the case. In fact, as theaverage signal strength approaches zero, the frequencywith which overestimation of credibility levels occurs ap-proaches 50%, with significant overestimation occurringin more than 10% of cases. This gives rise to many ofthe apparent paradoxes previously described when fre-quentist intervals are misinterpreted as providing bettingodds for parameter values. Hence, while the Bayesianconstruction described here also typically provides rea-sonable frequentist coverage, frequentist bounds do notgenerally provide good estimates of Bayesian credibility.It may be objected that these conclusions are based ona particular choice of prior and that details will changefor other choices. Nevertheless, it is true to say thatfrequentist intervals will generally not provide adequateBayesian credibility estimates for other common priorchoices either. At the same time, while the situation be-comes more complicated with multiple dimensions, thisexample demonstrates it is possible to make a pragmaticchoice of a commonly used prior for Bayesian intervalsthat will also provide a reasonable level of frequentist cov-erage for the case of a simple Poisson observable. Whilesuch coverage is in no way required for Bayesian inter-vals, we regard this property for the simple Poisson caseas convenient.
FIG. 10: Fraction of time with which two-sided 90% Bayesiancredible intervals generated with Bayes theorem and a uni-form prior for a Poisson observation do not contain the truesignal rate, as a function of the true signal rate and the back-ground rate. There is mild over-coverage or under-coverage,with deviations from the exact coverage of approximately thesame size as those seen for Feldman-Cousins intervals.FIG. 11: Probability that a 90% CL Feldman-Cousins intervalexcludes the true mean signal rate given a particular observednumber of counts and expected background rate, where theprobability is an average over equally weighted values of themean signal rate. APPENDIX B: MAXIMIZATION VERSUSINTEGRATION OVER NUISANCEPARAMETERS
A point that is sometimes a source of confusion is thatthe Bayesian and frequentist approaches for incorporat-ing the effect of nuisance parameters are often differ-ent. Given a joint likelihood distribution L ( θ, ξ ), a 1DBayesian probability distribution for θ , assuming a uni-form prior π ( ξ ) for ξ , is given by: P ( θ ) ∝ (cid:90) dξL ( θ, ξ ) π ( ξ ) ∝ (cid:90) dξL ( θ, ξ ) ≡ L marg ( θ )(16)where we may refer to L marg ( θ ) as the “marginalizedlikelihood” obtained from integrating L ( θ, ξ ) with a flatprior for ξ . (Strictly speaking L marg ( θ ) is not a likelihoodfunction but a posterior distribution, but since it playsan analogous role in Bayesian analyses to the frequentistprofiled likelihood the notation L marg is useful.) In con-trast, the frequentist approach to eliminating a nuisanceparameter is often to instead “profile” the likelihood: L prof ( θ ) = max ξ L ( θ, ξ ) (17)It is not necessarily intuitive how these seemingly differ-ent methods are related, if at all.In fact, the marginalization by integration given inEquation 16 is the more fundamental method, and fol-lows directly from the laws of conditional probability. Itis not often appreciated that the “profiling” method isactually just a numerical approximation to the full inte-gration. This approximation, proposed by Laplace [22],can be derived by rewriting Equation 16: L marg ( θ ) = (cid:90) dξL ( θ, ξ ) = (cid:90) dξ exp(ln L ( θ, ξ )) . (18)For fixed θ , one then proceeds by finding the maximumof ln L as a function of ξ and doing a Taylor expansionaround the value ˆ ξ θ that maximizes ln L ( θ, ξ ): L marg ( θ ) ≈ (19) (cid:90) dξ exp (cid:32) ln L ( θ, ˆ ξ θ ) − (cid:12)(cid:12)(cid:12)(cid:12) d (ln L ( θ, ξ )) dξ (cid:12)(cid:12)(cid:12)(cid:12) ˆ ξ θ ( ξ − ˆ ξ θ ) (cid:33) To the extent that higher order terms can be neglected,this integral evaluates to: L marg ( θ ) ∝ (cid:118)(cid:117)(cid:117)(cid:116) (cid:12)(cid:12)(cid:12) d (ln L ( θ,ξ ) dξ (cid:12)(cid:12)(cid:12) ˆ ξ θ exp (cid:16) ln L ( θ, ˆ ξ θ ) (cid:17) (20)This is nothing other than L prof times a correction fac- tor which depends on the second derivative of the likeli-hood function. If the likelihood function is Gaussian, thecorrection factor is a constant and so the approximationis exact, and if the likelihood function is only approx-imately Gaussian, then its second derivative still onlyvaries slowly with θ . Expressed more simply in terms ofthe log likelihood, we have: − ln L marg ( θ ) ≈ (21) − ln L prof ( θ ) + 12 ln (cid:32)(cid:12)(cid:12)(cid:12)(cid:12) d (ln L ( θ, ξ ) dξ (cid:12)(cid:12)(cid:12)(cid:12) ˆ ξ θ (cid:33) Hence, we see that the frequentist’s method of eliminat-ing nuisance parameters by maximizing the likelihood asa function of the unwanted parameter is actually just anapproximation to integrating over the parameter in thecase that the joint likelihood is Gaussian. Furthermore,Equation 21 gives us a correction factor to this approx-imation that can improve the accuracy in the case thatthe likelihood is not exactly Gaussian.To the extent that the likelihood function is well-approximated by a multi-dimensional Gaussian, even aBayesian analysis may find the multi-dimensional gener-alisation of Equation 21 a useful numerical expedient toreplace a potentially complicated integral with a functionminimization and the evaluation of the second derivativesof ln L at that minimum.4 APPENDIX C: MAXIMUM SIZE OFFREQUENTISTS ”FLIP-FLOP” EFFECT