The look-elsewhere effect from a unified Bayesian and frequentist perspective
TThe look-elsewhere effect from aunified Bayesian and frequentistperspective
Adrian E. Bayer a and Uroˇs Seljak a,b a Berkeley Center for Cosmological Physics, Department of Physics, University of California, Berkeley,Berkeley, CA 94720, USA b Physics Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road,Berkeley, CA 94720, USAE-mail: [email protected], [email protected]
Abstract.
When searching over a large parameter space for anomalies such as events, peaks, ob-jects, or particles, there is a large probability that spurious signals with seemingly high significancewill be found. This is known as the look-elsewhere effect and is prevalent throughout cosmology,(astro)particle physics, and beyond. To avoid making false claims of detection, one must account forthis effect when assigning the statistical significance of an anomaly. This is typically accomplishedby considering the trials factor, which is generally computed numerically via potentially expensivesimulations. In this paper we develop a continuous generalization of the Bonferroni and ˇSid´ak cor-rections by applying the Laplace approximation to evaluate the Bayes factor, and in turn relatingthe trials factor to the prior-to-posterior volume ratio. We use this to define a test statistic whosefrequentist properties have a simple interpretation in terms of the global p -value, or statistical sig-nificance. We apply this method to various physics-based examples and show it to work well for thefull range of p -values, i.e. in both the asymptotic and non-asymptotic regimes. We also show thatthis method naturally accounts for other model complexities such as additional degrees of freedom,generalizing Wilks’ theorem. This provides a fast way to quantify statistical significance in light ofthe look-elsewhere effect, without resorting to expensive simulations. a r X i v : . [ phy s i c s . d a t a - a n ] J u l ontents q S A common problem in statistical analysis is to find evidence for a physical signal in a large, continuousparameter space, where the true position of the signal is not known a priori. By searching over a wideparameter space one increases the probability of finding large signals caused by random statisticalfluctuations, as opposed to a physical source. This is known as the look-elsewhere effect – or sometimesthe “problem of multiple comparisons” in discrete cases – and must be accounted for when performing ahypothesis test [1, 2]. Ignoring this effect would lead to an overestimation of the statistical significance,sometimes by a considerable amount, and thus incorrectly concluding the detection of a physical signal.The look-elsewhere effect is prominent throughout (astro)particle physics and cosmology. Oneof the most commonly known occurrences is in collider searches for new particles, for example it wasa key consideration in the Higgs boson discovery [3, 4]. In this example, one searches a large rangeof masses for a resonance, without a priori knowledge of the true mass of the particle. Similarly, inastrophysical searches for particles one seeks resonances in the energy flux of various astrophysicalspectra, where the true energy signature of particle is unknown. Examples include: constraining thedark matter self-annihilation cross-section via gamma ray emission from galaxy clusters [5], searchingfor WIMPs via charged cosmic rays [6], searching for non-baryonic dark matter via X-ray emissionfrom the Milky Way [7], and explaining the source of high energy astrophysical neutrinos [8, 9]. Interms of cosmology, the look-elsewhere effect occurs in searches for gravitational wave signals fromblack hole or neutron star mergers [10–12]. Here one searches large time series for a signal, where thetime and shape of the event are unknown. A further cosmological example is searching for signaturesof inflation in the primordial power spectrum [13–15].The look-elsewhere effect is also prevalent in other areas of physics and beyond, for example: inastronomy it occurs when detecting exoplanets via stellar photometry, where the period and phaseof the planets’ transits are unknown (e.g. [16]); in biology it occurs when considering large DNAsequences to study genetic association [17, 18]; in medicine it occurs when testing the effectiveness ofdrugs in clinical trials [19]; and in theology it occurs when attempting to find hidden prophecies inreligious texts [20]. Therefore, given the apparent ubiquity of the look-elsewhere effect, there is muchmotivation for a fast method to account for it. – 1 –any simple general methods exist to mitigate for the look-elsewhere effect in the case of discreteproblems, for example if one is testing multiple drugs for their effectiveness at treating a disease [19].The number of drugs tested, more generally known as the trials factor, quantifies the extent of thelook-elsewhere effect. The larger the trials factor, i.e. the more drugs tested, the larger the chance ofa false positive arising due to a statistical fluctuation. Methods such as the Bonferroni correction [21]and ˇSid´ak correction [22] use the trials factor to correct the conclusions of a hypothesis test in lightof this effect. There is however no unique definition of the trials factor when searching a continuousparameter space for a signal. Thus, a common, brute-force approach to account for the look-elsewhereeffect for continuous problems is to perform many simulations of an experiment assuming there is nosignal. One can then estimate the p -value of a chosen test statistic, usually related to the maximumlikelihood, and in turn define a relation between the significance of a signal and the test statistic.This means that to conclude a detection at the 5-sigma level, corresponding to a p -value of order10 − , one would need to simulate more than ∼ realizations of the experimental data, which iscomputationally expensive. A faster method, developed in the context of high energy physics, is toapproximate the asymptotic form of the p -value by counting upcrossings, requiring fewer simulations[23]. In both of these cases new simulations are required each time a new model is considered, andthe simulations may not be an accurate representation of the data. In this paper we seek a generalapproach that can be directly applied to experimental data, without the need for simulations.Our approach applies Bayesian logic to tackle the look-elsewhere effect. The Bayesian evidenceis equal to the prior-weighted average of the likelihood over the parameter space, which can beconsiderably lower than the maximum likelihood if the prior is broad. This integration over theprior accounts for the look-elsewhere effect by penalizing large prior volumes. When consideringlarge prior volumes, the likelihood is typically multimodal, with most of the peaks correspondingto noise fluctuations rather than physical sources. In order to estimate the location of a physicalsignal, and its associated statistical significance, one typically considers a point estimator, such asthe maximum a posteriori (MAP) estimator which maximizes the posterior density. By applying theLaplace approximation, we introduce a Bayesian generalization of the MAP estimator, referred to asthe maximum posterior mass (MPM) estimator, which corrects the MAP estimator by the prior-to-posterior volume ratio. Then, by drawing an analogy between Bayesian and frequentist methodology,we present a hybrid of the MAP and MPM estimators, called the maximum posterior significance(MPS) estimator, which determines the most significant peak in light of the look-elsewhere effect.The frequentist properties of the MPS estimator are shown to be independent of the look-elsewhereeffect, providing a universal way to quantify the p -value, or statistical significance, without the needfor expensive simulations.The outline of this paper is as follows. In section 2 we review Bayesian posterior inference andhypothesis testing for a multimodal posterior, by discussing MAP estimation and then introducingMPM estimation. We then draw an analogy between Bayesian and frequentist philosophy in section3 to motivate MPS estimation as the appropriate technique to tackle the look-elsewhere effect. Thefollowing three sections then apply this method to various examples: section 4 considers a resonancesearch, which can be thought of as a toy example of a collider or astrophysical particle search; section5 considers a white noise time series, which can be thought of as a toy example of a gravitationalwave search; and section 6 considers a search for non-Gaussian models of cosmological inflation usingPlanck data [24]. Note that section 4 is the main example, as it illustrates the key advantages ofMPS, with the other examples complementary. Finally, we summarize and conclude in section 7. Two of the main tasks of Bayesian statistical analysis are posterior inference and hypothesis testing.Consider a model with parameters z = { z j } Mj =1 , and data x = { x i } N d i =1 that depends on z . Theinference of z is given by its posterior p ( z | x ) = p ( x , z ) p ( x ) = p ( x | z ) p ( z ) p ( x ) , (2.1)– 2 –here p ( x | z ) is the likelihood of the data, p ( z ) is the prior of z , and p ( x ) = (cid:82) d z p ( x | z ) p ( z ) isthe Bayesian evidence, also known as the normalization, marginal likelihood, or partition function.Typically, one can evaluate the joint probability p ( x , z ), but not the evidence, which makes theposterior inference analytically intractable. This is usually handled using simple approximations orMonte Carlo Markov Chain methods [25].A related problem is that of a hypothesis testing. In this case there are two different hypotheses, H and H , each with their own model parameters, z and z . In Bayesian methodology, hypothesistesting is performed using the Bayesian evidence ratio of the two hypotheses, which gives the Bayesfactor B ≡ p ( x | H ) p ( x | H ) , (2.2)where the Bayesian evidence for hypothesis H is given by p ( x | H ) = (cid:90) d z p ( x | z , H ) p ( z | H ) . (2.3)The Bayesian evidence and Bayes factor are also analytically intractable and harder to evaluate thanposteriors, especially for high dimensional z , although recent numerical methods such as GaussianizedBridge Sampling [26] have made the problem easier. For the sake of exposition we will not considersuch methods in this work, but instead use analytical approximations that give the Bayes factor anintuitive meaning. It is worth keeping in mind however that the full Bayes factor calculation canalways be performed numerically, without any approximations. Given the analytical intractability of posterior inference and hypothesis testing, one often chooses anestimator to extract useful information from the posterior. A common estimator is the maximum aposteriori (MAP) point estimator, which corresponds to the global maximum of the posterior. If theprior is flat, as it will always be in this paper, this equals the maximum likelihood estimator (MLE),which maximizes the likelihood. Mathematically, MAP is defined viaMAP : arg max z p ( z | x ) . (2.4)For the purpose of comparing data to a null hypothesis, a useful quantity to define is q L ( z ) ≡ p ( x | z ) p ( x | z ) , (2.5)where z represents the values of the parameters under the null hypothesis, and a subscript of L isused because the argument of the logarithm is the Likelihood ratio. To assess the significance of aresult one considers the maximum value of q L , which in the case of a flat prior is equal to q L evaluatedat the MAP: ˆ q L = q L ( z MAP ). For a Gaussian likelihood, this is equal to the chi-squared ( χ ), and inthe absence of the look-elsewhere effect √ ˆ q L typically gives the statistical significance. However, wewill see that this test statistic greatly suffers from the look-elsewhere effect. MAP is often a good point estimator in low dimensions if there is a single mode in the posterior.However, if the posterior has several modes, a more reasonable point estimator associates with thehighest posterior mass. We refer to this as the maximum posterior mass (MPM) estimator.For the purposes of this work, we will consider the example of a multimodal posterior consistingof a sum of multivariate Gaussian distributions; this has been shown to be a good approximation inmany practical cases [27]. We thus consider a posterior of the following form, p ( z | x ) = (cid:88) l w l N ( z ; µ l , Σ l ) , (2.6)– 3 – p ( z | x ) MAP MPM
Figure 1 . Plot of a bimodal Gaussian posterior for a 1d example in which 90% of the posterior mass isassigned to the right peak and 10% to the left. MPM yields the mode that maximizes the posterior mass andis close to the true mean, whereas MAP maximizes the posterior density and can be distant from the meanand represent only a small fraction of posterior mass. where N ( z ; µ , Σ ) is a multivariate normal distribution with mean µ and covariance matrix Σ , and thedata dependence has been dropped for neatness. Note that working with a posterior of this form isequivalent to applying the Laplace approximation to a general multimodal posterior in the upcomingderivations. In this model, the mass of mode l is proportional to the weight w l , which is normalizedsuch that (cid:80) l w l = 1.Assuming that only one component contributes at each peak, the weight of mode l is given byevaluating the posterior at the location of the mode, z = µ l ,ln w l = ln p ( µ l | x ) − ln N ( µ l ; µ l , Σ l ) = ln p ( µ l | x ) + 12 (cid:104) ln det Σ l + M ln(2 π ) (cid:105) . (2.7)To obtain a quantity that can be readily computed, we multiply each weight by the normalization p ( x ) to give the mass m l , defined byln m l ≡ ln w l + ln p ( x ) = ln p ( x | µ l ) + ln p ( µ l ) + 12 (cid:104) ln det Σ l + M ln(2 π ) (cid:105) . (2.8)Thus, the mass of each mode is equal to the log likelihood multiplied by the product of the priordensity and the posterior volume at the peak, where the posterior volume is defined as V posterior ≡ (2 π ) M/ √ det Σ . (2.9)The MPM estimator corresponds to the mode with the highest mass, hence to determine theMPM mode one would compute the ln m l by first finding the positions of all local posterior maxima µ l ,and then computing Σ l using the inverse of the Hessian at each peak. Qualitatively, MPM correspondsto maximizing the posterior density multiplied by the posterior volume ∼ √ det Σ , whereas MAP onlymaximizes the former. It is apparent that if there are multiple modes in the posterior, the one thathas the largest posterior mass does not necessarily have the largest posterior density, as shown infigure 1. In some situations the MPM mode will dominate the posterior mass such that the MPMmode alone gives a useful way to summarize the posterior. Consider a model with parameters z , z , ..., z M , with z corresponding to the amplitude of a feature,and z > corresponding to the properties of the feature. For example, z might correspond to theamplitude of a signal detected in a time series at time z . A typical analysis would scan over the z > ,finding the best fit value for the amplitude z at each point, giving rise to a multimodal posterior.In this work we wish to determine whether or not a dataset contains a true anomaly. In thelanguage of hypothesis testing, we wish to compare the hypothesis that there is an anomaly H ,– 4 –orresponding to z >
0, to the null hypothesis that there is no anomaly H , corresponding to z = 0.We assume the common case that the parameters of H are a subset of the parameters of H , with H reducing to H when z = 0. There may also be parameters other than z that are common toboth models, but these are of secondary importance when considering the look-elsewhere effect andwe drop these from the notation.Using equation 2.8 with (cid:80) w l = 1 implies that the Bayesian evidence for hypothesis H is givenby p ( x | H ) = (cid:90) d z p ( x | z , H ) p ( z | H ) = (cid:88) l m l , (2.10)where the m l correspond to the masses under hypothesis H . Hence, each mode contributes its massto the evidence. It follows that the mass of mode l corresponds to the Laplace approximation of theevidence integral in equation 2.10, integrated over the region of the mode. Because the null hypothesisdoes not depend on z > , the evidence for the null hypothesis is given by the likelihood evaluated at z = 0, that is p ( x | H ) = p ( x | z = 0) ≡ p ( x ). Together with equation 2.10 this gives the Bayesfactor B ≡ p ( x | H ) p ( x | H ) = 1 p ( x ) (cid:88) l m l ≡ (cid:88) l b l , (2.11)where b l is defined as the contribution of mode l to the Bayes factor. Using equation 2.8 gives b l = p ( x | µ l ) p ( x ) p ( µ l )(2 π ) M/ (cid:112) det Σ l = p ( x | µ l ) p ( x ) V posterior ( µ l ) V prior ( µ l ) , (2.12)where we have introduced the effective volume of the prior at µ l as, V − ( µ l ) ≡ p ( µ l ) , (2.13)appropriate for the case of a narrow posterior relative to the prior. In the remainder of this paper wewill drop the µ l dependence of the prior volume, as appropriate for a flat prior on z .Intuitively, one can think of each b l as the Bayes factor one would get if mode l were the only modein the posterior. If the maximum b l is sufficiently large, it alone can provide a useful approximationto the Bayes factor, meaning the MPM mode dominates the Bayes factor. The first ratio on the righthand side of equation 2.12 corresponds to the likelihood ratio of the signal hypothesis to the nullhypothesis, evaluated at the location of the peak, z = µ l . This is greater than or equal to 1 sinceadding parameters to the null hypothesis can only improve the fit. The second ratio gives the ratioof the posterior volume to the prior volume at the peak, which is always less than 1. This acts as apenalty to the likelihood ratio, often referred to as the Occam’s razor penalty [28], or model complexitypenalty, and compensates for the look-elsewhere effect in the case of a multimodal posterior. Thehigher the prior-to-posterior volume ratio, the higher the chance that peaks with a high likelihoodwill occur because of statistical fluctuations, thus the larger the penalty required to compensate.Just as q L is the estimator associated with MAP, we can define q b ≡ b as the estimatorassociated with MPM, such that q b = q L − V prior V posterior . (2.14)The MPM mode corresponds to the mode with maximum q b . This illustrates how the MAP estimatorignores the look-elsewhere penalty by effectively considering the posterior and prior to be overlappingdelta functions, which presumes a priori knowledge of the parameters and gives a prior-to-posteriorvolume ratio of unity.An interesting question to consider is whether one can relate q b to the look-elsewhere correctedstatistical significance in a frequentist sense. In the absence of the look-elsewhere effect, the signifi-cance is given by √ q L , but simply taking √ q b as the look-elsewhere corrected significance would notbe correct. In the next section we turn to a frequentist description of the look-elsewhere effect tomotivate a new estimator which applies a small modification to q b and has a simple interpretation interms of the significance, or p -value. – 5 –efore ending this section we discuss the choice of priors appropriate for a look-elsewhere analysis.If one has no prior knowledge regarding the location of an anomaly, then a uniform prior for the z > parameters is appropriate. If the prior is wide and posterior narrow this induces a large look-elsewhereeffect. This choice of prior is not controversial. On the other hand, the choice of prior for the amplitudeparameter z is less clear. If one has no prior knowledge of the signal amplitude, then one should beopen to a signal of any size, however one does not want the amplitude prior to induce a look-elsewherepenalty. In Bayesian hypothesis testing the amplitude parameter is treated analogously to the otherparameters, thus if one uses too broad an amplitude prior it will induce an unwanted look-elsewherepenalty, whereas if one chooses too narrow an amplitude prior one risks discounting a large signal.Based on this we rewrite b in the following form, explicitly separating the marginalization over z > and z , b = e q L / V > , posterior V > , prior V , posterior V , prior . (2.15)The posterior volume terms are given by the covariance matrix, as in equation 2.9, and V > , prior isgiven by the choice of prior on z > . It thus remains to justify a choice of V , prior , which we will doby turning to a frequentist description of the look-elsewhere effect in the next section. Standard statistics literature states that Bayesian and frequentist hypothesis testing follow differentmethodologies and may give very different results. One famous illustration of this is the Jeffreys-Lindley “paradox” [29], however, there is much debate as to whether this is indeed a paradox andhow relevant it is for scientific discourse (see [30] for a review in the context of high energy physics).While Bayesian statistics uses the Bayes factor for hypothesis testing, frequentist statistics uses themaximum likelihood ratio, or ˆ q L . One of the most important aspects of frequentist methodology isthe computation of the false positive rate using the p -value, which quantifies how often a test statistic,for example ˆ q L , will take a specific value or larger under the assumptions of the null hypothesis. Thishas an intuitive interpretation as it directly relates to the false positive rate of the test statistic.On the other hand, Bayesian methodology rejects the p -value. The basis for this rejection is thelikelihood principle, which states that any inference about the parameters z from the data x can onlybe made via the likelihood p ( x | z ). When the likelihood principle is applied to testing a hypothesiswith parameters z one must use the marginal likelihood by integrating out these parameters – as inthe Bayesian evidence of equation 2.3 – thus Bayesian methodology explicitly satisfies the likelihoodprinciple. It is commonly argued that p -values violate the likelihood principle, because they rely onthe frequentist properties of a distribution that go beyond the likelihood principle. However, theBayes factor provides a less reliable tool for model comparison, as it is often interpreted in terms ofarbitrary, model-independent scales [31], unlike the p -value which directly relates to the false positiverate. We seek to elucidate how the answers of the two schools of statistics relate to one and other whenit comes to the hypothesis testing. Both schools of statistics should give a similar, or at least relatedanswer, when the question is phrased similarly. For uncertainty quantification it is often argued thatthe two schools do not answer the same question, since the Bayesian school treats data as fixed andvaries the models, while the frequentist school varies the data at a fixed model. However, whenit comes to hypothesis testing the distinction is less prominent: for example, when comparing twodiscrete hypotheses without any marginalizations, the answer in both cases gives the likelihood ratioas the optimal statistic (assuming equal prior for the two hypotheses). For continuous hypotheses itis often argued this is not possible. Here we will show that the two answers, the p -value and the Bayesfactor, can be related with a specific choice of prior. It is important to emphasize that we are notclaiming to equate the Bayesian and frequentist methodologies, but rather motivate a connection.In this work we define the p -value as the probability under the null hypothesis, H , of a randomvariable, Q , to be observed to have a value equal to or more extreme than the value observed, q . Wethus use the notation P ( Q ≥ q ) for the p -value. To compute the p -value of a test statistic, one mustconsider how the test statistic is distributed under the null hypothesis. For the example of ˆ q L this– 6 –istribution is not universal: scanning over continuous variables, as in the look-elsewhere effect, willmodify this distribution in a model dependent manner. Moreover, increasing the model complexity inother ways, for example by including extra degrees of freedom, will further modify the distribution. Toaccount for extra degrees of freedom, Wilks’ theorem [32] provides the asymptotic distribution of ˆ q L for a hypothesis test where H has ν more degrees of freedom than H . However, Wilks’ theorem relieson technical conditions, such as the observed value not being at the edge of the interval, and does notconsider the look-elsewhere effect. Generalization of Wilks’ theorem for the look-elsewhere effect havebeen considered in [33, 34] and have been translated into a practical procedure in [23]. As a result, afrequentist approach consists of a series of considerations to determine the change in the distributionof ˆ q L due to different sources of model complexity. This is unlike the Bayesian methodology whereall forms of model complexity are accounted for in the same way, as they are encoded into the Bayesfactor. By connecting the two methodologies, we will present a test statistic whose distribution isuniversal, regardless of the model complexity and look-elsewhere effect. We start by considering the typical case of one degree of freedom, corresponding to a single signal withamplitude z and features described by z > . We denote q L maximized over the amplitude parameteronly as ˇ q L ( z > ) ≡ max z q L ( z ), not to be confused with ˆ q L ≡ max z q L ( z ) which is q L maximizedover all parameters. For a t -tailed test (where t is equal 1 or 2), Wilks’ theorem gives the asymptotic p -value of ˇ q L , at any position z > , as P ( ˇ Q L ≥ ˇ q L ) = t F (ˇ q L ) ˇ q L →∞ −−−−→ t √ π ˇ q L e − ˇ q L / , (3.1)where ˜ F ν is the complementary cumulative distribution function (CCDF) of a chi-squared randomvariable with ν degrees of freedom. This maximization over z at a fixed choice of z > correspondsto the p -value in the absence of the look-elsewhere effect, referred to as the local p -value. Furthermaximizing over z > introduces the look-elsewhere effect, which can be parameterized by multiplyingby the trials factor N such that P ( ˆ Q L ≥ ˆ q L ) = N t √ π ˆ q L e − ˆ q L / . (3.2)This is referred to as the global p -value. It is this form that leads to the Bonferroni correction [21]which divides the type I error by N to account for the look-elsewhere effect. For discrete problems thetrials factor equals the number of trials performed. However, in the continuous case it is ill-defined,but it quantifies how the probability of finding a spurious peak increases as one looks elsewhere inthe space spanned by z > . Accounting for the look-elsewhere effect thus requires an expression forthe trials factor.It follows from equation 3.2 that one can define a test statistic, q S = q L − N + ln 2 πq L − t (3.3)such that the global p -value tends to P ( ˆ Q S ≥ ˆ q S ) → e − ˆ q S / , (3.4)as either N → ∞ or ˆ q S → ∞ , so this also applies for N = 1. See Appendix A for a derivation. Unlikeˆ q L , ˆ q S has a distribution that is independent of N – the look-elsewhere effect has been absorbedinto the test statistic. Intuitively one can think of the 2 ln N term as a penalty to q L to correct forthe look-elsewhere effect, while the ln 2 πq L term removes q L dependent bias, ensuring the p -valuedepends on ˆ q S alone in the asymptotic limit. Thus to account for the look-elsewhere effect one needonly compute ˆ q S and use this equation to compute the p -value. Because the p -value is a monotonicallydecreasing function of ˆ q S , one can think of selecting the peak with maximum q S as selecting the peakwith minimum p -value or maximum statistical significance. We refer to the mode with maximum q S – 7 –s the MPS mode, deferring an explanation for this nomenclature until the end of the subsection. Thesimilarity of q S to q b from equation 2.14 suggests a connection between the frequentist and Bayesianpictures, and we now invoke this connection to find an expression for N and in turn generalize theBonferroni correction to continuous parameters.Heuristically, the Bayes factor describes the probability of the alternative hypothesis relative tothe null, determined by the likelihood (as measured by ˆ q L ), while the p -value averages its inverse overall values larger than ˆ q L and will be smaller than the likelihood. We expect that for higher ˆ q L theeffect is larger because we are further into the tail of the distribution. There is no unique relationbetween the two, but one simple option is that the p -value scales as B − / ˆ q L ≈ ˆ b − / ˆ q L , where hatsnow indicate quantities associated with the MPS mode. Because we have the freedom to choose theprior on z , we can define the relation between the Bayes factor and p -value asˆ b − ˆ q L ≡ P ( ˆ Q L ≥ ˆ q L ) . (3.5)Comparing equation 2.15 with equation 3.2 then gives V > , prior ˆ V > , posterior V , prior ˆ V , posterior e − ˆ q L / ˆ q L = N t √ π ˆ q L e − ˆ q L / . (3.6)By requiring that this relation holds in the absence of the look-elsewhere effect, the trials factor canbe identified as N = V > , prior ˆ V > , posterior , (3.7)and the amplitude prior volume is given by V , prior = t (cid:112) ˆ q L ˆ V , posterior √ π = t (cid:112) ˆ q L ˆ σ ≈ t ˆ µ . (3.8)In the final steps we used ˆ V , posterior = √ π ˆ σ , where ˆ σ is the error on the amplitude parameter,ˆ µ , and that the signal-to-noise ratio obeys √ ˆ q L ≈ ˆ µ / ˆ σ . Since the look-elsewhere effect leads tolarge ˆ q L , this prior volume will be larger than the posterior volume. This choice of amplitude priorvolume ensures that there is no trials factor associated with the amplitude, as intuition would dictate.Substituting equations 3.7 and 3.8 into equation 3.3 yields q S = q b + 2 ln q L . (3.9)Hence, we have effectively applied a modification to the MPM estimator to give a combination ofthe MPM and MAP estimators, so that the asymptotic p -value is neatly given by e − ˆ q S / . In thecontext of the look-elsewhere effect, the mode with maximum q b will typically also be the mode withmaximum q L , and thus maximum q S . However, this equivalence of MAP and MPM may not alwaysbe the case, as shown in figure 1.A pure Bayesian might argue that equation 3.8 is not a valid prior, since it depends on thea posteriori amplitude parameter ˆ µ ; however, this prior does have an intuitive justification. If ascientist is willing to consider a signal of any amplitude, the prior cannot be zero at ˆ µ , as it wouldnot make sense to discard the signal. On the other hand, making the prior significantly broaderthan ˆ µ implies the scientist has some additional information on the nature of the amplitude. Whenthere is no justification for broadening the prior, the narrowest possible prior still consistent withthe measured value can be more reasonable than arbitrarily fixing the size of the prior a priori. Thischoice of amplitude prior is simply designed to allow for a signal with any amplitude, without inducingan unwanted look-elsewhere penalty.Note that the explicit dependence on ˆ q L and the marginal likelihood, via ˆ b , in equation 3.5is what makes the p -value inconsistent with the likelihood principle. One could instead considerequating ˆ b − directly with the p -value, making it consistent with the likelihood principle. This would– 8 – q − − − − ˜ F ( q )1 − exp h − ( π q ) / e − q/ i Figure 2 . Equation 3.11 is a good approximation to ˜ F ( q ) over the entire range of q . This suggests that MPSis still accurate in the absence of the look-elsewhere effect for a two-tailed test, even non-asymptotically. require an amplitude prior of V , prior = t ˆ σ / ˆ µ , which we deem unreasonable as it is smaller thanthe posterior volume ˆ V , posterior . We emphasize that the equality of ˆ b − / ˆ q L to the p -value is notstrictly required for our approach to the look-elsewhere effect, but provides intuition for the Bayesian-frequentist connection. At its core, our method considers the test statistic ˆ q S , from equation 3.3, andreplaces the trials factor N with the prior-to-posterior volume of the non-amplitude parameters z > .Intuitively one can think of the number of trials as the number posterior volumes that fit within theprior volume.Because the asymptotic p -value scales linearly with the prior volume, the non-asymptotic formof the p -value can be derived by dividing the prior volume into K (cid:29) p -value for each. Assuming independence between these regions, the product of the p -values for eachregion can be used to obtain p -value of the full volume. Further assuming that the asymptotic regimestill applies, this gives P ( ˆ Q S ≥ ˆ q S ) = lim K →∞ (cid:34) − (cid:18) − e − ˆ q S / K (cid:19) K (cid:35) = 1 − exp (cid:16) − e − ˆ q S / (cid:17) . (3.10)Just as equation 3.4 is a generalization of the Bonferroni correction, equation 3.10 is a generalizationof the ˇSid´ak correction [22] to continuous variables. This expression generalizes the p -value into thenon-asymptotic regime.For N (cid:29) p -value willapproach 1 for sufficiently low ˆ q L , which equation 3.10 predicts to be for ˆ q S <
0. In the absence ofthe look-elsewhere effect ( N = 1) a one-tailed test should approach a p -value of 0.5, while equation3.10 approaches 1 as ˆ q S → −∞ . Thus, the non-asymptotic agreement breaks down for t = 1 and N = 1. On the other hand, if t = 2 and N = 1, substituting q S = q L + ln 2 πq L − P ( ˆ Q L ≥ ˆ q L ) N =1 ,t =2 = 1 − exp (cid:34) − (cid:18) π q L (cid:19) / e − ˆ q L / (cid:35) . (3.11)The term in the square brackets can be identified as the asymptotic expansion of ˜ F (ˆ q L ). We showthe non-asymptotic agreement of this equation with ˜ F (ˆ q L ), the true two-tailed p -value for N = 1,in figure 2. This illustrates the ability of the generalized ˇSid´ak correction to produce correct non-asymptotic results, even in the absence of the look-elsewhere effect. Hence, although we have appliedasymptotic approximations throughout the above calculations, we have obtained a result that is valideven in the non-asymptotic limit. Inverting equation 3.11 gives the significance, or number of sigma, S , as S ≈ ˆ q S − ln 2 π ˆ q S + 2 ln t, (3.12)with corrections of order O (ˆ q − S ). In the limit of ˆ q S → ∞ , the significance can be interpreted as √ ˆ q S ,in an analogous way to √ ˆ q L in the absence of the look-elsewhere effect. This motivates the namemaximum posterior significance (MPS) as q S depends on the posterior via the trails factor N , and ismonotonically related to the significance S .In summary, by considering a frequentist description of the look-elsewhere effect we introducedˆ q S as a natural test statistic to use, such that the asymptotic p -value is given by e − ˆ q S / . We derived ageneral expression for the p -value which also applies in the non-asymptotic regime, and when there’sno look-elsewhere effect. Adopting the prior of equation 3.8, we showed that one can write the p -value in terms of Bayes factor as ˆ b − / ˆ q L . This intrinsically accounts for the look-elsewhere effect byidentifying the trials factor as the prior-to-posterior volume ratio of z > at the MPS mode. Whileone can compute the Bayes factor using a variety of methods, we will use the Laplace approximationto evaluate the posterior volume of each mode, as in section 2. To outline the step-by-step approach: Maximum Posterior Significance (MPS) estimation:
1. Scan over the space of non-amplitude parameters, z > , locating peaks in the posteriorwith any amplitude, z . Often only the highest few peaks are needed.2. Compute q L and the posterior volume, using equation 2.9, for each peak.3. Compute q b for each peak using equation 2.14 with the amplitude prior of equation 3.8.4. Compute q S = q b + 2 ln q L for each peak.5. Find the peak with maximum q S .6. Compute the (global) p -value using equation 3.10 and significance using 3.12. For models with multiple degrees of freedom, the frequentist approach is to apply Wilks’ theorem[32]. This is valid in the asymptotic limit, where, for a two-tailed test, the local p -value is given by P ν ( ˇ Q L ≥ ˇ q L ) = ˜ F ν (ˇ q L ) ˇ q L →∞ −−−−→ ν/ (cid:18) ˇ q L (cid:19) ν/ − e − ˇ q L / , (3.13)for a model with ν degrees of freedom. Note that the limit assumes q (cid:29) ν , but for ν = 2 it isexact for any q . Wilks’ theorem can address the model complexity problem of having multiple ( ν )continuous amplitude parameters. A specific example from particle physics is a decay process with ν decay channels, each with amplitude A i (0 ≤ i ≤ ν ). In such a case max { A i } q L ( { A i } , ... ) ∼ ˜ F ν . Wilks’theorem is not sufficiently general: it fails if the parameters are at the edge of their distribution, andit does not naturally handle the model complexity of the look-elsewhere effect, where one scans overa wide range of values for one or more parameters. Upon introduction of the look-elsewhere effecta frequentist would typically consider single trials distributed as ∼ ˜ F ν , and then use a ν -dependenttrials factor [23]. Thus in a frequentist approach extra degrees of freedom and the look-elsewhereeffect are treated separately. On the other hand, a Bayesian approach accounts for both in the sameway. – 10 –o apply the Bayesian methodology, we first reparameterize the model so that there is only asingle amplitude parameter by introducing branching ratios α i , such that each amplitude parameter is A i = α i z , where z is the total amplitude parameter and (cid:80) νi =1 α i = 1. To remove the constraint weadopt rotation angles: for example, for ν = 2 we can work with a phase angle φ , such that α = cos φ and α = sin φ . Thus, instead of working with A and A and considering max A ,A q L ( A , A , ... ) ∼ ˜ F , we consider max z q L ( z , φ, ... ) ∼ ˜ F with z > = ( φ, ... ). We can then directly apply the MPSprescription for ν = 1, as in the previous subsection, by additionally marginalizing over φ to accountfor the model complexity with an additional prior-to-posterior volume penalty.To be agnostic, one would choose a prior volume for φ of V φ, prior = π (in practice a morecomplex prior may be appropriate, but it will typically be O (1)). Furthermore, the average error on φ is typically equal to the relative error on the amplitude, thus σ φ ≈ σ /µ ≈ q − / L . This gives amodel complexity correction of V φ, prior ˆ V φ, posterior = π √ π ˆ σ φ = √ π (cid:18) ˆ q L (cid:19) / = ˜ F (ˆ q L )˜ F (ˆ q L ) . (3.14)This shows that increasing the model complexity with an extra degree of freedom is accounted for inthe Bayesian framework by marginalizing over φ . Thus, the Bayesian answer to an increase in modelcomplexity, whether it be due to including extra degrees of freedom, or looking elsewhere, is identical:marginalization over the non-amplitude parameters z > . The ν dependence of the local p -value inequation 3.13 can be interpreted as a Bayesian model complexity penalty: a fixed p -value correspondsto a larger ˆ q L as ν increases. Thus, MPS intrinsically generalizes Wilks’ theorem by relating the trialsfactor to the prior-to-posterior volume. To test the theory of section 3 we first consider a resonance search example. These appear in manydifferent areas of physics, including astroparticle and high energy physics. We consider a searchfor a new particle whose mass and cross-section are unknown. The data x could correspond tomeasurements of the invariant mass in the case of collider searches, or the energy flux in astroparticlesearches. The probability density for a single measurement, x i , is given by p ( x i | f, x ∗ , σ ∗ ) = f p s ( x i | x ∗ , σ ∗ ) + (1 − f ) p b ( x i ) , (4.1)where p s and p b are the normalized signal and background distributions respectively, and f is thefraction of events belonging to the signal. We assume that the form of the signal and backgroundare known; we take the signal to be a normal distribution p s ( x i | x ∗ , σ ∗ ) = N ( x i | x ∗ , σ ∗ ), and thebackground to be a power law. Thus the resonance has position x ∗ and width σ ∗ . Given data x = { x i } N d i =1 , the likelihood is given by the product of the individual probability densities over thedata. Using equation 4.1 this gives the likelihood as p ( x | f, x ∗ , σ ∗ ) = N d (cid:89) i =1 (cid:2) f p s ( x i | x ∗ , σ ∗ ) + (1 − f ) p b ( x i ) (cid:3) . (4.2)Note that the Bayesian evidence under the null hypothesis is independent of the parameters, namely p ( x ) ≡ p ( x | f = 0) = N d (cid:89) i =1 p b ( x i ) . (4.3)While the likelihood depends on the number of data N d , quantities such as the p -value will haveconverged provided N d is sufficiently large to resolve the resonance. Throughout this section wefix N d = 10 V x ∗ , prior /σ ∗ to ensure sufficient convergence. We note that more complex models mightconsider drawing N d from a Poisson distribution, however this is unnecessary for our proof of concept.– 11 –
100 200 300 400 500 600 700 800 900 1000 x ∗ ˇ q L − − − − P ( ˇ Q L ≥ ˇ q L ) Figure 3 . The local chi-squared (left axis) and local p -value (right axis) for an example data realization withtrue amplitude f = 5 × − , position x ∗ = 500, and width σ ∗ = 0 .
5. While there is a peak with ˇ q L ≈
10 atthe correct position, the look-elsewhere effect leads to other, sometimes larger, peaks at random positions.
We first consider a uniform prior on x ∗ , with range (0 , ), i.e. a prior volume of V x ∗ , prior = 10 .We do not fit for σ ∗ and fix it to σ ∗ = 0 . x ∗ dimension, thus to find peaks we split theparameter space along the x ∗ dimension into narrow bins of size ∆ x ∗ and compute the maximumlikelihood of equation 4.2 within each bin. Ensuring ∆ x ∗ is sufficiently small, we determine thelocation of all peaks in the posterior, µ l , by comparing adjacent bins. The Hessian at each peak isthen computed using finite differencing, and inverted to give Σ l . Note, in this example we have ananalytical form for the likelihood, enabling verification of the numerical computation with analyticalresults. The value of q b at each peak is then computed using equation 2.14, in turn giving ˆ q S .Figure 3 shows the local chi-squared and local p -value as a function of x ∗ for an example datarealization. We use true parameters f = 5 × − and x ∗ = 500. Recall from equation 3.1 that thelocal chi-squared and p -value correspond to the values obtained by maximizing over f at fixed x ∗ ,i.e. they correspond to the values obtained without having corrected for the look-elsewhere effect. Thelocal chi-squared ˇ q L can also be thought of as the projection of q L onto the x ∗ axis. It can be seen thatalthough there is a peak with q L ≈
10 at the correct position, there are also multiple spurious peaksthroughout the parameter space, with ˆ q L ≈
14 in this example. This illustrates the look-elsewhereeffect: peaks with a local p -value of ∼ − are produced by noise, meaning a signal with such a local p -value should not be considered as significant as its local p -value naively suggests.We now consider 10 different data realizations without a signal ( f = 0) to study the distributionsof ˆ q L and ˆ q S under the null hypothesis. The plots in figure 4 show the global p -value in terms of ˆ q L and ˆ q S for a variety of scenarios. One can think of the vertical axes as corresponding to the falsepositive rate (FPR) of a hypothesis test using threshold q .We first compare three different prior volumes on x ∗ , V x ∗ , prior = 10 , , , to show theeffectiveness of our method for large and small N . The top left plot of figure 4 shows that the p -valueof ˆ q L has a considerable prior volume dependence. This is the look-elsewhere effect: a larger priorvolume leads to a larger trials factor and thus an increased probability of finding a higher maximumlikelihood. On the other hand we see that ˆ q S shows no prior dependence and is in good agreementwith equation 3.10, even in the non-asymptotic regime.We also investigate the variation of the p -value with the value of the width of the signal σ ∗ . Thisis shown in the top right plot of figure 4 where we consider σ ∗ = 0 . , . , .
0. Smaller σ ∗ leads toa smaller posterior volume and thus a larger trials factor. Much like the discussion above for priorvolume variation, ˆ q L has a large σ ∗ dependence, unlike ˆ q S .Next, we investigate the variation of the p -value with the dimensionality of the look-elsewhereeffect. To do this we extended the model to consider a signal at vector position x ∗ . Each data pointnow corresponds to a vector x i , and we extend the signal and background in a symmetric fashion– 12 – -3 -2 -1 P ( ˆ Q ≥ ˆ q ) Prior Dependence [ V x ∗ , prior ]ˆ q L [10 ]ˆ q S [10 ]ˆ q L [10 ]ˆ q S [10 ]ˆ q L [10 ]ˆ q S [10 ]1 − exp( − e − ˆ q/ ) Signal Width Dependence [ σ ∗ ]ˆ q L [0 . q S [0 . q L [0 . q S [0 . q L [1 . q S [1 . − exp( − e − ˆ q/ ) ˆ q -3 -2 -1 P ( ˆ Q ≥ ˆ q ) Dimensionality Dependence [ M ]ˆ q L [0 d ]ˆ q S [0 d ]ˆ q L [1 d ]ˆ q S [1 d ]ˆ q L [2 d ]ˆ q S [2 d ]ˆ q L [3 d ]ˆ q S [3 d ]1 − exp( − e − ˆ q/ ) ˆ q Parameterization Dependence ˆ q L [ f ]ˆ q S [ f ]ˆ q L [ Poisson ]ˆ q S [ Poisson ]1 − exp( − e − ˆ q/ ) Figure 4 . CCDFs of ˆ q L (dotted) and ˆ q S (dashed), computed using 10 simulations with no signal ( f = 0).(Top Left) compares three prior volumes: 10 (red), 10 (blue), and 10 (magenta). (Top Right) comparesdifferent values of signal width σ ∗ : 0.1 (red), 0.5 (blue) and 1.0 (magenta). (Bottom Left) compares thedimensionality of x ∗ : 0 d (red), 1 d (blue), 2 d (magenta), and 3 d (green). (Bottom Right) compares the un-binned f -parameterization (red) against a binned Poisson parameterization (blue). In all cases the p -value ofˆ q L has large variation, whereas ˆ q S does not. Furthermore, ˆ q S closely follows the predictions of equation 3.10(black). across each dimension, keeping the total prior volume fixed. Within the context of collider searches,the components of x ∗ might correspond to a collection of invariant mass and jet properties. Forastroparticle searches, the multiple dimensions might correspond to different directions in the sky.The bottom left plot of figure 4 shows the variation of the test statistics for dimensionality of 1, 2,and 3, for a constant prior volume of 100. It can be seen that, while the p -value of ˆ q L is dependenton the dimensionality, the p -value of ˆ q S is not. This justifies the naturally arising (2 π ) M/ prefactorin the posterior volume in equation 2.9. We also plot the 0 d case, corresponding to only fitting for A with fixed x ∗ . Even though there is no look-elsewhere effect in this case, asymptotic agreementwith equation 3.10 is still achieved. This shows our approach is still reliable in the N → . . . . . . FPR . . . . . . T P R ROC f = 0 .
005 (SNR ≈ . f = 0 .
009 (SNR ≈ f = 0 .
013 (SNR ≈ f = 0 .
019 (SNR ≈ q L ˆ q S Figure 5 . ROC curve: comparing the true positive rate (TPR), for a variety of f , with the false positive rate(FPR) for ˆ q L (dotted) and ˆ q S (dashed). The signal-to-noise ratio (SNR) corresponds to the average √ ˆ q L overall data realizations. justifying its applicability for arbitrary N . As discussed in section 3.1, non-asymptotic agreement isnot expected for a one-tailed test in the absence of the look-elsewhere effect, as the p -value tends to0.5 as ˆ q L →
0; on the other hand, a two-tailed test would give non-asymptotic agreement as shownin figure 2.The above discussion concerns an un-binned model, parameterized by the signal fraction f . Oftenin particle physics, one performs a binned analysis with the number of events in each bin modelledas a Poisson distribution [35]. We find similar results when using this Poisson parameterization, aspictured in the bottom right of figure 4. The Poisson line agrees with the black line slightly betterthan the f line does, likely because the Laplace approximation is more accurate in the Poisson case.When it comes to hypothesis testing, the relation between the true positive rate (TPR) and thefalse positive rate (FPR) determines the predictive power of a test statistic. In order to compare therelative power of the test statistics we consider an ROC plot for a variety of true f values, shown infigure 5. We also quote the (local) signal-to-noise ratio (SNR), which we define as the average √ ˆ q L across 10 realizations for the given f . It can be seen that ˆ q S and ˆ q L have approximately equivalentROC lines, suggesting MAP and MPS have equal predictive power. This is expected as the relationbetween the test statistics is approximately monotonic, as seen in equation 3.3. Also, it can be seenthat the predictive power increases with true f – as expected a larger true signal is more likely to becorrectly detected. While we could continue the discussion in the context of resonance searches, we now consider awhite noise time series example to illustrate the application of MPS to different models. This can bethought of as a toy model of a gravitational wave search. In this section we show how MPS handlesadditional model complexity as theorized in section 3.2. We consider a time series y ( x ) comprising– 14 – ˆ q − − − P ( ˆ Q ≥ ˆ q ) Degrees of Freedom Dependence ˆ q L [ A ]ˆ q S [ A ]ˆ q L [ A, φ ]ˆ q S [ A, φ ]ˆ q L [ A, x ∗ ]ˆ q S [ A, x ∗ ]ˆ q L [ A, φ, x ∗ ]ˆ q S [ A, φ, x ∗ ]˜ F (ˆ q ) / F (ˆ q ) / − exp ( − e − ˆ q/ ) Figure 6 . CCDFs of ˆ q L and ˆ q S averaged over 10 simulations with no signal ( A = 0). The parameters in thesquare brackets are those being maximized, with other parameters being held fixed (as discussed in the text).The p -value of ˆ q L varies depending on the model complexity, whereas ˆ q S consistently follows the predictionof equation 3.10 (solid black). of measurements at N d times, x = { x i } N d i =1 , with spacing x i +1 − x i = 1. In the absence of a signal,each data point y i ≡ y ( x i ) is assumed to be a standard normal random variable, i.e. we assume whitenoise. We consider a model with 2 degrees of freedom (dofs), with signal given by p s ( x | A , A , x ∗ , ∆ , σ ∗ ) = A N ( x | x ∗ , σ ∗ ) + A N ( x | x ∗ + ∆ , σ ∗ ) (5.1)where A , > x ∗ and x ∗ + ∆ are the positions of the dofs, and σ ∗ isthe common width.As motivated in section 3.2, we reparameterize so that there’s a single amplitude parameter, z = A , and other parameters describing the properties of the single degree of freedom, z > . We thustransform variables using A = A cos φ and A = A sin φ , with A > ≤ φ ≤ π/ p s ( x | A, φ, x ∗ , ∆ , σ ∗ ) = A [cos φN ( x | x ∗ , σ ∗ ) + sin φN ( x | x ∗ + ∆ , σ ∗ )] . (5.2)The corresponding chi-squared difference between the data and the null hypothesis, equal to two timesthe log-likelihood-ratio, is given by q L ( x | A, φ, x ∗ , ∆ , σ ∗ ) = N d (cid:88) i =1 (cid:2) y i − p s ( x i | A, φ, x ∗ , ∆ , σ ∗ ) (cid:3) − [ y i ] . (5.3)We consider a uniform prior on x ∗ with range (0 , V x ∗ , prior = 100, and N d = 100. We do not fit for σ ∗ or ∆ and fix them to σ ∗ = 0 . data realizations with no signal, figure 6 shows how ˆ q L and ˆ q S are distributedfor different levels of model complexity. First we maximize over A , while holding all other parametersfixed. In this case ˆ q L ∼ ˜ F (ˆ q L ) / φ allows for 2 dofs, and gives ˆ q L ∼ ˜ F (ˆ q L ) / A , > q S follows the same asymptoticdistribution as predicted by equation 3.10. This verifies that the Bayesian picture of marginalizing over φ successfully reduces a model with 2 dofs to the same scale as 1 dof, in other words Wilks’ Theoremhas been replaced by marginalizing over φ . There is some discrepancy in the non-asymptotic regimefor the maximization over A only (red dashed line), as discussed in section 3.1 for a one-tailed test.We now introduce the look-elsewhere effect by allowing x ∗ to vary. First we maximize over A and x ∗ for fixed φ = 0, as shown by the magenta lines. This is equivalent to a model with 1 dof because φ = 0 corresponds to A = 0. We see that the distribution of ˆ q L (magenta dotted line) is shiftedto the right compared to the red and blue dotted lines due to the look-elsewhere effect. However,the distribution of ˆ q S (magenta dashed line) continues to follow the line predicted by equation 3.10.Finally, when maximizing over A , φ and x ∗ , i.e. a model with 2 dofs in the presence of the look-elsewhere effect, ˆ q L (green dotted line) is further right-shifted, whereas ˆ q S (green dashed line) againagrees with equation 3.10. The slight discrepancy in the A, φ, x ∗ maximization case is due to using toolarge a prior volume: there is a slight preference to having two well fitted peaks compared to one verywell fitted peak, thus the distribution of φ is clustered towards φ = π/
4. Using a more appropriateprior for φ would improve agreement.In summary, while the distribution of ˆ q L is highly dependent on the model complexity, via theextra degrees of freedom and look-elsewhere effect, ˆ q S has a universal distribution. There is much interest in detecting non-Gaussian models of inflation via the cosmological powerspectrum [36–41]. A specific type of such a feature model adds the following oscillatory perturbationto the ΛCDM power spectrum, P ( k ) = P ( k )[1 + A sin(2 ωk + φ )] , (6.1)where P ( k ) is the featureless (ΛCDM) power spectrum and A , ω , and φ are the amplitude, frequency,and phase of the oscillatory perturbation. Such models are searched for using Planck 2013 data in [14]using the frequentist look-elsewhere analysis technique of [13]. In this section we seek to reproducethe conclusions of these papers using MPS.Equation 6.1 can be written in the form P ( k ) = P ( k ) + ∆ P ( k ) with∆ P ( k ; A, ω, φ ) = AP ( k )[cos φ sin(2 ωk ) + sin φ cos(2 ωk )] ≡ A cos φP s ( k ; ω ) + A sin φP c ( k ; ω ) , (6.2)where in the last line we explicitly separate terms with A and φ , as only ω couples to k . Assuming alinear relation, one can write C (cid:96) = C (cid:96), + ∆ C (cid:96) , with∆ C (cid:96) ( A, ω, φ ) = A cos φC (cid:96),s ( ω ) + A sin φC (cid:96),c ( ω ) , (6.3)where C (cid:96),s and C (cid:96),c are the angular power spectra corresponding to P s and P c respectively. ThePlanck Likelihood [24] is given by − L ( ˆ C (cid:96) | A, ω, φ ) = [ ˆ C (cid:96) − C (cid:96) ( A, ω, φ )]∆ (cid:96) (cid:96) [ ˆ C (cid:96) − C (cid:96) ( A, ω, φ )] , (6.4)where ˆ C (cid:96) are the PCL estimates, and ∆ (cid:96) (cid:96) = (cid:104) ∆ ˆ C (cid:96) ∆ ˆ C (cid:96) (cid:105) is the PCL covariance matrix. In order tocompute the likelihood for the null hypothesis, CosmoMC [42] was used to find the best fit values forthe cosmological and nuisance parameters. When computing the likelihood for the signal hypothesis,the cosmological parameters were held fixed at these values; while they should really be re-fitted for– 16 – ˇ q L e rr o r s σ ω σ φ σ ω σ φ p detΣ ω,φ ω − − q q L q S Figure 7 . Planck results. Top: Plot of ˇ q L , the projection of q L onto the ω axis; this corresponds to q L evaluated at the A and φ that maximize q L at each ω . Middle: The errors obtained for the parameters, aswell as a comparison with the determinant of the covariance matrix having removed the amplitude parameter,Σ ω,φ . Bottom: A plot of q L (blue) and q S (cyan) for each peak, with the look-elsewhere correction depictedby the vertical black lines. the signal hypothesis, this is found to have little effect in [14]. The C (cid:96) are evaluated using CAMB[43] with a sufficiently high accuracy setting to ensure resolution of the rapid oscillations. To speedup the evaluation of the likelihood over parameter space, C (cid:96),s ( ω ) and C (cid:96),c ( ω ) were computed over adiscrete range of ω between 0 and 4000 with step-size ∆ ω = 5, with intermediate values computed viaspline interpolation. A flat prior was chosen for ω and φ . The rest of the analysis is analogous to theprevious examples: we find all the local maxima of the posterior, compute the Hessian using finitedifferencing, compute the covariance matrix, and use this to find ˆ q S . Unlike the previous examples,we note that ω and φ are correlated, as illustrated in the middle plot of figure 7, so it is importantto use the determinant of the full covariance matrix and not just its diagonal components. It is alsointeresting to note that higher peaks have smaller errors.The results obtained using the CAMspec component of the 2013 Planck likelihood are picturedin figure 7. The maximum occurs at ω ≈ q L = 15 .
4, giving a naive significance of √ ˆ q L ≈ q S = 3 .
0, giving a global p -value of 1 − exp(1 − e − / ) = 0 .
20 usingequation 3.10, and significance of S = 1 . p -value of 0.13, which is in reasonableagreement. Note that our likelihood profile does not match [14] exactly due to our approximateapproach, hence the p -value quoted here is the value one would obtain by applying the prescriptionof [14] to our likelihood profile. We applied the same analysis to the 2015 plik lite likelihood [44]and found a p -value of approximately 1, suggesting no evidence for such models of non-Gaussianity. One should sum the different components of the likelihood, but this is unnecessary for our proof of concept. – 17 –
Conclusions
This work has employed Bayesian and frequentist thinking to provide a general method to accountfor the look-elsewhere effect. We started by considering the Bayesian approach, and explained howmaximizing the posterior mass, as in MPM, is a more appropriate choice than maximizing the pos-terior density, as in MAP. Bayesian methodology naturally considers model complexity and the look-elsewhere effect by marginalization, which penalizes the likelihood by the prior-to-posterior volumeratio. Under the Laplace approximation, the posterior volume is proportional to the determinant oferror covariance matrix. We then considered the frequentist approach by writing the global p -value asthe local p -value multiplied by the trials factor. By drawing an analogy between the two approacheswe identified the trials factor as the prior-to-posterior volume ratio of the parameters being scannedover, in turn generalizing the Bonferroni correction to continuous problems. We introduced q S andin turn MPS, a hybrid of MPM and MAP, which considers the mode with maximum q S . Finally,we generalized the ˇSid´ak correction to continuous problems, providing a universal way to assign theglobal p -value in both the asymptotic and non-asymptotic regimes.We illustrated the effectiveness of MPS by considering several examples from (astro)particlephysics and cosmology, showing it to have equal predictive power to MAP while naturally accountingfor the look-elsewhere effect. MPS effectively shifts the hypothesis testing threshold of the maximumlikelihood ratio to a generic scale: while the peak maximum likelihood ratio, or equivalently the bestfit chi-squared χ = ˆ q L , depends on the model complexity and extent of the look-elsewhere effect,ˆ q S does not. In other words, instead of considering fixed ˆ q L thresholds, one should consider fixed ˆ q S thresholds.Unlike current methods that rely on performing numerous simulations, MPS accounts for thelook-elsewhere effect by using information from the data alone, as one need only compute the likelihoodand the posterior volume to evaluate q S . This provides a more efficient way to quantify statisticalsignificance as it does not require expensive simulations. In a typical situation one would focus onthe most promising anomalies only, with ˆ q S providing a scale that gives good guidance on what falsepositive rate one should expect. Subsequently, one would obtain additional information to verify theveracity of an anomaly when possible.For our proof of concept it was sufficient to only consider simple physical examples in this paper,but there are many applications where our methods can be employed. Examples include searches fornew particles in astroparticle and particle data, searches for gravitational wave signals in LIGO data,searches for exoplanets in transit and radial velocity data, as well as many more. In some of thesecases the look-elsewhere penalty can be considerably large, reaching beyond 6 sigma. The problem isvery general, as almost every search for unknown objects, events, new physics, or other phenomenawhose existence is unknown, has to deal with the look-elsewhere effect.The goal of a data analyst searching for anomalies is to report the most promising anomaliesin terms of having a small p -value, or a high Bayes factor. By clarifying the origins of the look-elsewhere effect and model complexity penalty for continuous parameters we hope to open the way torefinements in anomaly searches that can improve the overall success rate of a detection. This shouldbe a common goal of any experimental analysis regardless of which school of statistics one belongs to. Acknowledgments
We thank Benjamin Nachman for insightful comments on the manuscript, and Benjamin Wallischfor valuable discussion regarding example III. This research made use of the Cori supercomputer atthe National Energy Research Scientific Computing Center (NERSC), a U.S. Department of EnergyOffice of Science User Facility operated under Contract No. DE-AC02-05CH11231. This material isbased upon work supported by the National Science Foundation under Grant Numbers 1814370 andNSF 1839217, and by NASA under Grant Number 80NSSC18K1274.– 18 – ppendicesA Derivation of the CCDF of ˆ q S The asymptotic (large q L ) CCDF of the global maximum of q L is for a one-tail test is given in equation3.2 as P Q L ( Q L ≥ q L ) = N
12 ˜ F ( q L ) (A.1)= N √ πq L e − q L / + N q − / L e − q L / O (cid:0) q − L (cid:1) , (A.2)where here we include the leading order correction, and drop hats and take t = 1 for convenience.Consider the transformation of variables to q S , defined by q S ≡ g ( q L ) ≡ q L − N + ln 2 πq L . (A.3)It can be shown that the inverse of g is given by q L = g − ( q S ) = W (cid:18) N e q S π (cid:19) (A.4)= q S + ln N π − ln (cid:18) q S + ln N π (cid:19) + O (cid:18) L L (cid:19) , (A.5)where W ( z ) is the principal branch of the Lambert W function. The asymptotic expansion has beenperformed in the final line, with the shorthand L i ≡ ln i N e qS π . Assuming N is constant to study thelimiting behaviour, the CCDF of q S is thus P Q a ( Q a ≥ q S ) = P Q L (cid:2) Q L ≥ g − ( q S ) (cid:3) (A.6)= e − q S / e −O ( L /L ) − ln (cid:16) q S + ln N π (cid:17) + O (cid:16) L L (cid:17) q S + ln N π − / + O (cid:32) e − q S / q S + ln N π (cid:33) (A.7) → e − q S / , (A.8)where the limit corresponds to either N → ∞ or q S → ∞ . This means the result still appliesasymptotically in the absence of the look-elsewhere effect ( N = 1). References [1] R. G. Miller,
Simultaneous Statistical Inference . Springer New York, 1981, 10.1007/978-1-4613-8122-8.[2] J. P. Shaffer,
Multiple hypothesis testing , Annual Review of Psychology (1995) 561[ https://doi.org/10.1146/annurev.ps.46.020195.003021 ].[3] ATLAS collaboration, G. Aad et al.,
Observation of a new particle in the search for the StandardModel Higgs boson with the ATLAS detector at the LHC , Phys. Lett.
B716 (2012) 1 [ ].[4]
CMS collaboration, S. Chatrchyan et al.,
Observation of a New Boson at a Mass of 125 GeV with theCMS Experiment at the LHC , Phys. Lett.
B716 (2012) 30 [ ].[5] B. Anderson, S. Zimmer, J. Conrad, M. Gustafsson, M. S´anchez-Conde and R. Caputo,
Search forgamma-ray lines towards galaxy clusters with the fermi-lat , Journal of Cosmology and AstroparticlePhysics (2016) 026–026.[6] A. Reinert and M. W. Winkler,
A precision search for WIMPs with charged cosmic rays , Journal ofCosmology and Astroparticle Physics (2018) 055. – 19 –
7] N. Sekiya, N. Y. Yamasaki and K. Mitsuda,
A search for a keV signature of radiatively decaying darkmatter with Suzaku XIS observations of the X-ray diffuse background , Publications of the AstronomicalSociety of Japan (2015)[ https://academic.oup.com/pasj/article-pdf/68/SP1/S31/7971976/psv081.pdf ].[8] M. Aartsen, M. Ackermann, J. Adams, J. Aguilar, M. Ahlers, M. Ahrens et al., Observation ofhigh-energy astrophysical neutrinos in three years of icecube data , Physical Review Letters (2014) .[9] K. Emig, C. Lunardini and R. Windhorst,
Do high energy astrophysical neutrinos trace star formation? , Journal of Cosmology and Astroparticle Physics (2015) 029–029.[10] K. Cannon, C. Hanna and J. Peoples,
Likelihood-ratio ranking statistic for compact binary coalescencecandidates with rate estimation , 2015.[11] B. Abbott, R. Abbott, T. Abbott, M. Abernathy, F. Acernese, K. Ackley et al.,
Gw150914: First resultsfrom the search for binary black hole coalescence with advanced ligo , Physical Review D (2016) .[12] C. Messick, K. Blackburn, P. Brady, P. Brockill, K. Cannon, R. Cariou et al., Analysis framework forthe prompt discovery of compact binary mergers in gravitational-wave data , Physical Review D (2017) .[13] J. R. Fergusson, H. F. Gruetjen, E. P. S. Shellard and M. Liguori, Combining power spectrum andbispectrum measurements to detect oscillatory features , Phys. Rev.
D91 (2015) 023502 [ ].[14] J. R. Fergusson, H. F. Gruetjen, E. P. S. Shellard and B. Wallisch,
Polyspectra searches for sharposcillatory features in cosmic microwave sky data , Phys. Rev.
D91 (2015) 123506 [ ].[15] P. Hunt and S. Sarkar,
Search for features in the spectrum of primordial perturbations using planck andother datasets , Journal of Cosmology and Astroparticle Physics (2015) 052–052.[16] J. Robnik and U. Seljak,
Kepler data analysis: non-gaussian noise and fourier gaussian processanalysis of star variability , 2019.[17] P. I. W. de Bakker, R. Yelensky, I. Pe’er, S. B. Gabriel, M. J. Daly and D. Altshuler,
Efficiency andpower in genetic association studies , Nature Genetics (2005) 1217.[18] J. D. Storey and R. Tibshirani, Statistical significance for genomewide studies , Proceedings of theNational Academy of Sciences of the United States of America (2003) 9440.[19] M. Proschan and M. Waclawiw,
Practical guidelines for multiplicity adjustment in clinical trials , Controlled clinical trials (2001) 527.[20] B. McKay, D. Bar-Natan, M. Bar-Hillel and G. Kalai, Solving the bible code puzzle , Statist. Sci. (1999) 150.[21] C. E. Bonferroni, Teoria statistica delle classi e calcolo delle probabilit`a , Pubblicazioni del R IstitutoSuperiore di Scienze Economiche e Commerciali di Firenze (1936) .[22] Z. ˇSid´ak,
Rectangular confidence regions for the means of multivariate normal distributions , Journal ofthe American Statistical Association (1967) 626[ https://doi.org/10.1080/01621459.1967.10482935 ].[23] E. Gross and O. Vitells, Trial factors for the look elsewhere effect in high energy physics , The EuropeanPhysical Journal C (2010) 525–530.[24] Planck Collaboration, P. A. R. Ade, N. Aghanim, C. Armitage-Caplan, M. Arnaud, M. Ashdown et al., Planck 2013 results. XV. CMB power spectra and likelihood , Astronomy & Astrophysics (2014)A15 [ ].[25] M.-H. Chen, Q.-M. Shao and J. G. Ibrahim,
Estimating Ratios of Normalizing Constants , pp. 124–190.Springer New York, New York, NY, 2000. 10.1007/978-1-4612-1276-8 5.[26] H. Jia and U. Seljak,
Normalizing constant estimation with gaussianized bridge sampling , 2019.[27] U. Seljak and B. Yu,
Posterior inference unchained with EL O, 2019.[28] D. J. C. MacKay,
Information Theory, Inference, and Learning Algorithms . Copyright CambridgeUniversity Press, 2003.[29] D. V. Lindley,
A statistical paradox , Biometrika (1957) 187. – 20 –
30] R. D. Cousins,
The jeffreys–lindley paradox and discovery criteria in high energy physics , Synthese (2014) 395–432.[31] S. Nesseris and J. Garc´ıa-Bellido,
Is the jeffreys’ scale a reliable tool for bayesian model comparison incosmology? , Journal of Cosmology and Astroparticle Physics (2013) 036–036.[32] S. S. Wilks,
The large-sample distribution of the likelihood ratio for testing composite hypotheses , Ann.Math. Statist. (1938) 60.[33] R. B. Davies, Hypothesis testing when a nuisance parameter is present only under the alternative , Biometrika (1977) 247.[34] R. B. Davies, Hypothesis testing when a nuisance parameter is present only under the alternative , Biometrika (1987) 33.[35] G. Cowan, K. Cranmer, E. Gross and O. Vitells, Asymptotic formulae for likelihood-based tests of newphysics , The European Physical Journal C (2011) .[36] J. Maldacena, Non-gaussian features of primordial fluctuations in single field inflationary models , Journal of High Energy Physics (2003) 013–013.[37] P. Creminelli, A. Nicolis, L. Senatore, M. Tegmark and M. Zaldarriaga,
Limits on non-gaussianitiesfrom wmap data , Journal of Cosmology and Astroparticle Physics (2006) 004–004.[38] E. Komatsu, J. Dunkley, M. R. Nolta, C. L. Bennett, B. Gold, G. Hinshaw et al.,
Five-year wilkinsonmicrowave anisotropy probe observations: Cosmological interpretation , The Astrophysical JournalSupplement Series (2009) 330–376.[39] U. Seljak,
Extracting primordial non-gaussianity without cosmic variance , Phys. Rev. Lett. (2009)021302.[40] Planck Collaboration, P. A. R. Ade, N. Aghanim, C. Armitage-Caplan, M. Arnaud, M. Ashdown et al.,
Planck 2013 results. XXIV. Constraints on primordial non-Gaussianity , Astronomy & Astrophysics (2014) A24 [ ].[41] P. Collaboration, Y. Akrami, F. Arroja, M. Ashdown, J. Aumont, C. Baccigalupi et al.,
Planck 2018results. ix. constraints on primordial non-gaussianity , 2019.[42] A. Lewis and S. Bridle,
Cosmological parameters from cmb and other data: A monte carlo approach , Physical Review D (2002) .[43] A. Lewis, A. Challinor and A. Lasenby, Efficient computation of cosmic microwave backgroundanisotropies in closed friedmann-robertson-walker models , The Astrophysical Journal (2000) 473.[44]
Planck collaboration, N. Aghanim et al.,
Planck 2015 results. XI. CMB power spectra, likelihoods, androbustness of parameters , Astron. Astrophys. (2016) A11 [ ].].