Generalized asymptotic formulae for estimating statistical significance in high energy physics analyses
GGeneralized asymptotic formulae for estimating statisticalsignificance in high energy physics analyses
M. J. Basso ∗ Department of Physics, University of Toronto, 60 St. George St., Toronto,Ontario, CanadaFebruary 9, 2021
Abstract
Within the framework of likelihood-based statistical tests for high energy physics measure-ments, we derive generalized expressions for estimating the statistical significance of discoveryusing the asymptotic approximations of Wilks and Wald for a variety of measurement models.These models include arbitrary numbers of signal regions, control regions, and Gaussian con-straints. We extend our expressions to use the representative or “Asimov” dataset proposedby Cowan et al. such that they are made Monte Carlo-free. While many of the generalizedexpressions are complicated and often involve solving systems of coupled, multivariate equa-tions, we show these expressions reduce to closed-form results under simplifying assumptions.We also validate the predicted significance using Monte Carlo toys in select cases.
In the field of high energy physics (HEP), likelihood-based statistical tests entail the constructionof a likelihood function L describing a particular measurement model; the likelihood function inturn describes the “likelihood” of measuring parameters defining the model given some observeddata [1]. For counting experiments typical of HEP analyses at the Large Hadron Collider (LHC),the likelihood may be written most simply as a product of Poisson probability density functions(PDFs) over N “regions” or “bins”: L ( (cid:126)θ ) = N (cid:89) i =1 P ( n i | ν i ( (cid:126)θ )) = N (cid:89) i =1 ν i ( (cid:126)θ ) n i · exp ( − ν i ( (cid:126)θ )) n i ! , (1)where n i and ν i are the observed and expected yields in region i , respectively, and (cid:126)θ are the freeparameters defining out model. Here, we have assumed a uniform prior π ( (cid:126)θ ) for our free parameters(i.e., no prior knowledge). The best-fit parameters for a given measurement will be those which maximize the likelihood. ∗ E-mail: [email protected] a r X i v : . [ phy s i c s . d a t a - a n ] F e b ypically, one is interested in measuring some signal s (e.g., the number of Higgs boson decayevents to W W ∗ ) given some known or constrained background b (e.g., the number of Drell-Yanevents). In this case, s defines our parameter-of-interest (POI), what we are interested in measuring,while b defines a nuisance parameter (NP), a parameter we measure but which may not be physicallyinteresting. We may parametrize the expected yield as ν ( µ, b ) = µs + b , where s is now fixed andour signal strength µ is what tunes the amount of signal, now becoming our POI . Absorbing b into (cid:126)θ , which we now assumes contains only our NPs, and letting L = L ( µ, (cid:126)θ ), we construct thelog-likelihood ratio: λ ( µ ) = L ( µ, (cid:126) ˆˆ θ ) L (ˆ µ, (cid:126) ˆ θ ) , (2)where ˆ µ and (cid:126) ˆ θ are the unconditional maximum likelihood estimators of L (i.e., the values of µ and (cid:126)θ which set ∂L/∂µ = 0 and ∂L/∂θ i = 0 ∀ i = 1 , . . . , N where N is the number of NPs) and (cid:126) ˆˆ θ is the conditional maximum likelihood estimator of L for fixed µ (i.e., the values of (cid:126)θ which set ∂L/∂θ i = 0 ∀ i = 1 , . . . , N where N is the number of NPs) [1, 2]. Eq. 2 ranges between 0 and1, where λ ( µ ) ∼ µ and λ ( µ ) (cid:28) t µ = − λ ( µ ) , (3)where t µ ∼ µ and increasing t µ indicates increasing disagreement [1, 2]. We may define a p -value, p µ , representing the probabilityof observering equal or greater disagreement with the hypothesized µ as: p µ = (cid:90) ∞ t µ, obs f ( t µ | µ ) dt µ , (4)where f ( t µ | µ ) is the PDF of the test statistic for fixed µ and t µ, obs is the observed value of the teststatistic. As is common practice in HEP, one may translate the p -value into the number of standarddeviations from the mean of a standard Gaussian whose integrated, one-sided tail equals such aprobability, i.e.: Z µ = Φ − (1 − p µ ) , (5)where Φ − is the inverse cumulative function of a standard Gaussian. This quantity is referred toas the significance of the measurement.In “discovery” HEP analyses, one looks to measure the presence of signal (i.e., µs + b with µ > H that no signal is present (i.e., µ (cid:48) = 0where µ (cid:48) is the tested value of our POI) and the alternative hypothesis H that signal is present insome fixed amount (i.e., µ (cid:48) = µ where µ is fixed to some value). An example of a discovery analysiswould be a measurement of vector boson fusion production of Higgs bosons decaying to W W ∗ overthe prevailing top quark pair and single top production, Drell-Yan, and diboson backgrounds.Within the framework of likelihood-based statistical tests for HEP, we adopt the test statisticfor discovery analyses proposed by Cowan et al. [2]: N.B.: it is equally valid to let s be our POI, but in the spirit of consistency with the literature on this topic, weadopt this reparametrization. = (cid:40) − λ (0) if ˆ µ ≥
00 if ˆ µ < , (6)where, as before, ˆ µ is our unconditional maximum likelihood estimator of µ . As we should notmeasure a negative signal strength for a signal model predicting an enhancement to our measuredyields, we set t equal to 0 as a lower bound on our test statistic (i.e., consistent the null hypothesis).By Eq. 4, our p -value becomes: p = (cid:90) ∞ t , obs f ( t | dt , (7)and by Eq. 5, our significance of discovery is: Z = Φ − (1 − p ) . (8)To claim the discovery of a signal, it is typical to require that the significance exceeds 5 σ : Z ≥ p = 2 . · − .Often, a physicist will want to know the expected significance of a measurement assuming theirsignal model in MC to be true and correct. In the case of discovery analyses, this will necessitateknowledge of f ( t | t andtherefore an approximation of t , obs may be extracted as the maximizer of f ( t | µ (cid:48) ), the PDF ofthe test statistic for discovery assuming the tested signal strength µ (cid:48) . As an equation, the median p -value is given by: med[ p | µ (cid:48) ] = (cid:90) ∞ max t f ( t | µ (cid:48) ) f ( t | dt . (9)Without knowing f ( t |
0) and f ( t | µ (cid:48) ), the above expression is difficult to evaluate. Using theapproximations of Wilks [3] and Wald [4], Cowan et al. [2] show that p -value for discovery may beapproximated as: p = 1 − F ( t | ≈ − Φ( √ t ) , (10)where F ( t |
0) is the cumulative distribution function (CDF) for f ( t | / √ N (cid:28) N is the sample size) and assuming the best-fit signalstrength ˆ µ is Gaussian distributed. Inserting Eq 10 into Eq. 8 yields: Z ≈ √ t , (11)under the same assumptions. Given that Z is a monotonically decreasing function of p and usingEqs. 8, 9, and 11, we may also write:med[ Z | µ (cid:48) ] = Φ − (cid:18) − (cid:90) ∞ med[ t | µ (cid:48) ] f ( t | dt (cid:19) ≈ (cid:112) med[ t | µ (cid:48) ] . (12)To evaluate the above, Cowan et al. propose the use of the “Asimov” dataset where the esti-mators of all parameters yield their true values [2]. In our above formulae, this is equivalent tosetting all parameters equal to their true values given our particular physics model (e.g., ˆ µ → µ (cid:48) and n → µ (cid:48) s + b ). If t , A is the Asimov value of our test statistic for discovery assuming the testedsignal strength µ (cid:48) , then we can write: 3ed[ t | µ (cid:48) ] = t , A , (13)and by inserting Eq. 13 into Eq. 12, we yield:med[ Z | µ (cid:48) ] ≈ (cid:112) t , A . (14)This is one of the important results shown by Cowan et al. [2]. It says we can estimate the mediansignificance of discovery as the square root of the test statistic for discovery evaluted using Asimovdata. Using the above, one can produce analytical approximations for a variety of measurementscenarios, giving a physicist a handle on the expected power of their analysis techniques withoutrelying on numerical recourse. As Cowan el al. discuss in their paper, the asymptotic approximationis already quite good for N ∼ O (100) (see for instance Fig. 7 of Ref. [2]).This note proceeds as follows: we will motivate and construct several different measurement sce-narios (e.g., multiple control regions, multiple signal regions, etc.) a physicist typically encountersand derive expressions for the median significance of discovery using the asymptotic approximationand assuming Asimov data. In all cases, we generalize to an arbitrary number of regions or con-straints N and show the resulting formulae to reduce to expected formulae (i.e., derived elsewhere)in the N = 1 case or to agree with numerical simulation in test cases.Additionally, we will simplify the use of Eq. 14 in the following sections by dropping the ap-proximation (i.e., setting it to an equality) and by assuming µ (cid:48) = 1, typical of discovery analyseswhere the signal model’s cross section is normalized to theoretical expectations. We define: Z ≡ med[ Z | µ (cid:48) = 1] = (cid:112) t , A = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − L (0 , (cid:126) ˆˆ θ A ) L (ˆ µ A , (cid:126) ˆ θ A ) , (15)where we have inserted Eq. 6 followed by Eq. 2 and the best-fit values of (cid:126) ˆˆ θ , ˆ µ , and (cid:126) ˆ θ are assumedto be evaluated using Asimov data (hence the subscript A). To verify our derivations for measurement scenarios which have not yet been studied analytically,we will use Monte Carlo (MC): we will draw toy events from the PDF governing the measurementscenario at hand in each of the relevant regions (i.e., SRs and CRs). The parameters of the indi-vidual PDFs will be set to their hypothesized values, and a full sampling of the likelihood will beperformed for both the null (i.e., µ (cid:48) = 0) and alternative (i.e., µ (cid:48) = 1) hypotheses. Using the Pythonpackage probfit [5] to set up a simultaneous, unbinned (in each region), maximum-likelihood fitand MINUIT [6] via the Python package iminuit [7] to perform the minimization, we will extractthe minima of − L (0 , ˆˆ θ ) and − L (ˆ µ, ˆ θ ), allowing us to calculate our test statistic t usingEqs. 2 and 6. By performing this procedure many (i.e., O (10000)) times assuming µ (cid:48) = 0 andthen assuming µ (cid:48) = 1, we can produce approximate PDFs of the test statistic, f ( t | µ (cid:48) = 0) and f ( t | µ (cid:48) = 1). By integrating f ( t | µ (cid:48) = 0) from the median value of f ( t | µ (cid:48) = 1) to infinity, weyield the p -value of the measurement which then yields the median significance of discovery usingEq. 12. The Python packages numpy [8], scipy [9], and matplotlib [10] are used for processingand plotting. 4 Derivations
In following subsections, we will derive expressions for the the median significance of discoveryin the asymptotic limit for a variety of commonly encountered measurement scenarios and providevalidation of some of our expressions by comparisons to other sources or by numerical simulation.In particular, we will cover: • Section 2.1: 1 signal region + N control regions, N ∈ N ; • Section 2.2: N signal regions + 1 control region, N ∈ N ; • Section 2.3: N signal regions + M control regions, N, M ∈ N ; • Section 2.4: 1 signal region containing N background processes with M Gaussian backgroundconstraints,
N, M ∈ N . N Control Regions
Assuming a uniform prior, we can write our likelihood with 1 signal region (SR) and N orthogonalauxiliary measurements (read as: N control regions (CRs) for backgrounds b i , i = 1 , . . ., N ) as: L ( s,(cid:126)b ) = P (cid:32) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s + N (cid:88) i =1 b i (cid:33) · N (cid:89) i =1 P (cid:32) m i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) j =1 τ ij · b j (cid:33) , (16)where P refers to a Poisson PDF, n is the observed yield in our SR, m i is the observed yield inCR i , and τ ij are the transfer factors which carry background j in our SR to CR i . Inserting themathematical form for P yields: L ( s,(cid:126)b ) = (cid:16) s + (cid:80) Ni =1 b i (cid:17) n exp (cid:16) − (cid:16) s + (cid:80) Ni =1 b i (cid:17)(cid:17) n ! · N (cid:89) i =1 (cid:16)(cid:80) Nj =1 τ ij · b j (cid:17) m i exp (cid:16) − (cid:16)(cid:80) Nj =1 τ ij · b j (cid:17)(cid:17) m i ! , (17)and taking the logarithm yields:ln L ( s,(cid:126)b ) = n · ln (cid:32) s + N (cid:88) i =1 b i (cid:33) − s − N (cid:88) i =1 b i + N (cid:88) i =1 (cid:32) m i · ln (cid:32) N (cid:88) j =1 τ ij · b j (cid:33) − N (cid:88) j =1 τ ij · b j (cid:33) . (18)As we are dealing with a likelihood, constant offsets do not affect our optimization and so we havedropped − (ln( n !) + (cid:80) Ni =1 ln( m i !)).We first consider the most probable value for the backgrounds, ˆˆ b i , in the absence of signal, s = 0.We are interested in maximizing ln L . As signal is fixed and constant, we make this explicit in ln L by evaluating it at s = 0 prior to taking any partial derivatives. The result is then differentiatedwith respect to background k and evaluated at (cid:126)b = (cid:126) ˆˆ b to yield: ∂ ln L (0 ,(cid:126)b ) ∂b k (cid:12)(cid:12)(cid:12)(cid:12) (cid:126)b = (cid:126) ˆˆ b = n (cid:80) Ni =1 ˆˆ b i − N (cid:88) i =1 τ ik · m i (cid:80) Nj =1 τ ij · ˆˆ b j − . (19)5etting all of the partial derivatives, k = 1 , . . ., N , equal to 0 yields the following system of equationsfor (cid:126) ˆˆ b : n (cid:80) Ni =1 ˆˆ b i − N (cid:88) i =1 τ ik · m i (cid:80) Nj =1 τ ij · ˆˆ b j − ; k = 1 , . . ., N . (20)We now consider the maximizing the likelihood in the presence of signal s . In this situation, welet ˆ s and (cid:126) ˆ b be the signal and background yields, respectively, which maximize our likelihood. Signaland background yields are both left floating, and so we take the partial derivative with respect to s evaluated at ( s,(cid:126)b ) = (ˆ s,(cid:126) ˆ b ): ∂ ln L ( s,(cid:126)b ) ∂s (cid:12)(cid:12)(cid:12)(cid:12) ( s,(cid:126)b )=(ˆ s,(cid:126) ˆ b ) = n ˆ s + (cid:80) Ni =1 ˆ b i − , (21)as well as the partial derivative with respect to b k : ∂ ln L ( s,(cid:126)b ) ∂b k (cid:12)(cid:12)(cid:12)(cid:12) ( s,(cid:126)b )=(ˆ s,(cid:126) ˆ b ) = n ˆ s + (cid:80) Ni =1 ˆ b i − N (cid:88) i =1 τ ik · (cid:32) m i (cid:80) Nj =1 τ ij · ˆ b j − (cid:33) . (22)Setting Eqs. 21 and 22 equal to 0 and substituting Eq. 21 into Eq. 22 yields:0 = N (cid:88) i =1 τ ik · (cid:32) m i (cid:80) Nj =1 τ ij · ˆ b j − (cid:33) . (23)Our system of equations for ˆ s and (cid:126) ˆ b is then: (cid:40) ˆ s = n − N (cid:88) i =1 ˆ b i , N (cid:88) i =1 τ ik · (cid:32) m i (cid:80) Nj =1 τ ij · ˆ b j − (cid:33) ; k = 1 , . . ., N (cid:41) . (24)And using Eq. 15, our signficance of discovery is: Z = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − · n · ln (cid:32) (cid:80) Ni =1 ˆˆ b i ˆ s + (cid:80) Ni =1 ˆ b i (cid:33) + ˆ s + N (cid:88) i =1 (ˆ b i − ˆˆ b i ) + m i · ln (cid:80) Nj =1 τ ij · ˆˆ b j (cid:80) Nj =1 τ ij · ˆ b j + N (cid:88) j =1 τ ij · (ˆ b j − ˆˆ b j ) . (25)In this expression and the expressions for ˆ s , ˆ b , and ˆˆ b , we set n = s + (cid:80) Ni =1 b i and m i = (cid:80) Nj =1 τ ij · b j ,i.e., their Asimov values. We may simplify the above further by using n = ˆ s + (cid:80) Ni =1 ˆ b i to yield: Z = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − · n · ln (cid:32) (cid:80) Ni =1 ˆˆ b i n (cid:33) + n + N (cid:88) i =1 − ˆˆ b i + m i · ln (cid:80) Nj =1 τ ij · ˆˆ b j (cid:80) Nj =1 τ ij · ˆ b j + N (cid:88) j =1 τ ij · (ˆ b j − ˆˆ b j ) . (26)This is our expression for the median significance of discovery in the asymptotic limit. N.B.: it ispossible to simplify the above even further using the intuitive ansatz that the solutions to Eq. 24are ˆ s = s and ˆ b k = b k ∀ k = 1 , . . . , N . Indeed, this can be explicitly checked:6 s − n + N (cid:88) i =1 ˆ b i = s − (cid:32) s + N (cid:88) i =1 b i (cid:33) + N (cid:88) i =1 b i = 0 , N (cid:88) i =1 τ ik · (cid:32) m i (cid:80) Nj =1 τ ij · ˆ b j − (cid:33) = N (cid:88) i =1 τ ik · (cid:32) (cid:80) Nj =1 τ ij · b j (cid:80) Nj =1 τ ij · b j − (cid:33) = N (cid:88) i =1 τ ik · (1 −
1) = 0 , (27)as expected for Asimov data. N = 1 Control Regions
As a check, in the case where we have only 1 CR, N = 1, we let ˆˆ b ≡ ˆˆ b , m ≡ m , and τ ≡ τ .From Eq. 20, we yield: 0 = n ˆˆ b − τ · (cid:32) mτ · ˆˆ b − (cid:33) ⇔ ˆˆ b = n + m τ , (28)as expected. Additionally, letting ˆ b ≡ ˆ b , Eq. 24 yields:0 = τ · (cid:18) mτ · ˆ b − (cid:19) ⇔ ˆ b = mτ ⇒ ˆ s = n − mτ , (29)as expected. Finally, from Eq. 26, our significance is: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) n · ln (cid:32) ˆˆ bn (cid:33) + n − ˆˆ b + m · ln (cid:32) ˆˆ b ˆ b (cid:33) + τ · (ˆ b − ˆˆ b ) (cid:33) , (30)but we know τ · ˆ b − (1 + τ ) · ˆˆ b = m − ( n + m ) = − n by Eqs. 28 and 29, leaving us with: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) n · ln (cid:32) ˆˆ bn (cid:33) + m · ln (cid:32) ˆˆ b ˆ b (cid:33)(cid:33) = (cid:118)(cid:117)(cid:117)(cid:116) − · ln (cid:32)(cid:18) n + m τ (cid:19) n + m · τ m n n m m (cid:33) , (31)matching what is shown in Eqs. 21 and 22 of Ref. [12] τ Often CRs are defined such that they yield high-stats, pure regions for a specific background.Here, we assume CR i targets background i by assuming the matrix of transfer factors τ is diagonal(i.e., the acceptance of CR i is 1 for background i and 0 for all other backgrounds). Letting τ k ≡ τ kk ,our equations for (cid:126) ˆˆ b , (cid:126) ˆ b , and ˆ s , Eqs. 20 and 24, simplify as:0 = n (cid:80) Ni =1 ˆˆ b i − m k ˆˆ b k − τ k ⇔ n (cid:80) Ni =1 ˆˆ b i + m k ˆˆ b k = 1 + τ k , (32)7nd: 0 = m k ˆ b k − τ k ⇔ ˆ b k = m k τ k ⇒ ˆ s = n − N (cid:88) i =1 m i τ i , (33)for k = 1 , . . . , N . Our significance of discovery is: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) n · ln (cid:32) (cid:80) Ni =1 ˆˆ b i n (cid:33) + n + N (cid:88) i =1 (cid:32) − ˆˆ b i + m i · ln (cid:32) ˆˆ b i ˆ b i (cid:33) + τ i · (ˆ b i − ˆˆ b i ) (cid:33)(cid:33) , (34)where we used τ ij = τ i · δ ij , where δ ij is the Kronecker delta function. Or, given ( − n · ˆˆ b k ) / ( (cid:80) Ni =1 ˆˆ b i ) = m k − (1 + τ k ) · ˆˆ b k = − ˆˆ b k + τ k · (ˆ b k − ˆˆ b k ) by Eqs. 32 and 33, we can also write: Z = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − · n · ln (cid:32) (cid:80) Ni =1 ˆˆ b i n (cid:33) + n + N (cid:88) i =1 m i · ln (cid:32) ˆˆ b i ˆ b i (cid:33) − n · ˆˆ b i (cid:80) Nj =1 ˆˆ b j = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − · ln (cid:32) (cid:80) Ni =1 ˆˆ b i n (cid:33) n · N (cid:89) i =1 (cid:32) ˆˆ b i ˆ b i (cid:33) m i . (35) N = 2 Control Regions and Diagonal τ As a special case of the previous section, we consider 2 CRs ( N = 2) and assume each CR tobe pure in the background they target, i.e., τ is diagonal. Then by Eq. 32: (cid:40) n ˆˆ b + ˆˆ b − (cid:32) m ˆˆ b − τ (cid:33) , n ˆˆ b + ˆˆ b − (cid:32) m ˆˆ b − τ (cid:33)(cid:41) , (36)and subtracting the second from the first yields:0 = (cid:32) m ˆˆ b − m ˆˆ b (cid:33) − ( τ − τ ) ⇔ ˆˆ b = m · ˆˆ b m + ( τ − τ ) · ˆˆ b . (37)Consider the simpler case where τ = τ = τ >
0. Then ˆˆ b = m m · ˆˆ b and:0 = n · ˆˆ b − (1 + τ ) · (ˆˆ b + ˆˆ b ) · ˆˆ b + m · (ˆˆ b + ˆˆ b )= n · ˆˆ b − (1 + τ ) · m + m m · ˆˆ b + ( m + m ) · ˆˆ b ⇒ ˆˆ b = m · ( n + m + m )(1 + τ ) · ( m + m ) , (38)(throwing away the ˆˆ b = 0 solution). By symmetry, we can send subscripted 1 → → b solution: 8ˆ b = m · ( n + m + m )(1 + τ ) · ( m + m ) . (39)We now consider the more complex case where we have τ (cid:54) = τ with τ > τ > n · ˆˆ b − (1 + τ ) · (ˆˆ b + ˆˆ b ) · ˆˆ b + m · (ˆˆ b + ˆˆ b )= ( m − (1 + τ ) · ˆˆ b ) · ˆˆ b + (( n + m ) − (1 + τ ) · ˆˆ b ) · ˆˆ b = ( m − (1 + τ ) · ˆˆ b ) · m + (( n + m ) − (1 + τ ) · ˆˆ b ) · ( m + ∆ τ · ˆˆ b )= m · ( m + m + n ) + (∆ τ · ( n + m ) − (1 + τ ) · ( m + m )) · ˆˆ b − ∆ τ · (1 + τ ) · ˆˆ b , (40)where ∆ τ ≡ τ − τ , which can only vary between τ and − τ . We have also cancelled an overallfactor of ˆˆ b on the third line, to remove the uninteresting solution ˆˆ b = 0. Our solution is:ˆˆ b = − B ± (cid:112) B − A C A , (41)where: A = − ∆ τ · (1 + τ ) ,B = ∆ τ · ( n + m ) − (1 + τ ) · ( m + m ) ,C = m · ( m + m + n ) . (42)By symmetry, we have: ˆˆ b = − B ± (cid:112) B − A C A , (43)where: A = ∆ τ · (1 + τ ) ,B = − ∆ τ · ( n + m ) − (1 + τ ) · ( m + m ) ,C = m · ( m + m + n ) . (44)The expressions above are only physically meaningful if ˆˆ b > b > negative number of events). We suppose, for definiteness, τ > τ ⇒ ∆ τ > require ˆˆ b > b > A < C > − A C >
0, which implies B − A C > (cid:112) B − A C has a real root. Additionally,2 A < B < (cid:112) B − A C , so to always pick up a positive solution for ˆˆ b , we choose thenegative sign: ˆˆ b = − B − (cid:112) B − A C A = B + (cid:112) B + 4 | A C | | A | . (45)9or ∆ τ >
0; this solution is real and positive. We turn to ˆˆ b : A > C > − A C < B > (cid:112) B − A C . Additionally, − B >
0, so our solution is always positive and we maywrite it as: ˆˆ b = | B | ± (cid:112) B − | A C | | A | . (46)The sign choice is still ambiguous, so we return to Eq. 36 and subsitute in our expressions for each.One can show that the negative sign is required to solve our system of equations, and so our solutionis: ˆˆ b = | B | − (cid:112) B − | A C | | A | . (47)The above is always positive, but the condition for being real requires B − | A C | >
0. It can beshown that B − A C = B − A C > τ yields: (cid:26) ˆ b = m τ , ˆ b = m τ , ˆ s = n − ˆ b − ˆ b (cid:27) . (48)Our significance of discovery in the asymptotic limit is then Eq. 35 with Eqs. 45, 47, and 48appropriately subbed in. Assuming Asimov data, we let n = s + b + b , m = τ · b , and m = τ · b ,where s and b , b are our theoretical signal and background yields in our SR, respectively.We have numerically calculated the median significance of discovery using the procedure de-scribed in Sec. 1.2 for 30 different hypothesized measurement scenarios (i.e., 3 curves with 10 pointsper curve). Additionally, we have plotted Eq. 35 continuously alongside these numerical results:both the numerical and the asymptotic results are shown in Fig. 1. Excellent agreement is observed,even down to low values of s + b + b . The “naive” approximation of the significance, s/ √ s + b + b ,is also plotted. As expected, this naive approximation agrees well with the asymptotic and numer-ical results in the regime where s/b (cid:28) s/ √ s + b + b is a Taylor expansion of the aymptotic result in the small s/b limit. This is demonstrated mostprominently by the green curve at low values of b , where s , b , and b are all O (1) and the small s/b assumption fails.We also includes examples of the PDFs for our test statistic t under the assumptions of nosignal, f ( t | µ (cid:48) = 0), and in the presence of signal, f ( t | µ (cid:48) = 1), for the green curve, b = 5, inFig. 1 for both the b = 1 and b = 1000 simulated data points. At higher values of s/b as shown inFig. 2a, the median value of f ( t | µ (cid:48) = 1) is well offset from t = 0, resulting in a smaller integrated p -value for the null hypothesis. At smaller values of s/b as shown in Fig. 2b, the median valueof f ( t | µ (cid:48) = 1) is approximately at t = 0 and the distribution itself is not unlike f ( t | µ (cid:48) = 0),resulting in a larger integrated p -value. This behaviour is as expected.10 Background 1 yield in SR b [a.u.] . . . . . . . . S i gn i fic an c eo f d i sc o v e r y m e d [ Z | µ = ] [ a . u .] s = 10 , τ = 8 , τ = 5 Numerical: b = 5 Asymptotic: b = 5 Simple: b = 5 Numerical: b = 25 Asymptotic: b = 25 Simple: b = 25 Numerical: b = 150 Asymptotic: b = 150 Simple: b = 150 Figure 1: The median significance of discovery as a function of SR background 1 yield ( b ) and SRbackground 2 yield ( b ) for the 1 SR bin + 2 CR bins measurement described in Section 2.1.4. TheSR signal yield s is assumed to be 10. The transfer matrix for the background to the respectivecontrol regions is assumed to be diagonal with τ = 8 and τ = 5. “Numerical” refers to the resultscalculated using MC toys (50,000 events for the estimation of f ( t | µ (cid:48) = 0) and 50,000 events forthe estimation of f ( t | µ (cid:48) = 1), per point), “Asymptotic” refers to Eq. 35, and “Simple” refers to s/ √ s + b + b . 11 t [a.u.]10 N o r m a li z e d c o un t s [ a . u . ] f ( t | = 0) f ( t | = 1) (a) b = 1. t [a.u.]10 N o r m a li z e d c o un t s [ a . u . ] f ( t | = 0) f ( t | = 1) (b) b = 1000. Figure 2: PDFs of the test statistic of discovery t for the green curve, b = 5, in Fig. 1 for boththe b = 1 and b = 1000 simulated data points. These distributions are used to calculate thecorresponding values of Z . N Signal Regions + 1 Control Region
We now consider the case where we have N SRs (read as: N signal bins) and 1 shared CR. Thenfor our i = 1 , . . . , N SRs, we have have transfer factors { τ i ; i = 1 , . . . , N } where τ i is the transferfactor carrying the background yield our CR to SR i . Taking our theoretical background yield inour CR to be b and our observed value to be m , we can write our likelihood as: L ( (cid:126)s, b ) = N (cid:89) i =1 (cid:26) P (cid:18) n i (cid:12)(cid:12)(cid:12)(cid:12) s i + bτ i (cid:19)(cid:27) · P ( m | b ) , (49)where we are dividing by the transfer matrices, as in Section 2.1 we took τ to be the factor whichmultiplies yields in our SR to give yields in our CR. We can immediately write our log-likelihoodas: ln L ( (cid:126)s, b ) = N (cid:88) i =1 (cid:18) n i · ln (cid:18) s i + bτ i (cid:19) − s i − bτ i (cid:19) + m · ln( b ) − b , (50)where we have discarded the constant − (ln( m !) + (cid:80) Ni =1 ln( n i !)). Going right ahead with finding ourconditional and unconditional maximum likelihood estimators:12 ln L ( (cid:126) , b ) ∂b (cid:12)(cid:12)(cid:12)(cid:12) b =ˆˆ b = N (cid:88) i =1 (cid:32) n i ˆˆ b − τ i (cid:33) + m ˆˆ b − ⇔ m + (cid:80) Ni =1 n i ˆˆ b = 1 + N (cid:88) i =1 τ i ⇔ ˆˆ b = m + (cid:80) Ni =1 n i (cid:80) Ni =1 1 τ i . (51)Also: ∂ ln L ( (cid:126)s, b ) ∂s k (cid:12)(cid:12)(cid:12)(cid:12) ( (cid:126)s,b )=( (cid:126) ˆ s, ˆ b ) = n k ˆ s k + ˆ bτ k − ⇔ ˆ s k = n k − ˆ bτ k , (52)and: ∂ ln L ( (cid:126)s, b ) ∂b (cid:12)(cid:12)(cid:12)(cid:12) ( (cid:126)s,b )=( (cid:126) ˆ s, ˆ b ) = N (cid:88) i =1 (cid:18) n i τ i · ˆ s i + ˆ b − τ i (cid:19) + m ˆ b − ⇔ N (cid:88) i =1 (cid:18) n i τ i · ˆ s i + ˆ b (cid:19) + m ˆ b = 1 + N (cid:88) i =1 τ i . (53)But τ i · ˆ s i + ˆ b = τ i · n i by Eq. 52, so Eq. 53 becomes: N (cid:88) i =1 τ i + m ˆ b = 1 + N (cid:88) i =1 τ i ⇔ ˆ b = m , (54)and so our solutions for ˆ s k and ˆ b are: (cid:26) ˆ s k = n k − mτ k , ˆ b = m ; k = 1 , . . . , N (cid:27) . (55)Using Eq. 15, we can write our significance of discovery in the asymptotic limit as: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) N (cid:88) i =1 (cid:32) n i · ln (cid:32) ˆˆ bτ i · ˆ s i + ˆ b (cid:33) + ˆ s i + ˆ b − ˆˆ bτ i (cid:33) + m · ln (cid:32) ˆˆ b ˆ b (cid:33) + (ˆ b − ˆˆ b ) (cid:33) . (56)Assuming Asimov data, we let n i = s i + b/τ i and m = b . We now consider the typical case where the signal yields s , s , . . . , s N among our N SRs arecorrelated and tuned by a single POI, our signal strength µ . Our expressions in the previous sectionare modified by letting s i → µs i ∀ i = 1 , . . . , N and accordingly L ( (cid:126)s, b ) → L ( µ, b ). The derivationfor ˆˆ b proceeds in the same way as in Section 2.2.1, ultimately yielding Eq. 51. Eq. 52 is modifiedto be a single derivative with respect to µ as: 13 ln L ( µ, b ) ∂µ (cid:12)(cid:12)(cid:12)(cid:12) ( µ,b )=(ˆ µ, ˆ b ) = N (cid:88) i =1 s i · (cid:32) n i ˆ µs i + ˆ bτ i − (cid:33) = 0 , (57)and Eq. 53 becomes: ∂ ln L ( µ, b ) ∂b (cid:12)(cid:12)(cid:12)(cid:12) ( µ,b )=(ˆ µ, ˆ b ) = N (cid:88) i =1 (cid:18) n i τ i · ˆ µ · s i + ˆ b − τ i (cid:19) + m ˆ b − . (58)While these equations are difficult to solve in the general sense, we may propose the ansatz thatˆ µ = 1 and ˆ b = b when assuming Asimov data (i.e., n i = s i + b/τ i ∀ i = 1 , . . . , N and m = b ). Whilethe solution is intuitive, it can be explicitly checked to solve Eq. 57: N (cid:88) i =1 s i · (cid:32) n i ˆ µs i + ˆ bτ i − (cid:33) = N (cid:88) i =1 s i · (cid:32) s i + bτ i s i + bτ i − (cid:33) = N (cid:88) i =1 s i · (1 −
1) = 0 , (59)and explicitly checked to solve Eq. 58: N (cid:88) i =1 (cid:18) n i τ i · ˆ µ · s i + ˆ b − τ i (cid:19) + m ˆ b − N (cid:88) i =1 (cid:32) s i + bτ i τ i · s i + b − τ i (cid:33) + bb − N (cid:88) i =1 (cid:18) τ i − τ i (cid:19) +1 − . (60)In terms of our median significance of discovery, Eq. 56, we let ˆ s i → ˆ µs i = s i , ˆ b = b , and ˆˆ b begiven by Eq. 51 (using Asimov data) to yield: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) N (cid:88) i =1 (cid:32)(cid:18) s i + bτ i (cid:19) · ln (cid:32) ˆˆ bτ i · s i + b (cid:33) + s i + b − ˆˆ bτ i (cid:33) + b · ln (cid:32) ˆˆ bb (cid:33) + ( b − ˆˆ b ) (cid:33) . (61)As in Section 2.1.4, we have numerically validated our results for the scenario where we have 3SRs, N = 3, and 1 CR, sampling Poisson PDFs in each of our 4 bins (with mean values of s + b/τ , s + b/τ , s + b/τ , and b ) in order to generate our simulated yields. The median significance ofdiscovery has been calculated for 30 different hypothesized measurement scenarios (i.e., 3 curveswith 10 points per curve) and we plotted Eq. 61 continuously alongside these numerical results. Thisis shown in Fig. 3. As before, we see excellent agreement between the numerical and asymptoticresults over the range of theoretical yields and parameters studied.We have also plotted alongside our results the “naive” approximation of the significance ofdiscovery where the bin-by-bin significances are summed in quadrature: Z = (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i =1 (cid:32) s i (cid:112) s i + b/τ i (cid:33) = (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i =1 s i s i + b/τ i , (62)and indeed in the low s/b regime (where we are in this regime if all bins are in this regime), we seegood agreement between all three methods. In the s/b ∼ any bin is in this regime), the naive approximation fails and no longer shows good agreement withthe asymptotic and numerical methods, as expected.14 Signal yield in SR 1 s [a.u.] S i gn i fic an c eo f d i sc o v e r y m e d [ Z | µ = ] [ a . u .] s = 12 , τ = 2 , τ = 10 , τ = 20 Numerical: ( s , b ) = (25 , Asymptotic: ( s , b ) = (25 , Simple: ( s , b ) = (25 , Numerical: ( s , b ) = (10 , Asymptotic: ( s , b ) = (10 , Simple: ( s , b ) = (10 , Numerical: ( s , b ) = (10 , Asymptotic: ( s , b ) = (10 , Simple: ( s , b ) = (10 , Figure 3: The median significance of discovery as a function of SR 1 signal yield ( s ), SR 2 signalyield ( s ), and CR background yield ( b ) for 3 SR bins + 1 CR bin measurement described inSection 2.2.2. The SR 3 signal yield s is assumed to be 12. The transfer factors for the backgroundfrom the CR to SRs 1 ( τ ), 2 ( τ ), and 3 ( τ ) are assumed to be 2, 10, and 20, respectively.“Numerical” refers to the results calculated using MC toys (50,000 events for the estimation of f ( t | µ (cid:48) = 0) and 50,000 events for the estimation of f ( t | µ (cid:48) = 1), per point), “Asymptotic” refersto Eq. 61, and “Simple” refers to (cid:113)(cid:80) Ni =1 s i / ( s i + b/τ i ). Note: the last MC data point on the greencurve, ( s , b ) = (25 , p -value calculation. 15 .3 N Signal Regions + M Control Regions
We now consider the case where we have N SRs (read as: N signal bins) and M CRs (one for eachbackground process). We assume the definitions of the CRs are
SR-independent and orthogonal .For i = 1 , . . . , N SRs, we have have transfer matrices { τ i ; i = 1 , . . . , N } where τ i is the transfermatrix carrying the yields in SR i to our M CRs (e.g., τ ijk carries the yield for background k in SR i to CR j ). Our likelihood is: L ( (cid:126)s, B ) = N (cid:89) i =1 (cid:40) P (cid:32) n i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s i + M (cid:88) j =1 b ji (cid:33)(cid:41) · M (cid:89) j =1 (cid:40) P (cid:32) m j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) j (cid:48) =1 τ i (cid:48) jj (cid:48) · b j (cid:48) i (cid:48) (cid:33)(cid:41) , (63)where B is our matrix of background yields in our SRs, where each row corresponds to specificbackground process and each column corresponds to a specific SR (e.g., b ji ≡ [ B ] ji correspondsto the yield for background j in SR i ). Additionally, i (cid:48) can be any integer from 1 to N , but forconsistent background yields in a given CR, regardless of which SR we are extrapolating from, werequire (cid:80) Mj (cid:48) =1 τ i (cid:48) jj (cid:48) · b j (cid:48) i (cid:48) = S j ∀ i (cid:48) ∈ [1 , N ], where S j is the expected sum of weights in CR j (i.e., a constant ). Taking the logarithm of Eq. 63:ln L ( (cid:126)s, B ) = N (cid:88) i =1 (cid:40) n i · ln (cid:32) s i + M (cid:88) j =1 b ji (cid:33) − s i − M (cid:88) j =1 b ji (cid:41) + M (cid:88) j =1 (cid:40) m j · ln (cid:32) M (cid:88) j (cid:48) =1 τ i (cid:48) jj (cid:48) · b j (cid:48) i (cid:48) (cid:33) − M (cid:88) j (cid:48) =1 τ i (cid:48) jj (cid:48) · b j (cid:48) i (cid:48) (cid:41) , (64)where we have dropped the constant − (cid:80) Ni =1 ln n i ! − (cid:80) Mj =1 ln m j !. Going ahead: ∂ ln L ( (cid:126) , B ) ∂b (cid:96)k (cid:12)(cid:12)(cid:12)(cid:12) B = ˆ B ˆ = n k (cid:80) Mj =1 ˆˆ b jk − M (cid:88) j =1 τ kj(cid:96) · m j (cid:80) Mj (cid:48) =1 τ kjj (cid:48) · ˆˆ b j (cid:48) k − = 0 ⇔ n k (cid:80) Mj =1 ˆˆ b jk − M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j [ τ k · ˆ B ] jk ˆ − (cid:33) = 0 , (65)where we have selected i (cid:48) = k when taking our partial derivative and compactified our denominatorsummation using matrix multiplication. So our best-fit values in the absence of signal are thesolutions to the set of coupled equations: n k (cid:80) Mj =1 ˆˆ b jk − M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j [ τ k · ˆ B ] jk ˆ − (cid:33) = 0 ; k = 1 , . . ., N , (cid:96) = 1 , . . ., M . (66)Also: ∂ ln L ( (cid:126)s, B ) ∂s k (cid:12)(cid:12)(cid:12)(cid:12) (cid:126)s = (cid:126) ˆ s, B = ˆ B = n k ˆ s k + (cid:80) Mj =1 ˆ b jk − ⇔ ˆ s k = n k − M (cid:88) j =1 ˆ b jk , (67)and: 16 ln L ( (cid:126)s, B ) ∂b (cid:96)k (cid:12)(cid:12)(cid:12)(cid:12) (cid:126)s = (cid:126) ˆ s, B = ˆ B = n k ˆ s k + (cid:80) Mj =1 ˆ b jk − M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j (cid:80) Mj (cid:48) =1 τ kjj (cid:48) · ˆ b j (cid:48) k − (cid:33) = 0 ⇔ M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j [ τ k · ˆ B ] jk − (cid:33) = 0 , (68)where we have subbed in Eq. 67. So our best-fit values in the presence of signal are the solutionsto the set of coupled equations: (cid:40) ˆ s k = n k − M (cid:88) j =1 ˆ b jk , M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j [ τ k · ˆ B ] jk − (cid:33) = 0 ; k = 1 , . . ., N , (cid:96) = 1 , . . ., M (cid:41) . (69)Using Eq. 15, the significance of discovery in the asymptotic limit is: Z = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − · N (cid:88) i =1 n i · ln (cid:80) Mj =1 ˆˆ b ji n i + n i − M (cid:88) j =1 ˆˆ b ji + M (cid:88) j =1 (cid:40) m j · ln (cid:32) [ τ i (cid:48) · ˆ B ] ji (cid:48) ˆ[ τ i (cid:48) · ˆ B ] ji (cid:48) (cid:33) − [ τ i (cid:48) · ( ˆ B ˆ − ˆ B )] ji (cid:48) (cid:41) . (70)Assuming Asimov data, we would let n i = s i + (cid:80) Mj =1 b ji and m j = [ τ i (cid:48) · B ] ji (cid:48) . As in Section 2.2.2, we consider the case where our signal yields s , s , . . . , s N among our N SRs are correlated and tuned by our signal strength µ . Our expressions in the previous section aremodified by letting s i → µs i ∀ i = 1 , . . . , N and accordingly L ( (cid:126)s, B ) → L ( µ, B ). The derivation for B = ˆ B ˆ proceeds in the same way as in Section 2.3.1 and yields Eq. 66. Eq. 67 is modified to be asingle derivative with respect to µ as: ∂ ln L ( µ, B ) ∂µ (cid:12)(cid:12)(cid:12)(cid:12) ( µ, B )=(ˆ µ, ˆ B ) = N (cid:88) i =1 s i · (cid:32) n i ˆ µs i + (cid:80) Mj =1 ˆ b ji − (cid:33) = 0 , (71)and Eq. 68 becomes: ∂ ln L ( µ, B ) ∂b (cid:96)k (cid:12)(cid:12)(cid:12)(cid:12) ( µ, B )=(ˆ µ, ˆ B ) = n k ˆ µ · s k + (cid:80) Mj =1 ˆ b jk − M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j [ τ k · ˆ B ] jk − (cid:33) = 0 . (72)Assuming Asimov data n i = s i + (cid:80) Mj =1 b ji ∀ i = 1 , . . . , N and m j = [ τ i · ˆ B ] ji ∀ j = 1 , . . . , M where i ∈ [1 , N ], the solutions to Eqs. 71 and 72 can be shown to be ˆ µ = 1 and ˆ B = B (N.B.: here, theelements of B are the theoretical background yields).17 .4 1 Signal Region + M Gaussian Background Constraints
We consider the case where we have 1 SR (read as: 1 bin) with a signal process yield s and N background processes with M Gaussian constraints on those backgrounds (read as: M nuisanceparameters). We assume nuisance parameter j , θ j , is described by a Gaussian constraint with anominal value of 0, a mean of θ j , and a variance of 1 as well as that the nuisance parameters arerelated via a correlation matrix Σ . Additionally, we assume background process i is affected by thenuisance parameters via: b i → M (cid:89) j =1 R ij ( θ j ) · b i , (73)where R ij ( θ j ) is the response function of background i to nuisance parameter j , as described inRef. [11]. We’ll condense our notation in the following equations by letting R i ( (cid:126)θ ) ≡ (cid:81) Mj =1 R ij ( θ j ).In our single SR bin, we assume the behaviour of a response function is governed by: R ij (+1) = 1 + σ + ij ,R ij (0) = 1 ,R ij ( −
1) = 1 − σ − ij , (74)where σ + ij and σ − ij are constants dependent on the nuisance parameter considered (i.e., nuisanceparameter j ) and affecting the overall normalization of background i (i.e., are relative “uncertain-ties”). The functional form of R ij ( θ j ) for θ j ∈ [ − , +1] is left free. Note: R ij ( θ j ) = 1 if background i is not affected by nuisance parameter j .Our likelihood may be written as: L ( s, (cid:126)θ ) = P (cid:32) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s + N (cid:88) i =1 R i ( (cid:126)θ ) · b i (cid:33) · G (cid:16) (cid:126) (cid:12)(cid:12)(cid:12) (cid:126)θ, Σ (cid:17) , (75)where we have moved the functional dependence from (cid:126)b to (cid:126)θ (as the value of (cid:126)θ tunes the backgroundyields) and where the diagonal elements of Σ are assumed to be 1. If the nuisance parameters werefully decoupled from one another, Σ would be a diagonal matrix and our M -dimensional Gaussianfunction may be written as the product of M L ( s, (cid:126)θ ) = (cid:16) s + (cid:80) Ni =1 R i ( (cid:126)θ ) · b i (cid:17) n exp (cid:16) − (cid:16) s + (cid:80) Ni =1 R i ( (cid:126)θ ) · b i (cid:17)(cid:17) n ! · exp (cid:16) − · (cid:126)θ (cid:62) Σ − (cid:126)θ (cid:17)(cid:112) (2 π ) M | Σ | , (76)or taking the logarithm:ln L ( s, (cid:126)θ ) = n · ln (cid:32) s + N (cid:88) i =1 R i ( (cid:126)θ ) · b i (cid:33) − s − N (cid:88) i =1 R i ( (cid:126)θ ) · b i − · (cid:126)θ (cid:62) Σ − (cid:126)θ , (77)where we dropped the constant − (ln( n !) + ln( (cid:112) (2 π ) M | Σ | )). Note that we can also write the matrixmultiplication in the last term using sums: 18 θ (cid:62) Σ − (cid:126)θ = M (cid:88) i =1 M (cid:88) j =1 [ Σ − ] ij θ i θ j = M (cid:88) i =1 M (cid:88) j = i (2 − δ ij )[ Σ − ] ij θ i θ j , (78)where [ Σ − ] ij refers to the element in the i -th row and j -th column of Σ − and δ ij is the Kroneckerdelta function. Here, we have used the fact that if Σ is symmetric then its inverse is also symmetric: Σ − = ( Σ − ) (cid:62) . From the above, it is also apparent that: ∂ ( (cid:126)θ (cid:62) Σ − (cid:126)θ ) ∂θ k = 2 · M (cid:88) j =1 [ Σ − ] kj θ j . (79)Continuing ahead, we consider our best fit values in the absence of signal: ∂ ln L (0 , (cid:126)θ ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) (cid:126)θ = (cid:126) ˆˆ θ = n (cid:80) Ni =1 R i ( (cid:126) ˆˆ θ ) · b i − · N (cid:88) i =1 (cid:32) ∂R ik ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k (cid:33) · R i ( (cid:126) ˆˆ θ ) · b i R ik (ˆˆ θ k ) − M (cid:88) j =1 [ Σ − ] kj ˆˆ θ j = 0 , (80)(the above is not unlike Eq. 8 of Ref. [11] for decorrelated constraints) and so our system of equationssolving for (cid:126) ˆˆ θ are: n (cid:80) Ni =1 R i ( (cid:126) ˆˆ θ ) · b i − · N (cid:88) i =1 (cid:32) ∂R ik ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k (cid:33) · R i ( (cid:126) ˆˆ θ ) · b i R ik (ˆˆ θ k ) − M (cid:88) j =1 [ Σ − ] kj ˆˆ θ j = 0 ; k = 1 , . . . , M . (81)We now consider the best fit values in the context of our tested hypothesis: ∂ ln L ( s, (cid:126)θ ) ∂s (cid:12)(cid:12)(cid:12)(cid:12) ( s,(cid:126)θ )=(ˆ s,(cid:126) ˆ θ ) = n ˆ s + (cid:80) Ni =1 R i ( (cid:126) ˆ θ ) · b i − ⇔ ˆ s = n − N (cid:88) i =1 R i ( (cid:126) ˆ θ ) · b i , (82)and: ∂ ln L ( s, (cid:126)θ ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) ( s,(cid:126)θ )=(ˆ s,(cid:126) ˆ θ ) = n ˆ s + (cid:80) Ni =1 R i ( (cid:126) ˆ θ ) · b i − · N (cid:88) i =1 (cid:32) ∂R ik ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆ θ k (cid:33) · R i ( (cid:126) ˆ θ ) · b i R ik (ˆ θ k ) − M (cid:88) j =1 [ Σ − ] kj ˆ θ j = 0 ⇒ M (cid:88) j =1 [ Σ − ] kj ˆ θ j = 0 , (83)where we substituted in Eq. 82. The above can be simultaneously satisfied for all k by writing Σ − (cid:126) ˆ θ = (cid:126) ⇒ (cid:126) ˆ θ = (cid:126)
0. Accordingly, ˆ s = n − (cid:80) Ni =1 R i ( (cid:126) · b i = n − (cid:80) Ni =1 b i . Our significance ofdiscovery in the asymptotic limit is thus: 19 = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − · ln L (0 , (cid:126) ˆˆ θ ) L (ˆ s, (cid:126) ˆ θ ) = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − · n · ln (cid:80) Ni =1 R i ( (cid:126) ˆˆ θ ) · b i n + n − N (cid:88) i =1 R i ( (cid:126) ˆˆ θ ) · b i − · (cid:126) ˆˆ θ (cid:62) Σ − (cid:126) ˆˆ θ . (84)Assuming Asimov data, we let n = s + (cid:80) Ni =1 b i in Eq. 84 and appropriately sub in the solutionsfrom Eq. 81. N Backgrounds and M = N Decorrelated Constraints, 1-Per-Background
We consider N backgrounds with M = N Gaussian contraints, one per background. We alsoassume the constraints are decorrelated (i.e., Σ = I ⇒ Σ − = I where I is the identity matrix).Then R i ( (cid:126)θ ) → R i ( θ i ) and R ij ( θ j ) → R i ( θ i ) (i.e., the total response function for background i is onlya function of θ i , the nuisance parameter governing the constraint which affects only background i ),implying ∂R ik ( θ k ) ∂θ k = ∂R i ( θ i ) ∂θ k = δ ik · ∂R k ( θ k ) ∂θ k . Applying the above to Eq. 81 yields: (cid:40)(cid:32) n (cid:80) Ni =1 R i (ˆˆ θ i ) · b i − (cid:33) · ∂R k ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k · b k − ˆˆ θ k = 0 ; k = 1 , . . . , N (cid:41) . (85)For an additional level of simplicity, we assume the response functions are linear in θ k and that σ + k = σ − k ≡ σ k : R k ( θ k ) = σ k · θ k + 1, where σ k can be interpreted as the relative “uncertainty” onthe normalization of background k . Consider the redefinitions: R k ( θ k ) · b k → b k ( θ k ) and σ k · b k → σ k (i.e., σ k is now an absolute “uncertainty”). Then we let ˆˆ b k ≡ b k (ˆˆ θ k ) and: ∂R k ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k · b k = ∂ ( R k ( θ k ) · b k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k = ∂b k ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k = σ k . (86)We identify ˆˆ θ k = (ˆˆ b k − b k ) /σ k (i.e., ˆˆ θ k controls the tuning of the best-fit ˆˆ b k away from the nominalvalue b k , consistent with our earlier definition of the response function), so inserting this and Eq. 86into Eq. 85 : (cid:40)(cid:32) n (cid:80) Ni =1 ˆˆ b i − (cid:33) · σ k − (ˆˆ b k − b k ) σ k = 0 ; k = 1 , . . . , N (cid:41) , (87)which can be solved to yield the best-fit values of the backgrounds in the absence of signal. Finally,with our assumptions and redefinitions, Eq. 84 becomes: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) n · ln (cid:32) (cid:80) Ni =1 ˆˆ b i n (cid:33) + n − N (cid:88) i =1 (cid:32) ˆˆ b i + ( b i − ˆˆ b i ) σ i (cid:33)(cid:33) , (88)which is our significance of discovery in the asymptotic limit. Assuming Asimov data, we would let n = s + (cid:80) Ni =1 b i . This system of equations is (attempted to be) solved in general sense in Appendix A
20e have also numerically validated our results for the scenario where we have 2 backgroundseach with 1 Gaussian constraint. To simulate our yields, we sample a Poisson PDF with mean µ = s + b + b in our SR. Additionally, to avoid biasing ourselves, we must sample the nominalvalue of each of the nuisance parameters from a Gaussian PDF centered on b i with width σ i for i = 1 and then i = 2. The sampled values then become the “true” nominal values of the Gaussianconstraints used in our maximum likelihood fit (i.e., the constraints look like G (˜ b i | b i , σ i ) with ˜ b i sampled from G ( x | b i , σ i )). The median significance of discovery has been calculated for 70 differenthypothesized measurement scenarios (i.e., 7 curves with 10 points per curve) and we plotted Eq. 88continuously alongside these numerical results (derived via the procedure described in Appendix A).This is shown in Figs. 4 and 5. As before, we see excellent agreement between the numerical andasymptotic results over the range of theoretical yields and parameters studied.We have also plotted alongside our results the “naive” approximation of the significance ofdiscovery: Z = s (cid:112) s + b + b + σ + σ , (89)(where σ and σ are given as the absolute uncertainties on backgrounds 1 and 2, respectively) andindeed in the low s/b (i.e., high b + b ) regime we see good agreement between all three methods.In the s/b ∼ b + b ), the naive approximation fails and no longer shows goodagreement with the asymptotic and numerical methods, as expected.21 Background 1 yield in SR b [a.u.] . . . . . . . . S i gn i fic an c eo f d i sc o v e r y m e d [ Z | µ = ] [ a . u .] s = 10 , σ = 5% , σ = 10% Numerical: b = 5 Asymptotic: b = 5 Simple: b = 5 Numerical: b = 25 Asymptotic: b = 25 Simple: b = 25 Numerical: b = 150 Asymptotic: b = 150 Simple: b = 150 Figure 4: The median significance of discovery as a function of SR background 1 yield ( b ) and SRbackground 2 yield ( b ) for 1 SR bin + 2 decorrelated Gaussian constraints measurement describedin Sec. 2.4.2. The SR signal yield is assumed to be 10. The relative uncertainties on backgrounds1 and 2 are assumed to be σ = 5% and σ = 10%, respectively. “Numerical” refers to the resultscalculated using MC toys (50,000 events for the estimation of f ( t | µ (cid:48) = 0) and 50,000 events forthe estimation of f ( t | µ (cid:48) = 1), per point), “Asymptotic” refers to Eq. 88, and “Simple” refers to s/ (cid:112) s + b + b + σ + σ . 22 Background 1 yield uncertainty in SR σ [%] . . . . . . S i gn i fic an c eo f d i sc o v e r y m e d [ Z | µ = ] [ a . u .] s = 10 , σ = 10% Numerical: ( b , b ) = (25 , Asymptotic: ( b , b ) = (25 , Simple: ( b , b ) = (25 , Numerical: ( b , b ) = (50 , Asymptotic: ( b , b ) = (50 , Simple: ( b , b ) = (50 , Numerical: ( b , b ) = (50 , Asymptotic: ( b , b ) = (50 , Simple: ( b , b ) = (50 , Numerical: ( b , b ) = (150 , Asymptotic: ( b , b ) = (150 , Simple: ( b , b ) = (150 , Figure 5: The median significance of discovery as a function of the uncertainty on the SR background1 yield ( σ ), the SR background 1 yield ( b ), and the SR background 2 yield ( b ) for 1 SR bin +2 decorrelated Gaussian constraints measurement described in Sec. 2.4.2. The SR signal yield isassumed to be 10. The relative uncertainty on the background 2 yield is assumed to be σ = 10%.“Numerical” refers to the results calculated using MC toys (50,000 events for the estimation of f ( t | µ (cid:48) = 0) and 50,000 events for the estimation of f ( t | µ (cid:48) = 1), per point), “Asymptotic” refersto Eq. 88, and “Simple” refers to s/ (cid:112) s + b + b + σ + σ .23 .4.3 Assuming N = 1 Backgrounds and M = 1 Constraints
As a special case of the previous section, we consider only N = 1 backgrounds with M = N = 1Gaussian contraints on this background (with 1 background and 1 constraint, we are automaticallyin the regime of “decorrelated” constraints). Letting b ≡ b , θ ≡ θ , σ ≡ σ , Eq. 87 becomes: (cid:32) n ˆˆ b − (cid:33) · σ − (ˆˆ b − b ) σ = 0 ⇔ (cid:16) n − ˆˆ b (cid:17) · σ − (ˆˆ b − b ) · ˆˆ b = 0 ⇔ − ˆˆ b + ( b − σ ) · ˆˆ b + nσ = 0 ⇒ ˆˆ b = − ( b − σ ) ± (cid:112) ( b − σ ) + 4 nσ − , (90)but as ( b − σ ) + 4 nσ > ( b − σ ) (all variables in this expression are positive), we choose theminus sign to give an overall positive (i.e., physical) solution for ˆˆ b :ˆˆ b = ( b − σ ) + (cid:112) ( b − σ ) + 4 nσ . (91)Using Eq. 88, our significance of discovery is: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) n · ln (cid:32) ˆˆ bn (cid:33) + n − ˆˆ b − ( b − ˆˆ b ) σ (cid:33) , (92)with Eq. 91 appropriately subbed in, matching what is shown in Eq. 26 of Ref. [12]. We have presented a collection of derivations of generalized formulae for estimating the mediansignificance of discovery in the asymptotic limit for various measurement models in HEP. Theformulae have been verified to agree with numerical results using MC toys in select cases; othertimes, they are shown to reduce to known formulae in simpler cases derived by similar means. Inthe low s/b regime, simpler versions of the asymptotic formulae based on s/ √ b do just as well asthe more accurate formulae derived in this document. In the s/b ∼ N SRs + M CRs inSection 2.3 when the CRs are assumed to pure in each background (i.e., diagonal τ i ∀ i ∈ [1 , N ]),generalizing the type of contraint in Section 2.4 (e.g., Gaussian, log-normal, etc.), and generalizingthe NPs in Section 2.4 to also act on signal. 24 References [1] K. Cranmer,
Practical Statistics for the LHC , CERN Yellow Rep. CERN-2014-003, CERN(2015) 267–308, arXiv: .[2] G. Cowan, K. Cranmer, E. Gross, and O. Vitells,
Asymptotic formulae for likelihood-based testsof new physics , Eur. Phys. J.
C71 (2011) 1554, arXiv: .[3] S. S. Wilks,
The Large-Sample Distribution of the Likelihood Ratio for Testing CompositeHypotheses , Ann. Math. Statist. (1938) 60–62, DOI: .[4] A. Wald, Tests of statistical hypotheses concerning several parameters when the num-ber of observations is large , Trans. Amer. Math. Soc. (1943) 426–482, DOI: .[5] P. Ongmongkolkul, C. Deil, C. Cheng, A. Pearce, E. Rodrigues, H. Schreiner, M. Marinangeli,L. Geiger, and H. Dembinski, scikit-hep/probfit: 1.1.0 (Version 1.1.0), November 5, 2018, DOI: .[6] F. James and M. Roos, Minuit – A System for Function Minimization and Analysis ofthe Parameter Errors and Correlations , Comput. Phys. Commun. (1975) 343–367, DOI: .[7] H. Dembinski, P. Ongmongkolkul, C. Deil, D. Men´endez. Hurtado, M. Feickert, H. Schreiner,Andrew, C. Burr, F. Rost, A. Pearce, L. Geiger, B. M. Wiedemann, Gonzalo, andO. Zapata, scikit-hep/iminuit: v1.5.2 (Version v1.5.2), September 24, 2020, DOI: .[8] C. R. Harris, K. J. Millman, S. J. van der Walt, et al. , Array programming with NumPy , Nature (2020) 357–362, DOI: .[9] P. Virtanen, R. Gommers, T. E. Oliphant, et al. , SciPy 1.0: fundamental algo-rithms for scientific computing in Python , Nature Meth. (2020) 261–272, DOI: .[10] J. D. Hunter, Matplotlib: A 2D graphics environment , IEEE Comput. Sci. Eng. (2007) 90–95,DOI: .[11] J. S. Conway, Incorporating Nuisance Parameters in Likelihoods for Multisource Spectra , Proc.PHYSTAT 2011 Workshop on Statistical Issues Related to Discovery Claims in Search Exper-iments and Unfolding (2011) 115–120, arXiv: .[12] The ATLAS Collaboration,
Formulae for Estimating Significance , Tech. Rep. ATL-PHYS-PUB-2020-025, CERN, 2020, URL: https://cds.cern.ch/record/2736148 .25 ppendix A Solution to A (cid:80) j x j + B i x i + C i = 0 , i = 1 , . . . , N We consider solutions of the system of equations: (cid:40) A (cid:80) Nj =1 x j + B i x i + C i = 0 ; i = 1 , . . . , N (cid:41) , (93)where A is a constant and B i and C i are equation-dependent constants, analogous to Eq. 87.Subtracting the j -th equation from the i -th equation: B i x i + C i − B j x j − C j = 0 ⇔ x j = B i x i + ( C i − C j ) B j , (94)which can then be inserted in the i -th equation of Eq. 93 to yield an expression entirely in terms of x i : A + B i x i N (cid:88) j =1 x j + C i N (cid:88) j =1 x j = 0 ⇔ A + B i x i N (cid:88) j =1 B i x i + ( C i − C j ) B j + C i N (cid:88) j =1 B i x i + ( C i − C j ) B j = 0 ⇔ (cid:32) B i N (cid:88) j =1 B j (cid:33) x i + (cid:32) B i N (cid:88) j =1 C i − C j B j (cid:33) x i + (cid:32) A + C i N (cid:88) j =1 C i − C j B j (cid:33) = 0 , (95)which may be solved using the quadratic equation, selecting the real and positive root for x i (or theanswer which statisfies the system of equations as the physical solution as x i ↔ ˆˆ b i ). From Eq. 87,we identify A = n , B i = − /σ i , and C i = b i /σ i −
1. Thus in Eq. 95, we know: B i N (cid:88) j =1 B j = − σ i N (cid:88) j =1 σ j < , (96)as σ i > ∀ i . From Eq. 95, we identify the quadratic equation: − | ˜ A i | x + ˜ B i x + ˜ C i = 0 , (97)where: | ˜ A i | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B i N (cid:88) j =1 B j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 σ i N (cid:88) j =1 σ j , ˜ B i = B i N (cid:88) j =1 C i − C j B j = 1 σ i N (cid:88) j =1 σ j · (cid:18) b i σ i − b j σ j − (cid:19) , ˜ C i = A + C i N (cid:88) j =1 C i − C j B j = n + (cid:18) − b i σ i (cid:19) · N (cid:88) j =1 σ j · (cid:18) b i σ i − b j σ j (cid:19) , (98)26hich has the solutions: x i = ˜ B i ± (cid:113) ˜ B i + 4 | ˜ A i | ˜ C i | ˜ A i | . (99)Using Eq. 98, it can be shown that:˜ B i = ˜ C i − nσ i − b i − ( σ i − b i ) · | ˜ A i | , (100)and therefore:˜ B i + 4 | ˜ A i | ˜ C i = (cid:32) ˜ C i − nσ i − b i (cid:33) + ( σ i − b i ) · | ˜ A i | − · ( ˜ C i − n ) · | ˜ A i | + 4 | ˜ A i | ˜ C i = (cid:32) ˜ C i − nσ i − b i (cid:33) + ( σ i − b i ) · | ˜ A i | + 2 · ( ˜ C i − n ) · | ˜ A i | + 4 n | ˜ A i | = (cid:32) ˜ C i − nσ i − b i + ( σ i − b i ) · | ˜ A i | (cid:33) + 4 n | ˜ A i | , (101)which is always positive, so we are always guaranteed a real root in Eq. 99. We consider thedifference: (cid:32) ˜ C i − nσ i − b i + ( σ i − b i ) · | ˜ A i | (cid:33) − (cid:32) ˜ C i − nσ i − b i − ( σ i − b i ) · | ˜ A i | (cid:33) = 2 · ( σ i − b i ) · | ˜ A i | . (102)If σ i > b i , the above is positive : this means the argument of our square root will always be largerthan ˜ B i and we must select the positive sign solution for equation i for a physical solution. ByEq. 100, we know: ˜ B i = 1 σ i · N (cid:88) j =1 σ j · (cid:18) b i σ i − b j σ j (cid:19) + ( b i − σ i ) · | ˜ A i | . (103)If b i > σ i , the second term in the above is positive. And if (cid:80) Nj =1 σ j · (cid:0) b i /σ i − b j /σ j (cid:1) < | (cid:80) Nj =1 σ j · (cid:0) b i /σ i − b j /σ j (cid:1) | > ( b i − σ i ) · | ˜ A i | , then we must select the positive sign solution for aphysical solution. In the case where (cid:80) Nj =1 σ j · (cid:0) b i /σ i − b j /σ j (cid:1) >