[PDF] Generalized asymptotic formulae for estimating statistical significance in high energy physics analyses

Abstract

Within the framework of likelihood-based statistical tests for high energy physics measurements, we derive generalized expressions for estimating the statistical significance of discovery using the asymptotic approximations of Wilks and Wald for a variety of measurement models. These models include arbitrary numbers of signal regions, control regions, and Gaussian constraints. We extend our expressions to use the representative or "Asimov" dataset proposed by Cowan et al. such that they are made data-free. While many of the generalized expressions are complicated and often involve solving systems of coupled, multivariate equations, we show these expressions reduce to closed-form results under simplifying assumptions. We also validate the predicted significance using toy-based data in select cases.

Full PDF

GGeneralized asymptotic formulae for estimating statisticalsigniﬁcance in high energy physics analyses

M. J. Basso ∗ Department of Physics, University of Toronto, 60 St. George St., Toronto,Ontario, CanadaFebruary 9, 2021

Abstract

Within the framework of likelihood-based statistical tests for high energy physics measure-ments, we derive generalized expressions for estimating the statistical signiﬁcance of discoveryusing the asymptotic approximations of Wilks and Wald for a variety of measurement models.These models include arbitrary numbers of signal regions, control regions, and Gaussian con-straints. We extend our expressions to use the representative or “Asimov” dataset proposedby Cowan et al. such that they are made Monte Carlo-free. While many of the generalizedexpressions are complicated and often involve solving systems of coupled, multivariate equa-tions, we show these expressions reduce to closed-form results under simplifying assumptions.We also validate the predicted signiﬁcance using Monte Carlo toys in select cases.

In the ﬁeld of high energy physics (HEP), likelihood-based statistical tests entail the constructionof a likelihood function L describing a particular measurement model; the likelihood function inturn describes the “likelihood” of measuring parameters deﬁning the model given some observeddata [1]. For counting experiments typical of HEP analyses at the Large Hadron Collider (LHC),the likelihood may be written most simply as a product of Poisson probability density functions(PDFs) over N “regions” or “bins”: L ( (cid:126)θ ) = N (cid:89) i =1 P ( n i | ν i ( (cid:126)θ )) = N (cid:89) i =1 ν i ( (cid:126)θ ) n i · exp ( − ν i ( (cid:126)θ )) n i ! , (1)where n i and ν i are the observed and expected yields in region i , respectively, and (cid:126)θ are the freeparameters deﬁning out model. Here, we have assumed a uniform prior π ( (cid:126)θ ) for our free parameters(i.e., no prior knowledge). The best-ﬁt parameters for a given measurement will be those which maximize the likelihood. ∗ E-mail: [email protected] a r X i v : . [ phy s i c s . d a t a - a n ] F e b ypically, one is interested in measuring some signal s (e.g., the number of Higgs boson decayevents to W W ∗ ) given some known or constrained background b (e.g., the number of Drell-Yanevents). In this case, s deﬁnes our parameter-of-interest (POI), what we are interested in measuring,while b deﬁnes a nuisance parameter (NP), a parameter we measure but which may not be physicallyinteresting. We may parametrize the expected yield as ν ( µ, b ) = µs + b , where s is now ﬁxed andour signal strength µ is what tunes the amount of signal, now becoming our POI . Absorbing b into (cid:126)θ , which we now assumes contains only our NPs, and letting L = L ( µ, (cid:126)θ ), we construct thelog-likelihood ratio: λ ( µ ) = L ( µ, (cid:126) ˆˆ θ ) L (ˆ µ, (cid:126) ˆ θ ) , (2)where ˆ µ and (cid:126) ˆ θ are the unconditional maximum likelihood estimators of L (i.e., the values of µ and (cid:126)θ which set ∂L/∂µ = 0 and ∂L/∂θ i = 0 ∀ i = 1 , . . . , N where N is the number of NPs) and (cid:126) ˆˆ θ is the conditional maximum likelihood estimator of L for ﬁxed µ (i.e., the values of (cid:126)θ which set ∂L/∂θ i = 0 ∀ i = 1 , . . . , N where N is the number of NPs) [1, 2]. Eq. 2 ranges between 0 and1, where λ ( µ ) ∼ µ and λ ( µ ) (cid:28) t µ = − λ ( µ ) , (3)where t µ ∼ µ and increasing t µ indicates increasing disagreement [1, 2]. We may deﬁne a p -value, p µ , representing the probabilityof observering equal or greater disagreement with the hypothesized µ as: p µ = (cid:90) ∞ t µ, obs f ( t µ | µ ) dt µ , (4)where f ( t µ | µ ) is the PDF of the test statistic for ﬁxed µ and t µ, obs is the observed value of the teststatistic. As is common practice in HEP, one may translate the p -value into the number of standarddeviations from the mean of a standard Gaussian whose integrated, one-sided tail equals such aprobability, i.e.: Z µ = Φ − (1 − p µ ) , (5)where Φ − is the inverse cumulative function of a standard Gaussian. This quantity is referred toas the signiﬁcance of the measurement.In “discovery” HEP analyses, one looks to measure the presence of signal (i.e., µs + b with µ > H that no signal is present (i.e., µ (cid:48) = 0where µ (cid:48) is the tested value of our POI) and the alternative hypothesis H that signal is present insome ﬁxed amount (i.e., µ (cid:48) = µ where µ is ﬁxed to some value). An example of a discovery analysiswould be a measurement of vector boson fusion production of Higgs bosons decaying to W W ∗ overthe prevailing top quark pair and single top production, Drell-Yan, and diboson backgrounds.Within the framework of likelihood-based statistical tests for HEP, we adopt the test statisticfor discovery analyses proposed by Cowan et al. [2]: N.B.: it is equally valid to let s be our POI, but in the spirit of consistency with the literature on this topic, weadopt this reparametrization. = (cid:40) − λ (0) if ˆ µ ≥

00 if ˆ µ < , (6)where, as before, ˆ µ is our unconditional maximum likelihood estimator of µ . As we should notmeasure a negative signal strength for a signal model predicting an enhancement to our measuredyields, we set t equal to 0 as a lower bound on our test statistic (i.e., consistent the null hypothesis).By Eq. 4, our p -value becomes: p = (cid:90) ∞ t , obs f ( t | dt , (7)and by Eq. 5, our signiﬁcance of discovery is: Z = Φ − (1 − p ) . (8)To claim the discovery of a signal, it is typical to require that the signiﬁcance exceeds 5 σ : Z ≥ p = 2 . · − .Often, a physicist will want to know the expected signiﬁcance of a measurement assuming theirsignal model in MC to be true and correct. In the case of discovery analyses, this will necessitateknowledge of f ( t | t andtherefore an approximation of t , obs may be extracted as the maximizer of f ( t | µ (cid:48) ), the PDF ofthe test statistic for discovery assuming the tested signal strength µ (cid:48) . As an equation, the median p -value is given by: med[ p | µ (cid:48) ] = (cid:90) ∞ max t f ( t | µ (cid:48) ) f ( t | dt . (9)Without knowing f ( t |

0) and f ( t | µ (cid:48) ), the above expression is diﬃcult to evaluate. Using theapproximations of Wilks [3] and Wald [4], Cowan et al. [2] show that p -value for discovery may beapproximated as: p = 1 − F ( t | ≈ − Φ( √ t ) , (10)where F ( t |

0) is the cumulative distribution function (CDF) for f ( t | / √ N (cid:28) N is the sample size) and assuming the best-ﬁt signalstrength ˆ µ is Gaussian distributed. Inserting Eq 10 into Eq. 8 yields: Z ≈ √ t , (11)under the same assumptions. Given that Z is a monotonically decreasing function of p and usingEqs. 8, 9, and 11, we may also write:med[ Z | µ (cid:48) ] = Φ − (cid:18) − (cid:90) ∞ med[ t | µ (cid:48) ] f ( t | dt (cid:19) ≈ (cid:112) med[ t | µ (cid:48) ] . (12)To evaluate the above, Cowan et al. propose the use of the “Asimov” dataset where the esti-mators of all parameters yield their true values [2]. In our above formulae, this is equivalent tosetting all parameters equal to their true values given our particular physics model (e.g., ˆ µ → µ (cid:48) and n → µ (cid:48) s + b ). If t , A is the Asimov value of our test statistic for discovery assuming the testedsignal strength µ (cid:48) , then we can write: 3ed[ t | µ (cid:48) ] = t , A , (13)and by inserting Eq. 13 into Eq. 12, we yield:med[ Z | µ (cid:48) ] ≈ (cid:112) t , A . (14)This is one of the important results shown by Cowan et al. [2]. It says we can estimate the mediansigniﬁcance of discovery as the square root of the test statistic for discovery evaluted using Asimovdata. Using the above, one can produce analytical approximations for a variety of measurementscenarios, giving a physicist a handle on the expected power of their analysis techniques withoutrelying on numerical recourse. As Cowan el al. discuss in their paper, the asymptotic approximationis already quite good for N ∼ O (100) (see for instance Fig. 7 of Ref. [2]).This note proceeds as follows: we will motivate and construct several diﬀerent measurement sce-narios (e.g., multiple control regions, multiple signal regions, etc.) a physicist typically encountersand derive expressions for the median signiﬁcance of discovery using the asymptotic approximationand assuming Asimov data. In all cases, we generalize to an arbitrary number of regions or con-straints N and show the resulting formulae to reduce to expected formulae (i.e., derived elsewhere)in the N = 1 case or to agree with numerical simulation in test cases.Additionally, we will simplify the use of Eq. 14 in the following sections by dropping the ap-proximation (i.e., setting it to an equality) and by assuming µ (cid:48) = 1, typical of discovery analyseswhere the signal model’s cross section is normalized to theoretical expectations. We deﬁne: Z ≡ med[ Z | µ (cid:48) = 1] = (cid:112) t , A = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) −  L (0 , (cid:126) ˆˆ θ A ) L (ˆ µ A , (cid:126) ˆ θ A )  , (15)where we have inserted Eq. 6 followed by Eq. 2 and the best-ﬁt values of (cid:126) ˆˆ θ , ˆ µ , and (cid:126) ˆ θ are assumedto be evaluated using Asimov data (hence the subscript A). To verify our derivations for measurement scenarios which have not yet been studied analytically,we will use Monte Carlo (MC): we will draw toy events from the PDF governing the measurementscenario at hand in each of the relevant regions (i.e., SRs and CRs). The parameters of the indi-vidual PDFs will be set to their hypothesized values, and a full sampling of the likelihood will beperformed for both the null (i.e., µ (cid:48) = 0) and alternative (i.e., µ (cid:48) = 1) hypotheses. Using the Pythonpackage probfit [5] to set up a simultaneous, unbinned (in each region), maximum-likelihood ﬁtand MINUIT [6] via the Python package iminuit [7] to perform the minimization, we will extractthe minima of − L (0 , ˆˆ θ ) and − L (ˆ µ, ˆ θ ), allowing us to calculate our test statistic t usingEqs. 2 and 6. By performing this procedure many (i.e., O (10000)) times assuming µ (cid:48) = 0 andthen assuming µ (cid:48) = 1, we can produce approximate PDFs of the test statistic, f ( t | µ (cid:48) = 0) and f ( t | µ (cid:48) = 1). By integrating f ( t | µ (cid:48) = 0) from the median value of f ( t | µ (cid:48) = 1) to inﬁnity, weyield the p -value of the measurement which then yields the median signiﬁcance of discovery usingEq. 12. The Python packages numpy [8], scipy [9], and matplotlib [10] are used for processingand plotting. 4 Derivations

In following subsections, we will derive expressions for the the median signiﬁcance of discoveryin the asymptotic limit for a variety of commonly encountered measurement scenarios and providevalidation of some of our expressions by comparisons to other sources or by numerical simulation.In particular, we will cover: • Section 2.1: 1 signal region + N control regions, N ∈ N ; • Section 2.2: N signal regions + 1 control region, N ∈ N ; • Section 2.3: N signal regions + M control regions, N, M ∈ N ; • Section 2.4: 1 signal region containing N background processes with M Gaussian backgroundconstraints,

N, M ∈ N . N Control Regions

Assuming a uniform prior, we can write our likelihood with 1 signal region (SR) and N orthogonalauxiliary measurements (read as: N control regions (CRs) for backgrounds b i , i = 1 , . . ., N ) as: L ( s,(cid:126)b ) = P (cid:32) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s + N (cid:88) i =1 b i (cid:33) · N (cid:89) i =1 P (cid:32) m i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) j =1 τ ij · b j (cid:33) , (16)where P refers to a Poisson PDF, n is the observed yield in our SR, m i is the observed yield inCR i , and τ ij are the transfer factors which carry background j in our SR to CR i . Inserting themathematical form for P yields: L ( s,(cid:126)b ) = (cid:16) s + (cid:80) Ni =1 b i (cid:17) n exp (cid:16) − (cid:16) s + (cid:80) Ni =1 b i (cid:17)(cid:17) n ! · N (cid:89) i =1 (cid:16)(cid:80) Nj =1 τ ij · b j (cid:17) m i exp (cid:16) − (cid:16)(cid:80) Nj =1 τ ij · b j (cid:17)(cid:17) m i ! , (17)and taking the logarithm yields:ln L ( s,(cid:126)b ) = n · ln (cid:32) s + N (cid:88) i =1 b i (cid:33) − s − N (cid:88) i =1 b i + N (cid:88) i =1 (cid:32) m i · ln (cid:32) N (cid:88) j =1 τ ij · b j (cid:33) − N (cid:88) j =1 τ ij · b j (cid:33) . (18)As we are dealing with a likelihood, constant oﬀsets do not aﬀect our optimization and so we havedropped − (ln( n !) + (cid:80) Ni =1 ln( m i !)).We ﬁrst consider the most probable value for the backgrounds, ˆˆ b i , in the absence of signal, s = 0.We are interested in maximizing ln L . As signal is ﬁxed and constant, we make this explicit in ln L by evaluating it at s = 0 prior to taking any partial derivatives. The result is then diﬀerentiatedwith respect to background k and evaluated at (cid:126)b = (cid:126) ˆˆ b to yield: ∂ ln L (0 ,(cid:126)b ) ∂b k (cid:12)(cid:12)(cid:12)(cid:12) (cid:126)b = (cid:126) ˆˆ b = n (cid:80) Ni =1 ˆˆ b i − N (cid:88) i =1 τ ik ·  m i (cid:80) Nj =1 τ ij · ˆˆ b j −  . (19)5etting all of the partial derivatives, k = 1 , . . ., N , equal to 0 yields the following system of equationsfor (cid:126) ˆˆ b :  n (cid:80) Ni =1 ˆˆ b i − N (cid:88) i =1 τ ik ·  m i (cid:80) Nj =1 τ ij · ˆˆ b j −  ; k = 1 , . . ., N  . (20)We now consider the maximizing the likelihood in the presence of signal s . In this situation, welet ˆ s and (cid:126) ˆ b be the signal and background yields, respectively, which maximize our likelihood. Signaland background yields are both left ﬂoating, and so we take the partial derivative with respect to s evaluated at ( s,(cid:126)b ) = (ˆ s,(cid:126) ˆ b ): ∂ ln L ( s,(cid:126)b ) ∂s (cid:12)(cid:12)(cid:12)(cid:12) ( s,(cid:126)b )=(ˆ s,(cid:126) ˆ b ) = n ˆ s + (cid:80) Ni =1 ˆ b i − , (21)as well as the partial derivative with respect to b k : ∂ ln L ( s,(cid:126)b ) ∂b k (cid:12)(cid:12)(cid:12)(cid:12) ( s,(cid:126)b )=(ˆ s,(cid:126) ˆ b ) = n ˆ s + (cid:80) Ni =1 ˆ b i − N (cid:88) i =1 τ ik · (cid:32) m i (cid:80) Nj =1 τ ij · ˆ b j − (cid:33) . (22)Setting Eqs. 21 and 22 equal to 0 and substituting Eq. 21 into Eq. 22 yields:0 = N (cid:88) i =1 τ ik · (cid:32) m i (cid:80) Nj =1 τ ij · ˆ b j − (cid:33) . (23)Our system of equations for ˆ s and (cid:126) ˆ b is then: (cid:40) ˆ s = n − N (cid:88) i =1 ˆ b i , N (cid:88) i =1 τ ik · (cid:32) m i (cid:80) Nj =1 τ ij · ˆ b j − (cid:33) ; k = 1 , . . ., N (cid:41) . (24)And using Eq. 15, our signﬁcance of discovery is: Z = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − ·  n · ln (cid:32) (cid:80) Ni =1 ˆˆ b i ˆ s + (cid:80) Ni =1 ˆ b i (cid:33) + ˆ s + N (cid:88) i =1  (ˆ b i − ˆˆ b i ) + m i · ln  (cid:80) Nj =1 τ ij · ˆˆ b j (cid:80) Nj =1 τ ij · ˆ b j  + N (cid:88) j =1 τ ij · (ˆ b j − ˆˆ b j )  . (25)In this expression and the expressions for ˆ s , ˆ b , and ˆˆ b , we set n = s + (cid:80) Ni =1 b i and m i = (cid:80) Nj =1 τ ij · b j ,i.e., their Asimov values. We may simplify the above further by using n = ˆ s + (cid:80) Ni =1 ˆ b i to yield: Z = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − ·  n · ln (cid:32) (cid:80) Ni =1 ˆˆ b i n (cid:33) + n + N (cid:88) i =1  − ˆˆ b i + m i · ln  (cid:80) Nj =1 τ ij · ˆˆ b j (cid:80) Nj =1 τ ij · ˆ b j  + N (cid:88) j =1 τ ij · (ˆ b j − ˆˆ b j )  . (26)This is our expression for the median signiﬁcance of discovery in the asymptotic limit. N.B.: it ispossible to simplify the above even further using the intuitive ansatz that the solutions to Eq. 24are ˆ s = s and ˆ b k = b k ∀ k = 1 , . . . , N . Indeed, this can be explicitly checked:6 s − n + N (cid:88) i =1 ˆ b i = s − (cid:32) s + N (cid:88) i =1 b i (cid:33) + N (cid:88) i =1 b i = 0 , N (cid:88) i =1 τ ik · (cid:32) m i (cid:80) Nj =1 τ ij · ˆ b j − (cid:33) = N (cid:88) i =1 τ ik · (cid:32) (cid:80) Nj =1 τ ij · b j (cid:80) Nj =1 τ ij · b j − (cid:33) = N (cid:88) i =1 τ ik · (1 −

1) = 0 , (27)as expected for Asimov data. N = 1 Control Regions

As a check, in the case where we have only 1 CR, N = 1, we let ˆˆ b ≡ ˆˆ b , m ≡ m , and τ ≡ τ .From Eq. 20, we yield: 0 = n ˆˆ b − τ · (cid:32) mτ · ˆˆ b − (cid:33) ⇔ ˆˆ b = n + m τ , (28)as expected. Additionally, letting ˆ b ≡ ˆ b , Eq. 24 yields:0 = τ · (cid:18) mτ · ˆ b − (cid:19) ⇔ ˆ b = mτ ⇒ ˆ s = n − mτ , (29)as expected. Finally, from Eq. 26, our signiﬁcance is: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) n · ln (cid:32) ˆˆ bn (cid:33) + n − ˆˆ b + m · ln (cid:32) ˆˆ b ˆ b (cid:33) + τ · (ˆ b − ˆˆ b ) (cid:33) , (30)but we know τ · ˆ b − (1 + τ ) · ˆˆ b = m − ( n + m ) = − n by Eqs. 28 and 29, leaving us with: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) n · ln (cid:32) ˆˆ bn (cid:33) + m · ln (cid:32) ˆˆ b ˆ b (cid:33)(cid:33) = (cid:118)(cid:117)(cid:117)(cid:116) − · ln (cid:32)(cid:18) n + m τ (cid:19) n + m · τ m n n m m (cid:33) , (31)matching what is shown in Eqs. 21 and 22 of Ref. [12] τ Often CRs are deﬁned such that they yield high-stats, pure regions for a speciﬁc background.Here, we assume CR i targets background i by assuming the matrix of transfer factors τ is diagonal(i.e., the acceptance of CR i is 1 for background i and 0 for all other backgrounds). Letting τ k ≡ τ kk ,our equations for (cid:126) ˆˆ b , (cid:126) ˆ b , and ˆ s , Eqs. 20 and 24, simplify as:0 = n (cid:80) Ni =1 ˆˆ b i − m k ˆˆ b k − τ k ⇔ n (cid:80) Ni =1 ˆˆ b i + m k ˆˆ b k = 1 + τ k , (32)7nd: 0 = m k ˆ b k − τ k ⇔ ˆ b k = m k τ k ⇒ ˆ s = n − N (cid:88) i =1 m i τ i , (33)for k = 1 , . . . , N . Our signiﬁcance of discovery is: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) n · ln (cid:32) (cid:80) Ni =1 ˆˆ b i n (cid:33) + n + N (cid:88) i =1 (cid:32) − ˆˆ b i + m i · ln (cid:32) ˆˆ b i ˆ b i (cid:33) + τ i · (ˆ b i − ˆˆ b i ) (cid:33)(cid:33) , (34)where we used τ ij = τ i · δ ij , where δ ij is the Kronecker delta function. Or, given ( − n · ˆˆ b k ) / ( (cid:80) Ni =1 ˆˆ b i ) = m k − (1 + τ k ) · ˆˆ b k = − ˆˆ b k + τ k · (ˆ b k − ˆˆ b k ) by Eqs. 32 and 33, we can also write: Z = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − ·  n · ln (cid:32) (cid:80) Ni =1 ˆˆ b i n (cid:33) + n + N (cid:88) i =1  m i · ln (cid:32) ˆˆ b i ˆ b i (cid:33) − n · ˆˆ b i (cid:80) Nj =1 ˆˆ b j  = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − · ln (cid:32) (cid:80) Ni =1 ˆˆ b i n (cid:33) n · N (cid:89) i =1 (cid:32) ˆˆ b i ˆ b i (cid:33) m i  . (35) N = 2 Control Regions and Diagonal τ As a special case of the previous section, we consider 2 CRs ( N = 2) and assume each CR tobe pure in the background they target, i.e., τ is diagonal. Then by Eq. 32: (cid:40) n ˆˆ b + ˆˆ b − (cid:32) m ˆˆ b − τ (cid:33) , n ˆˆ b + ˆˆ b − (cid:32) m ˆˆ b − τ (cid:33)(cid:41) , (36)and subtracting the second from the ﬁrst yields:0 = (cid:32) m ˆˆ b − m ˆˆ b (cid:33) − ( τ − τ ) ⇔ ˆˆ b = m · ˆˆ b m + ( τ − τ ) · ˆˆ b . (37)Consider the simpler case where τ = τ = τ >

0. Then ˆˆ b = m m · ˆˆ b and:0 = n · ˆˆ b − (1 + τ ) · (ˆˆ b + ˆˆ b ) · ˆˆ b + m · (ˆˆ b + ˆˆ b )= n · ˆˆ b − (1 + τ ) · m + m m · ˆˆ b + ( m + m ) · ˆˆ b ⇒ ˆˆ b = m · ( n + m + m )(1 + τ ) · ( m + m ) , (38)(throwing away the ˆˆ b = 0 solution). By symmetry, we can send subscripted 1 → → b solution: 8ˆ b = m · ( n + m + m )(1 + τ ) · ( m + m ) . (39)We now consider the more complex case where we have τ (cid:54) = τ with τ > τ > n · ˆˆ b − (1 + τ ) · (ˆˆ b + ˆˆ b ) · ˆˆ b + m · (ˆˆ b + ˆˆ b )= ( m − (1 + τ ) · ˆˆ b ) · ˆˆ b + (( n + m ) − (1 + τ ) · ˆˆ b ) · ˆˆ b = ( m − (1 + τ ) · ˆˆ b ) · m + (( n + m ) − (1 + τ ) · ˆˆ b ) · ( m + ∆ τ · ˆˆ b )= m · ( m + m + n ) + (∆ τ · ( n + m ) − (1 + τ ) · ( m + m )) · ˆˆ b − ∆ τ · (1 + τ ) · ˆˆ b , (40)where ∆ τ ≡ τ − τ , which can only vary between τ and − τ . We have also cancelled an overallfactor of ˆˆ b on the third line, to remove the uninteresting solution ˆˆ b = 0. Our solution is:ˆˆ b = − B ± (cid:112) B − A C A , (41)where: A = − ∆ τ · (1 + τ ) ,B = ∆ τ · ( n + m ) − (1 + τ ) · ( m + m ) ,C = m · ( m + m + n ) . (42)By symmetry, we have: ˆˆ b = − B ± (cid:112) B − A C A , (43)where: A = ∆ τ · (1 + τ ) ,B = − ∆ τ · ( n + m ) − (1 + τ ) · ( m + m ) ,C = m · ( m + m + n ) . (44)The expressions above are only physically meaningful if ˆˆ b > b > negative number of events). We suppose, for deﬁniteness, τ > τ ⇒ ∆ τ > require ˆˆ b > b > A < C > − A C >

0, which implies B − A C > (cid:112) B − A C has a real root. Additionally,2 A < B < (cid:112) B − A C , so to always pick up a positive solution for ˆˆ b , we choose thenegative sign: ˆˆ b = − B − (cid:112) B − A C A = B + (cid:112) B + 4 | A C | | A | . (45)9or ∆ τ >

0; this solution is real and positive. We turn to ˆˆ b : A > C > − A C < B > (cid:112) B − A C . Additionally, − B >

0, so our solution is always positive and we maywrite it as: ˆˆ b = | B | ± (cid:112) B − | A C | | A | . (46)The sign choice is still ambiguous, so we return to Eq. 36 and subsitute in our expressions for each.One can show that the negative sign is required to solve our system of equations, and so our solutionis: ˆˆ b = | B | − (cid:112) B − | A C | | A | . (47)The above is always positive, but the condition for being real requires B − | A C | >

0. It can beshown that B − A C = B − A C > τ yields: (cid:26) ˆ b = m τ , ˆ b = m τ , ˆ s = n − ˆ b − ˆ b (cid:27) . (48)Our signiﬁcance of discovery in the asymptotic limit is then Eq. 35 with Eqs. 45, 47, and 48appropriately subbed in. Assuming Asimov data, we let n = s + b + b , m = τ · b , and m = τ · b ,where s and b , b are our theoretical signal and background yields in our SR, respectively.We have numerically calculated the median signiﬁcance of discovery using the procedure de-scribed in Sec. 1.2 for 30 diﬀerent hypothesized measurement scenarios (i.e., 3 curves with 10 pointsper curve). Additionally, we have plotted Eq. 35 continuously alongside these numerical results:both the numerical and the asymptotic results are shown in Fig. 1. Excellent agreement is observed,even down to low values of s + b + b . The “naive” approximation of the signiﬁcance, s/ √ s + b + b ,is also plotted. As expected, this naive approximation agrees well with the asymptotic and numer-ical results in the regime where s/b (cid:28) s/ √ s + b + b is a Taylor expansion of the aymptotic result in the small s/b limit. This is demonstrated mostprominently by the green curve at low values of b , where s , b , and b are all O (1) and the small s/b assumption fails.We also includes examples of the PDFs for our test statistic t under the assumptions of nosignal, f ( t | µ (cid:48) = 0), and in the presence of signal, f ( t | µ (cid:48) = 1), for the green curve, b = 5, inFig. 1 for both the b = 1 and b = 1000 simulated data points. At higher values of s/b as shown inFig. 2a, the median value of f ( t | µ (cid:48) = 1) is well oﬀset from t = 0, resulting in a smaller integrated p -value for the null hypothesis. At smaller values of s/b as shown in Fig. 2b, the median valueof f ( t | µ (cid:48) = 1) is approximately at t = 0 and the distribution itself is not unlike f ( t | µ (cid:48) = 0),resulting in a larger integrated p -value. This behaviour is as expected.10 Background 1 yield in SR b [a.u.] . . . . . . . . S i gn i ﬁc an c eo f d i sc o v e r y m e d [ Z | µ = ] [ a . u .] s = 10 , τ = 8 , τ = 5 Numerical: b = 5 Asymptotic: b = 5 Simple: b = 5 Numerical: b = 25 Asymptotic: b = 25 Simple: b = 25 Numerical: b = 150 Asymptotic: b = 150 Simple: b = 150 Figure 1: The median signiﬁcance of discovery as a function of SR background 1 yield ( b ) and SRbackground 2 yield ( b ) for the 1 SR bin + 2 CR bins measurement described in Section 2.1.4. TheSR signal yield s is assumed to be 10. The transfer matrix for the background to the respectivecontrol regions is assumed to be diagonal with τ = 8 and τ = 5. “Numerical” refers to the resultscalculated using MC toys (50,000 events for the estimation of f ( t | µ (cid:48) = 0) and 50,000 events forthe estimation of f ( t | µ (cid:48) = 1), per point), “Asymptotic” refers to Eq. 35, and “Simple” refers to s/ √ s + b + b . 11 t [a.u.]10 N o r m a li z e d c o un t s [ a . u . ] f ( t | = 0) f ( t | = 1) (a) b = 1. t [a.u.]10 N o r m a li z e d c o un t s [ a . u . ] f ( t | = 0) f ( t | = 1) (b) b = 1000. Figure 2: PDFs of the test statistic of discovery t for the green curve, b = 5, in Fig. 1 for boththe b = 1 and b = 1000 simulated data points. These distributions are used to calculate thecorresponding values of Z . N Signal Regions + 1 Control Region

We now consider the case where we have N SRs (read as: N signal bins) and 1 shared CR. Thenfor our i = 1 , . . . , N SRs, we have have transfer factors { τ i ; i = 1 , . . . , N } where τ i is the transferfactor carrying the background yield our CR to SR i . Taking our theoretical background yield inour CR to be b and our observed value to be m , we can write our likelihood as: L ( (cid:126)s, b ) = N (cid:89) i =1 (cid:26) P (cid:18) n i (cid:12)(cid:12)(cid:12)(cid:12) s i + bτ i (cid:19)(cid:27) · P ( m | b ) , (49)where we are dividing by the transfer matrices, as in Section 2.1 we took τ to be the factor whichmultiplies yields in our SR to give yields in our CR. We can immediately write our log-likelihoodas: ln L ( (cid:126)s, b ) = N (cid:88) i =1 (cid:18) n i · ln (cid:18) s i + bτ i (cid:19) − s i − bτ i (cid:19) + m · ln( b ) − b , (50)where we have discarded the constant − (ln( m !) + (cid:80) Ni =1 ln( n i !)). Going right ahead with ﬁnding ourconditional and unconditional maximum likelihood estimators:12 ln L ( (cid:126) , b ) ∂b (cid:12)(cid:12)(cid:12)(cid:12) b =ˆˆ b = N (cid:88) i =1 (cid:32) n i ˆˆ b − τ i (cid:33) + m ˆˆ b − ⇔ m + (cid:80) Ni =1 n i ˆˆ b = 1 + N (cid:88) i =1 τ i ⇔ ˆˆ b = m + (cid:80) Ni =1 n i (cid:80) Ni =1 1 τ i . (51)Also: ∂ ln L ( (cid:126)s, b ) ∂s k (cid:12)(cid:12)(cid:12)(cid:12) ( (cid:126)s,b )=( (cid:126) ˆ s, ˆ b ) = n k ˆ s k + ˆ bτ k − ⇔ ˆ s k = n k − ˆ bτ k , (52)and: ∂ ln L ( (cid:126)s, b ) ∂b (cid:12)(cid:12)(cid:12)(cid:12) ( (cid:126)s,b )=( (cid:126) ˆ s, ˆ b ) = N (cid:88) i =1 (cid:18) n i τ i · ˆ s i + ˆ b − τ i (cid:19) + m ˆ b − ⇔ N (cid:88) i =1 (cid:18) n i τ i · ˆ s i + ˆ b (cid:19) + m ˆ b = 1 + N (cid:88) i =1 τ i . (53)But τ i · ˆ s i + ˆ b = τ i · n i by Eq. 52, so Eq. 53 becomes: N (cid:88) i =1 τ i + m ˆ b = 1 + N (cid:88) i =1 τ i ⇔ ˆ b = m , (54)and so our solutions for ˆ s k and ˆ b are: (cid:26) ˆ s k = n k − mτ k , ˆ b = m ; k = 1 , . . . , N (cid:27) . (55)Using Eq. 15, we can write our signiﬁcance of discovery in the asymptotic limit as: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) N (cid:88) i =1 (cid:32) n i · ln (cid:32) ˆˆ bτ i · ˆ s i + ˆ b (cid:33) + ˆ s i + ˆ b − ˆˆ bτ i (cid:33) + m · ln (cid:32) ˆˆ b ˆ b (cid:33) + (ˆ b − ˆˆ b ) (cid:33) . (56)Assuming Asimov data, we let n i = s i + b/τ i and m = b . We now consider the typical case where the signal yields s , s , . . . , s N among our N SRs arecorrelated and tuned by a single POI, our signal strength µ . Our expressions in the previous sectionare modiﬁed by letting s i → µs i ∀ i = 1 , . . . , N and accordingly L ( (cid:126)s, b ) → L ( µ, b ). The derivationfor ˆˆ b proceeds in the same way as in Section 2.2.1, ultimately yielding Eq. 51. Eq. 52 is modiﬁedto be a single derivative with respect to µ as: 13 ln L ( µ, b ) ∂µ (cid:12)(cid:12)(cid:12)(cid:12) ( µ,b )=(ˆ µ, ˆ b ) = N (cid:88) i =1 s i · (cid:32) n i ˆ µs i + ˆ bτ i − (cid:33) = 0 , (57)and Eq. 53 becomes: ∂ ln L ( µ, b ) ∂b (cid:12)(cid:12)(cid:12)(cid:12) ( µ,b )=(ˆ µ, ˆ b ) = N (cid:88) i =1 (cid:18) n i τ i · ˆ µ · s i + ˆ b − τ i (cid:19) + m ˆ b − . (58)While these equations are diﬃcult to solve in the general sense, we may propose the ansatz thatˆ µ = 1 and ˆ b = b when assuming Asimov data (i.e., n i = s i + b/τ i ∀ i = 1 , . . . , N and m = b ). Whilethe solution is intuitive, it can be explicitly checked to solve Eq. 57: N (cid:88) i =1 s i · (cid:32) n i ˆ µs i + ˆ bτ i − (cid:33) = N (cid:88) i =1 s i · (cid:32) s i + bτ i s i + bτ i − (cid:33) = N (cid:88) i =1 s i · (1 −

1) = 0 , (59)and explicitly checked to solve Eq. 58: N (cid:88) i =1 (cid:18) n i τ i · ˆ µ · s i + ˆ b − τ i (cid:19) + m ˆ b − N (cid:88) i =1 (cid:32) s i + bτ i τ i · s i + b − τ i (cid:33) + bb − N (cid:88) i =1 (cid:18) τ i − τ i (cid:19) +1 − . (60)In terms of our median signiﬁcance of discovery, Eq. 56, we let ˆ s i → ˆ µs i = s i , ˆ b = b , and ˆˆ b begiven by Eq. 51 (using Asimov data) to yield: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) N (cid:88) i =1 (cid:32)(cid:18) s i + bτ i (cid:19) · ln (cid:32) ˆˆ bτ i · s i + b (cid:33) + s i + b − ˆˆ bτ i (cid:33) + b · ln (cid:32) ˆˆ bb (cid:33) + ( b − ˆˆ b ) (cid:33) . (61)As in Section 2.1.4, we have numerically validated our results for the scenario where we have 3SRs, N = 3, and 1 CR, sampling Poisson PDFs in each of our 4 bins (with mean values of s + b/τ , s + b/τ , s + b/τ , and b ) in order to generate our simulated yields. The median signiﬁcance ofdiscovery has been calculated for 30 diﬀerent hypothesized measurement scenarios (i.e., 3 curveswith 10 points per curve) and we plotted Eq. 61 continuously alongside these numerical results. Thisis shown in Fig. 3. As before, we see excellent agreement between the numerical and asymptoticresults over the range of theoretical yields and parameters studied.We have also plotted alongside our results the “naive” approximation of the signiﬁcance ofdiscovery where the bin-by-bin signiﬁcances are summed in quadrature: Z = (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i =1 (cid:32) s i (cid:112) s i + b/τ i (cid:33) = (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i =1 s i s i + b/τ i , (62)and indeed in the low s/b regime (where we are in this regime if all bins are in this regime), we seegood agreement between all three methods. In the s/b ∼ any bin is in this regime), the naive approximation fails and no longer shows good agreement withthe asymptotic and numerical methods, as expected.14 Signal yield in SR 1 s [a.u.] S i gn i ﬁc an c eo f d i sc o v e r y m e d [ Z | µ = ] [ a . u .] s = 12 , τ = 2 , τ = 10 , τ = 20 Numerical: ( s , b ) = (25 , Asymptotic: ( s , b ) = (25 , Simple: ( s , b ) = (25 , Numerical: ( s , b ) = (10 , Asymptotic: ( s , b ) = (10 , Simple: ( s , b ) = (10 , Numerical: ( s , b ) = (10 , Asymptotic: ( s , b ) = (10 , Simple: ( s , b ) = (10 , Figure 3: The median signiﬁcance of discovery as a function of SR 1 signal yield ( s ), SR 2 signalyield ( s ), and CR background yield ( b ) for 3 SR bins + 1 CR bin measurement described inSection 2.2.2. The SR 3 signal yield s is assumed to be 12. The transfer factors for the backgroundfrom the CR to SRs 1 ( τ ), 2 ( τ ), and 3 ( τ ) are assumed to be 2, 10, and 20, respectively.“Numerical” refers to the results calculated using MC toys (50,000 events for the estimation of f ( t | µ (cid:48) = 0) and 50,000 events for the estimation of f ( t | µ (cid:48) = 1), per point), “Asymptotic” refersto Eq. 61, and “Simple” refers to (cid:113)(cid:80) Ni =1 s i / ( s i + b/τ i ). Note: the last MC data point on the greencurve, ( s , b ) = (25 , p -value calculation. 15 .3 N Signal Regions + M Control Regions

We now consider the case where we have N SRs (read as: N signal bins) and M CRs (one for eachbackground process). We assume the deﬁnitions of the CRs are

SR-independent and orthogonal .For i = 1 , . . . , N SRs, we have have transfer matrices { τ i ; i = 1 , . . . , N } where τ i is the transfermatrix carrying the yields in SR i to our M CRs (e.g., τ ijk carries the yield for background k in SR i to CR j ). Our likelihood is: L ( (cid:126)s, B ) = N (cid:89) i =1 (cid:40) P (cid:32) n i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s i + M (cid:88) j =1 b ji (cid:33)(cid:41) · M (cid:89) j =1 (cid:40) P (cid:32) m j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) j (cid:48) =1 τ i (cid:48) jj (cid:48) · b j (cid:48) i (cid:48) (cid:33)(cid:41) , (63)where B is our matrix of background yields in our SRs, where each row corresponds to speciﬁcbackground process and each column corresponds to a speciﬁc SR (e.g., b ji ≡ [ B ] ji correspondsto the yield for background j in SR i ). Additionally, i (cid:48) can be any integer from 1 to N , but forconsistent background yields in a given CR, regardless of which SR we are extrapolating from, werequire (cid:80) Mj (cid:48) =1 τ i (cid:48) jj (cid:48) · b j (cid:48) i (cid:48) = S j ∀ i (cid:48) ∈ [1 , N ], where S j is the expected sum of weights in CR j (i.e., a constant ). Taking the logarithm of Eq. 63:ln L ( (cid:126)s, B ) = N (cid:88) i =1 (cid:40) n i · ln (cid:32) s i + M (cid:88) j =1 b ji (cid:33) − s i − M (cid:88) j =1 b ji (cid:41) + M (cid:88) j =1 (cid:40) m j · ln (cid:32) M (cid:88) j (cid:48) =1 τ i (cid:48) jj (cid:48) · b j (cid:48) i (cid:48) (cid:33) − M (cid:88) j (cid:48) =1 τ i (cid:48) jj (cid:48) · b j (cid:48) i (cid:48) (cid:41) , (64)where we have dropped the constant − (cid:80) Ni =1 ln n i ! − (cid:80) Mj =1 ln m j !. Going ahead: ∂ ln L ( (cid:126) , B ) ∂b (cid:96)k (cid:12)(cid:12)(cid:12)(cid:12) B = ˆ B ˆ = n k (cid:80) Mj =1 ˆˆ b jk − M (cid:88) j =1 τ kj(cid:96) ·  m j (cid:80) Mj (cid:48) =1 τ kjj (cid:48) · ˆˆ b j (cid:48) k −  = 0 ⇔ n k (cid:80) Mj =1 ˆˆ b jk − M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j [ τ k · ˆ B ] jk ˆ − (cid:33) = 0 , (65)where we have selected i (cid:48) = k when taking our partial derivative and compactiﬁed our denominatorsummation using matrix multiplication. So our best-ﬁt values in the absence of signal are thesolutions to the set of coupled equations:  n k (cid:80) Mj =1 ˆˆ b jk − M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j [ τ k · ˆ B ] jk ˆ − (cid:33) = 0 ; k = 1 , . . ., N , (cid:96) = 1 , . . ., M  . (66)Also: ∂ ln L ( (cid:126)s, B ) ∂s k (cid:12)(cid:12)(cid:12)(cid:12) (cid:126)s = (cid:126) ˆ s, B = ˆ B = n k ˆ s k + (cid:80) Mj =1 ˆ b jk − ⇔ ˆ s k = n k − M (cid:88) j =1 ˆ b jk , (67)and: 16 ln L ( (cid:126)s, B ) ∂b (cid:96)k (cid:12)(cid:12)(cid:12)(cid:12) (cid:126)s = (cid:126) ˆ s, B = ˆ B = n k ˆ s k + (cid:80) Mj =1 ˆ b jk − M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j (cid:80) Mj (cid:48) =1 τ kjj (cid:48) · ˆ b j (cid:48) k − (cid:33) = 0 ⇔ M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j [ τ k · ˆ B ] jk − (cid:33) = 0 , (68)where we have subbed in Eq. 67. So our best-ﬁt values in the presence of signal are the solutionsto the set of coupled equations: (cid:40) ˆ s k = n k − M (cid:88) j =1 ˆ b jk , M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j [ τ k · ˆ B ] jk − (cid:33) = 0 ; k = 1 , . . ., N , (cid:96) = 1 , . . ., M (cid:41) . (69)Using Eq. 15, the signiﬁcance of discovery in the asymptotic limit is: Z = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − ·  N (cid:88) i =1  n i · ln  (cid:80) Mj =1 ˆˆ b ji n i  + n i − M (cid:88) j =1 ˆˆ b ji  + M (cid:88) j =1 (cid:40) m j · ln (cid:32) [ τ i (cid:48) · ˆ B ] ji (cid:48) ˆ[ τ i (cid:48) · ˆ B ] ji (cid:48) (cid:33) − [ τ i (cid:48) · ( ˆ B ˆ − ˆ B )] ji (cid:48) (cid:41) . (70)Assuming Asimov data, we would let n i = s i + (cid:80) Mj =1 b ji and m j = [ τ i (cid:48) · B ] ji (cid:48) . As in Section 2.2.2, we consider the case where our signal yields s , s , . . . , s N among our N SRs are correlated and tuned by our signal strength µ . Our expressions in the previous section aremodiﬁed by letting s i → µs i ∀ i = 1 , . . . , N and accordingly L ( (cid:126)s, B ) → L ( µ, B ). The derivation for B = ˆ B ˆ proceeds in the same way as in Section 2.3.1 and yields Eq. 66. Eq. 67 is modiﬁed to be asingle derivative with respect to µ as: ∂ ln L ( µ, B ) ∂µ (cid:12)(cid:12)(cid:12)(cid:12) ( µ, B )=(ˆ µ, ˆ B ) = N (cid:88) i =1 s i · (cid:32) n i ˆ µs i + (cid:80) Mj =1 ˆ b ji − (cid:33) = 0 , (71)and Eq. 68 becomes: ∂ ln L ( µ, B ) ∂b (cid:96)k (cid:12)(cid:12)(cid:12)(cid:12) ( µ, B )=(ˆ µ, ˆ B ) = n k ˆ µ · s k + (cid:80) Mj =1 ˆ b jk − M (cid:88) j =1 τ kj(cid:96) · (cid:32) m j [ τ k · ˆ B ] jk − (cid:33) = 0 . (72)Assuming Asimov data n i = s i + (cid:80) Mj =1 b ji ∀ i = 1 , . . . , N and m j = [ τ i · ˆ B ] ji ∀ j = 1 , . . . , M where i ∈ [1 , N ], the solutions to Eqs. 71 and 72 can be shown to be ˆ µ = 1 and ˆ B = B (N.B.: here, theelements of B are the theoretical background yields).17 .4 1 Signal Region + M Gaussian Background Constraints

We consider the case where we have 1 SR (read as: 1 bin) with a signal process yield s and N background processes with M Gaussian constraints on those backgrounds (read as: M nuisanceparameters). We assume nuisance parameter j , θ j , is described by a Gaussian constraint with anominal value of 0, a mean of θ j , and a variance of 1 as well as that the nuisance parameters arerelated via a correlation matrix Σ . Additionally, we assume background process i is aﬀected by thenuisance parameters via: b i → M (cid:89) j =1 R ij ( θ j ) · b i , (73)where R ij ( θ j ) is the response function of background i to nuisance parameter j , as described inRef. [11]. We’ll condense our notation in the following equations by letting R i ( (cid:126)θ ) ≡ (cid:81) Mj =1 R ij ( θ j ).In our single SR bin, we assume the behaviour of a response function is governed by: R ij (+1) = 1 + σ + ij ,R ij (0) = 1 ,R ij ( −

1) = 1 − σ − ij , (74)where σ + ij and σ − ij are constants dependent on the nuisance parameter considered (i.e., nuisanceparameter j ) and aﬀecting the overall normalization of background i (i.e., are relative “uncertain-ties”). The functional form of R ij ( θ j ) for θ j ∈ [ − , +1] is left free. Note: R ij ( θ j ) = 1 if background i is not aﬀected by nuisance parameter j .Our likelihood may be written as: L ( s, (cid:126)θ ) = P (cid:32) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s + N (cid:88) i =1 R i ( (cid:126)θ ) · b i (cid:33) · G (cid:16) (cid:126) (cid:12)(cid:12)(cid:12) (cid:126)θ, Σ (cid:17) , (75)where we have moved the functional dependence from (cid:126)b to (cid:126)θ (as the value of (cid:126)θ tunes the backgroundyields) and where the diagonal elements of Σ are assumed to be 1. If the nuisance parameters werefully decoupled from one another, Σ would be a diagonal matrix and our M -dimensional Gaussianfunction may be written as the product of M L ( s, (cid:126)θ ) = (cid:16) s + (cid:80) Ni =1 R i ( (cid:126)θ ) · b i (cid:17) n exp (cid:16) − (cid:16) s + (cid:80) Ni =1 R i ( (cid:126)θ ) · b i (cid:17)(cid:17) n ! · exp (cid:16) − · (cid:126)θ (cid:62) Σ − (cid:126)θ (cid:17)(cid:112) (2 π ) M | Σ | , (76)or taking the logarithm:ln L ( s, (cid:126)θ ) = n · ln (cid:32) s + N (cid:88) i =1 R i ( (cid:126)θ ) · b i (cid:33) − s − N (cid:88) i =1 R i ( (cid:126)θ ) · b i − · (cid:126)θ (cid:62) Σ − (cid:126)θ , (77)where we dropped the constant − (ln( n !) + ln( (cid:112) (2 π ) M | Σ | )). Note that we can also write the matrixmultiplication in the last term using sums: 18 θ (cid:62) Σ − (cid:126)θ = M (cid:88) i =1 M (cid:88) j =1 [ Σ − ] ij θ i θ j = M (cid:88) i =1 M (cid:88) j = i (2 − δ ij )[ Σ − ] ij θ i θ j , (78)where [ Σ − ] ij refers to the element in the i -th row and j -th column of Σ − and δ ij is the Kroneckerdelta function. Here, we have used the fact that if Σ is symmetric then its inverse is also symmetric: Σ − = ( Σ − ) (cid:62) . From the above, it is also apparent that: ∂ ( (cid:126)θ (cid:62) Σ − (cid:126)θ ) ∂θ k = 2 · M (cid:88) j =1 [ Σ − ] kj θ j . (79)Continuing ahead, we consider our best ﬁt values in the absence of signal: ∂ ln L (0 , (cid:126)θ ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) (cid:126)θ = (cid:126) ˆˆ θ =  n (cid:80) Ni =1 R i ( (cid:126) ˆˆ θ ) · b i −  · N (cid:88) i =1 (cid:32) ∂R ik ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k (cid:33) · R i ( (cid:126) ˆˆ θ ) · b i R ik (ˆˆ θ k )  − M (cid:88) j =1 [ Σ − ] kj ˆˆ θ j = 0 , (80)(the above is not unlike Eq. 8 of Ref. [11] for decorrelated constraints) and so our system of equationssolving for (cid:126) ˆˆ θ are:  n (cid:80) Ni =1 R i ( (cid:126) ˆˆ θ ) · b i −  · N (cid:88) i =1 (cid:32) ∂R ik ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k (cid:33) · R i ( (cid:126) ˆˆ θ ) · b i R ik (ˆˆ θ k )  − M (cid:88) j =1 [ Σ − ] kj ˆˆ θ j = 0 ; k = 1 , . . . , M  . (81)We now consider the best ﬁt values in the context of our tested hypothesis: ∂ ln L ( s, (cid:126)θ ) ∂s (cid:12)(cid:12)(cid:12)(cid:12) ( s,(cid:126)θ )=(ˆ s,(cid:126) ˆ θ ) = n ˆ s + (cid:80) Ni =1 R i ( (cid:126) ˆ θ ) · b i − ⇔ ˆ s = n − N (cid:88) i =1 R i ( (cid:126) ˆ θ ) · b i , (82)and: ∂ ln L ( s, (cid:126)θ ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) ( s,(cid:126)θ )=(ˆ s,(cid:126) ˆ θ ) =  n ˆ s + (cid:80) Ni =1 R i ( (cid:126) ˆ θ ) · b i −  · N (cid:88) i =1 (cid:32) ∂R ik ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆ θ k (cid:33) · R i ( (cid:126) ˆ θ ) · b i R ik (ˆ θ k )  − M (cid:88) j =1 [ Σ − ] kj ˆ θ j = 0 ⇒ M (cid:88) j =1 [ Σ − ] kj ˆ θ j = 0 , (83)where we substituted in Eq. 82. The above can be simultaneously satisﬁed for all k by writing Σ − (cid:126) ˆ θ = (cid:126) ⇒ (cid:126) ˆ θ = (cid:126)

0. Accordingly, ˆ s = n − (cid:80) Ni =1 R i ( (cid:126) · b i = n − (cid:80) Ni =1 b i . Our signiﬁcance ofdiscovery in the asymptotic limit is thus: 19 = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − · ln  L (0 , (cid:126) ˆˆ θ ) L (ˆ s, (cid:126) ˆ θ )  = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) − ·  n · ln  (cid:80) Ni =1 R i ( (cid:126) ˆˆ θ ) · b i n  + n − N (cid:88) i =1 R i ( (cid:126) ˆˆ θ ) · b i − · (cid:126) ˆˆ θ (cid:62) Σ − (cid:126) ˆˆ θ  . (84)Assuming Asimov data, we let n = s + (cid:80) Ni =1 b i in Eq. 84 and appropriately sub in the solutionsfrom Eq. 81. N Backgrounds and M = N Decorrelated Constraints, 1-Per-Background

We consider N backgrounds with M = N Gaussian contraints, one per background. We alsoassume the constraints are decorrelated (i.e., Σ = I ⇒ Σ − = I where I is the identity matrix).Then R i ( (cid:126)θ ) → R i ( θ i ) and R ij ( θ j ) → R i ( θ i ) (i.e., the total response function for background i is onlya function of θ i , the nuisance parameter governing the constraint which aﬀects only background i ),implying ∂R ik ( θ k ) ∂θ k = ∂R i ( θ i ) ∂θ k = δ ik · ∂R k ( θ k ) ∂θ k . Applying the above to Eq. 81 yields: (cid:40)(cid:32) n (cid:80) Ni =1 R i (ˆˆ θ i ) · b i − (cid:33) · ∂R k ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k · b k − ˆˆ θ k = 0 ; k = 1 , . . . , N (cid:41) . (85)For an additional level of simplicity, we assume the response functions are linear in θ k and that σ + k = σ − k ≡ σ k : R k ( θ k ) = σ k · θ k + 1, where σ k can be interpreted as the relative “uncertainty” onthe normalization of background k . Consider the redeﬁnitions: R k ( θ k ) · b k → b k ( θ k ) and σ k · b k → σ k (i.e., σ k is now an absolute “uncertainty”). Then we let ˆˆ b k ≡ b k (ˆˆ θ k ) and: ∂R k ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k · b k = ∂ ( R k ( θ k ) · b k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k = ∂b k ( θ k ) ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) θ k =ˆˆ θ k = σ k . (86)We identify ˆˆ θ k = (ˆˆ b k − b k ) /σ k (i.e., ˆˆ θ k controls the tuning of the best-ﬁt ˆˆ b k away from the nominalvalue b k , consistent with our earlier deﬁnition of the response function), so inserting this and Eq. 86into Eq. 85 : (cid:40)(cid:32) n (cid:80) Ni =1 ˆˆ b i − (cid:33) · σ k − (ˆˆ b k − b k ) σ k = 0 ; k = 1 , . . . , N (cid:41) , (87)which can be solved to yield the best-ﬁt values of the backgrounds in the absence of signal. Finally,with our assumptions and redeﬁnitions, Eq. 84 becomes: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) n · ln (cid:32) (cid:80) Ni =1 ˆˆ b i n (cid:33) + n − N (cid:88) i =1 (cid:32) ˆˆ b i + ( b i − ˆˆ b i ) σ i (cid:33)(cid:33) , (88)which is our signiﬁcance of discovery in the asymptotic limit. Assuming Asimov data, we would let n = s + (cid:80) Ni =1 b i . This system of equations is (attempted to be) solved in general sense in Appendix A

20e have also numerically validated our results for the scenario where we have 2 backgroundseach with 1 Gaussian constraint. To simulate our yields, we sample a Poisson PDF with mean µ = s + b + b in our SR. Additionally, to avoid biasing ourselves, we must sample the nominalvalue of each of the nuisance parameters from a Gaussian PDF centered on b i with width σ i for i = 1 and then i = 2. The sampled values then become the “true” nominal values of the Gaussianconstraints used in our maximum likelihood ﬁt (i.e., the constraints look like G (˜ b i | b i , σ i ) with ˜ b i sampled from G ( x | b i , σ i )). The median signiﬁcance of discovery has been calculated for 70 diﬀerenthypothesized measurement scenarios (i.e., 7 curves with 10 points per curve) and we plotted Eq. 88continuously alongside these numerical results (derived via the procedure described in Appendix A).This is shown in Figs. 4 and 5. As before, we see excellent agreement between the numerical andasymptotic results over the range of theoretical yields and parameters studied.We have also plotted alongside our results the “naive” approximation of the signiﬁcance ofdiscovery: Z = s (cid:112) s + b + b + σ + σ , (89)(where σ and σ are given as the absolute uncertainties on backgrounds 1 and 2, respectively) andindeed in the low s/b (i.e., high b + b ) regime we see good agreement between all three methods.In the s/b ∼ b + b ), the naive approximation fails and no longer shows goodagreement with the asymptotic and numerical methods, as expected.21 Background 1 yield in SR b [a.u.] . . . . . . . . S i gn i ﬁc an c eo f d i sc o v e r y m e d [ Z | µ = ] [ a . u .] s = 10 , σ = 5% , σ = 10% Numerical: b = 5 Asymptotic: b = 5 Simple: b = 5 Numerical: b = 25 Asymptotic: b = 25 Simple: b = 25 Numerical: b = 150 Asymptotic: b = 150 Simple: b = 150 Figure 4: The median signiﬁcance of discovery as a function of SR background 1 yield ( b ) and SRbackground 2 yield ( b ) for 1 SR bin + 2 decorrelated Gaussian constraints measurement describedin Sec. 2.4.2. The SR signal yield is assumed to be 10. The relative uncertainties on backgrounds1 and 2 are assumed to be σ = 5% and σ = 10%, respectively. “Numerical” refers to the resultscalculated using MC toys (50,000 events for the estimation of f ( t | µ (cid:48) = 0) and 50,000 events forthe estimation of f ( t | µ (cid:48) = 1), per point), “Asymptotic” refers to Eq. 88, and “Simple” refers to s/ (cid:112) s + b + b + σ + σ . 22 Background 1 yield uncertainty in SR σ [%] . . . . . . S i gn i ﬁc an c eo f d i sc o v e r y m e d [ Z | µ = ] [ a . u .] s = 10 , σ = 10% Numerical: ( b , b ) = (25 , Asymptotic: ( b , b ) = (25 , Simple: ( b , b ) = (25 , Numerical: ( b , b ) = (50 , Asymptotic: ( b , b ) = (50 , Simple: ( b , b ) = (50 , Numerical: ( b , b ) = (50 , Asymptotic: ( b , b ) = (50 , Simple: ( b , b ) = (50 , Numerical: ( b , b ) = (150 , Asymptotic: ( b , b ) = (150 , Simple: ( b , b ) = (150 , Figure 5: The median signiﬁcance of discovery as a function of the uncertainty on the SR background1 yield ( σ ), the SR background 1 yield ( b ), and the SR background 2 yield ( b ) for 1 SR bin +2 decorrelated Gaussian constraints measurement described in Sec. 2.4.2. The SR signal yield isassumed to be 10. The relative uncertainty on the background 2 yield is assumed to be σ = 10%.“Numerical” refers to the results calculated using MC toys (50,000 events for the estimation of f ( t | µ (cid:48) = 0) and 50,000 events for the estimation of f ( t | µ (cid:48) = 1), per point), “Asymptotic” refersto Eq. 88, and “Simple” refers to s/ (cid:112) s + b + b + σ + σ .23 .4.3 Assuming N = 1 Backgrounds and M = 1 Constraints

As a special case of the previous section, we consider only N = 1 backgrounds with M = N = 1Gaussian contraints on this background (with 1 background and 1 constraint, we are automaticallyin the regime of “decorrelated” constraints). Letting b ≡ b , θ ≡ θ , σ ≡ σ , Eq. 87 becomes: (cid:32) n ˆˆ b − (cid:33) · σ − (ˆˆ b − b ) σ = 0 ⇔ (cid:16) n − ˆˆ b (cid:17) · σ − (ˆˆ b − b ) · ˆˆ b = 0 ⇔ − ˆˆ b + ( b − σ ) · ˆˆ b + nσ = 0 ⇒ ˆˆ b = − ( b − σ ) ± (cid:112) ( b − σ ) + 4 nσ − , (90)but as ( b − σ ) + 4 nσ > ( b − σ ) (all variables in this expression are positive), we choose theminus sign to give an overall positive (i.e., physical) solution for ˆˆ b :ˆˆ b = ( b − σ ) + (cid:112) ( b − σ ) + 4 nσ . (91)Using Eq. 88, our signiﬁcance of discovery is: Z = (cid:118)(cid:117)(cid:117)(cid:116) − · (cid:32) n · ln (cid:32) ˆˆ bn (cid:33) + n − ˆˆ b − ( b − ˆˆ b ) σ (cid:33) , (92)with Eq. 91 appropriately subbed in, matching what is shown in Eq. 26 of Ref. [12]. We have presented a collection of derivations of generalized formulae for estimating the mediansigniﬁcance of discovery in the asymptotic limit for various measurement models in HEP. Theformulae have been veriﬁed to agree with numerical results using MC toys in select cases; othertimes, they are shown to reduce to known formulae in simpler cases derived by similar means. Inthe low s/b regime, simpler versions of the asymptotic formulae based on s/ √ b do just as well asthe more accurate formulae derived in this document. In the s/b ∼ N SRs + M CRs inSection 2.3 when the CRs are assumed to pure in each background (i.e., diagonal τ i ∀ i ∈ [1 , N ]),generalizing the type of contraint in Section 2.4 (e.g., Gaussian, log-normal, etc.), and generalizingthe NPs in Section 2.4 to also act on signal. 24 References [1] K. Cranmer,

Practical Statistics for the LHC , CERN Yellow Rep. CERN-2014-003, CERN(2015) 267–308, arXiv: .[2] G. Cowan, K. Cranmer, E. Gross, and O. Vitells,

Asymptotic formulae for likelihood-based testsof new physics , Eur. Phys. J.

C71 (2011) 1554, arXiv: .[3] S. S. Wilks,

The Large-Sample Distribution of the Likelihood Ratio for Testing CompositeHypotheses , Ann. Math. Statist. (1938) 60–62, DOI: .[4] A. Wald, Tests of statistical hypotheses concerning several parameters when the num-ber of observations is large , Trans. Amer. Math. Soc. (1943) 426–482, DOI: .[5] P. Ongmongkolkul, C. Deil, C. Cheng, A. Pearce, E. Rodrigues, H. Schreiner, M. Marinangeli,L. Geiger, and H. Dembinski, scikit-hep/probﬁt: 1.1.0 (Version 1.1.0), November 5, 2018, DOI: .[6] F. James and M. Roos, Minuit – A System for Function Minimization and Analysis ofthe Parameter Errors and Correlations , Comput. Phys. Commun. (1975) 343–367, DOI: .[7] H. Dembinski, P. Ongmongkolkul, C. Deil, D. Men´endez. Hurtado, M. Feickert, H. Schreiner,Andrew, C. Burr, F. Rost, A. Pearce, L. Geiger, B. M. Wiedemann, Gonzalo, andO. Zapata, scikit-hep/iminuit: v1.5.2 (Version v1.5.2), September 24, 2020, DOI: .[8] C. R. Harris, K. J. Millman, S. J. van der Walt, et al. , Array programming with NumPy , Nature (2020) 357–362, DOI: .[9] P. Virtanen, R. Gommers, T. E. Oliphant, et al. , SciPy 1.0: fundamental algo-rithms for scientiﬁc computing in Python , Nature Meth. (2020) 261–272, DOI: .[10] J. D. Hunter, Matplotlib: A 2D graphics environment , IEEE Comput. Sci. Eng. (2007) 90–95,DOI: .[11] J. S. Conway, Incorporating Nuisance Parameters in Likelihoods for Multisource Spectra , Proc.PHYSTAT 2011 Workshop on Statistical Issues Related to Discovery Claims in Search Exper-iments and Unfolding (2011) 115–120, arXiv: .[12] The ATLAS Collaboration,

Formulae for Estimating Signiﬁcance , Tech. Rep. ATL-PHYS-PUB-2020-025, CERN, 2020, URL: https://cds.cern.ch/record/2736148 .25 ppendix A Solution to A (cid:80) j x j + B i x i + C i = 0 , i = 1 , . . . , N We consider solutions of the system of equations: (cid:40) A (cid:80) Nj =1 x j + B i x i + C i = 0 ; i = 1 , . . . , N (cid:41) , (93)where A is a constant and B i and C i are equation-dependent constants, analogous to Eq. 87.Subtracting the j -th equation from the i -th equation: B i x i + C i − B j x j − C j = 0 ⇔ x j = B i x i + ( C i − C j ) B j , (94)which can then be inserted in the i -th equation of Eq. 93 to yield an expression entirely in terms of x i : A + B i x i N (cid:88) j =1 x j + C i N (cid:88) j =1 x j = 0 ⇔ A + B i x i N (cid:88) j =1 B i x i + ( C i − C j ) B j + C i N (cid:88) j =1 B i x i + ( C i − C j ) B j = 0 ⇔ (cid:32) B i N (cid:88) j =1 B j (cid:33) x i + (cid:32) B i N (cid:88) j =1 C i − C j B j (cid:33) x i + (cid:32) A + C i N (cid:88) j =1 C i − C j B j (cid:33) = 0 , (95)which may be solved using the quadratic equation, selecting the real and positive root for x i (or theanswer which statisﬁes the system of equations as the physical solution as x i ↔ ˆˆ b i ). From Eq. 87,we identify A = n , B i = − /σ i , and C i = b i /σ i −

1. Thus in Eq. 95, we know: B i N (cid:88) j =1 B j = − σ i N (cid:88) j =1 σ j < , (96)as σ i > ∀ i . From Eq. 95, we identify the quadratic equation: − | ˜ A i | x + ˜ B i x + ˜ C i = 0 , (97)where: | ˜ A i | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B i N (cid:88) j =1 B j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 σ i N (cid:88) j =1 σ j , ˜ B i = B i N (cid:88) j =1 C i − C j B j = 1 σ i N (cid:88) j =1 σ j · (cid:18) b i σ i − b j σ j − (cid:19) , ˜ C i = A + C i N (cid:88) j =1 C i − C j B j = n + (cid:18) − b i σ i (cid:19) · N (cid:88) j =1 σ j · (cid:18) b i σ i − b j σ j (cid:19) , (98)26hich has the solutions: x i = ˜ B i ± (cid:113) ˜ B i + 4 | ˜ A i | ˜ C i | ˜ A i | . (99)Using Eq. 98, it can be shown that:˜ B i = ˜ C i − nσ i − b i − ( σ i − b i ) · | ˜ A i | , (100)and therefore:˜ B i + 4 | ˜ A i | ˜ C i = (cid:32) ˜ C i − nσ i − b i (cid:33) + ( σ i − b i ) · | ˜ A i | − · ( ˜ C i − n ) · | ˜ A i | + 4 | ˜ A i | ˜ C i = (cid:32) ˜ C i − nσ i − b i (cid:33) + ( σ i − b i ) · | ˜ A i | + 2 · ( ˜ C i − n ) · | ˜ A i | + 4 n | ˜ A i | = (cid:32) ˜ C i − nσ i − b i + ( σ i − b i ) · | ˜ A i | (cid:33) + 4 n | ˜ A i | , (101)which is always positive, so we are always guaranteed a real root in Eq. 99. We consider thediﬀerence: (cid:32) ˜ C i − nσ i − b i + ( σ i − b i ) · | ˜ A i | (cid:33) − (cid:32) ˜ C i − nσ i − b i − ( σ i − b i ) · | ˜ A i | (cid:33) = 2 · ( σ i − b i ) · | ˜ A i | . (102)If σ i > b i , the above is positive : this means the argument of our square root will always be largerthan ˜ B i and we must select the positive sign solution for equation i for a physical solution. ByEq. 100, we know: ˜ B i = 1 σ i · N (cid:88) j =1 σ j · (cid:18) b i σ i − b j σ j (cid:19) + ( b i − σ i ) · | ˜ A i | . (103)If b i > σ i , the second term in the above is positive. And if (cid:80) Nj =1 σ j · (cid:0) b i /σ i − b j /σ j (cid:1) < | (cid:80) Nj =1 σ j · (cid:0) b i /σ i − b j /σ j (cid:1) | > ( b i − σ i ) · | ˜ A i | , then we must select the positive sign solution for aphysical solution. In the case where (cid:80) Nj =1 σ j · (cid:0) b i /σ i − b j /σ j (cid:1) >