Calculating p -values and their significances with the Energy Test for large datasets
FFebruary 6, 2018
Calculating p -values and theirsignificances with the Energy Testfor large datasets W. Barter, C. Burr, C. ParkesSchool of Physics and Astronomy, University of Manchester,Oxford Road, Manchester, M13 9PL, UK
Abstract
The energy test method is a multi-dimensional test of whether two samples areconsistent with arising from the same underlying population, through the calculationof a single test statistic (called the T -value). The method has recently been usedin particle physics to search for differences between samples that arise from CPviolation. The generalised extreme value function has previously been used todescribe the distribution of T -values under the null hypothesis that the two samplesare drawn from the same underlying population. We show that, in a simple testcase, the distribution is not sufficiently well described by the generalised extremevalue function. We present a new method, where the distribution of T -values underthe null hypothesis when comparing two large samples can be found by scaling thedistribution found when comparing small samples drawn from the same population.This method can then be used to quickly calculate the p -values associated with theresults of the test. a r X i v : . [ phy s i c s . d a t a - a n ] F e b Introduction
A key problem in data science is determining if two samples, measured in multi-dimensionalspaces, are consistent with having arisen from the same underlying population. This canbe alternatively phrased as asking whether there exists evidence for differences betweentwo samples. One area where this question is crucial is in searches for direct CP violation,which can appear as local differences of event yields in the multi-dimensional phase spacebetween samples of matter and anti-matter data. Several different approaches have beensuggested to address this question. One approach that has recently been developed andused in the context of CP violation is that of the Energy Test Method [1–6]. This methodcalculates a single test statistic, the energy or T -value, from the distribution of the data inthe two samples. The value of this quantity can then be used to determine the consistencyof the two samples by comparing to the expected distribution of the T -value under thenull hypothesis that the two samples are drawn from the same population. We will labelthis the T distribution. However, the analytic form that this distribution takes is notclearly understood. Understanding this distribution better will enable further applicationof the energy test method in the coming years, in samples containing a large number ofevents.Within the energy test method the T -value is calculated as T = 12 1 n ( n − n (cid:88) i (cid:54) = j ψ ij + 12 1 n ( n − n (cid:88) i (cid:54) = j ψ ij − nn n,n (cid:88) i,j ψ ij , (1)where the first sum is over pairs of points (or events) in the first sample, the second sum isover pairs of points in the second sample, and the final sum is over both samples. There are n events in the first sample and n events in the second sample. The value ψ is a weightingfunction, commonly taken as a Gaussian of some scaled Euclidean distance squared ( d )between two points in the (potentially multidimensional) phase space, ψ ij = e − d / (2 δ ) , where δ is a length-scale that is optimised for the problem under consideration. However,other choices have been made for this weighting function (for example logarithms of theEuclidean distance between points). The energy test essentially calculates the mean ofthe ψ distribution using pairs of events taken from each sample independently (the firsttwo terms), and then considers cross-terms between the two samples (the final term),with the added consideration that since the same events are used many times to calculatedistances, the different terms in the sums that calculate these means are correlated. Inthe case where the two samples are drawn from the same population the expected T -valueis 0. A large, positive T -value indicates differences between the samples. The method hasrecently been extended to also enable comparison of samples containing impurities [4].Estimating the T distribution is an important open question that we consider inthis article. This question lies at the heart of interpreting whether the returned teststatistic is significant, and whether the two samples are consistent with being drawn fromthe same underlying population. To date, the typical approach to find this distributionwhen analysing data is to randomly assign the data to be tested to two samples, andcalculate the T -value of these permuted samples, which are created, by definition, under1he null hypothesis (taking a ‘permutation approach’). By repeating this process multipletimes the T distribution of T -values under the null hypothesis can be found, and usedto determine the p -value associated with the nominal test, by counting how often thepermutations return a more extreme T -value than the true data. If the behaviour of the T distribution is not known, then one may need to repeat this process over 1.7 milliontimes to determine whether a T -value has a probability ( p -value) of less than 6 × − (i.e.in the case where data returns a large, positive T -value, and clearly lies in the tail of the T distribution). Such a probability is equivalent to that of finding a result 5 Gaussianstandard deviations (with 5 σ significance) from an expected central value, and is the keycriteria in particle physics for announcing a discovery.The use of this permutation approach raises a problem when considering large samples.The calculation of the T -value is computationally intensive, and scales as O ( n ), where n is the number of events in the sample (since O ( n ) distances must be calculated, usedto determine some weighting function, and then summed). Similar calculations must berun many times to accurately probe the tail of the T distribution, since the tail must beunderstood if a significant result is to be claimed. This increases the time taken in dataanalysis, and potentially makes the energy test significantly less useful as a data analysismethod as sample sizes get larger and if a large number of permutations are required. Aclear understanding of the T distribution will therefore allow the use of the energy-testto analyse the largest data samples, making its use tractable in cases where it was notbefore.It has been suggested [2] that the T distribution seems to follow a generalised extremevalue distribution when the energy test is used as a goodness of fit test. However,this property has subsequently been used when the energy test is used as a two sampletest [3–6], despite no suggestion that the T distribution takes this form in the initialpaper [2]. The assumption of this property has allowed quick calculations of p -valuesand significances, since a generalised extreme value (GEV) function can be fit to the T distribution found from a limited number of permuted samples, and used to determine thesignificance of the T -value obtained when analysing the two samples present in the data.This is particularly useful if the sample sizes are too large to generate a large number ofpermutations quickly, since it allows large significances to be inferred when the T -value indata lies in the tail of the distribution, and is potentially larger than any of the T -valuesfound in the permuted samples. This method has been used to determine p -values, oftenalongside direct counts of how often the permuted samples return more extreme T -valuesthan the true data [3–6]. With more studies using the energy test expected in the future,we investigate here whether the GEV function adequately describes the tail of the T distribution in a test case of the two sample comparison problem, and address this questionof how to model the distribution of T -values found under the null hypothesis: key forquickly performing the energy test and finding an unbiased and precise p -value if thesample T -value is located in the tail of the T distribution. Formal mathematical proofsare not presented in this article. Instead, two different models are considered and usedin toy studies to demonstrate the shortcomings and power of different methods. Thesemodels are presented in the next section. Following this, the use of the GEV function anda new approach are both presented. The article that introduced this approach [2] noted that no proof exists that the GEV functiondescribes the distribution in question and that the behaviour must be verified for each specific case. Data samples
Two different toy models are considered for these studies:1. Model 1This model is a very simple model, where events in each sample are generatedaccording to a uniform distribution in three dimensions, with the allowed region ofphase space being located between 0 and 1 in all dimensions. Here the probabilitydensity of a point in phase space is independent of the location in the space.2. Model 2We also use the more complicated, physically-motivated model, first set out inRef. [7], and also studied in the context of the energy test in Ref. [3] and Ref. [4].This model considers the decay of a particle X to a three-body final state. Theaxes considered in this problem are formed from the three different invariant masscombinations of pairs of the final state particles. The presence of intermediateresonances mean that the density of events depends on the location in the phasespace. This model is generated using the Laura++ package [8].In both cases two different choices of the function ψ ij are used: a Gaussian, as set outabove, and a logarithmic function − log | d + (cid:15) | , where (cid:15) , like δ above, is a parameter to bechosen by the analyst, and which can be optimised for the study in question. The generalised extreme value function takes three free parameters, µ , σ and ξ , anddescribes the distribution of a variable, labelled here as x . For ξ < −∞ < x < µ − σ/ξ , while for ξ > µ − σ/ξ < x < + ∞ . In thespecial case where ξ = 0 it is non-zero for all values of x on the real axis. The cumulativedistribution is given by CDF( x ) = exp (cid:0) − (cid:0) ξ ( x − µσ ) (cid:1) − /ξ (cid:1) , (2)in the relevant ranges defined above, for ξ (cid:54) = 0. In the case where ξ = 0, the cumulativedistribution is CDF( x ) = exp( − exp( − ( x − µ ) /σ )) . (3)It is not obvious whether the GEV function should be used to describe the T dis-tribution. If we take a weighting function for the energy test that can only take valuesbetween 0 and 1, such as the Gaussian function discussed above, then the T distributionmust be contained in the range -1 to 1, given the form of equation 1. However, any fit ofthe GEV function must contain either a finite probability for a T -value above 1 if ξ ≥ ξ ≤
0. Therefore, regardless of the value of ξ , with the GEV functionthe cumulative distribution is never solely defined in the range − < T <
1; it is thereforepossible to find a T -value where the significance is either underestimated or overestimated.We perform empirical studies to determine whether this poses problems when seekingto describe reasonable T distributions that can be generated, or if the GEV function is The literature also uses the parameter c = − ξ . value - - - · R e l a t i v e f r equen cy T value - - - · C u m u l a t i v e p r obab ili t y Figure 1: The (left) T -values found using the Gaussian weighting function are shown as blackpoints (for comparing two 1000-event samples generated using model 1), with the fit of a GEVfunction shown as the solid red line; (right) the corresponding cumulative distribution. Thedashed blue line corresponds to scaled T -values where only 500 events are used in each sample.On the cumulative plot the differences between the 1000 and 500 event T distributions are toosmall to be visible. never-the-less sufficient to describe the distribution at a level needed within particle physicsstudies. We generate two samples in Model 1, each containing 1000 points (or events),and calculate the T -value. This is done twice: once for the Gaussian weighting functionand once for the logarithmic function. This is repeated 5 million times in order to create adistribution of T -values for each weighting function (a permutation approach is not used,since ‘toy’ data can be quickly generated to find the distribution of T -values). With bothsamples generated from the same underlying distribution, the distributions of T -valuesfound are the relevant T distributions for each weighting function. For both the Gaussianweighting function and the logarithmic function we use δ = (cid:15) = 0 .
5. The T distributionsare shown in Figures 1 and 2. A GEV function is fit to the distributions in both casesusing the binned maximum likelihood method, with the fit also shown on the same figures,alongside the cumulative distribution and the cumulative distribution associated withthe best fit. In these fits a normalisation, and the three parameters ( µ, σ, ξ ), are leftunconstrained. It is clear for both weighting functions that the fit function does notdescribe the data. In addition, further cross-checks are made to see if the GEV functioncan be used to describe the data. First, χ fits are performed, and no good χ valueis returned. Second, the parameters in the GEV function left unconstrained in the fitare reduced to a normalisation and a value of ξ . This is achieved by fixing µ using therelation that the cumulative probability is 1 / e when x = µ , and that when the cumulativedistribution is 0.5, the relevant x -value, x / (the median) can be used to relate σ and ξ through σ = ξ ( x / − µ ) / ((log(2)) − ξ − T distribution. This does not change our conclusions.In the simple model studied here the fit returns a value for the parameter ξ >
0, and wefind that the p-value tends to be overestimated in the tail of the distribution at large T .In this case the discovery of a new effect might be missed. If the fit returns a value of ξ <
0, the p-value tends to be underestimated, and such a discovery may be incorrectlyclaimed. 4 value - - - · R e l a t i v e f r equen cy T value - - - · C u m u l a t i v e p r obab ili t y Figure 2: The figure contains the same information as Figure 1 but using the logarithmicweighting function.
This finding represents an important note in the application of the Energy Test method:the GEV function has already been used in the literature to fit the T distribution, andto determine p -values by extrapolating the fit results. This toy model is a simple casewhere this method is not valid: in this case, the use of the GEV function leads to theover-estimation of the p -value for large T -values, and a new effect or discovery might bemissed. We therefore turn our attention to a novel approach to efficiently determining the T distribution. The energy test (as set out in Equation 1) corresponds to the calculation of the meanvalues of the weighting function ψ , using correlated inputs, with O ( n ) terms (or similar)in each sum. If all the terms in calculating these means were not correlated (i.e. the sameevents were not re-used to calculate multiple distances), then increasing the sample sizesby a common factor k would simply scale the behaviour of each of the three terms inthe T distribution: the distribution of kT would be independent of the sample size (forsufficiently large samples). We investigate empirically whether this property also holdshere.We first test using model 1, by repeating the previous studies, but using two samplesof 500 events (as opposed to the 1000 event samples set out above). This is repeated 30million times to find the T distribution. The calculated T -values are scaled by a factor of500 / .
5, and overlaid on Figures 1 and 2. This gives a much better description ofthe T -values in the 1000 event samples than the GEV function.We make further studies by generating a sample using model 2 containing 1 000 000events. We use this sample to investigate the T distribution further: we randomly assignevents to two smaller sub-samples (here taken to contain an equal number of events, n )and calculate the T -value and the value of nT associated with running the energy testto compare these sub-samples. This is performed 30 million times for each value of n considered. The distribution of nT is shown in Figures 3 and 4 for different values of n ,for the Gaussian and logarithmic weighting functions (in both cases we use δ = (cid:15) = 0 . nT distribution where the p -value for getting such a5 p - v a l u e Gaussian Weighting Function Number of events used for permutations1012345 n T v a l u e r e q u i r e d Figure 3: (Left) The p -values associated with a particular value of nT in model 2, considering theGaussian weighting function (also known as the survival function). This is shown for differentvalues of n , with the p -values associated with 1, 2, 3, 4, and 5 σ significance marked as verticallines for the largest value of n we consider, n max = 50 000. (Right) The values of nT required forthese p -values are plotted as a function of n . The uncertainty on these values is estimated andshown using the semi-transparent lines. These lines are generated by bootstrapping (samplingwith replacement) a new set of T -values from the original set and determining the relevant valueof nT in this new sample. This is repeated 100 times. For ease of comparison between differentvalues of n , the expected values of nT required to achieve 1, 2, 3, 4, and 5 σ significance for n = n max = 50 000 are also shown as horizontal lines. result corresponds to 1, 2, 3, 4, and 5 σ evidence for differences between the samplesrespectively, for different values of n .It is clear that in these cases the value of nT associated with a particular level ofsignificance is independent of n for sufficiently large n (typically around 100 events, forthe levels of significance we have investigated). We have also applied this method toadditional models, and used different weighting functions ( ψ ) and have found no evidencefor the breaking of this scaling property. We also find that this scaling property does notrely on the sample sizes being equal, so that if the sizes of the samples being comparedare increased by a factor k , the distribution of kT under the null hypothesis is invariant(for large sample sizes). However, we note that we have no formal proof of this property.Indeed, for the Gaussian weighting function, the distribution the value of nT must liein the range − n < nT < n . Therefore the use of this scaling property to estimate p -values will also provide incorrect coverage for some large (positive or negative) value of T . However, in the cases we have examined, this effect appears negligible for n larger thanabout 100 points when considering significances of around 5 σ and smaller. We recommendsimilar tests are performed for each specific case where the method is used.This scaling property means that the T distribution can be generated using a smallvalue of n, and then scaled to determine the appropriate T distribution for the sample sizesunder consideration in the main test. This speeds up the computation of the significance of6 p - v a l u e Logarithmic Weighting Function Number of events used for permutations101234567 n T v a l u e r e q u i r e d Figure 4: The figure contains similar information to Figure 3 but using the logarithmic weightingfunction, and with n max = 2 000. The smaller value of n max is used here owing to the significantcomputing time taken to calculate 30 million T -values for large n . T -values when the p -value is small: with this method the calculation of the distribution ofthe null hypothesis no longer requires the generation of over one million permuted samples of the same size as the initial samples to claim a p -value smaller than 5 Gaussian standarddeviations. Instead, the T distribution can be quickly generated using small samples;only the one calculation of the T -value of the main data sample remains computationallyintensive, and uses the full event yield in the calculation. Consequently it is with thebiggest datasets that the impact of this scaling property is most significant. The energy test is a standard method within data science that measures whether twosamples are consistent with arising from the same underlying population. The method hasrecently been used for studies in particle physics. The small p -values necessary to claim adiscovery in particle physics require understanding rare T -values returned by the energytest method under the null hypothesis that the two samples are identical. Problems arise ifthe datasets under study are so large that the distribution under the null hypothesis cannotbe simulated quickly with a permutation method, so the tail of this distribution cannot bestudied sufficiently. In the existing literature the generalised extreme value distributionhas been fit to the distribution, with an extrapolation of the fit then used to determine the p -values associated with large T -values [3–6] (often alongside a straightforward ‘countingmethod’ of determining the p -value from how often tests of the permuted samples return T -values larger than that in the test of the true samples). However, we have shown herefor a simple test case that the T distribution is not sufficiently well described by thegeneralised extreme value function. We have therefore also presented a new method wheresmall sub-samples of the data, which can be analysed quickly, can be used to find the7istribution of T -values expected under the null hypothesis for large sample sizes. In thisway, the tail of the distribution under the null hypothesis can be probed. This allowsthe accurate determination of small p -values associated with claims of discovery of newphysical phenomena. Acknowledgements
We wish to thank Roger Barlow, Igor Babuschkin, Giulio Dujany, Marco Gersabeck,Gediminas Sarpis, Mike Williams, and G¨unter Zech for illuminating discussions. We alsothank G¨unter Zech for the clarification that if the GEV function describes the distributionwhen used for goodness-of-fit studies (as considered in Ref. [2]) this does not imply it alsodescribes the distribution for two sample tests, despite its subsequent use in such studies.This work was supported by STFC grant number ST/N000374/1.
References [1] B. Aslan and G. Zech,
New test for the multivariate two-sample problem based on theconcept of minimum energy , J. Stat. Comput. Simul. (2005) 109.[2] B. Aslan and G. Zech, Statistical energy as a tool for binning-free, multivariategoodness-of -fit tests, two-sample comparison and unfolding , Nucl. Instrum. Meth.
A537 (2005) 626 .[3] M. Williams,
Observing CP violation in many-body decays , Phys. Rev. D84 (2011)054015, arXiv:1105.5338 .[4] C. Parkes et al. , On model-independent searches for direct CP violation in multi-bodydecays , J. Phys.
G44 (2017), no. 8 085001, arXiv:1612.04705 .[5] LHCb collaboration, R. Aaij et al. , Search for CP violation in D → π − π + π decayswith the energy test , Phys. Lett. B740 (2015) 158, arXiv:1410.4170 .[6] LHCb collaboration, R. Aaij et al. , Search for CP violation in the phase space of D → π + π − π + π − decays , Phys. Lett. B769 (2017) 345, arXiv:1612.03207 .[7] M. Williams,
How good are your fits? Unbinned multivariate goodness-of-fit tests inhigh energy physics , JINST (2010) P09004, arXiv:1006.3019 .[8] J. Back et al. , Laura++ : a Dalitz plot fitter , arXiv:1711.09854arXiv:1711.09854