Biased bootstrap sampling for efficient two-sample testing
BBiased bootstrap sampling for efficient two-sample testing
Thomas P. S. Gillam ∗ and Christopher G. Lester † Abstract
The so-called ‘energy test’ is a frequentist technique used in experimental particle physicsto decide whether two samples are drawn from the same distribution. Its usage requires agood understanding of the distribution of the test statistic, T , under the null hypothesis. Wepropose a technique which allows the extreme tails of the T -distribution to be determinedmore efficiently than possible with present methods. This allows quick evaluation of (forexample) –sigma confidence intervals that otherwise would have required prohibitively costlycomputation times or approximations to have been made. Furthermore, we comment onother ways that T computations could be sped up using established results from the statisticscommunity. Beyond two-sample testing, the proposed biased bootstrap method may providebenefit anywhere extreme values are currently obtained with bootstrap sampling. The two-sample test statistic, T , described in Section 2, has many uses in particle physics. Theauthors of [BBP18] seek to address practical problems associated with the calculation of the extremetails of T -distributions. This paper develops those themes further, concentrating on two areas: • We observe that a significant section of the particle physics community dealing with T appearsto have become separated from the mathematical statistics literature dealing with MaximumMean Discrepancy (MMD) metrics, and other integral measures of distributional difference,even though the two fields are strongly interrelated. In Section 3 we describe the connectionsbetween these fields, and highlight known results which can directly lead to performanceimprovements. • Determining the shapes of the tails of T -distributions by simple means is non-trivial and canrequire prohibitive computational resources. It is to overcome this problem that [BBP18]suggests the use of an approximate scaling conjecture so as to obtain approximate knowledgeof the T -distribution’s shape in a sensible time. We note that an alternative way of proceed-ing is to use a Markov chain to overlay a precisely controlled bias on the vanilla bootstrapprocedure (see Section 4) so that after compensation for this bias, the shape of the tails canbecome known as precisely as the shape of the core of the distribution. Such an approachcould be useful in any case where those approximations are deemed unsuitable or undesir-able, and could even be used simultaneously for added performance. Our description of thisproposal comprises the remainder of this document, from Section 5 onwards.It is of little practical benefit to be able to generate estimates of the shapes of T -distributions iftheir accuracy cannot be quantified. Our proposed binned shape estimates are made by un-biasingand re-normalising histograms that have been filled from Markov chain sequences having internalcorrelations. Consequently, the uncertainties in every bin of our shape estimates are non-triviallydependent on each other. Our procedure for estimating uncertainties in resulting T -distributionsis therefore itself non-trivial, and is documented in an Appendix so as not to break the flow of thepaper. We note, however, that our method of estimating uncertainties is considerably faster thanthe process of calculating the quantities which they themselves constrain. ∗ GAM Systematic | Cantab † University of Cambridge In brief, the Markov chain history is used to generate an estimator for the underlying Markov chain transitionmatrix, which is itself used to derive the covariance matrix for every pair of bin counts prior to the weightingand normalisation step. The final uncertainties (after the re-weighting and normalisation process, which introducesfurther correlations between bins) may then be written as a simple function of the former set of covariances providedthat (as has already been assumed) the Markov chain has circulated through most of its domain ‘a few’ times. a r X i v : . [ phy s i c s . d a t a - a n ] M a r The T -statistic The ‘energy test’ makes use of a test statistic T . This statistic maps two sets, S and ¯ S , to a realnumber T ( S, ¯ S ) , abbreviated T . We will refer to each set as a ‘sample’ and refer to the elementsof each set as ‘events’. The number of events in S is denoted by n , and the number in ¯ S by ¯ n . Itis assumed that each event e i ∈ S (respectively ¯ e j ∈ ¯ S ) is independent and identically distributedwith underlying unknown distribution p (respectively ¯ p ), i.e. e i ∼ p and ¯ e j ∼ ¯ p . In common withmost two-sample test-statistics, the job of T is to assist in determining whether or not p is thesame as ¯ p . Summarising crudely: if p is very similar to ¯ p then T is expected to take values close toor below zero, while increasing differences between p and ¯ p should (at fixed n and ¯ n ) lead to evergreater positive values for T . The definition of T used in [BBP18] is given in equation (1), T = 12 1 n ( n − n (cid:88) i (cid:54) = j ψ ( e i , e j ) + 12 1¯ n (¯ n − ¯ n (cid:88) i (cid:54) = j ψ (¯ e i , ¯ e j ) − n ¯ n n, ¯ n (cid:88) i,j ψ ( e i , ¯ e j ) (1)in which ψ is an appropriately chosen kernel function which maps any two events (whether from S , ¯ S or both) to a number. In any real-world use-case, the choice of ψ is probably the single biggestdeterminant of the usefulness of T as a statistic. Much work should therefore go into choosing ψ appropriately. Our own paper, however, is not interested in the statistical optimality of T , butseeks instead to improve the efficiency with which the tails of distribution of T -distributions (forany ψ ) can be sampled. Consequently, we fix ψ to be the same exponential function of a Euclideandistance between events used by [BBP18]: ψ ( e i , e j ) = exp (cid:34) − | e i − e j | δ (cid:35) , (2)with δ = , even though many alternatives choices of ψ are possible. Conceptually there is a difference between (a) the single value of T that might be computedfrom a sample S obtained from a physical detector, together with a reference sample ¯ S obtainedfrom somewhere else, and (b) the (hypothetical) T -values that could appear under a null hypothesisthat p = ¯ p . To distinguish between the two we refer to the former statistic as T ( S, ¯ S ) , to indicatethat it is a function of the particular samples it digests, and denote the latter random variable by T ( p, n, ¯ n ) , to emphasise the values it can take are governed by the model p assumed in the nullhypothesis, together with the numbers of events that will populate each sample. Whether talkingof the statistic T ( S, ¯ S ) or the random-variable T ( p, n, ¯ n ) , equation (1) still applies.When hypothesis testing, it is of central importance that it is possible to evaluate P ( T ( p, n, ¯ n ) ≥ ˆ T ) , which is the probability that T ( p, n, ¯ n ) exceeds any fixed value ˆ T . If p is independently known,this probability can (in principle) be determined by generating many independent samples S and ¯ S from p , from which can be recorded the fraction of resulting T -values which are sufficientlyextreme. In practice, there are two reasons such an approach is problematic. The first problem(‘P1’) is that the formula for T in equation (1) has a complexity which scales as max( n , ¯ n ) .This would appear to be bad news for particle-physics, where it is not unusual to need sampleswith n of order . The second problem (‘P2’) is that, even if T evaluation is fast, it wouldbe necessary to generate O (3 . × ) independent sets S and ¯ S and their associated T valuesin order to determine the value of T corresponding to a 5-sigma excursion with 10% precision.If neither issue were addressed, a naive 5-sigma p -value calculation in particle physics could thuseasily require an infeasible O (10 × ) calculations.The work of [BBP18] could be said to address P1, insofar as it observes empirically that, forlarge enough n , there exist values of k of order such that the desired T ( p, n, n ) -distribution iswell approximated by the distribution of ( k/n ) T ( p, k, k ) . Since the complexity of the latter statisticis O (100 ) and not O ( n ) , it has addressed P1 successfully.In contrast, our own work addresses P2. That is to say: we seek to find better ways ofsampling the extreme tails of the T -distribution (e.g. to calculate 5-sigma or 15-sigma p -values),independently of whether the T -statistic (or an approximation to it) is costly to calculate. Ourwork is thus complementary to that of [BBP18]. Future users may apply both strategies at once,or use either separately, depending on their needs.In passing, we note that users who need to address P1 (perhaps because they have large n ),but who cannot use the method of [BBP18] (perhaps because their n is not large enough to make Given the decision to fix δ at , there is strong motivation to use distributions p and ¯ p having length scales ofsimilar or slightly larger order. For this reason our results in Section 7 use a p which distributes events e i uniformlywithin a unit cube. O ( n ) technique forcalculating values of the T -statistic with controlled uncertainty. The above T -statistic was first introduced in 2002-2005 in the High Energy Physics community[AZ02, AZ05], and this method has been cited in several publications from the LHCb collaboration[A +
15, A + + +
06, SPH07, GBR + .A formal correspondence between energy distances and MMD is known in the statistics lit-erature [SSGF13], however there does not appear to be an appreciation of this link in the HighEnergy Physics literature. We suspect that recent developments for fast and efficient two-sampletests would be of interest to the LHCb collaboration and others. In particular we would high-light an efficient spectral approximation of the null hypothesis distribution [GFHS09], the B -test[ZGB13], and the use of ‘random Fourier features’ in reducing the computational complexity ofMMD test-statistic computation to O ( n ) from O ( n ) [ZM15]. Our contribution is orthogonal tothe techniques mentioned in this section, since we directly target the efficiency of the bootstrapmethod [ET93] itself. We also note that our method can be applied anywhere bootstrap is used,for example in independence testing [ZFGS18].As has already been mentioned, the authors of [BBP18] conjecture that, in the case of equalsample sizes n = ¯ n , the distribution of nT is approximately independent of n for sufficiently large n . A proof is not provided therein, rather an appeal to the uncorrelated case is made in which thescaling is known to be exact. Following [BBP18] others [Zec18] have investigated the relation inmore detail and affirmed the conjecture. We observe, however, that such results have been knownfor a decade in other fields. For example, in 2009 it was shown in [GFHS09] that there is anasymptotic limit for the distribution of nT as n → ∞ , namely nT −→ D ∞ (cid:88) l =1 λ l ( z l − , where z l ∼ N (0 , ∀ l , and λ l represents the l th eigenvalue in the following equation (cid:90) E ˜ ψ ( e i , e j ) φ l ( e i ) dp = λ l φ l ( e j ) . In this context, φ l are the eigenfunctions, and ˜ ψ ( e i , e j ) = ψ ( e i , e j ) − E e ψ ( e i , e ) − E e,e (cid:48) ψ ( e, e (cid:48) ) . Notonly does this confirm the conjecture of [BBP18], as does [Zec18], it goes further by providing anexpression for the limiting distribution for arbitrary kernel function.
Suppose one has a desire to calculate properties of the T ( p, n, ¯ n ) -distribution when in possessionof incomplete or limited information about p . Specifically, suppose that knowledge of p is limitedto the availability of one sample S m containing m events, each independently sampled from p .In such a situation, bootstrap sampling can help. Bootstrap sampling [Efr79] makes use of anapproximation ˜ p to the unknown distribution p made by combining m delta distributions, onecentred on each of the events in S m : ˜ p ( e ) = 1 m (cid:88) e i ∈ S m δ ( e − e i ) . This distribution ˜ p , sometimes referred to as the ‘empirical distribution for p ’, places probabilityonly at the locations where events have already been seen. With ˜ p so defined, samples drawn from Observe that equation (1) above differs from equation (3) of [GBR +
12] only by a factor of 2. In practice these eigenvalues can be estimated empirically for finite sample size, and Section 2 of [GFHS09]details this method. Typically one might have m = n + ¯ n , in the case in which one considers the union S ∪ ¯ S as the set of eventsfrom which to draw bootstrap samples. (˜ p, n, ¯ n ) then become nothing other than values of T generated from n and ¯ n -sized collections ofevents drawn independently from S m with replacement. Any of the desired properties of T ( p, n, ¯ n ) may then be estimated by repeated sampling from T (˜ p, n, ¯ n ) .While bootstrap sampling provides nothing more than an estimate of any of the desired proper-ties of T ( p, n, ¯ n ) , the literature [Efr81] suggests that differences between estimates and underlyingparameters become negligible for our purposes when the number of event min [ n, ¯ n, m ] exceeds 50or 100. For the purposes of this paper we therefore make the assumption (in keeping with [BBP18])that bootstrap sampling (rather than ‘true’ sampling) introduces only negligible uncertainties inall the places we use it. While it is legitimate to question the validity of this assumption, doing sois not relevant for the purposes of comparing our approach to that of [BBP18]. T -tails Unavoidably, generators of uncorrelated samples produce extreme samples only rarely. It hasalready been mentioned that this is what makes it computationally inefficient to generate multi-sigma tails of distributions using vanilla bootstrap sampling. We propose that, to overcome thisdifficulty, the independent bootstrap sampler of [BBP18] be replaced with a Metropolis-HastingsMarkov chain that: (a) generates correlated samples via a random walk on a state-space ofbootstrap samples, and (b) has a calculable bias toward extreme values. The bias helps the chainto find the tails, while the correlations permit the tails (once discovered) to be explored morethoroughly.
Correlation between bootstrap samples in the Markov chain may be introduced in variousways. The simplest method is to require that the ‘proposal distribution’, ˆ Q , which suggests Markov-state transitions from state s i to state s i +1 , should update only a fixed fraction of the events formingeach bootstrap sample. For our purposes, we found it sufficient to use a ˆ Q which leaves a random90% of the events in each bootstrap samples unchanged, and draws the remaining 10% of eventsas fresh samples from the relevant empirical distribution. Users working on other problems mayfind it necessary to investigate alternative choices. This particular proposal algorithm ˆ Q happensto have a density, Q , which is symmetric: Q ( s i +1 | s i ) = Q ( s i | s i +1 ) . Bias toward extreme values of T may be obtained by modifying, with a weighting-function f ( T ) , the ratio ρ of Q -factors used in the Metropolis-Hastings acceptance rule. Specifically: aproposal to move to a new Markov state s i +1 given a current state s i is made. The proposal isaccepted only if a random number chosen uniformly from the real interval between 0 and 1 is foundto be less than the ratio ρ , where: ρ = Q ( s i | s i +1 ) Q ( s i +1 | s i ) f ( T ( s i )) f ( T ( s i +1 )) . If not accepted, the existing state is instead re-asserted (i.e. s i +1 is set equal to s i ). Since our Q is symmetric, its density cancels in the definition of ρ above and so does not need to be evaluated.If p ( T ) dT is (by definition) the desired T distribution, then with the above rule the randomwalk intrinsically samples from the weighted density p ( T ) f ( T ) dT , and so the resulting T -valuesgenerated by the chain must be unweighted to bring them back to the desired distribution. Theconsiderable freedom in choice of f ( T ) is highly constrained by the improvements desired. If thegoal of the biasing is to obtain quantitative information about the tails, a minimal requirementwould be that f ( T ) allow efficient recirculation of the Markov Chain between the bulk and the tailregions. Without this, co-normalisation of the tail with respect to the bulk will not be achieved.Practically speaking this means that f ( T ) must be approximately proportional to /p ( T ) so thatthe Markov chain will visit all T -values approximately uniformly. A measure of the degree ofdeparture of f ( T ) from /p ( T ) could be taken to be the largest factor, K , by which one part ofthe approximately uniform density exceeds another, within a subset τ of the interesting range of While the Metropolis-Hastings method is well defined for a wide class of proposal functions, poor choice ofproposal function could easily lead to very inefficient or incomplete sampling. We found our sampler to be stablewith respect to large changes in this 90:10 ratio, but others using this method will have to determine what sortof proposal function has, for their application, the right balance between high variability (to allow state-spaceexploration) and correlation (to allow benefit to be obtained from information learned). By allowing proposalsto change around 10% of the events in any one Markov ‘step’ the sampler can, in principle, reach an entirelyindependent bootstrap sample in order ten successful steps. By this measure, our proposal function has a relativelyshort memory. By ‘unweighted’ we mean ‘weighted with weight /f ( T ) ’. -values: K ( τ ) = exp (cid:18) max T ,T ∈ τ (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) f ( T ) p ( T ) f ( T ) p ( T ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:19) ≥ . (3)As a rule of thumb: keeping K below ∼ in interesting regions of T could be considered a first-order requirement for efficiency. Values of K larger than this would indicate that there are partsof the T -distribution in which the Markov chain is lingering for an order of magnitude more timethan necessary, meaning that a better choice of f ( T ) might have allowed equally good precisionfor an order of magnitude less computation time. Since the efficiency of the sampler is strongly dependent on the choice of weight-function f ( T ) it must be choosen carefully. Our approach is to identify a class of functions whose tails havesimilar properties to the tails of T -distributions, and which are completely specified by their mean,second central moment and third central moments. We then use a very short pre-sampling phase(that makes no use of weight-functions for tail boosting) to gain crude estimates of the mean,second central moment and third central moment of our desired T -distribution. The member ofthe aforementioned class of functions having the same moments is then identified, and its reciprocal(at least in the T range of interest) is used as the final weight function f ( T ) . Inevitably the momentestimates obtained from such a pre-run are unlikely to be optimal: the pre-run is unlikely to samplefar into the tails, and third moments are notoriously unstable things to estimate from finite samples.Nonetheless, the bar is low for tolerability; a weight function f ( T ) will speed things up if it is notpathological and is a better approximation to the desired function (in our case /p ( T ) ) than is aconstant. Both are relatively easy to achieve. T -tails Strictly speaking, T -distributions are discrete and with bounded support. At large n , however, forall practical purposes they are well approximated by continuous distributions with support on asemi-infinite region of the real line. We make the heuristic observation that the logarithm of theprobability density function is often well approximated by a hyperbola with a vertical asymptoteon one side and an oblique asymptote on the other. Examples of such hyperbole are shown in theinset plot within Figure 1. This motivates considering a class of probability densities with similarproperties. The simplest such class will have three parameters: one degree of freedom being neededto control the location of the vertical asymptote, while the other two specify the location and slopeof the oblique asymptote. It is not necessary that the three parameters have exactly those roles,only that there be three degrees of freedom capable of providing the necessary control. Indeed,since an overall location parameter may be trivially introduced into any distribution that lacksone, it is sufficient to look for two-parameter distributions whose location is not controlled.Being initially unable to find an off-the-shelf distribution with the above properties, we proposedone of our own. The simplest we could imagine has the following unit-normalised probability Note that it is neither possible nor desirable to find a weight function that is exactly proportional to /p ( T ) andso having K = 1 . To begin with, the existence of such a weight would remove the need to use the sampler since itsonly purpose is to estimate p ( T ) . Furthermore, it is not possible to sample from an unbounded uniform distribution– so some departure from f ( T ) ≈ /p ( T ) will always be needed outside the range of interesting T values τ . In suchregions the simplest prescription is to replace f ( T ) by a constant. This causes the sampler to return to unbiasedbehaviour. Note that is is not necessary to use functions be determined by their moments. Any parametric or non-parametric function that can approximately fit a T -distribution and can be extrapolated non-pathologically towardthe tails could be used. We base our approach on moment fitting simply as it provides for a non-iterative fittingwith constant-time guarantees. Though we have not found the need to do so, one could iterate this whole process if the resulting weight functionswere found to be of insufficient quality. For example, a short pre-run without weighting could be used to find a firstguess at an optimal weight function, and this could be followed by a secondary pre-run using that first guess to finda more accurate second guess, etc. We initially considered the Generalised Extreme Value (GEV) distribution suggested by [BBP18] as a candidatemodel for the tails, but dismissed it after noting that its tails do not, in general, fall as the exponential of a linearfunction of T , and so these usually lead to large and inefficient values of K (as defined in equation (3)) in ourtest cases. After settling on our straw model, however, we have noted that the a special cases of the GEV calledthe Gumbel distribution has a probability density function with a tail of the right form. It is potentially possible,therefore, that the Gumbel could have been used in place of our own straw model. We have not investigated thispossibility further, but note that as the Gumbel has one fewer parameter than our model, it is by no means guaranteedthat it has sufficient flexibility to fit to both the left and right hand sides of the T -distribution simultaneously. ( x;a, λ ) - - ( p ( x;a, λ )) Figure 1: Examples of the distributions p straw ( x ; a, λ ) defined in Equation (4). Values of ( a, λ ) shown are: (1 , in blue, (1 , in dirty yellow, and ( − , in green. D e n s i t y e s t i m a t e Figure 2: We show a number of different estimates of the T ( p, n, n ) density for n = 200 eventsassuming a distribution p which distributes events uniformly within a unit three-dimensional cube.One estimate (shown in solid red) is generated from N = 25 , of our own biased bootstrap MCMCsamples and unbiased initialisation samples. It may be compared with the estimate obtainedfrom N = 25 , unbiased bootstrap non-MCMC samples (dashed green), and N = 10 , , unbiased bootstrap non-MCMC samples (dotted blue). All estimates use bins. In each bin, b ,the density is shown as the central line, while the shaded region shows the one-sigma uncertaintyon the density, as calculated by the method documented in the Appendix. Technically, both thedensities and uncertainties are discrete and represent averages over each bin. However, for thepurposes of display only they have been made to look continuous (i.e. the values have been placedat the centre of each bin and joined to neighbours with straight-line interpolation) as this makesthe overlapping regions easier to follow. 6ensity function: p straw ( x ; a, λ ) = (cid:40) exp [ − λ ( xa + ax )] | a | K ( λ ) if ax > otherwise, (4)in which K n ( z ) is a modified Bessel function of the second kind. It is controlled by a scale parameter a (which is also the mode of the distribution) and a skewness parameter λ . The parameter λ isalways required to be greater than zero, while a must be non-zero. The distribution is one-sided,with support on (0 , ∞ ) if a is positive, and on ( −∞ , if a is negative. Some examples of thisdistribution are shown in Figure 1, in which the inset part of the figure demonstrates the desiredhyperbolic properties of the tails.After the trivial addition of a translational degree of freedom, it is the above density thatbecomes our idealised T -distribution’s ‘model’, and whose reciprocal (in the T -regions in which weare interested) that becomes our weight function f ( T ) . The process of fitting the above model toa set of T -values from a pre-run could, in principle, be done by many different methods. As we aredesirous of obtaining a reliable fit in constant time, we eschew iterative solution finders and insteadcompute three moments of the data set that is being fitted. In the first part of the Appendix wehave supplied expressions for the quantities λ and a (and the missing position parameter) in termsof those first three moments. These expressions, together with the moments already calculated,are what we use to perform our fit to pre-run data. We emphasise, however, that many otherapproaches would be possible that rely on none of those results.Finally, we note that it could be necessary to invent or use other straw models in any caseswhere T -distributions were found to look wholly unlike the forms shown in Figure 1. Any mis-matchbetween straw model and real distribution ought to show up readily through poor efficiency andincreasing uncertainties on the resulting estimates, and so this area should be largely self-policing. T -distributions Figure 2 compares three different estimates of the shape of the T ( p, , distribution for a p which distributes events uniformly within a three dimensional unit cube. Two of the shapeestimates (green-dashed and blue-dotted) are generated by a vanilla bootstrap process and actas ‘reference’ distributions against which the estimate from our own ‘biased bootstrap’ estimate(red-solid) can be compared. The ‘biased bootstrap’ estimate is generated from , evaluationsof T , following pre-samples (used to evaluate parameters of the bias function). This is thesame number of T -evaluations used by the green-dashed reference distribution, and so a comparisonbetween the two (between red-solid and green-dashed) provides a fair assessment of the performancebenefits of ‘biased bootstrap’ over the reference method. It can be seen that the red and greenestimates are in agreement with each other over their common domain, but that the uncertaintyof the green reference method increases dramatically once the density falls to − of its height atthe peak. The green estimate is evidently useless above T -values of order 0.010, while the ‘biasedbootstrap’ estimate appears well controlled up to T = 0 . (and presumably beyond). To checkthat the biased-bootstrap prediction is correct beyond the point at which the statistics of thegreen expire, one can compare it to the dotted blue line, which is generated with three orders ofmagnitude more statistics ( , , T -evaluations). Not only is the agreement between red andblue confirmed to be good everywhere, the benefits of the ‘biased bootstrap’ approach are onceagain made manifest. The biased method continues to work well above T -values of 0.016, and overseven orders of magnitude of variation in probability density, while the reference method (evenwith the advantage of 1000 times more computing power) is unable to match this. A biased bootstrap procedure for determining the shapes of T -distributions has been described. Ithas been shown to require many orders of magnitude fewer T -evaluations than existing methodswhen probing distribution tails. The proposed method does not make approximations beyond theuse of bootstrap itself. When desired, it may be used alongside existing methods, such as those in[BBP18], for added speed gains. 7 cknowledgements We gratefully acknowledge feedback from William Barter, Rupert Tombs and an anonymous refereeacting for the Journal of Instrumentation. CGL acknowledges support from the United Kingdom’sScience and Technology Facilities Council (STFC) consolidated grants RG79174 (ST/N000234/1)and RG95164 (ST/S000712/1).
Appendix
Moments and other properties of the T -tail straw model The analytic properties of p straw ( x ; a, λ ) , which we make use of when fitting to samples obtainedin the ‘pre-run’ previously mentioned are as follows: • The mean takes the value (cid:104) x (cid:105) = aK ( λ ) K ( λ ) . • The n th non-central moment of this distribution is: (cid:104) x n (cid:105) = a n K n +1 ( λ ) K ( λ ) . • The 2 nd central moment of this distribution is: M ≡ (cid:10) ( x − (cid:104) x (cid:105) ) (cid:11) = (cid:10) x (cid:11) − (cid:104) x (cid:105) (cid:104) x (cid:105) + (cid:104) x (cid:105) = (cid:10) x (cid:11) − (cid:104) x (cid:105) = a K ( λ ) K ( λ ) − (cid:18) aK ( λ ) K ( λ ) (cid:19) = a K ( λ ) (cid:0) K ( λ ) K ( λ ) − K ( λ ) (cid:1) . • The 3 rd central moment of this distribution is: M ≡ (cid:10) ( x − (cid:104) x (cid:105) ) (cid:11) = (cid:10) x (cid:11) − (cid:10) x (cid:11) (cid:104) x (cid:105) + 3 (cid:104) x (cid:105) (cid:104) x (cid:105) − (cid:104) x (cid:105) = (cid:10) x (cid:11) − (cid:10) x (cid:11) (cid:104) x (cid:105) + 2 (cid:104) x (cid:105) = a K ( λ ) K ( λ ) − (cid:18) a K ( λ ) K ( λ ) (cid:19) (cid:18) aK ( λ ) K ( λ ) (cid:19) + 2 (cid:18) aK ( λ ) K ( λ ) (cid:19) = a K ( λ ) (cid:0) K ( λ ) K ( λ ) − K ( λ ) K ( λ ) K ( λ ) + 2 K ( λ ) (cid:1) . • The ratio R , ( λ ) , defined as follows: R , ( λ ) ≡ M M = K ( λ ) K ( λ ) − K ( λ ) K ( λ ) K ( λ ) + 2 K ( λ ) K ( λ ) K ( λ ) − K ( λ ) is a monotonic decreasing function of λ , whose image is the interval (0 , as shown in Figure 3.Using the above results, the moment fitting description can now be described.1. From the set of T -values obtained in the ‘pre-run’, compute the sample mean ( ˜ µ ), an unbiasedestimator ( ˜ M ) for the second central moment M , and an unbiased estimator ( ˜ M ) for thethird central moment M of the underlying distribution. Define the constant ρ = ˜ M / ˜ M .2. Numerically, find the value of λ which solves the equation R , ( λ ) = ρ and call it ˆ λ . Ifand only if the ratio ρ is between 0 and 4, the monotonic and bounded nature of R , ( λ ) guarantees a unique solution that is easy to find by bisection search. If ρ is equal to zero orgreater than four, then the estimators are not yet accurate enough, or the distribution beingsampled is not well modelled by our straw model. If the former, then more statistics in the pre-run are needed. If the latter, then a better straw model is needed. .2 0.4 0.6 0.8 1.0 2 tan - ( λ ) π ( λ ) .Figure 3: The dependence of R , ( λ ) on λ .3. Define the constant ˆ a by the two requirements: | ˆ a | = (cid:115) ˜ M K (ˆ λ ) K (ˆ λ ) K (ˆ λ ) − K (ˆ λ ) and sign ˆ a = sign ˆ M . Note that the radicand is guaranteed to be positive for any < ˆ λ < ∞ and < ˆ M < ∞ .4. The ‘fitted’ straw-model for the pre-run’s T -distribution may finally be defined to be: p Fit ( T ) ≡ p straw (cid:32) T − ˜ µ + ˆ aK (ˆ λ ) K (ˆ λ ) ; ˆ a, ˆ λ (cid:33) since this distribution has the desired tail shapes, while sharing mean, second and thirdcentral moments with the estimates obtained from the pre-run. The reciprocal of p Fit ( T ) iswhat will (in the relevant T -range) be used as the weight function to bias the Markov chain. Uncertainty estimates
Our density estimates are made by un-biasing and normalising histograms that have been filled fromMarkov chain sequences having internal correlations. It is very helpful to be able to determinethe uncertainties on the resulting density predictions in each bin quickly, from whatever Markovchain run generates the central values of the density estimate. More formally: • Suppose that there exists a mechanism, M , that can generate a length- N sequence of values, T , T , . . . , T N , as outputs of a Markov chain with fixed (but unknown) transition matrix P . • Suppose that each of the random variables T i takes a value in a discrete set of indices which,without loss of generality, will be taken to be integers between 1 and B inclusive, and whichwill be referred to as ‘bins’ for short. • Suppose that (cid:126)t is a particular instance of such a sequence: (cid:126)t = { t , t , . . . , t N } , sampled bythe mechanism M from the Markov chain governed by P . • Suppose that the symbol ‘ s b ’ is used to denote the number of occurrences of the value b inthe sequence t , t , . . . t N , so that (cid:126)s = { s , s , . . . s B } may represent counts in the bins of ahistogram. • Suppose that (cid:126)s (cid:48)(cid:48) = { s (cid:48)(cid:48) , s (cid:48)(cid:48) , . . . s (cid:48)(cid:48) B } represents a potentially non-linear transformation of thebin counts in (cid:126)s . Equivalently, suppose that there are B functions g , g , . . . , g B such that s (cid:48)(cid:48) b = g b ( s , s , . . . s B ) . [This non-linear transformation will represent the re-weighting andnormalisation steps required in our density estimation.] The exact formula for our weighting and normalising procedure us given later in (6). It is relevant that P is unknown because, although the biased-bootstrap process is a Markov chain, it is oneimplemted using a Metropolis-Hastings algorithm and therefore has no direct knowledge of its underlying transitionmatrix, even though one exists in principle. Since the underlying T -values are not binned, some approximation is introduced here. We have not been able toderive a generally applicable estimate of the size of the effect introduced by this approximation, though we believeit to be small in the cases that matter to us. There is is room here for further work. Suppose that the quantities (cid:126)S , (cid:126)S (cid:48)(cid:48) , S b and S (cid:48)(cid:48) b are defined the same way as (cid:126)s , (cid:126)s (cid:48)(cid:48) , s b and s (cid:48)(cid:48) b above, but represent random variables over the process governed by P , rather than aparticular sample generated therefrom by the mechanism M .Given those assumptions, we may pose two questions whose answers assist in the determining theuncertainties of our density estimates. These questions are:1. What are the values of Var[ S (cid:48)(cid:48) ] , Var[ S (cid:48)(cid:48) ] , . . . , Var[ S (cid:48)(cid:48) B ] in terms of P ?2. If M exists but P is not known, is it possible to write down an estimator for P based on onlythe single sample (cid:126)t , and is this estimator ‘good enough’ to use in place of P in calculatingthe above variances?The expression ‘good enough’ is important and relevant even when answering the first questionbecause it is meaningless to talk about ‘exact’ uncertainties. An uncertainty exists to provide anapproximate measure of how close a prediction might be to an ideal underlying value. Askinghow uncertain the uncertainty is, would lead to a never-ending descent into the uncertainties onuncertainties on uncertainties, etc . What is needed, therefore, is a pragmatic approach to answeringthe above questions that ensures the answer is fit for purpose, and no more.Using this freedom, we choose to restrict our attention to cases in which the Markov chainhas begun to reach equilibrium. Prior to such a time any density estimate would be largelymeaningless, and talk of its uncertainty doubly so. Assuming that appropriate tests have beendone to ensure that the above criterion is met, then the simple answer to ‘Question 2’ is:‘Yes: a good estimator for P is the transition matrix obtained by looking at the history (cid:126)t and finding within it the proportion of times with which it moved to bin i given thatit was already in bin j .’Using the convention that P is left-stochastic, we therefore take its estimator to have the value (cid:80) k δ i,t k +1 δ j,t k s j in its i th row and j th column. Not only is this choice heuristically simple, these are also maximumlikelihood estimators for the elements of P (see discussions in [Nor97] and [TL09]). Is it ‘goodenough’ ? Again, we appeal to the the ‘equilibrium’ assumption to argue that it is. While thesewill never represent the true elements of P , we do not need them to do so. More quantitativearguments justifying this estimator for P are found in [TL09].What of Question 1? To answer this we proceed in stages. To begin with, we have proved(assuming as before that P is left-stochastic) the intermediate result that: Cov[ S b , S c ] = N π b ( δ bc − π c ) + (cid:26)(cid:18) N Q − Q − Q − Q N +1 ( − Q ) (cid:19) cb π b + [ b ↔ c ] (cid:27) (5)within which a number of new symbols have been introduced. In particular: Q is defined to bethe matrix Q = P − P ∞ in which P ∞ is defined by P ∞ = lim n →∞ P n which may itself be shownto be given by P ∞ = ( − P + U ) − · U if U is a B × B matrix consisting entirely of ones. Furthermore, the B -vector (cid:126)π having components (cid:126)π = ( π , π , . . . , π B ) contains the limiting probabilities for the chain to end up in any of its B states. (cid:126)π may be shown (see [TL09]) to be given by (cid:126)π = ( − P + U ) − · (cid:126) if (cid:126) is a B -vector consisting entirely of ones. Finally we note that the repeated index b on the righthand side of (5) does not indicate the summation convention is being used.Finally, we need to express the variances of the desired transformed quantities, Var[ S (cid:48)(cid:48) b ] , in termsof the covariances of raw quantities, Cov[ S b , S c ] , given in (5). In principle this translation can In loose terminology we assume the chain has made at least ‘a few’ effectively independent visits to each bin ofthe space. Equivalently we may require that the chain has lost knowledge of its starting point ‘a few’ times over. We had to derive (5) ourselves as we were unable to find it anywhere else in the literature, although its firstfew terms appear in many works ([TL09] is one) which calculate the limit, as N → ∞ , of Cov[ S b , S c ] /N . We havenot provided a proof of the result here, mostly for reasons of space, but also because the proof is not particularlyilluminating and should be little more than an excercise for persons more familiar with Markov chains than we are.Persons nontheless interested are welcome to contact the authors for further details. This will be achieved in (12).
10e done for arbitrary functions g , g , . . . , g B , however in the interests of brevity our presentationhere focuses on the particular transformation which is needed in this analysis to weight each of theraw histogram counts before then normalising the histogram to unit area: S (cid:48)(cid:48) b = S b w b (cid:80) Bi =1 S i w i (6)where the { w , w , . . . , w B } are the set of fixed weights. The trivial changes necessary to generaliseto other transformations are left as an exercise for the reader! Taking differentials of (6) anddefining the fractional differentials df b and df (cid:48)(cid:48) b by: df b = dS b S b and df (cid:48)(cid:48) b = dS (cid:48)(cid:48) b S (cid:48)(cid:48) b it may be proved that: df (cid:48)(cid:48) b = B (cid:88) k =1 ( δ bk − S (cid:48)(cid:48) k ) df k . (7)To gain some understanding of (7) we make some small notational changes. We use the factthat S (cid:48)(cid:48) b represents a probability estimate in bin b to motivate re-labelling it as p (cid:48)(cid:48) b . And sincecomplementary probabilities are often denoted with q s we define q (cid:48)(cid:48) b = 1 − p (cid:48)(cid:48) b . With those changes,(7) may be written out as: df (cid:48)(cid:48) = + q (cid:48)(cid:48) df − p (cid:48)(cid:48) df − p (cid:48)(cid:48) df − . . . − p (cid:48)(cid:48) B df B (8) df (cid:48)(cid:48) = − p (cid:48)(cid:48) df + q (cid:48)(cid:48) df − p (cid:48)(cid:48) df − . . . − p (cid:48)(cid:48) B df B (9) df (cid:48)(cid:48) = − p (cid:48)(cid:48) df − p (cid:48)(cid:48) df + q (cid:48)(cid:48) df − . . . − p (cid:48)(cid:48) B df B (10)... df (cid:48)(cid:48) B = − p (cid:48)(cid:48) df − p (cid:48)(cid:48) df − p (cid:48)(cid:48) df − . . . + q (cid:48)(cid:48) B df B . (11)Since the p (cid:48)(cid:48) and q (cid:48)(cid:48) quantities are all positive, in the above form one can see that the Markovchain correlations will tend to reduce the uncertainties in regions of high probability but havethe opposite effect at large distances. Putting everything together: Var[ S (cid:48)(cid:48) b ] = Var[ dS (cid:48)(cid:48) b ] = Var[ S (cid:48)(cid:48) b df (cid:48)(cid:48) b ] ≈ s b Var[ df (cid:48)(cid:48) b ]= s b Var (cid:34) B (cid:88) k =1 ( δ bk − S (cid:48)(cid:48) k ) df k (cid:35) (by (7)) = s b (cid:88) i,j Cov (cid:2) ( δ bi − S (cid:48)(cid:48) i ) df i , (cid:0) δ bj − S (cid:48)(cid:48) j (cid:1) df j (cid:3) ≈ s b (cid:88) i,j ( δ bi − s (cid:48)(cid:48) i ) (cid:0) δ bj − s (cid:48)(cid:48) j (cid:1) Cov [ df i , df j ]= s b (cid:88) i,j ( δ bi − s (cid:48)(cid:48) i ) (cid:0) δ bj − s (cid:48)(cid:48) j (cid:1) Cov (cid:20) dS i S i , dS i S j (cid:21) ≈ s b (cid:88) i,j ( δ bi − s (cid:48)(cid:48) i ) s i (cid:0) δ bj − s (cid:48)(cid:48) j (cid:1) s j Cov [ dS i , dS j ]= s b (cid:88) i,j ( δ bi − s (cid:48)(cid:48) i ) s i (cid:0) δ bj − s (cid:48)(cid:48) j (cid:1) s j Cov [ S i , S j ] (which is computable using (5)) . (12)Each of the approximations above appeals to the ‘good enough’ doctrine, previously discussed.Specifically, the above approximations are good if fractional uncertainties are ‘small’, which issomething that is already ensured by the previously used requirement that the Markov chain hasbegun to approach equilibrium, thereby placing uncertainties into a linear regime.The above procedure is fast since the estimator for P can be obtained during histogram fillingat no extra cost, and the remainder of the steps only involve simple matrix operations on B × B matrices. Bootstrap based Markov chains are local and so will tend to positively correlate values of df b and df c when b and c are nearby, and anti-correlate them when far apart. .000 0.005 0.010 0.015 0.020T(p,200,200)10 D e n s i t y e s t i m a t e Figure 4: Comparison of three independent biased-bootstrap estimates of the density shown inFigure 2, together with their one-sigma uncertainty estimates.This concludes the description of how our uncertainties are estimated. What remains to beunderstood is why this method of uncertainty estimation appears not to be in wider use in thehigh-energy physics community or, for that matter, elsewhere, despite the underlying maths beingmore than fifty years old. It would appear to have wide application in many places where, atpresent, we estimate uncertainties in Markov chain derived data by re-running the chains a fewmore times with different initial conditions or random seeds.
Checks on uncertainties
Since our method of estimating uncertainties is both novel and non-trival, some readers may findit reassuring to see examples of it working as intended. One way of doing this is to illustrate thatindependent density estimates from separate Markov chains agree within the uncertainty esitimatesthat this process attaches to each of them. We do so by presenting in Figure 4 three independentdensity estimates which, apart from having different random number seeds, are otherwise identicalto those shown in the biased-bootstrap estimate of Figure 2. In particular, all the estimates inFigure 4 were generated from , biased bootstrap MCMC samples following unbiasedinitialisation samples. Reassuringly, Figure 4 does indeed show that these three estimates all havethe degree of agreement with each other that would be expected given the sizes of their one-sigmauncertainty bands.The top row of plots in Figure 5 show the history of the three chains which generated thedensities just seen in Figure 4. In each case, time runs vertically. The histories show: (a) expectedauto-correlations within the chains, (b) that T -values were visited approximately uniformly, asdesired, and (c) that the chains have made their way between the two ends of the T -spectrum O (20) times. The latter is strong evidence that the chain should have equilibriated and so thesuccess of the uncertainty estimate ought not to be surprising.What might be more interesting is to see the uncertainty estimate under tricker conditions.The bottom row of plots in Figure 5 shows histories which are only 10% of the length of those seenpreviously (i.e. 2,500 MCMC biased-bootstrap samples instead of 25,000). This time it is evidentthat these histories cannot be confidently said to have reached equilibrium. The first has hardlysampled any low T -values, while the second and third have gone from end to end only two or threetimes. Nonetheless, despite these tough conditions, the one-sigma uncertainty estimates for theseplots which are shown in Figure 6 are once again remarkably well sized. It is tempting to attributethis success to the unreasonable effectiveness of mathematics. 2,500 MCMC samples is sufficientto provide O (100) samples in each of the 30 bins of Figure 6. The most likely transitions fromeach of those bins could therefore be relatively well estimated, and as a result of this the matrices12 M C M C s a m p l e nu m be r ( t i m e ) T -evaluations(seed 1) M C M C s a m p l e nu m be r ( t i m e ) T -evaluations(seed 2) M C M C s a m p l e nu m be r ( t i m e ) T -evaluations(seed 3) M C M C s a m p l e nu m be r ( t i m e ) T -evaluations(seed 1) M C M C s a m p l e nu m be r ( t i m e ) T -evaluations(seed 2) M C M C s a m p l e nu m be r ( t i m e ) T -evaluations(seed 3) Figure 5: Histories of the T -values in three long and three short biased-bootstrap chains whoseresulting density estimates are shown in Figures 4 and 6. P , P ∞ and Q can be much better known than one might at first imagine. Since our method ofuncertainty estimation is based on those matrices and analytically considers all possible MCMCchains consistent with them, not just the one (very short) chain that was actually realised in anyof the three actualised histories, it is perhaps less surprising that the method succeeds so well.In conclusion: while our confidence in the method is primarily vested in the mathematics,we hope that these examples provide reassurance that our method of uncertainty estimation doesindeed work well in the circumstances required – namely any in which the esitimated T -distributionis remotely meaningful. References [A +
15] Roel Aaij et al. Search for CP violation in D → π − π + π decays with the energy test. Phys. Lett. , B740:158–167, 2015.[A +
17] Roel Aaij et al. Search for CP violation in the phase space of D → π + π − π + π − decays. Phys. Lett. , B769:345–356, 2017.[AZ02] B. Aslan and Gunter Zech. A New class of binning free, multivariate goodness of fittests: The Energy tests. 2002.[AZ05] B. Aslan and G. Zech. New test for the multivariate two-sample problem based onthe concept of minimum energy.
Journal of Statistical Computation and Simulation ,75(2):109–119, 2005.[BBP18] W. Barter, C. Burr, and C. Parkes. Calculating p -values and their significances withthe energy test for large datasets. Journal of Instrumentation , 13(04):P04011, 2018.13 .000 0.005 0.010 0.015 0.020T(p,200,200)10 D e n s i t y e s t i m a t e Figure 6: Comparison of three independent short-run biased-bootstrap estimates of the densityshown in Figure 2, together with their one-sigma uncertainty estimates. In contrast to earlierfigures of a similar type, these density estimates here are calculated for bins on account of thelower statistics.[BGR +
06] K. Borgwardt, A. Gretton, M. J. Rasch, H. P. Kriegel, B. Schölkopf, and A. J. Smola.Integrating structured biological data by kernel maximum mean discrepancy.
Bioinfor-matics (ISMB) , 22(14):e49–e57, 2006.[Efr79] Bradley Efron. Bootstrap methods: Another look at the jackknife.
The Annals ofStatistics , 7(1):1–26, January 1979.[Efr81] Bradley Efron. Nonparametric estimates of standard error: The jackknife, the bootstrapand other methods.
Biometrika , 68(3):589–599, 1981.[ET93] B. Efron and R. J. Tibshirani.
An Introduction to the Bootstrap . Chapman & Hall,New York, 1993.[FM53] R. Fortet and E. Mourier. Convergence de la réparation empirique vers la réparationthéorique.
Ann. Scient. École Norm. Sup. , 70:266–285, 1953.[GBR +
12] A. Gretton, K. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kerneltwo-sample test.
Journal of Machine Learning Research , 13(Mar):723–773, 2012.[GFHS09] Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, and Bharath K. Sriperumbudur. Afast, consistent kernel two-sample test. In Y. Bengio, D. Schuurmans, J. D. Lafferty,C. K. I. Williams, and A. Culotta, editors,
Advances in Neural Information ProcessingSystems 22 , pages 673–681. Curran Associates, Inc., 2009. https://papers.nips.cc/paper/3738-a-fast-consistent-kernel-two-sample-test .[Nor97] J. R. Norris.
Markov Chains . Cambridge Series in Statistical and Probabilistic Math-ematics. Cambridge University Press, 1997.[PCB +
17] Chris Parkes, Shanzhen Chen, Jolanta Brodzicka, Marco Gersabeck, Giulio Dujany, andWilliam Barter. On model-independent searches for direct CP violation in multi-bodydecays.
J. Phys. , G44(8):085001, 2017.[SPH07] Bernhard Schölkopf, John Platt, and Thomas Hofmann.
Advances in Neural Infor-mation Processing Systems 19: Proceedings of the 2006 Conference (Bradford Books) ,chapter A Kernel Method for the Two-Sample Problem. The MIT Press, 2007.14SSGF13] Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equiv-alence of distance-based and RKHS-based statistics in hypothesis testing.
Ann. Statist. ,41(5):2263–2291, 10 2013.[TL09] Samis Trevezas and Nikolaos Limnios. Variance estimation in the central limit theoremfor markov chains.
Journal of Statistical Planning and Inference , 139(7):2242–2253,2009.[Wil11] Mike Williams. Observing CP Violation in Many-Body Decays.
Phys. Rev. , D84:054015,2011.[Zec18] G. Zech. Scaling property of the statistical Two-Sample Energy Test. 2018. https://arxiv.org/abs/1804.10599 .[ZFGS18] Qinyi Zhang, Sarah Filippi, Arthur Gretton, and Dino Sejdinovic. Large-scale kernelmethods for independence testing.
Statistics and Computing , 28(1):113–130, Jan 2018.[ZGB13] Wojciech Zaremba, Arthur Gretton, and Matthew Blaschko. B-test: A non-parametric,low variance kernel two-sample test. In C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, editors,
Advances in Neural Information Pro-cessing Systems 26 , pages 755–763. Curran Associates, Inc., 2013.[ZM15] Ji Zhao and Deyu Meng. FastMMD: Ensemble of circular discrepancy for efficienttwo-sample test.