New Statistical Techniques in the Measurement of the inclusive Top Pair Production Cross Section
NNew Statistical Techniques in the Measurement ofthe inclusive Top Pair Production Cross Section
Jiˇr´ı Franc, Petr Bouˇr, Michal ˇStˇep´anek, and V´aclav K˚us
Department of Mathematics, Faculty of Nuclear Sciences and Physical Engineering, CzechTechnical University, Trojanova 13, 120 00 Prague 2, The Czech RepublicE-mail: [email protected], [email protected],[email protected], [email protected]
Abstract.
We present several different types of multivariate statistical techniques used in themeasurement of the inclusive top pair production cross section in p ¯ p -collisions at √ s = 1 . . − ) collected with the D0 detector at the Fermilab TevatronCollider. We consider the final state of the top quark pair decays containing one electron ormuon and at least two jets. We proceed various statistical homogeneity tests such as Anderson- Darling, Kolmogorov - Smirnov, and ϕ -divergences tests to determine, which variables havegood data-MC agreement, as well as a good separation power. We adjusted all tests for usingweighted empirical distribution functions. Further we separate t ¯ t signal from the background bythe application of Generalized Linear Models, Gaussian Mixture Models, Neural Networks withSwitching Units and confront them with familiar methods from ROOT TMVA package such asBoosted Decision Trees, and Multi-layer Perceptron. We compare results by area under receiveroperating characteristic curve and verify the quality of the discrimination from all methods.
1. Introduction
The main goal of this analysis is to apply statistical techniques, which are new to HEP, to themeasurement of the inclusive top pair production cross section in p ¯ p -collisions at √ s = 1 . a r X i v : . [ h e p - e x ] D ec e applied more rigorous statistical approach and we modified common homogeneity tests, interms of adding weights and utilization of quantile binning.
2. Selection of Variables and Homogeneity Tests
In order to perform the efficient training of separation methods on MC simulation we needto guarantee the homogeneity of both MC and data populations, i.e., we test the followinghypothesis: H : F = G versus H : F (cid:54) = G at significance level α, (1)where F is unknown cumulative distribution function (CDF) of data distribution and G isunknown CDF of MC distribution. Let X = { X , . . . , X n } denote a random sample takenfrom the distribution F and X = { Y , . . . , Y n } be a random sample taken from the distribution G . We further denote N = n + n . Let F n , G n denote empirical distribution functions (EDFs)of samples X , X respectively. In our hypothesis testing we seek for p -value, i.e., the lowestsignificance level α at which we reject H . Thus, we automatically reject H for every highersignificance level α > p -value.However, MC simulation is weighted by weights ( w , . . . , w n ), so we process the MCsample X w = { ( Y , w ) , . . . , ( Y n , w n ) } . Therefore, we are forced to replace EDFwith weighted empirical cumulative distribution function (WEDF) defined by G wn ( x ) = W (cid:80) n i =1 w i ( −∞ ,x ] ( x i ), where W = (cid:80) n i =1 w i and ( −∞ ,x ] is an indicator function of the set( −∞ , x ]. Since all the weights are equal to 1 in data sample, the WEDF F wn coincides with theEDF F n . Let us denote W = W + W , where W = (cid:80) n j =1 w j for the case of F wn . This test reduces the problem (1) to testing homogeneity in multinomial populations. Let { t , . . . , t m +1 } be a partition of the real line such that for all x ∈ { X , X } it holds that x ∈ [ t , t m +1 ]. Hereby, we make this binning over populations X , X consisting of m bins.For i ∈ { , } and j ∈ { , . . . , m } we denote by p ij the probability that a randomly chosenobservation from X i belongs to the j -th bin. Instead of (1) we now test hypotheses H : p j = p j ∀ j ∈ { , . . . , m } versus H : H is not true . (2)As given in [2] using φ -divergence measure and maximum likelihood estimators n ij /n i and N j /N ,we consider the test statistic H φN = 2 Nφ (cid:48)(cid:48) (1) (cid:88) i =1 m (cid:88) j =1 n i N N j N φ (cid:18) n ij Nn i N j (cid:19) , (3)where n ij is the number of observations from X i in j -th bin, N j is the number of all observationsin j -th bin and φ is a given function from convex family. For our purposes, the maximumlikelihood ratio estimators n ij /n i , N j /N need to be replaced with the corresponding versionsof weighted estimators w (in bin) ij /W i , w (in bin) j /W made of the respective sums of weights. Theasymptotic distribution of the test statistic (3) is χ with ( m −
1) degrees of freedom. Thus,the approximate p -value can be obtained as 1 − χ m − (cid:16) H φN (cid:17) . There are two well-knownspecific cases: for φ ( x ) = ( x − the test coincides with the χ Homogeneity Test andfor φ ( x ) = x log x − x + 1 the test is identically the Likelihood Ratio Test. .2. Kolmogorov-Smirnov Test for Two Samples Unlike χ test, Kolmogorov-Smirnov test is based on differences between EDFs of two samples.We consider the statistic D n ,n = sup x ∈ R | F n ( x ) − G n ( x ) | . It follows from the Glivenko-Cantelli lemma that under the true H it holds that D n ,n a.s. → n , n → ∞ . Furthermore,due to [4] it holds for the true H and λ > n ,n →∞ P (cid:18)(cid:114) n n n + n D n ,n ≤ λ (cid:19) = 1 − ∞ (cid:88) k =1 ( − k − e − k λ . (4)Therefore, we can obtain approximate p -value as 2 (cid:80) ∞ k =1 ( − k − e − k λ , where λ = (cid:113) n n n + n D n ,n . In (4), the EDFs and n , n need to be replaced with the corresponding WEDFsand sums of weights W , W , respectively. Kolmogorov-Smirnov test is more powerful than χ test generally , see [3]. Another test based on EDF is Anderson-Darling test. Here H N stands for EDF of pooled sample { X , X } , i.e., H N ( x ) = [ n F n ( x ) + n G n ( x )] /N . We take into account the statistic from [5] A n n = n n N (cid:90) + ∞−∞ [ F n ( x ) − G n ( x )] H N ( x ) [1 − H N ( x )] d H N ( x ) , T n n = A n n − σ N , (5)where σ N = var (cid:0) A n n (cid:1) . According to [6] we can determine an approximate p -value by meansof the standardized statistic T n n . Once again, in (5) we need to replace the EDFs and thenumbers of entries n , n , N with their respective WEDFs and sums of weights W , W , W .Anderson-Darling test is generally more powerful than Kolmogorov-Smirnov test, see [7] indetail. We tested all 50 potential input variables and we made a selection with 36 of them. Here wepresent a preview of results for some of them in electron channel.
Figure 1.
Approximate p -values for selected variables from different tests MC vs data: (cid:4) Kolmogorov-Smirnov, (cid:4)
Anderson-Darling, (cid:4) χ , (cid:4) Likelihood Ratio.(variable
HT3 is not available for channel with only 2 Jets)
3. Discrimination
For a long time till the end of the last century, the High Energy Physiscs (HEP) community usedfor the discrimination linear decision boundary methods (Fisher-discriminants) and later Naiveayesian methods. Not until the first decade of the 21st century the supervised machine learningmethods such as Neural Networks, Boosted Decision Trees, and Support Vector Machines beganto play important role in the HEP analysis. Nowadays, the multivariate analysis techniquesare one of the fundamental tools in the discrimination phase. Nevertheless, there are still somewell known statistical methods that are worth trying out. Let us mention three methods, whosequality of separation was tested in the measurement of the inclusive top pair production crosssection on D0 Tevatron full RunII data. The first one is Model Based Clustering method(MBC) based on EM algorithm and Gaussian Mixture Models presented in [8], the secondone is well-known Generalized Linear Models (GLM), where we tested different link functionsand overdispersions, and the third one is Neural Nets with Switching Units (NNSU, quite newmethod developed by F. Hakl from Institute of Computer Science of the ASCR).We compared mentioned approaches with another MVA methods such as Multilayer Perceptron(MLP) and Boosted Decisions Trees (BDT) from the ROOT TMVA package. ROC curves asan example for all three muon + jets bin are shown in Figure 2. It’s clear that new methods arecomparable and provide similar results as established methods from the TMVA package.
Figure 2.
ROC curves for different methods and analysis chanels
4. Discussion
We presented the generalization of statistical homogeneity tests and their utilization in HEPanalysis. Since in the ROOT framework the most used classical K-S test is designed forhistograms instead of weighted empirical function, our approach is more proper. In a similar waywe modified Anderson-Darling test that is more powerful than the two sample K-S test. However,both mentioned tests are very sensitive and that’s why we recommend using ϕ -divergence testsof homogeneity with quantile binning, where the utilization of weights is more straightforward.Depending on the used convex function ϕ we can convert the test to the χ -test, Likelihood Ratiotest, or any other general ϕ -divergence tests. Further, we shortly mentioned not so commonby used discrimination methods, whose quality of the signal from background separation iscomparable with methods from ROOT TMVA package. Acknowledgments
This work has been supported by the MSMT (CZ) grant INGO II INFRA LG12020 and CTU(CZ) grant SGS12/197/OHK4/3T/14.
References [1] Abazov V. M. et al
D0 Collaboration 2014,
Measurement of differential t ¯ t production cross sections in p ¯ p collisions , Phys. Rev. D 90, 092006 (2014)2] Pardo L. 2006, Statistical Inference Based on Divergence Measures (Boca Raton: Chapman & Hall/CRC)pp 394-398[3] Stephens M. A. 1992,
An Appreciation of Kolmogorov’s 1933 Paper (Stanford: Stanford University) p 13[4] Smirnov N. V. 1944, Approximate laws of distribution of random variables from empirical data
Uspekhi Mat.Nauk Biometrika J. Am. Statist. Assoc. J. Appl. Quant. Methods et al D0 Collaboration 2011, Measurement of the top quark pair production cross section inthe lepton + jets channel in proton-antiproton collisions at √ s = 1 .
96 TeV