[PDF] New Statistical Techniques in the Measurement of the inclusive Top Pair Production Cross Section

Abstract

We present several different types of multivariate statistical techniques used in the measurement of the inclusive top pair production cross section in p p ¯ -collisions at s √ =1.96TeV employing the full RunII data ( 9.7 fb −1 ) collected with the D0 detector at the Fermilab Tevatron Collider. We consider the final state of the top quark pair decays containing one electron or muon and at least two jets. We proceed various statistical homogeneity tests such as Anderson - Darling, Kolmogorov - Smirnov, and φ -divergences tests to determine, which variables have good data-MC agreement, as well as a good separation power. We adjusted all tests for using weighted empirical distribution functions. Further we separate t t ¯ signal from the background by the application of Generalized Linear Models, Gaussian Mixture Models, Neural Networks with Switching Units and confront them with familiar methods from ROOT TMVA package such as Boosted Decision Trees, and Multi-layer Perceptron. We compare results by area under receiver operating characteristic curve and verify the quality of the discrimination from all methods.

Full PDF

NNew Statistical Techniques in the Measurement ofthe inclusive Top Pair Production Cross Section

Jiˇr´ı Franc, Petr Bouˇr, Michal ˇStˇep´anek, and V´aclav K˚us

Department of Mathematics, Faculty of Nuclear Sciences and Physical Engineering, CzechTechnical University, Trojanova 13, 120 00 Prague 2, The Czech RepublicE-mail: [email protected], [email protected],[email protected], [email protected]

Abstract.

We present several diﬀerent types of multivariate statistical techniques used in themeasurement of the inclusive top pair production cross section in p ¯ p -collisions at √ s = 1 . . − ) collected with the D0 detector at the Fermilab TevatronCollider. We consider the ﬁnal state of the top quark pair decays containing one electron ormuon and at least two jets. We proceed various statistical homogeneity tests such as Anderson- Darling, Kolmogorov - Smirnov, and ϕ -divergences tests to determine, which variables havegood data-MC agreement, as well as a good separation power. We adjusted all tests for usingweighted empirical distribution functions. Further we separate t ¯ t signal from the background bythe application of Generalized Linear Models, Gaussian Mixture Models, Neural Networks withSwitching Units and confront them with familiar methods from ROOT TMVA package such asBoosted Decision Trees, and Multi-layer Perceptron. We compare results by area under receiveroperating characteristic curve and verify the quality of the discrimination from all methods.

1. Introduction

The main goal of this analysis is to apply statistical techniques, which are new to HEP, to themeasurement of the inclusive top pair production cross section in p ¯ p -collisions at √ s = 1 . a r X i v : . [ h e p - e x ] D ec e applied more rigorous statistical approach and we modiﬁed common homogeneity tests, interms of adding weights and utilization of quantile binning.

2. Selection of Variables and Homogeneity Tests

In order to perform the eﬃcient training of separation methods on MC simulation we needto guarantee the homogeneity of both MC and data populations, i.e., we test the followinghypothesis: H : F = G versus H : F (cid:54) = G at signiﬁcance level α, (1)where F is unknown cumulative distribution function (CDF) of data distribution and G isunknown CDF of MC distribution. Let X = { X , . . . , X n } denote a random sample takenfrom the distribution F and X = { Y , . . . , Y n } be a random sample taken from the distribution G . We further denote N = n + n . Let F n , G n denote empirical distribution functions (EDFs)of samples X , X respectively. In our hypothesis testing we seek for p -value, i.e., the lowestsigniﬁcance level α at which we reject H . Thus, we automatically reject H for every highersigniﬁcance level α > p -value.However, MC simulation is weighted by weights ( w , . . . , w n ), so we process the MCsample X w = { ( Y , w ) , . . . , ( Y n , w n ) } . Therefore, we are forced to replace EDFwith weighted empirical cumulative distribution function (WEDF) deﬁned by G wn ( x ) = W (cid:80) n i =1 w i ( −∞ ,x ] ( x i ), where W = (cid:80) n i =1 w i and ( −∞ ,x ] is an indicator function of the set( −∞ , x ]. Since all the weights are equal to 1 in data sample, the WEDF F wn coincides with theEDF F n . Let us denote W = W + W , where W = (cid:80) n j =1 w j for the case of F wn . This test reduces the problem (1) to testing homogeneity in multinomial populations. Let { t , . . . , t m +1 } be a partition of the real line such that for all x ∈ { X , X } it holds that x ∈ [ t , t m +1 ]. Hereby, we make this binning over populations X , X consisting of m bins.For i ∈ { , } and j ∈ { , . . . , m } we denote by p ij the probability that a randomly chosenobservation from X i belongs to the j -th bin. Instead of (1) we now test hypotheses H : p j = p j ∀ j ∈ { , . . . , m } versus H : H is not true . (2)As given in [2] using φ -divergence measure and maximum likelihood estimators n ij /n i and N j /N ,we consider the test statistic H φN = 2 Nφ (cid:48)(cid:48) (1) (cid:88) i =1 m (cid:88) j =1 n i N N j N φ (cid:18) n ij Nn i N j (cid:19) , (3)where n ij is the number of observations from X i in j -th bin, N j is the number of all observationsin j -th bin and φ is a given function from convex family. For our purposes, the maximumlikelihood ratio estimators n ij /n i , N j /N need to be replaced with the corresponding versionsof weighted estimators w (in bin) ij /W i , w (in bin) j /W made of the respective sums of weights. Theasymptotic distribution of the test statistic (3) is χ with ( m −

1) degrees of freedom. Thus,the approximate p -value can be obtained as 1 − χ m − (cid:16) H φN (cid:17) . There are two well-knownspeciﬁc cases: for φ ( x ) = ( x − the test coincides with the χ Homogeneity Test andfor φ ( x ) = x log x − x + 1 the test is identically the Likelihood Ratio Test. .2. Kolmogorov-Smirnov Test for Two Samples Unlike χ test, Kolmogorov-Smirnov test is based on diﬀerences between EDFs of two samples.We consider the statistic D n ,n = sup x ∈ R | F n ( x ) − G n ( x ) | . It follows from the Glivenko-Cantelli lemma that under the true H it holds that D n ,n a.s. → n , n → ∞ . Furthermore,due to [4] it holds for the true H and λ > n ,n →∞ P (cid:18)(cid:114) n n n + n D n ,n ≤ λ (cid:19) = 1 − ∞ (cid:88) k =1 ( − k − e − k λ . (4)Therefore, we can obtain approximate p -value as 2 (cid:80) ∞ k =1 ( − k − e − k λ , where λ = (cid:113) n n n + n D n ,n . In (4), the EDFs and n , n need to be replaced with the corresponding WEDFsand sums of weights W , W , respectively. Kolmogorov-Smirnov test is more powerful than χ test generally , see [3]. Another test based on EDF is Anderson-Darling test. Here H N stands for EDF of pooled sample { X , X } , i.e., H N ( x ) = [ n F n ( x ) + n G n ( x )] /N . We take into account the statistic from [5] A n n = n n N (cid:90) + ∞−∞ [ F n ( x ) − G n ( x )] H N ( x ) [1 − H N ( x )] d H N ( x ) , T n n = A n n − σ N , (5)where σ N = var (cid:0) A n n (cid:1) . According to [6] we can determine an approximate p -value by meansof the standardized statistic T n n . Once again, in (5) we need to replace the EDFs and thenumbers of entries n , n , N with their respective WEDFs and sums of weights W , W , W .Anderson-Darling test is generally more powerful than Kolmogorov-Smirnov test, see [7] indetail. We tested all 50 potential input variables and we made a selection with 36 of them. Here wepresent a preview of results for some of them in electron channel.

Figure 1.

Approximate p -values for selected variables from diﬀerent tests MC vs data: (cid:4) Kolmogorov-Smirnov, (cid:4)

Anderson-Darling, (cid:4) χ , (cid:4) Likelihood Ratio.(variable

HT3 is not available for channel with only 2 Jets)

3. Discrimination

For a long time till the end of the last century, the High Energy Physiscs (HEP) community usedfor the discrimination linear decision boundary methods (Fisher-discriminants) and later Naiveayesian methods. Not until the ﬁrst decade of the 21st century the supervised machine learningmethods such as Neural Networks, Boosted Decision Trees, and Support Vector Machines beganto play important role in the HEP analysis. Nowadays, the multivariate analysis techniquesare one of the fundamental tools in the discrimination phase. Nevertheless, there are still somewell known statistical methods that are worth trying out. Let us mention three methods, whosequality of separation was tested in the measurement of the inclusive top pair production crosssection on D0 Tevatron full RunII data. The ﬁrst one is Model Based Clustering method(MBC) based on EM algorithm and Gaussian Mixture Models presented in [8], the secondone is well-known Generalized Linear Models (GLM), where we tested diﬀerent link functionsand overdispersions, and the third one is Neural Nets with Switching Units (NNSU, quite newmethod developed by F. Hakl from Institute of Computer Science of the ASCR).We compared mentioned approaches with another MVA methods such as Multilayer Perceptron(MLP) and Boosted Decisions Trees (BDT) from the ROOT TMVA package. ROC curves asan example for all three muon + jets bin are shown in Figure 2. It’s clear that new methods arecomparable and provide similar results as established methods from the TMVA package.

Figure 2.

ROC curves for diﬀerent methods and analysis chanels

4. Discussion

We presented the generalization of statistical homogeneity tests and their utilization in HEPanalysis. Since in the ROOT framework the most used classical K-S test is designed forhistograms instead of weighted empirical function, our approach is more proper. In a similar waywe modiﬁed Anderson-Darling test that is more powerful than the two sample K-S test. However,both mentioned tests are very sensitive and that’s why we recommend using ϕ -divergence testsof homogeneity with quantile binning, where the utilization of weights is more straightforward.Depending on the used convex function ϕ we can convert the test to the χ -test, Likelihood Ratiotest, or any other general ϕ -divergence tests. Further, we shortly mentioned not so commonby used discrimination methods, whose quality of the signal from background separation iscomparable with methods from ROOT TMVA package. Acknowledgments

This work has been supported by the MSMT (CZ) grant INGO II INFRA LG12020 and CTU(CZ) grant SGS12/197/OHK4/3T/14.

References [1] Abazov V. M. et al

D0 Collaboration 2014,

Measurement of diﬀerential t ¯ t production cross sections in p ¯ p collisions , Phys. Rev. D 90, 092006 (2014)2] Pardo L. 2006, Statistical Inference Based on Divergence Measures (Boca Raton: Chapman & Hall/CRC)pp 394-398[3] Stephens M. A. 1992,

An Appreciation of Kolmogorov’s 1933 Paper (Stanford: Stanford University) p 13[4] Smirnov N. V. 1944, Approximate laws of distribution of random variables from empirical data

Uspekhi Mat.Nauk Biometrika J. Am. Statist. Assoc. J. Appl. Quant. Methods et al D0 Collaboration 2011, Measurement of the top quark pair production cross section inthe lepton + jets channel in proton-antiproton collisions at √ s = 1 .

96 TeV