Bidirectional Long Short-Term Memory (BLSTM) neural networks for reconstruction of top-quark pair decay kinematics
,, ,
Bidirectional Long Short-Term Memory (BLSTM) neuralnetworks for reconstruction of top-quark pair decaykinematics
Fardin Syed and Riccardo Di Sipio and Pekka K. Sinervo, C.M.
Department of Physics, University of Toronto, 60 St George St, Toronto, ON M5S 1A7, Canada
E-mail: [email protected]
Abstract: A probabilistic reconstruction using machine-learning of the decay kinematics oftop-quark pairs produced in high-energy proton-proton collisions is presented. A deep neuralnetwork whose core consists of a Bidirectional Long Short-Term Memory (BLSTM) is trainedto infer the four-momenta of the two top quarks produced in the hard scattering process. TheMadGraph5+Pythia8 Monte Carlo event generator is used to create a sample of top-quark pairsdecaying in the µ +jets channel, whose final-state objects are used to create the input to the deep neuralnetwork. Distortions due to limited resolution of the experimental apparatus are simulated withthe Delphes3 fast detector simulator. The level of agreement between the Monte Carlo predictionsand the BLSTM for kinematic distributions at parton level is comparable to that obtained using abenchmark method that finds the jet permutation that minimizes an objective function. The code ispublicly available on the repository https://github.com/IMFardz/AngryTops .Keywords: Data processing methods; Analysis and statistical methods, Pattern recognition a r X i v : . [ h e p - e x ] S e p ontents Studies of energetic top quarks produced in hadron collisions provide a unique window into ourtheoretical framework for particle physics, known as the Standard Model [1]. They are also importantfor probing physics beyond the Standard Model. As the most massive fundamental particle witha rest mass of ≈
173 GeV, the top quark can only be observed by its decay products and theircorresponding signatures in detectors operating at particle colliders like the Large Hadron Collider(LHC).The production and decay of a top-quark pair results in six partons in the final state, includingcharged and neutral leptons and jets of particles arising from the daughter quarks and gluons. Thecomplexity of the resulting events and the finite resolution of the detectors makes the challengeof reconstructing the top-quark momenta exceptionally difficult. Improvements in top-quark re-construction are essential for understanding rare processes and making precision measurements oftop-quark cross sections. Current reconstruction routines such as KLFitter [2] and PseudoTop[3] attempt to solve this problem algorithmically or by fitting a likelihood function, respectively.Both can be considered as improvements on a more basic algorithm, known as χ -fit [4], whichattempts to reconstruct the decay chains by assigning one jet uniquely to each outgoing parton. Thepermutation that minimizes an objective function is used to define the assignment. However, ifthe jet produced by any of these partons is lost because of limited acceptance and resolution, thereconstruction of the event kinematics is compromised. Also, the jets arising from the top-quarkdecay daughters, which are defined as clusters of the observed constituents ( e.g. calorimeter cellsor tracks in the inner detector), are selected by applying a cut on their transverse momentum ( p T ). We use a right-handed coordinate system with its origin at the nominal pp interaction point in the center of thedetector and the z -axis along the beam pipe. The x -axis points from the centre of the detector to the center of the LHCring and the y -axis points upward. Cylindrical coordinates ( r , φ ) are used in the transverse plane, φ being the azimuthalangle around the z -axis. The pseudorapidity is defined in terms of the polar angle θ as η = − ln tan ( θ / ) . – 1 –ecause of this reason, it is fundamentally not possible to identify quark jets with a p T smaller thanthe threshold, unless some extrapolation is applied. In principle, a probabilistic approach as the onepresented in this work does not suffer from the limitations of χ -fit, and is able to perform suchextrapolation without any need to rely explicitly on parametric transfer functions as in the case ofKLFitter.The aim of this paper is to introduce a machine-learning approach to top quark reconstruction.In our analysis, we train AngryTops, our machine learning software package, to reconstruct top-quark, bottom-quark and W -boson four-momenta in the lepton+jets t ¯ t decay channel in pp collisionsat 13 TeV. In this topology, one top quark has decayed fully hadronically while the other top quarkhas decayed semileptonically. We then evaluate AngryTops by comparing its performance to χ -fit.In section 2, we outline our procedure for generating Monte Carlo (MC) events for the trainingand testing of our networks. Then, in Section 3, we describe the network architecture that lead tothe best performance during our study. Section 4 describes in further detail the training procedurefor our network architectures. In Section 5 we evaluate our model by comparing it against χ -fit.Finally, in Sections 6 and 7 we summarize our findings and discuss the future of machine learningin the kinematic reconstruction of top-quark momenta. To train AngryTops, a sample of 200 million t ¯ t events has been created. The MadGraph5 MonteCarlo (MC) event generator [5] has been used to calculate the amplitudes of the leading-orderprocess pp → t ¯ t with up to one additional outgoing quark or gluon, as shown in Fig.1. ThePythia8 generator [6] was used to carry out the parton-showering of quarks and gluons and MLMmatching [7] was employed to model how the parton showers were matched to the MadGraph5matrix-element calculcation. Finally, the detector simulation Delphes3 has been used to simulatethe effect of detector response. An average of 25 additional soft-QCD pp collisions (pile-up) wereoverlaid to reproduce realistic data-taking conditions at the LHC.In what follows, we will refer to the hadronically and semileptonically decaying top quarks asthe “hadronic top quark” and the “semileptonic top quark,” respectively.Electrons, muons, jets and missing transverse energy are reconstructed by Delphes3 algo-rithms. Jets were reconstructed using the anti- k T algorithm [8] as implemented in FastJet [9],with a distance parameter R = 0.4. We only consider the semileptonic top-quark decay modes thatresult in an energetic electron or muon and its associated neutrino. In addition, we require that thetransverse momenta and pseudorapidity of the muon, W bosons and b quarks are greater than 20GeV and less than 2.5, respectively. After these additional cuts, we are left with roughly 5 millionevents for training and testing our BLSTMs. The input for AngryTops is a 36 element array, which is reshaped into a (6 x 6) matrix in the firstnetwork layer. The first six elements of the input correspond to the following: (muon p x , p y , p z ,muon arrival time of flight T , missing transverse energy E miss T , missing energy azimuthal angle– 2 – g¯ tt W − W + ¯ qq ¯ ν µ / q µ − / ¯ q ¯ b ¯ q / ν µ q / µ + b Figure 1 . Example of a leading-order t ¯ t production Feynman diagram used to train the BLSTM network.The final state shown consists of a muon, a neutrino and at least four jets. Up to one additional outgoingquark or gluon is considered in the matrix element calculation. Matching partons arising in the matrix-element calculation with those produced in the subsequent parton showering model is performed by theMLM matching algorithm [7]. E miss φ ). The subsequent five columns in the input matrix each correspond to an input jet and aredefined as follows (jet p x , p y , p z , energy E , mass M , b -tagging state B ), where the b -tagging stateis either 0 (not-tagged) or 1 (tagged). In the case when there are only 4 jets present in the event, thelast jet column is set to all zeros. Our matrix of inputs is written as (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) p µ x p j , p j , p j , p j , p j , p µ y p j , p j , p j , p j , p j , p µ z p j , p j , p j , p j , p j , T µ E j , E j , E j , E j , E j , E miss T M j , M j , M j , M j , M j , E miss φ B j , B j , B j , B j , B j , (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) . (3.1)The output for our model is a (6 x 3) matrix, where each row corresponds to p x , p y and p z for the bottom quark from the hadronic top-quark decay, the bottom quark from the semileptonictop-quark decay, hadronic W boson, leptonic W boson, hadronically decaying and semileptonicallydecaying top quark, respectively. For the purpose of this analysis, we fix the top-quark mass to– 3 –72 . W boson to 80 . (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) p b , hadx p b , hady p b , hadz p b , lepx p b , lepy p b , lepz p W , hadx p W , hady p W , hadz p W , lepx p W , lepy p W , lepz p t , hadx p t , hady p t , hadz p t , lepx p t , lepy p t , lepz (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) . (3.2)Our network consists of 329,913 trainable parameters and 17 different layers. We experimentedwith Convolutional Neural Networks (CNNs), Long Short Term Memory (LSTMs), Feed ForwardNeural Networks (FFNNs) and Bidirectional Long Short Term Memory (BLSTMs). As describedbelow, we found that BLSTMs performed the best. A diagram of our network architecture can beseen in Figure 2. Our code is written with Keras[10] with a Tensorflow 2.0 RC [11] back-end. Figure 2 . A diagram of the BLSTM network architecture highlights employed in this study.
For network training, we select an Adam Optimizer [12] with a learning rate of 10 − and a MeanSquared Error (MSE) loss function.For each choice of network architecture, we scan through the hyper-parameter space and selectthe draw which leads to the lowest loss function. With Hyperopt[13], we used a uniform distributionto select the size of the network layers and network activation functions. With Tune[14], we use an– 4 –synchronous HyperBand Scheduler to train 1000 different draws from the hyper-parameter spacein simultaneous batches of 8.Of the five million selected events from the Monte-Carlo simulation, we use 90 % for trainingand the other 10 % for testing. For each training epoch, we set asid an additional 10 % of thetraining set for validation. A model’s training is stopped when its validation loss function begins toincrease.Before the start of training, we scale all the network input and outputs with a MinMax Scaling,which sets the minimum and maximum value for each kinematic variable to -1 and 1. We have alsoexperimented with shuffling the training set, ordering the inputs by different variables, and scalingto Gaussian parameter distributions with mean 0 and a variance of 1. We did not find any significantimprovements when using these alternative scaling techniques. The output of the deep neural network is compared to a benchmark reconstruction method inspiredby χ -fit, i.e. based on finding the jet permutation that minimizes the objective function χ = ( m j jb − m MCt ) σ t + ( m j j − m MCW ) σ W + ( m l ν b − m MCt ) σ t + ( m l ν − m MCW ) σ W , (5.1)where m MCt = . σ t =
30 GeV, m MCW = . σ W =
20 GeV. In order to performthe calculation, up to the first five jets (ordered by decreasing transverse momentum) are considered.Then, each permutation consists of four jets that are uniquely assigned to the hadronically decayingtop quark, hadronically decaying W boson and semi-leptonically decaying top quark. Informationabout b -tagging and lepton arrival time of flight are not considered. The masses of the top quark, W boson and bottom quark are used to calculate the energy component of the associated four-momenta. The neutrino four-momentum component along the z axis ( p z ) is estimated from themissing transverse energy and the quadratic W -boson mass constraint as in [3]. In the case ofdegenerate solutions, the smallest p z value is selected.Figures 3 – 8 show the reconstructed four-momenta of each of the six particles in the decaychain. We compare the normalized distributions predicted by AngryTops and χ -fit to thoseobtained using the MC event generator by the means of a χ metric defined as χ / NDF = n − n (cid:213) i = (cid:16) y MC i − y predicted i (cid:17) σ i (5.2)where σ i = (cid:114)(cid:0) σ MC i (cid:1) + (cid:16) σ predicted i (cid:17) , (5.3)and where y i and σ i correspond to the value and uncertainty in the i -th histogram bin. The resultsof these comparisons are presented in Table 1. – 5 – igure 3 . Reconstructed semileptonic top quark observables. The gray filled area represents the predictionobtained using the MadGraph5+Pythia8 Monte Carlo event generator. The black dashed line is obtainedfrom the permutation of jets which minimizes the χ . The red solid line is the output of the BLSTM. We first note that the χ comparisons are not particularly insightful, though they show someobvious differences in algorithm performance. In particular, the χ -fit algorithm reproduces the φ distributions well (which are expected to be featureless) while the AngryTops algorithm typicallycreates some structure in these distributions at the level of 10-20%.The AngryTops algorithm shows a somewhat better performance on the semileptonic W -boson and the top quark kinematic variables as compared to the χ -fit matching fitter, as shownin Figures 3 and 5, but does not improve on the kinematics of b quarks. We observe that the twomodels are generally closest in performance on the top quark kinematics, with the distributionsof χ -fit and AngryTops matching closely with the MC distributions. We note that the angulardistributions are not particularly well reproduced by AngryTops, a feature that appears to arise– 6 – igure 4 . Reconstructed hadronic top quark observables. The gray filled area represents the predictionobtained using the MadGraph5+Pythia8 Monte Carlo event generator. The black dashed line is obtainedfrom the permutation of jets which minimizes the χ . The red solid line is the output of the BLSTM. from either incomplete training of the network, or a fundamental instability in how these variablesare reproduced by the BLSTM.Interestingly, both the neural network and the χ -fit algorithms perform in a similar mannerwith the hadronic top-quark kinematics. In particular, both tend to under-estimate the top-quark p T in the same manner. AngryTops predictions for the η distribution are more accurate, though wesee a remaining asymmetry in the φ distribution.We observe that the kinematics of the W -boson and b -quark are reconstructed relatively poorlyby both algorithms, a feature that is well-known for the χ -fit algorithm and is not improved by theBLSTM approach. The b -quark kinematics are perhaps the most poorly reconstructed observablesby AngryTops, with a consistent under-estimation of the b -quark p T distribution. Although all b -quark jet candidates used in the training are required to have p T >
20 GeV, interestingly, AngryTopspredicts results that are below that threshold. – 7 – igure 5 . Reconstructed W boson observables for the semileptonic top quark. The gray filled area representsthe prediction obtained using the MadGraph5+Pythia8 Monte Carlo event generator. The black dashedline is obtained from the permutation of jets which minimizes the χ . The red solid line is the output of theBLSTM. The choice of kinematic variables to represent the data also has a significant impact on theperformance of the models. We find in general our models struggle most with learning the transversemomenta distributions. In all the particles besides the W -boson from the semileptonic top quarkand the semileptonic top, AngryTops consistently underestimates the transverse momentum andeven fails to learn the p T cutoff at 20 GeV. This error arises from the under-estimate made on p x and p y by AngryTops. The χ -fit on the other hand tends to slightly overestimate the transversemomentum, and while it fails to determine the cut-off in the W boson transverse momenta, it is ableto account for the cut-off in the b quark transverse momenta.There are also slight differences in the shape of the distributions between our models and the MChistograms. While AngryTops mostly learns the distributions, there are some asymmetries presentthat occasionally occur in the φ and rapidity distributions. These asymmetries are not a consistent– 8 – igure 6 . Reconstructed W -boson observables for the hadronic top quark. The gray filled area representsthe prediction obtained using the MadGraph5+Pythia8 Monte Carlo event generator. The black dashedline is obtained from the permutation of jets which minimizes the χ . The red solid line is the output of theBLSTM. phenomenon and differ between different training sessions. Due to the inherent complexity ofAngryTops, further studies of our machine learning approach are necessary to better understandthis behaviour. The χ -fit on the other hand does not present any asymmetries and significantlyoutperforms AngryTops in the φ variable. In this study, we have analyzed the capability of using neural networks to perform kinematicreconstructions of particles in the semi-leptonic t ¯ t decay.While we do not claim to have the best network architecture for this problem, our workdemonstrates the potential avenue for machine learning in this line of research. In Sections 5 and– 9 – igure 7 . Reconstructed b -quark observables for the semileptonic top-quark. The gray filled area representsthe prediction obtained using the MadGraph5+Pythia8 Monte Carlo event generator. The black dashedline is obtained from the permutation of jets which minimizes the χ . The red solid line is the output of theBLSTM.
6, we show the capability of machine-learning based approaches to be competitive with standardreconstruction algorithms such as χ -fit.The nature of a machine-learning based approach to kinematic reconstruction also presentsadditional advantages in improved flexibility. Algorithms such as χ -fit, KLFitter and PseudoToprequire fixed number of jets and inputs, while with AngryTops one is easily able to update theinput information of a network by adding/modifying network layers. An extension to the boostedregime, where quarks are collimated and appear as a single jet, seems straightforward with thecurrent implementation based on recurrent neural networks. The major drawback of AngryTopshowever is that one has little understanding of the intermediate steps performed by the network.There are many ways to go beyond the AngryTops project. Of course, the search for the “best”neural network architecture is an ongoing problem that can only be solved with further time and– 10 – igure 8 . Reconstructed b -quark observables for the hadronic top quark. The gray filled area represents theprediction obtained using the MadGraph5+Pythia8 Monte Carlo event generator. The black dashed lineis obtained from the permutation of jets which minimizes the χ . The red solid line is the output of theBLSTM. developments in machine learning techniques. Additionally, a comparison between AngryTopsand a more sophisticated kinematic reconstruction algorithm such as KLFitter is necessary. Wereserve this step of the analysis for a latter study, as differences in detector level simulations and inputinformation make a direct comparison difficult given the complexity of these more sophisticatedalgorithms. It is possible that a combination of pre-existing reconstruction algorithms aided bymachine-learning based approaches may also lead to significant advancements in the kinematicreconstruction of t ¯ t final states. – 11 – bservable χ / DOFBLSTM χ -fit p t , lep T η t , lep φ t , lep p t , had T η t , had φ t , had p W , lep T η W , lep φ W , lep p W , had T η W , had φ W , had p b , lep T η b , lep φ b , lep p b , had T η b , had φ b , had Table 1 . Comparison of the difference in the distributions of the kinematic variables, as measured by χ / DOF between the results and the MC prediction, resulting from the BLSTM and χ -fit reconstructionmethods. Acknowledgments
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada(NSERC). – 12 – eferences [1] A. Quadt, “Top quark physics at hadron colliders,”
Eur. Phys. J. , vol. C48, pp. 835–1000, 2006.[2] J. Erdmann, S. Guindon, K. Kroeninger, B. Lemmer, O. Nackenhorst, A. Quadt, and P. Stolte, “Alikelihood-based reconstruction algorithm for top-quark pairs and the KLFitter framework,”
Nucl.Instrum. Meth. , vol. A748, pp. 18–25, 2014.[3] J. Kvita, “Study of methods of resolved top quark reconstruction in semileptonic t ¯ t decay,” Nucl.Instrum. Meth. , vol. A900, pp. 84–100, 2018.[4] A. Abulencia et al. , “Top quark mass measurement using the template method in the lepton + jetschannel at CDF II,”
Phys. Rev. , vol. D73, p. 032003, 2006.[5] J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni et al. , “The automated computation oftree-level and next-to-leading order differential cross sections, and their matching to parton showersimulations,”
JHEP , vol. 07, p. 079, 2014.[6] T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel, C. O.Rasmussen, and P. Z. Skands, “An Introduction to PYTHIA 8.2,”
Comput. Phys. Commun. , vol. 191,p. 159, 2015.[7] F. Caravaglios, M. L. Mangano, M. Moretti, and R. Pittau, “A New approach to multijet calculationsin hadron collisions,”
Nucl. Phys. , vol. B539, pp. 215–232, 1999.[8] M. Cacciari, G. P. Salam, and G. Soyez, “The anti- k t jet clustering algorithm,” JHEP , vol. 04, p. 063,2008.[9] ——, “FastJet User Manual,”
Eur. Phys. J. C , vol. 72, p. 1896, 2012.[10] F. Chollet et al. , “Keras,” https://keras.io, 2015.[11] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machinelearning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online].Available: http://tensorflow.org/[12] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” https://arXiv.org/1412.6980,2014.[13] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of model search,” arXiv:1209.5111, 2012.[14] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica, “Tune: A research platformfor distributed model selection and training,” arXiv:1807.05118, 2018., “Keras,” https://keras.io, 2015.[11] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machinelearning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online].Available: http://tensorflow.org/[12] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” https://arXiv.org/1412.6980,2014.[13] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of model search,” arXiv:1209.5111, 2012.[14] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica, “Tune: A research platformfor distributed model selection and training,” arXiv:1807.05118, 2018.