Development of a Vertex Finding Algorithm using Recurrent Neural Network
Kiichi Goto, Taikan Suehara, Tamaki Yoshioka, Masakazu Kurata, Hajime Nagahara, Yuta Nakashima, Noriko Takemura, Masako Iwasaki
DDevelopment of a Vertex Finding Algorithmusing Recurrent Neural Network
Kiichi Goto ∗ , Taikan Suehara , , † , Tamaki Yoshioka , , , Masakazu Kurata ,Hajime Nagahara , Yuta Nakashima , Noriko Takemura , Masako Iwasaki , , , Department of Physics, Graduate School of Science, Kyushu University Department of Physics, Faculty of Science, Kyushu University Research Center for Advanced Particle Physics (RCAPP), Kyushu University Department of Physics, Graduate School of Science, The University of Tokyo Osaka University Institute for Datability Science (IDS) Department of Mathematics and Physics, Graduate School of Science, Osaka City University Nambu Yoichiro Institute of Theoretical and Experimental Physics (NITEP), Osaka City University Research Center for Nuclear Physics (RCNP), Osaka University
Abstract
Deep learning is a rapidly-evolving technology with possibility to significantly improve physics reach of colliderexperiments. In this study we developed a novel algorithm of vertex finding for future lepton colliders such as theInternational Linear Collider. We deploy two networks; one is simple fully-connected layers to look for vertex seeds fromtrack pairs, and the other is a customized Recurrent Neural Network with an attention mechanism and an encoder-decoderstructure to associate tracks to the vertex seeds. The performance of the vertex finder is compared with the standardILC reconstruction algorithm.
Machine learning has long been used for event reconstruction and physics analysis in particle physics. Deep learning (DL)techniques, which are rapidly advancing in recent years and widely applied to various fields of science and technologysuch as image recognition and automatic translation, have also started to be applied to the particle physics as a naturalextension of the traditional machine learning. Since DL can efficiently process larger information than traditional machinelearning, latest energy-frontier collider detectors with millions of readout channels have possibilities to significantly improvethe performance with the application of DL.The International Linear Collider (ILC)[1] is an e + e − collider project being considered to be constructed in Japanwith initial center-of-mass energy of 250 GeV. One of the main target of the ILC is precise measurements of Higgscouplings to various particles, which give critical information to search and identify Beyond-Standard-Model (BSM) theories.Measurements of final states including heavy-flavor ( b or c ) quarks are especially important in Higgs studies since Higgscouples more strongly to heavier particles.One of the major discriminants of b and c quarks from light quarks ( u / d / s ) is the existence of secondary vertices in thejets, since b and c quarks decay to b and c hadrons, which have finite decay lengths ( cτ ) of 400-500 µ m and 20-300 µ m,respectively. Since b hadrons mostly decay to c hadrons, b jets usually have secondary and tertiary vertices correspondingto the decays of b and c hadrons, while c jets only have secondary vertices. The secondary and tertiary vertices can beidentified by finding 3-dimensional points with multiple charged tracks crossed within the uncertainties with significantseparation from the primary vertex, which is the interaction point of the event. The standard method of the primaryand the secondary vertex finder for the ILC is LCFIPlus [2], which is an integrated jet reconstruction tool consistingof vertex finder, jet clustering and jet flavor tagging algorithms. The secondary vertex finder in the LCFIPlus is basedon the “build-up” technique, which finds track-pairs giving crossing points compatible with seconary vertices as vertexcandidates, and tries to add other tracks with quality selection criteria. It depends on many “human-tuned” parameters forthe selection. ∗ [email protected] † [email protected] a r X i v : . [ phy s i c s . d a t a - a n ] J a n NETWORKSIn this study, we developed a new vertex finding altorithm using a recurrent neural network (RNN) and an attentionmechanism [3, 4]. The RNN is a network to process sequential data, often used for speech recognition and natural languageprocessing (NLP). Attention is an emerging technique in DL, firstly developed to improve the RNN-based network. Ourvertex finder is designed to replace the vertex finder in LCFIPlus to directly compare the performance. Tensorflow (2.1.0) [5]and Keras (2.3.1) [6] on python are used as DL framework for the design and training of the network. Inference can bedone with C++ version of Tensorflow to ensure the replaceability with LCFIPlus.For the training and the performance estimation of our vertex finder, full detector simulation samples produced forDetailed Baseline Design (DBD) studies of the International Large Detector (ILD) [7], which is one of the detector conceptsfor the ILC are used. Two-fermion samples of e + e − → b ¯ b and c ¯ c at √ s = 91 GeV are produced with Whizard eventgenerator [8]. The event sample has the identical production condition to what was used for the performance study of [2].
The basic concept of our vertex finder is “build-up”, similar to LCFIPlus. We use two networks to realize a DL-basedbuild-up vertex finder. The first “seed finding” network takes information of a track pair (44 input variables; the variablesare listed in Table 1) and categorizes the track pair whether it is suitable to be a candidate of a “seed” of a vertex. Detailedcategorization is discussed later. The second “vertex production” network takes a candidate seed vertex obtained fromthe seed-finding network and examines every remaining track if it is suitable to be associated to the seed vertex. AnRNN-based network is adopted for the second network to treat a variable number of tracks. Details of the two networksare explained in this section.
Name Description Number of parameterstrack parameters d , z , φ , ω , tan λ Table 1:
List of the input variables of a single track.
The network for track pairs is one of the two networks for vertex finding. The input variables of the network involveinformation of two tracks and the network predicts the seed type of the vertex, listed in Table 2, and the distance of thevertex from the origin.
Name DescriptionNC tracks from different vertices (Not Connected)PV track pair from the primary vertexSVCC track pair from a secondary vertex of charm flavor in the final state of c ¯ c SVBB track pair from a secondary vertex of bottom flavor in the final state of b ¯ b TVCC track pair from a tertiary vertex of charm flavor in the final state of c ¯ c SVBC either of the tracks originated from a b hadron and another from a c hadron in a same decay chainOthers track pair orignated from another particle such as τ , a strange hadron or a photon conversion Table 2:
Categorization of track pairs (“seed type”) of the first network.
The true labels of seed types were obtained from Monte-Carlo (MC) information. The output of the vertex fitter ofLCFIPlus is used as the true label of the vertex distance.The network processes a track pair, and performs 7-class categorization of the seed types and regression to obtain thevertex distance. The structure is a simple feedforward network and batch normalization layers are inserted between fullyconnected layers, as shown in Figure 1. The network is divided to two sections, classification and regression, at the lastactivation layer.Any track pairs from each event are used for training the network. Since the primary vertex has a larger number oftracks, most of the track pairs are categorized as PV or NC, which causes imbalanced statistics for categories as shown inFigure 2. To prevent performance degradation due to the imbalanced statistics, the Cost-Sensitive Learning scheme [9]is used to weight the loss function accounting for the statistics of the categories. The total loss function including the2 NETWORKS 2.2 Network to associate tracks - Vertex productionregression is as follows: L CE = − . t NC log ( y NC ) − . t PV log ( y PV ) − . t SVCC log ( y SVCC ) − . t SVBB log ( y SVBB ) − . t TVCC log ( y TVCC ) − . t SVBC log ( y SVBC ) − . t Others log ( y Others ) L MSLE = L LMSE = (ln ( t Position + 1) − ln ( y Position + 1)) L Tot = w vertex L CE + w position L MSLE (1)where t x and y x are true labels and predicted scores, respectively. w x is the weight for each loss function. The loss for theregression is calculated by the mean-squared logarithmic error (MSLE). Fully ConnectedInputBatch NormalizationActivation (ReLU)Fully ConnectedBatch NormalizationActivation (ReLU)Fully ConnectedBatch NormalizationActivation (ReLU)Fully ConnectedOutput FC
SVCC SVBB TVCCPVNC SVBC Others Vertex Distance
Figure 1:
Structure of the network for track pairs.
The number of samples in each class
Figure 2:
Statistics of the imbalanced data. b ¯ b and c ¯ c samplesare accumulated. 50% of the events are randomlytaken for NC and PV categories to reduce the im-balance of the statistics. The performance of this network is shown in Figure 3. A reasonable performance of the classification of NC and PVcategories is seen, while SVs (SVCC, SVBB, TVCC, SVBC) are not efficiently separated. Significant mis-identification ofpairs with the true label of NC to other categories is seen in the purity matrix due to the dominant fraction of the truelabel of NC. A part of the mis-identified pairs should be removed at the selection of the input of the second network, asdescribed in Section 3.
Figure 3:
Confusion matrices for efficiency (left) and purity (right). In the efficiency matrix sum of the each row is one, while inthe purity matrix sum of the each column is one.
The second network is used to generate a vertex by adding tracks one by one to the vertex seed obtained by the seed-findingnetwork. The RNN framework was adopted for this network since the number of tracks varies in each event. We designed a3 NETWORKS 2.2 Network to associate tracks - Vertex productionnetwork to update the vertex using the long short-term memory (LSTM) structure [10]. Each cell of the network takes inputvariables of a single track and determines whether the track is suitable to be associated to the vertex whose information isconsidered to be stored in the memory and update the vertex according to the information of the input track if the track isaccepted to be associated. Parameters of a track pair selected by the seed-finding network with the procedure written inSection 3 are connected to two fully-connected layers with batch normalization and ReLU activation to calculate the initialstate of the memory.One of the big issues of using the LSTM structure in this network is that LSTM heavily depends on the order of thetracks to be provided, while the order of the tracks is not important (exchangable) for this task. To reduce the dependenceon the order of the tracks a dedicated LSTM structure was developed for this study. Figure 4 shows the cell of the modifiednetwork, where the hidden state of the short-term memory is effectively removed compared to the original LSTM cell.Each step of the cell (1, 2, 3) in Figure 4 is calculated as: h N = σ ( d h [tanh( v N − ) (cid:12) σ ( W o t N + R o v N − )]) v (cid:48) N = v N − (cid:12) σ ( W f t N + R f v N − ) + tanh( W c t N + R c v N − ) (cid:12) σ ( W i t N + R i v N − ) v N = (1 − h N ) v N − + h N v (cid:48) N (2)where d h , W , and R are a vector and matrices for trainable weights and v N and t N are the Nth hidden state of the vertexinformation and information of the input track, respectively. h N is the Nth binary output, showing whether the Nth trackis associated to the N-1th vertex. “ (cid:12) ” denotes element-wise multiplication. The operations can be understood as follows:1. determine whether the Nth track is associated to the N-1th vertex2. calculate the updated vertex with the Nth track and the N-1th vertex3. adopt Nth vertex if the track is associated in the step1, and keep the N-1th vertex if it is not associatedFor further extension, we implemented an encoder-decoder network with an attention mechanism using the dedicatedLSTM cell. The attention encoder-decoder model is shown in Figure 5. A bidirectional RNN is used for the encoder partto further reduce the dependence on the order of the tracks. Encoder (blue) cells and decoder (red) cells are modified fromthe dedicated LSTM cell described above. The encoder cell is modified as: h N = tanh( v N − ) (cid:12) σ ( W o t N + R o v N − ) (3)to provide multi-dimensional variables to the encoder output.In the decoder cell, attention weights are calculated with the additive attention scheme using the encoder output byfollowing formulae: e N = u energy ( K U key + T N U query ) a N = ( a N, , a N, , a N, , · · · a N,i , · · · )= (cid:32) exp ( e N, ) (cid:80) j exp ( e N,j ) , exp ( e N, ) (cid:80) j exp ( e N,j ) , exp ( e N, ) (cid:80) j exp ( e N,j ) , · · · exp ( e N,i ) (cid:80) j exp ( e N,j ) , · · · (cid:33) c N = a N Vh N = σ ( d h [tanh( v N − ) (cid:12) σ ( W o t N + R o v N − + C o c N )]) v (cid:48) N = tanh( W c t N + R c v N − + C c c N ) (cid:12) σ ( W i t N + R i v N − + C i c N )+ v N − (cid:12) σ ( W f t N + R f v N − + C f c N ) v N = (1 − h N ) v N − + h N v (cid:48) N (4)where u and U are a list and matrices of the trainable weights for the additive attention. The Key K and Value V arethe same matrix of the encoder output. The Nth Query T N is a matrix with the Nth track stacked. e N , a N , and c N arethe energy, the attention weights, and the context for the Nth Query, respectively. C x are also matrices of the trainableweights for the context. The first three equations calculate the attention. The following three equations show the extensionof the dedicated LSTM structure.Initial hidden states v of both of the bidirectional RNNs of the encoder partand single RNN of the decoder part arecalculated in the same way as the simple LSTM case, by two fully-connected layers with input variables of track-pairs.Samples of track pairs for the initial state and multiple tracks for the sequential input are necessary for the training.Track pairs coming from the same vertex in the MC information are used for the initial states and all tracks in the sameevent are used for the sequential input. Since the training of the RNN requires a fixed length of the sequential input inKeras framework, dummy tracks with all track parameters being set to zero are used for the padding if the number of4 VERTEX FINDER WITH DEEP LEARNING 3.1 Algorithm of vertex finder with DL Track t v N −1 Vertex New VertexConnected or Not v N t N h t Updated Vertexor1 2 3
Figure 4:
Schematic of the dedicated LSTM cell.The numbers in the circles stand forsteps and formulae in the text.
Custom LSTM Custom Attention LSTM・・・c / n c / n c / n c / n connected / not・・・Track 1 Track 2 Track3 Track4
Track N max
Track 1 Track 2 Track 3 Track 4 ・・・ Track N max ・・・
Encoder Output / key, value
Vertex Seed (Track Pair)Vertex Seed (Track Pair) Vertex Seed (Track Pair)
Figure 5:
Schematic of the attention encoder-decoder model. The upperpart is the bidirectional LSTM for the encoder, and the lowerpart is the attention LSTM for the decoder. The blue and redcircles show custom LSTM cell explained in the text. tracks is smaller than the fixed length. To discriminate dummy tracks one additional variable with a flag of whether it is adummy track or a real track is added to the sequential input, resulting in 23 variables in total. The order of the tracks isshuffled epoch by epoch at the training to further reduce the dependence of the training result on the order of the tracks.Figure 6 shows the training curves of the three types of networks. The Simple Standard LSTM stands for the resultwith the standard LSTM structure, and the Simple Dedicated LSTM stands for the result with the cell structure of Figure4 with a simple RNN without the encoder-decoder and attention structures. The Attention Dedicated LSTM stands for thenetwork described in Figure 5. The two tracks used as the vertex seed were excluded from the calculation of the accuracy,true positive fraction, and true negative fraction. The clear improvement with the use of the dedicated LSTM structureand the attention encoder-decoder structure is seen. The instability of the training seen with the standard LSTM can bedue to the shuffling order of the tracks at each epoch.Figure 7 shows the attention weights of one event in a test sample. Each circle shows a track with the same order forthe encoder and the decoder tracks. The difference of the numbers of the tracks is due to the dummy tracks placed only inthe encoder tracks. It shows that the tracks associated to the vertex (labeled “Connected”) tend to have larger attentionweights than real tracks while the tracks not associated to the vertex (labeled “Not Connected”) tend to have larger weightsthan dummy tracks.
Figure 6:
Comparison of the training curve (upper-left: lossfunction, upper-right: accuracy, lower-left: true-positive fraction, lower-right: true-negative fraction)for the three LSTM structures. See text for thedetails of each structure.
Figure 7:
Strength of attention weights between encoder anddecoder tracks at one event from the test sample.
All Tracks in one Event Network for Seed FindingAll combinations of Track PairsNetwork for Vertex Production Network for Vertex ProductionPrimary Vertex Secondary VerticesSV Seed SelectionRemaining TracksPV SV ①② ④③
Figure 8:
Schematic diagram of our vertex finder. Numbers show corresponding steps in the text.
Figure 8 shows a schematic diagram of our vertex finder using the two networks. Primary and secondary vertices arereconstructed by the following steps.1. Search all pairs of tracks for vertex seeds by inference of the “network for seed finding”.2. Generate a primary vertex by inference of the “network for vertex production” with the vertex seeds labeled as PV atStep 1.3. Select seeds of the secondary vertices obtained by the “network for seed finding”.4. Generate secondary vertices recurrently until the seeds are used up by inference of the “network for vertex production”with the vertex seeds selected in Step 3.At Step 1, every track pair is labeled as PV, SV (SVBB, SVCC, TVCC, SVBC) or others (NC, Others) by inference ofthe “network for seed finding”. At Step 2, pairs labeled as PV are listed in the decending order of the PV score of theseed-finding network, and used to calculate the initial state of the “network for vertex production”. All tracks in the eventare used as sequential input, and tracks with the scores larger than a parameter “score for PV production” are assignedas tracks from the primary vertex. This step is repeated as a parameter “number of PV seeds” and all tracks labeled asassigned once or more are combined to produce the primary vertex. Step 3 is a set of preselections of seeds of secondaryvertices at thresholds of parameters “score for SV seeds” and “vertex distance” with the output of the seed-finding network.The selections are applied to reduce the contamination of NC track pairs misassigned to SV. Track pairs including track(s)assigned to PV in Step 2 are also removed from the list of seeds. The remaining track pairs are listed in the decendingorder of the SV score. Step 4 is the production of secondary vertices with the seeds listed at Step 3. All tracks in theevent are used again as sequential input, and tracks with the scores larger than a parameter “score for SVs production”are assigned as tracks from the secondary vertex. If a track is assigned to both the primary and the secondary vertex,the scores from the networks are examined and the track is assigned to the vertex with the higher score. Step 4 is alsorepeated until all vertex seeds are used. From the second time, track pairs including track(s) already assigned to previoussecondary vertices are removed from the list of the seeds and the sequential input.The parameters written above are summarized with optimized values in the Table 3.
Name Description Valuescore for SV seeds sum of the score for SVs obtained by the “network for seed finding” 0.88vertex distance distance of the vertex from the origin predicted by the “network for seed finding” 30.0number of PV seeds number of PV seeds to be used for the initial state of PV production network 3score for PV production score for the PV obtained by the ”network for vertex production” 0.50score for SVs production score for the SVs obtained by the ”network for vertex production” 0.75
Table 3:
List of parameters for the vertex finder with optimized values. See text for details of the parameters.
Tables 4 and 5 show the combined performance of our vertex finder compared with LCFIPlus, using b ¯ b samples at √ s = 91 GeV. In the tables each track is categorized according to the MC information as follows.• Primary: Tracks originating from the primary vertex.• Bottom: Tracks whose most immediate parent with a non-zero lifetime containing a b quark.• Charm: Same as above, except the parent contains a c quark.• Others: All the other tracks, such as those from τ decays, strange hadrons, or photon conversions.The tables show the fraction of tracks in each category associated to the reconstructed secondary vertices. The tracksin the secondary vertices are further categorized by two criteria.• from same decay chain: tracks assigned to the vertices that all associated tracks come from a single decay chain inMC information, decending from the same b hadron.• from same parent particle: tracks assigned to the vertices that all associated tracks come from the same mostimmediate parent particle with a non-zero lifetime.The tables show that the track-based effeiciency to be associated to the secondary vertices is 5-10% higher with theDL-based vertex finder, compared with LCFIPlus, with reasonable quality of the vertices formed by correct tracks. Thisshows possibility to apply DL techniques to improve vertex finding and quark identification. Contamination from primaryand other tracks are, however, slightly higher with the DL-based vertex finder. Since the DL-based vertex finder currentlyonly uses network-based selections, possible reduction of the contamination can be taken by adding analytical cuts on thetrack quality or vertex quality with the DL-based vertex finder. Detailed study is ongoing. Track origin Primary Bottom Charm OthersTotal number of tracks 307 649 167 161 152 314 86 225Tracks in secondary vertices 1.2% 66.8% 74.7% 6.9%...from the same decay chain - 64.8% 69.1% -...from the same parent particle - 37.9% 40.5% -
Table 4:
Performance of vertex finder with DL. See text for explanation of each category.Track origin Primary Bottom Charm OthersTotal number of tracks 496 897 258 299 247 352 56 432Tracks in secondary vertices 0.6% 57.5% 64.3% 2.5%...from the same decay chain - 56.6% 63.4% 1.9%...from the same parent particle - 32.2% 38.9% 1.2%
Table 5:
Performance of vertex finder in LCFIPlus, obtained from [2].
A novel vertex finder using DL techniques has been developed. Two networks were designed; a simple DL with fully-connected layers is used for the selection of the vertex seeds, and a RNN-based network with a custom cell structure isused to form vertices by associating tracks to the vertex seeds. An attention mechanism in the encoder-decoder structurehas implemented, and has shown to improve the performance of the network. Performance of our vertex finder has beencompared with the standard method in the ILC, LCFIPlus, and show improvement in the efficiency of the secondary vertexreconstruction with small increase of the contamination. More optimization is expected in this algorithm, and furtherdevelopement is desired to fully utilize DL techniques to full jet analysis by expanding the network used in this vertexfinder.
Acknowledgements
The authors thank the ILD group for providing event samples produced for the training of LCFIPlus to be used for thecomparison of the performance of the vertex finder. This work is done in collaboration with a RCNP Project “Applicationof deep learning to accelerator experiments”. 7EFERENCES REFERENCES
References [1] “The International Linear Collider Technical Design Report - Volume 1: Executive Summary,” arXiv:1306.6327[physics.acc-ph] .[2] T. Suehara and T. Tanabe, “LCFIPlus: A Framework for Jet Analysis in Linear Collider Studies,”
Nucl. Instrum.Meth. A (2016) 109–116, arXiv:1506.08371 [physics.ins-det] .[3] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv:1409.0473 [cs.CL] .[4] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv:1508.04025 [cs.CL] .[5] .[6] https://keras.io/ja/ .[7]
ILD Concept Group
Collaboration, H. Abramowicz et al. , “International Large Detector: Interim Design Report,” arXiv:2003.01116 [physics.ins-det] .[8] W. Kilian, T. Ohl, and J. Reuter, “WHIZARD: Simulating Multi-Particle Processes at LHC and ILC,”
Eur. Phys. J.C (2011) 1742, arXiv:0708.4233 [hep-ph] .[9] C. Elkan, “The foundations of cost-sensitive learning,” in Proceedings of the 17th International Joint Conference onArtificial Intelligence - Volume 2 , IJCAI’01, pp. 973–978. Morgan Kaufmann Publishers Inc., San Francisco, CA,USA, 2001.[10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation no. 8, (1997) 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735https://doi.org/10.1162/neco.1997.9.8.1735