[PDF] Sequence-based Machine Learning Models in Jet Physics

Abstract

Sequence-based modeling broadly refers to algorithms that act on data that is represented as an ordered set of input elements. In particular, Machine Learning algorithms with sequences as inputs have seen successfull applications to important problems, such as Natural Language Processing (NLP) and speech signal modeling. The usage this class of models in collider physics leverages their ability to act on data with variable sequence lengths, such as constituents inside a jet. In this document, we explore the application of Recurrent Neural Networks (RNNs) and other sequence-based neural network architectures to classify jets, regress jet-related quantities and to build a physics-inspired jet representation, in connection to jet clustering algorithms. In addition, alternatives to sequential data representations are briefly discussed.

Full PDF

SSequence-based Machine Learning Models in JetPhysics

Rafael Teixeira de Lima

SLAC National Accelerator Laboratory [email protected]

Abstract

Sequence-based modeling broadly refers to algorithms that act on data that is representedas an ordered set of input elements. In particular, Machine Learning algorithms withsequences as inputs have seen successfull applications to important problems, suchas Natural Language Processing (NLP) and speech signal modeling. The usage thisclass of models in collider physics leverages their ability to act on data with variablesequence lengths, such as constituents inside a jet. In this document, we explore theapplication of Recurrent Neural Networks (RNNs) and other sequence-based neuralnetwork architectures to classify jets, regress jet-related quantities and to build a physics-inspired jet representation, in connection to jet clustering algorithms. In addition,alternatives to sequential data representations are brieﬂy discussed.

Contents

To appear in Artiﬁcial Intelligence for Particle Physics, World Scientiﬁc Publishing. a r X i v : . [ phy s i c s . d a t a - a n ] F e b equence-based Machine Learning Models in Jet Physics The area of

Sequence-Based Learning in Machine Learning deals with the concepts and algorithmsused to learn from data represented as an ordered set (sequence) of objects, each with its set ofcharacteristics (features), in which positional information of each object (context) is important. Theidea of contextual information as being important for the algorithms is fundamental, since it canencode correlations between objects along the sequence. One of the main applications of this classof models is in natural language processing (NLP). In these cases, the sequence is often built fromwords in a sentence and the algorithm must learn from it. How the learning occurs and what islearned will depend on the application. To perform a translation task (neural machine translation),for example, the algorithm must output another sequence. In other instances, a summary semanticinformation needs to be obtained, such as when the algorithm needs to classify a certain sentence aspositive or negative in an on-line product review, for example.In more mathematical terms, sequence-based models aim to perform operations f on a sequenceof inputs { x t } , where each entry x t is a vector of features, and t is a position in the ordered sequencewith a length T , as shown in the scheme presented in Fig. 1. In particle physics terms, the sequencecan represent an ordered set of tracks that constitutes a jet, for example, while the entry x t representsthe kinematics of the track in the position t in that sequence. An algorithm acting on the sequence J = { x t } can then be used to learn information about that jet, such as its ﬂavor or charge (as will bediscussed in the following sections).In general, different sequences being utilized for the algorithm deﬁnition will have differentlengths, just as jets can have different number of tracks. If all sequences have ﬁxed length, they canbe collapsed into a single feature vector and simpler algorithms, such as densely-connected NNs, canbe used. However, even for ﬁxed-length sequence problems, sequence-based models can outperformsimpler models by exploiting the ordered nature of the input data. x t − x t x t +1 f f f f Figure 1: Scheme of a sequence-based algorithm acting on a sequence of inputs x t .A key feature of sequence-based models is the ability to share parameters in different parts of thesame model, i.e., the operation f , which is learned, is applied to every step in the sequence. Withparameter sharing, these models are able to generalize to sequences of different lengths, using thesame set of parameters throughout the input elements. If different parameters were to be learnedindividually, the desired generalizability would not be achieavable, and the model would behavesimilar to a densely-connected network. A recurrent neural network (RNN) is a neural network implementation of the concepts describedin the previous section. Densely-connected neural networks map an input vector of features x toan output vector o . In contrast, RNNs map a sequence of inputs x into an output, which can bea vector or a sequence as well. This difference can be achieved in many different ways, but RNNarchitectures generally present cyclical connections between units in the same or different layers.2equence-based Machine Learning Models in Jet PhysicsThese interconnections between units are sequential, in the sense that each unit’s hidden stateis obtained by a combination of the previous unit’s hidden state and the input from that step.This means that instead of a unit’s hidden state be given by h = f ( x ; θ ) , it will be given by h t = f ( h t − , x t ; θ ) . On the particular case in which f is given by a hyperbolic tangent function, forexample, the RNN can be represented as h t = tanh (cid:0) W (cid:62) x t + V (cid:62) h t − + b (cid:1) , (1)instead of a simple densely-connected unit h = tanh (cid:0) W (cid:62) x + b (cid:1) , where W , T , and b representlearnable weights and biases.Notice that in eq. 1 the weights on the operations ( θ , or W , V , and b explicitly) do not dependon the time step, or the sequence element position, t , explicitly showcasing the parameter sharingfeature of the RNN. Similarly to MLPs functioning as universal approximators, large enough RNNshave been shown to universaly approximate any measurable sequence-to-sequence maps [1].In general, RNN architectures add other features on top of the recurrent layer format describedabove. For example, in tasks where the algorithm needs to output another sequence, each individualhidden state in the recurrent layer might be read out into a densely connected network. In contrast,the cases where only a single output is read at the end of the sequential layer (also known astime-unfolded RNNs) are used to extract a summary information of the input sequence. Even thoughthe presence of cycles in these architectures could potentially complicate the process of updatingthe network parameters during the optimization step, backpropagation can still be applied to theunrolled computation graph , thus no specialized algorithms are necessary.A simple but powerful extension of standard RNN architectures are Bidirectional RecurrentNeural Networks (BRNNs) [2]. For certain sequence-based modeling applications, knowledge aboutbackward-in-time context might be as important as forward-in-time, only the latter of which isexploited in standard RNNs. BRNNs manage to extend that context information by adding a secondrecurrent layer to the network architecture which processes the sequence in reversed order. Both theforward and backward RNNs are then connected to the same output layer, providing a combinedrepresentation.An idea related to RNN architectures which is also used in time-series and signal processinganalyses are 1D convolutional layers. In these architectures, the convolution operation kernal acts onneighbouring time steps, and outputs a new sequence based on its inputs. These operations act onﬁxed length sequences, where empty entries can be masked - similarly to 2D CNNs acting on sparseimages with a ﬁxed pixel grid. Parameter sharing is also an important feature here, exploited by theuse of a single convolution operation across different time steps. One drawback of this method is itslimited sensitivity to long-term dependencies, only exploring correlations across close neighbours, asdeﬁned by the kernel length. When dealing with large sequences, one expected behavior of RNNs is to learn how correlatedcertain entries are, regardless of how far appart they appear in the input sequence. In practicaltearms, this implies that information from early entries in the sequence must be encoded in how thenetwork learns about latter entries (long-term dependencies). Unfortunately, in the learning steps, it The application of backpropagation on the time-series unrolled RNN is known as backpropagation throughtime . vanishing or exploding gradient problem (e.g. in [3]). Vanishing gradients, for example, can lead tolong-term dependencies being given less importance through the sequence compared to short-termones.This issue can be understood by imagining a simpliﬁed recurrent structure as a linear transforma-tion between hidden states h t = V (cid:62) h t − . This operation will be therefore performed T times whenmoving from the time-step 0 to the last sequence entry. This shows that the learnable parameters in V will be raised to the power of T , which means that weights less than 1 will tend to approximate0 at later steps, while weights larger than 1 will quickly grow. Most modern RNN architecturessolve this problem with the usage of Long Short-Term Memory [4] (LSTM) units or Gated RecurrentUnits [5] (GRUs).LSTM units mitigate the vanishing gradient problem by introducing a new path in the recurrentloop where the information coming from each sequence entry can ﬂow for long durations, possiblywithout interference from subsequent hidden states. This path is dynamically gated through learnableparameters, which means that the importance of long-term dependencies is optimized with the restof the network parameters. In particular, if the activation function pertaining to a gate remains closeto 0, the information from the previous time-step will not be propagated throughout the sequence.However, the LSTM will still use that time-step information to update its current hidden state. Thisoptimizable gated structure ensures that both long-term and short-term contributions to the gradientare taken into account.A schemematic view of three sequential LSTM units is shown in Figure 2. The LSTM receives atthe time step t both the hidden state of the previous time step ( h t − ), which is concatenated with thefeature vector x t , and an extra input (the cell state, C t − ) which is regulated by a forget gate ( f t ).The forget gate can be a neural network itself, with a sigmoid output which is shared across all units,ensuring that the LSTM actively learns how much long-term correlations should be propagated inthe sequence. Next, the cell state is updated with information learned from the previous hidden stateand x t with a dedicated neural network, generating an intermediate cell state ˜ C t . The impact of ˜ C t in the ﬁnal cell state is regulated by another neural network, i t . After the update, C t is propagatedto the next time-step. The ﬁnal hidden state at this time step is derived with information learnedfrom the previous hidden state and x t through a network ( o t ), and also ˜ C t . σ tanh σ σ tanh C t −1 C t h t −1 h t h t x t f t i t ˜ C t o t σ tanh σ σ tanh σ tanh σ σ tanh Figure 2: Schematic structure of a LSTM recurrent layer. Two outputs for the unit’s hidden state h t are shown, representing the case in which the LSTM layer outputs another sequence. Image adaptedfrom [6]. 4equence-based Machine Learning Models in Jet PhysicsGRUs work with a similar gated structure - however, the same forget gate that decides on thepropagation of the previous cell state also determines, with an inverse importance, how the currentcell state should be updated with h t − and x t . With the LSTM nomenclature used above, the forgetgate and the cell state update gate would be given by f t = u t and i t = (1 − u t ) , respectively, where u t is called the update gate. GRUs comparatively uses less trainable parameters than LSTMs, whichcan be beneﬁtial for smaller datasets; its less ﬂexible architecture can potentially be detrimentalin more complex applications - however, the studies to be shown below that compared GRUs andLSTMs generally see similar performances. Representing reconstructed jets in collider experiments as images for classiﬁcation and other machinelearning based tasks, has been a successful avenue of research for years now.There are, however,a few features related to this choice of representation that can make the process of training acomputer vision-based algorithm difﬁcult when compared, for example, to the training of a simpledensely-connected neural network based on engineered features.Jet images can be very sparse, i.e., containing few populated pixels , making the identiﬁcation offeatures on individual jets a complicated task, even if identifying these features on their averagedimages might be very easy. Pre-processing steps can be applied to reduce sparseness, for example,increasing the coarseness the image further by combining adjacent pixels. This procedure, however,effectively reduces the spatial information contained in the input image, penalizing the algorithm’sperformance.Another issue arising in the pre-processing step is ﬁnding a unique geometrical representationfor these images that reﬂects the expected symmetries of the problem. In general, geometricaltransformations (e.g., rotations, translations and reﬂections) are used, aligning pre-deﬁned axesbased on the jets’ spatial energy distributions - but these deﬁnitions can be very task speciﬁc andnot well generalizable. This is specially true when the pattern of energy deposition inside thesejets displays a different number of core clusters (prongness) - for example, when comparing quark-initiated jets, jets from a collimated hadronic W boson decay, and jets from a collimated top quarkdecay.The algorithms to be described below think about the jet instead as collection of correlatedobjects, each one with its set of characteristics (such as position and energy), and apply some of thesequence-based ML ideas discussed above for different types of tasks. This idea is reminiscent of theactual way jets are built in collider experiments, using sequential clustering algorithms (for example,the widely used in LHC experiments anti- k t algorithm [8]), acting on low-level detector quantities,such as calorimeter deposits or tracks. The identiﬁcation of jets originated from the products of the hadronization process of heavy quarks(bottom and charm quarks), known as heavy ﬂavor jets, is of fundamental importance for experimentsat the LHC for two main reasons. Firstly, the Higgs boson - discovered in 2012 by ATLAS and CMS,and currently the focus of intense experimental investigation - mainly decays to a pair of bottom Initial computer vision-based jet images studies have reported ∼ of activated pixels on average for25 x 25 pixels images from signal and background-type jets [7]. [9]. Secondly, the top quark, whichtogether with the Higgs boson can help us learn about the structure of the electroweak vacuum [10],decays almost entirely to a bottom quark plus a W boson [11].Finding these Higgs and top decays is not an easy task due to the enormous multijet backgroundevents at the LHC (events with at least two reconstructed jets). These multijet events are producedwith cross sections over 3 times larger than events with top quarks, and 4 times larger than eventswith a Higgs boson [12]. Fortunately, most of these events contain jets from light quarks (quarksother than charm, botton and top) and gluons (collectively grouped as light jets). Therefore, learninghow to separate heavy ﬂavor signal jets from the light jets background is necessary.Hadrons containing bottom and charm quarks are heavy - for example, the B meson invariantmass is about 5.3 GeV/c [11]. They decay through the Weak force, thus having a long mean lifetime( τ B (cid:39) . × − s). Therefore, a B with a momentum of 50 GeV/c will travel on average over 4.5mm before decaying. This displaced decay can be reconstructed as a secondary vertex , i.e., a vertexthat is separated from the collision’s primary vertex (where the initial proton-proton interactionocurred, in the case of the LHC). The LHC experiment’s inner trackers (reponsible for reconstructingthe trajectory of charged particles) have been built with the intent of separating these secondaryvertices with good precision, in order to identify heavy ﬂavor jets. The decay chain of a B meson,as simulated by the ATLAS experiment, is shown in Figure 3 [13], which also includes the tertiaryvertex from the displaced decay of the D meson, containing a charm quark.In general, requiring that a secondary vertex needs to be reconstructed in order to identify a b-jetcan be detrimental for a high efﬁciency algorithm. However, tracks originated from this displacedlocation will have very particular characteristics when compared to tracks from the primary vertex,and will be correlated by their shared origin. Therefore, algorithms that focus on ﬁnding correlationsbetween tracks perform comparatively well with respect to direct secondary vertex ﬁnding, and canprovide complimentary information for a combined heavy versus light jet discrimination.In the ATLAS experiment, two different types of algorithms have been developed to identify heavyﬂavor jets based on the likelihood of tracks being originated from secondary vertices. Both use thetracks’ impact parameter information, which encodes the distance of closest approach of the chargedparticle’s trajectory with respect to the primary vertex. Particles originated from the primary vertexwill have small impact parameters, while particles from secondary vertices will tend to have largerimpact parameters. An important related quantity is the impact parameter signiﬁcance, in which thedistance is divided by the uncertainty in its measurement. Utilizing the signiﬁcance minimizes theimpact of low-quality tracks with large mis-measured impact parameters.IP3D [14], one of the ﬁrst ATLAS algorithm based on tracks impact parameter information,treats the tracks as independent entities, ignoring possible correlations. It uses 3D histograms oftransverse and longitudinal impact parameter signiﬁcances ( S d and S z , respectively), and a trackquality grade, built from simulation and separately for b-jets, c-jets and light ﬂavor jets. Per-ﬂavorconditional likelihoods are calculated from these histograms for each track in a jet. With a naïveBayes approach, a ﬁnal jet-level likelihood is built by multiplying the individual tracks likelihoods.However, important information is lost with the assumption that the tracks in the jet are uncorrelated.This can be seen in Figure 4, where a strong correlation between the S d distribution of the leadingand subleading track in the jet (ordered by S d ) can be seen near the diagonal for b-jets but not forlight jets.More recently, algorithms exploiting RNN architectures based on LSTMs have been proposed toperform the task described above, treating the list of tracks within the jet as the input sequence tothe algorithm. Due to the variable number of tracks within a jet, RNNs are better suited than dense6equence-based Machine Learning Models in Jet Physics ATLAS

Simulation

Preliminary s = 13 TeV, tt

Figure 3: Simulated decay chain of a B meson in the ATLAS experiment. This event display wasobtained from a simulated dataset of top quark pair production, at center-of-mass energy of 13 TeV.It shows the displacement of the vertices produced from B (secondary vertex) and subsequent D (tertiary vertex) decays, with respect to the center of the coordinate system (primary vertex).architectures in this approach. Even though no natural track ordering is clear to the problem, trackswith larger impact parameters are more likely to come from heavy ﬂavor jets, therefore, impactparameter based ordering is a good ansatz . While the most basic version of this algorithm acts ontracks only, proposals have been made to combine track and secondary vertices information in asingle LSTM-based architecture [15].The ATLAS implementation of the LSTM-based architecture for heavy jets identiﬁcation withtracks’ impact parameters is called RNNIP [16]. It treats the tracks within a jet as a sequence, anduses the impact parameter signiﬁcance information and track kinematics as features. It also usescategorical information based on the track reconstruction quality in an embedded layer. Thesecategories separate high quality, well measured tracks, in which a better impact parameter resolution Empirically, the studies mentioned below have shown that certain track orderings work better than others. d0 Leading S-20 -10 0 10 20 30 40 50 60 d0 S ub l ead i ng S -20-100102030405060 -4 -3 -2 |<2.5 η >20 GeV, | T b-jets, p ATLAS

Simulation Preliminary t=13 TeV, ts d0 Leading S-20 -10 0 10 20 30 40 50 60 d0 S ub l ead i ng S -20-100102030405060 -3 -2 -1 |<2.5 η >20 GeV, | T light-jets, p ATLAS

Simulation Preliminary t=13 TeV, ts

Figure 1: The distribution of the d signiﬁcance for the leading d signiﬁcance track and subleading d signiﬁcancetrack in b -jets (left) and light jets (right). The plots were produced with 700k b -jets and 1M light jets, and eachdistribution is normalized to unity. where each category corresponds to a dierent track quality [24]. Multiplying by the three ﬂavors, thisresults in a ﬁnal bin count of 35 ⇥ ⇥ ⇥ = , = ln Q i tracks p ib / p i light .One of the main assumptions of the IP3D algorithm is that the per-track ﬂavor conditional likelihood canbe computed independent of the other tracks in the jet. Such a likelihood model does not account for theeect shown in Figure 1, and the method of building templates to deﬁne likelihoods requires large samplesizes. In addition, extending the template to account for additional kinematic variables is computationallyexpensive, since the number of template bins (and the number of simulated events required to ﬁll them)grows exponentially with the number of variables. Such algorithmic deﬁciencies can be rectiﬁed usingmachine learning classiﬁers. Recurrent neural networks are used to directly learn sequential dependencies for arbitrary-length se-quences [4, 25]. The fundamental unit of an RNN is a cell encapsulating an internal state vector. As theﬁrst step of processing any given sequence (in this case the tracks in a jet), the internal state is initialized tozero. At each step in the sequence, the cell is handed a ﬁxed number of inputs (in this case the parametersthat describe one track). These parameters are combined with the current internal state in order to computea new internal state based on a set of rules which are tuned in the training phase. At the end of the sequencethe cell’s internal state serves as a ﬁxed-dimensional representation of the entire sequence. In this way arecurrent cell is able to reduce a sequence of arbitrary length to a ﬁxed number of variables, which canthen be processed by a traditional feed-forward network. Much of the recent success of RNNs in various natural language and long-sequence processing applicationscan be attributed to the advent of Long Short-Term Memory (LSTM) [27] units and later variants such asGated Recurrent Units (GRUs) [28, 29]. These architectural modiﬁcations at the cell level mitigate issues For a review of terminology such as “fully-connected”, and related concepts, and a more pedagogical introduction to deeplearning, see for instance References [25, 26]. Figure 4: 2D histogram of S d for leading (horizontal axis) and subleading (vertical axis) tracksinside a b-jet (left) and a light ﬂavor jet (right). The correlation observed is an indication that thenaïve Bayes approach of the IP3D algorithm is not enough to exploit the full information containedin the tracks’ impact parameters with respect to b-jet identiﬁcation.is expected, based on detector-level information such as the number of hits in the innermost trackerlayer.The tracks are ordered by S d , although other orderings (such as by track p T ) have shown similarperformance. The algorithm is trained in a simulated sample of top pairs, which provide a datasetenriched of both heavy and light quarks, and outputs a probability of a given jet to be a bottom jet( b -jet), a charm jet ( c -jet), or a light ﬂavor jet. The three probabilities are then combined into alikelihood that is used for discrimination.A comparison between the performance of the naïve Bayes algorithm described above (IP3D)and RNNIP is shown in Figure 5. The efﬁciency of identifying b -jets is plotted on the horizontal axis,while one over the probability of identifying light ﬂavor jets (misidentiﬁcation probability) as b -jets isplotted on the vertical axis. The RNNIP algorithm displays a better light ﬂavor jet discrimination forevery value of b -jet efﬁciency, and is comparable to a boosted decision tree that combines IP3D withthe secondary vertex-based ATLAS algorithms (MV2c10 [14]), even though it does not explicitlyreconstruct secondary vertices. Performances were measured for jets clustered with the anti- k T algorithm [8] with R = 0.4, with a transverse momentum above 20 GeV, in a simulated dataset oftop quark pairs, at center-of-mass energy of 13 TeV.Figure 6 shows the Pearson’s correlation coefﬁcient ρ between the RNNIP likelihood, and S d and S z for each track in the sequence. It is interesting to note that stronger correlations in b-jets areseen for impact parameter signiﬁcances of the ﬁrst ∼ tracks, which may be related to the expectedcharged particle multiplicity of b-hadron decays. This shows that the network architecture is able tolearn contextual information from the given sequence ordering.Heavy ﬂavor jets identiﬁcation in the CMS experiment shares many similarities with the strategiesemployed by ATLAS. In particular, their ﬁnal discriminant is also a combination of informationpertaining to secondary vertexing and the set of tracks inside the jet. Two sets of algorithms havebeen developed with this intent: one set combining engineered features extracted from the jet, andone that directly uses reconstructed objects information into a neural network architecture whichincludes LSTM layers. The CMS DeepCSV algorithm [17] exempliﬁes the ﬁrst strategy, similarly tothe ATLAS MV2c10 boosted decision tree. It is based on a densely-connected neural network with 48equence-based Machine Learning Models in Jet Physics reconstructed. Although not pictured, JetFitter suers from a similar maximum eciency. Despite theirlimited eciency, however, the vertex-based algorithms clearly complement the IP-based algorithms asillustrated by the superior performance of MV2c10, which combines JetFitter, SV1, and IP3D in a BDT. b ε b-jet efficiency, 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 l ε li gh t - j e t r e j e c t i on , / MV2c10RNNIPIP3DSV1

ATLAS

Simulation Preliminary t=13 TeV, ts |<2.5 η >20 GeV, | T p b ε b-jet efficiency, 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 c ε c - j e t r e j e c t i on , / MV2c10RNNIPIP3DSV1

ATLAS

Simulation Preliminary t=13 TeV, ts |<2.5 η >20 GeV, | T p Figure 3: The light-jet (left) and c -jet (right) rejection versus b -tagging eciency for jets with p T >

20 GeV and | ⌘ | < .

5. The statistical error on the curve is less than 3%. MV2c10 is a high level BDT tagger which integratesIP3D outputs outputs with additional vertex information from JetFitter and SV1.

To factorize the gains from the recurrent network from those provided by the additional variables, Figure 4compares the performance of an RNN trained on only the IP3D inputs to one which uses the additional R ( track, jet ) and p fracT inputs. A network using exactly the same inputs as IP3D improves light-jetrejection by a factor of 1.7 and c -jet rejection by a factor 1.05, even in the absence of any additionalvariables. b ε b-jet efficiency, 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 l ε li gh t - j e t r e j e c t i on , / R) Δ Frac, T , category, p z0 , S d0 RNNIP(S Frac) T , category, p z0 , S d0 RNNIP(S , category) z0 , S d0 RNNIP(SIP3D

ATLAS

Simulation Preliminary t=13 TeV, ts |<2.5 η >20 GeV, | T p b ε b-jet efficiency, 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 c ε c - j e t r e j e c t i on , / R) Δ Frac, T , category, p z0 , S d0 RNNIP(S Frac) T , category, p z0 , S d0 RNNIP(S , category) z0 , S d0 RNNIP(SIP3D

ATLAS

Simulation Preliminary t=13 TeV, ts |<2.5 η >20 GeV, | T p Figure 4: The light-jet (left) and c -jet (right) rejection versus b -tagging eciency for jets with p T >

20 GeVand | ⌘ | < .

5, for RNNs trained using various sets of input variables, and for IP3D. The RNN without p fracT and R ( track, jet ) uses only the inputs available to IP3D. In order to understand how the tagging performance depends on jet kinematics, the b -tagging eciencyversus jet p T is shown in Figure 5. To isolate the eect of a changing b -tagging eciency from that ofthe changing light-jet and c -jet rejection rejection, a ﬂat-eciency 70% WP is examined. In this case, alltaggers have a 70% eciency across p T , and only the rejection is varying. The light- and c -jet rejection8 Figure 5: Performance of heavy ﬂavor identiﬁcation algorithms in the ATLAS experiment [16]. Thehorizontal axis shows the efﬁciency of correctly identifying b -jets, while the vertical axis shows oneover the efﬁciency of incorrectly identifying light ﬂavor jets as b -jets. The red dashed curve showsthe performance of the RNN-based algorithm (RNNIP), while the dashed blue curve shows theperformance of a naïve Bayes algorithm acting on similar inputs (IP3D).hidden layers, with inputs that are deﬁned by other algorithms which act directly on tracks’ impactparameter information and reconstructed secondary vertices.Signiﬁcant improvement has been observed by CMS by moving to the DeepFlavour algorithm [18],which acts directly on these low-level observables. The DeepFlavour network receives three se-quences as inputs: a sequence of charged particles (reconstructed from tracks and calorimeterclusters), a sequence of neutral particles (calorimeter clusters with no associated tracks), and asequence of reconstructed secondary vertices. Each sequence is processed by a 1D convolutionallayer, which learns a shared representation that is speciﬁc for each type of sequence. The convolu-tional layer outputs are then fed to three different LSTM layers, which summarized the sequencesinformation into three ﬁxed length feature vector. These features are combined with additionaljet-level information in a densely connected network, which outputs the jet ﬂavor probabilities.Figure 7 summarizes the CMS b -jet identiﬁcation performance. The red and blue lines representthe performance of the DeepCSV and DeepFlavour algorithms respectively, with the CMS detectorconditions present during the 2017 LHC data taking period (Phase 1), while the green line showsthe DeepCSV performance with the CMS detector conditions in 2016 (Phase 0). Between 2016and 2017, CMS inner tracking detector was upgraded to deal with the harsher radiation conditionsat the LHC later Run 2 years. This upgrade also provided the CMS experiment a better impactparameter resolution, directly improving their heavy ﬂavor identiﬁcation performance. The b -jet9equence-based Machine Learning Models in Jet Physics track in sequence th i2 4 6 8 10 12 14 ) d0 , S RNN ( D ρ C o rr e l a t i on , -0.100.10.20.30.40.50.60.70.8 b-jetsc-jetslight-jets ATLAS

Simulation Preliminary t=13 TeV, ts |<2.5 η >20 GeV, | T p track in sequence th i2 4 6 8 10 12 14 ) z , S RNN ( D ρ C o rr e l a t i on , -0.100.10.20.30.40.50.60.7 b-jetsc-jetslight-jets ATLAS

Simulation Preliminary t=13 TeV, ts |<2.5 η >20 GeV, | T p track in sequence th i2 4 6 8 10 12 14 F r a c t i on ) T , p RNN ( D ρ C o rr e l a t i on , -0.2-0.100.10.20.30.40.5 b-jetsc-jetslight-jets ATLAS

Simulation Preliminary t=13 TeV, ts |<2.5 η >20 GeV, | T p track in sequence th i2 4 6 8 10 12 14 R ) Δ , RNN ( D ρ C o rr e l a t i on , -0.2-0.100.10.20.30.40.5 b-jetsc-jetslight-jets ATLAS

Simulation Preliminary t=13 TeV, ts |<2.5 η >20 GeV, | T p Figure 7: Correlations of per track input variables S d (top left), S z (top right), p frac T (bottom left), and R , with theRNN score D RNN . A new low-level b -tagging algorithm has been presented which is built from a Recurrent Neural Networkwith a sequence of track-by-track variables as input. This algorithm is seen to outperform impact parametertaggers, as is expected due to the ability to learn and discriminate on the correlations between tracks ina given jet and the ability to extend the number of input variables well beyond what is feasible withlikelihood based impact parameter taggers. Given this ﬂexibility, including more relatively low-leveltracking variables as RNN inputs oers a potential avenue for further improvements. While it is dicultto pinpoint what discriminating information the RNN has learned, this is partially illuminated by examiningthe correlation between the RNN output and the various track inputs to the network.High-level ATLAS taggers such at MV2 integrate the outputs from an IP-based algorithm with twovertex-based algorithms, each of which relies on the full set of track parameters and covariance matrices.As a potential replacement for the IP-based component, the RNN discriminant should not be compareddirectly to MV2. Instead the usefulness of the RNN tagger relies on the additional information that itcontributes to the high-level algorithm. This additional information may result in a performance boostto the high-level tagger, and indeed quantifying this improvement remains an active area of study withinthe ATLAS experiment. These studies include both quantifying the performance gains from adding theRNN outputs into a multi-algorithm composition of taggers like MV2 and studies of the correlations of10 Figure 6: Pearson’s correlation factor between RNNIP likelihood, and transverse and longitudinalimpact parameter signiﬁcances. Correlations are shown separately for b-jets (orange), c-jets (red),light ﬂavor jets (blue).efﬁciency is shown as a function of the light ﬂavor jets misidentiﬁcation probability (full lines), andas a function of the c -jets misidentiﬁcation probability (dashed lines). While a large improvement isseen in the performance of the DeepCSV algorithm with the CMS Phase 1 inner tracker upgrade,an improvement just as large is seen with the usage of DeepFlavour in terms of light ﬂavor jetdiscrimination, with an even large gain in terms of c -jet rejection. Performances were measured forjets clustered with the anti- k T algorithm, with a transverse momentum above 30 GeV, in a simulateddataset of top quark pairs, at center-of-mass energy of 13 TeV. Jets from strange quarks are grouped within the light ﬂavor jets category for the algorithms describedabove. However, for certain physics applications, such as the direct measurement of the | V ts | elementof the CKM matrix through the search of the rare decay t → W + s (or ¯ t → W − ¯ s ) [19], discriminatingstrange jets from ﬁrst-generation jets is a necessity.A LSTM-based algorithm has recently been proposed to tackle this problem [20]. The study isperformed with a simpliﬁed detector description model (Delphes [21]) based on a CMS-like detector.The algorithm is trained to discriminate between strange jets and jets from the hadronization of upand down quarks, produced by proton-proton QCD interactions. Similarly to the ATLAS RNNIP, theproposed algorithm acts on sequences of tracks, with features based on their impact parameters andkinematics with respect to the jet.Secondary vertices are expected to be present in strange jets through the decay of strangekaons into ππ and lambda baryons into pπ . Therefore, all possible secondary vertices in the jetare reconstructed by pairing tracks with small distances of closest approach to each other. Thesevertices are then used to deﬁne a track ordering based on the parameter R assigned to each track.This parameter is deﬁned either by the transverse distance between the primary vertex and thesecondary vertex to which that track belongs, or, in the cases where the track is not associated to asecondary vertex, the innermost tracker hit belonging to that track. If multiple tracks receive thesame R (e.g., two tracks from the same secondary vertex), their ordering is performed by p T . This10equence-based Machine Learning Models in Jet Physics b jet efficiency m i s i d . p r obab ili t y − − −

10 1 = 13 TeVs

CMS Simulation

Preliminary events tt > 30 GeV) T AK4jets (p

DeepFlavour phase 1DeepCSV phase 1DeepCSV phase 0 udsgc

Figure 1:

Performance of the DeepCSV and DeepFlavour b jet idenCﬁcaCon algorithms demonstraCng the probability for non-b jets to be misidenCﬁed as b jet, as a funcCon of the eﬃciency to correctly idenCfy b jets. The curves are obtained on simulated cbar events using jets within |η|<2.5 and with p T >30 GeV, b jets from gluon splibng to a pair of b quarks are considered as b jets. For comparison, the performance of DeepCSV with the 2016 detector (Phase 0) are also shown. The absolute performance in this ﬁgure serves as an illustraCon since the b jet idenCﬁcaCon eﬃciency depends on the event topology and on the amount of b jets from gluon splibng in the sample. Performance for jet p T >30 GeV Figure 7: Performance of heavy ﬂavor identiﬁcation algorithms in the CMS experiment [18]. Thehorizontal axis shows the efﬁciency of correctly identifying b -jets, while the vertical axis shows theefﬁciency of identifying light ﬂavor jets (full lines) or c -jets (dashed lines) as b -jets. The red and bluecurves represent the DeepCSV and DeepFlavor algorithms performances with the 2017 CMS detectorconditions, while the green curve represents the DeepCSV algorithm with the 2016 CMS detectorconditions.ordering ensures that adjacent tracks belong to the same secondary vertex, which the network willuse to learn about displaced decays.Figure 8 compares the performance of the LSTM-based strange jet discriminator to the per-formance of simpler methods. In particular, the performance of using the transverse momentumfraction x K and x Λ of identiﬁed kaons and lambda baryons is also investigated. To calculate thesequantities, a selection on the invariant mass of the reconstructed secondary vertices, consistent withthe K and Λ masses, is applied. The highest p T K and Λ candidates of the remaining vertices arechosen, and used to calculate x K and x Λ to their parent jet. Two different setups of the LSTM arecompared: one including all selected jets, and one only including jets that contain at least one trackwith large transverse impact parameter | d | > mm.The overall performance of this algorithm is unfortunately limited by the similarity betweenstrange and ﬁrst-generation quark jets - achieving a background efﬁciency of ( ) for a signalefﬁciency of ( ). The secondary vertices from kaon decays are not as displaced as verticesfrom b -jets or c -jets and can easily be misidentiﬁed from vertices produced by material interactionor decays of particles originated in the hadronization process. One important improvement withrespect to using x K and x Λ , however, is the ability to achieve higher signal efﬁciencies, as shown inFigure 8. 11equence-based Machine Learning Models in Jet Physics Figure 7: ROC curves evaluated on the test sample for the networks with50 LSTM units and 3 fully-connected layers with 200, 100 and 50 nodestrained with all jets (blue solid line), trained only with jets that have atleast one track that does not originate from the primary vertex (orange solidline), for x K of the K S candidate with the largest p T in the jet (thin blackline) and for using also x ⇤ of the ⇤ candidate with the largest p T if no K S candidate is found (thin red dashed line). The latter three ROC curves endat the eciencies corresponding to the presence of at least one track thatdoes not originate from the primary vertex ( | d | > K S ! ⇡ + ⇡ decay, and the presence of at least onereconstructed K S ! ⇡ + ⇡ or ⇤ ! p⇡ decay.jets without a K S candidate. In Figure 9 (c,d), the classiﬁer output is furthersplit by requiring that the value of x K of the K S candidate must be largeror smaller than 0.2, where jets without a K S candidates are assigned an x K value of 0. Jets with a K S candidate with a large value of x K are assignedhigh values of the classiﬁer output, and little discrimination remains between s - and d -jets. This indicates that the network learns from the presence oftracks that originate from the decay of K S mesons and it further learns that K S decays with a large value of x K are more likely to appear in s -jets thanin d -jets.Similar conclusions can be drawn from Figure 10, where the classiﬁeroutput distribution is not split by the presence of reconstructed K S ! ⇡ + ⇡ decays but by the presence of true K S ! ⇡ + ⇡ decays obtained from theMonte Carlo truth record. The true K S mesons are geometrically matched to16 Figure 8: Performance of strange jet identiﬁcation algorithms based on LSTM architectures andstrange hadron reconstruction [20]. The vertical axis shows the efﬁciency of correctly identifyingstrange jets, while the horizontal axis shows the efﬁciency of incorrectly identifying up and downquark jets as strange jets. The blue and orange lines show the performance of the LSTM-basedalgorithm using all jets and jets with at least one track with transverse impact parameter | d | > mm, respectively. The red dashed and full black lines show the performance of selecting on thetransverse momentum fractions x K and x Λ , and only on x K , respectively. Similarly to heavy ﬂavor jet identiﬁcation, identiﬁcation of tau leptons is particularly interesting dueto its ties to Higgs physics. The coupling between the Higgs and the tau lepton is the largest Higgscouplings to leptons in the Standard Model. It is therefore an opportunity to directly measure thestructure of the Higgs Yukawa couplings to that sector of the Standard Model.Taus decay either leptonically ( τ → ν τ + (cid:96)ν (cid:96) , in which (cid:96) is an electron or a muon), or hadronically( τ → ν τ + hadrons). Leptonic tau decays are roughly indistinguishable from isolated leptons inhadron collider experiments. Therefore, tau identiﬁcation focuses on hadronic taus, which representa branching fraction of approximately [11]. Hadronic tau decays usually include one or threecharged pions and one or more neutral pions. Therefore, these decays are seen in the detector asnarrow jets with one or more tracks.Since neural pions do not leave signals on the detectors’ inner tracks, their trajectories cannotbe reconstructed as tracks. This means that if only track information were used, a large portion ofinformation for tau identiﬁcation would be missing. Therefore, an optimal strategy should aim tocombine the tracking and calorimetry information. 12equence-based Machine Learning Models in Jet PhysicsThe ATLAS experiment state-of-the-art tau identiﬁcation algorithm is based on a double LSTMarchitecture that combines track sequences and calorimeter deposits (clusters) sequences [22].The algorithm has three sets of inputs: a track sequence, a cluster sequence, and a set of high-level variables connected to a dense layer. The track and cluster features considered refer to theirkinematics and detector-level properties, while the the high-level ones are related to the jet itself, orengineered features based on the collection of jet constituents.Tracks are individually fed through dense layers with shared weights, so that an embeddedrepresentation can be learned. The same procedure is applied to calorimeter clusters separately. Thetwo processed sequences, one of track embeddings and one of cluster embeddings, are ordered bydecreasing p T of the original objects and used as inputs to two separate LSTM blocks. These blocksincludes two layers of LSTM units - the ﬁrst one maps the sequence in the learned representationinto another sequence of the same length. The second LSTM layer only outputs the information atthe last time-step, thus providing a summary of the input sequence. The LSTM blocks outputs arefed to a densely-connected block, which also receives information from a densely-connected blockencoding high-level observables of the tau jet.The training and evaluation of the architecture deﬁned above is performed separately for thecases in which the tau decay includes one or three tracks. The tau decays (signal) are provided bysimulation of γ ∗ → τ τ events, while background jets are selected from a simulation of dijet events.The performance obtained in Figure 9 compares the LSTM architecture (RNN) optimized for taudecays with one (1-prong) and three (3-prong) tracks. It also compares to the previous algorithmused in the ATLAS experiment, based on a boosted decision tree (dashed lines). The LSTM-basedarchitecture outperforms the previous baseline for all hadronic tau efﬁciencies. These improvementshave been shown to be signiﬁcant enough that the new architecture has actually been used foridentifying tau candidates at the ATLAS High-Level Trigger (HLT) in 2018. When produced at large momenta, top quarks’ decay products will start to merge, making it moredifﬁcult to resolve them spatially in the detector. In this regime, the entire top quark decay can beclustered into a single jet. Usually these jets have larger radius parameters: CMS [23] utilizes R= 0.8 anti- k t jets and R = 1.5 Cambridge-Aachen jets [24], while ATLAS [25] focuses on R = 1.0anti- k t jets. Identifying these boosted top objects, from a background of jets from the hadronizationof lighter quarks and gluons, is particularly interesting when searching for Beyond the StandardModel physics which predict TeV-range resonances decaying to top pairs.Several interesting features which are present in a boosted top jet can be used to discriminateagainst a high momentum jet produced by QCD interactions. In particular, hadronic top decays willtend to be three-pronged, with each prong corresponding to a ﬁnal state particle in the t → bW → bqq (cid:48) decay chain. Two important details on this chain is that one of these prongs will be consistentwith a heavy ﬂavor jet, and the other two will be consistent with a W boson decay.LSTM-based architectures have been proposed for identifying these boosted top jets [26]. Thestudy is performed with a Delphes-based detector simulation [21] with a particle ﬂow type of particlereconstruction, overlaying minimum bias events to emulate the LHC 2016 collisions conditions,averaging of 23 proton-proton interaction per event. Jets are clustered following the ATLAS strategy,with the anti- k T algorithm and R = 1.0. The signal top jets are obtained from simulating a beyondthe Standard Model process in which a Z (cid:48) boson with masses ranging from 1400-6360 GeV decays to13equence-based Machine Learning Models in Jet Physics had-vis τ True 110 r e j e c t i on had - v i s τ F a k e ATLAS

Simulation Preliminary

RNN (1-prong)BDT (1-prong)Working points (1-prong)RNN (3-prong)BDT (3-prong)Working points (3-prong)

Figure 4: Rejection power for quark and gluon jets misidentiﬁed as ⌧ had-vis (fake ⌧ had-vis ) depending on the true ⌧ had-vis eciency. Shown are the curves for 1-prong (red) and 3-prong (blue) ⌧ had-vis candidates using the RNN-based (fullline) and the BDT-based (dashed line) identiﬁcation algorithms. The markers indicate the four deﬁned workingpoints Tight , Medium , Loose and

Very loose with increasing signal selection eciencies. − − − − −

10 1 c and i da t e s / . had - v i s τ F r a c t i on o f ATLAS

Simulation Preliminary had-vis τ had-vis τ True had-vis τ Fake (a) − − − − −

10 1 c and i da t e s / . had - v i s τ F r a c t i on o f ATLAS

Simulation Preliminary had-vis τ had-vis τ True had-vis τ Fake (b)

Figure 5: Distribution of the RNN score for true (red) and misidentiﬁed (blue) ⌧ had-vis candidates for 1-prong (a) and3-prong (b) cases. The RNN score has been ﬂattened such that it corresponds to the fraction of rejected true ⌧ had-vis . Figure 9: Performance of hadronic tau jets algorithms in the ATLAS experiment [22]. The horizontalaxis shows the efﬁciency of correctly identifying hadronic tau jets, while the vertical axis shows theinverse of the efﬁciency of incorrectly identifying quark-initiated jets as hadronic tau jets. The redcurves show the performance exclusively on tau jets with a single track (1-prong), while the bluecurves represent tau jets with three tracks (3-prong). The RNN-based model’s performance is shownin the full lines, with the dashed lines representing an algorithm with similar inputs but based on aboosted decision tree.hadronically decaying t ¯ t pairs. Background jets come from the simulation pure QCD hard scatteringprocesses (QCD jets).Similar to tau identiﬁcation, the sequence for the recurrent model is built of calorimeter clusters.The input ordering is deﬁned based on going through the clustering history of the jet, starting fromthe ﬁnal reconstructed jet, and adding constituents to the sequence as they appear in each step.The decision on which path to follow in each time two parent nodes merge depends on the anti- k T distance metric d ij . If the two parent nodes are present in the list of jet constituents, they are addedto the sequence ordered by p T . A scheme presenting how the clustering tree is used to build thesequence ordering is shown in Figure 10. Another strategy is based on reconstructing R = 0.2anti- k T subjets, order them by descending p T , then adding constituents to the sequence dependingto which subjet they belong - constituents on the same subjet are also ordered by descending p T .These schemes are compared with purely odering on the jet constituents p T .As seen in Figure 11 (left), the LSTM-based architecture with substructure oredering outperformsthe previous baseline studied by the same group [27], based on a fully connected dense neuralnetwork. The results are presented depending on the ordering scheme as well whether the jets aretrimmed or not. Trimming [28] is a technique that ensures robustness of the jet kinematics, especially14equence-based Machine Learning Models in Jet Physics We hypothesize that the order of the constituent sequence can provide salient information for signal/backgrounddiscrimination to the LSTM tagger, and thus develop sorting methods which attempt to represent theunderlying QCD and substructure of the jets, referred to as substructure ordering. In particular, we usea recursive algorithm which utilizes the history of the initial anti- k T clustering to add constituents to theinput list in an order which reﬂects the jet substructure. Clustering algorithms e↵ectively produce a binarytree from the reconstructed particles, as depicted in Fig. 1, where the intermediate jets are referred to as“PseudoJets” and are constructed by summing the four-momenta of the particles or PseudoJets with thesmallest distance metric at a given clustering step. The jet substructure sorting algorithm starts withthe ﬁnal jet and is called on each of the parent PseudoJets. Recursion is called on the pseudojet whoseparents have a smaller d ij . If one of the parents of the jet or PseudoJet under consideration is the originaljet constituent that constituent is added to the list and recursion is continued on the other parent. If bothparents of a jet or a PseudoJet are original constituents, both are added to the list with the higher p T oneadded ﬁrst and the recursion is terminated. Thus the ordering algorithm performs a depth-ﬁrst traversal ofthe clustering tree.This method is compared to sequence ordering schemes that were previously tested on the DNN in [23],namely sorting purely by p T of jet constituents, and “subjet sorting”. In the latter scheme ﬁrst subjets arearranged in a descending order by p T , and then constituents of given subjet are added to the list, also indescending order by p T . Subjet sorting was found to yield the best performance in [23]. Figure 1: An example of the binary tree constructedby jet algorithms during clustering and the resultingconstituent list ordering presented to the LSTM in thesubstructure ordering scheme.The best-performing network design consists of anLSTM with state width of 128 connected to 64-nodedense layer. Only the output of the LSTM layer atthe last step is connected to the dense layer. Thisarchitecture was found through heuristic search of thenumber of LSTM layers, layer widths and presenceor lack of the of the dense layer. Several optimizationmethods were tried with Adam [34] providing themost stable training with highest ﬁnal performance.The input data used for network selection was thetrimmed, subjet sorted set with LHC2016 pileup.The Keras suite [35] with the Theano [36] backendwas used to implement the model.

The primary interest of this study was to evaluatehow an LSTM network would compare to the previ-ously developed DNN. Fig. 2 (left) shows receiver operating characteristic (ROC) curves for the DNN andLSTM taggers under their respective best performing architectures and input conditions. The LSTM networkyields better performance than the DNN across all signal eciencies, in particular reaching a backgroundrejection of 100 at 50% signal eciency - greater than a factor of two improvement with respect to the DNN.Table 1 shows the background rejection power of the network when di↵erent pileup level datasets areanalyzed and di↵erent constituent ordering schemes are used. The LSTM with substructure ordering displays ahigher dependence on pileup conditions than the LSTM with subjet ordering, which has the best performance The distance metric used is referred to as d ij , i and j being indices of particles or PseudoJets in the event list, and is deﬁnedas: d ij = min ( k pti , k pti ) ij R , where k ti is the transverse momentum of particle i . Exponent p deﬁnes the precise algorithmused ( p = 1 for k T , p = k T or p = 0 for Cambridge-Aachen), R is the radius parameter of the clustering, and ij = ( y i y j ) + ( i j ) , y being the rapidity Figure 10: Scheme of jet clustering history used to deﬁne input sequence ordering in LSTM-basedtop identiﬁcation algorithm [26]. The anti- k T distance d ij is used to choose which path to follow inthe tree.the jet mass, with respect to pile-up contamination by removing R = 0.2 subjet constituents with acertain energy fraction; in this case, subjets are removed if their energy fraction is lower than .Figure 11 (right) shows that the impact of the ordering strategy is limited. In fact, it appears tobe smaller than the impact of applying trimming, which removes subjet information that could beimportant for top identiﬁcation. For more information on top jet identiﬁcation with jet images, seesection XYZ. Figure 2: Left panel shows the ROC curve comparing the best performing DNN tagger to best performingLSTM tagger under LHC 2016 pileup conditions. Inputs to the DNN were trimmed and sorted by subjet,while LSTM inputs were untrimmed and sorted by the jet substructure-based method described in Section5. Right panel shows the ROC curve for best performing LSTM network under di↵erent trimming andconstituent ordering conditions, under LHC 2016 pileup conditions.in high pileup conditions when not using trimming. This suggests that large pileup a↵ects the jet clusteringorder. The full ROC curves for some of these combinations are shown in Fig. 2 (right).Architecture Input conditions Background rejection at signal eciency ofPileup Trim Sorting 80% 50% 20%DNN LHC 2016 Yes Subjet 9.8 45 365LSTM LHC 2016 Yes Subjet 13.4 78 780No Substructure 17.0 101 930No Subjet 16.7 97 85550 Yes Subjet 13.5 78 780No Substructure 16.1 93 790No Subjet 16.6 96 890Table 1: Background rejection factors of the best performing LSTM network architecture given di↵erentinput types and sorting method. The performance of the DNN network from [23] is shown for comparison.

This work shows that using a simple and relatively narrow LSTM network with a fully-connected projectionimproves greatly on a DNN top tagger using the exact same jet constituent inputs in list form. The bestperforming LSTM reaches a background rejection of 100 at 50% signal eciency for jets with 600  p T  References [1] ATLAS Collaboration,

Search for heavy particles decaying to pairs of highly-boosted top quarks using Figure 11: Performance of top identiﬁcation with LSTM-based techniques [26]. The horizontalaxis shows the efﬁciency of correctly identifying top jets, while the vertical axis shows one over theefﬁciency of incorrectly identifying QCD jets as top jets. The left plot compares the LSTM algorithmwith a strategy using a dense neural network. The right plot compares different training strategiesfor the LSTM, including different sequence oderings (substructure, subjet or purely constituent p T based) and with or without trimming. 15equence-based Machine Learning Models in Jet Physics In general, a jet can be understood as a collection of ﬁnal state particles that are results of thehadronization and fragmentation processes of a quark or a gluon. However, the actual procedure ofreverse-engineering these processes, which are quantum mechanical by nature, can be complicated,especially when dealing with the busy environment of a hadron collider event. Jet clusteringalgorithms aim to reduce this complexity and make the connection between observable ﬁnal stateparticles and the phenomenological predictions based on partons.While many types of jet clustering algorithms have been proposed in the past, most recentapplications utilize strategies based on sequential clustering, such as the ones already discussedhere ( k T , anti- k T and Cambridge-Aachen being three well-known examples). These algorithms takeadvante from the fact that the processes for the particle shower development can be approximatedwell, due to factorization theorems, by a sequence of → splittings in creating a hierarchicalrepresentation of the jet. When reverse-engineered, this sequence of splittings becomes a sequenceof mergings of jet constituents, forming a binary tree. This binary tree represents the jet clusteringhistory, and encodes important information about the nature of that jet.As presented in the previous section, the usage of RNN architectures for learning jet labels hasbeen a successful avenue in collider experiments, greatly improving on previous, often already Ma-chine Learning-based, strategies. Treating the jet constituents simply as an input sequence, however,des not fully exploit the full information contained in the jet clustering history as represented by theclustering tree.It has been suggested that through understanding the interplay between the clustering historyand neural network architectures could lead to a more natural jet representation [29]. The studypresents a framework in which a neural network can be built based on the clustering tree obtainedfrom the jet clustering algorithm history. It tries to exploit the entire physical knowledge containedwithin the jet, particularly with respect to hierarchical correlations between individual constituents.This structure is then used to learn an overall probabilistic model of the jet conditioned on itsconstituents, through an unsupervised training task.Building a probabilistic model for a class of jets presents the possibility of having a tractable anddifferentiable distribution function which can inform different aspects of jet physics. In particular,models trained for different classes of jets, such as b-jets and light ﬂavor jets, can be used to computelikelihood ratios for discrimination. Models can also be sampled in order to generate new jets, aprocess which can take a signiﬁcant amount of time when large datasets are required.The framework, named JUNIPR ("Jets from UNsupervised Interpretable PRobabilistic models"), isbased on the factorization of the jet probabilistic model into a product of probabilities given by eachstep of the clustering tree. For a set of jet constituents denoted by their 4-momenta p , . . . , p n , itcomputes the probability density P jet ( { p , . . . , p n } ) of this set to have arisen from the speciﬁed model.This probability factorizes based on the → clustering tree, so that P jet ( { p , . . . , p n } ) = (cid:81) t = nt P t ,where P t represents the probability model of the branching step t . This factorization is schematicallyshown in Figure 12.The study further models P t as the product of three independent probabilities: P end , probabilityover binary values predicting if the tree stops after this split; P mother , probability over integers of howlikely is that the mother in step t indeed participates on the splitting → ; and P branch , probabilityover kinematic conﬁgurations of the three states involved in the branching step t . Each one of thesethree probabilities are modeled with densely-connected neural networks. They use as inputs the16equence-based Machine Learning Models in Jet Physics = ) (cid:67)(cid:109)(cid:77)(cid:66)(cid:84)(cid:96) t = 1 t = 2 t = 3 t = 4 P (cid:68)(cid:50)(cid:105) ( { p . . . p n } ) = P t =1 · · · P t = n t = 5 (cid:85)(cid:50)(cid:77)(cid:47)(cid:86) Figure 2 : With any ﬁxed clustering algorithm, the probability distribution over ﬁnal-statemomenta can be decomposed into a product of distributions. Each factor in the productcorresponds to a di↵erent step in the clustering tree. Subsequent probabilities are conditionedon the outcomes from previous steps, so this decomposition entails no loss of generality.We will now formalize this discussion into explicit equations. For the rest of this sectionwe assume that the clustering tree is determined by a ﬁxed jet algorithm (e.g. any of thegeneralized k t algorithms [58, 59]). The particular algorithm chosen is theoretically inconse-quential to the model, as the same probability distribution over ﬁnal states will be learnedfor any choice. Practically speaking, however, certain algorithms may have advantages overothers. We will discuss the choice of clustering algorithm further in Secs. 5.2 and 5.3.The application of a clustering algorithm on the jet constituents p , . . . , p n deﬁnes asequence of “intermediate states” k ( t )1 , . . . , k ( t ) t . Here the superscript t = 1 , . . . , n labels theintermediate state after the ( t th branching in the tree (where counting starts at 1) andthe subscript i = 1 , . . . , n enumerates momenta in that state. To be explicit, • the “initial state” consists of a single momentum: k (1)1 = p + · · · + p n ; • at subsequent steps { k ( t )1 , . . . , k ( t ) t } is gotten from { k ( t , . . . , k ( t t } by a single momentum-conserving 1 ! • after the ﬁnal branching, the state is the physical jet: { k ( n )1 , . . . , k ( n ) n } = { p , . . . , p n } .In this notation, the probability of the jet (as shown in Fig. 2) can be written as P jet ( { p , . . . p n } ) = " n Y t =1 P t k ( t +1)1 , . . . , k ( t +1) t +1 k ( t )1 , . . . , k ( t ) t (2.2) ⇥ P n end k ( n )1 , . . . , k ( n ) n . Eq. (2.2) allows for a natural, sequential description of the jet. However, it obscuresthe factorization of QCD which predicts an approximately self-similar splitting evolution.Thus we decompose the model further, so that each P t in Eq. (2.2) is described by a 1 ! h ( t ) of the global state of the jet atstep t . To be explicit, let k ( t ) m ! k ( t +1) d k ( t +1) d denote the branching of a mother into daughtersthat achieves the transition from k ( t )1 , . . . , k ( t ) t to k ( t +1)1 , . . . , k ( t +1) t +1 in the clustering tree. Then– 8 – Figure 12: JUNIPR calculation of jet-level probability model based on its factorization to nodes inbinary tree of the jet clustering history [29].hidden state h t at branching step t , calculated based on the recurrent relation below: h t = tanh (cid:0) V · (cid:0) k td , k td (cid:1) + W · h t − + b (cid:1) , (2)where k td , k d represent the 4-momenta of the two daughters involved in the splitting step t , and V , W , and b represent learnable weights and biases. This equation has the same form as the onepresented in section 1.1 detailing the functioning of RNNs. The study also tried to replace thissimple RNN strategy in the equation above with more complex solutions, such as LSTMs and GRUs.No signiﬁcant improvement in performance was observed; it was argued that this was due to thesimplicity of the task, acting on sequences with only two elements.The algorithm is trained based on the full jet probabilistic model, maximizing the log-likelihoodover the examples in the training dataset, with stochastic gradient descent. Datasets are generatedby e + e − collisions, without detector simulation, at center-of-mass energy of 1 TeV. The two jets in theﬁnal state are separated into hemispheres by the exclusive k T algorithm [30], and their clusteringtrees are obtained with the Cambridge-Aachen algorithm.Results of using the learned jet probability applied to jet classiﬁcation, through likelihood ratiocomputation, are shown in Figure 13. The blue line represents the discrimination power betweenjets from quarks and jets from a boosted hadronic Z decay by calculating the likelihood ratio betweenthe jets probabilities based on each hypotheses P Z ( jet ) /P q ( jet ) . P Z ( jet ) and P q ( jet ) represent theJUNIPR probabilities trained individually on Z jets and quark jets, respectively, and evaluated on agiven jet. The JUNIPR likelihood ratio greatly outperforms the strategy based on engineered featuresrepresenting the jet substructure.The framework has also been used for direct binary classiﬁcation tasks [31]. In this case,two JUNIPR networks are built based on two different types of jets (quark jets and gluon jets,for example). The two networks are trained with a cross entropy objective function, where theindividual jet probabilities from each jet type are deﬁned by the JUNIPR networks. The quark versusgluon discrimination achieved by this method was seen to outperform standard approaches, suchas CNNs on jet images. It also signiﬁcantly outperforms the strategy above of individually trainingJUNIPR models and calculating their likelihood ratio. The applications of RNN architectures in collider experiments and phenomenological studies hasbeen an active area of recent developments, successfully avoiding issues previously seen with CNNarchitectures, while very often improving on their performances. Some features of these architecturesare, however, less desirable for certain problems. Certain the physics problems utilizing RNNs do nothave a well-deﬁned, natural ordering of the sequence elements. Choosing a speciﬁc order becomes a17equence-based Machine Learning Models in Jet Physics (likelihood ratio)010002000300040005000 j e t s p e r b i n Pythia jetsC/A clustering e + e ! q ¯ qe + e ! Z Z . . . . . .

0Z acceptance0 . . . . . . q u a r k r e j ec t i o n Pythia e + e ! q ¯ q vs. Z Z

C/A clustering

Junipr likelihood ratio + multiplicity Figure 12 : (Left) Likelihood ratio P Z (jet) /P q (jet) evaluated on Pythia jets in the validationset. (Right) ROC curve for discrimination based on

Junipr ’s likelihood ratio, in comparisonto the empirical 2D distribution using 2-subjettiness and constituent multiplicity. All jetsused in this study have masses between 90.7 and 91.7 GeV.Indeed, visualizing jets as in Fig. 13 can provide a number of insights. Unsurprisingly,we see for the quark jet (on the top) that the likelihood ratio of the ﬁrst branching is ratherextreme, at 10 . , since it is unlike the energy-balanced ﬁrst branching associated withboosted- Z jets. However, we also see that almost all subsequent branchings are also unlikethose expected in boosted- Z jets, and they combine to provide comparable discriminationpower to the ﬁrst branching alone. Many e↵ects probably contribute to this separationpower at later branchings, including that quark jets often gain their mass throughout theirevolution instead of solely at the ﬁrst branching, and that the quark jet is color-connected toother objects in the global event. Such e↵ects have proven to be useful for discrimination inother contexts [69].Similarly, considering the boosted- Z jet on the bottom of Fig. 13 shows that signiﬁ-cant discrimination power comes not only from the ﬁrst branching, but also from subsequentsplittings, as the boosted- Z jet evolves as a color-singlet q ¯ q pair. Note the presence of thepredictive secondary emissions sent from one quark-subjet toward the other. This is reminis-cent of the pull observable, which has proven useful for discrimination in other contexts [70].More generally, the importance of the energy distribution, opening angles, multiplicity, andbranching pattern in high-performance discrimination can be understood from such pictures.We are very excited by the prospect of visualizing Junipr ’s discrimination power onjets, based on the likelihood ratio it assigns at each branching in their clustering trees, as inFig. 13. Such visualizations could provide intuition that leads to the development of new,human-interpretable, perhaps calculable observables for discrimination in important contexts.We would like to make one side note about discrimination, before moving on to the nextapplication of

Junipr . The statement that likelihood-ratio discrimination is optimal of course– 23 –

Figure 13: Performance of JUNIPR used for binary classiﬁcation, based on the calculation oflikelihood ratios [29]. The horizontal axis shows the efﬁciency of identifying Z jets, while the verticalaxis shows one minus the efﬁciency of incorrectly identifying quark jets as Z jets. The image alsoshows the discrimination performance by simply using a substructure variable τ and constituentsmultiplicity information.non-trivial step of the data formatting, one that could lead to undesirable performance losses. Onthe other hand, other physics problems might beneﬁt from topological structures that encode morecomplex relations than a sequence. Recursive Neural Networks (RecNNs) have been proposed in the literature as possible generalizationsof RNNs, in which the sequential computational graph is replaced by a tree structure [32]. Thismeans that a node hidden state still depends on the previous step in the computational graph,similarly to the RNN, but the step itself is deﬁned by a binary tree instead of a sequence. Thisfeature opens up the possibility of adding extra domain knowledge information into the networkarchitecture itself when building the tree. A scheme of a recursive structure based on a binary tree isshown in Figure 14.When building a RecNN, entries in a sequence are combined via a learned function, with apredetermined binary tree structure. Therefore, a hidden state combining entries x i and x j isrepresented by h = f ( x i , x j , θ ) , where θ represents the learnable parameters. This function f ( ., ., θ ) is then used in all binary combinations, which allows the architecture to act on variable-length18equence-based Machine Learning Models in Jet Physics x x x x h h h x o Figure 14: Scheme of a recursive binary tree structure algorithm acting on a sequence of inputs x , x , x , x , x . The h i nodes correspond to hidden states performing the combination of two othernodes, while the o node represents the output.sequences . Another advantage of RecNNs over RNNs is its lower complexity, since the computationsare not performed per sequence entry anymore.Studies utilizing RecNNs with trees reproducing the jet clustering history as a basis for jetrepresentation have been performed in the context of jet classiﬁcation [33, 34, 35] and will bedetailed below. They show that RecNNs are able to learn a ﬁxed-length jet embedding from avariable length tree structure built from the jet constituents. This embedding can be further used fordifferent tasks, such as classiﬁcation and regression. Applications to Jet Physics

Initial studies used the RecNN architecture for jet discrimination [33], using simulated eventsprocessed through a simpliﬁed detector simulation (Delphes). The signal jets are comprised ofhadronically decaying W bosons reconstructed as a single R = 1.0 anti- k T jet (boosted jet), while thebackground is taken from purely QCD hard scattering proton-proton collisions. The study performs acomparison between RecNN architectures based on different binary trees, corresponding to differentjet clustering algorithms histories. The result of this comparison is shown in Figure 15. It’s interestingto note that, even though the jets have been initially clustered with the anti- k T algorithm, otherclustering histories, such as k T , perform better. This is consistent with previous observations that k T outperforms anti- k T in terms of identifying substructures in jets. In general, this variation inperformance is an evidence for the strong dependence of RecNNs architectures on the choice ofbinary tree topology.The performance is also studied when adding gating to the RecNN nodes. Similarly to LSTMsand GRUs, gated structures are used to regulate how much information is passed through the binary This is another instance of weights sharing in neural networks, a necessity when dealing with variablelength inputs. p T -ordered sequence to a GRU-based architecture which performs the classiﬁcationtask. It was observed that the usage of RecNN-based embeddings improves signiﬁcantly the eventclassiﬁcation performance with respect to using a GRU-only architecture acting on sequences of jets. Figure 4 . Jet classiﬁcation performance of the RNN classiﬁer based on various network topologiesfor the embedding ( particles scenario). This plot shows that topology is signiﬁcant, as supportedby the fact that results for k t , C/A and desc- p T topologies improve over results for anti- k t , asc- p T and random binary trees. Best results are achieved for C/A and desc- p T topologies, depending onthe metric considered. saw a signiﬁcant loss in performance. While the trimming degraded classiﬁcation perfor-mance, we did not evaluate the robustness to pileup that motivates trimming and otherjet grooming procedures. In proposing variables to characterize substructure, physicists have been equally concernedwith classiﬁcation performance and the ability to ensure various theoretical propertiesof those variables. In particular, initial work on jet algorithms focused on the Infrared-Collinear (IRC) safe conditions: • Infrared safety.

The model is robust to augmenting e with additional particles { v N +1 , . . . , v N + K } with small transverse momentum. • Collinear safety.

The model is robust to a collinear splitting of a particle, which isrepresented by replacing a particle v j e with two particles v j and v j , such that v j = v j + v j and v j · v j = || v j || || v j || ✏ .The sequential recombination algorithms lead to an IRC-safe deﬁnition of jets, in thesense that given the event e , the number of jets M and their 4-momenta v ( t j ) are IRC-safe.An early motivation of this work is that basing the RNN topology on the sequentialrecombination algorithms would provide an avenue to machine learning classiﬁers with some– 11 – Figure 15: Performance of RecNN-based algorithm for identifying boosted W jets with respect to jetsproduced via purely QCD interactions [33]. The horizontal axis shows the efﬁciency of correctlyidentifying W jets, while the vertical axis shows one over the efﬁciency of incorrectly identifyingQCD jets as W jets. The different line colors correspond to different jet clustering algorithms used tobuild the binary tree topology in the RecNN.A similar strategy has also been employed for quark versus gluon jet discrimination [34], showinga slight improvement on a baseline boosted decision tree-based algorithm. Quark versus gluondiscrimination is an important avenue of work at the LHC, due to the enormous gluon backgroundfrom soft hadronic interactions, which have a strong impact on analyses with light quarks in the20equence-based Machine Learning Models in Jet Physicsﬁnal state. One important application of this class of algorithms is identifying the hard scatter ﬁnalstate light jets involved in production of the Higgs boson via vector boson fusion.As in the previous study, simulated events are processed with Delphes and jets are clustered withthe anti- k T algorithm with R = 0.7 for high jets with p T > TeV and R = 0.4 for jets with lower p T .Purely QCD events with two jets from the hard scatter (dijet events) are produced separately forwhen the hard scatter partons are gluons (background), or up, down or strange quarks (signal). Thediscrimination achieved with the RecNN under different jet reconstruction strategies is shown inFigure 16, together with a baseline approach based on a BDT with engineered features (jet shapeand kinematics). Three types of jet reconstructions are compared: using calorimeter towers only("nopﬂow"); using particle identiﬁcation for neutral hadrons, photons, and positively and negativelycharged particles, encoded in one-hot vectors for each jet constituent in the tree ("one-hot"); using a p T weighted jet charge deﬁned by the clustering tree ("ptwcharge"). Although little difference isobserved with respect to the investigated jet reconstruction schemes, a signiﬁcant improvement isobtained over the BDT baseline.Recent studies have also compared RecNNs to RNNs and CNNs when tasked to estimate jetcharges [35]. In the same spirit of identifying jet ﬂavor, identifying jet charges can help withmitigating the enormous multijet background present in hadronic LHC searches. Requiring that twojets forming a neutral resonance have opposite charges could potentially reach that goal, assuming agood charge reconstruction resolution is achieved.For these studies, jets from up quarks were used as proxies for positively charged jets, whilejets from down quarks were used for negatively charged jets. Jets were simulated from QCD hardscattering processes in proton-proton interactions, and clustered from ﬁnal state particles with theanti- k T algorithm with R = 0.4. To test the performance of CNN’s on jet images, the jets wereformated into δφ × δη = 33 × pixel box, with pixel intensities refering to the transverse momentagoing into that speciﬁc pixel, and to the sum of track charges weighted by their momenta over thetracks corresponding to that pixel.Results are shown comparing different ML-based models acting on jet constituents: standardCNNs, residual CNNs, RNNs and RecNNs. The RNN architecture implemented is based on GRUs,with LSTMs performing similarly; the jet constituents are ordered in the input sequence by their p T ,with the ordering based on distance to the jet axis performing equally. These different algorithms’performances are presented in Figure 17 in terms of Signiﬁcance Improvement Curves (SIC), deﬁnedby (cid:15) s / √ (cid:15) b , where (cid:15) s is the efﬁciency of correctly identifying down quark initiated jets, and (cid:15) b isthe efﬁciency of incorrectly identifying up quark jets as down jets. They are compared to morestandard strategies based on engineered features. Overall, the algorithms acting on jet constituentsconsistently outperform the baseline approaches. As shown by some of the studies described above, the ordering choice can be detrimental to theRNNs performance. One recent strategy proposed to overcome this ordering dependency is attentionmechanisms [36]. The main working point of attention mechanisms can be understood with asimple NLP example. Within translation problems, two languages often display different semanticstructures for the same sentence: “a yellow cat" in English becomes “um gato amarelo" in Portuguese,with the words for “cat" and “yellow" switching positions. This relationship is difﬁcult to be learnedthrough standard RNNs in which the input and output sequences have a deﬁned order. To solve thisissue, network architectures with attention will directly learn correlations between the entries in the21equence-based Machine Learning Models in Jet Physicsinput sequence and the entries in the output sequence, which might not be encoded in the sequencesordering.Different strategies and architectures involving ideas related to neural network attention havebeen proposed to mitigate the ordering issue. In particular, Transformer networks [37], whichemploy a self-attention technique, learning two-by-two correlations between the sequence inputsthemselves through attention weights. Transformers have become common tools in NLP, withpretrained models such as BERT (Bidirectional Encoder Representations from Transformers) [38]being adopted by Google in its search engine for better understanding search queries.Another method which aims to exploit correlations on a variable-length input structure is theDeep Sets architecture [39]. Deep Sets are particularly suited for situations in which the orderingis not well deﬁned, as it treats the variable-length input sequence as a permutation invariantset. A similar Deep Sets architecture was used [40] to learn representations for events in colliderexperiments (Particle Flow Networks). These Particle Flow Networks act on lists of particles, such asjet constituents, learns a combined representation through dense layers with shared weights, andsums them into a single, object-wide representation.A similar architecture to the Particle Flow Networks was used in the ATLAS experiment as analternative to the RNN-based model in the context of heavy ﬂavor identiﬁcation [41]. The Deep Sets-based ATLAS heavy ﬂavor discriminant (DIPS) outperforms its RNN analogous (RNNIP, described insection 2.1) version of the RNNIP ATLAS heavy ﬂavor discriminant by up to for b -jet efﬁciencieslarger than , while using the same inputs, as seen in Figure 18. DIPS has also been found toreduce signiﬁcantly the training and evaluation time with respect to RNNIP, due to its paralelizability.The ability of parallelizing computations on each sequence element is an important feature of thismodel, particularly for applications in which the network evaluation time is limited. Sequence-based Machine Learning algorithms have a long and rich history in the context of naturallanguage processing. While the idea of representing a jet as a sequence of its constituents is not newin particle physics, as evidenced by jet clustering algorithms, the usage of ML concepts explotingthis representation is relatively recent when compared to computer vision algorithms. Even so, theapplication of RNNs to jet physics in particular has been a fruitful avenue of research in the pastfew years. Special attention was given to the predictive power of these models, with the successfulapplication of LSTM-based neural network architectures to different types of jet classiﬁcation tasks.More generally, recurrent structures were shown to be well-suited to describe jet clustering histories,leading to a full probabilistic model of a jet given its constituents.Recent work has also been focused on expanding the basic RNN ideas of hierarchical contextlearning to more physics inspired architectures, such as recursive neural networks. However, whilethese models achieve high precision when the correct choice of input structure is used - eitherthe binary tree in a RecNN or the ordered sequence itself for RNNs - their performances can besigniﬁcantly degraded given a wrong structure choice. This is particularly difﬁcult to deal with whenthe input has variable sized length but the data structure is not obvious. For example, how onechooses to order the set of tracks inside the jet will depend on the task to be performed: orderingin impact parameter signiﬁcance can be suited for b-jet identiﬁcation, but not for quark versusgluon discrimination. With that in mind, algorithims in which the data structure itself is either22equence-based Machine Learning Models in Jet Physicslearned, such as transformers or graphs, or invariant under certain problem transformations, such aspermutation invariant sets, have shown a great potential for future studies.

Acknowledgements

RTdL would like to thank Michael Kagan for the help in reviewing the document; the book editors,Paolo Calaﬁura, David Rousseau and Kazuhiro Terao; and the other reviewers involved. This workwas supported by the US Department of Energy (DOE) under grant DE-AC02-76SF00515, and bythe SLAC Panofsky Fellowship. 23equence-based Machine Learning Models in Jet Physics

Discriminating u/d quark with pt-weighted charge information implemented in RecNN. The traditional single pt-weighted charge ( Q  =0 . ) is used as baseline. Fig. 5

ROCs (left) and SICs (right) for jet p T = 1 TeV.Baseline BDT and three scenarios concerning particle ﬂowinformation are displayed.“nopﬂow”: no extra particle ﬂowidentiﬁcation is added to RecNN; “one-hot”: one-hot imple-mentation of particle ﬂow; “ptwcharge”: recursively deﬁnedpt-weighted charge implemented in the embedding process. explored the implementation of particle ﬂow in RecNN.Besides, some variants for the network details are ex-amined in Section 3.1 to ﬁgure out the relevant factors.Moreover, a ﬁrst attempt to identify jet electric chargewas made in Section 3.2 , using recursively embedded pt-weighted charge. These results show a great poten-tial for RecNNs in broader application and in realisticuse.In the RecNN approach, the raw data taken fromdetectors are fed into the networks directly without in-formation loss. By embedding the jet data through atree-like recursive process and subsequently feeding intofollowing classiﬁer, one has a natural and ecient DNNstrategy for jet tagging. We ﬁrst investigated the perfor-mance of RecNN on pythia-level data, i.e. the hadronswithout detector e↵ects. Then we employed fast detec-tor simulation and took the detector measurements asinput. At this stage, RecNNs can give discriminationpower of rejecting ⇠

95% gluon jets at ✏ q = 50% for p T = 1 TeV. As for extra particle ﬂow identiﬁcationimplemented in RecNNs, slight increase can be gener-ally observed (especially for lower jet p T s), but not sig-niﬁcant enough. We examined several variants on thedetails of the procedure, and interestingly the resultsshowed that even only with particle ﬂow identiﬁcation,RecNNs still give fairly good performance. This mightindicate that most of the information for q/g discrimi-nation is already contained in the tree-structure itself.Pile up e↵ects are not taken into account in this work,and jet grooming is also not examined here. These canbe left for future work.As a byproduct, we also apply the RecNN with re-cursively deﬁned charge to a more dicult task in jetphysics: jet ﬂavor (light quarks) identiﬁcation. It actu-ally is the simplest extension from the conventional pt-weighted jet charge to its DNN version. And it’s show-ing no better performance, but still gives the discrimi-nation power at the same level. We hope it will help infurther study on multi-class classiﬁcation in jet physics.Thus as conclusions, we have: – The results with detector simulation indicate a greatpotential for RecNN in q/g discrimination. – RecNN is robust against the variances in input fea-ture sets. The tree structure itself already containsmost of the information in q/g discrimination. Thisis partly due to the fact that the particle multiplic-ity dominates in q/g tagging. – Extra particle ﬂow identiﬁcation is not showing sig-niﬁcant e↵ects in q/g discrimination, indicating thesaturation of input information here.

Figure 16: Performance of RecNN-based algorithm for discriminating quark and gluon jets [34]. Thehorizontal axis shows the efﬁciency of correctly identifying quark initiated jets, while the verticalaxis shows one minus the efﬁciency of incorrectly identifying gluon jets as quark jets. The red linescorrespond to the RecNN models trained with different jet reconstruction techniques. The blue lineshows the performance of a baseline BDT based on engineered features related to jet shape andkinematics. 24equence-based Machine Learning Models in Jet Physics

Figure 5: Comparison of the p T -weighted jet charge to the best performing recurrent (RNN),recursive (RecNN), and convolutional (CNN) neural networks for 100 GeV jets. The  and BDT and trainable  NN are also displayed. The CNN is a two-input channel CNN with  = 0 .

2. The RNN is of type 4 using the

C/A distance. Both CNNs and the RNN noticeablyoutperform the p T weighted jet charge. The RecNN performs slightly worse than the RNNand CNNs, while the trainable  network only slightly outperforms jet charge. The Q , BDT outperforms jet charge and the trainable  NN but does not match the performanceof the other NNs, particularly at low signal eciency.Figure 6: Comparison of best performing recurrent (RNN), recursive (RecNN), and convolu-tional (CNN) neural networks with p T -weighted jet charge at 1000 GeV. The  and BDTand trainable  NN are also displayed. The improvement between the RNN, CNN or theRecNN and jet charge or the Q , BDT was larger than at 100 GeV.11

Figure 17: Performance of ML-based algorithms acting on jet constituents to discriminate up anddown jets based on their electric charges [35]. The horizontal axis shows the efﬁciency of correctlyidentifying down quark initiated jets ( (cid:15) s ), while the vertical axis shows the ratio (cid:15) s / √ (cid:15) b , where (cid:15) b isthe efﬁciency of incorrectly identifying up quark jets as down jets. 25equence-based Machine Learning Models in Jet Physics L i gh t - f l a v ou r j e t r e j e c t i on ATLAS

Simulation Preliminary p s = 13 TeV, t ̄ t RNNIPDIPS0.6 0.7 0.8 0.9 1.0 b -jet efficiency1.01.2 R a t i o t o RNN I P Figure 18: Performance of the RNNIP [16] and DIPS [41] heavy ﬂavor identiﬁcation algorithms inthe ATLAS experiment. The horizontal axis shows the efﬁciency of correctly identifying b -jets, whilethe vertical axis shows the inverse of the efﬁciency of incorrectly identifying light ﬂavor jets as b -jets.The violet band shows the performance the DIPS performance, while the green band representsRNNIP. The width and central value of the curves represent the standard deviation and mean of thelight ﬂavor jets rejection for a given b -jet efﬁciency for ﬁve different network trainings. Performanceswere measured for jets with a transverse momentum above 20 GeV, in a simulated dataset of topquark pairs, at center-of-mass energy of 13 TeV. 26equence-based Machine Learning Models in Jet Physics References [1] B. Hammer, “On the approximation capability of recurrent neural networks,”

Neurocomputing ,vol. 31, no. 1, pp. 107 – 123, 2000.[2] M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,”

Signal Processing, IEEETransactions on , vol. 45, pp. 2673 – 2681, 12 1997.[3] R. Pascanu, T. Mikolov, and Y. Bengio, “Understanding the exploding gradient problem,”

CoRR ,vol. abs/1211.5063, 2012.[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural computation , vol. 9,pp. 1735–80, 12 1997.[5] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio, “Learningphrase representations using RNN encoder-decoder for statistical machine translation,”

CoRR ,vol. abs/1406.1078, 2014.[6] C. Olah, “Understanding LSTM Networks,” 2015. https://colah.github.io/posts/2015-08-Understanding-LSTMs/ .[7] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman, “Jet-images — deeplearning edition,”

JHEP , vol. 07, p. 069, 2016.[8] M. Cacciari, G. P. Salam, and G. Soyez, “The anti- k t jet clustering algorithm,” JHEP , vol. 04,p. 063, 2008.[9] LHC Higgs Cross Section Working Group, S. Heinemeyer, C. Mariotti, G. Passarino, andR. Tanaka (Eds.), “Handbook of LHC Higgs Cross Sections: 3. Higgs Properties,”

CERN-2013-004 , CERN, Geneva, 2013.[10] D. Buttazzo, G. Degrassi, P. P. Giardino, G. F. Giudice, F. Sala, A. Salvio, and A. Strumia,“Investigating the near-criticality of the higgs boson,”

Journal of High Energy Physics , vol. 2013,Dec 2013.[11] P. Zyla et al. , “Review of Particle Physics,”

PTEP , vol. 2020, no. 8, p. 083C01, 2020.[12] “Standard Model Summary Plots Spring 2019,” Tech. Rep. ATL-PHYS-PUB-2019-010, CERN,Geneva, Mar 2019.[13] “Topological b -hadron decay reconstruction and identiﬁcation of b -jets with the JetFitterpackage in the ATLAS experiment at the LHC,” Tech. Rep. ATL-PHYS-PUB-2018-025, CERN,Geneva, Oct 2018.[14] G. Aad, B. Abbott, D. C. Abbott, A. A. Abud, K. Abeling, D. K. Abhayasinghe, S. H. Abidi, O. S.AbouZeid, N. L. Abraham, and et al., “ATLAS b-jet identiﬁcation performance and efﬁciencymeasurement with t ¯ t events in pp collisions at √ s = 13 TeV,”

The European Physical Journal C ,vol. 79, Nov 2019.[15] D. Guest, J. Collado, P. Baldi, S.-C. Hsu, G. Urban, and D. Whiteson, “Jet Flavor Classiﬁcationin High-Energy Physics with Deep Neural Networks,”

Phys. Rev. , vol. D94, no. 11, p. 112002,2016. 27equence-based Machine Learning Models in Jet Physics[16] ATLAS Collaboration, “Identiﬁcation of Jets Containing b -Hadrons with Recurrent NeuralNetworks at the ATLAS Experiment,” Tech. Rep. ATL-PHYS-PUB-2017-003, CERN, Geneva, Mar2017.[17] A. Sirunyan, A. Tumasyan, W. Adam, F. Ambrogi, E. Asilar, T. Bergauer, J. Brandstetter,E. Brondolin, M. Dragicevic, J. Erö, and et al., “Identiﬁcation of heavy-ﬂavour jets with thecms detector in pp collisions at 13 tev,” Journal of Instrumentation , vol. 13, p. P05011–P05011,May 2018.[18] “Performance of b tagging algorithms in proton-proton collisions at 13 TeV with Phase 1 CMSdetector,” Jun 2018.[19] A. Ali, F. Barreiro, and T. Lagouri, “Prospects of measuring the CKM matrix element | V ts | at theLHC,” Physics Letters B , vol. 693, p. 44–51, Sep 2010.[20] J. Erdmann, “A tagger for strange jets based on tracking information using long short-termmemory,”

JINST , vol. 15, no. 01, p. P01021, 2020.[21] J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, and M. Selvaggi,“Delphes 3: a modular framework for fast simulation of a generic collider experiment,”

Journalof High Energy Physics , vol. 2014, Feb 2014.[22] ATLAS Collaboration, “Identiﬁcation of hadronic tau lepton decays using neural networks inthe ATLAS experiment,” Tech. Rep. ATL-PHYS-PUB-2019-033, CERN, Geneva, Aug 2019.[23] A. Sirunyan, A. Tumasyan, W. Adam, F. Ambrogi, T. Bergauer, M. Dragicevic, J. Erö, A. E. D.Valle, M. Flechl, R. Frühwirth, and et al., “Identiﬁcation of heavy, energetic, hadronicallydecaying particles using machine-learning techniques,”

Journal of Instrumentation , vol. 15,p. P06005–P06005, Jun 2020.[24] M. Wobisch and T. Wengler, “Hadronization corrections to jet cross-sections in deep inelasticscattering,” in

Workshop on Monte Carlo Generators for HERA Physics (Plenary Starting Meeting) ,pp. 270–279, 4 1998.[25] M. Aaboud et al. , “Performance of top-quark and W -boson tagging with ATLAS in Run 2 of theLHC,” Eur. Phys. J. C , vol. 79, no. 5, p. 375, 2019.[26] S. Egan, W. Fedorko, A. Lister, J. Pearkes, and C. Gay, “Long Short-Term Memory (LSTM)networks with jet constituents for boosted top tagging at the LHC,” 2017.[27] J. Pearkes, W. Fedorko, A. Lister, and C. Gay, “Jet constituents for deep neural network basedtop quark tagging,” 2017.[28] D. Krohn, J. Thaler, and L.-T. Wang, “Jet trimming,”

Journal of High Energy Physics , vol. 2010,Feb 2010.[29] A. Andreassen, I. Feige, C. Frye, and M. D. Schwartz, “JUNIPR: a Framework for UnsupervisedMachine Learning in Particle Physics,”

Eur. Phys. J. , vol. C79, no. 2, p. 102, 2019.[30] S. Catani, Y. Dokshitzer, M. Seymour, and B. Webber, “"longitudinally-invariant k t -clusteringalgorithms for hadron-hadron collisions",” Nuclear Physics B , vol. 406, no. 1, pp. 187 – 224,1993. 28equence-based Machine Learning Models in Jet Physics[31] A. Andreassen, I. Feige, C. Frye, and M. D. Schwartz, “Binary JUNIPR: an interpretableprobabilistic model for discrimination,”

Phys. Rev. Lett. , vol. 123, no. 18, p. 182001, 2019.[32] L. Bottou, “From machine learning to machine reasoning,”

CoRR , vol. abs/1102.1808, 2011.[33] G. Louppe, K. Cho, C. Becot, and K. Cranmer, “QCD-Aware Recursive Neural Networks for JetPhysics,”

JHEP , vol. 01, p. 057, 2019.[34] T. Cheng, “Recursive Neural Networks in Quark/Gluon Tagging,”

Comput. Softw. Big Sci. , vol. 2,no. 1, p. 3, 2018.[35] K. Fraser and M. D. Schwartz, “Jet Charge and Machine Learning,”

JHEP , vol. 10, p. 093, 2018.[36] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to alignand translate,” in (Y. Bengio and Y. LeCun, eds.),2015.[37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo-sukhin, “Attention is all you need,”

CoRR , vol. abs/1706.03762, 2017.[38] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectionaltransformers for language understanding,” in

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: Human Language Technologies,NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) (J. Burstein, C. Doran, and T. Solorio, eds.), pp. 4171–4186, Association for ComputationalLinguistics, 2019.[39] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. Salakhutdinov, and A. J. Smola, “Deepsets,”

CoRR , vol. abs/1703.06114, 2017.[40] P. T. Komiske, E. M. Metodiev, and J. Thaler, “Energy ﬂow networks: deep sets for particle jets,”