[PDF] Explainable AI for ML jet taggers using expert variables and layerwise relevance propagation

Abstract

A framework is presented to extract and understand decision-making information from a deep neural network (DNN) classifier of jet substructure tagging techniques. The general method studied is to provide expert variables that augment inputs ("eXpert AUGmented" variables, or XAUG variables), then apply layerwise relevance propagation (LRP) to networks both with and without XAUG variables. The XAUG variables are concatenated with the intermediate layers after network-specific operations (such as convolution or recurrence), and used in the final layers of the network. The results of comparing networks with and without the addition of XAUG variables show that XAUG variables can be used to interpret classifier behavior, increase discrimination ability when combined with low-level features, and in some cases capture the behavior of the classifier completely. The LRP technique can be used to find relevant information the network is using, and when combined with the XAUG variables, can be used to rank features, allowing one to find a reduced set of features that capture part of the network performance. In the studies presented, adding XAUG variables to low-level DNNs increased the efficiency of classifiers by as much as 30-40\%. In addition to performance improvements, an approach to quantify numerical uncertainties in the training of these DNNs is presented.

Full PDF

PPrepared for submission to JHEP

Explainable AI for ML jet taggers using expertvariables and layerwise relevance propagation

Garvita Agarwal, a Lauren Hay, a Ia Iashvili, a Benjamin Mannix, a,b

Christine McLean, a Margaret Morris, a Salvatore Rappoccio, a Ulrich Schubert a,c a Physics Department, and Institute for Computational and Data Sciences, University at Buﬀalo,State University of New York, Buﬀalo, New York, USA b Department of Physics, University of Oregon, Eugene, OR, USA c Google, Pittsburg, PA, USA

E-mail: [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] Abstract:

A framework is presented to extract and understand decision-making informa-tion from a deep neural network classiﬁer of jet substructure tagging techniques. There aretwo methods studied. The ﬁrst is using expert variables that augment the inputs ("expert-augmented" variables, or XAUGs). These XAUGs are concatenated to the classiﬁer stepsimmediately before the ﬁnal decision. The second is layerwise relevance propagation (LRP).The results show that XAUG variables can be used to interpret classiﬁer behavior, increasediscrimination ability when combined with low-level features, and in some cases capturethe behavior of the classiﬁer completely. The LRP technique can be used to ﬁnd relevantinformation the network is using, and when combined with the XAUG variables, can beused to rank features, allowing one to ﬁnd a reduced set of features that capture part of thenetwork performance. These XAUGs can also be added to low-level networks as a guide toimprove performance. a r X i v : . [ h e p - ph ] N ov ontents Machine learning (ML) [1] has become an extremely prevalent tool in the classiﬁcation ofhadronically decaying highly Lorentz-boosted objects ("boosted jets") and to study theinternal structure of hadronic jets ("jet substructure") [2–7]. In these areas, classiﬁcationtasks are common, for which artiﬁcial neural networks (ANNs) are well suited. Recent workin "deep" neural networks (DNNs) has shown tremendous improvements in identiﬁcation ofboosted jets over selections based on expert variables (see Ref. [6] and references therein).These algorithms typically make use of classiﬁers based on convolutional neural networks(CNNs) and recurrent neural networks (RNNs).– 1 –owever, these improvements come at a signiﬁcant cost. The underlying understandingof particular decisions is lost (although it could be argued that this is not a drawback).This paper provides a method to elucidate classiﬁer decisions in an explainable framework.This can assist in understanding of systematic uncertainties, as well as to develop betterexpert variables to capture the behavior of the classiﬁer in a simpler way. We will alsodemonstrate that combining expert features with low-level information can improve networkperformance, similar (but not identical) to the approach in Ref. [8].Collectively, explaining classiﬁers in artiﬁcial intelligence (AI) is referred to as "eXplain-able AI" (XAI); see Refs. [9, 10] for reviews. There are many options to explain individualclassiﬁer decisions, mostly based on local approximations of the classiﬁer function. Exam-ples include LIME [11], SLISE [12], layerwise relevance propagation (LRP) [13–15], andexplanation vectors [16]. We will utilize the LRP method in this paper due to simplicity ofuse and interpretation, although others could be used as well.Particle physics has a unique property, in that many of the observables we are interestedin have a full theoretical framework that can predict, or at least describe, their behavior.In particular, the expert variables provided can often be shown to fully capture availablekinematic information, as is done in Refs. [17–20], or even have a theoretical descriptionof the classiﬁer itself [21]. However, there are other observables for which the theoreticalcompletion is lower than others. In particular, information that can determine the originalprogenitor of the jet (such as ﬂavor information) is sometimes less theoretically developed.Many new identiﬁers that show extremely high levels of performance rely on this type ofinformation, such as track impact parameters, secondary vertex information, lepton content,and particle multiplicities, for instance as shown in Ref. [22]. Some of these observables arenot easily calculable, and many are not even infrared/co-linear safe. As such, further toolsare currently needed to extricate the various sources of discrimination aside from theoreticalcalculations.Our approach is to augment low-level input information with high-level expert variables,referred to as "expert-augmented" (XAUG) variables, and utilize a local approximationsystem such as LRP to understand the importance of various pieces of information. TheXAUG approach is similar to other contexts such as Ref. [8, 23, 24], however, our strategiesand goals are diﬀerent. Instead of using expert variables alongside low-level features inthe inputs, we augment the entire low-level network with expert features by concatenatingthem to the post-processed decisions as a whole (either after the convolution or recurrentnetwork decisions). This allows the a simple network to investigate and learn high-leveldetails in the inputs. We also seek to provide these XAUG variables for dimensionalityreduction in the function space of the DNN classiﬁer, similar in strategy to the analytictools proposed in [17], but with an eye toward extrication of the experimentally-availableinformation that may not be easily captured by theory, such as ﬂavor information. Thiscan assist in explaining individual classiﬁer decisions. In an ideal case, the DNN classiﬁercan in fact be fully contained within the XAUG variables in a simple multilayer perceptron(MLP).We develop a framework that can be applied to any classiﬁer in an attempt to re-duce the dimensionality of the problem by replacing or augmenting the lower-level/higher-– 2 –imensional features by higher-level/lower-dimensional XAUG variables. We will show thatthe behavior of the DNN can be captured robustly by the addition of appropriate XAUGvariables, and in some cases can entirely capture the behavior. This technique is not limitedto information that can be theoretically described or predicted.Firstly, we develop a trivial "toy" model with only a few features that are fully capturedby XAUG variables. We show that the classiﬁer decisions of such a DNN will be the sameas a simpler classiﬁer based only on XAUG variables themselves. We developed a 2D CNNbased on "images" similar (but not identical) to the approaches in Refs. [24–26], as wellas a 1D CNN and a 1D RNN based on particle lists inspired by the algorithm inputs inRef. [8, 22].Secondly, we develop several classiﬁers to distinguish boosted Z bosons decaying toclosely separated b ¯ b pairs ( Z → b ¯ b ) from standard QCD jets based on simulations. Onceagain, we investigate several cases, including a 2D jet image-based CNN (using only jetkinematic and shape information), a 1D particle-list CNN, and a 1D particle-list RNN. Thelatter two contain information beyond the kinematics of the jet, such as particle contentand decay impact parameters.As mentioned above, it has been shown in [17–20] that the kinematic information ofsuch classiﬁers is exhausted by angularity variables. One example "basis" is the set of N -subjettiness variables [27, 28]. As such, classiﬁer decisions with only kinematic informationshould be almost entirely correlated with existing kinematic variables as XAUGS. Otherinformation is not captured by these kinematic observables, however, such as the ﬂavor andsoft radiation in the jet such as the "jet pull" [29]. An advantage of the XAUG + LRPapproach is that all of these types of information can be used to gain an understanding ofthe relevant features. This work is similar but complementary to Ref. [30], which extractsexpert information from the network for kinematic and shape variables, whereas we alsoinvestigate ﬂavor information.It is known that in many cases, complete decorrelation with kinematic variables (suchas transverse momentum and mass) is desirable for many analyses (see an overview inRef. [6]). However, in the interest of simplicity, for this paper decorrelation is not addressed,although the general features of using XAUG variables and LRP extends to this case aswill be demonstrated in future work.In the following sections, we will introduce the layerwise relevance propagation tech-nique in Sec. 2. We will then describe a toy model for demonstration in Sec. 3 to highlighthow the technique works in a trivial but instructive case. Section 4 will describe the particle-level model. Explanations of both the toy model and the particle model will be discussedin Sec. 5. Finally, we will present conclusions in Sec. 6. LRP is a linearized approximation to networks that can be thought of as a "Deep Taylordecomposition" [31]. To understand this method, take a neural network with a prediction f ( x ) , based on some inputs x ; these inputs can be pixels in an image or input variables. LRPpropagates the prediction backwards through the network, eventually assigning a relevance– 3 –core R (1) j to each piece of input x (1) j . The relevance score indicates how much each inputpixel or variable contributes to the ﬁnal prediction.In an ideal case, the LRP backwards propagation method has an overall relevanceconservation: f ( x ) = · · · = (cid:88) j ∈ l +1 R ( l +1) j = (cid:88) j ∈ l R ( l ) j = · · · = (cid:88) j R (1) j (2.1)here the superscript indicates the layer the relevances are being calculated at, and thesubscript indicates the summation over the relevances within that layer. Relevance con-servation means that at every layer of the network, the total relevance is conserved [15].Therefore the backwards propagation process does not alter the prediction. AdditionallyLRP attributes the entirety of the network’s decision to the inputs.While there are many possible implementations of the LRP propagation rules [15], wefocus on only a few of them in this paper. First, we consider LRP- (cid:15) : (cid:88) j R ( l ) j = (cid:88) k x j w jk (cid:80) j x j w jk + (cid:15) R ( l +1) k (2.2)Here, and in other LRP rules, x j is the activation of the neurons at layer l , R ( l +1) k isthe relevance scores assigned to the layer l + 1 neurons, and w jk is the weight connectingneurons j and k [14]. The (cid:15) term is included to prevent any division by zero. In LRP- (cid:15) ,two criteria determine how relevance is propagated from layer l + 1 to each layer l neuron.The ﬁrst criterion is x j , the neuron activation. Rather intuitively, more relevance goes tomore activated neurons. Additionally, stronger connections (with larger w jk ) receive morerelevance. The simple case where (cid:15) is set to zero is called LRP-0 .In particular, for LRP- (cid:15) , especially at large values, (cid:15) can absorb some of the rele-vance [13]. Therefore, we also consider

LRP- αβ : (cid:88) j R ( l ) j = (cid:88) k  α · ( x j w jk ) + (cid:80) j ( x j w jk ) + − β · ( x j w jk ) − (cid:80) j ( x j w jk ) −  R ( l +1) k (2.3)where the + and − superscripts indicate the positive and negative contributions to therelevance, respectively [14]. Therefore, choosing diﬀerent values of α and β allows for controlover the importance of positive and negative contributions to the network’s decision [13].Relevance conservation is enforced by requiring α − β = 1 .In practice, ML models typically include a bias in each layer to improve accuracy.However, this violates the relevance conservation. In such cases, the summation of allrelevance scores do not equal the total score, although this is not a signiﬁcant disadvantage.In this paper, we implement LRP using the iNNvestigate Python package [32]. Tocompute the LRP score for CNN models, we use the "Preset A" mode of iNNvestigate ,which uses LRP- (cid:15) for dense layers, and LRP- αβ for the convolution layers. To compute theLRP score for RNN models, we use LRP- (cid:15) throughout.– 4 – Toy model

We ﬁrst create a toy model to explore the feature exhaustion of networks using XAUGvariables. The toy model is designed to vaguely mimic the salient features of "jet images",to be processed by a 2D convolutional network, and a list of "particle level" information,to be processed by an RNN or 1D CNN. We assume there are two populations of inputsfor each network, and design the networks to discriminate between the two populations.For the toy images, the generated features are based on the substructure features of ajet with two "subjets" of radii r and r , an angular distance of θ apart, with momentumfraction of the leading "subjet" denoted as z , as shown in Figure 1. These parameters arecartoon-level models for a jet with two subjets, where θ corresponds to ∆ for a boosted jetwith small angular separation as described in Ref. [33]. The "jet momentum" is normalizedto one, and since we are limiting the model to events with 2 subjets, the subjet momentaare deﬁned as z and − z . Using these parameters particles are randomly generated ineach subjet. These are then pixelized into jet images. Each jet image is × pixels,representing calorimeter cells in the detector, with 20 particles in each image or event. Figure 1 : Diagram showing the toy "event" level parameters (left) and the toy "particle"level parameters (right).There are two image populations, which we will refer to as

Signal and

Background .We generate 1M images, half Signal and half Background. The Signal is randomly sampledfrom normal distributions for z and θ , while the Background is sampled from exponentialdistributions that are mostly separated from signal in their combined phase space. Theradii of the subjets for both cases are sampled from a uniform distribution within a rangethat keeps the particles within the 16x16 image. These radii enclose the sampling of theparticles coordinates w.r.t the subjet axis, which are represented by variables dα and d β .The angular distance from the subjet axis, dα is sampled from the speciﬁed distributions(normal or exponential for Signal or Background, respectively), limited by the radius ofthe subjet. The azimuthal angle w.r.t the subjet axis, d β , is sampled from a uniformdistribution of { , π } for both populations. The momentum fraction of the subjets z isdistributed among the 10 particles of each jet according to the distributions, partitioned tosum to z . These are used as the p T of the constituents, and are used as the pixel intensities– 5 –n the jet image. We obtain the coordinates of the pixels within the image by transforming α and d β to an abstracted η - φ plane where the leading- p T subjet axis is at (0 , and thesubleading- p T subjet axis is at (0 , − . Finally, the image is ﬂipped if the sum of pixelintensities on the left half of the image is greater than that on the right half. The inputdistributions are shown in Figure 2, and examples of summed input images are shown inFigure 3.For use in networks that take 1-dimensional inputs, the data is structured as a list ofthe simulated "particle-level" information, i.e., the individual constituent z fractions, and α and d β values, as seen in the right diagram of Figure 1. Within each event, particles aresorted by z , which is the leading jet’s z value distributed among the leading jets particles.We refer to this as a "particle list", and just like the image data, we generate 1M events,half "signal" and half "background". Figure 2 : The normalized z , θ distributions for Signal and Background events in the ToyModel, along with the z , d θ , d β , and r values (w.r.t the 1st subjet axis) for thoseevents. We investigate three network structures: a 2D CNN for jet images, a 1D CNN for particlelists, and a 1D RNN for particle lists. The datasets were shuﬄed and split into testing,training, and validation subsets. In all cases, when XAUGs are used in conjunction withlow-level inputs, they are concatenated to the post-processed outputs of the CNN or RNN.This augmented list is then combined in a ﬂattened layer for the ﬁnal decision.

For the 2D CNN, we build a network consisting of 3 convolutional layers, followed by adropout layer that randomly drops nodes, and a max-pooling layer that calculates themaximum value of the feature map, then 2 dense rectiﬁed linear unit (ReLU) layers, witha sigmoid dense layer making the ﬁnal classiﬁcation between the Signal and Background– 6 – a) (b)

Figure 3 : Toy model summed input images for Signal (a) and Background (b) jets.

Figure 4 : Diagram of 2D CNN for classiﬁcation of toy model images. When used, XAUGsare concatenated to the processed outputs immediately before the ﬁnal two ReLu layers.populations. This network is based on networks made to analyze 2D jet images, like thosein Refs. [25, 26]. The network that operates on the toy images has 154,178 trainableparameters. When we add in the "event" level XAUGs, we add a layer to ﬂatten the2D convolved output so that we can concatenate the 1D expert variables onto the end.Additionally an extra dense layer is included to ﬁnd further relations in the concatenateddatasets. The ﬁnal network has 43,073 trainable parameters, the details of which can beseen in Fig. 4.

For the 1D CNN, we build a network consisting of a set of layers that goes over eachset of information within the particle list separately ( p frac , α , d β ) before the outputsof these layers are concatenated together and passed through a dense layer before a ﬁnaldecision is made. Each set of inputs goes through two 1-dimensional convolutional layers,a max pooling layer, and two additional convolutional layers before being ﬂattened andconcatenated together. In the case where XAUGs are added, they are concatenated with theparticle level information after they’ve gone through their respective convolutional layers.This model has 270,082 trainable parameters, and the details can be seen in Fig. 5.– 7 – igure 5 : Diagram of 1D CNN for classiﬁcation of toy model particle lists. When used,XAUGs are concatenated to the processed outputs immediately before the ﬁnal two ReLulayers. For the toy model RNN, we build a network consisting of recurrent layers that process theparticle level information before being ﬂattened and passed through a dense layer beforethe ﬁnal decision is made. The inputs are divided into subsets based on the chosen timestepof the recurrent layers. In our network there are 10 timesteps per event; this equates toprocessing information from 2 particles at a time, with a memory of the previous 2 particlesbeing used to ﬁnd patterns within the event. The particle list information passes through2 gated recurrent unit layers, then a batch normalization, one more gated recurrent unit,before being ﬂattened and passed to the dense layers for decision-making. In the casewhere XAUGs are added, they are concatenated with the particle level information afterthey’ve gone through their respective recurrent layers. This model has 248,938 trainableparameters, and the details can be seen in Fig. 6.

The more realistic simulation is a set of

PYTHIA To simulate 2-prongstructure of boosted jets, we produce SM ZZ production with both Z bosons decaying to bb . Both b quarks from the Z decay are required to be within ∆ R < . of the Z . Weuse generic QCD multijet events as background . We only consider the leading jet in thesimulations, so we refer to these two samples as Zbb and

QCD , respectively.It has been shown in Ref. [35] that independence of kinematic variables is a desirablefeature for boosted object tagging, and has been applied in several "decorrelated" versionsof DNNs, with examples in Refs. [22, 36, 37]. The approaches taken in other work has been Our software is outlined here The conﬁguration for SM ZZ production is here The conﬁguration for QCD multijet event simulation is here – 8 – igure 6 : Diagram of RNN for classiﬁcation of toy model particle lists. When used, XAUGsare concatenated to the processed outputs immediately before the ﬁnal dense layer.to either train an adversarial network to achieve the desired independence from kinematicvariables, or to decorrelate the behavior in a brute force approach. Our simulated sampleshave diﬀerent p T spectra in reality, however we wish to eliminate such kinematic diﬀerencesto some extent in our training. As such, we artiﬁcially weight the PYTHIA generation withthe bias2Selection parameter in order to have similar jet p T spectra, which is shownin Fig. 7a. We therefore partially remove the p T and mass dependencies of the classiﬁer.The goal of this paper is to explain decisions, so we take this simpler approach for ease ofdemonstration. It does not achieve complete decorrelation. The principles we develop arealso applicable to more thoroughly decorrelated taggers, but become more complicated dueto the increased dimensionality of having adversarial networks and are thus left to futurework for investigation.We then cluster the particles with fastjet k T jets [40]with a distance parameter of R = 0 . . We then use the RecursiveTools package from fastjet-contrib β = 0 . We use themMDT with z cut = 0 . and β = 0 to ﬁnd exactly two subjets. We deﬁne the "groomed"or "soft dropped" jet mass as m jet,sd .We select the leading jet in the event with p T > GeV, | η | < . , and

Zbb and QCD events. Shown are the jet p T (upper left), ungroomed jet mass (upper middle), groomed jet mass with the soft dropalgorithm (upper right), n-subjettiness ratio τ /τ (lower left), charged hadron multiplicity(lower middle), and d xy, max (lower right).of the jet, and the distances to each subjet are also stored. The distribution for d xy isshown in Fig. 7. To encapsulate the kinematic information of jets and their substructure, we will select the N -subjettiness variables τ βm as one set of XAUGs. We also input the jet four-vector, andthe groomed jet mass. The angular distance between the two subjets ∆ R subjets andmomentum fraction of the leading subjet z are also added.To add simple ﬂavor identiﬁcation, we add jet composition variables such as chargedhadron multiplicity ( N ch ), neutral hadron multiplicity( N neut ), photon multiplicity ( N γ ),muon multiplicity ( N µ ), and electron multiplicity ( N e ).A measure of the soft radiation in the jet is represented by the jet pull angle ( φ pull ) [29].Finally, heavy ﬂavor identiﬁcation information is encapsulated by an (over-)simpliﬁedmetric of the maximum of the spatial locations of the production vertex for each particlein the x − y ( d xy ) and z ( d z ) directions.Distributions for τ /τ and N ch are shown in Fig. 7. To ensure that all of the inputs to our networks have comparable numerical magnitude,we perform pre-processing on each input variable based on the overall distribution in the Most taggers have historically avoided adding the mass to the classiﬁer inputs, however in our case wehave decided to add it since it can be deduced from the N -subjettiness variables in any case. – 10 –ignal and background. An equal number of signal and background events are chosen in eachcase, and then the distribution of each input variable for the ensemble are calculated. Themean and standard deviations are computed, and the distribution is truncated between ± standard deviations of the mean. We then deﬁne the minimum to be zero, and the maximumto be one. These are referred to as "normalized" variables. To clarify, the underﬂows andoverﬂows appear at 0 and 1, respectively. We consider three types of classiﬁers. The ﬁrst is a 2D image-based CNN, with preprocessinginspired by Refs. [25, 26]. The second is a 1d CNN that inputs measurements from the ﬁrst N particles in a transverse momentum ( p T )-ordered list, similar to the DeepAK8 algorithmin Ref. [22]. The third is a recurrent neural network (RNN) that inputs an arbitrarily largelist of the same information as the 1d CNN.We discuss each classiﬁer in turn.

The ﬁrst classiﬁer we considered for particle-level simulations is a CNN that is based on 2Djet images, with preprocessing inspired by Refs. [25, 26]. We then use a CNN with the samearchitecture shown in Fig. 4, but with diﬀerent XAUGs and the full list of particle inputs.At this point, if only looking at image data, the network will pass the ﬂattened images to2 dense layers before being output. If the network was also passed XAUG variables, thesewill be concatenated with the ﬂattened images before going through the dense and outputlayers.The pre-processing is as follows. First, we ﬁnd exactly two subjets (events with fewerthan two subjets are discarded). We then examine the leading jet in the event, and calculatethe subjet directions. The subjet with the higher p T is used as the center of the jet image (0 , , and the image is then rotated in local pseudo-rapidity/azimuth space ( η − φ ) suchthat the lower- p T subjet is pointing downward. The radial distances are scaled in units ofthe ∆ R between the two subjets, such that the lower- p T subjet is placed at (0 , − . The p T of all of the constituents are scaled by the total jet p T , and are then pixelated into agrid of size . The image is then parity ﬂipped so that the largest sum of the pixelintensities is on the right-hand side of the image. Plots of the aggregated Signal ( Zbb ) andbackground (QCD) are shown in Fig. 8.

The second classiﬁer we considered for particle-level simulations is a CNN that is based onobservables from the ﬁrst N particles in a p T -ordered list of inputs, similar to the DeepAK8 algorithm in Ref. [22]. One of the main advantages of

DeepAK8 is the use of particle-levelinformation to have sensitivity to particle content such as hadron ﬂavor, and quark-gluondiscrimination. For this reason, we have two separate particle list taggers, one with only thefour-momentum information of the constituents, and another that adds other observables.We are interested in general characteristics of a procedure in this paper, so we do not– 11 – igure 8 : Aggregated Signal (

Zbb ) and Background (QCD) Jet Images after pre-processing.attempt a realistic ﬂavor deﬁnition for our simulated sample. Instead, we directly input theparticle ID of the stable particle, and the x, y, z position of its production vertex. Theseare overly simplistic assumptions for a realistic tagger, however it will demonstrate theprocedure of how ﬂavor information could be investigated with our method.In particular, we have found that N = 20 provides good discrimination between 2-prong jets and QCD jets. We use a network architecture of 2 convolutional layers, onewith shape (64, 3) and one with shape (64, 1) followed by a Max Pooling with pool size2, followed by a dropout of of the network’s nodes. The 2 convolutional layers andMax Pooling layer are repeated with shapes (32, 3), (32, 1) and 2 respectively with another dropout following. The convolved values are then ﬂattened and concatenated with theXAUG input features, which are then fed through a dense layer with a RELU activation.The network’s structure is the same as in Fig. 5 with diﬀerent XAUGs.For inputs to the 1D CNN that are similar to those in DeepAK8, 17 variables werecreated. The variables are listed in Table 1. The ﬁnal classiﬁer is an RNN that uses similar inputs as the CNN from the particle list;however, we limit the number of particle and expert variables due to the increased training– 12 –ariable log ( p T ) log ( p T /p T jet ) log ( E ) | η | ∆ φ ( jet )∆ η ( jet )∆ R ( jet )∆ R ( subjet R ( subjet Charge q isMuonisElectronisPhotonisChargedHadronisNeutralHadron d xy d z Table 1 : Particle list input variables of 1D CNN, properties of the constituents of theleading jet.and processing time of the network. The network architecture we apply is a ﬁrst inputlayer, followed by 3 recurrent layers, 2 additional input layers, and 2 dense layers at theend, shown in Fig. 6.For the ﬁrst input, the network takes in the four-vector ( p T , η , and φ ) data of the ﬁrst20 constituents of the leading jet of each event sorted by p T . This is given to the recurrentlayers. Then, the XAUG variables τ and charged-hadron multiplicity are added beforethe ﬁnal dense layers. – 13 – Explanations

We now explain the performance of the models described in the previous sections. In thefollowing section, we will discuss explanations of both the toy and particle models.

The classiﬁcation of the toy Signal and Background is trivial, even without the XAUGvariables. The Area Under Curve (AUC) of the ROC curve is close to unity. However,it is still instructive to investigate the explanations of these decisions to demonstrate thefeatures of LRP and XAUGs. The understanding gained can also be applied to the morerealistic particle simulation.Examples of the classiﬁcation in the 2D CNN are shown in Fig. 9. There are fourexamples of toy Signal and Background events in green. Their LRP scores are also shownas heatmaps in red and blue. The blue LRP scores indicate pixels that contributed positivelyto the identiﬁcation, whereas the red LRP scores indicate pixels that contributed negativelyto the identiﬁcation. It is clear that the Signal classiﬁcations prefer information where thereis a disjoint subjet near (0,-1) , whereas Background classiﬁcations discourage that area.Indeed, the input Signal images in Fig. 3 have a well-separated second subjet near (0,-1) , while the Background images do not. As such, the network is clearly learning theappropriate information that was used to create the Toy Model, including the position (i.e.angle) and intensity (i.e. energy) of the "subjets".

Figure 9 : Individual Toy Model Signal (left two columns) and Background (right twocolumns) images that were correctly predicted (top) and their corresponding LRP scoreheatmaps (bottom).We then trained a network with the XAUG variables r , r , z , and θ included amongthe network input features. (These variables are deﬁned and plotted in Figs. 1 and 2,respectively.) This is an exhaustive list of the entire information content of the toy model.As such, we expect that the network will be able to (almost) entirely ignore the information– 14 –ontained in the jet images, because its decision is completely determined by the XAUGs.Any residual relevance is an artifact of the optimization, not a distinguishing feature.To investigate the optimization dependence, we trained the same 2D CNN with imagesand XAUGs four times, labeled Models 1-4. This allows us to have four separate opti-mizations and judge their consistency. The mean normalized LRP scores of the aggregatedevents are shown in Fig. 10. To produce these bar plots, for each event we ﬁrst found thefeature (XAUG or image pixel) with the maximum absolute LRP score. Then we normal-ized the event by dividing all LRP scores by this maximum value, in order to eliminate thedependence of the relevance on the events with large deviations in total relevance. Then,for each event, we summed the absolute values of the normalized pixel LRP scores to get animage LRP score. After this, we averaged the absolute values of the XAUG and image LRPscores across all events to produced the mean normalized relevances shown. This procedureis used for all bar plots.It is clear from these bar plots that the two most important pieces of information arethe XAUG variables z and θ . There is very little information that the network learnsfrom the "subjet" radii (which is expected, since they are drawn from the same uniformdistributions). Furthermore, the network does not extract many features from the imageitself after the addition of the XAUG variables. The image relevance varies between 0.0to 0.1, but is always much lower than that of z and θ . This indicates an extremely ﬂatoptimization in the network training in the space of the input image. The network is unableto discern the subleading feature with much accuracy.To study the impact of various features on the decision making, we plot the relativerank of each feature in Fig. 11. The annotation on each bubble is the percent of events forwhich the variable had a certain rank. The z variable is ordinarily the highest ranked inall of the four models we used, while the image is ordinarily last. The variables r and r do not carry any discriminatory powers, as can be seen from Fig. 10.It is interesting to note that there is some variation within the optimizations, especiallyin the subleading domain of the optimization space. Diﬀerences in relevance attributed tofeatures between trainings point toward diﬀerences in local optimization of the network.This leads one to conclude that there should be numerical uncertainties applied to thederivation of DNN-based identiﬁcation algorithms.We plot the predicted score for these 4 models for θ versus z in Fig. 12. The orangedistribution is the "background" while the blue distribution is the "signal". The intensityof the color indicates the predicted score. The plot in Fig. 13 shows the same events, butthis time shading the histograms by the z LRP score.In Fig. 12, a clear gradient is shown across the decision boundary between the twopopulations. This is very consistent across diﬀerent trainings of the network. Figure 13shows high LRP scores along certain subspaces that correspond to details of decision bound-aries, and while the various trainings are qualitatively similar in their "broad" features, theindividual details vary among them. This indicates that there are sub-spaces within theclassiﬁcation that are not identical between diﬀerent trainings.Another way to display this information is shown in Fig. 14. Here, the XAUG variables z and θ are shown as 1D histograms, with their LRP relevance shown in the panel below.– 15 –he Signal is shown in blue and the Background is shown in orange. The network is placinga high magnitude of relevance on the values that are far from the overlap region ( z > . ,and θ > . ). Figure 10 : LRP scores for all variables in the Toy 2D CNN Model, with 4 separate trainedmodels.We performed the same studies shown above on the 1D CNN model. However, forbrevity we show an average over 4 diﬀerent trainings in our plots. The RMS of the 4trainings are shown as uncertainties. Similar to the 2D CNN model, we can see from Fig.15that the variables that were given the most relevance by LRP were the expert variablesthat had the greatest separation in combined phase space, z and θ . We can see a cleardecision boundary in their combined phase space in Fig. 16. The left image again shows aclear gradient in the relevance of the variables, indicating that the network is unsure howto categorize the event in the overlap region.The ﬁnal model that we investigated was an RNN. This model showed much diﬀerentrelevance results from the other two, despite having comparable accuracy to the 1D and2D CNN’s. The RNN gives the greatest relevances to the constituent list variables, drivenby the relevances of z (the values of z broken up among the leading jet constituents) and z (the values of (1 − z ) broken up among the sub-leading jet constituents), as can be– 16 – igure 11 : Ranked LRP scores for all variables in the Toy 2D CNN Model shown here forfour separately trained models.seen in the right plot of Fig. 15. z and z in encode the same information as z , but ina diﬀerent format. We still have the same leading variable, however the RNN gives theinformation more relevance in its constituent format than the event-level or XAUG formatthat the CNNs gave more relevance.For the 2D CNN (Fig. 13) and the 1D CNN (left image in Fig. 16), the combined z − θ phase space shows a clear decision boundary. The RNN, on the other hand, has aless apparent gradient (right image in Fig. 16), with most of the relevance in background.Again, this is because the network’s recurrent layers are extracting information in a diﬀerentway, looking at the list information much more than the 1D CNN.The plots in this section provide two important pieces of information. Firstly, in the case– 17 – igure 12 : Predicted score for Toy Model for four separately trained models. The mo-mentum fraction z is shown on the x − axis while the angle θ is shown on the y − axis. Theorange distribution is the "Background" and the blue distribution is the "Signal". Theintensity of the color indicates the predicted scores.where there are XAUG variables that dominantly capture the information in the system,the LRP accurately selects the most important variables, rendering the image redundant,as seen in Fig. 10. Secondly, the network can arrive at diﬀerent optimizations, each withdiﬀerent sub-leading relevance, which can impact the conﬁdence of the prediction in regionsof confusion. However, these separately trained models have nearly identical performance(see Fig. 12), despite clearly arriving at diﬀerent minima (see Fig. 13).These are not related to insuﬃcient training data, as we have checked this with bothlow-and high-statistics samples and the features persist. This leads to a recommendationabout numerical uncertainty in the classiﬁer. In the above studies, the LRP distributions of the various XAUGs show that the maindecision boundaries made by the classiﬁer are relatively stable. However, there are detailsin the local approximation (LRP) boundaries that are caused by diﬀerences in relevanceamong sub-leading features.The LRP score is a reliable way to quantify the most relevant variables. However,– 18 – igure 13 : Relevance (LRP) scores for Toy Model 1D CNN for four separately trainedmodels. The momentum fraction z is shown along the horizontal axis while the angle θ is shown on the vertical axis. The orange distribution is the "background" while the bluedistribution is the "signal". The intensity of the color indicates the summed LRP score,with darker colors representing more conﬁdent predictions of the model.this also means that the optimization with respect to less relevant variables is not easilydetermined, with very diﬀerent local behavior for diﬀerent trainings. This points to theneed for a numerical uncertainty of classiﬁers that is not easily visible in the ﬁnal classiﬁeroutput, since the latter is primarily sensitive to the most relevant variables. Decisions basedon less relevant variables can vary depending on the training. Optimizers like ADAM havea diﬃcult time optimizing along these axes because they are so shallow, and the optimizerloss functions are dominated by the most relevant variables.A consequence of this study is that it is very important to perform multiple diﬀerenttrainings for networks, especially if the decisions are made on sub-dominant features. Thetrustworthiness of individual classiﬁer decisions is very high in the case of diﬀerences infeatures with high relevance, but is considerably lower for features with low relevance.In the remaining study, we will therefore train the network 4 times and take the averageresponse. The RMS of the predictions can then be used also for an uncertainty quantiﬁ-cation, which we add for the relevance scores. These have similar interpretations to othersystematic uncertainties, inasmuch as diﬀerences that exist, but are covered within the– 19 – a) z proﬁle (b) theta proﬁle Figure 14 : Proﬁles for Toy 2D CNN Model with relevance and predictions averaged overthe four models previously shown.uncertainty, are not trustworthy diﬀerences.– 20 – igure 15 : Mean normalized relevance score for 4 trainings of the toy 1D CNN(toy 1DRNN) on the left(right), where the relevance scores of constituent variables List ( dα , dβ , dα , dα , z const, , z const, ) are summed. Figure 16 : Mean LRP scores for z and θ variables in the Toy 1D CNN Model (left), andRNN (right). – 21 – .3 Particle Model Explanations The Toy Model introduced previously is intended as a highly simpliﬁed "cartoon" of jetsubstructure. The angle θ and momentum fraction z of the "subjets" in the Toy Model areanalogous to the ∆ R and momentum fraction z of the actual subjets from the soft dropalgorithm. The drastic separation observed in the Toy Model will be reduced, although wecan use the insight and techniques developed to apply to the particle-level simulation.Distributions of the normalized variables are shown in the upper portions of somevariables in Figs. 21 and in the Appendix for all variables in 25-30. The jet images from afew individual events, along with their LRP heatmaps, are shown in Fig.17. The backgroundevents are more spread out than the signal events, which tend to be clustered among one ortwo subjets. This is exploited by the network, as can be observed in the bottom panes. Thenetwork identiﬁes signal by weighting toward the central two subjets, while the networkidentiﬁes background from pixels away from the two subjets. These images provide similarinformation to those in Ref. [26], which show the Pearson Correlation Coeﬃcient. Figure 17 : Individual Particle Model Signal (left two columns) and Background (right twocolumns) images (top) and their corresponding LRP score heatmaps (bottom).The predicted scores and LRP scores for the 2D CNN trained on the Particle Modelare shown in Fig. 18 for the rescaled angular separation ∆ R subjets versus the rescaledmomentum fraction of the subjets z . These are the analogous plots to our cartoon toymodel, shown in Fig. 18. It is clear that the separation between signal and background isreduced and the separation in these geometric variables is not as clear to the eye.However, the simple intuition from using XAUGs can be translated to more appropriatevariables. In particular, Fig. 19 shows the analogous distribution for two XAUGs, therescaled groomed jet mass m jet,SD and the rescaled τ ,SD . It is clear that there areseparations between the signal and background populations. The region for the rescaled m jet,SD between 0.35 and 0.4 (corresponding to the Z mass window) tends to be preferredas a signal region, and just outside that region is preferred as background. In addition,there is a clear correlation between m jet,SD and τ ,SD that the network utilizes to identify– 22 –ackground in the upper left and lower right portions of the plot (low τ at high mass, orhigh τ at low mass). Furthermore, Fig. 20 shows the groomed versus ungroomed jet mass.This gives information about the fraction of the jet that is groomed away compared tothe original, which can disentangle the hard and soft parts of the jet. The network clearlymakes use of this information. In both of these plots, the network has regions of extremelyhigh conﬁdence that oﬀer clear distinction between signal and background.Another way to see the same information is to look at the individual variables fromthe 2D CNN trained on the Particle Model in Figs. 21 for highly relevant variables, andin Figs. 25- 30 in the appendix for all variables. These plots show the distribution of theXAUGs in the top pane, as well as a proﬁle distribution of their relevance scores in thebottom pane. The relevance score shows the contribution to the correct classiﬁcation ofeither signal or background.It is possible to use the information gathered by the LRP procedure to quantify theimpact of each individual variable on the decision of the network. Figures 22 and 23 showthe mean relevance score over four models for each input in the 2D CNN and 1D CNN,respectively. In these cases, the relevance scores from the image and list are summed anddisplayed as a single bar in the histogram. The uncertainties on the scores are drawn fromthe RMS of four separate training iterations of the CNN. It is clear that there is a hierarchyof variables, with 5-6 very important ones with relevances above 0.3, around a dozen withmoderate importance between 0.1 to 0.3, and around 5-6 with low relevances below 0.1.In order to understand how much information is being used for the decision making,we can investigate the performance of our networks in several cases, shown in Fig. 24.Due to the fact that we are using a highly simpliﬁed model at the particle level only (fordemonstration), the ROC curves are (over-)optimistic about performance, although it isstill instructive to investigate diﬀerences to understand relevant features.First, as a baseline, we can look at the simplest case of MLPs using only the XAUGs asinputs. We will look at two such cases, one where we do not use ﬂavor information (particlecontent, d xy , and d z ), and then when this information is added. These are labeled "XAUGsonly (no ﬂavor)" and "XAUGs only").Next, we can investigate the 2D CNN with only the jet images as input (labeled as"Images only" in the ﬁgure). We then combine the information together from the image andthe 5 XAUGs with highest relevance ("Images + 5 XAUGs") or with all XAUGs ("Images+ XAUGS").We can repeat the above cases with the particle list for the 1d CNN ("Particle List"),and then add the top 5 XAUGS ("Particle List + 5 XAUGS") and all XAUGS ("ParticleList + XAUGS"). The RNN can also be similarly investigated with and without XAUGs,shown in "RNN - Particle List" and "RNN - Particle List + XAUGs".We see that the image-only NN performs the worst, with an AUC of 0.932. The XAUGvariables perform better, achieving AUC of 0.957 (0.976) without (with) ﬂavor information.Combining the image with the XAUGs achieves an AUC of 0.984. Combining the particlelist with the XAUGs achieves AUCs of 0.994 and 0.995 for the top 5, and all XAUGs,respectively. Similarly, the RNN achieves an AUC=0.975 regardless of whether the XAUGsare used. – 23 –hese results demonstrate that adding XAUG variables can considerably improve net-work performance. In fact, the XAUGs-only performance is already fairly good at distin-guishing between Z → bb and QCD. It may be quite useful for some simplistic cases toconsider these simple networks, guided by what DNNs ﬁnd important. In cases where morediscrimination is necessary, combining XAUGs with the network can act as a sort of guidefor the optimization, achieving better performance together than either achieves individu-ally. It also gives a robust method for investigation of subspaces of relevance that can beunderstood easily by analyzers.Furthermore, the cases where we add only the 5 XAUGs with the highest relevancescores are almost identical to using all XAUGs. We can therefore conclude that the XAUGswith the highest relevance scores are the primary decision factors that the network is making,and could be used by analysts to audit and understand network behavior. In addition, onlythese few XAUGs would be needed to greatly improve performance.In addition to the above observations, Fig. 22 shows considerable variation from onetraining to another in the local behavior of networks. The local network optimizationdepends weakly on features with low relevance, so decisions based on these features shouldbe trusted to a somewhat lesser degree. These ﬁndings suggest that networks should betrained multiple times, with the average taken and variations counted as uncertainties.In our case, the networks tend to be strongly aﬀected by dominant smaller subspacesrelated to a few shape-based variables, jet mass with and without grooming, and the lifetimeinformation. Such an analysis combining LRP and XAUGs could also be applied in otheruse cases to rank salient features. Figure 18 : Averaged relevance score of 5 models for the particle list CNN for ∆ R subjets vs z for the Zbb and QCD samples. – 24 – igure 19 : Averaged relevance score of 5 models for the particle list CNN for τ . ,SD vs m jet,SD for the Zbb and QCD samples. The features to compare are selected from Figure23. Figure 20 : Averaged relevance score over 5 models for particle list CNN for ungroomedmass m jet vs groomed mass m jet,SD for the Zbb and QCD samples.– 25 – igure 21 : Histograms with proﬁles of LRP relevances normalized per event and averagedover four models. Signal and background are the Zbb and QCD samples, respectively.– 26 – igure 22 : Input features with the greatest mean normalized relevance after averaging therelevance scores of four models for the 2D CNN. These models use Zbb and QCD simulationas Signal and Background, respectively. The relevance of the image is the summed relevanceof all the pixels in the image. – 27 – igure 23 : Input features with the greatest mean normalized relevance after averaging therelevance scores of four models for the particle list CNN. ’List’ is the summed relevance ofthe 20 particle inputs per event and the 17 particle list inputs.

Figure 24 : ROC curves from Image CNN, Particle List CNN, DNN with XAUGs only,and RNN, with mass window cut of < m jet,SD < GeV.– 28 –

Conclusion

We have presented a scheme to systematically explain jet identiﬁcation decisions from adeep neural network (DNN) using eXpert AUGmented variables (XAUG) and layerwiserelevance propagation (LRP). The combination of these two techniques can shed light onnetwork decisions that would otherwise be diﬃcult to ascertain. The XAUG variables canalso be used alone for classiﬁcation, or can be used in conjunction with lower-level / higher-dimensional classiﬁers to "guide" the network decisions, resulting in better performance. Insome cases, XAUG variables can even capture the information of the lower-level networksentirely if there are subspaces of the network optimization that are correlated with theXAUGs themselves.An additional beneﬁt to using LRP and XAUGs is the ability to investigate relevantsubspaces of the training. This can highlight when there are very shallow optimizations,where training multiple times could lead to slightly diﬀerent subspaces. In such cases, it isrecommended to train multiple times and take the mean of the predictions.– 29 – cknowledgments

This work was supported under NSF Grants PHY-1806573, PHY-1719690 and PHY-1652066.Computations were performed at the Center for Computational Research at the Universityat Buﬀalo.We would like to acknowledge Kyle Cranmer, Marat Freytsis, Loukas Gouskos, GregorKasieczka, Martin Kwok, Simone Marzani, David Miller, Ben Nachman, Juska Pekkanen,Jesse Thaler, Daniel Whiteson, and David Yu for helpful conversations during the BOOST2020 poster session and manuscript review.

A Software Versions

We used

PYTHIA v8.2.35 [34] with the default parameters, with events stored in the

ROOT [42]format. Both the toy model and particle-level simulations were processed with coffea v0.6 [43] using uproot v3.11 [44] and

AwkwardArray v0.12.0rc1 [45]. We used tensorflow v1.13 [46] and inNvestigate [32] for the DNNs and LRP, respectively.Our software can be found here on github.

B Relevance Scores for All XAUGs – 30 – igure 25 : Histograms with proﬁles of normalized LRP relevances.– 31 – igure 26 : Histograms with proﬁles of normalized LRP relevances.– 32 – igure 27 : Histograms with proﬁles of normalized LRP relevances.– 33 – igure 28 : Histograms with proﬁles of normalized LRP relevances.

Figure 29 : Histograms with proﬁles of normalized LRP relevances.– 34 – igure 30 : Histograms with proﬁles of normalized LRP relevances.– 35 – eferences [1] J. Zhang, R. Zhang, Y. Huang, and Q. Zou,

Unsupervised part mining for ﬁne-grained imageclassiﬁcation , CoRR abs/1902.09941 (2019) [ arXiv:1902.09941 ].[2] A. Abdesselam et al.,

Boosted Objects: A Probe of Beyond the Standard Model Physics , Eur.Phys. J.

C71 (2011) 1661, [ arXiv:1012.5412 ].[3] A. Altheimer et al.,

Jet Substructure at the Tevatron and LHC: New results, new tools, newbenchmarks , J. Phys.

G39 (2012) 063001, [ arXiv:1201.0008 ].[4] A. Altheimer et al.,

Boosted Objects and Jet Substructure at the LHC. Report ofBOOST2012, held at IFIC Valencia, 23rd-27th of July 2012 , Eur. Phys. J.

C74 (2014), no. 32792, [ arXiv:1311.2708 ].[5] D. Adams et al.,

Towards an Understanding of the Correlations in Jet Substructure , Eur.Phys. J.

C75 (2015), no. 9 409, [ arXiv:1504.00679 ].[6] A. J. Larkoski, I. Moult, and B. Nachman,

Jet Substructure at the Large Hadron Collider: AReview of Recent Advances in Theory and Machine Learning , Phys. Rept. (2020) 1–63,[ arXiv:1709.04464 ].[7] R. Kogler et al.,

Jet Substructure at the Large Hadron Collider: Experimental Review , Rev.Mod. Phys. (2019), no. 4 045003, [ arXiv:1803.06991 ].[8] D. Guest, J. Collado, P. Baldi, S.-C. Hsu, G. Urban, and D. Whiteson, Jet FlavorClassiﬁcation in High-Energy Physics with Deep Neural Networks , Phys. Rev.

D94 (2016),no. 11 112002, [ arXiv:1607.08633 ].[9] W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu,

Deﬁnitions, methods, andapplications in interpretable machine learning , Proceedings of the National Academy ofSciences (Oct, 2019) 22071–22080.[10] Z. C. Lipton,

The mythos of model interpretability , CoRR abs/1606.03490 (2016)[ arXiv:1606.03490 ].[11] M. T. Ribeiro, S. Singh, and C. Guestrin, "why should I trust you?": Explaining thepredictions of any classiﬁer , CoRR abs/1602.04938 (2016) [ arXiv:1602.04938 ].[12] A. Björklund, A. Henelius, E. Oikarinen, K. Kallonen, and K. Puolamäki,

Sparse robustregression for explaining classiﬁers , in

Discovery Science (P. Kralj Novak, T. Šmuc, andS. Džeroski, eds.), (Cham), pp. 351–366, Springer International Publishing, 2019.[13] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. MÃŒller, and W. Samek,

Onpixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevance propagation , PLOS ONE (07, 2015) 1–46.[14] W. Samek, T. Wiegand, and K. Müller, Explainable artiﬁcial intelligence: Understanding,visualizing and interpreting deep learning models , CoRR abs/1708.08296 (2017)[ arXiv:1708.08296 ].[15] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller,

Layer-Wise RelevancePropagation: An Overview , pp. 193–209. Springer International Publishing, Cham, 2019.[16] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K. Muller,

How toexplain individual classiﬁcation decisions , Journal of Machine Learning Research (6,2010) 1803–1831. – 36 –

17] K. Datta and A. Larkoski,

How Much Information is in a Jet? , JHEP (2017) 073,[ arXiv:1704.08249 ].[18] S. H. Lim and M. M. Nojiri, Spectral Analysis of Jet Substructure with Neural Networks:Boosted Higgs Case , JHEP (2018) 181, [ arXiv:1807.03312 ].[19] A. Chakraborty, S. H. Lim, and M. M. Nojiri, Interpretable deep learning for two-prong jetclassiﬁcation with jet spectra , JHEP (2019) 135, [ arXiv:1904.02092 ].[20] K.-F. Chen and Y.-T. Chien, Deep learning jet substructure from two-particle correlations , Physical Review D (Jun, 2020).[21] G. Kasieczka, S. Marzani, G. Soyez, and G. Stagnitto,

Towards machine learning analyticsfor jet substructure , Journal of High Energy Physics (Sep, 2020).[22]

CMS

Collaboration, A. M. Sirunyan et al.,

Identiﬁcation of heavy, energetic, hadronicallydecaying particles using machine-learning techniques , JINST (2020), no. 06 P06005,[ arXiv:2004.08262 ].[23] V. Mikuni and F. Canelli, Abcnet: an attention-based method for particle tagging , TheEuropean Physical Journal Plus (Jun, 2020).[24] G. Kasieczka, T. Plehn, M. Russell, and T. Schell,

Deep-learning Top Taggers or The End ofQCD? , JHEP (2017) 006, [ arXiv:1701.08784 ].[25] J. Cogan, M. Kagan, E. Strauss, and A. Schwarztman, Jet-Images: Computer VisionInspired Techniques for Jet Tagging , JHEP (2015) 118, [ arXiv:1407.5675 ].[26] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman, Jet-images — DeepLearning Edition , JHEP (2016) 069, [ arXiv:1511.05190 ].[27] J. Thaler and K. Van Tilburg, Identifying Boosted Objects with N-subjettiness , JHEP (2011) 015, [ arXiv:1011.2268 ].[28] J. Thaler and K. Van Tilburg, Maximizing Boosted Top Identiﬁcation by MinimizingN-subjettiness , JHEP (2012) 093, [ arXiv:1108.2701 ].[29] J. Gallicchio and M. D. Schwartz, Seeing in Color: Jet Superstructure , Phys. Rev. Lett. (2010) 022001, [ arXiv:1001.5027 ].[30] T. Faucett, J. Thaler, and D. Whiteson,

Mapping Machine-Learned Physics into aHuman-Readable Space , arXiv:2010.11998 .[31] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. Müller, Explaining nonlinearclassiﬁcation decisions with deep taylor decomposition , Pattern Recognition (May, 2017)211–222.[32] M. Alber, S. Lapuschkin, P. Seegerer, M. Hägele, K. T. Schütt, G. Montavon, W. Samek,K.-R. Müller, S. Dähne, and P.-J. Kindermans, innvestigate neural networks! , 2018.[33] M. Dasgupta, A. Fregoso, S. Marzani, and G. P. Salam, Towards an understanding of jetsubstructure , JHEP (2013) 029, [ arXiv:1307.0007 ].[34] T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel,C. O. Rasmussen, and P. Z. Skands, An Introduction to PYTHIA 8.2 , Comput. Phys.Commun. (2015) 159–177, [ arXiv:1410.3012 ].[35] J. Dolen, P. Harris, S. Marzani, S. Rappoccio, and N. Tran,

Thinking outside the ROCs:Designing Decorrelated Taggers (DDT) for jet substructure , JHEP (2016) 156,[ arXiv:1603.00027 ]. – 37 –

36] C. Shimmin, P. Sadowski, P. Baldi, E. Weik, D. Whiteson, E. Goul, and A. Søgaard,

Decorrelated Jet Substructure Tagging using Adversarial Neural Networks , Phys. Rev.

D96 (2017), no. 7 074034, [ arXiv:1703.03507 ].[37] O. Kitouni, B. Nachman, C. Weisser, and M. Williams,

Enhancing searches for resonanceswith machine learning and moment decomposition , arXiv:2010.09745 .[38] M. Cacciari and G. P. Salam, Dispelling the N myth for the k t jet-ﬁnder , Phys. Lett.

B641 (2006) 57–61, [ hep-ph/0512210 ].[39] M. Cacciari, G. P. Salam, and G. Soyez,

FastJet User Manual , Eur. Phys. J.

C72 (2012)1896, [ arXiv:1111.6097 ].[40] M. Cacciari, G. P. Salam, and G. Soyez,

The anti- k T jet clustering algorithm , JHEP (2008) 063, [ arXiv:0802.1189 ].[41] A. J. Larkoski, S. Marzani, G. Soyez, and J. Thaler, Soft Drop , JHEP (2014) 146,[ arXiv:1402.2657 ].[42] R. Brun, F. Rademakers, P. Canal, A. Naumann, O. Couet, L. Moneta, V. Vassilev, S. Linev,D. Piparo, G. GANIS, B. Bellenot, E. Guiraud, G. Amadio, wverkerke, P. Mato, TimurP,M. Tadel, wlav, E. Tejedor, J. Blomer, A. Gheata, S. Hageboeck, S. Roiser, marsupial,S. Wunsch, O. Shadura, A. Bose, CristinaCristescu, X. Valls, and R. Isemann, root-project/root: v6.18/02 , Aug., 2019.[43] L. Gray, N. Smith, A. Novak, D. Taylor, P. Fackeldey, C. Carballo, P. Gessinger, J. Pata,A. Woodard, Andreas, B. Fischer, Z. Surma, A. Perloﬀ, D. Noonan, L. Heinrich, N. Amin,P. Das, I. Dutta, J. Duarte, J. Rübenach, and A. R. Hall, Coﬀeateam/coﬀea: Releasev0.6.46 , Nov., 2020.[44] J. Pivarski, P. Das, C. Burr, D. Smirnov, M. Feickert, T. Gal, L. Kreczko, N. Smith,N. Biederbeck, O. Shadura, M. Proﬃtt, benkrikler, H. Dembinski, H. Schreiner, J. Rembser,M. R., C. Gu, J. Rübenach, M. Peresano, and R. Turra, scikit-hep/uproot: 3.12.0 , July, 2020.[45] J. Pivarski, C. Escott, M. Hedges, N. Smith, C. Escott, J. Rembser, J. Nandi, B. Fischer,H. Schreiner, P. Das, P. Fackeldey, Nollde, and B. Krikler, scikit-hep/awkward-array:0.12.0rc1 , July, 2019.[46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore,D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,Y. Yu, and X. Zheng,

TensorFlow: Large-scale machine learning on heterogeneous systems ,2015. Software available from tensorﬂow.org.,2015. Software available from tensorﬂow.org.