[PDF] Image-Based Jet Analysis

Abstract

Image-based jet analysis is built upon the jet image representation of jets that enables a direct connection between high energy physics and the fields of computer vision and deep learning. Through this connection, a wide array of new jet analysis techniques have emerged. In this text, we survey jet image based classification models, built primarily on the use of convolutional neural networks, examine the methods to understand what these models have learned and what is their sensitivity to uncertainties, and review the recent successes in moving these models from phenomenological studies to real world application on experiments at the LHC. Beyond jet classification, several other applications of jet image based techniques, including energy estimation, pileup noise reduction, data generation, and anomaly detection, are discussed.

Full PDF

IImage-Based Jet Analysis

Michael Kagan ∗ SLAC National Accelerator Laboratory

Image-based jet analysis is built upon the jet image representation ofjets that enables a direct connection between high energy physics andthe ﬁelds of computer vision and deep learning. Through this connec-tion, a wide array of new jet analysis techniques have emerged. In thistext, we survey jet image based classiﬁcation models, built primarily onthe use of convolutional neural networks, examine the methods to un-derstand what these models have learned and what is their sensitivityto uncertainties, and review the recent successes in moving these modelsfrom phenomenological studies to real world application on experimentsat the LHC. Beyond jet classiﬁcation, several other applications of jetimage based techniques, including energy estimation, pileup noise reduc-tion, data generation, and anomaly detection, are discussed.

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1. Notations and Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 42. Jets and Jet Physics Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 43. Jet Images and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1. Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84. Computer Vision and Convolutional Neural Networks . . . . . . . . . . . . . . 105. Jet tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1. Jet Tagging on Single Channel Jet Images . . . . . . . . . . . . . . . . . 155.2. Multi-Channel Jet Tagging with CNNs . . . . . . . . . . . . . . . . . . . 235.3. Sensitivity to Theory Uncertainties . . . . . . . . . . . . . . . . . . . . . 295.4. Jet images in LHC Experiments . . . . . . . . . . . . . . . . . . . . . . . 326. Understanding jet Image based tagging . . . . . . . . . . . . . . . . . . . . . . 386.1. Probing CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397. Other Applications of Jet images . . . . . . . . . . . . . . . . . . . . . . . . . 417.1. Jet Energy Regression and Pileup removal . . . . . . . . . . . . . . . . . 427.2. Generative Models with Jet Images . . . . . . . . . . . . . . . . . . . . . 447.3. Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 ∗ [email protected] 1 a r X i v : . [ phy s i c s . d a t a - a n ] D ec M. Kagan

1. Introduction

The jet image [1] approach to jet tagging is built upon the rapidly devel-oping ﬁeld of Computer Vision (CV) in Machine Learning (ML). Jets [2, 3]are collimated streams of particles produced by the fragmentation andhadronizaton of high energy quarks and gluons. The particles are sub-sequently measured by particle detectors and clustered with jet clusteringalgorithms to deﬁne the jets. Jet images view the energy depositions ofthe stream of particles comprising a jet within a ﬁxed geometric region ofa detector as an image, thereby connecting particle detector measurementswith an image representation and allowing the application of image analysistechniques from CV. In this way, models built upon advancements in deepconvolutional neural networks (CNN) can be trained for jet classiﬁcation,energy determination through regression, and the reduction of noise e.g.from simultaneous background interactions at a high intensity hadron col-lider such as the Large Hadron Collider (LHC). Throughout this text, thefocus will be on the use of jet image techniques studied within the contextof hadron colliders like the LHC [4].Jet images form a representation of jets highly connected with the detec-tor; one can look at segmented detectors as imaging devices and interpretthe measurements as an image. In contrast, other representations of jetsexist that are built more closely from the physics of jet formation, such asviewing jets as sequences [5, 6] or trees [7] formed through a sequential emis-sion process, or viewing jets as sets, graphs, or point clouds [8, 9] with thegeometric relationship between constituents of the jet encoded in the adja-cency matrix and node properties. There are overlaps in these approaches,for instance a graph can be deﬁned over detector energy measurements,but these approaches will not be discussed in detail in this chapter. Theutilization of an image-based approach comes with the major advantagethat CV is a highly developed ﬁeld of ML with some of the most advancedmodels available for application to jet analysis with jet images. From theexperimental viewpoint, the detector measurements are fundamental to anysubsequent analysis, and the detailed knowledge of the detector and its sys-tematic uncertainties can be highly advantageous for analysis of LHC data.Among the earliest use of jet images was for the classiﬁcation of the par-ent particle inducing the jet [1], and relied on utilizing linear discriminantstrained on image representations of jets for this task. While the remainderof this text will focus on deep learning approaches to jet images, even thisearly work saw interesting discrimination power for this task. By utilizing mage-Based Jet Analysis the detector measurements directly, rather than relying on jet features de-veloped using physics domain knowledge, additional discrimination powercould be extracted. Deep learning approaches surpass such linear meth-ods, but build on this notion of learning discriminating information fromdetector observables rather than engineered features.Fig. 1.: An example jet image of a Lorentz boosted top quark jet afterpreprocessing has been applied [10].While designed to take advantage of advances in computer vision, jet im-ages have notable diﬀerences with respect to typical natural images in CV.Jet images are sparse, with most pixels in the image having zero content.This is markedly diﬀerent from natural images that tend to have all pixelscontaining content. Moreover, jet images tend to have multiple localizedregions of high density in addition to diﬀusely located pixels throughoutthe image, as opposed to the smooth structures typically found in naturalimages. An example top quark jet image illustrating these features canbe seen in Figure 1. These diﬀerences can lead to notable challenges, forinstance the number of parameters used in jet image models (and conse-quently the training time) tend to be large to account for the size of theimage, even though most pixels carry no information. Some techniquesexist for sparse-image computer vision approaches [11], but have not beenexplored in depth within the jet image community.This text will ﬁrst discuss jets and typical jet physics in Section 2. The M. Kagan formation of jet images and the jet image preprocessing steps before classi-ﬁcation are discussed in Section 3. A brief introduction to Computer Visionis found in Section 4. The application of jet images in various jet classiﬁ-cation problems is then discussed in Section 5, followed by a discussion onthe interpretation of information learned by jet image based classiﬁers inSection 6. Some recent applications of jet images beyond classiﬁcation arediscussed in Section 7. A brief note on the notation used throughout thetext follows below (Section 1.1).It should be noted that the majority of the studies presented in this textrelate to phenomenological work using simpliﬁed setting that often do notinclude a realistic modeling of a detector’s impact on observables. Thesewill frequently be denoted as phenomenological studies , in contrast to thestudies using realistic detector simulations or using real experiment datathat are discussed mainly in Section 5.4.

Notations and Deﬁnitions

As we will focus on studies of jet images in the LHC setting, we will primar-ily utilize the hadron collider coordinate system notation. The beam-linedeﬁnes the z -axis, φ indicates the azimuthal angle, η = − log tan θ is thepseudo-rapidity which is a transformation of the polar angle θ . The rapidityis deﬁned as y = log (cid:104) E + p z E − p z (cid:105) is frequently used for the polar measurementof massive particles, such as jets, as diﬀerences in rapidity are invariant withrespect to Lorentz boosts along the beam direction. The angular separa-tion of particles is deﬁned as ∆ R ( p , p ) = (cid:112) ( y − y ) + ( φ − φ ) . Thetransverse momentum p T = (cid:113) p x + p y , is frequently used as it is invariantwith respect to Lorentz boosts along the beam direction. The transverseenergy is deﬁned as E T = E sin( θ ).

2. Jets and Jet Physics Challenges

Jets are collimated streams of particles produced by the fragmentation andhadronizaton of high energy quarks and gluons. Jet clustering algorithmsare used to combine particles into clusters that deﬁne the jets (See ref-erences [2, 3] for recent reviews). At the LHC, jet algorithms typicallyrely on sequential reclustering algorithms which, given a deﬁnition of dis-tance, iteratively combine the closest two constituents (either particles orpreviously combined sets of particles denoted proto-jets ) until a stoppingcondition is met. Diﬀerent distance metrics deﬁne diﬀerent jet algorithms mage-Based Jet Analysis and perhaps the most commonly used algorithm at the LHC is the anti- k T algrithm [12] in which the distance between particle i and particle j is de-ﬁned as d ij = min { k − T,i , k − T,j } ∆ ij /R . Here, ∆ ij = ( y i − y j ) + ( φ i − φ j ) and y , φ , and k T are the particle rapidity, azimuth, and transverse mo-mentum, respectively. The parameter R of the jet algorithm has the eﬀectof deﬁning the spatial span, or approximate “radius” (though the jet isnot necessarily circular), of the jet. Jets and jet algorithms are requiredto be IRC safe , i.e. insensitive to additional infrared radiation or collinearsplittings of particles, in order for the jet properties to be calculable in ﬁxed-order perturbation theory. This allows comparison between jets clusteredon partons from the hard scattering process, referred to as parton jets, onﬁnal state particles after showering and hadronization simulation, referredto as particle jets, and on reconstructed particles in detectors, referred toas reconstructed jets.Most of the work presented in this text are phenomenological studiesoutside the context of any individual experiment. These studies primar-ily utilize particle level simulation after fragmentation and hadronizationand thus study particle jets deﬁned after clustering the ﬁnal state parti-cles. These studies typically do not use a simulation of a detector and itsimpact on particle kinematic measurements. Studies of jets and jet im-ages after real detector simulation or in real detector data are discussedin Section 5.4. In the detector setting, various inputs to jet algorithmscan be used to deﬁne jets: (i) towers refer to a ﬁxed spatial extent in η and φ in which all energy within the longitudinal depth of the calorimeteris summed, (2) topological clusters [13] are used to cluster together en-ergy depositions in nearby calorimeter cells, (3) tracks, or charged particletrajectories, measured using tracking detectors. The Particle Flow (PF)algorithm [14] is used by the CMS collaboration to match charged particleswith energy in the calorimeter in order to utilize both measurements todeﬁne PF candidates that can be used as inputs to jet algorithms.The R -parameter of the jet is used to deﬁne the spatial extent to whichparticles are clustered into the jet. When studying quark and gluon jets, R = 0 . large- R jets are often used which have a larger R = 1 . R = 1 . Subjets ,deﬁned by running a jet clustering algorithm with smaller radius on theconstituents of a jet, are frequently used to study the internal propertiesof a jet. More broadly, jet substructure refers to the study of the internalstructure of jets and the development of theoretically motivated jet features

M. Kagan which are useful for discrimination and inference tasks (See references [15,16] for recent reviews).One particularly important feature of a jet is the jet mass, computed as: m = (cid:0) (cid:80) i ∈ jet p i (cid:1) . The sum of four-vectors runs over all the constituents i clustered into the jet. As diﬀerent heavy resonances have diﬀerent masses,this feature can be a strong discriminant between jet types. Note thatany operation performed on a jet which alter the constituents, such as thepileup mitigation discussed in the next paragraph, may alter the jet mass.It is important to note that additional proton-proton interactions withina bunch crossing, or pileup, creates additional particles present within anevent that can impact jet clustering and the estimation of jet properties.This is especially important for large- R jets which cover large spatial ex-tents. Dedicated pileup removal algorithms are used to mitigate the impactof pileup [17]. Jet trimming [18] is a jet grooming technique used to re-move soft and wide angle radiation from jets, in which the constituentswithin the jet using a jet algorithm with a smaller radius to deﬁne subjetsand remove subjets carrying a fraction of the jet energy below a threshold,and is frequently used on ATLAS to aid in pileup mitigation. The pileupper particle identiﬁcation algorithm (PUPPI) [19] is frequently used byCMS, in which for each particle a local shape parameter, which probes thecollinear versus soft diﬀuse structure in the neighborhood of the particle,is calculated. The distribution of this shape parameter per event is used tocalculate per particle weights that describe the degree to which particles arepileup-like. Particle four-momenta are then weighted and thus the impactof (down-weighted) pileup particles on jet clustering is reduce [19]. Pileupmitigation can greatly improve the estimation of the jet mass, energy andmomentum by removing / downweighting the pileup particles clustered intoa jet that only serve as noise in the jet properties estimation.Jet identiﬁcation, energy estimation, and pileup estimation / reductionare among the primary challenges for which the jet images approach hasbeen employed: (i) Jet identiﬁcation refers to the classiﬁcation of the par-ent particle type that gave rise to the jet, and is needed to determine theparticle content of a collision event. (ii) Jet energy estimation refers tothe regression of the true jet total energy from the noisy detector obser-vations, and is needed to determine the kinematic properties of an event.(iii) Jet pileup estimation and reduction refers to the determination of thestochastic contributions to detector observations arising from incident par-ticles produced in proton-proton collisions that are not the primary hardscattering. This form of denoising is required to improve the energy and mage-Based Jet Analysis momentum resolutions of measurements of jets.Among the primary physics settings in which jet images have been usedare in studies of jets produced by Lorentz boosted heavy particles, such asa W or Z boson, Higgs boson ( h ), top quark ( t ), or a hypothetical newbeyond the Standard Model particle. When a heavy short-lived particle isproduced with a momentum on the order of twice its mass or more, thequark decay products of such a heavy particle have a high likelihood of acollimated emergence in which the subsequent hadronic showers producedby the quarks overlap. Jet clustering algorithms can capture the entiretyof the heavy particle decay within one large- R jet with an R -parametertypically between 0.8 and 1.0, though in some cases larger R parametershave been used. The internal structure of such a boosted jet can be highlynon-trivial and signiﬁcantly diﬀerent than a typical jet produced by a singlequark or gluon. However, the production of quarks and gluons is ubiquitousat hadron colliders, and thus powerful discrimination methods, or taggers ,are needed to identify relatively clean samples of heavy-particle-inducedboosted jets. Moreover, the mass scale of heavy hadronically decayingparticles in the Standard Model is similar, from the W boson mass of ∼

80 GeV [20] up to the top quark mass of ∼

173 GeV [20]. Typicaldiscrimination tasks thus include discriminating boosted W -, Z -, h -, or t -jets from quarks and gluons, but also in discriminating between boostedheavy particle jets.Jet images have also been employed for studying jets from individualquarks and gluons. This includes discriminating between quark and gluonjets, and between jets produced by quarks of diﬀerent ﬂavour. In thesecases, smaller jets typically with R = 0 .

3. Jet Images and Preprocessing

Jet images are built using a ﬁxed grid, or pixelation, of the spatial dis-tribution of energy within a jet. Early instances of such pixelation reliedon energy depositions in calorimeter detectors, wherein the angular seg-mentation of the detector cells was used to deﬁne the “pixels” of the jetimage and the pixel “intensity” was deﬁned with the transverse energy ina cell. More recently, high resolution measurements of charged particlesfrom tracking detectors have also been used to form images, wherein thetransverse momentum of all particles found within the spatial extent of ajet image pixel are summed to deﬁne the pixel intensity. While calorimeterand tracking detectors typically span a large angular acceptance, a typical

M. Kagan jet has limited angular span. The angular span of a jet is related to the R -parameter of the jet clustering algorithm. Jet images are thus designedto cover the catchment area of the jet [21]. In many cases, the jet imageis ﬁrst deﬁned to be slightly larger than the expected jet catchment area,to ensure that preprocessing steps (discussed in Section 3.1) do not dis-rupt peripheral pixel estimates, and then after pre-processing are cropped.Nonetheless, only a slice of the angular space of the detector is used todeﬁne the jet image, with the image centered on the direction of the jetand the image size chosen to capture the extend of a jet with a given R parameter. If depth segmentation is present in a calorimeter, the energy isoften summed in depth. From this vantage point, a jet image can be viewedas a grey-scale image comprising the energy measurements encapsulated bythe angular span of the jet. In some cases energy depositions from hadronicand electromagnetic calorimeters will be separated into diﬀerent images, orseparate images will be formed from both calorimeter cell measurementsand the spatially pixelated charged particle measurement. In these cases,the set of jet images, each deﬁning a view of the jet from a diﬀerent set ofmeasurements, can be seen as color channels of jet image.It should be noted that jet pileup mitigation, such as the aforementionedtrimming or PUPPI algorithms, is vital to reduce the impact of pileup ondownstream jet image prediction tasks. While not explicitly discussed asa part of the jet image preprocessing, this step is almost always performedprior to jet image formation using the jet constituents, especially in thecase of studying large- R jets. Preprocessing

An important consideration in the training of a classiﬁer is how to processdata before feeding it to the classiﬁer such that the classiﬁer can learn mosteﬃciently. For instance, a common preprocessing step in ML is to standard-ize inputs by scaling and mean shifting each input feature such that eachfeature has zero mean and unit variance. In this case, standardizationhelps to ensure that features have similar range and magnitude so that nosingle feature dominates gradient updates. In general, data preprocessingcan help to stabilize the optimization process and can help remove redun-dancy in the data features to ease the learning of useful representations andimprove the learning sample eﬃciency. However, data preprocessing maycome at a cost if the preprocessing step requires approximations that leadto distortion of the information in the data. The primary jet preprocessing mage-Based Jet Analysis steps include: Translation:

An important consideration when preparing inputs to a clas-siﬁer are the symmetries of data and transformations of inputs that shouldnot aﬀect the classiﬁer prediction. In the case of jet images, these symme-tries are related to the physical symmetries of the system. At a particlecollider, there is no preferred direction transverse to the beam line, and thephysics should be invariant to azimuthal rotations in the transverse plane.In terms of jet images, given a ﬁxed parent particle, the distribution of jetimages at a given azimuthal coordinate φ = φ a should not diﬀer from thedistribution at a diﬀerent φ = φ b . As such, an important preprocessingstep is to translate all jet images to be “centered” at φ = 0. This is oftenperformed by translating the highest p T subjet (formed by clustering thejet constituents with a small R -parameter jet algorithms), or the jet energycentroid, to be located at φ = 0. The same invariance is not generically truefor changes in η , as translations in η correspond to Lorentz boosts along thebeam direction which could alter the jet properties if not handled carefully.When energy is used for jet image pixel intensities, a translation in η whilekeeping pixel intensities ﬁxed will lead to a change in the jet mass. How-ever, when the transverse momentum, which is invariant to boosts alongthe beam direction, is used to deﬁne pixel intensities, a translation in η canbe performed without altering the jet mass distribution. With this deﬁni-tion of pixel intensities, jet images are typically translated such that theleading subjet is located at η = 0. By centering the jet on the leading p T subjet, the classiﬁer can focus on learning the relative variations of a jet,which are key for classiﬁcation. Rotation:

The radiation within a jet is also approximately symmetricabout the jet axis in the η − φ plane. A common preprocessing step isthus to rotate jet images, after centering the image on the leading p T sub-jet, such that the second leading p T subjet or the ﬁrst principle axis ofspatial distribution of p T in the image is aligned along the y axis of theimage. However, there are challenges with rotations. First, rotations in the η − φ plane can alter the jet mass, thus potentially impacting the classiﬁca-tion performance a . Second, as jet images are discretized along the spatialdimensions, rotations by angles other than factors of π/ a Alternative deﬁnitions of rotations have been proposed that preserve jet mass [22] butmay alter other key jet properties.0

M. Kagan p T distribution within a jet image, apply a rotation to this spline function,and then impose an image grid to discretize the spline back to an image.The interpolation and the post-rotation discretization can spatially smearinformation in the jet and lead to aliasing. As such, there is varying use ofrotation preprocessing in jet image research. Flipping:

A transformation φ → − φ should not aﬀect the physics of thejet, and this transformation can be performed to ensure that positive φ contains the half of the jet with more energy, for instance due to radiationemission. Normalization:

A step often found in image preprocessing for computervision tasks is image normalization, typically through taking an L normof the image such that x i → x i / (cid:80) j x j where x i is a pixel intensity and thesum runs over all pixels in an image. However, in the case of jet images, sucha normalization may be destructive, as it does not preserve the total massof the jet (as computed from the pixels) and can deteriorate discriminationperformance due to this loss of information [23]. As such, there is varyingusage of image normalization in jet image research.The impact on the jet mass, as computed from the pixels of jet images,for W boson jets within a ﬁxed p T range and within a ﬁxed pre-pixelationmass range can be found in Figure 2. The distortion on the jet mass frompixelation, rotations for images with energies as pixel intensities, and from L normalization, can be seen clearly, whilst translation and ﬂipping donot show distortions of the jet image mass. As expected, mild distortionof the mass can be seen when rotations are performed on jet images withtransverse energy used for pixel intensities. These distortions may or maynot be impactful on downstream tasks, depending on if the jet mass is akey learned feature for the downstream model.

4. Computer Vision and Convolutional Neural Networks

Object classiﬁcation in Computer Vision (CV) tasks served as a primarysetting where Deep Learning had major early successes [24], quickly sur-passing then state of the art approaches and serving as one of the drivers fora deep learning revolution. While much of the work in CV has focused onunderstanding natural images, data collected by physics experiments comefrom heterogenous detectors, tend to be sparse, and do not have a clearsimilarity to natural images. Nonetheless, the success of DL in CV inspired mage-Based Jet Analysis Mass60 70 80 90 100 110 N o r m a li z ed t o U n i t y No pixelationOnly pixelation 0.75) · Pix+Translate (naive) (Pix+TranslatePix+Translate+Flip/2 Rotation p Pix+Translate+ 170) · norm ( Pix+Translate+p = 13 TeVsPythia 8, /GeV < 300 GeV, 65 < mass/GeV < 95 T

250 < p

Fig. 2.: The impact of various preprocessing steps on the distribution ofestimated jet mass for boosted W boson jet images [23].a parallel eﬀort in the collider physics community to explore applicationsof such techniques to HEP data. Below we present a brief introduction toconvolutional neural networks (CNNs) [25] and some of the state-of-the-artarchitecture variants in order to provide some background for the modelsused in jet tagging and other applications. For a more in depth pedagogicalintroduction to this material see for instance [26].Most of the models discussed in this text rely on the use of convolutionallayers. However, it should be noted that some models make use of locally-connected layers [25, 27, 28], in which a given neuron only has access to asmall patch of an input but, unlike convolutional layers that rely on weightsharing (as discussed below), the neuron processing each image patch isassociated with a diﬀerent set of weights. Convolutional neural networks rely on neuron local spatial connectivityand weight sharing to produce translationally equivariant models that arewell adapted to image analysis. A typical CNN is built by stacking oneor more convolutional and non-linear activation layers often followed by apooling layer. This structure is repeated several times. Fully connectedlayers, with full connections from all inputs to activations, are used toperform the ﬁnal classiﬁcation or regression prediction. Images processedby CNNs are represented as 3D tensors with dimensions width × height × depth and are often referred to as the image volume. The height and widthdimensions correspond to the spatial extend of the image while the depth M. Kagan is typically the color channel.

Convolutional layers are composed of a set of ﬁlters , where each ﬁlterapplies an inner product between a set of weights and small patch of aninput image. The ﬁlter is scanned, or convolved , across the height andwidth of the image to produce a 2D map, often referred to as a responsemap or convolved image, that gives the response of applying the ﬁlter ateach position of the image. The response at each position becomes largewhen the ﬁlter and the image patch match, i.e. when their inner product islarge. The ﬁlters will thus learn to recognize visual features such as edges,textures, and shapes, and produce large responses when such visual featuresare present in a patch of an image. The spatial extent of the input patchis known as the receptive ﬁeld or ﬁlter size, and the ﬁlters extend to thefull depth of the image volume. Several ﬁlters are learned simultaneouslyto respond to diﬀerent visual features. The response map of the ﬁltersare then stacked in depth, producing an output convolved image volume.Finally, the response maps are passed through point-wise (i.e. per pixel)non-linear activations to produce an activation map.By sharing weights between neurons, i.e. by scanning and applying thesame ﬁlter at each image location, it is implicitly assumed that it is useful toapply the same set of weights to diﬀerent image locations. This assumptionis reasonable, as a visual feature may be present at any location in an imageand the ﬁlter is thus testing for that feature across the image. This results inthe convolutional layers being translationally equivariant, in that if a visualfeature is shifted in an image, the response to that feature will be shiftedin the activation map. In addition, parameter sharing results in dramaticreduction in the number of free parameters in the network relative to a fullyconnected network of the same number of neurons.

Pooling layers reduce the spatial extent of the image volume while leavingthe depth of the volume unchanged [29]. This further reduces the number ofparameters needed by the network and helps control for overﬁtting. Poolingis often performed with a max operation wherein only the largest activationin a region, typically 2 ×

2, is kept for subsequent processing.

Normalization layers may be used to adjust activation outputs, typicallyto control the mean and variance of the activation distribution. This helpsensure the that a neuron does not produce extremely large or small activa-tions relative to other neurons, which can aid in gradient-based optimizationand in mitigating exploding / vanishing gradients. Batch normalization [30] mage-Based Jet Analysis is a common normalization method in which, for each mini-batch, the meanand variance within the mini-batch of each activation dimension are usedto normalize the activation to have approximately zero and unit variance.A linear transformation of the normalized activation, with learnable scaleand oﬀset parameters, is then applied. Fully connected layers are applied at the end of the network after ﬂat-tening the image volume into a vector in order to perform classiﬁcation orregression predictions. At this stage, auxiliary information, potentially pro-cessed by a separate set of neural network layers, may be merged with theinformation gleaned from the processing by convolutional layers in orderto ensure that certain features are provided for classiﬁcation. Within thejet tagging context, such information may correspond to information aboutthe jet, such as its mass, or global event information such as the number ofinteractions in a given collision.

Residual Connections:

While CNNs encode powerful structural infor-mation into the model, such as translation equivariance, it has been notedthat scaling up such models by stacking large numbers of convolutionallayers can lead to large challenges in training [31]. In order to train largemodels using the backpropagation algorithm, the chain rule is used to com-pute the gradient from the model output back to the relevant weight. Inearly layers, the multiplication of many gradients can lead to vanishinglysmall or exploding gradients, thus resulting in unhelpful gradient updates.To overcome this challenge, the residual block [32] was proposed, and hasled to the development of residual networks . While a typical neural net-work layer passes input z through a nonlinear function f ( · ) to produce anoutput z (cid:48) = f ( z ), a residual block also uses a “skip connection” to pass theinput to the output in the form z (cid:48) res = W s z + f ( z ) where the weights W s can be used to project the channels z to have the same dimension as thefunction f ( z ). In this way, the function f ( · ) is tasked with learning the rel-ative change to the input. Moreover, the skip connection provides a path toeﬃciently pass gradients backwards to earlier layers of the network withoutdistortion through the nonlinearities, making gradient descent much easierand thus enabling the training of signiﬁcantly deeper models. Note thatthe function f ( · ) can contain several layers of convolutions, nonlinearities,and normalization before being recombined with the input. Training in supervised learning tasks is performed by minimizing an ap-propriate loss function that compares the CNN prediction with a true label. M. Kagan

The loss is typically the cross-entropy in the case of binary classiﬁcation,and the mean squared error in the case of regression. Minimization is per-formed using stochastic gradient descent (SGD) [33], or one of its variantssuch as ADAM [34] designed to improve the convergence of the learningalgorithm.

Evaluation Metrics

Receiver operating characteristic (ROC) curves arefrequently used to examine and compare the performance of binary classiﬁ-cation algorithms. Given a model which produces a classiﬁcation prediction c ( x ), where x is the input features and c ( · ) ∈ [0 , τ onthe prediction is scanned from 0 to 1, and the fraction of inputs for each thesignal and background classes above this threshold, i.e. the signal eﬃciency( (cid:15) S ) and background eﬃciency ( (cid:15) B ) for surviving this threshold, deﬁnes apoint on the ROC curve for each τ value. ROC curves thus display thebackground eﬃciency (or background rejection deﬁned as 1 divided by thebackground eﬃciency) versus the signal eﬃciency. When the ROC curveis deﬁned as the background eﬃciency versus the signal eﬃciency, a met-ric commonly used to evaluate the overall model performance is the ROCintegral, also known as the area under the curve (AUC).Signiﬁcance improvement characteristic (SIC) curves [35] are closely re-lated to ROC curves, but display (cid:15) S / √ (cid:15) B as a function of the signal ef-ﬁciency (cid:15) S . This curve targets displaying the potential improvement instatistical signiﬁcance when applying a given discriminant threshold rela-tive to not applying such a threshold.

5. Jet tagging

Jet tagging refers to the classiﬁcation of the parent particle which gaverise to a jet. Linear discriminant methods were ﬁrst applied to jet imagesdeﬁned using a single channel, or “grey-scale”, with pixel intensities deﬁnedas the calorimeter cell p T [1]. Subsequently, CNN based classiﬁers trainedon single channel images were developed for discriminating between W/Z jets and quark / gluon jets [23, 27], between top jets and quark / gluonjets [10, 36–38], and for discriminating between quarks and gluons [39].Quark / gluon discrimination with single-channel jet images has also beenexplored for use in heavy ion collisions [40]. The extension to utilizing jetimages “in color” with multiple channels, deﬁned for instance using chargedparticle information, has shown promising performance improvements over mage-Based Jet Analysis Fig. 3.: Average W boson jet image (left) and average quark / gluon jetimage (right) after preprocessing [23].single channel approaches in many of these tasks [10, 39, 41–45], and hasbeen explored in realistic experimental settings by the ATLAS and CMScollaborations [46, 47]. Jet Tagging on Single Channel Jet Images

W/Z

Tagging:

The discrimination of boosted W and Z vector bosoninitiated jets from quark / gluon jets has served as a benchmark task inboosted jet tagging. The color singlet nature of electroweak bosons decayingto quark pairs leads to an internal structure of boosted W/Z jets in whichthere are typically two high energy clusters, or subjets, and additional(dipole) radiation tends to appear in the region between such subjets. TheHiggs boson, also a color singlet with decays to quark pairs, has a similarsubstructure, although the decays of heavy ﬂavour bottom and charm quarkpairs can lead to some structural diﬀerences owing to the long lifetimeof such quarks and their harder fragmentation than lighter quarks. Incontrast, single quarks and gluons tend to produce jets with a high energycore, lower energy secondary subjets created through radiative processes, aswell as diﬀuse wide angle radiation further from the core of the jet. Thesefeatures can be seen clearly in Figure 3, which shows the average W bosonjet image and average quark / gluon jet image after preprocessing.Building ML models applied to jet images for this discrimination taskavoids the explicit design of physics-inspired features, and rather focuses on M. Kagan the learning task of identifying diﬀerences in the jet image spatial energydistributions. In phenomenological studies, both fully convolutional [23]and models with locally connected layers [27] have been examined for dis-criminating jet images of boosted W and Z vector boson initiated jets fromquark / gluon jets. The CNN models were examined in simulated samplesof jets without pileup. The locally connected models were examined inevents both with and without pileup, thus enabling the examination of theimpact of pileup noise on jet image based tagging.Within both the studies on convolutional [23] and locally connected [27]models, hyperparameter scans were performed to ﬁnd model parametersthat maximized performance b . The hyperparameters that were consideredin the scans included the number of convolutional / locally connected layers,the number of hidden units per layer, and the number of fully connectedlayers. The resulting optimized models were similar, containing 3 to 4convolutional or locally connected layers, as well as 2 to 4 fully connectedlayers with approximately 300 to 400 hidden units at the end of the network.In the CNN, 32 ﬁlters were used in each convolutional layer, as well as(2 ×

2) or (3 ×

3) downsampling after each convolutional layer. One notableadditional optimization performed for the CNN models was the size of theconvolution ﬁlters in the ﬁrst layer. While ﬁlter sizes are typically (3 ×

3) or(4 ×

4) in standard CV applications, in the case of application to jet imagesit would found that a larger (11 ×

11) ﬁlter in the ﬁrst convolutional layer(with later layers using standard (3 ×

3) ﬁlter sizes) resulted in the bestperformance. It was hypothesized that the such large ﬁlters were beneﬁcialwhen applied to sparse images [23], in order to ensure that some non-zeropixels are likely to be found within the image patch supporting the ﬁlterapplication.The ROC curves indicating the performance of the CNN model and lo-cally connected model (applied to jets with pileup included) are shown inFigure 4. It should be noted that the jets in these ﬁgures correspond to dif-ferent p T ranges, with jets of p T ∈ (250 , p T ∈ [300 , p T subjets, the τ n-subjetiness [49], and the energycorrelation function D β =22 [50]. Two variable combinations were computedusing 2D binned likelihood ratios. Both the CNN and locally connected b In the case of CNNs the AUC was maximized whilst the Spearmint Bayesian Optimiza-tion package [48] was used to optimize the model with locally connected layers mage-Based Jet Analysis model signiﬁcantly outperform the 2D jet substructure feature combina-tions. It can also be seen that the jet image approach is not overly sensitiveto the eﬀects of pileup as the large performance gain over jet substructurefeatures persists both with and without the presence of pileup, owing tothe use of jet trimming to reduce the impact of pileup noise in the jets.In addition, a Boosted Decision Tree (BDT) classiﬁer [51] combining sixsubstructure features was compared with the locally connected model andfound to have similar performance. While these early jet image based mod-els did not signiﬁcantly outperform combinations of several jet substructurefeatures, this may be due to their relatively small model structure. As willbe seen, more complex architectures and the use of multi-channel jet imagescan lead to large gains over combinations of jet substructure features.One can also see the eﬀect of L image normalization on CNN models,which appears to improve performance over unnormalized images. Thiseﬀect was found to occur because the CNN model output was observedto have only small correlation with the jet mass, and thus was not learn-ing to be heavily reliant on the jet mass information that is distorted bynormalization. As a result, the regulation of the image variations due tonormalization was found to be beneﬁcial enough to overcome the induceddistortion of the jet mass. With more powerful models that learn represen-tations more correlated to the jet mass, this balance may not occur. Top Tagging:

The discrimination between boosted top quark jets andquark / gluon jets using CNNs applied to jet images has also been examinedboth in phenomenological studies [36] and in realistic simulations by theCMS experiment [47]. Top quark jet images are structurally more complexthan

W/Z/h jet images as hadronic decays of top quarks contain threequarks. This can have implications on both the preprocessing and thetagging performance. That is, some of the pre-processing steps previouslydeﬁned will lead to uniformity among jet images for two quark systems,such as the rotation step which aligns the leading two subjets, but may notlead to the same level of uniformity for three quark systems.The DeepTop [36] model is a CNN applied to single channel jet imagesafter the preprocessing described above, including image normalization.Hyperparameter optimization yielded a model with four convolutional lay-ers, each with 8 ﬁlters of size (4 × M. Kagan

Signal Efficiency0.2 0.4 0.6 0.8 / ( B a ck g r ound E ff i c i en cy ) = 13 TeV, Pythia 8s /GeV < 300 GeV, 65 < mass/GeV < 95 T

250 < p t mass+ R D mass+R D + t MaxOutConvnetConvnet-normRandom (a)

Signal efficiency0 0.2 0.4 0.6 0.8 1 B a ck g r ound r e j e c t i on DNN(image)BDT(expert)+mass =2 b D +mass =1 b t Jet mass>=50 m Pile-up < (b)

Fig. 4.: ROC curves for quark/gluon background rejection versus boosted W boson tagging eﬃciency for (a) events without pileup [23], and (b) eventswith pileup [27]. The jet image based CNN taggers are seen to outperformcombinations of jet substructure features, and to be stable with respect tothe addition of pileup.(MSE) loss. While structurally similar to the single channel CNN used for W/Z tagging in reference [23] there are some notable diﬀerences such asthe use of fewer numbers of ﬁlters (8 rather than 32) and the smaller ﬁltersize in the ﬁrst layer of convolution. The reason for these diﬀerence maybe due to (a) the presence of three quarks in the top quark decay leads tomore pixel-populated images and thus allowed for the use of smaller initialﬁlter sizes, or (b) the global nature of the hyperparameter scan wherein thenumber of ﬁlters and the size of the ﬁlters was ﬁxed to be the same acrossall convolutional layers.The performance of the DeepTop model can be found in Figure 5a interms of the ROC curve comparing the quark / gluon rejection versus theboosted top jet tagging eﬃciency for jets with p T ∈ [350 , n -subjettiness, as well and a BDT, denoted MotherOfTaggers, combiningseveral jet substructure features. The jet image based DeepTop algorithmshowed clear performance gains over substructure approaches across most ofthe signal eﬃciency range. As previously mentioned, pre-processing steps mage-Based Jet Analysis B a c k g r o und r e j ec t i o n / (cid:15) B Signal eﬃciency (cid:15) S SoftDrop + N -subjettiness MotherOfTaggersDeepTop full

DeepTop minimal (a)

DeepTopminimalOur final taggerMotherOfTaggers BDT ϵ S / ϵ B DeepTop jets (b)

Fig. 5.: ROC curves for quark / gluon jet rejection versus boosted top ef-ﬁciency for (a) the DeepTop model [36], and (b) the updated DeepTopmodel from reference [42]. In both cases, the CNN based DeepTop mod-els outperform individual and BDT combinations of substructure features,while the updated model in (b) is also seen to signiﬁcantly improve theDeepTop performance.have the potential to be beneﬁcial for the learning process by producingmore uniform images, but may also lead to performance degradation. Thiswas studied within the scope of the DeepTop algorithm, by examining thetagging performance using full preprocessing and a minimal preprocessingthat only performed centering but not the rotation or the ﬂipping. This canbe seen in Figure 5a, where a clear performance beneﬁt was observed whenutilizing only minimal pre-processing. While the full pre-processing maybe beneﬁcial for small sample sizes, with suﬃcient sample sizes and modelcomplexity the CNN models appear able to learn well all the variations injet images. In this case, the approximations introduced by pre-processingsteps appear to be more detrimental than the beneﬁts from uniformizationof the jet image distributions.Building upon the DeepTop design, developments in architecture de-sign, jet image preprocessing, and optimization were introduced in the phe-nomenological study of reference [42]. These developments include: (i)the cross entropy loss function, rather than the mean squared error loss,was used as it is more suitable to binary classiﬁcation problems, (ii) alearning rate adaptive optimizer, AdaDelta [52], and small mini-batch sizes M. Kagan of 128 was used rather than vanilla stochastic gradient descent and largemini-batches of 1000, (iii) larger numbers of ﬁlters per convolutional layer,between 64 and 128 rather than 8, and 256 neurons in the dense layersinstead of 64, (iv) preprocessing is performed before pixelation under theassumption that one would have access to high resolution particle momen-tum measurements, for instance using Particle Flow [14] approaches to jetreconstruction, and (v) the training set size was increased by nearly a factorof 10. While the individual eﬀects of these developments will be examinedfurther in Section 5.2 when discussing top tagging on multi-channel jet im-ages, the combination of these developments can be seen to provide largeperformance improvements over DeepTop of nearly a factor of two in back-ground rejection at ﬁxed signal eﬃciencies in Figure 5b.In terms of more complex architectures, the ResNeXt-50 architec-ture [53] was adapted to boosted top jet tagging task using single channeljet images in the phenomenological studies in reference [10]. ResNeXt-50utilizes blocks containing parallel convolutional layers that are aggregatedand merged also with a residual connection at the end of the block. As thejet images typically have fewer pixels than natural images, the architecturewas adapted to the top tagging dataset by reducing the number of ﬁltersby a factor of four in all but the ﬁrst convolutional layer, and dropout wasadded before the fully connected layer. In addition, smaller pixel sizes inthe jet images were utilized in this model, with a granularity of 0.025 ra-dians in η − φ space (whereas the jet image granularities typically used inother models is 0.1 radians in η − φ space).The ROC curve comparing the ResNeXt-50 model to a CNN based onreferences [36, 42], and comparing to several other neural network modelswith varying architectures can be found in Figure 6. The ResNeXt-50 modelprovides approximately 20% improvement in background rejection for ﬁxedsignal eﬃciency over the CNN model, and is among the most performantalgorithms explored. This is notable as many of the other neural networkmodels utilize particle 4-vectors as inputs, rather than aggregated particleinformation with a pixel cell, and make use of particle charge information,while the ResNeXt model only utilizes the distribution of energy within thejet. However, the ResNeXt-50 model contains nearly 1.5 million parame-ters, which is far more than other models such as the CNN which contains ≈ ≈

34k parameters. Thus powerful information for discriminationcan be extracted with jet image based models even from single channelimages, but it may come with the price of models with large parameter mage-Based Jet Analysis Signal efficiency S B a c k g r o un d r e j e c t i o n B ParticleNetTreeNiNResNeXtPFNCNNNSub(8)LBNNSub(6)P-CNNLoLaEFNnsub+mEFPTopoDNNLDA

Fig. 6.: ROC curve comparisons of various boosted top tagging models isshown [10]. Both ResNeXt and CNN curves are jet image based taggersusing CNN based architectures.counts.This model comparison study has been performed in a phenomenologi-cal setting on particle level simulations, and the ultimate question remainsas to the suitability for using these models in real experiment settings. Inexperimental settings, realistic detector noise, detection eﬃciency, detectorheterogeneity, and data taking conditions such as pileup, underlying event,and beam evolution will impact the model performance. Powerful models,including the large ResNext and CNN models, will likely have suﬃcientﬂexibility to learn powerful discriminators even in these more challengingsettings. However, in general it remains to be seen if these models canbe accurate whilst maintaining a low calibration error (where calibrationin this context refers to the criteria that the predicted class probabilitiescorrespond to the true probabilities of a given data input having a givenlabel) [54], or if additional care is needed to ensure calibration. Moreover,applications in real experimental settings must consider systematic uncer-tainties associated with training ML models in (high ﬁdelity) simulation butapplying them in real data with potentially diﬀerent feature distributions.The relationship between model complexity and sensitivity to systematicuncertainties in real experiment settings still remains to be thoroughly ex-plored. The potential beneﬁts in terms of sensitivity to systematic uncer-tainties when using neural networks with diﬀerent structural assumptions, M. Kagan such as convolutional versus graph models, also requires further study andwill likely depend on the details of how a given systematic uncertainty ef-fects the feature distributions. Some exploration of these challenges canbe found in Section 5.3 examining model sensitivity to theoretical uncer-tainties and in Section 5.4 examining applications of these models in HEPexperiments. Nonetheless, these remain important and exciting avenues offuture work.

Decorrelated tagging with Jet Images:

A common strategy in HEPto search for a particle is the so-called bump hunt in which the particle wouldgive rise to a localized excess on top of a smoothly falling background in thedistribution of the mass of reconstructed particle candidates. For instance,one may aim to identify the W -boson mass peak over the quark and gluonbackground from the distribution of jet mass. In addition to the particlemass being localization, a key to this strategy is that the smoothly fallingbackground mass distribution can typically be characterized with simpleparametric functions, thus facilitating a ﬁt of the data to identify the excessabove this background. Jet classiﬁcation methods can cause challenges inthe aforementioned strategy, as the classiﬁer may preferentially select forjets with a speciﬁc mass, thereby sculpting the selected jet mass distributionof the background and rendering the search strategy unusable. As a result,one line of work has focused on de-correlating classiﬁers from a sensitivefeature (e.g. mass) such that the sensitive feature is not sculpted by theapplication of the tagger. Such methods tend to rely on data augmentationor regularization, and overviews of these methods can be found for instancein references [55, 56]. Two recent regularization techniques that have seenstrong de-correlation capability include (i) adversarial techniques [57, 58],wherein a second neural network is trained simultaneously with the jetclassiﬁer to penalize the jet classiﬁer when the value of the sensitive featurecan be predicted from the classiﬁer’s output or its hidden representations,and (ii) distance correlation regularizers [59], wherein the jet classiﬁer lossis augmented with an additional regularization which explicitly computesthe correlation between the classiﬁer predictions and the sensitive feature.In both cases, the amount of penalization from the regularization can bevaried through a hyperparameter scaling the relative size of the regulationterm to the classiﬁcation loss.De-correlation for W -boson jet tagging with jet images using CNNs wasexamined in phenomenological studies in reference [59], using a CNN ar-chitecture similar to the model described in [42]. The quark and gluon mage-Based Jet Analysis background jet mass distribution before and after applying a threshold onthe output of a CNN can be seen in Figure 7a, showing a clear sculptingof the mass distribution. However, when the distance correlation regular-ization, or Disco , is used during training, the mass distribution remainslargely unsculpted after applying a classiﬁcation threshold. The level ofde-correlation can be estimated by examining the agreement between themass distribution before and after applying a classiﬁer threshold, for in-stance using the Jensen-Shannon divergence (JSD) computed between thebinned mass distributions. For classiﬁer thresholds ﬁxed to 50% signaleﬃciency, Figure 7b shows the JSD as a function of the background rejec-tion where the curves are produced through training with varying sizes ofthe regularization hyperparameter. The CNN models are compared withneural networks trained on substructure features and other classiﬁers withde-correlation methods applied. The CNN models, both the adversarialand distance correlation regularization, are seen to typically provide thehighest background rejection for a given level of de-correlation comparedto other models.

Multi-Channel Jet Tagging with CNNs

Recent work on jet image based tagging has shown performance gainsthrough the use of multi-channel images. While single channel jet im-ages have provided gains in classiﬁcation performance over individual, orpairings of, engineered substructure features, the performance beneﬁts weretypically smaller when compared to ML models trained on larger groups ofsubstructure features (except when very large models were used, as in [10]).Multi-channel jet images use calorimeter images as only a single input im-age channel, with additional channels computed from charged particle fea-tures such as the momentum, multiplicity, or charge. There is a signiﬁ-cant amount of freedom in choosing the deﬁnition of the additional imagechannels, allowing for a ﬂexibility in the choice of inductive bias to deliverrelevant information to the CNN.One challenge in combining charged particle trajectory information andcalorimeter images is the mismatch in resolution; charged particle trajecto-ries tend to have a signiﬁcantly ﬁner spatial resolution than calorimeters,thus leading to the questions of how to combine such information. Ascharged particles are not measured on a regular grid, often the same spa-tial grid for the calorimeter component is used for the charged particleimage and the energy of the constituents is summed within each pixel. Al- M. Kagan

50 100 150 200 250mass [GeV]10 n o r m a li z e d c o un t s before cutafter cut, no decor.after cut, DisCo (a) R / J S D -DDT D D -kNNAdaboostuBoostDNNDNN+planingDNN+adversaryDNN+distance correlationCNNCNN+planingCNN+adversaryCNN+distance correlation (b) Fig. 7.: (a) For boosted W boson tagging, the jet mass is shown beforeapplying a threshold on a trained CNN tagger and after applying a thresh-old on a standard and mass decorrelated tagger [59]. A clear reduction inmass sculpting is observed. (b) The rejection at 50% signal eﬃciency versusone over the Jensen-Shannon divergence, computed on the binned jet massdistribution before and after tagging, is shown for various taggers [59]. Jetimage based CNN taggers are seen to outperform other methods, eitherusing adversarial or distance-correlation based mass decorrelation.ternatively, separate CNN blocks (or upsampling procedures) can be usedto process charged and calorimeter images separately into a latent repre-sentation of equal size such that they can be merged for further processing.Note that when Particle Flow objects are used, and thus both neutral andcharged particle measurements do no necessarily fall on a grid, a ﬁne gridcan be used to exploit the better charged particle momentum resolution. Itshould also be noted that while phenomenological studies at particle-leveloften use ﬁxed grids to emulate the discretization of real detectors, diﬀer-ent inputs (i.e. charge vs neutral) in real detector settings have diﬀerentresolutions which may be diﬃcult to account for in simple discretizationapproaches.Multi-Channel jet image based tagging was introduced in phenomeno-logical studies of discriminating between quark initiated and gluon initiatedjets [39, 41] and has since been explored within the quark vs. gluon contexton the ATLAS experiment [46], in CMS Open Data [60, 61], and for tag-ging in heavy ion collision environments [40]. More broadly, multi-channel mage-Based Jet Analysis jet image tagging has lead to improved performance in phenomenologicalstudies of boosted top quark jet tagging [10, 37, 42], as well as in boosted W/Z jet tagging [43] and in boosted Higgs boson tagging [44, 45]. Notably,multi-channel jet image based boosted top tagging has been explored onthe CMS experiment [47] including the comparison and calibration of thisdiscriminant with respect to CMS collision data, thus adding additionalinsights into the usability of such models within LHC data analysis.The use of multi-channel jet images built from charged particle mo-mentum and multiplicity information within the context of discriminatingbetween quarks and gluons is natural, as the number of charged particleswithin such a jet is known to be a powerful discriminant for this challengingtask [62]. As such, in the phenomenological studies of reference [39] threejet image channels were deﬁned: (1) the transverse momentum of chargedparticles within each pixel, (2) the transverse momentum of neutral parti-cles within each pixel, and (3) the charged particle multiplicity within eachpixel. The same pixel size was used in each image, thus facilitating thedirect application of multi-channel CNNs. This approach thus relies on theability to separate the charged and neutral components of a jet; while thecharged component is measured using tracking detectors, the unique iden-tiﬁcation of the neutral component of a jet is signiﬁcantly more challengingtask. However, advancements in Particle Flow [14] aid in such a separa-tion, albeit not perfectly and with diﬀering resolutions between chargedand neutral measurements.The beneﬁt of the multi-channel approach for quark vs. gluon discrim-ination can be seen in the ROC and SIC curves in Figure 8. Both thecalorimeter only approach, denoted Deep CNN grayscale, as well as themulti-channel approach, denoted Deep CNN w/ color, outperform singlefeatures engineered for this task, BDTs trained using ﬁve of such features,and a linear discriminant trained on the greyscale jet images. In addition,the multi-channel model is seen to dominate over the single channel modelin both the ROC curve for jets with a momentum of p T ≈ b ¯ b jets in multi-jet events [44]. In addition to a CNN focused ondiscrimination based on jet images, this work also explored simultaneously M. Kagan G l u o n J e t R e j e c t i o n GirthCharge Particle MultiplicityLeading Energy FractionTwo Point MomentN95BDT of 5 jet obs.Fisher LDDeep CNN grayscaleDeep CNN w/ color (a) S i g n i f i c a n c e I m p r o v e m e n t Deep CNN w/ color Deep CNN grayscale100 GeV200 GeV500 GeV1000 GeV100 GeV200 GeV500 GeV1000 GeV (b)

Fig. 8.: ROC curve (a) and SIC curve (b) for quark versus gluon tagging us-ing multi-channel jet images [39]. Comparisons with jet substructure baseddiscriminants is shown in (a), while comparison between single channel andmulti-channel jet image based tagging with CNNs is shown in (b).processing an event image , deﬁned using the aforementioned three channelsover the entire calorimeter, through a separate set of convolutional layersand combining with the output of the convolutional processing of jet imagebefore discrimination. By including such an event image, one may explorethe potential beneﬁts of event topology information outside of the jet imagefor discrimination. The SIC curve for this discrimination task can be seen inFigure 9, where the CNN approaches were seen to signiﬁcantly outperformsingle engineered features. CNNs using only the jet image, event image, orboth (denoted “Full CNN Architecture” in Figure 9) were compared, show-ing that much of the discrimination power rests in the jet image whilst theevent image may provide some modest improvements. In addition, the jetimage discrimination without the neutral particle channel was also foundto be comparable to one using the neutral channel, indicating that muchof the discrimination power lies in the charged particle information withinthe jet.While a clear approach to extending jet images to contain multiple chan-nels is to sum the momentum of the charged particles or compute multi-plicities in each pixel to form an image channel, the high resolution of thecharged particle information allows for the introduction of additional in-ductive bias. More speciﬁcally, given the set of charged particles containedin the region of a pixel, one may compute pixel-level features that maybe more amenable to a given discrimination task. This approach was fol-lowed for building CNNs to discriminate between (a) up and down type mage-Based Jet Analysis S i g n i f i c a n c e I m p r o v e m e n t Full CNN ArchitectureJet Image OnlyJet Image, No Neutral Layer β Event Image Only H → bb vs. QCD background p T,Higgs >

450 GeV

Fig. 9.: SIC curve for boosted Higgs to b ¯ b tagging using multi-channel jetimages [44]. Models using only jet images, and models using both jet imagesand “event images” are shown.quarks, and (b) quarks and gluons [41]. In these phenomenological stud-ies, knowledge of the utility of the jet charge feature [63–65] for discrim-inating jets of diﬀerent parent particle charge inspired the developmentof the jet image channel computed per pixel as the p T weighted charge Q κ = (cid:80) j p ( j ) T ) κ (cid:80) j Q ( j ) ( p ( j ) T ) κ . The SIC curve showing the performanceof the CNN trained on the two channel jet images, one channel for p T and one for Q κ per pixel, is shown in Figure 10a. The two channel CNNsigniﬁcantly outperformed the total jet charge and classiﬁers trained onengineered features, and is comparable to other deep architectures trainedfor this task.A similar jet charge based multi-channel CNN was explored for discrim-inating between boosted W + /W − /Z boson jets in the phenomenologicalstudies of reference [43]. The per pixel charge image averaged over the testset for W + , W − , and Z jet images is shown in Figure 11. The geometryof all three images is similar, but the average per pixel charge diﬀers sig-niﬁcantly as expected, with the typical W + image carrying a positive pixelvalue, the typical W − image carrying a negative pixel value, and the typi-cal Z image having charge close to zero. The SIC curve for discriminatingbetween W + and W − jets can be seen in Figure 10b. Two CNNs wereexplored in this work, one denoted CNN in which both a p T and Q κ im-age were processed together (i.e. as a single multi-channel image processedby convolutional layers) and one denoted CNN in which each channel isprocessed by a separate stack of convolutional layers and then combined be- M. Kagan (a) † W − S I ( † W − / p † W + ) cut-basedmulti- κ BDTCNNCNN SIC curve (b)

Fig. 10.: SIC curves for (a) discriminating down quarks from up quarks [41],and (b) discriminating between W + and W − bosons [43]. The CNN κ = 0 . p T weighted charge image.fore the classiﬁcation layers. Both CNNs signiﬁcantly outperform methodsbased on engineered features. W + : Q channel 0.5 0.0 0.5 W : Q channel 0.5 0.0 0.5 Z : Q channel Fig. 11.: Average image of the per pixel p T weighted charge Q κ is shownfor W + (left), W − (middle), and Z bosons (right) [43].Multi channel jet images were explored for top tagging in the phe-nomenological studies of reference [42], using four channel jet images deﬁnedwith the neutral jet component as measured by the calorimeter, the chargedparticle sum p T per pixel, the charged particle multiplicity per pixel, andthe muon multiplicity per pixel. The architecture is discussed in Section 5.1within the context of single channel jet images. The inclusion of the muonimage channel targets the identiﬁcation of b quark initiated subjets withinthe top jet as muons can be produced in b -hadron decays. As noted in mage-Based Jet Analysis Section 5.1, several changes to the model architecture, preprocessing, andtraining procedure relative to the ﬁrst proposed DeepTop model [36] wereincluded in this work. The impact of these individual changes can be seein Figure 12 wherein developments on top of the ﬁrst proposed DeepTopmodel [36] are sequentially added to the model and the resulting ROCcurve is shown. The inclusion of multiple “color” channels was only seento provide modest performance gains over single channel jet images. No-table among changes that led to the largest improvements were changingthe optimization objective to be more suitable for classiﬁcation tasks andchanging the optimizer to ADAM (denoted training in the ﬁgure), increas-ing the model size (denoted architecture in the ﬁgure), and increasing thesample size. In agreement with these results, recent CNN models built forprocessing jet images have also tended to focus on larger models with largesamples for training.

DeepTop minimalTrainingArchitecturePreprocessingSample sizeColor ϵ S / ϵ B CMS jets

Fig. 12.: ROC curve of boosted top jet tagging eﬃciency versus backgroundquark and gluon rejection for the minimal DeepTop model [36] comparedwith models sequentially including the changes proposed in [42].

Sensitivity to Theory Uncertainties

While matrix element and parton shower Monte Carlo generators oftenprovide high ﬁdelity predictions of the data generation process, they pro-vide only approximations to the scattering and showering processes andempirical models of the hadronization process. As such, uncertainties in M. Kagan the theoretical predictions of these generators must be propagated to down-stream analyses. One mechanism for doing this is to compare an observablecomputed with samples from diﬀerent Monte Carlo generators. While nota precise estimation of theoretical uncertainty, this comparison can pro-vide a test of whether an observable is potentially sensitive to the diﬀerentapproximations of the diﬀerent generators.This sensitivity has been examined for CNN-based taggers operatingon jet images in several works, and we focus here on W -tagging in a phe-nomenological study [66] and on quark / gluon tagging using ATLAS sim-ulation [46]. As the CNN + jet image approaches utilize the distribution ofenergy throughout a jet image to discriminate, one concern is that the dif-ferences in modeling of the jet formation process by diﬀerent generators maylead to large performance variations. To study this, reference [46] traineda CNN model on boosted W boson jet images generated by Pythia [67, 68]and applied this trained model on samples of boosted W boson jet imagesgenerated by diﬀerent Monte Carlo generators. The ROC curves of theperformance can be see in Figure 13a, wherein, at the same signal eﬃ-ciency, reductions of background rejection of up to 50% can be seen whenthis tagger is applied to diﬀerent generators. While such a variation is notideal, it should be noted that similar variations were seen when a taggerof only substructure features, a binned two dimensional signal over back-ground likelihood ratio of the distribution of jet mass and τ , is appliedfor the same tagging task. Similar levels of performance variation are alsoseen in the ROC curves built for quark vs gluon tagging in ATLAS sim-ulation with a CNN trained on Pythia jet images applied to Herwig [69]generated jet images, as seen in Figure 13b. Interestingly, when the testis reversed and the CNN is trained on jet images from Herwig and appliedto jet images from Pythia, the tagging performance is similar to the CNNtrained and applied to Pythia jet images. This suggests that the CNNsin both cases are learning similar representations of information useful forquark vs gluon tagging, but the amount this information is expressed in thejet images varies between generators [46]. Thus while these studies showthat CNNs applied to diﬀerent samples may vary in performance, theremay be an underlying robustness to the information learned by CNNs forjet tagging.Beyond the potential tagging variations due to generator uncertainties,a key question when developing a jet observable of any kind is whethersuch an observable is theoretically sound and calculable. This is often ex-pressed as whether the observable is infrared and collinear (IRC) safe. IRC mage-Based Jet Analysis / [ B a c k g r o un d E ff i c i e n c y ]

Signal Efficiency M o d e l / P YT H I A (a) Quark Jet Efficiency G l uon J e t R e j e c t i on ATLAS

Simulation Preliminary s = 13 TeVAnti-k EM+JES R=0.4|η|<2.1, 150 GeV

Fig. 13.: (a) ROC curves for boosted W tagging for a jet image based CNNtagger trained on Pythia generated jet images and applied to jet imagesfrom various generators are shown [66]. (b) In events with the ATLASfull detector simulation, ROC curve for quark versus gluon tagging for jetimage based CNN tagger trained and applied on Pythia and Herwig basedjet images are shown [46].safety for jet image based tagging of boosted top jets with CNNs has beenexamined empirically in the phenomenological studies of reference [70]. Inthis work, within the context of boosted top jet tagging using a jet imagebased CNN, a feature denoted ∆ NN is studied which explores the impact ofmerging soft/collinear radiation with nearby partons. ∆ NN is constructedas follows: (a) a CNN was trained on particle level jet images for boostedtop tagging, (b) parton level jet images are generated for boosted top de-cays without (unmerged) and with (merged) adding the closest gluon to atop quark parton together before forming the image, (c) the diﬀerence inCNN output between unmerged and merged jet images is deﬁned as ∆ NN .By examining the distribution of ∆ NN and its variations with features thatexplore soft or collinear eﬀects, the sensitivity of the CNN tagger to IRCeﬀects can be studied empirically. This can be seen in Figure 14, wherethe 2D distribution of ∆ NN and the gluon relative transverse momentum,and the ∆ R to the parton, are shown. As either the gluon relative momen- M. Kagan tum or the ∆ R tend to zero, the ∆ NN distribution tends towards a sharppeak at 0, which would be indicative of the CNN being insensitive to IRCperturbations. Gluon p T , rel (GeV) | NN | R e l a t i v e D e n s i t y A . U . (a) R q , g | NN | R e l a t i v e D e n s i t y A . U . (b) Fig. 14.: ∆ NN versus (a) gluon relative transverse momentum and (b) ∆ R between the gluon on the nearest top decay parton. ∆ NN is the diﬀerencebetween particle jet trained CNN output applied on parton level jet imageswith and without merging the closest gluon with a top decay parton [70].Red points denote the point at which 90% of events within a vertical sliceof the distribution are contained. Jet images in LHC Experiments

The ultimate tests for the eﬃcacy of jet image based tagging approaches arethat the performance observed in phenomenological studies is also observedin realistic high ﬁdelity simulations and that their performance generalizesto real data without large systematic uncertainties. With that in mind,jet image base tagging approaches have been examined for quark vs gluontagging in ATLAS simulations [46] and in CMS Open Data [60] simulatedsamples [61], and for boosted top quark jet tagging in CMS simulation andreal data [47].The ATLAS quark vs gluon jet image based CNN tagger [46] was trainedusing fully simulated ATLAS events [71, 72]. Multi-channel jet images wereused, with one channel containing an image of the sum of measured chargedparticle track p T per pixel. A second image for calorimeter measurementswas examined in two forms, a jet image containing either the transverse mage-Based Jet Analysis energy measured in calorimeter towers of size ∆ η × ∆ φ = 0 . × . ×

5, 5 ×

5, and 3 ×

3, respectively, and max pooling after each convolutionallayer was used. As can be seen in the ROC curve in Figure 15a, the CNNprocessing the track + tower jet images outperforms other standard tag-gers for quark vs gluon tagging. Interestingly, the standard tagger basedon the combination of two jet substructure features (number of chargedparticles and the jet width) outperforms the CNN approach at low quarkeﬃciency. This is likely due to the track image discretization that mayresult in multiple tracks falling in the same pixel. As track multiplicity isnot stored in the images, this useful discriminating information is lost forthe CNN. In Figure 15b, the impact on performance of utilizing diﬀerentjet image channels was examined, wherein utilizing only calorimeter basedjet images provides signiﬁcantly less performance than tagging using trackand calorimeter images. In addition, topo-cluster based images, which areformed by projecting the continuous topo-cluster direction estimates into adiscrete grid, are seen to have lower performance than tower based images.This is likely due to the projection onto a ﬁxed grid for use in a CNN, asthis may cause a loss of information about the spatial distribution of energywithin a topo-cluster and may result in the overlap of several clusters in thesame pixel. Moreover, it can be seen that the track + calorimeter imageapproach does not reach the performance found when a CNN is trained ona jet image formed from truth particles (i.e. without the impact of detectorsmearing). It was noted in [46] that when comparing the performance ofa CNN trained on only track images to a CNN trained on only chargedtruth particles, the observed performances were extremely similar. Thissimilarity is driven by the excellent charged particle track resolution, andfurther indicates the diﬀerence between the track + calorimeter jet imagebased CNN tagger and the truth particle based CNN tagger is driven bythe low resolution, and thus loss of information, of the calorimeter.The CMS boosted top jet image based CNN tagger [47], denoted Im-ageTop, was trained on fully simulated CMS events [71]. Multi-channel jetimages with six channels were built using particle ﬂow (PF) objects foundwithin an R = 0 . p T per pixel withone channel containing all PF candidates, and one channel each for PF can- M. Kagan (a) (b)

Fig. 15.: ROC curves for quark jet eﬃciency versus gluon jet rejection inATLAS fully simulated datasets showing comparisons of (a) jet image basedCNN taggers against jet width and number-of-tracks discriminants, and (b)of jet image basd CNN taggers trained with diﬀerent input images [46].didate ﬂavor, i.e. charged, neutral, photon, electron, and muon candidates.ImageTop was based on the multi-Channel DeepTop algorithm [42], andcomprises four convolutional layers each using 4 × b -quarks, a b -tagging identiﬁcation score [73] evaluated on subjetsof the large jet was also fed as input to the dense layers of the tagger be-fore classiﬁcation. In addition to a baseline ImageTop, a mass decorrelatedversion denoted ImageTop-MD was also trained, wherein the mass decor-relation was performed by down-sampling the background quark and gluonjet samples to have the same mass distribution as the sample of boostedtop jets used for training. In this way, the discriminating information fromthe jet mass is removed to ﬁrst order c .The ROC curve showing the performance of the ImageTop model isseen in Figure 16a. Several algorithms were compared to ImageTop, in-cluding several jet substructure feature based taggers and a deep neuralnetwork, denoted DeepAK8, based on processing PF candidates. ImageTopis seen to outperform all other algorithms except DeepAK8, and generallythe deep network based taggers are found to signiﬁcantly outperform otheralgorithms. Moreover, once mass decorrelation is included, the ImageTop- c As the authors note, though this method is not guaranteed to remove tagger massdependence, it was found to work suﬃciently well in this case as the baseline taggerinputs were not observed to have a strong correlation to mass. mage-Based Jet Analysis MD is found to be the highest performing mass-deccorrelated model. Thesmaller change in performance due to mass decorrelation of ImageTop rel-ative to other algorithms such as DeepAK8 may be due the the imagepreprocessing; images are both normalized and “zoomed” using a Lorentzboost determined by the jet p T to increase uniformity of jet images overthe p T range. These steps can result in a reduction of mass informationin the images and thus a reduction of the learned dependence of Image-Top on the mass. The mass spectrum for background quark and gluon jetsbefore (in grey) and after applying a 30% signal eﬃciency tagging thresh-old for ImageTop and ImageTop-MD (in green) can be seen in Figure 16b.The decorrelation method greatly helped to preserve the mass distributionand was not seen to signiﬁcantly degrade performance, as seen in the ROCcurves of Figure 16a. Signal efficiency - - - -

10 110 B a ck g r ound e ff i c i en cy (13 TeV) CMS

Simulation

DeepAK8DeepAK8-MDImageTopImageTop-MD t + SD m + b t + SD mBESTHOTVR-BDT (CA15) N Top quark vs. QCD multijet | < 2.4 gen h < 1500 GeV, | genT AK8SD

105 < m < 210 GeV

CA15SD

110 < m < 220 GeV

HOTVR

140 < m (a)

50 100 150 200 250 300 ) [GeV]

HOTVR (m SD m - -

10 1 A . U . (13 TeV) CMS

Simulation

Inclusive (AK8)Inclusive (HOTVR)Inclusive (CA15)DeepAK8DeepAK8-MDImageTopImageTop-MD t + b t BESTHOTVR-BDT (CA15) N Dijet sample = 30 % S ˛ Top quark tagging, | < 2.4 jet h < 1000 GeV, | jetT

600 < p (b)

Fig. 16.: Examination of the CMS ImageTop tagger [47] trained on fullysimulated CMS events in (a) ROC curves of the quark / gluon jet eﬃ-ciency versus boosted top jet tagging eﬃciency comparing several taggersand showing the dominant performance of the deep neural net based tag-gers, and (b) the impact of applying a threshold on tagger outputs to thebackground jet mass distribution wherein the mass decorrelated taggersshow signiﬁcantly less sculpting.As noted early, one concern with jet image based approaches to jettagging is their potential dependence on pileup conditions. For a ﬁxed Im-ageTop tagging threshold giving an inclusive 30% top jet tagging eﬃciency, M. Kagan the variations of the top jet tagging eﬃciency as a function of the numberof primary vertices in the event can be seen in Figure 17. Eﬃciency varia-tions for both ImageTop and ImageTop-MD were found to be small, at thelevel of less than 1%, across the values of number of primary vertices. Asimilar level of stability was observed for the background mis-identiﬁcationrate. This stability draws largely from the pileup mitigation applied to thejet before creating the jet images, and this stability is not disturbed by theCNN discriminant. N E ff i c i en cy (13 TeV) CMS

Simulation

DeepAK8DeepAK8-MDImageTopImageTop-MD t + SD m + b t + SD mBESTHOTVR-BDT (CA15) N = 30 % S ˛ Top quark tagging, | < 2.4 gen h < 1000 GeV, | genT

500 < p < 210 GeV

AK8SD

105 < m < 210 GeV

CA15SD

110 < m < 220 GeV

HOTVR

140 < m

Fig. 17.: Variations as a function of the number of reconstructed vertices inan event of boosted top jet tagging eﬃciency after applying a ﬁxed taggeroutput threshold on the CMS ImageTop tagger [47], as well as several othertaggers, trained on fully simulated CMS events.While the simulation based training of classiﬁers can lead to powerfuldiscriminants, diﬀerences in feature distributions between data and simu-lation could cause the tagger to have diﬀering performance between dataand simulation. As such, the discriminant is typically calibrated beforeapplication in data. Calibration entails deﬁning control samples of jets indata where the tagging eﬃciency and mis-identiﬁcation rate can be mea-sured in data and simulation. The eﬃciency of the tagger as a functionof jet p T is evaluated in data and simulation, and a p T dependent ratioof eﬃciencies known as a Scale Factor (SF) is derived. This SF can thenbe used to weight events such that the simulation trained tagger eﬃciencymatches the data. The SFs for the ImageTop signal eﬃciency were esti-mated in a sample of single muon events selected to have a high purity oftop-pair events in the 1-lepton decay channel, while quark and gluon back- mage-Based Jet Analysis ground mis-identiﬁcation rates were estimated in dijet samples and samplesof photons recoiling oﬀ of jets. Systematic uncertainties were evaluated onthe data based estimation of the tagging eﬃciency and propagated to SFuncertainties. These systematic uncertainties included theory uncertaintiesin the parton showering model, renormalization and factorization scales,parton distribution functions, as well as experimental uncertainties on thejet energy scale and resolution, p miss T unclustered energy, trigger and lep-ton identiﬁcation, pileup modeling, and integrated luminosity, as well asstatistical uncertainties of simulated samples. [GeV] T jet p S F t + SD m +b t + SD mHOTVR -BDT (CA15) NBEST ImageTopImageTop-MD DeepAK8DeepAK8-MD

CMS sample m Single-Top quark tagging (13 TeV) -1 (a) [GeV] T jet p S F t + SD m +b t + SD mHOTVR -BDT (CA15) NBEST ImageTopImageTop-MD DeepAK8DeepAK8-MD

CMS (13 TeV) -1 Dijet sample

QCD multijet: MG+P8

Top quark tagging (b)

Fig. 18.: Calibration scale factors as a function of jet p T for (a) the topjet tagging eﬃciency in single muon events, and (b) the quark / gluon jetmistag eﬃciency in dijet events, for the CMS ImageTop tagger [47] trainedon fully simulated CMS events and calibrated to data.The scale factors for ImageTop and ImageTop-MD for both the top tag-ging eﬃciency and the background mis-identiﬁcation rate can be found inFigure 18. The signal eﬃciency scale factors were largest at low momen-tum, showing a departure from unity of around 10%, but were signiﬁcantlycloser to unity in essentially all other p T ranges. The systematic uncer-tainties ranged from approximately 5-10%, with the largest uncertaintiesat low p T . The scale factors for the mis-identiﬁcation rate tended to belarger, up to a 20% departure from unity in dijet samples but with smallerscale factors in the photon+jet samples. These calibrations indicate thatwhile some departures from unity of the scale factors are observed, theyare largely consistent with observations from other taggers. The situationis similar in terms of the scale factor uncertainties. As such, the jet imageand CNN based tagging approach can be seen to work well in data, without M. Kagan extremely large calibrations and uncertainties, thus indicating its viabilityfor use in analysis.

6. Understanding jet Image based tagging

Interpretability and explainability are vital when applying ML methods tophysics analysis in order to ensure that (i) reasonable and physical infor-mation is being used for discrimination rather than spurious features of thedata, and (ii) when training models on simulation, models are not highlyreliant on information that may be mismodeled with respect to real data.Interpretability and explainability of deep neural networks is highly chal-lenging and is an active area of research within the ML community [74].While a large number of techniques exist for examining CNNs, a subset ofthe techniques from the ML community have been applied within the studyof jet images. A beneﬁt of the computer vision approach to jet analysis isthat while the data input to ML models may be high dimensional, in thiscase with a large number of pixels, they can be visualized on the imagegrid for inspection and interpretation. Thus the tools for interpreting CNNmodels applied to jet images tend to center on this aspect with tools such aspixel-discriminant correlation maps, ﬁlter examination, and ﬁnding imagesthat maximally activate neurons.Given a jet image x with pixel values { x ij } and discriminant c ( x ), onecan examine how changes to the input may eﬀect the discriminant predic-tion. Correlation maps examine Pearson correlation coeﬃcients betweeneach pixel and the discriminant prediction, thus probing how each inputfeature is correlated with increases and decreases in prediction over a sam-ple of inputs. For a sample of N inputs, the correlation map is computedas ρ ij = σ xij σ c (cid:80) Nk =1 ( x ( k ) ij − ¯ x ij )( c ( x ( k ) ) − ¯ c ), where ¯ x ij = N (cid:80) Nk =1 x ( k ) ij and ¯ c = N (cid:80) Nk =1 c ( x ( k ) ) are the mean feature and prediction values, while σ x ij = N (cid:80) Nk =1 ( x ( k ) ij − ¯ x ij ) and σ c = N (cid:80) Nk =1 ( c ( x ( k ) ) − ¯ c ) are the vari-ances of the feature and prediction values.The ﬁlters of a CNN perform local feature matching and are applieddirectly to the pixels of the image (or convolved image), and thus one mayplot each ﬁlter as an image and examine what features each ﬁlter is tar-geting. As there can be a large number of ﬁlters at each CNN layer aswell as a large number of channels in layers deep within a CNN, this ap-proach tends to be easiest at the ﬁrst layers of the CNN. In addition, ratherthan examining the ﬁlters themselves, after processing an image by a CNN mage-Based Jet Analysis model, one may examine the output of any given ﬁlter. This will producea convolved image in which the local feature matching has been applied ateach position of the image and will highlight the location of the image inwhich a given ﬁlter has become active. In order to highlight diﬀerence inconvolved images between classes, the diﬀerence between average convolvedimage between two classes can highlight relative diﬀerences in the spatiallocation of information relevant for discrimination.Maximally activating images or image patches correspond to applyinga CNN model on a large set of images and ﬁnding the images, or imagepatches, that cause a given neuron to output a large activation. In thecase of neurons in the fully connected layers at the end of the network, thiscorresponds to full images, whilst for neurons in convolutional layers thiscorresponds to image patches in which the neuron is most active. Probing CNNs

In Figure 19, the ﬁlters in the CNNs for W tagging [23] and top tagging [36]with jet images are examined. Several ﬁlters from the ﬁrst convolutionallayer of the CNN for W tagging are shown in the top row of Figure 19a,and the bottom row shows the corresponding diﬀerence between the averageconvolved image resulting from applying each ﬁlter. While the ﬁlters arenot easy to interpret, one can see dark regions of the ﬁlters corresponding torelative locations of large energy depositions in the jet image as well as someintensity gradients that help identify regions where additional radiationmay be expected. After applying the ﬁlters to sets of signal and backgroundimages and taking the diﬀerence of the average convolutions to each sample,one can explore how each ﬁlter is ﬁnding diﬀerent information in signaland background-like images. The more signal-like regions are shown inred while the more background-like regions are shown in blue. The blueregion at the centers identiﬁes wider energy depositions at the center ofthe jet image, whilst the signal-like regions at the bottom of such imagesidentify common locations for the subleading energy deposition. There isa strong focus on identifying signal-like radiation between the leading twoenergy depositions. Similarly for the DeepTop model, in Figure 19b onecan see the convolved average image diﬀerence for several ﬁlters at eachlayer of the model, where the rows correspond to layer depth from top tobottom. We again see the tendency for the central region to be background-like, whilst the signal-like regions correspond to diﬀerent locations of thesubleading subjet and radiation between the two leading subjets. One can M. Kagan (a) (b)

Fig. 19.: (a) Filters from the ﬁrst convolutional layer for boosted W taggingwith a jet image based CNN tagger [23] are shown in the top row, while thebottom row shows the average diﬀerence between signal and backgroundconvolved images from the corresponding ﬁlter in the top row. (b) The aver-age diﬀerence between signal and background convolved images for severalﬁlters of the DeepTop jet image based CNN tagger for boosted top jets [36].also see broader radiation patterns which vary depending on the location ofthe subleading subjet and attempt to identify likely locations of additionalsubjets in the image.Figure 20 examines the average of the 500 images that lead to the highestactivation for each of several neurons in the last (dense) layer of the CNNfor W tagging [23]. The fraction of signal jet images in this sample isalso noted, and the images are ordered left to right in terms of this signalfraction. The neuron that activates predominantly on signal jet images hasa clear two prong structure and a tight core between these two prongs whereradiation is expected. The neuron activating predominantly on backgroundjet images shows a very diﬀerent pattern, with a much broader centralregion where energy may be found and a broad ring around the centralregion where additional wide angle radiation may be present. These featuresare in agreement with the known physics of such jets.For a more global view of what the discriminant has learned, one canexamine the correlation maps for the CNN W tagging [23] and the DeepTopmodel [36] using full preprocessing in Figure 21. Structurally they are quite mage-Based Jet Analysis Fig. 20.: The average jet image which most activates a neural in the ﬁnallayer of a jet image based CNN for boosted W taggeing [23]. The fractionof signal events for each neuron is noted, thus indicating if the neuron wasmost activated by signal or background-like images.similar d , however the the regions of signal (red) and background (blue)correlation appear inverted. For W tagging, the location of the subleadingsubjet at the bottom of the image is a strong indicator of signal owing tothe fact that W jets have a two particle decay structure which stronglyrestricts the relative location of the two subjets for a ﬁxed jet p T . Thisrelative location is not as strict in quark/gluon jets and may vary dueto additional radiation. The region around the central core of the jet iscorrelated with background-like images where additional radiation may befound. For top tagging, a strong energy deposition above the central leadingenergy deposition as well as addition energy depositions, i.e. the thirdexpected subjet in a top quark decay, are correlated with signal-like images.This correlation pattern indicates that the discriminant relies heavily on theidentiﬁcation of the third subjet, as would be expected in a top quark decay.

7. Other Applications of Jet images

In addition to classiﬁcation tasks, the approach of using jet images andconvolutional layers for processing have also been explored in several otherdata analysis challenges. We brieﬂy examine some of these applications,showing how this computer vision to jet analysis can be powerful in avariety of settings. d The relative location of the second subjet was rotated to be below the leading subjetin the case of W -tagging and above the leading subjet for DeepTop which leads to theapparent ﬂip in the correlation images over the horizontal axis.2 M. Kagan ( η ) [ T r a n s f o r m e d ] A z i m u t h a l A n g l e ( φ ) Correlation of Deep Network output with pixel activations. p WT ∈ [250 , matched to QCD, m W ∈ [65 , GeV P e a r s o n C o rr e l a t i o n C o e ff i c i e n t (a) φ pixels05101520253035 η p i x e l s − . − . − . − . . . . . . P e a r s o n c o rr e l a t i o n c o e ﬃ c i e n tr (b) Fig. 21.: Correlation images showing the Pearson linear correlation coeﬃ-cient per pixel between jet image pixels and a jet image based CNN taggeroutput for (a) boosted W tagging [23], and (b) boosted top tagging withDeepTop [36]. Jet Energy Regression and Pileup removal

Among the major challenges facing analyses utilizing jets at high luminosityhadron colliders is the presence of pileup, or interactions occurring in thesame bunch crossing as the primary hard scattering. Pileup interactionslead to additional particles which may fall within the catchment area of ajet and thus are eﬀectively “noise” in the estimation of jet properties. Avariety of techniques have been proposed for pileup mitigation in jets [17]ranging from subtracting an average pileup energy density from a jet totechniques targeting the classiﬁcation of each particle in a jet as pileup orfrom the hard scatter.Within the paradigm of jet images, one approach to pileup mitigationis to predict the per pixel pileup contributions, as is done in the PUMMLmethod [75]. In this technique, a jet can be considered as composed of fourcomponents, the charged and neutral hard scatter contributions and thecharged and neutral pileup contributions. While the charged componentsof the hard scatter and pileup are known from charged particle trackingmeasurements, the neutral hard scatter and pileup components are onlyobserved together in calorimeter measurements. PUMML performs a perpixel regression of the neutral component of the hard scatter contributionsto the jet. A multi-channel jet image was used as input, with one channelfor each the hard scatter and pileup charged components of the jet, and onechannel for the combined neutral component. As the charged contribution mage-Based Jet Analysis measured by tracking detectors has signiﬁcantly better resolution than theneutral component, a signiﬁcantly smaller pixel size of ∆ η × ∆ φ = 0 . × .

025 was used for the charged images than the ∆ η × ∆ φ = 0 . × . C r o ss - s e c t i o n ( n o r m a li z e d ) Truew. PileupSoftKillerPUPPIPUMML (a) C r o ss - s e c t i o n ( n o r m a li z e d ) Truew. PileupSoftKillerPUPPIPUMML (b)

Fig. 22.: The impact of pileup mitigation on (a) jet p T , and (b) jet mass,for various mitigation techniques including the jet image based PUMMLalgorithm [75]. M. Kagan

Generative Models with Jet Images

Among the earliest work applying deep generative models as approxima-tions for HEP high ﬁdelity simulators made use of jet images as the datarepresentation [28]. The aim of this work was to learn the structure of jetimages as they may appear in a calorimeter and subsequently draw samplejets from the learned generative model. As a neural generative model can besigniﬁcantly faster than running a high ﬁdelity simulator, such approacheshave the potential to signiﬁcantly reduce the large simulation times in HEP.In the phenomenological studies of reference [28], a generative adversarialnetwork (GAN) setup was used to train a generative model to transformsamples from a standard normal distribution into samples of jet images,whilst a second discriminator network was used to penalize the genera-tive model if it could discriminate between real and generated jet images.Locally connected layers as well as convolutional layers were investigatedfor use in the networks. The distribution of p T for W boson jets and ofquark/gluon jets were compared between the Pythia simulator [67, 68] andthe GAN generated images, as seen in Figure 23a. Figure 23b shows a setof Pythia simulated jet images in the top row and their nearest neighborGAN generated jet images in the bottom row. Both the distribution of jetproperties and the general structure of jet images were reasonably well pro-duced by the GAN approach. While not yet reaching the ﬁdelity of HEPsimulators, this early work in HEP data generation showed the potentialutility in examining fast approximation simulators from deep generativemodels for HEP. Anomaly Detection

The use of CNNs to process jet images provides a powerful scheme to learnuseful representations of the information contained within a jet. In typicalclassiﬁcation tasks, these representations are used for discriminating classesof jets. However, when searching for signs of new physics, one may not know a priori the properties of such a new signal but only that such a signalwould have properties that deviate from known Standard Model processes.Such anomaly detection tasks are challenging due to the lack of signalknowledge and thus the inability to use standard classiﬁers for this task.Within the context of a search for jets produced by new particles, recentwork has combined the power of CNN representation learning on jet imageswith autoencoder network architectures [77, 78] to search for anomalousjets [79, 80]. mage-Based Jet Analysis

200 220 240 260 280 300 320 340Discretized p T of Jet Image0.0000.0050.0100.0150.0200.0250.0300.0350.0400.045 U n i t s n o r m a li z e d t o un i t a r e a generated ( W → WZ )Pythia ( W → WZ )generated (QCD dijets)Pythia (QCD dijets) (a) -3 -2 -1 P i x e l p T ( G e V ) (b) Fig. 23.: (a) The jet image p T for W boson jet and quark/gluon jets com-paring the GAN generated distribution to the Pythia simulated distribu-tion [28]. (b) A visual comparison of Pythia simulated (top) and the nearestGAN generated (bottom) jet images.Autoencoder models are designed to map an input to a compressedlatent representation through an “encoder”, and then decompress the la-tent representation back to original input via a “decoder”. Such modelsare trained to minimize the “reconstruction error” computed as the MSEbetween the original input and the autoencoder output. The reconstruc-tion error can be used to identify inputs that are not well adapted for thecompression and reconstruction scheme learned by the autoencoder. Whenused for anomaly detection, autoencoders are trained to compress and re-construct one class of events. Under the assumption that this compressionand reconstruction scheme would not be well adapted for inputs from classesdiﬀerent from the training sample, the reconstruction for inputs from newclasses is expected to perform poorly and thus lead to a large reconstructionerror.When applied to searches for anomalous jets in phenomenological stud-ies, jet images have been examined as the data representation, and con-volutional layers combined with max pooling and with upsampling havebeen used for the encoder and decoder respectively. In this case, the au-toencoder is trained on a background sample of standard quark and gluonjets, and the ability to identify diﬀerent signal jets was examined. Thereconstruction error was used directly to search for excesses of events, asseen in Figure 24 where the signal was either a sample of top quark jetsor jets from a hypothetical new gluino particle. The distribution of the re- M. Kagan construction error shows a large separation from the background, denotedQCD, and the potential signal jets. The ROC curve for identifying top jets,produced by scanning a threshold on the reconstruction error, is also shownand compares the CNN based autoencoder with dense architecture basedautoencoder applied to a ﬂattened vector of pixel p T ’s (denoted Dense),principle components analysis, and applying a threshold only on the jetmass. The jet image+CNN architecture approach dominated the othermethods. However, it should be noted that this domination was not seenfor gluino jets. Reconstruction Error0.00.20.40.60.81.0 QCD tg (400 GeV) (a) ϵ S / ϵ B CNNDensePCAMass (b)

Fig. 24.: (a) The distribution of autoencoder reconstruction error trainedon quarks / gluon jets, showing potential top or gluino signal distributions.(b) ROC curves for identifying boosted top jets from quarks / gluon jetsusing autoencoders with various architectures, wherein the jet image basedCNN is shown to outperformance other methods [80].One challenge with autoencoder approaches for anomalous jet searchesis the possibility that the autoencoder reconstruction quality is dependenton the jet mass. In this case the signal identiﬁcation eﬃciency could be massdependent. Moreover, if a bump hunt analysis in the jet mass spectrum issubsequently performed, such a reconstruction error correlation with masscould disturb the jet mass distribution and render the bump hunt strat-egy infeasible. To overcome such a challenge, an adversarial approach wasinvestigated in reference [79], wherein a second network is simultaneouslytrained with the autoencoder to predict the jet mass from the autoencoderoutput whilst the autoencoder is penalized during training if the secondnetwork is successful. The resulting adversarial autoencoder performancefor identifying a top jet signal can be see in Figure 25. With the adver- mage-Based Jet Analysis sary in use, the jet mass distribution was kept relatively stable even whenapplying a threshold on the reconstruction error which only permits a 5%background jet false positive rate. However, as seen in the ROC curve, in-creasing the strength λ of the adversarial penalty on the autoencoder couldsigniﬁcantly decrease the top jet signal sensitivity. n o r m a li z e d d i s t r i b u t i o n = 1 10 AUC = 0.633% TopsQCD3% Tops(5% least QCD-like)QCD(5% least QCD-like) (a) S B a c k g r o un d r e j e c t i o n / B ImagesBottleneck 32= 1 10 = 5 10 = 1 10 ConstituentsBottleneck 10= 1 10 (b) Fig. 25.: (a) The jet mass distribution after applying a threshold allowing3% or 5% quark / gluon jet background eﬃciency on the jet image basedadversarial autoencoder. The background is largely unsculpted and the topjet peaks can be clearly seen [79]. (b) ROC curves for quark / gluon jet re-jection versus top jet eﬃciency for jet image based adversarial autoencoderswith varying strength of adversarial penalty during training [79].

8. Conclusion

The representation of jets as images has proven highly useful for connectingthe ﬁelds of high energy physics and machine learning. Through this con-nection, advanced methods in deep learning and computer vision, primarilywith convolutional neural network architectures, have been applied to thechallenges of jet physics and have shown promising performance both inphenomenological studies and in experiments at the LHC. Jet images haveseen a broad set of use cases, not only for jet classiﬁcation but also forenergy regression, pileup noise removal, data generation, and anomaly de-tection. Image based jet tagging remains an active area of research and M. Kagan broad classes of state of the art deep neural network architectures for com-puter vision are being explored within the ﬁeld of high energy physics.While much of the work presented in this text has been in phenomeno-logical studies using particle level simulations, there remains open ques-tion on the applicability of these methods on high-ﬁdelity simulated dataand in real experimental data. In more realistic settings, the complexityof the detector and the data-taking conditions, and the challenges of thediﬀerences between simulated and real data will be key challenges for un-derstanding and optimizing these models. Understanding the relationshipsin realistic data between model accuracy and calibration error, and modelcomplexity / structural assumptions and sensitivity to systematic uncer-tainties, will be important for the long-term eﬃcacy of these image-basedmethods. Nonetheless, initial results from both ATLAS and CMS haveshown promise, pointing towards the exciting potential for jet imaging inthe future.

Acknowledgments

This work was supported by the US Department of Energy (DOE) undergrant DE-AC02-76SF00515, and by the SLAC Panofsky Fellowship.

References [1] J. Cogan, M. Kagan, E. A. Strauss, and A. Schwarztman,

Jet-images:computer vision inspired techniques for jet tagging , Journal of High EnergyPhysics (2015) 1.[2] S. Marzani, G. Soyez, and M. Spannowsky,

Looking inside jets: anintroduction to jet substructure and boosted-object phenomenology .Springer, Cham, Switzerland, 2019.[3] J. Shelton,

Jet Substructure , pp. 303–340. World Scientiﬁc, 2013.[4] L. Evans and P. Bryant,

LHC Machine , Journal of Instrumentation (aug,2008) S08001.[5] A. Andreassen, I. Feige, C. Frye, and M. D. Schwartz, JUNIPR: aframework for unsupervised machine learning in particle physics , TheEuropean Physical Journal C (2019) 2, 102.[6] ATLAS Collaboration, Deep Sets based Neural Networks for ImpactParameter Flavour Tagging in ATLAS , Tech. Rep.ATL-PHYS-PUB-2020-014, CERN, Geneva, May, 2020. https://cds.cern.ch/record/2718948 .[7] G. Louppe, K. Cho, C. Becot, and K. Cranmer,

QCD-aware recursiveneural networks for jet physics , Journal of High Energy Physics (2019) 1, 57. mage-Based Jet Analysis [8] P. T. Komiske, E. M. Metodiev, and J. Thaler, Energy ﬂow networks: deepsets for particle jets , Journal of High Energy Physics (2019) 1, 121.[9] H. Qu and L. Gouskos,

Jet tagging via particle clouds , Physical Review D (Mar, 2020) 056019.[10] G. Kasieczka, T. Plehn, A. Butter, K. Cranmer, D. Debnath, B. M. Dillon,M. Fairbairn, D. A. Faroughy, W. Fedorko, C. Gay, and et al.,

TheMachine Learning landscape of top taggers , SciPost Physics (Jul, 2019) .[11] B. Graham and L. van der Maaten, Submanifold Sparse ConvolutionalNetworks , arXiv:1706.01307 [cs.NE].[12] M. Cacciari, G. P. Salam, and G. Soyez,

The anti- k t jet clusteringalgorithm , Journal of High Energy Physics (Apr, 2008) 063.[13] ATLAS Collaboration, Topological cell clustering in the ATLAScalorimeters and its performance in LHC Run 1 , The European PhysicalJournal C (2017) 7, 490.[14] CMS Collaboration, Particle-ﬂow reconstruction and global eventdescription with the CMS detector , Journal of Instrumentation (Oct.,2017) P10003, arXiv:1706.04965 [physics.ins-det].[15] R. Kogler et al. , Jet Substructure at the Large Hadron Collider:Experimental Review , Reviews of Modern Physics (2019) 4, 045003,arXiv:1803.06991 [hep-ex].[16] A. J. Larkoski, I. Moult, and B. Nachman, Jet Substructure at the LargeHadron Collider: A Review of Recent Advances in Theory and MachineLearning , Physics Reports (2020) 1, arXiv:1709.04464 [hep-ph].[17] G. Soyez,

Pileup mitigation at the LHC: A theorist’s view , Physics Reports (Apr, 2019) 1.[18] D. Krohn, J. Thaler, and L.-T. Wang,

Jet trimming , Journal of HighEnergy Physics (Feb, 2010) .[19] D. Bertolini, P. Harris, M. Low, and N. Tran,

Pileup per particleidentiﬁcation , Journal of High Energy Physics (Oct, 2014) .[20] Particle Data Group,

Review of Particle Physics , Progress of Theoreticaland Experimental Physics (08, 2020) .[21] M. Cacciari, G. P. Salam, and G. Soyez,

The catchment area of jets ,Journal of High Energy Physics (Apr, 2008) .[22] J. Pearkes, W. Fedorko, A. Lister, and C. Gay,

Jet Constituents for DeepNeural Network Based Top Quark Tagging , arXiv:1704.02124 [hep-ex].[23] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. G.Schwartzman,

Jet-images – deep learning edition , Journal of High EnergyPhysics (2016) 1.[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton,

ImageNet Classiﬁcationwith Deep Convolutional Neural Networks , Communications of the ACM (May, 2017) 84.[25] Yann LeCun, Generalization and network design strategies , Tech. Rep.CRG-TR-89-4, University of Toronto, 1989. http://yann.lecun.com/exdb/publis/pdf/lecun-89.pdf .[26] I. Goodfellow, Y. Bengio, and A. Courville,

Deep Learning . MIT Press,2016. . M. Kagan [27] P. Baldi, K. Bauer, C. Eng, P. Sadowski, and D. Whiteson,

Jetsubstructure classiﬁcation in high-energy physics with deep neural networks ,Physical Review D (May, 2016) 094034.[28] L. de Oliveira, M. Paganini, and B. Nachman, Learning Particle Physics byExample: Location-Aware Generative Adversarial Networks for PhysicsSynthesis , Computing and Software for Big Science (2017) 1.[29] D. Scherer, A. M¨uller, and S. Behnke, Evaluation of Pooling Operations inConvolutional Architectures for Object Recognition , in

Artiﬁcial NeuralNetworks – ICANN 2010 , K. Diamantaras, W. Duch, and L. S. Iliadis, eds.Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.[30] S. Ioﬀe and C. Szegedy,

Batch Normalization: Accelerating Deep NetworkTraining by Reducing Internal Covariate Shift , in

Proceedings of the 32ndInternational Conference on International Conference on MachineLearning , F. Bach and D. Blei, eds. Proceedings of Machine LearningResearch, Lille, France, 07–09 Jul, 2015. http://proceedings.mlr.press/v37/ioffe15.html .[31] X. Glorot and Y. Bengio,

Understanding the diﬃculty of training deepfeedforward neural networks , in

Proceedings of the Thirteenth InternationalConference on Artiﬁcial Intelligence and Statistics , Y. W. Teh andM. Titterington, eds. Journal of Machine Learning Research Workshop andConference Proceedings, Chia Laguna Resort, Sardinia, Italy, 13–15 May,2010. http://proceedings.mlr.press/v9/glorot10a.html .[32] K. He, X. Zhang, S. Ren, and J. Sun,

Deep Residual Learning for ImageRecognition , in . 2016.[33] L. Bottou, F. E. Curtis, and J. Nocedal,

Optimization Methods forLarge-Scale Machine Learning , SIAM Review (2016) 2, 223,arXiv:1606.04838 [stat.ML].[34] D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization , in

International Conference for Learning Representations . 2015.arXiv:1412.6980 [cs.LG].[35] J. Gallicchio, J. Huth, M. Kagan, M. D. Schwartz, K. Black, andB. Tweedie,

Multivariate discrimination and the Higgs+W/Z search ,Journal of High Energy Physics (Apr., 2011) 69, arXiv:1010.3698[hep-ph].[36] G. Kasieczka, T. Plehn, M. C. Russell, and T. Schell,

Deep-learning toptaggers or the end of QCD? , Journal of High Energy Physics (2017)1.[37] S. Diefenbacher, H. Frost, G. Kasieczka, T. Plehn, and J. M. Thompson,

CapsNets Continuing the Convolutional Quest , arXiv:1906.11265 [hep-ph].[38] S. Bollweg, M. Haussmann, G. Kasieczka, M. Luchmann, T. Plehn, andJ. Thompson,

Deep-Learning Jets with Uncertainties and More , SciPostPhysics (2020) 6.[39] P. T. Komiske, E. M. Metodiev, and M. D. Schwartz, Deep learning incolor: towards automated quark/gluon jet discrimination , Journal of HighEnergy Physics (2017) 1. mage-Based Jet Analysis [40] Y.-T. Chien and R. Kunnawalkam Elayavalli, Probing heavy ion collisionsusing quark and gluon jet substructure , arXiv:1803.03589 [hep-ph].[41] K. Fraser and M. D. Schwartz,

Jet charge and machine learning , Journal ofHigh Energy Physics (2018) 10, 93.[42] S. Macaluso and D. Shih,

Pulling out all the tops with computer vision anddeep learning , Journal of High Energy Physics (2018) 10, 121.[43] Y.-C. J. Chen, C.-W. Chiang, G. Cottin, and D. Shih,

Boosted w and z tagging with jet charge and deep learning , Phys. Rev. D (Mar, 2020)053001. https://link.aps.org/doi/10.1103/PhysRevD.101.053001 .[44] J. Lin, M. Freytsis, I. Moult, and B. Nachman, Boosting H → b ¯ b withmachine learning , Journal of High Energy Physics (Oct, 2018) 101,arXiv:1807.10768 [hep-ph].[45] J. H. Kim, M. Kim, K. Kong, K. T. Matchev, and M. Park, Portrayingdouble Higgs at the Large Hadron Collider , Journal of High Energy Physics (Sep, 2019) 47, arXiv:1904.08549 [hep-ph].[46] ATLAS Collaboration,

Quark versus Gluon Jet Tagging Using Jet Imageswith the ATLAS Detector , Tech. Rep. ATL-PHYS-PUB-2017-017, CERN,Geneva, Jul, 2017. https://cds.cern.ch/record/2275641 .[47] CMS Collaboration,

Identiﬁcation of heavy, energetic, hadronicallydecaying particles using machine-learning techniques , Journal ofInstrumentation (2020) 06, P06005, arXiv:2004.08262 [hep-ex].[48] J. Snoek, H. Larochelle, and R. P. Adams, Practical Bayesian Optimizationof Machine Learning Algorithms , in

Proceedings of the 25th InternationalConference on Neural Information Processing Systems . 2012. https://papers.nips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html .[49] J. Thaler and K. Van Tilburg,

Identifying boosted objects withN-subjettiness , Journal of High Energy Physics (Mar, 2011) .[50] A. J. Larkoski, I. Moult, and D. Neill,

Power counting to better jetobservables , Journal of High Energy Physics (Dec, 2014) .[51] T. Hastie, R. Tibshirani, and J. Friedman,

The Elements of StatisticalLearning: Data Mining, Inference, and Prediction . Springer-Verlag, NewYork, NY, 2 ed., 2009.[52] M. D. Zeiler,

ADADELTA: An Adaptive Learning Rate Method ,arXiv:1212.5701 [cs.LG].[53] S. Xie, R. B. Girshick, P. Doll´ar, Z. Tu, and K. He,

Aggregated ResidualTransformations for Deep Neural Networks , 2017 IEEE Conference onComputer Vision and Pattern Recognition (CVPR) (2017) 5987.[54] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger,

On Calibration ofModern Neural Networks , in

Proceedings of the 34th InternationalConference on Machine Learning , D. Precup and Y. W. Teh, eds.Proceedings of Machine Learning Research, International ConventionCentre, Sydney, Australia, 06–11 Aug, 2017. http://proceedings.mlr.press/v70/guo17a.html .[55] ATLAS Collaboration,

Performance of mass-decorrelated jet substructureobservables for hadronic two-body decay tagging in ATLAS , Tech. Rep. M. Kagan

ATL-PHYS-PUB-2018-014, CERN, Geneva, Jul, 2018. https://cds.cern.ch/record/2630973 .[56] Bradshaw, Layne and Mishra, Rashmish K. and Mitridate, Andrea andOstdiek, Bryan,

Mass Agnostic Jet Taggers , SciPost Physics (2020) 1,011, arXiv:1908.08959 [hep-ph].[57] G. Louppe, M. Kagan, and K. Cranmer, Learning to Pivot with AdversarialNetworks , in

Proceedings of the 31st International Conference on NeuralInformation Processing Systems . 2017. https://papers.nips.cc/paper/2017/hash/48ab2f9b45957ab574cf005eb8a76760-Abstract.html .[58] Shimmin, Chase and Sadowski, Peter and Baldi, Pierre and Weik, Edisonand Whiteson, Daniel and Goul, Edward and Søgaard, Andreas,

Decorrelated Jet Substructure Tagging using Adversarial Neural Networks ,Physical Review D (2017) 7, 074034, arXiv:1703.03507 [hep-ex].[59] G. Kasieczka and D. Shih, DisCo Fever: Robust Networks ThroughDistance Correlation , arXiv:2001.05310 [hep-ph].[60] CERN, “CERN Open Data Portal.” Http://opendata.cern.ch/, 2017.[61] M. Andrews, J. Alison, S. An, B. Burkle, S. Gleyzer, M. Narain,M. Paulini, B. Poczos, and E. Usai,

End-to-end jet classiﬁcation of quarksand gluons with the CMS Open Data , Nuclear Instruments and Methods inPhysics Research A (Oct, 2020) 164304.[62] J. Gallicchio and M. D. Schwartz,

Quark and gluon jet substructure ,Journal of High Energy Physics (2013) 1.[63] R. Field and R. Feynman,

A parametrization of the properties of quarkjets , Nuclear Physics B (1978) 1, 1 .[64] D. Krohn, M. D. Schwartz, T. Lin, and W. J. Waalewijn,

Jet Charge at theLHC , Physical Review Letters (May, 2013) 212001.[65] W. J. Waalewijn,

Calculating the charge of a jet , Physical Review D (Nov, 2012) 094030.[66] J. Barnard, E. N. Dawe, M. J. Dolan, and N. Rajcic, Parton showeruncertainties in jet substructure analyses with deep neural networks ,Physical Review D (Jan, 2017) 014018.[67] T. Sj¨ostrand, S. Mrenna, and P. Skands, A brief introduction to PYTHIA8.1 , Computer Physics Communications (2008) 11, 852 .[68] T. Sj¨ostrand, S. Mrenna, and P. Skands,

PYTHIA 6.4 physics and manual ,Journal of High Energy Physics (May, 2006) 026.[69] J. Bellm, S. Gieseke, D. Grellscheid, A. Papaefstathiou, S. Platzer,P. Richardson, C. Rohr, T. Schuh, M. H. Seymour, A. Siodmok,A. Wilcock, and B. Zimmermann,

Herwig++ 2.7 Release Note ,arXiv:1310.6877 [hep-ph].[70] S. Choi, S. J. Lee, and M. Perelstein,

Infrared safety of a neural-net toptagging algorithm , Journal of High Energy Physics (2018) 1.[71] Geant4 Collaboration,

Geant4 simulation toolkit , Nuclear Instruments andMethods in Physics Research A (2003) 3, 250.[72] ATLAS Collaboration,

The ATLAS Simulation Infrastructure , TheEuropean Physical Journal C (2010) 3, 823.[73] CMS Collaboration, Performance of b -tagging algorithms in proton-proton mage-Based Jet Analysis collisions at 13 TeV with Phase 1 CMS detector , Tech. Rep.CMS-DP-2018-033, CERN, Jun, 2018. https://cds.cern.ch/record/2627468 .[74] U. Bhatt, A. Xiang, S. Sharma, A. Weller, A. Taly, Y. Jia, J. Ghosh,R. Puri, J. M. F. Moura, and P. Eckersley, Explainable machine learning indeployment , Proceedings of the 2020 Conference on Fairness,Accountability, and Transparency (2020) .[75] P. T. Komiske, E. M. Metodiev, B. Nachman, and M. D. Schwartz,

PileupMitigation with Machine Learning (PUMML) , Journal of High EnergyPhysics (2017) 1.[76] ATLAS Collaboration, A. Collaboration,

Convolutional Neural Networkswith Event Images for Pileup Mitigation with the ATLAS Detector , Tech.Rep. ATL-PHYS-PUB-2019-028, CERN, Geneva, Jul, 2019. https://cds.cern.ch/record/2684070 .[77] P. Baldi and K. Hornik,

Neural networks and principal component analysis:Learning from examples without local minima , Neural Networks (1989) 1,53 .[78] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, Extracting andComposing Robust Features with Denoising Autoencoders , in

Proceedingsof the 25th International Conference on Machine Learning , InternationalConference on Machine Learning ’08. Association for ComputingMachinery, New York, NY, USA, 2008.[79] T. Heimel, G. Kasieczka, T. Plehn, and J. Thompson,

QCD or what? ,SciPost Physics (Mar, 2019) .[80] M. Farina, Y. Nakai, and D. Shih, Searching for new physics with deepautoencoders , Physical Review D101