IImage-Based Jet Analysis
Michael Kagan ∗ SLAC National Accelerator Laboratory
Image-based jet analysis is built upon the jet image representation ofjets that enables a direct connection between high energy physics andthe fields of computer vision and deep learning. Through this connec-tion, a wide array of new jet analysis techniques have emerged. In thistext, we survey jet image based classification models, built primarily onthe use of convolutional neural networks, examine the methods to un-derstand what these models have learned and what is their sensitivityto uncertainties, and review the recent successes in moving these modelsfrom phenomenological studies to real world application on experimentsat the LHC. Beyond jet classification, several other applications of jetimage based techniques, including energy estimation, pileup noise reduc-tion, data generation, and anomaly detection, are discussed.
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1. Notations and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 42. Jets and Jet Physics Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 43. Jet Images and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1. Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84. Computer Vision and Convolutional Neural Networks . . . . . . . . . . . . . . 105. Jet tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1. Jet Tagging on Single Channel Jet Images . . . . . . . . . . . . . . . . . 155.2. Multi-Channel Jet Tagging with CNNs . . . . . . . . . . . . . . . . . . . 235.3. Sensitivity to Theory Uncertainties . . . . . . . . . . . . . . . . . . . . . 295.4. Jet images in LHC Experiments . . . . . . . . . . . . . . . . . . . . . . . 326. Understanding jet Image based tagging . . . . . . . . . . . . . . . . . . . . . . 386.1. Probing CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397. Other Applications of Jet images . . . . . . . . . . . . . . . . . . . . . . . . . 417.1. Jet Energy Regression and Pileup removal . . . . . . . . . . . . . . . . . 427.2. Generative Models with Jet Images . . . . . . . . . . . . . . . . . . . . . 447.3. Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 ∗ [email protected] 1 a r X i v : . [ phy s i c s . d a t a - a n ] D ec M. Kagan
1. Introduction
The jet image [1] approach to jet tagging is built upon the rapidly devel-oping field of Computer Vision (CV) in Machine Learning (ML). Jets [2, 3]are collimated streams of particles produced by the fragmentation andhadronizaton of high energy quarks and gluons. The particles are sub-sequently measured by particle detectors and clustered with jet clusteringalgorithms to define the jets. Jet images view the energy depositions ofthe stream of particles comprising a jet within a fixed geometric region ofa detector as an image, thereby connecting particle detector measurementswith an image representation and allowing the application of image analysistechniques from CV. In this way, models built upon advancements in deepconvolutional neural networks (CNN) can be trained for jet classification,energy determination through regression, and the reduction of noise e.g.from simultaneous background interactions at a high intensity hadron col-lider such as the Large Hadron Collider (LHC). Throughout this text, thefocus will be on the use of jet image techniques studied within the contextof hadron colliders like the LHC [4].Jet images form a representation of jets highly connected with the detec-tor; one can look at segmented detectors as imaging devices and interpretthe measurements as an image. In contrast, other representations of jetsexist that are built more closely from the physics of jet formation, such asviewing jets as sequences [5, 6] or trees [7] formed through a sequential emis-sion process, or viewing jets as sets, graphs, or point clouds [8, 9] with thegeometric relationship between constituents of the jet encoded in the adja-cency matrix and node properties. There are overlaps in these approaches,for instance a graph can be defined over detector energy measurements,but these approaches will not be discussed in detail in this chapter. Theutilization of an image-based approach comes with the major advantagethat CV is a highly developed field of ML with some of the most advancedmodels available for application to jet analysis with jet images. From theexperimental viewpoint, the detector measurements are fundamental to anysubsequent analysis, and the detailed knowledge of the detector and its sys-tematic uncertainties can be highly advantageous for analysis of LHC data.Among the earliest use of jet images was for the classification of the par-ent particle inducing the jet [1], and relied on utilizing linear discriminantstrained on image representations of jets for this task. While the remainderof this text will focus on deep learning approaches to jet images, even thisearly work saw interesting discrimination power for this task. By utilizing mage-Based Jet Analysis the detector measurements directly, rather than relying on jet features de-veloped using physics domain knowledge, additional discrimination powercould be extracted. Deep learning approaches surpass such linear meth-ods, but build on this notion of learning discriminating information fromdetector observables rather than engineered features.Fig. 1.: An example jet image of a Lorentz boosted top quark jet afterpreprocessing has been applied [10].While designed to take advantage of advances in computer vision, jet im-ages have notable differences with respect to typical natural images in CV.Jet images are sparse, with most pixels in the image having zero content.This is markedly different from natural images that tend to have all pixelscontaining content. Moreover, jet images tend to have multiple localizedregions of high density in addition to diffusely located pixels throughoutthe image, as opposed to the smooth structures typically found in naturalimages. An example top quark jet image illustrating these features canbe seen in Figure 1. These differences can lead to notable challenges, forinstance the number of parameters used in jet image models (and conse-quently the training time) tend to be large to account for the size of theimage, even though most pixels carry no information. Some techniquesexist for sparse-image computer vision approaches [11], but have not beenexplored in depth within the jet image community.This text will first discuss jets and typical jet physics in Section 2. The M. Kagan formation of jet images and the jet image preprocessing steps before classi-fication are discussed in Section 3. A brief introduction to Computer Visionis found in Section 4. The application of jet images in various jet classifi-cation problems is then discussed in Section 5, followed by a discussion onthe interpretation of information learned by jet image based classifiers inSection 6. Some recent applications of jet images beyond classification arediscussed in Section 7. A brief note on the notation used throughout thetext follows below (Section 1.1).It should be noted that the majority of the studies presented in this textrelate to phenomenological work using simplified setting that often do notinclude a realistic modeling of a detector’s impact on observables. Thesewill frequently be denoted as phenomenological studies , in contrast to thestudies using realistic detector simulations or using real experiment datathat are discussed mainly in Section 5.4.
Notations and Definitions
As we will focus on studies of jet images in the LHC setting, we will primar-ily utilize the hadron collider coordinate system notation. The beam-linedefines the z -axis, φ indicates the azimuthal angle, η = − log tan θ is thepseudo-rapidity which is a transformation of the polar angle θ . The rapidityis defined as y = log (cid:104) E + p z E − p z (cid:105) is frequently used for the polar measurementof massive particles, such as jets, as differences in rapidity are invariant withrespect to Lorentz boosts along the beam direction. The angular separa-tion of particles is defined as ∆ R ( p , p ) = (cid:112) ( y − y ) + ( φ − φ ) . Thetransverse momentum p T = (cid:113) p x + p y , is frequently used as it is invariantwith respect to Lorentz boosts along the beam direction. The transverseenergy is defined as E T = E sin( θ ).
2. Jets and Jet Physics Challenges
Jets are collimated streams of particles produced by the fragmentation andhadronizaton of high energy quarks and gluons. Jet clustering algorithmsare used to combine particles into clusters that define the jets (See ref-erences [2, 3] for recent reviews). At the LHC, jet algorithms typicallyrely on sequential reclustering algorithms which, given a definition of dis-tance, iteratively combine the closest two constituents (either particles orpreviously combined sets of particles denoted proto-jets ) until a stoppingcondition is met. Different distance metrics define different jet algorithms mage-Based Jet Analysis and perhaps the most commonly used algorithm at the LHC is the anti- k T algrithm [12] in which the distance between particle i and particle j is de-fined as d ij = min { k − T,i , k − T,j } ∆ ij /R . Here, ∆ ij = ( y i − y j ) + ( φ i − φ j ) and y , φ , and k T are the particle rapidity, azimuth, and transverse mo-mentum, respectively. The parameter R of the jet algorithm has the effectof defining the spatial span, or approximate “radius” (though the jet isnot necessarily circular), of the jet. Jets and jet algorithms are requiredto be IRC safe , i.e. insensitive to additional infrared radiation or collinearsplittings of particles, in order for the jet properties to be calculable in fixed-order perturbation theory. This allows comparison between jets clusteredon partons from the hard scattering process, referred to as parton jets, onfinal state particles after showering and hadronization simulation, referredto as particle jets, and on reconstructed particles in detectors, referred toas reconstructed jets.Most of the work presented in this text are phenomenological studiesoutside the context of any individual experiment. These studies primar-ily utilize particle level simulation after fragmentation and hadronizationand thus study particle jets defined after clustering the final state parti-cles. These studies typically do not use a simulation of a detector and itsimpact on particle kinematic measurements. Studies of jets and jet im-ages after real detector simulation or in real detector data are discussedin Section 5.4. In the detector setting, various inputs to jet algorithmscan be used to define jets: (i) towers refer to a fixed spatial extent in η and φ in which all energy within the longitudinal depth of the calorimeteris summed, (2) topological clusters [13] are used to cluster together en-ergy depositions in nearby calorimeter cells, (3) tracks, or charged particletrajectories, measured using tracking detectors. The Particle Flow (PF)algorithm [14] is used by the CMS collaboration to match charged particleswith energy in the calorimeter in order to utilize both measurements todefine PF candidates that can be used as inputs to jet algorithms.The R -parameter of the jet is used to define the spatial extent to whichparticles are clustered into the jet. When studying quark and gluon jets, R = 0 . large- R jets are often used which have a larger R = 1 . R = 1 . Subjets ,defined by running a jet clustering algorithm with smaller radius on theconstituents of a jet, are frequently used to study the internal propertiesof a jet. More broadly, jet substructure refers to the study of the internalstructure of jets and the development of theoretically motivated jet features
M. Kagan which are useful for discrimination and inference tasks (See references [15,16] for recent reviews).One particularly important feature of a jet is the jet mass, computed as: m = (cid:0) (cid:80) i ∈ jet p i (cid:1) . The sum of four-vectors runs over all the constituents i clustered into the jet. As different heavy resonances have different masses,this feature can be a strong discriminant between jet types. Note thatany operation performed on a jet which alter the constituents, such as thepileup mitigation discussed in the next paragraph, may alter the jet mass.It is important to note that additional proton-proton interactions withina bunch crossing, or pileup, creates additional particles present within anevent that can impact jet clustering and the estimation of jet properties.This is especially important for large- R jets which cover large spatial ex-tents. Dedicated pileup removal algorithms are used to mitigate the impactof pileup [17]. Jet trimming [18] is a jet grooming technique used to re-move soft and wide angle radiation from jets, in which the constituentswithin the jet using a jet algorithm with a smaller radius to define subjetsand remove subjets carrying a fraction of the jet energy below a threshold,and is frequently used on ATLAS to aid in pileup mitigation. The pileupper particle identification algorithm (PUPPI) [19] is frequently used byCMS, in which for each particle a local shape parameter, which probes thecollinear versus soft diffuse structure in the neighborhood of the particle,is calculated. The distribution of this shape parameter per event is used tocalculate per particle weights that describe the degree to which particles arepileup-like. Particle four-momenta are then weighted and thus the impactof (down-weighted) pileup particles on jet clustering is reduce [19]. Pileupmitigation can greatly improve the estimation of the jet mass, energy andmomentum by removing / downweighting the pileup particles clustered intoa jet that only serve as noise in the jet properties estimation.Jet identification, energy estimation, and pileup estimation / reductionare among the primary challenges for which the jet images approach hasbeen employed: (i) Jet identification refers to the classification of the par-ent particle type that gave rise to the jet, and is needed to determine theparticle content of a collision event. (ii) Jet energy estimation refers tothe regression of the true jet total energy from the noisy detector obser-vations, and is needed to determine the kinematic properties of an event.(iii) Jet pileup estimation and reduction refers to the determination of thestochastic contributions to detector observations arising from incident par-ticles produced in proton-proton collisions that are not the primary hardscattering. This form of denoising is required to improve the energy and mage-Based Jet Analysis momentum resolutions of measurements of jets.Among the primary physics settings in which jet images have been usedare in studies of jets produced by Lorentz boosted heavy particles, such asa W or Z boson, Higgs boson ( h ), top quark ( t ), or a hypothetical newbeyond the Standard Model particle. When a heavy short-lived particle isproduced with a momentum on the order of twice its mass or more, thequark decay products of such a heavy particle have a high likelihood of acollimated emergence in which the subsequent hadronic showers producedby the quarks overlap. Jet clustering algorithms can capture the entiretyof the heavy particle decay within one large- R jet with an R -parametertypically between 0.8 and 1.0, though in some cases larger R parametershave been used. The internal structure of such a boosted jet can be highlynon-trivial and significantly different than a typical jet produced by a singlequark or gluon. However, the production of quarks and gluons is ubiquitousat hadron colliders, and thus powerful discrimination methods, or taggers ,are needed to identify relatively clean samples of heavy-particle-inducedboosted jets. Moreover, the mass scale of heavy hadronically decayingparticles in the Standard Model is similar, from the W boson mass of ∼
80 GeV [20] up to the top quark mass of ∼
173 GeV [20]. Typicaldiscrimination tasks thus include discriminating boosted W -, Z -, h -, or t -jets from quarks and gluons, but also in discriminating between boostedheavy particle jets.Jet images have also been employed for studying jets from individualquarks and gluons. This includes discriminating between quark and gluonjets, and between jets produced by quarks of different flavour. In thesecases, smaller jets typically with R = 0 .
3. Jet Images and Preprocessing
Jet images are built using a fixed grid, or pixelation, of the spatial dis-tribution of energy within a jet. Early instances of such pixelation reliedon energy depositions in calorimeter detectors, wherein the angular seg-mentation of the detector cells was used to define the “pixels” of the jetimage and the pixel “intensity” was defined with the transverse energy ina cell. More recently, high resolution measurements of charged particlesfrom tracking detectors have also been used to form images, wherein thetransverse momentum of all particles found within the spatial extent of ajet image pixel are summed to define the pixel intensity. While calorimeterand tracking detectors typically span a large angular acceptance, a typical
M. Kagan jet has limited angular span. The angular span of a jet is related to the R -parameter of the jet clustering algorithm. Jet images are thus designedto cover the catchment area of the jet [21]. In many cases, the jet imageis first defined to be slightly larger than the expected jet catchment area,to ensure that preprocessing steps (discussed in Section 3.1) do not dis-rupt peripheral pixel estimates, and then after pre-processing are cropped.Nonetheless, only a slice of the angular space of the detector is used todefine the jet image, with the image centered on the direction of the jetand the image size chosen to capture the extend of a jet with a given R parameter. If depth segmentation is present in a calorimeter, the energy isoften summed in depth. From this vantage point, a jet image can be viewedas a grey-scale image comprising the energy measurements encapsulated bythe angular span of the jet. In some cases energy depositions from hadronicand electromagnetic calorimeters will be separated into different images, orseparate images will be formed from both calorimeter cell measurementsand the spatially pixelated charged particle measurement. In these cases,the set of jet images, each defining a view of the jet from a different set ofmeasurements, can be seen as color channels of jet image.It should be noted that jet pileup mitigation, such as the aforementionedtrimming or PUPPI algorithms, is vital to reduce the impact of pileup ondownstream jet image prediction tasks. While not explicitly discussed asa part of the jet image preprocessing, this step is almost always performedprior to jet image formation using the jet constituents, especially in thecase of studying large- R jets. Preprocessing
An important consideration in the training of a classifier is how to processdata before feeding it to the classifier such that the classifier can learn mostefficiently. For instance, a common preprocessing step in ML is to standard-ize inputs by scaling and mean shifting each input feature such that eachfeature has zero mean and unit variance. In this case, standardizationhelps to ensure that features have similar range and magnitude so that nosingle feature dominates gradient updates. In general, data preprocessingcan help to stabilize the optimization process and can help remove redun-dancy in the data features to ease the learning of useful representations andimprove the learning sample efficiency. However, data preprocessing maycome at a cost if the preprocessing step requires approximations that leadto distortion of the information in the data. The primary jet preprocessing mage-Based Jet Analysis steps include: Translation:
An important consideration when preparing inputs to a clas-sifier are the symmetries of data and transformations of inputs that shouldnot affect the classifier prediction. In the case of jet images, these symme-tries are related to the physical symmetries of the system. At a particlecollider, there is no preferred direction transverse to the beam line, and thephysics should be invariant to azimuthal rotations in the transverse plane.In terms of jet images, given a fixed parent particle, the distribution of jetimages at a given azimuthal coordinate φ = φ a should not differ from thedistribution at a different φ = φ b . As such, an important preprocessingstep is to translate all jet images to be “centered” at φ = 0. This is oftenperformed by translating the highest p T subjet (formed by clustering thejet constituents with a small R -parameter jet algorithms), or the jet energycentroid, to be located at φ = 0. The same invariance is not generically truefor changes in η , as translations in η correspond to Lorentz boosts along thebeam direction which could alter the jet properties if not handled carefully.When energy is used for jet image pixel intensities, a translation in η whilekeeping pixel intensities fixed will lead to a change in the jet mass. How-ever, when the transverse momentum, which is invariant to boosts alongthe beam direction, is used to define pixel intensities, a translation in η canbe performed without altering the jet mass distribution. With this defini-tion of pixel intensities, jet images are typically translated such that theleading subjet is located at η = 0. By centering the jet on the leading p T subjet, the classifier can focus on learning the relative variations of a jet,which are key for classification. Rotation:
The radiation within a jet is also approximately symmetricabout the jet axis in the η − φ plane. A common preprocessing step isthus to rotate jet images, after centering the image on the leading p T sub-jet, such that the second leading p T subjet or the first principle axis ofspatial distribution of p T in the image is aligned along the y axis of theimage. However, there are challenges with rotations. First, rotations in the η − φ plane can alter the jet mass, thus potentially impacting the classifica-tion performance a . Second, as jet images are discretized along the spatialdimensions, rotations by angles other than factors of π/ a Alternative definitions of rotations have been proposed that preserve jet mass [22] butmay alter other key jet properties.0
M. Kagan p T distribution within a jet image, apply a rotation to this spline function,and then impose an image grid to discretize the spline back to an image.The interpolation and the post-rotation discretization can spatially smearinformation in the jet and lead to aliasing. As such, there is varying use ofrotation preprocessing in jet image research. Flipping:
A transformation φ → − φ should not affect the physics of thejet, and this transformation can be performed to ensure that positive φ contains the half of the jet with more energy, for instance due to radiationemission. Normalization:
A step often found in image preprocessing for computervision tasks is image normalization, typically through taking an L normof the image such that x i → x i / (cid:80) j x j where x i is a pixel intensity and thesum runs over all pixels in an image. However, in the case of jet images, sucha normalization may be destructive, as it does not preserve the total massof the jet (as computed from the pixels) and can deteriorate discriminationperformance due to this loss of information [23]. As such, there is varyingusage of image normalization in jet image research.The impact on the jet mass, as computed from the pixels of jet images,for W boson jets within a fixed p T range and within a fixed pre-pixelationmass range can be found in Figure 2. The distortion on the jet mass frompixelation, rotations for images with energies as pixel intensities, and from L normalization, can be seen clearly, whilst translation and flipping donot show distortions of the jet image mass. As expected, mild distortionof the mass can be seen when rotations are performed on jet images withtransverse energy used for pixel intensities. These distortions may or maynot be impactful on downstream tasks, depending on if the jet mass is akey learned feature for the downstream model.
4. Computer Vision and Convolutional Neural Networks
Object classification in Computer Vision (CV) tasks served as a primarysetting where Deep Learning had major early successes [24], quickly sur-passing then state of the art approaches and serving as one of the drivers fora deep learning revolution. While much of the work in CV has focused onunderstanding natural images, data collected by physics experiments comefrom heterogenous detectors, tend to be sparse, and do not have a clearsimilarity to natural images. Nonetheless, the success of DL in CV inspired mage-Based Jet Analysis Mass60 70 80 90 100 110 N o r m a li z ed t o U n i t y No pixelationOnly pixelation 0.75) · Pix+Translate (naive) (Pix+TranslatePix+Translate+Flip/2 Rotation p Pix+Translate+ 170) · norm ( Pix+Translate+p = 13 TeVsPythia 8, /GeV < 300 GeV, 65 < mass/GeV < 95 T
250 < p
Fig. 2.: The impact of various preprocessing steps on the distribution ofestimated jet mass for boosted W boson jet images [23].a parallel effort in the collider physics community to explore applicationsof such techniques to HEP data. Below we present a brief introduction toconvolutional neural networks (CNNs) [25] and some of the state-of-the-artarchitecture variants in order to provide some background for the modelsused in jet tagging and other applications. For a more in depth pedagogicalintroduction to this material see for instance [26].Most of the models discussed in this text rely on the use of convolutionallayers. However, it should be noted that some models make use of locally-connected layers [25, 27, 28], in which a given neuron only has access to asmall patch of an input but, unlike convolutional layers that rely on weightsharing (as discussed below), the neuron processing each image patch isassociated with a different set of weights. Convolutional neural networks rely on neuron local spatial connectivityand weight sharing to produce translationally equivariant models that arewell adapted to image analysis. A typical CNN is built by stacking oneor more convolutional and non-linear activation layers often followed by apooling layer. This structure is repeated several times. Fully connectedlayers, with full connections from all inputs to activations, are used toperform the final classification or regression prediction. Images processedby CNNs are represented as 3D tensors with dimensions width × height × depth and are often referred to as the image volume. The height and widthdimensions correspond to the spatial extend of the image while the depth M. Kagan is typically the color channel.
Convolutional layers are composed of a set of filters , where each filterapplies an inner product between a set of weights and small patch of aninput image. The filter is scanned, or convolved , across the height andwidth of the image to produce a 2D map, often referred to as a responsemap or convolved image, that gives the response of applying the filter ateach position of the image. The response at each position becomes largewhen the filter and the image patch match, i.e. when their inner product islarge. The filters will thus learn to recognize visual features such as edges,textures, and shapes, and produce large responses when such visual featuresare present in a patch of an image. The spatial extent of the input patchis known as the receptive field or filter size, and the filters extend to thefull depth of the image volume. Several filters are learned simultaneouslyto respond to different visual features. The response map of the filtersare then stacked in depth, producing an output convolved image volume.Finally, the response maps are passed through point-wise (i.e. per pixel)non-linear activations to produce an activation map.By sharing weights between neurons, i.e. by scanning and applying thesame filter at each image location, it is implicitly assumed that it is useful toapply the same set of weights to different image locations. This assumptionis reasonable, as a visual feature may be present at any location in an imageand the filter is thus testing for that feature across the image. This results inthe convolutional layers being translationally equivariant, in that if a visualfeature is shifted in an image, the response to that feature will be shiftedin the activation map. In addition, parameter sharing results in dramaticreduction in the number of free parameters in the network relative to a fullyconnected network of the same number of neurons.
Pooling layers reduce the spatial extent of the image volume while leavingthe depth of the volume unchanged [29]. This further reduces the number ofparameters needed by the network and helps control for overfitting. Poolingis often performed with a max operation wherein only the largest activationin a region, typically 2 ×
2, is kept for subsequent processing.
Normalization layers may be used to adjust activation outputs, typicallyto control the mean and variance of the activation distribution. This helpsensure the that a neuron does not produce extremely large or small activa-tions relative to other neurons, which can aid in gradient-based optimizationand in mitigating exploding / vanishing gradients. Batch normalization [30] mage-Based Jet Analysis is a common normalization method in which, for each mini-batch, the meanand variance within the mini-batch of each activation dimension are usedto normalize the activation to have approximately zero and unit variance.A linear transformation of the normalized activation, with learnable scaleand offset parameters, is then applied. Fully connected layers are applied at the end of the network after flat-tening the image volume into a vector in order to perform classification orregression predictions. At this stage, auxiliary information, potentially pro-cessed by a separate set of neural network layers, may be merged with theinformation gleaned from the processing by convolutional layers in orderto ensure that certain features are provided for classification. Within thejet tagging context, such information may correspond to information aboutthe jet, such as its mass, or global event information such as the number ofinteractions in a given collision.
Residual Connections:
While CNNs encode powerful structural infor-mation into the model, such as translation equivariance, it has been notedthat scaling up such models by stacking large numbers of convolutionallayers can lead to large challenges in training [31]. In order to train largemodels using the backpropagation algorithm, the chain rule is used to com-pute the gradient from the model output back to the relevant weight. Inearly layers, the multiplication of many gradients can lead to vanishinglysmall or exploding gradients, thus resulting in unhelpful gradient updates.To overcome this challenge, the residual block [32] was proposed, and hasled to the development of residual networks . While a typical neural net-work layer passes input z through a nonlinear function f ( · ) to produce anoutput z (cid:48) = f ( z ), a residual block also uses a “skip connection” to pass theinput to the output in the form z (cid:48) res = W s z + f ( z ) where the weights W s can be used to project the channels z to have the same dimension as thefunction f ( z ). In this way, the function f ( · ) is tasked with learning the rel-ative change to the input. Moreover, the skip connection provides a path toefficiently pass gradients backwards to earlier layers of the network withoutdistortion through the nonlinearities, making gradient descent much easierand thus enabling the training of significantly deeper models. Note thatthe function f ( · ) can contain several layers of convolutions, nonlinearities,and normalization before being recombined with the input. Training in supervised learning tasks is performed by minimizing an ap-propriate loss function that compares the CNN prediction with a true label. M. Kagan
The loss is typically the cross-entropy in the case of binary classification,and the mean squared error in the case of regression. Minimization is per-formed using stochastic gradient descent (SGD) [33], or one of its variantssuch as ADAM [34] designed to improve the convergence of the learningalgorithm.
Evaluation Metrics
Receiver operating characteristic (ROC) curves arefrequently used to examine and compare the performance of binary classifi-cation algorithms. Given a model which produces a classification prediction c ( x ), where x is the input features and c ( · ) ∈ [0 , τ onthe prediction is scanned from 0 to 1, and the fraction of inputs for each thesignal and background classes above this threshold, i.e. the signal efficiency( (cid:15) S ) and background efficiency ( (cid:15) B ) for surviving this threshold, defines apoint on the ROC curve for each τ value. ROC curves thus display thebackground efficiency (or background rejection defined as 1 divided by thebackground efficiency) versus the signal efficiency. When the ROC curveis defined as the background efficiency versus the signal efficiency, a met-ric commonly used to evaluate the overall model performance is the ROCintegral, also known as the area under the curve (AUC).Significance improvement characteristic (SIC) curves [35] are closely re-lated to ROC curves, but display (cid:15) S / √ (cid:15) B as a function of the signal ef-ficiency (cid:15) S . This curve targets displaying the potential improvement instatistical significance when applying a given discriminant threshold rela-tive to not applying such a threshold.
5. Jet tagging
Jet tagging refers to the classification of the parent particle which gaverise to a jet. Linear discriminant methods were first applied to jet imagesdefined using a single channel, or “grey-scale”, with pixel intensities definedas the calorimeter cell p T [1]. Subsequently, CNN based classifiers trainedon single channel images were developed for discriminating between W/Z jets and quark / gluon jets [23, 27], between top jets and quark / gluonjets [10, 36–38], and for discriminating between quarks and gluons [39].Quark / gluon discrimination with single-channel jet images has also beenexplored for use in heavy ion collisions [40]. The extension to utilizing jetimages “in color” with multiple channels, defined for instance using chargedparticle information, has shown promising performance improvements over mage-Based Jet Analysis Fig. 3.: Average W boson jet image (left) and average quark / gluon jetimage (right) after preprocessing [23].single channel approaches in many of these tasks [10, 39, 41–45], and hasbeen explored in realistic experimental settings by the ATLAS and CMScollaborations [46, 47]. Jet Tagging on Single Channel Jet Images
W/Z
Tagging:
The discrimination of boosted W and Z vector bosoninitiated jets from quark / gluon jets has served as a benchmark task inboosted jet tagging. The color singlet nature of electroweak bosons decayingto quark pairs leads to an internal structure of boosted W/Z jets in whichthere are typically two high energy clusters, or subjets, and additional(dipole) radiation tends to appear in the region between such subjets. TheHiggs boson, also a color singlet with decays to quark pairs, has a similarsubstructure, although the decays of heavy flavour bottom and charm quarkpairs can lead to some structural differences owing to the long lifetimeof such quarks and their harder fragmentation than lighter quarks. Incontrast, single quarks and gluons tend to produce jets with a high energycore, lower energy secondary subjets created through radiative processes, aswell as diffuse wide angle radiation further from the core of the jet. Thesefeatures can be seen clearly in Figure 3, which shows the average W bosonjet image and average quark / gluon jet image after preprocessing.Building ML models applied to jet images for this discrimination taskavoids the explicit design of physics-inspired features, and rather focuses on M. Kagan the learning task of identifying differences in the jet image spatial energydistributions. In phenomenological studies, both fully convolutional [23]and models with locally connected layers [27] have been examined for dis-criminating jet images of boosted W and Z vector boson initiated jets fromquark / gluon jets. The CNN models were examined in simulated samplesof jets without pileup. The locally connected models were examined inevents both with and without pileup, thus enabling the examination of theimpact of pileup noise on jet image based tagging.Within both the studies on convolutional [23] and locally connected [27]models, hyperparameter scans were performed to find model parametersthat maximized performance b . The hyperparameters that were consideredin the scans included the number of convolutional / locally connected layers,the number of hidden units per layer, and the number of fully connectedlayers. The resulting optimized models were similar, containing 3 to 4convolutional or locally connected layers, as well as 2 to 4 fully connectedlayers with approximately 300 to 400 hidden units at the end of the network.In the CNN, 32 filters were used in each convolutional layer, as well as(2 ×
2) or (3 ×
3) downsampling after each convolutional layer. One notableadditional optimization performed for the CNN models was the size of theconvolution filters in the first layer. While filter sizes are typically (3 ×
3) or(4 ×
4) in standard CV applications, in the case of application to jet imagesit would found that a larger (11 ×
11) filter in the first convolutional layer(with later layers using standard (3 ×
3) filter sizes) resulted in the bestperformance. It was hypothesized that the such large filters were beneficialwhen applied to sparse images [23], in order to ensure that some non-zeropixels are likely to be found within the image patch supporting the filterapplication.The ROC curves indicating the performance of the CNN model and lo-cally connected model (applied to jets with pileup included) are shown inFigure 4. It should be noted that the jets in these figures correspond to dif-ferent p T ranges, with jets of p T ∈ (250 , p T ∈ [300 , p T subjets, the τ n-subjetiness [49], and the energycorrelation function D β =22 [50]. Two variable combinations were computedusing 2D binned likelihood ratios. Both the CNN and locally connected b In the case of CNNs the AUC was maximized whilst the Spearmint Bayesian Optimiza-tion package [48] was used to optimize the model with locally connected layers mage-Based Jet Analysis model significantly outperform the 2D jet substructure feature combina-tions. It can also be seen that the jet image approach is not overly sensitiveto the effects of pileup as the large performance gain over jet substructurefeatures persists both with and without the presence of pileup, owing tothe use of jet trimming to reduce the impact of pileup noise in the jets.In addition, a Boosted Decision Tree (BDT) classifier [51] combining sixsubstructure features was compared with the locally connected model andfound to have similar performance. While these early jet image based mod-els did not significantly outperform combinations of several jet substructurefeatures, this may be due to their relatively small model structure. As willbe seen, more complex architectures and the use of multi-channel jet imagescan lead to large gains over combinations of jet substructure features.One can also see the effect of L image normalization on CNN models,which appears to improve performance over unnormalized images. Thiseffect was found to occur because the CNN model output was observedto have only small correlation with the jet mass, and thus was not learn-ing to be heavily reliant on the jet mass information that is distorted bynormalization. As a result, the regulation of the image variations due tonormalization was found to be beneficial enough to overcome the induceddistortion of the jet mass. With more powerful models that learn represen-tations more correlated to the jet mass, this balance may not occur. Top Tagging:
The discrimination between boosted top quark jets andquark / gluon jets using CNNs applied to jet images has also been examinedboth in phenomenological studies [36] and in realistic simulations by theCMS experiment [47]. Top quark jet images are structurally more complexthan
W/Z/h jet images as hadronic decays of top quarks contain threequarks. This can have implications on both the preprocessing and thetagging performance. That is, some of the pre-processing steps previouslydefined will lead to uniformity among jet images for two quark systems,such as the rotation step which aligns the leading two subjets, but may notlead to the same level of uniformity for three quark systems.The DeepTop [36] model is a CNN applied to single channel jet imagesafter the preprocessing described above, including image normalization.Hyperparameter optimization yielded a model with four convolutional lay-ers, each with 8 filters of size (4 × M. Kagan
Signal Efficiency0.2 0.4 0.6 0.8 / ( B a ck g r ound E ff i c i en cy ) = 13 TeV, Pythia 8s /GeV < 300 GeV, 65 < mass/GeV < 95 T
250 < p t mass+ R D mass+R D + t MaxOutConvnetConvnet-normRandom (a)
Signal efficiency0 0.2 0.4 0.6 0.8 1 B a ck g r ound r e j e c t i on DNN(image)BDT(expert)+mass =2 b D +mass =1 b t Jet mass>=50 m Pile-up < (b)
Fig. 4.: ROC curves for quark/gluon background rejection versus boosted W boson tagging efficiency for (a) events without pileup [23], and (b) eventswith pileup [27]. The jet image based CNN taggers are seen to outperformcombinations of jet substructure features, and to be stable with respect tothe addition of pileup.(MSE) loss. While structurally similar to the single channel CNN used for W/Z tagging in reference [23] there are some notable differences such asthe use of fewer numbers of filters (8 rather than 32) and the smaller filtersize in the first layer of convolution. The reason for these difference maybe due to (a) the presence of three quarks in the top quark decay leads tomore pixel-populated images and thus allowed for the use of smaller initialfilter sizes, or (b) the global nature of the hyperparameter scan wherein thenumber of filters and the size of the filters was fixed to be the same acrossall convolutional layers.The performance of the DeepTop model can be found in Figure 5a interms of the ROC curve comparing the quark / gluon rejection versus theboosted top jet tagging efficiency for jets with p T ∈ [350 , n -subjettiness, as well and a BDT, denoted MotherOfTaggers, combiningseveral jet substructure features. The jet image based DeepTop algorithmshowed clear performance gains over substructure approaches across most ofthe signal efficiency range. As previously mentioned, pre-processing steps mage-Based Jet Analysis B a c k g r o und r e j ec t i o n / (cid:15) B Signal efficiency (cid:15) S SoftDrop + N -subjettiness MotherOfTaggersDeepTop full
DeepTop minimal (a)
DeepTopminimalOur final taggerMotherOfTaggers BDT ϵ S / ϵ B DeepTop jets (b)
Fig. 5.: ROC curves for quark / gluon jet rejection versus boosted top ef-ficiency for (a) the DeepTop model [36], and (b) the updated DeepTopmodel from reference [42]. In both cases, the CNN based DeepTop mod-els outperform individual and BDT combinations of substructure features,while the updated model in (b) is also seen to significantly improve theDeepTop performance.have the potential to be beneficial for the learning process by producingmore uniform images, but may also lead to performance degradation. Thiswas studied within the scope of the DeepTop algorithm, by examining thetagging performance using full preprocessing and a minimal preprocessingthat only performed centering but not the rotation or the flipping. This canbe seen in Figure 5a, where a clear performance benefit was observed whenutilizing only minimal pre-processing. While the full pre-processing maybe beneficial for small sample sizes, with sufficient sample sizes and modelcomplexity the CNN models appear able to learn well all the variations injet images. In this case, the approximations introduced by pre-processingsteps appear to be more detrimental than the benefits from uniformizationof the jet image distributions.Building upon the DeepTop design, developments in architecture de-sign, jet image preprocessing, and optimization were introduced in the phe-nomenological study of reference [42]. These developments include: (i)the cross entropy loss function, rather than the mean squared error loss,was used as it is more suitable to binary classification problems, (ii) alearning rate adaptive optimizer, AdaDelta [52], and small mini-batch sizes M. Kagan of 128 was used rather than vanilla stochastic gradient descent and largemini-batches of 1000, (iii) larger numbers of filters per convolutional layer,between 64 and 128 rather than 8, and 256 neurons in the dense layersinstead of 64, (iv) preprocessing is performed before pixelation under theassumption that one would have access to high resolution particle momen-tum measurements, for instance using Particle Flow [14] approaches to jetreconstruction, and (v) the training set size was increased by nearly a factorof 10. While the individual effects of these developments will be examinedfurther in Section 5.2 when discussing top tagging on multi-channel jet im-ages, the combination of these developments can be seen to provide largeperformance improvements over DeepTop of nearly a factor of two in back-ground rejection at fixed signal efficiencies in Figure 5b.In terms of more complex architectures, the ResNeXt-50 architec-ture [53] was adapted to boosted top jet tagging task using single channeljet images in the phenomenological studies in reference [10]. ResNeXt-50utilizes blocks containing parallel convolutional layers that are aggregatedand merged also with a residual connection at the end of the block. As thejet images typically have fewer pixels than natural images, the architecturewas adapted to the top tagging dataset by reducing the number of filtersby a factor of four in all but the first convolutional layer, and dropout wasadded before the fully connected layer. In addition, smaller pixel sizes inthe jet images were utilized in this model, with a granularity of 0.025 ra-dians in η − φ space (whereas the jet image granularities typically used inother models is 0.1 radians in η − φ space).The ROC curve comparing the ResNeXt-50 model to a CNN based onreferences [36, 42], and comparing to several other neural network modelswith varying architectures can be found in Figure 6. The ResNeXt-50 modelprovides approximately 20% improvement in background rejection for fixedsignal efficiency over the CNN model, and is among the most performantalgorithms explored. This is notable as many of the other neural networkmodels utilize particle 4-vectors as inputs, rather than aggregated particleinformation with a pixel cell, and make use of particle charge information,while the ResNeXt model only utilizes the distribution of energy within thejet. However, the ResNeXt-50 model contains nearly 1.5 million parame-ters, which is far more than other models such as the CNN which contains ≈ ≈
34k parameters. Thus powerful information for discriminationcan be extracted with jet image based models even from single channelimages, but it may come with the price of models with large parameter mage-Based Jet Analysis Signal efficiency S B a c k g r o un d r e j e c t i o n B ParticleNetTreeNiNResNeXtPFNCNNNSub(8)LBNNSub(6)P-CNNLoLaEFNnsub+mEFPTopoDNNLDA
Fig. 6.: ROC curve comparisons of various boosted top tagging models isshown [10]. Both ResNeXt and CNN curves are jet image based taggersusing CNN based architectures.counts.This model comparison study has been performed in a phenomenologi-cal setting on particle level simulations, and the ultimate question remainsas to the suitability for using these models in real experiment settings. Inexperimental settings, realistic detector noise, detection efficiency, detectorheterogeneity, and data taking conditions such as pileup, underlying event,and beam evolution will impact the model performance. Powerful models,including the large ResNext and CNN models, will likely have sufficientflexibility to learn powerful discriminators even in these more challengingsettings. However, in general it remains to be seen if these models canbe accurate whilst maintaining a low calibration error (where calibrationin this context refers to the criteria that the predicted class probabilitiescorrespond to the true probabilities of a given data input having a givenlabel) [54], or if additional care is needed to ensure calibration. Moreover,applications in real experimental settings must consider systematic uncer-tainties associated with training ML models in (high fidelity) simulation butapplying them in real data with potentially different feature distributions.The relationship between model complexity and sensitivity to systematicuncertainties in real experiment settings still remains to be thoroughly ex-plored. The potential benefits in terms of sensitivity to systematic uncer-tainties when using neural networks with different structural assumptions, M. Kagan such as convolutional versus graph models, also requires further study andwill likely depend on the details of how a given systematic uncertainty ef-fects the feature distributions. Some exploration of these challenges canbe found in Section 5.3 examining model sensitivity to theoretical uncer-tainties and in Section 5.4 examining applications of these models in HEPexperiments. Nonetheless, these remain important and exciting avenues offuture work.
Decorrelated tagging with Jet Images:
A common strategy in HEPto search for a particle is the so-called bump hunt in which the particle wouldgive rise to a localized excess on top of a smoothly falling background in thedistribution of the mass of reconstructed particle candidates. For instance,one may aim to identify the W -boson mass peak over the quark and gluonbackground from the distribution of jet mass. In addition to the particlemass being localization, a key to this strategy is that the smoothly fallingbackground mass distribution can typically be characterized with simpleparametric functions, thus facilitating a fit of the data to identify the excessabove this background. Jet classification methods can cause challenges inthe aforementioned strategy, as the classifier may preferentially select forjets with a specific mass, thereby sculpting the selected jet mass distributionof the background and rendering the search strategy unusable. As a result,one line of work has focused on de-correlating classifiers from a sensitivefeature (e.g. mass) such that the sensitive feature is not sculpted by theapplication of the tagger. Such methods tend to rely on data augmentationor regularization, and overviews of these methods can be found for instancein references [55, 56]. Two recent regularization techniques that have seenstrong de-correlation capability include (i) adversarial techniques [57, 58],wherein a second neural network is trained simultaneously with the jetclassifier to penalize the jet classifier when the value of the sensitive featurecan be predicted from the classifier’s output or its hidden representations,and (ii) distance correlation regularizers [59], wherein the jet classifier lossis augmented with an additional regularization which explicitly computesthe correlation between the classifier predictions and the sensitive feature.In both cases, the amount of penalization from the regularization can bevaried through a hyperparameter scaling the relative size of the regulationterm to the classification loss.De-correlation for W -boson jet tagging with jet images using CNNs wasexamined in phenomenological studies in reference [59], using a CNN ar-chitecture similar to the model described in [42]. The quark and gluon mage-Based Jet Analysis background jet mass distribution before and after applying a threshold onthe output of a CNN can be seen in Figure 7a, showing a clear sculptingof the mass distribution. However, when the distance correlation regular-ization, or Disco , is used during training, the mass distribution remainslargely unsculpted after applying a classification threshold. The level ofde-correlation can be estimated by examining the agreement between themass distribution before and after applying a classifier threshold, for in-stance using the Jensen-Shannon divergence (JSD) computed between thebinned mass distributions. For classifier thresholds fixed to 50% signalefficiency, Figure 7b shows the JSD as a function of the background rejec-tion where the curves are produced through training with varying sizes ofthe regularization hyperparameter. The CNN models are compared withneural networks trained on substructure features and other classifiers withde-correlation methods applied. The CNN models, both the adversarialand distance correlation regularization, are seen to typically provide thehighest background rejection for a given level of de-correlation comparedto other models.
Multi-Channel Jet Tagging with CNNs
Recent work on jet image based tagging has shown performance gainsthrough the use of multi-channel images. While single channel jet im-ages have provided gains in classification performance over individual, orpairings of, engineered substructure features, the performance benefits weretypically smaller when compared to ML models trained on larger groups ofsubstructure features (except when very large models were used, as in [10]).Multi-channel jet images use calorimeter images as only a single input im-age channel, with additional channels computed from charged particle fea-tures such as the momentum, multiplicity, or charge. There is a signifi-cant amount of freedom in choosing the definition of the additional imagechannels, allowing for a flexibility in the choice of inductive bias to deliverrelevant information to the CNN.One challenge in combining charged particle trajectory information andcalorimeter images is the mismatch in resolution; charged particle trajecto-ries tend to have a significantly finer spatial resolution than calorimeters,thus leading to the questions of how to combine such information. Ascharged particles are not measured on a regular grid, often the same spa-tial grid for the calorimeter component is used for the charged particleimage and the energy of the constituents is summed within each pixel. Al- M. Kagan
50 100 150 200 250mass [GeV]10 n o r m a li z e d c o un t s before cutafter cut, no decor.after cut, DisCo (a) R / J S D -DDT D D -kNNAdaboostuBoostDNNDNN+planingDNN+adversaryDNN+distance correlationCNNCNN+planingCNN+adversaryCNN+distance correlation (b) Fig. 7.: (a) For boosted W boson tagging, the jet mass is shown beforeapplying a threshold on a trained CNN tagger and after applying a thresh-old on a standard and mass decorrelated tagger [59]. A clear reduction inmass sculpting is observed. (b) The rejection at 50% signal efficiency versusone over the Jensen-Shannon divergence, computed on the binned jet massdistribution before and after tagging, is shown for various taggers [59]. Jetimage based CNN taggers are seen to outperform other methods, eitherusing adversarial or distance-correlation based mass decorrelation.ternatively, separate CNN blocks (or upsampling procedures) can be usedto process charged and calorimeter images separately into a latent repre-sentation of equal size such that they can be merged for further processing.Note that when Particle Flow objects are used, and thus both neutral andcharged particle measurements do no necessarily fall on a grid, a fine gridcan be used to exploit the better charged particle momentum resolution. Itshould also be noted that while phenomenological studies at particle-leveloften use fixed grids to emulate the discretization of real detectors, differ-ent inputs (i.e. charge vs neutral) in real detector settings have differentresolutions which may be difficult to account for in simple discretizationapproaches.Multi-Channel jet image based tagging was introduced in phenomeno-logical studies of discriminating between quark initiated and gluon initiatedjets [39, 41] and has since been explored within the quark vs. gluon contexton the ATLAS experiment [46], in CMS Open Data [60, 61], and for tag-ging in heavy ion collision environments [40]. More broadly, multi-channel mage-Based Jet Analysis jet image tagging has lead to improved performance in phenomenologicalstudies of boosted top quark jet tagging [10, 37, 42], as well as in boosted W/Z jet tagging [43] and in boosted Higgs boson tagging [44, 45]. Notably,multi-channel jet image based boosted top tagging has been explored onthe CMS experiment [47] including the comparison and calibration of thisdiscriminant with respect to CMS collision data, thus adding additionalinsights into the usability of such models within LHC data analysis.The use of multi-channel jet images built from charged particle mo-mentum and multiplicity information within the context of discriminatingbetween quarks and gluons is natural, as the number of charged particleswithin such a jet is known to be a powerful discriminant for this challengingtask [62]. As such, in the phenomenological studies of reference [39] threejet image channels were defined: (1) the transverse momentum of chargedparticles within each pixel, (2) the transverse momentum of neutral parti-cles within each pixel, and (3) the charged particle multiplicity within eachpixel. The same pixel size was used in each image, thus facilitating thedirect application of multi-channel CNNs. This approach thus relies on theability to separate the charged and neutral components of a jet; while thecharged component is measured using tracking detectors, the unique iden-tification of the neutral component of a jet is significantly more challengingtask. However, advancements in Particle Flow [14] aid in such a separa-tion, albeit not perfectly and with differing resolutions between chargedand neutral measurements.The benefit of the multi-channel approach for quark vs. gluon discrim-ination can be seen in the ROC and SIC curves in Figure 8. Both thecalorimeter only approach, denoted Deep CNN grayscale, as well as themulti-channel approach, denoted Deep CNN w/ color, outperform singlefeatures engineered for this task, BDTs trained using five of such features,and a linear discriminant trained on the greyscale jet images. In addition,the multi-channel model is seen to dominate over the single channel modelin both the ROC curve for jets with a momentum of p T ≈ b ¯ b jets in multi-jet events [44]. In addition to a CNN focused ondiscrimination based on jet images, this work also explored simultaneously M. Kagan G l u o n J e t R e j e c t i o n GirthCharge Particle MultiplicityLeading Energy FractionTwo Point MomentN95BDT of 5 jet obs.Fisher LDDeep CNN grayscaleDeep CNN w/ color (a) S i g n i f i c a n c e I m p r o v e m e n t Deep CNN w/ color Deep CNN grayscale100 GeV200 GeV500 GeV1000 GeV100 GeV200 GeV500 GeV1000 GeV (b)
Fig. 8.: ROC curve (a) and SIC curve (b) for quark versus gluon tagging us-ing multi-channel jet images [39]. Comparisons with jet substructure baseddiscriminants is shown in (a), while comparison between single channel andmulti-channel jet image based tagging with CNNs is shown in (b).processing an event image , defined using the aforementioned three channelsover the entire calorimeter, through a separate set of convolutional layersand combining with the output of the convolutional processing of jet imagebefore discrimination. By including such an event image, one may explorethe potential benefits of event topology information outside of the jet imagefor discrimination. The SIC curve for this discrimination task can be seen inFigure 9, where the CNN approaches were seen to significantly outperformsingle engineered features. CNNs using only the jet image, event image, orboth (denoted “Full CNN Architecture” in Figure 9) were compared, show-ing that much of the discrimination power rests in the jet image whilst theevent image may provide some modest improvements. In addition, the jetimage discrimination without the neutral particle channel was also foundto be comparable to one using the neutral channel, indicating that muchof the discrimination power lies in the charged particle information withinthe jet.While a clear approach to extending jet images to contain multiple chan-nels is to sum the momentum of the charged particles or compute multi-plicities in each pixel to form an image channel, the high resolution of thecharged particle information allows for the introduction of additional in-ductive bias. More specifically, given the set of charged particles containedin the region of a pixel, one may compute pixel-level features that maybe more amenable to a given discrimination task. This approach was fol-lowed for building CNNs to discriminate between (a) up and down type mage-Based Jet Analysis S i g n i f i c a n c e I m p r o v e m e n t Full CNN ArchitectureJet Image OnlyJet Image, No Neutral Layer β Event Image Only H → bb vs. QCD background p T,Higgs >
450 GeV
Fig. 9.: SIC curve for boosted Higgs to b ¯ b tagging using multi-channel jetimages [44]. Models using only jet images, and models using both jet imagesand “event images” are shown.quarks, and (b) quarks and gluons [41]. In these phenomenological stud-ies, knowledge of the utility of the jet charge feature [63–65] for discrim-inating jets of different parent particle charge inspired the developmentof the jet image channel computed per pixel as the p T weighted charge Q κ = (cid:80) j p ( j ) T ) κ (cid:80) j Q ( j ) ( p ( j ) T ) κ . The SIC curve showing the performanceof the CNN trained on the two channel jet images, one channel for p T and one for Q κ per pixel, is shown in Figure 10a. The two channel CNNsignificantly outperformed the total jet charge and classifiers trained onengineered features, and is comparable to other deep architectures trainedfor this task.A similar jet charge based multi-channel CNN was explored for discrim-inating between boosted W + /W − /Z boson jets in the phenomenologicalstudies of reference [43]. The per pixel charge image averaged over the testset for W + , W − , and Z jet images is shown in Figure 11. The geometryof all three images is similar, but the average per pixel charge differs sig-nificantly as expected, with the typical W + image carrying a positive pixelvalue, the typical W − image carrying a negative pixel value, and the typi-cal Z image having charge close to zero. The SIC curve for discriminatingbetween W + and W − jets can be seen in Figure 10b. Two CNNs wereexplored in this work, one denoted CNN in which both a p T and Q κ im-age were processed together (i.e. as a single multi-channel image processedby convolutional layers) and one denoted CNN in which each channel isprocessed by a separate stack of convolutional layers and then combined be- M. Kagan (a) † W − S I ( † W − / p † W + ) cut-basedmulti- κ BDTCNNCNN SIC curve (b)
Fig. 10.: SIC curves for (a) discriminating down quarks from up quarks [41],and (b) discriminating between W + and W − bosons [43]. The CNN κ = 0 . p T weighted charge image.fore the classification layers. Both CNNs significantly outperform methodsbased on engineered features. W + : Q channel 0.5 0.0 0.5 W : Q channel 0.5 0.0 0.5 Z : Q channel Fig. 11.: Average image of the per pixel p T weighted charge Q κ is shownfor W + (left), W − (middle), and Z bosons (right) [43].Multi channel jet images were explored for top tagging in the phe-nomenological studies of reference [42], using four channel jet images definedwith the neutral jet component as measured by the calorimeter, the chargedparticle sum p T per pixel, the charged particle multiplicity per pixel, andthe muon multiplicity per pixel. The architecture is discussed in Section 5.1within the context of single channel jet images. The inclusion of the muonimage channel targets the identification of b quark initiated subjets withinthe top jet as muons can be produced in b -hadron decays. As noted in mage-Based Jet Analysis Section 5.1, several changes to the model architecture, preprocessing, andtraining procedure relative to the first proposed DeepTop model [36] wereincluded in this work. The impact of these individual changes can be seein Figure 12 wherein developments on top of the first proposed DeepTopmodel [36] are sequentially added to the model and the resulting ROCcurve is shown. The inclusion of multiple “color” channels was only seento provide modest performance gains over single channel jet images. No-table among changes that led to the largest improvements were changingthe optimization objective to be more suitable for classification tasks andchanging the optimizer to ADAM (denoted training in the figure), increas-ing the model size (denoted architecture in the figure), and increasing thesample size. In agreement with these results, recent CNN models built forprocessing jet images have also tended to focus on larger models with largesamples for training.
DeepTop minimalTrainingArchitecturePreprocessingSample sizeColor ϵ S / ϵ B CMS jets
Fig. 12.: ROC curve of boosted top jet tagging efficiency versus backgroundquark and gluon rejection for the minimal DeepTop model [36] comparedwith models sequentially including the changes proposed in [42].
Sensitivity to Theory Uncertainties
While matrix element and parton shower Monte Carlo generators oftenprovide high fidelity predictions of the data generation process, they pro-vide only approximations to the scattering and showering processes andempirical models of the hadronization process. As such, uncertainties in M. Kagan the theoretical predictions of these generators must be propagated to down-stream analyses. One mechanism for doing this is to compare an observablecomputed with samples from different Monte Carlo generators. While nota precise estimation of theoretical uncertainty, this comparison can pro-vide a test of whether an observable is potentially sensitive to the differentapproximations of the different generators.This sensitivity has been examined for CNN-based taggers operatingon jet images in several works, and we focus here on W -tagging in a phe-nomenological study [66] and on quark / gluon tagging using ATLAS sim-ulation [46]. As the CNN + jet image approaches utilize the distribution ofenergy throughout a jet image to discriminate, one concern is that the dif-ferences in modeling of the jet formation process by different generators maylead to large performance variations. To study this, reference [46] traineda CNN model on boosted W boson jet images generated by Pythia [67, 68]and applied this trained model on samples of boosted W boson jet imagesgenerated by different Monte Carlo generators. The ROC curves of theperformance can be see in Figure 13a, wherein, at the same signal effi-ciency, reductions of background rejection of up to 50% can be seen whenthis tagger is applied to different generators. While such a variation is notideal, it should be noted that similar variations were seen when a taggerof only substructure features, a binned two dimensional signal over back-ground likelihood ratio of the distribution of jet mass and τ , is appliedfor the same tagging task. Similar levels of performance variation are alsoseen in the ROC curves built for quark vs gluon tagging in ATLAS sim-ulation with a CNN trained on Pythia jet images applied to Herwig [69]generated jet images, as seen in Figure 13b. Interestingly, when the testis reversed and the CNN is trained on jet images from Herwig and appliedto jet images from Pythia, the tagging performance is similar to the CNNtrained and applied to Pythia jet images. This suggests that the CNNsin both cases are learning similar representations of information useful forquark vs gluon tagging, but the amount this information is expressed in thejet images varies between generators [46]. Thus while these studies showthat CNNs applied to different samples may vary in performance, theremay be an underlying robustness to the information learned by CNNs forjet tagging.Beyond the potential tagging variations due to generator uncertainties,a key question when developing a jet observable of any kind is whethersuch an observable is theoretically sound and calculable. This is often ex-pressed as whether the observable is infrared and collinear (IRC) safe. IRC mage-Based Jet Analysis / [ B a c k g r o un d E ff i c i e n c y ] Signal Efficiency M o d e l / P YT H I A (a) Quark Jet Efficiency G l uon J e t R e j e c t i on ATLAS Simulation Preliminary s = 13 TeVAnti-k EM+JES R=0.4|η|<2.1, 150 GeV
Fig. 13.: (a) ROC curves for boosted W tagging for a jet image based CNNtagger trained on Pythia generated jet images and applied to jet imagesfrom various generators are shown [66]. (b) In events with the ATLASfull detector simulation, ROC curve for quark versus gluon tagging for jetimage based CNN tagger trained and applied on Pythia and Herwig basedjet images are shown [46].safety for jet image based tagging of boosted top jets with CNNs has beenexamined empirically in the phenomenological studies of reference [70]. Inthis work, within the context of boosted top jet tagging using a jet imagebased CNN, a feature denoted ∆ NN is studied which explores the impact ofmerging soft/collinear radiation with nearby partons. ∆ NN is constructedas follows: (a) a CNN was trained on particle level jet images for boostedtop tagging, (b) parton level jet images are generated for boosted top de-cays without (unmerged) and with (merged) adding the closest gluon to atop quark parton together before forming the image, (c) the difference inCNN output between unmerged and merged jet images is defined as ∆ NN .By examining the distribution of ∆ NN and its variations with features thatexplore soft or collinear effects, the sensitivity of the CNN tagger to IRCeffects can be studied empirically. This can be seen in Figure 14, wherethe 2D distribution of ∆ NN and the gluon relative transverse momentum,and the ∆ R to the parton, are shown. As either the gluon relative momen- M. Kagan tum or the ∆ R tend to zero, the ∆ NN distribution tends towards a sharppeak at 0, which would be indicative of the CNN being insensitive to IRCperturbations. Gluon p T , rel (GeV) | NN | R e l a t i v e D e n s i t y A . U . (a) R q , g | NN | R e l a t i v e D e n s i t y A . U . (b) Fig. 14.: ∆ NN versus (a) gluon relative transverse momentum and (b) ∆ R between the gluon on the nearest top decay parton. ∆ NN is the differencebetween particle jet trained CNN output applied on parton level jet imageswith and without merging the closest gluon with a top decay parton [70].Red points denote the point at which 90% of events within a vertical sliceof the distribution are contained. Jet images in LHC Experiments The ultimate tests for the efficacy of jet image based tagging approaches arethat the performance observed in phenomenological studies is also observedin realistic high fidelity simulations and that their performance generalizesto real data without large systematic uncertainties. With that in mind,jet image base tagging approaches have been examined for quark vs gluontagging in ATLAS simulations [46] and in CMS Open Data [60] simulatedsamples [61], and for boosted top quark jet tagging in CMS simulation andreal data [47].The ATLAS quark vs gluon jet image based CNN tagger [46] was trainedusing fully simulated ATLAS events [71, 72]. Multi-channel jet images wereused, with one channel containing an image of the sum of measured chargedparticle track p T per pixel. A second image for calorimeter measurementswas examined in two forms, a jet image containing either the transverse mage-Based Jet Analysis energy measured in calorimeter towers of size ∆ η × ∆ φ = 0 . × . × 5, 5 × 5, and 3 × 3, respectively, and max pooling after each convolutionallayer was used. As can be seen in the ROC curve in Figure 15a, the CNNprocessing the track + tower jet images outperforms other standard tag-gers for quark vs gluon tagging. Interestingly, the standard tagger basedon the combination of two jet substructure features (number of chargedparticles and the jet width) outperforms the CNN approach at low quarkefficiency. This is likely due to the track image discretization that mayresult in multiple tracks falling in the same pixel. As track multiplicity isnot stored in the images, this useful discriminating information is lost forthe CNN. In Figure 15b, the impact on performance of utilizing differentjet image channels was examined, wherein utilizing only calorimeter basedjet images provides significantly less performance than tagging using trackand calorimeter images. In addition, topo-cluster based images, which areformed by projecting the continuous topo-cluster direction estimates into adiscrete grid, are seen to have lower performance than tower based images.This is likely due to the projection onto a fixed grid for use in a CNN, asthis may cause a loss of information about the spatial distribution of energywithin a topo-cluster and may result in the overlap of several clusters in thesame pixel. Moreover, it can be seen that the track + calorimeter imageapproach does not reach the performance found when a CNN is trained ona jet image formed from truth particles (i.e. without the impact of detectorsmearing). It was noted in [46] that when comparing the performance ofa CNN trained on only track images to a CNN trained on only chargedtruth particles, the observed performances were extremely similar. Thissimilarity is driven by the excellent charged particle track resolution, andfurther indicates the difference between the track + calorimeter jet imagebased CNN tagger and the truth particle based CNN tagger is driven bythe low resolution, and thus loss of information, of the calorimeter.The CMS boosted top jet image based CNN tagger [47], denoted Im-ageTop, was trained on fully simulated CMS events [71]. Multi-channel jetimages with six channels were built using particle flow (PF) objects foundwithin an R = 0 . p T per pixel withone channel containing all PF candidates, and one channel each for PF can- M. Kagan (a) (b) Fig. 15.: ROC curves for quark jet efficiency versus gluon jet rejection inATLAS fully simulated datasets showing comparisons of (a) jet image basedCNN taggers against jet width and number-of-tracks discriminants, and (b)of jet image basd CNN taggers trained with different input images [46].didate flavor, i.e. charged, neutral, photon, electron, and muon candidates.ImageTop was based on the multi-Channel DeepTop algorithm [42], andcomprises four convolutional layers each using 4 × b -quarks, a b -tagging identification score [73] evaluated on subjetsof the large jet was also fed as input to the dense layers of the tagger be-fore classification. In addition to a baseline ImageTop, a mass decorrelatedversion denoted ImageTop-MD was also trained, wherein the mass decor-relation was performed by down-sampling the background quark and gluonjet samples to have the same mass distribution as the sample of boostedtop jets used for training. In this way, the discriminating information fromthe jet mass is removed to first order c .The ROC curve showing the performance of the ImageTop model isseen in Figure 16a. Several algorithms were compared to ImageTop, in-cluding several jet substructure feature based taggers and a deep neuralnetwork, denoted DeepAK8, based on processing PF candidates. ImageTopis seen to outperform all other algorithms except DeepAK8, and generallythe deep network based taggers are found to significantly outperform otheralgorithms. Moreover, once mass decorrelation is included, the ImageTop- c As the authors note, though this method is not guaranteed to remove tagger massdependence, it was found to work sufficiently well in this case as the baseline taggerinputs were not observed to have a strong correlation to mass. mage-Based Jet Analysis MD is found to be the highest performing mass-deccorrelated model. Thesmaller change in performance due to mass decorrelation of ImageTop rel-ative to other algorithms such as DeepAK8 may be due the the imagepreprocessing; images are both normalized and “zoomed” using a Lorentzboost determined by the jet p T to increase uniformity of jet images overthe p T range. These steps can result in a reduction of mass informationin the images and thus a reduction of the learned dependence of Image-Top on the mass. The mass spectrum for background quark and gluon jetsbefore (in grey) and after applying a 30% signal efficiency tagging thresh-old for ImageTop and ImageTop-MD (in green) can be seen in Figure 16b.The decorrelation method greatly helped to preserve the mass distributionand was not seen to significantly degrade performance, as seen in the ROCcurves of Figure 16a. Signal efficiency - - - - 10 110 B a ck g r ound e ff i c i en cy (13 TeV) CMS Simulation DeepAK8DeepAK8-MDImageTopImageTop-MD t + SD m + b t + SD mBESTHOTVR-BDT (CA15) N Top quark vs. QCD multijet | < 2.4 gen h < 1500 GeV, | genT AK8SD 105 < m < 210 GeV CA15SD 110 < m < 220 GeV HOTVR 140 < m (a) 50 100 150 200 250 300 ) [GeV] HOTVR (m SD m - - 10 1 A . U . (13 TeV) CMS Simulation Inclusive (AK8)Inclusive (HOTVR)Inclusive (CA15)DeepAK8DeepAK8-MDImageTopImageTop-MD t + b t BESTHOTVR-BDT (CA15) N Dijet sample = 30 % S ˛ Top quark tagging, | < 2.4 jet h < 1000 GeV, | jetT 600 < p (b) Fig. 16.: Examination of the CMS ImageTop tagger [47] trained on fullysimulated CMS events in (a) ROC curves of the quark / gluon jet effi-ciency versus boosted top jet tagging efficiency comparing several taggersand showing the dominant performance of the deep neural net based tag-gers, and (b) the impact of applying a threshold on tagger outputs to thebackground jet mass distribution wherein the mass decorrelated taggersshow significantly less sculpting.As noted early, one concern with jet image based approaches to jettagging is their potential dependence on pileup conditions. For a fixed Im-ageTop tagging threshold giving an inclusive 30% top jet tagging efficiency, M. Kagan the variations of the top jet tagging efficiency as a function of the numberof primary vertices in the event can be seen in Figure 17. Efficiency varia-tions for both ImageTop and ImageTop-MD were found to be small, at thelevel of less than 1%, across the values of number of primary vertices. Asimilar level of stability was observed for the background mis-identificationrate. This stability draws largely from the pileup mitigation applied to thejet before creating the jet images, and this stability is not disturbed by theCNN discriminant. N E ff i c i en cy (13 TeV) CMS Simulation DeepAK8DeepAK8-MDImageTopImageTop-MD t + SD m + b t + SD mBESTHOTVR-BDT (CA15) N = 30 % S ˛ Top quark tagging, | < 2.4 gen h < 1000 GeV, | genT 500 < p < 210 GeV AK8SD 105 < m < 210 GeV CA15SD 110 < m < 220 GeV HOTVR 140 < m Fig. 17.: Variations as a function of the number of reconstructed vertices inan event of boosted top jet tagging efficiency after applying a fixed taggeroutput threshold on the CMS ImageTop tagger [47], as well as several othertaggers, trained on fully simulated CMS events.While the simulation based training of classifiers can lead to powerfuldiscriminants, differences in feature distributions between data and simu-lation could cause the tagger to have differing performance between dataand simulation. As such, the discriminant is typically calibrated beforeapplication in data. Calibration entails defining control samples of jets indata where the tagging efficiency and mis-identification rate can be mea-sured in data and simulation. The efficiency of the tagger as a functionof jet p T is evaluated in data and simulation, and a p T dependent ratioof efficiencies known as a Scale Factor (SF) is derived. This SF can thenbe used to weight events such that the simulation trained tagger efficiencymatches the data. The SFs for the ImageTop signal efficiency were esti-mated in a sample of single muon events selected to have a high purity oftop-pair events in the 1-lepton decay channel, while quark and gluon back- mage-Based Jet Analysis ground mis-identification rates were estimated in dijet samples and samplesof photons recoiling off of jets. Systematic uncertainties were evaluated onthe data based estimation of the tagging efficiency and propagated to SFuncertainties. These systematic uncertainties included theory uncertaintiesin the parton showering model, renormalization and factorization scales,parton distribution functions, as well as experimental uncertainties on thejet energy scale and resolution, p miss T unclustered energy, trigger and lep-ton identification, pileup modeling, and integrated luminosity, as well asstatistical uncertainties of simulated samples. [GeV] T jet p S F t + SD m +b t + SD mHOTVR -BDT (CA15) NBEST ImageTopImageTop-MD DeepAK8DeepAK8-MD CMS sample m Single-Top quark tagging (13 TeV) -1 (a) [GeV] T jet p S F t + SD m +b t + SD mHOTVR -BDT (CA15) NBEST ImageTopImageTop-MD DeepAK8DeepAK8-MD CMS (13 TeV) -1 Dijet sample QCD multijet: MG+P8 Top quark tagging (b) Fig. 18.: Calibration scale factors as a function of jet p T for (a) the topjet tagging efficiency in single muon events, and (b) the quark / gluon jetmistag efficiency in dijet events, for the CMS ImageTop tagger [47] trainedon fully simulated CMS events and calibrated to data.The scale factors for ImageTop and ImageTop-MD for both the top tag-ging efficiency and the background mis-identification rate can be found inFigure 18. The signal efficiency scale factors were largest at low momen-tum, showing a departure from unity of around 10%, but were significantlycloser to unity in essentially all other p T ranges. The systematic uncer-tainties ranged from approximately 5-10%, with the largest uncertaintiesat low p T . The scale factors for the mis-identification rate tended to belarger, up to a 20% departure from unity in dijet samples but with smallerscale factors in the photon+jet samples. These calibrations indicate thatwhile some departures from unity of the scale factors are observed, theyare largely consistent with observations from other taggers. The situationis similar in terms of the scale factor uncertainties. As such, the jet imageand CNN based tagging approach can be seen to work well in data, without M. Kagan extremely large calibrations and uncertainties, thus indicating its viabilityfor use in analysis. 6. Understanding jet Image based tagging Interpretability and explainability are vital when applying ML methods tophysics analysis in order to ensure that (i) reasonable and physical infor-mation is being used for discrimination rather than spurious features of thedata, and (ii) when training models on simulation, models are not highlyreliant on information that may be mismodeled with respect to real data.Interpretability and explainability of deep neural networks is highly chal-lenging and is an active area of research within the ML community [74].While a large number of techniques exist for examining CNNs, a subset ofthe techniques from the ML community have been applied within the studyof jet images. A benefit of the computer vision approach to jet analysis isthat while the data input to ML models may be high dimensional, in thiscase with a large number of pixels, they can be visualized on the imagegrid for inspection and interpretation. Thus the tools for interpreting CNNmodels applied to jet images tend to center on this aspect with tools such aspixel-discriminant correlation maps, filter examination, and finding imagesthat maximally activate neurons.Given a jet image x with pixel values { x ij } and discriminant c ( x ), onecan examine how changes to the input may effect the discriminant predic-tion. Correlation maps examine Pearson correlation coefficients betweeneach pixel and the discriminant prediction, thus probing how each inputfeature is correlated with increases and decreases in prediction over a sam-ple of inputs. For a sample of N inputs, the correlation map is computedas ρ ij = σ xij σ c (cid:80) Nk =1 ( x ( k ) ij − ¯ x ij )( c ( x ( k ) ) − ¯ c ), where ¯ x ij = N (cid:80) Nk =1 x ( k ) ij and ¯ c = N (cid:80) Nk =1 c ( x ( k ) ) are the mean feature and prediction values, while σ x ij = N (cid:80) Nk =1 ( x ( k ) ij − ¯ x ij ) and σ c = N (cid:80) Nk =1 ( c ( x ( k ) ) − ¯ c ) are the vari-ances of the feature and prediction values.The filters of a CNN perform local feature matching and are applieddirectly to the pixels of the image (or convolved image), and thus one mayplot each filter as an image and examine what features each filter is tar-geting. As there can be a large number of filters at each CNN layer aswell as a large number of channels in layers deep within a CNN, this ap-proach tends to be easiest at the first layers of the CNN. In addition, ratherthan examining the filters themselves, after processing an image by a CNN mage-Based Jet Analysis model, one may examine the output of any given filter. This will producea convolved image in which the local feature matching has been applied ateach position of the image and will highlight the location of the image inwhich a given filter has become active. In order to highlight difference inconvolved images between classes, the difference between average convolvedimage between two classes can highlight relative differences in the spatiallocation of information relevant for discrimination.Maximally activating images or image patches correspond to applyinga CNN model on a large set of images and finding the images, or imagepatches, that cause a given neuron to output a large activation. In thecase of neurons in the fully connected layers at the end of the network, thiscorresponds to full images, whilst for neurons in convolutional layers thiscorresponds to image patches in which the neuron is most active. Probing CNNs In Figure 19, the filters in the CNNs for W tagging [23] and top tagging [36]with jet images are examined. Several filters from the first convolutionallayer of the CNN for W tagging are shown in the top row of Figure 19a,and the bottom row shows the corresponding difference between the averageconvolved image resulting from applying each filter. While the filters arenot easy to interpret, one can see dark regions of the filters corresponding torelative locations of large energy depositions in the jet image as well as someintensity gradients that help identify regions where additional radiationmay be expected. After applying the filters to sets of signal and backgroundimages and taking the difference of the average convolutions to each sample,one can explore how each filter is finding different information in signaland background-like images. The more signal-like regions are shown inred while the more background-like regions are shown in blue. The blueregion at the centers identifies wider energy depositions at the center ofthe jet image, whilst the signal-like regions at the bottom of such imagesidentify common locations for the subleading energy deposition. There isa strong focus on identifying signal-like radiation between the leading twoenergy depositions. Similarly for the DeepTop model, in Figure 19b onecan see the convolved average image difference for several filters at eachlayer of the model, where the rows correspond to layer depth from top tobottom. We again see the tendency for the central region to be background-like, whilst the signal-like regions correspond to different locations of thesubleading subjet and radiation between the two leading subjets. One can M. Kagan (a) (b) Fig. 19.: (a) Filters from the first convolutional layer for boosted W taggingwith a jet image based CNN tagger [23] are shown in the top row, while thebottom row shows the average difference between signal and backgroundconvolved images from the corresponding filter in the top row. (b) The aver-age difference between signal and background convolved images for severalfilters of the DeepTop jet image based CNN tagger for boosted top jets [36].also see broader radiation patterns which vary depending on the location ofthe subleading subjet and attempt to identify likely locations of additionalsubjets in the image.Figure 20 examines the average of the 500 images that lead to the highestactivation for each of several neurons in the last (dense) layer of the CNNfor W tagging [23]. The fraction of signal jet images in this sample isalso noted, and the images are ordered left to right in terms of this signalfraction. The neuron that activates predominantly on signal jet images hasa clear two prong structure and a tight core between these two prongs whereradiation is expected. The neuron activating predominantly on backgroundjet images shows a very different pattern, with a much broader centralregion where energy may be found and a broad ring around the centralregion where additional wide angle radiation may be present. These featuresare in agreement with the known physics of such jets.For a more global view of what the discriminant has learned, one canexamine the correlation maps for the CNN W tagging [23] and the DeepTopmodel [36] using full preprocessing in Figure 21. Structurally they are quite mage-Based Jet Analysis Fig. 20.: The average jet image which most activates a neural in the finallayer of a jet image based CNN for boosted W taggeing [23]. The fractionof signal events for each neuron is noted, thus indicating if the neuron wasmost activated by signal or background-like images.similar d , however the the regions of signal (red) and background (blue)correlation appear inverted. For W tagging, the location of the subleadingsubjet at the bottom of the image is a strong indicator of signal owing tothe fact that W jets have a two particle decay structure which stronglyrestricts the relative location of the two subjets for a fixed jet p T . Thisrelative location is not as strict in quark/gluon jets and may vary dueto additional radiation. The region around the central core of the jet iscorrelated with background-like images where additional radiation may befound. For top tagging, a strong energy deposition above the central leadingenergy deposition as well as addition energy depositions, i.e. the thirdexpected subjet in a top quark decay, are correlated with signal-like images.This correlation pattern indicates that the discriminant relies heavily on theidentification of the third subjet, as would be expected in a top quark decay. 7. Other Applications of Jet images In addition to classification tasks, the approach of using jet images andconvolutional layers for processing have also been explored in several otherdata analysis challenges. We briefly examine some of these applications,showing how this computer vision to jet analysis can be powerful in avariety of settings. d The relative location of the second subjet was rotated to be below the leading subjetin the case of W -tagging and above the leading subjet for DeepTop which leads to theapparent flip in the correlation images over the horizontal axis.2 M. Kagan ( η ) [ T r a n s f o r m e d ] A z i m u t h a l A n g l e ( φ ) Correlation of Deep Network output with pixel activations. p WT ∈ [250 , matched to QCD, m W ∈ [65 , GeV P e a r s o n C o rr e l a t i o n C o e ff i c i e n t (a) φ pixels05101520253035 η p i x e l s − . − . − . − . . . . . . P e a r s o n c o rr e l a t i o n c o e ffi c i e n tr (b) Fig. 21.: Correlation images showing the Pearson linear correlation coeffi-cient per pixel between jet image pixels and a jet image based CNN taggeroutput for (a) boosted W tagging [23], and (b) boosted top tagging withDeepTop [36]. Jet Energy Regression and Pileup removal Among the major challenges facing analyses utilizing jets at high luminosityhadron colliders is the presence of pileup, or interactions occurring in thesame bunch crossing as the primary hard scattering. Pileup interactionslead to additional particles which may fall within the catchment area of ajet and thus are effectively “noise” in the estimation of jet properties. Avariety of techniques have been proposed for pileup mitigation in jets [17]ranging from subtracting an average pileup energy density from a jet totechniques targeting the classification of each particle in a jet as pileup orfrom the hard scatter.Within the paradigm of jet images, one approach to pileup mitigationis to predict the per pixel pileup contributions, as is done in the PUMMLmethod [75]. In this technique, a jet can be considered as composed of fourcomponents, the charged and neutral hard scatter contributions and thecharged and neutral pileup contributions. While the charged componentsof the hard scatter and pileup are known from charged particle trackingmeasurements, the neutral hard scatter and pileup components are onlyobserved together in calorimeter measurements. PUMML performs a perpixel regression of the neutral component of the hard scatter contributionsto the jet. A multi-channel jet image was used as input, with one channelfor each the hard scatter and pileup charged components of the jet, and onechannel for the combined neutral component. As the charged contribution mage-Based Jet Analysis measured by tracking detectors has significantly better resolution than theneutral component, a significantly smaller pixel size of ∆ η × ∆ φ = 0 . × . 025 was used for the charged images than the ∆ η × ∆ φ = 0 . × . C r o ss - s e c t i o n ( n o r m a li z e d ) Truew. PileupSoftKillerPUPPIPUMML (a) C r o ss - s e c t i o n ( n o r m a li z e d ) Truew. PileupSoftKillerPUPPIPUMML (b) Fig. 22.: The impact of pileup mitigation on (a) jet p T , and (b) jet mass,for various mitigation techniques including the jet image based PUMMLalgorithm [75]. M. Kagan Generative Models with Jet Images Among the earliest work applying deep generative models as approxima-tions for HEP high fidelity simulators made use of jet images as the datarepresentation [28]. The aim of this work was to learn the structure of jetimages as they may appear in a calorimeter and subsequently draw samplejets from the learned generative model. As a neural generative model can besignificantly faster than running a high fidelity simulator, such approacheshave the potential to significantly reduce the large simulation times in HEP.In the phenomenological studies of reference [28], a generative adversarialnetwork (GAN) setup was used to train a generative model to transformsamples from a standard normal distribution into samples of jet images,whilst a second discriminator network was used to penalize the genera-tive model if it could discriminate between real and generated jet images.Locally connected layers as well as convolutional layers were investigatedfor use in the networks. The distribution of p T for W boson jets and ofquark/gluon jets were compared between the Pythia simulator [67, 68] andthe GAN generated images, as seen in Figure 23a. Figure 23b shows a setof Pythia simulated jet images in the top row and their nearest neighborGAN generated jet images in the bottom row. Both the distribution of jetproperties and the general structure of jet images were reasonably well pro-duced by the GAN approach. While not yet reaching the fidelity of HEPsimulators, this early work in HEP data generation showed the potentialutility in examining fast approximation simulators from deep generativemodels for HEP. Anomaly Detection The use of CNNs to process jet images provides a powerful scheme to learnuseful representations of the information contained within a jet. In typicalclassification tasks, these representations are used for discriminating classesof jets. However, when searching for signs of new physics, one may not know a priori the properties of such a new signal but only that such a signalwould have properties that deviate from known Standard Model processes.Such anomaly detection tasks are challenging due to the lack of signalknowledge and thus the inability to use standard classifiers for this task.Within the context of a search for jets produced by new particles, recentwork has combined the power of CNN representation learning on jet imageswith autoencoder network architectures [77, 78] to search for anomalousjets [79, 80]. mage-Based Jet Analysis 200 220 240 260 280 300 320 340Discretized p T of Jet Image0.0000.0050.0100.0150.0200.0250.0300.0350.0400.045 U n i t s n o r m a li z e d t o un i t a r e a generated ( W → WZ )Pythia ( W → WZ )generated (QCD dijets)Pythia (QCD dijets) (a) -3 -2 -1 P i x e l p T ( G e V ) (b) Fig. 23.: (a) The jet image p T for W boson jet and quark/gluon jets com-paring the GAN generated distribution to the Pythia simulated distribu-tion [28]. (b) A visual comparison of Pythia simulated (top) and the nearestGAN generated (bottom) jet images.Autoencoder models are designed to map an input to a compressedlatent representation through an “encoder”, and then decompress the la-tent representation back to original input via a “decoder”. Such modelsare trained to minimize the “reconstruction error” computed as the MSEbetween the original input and the autoencoder output. The reconstruc-tion error can be used to identify inputs that are not well adapted for thecompression and reconstruction scheme learned by the autoencoder. Whenused for anomaly detection, autoencoders are trained to compress and re-construct one class of events. Under the assumption that this compressionand reconstruction scheme would not be well adapted for inputs from classesdifferent from the training sample, the reconstruction for inputs from newclasses is expected to perform poorly and thus lead to a large reconstructionerror.When applied to searches for anomalous jets in phenomenological stud-ies, jet images have been examined as the data representation, and con-volutional layers combined with max pooling and with upsampling havebeen used for the encoder and decoder respectively. In this case, the au-toencoder is trained on a background sample of standard quark and gluonjets, and the ability to identify different signal jets was examined. Thereconstruction error was used directly to search for excesses of events, asseen in Figure 24 where the signal was either a sample of top quark jetsor jets from a hypothetical new gluino particle. The distribution of the re- M. Kagan construction error shows a large separation from the background, denotedQCD, and the potential signal jets. The ROC curve for identifying top jets,produced by scanning a threshold on the reconstruction error, is also shownand compares the CNN based autoencoder with dense architecture basedautoencoder applied to a flattened vector of pixel p T ’s (denoted Dense),principle components analysis, and applying a threshold only on the jetmass. The jet image+CNN architecture approach dominated the othermethods. However, it should be noted that this domination was not seenfor gluino jets. Reconstruction Error0.00.20.40.60.81.0 QCD tg (400 GeV) (a) ϵ S / ϵ B CNNDensePCAMass (b) Fig. 24.: (a) The distribution of autoencoder reconstruction error trainedon quarks / gluon jets, showing potential top or gluino signal distributions.(b) ROC curves for identifying boosted top jets from quarks / gluon jetsusing autoencoders with various architectures, wherein the jet image basedCNN is shown to outperformance other methods [80].One challenge with autoencoder approaches for anomalous jet searchesis the possibility that the autoencoder reconstruction quality is dependenton the jet mass. In this case the signal identification efficiency could be massdependent. Moreover, if a bump hunt analysis in the jet mass spectrum issubsequently performed, such a reconstruction error correlation with masscould disturb the jet mass distribution and render the bump hunt strat-egy infeasible. To overcome such a challenge, an adversarial approach wasinvestigated in reference [79], wherein a second network is simultaneouslytrained with the autoencoder to predict the jet mass from the autoencoderoutput whilst the autoencoder is penalized during training if the secondnetwork is successful. The resulting adversarial autoencoder performancefor identifying a top jet signal can be see in Figure 25. With the adver- mage-Based Jet Analysis sary in use, the jet mass distribution was kept relatively stable even whenapplying a threshold on the reconstruction error which only permits a 5%background jet false positive rate. However, as seen in the ROC curve, in-creasing the strength λ of the adversarial penalty on the autoencoder couldsignificantly decrease the top jet signal sensitivity. n o r m a li z e d d i s t r i b u t i o n = 1 10 AUC = 0.633% TopsQCD3% Tops(5% least QCD-like)QCD(5% least QCD-like) (a) S B a c k g r o un d r e j e c t i o n / B ImagesBottleneck 32= 1 10 = 5 10 = 1 10 ConstituentsBottleneck 10= 1 10 (b) Fig. 25.: (a) The jet mass distribution after applying a threshold allowing3% or 5% quark / gluon jet background efficiency on the jet image basedadversarial autoencoder. The background is largely unsculpted and the topjet peaks can be clearly seen [79]. (b) ROC curves for quark / gluon jet re-jection versus top jet efficiency for jet image based adversarial autoencoderswith varying strength of adversarial penalty during training [79]. 8. Conclusion The representation of jets as images has proven highly useful for connectingthe fields of high energy physics and machine learning. Through this con-nection, advanced methods in deep learning and computer vision, primarilywith convolutional neural network architectures, have been applied to thechallenges of jet physics and have shown promising performance both inphenomenological studies and in experiments at the LHC. Jet images haveseen a broad set of use cases, not only for jet classification but also forenergy regression, pileup noise removal, data generation, and anomaly de-tection. Image based jet tagging remains an active area of research and M. Kagan broad classes of state of the art deep neural network architectures for com-puter vision are being explored within the field of high energy physics.While much of the work presented in this text has been in phenomeno-logical studies using particle level simulations, there remains open ques-tion on the applicability of these methods on high-fidelity simulated dataand in real experimental data. In more realistic settings, the complexityof the detector and the data-taking conditions, and the challenges of thedifferences between simulated and real data will be key challenges for un-derstanding and optimizing these models. Understanding the relationshipsin realistic data between model accuracy and calibration error, and modelcomplexity / structural assumptions and sensitivity to systematic uncer-tainties, will be important for the long-term efficacy of these image-basedmethods. Nonetheless, initial results from both ATLAS and CMS haveshown promise, pointing towards the exciting potential for jet imaging inthe future. Acknowledgments This work was supported by the US Department of Energy (DOE) undergrant DE-AC02-76SF00515, and by the SLAC Panofsky Fellowship. References [1] J. Cogan, M. Kagan, E. A. Strauss, and A. Schwarztman, Jet-images:computer vision inspired techniques for jet tagging , Journal of High EnergyPhysics (2015) 1.[2] S. Marzani, G. Soyez, and M. Spannowsky, Looking inside jets: anintroduction to jet substructure and boosted-object phenomenology .Springer, Cham, Switzerland, 2019.[3] J. Shelton, Jet Substructure , pp. 303–340. World Scientific, 2013.[4] L. Evans and P. Bryant, LHC Machine , Journal of Instrumentation (aug,2008) S08001.[5] A. Andreassen, I. Feige, C. Frye, and M. D. Schwartz, JUNIPR: aframework for unsupervised machine learning in particle physics , TheEuropean Physical Journal C (2019) 2, 102.[6] ATLAS Collaboration, Deep Sets based Neural Networks for ImpactParameter Flavour Tagging in ATLAS , Tech. Rep.ATL-PHYS-PUB-2020-014, CERN, Geneva, May, 2020. https://cds.cern.ch/record/2718948 .[7] G. Louppe, K. Cho, C. Becot, and K. Cranmer, QCD-aware recursiveneural networks for jet physics , Journal of High Energy Physics (2019) 1, 57. mage-Based Jet Analysis [8] P. T. Komiske, E. M. Metodiev, and J. Thaler, Energy flow networks: deepsets for particle jets , Journal of High Energy Physics (2019) 1, 121.[9] H. Qu and L. Gouskos, Jet tagging via particle clouds , Physical Review D (Mar, 2020) 056019.[10] G. Kasieczka, T. Plehn, A. Butter, K. Cranmer, D. Debnath, B. M. Dillon,M. Fairbairn, D. A. Faroughy, W. Fedorko, C. Gay, and et al., TheMachine Learning landscape of top taggers , SciPost Physics (Jul, 2019) .[11] B. Graham and L. van der Maaten, Submanifold Sparse ConvolutionalNetworks , arXiv:1706.01307 [cs.NE].[12] M. Cacciari, G. P. Salam, and G. Soyez, The anti- k t jet clusteringalgorithm , Journal of High Energy Physics (Apr, 2008) 063.[13] ATLAS Collaboration, Topological cell clustering in the ATLAScalorimeters and its performance in LHC Run 1 , The European PhysicalJournal C (2017) 7, 490.[14] CMS Collaboration, Particle-flow reconstruction and global eventdescription with the CMS detector , Journal of Instrumentation (Oct.,2017) P10003, arXiv:1706.04965 [physics.ins-det].[15] R. Kogler et al. , Jet Substructure at the Large Hadron Collider:Experimental Review , Reviews of Modern Physics (2019) 4, 045003,arXiv:1803.06991 [hep-ex].[16] A. J. Larkoski, I. Moult, and B. Nachman, Jet Substructure at the LargeHadron Collider: A Review of Recent Advances in Theory and MachineLearning , Physics Reports (2020) 1, arXiv:1709.04464 [hep-ph].[17] G. Soyez, Pileup mitigation at the LHC: A theorist’s view , Physics Reports (Apr, 2019) 1.[18] D. Krohn, J. Thaler, and L.-T. Wang, Jet trimming , Journal of HighEnergy Physics (Feb, 2010) .[19] D. Bertolini, P. Harris, M. Low, and N. Tran, Pileup per particleidentification , Journal of High Energy Physics (Oct, 2014) .[20] Particle Data Group, Review of Particle Physics , Progress of Theoreticaland Experimental Physics (08, 2020) .[21] M. Cacciari, G. P. Salam, and G. Soyez, The catchment area of jets ,Journal of High Energy Physics (Apr, 2008) .[22] J. Pearkes, W. Fedorko, A. Lister, and C. Gay, Jet Constituents for DeepNeural Network Based Top Quark Tagging , arXiv:1704.02124 [hep-ex].[23] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. G.Schwartzman, Jet-images – deep learning edition , Journal of High EnergyPhysics (2016) 1.[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classificationwith Deep Convolutional Neural Networks , Communications of the ACM (May, 2017) 84.[25] Yann LeCun, Generalization and network design strategies , Tech. Rep.CRG-TR-89-4, University of Toronto, 1989. http://yann.lecun.com/exdb/publis/pdf/lecun-89.pdf .[26] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning . MIT Press,2016. . M. Kagan [27] P. Baldi, K. Bauer, C. Eng, P. Sadowski, and D. Whiteson, Jetsubstructure classification in high-energy physics with deep neural networks ,Physical Review D (May, 2016) 094034.[28] L. de Oliveira, M. Paganini, and B. Nachman, Learning Particle Physics byExample: Location-Aware Generative Adversarial Networks for PhysicsSynthesis , Computing and Software for Big Science (2017) 1.[29] D. Scherer, A. M¨uller, and S. Behnke, Evaluation of Pooling Operations inConvolutional Architectures for Object Recognition , in Artificial NeuralNetworks – ICANN 2010 , K. Diamantaras, W. Duch, and L. S. Iliadis, eds.Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.[30] S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep NetworkTraining by Reducing Internal Covariate Shift , in Proceedings of the 32ndInternational Conference on International Conference on MachineLearning , F. Bach and D. Blei, eds. Proceedings of Machine LearningResearch, Lille, France, 07–09 Jul, 2015. http://proceedings.mlr.press/v37/ioffe15.html .[31] X. Glorot and Y. Bengio, Understanding the difficulty of training deepfeedforward neural networks , in Proceedings of the Thirteenth InternationalConference on Artificial Intelligence and Statistics , Y. W. Teh andM. Titterington, eds. Journal of Machine Learning Research Workshop andConference Proceedings, Chia Laguna Resort, Sardinia, Italy, 13–15 May,2010. http://proceedings.mlr.press/v9/glorot10a.html .[32] K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for ImageRecognition , in . 2016.[33] L. Bottou, F. E. Curtis, and J. Nocedal, Optimization Methods forLarge-Scale Machine Learning , SIAM Review (2016) 2, 223,arXiv:1606.04838 [stat.ML].[34] D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization , in International Conference for Learning Representations . 2015.arXiv:1412.6980 [cs.LG].[35] J. Gallicchio, J. Huth, M. Kagan, M. D. Schwartz, K. Black, andB. Tweedie, Multivariate discrimination and the Higgs+W/Z search ,Journal of High Energy Physics (Apr., 2011) 69, arXiv:1010.3698[hep-ph].[36] G. Kasieczka, T. Plehn, M. C. Russell, and T. Schell, Deep-learning toptaggers or the end of QCD? , Journal of High Energy Physics (2017)1.[37] S. Diefenbacher, H. Frost, G. Kasieczka, T. Plehn, and J. M. Thompson, CapsNets Continuing the Convolutional Quest , arXiv:1906.11265 [hep-ph].[38] S. Bollweg, M. Haussmann, G. Kasieczka, M. Luchmann, T. Plehn, andJ. Thompson, Deep-Learning Jets with Uncertainties and More , SciPostPhysics (2020) 6.[39] P. T. Komiske, E. M. Metodiev, and M. D. Schwartz, Deep learning incolor: towards automated quark/gluon jet discrimination , Journal of HighEnergy Physics (2017) 1. mage-Based Jet Analysis [40] Y.-T. Chien and R. Kunnawalkam Elayavalli, Probing heavy ion collisionsusing quark and gluon jet substructure , arXiv:1803.03589 [hep-ph].[41] K. Fraser and M. D. Schwartz, Jet charge and machine learning , Journal ofHigh Energy Physics (2018) 10, 93.[42] S. Macaluso and D. Shih, Pulling out all the tops with computer vision anddeep learning , Journal of High Energy Physics (2018) 10, 121.[43] Y.-C. J. Chen, C.-W. Chiang, G. Cottin, and D. Shih, Boosted w and z tagging with jet charge and deep learning , Phys. Rev. D (Mar, 2020)053001. https://link.aps.org/doi/10.1103/PhysRevD.101.053001 .[44] J. Lin, M. Freytsis, I. Moult, and B. Nachman, Boosting H → b ¯ b withmachine learning , Journal of High Energy Physics (Oct, 2018) 101,arXiv:1807.10768 [hep-ph].[45] J. H. Kim, M. Kim, K. Kong, K. T. Matchev, and M. Park, Portrayingdouble Higgs at the Large Hadron Collider , Journal of High Energy Physics (Sep, 2019) 47, arXiv:1904.08549 [hep-ph].[46] ATLAS Collaboration, Quark versus Gluon Jet Tagging Using Jet Imageswith the ATLAS Detector , Tech. Rep. ATL-PHYS-PUB-2017-017, CERN,Geneva, Jul, 2017. https://cds.cern.ch/record/2275641 .[47] CMS Collaboration, Identification of heavy, energetic, hadronicallydecaying particles using machine-learning techniques , Journal ofInstrumentation (2020) 06, P06005, arXiv:2004.08262 [hep-ex].[48] J. Snoek, H. Larochelle, and R. P. Adams, Practical Bayesian Optimizationof Machine Learning Algorithms , in Proceedings of the 25th InternationalConference on Neural Information Processing Systems . 2012. https://papers.nips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html .[49] J. Thaler and K. Van Tilburg, Identifying boosted objects withN-subjettiness , Journal of High Energy Physics (Mar, 2011) .[50] A. J. Larkoski, I. Moult, and D. Neill, Power counting to better jetobservables , Journal of High Energy Physics (Dec, 2014) .[51] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference, and Prediction . Springer-Verlag, NewYork, NY, 2 ed., 2009.[52] M. D. Zeiler, ADADELTA: An Adaptive Learning Rate Method ,arXiv:1212.5701 [cs.LG].[53] S. Xie, R. B. Girshick, P. Doll´ar, Z. Tu, and K. He, Aggregated ResidualTransformations for Deep Neural Networks , 2017 IEEE Conference onComputer Vision and Pattern Recognition (CVPR) (2017) 5987.[54] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, On Calibration ofModern Neural Networks , in Proceedings of the 34th InternationalConference on Machine Learning , D. Precup and Y. W. Teh, eds.Proceedings of Machine Learning Research, International ConventionCentre, Sydney, Australia, 06–11 Aug, 2017. http://proceedings.mlr.press/v70/guo17a.html .[55] ATLAS Collaboration, Performance of mass-decorrelated jet substructureobservables for hadronic two-body decay tagging in ATLAS , Tech. Rep. M. Kagan ATL-PHYS-PUB-2018-014, CERN, Geneva, Jul, 2018. https://cds.cern.ch/record/2630973 .[56] Bradshaw, Layne and Mishra, Rashmish K. and Mitridate, Andrea andOstdiek, Bryan, Mass Agnostic Jet Taggers , SciPost Physics (2020) 1,011, arXiv:1908.08959 [hep-ph].[57] G. Louppe, M. Kagan, and K. Cranmer, Learning to Pivot with AdversarialNetworks , in Proceedings of the 31st International Conference on NeuralInformation Processing Systems . 2017. https://papers.nips.cc/paper/2017/hash/48ab2f9b45957ab574cf005eb8a76760-Abstract.html .[58] Shimmin, Chase and Sadowski, Peter and Baldi, Pierre and Weik, Edisonand Whiteson, Daniel and Goul, Edward and Søgaard, Andreas, Decorrelated Jet Substructure Tagging using Adversarial Neural Networks ,Physical Review D (2017) 7, 074034, arXiv:1703.03507 [hep-ex].[59] G. Kasieczka and D. Shih, DisCo Fever: Robust Networks ThroughDistance Correlation , arXiv:2001.05310 [hep-ph].[60] CERN, “CERN Open Data Portal.” Http://opendata.cern.ch/, 2017.[61] M. Andrews, J. Alison, S. An, B. Burkle, S. Gleyzer, M. Narain,M. Paulini, B. Poczos, and E. Usai, End-to-end jet classification of quarksand gluons with the CMS Open Data , Nuclear Instruments and Methods inPhysics Research A (Oct, 2020) 164304.[62] J. Gallicchio and M. D. Schwartz, Quark and gluon jet substructure ,Journal of High Energy Physics (2013) 1.[63] R. Field and R. Feynman, A parametrization of the properties of quarkjets , Nuclear Physics B (1978) 1, 1 .[64] D. Krohn, M. D. Schwartz, T. Lin, and W. J. Waalewijn, Jet Charge at theLHC , Physical Review Letters (May, 2013) 212001.[65] W. J. Waalewijn, Calculating the charge of a jet , Physical Review D (Nov, 2012) 094030.[66] J. Barnard, E. N. Dawe, M. J. Dolan, and N. Rajcic, Parton showeruncertainties in jet substructure analyses with deep neural networks ,Physical Review D (Jan, 2017) 014018.[67] T. Sj¨ostrand, S. Mrenna, and P. Skands, A brief introduction to PYTHIA8.1 , Computer Physics Communications (2008) 11, 852 .[68] T. Sj¨ostrand, S. Mrenna, and P. Skands, PYTHIA 6.4 physics and manual ,Journal of High Energy Physics (May, 2006) 026.[69] J. Bellm, S. Gieseke, D. Grellscheid, A. Papaefstathiou, S. Platzer,P. Richardson, C. Rohr, T. Schuh, M. H. Seymour, A. Siodmok,A. Wilcock, and B. Zimmermann, Herwig++ 2.7 Release Note ,arXiv:1310.6877 [hep-ph].[70] S. Choi, S. J. Lee, and M. Perelstein, Infrared safety of a neural-net toptagging algorithm , Journal of High Energy Physics (2018) 1.[71] Geant4 Collaboration, Geant4 simulation toolkit , Nuclear Instruments andMethods in Physics Research A (2003) 3, 250.[72] ATLAS Collaboration, The ATLAS Simulation Infrastructure , TheEuropean Physical Journal C (2010) 3, 823.[73] CMS Collaboration, Performance of b -tagging algorithms in proton-proton mage-Based Jet Analysis collisions at 13 TeV with Phase 1 CMS detector , Tech. Rep.CMS-DP-2018-033, CERN, Jun, 2018. https://cds.cern.ch/record/2627468 .[74] U. Bhatt, A. Xiang, S. Sharma, A. Weller, A. Taly, Y. Jia, J. Ghosh,R. Puri, J. M. F. Moura, and P. Eckersley, Explainable machine learning indeployment , Proceedings of the 2020 Conference on Fairness,Accountability, and Transparency (2020) .[75] P. T. Komiske, E. M. Metodiev, B. Nachman, and M. D. Schwartz, PileupMitigation with Machine Learning (PUMML) , Journal of High EnergyPhysics (2017) 1.[76] ATLAS Collaboration, A. Collaboration, Convolutional Neural Networkswith Event Images for Pileup Mitigation with the ATLAS Detector , Tech.Rep. ATL-PHYS-PUB-2019-028, CERN, Geneva, Jul, 2019. https://cds.cern.ch/record/2684070 .[77] P. Baldi and K. Hornik, Neural networks and principal component analysis:Learning from examples without local minima , Neural Networks (1989) 1,53 .[78] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, Extracting andComposing Robust Features with Denoising Autoencoders , in Proceedingsof the 25th International Conference on Machine Learning , InternationalConference on Machine Learning ’08. Association for ComputingMachinery, New York, NY, USA, 2008.[79] T. Heimel, G. Kasieczka, T. Plehn, and J. Thompson, QCD or what? ,SciPost Physics (Mar, 2019) .[80] M. Farina, Y. Nakai, and D. Shih, Searching for new physics with deepautoencoders , Physical Review D101