[PDF] Pileup Mitigation with Machine Learning (PUMML)

Abstract

Full PDF

MMIT–CTP 4924

Pileup Mitigation with Machine Learning (PUMML)

Patrick T. Komiske, a Eric M. Metodiev, a Benjamin Nachman, b Matthew D. Schwartz c a Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA b Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA c Department of Physics, Harvard University, Cambridge, MA 02138, USA

E-mail: [email protected] , [email protected] , [email protected] , [email protected] Abstract:

Pileup involves the contamination of the energy distribution arising from theprimary collision of interest (leading vertex) by radiation from soft collisions (pileup). Wedevelop a new technique for removing this contamination using machine learning and con-volutional neural networks. The network takes as input the energy distribution of chargedleading vertex particles, charged pileup particles, and all neutral particles and outputs theenergy distribution of particles coming from leading vertex alone. The PUMML algorithmperforms remarkably well at eliminating pileup distortion on a wide range of simple and com-plex jet observables. We test the robustness of the algorithm in a number of ways and discusshow the network can be trained directly on data. a r X i v : . [ h e p - ph ] J a n ontents The Large Hadron Collider (LHC) is operated at very high instantaneous luminosities toachieve the large statistics required to search for exotic Standard Model (SM) or beyondthe SM processes as well as for precision SM measurements. At a hadron collider, protonsare grouped together in bunches; as the luminosity increases for a ﬁxed bunch spacing, thenumber of protons within each bunch that collide inelastically increases as well. Most of theseinelastic collisions are soft, with the protons dissolving into mostly low-energy pions thatdisperse throughout the detector. A typical collision of this sort at the LHC will contributeabout 0.6 GeV/rad of energy [1, 2]. Occasionally, one pair of protons within a bunch crossingcollides head-on, producing hard (high-energy) radiation of interest. At high luminosity, thishard collision, or leading vertex (LV), is always accompanied by soft proton-proton collisionscalled pileup. The data collected thus far by ATLAS and CMS have approximately 20 pileupcollisions per bunch crossing on average ( (cid:104) NPU (cid:105) ∼ (cid:104)

NPU (cid:105) ∼

80; and the HL-LHC in Runs 4-5 will have (cid:104)

NPU (cid:105) ∼ This Some detector systems have an integration time that is (much) longer than the bunch spacing of 25 ns, sothere is also a contribution from pileup collisions happening before or after the collision of interest (out-of-timepileup). This contribution will not have charged particle tracks and can be at least partially mitigated withcalorimeter timing information. Out-of-time pileup is not considered further in this analysis. – 1 –s the simplest pileup removal technique, called charged-hadron subtraction . The challengewith pileup removal is therefore how to distinguish neutral radiation associated with the hardcollision from neutral pileup radiation. Since radiation from pileup is fairly uniform , itcan be removed on average, for example, using the jet areas technique [9]. The jet areastechnique focuses on correcting the overall energy of collimated sprays of particles known asjets. Indeed, both the ATLAS and CMS experiments apply jet areas or similar techniques tocalibrate the energy of their jets [1, 2, 10–13]. Unfortunately, for many measurements, suchas those involving jet substructure or the full radiation patterns within the jet, removing theradiation on average is not enough.Rather than calibrating only the energy or net 4-momentum of a jet, it is possible tocorrect the constituents of the jet. By removing the pileup contamination from each con-stituent, it should be possible to reconstruct more subtle jet observables. We can coarselyclassify constituent pileup mitigation strategies into several categories: constituent prepro-cessing, jet/event grooming, subjet corrections, and constituent corrections. Grooming refersto algorithms that remove objects and corrections describe scale factors applied to individualobjects. Both ATLAS and CMS perform preprocessing to all of their constituents before jetclustering. For ATLAS, pileup-dependent noise thresholds in topoclustering [14] suppresseslow energy calorimeter deposits that are characteristic of pileup. In CMS, charged-hadronsubtraction removes all of the pileup particle-ﬂow candidates [15]. Jet grooming techniquesare not necessarily designed to exclusively mitigate pileup but since they remove constituentsor subjets in a jet (or event) that are soft and/or at wide angles to the jet axis, pileupparticles are preferentially removed [6, 16–21]. Explicitly tagging and removing pileup sub-jets often performs comparably to algorithms without explicit pileup subjet removal [6]. Apopular event-level grooming algorithm called SoftKiller [21] removes radiation below somecutoﬀ on transverse momentum, p cut T chosen on an event-by-event basis so that half of a setof pileup-only patches are radiation free.While grooming algorithms remove constituents and subjets, there are also techniquesthat try to reconstruct the exact energy distribution from the primary collision. One ofthe ﬁrst such methods introduced was Jet Cleansing [22]. Cleansing works at the subjetlevel, clustering and declustering jets to correct each subjet separately based on its localenergy information. Furthermore, Cleansing exploits the fact that the relative size of pileupﬂuctuations decreases as (cid:104) NPU (cid:105) → ∞ so that the neutral pileup-energy content of subjets canbe estimated from the charged pileup-energy content. A series of related techniques operateon the constituents themselves [23–25]. One such technique called PUPPI also uses localcharged track information but works at the particle level rather than subjet level. PUPPI Charged-hadron subtraction follows a particle-ﬂow technique that removes calorimeter energy from pileuptracks. Due to the calorimeter energy resolution, there will be a residual contribution from charged-hadronpileup. This contribution is ignored but could in principle be added to the neutral pileup contribution. This work will not explicitly discuss identiﬁcation of real high energy jets resulting from pileup collisions.The ATLAS and CMS pileup jet identiﬁcation techniques are documented in Ref. [6, 7] and [8], respectively. – 2 –omputes a scale factor for each particle, using a local estimate inspired by the jets-without-jets paradigm [26]. In this paper, we will be comparing our method to PUPPI and SoftKiller.In this paper, we present a new approach to pileup removal based on machine learning.The basic idea is to view the energy distribution of particles as the intensity of pixels in animage [27]. Convolutional neural networks applied to jet images [28] have found widespreadapplications in both classiﬁcation [28–32] and generation [33, 34]. Previous jet-images appli-cations have included boosted W -boson tagging [28–30], boosted top quark identiﬁcation [31],and quark/gluon jet discrimination [32]. Most of these previous applications were all classi-ﬁcation tasks: extracting a single binary classiﬁer (quark or gluon, W jet or background jet,etc.) from a highly-correlated multidimensional input. The application to pileup removal isa more complicated regression task, as the output (a cleaned-up image) should be of similardimensionality to the input. PUMML is among the ﬁrst applications of modern machinelearning tools to regression problems in high energy physics.To apply the convolutional neural network paradigm to cleaning an image itself, weexploit the ﬁner angular resolution of the tracking detectors relative to the calorimeters ofATLAS and CMS. Building on the use of multichannel inputs in [32], we give as input to ournetwork three-channel jet images: one channel for the charged LV particles, one channel forthe charged pileup particles, and one channel, at slightly lower resolution, for the total neutralparticles. We then ask the network to reconstruct the unknown image for LV neutral particles.Thus our inputs are like those of Jet Cleansing but binned into a regular grid (as images)rather than single numbers for each subjet [22]. Further, the architecture is designed to belocal (as with Cleansing or PUPPI), with the correction of a pixel only using informationin a region around it. The details of our network architecture are described in Section 2.Section 3 documents its performance in comparison to other state-of-the-art techniques. Theremainder of the paper contains some robustness checks and a discussion in Section 6 of thechallenges and opportunities for this approach. The goal of the PUMML algorithm is to reconstruct the neutral leading vertex radiation fromthe charged leading vertex, charged pileup, and total neutral information. Since neutral par-ticles do not have tracking information available, the challenge is to determine what fractionof the total neutral energy in each direction came from the leading vertex and what fractioncame from pileup. To assist this discrimination, we take as inputs into our network the en-ergy distribution of charged particles, separated into leading vertex and pileup contributions,in addition to the total neutral energy distribution . A natural way to combine these ob-servables is using the multichannel images approach introduced in [32] based on color-imagerecognition technology. Both ATLAS [35] and CMS [36, 37] are proposing precision timing detectors are part of their upgradesfor the HL-LHC; such information could naturally be incorporated into another layer of the network. – 3 –e apply this machine learning technique to R = 0 . k t jets. The jet image inputsare square grids in pseudorapidity-azimuth ( η, φ ) space of size 0 . × . p T )-weighted centroid of the jet. One could combine alllayers to determine the jet axis, but in practice the axis determined from the charged leadingvertex captures dominates because of its superior angular resolution and pileup robustness.To simulate the detector resolutions of charged and neutral calorimeters, charged imagesare discretized into ∆ η × ∆ φ = 0 . × .

025 pixels and neutral images are discretized into∆ η × ∆ φ = 0 . × . . We use the following three input channels: red = the transverse momenta of all neutral particles green = the transverse momenta of charged pileup particles blue = transverse momenta of charged leading vertex particlesThe output of our network is also an image: output = the transverse momenta of neutral leading vertex particles.Only charged particles with p T >

500 MeV were included in the green or blue channels.Charged particles not passing this charged reconstruction cut were treated as if they wereneutral particles. Otherwise, the separation into channels is assumed perfect. No imagenormalization or standardization was applied to the jet images, allowing the network tomake use of the overall transverse momentum scale in each pixel. The diﬀerent resolutionsfor charged and neutral particles initially present a challenge, since standard architecturesassume identical resolution for each color channel. To avoid this issue, we perform a directupsampling of each neutral pixel to 4 × η × ∆ φ = 0 . × .

025 and divideeach pixel value by 16 such that the total momentum in the image is unchanged.In summary, the following processing was applied to produce the pileup images:1.

Center : Center the jet image by translating in ( η, φ ) so that the total charged leadingvertex p T -weighted centroid pixel is at ( η, φ ) = (0 , Pixelate : Crop to a 0 . × . η, φ ) = (0 , η × ∆ φ = 0 . × .

025 and ∆ η × ∆ φ = 0 . × . Upsample : Upsample each neutral pixel to sixteen ∆ η × ∆ φ = 0 . × .

025 pixels,keeping the total transverse momentum in the image unchanged. These dimensions are representative of typical tracking and calorimeter resolutions, but would be adaptedto the particular detector in practice. We ignore other detector eﬀects in this algorithm demonstration, ashas also been done also for PUPPI and SoftKiller. In principle, additional complications due to the detectorresponse can be naturally incorporated into the algorithm during training. – 4 – φ b e a m Leading vertex charged Pileup chargedTotal neutralLeading vertex neutralInputs to NN (cid:124) (cid:123)(cid:122) (cid:125)

10 ﬁlters × Figure 1 : An illustration of the PUMML framework. The input is a three-channel image:blue/purple represents charged radiation from the leading vertex, green is charged pileupradiation, and yellow/orange/red is the total neutral radiation. The resolution of the chargedimages is higher than for the neutral one. These images are fed into a convolutional layer withseveral ﬁlters whose value at each pixel is a function of a patch around that pixel location inthe input images. The output is an image combining the pixels of each ﬁlter to one outputpixel. – 5 –he convolutional neural net architecture used in this study took as input 36 ×

36 pixel,three-channel pileup images. Two convolutional layers, each with 10 ﬁlters of size 6 × × × ×

10, with the 9 × × × × × (cid:96) = (cid:42) log (cid:32) p (pred) T + ¯ pp (true) T + ¯ p (cid:33) (cid:43) , (2.1)where ¯ p is a hyperparameter that controls the choice between favoring all p T equally (¯ p → ∞ )or favoring soft pixels (¯ p → p = 10 GeV was chosen,though the performance of the model as measured by correlations between reconstructedand true observables is relatively robust to this choice. PUMML was found to give goodperformance even with a standard loss function such as the mean squared error, which favorsall p T equally.The PUMML architecture is local in that the rescaling of a neutral pixel is a functionsolely of the information in a patch in ( η, φ )-space around that pixel. The size of this patch canbe controlled by tuning the ﬁlter sizes and number of layers in the architecture. Further, dueto weight-sharing in convolutional layers, the same function is applied for all pixels. Buildingthis locality and translation invariance into the architecture ensures that the algorithm learnsa universal pileup mitigation technique, while carrying the beneﬁt of drastically reducing thenumber of model parameters. Indeed, the PUMML architecture used in this study has only4,711 parameters, which is small on the scale of deep learning architectures, but serves tohighlight the eﬀectiveness of using modern machine learning techniques (such as convolutionallayers) in high energy physics without necessarily using large or deep networks.While we considered jets and jet images in this study, the PUMML architecture usingconvolutional nets readily generalizes to event-level applications. The locality of the algorithmimplies that the trained model can be applied to any desired region of the event using onlythe surrounding pixels. To train the model on the event level, either the existing PUMMLarchitecture could be generalized to larger inputs and outputs or the event could be sliced– 6 –nto smaller images and the model trained as in the present study. The parameters of thePUMML architecture are the convolutional ﬁlter sizes, the number of ﬁlters per layer, and thenumber of convolutional layers, which may be optimized for a speciﬁc application. Here, wehave presented an architecture optimized for simplicity and performance for jet-level pileupsubtraction. PUMML is designed to be applicable at both jet- and event-level. To test the PUMML algorithm, we consider q ¯ q light-quark-initiated jets coming from thedecay of a scalar with mass m φ = 500 GeV. Events were generated using Pythia 8.183 [42]with the default tune for pp collisions at √ s = 13 TeV. Pileup was generated by overlayingsoft QCD processes onto each event. Final state particles except muons and neutrinos werekept. The events were clustered with FastJet 3.1.3 [43] using the anti- k t algorithm [44] witha jet radius of R = 0 .

4. A parton-level p T cut of 95 GeV was applied and up to two leadingjets with p T >

100 GeV and η ∈ [ − . , .

5] were selected from each event. All particles weretaken to be massless.Samples were generated with the number of pileup vertices ranging from 0 to 180. Sincethe model must be trained to ﬁx its parameters, the learned model depends on the pileupdistribution used for training. For our pileup simulations, we trained on a Poisson distributionof NPUs with mean (cid:104)

NPU (cid:105) = 140. For robustness studies, we also tried training with NPU=140 for each event or NPU= 20 for each event. The average jet image inputs for this sampleare shown in Fig. 2. For comparison, we show the performance of two powerful and widelyused constituent-based pileup mitigation methods: PUPPI [23] and SoftKiller [21]. In bothcases, default parameter values were used: R = 0 . R min = 0 . w cut = 0 . p cutT (NPU) =0 . . × NPU (PUPPI), grid size = 0.4 (SoftKiller). Variations in the PUPPI parametersdid not yield a large diﬀerence in performance. Both PUPPI and SoftKiller were implementedat the particle level and then discretized for comparison with PUMML. We show the actionof the various pileup mitigation methods on a random selection of events in Fig. 3. Onthese examples, PUMML more eﬀectively removes moderately soft energy deposits that areretained by PUPPI and SoftKiller.To evaluate the performance of diﬀerent pileup mitigation techniques, we compute severalobservables and compare the true values to the corrected values of the observables. Tofacilitate a comparison with PUMML, which outputs corrected neutral calorimeter cells ratherthan lists of particles, a detector discretization is applied to the true and reconstructed events.Our comparisons focus on the following six jet observables: • Jet Mass : Invariant mass of the leading jet. • Dijet Mass : Invariant mass of the two leading jets. • Jet Transverse Momentum : The total transverse momentum of the jet.– 7 – seudorapidity A z i m u t h a l A n g l e Neutral Total p T Pseudorapidity A z i m u t h a l A n g l e Charged Pileup p T Pseudorapidity A z i m u t h a l A n g l e Charged Leading Vertex p T Pseudorapidity A z i m u t h a l A n g l e Neutral Leading Vertex p T Figure 2 : The average leading-jet images for a 500 GeV scalar decaying to light-quark jetswith (cid:104)

NPU (cid:105) = 140 pileup, separated by all neutral particles (top left), charged pileup particles(top right), charged leading vertex particles (bottom left), and neutral leading vertex particles(bottom right). Diﬀerent pixelizations are used for charged and neutral images to reﬂect thediﬀerences in calorimeter resolution. The charged and total neutral images comprise thethree-channel input to the neural network, which is trained to predict the neutral leadingvertex image. • Neutral Image Activity , N [45]: The number of neutral calorimeter cells which accountfor 95% of the total neutral transverse momentum. • Energy Correlation Functions , ECF ( β ) N [46]: Speciﬁcally, we consider the logarithm ofthe two- and three-point ECFs with β = 4.Fig. 4 illustrates the distributions of several of these jet observables after applying thediﬀerent pileup subtraction methods. While these plots are standard, they do not give a per-event indication of performance. A more useful comparison is to show the distributions of theper-event percent error in reconstructing the true values of the observables, which are shown– 8 –eading Vertex with Pileup PUMML PUPPI SoftKiller Figure 3 : Depictions of three randomly chosen leading jets. Blue/purple represents chargedradiation from the leading vertex, green is charged pileup radiation, and yellow/orange/red isthe neutral radiation. Shown from left to right are the true neutral leading vertex particles,the event with pileup and charged leading vertex information, followed by the neutral leadingvertex particles predicted by PUMML, PUPPI, and SoftKiller. From examining these events,it appears that PUMML has learned an eﬀective pileup mitigation strategy.in Fig. 5. To numerically explore the event-by-event eﬀectiveness, we can look at the Pearsonlinear correlation coeﬃcient between the true and corrected values or the interquartile range(IQR) of the percent errors. Table 1 summarizes the event-by-event correlation coeﬃcientsof the distributions shown in Fig. 4. Table 2 summarizes the IQRs of the distributions shownin Fig. 5. PUMML outperforms the other pileup mitigation techniques on both of thesemetrics, with improvements for jet substructure observables such as the jet mass and theenergy correlation functions.

It is important to verify that PUMML learns a pileup mitigation function which is not overlysensitive to the NPU distribution of its training sample. Robustness to the NPU on which it istrained would indicate that PUMML is learning a universal subtraction strategy. To evaluatethis robustness, PUMML was trained on 50k events with either NPU = 20 or NPU = 140– 9 –

20 40 60 80 100Jet Mass (GeV) C r o ss - s e c t i o n ( n o r m a li z e d ) Truew. PileupSoftKillerPUPPIPUMML 0 200 400 600 800 1000Dijet Mass (GeV) C r o ss - s e c t i o n ( n o r m a li z e d ) Truew. PileupSoftKillerPUPPIPUMML0 100 200 300 400 500 600Jet pT (GeV) C r o ss - s e c t i o n ( n o r m a li z e d ) Truew. PileupSoftKillerPUPPIPUMML 0 10 20 30 40 50Neutral N C r o ss - s e c t i o n ( n o r m a li z e d ) Truew. PileupSoftKillerPUPPIPUMML14 12 10 8 6 4 2 0ln ECF ( =4) N =2 C r o ss - s e c t i o n ( n o r m a li z e d ) Truew. PileupSoftKillerPUPPIPUMML 35 30 25 20 15 10 5ln ECF ( =4) N =3 C r o ss - s e c t i o n ( n o r m a li z e d ) Truew. PileupSoftKillerPUPPIPUMML

Figure 4 : Distributions of leading jet mass (top left), dijet mass (top right), leading jet p T (middle left), neutral N (middle right), ln ECF ( β =4) N =2 (bottom left), and ln ECF ( β =4) N =3 (bottomright) for the considered pileup subtraction methods with Poissonian (cid:104) NPU (cid:105) = 140 pileup.While all of the pileup mitigation methods do well for observables such as the dijet mass andjet p T , PUMML more closely matches the true distributions of more sensitive substructureobservables like mass, neutral N , and the energy correlation functions.– 10 –

00 50 0 50 100Jet Mass Percent Error, Centered C r o ss - s e c t i o n ( n o r m a li z e d ) SoftKillerPUPPIPUMML 15 10 5 0 5 10 15Dijet Mass Percent Error, Centered C r o ss - s e c t i o n ( n o r m a li z e d ) SoftKillerPUPPIPUMML20 15 10 5 0 5 10 15 20Jet pT Percent Error, Centered C r o ss - s e c t i o n ( n o r m a li z e d ) SoftKillerPUPPIPUMML 10 5 0 5 10Neutral N Difference, Centered C r o ss - s e c t i o n ( n o r m a li z e d ) SoftKillerPUPPIPUMML40 20 0 20 40ln ECF ( =4) N =2 Percent Error, Centered C r o ss - s e c t i o n ( n o r m a li z e d ) SoftKillerPUPPIPUMML 40 20 0 20 40ln ECF ( =4) N =3 Percent Error, Centered C r o ss - s e c t i o n ( n o r m a li z e d ) SoftKillerPUPPIPUMML

Figure 5 : Distributions of the percent error between reconstructed and true values forleading jet mass (top left), dijet mass (top right), leading jet p T (middle left), neutral N (middle right), ln ECF ( β =4) N =2 (bottom left), and ln ECF ( β =4) N =3 (bottom right) for the consideredpileup subtraction methods with Poissonian (cid:104) NPU (cid:105) = 140 pileup. For the discrete neutral N observable, only the diﬀerence is shown. All distributions are centered to have medianat 0. The improved reconstruction performance of PUMML is highlighted by its taller andnarrower peaks. – 11 – orrelation (%) w. Pileup PUMML PUPPI SoftKillerJet mass 65.5 97.4 94.0 91.3Dijet mass 85.5 99.5 95.8 99.1Jet p T N ( β =4) N =2 ( β =4) N =3 Table 1 : Correlation coeﬃcients between the true and corrected values of diﬀerent jetobservables on an event-by-event level. The ﬁrst column lists the correlation without anypileup mitigation applied to the event. Larger correlation coeﬃcients are better.

IQR (%)

PUMML PUPPI SoftKillerJet mass 13.0 28.7 30.8Dijet mass 2.02 2.95 2.97Jet p T ( β =4) N =2 ( β =4) N =3 Table 2 : The interquartile ranges (IQR) of the distributions in Fig. 5. Note that PUMMLperforms better than either PUPPI or SoftKiller. Lower IQR indicates better performance.and then tested on samples with diﬀerent NPUs. Fig. 6 shows the jet mass correlation coef-ﬁcients as a function of the test sample NPU. PUMML learns a strategy that is surprisinglyperformant outside of the NPU range on which it was trained. Further, we see that by thismeasure of performance, PUMML consistently outperforms both PUPPI and SoftKiller.A related robustness test is to probe how the performance of PUMML depends on the p T spectrum of the training sample. To explore this, we generated two large training samples(50k events): one with a scalar mass of 200 GeV and one with a scalar mass of 2 TeV; we didnot impose any parton-level p T cuts on these samples. After training these two networks, wetested them on a set of samples generated from scalars with intermediate masses, from 300GeV to 900 GeV. As can be seen in Fig. 7, the performance of PUMML is very robust tothe p T distribution of the jets in the training sample: the networks trained on the 200 GeVresonance and the 2 TeV resonance have identical performance. The ﬁgure also shows thatthe performance of PUMML is less sensitive to of the p T of the testing sample than eitherPUPPI or Soft-Killer. This robustness test speaks to the PUMML algorithm’s ability to learnuniversal aspects of pileup mitigation.A number of modiﬁcations of PUMML were also tried. Locally connected layers weretried instead of convolutional layers and were found to perform worse due to a large increasein the number of parameters of the model, while losing the translation invariance that makes– 12 –

25 50 75 100 125 150 175NPU0.880.900.920.940.960.981.00 J e t M a ss C o rr e l a t i o n C o e ff i c i e n t PUMML trained on NPU=20PUMML trained on NPU=140PUPPISoftKiller

Figure 6 : Correlation coeﬃcients between reconstructed and true jet masses plotted asa function of NPU for the diﬀerent pileup mitigation schemes. PUMML was trained on50k events with either NPU= 20 or NPU= 140 indicated by dashed vertical lines. Theperformance of PUMML with Poissonian (cid:104)

NPU (cid:105) = 140 is similar to the NPU= 140 curve.PUMML is surprisingly performant well outside the NPU range on which it was trained andconsistently outperforms PUPPI and SoftKiller. Note that PUMML trained on the lowerNPU sample better reconstructs the jet mass in the low pileup regime.PUMML powerful. We tried training without various combinations of the input channels;the model was found to perform moderately worse without either of the charged channels butsuﬀered severe degradation without the total neutral channel. We tried using simpler modelswith only one layer or fewer ﬁlters per layer. Remarkably, even with only a single layer and asingle 4 ×

00 400 500 600 700 800 900 m φ (GeV)0.840.860.880.900.920.940.960.981.00 J e t M a ss C o rr e l a t i o n C o e ff i c i e n t PUMML, m φ =

200 GeVPUMML, m φ = Figure 7 : Correlation coeﬃcients between reconstructed and true jet masses plotted as afunction of the mass of the scalar resonance with NPU=140. A spread in scalar resonancesis generated in order to produce a range in jet transverse momenta. In order to assess theimpact of the p T distribution used for training, one version of PUMML was trained with ascalar mass of 200 GeV (black) and one was trained with a mass of 2 TeV (gray). The twoPUMML curves closely match one another. While it is generally very diﬃcult to determine what a network is learning, one possible probeis to examine the weights of the ﬁlter layers in the convolutional network. For our full network,these weights are complicated and the subtractor that the network learns is diﬃcult to probeanalytically. Instead, we trained a simpliﬁed PUMML network with a single 12 ×

12 pixelﬁlter, which spans 3 × p N,LV T = 1 . p N,total T − β p C,PU T + 0 . p C,LV T , (5.1)for some O (1) constant β , where p N,LV T , p N,total T , p C,PU T , and p C,LV T are the neutral-pixel-leveltransverse momenta of the neutral leading-vertex particles, all neutral particles, chargedpileup particles, and charged leading-vertex particles, respectively. The values 1.0 and 0.0 in– 14 – eutral Total Filter Charged Pileup Filter

Charged Leading Vertex Filter

Figure 8 : Filter weights for a simple PUMML network with a single 12 ×

12 ﬁlter and aReLU activation function trained with (cid:104)

NPU (cid:105) = 140. The network has selected the relevantneutral pixel, turned oﬀ the charged leading vertex contribution, and is using the chargedpileup information uniformly.Eq. (5.1) are stable (to the 0.05 level) under variations in the loss and activation functions.This is reassuring as the learned subtractor is thereby robust in the NPU → (cid:104) NPU (cid:105) = 140.Eq. (5.1) is remarkably similar to the physically-motivated formula used in Jet Cleans-ing [22]. Cleansing is built on the observation that since pileup is the incoherent sum of manyseparate scattering events, its variance is smaller than the variance of the radiation from theleading-vertex. Thus, it is better to estimate p N,PU T from p C,PU T than to estimate p N,LV T from p C,LV T . The simplest form of Cleansing (Linear Cleansing) gives the formula: p N,LV T = p N,tot T − (cid:18) γ − (cid:19) p C,PU T , (5.2)where γ is the average ratio of charged p T to total p T in a subjet. Thus this simple one12 ×

12 ﬁlter PUMML network is learning a subtractor of precisely the same parametric formas Linear Cleansing!The value of γ in Linear Cleansing and the value of β that is learned in Eq. (5.1) dependon how soft radiation is handled. For example, if no reconstruction threshold is applied, γ ≈ / β depends on the lossfunction used. For example, if the loss function is minimized when the means of the true andpredicted neutral transverse momenta are equal: (cid:96) = (cid:12)(cid:12)(cid:12) (cid:104) p (true) T (cid:105) − (cid:104) p (pred) T (cid:105) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) (cid:104) p N,LV T (cid:105) − (cid:104) p N,total T (cid:105) + β (cid:104) p C,PU T (cid:105) (cid:12)(cid:12)(cid:12) , (5.3)then we ﬁnd that the optimal β is: β = (cid:104) p N,PU T (cid:105)(cid:104) p C,PU T (cid:105) . (5.4)Training the 12 ×

12 PUMML ﬁlter without a ReLU or bias term, using the loss functionof Eq. (5.3) with the average taken pixel-wise over the batch, we ﬁnd β = 0 .

59 with no– 15 –harged reconstruction cut and β = 1 .

18 with the cut. These values are consistent with thosepredicted by Eq. (5.4) of 0.62 and 1.26, respectively.On the other hand, if we take a mean squared error loss function: (cid:96) = (cid:28)(cid:16) p (true) T − p (pred) T (cid:17) (cid:29) , (5.5)then the minimum occurs at: β = (cid:104) p N,PU T p C,PU T (cid:105)(cid:104) p C,PU T p C,PU T (cid:105) . (5.6)This still depends only on the pileup properties, as with Linear Cleansing, but also dependson correlations between neutral and charged radiation. For example, training the 12 × β = 0 .

56 with no charged reconstruction cut and β = 0 .

97 with the cut. These numbersare in general agreement (within 10 − ×

12 ﬁlter, using the loss function of Eq. (2.1) and including a ReLU andbias term, PUMML achieves a jet mass correlation coeﬃcient of 90.4%. This is competitivewith the values listed in Table. 1, as we might expect since Linear Cleansing has comparableperformance to PUPPI and SoftKiller. The full network improves on Linear Cleansing byexploiting additional correlations that are hard to disentangle by looking at the ﬁlters.

In this paper, we have introduced the ﬁrst application of machine learning to the criticallyimportant problem of pileup mitigation at hadron colliders. We have phrased the problemof pileup mitigation in the language of a machine learning regression problem. The methodwe introduced, PUMML, takes as input the transverse momentum distribution of chargedleading-vertex, charged pileup, and all neutral particles, and outputs the corrected leadingvertex neutral energy distribution. We demonstrated that PUMML works at least as wellas, and often better than, the competing algorithms PUPPI and SoftKiller in their defaultimplementations. It will be exciting to see these algorithms compared with a full detectorsimulation, where it will be possible to test the sensitivity to important experimental eﬀectssuch as resolutions and ineﬃciencies.There are several extensions and additional applications of the PUMML framework be-yond the scope of this study. As mentioned in Section 2, PUMML can very naturally beextended from jet images to entire events. Applying this event-level PUMML to the problemof missing transverse energy would be a natural next step. While the ﬁlter sizes can be the– 16 –ame for the event and jet images, the network training will likely require modiﬁcation. Fur-thermore, the inhomogeneity of the detector response with | η | will require attention. Anotherpotentially useful modiﬁcation to PUMML would be to train to predict the neutral pileup p T rather than the neutral leading vertex p T in order to increase out-of-sample robustness ofthe learned pileup mitigation algorithm. Additionally, using larger- R jets may be of interest,thereby necessitating a resizing of the local patch or other PUMML parameters, all of whichis easily achieved.An important consideration when using machine learning for particle physics applicationsis how the method can be used with data and whether or not the systematic uncertainties areunder control. Unlike a purely physically-motivated algorithm, such as PUPPI or SoftKiller,machine learning runs the risk of being a “black-box” which can be diﬃcult to understand.Nevertheless, machine learning is powerful, scaleable, and capable of complementing physicalinsight to solve complicated or otherwise intractable problems.To prevent the model from learning simulation artifacts, it is preferable to train onactual data rather than simulation. In many machine learning applications in collider physics,obtaining truth-level training samples in data is a substantial challenge. To overcome thischallenge in classiﬁcation tasks, [47] introduces an approach to train from impure samplesusing class proportion information. For PUMML and pileup mitigation more broadly, a moredirect method to train on data is possible. To simulate pileup, we overlay soft QCD events ontop of a hard scattering process, both generated with Pythia. Experimentally, there are largesamples of minimum bias and zero-bias (i.e. randomly triggered) data. There are also samplesof relatively pileup-free events from low luminosity runs. Thus we can construct high-pileupsamples using purely data. This kind of data overlay approach, which has already been usedby experimental groups in other contexts [48, 49], could be perfect for training PUMML withdata. Therefore, an implementation of ML-based pileup mitigation in an actual experimentalsetting could avoid mis-modeling artifacts during training, thus adding more robustness andpower to this new tool. Acknowledgments

The authors would like to thank Philip Harris, Francesco Rubbo, Ariel Schwartzman andNhan Tran for stimulating conversations, in particular for suggesting some of the extensionsmentioned in the conclusions. We would also like to thank Jesse Thaler for helpful discussions.PTK and EMM would like to thank the MIT Physics Department for its support. Computa-tions in this paper were run on the Odyssey cluster supported by the FAS Division of Science,Research Computing Group at Harvard University. This work was supported by the Oﬃceof Science of the U.S. Department of Energy (DOE) under contracts DE-AC02-05CH11231and DE-SC0013607, the DOE Oﬃce of Nuclear Physics under contract DE-SC0011090, andthe DOE Oﬃce of High Energy Physics under contract DE-SC0012567. Cloud computingresources were provided through a Microsoft Azure for Research award. Additional supportwas provided by the Harvard Data Science Initiative.– 17 – eferences [1]

CMS

Collaboration, V. Khachatryan et. al. , Jet energy scale and resolution in the CMSexperiment in pp collisions at 8 TeV , JINST (2017), no. 02 P02014 [ ].[2] ATLAS

Collaboration, M. Aaboud et. al. , Jet energy scale measurements and their systematicuncertainties in proton-proton collisions at √ s = 13 TeV with the ATLAS detector , .[3] CMS

Collaboration, S. Chatrchyan et. al. , Description and performance of track andprimary-vertex reconstruction with the CMS tracker , JINST (2014), no. 10 P10009[ ].[4] ATLAS

Collaboration,

Characterization of Interaction-Point Beam Parameters Using the ppEvent-Vertex Distribution Reconstructed in the ATLAS Detector at the LHC , .[5]

ATLAS

Collaboration,

Performance of primary vertex reconstruction in proton-protoncollisions at √ s = , .[6] ATLAS

Collaboration, G. Aad et. al. , Performance of pile-up mitigation techniques for jets in pp collisions at √ s = 8 TeV using the ATLAS detector , Eur. Phys. J.

C76 (2016), no. 11 581[ ].[7]

ATLAS

Collaboration, M. Aaboud et. al. , Identiﬁcation and rejection of pile-up jets at highpseudorapidity with the ATLAS detector , Eur. Phys. J.

C77 (2017), no. 9 580 [ ].[8]

CMS

Collaboration, C. Collaboration,

Pileup Jet Identiﬁcation , .[9] M. Cacciari and G. P. Salam,

Pileup subtraction using jet areas , Phys. Lett.

B659 (2008)119–126 [ ].[10]

CMS

Collaboration, S. Chatrchyan et. al. , Determination of Jet Energy Calibration andTransverse Momentum Resolution in CMS , JINST (2011) P11002 [ ].[11] ATLAS

Collaboration, G. Aad et. al. , Jet energy measurement with the ATLAS detector inproton-proton collisions at √ s = 7 TeV , Eur. Phys. J.

C73 (2013), no. 3 2304 [ ].[12]

ATLAS

Collaboration,

Monte carlo calibration and combination of in-situ measurements of jetenergy scale, jet energy resolution and jet mass in atlas , ATLAS-CONF-2015-037, 2015.[13]

CMS

Collaboration,

Jet energy scale and resolution performances with 13tev data , CMSDetector Performance Summary CMS-DP-2016-020, CERN (2016).[14]

ATLAS

Collaboration, G. Aad et. al. , Topological cell clustering in the ATLAS calorimetersand its performance in LHC Run 1 , .[15] CMS

Collaboration, A. M. Sirunyan et. al. , Particle-ﬂow reconstruction and global eventdescription with the CMS detector , .[16] J. M. Butterworth, A. R. Davison, M. Rubin and G. P. Salam, Jet substructure as a new Higgssearch channel at the LHC , Phys. Rev. Lett. (2008) 242001 [ ].[17] D. Krohn, J. Thaler and L.-T. Wang,

Jet Trimming , JHEP (2010) 084 [ ].[18] S. D. Ellis, C. K. Vermilion and J. R. Walsh, Techniques for improved heavy particle searcheswith jet substructure , Physical Review D (2009), no. 5 051501.[19] S. D. Ellis, C. K. Vermilion and J. R. Walsh, Recombination algorithms and jet substructure:pruning as a tool for heavy particle searches , Physical Review D (2010), no. 9 094023. – 18 –

20] A. J. Larkoski, S. Marzani, G. Soyez and J. Thaler,

Soft Drop , JHEP (2014) 146[ ].[21] M. Cacciari, G. P. Salam and G. Soyez, SoftKiller, a particle-level pileup removal method , Eur.Phys. J.

C75 (2015), no. 2 59 [ ].[22] D. Krohn, M. D. Schwartz, M. Low and L.-T. Wang,

Jet Cleansing: Pileup Removal at HighLuminosity , Phys. Rev.

D90 (2014), no. 6 065020 [ ].[23] D. Bertolini, P. Harris, M. Low and N. Tran,

Pileup Per Particle Identiﬁcation , JHEP (2014) 059 [ ].[24] P. Berta, M. Spousta, D. W. Miller and R. Leitner, Particle-level pileup subtraction for jets andjet shapes , JHEP (2014) 092 [ ].[25] ATLAS

Collaboration,

Constituent-level pileup mitigation performance using 2015 data , ATLAS-CONF-2017-065 (2017).[26] D. Bertolini, T. Chan and J. Thaler,

Jet Observables Without Jet Algorithms , JHEP (2014)013 [ ].[27] J. Cogan, M. Kagan, E. Strauss and A. Schwarztman, Jet-Images: Computer Vision InspiredTechniques for Jet Tagging , JHEP (2015) 118 [ ].[28] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman and A. Schwartzman, Jet-images — deeplearning edition , JHEP (2016) 069 [ ].[29] P. Baldi, K. Bauer, C. Eng, P. Sadowski and D. Whiteson, Jet Substructure Classiﬁcation inHigh-Energy Physics with Deep Neural Networks , Phys. Rev.

D93 (2016), no. 9 094034[ ].[30] J. Barnard, E. N. Dawe, M. J. Dolan and N. Rajcic,

Parton Shower Uncertainties in JetSubstructure Analyses with Deep Neural Networks , Phys. Rev.

D95 (2017), no. 1 014018[ ].[31] G. Kasieczka, T. Plehn, M. Russell and T. Schell,

Deep-learning Top Taggers or The End ofQCD? , JHEP (2017) 006 [ ].[32] P. T. Komiske, E. M. Metodiev and M. D. Schwartz, Deep learning in color: towards automatedquark/gluon jet discrimination , JHEP (2017) 110 [ ].[33] L. de Oliveira, M. Paganini and B. Nachman, Learning Particle Physics by Example:Location-Aware Generative Adversarial Networks for Physics Synthesis , .[34] M. Paganini, L. de Oliveira and B. Nachman, CaloGAN: Simulating 3D High Energy ParticleShowers in Multi-Layer Electromagnetic Calorimeters with Generative Adversarial Networks , .[35] ATLAS Collaboration, ATLAS Phase-II Upgrade Scoping Document , CERN-LHCC-2015-020 (2015).[36] CMS Collaboration,

CMS Phase II Upgrade Scope Document , CERN-LHCC-2015-019 (2015).[37] CMS Collaboration,

Technical Proposal for the Phase-II Upgrade of the CMS Detector , CERN-LHCC-2015-010. (2015).[38] F. Chollet,

Keras , 2015 https://github.com/fchollet/keras. – 19 –

39] J. e. a. Bergstra,

Theano: A cpu and gpu math compiler in python , in , pp. 1–7, 2010.[40] K. He, X. Zhang, S. Ren and J. Sun,

Delving deep into rectiﬁers: Surpassing human-levelperformance on imagenet classiﬁcation , in , pp. 1026–1034, 2015.[41] D. Kingma and J. Ba,

Adam: A method for stochastic optimization , arXiv preprintarXiv:1412.6980 (2014).[42] T. Sj¨ostrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel,C. O. Rasmussen and P. Z. Skands, An Introduction to PYTHIA 8.2 , Comput. Phys. Commun. (2015) 159–177 [ ].[43] M. Cacciari, G. P. Salam and G. Soyez,

FastJet User Manual , Eur. Phys. J.

C72 (2012) 1896[ ].[44] M. Cacciari, G. P. Salam and G. Soyez,

The Anti- k t jet clustering algorithm , JHEP (2008)063 [ ].[45] J. Pumplin, How to tell quark jets from gluon jets , Phys. Rev.

D44 (1991) 2025–2032.[46] A. J. Larkoski, G. P. Salam and J. Thaler,

Energy Correlation Functions for Jet Substructure , JHEP (2013) 108 [ ].[47] L. M. Dery, B. Nachman, F. Rubbo and A. Schwartzman, Weakly Supervised Classiﬁcation inHigh Energy Physics , JHEP (2017) 145 [ ].[48] Z. Marshall, A. Collaboration et. al. , Simulation of pile-up in the atlas experiment , in

Journalof Physics: Conference Series , vol. 513, p. 022024, IOP Publishing, 2014.[49]

ATLAS

Collaboration, A. Haas,

Atlas simulation using real data: Embedding and overlay ,tech. rep., 2017.,tech. rep., 2017.