[PDF] Learning to Isolate Muons

Abstract

Distinguishing between prompt muons produced in heavy boson decay and muons produced in association with heavy-flavor jet production is an important task in analysis of collider physics data. We explore whether there is information available in calorimeter deposits that is not captured by the standard approach of isolation cones. We find that convolutional networks and particle-flow networks accessing the calorimeter cells surpass the performance of isolation cones, suggesting that the radial energy distribution and the angular structure of the calorimeter deposits surrounding the muon contain unused discrimination power. We assemble a small set of high-level observables which summarize the calorimeter information and partially close the performance gap with networks which analyze the calorimeter cells directly. These observables are theoretically well-defined and can be applied to studies of collider data. The remaining performance gap suggests the need for a new class of calorimeter-based observables.

Full PDF

LLearning to Isolate Muons

Julian Collado, Kevin Bauer, Edmund Witkowski, Taylor Faucett, Daniel Whiteson, and Pierre Baldi Department of Computer Science, University of California, Irvine, CA, 92697 Department of Physics and Astronomy, University of California, Irvine, CA 92697 (Dated: February 5, 2021)Distinguishing between prompt muons produced in heavy boson decay and muons produced inassociation with heavy-ﬂavor jet production is an important task in analysis of collider physics data.We explore whether there is information available in calorimeter deposits that is not captured bythe standard approach of isolation cones. We ﬁnd that convolutional networks and particle-ﬂownetworks accessing the calorimeter cells surpass the performance of isolation cones, suggesting thatthe radial energy distribution and the angular structure of the calorimeter deposits surrounding themuon contain unused discrimination power. We assemble a small set of high-level observables whichsummarize the calorimeter information and partially close the performance gap with networks whichanalyze the calorimeter cells directly. These observables are theoretically well-deﬁned and can beapplied to studies of collider data. The remaining performance gap suggests the need for a new classof calorimeter-based observables.

PACS numbers:

INTRODUCTION

Searches for new physics and precision tests of theStandard Model at hadron colliders have long relied onleptonic decays of heavy bosons, due to the relativelylow background rates and excellent momentum resolu-tion compared to hadronic ﬁnal states. In the case ofmuons, the primary source of background to promptmuons (those from

W, Z or other bosons) is productionwithin a heavy-ﬂavor jet. This non-prompt backgroundis largest at lower values of muon transverse momentum,which has become important in searches for supersym-metry [1–3] as well as low-mass resonances [4].The current state of the art strategy for distinguishingprompt and non-prompt muons in experimental searchesis the robust and simple approach of measuring the iso-lation of the muon in the calorimeter, as I µ ( R ) = (cid:88) i,R

The observable I µ ( R ) is a powerful discriminatorwhich reduces a large amount of information to a singlehigh-level scalar. However, it is possible that it fails tocapture the fullness of the calorimeter information avail-able to distinguish prompt muons from those which areproduced within a jet. To probe whether informationhas been lost, we compare the performance of deep neuralnetworks which access the full calorimeter information toshallow networks which use one or more isolation cones.Neural network decisions are notoriously diﬃcult to a r X i v : . [ phy s i c s . d a t a - a n ] F e b reverse-engineer, especially when the dimensionality ofthe data is large and the training is done with simulatedsamples, as is the case for networks which directly use thecalorimeter cells. This leads to valid concerns about theapplication of such complex strategies to collider data.In this study, our goal is not to develop deep networksfor use in collider data. Instead, we apply these deep net-works as a probe, to measure a loose upper bound on thepossible classiﬁcation performance, and provide insightinto whether information has been lost in the reductionof the calorimeter cells to isolation cones.Where information has been lost, we attempt to cap-ture it, not by applying the deep network, but by as-sembling a small set of new high-level (HL) observablesthat bridge the performance gap and reproduce the clas-siﬁcation decisions of the calorimeter cell networks [16].These high-level observables are more compact, physi-cally interpretable, can be validated in data, and allowthe straightforward assessment and propagation of sys-tematic uncertainties. Data generation

Samples of simulated prompt muons were generatedvia the process pp → Z (cid:48) → µ + µ − with a Z (cid:48) mass of20 GeV. Non-prompt muons were generated via the pro-cess pp → b ¯ b . Both samples are generated at a center ofmass energy √ s = 13 TeV. Collisions and heavy bosondecays are simulated with Madgraph5 [17], showeredand hadronized with

Pythia [18], and the detector re-sponse simulated with

Delphes [19] using the standardATLAS card. The classiﬁcation of these objects is sen-sitive to the presence of additional proton interactions,referred to as pile-up events. We overlay such interac-tions within the simulation with an average number ofinteractions per event of µ = 50, as an estimate of futureLHC experimental data.Muons in the range p T ∈ [10 ,

15] GeV were considered,and the signal samples are weighted such that the trans-verse muon momentum distributions match that of thebackground. Only events where a muon is identiﬁed as atrack in the muon spectrometer are used. In total therewere 91,592 events used, where 47,616 were signal and43,976 were background.Calorimeter deposits can be represented as imageswhere each pixel value represents the E T deposited bya particle [13]. Images are formed by considering cells inthe calorimeter within a cone of radius up to ∆ R = 0 . η − φ space for both signal prompt muons and back-ground non-prompt muons are shown in Fig. 1. The (a) Mean Prompt Muon(b) Mean Non-prompt MuonFIG. 1: Mean calorimeter images for signal prompt muons(top) and muons produced within heavy-ﬂavor jets (bottom),in the vicinity of reconstructed muons within a cone of R =0 .

4. The color of each cell represents the sum of the E T ofthe calorimeter deposits within the cell. signal calorimeter deposits are uniform and can be at-tributed to pileup whereas the background deposits ap-pear largely radially symmetric with a dense core fromthe jet.We calculate the standard muon isolation observable I µ ( R ) for a set of cones with 0 . ≤ R ≤ .

45 in 18equally spaced steps.Crucially, these isolation observables and all othercalorimeter observables are calculated directly from thepixels of the muon images, ensuring that they contain astrict subset of the information available. This allows fordirect and revealing comparisons of the performance be-tween networks trained with the images and those trainedwith I µ . Note that pixelization of the detector may in-cur some loss of information relative to the underlyingsegmentation of the calorimeter, but our studies exam-ines the relative powers of the techniques, rather absolutecomparisons with more realistic scenarios. NETWORKS AND PERFORMANCE

We apply several strategies to the task of classify-ing prompt and non-prompt muons, using both low-levelcalorimeter information and higher-level isolation quanti-ties. Accessing the calorimeter information at the lowest-level and highest-dimensionality, particle-ﬂow networks(PFN) [15] operate on unordered lists of calorimeter cells,while convolutional networks (CNN) are applied to themuon images [13, 14]. Smaller feed-forward dense net-works are trained to use the information in one or moreisolation cones (see the Appendix for details on networkarchitectures and training). We evaluate the performanceof each approach by comparing the integral of the ROC(Receiver Operating Characteristic) curve, known as theAUC (Area Under the Curve).The standard approach of using a single isolationcone yields an AUC of 0 .

780 for the optimal cone size, R = 0 . . The muon image network achieves a signif-icantly higher performance, with an AUC of 0.842, andthe particle ﬂow network reaches 0.848. This immedi-ately suggests that there is signiﬁcant additional infor-mation available to distinguish between the prompt andnon-prompt muons beyond what is summarized in theisolation cones. A more restricted version of the PFN, anEnergy-Flow Network [15] (EFN), which enforces infra-red and collinear (IRC) safety, achieves nearly the sameperformance, 0.843. This suggests that most of the addi-tional information beyond the isolation cones is IRC-safe.We hypothesized that additional cones would provideuseful information about the radial energy distribution.Including a second cone with a distinct R value as inputto a small neural network (see Appendix A) slightly im-proves performance, with an AUC of 0.785. To estimatethe full information available in the cones, we perform agreedy search through all 18 cones; we ﬁnd that a set of8 cones, [0 . , . , . , . , . , . , . , . Similar performance was seen for other cone sizes.

FIG. 2: Comparison of classiﬁcation performance usingthe performance metric AUC between Particle-Flow networkstrained on unordered lists of calorimeter deposits (orange,solid), convolutional networks trained on muon images (blue,dashed) and networks which use increasing numbers of iso-lation cones (green, solid). For each number of cones, theoptimal set is chosen.FIG. 3: Background rejection versus signal eﬃciency forParticle-Flow networks trained on unordered lists of calorime-ter deposits (orange, solid), convolutional networks trained onmuon images (blue, dashed), networks trained on a set of iso-lation cones (purple, dotted) and the benchmark approach, asingle isolation cone approach (green, dashed).

ANALYSIS

The networks which use the calorimeter cells directlyhave the most powerful performance, but our aim is notsimply to optimize classiﬁcation performance in this par-ticular simulated sample. Instead, we seek to understandthe nature of the learned strategy in order to validate itand translate it into simpler, more easily interpretablehigh-level features which can be studied in other datasets,real or simulated. In addition, this understanding can re-veal how well the strategy is likely to generalize to otherkinds of jets that are not represented by this backgroundsample, such as charm jets.The CNN and PFN results indicate that the radiallysymmetric isolation cones are failing to utilize some in-formation which is relevant to the classiﬁcation task. Inthis section, we search for additional high-level observ-ables which capture this information.

Search Strategy

Interpreting the decisions of a deep network with ahigh-dimensional input vector is notoriously diﬃcult. In-stead, we attempt to translate its performance into asmaller set of interpretable observables [16]. This allowsus to understand the nature of the information being usedas well as to represent it more compactly.As the background non-prompt muons are due to jetproduction, we search within a set of observables origi-nally intended for analysis of jets: the Energy Flow Poly-nomials (EFPs) [20], a formally inﬁnite set of parame-terized engineered functions, inspired by previous workon energy correlation functions [21], which sum over thecontents of the cells scaled by relative angular distances.These parametric sums are described as the set of allisomorphic multigraphs where:each node ⇒ N (cid:88) i =1 z i , (1)each k -fold edge ⇒ ( θ ij ) k . (2)The observable corresponding to each graph can bemodiﬁed with parameters ( κ, β ), where( z i ) κ = (cid:32) p T i (cid:80) j p T j (cid:33) κ , (3) θ βij = (cid:0) ∆ η ij + ∆ φ ij (cid:1) β/ . (4)Here, p T i is the transverse momentum of cell i , and∆ η ij (∆ φ ij ) is pseudorapidity (azimuth) diﬀerence be-tween cells i and j . As the EFPs are normalized, theycapture only the relative information about the energydeposition. For this reason, in each network that includesEFP observables, we include as an additional input thesum of p T over all cells, to indicate the overall scale ofthe energy deposition.The original IRC-safe EFPs require κ = 1. To morebroadly explore the space, we consider examples with κ (cid:54) = 1 to explore a broader space of observables .In principle, the space spanned by the EFPs is com-plete, such that any jet observable can be described by Also, note that κ > κ <

0, empty cells are omitted from thesum. one or more EFPs of some degree. One might con-sider simply searching this space for all possible combi-nations of EFPs for a set which maximizes performancefor this task. Such a search is computationally pro-hibitive; instead, we follow the black-box guided algo-rithm of Ref. [16], which iteratively assembles a set ofEFPs that mimic the decisions of another guiding net-work (the CNN or PFN in our case) by isolating theportion of the input space where the guiding networkdisagrees with the isolation network, and ﬁnding EFPswhich mimic the guiding network’s decisions in that sub-space.Here, the agreement between networks f ( x ) and g ( x ) isevaluated over pairs of ( x, x (cid:48) ) by comparing their relativeclassiﬁcation decisions, expressed mathematically as:DO[ f, g ]( x, x (cid:48) ) = Θ (cid:16)(cid:0) f ( x ) − f ( x (cid:48) ) (cid:1)(cid:0) g ( x ) − g ( x (cid:48) ) (cid:1)(cid:17) , (5)and referred to as decision ordering (DO). A DO= 0 cor-responds to inverted decisions over all input pairs andDO= 1 corresponds to the same decision ordering. Asprescribed in Ref. [16], we scan the space of EFPs to ﬁndthe observable that has the highest average decision or-dering (ADO) with the guiding network when averagedover disordered pairs. The selected EFP is then incorpo-rated into the new network of HL features, HLN n +1 , andthe process is repeated until the ADO plateaus. IRC Safe Observables

We begin our search by considering only a small setof simple observables, those which are IRC safe ( κ = 1),have a simple angular weighting ( β ∈ [1 , n ≤ (cid:80) p T , wherethe summation is over all calorimeter cells in the image,to set the scale accompanying the normalized EFPs. Theﬁrst EFP observable identiﬁed is a simple three-pointcorrelator: = N (cid:88) a,b,c =1 z a z b z c θ ab θ bc θ ca which, when combined with the isolation cones and (cid:80) p T , yields an AUC of 0.813 and an ADO with theCNN of 0.897, a signiﬁcant boost relative to just usingthe radial information of the isolation cones. The subse-quent scans produce variants of this observable := N (cid:88) a,b,c =1 z a z b z c θ ab θ bc = N (cid:88) a,b,c =1 z a z b z c θ ab θ bc θ ca = N (cid:88) a,b =1 z a z b θ ab with additional edges corresponding to higher powersof the angular information. Their power may come fromtheir sensitivity to the collimated radiation pattern of thejet. Together with the isolation cones, these observablesreach an AUC of 0.821 and an ADO with the CNN of0.908, see Table I.This set of observables partially closes the performancegap with the calorimeter cell networks, indicating thatangular information is relevant to the muon isolation clas-siﬁcation task, but fails to fully match its performance.Further scans in this limited space do not yield signiﬁ-cant boost in AUC or ADO values. Distributions of theseEFPs for signal and background are shown in Fig. 4.A scan guided by the CNN rather than the PFN yieldsvery similar results, with identical choices for the ﬁrstthree EFPs. IRC-unsafe Observables

To understand the nature of the remaining informa-tion used by the PFN but not captured by the isolationcones and the IRC-safe observables, we expand the searchspace to include observables which are not IRC safe ( κ ∈ [ − , , , , , β ∈ [ , , , , , n = 7 nodes and d = 7 edges.A scan of these observables ﬁnds a set of 10 which,when combined with the isolation cones and (cid:80) p T reachan AUC of 0.827. Due to the overlapping nature of thelarge space of EFPs, there are many sets which achievesimilar performance. Rather than focusing on the speciﬁcEFPs selected, we take the value of this plateau as ameasure of the power contained in our ﬁnite subset ofthe formally inﬁnite space of EFPs. Again, a similarscan guided by the CNN rather than the PFN yields verysimilar results. DISCUSSION

The performance of the networks which use the low-level calorimeter cells indicates that information exists inthese cells which is not captured by the isolation cones,see Table I. A guided search through the space of EFPscloses approximately half of the gap between these net-works, giving us some insight as to the nature of the log [EFP Observable]0.000.250.500.751.001.251.501.752.00 D e n s i t y background signal κ = 1 ,β = 1 log [EFP Observable]0.00.20.40.60.8 D e n s i t y background signal κ = 1 ,β = 2

11 10 9 8 7 6 5 4 log [EFP Observable]0.00.10.20.30.40.50.6 D e n s i t y background signal κ = 1 ,β = 2 log [EFP Observable]0.00.51.01.52.02.53.03.5 D e n s i t y background signal κ = 1 ,β = 2 FIG. 4: Distributions of the log of the selected IRC-safeEFPs as chosen by the black-box guided strategy, for prompt(signal) muons and non-prompt (background) muons. information. However, given that the set of EFPs areformally complete, the remaining gap presents an inter-esting puzzle. Why is there no EFP which can capturethe information used by the calorimter-cell networks?One clue lies in the assumptions that underly the claimthat EFPs are a complete basis for IRC safe observables.Speciﬁcally, it is assumed that the calorimeter cell inputsare rotationally and translationally invariant, such thata transformation does not aﬀect the value of the observ-able. In this case, however, an important element of thelearning task violates that assumption: the location of TABLE I: Summary of performance (AUC) in the promptmuon classiﬁcation task for various network architectures andinput features. Statistical uncertainty in each case is ± . (cid:80) p T + 1 IRC-safe EFPs 0.813 0.8978 Iso + (cid:80) p T + 4 IRC-safe EFPs 0.821 0.9088 Iso + (cid:80) p T + 10 IRC-unsafe EFPs 0.827 0.923Calo image CNN 0.842 0.949Calo cell Energy-Flow Net 0.843 0.947Calo cell Particle-Flow Net 0.848 1 the muon at the center of the image. As a consequence,the EFPs do not have access to the information aboutthe relative angle between a cell and the muon location,which is clearly important for this task . Instead, theycan only access angular information between cells. Theparticle-ﬂow network, in contrast, does not assume thisinvariance, and can learn that the angle relative to thecenter of the image is important. An extension of theEFP sets which includes an additional node of anotherclass, to indicate the location of the muon, would likelyclose the performance gap, but is beyond the scope ofthis work. CONCLUSIONS

We have applied deep networks to low-level calorime-ter deposits surrounding prompt and non-prompt muonsin order to estimate the amount of classiﬁcation poweravailable and to probe whether the standard methods arefully capturing the relevant information.The performance of the calorimeter cell networks sig-niﬁcantly exceeds the benchmark approach, a single iso-lation cone. The use of several isolation cones providessome improvement, suggesting that there is additionaluseful information in the full radial energy distribution.However, a substantial gap remains, hinting the thereis non-radial structure in the calorimeter cells whichprovides useful information for classiﬁcation. We mapthe strategy of the calorimeter cell networks into a setof energy ﬂow polynomials, ﬁnding four IRC-safe, sim-ple three-point correlators which capture a signiﬁcantamount of the missing information. As they are simplefunctions of the energy deposition, they can be physi-cally interpreted, and the ﬁdelity of their modeling canbe reliably extrapolated from control regions in collider We thank Jesse Thaler for discussions on this point. data. Any boost in muon identiﬁcation performance isextremely valuable to searches at the LHC, especiallythose with multiple leptons, where event-level eﬃcienciesdepend sensitively on object-level eﬃciencies.Additional, non-IRC safe EFPs provide a further mod-est boost in performance, but does not close the gap withthe PFN and CNN, suggesting that additional informa-tion remains to be extracted. It is possible that the re-maining information could be captured by more complexobservables we have not included in our EFP subset, orrequire an extension of the EFP observables to includeinformation such as the location of the muon. The strongperformance of the IRC-safe EFN suggests that most ofthe additional information beyond the isolation cones isIRC-safe.More broadly, the existence of a gap between the per-formance of state-of-the-art high-level features and net-works using lower-level calorimeter information repre-sents an opportunity to gather additional power in thebattle to suppress lepton backgrounds. Rather thanemploying black-box deep networks directly, we havedemonstrated the power of using them to identify therelevant observables from a large list of physically inter-pretable options. This allows the physicist to understandthe nature of the information being used and to assessits systematic uncertainty. While these studies were per-formed with simulated samples, similar studies can beperformed using unsupervised methods [22, 23] on sam-ples of collider data, which we leave to future studies.

IX. ACKNOWLEDGEMENTS

We would like to thank Michael Fenton, Dan Guestand Jesse Thaler for providing valuable feedback and in-sightful comments and Yuzo Kanomata for computingsupport. We also wish to acknowledge a hardware grantfrom NVIDIA. This material is based upon work sup-ported by the National Science Foundation under grantnumber 1633631. DW is supported by the DOE Oﬃce ofScience. The work of JC and PB in part supported bygrants NSF 1839429 and NSF NRT 1633631 to PB. [1] M. Aaboud et al. (ATLAS), Phys. Rev.

D97 , 052010(2018), 1712.08119.[2] R. Schoefbeck, Nuclear and Particle Physics Proceed-ings , 631 (2016), ISSN 2405-6014, 37thInternational Conference on High Energy Physics(ICHEP), URL . sciencedirect . com/science/article/pii/S2405601415005842 .[3] V. Khachatryan et al. (CMS), JHEP , 189 (2015),1508.07628.[4] I. Hoenig, G. Samach, and D. Tucker-Smith, Phys. Rev.D , 023 (2014), 1408.1075. [5] G. Aad et al. (ATLAS), Eur. Phys. J. C76 , 292 (2016),1603.05598.[6] R. Aaij et al. (LHCb), Phys. Rev. Lett. , 061801(2018), 1710.02867.[7] Z. Hall and J. Thaler, JHEP , 164 (2018), 1805.11622.[8] ATLAS Collaboration, Tech. Rep. ATL-PHYS-PUB-2020-018, CERN, Geneva (2020), URL https://cds . cern . ch/record/2724632 .[9] J. Collado, J. N. Howard, T. Faucett, T. Tong, P. Baldi,and D. Whiteson (2020), 2011.01984.[10] C. Brust, P. Maksimovic, A. Sady, P. Saraswat, M. T.Walters, and Y. Xin, JHEP , 079 (2015), 1410.0362.[11] P. Baldi, P. Sadowski, and D. Whiteson, Nature Com-munications (2014).[12] P. Baldi, Deep Learning in Science (Cambridge Univer-sity Press, Cambridge, UK, 2021), in press.[13] J. Cogan, M. Kagan, E. Strauss, and A. Schwarztman,JHEP , 118 (2015), 1407.5675.[14] P. Baldi, K. Bauer, C. Eng, P. Sadowski, and D. White-son, Phys. Rev. D93 , 094034 (2016), 1603.09349.[15] P. T. Komiske, E. M. Metodiev, and J. Thaler, JHEP , 121 (2019), 1810.05165.[16] T. Faucett, J. Thaler, and D. Whiteson (2020),2010.11998.[17] J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni,O. Mattelaer, H. S. Shao, T. Stelzer, P. Torrielli, andM. Zaro, JHEP , 079 (2014), 1405.0301.[18] T. Sjostrand, S. Mrenna, and P. Z. Skands, JHEP ,026 (2006), hep-ph/0603175.[19] J. de Favereau et al. (DELPHES 3), JHEP , 057(2014), 1307.6346.[20] P. T. Komiske, E. M. Metodiev, and J. Thaler, JHEP , 013 (2018), 1712.07124.[21] A. J. Larkoski, G. P. Salam, and J. Thaler, arXiv.org(2013), 1305.0007.[22] L. M. Dery, B. Nachman, F. Rubbo, and A. Schwartz-man, JHEP , 145 (2017), 1702.00414.[23] E. M. Metodiev, B. Nachman, and J. Thaler, JHEP ,174 (2017), 1708.02949.[24] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,et al., TensorFlow: Large-scale machine learning on het-erogeneous systems (2015), software available from ten-sorﬂow.org, URL . tensorflow . org/ .[25] F. Chollet et al., Keras , https://keras . io (2015).[26] D. P. Kingma and J. Ba, CoRR abs/1412.6980 (2014),1412.6980, URL http://arxiv . org/abs/1412 . .[27] A. M. Saxe, J. L. McClelland, and S. Ganguli,CoRR abs/1312.6120 (2013), 1312.6120, URL http://arxiv . org/abs/1312 . .[28] L. Hertel, J. Collado, P. Sadowski, J. Ott, andP. Baldi, SoftwareX (2020), saSoftware available at:https://github.com/sherpa-ai/sherpa, 2005.04048.[29] X. Glorot, A. Bordes, and Y. Bengio, in Proceed-ings of the Fourteenth International Conference on Ar-tiﬁcial Intelligence and Statistics (PMLR, Fort Laud-erdale, FL, USA, 2011), vol. 15 of

Proceedings of Ma-chine Learning Research , pp. 315–323, URL http://proceedings . mlr . press/v15/glorot11a . html .[30] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever,and R. Salakhutdinov, Journal of Machine Learning Re-search , 1929 (2014).[31] P. Baldi and P. Sadowski, Artiﬁcial Intelligence ,78 (2014). [32] P. T. Komiske, E. M. Metodiev, and J. Thaler, Journalof High Energy Physics (2019), ISSN 1029-8479,URL http://dx . doi . org/10 . . A. Neural Network Architectures

All networks were trained in Tensorﬂow[24] andKeras[25]. The networks were optimized with Adam [26]for up to 100 epochs with early stopping. For all net-works except the PFNs, the weights were initialized us-ing orthogonal weights[27]. Hyperparameters were opti-mized using bayesian optimization with the Sherpa hy-perparameter optimization library [28]. The variablesand ranges for the hyperparameters are shown in tablesII and III.Below are further details regarding the networks whichuse images and those which use isolation and EFP ob-servables.

B. Muon Image Networks

The pixelated images were preprocessed to have zeromean and unit standard deviation. The best muon imagenetwork structure begins with three convolutional blocks.Each block contains two convolutional layers with 56 ﬁl-ters with rectiﬁed linear units [29], followed by a 2x2pooling layer. Afterwards there are four fully connectedlayers with 178 rectiﬁed linear units and a ﬁnal layer witha sigmoidal logistic activation function to classify signalvs background. The model had dropout [30, 31] withvalue 0.2062 on the fully connected layers and an initiallearning rate of 0.0002 and batch size of 128.

TABLE II: Hyperparameter ranges for bayesian optimizationof convolutional networksParameter Range ValueNum. of convolutional blocks [1, 3] 3Num. of ﬁlters [16, 128] 56Num. of fully connected layers [2, 5] 4Number of hidden units [25, 200] 178Learning rate [0.0001, 0.01] 0.0002Dropout [0.0, 0.5] 0.2062

C. Particle-Flow Networks

The Particle Flow Network (PFN) is trained using the energyflow package[32]. Input features are taken fromthe muon image pixels and preprocessed by subtractingthe mean and dividing by the variance. The PFN uses3 dense layers in the per-particle frontend module and3 dense layers in the backend module. Each layer uses100 nodes, relu activation and glorot_normal initial-izer. The ﬁnal output layer uses a sigmoidal logistic ac-tivation function to predict the probability of signal orbackground. The

Adam optimizer is used with a learningrate of 0.0001 and trained with a batch size of 128.

D. Isolation Cone Networks

The isolation inputs are preprocessed by subracting themean and dividing by the variance. We trained neuralnetworks with two to eight fully connected hidden layersdepending on the hyperparameter value and a ﬁnal layerwith a sigmoidal logistic activation function to predictthe probability of signal or background.For the minimal set of isolation inputs the best modelwe found had 4 fully connected layers with 179 rectiﬁedlinear hidden units[29] and a learning rate of 0 . TABLE III: Hyperparameter ranges for Bayesian optimiza-tion of fully connected networksParameter Range ISO ValueNum. of layers [2, 8] 4Num. of hidden units [1, 200] 179Learning rate [0.0001, 0.01] 0.0002Dropout [0.0, 0.5] 0.0160