[PDF] Benchmarking Machine Learning Techniques with Di-Higgs Production at the LHC

Abstract

Full PDF

PPrepared for submission to JHEP

Benchmarking Machine Learning Techniques withDi-Higgs Production at the LHC

B. Tannenwald a C. Neu, a A. Li, a G. Buehlmann, a A. Cuddeback, a L. Hatﬁeld, a R.Parvatam, a C. Thompson a a University of Virginia, 248 McCormick Road, Charlottesville, VA, USA

E-mail: [email protected]

Abstract:

Many domains of high energy physics analysis are starting to explore machinelearning techniques. Powerful methods can be used to identify and measure rare processesfrom previously insurmountable backgrounds. One of the most profound Standard Modelsignatures still to be discovered at the LHC is the pair production of Higgs bosons throughthe Higgs self-coupling. The small cross section of this process makes detection very diﬃculteven for the decay channel with the largest branching fraction ( hh → b ¯ bb ¯ b ). This paperbenchmarks a variety of approaches (boosted decision trees, various neural network archi-tectures, semi-supervised algorithms) against one another to catalog a few of the varioustechniques available to high energy physicists as the era of the HL-LHC approaches. a r X i v : . [ h e p - ph ] S e p ontents k -Means Clustering 154.2 Autoencoder 16 The use of machine learning (ML) techniques in high energy particle physics has rapidlyexpanded in the last few decades. The proliferation of methods and applications has touchednearly every segment of analysis and reconstruction [3] and will be vital in understandingthe full dataset of the Large Hadron Collider (LHC) and data from future colliders.Common approaches involve using linear techniques like decision trees and non-linearapproaches like neural networks. These techniques are then used to reconstruct objects likeleptons and jets, to tag b-quarks and boosted decays, and to classify diﬀerent processes.Many models depend on kinematic input features physicists have traditionally used; otherarchitectures rely on emergent features produced in a more abstract phase-space. Withso many usable machine-learning options available, interesting questions arise about whattypes of information are best to feed to our networks. The information that is best forphysicists to learn from might not be optimal for sophisticated computing algorithms.The goal of this paper is to explore a wide range of current ML techniques that can beused to identify di-Higgs production at the High Luminosity LHC (HL-LHC). Observingdi-Higgs production is necessary to measure the self-coupling of the Higgs boson and fullyunderstand the nature of electroweak symmetry breaking. The diﬃculty in measuring– 1 –iggs pair production lies in the tiny cross-section of even the largest branching fraction( hh → b ¯ bb ¯ b ) and the relative abundance of similarly reconstructed QCD events.This paper is organized as follows: Section 2 deals with the physics relevant for di-Higgsproduction and the QCD background. Sections 3 and 4 summarize the various ML methodstested. The maximum signiﬁcance σ = N signal (cid:112) N background (1.1)is provided for each method along with event yields normalized to the HL-LHC designintegrated luminosity of 3000 fb − . Section 5 compares the results from the various ap-proaches. The Higgs boson is an essential part of the Standard Model (SM) of particle physics andis a product of the mechanism responsible for electroweak symmetry breaking. Alongwith the interactions of the Higgs boson to the other SM particles, an additional Higgsself-coupling interaction is predicted at tree-level by the SM. This mechanism contributesto non-resonant Higgs boson pair production together with quark-loop contributions viaYukawa-type interactions. Figure 1 shows the two leading order diagrams of non-resonantHiggs boson pair production. Since the production cross section for Higgs boson pairproduction is extremely small within the SM [4], σ hh (14 TeV) = 39 . ± . fb , any signiﬁcant deviation would indicate the presence of new physics. Figure 1 . Leading order Feynman diagrams for non-resonant production of Higgs boson pairs inthe Standard Model through (a) the Higgs boson self-coupling and (b) the Higgs-fermion Yukawainteraction.

Many extensions of the SM predict the existence of additional scalar bosons whichmay have mass larger than twice the Higgs mass and can decay into a Higgs boson pair.Searching for resonances in the hh mass spectrum can help us discover or limit exotic modelswhich predict the presence of such particles. More importantly, measuring the SM di-Higgscross-section (or placing limits on its magnitude) allows us to probe the self-coupling of theHiggs ﬁeld and better understand the mechanism behind electroweak symmetry breaking.– 2 –he following work is focused on techniques for distinguishing non-resonant (SM-like)Higgs boson pair production where both Higgs bosons decay via h → b ¯ b . Using the b decay mode provides the largest possible amount of signal events but requires powerfulbackground reduction techniques due to the large production cross-section of fully hadronicQCD processes. All results are quoted for simulated events produced by pp collisions witha center-of-mass energy of 14 TeV and scaled to the HL-LHC design integrated luminosityof 3000 fb − . Simulated samples were produced using Madgraph v2.7.0 [6] in a ROOTv6.12/04 [5] environment. Events were then showered using Pythia v8.2.44 [7] and recon-structed with Delphes v3.0 [8] using the v2 approximation of the upgraded Phase-II CMSdetector.Both the signal and background samples were generated with minimal pileup additionsampled from a Poisson distribution with an expectation value of zero additional vertices.An additional generator-level cut requiring total hadronic energy greater than 300 GeV wasapplied when generating background QCD events. All code used to set up the generationenvironment and produce events is publicly available [9]. A summary of the sample gen-eration details is shown in Table 1. The eﬀective cross-sections in Table 1 are used for allevent yield calculations in subsequent sections.Name Process σ eﬀ [fb] N events Di-Higgs pp → hh , h → b ¯ b . ± . · QCD pp → b ¯ bb ¯ b ± · Table 1 . Madgraph processes, eﬀective cross-sections, and number of generated events for thesignal and background samples used in this paper. The eﬀective cross-sections diﬀer from the totaltheoretical cross-sections due to branching fractions and generation-level cuts on hadronic activity.

The assumption of zero additional background interactions per event is an unrealisticapproximation of the HL-LHC environment where an average of two hundred additionalpileup interactions is expected for each collision. This paper is an initial study aimed atgauging the power of various techniques in a simpliﬁed environment. Future studies willextend the approaches explored here to a realistic HL-LHC regime.

The ﬁrst step in reconstructing the 4 b system is to reconstruct and select b-jet candidates.Jets are clustered using an anti- k T algorithm with a radius of R=0.4. To be selected foruse in event reconstruction, a jet must have p T > GeV and | η | < . . Delphes uses aninternal b -tagging eﬃciency parameterization to predict whether jets are tagged. An eventis only fully reconstructed if at least 4 jets are b -tagged (unless otherwise speciﬁed). Theproperties and kinematics of selected jets are shown in Figure 2. This strict requirementof having at least 4 b -tags in an event helps to reduce contributions from QCD and resolvesome of the combinatoric ambiguity in event reconstruction.Once events with at least 4 b -tags are selected, there is a choice about how to reconstructthe di-Higgs system. Several reconstruction methods were tested for pairing b-jets to ﬁndan optimal algorithm for correctly pairing Higgs boson constituents. Two algorithms were– 3 – a) (b)(c) (d) Figure 2 . Sample of event information and leading jet kinematics from QCD and di-Higgs simu-lation. Distributions are normalized to the same area to compare shapes. selected for use in the following sections: the ﬁrst iterates through all selected jets in anevent and returns the two pairs with closest di-jet masses to one another. The secondreturns the two jet pairs that yield two Higgs candidates with masses closest to the knownHiggs boson mass by minimizing the quantity χ = (cid:18) m − m h σ h (cid:19) + (cid:18) m − m h σ h (cid:19) where m , are the candidate di-jet masses, m h = 125 GeV is Higgs boson mass, and σ h = 10 GeV is the expected experimental resolution of the Higgs boson in the b ¯ b decay channel.Unless otherwise speciﬁed, the method that selects di-jets with masses closest to each otheris used when training algorithms that require reconstructed events. Fig 3 shows a selectionof distributions describing the di-Higgs system using this reconstruction algorithm.Reconstructed variables include the masses and momentum of the two- and four-bodyHiggs candidates as well as the angular separations between the two Higgs candidates andtheir constituent jets. Additional event-level variables like the number of selected jets, thenumber of b-tagged jets, and the missing transverse energy in the event were also consideredas inputs to various algorithms. A full list of all variables considered for each approach isgiven in Table 2. All variables were evaluated using the Kolmogorov-Smirnov (KS) test for– 4 – a) (b) (c)(d) (e) (f) Figure 3 . Sample of reconstructed kinematics of the di-Higgs system in QCD and di-Higgs simu-lation. Distributions are normalized to the same area to compare shapes. individual separation power between signal and background. Variables were then sorted indescending order of KS separability. Each algorithm is trained on a subset that minimizesthe number of variables while not sacriﬁcing performance.Event-Level Reconstructeddi-Higgs System Higgs Candidates Jet Variables N jets , N b-tags m hh , p hhT , η hh , φ hh , m h1 , p h1T , η h1 , φ h1 , p ( i =1 − Tj , η ( i =1 − Tj , φ ( i =1 − Tj , Missing E T , ∆ R ( h1, h2 ) , m h2 , p h2T , η h2 , φ h2 , b-tag ( i =1 − Scalar H T ∆ φ ( h1, h2 ) ∆ R ( h1 jets ) , ∆ R ( h2 jets ) , ∆ φ ( h1 jets ) , ∆ φ ( h2 jets ) Table 2 . Complete list of all variables considered for analysis in each method. Event-level variablesrequire no di-Higgs reconstruction. Reconstructed variables are deﬁned only after selecting the jetpairs to reconstruct the di-Higgs system in each event. The sub/superscript ’hh’ corresponds to thefour-body di-Higgs system, the sub/superscript ’h1’ corresponds to the leading p T Higgs candidate,and the sub/superscript ’h2’ corresponds to the sub-leading p T Higgs candidate. Kinematic forall individual jets chosen by the reconstruction are also available for use in training classiﬁcationalgorithms. – 5 –

Supervised Learning

Searches for speciﬁc signatures or interactions in collider data can be thought of as a clas-siﬁcation problem - some known signal process must be identiﬁed and separated from someknown and well-modeled set of background processes. Any iterative algorithm can thenimprove its ability to properly identify signal from background by comparing its predictedclassiﬁcations to the true known classiﬁcations and adjusting its internal parameters. Thistype of approach is known as supervised machine learning, and it is particularly relevantwhen training models to distinguish between diﬀerent known processes. We discuss severalsupervised learning approaches below.

Boosted Decision Trees (BDTs) have a long history in high energy physics from enabling theﬁrst observation of single top production at the Tevatron [1, 2] to helping in the discoveryof the Higgs boson at the LHC [10, 11]. A decision tree functions by making a seriesof sequential cuts (or decisions) that each maximize the separation between signal andbackground events for a single variable. This separation is maximized by calculating theGini impurity index I = 1 − J (cid:88) i =1 p i where i runs over the number of classes in a dataset and p i is the fraction of items labeledas class i in the set after a tree split. By maximizing the purity produced by subsequentcuts, a well-designed decision tree eﬃciently splits the original dataset into independentsub-populations grouped by class.Any series of cuts for identifying events will inevitably misclassify some events, andthere are many strategies for improving the results. A boosted decision tree attemptsto improve the classiﬁcation by creating a new set of data from the improperly classiﬁedevents and training a new decision tree on these inputs. Each step of re-training withmisclassiﬁed events is called a boost , and the total prediction for an event is the weightedsum of predictions from the original tree plus the predictions from the boosted trees whereeach sequential boost receives a smaller weight in the sum.The BDT trained for di-Higgs detection was built using the xgboost package [12]. Thetop seventeen reconstructed and event-level variables ranked by KS separability (see Table 2for a summary of variables) were used in training. These included the mass, momentum,and angular separation variables for the di-Higgs system and the Higgs candidate di-jets.Event-level variables (e.g. N b-tags , N jets ) and individual jet momenta were also used intraining.A set of tuneable variables called hyperparameters determine the training behavior andultimate performance of any machine learning model include BDTs. Of all the hyperpa-rameters that describe the BDT, ﬁve were found to have signiﬁcant impact on the ﬁnalperformance of the model: the multiplicative boost factor described above, the maximumnumber of decisions allowed per tree (max depth), a minimum separation improvement– 6 –ecessary for a further cut ( γ ), a term that reduces large absolute boosted corrections(L1 regularization or α ), and a term that reduces the impact of large squared correctionsin each boost (L2 regularization or λ ). Regularization terms are a common element ofmany machine learning algorithms and are helpful in avoiding in over-training. The set ofhyperparameters that yielded the best signiﬁcance for the BDT were found to be: boostfactor of 0.1, maximum tree depth of 9, γ of 1.1, L1 regularization term of 22.1, and an L2regularization term of 8.3. Figure 4 . Signal predictions of the trained BDT for independent signal and background samplesnot used for training.

The predictions from the optimized BDT are shown in Figure 4. A maximum sig-niﬁcance of 1.84 ± . ± . signal events and 2.8 ± . · background events. Random Forest algorithms share a similar tree structure with BDTs, but they leverageensembles of independent decision trees as opposed to iteratively improving the predictionsof a single tree using misclassiﬁed events. Each tree in a random forest is ‘grown’ using arandom sampling of input variables and training events. The randomness of the samplingensures each tree yields a unique but correlated prediction compared to the other trees inthe forest. The class prediction of the forest is the majority vote of the constituent trees.Tuning the hyperparameters of a random forest requires optimizing the number of treesin the forest, the variable sub-sampling used to produce each tree, and the depth of theconstituent trees.The random forest trained for di-Higgs classiﬁcation uses the reconstruction algorithmthat selects di-jet pairs consistent with a Higgs mass hypothesis, and the same reconstructedand event-level variables used for the BDT were used as input to the forest. The randomforest was implemented using xgboost [12]. Five hyperparameters were found to have ameaningful impact on the performance of the model: the number of trees in the forest,the maximum tree depth, a L1 regularization term, the fraction of events used as inputs– 7 –o each tree (sub-sample fraction), and the fraction of reconstruction variables randomlyselected as inputs for each tree (the column sub-sampling rate). Similar to the approach foroptimizing the hyperparameters of the BDT, an optimal random forest conﬁguration wasobtained by individually varying each signiﬁcant hyperparameter over a reasonable rangeand selecting the best performing model.The optimal random forest was training with 300 constituent trees, a maximum treedepth of 20, a L1 regularization term of 1.175, a sub-sampling fraction of 0.8, and a columnsub-sampling rate of 0.8,. The best signiﬁcance for the random forest approach was foundto be σ = 2.44 ± . ± . signal events and . ± . · background events. Distributionsof the predictions for signal and background are shown in Figure 5. Figure 5 . Output score on the testing dataset with the fully trained random forest classiﬁer.

Fully connected or feed-forward neural networks (NNs) also have a long history in highenergy physics. The fundamental element of any neural network is called a layer . Multiplelayers are stacked together to produce a ﬁnal prediction given the input variables fromthe ﬁrst layer. This predicted outcome is then evaluated against a known target value. Afully-connected network can have multiple internal (or hidden) layers between the inputand output layers, and each hidden layer is composed of a series of activation functionsand trainable weights that allow the network to identify and iteratively combine importantfeatures of the input space. A function (called the loss function) is chosen to quantifythe diﬀerence between the model prediction and target values. The loss calculated after a– 8 –ingle training iteration is used to adjust the internal network weights in the next trainingiteration through a process called backpropagation. The model is fully trained once theimprovement in the loss between iterations falls beneath some user-deﬁned threshold.The NN trained for di-Higgs detection was built using the Keras [13] and Tensorﬂow [14]packages. The top twenty-two most separable reconstructed and event-level variables wereused as the input variables for the NN. These included the mass, momentum, and angularseparation variables for the di-Higgs system and the Higgs candidate di-jets, momentumand angular information from the individual selected jets, and all event-level variables.Hyperparameter optimization for neural networks involves the additional step of designingthe structure of the network: how many hidden layers to use, how many nodes per hiddenlayer, which activation functions to use in a given layer, using dropout layers which removerandom weights in each training iteration to prevent over-ﬁtting, and adding batch nor-malization layers which stabilize and accelerate network training by re-scaling the neuronweights of the previous layer.Several models were trained by individually tuning each hyperparameter over a rea-sonable range in order to produce a ﬁnal optimized model. The number of hidden layerswas similarly optimized by training models with one, two, three, and four hidden layers.No networks with more than four hidden layers were tested in order to keep the number oftrainable model parameters below approximately one tenth the number of training events.The ﬁnal network structure consisted of the input layer, two hidden layers, and a single-node output layer. The ﬁrst hidden layer contained 175 nodes with an L2 kernel regularizer( λ = − ). The second hidden layer contained 90 nodes with no kernel regularizer. A batchnormalization layer and a dropout layer (dropout fraction 0.2) were placed in between thetwo hidden layers to prevent over-ﬁtting. Both hidden layers used a rectiﬁed linear (ReLU)activation function, while the output layer used a sigmoid activation function. The ﬁnalperformance was found to be independent of the choice of activation functions, and theyare included here for completeness. The binary cross-entropy loss function L = − ( y log( p ) + (1 − y ) log(1 − p )) was used for the backpropagation. Here y represents the known binary target classiﬁcations, p represents the predicted probabilities that a given event is signal or background, and log isthe natural logarithm. A schematic ﬂowchart of the network structure is shown in Figure 6. Figure 6 . Structure of the feed-forward neural network. The input variables are fed through twofully connected dense layers to classify events. One dropout layer and one batch normalization layerhelp mitigate over-ﬁtting during training. – 9 –he NN was trained for 25 epochs before the minimal loss-improvement threshold wasmet, and the results are shown in Figure 7. The trained model obtained a maximum σ =2.40 ± . ± . events and a background yield of . ± . · events. Figure 7 . Final predictions of the feed-forward network for signal and background samples.

Convolutional Neural Networks (CNNs) are neural networks that predict the content of aninput image by using assumptions about the local relationships between neighboring pixels.For this analysis, content prediction is simpliﬁed to a general classiﬁcation of whether theimage comes from a di-Higgs or QCD event. The fundamental elements of any convolutionalnetwork are convolutional layers and pooling layers. Convolutional layers use ﬁlters thatperform linear combinations of neighboring pixels within the ﬁlter size, and pooling layersaggregate information by grouping neighboring pixels using either their maximum or averagevalues. After some number of these layers, the output is ﬂattened into a one-dimensionalvector, and this ﬂattened vector is pushed through a set of feed-forward layers in order tomake a ﬁnal output prediction.Many previous papers have explored the use of convolutional networks trained on low-level quantities (e.g. tracks and calorimeter deposits) for the purposes of object identiﬁca-tion [15] at colliders. This paper extends the application to event-level identiﬁcation. Usinglow-level quantities removes the need to reconstruct higher-level objects like jets or jet pairs;only the detector-level measurements are required for image creation. The performance offour convolutional networks was studied in the context of di-Higgs identiﬁcation. The ﬁrstnetwork used a 3-layer image composed of energy/momentum weighted tracks, electromag-netic calorimeter deposits, and hadronic calorimeter deposits. The second network used thesame three layers but appended additional global event-level information to the ﬂattenedvector after image processing and before the fully connected layers. Figures 8 and 9 depictboth network structures. The third and fourth networks followed the same pattern as the– 10 –revious two but with the addition of two image layers corresponding to longitudinal andtransverse impact parameter-weighted track information.

Figure 8 . Structure of the nominal convolutional neural network. The input images are fedthrough two convolutional layers and a single max-pooling layer before being ﬂattened into a one-dimensional vector. The ﬂattened vector is then fed through one fully connected layer, a batchnormalization layer, and a ﬁnal fully connected layer before a ﬁnal prediction is made.

Figure 9 . Structure of the hybrid convolutional neural network. The input images are fed throughtwo convolutional layers and a single max-pooling layer before being ﬂattened into a one-dimensionalvector. Scaled user-speciﬁed variables (e.g. H T ) are then concatenated with the ﬂattened imagevector. The concatenated vector is then fed through one fully connected layer, a batch normalizationlayer, and a ﬁnal fully connected layer before a ﬁnal prediction is made. In order to produce coherent images, the center of mass and the center of momentumfor each event are calculated. All constituents are then boosted longitudinally into thecenter of mass of the event and rotated in phi to the center of momentum. After this pre-processing, each image layer corresponds to a 31x31 pixel grid centered on the total activityin the event. Figure 10 shows the average di-Higgs image layers (a–e) and the average QCDimage layers (f–j). While the average image layers for each sample closely resemble oneanother, they do contain diﬀerent information, and variations are visible.Importantly, clear diﬀerences are observed between the average QCD images and theaverage di-Higgs images. Each half of the di-Higgs image (split across φ = 0) is arrangedin a roughly circular, isotropic shape due to the spin-0 nature of the Higgs. The QCDimages appear balanced because of the eﬀect of the pre-processing, but no similar circularstructure is produced. Additionally, the variance of pixel intensities in di-Higgs images ismuch smaller than the variance in QCD images due to the more balanced kinematics ofHiggs pair production compared to QCD processes.As shown in Figures 8 and 9, the CNN network structure uses two sequential 2Dconvolutional layers each with 16 3x3 ﬁlters, one max-pooling layer with a 2x2 window,– 11 – a) (b) (c)(d) (e)(f) (g) (h)(i) (j) Figure 10 . Average image layers for di-Higgs (a–e) and QCD (f–j) decays. Sub-ﬁgures (a) and(f) show p T -weighted tracks, (b) and (g) show E T -weighted ECAL deposits, (c) and (h) show E T -weighted HCAL deposits, (d) and (i) show transverse impact parameter-weighted tracks, (e) and(j) show longitudinal impact parameter-weighted tracks. – 12 – ﬂattening of the outputs, two 64-node fully connected hidden layers, and one outputlayer for making the ﬁnal prediction. As previously described, two of the networks appendadditional high level variables (scalar sum of transverse hadronic energy, number of jets,and number of b -tags) after the ﬂattening and before the image information is fed throughthe fully connected layers. The optimal signiﬁcance for each network is shown in Table 3.The best results were obtained using the 5-color network with additional high-level inputs,and the ﬁnal predictions for this conﬁguration are shown in Figure 11. A best signiﬁcanceof 2.86 ± > . ± . · events and a background yield of . ± . · events. Method

Best σ AUCTracks+HCAL+ECAL 1.77 ± ± ± ± Table 3 . Normalized to full HL-LHC dataset of 3000 fb − Figure 11 . Signal prediction for the 5-color convolutional network with additional high-level inputs.The total area of the signal and background predictions are normalized to unity for easier shapecomparison.

Energy Flow Networks (EFN) and Particle Flow Networks (PFN) are neural networks thatalso operate with basic jet constituent information as input rather than reconstructed jetsand multi-jet composites [16]. The EFN structure takes only the rapidity, y , and azimuthalangle, φ , of jet constituents as input, while the PFN takes the rapidity, azimuthal angle,and transverse momentum, p T , of jet constituents as input. Both the EFN and PFN aretwo-component networks, and their internal structures are shown in Figure 12.– 13 –etwork (a) takes jet constituent information as input and uses a set of fully connectedneural network layers to produce a set of latent features, Φ . Linear combinations of thelatent variables, O are then used as the input for network (b) which uses another set offully connected layers to produce ﬁnal predictions. The EFN and PFN used for di-Higgsclassiﬁcation use 200 nodes for each hidden layer in network (a), 256 nodes for the latentspace dimension, and 300 nodes for each hidden layer in network (b). Figure 12 . Example of a complete EFN/PFN network. Network (a) uses jet constituent informa-tion as input and outputs a set of latent variables, Φ . Network (b) produces linear combinations, O , of the latent variables which are then fed through several fully connected layers to make a ﬁnalprediction. The EFN/PFN networks were trained using four separate categories split by number ofjets and number of b -tags to test the network’s dependence on higher-level jet information.Independent networks were trained using: all events, only events with ≥ ≥ b -tags, and only events with ≥ ≥ b -tags. In each conﬁg-uration, the number of signal and background events were adjusted to maintain an equalproportion of each population in the training sample. L2 regularization and dropout layerswere added to minimize over-ﬁtting. The results obtained from each EFN conﬁguration areshown in Table 4. The results of each PFN conﬁguration are shown in Table 5. Category σ N Signal N Background

All Events . ± .

006 1 . ± . · . ± . · . ± .

008 3468 . ± . . ± . · Table 4 . Results of the optimal EFN performance across categories. Event yields normalized tothe full HL-LHC dataset of 3000 fb − . The provided errors take into account the Monte Carlostatistical uncertainty and the uncertainty on the Madgraph generated cross-section. Both networks performed best when trained over all events without any cuts on thenumber of jets or b -tags. The EFN obtained a highest signiﬁcance of 1.41 ± ± ategory σ N Signal N Background

All Events . ± .

008 1 . ± . · . ± . · . ± .

009 1 . ± . · . ± . · . ± .

009 3297 . ± . . ± . · Table 5 . Results of the optimal PFN performance across categories. Event yields normalized tothe full HL-LHC dataset of 3000 fb − . The provided errors take into account the Monte Carlostatistical uncertainty and the uncertainty on the Madgraph generated cross-section. While it makes sense to treat searches for new physics or rare signatures as a supervisedclassiﬁcation problem, an alternative approach is to let an algorithm learn intrinsic featuresfrom an unlabeled dataset and then evaluate whether this self-learned information can beused to identify sub-samples within a complete data set. The process of learning featuresfrom unlabeled datasets is called unsupervised learning. Unsupervised machine learningcan be transformed into semi-supervised learning when the results of unsupervised learningare evaluated using known classiﬁcation information. k -Means Clustering K-means clustering is an unsupervised learning algorithm that ﬁnds natural unlabeledgroupings in a scaled phase-space describing the input dataset. Scaled data is createdby compressing or stretching each variable of the dataset to the same maximum and min-imum values. This ensures each variable is treated equally when calculated the distancesbetween neighboring points. Minimum values of 0 and maximum values of 1 were chosenfor all variables in this analysis.The k-means approach creates clusters by randomly seeding a user-speciﬁed number ofcluster centroids in the scaled phase-space. Each point is associated to the closest nearbycentroid, and the group of points belonging to a single centroid is deﬁned as a cluster. Thecentroid positions are updated over several iterations by minimizing the ensemble sum ofthe squared distance between a centroid and all points associated to that cluster. As thecentroids move, points can be associated to diﬀerent clusters eventually obtaining a setof clusters deﬁned by locally dense groupings of similar points. Combining unsupervisedclustering with the supervised structure of a BDT converts the unsupervised approachinto a semi-supervised algorithm whose performance can be compared to other supervisedmethods.The number of clusters, k , to ﬁt is a user-deﬁned hyperparameter, and the number ofclusters used to describe a dataset must be carefully chosen. Using too few clusters risksnot being able to identify meaningfully diﬀerent populations in the input dataset. Usingtoo many clusters risks losing predictive power by splitting coherent sets of points intoarbitrary groupings. An optimal k is found by scanning several values, calculating the total– 15 –istance between all points and their associated centroid, and selecting a k that minimizesthis distance without asymptotically approaching zero.The optimal k for the di-Higgs data was found lie around 20, and three diﬀerent cluster-ings ( k = 15, 20, 40) were tested for completeness. The scenarios with 15 and 40 clusters testthe eﬀects of under-clustering and over-clustering respectively. Two sets of clustered datawere used as input to the nominal BDT to test the performance of this semi-supervised ap-proach. The ﬁrst set used all reconstructed kinematic variables as the input to the k -meansclustering stage before training the BDT. The second set was produced by performing aprincipal component analysis (PCA) decomposition on the nominal kinematic inputs be-fore passing through the clustering step and the BDT. PCA is a technique for ﬁnding anorthogonal basis of the input data that minimizes the variance along each new axis. Notransformation was found to improve the performance of the nominal conﬁguration, andthe results are shown in Table 6. Method σ N sig N bkg Nominal BDT 1.84 ± . ± . . ± . ·

15 Clusters 1.29 ± . ± . . ± . ·

15 Clusters + PCA 1.25 ± . ± . . ± . ·

20 Clusters 1.30 ± . ± . . ± . ·

20 Clusters + PCA 1.27 ± . ± . . ± . ·

40 Clusters 1.44 ± . ± . . ± . ·

40 Clusters + PCA 1.34 ± . ± . . ± . · Table 6 . Signiﬁcance and yields showing BDT performance when using the nominal kinematicinputs, clustered kinematic inputs, and clustered inputs from a PCA decomposition. All yields arenormalized to full HL-LHC dataset of 3000 fb − . An autoencoder (AE) is an unsupervised machine learning architecture used for detectinganomalies that diﬀer signiﬁcantly from the data used to train the network. The structureof the AE compresses the input information into a lower-dimensional representation calledthe latent space. This compression ‘encodes’ the most important features of the trainingdata into the latent space, while the second half of the network ‘decodes’ the latent spaceback into a representation approaching the original inputs.This construction fundamentally changes the meaning of the loss calculation– ratherthan computing the loss between a prediction and a target, the AE loss is a measure of howwell the network reproduces the original inputs after encoding and decoding. Inputs thatdiﬀer signiﬁcantly from the data used to train the AE will not be properly reconstructed,and anomalies can be identiﬁed by selecting events with large losses. Training with MonteCarlo simulations allows for a semi-supervised cross-check on AE performance since theclasses of training and testing samples are known in advance.– 16 –ecause AE anomaly detection relies on a well-modeled understanding of backgroundprocesses, the network was trained using only QCD events. Additional models were trainedby substituting the pure-QCD training sample with training samples consisting of mixturesof QCD and di-Higgs events to test the stability of the method against signal contamination.No signiﬁcant deterioration was observed for reasonable levels of contamination. The AEused for di-Higgs detection was built using the Keras package [13] and consists of an inputlayer, a single hidden layer, and an output layer. Eleven reconstructed variables (di-Higgsand di-jet Higgs candidate masses, momenta, and angular variables) were selected for usein the AE.

Figure 13 . PCA performed on the selected eleven kinematic inputs. The x-axis indicates thenumber of PCA features, and the y-axis indicates the variance. Choosing the optimal size for thelatent space requires identifying the point of diminishing variance return.

The output layer is a mirror of the input layer and therefore has 11 nodes. Similar tothe k -means clustering, the size of the latent space should be large enough to model theintrinsic features of the input data without being so large that the latent representation isable to learn features drawn arbitrarily from the input space. A PCA analysis (shown inFigure 13) was used to determine that a latent space of three nodes was able to captureover 95% of the input variance. The hidden layer and output layer use ReLU and sigmoidactivation functions respectively. An L2 regularization term was added to the hidden layerto avoid over-ﬁtting. Because training an autoencoder is an unsupervised and unlabeledprocess, there is no prediction of whether a given event is signal or background. Sincedi-Higgs events should be relatively anomalous compared to the QCD training set, signalevents should have a relatively larger average loss value. Cutting on the loss function allowsfor a signiﬁcance to be calculated for comparison with other methods.Training for several hundred epochs leads the model to converge, and it reaches anasymptotic ensemble loss value near 0.7 (see Figure 14). A best signiﬁcance of σ = . ± . was obtained for events with individual reconstructed loss scores greater than 0.05. Thesigniﬁcance obtained for lower thresholds allowed for slightly more background while the– 17 – a) (b) Figure 14 . (Left) The loss of the AE during QCD training/reconstruction converged after 700epochs. (Right) The loss distribution generated by the AE when being tested on QCD and di-Higgsevent data separately. signiﬁcance obtained for higher loss thresholds rejected too many signal events. Still, thisoptimal point is somewhat misleading since the signal and background loss distributionshave little separation. The highest signiﬁcance result eﬀectively is a cut that keeps nearlyall events. This suggests that the kinematic inputs used in training do not signiﬁcantlydiﬀer between signal and background processes after the latent space compression.

The methods covered in this paper are by no means an exhaustive review of the ML land-scape available to high energy physics. Still, a wide range of techniques and philosophiesare covered. Table 7 provides a summary of the methods described in the previous sections.The raw yields after selection and reconstruction cuts (depending on the method) havebeen weighted by the eﬀective cross-sections listed in Table 1 and scaled to 3000 fb − . Theresult from a traditional 1-D sequential cut technique is shown for comparison though thedetails were not discussed in the previous sections. Clear gains in sensitivity compared tothis baseline are apparent for many of the ML models tested in this review.An important caveat to keep in mind is that all results discussed here were determined inconditions with zero pileup. In higher pileup environments like those expected at the HL-LHC, reconstruction algorithms see serious reductions in correct combinatoric matching.This eﬀect will certainly degrade the expected performance of techniques that rely onexplicit event reconstruction. Methods that do not rely on event reconstruction (CNN,PFN) might be more robust to these eﬀects, and this should be studied in further work.The unsupervised AE technique performed the worst among all the methods tested, butthis is likely a reﬂection of the fact that model-speciﬁc methods often outperform model-unspeciﬁc methods when evaluated on the model used in training.– 18 – ethod σ N Signal N Background

Autoencoder . ± .

01 5840 . ± . . ± . · . ± .

02 3621 . ± . . ± . · k-Means Clustering . ± .

02 1703 . ± . . ± . · Particle Flow Network . ± .

01 1 . ± . · . ± . · Boosted Decision Tree . ± .

09 986 . ± . . ± . · Feed-Forward NN . ± .

08 1659 . ± . . ± . · Random Forest . ± .

19 544 . ± . . ± . · Convolutional NN . ± .

02 1 . ± . · . ± . · Table 7 . Comparison of method signiﬁcance and signal/background yields normalized to full HL-LHC dataset of 3000 fb − . The provided errors take into account the Monte Carlo statisticaluncertainty and the uncertainty on the Madgraph generated cross-section. Measuring the rate of di-Higgs production will be a problem facing the high energy physicscommunity through the end of the HL-LHC era. The techniques explored in this paper showthe power of machine learning techniques in identifying di-Higgs signals amid overwhelmingQCD backgrounds. The signiﬁcance obtained for many of these methods is impressivegiven the simplicity of the evaluation metric. This bodes well for both current and futuremeasurements of Higgs pair production at the LHC.– 19 – eferences [1] D0 collaboration, Evidence for Production of Single Top Quarks and First DirectMeasurement of |Vtb| , Phys. Rev. Lett. (2007) 181802 [ hep-ex/0612052 ].[2] CDF collaboration,

Measurement of the Single Top Quark Production Cross Section at CDF , Phys. Rev. Lett. (2008) 252001 [ ].[3] K.A. et al.,

Machine Learning in High Energy Physics Community White Paper , 2018.[4]

LHC Higgs Cross Section Working Group collaboration,

Handbook of LHC HiggsCross Sections: 4. Deciphering the Nature of the Higgs Sector , .[5] R. Brun and F. Rademakers, ROOT: An Object Oriented Data Analysis Framework , Nucl.Instrum. Meth. A (1997) 81.[6] J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer et al.,

TheAutomated Computation of Tree-Level and Next-to-Leading Order Diﬀerential Cross Sections,and their Matching to Parton Shower Simulations , JHEP (2014) 079 [ ].[7] T. SjÃűstrand, S. Ask, J.R. Christiansen, R. Corke, N. Desai, P. Ilten et al., An Introductionto PYTHIA 8.2 , Computer Physics Communications (2015) 159âĂŞ177.[8] J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. LemaÃőtre, A. Mertens et al.,

DELPHES 3: A Modular Framework for Fast Simulation of a Generic Collider Experiment , Journal of High Energy Physics (2014) .[9] B. Tannenwald, A. Li, A. Cuddeback, R. Parvatam and C. Thompson, “dihiggsMLProject.” https://github.com/neu-physics/dihiggsMLProject , 2020.[10] G. Aad, T. Abajyan, B. Abbott, J. Abdallah, S. Abdel Khalek, A. Abdelalim et al.,

Observation of a New Particle in the Search for the Standard Model Higgs boson with theATLAS detector at the LHC , Physics Letters B (2012) 1âĂŞ29.[11] S. Chatrchyan, V. Khachatryan, A. Sirunyan, A. Tumasyan, W. Adam, E. Aguilo et al.,

Observation of a New Boson at a Mass of 125 GeV with the CMS experiment at the LHC , Physics Letters B (2012) 30âĂŞ61.[12] T. Chen and C. Guestrin,

XGBoost: A Scalable Tree Boosting System , CoRR abs/1603.02754 (2016) [ ].[13] F. Chollet, “Keras.” https://github.com/fchollet/keras , 2015.[14] M.A. et al.,

TensorFlow: Large-Scale Machine Learning on Heterogeneous DistributedSystems , CoRR abs/1603.04467 (2016) [ ].[15] J. Alison, S. An, P. Bryant, B. Burkle, S. Gleyzer, M. Narain et al.,

End-to-end Particle andEvent Identiﬁcation at the Large Hadron Collider with CMS Open Data , in

Meeting of theDivision of Particles and Fields of the American Physical Society , 10, 2019 [ ].[16] P.T. Komiske, E.M. Metodiev and J. Thaler,

Energy Flow Networks: Deep Sets for ParticleJets , JHEP (2019) 121 [ ].].