[PDF] Unsupervised clustering for collider physics

Abstract

We propose a new method for Unsupervised clustering in particle physics named UCluster, where information in the embedding space created by a neural network is used to categorise collision events into different clusters that share similar properties. We show how this method can be applied to an unsupervised multiclass classification as well as for anomaly detection, which can be used for new physics searches.

Full PDF

SS UBMITTED TO P HYSICAL R EVIEW D Unsupervised clustering for collider physics

V. Mikuni, a F. Canelli a a University of Zurich,Winterthurerstrasse 190, 8057 Zurich, Switzerland

E-mail: [email protected]

Abstract:

We propose a new method for Unsupervised Clustering for collider physicsnamed UCluster, where information in the embedding space created by a neural network isused to categorise collision events into diﬀerent clusters that share similar properties. Weshow how this method can be developed into an unsupervised multiclass classiﬁcation ofdiﬀerent processes and applied in anomaly detection of events to search for new physicsphenomena at colliders. a r X i v : . [ phy s i c s . d a t a - a n ] O c t ontents The Standard Model (SM) of particle physics has been successful so far at describing theinteraction of fundamental particles in high energy physics (HEP). The ATLAS [1] andCMS [2] Collaborations have tested the SM extensively using particle collision events atthe CERN Large Hadron Collider (LHC), while also looking for deviations from the SMthat could point to physics beyond the SM (BSM). Since the underlying nature of the newphysics is not known, new methods designed to be model independent have proliferated inthe recent years. These strategies aim at ﬁnding deviations or detecting anomalies whereonly SM events are used and avoiding any dependence on BSM signals. For a short reviewof recent approaches see [3].For measurements of SM parameters, a fully unsupervised multiclass classiﬁcationmethod would be advantageous. This is particularly true for precision measurements ofSM parameters. Simulations are often needed to describe the properties of diﬀerent pro-cesses produced in the LHC collisions. However, simulated events are not always precise inall physics process. This can be caused either by a lack of simulated events compared to thedata expectation, or the need of corrections that are beyond the accuracy of the approxi-mations used in the simulation. Further precision might be computationally prohibitive toachieve, or beyond the capability of our current methods. To mitigate these issues, diﬀerentdata-driven methods often replace the event simulations. See [4–7] for recent examples.– 1 –hen two or more processes are not well modelled, the common approach is to designmultiple control regions, often deﬁned using high level distributions, to create a high puritysample that allows a data driven estimation and modelling for this process. However, since itis not always straightforward to deﬁne each of these regions without relying on simulations,an unsupervised multiclass classiﬁcation approach could be used instead.In this paper, we introduce a method for unsupervised clustering (UCluster). Themain idea of UCluster is to use a neural network (NN) to reduce the data dimensionalitywhile retaining the main properties of the data set. In this reduced representation, aclustering objective is added to the training to encourage points embedded in this spaceto be close together when they share similar properties and far apart otherwise. We testthe performance of UCluster in the context of two diﬀerent tasks: unsupervised multiclassclassiﬁcation of three diﬀerent SM processes and unsupervised anomaly detection.

Recently, diﬀerent and innovative strategies have been proposed for unsupervised trainingin HEP, mostly in the context of event classiﬁcation. A few examples of methods exploitinganomaly detection signatures as over-densities are [8, 9] and more recently [3]. In theseapproaches, anomalous events are identiﬁed as localised excesses in some distribution, wheremachine learning is then used to enhance the local signiﬁcance of the new physics process.While many strategies focus on unsupervised anomaly detection, other methods havealso been proposed to better understand SM processes without relying on simulation, likethe work developed in [10] for quark and gluon classiﬁcation with jet topics and the methodsdeveloped in [11], employing latent Dirichlet allocation to build a data-driven top-quarktagger. In order to create an unsupervised and model independent approach, the majorityof the strategies rely on binary classiﬁcation, where the main goal is to test if an event(or a group of events) resulting from a particle collision is compatible with one out oftwo competing hypotheses. Approaches applied to mixed samples with more than twocomponents were also studied in [12, 13], where prior knowledge of the label proportion foreach component in the mixed sample is required to achieve a good performance.In this work, we propose an unsupervised method for multiclass classiﬁcation whoseonly requirement is on the expected number of diﬀerent components inside a mixed sample.The same method is applied to anomalous event detection, where the data is partitionedinto clusters that isolate the anomaly from backgrounds.

UCluster consists of two components: a classiﬁcation step to ensure events with similarproperties are close in the embedding space created by a NN; and a clustering step, wherethe network learns to cluster embedded events of similar properties. These two tasks areaccomplished by means of a combined loss function containing independent components toguarantee each of the described steps. – 2 –he classiﬁcation loss ( L focal ), applied to the output nodes of a NN, is deﬁned bythe focal loss [14]. The focal loss improves the classiﬁcation performance for unbalancedlabels, the case for the classiﬁcation tasks to be introduced in the following sections. Theexpression for the focal loss is: L focal = − N N (cid:88) j M (cid:88) m y j,m (1 − p θ,m ( x j )) γ log( p θ,m ( x j )) (3.1)where p θ,m ( x j ) is the network’s conﬁdence, for event x j with trainable parameters θ , to beclassiﬁed as class m . The term y j,m is 1 if class m is the correct assignment for event j and0 otherwise. In this work, we ﬁx the hyperparameter γ = 2 of the focal loss.The clustering loss ( L cluster ) is deﬁned similarly as the loss developed in [15]: L cluster = 1 N K (cid:88) k n (cid:88) j (cid:107) f θ ( x j ) − µ k (cid:107) π jk , (3.2)where the distance between each event j and each cluster centroid µ k is calculated in theembedding space f θ of the neural network with trainable parameters θ . The function π jk weighs the importance of each event and takes the form: π jk = e − α (cid:107) f θ ( x j ) − µ k (cid:107) (cid:80) k (cid:48) e − α (cid:107) f θ ( x j ) − µ k (cid:107) , (3.3)with hyperparameter α identiﬁed as an inverse temperature term. Since L cluster is diﬀeren-tiable, stochastic gradient descent can be used to optimise jointly the trainable parameters θ and the centroid positions µ k .The combined loss to be minimised is: L = L focal + βL cluster . (3.4)The hyperparameter β controls the relative importance between the two losses. For thesestudies, we ﬁx β =10 to ensure that both components have the same order of magnitude.Since L cluster requires an initial guess for the centroid positions, we pre-train the modelusing only L focal for 10 epochs. After the pre-training, the K-Means algorithm [16] is appliedto the object embeddings to initialise the cluster centroids. The full training is then carriedout with the combined loss deﬁned in Eq. 3.4. To allow the cluster centres to change, theinverse temperature α has a starting value of 1 and linearly increases by 2 for each followingepoch. The implementation of UCluster is done using ABCNet [17]. ABCNet is a graph-basedneural network where each reconstructed particle is taken as a node in an graph. Theimportance of each node is then learned by the model by the usage of attention mecha-nisms. The embedding space for the clustering loss in Eq. 3.2 is taken as the output of a– 3 – ggregation

GAPLayer {16} (k = 10, H = 1)Input cloud (NxF)Attention features

GAPLayer {128} (k = 10, H = 1)

Attention features Graph featuresGraph features

Global features (Nx4)

Fully connected {16}Fully connected {256}Fully connected {256}Fully connected {128} F u ll y c o nn ec t e d { E } F u ll y c o nn ec t e d { } S o f t m a x { N x x } M a x p oo li n g F u ll y c o nn ec t e d { } Fully connected {64}

Embedding space

Figure 1 . ABCNet architecture used in UCluster for a batch size N, F input features, and embed-ding space of size E. Fully connected layers and encoding node sizes are denoted inside “{}”. Foreach GAPLayer, the number of k-nearest neighbours (k) and heads (H) are given. The additionalcomponents used only for anomaly detection are shown in red. max-pooling layer. For the following studies, the 10 nearest neighbours from each parti-cle are used to calculate the GAPLayers [18]. The initial distances are calculated in thepseudorapidity-azimuth ( η − φ ) space using the distance ∆ R = (cid:112) ∆ η + ∆ φ . The secondGAPLayer uses the Euclidean distances in the space created by subsequent fully connectedlayers. The architectures used for multiclass classiﬁcation and anomaly detection are de-picted in Fig. 1. Besides the output classiﬁcation size, both tasks share almost identicalarchitectures. The model used for anomaly detection uses additional high-level distribu-tions and additional skip connections after the pooling layer to improve the classiﬁcationperformance. In both cases the batch size is set to 1024 and the training is stopped afterfor 100 epochs.ABCNet is implemented in Tensorﬂow v1.14 [19]. An Nvidia GTX 1080 Ti graphicscard is used for the training and evaluation steps. For all tasks described in this paper, theAdam optimiser [20] is used. The learning rate starts from 0.001 and decreases by a factor2 every three epochs, until reaching a minimum of 1e-5. The applicability of UCluster is demonstrated on an important problem in high energyphysics: unsupervised multiclass classiﬁcation. To achieve good performance, we requirea task that results in a suitable embedding space. This task should be such that events– 4 –temming from the same physics process are found close together in the embedding spaceas compared to events from diﬀerent physics processes. Here, a jet mass classiﬁcationtask is chosen in order to provide meaningful event embeddings. Given a set of particlesbelonging to a jet, we ask our model to correctly identify the invariant mass of the jet.This task chosen is inspired by the correlation of jet substructure observables and theinvariant mass of a jet [21, 22]. The goal is to have our machine learning method learnto extract relevant information regarding the diﬀerent jet substructures by ﬁrst learninghow to correctly identify the mass of a jet. The simplest solution to this problem could beachieved by the four-vector sum of all the particle’s constituents, leading to an embeddingspace that does not have separation power for diﬀerent types of jets. To alleviate this issue,we instead deﬁne a jet mass label by taking 20 equidistant steps from 10 to 200 GeV, asshown in Fig. 2. The task is then to identify the correct mass interval a jet belongs to,instead of the speciﬁc mass value. The input distributions used for the training are listedin Table 1.

20 40 60 80 100 120 140 160 180 200

Jet mass [GeV] N o r m a li z ed en t r i e s / b i n WZTop

Figure 2 . Normalised distribution of the jet mass of each category used in the unsupervisedmulticlass classiﬁcation task. The bin boundaries represent the boundaries used to deﬁne the jetmass labels.

For this study, a sample containing simulated jets originating from W bosons, Z bosons,and top quarks produced at √ s = 13 TeV proton-proton collisions is used. This data set iscreated and conﬁgured using a parametric description of a generic LHC detector, described– 5 – able 1 . Description of each feature used to deﬁne a point in the ABCNet implementation forunsupervised multiclass classiﬁcation. Variable Description ∆ η Diﬀerence between the pseudo-rapidity of the constituent and the jet ∆ φ Diﬀerence between the azimuthal angle of the constituent and the jet log p T Logarithm of the constituent’s p T log E Logarithm of the constituent’s E log p T p T (jet) Logarithm of the ratio between the constituent’s p T and the jet p T log EE(jet)

Logarithm of the ratio between the constituent’s E and the jet E ∆R Distance in the η − φ space between the constituent and the jet PID

Particle type identiﬁer as described in [23]. in [24, 25]. The jets are clustered with the anti-kt algorithm [26] with radius parameterR = 0.8, while also requiring that the jet’s p T is around 1 TeV, ensuring that most of thedecay products of the generated particles are found inside a single jet.The samples are available at [27]. For each jet, up to 100 particles are stored. If moreparticles were found inside a jet, the event is truncated, otherwise zero-padded up to 100.The training set contains 300,000 jets, while the validation sample consists of 140,000 jets.

40 30 20 10 0 10 20 30 4040302010010203040 WZTop 40 30 20 10 0 10 20 30 4040302010010203040 cluster 0cluster 1cluster 2

Figure 3 . t-SNE visualisation of the embedding space after the pre-training and before the fulltraining for multiclass classiﬁcation with 1000 jets. The true label information is shown on the left,while the initial cluster labels using a k-Means approach is shown on the right.

To visualise the embedding space, the t-SNE visualisation method [28] is used for 1000jets, taken just after the pre-training with only the classiﬁcation loss, and compared tothe space created after full training is performed. After the pre-training, the initial labelassignment is taken from a k-Means approach, shown in Fig. 3 (right) while the true labelsare shown in Fig. 3 (left). At this stage, the clustering accuracy, calculated using the– 6 –

Figure 4 . t-SNE visualisation of the embedding space, after the full training, created for multiclassclassiﬁcation with 1000 jets. The true label information is shown on the left while the trained clustersare shown in the right.

Hungarian algorithm [29], is 51%. After the full training is performed, the trained labelsare shown in Fig. 4 (right) with a clustering accuracy of 81% compared to the true labelassignment in Fig. 4 (left).To inspect the quality of the embedding space further, a supervised KNN is trainedusing only the embedding features as inputs. Its performance is then compared to a separateKNN with the same setup, but using only the jet mass as input. The supervised KNNs aretrained to determine class membership given the label of the 30 nearest neighbours. Forthe training, 35k events are used and tested on an independent sample with 15k events.The one-vs-all performance is compared using a receiver operating characteristic (ROC)curve in Fig. 5, where one category is considered the signal of interest while the others areconsidered a background. The area under curve (AUC) for each process is also shown.The resulting AUC for the supervised training using the event embeddings is higher thanthe jet mass alone for all categories. Top quark classiﬁcation shows a particularly largeimprovement by using the embedding space information. We attribute this improvementto jets containing a top quark showing a broader mass distribution compared to W and Zbosons, resulting in a worse invariant mass separation as seen in Fig. 2. UCluster is able tolearn other jet properties beyond the invariant mass, improving the overall performance.To estimate an upper bound on the UCluster performance, a fully supervised modelusing the full ABCNet architecture is also trained. The ABCNet architecture is used totrain a classiﬁer containing the real class labels as targets, achieving an accuracy of 92%.The comparable results between the fully supervised approach and the KNN trained on theevent embeddings demonstrate how the method is able to reduce the dimensionality of theinput data while retaining relevant information.The accuracies achieved with the full supervision and the other approaches are sum-– 7 –arised in Tab. 2.

Table 2 . Supervised and unsupervised clustering accuracy of UCluster when using only the em-bedding space features.

Algorithm AccuracyPre-training k-Means 51%UCluster 81%Supervised KNN 89%Supervised training 92%

Other jets efficiency T op J e t S i gna l E ff i c i en cy KNN on Jet Mass (AUC = 0.94)KNN on Embeddings (AUC = 0.98)

Other jets efficiency W J e t S i gna l E ff i c i en cy KNN on Jet Mass (AUC = 0.95)KNN on Embeddings (AUC = 0.97)

Other jets efficiency Z J e t S i gna l E ff i c i en cy KNN on Jet Mass (AUC = 0.95)KNN on Embeddings (AUC = 0.97)

Figure 5 . ROC curves for each jet category when considering the other jet categories as a back-ground. – 8 –

Anomaly detection

UCluster can also be applied to anomaly detection. Here, we show an example whereanomalous events, created from an unknown physics process, are found to be close in theembedding space created from a suitable classiﬁcation task. This technique is motivatedby the fact that, irrespective to the underlying physics model, events created by the samephysics process carry similar event signatures.To create a suitable embedding space we modify the approach described in Sec. 5 totake into account all the particles created in a collision event rather than a single jet. To doso, the classiﬁcation task is instead changed to a part segmentation task. We consider allparticles associated to a clustered jet. Each particle then receives a label proportional tothe mass of the jet that it was clustered into. For this task, we require the model to learnnot only the mass of the associated jet the particle belongs to, but to also to learn whichparticles should belong to the same jet. This approach is motivated by the fact that jetsubstructure often contains useful information for distinguishing diﬀerent physics processes,as studied in the previous section.The mass labels are then created by deﬁning 20 equidistant intervals from 10 to 1000GeV. For simplicity, only the two heaviest jets are considered per event. A simpliﬁedexample of the label deﬁnition is shown in Fig. 6.

Figure 6 . Schematic of the labels for anomaly detection. Each particle associated to a clusteredjet receives a mass label proportional to the respective jet mass. The larger the number, the moremassive the associated jet.

To perform these studies, we use the R&D data set created for the LHC Olympics2020 [30]. The data set consists of a million quantum chromodynamic (QCD) dijet eventssimulated with Pythia 8[31] without pile up or multiple parton interactions. The BSMsignal consists of a hypothetical W’ boson with mass m W = 3.5 TeV that decays into anX and Y bosons with masses m X = 500 GeV and m Y = 100 GeV, respectively. The Xand Y bosons, on the other hand, decay promptly into quarks.The detector simulation isperformed with Delphes 3.4.1 [32] and particle ﬂow objects are clustered into jets usingthe Fastjet [33] implementation of the anti-kt algorithm with R = 1.0 for the jet radius.– 9 –vents are required to have at least one jet with p T >1.3 TeV. The number of signal eventsgenerated is set to as 1% of the total number of events. From this data-set, 300k events arerandomly selected for training, 150k for testing and 300k events, are used to evaluate theclustering performance.The distributions used as an input for ABCNet are described in Tab. 3. To improve theclustering performance, a set of high level variables are added to the network. The goal ofthe additional distributions is to parameterize the model performance as described in [34].Here we would also like to point out that, even if a proxy of jet masses is given asan input, the trivial solution is still not achieved, since the model also has to identifywhich particles belong to which jets. To quantify the performance of UCluster, we start byconsidering only two clusters with an embedding space of same dimension. Figure 7 showsthe resulting embedding space without any transformation for 1000 random events. Table 3 . Descriptions of each feature used to deﬁne a point in the point cloud implementation formulticlass classiﬁcation. The last two lines are the global information added to parameterize thenetwork.

Variable Description ∆ η Pseudorapidity diﬀerence between the constituent and the associated jet ∆ φ Azimuthal angle diﬀerence between the constituent and the associated jet log p T Logarithm of the constituent’s p T log E Logarithm of the constituent’s E log p T p T (jet) Logarithm of the ratio between the constituent’s p T and the associated jet p T log EE(jet)

Logarithm of the ratio between the constituent’s E and the associated jet E ∆R Distance in the η − φ space between the constituent and the associated jet log m J { , } Logarithm of the masses of the two heaviest jets in the event τ { , } Ratio of τ to τ for the two heaviest jets in the event, with τ N deﬁned in [35] Most of the BSM events are found in the same trained cluster, conﬁrming the assump-tion that the signal events would end up close together in the embedding space. However,because of the large QCD background contamination present in the same cluster, the signal-to-background (S/B) ratio remains low, increasing only from 1% to 2.5%. If the proximityassumption holds, then the cluster S/B ratio can be further enhanced by partitioning theevents into more clusters. Indeed, if the classiﬁcation loss favours an embedding spacewhere signal events remain close together, increasing the number of clusters will decreasethe QCD contamination in the signal clusters whose properties diﬀer from the signal events.To test this assumption, the cluster size is varied while keeping all the other network pa-rameters ﬁxed. The maximum S/B ratio found in a cluster for diﬀerent clusters sizes isshown in Fig. 8 left. The S/B ratio steadily increases with cluster size, reaching an averageof around 28%. To test how the performance changes with the number of events, diﬀerenttraining sample sizes were used while keeping the model ﬁxed, the signal fraction ﬁxed to1%, and number of clusters ﬁxed to 30. The result of each training is then evaluated inan independent sample which is the same size as the training sample. The result of the– 10 – .8 2.9 3.0 3.1 3.2 3.3 3.43.73.83.94.04.14.24.34.4 QCDBSM 2.8 2.9 3.0 3.1 3.2 3.3 3.43.73.83.94.04.14.24.34.4 cluster 0cluster 1

Figure 7 . Visualisation of the embedding space created for anomaly detection using 1000 events.Since the embedding space is already two-dimensional, no additional transformation is applied. Thetrue labels are show on the left, while the clusters created by UCluster are shown on the right. approximate signiﬁcance (S/ √ B ) is shown in Fig. 8 on the right. For initial signiﬁcance inthe range 2-6, we observe enhancements by factors 3-4.The uncertainties in Fig. 8 show the standard deviation of ﬁve independent trainingswith diﬀerent random initial weights. When many clusters are used, the clustering stabilitystarts to decrease, as evidenced by the larger error bars. This behaviour is expected, sincea large cluster multiplicity requires clusters to target more speciﬁc event properties thatmight diﬀer in between diﬀerent trainings.The dijet mass distribution for all events (left) and for the cluster with the highest S/Bratio (right) are shown in Fig. 9. In order to relate the clusters in embedding space to physical observables, four high-levelfeatures were added to the anomaly detection model: the invariant mass and τ of the twoheaviest jets in the event.To visualise the physical properties of the clusters, histograms of these four observablesare shown in Fig. 10 with the stacked contributions of each individual cluster shown forUCluster with 5 clusters. From these distributions, there is a sharp separation between thecluster boundaries for the mass of heaviest jet in the dijet event. The sharp separation injet mass is also related to the separation that is observed in the heaviest jet τ . As pointedout in [22], QCD jets show a more distinctive two-prong structure when they have a largermass. Therefore, heavier jets tend to have lower values of τ . This correlation between jetmass and jet substructure is why the jet mass classiﬁcation task leads to clusters where jetswithin a cluster have similar substructure.– 11 –

20 40 60 80 100

Number of clusters M a x . S / B · Number of events BS / All eventsCluster with max. significance

Figure 8 . Maximum signal-to-background ratio found for diﬀerent clustering sizes (left) and max-imum signiﬁcance found for UCluster trained and evaluated on diﬀerent number of events withcluster size ﬁxed to 30 (right). The uncertainty shows the standard deviation of the results fromﬁve trainings with diﬀerent random weight initialisation.

Dijet mass [GeV] · E n t r i e s / b i n BackgroundSignal

Dijet mass [GeV] E n t r i e s / b i n BackgroundSignal

Figure 9 . Dijet mass distribution of the events prior to clustering (left) and for the cluster withthe highest S/B ratio (right), found when the data are partitioned into 60 clusters.

In this work, we presented UCluster, a new method to perform unsupervised clusteringfor collision events in high energy physics. We explored two potential applications for thismethod: unsupervised multiclass classiﬁcation and anomaly detection.The ability of the embedding space to separate diﬀerent processes is directly connected– 12 –

200 400 600 800 1000

Heaviest jet mass [GeV] · E n t r i e s / b i n t · E n t r i e s / b i n Second heaviest jet mass [GeV] · E n t r i e s / b i n t · E n t r i e s / b i n Figure 10 . Distributions for the 4 high level features used to parameterize the performance ofUCluster trained with 5 clusters. Events belonging to the same clusters receive the same color. Thestacked contribution of all clusters is then shown. to the secondary task used in conjunction with the clustering objective. We proposed aclassiﬁcation task which was motivated by the observations of the correlation between thejet mass and jet substructure observables which is often useful for jet tagging. By learningto classify the mass of a jet, UCluster created an embedding space that was shown to havea better separation power for all the class components in the data set compared to the jetmass alone.UCluster was also studied for unsupervised anomaly detection. In this context, theclassiﬁcation task on jet masses was expanded to cover the entire event topology. Usingthis method, we were able to increase the signal-to-background ratio in a given cluster froman initial value of 1% up to 28%, while also observing a stable performance even for a large– 13 –luster multiplicity.We remark that diﬀerent tasks than the ones proposed in this work can also be used tocreate meaningful embeddings. In particular, recent advances in auto-encoders applied toparticle physics [36, 37] are strong candidates for a summary statistic that can encapsulatethe event information in a lower dimensional representation, suitable for clustering.Compared to [12, 13], we relax the requirements on the label proportion for each diﬀer-ent component in a mixed sample. One interesting point to notice is that, as presented in[38], the clustering assignment problem can instead be interpreted as an optimal transportproblem. This insight is particularly interesting when the label proportions are known apriori. In this case, the additional knowledge of the label proportions can be directly addedto the model as a regularisation term of the form: L reg . cluster = min K (cid:88) k n (cid:88) j (cid:107) f θ ( x j ) − µ k (cid:107) π jk + απ jk (log( π jk ) − . (7.1)This approach requires the term π jk to be numerically solved, subject to: π K = 1 n N , (7.2) π T N = w, (7.3)where w represent the vector of label proportions.Furthermore, we considered an application where the initial number of mixed compo-nents was known. This condition was necessary to select a suitable number of clusters.However, this requirement could also be relaxed, as shown in [39, 40], for example, wherethe clustering model is able to identify the optimal number of partitions given the propertiesof a data set.Finally, UCluster can also be used in conjunction with other anomaly detection ap-proaches, where ﬁrst a set of interesting clusters are identiﬁed and then further inspectedby other methods. The authors would like to thank Kyle James Read Cormier for the valuable suggestionsregarding the development and clarity of this document. This research was supported inpart by the Swiss National Science Foundation (SNF) under contract No. 200020-182037. [1]

ATLAS

Collaboration, G. Aad et al.,

The ATLAS Experiment at the CERN Large HadronCollider , JINST (2008) S08003.[2] CMS

Collaboration, S. Chatrchyan et al.,

The CMS Experiment at the CERN LHC , JINST (2008) S08004. – 14 –

3] B. Nachman and D. Shih,

Anomaly detection with density estimation , Phys. Rev. D (Apr, 2020) 075042.[4]

ATLAS

Collaboration, M. Aaboud et al.,

Search for new phenomena with large jetmultiplicities and missing transverse momentum using large-radius jets and ﬂavour-taggingat ATLAS in 13 TeV pp collisions , JHEP (2017) 034, [ arXiv:1708.02794 ].[5] CMS

Collaboration, A. M. Sirunyan et al.,

Measurement of the t¯tb¯b production crosssection in the all-jet ﬁnal state in pp collisions at √ s =

13 TeV , Phys. Lett. B (2020)135285, [ arXiv:1909.05306 ].[6]

CMS

Collaboration, A. M. Sirunyan et al.,

Search for high mass dijet resonances with a newbackground prediction method in proton-proton collisions at √ s =

13 TeV , JHEP (2020)033, [ arXiv:1911.03947 ].[7] ATLAS

Collaboration, G. Aad et al.,

Dijet resonance search with weak supervision using √ s = 13 TeV pp collisions in the ATLAS detector , arXiv:2005.02983 .[8] E. M. Metodiev, B. Nachman, and J. Thaler, Classiﬁcation without labels: Learning frommixed samples in high energy physics , JHEP (2017) 174, [ arXiv:1708.02949 ].[9] J. H. Collins, K. Howe, and B. Nachman, Extending the search for new resonances withmachine learning , Phys. Rev. D (2019), no. 1 014038, [ arXiv:1902.02634 ].[10] E. M. Metodiev and J. Thaler, Jet topics: Disentangling quarks and gluons at colliders , Physical Review Letters (Jun, 2018).[11] B. M. Dillon, D. A. Faroughy, and J. F. Kamenik,

Uncovering latent jet substructure , Phys.Rev. D (Sep, 2019) 056002.[12] N. Quadrianto, A. J. Smola, T. S. Caetano, and Q. V. Le,

Estimating labels from labelproportions , Journal of Machine Learning Research (2009), no. 82 2349–2374.[13] G. Patrini, R. Nock, P. Rivera, and T. Caetano, (almost) no label no cry , in Advances inNeural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, eds.), pp. 190–198. Curran Associates, Inc., 2014.[14] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár,

Focal loss for dense object detection , CoRR abs/1708.02002 (2017) [ arXiv:1708.02002 ].[15] M. M. Fard, T. Thonet, and É. Gaussier,

Deep k-means: Jointly clustering with k-means andlearning representations , CoRR abs/1806.10069 (2018) [ arXiv:1806.10069 ].[16] J. A. Hartigan and M. A. Wong,

Algorithm as 136: A k-means clustering algorithm , Journalof the Royal Statistical Society. Series C (Applied Statistics) (1979), no. 1 100–108.[17] V. Mikuni and F. Canelli, ABCNet: An attention-based method for particle tagging , Eur.Phys. J. Plus (2020), no. 6 463, [ arXiv:2001.05311 ].[18] C. Chen, L. Zanotti Fragonara, and A. Tsourdos,

GAPNet: Graph Attention based PointNeural Network for Exploiting Local Feature of Point Cloud , arXiv e-prints (May, 2019)arXiv:1905.08705, [ arXiv:1905.08705 ].[19] M. Abadi et al., TensorFlow: Large-scale machine learning on heterogeneous systems , 2015.Software available from tensorﬂow.org.[20] D. P. Kingma and J. Ba,

Adam: A Method for Stochastic Optimization , arXiv e-prints (Dec,2014) arXiv:1412.6980, [ arXiv:1412.6980 ]. – 15 –

21] J. Dolen, P. Harris, S. Marzani, S. Rappoccio, and N. Tran,

Thinking outside the ROCs:Designing Decorrelated Taggers (DDT) for jet substructure , JHEP (2016) 156,[ arXiv:1603.00027 ].[22] P. T. Komiske, E. M. Metodiev, and J. Thaler, Metric Space of Collider Events , Phys. Rev.Lett. (2019), no. 4 041801, [ arXiv:1902.02346 ].[23] M. Tanabashi et al.,

Review of Particle Physics , Phys. Rev.

D98 (2018), no. 3 030001.[24] E. Coleman, M. Freytsis, A. Hinzmann, M. Narain, J. Thaler, N. Tran, and C. Vernieri,

Theimportance of calorimetry for highly-boosted jet substructure , JINST (2018), no. 01T01003, [ arXiv:1709.08705 ].[25] J. Duarte et al., Fast inference of deep neural networks in FPGAs for particle physics , JINST (2018), no. 07 P07027, [ arXiv:1804.06913 ].[26] M. Cacciari, G. P. Salam, and G. Soyez, The anti- k T jet clustering algorithm , JHEP (2008) 063, [ arXiv:0802.1189 ].[27] M. Pierini, J. M. Duarte, N. Tran, and M. Freytsis, Hls4ml lhc jet dataset (100 particles) ,Jan., 2020.[28] L. van der Maaten and G. Hinton,

Visualizing data using t-SNE , Journal of MachineLearning Research (2008) 2579–2605.[29] H. W. Kuhn, The hungarian method for the assignment problem , Naval Research LogisticsQuarterly (1955), no. 1-2 83–97,[ https://onlinelibrary.wiley.com/doi/pdf/10.1002/nav.3800020109 ].[30] G. Kasieczka, B. Nachman, and D. Shih, R&D Dataset for LHC Olympics 2020 AnomalyDetection Challenge , Apr., 2019.[31] T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel,C. O. Rasmussen, and P. Z. Skands,

An Introduction to PYTHIA 8.2 , Comput. Phys.Commun. (2015) 159–177, [ arXiv:1410.3012 ].[32]

DELPHES 3

Collaboration, J. de Favereau, C. Delaere, P. Demin, A. Giammanco,V. Lemaître, A. Mertens, and M. Selvaggi,

DELPHES 3, A modular framework for fastsimulation of a generic collider experiment , JHEP (2014) 057, [ arXiv:1307.6346 ].[33] M. Cacciari, G. P. Salam, and G. Soyez, FastJet User Manual , Eur. Phys. J. C (2012)1896, [ arXiv:1111.6097 ].[34] P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, and D. Whiteson, Parameterized neuralnetworks for high-energy physics , Eur. Phys. J. C (2016), no. 5 235, [ arXiv:1601.07913 ].[35] J. Thaler and K. Van Tilburg, Identifying boosted objects with n-subjettiness , Journal of HighEnergy Physics (2011), no. 3 15.[36]

ATLAS Collaboration

Collaboration,

Deep generative models for fast shower simulationin ATLAS , Tech. Rep. ATL-SOFT-PUB-2018-001, CERN, Geneva, Jul, 2018.[37] T. Cheng, J.-F. Arguin, J. Leissner-Martin, J. Pilette, and T. Golling,

VariationalAutoencoders for Anomalous Jet Tagging , arXiv:2007.01850 .[38] A. Genevay, G. Dulac-Arnold, and J.-P. Vert, Diﬀerentiable deep clustering with cluster sizeconstraints , ArXiv abs/1910.09036 (2019).[39] Y. Ren, N. Wang, M. Li, and Z. Xu,

Deep density-based image clustering , 2018. – 16 –

40] C. Patil and I. Baidari,

Estimating the optimal number of clusters k in a dataset using datadepth , Data Science and Engineering (2019), no. 2 132–140.(2019), no. 2 132–140.