[PDF] Adversarially Learned Anomaly Detection on CMS Open Data: re-discovering the top quark

Abstract

We apply an Adversarially Learned Anomaly Detection (ALAD) algorithm to the problem of detecting new physics processes in proton-proton collisions at the Large Hadron Collider. Anomaly detection based on ALAD matches performances reached by Variational Autoencoders, with a substantial improvement in some cases. Training the ALAD algorithm on 4.4 fb-1 of 8 TeV CMS Open Data, we show how a data-driven anomaly detection and characterization would work in real life, re-discovering the top quark by identifying the main features of the t-tbar experimental signature at the LHC.

Full PDF

AA DVERSARIALLY L EARNED A NOMALY D ETECTION ON

CMSO

PEN D ATA : RE - DISCOVERING THE TOP QUARK

O. Knapp, G. Dissertori

Institute for Particle Physics and AstrophysicsETH ZurichCH-8093 Zurich, Switzerland

O. Cerri, T. Q. Nguyen, J. R. Vlimant

Division of Physics, Mathematics and AstronomyCalifornia Institute of TechnologyPasadena, CA 91125, USA

M. Pierini

Experimental Physics DepartmentEuropean Organization for Nuclear Research (CERN)CH-1211 Geneva 23, SwitzerlandMay 5, 2020 A BSTRACT

We apply an Adversarially Learned Anomaly Detection (ALAD) algorithm to the problem of detectingnew physics processes in proton-proton collisions at the Large Hadron Collider. Anomaly detectionbased on ALAD matches performances reached by Variational Autoencoders, with a substantialimprovement in some cases. Training the ALAD algorithm on . fb − of 8 TeV CMS Open Data, weshow how a data-driven anomaly detection and characterization would work in real life, re-discoveringthe top quark by identifying the main features of the t ¯ t experimental signature at the LHC. K eywords Deep Learning · Simulation · Autoencoders · MNIST · Jets

CERN’s Large Hadron Collider (LHC) delivers proton-proton collisions in unique experimental conditions. Not only itaccelerates protons to an unprecedented energy (6.5 TeV for each proton beam). It also operates at the highest collisionfrequency, producing one proton-beam crossing (an "event") every 25 nsec. By recording the information of everysensor, the two LHC multipurpose detectors, ATLAS and CMS, generate O (1) MB of data for each interesting event.The LHC big data problem consists in the incapability of storing each event, that would require writing O (10) TB/sec.In order to deal with this limitation, a typical LHC detector is equipped with a real-time event selection system, thetrigger.The need of a trigger system for data taking has important consequences downstream. In particular, it naturally shapesthe data analysis strategy for many searches for new physics phenomena (new particles, new forces, etc.) into ahypothesis test [1]: one speciﬁes a signal hypothesis upfront (in fact, as upfront as the trigger system) and tests thepresence of the predicted kind of events (the experimental signature) on top of a background from known physicsprocesses, against the alternative hypothesis (i.e., known physics processes alone). From a data-science perspective,this corresponds to a supervised strategy. This procedure was very successful so far, thanks to well established signalhypotheses to test (e.g., the existence of the Higgs boson). On the other hand, following this paradigm didn’t produceso far any evidence for new-physics signals. While this is teaching us a lot about our Universe, it also raises questionson the applied methodology. For instance, the amount of information derived from this large number of "unsuccessful" searches has put to question theconcept of "natural" new physics models, such as low scale supersymmetry. Considering that the generally prevailing pre-LHC viewof particle physics was based on two pillars (the Higgs boson and low-scale natural supersymmetry), these experimental results areshaping our understanding of microscopic physics as much as, and probably even more than, the discovery of the Higgs boson. a r X i v : . [ h e p - e x ] M a y a) (b)(c) zz zz zzzz zzzz (d) zz zzzzzz zzzzzz zzzzzz Figure 1: Comparison of Deep Network architectures: (a) In a GAN, a generator G returns samples G ( z ) fromlatent-space points z , while a discriminator D x tries to distinguish the generated samples G ( z ) from the real samples x .(b) In an autoencoder, the encoder E compresses the input x to a latent-space point z , while the decoder D provides anestimate D ( z ) = D ( E ( x )) of x . (c) A BiGAN is built by adding to a GAN an encoder to learn the z representationof the true x , and using the information both in the real space X and the latent space Z as input to the discriminator.(d) The ALAD model is a BiGAN in which two additional discriminators help converging to a solution which fulﬁlsthe cycle-consistency conditions G ( E ( x )) ≈ x and E ( G ( z )) ≈ z . The ⊕ symbol in the ﬁgure represents a vectorconcatenation.Recent works have proposed strategies, mainly based on Machine Learning (ML), to relax the underlying assumptionsof a typical experimental analysis [2–12], extending traditional unsupervised searches performed at colliders [13–17].While many of these works focus on off-line data analysis, Ref. [3] advocates the need to also perform an on-line eventselection with anomaly detection techniques, in order to be able to save a fraction of new physics events even whenthe underlying new physics scenario was unforeseen (and no dedicated trigger algorithm was put in place). Selectedanomalous events could then be visually inspected (as done with the CMS exotica hotline data stream on early LHCruns in 2010-2012) or be given as input to off-line unsupervised analyses, following any of the strategies suggested inliterature.In this paper, we extend the work of Ref. [3] in two directions: (i) we identify anomalies using an Adversarially LearnedAnomaly Detection (ALAD) algorithm [18], which combines the strength of generative adversarial networks [19, 20]with that of autoencoders [21–23]; (ii) we demonstrate how the anomaly detection would work in real life, using theALAD algorithm to re-discover the top quark. To this purpose we use real LHC data, released by the CMS experimenton the CERN Open Data portal [24]. Our implementation of the ALAD model in TensorFlow [25], derived from theoriginal code of Ref. [18], is available on GitHub [26].This paper is structured as follows: the ALAD algorithm is described in Section 2. Its performance is assessed inSection 3, repeating the study of Ref. [3]. In Section 4 we use an ALAD algorithm to re-discover the top quark on afraction of the CMS 2012 Open Data (described in Appendix A). Conclusions are given in Section 5. The ALAD algorithm is a kind of Generative Adversarial Network (GAN) [19] speciﬁcally designed for anomalydetection. The basic idea underlying GANs is that two artiﬁcial neural networks compete against each other duringtraining, as shown in Fig. 1. One network, the generator G : Z → X , learns to generate new samples in the data space(e.g., proton-proton collisions in our case) aiming to resemble the samples in the training set. The other network, thediscriminator D x : X → [0 , , tries to distinguish real samples from generated ones, returning the score of a givensample to be real, as opposed of being generated by G . Both G and D x are expressed as neural networks, which are2rained against each other in a saddle-point problem:min G max D x E x ∼ p X [log D x ( x )] + E z ∼ p Z [log (1 − D x ( G ( z ))] , (1)where p X ( x ) is the distribution over the data space X and p Z ( z ) is the distribution over the latent space Z . The solutionto this problem will have the property p X = p G , where p G is the distribution induced by the generator [19]. Thetraining typically involves alternating gradient descent on the parameters of G and D x to maximize for D x (treating G as ﬁxed) and to minimize for G (treating D x as ﬁxed).Deep learning for anomaly detection [20] is usually discussed in the context of (variational) autoencoders [21, 22]. Withautoencoders (cf. Fig. 1), one projects the input x to a point z of a latent-space through an encoder network E : X → Z .An approximation D ( z ) = D ( E ( x )) of the input information is then reconstructed through the decoder network, D : Z → X . The intuition is that the decoder D can only reconstruct the input from the latent space representation z if x ∼ p X . Therefore, the reconstruction for an anomalous sample, which belongs to a different distribution, wouldtypically have a higher reconstruction loss. One can then use a metric D R deﬁning the output-to-input distance (e.g., theone used in the reconstruction loss function) to derive an anomaly-score A : A ( x ) ∼ D R ( x, D ( E ( x ))) . (2)While this is not directly possible with GANs, since a generated G ( z ) doesn’t correspond to a speciﬁc x , several GAN-based solutions have been proposed that would be suitable for anomaly detection, as for instance in Refs. [18,20,27–29].In this work, we focus on the ALAD method [18], built upon the use of bidirectional-GANs (BiGAN) [30]. As shown inFig. 1, a BiGAN model adds an encoder E : X → Z to the GAN construction. This encoder is trained simultaneouslyto the generator. The saddle point problem in Eq.(1) is then extended as follows:min

G,E max D xz V ( D xz , E, G ) = min G,E max D xz E x ∼ p X [log D xz ( x, E ( x ))] + E z ∼ p Z [log (1 − D xz ( G ( z ) , z )] , (3)where D xz is a modiﬁed discriminator, taking inputs from both the X and Z . Provided there is convergence tothe global minimum, the solution has the distribution matching property p E ( x, z ) = p G ( x, z ) , where one deﬁnes p E ( x, z ) = p E ( z | x ) p X ( x ) and p G ( x, z ) = p G ( x | z ) p Z ( z ) [30]. To help reaching full convergence, the ALAD model isequipped with two additional discriminators: D xx and D zz . The former discriminator together with the value function V ( D xx , E, G ) = E x ∼ p X [log D xx ( x, x )] + E x ∼ p X [log (1 − D xx ( x, G ( E ( x )))] (4)enforces the cycle-consistency condition G ( E ( x )) ≈ x . The latter is added to further regularize the latent space througha similar value function: V ( D zz , E, G ) = E z ∼ p Z [log D zz ( z, z )] + E z ∼ p Z [log (1 − D zz ( z, E ( G ( z )))] , (5)enforcing the cycle condition E ( G ( z )) ≈ z . The ALAD training objective consists in solving:min G,E max D xz ,D xx ,D zz V ( D xz , E, G ) + V ( D xx , E, G ) + V ( D zz , E, G ) . (6)Having multiple outputs at hand, one can associate the ALAD algorithm to several anomaly-score deﬁnitions. FollowingRef. [18], we consider the following four anomaly scores: • A "Logits" score, deﬁned as: A L ( x ) = log( D xx ( x, G ( E ( x ))) . • A "Features" score, deﬁned as: A F ( x ) = || f xx ( x, x ) − f xx ( x, G ( E ( x ))) || , where f xx ( · , · ) are the activationvalues in the last hidden layer of D xx . • The L distance between an input x and its reconstructed output G ( E ( x )) : A L ( x ) = || x − G ( E ( x )) || . • The L distance between an input x and its reconstructed output G ( E ( x )) : A L ( x ) = || x − G ( E ( x )) || .We ﬁrst apply this model to the problem described in Ref [3], in order to obtain a direct comparison with VAE-basedanomaly detection. Then, we apply this model to real LHC data (2012 CMS Open Data), showing how anomalydetection could guide physicists to discover and characterize new processes. We consider a sample of simulated LHC collisions, pre-ﬁltered by requiring the presence of a muon with largetransverse momentum ( p T ) and isolated from other particles. Proton-proton collision events at a center-of-mass As common for collider physics, we use a Cartesian coordinate system with the z axis oriented along the beam axis, the x axison the horizontal plane, and the y axis oriented upward. The x and y axes deﬁne the transverse plane, while the z axis identiﬁesthe longitudinal direction. The azimuth angle φ is computed with respect to the x axis. The polar angle θ is used to compute thepseudorapidity η = − log(tan( θ/ . The transverse momentum ( p T ) is the projection of the particle momentum on the ( x , y )plane. We ﬁx units such that c = (cid:126) = 1 . √ s = 13 TeV are generated with the

PYTHIA8 event-generation library [31]. The generated events are furtherprocessed with the

DELPHES library [32] to model the detector response. Subsequently the

DELPHES particle-ﬂow (PF)algorithm is applied to obtain a list of reconstructed particles for each event, the so-called PF candidates.Events are ﬁltered requiring p T > GeV and isolation Iso < ( q l , IsEle, N µ and N e ) are represented through one-hot encoding. The otherfeatures are standardized to a zero median and unit variance. The resulting vector, containing the one-hot encoded andcontinuous features, has a dimension of 39 and is given as input to the ALAD algorithm.The sample, available on Zenodo [33], consists of the following Standard Model (SM) processes: • Inclusive W boson production: W → (cid:96)ν , with (cid:96) = e, µ, τ being a charged lepton [34]. • Inclusive Z boson production: Z → (cid:96)(cid:96) [35]. • Multijet production from Quantum Chromodynamic (QCD) interaction [36]. • t ¯ t production [37].A SM cocktail is assembled from a weighted mixture of those four processes, with weights given by the productioncross section. This cocktail’s composition is given in Table 1.Table 1: Composition of the cocktail of SM processes. The ﬁrst column gives the production cross section for eachprocess. The trigger efﬁciency is the fraction of events passing a loose selection on the muon p T , correspondingto an online selection in the trigger system. The acceptance is the fraction of events passing our selection criteria( p T > GeV and Iso < W → (cid:96)ν

58 68% 55.6% 59.2%QCD . · Z → (cid:96)(cid:96)

20 77% 16% 6.7% t ¯ t • A leptoquark with mass 80 GeV, decaying to a b quark and a τ lepton: LQ → bτ [38]. • A neutral scalar boson with mass 50 GeV, decaying to two off-shell Z bosons, each forced to decay to twoleptons: A → (cid:96) [39]. • A scalar boson with mass 60 GeV, decaying to two τ leptons: h → τ τ [40]. • A charged scalar boson with mass 60 GeV, decaying to a τ lepton and a neutrino: h ± → τ ν [41].As a starting point, we consider the ALAD architecture [18] used for the KDD99 dataset, which has similar dimensionalityas our input feature vector. In this conﬁguration, both the D xx and D zz discriminators take as input the concatenationof the two input vectors, which is processed by the network up to the single output node, activated by a sigmoidfunction. The D xz discriminator has one dense layer for each of the two inputs. The two intermediate representationsare concatenated and passed to another dense layer and then to a single output node with sigmoid activation, as for theother discriminators. The hidden nodes of the generator are activated by ReLU functions [42], while Leaky ReLU [43]are used for all the other nodes. The slope parameter of the Leaky ReLU function is ﬁxed to 0.2. The network isoptimized using the Adam [44] minimizer and minibatches of 50 events each. The training is regularized using dropoutlayers in the three discriminators.Starting from this baseline architecture, we adjust the architecture hyperparameters one by one, repeating the trainingwhile maximizing a ﬁgure of merit for anomaly detection efﬁciency. We perform this exercise using as anomaliesthe benchmark models described in Ref. [3] and looking for a conﬁguration that performs well on all of them. To A deﬁnition of isolation is provided in Appendix A. q l is the charge of the lepton; IsEle is a ﬂag set to 1 (0) if the lepton is an electron (muon); N µ and N e are the muon and electronmultiplicities, respectively. + . We deﬁne the LR + as the ratio between the BSM signal efﬁciency, i.e., the true positive rate(TPR), and the SM background efﬁciency, i.e., the false positive rate (FPR). The training is performed on half of theavailable SM events (3.4M events), leaving the other half of the SM events and the BSM samples for validation. Fromthe resulting anomaly scores, we compute the ROC curve and compare it to the results of the VAE in Ref. [3]. Wefurther quantify the algorithm performance considering the LR + values corresponding to an FPR of − .The optimized architecture, adapted from Ref. [18], is summarized in Table 2. This architecture is used for all subsequentstudies. We consider as hyperparameters the number of hidden layers in the ﬁve networks, the number of nodes in eachhidden layer, and the dimensionality of the latent space, represented in the table by the size of the E output layer.Table 2: Hyperparameters for the ALAD algorithm. Parameters in bold have been optimized for. No Dropout layer isapplied wherever a dropout rate is not speciﬁed.Operation Units Activation BatchNorm. DropoutRate E ( x ) Number of hidden layers Dense Leaky ReLU × -Dense Leaky ReLU × -Output Linear × - G ( z ) Number of hidden layers Dense ReLU × -Dense ReLU × -Output 39 Linear × - D xz ( x, z ) Number of hidden layers Only on x

Dense

Leaky ReLU √ - Only on z

Dense

Leaky ReLU × Concatenate outputs

Dense

Leaky ReLU × × - D xx ( x, ˆ x ) Concatenate x and x’

Number of hidden layers Dense

Leaky ReLU × × - D zz ( z, ˆ z ) Concatenate z and z’

Number of hidden layers Dense

Leaky ReLU × × -Training Parameter ValueOptimizer Adam ( α = 10 − , β = 0 . )Batch size Leaky ReLU slope 0.2Spectral norm √ Weight, bias init. Xavier Initializer, Constant(0)Having trained the ALAD on the training dataset, we compute the anomaly scores for the validation samples as well asfor the four BSM samples, where each BSM process has O ( ) samples. Figure 2 shows the ROC curves of eachBSM benchmark process, for the four considered anomaly scores. The best VAE result from Ref. [3] is also shownfor comparison. In the rest of this paper, we use the L score as the anomaly score. Similar results would have beenobtained using any of the other three anomaly scores. Figure 3 compares the A L distribution for each BSM processwith the SM cocktail. One can clearly see that all BSM processes have an increased probability in the high-score regime5 SM efficiency10 B S M e ff i c i e n c y A l SM efficiency10 B S M e ff i c i e n c y LQ SM efficiency10 B S M e ff i c i e n c y h SM efficiency10 B S M e ff i c i e n c y h ± Features L L Logits VAE

Figure 2: ROC curves for the ALAD trained on the SM cocktail training set and applied to SM+BSM validation samples.The VAE curve corresponds to the best result of Ref. [3], which is shown here for comparison. The other four linescorrespond to the different anomaly score models of the ALAD.compared to the SM cocktail. We further veriﬁed that the anomaly score distributions obtained on the SM-cocktailtraining and validation sets are consistent. This test excludes the occurrence of over-training issues.The ALAD algorithm outperforms the VAE by a substantial margin on the A → (cid:96) sample, providing similarperformance overall, and in particular for FPR ∼ − , the working point chosen as a reference in Ref. [3]. Weveriﬁed that the uncertainty on the TPR at ﬁxed FPR, computed with the Agresti–Coull interval [45], is negligiblewhen compared to the observed differences between ALAD and VAE ROC curves, i.e., the difference is statisticallysigniﬁcant.The left plot in Fig. 4 provides a comparison across different BSM models. As for the VAE, ALAD performs better on A → (cid:96) and h ± → τ ν than for the other two BSM processes. The right plot in Figure 4 shows the LR + values as afunction of the FPR ones. The LR + peaks at a SM efﬁciency of O (10 − ) for all four BSM processes and is basicallyconstant for smaller SM-efﬁciency values. In order to test the performance of ALAD on real data, and in general to show how an anomaly detection techniquecould guide physicists to a discovery in a data-driven manner, we consider a scenario in which collision data from theLHC are available, but no previous knowledge about the existence of the top quark is at hand. The list of known SM6 Anomaly score10 P r o b a b ili t y d e n s i t y SM cocktail A lh ± h LQ Figure 3: Distribution for the A L anomaly score. The "SM cocktail" histogram corresponds to the anomaly score forthe validation sample. The other four distributions refer to the scores of the four BSM benchmark models. SM efficiency10 B S M e ff i c i e n c y SM efficiency10 P o s i t i v e li k e li h oo d r a t i o A l LQ h h ± Figure 4: Left: ROC curves for each BSM process obtained with the ALAD L -score model. Right: LR + curvescorresponding to the ROC curves on the left.processes contributing to a dataset with one isolated high- p T lepton would then include W production, Z/γ ∗ productionand QCD multijet events, neglecting more rare processes such as diboson production. Top-quark pair production, the"unknown anomalous process", represents ∼ . of the total dataset.We consider a fraction of LHC events collected by the CMS experiment in 2012 and corresponding to an integratedluminosity of about 4.4 fb − at a center-of-mass energy √ s = 8 TeV. For each collision event in the dataset, wecompute a vector of physics-motivated high-level features, which we give as input to ALAD for training and inference.Details on the dataset, event content, and data processing can be found in Appendix A.We select events applying the following requirements: • At least one lepton with p T > GeV and PF isolation Iso < . within | η | < . . • At least two jets with p T > within | η | < . . 7 Background efficiency10 S i g n a l e ff i c i e n c y Background efficiency10 P o s i t i v e li k e li h oo d r a t i o Features L L Logits

Figure 5: Left: ROC curves for each anomaly score, where the signal efﬁciency is the fraction of t ¯ t (signal) events passingthe anomaly selection, i.e., the true positive rate (TPR). The background efﬁciency is the fraction of background eventspassing the selection, i.e. the false positive rate (FPR). Right: Positive likelihood ratio ( LR + ) curves corresponding tothe ROC curves in the left.This selection is tuned to reduce the expected QCD multijet contamination to a negligible level, which avoids problemswith the small size of the available QCD simulated dataset. This selection should not be seen as a limiting factor for thegenerality of the study: in real life, one would apply multiple sets of selection requirements on different triggers, inorder to deﬁne multiple datasets on which different anomaly detection analyses would run.Our goal is to employ ALAD to isolate t ¯ t events as due to a rare (and pretended to be) unknown process in the selecteddataset. Unlike the case discussed in Ref [3] and Section 3, we are not necessarily interested in a pre-deﬁned algorithmto run on a trigger system. Given an input dataset, we want to isolate its outlier events without explicitly specifyingwhich tail of which kinematic feature one should look at. Because of this, we don’t split the data into a training anda validation dataset. Instead, we run the training on the full dataset. The training is performed ﬁxing the ALADarchitecture to the values shown in Table 2. In our procedure, we implicitly rely on the assumption that the anomalousevents are rare, so that their modeling is not accurately learned by the ALAD algorithm.In order to evaluate the ALAD performance, we show in Fig. 5 the ROC and LR + curves on labelled Monte Carlo (MC)simulated data, considering the t ¯ t sample as the signal anomaly and the SM W and Z production as the background.An event is classiﬁed as anomalous whenever the L anomaly score is above a given threshold. The threshold is setsuch that the fraction of selected events is about − . The anomaly selection results in a factor-20 enhancement of the t ¯ t contribution over the SM background, for the anomaly-deﬁning FPR threshold.Figure 6 shows the distributions of a subset of the input quantities before and after anomaly selection (see Appendix A): H T , M J , N J , and N b . These are the quantities that will become relevant in the rest of the discussion. The correspondingdistributions for MC simulated events are shown in Figure 7, where the t ¯ t contribution to the selected anomalous data ishighlighted. At this stage, we don’t attempt a direct comparison of simulated distributions to data, since we couldn’tapply the many energy scale and resolution corrections, normally applied for CMS studies. Anyhow, such a comparisonis not required for our data-driven deﬁnition of anomalous events.In order to quantify the impact of the applied anomaly selection on a given quantity, we consider the bin-by-binfraction of accepted events. We identify this differential quantity as anomaly selection transfer function (ASTF). Whencompared to the expectation from simulated MC data, the dependence of ASTF values on certain quantities allows oneto characterize the nature of the anomaly, indicating in which range of which quantity the anomalies cluster together.Figure 8 shows the ASTF for the data and the simulated SM processes ( W and Z production) deﬁning the backgrounddata. The comparison between the data and the background ASTF suggests the presence of an unforeseen class ofevents, clustering at large number of jets. An excess of anomalies is observed at large jet multiplicity, which alsoinduces an excess of anomalies at large values of H T and M J . Notably, a large fraction of anomalous events has jetsoriginating from b quarks. This is the ﬁrst time that a MC simulation enters our analysis. We stress the fact that thiscomparison between data and simulation is qualitative. At this stage, we don’t need the MC-predicted ASTF valuesto agree with data in absence of a signal, since we are not attempting a background estimate like those performed in8

500 1000 1500 2000 H T [GeV]10 P r o b . d e n s i t y M J [GeV]10 P r o b . d e n s i t y N J P r o b . d e n s i t y N b P r o b . d e n s i t y All data Anomalous data

Figure 6: Event distribution on data, before and after the anomaly selection: H T (top-left), M J (top-right), N J (bottom-left), and N b (bottom-right) distributions, normalized to unit area. A deﬁnition of the considered features isgiven in appendix A.data-driven searches at the LHC. For us, it is sufﬁcient to observe qualitative differences in the dependence of ASTFson speciﬁc features. Nevertheless, a qualitative difference like those we observe in Fig. 8 could still be induced bysystematic differences between data and simulation. That would still be an anomaly, but not of the kind that would leadto a discovery. In our case, we do know that the observed discrepancy is too large to be explained by a wrong modelingof W and Z events, given the level of accuracy reached by the CMS simulation software in Run I. In a real-life situation,one would have to run further checks to exclude this possibility.Starting from the ASTF plots of Fig. 8, we deﬁne a post-processing selection (PPS), aiming to isolate a subset ofanomalous events in which the residual SM background could be suppressed. In particular, we require N J ≥ and N b ≥ . Figure 9 shows the distributions of some of the considered features after the PPS. According to MCexpectations, almost all background events should be rejected. The same should apply to background events in data.Instead, a much larger number of events is observed. This discrepancy points to the presence of an additional process,characterized by many jets (particularly coming from b jets). As a closure test, we veriﬁed that the agreement betweenthe observed distributions in data and the expectation from MC simulation is restored once the t ¯ t contribution is takeninto account.In summary, the full procedure consists of the following steps:1. Deﬁne an input dataset with an event representation which is as generic as possible.2. Train the ALAD (or any other anomaly detection algorithm) on it.3. Isolate a fraction α of the events, by applying a threshold on the anomaly score.4. Study the ASTF on as many quantities as possible, in order to deﬁne a post-processing selection that allowsone to isolate the anomaly.5. (Optionally) once a pure sample of anomalies is isolated, a visual inspection of the events could also guide theinvestigation, as already suggested in Ref. [3]. 9

500 1000 1500 2000 H T [GeV]10 N u m b e r o f e v e n t s H T [GeV]10 N u m b e r o f e v e n t s M J [GeV]10 N u m b e r o f e v e n t s M J [GeV]10 N u m b e r o f e v e n t s N J N u m b e r o f e v e n t s N J N u m b e r o f e v e n t s N b N u m b e r o f e v e n t s N b N u m b e r o f e v e n t s tt W ( ) + jets Z / * ( ) + jets Figure 7: Background event distribution from simulation before (left) and after (right) applying the anomaly selection.From top to bottom: H T , M J , N J , and N b . Distributions are normalized to an integrated luminosity (cid:82) L = 4 . fb − .At this stage, one would gain an intuition about the nature of the new rare process. For instance, one could study theASTF distribution using as a reference quantity on the x axis the run period at which an event was taken. Anomalies10

500 1000 1500 2000 H T [GeV]10 A S T F M J [GeV]10 A S T F N J A S T F N b A S T F background data Figure 8: ASTF ratios for H T (top-left), M J (top-right), N J (bottom-left), and N b (top-right) for data. The ﬁlled areashows the same ratio, computed on simulated background data ( W and Z events). The uncertainty bands represent thestatistical uncertainties on the ASTF ratios.clustering on speciﬁc run periods would most likely be due to transient detector malfunctioning or experimentalproblems of other nature (e.g., a bug in the reconstruction software). If instead the signiﬁcant ASTF ratios point toevents with a lepton, large jet multiplicity, with an excess of b-jets, one might have discovered a new heavy particledecaying to leptons and b-jets. In a world with no previous knowledge of the top quark, some bright theorist wouldexplain the anomaly proposing the existence of a third up-type quark with unprecedented heavy mass. We presented an application of ALAD to the search for new physics at the LHC. Following the study presented inRef. [3], we show how this algorithm matches (and in some cases improves) the performance obtained with variationalautoencoders. The ALAD architecture also offers practical advantages with respect to the VAE of Ref. [3]. Besidesproviding an improved performance in some cases, it offers an easier training procedure. Good performance can beachieved using a standard MSE loss, unlike the VAE model in Ref. [3], for which good performance was obtained onlyafter a heavy customization of the loss function.We train the ALAD algorithm on a sample of real data, obtained by processing part of the 2012 CMS Open data. Onthese events, we show how one could detect the presence of t ¯ t events, a < . population pretended to originate froma previously unknown process. Using the anomaly score of the trained ALAD algorithm, we deﬁne a post-selectionprocedure that let us isolate an almost pure subset of anomalous events. Furthermore, we present a strategy basedon ASTF distributions to characterize the nature of an observed excess and we show its effectiveness on the speciﬁcexample at hand. Further studies should be carried on to demonstrate the robustness of this strategy for more rareprocesses.For the ﬁrst time, our study shows with real LHC data that anomaly detection techniques can highlight the presence ofrare phenomena in a data-driven manner. This result could help promoting a new class of data-driven studies in the nextrun of the LHC, possibly offering additional guidance in the search for new physics.11

500 1000 1500 2000 H T [GeV]10 N u m b e r o f e v e n t s M J [GeV]10 N u m b e r o f e v e n t s N J N u m b e r o f e v e n t s N b N u m b e r o f e v e n t s tt W ( ) + jets Z / * ( ) + jets data Figure 9: Data distribution for H T (top-left), M J (top-right), N J (bottom-left), and N b (top-right) after the postprocessing selection. The ﬁlled histograms show the expectation from MC simulation, normalized to an integratedluminosity of (cid:82) L = 4 . fb − ), including the t ¯ t contribution.As already stressed in Ref [3], the promotion of these algorithms to the trigger system of the LHC experiments couldallow to mitigate for the limited bandwidth of the experiments. Unlike Ref [3], the strategy presented in this paperwould also require a substantial fraction of normal events to be saved (through pre-scaled triggers), in order to computethe ASTF ratios with sufﬁcient precision. In this respect, there would be a cost in terms of trigger throughput, thatcould be compensated by foreseeing dedicated scouting streams [46–48] with reduced event content, tailored to thecomputation of the ASTF ratios in ofﬂine analyses. Putting in place this strategy for the LHC Run III could be aninteresting way to extend the physics reach of the LHC experiments. Acknowledgments

This work was possible thanks to the commitment of the CMS collaboration to release its data and MC samples throughthe CERN Open Data portal. We would like to thank our CMS colleagues and the CERN Open Data team for theireffort to promote open access to science. In particular, we thank Kati Lassila-Perini for her precious help.This project is partially supported by the European Research Council (ERC) under the European Union’s Horizon 2020research and innovation program (grant agreement n o iBanks ,” the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro and theKavli Foundation for their support of “ iBanks .” 12 ppendixA CMS Single Muon Open Data The re-discovery of the top quark, described in Section 4, is performed on a sample of real LHC collisions, collected bythe CMS experiment and released on the CERN Open Data portal [24]. The collision data correspond to the SingleMudataset in the Run2012B period [49]. This dataset results from the logic OR of many triggers requiring a reconstructedmuon. Most of the events were collected by an inclusive isolated-muon trigger, with requirements very similar to thoseapplied as a pre-selection in Section 3 [3].Once collected, the raw data recorded by the CMS detector were processed by a global event reconstruction algorithm,based on Particle Flow (PF) [50]. This processing step gives as output a list of so-called PF candidates (electrons,muons, photons, charged hadrons and neutral hadrons), reconstructed combining measurements from different detectorsto achieve maximal accuracy.While the ALAD training is performed directly on data, samples from MC simulation are also used in the study, in orderto validate the model and its ﬁndings on a labelled sample. To this purpose, we considered samples of W +jets [51–53], Z/γ ∗ +jets [54–57], and t ¯ t [58] events. No sample of QCD multijet events was considered, since its contributionwas found to be negligible after the baseline requirements on the muon and on two additional jets (see Section 4).These MC samples were generated by the CMS collaboration with different libraries and processed by a full Geant4simulation [59]. The same reconstruction software used on data was applied to the output of the simulation, so that thesame lists of PF candidates are available in this case.Following a procedure similar to that of Ref. [3], we take as input the lists of PF candidates and compute a set of physicsmotivated features, which are used as input to train the ALAD algorithm:We consider 14 event-related quantities: • H T – The scalar sum of the transverse momenta ( p T ) of all jets having p T > GeV and | η | < . . • M J – The invariant mass of all jets entering the H T sum. • N J – The number of jets entering the H T sum. • N B – The number of jets identiﬁed as originating from a b quark. • p miss T – The missing transverse momentum deﬁned as p miss T = (cid:12)(cid:12)(cid:12) − (cid:80) q p qT (cid:12)(cid:12)(cid:12) , where the sum goes over all PFcandidates. • p µT,T OT – The vector sum of the p T of all PF muons in the event having p T > . GeV. • M µ – The combined invariant mass of all muons entering the above sum in p µT,T OT . • N µ – The number of muons entering the sum in p µT,T OT . • p eT,T OT – The vector sum of the p T of all PF electrons in the event having p T > . GeV. • M e – The combined invariant mass of all electrons entering the above sum in p eT,T OT . • N e – The number of electrons entering the sum in p eT,T OT . • N neu – The number of all neutral hadron PF-candidates. • N ch – The number of all charged hadron PF-candidates. • N γ – The number of all photon PF-candidates.We further consider 10 quantities, speciﬁc to the highest- p T lepton in the event: • p (cid:96)T – The lepton p T . • η (cid:96) – The lepton pseudorapidity. • q (cid:96) – The lepton charge (either − or +1 ) • Iso (cid:96)ch – The lepton isolation, deﬁned as the ratio between the scalar sum of the p T of the other charged PFcandidates with angular distance ∆ R = (cid:112) ∆ η + ∆ φ < . from the lepton, and the lepton p T . • Iso (cid:96)neu – Same as Iso (cid:96)ch but with the sum going over all neutral hadrons. • Iso (cid:96)γ – Same as Iso (cid:96)ch but with the sum going over all photons.13

Iso – The total isolation given by Iso = Iso (cid:96)ch + Iso (cid:96)neu + Iso (cid:96)γ . • M T – The combined transverse mass of the lepton and the E miss T system, which is given by M T = (cid:113) p (cid:96)T E miss T (1 − cos ∆ φ ) . . • p miss T, (cid:107) – The parallel component of p miss T with respect to the lepton. • p miss T, ⊥ – The orthogonal component of p miss T with respect to the lepton. • IsEle – A ﬂag set to 1 if the lepton is an electron, 0 if it is a muon.

References [1] ATLAS and CMS Collaborations,

Procedure for the LHC Higgs boson search combination in Summer 2011 [ATL-PHYS-PUB-2011-011, CMS-NOTE-2011-005].[2] C. Weisser and M. Williams,

Machine learning and multivariate goodness of ﬁt , .[3] O. Cerri, T. Q. Nguyen, M. Pierini, M. Spiropulu and J.-R. Vlimant, Variational Autoencoders for New PhysicsMining at the Large Hadron Collider , JHEP (2019) 036 [ ].[4] R. T. D’Agnolo and A. Wulzer, Learning New Physics from a Machine , Phys. Rev.

D99 (2019) 015014[ ].[5] A. De Simone and T. Jacques,

Guiding New Physics Searches with Unsupervised Learning , Eur. Phys. J. C (2019) 289 [ ].[6] M. Farina, Y. Nakai and D. Shih, Searching for New Physics with Deep Autoencoders , Phys. Rev. D (2020)075021 [ ].[7] J. H. Collins, K. Howe and B. Nachman,

Anomaly Detection for Resonant New Physics with Machine Learning , Phys. Rev. Lett. (2018) 241803 [ ].[8] A. Blance, M. Spannowsky and P. Waite,

Adversarially-trained autoencoders for robust unsupervised new physicssearches , JHEP (2019) 047 [ ].[9] J. Hajer, Y.-Y. Li, T. Liu and H. Wang, Novelty Detection Meets Collider Physics , Phys. Rev. D (2020)076015 [ ].[10] T. Heimel, G. Kasieczka, T. Plehn and J. M. Thompson,

QCD or What? , SciPost Phys. (2019) 030[ ].[11] J. H. Collins, K. Howe and B. Nachman, Extending the search for new resonances with machine learning , Phys.Rev.

D99 (2019) 014038 [ ].[12] R. T. D’Agnolo, G. Grosso, M. Pierini, A. Wulzer and M. Zanetti,

Learning Multivariate New Physics , .[13] H1 Collaboration, F. D. Aaron et al., A General Search for New Phenomena at HERA , Phys. Lett.

B674 (2009)257 [ ].[14] CDF Collaboration, T. Aaltonen et al.,

Global Search for New Physics with 2.0 fb − at CDF , Phys. Rev.

D79 (2009) 011101 [ ].[15] D0 Collaboration, V. M. Abazov et al.,

Model independent search for new phenomena in p ¯ p collisions at √ s = 1 . TeV , Phys. Rev.

D85 (2012) 092015 [ ].[16] CMS Collaboration,

MUSiC, a Model Unspeciﬁc Search for New Physics, in pp Collisions at √ s = 8 TeV CMS-PAS-EXO-14-016.[17] ATLAS Collaboration, M. Aaboud et al.,

A strategy for a general search for new phenomena using data-derivedsignal regions and its application within the ATLAS experiment , Eur. Phys. J.

C79 (2019) 120 [ ].[18] H. Zenati, M. Romain, C. S. Foo, B. Lecouat and V. R. Chandrasekhar,

Adversarially Learned Anomaly Detection , arXiv:1812.02288 .[19] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair et al., Generative AdversarialNetworks , arXiv:1406.2661 . 1420] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth and G. Langs, Unsupervised Anomaly Detection withGenerative Adversarial Networks to Guide Marker Discovery , arXiv:1703.05921 .[21] D. P. Kingma and M. Welling, Auto-Encoding Variational Bayes , ArXiv e-prints (2013) [ ].[22] J. An and S. Cho,

Variational autoencoder based anomaly detection using reconstruction probability , SpecialLecture on IE (2015) 1.[23] C. Zhou and R. C. Paffenroth, Anomaly Detection with Robust Deep Autoencoders , in

Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Mining , (New York, NY, USA),p. 665–674, Association for Computing Machinery, 2017 DOI.[24] CMS Collaboration Open Data Repository.[25] M. Abadi et al.,

TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems http://tensorflow.org/ .[26] O. Knapp,

Adversarially Learned Anomaly Detection for the LHC [G IT H UB ].[27] A. Creswell and A. A. Bharath, Inverting The Generator Of A Generative Adversarial Network , CoRR abs/1611.05644 (2016) [ ].[28] Y. Wu, Y. Burda, R. Salakhutdinov and R. B. Grosse,

On the Quantitative Analysis of Decoder-Based GenerativeModels , CoRR abs/1611.04273 (2016) [ ].[29] E. Gherbi, B. Hanczar, J.-C. Janodet and W. Klaudel,

An Encoding Adversarial Network for Anomaly Detection ,in , vol. 101 of

JMLR: Workshop and ConferenceProceedings , (Nagoya, Japan), pp. 1–16, Nov., 2019 https://hal.archives-ouvertes.fr/hal-02421274.[30] J. Donahue, P. Krähenbühl and T. Darrell,

Adversarial Feature Learning , arXiv:1605.09782 .[31] T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten et al., An Introduction to PYTHIA 8.2 , Comput. Phys. Commun. (2015) .[32] DELPHES 3 Collaboration, J. de Favereau et al.,

DELPHES 3, A modular framework for fast simulation of ageneric collider experiment , JHEP (2014) 057 [ ].[33] M. Pierini et al. mPP Zenodo community.[34] O. Cerri, T. Nguyen, M. Pierini and J.-R. Vlimant, New Physics Mining at the Large Hadron Collider: W → (cid:96)ν .[35] O. Cerri, T. Nguyen, M. Pierini and J.-R. Vlimant, New Physics Mining at the Large Hadron Collider: Z → (cid:96)(cid:96) .[36] O. Cerri, T. Nguyen, M. Pierini and J.-R. Vlimant, New Physics Mining at the Large Hadron Collider: QCDmultijet production .[37] O. Cerri, T. Nguyen, M. Pierini and J.-R. Vlimant,

New Physics Mining at the Large Hadron Collider: t ¯ t production .[38] O. Cerri, T. Nguyen, M. Pierini and J.-R. Vlimant, New Physics Mining at the Large Hadron Collider: LQ → bτ .[39] O. Cerri, T. Nguyen, M. Pierini and J.-R. Vlimant, New Physics Mining at the Large Hadron Collider: A → (cid:96) .[40] O. Cerri, T. Nguyen, M. Pierini and J.-R. Vlimant, New Physics Mining at the Large Hadron Collider: h → τ τ .[41] O. Cerri, T. Nguyen, M. Pierini and J.-R. Vlimant, New Physics Mining at the Large Hadron Collider: h + → τ ν .[42] V. Nair and G. E. Hinton, Rectiﬁed Linear Units Improve Restricted Boltzmann Machines. , in

ICML

Rectiﬁer nonlinearities improve neural network acoustic models , in inICML Workshop on Deep Learning for Audio, Speech and Language Processing , 2013.[44] D. P. Kingma and J. Ba,

Adam: A Method for Stochastic Optimization , in , 2015http://arxiv.org/abs/1412.6980.[45] A. Agresti and B. A. Coull,

Approximate Is Better than "Exact" for Interval Estimation of Binomial Proportions , The American Statistician (1998) .[46] D. Anderson, Data Scouting in CMS , PoS

ICHEP2016 (2016) 190.[47] S. Mukherjee,

Data Scouting : A New Trigger Paradigm , in , 8,2017 [ ].[48] J. Duarte,

Fast Reconstruction and Data Scouting , in , 2018 [ ].1549] CMS Collaboration /SingleMu/Run2012B-22Jan2013-v1/AOD.[50] CMS Collaboration, A. Sirunyan et al.,

Particle-ﬂow reconstruction and global event description with the CMSdetector , JINST (2017) P10003 [ ].[51] CMS Collaboration/W1JetsToLNu_TuneZ2Star_8TeV-madgraph/Summer12_DR53X-PU_S10_START53_V19-v1/AODSIM.[52] CMS Collaboration/W2JetsToLNu_TuneZ2Star_8TeV-madgraph/Summer12_DR53X-PU_S10_START53_V19-v1/AODSIM.[53] CMS Collaboration/W3JetsToLNu_TuneZ2Star_8TeV-madgraph/Summer12_DR53X-PU_S10_START53_V19-v1/AODSIM.[54] CMS Collaboration /DY1JetsToLL_M-50_8TeV_ext-madgraph/Summer12_DR53X-PU_S10_START53_V19-v1/AODSIM.[55] CMS Collaboration /DY2JetsToLL_M-50_TuneZ2Star_8TeV-madgraph/Summer12_DR53X-PU_RD1_START53_V7N-v1/AODSIM.[56] CMS Collaboration /DY3JetsToLL_M-50_TuneZ2Star_8TeV-madgraph/Summer12_DR53X-PU_RD1_START53_V7N-v1/AODSIM.[57] CMS Collaboration /DY4JetsToLL_M-50_TuneZ2Star_8TeV-madgraph/Summer12_DR53X-PU_RD1_START53_V7N-v1/AODSIM.[58] CMS Collaboration /TTJets_MSDecays_central_TuneZ2star_8TeV-madgraph-tauola/Summer12_DR53X-PU_S10_START53_V19-v1/AODSIM.[59] GEANT4 Collaboration, S. Agostinelli et al., GEANT4: A Simulation toolkit , Nucl. Instrum. Meth.