[PDF] Detector monitoring with artificial neural networks at the CMS experiment at the CERN Large Hadron Collider

Abstract

Reliable data quality monitoring is a key asset in delivering collision data suitable for physics analysis in any modern large-scale High Energy Physics experiment. This paper focuses on the use of artificial neural networks for supervised and semi-supervised problems related to the identification of anomalies in the data collected by the CMS muon detectors. We use deep neural networks to analyze LHC collision data, represented as images organized geographically. We train a classifier capable of detecting the known anomalous behaviors with unprecedented efficiency and explore the usage of convolutional autoencoders to extend anomaly detection capabilities to unforeseen failure modes. A generalization of this strategy could pave the way to the automation of the data quality assessment process for present and future high-energy physics experiments.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Detector monitoring with artiﬁcial neural networks at theCMS experiment at the CERN Large Hadron Collider

Adrian Alan Pol · Gianluca Cerminara · Cecile Germain · MaurizioPierini · Agrima Seth

Received: date / Accepted: date

Abstract

Reliable data quality monitoring is a key as-set in delivering collision data suitable for physics anal-ysis in any modern large-scale High Energy Physics ex-periment. This paper focuses on the use of artiﬁcial neu-ral networks for supervised and semi-supervised prob-lems related to the identiﬁcation of anomalies in thedata collected by the CMS muon detectors. We use deepneural networks to analyze LHC collision data, repre-sented as images organized geographically. We train aclassiﬁer capable of detecting the known anomalous be-haviors with unprecedented eﬃciency and explore theusage of convolutional autoencoders to extend anomalydetection capabilities to unforeseen failure modes. Ageneralization of this strategy could pave the way tothe automation of the data quality assessment processfor present and future high-energy physics experiments.

Keywords

High Energy Physics · Large HadronCollider · Compact Muon Solenoid · Machine Learning · Data Quality Monitoring · Artiﬁcial Neural Networks

The Compact Muon Solenoid (CMS) experiment is ageneral purpose particle physics detector operating at

Adrian Alan PolUniversit´e Paris-Saclay, CERNE-mail: [email protected] CerminaraCERNCecile GermainUniversit´e Paris-SaclayMaurizio PieriniCERNAgrima SethCERN the CERN Large Hadron Collider [1] (LHC). Data col-lected with the CMS detector are used in many aspectsof modern particle physics, notably the discovery [2]and characterization [3] of the Higgs boson.The CMS detector is described in details in [4], to-gether with a deﬁnition of the used coordinate systemand the relevant kinematic variables. In CMS, muonsare measured with detection planes instrumented withfour detector technologies: drift tubes (DTs), cathodestrip chambers, resistive plate chambers, and gas elec-tron multipliers. A detailed description of the CMSmuon detectors can be found in [5].Within the CMS Collaboration, physics analysis areperformed on good data, selected by imposing stringentquality criteria. During data taking, a subset of the col-lected statistics is processed in real time, to create a setof histograms ﬁlled with a certain critical quantities.Statistical tests are performed to compare these his-tograms to a set of predeﬁned reference, representingthe typical detector response during normal operationconditions. Using the histogram comparison and theoutcome of the tests, expert shifters acknowledge thealarms and may decide to intervene (up to stoppingthe data taking), depending on the evaluation of theproblem severity. The knowledge of the LHC runningconditions and of the history of possible issues identiﬁedin the past, are key ingredients in this decision process.Details on the infrastructure used for this Data Qual-ity Monitoring (DQM) are given in [6]. The two maindomains of the monitoring chain are: – online monitoring , which provides live feedback onthe quality of the data while they are being ac-quired, allowing the operator crew to react to un-foreseen issues identiﬁed by the monitoring applica-tion; a r X i v : . [ phy s i c s . d a t a - a n ] J u l Adrian Alan Pol et al. – oﬄine monitoring , designed to certify the qualityof the data collected and stored on disk using cen-tralized processing (referred to as the event recon-struction, which converts detector hits into a list ofdetected particles, each associated with an energyand direction).These two validation steps diﬀer in three main aspects: – the latency of the evaluation process; online moni-toring is requested to identify anomalies in quasi realtime to allow the operators to intervene promptlywhile the oﬄine procedure has a typical timescaleof several days, – the fraction of the data which they have access to;online processing runs at a rate of 100 Hz, corre-sponding to approximately 0 .

1% of the data writ-ten to disk for analysis, while the oﬄine processingtakes as input the full set of events accepted by thetrigger system ( ∼ – the granularity of the monitored detector compo-nents; while oﬄine monitoring requires identifyingonly overall status of the detector components, on-line should determine faulty subdetector elements.Despite their speciﬁc characteristics, these two stepsrely on the same anomaly detection strategy: the scrutinyof a long list of predeﬁned histograms, selected to de-tect a set of known failure modes. These histogramsare monitored by detector experts, who compare eachdistribution to a corresponding reference, derived from good data in line with predetermined validation guide-lines.This two-layer monitoring protocol was adopted bythe CMS Collaboration for LHC Run I (2010-2012) andin Run II (2015-2018). The ever increasing detectorcomplexity, monitoring data volumes and the necessityto cope with diﬀerent LHC running scenarios call for anincreasing level of automation of the applications in thefuture. Already, the amount of histograms to monitoris challenging for a single shifter, while the number ofhistograms to monitor increases every time a new fail-ure mode is identiﬁed and consequently added to thelist of known potential problems. Furthermore, the hu-man intervention and currently implemented tests re-quire collecting a substantial amount of data, implyinga detection delay. Last but not least, the cost in termsof human resources is substantial i.e. the 24/7 DQMshifter and the expert personnel responsible for updat-ing the good data references and related instructions.We believe that introducing machine learning into theCMS DQM process will help with those challenges.This work focuses on the online monitoring. We con-centrate on the application of deep learning techniques,and speciﬁcally image-like processing [7] for the au- tomation of detector level monitoring. While the mainfocus of this work is on improving detection speciﬁcityand sensitivity, the proposed approach could come withpractical advantages in operation being based on lessastringent assumption on the nature of the anomalies.As a concrete example we use real data recordedby the CMS DT chambers of the muon spectrometerduring the data-taking campaign of the LHC Run II.The main aspects of this work are: – we exploit the geographical information of the detec-tor assessing the (mis)behavior with high-granularityand then combining the results to probe diﬀerentdetector components; – we detect diﬀerent types of anomalies aﬀecting thedetector at diﬀerent scales (ranging from a few chan-nels to collective behaviors of big portion of the DTsystem) by combining diﬀerent algorithms; – we show that image-like processing achieves consid-erably better performance with respect to the cur-rent threshold-based DQM test (later called the pro-duction algorithm ) and allows to tune the workingpoint in terms of speciﬁcity (depending on the de-ployment strategy).Although the experimental demonstration of the re-sults presented in this paper is tied to the speciﬁcities ofthe DT subdetector, the procedure that we discuss havea potentially broader application scope. Mainly becausethe typical issues encountered with other subdetectorsare analogous. This possibility is currently under in-vestigation for other detector components of the CMSexperiment.The remainder of the paper is organized as follows.Sections 2 and 3 present in more details the problemthat we want to solve and describe the utilized data set.Section 4 reviews the current state of the art in the ma-chine learning domain of failure detection. Sections 5, 6and 7 present three complementary approaches to theproblem. Section 8 describes and discusses the results. An illustration of the internal structure of a DT cham-ber is shown in Fig. 1. Each chamber, on average 2 × . channel in therest of the paper). Particles carrying an electromag-netic charge and traversing a tube release an electronic etector monitoring with artiﬁcial neural networks at the CMS experiment at the CERN Large Hadron Collider 3 Fig. 1

Schematic view of the one DT chamber showing theposition and orientation of the tubes. From [8].

Fig. 2

Magniﬁed view of the CMS detector showing thewheel structure. The muon chambers are represented by thewhite volumes while the red volumes represent the iron returnyoke. signal by ionizing the gas in the tube (a hit ). By com-bining the information provided by the channels, onecan determine the trajectory of the particle crossingthe chamber.The chamber numbering schema follows that of theiron of the yoke, consisting in ﬁve wheels (see Fig. 2)along the z -axis, each one divided into 12 azimuthal sec-tors (see Fig. 3). The wheels are numbered from − z -axis, with wheel 0situated in the central region around the proton-protoncollision point. The sector numbering is assigned in ananti-clockwise sense when looking at the detector fromthe positive z -axis, starting from the vertically-orientedsector on the positive- x side in the CMS coordinate sys-tem (sector 1). Chambers are arranged in four stationsat diﬀerent radii, named MB1, MB2, MB3 and MB4.The ﬁrst and the fourth stations are mounted on theinner and outer face of the yoke respectively; the re-maining two are located in slots within the iron. Eachstation consists of 12 chambers (one per sector) exceptfor MB4 (which contains 14 chambers). The total num-ber of chambers is then 5 × (3 ×

12 + 14) = 250.

Fig. 3

Numbering schema of the sector and stations of DTchambers in one wheel. From [8]. runs in CMS jargon), corresponding to a given setupboth of the CMS detector and of LHC accelerator. Runsare denoted by integers, increasing with time. Their du-ration is varying from as little as few seconds to as muchas several hours.Each run is divided into luminosity sections (LSs), atime interval corresponding to a ﬁxed number of proton-beam orbits in the LHC and amounting to approxi-mately 23 seconds, numbered progressively from 1 atthe start of each run. Each LS can be identiﬁed uniquelyby specifying the LS number and the run number. Thebeam intensity (also referred to as luminosity ) variesalong each run, resulting in a varying number of proton-proton collision data (the events).For each chamber k in a given run, the currentDQM infrastructure [9] records an occupancy matrix C k , which contains the total number of particle hitsat each channel for a given LS or set of consecutiveLSs. The occupancy matrix can be viewed as a varyingsize two-dimensional array organized along layer (row)and channel (column) indices: C k = { x ki,j ; 1 ≤ i ≤ l, ≤ j < n i } , where l = 12 is the number of layers and n i is thenumber of channels in layer i . In general, we label thechambers and their components as C k and x ki,j . For sim-plicity, we omit the k index when discussing problemsrelated to individual chambers, until Section 6. Figure 4shows examples of occupancy matrices, represented astwo-dimensional occupancy plots. The utilized data setconsists of 21000 occupancy matrices for the 250 cham-bers. Adrian Alan Pol et al.ABC

Fig. 4

Example of visualization of input data for three DTchambers. The data in (A) manifest the expected behaviorin spite of having a dead channel in layer 1. The produc-tion algorithm regards this instance as non-problematic. Thechamber shown in the plot in (B) instead shows regions oflow occupancy across the 12 layers and should be classiﬁedas faulty. According to the run log, this eﬀect was induced bya transient problem with the detector electronic. (C) suﬀersfrom a region in layer 1 with lower eﬃciency, which shouldbe identiﬁed as anomalous. The production algorithm classifythe chamber in (B) as anomalous. However it is not sensitiveenough to ﬂag the chamber in (C). – Local : data collected in each layer are treated inde-pendently from the others. As for the productionalgorithm, this approach regards chambers whichhave occupancy of hits with small variance betweenneighboring channels as expected behavior and tar-gets a well known list of problems with a supervisedapproach. Chambers which have dead, ineﬃcient ornoisy regions, are considered problematic. We ex-plore this approach in Section 5. – Regional : we extend the local approach to accountfor intra-chamber problems, to be applied wheneverfaults are spotted only when the information aboutall layers within one chamber is present. For thispurpose, we simultaneously consider all layers in achamber, but each chamber is considered indepen-dently from the others, Section 6. – Global : we simultaneously use the information of allthe chambers for a given run. The position of thechamber in the CMS detector (uniquely determinedby the wheel, station, and sector numbers) impactsexpected occupancy distribution of the channel hits.This approach is described in Section 7.3.3 PreprocessingA common data set preprocessing procedure is used forthe three studies (for visual interpretation, see Fig. 5). – Standardization of the chamber data : the numberof channels x in a layer varies not only within thechamber but also depends on the chamber positionin the detector. This quantity falls between 47 and96. We enforce a ﬁxed-input dimensionality by ap-plying a layer-by-layer one dimensional linear inter-polation to match the size of the smallest layer s in data set. The smallest layer is chosen to simplifyour models later in this study. Starting from therecorded matrix x ij , a standardized matrix ˜ x i,j isdeﬁned as:˜ x i,j = f rac ( α )( x i, (cid:98) α (cid:99) − x i, (cid:100) α (cid:101) ) + x i, (cid:98) α (cid:99) . where α is an interpolation point, deﬁned by α = j n i n s . We veriﬁed that this method doesn’t compro-mise sensitivity to very small problematic regionsdespite a small reduction in the amplitude and sharp-ness of the anomalies. etector monitoring with artiﬁcial neural networks at the CMS experiment at the CERN Large Hadron Collider 5 – Smoothing : according to CMS DT experts, misbe-having channels are problematic only when a spa-tially contiguous cluster of them is observed. In-stead, isolated misbehaving channels are not consid-ered a problem. To take this into account the onedimensional median ﬁlter is applied:ˆ x i,j = med( x i,j , x i,j +1 , x i,j +2 ) . – Normalization : the occupancy of the chambers inthe input data set depends on the integration timeand on the LHC beam conﬁguration and intensityi.e. on the number of LSs spanned when creatingthe image and corresponding luminosity. The nor-malization strategy depends on the need for compar-ing data across chambers or across runs: the preciseprocedure used in the two approaches is describedin Sections 5 and 6, respectively.

In this section we brieﬂy discuss machine learning anomalydetection techniques in light of both the operationalcondition and the a priori knowledge of the data. Ma-chine learning presents several advantages over the cur-rently adopted procedure as the decision function canbe learned from collected data. In the future, it mightbe possible to bypass human intervention when the al-gorithm decision is not controversial and only invokethe shifters opinion for intermediate questionable cases.An example of this approach is discussed in [10] inthe context of the CMS oﬄine monitoring. The highdata dimensionality precludes simple parametric den-sity estimation of the normal behavior. This leaves anextremely wide range of methods such as one-class Sup-port Vector Machine ( µ -SVM) [11], Isolation Forest [12,13] and diﬀerent ﬂavors of deep learning. For a generalsurvey see [14].Anomaly detection techniques usually assume rarityof abnormal events (considered as outliers with respectto the normal generating process) and/or lack of a com-plete set of representative examples of all possible be-haviors. If such representative examples are available,anomaly detection reduces to binary classiﬁcation (su-pervised learning), with possibly the help of various re-sampling methods [15] or reformulation of the objectivefunction [16] for dealing with class imbalance.In our case, supervised learning is clearly a validoption as speciﬁc anomalous scenarios were extensivelystudied. The CMS DQM framework keeps copious archivesof subdetector-speciﬁc quality-related quantities, e.g. ABCD

Fig. 5

Example of two kinds of input sample preprocessing.(A) acquired ( raw ) values, (B) standardizing each layer di-rectly from raw values using linear interpolation. (C) smooth-ing the raw values data with median ﬁlter (D) standardiz-ing each layer from smoothed data. In (C), the isolated low-occupancy spot in layer 1, corresponding to a dead channel,is discarded. the DT occupancy plots. Moreover, the imbalance be-tween good and bad data is not extreme, with a typicalrate of anomalies reaching the 10% level. These anoma-lies are then frequent enough for a sizable set of themto be used for supervised training.However, there is a good motivation for a semi-supervised anomaly detection approach. Beside the deeplearning methods that will be discussed at the end ofthis section, we experiment with the two reference meth-ods, which are variants of one-class classiﬁcation: µ -SVM and Isolation Forest. µ -SVM estimates the sup-port of the data distribution by a non-linear (kernel)transform of the data space (as in all SVM techniques)and by identifying the hyperplane that maximizes theseparation of the training data from the origin. Ac-cordingly, µ -SVM has the important property of be-ing a novelty detection algorithm: once trained, it is Adrian Alan Pol et al. not sensitive to the frequency of anomalies. However,the implicit prior of kernel-based classiﬁcation is thatthe function to be learned is smooth such that general-ization can be achieved by local interpolation betweenneighboring training examples. As argued at length byBengio et al. (for instance in [17] and [18]), this as-sumption is questionable for high data dimensionality.An alternative is the Isolation Forest, which copes withthe curse of dimensionality by relying only on the prin-ciple of isolation of outliers in a random recursive par-titioning of the feature space along the axes and treeensembles. The Isolation Forest algorithm does not relyon any distance or density measure, but assumes thatanomalies can be isolated in the native feature space.Besides being highly scalable to large data sets, Isola-tion Forest oﬀers some possibility of interpretation.Classical fully unsupervised approaches based onneighborhood (e.g. k nearest neighbors), topological den-sity estimation (Local Outlier Factor and its variants)or clustering (for a detailed presentation, see [19]) arenot relevant here. These algorithms have quadratic com-plexity and poorly perform in high dimensions, becauseof data sparsity (in high dimensions, all pairs of pointsbecome almost equidistant) [20]. Moreover, a simple ge-ometric (e.g. Euclidean) distance in the feature spacedoes not deﬁne a similarity metric. For instance, thedistance between examples A and B in Fig. 4 is dom-inated by the contribution of well-behaving channels.The similarity function, or equivalently the adequaterepresentation, must be learned from the data.This representation learning view [18] points towardsdeep learning, as it should remain sensitive to the localgeometric relationship in the data related to the under-lying apparatus. Convolutional networks [21] integratethe basic knowledge of merely the topological structureof the input dimensions and learn the optimal ﬁltersthat minimize the objective error.A more ambitious goal is to extract an explanatoryrepresentation of the anomalies with latent variables,in a probabilistic framework (e.g. restricted Boltzmannmachines, or variational autoencoders [22]), where thelearned representation is the posterior distribution ofthe latent variables given an observed input. However,even the inference step with these representations maysuﬀer from high computational cost, and requires somefurther feature construction.The alternative is the trade-oﬀ between interpretabil-ity and simplicity by learning a direct encoding, typ-ically as a neural network based autoencoder, whichis a parametric map from inputs to their representa-tion. Although it has been argued that, even for ba-sic neural networks, most of the training is devoted tolearning a compressed representation [23,24], autoen- coders are particularly suitable for anomaly detection.When trained on the inliers, testing on unseen faultysample tend to yield sub-optimal representations, indi-cating that a sample is likely generated by a diﬀerentprocess. In order to go beyond simple dimensionalityreduction while preventing over-ﬁtting, various ﬂavorsof regularization are proposed (the literature being con-siderable, we give only some entry points): – sparse autoencoders [25] penalize the output of thehidden unit activations or the bias; – denoising autoencoders [26] robustify the mappingby requiring it to be insensitive to small randomperturbations; – contractive autoencoders [27] pursue the same goal,by penalizing analytically the sensitivity of learnedfeatures in a data-driven interpretation of the Tan-gent Propagation algorithm [28].In fact, denoising and contractive autoencoders learndensity models implicitly, through the estimation ofstatistics or through a generative procedure [29]. dead channels in detector jargon),or having lower detection eﬃciency (hence lower hitcounts with respect to the neighboring ones in the samelayer) or being dominated by electronic noise (called noisy channels ). These are by far the most frequentfailure modes. The local approach can be consideredas an initial benchmark comparing fully supervised,semi-supervised and unsupervised methods, and spe-ciﬁc algorithms in each category, before embarking infull-ﬂedged anomaly detection. Moreover, the local ap-proach, if successful, can be further exploited as a pre-processing step for ﬁltering these trivial faults beforeattempting to detect more elusive ones.Given the locality restriction of this approach, con-textual information is not accessible. As a consequenceof this, a model based on this strategy will not be ableto spot, for example, a faulty layer in which occupancyis decreased uniformly with respect to neighboring lay- etector monitoring with artiﬁcial neural networks at the CMS experiment at the CERN Large Hadron Collider 7 ers. We acknowledge this limitation and address it inSection 6.5.2 Data set and methodsAfter having applied the standardization procedure (seeSection 3.3), a layer is represented as a single row of anoccupancy matrix: X i = (˜ x i, , ˜ x i, , . . . , ˜ x i, ) . The available data set consists of 21000 chambers cor-responding to 228480 individual layers.Hit counts in a layer are normalized to a [0 ,

1] range,dividing them by the maximum of the occupancy valuein the layer: ˙ x i,j = ˜ x i,j max( X i ) . The need for normalization comes from the intrinsicvariation of the occupancy, which depends on the spa-tial position of the chamber (as described in more detailin Section 7) and on the integration time of the analyzedimage.In this experiment, we compare the performances ofthe following: – unsupervised learning with (a) a simple statisticalindicator, the variance within the layer, and (b)an image processing technique, namely the maxi-mum value of the vector obtained by applying avariant of an edge detection Sobel ﬁlter [30]: S i =max( (cid:2) − (cid:3) ∗ X i ). – semi-supervised learning, with (c) Isolation Forest,and (d) µ -SVM. – supervised learning, with (e) a fully connected shal-low neural network (SNN), and (f) a convolutionalneural network (CNN);The ground truth is established by ﬁeld experts ona random subset of the data set, by visually inspectingthe input sample before any preprocessing: 5668 layerswere labeled as good and 612 as bad. The 9,75% faultrate is a faithful representation of the real problem athand. With this ratio, both anomaly and outlier detec-tion approach can be considered. Out of this set 1134good and 123 bad examples are reserved to compose thetest set corresponding to 20% of the labeled layers. Theremaining examples are used for training and validationfor the semi-supervised and supervised methods.The Isolation Forest and µ -SVM models are cross-validated using ﬁve stratiﬁed data set folds to search fortheir corresponding optimal hyper-parameters. Subse-quently, the Isolation Forest is retrained using thosehyper-parameters (100 base estimators in the ensem-ble) on the full unlabeled data set, while µ -SVM (RBF Outputs8 hidden units90 hidden units10@9x1 feature maps10@45x1 feature maps47x1 input3x1 convolutions 5x1 max pooling Flatten Fully connected Fully connected

Fig. 6

Architecture of the convolutional neural networkmodel used to target the local strategy. kernel, ν of 0 . γ of 0 .

1) is retrained using only nega-tive class examples. The architecture of the CNN modelwith one dimensional convolution layers used for thisproblem is shown in Fig. 6. Rectiﬁed linear units arechosen as activation functions for inner-layer nodes,while the softmax function is used for the output nodes.The model is trained using the Adam [31] optimizerand early stopping mechanism monitoring validationset (set to 20% of data set) with patience set to 32epochs. The model is implemented in Keras [32], usingTensorFlow [33] as a backend.The SNN model consists of one hidden fully-connectedlayer with 16 units (chosen to approximately matchnumber of parameters in the CNN). As for CNN, ituses rectiﬁed linear unit as activation function of thehidden nodes and the softmax function is used for theoutput nodes. This model is primarily introduced toobtain a term of comparison for the CNN.Unlike what was done for the other models, we donot apply the smoothing preprocessing step describedin Section 3.3 for CNN nor SNN models, in order toallow them to learn their ﬁlters. Additionally, we weightour negative S and positive S samples to account forclass imbalance. The weight λ ψ for a sample in class ψ ∈ { , } is deﬁned by λ ψ = | S | · | S ψ | where S = S ∪ S . We discuss the results in Section 8.1. Adrian Alan Pol et al. ∼

500 labeledimages do not provide a suﬃciently large training set.Thus we start with a much larger data set (all the un-labeled samples). We solve the labeling problem usingthe score of the convolutional model presented in Sec-tion 5 as an approximation of the ground truth. Forthis, we choose a working point with 99% true positiverate (to guarantee a large data set size) and 5% falsepositive rate. Approximately 90% of the collected dataare labeled as good by online and oﬄine monitoring ex-perts. We then estimate the residual bad example con-tamination to be ∼ . C k within each chamber obtaining ma-trices of shape 12 ×

46. The occupancy of hits within AB Fig. 7

Example of the impact on hit counting of diﬀerentvoltages applied to layer 9. (A) shows the occupancy mapwhen operating the layer at 3200 V and (B) shows the eﬀectof operating at 3450 V. Both examples should be regardedas anomalies. Since the values in both cases are not equal tozero, the production algorithm considers those cases as non-problematic. one chamber are normalized using a min-max scaler:˙ C = ˜ C k − min( ˜ C k )max( ˜ C k ) − min( ˜ C k ) . This normalized values to the [0 ,

1] range while retain-ing the information about the relative occupancy be-tween the layers.In order to evaluate the model, we use the only la-beled set for the class of anomalies that we want totackle: a subset of the data (runs 302634 and 304737-304740), during which layer 9 of some chambers wasoperating at a voltage lower than the nominal one (seeFig. 7). In particular, the voltage was set to 3450 Vin runs 304737-304740 and to 3200 V in run 302634while the standard operation point is 3550 or 3600 Vdepending on the chamber. These settings result in anabsolute diﬀerence in hit counting, more pronounced forthe lower voltage settings, because the physics of gasionization by radiation. The chambers where all layersoperate at nominal conditions are considered as good inthe test.In this experiment, the following semi-supervisedapproaches are considered: – simple bottleneck autoencoder with the representa-tion layer equal to 20 units; – convolutional autoencoder; – denoising autoencoder in which we add additionalartiﬁcial noise to training samples; – autoencoder with kernel L1 (10 − ) sparsity regular-ization in the hidden layers. etector monitoring with artiﬁcial neural networks at the CMS experiment at the CERN Large Hadron Collider 9A

20 hidden units144 hidden units4@12x3 feature maps4@46x12 feature maps46x12 input4x4 convolutions 4x4 average pooling

Flatten

Fully connectedFully connected144 hidden units4@12x3 feature mapsReshape 4@46x12 feature maps4x4 upsampling 46x12 output3x1 convolutions B

552 hidden unitsFlatten Fully connected46x12 input 50 hidden units 20 hidden unitsFully connected Fully connected Fully connected50 hidden units 552 hidden unitsReshape 46x12 output

Fig. 8

Convolutional (A) and simple, denoising, sparse (B)autoencoder models architecture used to target regional strat-egy.

Similarly to the local approach we train the autoen-coders using the Adam optimizer. Early stopping mech-anism with the patience set to 32 epochs is adopted tomonitor validation set (20% of the total data set). Allmodels are implemented using the Keras library withTensorFlow as a backend. The architecture of the modelis shown in Fig. 8. A and B for, respectively, the con-volutional autoencoder and the other three models (forwhich a common architecture is adopted). The bottle-neck architecture is kept for both denoising and sparseautoencoders in order to limit the amount of param-eters to train. The parametric rectiﬁed linear unit isused as the activation function on the hidden layers,while the output layer uses the sigmoid function. Allmodels are instructed to minimize the mean squarederror (MSE) (cid:15) between original, ˙ x , and reconstructed,¨ x , samples: (cid:15) k = 1 ij (cid:88) i,j ( ˙ x ki,j − ¨ x ki,j ) . We discuss the results in Section 8.2.

Local approachThe performance of the various models on a held outtest data set can be seen in Fig. 9, where we show thediﬀerent receiver operating characteristic (ROC) curve.Compared to statistical, image processing or other ma-chine learning based solutions, supervised deep learn-ing clearly outperforms the rest. Thanks to the lim-ited number of parameters of the model, the trainingconverges to a satisfactory result (Fig. 10), despite thenumber of training samples being small.Although the Area Under Curve (AUC) of the fully-connected shallow neural network is comparable to theone of CNN, the latter is a better solution when re-quiring maximum speciﬁcity (true negative rate (TNR),aims at avoiding false positives) and sensitivity (truepositive rate (TPR), aims at avoiding false negatives).The relatively good performance of the basic and unsu-pervised variance method, compared to the poor resultsof the ﬁlter, and the near optimal performance of theSNN, show that the features to learn are not simple con-trasts, although the superior performance of the CNNdemonstrate that the initial edge detection layer is use-ful. The limited performance of Isolation Forest is likelyto come from the violation of its fundamental assump-tion, that faults are rare (remember that the fault rateis in the order of 10%) and homogeneous. The inferiorperformance of the typical semi-supervised method ( µ -SVM) illustrates the well-known smoothness versus lo-cality argument for deep learning [17,18]: the diﬃcultyto model the highly varying decision surfaces producedby complex dependencies involving many factors.As shown in the score distribution of Fig. 11, theproposed architecture of the CNN model separates anoma-lous layers signiﬁcantly. This allows for great ﬂexibilityin choosing the working point for deployment in pro-duction in the CMS DQM. Depending on the cost oftype 1 and type 2 errors for the detector operators thethreshold can be set anywhere in [0 . , .

9] score range.When using the CNN for the selection of good samplesfor training the regional algorithms, the working pointis chosen not to favor speciﬁcity nor sensitivity, with athreshold equal to score 0.5.The production algorithm targets a speciﬁc failurescenario of dead regions and produces a chamber-wisegoodness assessment, without being capable of identi-fying a speciﬁc problematic layer in the chamber. Forthis reason we can not directly compare its performancewith our local approach. For the sake of benchmarkingour approach, we use our per-layer ground truth to la-bel as bad any chamber with at least one problematic

Fig. 9

ROC curves for diﬀerent models used in the localapproach. The Area Under the Curve (AUC) is quoted tocompare the performance.

Fig. 10

Loss function as a function of the number of epochsin the training of the CNN model used for the local approach.The two curves illustrate the behavor of the training andvalidation data sets.

Fig. 11

Distribution of scores in local approach for the CNNmodel. layer. We then ask the production algorithm if it in-dicates there is at least one faulty layer in a chamber.With this per-chamber label, we are able to estimate thespeciﬁcity of the production algorithm to 91%, with asensitivity of only 26%.Another diﬀerence of our approach with respect tothe production one is the performance with low statis- etector monitoring with artiﬁcial neural networks at the CMS experiment at the CERN Large Hadron Collider 11

Fig. 12

Stability of the proposed model and the productionalgorithm as a function of time ( number of lumisections ) forthree diﬀerent runs: 306777, 306793, 306794. The stabilitytest is performed every 10 LSs. The

CNN: Total and

Produc-tion lines follow the total fraction of alarms generated by themethods. Instead

CNN: Emerging reports the fraction of newalarms being generated by the CNN model w.r.t previous testpoint. tics i.e. at the beginning of a run. As seen in Fig. 12, ourCNN model gradually adds alarms until reaching sta-bility. The production algorithm has the opposite be-havior, generating a substantial fraction of false alarmsin the early stages of the run.8.2

Regional approachTo assess the performance of a given ensemble of chan-nels we take as anomaly indicator the quantity: (cid:15) ki = 1 j (cid:88) j ( ˙ x ki,j − ¨ x ki,j ) , i.e., the MSE between the original sample given as in-put to the encoder ( ˙ x ki,j ) and the output of the decoder(¨ x ki,j ). With the objective of identifying the problematicregion of the chamber, we exploit the granularity of theautoencoder information computing the MSE values fordiﬀerent set of channels. For example, we can computethe MSE for all the channels corresponding to a givenread-out electronic board or, alternatively we can com-pute it per layer when tackling potential failures of thevoltage distribution system.We use this ﬁgure of merit on the sample with dif-ferent voltage settings described in Section 6. Figure 13shows good performance of all models, especially con-volutional autoencoder. The distributions of the MSEfor a well behaving and a problematic layer are shownin Fig. 14. The MSE distribution for layer 9 shows clearseparation for chambers operated at nominal and lowervoltages. For each (cid:15) i value for a given example, a quanti-tative assessment of the severity of a potential anomalycan be derived quoting the corresponding p-value of the Fig. 13

ROC and AUC of the diﬀerent autoencoder modelsused in regional approach . The discriminator between goodand anomalous samples is the (cid:15) in layer 9. good example distribution. The separation is less pro-nounced for the working point at 3450 V being closerto the nominal setup. This reﬂects in the AUC valuesreported in Fig. 13.The production algorithm is not sensitive to thetype of faults described in this section since the hits inlayer 9 are non-zero values. Thinking about deploymentin the DQM infrastructure of the CMS experiment, thebest result would be obtained when applying the localand regional models in a pipeline.8.3

Global approachFigure 15 shows an example of a low-dimensionalityrepresentation of the chamber data clustering depend-ing on the chamber position in the detector. The globalapproach is then potentially capable to spot an unusualbehavior of DT chambers taking into account the ge-ographical constraints. Ultimately, this could pave theway to more ﬂexible assessment by scoring per detectorregion.When investigating the representations for a speciﬁcchamber across diﬀerent runs (see Fig. 16), we noticethat the representations tend to cluster depending onthe number of problematic layers. Thanks to this fact,the cumulative distribution of the compressed represen-tation could be used to highlight the occurrence of newanomalies or to associate an anomalous behavior to analready known problem. This application could assistexperts in diagnosing transient and reoccurring issues.

This paper shows how detector malfunctions can beidentiﬁed with high accuracy by a set of automatic pro-cedures, based on machine learning. We considered the

Fig. 14

MSE between reconstructed and input samples forlayer 3 (A) and layer 9 (B) for 3 categories of data for convo-lutional autoencoder. Despite a problem in layer 9, all (cid:15) forlayer 3 are comparable for all chambers.

Fig. 15

Compressed representation of the chamber-leveldata of the global model. The samples cluster according toposition in the detector. Here depending on the station num-ber. speciﬁc case of the DT muon chambers of the CMS ex-periment. We developed a CNN-based classiﬁer to spotlocal misbehaviors of the kind currently targeted by theexisting CMS monitoring tools. We also showed that itis possible to extract more information from the map ofelectronic hits than the currently implemented statisti-cal tests. In particular, we developed a strategy to spot

Fig. 16

Compressed representation of the chamber-leveldata of the global model limited to only one chamber acrossdiﬀerent runs with respect to number of faulty layers (scale).The samples cluster according to similar behavior. regional problems across layers in a detector chamber,or globally, i.e., across chambers in the full muon de-tector. These algorithms, based on autoencoders, willoﬀer a more robust anomaly detection strategy, not be-ing deﬁned as supervised classiﬁers of speciﬁc failuremodes. This approach allows to localize the origin of agiven anomaly, exploiting the granularity oﬀered by theuse of MSE of the decoded image as a quantiﬁcation ofthe anomaly.Currently, these algorithms have been integratedinto the CMS online DQM infrastructure and they arebeing commissioned with the early data of the 2018Run. The model could be further reﬁned, e.g. integrat-ing a mechanism of periodic retraining that would allowto repeat alarms for known problems, or to adapt to thelong-term changes of the detector and beam conditions.Since CNN is the basic ingredient in this study, andsince many monitored quantities in typical high-energyphysics experiments are based on 2D maps (e.g., de-tector occupancy, detector synchronization, etc.), theapproach proposed in this paper could be extended be-yond the presented use case. We hope that this casestudy could serve as a concrete showcase and could mo-tivate further DQM automation using machine learn-ing.

Acknowledgements

We thank the CMS collaboration for providing the dataset used in this study. We are thankful to the membersof the CMS Physics Performance and Data set projectand the CMS DT Detector Performance Group for use-ful discussions, suggestions, and support. We acknowl-edge the support of the CMS CERN group for provid- etector monitoring with artiﬁcial neural networks at the CMS experiment at the CERN Large Hadron Collider 13 ing the computing resources to train our models andof CERN OpenLab for sponsoring A.S.’s internship atCERN, as part of the CERN OpenLab Summer studentprogram. We thank Danilo Rezende for precious discus-sions and suggestions. This project has received fundingfrom the European Research Council (ERC) under theEuropean Union’s Horizon 2020 research and innova-tion program (grant agreement n o References

1. The LHC Study Group. The large hadron collider,conceptual design. Technical report, CERN/AC/95-05(LHC) Geneva, 1995.2. Serguei Chatrchyan et al. Observation of a new boson ata mass of 125 gev with the cms experiment at the lhc.

Physics Letters B , 716(1):30–61, 2012.3. Vardan Khachatryan et al. Precise determination of themass of the Higgs boson and tests of compatibility of itscouplings with the standard model predictions using pro-ton collisions at 7 and 8 TeV.

Eur. Phys. J. , C75(5):212,2015.4. Serguei Chatrchyan et al. The cms experiment at thecern lhc.

JINST , 3:S08004, 2008.5. Albert M Sirunyan et al. Performance of the CMS muondetector and muon reconstruction with proton-protoncollisions at √ s = 13 TeV. 2018.6. Federico De Guio. The data quality monitoring challengeat cms: experience from ﬁrst collisions and future plans.Technical report, 2015.7. Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deeplearning. nature , 521(7553):436, 2015.8. CMS collaboration et al. Calibration of the cms drifttube chambers and measurement of the drift velocity withcosmic rays. Journal of Instrumentation , 5(03):T03016,2010.9. L Tuura, G Eulisse, and A Meyer. Cms data qualitymonitoring web service. In

Journal of Physics: Confer-ence Series , volume 219, page 072055. IOP Publishing,2010.10. Maxim Borisyak, Fedor Ratnikov, Denis Derkach, andAndrey Ustyuzhanin. Towards automation of data qual-ity system for cern cms experiment. In

Journal ofPhysics: Conference Series , volume 898, page 092041.IOP Publishing, 2017.11. Bernhard Sch¨olkopf, John C Platt, John Shawe-Taylor,Alex J Smola, and Robert C Williamson. Estimatingthe support of a high-dimensional distribution.

Neuralcomputation , 13(7):1443–1471, 2001.12. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou.Isolation forest. In

Data Mining, 2008. ICDM’08.Eighth IEEE International Conference on , pages 413–422. IEEE, 2008.13. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou.Isolation-based anomaly detection.

ACM Transactionson Knowledge Discovery from Data (TKDD) , 6(1):3,2012.14. Charu C Aggarwal. Outlier analysis. In

Data mining ,pages 237–263. Springer, 2015.15. Charu C Aggarwal.

Data classiﬁcation: algorithms andapplications . CRC Press, 2014.16. Glen Cowan, Kyle Cranmer, Eilam Gross, and OferVitells. Asymptotic formulae for likelihood-based tests of new physics.

The European Physical Journal C ,71(2):1554, 2011.17. Yoshua Bengio, Yann LeCun, et al. Scaling learning algo-rithms towards ai.

Large-scale kernel machines , 34(5):1–41, 2007.18. Yoshua Bengio, Aaron Courville, and Pascal Vincent.Representation learning: A review and new perspectives.

IEEE transactions on pattern analysis and machine in-telligence , 35(8):1798–1828, 2013.19. Markus Goldstein and Seiichi Uchida. A comparativeevaluation of unsupervised anomaly detection algorithmsfor multivariate data.

PloS one , 11(4):e0152173, 2016.20. Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel.A survey on unsupervised outlier detection in high-dimensional numerical data.

Statistical Analysis andData Mining: The ASA Data Science Journal , 5(5):363–387, 2012.21. Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton.Imagenet classiﬁcation with deep convolutional neuralnetworks. In

Advances in neural information processingsystems , pages 1097–1105, 2012.22. Danilo Jimenez Rezende, Shakir Mohamed, and DaanWierstra. Stochastic backpropagation and approximateinference in deep generative models. In

Proceedings of the31st International Conference on International Confer-ence on Machine Learning - Volume 32 , ICML’14, pagesII–1278–II–1286, 2014.23. Naftali Tishby and Noga Zaslavsky. Deep learning andthe information bottleneck principle. In

IEEE Informa-tion Theory Workshop , pages 1–5, 2015.24. Ravid Shwartz-Ziv and Naftali Tishby. Opening the blackbox of deep neural networks via information.

CoRR ,abs/1703.00810, 2017.25. Marc’Aurelio Ranzato, Christopher Poultney, SumitChopra, and Yann LeCun. Eﬃcient learning of sparserepresentations with an energy-based model. In

Proceed-ings of the 19th International Conference on Neural In-formation Processing Systems , pages 1137–1144, 2006.26. Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, YoshuaBengio, and Pierre-Antoine Manzagol. Stacked denoisingautoencoders: Learning useful representations in a deepnetwork with a local denoising criterion.

J. Mach. Learn.Res. , 11:3371–3408, 2010.27. Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glo-rot, and Yoshua Bengio. Contractive auto-encoders: Ex-plicit invariance during feature extraction. In

Proceedingsof the 28th international conference on machine learning(ICML-11) , pages 833–840, 2011.28. Patrice Y. Simard, Yann A. Le Cun, John S. Denker, andBernard Victorri. Transformation invariance in patternrecognition - tangent distance and tangent propagation.In

Neural Networks: Tricks on the Trade , volume 1524of

Lectures Notes in Computer Science , pages 239–274.1998.29. Guillaume Alain and Yoshua Bengio. What regularizedauto-encoders learn from the data-generating distribu-tion.

J. Mach. Learn. Res. , 15(1):3563–3593, 2014.30. Irvin Sobel. An isotropic 3 × Machine vision for three-dimensional scenes , pages 376–379, 1990.31. Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.32. Fran¸cois Chollet et al. Keras, 2015.33. Mart´ın Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg S Corrado,4 Adrian Alan Pol et al.Andy Davis, Jeﬀrey Dean, Matthieu Devin, et al. Ten-sorﬂow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467 ,2016.34. Xiuyao Song, Mingxi Wu, Christopher Jermaine, andSanjay Ranka. Conditional anomaly detection.