Tackling the muon identification in water Cherenkov detectors problem for the future Southern Wide-field Gamma-ray Observatory by means of Machine Learning
B.S. González, R. Conceição, M. Pimenta, B. Tomé, A. Guillén
TTackling the muon identification in water Cherenkov detectors problem for thefuture Southern Wide-field Gamma-ray Observatory by means of Machine Learning
B.S. Gonz´alez a,b , R. Concei¸c˜ao a , M. Pimenta a , B. Tom´e a , A. Guill´en b a Laborat´orio de Instrumenta¸c˜ao e F´ısica Experimental de Part´ıculas (LIP) - Lisbon, Av. Prof. Gama Pinto 2, 1649-003 Lisbon, Portugaland Instituto Superior T´ecnico (IST), Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001 Lisbon, Portugal b Computer Architecture and Technology Department, University of Granada, Granada, Spain
Abstract
This paper presents several approaches to deal with the problem of identifying muons in a water Cherenkov detector witha reduced water volume and 4 PMTs. Different perspectives of information representation are used and new featuresare engineered using the specific domain knowledge. As results show, these new features, in combination with theconvolutional layers, are able to achieve a good performance avoiding overfitting and being able to generalise properlyfor the test set. The results also prove that the combination of state-of-the-art Machine Learning analysis techniquesand water Cherenkov detectors with low water depth can be used to efficiently identify muons, which may lead to hugeinvestment savings due to the reduction of the amount of water needed at high altitudes. This achievement can be usedin further research to be able to discriminate between gamma and hadron induced showers using muons as discriminant.
Keywords:
Machine learning, Neural Networks, Feature engineering, Gamma rays, Cosmic rays, Informationrepresentation, Signal processing
1. Introduction
The irruption of Deep Learning (DL) models usingConvolutional Neural Networks (CNNs) has been earth-shaking for the research in machine learning by setting thestandard outperforming human capabilities in many tasks.It all started in image applications with [1] but, rapidly,these methods have been applied to many other domains[2, 3]. Outside the image landscape, the principal appli-cation of these models has been biomedical signals likeElectroCardioGrams (ECG) and ElectroEncephaloGrams(EEG) [4, 5, 2].For the concrete case of analysing cosmic rays mea-surements, DL has been successfully applied to a few ex-periments. Some of them resemble the application donein image processing as images can be generated in a quitestraight-forward way from the experimental devices as inImaging Atmospheric Cherenkov Telescope Arrays (IAC-TAs). However, a different approach is required to pro-cess other type of inputs in other detectors, for instance,the reconstruction of the main characteristics of ExtensiveAir Showers (EAS) in observatories like HAWC or PierreAuger [6, 7, 8] or the detection of neutrinos with the Ice-Cube observatory [9].In this research field, the detection of very-high-energy(VHE) gamma-rays is essential to investigate the sources ofthe incoming electromagnetic radiation produced by someof the most extreme, non-thermal, phenomena taking placein the Universe, e.g., fast-spinning neutron stars or super-massive black holes [10]. One of the possible techniques to indirectly detect gamma-rays are the EAS arrays, whichcover large areas with particle detectors at high altitude.This method reconstructs the main characteristics (e.g.energy or direction) of the primary gamma-ray by meansof the detection of secondary particles produced during theair shower development. EAS arrays are able to performlong term observations of variable sources and allow thesearch for emissions in extended regions [11].However, although the indirect methods are effectivein the GeV-TeV region, an important disadvantage is thepresence of a huge background from charged particles pro-duced in showers originated by cosmic-rays, which is threeorders of magnitude larger than the signal. Several tech-niques have been proposed to perform the gamma/hadronseparation, that is, rejecting the hadronic background. Athigh energy, muons, a clear signal of hadronic interactions,begin to reach the ground in sufficiently high numbers sothat they can be used to discriminate gamma from hadron-ically induced showers. Thus, the identification of muonsin water Cherenkov detectors (WCD) is explored in thiswork.Using light detectors, WCDs measure the Cherenkovlight emitted by charge particles in water. The produc-tion of the Cherenkov light in WCDs is closely relatedto its depth. Since electromagnetic particles get attenu-ated at the top of the station, using a detector unit withenough depth ensures a good efficiency in the identificationof muons. That is the case of HAWC observatory, whichhas a good discrimination power in single muons events(those events with only muons and no electromagnetic con-
Preprint submitted to Engineering Applications of Artificial Intelligence January 29, 2021 a r X i v : . [ phy s i c s . i n s - d e t ] J a n amination) [12, 13] by using 300 cylindrical WCDs with adepth of 5 meters and a diameter of 7,5 meters. However,it must be taken into account that at mountain altitudesthe resources of liquid water are reduced, then, the detec-tor size must be optimised to reduce the necessity of wa-ter at those altitudes. For this reason, in this paper, theproblem is tackled from a new WCD design with low waterdepth and a good muon tagging efficiency. The proposeddetector uses a set of four Photomultipliers Tubes (PMTs),whose positions inside the WCD have been optimised tomaximise the signal asymmetric of muons vertically cross-ing the detector.The challenge in this work is to tackle the problemof identifying if a muon has crossed the WCD by meansof the PMTs’ gathered information. Once it is known ifthere are muons in the event, this information could beused for an ulterior classification between γ and hadroninduced showers which could be straight applied to newobservatories. To do so, it is necessary to establish thewhole machine learning pipeline as well as determine howthe input information will be given to the models. Thus,the rest of the paper is organised as follows: in Section 2,the data for the analysis is described. Section 3 presentsthe different approaches considered to tackle the problem.Section 4 performs an experimental comparison of the pre-vious approaches. Finally, the conclusions and a discussionare presented in Section 5.
2. Data description and preprocessing
The data analysed in this work has been simulated us-ing CORSIKA (version 7.5600) [14] and the detector usingthe Geant4 toolkit (version 4.10.05.p01) [15, 16, 17]. Theobservation level for the simulations was set at 5 200 mabove the sea level and a WCD array covering an area of80 000 m and a fill factor of ∼
80% was considered whichare the operational characteristics that the future South-ern Wide field Gamma-ray Observatory (SWGO) [18] willhave.The set of EAS generated was conducted using proton-based particles with energies between 4 − θ ∈ [5 ◦ ; 15 ◦ ]. This range makes muons tofall vertically towards the station, making its presenceclearer and adding some variability to the data set in-stead of having to introduce noise to make them morerealistic. The azimuth angle of the primary particle wasuniformly distributed over φ . The WCD unit used in thisstudy is a cylinder 1 . . A crucial stage in the experimentation is to define aprocedure to arrange the data registered by the detectorsinto a classical machine learning problem. First of all, foreach EAS, the simulator considers all the stations thatare hit by any particle. However, it is necessary to becautious and select only the events that are interesting forour purpose. For instance, in the case that a muon doesn’tcross the detector entirely, few Cherenkov light could bedetected. To overcome possible unfavorable situations athreshold of 300 photoelectrons in the light collected isestablished when selecting the events.The next step is to build the data sets necessary forthe experimentation. As the muon identification is carriedout at station level, to ensure the maximum fairness inour analysis, the EAS are separated beforehand to avoidhaving events of the same shower in different data sets.Afterwards, the data is partitioned into three indepen-dent subsets: • Train : necessary for setting a value for model hy-perparameters. • Validation : used to select the best final model. Tothis end, a 20% is taken from the training data set. • Test : applied to perform an unbiased evaluation ofthe fittest models after the training/validation pro-cess.Table 1 details the number of instances of each parti-tion.
Data setsTraining TestEAS
S.M. stations
17 244 14 808
E.M. stations
340 007 289 354
Instances
357 251 304 162
Muonic prop.
Table 1: Description of the data sets. Number of station events with
Single Muons (S.M. stations) and without muons (E.M. stations)from proton-induced showers with E ∈ [4; 6] TeV and θ ∈ [5 ◦ ; 15 ◦ ]. Muons constitute a small fraction of the total numberof particles that reach the Earth’s surface. Thus, as shownin Table 1, the number of stations with muons per EAS islow (in comparison with those with electrons or photons).Since the subsets are partitioned by separating the EAS,the proportion of stations with muons will remain. Forthis reason, different preprocessing techniques have beenexplored in order to balance the classes before the trainingstage, which is a must if we want to identify any muon [19]. • Random oversampling : random repetition of thesamples available from stations with muons. • Random undersampling : random elimination ofthe samples available from stations without muons.2 a) Single-layered WCD design. t [ns] S [ p . e . ] PMT 20 10 20 30 t [ns] S [ p . e . ] PMT 3 2 0 2 x [m] y [ m ]
123 4Station geometry 0 10 20 30 t [ns] S [ p . e . ] PMT 10 10 20 30 t [ns] S [ p . e . ] PMT 4 (b) Water Cherenkov detector crossed by a
SingleMuon . The WCD drawing is surrounded by thecorrespondent PMT signal time traces.Figure 1: Scheme of the single-layered WCD. • SMOTE ( Synthetic Minority Over-sampling TEch-nique ): creation of synthetic new samples of stationswith muons [20].
3. Machine Learning algorithms: problem approachand model design
This section presents the different approaches consid-ered to solve the problem as well as the details regardingthe design methodology for each type of technique.
As it is desired to know the probability that the WCDanalysed has been crossed by a muon, instead of approach-ing the task as a classification problem (where outputwould be { } ), regression models are the adequate toprovide such information. By having the probability, itwill be possible to tune the model towards a more sensibleor specific output by defining thresholds [21].The nature of the data allows using different techniqueswhen handling the information collected by the PMTs. Inthis paper, the following strategies have been considered: • Signals ( (cid:126)S = (cid:126)S , (cid:126)S , (cid:126)S , (cid:126)S ): it corresponds to thesignal trace captured of the event during 30 ns, whichallows us to explore the temporal features of theevents and may be crucial for identifying, in futurework, if more than one particle has crossed the sta-tion. This is specially interesting considering thatanalysing the whole signal could make possible toidentify reflections of light generated by the tankwalls. Each (cid:126)S i is highly dependant on the totalenergy of the event, therefore, the raw signals arenormalised by the sum of the integrals of all PMTsduring 30 ns. • PMT Integrals ( I , I , I , I ): instead of having sucha large amount of data, it is possible to project thetraces into a number by computing the integral ofthe signal recorded by each detector. By doing this,it is possible to consider classical approaches whichwould not been able to process the raw signals de-scribed above. As there are 4 PMTs recording theCherenkov light, it is still possible to conserve somespatial information although the temporal compo-nent is lost. For the sake of clarity in the muonidentification, as they reach the PMTs before theelectromagnetic component, a threshold of 10 ns hasbeen set to compute these values. • Total light I t : it represents the sum of the PMTintegrals, therefore, I t = (cid:80) i =1 I i . The reason toconsider I t is due to the intrinsic uniform propertiesof muons, which are confined in a concrete range ofvalues of this variable. • Asymmetry A T : Cherenkov light in muonic eventsis mainly produced in a reduced area of the WCD,whilst in electromagnetic events the light is spreadinside the tank. Under these circumstances, it is tobe expected that a large asymmetry is present inmuonic events when comparing the amount of lightcollected by the PMT with the maximum signal andthe one in front of it. Let A T be the total asym-metry of an event, defined by equation (1), and let I max and I opp be the integral of the signals in thePMT with maximum signal and the one in front ofit respectively, so that A T ∈ [0 , A T = I max − I opp I max + I opp (1)3 PMT Integrals Normalization ( In , ..., In ): Althoughthe integral by itself could be a good projection ofthe event, it is possible to engineer the informationencoding by normalising the amount of signal of eachPMT considering the total signal recorded to obtaina magnitude independent range of values. Thus, theintegrals for each PMT are normalised using I t . Let, In i = I i /I t .As a summary, Table 2 shows the input given to themodels and its classification. This subsection first describes a first approach on CNNsapplication to the problem using the traces available. Af-terwards, it is detailed the process followed with boostedtress which are not able to process the raw signals. Finally,in this section, it is commented the possibility of combin-ing those different approaches by means of an ensemble.
To process the signal trace recorded by the PMTs 1-dimensional convolutional neural networks (CNNs) havebeen designed. As stated previously, CNNs are one of thestate-of-the-art machine learning algorithms, whose rele-vant applications in different fields have proven its greatpotential. The algorithm is composed of two blocks ofhidden-layers: a first set of convolutional layers extract thefeatures from the raw data by means of the convolution,afterwards, a second set of hidden-layers uses the previ-ous information to perform the classification or regressionand provide the output. Therefore, CNNs are capable offusing the tasks of feature extraction from complex data(signals or images) and classification/regression in a singleprocess, which is the main advantage of using this algo-rithm [22]. Before hyperparameters can be tuned, it is re-quired to find a way of introducing such input. As statedpreviously, different kind of inputs appear after analysingthe information captured by the detectors. With the useof this algorithm we intend to squeeze all the informationavailable, both spatial and temporal.On the one hand, convolutional layers will be used toextract characteristics from the signal traces. For eachconsidered station we will have 4 signal traces to store(one per PMT). In order to respect as much as possiblethe temporal information that the problem provides, theapproach consist on using a channel for each of these sig-nals, in this way, each signal trace value will be attachedto the nanosecond when it was registered by its positionin the array which stores the data.On the other hand, after focusing on the temporal char-acteristics, we must consider how to feed the CNN with theother engineered variables as well. To this end, the secondset of hidden layers is fed with the rest of the variables,that is, the dense layers which are used to perform theregression. When designing a CNN, one of the critical stages is thedefinition of the convolutional layers. Nowadays, no stan-dard approach exists to set the configuration of a CNN.There are some rules of thumb for image classification, but,since the problem tackled in this paper belongs to anotherdomain, those rules could not be adequate. To overcomethis problem, the hyperparameter space has been clus-tered based on the experience tackling similar problemslike [23, 24], so that a finite number of values are eval-uated. Finally, after evaluating different topologies withthe validation data set, the fittest one was composed ofthree convolutional layers which were enough to cover theinner complexity of the data and extract relevant featuresfrom the raw data. ReLU [25] activation function wasused in these layers. The size of the filters was selected ac-cording to the mean of the signal traces registered, whichunveils that the first pulse of Cherenkov light usually ar-rives within the first 3 nanoseconds of signal. Thus, smallfilters are proposed. The fittest configuration used filtersof size 2, with a stride = 2 in the first layer to highlight thearrival of the direct Cherenkov light. The number of filtersin these convolutional layers was 20, 15 and 10. Poolinglayers have been discarded since they reduce the averagespecificity, though higher average sensitivity was achievedwhen using them.Regarding the architecture of the CNN, the configu-ration chosen is: three dense layers (with 30, 15 and 10neurons) and a final output layer were used to performthe regression from the previous extracted characteristicsand the introduced variables. Sigmoid activation functionwas used for the three dense layers and for the final neu-ron. ReLU and Tanh activation functions were consideredas well but performance obtained was smaller.All CNNs configurations (using different input vari-ables) were trained using Adam learning algorithm [26],which is a stochastic gradient-based method and has beenbroadly used in many deep learning applications. The ad-justment of its parameters was done following those rec-ommended in [26], that is: betas = (0.9, 0.999) and ep-silon = 10 − and learning rate of 0.001. The models weretrained during 200 epochs with a batch size of 512, smallervalues were showing high differences between training andvalidation and higher values increased training errors. Re-garding the loss function to be minimised after each epochit was chosen Mean Squared Error (MSE).Figure 2 shows the topology of the fittest CNN foundusing temporal and spatial features, whose parameters havebeen detailed previously. Two state-of-the-art decision-tree based algorithms havebeen used to process exclusively the signal traces integralswhich enable us to study the spatial information presentin the WCDs:
Extreme Gradient Boosting (XGBoost) and
Random Forest . Random Forest [27] is an algorithm thatcombines several decision trees and arises from the modifi-cation of the bagging process after decorrelating the trees4 ariable (cid:126)S
Simulated PMTs’ signal traces. First 30 ns of the signaltrace of each PMT.Normalised using the total signal.2 I t Engineered WCD’s total signal. Sum of the integrals of eachPMTs’ signal using 10 ns. TotalCherenkov light in the WCD.3 I i , i = 1 ... In i , i = 1 ... A t Engineered Asymmetry. A normalised difference of signalbetween the PMT with the greatestsignal and its opposed in the WCD.Defined in equation (1).
Table 2: Variables used in the different approaches. Variable number 3 is actually 4 variables, one for each PMT in the tank. In the sameway, variable number 1 consists of 4 ×
30 values corresponding to the trace registered by each of the 4 PMTs.
InputLayer input:output: (None, 4, 30)(None, 4, 30)Conv1D input:output: (None, 4, 30)(None, 20, 15)Conv1D input:output: (None, 20, 15)(None, 19, 15)Conv1D input:output: (None, 19, 15)(None, 18, 10)Flatten input:output: (None, 18, 10)(None, 180)Concatenate input:output: [(None, 180), (None, 9)](None, 189)InputLayer input:output: (None, 9)(None, 9)Dense input:output: (None, 189)(None, 30)Dense input:output: (None, 30)(None, 15)Dense input:output: (None, 15)(None, 10)Dense input:output: (None, 10)(None, 1)
Figure 2: Topology of the fittest CNN found. generated in the process. The bagging technique is meantto reduce the variance from the combination of models.During the process, the resampling technique bootstrap-ping is used. With this technique new data sets are cre-ated from the extraction of samples (with repetition) ofthe training data set. So, in a regression problem, a modelis trained with each of the new training sets and the pre-diction of the total model will be the combination of theoutput of each of the trained models. The parameters toadjust after the training stage have been clustered andevaluated with the validation data set: • Number of estimators that will be built. Evaluated values: [50,100,500,1000]. Best value found: 100. • Maximum depth for the trees built. Evaluated val-ues: [10,15,20]. Best value found: 20.The second approach considered is the Extreme Gradi-ent Boosting (XGBoost) algorithm [28, 29] which is basedon the boosting method. This method is similar to baggingtechnique, but in this case the trees are built sequentially.That is, each of the trees are built using information fromthose previously built. In this way, “robust” trees are builtfrom “weaker” ones. This is achieved by using the Gradi-ent descent algorithm as the optimiser. Thus, during eachiteration of the training process, the parameters of theweakest models are adjusted in order to minimise the lossfunction established for the problem (for instance, the rootof the mean square error in the case of a regression), whichwill be given by the result of the model. Some parame-ters must be established when training this algorithm, forwhich we have followed the same strategy used previously.The parameters which were evaluated using the validationdata set are the following: • Maximum depth for the trees built. Evaluated val-ues: [8,9,10]. Best value found: 10. • Learning rate . Evaluated values: [0.01, 0.1, 0.2, 0.7].Best value found: 0.2. • Objective : define the kind of output that will be pro-vided by the algorithm. In our case, a probability iswanted, thus, “binary logistic” is chosen. • Scale pos weight: is used to control the balance ofpositive and negative weights, which is useful for un-balanced classes. Introducing the ratio between theclasses N em /N µ which is present in the preprocessedtraining data set helped in improving the result. • Maximum number of rounds when training. Selectedvalue: 1000. • Early stopping rounds: applied to find the optimalnumber of boosting rounds. If the error is not re-duced in the validation data set after a fixed numberof rounds the training will stop. Selected value: 20.5 .4. Ensemble approach
The previous approaches have a partial overlap in thevariables used as input, however, there might be some fea-tures discovered by the CNN that tree models are not ableto compute. At the same time, due to the smaller inputdata and the efficient optimization strategy, tree modelsmight obtain more accurate results. With this two factsin mind, the possibility of combining both outputs in anensemble is discussed.Two ensembles are designed using the CNN as themain algorithm and a decision-tree based algorithm as thesecond algorithm, whose outputs will be obtained usingthe equation (2). Let P µ, pred ( w ) be the probability ob-tained by the ensemble and let P µ, m and P µ, m be theprobabilities given by the two best models found for eachapproach respectively. Additionally, let us propose theweight w to adjust the influence that each algorithm willhave when determining the ensemble probability, so that P µ, pred ( w ) ∈ [0 , P µ, pred ( w ) = w · P µ, m + (1 − w ) · P µ, m (2)In principle, there are infinite possibilities when com-bining the outputs of the models depending on the weight w . However, we can adjust and get an optimum value of w in order to minimise the error. A possible approach tooptimise the weight of each ensemble is to find the valuewhich minimises the mean squared error after predictingthe samples of the validation data set [30], that is, min-imizing the equation (3), where n is the total number ofsamples in the validation data set, i is the index of thesample, P ( i ) µ, pred is the probability obtained by the ensem-ble and P ( i ) µ the real “probability” and label of that sample(1 if there is a muon in the WCD or 0 otherwise). To solvethe global optimization problem Differential Evolution al-gorithm from the
Scipy Optimize
Python package is im-plemented, which is a stochastic population based method[31]. MSE = 1 n n (cid:88) i =1 (cid:16) P ( i ) µ, pred ( w ) − P ( i ) µ (cid:17) (3)With the previous approach, when there is a disagree-ment between the models, one of them will have moreweight than the other. This is translated in that, if onemodel predicts muon and the other does not the ensem-ble will predict muon unless the weight given to the othermodel is very small. This situation is undesirable as itis a requirement to be sure about when muons appear sotwo other approaches to combine the model’s outputs havebeen considered as well: P µ, pred = P µ, m · P µ, m (4) P µ, pred = (cid:112) P µ, m · P µ, m (5) P µ, pred = 1 √ (cid:113) P µ, m + P µ, m (6) By using the product, in case of clear disagreement,the ensemble will not provide a high probability. Thisvalue will be high only when both models agree. The en-semble of equation (6) was designed with the aim of giv-ing greater importance to the algorithm that provides thehighest probability.
4. Experiments and discussion
This Section will show first, how relevant are the newvariables proposed in Section 2. Afterwards, the perfor-mance of the ensemble models is analysed. Finally, an ex-periment on solving the classification task using the proba-bility is presented with the models presented in the paper.As stated in Section 2, the data set is highly imbalancedso, after making some trials with the cited strategies, thebest one (evaluated using validation error) was Randomoversampling with ratio N µ /N e . m . = 0 .
5, where N µ and N e . m . are the number of stations with muons and with-out muons respectively. The following experiments, unlessstated explicitly, are made after applying this procedureto the training data set.Regarding the implementation details, Python 3.7 us-ing SciKit-learn [32], Numpy [33] and Keras Framework[34] were used to develop the neural networks and ran-dom forest. Regarding the XGBoost, the implementationavailable at [35] was chosen to carry out the experiments.Due to the high volume of data (each EAS can requireup to hundreds of MiB), the infrastructure used to com-pute was a cluster at LIP with the characteristics: In-tel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz processors,46 GiB of RAM and two GPUs: NVIDIA GP102 TITANXp, NVIDIA TU102 GeForce RTX 2080 Ti.For the sake of reproducibility and transparency, allthe code and data are available at https://github.com/borja-sg/Muon-identification-WCD . To determine how important the variables are, twomethods were applied. The first one is based on the rank-ing that the XGBoost algorithm builds during the train-ing stage. The second is based on the approximation errorachieved by each model when training with different subsetof variables (given the fixed model architecture of Section3.2.1).Figure 3 shows the weights that the XGBoost has as-signed after carrying out the training. The most importantvariables are the integrals after the normalisation processfollowed by the new proposed index for the asymmetry.For the CNN approach, in Table 3 it is shown the meanRoot Mean Squared Error (RMSE, √ M SE ) obtained af-ter performing a cross validation with 5 folds. By doingthis, it is guaranteed that the process is not biased bythe initial shuffle to select train and validation data. Asexpected, the cross validation does not affect the model’s6 I I I A T I t In In In In Features02500500075001000012500150001750020000 W e i gh t Figure 3: Mean feature importance given by XGBoost using 5 dif-ferent seeds to select train and validation data sets. The black linescorrespond to the standard deviation after training with the differentseeds. Importance type: weight , which is the number of times thefeature was used to split the data across all trees. learning capabilities. In this case, the combination of vari-ables that provides the best validation results is the onethat uses all the proposed variables but the asymmetry.It is fair to say that the variables proposed always im-proved the approximation capabilities of the CNN archi-tecture as the CNN using only the raw normalised signalsis the one that provides the worst validation error. Table3 also shows the error obtained by the tree based mod-els. Although these models are able to learn properly thedata, they tend to overfit, obtaining worse validation andtest errors. Another reason to see some difference betweentraining and validation might probably be due to the factthat the training data was balanced meanwhile validationand test are not.
From the previous experiment, we have seen that thefeatures proposed were able to improve the probabilityprediction. Once the best models have been identified,it is time to see if they can combine knowledge assem-bling them. Table 4 shows the obtained RMSE for thedifferent strategies towards creating the ensemble’s out-put. The two models were chosen considering the bestRMSE in validation for the models using the PMT signal’s(CNNs) as well as for the tree based models. It is interest-ing to observe how, due to the overfitting of the XGBoost,it receives much more weight than the CNN although itperforms worse in validation. The weight computation isalso affected by the training data preprocessing and it’simportant to notice how the balancing still seems neces-sary as it provides better validation and test errors. Themost remarkable fact of this table is that, regardless of thestrategy used to combine both model’s outputs, it alwaysimproves the results obtained isolated. Taking a close look, it is easy to conclude that when combining probabilities,it is better to perform the product than the weighted sum.The product of both probabilities seems the most adequateas it provides significant improvement in all metrics.
Once it was obtained the most accurate model to pre-dict the probability of having a muon in the analysedWCD, this experiment will show if this prediction is re-liable enough to estimate whether muon has crossed thedetector or not. In this task of saying if a tank has beencrossed by a muon, it has to be set a threshold to decide ifthe given probability is high enough. If the value is higherthan the threshold, the tank is classified as positive. Oth-erwise, it is considered as with no muon. To determine thethreshold, a loop applying cuts to the obtained probabilitywas used with a resolution of 0.01 and choosing the bestone according to a metric.Four classification metrics have been used: Area un-der the
Receiver Operating Characteristic (ROC) curve,F1-score, Accuracy, and an Ad-hoc criterion. The firstconsidered is based on the (ROC) curves of each proposedmodel to explore which ratio between the
True PositiveRate (TPR) and the
False Positive Rate (FPR) is suit-able to determine whether there is a muon in the stations.Considering using this information in an ulterior systemto classify γ /hadron EAS is necessary to ensure a low rateof false positive rate. In such a task, the ROC curves canbe effective to analyse the behavior of the models whenselecting different thresholds. In addition, the Area Un-der the Curve (AUC) will allow us to compare the overallperformance of the models. Figure 4 shows the results ob-tained for each model for the two test data sets. Accordingto the results obtained, there is a match between the CNNand the ensemble obtaining and excellent result for
SingleMuons . XGBoost has a good performance but it does notachieve as good results.For the other three metrics, Tables 5 and 6 show theresults for the three models. Table 5 shows results for thethresholds that maximise F1-score and Accuracy metricsrespectively for the balanced Train data set. In this case,the ensemble outperforms in accuracy and F1-score, how-ever, for the last metric, the CNN show a better behaviourfor validation and test data sets probably due to the effectsof the overfitting of the XGBoost, which shows very poorperformances in both validation and tests.The last metric is the criterion that could be used foran ulterior classification of the γ vs hadron EAS and toidentify the type of particle. In this case, it is very impor-tant to not to missclassify the T γ , this is, the tanks wherea muon has not passed through and, it is enough to be ableto identify some muons ( T µ ). Thus, these values were com-puted and the cut has been set where the T γ rate is near99.9 %. As this implies new threshold values, the othertwo metrics have been computed as well to have a faircomparison of the models using different criteria. FromTable 6 the performance of both the CNN and ensemble7nput Variables Train Validation Test (cid:126)S, I i , In i , I t , A T (cid:126)S, I i , In i , I t (cid:126)S, In i , A T , I t (cid:126)S, I i , A T , I t (cid:126)S (cid:126)S, A T (cid:126)S, I i (cid:126)S, In i (cid:126)S, I t (cid:126)S, I i , I t (cid:126)S, In i , I t Table 3: Mean RMSE (and standard deviation) for models trained using 5 different seeds to select train and validation data sets.
Model 1 Model 2 Strategy w optimisation w Train Val. TestCNN ( (cid:126)S, I i , In i , I t ) XGB w · P µ, m1 + (1 − w ) · P µ, m2 Train (Balanced) 0.2323 0.1288 0.2092 0.2102CNN ( (cid:126)S, I i , In i , I t ) XGB w · P µ, m1 + (1 − w ) · P µ, m2 Train (Original, imbalanced) 0.1833 0.1285 0.2131 0.2141CNN ( (cid:126)S, I i , In i , I t ) XGB P µ,m · P µ,m - - CNN ( (cid:126)S, I i , In i , I t ) XGB (cid:112) P µ,m · P µ,m - - 0.1193 0.1897 0.1901CNN ( (cid:126)S, I i , In i , I t ) XGB √ · (cid:112) P µ,m + P µ,m - - 0.1623 0.2133 0.2126 Table 4: RMSE error for different ensemble strategies combining the best CNN with XGBoost. Notice that the approaches using the productdo not require the w optimisation parameter. remains quite similar although in the test sets, for the F1-score, the ensemble does an improvement in comparisonwith the other approaches. For the Ad-hoc criterion, wehave two non-dominant solutions, this is, from the two ob-jectives that we are aiming, T µ and T γ , there is no modelthat perform better in both at the same time. If a moreaccurate T µ is desired, the ensemble should be chosen buta 0.2% of error will be increased when misclassifying T γ .After comparing the different models using several met-rics, it is quite obvious that the best approach in gen-eral is the one that receives the input the traces and theengineered variables, this is, CNN. From this fact, it ispossible to devise that the temporal and spatial compo-nents of the signals are crucial when trying to discrimi-nate the presence of muons. One reason for this mightbe the fact that the traces are seeing 30 ns from the sig-nal that includes the direct light and the first reflection,meanwhile the integral is computed considering the first10 ns to avoid too much interference with the electromag-netic component. Thus, a tank that maximises reflectionsshould be considered when designing the new observatory.Another interesting element to discuss is inability of mod-els to learn the engineered variables. This makes sense asthe engineered variables are almost impossible to be ob-tained just by performing convolutions but it is a call ofattention to DL practitioners so they do not rely uniquelyon the model but they have to put special attention to theproblem domain and the expert knowledge available.
5. Conclusions
The muon identification is a hot topic in cosmic ray re-search due to its implications in γ -ray observation, multi-messenger studies and Ultra High Energy Physics. Withinthis context, a new observatory based on indirect methods,like Water Cherenkov Detectors, is being studied. This pa-per has dealt with the problem of determining the proba-bility of muon presence in the signal recorded by a waterCherenkov detector prototype design. To do so, severalmachine learning techniques have been compared and newengineered variables have been proposed. The results haveshown that the models that behave the best consideringthe characteristic of the problem are the ones that pro-cess the complete signal and that introduce the variablesengineered by researchers in Physics and Computer Sci-ence by means of a combination of Convolutional Layerswith Dense layers. Although the improvement is small,it is magnified when using an ensemble of models com-bining CNNs and XGBoost. Therefore, the possibilitiesthat this new methodology open to identify Extensive AirShowers with γ -rays are wide open and allow evaluatingthe different detector designs before deploying the futureobservatory. Future studies will concentrate in consideringother convolutions for the CNN and to consider methodsthat do not suffer from overfiting as much as the XGBoostdoes with these data sets.8 .0 0.2 0.4 0.6 0.8 1.0 False Positive Rate T r u e P o s iti v e R a t e CNN, AUC = 0.954XGB, AUC = 0.905 (a) Test using CNN and XGB.
False Positive Rate T r u e P o s iti v e R a t e P ,CNN +0.7677 P ,XGB , AUC = 0.9350.1833 P ,CNN +0.8167 P ,XGB , AUC = 0.932 P ,CNN P ,XGB , AUC = 0.953 P CNN + P XGB , AUC = 0.949 P ,CNN P ,XGB , AUC = 0.953 (b) Test using ensembles.Figure 4: Result of the ROC curves and AUC for test data set (a) single models (b) ensembles. Accuracy
Algorithm Threshold Train Train (Original, imbalanced) Validation TestCNN 0.27 94.72 94.04 92.70 92.73XGBoost 0.83 99.67 99.59 94.60 94.48CNN · XGBoost 0.18 96.03 97.74 94.68 94.66
F1-score
Algorithm Threshold Train Train (Original, imbalanced) Validation TestCNN 0.27 92.39 60.81 52.16 52.36XGBoost 0.83 99.51 95.95 27.86 27.89CNN · XGBoost 0.16 93.92 78.54 50.14 50.44
Table 5: Maximum values for two error measures when classifying the appearance of muons in the WCDs. This is the best value obtainedfor any possible threshold to determine the class separation.
Accuracy
Algorithm Threshold Train Train (Original) Validation TestCNN 0.96 70.70 95.59 95.61 95.53XGBoost 0.94 95.69 99.14 95.04 94.97CNN · XGBoost 0.82 87.05 98.04 95.47 95.48
F1-score
Algorithm Threshold Train Train (Original) Validation TestCNN 0.96 22.13 21.26 22.39 20.71XGBoost 0.94 93.10 90.47 16.10 16.41CNN · XGBoost 0.82 75.94 75.08 21.67 23.04
Ad-hoc criterion
Algorithm Threshold Train Train (Original) Validation Test T µ T γ T µ T γ T µ T γ T µ T γ CNN 0.96 12.49 99.81 12.35 99.81 13.13 99.79 11.99 99.81XGBoost 0.94 87.25 99.91 84.11 99.91 9.86 99.36 10.14 99.31CNN · XGBoost 0.82 61.33 99.90 61.24 99.90 12.99 99.65 13.89 99.66
Table 6: Error measures (accuracy and F1-score) and percentage of correct classification (Ad-hoc criterion) when classifying the appearanceof muons in the WCDs. These are the best values to ensure a minimum of T γ ∼ .
9% using the balanced Train data set.
Acknowledgements
We would like to thank to A. Bueno for all the sup-port and useful discussions during the development of thiswork. The authors thank also for the financial supportby OE - Portugal, FCT, I. P., under project PTDC/FIS-PAR/29158/2017. R. C. is grateful for the financial sup-port by OE - Portugal, FCT, I. P., under DL57/2016/cP1330/cT0002. B.S.G. is grateful for the financial support by grant LIP/BI - 14/2020, under project IC&DT, POCI-01-0145-FEDER-029158.
References [1] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard,W. Hubbard, L. Jackel, Handwritten digit recognition with aback-propagation network, Advances in neural information pro-cessing systems 2 (1989) 396–404.
2] S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, D. J.Inman, 1d convolutional neural networks and applications: Asurvey, Mechanical Systems and Signal Processing 151 (2021)107398. doi:https://doi.org/10.1016/j.ymssp.2020.107398 .URL [3] L. Erhan, M. Ndubuaku, M. Di Mauro, W. Song, M. Chen,G. Fortino, O. Bagdasar, A. Liotta, Smart anomaly detection insensor systems: A multi-perspective review, Information Fusion67 (2021) 64 – 79. doi:https://doi.org/10.1016/j.inffus.2020.10.001 .URL [4] A. Peimankar, S. Puthusserypady, Dens-ecg: A deeplearning approach for ecg signal delineation, ExpertSystems with Applications 165, cited By 0 (2021). doi:10.1016/j.eswa.2020.113911 .URL [5] M. Manzano, A. Guill´en, I. Rojas, L. J. Herrera, Deep learn-ing using EEG data in time and frequency domains for sleepstage classification, in: I. Rojas, G. Joya, A. Catal`a (Eds.), Ad-vances in Computational Intelligence - 14th International Work-Conference on Artificial Neural Networks, IWANN 2017, Cadiz,Spain, June 14-16, 2017, Proceedings, Part I, Vol. 10305 of Lec-ture Notes in Computer Science, Springer, 2017, pp. 132–141. doi:10.1007/978-3-319-59153-7\_12 .URL https://doi.org/10.1007/978-3-319-59153-7_12 [6] A. Guill´en, A. Bueno, J. Carceller, J. Mart´ınez-Vel´azquez,G. Rubio, C. T. Peixoto, P. Sanchez-Lucas, Deep learning tech-niques applied to the physics of extensive air showers, Astropar-ticle Physics 111 (2019) 12 – 22. doi:https://doi.org/10.1016/j.astropartphys.2019.03.001 .URL [7] A. Guill´en, J. Mart´ınez, J. M. Carceller, L. J. Herrera, Acomparative analysis of machine learning techniques for muoncount in uhecr extensive air-showers, Entropy 22 (11) (2020). doi:10.3390/e22111216 .URL [8] T. Capistr´an, I. Torres, L. Altamirano, New method forgamma/hadron separation in hawc using neural networks, arXivpreprint arXiv:1508.04370 (2015).[9] N. Choma, F. Monti, L. Gerhardt, T. Palczewski, Z. Ronaghi,P. Prabhat, W. Bhimji, M. Bronstein, S. Klein, J. Bruna, Graphneural networks for icecube signal classification, in: 2018 17thIEEE International Conference on Machine Learning and Ap-plications (ICMLA), IEEE, 2018, pp. 386–391.[10] A. De Angelis, M. Pimenta, Introduction to particle and as-troparticle physics: multimessenger astronomy and its particlephysics foundations, Springer, 2018.[11] P. Assis, R. Concei¸c˜ao, M. Pimenta, B. Tom´e, A. Blanco,P. Fonte, L. Lopes, U. B. de Almeida, R. Shellard, B. D. Pi-azzoli, et al., Lattes: a novel detector concept for a gamma-rayexperiment in the southern hemisphere.[12] A. Zu˜niga-Reyes, A. Hern´andez, A. Miranda-Aguilar, A. San-doval, J. Mart´ınez-Castro, R. Alfaro, E. Belmont, H. Le´on,A. P. Vizcaya, Detection of vertical muons with the hawc wa-ter cherenkov detectors and its application to gamma/hadrondiscrimination, arXiv preprint arXiv:1708.09500 (2017).[13] A. Barber, D. Kieda, W. Springer, H. Collaboration, et al.,Detection of near horizontal muons with the hawc observatory,in: ICRC, Vol. 301, 2017, p. 512.[14] D. Heck, J. Knapp, J. Capdevielle, G. Schatz, T. Thouw, Amonte carlo code to simulate extensive air showers, ReportFZKA 6019 (1998).[15] S. Agostinelli, J. Allison, K. a. Amako, J. Apostolakis,H. Araujo, P. Arce, M. Asai, D. Axen, S. Banerjee, G. . Bar-rand, et al., Geant4—a simulation toolkit, Nuclear instrumentsand methods in physics research section A: Accelerators, Spec- trometers, Detectors and Associated Equipment 506 (3) (2003)250–303.[16] IEEE Transactions on Nuclear Science 53 No. 1 (2006) 270–278.[17] Nuclear Instruments and Methods in Physics Research A 835(2016) 186–225.[18] Southern Wide field Gamma-ray Observatory (SWGO) swgo.org .[19] B. S. Gonz´alez, R. Concei¸c˜ao, B. Tom´e, M. Pimenta, L. J. Her-rera, A. Guillen, Using convolutional neural networks for muondetection in WCD tank, Journal of Physics: Conference Series1603 (2020) 012024. doi:10.1088/1742-6596/1603/1/012024 .URL https://doi.org/10.1088%2F1742-6596%2F1603%2F1%2F012024 [20] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer,Smote: Synthetic minority over-sampling technique, J. Artif.Int. Res. 16 (1) (2002) 321–357.[21] A. Guill´en, C. Todero, J. C. Mart´ınez, L. J. Herrera, A prelimi-nary approach to composition classification of ultra-high energycosmic rays, in: International Conference on Applied Physics,System Science and Computers, Springer, 2018, pp. 196–202.[22] S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, D. J.Inman, 1d convolutional neural networks and applications: Asurvey, arXiv preprint arXiv:1905.03554 (2019).[23] F. Carrillo-Perez, L. J. Herrera, J. M. Carceller, A. Guill´en,Improving classification of ultra-high energy cosmic rays usingspacial locality by means of a convolutional dnn, in: Interna-tional Work-Conference on Artificial Neural Networks, Springer,2019, pp. 222–232.[24] M. Manzano, A. Guill´en, I. Rojas, L. J. Herrera, Combination ofeeg data time and frequency representations in deep networksfor sleep stage classification, in: International Conference onIntelligent Computing, Springer, 2017, pp. 219–229.[25] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MITPress, 2016, .[26] D. P. Kingma, J. Ba, Adam: A method for stochastic optimiza-tion, arXiv preprint arXiv:1412.6980 (2014).[27] L. Breiman, Random forests, Machine learning 45 (1) (2001)5–32.[28] J. H. Friedman, Greedy function approximation: a gradientboosting machine, Annals of statistics (2001) 1189–1232.[29] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting sys-tem, in: Proceedings of the 22nd acm sigkdd international con-ference on knowledge discovery and data mining, 2016, pp. 785–794.[30] M. Shahhosseini, G. Hu, H. Pham, Optimizing ensemble weightsand hyperparameters of machine learning models for regressionproblems, arXiv preprint arXiv:1908.05287 (2019).[31] R. Storn, K. Price, Differential evolution–a simple and efficientheuristic for global optimization over continuous spaces, Journalof global optimization 11 (4) (1997) 341–359.[32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machinelearning in Python, Journal of Machine Learning Research 12(2011) 2825–2830.[33] T. E. Oliphant, Python for scientific computing, Computingin Science Engineering 9 (3) (2007) 10–20. doi:10.1109/MCSE.2007.58 .[34] F. Chollet, et al., Keras, https://github.com/fchollet/keras (2015).[35] xgboost developers, XGBoost Python Package, https://xgboost.readthedocs.io/en/latest/python/index.html (2020).(2020).