[PDF] A density-based clustering algorithm for the CYGNO data analysis

Abstract

Time Projection Chambers (TPCs) working in combination with Gas Electron Multipliers (GEMs) produce a very sensitive detector capable of observing low energy events. This is achieved by capturing photons generated during the GEM electron multiplication process by means of a high-resolution camera. The CYGNO experiment has recently developed a TPC Triple GEM detector coupled to a low noise and high spatial resolution CMOS sensor. For the image analysis, an algorithm based on an adapted version of the well-known DBSCAN was implemented, called iDBSCAN. In this paper a description of the iDBSCAN algorithm is given, including test and validation of its parameters, and a comparison with DBSCAN itself and a widely used algorithm known as Nearest Neighbor Clustering (NNC). The results show that the adapted version of DBSCAN is capable of providing full signal detection efficiency and very good energy resolution while improving the detector background rejection.

Full PDF

PPrepared for submission to JINST

A density-based clustering algorithm for theCYGNO data analysis

E. Baracchini, a , b L. Benussi, c S. Bianco, c C. Capoccia, c M. Caponero, c , d G. Cavoto, e , f A.Cortez, a , b I. A. Costa, g E. Di Marco, e G. D’Imperio, e G. Dho, a , b F. Iacoangeli, e G.Maccarrone, c M. Maraﬁni, e , h G. Mazzitelli, c A. Messina, e , f R. A. Nobrega, g A. Orlandi, c E.Paoletti, c L. Passamonti, c F. Petrucci, i , j D. Piccolo, c D. Pierluigi, c D. Pinci e F. Renga, e F.Rosatelli, c A. Russo, c G. Saviano, c , k and S. Tomassini c a Gran Sasso Science Institute,L’Aquila, I-67100, Italy b Istituto Nazionale di Fisica Nucleare,Laboratori Nazionali del Gran Sasso, Assergi, Italy c Istituto Nazionale di Fisica Nucleare ,Laboratori Nazionali di Frascati, I-00044, Italy d ENEA Centro Ricerche Frascati, Frascati, Italy e Istituto Nazionale di Fisica Nucleare,Sezione di Roma, I-00185, Italy f Dipartimento di Fisica Sapienza Università di Roma, I-00185, Italy g Universidade Federal de Juiz de Fora, Juiz de Fora, Brasil h Museo Storico della Fisica e Centro Studi e Ricerche "Enrico Fermi",Piazza del Viminale 1, Roma, I-00184, Italy i Dipartimento di Matematica e Fisica, Università Roma TRE, Roma, Italy j Istituto Nazionale di Fisica Nucleare, Sezione di Roma TRE, Roma, Italy k Dipartimento di Ingegneria Chimica, Materiali e Ambiente, Sapienza Università di Roma, Roma, Italy

E-mail: [email protected]

Abstract: Time Projection Chambers (TPCs) working in combination with Gas Electron Mul-tipliers (GEMs) produce a very sensitive detector capable of observing low energy events. Thisis achieved by capturing photons generated during the GEM electron multiplication process bymeans of a high-resolution camera. The CYGNO experiment has recently developed a TPC TripleGEM detector coupled to a low noise and high spatial resolution CMOS sensor. For the imageanalysis, an algorithm based on an adapted version of the well-known DBSCAN was implemented,called iDBSCAN. In this paper a description of the iDBSCAN algorithm is given, including testand validation of its parameters, and a comparison with DBSCAN itself and a widely used algo-rithm known as Nearest Neighbor Clustering (NNC). The results show that the adapted version ofDBSCAN is capable of providing full signal detection eﬃciency and very good energy resolutionwhile improving the detector background rejection. a r X i v : . [ phy s i c s . i n s - d e t ] S e p ontents Fe energy spectra 73.2 Slimness selection optimization 93.3 Light Yield Resolution 12

Introduction

Clustering analysis is a widely used unsupervised technique to organize datasets into groups basedon their similarities. One of the most known algorithms is the so-called Density-Based SpatialClustering of Applications with Noise (DBSCAN) [1]. Given a set of elements distributed over ahyper-plane, DBSCAN seeks for areas of high density to form clusters. Such density is calculatedconsidering the number of elements within a pre-deﬁned hyper-sphere. The generalization power ofDBSCAN and its simplicity, which make it a very attractive algorithm, can be understood in termsof its two parameters: the radius of the hyper-sphere ( (cid:15) ), which is applied over each element tocount the number of neighboring elements around it, and the minimum number of points inside eachhyper-sphere ( N min ), used to decide if those elements should make up a cluster. To fulﬁll the needsof the CYGNO experiment, a detector-speciﬁc algorithm, based on DBSCAN, has been developed.Within the context of the experiment, a detection apparatus composed of an optical readout systembased on a high-resolution and low noise CMOS sensor capable of providing track images producedby interacting particles with release energies in the range of a few keV has been developed [2–7].This modiﬁed version of DBSCAN, called intensity-based DBSCAN or simply iDBSCAN, hasshown to be able to improve detector performance when compared to the previously used algorithmbased on the Nearest Neighbor Clustering (NNC) technique [8]. This paper proposes a study on theimpact of iDBSCAN when compared to NNC and DBSCAN on two crucial detector’s parameters,– 1 –ackground rejection and energy resolution, measured in the energy range of a few keV. For such,low energy particles (5.9 keV photons) produced by a Fe radioactive source, background fromnatural radioactivity and data with electronics noise only were employed.

LEMOn (Large Elliptical MOdule) [9] is the most recent CYGNO experiment’s prototype. Its coreconsists of a 7 liter active drift volume surrounded by an elliptical ﬁeld cage (20 × ×

24 cm )and a 20 ×

24 cm Triple GEM structure whose produced photons are readout by an Orca Flash 4CMOS-based camera [10] placed at a distance of 52 . / CF gas mixture in the proportionof 60 /

40 and a Fe source with an activity of about 740 MBq was used. For operation, electricﬁelds are applied to the TPC drift volume and between the GEMs. They are called drift ﬁeld ( E d )and transfer ﬁeld ( E t ) respectively. The typical operating conditions of the detector, as used in thiswork, are: E d = 500 V/cm, E t = 2.5 kV/cm, and a voltage diﬀerence across the GEM sides ( V GEM )of 460 V.

Figure 1 . Drawing of the experimental setup. In particular, the elliptical ﬁeld cage close on one side by thetriple-GEM structure and on the other side by the semitransparent cathode (A), the PMT (B), the adaptablebellow (C) and the CMOS camera with its lens (D) are visible. The sliding external Fe source, positionedclose to the TPC is also drawn.

Data were acquired using auto-trigger mode. For the proposed study presented in this document,three diﬀerent acquisition datasets were used, as listed below:• Electronic noise (EN) dataset: produced by lowering down V GEM to 260 V, a value where themultiplication process is forbidden (6478 images recorded);• Natural radioactivity (NRAD) dataset (composed of cosmic rays and environmental radioac-tivity): produced by exposing the camera lens and turning on the detector power supplies– 2 –nd raising V GEM to the nominal value of 460 V to allow charge multiplication and secondarylight emission during this process (864 images recorded);• Electron Recoils (ER) dataset: the same as the previous item but placing a Fe source nearto the detector drift volume (864 images recorded).

Based on the acquisition datasets deﬁned in section 1.2, particles interacting with the detector gascan have two distinct origins: Fe source and natural radioactivity. The former releases 5.9 keVphotons which produce round spots on the image while the latter can be composed of few diﬀerentparticles as photons, electrons and muons. Typical signals are shown in Fig. 2: three interactionsof Fe photons in the left top image; two low-energy electrons in the left bottom image; and twohigh-energy particles (likely to be cosmic ray muons) and, between them, two interactions of Fephotons in the right image.

Figure 2 . Examples of signals that can occur using the described conﬁguration.

In this work, the signals of interest are those generated by the 5.9 keV photons, which are usedto assess the impact of the proposed clustering algorithms on the detector characteristics, focusingmainly on its energy resolution and background-events rejection performance in the energy rangeof few keV.

The acquisition system provides images with 2048 × × µ m and each pixel has– 3 – size of 6.5 µ m × µ m. The camera’s exposure time was set to 40 ms and it covers an areaof 26 ×

26 cm in relation to the plane of the last layer of the GEM detector. Each pixel providesa response, here called intensity, proportional to the number of collected photons [6] added to abaseline, also known as pedestal, which can be deﬁned as the intensity value corresponding tozero photons. Speciﬁcally, the pedestal average value of the sensor is about 99 counts, however itcan vary from pixel to pixel. Additionally, the noise level is another important parameter that canvary from pixel to pixel. Those eﬀects can be seen in Fig. 3, which shows the mean and standarddeviation distributions of the noise as computed for each pixel, produced with the EN dataset. Toaccount for such variations, both the pixel baseline ( µ i ) and its average noise ( σ i ), calculated as thestandard deviation of the pedestal distribution, are estimated for every single pixel i before runningthe event reconstruction procedure.

92 94 96 98 100 102 104 106 108Pixel intensity mean05×10 E n t r i e s Entries 4194304Mean 99.59RMS 17.68 E n t r i e s Entries 4194304Mean 2.50RMS 2.16

Figure 3 . Mean and standard deviation distributions of the sensor’s pixels noise.

The current CYGNO’s event-reconstruction algorithm is represented in the ﬂowchart shown inFig. 4 and it is described below.

Figure 4 . Flowchart of the CYGNO’s event-reconstruction algorithm. – 4 –. Pedestal subtraction is carried out pixel by pixel by subtracting µ i from their original intensityvalues, generating new intensity values deﬁned as I i .2. Lower and upper thresholds are applied to I i . While the upper limit is set to 100 counts, thelower limit is set to 1.3 times σ i . The upper limit allows to remove pixels with a too largeintensity, produced mainly due to leakage currents that go into sensor wells - also known ashot-pixels, while the lower limit was optimized and set to be just above noise level to ensurea good detection eﬃciency, but not too low in order not to overload the event-reconstructionalgorithm with pixels dominated by noise. Pixels outside those limits have their intensitiesreset to zero.3. Images are then rescaled to 512 ×

512 pixels, for CPU reasons, so that each 4 × I i of the 16 pixels occupying the same area of the sensor.4. The rescaled image goes then through a ﬁltering stage based on a 4 × w , g ( x , y ) , as given by Equation 2.1 [12], where f ( x , y ) is the intensity of the macro-pixel ( x , y ) . g ( x , y ) = median { f ( x , y ) , ( x , y ) ∈ w } (2.1)Such ﬁlter is widely used in many applications due to its eﬀective noise suppression capabilityand high computational eﬃciency [13]. Tests performed on the EN dataset (see section 1.2)showed that this ﬁlter is able to reduce the number of noise pixels sent to the clusteringalgorithm by a factor of 3.07 ± I i values are sent to the clustering algorithm whose output is used to extract clusters’ featuressuch as integrated light, length and width, computed over the full-resolution image. Thosefeatures are then used to select events of interest.In this work three features, extracted from the clusters, are used:• Length and width: the full length of the major and minor axes along the two eigenvectors ofthe (X,Y) pixel matrix in the context of Principal Component Analysis [14] are assigned asthe length and width of the cluster, respectively.• Cluster light: calculated as the sum of all the pixel I i intensities belonging to the cluster.As mentioned before, prior to iDBSCAN, the CYGNO clustering algorithm was based on thewidely employed NNC method. Basically it groups neighboring pixels that went through a selectionsimilar to the one in step 3. A detector performance study using such method was presented in [8].To understand the advantages of using iDBSCAN, in addition to the comparison with NNC, theperformance achieved with the DBSCAN algorithm will also be presented.– 5 – .3 The CYGNO intensity-based clustering algorithm2.3.1 iDBSCAN As in many areas, in particle physics it is possible to insert a priori knowledge about the detectionsystem and its data to improve the performance of the clustering task [15]. In this sense, a modiﬁ-cation of DBSCAN [16] clustering algorithm was implemented, to better match the experimentalconditions and data of the LEMOn detector. As mentioned before, DBSCAN has only two parame-ters: (cid:15) and N min . Whenever the number of neighboring elements inside a hyper-sphere reaches the N min value, the center element and all its neighbors are activated to start the formation of a cluster.Then, the same process happens to all the neighboring elements in order to expand the startingcluster, to form a ﬁnal cluster. This process is repeated to all the data elements. To be applied toCYGNO, instead of using the number of elements as a parameter to decide if the elements insidea hyper-sphere make part of a cluster, the sum of their intensity values is used. Consequently,the N min becomes a parameter related to the total intensity within a hyper-sphere instead of to thenumber of elements. Therefore, rather than having each pixel counted as a unit when computingthe number of pixels inside a given hyper-sphere, each pixel counts I i times. If the total intensity isequal or greater than a certain value ( N min ), they are considered as making part of a cluster. Duringthe development of the iDBSCAN algorithm, many (cid:15) and N min values have been tested, leadingus to converge to values around 5.8 and 30, respectively, which will be validated in section 2.3.2.Additionally, to make iDBSCAN more robust against electronic noise and intensity spikes, a clusteris required to have more than two macro-pixels, otherwise it is discarded. This same operation isalso applied to NNC and DBSCAN. The CYGNO Collaboration is currently using iDBSCAN for the clustering method in its event-reconstruction. The iDBSCAN performance for signals produced by the interactions of photonsfrom Fe has been studied as a function of diﬀerent values of its parameters: (cid:15) and N min . In orderto evaluate those values, a test on the detector eﬃciency and background rejection was carried out:a scan over the two iDBSCAN parameters has been performed. While the (cid:15) ( N min ) parameter willbe ﬁxed to a value of 5.8 (30), the other parameter’s value will be swept from 5 to 50 (4 to 10).Figure 5 (left) shows the total number of clusters found as a function of (cid:15) for two distinct datasets:ER and NRAD. For low (cid:15) values the number of NRAD clusters tends to increase, indicating anincrease of background contamination. However, for (cid:15) values between 5 and 7, this contaminationrate stabilizes around a minimum value. Figure 5 (right) shows the same trend, while counting onlyclusters with an integral in the range 2000âĂŞ4000 photons, characteristic of Fe deposits. Thisregion refers to the energy region of the Fe produced electron recoils (see Fig. 9).Similarly, a scan over the N min parameter has been performed as shown in Fig. 6. Applyingthe same logic as for the (cid:15) parameter, the plot on the left indicates a low contamination region for N min values between 20 and 40, and the right plot to a region for N min ≤

30. In both cases, whenstable, the diﬀerence between the results indicate a number of Fe clusters of about 280.Finally, energy resolution has also been measured as function of the iDBSCAN parameters.Values around 12.2% have been measured for all the (cid:15) and N min considered values, with negligiblevariation. Section 3.3 provides details about the energy resolution measurement.– 6 – T o t a l nu m b e r o f c l u s t e r s NRADER Fe N u m b e r o f c l u s t e r s i n t h e F e p e a k r e g i o n NRADER Fe Figure 5 . Total number of reconstructed clusters (left) and Number of clusters in the Fe peak region (right)as a function of (cid:15) for ER and NRAD runs and also a line for the Fe, which means ER-NRAD.

10 20 30 40 50 N min parameter of iDBSCAN0200400600800100012001400 T o t a l nu m b e r o f c l u s t e r s NRADER Fe

10 20 30 40 50 N min parameter of iDBSCAN0100200300400500 N u m b e r o f c l u s t e r s i n t h e F e p e a k r e g i o n NRADER Fe Figure 6 . Total number of reconstructed clusters (left) and Number of clusters in the Fe peak region (right)as a function of N min for ER and NRAD runs and also a line for the Fe, which means ER-NRAD.

The same procedure performed to choose iDBSCAN parameters was also applied to DBSCAN.The resulting values for the DBSCAN parameters were 6 for (cid:15) and 20 for N min . It is noteworthy thatthe value of (cid:15) for DBSCAN is very close to the 5.8 found by iDBSCAN, which shows a coherencesince the two-dimensional space is the same for both algorithms. The DBSCAN graphs are notshown here as it adds little information to the work considering that they have characteristics similarto those presented in Figs. 5 and 6. Fe energy spectra The well-known energy deposition signature of 5.9 keV photons coming out from the Fe sourceis exploited in order to evaluate the detection eﬃciency and background rejection of both methods.While the ER dataset will be used for signal characterization, EN and NRAD datasets will bedeployed for background rejection measurements. The EN acquired data produces low energyclusters with a distribution squeezed in the region below 500 photons as shown in Fig. 7, NRADproduces an energy distribution widely spread by a heavy tail component as shown in Fig. 8 while ER– 7 –orms an additional narrow distribution centered at around 3000 photons as shown in Fig. 9. In thislast case, the energy spectrum is composed of background and Fe induced deposits and, therefore,to reconstruct the Fe energy distribution, the background distribution should be subtracted. Allthe distributions were generated with the same amount of images, 864 of them, except for theiDBSCAN distributions of Fig. 7 which used 6478 images, in order to collect enough EN-clusters,which occur at a low rate. Additionally, the signal purity is enhanced accounting for the clusteraspect ratio, called slimness, deﬁned as the ratio between the minor axis (width) and major axis(length) of each cluster.Figure 7 compares the energy spectrum of clusters generated by NNC and DBSCAN with thosegenerated by iDBSCAN for EN events without and with a selection based on the slimness parameter,considering only clusters with slimness greater than 0.4 for the latter case. The computed numbersof EN-clusters per image for NNC, DBSCAN and iDBSCAN were 4 . ± .

17, 3 . ± .

12 and ( ± ) × − , respectively. Regarding NNC and DBSCAN, EN-clusters dominate the backgroundrate for energies below 500 photons which can be noticed by comparing the EN energy distributionof Fig. 7 with that of the NRAD shown in Fig. 8. Selection on slimness variable decreases thenumber of clusters per image to 3 . ± .

14, 2 . ± .

09 and ( ± ) × − for NNC, DBSCANand iDBSCAN, respectively. Therefore, when compared to NNC and DBSCAN, iDBSCAN is ableto reduce the number of EN-clusters per image by a factor of ( ÷ ) × . Cluster light (photons) N u m b e r o f c l u s t e r s Electronic noise (EN) dataset Entries:NNC 3981DBSCAN 2740iDBSCAN 6

Cluster light (photons) N u m b e r o f c l u s t e r s Electronic noise (EN) dataset Entries:NNC 3287DBSCAN 1879iDBSCAN 3

Figure 7 . Clusters energy distribution for NNC, DBSCAN and iDSBSCAN applied to the EN dataset,without (left) and with (right) a selection on the slimness.

Figure 8 shows the energy distributions for the NNC, DBSCAN and iDBSCAN clusters usingthe NRAD dataset without (left) and with (right) a selection on slimness. iDBSCAN presents aclear peak evolution around 300 photons while NNC and DBSCAN accumulate clusters with lowerenergies due to EN-clusters. iDBSCAN and DBSCAN reduce the number of background eventsin the region between 2000 and 4000 photons when compared to NNC, which is the region wherethe Fe events are expected to be, as mentioned before, providing better background rejectionfor low energy events as for the 5.9 keV photons. On the right of Fig. 8, the distribution oflight, only considering clusters with slimness greater than 0.4 is shown. This selection reduceseven more the number of background events in the Fe region, bringing NNC closer to the othermethods. However, for the lower energy region, the number of fake clusters is only slightly reduced,causing iDBSCAN to maintain a better background rejection eﬃciency when compared to NNC– 8 –nd DBSCAN.

Cluster light (photons) N u m b e r o f c l u s t e r s Natural radioactivity (NRAD) dataset Entries:NNC 4931DBSCAN 2918iDBSCAN 364

Cluster light (photons) N u m b e r o f c l u s t e r s Natural radioactivity (NRAD) dataset Entries:NNC 3951DBSCAN 1983iDBSCAN 207

Figure 8 . Clusters energy distribution for NNC, DBSCAN and iDSBSCAN applied to the NRAD dataset,without (left) and with (right) a selection based on the slimness.

Figure 9 shows the results of the same analysis performed on the ER dataset. In this case, thesum of the distribution obtained in the NRAD sample and the one from Fe interactions is expected.As shown, all three clustering algorithms are sensitive to the 5.9 keV photon events. However, ascommented previously, a higher purity level is achieved using iDBSCAN. After applying theslimness threshold, as shown in the right plot of Fig. 9, the distributions around the Fe peak ofNNC, DBSCAN and iDBSCAN get closer indicating that the three methods have similar detectioneﬃciency considering that the number of Fe spots found by each method is practically the same.

Cluster light (photons) N u m b e r o f c l u s t e r s Electron Recoils (ER) dataset Entries:NNC 5579DBSCAN 3433iDBSCAN 668

Cluster light (photons) N u m b e r o f c l u s t e r s Electron Recoils (ER) dataset Entries:NNC 4554DBSCAN 2410iDBSCAN 465

Figure 9 . Clusters energy distribution for NNC, DBSCAN and iDSBSCAN applied to the ER dataset,without (left) and with (right) a selection based on the slimness.

Figure 10 shows the slimness cumulative distribution of clusters for an interval between 0 and 1,applied to the NRAD and ER datasets for NNC, DBSCAN and iDBSCAN. As it is possible to see,in all cases Fe spots tend to have slimness higher than about 0.4. This variable can be used inconjunction with energy measurement to discriminate Fe spots from background clusters. In thissection the value of slimness will be swept so that it is possible to choose the most suitable value– 9 –or its use as an event selection parameter as well as to evaluate its impact when applied togetherwith the energy measurement.

Slimness N u m b e r o f c l u s t e r s ( C u m u l a t i v e ) Nearest Neighbor Clustering (NNC)ERNRAD

Slimness N u m b e r o f c l u s t e r s ( C u m u l a t i v e ) DBSCANERNRAD

Slimness N u m b e r o f c l u s t e r s ( C u m u l a t i v e ) iDBSCANERNRAD Figure 10 . Cumulative distribution of the slimness for NRAD and ER data, for NNC, DBSCAN andiDSBSCAN.

In order to evaluate the signal eﬃciency and purity as a function of the slimness selection forthe two algorithms, the number of clusters within the selected Fe energy region (from 1500 to4500 photons) was measured for various slimness threshold values (X (cid:62) x) as shown in Fig. 11 forthe NNC, DBSCAN and iDBSCAN algorithms. This ﬁgure shows that DBSCAN and iDBSCANﬁnd a similar number of clusters in the Fe region when compared to NNC for slimness below 0.4,given by the diﬀerence between the ER and NRAD curves, but with lesser contamination (NRADcurves).Considering that the Fe clusters produce an intensity that follows a Gaussian distributionwith an average value of about 3000 photons and standard deviations of 550, 385 and 371, for NNC,DBSCAN and iDBSCAN respectively (see Fig. 12), then more than 99% of the Fe clusters areselected between 1500 and 4500 photons. On the other hand, for the same region, the subtractionof the natural radioactivity events between the ER and NRAD acquisition runs has a mean valueequal to zero but a ﬂuctuation of about 23 (14), 10 (7) and 11 (7) clusters for slimness equal to 0.0(0.4), for NNC, DBSCAN and iDBSCAN respectively. Therefore, the dashed line of Fig. 11 iscomposed mainly of Fe events plus few background events produced by the statistical ﬂuctuationthat occurs in the process of subtracting natural radioactivity. As can be noticed by observingFig. 8, DBSCAN and iDBSCAN tend to have less background contamination than NNC, reducingthe statistical uncertainty related to the background subtraction. This eﬀect is also shown by theshaded band drawn around the dashed lines of Fig. 11.Based on the measurements of Fig. 11, the impact of the slimness parameter can be assessedby measuring the relative eﬃciency ( ε sel ) with respect to the bin with the highest content in the Fecurve (so that for such a bin, ε sel = 100%), and fake events (F evts ), as deﬁned below:• ε sel : number of clusters found in the ER dataset (nFe) subtracted by the number of clustersfound in the NRAD dataset (nRd) divided by the maximum value of the nFe − nRd subtractionamong all slimness values (see Equation 3.1); ε sel = (cid:18) nFe − nRdmax ( nFe − nRd ) (cid:19) (3.1)– 10 – igure 11 . Scan in the number of clusters on the Fe peak region (between 1500 and 4500 photons) whenchanging the threshold on the slimness for NRAD and ER data, for NNC, DBSCAN and iDSBSCAN. • F evts : ratio between the number of clusters found in the NRAD dataset (nRd) and the numberof clusters found in the ER dataset (nFe) (see Equation 3.2a). This measure can also beunderstood in terms of background rejection (B rj ) as shown by Equation 3.2b;F evts = (cid:16) nRdnFe (cid:17) ( a ) , B rj = − F evts ( b ) (3.2)Figure 10 shows that for slimness below 0.4 the eﬃciency for background events is very small,while most of the Fe events are retained. Tables 1 and 2 shows, respectively, the computed ε sel and F evts for both clustering methods and diﬀerent thresholds on the slimness variable ranging from0.0 to 0.8. The errors presented in these tables were computed considering a conﬁdence interval of95% for a binomial proportion [17]. For the high eﬃciency region ( ≥ . Table 1 . ε sel comparison between iDBSCAN, DBSCAN and NNC. Slimness (width/length) ε sel iDBSCAN DBSCAN NNC0.0 1.00 + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . The last column of Table 2 shows the iDBSCAN background-rejection improvement comparedto NNC. For slimness equal to 0.4, for example, iDBSCAN has 92% of background rejectioneﬃciency while NNC has 75%, leading to a relative improvement of (92-75)/75 ≈ able 2 . F evts comparison between iDBSCAN, DBSCAN and NNC. Slimness (width/length) F evts iDBSCAN B r j variation (%)iDBSCAN DBSCAN NNC DBSCAN NNC0.0 0.18 + . − . + . − . + . − . -3.4 + . − . + . − . + . − . + . − . + . − . -3.5 + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . + . − . -0.4 + . − . + . − . + . − . + . − . + . − . + . − . -1.0 + . − . The detector energy resolution was estimated by a ﬁt to the clusters energy distributions accountingfor natural radioactivity and the Fe events. The former was modeled by an exponential functionand the latter by a Polya function [18]: P ( n ) = bn k ! (cid:16) nbn (cid:17) k · e − n / bn (3.3)where b is a free parameter and k = / b −

1. The distribution has n as expected value, while thevariance is governed by n and the b parameter, as follows: σ = n ( + bn ) . The total likelihood isgiven by the sum of the two functions.Figure 12 shows the ﬁt results for NCC, DBSCAN and iDBSCAN clusters without applyingany selection on the slimness parameter. Based on the computed values, energy resolution weremeasured to be (18.1 ± ± ± mean and sigma parameters shownin Fig. 12. The former is the mean divided by 5.9 keV (ER energy), while the latter is given bydividing the sigma by the mean . Figure 12 . Results of the ﬁt applied to the NNC, DBSCAN and iDBSCAN energy distributions.

Figure 13 shows the ﬁt results when considering only clusters with slimness greater than0.4. The estimated energy resolutions are 13.7 ± ± ± Feclusters for these two methods

Figure 13 . Results of the ﬁt applied to the NNC, DBSCAN and iDBSCAN energy distributions for clusterswith slimness higher than 0.4.

Table 3 . Detector resolution comparison between NNC and iDBSCAN as a function of slimness.

Slimness(width/length) Resolution (%)iDBSCAN DBSCAN NNC0.0 12.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Summary

An adapted version of DBSCAN, named intensity-based DBSCAN, has recently been developedand tested on data acquired with a CYGNO TPC prototype. The impact of this algorithm on thedetector performance has been studied using 5.9 keV photons from a Fe radioactive source andcompared with results obtained with the standard DBSCAN and NNC algorithms. The iDBSCANparameters were optimized for the running conditions of LEMOn, which uses a 4M pixels sCMOScamera, and for signals from Fe photons. The obtained results showed that, with iDBSCAN,the clustering process of the CYGNO’s event-reconstruction algorithm can achieve, without anyother event-selection routine, a natural radioactivity background rejection in the energy regionaround 5.9 keV (from 3.0 keV to 8.8 keV) of 0.82 + . − . and a number of electronic-noise clustersper image of ( ± ) × − , occurring predominantly in the region below 1 keV ( ≈

500 photons).Compared to NNC, these results represent an enhancement of 57% for the former and, for thelatter, an improvement by a factor of a few thousand. Compared to DBSCAN, iDBSCAN obtainedsimilar performance regarding background rejection in the Fe energy region; however, iDBSCANhas managed to signiﬁcantly reduce the number of electronic noise clusters also when comparedto DBSCAN. Therefore, despite achieving similar performance in relation to iDBSCAN in therejection of background radiation, DBSCAN was not as eﬃcient as iDBSCAN in reducing theeﬀects of electronic noise. Finally, the detector energy resolution using iDBSCAN was measuredto be (12.2 ± ( ± ) × − , a natural radioactive backgroundrejection of 0.92 + . − . and an energy resolution of (11.8 ± Acknowledgments

This work was supported by the European Research Council (ERC) under the European UnionâĂŹsHorizon 2020 research and innovation program (grant agreement No 818744) and also by theCoordenaÃğÃčo de AperfeiÃğoamento de Pessoal de NÃŋvel Superior - Brasil (CAPES) - FinanceCode 001.

References [1] M. Ester, H.-P. Kriegel, J. Sander and X. Xu,

A density-based algorithm for discovering clusters adensity-based algorithm for discovering clusters in large spatial databases with noise , in

Proceedingsof the Second International Conference on Knowledge Discovery and Data Mining , KDD’96,pp. 226–231, AAAI Press, 1996, http://dl.acm.org/citation.cfm?id=3001460.3001507.[2] L. M. S. Margato, F. A. F. Fraga, S. T. G. Fetal, M. M. F. R. Fraga, E. F. S. Balau, A. Blanco et al.,

Performance of an optical readout GEM-based TPC , Nucl. Instrum. Meth.

A535 (2004) 231.[3] C. M. B. Monteiro, A. S. Conceicao, F. D. Amaro, J. M. Maia, A. C. S. S. M. Bento, L. F. R. Ferreiraet al.,

Secondary scintillation yield from gaseous micropattern electron multipliers in direct darkmatter detection , Phys. Lett.

B677 (2009) 133.[4] C. M. B. Monteiro, L. M. P. Fernandes, J. F. C. A. Veloso, C. A. B. Oliveira and J. M. F. dos Santos,

Secondary scintillation yield from GEM and THGEM gaseous electron multipliers for direct darkmatter search , Phys. Lett.

B714 (2012) 18. – 14 –

5] A. Bondar, A. Buzulutskov, A. Grebenuk, A. Sokolov, D. Akimov, I. Alexandrov et al.,

Directobservation of avalanche scintillations in a THGEM-based two-phase Ar avalanche detector usingGeiger-mode APD , JINST (2010) P08002 [ ].[6] M. Maraﬁni, V. Patera, D. Pinci, A. Sarti, A. Sciubba and E. Spiriti, ORANGE: A high sensitivityparticle tracker based on optically read out GEM , Nucl. Instrum. Meth.

A845 (2017) 285.[7] V. C. Antochi, E. Baracchini, G. Cavoto, E. D. Marco, M. Maraﬁni, G. Mazzitelli et al.,

Combinedreadout of a triple-GEM detector , JINST (2018) P05001 [ ].[8] I. A. Costa, E. Baracchini, F. Bellini, L. Benussi, S. Bianco, M. Caponero et al., Performance ofoptically readout GEM-based TPC with a 55fe source , Journal of Instrumentation (2019) P07011.[9] G. Mazzitelli, V. C. Antochi, E. Baracchini, G. Cavoto, A. De Stena, E. Di Marco et al., A highresolution tpc based on gem optical readout , in , pp. 1–4, Oct, 2017, https://doi.org/10.1109/NSSMIC.2017.8532631.[10] Hamamatsu,

ORCA-Flash4.0 V3 Digital CMOS camera , 2018.[11] D. Pinci, E. Di Marco, F. Renga, C. Voena, E. Baracchini, G. Mazzitelli et al.,

Cygnus: developmentof a high resolution TPC for rare events , PoS

EPS-HEP2017 (2017) 077.[12] G. Lopes, E. Baracchini, F. Bellini, L. Benussi, S. Bianco, G. Cavoto et al.,

Study of the impact ofpre-processing applied to images acquired by the cygno experiment , in

Pattern Recognition andImage Analysis. IbPRIA 2019. Lecture Notes in Computer Science , vol. 11869, Springer, Cham,(2019), https://doi.org/10.1007/978-3-030-31321-0_45.[13] R. C. Gonzalez, R. E. Woods et al.,

Digital image processing, vol. 141, no. 7 , Publishing House ofElectronics Industry (2002) .[14] I. T. Jolliﬀe,

Principal component analysis , Springer series in statistics (2002) .[15] K. Wagstaﬀ and C. Cardie, Clustering with instance-level constraints , in

Proceedings of theSeventeenth International Conference on Machine Learning , ICML âĂŹ00, (San Francisco, CA,USA), p. 1103âĂŞ1110, Morgan Kaufmann Publishers Inc., 2000,https://dl.acm.org/doi/10.5555/645529.658275.[16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel et al.,

Scikit-learn:Machine learning in python , Journal of Machine Learning Research (2011) 2825.[17] L. D. Brown, T. T. Cai and A. DasGupta, Interval estimation for a binomial proportion , Statisticalscience (2001) 101.[18] W. Blum, L. Rolandi and W. Riegler,

Particle detection with drift chambers , Particle Acceleration andDetection, ISBN = 9783540766834. Springer Science & Business Media, 2008,10.1007/978-3-540-76684-1., Particle Acceleration andDetection, ISBN = 9783540766834. Springer Science & Business Media, 2008,10.1007/978-3-540-76684-1.