[PDF] A neural network classifier for electron identification on the DAMPE experiment

Abstract

The Dark Matter Particle Explorer (DAMPE) is a space-borne particle detector and cosmic ray observatory in operation since 2015, designed to probe electrons and gamma rays from a few GeV to 10 TeV energy, as well as cosmic protons and nuclei up to 100 TeV. Among the main scientific objectives is the precise measurement of the cosmic electron+positron flux, which due to the very large proton background in orbit requires a powerful particle identification method. In the past decade, the field of machine learning has provided us the needed tools. This paper presents a neural network based approach to cosmic electron identification and proton rejection and showcases its performances based on simulated Monte Carlo data. The neural network reaches significantly lower background than the classical, cut-based method for the same detection efficiency, especially at highest energies. A good matching between simulations and real data completes the picture.

Full PDF

PPrepared for submission to JINST

A neural network classiﬁer for electron identiﬁcation onthe DAMPE experiment

D.Droz, 𝑎, A.Tykhonov, 𝑎 X.Wu, 𝑎 F.Alemanno, 𝑏,𝑐

G.Ambrosi, 𝑑 E.Catanzani, 𝑑,𝑒

M.Di Santo, 𝑓 ,𝑔

D.Kyratzis, 𝑏,𝑐

S.Zimmer 𝑎 University of Geneva, CH-1205 Geneva, Switzerland 𝑏 Gran Sasso Science Institute (GSSI), Via Iacobucci 2, I-67100 L’Aquila, Italy 𝑐 Istituto Nazionale di Fisica Nucleare (INFN) - Laboratori Nazionali del Gran Sasso, I-67100 Assergi,L’Aquila, Italy 𝑑 Istituto Nazionale di Fisica Nucleare (INFN) - Sezione di Perugia, I-06123 Perugia, Italy 𝑒 Dipartimento di Fisica e Geologia, Universita‘ di Perugia, I-06123, Perugia, Italy 𝑓 Dipartimento di Matematica e Fisica E. De Giorgi, Università del Salento, I-73100, Lecce, Italy 𝑔 Istituto Nazionale di Fisica Nucleare (INFN) - Sezione di Lecce, I-73100, Lecce, Italy

E-mail: [email protected] , [email protected] Abstract: The Dark Matter Particle Explorer (DAMPE) is a space-borne particle detector andcosmic ray observatory in operation since 2015, designed to probe electrons and gamma rays froma few GeV to 10 TeV energy, as well as cosmic protons and nuclei up to 100 TeV. Among the mainscientiﬁc objectives is the precise measurement of the cosmic electron+positron ﬂux, which due tothe very large proton background in orbit requires a powerful particle identiﬁcation method. In thepast decade, the ﬁeld of machine learning has provided us the needed tools. This paper presents aneural network based approach to cosmic electron identiﬁcation and proton rejection and showcasesits performances based on simulated Monte Carlo data. The neural network reaches signiﬁcantlylower background than the classical, cut-based method for the same detection eﬃciency, especiallyat highest energies. A good matching between simulations and real data completes the picture.Keywords: Particle identiﬁcation methods Corresponding author. a r X i v : . [ a s t r o - ph . I M ] F e b ontents The DArk Matter Particle Explorer (DAMPE - also known as Wukong in China) is a satellite-basedcosmic rays observatory and gamma rays telescope [1]. It can measure cosmic electrons up toan energy of 10 TeV and protons and heavier nuclei up to a hundred TeV. It is constituted of foursubdetectors, from top to bottom: a plastic scintillator (PSD) for absolute charge measurement; asilicon-tungsten tracker-converter (STK) for precise direction measurement and for enabling photonpair production; a Bismuth Germanium Oxide imaging calorimeter (BGO) of about 32 radiationlengths, made of 308 hodoscopically arranged bars in 14 layers and used for energy measurement,particle identiﬁcation, and trigger [2]; and a neutron detector (NUD) for improving the identiﬁcationof hadronic showers [1]. The DAMPE satellite was launched into a 500-km Sun-synchronous orbiton December 2015, and is in stable operations since then [3].

Carbon-Fibre Satellite Housing

NUD BGO STKPSD z x y

Figure 1 . Layout of the DAMPE detector system. Figure from [4] – 1 –mong the main scientiﬁc objectives of DAMPE is the study of cosmic ray electrons pluspositrons (CREs) in the TeV range. Precise measurement of CREs spectrum is crucial for under-standing the mechanisms of cosmic ray production and propagation [5–7]. CREs represent a perfectprobe of the most energetic process in the nearby Galaxy, such as supernovae explosions [8] andmay enable the observation of phenomena such as dark matter decay or annihilation [9, 10]. TheCRE spectrum was ﬁrst measured by DAMPE in 2017 up to an energy of 4.6 TeV, with the directdetection of a spectral break in the TeV region [11]. This result was obtained using a classical,cut-based technique for electron/proton discrimination. Said method, while powerful enough toreject proton background up to a few TeV with acceptable uncertainty, could not fully exploit theparticle identiﬁcation capability of the detector. After more than four years of operation and thusmuch higher accumulated data statistics, DAMPE is capable of measuring the CRE spectrum up toan unprecedented energy of 10 TeV with the best achieved energy resolution. However the standardidentiﬁcation technique [11] does not have suﬃcient electron/proton discrimination power to beused at such extreme energies. A new method is therefore required, and we propose one comingfrom the ever-growing toolbox of data sciences and artiﬁcial intelligence.The ﬁeld of data sciences underwent incredible developments in the past decade. The expo-nential growth of computing power, the advent of the big data era, and improvements in existingalgorithms, are allowing machine learning and artiﬁcial intelligence to conquer a variety of ﬁelds[12], where they may even outperform human experts: computer vision [13], self-driving cars [14],or notoriously playing Go [15]. In your everyday life, you most likely use systems powered bymachine learning algorithms, from web searches [16] to emails [17] and to your phone’s virtualassistant [18].Behind these advances and successes are a set of techniques known as deep learning anddeep neural networks [19, 20]. The various ﬂavours of neural networks represent nowadays thestate-of-the-art in most of the data sciences. They have found their way into high energy physics[21, 22], where the large size and high dimensionality of the data make them much welcomeadditions, and into astrophysics [23, 24] where their pattern recognition power allows for extractionof additional information from telescope images. However, neural networks have only been seldomused in cosmic ray physics, and never exploited for CRE direct detection experiments at multi-TeVenergies.In this paper we propose to use neural networks for the problem of electron identiﬁcation onthe DAMPE experiment. In section 2 we develop the electron identiﬁcation technique based on aneural network classiﬁer. In section 3 we demonstrate the performances of the developed classiﬁeron Monte-Carlo simulations and compare it with the standard cut-based technique. In section 4 weperform a validation of the developed classiﬁer with the 4-years DAMPE orbit data and demonstratethat the new method allows a substantial enhancement of the electron identiﬁcation, enabling theﬁrst CREs spectrum measurement with DAMPE in the 10 TeV energy domain. Finally, the workand the results are summarised in section 5.

Measuring cosmic electrons in orbit requires ﬁrst to identify them among the many sources ofbackground [25]. Helium and heavier nuclei can be rejected by measuring the absolute electric– 2 –harge, which is done in DAMPE using the signal deposited inside the plastic scintillator (PSD). ThePSD cut is complemented by a charge measurement in the silicon tracker (STK) allowing to rejectheavy nuclei outside of the PSD ﬁducial region. Gamma rays are another source of background,however their ﬂux is orders of magnitude weaker than the CRE ﬂux especially at energies higherthan several hundred GeV. They are rejected thanks to their electric charge of zero: a gamma raypenetrating through the PSD will not leave any signal as opposed to other particle species.Protons however have the same absolute electric charge as electrons, meaning they cannot berejected with these methods. Moreover, protons are the most abundant comic ray species, with aﬂux orders of magnitude higher than the electrons one, making the latter a rare signal drown outin a sea of background events. The exploitable diﬀerence between these two particles lies in thephysical processes when interacting with matter: protons produce a wide and deep hadronic shower,while electrons produce narrower electromagnetic showers, resulting in diﬀerent signal topologiesin the BGO calorimeter.Classically, one can build observables that quantify the shower shape inside the calorimeter,and use them to reject protons. The ﬁrst DAMPE cosmic electron spectrum measurement [11]is based on a single observable named 𝜁 that combines the maximum shower depth and theshower horizontal spread [26]. While the variable proved powerful, it does not use the entiretyof information available in the detector, including the possibility to exploit the strong correlationsbetween topological variables used to describe the shower development in the imaging calorimeter.Furthermore its rejection power is limited above the TeV range where the topological developmentof hadronic and electromagnetic showers in the detector is less pronounced. A more powerfulmethod is therefore required, and is presented in this paper. Neural networks are a new-old technique. First developed in the 1950s-1960s [27], they were quicklyabandoned due to their very high requirements in computing power despite their capabilities as anuniversal classiﬁer [28]. Neural networks were brought back to life in the 21st century and foundtheir way to the very front of machine learning development, thanks to the abundance of data of allsorts and the capabilities of modern computers.The classiﬁer we propose to use is based on a vanilla neural network, composed of a stackof densely connected layers (see below). Such algorithms are sometimes named as multilayerperceptron , feedforward neural network , or simply artiﬁcial neural network . Other techniques werealso studied in the DAMPE collaboration prior to this work: convolutional neural networks (CNN),suited for pattern recognition and image identiﬁcation, showed promising performances. WhileCNN demonstrated potential for further improvement with respect to the feedforward network, weopted for latter technique due to the better understood systematics of this type of networks, whichis reﬂected, in particular, in data to Monte-Carlo agreement of a classiﬁer score distribution [29].Another technique studied by the collaboration are boosted decision trees, commonly used in highenergy physics. They showed some improvement over the classical method, though optimised forthe lower energy domain, around 10 to 100 GeV [30].An artiﬁcial neural network is a stack of densely connected layers of so-called neurons. Aneuron is a mathematical unit that applies a non-linear function to the linear combination of itsinputs. The function output is then used as input by all the neurons in the next layer, in the case of– 3 –ully connected networks. Mathematically, if a neuron receives as input a set { 𝑋 𝑖 } , then its output 𝑦 is: { 𝑋 𝑖 } −→ 𝑦 = 𝑓 (cid:32)∑︁ 𝑖 𝑤 𝑖 𝑋 𝑖 (cid:33) + 𝑏 (2.1)where 𝑓 is the non-linear activation function, 𝑤 𝑖 the weights and 𝑏 the bias. The activation function 𝑓 , as well as the number of neurons and layers, are characteristics of the model which are decidedby the human programmer (section 2.2). The values of 𝑤 𝑖 and 𝑏 are set by the machine duringthe training procedure: the network is exposed to a set of labelled data (training data) where eachobservation/event is associated to a class (e.g. signal/background), and tries to minimise a givenerror metric on the set. The minimisation usually follows a so-called gradient descent or one ofits ﬂavours. In the case of a classiﬁcation problem, such as discrimination between protons andelectrons, the canonical error metric is the cross-entropy [20].An artiﬁcial neural network takes as input a one-dimensional set of variables { 𝑋 𝑖 } . In a binaryclassiﬁcation, it outputs a value between 0 and 1 which can be interpreted as the probability for anelectron (signal) to produce { 𝑋 𝑖 } . This value is obtained as the output of a sigmoid or sigmoid-likefunction as the activation of the very last layer.In-depth reviews of neural networks can be found in references [19, 20]. Two critical aspects of the procedure are the choice and preparation of the input data, and theparameters of the neural network architecture (so-called hyperparameters).The data available for training and testing the model are Monte Carlo (MC) simulated events,created using the Geant4 package [31] interfaced to the DAMPE software [32] to emulate thedetector response to the various cosmic rays species in orbit. The training data was prepared witha set of cleaning cuts to replicate the analysis chain applied on real data: this includes showercontainment and ﬁducial volume selection criteria, as well as cuts to remove Helium nuclei andgamma rays [11]. In this way, the only remaining background for electron identiﬁcation are protons.The cut ﬂow is completed by selecting only events with 𝜁 < 𝜁 is the classical classiﬁerfrom reference [11]. This conservative criterion eliminates events with obvious proton-like topologywhile retaining 99.95% of electrons. The motivation behind this ﬁnal cut is to get rid of easilyidentiﬁed events, such that the neural network can focus on more complex ones. This preselectionyields a neural network with higher discrimination power.Out of this cleaned-up data, we built a training set of 140’000 events, balanced 50/50 betweenelectrons and protons. The data is normalised by dividing each input variable by its maximum valuethrough the dataset, therefore scaling them to the [

0; 1 ] range.We selected the input variables based on several criteria. First, the features had to be descriptiveenough of the shower topology to provide as much information as possible to the neural network.Following published results on deep learning for particle physics [21], we opted mostly for low-levelvariables to maximise said information, but we added in high-level variables as well such as the 𝜁 classiﬁer which we know provides an already powerful proton rejection. The idea being to givethe network our best guess as expert, and then the low-level ingredients to improve on it. Adding 𝜁 signiﬁcantly improved the sub-TeV performances. Finally, we required the input variables to be– 4 – ..... . . . Measured energy (cid:1)

Trajectory angleRMS layer 14RMS layer 3Energy layer 14Energy layer 3 . . . . . . D r o p o u t % D r o p o u t % D r o p o u t % Dense layer300 neurons Dense layer150 neurons Dense layer75 neurons Output layer

STK cluster energy

Figure 2 . Schematic of the neural network model used in this work. The hidden layers use the ReLUactivation function. The logistic sigmoid in the output layer is removed after training. well simulated. We observed that depending on the set of features, the network could produceslightly diﬀerent results between MC simulations and real data. We therefore followed an extensivecampaign of empirical testing to ﬁnd and remove the quantities that resulted in such diﬀerences.As a result, the features selected include the energy deposited and its RMS distribution in 12 outof 14 layers of the BGO calorimeter (excluding the top two), the reconstructed energy, the angle ofthe trajectory, the energy deposited in one Moliere radius of a STK track (STK cluster energy), andthe classical 𝜁 classiﬁer.Along with optimising the set of input variables we were also researching and optimising thearchitecture of the neural network itself (the model). Building a model indeed requires severalparameter choices: the number of neurons and layers, the activation function, the optimiser andregularisers, etc. The optimisation of these hyper-parameters is somewhat of an art, and a commonpractice is to conduct a random gridsearch [33]. We decided to follow this philosophy and testedhundreds of models against each other. The winner of this computing battle royale is a modelconsisting of 4 layers with 300, 150, 75 and 1 neuron, respectively, regularised with a 10-20%dropout (technique consisting of randomly turning oﬀ neurons during the training) [34]. Thehidden layers use the Rectiﬁed Linear Unit (ReLU) [35] activation function, and the output layeruses the logistic sigmoid function to map the network output to the [

0; 1 ] range as is common tobinary classiﬁcation problems. Finally the model is optimised using the Adam gradient descentalgorithm [36] against the cross-entropy metric. The architecture is represented in ﬁgure 2 andtable 1.The extensive training campaigns were conducted on the Baobab computer cluster of theUniversity of Geneva, using Nvidia Titan X GPUs. On the software side, we used Nvidia cuDNN[37], Keras [38] with Theano [39] as a backend, and Scikit-Learn [40]. Google’s Tensorﬂow[41] was considered as well but internal benchmarks with our models and data showed no gain inperformances, for a longer computing time.A feature we noticed during the early stages of our optimisation procedure is that the neuralnetwork output values are either very close (or exactly equal to) 0.0 or 1.0, with only very fewevents classiﬁed in-between. This holds true for false positives and false negatives as well: ﬁgure– 5 –ayer Neurons ParametersDense 300 9000Dropout, 20% 0Dense 150 45150Dropout, 10% 0Dense 75 11325Dropout, 10% 0Dense 1 76Total parameters: 65,551Trainable parameters: 65,551Non-trainable parameters: 0 Table 1 . Layer-per-layer summary of the neural network used. . The overall distributionis therefore non-monotonic, which introduces complications for the estimation of background ona cosmic ray electron measurement. For example, this behaviour prevents any sort of baselinebackground extrapolation.The cause is a feature of neural networks for classiﬁcation: the very last operation is a logisticsigmoid function that maps the output to the [

0; 1 ] range: 𝑓 ( 𝑥 ) = + 𝑒 − 𝑥 (2.2)Values 𝑥 (cid:29) 𝑓 ( 𝑥 ) (cid:39) .

0, eﬀectively compressing the output into a limited, ﬁnitespace. Computer ﬂoating point accuracy also has an inﬂuence: for a 16-bits ﬂoat, 𝑓 ( ) = 𝑓 ( ) = . (cid:46)

10. Whereas without this trick thereare proton events mapped to the whole space [

0; 1 ] , meaning that even the tightest possible cut at1.0 will still have a non-zero false positive rate. These events are likely protons that transfer most of their energy to one or several 𝜋 , starting electromagneticshowers while the remaining energy yields a very small hadronic contribution. They are therefore the most diﬃcultbackground to distinguish from electron-induced showers. – 6 – .0 0.2 0.4 0.6 0.8 1.0Neural network output score10 N u m b e r o f e v e n t s MC electronMC proton 20 10 0 10 20NN output score (no sigmoid)10 N u m b e r o f e v e n t s MC eMC p

Figure 3 . Histogram of the neural network output on electron and proton Monte Carlo: (left) vanilla modelwith a sigmoid activation function at the output of the last layer, and with a peak of false positive outcomesin the proton Monte Carlo highlighted with a circle ; (right) the same model after removing (post-training)the sigmoid at the last layer. R e m a i n i n g b a c k g r o un d sigmoid offsigmoid on Figure 4 . Sample Receiver Operating Characteristic (ROC) curve of a model with a sigmoid activationfunction at its output versus a model without. The curves compare the signal eﬃciency (true positive rate)versus the remaining background (false positive rate) at varying discrimination threshold. The perfectlysuperimposed curves prove that the performances are exactly the same.

We report on ﬁgure 5 several Receiver Operating Characteristics (ROC) curve for our neuralnetworks, in comparison with the classical 𝜁 method, covering the energy range from 24 GeV to10.4 TeV. ROC curves are obtained by computing classiﬁcation metrics at various discriminationthresholds, on the test sample. In our case we choose to plot the signal eﬃciency against theremaining background. Both quantities are deﬁned as the ratio of passing events over total events– 7 – .70 0.75 0.80 0.85 0.90 0.95 1.00Signal efficiency R e m a i n i n g b a c k g r o un d Neural network classifier classifier 0.70 0.75 0.80 0.85 0.90 0.95 1.00Signal efficiency10 R e m a i n i n g b a c k g r o un d Neural network classifier classifier R e m a i n i n g b a c k g r o un d Neural network classifier classifier 0.70 0.75 0.80 0.85 0.90 0.95 1.00Signal efficiency10 R e m a i n i n g b a c k g r o un d Neural network classifier classifier0.70 0.75 0.80 0.85 0.90 0.95 1.00Signal efficiency10 R e m a i n i n g b a c k g r o un d Neural network classifier classifier 0.70 0.75 0.80 0.85 0.90 0.95 1.00Signal efficiency10 R e m a i n i n g b a c k g r o un d Neural network classifier classifier0.70 0.75 0.80 0.85 0.90 0.95 1.00Signal efficiency10 R e m a i n i n g b a c k g r o un d Neural network classifier classifier 0.70 0.75 0.80 0.85 0.90 0.95 1.00Signal efficiency10 R e m a i n i n g b a c k g r o un d Neural network classifier classifier

Figure 5 . Sample ROC curves on Monte-Carlo for the neural network classiﬁer and the classical 𝜁 classiﬁer,from 24 GeV to 10.4 TeV. The lowest curves have the lowest background for a ﬁxed eﬃciency. The colouredband shows statistical uncertainty from Monte Carlo sampling. Logarithmic y-scale. – 8 – BGO corrected energy [GeV]0.000.010.020.030.04 R e m a i n i n g b a c k g r o un d a t % s i g n a l e ff i c i e n c y Neural network classifierStatistical error classifierEqual-efficiency bandEqual-efficiency + stat. 10 BGO corrected energy [GeV]0.000.010.020.030.040.050.060.07 R e m a i n i n g b a c k g r o un d a t % s i g n a l e ff i c i e n c y Neural network classifierStatistical error classifierEqual-efficiency bandEqual-efficiency + stat.

Figure 6 . Energy dependency of the surviving background fraction for a ﬁxed signal eﬃciency of 85% (left)and 95% (right), for the neural networks and the classical 𝜁 method. for electrons, respectively protons: Signal eﬃciency = 𝑁 𝑒 − , 𝑝𝑎𝑠𝑠 𝑁 𝑒 − Remaining background = 𝑁 𝑝, 𝑝𝑎𝑠𝑠 𝑁 𝑝 These two metrics have the advantage of being independent from the relative abundance of electronswith respect to protons. A good classiﬁer is one that maximises the ﬁrst metric and/or minimisesthe second. This translates into a lower curve on ﬁgure 5: classiﬁers with the lowest curves havethe smallest background for a set eﬃciency. The image shows that the neural network signiﬁcantlyoutperforms the classical method in the lowest and highest energy ranges, while the performancesappear roughly comparable at intermediate energies. Note that for both classiﬁers, 𝑁 𝑒 − and 𝑁 𝑝 aretaken after the 𝜁 <

100 cut for a fair comparison.The performances are thus energy-dependent. To see this dependence and to better quantifythe performances of both methods, we report on ﬁgure 6 the remaining background when we setthe discrimination threshold such as to have a 85% or 95% signal eﬃciency, as a function of theenergy reconstructed from the BGO calorimeter. The comparison involves an uncertainty due to theeﬃciency of both classiﬁers not being perfectly equal. On the ﬁgure, the error bars associated to 𝜁 show the statistical uncertainty from Monte Carlo sampling, the darker band shows the uncertaintyassociated to the choice of threshold to have compatible eﬃciency, and the lighter band is thecombination of both. The blue band associated to the neural networks is purely statistical. Figure 6conﬁrms the previous observation that the gains of neural networks are signiﬁcant on both ends ofthe energy range, in the high eﬃciency regime. From a few hundred GeV to 2 TeV, the performancesare within uncertainty of each other. Above 5 TeV, the proton rejection is improved by a factor atleast 2. – 9 – eural network output score10 − − C oun t s × FlightMC totalMC pMC e

Neural network output score10 - - C oun t s Neural network output score10 - - C oun t s Neural network output score10 - - C oun t s Neural network output score10 - - C oun t s Neural network output score10 - - C oun t s Figure 7 . Comparison of the neural network output distribution between simulated Monte Carlo and realdata, on six energy bins from 24 GeV to 10.6 TeV. – 10 – eural network output score10 − − C oun t s FlightMC totalMC pMC e

Neural network output score10 - - C oun t s Neural network output score10 - - C oun t s Neural network output score10 - - C oun t s Neural network output score10 - - C oun t s Neural network output score10 - - C oun t s Figure 8 . Comparison of the neural network output distribution between simulated Monte Carlo and realdata, on six energy bins from 24 GeV to 10.6 TeV. Logarithmic y-scale. – 11 –

Model validation

Raw performances on Monte Carlo is only half of the picture. The other half is whether theclassiﬁer can be trusted on real data, that is if its output on simulations is compatible with itsoutput on real data. Both distributions of simulations and real data are shown on ﬁgures 7 and 8 insix logarithmatically-spaced energy bins. Both samples have been cleaned such that contributionsfrom other cosmic species, notably Helium nuclei, are negligible (section 2). A sum of proton andelectron Monte-Carlo templates were ﬁtted to the data in the following way: the proton sample wasscaled to the data in the region [− − ] , while electron MC was scaled to the data in the region > To tackle the problem of high energy cosmic electron identiﬁcation and measurement with DAMPE,we developed a four-layers deep feed-forward neural network classiﬁer, the output of which wetransformed to suit the needs of particle physics analysis. On simulated data, the new classiﬁershows a strong background rejection power for a high signal eﬃciency of 95% and thus outperformsthe more traditional cut-based electron identiﬁcation technique in the energy ranges where thelatter shows its limits, thanks to a better exploitation of the information contained within DAMPEcalorimeter and tracker. In particular, the proton rejection improves by a factor > Acknowledgments

The DAMPE mission was funded by the strategic priority science and technology projects in spacescience of the Chinese Academy of Sciences. In China, the data analysis was supported in partby the National Key Research and Development Program of China (no. 2016YFA0400200), the– 12 –ational Natural Science Foundation of China (nos. 11525313, 11622327, 11722328, U1738205,U1738207, and U1738208), the strategic priority science and technology projects of the ChineseAcademy of Sciences (no. XDA15051100), the 100 Talents Program of Chinese Academy ofSciences, and the Young Elite Scientists Sponsorship Program. In Europe, the activities and thedata analysis were supported by the Swiss National Science Foundation (SNSF), Switzerland,National Institute for Nuclear Physics (INFN), Italy and European Research Council (ERC) underthe European Union’s Horizon 2020 research and innovation programme (grant agreement No851103).The computations presented in this document were performed at University of Geneva on theBaobab cluster, with signiﬁcant help from computer engineer Y. Meunier and from the HPC team.Simulations were performed on INFN CNAF and ReCaS clusters, Italy, and on Swiss NationalSupercomputing Centre (CSCS) Piz Daint (project s979).The corresponding author would like to acknowledge SARS-CoV-2 for making this worksigniﬁcantly harder than it should have been.

References [1] J. Chang et al. The DArk Matter Particle Explorer mission.

Astropart. Phys. , 95:6–24, 2017.[2] Zhiyong Zhang, Yunlong Zhang, Jianing Dong, Sicheng Wen, Changqing Feng, Chi Wang, YifengWei, Xiaolian Wang, Zizong Xu, and Shubin Liu. Design of a high dynamic range photomultiplierbase board for the bgo ecal of dampe.

Nuclear Instruments and Methods in Physics Research SectionA: Accelerators, Spectrometers, Detectors and Associated Equipment , 780:21–26, 2015.[3] G Ambrosi, Q An, R Asfandiyarov, P Azzarello, P Bernardini, MS Cai, M Caragiulo, J Chang,DY Chen, HF Chen, et al. The on-orbit calibration of dark matter particle explorer.

AstroparticlePhysics , 106:18–34, 2019.[4] A Tykhonov, G Ambrosi, R Asfandiyarov, P Azzarello, P Bernardini, B Bertucci, A Bolognini,F Cadoux, A D’Amone, A De Benedittis, et al. Internal alignment and position resolution of thesilicon tracker of dampe determined with orbit data.

Nuclear Instruments and Methods in PhysicsResearch Section A: Accelerators, Spectrometers, Detectors and Associated Equipment , 893:43–56,2018.[5] Peter Meyer. Cosmic rays in the galaxy.

Annual review of astronomy and astrophysics , 7(1):1–38,1969.[6] Igor V Moskalenko and AW Strong. Production and propagation of cosmic-ray positrons andelectrons.

The Astrophysical Journal , 493(2):694, 1998.[7] Yi-Zhong Fan, Bing Zhang, and Jin Chang. Electron/positron excesses in the cosmic ray spectrumand possible interpretations.

International Journal of Modern Physics D , 19(13):2011–2058, 2010.[8] T Kobayashi, Y Komori, K Yoshida, and J Nishimura. The most likely sources of high-energycosmic-ray electrons in supernova remnants.

The Astrophysical Journal , 601(1):340, 2004.[9] Gianfranco Bertone, Dan Hooper, and Joseph Silk. Particle dark matter: Evidence, candidates andconstraints.

Physics reports , 405(5-6):279–390, 2005.[10] Jonathan L Feng. Dark matter candidates from particle physics and methods of detection.

AnnualReview of Astronomy and Astrophysics , 48:495–545, 2010. – 13 –

11] G. Ambrosi et al. Direct detection of a break in the teraelectronvolt cosmic-ray spectrum of electronsand positrons.

Nature , 552:63–66, 2017.[12] Stephen Marsland.

Machine learning: an algorithmic perspective . Chapman and Hall/CRC, 2014.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassinghuman-level performance on imagenet classiﬁcation. In

Proceedings of the IEEE internationalconference on computer vision , pages 1026–1034, 2015.[14] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, PrasoonGoyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning forself-driving cars. arXiv preprint arXiv:1604.07316 , 2016.[15] Jim X Chen. The evolution of computing: Alphago.

Computing in Science & Engineering , 18(4):4,2016.[16] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deepstructured semantic models for web search using clickthrough data. In

Proceedings of the 22nd ACMinternational conference on Information & Knowledge Management , pages 2333–2338. ACM, 2013.[17] Thiago S Guzella and Walmir M Caminhas. A review of machine learning approaches to spamﬁltering.

Expert Systems with Applications , 36(7):10206–10222, 2009.[18] Veton Kepuska and Gamal Bohouta. Next-generation of virtual personal assistants (microsoftcortana, apple siri, amazon alexa and google home). In , pages 99–103. IEEE, 2018.[19] Yann LeCun, Yoshua Bengio, and Geoﬀrey E. Hinton. Deep learning.

Nature , 521(7553):436–444,2015.[20] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning . MIT Press, 2016. .[21] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for Exotic Particles in High-EnergyPhysics with Deep Learning.

Nature Commun. , 5:4308, 2014.[22] Dan Guest, Kyle Cranmer, and Daniel Whiteson. Deep learning and its application to lhc physics.

Annual Review of Nuclear and Particle Science , 68:161–181, 2018.[23] C Schaefer, M Geiger, T Kuntzer, and J-P Kneib. Deep convolutional neural networks as stronggravitational lens detectors.

Astronomy & Astrophysics , 611:A2, 2018.[24] Daniel George and EA Huerta. Deep neural networks to enable real-time multimessengerastrophysics.

Physical Review D , 97(4):044039, 2018.[25] Claus Grupen.

Astroparticle physics . Springer Science & Business Media, 2005.[26] J. Chang et al. Resolving electrons from protons in ATIC.

Adv. Space Res. , 42:431–436, 2008.[27] Jürgen Schmidhuber. Deep learning in neural networks: An overview.

Neural networks , 61:85–117,2015.[28] G. Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics of Control,Signals and Systems , 2(4):303–314, Dec 1989.[29] D Droz, A Tykhonov, and X Wu. Neural networks for electron identiﬁcation with dampe. In , volume 36, 2019. – 14 –

30] Hao Zhao, Wen-Xi Peng, Huan-Yu Wang, Rui Qiao, Dong-Ya Guo, Hong Xiao, and Zhao-Min Wang.A machine learning method to separate cosmic ray electrons from protons from 10 to 100 gev usingdampe data.

Research in Astronomy and Astrophysics , 18(6):071, 2018.[31] Sea Agostinelli, John Allison, K al Amako, John Apostolakis, H Araujo, P Arce, M Asai, D Axen,S Banerjee, G 2 Barrand, et al. Geant4—a simulation toolkit.

Nuclear instruments and methods inphysics research section A: Accelerators, Spectrometers, Detectors and Associated Equipment ,506(3):250–303, 2003.[32] Chi Wang, Dong Liu, Yifeng Wei, Zhiyong Zhang, Yunlong Zhang, Xiaolian Wang, Zizong Xu,Guangshun Huang, Andrii Tykhonov, Xin Wu, et al. Oﬄine software for the dampe experiment.

Chinese Physics C , 41(10):106201, 2017.[33] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization.

Journal ofMachine Learning Research , 13(Feb):281–305, 2012.[34] Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overﬁtting.

The journal of machine learningresearch , 15(1):1929–1958, 2014.[35] Vinod Nair and Geoﬀrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In

Proceedings of the 27th international conference on machine learning (ICML-10) , pages 807–814,2010.[36] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[37] Sharan Chetlur, Cliﬀ Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro,and Evan Shelhamer. cudnn: Eﬃcient primitives for deep learning. arXiv preprint arXiv:1410.0759 ,2014.[38] F. Chollet et al. Keras. https://keras.io , 2015.[39] Theano Development Team. Theano: A Python framework for fast computation of mathematicalexpressions. arXiv e-prints , abs/1605.02688, May 2016.[40] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of MachineLearning Research , 12:2825–2830, 2011.[41] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.Corrado, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, AndrewHarp, Geoﬀrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, ManjunathKudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, VincentVanhoucke, Vĳay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg,Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning onheterogeneous systems, 2015. Software available from tensorﬂow.org., 12:2825–2830, 2011.[41] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.Corrado, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, AndrewHarp, Geoﬀrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, ManjunathKudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, VincentVanhoucke, Vĳay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg,Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning onheterogeneous systems, 2015. Software available from tensorﬂow.org.