[PDF] A Convolutional Neural Network based Cascade Reconstruction for the IceCube Neutrino Observatory

Abstract

Continued improvements on existing reconstruction methods are vital to the success of high-energy physics experiments, such as the IceCube Neutrino Observatory. In IceCube, further challenges arise as the detector is situated at the geographic South Pole where computational resources are limited. However, to perform real-time analyses and to issue alerts to telescopes around the world, powerful and fast reconstruction methods are desired. Deep neural networks can be extremely powerful, and their usage is computationally inexpensive once the networks are trained. These characteristics make a deep learning-based approach an excellent candidate for the application in IceCube. A reconstruction method based on convolutional architectures and hexagonally shaped kernels is presented. The presented method is robust towards systematic uncertainties in the simulation and has been tested on experimental data. In comparison to standard reconstruction methods in IceCube, it can improve upon the reconstruction accuracy, while reducing the time necessary to run the reconstruction by two to three orders of magnitude.

Full PDF

PPrepared for submission to JINST

A Convolutional Neural Network based CascadeReconstruction for the IceCube Neutrino Observatory

R. Abbasi, M. Ackermann, J. Adams, J. A. Aguilar, M. Ahlers, M. Ahrens, C.Alispach, A. A. Alves Jr., N. M. Amin, R. An, K. Andeen, T. Anderson, I.Ansseau, G. Anton, C. Argüelles, S. Axani, X. Bai, A. Balagopal V., A. Barbano, S. W. Barwick, B. Bastian, V. Basu, V. Baum, S. Baur, R. Bay, J. J. Beatty, , K.-H.Becker, J. Becker Tjus, C. Bellenghi, S. BenZvi, D. Berley, E. Bernardini, ,𝑎 D. Z.Besson, ,𝑏 G. Binder, , D. Bindig, E. Blaufuss, S. Blot, S. Böser, O. Botner, J.Böttcher, E. Bourbeau, J. Bourbeau, F. Bradascio, J. Braun, S. Bron, J.Brostean-Kaiser, A. Burgman, R. S. Busse, M. A. Campana, C. Chen, D. Chirkin, S.Choi, B. A. Clark, K. Clark, L. Classen, A. Coleman, G. H. Collin, J. M. Conrad, P.Coppin, P. Correa, D. F. Cowen, , R. Cross, P. Dave, C. De Clercq, J. J.DeLaunay, H. Dembinski, K. Deoskar, S. De Ridder, A. Desai, P. Desiati, K. D. deVries, G. de Wasseige, M. de With, T. DeYoung, S. Dharani, A. Diaz, J. C.Díaz-Vélez, H. Dujmovic, M. Dunkman, M. A. DuVernois, E. Dvorak, T. Ehrhardt, P.Eller, R. Engel, J. Evans, P. A. Evenson, S. Fahey, A. R. Fazely, S. Fiedlschuster, A.T. Fienberg, K. Filimonov, C. Finley, L. Fischer, D. Fox, A. Franckowiak, , E.Friedman, A. Fritz, P. Fürst, T. K. Gaisser, J. Gallagher, E. Ganster, S. Garrappa, L.Gerhardt, A. Ghadimi, C. Glaser, T. Glauch, T. Glüsenkamp, A. Goldschmidt, J. G.Gonzalez, S. Goswami, D. Grant, T. Grégoire, Z. Griﬃth, S. Griswold, M.Gündüz, C. Haack, A. Hallgren, R. Halliday, L. Halve, F. Halzen, M. Ha Minh, K.Hanson, J. Hardin, A. A. Harnisch, A. Haungs, S. Hauser, D. Hebecker, K. Helbing, F. Henningsen, E. C. Hettinger, S. Hickford, J. Hignight, C. Hill, G. C. Hill, K. D.Hoﬀman, R. Hoﬀmann, T. Hoinka, B. Hokanson-Fasig, K. Hoshina, ,𝑐 F. Huang, M.Huber, T. Huber, K. Hultqvist, M. Hünnefeld, R. Hussain, S. In, N. Iovine, A.Ishihara, M. Jansson, G. S. Japaridze, M. Jeong, B. J. P. Jones, R. Joppe, D. Kang, W. Kang, X. Kang, A. Kappes, D. Kappesser, T. Karg, M. Karl, A. Karle, U. Katz, M. Kauer, M. Kellermann, J. L. Kelley, A. Kheirandish, J. Kim, K. Kin, T.Kintscher, J. Kiryluk, S. R. Klein, , R. Koirala, H. Kolanoski, L. Köpke, C. Kopper, S. Kopper, D. J. Koskinen, P. Koundal, M. Kovacevich, M. Kowalski, , K. Krings, G.Krückl, N. Kurahashi, A. Kyriacou, C. Lagunas Gualda, J. L. Lanfranchi, M. J.Larson, F. Lauber, J. P. Lazar, , K. Leonard, A. Leszczyńska, Y. Li, Q. R. Liu, E.Lohﬁnk, C. J. Lozano Mariscal, L. Lu, F. Lucarelli, A. Ludwig, , W. Luszczak, Y.Lyu, , W. Y. Ma, J. Madsen, K. B. M. Mahn, Y. Makino, P. Mallik, S. Mancina, I. C.Mariş, R. Maruyama, K. Mase, F. McNally, K. Meagher, A. Medina, M. Meier, S.Meighen-Berger, J. Merz, J. Micallef, D. Mockler, G. Momenté, T. Montaruli, R. W. a r X i v : . [ h e p - e x ] J a n oore, K. Morik, R. Morse, M. Moulai, R. Naab, R. Nagai, U. Naumann, J.Necker, L. V. Nguy ˜ên, H. Niederhausen, M. U. Nisa, S. C. Nowicki, D. R. Nygren, A.Obertacke Pollmann, M. Oehler, A. Olivas, E. O’Sullivan, H. Pandya, D. V.Pankova, N. Park, G. K. Parker, E. N. Paudel, P. Peiﬀer, C. Pérez de los Heros, S.Philippen, D. Pieloth, S. Pieper, A. Pizzuto, M. Plum, Y. Popovych, A. Porcelli, M.Prado Rodriguez, P. B. Price, B. Pries, G. T. Przybylski, C. Raab, A. Raissi, M.Rameez, K. Rawlins, I. C. Rea, A. Rehman, R. Reimann, M. Renschler, G. Renzi, E.Resconi, S. Reusch, W. Rhode, M. Richman, B. Riedel, S. Robertson, , G.Roellinghoﬀ, M. Rongen, C. Rott, T. Ruhe, D. Ryckbosch, D. Rysewyk Cantu, I.Safa, , S. E. Sanchez Herrera, A. Sandrock, J. Sandroos, M. Santander, S.Sarkar, S. Sarkar, K. Satalecka, M. Scharf, M. Schaufel, H. Schieler, P. Schlunder, T. Schmidt, A. Schneider, J. Schneider, F. G. Schröder, , L. Schumacher, S.Sclafani, D. Seckel, S. Seunarine, A. Sharma, S. Shefali, M. Silva, B. Skrzypek, B.Smithers, R. Snihur, J. Soedingrekso, D. Soldin, G. M. Spiczak, C. Spiering, ,𝑏 J.Stachurska, M. Stamatikos, T. Stanev, R. Stein, J. Stettner, A. Steuer, T.Stezelberger, R. G. Stokstad, T. Stuttard, G. W. Sullivan, I. Taboada, F. Tenholt, S.Ter-Antonyan, S. Tilav, F. Tischbein, K. Tollefson, L. Tomankova, C. Tönnis, S.Toscano, D. Tosi, A. Trettin, M. Tselengidou, C. F. Tung, A. Turcati, R. Turcotte, C.F. Turley, J. P. Twagirayezu, B. Ty, M. A. Unland Elorrieta, N. Valtonen-Mattila, J.Vandenbroucke, D. van Eijk, N. van Eijndhoven, D. Vannerom, J. van Santen, S.Verpoest, M. Vraeghe, C. Walck, A. Wallace, T. B. Watson, C. Weaver, A. Weindl, M. J. Weiss, J. Weldert, C. Wendt, J. Werthebach, M. Weyrauch, B. J. Whelan, N.Whitehorn, , K. Wiebe, C. H. Wiebusch, D. R. Williams, M. Wolf, K. Woschnagg, G.Wrede, J. Wulﬀ, X. W. Xu, Y. Xu, J. P. Yanez, S. Yoshida, T. Yuan and Z. Zhang on behalf of the IceCube collaboration III. Physikalisches Institut, RWTH Aachen University, D-52056 Aachen, Germany Department of Physics, University of Adelaide, Adelaide, 5005, Australia Dept. of Physics and Astronomy, University of Alaska Anchorage, 3211 Providence Dr., Anchorage, AK99508, USA Dept. of Physics, University of Texas at Arlington, 502 Yates St., Science Hall Rm 108, Box 19059,Arlington, TX 76019, USA CTSPS, Clark-Atlanta University, Atlanta, GA 30314, USA School of Physics and Center for Relativistic Astrophysics, Georgia Institute of Technology, Atlanta, GA30332, USA Dept. of Physics, Southern University, Baton Rouge, LA 70813, USA Dept. of Physics, University of California, Berkeley, CA 94720, USA Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Institut für Physik, Humboldt-Universität zu Berlin, D-12489 Berlin, Germany Fakultät für Physik & Astronomie, Ruhr-Universität Bochum, D-44780 Bochum, Germany Université Libre de Bruxelles, Science Faculty CP230, B-1050 Brussels, Belgium Vrĳe Universiteit Brussel (VUB), Dienst ELEM, B-1050 Brussels, Belgium Department of Physics and Laboratory for Particle Physics and Cosmology, Harvard University, Cam-bridge, MA 02138, USA Dept. of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Dept. of Physics and Institute for Global Prominent Research, Chiba University, Chiba 263-8522, Japan Department of Physics, Loyola University Chicago, Chicago, IL 60660, USA Dept. of Physics and Astronomy, University of Canterbury, Private Bag 4800, Christchurch, New Zealand Dept. of Physics, University of Maryland, College Park, MD 20742, USA Dept. of Astronomy, Ohio State University, Columbus, OH 43210, USA Dept. of Physics and Center for Cosmology and Astro-Particle Physics, Ohio State University, Columbus,OH 43210, USA Niels Bohr Institute, University of Copenhagen, DK-2100 Copenhagen, Denmark Dept. of Physics, TU Dortmund University, D-44221 Dortmund, Germany Dept. of Physics and Astronomy, Michigan State University, East Lansing, MI 48824, USA Dept. of Physics, University of Alberta, Edmonton, Alberta, Canada T6G 2E1 Erlangen Centre for Astroparticle Physics, Friedrich-Alexander-Universität Erlangen-Nürnberg, D-91058Erlangen, Germany Physik-department, Technische Universität München, D-85748 Garching, Germany Département de physique nucléaire et corpusculaire, Université de Genève, CH-1211 Genève, Switzerland Dept. of Physics and Astronomy, University of Gent, B-9000 Gent, Belgium Dept. of Physics and Astronomy, University of California, Irvine, CA 92697, USA Karlsruhe Institute of Technology, Institute for Astroparticle Physics, D-76021 Karlsruhe, Germany Dept. of Physics and Astronomy, University of Kansas, Lawrence, KS 66045, USA SNOLAB, 1039 Regional Road 24, Creighton Mine 9, Lively, ON, Canada P3Y 1N2 Department of Physics and Astronomy, UCLA, Los Angeles, CA 90095, USA Department of Physics, Mercer University, Macon, GA 31207-0001, USA Dept. of Astronomy, University of Wisconsin–Madison, Madison, WI 53706, USA Dept. of Physics and Wisconsin IceCube Particle Astrophysics Center, University of Wisconsin–Madison,Madison, WI 53706, USA Institute of Physics, University of Mainz, Staudinger Weg 7, D-55099 Mainz, Germany Department of Physics, Marquette University, Milwaukee, WI, 53201, USA Institut für Kernphysik, Westfälische Wilhelms-Universität Münster, D-48149 Münster, Germany Bartol Research Institute and Dept. of Physics and Astronomy, University of Delaware, Newark, DE 19716,USA Dept. of Physics, Yale University, New Haven, CT 06520, USA Dept. of Physics, University of Oxford, Parks Road, Oxford OX1 3PU, UK Dept. of Physics, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104, USA Physics Department, South Dakota School of Mines and Technology, Rapid City, SD 57701, USA Dept. of Physics, University of Wisconsin, River Falls, WI 54022, USA Dept. of Physics and Astronomy, University of Rochester, Rochester, NY 14627, USA Oskar Klein Centre and Dept. of Physics, Stockholm University, SE-10691 Stockholm, Sweden Dept. of Physics and Astronomy, Stony Brook University, Stony Brook, NY 11794-3800, USA Dept. of Physics, Sungkyunkwan University, Suwon 16419, Korea Institute of Basic Science, Sungkyunkwan University, Suwon 16419, Korea Dept. of Physics and Astronomy, University of Alabama, Tuscaloosa, AL 35487, USA Dept. of Astronomy and Astrophysics, Pennsylvania State University, University Park, PA 16802, USA Dept. of Physics, Pennsylvania State University, University Park, PA 16802, USA Dept. of Physics and Astronomy, Uppsala University, Box 516, S-75120 Uppsala, Sweden Dept. of Physics, University of Wuppertal, D-42119 Wuppertal, Germany DESY, D-15738 Zeuthen, Germany Dept. of Computer Science, TU Dortmund University, D-44221 Dortmund, Germany 𝑎 also at Università di Padova, I-35131 Padova, Italy 𝑏 also at National Research Nuclear University, Moscow Engineering Physics Institute (MEPhI), Moscow115409, Russia 𝑐 also at Earthquake Research Institute, University of Tokyo, Bunkyo, Tokyo 113-0032, Japan E-mail: [email protected]

Abstract: Continued improvements on existing reconstruction methods are vital to the success ofhigh-energy physics experiments, such as the IceCube Neutrino Observatory. In IceCube, furtherchallenges arise as the detector is situated at the geographic South Pole where computationalresources are limited. However, to perform real-time analyses and to issue alerts to telescopesaround the world, powerful and fast reconstruction methods are desired. Deep neural networkscan be extremely powerful, and their usage is computationally inexpensive once the networksare trained. These characteristics make a deep learning-based approach an excellent candidatefor the application in IceCube. A reconstruction method based on convolutional architecturesand hexagonally shaped kernels is presented. The presented method is robust towards systematicuncertainties in the simulation and has been tested on experimental data. In comparison to standardreconstruction methods in IceCube, it can improve upon the reconstruction accuracy, while reducingthe time necessary to run the reconstruction by two to three orders of magnitude.Keywords: Neutrino detectors, Data analysis, Pattern recognition ontents – 1 –

Potential of Deep Learning-based Reconstruction Methods in IceCube

The primary goal of the IceCube Neutrino Observatory is the detection of astrophysical neutrinos(neutrinos originating from extraterrestrial sources), which was achieved in 2013 [1], as well as theidentiﬁcation and characterization of their sources. Although many eﬀorts have been put forth todetect the sources of high-energy astrophysical neutrinos, to date, their origin remains a mystery.However, ﬁrst evidence for potential neutrino sources were found in observations of the blazar TXS0506+056 [2, 3] and in time-integrated neutrino source searches [4]. Ongoing work is focused onimproving detector calibration and reconstruction methods to further increase the sensitivity of thedetector. In addition, a real-time multi-messenger approach is implemented to further increase thediscovery potential through the combination of measurements from diﬀerent telescopes [5, 6].High-energy neutrino candidates are selected and reconstructed on-site at the South Pole.Given the hardware limitations, events must be processed within a certain amount of time to preventpileup. Therefore, only simple and fast event reconstructions can be run on-site. Even though moresophisticated and powerful reconstruction methods exist [7, 8], these can take minutes to hours toreconstruct a single event, rendering them intractable for use at the South Pole system. The high datarate as well as the strict hardware and bandwidth limitations are key challenges to the success of thereal-time follow-up framework. But even for oﬀ-site reconstructions, the most advanced methodsin IceCube can only be applied to a subset of data due to their computational complexity. Oﬄineanalyses can therefore also beneﬁt from improved and more eﬃcient reconstruction methods.These limitations call for a robust reconstruction method which can handle raw data and has ashort and preferably constant runtime. Any new reconstruction method would ideally also reducethe uncertainties on reconstructed parameters in comparison to the standard reconstruction method.Recent advances in image recognition [9] have shown the capabilities of deep convolutional networksto surpass state-of-the-art methods. Once the networks are trained, their usage is computationallyinexpensive. The network performs a ﬁxed amount of mathematical operations on the input data,resulting in a stable runtime that is essentially independent of the input. These characteristics makedeep learning-based methods an excellent candidate for powerful and fast on-site reconstructions.In this paper, a reconstruction method based on convolutional neural networks is presentedwhich utilizes hexagonally shaped convolution kernels. The neural network is set up as a multi-regression task. It is capable of achieving competitive reconstruction accuracy on neutrino proper-ties, such as their incoming direction and energy, and estimating per-event uncertainties on thesequantities. Simultaneously, the neural network achieves runtimes that are reduced by two to threeorders of magnitude in comparison to standard reconstruction methods in IceCube. The reconstruc-tion method is validated by investigating the eﬀects of systematic uncertainties in the simulationand by demonstrating agreement between simulated and experimental data. It has been appliedin Ref. [10] with minor modiﬁcations to the network architecture and training procedure. Relatedwork and applications in IceCube are discussed in Refs. [11–13].The paper is structured as follows: in Section 2, cascades are deﬁned and the IceCube detector isintroduced. Section 3 and 4 illustrate the data input format of the neural network and its architecture.The training procedure is explained in Section 5 and Section 6 highlights the performance ofthe presented method. Eﬀects of systematic uncertainties are investigated in Section 7. Beforeconcluding in Section 9, limitations of the chosen network architecture are discussed in Section 8.– 2 – −

250 0 250 500

Detector Coordinate x / m − − − D e t ec t o r C oo r d i n a t e y / m Main Array DeepCore 100104109113118122127131136140145150 I n t e r - S t r i n g D i s t a n ce / m (a) Top view of IceCube (b) Side view of Strings Figure 1 : A top view of the IceCube detector is shown on the left. The in blue depicted 78 stringsare on an approximately triangular grid, while the DeepCore strings, shown in gray, are installedin a denser conﬁguration. The color scale indicates the inter-string distance for the main IceCubearray, which can substantially deviate from the usual spacing of 125 m. On the right, the DOMlayout along the 𝑧 -axis of the strings is illustrated. In contrast to the strings of the main array, theDOMs on the DeepCore strings (gray) are divided into two groups, one above the dust layer andone below. The reconstruction method presented in this paper is a versatile tool that is applicable to a widerange of classiﬁcation and regression tasks. In the scope of this paper, the method is demonstratedfor the reconstruction of cascade events. In the following, the IceCube detector is introduced,cascade events are deﬁned, and the current standard cascade reconstruction method is discussed.

IceCube is a neutrino detector located at the South Pole instrumenting a cubic kilometer of glacialice. The detector consists of 5160 digital optical modules (DOMs) with a downward-facing 10 inch-diameter photomultipler tube (PMT) [14] installed on 86 vertical strings at depths between 1450 mand 2450 m. The origin of the IceCube coordinate system is placed in the detector center at a depthof 1948 m. The 𝑧 -axis is chosen to point upwards towards the ice surface. PMT signals are digitizedand buﬀered on the DOM mainboard with a timing resolution of about 2ns. Upon readout request,these digitized waveforms are sent to computers on the surface of the detector array.The 86 strings of the detector may be grouped into three detector parts. While the mainIceCube array, consisting of the ﬁrst 78 strings, is arranged on an approximately triangular grid,the remaining 8 strings do not follow this symmetry. They are installed in a denser conﬁgurationcalled DeepCore [15] with variable distances to neighboring strings. Each of the 86 strings holds– 3 –0 DOMs. These are evenly distributed along the 𝑧 -axis for strings of the main array. DOMslocated on DeepCore strings are further grouped into an array above the dust layer (10 DOMs oneach string) and an array below the dust layer (50 DOMs per string) as illustrated in Fig. 1b. Thedust layer is a layer in the glacial ice with increased dust impurities that was produced about 60000to 70000 years ago [16, 17]. It ranges from depths of 1950 m to 2100 m and results in increasedscattering and absorption coeﬃcients. The strings of the main IceCube array have an inter-stringspacing of about 125 m. However, there are deviations as shown in Fig. 1a.At trigger level, IceCube data consist of recorded waveforms where the amplitude of thewaveform corresponds to the charge recorded by the DOM. An IceCube ’event’ is a set of thesewaveforms for many DOMs. Each DOM can measure multiple waveforms with variable startingtimes in a single event. While a typical event has a read out window of approximately 15 000 ns,this can vary depending on the trigger [18, pp.56-59]. In a subsequent step, a series of pulses,represented as a series of times and amplitudes of light observed in a DOM, are extracted fromthe recorded waveforms. The number of extracted pulses can vary by orders of magnitude and itis diﬀerent for each DOM. A detailed description of the IceCube detector and its data acquisitionsystem can be found in Ref. [18]. In order to detect neutrinos, IceCube measures Cherenkov photons produced by charged secondaryparticles resulting from neutrino interactions. The two primary detection channels consist of socalled track-like events (tracks), muons induced by charged-current (CC) 𝜈 𝜇 interactions, as well ascascade-like events (cascades), which result from CC 𝜈 𝑒 and 𝜈 𝜏 interactions in addition to neutralcurrent interactions of all neutrino types [5, p. 6]. Muons deposit energy along their trajectorythrough the detector resulting in tracks as shown in Fig. 2a. Cascade-like events, on the other hand,produce a shower of secondary particles at the neutrino interaction vertex, which is generally notresolvable by IceCube, given its inter-string spacing of about 125 m (see Section 2.1). As a result,IceCube detects a spherical, almost point-like energy deposition as shown in Fig. 2b.In contrast to track-like events, the missing lever arm and the spherical energy depositionmake the angular reconstruction of cascades a challenging task in IceCube. Furthermore, cascadereconstructions are more susceptible to local ice properties than track reconstructions due to the localenergy deposition of the cascade. For tracks, the energy depositions are distributed over large partsof the detector, which helps to average out local ﬂuctuations. It is therefore crucial to understandthe eﬀect that systematic uncertainties in the description of the local ice properties may have onthe reconstruction (this is investigated in Section 7). Despite these challenges, cascade events arean important detection channel that expand IceCube’s ability to probe the southern neutrino sky,which is otherwise dominated by down-going atmospheric muons.The main cascade reconstruction quantities of interest are the deposited energy and the directionof the primary neutrino that induced the particle shower. The focus of this paper is on the angularreconstruction of cascade-like events in the TeV to PeV energy range. To combat the challenges involved in the reconstruction of cascades, IceCube has implementeda variety of reconstruction algorithms. The current standard cascade reconstruction method is– 4 – a) Up-going muon from CC 𝜈 𝜇 interaction (b) CC 𝜈 𝑒 interaction inside detector volume Figure 2 : Example event views of simulated data are shown for an up-going muon entering thedetector (left) and for the resulting particle cascade of a charged-current (CC) 𝜈 𝑒 interaction insidethe detector volume (right). Each DOM is represented by a sphere. The size of the spherecorresponds to the amount of collected light and the color indicates the arrival time of the photons(darker colors correspond to later times).a maximum likelihood estimation that ﬁts a point-like cascade template to the measured lightdeposition pattern in the detector [19]. The cascade template describes the expected light yieldat each DOM for a given cascade hypothesis consisting of the interaction vertex and time, thedeposited energy, and the direction of the primary neutrino. A binned Poisson likelihood isemployed to ﬁnd the cascade hypothesis that best describes the measured light deposition patternin the detector. The cascade template is obtained from MC simulation and tabulated for variouscascade-DOM-conﬁgurations. In order to reduce the dimensionality of the lookup tables, rotationaland translational invariance in the 𝑥 - 𝑦 -plane are assumed. Second order corrections can thenbe applied to account for the inhomogeneities in the detector medium and photon propagation.Due to these simpliﬁcations, the standard reconstruction method maximizes an approximation tothe true underlying likelihood. More advanced reconstruction methods exist within IceCube [7, 8],however, due to their computational complexity they can only be applied on single events. Additionalinformation on the standard reconstruction method is provided in Ref. [19].– 5 – Neural Network Input Data Format

Convolutional neural networks (CNNs) were ﬁrst developed for the domain of image recogni-tion [20]. In contrast to fully-connected (dense) networks, CNNs can exploit translational invariancein the input data. Translational invariance in the context of image recognition means that the classof an image does not change if the position of the classiﬁed object is shifted within that image.The capability of CNNs to exploit translational invariance is one of several contributions that ledto the breakthrough in image recognition [9]. In the case of IceCube, the application of CNNs mayhelp in exploiting the fact that the underlying physics of the neutrino interaction are invariant undertranslation in space and time. Inputting image data into a convolutional neural network is straightforward, as it only has two dimensions with an optional color channel. The pixels of an image arearranged on an orthogonal grid and for typical image datasets the pictures are all of the same size orcan easily be cropped to the same size. In contrast, IceCube data is given at four dimensional pointsin space and time, arranged on an imperfect triangular grid and highly variable in size. It has tobe uniﬁed in a way that it can be sequenced into a convolutional neural network while maintainingspatial and temporal relations so that symmetries, such as translational invariance in space and time,can be exploited. Key challenges that need to be addressed include the high dimensionality andvariability of the data as well as IceCube’s hexagonal geometry and sub-arrays.

For purposes of use in the CNN, the IceCube detector is eﬀectively separated into three detectorparts, which do not share the same geometry or symmetries. The main IceCube array is arrangedon an approximately triangular grid. However, pixels in an image are typically arranged on anorthogonal grid. Data storage, matrix and tensor operations are also performed on orthogonalgrids. In order to employ existing and eﬃcient convolution algorithms, IceCube data must ﬁrstbe transformed to an orthogonal grid. A common approach to transform hexagonal data is tointerpolate or rebin the data points onto an orthogonal grid as has been done elsewhere [21–23].However, this process introduces an unnecessary loss of information as the original data points onthe hexagonal grid are replaced with an approximation thereof. In addition, while the interpolationis reasonable for the integrated charge, it is not clear how this interpolation would have to beperformed on the pulse arrival times which heavily depend on the event characteristics and cannotgenerally be inferred through the measured distribution of pulses at DOMs in the vicinity.Instead, a lossless transformation is used [11], which is also employed in Section 4.3 to obtainhexagonal convolution kernels. In this approach, each DOM itself is used as a bin and input nodefor the CNN. For this DOM-binning, the triangular grid of the DOMs needs to be transformed toan orthogonal grid. This can be achieved by the transformation described in Fig. 3. In order toexploit the symmetries maintained within each sub-array of the detector, the two DeepCore partsof the detector are handled separately from the main array, reducing the 𝑥 and 𝑦 -coordinates to asingle dimension. Note that this separate processing results in a loss of ability to directly correlateneighboring strings from the main array with those of DeepCore. Furthermore, the transformationas outlined above does not explicitly correct for the dust layer or for the irregularities in the detectorgrid, detailed in Section 2.1. As a result of the transformation, three input tensors of the shape ( × × × 𝑛 ) , ( × × 𝑛 ) , and ( × × 𝑛 ) are obtained for the main array and upper and– 6 – D Space1D Time 3D Space1D Time2D Space

1D Time

Main Array

Lower

DeepCoreUpper

DeepCore

Main ArrayDeepCore

Zero Padding

Figure 3 : The main IceCube array and DeepCore strings are handled separately due to theirdiﬀering geometry. Hexagonally shaped data of the main array can be transformed from an axialcoordinate system into an orthogonal grid by padding with zeros (orange dots) and aligning therows, which results in a 10 ×

10 grid in the 𝑥 - 𝑦 -plane. Every DOM deﬁnes a bin in the spatialcoordinates (DOM-binning).lower DeepCore, respectively, where 𝑛 denotes the number of input variables per DOM. The secondto last dimension in each tensor indicates the number of DOMs (60, 10, and 50) along the z-axis.Possible input variables for each DOM are discussed in the following section. The transformation described above allows for a convolution over all spatial dimensions for themain array and a convolution over the z-dimension for DeepCore. Hence, the symmetry in spatialcoordinates can be exploited to the extent possible given IceCube’s geometry. Ideally, translationalinvariance in time should also be exploited. The starting point of most reconstruction methodsin IceCube is the extracted pulses as described in Section 2.1. These pulses are an eﬃcient datarepresentation. However, the number of pulses at a given DOM is highly variable and therefore thepulses are an unsuitable input to a CNN. A standard CNN requires a uniform and constant inputsize. One option is to bin the measured pulse charges in time. With a four dimensional convolutionover the main array input, translational invariance in space and time can be exploited. A full fourdimensional convolution would require on the order of thousands of time bins per DOM for thedesired timing resolution. This would result in a signiﬁcant increase in computational complexity.A simultaneous processing in space and time via a 4D convolution is therefore not feasible. Thedimensionality of the problem must be reduced or separated out in individual tasks. Reducing thedimensionality of the problem results in a loss of spatial and temporal relations that have to becompensated for in an alternative way.There are many ways to achieve this. For this paper, the variable size of the time dimensionis reduced to nine selected summary statistics, which describe the pulse series at each DOM. This– 7 – C h a r g e [ P E ] t ﬁ r s t t m e a n t l a s t t std t ﬁrst = 529 . ns t mean = 1329 . ns t last = 5082 . ns t std = 669 . ns Relative time t rel [ns] C u m . C h a r g e [ P E ] t ﬁ r s t t l a s t c total c c t % t % Cumulative ChargeGaussian KDEPulses c total = 942 . PE c = 395 . PE c = 13 . PE t = 844 . ns t = 1126 . ns Figure 4 : An example pulse series and corresponding input data for a single DOM is shown. Themeasured pulse series cannot directly be utilized by a CNN due to its varying length. The pulsesare therefore reduced to nine input parameters ( 𝑐 total , 𝑐 , 𝑐 , 𝑡 ﬁrst , 𝑡 last , 𝑡 , 𝑡 , 𝑡 mean , 𝑡 std )which aim to summarize the pulse distribution.has the added beneﬁt that parameters can be chosen that have good agreement between measuredand simulated data, making the network more robust towards possible mis-modeling in the MonteCarlo (MC) simulation. Parameters including time variables are calculated relative to a global oﬀsetwhich is deﬁned as the start of a 6000 ns long time window that maximizes the contained chargefor each event. By utilizing relative timing information rather than absolute timing, translationalinvariance in time can be exploited to a certain degree. The nine input parameters are chosen basedon their expected relevance for the reconstruction task and consist of: the total DOM charge, thecharge within 100 ns and 500 ns of the ﬁrst pulse, the relative time of ﬁrst pulse, the relative timeat which 20% and 50% of the charge is collected, the relative time of the last pulse, and the chargeweighted mean and standard deviation of the relative pulse arrival times. The input parameters areillustrated in Fig. 4. Reconstruction methods in IceCube typically exclude pulses from saturatedor overly bright DOMs to avoid potential mis-modeling in the MC simulation. In this case, theinput features corresponding to the excluded DOMs are set to zero. With these features, three inputtensors are obtained for the CNN of the shape ( × × × ) , ( × × ) , and ( × × ) for the main array and upper and lower DeepCore, respectively.In a future iteration, the calculation and selection of input parameters (feature generation) for agiven DOM could be automated by applying a 1D CNN on the measured waveforms or time-binnedpulses. Alternatively, a recurrent neural network can be set up to work directly on the extractedpulses. The automated and fully diﬀerentiable feature generation would allow for an end-to-endlearning task starting from the measured waveforms or pulse series.– 8 – nergy Dir.-xDir.-yDir.-z (10 x 10 x 60 x 9)(8 x 10 x 9)(8 x 50 x 9)

20 Hex. Convolutional Layers14 Convolutional Layers Flattened Layer 3 Fully Connected Layers3 Fully Connected Layers

Azimuth M a i n A rr a y U pp e r D ee p C o r e L o w e r D ee p C o r e P r e d i c t i o n U n c e r t a i n t y E s t i m a t i o n Gradient Stop

ZenithEnergy σσσσσσ

Dir.-xDir.-yDir.-z

AzimuthZenith

Figure 5 : A sketch of the neural network architecture is shown. Data from the three sub-arraysare sequenced into convolutional layers. The result is ﬂattened, combined, and passed on to twofully-connected sub-networks which perform the reconstruction and uncertainty estimation.

Once the data input format is deﬁned, the network architecture can be set up. The reconstructionquantities of interest (training labels) are deﬁned as the deposited energy in the detector and thedirection of the incoming neutrino. The zenith Θ and azimuth Φ angle of the neutrino directionis further decomposed into the direction vector components. Due to a widely used convention inIceCube, the particle direction vector (cid:174) 𝑑 is chosen to point in the direction of travel while Θ and Φ deﬁne the origin of the particle. This results in a total of 6 labels which consist of: energy, azimuth,zenith, dir.- 𝑥 , dir.- 𝑦 , dir.- 𝑧 . Settings and design choices of the neural network are discussed in thissection. The neural network is implemented within the python interface of TensorFlow [24] and thecode is available on GitHub . Convolution operations in common deep learning frameworks, such as

TensorFlow , are performedon orthogonal grids. Transforming the (in 𝑥 - and 𝑦 -dimension) hexagonally shaped IceCube datato an orthogonal grid as illustrated in Fig. 3 results in convolution kernels shaped as parallelogramsin the detector. To instead obtain convolutional kernels which are shaped according to the IceCubegeometry, the transformation method described in our previous work [11] is applied. The convo-lution kernels within TensorFlow are adjusted by setting the corner elements to zero as illustratedby the orange points on the right of Fig. 6. As a result, hexagonally shaped convolution kernels are https://github.com/icecube/dnn_reco – 9 – igure 6 : Hexagonally shaped convolutional kernels in an axial coordinate system on the left (bluedots) can be transformed into an orthogonal grid on the right by padding with zeros (orange dots)and aligning the rows [11]. Trainable Parameter

Constant Zero (2, 0) (2, 1) (3, 0)(3, 1) (3, 2) (4, 0)

Figure 7 : An hexagonallyshaped kernel can be de-ﬁned by a tuple of size 𝑠 andorientation 𝑜 : ( 𝑠, 𝑜 ) . For anaxis aligned hexagon (ori-entation 𝑜 = 𝑠 deﬁnes the num-ber of points along an edgeof the hexagon.obtained for the 𝑥 - 𝑦 -plane in the detector. The obtained kernels can be deﬁned by a tuple of size andorientation as illustrated in Fig. 7. Due to their symmetry, these kernels can in principle be rotatedby multiples of 60 ° to exploit rotational equivariance about the 𝑧 -axis within group convolutions asexplained in Refs. [25, 26]. In order for neural networks to be able to learn non-linear relations, non-linear functions (activationfunctions) are applied to the output of a layer. Once the architecture of the neural network is deﬁned,the parameters of the network can be adjusted to minimize a speciﬁed target function, typicallyreferred to as loss function or loss. The loss function quantiﬁes the deviation from the target values.It is used to determine which set of parameter values best solves the reconstruction task. Duringthe training process, the loss is minimized iteratively via a gradient-based minimizer [27]. Detailson the chosen loss function are provided in Section 5.3.Deep neural networks can in theory process data of any range and scaling. However, theactivation function’s response is typically centered around zero and often converges for very largeor small values leading to vanishing gradients. Moreover, the loss landscape’s dependence on theparameters of the neural network may be highly asymmetric if input features are on diﬀerent sizescales, i.e. if the order of magnitude between the features diﬀer, which can cause problems duringminimization. In addition, if input data is not normalized, small imbalances in the early layers ofthe network can cascade throughout neural networks causing gradients to become extremely large– 10 –uring backpropagation. This eﬀect, often referred to as exploding gradients, is even stronger fordeep neural networks [28, 29].It is therefore useful to normalize the input data 𝑋 and training labels 𝑌 . Note that 𝑋 and 𝑌 are tensors. Hence, the following transformations, outlined in Eqns. (4.1) and (4.2), are appliedpointwise. IceCube measures neutrino interactions across many orders of magnitude in energy. Asa result, the energy label and input features such as the collected charge span over many orders ofmagnitude. Consequently, these input features and labels are ﬁrst transformed via 𝑋 (cid:48) = ln ( . + 𝑋 ) and 𝑌 (cid:48) = ln ( . + 𝑌 ) . (4.1)The transformation deﬁned in Eq. (4.1) only aﬀects the input features total charge, charge collectedwithin the ﬁrst 100 ns and 500 ns, as well as the energy label. The identity operation 𝑋 (cid:48) = 𝑋 and 𝑌 (cid:48) = 𝑌 is applied to all other features and labels, respectively. Afterwards, the input data 𝑋 (cid:48) andlabels 𝑌 (cid:48) are normalized to zero mean and unit variance by the following transformation: 𝑋 (cid:48)(cid:48) = 𝑋 (cid:48) − 𝑋 (cid:48) 𝜎 𝑋 (cid:48) + 𝜖 and 𝑌 (cid:48)(cid:48) = 𝑌 (cid:48) − 𝑌 (cid:48) 𝜎 𝑌 (cid:48) + 𝜖 , (4.2)where 𝑋 (cid:48) is the mean and 𝜎 𝑋 (cid:48) is the standard deviation of 𝑋 (cid:48) computed over all DOMs and overall events of the training dataset. Entries in 𝑋 (cid:48) and 𝜎 𝑋 (cid:48) , which are a result of the zero-paddingdescribed in Fig. 3, are set to zero. Therefore, padded zeros in the input data 𝑋 will remain zeroin the transformed input data 𝑋 (cid:48)(cid:48) . The tensors 𝑌 (cid:48) and 𝜎 𝑌 (cid:48) are deﬁned analogously for the labels.A small constant 𝜖 = − is added to prevent division by zero. The values 𝑋 (cid:48) , 𝜎 𝑋 (cid:48) and 𝑌 (cid:48) , 𝜎 𝑌 (cid:48) inEq. (4.2) are calculated in advance on the training dataset prior to the training of the neural network.In addition to the transformations described by Eqns. (4.1) and (4.2), steps are undertaken tomaintain a standardized input to every layer throughout the network. This can be accomplishedby the use of residual additions and unit variance maintaining layers as described in Section 4.3and 4.4, respectively. Other common approaches not regarded here are the application of self-normalizing networks [29] and batch normalization [30]. Batch normalization is not used due toits regularization properties, which are too strong for the application at hand (see Section 4.5). Residual nets [31] can greatly facilitate the training of deep neural networks by avoiding vanishingor exploding gradients through the introduction of skip connections. Intuitively, the introductionof skip connections reduces the dependence among the layers and thus the cascading eﬀect in deepneural networks. The concept is adopted here with slight modiﬁcations applied. In residual layersas deﬁned in this work, the output of a given layer 𝑋 output is modiﬁed to 𝑋 (cid:48) output according to 𝑋 (cid:48) output = 𝑋 input + 𝑠 · 𝑋 output (4.3)where 𝑠 is a scale factor deﬁning the contribution of the residual of the given layer. This allowsthe neural network to essentially skip the weights of a given layer by setting 𝑠 to zero. The scalefactor 𝑠 is initialized to small random values. As a result, the given layer will initially perform anapproximate identity operation. During training, the scale factor 𝑠 is optimized and the contributionof the residual may increase. Due to downsizing or pooling operations in convolutional layers, thedimensions of 𝑋 input and 𝑋 output can diﬀer. In this case, only the channels which align are treated asresiduals according to Eq. (4.3). – 11 – .4 Unit Variance Maintaining Layers As mentioned in Section 4.2, there are great beneﬁts to keeping the input and output normalizedwithin the neural network. This is also the goal of batch normalization [30], which explicitlynormalizes the input into a speciﬁc layer in addition to introducing regularization. However, theregularization properties of batch normalization are too strong for the application at hand (seeSection 4.5). Therefore, an alternative, regularization-free method is deﬁned which maintains anormalized throughput by ensuring that each layer conserves unit variance of its input. Assumingthat the input into a layer is zero centered and has unit variance, the calculations performed in thatlayer can be modiﬁed such that the output remains normalized. For instance, the sum 𝑧 total of 𝑛 independent random variables 𝑧 𝑖 has an increased variance given by: 𝑠 = 𝑛 ∑︁ 𝑖 𝑠 𝑖 , (4.4)where 𝑠 𝑖 is the variance of input 𝑧 𝑖 . If the inputs 𝑧 𝑖 have unit variance ( 𝑠 𝑖 = 𝑠 total = √ 𝑛 . To compensate for this increased variance of the output, the result of the sum can bedivided by 𝑠 total to obtain the normalized output: 𝑧 (cid:48) total = 𝑧 total 𝑠 total = 𝑧 total √ 𝑛 . (4.5)Wherever applicable, transformations such as Eq. (4.5) are applied. The assumption of normalizedinputs is an idealization that does not necessarily hold in practice. Nevertheless, the applicationof these compensation factors helps in maintaining the variance to a reasonable range for theapplication described in this paper.In addition to the modiﬁcations described above, trainable parameters of the neural networkare initialized in a way that the layers initially perform an approximate identity operation. Thisinitialization scheme in combination with unit variance maintaining layers and the normalizationof input and labels (see Section 4.2) greatly speeds up and facilitates the training procedure. As aresult, the initialized but untrained network predicts each label with a probability proportional toits frequency in the training data. In contrast to typical applications of CNNs in the domain of image recognition, applications inIceCube are not limited by the training dataset size. The extremely large amount of availabletraining data (on the order of hundreds of millions of events) in combination with the rather smallnetwork capacity results in a reduced necessity for regularization. A regularization technique calleddropout [32], which randomly drops input nodes in the training phase, is used in early stages oftraining and gradually reduced during the training procedure. The last training steps are nearlyperformed regularization-free. To ensure generalization, systematic variations of the baselinesimulation are included in the training, validation, and test datasets. The trained model is alsotested for various modiﬁcations as discussed in Section 7.2. Additionally, entire DOMs are droppedrandomly during training via Dropout to ensure robust predictions and to emulate experimentaldata, where DOMs may drop and be excluded for a certain data-taking period. This forces theneural network not to rely too much on the presence or values of a single DOM.– 12 – .6 Network Architecture

IceCube’s data is split up and transformed to an orthogonal grid as described in Section 3 andillustrated in Fig. 3. The input data for each sub-array is sequenced into a series of convolutionallayers as shown in Fig. 5. A three dimensional convolution over all spatial coordinates is performedfor the data of the main IceCube array, while a single convolution over the z-axis is used for theDeepCore arrays. The result of these convolutional layers is ﬂattened, combined, and is then usedin two small fully-connected networks. One of these fully-connected networks is used to obtain theprediction for each label quantity, whereas the other estimates the uncertainty on each predictedquantity. The uncertainty-estimating sub-network also obtains the prediction output as input. Agradient-stop is applied to the input of the uncertainty-estimating network such that its optimizationdoes not alter the trainable parameters of the remaining network. Details of the chosen networkarchitecture and the layers of each sub-network are provided in Tab. 2 and 3 in Appendix A.During training, the dropout rates 𝑑 𝑖 are gradually decreased as deﬁned in Section 5.4. EntireDOMs of the input tensors are dropped with a rate of 𝑑 . The convolutional layers use a dropout rateof 𝑑 and a dropout rate of 𝑑 is applied to the combined and ﬂattened layer after the convolutionallayers. The values of the dropout rates 𝑑 𝑖 are provided in Tab. 4. An exponential linear activationfunction (elu) [33] is used for all layers except for the output layers. The prediction sub-networkdoes not use an activation function in its output layer, whereas the uncertainty sub-network uses theabsolute value function (abs), enforcing non-negative values.The network is set up to solve a multi-label regression task, although it can easily be altered tosolve a classiﬁcation task instead. Its main focus is on the directional reconstruction of the primaryparticle and the deposited energy in the detector. In principle, the directional reconstruction canbe learned directly via the zenith and azimuth angle. However, due to the azimuthal periodicity,it is beneﬁcial to decompose the angles into the three components of the unit direction vector andto emphasize these during training. The angles zenith Θ and azimuth Φ are therefore not directlyestimated by the neural network, but are calculated from the normalized direction vector (cid:174) 𝑑 = ( 𝑑 𝑥 , 𝑑 𝑦 , 𝑑 𝑧 ) 𝑇 according to Θ = cos − (− 𝑑 𝑧 ) (4.6) Φ = tan − (cid:18) − 𝑑 𝑦 − 𝑑 𝑥 (cid:19) ( mod 2 𝜋 ) . (4.7)The uncertainty on the direction vector components ( 𝜎 𝑑 𝑥 , 𝜎 𝑑 𝑦 , 𝜎 𝑑 𝑧 ) and the angles 𝜎 Θ and 𝜎 Φ areestimated directly by the neural network. For the sake of simplicity, correlations are not taken intoaccount in this work. They may be included by adding additional output nodes to the neural networkfor the oﬀ-diagonal elements of the covariance matrix in addition to changing the loss function inEq. (5.7) to a multivariate Gaussian. The architecture described in the previous section deﬁnes a multi-label regression task. Trainingmultiple labels at once can be challenging due to individual labels dominating the loss. Therefore, – 13 – weighted multi-label loss function is introduced. In addition, the training data and procedure isdescribed in this section. The neural network is trained and evaluated on neutrino Monte Carlo (MC) simulations. Additionalchecks are performed on experimental data to validate the applicability of the MC simulations (seeSection 7). An event selection is performed on events that trigger the detector in order to reducethe atmospheric background and to improve the event quality. In this paper, the event selectiondescribed in Ref. [10] is used, which aims to select cascade-like events (charged-current 𝜈 𝑒 and ¯ 𝜈 𝑒 as well as neutral-current interactions of all neutrino types).To investigate eﬀects of systematic uncertainties in the MC simulation, datasets are simulatedwith discrete variations of key parameters, such as the scattering and absorption coeﬃcients ofthe glacial ice. These parameters aﬀect the propagation of light in the detector medium and are amajor source of systematic uncertainty. The neural network is trained on the baseline MC datasetin addition to its systematic variations. These systematic datasets are included in the early stages oftraining to become more robust towards these uncertainties. The combined baseline and systematicdataset is split into three parts to obtain a training, validation, and test set. The resulting trainingset has about six million events.In order to eﬃciently train the neural network, loading, preprocessing, and sequencing thedata into the network have to be highly optimized. The chosen simulation dataset (about 10 TBof disk space) is preprocessed, compressed and saved to hdf5 ﬁles. During training, readingand decompression of the hdf5 ﬁles is handled in parallel on multiple CPUs. Once read anddecompressed, the input tensors and labels of these events are combined from the multiple processesand enqued in a single shared queue. An additional process dequeues the elements and producesbatches of a given size, which are then sequenced into the neural network.Due to the high dimensionality of the input and the restrictions given by the computingarchitecture, only a limited amount of ﬁles can be loaded into memory at once. In order to reducethe amount of times ﬁles need to be read and decompressed, a given number of 𝑚 ﬁles are loadedat once. Of the loaded and decompressed events, batches are randomly drawn until a speciﬁedamount of 𝑘 iterations through the data in memory (epochs) are completed. Afterwards, the next 𝑚 ﬁles are loaded into memory. This allows for the more eﬃcient use of CPU resources, but somecare must be taken to ensure variability between consecutive training batches. The values of 𝑘 and 𝑚 therefore depend on the batch size 𝑏 and the number of events per ﬁle 𝑓 . They are chosen suchthat the number of loaded events 𝑀 = (cid:205) 𝑚𝑖 𝑓 𝑖 fulﬁlls the relation 𝑀 > √ 𝑘 · 𝑏 . The choice of thefactor √ 𝑘 as opposed to 𝑘 allows to reuse events in memory. The larger the pool of loaded events 𝑀 is with respect to the batch size, the higher is the number of allowed epochs through the data inmemory. The reconstruction method described in this paper is a multi-label regression task. As such theneural network is trained on multiple labels simultaneously. Due to the varying reconstructiondiﬃculty of each label, their contribution to the loss function may be on diﬀerent scales. Note– 14 –hat the relative contributions of each label to the loss function deﬁne the contributions to thegradients used in gradient-descent during training. Labels that are inherently hard to predict candominate the training process, preventing the accurate prediction of others. To ensure that eachlabel contributes equally to the loss function or by a predeﬁned importance, an adaptive weightingmethod is proposed.An importance vector (cid:174) 𝐶 is introduced which can be used to distribute weights to individuallabels. The importance vector has an entry for each label where (cid:16) (cid:174) 𝐶 (cid:17) 𝑘 is the weight associated to the 𝑘 -th label. The loss ( L ) 𝑏,𝑘 for an event 𝑏 in a batch of 𝐵 events and label 𝑘 can then be weighted by ( L 𝑤 ) 𝑏,𝑘 = (cid:16) (cid:174) 𝐶 (cid:17) 𝑘 · ( L ) 𝑏,𝑘 . (5.1)Optionally, the events can additionally be weighted to reﬂect their expected frequency in experi-mental data. This changes the importance weighted loss ( L 𝑤 ) 𝑏,𝑘 from Eq. (5.1) to ( L 𝑤 ) 𝑏,𝑘 = (cid:16) (cid:174) 𝑊 (cid:17) 𝑏 · (cid:16) (cid:174) 𝐶 (cid:17) 𝑘 · ( L ) 𝑏,𝑘 , (5.2)where (cid:174) 𝑊 are the weights deﬁning the expected frequency in experimental data for the 𝐵 events in amini batch. In this work, Eq. (5.1) is used. The overall scalar loss L scalar is computed viaL scalar = 𝐵 𝐾 ∑︁ 𝑘 = 𝐵 ∑︁ 𝑏 = ( L 𝑤 ) 𝑏,𝑘 , (5.3)where 𝐾 is the number of labels and 𝐵 is the number of events in a batch. The importance vector (cid:174) 𝐶 is updated every 𝑁 optimization steps to ensure that each label contributes to the loss functionaccording to the importance assigned to it. This compensates for the fact that diﬀerent labels aretrained at diﬀerent speeds. The importance vector is updated via: (cid:16) (cid:174) 𝐶 (cid:48) (cid:17) 𝑘 = (cid:16) (cid:174) 𝐶 (cid:17) 𝑘 · max (cid:20) , (cid:16) −−−−−→(cid:104) MSE (cid:105) (cid:17) 𝑘 − (cid:21) , (5.4)where (cid:174) 𝐶 is the unmodiﬁed, original importance vector and −−−−−→(cid:104) MSE (cid:105) is given by −−−−−→(cid:104)

MSE (cid:105) = 𝑁 𝑁 ∑︁ 𝑛 = −−−→ MSE 𝑛 (5.5)with the mean squared error −−−→ MSE 𝑛 over each normalized label for the 𝑛 -th mini batch. A rollingaverage can also be chosen as an alternative to the update rule from Eq. (5.4). During the ﬁrst training steps, both the predictions and uncertainty estimates are trained by lossfunctions employing mean squared errors (MSE) due to their robustness. The loss ( L ) 𝑏,𝑘 fromEq. (5.1) is therefore given by ( L ) 𝑏,𝑘 = (cid:18) (cid:16) 𝑌 (cid:48)(cid:48) true − 𝑌 (cid:48)(cid:48) pred (cid:17) 𝑏,𝑘 (cid:19) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) ( L pred ) 𝑏,𝑘 + (cid:20) (cid:0) 𝑌 (cid:48)(cid:48) unc (cid:1) 𝑏,𝑘 − gradient_stop (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:16) 𝑌 (cid:48)(cid:48) true − 𝑌 (cid:48)(cid:48) pred (cid:17) 𝑏,𝑘 (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) (cid:21) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) ( L unc ) 𝑏,𝑘 . (5.6)– 15 –here L, 𝑌 (cid:48)(cid:48) true , 𝑌 (cid:48)(cid:48) pred , and 𝑌 (cid:48)(cid:48) unc are 𝐵 × 𝐾 -matrices with the number of labels 𝐾 and the numberof events 𝐵 in a batch. The true normalized labels are given by 𝑌 (cid:48)(cid:48) true (see Eq. (4.2)). 𝑌 (cid:48)(cid:48) pred isthe normalized network prediction and 𝑌 (cid:48)(cid:48) unc is the normalized output of the uncertainty estimatingsub-network. The function gradient_stop ( 𝑥 ) is an identity operation that eﬀectively treats 𝑥 asa constant during backpropagation. Optimization of L unc therefore only aﬀects the weights of theuncertainty estimating sub-network due to the applied gradient-stop. Gradients are not propagatedthrough to the main neural network architecture.In later training stages, once the loss starts to converge, i.e. the improvement in loss per trainingstep starts to ﬂatten out, the neural network’s reconstruction and uncertainty estimation is robustenough for the application of the more sensitive Gaussian Likelihood (GL). In this case, the losschanges to ( L ) 𝑏,𝑘 = · ln (cid:16) (cid:0) 𝑌 (cid:48)(cid:48) unc (cid:1) 𝑏,𝑘 (cid:17) + (cid:169)(cid:173)(cid:173)(cid:171) (cid:16) 𝑌 (cid:48)(cid:48) true − 𝑌 (cid:48)(cid:48) pred (cid:17) 𝑏,𝑘 ( 𝑌 (cid:48)(cid:48) unc ) 𝑏,𝑘 (cid:170)(cid:174)(cid:174)(cid:172) . (5.7) The loss function L scalar from Eq. (5.3) is minimized with the ADAM-optimizer [27] within the

TensorFlow framework [24]. The general learning scheme is to start training with a high dropoutrate, forcing the network to learn robust features. Over time the learning and dropout rate arereduced. During early stages of training, the neural network is also trained on systematic variationsof the baseline dataset to promote robustness towards uncertainties in the MC simulation. Trainingis performed in a process of multiple steps as depicted in Tab. 4 and takes about ﬁve days on anNVIDIA Tesla P40 GPU.

Although the CNN architecture is a versatile tool, emphasis is put on the performance of cascade-like events for the purpose of this paper. The architecture can easily be used to reconstruct otherevent topologies or labels of interest. Minor changes to the loss function and activation of thelast layer can be applied to use the architecture for classiﬁcation tasks. In the following sections,the performance of the described CNN architecture is compared to the current standard cascadereconstruction method in IceCube (see Section 2.3 for details). Results shown in this paper areobtained for the cascade event selection described in Ref. [10], which aims to select high-energyevents in the TeV range and above resulting from charged-current 𝜈 𝑒 and ¯ 𝜈 𝑒 as well as neutral-currentinteractions of all neutrino types. Wherever applicable, results shown for the CNN-based methodare denoted by a icon and results for the standard (approximate) likelihood-based reconstructionare denoted by a L symbol. Because the statistical power of searches for neutrino sources depends upon the ability to localizethe origin of neutrino candidates, improving the precision of directional reconstruction is of greatimportance. In order to evaluate the precision of the directional reconstruction, the metric angularresolution is introduced, which is deﬁned as the median opening angle between the true and– 16 – True Neutrino Energy E ν [GeV] O p e n i n g A n g l e ∆ Ψ (tr u e , r ec o ) [ ◦ ] CNN

CNN

Standard

Figure 8 : The angular resolution isshown as a function of neutrino en-ergy for the standard reconstructionmethod and the newly developedCNN-based method. The shadedarea and lines denote the 20%, 50%,and 80% quantiles. At higher ener-gies, the resolution can be improvedby up to 50 %. Systematic uncer-tainties are not included.reconstructed direction. The relative contribution of astrophysical neutrinos steadily increaseswith energy. At lower energies, the neutrino candidates in the event sample are dominated byatmospheric background events. The CNN-based method described in this paper is able to improvethe angular resolution of neutrinos above 10 TeV. Fig. 8 compares the angular resolution of thestandard reconstruction method and the newly developed CNN-based method. At higher energies,the resolution can be improved by up to 50 %. Using the CNN for reconstruction aﬀords improvedsensitivity at higher energies compared to standard reconstruction methods. In combination withadditional years of data, this results in an improved point source sensitivity up to a factor of four inthe southern sky [10] compared to previous work based on standard reconstruction methods [34].

The primary neutrino energy cannot be directly measured in IceCube, but it can be inferred fromthe deposited energy in the detector. For charged-current cascade-like events, the deposited energyis an excellent proxy for the true neutrino energy as detailed in Ref. [19]. Neutral-current eventsare more ambiguous due to large ﬂuctuations in the fraction of the neutrino energy transferredto the target nucleus. Further smaller uncertainties arise from ﬂuctuations in the light yield ofhadronic cascades. Therefore, cascade reconstruction methods reconstruct the electro-magnetic(EM) equivalent deposited energy. This is the energy of an EM cascade that produces the samenumber of Cherenkov photons as the observed shower.The neural network is trained on this EM equivalent energy. For hadronic cascades, theaverage expected EM equivalent energy is chosen. Overall, the energy resolution of the standardreconstruction and the CNN-based method are comparable as shown in Fig. 9. There are slightbiases visible in the CNN estimate at lowest and highest energies. These are artifacts induced by theboundaries of the training dataset. If necessary, the CNN estimate can be calibrated to remove thisbias. In comparison to the standard reconstruction, the predictions of the CNN are more robust andhave less outliers. There is a considerable smearing visible in the correlation plot for the standardreconstruction method at higher energies. – 17 – D e p o s i t e d E n e r g y E p r e d [ G e V ] µ = − . σ = 0 . ρ = 0 . r s = 0 . µ = − . σ = 0 . ρ = 0 . r s = 0 . − − − − E v e n t R a t e / a r b . un i t Deposited Energy E true [ GeV ]0 . . . R e s . σ l og E Average: 0.126 Deposited Energy E true [ GeV ] Average: 0.106

Figure 9 : Correlation plots between the true and reconstructed deposited energy are shown for thestandard reconstruction (left) and CNN (right). Mean 𝜇 and standard deviation 𝜎 of the residuals 𝑦 pred − 𝑦 true are determined in log -space as well as the Pearson 𝜌 and Spearman 𝑟 𝑠 correlationcoeﬃcients. The energy resolution in the bottom panel is calculated according to [19]. For application of a reconstruction method in analyses, it is necessary to have a measure on theuncertainty of a reconstructed quantity. As described in Section 4.6 and 5.3, the model architectureestimates a per-event uncertainty on each of its reconstructed quantities via an assumed Gaussianlikelihood. This works well, if the residuals Δ 𝑦 = 𝑦 pred − 𝑦 true follow a Gaussian distribution witha given event dependent standard deviation. In many cases, the assumed Gaussian likelihood is agood ﬁt as illustrated in Fig. 10 and 11 which show the pull distribution and the coverage of theestimated uncertainties for the ensemble of events, respectively.A correct uncertainty estimator for Gaussian residuals will produce a pull distribution thatconverges to a normal distribution in the limit of inﬁnite statistics. The pull distributions shownin Fig. 10 are reasonably well described by a normal distribution, indicating correct uncertaintyestimates. There are slight deviations visible in the far tail of the distribution which are caused byrare outlier events. These outliers might be reduced through additional training iterations or by abiased sampling strategy during training that enriches the training batches with mis-reconstructedevents.Another way of validating the estimated uncertainty is by checking its coverage. Based on theassumption of a Gaussian distribution with a width as estimated by the neural network, the numberof events can be calculated that should lie within a certain range around the prediction. Thisnumber can then be compared to how many events actually fall inside. For an accurate uncertaintymeasure, this should result in a one-to-one relation described by the dashed diagonal in Fig. 11. Theuncertainty estimates on all reconstructed quantities achieve an excellent coverage on the ensemblewhich deviates only by a few percent from the case of perfect coverage.– 18 – − y pred − y true ) / σ pred − − − − − − E v e n t R a t e / a r b . un i t Normal Distribution[ µ : -0.30, σ : 1.03] Dep. Energy[ µ : -0.01, σ : 1.06] Azimuth Φ [ µ : -0.03, σ : 1.05] Zenith Θ Figure 10 : The pull distributions areshown for the reconstructed quantitiesin comparison to the normal distributionwhich indicates a good uncertainty es-timator. These match well, apart fromslight deviations in the tails of the distri-bution. . . . . . C o v e r a g e Perfect CoverageDep. EnergyAzimuth Φ Zenith Θ Opening Angle ∆Ψ . . . . . . Central Quantile . . ∆ C o v e r a g e Figure 11 : The coverage, i.e. the number of eventsthat fall in a certain quantile, is shown as a function ofthe computed central quantile based on the uncertaintyestimate. Perfect coverage is obtained on the dasheddiagonal. The bottom panel shows the diﬀerence incoverage of each reconstructed quantity compared tothe perfect one-to-one relation.As detailed in Section 4.6, the CNN estimates the uncertainty on each reconstructed quantity in-dependently while disregarding potential correlations. In order to estimate the angular uncertainty,which is more closely related to the angular resolution deﬁned as the median opening angle ΔΨ between reconstructed and true direction, either the uncertainties on the direction vector compo-nents ( 𝜎 𝑑 𝑥 , 𝜎 𝑑 𝑦 , 𝜎 𝑑 𝑧 ) or the uncertainties on the angles 𝜎 Θ and 𝜎 Φ have to be combined. Here, thefollowing prescription is chosen: 𝜎 ΔΨ = √︄ 𝜎 Θ + ( 𝜎 Φ · sin Θ CNN ) 𝜎 ΔΨ for the angular resolution. Θ CNN inEq. (6.1) is the reconstructed zenith angle by the CNN. This “circularized” uncertainty estimateassumes that the uncertainty on the direction reconstruction may be approximated by a circle inzenith-azimuth-space. If correlations exist or if the uncertainties in reconstructed zenith and azimuthangles are on diﬀerent size scales, this simpliﬁed assumption will lead to under or over-coverage.Fig. 11 shows that the “circularized” uncertainty estimate 𝜎 ΔΨ obtains correct coverage on averagefor the ensemble. The coverage for 𝜎 ΔΨ is computed assuming a Rayleigh distribution.Contrary to maximum likelihood estimation which can evaluate the local curvature of thelikelihood to estimate per-event uncertainties, the neural network is a data driven approach. As suchit relies on proper coverage of the phase space through the training data. Certain areas in the phasespace that are less populated by training data, might therefore lead to biases and result in over- orunder-coverage of the estimated uncertainty. Fig. 10 and 11 show that the CNN-based uncertaintiescan obtain correct coverage on average for the ensemble of events.– 19 – − − Runtime per Event / s − − − − − F r a c t i o n o f E v e n t s . s . s . s . s StandardCNN

Figure 12 : Per-event runtimes are shown asa survival function of the fraction of eventsexceeding a speciﬁed runtime. In contrastto the standard reconstruction, the CNN-based method is able to run on multiplecores as well as on a GPU. Runtimes shownfor the CNN are for application on a GPU.The CNN outperforms the standard methodby 2-3 orders of magnitude.The results shown here are only for statistical uncertainties. Known systematic uncertaintiescan, however, be included by adding training examples from systematic datasets throughout theentire training process. To obtain proper coverage, the training events must be sampled or weightedaccording to the assumed priors on the systematics. Ideally, the training events are sampled froma continous distribution of systematic variations. This can be achieved by application of theSnowStorm method described in Ref. [35]. The CNN presented here only includes events from thesystematic dataset in early stages of the training to promote robustness. Later training steps are onlyperformed on the baseline dataset. The neural network is therefore trained to estimate the statisticaluncertainty, whereas systematic uncertainties are not included in the prediction.

A key advantage and motivation for the CNN-based reconstruction method is the desired applicationwithin IceCube’s real-time system as mentioned in Section 1. Strict hardware requirements adhere toreconstructions run at the South Pole due to limited resources on-site. In addition to the requirementof a fast per-event runtime, the reconstruction methods should complete in a foreseeable and ideallyconstant time in order to prevent pileup. For most advanced likelihood-based reconstructionmethods, this can pose a considerable challenge. The maximization of the likelihood can beprohibitively expensive and, more importantly, its runtime may vary over many orders of magnitudeas indicated by the width of the shaded area in Fig. 12.Figure 12 compares the per-event runtime between the CNN-based method (blue) and thestandard reconstruction (gray). The standard reconstruction method which depends on the maxi-mization of an approximative likelihood has a wide distribution of runtimes that spans over nearlyfour orders of magnitude. With a median runtime of about 10 s, the standard reconstruction can beapplied oﬄine, but it is too slow to use on-site. In comparison, the CNN-based method ﬁnishes inless than 24 ms for half of all reconstructed events.The neural network performs a ﬁxed amount of computations deﬁned by the chosen architectureand independent of the event parameters. Therefore, the time necessary for one forward pass isnearly constant and takes about 6 ms on an NVIDIA GTX 980. The use of the CNN with a CPUincreases the runtime compared to a GPU by a factor of ~60. The bottleneck of the CNN-basedmethod, if run on a GPU, is the data preparation step which includes the calculation of the DOM– 20 –ummary statistics (Section 3.2). This step also introduces a dependence on the amount of detectedpulses in an event and is responsible for the widening of the runtime distribution shown in Fig. 12.Although the runtime of the CNN-based method is fast enough, it can be further reduced throughoptimization of the input pipeline or by choosing less computing intensive input variables. Anend-to-end solution starting directly from the measured pulses as described in Section 3.2 may alsodecrease the runtime. If run on a CPU, the bottleneck shifts from the data preparation step to theforward pass through the network. In this case, the size of the network architecture can be decreasedwhile only minimally aﬀecting its performance.

In the previous section, the performance of the CNN-based method was shown to oﬀer comparable,and for some tasks, even improved performance relative to the standard cascade reconstructionmethod in IceCube. However, the performance plots shown are for the MC baseline dataset. Limitsin our knowledge of the detector and the underlying physics result in systematic uncertainties in theMC simulation. Some of the sources of systematics are known, such as the scattering and absorptioncoeﬃcients of the glacial ice. The eﬀect of these systematic uncertainties can be studied in dedicatedsimulations for which these parameters are varied. More challenging are unknown systematics thatcannot be simulated and explicitly tested. A key quality parameter of a reconstruction method istherefore its data/MC agreement and the robustness of the method towards possible uncertaintiesin the MC simulation. In the context of this paper, robustness of a reconstruction method isdeﬁned as its insensitivity towards changes in the input data. Note that a robust method doesnot necessarily have to be an accurate one. In contrast, often a trade-oﬀ between reconstructionaccuracy and robustness exists. In the following sections, the data/MC agreement of the presentedreconstruction method will be investigated and its robustness tested against potential mis-modelingin the simulation.

A correct MC simulation will not contain diﬀerences to experimental data, although in practice, thesimpliﬁed model in the simulation will not perfectly describe the data. While a simulation cannotaccurately describe all low level physics, it is important that the resulting high-level variables, fromwhich physics results are derived, are well described. Reconstruction methods produce such high-level variables and must therefore be robust towards low-level disagreements in the simulation. Thedata/MC agreement of a given reconstruction method can be quantiﬁed by comparing distributionsof analysis level parameters.The agreement between simulated and experimental data for the distributions of the recon-structed zenith, azimuth, and deposited energy is investigated in Fig. 13. On the left side, thedistributions are compared for the standard reconstruction method and on the right for the CNN.The distributions and their agreement is comparable between the two reconstruction methods. Somesmaller ﬂuctuations exist, but overall, the simulation and the reconstructed quantities well describethe experimental data.The comparisons of the one-dimensional distributions in Fig. 13 are blind to some classes ofmultivariate disagreements. In principle, higher-dimensional distributions can be compared, but the– 21 – − − E v e n t s i n . d a y s DataMC simulation68.0% Interval90.0% Interval99.0% Interval (Standard) Deposited Energy [GeV] − − p - v a l u e − − E v e n t s i n . d a y s DataMC simulation68.0% Interval90.0% Interval99.0% Interval (CNN) Deposited Energy [GeV] − − p - v a l u e L E v e n t s i n . d a y s DataMC simulation68.0% Interval90.0% Interval99.0% Interval (Standard) Azimuth [rad] − − p - v a l u e E v e n t s i n . d a y s DataMC simulation68.0% Interval90.0% Interval99.0% Interval (CNN) Azimuth [rad] − − p - v a l u e L E v e n t s i n . d a y s DataMC simulation68.0% Interval90.0% Interval99.0% Interval (Standard) Zenith [rad] − − p - v a l u e E v e n t s i n . d a y s DataMC simulation68.0% Interval90.0% Interval99.0% Interval (CNN) Zenith [rad] − − p - v a l u e Figure 13 : The simulated and measured distributions for the reconstructed quantities are shown.The left side shows the standard reconstruction and the right side shows the CNN-based method.Each plot is divided into two panels showing the number of events (top panel) and the signiﬁcance ofthe ﬂuctuations (bottom panel) in each bin. The signiﬁcance calculation is based on the assumptionof Poisson-distributed data and does not include systematic uncertainties.– 22 – . . . . . . False Positive Rate . . . . . . T r u e P o s i t i v e R a t e AUC = 0 . ± . [CNN]AUC = 0 . ± . [Standard]AUC = 0 . ± . [Combined]AUC = 0 . ± . [Test]Random Guess Figure 14 : A random forest (RF) is trained in a5-fold cross-validation to distinguish data eventsfrom simulated ones on the basis of diﬀerent setsof features. The receiver-operating curves areshown for each set of features. The area underthis curve (AUC) is a metric for the classiﬁerperformance. A value of 1 . .

5. Events are notdistinguishable on the basis of the reconstructedvalues for either of the reconstruction methods.data will become too sparse once the dimensionality reaches a certain level. An alternative approachis to employ a multivariate classiﬁer to distinguish data from simulation [36, 37]. A random forest(RF) [38] classiﬁer from the scikit-learn [39] package is trained in a 5-fold cross-validation. Toquantify the found mis-match, the area under curve (AUC) metric of the receiver-operating curveis used. Ideally, the classiﬁer should not be able to tell the diﬀerence between simulated andexperimental data, i.e. the AUC should be around 0 .

5. If it is capable of separating the classes(AUC > . 𝑥,𝑐 and Var 𝑦,𝑐 are added which are drawn from a bivariate normal distribution witha covariance of + 𝑐 for data events and − 𝑐 for simulated events. The one-dimensional distributionsof these variables for data and simulated events are indistinguishable by construction. Only amultivariate approach is able to ﬁnd the disagreement. The resulting receiver-operating curve isshown in Fig. 14 labeled as "Test". As expected, the RF is able to distinguish the events basedon the added variables which contain mis-matches in correlated variables. The resulting featureimportances also correctly order the artiﬁcial features according to their mis-match as shown on theleft of Tab. 1.Output features of each reconstruction method are tested independently as well as in a com-bined set. The resulting receiver-operating curves and feature importances are shown in Fig. 14 andon the right of Tab. 1. Apart from the method test case, the RF is not capable of signiﬁcantly dis-tinguishing simulation from experimental data, boosting conﬁdence in the baseline simulation andthe reconstruction methods. Both reconstruction methods provide reasonable data/MC agreementwhich indicates a certain level of robustness towards potential mis-modeling in the simulation. The accurate description of the detector medium is a challenging task for simulations in Ice-Cube [40]. Ice properties aﬀect how photons scatter and are absorbed in the detector medium and– 23 – able 1 : Feature importances obtained from the RF, trained to distinguish simulated from ex-perimental data, are shown for each of the four sets of features. The test case in which artiﬁcialfeatures Var 𝑥 / 𝑦,𝑐 with varying degrees of mis-matches are added is shown on the left. An AUCscore of 0 . ± .

02 is achieved. The importances of the CNN-only (AUC = 0 . ± .

02 ), standardreconstruction-only (AUC = 0 . ± .

02 ), and combined feature set including outputs from bothreconstructions (AUC = 0 . ± .

02 ) are shown on the right from top to bottom, respectively.TestFeature Name ImportanceVar 𝑥,𝑐 = . . ± . 𝑦,𝑐 = . . ± . 𝑥,𝑐 = . . ± . 𝑦,𝑐 = . . ± . 𝑦,𝑐 = . . ± . . ± . . ± . . ± . . ± . . ± . . ± . 𝑥,𝑐 = . . ± . . ± . . ± . . ± .

02 RealFeature Name ImportanceCNN - Zenith 0 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . . In addition to the varied ice properties in Section 7.2, further unknown sources of systematics mayexist. While the eﬀect of these unknown systematics cannot be directly studied, the simulation canbe altered in various ways to investigate how a reconstruction method reacts to certain modiﬁca-tions. In this study, several modiﬁcations are performed to the measured pulses to understand thereconstruction’s dependence on single DOMs, its sensitivity to timing uncertainties, and the eﬀectof an increased noise rate.The reconstruction’s dependence on single DOMs is investigated by discarding these prior tothe reconstruction. Entries in the input data tensor 𝑋 , which correspond to discarded DOMs, areﬁlled with zeros. Three diﬀerent variations are tested: discarding the top 5 and top 10 DOMs withthe most collected charge and randomly discarding 10% of all DOMs. The top panel of Fig. 15bshows the results of this test. The more light a DOM captures, the more information it carries for thereconstruction task. Thus, high-charge DOMs (DOMs with the most collected charge in an event)are crucial to the reconstruction. However, if the reconstruction relies solely, or to a big extent, on asingle high-charge DOM, it is more susceptible to mis-modelling in the simulation. It is challengingto accurately describe local ice properties around a DOM, whereas averages over many DOMs aremuch more robust. As shown in the top panel in Fig. 15b, both reconstruction methods have asimilar energy-dependent behaviour. Discarding the highest-charge DOMs negatively impacts theangular resolution at lower energies, while it becomes less important at higher energies. This isdue to the fact that saturated DOMs and overly bright DOMs are automatically excluded from thereconstruction. This applies to both reconstruction methods which are therefore, by construction,fairly robust to this systematic at higher energies. For lower energetic events, the number of hit andsaturated DOMs reduces and hence the exclusion of the 𝑛 highest-charge DOMs becomes more andmore relevant as the fraction of removed hit DOMs increases. Randomly discarding DOMs (lightblue curve in top panel) results in a slightly worse angular resolution over the complete energy rangewith a minimal energy-dependence. While both reconstruction methods are similarly aﬀected by– 25 – Neutrino Energy E ν [GeV] . . . ∆ Ψ s y s , % / ∆ Ψ b a s e , % [0 . ± . Absorption +10%[0 . ± . Scattering +10%[0 . ± . Abs. & Scat. − . ± . Hole-ice Variation Neutrino Energy E ν [GeV] [0 . ± . Absorption +10%[0 . ± . Scattering +10%[0 . ± . Abs. & Scat. − . ± . Hole-ice Variation (a) Variation of Ice Properties L . . . ∆ Ψ s y s , % / ∆ Ψ b a s e , % [0 . ± . Discard top 10 DOMs [0 . ± . Discard top 5 DOMs [0 . ± . Discard of DOMs [0 . ± . Discard top 10 DOMs [0 . ± . Discard top 5 DOMs [0 . ± . Discard of DOMs L . . . ∆ Ψ s y s , % / ∆ Ψ b a s e , % [0 . ± . Smear times σ = 20ns[0 . ± . First time − . ± . Smear times σ = 5ns [0 . ± . Smear times σ = 20ns[0 . ± . First time − . ± . Smear times σ = 5ns L Neutrino Energy E ν [GeV] . . . ∆ Ψ s y s , % / ∆ Ψ b a s e , % [0 . ± . Noise +1000%[0 . ± . Noise +500%[0 . ± . Noise +200% Neutrino Energy E ν [GeV] [0 . ± . Noise +1000%[0 . ± . Noise +500%[0 . ± . Noise +200% (b) Pulse Modiﬁcations

Figure 15 : The median angular resolution is compared between the baseline simulation ΔΨ base, 50% and the systematic variation ΔΨ sys, 50% for the standard reconstruction method (left) and the CNN-based method (right). A ratio close to 1.0 indicates a robust reconstruction method that is insensitiveto the varied systematic parameter. For each systematic variation, the data/MC test via the RF(Section 7.1) is performed. The resulting AUC for each variation is shown in brackets in the legend.– 26 –he exclusion of DOMs, the standard method is more sensitive to single high-charge DOMs, whichis driven by the underlying Poisson likelihood.Unknown systematic uncertainties in the simulation might also aﬀect the arrival times ofphotons at the photomultipliers. To investigate this, the pulse times are smeared by a Gaussianwith a width of 5 ns and 20 ns. In comparison, IceCube’s timing resolution is on the order of a fewns [14, 18]. Moreover, the time of the ﬁrst pulse at a DOM is particularly important for directionalreconstruction, since it is likely to come from a photon that has scattered the least, i.e. carries a lotof information on its origin. By shifting it by 20 ns, it can be tested how much the reconstructionrelies on this time. The results are plotted in the middle panel of Fig. 15b. Smearing the pulse timesby 5 ns, which is larger than IceCube’s resolution, barely has an eﬀect on the angular resolution.Interestingly, the standard method seems to rely more on timing information at lower energiesthan the CNN-based method. At energies around a few TeV and lower, the variations to the pulsetimes have almost no eﬀect on the CNN angular reconstruction. This indicates that the CNN-basedmethod is not using the timing information to the full extent possible. Timing information is moreimportant for the reconstruction of low-energetic cascades as opposed to high-energetic cascadesfor which the charge asymmetry is a good measure of the cascade direction. This might explainwhy the standard reconstruction method achieves signiﬁcantly better resolution at lower energies.Neural networks can be manipulated to produce wrong results by adding subtle noise to imagesthrough evolutionary algorithms or gradient ascent [42, 43]. This vulnerability is of great concern insecurity crucial applications such as autonomous driving. While explicit attacks are not a concernfor the application in IceCube, potential mis-modeling in the noise simulation could impose avulnerability. Instead of explicitly attempting to manipulate the CNN, the eﬀect of an increasedoverall noise rate in the detector is studied. For this study, the noise rate is increased by 200 %,500 %, and 1000 %. Note that the values studied here are purely hypothetical and much larger thanactual uncertainties on the noise rate, which are on the order of only a few percent. The noise rate inIceCube is well monitored and stable [18]. As previously mentioned, the standard method internallyrelies on a Poisson likelihood which is driven by high-charge DOMs. A uniformly increased noiserate therefore barely has any eﬀect. The standard reconstruction method is robust towards theincreased noise rate. In contrast, the CNN is sensitive to this modiﬁcation. In agreement with theprevious ﬁndings, it seems that the CNN-based method is less focused on single DOMs, but moreso on the overall distribution. Increasing the noise rate to the hypothetical case of an additional + . ± .

02 . Hence, the RF classiﬁer is clearly ableto distinguish between data and simulated events. This increases the conﬁdence that unaccountedsystematics will be detected by the data/MC test, if they have signiﬁcant impact on the reconstructionvalues.

As described in Section 3, the input data dimensionality must be reduced and transformed in orderto be utilized by a standard CNN architecture. Hence, the measured pulses are summarized innine input features per DOM. This eﬀectively removes the time dimension and results in a loss ofinformation. Quantifying the impact of this information loss as well as the relative importance ofeach of the input features require extensive tests which are not further pursued in the context of thispaper. Results shown in Section 7.3 do, however, indicate that there are deﬁciencies. The neuralnetwork is less sensitive than the standard reconstruction method to time-based variables as shownin the middle panel of Fig. 15b. This could mean that the neural network is not able to exploit thenecessary timing information with the provided data input format and network architecture.Further limitations are possible due to the choice of network architecture. A standard CNNarchitecture is chosen while only adding minor modiﬁcations as detailed in Section 4. Moreadvanced CNN architectures exist that could be applied. For this work, no extensive hyper-parametersearches are performed. A more suitable and optimized CNN architecture will therefore likely resultin further performance gains. Apart from CNN architectures, other types of neural networks maybe investigated. Recurrent neural networks are a natural choice given the data representation asa pulse series in time. Graph neural networks as utilized in Ref. [12] can handle the geometryof the detector more adequately. Nonetheless, none of the above mentioned architectures arecapable of exploiting all available domain knowledge (knowledge that is speciﬁc to the applicationat hand). The underlying physics of the neutrino interaction are invariant under translation androtation in space. The CNN architecture attempts to utilize the translational symmetry through theapplication of convolutions. However, due to the detector layout and to the inhomogeneities ofthe dust concentration in the ice, these symmetries are only approximately valid in the measureddata. Translational invariance in time is utilized to a certain degree by the use of relative timing asopposed to absolute timing information. Available domain knowledge extends beyond the previouslymentioned symmetries, but it cannot be directly incorporated into the network architecture. Instead,the neural network relies on learning these relations in a data driven approach which limits itscapabilities. In contrast, methods based on maximum likelihood estimation can include domainknowledge in the likelihood prescription.

The CNN-based reconstruction method presented in this paper is a versatile tool that complementsand in certain parameter spaces improves on state-of-the-art reconstruction methods in IceCube.It is robust and insensitive to reasonable systematic variations. A data-MC agreement test via a– 28 –ultivariate classiﬁer can be employed to detect potential unknown and unaccounted systematics,further increasing conﬁdence in the developed method. The presented method is therefore a viablealternative to traditional reconstruction methods in IceCube. Due to the signiﬁcantly improvedpointing resolution for cascade-like events above 10 TeV, its application can considerably improveIceCube’s sensitivity in the southern sky [10]. In addition, the CNN’s speed and robustness, thelow number of outliers and mis-reconstructed events, as well as the available per-event uncertaintyestimate make this method attractive for the application in the real-time system.Despite its success, the CNN-based reconstruction has its limitations. In the current im-plementation, the IceCube detector has to be divided into three sub-arrays, which are processedindependently and only combined at a later stage. The employed convolutional layers do not op-timally use available timing information and cannot naturally handle the geometry of the detector,which limits the application of convolutions along the z-axis for the DeepCore sub-arrays. Inaddition, the convolution in 𝑥 and 𝑦 directions over the main array are only approximately valid dueto the distortions in the detector grid. Similarly, the convolution in 𝑧 -direction is an approximation.Although the underlying physics of a neutrino interaction are invariant under a translation in space,the measured light yield in the detector will break this symmetry due to the detector responseand inhomogeneity of the dust concentration in the ice. Apart from translational invariance, othersymmetries and prior information exist, which are yet to be exploited.Future developments will have to overcome these limitations. The ﬁeld of geometric deeplearning [44] can help in the correct description of IceCube’s irregular geometry. However, in orderto become competitive to the most advanced maximum-likelihood estimation (MLE) methods,the exploitation of invariances and prior information must be extended. Tailored reconstructionmethods that combine the beneﬁts of deep learning and MLE must be developed. A Neural Network Architecture and Training Details

Details of the neural network architecture and training procedure are provided in this section. Thechosen architecture is inspired by common CNN architectures such as AlexNet [9], ResNet [31],VGG [45], and Inception [46]. In addition, the number of layers, ﬁlters, and kernel shapes arechosen to strike a balance between computational complexity and expressiveness of the model.Compared to the previously mentioned CNN architectures which have between 6 and 138 millionfree parameters, the CNN presented here with roughly 6 million parameters is small for the givendata complexity. In combination with the large training dataset, this results in a reduced necessityfor regularization as described in Section 4.5. – 29 – able 2 : Details on the number and shape of convolution kernels (ﬁlters) and downsampling viamax-pooling are provided for the convolutional layers over the main IceCube array (left), upperDeepCore (top right), and lower DeepCore (lower right). Convolutions over the main IceCube arrayuse aligned hexagonally shaped kernels as described in Section 4.1. The provided ﬁlter and poolingshapes correspond to the dimensions of the orthogonal grid. All layers apply residual additions anduse a dropout rate of 𝑑 .Layer able 3 : The number of nodes per layer, usage of residual additions (Res. Add.), and the appliedactivation function for the fully-connected layers of the prediction (left) and uncertainty (right)sub-networks are shown. A dropout rate of 𝑑 is applied to all but the output layer. The precursorarchitecture as has been applied in Ref. [10] does not use layer 2.Layer Table 4 : Eight training steps as detailed below are performed to train the CNN. A batch size of32 is used throughout the entire training procedure. The loss target speciﬁes with respect to whichsub-networks (prediction/uncertainty or both) the loss function L scalar is optimized. See Section 4.6for the deﬁnition of the dropout rates 𝑑 𝑖 . Training Step

Num. Iterations

Loss Function

MSE MSE MSE MSE GL GL GL GL

Loss Target both both both both unc unc both both

Learning Rate

Dropout d Dropout d Dropout d Dropout d Weight Azimuth

Weight Zenith

Weight Energy

Weight Dir. X

Weight Dir. Y

Weight Dir. Z

Include Sys.

True True True False False False False False– 31 – cknowledgments

The IceCube collaboration acknowledges the signiﬁcant contributions to this manuscript fromMirco Hünnefeld. The authors gratefully acknowledge the support from the following agenciesand institutions: USA – U.S. National Science Foundation-Oﬃce of Polar Programs, U.S. Na-tional Science Foundation-Physics Division, Wisconsin Alumni Research Foundation, Center forHigh Throughput Computing (CHTC) at the University of Wisconsin–Madison, Open ScienceGrid (OSG), Extreme Science and Engineering Discovery Environment (XSEDE), Frontera com-puting project at the Texas Advanced Computing Center, U.S. Department of Energy-NationalEnergy Research Scientiﬁc Computing Center, Particle astrophysics research computing center atthe University of Maryland, Institute for Cyber-Enabled Research at Michigan State University, andAstroparticle physics computational facility at Marquette University; Belgium – Funds for Scien-tiﬁc Research (FRS-FNRS and FWO), FWO Odysseus and Big Science programmes, and BelgianFederal Science Policy Oﬃce (Belspo); Germany – Bundesministerium für Bildung und Forschung(BMBF), Deutsche Forschungsgemeinschaft (DFG), Helmholtz Alliance for Astroparticle Physics(HAP), Initiative and Networking Fund of the Helmholtz Association, Deutsches Elektronen Syn-chrotron (DESY), and High Performance Computing cluster of the RWTH Aachen; Sweden –Swedish Research Council, Swedish Polar Research Secretariat, Swedish National Infrastructurefor Computing (SNIC), and Knut and Alice Wallenberg Foundation; Australia – Australian ResearchCouncil; Canada – Natural Sciences and Engineering Research Council of Canada, Calcul Québec,Compute Ontario, Canada Foundation for Innovation, WestGrid, and Compute Canada; Denmark– Villum Fonden and Carlsberg Foundation; New Zealand – Marsden Fund; Japan – Japan Societyfor Promotion of Science (JSPS) and Institute for Global Prominent Research (IGPR) of ChibaUniversity; Korea – National Research Foundation of Korea (NRF); Switzerland – Swiss NationalScience Foundation (SNSF); United Kingdom – Department of Physics, University of Oxford.

References [1] IceCube collaboration, M. Aartsen et al.,

Evidence for High-Energy Extraterrestrial Neutrinos at theIceCube Detector , Science (2013) 1242856 [ ].[2] IceCube, Fermi-LAT, MAGIC, AGILE, ASAS-SN, HAWC, H.E.S.S., INTEGRAL, Kanata, Kiso,Kapteyn, Liverpool Telescope, Subaru, Swift NuSTAR, VERITAS, VLA/17B-403collaboration, M. Aartsen et al.,

Multimessenger observations of a ﬂaring blazar coincident withhigh-energy neutrino IceCube-170922A , Science (2018) eaat1378 [ ].[3] IceCube collaboration, M. Aartsen et al.,

Neutrino emission from the direction of the blazar TXS0506+056 prior to the IceCube-170922A alert , Science (2018) 147 [ ].[4] IceCube collaboration, M. Aartsen et al.,

Time-Integrated Neutrino Source Searches with 10 Years ofIceCube Data , Phys. Rev. Lett. (2020) 051103 [ ].[5] IceCube collaboration, M. Aartsen et al.,

The IceCube Realtime Alert System , Astropart. Phys. (2017) 30 [ ].[6] IceCube, MAGIC, VERITAS collaboration, M. Aartsen et al., Very High-Energy Gamma-RayFollow-Up Program Using Neutrino Triggers from IceCube , JINST (2016) P11009 [ ]. – 32 –

7] IceCube collaboration, D. Chirkin,

Event reconstruction in IceCube based on direct eventre-simulation , in , p. 0581, 2013 [ ].[8] C. Haack, L. Lu and T. Yuan,

Improving the directional reconstruction of PeV hadronic cascades inIceCube , EPJ Web Conf. (2019) 05003.[9] A. Krizhevsky, I. Sutskever and G.E. Hinton,

Imagenet classiﬁcation with deep convolutional neuralnetworks , in

Proceedings of the 25th International Conference on Neural Information ProcessingSystems - Volume 1 , pp. 1097–1105 (2012), https://papers.nips.cc/paper/4824-imagenet-classiﬁcation-with-deep-convolutional-neural-networks.pdf.[10] IceCube collaboration, M. Aartsen et al.,

Search for Sources of Astrophysical Neutrinos Using SevenYears of IceCube Cascade Events , Astrophys. J. (2019) 12 [ ].[11] IceCube collaboration, M. Huennefeld,

Deep Learning in Physics exempliﬁed by the Reconstructionof Muon-Neutrino Events in IceCube , PoS

ICRC2017 (2018) 1057.[12] IceCube collaboration, N. Choma et al.,

Graph Neural Networks for IceCube Signal Classiﬁcation , .[13] IceCube collaboration, M. Kronmueller and T. Glauch, Application of Deep Neural Networks toEvent Type Classiﬁcation in IceCube , PoS

ICRC2019 (2020) 937 [ ].[14] IceCube collaboration, R. Abbasi et al.,

Calibration and Characterization of the IceCubePhotomultiplier Tube , Nucl. Instrum. Meth. A (2010) 139 [ ].[15] IceCube collaboration, R. Abbasi et al.,

The Design and Performance of IceCube DeepCore , Astropart. Phys. (2012) 615 [ ].[16] IceCube collaboration, M. Aartsen et al., South Pole glacial climate reconstruction frommulti-borehole laser particulate stratigraphy , J. Glaciol. (2013) 1117.[17] IceCube collaboration, M. Ackermann et al., Optical properties of deep glacial ice at the South Pole , J. Geophys. Res. (2006) D13203.[18] IceCube collaboration, M. Aartsen et al.,

The IceCube Neutrino Observatory: Instrumentation andOnline Systems , JINST (2017) P03012 [ ].[19] IceCube collaboration, M. Aartsen et al., Energy Reconstruction Methods in the IceCube NeutrinoTelescope , JINST (2014) P03009 [ ].[20] Y. LeCun et al., Handwritten digit recognition with a back-propagation network , in

Advances inNeural Information Processing Systems 2 , pp. 396–404, Morgan-Kaufmann (1990),http://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation-network.pdf.[21] VERITAS collaboration, Q. Feng and T.T. Lin,

The analysis of VERITAS muon images usingconvolutional neural networks , IAU Symp. (2016) 173 [ ].[22] T.L. Holch et al.,

Probing Convolutional Neural Networks for Event Reconstruction in 𝛾 -RayAstronomy with Cherenkov Telescopes , PoS

ICRC2017 (2018) 795 [ ].[23] S. Mangano et al.,

Extracting gamma-ray information from images with convolutional neural networkmethods on simulated Cherenkov Telescope Array data , 10, 2018, DOI [ ].[24] M. Abadi et al.,

Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems , CoRR abs/1603.04467 (2016) [ ].[25] T.S. Cohen and M. Welling,

Group equivariant convolutional networks , in

Proceedings of the 33rdInternational Conference on International Conference on Machine Learning - Volume 48 ,pp. 2990–2999, 2016 [ ]. – 33 –

26] E. Hoogeboom et al.,

Hexaconv , in , 2018[ ].[27] D.P. Kingma and J. Ba,

Adam: A method for stochastic optimization , CoRR abs/1412.6980 (2014)[ ].[28] Y.A. LeCun, L. Bottou, G.B. Orr and K.-R. Müller,

Eﬃcient BackProp , in

Lecture Notes in ComputerScience , pp. 9–48, Springer Berlin Heidelberg (2012) ,10.1007/978-3-642-35289-8_3 .[29] G. Klambauer et al.,

Self-normalizing neural networks , CoRR abs/1706.02515 (2017) [ ].[30] S. Ioﬀe and C. Szegedy,

Batch normalization: Accelerating deep network training by reducinginternal covariate shift , CoRR abs/1502.03167 (2015) [ ].[31] K. He et al.,

Deep residual learning for image recognition , CoRR abs/1512.03385 (2015)[ ].[32] N. Srivastava et al.,

Dropout: A simple way to prevent neural networks from overﬁtting , vol. 15,pp. 1929–1958, 2014, http://jmlr.org/papers/v15/srivastava14a.html.[33] D.-A. Clevert, T. Unterthiner and S. Hochreiter,

Fast and accurate deep network learning byexponential linear units (elus) , CoRR abs/1511.07289 (2015) [ ].[34] IceCube collaboration, M. Aartsen et al.,

Search for astrophysical sources of neutrinos using cascadeevents in IceCube , Astrophys. J. (2017) 136 [ ].[35] IceCube collaboration, M. Aartsen et al.,

Eﬃcient propagation of systematic uncertainties fromcalibration to analysis with the SnowStorm method in IceCube , JCAP (2019) 048 [ ].[36] M. Börner et al., Measurement/Simulation Mismatches and Multivariate Data Discretization in theMachine Learning Era , in

Astronomical Data Analysis Software and Systems XXVII , vol. 522, p. 431,2019.[37] D. Martschei et al.,

Advanced event reweighting using multivariate analysis , J. Phys. Conf. Ser. (2012) 012028.[38] L. Breiman,

Random Forests , Machine Learning (2001) 5.[39] F. Pedregosa et al., Scikit-learn: Machine learning in Python

Measurement of South Pole ice transparency with theIceCube LED calibration system , Nucl. Instrum. Meth. A (2013) 73 [ ].[41] IceCube collaboration, M. Rongen,

Measuring the optical properties of IceCube drill holes , EPJ WebConf. (2016) 06011.[42] C. Szegedy et al.,

Intriguing properties of neural networks , CoRR abs/1312.6199 (2014)[ ].[43] A. Nguyen, J. Yosinski and J. Clune,

Deep neural networks are easily fooled: High conﬁdencepredictions for unrecognizable images , in , IEEE, June, 2015 ,10.1109/cvpr.2015.7298640 .[44] M.M. Bronstein et al.,

Geometric Deep Learning: Going beyond Euclidean data , IEEE Sig. Proc.Mag. (2017) 18 [ ].[45] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition , CoRR abs/1409.1556 (2015) [ ]. – 34 –

46] C. Szegedy et al.,

Going deeper with convolutions , in

Computer Vision and Pattern Recognition(CVPR) , 2015 [ ].].