[PDF] Dwarfs from the Dark (Energy Survey): a machine learning approach to classify dwarf galaxies from multi-band images

Abstract

Countless low-surface brightness objects - including spiral galaxies, dwarf galaxies, and noise patterns - have been detected in recent large surveys. Classically, astronomers visually inspect those detections to distinguish between real low-surface brightness galaxies and artefacts. Employing the Dark Energy Survey (DES) and machine learning techniques, Tanoglidis et al. (2020) have shown how this task can be automatically performed by computers. Here, we build upon their pioneering work and further separate the detected low-surface brightness galaxies into spirals, dwarf ellipticals, and dwarf irregular galaxies. For this purpose, we have manually classified 5567 detections from multi-band images from DES and searched for a neural network architecture capable of this task. Employing a hyperparameter search, we find a family of convolutional neural networks achieving similar results as with the manual classification, with an accuracy of 85% for spiral galaxies, 94% for dwarf ellipticals, and 52% for dwarf irregulars. For dwarf irregulars - due to their diversity in morphology - the task is difficult for humans and machines alike. Our simple architecture shows that machine learning can reduce the workload of astronomers studying large data sets by orders of magnitudes, as will be available in the near future with missions such as Euclid.

Full PDF

TT he O pen J ournal of A strophysics Preprint typeset using L A TEX style openjournal v. 09 / / DWARFS FROM THE DARK (ENERGY SURVEY): A MACHINE LEARNING APPROACH TO CLASSIFY DWARFGALAXIES FROM MULTI-BAND IMAGES O liver M¨ uller (cid:63) and E va S chnider Observatoire Astronomique de Strasbourg (ObAS), Universite de Strasbourg - CNRS, UMR 7550 Strasbourg, France and Department of Biomedical Engineering, University of Basel, Allschwil, Switzerland submitted February 2021

ABSTRACTCountless low-surface brightness objects – including spiral galaxies, dwarf galaxies, and noise patterns – havebeen detected in recent large surveys. Classically, astronomers visually inspect those detections to distinguishbetween real low-surface brightness galaxies and artefacts. Employing the Dark Energy Survey (DES) and ma-chine learning techniques, Tanoglidis et al. (2020) have shown how this task can be automatically performedby computers. Here, we build upon their pioneering work and further separate the detected low-surface bright-ness galaxies into spirals, dwarf ellipticals, and dwarf irregular galaxies. For this purpose we have manuallyclassiﬁed 5567 detections from multi-band images from DES and searched for a neural network architecturecapable of this task. Employing a hyperparameter search, we ﬁnd a family of convolutional neural networksachieving similar results as with the manual classiﬁcation, with an accuracy of 85% for spiral galaxies, 94%for dwarf ellipticals, and 52% for dwarf irregulars. For dwarf irregulars – due to their diversity in morphology– the task is di ﬃ cult for humans and machines alike. Our simple architecture shows that machine learning canreduce the workload of astronomers studying large data sets by orders of magnitudes, as will be available inthe near future with missions such as Euclid. Keywords:

Galaxies: dwarf; Galaxies: spiral; Galaxies: elliptical and lenticular, cD; Galaxies: irregular;Methods: data analysis INTRODUCTION

We are living in a golden age to explore the low surfacebrightness Universe. Current instrumentation, paired withcutting-edge computing facilities, allows the study of the Uni-verse in unprecedented detail. For our own Milky Way sys-tem, large surveys like the Sloan Digital Sky Survey (SDSS,York et al. 2000) or the Dark Energy Survey (DES, Abbottet al. 2018), have increased the number of faint dwarf galaxiesby several magnitudes (e.g., Willman et al. 2005; Belokurovet al. 2010; Kim et al. 2015; Drlica-Wagner et al. 2015).With this knowledge came new opportunities to probe ourunderstanding of gravity (Wolf et al. 2010; McGaugh 2016;Famaey et al. 2018), the role of baryons (Pontzen & Gover-nato 2012; Spekkens et al. 2014; Revaz & Jablonka 2018), andcosmology (Kroupa et al. 2010; Boylan-Kolchin et al. 2011;Pawlowski et al. 2012).Outside of our own Milky Way system, the search for andstudy of these elusive dwarf galaxies has mainly relied ontargeted observations of selected galaxy groups and clusters(e.g., Martin et al. 2009; Chiboucas et al. 2009; Merritt et al.2014; Sand et al. 2014; M¨uller et al. 2015; Carlin et al. 2016;Park et al. 2017; Venhola et al. 2017; Eigenthaler et al. 2018;Smercina et al. 2018; Cohen et al. 2018; Taylor et al. 2018;Carlsten et al. 2020; Habas et al. 2020; Iodice et al. 2020).Both automatic source extraction and visual scans of the im-ages – or a mix of both – are employed to search for dwarfgalaxies, even today (e.g., Habas et al. 2020; M¨uller & Jerjen2020). These are tedious and time consuming tasks and needsome expert knowledge to correctly identify the dwarf galaxycandidates.Follow-up observations of many dwarf galaxy candidatesfrom some of these surveys with better instruments have re-vealed an unavoidable contamination of false detections, i.e. (cid:63)

E-mail: [email protected] by confusion with background galaxies, foreground galacticcirrus, or instrumental noise (Chiboucas et al. 2013; Merrittet al. 2016; M¨uller et al. 2018, 2019; Bennet et al. 2019). Aprime example is the galaxy Cen 8 / KK 198, which has beenregarded as dwarf galaxy associated with the Centaurus groupfor more than two decades (Cote 1996; Jerjen et al. 2000;Karachentsev et al. 2013; M¨uller et al. 2017), until VLT ob-servations have uncovered it as a low-surface brightness spi-ral galaxy (M¨uller et al. 2019, 2021). Once high-resolutionimaging was available to study the morphology of the galaxyin detail, the spiral pattern in the older imaging becomes quiteapparent, in other words, this spiral galaxy could have beenspotted as an interloper in the dwarf galaxy catalogs beforethe costly follow-up observations.With upcoming surveys like LSST or Euclid, we will facea surge of low-surface brightness objects. If we want to studythe dwarf galaxy regime with these immense data sets, weneed novel approaches – hand-on categorization of all thefaint sources will be too time-consuming for humans to do.Fortunately, we can teach computers to perform such tasks.With the advent of deep neural networks, computer vision hasbecome a solvable task and has found a broad range of us-ages, from recognizing handwritten letters (O’Shea & Nash2015), human faces (Huang et al. 2012), individual bones(Schnider et al. 2020), animals (Russakovsky et al. 2014), andplants (Go¨eau et al. 2017), to even being able to fake piecesof arts (Gatys et al. 2015), pizza (Papadopoulos et al. 2019),or videos (Korshunov & Marcel 2018).The use of neural networks to classify astronomical imagesand a discussion of their potential role in handling large digitalsuveys dates back to the early nineties (Storrie-Lombardi et al.1992; Bertin 1994). More recently, Tanoglidis et al. (2020)have used a simple three layered convolutional neural network(CNN) to separate artefacts from low-surface brightness ob-jects they previously found in data from the DES (Tanoglidis a r X i v : . [ a s t r o - ph . GA ] F e b et al. 2021) to a 90% accuracy. Even more encouraging isthe fact that transfer learning – meaning to use the pre-trainednetwork on a di ﬀ erent data set – seems to be doable with min-imal e ﬀ ort of re-training the model with a subsample from thenew data set, in the Tanoglidis et al. (2020) case data from theHyper Suprime Cam survey.In this work, we build upon the work of Tanoglidis et al.(2020) do further distinguish their low-surface brightness ob-jects into spiral and dwarf galaxies, employing machine learn-ing. To do so, we ﬁrst manually classify 5567 objects, then weﬁnd the best set of hyperparameters to deﬁne our neural net-work, and then ﬁnally train this network. Before jumping intothe topic of ﬁnding a suitable architecture, we will give a briefintroduction to neural networks, focusing on the techniqueswe are going to apply. NEURAL NETWORKS

Much research has been conducted in the past years to ﬁndthe most promising machine learning methods for classiﬁca-tion of images (Krizhevsky et al. 2012; LeCun et al. 1989;Simonyan & Zisserman 2014). While there is a plethora ofpossible architectures, many supervised classiﬁcation taskswork well using comparatively simple convolutional net-works. Speciﬁcally variations of LeNet (LeCun et al. 1989)in combination with modern optimisation techniques are stillwide-spread ( ¨Ozyurt et al. 2019; Sardogan et al. 2018). A se-quence of convolutional layers with down-pooling is followedby a few fully-connected layers, ending with a classiﬁcationlayer consisting of as many nodes as classes of interest.Convolutions have been used in digital image processingfor a long time, also outside the context of neural networks(Malin 1977; Keys 1981; Perona & Malik 1990). Every pixelof the output image consists of the dot-product between thesurrounding pixels of the input image and the entries of theconvolutional kernel. Depending on their values kernels canserve many purposes, such as edge detection, blurring, orsharpening of an image. In convolutional neural networks, thevalues of the convolutional kernels are incrementally learnt,instead of hand-crafted. To every input layer, multiple con-volutional kernels can be learnt and applied. The resultingoutputs of those individual kernels are called channels. Thenumber of channels thus directly corresponds to the numberof learnt kernels.In fully connected layers, every node n jout is the weightedsum of all nodes of the previous layers, plus an o ﬀ set: n iout = (cid:80) i w i , j ∗ n jin + b j . The weights w and biases b of the wholenetwork are trainable parameters and are iteratively updatedwhile training the network. As the name suggest, in fully con-nected layers every node of a layer is linked to every node ofthe preceding layer. In contrast, the local dot-product of con-volutional layers only links nodes that are spatially close.Since even an elaborate chain of linear functions stays alinear function, activation functions are added to the layers ofa neural network. These simple non-linear functions allow forthe possibility that the output of the neural network does notneed to linearly depend on the input (Hinton et al. 2012).Pooling layers are used to down-sample the input. Theywork similar to convolutional layers, by applying operationsto a small region of input at once. In contrast to convolutionallayers, the kernel of pooling layers is not trained but ﬁxed up-front. Two popular examples include a kernel that choosesthe maximum value of the input region, or the mean valueof the input region. The downsampling e ﬀ ect is achieved by applying the kernel such that the input regions don’t overlap.That way, a 2x2 pool kernel reduces the spatial size of theoutput by a factor of 2 per spatial dimension.The trainable parameters of a neural network are iterativelyupdated as part of an optimisation process. Images are passedthrough the network and the output – the predicted class –is compared to the true class of the image. A common lossfunction to quantify the agreement is the cross-entropy loss.Using the chain rule, the gradient of the loss with respect toevery single trainable parameter can be computed. This inturn allows for the use of gradient-based optimisers. A lot ofresearch has been conducted to ﬁnd ﬁrst-order-gradient basedoptimisers that show good performance in training neural net-works, with a certain robustness against ending up in localoptima or getting stuck in ravines of the search space (Zeiler2012; Kingma & Ba 2014; Qian 1999).Modern classiﬁcation network architectures tend be verydeep, i.e. consist of large numbers of convolutional layers(Szegedy et al. 2015; Krizhevsky et al. 2012; He et al. 2016).Due to the resulting high number of trainable parameters andto prevent overﬁtting, they are generally trained on tens ofthousands of natural images (Krishna & Kalluri 2019). Train-ing these kinds of architectures from scratch costs a lot interms of computational memory and time. A popular ap-proach is therefore to leverage the power of transfer learn-ing for new tasks (Pan & Yang 2009). In this case, networkparameters are not initialized using random functions, but us-ing the ﬁnal parameter values of an openly accessible fullytrained network. It is then possible to update the parametersduring future training on the new task. To extend the net-work’s predictive capabilities to new classes – unseen duringthe initial training phase – the very last classiﬁcation layer isdropped and replaced by one which consists of the new num-ber of output classes. While this approach reduces the num-ber of data needed to learn a new task on very deep networks,the computational requirements remain huge, due to the mil-lions of trainable parameters. This problem can be alleviatedby ﬁxing all parameters but the ones in the classiﬁcation lay-ers. In this way the number of trainable parameters is reducedtremendously, which speeds up the training considerably andallows the use more accessible hardware. However, there arealso drawbacks to keeping most of the layers ﬁxed. The in-ner layers learn to encode images in an abstract space whichfacilitates the classiﬁcation. Since this internal encoding waslearnt during the initial training using natural images, othertypes of images may not be represented well using this en-coding. Therefore, the success of transfer learning from state-of-the art natural image classiﬁcation networks for types ofimaging such as histology, medical imaging, or astronomicalimages depends on the choice of network, data set size andtask (Zhang et al. 2016; Cha et al. 2017; Martinazzo et al.2020). DATA

Tanoglidis et al. (2020) selected a set of 20’000 low-surfacebrightness objects detected in DES (Tanoglidis et al. 2021)and predicted them to be real low-surface brightness objects(in contrast to artefacts) by the use of machine learning tech-niques. We downloaded the centered images of these objectsfrom the DES archive using a modiﬁed script provided in thegit project of Tanoglidis et al. (2020). The jpeg images havethe dimensions of 256 ×

256 px, corresponding to a ﬁeld of https: // github.com / dtanoglidis / DeepShadows warf from the D ark E nergy S urvey Figure 1.

Typical objects corresponding to the three classes manually labeled in this work: spirals (top row), dwarf ellipticals (middle row), and dwarf irregulars(bottom row). The images have a side of 256 px.

Table 1

Hyperparameters, their optimization search space and optimal value.Name search space best value description n layers , conv [1 ,

5] 5 number of convolutional layers n channel [2 ,

40 number of initial channels in the ﬁrst convolutional layer, with a step size of 2 n nodes [10 , δ [0 , . ] 0.84 negative slope of the Leaky ReLU activation functions p dropout [0 , .

9] 0.28 dropout rate for regularizationimage size [32 ,

96 size of one side of the input image, , with a step size of 32optimizer Adam, SGD, RMSprop RMSprop the optimizer used to update trainable parameterslearning rate [10 − , − ] 1 . × − learning rate used by the optimizermomentum [0 , .

99] 0.78 momentum used by the optimizerweight decay [0 , .

2] 0.003 weight decay used by the optimizerbatch size [2 , view of 1 . × . , and have three channels, i.e. thestandard RGB channels (not to be confused with the sloan r and g and Johnson / Bessell B ﬁlters). The pixel values spanfrom 0 to 255, with 0 corresponding to no intensity in thepixel.Because many of these detections are small background spi-ral galaxies, we have made a size cut of 3.3 px on their tablecolumn named ﬂux radius r, which was measured (Tanoglidiset al. 2021) with Source Extractor (Bertin & Arnouts 1996).This left us with 5567 low-surface brightness objects. Assum- ing a distance of 50 Mpc, this cut translates into a radius of200 pc, corresponding to a typical e ﬀ ective radius for a faintdwarf (M¨uller et al. 2019). The distance of 50 Mpc was cho-sen to mock the selection of galaxies from the MATLAS sur-vey (Duc 2020).To further increase the sample size during the training andvalidation of our neural networks, we used standard data aug-mentation techniques used in machine learning. Namely, eachimage can be ﬂipped (i.e. mirrored) and rotated (by 90, 180,or 270 degrees). These data augmentations increase the num-ber of images used in this work by a factor 8 to 44536. Wehave decided not to use translations or rotations which wouldneed interpolation of the pixel values. Such transformationscould lead to unexpected behaviors of the neural networks,but it could be interesting for future studies to include them aswell.We will split the sample into three di ﬀ erent sets, a training,a validation, and a test set, which are randomly drawn fromthe total sample of galaxies. The training set consists of 70%,the validation set of 10%, and the test set of 20% of the data.As the name suggests, the training set will be used for thetraining of the network. The validation set will be used for thehyperparameter search, and the test set for the ﬁnal evaluationof the trained network. MANUAL CLASSIFICATION

To train deep neural networks, it is necessary to providemany labelled samples of the classes we want them to be ableto distinguish. Because the data set provided by Tanoglidiset al. (2020) is not morphologically labelled, we had to per-form the classiﬁcation by ourselves. For this purpose, wehave used Pidgey , a simple Jupyter notebook tool to anno-tate images. We have created three classes, these are i) spiralgalaxies, ii) dwarf ellipticals, and iii) dwarf irregulars. Thisfollows the classiﬁcation scheme implemented in Habas et al.(2020). We have classiﬁed an object considering followingtypical features of each class: • spiral: indications of a bar; spiral arms; extended nu-cleus; compact shape; high ellipticity indicating view-ing the spiral edge-on, sharp edges; strong color gradi-ent from red in the center to blue in the outskirts. • dwarf elliptical: smooth proﬁle; extended; di ﬀ use; noirregular features. • dwarf irregular: mostly smooth proﬁle; extended; ir-regular features (e.g. tidal tails, star forming regions),which are not spiral arms.It was not always straight forward to classify an object, espe-cially between small dwarf ellipticals and spirals. If an objectappears smooth, but very small, it is likely an unresolved spi-ral and therefore was classiﬁed as spiral galaxy. Of course thiscould falsely label some small dwarf elliptical as spiral galax-ies. Also the di ﬀ erences between disturbed spiral galaxiesand dwarf irregular are sometimes di ﬃ cult to tell. Biases inhuman classiﬁcations are therefore unavoidable. To improvethe reliability of human classiﬁcation, we have repeated theclassiﬁcation three times and made a majority vote, meaningthat a label had to be picked at least two times to count. Thisgives also a base-line for the variability of the classiﬁcationcoming from the same person.In total, we have labeled 2925 objects to be spiral galax-ies, 2037 to be dwarf ellipticals, and 605 to be dwarf irreg-ulars. For those, 2431 spirals (83.1%), 1658 dwarf ellipti-cals (81.4%), and 376 dwarf irregulars (62.1%) have receivedthree times the same vote, meaning that we were conﬁdentabout the label. Uncertainties in labelling will have two mainsources: one is coming from assigning them a wrong label(i.e. clicking on the wrong button), the other is coming fromthe true uncertainty about the morphology of the object. Thelatter is di ﬃ cult to assess, because we do not know the groundtruth of these objects and therefore can’t directly quantify it. https: // github.com / wbwvos / pidgey. HYPERPARAMETER SEARCH AND TRAINING

The search for the best set of hyperparameters is oneof the most cumbersome tasks in any machine learningproject, due to their high dimensionality and non-trivial inter-dependencies between them. In machine learning, hyperpa-rameters describe values which need to be decided a priori to the actual training of the neural network. The choice ofthese parameters can come from previous estimations of sim-ilar problems. Better though is the selection of the hyper-parameters by some kind of pre-deﬁned rules. The searchfor the best set of hyperparameters is an active research ﬁeld,which continues to deliver novel approaches for this task. Inour work, we use the Optuna framework (Akiba et al. 2019),which uses a multivariant Tree-structured Parzen Estimator(TPE, Bergstra et al. 2011). The TPE estimates the densitiesof good and bad hyperparameters. From this, it samples themost promising hyperparameters. Furthermore, Optuna usespruning, which allows for early-stopping of bad sets of hyper-parameters, speeding up the process of the hyperparametersearch. For the pruning, we use Optuna’s implementation ofHyperband (Li et al. 2017).The deﬁne-by-run principle implemented in Optuna allowsto dynamically construct an optimal neural network from therange of hyperparameters and a pre-deﬁned heuristic. Here,we use the validation accuracy as such heuristic. We start witha standard layout for CNNs, combining a convolution with akernel size of 5 with a max-pooling (halving the spatial sizes).The number of these layers n layers , conv is a hyperparameter.The number of feature maps (also called channels) is doubledin each subsequent layer. The initial number of feature maps n channel is another hyperparameter. After these convolutions, aseries of fully connected layers reduces the features, up untilthe ﬁnal classiﬁcation layer. We use two fully connected lay-ers. The ﬁrst fully connected layer has n nodes nodes, which isa hyperparameter, the last one has the number of labels, i.e. 3nodes.We employ Leaky ReLUs as activation functions, with thenegative slope δ being another hyperparameter. Throughoutthe network, dropout is applied as regularization, with thedropout rate p dropout being another hyperparameter. Furtherhyperparameters are coming from the optimizer: the choice ofthe optimizer itself, its learning rate, momentum, and weightdecay. Finally, the image size is one more hyperparameter,with a step size of 32 px per side, up to 256 px. However, forthe larger image sizes we sometimes run out of memory onour local machine, depending on the other hyperparameters.In total we have a set of 11 hyperparameters. They are pre-sented in Table 1. We have used Optuna to search for the bestcombination of these hyperparameters using 500 trials with amaximum of 30 epochs in each run, using a cross entropy lossfunction. The best set of hyperparameters is presented in Ta-ble 1. The networks themselves were implemented in PyTorch1.7.0 (Paszke et al. 2019) and the experiments and trainingwere conducted on a single nvidia GeForce RTX 2070 GPUwith 8 GB of RAM and CUDA 11.0.Now that we have the best set of hyperparameters esti-mated, we only need to decide for the number of epochs thenetwork will get trained. For this, we again use the valida-tion set, but with the hyperparameters ﬁxed. We train ournetwork for 200 epochs to see where overﬁtting starts to kickin. For this we measure the training and validation accuracyas a function of the epoch, see Fig. 3. While the network getsbetter on the training set the longer it trains, the validation ac- warf from the D ark E nergy S urvey O b j e c t i v e V a l u e c o n v l a y e r s h i dd e n n o d e s d r o p o u t i m a g e s i z e i n i t i a l c h a n e l s l e a r n i n g r a t e m o m e n t u m n e g a t i v e _ s l o p e o p t i m i z e r w e i g h t d e c a y O b j e c t i v e V a l u e Figure 2.

Parallel coordinate plot, indicating which combination of hyperparameters resulted in which performance, measured as objective value – in our casethe validation accuracy. a cc u r a c y trainingvalidation Figure 3.

Training (blue line) and validation (red line) accuracies as a func-tion of the training epoch. The running average of 10 epochs of the validationaccuracy is indicated as thick black line. curacy starts to stay the same after ≈

60 epochs, indicating thatthe networks starts to overﬁt. We therefore stop the trainingfor the ﬁnal evaluation of the network at 60 epochs.To compare our CNN to a more sophisticated, pre-trainednetwork from the literature, we also study the capabilities ofResNet50 (Xie et al. 2017) for our task at hand. Because it hastoo many parameters to fully train on our machine, we onlyre-train the last fully connected layer, which is just before theclassiﬁcation layer. ResNet50 requires as input an image withat least a dimension of 224 ×

224 px, therefore we simply usethe original 256 px images. We searched for the best set ofhyperparameters with Optuna, but leaving out architecturalchoices and only searching for the optimal batch size, opti-mizer, learning rate, momentum, and weight decay. They are:44 for the batch size, SGD for the optimizer, 6 . × forthe learning rate, 0.92 for the momentum, and 0.095 for theweight decay. Similarly as before, we also searched for the epoch where overﬁtting starts to become relevant. This oc-curs at 50 epochs. Finally, we train and evaluate on the sametraining and test sets as before. RESULTS AND DISCUSSION

Suitable network architectures

The best set of hyperparameters from our optimization ispresented in Table 1. However, this set is not the only archi-tecture yielding good results. In Fig 2 we present the perfor-mance for di ﬀ erent combinations of the hyperparameters in aso-called parallel coordinate plot. This ﬁgure visualises theimpact of the di ﬀ erent combinations of hyperparameters andindicates whether certain hyperparameters must be tuned toachieve a good result, or if there are multiple ways to achievesimilar results. The latter is the case for many of the hyperpa-rameters, meaning that there is no distinct architecture supe-rior to others. The best results (i.e. highest objective values)are achieved with a family of network having similar, albeitnot the same, hyperparameters, which are actually close tothe proposed standard values in PyTorch, with some excep-tions. Most surprisingly, the RMSprop optimizer yields thebest results, although Adam is the standard for this task. Wepresent a graphical representation of the architecture achiev-ing the highest validation accuracy during the trials in Fig. 4. Evaluation of the neural network

The ﬁnal evaluation of the network is done with the testset, which the network has not seen up to this point. Thenetwork was trained during 60 epochs, as evaluated in ouroverﬁtting experiment. The accuracy on this test set gives usthe performance. After training the network for 60 epochs, itis able to correctly classify 85% of the spiral galaxies, 94% ofthe dwarf ellipticals, and 52% of the dwarf irregulars. In totalit achieves an accuracy of 85%.Has the network learned something? If this is the case, itmust perform better than random. Because the three labelsare not equally distributed, they have following probabilities I conv1 I /2 conv2 I /4 conv3 I /8 conv4 I /16 conv5 ﬂatten fully class Figure 4.

The optimal network as estimated from our hyperparameter search. It contains a sequence of convolutional (yellow) plus max pooling (red) layers,followed by a ﬂattening layer (green), which maps all the nodes of the last convolution to a a linear layer, and then a sequence of two fully connected layers(violet). S 49985% 174% 3229%dE 6812% 39194% 2220%dIrr 203%S 61%dE 5852%dIrr P r e d i c t e d C l a ss Actual Class

Figure 5.

Confusion matrix, where the x-axis corresponds to the visuallyassigned label and the y-axis to the predicted label from the neural network. to occur by chance: for spiral galaxies it is 52%, for dwarfellipticals it is 37%, and for dwarf irregulars its 10%. Thisindeed shows that for each class, the network was able to learnfrom the training data and use the generalization to classifythe objects.The confusion matrix of the classiﬁcation is shown inFig. 5. This matrix shows on the diagonal the correctly clas-siﬁed objects (i.e. the predicted label is equal to the groundtruth set by our manual classiﬁcation). The o ﬀ -diagonal en-tries represent the confusion between the classes. If we lookat the objects with spiral galaxies as ground truth, 85% werecorrectly classiﬁed and the main confusion happened with thedwarf ellipticals (12%). Only 3% of the ‘real’ spiral galax-ies were classiﬁed as dwarf irregulars. For the dwarf ellipti-cals, 94% of the test objects were correctly classiﬁed and only4% were confused with spirals and 1% confused with dwarfirregulars. These two labels indeed yield good results with minimal confusion. However, for dwarf irregulars a di ﬀ erentpicture emerges. While 52% of the test objects were correctlylabelled, there was heavy confusion with both spirals (29%)and dwarf ellipticals (20%). This shows that there is ampleroom for improvement for dwarf irregulars. Human vs. Machine Performance

Can a machine achieve similar results as a human? In ourcase, a direct comparison is tricky, because the ground truth– i.e. what morphological type the object belongs to – is stillunknown. The repeated manual classiﬁcation of the objects,however, may be used as a proxy for the accuracy of humanclassiﬁcation. If the object is classiﬁed in each run the sameway, this indicates that we are certain about the type of theobject, if on the other hand the classiﬁcation changes, this in-dicates uncertainty. So how uncertain were our manual clas-siﬁcations? For this we counted the fraction of labels whichwe classiﬁed the same each time. For the spiral galaxies,this happened in 83% of the cases, for the dwarf ellipticalsin 81% of the cases, and for the dwarf irregulars in 62% ofthe cases, highlighting how di ﬃ cult dwarf irregulars are. Ifwe rather ask the question if an object is a spiral galaxy ora dwarf galaxy (i.e. we do not care whether an object is adwarf irregular or dwarf elliptical), the later number increasesto 86%. This shows that the confusion of dwarf irregular hap-pens mainly with dwarf ellipticals. Two such cases are pre-sented in Fig 6, where the vote was 2 / warf from the D ark E nergy S urvey Figure 6.

The galaxies which were labeled di ﬀ erently during the di ﬀ erentmanual classiﬁcation runs. Left is a galaxy with a majority vote labelling itas dwarf irregular (but has one out of three votes as dwarf elliptical ), andright a galaxy ﬁnally labeled as dwarf elliptical (but has one vote as dwarfirregular). dwarf irregulars 52% (vs. 62%). This is well compatible withour manual classiﬁcation and shows that the network is ca-pable of performing the job as good as a human expert. Fordwarf ellipticals the neural network is even better, however,for dwarf irregular it performs worse. This is expected, as thedwarf irregulars have the least amount of training data and arethe hardest to classify. Comparison to ResNet50

We have shown that our simple CNN performs quite wellgiven the data set at hand. So how is the performance of amore sophisticated network, namely ResNet50, for which thelast fully connected layer has been retrained on our data set?Our adopted ResNet50 achieves a total test accuracy of 65%,with 91% of the spirals correctly classiﬁed, 42% of the dwarfellipticals, and 0% of the dwarf irregulars. While it outper-forms the classiﬁcation of the spirals, in general it performsworse than our simple CNN. What is likely to happen is thatit prefers to label spirals because they are the most commontype in our data set, as well as have some distinct features (likerotation), which do occur in natural images. But the general-ization power of ResNet50 is less strong than a more simpleconvolutional network, like the one we found here. SUMMARY AND CONCLUSIONS

In this work, we have shown how a neural network canbe designed to distinguish from multiband images betweendwarf elliptical, dwarf irregular, and spiral galaxies to a sim-ilar accuracy as achieved by manual intervention. For thispurpose, we have ﬁrst classiﬁed 5567 galaxies detected in theDark Energy Survey. Then, we used an optimization schemeto search for suitable architectures and hyperparameters. Ourhyperparameter search revealed that a convolutional neuralnetwork with ﬁve convolutional layers and two fully con-nected layers with hyperparameters close to standard valuesyields the best results.This network was trained and evaluated on a separate set ofdata. For spiral galaxies, the network achieves an accuracy of85%, for dwarf ellipticals an accuracy of 94% and for dwarfirregulars an accuracy of 52%. Compared to the uncertaintyof human classiﬁcation, which we measured by classifyingthe galaxies multiple times, the neural networks perform asgood in a fraction of the needed time. However, for dwarf ir-regulars we are still not at a level where a label can be takenwith conﬁdence. This has two sources: they a) come in manydi ﬀ erent shapes and sizes and don’t have clear deﬁned struc-tures, and b) are quite infrequent, meaning that the training setis sparse. Adding more dwarf irregulars to the training sample will certainly help alleviating this problem. Comparisons to acommonly used network in the literature – ResNet50 – whichwas pre-trained on natural images, shows that our simple net-work yields better results.What could be done to improve the classiﬁcation? Cer-tainly, adding more training data would result in a better ac-curacy, but this cannot easily be done for the data set we used.However, the ground truth could be improved with relying onnot just one expert manually classifying the images, but mul-tiple people labelling the data. Another avenue would be totest the impact of more elaborate data augmentation schemes,such as translations, elastic deformations and non-trivial rota-tions.One may wonder what the use of the neural network is if allthe galaxies have already been classiﬁed by hand. On the onehand, our ﬁndings regarding suitable architectures and hyper-parameters can serve as a starting point and benchmark fortraining similar tasks on di ﬀ erent data sets. On the other hand,there is the potential of transfer learning. Our neural networktrained on the data from the Dark Energy Survey may be usedon a di ﬀ erent set of data, like for example the Kilo DegreeSurvey (KiDS). By using a pre-trained network, the trainingdata needed to achieve similar (or better) results can go downby magnitudes, reducing the workload of manually classify-ing only a subset of the total data and then use the newlytrained neural network for the remaining objects. Along theselines, it will be interesting to study what kind of data set wewould need to use transfer learning on upcoming large-scalesurveys like Euclid or LSST. ACKNOWLEDGEMENTS

O.M. is grateful to the Swiss National Science Foundationfor ﬁnancial support. REFERENCES

Abbott, T. M. C., Abdalla, F. B., Allam, S., et al. 2018, ApJS, 239, 18Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. 2019, inProceedings of the 25th ACM SIGKDD international conference onknowledge discovery & data mining, 2623–2631Belokurov, V., Walker, M. G., Evans, N. W., et al. 2010, ApJL, 712, L103Bennet, P., Sand, D. J., Crnojevi´c, D., et al. 2019, ApJ, 885, 153Bergstra, J., Bardenet, R., Bengio, Y., & K´egl, B. 2011, in 25th annualconference on neural information processing systems (NIPS 2011),Vol. 24, Neural Information Processing Systems FoundationBertin, E. 1994, Astrophysics and Space Science, 217, 49Bertin, E. & Arnouts, S. 1996, A&AS, 117, 393Boylan-Kolchin, M., Bullock, J. S., & Kaplinghat, M. 2011, MNRAS, 415,L40Carlin, J. L., Sand, D. J., Price, P., et al. 2016, ApJL, 828, L5Carlsten, S. G., Greco, J. P., Beaton, R. L., & Greene, J. E. 2020, ApJ, 891,144Cha, K. H., Hadjiiski, L. M., Chan, H.-P., et al. 2017, in Medical Imaging2017: Computer-Aided Diagnosis, Vol. 10134, International Society forOptics and Photonics, 1013404Chiboucas, K., Jacobs, B. A., Tully, R. B., & Karachentsev, I. D. 2013, AJ,146, 126Chiboucas, K., Karachentsev, I. D., & Tully, R. B. 2009, AJ, 137, 3009Cohen, Y., van Dokkum, P., Danieli, S., et al. 2018, ApJ, 868, 96Cote, S. 1996, PASA, 13, 278Drlica-Wagner, A., Bechtol, K., Ryko ﬀ , E. S., et al. 2015, ApJ, 813, 109Duc, P.-A. 2020, arXiv e-prints, arXiv:2007.13874Eigenthaler, P., Puzia, T. H., Taylor, M. A., et al. 2018, ApJ, 855, 142Famaey, B., McGaugh, S., & Milgrom, M. 2018, MNRAS, 480, 473Gatys, L. A., Ecker, A. S., & Bethge, M. 2015, ArXiv, abs / Hinton, G., Deng, L., Yu, D., et al. 2012, IEEE Signal processing magazine,29, 82Huang, G. B., Mattar, M., Lee, H., & Learned-Miller, E. 2012, in NIPSIodice, E., Cantiello, M., Hilker, M., et al. 2020, A&A, 642, A48Jerjen, H., Binggeli, B., & Freeman, K. C. 2000, AJ, 119, 593Karachentsev, I. D., Makarov, D. I., & Kaisina, E. I. 2013, AJ, 145, 101Keys, R. 1981, IEEE transactions on acoustics, speech, and signalprocessing, 29, 1153Kim, D., Jerjen, H., Mackey, D., Da Costa, G. S., & Milone, A. P. 2015,ApJL, 804, L44Kingma, D. P. & Ba, J. 2014, arXiv preprint arXiv:1412.6980Korshunov, P. & Marcel, S. 2018, Assessment and detectionKrishna, S. T. & Kalluri, H. K. 2019, International Journal of RecentTechnology and Engineering (IJRTE), 7, 427Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012, Advances in neuralinformation processing systems, 25, 1097Kroupa, P., Famaey, B., de Boer, K. S., et al. 2010, A&A, 523, A32LeCun, Y., Boser, B., Denker, J. S., et al. 1989, Neural computation, 1, 541Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. 2017,The Journal of Machine Learning Research, 18, 6765Malin, D. F. 1977, AAS Photo Bulletin, 16, 10Martin, N. F., McConnachie, A. W., Irwin, M., et al. 2009, ApJ, 705, 758Martinazzo, A., Espadoto, M., & Hirata, N. S. 2020, in VISIGRAPP (5:VISAPP), 87–95McGaugh, S. S. 2016, ApJL, 832, L8Merritt, A., van Dokkum, P., & Abraham, R. 2014, ApJL, 787, L37Merritt, A., van Dokkum, P., Danieli, S., et al. 2016, ApJ, 833, 168M¨uller, O., Fahrion, K., Rejkuba, M., et al. 2021, A&A, 645, A92M¨uller, O. & Jerjen, H. 2020, A&A, 644, A91M¨uller, O., Jerjen, H., & Binggeli, B. 2015, A&A, 583, A79M¨uller, O., Jerjen, H., & Binggeli, B. 2017, A&A, 597, A7M¨uller, O., Rejkuba, M., & Jerjen, H. 2018, A&A, 615, A96M¨uller, O., Rejkuba, M., Pawlowski, M. S., et al. 2019, A&A, 629, A18O’Shea, K. & Nash, R. 2015, arXiv e-prints, arXiv:1511.08458¨Ozyurt, F., Sert, E., Avci, E., & Dogantekin, E. 2019, Measurement, 147,106830Pan, S. J. & Yang, Q. 2009, IEEE Transactions on knowledge and dataengineering, 22, 1345Papadopoulos, D. P., Tamaazousti, Y., Oﬂi, F., Weber, I., & Torralba, A.2019, 2019 IEEE / CVF Conference on Computer Vision and PatternRecognition (CVPR), 7994Park, H. S., Moon, D.-S., Zaritsky, D., et al. 2017, ApJ, 848, 19Paszke, A., Gross, S., Massa, F., et al. 2019, in Advances in NeuralInformation Processing Systems 32, ed. H. Wallach, H. Larochelle,A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, & R. Garnett (Curran Associates,Inc.), 8024–8035 Pawlowski, M. S., Pﬂamm-Altenburg, J., & Kroupa, P. 2012, MNRAS, 423,1109Perona, P. & Malik, J. 1990, IEEE Transactions on pattern analysis andmachine intelligence, 12, 629Pontzen, A. & Governato, F. 2012, MNRAS, 421, 3464Qian, N. 1999, Neural networks, 12, 145Revaz, Y. & Jablonka, P. 2018, A&A, 616, A96Russakovsky, O., Deng, J., Su, H., et al. 2014, arXiv e-prints,arXiv:1409.0575Sand, D. J., Crnojevi´c, D., Strader, J., et al. 2014, ApJL, 793, L7Sardogan, M., Tuncer, A., & Ozen, Y. 2018, in 2018 3rd InternationalConference on Computer Science and Engineering (UBMK), IEEE,382–385Schnider, E., Horv´ath, A., Rauter, G., et al. 2020, in Machine Learning inMedical Imaging, ed. M. Liu, P. Yan, C. Lian, & X. Cao (Cham: SpringerInternational Publishing), 40–49Simonyan, K. & Zisserman, A. 2014, arXiv preprint arXiv:1409.1556Smercina, A., Bell, E. F., Price, P. A., et al. 2018, ApJ, 863, 152Spekkens, K., Urbancic, N., Mason, B. S., Willman, B., & Aguirre, J. E.2014, ApJL, 795, L5Storrie-Lombardi, M., Lahav, O., Sodre Jr, L., & Storrie-Lombardi, L. 1992,Monthly Notices of the Royal Astronomical Society, 259, 8PSzegedy, C., Liu, W., Jia, Y., et al. 2015, in Proceedings of the IEEEconference on computer vision and pattern recognition, 1–9Tanoglidis, D., ¨Aiprijanovi ¨A, A., & Drlica-Wagner, A. 2020, DeepShadows:Finding low-surface-brightness galaxies in survey imagesTanoglidis, D., Drlica-Wagner, A., Wei, K., et al. 2021, ApJS, 252, 18Taylor, M. A., Eigenthaler, P., Puzia, T. H., et al. 2018, ApJL, 867, L15Venhola, A., Peletier, R., Laurikainen, E., et al. 2017, A&A, 608, A142Willman, B., Dalcanton, J. J., Martinez-Delgado, D., et al. 2005, ApJL, 626,L85Wolf, J., Martinez, G. D., Bullock, J. S., et al. 2010, MNRAS, 406, 1220Xie, S., Girshick, R., Doll´ar, P., Tu, Z., & He, K. 2017, in Proceedings of theIEEE conference on computer vision and pattern recognition, 1492–1500York, D. G., Adelman, J., Anderson, Jr., J. E., et al. 2000, AJ, 120, 1579Zeiler, M. D. 2012, arXiv preprint arXiv:1212.5701Zhang, R., Zheng, Y., Mak, T. W. C., et al. 2016, IEEE journal of biomedicaland health informatics, 21, 41

This paper was built using the Open Journal of AstrophysicsL A TEX template. The OJA is a journal which provides fast andeasy peer review for new papers in the astro-ph section of the arXiv, making the reviewing process simpler for authorsand referees alike. Learn more at http://astro.theoj.orghttp://astro.theoj.org