CNN Architecture Comparison for Radio Galaxy Classification
Burger Becker, Mattia Vaccari, Matthew Prescott, Trienko Lups Grobler
MMNRAS , 1–20 (2020) Preprint 9 February 2021 Compiled using MNRAS L A TEX style file v3.0
CNN Architecture Comparison for Radio Galaxy Classification
Burger Becker ★ , Mattia Vaccari , Matthew Prescott , Trienko Grobler Computer Science Department, Stellenbosch University, Stellenbosch, South Africa Inter-University Institute for Data Intensive Astronomy (IDIA) and Department of Physics and Astronomy, University of the Western Cape, Robert Sobukwe Road,7535 Bellville, Cape Town, South Africa
Accepted XXX. Received YYY; in original form ZZZ
ABSTRACT
The morphological classification of radio sources is important to gain a full understandingof galaxy evolution processes and their relation with local environmental properties. Further-more, the complex nature of the problem, its appeal for citizen scientists and the large datarates generated by existing and upcoming radio telescopes combine to make the morphologicalclassification of radio sources an ideal test case for the application of machine learning tech-niques. One approach that has shown great promise recently is Convolutional Neural Networks(CNNs). Literature, however, lacks two major things when it comes to CNNs and radio galaxymorphological classification. Firstly, a proper analysis of whether overfitting occurs whentraining CNNs to perform radio galaxy morphological classification using a small curatedtraining set is needed. Secondly, a good comparative study regarding the practical applicabil-ity of the CNN architectures in literature is required. Both of these shortcomings are addressedin this paper. Multiple performance metrics are used for the latter comparative study, such asinference time, model complexity, computational complexity and mean per class accuracy. Aspart of this study we also investigate the effect that receptive field, stride length and coveragehas on recognition performance. For the sake of completeness, we also investigate the recog-nition performance gains that we can obtain by employing classification ensembles. A rankingsystem based upon recognition and computational performance is proposed. MCRGNet, Ra-dio Galaxy Zoo and ConvXpress (novel classifier) are the architectures that best balancecomputational requirements with recognition performance.
Key words: radio continuum:galaxies – methods: statistical – surveys
Morphological classification is a fundamental aspect of galaxy for-mation and evolution studies, where the shape of galaxies is inti-mately connected to the dynamical and physical processes at play.Since the days of Hubble, astronomers have thus been establishingincreasingly sophisticated classification schemes to group galaxiesin different classes according to their shapes and appearance ob-served at optical wavelengths (Hubble 1926; de Vaucouleurs 1959;Sandage 1961; Elmegreen & Elmegreen 1987).According to our current understanding of galaxy formationand evolution, every massive galaxy is believed to contain a super-massive black hole which undergoes periods of accretion throughoutcosmic time to produce an Active Galactic Nucleus (AGN). AGNare often detected in radio surveys via their synchrotron emissionproduced by accelerated electrons in their cores, lobes and jets, andare then referred to as radio-loud AGN.Morphologically, Fanaroff & Riley (1974) found that radio-loud AGN could be divided into two populations, known as Fa- ★ Contact e-mail: [email protected] naroff–Riley (FR) types I and II (FRIs and FRIIs), which werefound to show a division at approximately 𝐿 = WHz − . Those having bright cores, or "core-brightened" features,and diffuse lobes are labelled as FRIs and those dominated by"edge-brightened" features far from their cores are known as FRIIs.A clear divide in the radio and optical luminosities between thetwo morphologies was observed in Owen & Ledlow (1994), indi-cating that they have formed and evolved in different ways. Thedifferent morphological types are thought to be due to different ac-cretion modes. FRIs were believed to be more associated with LowExcitation Radio Galaxies (LERGs); passive galaxies undergoinginefficient accretion of hot gas, with an absence of emission linesin their optical spectra. FRIIs are more associated with High Ex-citation Radio Galaxies (HERGs); galaxies undergoing rapid andefficient accretion of a cold gas supply, as indicated via the pres-ence of emission lines in their spectra (Hine & Longair 1979; Lainget al. 1994; Best & Heckman 2012; Pracy et al. 2016). While multi-wavelength observations are essential to better pinpoint the centreof radio galaxies and study their physical properties (Prescott et al.2018; Ocran et al. 2020; Kozieł-Wierzbowska et al. 2020), deter- © a r X i v : . [ a s t r o - ph . GA ] F e b Burger Becker mining radio morphologies is a very useful starting point towardimproving our understanding of radio galaxies.Radio morphologies can also be used to trace the environ-ments of their host galaxies. Miraghaei & Best (2017) found that, atfixed stellar mass and radio luminosity, FRIs are more likely to befound in richer environments than FRIIs. Bent-tailed radio galaxiessuch as Narrow-Angle Tailed (NAT, Rudnick & Owen (1977)) andWide-Angle Tailed (WAT, Owen & Rudnick (1976); Missaglia et al.(2019)) radio galaxies are associated with clusters of galaxies andrepresent galaxies with radio jets that are interacting with the hotintra-cluster medium (ICM) that resides there.New surveys with lower flux limits show that upon closerinspection the FRI/FRII divide becomes less clear, revealing thatthere is much more overlap in the properties of FRIs and FRIIsthan previously thought (Mingo et al. 2019). The FRI/FRII divideis also complicated by radio sources that have hybrid FRI/FRIImorphologies, also known as HyMoRS, that have also been foundto exist (Gopal-Krishna & Wiita 2000). These however are likelyto be bent FRII sources viewed at a particular orientation, andhave lobes that appear as having different morphologies due to theobserver’s line of sight (Smith & Donohoe 2019; Harwood et al.2020).In more recent times, the FR classification scheme has beenexpanded to include radio sources with compact morphologies.These so-called FR0 sources are believed to be the most abundantradio sources in the local Universe (Baldi et al. 2018; Garofalo &Singh 2019). Despite being abundant, little is known about theirnature. Whilst some are young AGN that will grow to form FRIand FRII sources, a comparison between their number densitiesand that of extended sources indicates the majority must be oldersources that have failed to form extended structures (Sadler et al.2014; Baldi et al. 2018). Whittam et al. (2020) show FR0s are amixed population of HERG and LERG radio sources.Radio sources with more exotic morphologies have also at-tracted substantial interest recently. These include X-shaped and S-shaped radio galaxies (Cheung 2007), that may represent AGN thathave undergone the process of hydrodynamical backflow (Leahy &Williams 1984) or are the result of a spin-flip from the coalescenceof two black holes (Ekers et al. 1978), with the former scenariobeing the preferred explanation in some of the latest work (Robertset al. 2018; Cotton et al. 2020).
Radio astronomy is currently undergoing a rapid development inobservational capabilities which is paving the way for the highlyanticipated Square Kilometre Array (Braun et al. 2015, 2019, SKA).Before the advent of the SKA, its pathfinders and precursors(Norris et al. 2013) promise to revolutionize our knowledge of theradio sky. Ongoing surveys such as VLASS (Lacy et al. 2020)and EMU (Norris et al. 2011) are expected to detect 5 and 70million radio sources, respectively, greatly exceeding the roughly2.5 million radio sources known to date. Historically, scientificanalysis used catalogues compiled by either individuals or smallteams (Fanaroff & Riley 1974). However, the increasingly largesamples of radio sources detected by modern radio telescopes meansthat the classification of full catalogues by subject-matter experts isno longer a viable option.One possible solution is the crowdsourcing of labelling tolarge groups of volunteers, known as citizen science. The first suc-cessful large scale citizen science project in galaxy morphologyclassification was Galaxy Zoo (Lintott et al. 2008), during which participants were asked to label galaxies observed as part of theSloan Digital Sky Survey on the basis of their morphology. Afterinitial fears of poor public participation, roughly 100,000 partici-pants made 40 million individual classifications in 175 days. Dueto the success of the initial project, Galaxy Zoo grew into the largerZooniverse project, which serves as an online platform for variouscrowdsourcing projects. The first citizen science project devoted toradio astronomy was Radio Galaxy Zoo (Banfield et al. 2015),which aimed to classify radio sources observed in the FIRST surveybased on their morphology and identify them with their infraredcounterparts observed in the WISE survey. Radio Galaxy Zoodemonstrated citizen scientists can help us to further the scientificexploitation of large radio surveys, in the process creating largesamples of visually-inspected labelled sources. Another solution would be to make use of machine learning tech-niques to aid in the classification task. Convolutional Neural Net-works (CNNs) are a popular choice for image recognition problemsin both academia and industry. A CNN is a special type of neuralnetwork that learns which features are important to extract fromimages. These learned features are then employed to perform clas-sification. AlexNet is one of the first CNN architecture to achievehuman-level performance on the ImageNet Challenge, which in-volved classifying 1.2 million images into 1000 different classes(Deng et al. 2009; Krizhevsky et al. 2012).CNNs were popularised as an automated means of morpho-logical classification in astronomy during the Galaxy Zoo Chal-lenge (Willett et al. 2013), hosted on the Kaggle data science plat-form . Dieleman et al. (2015) developed a CNN for this challenge.The CNN Dieleman et al. (2015) developed obtained the highestrecognition performance in this challenge. Both AlexNet and thestudy by Dieleman et al. (2015) inspired Toothless, the first CNNdeveloped for the morphological classification of radio galaxies(Aniyan & Thorat 2017). Since then several new CNNs have beendeveloped for morphological classification in radio astronomy, asshown in Table 1.To construct a CNN model usually requires a lot of trainingdata. If the dataset that is used for training a CNN is too smalloverfitting occurs. In most of the studies in Table 1 training was doneon a small dataset. It is, therefore, imperative to determine whether,in the case of radio galaxy morphological classification, overfittingindeed occurs when a small curated dataset is used for training.Having overfitted is, however, easily correctable by simply using alarger training set. The only real danger occurs when an overfittedmodel is used to predict the performance of an architecture in a realworld setting. In this paper, we conduct an experiment to determineif this issue is in fact something which we should be cognisant ofgoing forward. The experiment we propose will make use of twomodified datasets from Ma et al. (2019).Moreover, most of the studies in Table 1 have a similar overalllayout. They first present a novel architecture and then they reporton the architecture’s recognition performance. Furthermore, mostof these studies do not analyse inference time or other factors relatedto computational cost (with the exception of model complexity of- MNRAS , 1–20 (2020)
NN Architecture Comparison for Radio Galaxy Classification ten being evaluated through the number of trainable parameters). Some of these studies also focus on the effects of layer composition(Lukic et al. 2018). Most critically, none have looked at how com-putational cost impacts recognition performance. Analysis of therelationships between these metrics can further our understandingof existing architectures and help develop best practices for findingnew architectures. In this paper we also address this shortcoming,i.e. we use a wide variety of computational related cost metricsto asses whether the architectures in Table 1 are capable of realtime performance, whilst maintaining a high level of recognitionperformance.All the architectures, the software used to analyse and test themand the datasets used in the process are made publicly available. We start our paper by giving a brief overview of CNNs. InSection 3 we list all the architectures we used in our study, whichalso includes a novel architecture. The datasets we make use of arepresented in Section 4 and is structured according to a simplified4-class morphological classification system used by Alhassan et al.(2018): Compact sources, FRI, FRII and Bent-tails. The experimen-tal setup of our study is presented in Section 5. The results pertainingto the overfitting experiment are presented in Section 6. The resultsof the computational cost versus recognition performance analysisof the architectures in Table 1 are presented in Section 7 (a prag-matic ranking of the architectures is also established). As part ofthis performance analysis, we also investigate how other factors likereceptive field, stride length, coverage (see Section 2.2) impact therecognition performance of the architectures in Table 1. For the sakeof completeness, we also investigate whether we can achieve recog-nition performance gains using ensemble classifiers in Section 7(also see Section 5.5). Findings are then summarized according tosubject in the conclusion.
A brief overview of CNNs is presented in Section 2.1. In Section 2.2we briefly discuss the different metrics we made use of to evaluatethe experiments we conducted.
Hubel & Wiesel (1968) found that mammalian visual cortices pri-marily consist of two types of cells, simple cells (that would activatewhen straight edges had a certain orientation) and complex cells(with a larger receptive field and lower sensitivity to orientation).This inspired the Neocognitron artificial neural network (Fukushima1980) which combined layers consisting wholly of one of two typesof “cells” into a hierarchical model to perform handwritten characterrecognition. One type of cell would apply a convolutional opera-tion to the input, while the other would downsample the input. Theweights for the convolutional operation would be learned from inputexamples. The first deep CNN (LeCun et al. 1998) had seven layersand was primarily used for handwritten character recognition. Atthe time, training such deep models was computationally expensiveand time consuming. The widespread advent of Graphics Process-ing Units (GPU) within desktop computers, however, resulted inmore people being able to quickly train deep CNNs (Cireşan et al. CLARAN’s computational cost is reported in Wu et al. (2019). https://github.com/BurgerBecker/rg-benchmarker de facto standard for image classification after the 2012ImageNet Challenge was won by a CNN, AlexNet (Krizhevskyet al. 2012). Neural Networks can be visualized as graph-like structures in whichnodes are referred to as neurons. Each edge or connection to anotherneuron has a weight and a bias term, which represents the strengthof their connection. The weights and biases affect the propagationof information through the network. With the right combination ofweights, the network can match groups of similar input, or classes,to a respective output, or class label, reliably. The neurons arearranged in layers, with an input being propagated from the inputlayer, through intermediate layers (called hidden layers) until itreaches an output layer. Although many different types of neuralnetworks have developed over time, CNNs have become some ofthe top performing classifiers for image recognition.CNNs have three main types of layers: Fully connected (ordense) layers, convolutional layers and pooling layers.
The neurons of a fully connected layer are, as the name implies,connected to all the neurons of the previous layer. The output layeris also a fully connected layer, with as many neurons as the numberof classes that have been provided. The neuron with the highestoutput value determines the classification result, with a higher valuemeaning greater model confidence in the classification.
In convolutional layers, a convolution operation is performed on asmall neighbourhood of pixels which then outputs a single value forthe neuron in the next layer. This operation is realized using a smallmatrix containing trainable weights. This small matrix is known asthe kernel. The aforementioned kernel is then moved over the imagein strides, with a stride length of 1 moving the centre of the kernelone pixel across or down until the entire image has been covered. Acommonly used kernel size is 3 × ×
11 pixels early on in the networkto reduce input size while the image still contains a high ratio ofnoise to information (AlexNet and Toothless make use of this). Inlater convolutional layers, the kernel is moved across the outputs ofthe previous layer and not the original image, i.e. the outputs of theprevious layer become the input pixels of the current convolutionallayer.This many-to-one mapping during convolution leads to the out-put of a convolutional layer to be downsampled (i.e. to have a smalleroutput dimension than the input). If downsampling happens too sud-denly, this can potentially lead to the loss of too much informationwithout it being incorporated into the model. One workaround forthis is padding the input with zeros around the edges, to ensure theoutput shape is the same size as the input shape. This also assumesa stride length of 1, otherwise downsampling will still occur. Thisis often referred to as same or zero padding, whereas the absence ofpadding is known as valid padding. Kernel size, stride length andpadding type are all examples of hyper parameters of a CNN, eachof which could affect recognition performance.The output of the final convolutional layer is then flattened
MNRAS , 1–20 (2020)
Burger Becker into a one dimensional vector, which is then input into the first fullyconnected layer.
Pooling layers are used to deliberately downsample the input, re-ducing the input size while preserving the salient features we wantthe network to learn. Max pooling layers perform downsampling bymoving a kernel across the input and returning only the maximumpixel value within a kernel. A max pooling layer’s kernel size isnormally 2 ×
2, which halves the input size.A reduction in input size is needed so that the the computationalrequirements of deeper layers can be reduced or kept constant.
Finding the right combination of weights is referred to as trainingin Deep Learning terminology, which is done with gradient descentand a loss (or error) function. The loss function represents the errorbetween the target output (the class label that has been provided) andthe network’s current output (predicted label). A single iteration oftraining takes place by calculating the gradient of the error functionwith respect to the weights and biases. The weights and biasesare then updated based on the learning rate. However, since onlythe output layer has a clearly defined target output (the class label),the weights and biases of the intermediate neurons (hidden neurons)cannot be updated in isolation. The amount by which the weights andbiases of hidden layers need to be updated depend on all the previousand subsequent layers’ parameter values, which makes the gradientdescent algorithm computationally expensive (especially on deepnetworks). This is circumvented by the backward propagation oferrors (backpropagation) algorithm (Kelley 1960), which calculatesthe gradient of the final layer’s error function, and also reusingpartial computations from previous layers, moving “backwards”from the output layer through the hidden layers to the input layer.
When a CNN can classify an image irrespective of orientation,it is said to have rotational invariance. CNNs are normally notfully rotationally invariant (Lukic et al. 2019a). Convolutional layersenforce translation equivariance and pooling layers add translationinvariance but both of these usually allow only limited invarianceto rotations, normally not more than a few degrees (Marcos et al.2016). Other Deep Learning architectures are rotationally invariantby design, such as Capsule Networks (Sabour et al. 2017). Lukicet al. (2019a) have compared the performance of conventional CNNarchitectures and capsule networks when they are both used toperform radio galaxy morphological classification. 𝐹 -Score A useful tool when reporting the results of classifiers is the con-fusion matrix. The 𝑖 𝑗 th entry of a confusion matrix tells you thenumber of images that were classified as belonging to class 𝑗 eventhough they actually belong to class 𝑖 . In practice, when we depictconfusion matrices we often use annotated interpretable labels in-stead of the aforementioned integer labels. A hypothetical confusionmatrix which was obtained after classifying a dataset consisting of radio sources is depicted in Figure 1. This confusion matrix de-picts how well our classifier could distinguish FRI sources fromnon-FRI sources. The class for which a classifier’s performance iscurrently being assessed is known as the positive class (the classcurrently under consideration). The remaining classes are knownas the negative classes. In the case of Figure 1 FRI is our positiveclass. The depiction in Figure 1 will of course be different if anotherclass becomes the positive class. Furthermore, the confusion matrixin Figure 1 also graphically depicts the definition of the followingconcepts in the case of a multiclass scenario: True Positives, FalsePositives, True Negatives and False Negatives. The general defini-tion of the above concepts and examples thereof (from Figure 1) arelisted below: • True Positives (TP): images belonging to the positive classbeing classified as such (sources that were annotated as FRI andclassified as such) . • True Negatives (TN): images from the negative classes that arenot classified as belonging to the positive class ( sources that wereannotated as FRII and correctly classified as such or sources thatwere annotated as Bent, but incorrectly classified as FRII ). • False Negative (FN): images belonging to the positive classnot classified as such ( sources annotated as FRI, but incorrectlyclassified as FRII ). • False Positives (FP): images belonging to the negative classesthat are classified as belonging to the positive class ( sources anno-tated as FRII, but incorrectly classified as FRI ).Recall refers to the ratio of the number of images that werecorrectly classified as belonging to the positive class to the totalnumber of images in the positive class, i.e.:recall = TPTP + FN (1)Recall is also refered to as the True Positive Rate. Precision isthe ratio of images that were correctly classified as belonging to thepositive class to the total number of images that were classified asbelonging to the positive class, i.e.:precision = TPTP + FP (2)The weighted average of recall and precision is known as the 𝐹 -score: 𝐹 = × precision × recallprecision + recall (3) Overall accuracy is the ratio of correct classifications for all classsesto the total number of samples tested on. With respect to the con-fusion matrix described in Figure 1 this would be calculated as thesum of the main diagonal (the TP of each class) divided by the sumof the entire matrix.Overall accuracy can be a misleading metric, especially whena significant class imbalance is present.For this purpose we use Mean per Class Accuracy (MPCA),calculated as the mean of the main diagonal of a normalized confu-sion matrix. The confusion matrix is normalized by dividing eachrow with the number of samples in that row (which corresponds tothe number of samples per class). This metric is less susceptible toclass imbalances than overall accuracy.
MNRAS , 1–20 (2020)
NN Architecture Comparison for Radio Galaxy Classification TNFN FPTP TNFN TNFNTNTN FPFP TNTN TNTNCompact FRI FRII BentCompactFRIFRIIBent PredictedLabel A nno t a t ed G r ound T r u t h TNFPFNTP True PositiveFalsePositiveFalseNegativeTrue Negative
Figure 1.
Confusion Matrix Layout. This specific example showcases anassessment of the FRI class.
We define model complexity as its number of trainable parameters.The trainable parameters of a neural network are its weights and biasterms. The number of trainable parameters of all model instancesassociated with a particular architecture remain the same as long asthey were created using the same set of hyperparameters.
Each architecture’s computational complexity is measured usingTensorflow’s version 1 profiler . The aforementioned profiler mea-sures the number of floating point operations (FLOPs) used by themodel in a single forward pass. Theoretical GPU memory usage was estimated by first determiningthe memory footprint of the CNNs parameters and then adding tothat the amount of active memory the CNN would require whenprocessing a batch of data (a batch size of 32 was used in thiscase). Inference time is the time that a CNN requires to classify a singleimage. Classification speed is the number of images that are classi-fied per second, obtained by inverting inference time. In this paper,inference time was estimated by taking the average of 10 timed runsin which we classified 3072 images (with a batch size of 32).
The final convolutional layer’s output is not necessarily the result ofa transformation applied to every pixel in the input image (unlike adense/fully connected layer). Rather, each output pixel has a limited“field of view”, a limited region in the input image that tricklesdown through the convolutional layers to become a single output. Number of samples classified concurrently at any point in time.
This is the architecture’s theoretical receptive field, as opposed toits effective receptive field (the pixels in that limited region that hadthe largest impact on the output) (Luo et al. 2016).We do not consider the effective receptive field any further inthis paper. Moreover, for the sake of simplicity we will refer to thetheoretical receptive field simply as the receptive field throughoutthe remainder of the paper.Effective stride is defined as the stride between the input layerand the output layer of the convolutional part of a CNN (Araujoet al. 2019). The reader who wants to gain more insight into these topics isreferred to Araujo et al. (2019), who provides an in depth descriptionof how the receptive field and the effective stride of a CNN iscomputed.
Data coverage is the percentage of the total dataset that a classifiercan assign a label to given a certain confidence threshold. A goodcharacteristic that a classifier should have is that its recognitionperformance should increase as its prediction confidence thresholdis increased. Data coverage will either decrease or remain constantas the prediction confidence threshold is increased, with a significantdecrease expected at higher confidence thresholds. In the ideal case,excluding only a few sources from your dataset will bring about alarge gain in recognition performance.
In this section we discuss the architectures we considered for ourstudy. We also present auxiliary useful information that will improvethe reader’s understanding of the paper.
The terms architecture and model are not used interchangeably inthe context of this study. We refer to an architecture as the layoutof the network’s structure, whereas a model is a trained instance ofthe architecture. Models of the same architecture are differentiatedby the data it was trained on and other hyperparameters such asdifferent learning rates or the optimizer used during training.
The architectures we considered in our comparison are listed in Ta-ble 1, providing the architecture names, the corresponding studiesfrom the literature as well as shortened keys assigned for use in plots(see Figure 5 as an example). Some studies have contributed morethan one architecture (Lukic et al. 2019b). All of the architectureslisted in Table 1 were modified to enable them to discern betweenfour types of radio sources (i.e. the number of output classes werechanged to four). Example images of the four classes that we con-sider in this paper are depicted in Figure 2. Effective padding is defined similarly.MNRAS , 1–20 (2020)
Burger Becker
Table 1.
List of architectures and their keys for all figures. Architecturesmarked † have been modified from their original form.Architecture Name Key StudyAlexNet ALN Krizhevsky et al. (2012)ATLAS X-ID † ATL Alger et al. (2018)ConvNet4 CN4 Lukic et al. (2019b)ConvNet8 CN8 Lukic et al. (2019b)FIRST Classifier 1stC Alhassan et al. (2018)FR-Deep FR-D Tang et al. (2019)Hosenie H Hosenie (2018)MCRGNet † MCRG Ma et al. (2019)Radio Galaxy Zoo RGZ Lukic et al. (2018)SimpleNet CNs Lukic et al. (2019b)Toothless † TLS Aniyan & Thorat (2017)CLARAN † (VGG16D) VGG Wu et al. (2019)ConvXpress CXP Novel AlexNet, ConvNet4, ConvNet8, FIRST Classifier, FR-Deep,Hosenie, Radio Galaxy Zoo and SimpleNet were not modifiedin any further way.
The following architectures were further modified: • ATLAS X-ID: ATLAS X-ID was not designed explicitly forradio galaxy classification, but rather for finding host galaxies forradio sources by cross identification. The CNN described in thepaper had an additional input vector of 10 features from the candi-date host in the SWIRE survey, which has not been included in themodified version. • MCRGNet: has been adapted from the neural network de-scribed by Ma et al. (2019). Initially this network was pretrained asthe encoder level of a Convolutional Auto-Encoder using an unla-belled sample and then fine-tuned on a labelled sample. Several ofthese CNNs would be combined to form a dichotomous tree classi-fier, each classifying a subset of the classes. Due to computationalconstraints, the pretraining step has been left out and only a singleinstance of this architecture is used. • Toothless: originally implemented as a fusion classifier con-sisting of 3 binary classifiers, classifying either FRI/FRII,FRI/Bentand FRII/Bent respectively. If two classifiers would predict a sourceas the same class with a 60% probability, the classification wouldbe accepted, if both predicted with less than the 60% confidencethreshold, a ‘?’ would be appended to the classification. Addition-ally, should none of the classifiers give the same class output, thesource is labelled as "Strange". To reduce computational require-ments, only a single classifier instance is considered. • CLARAN: CLARAN takes as input a radio source and a cor-responding infrared image, after which it outputs a bounding boxshowing the location and size of the detected radio source. Thesource morphology is given in the format 𝑖𝐶 _ 𝑗 𝑃 , where 𝑖 is the num-ber of components and 𝑗 is the number of flux-density peaks. A cor-responding probability of the morphology is also output. CLARANuses VGG16D (Simonyan & Zisserman 2015) as a classificationlayer that is fed into a region of interest classifier. While the entirearchitecture was not suitable for this study, the VGG16D classifierlayer was appropriate to include. Note that the VGG16D architec-ture we include in this study, in contrast with CLARAN, can only assign one label to each image it receives and would, therefore, notfare well if the images it receives contain multiple source. At this point in time we should take a moment to consider the po-tential impact that the modifications we discuss in the beginningof Section 3 and those in Section 3.2.2 will have on the recogni-tion performance of the architectures presented in the studies fromTable 1. But first, it should be duly noted that the proposed mod-ifications are a necessity as these modifications make it possibleto perform a meaningful comparison of these architectures. Threemajor modifications were discussed in the beginning of Section 3and in Section 3.2.2:
Output Classes
The number of the output classes and in somecases even the output-labels of the output classes were altered(Toothless as an example of the former, CLARAN as an exampleof the latter). This alteration, however, is standard practise withinthe field of Deep Learning. Take AlexNet as an example it wasoriginally designed for the ImageNet Challenge, but it is nowadaysused to solve many other types of image recognition problems (i.e.the number of classes and the output-lables it can produce differfrom its original use case). Generally speaking, if a CNN architec-ture is identified that can discern between 𝑁 different classes, thenits recognition performance will normally not deteriorate signifi-cantly if the number of classes that one considers is either reducedor increased by one (given that it is properly re-trained). More-over, neither would considering 𝑁 completely different labels havea significant impact on its performance. There are of course excep-tions to this, if the nature of the problem is changed completelyor the inherent separability of the dataset changes significantly thisgeneralization might not necessarily remain true. Architecture Instances
Only single architecture instances wereconsidered (as an example only a single architecture instance ofToothless was used). Multiple instances of any architecture canbe incorporated into a more complex classifier (like a fusion classi-fier). This will certainly improve the recognition performance of aparticular architecture. However, knowing how a single instance ofthe architecture performs enables us to identify which architectureswill ultimately perform better if they are chosen to create a morecomplex classifier.
Data
The same dataset (no peripheral data was included in our ex-periments so that a fair comparison between the architectures in Ta-ble 1 could be made even though additional data would have resultedin improved performance of certain architectures) was used to eval-uate each architecture. As mentioned in Section 3.2.2 MCRGNetand ATLAS X-ID is particularly affected by this.
The architecture of ConvXpress is based on the architecture ofConvNet8 and VGG16D. ConvXpress is deeper than ConvNet8(11 vs 8 convolutional layers) and uses the convolutional stack struc-ture introduced by VGG16D. Each stack is comprised of 3 convolu-tional layers (except for the last stack, which is only two) and a maxpooling layer. This was developed to match or enlarge the recep-tive field size (see Section 2.2.7) of AlexNet’s convolutional layerswithout having to use AlexNet’s large kernel size. The effectivereceptive field of ConvXpress’s first convolutional stack is 11 × MNRAS , 1–20 (2020)
NN Architecture Comparison for Radio Galaxy Classification Table 2.
ConvXpress Architecture LayoutLayer Depth Kernel Size Stride Length ActivationConv2D 32 3 2 ReLUConv2D 32 3 1 ReLUConv2D 32 3 1 ReLUMaxPooling2D 2 1DropoutConv2D 64 3 2 ReLUConv2D 64 3 1 ReLUConv2D 64 3 1 ReLUMaxPooling2D 2 1DropoutConv2D 128 3 1 ReLUConv2D 128 3 1 ReLUConv2D 128 3 1 ReLUMaxPooling2D 2 1DropoutConv2D 256 3 1 ReLUConv2D 256 3 1 ReLUMaxPooling2D 2 1DropoutFlattenDense 500 LinearDropoutDense 4 Softmax compared to AlexNet’s single activation, making the model morediscriminative (Simonyan & Zisserman 2015).In addition to this, the number of parameters required arereduced by stacking: a layer with an 11 ×
11 kernel with 𝐶 inputchannels require 11 𝐶 = 𝐶 parameters, while 3 layers witha 3 × 𝐶 input channels require only 3 ( 𝐶 ) = 𝐶 parameters.ConvXpress has a non-standard stride length, similar toMCRGNet, Toothless, AlexNet and Radio Galaxy Zoo. Inparticular, it makes use of a stride length of 2 in the first convolu-tional layer of the first and second convolutional stacks. The Dense(or fully-connected) layers are the same as ConvNet8’s, using alinear activation in the second last layer with an L2 kernel regular-izer. ConvXpress also contains five dropout layers. During eachtraining step a dropout layer randomly turns some of the neuronsin the layer that comes before it off (i.e. it blocks their output frompropagating to the next layer). The probability that a specific neuronis turned off is known as the dropout rate 𝑝 . This is similar to cre-ating and training many small networks within the larger network(Srivastava et al. 2014). The value for 𝑝 for all the dropout layers inConvXpress is 0.25. The only exception is the last dropout layer.For the last layer the value of 𝑝 is equal to 0.5.The architecture of ConvXpress is presented in Table 2. Some architectures were excluded from the study for either beingoriginally designed to perform a task other than classification orin order to restrict the scope of the study to conventional CNNarchitectures. The following architectures were excluded: • Convosource: Designed for source-finding, to extract the pix-els that belong to an astronomical source from an image with back-ground noise (Lukic et al. 2019a). • COSMODEEP: Designed to perform a combination of sourcefinding and classification by breaking up larger images into smaller tiles that are then individually classified as either containing nosignal or containing a radio source (Gheller et al. 2018). • DEEPSource: Aimed at source-finding in low signal-to-noiseratio cases (Sadr et al. 2019). • Capsule Networks: This study limits the focus of comparisonto conventional CNN architectures described in the literature. Lu-kic et al. (2019a) compared Capsule Network performance withconventional CNN architectures.
AlexNet and
ToothlessJust as AlexNet has been a seminal work in image classificationfor CNNs, so too has Toothless (Aniyan & Thorat 2017) madeits mark on the classification of radio galaxies for being the firstCNN developed for this very purpose. It has thus been referencedin almost all of the subsequent works listed in Table 1. Toothlessis based on AlexNet’s architecture and does not differ much otherthan the type of padding used, with the original AlexNet designusing valid padding (no padding is applied around the edges of theinput of a layer) rather than same padding (zero padding around theedges of the input of a layer to ensure there is no size reduction otherthan that caused by stride length). This small difference does havea slight effect on performance, since the size of the input is beingreduced steadily with valid padding, less computational resourcesare required for AlexNet than for Toothless. The type of paddingat different layers is something that should be carefully consideredwhen designing an architecture, since this might shrink input toofast, throwing away useful information. Same or zero padding willwork better for a wider range of input resolutions.
Ma et al. (2019) used two datasets in their study: • the Labelled Radio Ralaxy (LRG) dataset: a curated datasetthat contain well attested sources from the CONFIG (Gendre &Wall 2008; Gendre et al. 2010), GROUPS (Proctor 2011), FRICAT(Capetti et al. 2017a), FRIICAT (Capetti et al. 2017b), FR0CAT(Baldi et al. 2018) and Cheung (2007) catalogues (see Table 3). Thedataset contains 1442 sources and consists of 6 classes (Compact,FRI, FRII, Bent, X-shaped and Ringlike) . • the Unlabelled Radio Galaxy (URG) dataset: a dataset con-taining 14245 AGNs from the Best-Heckman sample (Best & Heck-man 2012). This dataset was manually labelled by Ma et al. (2019).We use slightly modified versions of these two datasets in ourstudy. We will refer to the modified version of the LRG dataset asthe Modified Labelled Radio Ralaxy (MLRG) dataset throughoutthe rest of the paper (it contains 1,328 sources). Similarly we willrefer to the modified URG dataset as the Modified Unlabelled RadioGalaxy (MURG) dataset (it contains 14,093 sources). We made thefollowing modifications. We removed all X-shaped and Ringlikesources from both the LRG and the URG. We also removed error-prone images from the URG dataset (i.e. images consisting only of NaN values) . Moreover, all FR0 sources were added to the Compactclass. The total number of sources reported in Ma et al. (2019) and the totalnumber of sources in the catalog available on their Github repo differ slightly. Three Compact, two FRI and two FRII sources were also removed.MNRAS , 1–20 (2020)
Burger Becker
Table 3.
Class breakdown per catalogue for the LRG datasetCompact FR0 FRI FRII Bent X RingCoNFIG 270 8 14 350 9 0 0FR0CAT 0 104 0 0 0 0 0FRICAT 1 19 173 0 5 0 0FRIICAT 0 0 0 80 8 3 0Proctor (2011) 0 1 0 0 284 0 32Cheung (2007) 0 2 0 0 0 79 0Total 271 134 187 430 306 82 32
Table 4.
The first 5 rows of the MLRG sample, the full table is available onthe projects Github Repository.Source Right Ascension Declination ClassificationName (degrees) (degrees)J000330.73+002756.1 0.05854 0.46558 Bent-tailedJ001247.57+004715.8 0.21321 0.78772 FRIIJ002107.62-005531.4 0.35212 -0.92539 FRIIJ002900.98-011341.7 0.48361 -1.22825 CompactJ003930.52-103218.6 0.65848 -10.5385 FRI
Table 5.
The first 5 rows of the MURG sample, the full table is available onthe projects Github Repository.Source Right Ascension Declination ClassificationName (degrees) (degrees)J000001.57-092940.3 0.00044 -9.49453 CompactJ000025.55-095752.8 0.00710 -9.96467 FRIJ000027.89-010235.4 0.00775 -1.04317 CompactJ000049.32-005042.9 0.01370 -0.84525 FRIJ000052.92+003044.6 0.01470 0.51239 FRII
Table 6.
Comparison of the MLRG and MURG datasets we use in this study.Class MLRG MURGCompact 405 6093FRI 187 5039FRII 430 2072Bent 306 889Total 1328 14 093
The final class breakdown of both datasets are presented inTable 6.The final catalogues we used are partially presented in Ta-bles 4 and 5. The rest of these catalogues are available on ourGithub repository. A script that can download the sources from thecatalogues is also provided on our Github repository. It downloadsFIRST cutouts (Becker et al. 1995) in FITS format (300 by 300pixels) via the Skyview tool (McGlynn et al. 1998).
This section describes the experimental setup we used. The imageprepossessing procedure we adopted is described in Section 5.1. Thehardware used as well as other important overarching experimental information is presented in Section 5.2. The two main experimentsconducted are described in Section 5.3 and Section 5.4. We end thissection by describing the ensemble classifiers that we constructed.
The preprocessing steps used are: the images were first normalizedand then thresholding was applied. Allowing some noise in thetraining data improves learning for very deep networks (Neelakantanet al. 2015). The thresholding method used assigns a zero value to allpixels with a value below the threshold of three standard deviationsabove the mean pixel value of the specific image. Otherwise, thepixel value is kept the same.
All training was performed on a Nvidia Tesla V100 32GB. Ourarchitectures were constructed using the Deep Learning frame-work: Keras (Chollet et al. 2015). To ensure replicable resultswe provided a random seed to all non-deterministic processes.In addition to this step, Tensorflow requires you to set the
TF_CUDNN_DETERMINISTIC environment variable to ‘1’ or ‘true’(alternatively, depending on the version of Tensorflow being used,either the Nvidia Tensorflow-Determinism patch can be applied orthe
TF_DETERMINISTIC_OPS environment variable must be set ).A customized version of the Keras Data Generator class was usedto load images during training, validation and testing.Each architecture was trained for 16 epochs with a learning ratedependant on the number of parameters in the architecture. Adamwas used as the optimizer with a callback function that reduces thelearning rate once a loss plateau is reached. This callback functionreduces the learning rate by half after 3 epochs in which the valida-tion loss has not decreased by the set threshold: 0.001 (i.e. this is thelearning rate scheduler we used). At the end of each epoch, anothercallback function assesses the validation loss. If the current modelhas a lower validation loss than the previous lowest validation loss,this model is saved and the previous one is discarded. This modelis used to represent the architecture in the tests of both experimentsbelow. This process is necessary to prevent overfitting by storingthe models that generalize well on the validation data.Furthermore, the experiments below were repeated three times(different seed values for the random processes and the subset se-lection were assigned during each of the three runs). Using differ-ent seed values results in a different weight initialization for theCNNs and a different subset selection for the training, validationand testing sets during each run. This is done to get a more accu-rate representation of the architecture’s performance for the chosenhyperparameters and to assess the validity of the results. In this experiment we emulate the type of training most of the ar-chitectures in Table 1 employed in their respective studies: training,validation and testing on a small curated dataset. More specifically:the architectures are trained on a subset of the MLRG dataset andthen tested on a mutually exclusive subset of the MLRG dataset.The architectures are then re-tested on the full MURG dataset. Thetraining, validation and test set breakdown used for this experiment https://pypi.org/project/tensorflow-determinism/ MNRAS , 1–20 (2020)
NN Architecture Comparison for Radio Galaxy Classification h m s s s s s D e c A h m s m s s s s D e c B h m s s s m s s s D e c C h m s s s s m s D e c D Figure 2.
Examples of the different radio morphologies that have been classified in this study. Examples include an FRI (A), an FRII (B), a compact radiosource (C) and a bent-tailed radio galaxy (D). All examples are shown before the preprocessing step. are summarized in Table 7. We elaborate further in this regard inthe sections that follow.
The training and validation sets are augmented by rotating eachsource at 15 degree intervals, leading to 24 rotated samples ofeach image. The total number of augmented samples are given inparentheses in Table 7. This is done to increase the number ofsamples for validation and training as well as addressing rotationalinvariance (discussed in section 2.1.7).Each source is rotated after preprocessing and then saved as anew FITS image with the rotation factor added to the original filename.
The training and validation data are selected from the MLRG dataset(the samples in these datasets depend of the chosen seed value and assuch differ for each experimental run). Training is performed on 80unique sources per class (1920 after augmentation) and validationon 60 unique sources per class (1920 after augmentation). Usingthe aforementioned number of sources for training and validationallows for testing on roughly 25% of the smallest class (FRI) inthe MLRG dataset. The split ratio’s are in line with both a standardtraining/validation/test split commonly used in practice and withthe splits used in most of the studies of the architectures presentedTable 1.
MNRAS , 1–20 (2020) Burger Becker
We first test the resulting models on the test split of the MLRGdataset. The models are then tested on the full MURG dataset. Bothsets are given in Table 7.
In this section we describe an experiment that is designed to be lesssusceptible to overfitting. While the MLRG sample provides excel-lent examples of each class, the application of stringent selectioncriteria results in the loss of samples that could make a model morerobust. Similar to allowing some noise to remain after preprocessingan image to facilitate better training, allowing samples in the train-ing set that are less-than-perfect examples provides a more nuancedunderstanding of the class distinction and can lead to more robustclassification systems. This problem might be more adequately ad-dressed by taking into account the annotator’s confidence in theirclassification, as a separate input in the dense layer for example,however we leave this for future exploration.While the aforementioned issue is worth noting, the small sizeof the MLRG sample is the most serious concern when it comesto potentially overfitting a model. Large training sets are requiredto train any type of Deep Neural Network. The MURG sampleprovides a dataset that is large enough to be used to train a DeepLearning model. Although the training sample used for the MURGrandom split experiment is relatively small compared to what otherdeep learning studies have used, there is a 312.5% increase in thesize of the training set used for this experiment when its size iscompared with the size of the training set used during the Overfitexperiment. The results of the MURG random split experimentprovides a more realistic reflection of the architectures in Table 1’sexpected performance when deployed in practice.More specifically: for this experiment architectures are trainedon a subset of the MURG sample and tested on a test split of theMURG sample. The training, validation and test set breakdown usedfor this experiment are summarized in Table 8. We elaborate furtherin this regard in the sections that follow.
The training set is augmented by rotating each source at 15 degreeintervals, leading to 24 rotated samples of each image. This resultsin 24,000 and 9,600 sources for training and validation respectivelyafter augmentation, as shown in parentheses in Table 8.
The training and validation data are sampled from the MURGdataset. Training is performed on a random selection of 250 sourcesper class (6000 after augmentation) and validation on 100 sourcesper class (2400 after augmentation). Again, exactly which sourcesare selected is determined by the random seed that was used duringeach experimental run. While this is a much larger training andvalidation split than what is used for the Overfit experiment, it issignificantly smaller than what is normally used when training aCNN, the total training set makes up only 7.87% of the total datasetcompared to normally selecting closer to 50% of the set or more fortraining. This smaller selection was chosen to assess the efficacy ofmodel generalization when training on a relatively small subset ofthe data.
Table 7.
Overfit Experiment: Training, validation and test set break downper class. Note the two different sets which are used for testing which relateto the results in Figure 3. The values in parentheses are the number ofaugmented samples.Class Training Set Validation Set MLRG MURG(Augmented) (Augmented) Test Set Test SetCompact 80 (1920) 60 (1440) 265 6093FRI 80 (1920) 60 (1440) 47 5039FRII 80 (1920) 60 (1440) 290 2072BENT 80 (1920) 60 (1440) 166 889Total 320 (7680) 240 (5760) 768 14 093
Table 8.
MURG Random Split Experiment: Training, validation and test setbreak down per class. The test set is a subset of the MURG samples. Thevalues in parentheses are the number of augmented samples.Class Training Set Validation Set Test Set(Augmented) (Augmented)Compact 250 (6000) 100 (2400) 5743FRI 250 (6000) 100 (2400) 4689FRII 250 (6000) 100 (2400) 1722BENT 250 (6000) 100 (2400) 539Total 1000 (24 000) 400 (9600) 12 693
Testing was performed on a MURG test split. The total number ofsamples in this test set is given in Table 8.
An ensemble classifier is a combination of several classifiers trainedto perform the same purpose (such as classification). These ensem-bles often generalize better than any of their constituents and areless likely to have the same pitfalls as their constituents (overfittingto a specific class for example). Two ensemble classifiers have beencreated from the MURG Random Split Experiment:(i) Ensemble of all classifiers (ENA)(ii) Top 4 Classifier Ensemble, selected based on their MPCA.From here on referred to as the SKA Artificial Intelligence Network(SKAAI Net in Figure 4 or as SKN in Figure 6).Both ensembles sum the output probabilities of their con-stituent classifiers and take the highest probability as the outputclass.
Section 6.1 reports only on the results form the Overfit Experiment(see Section 5.3). In Section 6.2, we compare the results obtainedfrom the Overfit experiment with that of the MURG Random SplitExperiment (see Section 5.4).
Figure 3 shows the averaged MPCA over three runs of the Over-fit Experiment for each architecture. The data-points depicted bythe cross markers represent the MPCA results associated with the
MNRAS , 1–20 (2020)
NN Architecture Comparison for Radio Galaxy Classification V G G C X P C N R G Z T L S M C R G C N S A T L s t C A L N C N F R - D H Classifier Key4050607080 M e a n p e r C l a ss A cc u r a c y [ % ] MLRG Test SetMURG Test Set
Figure 3.
Models trained on the MLRG sample in the “Overfit” experiment,reporting MPCA averaged over three iterations with different random seedvalues. The crosses represent each architectures’s averaged MPCA on theMLRG sample’s test split, while the diamonds show averaged MPCA onthe full MURG dataset. The results show that training and testing architec-tures on a small sample, such as the MLRG sample, can give misleadingexpectations for performance on a larger dataset that has not been curatedas thoroughly, such as the MURG dataset. The standard deviation for eacharchitecture is also shown.
MLRG test set (described in Table 7), while the data-points depictedby the diamond markers represent the MPCA results associated withtesting on the full MURG dataset. All models experience a more than20% decrease in performance when switching from the MLRG testset to the MURG dataset, clear evidence of the models overfittingto the MLRG training set and the results being unreliable to assessperformance on a larger dataset. Given that the training sample sizeis only 2.2% that of the test sample size, model performance onthe MURG data set is not as bad as one would expect. The results,however, indicate that the models trained on only the MLRG setshould not be used for autonomous classification in practice, sincefor the MURG dataset most of the models have a sub 50% accuracyin at least one class.Training and testing models on small datasets give a skewedperception of architecture performance. Models need to be trainedand assessed on samples that are representative enough of the un-derlying data distribution that underpins the classification problemat hand (a too small training dataset prevents this).
In Figure 4, we compare the MPCA averaged during three runs ofthe Overfit experiment (diamond markers) with the MPCA resultsobtained from the models trained during the MURG Random Splitexperiment (circle markers). It is important to note that the modelsassociated with the two experiments are not tested on the same data(although the intersection between the two datasets is large): the V G G C X P C N R G Z T L S M C R G C N S A T L s t C A L N C N F R - D H Classifier Key4045505560657075 M e a n p e r C l a ss A cc u r a c y [ % ] SKAAI Net ENAMLRG TrainedMLRG Trained EnsembleMURG TrainedMURG Trained Ensemble
Figure 4.
Models trained on a selection from the MURG sample (circles)compared with those trained on the MLRG sample (diamonds) in the “Over-fit” experiment, giving MPCA averaged over three iterations with differentrandom seed values. The ensemble classifier’s averaged MPCA is also given,with the best performing MURG trained ensemble given as SKAAI Net andthe ensemble of all classifiers given as ENA. The standard deviation for eacharchitecture and the ensembles are also shown.
Overfit experiment is tested on the full MURG dataset while theMURG Random Split experiment is tested on a large subset of theMURG dataset (and differs for each run depending on the randomseed that was chosen). All the models associated with the MURGexperiment show an increase in recognition performance when theyare compared to the models associated with the Overfit experiment(ranging from a 11.7% to a 24.03% increase in performance, with anaverage increase in performance of 18.5%). The average increase inrecognition performance is 3.26 times that of the training data sizeincrease (which increased from 2.2% to 7.87% of the total dataset).The result of the top 4 ensemble classifier (SKAAI Net) and theensemble of all the classifiers (ENA) are also given, the top dashedline represents the results associated with the MURG Random Splitensemble classifier and the bottom dashed line the results associatedwith Overfit ensemble classifier. Note that the order in this figure isbased on the MURG Random Split performance.The MURG Random Split result is a better indication of ar-chitecture performance than the results obtained from the Overfitexperiment. The reason being, the architectures are exposed to alarger dataset during training and are, therefore, less prone to over-fit. The next section only deals with results obtained from the MURGRandom Split experiment.
All the results presented in the subsequent subsections were ob-tained from the MURG Random Split Experiment (Section 5.4).The results of this experiment is summarized in Table A1 and itsfollow on table. In Section 7.1, we report on the MPCA versus the
MNRAS , 1–20 (2020) Burger Becker computational complexity of the architectures in Table 1. Note that,for the sake of brevity we sometimes only use the shortened phrase“architectures” instead of “architectures in Table 1” when referringto the architectures that we considered in this paper. Section 7.2,looks at the per class F1-score performance of the architectures(which serves to showcase the trade-off in class performance foreach classifier and highlights the shortfalls of a metric such asMPCA). The memory requirements and the classification speeds ofthe architectures are discussed in Section 7.3 and Section 7.4. Thereceptive field and the effective stride length of the architectures arereported in Section 7.5. The overall ranking of the architectures ispresented in Section 7.6. Section 7.7, reports on the performanceresults of the ENA and SKAAI Net (the two ensemble classifiersdescribed in Section 5.5). The data coverage versus the confidencethreshold graphs associated with SKAAI Net is presented in Sec-tion 7.8.
Figure 5 reports the MPCA versus the computational complexityof the architectures in Table 1; for a single forward pass (measuredin floating point operations or FLOPs). The size of the markersin Figure 5 represent the model complexity of the architectures(measured in the number of trainable parameters).The model with the highest MPCA (72.98%) is CLARAN(i.e. VGG16) (Wu et al. 2019; Simonyan & Zisserman 2015). Thebest performing models from the existing literature are ConvNet8(Lukic et al. 2019a) (71.7%), Radio Galaxy Zoo (Lukic et al.2018) (69.87%) and Toothless (Aniyan & Thorat 2017) (68.92%).The novel classifier produced for this paper, ConvXpress, has thesecond highest MPCA (71.74%).Using classifiers from the general computer vision literature,that perform well on other datasets, as a springboard for architecturedevelopment in radio astronomy could potentially save significantcomputational time. This is evident looking at Toothless that wasderived from AlexNet and and is 5 th best classifier in this study,even though this was the first CNN implemented specifically forradio astronomy.A very weak correlation between the logarithm of the FLOPcount and MPCA is present with a Pearson correlation coefficientof 0.39 (Freedman et al. 2007).An increase in computational complexity will not necessarilytranslate into a proportional increase in recognition performance,evidenced by the weak correlation of these two quantities. Thisis corroborated by the following examples: ATLAS and RadioGalaxy Zoo require fewer FLOPS than Lukic et al. (2019a)’sSimpleNet and ConvNet4, whilst obtaining a better MPCA thanthe latter two architectures.The logarithm of the number of parameters and MPCA areeven more weakly correlated than the logarithm of the FLOP countand MPCA, with a Pearson correlation coefficient of 0.35. Largemodels (i.e. higher trainable parameter count) often outperformsmaller models in terms of recognition performance, but notableexceptions exist. ConvXpress and Radio Galaxy Zoo have lowparameter counts but have high MPCA scores. Figure 6, reports the F1-score of each class sorted by architec-ture performance. F1-score encapsulates recall and precision into a single metric. More general metrics, such as MPCA and overall ac-curacy, can be misleading metrics, since a model might score highin either of these metrics by doing exceptional well in one class,whilst underperforming in another class.Classifiers should in general not be evaluated using a singlemetric, however, a single metric is sometimes necessary as it canconvey information in a concise and succinct manner.
Figure 7, reports inference time versus theoretical GPU memoryusage (at a batch size of 32). As memory usage increases, inferencetime dramatically increases. The standard deviation associated withthe inference time is also depicted in Figure 7.
Classification speed is the number of images an architecture canclassify per second at a certain batch size. A batch size of 32 hasbeen used for this experiment. The classification speed of the differ-ent architectures are obtained by taking the inverse of the inferencetimes reported in Figure 7. Figure 8 reports the MPCA versus clas-sification speed of the different architectures. It is evident fromFigure 8 that computationally efficient models generally have fasterclassification speeds. A trade-off, therefore, exists between fasterclassification and higher recognition performance, at least for stan-dard CNN architectures. Moreover, MCRGNet has the fastest clas-sification speed at 1270 images per second with Radio GalaxyZoo close by at 1246 images per second. VGG16 has the slowestclassification speed at 291 images per second.In comparison, the classification speed of an average person is“about 250 images in 5 minutes” or roughly 0.833 images per second(Markoff 2012). The classification task from which this result wasobtained is complex, a human classifier had to choose a label from alarge number of possibilities, and as such the aforementioned resultshould be regarded as a lower bound estimate of how fast an averageperson would be able to classify 250 images.
Figure 9 shows how receptive field, effective stride length andMPCA relate to one another. The correlation between receptivefield and MPCA is very weak (with a Pearson correlation coeffi-cient of 0.431). Effective stride length and MPCA are slightly bettercorrelated (with a Pearson correlation coefficient of 0.436). This iswell corroborated by Figure 9. As an example: the ConvNet8 hasa small receptive field and effective stride, but performs compara-tively well against architectures that have larger receptive fields andstrides than it does. In summary, a larger receptive field and effectivestride alone is no guarantee of better classifier performance. Usinglarger strides reduces the number of convolutions applied whichresults in faster classification speeds. Applying larger strides in thefirst layers of the architecture reduces the layer’s output size whichultimately decreases the number of FLOPs used by the architectureas it reduces the input sizes of subsequent layers.
All the architectures listed in Table 1 have been ranked in Table 9according to their recognition performance (given as classifier rank-ing) and their computational performance (given as computational
MNRAS , 1–20 (2020)
NN Architecture Comparison for Radio Galaxy Classification Floating Point Operations586062646668707274 M e a n p e r C l a ss A cc u r a c y [ % ] VGGCXP CN8RGZ TLSMCRG CNSATL 1stCALNCN4FR-DH 250K1M10M100MParameters
Figure 5.
Average accuracy vs Computational Complexity vs Model Complexity: The different CNN architectures are compared based on recognitionperformance (MPCA on the 𝑦 -axis), computational complexity (FLOPs on the 𝑥 -axis) and model complexity (the number of trainable parameters, given as thecircle size). A weak correlation is present between the logarithm of the computational complexity and recognition performance (Pearson correlation coefficientof 0.39). ranking). An overall rank is calculated based on the sum of thesetwo rankings. Please note that this ranking is a relative rankingand that it is limited to the architectures within this study (and thedatasets used) and as such cannot be seen as an absolute reflectionof architecture standing.The aforementioned rankings are calculated using a round-robin “tournament” in which each architecture is compared to everyother architecture (excluding itself) in several different categories. Ifan architecture achieves a higher or lower score (which depends on the metric of the category under consideration) than a “competing”architecture does in a specific category then the former architec-ture’s ranking is incremented by 𝑗 , while the latter architecture’sranking is decremented by 𝑘 . As alluded to before, this compari-son is repeated for every category and every architecture-pair. Ahigher category score is better in the case of recognition perfor-mance metric categories, while a lower category score is better inthe case of computational requirement metric categories. To estab-lish a classifier ranking, the MPCA and the per class F1-score of the MNRAS , 1–20 (2020) Burger Becker
SKN ENA CN8 VGG CXP RGZ CNS MCRG ATL H TLS FR-D CN4 ALN 1stC0.20.40.60.8 C o m p a c t CN8 SKN ENA CXP VGG 1stC MCRG TLS RGZ CN4 ATL CNS H FR-D ALN0.20.40.60.8 F R I SKN ENA VGG CN8 CXP RGZ TLS MCRG 1stC CN4 ALN CNS ATL FR-D H0.20.40.60.8 F R II CN8 SKN ENA CXP VGG 1stC MCRG RGZ TLS CNS ATL ALN H CN4 FR-D0.20.40.60.8 B e n t Figure 6.
F1-score per class for all architectures. ENA represents an ensemble classifier of all architectures while SKAAI Net (SKN) is an ensemble of the top4 classifiers. Memory Usage (MB)246810 I n f e r e n c e T i m e ( S e c o n d s ) A L N A T L C N C N s t C F R - D H M C R G R G Z C N S T L S V GG C X P Figure 7.
Inference time versus GPU memory usage. different architectures are compared with one another (i.e. a total of5 categories are considered). For the classifier ranking, 𝑘 = 𝑗 =
400 600 800 1000 1200Images per Second5860626466687072 M e a n p e r C l a ss A cc u r a c y VGG CXPCN8 RGZTLS MCRGCNS ATL1stC ALNCN4 FR-DH
Figure 8.
MPCA versus Images per Second: As classification speed in-creases, recognition performance decreases. (i.e. a total of 3 categories are considered). For the computationalranking, we also decided upon using 𝑘 = 𝑗 = MNRAS , 1–20 (2020)
NN Architecture Comparison for Radio Galaxy Classification M e a n p e r C l a ss A cc u r a c y [ % ] VGG CXPCN8RGZ TLSMCRGCNSATL1stC ALNCN4 FR-DH 4163264EffectiveStride Length
Figure 9.
MPCA versus Receptive field size versus effective stride length.In general, as the receptive field and effective stride length increases so doesMPCA, care should be taken when designing architectures based only onreceptive field size, since a larger receptive field or effective stride lengthdoes not always translate into a higher accuracy. Receptive field and effectivestride length could, however, serve as a useful metric to better understandthe performance of a particular CNN.
Table 9.
Classifier Rankings: the proposed ranking system is based uponrecognition performance (classification ranking) and computational per-formance (computational ranking). ConvXpress, MCRGNet and RadioGalaxy Zoo rank in the top three, each showing a different balance betweencomputational performance and recognition performance. The rankings arecalculated using a round-robin “tournament” in which each architectureis compared to every other classifier (excluding itself) in each category.MPCA and F1-score for each class are used for the classification rankingwhile memory usage, FLOP count and inference time is used for the com-putational ranking.Key Classification Computational
Overall
Ranking Ranking
Rank
CXP 46 6 MCRG 14 36 RGZ 20 30 CN8 54 -30 VGG 50 -36 TLS 6 -14 -8 ATL -18 8 -10
CNs -10 -10 -20
H -42 20 -22 -22
ALN -38 12 -26
FR-D -48 10 -38
CN4 -30 -14 -44
The recognition performance of SKAAI Net and ENA is given inFigures 4 and 6 (see Section 5.5). Comparing either ensemble’s per-formance with any individual classifier’s performance, the ensemblemethods outperform individual classifiers in terms of accuracy andoutperform most in F1-score. The top 4 classifier (SKAAI Net) per-forms the best when we consider the MPCA metric (73.85%) andscores the highest in F1-score in 2 classes, being second in the FRIand Bent classes (indicating that even though MPCA has its short-comings it remains a helpful keystone metric to use for evaluatingarchitecture performance). The ensemble of all the models (ENA),on the other hand, has an MPCA of 72.08%, just slightly below thehighest MPCA of the single classifiers (ConvXpress, 72.98%) andperforms well in F1-score.SKAAI Net’s confusion matrix is depicted in Figure 10. Weonly provide SKAAI Net’s confusion matrix here as it outperformsENA. The main diagonal of the matrix shows the number of cor-rectly classified images, with the columns indicating the classifier’sprediction and the rows the actual label of each image. The per-centages are the normalized values for each class. Percentage wise,the most misclassifications are FRII that are being labelled as bent-tails (17.85%), but in absolute terms more FRIs are misclassified asbent-tails on average (576).Overall, SKAAI Net provides an ensemble classifier that re-duces classifier specific shortcomings in regards to recognition per-formance. It should be noted that these ensemble methods willrequire a significant amount of computational resources in order torun (as they consist of more than one model), resulting in a muchslower classification speed.ENA in particular has an exceptionally large computationalfootprint, as it is made up of all the classifiers in this study,which makes it infeasible for deployment in production (it requires~11.8GB of GPU memory). Whether the marginal gains SKAAINet and ENA make in MPCA and F1-score is worth the significantincrease in computational requirements that are needed to achievethose gains is, therefore, highly debatable.
Figure 11, reports on the percentage dataset coverage per class andF1-score per class of SKAAI Net at different confidence thresholds.It also reports the percentage coverage for the entire dataset andMPCA.Compact sources have the highest F1-score overall and thesmallest decrease in coverage. Bent tails sees the largest increase inF1-score. FRII coverage drops at a faster rate than the other classes.Knowing the data coverage behaviour of a classifier is impor-tant. It allows an estimate of the resources that would be requiredif this were to be incorporated with subject-matter experts into theclassification pipeline, i.e. on average exactly how many imageswould be thrown out by the classification system at a specific cer-tainty threshold, which in turn would help us estimate the number ofman hours that would be required to manually classify the sourcesthat were thrown out. On the other hand, if the human resourceavailability is known, the certainty threshold can be adapted.
Two experiments were performed in this study. The first experimentassessed overfitting on the MLRG dataset, that has large intersec-
MNRAS , 1–20 (2020) Burger Becker
Compact FRI FRII BentPredicted labelCompactFRIFRIIBent T r u e l a b e l (87.48±1.01)%4979±57 (8.31±1.46)%473±82 (2.98±1.43)%169±81 (1.23±0.29)%70±16(10.7±1.14)%501±53 (69.55±4.23)%3261±198 (7.46±2.18)%350±102 (12.29±2.22)%576±104(3.64±1.0)%62±17 (10.53±3.44)%181±59 (67.98±5.92)%1170±101 (17.85±2.36)%307±40(1.67±0.69)%9±3 (11.19±3.51)%60±18 (17.13±4.38)%92±23 (70.01±2.05)%377±11 Figure 10.
Confusion Matrix of the SKAAI Net ensemble, averaged overthree runs. The normalized confusion matrix is given in percentage in eachsquare, above the number of sources classified tions with the training sets used in most studies in Table 1 (seeSection 5 and Section 6). The second experiment analysed the com-putational cost of existing CNN architectures used for radio galaxymorphological classification (see Section 7). The results from theseexperiment suggests that when evaluating an architecture’s perfor-mance careful attention should be paid to the size of the trainingset being used, otherwise the results obtained could potentially notbe a true reflection of architecture performance (see Figure 3 andFigure 4). Furthermore, there exists a trade-off between recognitionperformance and computational cost. These two factors should becarefully weighed up against each other when deciding on whicharchitecture to use in production. In addition to considering recog-nition performance metrics like MPCA and F1-score, one shouldalso consider computational cost metrics like memory usage, float-ing point operations used and classification speed. From all of thesemetrics a “best” architecture can be chosen for deployment, basedon the computational resources that are available.There are also a few other minor conclusions that can be drawnfrom the results obtained from the experiments we conducted in thispaper:
Architecture
A few design choices for CNN architectures canspeed up model performance while driving down resource costs. Alarger kernel’s receptive field is equivalent to several smaller recep-tive fields when these are stacked without a pooling layer in betweenwith the added bonus that more layers of non-linearity are addedthrough more ReLU activations while driving down the number ofparameters. This was originally used within the VGG architectuefamily developed by Simonyan & Zisserman (2015). Several of thearchitectures in Table 1 build on these design decisions, specificallyConvNet-8 (Lukic et al. 2019b).ConvXpress ConvXpress utilizes the aforementioned stackingstrategy. It also uses a non-standard stride length which indirectly re- duces its computational cost. This architecture performs well whencompared to the other architectures in Table 1 (see Table 9).
Parameters
Since model complexity is weakly correlated withrecognition performance, an increase in parameters is, therefore,likely to translate into an increase in recognition performance (this ishowever not guaranteed). Increasing your recognition performanceby simply utilizing more and more trainable parameters is discour-aged as it is a strategy that can lead to overfitting (it also does notscale well). Furthermore, an increase in trainable parameters willincrease computational complexity, training time and GPU mem-ory usage. Overall Deep Learning models are viewed as inefficientin exploiting their full learning power (Muhammed et al. 2017)given the large number of parameters they require relative to othermachine learning approaches.
FLOPs
Computational complexity (given as FLOPs) and recog-nition performance (approximated as MPCA) are weakly correlated(see Figure 5). Utilizing more computational resources is, therefore,likely to result in at least marginal increases in recognition perfor-mance. As we have hinted at previously, increasing your recognitionperformance by simply utilizing more and more computational re-sources is frowned upon as it is a strategy that does not scale well.
Classification Speed
Generally, models with a higher MPCA,classify slower than those with a lower MPCA (see Figure 8). Thetrade-off between classification speed and recognition performanceis evident as model MPCA decreases with an increase in imagesclassified per second.
Receptive field and Stride length
A large receptive field andstride length does not guarantee good recognition performance.These two metrics can, however, help explain the performance of aparticular architecture.
Ranking
CNNs can be ranked according to their recognition per-formance results and the computational resources that they require.The ranking we obtained doing just this is presented in Table 9. Itis, however, important to realize that this study is not exhaustiveenough to provide us with an absolute ranking of the architecturesin Table 1. A more extensive study that considers all possible com-binations of hyperparameters would be required for us to achievethe aforementioned goal. Such a study would, however, be com-putationally infeasible. This study does, however, provide us witha useful pragmatic ranking as the hyperparameters were chosen inaccordance with excepted guidelines.
Ensemble
While the ensemble methods do produce better re-sults, the significant increase in computational requirements asso-ciated with using them is not proportional to the gain in recognitionperformance that using them offers. An option left unexplored inthis study is the creation of either a tree classifier (such as was donein the original MCRGNet study (Ma et al. 2019)) or a fusion classi-fier with a voting scheme (as used by Toothless (Aniyan & Thorat2017)). Both of these approaches requires the training of severalmodels that specialize in the classification of only two classes. Amajority vote of a single class is indicative of a high confidence ofsuch a prediction, while a mixed vote indicates uncertainty (such asource would be marked for inspection by a subject-matter expert).
Coverage
Data coverage analysis results can be used to integratesubject-matter experts into a classification pipeline (or at the veryleast the ability to flag sources that the classifier is uncertain ofcan be added) (see Section 7.8). Which confidence threshold is bestsuited for this endeavour is not explored, since this will rely on theavailability of the following: subject-matter experts and computa-tional resources.For an in-depth comparison of the more recent architectures that
MNRAS , 1–20 (2020)
NN Architecture Comparison for Radio Galaxy Classification Figure 11.
SKAAI Net Confidence Threshold: Dataset Coverage vs Recognition Performance. are being used for image recognition, please refer to the work doneby Muhammed et al. (2017). Training CNNs for the purpose ofimage classification and specifically radio galaxy classification, hasbecome a relatively easy task to set up given the computationalresources available at present. But as Jitendra Malik , one of theseminal figures in computer vision, stated “There are many prob-lems in [computer] vision where getting 50 percent of the solutionyou can get in one minute, getting to 90 percent can take you a day, Arthur J. Chick Professor of Electrical Engineering and Computer Sci-ences at the University of California, Berkeley getting to 99 percent may take you five years and 99.99 percent,may not happen in your lifetime” (Fridman & Malik 2020).The lack of large sets of annotated training data remains oneof the greatest challenges in assessing and improving the generalrecognition performance of CNNs (and all image classification al-gorithms’). In addition, determining the computational resources amodel requires is important to consider when selecting an archi-tecture for deployment. The framework and experiments laid out inthis study will hopefully be able to help shape the future of imagerecognition development.
MNRAS , 1–20 (2020) Burger Becker
ACKNOWLEDGEMENTS
DATA AVAILABILITY STATEMENT
The data underlying this article will be shared on reasonable requestto the corresponding author.
REFERENCES
Alger M. J., et al., 2018, MNRAS, 478, 5547Alhassan W., Taylor A. R., Vaccari M., 2018, MNRAS, 480, 2085Aniyan A. K., Thorat K., 2017, ApJS, 230, 20Araujo A., Norris W., Sim J., 2019, Distill, 4, e21Baldi R. D., Capetti A., Massaro F., 2018, A&A, 609, A1Banfield J. K., et al., 2015, MNRAS, 453, 2326Becker R. H., White R. L., Helfand D. J., 1995, ApJ, 450, 559Best P. N., Heckman T. M., 2012, MNRAS, 421, 1569Braun R., Bourke T., Green J. A., Keane E., Wagg J., 2015, in AdvancingAstrophysics with the Square Kilometre Array (AASKA14). p. 174Braun R., Bonaldi A., Bourke T., Keane E., Wagg J., 2019, arXiv e-prints,p. arXiv:1912.12699Capetti A., Massaro F., Baldi R. D., 2017a, A&A, 598, A49Capetti A., Massaro F., Baldi R. D., 2017b, A&A, 601, A81Cheung C. C., 2007, AJ, 133, 2097Chollet F., et al., 2015, Keras, https://keras.io
Cireşan D. C., Meier U., Gambardella L. M., Schmidhuber J., 2010, NeuralComp., 22, 3207Cotton W. D., et al., 2020, MNRAS, 495, 1271Deng J., Dong W., Socher R., Li L.-J., Li K., Fei-Fei L., 2009. IEEE Confer-ence on Computer Vision and Pattern Recognition. Imagenet: A large-scale hierarchical image database, Miami, Florida, p. 248Dieleman S., Willett K. W., Dambre J., 2015, MNRAS, 450, 1441Ekers R. D., Fanti R., Lari C., Parma P., 1978, Nature, 276, 588Elmegreen D. M., Elmegreen B. G., 1987, ApJ, 314, 3Fanaroff B. L., Riley J. M., 1974, MNRAS, 167, 31PFreedman D., Pisani R., Purves R., 2007, WW Norton & Company, NewYorkFridman L., Malik J., 2020, 110 – Jitendra Malik: Computer VisionFukushima K., 1980, Biol. Cybernetics, 36, 193Garofalo D., Singh C. B., 2019, ApJ, 871, 259Gendre M. A., Wall J. V., 2008, MNRAS, 390, 819Gendre M. A., Best P. N., Wall J. V., 2010, MNRAS, 404, 1719Gheller C., Vazza F., Bonafede A., 2018, MNRAS, 480, 3749Gopal-Krishna Wiita P. J., 2000, A&A, 363, 507Harwood J. J., Vernstrom T., Stroe A., 2020, MNRAS, 491, 803Hine R. G., Longair M. S., 1979, MNRAS, 188, 111 Hosenie Z. B., 2018, Master’s thesis, North-West Univ., PotchefstroomHubble E. P., 1926, ApJ, 64, 321Hubel D. H., Wiesel T. N., 1968, The J. of Phys., 195, 215Kelley H. J., 1960, ARS J., 30, 947Kozieł-Wierzbowska D., Goyal A., Żywucka N., 2020, ApJS, 247, 53Krizhevsky A., Sutskever I., Hinton G. E., 2012. Advances in Neural Infor-mation Processing Systems. ImageNet Classification with Deep Convo-lutional Neural Networks, Lake Tahoe, Nevada, p. 1097Lacy M., et al., 2020, PASP, 132, 035001Laing R. A., Jenkins C. R., Wall J. V., Unger S. W., 1994, in BicknellG. V., Dopita M. A., Quinn P. J., eds, ASP Conf. Ser. Vol. 54, The FirstStromlo Symposium: The Physics of Active Galaxies. Astron. Soc. Pac.,San Francisco, p. 201LeCun Y., Bottou L., Bengio Y., Haffner P., 1998, Proc. of the IEEE, 86,2278Leahy J. P., Williams A. G., 1984, MNRAS, 210, 929Lintott C. J., et al., 2008, MNRAS, 389, 1179Lukic V., Brüggen M., Banfield J. K., Wong O. I., Rudnick L., Norris R. P.,Simmons B., 2018, MNRAS, 476, 246Lukic V., de Gasperin F., Brüggen M., 2019a, Galaxies, 8, 3Lukic V., Brüggen M., Mingo B., Croston J. H., Kasieczka G., Best P. N.,2019b, MNRAS, 487, 1729Luo W., Li Y., Urtasun R., Zemel R., 2016. Advances in Neural InformationProcessing Systems. Understanding the effective receptive field in deepConvolutional Neural Networks, Barcelona, Spain, p. 4898Ma Z., et al., 2019, ApJS, 240, 34Marcos D., Volpi M., Tuia D., 2016. 23rd International Conference onPattern Recognition. Learning rotation invariant convolutional filtersfor texture classification, Cancun, Mexico, p. 2012Markoff J., 2012, Seeking a Better Way to Find Web Images,
McGlynn T., Scollick K., White N., 1998, in McLean B. J., A G. D., HayesJ. J., Payne H. E., eds, IAU Symp. Vol. 179, New Horizons from Multi-Wavelength Sky Surveys. Kluwer, Dordrecht, p. 465Mingo B., et al., 2019, MNRAS, 488, 2701Miraghaei H., Best P. N., 2017, MNRAS, 466, 4346Missaglia V., Massaro F., Capetti A., Paolillo M., Kraft R. P., Baldi R. D.,Paggi A., 2019, A&A, 626, A8Muhammed M. A. E., Ahmed A. A., Khalid T. A., 2017, in 2017 Interna-tional Conference On Smart Technologies For Smart Nation (Smart-TechCon). pp 902–907, doi:10.1109/SmartTechCon.2017.8358502Neelakantan A., Vilnis L., Le Q. V., Sutskever I., Kaiser L., Kurach K.,Martens J., 2015, preprint (arXiv:1511.06807)Norris R. P., et al., 2011, Publ. Astron. Soc. Australia, 28, 215Norris R. P., et al., 2013, Publ. Astron. Soc. Australia, 30, e020Ocran E. F., Taylor A. R., Vaccari M., Ishwara-Chandra C. H., Prandoni I.,2020, MNRAS, 491, 1127Owen F. N., Ledlow M. J., 1994, in Bicknell G. V., Dopita M. A., QuinnP. J., eds, ASP Conf. Ser. Vol. 54, The First Stromlo Symposium: ThePhysics of Active Galaxies. Astron. Soc. Pac., San Francisco, p. 319Owen F. N., Rudnick L., 1976, ApJ, 205, L1Pracy M. B., et al., 2016, MNRAS, 460, 2Prescott M., et al., 2018, MNRAS, 480, 707Proctor D. D., 2011, ApJS, 194, 31Roberts D. H., Saripalli L., Wang K. X., Sathyanarayana Rao M., Subrah-manyan R., KleinStern C. C., Morii-Sciolla C. Y., Simpson L., 2018,ApJ, 852, 47Rudnick L., Owen F. N., 1977, AJ, 82, 1Sabour S., Frosst N., Hinton G. E., 2017, in Proceedings of the 31st Interna-tional Conference on Neural Information Processing Systems. p. 3856Sadler E. M., Ekers R. D., Mahony E. K., Mauch T., Murphy T., 2014,MNRAS, 438, 796Sadr A. V., Vos E. E., Bassett B. A., Hosenie Z., Oozeer N., Lochner M.,2019, MNRAS, 484, 2793Sandage A., 1961, The Hubble Atlas of GalaxiesSimonyan K., Zisserman A., 2015, in International Conference on LearningRepresentations. MNRAS , 1–20 (2020)
NN Architecture Comparison for Radio Galaxy Classification Smith M. D., Donohoe J., 2019, MNRAS, 490, 1363Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R.,2014, J. Mach. Learn. Res., 15, 1929–1958Tang H., Scaife A. M. M., Leahy J. P., 2019, MNRAS, 488, 3358Whittam I. H., Green D. A., Jarvis M. J., Riley J. M., 2020, MNRAS, 493,2841Willett K. W., et al., 2013, MNRAS, 435, 2835Wu C., et al., 2019, MNRAS, 482, 1211de Vaucouleurs G., 1959, Handbuch der Physik, 53, 275MNRAS , 1–20 (2020) B u r g e r B ecke r APPENDIX A: TABLE OF RESULTS
Table A1.
Results from the experiments discussed in Section 5, each architecture’s name and the keys given in Table 1 for use in the figures. † denotes that the architecture have been modified due to computationalconstraints or re-purposed for classification. ‡ the results given for the inference time and images classified per second are for the classification of 3072 images at batch size 32.Architecture Name Key Floating Point Convolutional Fully Connected Trainable Inference Time ‡ St. Dev. Images Classed Effective Effective Effective GPU MemoryOperations (FLOPs) FLOPs FLOPs Parameters (Seconds) (Seconds) per Second ‡ Receptive Field Stride Padding Usage (MB)AlexNet ALN 1107736302 1074169584 33566718 37302980 2.72816 0.24097 1126 195 32 0 280.576ATLAS X-ID † ATL 546585022 545304800 1280222 1385988 2.90696 0.36538 1056 64 10 24 411.648CLARAN (VGG16D) † VGG 27231525758 27044866944 186658814 201384644 10.52742 0.33702 291 212 32 90 4061.184ConvNet4 CN4 1036529876 870640128 165889748 165910168 3.51457 0.22828 874 24 4 12 897.024ConvNet8 CN8 4612528724 4571054976 41473748 42646184 4.73292 0.24774 649 76 16 30 1844.224ConvXpress CXP 764997460 762947712 2049748 3415944 2.9279 0.21564 1049 333 64 136 393.216FIRST Classifier 1stC 1128841236 1077316875 51524361 51655412 3.29737 0.26081 931 22 8 7 1715.2FR-Deep FR-D 141486958 141023472 463486 479996 2.88221 0.29405 1065 84 30 27 412.672Hosenie H 109649469 109413751 235718 261239 2.76616 0.2965 1110 94 18 38 315.392MCRGNet † MCRG 8406674 8201652 205022 213916 2.41828 0.27093 1270 63 32 20 63.488Radio Galaxy Zoo RGZ 75825974 70580024 5245950 5283444 2.46538 0.31239 1246 74 26 8 84.992SimpleNet CNs 797660134 797639400 20734 37460 3.69593 0.2417 831 49 16 11 573.44Toothless † TLS 1634645742 1566476016 68169726 71906180 2.94229 0.21572 1044 195 32 64 782.336
Table A2.
Results of from the recognition performance experiments discussed in Section 5, each architecture’s name and the keys given in Table 1 for use in the figures.Architecture Name Key Mean per Class Precision Recall F1-ScoreAccuracy (Compact) (FRI) (FRII) (Bent) (Compact) (FRI) (FRII) (Bent) (Compact) (FRI) (FRII) (Bent)AlexNet ALN 63.15 0.837 0.7837 0.5022 0.2509 0.8634 0.5872 0.6785 0.397 0.8488 0.6698 0.5734 0.3007ATLAS X-ID † ATL 65.34 0.8782 0.7784 0.5636 0.2069 0.853 0.6491 0.5341 0.5776 0.8639 0.7047 0.5461 0.3043CLARAN (VGG16D) † VGG 72.98 0.893 0.8225 0.636 0.2658 0.8661 0.6621 0.6775 0.7137 0.879 0.729 0.6531 0.3861ConvNet4 CN4 62.84 0.8678 0.7586 0.5639 0.2025 0.8401 0.6662 0.6336 0.3735 0.852 0.707 0.5913 0.2576ConvNet8 CN8 71.71 0.8925 0.7923 0.6665 0.3065 0.8687 0.7242 0.644 0.6314 0.8798 0.756 0.6475 0.4126ConvXpress CXP 71.75 0.8983 0.7849 0.6527 0.2878 0.858 0.7065 0.6405 0.6648 0.8767 0.7433 0.6405 0.4017FIRST Classifier 1stC 65.21 0.8697 0.7408 0.585 0.2698 0.8279 0.6937 0.6341 0.4527 0.8479 0.713 0.6056 0.3348FR-Deep FR-D 59.36 0.866 0.7451 0.5282 0.1552 0.8433 0.6243 0.2199 0.6871 0.8532 0.6773 0.2956 0.2527Hosenie H 58.5 0.8957 0.7014 0.3894 0.1755 0.8216 0.6837 0.2238 0.611 0.857 0.6923 0.2809 0.2717MCRGNet † MCRG 68.09 0.8686 0.7809 0.6367 0.2281 0.8658 0.6499 0.59 0.6178 0.867 0.709 0.6124 0.333Radio Galaxy Zoo RGZ 69.87 0.886 0.8102 0.633 0.2144 0.8662 0.6292 0.6425 0.6568 0.876 0.7076 0.633 0.3229SimpleNet CNs 65.91 0.8763 0.7776 0.5758 0.2096 0.8593 0.6446 0.5563 0.5764 0.8674 0.7037 0.5657 0.3065Toothless † TLS 68.92 0.8973 0.785 0.614 0.2083 0.8189 0.6542 0.6518 0.632 0.8548 0.7085 0.6315 0.3129This paper has been typeset from a TEX/L A TEX file prepared by the author. M N R A S , (2020