[PDF] CNN Architecture Comparison for Radio Galaxy Classification

Abstract

The morphological classification of radio sources is important to gain a full understanding of galaxy evolution processes and their relation with local environmental properties. Furthermore, the complex nature of the problem, its appeal for citizen scientists and the large data rates generated by existing and upcoming radio telescopes combine to make the morphological classification of radio sources an ideal test case for the application of machine learning techniques. One approach that has shown great promise recently is Convolutional Neural Networks (CNNs). Literature, however, lacks two major things when it comes to CNNs and radio galaxy morphological classification. Firstly, a proper analysis of whether overfitting occurs when training CNNs to perform radio galaxy morphological classification using a small curated training set is needed. Secondly, a good comparative study regarding the practical applicability of the CNN architectures in literature is required. Both of these shortcomings are addressed in this paper. Multiple performance metrics are used for the latter comparative study, such as inference time, model complexity, computational complexity and mean per class accuracy. As part of this study we also investigate the effect that receptive field, stride length and coverage has on recognition performance. For the sake of completeness, we also investigate the recognition performance gains that we can obtain by employing classification ensembles. A ranking system based upon recognition and computational performance is proposed. MCRGNet, Radio Galaxy Zoo and ConvXpress (novel classifier) are the architectures that best balance computational requirements with recognition performance.

Full PDF

MMNRAS , 1–20 (2020) Preprint 9 February 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

CNN Architecture Comparison for Radio Galaxy Classiﬁcation

Burger Becker ★ , Mattia Vaccari , Matthew Prescott , Trienko Grobler Computer Science Department, Stellenbosch University, Stellenbosch, South Africa Inter-University Institute for Data Intensive Astronomy (IDIA) and Department of Physics and Astronomy, University of the Western Cape, Robert Sobukwe Road,7535 Bellville, Cape Town, South Africa

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

The morphological classiﬁcation of radio sources is important to gain a full understandingof galaxy evolution processes and their relation with local environmental properties. Further-more, the complex nature of the problem, its appeal for citizen scientists and the large datarates generated by existing and upcoming radio telescopes combine to make the morphologicalclassiﬁcation of radio sources an ideal test case for the application of machine learning tech-niques. One approach that has shown great promise recently is Convolutional Neural Networks(CNNs). Literature, however, lacks two major things when it comes to CNNs and radio galaxymorphological classiﬁcation. Firstly, a proper analysis of whether overﬁtting occurs whentraining CNNs to perform radio galaxy morphological classiﬁcation using a small curatedtraining set is needed. Secondly, a good comparative study regarding the practical applicabil-ity of the CNN architectures in literature is required. Both of these shortcomings are addressedin this paper. Multiple performance metrics are used for the latter comparative study, such asinference time, model complexity, computational complexity and mean per class accuracy. Aspart of this study we also investigate the eﬀect that receptive ﬁeld, stride length and coveragehas on recognition performance. For the sake of completeness, we also investigate the recog-nition performance gains that we can obtain by employing classiﬁcation ensembles. A rankingsystem based upon recognition and computational performance is proposed. MCRGNet, Ra-dio Galaxy Zoo and ConvXpress (novel classiﬁer) are the architectures that best balancecomputational requirements with recognition performance.

Key words: radio continuum:galaxies – methods: statistical – surveys

Morphological classiﬁcation is a fundamental aspect of galaxy for-mation and evolution studies, where the shape of galaxies is inti-mately connected to the dynamical and physical processes at play.Since the days of Hubble, astronomers have thus been establishingincreasingly sophisticated classiﬁcation schemes to group galaxiesin diﬀerent classes according to their shapes and appearance ob-served at optical wavelengths (Hubble 1926; de Vaucouleurs 1959;Sandage 1961; Elmegreen & Elmegreen 1987).According to our current understanding of galaxy formationand evolution, every massive galaxy is believed to contain a super-massive black hole which undergoes periods of accretion throughoutcosmic time to produce an Active Galactic Nucleus (AGN). AGNare often detected in radio surveys via their synchrotron emissionproduced by accelerated electrons in their cores, lobes and jets, andare then referred to as radio-loud AGN.Morphologically, Fanaroﬀ & Riley (1974) found that radio-loud AGN could be divided into two populations, known as Fa- ★ Contact e-mail: [email protected] naroﬀ–Riley (FR) types I and II (FRIs and FRIIs), which werefound to show a division at approximately 𝐿 = WHz − . Those having bright cores, or "core-brightened" features,and diﬀuse lobes are labelled as FRIs and those dominated by"edge-brightened" features far from their cores are known as FRIIs.A clear divide in the radio and optical luminosities between thetwo morphologies was observed in Owen & Ledlow (1994), indi-cating that they have formed and evolved in diﬀerent ways. Thediﬀerent morphological types are thought to be due to diﬀerent ac-cretion modes. FRIs were believed to be more associated with LowExcitation Radio Galaxies (LERGs); passive galaxies undergoingineﬃcient accretion of hot gas, with an absence of emission linesin their optical spectra. FRIIs are more associated with High Ex-citation Radio Galaxies (HERGs); galaxies undergoing rapid andeﬃcient accretion of a cold gas supply, as indicated via the pres-ence of emission lines in their spectra (Hine & Longair 1979; Lainget al. 1994; Best & Heckman 2012; Pracy et al. 2016). While multi-wavelength observations are essential to better pinpoint the centreof radio galaxies and study their physical properties (Prescott et al.2018; Ocran et al. 2020; Kozieł-Wierzbowska et al. 2020), deter- © a r X i v : . [ a s t r o - ph . GA ] F e b Burger Becker mining radio morphologies is a very useful starting point towardimproving our understanding of radio galaxies.Radio morphologies can also be used to trace the environ-ments of their host galaxies. Miraghaei & Best (2017) found that, atﬁxed stellar mass and radio luminosity, FRIs are more likely to befound in richer environments than FRIIs. Bent-tailed radio galaxiessuch as Narrow-Angle Tailed (NAT, Rudnick & Owen (1977)) andWide-Angle Tailed (WAT, Owen & Rudnick (1976); Missaglia et al.(2019)) radio galaxies are associated with clusters of galaxies andrepresent galaxies with radio jets that are interacting with the hotintra-cluster medium (ICM) that resides there.New surveys with lower ﬂux limits show that upon closerinspection the FRI/FRII divide becomes less clear, revealing thatthere is much more overlap in the properties of FRIs and FRIIsthan previously thought (Mingo et al. 2019). The FRI/FRII divideis also complicated by radio sources that have hybrid FRI/FRIImorphologies, also known as HyMoRS, that have also been foundto exist (Gopal-Krishna & Wiita 2000). These however are likelyto be bent FRII sources viewed at a particular orientation, andhave lobes that appear as having diﬀerent morphologies due to theobserver’s line of sight (Smith & Donohoe 2019; Harwood et al.2020).In more recent times, the FR classiﬁcation scheme has beenexpanded to include radio sources with compact morphologies.These so-called FR0 sources are believed to be the most abundantradio sources in the local Universe (Baldi et al. 2018; Garofalo &Singh 2019). Despite being abundant, little is known about theirnature. Whilst some are young AGN that will grow to form FRIand FRII sources, a comparison between their number densitiesand that of extended sources indicates the majority must be oldersources that have failed to form extended structures (Sadler et al.2014; Baldi et al. 2018). Whittam et al. (2020) show FR0s are amixed population of HERG and LERG radio sources.Radio sources with more exotic morphologies have also at-tracted substantial interest recently. These include X-shaped and S-shaped radio galaxies (Cheung 2007), that may represent AGN thathave undergone the process of hydrodynamical backﬂow (Leahy &Williams 1984) or are the result of a spin-ﬂip from the coalescenceof two black holes (Ekers et al. 1978), with the former scenariobeing the preferred explanation in some of the latest work (Robertset al. 2018; Cotton et al. 2020).

Radio astronomy is currently undergoing a rapid development inobservational capabilities which is paving the way for the highlyanticipated Square Kilometre Array (Braun et al. 2015, 2019, SKA).Before the advent of the SKA, its pathﬁnders and precursors(Norris et al. 2013) promise to revolutionize our knowledge of theradio sky. Ongoing surveys such as VLASS (Lacy et al. 2020)and EMU (Norris et al. 2011) are expected to detect 5 and 70million radio sources, respectively, greatly exceeding the roughly2.5 million radio sources known to date. Historically, scientiﬁcanalysis used catalogues compiled by either individuals or smallteams (Fanaroﬀ & Riley 1974). However, the increasingly largesamples of radio sources detected by modern radio telescopes meansthat the classiﬁcation of full catalogues by subject-matter experts isno longer a viable option.One possible solution is the crowdsourcing of labelling tolarge groups of volunteers, known as citizen science. The ﬁrst suc-cessful large scale citizen science project in galaxy morphologyclassiﬁcation was Galaxy Zoo (Lintott et al. 2008), during which participants were asked to label galaxies observed as part of theSloan Digital Sky Survey on the basis of their morphology. Afterinitial fears of poor public participation, roughly 100,000 partici-pants made 40 million individual classiﬁcations in 175 days. Dueto the success of the initial project, Galaxy Zoo grew into the largerZooniverse project, which serves as an online platform for variouscrowdsourcing projects. The ﬁrst citizen science project devoted toradio astronomy was Radio Galaxy Zoo (Banﬁeld et al. 2015),which aimed to classify radio sources observed in the FIRST surveybased on their morphology and identify them with their infraredcounterparts observed in the WISE survey. Radio Galaxy Zoodemonstrated citizen scientists can help us to further the scientiﬁcexploitation of large radio surveys, in the process creating largesamples of visually-inspected labelled sources. Another solution would be to make use of machine learning tech-niques to aid in the classiﬁcation task. Convolutional Neural Net-works (CNNs) are a popular choice for image recognition problemsin both academia and industry. A CNN is a special type of neuralnetwork that learns which features are important to extract fromimages. These learned features are then employed to perform clas-siﬁcation. AlexNet is one of the ﬁrst CNN architecture to achievehuman-level performance on the ImageNet Challenge, which in-volved classifying 1.2 million images into 1000 diﬀerent classes(Deng et al. 2009; Krizhevsky et al. 2012).CNNs were popularised as an automated means of morpho-logical classiﬁcation in astronomy during the Galaxy Zoo Chal-lenge (Willett et al. 2013), hosted on the Kaggle data science plat-form . Dieleman et al. (2015) developed a CNN for this challenge.The CNN Dieleman et al. (2015) developed obtained the highestrecognition performance in this challenge. Both AlexNet and thestudy by Dieleman et al. (2015) inspired Toothless, the ﬁrst CNNdeveloped for the morphological classiﬁcation of radio galaxies(Aniyan & Thorat 2017). Since then several new CNNs have beendeveloped for morphological classiﬁcation in radio astronomy, asshown in Table 1.To construct a CNN model usually requires a lot of trainingdata. If the dataset that is used for training a CNN is too smalloverﬁtting occurs. In most of the studies in Table 1 training was doneon a small dataset. It is, therefore, imperative to determine whether,in the case of radio galaxy morphological classiﬁcation, overﬁttingindeed occurs when a small curated dataset is used for training.Having overﬁtted is, however, easily correctable by simply using alarger training set. The only real danger occurs when an overﬁttedmodel is used to predict the performance of an architecture in a realworld setting. In this paper, we conduct an experiment to determineif this issue is in fact something which we should be cognisant ofgoing forward. The experiment we propose will make use of twomodiﬁed datasets from Ma et al. (2019).Moreover, most of the studies in Table 1 have a similar overalllayout. They ﬁrst present a novel architecture and then they reporton the architecture’s recognition performance. Furthermore, mostof these studies do not analyse inference time or other factors relatedto computational cost (with the exception of model complexity of- MNRAS , 1–20 (2020)

NN Architecture Comparison for Radio Galaxy Classiﬁcation ten being evaluated through the number of trainable parameters). Some of these studies also focus on the eﬀects of layer composition(Lukic et al. 2018). Most critically, none have looked at how com-putational cost impacts recognition performance. Analysis of therelationships between these metrics can further our understandingof existing architectures and help develop best practices for ﬁndingnew architectures. In this paper we also address this shortcoming,i.e. we use a wide variety of computational related cost metricsto asses whether the architectures in Table 1 are capable of realtime performance, whilst maintaining a high level of recognitionperformance.All the architectures, the software used to analyse and test themand the datasets used in the process are made publicly available. We start our paper by giving a brief overview of CNNs. InSection 3 we list all the architectures we used in our study, whichalso includes a novel architecture. The datasets we make use of arepresented in Section 4 and is structured according to a simpliﬁed4-class morphological classiﬁcation system used by Alhassan et al.(2018): Compact sources, FRI, FRII and Bent-tails. The experimen-tal setup of our study is presented in Section 5. The results pertainingto the overﬁtting experiment are presented in Section 6. The resultsof the computational cost versus recognition performance analysisof the architectures in Table 1 are presented in Section 7 (a prag-matic ranking of the architectures is also established). As part ofthis performance analysis, we also investigate how other factors likereceptive ﬁeld, stride length, coverage (see Section 2.2) impact therecognition performance of the architectures in Table 1. For the sakeof completeness, we also investigate whether we can achieve recog-nition performance gains using ensemble classiﬁers in Section 7(also see Section 5.5). Findings are then summarized according tosubject in the conclusion.

A brief overview of CNNs is presented in Section 2.1. In Section 2.2we brieﬂy discuss the diﬀerent metrics we made use of to evaluatethe experiments we conducted.

Hubel & Wiesel (1968) found that mammalian visual cortices pri-marily consist of two types of cells, simple cells (that would activatewhen straight edges had a certain orientation) and complex cells(with a larger receptive ﬁeld and lower sensitivity to orientation).This inspired the Neocognitron artiﬁcial neural network (Fukushima1980) which combined layers consisting wholly of one of two typesof “cells” into a hierarchical model to perform handwritten characterrecognition. One type of cell would apply a convolutional opera-tion to the input, while the other would downsample the input. Theweights for the convolutional operation would be learned from inputexamples. The ﬁrst deep CNN (LeCun et al. 1998) had seven layersand was primarily used for handwritten character recognition. Atthe time, training such deep models was computationally expensiveand time consuming. The widespread advent of Graphics Process-ing Units (GPU) within desktop computers, however, resulted inmore people being able to quickly train deep CNNs (Cireşan et al. CLARAN’s computational cost is reported in Wu et al. (2019). https://github.com/BurgerBecker/rg-benchmarker de facto standard for image classiﬁcation after the 2012ImageNet Challenge was won by a CNN, AlexNet (Krizhevskyet al. 2012). Neural Networks can be visualized as graph-like structures in whichnodes are referred to as neurons. Each edge or connection to anotherneuron has a weight and a bias term, which represents the strengthof their connection. The weights and biases aﬀect the propagationof information through the network. With the right combination ofweights, the network can match groups of similar input, or classes,to a respective output, or class label, reliably. The neurons arearranged in layers, with an input being propagated from the inputlayer, through intermediate layers (called hidden layers) until itreaches an output layer. Although many diﬀerent types of neuralnetworks have developed over time, CNNs have become some ofthe top performing classiﬁers for image recognition.CNNs have three main types of layers: Fully connected (ordense) layers, convolutional layers and pooling layers.

The neurons of a fully connected layer are, as the name implies,connected to all the neurons of the previous layer. The output layeris also a fully connected layer, with as many neurons as the numberof classes that have been provided. The neuron with the highestoutput value determines the classiﬁcation result, with a higher valuemeaning greater model conﬁdence in the classiﬁcation.

In convolutional layers, a convolution operation is performed on asmall neighbourhood of pixels which then outputs a single value forthe neuron in the next layer. This operation is realized using a smallmatrix containing trainable weights. This small matrix is known asthe kernel. The aforementioned kernel is then moved over the imagein strides, with a stride length of 1 moving the centre of the kernelone pixel across or down until the entire image has been covered. Acommonly used kernel size is 3 × ×

11 pixels early on in the networkto reduce input size while the image still contains a high ratio ofnoise to information (AlexNet and Toothless make use of this). Inlater convolutional layers, the kernel is moved across the outputs ofthe previous layer and not the original image, i.e. the outputs of theprevious layer become the input pixels of the current convolutionallayer.This many-to-one mapping during convolution leads to the out-put of a convolutional layer to be downsampled (i.e. to have a smalleroutput dimension than the input). If downsampling happens too sud-denly, this can potentially lead to the loss of too much informationwithout it being incorporated into the model. One workaround forthis is padding the input with zeros around the edges, to ensure theoutput shape is the same size as the input shape. This also assumesa stride length of 1, otherwise downsampling will still occur. Thisis often referred to as same or zero padding, whereas the absence ofpadding is known as valid padding. Kernel size, stride length andpadding type are all examples of hyper parameters of a CNN, eachof which could aﬀect recognition performance.The output of the ﬁnal convolutional layer is then ﬂattened

MNRAS , 1–20 (2020)

Burger Becker into a one dimensional vector, which is then input into the ﬁrst fullyconnected layer.

Pooling layers are used to deliberately downsample the input, re-ducing the input size while preserving the salient features we wantthe network to learn. Max pooling layers perform downsampling bymoving a kernel across the input and returning only the maximumpixel value within a kernel. A max pooling layer’s kernel size isnormally 2 ×

2, which halves the input size.A reduction in input size is needed so that the the computationalrequirements of deeper layers can be reduced or kept constant.

Finding the right combination of weights is referred to as trainingin Deep Learning terminology, which is done with gradient descentand a loss (or error) function. The loss function represents the errorbetween the target output (the class label that has been provided) andthe network’s current output (predicted label). A single iteration oftraining takes place by calculating the gradient of the error functionwith respect to the weights and biases. The weights and biasesare then updated based on the learning rate. However, since onlythe output layer has a clearly deﬁned target output (the class label),the weights and biases of the intermediate neurons (hidden neurons)cannot be updated in isolation. The amount by which the weights andbiases of hidden layers need to be updated depend on all the previousand subsequent layers’ parameter values, which makes the gradientdescent algorithm computationally expensive (especially on deepnetworks). This is circumvented by the backward propagation oferrors (backpropagation) algorithm (Kelley 1960), which calculatesthe gradient of the ﬁnal layer’s error function, and also reusingpartial computations from previous layers, moving “backwards”from the output layer through the hidden layers to the input layer.

When a CNN can classify an image irrespective of orientation,it is said to have rotational invariance. CNNs are normally notfully rotationally invariant (Lukic et al. 2019a). Convolutional layersenforce translation equivariance and pooling layers add translationinvariance but both of these usually allow only limited invarianceto rotations, normally not more than a few degrees (Marcos et al.2016). Other Deep Learning architectures are rotationally invariantby design, such as Capsule Networks (Sabour et al. 2017). Lukicet al. (2019a) have compared the performance of conventional CNNarchitectures and capsule networks when they are both used toperform radio galaxy morphological classiﬁcation. 𝐹 -Score A useful tool when reporting the results of classiﬁers is the con-fusion matrix. The 𝑖 𝑗 th entry of a confusion matrix tells you thenumber of images that were classiﬁed as belonging to class 𝑗 eventhough they actually belong to class 𝑖 . In practice, when we depictconfusion matrices we often use annotated interpretable labels in-stead of the aforementioned integer labels. A hypothetical confusionmatrix which was obtained after classifying a dataset consisting of radio sources is depicted in Figure 1. This confusion matrix de-picts how well our classiﬁer could distinguish FRI sources fromnon-FRI sources. The class for which a classiﬁer’s performance iscurrently being assessed is known as the positive class (the classcurrently under consideration). The remaining classes are knownas the negative classes. In the case of Figure 1 FRI is our positiveclass. The depiction in Figure 1 will of course be diﬀerent if anotherclass becomes the positive class. Furthermore, the confusion matrixin Figure 1 also graphically depicts the deﬁnition of the followingconcepts in the case of a multiclass scenario: True Positives, FalsePositives, True Negatives and False Negatives. The general deﬁni-tion of the above concepts and examples thereof (from Figure 1) arelisted below: • True Positives (TP): images belonging to the positive classbeing classiﬁed as such (sources that were annotated as FRI andclassiﬁed as such) . • True Negatives (TN): images from the negative classes that arenot classiﬁed as belonging to the positive class ( sources that wereannotated as FRII and correctly classiﬁed as such or sources thatwere annotated as Bent, but incorrectly classiﬁed as FRII ). • False Negative (FN): images belonging to the positive classnot classiﬁed as such ( sources annotated as FRI, but incorrectlyclassiﬁed as FRII ). • False Positives (FP): images belonging to the negative classesthat are classiﬁed as belonging to the positive class ( sources anno-tated as FRII, but incorrectly classiﬁed as FRI ).Recall refers to the ratio of the number of images that werecorrectly classiﬁed as belonging to the positive class to the totalnumber of images in the positive class, i.e.:recall = TPTP + FN (1)Recall is also refered to as the True Positive Rate. Precision isthe ratio of images that were correctly classiﬁed as belonging to thepositive class to the total number of images that were classiﬁed asbelonging to the positive class, i.e.:precision = TPTP + FP (2)The weighted average of recall and precision is known as the 𝐹 -score: 𝐹 = × precision × recallprecision + recall (3) Overall accuracy is the ratio of correct classiﬁcations for all classsesto the total number of samples tested on. With respect to the con-fusion matrix described in Figure 1 this would be calculated as thesum of the main diagonal (the TP of each class) divided by the sumof the entire matrix.Overall accuracy can be a misleading metric, especially whena signiﬁcant class imbalance is present.For this purpose we use Mean per Class Accuracy (MPCA),calculated as the mean of the main diagonal of a normalized confu-sion matrix. The confusion matrix is normalized by dividing eachrow with the number of samples in that row (which corresponds tothe number of samples per class). This metric is less susceptible toclass imbalances than overall accuracy.

MNRAS , 1–20 (2020)

NN Architecture Comparison for Radio Galaxy Classiﬁcation TNFN FPTP TNFN TNFNTNTN FPFP TNTN TNTNCompact FRI FRII BentCompactFRIFRIIBent PredictedLabel A nno t a t ed G r ound T r u t h TNFPFNTP True PositiveFalsePositiveFalseNegativeTrue Negative

Figure 1.

Confusion Matrix Layout. This speciﬁc example showcases anassessment of the FRI class.

We deﬁne model complexity as its number of trainable parameters.The trainable parameters of a neural network are its weights and biasterms. The number of trainable parameters of all model instancesassociated with a particular architecture remain the same as long asthey were created using the same set of hyperparameters.

Each architecture’s computational complexity is measured usingTensorﬂow’s version 1 proﬁler . The aforementioned proﬁler mea-sures the number of ﬂoating point operations (FLOPs) used by themodel in a single forward pass. Theoretical GPU memory usage was estimated by ﬁrst determiningthe memory footprint of the CNNs parameters and then adding tothat the amount of active memory the CNN would require whenprocessing a batch of data (a batch size of 32 was used in thiscase). Inference time is the time that a CNN requires to classify a singleimage. Classiﬁcation speed is the number of images that are classi-ﬁed per second, obtained by inverting inference time. In this paper,inference time was estimated by taking the average of 10 timed runsin which we classiﬁed 3072 images (with a batch size of 32).

The ﬁnal convolutional layer’s output is not necessarily the result ofa transformation applied to every pixel in the input image (unlike adense/fully connected layer). Rather, each output pixel has a limited“ﬁeld of view”, a limited region in the input image that tricklesdown through the convolutional layers to become a single output. Number of samples classiﬁed concurrently at any point in time.

This is the architecture’s theoretical receptive ﬁeld, as opposed toits eﬀective receptive ﬁeld (the pixels in that limited region that hadthe largest impact on the output) (Luo et al. 2016).We do not consider the eﬀective receptive ﬁeld any further inthis paper. Moreover, for the sake of simplicity we will refer to thetheoretical receptive ﬁeld simply as the receptive ﬁeld throughoutthe remainder of the paper.Eﬀective stride is deﬁned as the stride between the input layerand the output layer of the convolutional part of a CNN (Araujoet al. 2019). The reader who wants to gain more insight into these topics isreferred to Araujo et al. (2019), who provides an in depth descriptionof how the receptive ﬁeld and the eﬀective stride of a CNN iscomputed.

Data coverage is the percentage of the total dataset that a classiﬁercan assign a label to given a certain conﬁdence threshold. A goodcharacteristic that a classiﬁer should have is that its recognitionperformance should increase as its prediction conﬁdence thresholdis increased. Data coverage will either decrease or remain constantas the prediction conﬁdence threshold is increased, with a signiﬁcantdecrease expected at higher conﬁdence thresholds. In the ideal case,excluding only a few sources from your dataset will bring about alarge gain in recognition performance.

In this section we discuss the architectures we considered for ourstudy. We also present auxiliary useful information that will improvethe reader’s understanding of the paper.

The terms architecture and model are not used interchangeably inthe context of this study. We refer to an architecture as the layoutof the network’s structure, whereas a model is a trained instance ofthe architecture. Models of the same architecture are diﬀerentiatedby the data it was trained on and other hyperparameters such asdiﬀerent learning rates or the optimizer used during training.

The architectures we considered in our comparison are listed in Ta-ble 1, providing the architecture names, the corresponding studiesfrom the literature as well as shortened keys assigned for use in plots(see Figure 5 as an example). Some studies have contributed morethan one architecture (Lukic et al. 2019b). All of the architectureslisted in Table 1 were modiﬁed to enable them to discern betweenfour types of radio sources (i.e. the number of output classes werechanged to four). Example images of the four classes that we con-sider in this paper are depicted in Figure 2. Eﬀective padding is deﬁned similarly.MNRAS , 1–20 (2020)

Burger Becker

Table 1.

List of architectures and their keys for all ﬁgures. Architecturesmarked † have been modiﬁed from their original form.Architecture Name Key StudyAlexNet ALN Krizhevsky et al. (2012)ATLAS X-ID † ATL Alger et al. (2018)ConvNet4 CN4 Lukic et al. (2019b)ConvNet8 CN8 Lukic et al. (2019b)FIRST Classifier 1stC Alhassan et al. (2018)FR-Deep FR-D Tang et al. (2019)Hosenie H Hosenie (2018)MCRGNet † MCRG Ma et al. (2019)Radio Galaxy Zoo RGZ Lukic et al. (2018)SimpleNet CNs Lukic et al. (2019b)Toothless † TLS Aniyan & Thorat (2017)CLARAN † (VGG16D) VGG Wu et al. (2019)ConvXpress CXP Novel AlexNet, ConvNet4, ConvNet8, FIRST Classifier, FR-Deep,Hosenie, Radio Galaxy Zoo and SimpleNet were not modiﬁedin any further way.

The following architectures were further modiﬁed: • ATLAS X-ID: ATLAS X-ID was not designed explicitly forradio galaxy classiﬁcation, but rather for ﬁnding host galaxies forradio sources by cross identiﬁcation. The CNN described in thepaper had an additional input vector of 10 features from the candi-date host in the SWIRE survey, which has not been included in themodiﬁed version. • MCRGNet: has been adapted from the neural network de-scribed by Ma et al. (2019). Initially this network was pretrained asthe encoder level of a Convolutional Auto-Encoder using an unla-belled sample and then ﬁne-tuned on a labelled sample. Several ofthese CNNs would be combined to form a dichotomous tree classi-ﬁer, each classifying a subset of the classes. Due to computationalconstraints, the pretraining step has been left out and only a singleinstance of this architecture is used. • Toothless: originally implemented as a fusion classiﬁer con-sisting of 3 binary classiﬁers, classifying either FRI/FRII,FRI/Bentand FRII/Bent respectively. If two classiﬁers would predict a sourceas the same class with a 60% probability, the classiﬁcation wouldbe accepted, if both predicted with less than the 60% conﬁdencethreshold, a ‘?’ would be appended to the classiﬁcation. Addition-ally, should none of the classiﬁers give the same class output, thesource is labelled as "Strange". To reduce computational require-ments, only a single classiﬁer instance is considered. • CLARAN: CLARAN takes as input a radio source and a cor-responding infrared image, after which it outputs a bounding boxshowing the location and size of the detected radio source. Thesource morphology is given in the format 𝑖𝐶 _ 𝑗 𝑃 , where 𝑖 is the num-ber of components and 𝑗 is the number of ﬂux-density peaks. A cor-responding probability of the morphology is also output. CLARANuses VGG16D (Simonyan & Zisserman 2015) as a classiﬁcationlayer that is fed into a region of interest classiﬁer. While the entirearchitecture was not suitable for this study, the VGG16D classiﬁerlayer was appropriate to include. Note that the VGG16D architec-ture we include in this study, in contrast with CLARAN, can only assign one label to each image it receives and would, therefore, notfare well if the images it receives contain multiple source. At this point in time we should take a moment to consider the po-tential impact that the modiﬁcations we discuss in the beginningof Section 3 and those in Section 3.2.2 will have on the recogni-tion performance of the architectures presented in the studies fromTable 1. But ﬁrst, it should be duly noted that the proposed mod-iﬁcations are a necessity as these modiﬁcations make it possibleto perform a meaningful comparison of these architectures. Threemajor modiﬁcations were discussed in the beginning of Section 3and in Section 3.2.2:

Output Classes

The number of the output classes and in somecases even the output-labels of the output classes were altered(Toothless as an example of the former, CLARAN as an exampleof the latter). This alteration, however, is standard practise withinthe ﬁeld of Deep Learning. Take AlexNet as an example it wasoriginally designed for the ImageNet Challenge, but it is nowadaysused to solve many other types of image recognition problems (i.e.the number of classes and the output-lables it can produce diﬀerfrom its original use case). Generally speaking, if a CNN architec-ture is identiﬁed that can discern between 𝑁 diﬀerent classes, thenits recognition performance will normally not deteriorate signiﬁ-cantly if the number of classes that one considers is either reducedor increased by one (given that it is properly re-trained). More-over, neither would considering 𝑁 completely diﬀerent labels havea signiﬁcant impact on its performance. There are of course excep-tions to this, if the nature of the problem is changed completelyor the inherent separability of the dataset changes signiﬁcantly thisgeneralization might not necessarily remain true. Architecture Instances

Only single architecture instances wereconsidered (as an example only a single architecture instance ofToothless was used). Multiple instances of any architecture canbe incorporated into a more complex classiﬁer (like a fusion classi-ﬁer). This will certainly improve the recognition performance of aparticular architecture. However, knowing how a single instance ofthe architecture performs enables us to identify which architectureswill ultimately perform better if they are chosen to create a morecomplex classiﬁer.

Data

The same dataset (no peripheral data was included in our ex-periments so that a fair comparison between the architectures in Ta-ble 1 could be made even though additional data would have resultedin improved performance of certain architectures) was used to eval-uate each architecture. As mentioned in Section 3.2.2 MCRGNetand ATLAS X-ID is particularly aﬀected by this.

The architecture of ConvXpress is based on the architecture ofConvNet8 and VGG16D. ConvXpress is deeper than ConvNet8(11 vs 8 convolutional layers) and uses the convolutional stack struc-ture introduced by VGG16D. Each stack is comprised of 3 convolu-tional layers (except for the last stack, which is only two) and a maxpooling layer. This was developed to match or enlarge the recep-tive ﬁeld size (see Section 2.2.7) of AlexNet’s convolutional layerswithout having to use AlexNet’s large kernel size. The eﬀectivereceptive ﬁeld of ConvXpress’s ﬁrst convolutional stack is 11 × MNRAS , 1–20 (2020)

NN Architecture Comparison for Radio Galaxy Classiﬁcation Table 2.

ConvXpress Architecture LayoutLayer Depth Kernel Size Stride Length ActivationConv2D 32 3 2 ReLUConv2D 32 3 1 ReLUConv2D 32 3 1 ReLUMaxPooling2D 2 1DropoutConv2D 64 3 2 ReLUConv2D 64 3 1 ReLUConv2D 64 3 1 ReLUMaxPooling2D 2 1DropoutConv2D 128 3 1 ReLUConv2D 128 3 1 ReLUConv2D 128 3 1 ReLUMaxPooling2D 2 1DropoutConv2D 256 3 1 ReLUConv2D 256 3 1 ReLUMaxPooling2D 2 1DropoutFlattenDense 500 LinearDropoutDense 4 Softmax compared to AlexNet’s single activation, making the model morediscriminative (Simonyan & Zisserman 2015).In addition to this, the number of parameters required arereduced by stacking: a layer with an 11 ×

11 kernel with 𝐶 inputchannels require 11 𝐶 = 𝐶 parameters, while 3 layers witha 3 × 𝐶 input channels require only 3 ( 𝐶 ) = 𝐶 parameters.ConvXpress has a non-standard stride length, similar toMCRGNet, Toothless, AlexNet and Radio Galaxy Zoo. Inparticular, it makes use of a stride length of 2 in the ﬁrst convolu-tional layer of the ﬁrst and second convolutional stacks. The Dense(or fully-connected) layers are the same as ConvNet8’s, using alinear activation in the second last layer with an L2 kernel regular-izer. ConvXpress also contains ﬁve dropout layers. During eachtraining step a dropout layer randomly turns some of the neuronsin the layer that comes before it oﬀ (i.e. it blocks their output frompropagating to the next layer). The probability that a speciﬁc neuronis turned oﬀ is known as the dropout rate 𝑝 . This is similar to cre-ating and training many small networks within the larger network(Srivastava et al. 2014). The value for 𝑝 for all the dropout layers inConvXpress is 0.25. The only exception is the last dropout layer.For the last layer the value of 𝑝 is equal to 0.5.The architecture of ConvXpress is presented in Table 2. Some architectures were excluded from the study for either beingoriginally designed to perform a task other than classiﬁcation orin order to restrict the scope of the study to conventional CNNarchitectures. The following architectures were excluded: • Convosource: Designed for source-ﬁnding, to extract the pix-els that belong to an astronomical source from an image with back-ground noise (Lukic et al. 2019a). • COSMODEEP: Designed to perform a combination of sourceﬁnding and classiﬁcation by breaking up larger images into smaller tiles that are then individually classiﬁed as either containing nosignal or containing a radio source (Gheller et al. 2018). • DEEPSource: Aimed at source-ﬁnding in low signal-to-noiseratio cases (Sadr et al. 2019). • Capsule Networks: This study limits the focus of comparisonto conventional CNN architectures described in the literature. Lu-kic et al. (2019a) compared Capsule Network performance withconventional CNN architectures.

AlexNet and

ToothlessJust as AlexNet has been a seminal work in image classiﬁcationfor CNNs, so too has Toothless (Aniyan & Thorat 2017) madeits mark on the classiﬁcation of radio galaxies for being the ﬁrstCNN developed for this very purpose. It has thus been referencedin almost all of the subsequent works listed in Table 1. Toothlessis based on AlexNet’s architecture and does not diﬀer much otherthan the type of padding used, with the original AlexNet designusing valid padding (no padding is applied around the edges of theinput of a layer) rather than same padding (zero padding around theedges of the input of a layer to ensure there is no size reduction otherthan that caused by stride length). This small diﬀerence does havea slight eﬀect on performance, since the size of the input is beingreduced steadily with valid padding, less computational resourcesare required for AlexNet than for Toothless. The type of paddingat diﬀerent layers is something that should be carefully consideredwhen designing an architecture, since this might shrink input toofast, throwing away useful information. Same or zero padding willwork better for a wider range of input resolutions.

Ma et al. (2019) used two datasets in their study: • the Labelled Radio Ralaxy (LRG) dataset: a curated datasetthat contain well attested sources from the CONFIG (Gendre &Wall 2008; Gendre et al. 2010), GROUPS (Proctor 2011), FRICAT(Capetti et al. 2017a), FRIICAT (Capetti et al. 2017b), FR0CAT(Baldi et al. 2018) and Cheung (2007) catalogues (see Table 3). Thedataset contains 1442 sources and consists of 6 classes (Compact,FRI, FRII, Bent, X-shaped and Ringlike) . • the Unlabelled Radio Galaxy (URG) dataset: a dataset con-taining 14245 AGNs from the Best-Heckman sample (Best & Heck-man 2012). This dataset was manually labelled by Ma et al. (2019).We use slightly modiﬁed versions of these two datasets in ourstudy. We will refer to the modiﬁed version of the LRG dataset asthe Modiﬁed Labelled Radio Ralaxy (MLRG) dataset throughoutthe rest of the paper (it contains 1,328 sources). Similarly we willrefer to the modiﬁed URG dataset as the Modiﬁed Unlabelled RadioGalaxy (MURG) dataset (it contains 14,093 sources). We made thefollowing modiﬁcations. We removed all X-shaped and Ringlikesources from both the LRG and the URG. We also removed error-prone images from the URG dataset (i.e. images consisting only of NaN values) . Moreover, all FR0 sources were added to the Compactclass. The total number of sources reported in Ma et al. (2019) and the totalnumber of sources in the catalog available on their Github repo diﬀer slightly. Three Compact, two FRI and two FRII sources were also removed.MNRAS , 1–20 (2020)

Burger Becker

Table 3.

Class breakdown per catalogue for the LRG datasetCompact FR0 FRI FRII Bent X RingCoNFIG 270 8 14 350 9 0 0FR0CAT 0 104 0 0 0 0 0FRICAT 1 19 173 0 5 0 0FRIICAT 0 0 0 80 8 3 0Proctor (2011) 0 1 0 0 284 0 32Cheung (2007) 0 2 0 0 0 79 0Total 271 134 187 430 306 82 32

Table 4.

The ﬁrst 5 rows of the MLRG sample, the full table is available onthe projects Github Repository.Source Right Ascension Declination ClassiﬁcationName (degrees) (degrees)J000330.73+002756.1 0.05854 0.46558 Bent-tailedJ001247.57+004715.8 0.21321 0.78772 FRIIJ002107.62-005531.4 0.35212 -0.92539 FRIIJ002900.98-011341.7 0.48361 -1.22825 CompactJ003930.52-103218.6 0.65848 -10.5385 FRI

Table 5.

The ﬁrst 5 rows of the MURG sample, the full table is available onthe projects Github Repository.Source Right Ascension Declination ClassiﬁcationName (degrees) (degrees)J000001.57-092940.3 0.00044 -9.49453 CompactJ000025.55-095752.8 0.00710 -9.96467 FRIJ000027.89-010235.4 0.00775 -1.04317 CompactJ000049.32-005042.9 0.01370 -0.84525 FRIJ000052.92+003044.6 0.01470 0.51239 FRII

Table 6.

Comparison of the MLRG and MURG datasets we use in this study.Class MLRG MURGCompact 405 6093FRI 187 5039FRII 430 2072Bent 306 889Total 1328 14 093

The ﬁnal class breakdown of both datasets are presented inTable 6.The ﬁnal catalogues we used are partially presented in Ta-bles 4 and 5. The rest of these catalogues are available on ourGithub repository. A script that can download the sources from thecatalogues is also provided on our Github repository. It downloadsFIRST cutouts (Becker et al. 1995) in FITS format (300 by 300pixels) via the Skyview tool (McGlynn et al. 1998).

This section describes the experimental setup we used. The imageprepossessing procedure we adopted is described in Section 5.1. Thehardware used as well as other important overarching experimental information is presented in Section 5.2. The two main experimentsconducted are described in Section 5.3 and Section 5.4. We end thissection by describing the ensemble classiﬁers that we constructed.

The preprocessing steps used are: the images were ﬁrst normalizedand then thresholding was applied. Allowing some noise in thetraining data improves learning for very deep networks (Neelakantanet al. 2015). The thresholding method used assigns a zero value to allpixels with a value below the threshold of three standard deviationsabove the mean pixel value of the speciﬁc image. Otherwise, thepixel value is kept the same.

All training was performed on a Nvidia Tesla V100 32GB. Ourarchitectures were constructed using the Deep Learning frame-work: Keras (Chollet et al. 2015). To ensure replicable resultswe provided a random seed to all non-deterministic processes.In addition to this step, Tensorﬂow requires you to set the

TF_CUDNN_DETERMINISTIC environment variable to ‘1’ or ‘true’(alternatively, depending on the version of Tensorﬂow being used,either the Nvidia Tensorﬂow-Determinism patch can be applied orthe

TF_DETERMINISTIC_OPS environment variable must be set ).A customized version of the Keras Data Generator class was usedto load images during training, validation and testing.Each architecture was trained for 16 epochs with a learning ratedependant on the number of parameters in the architecture. Adamwas used as the optimizer with a callback function that reduces thelearning rate once a loss plateau is reached. This callback functionreduces the learning rate by half after 3 epochs in which the valida-tion loss has not decreased by the set threshold: 0.001 (i.e. this is thelearning rate scheduler we used). At the end of each epoch, anothercallback function assesses the validation loss. If the current modelhas a lower validation loss than the previous lowest validation loss,this model is saved and the previous one is discarded. This modelis used to represent the architecture in the tests of both experimentsbelow. This process is necessary to prevent overﬁtting by storingthe models that generalize well on the validation data.Furthermore, the experiments below were repeated three times(diﬀerent seed values for the random processes and the subset se-lection were assigned during each of the three runs). Using diﬀer-ent seed values results in a diﬀerent weight initialization for theCNNs and a diﬀerent subset selection for the training, validationand testing sets during each run. This is done to get a more accu-rate representation of the architecture’s performance for the chosenhyperparameters and to assess the validity of the results. In this experiment we emulate the type of training most of the ar-chitectures in Table 1 employed in their respective studies: training,validation and testing on a small curated dataset. More speciﬁcally:the architectures are trained on a subset of the MLRG dataset andthen tested on a mutually exclusive subset of the MLRG dataset.The architectures are then re-tested on the full MURG dataset. Thetraining, validation and test set breakdown used for this experiment https://pypi.org/project/tensorflow-determinism/ MNRAS , 1–20 (2020)

NN Architecture Comparison for Radio Galaxy Classiﬁcation h m s s s s s D e c A h m s m s s s s D e c B h m s s s m s s s D e c C h m s s s s m s D e c D Figure 2.

Examples of the diﬀerent radio morphologies that have been classiﬁed in this study. Examples include an FRI (A), an FRII (B), a compact radiosource (C) and a bent-tailed radio galaxy (D). All examples are shown before the preprocessing step. are summarized in Table 7. We elaborate further in this regard inthe sections that follow.

The training and validation sets are augmented by rotating eachsource at 15 degree intervals, leading to 24 rotated samples ofeach image. The total number of augmented samples are given inparentheses in Table 7. This is done to increase the number ofsamples for validation and training as well as addressing rotationalinvariance (discussed in section 2.1.7).Each source is rotated after preprocessing and then saved as anew FITS image with the rotation factor added to the original ﬁlename.

The training and validation data are selected from the MLRG dataset(the samples in these datasets depend of the chosen seed value and assuch diﬀer for each experimental run). Training is performed on 80unique sources per class (1920 after augmentation) and validationon 60 unique sources per class (1920 after augmentation). Usingthe aforementioned number of sources for training and validationallows for testing on roughly 25% of the smallest class (FRI) inthe MLRG dataset. The split ratio’s are in line with both a standardtraining/validation/test split commonly used in practice and withthe splits used in most of the studies of the architectures presentedTable 1.

MNRAS , 1–20 (2020) Burger Becker

We ﬁrst test the resulting models on the test split of the MLRGdataset. The models are then tested on the full MURG dataset. Bothsets are given in Table 7.

In this section we describe an experiment that is designed to be lesssusceptible to overﬁtting. While the MLRG sample provides excel-lent examples of each class, the application of stringent selectioncriteria results in the loss of samples that could make a model morerobust. Similar to allowing some noise to remain after preprocessingan image to facilitate better training, allowing samples in the train-ing set that are less-than-perfect examples provides a more nuancedunderstanding of the class distinction and can lead to more robustclassiﬁcation systems. This problem might be more adequately ad-dressed by taking into account the annotator’s conﬁdence in theirclassiﬁcation, as a separate input in the dense layer for example,however we leave this for future exploration.While the aforementioned issue is worth noting, the small sizeof the MLRG sample is the most serious concern when it comesto potentially overﬁtting a model. Large training sets are requiredto train any type of Deep Neural Network. The MURG sampleprovides a dataset that is large enough to be used to train a DeepLearning model. Although the training sample used for the MURGrandom split experiment is relatively small compared to what otherdeep learning studies have used, there is a 312.5% increase in thesize of the training set used for this experiment when its size iscompared with the size of the training set used during the Overﬁtexperiment. The results of the MURG random split experimentprovides a more realistic reﬂection of the architectures in Table 1’sexpected performance when deployed in practice.More speciﬁcally: for this experiment architectures are trainedon a subset of the MURG sample and tested on a test split of theMURG sample. The training, validation and test set breakdown usedfor this experiment are summarized in Table 8. We elaborate furtherin this regard in the sections that follow.

The training set is augmented by rotating each source at 15 degreeintervals, leading to 24 rotated samples of each image. This resultsin 24,000 and 9,600 sources for training and validation respectivelyafter augmentation, as shown in parentheses in Table 8.

The training and validation data are sampled from the MURGdataset. Training is performed on a random selection of 250 sourcesper class (6000 after augmentation) and validation on 100 sourcesper class (2400 after augmentation). Again, exactly which sourcesare selected is determined by the random seed that was used duringeach experimental run. While this is a much larger training andvalidation split than what is used for the Overﬁt experiment, it issigniﬁcantly smaller than what is normally used when training aCNN, the total training set makes up only 7.87% of the total datasetcompared to normally selecting closer to 50% of the set or more fortraining. This smaller selection was chosen to assess the eﬃcacy ofmodel generalization when training on a relatively small subset ofthe data.

Table 7.

Overﬁt Experiment: Training, validation and test set break downper class. Note the two diﬀerent sets which are used for testing which relateto the results in Figure 3. The values in parentheses are the number ofaugmented samples.Class Training Set Validation Set MLRG MURG(Augmented) (Augmented) Test Set Test SetCompact 80 (1920) 60 (1440) 265 6093FRI 80 (1920) 60 (1440) 47 5039FRII 80 (1920) 60 (1440) 290 2072BENT 80 (1920) 60 (1440) 166 889Total 320 (7680) 240 (5760) 768 14 093

Table 8.

MURG Random Split Experiment: Training, validation and test setbreak down per class. The test set is a subset of the MURG samples. Thevalues in parentheses are the number of augmented samples.Class Training Set Validation Set Test Set(Augmented) (Augmented)Compact 250 (6000) 100 (2400) 5743FRI 250 (6000) 100 (2400) 4689FRII 250 (6000) 100 (2400) 1722BENT 250 (6000) 100 (2400) 539Total 1000 (24 000) 400 (9600) 12 693

Testing was performed on a MURG test split. The total number ofsamples in this test set is given in Table 8.

An ensemble classiﬁer is a combination of several classiﬁers trainedto perform the same purpose (such as classiﬁcation). These ensem-bles often generalize better than any of their constituents and areless likely to have the same pitfalls as their constituents (overﬁttingto a speciﬁc class for example). Two ensemble classiﬁers have beencreated from the MURG Random Split Experiment:(i) Ensemble of all classiﬁers (ENA)(ii) Top 4 Classiﬁer Ensemble, selected based on their MPCA.From here on referred to as the SKA Artiﬁcial Intelligence Network(SKAAI Net in Figure 4 or as SKN in Figure 6).Both ensembles sum the output probabilities of their con-stituent classiﬁers and take the highest probability as the outputclass.

Section 6.1 reports only on the results form the Overﬁt Experiment(see Section 5.3). In Section 6.2, we compare the results obtainedfrom the Overﬁt experiment with that of the MURG Random SplitExperiment (see Section 5.4).

Figure 3 shows the averaged MPCA over three runs of the Over-ﬁt Experiment for each architecture. The data-points depicted bythe cross markers represent the MPCA results associated with the

MNRAS , 1–20 (2020)

NN Architecture Comparison for Radio Galaxy Classiﬁcation V G G C X P C N R G Z T L S M C R G C N S A T L s t C A L N C N F R - D H Classifier Key4050607080 M e a n p e r C l a ss A cc u r a c y [ % ] MLRG Test SetMURG Test Set

Figure 3.

Models trained on the MLRG sample in the “Overﬁt” experiment,reporting MPCA averaged over three iterations with diﬀerent random seedvalues. The crosses represent each architectures’s averaged MPCA on theMLRG sample’s test split, while the diamonds show averaged MPCA onthe full MURG dataset. The results show that training and testing architec-tures on a small sample, such as the MLRG sample, can give misleadingexpectations for performance on a larger dataset that has not been curatedas thoroughly, such as the MURG dataset. The standard deviation for eacharchitecture is also shown.

MLRG test set (described in Table 7), while the data-points depictedby the diamond markers represent the MPCA results associated withtesting on the full MURG dataset. All models experience a more than20% decrease in performance when switching from the MLRG testset to the MURG dataset, clear evidence of the models overﬁttingto the MLRG training set and the results being unreliable to assessperformance on a larger dataset. Given that the training sample sizeis only 2.2% that of the test sample size, model performance onthe MURG data set is not as bad as one would expect. The results,however, indicate that the models trained on only the MLRG setshould not be used for autonomous classiﬁcation in practice, sincefor the MURG dataset most of the models have a sub 50% accuracyin at least one class.Training and testing models on small datasets give a skewedperception of architecture performance. Models need to be trainedand assessed on samples that are representative enough of the un-derlying data distribution that underpins the classiﬁcation problemat hand (a too small training dataset prevents this).

In Figure 4, we compare the MPCA averaged during three runs ofthe Overﬁt experiment (diamond markers) with the MPCA resultsobtained from the models trained during the MURG Random Splitexperiment (circle markers). It is important to note that the modelsassociated with the two experiments are not tested on the same data(although the intersection between the two datasets is large): the V G G C X P C N R G Z T L S M C R G C N S A T L s t C A L N C N F R - D H Classifier Key4045505560657075 M e a n p e r C l a ss A cc u r a c y [ % ] SKAAI Net ENAMLRG TrainedMLRG Trained EnsembleMURG TrainedMURG Trained Ensemble

Figure 4.

Models trained on a selection from the MURG sample (circles)compared with those trained on the MLRG sample (diamonds) in the “Over-ﬁt” experiment, giving MPCA averaged over three iterations with diﬀerentrandom seed values. The ensemble classiﬁer’s averaged MPCA is also given,with the best performing MURG trained ensemble given as SKAAI Net andthe ensemble of all classiﬁers given as ENA. The standard deviation for eacharchitecture and the ensembles are also shown.

Overﬁt experiment is tested on the full MURG dataset while theMURG Random Split experiment is tested on a large subset of theMURG dataset (and diﬀers for each run depending on the randomseed that was chosen). All the models associated with the MURGexperiment show an increase in recognition performance when theyare compared to the models associated with the Overﬁt experiment(ranging from a 11.7% to a 24.03% increase in performance, with anaverage increase in performance of 18.5%). The average increase inrecognition performance is 3.26 times that of the training data sizeincrease (which increased from 2.2% to 7.87% of the total dataset).The result of the top 4 ensemble classiﬁer (SKAAI Net) and theensemble of all the classiﬁers (ENA) are also given, the top dashedline represents the results associated with the MURG Random Splitensemble classiﬁer and the bottom dashed line the results associatedwith Overﬁt ensemble classiﬁer. Note that the order in this ﬁgure isbased on the MURG Random Split performance.The MURG Random Split result is a better indication of ar-chitecture performance than the results obtained from the Overﬁtexperiment. The reason being, the architectures are exposed to alarger dataset during training and are, therefore, less prone to over-ﬁt. The next section only deals with results obtained from the MURGRandom Split experiment.

All the results presented in the subsequent subsections were ob-tained from the MURG Random Split Experiment (Section 5.4).The results of this experiment is summarized in Table A1 and itsfollow on table. In Section 7.1, we report on the MPCA versus the

MNRAS , 1–20 (2020) Burger Becker computational complexity of the architectures in Table 1. Note that,for the sake of brevity we sometimes only use the shortened phrase“architectures” instead of “architectures in Table 1” when referringto the architectures that we considered in this paper. Section 7.2,looks at the per class F1-score performance of the architectures(which serves to showcase the trade-oﬀ in class performance foreach classiﬁer and highlights the shortfalls of a metric such asMPCA). The memory requirements and the classiﬁcation speeds ofthe architectures are discussed in Section 7.3 and Section 7.4. Thereceptive ﬁeld and the eﬀective stride length of the architectures arereported in Section 7.5. The overall ranking of the architectures ispresented in Section 7.6. Section 7.7, reports on the performanceresults of the ENA and SKAAI Net (the two ensemble classiﬁersdescribed in Section 5.5). The data coverage versus the conﬁdencethreshold graphs associated with SKAAI Net is presented in Sec-tion 7.8.

Figure 5 reports the MPCA versus the computational complexityof the architectures in Table 1; for a single forward pass (measuredin ﬂoating point operations or FLOPs). The size of the markersin Figure 5 represent the model complexity of the architectures(measured in the number of trainable parameters).The model with the highest MPCA (72.98%) is CLARAN(i.e. VGG16) (Wu et al. 2019; Simonyan & Zisserman 2015). Thebest performing models from the existing literature are ConvNet8(Lukic et al. 2019a) (71.7%), Radio Galaxy Zoo (Lukic et al.2018) (69.87%) and Toothless (Aniyan & Thorat 2017) (68.92%).The novel classiﬁer produced for this paper, ConvXpress, has thesecond highest MPCA (71.74%).Using classiﬁers from the general computer vision literature,that perform well on other datasets, as a springboard for architecturedevelopment in radio astronomy could potentially save signiﬁcantcomputational time. This is evident looking at Toothless that wasderived from AlexNet and and is 5 th best classiﬁer in this study,even though this was the ﬁrst CNN implemented speciﬁcally forradio astronomy.A very weak correlation between the logarithm of the FLOPcount and MPCA is present with a Pearson correlation coeﬃcientof 0.39 (Freedman et al. 2007).An increase in computational complexity will not necessarilytranslate into a proportional increase in recognition performance,evidenced by the weak correlation of these two quantities. Thisis corroborated by the following examples: ATLAS and RadioGalaxy Zoo require fewer FLOPS than Lukic et al. (2019a)’sSimpleNet and ConvNet4, whilst obtaining a better MPCA thanthe latter two architectures.The logarithm of the number of parameters and MPCA areeven more weakly correlated than the logarithm of the FLOP countand MPCA, with a Pearson correlation coeﬃcient of 0.35. Largemodels (i.e. higher trainable parameter count) often outperformsmaller models in terms of recognition performance, but notableexceptions exist. ConvXpress and Radio Galaxy Zoo have lowparameter counts but have high MPCA scores. Figure 6, reports the F1-score of each class sorted by architec-ture performance. F1-score encapsulates recall and precision into a single metric. More general metrics, such as MPCA and overall ac-curacy, can be misleading metrics, since a model might score highin either of these metrics by doing exceptional well in one class,whilst underperforming in another class.Classiﬁers should in general not be evaluated using a singlemetric, however, a single metric is sometimes necessary as it canconvey information in a concise and succinct manner.

Figure 7, reports inference time versus theoretical GPU memoryusage (at a batch size of 32). As memory usage increases, inferencetime dramatically increases. The standard deviation associated withthe inference time is also depicted in Figure 7.

Classiﬁcation speed is the number of images an architecture canclassify per second at a certain batch size. A batch size of 32 hasbeen used for this experiment. The classiﬁcation speed of the diﬀer-ent architectures are obtained by taking the inverse of the inferencetimes reported in Figure 7. Figure 8 reports the MPCA versus clas-siﬁcation speed of the diﬀerent architectures. It is evident fromFigure 8 that computationally eﬃcient models generally have fasterclassiﬁcation speeds. A trade-oﬀ, therefore, exists between fasterclassiﬁcation and higher recognition performance, at least for stan-dard CNN architectures. Moreover, MCRGNet has the fastest clas-siﬁcation speed at 1270 images per second with Radio GalaxyZoo close by at 1246 images per second. VGG16 has the slowestclassiﬁcation speed at 291 images per second.In comparison, the classiﬁcation speed of an average person is“about 250 images in 5 minutes” or roughly 0.833 images per second(Markoﬀ 2012). The classiﬁcation task from which this result wasobtained is complex, a human classiﬁer had to choose a label from alarge number of possibilities, and as such the aforementioned resultshould be regarded as a lower bound estimate of how fast an averageperson would be able to classify 250 images.

Figure 9 shows how receptive ﬁeld, eﬀective stride length andMPCA relate to one another. The correlation between receptiveﬁeld and MPCA is very weak (with a Pearson correlation coeﬃ-cient of 0.431). Eﬀective stride length and MPCA are slightly bettercorrelated (with a Pearson correlation coeﬃcient of 0.436). This iswell corroborated by Figure 9. As an example: the ConvNet8 hasa small receptive ﬁeld and eﬀective stride, but performs compara-tively well against architectures that have larger receptive ﬁelds andstrides than it does. In summary, a larger receptive ﬁeld and eﬀectivestride alone is no guarantee of better classiﬁer performance. Usinglarger strides reduces the number of convolutions applied whichresults in faster classiﬁcation speeds. Applying larger strides in theﬁrst layers of the architecture reduces the layer’s output size whichultimately decreases the number of FLOPs used by the architectureas it reduces the input sizes of subsequent layers.

All the architectures listed in Table 1 have been ranked in Table 9according to their recognition performance (given as classiﬁer rank-ing) and their computational performance (given as computational

MNRAS , 1–20 (2020)

NN Architecture Comparison for Radio Galaxy Classiﬁcation Floating Point Operations586062646668707274 M e a n p e r C l a ss A cc u r a c y [ % ] VGGCXP CN8RGZ TLSMCRG CNSATL 1stCALNCN4FR-DH 250K1M10M100MParameters

Figure 5.

Average accuracy vs Computational Complexity vs Model Complexity: The diﬀerent CNN architectures are compared based on recognitionperformance (MPCA on the 𝑦 -axis), computational complexity (FLOPs on the 𝑥 -axis) and model complexity (the number of trainable parameters, given as thecircle size). A weak correlation is present between the logarithm of the computational complexity and recognition performance (Pearson correlation coeﬃcientof 0.39). ranking). An overall rank is calculated based on the sum of thesetwo rankings. Please note that this ranking is a relative rankingand that it is limited to the architectures within this study (and thedatasets used) and as such cannot be seen as an absolute reﬂectionof architecture standing.The aforementioned rankings are calculated using a round-robin “tournament” in which each architecture is compared to everyother architecture (excluding itself) in several diﬀerent categories. Ifan architecture achieves a higher or lower score (which depends on the metric of the category under consideration) than a “competing”architecture does in a speciﬁc category then the former architec-ture’s ranking is incremented by 𝑗 , while the latter architecture’sranking is decremented by 𝑘 . As alluded to before, this compari-son is repeated for every category and every architecture-pair. Ahigher category score is better in the case of recognition perfor-mance metric categories, while a lower category score is better inthe case of computational requirement metric categories. To estab-lish a classiﬁer ranking, the MPCA and the per class F1-score of the MNRAS , 1–20 (2020) Burger Becker

SKN ENA CN8 VGG CXP RGZ CNS MCRG ATL H TLS FR-D CN4 ALN 1stC0.20.40.60.8 C o m p a c t CN8 SKN ENA CXP VGG 1stC MCRG TLS RGZ CN4 ATL CNS H FR-D ALN0.20.40.60.8 F R I SKN ENA VGG CN8 CXP RGZ TLS MCRG 1stC CN4 ALN CNS ATL FR-D H0.20.40.60.8 F R II CN8 SKN ENA CXP VGG 1stC MCRG RGZ TLS CNS ATL ALN H CN4 FR-D0.20.40.60.8 B e n t Figure 6.

F1-score per class for all architectures. ENA represents an ensemble classiﬁer of all architectures while SKAAI Net (SKN) is an ensemble of the top4 classiﬁers. Memory Usage (MB)246810 I n f e r e n c e T i m e ( S e c o n d s ) A L N A T L C N C N s t C F R - D H M C R G R G Z C N S T L S V GG C X P Figure 7.

Inference time versus GPU memory usage. diﬀerent architectures are compared with one another (i.e. a total of5 categories are considered). For the classiﬁer ranking, 𝑘 = 𝑗 =

400 600 800 1000 1200Images per Second5860626466687072 M e a n p e r C l a ss A cc u r a c y VGG CXPCN8 RGZTLS MCRGCNS ATL1stC ALNCN4 FR-DH

Figure 8.

MPCA versus Images per Second: As classiﬁcation speed in-creases, recognition performance decreases. (i.e. a total of 3 categories are considered). For the computationalranking, we also decided upon using 𝑘 = 𝑗 = MNRAS , 1–20 (2020)

NN Architecture Comparison for Radio Galaxy Classiﬁcation M e a n p e r C l a ss A cc u r a c y [ % ] VGG CXPCN8RGZ TLSMCRGCNSATL1stC ALNCN4 FR-DH 4163264EffectiveStride Length

Figure 9.

MPCA versus Receptive ﬁeld size versus eﬀective stride length.In general, as the receptive ﬁeld and eﬀective stride length increases so doesMPCA, care should be taken when designing architectures based only onreceptive ﬁeld size, since a larger receptive ﬁeld or eﬀective stride lengthdoes not always translate into a higher accuracy. Receptive ﬁeld and eﬀectivestride length could, however, serve as a useful metric to better understandthe performance of a particular CNN.

Table 9.

Classiﬁer Rankings: the proposed ranking system is based uponrecognition performance (classiﬁcation ranking) and computational per-formance (computational ranking). ConvXpress, MCRGNet and RadioGalaxy Zoo rank in the top three, each showing a diﬀerent balance betweencomputational performance and recognition performance. The rankings arecalculated using a round-robin “tournament” in which each architectureis compared to every other classiﬁer (excluding itself) in each category.MPCA and F1-score for each class are used for the classiﬁcation rankingwhile memory usage, FLOP count and inference time is used for the com-putational ranking.Key Classiﬁcation Computational

Overall

Ranking Ranking

Rank

CXP 46 6 MCRG 14 36 RGZ 20 30 CN8 54 -30 VGG 50 -36 TLS 6 -14 -8 ATL -18 8 -10

CNs -10 -10 -20

H -42 20 -22 -22

ALN -38 12 -26

FR-D -48 10 -38

CN4 -30 -14 -44

The recognition performance of SKAAI Net and ENA is given inFigures 4 and 6 (see Section 5.5). Comparing either ensemble’s per-formance with any individual classiﬁer’s performance, the ensemblemethods outperform individual classiﬁers in terms of accuracy andoutperform most in F1-score. The top 4 classiﬁer (SKAAI Net) per-forms the best when we consider the MPCA metric (73.85%) andscores the highest in F1-score in 2 classes, being second in the FRIand Bent classes (indicating that even though MPCA has its short-comings it remains a helpful keystone metric to use for evaluatingarchitecture performance). The ensemble of all the models (ENA),on the other hand, has an MPCA of 72.08%, just slightly below thehighest MPCA of the single classiﬁers (ConvXpress, 72.98%) andperforms well in F1-score.SKAAI Net’s confusion matrix is depicted in Figure 10. Weonly provide SKAAI Net’s confusion matrix here as it outperformsENA. The main diagonal of the matrix shows the number of cor-rectly classiﬁed images, with the columns indicating the classiﬁer’sprediction and the rows the actual label of each image. The per-centages are the normalized values for each class. Percentage wise,the most misclassiﬁcations are FRII that are being labelled as bent-tails (17.85%), but in absolute terms more FRIs are misclassiﬁed asbent-tails on average (576).Overall, SKAAI Net provides an ensemble classiﬁer that re-duces classiﬁer speciﬁc shortcomings in regards to recognition per-formance. It should be noted that these ensemble methods willrequire a signiﬁcant amount of computational resources in order torun (as they consist of more than one model), resulting in a muchslower classiﬁcation speed.ENA in particular has an exceptionally large computationalfootprint, as it is made up of all the classiﬁers in this study,which makes it infeasible for deployment in production (it requires~11.8GB of GPU memory). Whether the marginal gains SKAAINet and ENA make in MPCA and F1-score is worth the signiﬁcantincrease in computational requirements that are needed to achievethose gains is, therefore, highly debatable.

Figure 11, reports on the percentage dataset coverage per class andF1-score per class of SKAAI Net at diﬀerent conﬁdence thresholds.It also reports the percentage coverage for the entire dataset andMPCA.Compact sources have the highest F1-score overall and thesmallest decrease in coverage. Bent tails sees the largest increase inF1-score. FRII coverage drops at a faster rate than the other classes.Knowing the data coverage behaviour of a classiﬁer is impor-tant. It allows an estimate of the resources that would be requiredif this were to be incorporated with subject-matter experts into theclassiﬁcation pipeline, i.e. on average exactly how many imageswould be thrown out by the classiﬁcation system at a speciﬁc cer-tainty threshold, which in turn would help us estimate the number ofman hours that would be required to manually classify the sourcesthat were thrown out. On the other hand, if the human resourceavailability is known, the certainty threshold can be adapted.

Two experiments were performed in this study. The ﬁrst experimentassessed overﬁtting on the MLRG dataset, that has large intersec-

MNRAS , 1–20 (2020) Burger Becker

Compact FRI FRII BentPredicted labelCompactFRIFRIIBent T r u e l a b e l (87.48±1.01)%4979±57 (8.31±1.46)%473±82 (2.98±1.43)%169±81 (1.23±0.29)%70±16(10.7±1.14)%501±53 (69.55±4.23)%3261±198 (7.46±2.18)%350±102 (12.29±2.22)%576±104(3.64±1.0)%62±17 (10.53±3.44)%181±59 (67.98±5.92)%1170±101 (17.85±2.36)%307±40(1.67±0.69)%9±3 (11.19±3.51)%60±18 (17.13±4.38)%92±23 (70.01±2.05)%377±11 Figure 10.

Confusion Matrix of the SKAAI Net ensemble, averaged overthree runs. The normalized confusion matrix is given in percentage in eachsquare, above the number of sources classiﬁed tions with the training sets used in most studies in Table 1 (seeSection 5 and Section 6). The second experiment analysed the com-putational cost of existing CNN architectures used for radio galaxymorphological classiﬁcation (see Section 7). The results from theseexperiment suggests that when evaluating an architecture’s perfor-mance careful attention should be paid to the size of the trainingset being used, otherwise the results obtained could potentially notbe a true reﬂection of architecture performance (see Figure 3 andFigure 4). Furthermore, there exists a trade-oﬀ between recognitionperformance and computational cost. These two factors should becarefully weighed up against each other when deciding on whicharchitecture to use in production. In addition to considering recog-nition performance metrics like MPCA and F1-score, one shouldalso consider computational cost metrics like memory usage, ﬂoat-ing point operations used and classiﬁcation speed. From all of thesemetrics a “best” architecture can be chosen for deployment, basedon the computational resources that are available.There are also a few other minor conclusions that can be drawnfrom the results obtained from the experiments we conducted in thispaper:

Architecture

A few design choices for CNN architectures canspeed up model performance while driving down resource costs. Alarger kernel’s receptive ﬁeld is equivalent to several smaller recep-tive ﬁelds when these are stacked without a pooling layer in betweenwith the added bonus that more layers of non-linearity are addedthrough more ReLU activations while driving down the number ofparameters. This was originally used within the VGG architectuefamily developed by Simonyan & Zisserman (2015). Several of thearchitectures in Table 1 build on these design decisions, speciﬁcallyConvNet-8 (Lukic et al. 2019b).ConvXpress ConvXpress utilizes the aforementioned stackingstrategy. It also uses a non-standard stride length which indirectly re- duces its computational cost. This architecture performs well whencompared to the other architectures in Table 1 (see Table 9).

Parameters

Since model complexity is weakly correlated withrecognition performance, an increase in parameters is, therefore,likely to translate into an increase in recognition performance (this ishowever not guaranteed). Increasing your recognition performanceby simply utilizing more and more trainable parameters is discour-aged as it is a strategy that can lead to overﬁtting (it also does notscale well). Furthermore, an increase in trainable parameters willincrease computational complexity, training time and GPU mem-ory usage. Overall Deep Learning models are viewed as ineﬃcientin exploiting their full learning power (Muhammed et al. 2017)given the large number of parameters they require relative to othermachine learning approaches.

FLOPs

Computational complexity (given as FLOPs) and recog-nition performance (approximated as MPCA) are weakly correlated(see Figure 5). Utilizing more computational resources is, therefore,likely to result in at least marginal increases in recognition perfor-mance. As we have hinted at previously, increasing your recognitionperformance by simply utilizing more and more computational re-sources is frowned upon as it is a strategy that does not scale well.

Classiﬁcation Speed

Generally, models with a higher MPCA,classify slower than those with a lower MPCA (see Figure 8). Thetrade-oﬀ between classiﬁcation speed and recognition performanceis evident as model MPCA decreases with an increase in imagesclassiﬁed per second.

Receptive ﬁeld and Stride length

A large receptive ﬁeld andstride length does not guarantee good recognition performance.These two metrics can, however, help explain the performance of aparticular architecture.

Ranking

CNNs can be ranked according to their recognition per-formance results and the computational resources that they require.The ranking we obtained doing just this is presented in Table 9. Itis, however, important to realize that this study is not exhaustiveenough to provide us with an absolute ranking of the architecturesin Table 1. A more extensive study that considers all possible com-binations of hyperparameters would be required for us to achievethe aforementioned goal. Such a study would, however, be com-putationally infeasible. This study does, however, provide us witha useful pragmatic ranking as the hyperparameters were chosen inaccordance with excepted guidelines.

Ensemble

While the ensemble methods do produce better re-sults, the signiﬁcant increase in computational requirements asso-ciated with using them is not proportional to the gain in recognitionperformance that using them oﬀers. An option left unexplored inthis study is the creation of either a tree classiﬁer (such as was donein the original MCRGNet study (Ma et al. 2019)) or a fusion classi-ﬁer with a voting scheme (as used by Toothless (Aniyan & Thorat2017)). Both of these approaches requires the training of severalmodels that specialize in the classiﬁcation of only two classes. Amajority vote of a single class is indicative of a high conﬁdence ofsuch a prediction, while a mixed vote indicates uncertainty (such asource would be marked for inspection by a subject-matter expert).

Coverage

Data coverage analysis results can be used to integratesubject-matter experts into a classiﬁcation pipeline (or at the veryleast the ability to ﬂag sources that the classiﬁer is uncertain ofcan be added) (see Section 7.8). Which conﬁdence threshold is bestsuited for this endeavour is not explored, since this will rely on theavailability of the following: subject-matter experts and computa-tional resources.For an in-depth comparison of the more recent architectures that

MNRAS , 1–20 (2020)

NN Architecture Comparison for Radio Galaxy Classiﬁcation Figure 11.

SKAAI Net Conﬁdence Threshold: Dataset Coverage vs Recognition Performance. are being used for image recognition, please refer to the work doneby Muhammed et al. (2017). Training CNNs for the purpose ofimage classiﬁcation and speciﬁcally radio galaxy classiﬁcation, hasbecome a relatively easy task to set up given the computationalresources available at present. But as Jitendra Malik , one of theseminal ﬁgures in computer vision, stated “There are many prob-lems in [computer] vision where getting 50 percent of the solutionyou can get in one minute, getting to 90 percent can take you a day, Arthur J. Chick Professor of Electrical Engineering and Computer Sci-ences at the University of California, Berkeley getting to 99 percent may take you ﬁve years and 99.99 percent,may not happen in your lifetime” (Fridman & Malik 2020).The lack of large sets of annotated training data remains oneof the greatest challenges in assessing and improving the generalrecognition performance of CNNs (and all image classiﬁcation al-gorithms’). In addition, determining the computational resources amodel requires is important to consider when selecting an archi-tecture for deployment. The framework and experiments laid out inthis study will hopefully be able to help shape the future of imagerecognition development.

MNRAS , 1–20 (2020) Burger Becker

ACKNOWLEDGEMENTS

DATA AVAILABILITY STATEMENT

The data underlying this article will be shared on reasonable requestto the corresponding author.

REFERENCES

Alger M. J., et al., 2018, MNRAS, 478, 5547Alhassan W., Taylor A. R., Vaccari M., 2018, MNRAS, 480, 2085Aniyan A. K., Thorat K., 2017, ApJS, 230, 20Araujo A., Norris W., Sim J., 2019, Distill, 4, e21Baldi R. D., Capetti A., Massaro F., 2018, A&A, 609, A1Banﬁeld J. K., et al., 2015, MNRAS, 453, 2326Becker R. H., White R. L., Helfand D. J., 1995, ApJ, 450, 559Best P. N., Heckman T. M., 2012, MNRAS, 421, 1569Braun R., Bourke T., Green J. A., Keane E., Wagg J., 2015, in AdvancingAstrophysics with the Square Kilometre Array (AASKA14). p. 174Braun R., Bonaldi A., Bourke T., Keane E., Wagg J., 2019, arXiv e-prints,p. arXiv:1912.12699Capetti A., Massaro F., Baldi R. D., 2017a, A&A, 598, A49Capetti A., Massaro F., Baldi R. D., 2017b, A&A, 601, A81Cheung C. C., 2007, AJ, 133, 2097Chollet F., et al., 2015, Keras, https://keras.io

Cireşan D. C., Meier U., Gambardella L. M., Schmidhuber J., 2010, NeuralComp., 22, 3207Cotton W. D., et al., 2020, MNRAS, 495, 1271Deng J., Dong W., Socher R., Li L.-J., Li K., Fei-Fei L., 2009. IEEE Confer-ence on Computer Vision and Pattern Recognition. Imagenet: A large-scale hierarchical image database, Miami, Florida, p. 248Dieleman S., Willett K. W., Dambre J., 2015, MNRAS, 450, 1441Ekers R. D., Fanti R., Lari C., Parma P., 1978, Nature, 276, 588Elmegreen D. M., Elmegreen B. G., 1987, ApJ, 314, 3Fanaroﬀ B. L., Riley J. M., 1974, MNRAS, 167, 31PFreedman D., Pisani R., Purves R., 2007, WW Norton & Company, NewYorkFridman L., Malik J., 2020, 110 – Jitendra Malik: Computer VisionFukushima K., 1980, Biol. Cybernetics, 36, 193Garofalo D., Singh C. B., 2019, ApJ, 871, 259Gendre M. A., Wall J. V., 2008, MNRAS, 390, 819Gendre M. A., Best P. N., Wall J. V., 2010, MNRAS, 404, 1719Gheller C., Vazza F., Bonafede A., 2018, MNRAS, 480, 3749Gopal-Krishna Wiita P. J., 2000, A&A, 363, 507Harwood J. J., Vernstrom T., Stroe A., 2020, MNRAS, 491, 803Hine R. G., Longair M. S., 1979, MNRAS, 188, 111 Hosenie Z. B., 2018, Master’s thesis, North-West Univ., PotchefstroomHubble E. P., 1926, ApJ, 64, 321Hubel D. H., Wiesel T. N., 1968, The J. of Phys., 195, 215Kelley H. J., 1960, ARS J., 30, 947Kozieł-Wierzbowska D., Goyal A., Żywucka N., 2020, ApJS, 247, 53Krizhevsky A., Sutskever I., Hinton G. E., 2012. Advances in Neural Infor-mation Processing Systems. ImageNet Classiﬁcation with Deep Convo-lutional Neural Networks, Lake Tahoe, Nevada, p. 1097Lacy M., et al., 2020, PASP, 132, 035001Laing R. A., Jenkins C. R., Wall J. V., Unger S. W., 1994, in BicknellG. V., Dopita M. A., Quinn P. J., eds, ASP Conf. Ser. Vol. 54, The FirstStromlo Symposium: The Physics of Active Galaxies. Astron. Soc. Pac.,San Francisco, p. 201LeCun Y., Bottou L., Bengio Y., Haﬀner P., 1998, Proc. of the IEEE, 86,2278Leahy J. P., Williams A. G., 1984, MNRAS, 210, 929Lintott C. J., et al., 2008, MNRAS, 389, 1179Lukic V., Brüggen M., Banﬁeld J. K., Wong O. I., Rudnick L., Norris R. P.,Simmons B., 2018, MNRAS, 476, 246Lukic V., de Gasperin F., Brüggen M., 2019a, Galaxies, 8, 3Lukic V., Brüggen M., Mingo B., Croston J. H., Kasieczka G., Best P. N.,2019b, MNRAS, 487, 1729Luo W., Li Y., Urtasun R., Zemel R., 2016. Advances in Neural InformationProcessing Systems. Understanding the eﬀective receptive ﬁeld in deepConvolutional Neural Networks, Barcelona, Spain, p. 4898Ma Z., et al., 2019, ApJS, 240, 34Marcos D., Volpi M., Tuia D., 2016. 23rd International Conference onPattern Recognition. Learning rotation invariant convolutional ﬁltersfor texture classiﬁcation, Cancun, Mexico, p. 2012Markoﬀ J., 2012, Seeking a Better Way to Find Web Images,

McGlynn T., Scollick K., White N., 1998, in McLean B. J., A G. D., HayesJ. J., Payne H. E., eds, IAU Symp. Vol. 179, New Horizons from Multi-Wavelength Sky Surveys. Kluwer, Dordrecht, p. 465Mingo B., et al., 2019, MNRAS, 488, 2701Miraghaei H., Best P. N., 2017, MNRAS, 466, 4346Missaglia V., Massaro F., Capetti A., Paolillo M., Kraft R. P., Baldi R. D.,Paggi A., 2019, A&A, 626, A8Muhammed M. A. E., Ahmed A. A., Khalid T. A., 2017, in 2017 Interna-tional Conference On Smart Technologies For Smart Nation (Smart-TechCon). pp 902–907, doi:10.1109/SmartTechCon.2017.8358502Neelakantan A., Vilnis L., Le Q. V., Sutskever I., Kaiser L., Kurach K.,Martens J., 2015, preprint (arXiv:1511.06807)Norris R. P., et al., 2011, Publ. Astron. Soc. Australia, 28, 215Norris R. P., et al., 2013, Publ. Astron. Soc. Australia, 30, e020Ocran E. F., Taylor A. R., Vaccari M., Ishwara-Chandra C. H., Prandoni I.,2020, MNRAS, 491, 1127Owen F. N., Ledlow M. J., 1994, in Bicknell G. V., Dopita M. A., QuinnP. J., eds, ASP Conf. Ser. Vol. 54, The First Stromlo Symposium: ThePhysics of Active Galaxies. Astron. Soc. Pac., San Francisco, p. 319Owen F. N., Rudnick L., 1976, ApJ, 205, L1Pracy M. B., et al., 2016, MNRAS, 460, 2Prescott M., et al., 2018, MNRAS, 480, 707Proctor D. D., 2011, ApJS, 194, 31Roberts D. H., Saripalli L., Wang K. X., Sathyanarayana Rao M., Subrah-manyan R., KleinStern C. C., Morii-Sciolla C. Y., Simpson L., 2018,ApJ, 852, 47Rudnick L., Owen F. N., 1977, AJ, 82, 1Sabour S., Frosst N., Hinton G. E., 2017, in Proceedings of the 31st Interna-tional Conference on Neural Information Processing Systems. p. 3856Sadler E. M., Ekers R. D., Mahony E. K., Mauch T., Murphy T., 2014,MNRAS, 438, 796Sadr A. V., Vos E. E., Bassett B. A., Hosenie Z., Oozeer N., Lochner M.,2019, MNRAS, 484, 2793Sandage A., 1961, The Hubble Atlas of GalaxiesSimonyan K., Zisserman A., 2015, in International Conference on LearningRepresentations. MNRAS , 1–20 (2020)

NN Architecture Comparison for Radio Galaxy Classiﬁcation Smith M. D., Donohoe J., 2019, MNRAS, 490, 1363Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R.,2014, J. Mach. Learn. Res., 15, 1929–1958Tang H., Scaife A. M. M., Leahy J. P., 2019, MNRAS, 488, 3358Whittam I. H., Green D. A., Jarvis M. J., Riley J. M., 2020, MNRAS, 493,2841Willett K. W., et al., 2013, MNRAS, 435, 2835Wu C., et al., 2019, MNRAS, 482, 1211de Vaucouleurs G., 1959, Handbuch der Physik, 53, 275MNRAS , 1–20 (2020) B u r g e r B ecke r APPENDIX A: TABLE OF RESULTS

Table A1.

Results from the experiments discussed in Section 5, each architecture’s name and the keys given in Table 1 for use in the ﬁgures. † denotes that the architecture have been modiﬁed due to computationalconstraints or re-purposed for classiﬁcation. ‡ the results given for the inference time and images classiﬁed per second are for the classiﬁcation of 3072 images at batch size 32.Architecture Name Key Floating Point Convolutional Fully Connected Trainable Inference Time ‡ St. Dev. Images Classed Eﬀective Eﬀective Eﬀective GPU MemoryOperations (FLOPs) FLOPs FLOPs Parameters (Seconds) (Seconds) per Second ‡ Receptive Field Stride Padding Usage (MB)AlexNet ALN 1107736302 1074169584 33566718 37302980 2.72816 0.24097 1126 195 32 0 280.576ATLAS X-ID † ATL 546585022 545304800 1280222 1385988 2.90696 0.36538 1056 64 10 24 411.648CLARAN (VGG16D) † VGG 27231525758 27044866944 186658814 201384644 10.52742 0.33702 291 212 32 90 4061.184ConvNet4 CN4 1036529876 870640128 165889748 165910168 3.51457 0.22828 874 24 4 12 897.024ConvNet8 CN8 4612528724 4571054976 41473748 42646184 4.73292 0.24774 649 76 16 30 1844.224ConvXpress CXP 764997460 762947712 2049748 3415944 2.9279 0.21564 1049 333 64 136 393.216FIRST Classifier 1stC 1128841236 1077316875 51524361 51655412 3.29737 0.26081 931 22 8 7 1715.2FR-Deep FR-D 141486958 141023472 463486 479996 2.88221 0.29405 1065 84 30 27 412.672Hosenie H 109649469 109413751 235718 261239 2.76616 0.2965 1110 94 18 38 315.392MCRGNet † MCRG 8406674 8201652 205022 213916 2.41828 0.27093 1270 63 32 20 63.488Radio Galaxy Zoo RGZ 75825974 70580024 5245950 5283444 2.46538 0.31239 1246 74 26 8 84.992SimpleNet CNs 797660134 797639400 20734 37460 3.69593 0.2417 831 49 16 11 573.44Toothless † TLS 1634645742 1566476016 68169726 71906180 2.94229 0.21572 1044 195 32 64 782.336

Table A2.

Results of from the recognition performance experiments discussed in Section 5, each architecture’s name and the keys given in Table 1 for use in the ﬁgures.Architecture Name Key Mean per Class Precision Recall F1-ScoreAccuracy (Compact) (FRI) (FRII) (Bent) (Compact) (FRI) (FRII) (Bent) (Compact) (FRI) (FRII) (Bent)AlexNet ALN 63.15 0.837 0.7837 0.5022 0.2509 0.8634 0.5872 0.6785 0.397 0.8488 0.6698 0.5734 0.3007ATLAS X-ID † ATL 65.34 0.8782 0.7784 0.5636 0.2069 0.853 0.6491 0.5341 0.5776 0.8639 0.7047 0.5461 0.3043CLARAN (VGG16D) † VGG 72.98 0.893 0.8225 0.636 0.2658 0.8661 0.6621 0.6775 0.7137 0.879 0.729 0.6531 0.3861ConvNet4 CN4 62.84 0.8678 0.7586 0.5639 0.2025 0.8401 0.6662 0.6336 0.3735 0.852 0.707 0.5913 0.2576ConvNet8 CN8 71.71 0.8925 0.7923 0.6665 0.3065 0.8687 0.7242 0.644 0.6314 0.8798 0.756 0.6475 0.4126ConvXpress CXP 71.75 0.8983 0.7849 0.6527 0.2878 0.858 0.7065 0.6405 0.6648 0.8767 0.7433 0.6405 0.4017FIRST Classifier 1stC 65.21 0.8697 0.7408 0.585 0.2698 0.8279 0.6937 0.6341 0.4527 0.8479 0.713 0.6056 0.3348FR-Deep FR-D 59.36 0.866 0.7451 0.5282 0.1552 0.8433 0.6243 0.2199 0.6871 0.8532 0.6773 0.2956 0.2527Hosenie H 58.5 0.8957 0.7014 0.3894 0.1755 0.8216 0.6837 0.2238 0.611 0.857 0.6923 0.2809 0.2717MCRGNet † MCRG 68.09 0.8686 0.7809 0.6367 0.2281 0.8658 0.6499 0.59 0.6178 0.867 0.709 0.6124 0.333Radio Galaxy Zoo RGZ 69.87 0.886 0.8102 0.633 0.2144 0.8662 0.6292 0.6425 0.6568 0.876 0.7076 0.633 0.3229SimpleNet CNs 65.91 0.8763 0.7776 0.5758 0.2096 0.8593 0.6446 0.5563 0.5764 0.8674 0.7037 0.5657 0.3065Toothless † TLS 68.92 0.8973 0.785 0.614 0.2083 0.8189 0.6542 0.6518 0.632 0.8548 0.7085 0.6315 0.3129This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author. M N R A S , (2020