Fanaroff-Riley classification of radio galaxies using group-equivariant convolutional neural networks
MMNRAS , 1–11 (2015) Preprint 19 February 2021 Compiled using MNRAS L A TEX style file v3.0
Fanaroff-Riley classification of radio galaxies using group-equivariantconvolutional neural networks.
Anna M. M. Scaife, , ★ and Fiona Porter Jodrell Bank Centre for Astrophysics, Department of Physics & Astronomy, University of Manchester, Oxford Road, Manchester M13 9PL UK The Alan Turing Institute, Euston Road, London NW1 2DB, UK
Accepted XXX. Received YYY; in original form ZZZ
ABSTRACT
Weight sharing in convolutional neural networks (CNNs) ensures that their feature maps will be translation-equivariant. However,although conventional convolutions are equivariant to translation, they are not equivariant to other isometries of the input imagedata, such as rotation and reflection. For the classification of astronomical objects such as radio galaxies, which are expectedstatistically to be globally orientation invariant, this lack of dihedral equivariance means that a conventional CNN must learnexplicitly to classify all rotated versions of a particular type of object individually. In this work we present the first application ofgroup-equivariant convolutional neural networks to radio galaxy classification and explore their potential for reducing intra-classvariability by preserving equivariance for the Euclidean group E(2), containing translations, rotations and reflections. For theradio galaxy classification problem considered here, we find that classification performance is modestly improved by the useof both cyclic and dihedral models without additional hyper-parameter tuning, and that a 𝐷 equivariant model provides thebest test performance. We use the Monte Carlo Dropout method as a Bayesian approximation to recover epistemic uncertaintyas a function of image orientation and show that E(2)-equivariant models are able to reduce variations in model confidence as afunction of rotation. Key words: radio continuum: galaxies – methods: data analysis – techniques: image processing
In radio astronomy, a massive increase in data volume is currentlydriving the increased adoption of machine learning methodologiesand automation during data processing and analysis. This is largelydue to the high data rates being generated by new facilities suchas the Low-Frequency Array (LOFAR; Van Haarlem et al. 2013),the Murchison Widefield Array (MWA; Beardsley et al. 2019), theMeerKAT telescope (Jarvis et al. 2016), and the Australian SKAPathfinder (ASKAP) telescope (Johnston et al. 2008). For these in-struments a natural solution has been to automate the data processingstages as much as possible, including classification of sources.With the advent of such huge surveys, new automated classificationalgorithms have been developed to replace the “by eye” classifica-tion methods used in earlier work. In radio astronomy, morphologicalclassification using convolutional neural networks (CNNs) and deeplearning is becoming increasingly common for object classification,in particular with respect to the classification of radio galaxies. Theground work in this field was done by Aniyan & Thorat (2017) whomade use of CNNs for the classification of Fanaroff-Riley (FR) typeI and type II radio galaxies (Fanaroff & Riley 1974). This was fol-lowed by other works involving the use of deep learning in sourceclassification. Examples include Lukic et al. (2018) who made use ofCNNs for the classification of compact and extended radio sourcesfrom the Radio Galaxy Zoo catalogue (Banfield et al. 2015), theCLARAN (Classifying Radio Sources Automatically with a Neural ★ E-mail: [email protected] (AMS)
Network; Wu et al. 2018) model made use of the Faster R-CNN (Renet al. 2015) network to identify and classify radio sources; Algeret al. (2018) made use of an ensemble of classifiers including CNNsto perform host galaxy cross-identification. Tang et al. (2019) madeuse of transfer learning with CNNs to perform cross-survey classifi-cation, while Gheller et al. (2018) made use of deep learning for thedetection of cosmological diffuse radio sources. Lukic et al. (2018)also performed morphological classification using a novel techniqueknown as capsule networks (Sabour et al. 2017), although they foundno specific advantage compared to traditional CNNs. Bowles et al.(2020) showed that an attention-gated CNN could be used to per-form Fanaroff-Riley classification of radio galaxies with equivalentperformance to other applications in the literature, but using ∼ © a r X i v : . [ a s t r o - ph . I M ] F e b A. M. M. Scaife & F. Porter means that a conventional CNN must explicitly learn to classify allrotational augmentations of each image individually. This can resultin CNNs learning multiple copies of the same kernel but in differentorientations, an effect that is particularly notable when the data itselfpossesses rotational symmetry (Dieleman et al. 2016). Furthermore,while data augmentation that mimicks a form of equivariance, such asimage rotation, can result in a network learning approximate equiv-ariance if it has sufficient capacity, it is not guaranteed that invariancelearned on a training set will generalise equally well to a test set (Lenc& Vedaldi 2014). A variety of different equivariant networks havebeen developed to address this issue, each guaranteeing a particulartransformation equivariance between the input data and associatedfeature maps. For example, in the field of galaxy classification usingoptical data, Dieleman et al. (2015) enforced discrete rotational in-variance through the use of a multi-branch network that concatenatedthe output features from multiple convolutional branches, each usinga rotated version of the same data sample as its input. However, whileeffective, the approach of Dieleman et al. (2015) requires the con-volutional layers of a network architecture and hence the number ofmodel weights associated with them to be replicated 𝑁 times, where 𝑁 is the number of discrete rotations.Recently, a more efficient method of using convolutional layers thatare equivariant to a particular group of transforms has been devel-oped, which requires no replication of architecture and hence fewerlearnable parameters to be used. Explicitly enforcing an equivariancein the network model in this way not only provides a guarantee thatit will generalise, but also prevents the network using parameter ca-pacity to learn characteristic behaviour that can instead be specifieda priori. First introduced by Cohen & Welling (2016), these Groupequivariant Convolutional Neural Networks (G-CNNs), which pre-serve group equivariance through their convolutional layers, are anatural extension of conventional CNNs that ensure translational in-variance through weight sharing. Group equivariance has also beendemonstrated to improve generalisation and increase performance(see e.g. Weiler et al. 2017; Weiler & Cesa 2019). In particular, Steerable
G-CNNs have become an increasingly important solutionto this problem and notably those steerable CNNs that describe E(2)-equivariant convolutions.The Euclidean group E(2) is the group of isometries of the plane R that contains translations, rotations and reflections. Isometries such asthese are important for general image classification using convolutionas the target object in question is unlikely to appear at a fixed positionand orientation in every test image. Such variations are not onlyhighly significant for objects/images that have a preferred orientation,such as text or faces, but are also important for low-level features innominally orientation-unbiased targets such as astrophysical objects.In principle, E(2)-equivariant CNNs will generalize over rotationally-transformed images by design, which reduces the amount of intra-class variability that they have to learn. In effect such networks areinsensitive to rotational or reflection variations and therefore learnonly features that are independent of these properties.In this work we introduce the use of 𝐺 -steerable CNNs to as-tronomical classification. The structure of the paper is as follows:in Section 2 we describe the mathematical operation of 𝐺 -steerableCNNs and define the specific Euclidean subgroups being consid-ered in this work; in Section 3 we describe the data sets used inthis work and the preprocessing steps implemented on those data; inSection 4 we describe the network architecture adopted in this work,explain how the 𝐺 -steerable implementation is constructed and spec-ify the group representations; in Section 5 we give an overview ofthe training outcomes including a discussion of the convergence fordifferent equivalence groups, validation and test performance met- rics, and introduce a novel use of the Monte Carlo Dropout methodfor quantitatively assessing the degree of model confidence in a testprediction as a function of image orientation; in Section 6 we discussthe validity of the assumptions that radio galaxy populations are ex-pected to be staitsically rotation and reflection unbiased and reviewthe implications of this work in that context; in Section 7 we drawour conclusions. Group CNNs define feature spaces using feature fields 𝑓 : R → R 𝑐 ,which associate a 𝑐 -dimensional feature vector 𝑓 ( 𝑥 ) ∈ R 𝑐 to eachpoint 𝑥 of an input space. Unlike conventional CNNs, the featurefields of such networks contain transformations that preserve thetransformation law of a particular group or subgroup, which allowsthem to encode orientation information. This means that if one trans-forms the input data, 𝑥 , by some transformation action, 𝑔 , (translation,rotation, etc.) and passes it through a trained layer of the network,then the output from that layer, Φ ( 𝑥 ) , must be equivalent to havingpassed the data through the layer and then transformed it, i.e. Φ (T 𝑔 𝑥 ) = T (cid:48) 𝑔 Φ ( 𝑥 ) , (1)where T 𝑔 is the transformation for action 𝑔 . In the case where thetransformation is invariant rather than equivariant , i.e. the inputdoes not change at all when it is transformed, T (cid:48) 𝑔 will be the identitymatrix for all actions 𝑔 ∈ 𝐺 . In the case of equivariance, T 𝑔 doesnot necessarily need to be equal to T (cid:48) 𝑔 and instead must only fulfilthe property that it is a linear representation of 𝐺 , i.e. T ( 𝑔ℎ ) = T ( 𝑔 )T ( ℎ ) .Cohen & Welling (2016) demonstrated that the conventional con-volution operation in a network can be re-written as a group convo-lution: [ 𝑓 ∗ 𝜙 ]( 𝑔 ) = ∑︁ ℎ ∈ 𝑋 ∑︁ 𝑘 𝑓 𝑘 ( ℎ ) 𝜙 𝑘 ( 𝑔 − ℎ ) , (2)where 𝑋 = R in layer one and 𝑋 = 𝐺 in all subsequent layers. Whilstthis operation is translationally-equivariant, 𝜙 is still rotationallyconstrained. For E(2)-equivariance to hold more generally, the kernelitself must satisfy 𝜙 ( 𝑔𝑥 ) = 𝜌 out ( 𝑔 ) 𝜙 ( 𝑥 ) 𝜌 in ( 𝑔 − ) ∀ 𝑔 ∈ 𝐺, 𝑥 ∈ R , (3)(Weiler et al. 2018), where 𝑔 is an action from group 𝐺 , and 𝜙 : R → R 𝑐 in × 𝑐 out , where 𝑐 in and 𝑐 out are the number of channels inthe input and output data, respectively; 𝜌 is the group representation,which specifies how the channels of each feature vector mix undertransformations. Kernels which fulfil this constraint are known as rotation-steerable and must be constructed from a suitable family ofbasis functions. As noted above, this is a linear relationship, whichmeans that G-steerable kernels form a subspace of the convolutionkernels used by conventional CNNs.For planar images the input space will be R , and for single fre-quency or continuum radio images these feature fields will be scalar,such that 𝑠 : R → R . The group representation for scalar fields isalso known as the trivial representation, 𝜌 ( 𝑔 ) = ∀ 𝑔 ∈ 𝐺 , indicat-ing that under a transformation there is no orientation informationto preserve and that the amplitude does not change. The group rep-resentation of the output space from a G-steerable convolution mustbe chosen by the user when designing their network architecture andcan be thought of as a variety of hyper-parameter.However, whilst the representation of the input data is in somesenses quite trivial for radio images, in practice convolution layers are MNRAS , 1–11 (2015) (2)-equivariant radio galaxy classification interleaved with other operations that are sensitive to specific choicesof representation. In particular, the range of non-linear activationlayers permissible for a particular group or subgroup representationmay be limited. Trivial representations, such as scalar fields, do nottransform under rotation and therefore conventional nonlinearitieslike the widely used ReLU activation function are fine. Bias termsin convolution allow equivariance for group convolutions only inthe case where there is a single bias parameter per group featuremap (rather than per channel feature map) and likewise for batchnormalisation (Cohen & Welling 2016).In this work we use the G-steerable network layers from Weiler& Cesa (2019) who define the Euclidean group as being con-structed from the translation group, ( R , +) , and the orthogonal group,O(2) = { 𝑂 ∈ R × | 𝑂 𝑇 𝑂 = id × } , such that the Euclideangroup is congruent with the semi-direct product of these two groups,E(2) (cid:27) ( R , +) (cid:111) O(2). Consequently, the operations contained in theorthogonal group are those which leave the origin invariant, i.e.continuous rotations and reflections. In this work we specificallyconsider the cyclic subgroups of the Euclidean group with form ( R , +) (cid:111) 𝐶 𝑁 , where 𝐶 𝑁 contains a set of discrete rotations in mul-tiples of 2 𝜋 / 𝑁 , and the dihedral subgroups with form ( R , +) (cid:111) 𝐷 𝑁 ,where 𝐷 𝑁 (cid:27) 𝐶 𝑁 (cid:111) ({± } , ∗) , which incorporate reflection around 𝑥 = 𝐶 𝑁 and 𝐷 𝑁 , with orders 𝑁 and 2 𝑁 , respectively. The data set used in this work is based on the catalogue of Miraghaei& Best (2017), who used a parent galaxy sample taken from Best &Heckman (2012) that cross-matched the Sloan Digital Sky Survey(SDSS; York et al. 2000) data release 7 (DR7; Abazajian et al. 2009)with the Northern VLA Sky Survey (NVSS; Condon et al. 1998) andthe Faint Images of the Radio Sky at Twenty centimetres (FIRST;Becker et al. 1995).From the parent sample, sources were visually classified by Mi-raghaei & Best (2017) using the original morphological definitionprovided by Fanaroff & Riley (1974): galaxies which had their mostluminous regions separated by less than half of the radio source’sextent were classed as FRI, and those which were separated by morethan half of this were classed as FRII. Where the determination ofthis separation was complicated by either the limited resolution ofthe FIRST survey or by its poor sensitivity to low surface brightnessemission, the human subjectivity in this calculation was indicatedby the source classification being denoted as “Uncertain", ratherthan “Confident". Galaxies were then further classified into morpho-logical sub-types via visual inspection. Any sources which showedFRI-like behaviour on one half of the source and FRII-like behaviouron the other were deemed to be hybrid sources.Each object within the catalogue of Miraghaei & Best (2017) wasgiven a three-digit classification identifier to allow images to be sep-arated into different subsets. Images were classified by FR class,confidence of classification, and morphological sub-type. These aresummarised in Table 1. For example, a radio galaxy that was confi-dently classified as an FRI type source with a wide-angle tail mor-phology would be denoted 102.
Digit 1 Digit 2 Digit 30 - FRI 0 - Confident 0 - Standard1 - FRII 1 - Uncertain 1 - Double-double2 - Hybrid 2 - Wide-angle Tail3 - Unclassifiable 3 - Diffuse4 - Head-tail
Table 1.
Numerical identifiers from the catalogue of Miraghaei & Best (2017).
We note that not all combinations of the three digits described inTable 1 are present in the catalogue as some morphological classesare dependent on the parent FR class, with only FRI type objects be-ing sub-classified into head-tail or wide-angle tail, and only FRII typeobjects being sub-classified as double-double. Hybrid FR sources arenot considered to have any non-standard morphologies, as their stan-dard morphology is inherently inconsistent between sources. Con-fidently classified objects outnumber their uncertain counterpartsacross all classes, and in classes that have few examples there maybe no uncertain sources present. This is particularly apparent fornon-standard morphologies.From the full catalog of 1329 labelled objects, 73 were excludedfrom the machine learning data set. These include (i) the 40 objectsdenoted as , (ii) 28 objects which had an angularextent greater than a selected image size of 150 ×
150 pixels, (iii)4 objects with structure that was found to overlap the edge of thesky area covered by the FIRST survey, and (iv) the single object in3-digit category 103. This final object was excluded as a minimumof two examples from each class are required for the data set: one forthe training set and one for the test set. Following these exclusions,1256 objects remain, which we refer to as the
MiraBest data set andsummarise in Table 2.All images in the
MiraBest data set are subjected to a similar datapre-processing as other radio galaxy deep learning data sets in theliterature (see e.g. Aniyan & Thorat 2017; Tang et al. 2019). FITSimages for each object are extracted from the FIRST survey data usingthe Skyview service (McGlynn et al. 1998) and the astroquery library (Ginsburg et al. 2019). These images are then processed infour stages before data augmentation is applied: firstly, image pixelvalues are set to zero if their value is below a threshold of three timesthe local rms noise, secondly the image size is clipped to 150 by 150pixels, i. e. 270 (cid:48)(cid:48) by 270 (cid:48)(cid:48) for FIRST, where each pixel correspondsto 1 . (cid:48)(cid:48) . Thirdly, all pixels outside a square central region with extentequal to the largest angular size of the radio galaxy are set to zero.This helps to eliminate secondary background sources in the fieldand is possible for the MiraBest data set due to the inclusion of thisparameter in the catalogue of Miraghaei & Best (2017). Finally theimage is normalised as:Output = · Input − min ( Input ) max ( Input ) − min ( Input ) , (4)where ‘Output’ is the normalised image, ‘Input’ is the original imageand ‘min’ and ‘max’ are functions which return the single minimaland maximal values of their inputs, respectively. Images are saved toPNG format and accummulated into a PyTorch batched data set .For this work we extract the objects labelled as Fanaroff-RileyClass I (FRI) and Fanaroff-Riley Class II (FRII; Fanaroff & Riley1974) radio galaxies with classifications denoted as Confident (as The MiraBest data set is available on Zenodo: 10.5281/zenodo.4288837MNRAS000
MiraBest data set are subjected to a similar datapre-processing as other radio galaxy deep learning data sets in theliterature (see e.g. Aniyan & Thorat 2017; Tang et al. 2019). FITSimages for each object are extracted from the FIRST survey data usingthe Skyview service (McGlynn et al. 1998) and the astroquery library (Ginsburg et al. 2019). These images are then processed infour stages before data augmentation is applied: firstly, image pixelvalues are set to zero if their value is below a threshold of three timesthe local rms noise, secondly the image size is clipped to 150 by 150pixels, i. e. 270 (cid:48)(cid:48) by 270 (cid:48)(cid:48) for FIRST, where each pixel correspondsto 1 . (cid:48)(cid:48) . Thirdly, all pixels outside a square central region with extentequal to the largest angular size of the radio galaxy are set to zero.This helps to eliminate secondary background sources in the fieldand is possible for the MiraBest data set due to the inclusion of thisparameter in the catalogue of Miraghaei & Best (2017). Finally theimage is normalised as:Output = · Input − min ( Input ) max ( Input ) − min ( Input ) , (4)where ‘Output’ is the normalised image, ‘Input’ is the original imageand ‘min’ and ‘max’ are functions which return the single minimaland maximal values of their inputs, respectively. Images are saved toPNG format and accummulated into a PyTorch batched data set .For this work we extract the objects labelled as Fanaroff-RileyClass I (FRI) and Fanaroff-Riley Class II (FRII; Fanaroff & Riley1974) radio galaxies with classifications denoted as Confident (as The MiraBest data set is available on Zenodo: 10.5281/zenodo.4288837MNRAS000 , 1–11 (2015)
A. M. M. Scaife & F. Porter
Figure 1.
Illustration of the 𝐶 and 𝐷 groups for an example radio galaxy postage stamp image with 50 ×
50 pixels. The members of the 𝐶 group are eachrotated by 𝜋 / | 𝐶 | =
4. The members of the 𝐷 group are each rotated by 𝜋 / 𝑥 =
0, resultingin a group order | 𝐷 | =
8. Class No. Confidence Morphology No. MiraBest LabelFRI 591 Confident Standard 339 0Wide-Angle Tailed 49 1Head-Tail 9 2Uncertain Standard 191 3Wide-Angle Tailed 3 4FRII 631 Confident Standard 432 5Double-Double 4 6Uncertain Standard 195 7Hybrid 34 Confident NA 19 8Uncertain NA 15 9
Table 2.
MiraBest data set summary. The original data set labels (MiraBest Label) are shown in relation to the labels used in this work (Label). Hybrid sourcesare not included in this work, and therefore have no label assigned to them. opposed to Uncertain). We exclude the objects classified as Hybridand do not employ sub-classifications. This creates a binary classifi-cation data set with target classes FRI and FRII. We denote the subsetof the full
MiraBest data set used in this work as
MiraBest ∗ .The MiraBest ∗ data set has pre-specified training and test datapartitions and the number of objects in each of these partitions isshown in Table 3 along with the equivalent partitions for the full MiraBest data set. In this work we subdivide the
MiraBest* trainingpartition into training and validation sets using an 80:20 split. The testpartition is reserved for deriving the performance metrics presentedin Section 5.2.To accelerate convergence, we further normalise individual datasamples from the data set by shifting and scaling as a function of themean and variance, both calculated from the full training set (LeCunet al. 2012) and listed in Table 3. Data augmentation is performedduring training and validation for all models using random rotationsfrom 0 to 360 degrees. This is standard practice for augmentation andis also consistent with the 𝐺 -steerable CNN training implementationsof Weiler & Cesa (2019), who included rotational augmentation fortheir own tests in order to not disadvantage models with lower levelsof equivariance. To avoid issues arising from samples where thestructure of the radio source overlaps the edge of the field and isartificially truncated in some orientations during augmentation, but Table 3.
Data used in this work. The table shows the number of objects ofeach class that are provided in the training and test partitions for the
MiraBest data set, containing sources labeled as both Confident and Uncertain, and the
MiraBest ∗ data set, containing only objects labeled as Confident, as well asthe mean and standard deviation of the training sets in each case.Train TestData FRI FRII FRI FRII 𝜇 𝜎 MiraBest
517 552 74 79 0.0031 0.0352
MiraBest ∗
348 381 49 55 0.0031 0.0350 not in others, we apply a circular mask to each sample image, settingall pixels to zero outside a radial distance from the centre of 75 pixels.An example data sample is shown in Figure 1, where it is used toillustrate the corresponding 𝐶 and 𝐷 groups. As noted by Weiler &Cesa (2019), for signals digitised on a pixel grid, exact equivarianceis not possible for groups that are not symmetries of the grid itself andin this case only subgroups of 𝐷 will be exact symmetries with allother subgroups requiring interpolation to be employed (Dielemanet al. 2016). MNRAS , 1–11 (2015) (2)-equivariant radio galaxy classification Table 4.
The LeNet5-style network architecture used for all the models in thiswork. 𝐺 -Steerable implementations include the additional steps indicatedin italics and replace the convolutional layers with the appropriate group-equivariant equivalent in each case. Column [1] lists the operation of eachlayer in the network, where entries in italics denote operations that are appliedonly in the 𝐺 -steerable version of the network; Column [2] lists the kernelsize in pixels for each layer, where appropriate; Column [3] lists the numberof output channels from each layer; Column [4] denotes the degree of zero-padding in pixels added to each edge of an image, where appropriate.Operation Kernel Channels Padding Invariant Projection
Convolution 5 × × × × Invariant ProjectionGlobal Average Pool
Fully-connected 120ReLUFully-connected 84ReLUDropout ( 𝑝 = . For our architecture we use a simple LeNet-style network (LeCunet al. 1998) with two convolutional layers, followed by three fully-connected layers. Each of the convolutional layers has a ReLU ac-tivation function and is followed by a max-pooling operation. Thefully-connected layers are followed by ReLU activation functionsand we use a 50% dropout before the final fully-connected layer, asis standard for LeNet (Krizhevsky et al. 2012). An overview of thearchitecture is shown in Table 4. In what follows we refer to this basearchitecture using conventional convolution layers as the standardCNN and denote it { 𝑒 } . We also note that the use of conventionalCNN is used through the paper to refer to networks that do not employgroup-equivariant convolutions, independent of architecture.For the 𝐺 -steerable implementation of this network we use the e2cnn extension to the PyTorch library (Weiler & Cesa 2019)and replace the convolutional layers with their subgroup-equivariantequivalent. We also introduce two additional steps into the networkin order to recast the feature data from the convolutional layers intoa format suitable for the conventional fully-connected layers. Thesesteps consist of reprojecting the feature data from a geometric tensorinto standard tensor format and pooling over the group features, andare indicated in italics in Table 4. Since the additional steps in the 𝐺 -steerable implementations have no learnable parameters associ-ated with them, the overall architecture is unchanged from that of thestandard CNN; it is only the nature of the kernels in the convolutionallayers that differ.For the input data we use the trivial representation, but for allsubsequent steps in the 𝐺 -steerable implementations we adopt the regular representation, 𝜌 reg . This representation is typical for de-scribing finite groups/subgroups such as 𝐶 𝑁 and 𝐷 𝑁 . The regularrepresentation of a finite group 𝐺 acts on a vector space R | 𝐺 | bypermuting its axes, where | 𝐺 | = 𝑁 for 𝐶 𝑁 and | 𝐺 | = 𝑁 for 𝐷 𝑁 , https://github.com/QUVA-Lab/e2cnn see Figure 1. This representation is helpful because its action sim-ply permutes channels of fields and is therefore equivariant underpointwise operations such as the ReLU activation function, max andaverage pooling functions (Weiler & Cesa 2019).We train each network over 600 epochs using a standard cross-entropy loss function and the Adam optimiser (Kingma & Ba 2014)with an initial learning rate of 10 − and a weight decay of 10 − . Weuse a scheduler to reduce the learning rate by 10% each time thevalidation loss fails to decrease for two consecutive epochs. We usemini-batching with a batch size of 50. No additional hyper-parametertuning is performed. We also implement an early-stopping criterionbased on validation accuracy and for each training run we save themodel corresponding to this criterion. Validation loss curves for both the standard CNN implementation,denoted { 𝑒 } , and the group-equivariant CNN implementations for 𝑁 = { , , , } are shown in Figure 2. Curves show the mean andstandard deviation for each network over five training repeats. It canseen from Figure 2 that the standard CNN implementation achievesa significantly poorer loss than that of its group-equivariant equiva-lents. For both the cyclic and dihedral group-equivariant models, thebest validation loss is achieved for 𝑁 =
16. Although the final loss inthe case of the cyclic and dihedral-equivariant networks is not sig-nificantly different in value, it is notable that the lower order dihedralnetworks converge towards this value more rapidly than the equiv-alent order cyclic networks. We observe that higher order groupsminimize the validation loss more rapidly, i.e. the initial gradient ofthe loss as a function of epoch is steeper, up to order 𝑁 =
16 in thiscase. Weiler & Cesa (2019), who also noted the same thing whentraining on the MNIST datasets, attribute this behaviour to the in-creased generalisation capacity of equivariant networks, since thereis no significant difference in the number of learnable parametersbetween models.Final validation error as a function of order, 𝑁 , for the group-equivariant networks is shown in Figure 3. From this figure it can beseen that all equivariant models improve upon the non-equivariantCNN baseline, { 𝑒 } , and that the validation error decreases beforereaching a minimum for both cyclic and dihedral models at ap-proximately 16 orientations. This behaviour is discussed further inSection 6.4. Standard performance metrics for both the standard CNN implemen-tation, denoted { 𝑒 } , and the group-equivariant CNN implementationsfor 𝑁 = { , , , } are shown in Table 5. The metrics in this tableare evaluated using the reserved test set of the MiraBest ∗ data set,classified using the best-performing model according to the valida-tion early-stopping criterion. The reserved test set is augmented by afactor of 9 using discrete rotations of 20 ◦ over the interval [ ◦ , ◦ ) .This augmentation is performed in order to provide metrics that re-flect the performance over a consistent range of orientations. Thevalues in the table show the mean and standard deviation for eachmetric over five training repeats. All 𝐺 -steerable CNNs listed in thistable use a regular representation for feature data and apply a 𝐺 -invariant map after the convolutional layers to guarantee an invariantprediction. MNRAS000
16 in thiscase. Weiler & Cesa (2019), who also noted the same thing whentraining on the MNIST datasets, attribute this behaviour to the in-creased generalisation capacity of equivariant networks, since thereis no significant difference in the number of learnable parametersbetween models.Final validation error as a function of order, 𝑁 , for the group-equivariant networks is shown in Figure 3. From this figure it can beseen that all equivariant models improve upon the non-equivariantCNN baseline, { 𝑒 } , and that the validation error decreases beforereaching a minimum for both cyclic and dihedral models at ap-proximately 16 orientations. This behaviour is discussed further inSection 6.4. Standard performance metrics for both the standard CNN implemen-tation, denoted { 𝑒 } , and the group-equivariant CNN implementationsfor 𝑁 = { , , , } are shown in Table 5. The metrics in this tableare evaluated using the reserved test set of the MiraBest ∗ data set,classified using the best-performing model according to the valida-tion early-stopping criterion. The reserved test set is augmented by afactor of 9 using discrete rotations of 20 ◦ over the interval [ ◦ , ◦ ) .This augmentation is performed in order to provide metrics that re-flect the performance over a consistent range of orientations. Thevalues in the table show the mean and standard deviation for eachmetric over five training repeats. All 𝐺 -steerable CNNs listed in thistable use a regular representation for feature data and apply a 𝐺 -invariant map after the convolutional layers to guarantee an invariantprediction. MNRAS000 , 1–11 (2015)
A. M. M. Scaife & F. Porter
Figure 2.
Validation losses during the training of the standard CNN, denoted { 𝑒 } , and (i) 𝐶 𝑁 -equivariant models for the MiraBest ∗ data set (left), and (ii) 𝐷 𝑁 -equivariant models for the MiraBest ∗ data set (right). Plots show mean and standard deviation over five training repeats. Curves are smoothed over 20epochs to eliminate small-scale variability. Table 5.
Performance metrics for classification of the MiraBest ∗ data set using the standard CNN ( { 𝑒 } ) and 𝐺 -steerable CNNs for different cyclic and dihedralsubgroups of the E(2) Euclidean group. All 𝐺 -steerable CNNs use a regular representation for feature data and apply a G-invariant map after the convolutionsto guarantee an invariant prediction. FRI FRII MiraBest ∗ Accuracy [%]
Precision Recall F1-score Precision Recall F1-score { 𝑒 } . ± .
37 0 . ± .
018 0 . ± .
024 0 . ± .
015 0 . ± .
020 0 . ± .
018 0 . ± . 𝐶 . ± .
23 0 . ± .
018 0 . ± .
015 0 . ± .
013 0 . ± .
013 0 . ± .
018 0 . ± . 𝐶 . ± .
06 0 . ± .
020 0 . ± .
016 0 . ± .
011 0 . ± .
013 0 . ± .
019 0 . ± . 𝐶 . ± .
03 0 . ± .
020 0 . ± .
013 0 . ± .
011 0 . ± .
011 0 . ± .
019 0 . ± . 𝐶 . ± .
12 0 . ± .
019 0 . ± .
013 0 . ± .
012 0 . ± .
011 0 . ± .
018 0 . ± . 𝐷 . ± .
38 0 . ± .
024 0 . ± .
017 0 . ± .
014 0 . ± .
015 0 . ± .
023 0 . ± . 𝐷 . ± .
95 0 . ± .
019 0 . ± .
014 0 . ± .
010 0 . ± .
012 0 . ± .
018 0 . ± . 𝐷 . ± . . ± .
025 0 . ± .
014 0 . ± .
013 0 . ± .
012 0 . ± .
023 0 . ± . 𝐷 . ± .
00 0 . ± .
018 0 . ± .
015 0 . ± .
010 0 . ± .
013 0 . ± .
017 0 . ± . Figure 3.
Validation errors of 𝐶 𝑁 and 𝐷 𝑁 regular steerable CNNs fordifferent orders, 𝑁 , for the MiraBest ∗ data set. All equivariant models improveupon the non-equivariant CNN baseline, { 𝑒 } . From Table 5, it can be seen that the best test accuracy is achievedby the 𝐷 model, highlighted in bold. Indeed, while all equivariantmodels perform better than the standard CNN, the performance ofthe dihedral models is consistently better than for the cyclic modelsof equivalent order.For the cyclic models it can be observed that the largest changein performance comes from an increased FRI recall. For a binaryclassification problem, the recall of a class is defined asRecall = TPTP + FN , (5)where TP indicates the number of true positives and FN indicates thenumber of false negatives. The recall therefore represents the fractionof all objects in that class which are correctly classified. Equivalently,the precision of the class is defined asPrecision = TPTP + FP . (6)Consequently, if the recall of one class increases at the expense ofthe precision of the opposing class then it indicates that the opposingclass is being disproportionately misclassified. However, in this casewe can observe from Table 5 that the precision of the FRII class is also MNRAS , 1–11 (2015) (2)-equivariant radio galaxy classification Figure 4.
Average number of misclassifications for FRI (cyan) and FRII (grey)over all orientations and training repeats for the standard CNN, denoted { 𝑒 } ,the 𝐶 CNN and the 𝐷 CNN, see Section 5.2 for details. increasing, suggesting that the improvement in performance is due toa smaller number of FRI objects being misclassified as FRII. For thecyclic models there is a smaller but not equivalent improvement inFRII recall. This suggests that the cyclic model primarily reduces themisclassification of FRI objects as FRII, but does not equivalentlyreduce the misclassification of FRII as FRI.The dihedral models show a more even distribution of improve-ment across all metrics, indicating that there are more balanced reduc-tions across both FRI and FRII misclassifications. This is illustratedin Figure 4, which shows the average number of misclassificationsover all orientations and training repeats for the standard CNN, the 𝐶 CNN and the 𝐷 CNN for the reserved test set.The test partition of the full
Mirabest data set contains 153 FRIand FRII-type sources labelled as both Confident and Uncertain, seeTable 3. When using this combined test set the overall performancemetrics of the networks considered in this work become accordinglylower due to the inclusion of the Uncertain sources. This is expected,not only because the Uncertain samples include edge cases that aremore difficult to classify but also because the assigned labels for theseobjects may not be fully accurate. However, the relative performanceshows the same degree of improvement between the standard CNN, { 𝑒 } , and the 𝐷 model, which have percentage accuracies of 82 . ± .
41 and 85 . ± .
35, respectively, when evaluated against thiscombined test set.We note that given the comparatively small size of the
Mirabest ∗ training set, these results may not generalise equivalently to otherpotentially larger data sets with different selection specifications andthat additional validation should be performed when consideringthe use of group-equivariant convolutions for other classificationproblems. Target class predictions for each test data sample are made by se-lecting the highest softmax probability, which provides a normalisedversion of the network output values. By using dropout as a Bayesianapproximation, as demonstrated in Gal & Ghahramani (2015), oneis able to obtain a posterior distribution of network outputs for each test sample. This posterior distribution allows one to assess the de-gree of certainty with which a prediction is being made, i.e. if thedistribution of outputs for a particular class is well-separated fromthose of other classes then the input is being classified with highconfidence; however, if the distribution of outputs intersects thoseof other classes then, even though the softmax probability for a par-ticular realisation may be high (even as high as unity), the overalldistribution of softmax probabilities for that class may still fill theentire [ , ] range, overlapping significantly with the distributionsfrom other target classes. Such a circumstance denotes a low degreeof model certainty in the softmax probability and therefore in theclass prediction for that particular test sample.By re-enabling the dropout before the final fully-connected layerat test time, we estimate the predictive uncertainty of each modelfor the data samples in the reserved MiraBest ∗ test set. With dropoutenabled, we perform 𝑇 =
50 forward passes through the trainednetwork for each sample in the test set. On each pass we recover ( 𝑥 𝑡 , 𝑦 𝑡 ) , where 𝑥 and 𝑦 are the softmax probabilities of FRI andFRII, respectively. An example of the results from this process canbe seen in Figure 5, where we evaluate the trained model on a rotatedversion of the input image at discrete intervals of 20 ◦ in the range [ ◦ , ◦ ) using a trained model for the standard CNN (left panel)and for the 𝐷 -equivariant CNN (right panel). For each rotationangle, a distribution of softmax probabilities is obtained. In the caseof the standard CNN it can be seen that, although the model classifiesthe source with high confidence when it is unrotated (0 ◦ ), the soft-max probability distributions are not well-separated for the centralimage orientations, indicating that the model has a lower degree ofconfidence in the prediction being made in at these orientations. Forthe 𝐷 -equivariant CNN it can be seen that in this particular testcase the model has a high degree of confidence in its prediction forall orientations of the image.To represent the degree of uncertainty for each test sample quanti-tatively, we evaluate the degree of overlap in the distributions of soft-max probabilities at a particular rotation angle using the distribution-free overlap index (Pastore & Calcagnì 2019). To do this, we calculatethe local densities at position 𝑧 for each class using a Gaussian kerneldensity estimator, such that 𝑓 𝑥 ( 𝑧 ) = 𝑇 𝑇 ∑︁ 𝑡 = 𝛽 √ 𝜋 e −( 𝑧 − 𝑥 𝑡 ) / 𝛽 , (7) 𝑓 𝑦 ( 𝑧 ) = 𝑇 𝑇 ∑︁ 𝑡 = 𝛽 √ 𝜋 e −( 𝑧 − 𝑦 𝑡 ) / 𝛽 , (8)where 𝛽 = .
1. We then use these local densities to calculate theoverlap index, 𝜂 , such that 𝜂 = 𝑁 𝑧 ∑︁ 𝑖 = min (cid:2) 𝑓 𝑥 ( 𝑧 𝑖 ) , 𝑓 𝑦 ( 𝑧 𝑖 ) (cid:3) 𝛿𝑧, (9)where { 𝑧 𝑖 } 𝑁 𝑧 𝑖 = covers the range zero to one in 𝑁 𝑧 steps of size 𝛿𝑧 .For this work we assume 𝑁 𝑧 = 𝜂 ,varies between zero and one, with larger values indicating a higherdegree of overlap and hence a lower degree of confidence.For each test sample we evaluate the overlap index over a range ofrotations from 0 ◦ to 180 ◦ in increments of 20 ◦ . We then calculate theaverage overlap index, (cid:104) 𝜂 (cid:105) , across these nine rotations. In Figure 5the value of this index can be seen above each plot: in this case, thestandard CNN has (cid:104) 𝜂 (cid:105) { 𝑒 } = .
30 and the 𝐷 -equivariant CNN has (cid:104) 𝜂 (cid:105) 𝐷 < . . ± . MNRAS000
30 and the 𝐷 -equivariant CNN has (cid:104) 𝜂 (cid:105) 𝐷 < . . ± . MNRAS000 , 1–11 (2015)
A. M. M. Scaife & F. Porter
Figure 5.
A scatter of 50 forward passes of the softmax output for the standard CNN (left) and the 𝐷 -equivariant CNN (right). The lower panel shows therotated image of the test image. As indicated, the average overlap index for the standard CNN is (cid:104) 𝜂 (cid:105) = .
30, and (cid:104) 𝜂 (cid:105) < .
01 for the 𝐷 -equivariant CNN. (cid:104) 𝜂 (cid:105) { 𝑒 } − (cid:104) 𝜂 (cid:105) 𝐷 > .
01, when classified using the 𝐷 -equivariantCNN compared to the standard CNN, 8 . ± .
5% show a deteriorationin average model confidence, i.e. (cid:104) 𝜂 (cid:105) 𝐷 −(cid:104) 𝜂 (cid:105) { 𝑒 } > .
01, and all othersamples show no significant change in average model confidence,i.e. |(cid:104) 𝜂 (cid:105) { 𝑒 } − (cid:104) 𝜂 (cid:105) 𝐷 | < .
01. Mean values and uncertainties aredetermined from (cid:104) 𝜂 (cid:105) values for all test samples evaluated using apairwise comparison of 5 training realisations of the standard CNNand 5 training realisations of the 𝐷 CNN.Those objects that show an improvement in average model confi-dence are approximately evenly divided between FRI and FRII typeobjects, whereas the objects that show a reduction in model confi-dence exhibit a weak preference for FRII. These results are discussedfurther in Section 6.1.
Mathematically, 𝐺 -steerable CNNs classify equivalence classes ofimages, as defined by the equivalence relation of a particular group, 𝐺 , whereas conventional CNNs classify equivalence classes definedonly by translations. Consequently, by using E(2)-equivalent convo-lutions the trained models assume that the statistics of extra-galacticastronomical images containing individual objects are expected tobe invariant not only to translations but also to global rotations andreflections. Here we briefly review the literature in order to considerwhether this assumption is robust and highlight the limitations thatmay result from it.The orientation of radio galaxies, as defined by the direction oftheir jets, is thought to be determined by the angular momentum axisof the super-massive black hole within the host galaxy. A number ofstudies have looked for evidence of preferred jet alignment directionsin populations of radio galaxies, as this has been proposed to be apotential consequence of angular momentum transfer during galaxyformation (e.g. White 1984; Codis et al. 2018; Kraljic et al. 2020), oralternatively it could be caused by large-scale filamentary structuresin the cosmic web giving rise to preferential merger directions (seee.g. Kartaltepe et al. 2008) that might result in jet alignment for radiogalaxies formed during mergers (e.g. Croton et al. 2006; Chiaberge et al. 2015). The observational evidence for both remains a subjectof discussion in the literature.Taylor & Jagannathan (2016) found a local alignment of radiogalaxies in the ELAIS N1 field on scales < ◦ using observationsfrom the Giant Metrewave Radio Telescope (GMRT) at 610 MHz.Local alignments were also reported by Contigiani et al. (2017) whoreported evidence ( > 𝜎 ) of local alignment on scales of ∼ . ◦ among radio sources from the FIRST survey using a much largersample of radio galaxies, catalogued by the radio galaxy zoo project.A similar local alignment was also reported by Panwar et al. (2020)using data from the FIRST survey. Using a sample of 7555 double-lobed radio galaxies from the LOFAR Sky Survey (LoTSS; Shimwellet al. 2019) at 150 MHz, Osinga et al. (2020) concluded that a statis-tical deviation from purely random distributions of orientation as afunction of projected distance was caused by systematics introducedby the brightest objects and did not persist when redshift informationwas taken into account. However, the study also suggested that largersamples of radio galaxies should be used to confirm this result.Whilst these results may suggest tentative evidence for spatial cor-relations of radio galaxy orientations in local large-scale structure,they do not provide any information on whether these orientationsdiffer between classes of radio galaxy, i.e. the equivalence classesconsidered here. Moreover, the large spatial distribution and com-paratively small number of galaxies that form the training set used inthis work mean that even spatial correlation effects would be unlikelyto be significant for the data set used here. However, the results ofTaylor & Jagannathan (2016); Contigiani et al. (2017); Panwar et al.(2020) suggest that care should be taken in this assumption if datasets are compiled from only small spatial regions.In Section 5.1, we found that the largest improvement in perfor-mance was seen when using dihedral, 𝐷 𝑁 , models. We suggest thatthis improvement over cyclic, 𝐶 𝑁 , models is due to image reflectionsaccounting for chirality, in addition to orientations on the celestialsphere which are represented by the cyclic group. Galactic chiralityhas previously been considered for populations of star-forming, ornormal, galaxies (see e.g. Slosar et al. 2009; Shamir 2020), as thespiral structure of star-forming galaxies means that such objects canbe considered to be enantiomers, i.e. their mirror images are not su-perimposable (Capozziello & Lattanzi 2005). It has been suggestedthat a small asymmetry exists in the number of clockwise versus MNRAS , 1–11 (2015) (2)-equivariant radio galaxy classification anti-clockwise star-forming galaxy spins (Shamir 2020). As far asthe authors are aware there have been no similar studies consideringthe chirality of radio galaxies. However, a simple example of suchchirality for radio galaxies might include the case where relativisticboosting causes one jet of a radio galaxy to appear brighter than theother due to an inclination relative to the line of sight. Since the dom-inance of a particular orientation relative to the line of sight should beunbiased then this would imply a global equivariance to reflection.Since the dihedral ( 𝐷 𝑁 ) models used in this work are insensitiveto chirality, the results in Section 5.1 suggest that the radio galaxiesin the training sample used here do not have a significant degree ofpreferred chirality. Whilst this does not itself validate the assumptionof global reflection invariance, in the absence of evidence to the con-trary from the literature we suggest that it is unlikely to be significantfor the data sample used in this work.From the perspective of classification, equivariance to reflectionsimplies that inference should be independent of reflections of theinput. For FR I and FR II radio galaxy classification, incorporatingsuch information into a classification scheme may be important moregenerally: the unified picture of radio galaxies holds that both FR I andFR II, as well as many other classifications of active galactic nuclei(AGN) such as quasars, QSOs (quasi-stellar objects), blazars, BLLac objects, Seyfert galaxies etc., are in fact defined by orientation-dependent observational differences, rather than intrinsic physicaldistinctions (Urry 2004).Consequently, under the assumptions of global rotational and re-flection invariance, the possibility of a classification model providingdifferent output classifications for the same test sample at differentorientations is problematic. Furthermore, the degree of model confi-dence in a classification should also not vary significantly as a func-tion of sample orientation, i.e. if a galaxy is confidently classifiedat one particular orientation then it should be approximately equallyconfidently classified at all other orientations. If this is not the case,as shown for the standard CNN in Figure 5 (left), then it indicates apreferred orientation in the model weights for a given outcome, in-consistent with the expected statistics of the true source population.Such inconsistencies might be expected to result in biased samplesbeing extracted from survey data.In this context it is then not only the average degree of modelconfidence that is important as a function of sample rotation, asquantified by the value of (cid:104) 𝜂 (cid:105) in Section 5.3, but also the stabilityof the 𝜂 index as a function of rotation, i.e. a particular test sampleshould be classified at a consistent degree of confidence as a functionof orientation, whether that confidence is low or high. To evaluatethe stability of the predictive confidence as a function of orientation,we examine the variance of the 𝜂 index as a function of rotation. Forthe MiraBest ∗ reserved test set we find that approximately 30% ofthe test samples show a reduction of more than 0.01 in the standarddeviation of their overlap index as a function of rotation, with 17%showing a reduction of more than 0.05. Conversely approximately8% of test samples show an increase of > .
01 and 4% samples showan increase of > .
05. In a similar manner to the results for averagemodel confidence given in Section 5.3, those objects that show areduction in their variance, i.e. an improvement in the consistency ofprediction as a function of rotation, are evenly balanced between thetwo classes; however, those objects showing a strong improvementof > .
05 are preferentially FRI type objects.
The use of capsule networks (Sabour et al. 2017) for radio galaxyclassification was investigated by Lukic et al. (2018). Capsule net- works aim to separate the orientation (typically referred to as theviewpoint or pose in the context of capsule networks) of an objectfrom its nature, i.e. class, by encoding the output of their layers as tu-ples incorporating both a pose vector and an activation. The purposeof this approach is to focus on the linear hierarchical relationships inthe data and remove sensitivity to orientation; however, as describedby Lenssen et al. (2018), general capsule networks do not guaran-tee particular group equivariances and therefore cannot completelydisentangle orientation from feature data. It is perhaps partly for thisreason that Lukic et al. (2018) found that capsule networks offeredno significant advantage over standard CNNs for the radio galaxyclassification problem addressed in that work.In Section 5, we found that not only is the test performance im-proved by the use of equivariant CNNs, but that equivariant networksalso converge more rapidly. For image data, a standard CNN enablesgeneralization over classes of translated images, which provides anadvantage over the use of an MLP, where every image must be consid-ered individually. 𝐺 -steerable CNNs extend this behaviour to includeadditional equivalences, further improving generalization. This ad-ditional equivariance enhances the data efficiency of the learningalgorithm because it means that every image is no longer an indi-vidual data point but instead a representative of its wider equiva-lence group. Consequently, unlike capsule networks, the equivalencegroups being classified by a 𝐺 -steerable CNN are specified a priori,rather than the orientations of individual samples being learned dur-ing training. Whilst this creates additional capacity in the networkfor learning intra-class differences that are insensitive to the specifiedequivalences, it does not provide the information on orientation ofindividual samples that is provided as an output by capsule networks.Lenssen et al. (2018) combined group-equivariant convolutionswith capsule networks in order to output information on both classifi-cation and pose, although they note that a limitation of this combinedapproach is that arbitrary pose information is no longer available, butis instead limited to the elements of the equivariant group. For radioastronomy, where radio galaxy orientations are expected to be ex-tracted from images at a precision that is limited by the observationalconstraints of the data, it is unlikely that pose information limitedto the elements of a low-order finite group, 𝐺 < 𝐸 ( ) , is sufficientfor further analysis. However, given particular sets of observationaland physical constraints or specifications it is possible that such anapproach may become useful at some limiting order. Alternatively,pose information might be used to specify a prior for a secondaryprocessing step that refines a measurement of orientation. By design, the final features used for classification in equivariantCNNs do not include any information about the global orientationor chirality of an input image; however, this can also mean that theyare insensitive to local equivariances in the image, when these mightin fact be useful for classification. The hierarchical nature of con-volutional networks can be used to mitigate against this, as kernelscorresponding to earlier layers in a network will have a smaller, morelocal, footprint on the input image and therefore be sensitive to adifferent scale of feature than those from deeper layers which en-compass larger-scale information. Therefore, by changing the degreeof equivariance as a function of layer depth one can control the de-gree to which local equivariance is enforced. Weiler & Cesa (2019)refer to this practice as group restriction and find that it is beneficialwhen classifying data sets that possess symmetries on a local scalebut not on a global scale, such as the CIFAR and unrotated MNISTdatasets. Conversely, the opposite situation may also be true, where
MNRAS , 1–11 (2015) A. M. M. Scaife & F. Porter
Figure 6.
Validation losses during the training of the standard CNN, denoted { 𝑒 } (blue), the 𝐷 CNN (orange), and the restricted 𝐷 𝑁 | { 𝑒 } CNN (green;dashed) for the
MiraBest ∗ data set. Plots show mean and standard deviationover five training repeats. no symmetry is present on a local scale, but the data are statisticallyinvariant on a global scale. In this case the reverse may be done and,rather than restricting the representation of the feature data to re-duce the degree of equivariance, one might expand the domain of therepresentation at a particular layer depth in order to reflect a globalequivariance.We investigate the effect of group restriction by using a 𝐷 𝑁 | { 𝑒 } restricted version of the LeNet architecture, i.e. the first layer is 𝐷 𝑁 equivariant and the second convolutional layer is a standard convo-lution. Using 𝑁 =
16, the loss curve for this restricted architecturerelative to the unrestricted 𝐷 equivariant CNN is shown in Fig-ure 6. From the figure it can be seen that while exploiting localsymmetries gives an improved performance over the standard CNN,the performance of the group restricted model is significantly poorerthan that of the full 𝐷 CNN. This result suggests that althoughlocal symmetries are present in the data, it is the global symmetriesof the population that result in the larger performance gain for theradio galaxy data set.
In Section 5 we found that the 𝑁 =
16 cyclic and dihedral modelswere preferred over the higher order 𝑁 =
20 models. This mayseem counter-intuitive as one might assume that for truly rotationallyinvariant data sets the performance would converge to a limitingvalue as the order increased, rather than finding a minimum at somediscrete point. Consequently, we note that the observed minimum at 𝑁 =
16 might not represent a true property of the data set but insteadrepresent a limitation caused by discretisation artifacts from rotationof convolution kernels with small support, in this case 𝑘 =
5, seeTable 4 (Weiler & Cesa 2019). These same discretisation errors mayalso account in part for the small oscillation in validation error asa function of group order seen in Figure 3. Consequently, while noadditional hyper-parameter tuning has been performed for any of thenetworks used in this work, we note that kernel size is potentiallyone hyper-parameter that could be tuned as a function of grouporder, 𝑁 , and that such tuning might lead to further improvements inperformance for higher orders. In this work, we have demonstrated that the use of even low-ordergroup-equivariant convolutions results in a performance improve-ment over standard convolutions for the radio galaxy classificationproblem considered here, without additional hyper-parameter tun-ing. We have shown that both cyclic and dihedral equivariant modelsconverge to lower validation loss values during training and provideimproved validation errors. We attribute this improvement to the in-creased capacity of the equivariant networks for learning hierarchicalfeatures specific to classification, when additional capacity for en-coding redundant feature information at multiple orientations is nolonger required, hence reducing intra-class variability.We have shown that for the simple network architecture and train-ing set considered here, a 𝐷 equivariant model results in the besttest performance using a reserved test set. We suggest that the im-provement of the dihedral over the cyclic models is due to an in-sensitivity to - and therefore lack of preferred - chirality in the data,and that further improvements in performance might be gained fromtuning the size of the kernels in the convolutional layers according tothe order of the equivalence group. We find that cyclic models pre-dominantly reduce the misclassification of FRI type radio galaxies,whereas dihedral models reduce misclassifications for both FRI andFRII type galaxies.By using the MC Dropout Bayesian approximation method, wehave shown that the improved performance observed for the 𝐷 model compared to the standard CNN is reflected in the model con-fidence as a function of rotation. Using the reserved test set, we havequantified this difference in confidence using the overlap betweenpredictive probability distributions of different target classes, as en-capsulated in the distribution free overlap index parameter, 𝜂 . Wefind that not only is average model confidence improved when usingthe equivariant model, but also that the consistency of model confi-dence as a function of image orientation is improved. We emphasisethe importance of such consistency for applications of CNN-basedclassification in order to avoid biases in samples being extracted fromfuture survey data.Whilst the results presented here are encouraging, we note that thiswork addresses a specific classification problem in radio astronomyand the method used here may not result in equivalent improvementswhen applied to other areas of astronomical image classification us-ing different data sets or network architectures. In particular, the as-sumptions of global rotational and reflectional invariance are strongassumptions, which may not apply to all data sets. As described inSection 6.1, data sets extracted from localised regions of the skymay be particularly vulnerable to biases when using this method andthe properties of the MiraBest ∗ data set used in this work may notgeneralise to all other data sets or classification problems. We notethat this is true for all CNNs benchmarked against finite data sets andusers should be aware that additional validation should be performedbefore models are deployed on new test data, as biases arising fromdata selection may be reflected in biases in classifier performance(see e.g. Wu et al. 2018; Tang 2019; Walmsley et al. 2020). However,in conclusion, we echo the expectation of Weiler & Cesa (2019),that equivariant CNNs may soon become a common choice for mor-phological classification in fields like astronomy, where symmetriesmay be present in the data, and note that the overhead in constructingsuch networks is now minimal due to the emergence of standardisedlibraries such as e2cnn . Future work will need to address the op-timal architectures and hyper-parameter choices for such models asspecific applications evolve. MNRAS , 1–11 (2015) (2)-equivariant radio galaxy classification ACKNOWLEDGEMENTS
This work makes extensive use of the e2cnn library: github.com/QUVA-Lab/e2cnn . AMS gratefully acknowledges support from anAlan Turing Institute AI Fellowship EP/V030302/1. FP gratefullyacknowledges support from STFC and IBM through the iCASE stu-dentship ST/P006795/1.
DATA AVAILABILITY
Code for this work is publicly available on Github at the followingaddress: github.com/as595/E2CNNRadGal . The MiraBest data setis available on Zenodo: 10.5281/zenodo.4288837.
REFERENCES
Abazajian K. N., et al., 2009, ApJS, 182, 543Alger M. J., et al., 2018, Monthly Notices of the Royal Astronomical Society,478, 5547Aniyan A. K., Thorat K., 2017, The Astrophysical Journal Supplement SeriesBanfield J. K., et al., 2015, MNRAS, 453, 2326Beardsley A. P., et al., 2019, Publications of the Astronomical Society ofAustraliaBecker R. H., White R. L., Helfand D. J., 1995, ApJ, 450, 559Best P. N., Heckman T. M., 2012, MNRAS, 421, 1569Bowles M., Scaife A. M. M., Porter F., Tang H., Bastien D. J., 2020, arXive-prints, p. arXiv:2012.01248Capozziello S., Lattanzi A., 2005, arXiv e-prints, p. physics/0509144Chiaberge M., Gilli R., Lotz J. M., Norman C., 2015, ApJ, 806, 147Codis S., Jindal A., Chisari N. E., Vibert D., Dubois Y., Pichon C., DevriendtJ., 2018, MNRAS, 481, 4753Cohen T. S., Welling M., 2016, arXiv e-prints, p. arXiv:1602.07576Condon J. J., Cotton W. D., Greisen E. W., Yin Q. F., Perley R. A., TaylorG. B., Broderick J. J., 1998, AJ, 115, 1693Contigiani O., et al., 2017, MNRAS, 472, 636Croton D. J., et al., 2006, MNRAS, 365, 11Dieleman S., Willett K. W., Dambre J., 2015, Monthly Notices of the RoyalAstronomical Society, 450, 1441Dieleman S., De Fauw J., Kavukcuoglu K., 2016, arXiv e-prints, p.arXiv:1602.02660Fanaroff B. L., Riley J. M., 1974, MNRAS, 167, 31PGal Y., Ghahramani Z., 2015, arXiv e-prints, p. arXiv:1506.02142Gens R., Domingos P. M., 2014, in Advances in Neural Information Process-ing Systems. Curran Associates, Inc., pp 2537–2545Gheller C., Vazza F., Bonafede A., 2018, MNRAS, 480, 3749Ginsburg A., et al., 2019, AJ, 157, 98Jarvis M. J., et al., 2016, in Proceedings of Science. , doi:10.22323/1.277.0006Johnston S., et al., 2008, Experimental AstronomyKartaltepe J. S., Ebeling H., Ma C. J., Donovan D., 2008, Monthly Noticesof the Royal Astronomical Society, 389, 1240Kingma D. P., Ba J., 2014, arXiv e-prints, p. arXiv:1412.6980Kraljic K., Davé R., Pichon C., 2020, MNRAS, 493, 362Krizhevsky A., Sutskever I., Hinton G. E., 2012, in Pereira F., Burges C.J. C., Bottou L., Weinberger K. Q., eds, , Advances in Neural InformationProcessing Systems 25. Curran Associates, Inc., pp 1097–1105LeCun Y., Bottou L., Bengio Y., Haffner P., 1998, Proceedings of the IEEE,86, 2278LeCun Y., Bottou L., Orr G., Müller K., 2012, in Neural Networks: Tricks ofthe Trade. Lecture Notes in Computer Science. , doi:10.1007/978-3-642-35289-8_3Lenc K., Vedaldi A., 2014, arXiv e-prints, p. arXiv:1411.5908Lenssen J. E., Fey M., Libuschewski P., 2018, arXiv e-prints, p.arXiv:1806.05086Lukic V., Brüggen M., Banfield J. K., Wong O. I., Rudnick L., Norris R. P.,Simmons B., 2018, MNRAS, 476, 246 McGlynn T., Scollick K., White N., 1998, in McLean B. J., Golombek D. A.,Hayes J. J. E., Payne H. E., eds, New Horizons from Multi-WavelengthSky Surveys.Miraghaei H., Best P. N., 2017, Monthly Notices of the Royal AstronomicalSociety, 466, 4346Osinga E., et al., 2020, A&A, 642, A70Panwar M., Prabhakar Sandhu P. K., Wadadekar Y., Jain P., 2020, MNRAS,499, 1226Pastore M., Calcagnì A., 2019, Frontiers in Psychology, 10, 1089Ren S., He K., Girshick R., Sun J., 2015, in Advances in neural informationprocessing systems. pp 91–99Sabour S., Frosst N., E Hinton G., 2017, arXiv e-prints, p. arXiv:1710.09829Shamir L., 2020, Publ. Astron. Soc. Australia, 37, e053Shimwell T. W., et al., 2019, A&A, 622, A1Slosar A., et al., 2009, MNRAS, 392, 1225Tang H., 2019, FR-DEEP, https://github.com/HongmingTang060313/FR-DEEP
Tang H., Scaife A. M. M., Leahy J. P., 2019, MNRAS, 488, 3358Taylor A. R., Jagannathan P., 2016, MNRAS, 459, L36Urry C., 2004, in Richards G. T., Hall P. B., eds, Astronomical Society of thePacific Conference Series Vol. 311, AGN Physics with the Sloan DigitalSky Survey. p. 49 ( arXiv:astro-ph/0312545 )Van Haarlem M. P., et al., 2013, LOFAR: The low-frequency array,doi:10.1051/0004-6361/201220873Walmsley M., et al., 2020, MNRAS, 491, 1554Weiler M., Cesa G., 2019, arXiv e-prints, p. arXiv:1911.08251Weiler M., Hamprecht F. A., Storath M., 2017, arXiv e-prints, p.arXiv:1711.07289Weiler M., Geiger M., Welling M., Boomsma W., Cohen T., 2018, arXive-prints, p. arXiv:1807.02547White S. D. M., 1984, ApJ, 286, 38Wu C., et al., 2018, Monthly Notices of the Royal Astronomical Society, 482,1211York D. G., et al., 2000, AJ, 120, 1579This paper has been typeset from a TEX/L A TEX file prepared by the author.MNRAS000