[PDF] Fanaroff-Riley classification of radio galaxies using group-equivariant convolutional neural networks

Abstract

Weight sharing in convolutional neural networks (CNNs) ensures that their feature maps will be translation-equivariant. However, although conventional convolutions are equivariant to translation, they are not equivariant to other isometries of the input image data, such as rotation and reflection. For the classification of astronomical objects such as radio galaxies, which are expected statistically to be globally orientation invariant, this lack of dihedral equivariance means that a conventional CNN must learn explicitly to classify all rotated versions of a particular type of object individually. In this work we present the first application of group-equivariant convolutional neural networks to radio galaxy classification and explore their potential for reducing intra-class variability by preserving equivariance for the Euclidean group E(2), containing translations, rotations and reflections. For the radio galaxy classification problem considered here, we find that classification performance is modestly improved by the use of both cyclic and dihedral models without additional hyper-parameter tuning, and that a D16 equivariant model provides the best test performance. We use the Monte Carlo Dropout method as a Bayesian approximation to recover epistemic uncertainty as a function of image orientation and show that E(2)-equivariant models are able to reduce variations in model confidence as a function of rotation.

Full PDF

MMNRAS , 1–11 (2015) Preprint 19 February 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

Fanaroﬀ-Riley classiﬁcation of radio galaxies using group-equivariantconvolutional neural networks.

Anna M. M. Scaife, , ★ and Fiona Porter Jodrell Bank Centre for Astrophysics, Department of Physics & Astronomy, University of Manchester, Oxford Road, Manchester M13 9PL UK The Alan Turing Institute, Euston Road, London NW1 2DB, UK

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

Weight sharing in convolutional neural networks (CNNs) ensures that their feature maps will be translation-equivariant. However,although conventional convolutions are equivariant to translation, they are not equivariant to other isometries of the input imagedata, such as rotation and reﬂection. For the classiﬁcation of astronomical objects such as radio galaxies, which are expectedstatistically to be globally orientation invariant, this lack of dihedral equivariance means that a conventional CNN must learnexplicitly to classify all rotated versions of a particular type of object individually. In this work we present the ﬁrst application ofgroup-equivariant convolutional neural networks to radio galaxy classiﬁcation and explore their potential for reducing intra-classvariability by preserving equivariance for the Euclidean group E(2), containing translations, rotations and reﬂections. For theradio galaxy classiﬁcation problem considered here, we ﬁnd that classiﬁcation performance is modestly improved by the useof both cyclic and dihedral models without additional hyper-parameter tuning, and that a 𝐷 equivariant model provides thebest test performance. We use the Monte Carlo Dropout method as a Bayesian approximation to recover epistemic uncertaintyas a function of image orientation and show that E(2)-equivariant models are able to reduce variations in model conﬁdence as afunction of rotation. Key words: radio continuum: galaxies – methods: data analysis – techniques: image processing

In radio astronomy, a massive increase in data volume is currentlydriving the increased adoption of machine learning methodologiesand automation during data processing and analysis. This is largelydue to the high data rates being generated by new facilities suchas the Low-Frequency Array (LOFAR; Van Haarlem et al. 2013),the Murchison Wideﬁeld Array (MWA; Beardsley et al. 2019), theMeerKAT telescope (Jarvis et al. 2016), and the Australian SKAPathﬁnder (ASKAP) telescope (Johnston et al. 2008). For these in-struments a natural solution has been to automate the data processingstages as much as possible, including classiﬁcation of sources.With the advent of such huge surveys, new automated classiﬁcationalgorithms have been developed to replace the “by eye” classiﬁca-tion methods used in earlier work. In radio astronomy, morphologicalclassiﬁcation using convolutional neural networks (CNNs) and deeplearning is becoming increasingly common for object classiﬁcation,in particular with respect to the classiﬁcation of radio galaxies. Theground work in this ﬁeld was done by Aniyan & Thorat (2017) whomade use of CNNs for the classiﬁcation of Fanaroﬀ-Riley (FR) typeI and type II radio galaxies (Fanaroﬀ & Riley 1974). This was fol-lowed by other works involving the use of deep learning in sourceclassiﬁcation. Examples include Lukic et al. (2018) who made use ofCNNs for the classiﬁcation of compact and extended radio sourcesfrom the Radio Galaxy Zoo catalogue (Banﬁeld et al. 2015), theCLARAN (Classifying Radio Sources Automatically with a Neural ★ E-mail: [email protected] (AMS)

Network; Wu et al. 2018) model made use of the Faster R-CNN (Renet al. 2015) network to identify and classify radio sources; Algeret al. (2018) made use of an ensemble of classiﬁers including CNNsto perform host galaxy cross-identiﬁcation. Tang et al. (2019) madeuse of transfer learning with CNNs to perform cross-survey classiﬁ-cation, while Gheller et al. (2018) made use of deep learning for thedetection of cosmological diﬀuse radio sources. Lukic et al. (2018)also performed morphological classiﬁcation using a novel techniqueknown as capsule networks (Sabour et al. 2017), although they foundno speciﬁc advantage compared to traditional CNNs. Bowles et al.(2020) showed that an attention-gated CNN could be used to per-form Fanaroﬀ-Riley classiﬁcation of radio galaxies with equivalentperformance to other applications in the literature, but using ∼ © a r X i v : . [ a s t r o - ph . I M ] F e b A. M. M. Scaife & F. Porter means that a conventional CNN must explicitly learn to classify allrotational augmentations of each image individually. This can resultin CNNs learning multiple copies of the same kernel but in diﬀerentorientations, an eﬀect that is particularly notable when the data itselfpossesses rotational symmetry (Dieleman et al. 2016). Furthermore,while data augmentation that mimicks a form of equivariance, such asimage rotation, can result in a network learning approximate equiv-ariance if it has suﬃcient capacity, it is not guaranteed that invariancelearned on a training set will generalise equally well to a test set (Lenc& Vedaldi 2014). A variety of diﬀerent equivariant networks havebeen developed to address this issue, each guaranteeing a particulartransformation equivariance between the input data and associatedfeature maps. For example, in the ﬁeld of galaxy classiﬁcation usingoptical data, Dieleman et al. (2015) enforced discrete rotational in-variance through the use of a multi-branch network that concatenatedthe output features from multiple convolutional branches, each usinga rotated version of the same data sample as its input. However, whileeﬀective, the approach of Dieleman et al. (2015) requires the con-volutional layers of a network architecture and hence the number ofmodel weights associated with them to be replicated 𝑁 times, where 𝑁 is the number of discrete rotations.Recently, a more eﬃcient method of using convolutional layers thatare equivariant to a particular group of transforms has been devel-oped, which requires no replication of architecture and hence fewerlearnable parameters to be used. Explicitly enforcing an equivariancein the network model in this way not only provides a guarantee thatit will generalise, but also prevents the network using parameter ca-pacity to learn characteristic behaviour that can instead be speciﬁeda priori. First introduced by Cohen & Welling (2016), these Groupequivariant Convolutional Neural Networks (G-CNNs), which pre-serve group equivariance through their convolutional layers, are anatural extension of conventional CNNs that ensure translational in-variance through weight sharing. Group equivariance has also beendemonstrated to improve generalisation and increase performance(see e.g. Weiler et al. 2017; Weiler & Cesa 2019). In particular, Steerable

G-CNNs have become an increasingly important solutionto this problem and notably those steerable CNNs that describe E(2)-equivariant convolutions.The Euclidean group E(2) is the group of isometries of the plane R that contains translations, rotations and reﬂections. Isometries such asthese are important for general image classiﬁcation using convolutionas the target object in question is unlikely to appear at a ﬁxed positionand orientation in every test image. Such variations are not onlyhighly signiﬁcant for objects/images that have a preferred orientation,such as text or faces, but are also important for low-level features innominally orientation-unbiased targets such as astrophysical objects.In principle, E(2)-equivariant CNNs will generalize over rotationally-transformed images by design, which reduces the amount of intra-class variability that they have to learn. In eﬀect such networks areinsensitive to rotational or reﬂection variations and therefore learnonly features that are independent of these properties.In this work we introduce the use of 𝐺 -steerable CNNs to as-tronomical classiﬁcation. The structure of the paper is as follows:in Section 2 we describe the mathematical operation of 𝐺 -steerableCNNs and deﬁne the speciﬁc Euclidean subgroups being consid-ered in this work; in Section 3 we describe the data sets used inthis work and the preprocessing steps implemented on those data; inSection 4 we describe the network architecture adopted in this work,explain how the 𝐺 -steerable implementation is constructed and spec-ify the group representations; in Section 5 we give an overview ofthe training outcomes including a discussion of the convergence fordiﬀerent equivalence groups, validation and test performance met- rics, and introduce a novel use of the Monte Carlo Dropout methodfor quantitatively assessing the degree of model conﬁdence in a testprediction as a function of image orientation; in Section 6 we discussthe validity of the assumptions that radio galaxy populations are ex-pected to be staitsically rotation and reﬂection unbiased and reviewthe implications of this work in that context; in Section 7 we drawour conclusions. Group CNNs deﬁne feature spaces using feature ﬁelds 𝑓 : R → R 𝑐 ,which associate a 𝑐 -dimensional feature vector 𝑓 ( 𝑥 ) ∈ R 𝑐 to eachpoint 𝑥 of an input space. Unlike conventional CNNs, the featureﬁelds of such networks contain transformations that preserve thetransformation law of a particular group or subgroup, which allowsthem to encode orientation information. This means that if one trans-forms the input data, 𝑥 , by some transformation action, 𝑔 , (translation,rotation, etc.) and passes it through a trained layer of the network,then the output from that layer, Φ ( 𝑥 ) , must be equivalent to havingpassed the data through the layer and then transformed it, i.e. Φ (T 𝑔 𝑥 ) = T (cid:48) 𝑔 Φ ( 𝑥 ) , (1)where T 𝑔 is the transformation for action 𝑔 . In the case where thetransformation is invariant rather than equivariant , i.e. the inputdoes not change at all when it is transformed, T (cid:48) 𝑔 will be the identitymatrix for all actions 𝑔 ∈ 𝐺 . In the case of equivariance, T 𝑔 doesnot necessarily need to be equal to T (cid:48) 𝑔 and instead must only fulﬁlthe property that it is a linear representation of 𝐺 , i.e. T ( 𝑔ℎ ) = T ( 𝑔 )T ( ℎ ) .Cohen & Welling (2016) demonstrated that the conventional con-volution operation in a network can be re-written as a group convo-lution: [ 𝑓 ∗ 𝜙 ]( 𝑔 ) = ∑︁ ℎ ∈ 𝑋 ∑︁ 𝑘 𝑓 𝑘 ( ℎ ) 𝜙 𝑘 ( 𝑔 − ℎ ) , (2)where 𝑋 = R in layer one and 𝑋 = 𝐺 in all subsequent layers. Whilstthis operation is translationally-equivariant, 𝜙 is still rotationallyconstrained. For E(2)-equivariance to hold more generally, the kernelitself must satisfy 𝜙 ( 𝑔𝑥 ) = 𝜌 out ( 𝑔 ) 𝜙 ( 𝑥 ) 𝜌 in ( 𝑔 − ) ∀ 𝑔 ∈ 𝐺, 𝑥 ∈ R , (3)(Weiler et al. 2018), where 𝑔 is an action from group 𝐺 , and 𝜙 : R → R 𝑐 in × 𝑐 out , where 𝑐 in and 𝑐 out are the number of channels inthe input and output data, respectively; 𝜌 is the group representation,which speciﬁes how the channels of each feature vector mix undertransformations. Kernels which fulﬁl this constraint are known as rotation-steerable and must be constructed from a suitable family ofbasis functions. As noted above, this is a linear relationship, whichmeans that G-steerable kernels form a subspace of the convolutionkernels used by conventional CNNs.For planar images the input space will be R , and for single fre-quency or continuum radio images these feature ﬁelds will be scalar,such that 𝑠 : R → R . The group representation for scalar ﬁelds isalso known as the trivial representation, 𝜌 ( 𝑔 ) = ∀ 𝑔 ∈ 𝐺 , indicat-ing that under a transformation there is no orientation informationto preserve and that the amplitude does not change. The group rep-resentation of the output space from a G-steerable convolution mustbe chosen by the user when designing their network architecture andcan be thought of as a variety of hyper-parameter.However, whilst the representation of the input data is in somesenses quite trivial for radio images, in practice convolution layers are MNRAS , 1–11 (2015) (2)-equivariant radio galaxy classiﬁcation interleaved with other operations that are sensitive to speciﬁc choicesof representation. In particular, the range of non-linear activationlayers permissible for a particular group or subgroup representationmay be limited. Trivial representations, such as scalar ﬁelds, do nottransform under rotation and therefore conventional nonlinearitieslike the widely used ReLU activation function are ﬁne. Bias termsin convolution allow equivariance for group convolutions only inthe case where there is a single bias parameter per group featuremap (rather than per channel feature map) and likewise for batchnormalisation (Cohen & Welling 2016).In this work we use the G-steerable network layers from Weiler& Cesa (2019) who deﬁne the Euclidean group as being con-structed from the translation group, ( R , +) , and the orthogonal group,O(2) = { 𝑂 ∈ R × | 𝑂 𝑇 𝑂 = id × } , such that the Euclideangroup is congruent with the semi-direct product of these two groups,E(2) (cid:27) ( R , +) (cid:111) O(2). Consequently, the operations contained in theorthogonal group are those which leave the origin invariant, i.e.continuous rotations and reﬂections. In this work we speciﬁcallyconsider the cyclic subgroups of the Euclidean group with form ( R , +) (cid:111) 𝐶 𝑁 , where 𝐶 𝑁 contains a set of discrete rotations in mul-tiples of 2 𝜋 / 𝑁 , and the dihedral subgroups with form ( R , +) (cid:111) 𝐷 𝑁 ,where 𝐷 𝑁 (cid:27) 𝐶 𝑁 (cid:111) ({± } , ∗) , which incorporate reﬂection around 𝑥 = 𝐶 𝑁 and 𝐷 𝑁 , with orders 𝑁 and 2 𝑁 , respectively. The data set used in this work is based on the catalogue of Miraghaei& Best (2017), who used a parent galaxy sample taken from Best &Heckman (2012) that cross-matched the Sloan Digital Sky Survey(SDSS; York et al. 2000) data release 7 (DR7; Abazajian et al. 2009)with the Northern VLA Sky Survey (NVSS; Condon et al. 1998) andthe Faint Images of the Radio Sky at Twenty centimetres (FIRST;Becker et al. 1995).From the parent sample, sources were visually classiﬁed by Mi-raghaei & Best (2017) using the original morphological deﬁnitionprovided by Fanaroﬀ & Riley (1974): galaxies which had their mostluminous regions separated by less than half of the radio source’sextent were classed as FRI, and those which were separated by morethan half of this were classed as FRII. Where the determination ofthis separation was complicated by either the limited resolution ofthe FIRST survey or by its poor sensitivity to low surface brightnessemission, the human subjectivity in this calculation was indicatedby the source classiﬁcation being denoted as “Uncertain", ratherthan “Conﬁdent". Galaxies were then further classiﬁed into morpho-logical sub-types via visual inspection. Any sources which showedFRI-like behaviour on one half of the source and FRII-like behaviouron the other were deemed to be hybrid sources.Each object within the catalogue of Miraghaei & Best (2017) wasgiven a three-digit classiﬁcation identiﬁer to allow images to be sep-arated into diﬀerent subsets. Images were classiﬁed by FR class,conﬁdence of classiﬁcation, and morphological sub-type. These aresummarised in Table 1. For example, a radio galaxy that was conﬁ-dently classiﬁed as an FRI type source with a wide-angle tail mor-phology would be denoted 102.

Digit 1 Digit 2 Digit 30 - FRI 0 - Conﬁdent 0 - Standard1 - FRII 1 - Uncertain 1 - Double-double2 - Hybrid 2 - Wide-angle Tail3 - Unclassiﬁable 3 - Diﬀuse4 - Head-tail

Table 1.

Numerical identiﬁers from the catalogue of Miraghaei & Best (2017).

We note that not all combinations of the three digits described inTable 1 are present in the catalogue as some morphological classesare dependent on the parent FR class, with only FRI type objects be-ing sub-classiﬁed into head-tail or wide-angle tail, and only FRII typeobjects being sub-classiﬁed as double-double. Hybrid FR sources arenot considered to have any non-standard morphologies, as their stan-dard morphology is inherently inconsistent between sources. Con-ﬁdently classiﬁed objects outnumber their uncertain counterpartsacross all classes, and in classes that have few examples there maybe no uncertain sources present. This is particularly apparent fornon-standard morphologies.From the full catalog of 1329 labelled objects, 73 were excludedfrom the machine learning data set. These include (i) the 40 objectsdenoted as , (ii) 28 objects which had an angularextent greater than a selected image size of 150 ×

150 pixels, (iii)4 objects with structure that was found to overlap the edge of thesky area covered by the FIRST survey, and (iv) the single object in3-digit category 103. This ﬁnal object was excluded as a minimumof two examples from each class are required for the data set: one forthe training set and one for the test set. Following these exclusions,1256 objects remain, which we refer to as the

MiraBest data set andsummarise in Table 2.All images in the

MiraBest data set are subjected to a similar datapre-processing as other radio galaxy deep learning data sets in theliterature (see e.g. Aniyan & Thorat 2017; Tang et al. 2019). FITSimages for each object are extracted from the FIRST survey data usingthe Skyview service (McGlynn et al. 1998) and the astroquery library (Ginsburg et al. 2019). These images are then processed infour stages before data augmentation is applied: ﬁrstly, image pixelvalues are set to zero if their value is below a threshold of three timesthe local rms noise, secondly the image size is clipped to 150 by 150pixels, i. e. 270 (cid:48)(cid:48) by 270 (cid:48)(cid:48) for FIRST, where each pixel correspondsto 1 . (cid:48)(cid:48) . Thirdly, all pixels outside a square central region with extentequal to the largest angular size of the radio galaxy are set to zero.This helps to eliminate secondary background sources in the ﬁeldand is possible for the MiraBest data set due to the inclusion of thisparameter in the catalogue of Miraghaei & Best (2017). Finally theimage is normalised as:Output = · Input − min ( Input ) max ( Input ) − min ( Input ) , (4)where ‘Output’ is the normalised image, ‘Input’ is the original imageand ‘min’ and ‘max’ are functions which return the single minimaland maximal values of their inputs, respectively. Images are saved toPNG format and accummulated into a PyTorch batched data set .For this work we extract the objects labelled as Fanaroﬀ-RileyClass I (FRI) and Fanaroﬀ-Riley Class II (FRII; Fanaroﬀ & Riley1974) radio galaxies with classiﬁcations denoted as Conﬁdent (as The MiraBest data set is available on Zenodo: 10.5281/zenodo.4288837MNRAS000

A. M. M. Scaife & F. Porter

Figure 1.

Illustration of the 𝐶 and 𝐷 groups for an example radio galaxy postage stamp image with 50 ×

50 pixels. The members of the 𝐶 group are eachrotated by 𝜋 / | 𝐶 | =

4. The members of the 𝐷 group are each rotated by 𝜋 / 𝑥 =

0, resultingin a group order | 𝐷 | =

8. Class No. Conﬁdence Morphology No. MiraBest LabelFRI 591 Conﬁdent Standard 339 0Wide-Angle Tailed 49 1Head-Tail 9 2Uncertain Standard 191 3Wide-Angle Tailed 3 4FRII 631 Conﬁdent Standard 432 5Double-Double 4 6Uncertain Standard 195 7Hybrid 34 Conﬁdent NA 19 8Uncertain NA 15 9

Table 2.

MiraBest data set summary. The original data set labels (MiraBest Label) are shown in relation to the labels used in this work (Label). Hybrid sourcesare not included in this work, and therefore have no label assigned to them. opposed to Uncertain). We exclude the objects classiﬁed as Hybridand do not employ sub-classiﬁcations. This creates a binary classiﬁ-cation data set with target classes FRI and FRII. We denote the subsetof the full

MiraBest data set used in this work as

MiraBest ∗ .The MiraBest ∗ data set has pre-speciﬁed training and test datapartitions and the number of objects in each of these partitions isshown in Table 3 along with the equivalent partitions for the full MiraBest data set. In this work we subdivide the

MiraBest* trainingpartition into training and validation sets using an 80:20 split. The testpartition is reserved for deriving the performance metrics presentedin Section 5.2.To accelerate convergence, we further normalise individual datasamples from the data set by shifting and scaling as a function of themean and variance, both calculated from the full training set (LeCunet al. 2012) and listed in Table 3. Data augmentation is performedduring training and validation for all models using random rotationsfrom 0 to 360 degrees. This is standard practice for augmentation andis also consistent with the 𝐺 -steerable CNN training implementationsof Weiler & Cesa (2019), who included rotational augmentation fortheir own tests in order to not disadvantage models with lower levelsof equivariance. To avoid issues arising from samples where thestructure of the radio source overlaps the edge of the ﬁeld and isartiﬁcially truncated in some orientations during augmentation, but Table 3.

Data used in this work. The table shows the number of objects ofeach class that are provided in the training and test partitions for the

MiraBest data set, containing sources labeled as both Conﬁdent and Uncertain, and the

MiraBest ∗ data set, containing only objects labeled as Conﬁdent, as well asthe mean and standard deviation of the training sets in each case.Train TestData FRI FRII FRI FRII 𝜇 𝜎 MiraBest

517 552 74 79 0.0031 0.0352

MiraBest ∗

348 381 49 55 0.0031 0.0350 not in others, we apply a circular mask to each sample image, settingall pixels to zero outside a radial distance from the centre of 75 pixels.An example data sample is shown in Figure 1, where it is used toillustrate the corresponding 𝐶 and 𝐷 groups. As noted by Weiler &Cesa (2019), for signals digitised on a pixel grid, exact equivarianceis not possible for groups that are not symmetries of the grid itself andin this case only subgroups of 𝐷 will be exact symmetries with allother subgroups requiring interpolation to be employed (Dielemanet al. 2016). MNRAS , 1–11 (2015) (2)-equivariant radio galaxy classiﬁcation Table 4.

The LeNet5-style network architecture used for all the models in thiswork. 𝐺 -Steerable implementations include the additional steps indicatedin italics and replace the convolutional layers with the appropriate group-equivariant equivalent in each case. Column [1] lists the operation of eachlayer in the network, where entries in italics denote operations that are appliedonly in the 𝐺 -steerable version of the network; Column [2] lists the kernelsize in pixels for each layer, where appropriate; Column [3] lists the numberof output channels from each layer; Column [4] denotes the degree of zero-padding in pixels added to each edge of an image, where appropriate.Operation Kernel Channels Padding Invariant Projection

Convolution 5 × × × × Invariant ProjectionGlobal Average Pool

Fully-connected 120ReLUFully-connected 84ReLUDropout ( 𝑝 = . For our architecture we use a simple LeNet-style network (LeCunet al. 1998) with two convolutional layers, followed by three fully-connected layers. Each of the convolutional layers has a ReLU ac-tivation function and is followed by a max-pooling operation. Thefully-connected layers are followed by ReLU activation functionsand we use a 50% dropout before the ﬁnal fully-connected layer, asis standard for LeNet (Krizhevsky et al. 2012). An overview of thearchitecture is shown in Table 4. In what follows we refer to this basearchitecture using conventional convolution layers as the standardCNN and denote it { 𝑒 } . We also note that the use of conventionalCNN is used through the paper to refer to networks that do not employgroup-equivariant convolutions, independent of architecture.For the 𝐺 -steerable implementation of this network we use the e2cnn extension to the PyTorch library (Weiler & Cesa 2019)and replace the convolutional layers with their subgroup-equivariantequivalent. We also introduce two additional steps into the networkin order to recast the feature data from the convolutional layers intoa format suitable for the conventional fully-connected layers. Thesesteps consist of reprojecting the feature data from a geometric tensorinto standard tensor format and pooling over the group features, andare indicated in italics in Table 4. Since the additional steps in the 𝐺 -steerable implementations have no learnable parameters associ-ated with them, the overall architecture is unchanged from that of thestandard CNN; it is only the nature of the kernels in the convolutionallayers that diﬀer.For the input data we use the trivial representation, but for allsubsequent steps in the 𝐺 -steerable implementations we adopt the regular representation, 𝜌 reg . This representation is typical for de-scribing ﬁnite groups/subgroups such as 𝐶 𝑁 and 𝐷 𝑁 . The regularrepresentation of a ﬁnite group 𝐺 acts on a vector space R | 𝐺 | bypermuting its axes, where | 𝐺 | = 𝑁 for 𝐶 𝑁 and | 𝐺 | = 𝑁 for 𝐷 𝑁 , https://github.com/QUVA-Lab/e2cnn see Figure 1. This representation is helpful because its action sim-ply permutes channels of ﬁelds and is therefore equivariant underpointwise operations such as the ReLU activation function, max andaverage pooling functions (Weiler & Cesa 2019).We train each network over 600 epochs using a standard cross-entropy loss function and the Adam optimiser (Kingma & Ba 2014)with an initial learning rate of 10 − and a weight decay of 10 − . Weuse a scheduler to reduce the learning rate by 10% each time thevalidation loss fails to decrease for two consecutive epochs. We usemini-batching with a batch size of 50. No additional hyper-parametertuning is performed. We also implement an early-stopping criterionbased on validation accuracy and for each training run we save themodel corresponding to this criterion. Validation loss curves for both the standard CNN implementation,denoted { 𝑒 } , and the group-equivariant CNN implementations for 𝑁 = { , , , } are shown in Figure 2. Curves show the mean andstandard deviation for each network over ﬁve training repeats. It canseen from Figure 2 that the standard CNN implementation achievesa signiﬁcantly poorer loss than that of its group-equivariant equiva-lents. For both the cyclic and dihedral group-equivariant models, thebest validation loss is achieved for 𝑁 =

16. Although the ﬁnal loss inthe case of the cyclic and dihedral-equivariant networks is not sig-niﬁcantly diﬀerent in value, it is notable that the lower order dihedralnetworks converge towards this value more rapidly than the equiv-alent order cyclic networks. We observe that higher order groupsminimize the validation loss more rapidly, i.e. the initial gradient ofthe loss as a function of epoch is steeper, up to order 𝑁 =

16 in thiscase. Weiler & Cesa (2019), who also noted the same thing whentraining on the MNIST datasets, attribute this behaviour to the in-creased generalisation capacity of equivariant networks, since thereis no signiﬁcant diﬀerence in the number of learnable parametersbetween models.Final validation error as a function of order, 𝑁 , for the group-equivariant networks is shown in Figure 3. From this ﬁgure it can beseen that all equivariant models improve upon the non-equivariantCNN baseline, { 𝑒 } , and that the validation error decreases beforereaching a minimum for both cyclic and dihedral models at ap-proximately 16 orientations. This behaviour is discussed further inSection 6.4. Standard performance metrics for both the standard CNN implemen-tation, denoted { 𝑒 } , and the group-equivariant CNN implementationsfor 𝑁 = { , , , } are shown in Table 5. The metrics in this tableare evaluated using the reserved test set of the MiraBest ∗ data set,classiﬁed using the best-performing model according to the valida-tion early-stopping criterion. The reserved test set is augmented by afactor of 9 using discrete rotations of 20 ◦ over the interval [ ◦ , ◦ ) .This augmentation is performed in order to provide metrics that re-ﬂect the performance over a consistent range of orientations. Thevalues in the table show the mean and standard deviation for eachmetric over ﬁve training repeats. All 𝐺 -steerable CNNs listed in thistable use a regular representation for feature data and apply a 𝐺 -invariant map after the convolutional layers to guarantee an invariantprediction. MNRAS000

A. M. M. Scaife & F. Porter

Figure 2.

Validation losses during the training of the standard CNN, denoted { 𝑒 } , and (i) 𝐶 𝑁 -equivariant models for the MiraBest ∗ data set (left), and (ii) 𝐷 𝑁 -equivariant models for the MiraBest ∗ data set (right). Plots show mean and standard deviation over ﬁve training repeats. Curves are smoothed over 20epochs to eliminate small-scale variability. Table 5.

Performance metrics for classiﬁcation of the MiraBest ∗ data set using the standard CNN ( { 𝑒 } ) and 𝐺 -steerable CNNs for diﬀerent cyclic and dihedralsubgroups of the E(2) Euclidean group. All 𝐺 -steerable CNNs use a regular representation for feature data and apply a G-invariant map after the convolutionsto guarantee an invariant prediction. FRI FRII MiraBest ∗ Accuracy [%]

Precision Recall F1-score Precision Recall F1-score { 𝑒 } . ± .

37 0 . ± .

018 0 . ± .

024 0 . ± .

015 0 . ± .

020 0 . ± .

018 0 . ± . 𝐶 . ± .

23 0 . ± .

018 0 . ± .

015 0 . ± .

013 0 . ± .

018 0 . ± . 𝐶 . ± .

06 0 . ± .

020 0 . ± .

016 0 . ± .

011 0 . ± .

013 0 . ± .

019 0 . ± . 𝐶 . ± .

03 0 . ± .

020 0 . ± .

013 0 . ± .

011 0 . ± .

019 0 . ± . 𝐶 . ± .

12 0 . ± .

019 0 . ± .

013 0 . ± .

012 0 . ± .

011 0 . ± .

018 0 . ± . 𝐷 . ± .

38 0 . ± .

024 0 . ± .

017 0 . ± .

014 0 . ± .

015 0 . ± .

023 0 . ± . 𝐷 . ± .

95 0 . ± .

019 0 . ± .

014 0 . ± .

010 0 . ± .

012 0 . ± .

018 0 . ± . 𝐷 . ± . . ± .

025 0 . ± .

014 0 . ± .

013 0 . ± .

012 0 . ± .

023 0 . ± . 𝐷 . ± .

00 0 . ± .

018 0 . ± .

015 0 . ± .

010 0 . ± .

013 0 . ± .

017 0 . ± . Figure 3.

Validation errors of 𝐶 𝑁 and 𝐷 𝑁 regular steerable CNNs fordiﬀerent orders, 𝑁 , for the MiraBest ∗ data set. All equivariant models improveupon the non-equivariant CNN baseline, { 𝑒 } . From Table 5, it can be seen that the best test accuracy is achievedby the 𝐷 model, highlighted in bold. Indeed, while all equivariantmodels perform better than the standard CNN, the performance ofthe dihedral models is consistently better than for the cyclic modelsof equivalent order.For the cyclic models it can be observed that the largest changein performance comes from an increased FRI recall. For a binaryclassiﬁcation problem, the recall of a class is deﬁned asRecall = TPTP + FN , (5)where TP indicates the number of true positives and FN indicates thenumber of false negatives. The recall therefore represents the fractionof all objects in that class which are correctly classiﬁed. Equivalently,the precision of the class is deﬁned asPrecision = TPTP + FP . (6)Consequently, if the recall of one class increases at the expense ofthe precision of the opposing class then it indicates that the opposingclass is being disproportionately misclassiﬁed. However, in this casewe can observe from Table 5 that the precision of the FRII class is also MNRAS , 1–11 (2015) (2)-equivariant radio galaxy classiﬁcation Figure 4.

Average number of misclassiﬁcations for FRI (cyan) and FRII (grey)over all orientations and training repeats for the standard CNN, denoted { 𝑒 } ,the 𝐶 CNN and the 𝐷 CNN, see Section 5.2 for details. increasing, suggesting that the improvement in performance is due toa smaller number of FRI objects being misclassiﬁed as FRII. For thecyclic models there is a smaller but not equivalent improvement inFRII recall. This suggests that the cyclic model primarily reduces themisclassiﬁcation of FRI objects as FRII, but does not equivalentlyreduce the misclassiﬁcation of FRII as FRI.The dihedral models show a more even distribution of improve-ment across all metrics, indicating that there are more balanced reduc-tions across both FRI and FRII misclassiﬁcations. This is illustratedin Figure 4, which shows the average number of misclassiﬁcationsover all orientations and training repeats for the standard CNN, the 𝐶 CNN and the 𝐷 CNN for the reserved test set.The test partition of the full

Mirabest data set contains 153 FRIand FRII-type sources labelled as both Conﬁdent and Uncertain, seeTable 3. When using this combined test set the overall performancemetrics of the networks considered in this work become accordinglylower due to the inclusion of the Uncertain sources. This is expected,not only because the Uncertain samples include edge cases that aremore diﬃcult to classify but also because the assigned labels for theseobjects may not be fully accurate. However, the relative performanceshows the same degree of improvement between the standard CNN, { 𝑒 } , and the 𝐷 model, which have percentage accuracies of 82 . ± .

41 and 85 . ± .

35, respectively, when evaluated against thiscombined test set.We note that given the comparatively small size of the

Mirabest ∗ training set, these results may not generalise equivalently to otherpotentially larger data sets with diﬀerent selection speciﬁcations andthat additional validation should be performed when consideringthe use of group-equivariant convolutions for other classiﬁcationproblems. Target class predictions for each test data sample are made by se-lecting the highest softmax probability, which provides a normalisedversion of the network output values. By using dropout as a Bayesianapproximation, as demonstrated in Gal & Ghahramani (2015), oneis able to obtain a posterior distribution of network outputs for each test sample. This posterior distribution allows one to assess the de-gree of certainty with which a prediction is being made, i.e. if thedistribution of outputs for a particular class is well-separated fromthose of other classes then the input is being classiﬁed with highconﬁdence; however, if the distribution of outputs intersects thoseof other classes then, even though the softmax probability for a par-ticular realisation may be high (even as high as unity), the overalldistribution of softmax probabilities for that class may still ﬁll theentire [ , ] range, overlapping signiﬁcantly with the distributionsfrom other target classes. Such a circumstance denotes a low degreeof model certainty in the softmax probability and therefore in theclass prediction for that particular test sample.By re-enabling the dropout before the ﬁnal fully-connected layerat test time, we estimate the predictive uncertainty of each modelfor the data samples in the reserved MiraBest ∗ test set. With dropoutenabled, we perform 𝑇 =

50 forward passes through the trainednetwork for each sample in the test set. On each pass we recover ( 𝑥 𝑡 , 𝑦 𝑡 ) , where 𝑥 and 𝑦 are the softmax probabilities of FRI andFRII, respectively. An example of the results from this process canbe seen in Figure 5, where we evaluate the trained model on a rotatedversion of the input image at discrete intervals of 20 ◦ in the range [ ◦ , ◦ ) using a trained model for the standard CNN (left panel)and for the 𝐷 -equivariant CNN (right panel). For each rotationangle, a distribution of softmax probabilities is obtained. In the caseof the standard CNN it can be seen that, although the model classiﬁesthe source with high conﬁdence when it is unrotated (0 ◦ ), the soft-max probability distributions are not well-separated for the centralimage orientations, indicating that the model has a lower degree ofconﬁdence in the prediction being made in at these orientations. Forthe 𝐷 -equivariant CNN it can be seen that in this particular testcase the model has a high degree of conﬁdence in its prediction forall orientations of the image.To represent the degree of uncertainty for each test sample quanti-tatively, we evaluate the degree of overlap in the distributions of soft-max probabilities at a particular rotation angle using the distribution-free overlap index (Pastore & Calcagnì 2019). To do this, we calculatethe local densities at position 𝑧 for each class using a Gaussian kerneldensity estimator, such that 𝑓 𝑥 ( 𝑧 ) = 𝑇 𝑇 ∑︁ 𝑡 = 𝛽 √ 𝜋 e −( 𝑧 − 𝑥 𝑡 ) / 𝛽 , (7) 𝑓 𝑦 ( 𝑧 ) = 𝑇 𝑇 ∑︁ 𝑡 = 𝛽 √ 𝜋 e −( 𝑧 − 𝑦 𝑡 ) / 𝛽 , (8)where 𝛽 = .

1. We then use these local densities to calculate theoverlap index, 𝜂 , such that 𝜂 = 𝑁 𝑧 ∑︁ 𝑖 = min (cid:2) 𝑓 𝑥 ( 𝑧 𝑖 ) , 𝑓 𝑦 ( 𝑧 𝑖 ) (cid:3) 𝛿𝑧, (9)where { 𝑧 𝑖 } 𝑁 𝑧 𝑖 = covers the range zero to one in 𝑁 𝑧 steps of size 𝛿𝑧 .For this work we assume 𝑁 𝑧 = 𝜂 ,varies between zero and one, with larger values indicating a higherdegree of overlap and hence a lower degree of conﬁdence.For each test sample we evaluate the overlap index over a range ofrotations from 0 ◦ to 180 ◦ in increments of 20 ◦ . We then calculate theaverage overlap index, (cid:104) 𝜂 (cid:105) , across these nine rotations. In Figure 5the value of this index can be seen above each plot: in this case, thestandard CNN has (cid:104) 𝜂 (cid:105) { 𝑒 } = .

30 and the 𝐷 -equivariant CNN has (cid:104) 𝜂 (cid:105) 𝐷 < . . ± . MNRAS000

30 and the 𝐷 -equivariant CNN has (cid:104) 𝜂 (cid:105) 𝐷 < . . ± . MNRAS000 , 1–11 (2015)

A. M. M. Scaife & F. Porter

Figure 5.

A scatter of 50 forward passes of the softmax output for the standard CNN (left) and the 𝐷 -equivariant CNN (right). The lower panel shows therotated image of the test image. As indicated, the average overlap index for the standard CNN is (cid:104) 𝜂 (cid:105) = .

30, and (cid:104) 𝜂 (cid:105) < .

01 for the 𝐷 -equivariant CNN. (cid:104) 𝜂 (cid:105) { 𝑒 } − (cid:104) 𝜂 (cid:105) 𝐷 > .

01, when classiﬁed using the 𝐷 -equivariantCNN compared to the standard CNN, 8 . ± .

5% show a deteriorationin average model conﬁdence, i.e. (cid:104) 𝜂 (cid:105) 𝐷 −(cid:104) 𝜂 (cid:105) { 𝑒 } > .

01, and all othersamples show no signiﬁcant change in average model conﬁdence,i.e. |(cid:104) 𝜂 (cid:105) { 𝑒 } − (cid:104) 𝜂 (cid:105) 𝐷 | < .

01. Mean values and uncertainties aredetermined from (cid:104) 𝜂 (cid:105) values for all test samples evaluated using apairwise comparison of 5 training realisations of the standard CNNand 5 training realisations of the 𝐷 CNN.Those objects that show an improvement in average model conﬁ-dence are approximately evenly divided between FRI and FRII typeobjects, whereas the objects that show a reduction in model conﬁ-dence exhibit a weak preference for FRII. These results are discussedfurther in Section 6.1.

Mathematically, 𝐺 -steerable CNNs classify equivalence classes ofimages, as deﬁned by the equivalence relation of a particular group, 𝐺 , whereas conventional CNNs classify equivalence classes deﬁnedonly by translations. Consequently, by using E(2)-equivalent convo-lutions the trained models assume that the statistics of extra-galacticastronomical images containing individual objects are expected tobe invariant not only to translations but also to global rotations andreﬂections. Here we brieﬂy review the literature in order to considerwhether this assumption is robust and highlight the limitations thatmay result from it.The orientation of radio galaxies, as deﬁned by the direction oftheir jets, is thought to be determined by the angular momentum axisof the super-massive black hole within the host galaxy. A number ofstudies have looked for evidence of preferred jet alignment directionsin populations of radio galaxies, as this has been proposed to be apotential consequence of angular momentum transfer during galaxyformation (e.g. White 1984; Codis et al. 2018; Kraljic et al. 2020), oralternatively it could be caused by large-scale ﬁlamentary structuresin the cosmic web giving rise to preferential merger directions (seee.g. Kartaltepe et al. 2008) that might result in jet alignment for radiogalaxies formed during mergers (e.g. Croton et al. 2006; Chiaberge et al. 2015). The observational evidence for both remains a subjectof discussion in the literature.Taylor & Jagannathan (2016) found a local alignment of radiogalaxies in the ELAIS N1 ﬁeld on scales < ◦ using observationsfrom the Giant Metrewave Radio Telescope (GMRT) at 610 MHz.Local alignments were also reported by Contigiani et al. (2017) whoreported evidence ( > 𝜎 ) of local alignment on scales of ∼ . ◦ among radio sources from the FIRST survey using a much largersample of radio galaxies, catalogued by the radio galaxy zoo project.A similar local alignment was also reported by Panwar et al. (2020)using data from the FIRST survey. Using a sample of 7555 double-lobed radio galaxies from the LOFAR Sky Survey (LoTSS; Shimwellet al. 2019) at 150 MHz, Osinga et al. (2020) concluded that a statis-tical deviation from purely random distributions of orientation as afunction of projected distance was caused by systematics introducedby the brightest objects and did not persist when redshift informationwas taken into account. However, the study also suggested that largersamples of radio galaxies should be used to conﬁrm this result.Whilst these results may suggest tentative evidence for spatial cor-relations of radio galaxy orientations in local large-scale structure,they do not provide any information on whether these orientationsdiﬀer between classes of radio galaxy, i.e. the equivalence classesconsidered here. Moreover, the large spatial distribution and com-paratively small number of galaxies that form the training set used inthis work mean that even spatial correlation eﬀects would be unlikelyto be signiﬁcant for the data set used here. However, the results ofTaylor & Jagannathan (2016); Contigiani et al. (2017); Panwar et al.(2020) suggest that care should be taken in this assumption if datasets are compiled from only small spatial regions.In Section 5.1, we found that the largest improvement in perfor-mance was seen when using dihedral, 𝐷 𝑁 , models. We suggest thatthis improvement over cyclic, 𝐶 𝑁 , models is due to image reﬂectionsaccounting for chirality, in addition to orientations on the celestialsphere which are represented by the cyclic group. Galactic chiralityhas previously been considered for populations of star-forming, ornormal, galaxies (see e.g. Slosar et al. 2009; Shamir 2020), as thespiral structure of star-forming galaxies means that such objects canbe considered to be enantiomers, i.e. their mirror images are not su-perimposable (Capozziello & Lattanzi 2005). It has been suggestedthat a small asymmetry exists in the number of clockwise versus MNRAS , 1–11 (2015) (2)-equivariant radio galaxy classiﬁcation anti-clockwise star-forming galaxy spins (Shamir 2020). As far asthe authors are aware there have been no similar studies consideringthe chirality of radio galaxies. However, a simple example of suchchirality for radio galaxies might include the case where relativisticboosting causes one jet of a radio galaxy to appear brighter than theother due to an inclination relative to the line of sight. Since the dom-inance of a particular orientation relative to the line of sight should beunbiased then this would imply a global equivariance to reﬂection.Since the dihedral ( 𝐷 𝑁 ) models used in this work are insensitiveto chirality, the results in Section 5.1 suggest that the radio galaxiesin the training sample used here do not have a signiﬁcant degree ofpreferred chirality. Whilst this does not itself validate the assumptionof global reﬂection invariance, in the absence of evidence to the con-trary from the literature we suggest that it is unlikely to be signiﬁcantfor the data sample used in this work.From the perspective of classiﬁcation, equivariance to reﬂectionsimplies that inference should be independent of reﬂections of theinput. For FR I and FR II radio galaxy classiﬁcation, incorporatingsuch information into a classiﬁcation scheme may be important moregenerally: the uniﬁed picture of radio galaxies holds that both FR I andFR II, as well as many other classiﬁcations of active galactic nuclei(AGN) such as quasars, QSOs (quasi-stellar objects), blazars, BLLac objects, Seyfert galaxies etc., are in fact deﬁned by orientation-dependent observational diﬀerences, rather than intrinsic physicaldistinctions (Urry 2004).Consequently, under the assumptions of global rotational and re-ﬂection invariance, the possibility of a classiﬁcation model providingdiﬀerent output classiﬁcations for the same test sample at diﬀerentorientations is problematic. Furthermore, the degree of model conﬁ-dence in a classiﬁcation should also not vary signiﬁcantly as a func-tion of sample orientation, i.e. if a galaxy is conﬁdently classiﬁedat one particular orientation then it should be approximately equallyconﬁdently classiﬁed at all other orientations. If this is not the case,as shown for the standard CNN in Figure 5 (left), then it indicates apreferred orientation in the model weights for a given outcome, in-consistent with the expected statistics of the true source population.Such inconsistencies might be expected to result in biased samplesbeing extracted from survey data.In this context it is then not only the average degree of modelconﬁdence that is important as a function of sample rotation, asquantiﬁed by the value of (cid:104) 𝜂 (cid:105) in Section 5.3, but also the stabilityof the 𝜂 index as a function of rotation, i.e. a particular test sampleshould be classiﬁed at a consistent degree of conﬁdence as a functionof orientation, whether that conﬁdence is low or high. To evaluatethe stability of the predictive conﬁdence as a function of orientation,we examine the variance of the 𝜂 index as a function of rotation. Forthe MiraBest ∗ reserved test set we ﬁnd that approximately 30% ofthe test samples show a reduction of more than 0.01 in the standarddeviation of their overlap index as a function of rotation, with 17%showing a reduction of more than 0.05. Conversely approximately8% of test samples show an increase of > .

01 and 4% samples showan increase of > .

05. In a similar manner to the results for averagemodel conﬁdence given in Section 5.3, those objects that show areduction in their variance, i.e. an improvement in the consistency ofprediction as a function of rotation, are evenly balanced between thetwo classes; however, those objects showing a strong improvementof > .

05 are preferentially FRI type objects.

The use of capsule networks (Sabour et al. 2017) for radio galaxyclassiﬁcation was investigated by Lukic et al. (2018). Capsule net- works aim to separate the orientation (typically referred to as theviewpoint or pose in the context of capsule networks) of an objectfrom its nature, i.e. class, by encoding the output of their layers as tu-ples incorporating both a pose vector and an activation. The purposeof this approach is to focus on the linear hierarchical relationships inthe data and remove sensitivity to orientation; however, as describedby Lenssen et al. (2018), general capsule networks do not guaran-tee particular group equivariances and therefore cannot completelydisentangle orientation from feature data. It is perhaps partly for thisreason that Lukic et al. (2018) found that capsule networks oﬀeredno signiﬁcant advantage over standard CNNs for the radio galaxyclassiﬁcation problem addressed in that work.In Section 5, we found that not only is the test performance im-proved by the use of equivariant CNNs, but that equivariant networksalso converge more rapidly. For image data, a standard CNN enablesgeneralization over classes of translated images, which provides anadvantage over the use of an MLP, where every image must be consid-ered individually. 𝐺 -steerable CNNs extend this behaviour to includeadditional equivalences, further improving generalization. This ad-ditional equivariance enhances the data eﬃciency of the learningalgorithm because it means that every image is no longer an indi-vidual data point but instead a representative of its wider equiva-lence group. Consequently, unlike capsule networks, the equivalencegroups being classiﬁed by a 𝐺 -steerable CNN are speciﬁed a priori,rather than the orientations of individual samples being learned dur-ing training. Whilst this creates additional capacity in the networkfor learning intra-class diﬀerences that are insensitive to the speciﬁedequivalences, it does not provide the information on orientation ofindividual samples that is provided as an output by capsule networks.Lenssen et al. (2018) combined group-equivariant convolutionswith capsule networks in order to output information on both classiﬁ-cation and pose, although they note that a limitation of this combinedapproach is that arbitrary pose information is no longer available, butis instead limited to the elements of the equivariant group. For radioastronomy, where radio galaxy orientations are expected to be ex-tracted from images at a precision that is limited by the observationalconstraints of the data, it is unlikely that pose information limitedto the elements of a low-order ﬁnite group, 𝐺 < 𝐸 ( ) , is suﬃcientfor further analysis. However, given particular sets of observationaland physical constraints or speciﬁcations it is possible that such anapproach may become useful at some limiting order. Alternatively,pose information might be used to specify a prior for a secondaryprocessing step that reﬁnes a measurement of orientation. By design, the ﬁnal features used for classiﬁcation in equivariantCNNs do not include any information about the global orientationor chirality of an input image; however, this can also mean that theyare insensitive to local equivariances in the image, when these mightin fact be useful for classiﬁcation. The hierarchical nature of con-volutional networks can be used to mitigate against this, as kernelscorresponding to earlier layers in a network will have a smaller, morelocal, footprint on the input image and therefore be sensitive to adiﬀerent scale of feature than those from deeper layers which en-compass larger-scale information. Therefore, by changing the degreeof equivariance as a function of layer depth one can control the de-gree to which local equivariance is enforced. Weiler & Cesa (2019)refer to this practice as group restriction and ﬁnd that it is beneﬁcialwhen classifying data sets that possess symmetries on a local scalebut not on a global scale, such as the CIFAR and unrotated MNISTdatasets. Conversely, the opposite situation may also be true, where

MNRAS , 1–11 (2015) A. M. M. Scaife & F. Porter

Figure 6.

Validation losses during the training of the standard CNN, denoted { 𝑒 } (blue), the 𝐷 CNN (orange), and the restricted 𝐷 𝑁 | { 𝑒 } CNN (green;dashed) for the

MiraBest ∗ data set. Plots show mean and standard deviationover ﬁve training repeats. no symmetry is present on a local scale, but the data are statisticallyinvariant on a global scale. In this case the reverse may be done and,rather than restricting the representation of the feature data to re-duce the degree of equivariance, one might expand the domain of therepresentation at a particular layer depth in order to reﬂect a globalequivariance.We investigate the eﬀect of group restriction by using a 𝐷 𝑁 | { 𝑒 } restricted version of the LeNet architecture, i.e. the ﬁrst layer is 𝐷 𝑁 equivariant and the second convolutional layer is a standard convo-lution. Using 𝑁 =

16, the loss curve for this restricted architecturerelative to the unrestricted 𝐷 equivariant CNN is shown in Fig-ure 6. From the ﬁgure it can be seen that while exploiting localsymmetries gives an improved performance over the standard CNN,the performance of the group restricted model is signiﬁcantly poorerthan that of the full 𝐷 CNN. This result suggests that althoughlocal symmetries are present in the data, it is the global symmetriesof the population that result in the larger performance gain for theradio galaxy data set.

In Section 5 we found that the 𝑁 =

16 cyclic and dihedral modelswere preferred over the higher order 𝑁 =

20 models. This mayseem counter-intuitive as one might assume that for truly rotationallyinvariant data sets the performance would converge to a limitingvalue as the order increased, rather than ﬁnding a minimum at somediscrete point. Consequently, we note that the observed minimum at 𝑁 =

16 might not represent a true property of the data set but insteadrepresent a limitation caused by discretisation artifacts from rotationof convolution kernels with small support, in this case 𝑘 =

5, seeTable 4 (Weiler & Cesa 2019). These same discretisation errors mayalso account in part for the small oscillation in validation error asa function of group order seen in Figure 3. Consequently, while noadditional hyper-parameter tuning has been performed for any of thenetworks used in this work, we note that kernel size is potentiallyone hyper-parameter that could be tuned as a function of grouporder, 𝑁 , and that such tuning might lead to further improvements inperformance for higher orders. In this work, we have demonstrated that the use of even low-ordergroup-equivariant convolutions results in a performance improve-ment over standard convolutions for the radio galaxy classiﬁcationproblem considered here, without additional hyper-parameter tun-ing. We have shown that both cyclic and dihedral equivariant modelsconverge to lower validation loss values during training and provideimproved validation errors. We attribute this improvement to the in-creased capacity of the equivariant networks for learning hierarchicalfeatures speciﬁc to classiﬁcation, when additional capacity for en-coding redundant feature information at multiple orientations is nolonger required, hence reducing intra-class variability.We have shown that for the simple network architecture and train-ing set considered here, a 𝐷 equivariant model results in the besttest performance using a reserved test set. We suggest that the im-provement of the dihedral over the cyclic models is due to an in-sensitivity to - and therefore lack of preferred - chirality in the data,and that further improvements in performance might be gained fromtuning the size of the kernels in the convolutional layers according tothe order of the equivalence group. We ﬁnd that cyclic models pre-dominantly reduce the misclassiﬁcation of FRI type radio galaxies,whereas dihedral models reduce misclassiﬁcations for both FRI andFRII type galaxies.By using the MC Dropout Bayesian approximation method, wehave shown that the improved performance observed for the 𝐷 model compared to the standard CNN is reﬂected in the model con-ﬁdence as a function of rotation. Using the reserved test set, we havequantiﬁed this diﬀerence in conﬁdence using the overlap betweenpredictive probability distributions of diﬀerent target classes, as en-capsulated in the distribution free overlap index parameter, 𝜂 . Weﬁnd that not only is average model conﬁdence improved when usingthe equivariant model, but also that the consistency of model conﬁ-dence as a function of image orientation is improved. We emphasisethe importance of such consistency for applications of CNN-basedclassiﬁcation in order to avoid biases in samples being extracted fromfuture survey data.Whilst the results presented here are encouraging, we note that thiswork addresses a speciﬁc classiﬁcation problem in radio astronomyand the method used here may not result in equivalent improvementswhen applied to other areas of astronomical image classiﬁcation us-ing diﬀerent data sets or network architectures. In particular, the as-sumptions of global rotational and reﬂectional invariance are strongassumptions, which may not apply to all data sets. As described inSection 6.1, data sets extracted from localised regions of the skymay be particularly vulnerable to biases when using this method andthe properties of the MiraBest ∗ data set used in this work may notgeneralise to all other data sets or classiﬁcation problems. We notethat this is true for all CNNs benchmarked against ﬁnite data sets andusers should be aware that additional validation should be performedbefore models are deployed on new test data, as biases arising fromdata selection may be reﬂected in biases in classiﬁer performance(see e.g. Wu et al. 2018; Tang 2019; Walmsley et al. 2020). However,in conclusion, we echo the expectation of Weiler & Cesa (2019),that equivariant CNNs may soon become a common choice for mor-phological classiﬁcation in ﬁelds like astronomy, where symmetriesmay be present in the data, and note that the overhead in constructingsuch networks is now minimal due to the emergence of standardisedlibraries such as e2cnn . Future work will need to address the op-timal architectures and hyper-parameter choices for such models asspeciﬁc applications evolve. MNRAS , 1–11 (2015) (2)-equivariant radio galaxy classiﬁcation ACKNOWLEDGEMENTS

This work makes extensive use of the e2cnn library: github.com/QUVA-Lab/e2cnn . AMS gratefully acknowledges support from anAlan Turing Institute AI Fellowship EP/V030302/1. FP gratefullyacknowledges support from STFC and IBM through the iCASE stu-dentship ST/P006795/1.

DATA AVAILABILITY

Code for this work is publicly available on Github at the followingaddress: github.com/as595/E2CNNRadGal . The MiraBest data setis available on Zenodo: 10.5281/zenodo.4288837.

REFERENCES

Abazajian K. N., et al., 2009, ApJS, 182, 543Alger M. J., et al., 2018, Monthly Notices of the Royal Astronomical Society,478, 5547Aniyan A. K., Thorat K., 2017, The Astrophysical Journal Supplement SeriesBanﬁeld J. K., et al., 2015, MNRAS, 453, 2326Beardsley A. P., et al., 2019, Publications of the Astronomical Society ofAustraliaBecker R. H., White R. L., Helfand D. J., 1995, ApJ, 450, 559Best P. N., Heckman T. M., 2012, MNRAS, 421, 1569Bowles M., Scaife A. M. M., Porter F., Tang H., Bastien D. J., 2020, arXive-prints, p. arXiv:2012.01248Capozziello S., Lattanzi A., 2005, arXiv e-prints, p. physics/0509144Chiaberge M., Gilli R., Lotz J. M., Norman C., 2015, ApJ, 806, 147Codis S., Jindal A., Chisari N. E., Vibert D., Dubois Y., Pichon C., DevriendtJ., 2018, MNRAS, 481, 4753Cohen T. S., Welling M., 2016, arXiv e-prints, p. arXiv:1602.07576Condon J. J., Cotton W. D., Greisen E. W., Yin Q. F., Perley R. A., TaylorG. B., Broderick J. J., 1998, AJ, 115, 1693Contigiani O., et al., 2017, MNRAS, 472, 636Croton D. J., et al., 2006, MNRAS, 365, 11Dieleman S., Willett K. W., Dambre J., 2015, Monthly Notices of the RoyalAstronomical Society, 450, 1441Dieleman S., De Fauw J., Kavukcuoglu K., 2016, arXiv e-prints, p.arXiv:1602.02660Fanaroﬀ B. L., Riley J. M., 1974, MNRAS, 167, 31PGal Y., Ghahramani Z., 2015, arXiv e-prints, p. arXiv:1506.02142Gens R., Domingos P. M., 2014, in Advances in Neural Information Process-ing Systems. Curran Associates, Inc., pp 2537–2545Gheller C., Vazza F., Bonafede A., 2018, MNRAS, 480, 3749Ginsburg A., et al., 2019, AJ, 157, 98Jarvis M. J., et al., 2016, in Proceedings of Science. , doi:10.22323/1.277.0006Johnston S., et al., 2008, Experimental AstronomyKartaltepe J. S., Ebeling H., Ma C. J., Donovan D., 2008, Monthly Noticesof the Royal Astronomical Society, 389, 1240Kingma D. P., Ba J., 2014, arXiv e-prints, p. arXiv:1412.6980Kraljic K., Davé R., Pichon C., 2020, MNRAS, 493, 362Krizhevsky A., Sutskever I., Hinton G. E., 2012, in Pereira F., Burges C.J. C., Bottou L., Weinberger K. Q., eds, , Advances in Neural InformationProcessing Systems 25. Curran Associates, Inc., pp 1097–1105LeCun Y., Bottou L., Bengio Y., Haﬀner P., 1998, Proceedings of the IEEE,86, 2278LeCun Y., Bottou L., Orr G., Müller K., 2012, in Neural Networks: Tricks ofthe Trade. Lecture Notes in Computer Science. , doi:10.1007/978-3-642-35289-8_3Lenc K., Vedaldi A., 2014, arXiv e-prints, p. arXiv:1411.5908Lenssen J. E., Fey M., Libuschewski P., 2018, arXiv e-prints, p.arXiv:1806.05086Lukic V., Brüggen M., Banﬁeld J. K., Wong O. I., Rudnick L., Norris R. P.,Simmons B., 2018, MNRAS, 476, 246 McGlynn T., Scollick K., White N., 1998, in McLean B. J., Golombek D. A.,Hayes J. J. E., Payne H. E., eds, New Horizons from Multi-WavelengthSky Surveys.Miraghaei H., Best P. N., 2017, Monthly Notices of the Royal AstronomicalSociety, 466, 4346Osinga E., et al., 2020, A&A, 642, A70Panwar M., Prabhakar Sandhu P. K., Wadadekar Y., Jain P., 2020, MNRAS,499, 1226Pastore M., Calcagnì A., 2019, Frontiers in Psychology, 10, 1089Ren S., He K., Girshick R., Sun J., 2015, in Advances in neural informationprocessing systems. pp 91–99Sabour S., Frosst N., E Hinton G., 2017, arXiv e-prints, p. arXiv:1710.09829Shamir L., 2020, Publ. Astron. Soc. Australia, 37, e053Shimwell T. W., et al., 2019, A&A, 622, A1Slosar A., et al., 2009, MNRAS, 392, 1225Tang H., 2019, FR-DEEP, https://github.com/HongmingTang060313/FR-DEEP

Tang H., Scaife A. M. M., Leahy J. P., 2019, MNRAS, 488, 3358Taylor A. R., Jagannathan P., 2016, MNRAS, 459, L36Urry C., 2004, in Richards G. T., Hall P. B., eds, Astronomical Society of thePaciﬁc Conference Series Vol. 311, AGN Physics with the Sloan DigitalSky Survey. p. 49 ( arXiv:astro-ph/0312545 )Van Haarlem M. P., et al., 2013, LOFAR: The low-frequency array,doi:10.1051/0004-6361/201220873Walmsley M., et al., 2020, MNRAS, 491, 1554Weiler M., Cesa G., 2019, arXiv e-prints, p. arXiv:1911.08251Weiler M., Hamprecht F. A., Storath M., 2017, arXiv e-prints, p.arXiv:1711.07289Weiler M., Geiger M., Welling M., Boomsma W., Cohen T., 2018, arXive-prints, p. arXiv:1807.02547White S. D. M., 1984, ApJ, 286, 38Wu C., et al., 2018, Monthly Notices of the Royal Astronomical Society, 482,1211York D. G., et al., 2000, AJ, 120, 1579This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000