[PDF] Structured Variational Inference for Simulating Populations of Radio Galaxies

Abstract

We present a model for generating postage stamp images of synthetic Fanaroff-Riley Class I and Class II radio galaxies suitable for use in simulations of future radio surveys such as those being developed for the Square Kilometre Array. This model uses a fully-connected neural network to implement structured variational inference through a variational auto-encoder and decoder architecture. In order to optimise the dimensionality of the latent space for the auto-encoder we introduce the radio morphology inception score (RAMIS), a quantitative method for assessing the quality of generated images, and discuss in detail how data pre-processing choices can affect the value of this measure. We examine the 2-dimensional latent space of the VAEs and discuss how this can be used to control the generation of synthetic populations, whilst also cautioning how it may lead to biases when used for data augmentation.

Full PDF

MMNRAS , 1–20 (2021) Preprint 26 February 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

Structured Variational Inference for Simulating Populations of RadioGalaxies

David J. Bastien, , ★ Anna M. M. Scaife, , Hongming Tang, Micah Bowles, and Fiona Porter Jodrell Bank Centre for Astrophysics, Department of Physics & Astronomy, University of Manchester, Oxford Road, Manchester M13 9PL, UK Square Kilometre Array Organisation, Jodrell Bank Observatory, Macclesﬁeld, SK11 9FT, UK The Alan Turing Institute, Euston Road, London NW1 2DB, UK

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

We present a model for generating postage stamp images of synthetic Fanaroﬀ-Riley Class I and Class II radio galaxies suitablefor use in simulations of future radio surveys such as those being developed for the Square Kilometre Array. This model usesa fully-connected neural network to implement structured variational inference through a variational auto-encoder and decoderarchitecture. In order to optimise the dimensionality of the latent space for the auto-encoder we introduce the radio morphologyinception score (RAMIS), a quantitative method for assessing the quality of generated images, and discuss in detail how datapre-processing choices can aﬀect the value of this measure. We examine the 2-dimensional latent space of the VAEs and discusshow this can be used to control the generation of synthetic populations, whilst also cautioning how it may lead to biases whenused for data augmentation.

Key words: radio continuum: galaxies – methods: statistical – surveys

Since Fanaroﬀ & Riley (1974) ﬁrst introduced their classiﬁcationscheme for radio galaxies, various works have focused on ﬁnding anexplanation for such distinct morphological characteristics. Ledlow& Owen (1996) ﬁrst showed that the FRI/II transition is both afunction of the radio and optical luminosity. Other works focused onan explanation of the two groups through the underlying mechanismof the AGN and/or its environment. One of the most active debatesconsiders the relationship between the accretion mechanism and thesource morphology (Gendre et al. 2013). Even if accretion correlationhas not yet been established, Croston et al. (2018) make a clearanalysis of the two classes to conclude that the diﬀerence in externalpressures of the lobes and jets, which is a distinctive intrinsic propertyto each classes, could be explained by the diﬀerence in the protoncontent causing a diﬀerence in energies in both classes. Kaiser &Best (2007) worked on an evolutionary link between FRI and FRII,where they postulated that all radio galaxies must have at some pointnear the start of their life shown some FRII morphology. The lobes ofsources with weak jets came into equilibrium with their surroundingscausing the lobes to start close to the core. The jets are then disruptedby the ambient medium and the lobes eventually developed into FRIs.However, the dividing line between FRI and FRII is blurry forlow ﬂux density samples. For diﬀerent redshifts and environments,the morphology is sometimes uncertain and often leads to hybridmorphologies. Best (2009) identiﬁed sources with one lobe showingFRI-like morphology and the other showing FRII-like morphology.Mingo et al. (2019b) make use of a two-sided source classiﬁcation ★ E-mail: [email protected] (DJB) method and found that 20% of their sources were of hybrid morphol-ogy.It is evident that with new radio surveys this blurry line betweenFRI and FRII radio galaxies will be made clearer by (i) providinglarger samples of sources at diﬀerent redshifts and environment, and(ii) by providing larger samples of sources at low ﬂux levels. Variousresearch groups are thus working on surveys such as the EvolutionaryMap of the Universe survey (EMU; Norris et al. 2011) made usingthe Australian Square Kilometre Array Pathﬁnder (ASKAP), theMeerKAT International GHz Tiered Extra-galactic Exploration sur-vey (MIGHTEE; Jarvis et al. 2017) done using MeerKAT, the GaLac-tic and Extragalactic All-Sky MWA survey (GLEAM; Hurley-Walkeret al. 2016) done using the Murchison Wideﬁeld Array (MWA), theVery Large Array Sky Sky Survey (VLASS; Villarreal Hernández &Andernach 2018) done using Very Large Array (VLA) and the latestLoTSS survey (Shimwell et al. 2019) done using LOFAR.With the advent of such huge surveys, new automated classiﬁca-tion algorithms have been developed to replace the “by eye” classi-ﬁcation methods used in early classiﬁcation. These new algorithmsare trained using existing databases of pre-classiﬁed FRI and FRIIsources. Aniyan & Thorat (2017) made use of a deep convolutionalneural network for the classiﬁcation of FRI and FRII galaxies, whileTang et al. (2019) made use of transfer learning to classify sourcesacross diﬀerent surveys introducing cross-survey identiﬁcation abil-ities to existing neural networks. Lukic et al. (2019) made use ofcapsule networks instead of CNNs for classifying images from theLOFAR Two-metre Sky Survey(LoTSS) survey. Ma et al. (2019a)trained an encoder-decoder architecture and used the encoder as aclassiﬁer, and ﬁnally Wu et al. (2018) engineered the CLARAN net-work, which was able to locate and classify radio sources trainedusing the Radio Galaxy Zoo catalogue (Banﬁeld et al. 2015). © a r X i v : . [ a s t r o - ph . I M ] F e b D. J. Bastien et al.

In radio astronomy, deep learning has found its way through usein radio source classiﬁcation in a similar way to classical ML. Theground work was done by Aniyan & Thorat (2017) who made use ofCNNs for the classiﬁcation of radio galaxies. This was followed byother works involving the use of deep learning in source classiﬁca-tion, Lukic et al. (2018) made use of CNNs for the classiﬁcation ofcompact and extended radio sources from the Radio Galaxy Zoo cat-alogue. CLARAN (Classifying Radio Sources Automatically withNeural Network) Wu et al. (2018) made use of the Faster R-CNN(Ren et al. 2015) to identify and classify radio sources. Alger et al.(2018) made use of an ensemble of classiﬁers including CNNs toperform host galaxy cross-identiﬁcation. Tang et al. (2019) madeuse of transfer learning with CNNs to do cross-survey classiﬁcation.Gheller et al. (2018) made use of deep learning for the detection ofcosmological diﬀuse radio sources. Lukic et al. (2019) performedmorphological classiﬁcation using a novel technique known as cap-sule networks. However, while most of these previous works focusedon the use of classiﬁers, in this work we focus on the use of neuralnetworks to generate realistic radio galaxies.Source simulation plays an important role in the development ofmodern day telescopes. Simulated sources should reﬂect two veryimportant characteristics: (i) They reﬂect the extrinsic properties ofthe astronomical entities being simulated. For example, if we aredealing with radio galaxies, the simulated sources should show thejets, lobes and core structure. (ii) They should simulate the propertiesof the telescopes, and as such simulate the bean width or artefactsthat might result from any receiver characteristics. In optical as-tronomy, both aspects have been catered for by various authors andthe importance of such simulations are well established. For exam-ple, Peterson et al. (2015) simulated astronomical objects for opticalsurveys and states the importance of such tools for the planning offuture observations and for the development of sophisticated analysissoftware.In radio astronomy, equivalent simulators will be useful for thedevelopment of the SKA. Indeed, to test existing processing tools,in 2018 the SKA released the ﬁrst SKA data challenge (Bonaldi &Braun 2018), which consisted of a series of generated images withspeciﬁcations similar to those expected from the SKA. The commu-nity were invited to (i) undertake source ﬁnding on the generated im-ages, (ii) perform source property characterisation, and (iii) identifythe diﬀerent source populations. The ﬁnal results from the challengewere bench-marked against the input catalogue used to generate theimages.In addition to being used for bench-marking existing analysis tools,source simulators are also useful for improving existing classiﬁers.Most radio source classiﬁers make use of data augmentation to traintheir models, due to a lack of labelled training data (Tang et al.(2019),Wu et al. (2018),Aniyan & Thorat (2017)). This augmentationinvolves rotating the images, ﬂipping and inverting each image in thetraining set. However Mikołajczyk & Grochowski (2018) concludedthat even if they are fast and easy to implement, the data augmentationtechnique does not bring any new visual features to the data. Incomparison, using simulated sources as training data might introducenew features and could improve the ability of a classiﬁer.There are diﬀerent methods that can be used to simulate radiosources. Radio Galaxies can be simulated through astrophysical ﬂuiddynamics simulations. For example, Rossi et al. (2017) made use ofmagneto-hydrodynamics (MHD) to simulate X-sources (Ekers et al.1978), while Tchekhovskoy & Bromberg (2016) used MHD to simu-late FRI and FRII jets. Although such simulations are highly realistic,they exhibit various practical issues when it comes to general use.Firstly, they do not cater for the telescope optics and are indeed designed to simulate perfect sources without those constraints. Sec-ondly, when large populations of sources are needed this methodturns out to be computationally ineﬃcient and more complex modelsmust be employed to cater for diﬀerent environmental eﬀects.Bonaldi & Braun (2018) made use of simple Gaussians to mimicradio sources in the SKA Challenge and Makhathini et al. (2015)modeled radio galaxies by using triple Gaussians. However evenif the generated images are not as realistic as real sources, theytake into consideration the telescope characteristic. Bonaldi & Braun(2018) simulated images corresponding to the 3 frequencies of theSKA, namely 560 MHz, 1.4 GHz and 9.2 GHz, also adding in noiseand convolving the sources with the point spread function (PSF) ofthe telescope. However, even if such simulations respect telescopespeciﬁcations, ﬁne structures that are typically found in jets and lobesare absent. Hence an ideal simulator would be one which is able togenerate images with the telescope properties and generate structuressimilar to those made by MHD modelling.In this paper we use generative machine learning to simulate radiosources. Generative algorithms were ﬁrst introduced in 2014 with twomain methods: the Generative Adversarial Network (GAN; Goodfel-low et al. 2014) and the Variational Autoencoder (VAE; Kingma et al.2014). Generative algorithms are trained using real data to generatefake (or synthetic) data that look similar, without memorizing thetraining set. They have been employed across multiple diﬀerent do-mains, including the generation of human faces (Karras et al. 2019),the generation of artworks (Elgammal et al. 2017), and to generateradio sources (Ma et al. 2018).The idea behind the use of neural networks for generating data isbased on the ability of neural networks to reduce high dimensionaldata to lower dimensions or vice versa. These dimensional trans-forms can be performed using encoders and decoders . These twotypes of network are used in two basic applications of generativealgorithms: generative adversarial networks (GANs) and variationalauto-encoders (VAEs). VAEs makes use of both encoders and de-coders. While encoders are not used in GANs, encoders do constitutean important building block for VAEs. Encoders work by reducinghigh dimensional data to lower dimensions. This is done by eitherusing fully connected layers or CNNs. For VAEs, the encoder canbe thought of as an embedding method similar to principal compo-nent analysis (PCA; e.g. Rolinek et al. 2019). The decoder is thegenerative network common to all generative methods. Decoders uselower dimensional data to produce higher dimensional outputs. Sim-ilar to the encoder, this can done using a fully connected network ora de-convolution network.These methods have been used used in both optical and radioregimes as a generative method, an unsupervised clustering tool andinference method. In the radio regime, Ralph et al. (2019) made useof auto-encoders combined with self-organizing maps to separateoutliers by making use of k-means clustering methods. Regier et al.(2015) applied the VAE to astronomical optical images as an infer-ence tool, while Spindler et al. (2020) designed a VAE that can beused for unsupervised clustering and generation of synthetic opticalimages from the Sloan Digital Sky Survey (SDSS). Ravanbakhshet al. (2016) make use of a conditional VAE for the generation ofrealistic galaxy images to be used as calibration data.In this work, we use a variational auto-encoder (VAE) to implementa structured variational inference approach to generating simulatedpostage stamp images of radio galaxies in diﬀerent target classes.In Section 2 we introduce variational auto-encoders and outline thestatistical basis of their operation; in Section 3 we introduce the datasets used for training the VAEs used in this work; in Section 4 weintroduce the concept of the radio astronomy morphology inception

MNRAS , 1–20 (2021) alaxy populations from structured VAEs score and deﬁne how it is calculated; in Section 5 we describe theneural network implemented in this work and how the dimensionalityof the latent space is optimised; in Section 6 we give an analysis ofthe generated images from this network; in Section 7 we discussthe implications of these results and how they diﬀer from previousgenerative applications in the ﬁeld; and in Section 8 we draw ourconclusions. The ﬁrst proposed form for an auto-encoder was a neural networkcomprised of two fully-connected layers that transformed an input toan output without causing any distortion in the process (Rumelhartet al. 1985). It did so by learning to generate a compact representationof the data, in a process similar to principal component analysis(PCA). This compact representation could then be used to re-generatethe original input through the use of a decoder. Once trained, thedecoder could be disconnected from the encoder to generate new data.In practice however, the original form of the encoder-decoder networkwas not able to create truly new data samples as the reduced spacewas not continuous. In order to address this problem of continuity,variational auto-encoders (VAEs) were proposed (Kingma & Welling2013). The reduced space of the VAE is continuous by design, whichimplies that any point sampled from the reduced or latent space cangenerate an output that will appear real.Since their ﬁrst implementation by Kingma & Welling (2013),VAEs have been used in many diﬀerent ﬁelds, including video gen-eration (Denton & Fergus 2018), text generation (Semeniuta et al.2017), molecular design (Blaschke et al. 2018) as well as in cos-mology (Yi et al. 2020). When it comes to improvements, variousworks have focused on the improvement of the original VAE. Theseinclude both domain speciﬁc VAEs and those that implement gen-eral improvements to the original architecture of Kingma & Welling(2013). Among these are the infoVAE (Zhao et al. 2019), the VAE-DGP (Dai et al. 2015), the Hypersherical VAE (Davidson et al. 2018),the WAE (Tolstikhin et al. 2017), the control VAE (Shao et al. 2020),the semi amortized VAE (Kim et al. 2018) and the 𝛿 -VAE (Razaviet al. 2019), which works to prevent posterior collapse in VAEs.Arguably, the most major advances in the ﬁeld of VAEs were madeby Kingma et al. (2014) and Sohn et al. (2015). Kingma et al. (2014)improved their original VAE by engineering a semi-supervised VAE(sVAE) which incorporates a classiﬁer and a VAE with conditioning.As such the networks could be trained using both unlabelled andlabelled images. Sohn et al. (2015) further modiﬁed the sVAE byremoving the classifer and focused on the conditional properties ofthe training. The so-called Conditional VAE (CVAE) can be used totrain a VAE to generate a speciﬁc target class, an approach that wefollow in this work to generate FRIs and FRIIs, see Section 5.1. TheCVAE has been used in various domains, such as dialog generation(Shen et al. 2017), basketball player modelling (Acuna 2017), drugdiscovery Polykovskiy et al. (2018), sustainable building generation(Ge et al. 2019) and machine translation (Pagnoni et al. 2018). Im-provements were also made to the CVAE, such as the CDVAE (Luet al. 2016), the CF-VAE (Bhattacharyya et al. 2019) and the W-VAE(Ryu et al. 2019).Further improvements in VAEs were made by combining the VAEswith the discriminator of a Generative Adversarial Network (GAN).These VAEs are trained with a binary discriminator that classiﬁesthe generated images as fake or real. The VAEs are trained so asto generate data that are classiﬁed as real by the discriminator. TheVAE-GAN (Boesen Lindbo Larsen et al. 2015) is a normal VAE ﬁtted with a GAN discriminator. Similarly, the CVAE-GAN (Baoet al. 2017) and sVAE-GAN are CVAEs and sVAEs respectively,which are both ﬁtted with a GAN discriminator.To achieve a continuous latent space Kingma et al. (2014) as-sumed that the latent space is represented by a continuous randomvariable with a parametric prior distribution. For neural networks thatintroduce non-linearity between their layers, calculation of the trueposterior distribution for such a variable with respect to an input dataset is generally intractable; however, the parameters of a variationalapproximation to the posterior can be learned instead. Consider that we have 𝑁 observable data points that are containedin a dataset X = { 𝑥 ( 𝑖 ) } 𝑁𝑖 = where x ( 𝑖 ) ∈ R 𝐷 . We assume that thesedata points represent examples of a continuous random variable, x ,which can be encoded into a lower-dimensional latent space, z , where z ∈ R 𝑑 with 𝑑 < 𝐷 , and vice-versa that a sample drawn from z canbe decoded to generate a new example, ˆ x ( 𝑖 ) . However, we do not apriori know the distribution of z or the parameters of the functionthat maps one space to the other, 𝜃 .In cases where the encoder and decoder functions are complexand/or non-linear, both the true posterior, 𝑝 𝜃 ( z | x ) , and the marginallikelihood, 𝑝 𝜃 ( x ) , are typically intractable and therefore these re-lationships cannot be evaluated directly. In such cases a variationalapproximation to the true posterior is adopted and denoted 𝑞 𝜙 ( z | x ) ,where 𝜙 are the parameters of (e.g.) a neural network used to ap-proximate the true transforms. Typically a Gaussian approximationto 𝑝 𝜃 ( z | x ) is assumed and the Kullback-Leibler (KL) Divergence(Kullback & Leibler 1951) between 𝑝 𝜃 ( z | x ) and 𝑞 𝜙 ( z | x ) is used asan optimization metric.Following Kingma et al. (2014), in this scenario the marginal loglikelihood of an individual data point is given by,log 𝑝 𝜃 ( x ( 𝑖 ) ) = 𝐷 𝐾 𝐿 ( 𝑞 𝜙 ( z | x ( 𝑖 ) )|| 𝑝 𝜃 ( z | x ( 𝑖 ) )) + L( 𝜃, 𝜙 ; x ( 𝑖 ) ) (1)where the ﬁrst term is deﬁned as, 𝐷 𝐾 𝐿 ( 𝑞 𝜙 ( z | x ( 𝑖 ) )(cid:107) 𝑝 𝜃 ( z | x ( 𝑖 ) )) = − ∑︁ 𝑞 𝜙 ( z | x ( 𝑖 ) ) log 𝑝 𝜃 ( z | x ( 𝑖 ) ) 𝑞 𝜙 ( z | x ( 𝑖 ) ) (2)and the second term is given by, L( 𝜃, 𝜙 ; x ( 𝑖 ) ) = E 𝑞 𝑧 ( z | x ) [− log 𝑞 𝑧 ( z | x ) + log 𝑝 𝜃 ( x , z )] . (3)Since Eq. 2 is always non-negative, this implies thatlog 𝑝 𝜃 ( x ( 𝑖 ) ) (cid:62) L( 𝜃, 𝜙 ; 𝑥 ( 𝑖 ) ) , which is referred to as the variationalor evidence lower bound (ELBO) and is the loss term for training aVAE.In practice, the expression for the lower bound given in Equation 3can be re-written as: L( 𝜃, 𝜙 ; x ( 𝑖 ) ) = − 𝐷 𝐾 𝐿 ( 𝑞 𝜙 ( z | x ( 𝑖 ) )(cid:107) 𝑝 𝜃 ( z ))+ 𝐸 𝑞 𝜙 ( z | x ( 𝑖 ) ) [ log 𝑝 𝜃 ( x ( 𝑖 ) | z )] (4)and these two terms play a key role in the understanding of the generalbehaviour of the variational auto encoder.The ﬁrst term in Equation 4 is the KL term and minimises the diﬀer-ence between the variational approximation and the true posterior. Byrearranging Equation 2 and using the Gaussian re-parameterisationproposed by Kingma et al. (2014), which simpliﬁes sampling a ran-dom deviate, ˜ 𝑧 , from the Gaussian distribution, N ( 𝜇, 𝜎 ) , by usingthe diﬀerentiable relationship˜ 𝑧 = 𝜇 + 𝜎 · 𝜖 where 𝜖 ∼ N ( , ) , (5) MNRAS , 1–20 (2021)

D. J. Bastien et al. and we can therefore rewrite the KL term as: 𝐷 𝐾 𝐿 ( 𝑞 𝜙 ( z | x ( 𝑖 ) )(cid:107) 𝑝 𝜃 ( z )) = 𝑑 ∑︁ 𝑗 = ( + log 𝜎 𝑗 − 𝜇 𝑗 − 𝜎 𝑗 ) , (6)where 𝑝 𝜃 ( z ) = 𝑁 ( , ) and 𝑞 𝜙 ( z | x ( 𝑖 ) ) = 𝑁 ( 𝜇, 𝜎 ) .The second term in Equation 4 can be considered as the recon-struction error and ensures that the network correctly reconstructs theinput image. Since 𝑝 𝜃 ( x | z ) is a function that maps z into ˆ x , where ˆ x represents the reconstructed image, we can assume that the term willtake a form 𝑝 𝜃 ( x | ˆ x ) and in the Gaussian case, 𝐸 𝑞 𝜙 ( z | x ( 𝑖 ) ) [ log 𝑝 𝜃 ( x ( 𝑖 ) | z )] = 𝑁 𝑁 ∑︁ 𝑙 = ( x ( 𝑙 ) − ˆ x ( 𝑙 ) ) . (7)Using Equations 6 & 7, we can therefore express the full loss termas: L( 𝜃, 𝜙 ; x ( 𝑖 ) ) = − 𝑑 ∑︁ 𝑗 = ( + log 𝜎 𝑗 − 𝜇 𝑗 − 𝜎 𝑗 ) + 𝑁 𝑁 ∑︁ 𝑙 = ( x ( 𝑙 ) − ˆ x ( 𝑙 ) ) . (8)Since the marginal log likelihood of the full training data set isgiven by the sum over the marginal log likelihoods of each individualdata point, the total loss for the full data set is therefore given by L( 𝜃, 𝜙 ; x ) = 𝑁 ∑︁ 𝑖 = L( 𝜃, 𝜙 ; x ( 𝑖 ) ) . (9) In the conditional case we again have 𝑁 observable datapoints thatare represented by the dataset ˆ X = { x ( 𝑖 ) } , where x ( 𝑖 ) ∈ 𝑅 𝐷 . Butit is now accompanied by the associated labels or class condition,given as ˆ Y = { y ( 𝑖 ) } . Similarly the data will be generated by the latentspace 𝑧 where z ( 𝑖 ) ∈ 𝑅 𝑑 where 𝑑 < 𝐷 . The encoder will take asinput X and Y to encode 𝑧 and the decoder will take 𝑧 and Y togenerate the data point ˆ x ( 𝑖 ) . We here make use of Bayes Theoremwith conditioning that takes into consideration the class condition, 𝑦 , 𝑝 𝜃 ( z | x , y ) = 𝑝 𝜃 ( x | z , y ) 𝑝 𝜃 ( z | y ) 𝑝 𝜃 ( x | y ) . (10)In this case the marginal likelihood of an individual data samplecan be written as,log 𝑝 𝜃 ( x ( 𝑖 ) , y ) = 𝐷 𝐾 𝐿 ( 𝑞 𝜙 ( z | x ( 𝑖 ) , y )|| 𝑝 𝜃 ( z | x ( 𝑖 ) , y ))+L( 𝜃, 𝜙 ; 𝑥 ( 𝑖 ) , y ) (11)and consequently the ELBO is deﬁned as, L( 𝜃, 𝜙 ; x ( 𝑖 ) , y ) = − 𝐷 𝐾 𝐿 ( 𝑞 𝜙 ( z | x ( 𝑖 ) , y )(cid:107) 𝑝 𝜃 ( z | y ))+ E 𝑞 𝜙 ( z | x ( 𝑖 ) , y ) [ log 𝑝 𝜃 ( x ( 𝑖 ) | z , y )] . (12) In this work, we make use of FRDEEP, ﬁrst introduced in Tang et al.(2019). The FRDEEP data set consists of an imbalanced sample ofFRI and FRII radio galaxy images, extracted from the FIRST radiosurvey (Becker et al. 1994) and subjected to image pre-processingusing the method described in Aniyan & Thorat (2017). It makesuse of two catalogues (i) the combined NVSS and FIRST Galaxies Catalog (CoNFIG; Gendre et al. 2013; Gendre et al. 2010) and (ii)the FRICAT catalogue (Capetti et al. 2017) to label the images. Herewe describe the data set in more detail.

The CoNFIG catalogue of FRI and FRII galaxies is constructed from4 sub-samples referred to as ConFIG-1, 2, 3 and 4. These subsamplescontain sources that were originally selected from the NVSS survey(Condon et al. 1998) with ﬂux density cuts of 𝑆 . (cid:62) . , . , . .

005 Jy. These sources were ﬁrst identiﬁed using the NVSScatalogues using the process brought forward by Gendre & Wall(2008) where they make use of component association to identifythe radio galaxies. If two or more components were found with 𝑆 . > . . (cid:62) 𝑆 . (cid:62) . To balance the ConFIG dataset that consists mostly of FRII sources,we make use of the FRICAT. The catalogue consists of 219 sourcesthat were ﬁrst identiﬁed using the database of Best & Heckman(2012) identiﬁed using the method brought forward by Best (2009).The initial sample consists of 18,286 radio galaxies. These sourceswere cross-matched with the optical spectroscopic data from the datarelease 7 of the Sloan Digital Sky Survey (SDSS). They perform aredshift selection within 𝑧 < .

15 where 3357 sources were selected.The FIRST sources were then visually inspected and the sourceswithin an extent of 30 kpc from the optical host were chosen, thiscorresponds to 11.5 arcsec and is ideal for morphological classiﬁca-tion with a 5 (cid:48)(cid:48) resolution.The sources were then visually classiﬁed, they limited their se-lection to one-side and two-sided jets and focused on those sourceswhose brightness decreased along the jets with no enhanced bright-ness at the jet end. The classiﬁcation was performed independentlyby the three authors and accepted if two authors agreed on a classi-ﬁcation. The ﬁnal catalogue consists of 219 FRI sources.

Out of the combined 659 sources from CoNFIG and FRICAT, 600sources were randomly selected to reduce class imbalance, of which264 sources were FRIs and 336 were FRIIs. The training set consistsof 550 sources and 50 sources were assigned to the test set. Onceaugmented, the training set consists of 198,000 sources, of which

MNRAS000

MNRAS000 , 1–20 (2021) alaxy populations from structured VAEs FRI FRII TotalFRICAT

219 - 219

ConFIG

50 390 440

Total

269 390 659

Total

264 336 600Training 242 308 550Test 22 28 50

Train Augmented

Table 1.

Number of selected FRI and FRII sources selected from the FRICATand ConFIG. 600 sources were selected from both datasets with 550 Trainsources and 50 test sources.

An important procedure in machine learning is data pre-processing,which helps to maintain a homogeneous sample space (Aniyan &Thorat 2017). While human classiﬁers can easily deal with the back-ground noise in images and classify objects oriented diﬀerently,ML classiﬁers perform badly if an attempt is made to classify ra-dio sources without a proper pre-processing procedure. For CNNs,Aniyan & Thorat (2017) and Tang et al. (2019) have shown thatthe noise should ﬁrst be clipped at 3-sigma level, the pixels valuesrescaled between 0 and 1 and ﬁnally the images augmented throughﬂipping and rotation for the classiﬁer to attain a good accuracy. Wuet al. (2018) applied a diﬀerent pre-processing procedure, they per-formed a zero-centering of the pixel values followed by a rescaling ofthe source followed by a horizontal ﬂip to attain good performance.Our image pre-processing makes use of the method used by Aniyan& Thorat (2017) and Tang et al. (2019). This was done in two phases.The ﬁrst phase involves the processes covered by Tang et al. (2019)where the pre-processed data was made available through their gitrepository . This data was reprocessed to adapt to our needs in theVAE. To augment the training dataset the sources were rotated from0 to 360 degrees in intervals of 1 degree. This resulted in a totalof 198,000 sources in the training set with 87,120 FRI sources and110,880 FRIIs. Table 1 summarises this source distribution. One of the main challenges of developing and implementing gener-ative algorithms is that they are diﬃcult to evaluate and compare.Compared to discriminative methods, where we can make use ofmetrics like accuracy, the F1-Score, the precision or the recall, gen-erative algorithms do not have such direct measures for comparingdiﬀerent outcomes. While the loss can be used as an indication of themodel performance, it cannot be used comparatively across diﬀerentarchitectures or algorithms.To deal with this issue, we make use of an adapted form of the inception score (Szegedy et al. 2014), a measure that can be used forcomparison across diﬀerent generative algorithms. This standardized https://github.com/HongmingTang060313/FR-DEEP method makes use of classiﬁers to quantify the generated image qual-ity by evaluating the degree of uncertainty in its classiﬁcation. Theinception score was ﬁrst introduced using the inception v1 model,trained on 1000 target classes of the Imagenet Large Scale VisualRecognition (ILSVRC) 2014 classiﬁcation challenge (Russakovskyet al. 2015). The inception classiﬁer was used to evaluate the uncer-tainty in image classiﬁcation: images that were correctly generatedwere assigned to one of the 1000 classes and assigned a high prob-ability, while those that were incorrectly generated were assigned tomultiple classes with low probability. The inception score was thenused as a measure of this diﬀerence.To calculate the score, the inception model was applied to eachgenerated image and the conditional probability, i.e. the probabilitythat the image belongs to one class, 𝑝 ( 𝑦 | 𝑥 ) , was obtained. Usingthose probabilities, the entropy was calculated in order to evaluatethe inception score, which is given by the KL divergence between 𝑝 ( 𝑦 | 𝑥 ) and 𝑝 ( 𝑦 ) : 𝐼 score = exp [ 𝐷 KL ( 𝑝 ( 𝑦 | 𝑥 )|| 𝑝 ( 𝑦 )] . (13)Both 𝑝 ( 𝑦 | 𝑥 ) and 𝑝 ( 𝑦 ) are evaluated from the generated dataset: 𝑝 ( 𝑦 | 𝑥 ) is the output of the classiﬁer, i.e. the label distribution of animage where 𝑦 is the set of labels and 𝑥 is the image; 𝑝 ( 𝑦 ) is themarginal distribution of the labels, 𝑦 , for the generated dataset.A high inception score implies a well performing generative algo-rithm and the highest inception score to date for the ILSVRC 2014classiﬁcation data set is 9.46, based on the de-noising diﬀusion prob-abilistic mode (Ho et al. 2020).Although we wish to use the concept of the inception score toevaluate radio source generation, the inception model itself was nottrained using radio sources. To address this we make use of theexisting radio source classiﬁer architecture introduced in Tang et al.(2019), which uses an architecture similar to Aniyan & Thorat (2017).This classiﬁer consists of 5 convolutional layers each with batchnormalisation and max pooling. The output of the ﬁnal convolutionlayer is ﬂattened and input into a fully connected network consistingof 3 layers. The output from the fully connected layers is then passedthrough a softmax function to obtain the predictions 𝑝 ( 𝑦 | 𝑥 ) .This CNN was trained using the FRDEEP dataset described inSection 3 using 390 sources, validated using 110 sources, and testedusing 55 sources. The data was augmented and pre-processed usingthe method described in Section 3.4. We make use of the Adagradoptimizer with an initial learning rate of 0.001. The network wastrained for 30 epochs and the training was stopped when the vali-dation and training loss stabilized. Using this network, we are ableto evaluate 𝑝 ( 𝑦 | 𝑥 ) and calculate the RAdio Morphology InceptionScore (RAMIS), our equivalent of the original inception score forradio astronomy. To test the ability of the RAMIS to quantify the quality of generatedimages, we performed a number of tests, transforming the trainingimages from FRDEEP in diﬀerent ways and evaluating the RAMISscore for each transformation. To perform these tests, we make useof the non-augmented training data set from FRDEEP. These imagesonce transformed were passed through the classiﬁer model to obtain 𝑝 ( 𝑦 | 𝑥 ) and evaluate the RAMIS score using Equation 13.The original images with no transformation applied had a RAMISscore of 1.4 when evaluated. We then performed the following trans-formations on those images, see Figure 1 : (i) blurring of the imagesusing kernels of diﬀerent sizes; (ii) addition of noise at varying levels;and (iii) cropping of the images. MNRAS , 1–20 (2021)

D. J. Bastien et al.

Figure 1.

Experiment 1,2 and 3:Experiment 1 involves the image blurring with kernel size (1,1) (No blurring) to (19,19). Experiment 2 involves the croppingof the images while experiment 3 involves the addition of random noise.

In the ﬁrst experiment, we performed a Gaussian blurring and varythe size of the smoothing kernel from ( , ) to ( , ) pixels in stepsof 2. This results in a decrease in the resolution of the images. TheRAMIS score was evaluated for each kernel size.An exponential drop in the RAMIS was observed from 1.4 to 1.05where the non-transformed images with a kernel size of (1,1) resultedin a RAMIS of 1.4. The classiﬁer confusion is similar to that obtainedwhen observing at diﬀerent resolving power, for example in Figure 1row 3 the core and the jets are clearly unresolved, which causes classconfusion within the model and reduces the RAMIS.The second experiment involved the addition of random noiseto the image, see Figure 1. We added random noise with varyingamplitude between 0 and 0.011 in steps of 0.01 and evaluated theRAMIS at each stage. With no added noise the RAMIS score was1.4, and a drastic drop in the score is observed from noise level0.001 to 0.003. Above that noise threshold the classiﬁer was stronglyinﬂuenced by the noise and eventually randomly classiﬁes the images.The ﬁnal experiment involved cropping the radio images to diﬀer-ent sizes. In Section 3.4, where we cover the pre-processing process,we cropped the input images to 100 x 100 pixels, i.e blanked anedge strip with a width of 25 pixels. This was done to prevent cornerdiﬀerences in the images when rotating. This process may remove in-formation from some images, for example those that include low levelfeatures that might result from lobes or jets. To quantify this eﬀect,the training images were cropped by blanking the edge with strips ofwidth 0 to 75 pixels, at which point the image is completely blank. Ina similar manner to the previous experiments, the RAMIS droppedexponentially as a function of strip width, converging towards 1.0(worst score). At a strip width of 25 pixels, which corresponds to the pre-processing procedure used in this work, a RAMIS of 1.25 wasmeasured.This analysis is relevant when quantitatively evaluating the per-formance of VAEs or other generative algorithms. For example Maet al. (2019a) cropped their input images to 40 x 40, correspondingto an edge blanking of 55 pixels. At that level the RAMIS is less than1.05. This may be explained by the fact that many FRII lobes, whichrange in radial distance up to 75 pixels from the image centre, havebeen cut out of the images, causing only the core of the radio galaxyto be seen by the VAE. This will result in poorer performance whengenerating FRIIs. The unsupervised VAE used in this work consists of two fully con-nected neural networks: (i) the encoder and (ii) the decoder. Whilemany VAE implementations make use of convolutional layers (e.g.Ma et al. 2018; Ralph et al. 2019; Spindler et al. 2020), in this workwe choose to use a fully connected network. There are a number ofknown issues involved in the use of convolutional VAEs and theseare described in more detail in Section 7. Consequently, although theuse of convolutional layers may be addressed in future work, here weretain a fully-connected architecture for simplicity.The encoder takes in the 100 ×

100 pixel input images that havebeen reshaped into a 10000 × 𝑥 , to a lower latent dimension, 𝑧 , where 𝑧 ∈ 𝑅 𝑑 and 𝑥 ∈ 𝑅 𝐷 with 𝐷 > 𝑑 , see Figure 2. The ﬁrst hiddenlayer has a soft-plus activation function and is ﬁtted with a drop-out.

MNRAS000

MNRAS000 , 1–20 (2021) alaxy populations from structured VAEs Figure 2.

The VAE neural network architecture: The encoder reduces the dimensions of the input images through the use of a funnel architecture that halves atevery layer in the encoder and doubles for the decoder.

Figure 3.

Modiﬁcations to the VAE for the CVAE model with two additional neurons at the input layer of the encoder and decoder. Labels are forwarded at boththe encoder and decoder.

The number of neurons are halved for each subsequent layer so as tocreate the tunnel architecture with the 2nd layer consisting of 2048neurons, 3rd layer - 1024 neurons, 4th layer - 512 neurons and 6thlayer - 256 neurons. Each layer was ﬁtted with a leaky ReLU. Ascompared with the widely used Rectiﬁed Linear Unit (ReLU) whichis a non-linear activation function that allows positive outputs fromthe neuron to pass while zeroing any negative values given by thefunction: 𝑓 ( 𝑥 ) = (cid:40) 𝑥 if 𝑥 >

00 if 𝑥 (cid:54) . (14) The leaky ReLU resolves the inability of the ReLU to map the neg-ative values by introducing a small negative slope to any negativeinput by following the function: 𝑓 ( 𝑥 ) = (cid:40) 𝑥 if 𝑥 > 𝛼𝑥 if 𝑥 (cid:54) 𝛼 = .

001 in our work. All these hidden layers are ﬁttedwith a Leaky ReLU activation function and followed by a drop-outwith 𝑝 = .

2. The ﬁnal output layer of the encoder consists of two 𝑑 dimensional layers where one layer outputs the mean parametersand the other layer outputs the variance parameter. MNRAS , 1–20 (2021)

D. J. Bastien et al.

The decoder on the other hand converts the sampled 𝑧 valuesinto outputs with the same dimension as the encoder input. Thedecoder input takes the latent variable, 𝑧 , that is sampled from thetwo Gaussians with parameters 𝑧 𝜇 and 𝑧 𝜎 . This sampling The ﬁrstlayer is ﬁtted with a Leaky ReLU and a drop-out. The network isthen a mirrored version of the encoder where the 2nd layer has 256neurons, 3rd layer - 2048 neurons, 4th - 1024 neurons, 5th - 2048neurons and ﬁnally 6th - 4056 neurons. These layers are each ﬁttedwith a leaky ReLU and drop-out. The ﬁnal layer is followed by asigmoid activation that bounds the output between 0 and 1.The architecture of the encoder-decoder network is shown in Fig-ure 2 and detailed in Table 2.For the declarative deﬁnition. our encoder and decoder were de-ﬁned within the model and guide as: • 𝑝 𝜃 ( 𝑧 ) = 𝑁 ( , 𝐼 )• 𝑝 𝜃 ( 𝑥 | 𝑧 ) = 𝐻 𝜃 ( 𝑧 ) where 𝐻 𝜃 ( 𝑧 ) represents the encoderThe guide which was introduced in section 2.1 is given as: • 𝑞 𝜙 ( 𝑧 | 𝑥 ) = 𝑁 ( 𝑧 𝜇 , 𝑧 𝜎 ) where 𝑧 𝜇 = 𝐹 𝜙 ( 𝑥 ) and 𝑧 𝜎 = 𝐺 𝜙 ( 𝑥 ) where 𝐹 𝜙 and 𝐺 𝜙 are the same neural network with the ﬁnal out-put layer outputs 𝑧 𝜇 , the mean parameters and 𝑧 𝜎 , the varianceparameters.The VAE is trained by optimizing the guide to match the model soas to minimize the loss function derived in equation 8. The conditional VAE is a variation on the unsupervised VAE thatalso takes in the labels on the data. The labels are the conditionsthat associate particular data samples and can be used to generateimages based on speciﬁc conditions. We input these labels as onehot-vectors at two instances in the network: ﬁrstly at the input tothe encoder along with the reshaped images, and secondly at inputto the decoder along with the latent 𝑧 for image reconstruction. Thealterations to the unsupervised VAE are shown in ﬁgure 3 and table2.The model was also modiﬁed following the method outlined inSection 2.2, to accommodate this additional information. We al-ter the deﬁnition of the model to include a prior on the class, 𝑝 ( 𝑦 ) = 𝑐𝑎𝑡 ( 𝑦 | 𝜋 ) , and we alter the likelihood, 𝑝 𝜃 ( 𝑥 | 𝑧 ) , to be 𝑝 𝜃 ( 𝑥 | 𝑧, 𝑦 ) = Bernouilli ( 𝑥 | 𝐻 𝜃 ( 𝑧, 𝑦 )) , where 𝐻 𝜃 represents the en-coder. The variational approximation, or guide, remains unchangedwith 𝑞 𝜙 ( 𝑧 | 𝑥 ) = 𝑁 ( 𝑧 𝜇 , 𝑧 𝜎 ) , where 𝑧 𝜇 = 𝐹 𝜙 ( 𝑥, 𝑦 ) and 𝑧 𝜎 = 𝐺 𝜙 ( 𝑥, 𝑦 ) .All networks used in this work were implemented using the Pythonprobabilistic programming library Pyro (Bingham et al. 2018; Phanet al. 2019) . The unsupervised VAE was ﬁrst trained by performing a learning ratesearch. The learning rate search was performed for the diﬀerent latentdimensions 𝑑 = , , , ,

32 and 64. This was done to identify theideal learning rate that leads to the lowest train loss after 10 epochs.The VAE was trained using initial learning rates between 0.0005 to0.0015 in steps of 0.0001. Such a procedure is crucial as the initial Code for this work is available on the git repository: https://github.com/joshen1307/RAGA

Figure 4.

Test loss and RAMIS Score of VAE for diﬀerent latent dimensions

Figure 5.

Test loss, RAMIS score and FRII fraction of the CVAE for diﬀerentlatent dimensions. learning rate is often considered to be the most important hyper-parameter (Smith 2017). Table 3 shows the ideal initial learning ratefor each latent dimension 𝑑 .The identiﬁed initial learning rates were used to train the networkfor 8000 epochs. For all the latent dimensions, the loss convergedtowards a minimum showing an adequate convergence towards aminimized loss. However the minimum diﬀers for the diﬀerent latentdimensions. None of the networks over-ﬁtted the training data andthe test loss remained stable after 6000 epochs. At 𝑑 =

8, the test lossstabilizes at 130. For the other latent dimensions i.e 𝑑 = , , , 𝑑 = , , , ,

32 and 64. Using theidentiﬁed learning rate, the CVAE was trained for each selected latentdimensions ( as shown in table 4). In parallel to the VAE, the initiallearning rate search was done between 0.0005 to 0.0015 in stepsof 0.0001. Once identiﬁed, the CVAE was trained for 5000 epochs,where the test loss, train loss and RAMIS score was calculated every10 epochs. In addition to those three metrics, we also generated100 FRIs and 100 FRIIs which were classiﬁed using the classiﬁerdescribed in section 4 . This was done to ﬁnd the class generationeﬃciency and to identify any class bias in our CVAE. Those resultsare shown in table 4 and ﬁgure 5, the additional metrics were denotedas

𝐹𝑅𝐼 𝑐𝑜𝑢𝑛𝑡 and

𝐹𝑅𝐼 𝐼 𝑐𝑜𝑢𝑛𝑡 are the respective fraction of generatedFRI and FRII that were correctly classiﬁed.

MNRAS000

MNRAS000 , 1–20 (2021) alaxy populations from structured VAEs VAE Architecture CVAE ArchitectureEncoder Decoder Encoder DecoderLayer Dimensions/parameters Layer Dimensions/Parameters Layer Dimensions/parameters Layer Dimensions/Parameters

Input Layer 10 , × 𝑑 × ( , + )× ( 𝑑 + ) × , × 𝑝 = . , × 𝑝 = . 𝑝 = . × 𝑝 = . × , × , × 𝑝 = . 𝑝 = . 𝑝 = . × 𝑝 = . × , × , × 𝑝 = . 𝑝 = . 𝑝 = . , × 𝑝 = . , × × × 𝑝 = . 𝑝 = . 𝑝 = . , × 𝑝 = . , × × × 𝑝 = . 𝑝 = . 𝑝 = . , × 𝑝 = . , × 𝑧 𝜇 , 𝑧 𝜎 𝑑 × 𝑧 𝜇 , 𝑧 𝜎 𝑑 × 𝑝 = . 𝑝 = . , × , × Table 2.

VAE and CVAE architectures. d 2 4 8 16 32 64

Initial LR(10 − ) 1.07 0.96 1.04 0.94 0.88 0.94Train Loss 133.6 127.2 136.3 127.2 128.3 127.4Test Loss 125.7 121.0 127.6 120.6 Table 3.

Selected initial learning rates, train loss, test loss and RAMIS scoresfor VAE at diﬀerent latent dimensions. d 2 4 8 16 Initial LR(10 − ) 0.90 1.08 0.90 0.94 RAMIS

𝐹 𝑅𝐼 𝑐𝑜𝑢𝑛𝑡 /% 57.5 75.5 53.4 37.1

𝐹 𝑅𝐼 𝐼 𝑐𝑜𝑢𝑛𝑡 /% 37.2 31.8 51.5 61.7

Table 4.

Selected initial learning rates, train loss, test loss and RAMIS scoresfor VAE at diﬀerent latent dimensions.

𝐹 𝑅𝐼 𝑐𝑜𝑢𝑛𝑡 and

𝐹 𝑅𝐼 𝐼 𝑐𝑜𝑢𝑛𝑡 are thepercentage of generated FRIs and FRII correctly classiﬁed by the CNN.

We make a model selection using the test loss and the RAMIS scoreas our main criteria. Table 3 shows the metrics for the VAE. As aselection criteria, we make use of the mean RAMIS and mean testloss as a benchmark. Any model with RAMIS larger than the meanRAMIS and with a test loss lower that the mean test loss was selectedas being the best performing model. For the VAE we chose 𝑑 = 𝑒𝑝𝑜𝑐ℎ = 𝑑 =

32 with a test loss of127 .

33 (which is lower than the mean 127 .

93) with a RAMIS of1.17 (higher than the mean which is 1 . 𝑑 =

32 at 𝑒𝑝𝑜𝑐ℎ = To qualitatively understand the general ability of the unsupervisedVAE model, a selection of images from the training set were fed intothe encoder and reconstructed images generated from those latentcoordinates were plotted for each latent dimension, see Figure 7.As the generation is a stochastic process, it is not expected thatthe output images will appear identical to the corresponding input,however they should appear similar. In each case the image shownis generated at the minimum test loss and maximum RAMIS scorefor each model. All 6 latent dimensions were able to reconstruct theradio sources, however some dimensions reproduce images that havestructures closer to those of the original images, for example at 𝑑 = 𝑑 =

16 and 𝑑 =

32 the VAE is able to reproduce the asymmetry of thesources: Source 3, which is asymmetric with one lobe brighter thanthe other, can partially be reconstructed with 𝑑 = 𝑑 =

16. TheFRI and FRII division can also be reproduced correctly. Source 1,which is an FRII, is reproduced as a radio source with lobes brighterthan the core; Source 7, which is an FRI, is reconstructed as a radiosource with a bright core and low brightness lobes. However forsome latent dimensions, such as 𝑑 =

8, we observe that for sourceslike Source 5, the sources are reconstructed as triple sources whilethe input image is a double radio source. Another limitation of thesystem is its inability to reproduce bent structures: Source 2, whichis a bent source, is reconstructed as a straight source. This is assumedto be a consequence of having only a small number of bent sourcespresent in the training data. Finally, one of the major drawbacks ofthe VAE is the ‘blurriness’ of the generated images. This is a known

MNRAS , 1–20 (2021) D. J. Bastien et al.

Figure 6.

The VAE and CVAE training curves with the RAMIS curve. constraint of VAEs (Boesen Lindbo Larsen et al. 2015) and impactson the RAMIS score as we have seen in Section 4.1.Figure 8 shows the reconstructed images from the unsupervisedVAE with 𝑑 =

32 for three diﬀerent sources from the training set.We can see that the model initially learns the bright central core ofthe galaxy before learning the extended structure. The orientation ofthe source is only learned towards the end of the training process.For the conditional VAE, we can use an input image from thetraining data set to specify a point in latent space and then choose togenerate synthetic images as either FRI or FRII.

As covered in Section 3.4, the training and testing images have beensigma clipped. This resulted in all pixel values below a given thresh-old being set to zero and created a gap in pixel values between 0 and0.0039. This is equivalent to approximately 95% of the pixels beingset to zero. Generating images with zero-valued pixels is diﬃcultfor machine learning algorithms and they instead assign an inﬁnitelysmall value to these regions in order to mimic the zeroing process.By saturating the generated images at the 95th pixel percentile andusing a log-scale to visualise the data we can observe that artifacts arepresent in the generated images at a low level. These are illustratedin Figure 9. These artifacts appear as concentric ring-like structuresaround the generated sources. The distribution of pixel values in these regions varies betweenmodels with diﬀerent latent dimensions and we suggest that thisdistribution can be used as an indicator for the performance of theVAE. While zeroing the pixels is computationally diﬃcult, attainingvery small values close to zero is an indicator of good networkperformance. This was evident for models that became stuck in localminima during the training process where the percentage of pixelswith values < × − was signiﬁcantly higher than those whichconverged to a global minimum. Although the lower dimensionality of the 𝑑 = 𝑧 , to the output images. Analyzingthe latent space can also be useful to visualise the VAEs’ ability toseparate source characteristics in the 2D latent space based on theirmorphology. To do this we sample points from the 𝑑 = − . < 𝑧 < . 𝑑 = 𝑧 = 𝑧 >

3, it can be seen that part of the structureof the generated radio source overlaps the maximum 100 ×

100 pixelimage extent and, while large extended sources are spread around thelatent space, compact double radio sources are concentrated towardsthe bottom left corner of the space. Towards the top left of the spacewe ﬁnd uncentred sources. These sources appear to have one lobecentered at the centre of the image with the other lobe appearing onthe lower left quarter of the image.In the same manner as was illustrated for the 𝑑 = 𝑧 , transitions smoothly between dif-ferent morphologies represented in the training set, even when thosemorphologies do not necessarily correspond to an equivalent phys-ical continuum. Whilst this may align with observations of hybridmorphologies in some radio galaxies (e.g. Miraghaei & Best 2017;Mingo et al. 2019a), it should be considered carefully when generat-ing synthetic examples for data augmentation as the inclusion of toomany intermediate morphologies may bias the model performancewhen applied to real data. In the case of the conditional VAE, the generated sources were usedto evaluate an additional performance metric designed to measurethe degree of class separation achieved by the conditional generator.This was calculated by using the classiﬁer described in Section 4to classify a population of synthetic sources generated by the modelwith an input speciﬁcation of a 50:50 FRI:FRII class balance anda uniform random sampling of the latent space in each case. Thedistributions of classiﬁcations are shown in Table 4 for each dif-ferent latent dimension. From Table 4, it can be seen that althougha balanced sample was speciﬁed at the input to the generator, the

MNRAS , 1–20 (2021) alaxy populations from structured VAEs Figure 7.

Original and reconstructed images at diﬀerent latent dimensions. The 1st row shows the original images from the FRDEEP training set. Rows 2through 6 show the reconstructed images for 𝑑 = , , , ,

32 and 64. resulting synthetic population was found to be imbalanced by theexternal classiﬁer. More speciﬁcally all models were seen to producean excess of FRI-type sources compared to FRIIs, with the exceptionof 𝑑 = N ( , ) , this behaviour isreversed and the synthetic population is found to be dominated byobjects classiﬁed as FRII-type sources; however, we note that this islikely due to the fact that sources generated from the latent volumearound the origin are predominantly compact, comprising unresolvedand only marginally resolved objects, and the classiﬁer is biased to- wards classifying these objects as FRII sources. This eﬀect is alsoseen when sampling from the prior distribution over the latent spaceof the VAE. A selection of sources generated by randomly samplingfrom the prior distribution for the VAE is shown in Figure A9. To qualitatively evaluate the CVAE’s ability to produce sources withFRI and FRII characteristic, FRI and FRII sources originating fromthe same latent coordinates were generated The sources were gener-ated from the model with 𝑑 =

32 obtained at 𝑒𝑝𝑜𝑐ℎ = MNRAS , 1–20 (2021) D. J. Bastien et al.

Figure 8.

Sources generated as a function of training epoch from a model with 𝑑 =

32. The ﬁrst column shows the original image from the training set that isused to deﬁne the point in latent space from which the synthetic images are generated.

Figure 9.

Image of low level noise structures in generated images from a model with 𝑑 = cross-sectional proﬁle across the principal axis of the FRI source wascompared with that of the FRII. As such two versions (FRI and FRII)of the "same" source could be generated and compared. Figure 11shows 6 selected sources where the cross-section along the principalaxis have been plotted. For each pair of generated sources the maindistinctive feature lied in the lobe brightness, generated FRII sourceshad lobes brighter than the core while for FRIs, the pixel intensitydecreases as we move away from the core which in most cases arebrighter than the lobe. This is inline with the deﬁnition of FRIs andFRIIs, and shows that CVAEs can generate sources with distinctiveFRI/II morphological characteristics. Previous work on the generation of radio galaxy images has beenundertaken by Ma et al. (2019a). In that work the authors used a con-ditional VAE that used convolutional layers in both the encoder anddecoder, which is a signiﬁcantly diﬀerent architecture to the fully-connected network presented here. They evaluated their networkbased on standard classiﬁcation scores (precision, recall, F1-score)using the classiﬁer deﬁned in Ma et al. (2019b), which itself wastrained on augmented images produced using a VAE. As alreadynoted in Section 4.1, the data pre-processing in Ma et al. (2019a)involved clipping the training images to 40 ×

40 pixels, which wouldhave impacted signiﬁcantly on the RAMIS evaluation measure de-ﬁned in this work and which we suggest may have a disproportion-ately large eﬀect on the generation of FRII galaxies, which werenoted to achieve poorer performance metrics than FRIs in the workof Ma et al. (2018). The range of latent space dimensionalities considered by Ma et al.(2019a) was signiﬁcantly larger than in this work, with models upto 𝑑 =

500 being trained. However their conclusion is inline withthe results of this work that ﬁnd a relatively low latent space dimen-sionality is preferred. Ma et al. (2018) do not explicitly state theirpreferred dimensionality, but from their Figure 3 it appears to be inthe same range as proposed here.A further interesting observation in the work of Ma et al. (2019a)was that the generated images contained two artifacts that are notseen in the generated images from this work. The ﬁrst of these wasdescribed as pseudo-structure, particularly in the generated FRIIimages, and the second was the presence of a grid structure, orpseudo-texture, overlying the images. Ma et al. (2019a) attributedthis structure to a bias from the mean square error (MSE) loss usedto construct those images, equivalent to Equation 7. However, wesuggest that it may be a consequence of the convolutional layers usedin their network. Chequerboard artifacts in generative algorithms thatemploy convolutional and speciﬁcally deconvolutional layers are aknown issue. These artifacts arise from kernel overlap in the decon-volution steps of the decoder (Odena et al. 2016). Although they canbe minimised by the use of deconvolution layers with stride 1 thisis typically only used in the ﬁnal layer of a convolutional decoderand artifacts that have been produced in earlier layers with largerstride steps can still be present. Alternatively, such chequerboard ef-fects can also be mitigated through the use of up-sampling (see e.g.Spindler et al. 2020). Other high frequency artifacts are also thoughtto be caused by the use of max-pooling to subsample the output fromconvolutional layers in the encoder (e.g. Hénaﬀ & Simoncelli 2015);however, we note that Ma et al. (2019a) do not use max-pooling tosubsample, employing an average pooling approach instead.

MNRAS000

MNRAS000 , 1–20 (2021) alaxy populations from structured VAEs Figure 10.

2D Latent mapping of the VAE

While the VAE and CVAE presented in this work can clearly beused to generate realistic radio sources, we note that there remain anumber of limitations to this method that should be addressed in fu-ture work. The ﬁrst of these is that VAEs and CVAEs tend to produceblurry images, and while an ideal generative system should generateradio sources with resolution similar to the training set, the resolu-tion of the generated images produced here appears lower than thatof the training set, i.e. FIRST. As we have demonstrated, the eﬀect ofthis blurring will eﬀect performance based on the RAMIS evaluationmethod introduced in this work. As an alternative to the MSE loss,Ma et al. (2019a) also used a pixel-wise cross entropy (PCE) loss,which they propose enabled ﬁner structures to be generated in theiroutput images. Another possibility for addressing this resolution is-sue is to introduce a discriminative network after the VAE identifyinginput images as real or fake. This approach is known as a VAE-GAN(Kawai et al. 2020). A second limitation is that VAEs cannot reproduce the sigma clip-ping applied in data pre-processing. In principle this can be remediedby applying a post-processing sigma clipping to the output images,but future applications should also address the nature of the system-atic artifacts that appear in this noise.A ﬁnal point of note is that the generator is biased towards the cre-ation FRI radio galaxies. With the exception of models with a latentspace dimensionality of 𝑑 =

16, the CVAE creates a higher num-ber of FRIs compared to FRIIs when the output images are passedthrough a classiﬁer. Most of these mis-classiﬁed FRIIs had an X-shaped morphology or were double-double sources. Compact FRIIswith smaller angular extent were correctly generated and classiﬁedwhile those with large angular extent were generated with unclearmorphologies. This is similar to the performance mis-match seen byMa et al. (2018) for their FRI/II populations.

MNRAS , 1–20 (2021) D. J. Bastien et al.

Figure 11.

Comparison of FRI and FRII sources. Column 1 shoes the FRIsources, Column 2 the corresponding FRII source and Column 3 shows thecross sectional plot across the jet axis

In this work we have demonstrated the use of generative machinelearning methods to simulate realistic radio galaxies. We presentresults from both an unsupervised variational autoencoder and aconditional variational autoencoder. The networks were trained us-ing sources from the FIRST radio survey and produce radio sourceswith FRI and FRII morphologies. Furthermore we have presented aquantitative method for evaluating the performance of these genera-tive models in radio astronomy, formulated as the radio morphologyinception score (RAMIS).Using both the RAMIS as a quantitative measure and by inspectingthe radio sources, we found that VAEs could be used as a method forthe generation of relatistic radio sources. We found that the lowest model loss was obtained at a latent dimension of 𝑑 =

32 with aRAMIS of 1 . 𝑑 = 𝑑 = ACKNOWLEDGEMENTS

The authors thank the reviewer for their comments, which signiﬁ-cantly improved this paper. The authors would like to thank AlexShestapoloﬀ and Brooks Paige from the Alan Turing Institute andTingting Mu from the University of Manchester for early discus-sions that informed the course of this work. DJB gratefully acknowl-edges support from STFC and the Newton Fund through the DARABig Data program under grant ST/R001898/1. AMS gratefully ac-knowledges support from an Alan Turing Institute AI FellowshipEP/V030302/1. MB gratefully acknowledges support from the Uni-versity of Manchester STFC CDT in Data Intensive Science, grantnumber ST/P006795/1. FP gratefully acknowledges support fromSTFC and IBM through the iCASE studentship ST/P006795/1.

DATA AND SOURCE CODE AVAILABILITY

Trained models from this work are publicly available on Zen-odo (DOI:10.5281/zenodo.4456165). All codes are available ongithub: https://github.com/joshen1307/RAGA . The FRDEEPdataset used in this work is publicly available from Zenodo(DOI:10.5281/zenodo.4255826) and should be cited as Tang et al.(2019).

REFERENCES

Acuna D., 2017, Unsupervised modeling of the movement of basketball play-ers using a deep generative modelAlger M. J., et al., 2018, Monthly Notices of the Royal Astronomical Society,478, 5547Aniyan A. K., Thorat K., 2017, ApJS, 230, 20Banﬁeld J. K., et al., 2015, Monthly Notices of the Royal AstronomicalSociety, 453, 2326Bao J., Chen D., Wen F., Li H., Hua G., 2017, arXiv e-prints, p.arXiv:1703.10155Becker R. H., White R. L., Helfand D. J., 1994, in Crabtree D. R., HanischR. J., Barnes J., eds, Astronomical Society of the Paciﬁc ConferenceSeries Vol. 61, Astronomical Data Analysis Software and Systems III.p. 165Best P. N., 2009, Astronomische Nachrichten, 330, 184Best P. N., Heckman T. M., 2012, MNRAS, 421, 1569Bhattacharyya A., Hanselmann M., Fritz M., Schiele B., Straehle C.-N., 2019,arXiv e-prints, p. arXiv:1908.09008Bingham E., et al., 2018, Journal of Machine Learning ResearchMNRAS000 , 1–20 (2021) alaxy populations from structured VAEs Blaschke T., Olivecrona M., Engkvist O., Bajorath J., Chen H., 2018, Molec-ular informatics, 37, 1700123Boesen Lindbo Larsen A., Kaae Sønderby S., Larochelle H., Winther O.,2015, arXiv e-prints, p. arXiv:1512.09300Bonaldi A., Braun R., 2018, arXiv e-prints, p. arXiv:1811.10454Capetti A., Massaro F., Baldi R. D., 2017, A&A, 598, A49Condon J. J., Cotton W. D., Greisen E. W., Yin Q. F., Perley R. A., TaylorG. B., Broderick J. J., 1998, AJ, 115, 1693Croston J. H., Ineson J., Hardcastle M. J., 2018, MNRAS, 476, 1614Dai Z., Damianou A., González J., Lawrence N., 2015, arXiv e-prints, p.arXiv:1511.06455Davidson T. R., Falorsi L., De Cao N., Kipf T., Tomczak J. M., 2018, arXive-prints, p. arXiv:1804.00891Denton E., Fergus R., 2018, arXiv e-prints, p. arXiv:1802.07687Ekers R. D., Fanti R., Lari C., Parma P., 1978, Nature, 276, 588Elgammal A., Liu B., Elhoseiny M., Mazzone M., 2017, arXiv preprintarXiv:1706.07068Fanaroﬀ B. L., Riley J. M., 1974, MNRAS, 167, 31PGe X., Goodwin R. T., Gregory J. R., Kirchain R. E., Maria J., VarshneyL. R., 2019, arXiv preprint arXiv:1905.08222Gendre M. A., Wall J. V., 2008, Monthly Notices of the Royal AstronomicalSociety, 390, 819Gendre M. A., Best P. N., Wall J. V., 2010, Monthly Notices of the RoyalAstronomical Society, 404, 1719Gendre M. A., Best P. N., Wall J. V., Ker L. M., 2013, MNRAS, 430, 3086Gheller C., Vazza F., Bonafede A., 2018, Monthly Notices of the RoyalAstronomical Society, 480, 3749Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., OzairS., Courville A., Bengio Y., 2014, arXiv e-prints, p. arXiv:1406.2661Hénaﬀ O. J., Simoncelli E. P., 2015, arXiv e-prints, p. arXiv:1511.06394Ho J., Jain A., Abbeel P., 2020, arXiv e-prints, p. arXiv:2006.11239Hurley-Walker N., et al., 2016, Monthly Notices of the Royal AstronomicalSociety, 464, 1146Jarvis M. J., et al., 2017, arXiv preprint arXiv:1709.01901Kaiser C. R., Best P. N., 2007, MNRAS, 381, 1548Karras T., Laine S., Aittala M., Hellsten J., Lehtinen J., Aila T., 2019, arXive-prints, p. arXiv:1912.04958Kawai H., Chen J., Ishwar P., Konrad J., 2020, arXiv e-prints, p.arXiv:2003.00641Kim Y., Wiseman S., Miller A. C., Sontag D., Rush A. M., 2018, arXive-prints, p. arXiv:1802.02550Kingma D. P., Welling M., 2013, arXiv e-prints, p. arXiv:1312.6114Kingma D. P., Rezende D. J., Mohamed S., Welling M., 2014, arXiv e-prints,p. arXiv:1406.5298Kullback S., Leibler R. A., 1951, The annals of mathematical statistics, 22,79Ledlow M. J., Owen F. N., 1996, AJ, 112, 9Lu J., Deshpande A., Forsyth D., 2016, arXiv e-prints, p. arXiv:1612.00132Lukic V., Brüggen M., Banﬁeld J. K., Wong O. I., Rudnick L., Norris R. P.,Simmons B., 2018, Monthly Notices of the Royal Astronomical Society,476, 246Lukic V., Brüggen M., Mingo B., Croston J. H., Kasieczka G., Best P. N.,2019, MNRAS, 487, 1729Ma Z., Zhu J., Li W., Xu H., 2018, in 2018 14th IEEE International Conferenceon Signal Processing (ICSP). pp 522–526Ma Z., Zhu J., Zhu Y., Xu H., 2019a, in 2019 15th International Conferenceon Computational Intelligence and Security (CIS). pp 151–155Ma Z., Zhu J., Zhu Y., Xu H., 2019b, Communications in Computer andInformation Science, 1071Makhathini S., Jarvis M., Smirnov O., Heywood I., 2015, in Advanc-ing Astrophysics with the Square Kilometre Array (AASKA14). p. 81( arXiv:1412.5990 )Mikołajczyk A., Grochowski M., 2018, in 2018 International InterdisciplinaryPhD Workshop (IIPhDW). pp 117–122Mingo B., et al., 2019a, Monthly Notices of the Royal Astronomical SocietyMingo B., et al., 2019b, MNRAS, 488, 2701Miraghaei H., Best P. N., 2017, Monthly Notices of the Royal AstronomicalSociety, 466, 4346 Norris R. P., et al., 2011, Publications of the Astronomical Society of Aus-tralia, 28, 215–248Odena A., Dumoulin V., Olah C., 2016, DistillPagnoni A., Liu K., Li S., 2018, arXiv preprint arXiv:1812.04405Peterson J. R., et al., 2015, ApJS, 218, 14Phan D., Pradhan N., Jankowiak M., 2019, arXiv preprint arXiv:1912.11554Polykovskiy D., et al., 2018, Molecular pharmaceutics, 15, 4398Ralph N. O., et al., 2019, PASP, 131, 108011Ravanbakhsh S., Lanusse F., Mandelbaum R., Schneider J., Poczos B., 2016,arXiv e-prints, p. arXiv:1609.05796Razavi A., van den Oord A., Poole B., Vinyals O., 2019, arXiv e-prints, p.arXiv:1901.03416Regier J., Miller A., McAuliﬀe J., Adams R., Hoﬀman M., Lang D., SchlegelD., Prabhat M., 2015, in Bach F., Blei D., eds, Proceedings of Ma-chine Learning Research Vol. 37, Proceedings of the 32nd InternationalConference on Machine Learning. PMLR, Lille, France, pp 2095–2103, http://proceedings.mlr.press/v37/regier15.html

Ren S., He K., Girshick R., Sun J., 2015, arXiv e-prints, p. arXiv:1506.01497Rolinek M., Zietlow D., Martius G., 2019, in Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition. pp 12406–12415Rossi P., Bodo G., Capetti A., Massaglia S., 2017, Astronomy & Astrophysics,606, A57Rumelhart D. E., Hinton G. E., Williams R. J., 1985, Technical report, Learn-ing internal representations by error propagation. California Univ SanDiego La Jolla Inst for Cognitive ScienceRussakovsky O., et al., 2015, International Journal of Computer Vision(ĲCV), 115, 211Ryu J. J., Choi Y., Kim Y.-H., El-Khamy M., Lee J., 2019, arXiv e-prints, p.arXiv:1905.10945Semeniuta S., Severyn A., Barth E., 2017, arXiv e-prints, p. arXiv:1702.02390Shao H., Yao S., Sun D., Zhang A., Liu S., Liu D., Wang J., Abdelzaher T.,2020, arXiv e-prints, p. arXiv:2004.05988Shen X., Su H., Li Y., Li W., Niu S., Zhao Y., Aizawa A., Long G., 2017,arXiv preprint arXiv:1705.00316Shimwell T. W., et al., 2019, A&A, 622, A1Smith L. N., 2017, in 2017 IEEE Winter Conference on Applications ofComputer Vision (WACV). pp 464–472Sohn K., Lee H., Yan X., 2015, in Advances in neural information processingsystems. pp 3483–3491Spindler A., Geach J. E., Smith M. J., 2020, MNRAS,Szegedy C., et al., 2014, arXiv e-prints, p. arXiv:1409.4842Tang H., Scaife A. M. M., Leahy J. P., 2019, Monthly Notices of the RoyalAstronomical Society, 488, 3358Tchekhovskoy A., Bromberg O., 2016, MNRAS, 461, L46Tolstikhin I., Bousquet O., Gelly S., Schoelkopf B., 2017, arXiv e-prints, p.arXiv:1711.01558Villarreal Hernández A. C., Andernach H., 2018, arXiv e-prints, p.arXiv:1808.07178Wu C., et al., 2018, Monthly Notices of the Royal Astronomical Society, 482,1211Yi K., Guo Y., Fan Y., Hamann J., Wang Y. G., 2020, arXiv e-prints, p.arXiv:2001.11651Zhao S., Song J., Ermon S., 2019, in Proceedings of the aaai conference onartiﬁcial intelligence. pp 5885–5892

APPENDIX A: IMAGES

We here present samples of generated sources using the trained VAEand CVAE.(i) Figure A1 and A4 : Generated FRI sources from the CVAEthat were classiﬁed as FRI by the CNN classiﬁer.(ii) Figure A5 and A8 : Generated FRII sources from the CVAEthat were classiﬁed as FRII by the CNN classiﬁer.(iii) Figure A9 : 100 generated sources from the unsupervisedVAE for 𝑑 =

32 where 𝑧 is sampled randomly from the prior 𝑁 ( , ) . MNRAS , 1–20 (2021) D. J. Bastien et al.

Figure A1.

Generated FRI Sample Set 1

Figure A2.

Generated FRI Sample Set 2MNRAS000

Generated FRI Sample Set 2MNRAS000 , 1–20 (2021) alaxy populations from structured VAEs Figure A3.

Generated FRI Sample Set 3

Figure A4.

Generated FRI Sample Set 4MNRAS , 1–20 (2021) D. J. Bastien et al.

Figure A5.

Generated FRII Sample Set 1

Figure A6.

Generated FRII Sample Set 2MNRAS000

Generated FRII Sample Set 2MNRAS000 , 1–20 (2021) alaxy populations from structured VAEs Figure A7.

Generated FRII Sample Set 3

Figure A8.

Generated FRII Sample Set 4MNRAS , 1–20 (2021) D. J. Bastien et al.

Figure A9.

Generated sources from the unsupervised VAE for 𝑑 =

32 where 𝑧 is sampled randomly from the prior 𝑁 ( , ) .This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000