[PDF] AstroVaDEr: Astronomical Variational Deep Embedder for Unsupervised Morphological Classification of Galaxies and Synthetic Image Generation

Abstract

We present AstroVaDEr, a variational autoencoder designed to perform unsupervised clustering and synthetic image generation using astronomical imaging catalogues. The model is a convolutional neural network that learns to embed images into a low dimensional latent space, and simultaneously optimises a Gaussian Mixture Model (GMM) on the embedded vectors to cluster the training data. By utilising variational inference, we are able to use the learned GMM as a statistical prior on the latent space to facilitate random sampling and generation of synthetic images. We demonstrate AstroVaDEr's capabilities by training it on gray-scaled \textit{gri} images from the Sloan Digital Sky Survey, using a sample of galaxies that are classified by Galaxy Zoo 2. An unsupervised clustering model is found which separates galaxies based on learned morphological features such as axis ratio, surface brightness profile, orientation and the presence of companions. We use the learned mixture model to generate synthetic images of galaxies based on the morphological profiles of the Gaussian components. AstroVaDEr succeeds in producing a morphological classification scheme from unlabelled data, but unexpectedly places high importance on the presence of companion objects---demonstrating the importance of human interpretation. The network is scalable and flexible, allowing for larger datasets to be classified, or different kinds of imaging data. We also demonstrate the generative properties of the model, which allow for realistic synthetic images of galaxies to be sampled from the learned classification scheme. These can be used to create synthetic image catalogs or to perform image processing tasks such as deblending.

Full PDF

MMNRAS , 000–000 (2020) Preprint 23 November 2020 Compiled using MNRAS L A TEX style ﬁle v3.0

AstroVaDEr: Astronomical Variational Deep Embedder forUnsupervised Morphological Classiﬁcation of Galaxies andSynthetic Image Generation

Ashley Spindler (cid:63) , James E. Geach, and Michael J. Smith

Centre for Astrophysics Research, Department of Physics, Astronomy & Mathematics, University of Hertfordshire, College Lane, Hatﬁeld, AL10 9ABCentre of Data Innovation Research, Department of Physics, Astronomy & Mathematics, University of Hertfordshire, College Lane, Hatﬁeld, AL10 9AB

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

We present AstroVaDEr, a variational autoencoder designed to perform unsupervised cluster-ing and synthetic image generation using astronomical imaging catalogues. The model is aconvolutional neural network that learns to embed images into a low dimensional latent space,and simultaneously optimises a Gaussian Mixture Model (GMM) on the embedded vectorsto cluster the training data. By utilising variational inference, we are able to use the learnedGMM as a statistical prior on the latent space to facilitate random sampling and generation ofsynthetic images. We demonstrate AstroVaDEr’s capabilities by training it on gray-scaled gri images from the Sloan Digital Sky Survey, using a sample of galaxies that are classiﬁed byGalaxy Zoo 2. An unsupervised clustering model is found which separates galaxies based onlearned morphological features such as axis ratio, surface brightness proﬁle, orientation andthe presence of companions. We use the learned mixture model to generate synthetic imagesof galaxies based on the morphological proﬁles of the Gaussian components. AstroVaDErsucceeds in producing a morphological classiﬁcation scheme from unlabelled data, but unex-pectedly places high importance on the presence of companion objects—demonstrating theimportance of human interpretation. The network is scalable and ﬂexible, allowing for largerdatasets to be classiﬁed, or diﬀerent kinds of imaging data. We also demonstrate the generativeproperties of the model, which allow for realistic synthetic images of galaxies to be sampledfrom the learned classiﬁcation scheme. These can be used to create synthetic image catalogsor to perform image processing tasks such as deblending.

Key words: galaxies: general – methods: data analysis – methods: observational

Over the past hundred years, extragalactic astronomy has seen acontinual evolution in the methods used to collect, process andanalyse observational data. From photographic plates to electronicdetectors, and from human computers to high-performance com-puter algorithms, the advancement of data acquisition and analysistechniques are not always in lock-step. Amongst the many datachallenges that need to be solved, a particular problem arises when(human) visual classiﬁcation of individual objects is the state-of-the-art methodology. Classifying the morphologies of galaxies isone such task, and until now it has been an achievable goal forteams of expert classiﬁers or citizen scientists to visually inspectevery object within a survey. However, even at the best classiﬁca-tion rates for crowd sourced projects, it would take many years tocollect enough classiﬁcations for current generation surveys (e.g.DES (Flaugher 2005), DECaLS (Dey et al. 2019), Hyper-Supreme (cid:63)

E-mail: [email protected]

Cam (Aihara et al. 2017)), let alone the next-generation wide anddeep surveys due to come on line within the next decade (Walmsleyet al. 2020).Addressing the challenge of producing physically and seman-tically meaningful morphological labels at the scales required forthe Legacy Survey of Space and Time to be conducted with the VeraRubin Observatory (Ivezić et al. 2019), or images from

Euclid , in-variably drives attention towards machine learning as a solution. Inparticular, deep neural networks have been demonstrated to great ef-fect at the task of classifying certain morphological characteristicswithin galaxies (LeCun et al. 2015). Until recently, most atten-tion within the astronomical research community has been towardssupervised machine learning techniques, where a labelled set ofimages is used to train a model to predict the class labels of new,unseen inputs. Artiﬁcal Neural Networks (ANNs) for galaxy clas-siﬁcation have been in use in astrophysics for at least 25 years (e.g.Angel et al. 1990; Lloyd-Hart et al. 1992; Storrie-Lombardi et al.1992; Odewahn et al. 1993, 1992; Lahav et al. 1995), and the re-cent growth in the power of neural networks has seen an upsurge in © a r X i v : . [ a s t r o - ph . I M ] N ov A. Spindler et. al. their application in astronomy. For example Dieleman et al. (2015)ﬁnished ﬁrst place in the

Galaxy Challenge by using a rotationally-invariant convolutional neural network (CNN) to predict the votingfractions of galaxies that were classiﬁed by Galaxy Zoo 2. Morerecent approaches, such as Walmsley et al. (2020), show the poten-tial for supervised learning combined with crowd sourced labellingto actively improve and inform the neural network classiﬁcations.Supervised learning has also been demonstrated in the discoveryof strong gravitational lenses, which is essentially a galaxy mor-phology classiﬁcation problem (Petrillo et al. 2019; Li et al. 2020;Metcalf et al. 2019; Avestruz et al. 2019; Lanusse et al. 2018).While supervised methods show some promise, they still facethe limiting challenge of generating suﬃcient labelled images toform a cohesive training set that has imaging quality matchingfuture surveys. However, recent work has shown that unsupervisedlearning methods also perform well at visually classifying astronom-ical objects. Cheng et al. (2020) used a Convolutional Autoencoder(CAE), paired with Bayesian Gaussian Mixture Models (GMMs),to successfully construct a classiﬁer for strong gravitational lenses.Ralph et al. (2019) also exploited CAEs, combining the feature ex-traction capability of the autoencoder with a Self-Organised Mapand k -means clustering to identify diﬀerent classes of radio galaxymorphology. Using CAEs as a starting point provides many advan-tages, while retaining the image recognition power of supervisedCNNs. The main beneﬁt, of course, is in removing the need for alarge volume of labelled training data.Autoencoders work by learning how to transform data into alow-dimensional representation (sometimes called the latent space,embedding or encoded representation), and back again. This is mostcommonly achieved with a pair of neural networks, one to encodeand one to decode the data (Lecun 1987; Bourlard & Kamp 1988;Hinton & Zemel 1993). The methods discussed above rely on sep-arating out the dimensionality reduction and clustering tasks, how-ever it is possible to combine the tasks by using so-called ‘Deepclustering’ techniques (Xie et al. 2015; Dilokthanakul et al. 2016).By training the embedding and clustering processes simultaneously,the learned latent space is encouraged to take on a clustered distribu-tion while the clustering parameters evolve to follow the latent space.Deep clustering techniques have been shown to produce higher ac-curacy clustering scores on standard data sets than independentlyoptimised solutions, and promise to produce more interpretableclusters (Ramachandra 2019; Jiang et al. 2016; Cao et al. 2020).Deep learning is not the only form of unsupervised methodol-ogy which has been employed in astronomy. Hocking et al. (2018)and Martin et al. (2020) demonstrate the ability of using a GrowingNeural Gas and Hierarchical Clustering algorithm, which populatesa learned model with low dimensional neurons that act as represen-tations of diﬀerent galaxy properties. Instead of learning a featurevariable from scratch, as a deep CNN would, Hocking et al. (2018)and Martin et al. (2020) use Fourier transformed image patches toencode morphological and spectral information, training the modelto group together similar patches into objects, and similar objectsinto morphological clusters. Uzeirbegovic et al. (2020) use principalcomponent analysis of Hubble Space Telescope -CANDELS imagesto demonstrate how the morphological features of galaxies can berepresented by ‘eigengalaxies’ which describe diﬀerent componentsof an underlying morphological manifold. We have also seen imple-mentations in time-domain astronomy, such as in Ay et al. (2020),which uses a Dirchelet-Process GMM to identify diﬀerent classesof pulsars from their periods and period derivatives. Finally, theprediction of redshifts from photometric data has also been studied with unsupervised learning, such as in Geach (2012), Siudek et al.(2018), and D’Isanto & Polsterer (2018).There have been promising developments in recent years in im-proving the ability of autoencoders (AEs) to perform image classi-ﬁcation, and among those is the integration of Variational Inference(Blei et al. 2016) into the encoding process. Variational inferenceis a ﬁeld of statistics concerned with ﬁnding approximations ofthe posterior distributions in Bayesian Models. Kingma & Welling(2013) demonstrated how a variational inference model could beapproximated by using the autoencoder framework, which has ledto a wide ﬁeld of research into the applications of Variational Au-toencoders (VAE). Like CAEs, a VAE encodes a sample of data intoa low-dimensional space, but it diﬀers in the sense that a statisticalprior is used to ‘condition’ the encoded space to take a certain shape.The most common prior in a VAE is a unit Gaussian, however othermodels have been designed that allow for deep clustering. Varia-tional Deep Embedding (VaDE, Jiang et al. 2016) represents oneof the state-of-the-art approaches for deep, unsupervised clustering,and works by imposing a Gaussian Mixture prior upon the learnedlatent space.An important distinction between a VAE and a traditional AEis that a VAE is in fact a generative network. We will discuss thegenerative process in Section 3, but suﬃce to say that the statisticalprior can be used to generate synthetic images. This is analogousto another generative network that has seen some popularity inastronomy: Generative Adversarial Networks (GANs) (Goodfellowet al. 2014; Reed et al. 2016; Smith & Geach 2019; Schawinskiet al. 2017). A VAE diﬀers from a GAN in that the latter generallycomprises two neural networks that are attempting to fool each other.One network (the generator) in the GAN is attempting to producenew data which matches the feature distribution of the training set,while the second network (the discriminator) attempts to guess ifthe new data is real or fake. The two networks compete, with thegenerator learning to make better (more realistic) synthetic data, andthe discriminator getting better at distinguishing generated outputsfrom the real thing.Variational inference has already been explored in the con-text of extragalactic astronomy. Most applicable to this work isthat of Regier et al. (2015), who demonstrate the fundamentals ofusing a VAE to embed morphological characteristics of galaxies.Ravanbakhsh et al. (2017) also implement a VAE, but do so us-ing a conditional scheme that leverages known properties about thetraining images to guide the learning process. The conditional VAE(C-VAE) is shown alongside a conditional GAN network, and theauthors show that the C-VAE produced more consistent results, andthat by adding noise proﬁles associated with the original datasetthey could produce well-realised synthetic images. While not a fullGaussian mixture, Sun et al. (2019) successfully employ a Cascade-VAE with a double peaked Gaussian prior to perform star-galaxyseparation. To our knowledge, the use of a Gaussian Mixture prior,as in VaDE, for galaxy morphological classiﬁcation and image gen-eration has not yet been demonstrated.In this paper, we introduce AstroVaDEr (Astronomical Varia-tional Deep Embedder), an implementation of the VaDE architec-ture which leverages numerous recent improvements to the varia-tional deep clustering (VDC) paradigm. Here we demonstrate As-troVaDEr’s capability as an unsupervised classiﬁer for galaxy mor-phology, and show how its variational inference properties allowthe network to be employed as a generative network. We performtraining on, and comparisons with, galaxies from the Galaxy Zoo2 (Willett et al. 2013; Huertas-Company et al. 2015) dataset tobenchmark the network, but we also demonstrate some of the ﬂex-

MNRAS000

MNRAS000 , 000–000 (2020) stroVaDEr ible development choices that allow AstroVaDEr to be adapted toother surveys and image classiﬁcation problems.In Section 2 we describe the training, validation and testingsets we selected for this work, along with the image pre-processingthat was performed. Section 3 describes the theoretical backgroundthat informed the model and the chosen architecture, followed bydetails on hyperparameter selection and model training in Section4. Our results are presented in Section 5, which includes a demon-stration of the image reconstructions achieved through training, theunsupervised clustering results, a comparison with Galaxy Zoo 2voting fractions and synthetic image generation properties. We dis-cuss future improvements and our conclusions in Sections 6 and 7,respectively. In order to test the ability of VaDE to produce cluster assignmentsthat are representative of the real underlying distribution of galaxies,we require a large dataset of labelled images of galaxies. For thispurpose, we use the Galaxy Zoo 2 (GZ2) dataset, as describedin Willett et al. (2013). The images classiﬁed by Galaxy Zoo aretaken from Data Release 7 of the Sloan Digital Sky Survey (SDSS),and make up a magnitude limited sample with M r < − mag.We extract our images from the SDSS imgcutout service, initiallyextracting gri colour images at × pixels, scaled to . × R arcsec per pixel . These images are cropped smaller than those usedin the citizen science project to lessen the eﬀect on the network ofnearby structure unassociated with the target galaxy.To help prevent over-ﬁtting and to augment the data set, im-ages are randomly ﬂipped on their horizontal and vertical axes eachtime they are fed into the network. Mainly, this is to try and preventthe network learning some rare features in speciﬁc locations in theimages, but the nature of autoencoders brings into question the va-lidity of other image augmentation. For example, while CNNs canbe made rotationally invariant (Dieleman et al. 2015), this invari-ance does not necessarily transfer to the fully-connected layers usedwithin the latent embedding. Performing random rotations (which isa common augmentation technique in galaxy morphology studies)does not prevent the network from encoding rotational features. Thenetwork needs to be able to reproduce the image regardless of itsobserved orientation. Random rotations may improve disentangle-ment of rotation as a feature (i.e. the network would use less of theencoded space to control rotation), but may hamper the disentangle-ment of other features. Finally, there is a scalability consideration,as performing rotations on each input can prove costly in terms ofprocessing time.After the random transformations, we convert the image to greyscale by averaging the three colours bands. Training on grey scaleimages allows us to ensure that the network is learning strictly on thebasis of morphology, as opposed to learning the colour dependenceof diﬀerent morphological types . Finally, we downscale the imagesto × pixels to improve training speed.Galaxy Zoo 2 provides labels for each galaxy in the sample inthe form of vote counts for answers to a series of questions about agalaxy’s morphology. These questions cover whether the galaxy issmooth or featured, and further covers topics such as the presence R refers to the Petrosian radius containing of the galaxy’s light. An interesting alternative may be to train using CMYK format images,with a regulariser within the clustering model to emphasise the features inthe luminance band over the colour bands. of bars and rings and the size of the bulge. We do not use thevotes during training, but we will analyse our clustering results withrespect to the raw and redshift-debiased vote fractions for a varietyof morphological properties. Quantitatively, this will allow us tocompare the properties the network chooses are most importantversus the citizen scientists.We select galaxies from the Galaxy Zoo 2 catalog that havemore than 36 votes for the ﬁrst question (smooth / features / star /artifact). We then split the sample into a training set, validation setand a test set. The training set uses a random selection of roughly80% of the catalog, we round to 100 objects to streamline mini-batch processing, with 159,600 galaxies in total. The validation setis a small selection of 5000 objects which is used to periodicallytest the network during training, but does not inﬂuence the learningprocess. The ﬁnal test set consists of 41,100 objects. These are notseen during the training process by the network, but are used toperform analysis on the clustering model.

We combine several machine learning techniques to develop a net-work architecture that is ﬂexible and powerful. Our hope is thatthis network is not simply tuned to the speciﬁc task of classifyinggalaxy morphology, but rather that the base architecture can be eas-ily modiﬁed and optimised for a variety of tasks. AstroVaDEr is acombination of several staples of the ‘machine learning in astron-omy’ community, as well as more recent and modern algorithmstailored towards unsupervised learning. Our model builds on pre-vious work within and without the ﬁeld of astronomy. Firstly, ourCNN is based on the work of Walmsley et al. (2020) (W20) inclassifying galaxy morphology for Galaxy Zoo images. Secondly,we implement the VaDE technique from Jiang et al. (2016) (J16),with recent improvements to the algorithm discussed in Cao et al.(2020) (C20) for ‘Simple, Scalable, and Stable Variational DeepClustering’ (s3VDC). In this section, we shall discuss the frame-works underlying the AstroVaDEr architecture and then present thespeciﬁc model used in our galaxy morphology experiments.

The foundational concept behind AstroVaDEr as an unsupervisedclustering algorithm is the autoencoder. In essence, an autoencoderis a neural network that takes in some N -dimensional input, x ,reduces that input to a lower n -dimensional space (for n (cid:28) N ),and then reconstructs an output, ˆ x which is compared with theinput. An autoencoder optimises the weights and biases of neurons,typically in the form of fully connected and convolutional layers,by attempting to reduce the reconstruction loss between x and ˆ x .Practically, an autoencoder can be designed with two components,an encoder ( E φ ) and a decoder ( D θ ), such that: E φ ( x ) = z (1) D θ ( z ) = ˆ x (2)where φ and θ are the values of the weights and biases of the layersin the encoder and decoder, respectively. The latent variable, z isthe encoded representation of the input. Autoencoders prove to bea reliable form of dimensionality reduction, which allows for unsu-pervised clustering algorithms to be used on data sets that would MNRAS , 000–000 (2020)

A. Spindler et. al. otherwise be too complex. For example, Cheng et al. (2020) employan autoencoder with convolutional layers to learn a latent represen-tation from a data set of simulated gravitational lens imaging, whichis then used to perform Bayesian Mixture Modelling to build a clus-tering and classiﬁcation scheme for determining if a given imagecontains a gravitational lens. Note that there exist applications ofautoencoders that embed the clustering process within the networkitself. Deep Embedded Clustering optimises both the autoencoderparameters and the parameters of a unsupervised clustering or mix-ture model simultaneously, which results in increase classiﬁcationaccuracy.For AstroVaDEr, autoencoders pose one crucial problem. Be-cause we do not impose any structure on the latent space, it isinherently not a generative model. We do not know what the under-lying statistical distribution is in the latent space, and as such wecannot create a random sample from it that could be fed into thedecoder network to create synthetic images. However, it is possi-ble to construct an autoencoder that has this property, by using thetechnique of variational inference.

Variational inference is a ﬁeld of statistics concerned with ﬁndingapproximations of the posterior distributions in Bayesian Models.Kingma & Welling (2013) developed a framework wherein an au-toencoder can be constructed with knowledge of a prior distributionon the latent space, such that in the process of optimising the neuralnetwork we approximate the posterior distribution. In this scheme,the encoder becomes an inference network that approximates thedistribution q φ ( z | x ) by learning to map x to z . The decoder is thena generative network, approximated the distribution p θ (ˆ x | z ) .A VAE is a generative model, due to the inclusion of a priordistribution p ( z ) = N (0 , , which can be sampled to generate syn-thetic data. Since we approximate p ( z ) with a unit Gaussian, we thenchoose to model the encoding process as a multivariate Gaussianwith diagonal covariance, E φ ( x ) = N ( µ z , σ z ) . To enable trainingof this network we make use of the reparameterisation trick, whichallows for the back propagation of gradients in stochastic gradientdescent. To do this, instead of E φ ( x ) directly encoding z , we in-stead encode the means and covariances of each input, µ z and σ z (in practice, we encode ln ( σ z ) as it is more stable numerically).We then re-sample z in the following way: z = µ z + σ z ◦ N (0 , (3)where ◦ is an element-wise multiplication and N (0 , is a unitGaussian. The objective function of this VAE is called the EvidenceLower Bound (ELBO), and we train the network by maximising thisfunction. The ELBO has a general form of: log p θ ( x ) ≥ E q φ ( z | x ) log p θ (ˆ x | z ) − KL ( q φ ( z | x ) (cid:107) p ( z )) (4)The ﬁrst term on the right hand side is interpreted as the recon-struction loss by transforming a latent representation z to ˆ x , suchas the mean squared error (MSE) or binary cross entropy (BXE)multiplied by the dimensionality of the input data (D). The secondterm is the Kullback-Leiber (KL) divergence between the encodedrepresentation of z and the prior distribution. In practical terms, theﬁrst term optimises the network to produce good reconstructions ofthe inputs, while the second term acts as a regulariser that punishesthe network if the distribution of z drifts from a unit Gaussian. We can rewrite the ELBO function into an exact loss function, L Total , that our network can optimise in the following way: L Total = L Recon + L KL (5)where L Recon = E q φ ( z | x ) log p θ (ˆ x | z ) = [MSE or BXE] × D (6)and L KL = − (cid:88) j (cid:0) σ j ) − µ j − σ j (cid:1) (7)When the network is trained, we can use it in a generativeway by simply taking a random sample from p ( z ) = N (0 , ,and feeding it as an input to the decoder network. Alternatively,we can generate latent representations of our inputs and perform aclustering analysis, similar to a conventional autoencoder.One major problem with VAEs in general is that the regu-larising eﬀect of the KL term tends to negatively impact the re-construction quality. With no adjustments, this typically results inblurrier reconstructions than a conventional autoencoder with thesame architecture. There is a balance in how well the VAE can learnand disentangle the latent dimensions and how well it can use thatspace to reconstruct inputs. A simple way to address this balanceis by introducing a weight on the KL term, either increasing its ef-fect to improve disentanglement of image properties, or decreasingits eﬀect to improve image quality at the cost of poorer sampling.So called β -VAE networks, and their successors, have been stud-ied widely, and we will discuss later how we include this in ourarchitecture. The example VAE given in Kingma & Welling (2013) uses only asingle unit Gaussian component as its prior on z , but the frameworkcan be generalised to include a mixture of Gaussians. While therehave been many examples of this, the two most prominent are VaDE(J16) and GMVAE (Gaussian Mixture Variational Autoencoders,Dilokthanakul et al. 2016), which vary in their implementations.The essential diﬀerence is that VaDE calculates a single encodedmean and variance for each sample, and learns the GMM means,variances and weights as trainable parameters, whereas GMVAEapproximates the GMM means and variances with additional neuralnetworks and keeps the component weights ﬁxed. Recently, C20presented a study of the state-of-the-art variational deep clusteringmethods and highlighted some key problems with the frameworksin terms of simplicity, scalability and stability.For AstroVaDEr, we choose to implement the VaDE algorithmwith the optimisations from s3VDC. We will now discuss the mod-iﬁcations to the ELBO and loss functions to turn VAE into VaDE,and the steps taken to incorporate the s3VDC framework. To begin,let us describe the clustering model which is parameterised by thecategorical distribution p ( c ) : c ∼ Cat( π ) , s.t.: C (cid:88) c =1 π c = 1 , (8)where π c are the weights of the clusters. We ﬁnd that in practice (cid:80) Cc =1 π c = 1 is not always true when the model is training, but ex-periments invetigating constraining these values via normalisationor a softmax function did not produce improved results. With this MNRAS000 , 000–000 (2020) stroVaDEr categorical distribution, we also change how the latent variable z behaves with the following prior: z ∼ N ( µ c , σ c I ) , (9)where µ c and σ c are the means and diagonal covariances of theclusters, and I is the identity matrix. Under VaDE, π c , µ c and σ c are all trainable weights within a neural network. The ELBOobjective of VaDE is then given as: L Total = E q φ ( z ,c | x ) log p θ (ˆ x , z ) − KL ( q φ ( z, c | x ) (cid:107) p ( z, c )) (10)As before, the ﬁrst term in equation 10 is the reconstruction term,now given with respect to q φ ( z , c | x ) , instead of q φ ( z | x ) . The KLterm now describes the Kullback-Lieber divergence between thelatent variable z and the cluster model p ( z, c ) . Within the neu-ral network model, the KL component of the loss is calculated asfollows: L VaDE ( x i ) = − J (cid:88) j =1 (cid:16) σ izj (cid:17) − K (cid:88) c =1 (cid:16) γ ic log( π c ) + γ ic log( γ ic ) (cid:17) + 12 K (cid:88) c =1 (cid:32) γ ic J (cid:88) j =1 (cid:32) log( σ cj ) + ( µ izj − µ cj ) + (cid:18) σ izj σ cj (cid:19) (cid:33)(cid:33) + J log 2 π, (11)where J is the dimensionality of the embedded spaces, K is thenumber of cluster components, x i is an input sample, µ iz and σ iz are the embedded mean and covariance representations of x i fromthe encoder network, and γ ic is the cluster probability of the input x i .To calculate the cluster probabilities of an input sample, we followsuit with C20 and use the scikit-learn Python implementation(Pedregosa et al. 2011). log( p ( c, z )) = − log( p ( z, c )) − log( p ( c )) (12) γ ic = e log( p ( c,z )) K (cid:80) c =1 e log( p ( c,z )) (13)Training of VaDE then follows the usual scheme for a VAE. Abatch of samples x is embedded into the latent space representedby µ z and σ z . We use the reparameterisation trick to sample z ,which is fed into a probabilistic decoder. The main diﬀerence is thatthe network is now conditioning the latent space such that z tendstowards N ( µ c , σ c I ) instead of N (0 , .C20 introduced a number of improvements to variational deepclustering methods which we implement in AstroVaDEr. First, in-stead of pretraining the network without a KL component, as donein VaDE, we use an α -training phase which incorporates a low The full calculation of γ ic can be found in the Gaussian Mixture sourcecode for scikit-learn and we also refer the reader to the github repositoryfor C20: https://github.com/king/s3vdc . C20 call this a γ -training phase, but we wish to avoid confusion with thecluster probability, γ c . weighted KL regulariser from a simple N (0 , prior for T α epochs.This pretraining primes the network to produce good reconstruc-tions, without wandering too far away from the latent prior such thatthe embedded space becomes too unstructured for good clusteringﬁts. During this phase we weight the KL loss component by a factor α = 0 . , such that the network loss at epoch t is calculated as: L Total = L Recon + αL KL , T α ≥ t > (14)The next optimisation is a phased annealing program. TheGMM prior is introduced, but the contribution to the total loss isslowly increased during the ‘annealing phase’ using a weightingfactor β . The ramping up of β follows a polynomial function until β = 1 + α . Following annealing we train the model in a ‘staticphase’ with equal weighting given to the reconstruction and KLdivergence losses. These two phases are repeated a ﬁxed number oftimes. The reason for this is to hamper the eﬀects of the competinglosses, wherein the reconstruction loss is pressuring the model to useas much of the latent space as possible to improve image recovery,while the KL loss is trying to force the embedded representationsonto the prior distribution. By phasing the weight of the KL loss,we can improve the disentanglement between the latent variables,without signiﬁcantly reducing reconstruction quality. The annealingand static phases are repeated for M periods, lasting T β + T s epochseach, where T β and T s are the number of epochs in the annealing andstatic phases respectively. The network losses during these phasesare: L beta = ( β u + α ) L KL s.t.: β = (cid:20) t − T α − ( m − T β + T s ) T β (cid:21) ,T α + ( m − T β + T s ) + T β ≥ t > T α + ( m − T β + T s ) (15)and L static = L Recon + L KL , (16)where t is the current epoch, and u is a polynomial factor thatdictates the rate that β increases to .The third recommendation from C20 concerns the initializa-tion of the GMM weights. Naively, one might assume to simplyinitialise the weights randomly, but this would require spinning upnumerous versions of the model to ﬁnd the best ﬁt, which is timeconsuming and requires a large amount of computing power. A com-mon technique is to instead calculate the initial weights by ﬁtting aGMM onto the embedded representations of the training data aftersome ﬁxed pretraining period. C20 opt for this approach, but they doso by investigating the number of input samples required to performa satisfactory ﬁt. In what they call ‘mini-batch GMM Initialization,’we simply ﬁt a GMM using k batches of size L , instead of the fulldata set.Finally, C20 address the problem of NaN losses in VDC models.The most common culprit of

NaN values in these networks can betraced back to the approximation of log p ( c, z ) and γ c . Essentially, itis possible that a sample’s probability within a cluster falls so closeto zero that it generates a NaN or Inf value within the loss function.C20 address this by introducing a min-max scaling of log p ( c, z ) ,which prevents its values from getting too small. Considering V as MNRAS , 000–000 (2020)

A. Spindler et. al. the full matrix representation of log p ( c, z ) , we calculate a scaledvalue ˆ V with a range [ − λ, : ˆ V = λ [ V − min V ][max V − min V ] (17)With these optimizations in mind, we will now turn our atten-tion to the neural network architecture which we use to constructthe encoders and decoders that will be used in our experiments. AstroVaDEr is constructed using the

Keras

Python API (Cholletet al. 2015). The core architecture is based on the CNN developedby W20 for GZ2. W20 is in turn based on VGGNet (Simonyan& Zisserman 2014), and we follow suit with the core of both theencoder and decoder networks following this architecture. Figure1 shows the model architecture employed in this work. The mainmodel architecture is designed to be ﬂexible, and the model pa-rameters shown here are simply those that were chosen during ouroptimisation process, discussed in Section 4. In practice, the in-put size, number of convolutional ﬁlters, kernel sizes and latentvariables can all be tuned for diﬀerent applications.The encoder takes an input image, a × pixel gray scaleimage in our case, and successively reduces the size and learningof the key features with convolutional and pooling layers. A ‘block’in the encoder has two convolutional layers with a stride of 1 and‘same’ padding, followed by a × max pooling layer. The encoderhas three such blocks, which reduce the image size by a factor of 8in total. The number of ﬁlters in each block, and the kernel sizes, areset during hyperparameter optimisation. The output from the thirdblock is ﬂattened to a one dimensional array. The ﬂattened activationmaps are passed into a pair of dense layers, which encode µ z and log( σ z ) . We implement three custom layers, the ﬁrst performs thereparameterisation trick to sample z from Equation 3, followed bya pass-through layer that initialises the GMM weights for training,and ﬁnally a layer which calculates the cluster probabilities for eachinput sample.The decoder essentially performs the opposite transformationas the encoder. Each block in the decoder consists of two convolu-tional layers and a bi-linear × up-sampling layer. The embeddedcode for a sample is fed into a dense layer with the same numberof units as the ﬂattened encoder layer, and is then reshaped into animage with bands equal to the number of ﬁlters in the last encoderblock. The three decoder blocks use the same number of ﬁlters asthe encoder but in reverse, and the ﬁnal up-sampled output is fedinto a ﬁnal convolutional layer with either a single ﬁlter for grayscale input or three ﬁlters for RGB images.Throughout the network we choose to implement the LeakyReLU activation function (Lu et al. 2019), with a slope of α = 0 . .We use this function, as opposed to the popular ReLU activation,to tackle the "dying ReLU" problem. Dying ReLU occurs in ReLUactivated neurons where it is possible for a neuron to be set to zeroand then can no longer contribute to learning. Essentially the neu-ron ‘dies’ (Lu et al. 2019). Leaky ReLU addresses this by allowinga small slope for negative values, thus preventing the gradient go-ing to zero when the activated output is forced to be positive. Thisactivation is used for all convolutional layers in the encoder, thedense layer following the decoder input, and all but the ﬁnal con-volutional layer in the decoder. The embedded dense layers have alinear activation, and the output has a ReLU activation (the outputis not aﬀected by the dying ReLU problem because it is the ﬁrst layer in the back propagation). All convolutional and dense layersimplement a l2 regulariser on their weights and biases with a factorof . .The GMM weights are fairly well behaved, but we do notethat it is useful to have a constraint on the GMM covariances tokeep them positive deﬁnite. In circumstances where there are toomany mixture components, or poorly deﬁned components from theinitialisation, we ﬁnd that the model will attempt to remove thosecomponents by shrinking their covariances to zero or even negativevalues, which results in NaN values propagating through the networkwhen an inverse or logarithmic covariance is calculated.Training is performed on two NVIDIA Tesla V100 graphicsprocessor units (GPUs) on the University of Hertfordshire High Per-formance Computing cluster, using the

Keras multi-gpu-model functionality . We experimented with a variety of batch sizes, andsettled on 180 samples per batch split across the two GPUs. Themain eﬀects of diﬀerent batch sizes is in the CPU bottleneck in load-ing and preprocessing the inputs. We choose to utilise the

Keras

Image Data Generator class to load each image from the hard disk,perform random transformations and feed batches to the network.If the batch size is too large, the network can struggle to generatebatches fast enough for the GPU to process them through the model.

As stated above, the core architecture of AstroVaDEr is intendedto be ﬂexible to diﬀerent applications. We will now discuss thespeciﬁc settings which were used to produce the results in thispaper. We used Bayesian Hyperparameter Optimisation (J. Bergstra2013, via Hyperopt) to narrow-down the parameter search spaceand then a manual search for settings that optimised reconstructionand clustering quality.The network is optimised using the Adam optimiser (Kingma& Ba 2014) with an initial learning rate of × − , we decrease thelearning rate every ﬁve epochs using exponential decay, such thatthe ﬁnal learning rate is − . We set the number of latent variablesto be . Larger values do improve the reconstruction quality ofthe network, but for the purpose of this paper the lower numbermakes interpretation more straightforward. For the convolutionalblocks, the encoder uses ﬁlters in the ﬁrst and second pairs ofconvolutions, and 16 ﬁlters in the ﬁnal pair. The convolutions use (3 × kernels in the ﬁrst block, and (5 × kernels in the secondand third blocks. The decoder uses the same conﬁguration, but inreverse. The ﬂattened dimensionality of the encoder before the latentembedding is thus × ×

16 = 4096 , which also corresponds tothe number of units in the ﬁrst fully-connected layer in the decoder.The s3VDC framework introduces an additional set of hy-perparameters that must be optimised. These include: number ofGMM mini-batches, α -training epochs, annealing phase epochs,static training epochs and annealing-static periods. These param-eters pose a challenge for Bayesian optimisation however, as theyrequire full end-to-end runs of model training, each of which cantake upwards of 12 hours to complete. Instead, we used a purelyqualitative approach to setting these parameters, which ﬁrst involvedvarying the length of the α -training period until we were satisﬁed Speciﬁcally, the hyperas implementation https://github.com/maxpumperla/hyperas

MNRAS000

MNRAS000 , 000–000 (2020) stroVaDEr Encoder

F1F1 × Conv 1&2(K1xK1) × MaxPool(2x2)

F2F2 × Conv 3&4(K2xK2) × MaxPool(2x2)

F3F3 × Conv 5&6(K3xK3) × MaxPool(2x2)

Flatten L µ z L σ z p ( z ) Reparameterization L Z Decoder

Dense × Reshape

F3F3 × Conv 7&8(K3xK3) × Unpool(2x2)

F2F2 × Conv 9&10(K2xK2) × Unpool(2x2)

F1F1 × Conv 11&12(K1xK1) × Unpool(2x2) × ConvOut(3x3)

Figure 1.

Diagrams representing the network architecture of AstroVaDEr. Shown on the top row is the Encoder network, E φ ( x ) = z , and the bottom row isthe Decoder network, D θ ( z ) = ˆ x . The core layers of E φ ( x ) = z and D θ ( z ) = ˆ x consist of paired convolutional layers, each with ReLU activation, we usemax pooling layers in the Encoder and Upsampling layers in the Decoder to decrease and increase the dimensionality, respectively. The paired convolutionallayers have (64 , , ﬁlters, with kernel sizes of (3 , , . We utilise layer ﬂattening and reshaping to compress and decompress the activation maps inand out of the embedded space. The L -unit dense layers represent µ z and log ( σ z ) , the latent embedding. p ( z ) represents the reparameterisation trick. Weimplement a Keras layer to contain the GMM weights at the end of the Encoder. with the quality of the reconstructions, and then again manually ad-justing the length and number of annealing periods until we foundconvergences in the GMM weights. We show the results of thissearch in Table 1. Future work will attempt to quantify and stream-line this process, by investigating diﬀerent early stopping condi-tions at diﬀerent stages of training and analysing how the number ofepochs in each phase is related to the learning rate and the weight-ing factors on the KL divergence. One additional outcome of thissearch was that L KL often overpowered the reconstruction lossesunder certain combinations of the number of Gaussian componentsand latent variables, even when we employed the annealing periods.This feature of the training requires further study to fully under-stand, but for the purposes of this work we found that including an Table 1. s3VDC hyperparameters found by manual search.

Hyperparameter Value α -training epochs 100GMM Batches 200Annealing epochs 25Static epochs 25 additional weighting of . on L KL allowed for good reconstructionquality and clustering results.The ﬁnal parameter we can control is perhaps the most im-portant given our ﬁnal goal of producing a generative classiﬁer:the number of components to include in the Gaussian mixture. The MNRAS , 000–000 (2020)

A. Spindler et. al.

Figure 2.

Training (solid lines) and Validation (dashed lines) losses for AstroVaDEr obtained while training the network using the hyperparameters discussedin this Section. Top left is the total loss throughout training, calculated from Equations 14 & 15 before and after 100 epochs, respectively. This loss includesadditional penalties to training from regularisation of the weights and biases. Top right shows the reconstruction loss, measured as the total mean squared errorper sample. Bottom left shows the KL divergence associated with the vanilla VAE prior, p ( z ) = N (0 , , which is used in the ﬁrst epochs. Bottom rightshows the KL divergences from the GMM prior, z ∼ N ( µ c , σ c I ) , from epoch onward, which is annealed following the procedure outlined in Section3.3. number of clusters to use in any unsupervised clustering task de-pends on the data set, the method and the desired result. For someproblems one may know in advance the number of clusters the datanaturally falls into, for example in clustering the MNIST (LeCunet al. 1998) handwritten digits data set one knows there should beten clusters. However, as is the case with real world applications ofthese algorithms, we often do not have a priori knowledge of theideal number of clusters for describing the data. This is further com-plicated by our goal of developing a generative model, which willbe able to selectively generate synthetic images from a particularcomponent in our Gaussian mixture. We have to balance, in essence,the philosophical question of ‘how many types of galaxy are there?’with the practical question of ‘how many components can the net-work accurately model?’ Too few clusters and we risk blurring thedeﬁnitions of diﬀerent morphological types and losing the abilityto distinguish between similar classes. With too many clusters thenetwork will begin to encounter sparsity problems where poorlypopulated components are eﬀectively turned oﬀ.Comparing to other astronomical works, we see a wide varietyin approaches to this problem. For example, approaching the prob-lem from the perspective of supervised learning we may simply attempt to deﬁne two clusters, following a binary classiﬁer such asin W20. The problem here is that with a CNN classiﬁer one candeﬁne the speciﬁc classes to be determined (e.g. spiral or elliptical,barred or unbarred), but we do not have this luxury. Two clus-ters may instead deﬁne edge-on versus face-on galaxies, or isolatedgalaxies versus crowded images. At the other end of the scale is thework of Martin et al. (2020), who use as many as 160 clusters intheir unsupervised algorithm. There are a few key diﬀerences here,ﬁrstly Martin et al. (2020) use a Growing Neural Gas model andhierarchical clustering to produce learned feature vectors for eachobject (which are analogous but not equivalent to our latent space)and each object is assigned a cluster based on k -means clusteringwhich typically does not fail to produce large numbers of roughlyequally populated clusters. A GMM, especially as implemented inVaDE, is prone to model collapse due to exploding and vanishingcovariances and vanishing weights on the individual components.Two recent works in astronomy use similar techniques to oursto solve diﬀerent problems: Cheng et al. (2020) use a CAE trainedon simulated images of strong gravitational lenses, followed by aBayesian GMM applied to the embedded samples post training toproduce a classiﬁer. They use a number of clusters in their mix- MNRAS000 , 000–000 (2020) stroVaDEr ture model equal to the number of latent dimensions, which in asimpliﬁed way can be explained as saying that each variable in thelatent space (i.e. each unit in the fully connected layer) approxi-mates to one feature and therefore one cluster. The assumption thateach variable only controls one feature, however, is not always true,which we explore in Section 5.3. Ralph et al. (2019) instead applya CAE to images of radio galaxy images in order to classify theirmorphologies, and produce clusters based on a Self Organising Map(SOM) and k -means clustering of the latent space. They show clus-tering results for 4 and 8 clusters, but stress that those numbers werechosen as a general demonstration of the × SOM for cluster-ing purposes. It should be emphasised that both of these methodscluster the data on the latent space after training on conventionalautoencoder architectures, while we are attempting to perform thetask during training on a generative architecture.We tested a range of techniques to ﬁnd optimal cluster num-bers, which we will brieﬂy explain before discussing our chosenmethod. C20 demonstrate how they optimise the number of clustersfor a dataset where the true number of clusters in unknown. Theytrain their network fully with a range of values for the number ofcomponents in the GMM, and then compare various unsupervisedclustering metrics. They use the Calinski-Harabasz (CH, Caliński& Harabasz 1974) index, a measure of the ratio between in inter-cluster dispersion and intra-cluster dispersion that is higher whenclusters are dense and well separated, and the simpliﬁed silhou-ette score (Rousseeuw 1987), which measures how similar samplesare to their cluster members compared to samples outside theircluster and ranges between − and where higher numbers sug-gest better clustering. They also investigate the disentanglement ofthe clusters by projecting the embedded data into two dimensionswith t-distributed stochastic neighbour embedding (TSNE, Hinton& Roweis 2002) and the marginal likelihood of the GMM over thedataset. C20 suggest that comparing all of these factors togethercan determine the best number of clusters to use. We explored thisoption, but the main diﬃculty with these methods when workingwith very complex data is that training just an individual instance ofthe network takes many hours. Training the network multiple timeswith diﬀerent clustering settings end-to-end isn’t computationallyeﬃcient (especially when we consider the already substantial envi-ronmental impact of high performance computing, Portegies Zwart2020). Added to that, the experiments we did perform were largelyinconclusive due to the high overlap between the learned GMMcomponents (see Section 5.2 for details).However, we settled on what seems a fairly intuitive method-ology that ﬁnds an optimum number of clusters without needing totrain multiple versions of the full AstroVaDEr network, which ex-ploits the behaviour of the scikit-learn (Pedregosa et al. 2011)Bayesian Gaussian Mixture (BGM) method. BGM is a variationalGMM and uses the expectation-maximisation technique (Dempsteret al. 1977) to maximise the ELBO of the log-likelihood (Kullback& Leibler 1951; Attias 2000; McLachlan & Krishnan 1997; Bishop2006). This method can be employed in a mode which approximatesan inﬁnite-mixture model with a Dirichlet Process (Ferguson 1973).In this mode, the algorithm naturally sets low probability clusters tohave zero contribution to the mixture, essentially setting the numberof components automatically. How this is achieved can be under-stood with the stick-breaking analogy; consider a unit-length stickthat represents the Dirichlet Process, and step by step we break oﬀa piece of that stick. Each broken piece of stick represents a groupsof points that fall within the mixture, and has a random length thatdeﬁnes the weight of the piece. The remaining length of stick issuccessively broken down, and the last piece is used to represent the the points that don’t fall into the other pieces. In this way, aslong as we choose a number of components that is greater than thetrue optimum number of clusters, the excess components will havetheir weights set to zero and are removed from the model. We usedthis behaviour to choose the number of clusters to ﬁt to our data byinitialising many instances of a BGM with diﬀerent mini-batches ofembedded samples, and found that clusters emerged as a fairlyconsistent mixture conﬁguration. Compared to the other methodsdiscussed here, using this stick-breaking method is computationallyfast, consistent, intuitive, and crucially doesn’t require comparingclustering metrics and visualisations of a dozen ﬁtted models wherethe best options aren’t immediately apparent. For this work, we trained AstroVaDEr with the hyperparametersdiscussed previously and show the losses obtained during the train-ing in Figure 2. Losses from the training data and validation dataare shown with solid and dashed lines, respectively, and in all casesare the mean values over all batches. For the total, reconstructionand vanilla VAE KL divergence we show the values across all epochs, and for the GMM KL divergence we show its values fromepochs − as we do not train the GMM parameters in theﬁrst epochs. The α , annealing and static training phases are allrecognisable in the plots, with the α training taking the ﬁrst epochs and then the other phases alternating as previously described.In the top left of Figure 2 we show the total loss calculatedfrom Equations 14 & 15 plus the penalties associated with the L2regularisers on the model weights and biases. In the ﬁrst epochsthe total loss approaches, but does not fully reach, a minimumconvergence, this suggests that we could potentially train in thisphase longer, but that runs the risk of collapsing the latent spaceonto the unit Gaussian prior too much. As the annealing factor onthe GMM loss increases during the β -annealing phase, the total lossincreases with the cubic factor on the weight, and then decreasesslowly in the ﬁrst static phases before plateauing in the later ones.By epoch , the total loss has ﬂattened in the ﬁnal static phase,and the minimum total loss achieved when the annealing factor isat its minimum does not appreciably change through the training.The reconstruction losses, measured as the mean squared errorper sample in the training and validation sets, are shown in thetop right plot of Figure 2. The error here follows a similar patternto the total loss, decreasing steadily during the α -training phaseand then increasing-plateauing-minimising as the network cyclesthrough the β and static training phases. Unlike the total losses,we do see some reduction in the peak reconstruction error witheach annealing phases, and additional training may result in betterimaging quality at the expense of increased computational time.The exact eﬀect of the static phases on the reconstruction quality isnot overly clear from our study, as it does not seem to change duringthose epochs. According to C20, this is a ﬁne-tuning stage wherethe network should be attempting to ﬁnd an equilibrium betweenthe clustering and reconstructions, but further testing is needed tobe more speciﬁc.The last two panels show the KL divergence for the unit Gaus-sian and GM priors on the left and right, respectively. By designdue to the small weigh applied during α -training, the unit Gaus-sian divergence does not fall a large amount. During the annealingperiods the unit Gaussian loss is not used in the training, but wetrack it to compare with the GMM loss. As the weighting factor onthe GMM loss increases, the loss decreases to a minimum where itremains steady during the static training. We do note, that like the MNRAS , 000–000 (2020) A. Spindler et. al. peak reconstruction loss decreasing with each annealing period, sotoo does the peak GMM divergence when the static phases end andthe weighting factor is reset close to zero.

The ﬁrst step in assessing the quality of the trained model is toexamine its ability to reconstruct the test images that were notincluded in the training sample. We draw from the test dataset,deﬁned in Section 2, for this purpose. We assess the reconstructionsin two ways, ﬁrst by a qualitative visual inspection, and second byinvestigating the objects that the network performs best and worst atreproducing. Reconstructions are produced by passing test images,without any augmentations, into the AstroVaDEr network, wherethey are embedded into the latent space by the encoder and thenretrieved by the decoder.Figure 3 shows the random selection of galaxies, with theirinput images, reconstructions, and residuals. The residuals providea better idea of what parts of the galaxies the network is performingwell at reconstructing and where it struggles. Broadly speaking, theoverall shape of each galaxy is well-preserved by the network, bywhich we mean that global properties such as axis ratio/inclinationangle, surface brightness, angular size, and orientation, are all vi-sually reproduced. VAEs like AstroVaDEr have a the well knownproperty of producing blurry or ‘fuzzy’ images, which we see hereas well. We can see in the reconstructions and residuals that thenetwork is eﬀectively denoising the input image, however the ﬁnerdetails are smoothed out. The result is that the network strugglesto reproduce ﬁne-scale morphological features like spiral arms andbars. Inspecting some of the highly featured objects in Figure 3, wecan clearly see that the signatures of spiral arms, bars and rings allappear in the residuals.Addressing the blurriness of VAE generated images (both re-constructions and synthetic images), has been an ongoing area ofresearch, and progress has been made (Yeung et al. 2017; Asperti2018; Dai & Wipf 2019; Asperti & Trentin 2020; Kobayashi et al.2020). The key issue, as discussed in Asperti & Trentin (2020),appears to be in the capacity and sparsity in the network. Thereneeds to be suﬃcient capacity, i.e. the number of latent variables, tocontain the necessary information for reconstruction, but one mustalso prevent the collapse of those variables, which induces sparsityin the sense that the network tries to use as few of the latent variablesas possible.Tackling this problem is a matter of carefully balancing thereconstruction loss and the KL divergence between the latent em-bedding and the prior distribution. Essentially, as the network beginstraining it will force σ z to be as close to as possible, ensuring ahigh degree of conﬁdence in z and therefore ˆ x . However, as thereconstruction loss decreases, and the network focuses more on op-timising the KL divergence (in our case, between the embeddingand the learned GMM), it instead increases σ z to improve the over-all coverage of the latent space. With a less certain measure of z ,there is a subsequent loss in the reconstruction quality. Solving thisproblem with AstroVaDEr in its current state is beyond the scope ofthis work, this is because the techniques that have been developed todo so were developed on ‘vanilla’ VAEs with a unit Gaussian prior.Future work in improving AstroVaDEr will involve integrating oneor more of the recent developments in balancing the competinglosses with the GMM prior. We have already seen that the network struggles with ﬁnerstructural details within the reconstructions. To further investigatethe strengths and weaknesses of the network, we present Figure 4.We calculate the mean squared error for each galaxy in the testset and present the highest and lowest error objects. The top tworows show input and reconstruction images of the ﬁve galaxies withthe lowest mean squared error. As you may expect from Figure3, these images are smooth galaxies; they have a variety of axisratios and surface brightness proﬁles, but lack high spatial frequencyfeatures like spiral arms. The objects with the highest reconstructionerrors are shown in the bottom row, and reveals the two broadcategories that AstroVaDEr struggles with the most in terms of meansquared error. The high error objects include those that have visibleartifacts and stars, as well as objects that are crowded by multiplenearby sources, be they foreground, background or interacting. Onecould argue that it would make sense to utilise the Galaxy Zoolabels to remove objects that have imaging artifacts. That wouldnot only defeat the purpose of approaching this task completelyunsupervised, but, as we will see in the next section, these objects are identiﬁable by the network despite their low reconstruction quality,and this result can be exploited in other ways, such as automaticartifact detection.The image reconstructions presented here are a limited sampleof the 41,000 images in the test dataset. However, this qualitativeanalysis demonstrates the reconstructive abilities of the network andhighlights important areas of improvement. Of course, the recon-structive ability of AstroVaDEr is not the primary goal, and wewill now look at the ﬁrst of the two main tasks of the network:unsupervised clustering. Assessing the quality of an unsupervised clustering results is adiﬃcult task, especially given that we have no exact ground truthlabel to compare our results to. This is perhaps made more diﬃcultby the fact that AstroVaDEr is probabilistic, owing to its variationalinference framework and GMM prior. Therefore we do not get aﬁxed cluster label for each object, but rather a likelihood that anobject is drawn from a particular component in the mixture. Wecan use this to our advantage by inspecting galaxies based on theirmost likely component, for which we assign a cluster label, and bylooking at the most probable objects in each component.Before visually assessing the galaxies, let us begin by lookingat the GMM properties themselves. In Table 2, we show how manygalaxies are assigned each cluster label based on their highest prob-ability component in the GMM. We can see that there is a goodrange of coverage between the components, with clusters 2, 7 and3 being the most populated, while cluster 9 is the least populated.Since the mixture model is probabilistic, we can also interrogate thedegree of overlap between the clusters in a number of ways. First,we show three diﬀerent cut oﬀ values for γ c , the predicted clusterprobability for each galaxy, which could be used in a similar manneras cuts in the Galaxy Zoo 2 catalog for the basis of selecting cleansamples. Recall that γ c is normalised as per Equation 17. We alsoapply a softmax operation to γ c to force the values to sum to unity.With γ c > . , a large number of galaxies appear in multiple com-ponents, indicating that this threshold is not high enough to makeany clean cuts. At γ c > . , the cluster assignments drop belowthe values of those found from the highest probability componentfor each object. The number of objects that have γ c > . in anycomponent is 28,781, just 70% of the full test sample. The ﬁnal cutof γ c > . , shows that clusters 2 and 10, which were the most MNRAS000

Image reconstructions of a random sample of galaxies. Each galaxy is shown with the non-augmented, gray scale input image, followed by theAstroVaDEr reconstruction and then the residual between the input and reconstruction. Input images and reconstructions are shown with a linear scale andpixel range of (0 , .MNRAS , 000–000 (2020) A. Spindler et. al.

Figure 4.

Comparison of reconstruction quality between galaxies with low reconstruction loss (top two rows), and high reconstruction loss (bottom two rows).The galaxies were selected by calculating the mean squared error between the input images and reconstructions, and choosing those with the ﬁve galaxies withthe lowest error and highest error.

Mixture Component

No. Objects

No. Objects with γ c > . No. Objects with γ c > . No. Objects with γ c > . Mean Silhouette Score -0.151 -0.132 0.112 -0.063 -0.215 -0.185 -0.195 -0.078 -0.094 -0.257 0.034 -0.160

Table 2.

The number of objects from the test set in each mixture component based on the highest likelihood component for each object. populated, are now empty, while the less populated clusters nowretain a higher proportion of their occupants.This relationship between the cluster occupancies would seemto indicate that the large components have a higher degree of overlapwith each other than with the full mixture. There are a number ofways to measure the overlap between clustering results, and weshow a few here. Table 2 shows the mean Simpliﬁed Silhouette Score, s ( i ) , for objects that are assigned each component label. Thesilhouette score for a sample, i is calculated as: s ( i ) =  − a ( i ) / b ( i ) , if a ( i ) < b ( i ) , if a ( i ) = b ( i ) b ( i ) / a ( i ) − , if a ( i ) > b ( i ) (18)where a ( i ) is the mean distance between i and its fellow clustermembers. b ( i ) is calculated by ﬁnding the smallest mean distance MNRAS000

Cluster overlap matrix, which shows the probability that an object in the test set which resides in one cluster could also reside in another cluster. Theprimary clusters along the x -axis refers to the assigned label to each galaxy, based on their highest likelihood among the mixture components. The colours andnumbers in each element are the mean likelihoods of objects for the secondary mixture component. For example, an object in Cluster ( x -axis), has an averagelikelihood of being assigned to Cluster ( y -axis) of . . between i and the members of each other cluster in the model(Rousseeuw 1987). A silhouette score is a measure of how similar anobject looks to its fellow cluster members, compared to the samplesoutside its cluster, where higher scores mean that each cluster ismore similar to itself than the full sample. Cluster 2 has the highestmean silhouette score, which indicates that its members are moresimilar to each other than they are to the full sample, whereas forexample, clusters 4, 9 and 10 have lower values, suggesting thereare similarities between these objects and the rest of the sample.We note that the mean score for the full sample is − . , which isclose to zero and suggests, as the other measurements do, that thereis a high degree similarity between the clusters.As a ﬁnal test of the intercluster overlap, we provide Figure5. This ﬁgure shows that for each component along the horizontalaxis, the mean γ c for galaxies with that label in each componentof the mixture. These mean probabilities show quite clearly thegroupings of clusters that appear to overlap on diﬀerent subsets ofobjects. Choosing cluster 2 on the horizontal axis, for example, wesee that those objects have an almost equal probability of beingassigned a cluster 10 label. In contrast, an object in cluster 9 hasalmost no chance of being in cluster 2. This analysis shows, inseveral ways, what one might expect for such a model of galaxymorphology: that there are no true ‘distinct’ classes of objects, and that what we describe as a galaxy’s morphology is a combinationof many diﬀerent components. While not unique to astronomy, theproblem of unbalanced and blended classes of objects in imagerecognition and unsupervised clustering is particularly prominentwith morphological classiﬁcation, and it has been argued that thismakes astronomy an ideal ﬁeld for pushing boundaries in machinelearning research (Kremer et al. 2017).We will now turn to visually inspecting the objects within theclusters. Since we know what components ought to contain similarobjects, we can use that information to inform us as to why they havebeen separated. For each component, we show the objects that havethe highest probabilities of objects with that label, and a randomselection of objects, in Figures 6 and 7, respectively. Shown hereare the reconstructed images of these objects, however the inputimages are provided in the Appendix for reference. An immediate,but perhaps surprising, observation can be made from both of theseﬁgures: the components broadly fall into two groups, that is objectswith and without secondary sources in the images. We also see thatfor the components that have objects that are highly dependent onorientation, that those objects have been split between components.We ﬁnd ﬁve clusters that contain galaxies without secondarysources, six that do contain secondaries, and one ﬁnal cluster thatappears to be predominantly corrupted images or very complex MNRAS , 000–000 (2020) A. Spindler et. al.

Figure 6.

Image reconstructions of test data objects assigned to each cluster. This ﬁgure shows a random selection of galaxies from those that have theappropriate cluster label. Images are shown with a linear scale and pixel range of (0 , . systems. Of the ﬁve clusters without secondaries, two contain discgalaxies with high inclinations that are separated by rotation, oneappears to mainly contain ‘larger’ galaxies such as face-on disksand elliptical galaxies. The ﬁnal two clusters are diﬃcult to separatevisually, based on the reconstructed images alone. The immediatediﬀerence appears to again be size and brightness, but there alsoappears to be a factor of surface brightness proﬁle, as the galaxies incluster 2 (top right) appear more compact and centrally dominated.The six clusters containing objects that generally have fewerthan two secondary sources pose a signiﬁcant challenge for As-troVaDEr. These galaxies have not been separated at all based onthe morphology of the primary source, but rather on the position ofthe secondary sources. In the top left cluster, the secondaries are allpositioned along the upper most edge of the images, while the centretop objects all have a secondary to their left. While this may seemto be nonsensical in the context of morphological classiﬁcation, itdoes tell us something very interesting about AstroVaDEr, which isrelated to the very nature of autoencoder architectures in general.The clusters which are morphological dominated are those inFigure 5 which have lower likelihoods among their deﬁned members(values along the diagonals in the confusion matrix), clusters , , , , and . This makes sense when we consider that the remainingcomponents contain galaxies of all shapes. A galaxy without abright secondary source that is an edge-on disk would be equallylikely to be in any of the secondary source clusters, but only have ahigh likelihood in one or two of the morphological groups. Contrastthis with an edge-on galaxy that has a bright secondary, whichwould only ﬁt in the component that matches the position of itscompanion and the edge-on clusters. We also note that this agreeswith the silhouette scores in Table 2, all the components except and (top right and bottom centre in Figures 6 and 7, respectively)have negative silhouette scores and the lowest values correspond tocomponents that contain galaxies with a range of morphologies.Components and have positive scores, and contain objectswhich could be considered very "average" or "normal", i.e. theyare round and regular and smooth looking. Even components , and (second row-left, third row-middle and third row-right inthe Figures) have scores much closer to than the groups withsecondary sources.The GMM we trained is a representation of the latent space thatobjects are embedded into, and the types of objects that are groupedtogether in the 20D space depends on how much of the network MNRAS000

Image reconstructions of test data objects assigned to each cluster. This ﬁgure shows the objects with the highest cluster likelihood of objects fromthose that have the appropriate cluster label. Images are shown with a linear scale and pixel range of (0 , . capacity is being used to learn those features. In this iteration,AstroVaDEr has apparently committed a signiﬁcant fraction of itscapacity to accurately recreating the presence, size, brightness andposition of the secondary, tertiary, etc. sources in these images. Thisis an example of the ways in which unsupervised learning algorithmscan produce representations of data in ways that humans brainswould not consider doing. In Section 2 we deliberately chose not torotate our images as part of the pre-processing before training. Inpart, this was because we expected that horizontal and vertical ﬂipswould impart some degree of rotational invariance, but also becausewe were interested in the speciﬁc properties of the embedded spacewithout those rotations.We do note, however, that simply including random rotationsin the network does not guarantee that it will not still use the samecapacity to store the orientation of an object. Consider a supervisedCNN where the goal is to reproduce known labels, in a sense thenetwork ‘knows’ that a rotated image should have the same labeland so the rotation is managed as part of the learning. Under unsu-pervised assumptions, however, the network does not know that tworotated images could be the same object, and as such has no pressure to give them the same label . For AstroVaDEr, when an image isrotated it will naturally end up in a diﬀerent region of the latentspace, and therefore be better represented by a diﬀerent component.We will explore this in Section 5.3, and demonstrate how rotationis actually embedded within the diﬀerent latent variables.One of the beneﬁts of the probabilistic nature of GMMs, andin particular the model trained here which has a high degree ofoverlap between its components, is that we can choose to ignorecertain components. Throughout the clusters that have secondarysources, we noted that the overall morphologies are ignored. How-ever, if we take the galaxies assigned to those clusters, we can thengive them a secondary label based on the probabilities of the ﬁve In fact, rotational invariance may not even be the correct mode of think-ing, as it would imply that the network output (i.e. the reconstructed images)would be invariant to rotation. Instead, it could be more useful to consider apossible rotation ‘equivariance’, where the assigned cluster does not changewhen an image is rotated, and subsequently, similar objects which have rela-tive rotations should also be assigned the same labels. For more informationon rotational invariance/equivariance, and one potential way this can beaddressed, see Prasad et al. (2020).MNRAS , 000–000 (2020) A. Spindler et. al.

Figure 8.

Image reconstructions of test data objects divided by their most likely morphological cluster (see discussion in Section 5.2. For each component, weshow those galaxies that have the highest cluster likelihoods, selected from those that were not assigned to these components due to the presence of a secondarysource. Images are shown with a linear scale and pixel range of (0 , . morphological clusters. In Figure 8, we show the highest probabilityobjects from the 13,000 objects with secondary sources (excludingthose in the cluster of corrupted and complex images), in each ofthe purely morphological clusters. Doing so allows us to retrievea morphological label for those galaxies, despite the fact that thenetwork primarily assigns them to other components.These clustering results, while they may not reﬂect human in-tuition, are clearly doing a sensible job of providing some systematicmeasure of a galaxy’s morphology. Equally, they give us a greaterinsight into the inner workings of unsupervised clustering in thecontext of morphological features, which can direct future work inreﬁning these methods. Aside from clustering, the other main task of AstroVaDEr is toact as a generative network. The generative properties allow us tocreate synthetic images, as we will demonstrate, but also it canbe leveraged to allow us to investigate the properties of the latentvariables that make up the embedded space. We will begin by simplycreating new images and seeing how they hold up compared to thereconstructions and original images. Then we will explore the latentspace by interpolating through the space and observing how thevisual properties of the generated images change.One of the ways the generative model can be used in practiceis in the data-driven production of a realistic synthetic imaging dataset that can be used to test data analysis pipelines intended for nextgeneration surveys, before those surveys beginning releasing public

MNRAS000

MNRAS000 , 000–000 (2020) stroVaDEr Figure 9.

Synthetic images generated using the AstroVaDEr network. For each mixture component, c , we randomly sample from the Multivariate Gaussian N ( µ c , σ c I ) . The sampled vectors are passed into the decoder arm of AstroVaDEr. Images are shown with a linear scale and pixel range of (0 , . science data to the community. Synthetic image generation withAstroVaDEr works as follows:(i) Choose a cluster, c , either by choice or from c ∼ Cat( π ) ,(ii) Sample z from the Multivariate Gaussian N ( µ c , σ c I ) ,(iii) Feed z in the decoder network to generate a synthetic ˆ x ,(iv) Repeat until the desired number of images have been gener-ated.Note that we have a choice in whether to generate a ﬁxed num-ber of images for a particular cluster, or sample over the full mixtureand generate a data set that reﬂects the morphological distribution ofthe training data. For example, we may choose to only sample fromthe ﬁve morphological clusters, as we know that primary propertiesof those components are related to key morphological characteris-tics. To begin with, let us recreate Figure 6 by randomly samplingfrom each component separately. The results of this generative sam-pling are shown in Figure 9. For the ﬁve morphological clusters inquestion, the resemblance to the image reconstructions is striking.Each of the clusters contains objects that look very similar to therandom selection show in Figure 6, and a few even resemble thehighly probable objects in Figure 7.There appears to be a larger range in the quality of the syn- thetic images in the remaining components of the model. Individu-ally, most of the synthetic images look quite plausible and exhibitmorphologies that are interesting and complex, but not outside therealms of reality. Comparing the synthetics in this Figure to thereconstructions in Figures 6 and 7, however, it does seem like As-troVaDEr is producing more objects with these complex morpholo-gies. While the relative positioning of the secondary sources ismostly consistent, the fact that these clusters contain galaxies withall morphologies leads to some blending of diﬀerent characteris-tic properties that deﬁnitely limits their quantitative value. Someimages take on a ‘wispy’ like quality, where multiple secondarysources have blended together, almost resembling (but certainly nota representation of) gravitational lenses. The bottom left cluster,which is made up of anomalous, corrupted and complex images inthe training data, contains some very strange objects: some resem-ble merger remnants, but interpretations of these galaxies may bemore at home in a Salvador Dali exhibit than in a synthetic imagingcatalogue.While it is clear that the generative process in AstroVaDEr is byno means perfect, performing these experiments gives us valuableinsights into how the model may be improved. The ﬁnal set of MNRAS , 000–000 (2020) A. Spindler et. al.

Figure 10.

Synthetic images generated using the AstroVaDEr network. For each component, we select the galaxy with the highest cluster likelihood, and thenorder the clusters based on the shorted route through the latent space with a travelling salesperson algorithm and interpolate along that path. The ﬁrst andlast image of each column are reconstructions of real objects, and the objects between are synthetic. The sampled vectors are passed into the decoder arm ofAstroVaDEr. Images are shown within a pixel range of (0 , . experiments we perform on the generative properties of the networkrelate to understanding how the network makes the decisions itdoes, by digging into the structure of the embedded space. Usingthe generative properties, we can explore the latent variables andproduce images that demonstrate what features are being learned.Let us ﬁrst consider the 12 learned clusters. What happens tothe generated images as we move through the latent space betweeneach cluster? We select the highest probability object for each clus-ter, and then calculate the shortest route between each object withinthe embedded space using a travelling salesperson algorithm. Usingthat route, we linearly interpolate new latent vectors and generateimages. Figure 10 shows the results of this generative sample. Aswe expect, transitions between clusters are smooth, indicating thatthere are no discontinuities in the latent space where unrealisticobjects might be formed in the decoding process. We can gain asense here of how features such as the axis ratio and orientationvary through the space, and can see how secondary sources can begenerated with varying intensities and positions.It is clear from the synthetic images that the latent space isquite complex, and with just 20 latent variables it is diﬃcult todisentangle various features. The most obvious is the secondary sources: the network needs to be able to control the number ofsources, their intensity (which it seems to do independently in somecases), and their positions. In Figure 11, we explore each latent vari-able individually by generating latent vectors which vary betweenthe minimum and maximum of each variable. We keep the othervariables ﬁxed at their mean value in this process. Within each rowof the ﬁgure, we can see the varied morphological features it is con-trolling, compared with the ‘mean galaxy’ down the centre column.For example, we see that the galaxy’s brightness is almost entirelycontrolled by a single variable (column ), and the axis ratio andorientation appears to be split over two variables. As expected fromour previous discussions, we can see quite clearly that much of thenetwork capacity is given to controlling the positions, intensity andquantity of secondary sources. Minimising the network footprint ofthese secondary sources appears to be a key challenge in improvingthe generative and clustering capability of the network.One positive aspect of this experiment is that we can see thatnone of the latent variables are doing ‘nothing,’ per se. All of thelatent variables have a role to play in embedding the images. Ifanything, the most crucial take away here is that there is spacefor more latent variables to be added. The question becomes ‘how MNRAS000

Synthetic images generated using the AstroVaDEr network. We calculate the mean, maximum and minimum values of each latent variable, and ineach column we linearly interpolate one variable between the minimum and maximum, while keeping all other variables ﬁxed. The interpolated vectors arepassed into the decoder to generate synthetic images. Images are shown within a pixel range of (0 , . Figure 12.

Example of deblending a target galaxy with the secondary sources in the image. The left and right most images are reconstructions of each galaxy,and the intermediate synthetic galaxies are generated by interpolating between the latent vectors of the two objects. many variables is too many?’. What we do not want to see happenis, say, doubling the number of features leading to an improvementin reconstruction loss, but a collapse of the clustering optimisations.One ﬁnal point of note from Figure 11 is that it demonstrates apotential application of AstroVaDEr (or, potentially better suited, anon-clustering VAE implementation). Scanning across some latentvectors reveals that there is some degree of deblending happening.It may be possible to tailor a version of the model to remove thebackground/foreground objects from the target samples. Doing thiseﬀectively will require a much higher level of disentanglement ofthe latent variables than AstroVaDEr currently possesses, however,a crude way of doing this can be achieved using the cluster assign-ments. We ﬁrst pick a galaxy assigned to one of the clusters thatfeatures secondary sources, we then ﬁnd the pure morphologicalcluster it relates to, and ﬁnd the a galaxy that looks similar. Wechoose the similar looking galaxy by calculating the mean squarederror between the object we want to deblend and all the objectsassigned to its morphological class. Finally, we interpolate betweenthe two chosen galaxies. The results of this test are shown in Figure12, the left-most and right-most images are reconstructions of real galaxies, and those between are linear interpolations through thelatent space.

Finally, we will now consider how AstroVaDEr and its cluster as-signments compare with the human classiﬁcations collected in GZ2.Given AstroVaDEr’s current reconstruction ability, we will concernourselves with the top level questions in the Galaxy Zoo 2 decisiontree: Smooth versus Featured, and Edge-on versus Face-on. As wehave identiﬁed that only of the clusters relate to morphology, werestrict this analysis to those clusters only, assigning a label to eachtest sample galaxy based on the highest probability for each of thoseclusters (as in 5.2). For each cluster we calculate the number of testsample galaxies that lie above a particular cut in the debiased votefractions for each response. For smooth and featured galaxies we setthis cut at p = 0 . , and for edge-on and face-on galaxies we applythe featured cut and then use p = 0 . to select clean samples ofeach. These values are those recommended in Willett et al. (2013)for clean samples. MNRAS , 000–000 (2020) A. Spindler et. al.

Figure 13.

Histogram showing the number of Galaxy Zoo 2 ‘Featured’ (bluesolid line) and ‘Smooth’ (red dashed line) galaxies in the morphologicalcomponents in the learned GMM. Below each bin in the histogram we showan example reconstructed image of an object from that component. In Figure 13 we show the number of galaxies that are smooth orfeatured based on the GZ2 debiased vote fractions. Morphologicalcomponent in this Figure has a higher number of smooth galaxies,implying that this cluster of objects is mainly populated by early-type galaxies. Components and , the components we identiﬁedas being mainly high axis ratio galaxies, have a higher numberof featured objects, which is to be expected. Finally, components and have more balanced numbers of smooth and featured galaxies,which we infer as meaning that the primary split here is likely onsize and brightness proﬁles.In Figure 14 we show the distribution of featured galaxieswhich are voted as "Edge-On" and "Not Edge-On" (i.e. face-on) byGZ2 users. Remember that the chosen cut is intended to produce‘clean’ samples, so galaxies that are unclear in whether they areedge-on or not are not included. We ﬁrst note that there are manymore face-on galaxies than edge-on galaxies, but this is to be ex-pected as only fairly extreme axis-ratios are consistently voted asedge-on in the GZ2 scheme. Components , , and in this Figure,as expected, have almost no edge-on objects between them. giventhat these clusters pick up smooth and featured galaxies, it makessense that the late-type galaxies in this group would need to beface-on in order to "look" the same to the network.The most curious splits in this comparison are in clusters and , which we expected to be mainly edge-on type galaxies. In com-ponent , the number of edge-on objects is roughly half that of theface-on objects, but in component there are times as many face-on galaxies as there are edge-on galaxies. This brings into questionhow these two clusters are really behaving, as we initially believedthat rotation was the main discerning feature of these groups. It Figure 14.

Histogram showing the number of Galaxy Zoo 2 ‘Edge-on’ (bluesolid line) and ‘Face-on’ (red dashed line) galaxies in the morphologicalcomponents in the learned GMM. Below each bin in the histogram we showan example reconstructed image of an object from that component. appears from this analysis, that these clusters also discern betweenaxis ratios as a secondary feature. We have discussed throughoutthis work the importance of latent disentanglement, as this is a clearexample of how latent features can be entangled in the clusteringparadigm.It is clear that while AstroVaDEr does not, and indeed is notdesigned to, reproduce the Galaxy Zoo classiﬁcation scheme, it iscertainly interesting how the morphological groups we have iden-tiﬁed compare with the independent user generated labels. In thefuture we hope to improve the reconstruction and disentanglementof AstroVaDEr’s clustering to ﬁnd ﬁner grained features in whichthe galaxies are grouped together. In particular, trying to pick outdistinct morphological features such as bars and rings, and possi-bly gravitational lenses and low surface brightness features in nextgeneration surveys that have suﬃcient resolution and depth. Wealso note that this comparison with Galaxy Zoo brings the excitingpossibility of combining unsupervised learning and citizen science,wherein we envision a possible platform where citizen scientistscould be employed to ﬁnd the common features between objects inunsupervised clusters. The architecture presented in this work was chosen for demon-strative purposes, and by no means fully optimised to the task ofgalaxy classiﬁcation. In this section we shall brieﬂy discuss how As-troVaDEr will be improved and optimised in the future for generalrelease and application to next-generation astronomical surveys.

MNRAS000

MNRAS000 , 000–000 (2020) stroVaDEr The main point to address in improving AstroVaDEr for sci-entiﬁc applications is the balance between the reconstruction lossand the clustering loss. There is much room to improve both thequality of the image generation (both synthetic and reconstructedoutputs) and the latent variable/cluster disentanglement. Much workhas been done to investigate the sparsity problem in Variational Au-toencoders, such as Balance VAE (Dai & Wipf 2019; Asperti &Trentin 2020) and Regularised Autoencoders (Ghosh et al. 2019).These models are designed to work with unit Gaussian priors, butit may be possible to develop a model under these schemes thatworks with a GMM. It may also be possible to develop a clusteringversion of Introspective VAE (Kobayashi et al. 2020), which utilisesa GAN-like structure where the decoder output is re-embedded andtrained antagonistically with the encoder.AstroVaDEr is currently coded in an outdated version of

Keras which still relies on

Tensorflow

Version 1.15 (Abadi et al. 2015).Further development of the platform will involve updating the modelto run in

Tensorflow

Tensor-flow

API, we hope to include the

Tensorflow

Probability mod-ule, which includes optimised probability distributions for usingwithin machine learning architectures. We will also investigate theinclusion of more complex convolutional blocks, such as ResNetblocks (He et al. 2015). Finally, we note that the order of the up-sampling and convolutional layers in the decoder is something notuniversally agreed upon in CAE architecture, so we will test usingconvolutional blocks before or after upsampling the decoded activ-ity maps. We also plan on changing our training catalog to imagingfrom the Hyper-Supreme Cam Subaru Strategic Program (Aiharaet al. 2018), in an eﬀort to tailor the network toward future VeraRubin Observatory operations. We have presented here a demonstration of AstroVaDEr, an As-tronomical Variational Deep Embedder. This network has been de-veloped to perform unsupervised clustering of images of galaxiesimaged in the Sloan Digital Sky Survey and classiﬁed by citizen sci-entists in Galaxy Zoo 2. We provide a comprehensive overview ofthe theoretical background of Variational Autoencoders and Varia-tional Deep Embedding, and show how we implement cutting-edgeoptimisations of these networks within our model.We train AstroVaDEr on around , images of nearbygalaxies to embed the images into 20 latent variables which haveGaussian distributions, reconstruct the images from the latent em-bedding, and ﬁnally to cluster the images within the later space usinga Gaussian Mixture Model (GMM). The trained network is able toproduce cluster labels and probabilities for new images of galaxies,as demonstrated with a test sample of approximately , ob-jects, and also to generate synthetic images of galaxies randomlydrawn from the GMM.Reconstructed and synthetic images produced by AstroVaDErshow qualitatively realistic properties in terms of shape, light pro-ﬁles, bulge presence, axis ratio and size. Currently, we are not ableto fully reconstruct ﬁner grained details like spiral arms and galacticbars, and output images tend to have a ‘blurred’ or ‘smoothed’ look.This blurring is mainly due to underlying problems in VAE archi-tecture, which originates in the competition between reconstruction Tensorflow

Probability: https://github.com/tensorflow/probability loss and clustering loss. We discuss in Section 6 some of the waysthis may be addressed. We show how the generative process usedin the network is capable of construction a continuous generativespace between the diﬀerent morphological clusters.The resulting clustering model is able to identify the presenceof secondary sources within the images, and also provides a numberof morphological groups. We ﬁnd that galaxies are grouped togetherbased on size, surface brightness distribution, axis ratio and rotation.We compare our clustering scheme to the debiased vote fractionsfrom Galaxy Zoo 2 and ﬁnd correlations between the unsupervisedclusters and ‘smooth’ and ‘featured’ galaxies, and with ‘edge-on’and ‘not edge-on’ galaxies.AstroVaDEr has potential to be used with next generation skysurveys as a science-enabling platform. We envision that it couldbe used to generate large scale synthetic imaging datasets to use intesting and developing data analysis pipelines in preparation for fu-ture data releases, for example on three NVIDIA Tesla V100 GPUswe can generate , images at × pixels in one minute.At this speed we could generate one hundred million images in lessthan six hours. For the clustering tasks, we predict that it would takeabout 7 hours to calculate cluster assignments and probabilities ofthe one hundred million objects using our current hardware. Wealso show that, even without improvements to the disentanglementof latent variables, AstroVaDEr demonstrates some capability indeblending primary and secondary sources in the input images.Development of the network continues, with planned improve-ments focusing on reconstruction/synthetic image quality and dis-entanglement of latent variables and clusters. We plan on buildingAstroVaDEr into a ﬂexible platform that can be used by researchersin a variety of ﬁelds, in and out of the astronomical community,with only minimal prior knowledge of machine learning networks.The code used to train the model and produce the results of thispaper are available at https://github.com/AshleySpindler/AstroVaDEr-Public . ACKNOWLEDGEMENTS

A.S. is supported by an STFC Innovation Fellowship(ST/R005265/1). J.E.G. is supported by a Royal Society Univer-sity Research Fellowship (URF/R/180014). This research has madeuse of the University of Hertfordshire high-performance computingfacility ( https://uhhpc.herts.ac.uk ).The authors thank Christopher Lovell (University of Hertford-shire) and Sandor Kruk (ESA, ESTEC) for their valuable insight.The authors would also like to thank the referee for their kind com-ments and helpful feedback.

REFERENCES

Abadi M., et al., 2015, TensorFlow: Large-Scale Machine Learning on Het-erogeneous Systems,

Aihara H., et al., 2017, Publications of the Astronomical Society of Japan,70Aihara H., et al., 2018, PASJ, 70, S4Angel J. R. P., Wizinowich P., Lloyd-Hart M., Sand ler D., 1990, Nature,348, 221Asperti A., 2018, arXiv e-prints, p. arXiv:1812.07238Asperti A., Trentin M., 2020, arXiv e-prints, p. arXiv:2002.07514Attias H., 2000, in In Advances in Neural Information Processing Systems12. MIT Press, pp 209–215Avestruz C., Li N., Zhu H., Lightman M., Collett T. E., Luo W., 2019, ApJ,877, 58MNRAS , 000–000 (2020) A. Spindler et. al.

Ay F., Ince G., Kamas , ak M. E., Eks , i K. Y., 2020, MNRAS, 493, 713Bishop C. M., 2006, Pattern Recognition and Machine Learning (Informa-tion Science and Statistics). Springer-Verlag, Berlin, HeidelbergBlei D. M., Kucukelbir A., McAuliﬀe J. D., 2016, arXiv e-prints, p.arXiv:1601.00670Bourlard H., Kamp Y., 1988, Biological Cybernetics, 59, 291Caliński T., Harabasz J., 1974, Communications in Statistics, 3, 1Cao L., Asadi S., Zhu W., Schmidli C., Sjöberg M., 2020, arXiv e-prints, p.arXiv:2005.08047Cheng T.-Y., Li N., Conselice C. J., Aragón-Salamanca A., Dye S., MetcalfR. B., 2020, MNRAS, 494, 3750Chollet F., et al., 2015, Keras, https://keras.io D’Isanto A., Polsterer K. L., 2018, A&A, 609, A111Dai B., Wipf D., 2019, arXiv e-prints, p. arXiv:1903.05789Dempster A. P., Laird N. M., Rubin D. B., 1977, Journal of the RoyalStatistical Society. Series B (Methodological), 39, 1Dey A., et al., 2019, AJ, 157, 168Dieleman S., Willett K. W., Dambre J., 2015, MNRAS, 450, 1441Dilokthanakul N., Mediano P. A. M., Garnelo M., Lee M. C. H., Sal-imbeni H., Arulkumaran K., Shanahan M., 2016, arXiv e-prints, p.arXiv:1611.02648Ferguson T. S., 1973, Ann. Statist., 1, 209Flaugher B., 2005, International Journal of Modern Physics A, 20, 3121Geach J. E., 2012, MNRAS, 419, 2633Ghosh P., Sajjadi M. S. M., Vergari A., Black M., Schölkopf B., 2019, arXive-prints, p. arXiv:1903.12436Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D.,Ozair S., Courville A., Bengio Y., 2014, in Proceedings of the 27thInternational Conference on Neural Information Processing Systems -Volume 2. NIPS’14. MIT Press, Cambridge, MA, USA, p. 2672–2680He K., Zhang X., Ren S., Sun J., 2015, Deep Residual Learning for ImageRecognition ( arXiv:1512.03385 )Hinton G., Roweis S., 2002, in Proceedings of the 15th International Confer-ence on Neural Information Processing Systems. NIPS’02. MIT Press,Cambridge, MA, USA, p. 857–864Hinton G. E., Zemel R. S., 1993, in Proceedings of the 6th International Con-ference on Neural Information Processing Systems. NIPS’93. MorganKaufmann Publishers Inc., San Francisco, CA, USA, p. 3–10Hocking A., Geach J. E., Sun Y., Davey N., 2018, MNRAS, 473, 1108Huertas-Company M., et al., 2015, ApJS, 221, 8Ivezić v. Z., et al., 2019, Astrophys. J., 873, 111J. Bergstra D. Yamins D. D. C., 2013, in Proc. of the 30th InternationalConference on Machine Learning. ICML 2013Jiang Z., Zheng Y., Tan H., Tang B., Zhou H., 2016, arXiv e-prints, p.arXiv:1611.05148Kingma D. P., Ba J., 2014, arXiv e-prints, p. arXiv:1412.6980Kingma D. P., Welling M., 2013, arXiv e-prints, p. arXiv:1312.6114Kobayashi K., et al., 2020, arXiv e-prints, p. arXiv:2005.12573Kremer J., Stensbo-Smidt K., Gieseke F., Pedersen K. S., Igel C., 2017,IEEE Intelligent Systems, 32, 16–22Kullback S., Leibler R. A., 1951, Ann. Math. Statist., 22, 79Lahav O., et al., 1995, Science, 267, 859Lanusse F., Ma Q., Li N., Collett T. E., Li C.-L., Ravanbakhsh S., Mandel-baum R., Póczos B., 2018, MNRAS, 473, 3895LeCun Y., Bottou L., Bengio Y., Haﬀner P., 1998, Proceedings of the IEEE,86, 2278LeCun Y., Bengio Y., Hinton G., 2015, Nature, 521, 436Lecun Y., 1987, PhD thesis: Modeles connexionnistes de l’apprentissage(connectionist learning models). Universite P. et M. Curie (Paris 6)Li R., et al., 2020, arXiv e-prints, p. arXiv:2004.02715Lloyd-Hart M., et al., 1992, ApJ, 390, L41Lu L., Shin Y., Su Y., Karniadakis G. E., 2019, arXiv e-prints, p.arXiv:1903.06733Martin G., Kaviraj S., Hocking A., Read S. C., Geach J. E., 2020, MNRAS,491, 1408McLachlan G., Krishnan T., 1997, The EM algorithm and extensions. Wiley,New YorkMetcalf R. B., et al., 2019, A&A, 625, A119 Odewahn S. C., Stockwell E. B., Pennington R. L., Humphreys R. M.,Zumach W. A., 1992, AJ, 103, 318Odewahn S. C., Humphreys R. M., Aldering G., Thurmes P., 1993, PASP,105, 1354Pedregosa F., et al., 2011, Journal of Machine Learning Research, 12, 2825Petrillo C. E., et al., 2019, MNRAS, 484, 3879Portegies Zwart S., 2020, Nature Astronomy, 4, 819Prasad V., Das D., Bhowmick B., 2020, arXiv e-prints, p. arXiv:2005.04613Ralph N. O., et al., 2019, PASP, 131, 108011Ramachandra V., 2019, arXiv e-prints, p. arXiv:1911.06623Ravanbakhsh S., Lanusse F., Mandelbaum R., Schneider J. G., Póczos B.,2017, in AAAI.Reed S., Akata Z., Yan X., Logeswaran L., Schiele B., Lee H., 2016, arXive-prints, p. arXiv:1605.05396Regier J., McAuliﬀe J., Prabhat 2015, in ADG.Rousseeuw P. J., 1987, Journal of Computational and Applied Mathematics,20, 53Schawinski K., Zhang C., Zhang H., Fowler L., Santhanam G. K., 2017,MNRAS, 467, L110Simonyan K., Zisserman A., 2014, arXiv e-prints, p. arXiv:1409.1556Siudek M., et al., 2018, arXiv e-prints, p. arXiv:1805.09905Smith M. J., Geach J. E., 2019, MNRAS, 490, 4985Storrie-Lombardi M. C., Lahav O., Sodre L. J., Storrie-Lombardi L. J., 1992,MNRAS, 259, 8PSun H., Guo J., Kim E. J., Brunner R. J., 2019, arXiv e-prints, p.arXiv:1910.14056Uzeirbegovic E., Geach J. E., Kaviraj S., 2020, arXiv e-prints, p.arXiv:2004.06734Walmsley M., et al., 2020, MNRAS, 491, 1554Willett K. W., et al., 2013, MNRAS, 435, 2835Xie J., Girshick R., Farhadi A., 2015, arXiv e-prints, p. arXiv:1511.06335Yeung S., Kannan A., Dauphin Y., Fei-Fei L., 2017, arXiv e-prints, p.arXiv:1706.03643

DATA AVAILABILITY

The data and code used to produce the training, validation andtesting datasets, and the results presented in this paper, are archivedon Zenodo at https://doi.org/10.5281/zenodo.4034802 . APPENDIX A: GROUND TRUTH IMAGES OFCLUSTERED GALAXIES

For comparison purposes, we provide the original input images ofgalaxies included in Figures 6 and 7 in Figures A1 and A2.

This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000

This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000 , 000–000 (2020) stroVaDEr Figure A1.

Input images of test data objects assigned to each cluster. This ﬁgure shows a random selection of galaxies from among those that have theappropriate cluster label. Images are shown on a linear scale with a pixel range of (0 , .MNRAS , 000–000 (2020) A. Spindler et. al.

Figure A2.

Input images of test data objects assigned to each cluster. This ﬁgure shows the objects with the highest cluster likelihood of objects from amongthose that have the appropriate cluster label. Images are shown on a linear scale with a pixel range of (0 , . MNRAS000