Eigengalaxies: describing galaxy morphology using principal components in image space
MMNRAS , 1– ?? (2020) Preprint 16 April 2020 Compiled using MNRAS L A TEX style file v3.0
Eigengalaxies: describing galaxy morphology usingprincipal components in image space
Emir Uzeirbegovic, (cid:63)
James E. Geach and Sugata Kaviraj
Centre for Astrophysics Research, School of Physics, Astronomy & Mathematics, University of Hertfordshire, Hatfield, AL10 9ABCentre of Data Innovation Research, School of Physics, Astronomy & Mathematics, University of Hertfordshire, Hatfield, AL10 9AB
16 April 2020
ABSTRACT
We demonstrate how galaxy morphologies can be represented by weighted sums of‘eigengalaxies’ and how eigengalaxies can be used in a probabilistic framework toenable principled and simplified approaches in a variety of applications. Eigengalax-ies can be derived from a Principal Component Analysis (PCA) of sets of single- ormulti-band images. They encode the image space equivalent of basis vectors that canbe combined to describe the structural properties of large samples of galaxies in a mas-sively reduced manner. As an illustration, we show how a sample of 10,243 galaxies inthe
Hubble Space Telescope
CANDELS survey, which have visual classifications fromGalaxy Zoo, can be represented by just 12 eigengalaxies whilst retaining 96% explainedvariance. We show how the emphasis of certain eigengalaxy components correspond tovisual features (e.g. disc, point source, etc.), and explore the correspondence betweencombinations of eigengalaxies and features defined in the Galaxy Zoo-CANDELS cat-alogue. We also describe a probabilistic extension to PCA (PPCA) which enables theeigengalaxy framework to assign probabilities to galaxies and characterise a wholepopulation as a generative distribution. We present four practical applications of theprobabilistic eigengalaxy framework that are particularly relevant for the next genera-tion of large imaging surveys: we (i) show how low probability galaxies make for naturalcandidates for outlier detection (ii) demonstrate how missing data can be predicted(iii) show how a similarity search can be performed on exemplars (iv) demonstratehow unsupervised clustering of objects can be implemented.
Key words: methods: data analysis, methods: statistical, galaxies: structure, tech-niques: image processing
The distribution of light in galaxies, commonly referred to asgalaxy ‘morphology’, is a fundamental observable property.Morphology strongly correlates with the physical propertiesof a galaxy, such as its stellar mass (e.g. Bundy et al. 2005),star formation rate (e.g. Bluck et al. 2014; Smethurst et al.2015; Willett et al. 2015), surface brightness (e.g. Martinet al. 2019), rest frame colour (e.g. Strateva et al. 2001;Skibba et al. 2009; Bamford et al. 2009) and local environ-ment (e.g. Dressler et al. 1997; Postman et al. 2005) and re-veals key information about the processes that have shapedits evolution over cosmic time (e.g. Martin et al. 2018b; Jack-son et al. 2020). For example, the smooth light distributionsof elliptical galaxies, which are a result of the largely randomorbits of their stars (e.g. Cappellari et al. 2011), are signposts (cid:63) [email protected] of a merger-rich evolutionary history (e.g. Conselice 2006).On the other hand, the presence of a disc indicates a rela-tively quiescent formation history, in which the galaxy hasgrown primarily through accretion of gas from the cosmicweb (Codis et al. 2012; Martin et al. 2018a). In a similar vein,morphological details such as extended tidal features sug-gest recent mergers and/or interactions (e.g. Kaviraj 2014;Kaviraj et al. 2019; Jackson et al. 2019), with the surfacebrightness of these tidal features typically scaling with themass ratios of the mergers in question (e.g. Peirani et al.2010; Kaviraj 2010).Apart from being a fundamental component of galaxyevolution studies, morphological information has a widerange of applications across astrophysical science. For exam-ple, it can be a key prior in photometric redshift pipelines(e.g. Soo et al. 2018; Menou 2018) which underpin much ofobservational cosmology and weak lensing studies, is used ascontextual data in the classification of transient light curves © a r X i v : . [ a s t r o - ph . GA ] A p r Uzeirbegovic et al. (e.g. Djorgovski et al. 2012; Wollaeger et al. 2018) and is anessential ingredient in the study of the processes that driveactive galactic nuclei (e.g. Schawinski et al. 2014; Kavirajet al. 2015). The morphological analysis of galaxy popu-lations, especially in the large surveys that underpin ourstatistical understanding of galaxy evolution, is therefore offundamental importance.A vast literature exists on methods for measuring galaxymorphology. Popular techniques range from those that de-scribe a galaxy’s light distribution using a small number ofparameters (e.g. de Vaucouleurs 1948; S´ersic 1963; Simardet al. 2002; Odewahn et al. 2002; Lackner & Gunn 2012)to non-parametric approaches such as ‘CAS’ (e.g. Abrahamet al. 1994; Conselice 2003; Menanteau et al. 2006) or Gini- M (e.g. Lotz et al. 2004; Scarlata et al. 2007; Peth et al.2016), where the light distribution is reduced to a singlevalue. The convergence of large observational surveys andrapidly increasing computing power has recently broughtmachine learning to the fore in morphological studies. Whilethe use of machine learning can be traced back at least asfar back as the 1990s (e.g. Lahav et al. 1995), there has beena recent explosion of studies that apply such techniques tothe exploration of galaxy morphology, particularly in largesurvey datasets (e.g. Huertas-Company et al. 2015; Ostro-vski et al. 2017; Schawinski et al. 2017; Hocking et al. 2018;Goulding et al. 2018; Cheng et al. 2019; Martin et al. 2020).Automated techniques lend themselves particularly wellto the analysis of large surveys, but the most accuratemethod of morphological classification is arguably visual in-spection (e.g. Kaviraj et al. 2007). Indeed the genesis of thissubject can be traced back to the visual ‘tuning-fork’ clas-sifications by Hubble (1926), where galaxies were classifiedinto a so-called sequence of ellipticals and spirals. It is re-markable that this classification system still underpins thebroad morphological classes into which galaxies are split inmodern studies of galaxy evolution. Although visual inspec-tion of large observational surveys is time-intensive, the ad-vent of massively distributed systems like Galaxy Zoo hasrevolutionised its use on survey datasets (e.g. Lintott et al.2011; Simmons et al. 2017; Willett et al. 2017). Galaxy Zoohas used more than a million citizen-science volunteers toclassify large contemporary surveys, like the Sloan DigitalSky Survey (SDSS) and the Hubble Space Telescope ( HST )Legacy Surveys and has provided the benchmark againstwhich the accuracy of automated techniques have been rou-tinely compared (e.g. Huertas-Company et al. 2015; Diele-man et al. 2015; Beck et al. 2018; Walmsley et al. 2019; Maet al. 2019).Principal component analysis (PCA) is a multivariatestatistical technique used to convert a set of variables to aset of uncorrelated variables termed principal components(PCs). The PCs are defined iteratively such that each com-ponent accounts for the most possible remaining variance ina dataset and is orthogonal to all prior components. We con-sider the line of approach inspired by the work of Sirovich &Kirby (1987), in which they coined the term ‘eigenfaces’ asa synonym for the eigenvectors (presented as images) thatresult from a PCA decomposition of a set of 2D images,considered together as the row vectors of a common matrix.Those authors showed that a set of N × N images describedas vectors in N -space could be projected to a compara-tively small M -space whilst preserving the majority of the variance. The M basis vectors of the M -space became thetemplate eigenfaces, where each face image in the originalset could be faithfully reconstructed as a linear combinationof the templates.Similar methods have been applied in astronomy. Forexample, De La Calleja & Fuentes (2004) used PCA toproject a set of 310 disparate images to a lower rank space tofacilitate further classification steps. They referred to the ba-sis eigenvectors as ‘eigengalaxies’ and used their weightingsto test multiple machine learning methods against each otherfor the purpose of galaxy classification. Li et al. (2005) usedPCA to decompose stellar spectra from the STELIB spectro-scopic stellar library (Le Borgne et al. 2003) and Two MicronAll Sky Survey (2MASS, Skrutskie et al. 2006) near-infraredphotometry to derive ‘eigenspectra’ and then used them tofit the observed spectra of a selection of galaxies from theSDSS (Fukugita et al. 1996) Data Release 1. Anderson et al.(2004) generated many galaxies from a parametric model,calculated the eigenvectors of the generated set, and thenused the reduced eigenspace to find the nearest syntheticmodel to any given real galaxy image as a way of discover-ing correspondence between galaxies and known structures.Wild et al. (2014) used PCA to concisely describe a largenumber of model spectral energy distributions (SEDs). Theytermed the retained eigenvectors as ‘super-colours’ and usedthem to impute SEDs from sparse samples in observed galax-ies. In this paper, we further consider the utility of PCs inimage space, working with multi-band imaging and focus-ing on galaxy morphology. We start by finding a concisevariance-preserving latent space representation using PCA,and as others have done before, refer to the resultant eigen-vectors as ‘eigengalaxies’. Taking this concept further, weshow how the result can be converted into a generative modelwhich can assign likelihoods to galaxies, and demonstratehow the model enables the simple and principled implemen-tation of a set of cross-cutting applications that may be ofuse to the community in a variety of scenarios, not neces-sarily limited to galaxy morphology studies.The paper is structured as follows. In Section 2 we de-scribe both the sample and our methodology and discussthe interpretation of eigengalaxies and its relevance to theinterpretation of galaxy morphology. We then show howthis result is converted to a generating distribution usingprobabilistic PCA (PPCA). In Section 3 we show how thisprobabilistic eigengalaxy framework leads to principled ap-proaches in various applications. In particular, we (i) showhow low probability galaxies make for natural candidatesfor outlier detection (ii) show how missing data can be pre-dicted (iii) demonstrate how similarity searches for imagescan be implemented given an exemplar (iv) demonstratehow a natural unsupervised clustering of objects can be pro-duced. We then outline how the methods may be used to-gether in more advanced applications and how it is relevantto big datasets such as the upcoming Large Synoptic SurveyTelescope (LSST) project (Robertson et al. 2017). Section 4concludes and summarises our findings, and provides a linkto the codes developed for this work. MNRAS , 1– ?? (2020) igengalaxies E i g e n g a l a x i e s . . . . . Figure 1.
Contour plot of explained variance corresponding tothe required number of eigengalaxies versus sample size. Higherexplained variances tend to require proportionally more eigen-galaxies as sample size increases, however at roughly 96% ex-plained variance is asymptotic to a constant. N u m b e r o f o b j e c t s Figure 2.
Histogram showing the frequency of the ratio of thesum of deltas (difference between original and reconstructed pixelflux over all bands) and the sum of the flux in the original images,for the PCA decomposition of the 10,243 galaxies in the GZ-CANDELS GOODS–S sample. 99% of objects lie between ± . .We account for 96% of variance by design and the distributionabove shows that the reconstructed images bear a good likeness(in terms of reconstructed flux) to the originals. HST
CANDELS
HST
CANDELS (Grogin et al. 2011; Koekemoer et al. 2011)offers a high-resolution probe of galaxy evolution. The sur-vey consists of optical and near-infrared (WFC3/UVIS/IR)images from the Wide Field Camera 3 (WFC3) and opticalimages from the Advanced Camera for Surveys (ACS) in fivewell-studied extragalactic survey fields. Here, we focus onGOODS-S, one of the deep tier (at least four-orbit effectivedepth) fields. We select a sample of 10,243 galaxies presentin the ‘Galaxy Zoo: CANDELS’ (GZ-CANDELS) GOODS-Scatalogue (Simmons et al. 2016), that fall within the regionjointly covered by the F814W, F125W and F160W bands. The majority of objects are at z < (Simmons et al. 2016).The motivation for using the GZ-CANDELS catalogue isthat our results can be interpreted in the context of themorphological and structural classifications provided by theGalaxy Zoo citizen science project. The aim of registration is to remove the uninterestingsources of variation which would otherwise increase the num-ber of eigengalaxies required. We identify pixel scale match-ing, image centering, cropping, range clipping and image ro-tation as the most important confounders to account for.For each object in the catalog we take a 1.8 (cid:48)(cid:48) cut-out, usingthe catalogued sky coordinate as a centroid. ACS images aredown-sampled by a factor of two to match the pixel scale ofthe WFC3 images (0.06 (cid:48)(cid:48) pixel − ), so that for each band weobtain a × -pixel image at the position of each galaxy.The cut-out dimensions are selected with prior experimenta-tion because they result in the least superfluous backgroundfor most targets.We perform some simple post-processing to the data.In particular, we truncate the pixel distribution at the 99 th percentile to avoid anomalous spikes in flux in some galaxiesfrom accounting for too much variance. Next, we produce atemporary composite image of the array by using, for eachpixel, its maximum magnitude across all bands. We thenuse the locations of all pixels in our composite with valuesexceeding the 75 th percentile to create a matrix of the coor-dinates of bright pixels. The first principal component of thetwo column matrix of pixel coordinates is a vector of weights ( x , y ) which can be interpreted as the linear transformationof our 2D coordinates to the 1D space that preserves mostvariance (i.e. some line passing through the origin of theoriginal 2D space). The angle between the original x -axis inour image and the new variance-preserving one is then givenby θ = arctan2 ( y , x ) . Finally, we transform the original ar-ray by rotating every band in turn by θ − π / radians (inorder to make vertical what would otherwise be a rotationto the horizontal plane) which results in vertically-alignedbrightness. The set of objects, each a registered × × array, are flat-tened into × row vectors and concatenated to form afeature matrix. We perform PCA on the resultant matrix,but prior to that we conduct some experiments to deter-mine a reasonable level of explained variance to target. Intheory, in the worst case scenario, as the ratio of explainedvariance approaches unity, the number of eigengalaxies re-quired may be close to the total sample size. We performPCA for varying sample sizes and create a contour plot todepict the relationship between sample size and the num-ber of eigengalaxies required for any given level of explainedvariance (Figure 1). The n th eigenvector is always the sameregardless of what explained variance is targeted, becauseeigenvectors are ordered by explained variance and orthog-onal to each other, so targeting a higher explained variance We use the Python package sklearn.decomposition.pca .MNRAS , 1– ????
CANDELS (Grogin et al. 2011; Koekemoer et al. 2011)offers a high-resolution probe of galaxy evolution. The sur-vey consists of optical and near-infrared (WFC3/UVIS/IR)images from the Wide Field Camera 3 (WFC3) and opticalimages from the Advanced Camera for Surveys (ACS) in fivewell-studied extragalactic survey fields. Here, we focus onGOODS-S, one of the deep tier (at least four-orbit effectivedepth) fields. We select a sample of 10,243 galaxies presentin the ‘Galaxy Zoo: CANDELS’ (GZ-CANDELS) GOODS-Scatalogue (Simmons et al. 2016), that fall within the regionjointly covered by the F814W, F125W and F160W bands. The majority of objects are at z < (Simmons et al. 2016).The motivation for using the GZ-CANDELS catalogue isthat our results can be interpreted in the context of themorphological and structural classifications provided by theGalaxy Zoo citizen science project. The aim of registration is to remove the uninterestingsources of variation which would otherwise increase the num-ber of eigengalaxies required. We identify pixel scale match-ing, image centering, cropping, range clipping and image ro-tation as the most important confounders to account for.For each object in the catalog we take a 1.8 (cid:48)(cid:48) cut-out, usingthe catalogued sky coordinate as a centroid. ACS images aredown-sampled by a factor of two to match the pixel scale ofthe WFC3 images (0.06 (cid:48)(cid:48) pixel − ), so that for each band weobtain a × -pixel image at the position of each galaxy.The cut-out dimensions are selected with prior experimenta-tion because they result in the least superfluous backgroundfor most targets.We perform some simple post-processing to the data.In particular, we truncate the pixel distribution at the 99 th percentile to avoid anomalous spikes in flux in some galaxiesfrom accounting for too much variance. Next, we produce atemporary composite image of the array by using, for eachpixel, its maximum magnitude across all bands. We thenuse the locations of all pixels in our composite with valuesexceeding the 75 th percentile to create a matrix of the coor-dinates of bright pixels. The first principal component of thetwo column matrix of pixel coordinates is a vector of weights ( x , y ) which can be interpreted as the linear transformationof our 2D coordinates to the 1D space that preserves mostvariance (i.e. some line passing through the origin of theoriginal 2D space). The angle between the original x -axis inour image and the new variance-preserving one is then givenby θ = arctan2 ( y , x ) . Finally, we transform the original ar-ray by rotating every band in turn by θ − π / radians (inorder to make vertical what would otherwise be a rotationto the horizontal plane) which results in vertically-alignedbrightness. The set of objects, each a registered × × array, are flat-tened into × row vectors and concatenated to form afeature matrix. We perform PCA on the resultant matrix,but prior to that we conduct some experiments to deter-mine a reasonable level of explained variance to target. Intheory, in the worst case scenario, as the ratio of explainedvariance approaches unity, the number of eigengalaxies re-quired may be close to the total sample size. We performPCA for varying sample sizes and create a contour plot todepict the relationship between sample size and the num-ber of eigengalaxies required for any given level of explainedvariance (Figure 1). The n th eigenvector is always the sameregardless of what explained variance is targeted, becauseeigenvectors are ordered by explained variance and orthog-onal to each other, so targeting a higher explained variance We use the Python package sklearn.decomposition.pca .MNRAS , 1– ???? (2020) Uzeirbegovic et al. E x p l a i n e d V a r i a n c e R a t i o Figure 3.
Cumulative explained variance illustrating how muchvariance each additional feature accounts for. In PCA successivecomponents account for less variance than the previous compo-nent. In this instance, only two eigengalaxies are required to ac-count for ∼
85% of explained variance yet 12 eigengalaxies arerequired to account for 96% of explained variance. just means retaining more eigenvectors. We aim to target thehighest level of explained variance asymptotic to a constantnumber of eigengalaxies as the sample size increases. Thisis important because we intend the methods illustrated hereto be applicable to very large datasets in which the numberof eigengalaxies targeted would have to be on an asymptoticcurve. This is discussed further in Section 3. After choosinga target variance, each × eigenvector is reshaped intoa × × array: these are the eigengalaxies. Reshapingthe previously flattened array is done by filling a × × array one band at a time in column order. The matrix ofcoefficients defining how the eigengalaxies are combined toreconstruct any given galaxy is termed the ‘score matrix’and is also calculated. Figure 1 presents a contour plot of explained variance corre-sponding to various combinations of sample size and eigen-galaxy number. We chose to retain 96% variance as it is thehighest target variance with near zero or asymptotic growthas sample size increases. Only 12 eigengalaxies were requiredto achieve an explained variance of approximately 96% for10,243 objects. Let F be the collection of flux densities ofthe pixels in the original images and F ∆ be the correspond-ing differences in flux between the original pixels and thosein images reconstructed using the eigengalaxies. Figure 2presents a histogram of (cid:205) x ∈ F ∆ x / (cid:205) x ∈ F x (i.e. the ratio ofthe sum of deltas and the sum of flux over all bands). Itshows a symmetrical distribution centred at zero. 99% ofobjects lie between | (cid:205) x ∈ F ∆ x / (cid:205) x ∈ F x | < . .Figure 3 shows the cumulative variance explained bysuccessive eigengalaxies. PCA successively maximises thevariance in each orthogonal eigengalaxy and each additionaleigengalaxy therefore accounts for less variance. In this in-stance, only two eigengalaxies are required to account for ∼
85% of explained variance, yet 12 eigengalaxies are re-quired to account for 96% of explained variance. Figure 4
Figure 4.
The 12 eigengalaxies accounting for 96% variance inthe GZ-CANDELS GOODS-S sample. Each row is as an eigen-galaxy and each column is a band. All images are scaled identi-cally, where white indicates relative emphasis and black indicatesrelative de-emphasis of the region. Each image is × pixels,or . (cid:48)(cid:48) × . (cid:48)(cid:48) . MNRAS , 1– ?? (2020) igengalaxies Figure 5.
The left panel shows a sample of objects where eigen-galaxy component 3 has an extreme negative weight, illustratinghow different eigengalaxy components can correspond to visualfeatures. In this case, eigengalaxy 3 seems to correspond to re-solved disc-like structures, which are completely absent in theimages in the right panel, which are point sources. Each 1.8 (cid:48)(cid:48) thumbnail image is an RGB composite of the F160W, F125W,and F814W bands. presents the 12 eigengalaxies as grayscale images. The effectof image rotation to vertically align brightness can be seenin the relative emphasis of the central vertical strip. Onemay also wonder how structure with high spatial frequency– such as spirals – would be reconstructed given the featuresof the eigengalaxies. Finessed features are a function of co-ordination of adjacent pixel columns and although visuallydistinct typically account for a tiny amount of flux variance.Therefore, fine-structured spiral features (for example) willbecome washed out in reconstructions, and not appear assharp as in the real image.Examining the first two eigengalaxies we see that theyare intensity dominated. It is unsurprising that most vari-ance is to be found in accounting for relative brightness,given the GZ-CANDELS sample spans ( H -band) magni-tudes down to F160W < . mag (Simmons et al. 2016).Subsequent eigengalaxies become more complex and struc-tured. How can we interpret them? It is sometimes difficultto ‘eyeball’ the feature implications of any given eigengalaxybut it can be made clearer by contrasting samples which ex-tremely emphasise ( > th percentile) or de-emphasise ( < st percentile) the eigengalaxy as shown in Figure 5. In thisexample, it appears that extreme negative weights on eigen-galaxy component 3 corresponds to the presence of extendedresolved features such as (but not exclusively) discs; con-versely, extreme positive eigengalaxy 3 weights result inpoint sources with no, or insignificant, resolved structure.Thus, selection on relative component weighting could leadto practical applications such as star/galaxy separation orthe identification of quasars.Where individual eigengalaxies do not coincide with afeature of interest, the correspondence can be sought witha combination of eigengalaxies. One way to establish thiscorrespondence is to train a classifier to discover the rela-tionship between the eigengalaxy scores and a set of labels,and then to interpret the classifier to better understand thecorrespondence. Note that the performance of classifiers foruse cases such as these will be substantially dependent onthe purity of labelled classes in the training set. We matchthe number of galaxies in each class to make the ex ante probability of any class roughly equal. We reserve half of the data at random for training, withhold the other halffor testing, and fit a random forest (Liaw et al. 2002) clas-sifier using the GZ-CANDELS classifications as a responseand the score matrix as the explanatory variables. The ran-dom forest classifier is an ensemble learning technique whichaggregates the classifications of a large number of decisiontrees. Decision trees themselves have low bias but a highvariance on out of sample predictions because they tend toover-fit data. The random forest corrects for this by ran-domly partitioning the data that each tree can see, whichhas the effect of de-correlating tree classifications. The meanor mode of the resultant tree classifications is then used toproduce the final result. Random forests can have a drasti-cally lower variance than that of a single decision tree fitting,whilst retaining much of a decision tree’s interpretability.Figure 6 shows the error matrix (a cross tabulationof true class labels versus predicted class labels) for thewithheld data. Correspondence to complex GZ-CANDELSrules is good despite some inherent class ambiguity. GZ-CANDELS classifications are based on combinations of cut-off points corresponding to fractions of votes by human clas-sifiers. Classifications are presented as mutually exclusivewhereas it is possible for a spiral galaxy to be simultaneouslyedge-on and/or clumpy for example, leading to classificationambiguity. Figure 7 illustrates the importance weighting foreach eigengalaxy in random forest classifications. It shows aparticular emphasis on a subset of the eigengalaxies whenclassifying edge-on, spiral and clumpy galaxies against eachother.The eigengalaxies themselves encode something elemen-tary about the (usually) complex multi-band light distribu-tion in the transformed galaxy sample, and therefore can beused in a sense to ‘define’ morphology in a non-parametric,data-driven way through the score matrix, where the as-sumptions of PCA are reasonable; that is, that features areorthogonal, with each maximising the remainder of the resid-ual variance, and the dataset is well described by a mean andcovariance structure. Extending PCA to a probabilistic generative model greatlyenhances its capabilities to applications like predicting miss-ing data, generating new data and drawing inferences aboutthe likelihood of data, which PCA alone is not capable of.Tipping & Bishop (1999b) show that PCA is equivalent tothe maximum likelihood (ML) solution of the factor model,which aims to map an observed vector space to an unob-served latent space by a linear transformation: t = W x + µ + (cid:15) .Here t ∈ R d is the observation vector, x ∈ R q is the latentvector, W is a d × q loading matrix (the coefficients mappingan observed vector to the latent space) relating t to x where q < d , µ is an offset and (cid:15) is a residual error. Tipping &Bishop (1999b) show that given x ∼ N ( , I ) , and (cid:15) ∼ N ( , σ ) (i.e. isotropic error), then t ∼ N ( µ, WW T + σ I ) , and the MLsolution yields W equivalent to the eigenvectors of the co-variance matrix (the PCA decomposition).Given a PCA decomposition and an ML estimate of σ ,the established equivalence means that we can write downa generative factor model. This offers two key opportunitiesfurther discussed later: (i) the factor model is Gaussian andwe can simultaneously estimate the loading matrix and miss- MNRAS , 1– ????
The left panel shows a sample of objects where eigen-galaxy component 3 has an extreme negative weight, illustratinghow different eigengalaxy components can correspond to visualfeatures. In this case, eigengalaxy 3 seems to correspond to re-solved disc-like structures, which are completely absent in theimages in the right panel, which are point sources. Each 1.8 (cid:48)(cid:48) thumbnail image is an RGB composite of the F160W, F125W,and F814W bands. presents the 12 eigengalaxies as grayscale images. The effectof image rotation to vertically align brightness can be seenin the relative emphasis of the central vertical strip. Onemay also wonder how structure with high spatial frequency– such as spirals – would be reconstructed given the featuresof the eigengalaxies. Finessed features are a function of co-ordination of adjacent pixel columns and although visuallydistinct typically account for a tiny amount of flux variance.Therefore, fine-structured spiral features (for example) willbecome washed out in reconstructions, and not appear assharp as in the real image.Examining the first two eigengalaxies we see that theyare intensity dominated. It is unsurprising that most vari-ance is to be found in accounting for relative brightness,given the GZ-CANDELS sample spans ( H -band) magni-tudes down to F160W < . mag (Simmons et al. 2016).Subsequent eigengalaxies become more complex and struc-tured. How can we interpret them? It is sometimes difficultto ‘eyeball’ the feature implications of any given eigengalaxybut it can be made clearer by contrasting samples which ex-tremely emphasise ( > th percentile) or de-emphasise ( < st percentile) the eigengalaxy as shown in Figure 5. In thisexample, it appears that extreme negative weights on eigen-galaxy component 3 corresponds to the presence of extendedresolved features such as (but not exclusively) discs; con-versely, extreme positive eigengalaxy 3 weights result inpoint sources with no, or insignificant, resolved structure.Thus, selection on relative component weighting could leadto practical applications such as star/galaxy separation orthe identification of quasars.Where individual eigengalaxies do not coincide with afeature of interest, the correspondence can be sought witha combination of eigengalaxies. One way to establish thiscorrespondence is to train a classifier to discover the rela-tionship between the eigengalaxy scores and a set of labels,and then to interpret the classifier to better understand thecorrespondence. Note that the performance of classifiers foruse cases such as these will be substantially dependent onthe purity of labelled classes in the training set. We matchthe number of galaxies in each class to make the ex ante probability of any class roughly equal. We reserve half of the data at random for training, withhold the other halffor testing, and fit a random forest (Liaw et al. 2002) clas-sifier using the GZ-CANDELS classifications as a responseand the score matrix as the explanatory variables. The ran-dom forest classifier is an ensemble learning technique whichaggregates the classifications of a large number of decisiontrees. Decision trees themselves have low bias but a highvariance on out of sample predictions because they tend toover-fit data. The random forest corrects for this by ran-domly partitioning the data that each tree can see, whichhas the effect of de-correlating tree classifications. The meanor mode of the resultant tree classifications is then used toproduce the final result. Random forests can have a drasti-cally lower variance than that of a single decision tree fitting,whilst retaining much of a decision tree’s interpretability.Figure 6 shows the error matrix (a cross tabulationof true class labels versus predicted class labels) for thewithheld data. Correspondence to complex GZ-CANDELSrules is good despite some inherent class ambiguity. GZ-CANDELS classifications are based on combinations of cut-off points corresponding to fractions of votes by human clas-sifiers. Classifications are presented as mutually exclusivewhereas it is possible for a spiral galaxy to be simultaneouslyedge-on and/or clumpy for example, leading to classificationambiguity. Figure 7 illustrates the importance weighting foreach eigengalaxy in random forest classifications. It shows aparticular emphasis on a subset of the eigengalaxies whenclassifying edge-on, spiral and clumpy galaxies against eachother.The eigengalaxies themselves encode something elemen-tary about the (usually) complex multi-band light distribu-tion in the transformed galaxy sample, and therefore can beused in a sense to ‘define’ morphology in a non-parametric,data-driven way through the score matrix, where the as-sumptions of PCA are reasonable; that is, that features areorthogonal, with each maximising the remainder of the resid-ual variance, and the dataset is well described by a mean andcovariance structure. Extending PCA to a probabilistic generative model greatlyenhances its capabilities to applications like predicting miss-ing data, generating new data and drawing inferences aboutthe likelihood of data, which PCA alone is not capable of.Tipping & Bishop (1999b) show that PCA is equivalent tothe maximum likelihood (ML) solution of the factor model,which aims to map an observed vector space to an unob-served latent space by a linear transformation: t = W x + µ + (cid:15) .Here t ∈ R d is the observation vector, x ∈ R q is the latentvector, W is a d × q loading matrix (the coefficients mappingan observed vector to the latent space) relating t to x where q < d , µ is an offset and (cid:15) is a residual error. Tipping &Bishop (1999b) show that given x ∼ N ( , I ) , and (cid:15) ∼ N ( , σ ) (i.e. isotropic error), then t ∼ N ( µ, WW T + σ I ) , and the MLsolution yields W equivalent to the eigenvectors of the co-variance matrix (the PCA decomposition).Given a PCA decomposition and an ML estimate of σ ,the established equivalence means that we can write downa generative factor model. This offers two key opportunitiesfurther discussed later: (i) the factor model is Gaussian andwe can simultaneously estimate the loading matrix and miss- MNRAS , 1– ???? (2020) Uzeirbegovic et al. T r u e l a b e l Figure 6.
Error matrix depicting the correspondence betweentrue labels and those predicted by a random forest for edge-on(0), spiral (1) and clumpy (2) galaxies. It shows good ( (cid:38) E i g e n g a l a xy Figure 7.
Importance weighting for each eigengalaxy in our ran-dom forest classifications. This shows a particular emphasis oneigengalaxies 1, 4 and 12 when classifying edge-on, spiral andclumpy galaxies against each other. ing data in an expectation maximisation setting, and thuspredict missing values and compute PCA at the same time .This enables PCA even when data is missing, and it also en-ables the prediction of missing values; (ii) given a generativefactor model fitted within our eigengalaxy framework, wecan assign a probability to the generation of any given galaxy Although not explored in this paper, we also note that PPCAmakes a ‘mixture of PCAs’ model (Tipping & Bishop 1999a) pos-sible. This enables the discovery of linear sub-spaces and the si-multaneous PCA fitting of them, which may be more suited todatasets with which PCA struggles (e.g. data which cannot bewell described by a mean and covariance structure). It may alsoreduce the overall dimensionality requirements for each linear sub-space compared to global PCA.
Figure 8.
25 of the ‘rarest’ galaxies from our sample. Rarity isquantified by the likelihood of the galaxy being generated by thePPCA generating function. The image shows anomalous detec-tions and artefacts along with real, but intrinsically rare types ofobjects. Each 1.8 (cid:48)(cid:48) thumbnail image is an RGB composite of theF160W, F125W, and F814W bands. N u m b e r o f o b j e c t s Figure 9.
The frequency of the ratio of the sum of deltas (differ-ence between the original and predicted pixel flux) and the sumof the flux of the original images for galaxies with missing data.The body of the distribution extends to ± error but ∼ ofobjects are predicted within ± error, indicating that PPCAcan predict missing data in a 2D sense with a high level of fidelity. x using the Gaussian density p ( x | µ, WW T + σ I ) . Sortinggalaxies by their likelihood therefore provides a natural mea-sure of ‘rarity’ which may be used to identify outliers in aquantitative manner. In the next section, we explore somepractical applications of these opportunities. In this section we explore some applications of the proba-bilistic eigengalaxy framework described above.
MNRAS , 1– ?? (2020) igengalaxies A key utility of outlier detection is to make discoverable rarephenomena buried in enormous datasets. This may includesearches for exotic galaxies and rare objects but also theidentification of anomalous detections and pipeline errors.The two-part challenge is first to define an ‘outlier’ in a wayuseful to astronomy, and the second is to scale the detectionalgorithm to the size of the data.Dutta et al. (2007) implement a distributed version ofPCA using random projection and sampling to approximateeigenvectors for the purpose of outlier detection and applyit to the 2MASS (Skrutskie et al. 2006) and SDSS (Fukugitaet al. 1996) datasets. In defining an outlier they search forgalaxies which are over-represented by the last eigenvector.Other large scale PCA methods include incremental PCA(Ross et al. 2008), which approximates PCA by processingdata in batches commensurate with the available randomaccess memory. Baron & Poznanski (2017) utilise a randomforest and fit to discriminate between real and syntheticdata . For every pair of objects, they count the number oftrees in which each pair is labelled ‘real’ in the same leaf. Theoutput is a N × N similarity matrix which is then searchedfor ‘outliers’, defined as objects with a large average distancefrom all other objects.We focus on a simple and principled definition for an‘outlier’, proceeding directly from our eigengalaxy frame-work. Given that PPCA allows us to cast our eigengalaxyresult to a probabilistic model, a natural definition for anoutlier is an object with a low probability of being generated.Outliers can thus be interpreted as the objects least repre-sentative of the object population described by the meanand covariance structure of the factor model. Note that theapproach of calculating PCA for a set of features, convertingit to PPCA and then using our outlier definition is generaland could be applied not only to imaging, but also to spec-troscopy, light curves, catalogues and other kinds of data.We define a formal outlier description within the eigengalaxyframework: given a generative factor model N ( W x + µ, σ I ) ,an outlier x is an object such that p ( x | W x + µ, σ I ) < T where T is a threshold probability density, and can be setaccording to the purpose at hand.To illustrate the concept with the GZ-CANDELSdataset, we use our derived eigengalaxies to assign a log like-lihood to every galaxy, using the score samples method onthe sklearn.decomposition.PCA object which implementsthe Tipping & Bishop (1999b) factor model representationto calculate likelihood. We sort the data by likelihood andpresent in Figure 8 the 25 objects least likely to be gener-ated. The image shows not only anomalous detections andartefacts but also systems that are known to be rare, suchas dust lanes which are signposts of recent minor mergers(see e.g. Kaviraj et al. 2012), ongoing mergers (see e.g. Darget al. 2010) and edge-on spirals which appear to be accret-ing a blue companion. Outlier detection therefore offers anefficient way to identify examples of rare objects in largesurveys. These authors use the flux at each wavelength as a feature set,generating synthetic data by sampling from the marginal distri-bution of each feature.
In many situations one might be missing a particular band,for example due to bad data, partial coverage with a cer-tain bandpass, obliteration by
Starlink satellite trails, etc. Inthese cases we can consider how well we can predict the miss-ing data using eigengalaxies. Tipping & Bishop (1999b) de-fine an expectation maximisation (EM) algorithm for PPCAin the presence of missing data which works by treating miss-ing values as jointly distributed with the latent variables andmaximising the expectation of the joint likelihood function.As a demonstration, we randomly omit a band with equalprobability from our dataset for 5% of rows chosen at ran-dom. We use the data as processed in Section 2.2 and anefficient variational EM version of the Tipping & Bishop(1999b) algorithm set out in Porta et al. (2005) to fit ourPPCA model. We use the delta sum over flux sum ratioas in Figure 2 for the reconstruction error to describe theprediction error. Figure 9 illustrates the distribution of pre-diction error. The distribution is roughly symmetrical andis centred on zero. The body of the distribution extends to ± error but ∼ of objects are accounted for within ± error showing that high fidelity prediction is possi-ble for most objects. Figure 10 provides some examples ofpredicted data for different bands. It is noteworthy that byusing information contained in the eigengalaxies, PPCA issuccessful in estimating total flux even when it bears no re-semblance to that of the other bands. The ability to predictmissing images offers a route to ‘filling in’ missing data, suchas predicting photometric data points which are absent inthe observations in order to reconstruct missing parts of agalaxy’s SED. Given a survey with a large number of objects with diversevariety, and an interest in galaxies of a specific kind of ob-ject, it is useful to be able to present an exemplar galaxyand use it to quickly search for all other galaxies with simi-lar features. The utility and suitability of a similarity searchfor any particular use case will depend primarily on how theobjects are being described, and how the similarity betweentheir descriptions is being calculated. For example, Protopa-pas et al. (2006) use cross-correlation as a proxy for similar-ity between light curves for the purpose of detecting outlierswhereby light curves with the lowest average similarity aredefined as outliers. Sart et al. (2010) utilise dynamic timewarps (Berndt & Clifford 1994) to measure the similaritybetween light curves. Hocking et al. (2018) compare variousmeasures, including Euclidean distance and Pearson’s cor-relation coefficient, and settle on using cosine distance tomeasure similarity from a description generated by growingneural gas prior to hierarchical clustering.The eigengalaxy framework projects galaxies to real rowvectors in an 12D space. The dimensions are not arbitrarybut represent the eigengalaxies which together reconstructthe galaxies whilst preserving 96% of variance. Further, thevalue of each dimension is easy to interpret: it representsthe weights required for each eigengalaxy to reconstruct the Available in Python package pyppca .MNRAS , 1– ????
Starlink satellite trails, etc. Inthese cases we can consider how well we can predict the miss-ing data using eigengalaxies. Tipping & Bishop (1999b) de-fine an expectation maximisation (EM) algorithm for PPCAin the presence of missing data which works by treating miss-ing values as jointly distributed with the latent variables andmaximising the expectation of the joint likelihood function.As a demonstration, we randomly omit a band with equalprobability from our dataset for 5% of rows chosen at ran-dom. We use the data as processed in Section 2.2 and anefficient variational EM version of the Tipping & Bishop(1999b) algorithm set out in Porta et al. (2005) to fit ourPPCA model. We use the delta sum over flux sum ratioas in Figure 2 for the reconstruction error to describe theprediction error. Figure 9 illustrates the distribution of pre-diction error. The distribution is roughly symmetrical andis centred on zero. The body of the distribution extends to ± error but ∼ of objects are accounted for within ± error showing that high fidelity prediction is possi-ble for most objects. Figure 10 provides some examples ofpredicted data for different bands. It is noteworthy that byusing information contained in the eigengalaxies, PPCA issuccessful in estimating total flux even when it bears no re-semblance to that of the other bands. The ability to predictmissing images offers a route to ‘filling in’ missing data, suchas predicting photometric data points which are absent inthe observations in order to reconstruct missing parts of agalaxy’s SED. Given a survey with a large number of objects with diversevariety, and an interest in galaxies of a specific kind of ob-ject, it is useful to be able to present an exemplar galaxyand use it to quickly search for all other galaxies with simi-lar features. The utility and suitability of a similarity searchfor any particular use case will depend primarily on how theobjects are being described, and how the similarity betweentheir descriptions is being calculated. For example, Protopa-pas et al. (2006) use cross-correlation as a proxy for similar-ity between light curves for the purpose of detecting outlierswhereby light curves with the lowest average similarity aredefined as outliers. Sart et al. (2010) utilise dynamic timewarps (Berndt & Clifford 1994) to measure the similaritybetween light curves. Hocking et al. (2018) compare variousmeasures, including Euclidean distance and Pearson’s cor-relation coefficient, and settle on using cosine distance tomeasure similarity from a description generated by growingneural gas prior to hierarchical clustering.The eigengalaxy framework projects galaxies to real rowvectors in an 12D space. The dimensions are not arbitrarybut represent the eigengalaxies which together reconstructthe galaxies whilst preserving 96% of variance. Further, thevalue of each dimension is easy to interpret: it representsthe weights required for each eigengalaxy to reconstruct the Available in Python package pyppca .MNRAS , 1– ???? (2020) Uzeirbegovic et al.
Figure 10.
An example of using PPCA for image imputation. In each figure, the first row shows the original galaxy across all threebands, the second row shows the same galaxy with a random band censored, the third row shows the censored galaxy predicted byPPCA. It is noteworthy that by using information contained in the eigengalaxies, PPCA is successful in estimating brightness correctlyeven when it bares no resemblance to that of the other bands, as illustrated in the left most image.
Figure 11.
Examples of similarity searches. In each 1.8 (cid:48)(cid:48) thumbnail image (an RGB composite of the F160W, F125W, and F814W bands),the exemplar galaxy is given in the top left, followed by 24 of its nearest neighbours by Euclidean distance in the 12D eigengalaxy space.The image shows searches for spirals, edge-on spirals, ellipticals and recent mergers from left to right respectively. original galaxies as a linear combination of eigengalaxies.The closer row-vector weights are to each other, the moresimilar their constitution in terms of eigengalaxies and thusthe more similar their reconstruction. In this case, Euclideandistances are preferable over, say, cosine distance or a corre-lation measure, because we want to be sensitive primarily tothe magnitude of row vector differences. Formally, let r beour exemplar galaxy expressed as a row vector and let X beour dataset (an N × matrix). Let Y represent the columnvector of Euclidean distances of length N . Y n is given by: Y n = (cid:169)(cid:173)(cid:171) (cid:213) i = (cid:0) X n , i − r i (cid:1) (cid:170)(cid:174)(cid:172) / . (1)Let I be the indices of Y . We sort I with respect tothe ascending order of Y , which gives us all objects in orderof similarity to the exemplar. To demonstrate, we pick anexemplar galaxy and implement the procedure above. Fig-ure 11 illustrates four examples of similarity searches, show-ing the 24 most similar galaxies to the exemplar. The imageshows searches for spirals, edge-on spirals, ellipticals and re-cent mergers from left to right respectively. Most of the fea-tures are invariant to arbitrary brightness and the objectregistration has excluded aspects such as galaxy rotation,rendering this elementary algorithm remarkably effective. Unsupervised clustering is particularly useful for massive as-tronomy datasets because it provides a method of investigat-ing very large collections of objects efficiently. If the cluster-ing method is effective at grouping objects with similar fea-tures together, then one can study a dataset by characteris-ing its morphological clusters rather than examining each ob-ject separately. There is some precedence for this in astron-omy. For example, Martin et al. (2020) follow Hocking et al.(2018) in using growing neural gas and hierarchical cluster-ing directly on pixel data to identify structurally distinctclusters. Almeida et al. (2010) and Almeida & Prieto (2013)utilise k -means to classify spectra from SDSS into fewer basetypes. Valenzuela & Pichara (2018) use k -medoids to clus-ter and map sequences of light curve segments to variationaltrees.The interpretation of the eigengalaxy framework de-scribed in Section 2.4 naturally extends to a simple unsuper-vised clustering treatment. We can create a distance matrix(an N × N matrix in which each cell indicates a distance be- In k -medoids, data points become cluster centres, unlike k -means where the cluster centre is not necessarily correspondentto a data point. MNRAS , 1– ?? (2020) igengalaxies tween object j at row j and object i at row i ) by calculatingthe Euclidean distance between every galaxy. The distancematrix provides an input for a broad range of unsupervisedclustering algorithms. This provides us with an opportunityto define and discover groups of galaxies based on the sim-ilarity of their multi-band morphologies as encoded by theeigengalaxies.To demonstrate this, we produce a distance matrix forthe GZ-CANDELS dataset as outlined above. We use affin-ity propagation (Dueck 2009) which produces 536 clustersfrom the total sample of 10,243 objects, with clusters vary-ing in size from 1 to 316 with a median size of 4. Affinitypropagation has a tunable ‘preference’ parameter that al-lows the coarseness of clustering to be adjusted; here weuse the default value (the median of the distance matrix).Figure 12 illustrates samples of galaxies from four morpho-logical clusters. It is noteworthy that affinity propagation isan exemplar-based algorithm, so that each cluster is actuallycharacterised by an exemplar galaxy. Thus, in this instancethe whole dataset is summarised by 536 exemplars (5% ofthe sample) which could, in principle, be sorted by rarity(see Section 3.1) and then viewed as one large image. Thismassive reduction in the scale of the full dataset highlightsthe efficiency gains that can be made by using this methodfor exploring the extremely large imaging surveys of the fu-ture. All the methods described above utilise the same frameworkand can be used together to create additional capabilitiessimply based on the eigengalaxy weights in the score ma-trix. There are many possible combinations but we highlightthree: • Missing value prediction → all methods. Predictingmissing band values, dead pixels, etc. can be used to ‘com-plete’ data so that it may be considered on par with therest of the data. It may then be used with any of the othermethods. • Outlier detection → similarity search / clustering / clas-sification. We may discover an interesting outlier (e.g. a rareobject such as a gravitationally lensed galaxy) of which wewould like to find many more examples (similarity search).We may want to cluster the outliers to determine a self-similar morphological clusters. We may also want to label atraining set of outliers, train a classifier and then select onlythe outliers of interest. • Clustering → missing value prediction. To make the pre-diction of missing values more accurate, we may first gener-ate eigengalaxies using the available bands and use it to cre-ate a clustering. This produces a set of self-similar classes towhich missing values have more in common than the wholepopulation. For each cluster we could then separately deriveeigengalaxies and predict missing values for objects withinthat cluster. The GZ-CANDELS dataset is relatively small and can beeasily processed. In this section we consider problems andadaptations which may be necessary to use these methods with extremely large datasets, such as LSST (Robertsonet al. 2017),
Euclid (Refregier et al. 2010) and the SquareKilometre Array (Weltman et al. 2020) which may containbillions of observed objects. We assert that the eigengalaxyframework offers a novel solution to processing such big databy making it possible to find outliers, search for and clus-ter objects using only the reduced score matrix form of adataset. For example, if LSST was to provide a correspond-ing score matrix, eigengalaxies and galaxy likelihoods, allthe methods described here would be available based on asmall fraction of the full dataset.An important consideration in making the methods pre-sented here feasible for big data is the number of eigengalax-ies required to achieve a practically useful level of explainedvariance. Aggarwal et al. (2001) provides a deeper discus-sion on common issues for algorithms operating in high di-mensional space. In our case we would expect the followingproblems given too many eigengalaxies:(i) The score matrix form of the data may itself be toolarge to be useful. In our case, as the number of eigengalaxiesapproaches 10,243, the total size of the data converges ontothe total size of the cut-out collection.(ii) Euclidean distance loses interpretability and thepower to distinguish objects as the number of dimensionsgrows.(iii) Some techniques like fully calculating a distance ma-trix or calculating the similarity of every object with refer-ence to an exemplar would not be computationally scalable.Taking LSST as an example, we would expect that thenumber of eigengalaxies required to explain, say, 96% vari-ance to be more than in our GZ-CANDELS experiments fortwo primary reasons: (i) it is a deeper survey and it there-fore observes more low mass galaxies, the morphological mixof which is yet unknown (e.g. Martin et al. 2019), and (ii)we would expect bright galaxies to have more detail suchas extended debris and tidal features (e.g. Duc et al. 2011)and therefore exhibit more variance. However, the additionalsources of variance are limited, and we would expect a rea-sonable asymptotic number of eigengalaxies to emerge foruseful level of explained variance at relatively small samplesizes.The kind of variance that needs to be captured andthe sufficient ratio of it to retain ultimately depends on theintended application. If the number of eigengalaxies is toohigh we could explore various ways to reduce the variancenot required for the intended purpose in order to stay withinan eigengalaxy quota at a useful explained variance ratio.Potentially applicable techniques include using smaller cut-outs, down-sampling pixels, using fewer bands, combiningbands, stretching the range of the flux densities to make dif-ferences less subtle, and partitioning the data by some othervariable (e.g. brightness, location, etc.) and then conductingthe analysis per partition. If the number of eigengalaxies wasworkable from a data size perspective but a problem for Eu-clidean distance then we could use other distance measuresmore robust in high dimensional space. Finally, in every casewe would need to replace exhaustive similarity searches withmore scalable methods such as the nearest neighbour algo-rithm (Beis & Lowe 1997), and affinity propagation wouldneed to be replaced with a less expensive clustering method
MNRAS , 1– ????
MNRAS , 1– ???? (2020) Uzeirbegovic et al.
Figure 12.
Composite image samples of four morphological clusters from a total of 536 created using affinity propagation clustering ofa distance matrix defined by pairwise Euclidean distances in 12D eigengalaxy space. Each 1.8 (cid:48)(cid:48) thumbnail image is an RGB compositeof the F160W, F125W, and F814W bands. such as k -means (or k -medoids if Euclidean distance was un-acceptable). We have demonstrated how a large sample of galaxies can bedescribed in terms of a small set of template ‘eigengalaxies’.Eigengalaxies can be considered to be the image-space equiv-alent of ‘basis vectors’, interpreted as orthogonal features.This offers a completely data-driven approach to character-ising galaxy morphology. Using a sample of 10,243 galax-ies from the Galaxy Zoo–CANDELS survey, chosen to liein the deep GOODS–S field, we use three-band
HST
ACSand WFC3 imaging to derive just 12 eigengalaxies that aresufficient to explain 96% of the morphological variance inthe sample. How eigengalaxies correspond to real featuresin galaxies can be explored by examining samples whereparticular eigengalaxies are emphasized or de-emphasized.For example, we have shown how the eigengalaxies can bemapped to semantically convenient labels using Galaxy Zooclassifications. We have further shown how eigengalaxies canbe embedded into a probabilistic framework and so enablea variety of useful applications. We explored four potentialapplications of our framework:(i) we have shown how PPCA can be used to assign likeli-hoods and identify outliers as objects with a low probabilityof being generated.(ii) we have shown how PPCA can be used to predictmissing values in imaging, for example due to bad data orpartial coverage in a given band.(iii) we have shown how the projection of eigengalaxies onto a 12D space naturally facilitates similarity searches, wheregalaxies can be sorted relative to their Euclidean distancefrom an exemplar, thus quickly returning samples of galaxiesthat are morphologically similar to the exemplar as definedby their eigengalaxy components.(iv) we have shown how the Euclidean distances betweengalaxies in the 12D space can be compiled into a distancematrix that can provide the input for unsupervised clus-tering algorithms to discover groups of similar objects. Wehave demonstrated this using affinity propagation to showhow morphologically similar groupings can be identified inlarge samples.Finally, we have described how the methods may be used together in more advanced use cases and how workingin eigengalaxy space may present a novel solution to outlierdetection, as well as search and clustering problems in forth-coming massive imaging datasets such as LSST. We arguethat these results and illustrations underscore the suitabilityof our PCA based probabilistic eigengalaxy framework forthe study of morphology, especially in the era of big dataastronomy, where representational efficiency and relevancewill pay dividends.Code for all results and figures can be found at https://emiruz.com/eigengalaxies . ACKNOWLEDGEMENTS
J.E.G. is supported by the Royal Society. S.K acknowledgesa Senior Research Fellowship from Worcester College Ox-ford. This work is based on observations taken by the CAN-DELS Multi-Cycle Treasury Program with the NASA/ESAHST, which is operated by the Association of Universities forResearch in Astronomy, Inc., under NASA contract NAS5-26555. This research made use of Astropy, a community-developed core Python package for Astronomy (Astropy Col-laboration et al. 2013; Price-Whelan et al. 2018). REFERENCES
Abraham R. G., Valdes F., Yee H. K. C., van den Bergh S., 1994,ApJ, 432, 75Aggarwal C. C., Hinneburg A., Keim D. A., 2001, in Internationalconference on database theory. pp 420–434Almeida J. S., Prieto C. A., 2013, The Astrophysical Journal, 763,50Almeida J. S., Aguerri J. A. L., Munoz-Tun´on C., De Vicente A.,2010, The Astrophysical Journal, 714, 487Anderson B., Moore A., Connolly A., Nichol R., 2004, in Pro-ceedings of the tenth ACM SIGKDD international conferenceon Knowledge discovery and data mining. pp 40–48Astropy Collaboration et al., 2013, A&A, 558, A33Bamford S. P., et al., 2009, MNRAS, 393, 1324Baron D., Poznanski D., 2017, Monthly Notices of the Royal As-tronomical Society, 465, 4530Beck M. R., et al., 2018, MNRAS, 476, 5516 , 1– ?? (2020) igengalaxies Beis J. S., Lowe D. G., 1997, in Proceedings of IEEE computer so-ciety conference on computer vision and pattern recognition.pp 1000–1006Berndt D. J., Clifford J., 1994, in KDD workshop. pp 359–370Bluck A. F. L., Mendel J. T., Ellison S. L., Moreno J., Simard L.,Patton D. R., Starkenburg E., 2014, MNRAS, 441, 599Bundy K., Ellis R. S., Conselice C. J., 2005, ApJ, 625, 621Cappellari M., et al., 2011, MNRAS, 416, 1680Cheng T.-Y., et al., 2019, arXiv e-prints, p. arXiv:1908.03610Codis S., Pichon C., Devriendt J., Slyz A., Pogosyan D., DuboisY., Sousbie T., 2012, MNRAS, 427, 3320Conselice C. J., 2003, ApJS, 147, 1Conselice C. J., 2006, ApJ, 638, 686Darg D. W., et al., 2010, MNRAS, 401, 1043De La Calleja J., Fuentes O., 2004, Monthly Notices of the RoyalAstronomical Society, 349, 87Dieleman S., Willett K. W., Dambre J., 2015, MNRAS, 450, 1441Djorgovski S. G., Mahabal A. A., Donalek C., Graham M. J.,Drake A. J., Moghaddam B., Turmon M., 2012, arXiv e-prints,Dressler A., et al., 1997, ApJ, 490, 577Duc P.-A., et al., 2011, MNRAS, 417, 863Dueck D., 2009, Affinity propagation: clustering data by passingmessages. CiteseerDutta H., Giannella C., Borne K., Kargupta H., 2007, in Pro-ceedings of the 2007 SIAM International Conference on DataMining. pp 473–478Fukugita M., Shimasaku K., Ichikawa T., Gunn J., et al., 1996,Technical report, The Sloan digital sky survey photometricsystem. SCAN-9601313Goulding A. D., et al., 2018, Publications of the AstronomicalSociety of Japan, 70, S37Grogin N. A., et al., 2011, The Astrophysical Journal SupplementSeries, 197, 35Hocking A., Geach J. E., Sun Y., Davey N., 2018, MNRAS, 473,1108Hubble E. P., 1926, ApJ, 64, 321Huertas-Company M., et al., 2015, ApJS, 221, 8Jackson R. A., Martin G., Kaviraj S., Laigle C., DevriendtJ. E. G., Dubois Y., Pichon C., 2019, MNRAS, 489, 4679Jackson R. A., Martin G., Kaviraj S., Laigle C., De-vriendt J., Dubois Y., Pichon C., 2020, arXiv e-prints, p.arXiv:2004.00023Kaviraj S., 2010, MNRAS, 406, 382Kaviraj S., 2014, MNRAS, 440, 2944Kaviraj S., et al., 2007, ApJS, 173, 619Kaviraj S., et al., 2012, MNRAS, 423, 49Kaviraj S., Devriendt J., Dubois Y., Slyz A., Welker C., PichonC., Peirani S., Le Borgne D., 2015, MNRAS, 452, 2845Kaviraj S., Martin G., Silk J., 2019, MNRAS, 489, L12Koekemoer A. M., et al., 2011, The Astrophysical Journal Sup-plement Series, 197, 36Lackner C. N., Gunn J. E., 2012, MNRAS, 421, 2277Lahav O., et al., 1995, Science, 267, 859Le Borgne J.-F., et al., 2003, Astronomy & Astrophysics, 402, 433Li C., Wang T.-G., Zhou H.-Y., Dong X.-B., Cheng F.-Z., 2005,The Astronomical Journal, 129, 669Liaw A., Wiener M., et al., 2002, R news, 2, 18Lintott C., et al., 2011, MNRAS, 410, 166Lotz J. M., Primack J., Madau P., 2004, AJ, 128, 163Ma Z., et al., 2019, The Astrophysical Journal Supplement Series,240, 34Martin G., et al., 2018a, MNRAS, 476, 2801Martin G., Kaviraj S., Devriendt J. E. G., Dubois Y., Pichon C.,2018b, MNRAS, 480, 2266Martin G., et al., 2019, MNRAS, 485, 796Martin G., Kaviraj S., Hocking A., Read S. C., Geach J. E., 2020,MNRAS, 491, 1408 Menanteau F., Ford H. C., Motta V., Ben´ıtez N., Martel A. R.,Blakeslee J. P., Infante L., 2006, AJ, 131, 208Menou K., 2018, arXiv e-prints, p. arXiv:1811.06374Odewahn S. C., Cohen S. H., Windhorst R. A., Philip N. S., 2002,ApJ, 568, 539Ostrovski F., et al., 2017, MNRAS, 465, 4325Peirani S., Crockett R. M., Geen S., Khochfar S., Kaviraj S., SilkJ., 2010, MNRAS, 405, 2327Peth M. A., et al., 2016, MNRAS, 458, 963Porta J. M., Verbeek J. J., Kr¨ose B. J., 2005, Autonomous Robots,18, 59Postman M., et al., 2005, ApJ, 623, 721Price-Whelan A. M., et al., 2018, AJ, 156, 123Protopapas P., Giammarco J., Faccioli L., Struble M., Dave R.,Alcock C., 2006, Monthly Notices of the Royal AstronomicalSociety, 369, 677Refregier A., Amara A., Kitching T. D., Rassat A., ScaramellaR., Weller J., 2010, arXiv e-prints, p. arXiv:1001.0061Robertson B. E., et al., 2017, preprint, ( arXiv:1708.01617 )Ross D. A., Lim J., Lin R.-S., Yang M.-H., 2008, Internationaljournal of computer vision, 77, 125Sart D., Mueen A., Najjar W., Keogh E., Niennattrakul V., 2010,in 2010 IEEE International Conference on Data Mining. pp1001–1006Scarlata C., et al., 2007, ApJS, 172, 406Schawinski K., et al., 2014, MNRAS, 440, 889Schawinski K., Zhang C., Zhang H., Fowler L., Santhanam G. K.,2017, MNRAS, 467, L110S´ersic J. L., 1963, Boletin de la Asociacion Argentina de Astrono-mia La Plata Argentina, 6, 41Simard L., et al., 2002, ApJS, 142, 1Simmons B. D., et al., 2016, Monthly Notices of the Royal Astro-nomical Society, p. stw2587Simmons B. D., et al., 2017, MNRAS, 464, 4420Sirovich L., Kirby M., 1987, Josa a, 4, 519Skibba R. A., et al., 2009, MNRAS, 399, 966Skrutskie M., et al., 2006, The Astronomical Journal, 131, 1163Smethurst R. J., et al., 2015, MNRAS, 450, 435Soo J., et al., 2018, Monthly Notices of the Royal AstronomicalSociety, 475, 3613Strateva I., et al., 2001, AJ, 122, 1861Tipping M. E., Bishop C. M., 1999a, Neural computation, 11, 443Tipping M. E., Bishop C. M., 1999b, Journal of the Royal Statis-tical Society: Series B (Statistical Methodology), 61, 611Valenzuela L., Pichara K., 2018, Monthly Notices of the RoyalAstronomical Society, 474, 3259Walmsley M., Ferguson A. M. N., Mann R. G., Lintott C. J.,2019, MNRAS, 483, 2968Weltman A., et al., 2020, Publ. Astron. Soc. Australia, 37, e002Wild V., et al., 2014, Monthly Notices of the Royal AstronomicalSociety, 440, 1880Willett K. W., et al., 2015, MNRAS, 449, 820Willett K. W., et al., 2017, MNRAS, 464, 4176Wollaeger R. T., et al., 2018, MNRAS, 478, 3298de Vaucouleurs G., 1948, Annales d’Astrophysique, 11, 247MNRAS , 1– ????