Soft clustering analysis of galaxy morphologies: A worked example with SDSS
AAstronomy & Astrophysics manuscript no. andrae˙et˙al c (cid:13)
ESO 2018October 26, 2018
Soft clustering analysis of galaxy morphologies: A worked examplewith SDSS
Ren´e Andrae , Peter Melchior , and Matthias Bartelmann Max-Planck-Institut f¨ur Astronomie, K¨onigstuhl 17, 69117 Heidelberg, Germanye-mail: [email protected] Institut f¨ur Theoretische Astrophysik, Zentrum f¨ur Astronomie, Albert-Ueberle-Str. 2, 69120 Heidelberg, GermanySubmitted to A&A on Feb 01 2010
Abstract
Context.
The huge and still rapidly growing amount of galaxies in modern sky surveys raises the need of an automatedand objective classification method. Unsupervised learning algorithms are of particular interest, since they discoverclasses automatically.
Aims.
We briefly discuss the pitfalls of oversimplified classification methods and outline an alternative approach called”clustering analysis”.
Methods.
We categorise different classification methods according to their capabilities. Based on this categorisation, wepresent a probabilistic classification algorithm that automatically detects the optimal classes preferred by the data. Weexplore the reliability of this algorithm in systematic tests. Using a small sample of bright galaxies from the SDSS, wedemonstrate the performance of this algorithm in practice. We are able to disentangle the problems of classificationand parametrisation of galaxy morphologies in this case.
Results.
We give physical arguments that a probabilistic classification scheme is necessary. The algorithm we presentproduces reasonable morphological classes and object-to-class assignments without any prior assumptions.
Conclusions.
There are sophisticated automated classification algorithms that meet all necessary requirements, but alot of work is still needed on the interpretation of the results.
Key words.
Galaxies; Surveys; Methods: data analysis, statistical
1. Introduction
Classification of objects is typically the first step towardsscientific understanding, since it brings order to a previ-ously unorganised set of observational data and providesstandardised terms to describe objects. These standardisedterms are usually qualitative, but they can also be quanti-tative which makes them accessible for mathematical anal-ysis. A famous example of a successful classification fromthe field of astrophysics is the Hertzsprung-Russell diagram,where stars exhibit distinct groups in the colour-magnitudediagram that represent their different evolutionary stages.For the same reason, galaxy classification is an importantconceptual step towards understanding physical properties,formation and evolution scenarios of galaxies.With the advent of modern sky surveys containing mil-lions (e.g. SDSS, COSMOS, PanSTARRS, GAMA) or evenbillions (e.g. LSST) of galaxies, the classification of thesegalaxies is becoming more and more problematic. The vastamount of data excludes the hitherto common practice ofvisual classification and clearly calls for an automated clas-sification scheme that is more efficient and more objective.In this work, we present an algorithm for automated andprobabilistic classification, where the classes are discoveredautomatically, too. However, the intention of this work isnot to come up with ”yet another morphological classifi-cation scheme”, but rather to demonstrate of how it couldbe done alternatively to the standard practice of classifica-tion in astrophysics. Besides, we are unable to present a full solution to the problem of morphological galaxy classifica-tion, since there is still no accepted method for parametris-ing arbitrary galaxy morphologies (cf. Andrae et al. inprep.). In addition to the lack of convincing classificationschemes, this is why many experts are very sceptical aboutthe subject of classifying galaxy morphologies as a whole.As parametrisation of galaxy spectra is more reliable, spec-tral classifications have become more accepted.In the remaining part of this introduction, we first givean overview about modern automated classification meth-ods and work out a categorisation of these methods. Wedescribe our parametrisation of galaxy morphologies usingshapelets (R´efr´egier 2003) in Sect. 2. In Sect. 3 we presentthe algorithm we are using, which has been introduced be-fore by Yu et al. (2005) in the field of pattern recognition.We extensively investigate the reliability of this classifica-tion algorithm in Sect. 4. Such a study has not been under-taken by Yu et al. (2005). In Sect. 5 we present a worked ex-ample with a small sample of 1,520 bright galaxies from theSDSS. The objects in this sample are selected such that nopractical problems with parametrisation arise, as we wantto disentangle the problems of classification and parametri-sation as much as possible. The aim of this worked exampleis not to do science with the resulting classes or data-to-class assignments, but to demonstrate that such an algo-rithm indeed produces reasonable results. We conclude inSect. 6. a r X i v : . [ a s t r o - ph . C O ] F e b Ren´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologiesType Classification ClusteringHard nearest neighbour, K -means,Fisher’s linear spectral clustering,discriminant analysis kernel PCASoft na¨ıve Bayes, Gaussian mixturelinear/quadratic modelsdiscriminant analysis,neural networks Table 1.
Overview of different classification and clusteringalgorithms with examples.Soft (probabilistic) algorithms are always model-based,whereas hard algorithms are not necessarily. Soft algo-rithms can always be turned into hard algorithms, but notvice-versa. The list of example algorithms is not complete.
In Table 1 we give an overview of different classificationmethods and some example algorithms. The two criteriafor this categorisation are:1. Is the data-to-class assignment probabilistic (soft) ornot (hard)?2. Are the classes specified a priori (classification) or dis-covered automatically (clustering)?Not all algorithms fit into this categorisation, namely thosethat do not directly assign classes to objects (e.g. self-organising maps).The algorithm we are going to present is a soft algo-rithm, i.e. the data-to-class assignment is probabilistic (cf.next section). The reason is that in the case of galaxy mor-phologies, it is obvious that the classes will not be clearlyseparable. We rather expect the galaxies to be more or lesshomogeneously distributed in some parameter space, withthe classes appearing as local overdensities and exhibitingpotentially strong overlap. As we demonstrate in Sect. 4.2,hard algorithms break down in this case, producing biasedclassification results. There are physical reasons to expectoverlapping classes: First, the random inclination and ori-entation angles w.r.t. the line of sight induce a continu-ous transition of apparent axis ratios, apparent steepnessof the radial light profiles and ratio of light coming frombulge and disk components. Second, observations of galax-ies show that there are indeed transitional objects betweendifferent morphological types. For instance, there are tran-sitional objects between early- and late-type galaxies in the”green valley” of the colour bimodality (e.g. Strateva et al.2001; Baldry et al. 2004), which is also reproduced in sim-ulations (Croton et al. 2006). Hence, we have to draw theconclusion that hard algorithms are generically inappropri-ate for analysing galaxy morphologies. This conclusion isbacked up by practical experience, since even various spe-cialists usually do not agree in hard visual classifications(e.g. Bamford et al. 2009). In fact, the outcome of multi-person visual classifications becomes a probability distribu-tion automatically.Furthermore, our algorithm is a clustering algorithm,i.e. we do not specify the morphological classes a priori,but let the algorithm discover them. This approach is called”unsupervised learning” and it is the method of choice ifwe are uncertain about the type of objects we will find ina given data sample. In the context of clustering analysis classes are referred to as clusters , and we adopt this termi-nology in this article.
Let O denote an object and x its parametrisation.Furthermore, let c k denote a single class out of k = 1 , . . . , K possible classes, then prob( c k | x ) denotes the probability ofclass c k given the object O represented by x . This condi-tional probability prob( c k | x ) is called the class posterior and is computed using Bayes’ theoremprob( c k | x ) = prob( c k ) prob( x | c k )prob( x ) . (1)The marginal probability prob( c k ) is called class prior and prob( x | c k ) is called class likelihood . The denominatorprob( x ) acts as a normalisation factor. The class prior andlikelihood are obtained from a generative model (Sect. 3.3).Prior and posterior satisfy the following obvious normali-sation constraints K (cid:88) k =1 prob( c k ) = 1 and K (cid:88) k =1 prob( c k | x ) = 1 , (2)which ensure that each object is definitely assigned tosome class. In the case of hard assignments, both poste-rior prob( c k | x ) and likelihood prob( x | c k ) are replaced byKronecker symbols.
2. Parametrising galaxy morphologies withshapelets
We parametrise galaxy morphologies in terms of shapelets(R´efr´egier 2003). Shapelets are a scaled version of two-dimensional Gauss-Hermite polynomials and form a set ofcomplete basis functions that are orthonormal on the inter-val [ −∞ , ∞ ]. A given galaxy image I ( x ) can be decomposedinto a linear superposition of basis functions B m,n ( x /β ), i.e. I ( x ) = ∞ (cid:88) m,n =0 c m,n B m,n ( x /β ) , (3)where the c m,n denote the expansion coefficients that con-tain the morphological information and β > N max < ∞ which depends on the object’s signal-to-noise ratio and res-olution. This means Eq. (3) is an approximation only, I ( x ) ≈ N max (cid:88) m,n =0 c m,n B m,n ( x /β ) . (4)We use the C++ algorithm by Melchior et al. (2007) toestimate N max, the scale radius and the linear coefficients,which was shown to be faster and more accurate than theIDL algorithm by Massey & R´efr´egier (2005). en´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies 3 It was shown by Melchior et al. (2009b) that the limitationof the number of basis functions in Eq. (4) can lead to severemodelling failures and misestimations of galaxy shapes incase of objects with low signal-to-noise ratios. They identi-fied two origins of these biases: First, the Gaussian profile ofshapelets does not match the true profiles of galaxies, whichare typically much steeper. Second, the shapelet basis func-tions are intrinsically spherical, i.e. they have problems inmodelling highly eccentric objects. However, in this demon-stration we consider only galaxies with high signal-to-noiseratios, where we can use many basis functions such that theimpact of these biases is negligible. We demonstrate thisin Fig. 1, where we show the shapelet reconstructions of aface-on disk, an edge-on disk and an elliptical galaxy drawnfrom the sample presented in Sect. 5.1. The reconstructionof the face-on disk galaxy (top row) is excellent, leaving es-sentially uncorrelated noise in the residuals. However, thereconstructions of the edge-on disk galaxy (centre row) andthe elliptical galaxy (bottom row) exhibit ring-like artefactsthat originate from the steep light profiles of the ellipti-cal and the edge-on disk along the minor axis. Such mod-elling failures appear systematically and do not introduceadditional scatter into the results, i.e. similar galaxies areaffected in a similar way. However, since shapelet modelsdo not capture steep and strongly elliptical galaxies verywell, we are aware that our algorithm has less dicrimina-tory power for galaxies of this kind.
The coefficients form a vector space and we denote themas vectors x . In first-order approximation, these coefficientvectors are independent of the size of the object, whichwas encoded by the scale radius β . Moreover, we can alsomake x invariant against the image flux, since Eq. (3) im-plies that for a constant scalar α (cid:54) = 0 the transformation x → α x changes the image flux by this same factor of α .Therefore, if we demand x · x = 1, then differing imagefluxes will have no impact on the shapelet coefficients. Thisimplies that morphologies are a direction in shapelet coeffi-cient space and the corresponding coefficient vectors lie onthe surface of a hypersphere with unit radius. We can thusmeasure distances between morphologies on this surface viathe angle spanned by their (normalised) coefficient vectors, d ( x , x ) = (cid:94) ( x , x ) = arccos ( x · x ) . (5)Employing the polar representation of shapelets (Massey &R´efr´egier 2005), we can apply rotations and parity flips toshapelet models. We can estimate the object’s orientationangle from the second moments of its light distribution (e.g.Melchior et al. 2007) and then use this estimate to align allmodels. This ensures invariance of the coefficients againstrandom orientations. Additionally, we can break the de-generacy between left- and right-handed morphologies byapplying parity flips such that the distance of two objectsis minimised. These transformations in model space do notsuffer from pixellation errors and increase the local densityof similar objects in shapelet space.
3. Soft Clustering Algorithm
We now present the soft clustering algorithm of Yu et al.(2005). Before we explain the details, we want to give abrief outline of the general method. The basic idea is toassign similarities to pairs of objects, so we first explain howto measure similarities of galaxy morphologies and what asimilarity matrix is. These pairwise similarities are theninterpreted by a probabilistic model, which provides ourgenerative model. We also present the algorithm that fitsthe model to the similarity matrix.
Instead of analysing the data in shapelet space, we computea similarity matrix by assigning similarities to any two datapoints. This approach is an alternative to working directlyin the sparsely populated shapelet space or employing amethod for dimensionality reduction. If we have N datapoints x n , then this similarity matrix will be an N × N symmetric matrix. It is this similarity matrix to which weare going to apply the soft clustering analysis.Based on the pairwise distances in shapelet coefficientspace (Eq. (5)), we estimate pairwise similarities up to aconstant factor as W mn ∝ − ( d ( x m , x n ) /d max) α s . (6)Here d max denotes the maximum distance between anytwo objects in the given data sample, while the exponent α > s > α and s inSect. 4.3. This definition ensures that 0 < W mn ≤ d ( x m , x m ) = 0. Note that this similarity measure is invari-ant under changes of size, flux, orientation, and parity ofthe galaxy morphology. Square symmetric similarity matrices have a very intu-itive interpretation: They represent a weighted undirectedgraph. Figure 2 shows a sketch of such a graph. The datapoints x n are represented symbolically as nodes x n . Thepositions of these nodes are usually arbitrary, it is neithernecessary nor helpful to arrange them according to the truelocations of the data points in parameter space. Any twodata nodes x m and x n are connected by an edge, whichis assigned a weight W mn . Obviously, all the weights W mn form an N × N matrix W , and if this matrix is symmetric,i.e. W mn = W nm , the edges will have no preferred direction.In this case, the weighted graph is undirected. In graph the-ory the matrix of weights W is called adjacency matrix , andwe can interpret the similarity matrix as adjacency matrixof a weighted undirected graph.Inspecting Fig. 2, we now introduce some importantconcepts. First, we note that there is also an edge con-necting x with itself. This edge is weighted by the ”self-similarity” W . These self-similarities W nn are usuallynon-zero and have to be taken into account in order tosatisfy normalisation constraints (cf. Eq. (8)). Second, we Ren´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies
Figure 1.
Examples of shapelet models of three galaxies from SDSS ( g band). Shown are the original images (left column), the shapelet models (centre column) and the residuals (right column) of a face-ondisk galaxy (top row), an edge-on disk galaxy (centre row) and an elliptical galaxy (bottom row). Note the different plot ranges ofthe residual maps. The shapelet decomposition used N max = 16, i.e. 153 basis functions. define the degree d n of a data node x n as the sum of weightsof all edges connected with x n , i.e. d n = N (cid:88) m =1 W mn . (7)We can interpret the degree d n to measure the connectivityof data node x n in the graph. For instance, we can detectoutlier objects by their low degree, since they are very dis-similar to all other objects. Third, we note that we canrescale all similarities by a constant scalar factor C > N (cid:88) m,n =1 W mn = N (cid:88) n =1 d n = 1 . (8)This constraint ensures the normalisation of the probabilis-tic model we are going to set up for our soft clusteringanalysis of the similarity matrix. en´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies 5 x x x x x W W W W W Figure 2.
Sketch of a weighted undirected graph.
The data nodes x n are connected by edges. For the sake ofvisibility, only edges connecting x are shown. The edgesare undirected and weighted by the similarity of the twoconnected nodes. ... ... x x x x x x N B N B c c c c K Figure 3.
Sketch of a bipartite graph.
The bipartite graph contains two sets of nodes, X = { x , . . . , x N } and C = { c , . . . , c K } . Edges connect nodesfrom different sets only and are weighted by an adjacencymatrix B . Not all edges are shown. We need a probabilistic model of the similarity matrix W that can be interpreted in terms of a soft clustering anal-ysis. Such a model was proposed by Yu et al. (2005). Assimilarity matrices are closely related to graphs, this modelis motivated from graph theory, too. The basic idea of thismodel is that the similarity of two data points x m and x n is induced by both objects being members of the same clus-ters. This is the basic hypothesis of any classification ap-proach: Objects from the same class are more similar thanobjects from different classes.In detail, we model a weighted undirected graph (Fig.2) and its similarity matrix by a bipartite graph (Fig. 3).A bipartite graph is a graph whose nodes can be dividedinto two disjoint sets X = { x , . . . , x N } of data nodes and C = { c , . . . , c K } of cluster nodes, such that the edges inthe graph only connect nodes from different sets. Again,the edges are weighted and undirected, where the weights B nk form an N × K rectangular matrix B , the bipartite- graph adjacency matrix. The bipartite-graph model for thesimilarity matrix then readsˆ W mn = K (cid:88) k =1 B nk B mk λ k , (9)with the cluster priors λ k = (cid:80) Nn =1 B nk . A detailed deriva-tion is given in the following section. This model inducesthe pairwise similarities via two-hop transitions X → C →X (cf. Yu et al. 2005). The numerator accounts for thestrength of the connections of both data nodes to a certaincluster. The impact of the denominator is that the com-mon membership to a cluster of small degree is consideredmore decisive. Obviously, the model defined by Eq. (9) issymmetric, as the similarity matrix itself. The normalisa-tion constraint on W as given by Eq. (8) translates via thebipartite-graph model to K (cid:88) k =1 N (cid:88) n =1 B nk = K (cid:88) k =1 λ k = 1 . (10)These constraints need to be respected by the fit algorithm.Having fitted the bipartite-graph model to the given datasimilarity matrix, we compute the cluster posterior proba-bilitiesprob( c k | x n ) = prob( x n , c k )prob( x n ) = B nk (cid:80) Kl =1 B nl , (11)which are the desired soft data-to-cluster assign-ments. Obviously, K cluster posteriors are assigned toeach data node x n and the normalisation constraint (cid:80) Kk =1 prob( c k | x n ) = 1 is satisfied. Here we give a derivation of the bipartite-graph model ofEq. (9) that is more detailed than in Yu et al. (2005). Theansatz is to identify the similarity ˆ W mn with the joint prob-abilityˆ W mn = prob( x m , x n ) . (12)This interprets ˆ W mn as the probability to find x m and x n in the same cluster. Eq. (8) ensures the normalisation (cid:80) Nm,n =1 prob( x m , x n ) = 1. As we do not know which par-ticular cluster induces the similarity, we have to marginaliseover all cluster nodes in Fig. 3,prob( x m , x n ) = K (cid:88) k =1 prob( x m , x n , c k ) . (13)With this marginalisation we have switched from theweighted undirected graph to our bipartite-graph model.Applying Bayes’ theorem yieldsprob( x m , x n ) = K (cid:88) k =1 prob( x n | c k ) prob( x m , c k ) , (14)where we have usedprob( x n | x m , c k ) = prob( x n | c k ) , (15) Ren´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies since x m and x n are not directly connected in the bipar-tite graph, i.e. they are statistically independent. This isthe only assumption in this derivation and it implies that all statistical dependence is induced by the clusters. UsingBayes’ theorem once more yieldsprob( x m , x n ) = K (cid:88) k =1 prob( x n , c k ) prob( x m , c k )prob( c k ) . (16)We identify the bipartite-graph adjacency matrix in anal-ogy to Eq. (12), B nk = prob( x n , c k ) , (17)with its marginalisation λ k = prob( c k ) = N (cid:88) n =1 prob( x n , c k ) = N (cid:88) n =1 B nk . (18)The marginalised probabilities λ k are the cluster priors ofthe cluster nodes c k in the bipartite graph. Moreover, the λ k are the degrees of the nodes. In order to fit the bipartite-graph model defined by Eq.(9) to a given similarity matrix, we perform some simpli-fications. First, we note that we can rewrite Eq. (9) usingmatrix notation,ˆ W = B · Λ − · B T , (19)where B is the N × K bipartite-graph adjacency matrix andΛ = diag( λ , . . . , λ k ) is the K × K diagonal matrix of clusterdegrees. This notation enables us to employ fast and effi-cient algorithms from linear algebra. We change variablesby B = H · Λ , (20)where H is an N × K matrix. The elements of H can beinterpreted as the cluster likelihoods, since H nk = B nk λ k =prob ( x n ,c k ) prob ( c k ) = prob( x n | c k ). Using these new variables H and Λ, the model ˆ W of the data similarity matrix W isgiven byˆ W = H · Λ · H T , (21)where we have eliminated the matrix inversion and reducedthe nonlinearity to some extent. The normalisation con-straints of Eq. (10) translate to H as N (cid:88) n =1 H nk = N (cid:88) n =1 prob( x n | c k ) = 1 ∀ k = 1 , . . . , K . (22)The normalisation constraints on H and Λ are now decou-pled, and we can treat both matrices as independent of eachother. As H is an N × K matrix and Λ a K × K diagonalmatrix, we have K ( N + 1) model parameters. In compar-ison to this number, we do have N ( N + 1) independentelements in the data similarity matrix due to its symmetry.Hence, a reasonable fit situation requires N (cid:29) K in orderto constrain all model parameters. The data similarity matrix W is fitted by maximisingthe logarithmic likelihood function log L of the bipartite-graph model. Yu et al. (2005) give a derivation of this func-tion based on the theory of random walks on graphs. Theirresult islog L (Θ | W ) = N (cid:88) m,n =1 W mn log prob( x m , x n | Θ) , (23)where Θ = { H , . . . , H NK , λ , . . . , λ K } denotes the setof K ( N + 1) model parameters and prob( x m , x n | Θ) = (cid:80) Kk =1 H mk λ k H nk = ˆ W mn is the model. If we remem-ber that W mn = prob( x m , x n ), then we see that log L is the cross entropy of the true probability distribu-tion W mn = prob( x m , x n ) and the modelled distributionˆ W mn = prob( x m , x n | Θ). Consequently, maximising log L maximises the information our model contains about thedata similarity matrix.Directly maximising log L is too hard, since the fit pa-rameters are subject to the constraints given by Eqs. (10)and (22). We use an alternative approach that makes useof the expectation-maximisation (EM) algorithm, which isan iterative fit routine. Given an initial guess on the modelparameters, the EM algorithm provides a set of algebraicupdate equations to compute an improved estimate of theoptimal parameters that automatically respects the nor-malisation. These update equations are (Bilmes 1997; Yuet al. 2005) λ new k = λ k N (cid:88) m,n =1 W mn ( H · Λ · H T ) mn H mk H nk , (24) H new nk ∝ H nk λ k N (cid:88) m =1 W mn ( H · Λ · H T ) mn H mk . (25)The H new nk have to be normalised by hand, whereas the λ new k are already normalised. Each iteration step updatesall the model parameters, which has time complexity O ( K · N ) for K clusters and N data nodes. We initialise all thecluster degrees to λ k = K , whereby we trivially satisfythe normalisation condition and simultaneously ensure thatno cluster is initialised as virtually absent. The H nk areinitialised randomly and normalised ”by hand”.Now, we want to briefly discuss the convergence proper-ties of the EM algorithm. It has been shown (e.g. Redner &Walker 1984) that the EM algorithm is guaranteed to con-verge to a local maximum of log L under mild conditions.Indeed, it was shown that the EM algorithm is monoton-ically converging, i.e. each iteration step is guaranteed toincrease log L . Therefore, after each iteration step, we checkhow much log L was increased compared to the previousstep. If log L changed by less than a factor of 10 − , we willconsider the EM algorithm to have converged. This conver-gence criterion was chosen based on systematic tests likethose discussed in Sect. 4. Finally, we note that the fit re-sults are not unique, since the ordering of the clusters ispurely random. In this section we demonstrate how to estimate the optimalnumber of clusters for a given data set, which is a crucial en´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies 7
Figure 4.
Estimating the optimal number of clusters forthe data sample shown in Fig. 6. (a) SSR( K ) as a function of the number K of clusters. (b) Meanangular changes (cid:104) ∆( K ) (cid:105) averaged over ten fits. part of any clustering analysis. It is essential to estimatethe optimal number of clusters with due caution. This isa problem of assessing nonlinear models and there are notheoretically justified methods, there are only heuristic ap-proaches. Common heuristics are the Bayesian informationcriterionBIC = − L + p log N (26)and Akaike’s information criterionAIC = − L + 2 p , (27)where p and N denote the number of model parametersand the number of data points, respectively. As we haveseen in Sect. 3.5, the bipartite-graph model involves K ( N +1) model parameters. Consequently, BIC and AIC are notapplicable, since log L is not able to compensate for thelarge impact of the penalty terms. This inability of log L islikely to originate from the sparse data population in thehigh-dimensional parameter space. Another tool of modelassessment is cross-validation, but this is computationallyinfeasible in this case.We now explain how to compare bipartite-graph modelsof different complexities heuristically, i.e. how to estimatethe optimal number of clusters. This heuristic employs thesum of squared residualsSSR( K ) = N (cid:88) m =1 m (cid:88) n =1 (cid:32) W mn − (cid:80) Kk =1 H mk λ k H nk W mn (cid:33) . (28)The definition puts equal emphasis on all elements. If weleft out the denominator in Eq. (28), the SSR would em-phasise deviations of elements with large values, whereaselements with small values would be neglected. However, both large and small values of pairwise similarities are de-cisive. We estimate the optimal K via the position of a kink in the function SSR( K ) (cf. Fig. 4). Such a kink arises ifadding a further cluster does not lead to a significant im-provement in the similarity-matrix reconstruction.We demonstrate this procedure in Fig. 4 by using thetoy example of Figs. 6 and 7, which is composed of sixnicely separable clusters. We fit bipartite-graph models tothe similarity matrix shown in Fig. 7, with K ranging from1 to 15. The resulting SSR values are shown in panel (a)of Fig. 4. In fact, SSR( K ) exhibits two prominent kinks at K = 3 and K = 6, rather than a single one. Obviously,for K = 3, the clustering algorithm groups the four nearbyclusters together, thus resulting in three clusters. For K =6, it is able to resolve this group of ”subclusters”.We can construct a more quantitative measure by com-puting the angular change ∆( K ) of log SSR( K ) at each K ,∆( K ) = arctan [log SSR( K − − log SSR( K )] − arctan [log SSR( K ) − log SSR( K + 1)] . (29)As K is an integer, log SSR( K ) is a polygonal chain andthus an angular change is well defined. A large positiveangular change then indicates the presence of a kink inSSR( K ). However, we can even do better by fitting thesimilarity matrix several times for each K and averagingthe angular changes. The results of the fits differ slightly,since the model parameters are randomly initialised eachtime. These mean angular changes are shown in panel (b)of Fig. 4, averaged over 20 fits for each K . First, for large K the mean angular changes are consistent with zero, i.e.in this domain increasing K decreases SSR( K ) but doesnot improve the fit systematically. Second, for K = 3 and K = 6 the mean angular changes deviate significantly fromzero. For K = 2 and K = 4, the mean angular changes arenegative, which corresponds to ”opposite” kinks in the SSRspectrum and is due to K = 3 being a very good groupingof the data.For large K these detections may be less definite due tothe flattening of SSR( K ). Therefore, we may systematicallyunderestimate the optimal number of clusters. Moreover,this toy example also demonstrates that there may be morethan a single advantageous grouping of the data and theremay be disadvantageous groupings. If there are multipledetections of advantageous groupings, it may be difficult tojudge which grouping is the best. In the worst case, we evenmay not find any signal of an advantageous grouping, whichwould either imply that our given sample is composed ofobjects of the same type or that the data does not containenough information about the grouping. Unfortunately, thisscheme of estimating the optimal number of clusters is ex-tremely inefficient from a computational point of view. Thisis a severe disadvantage for very large data sets. Moreover,though this heuristic is working well, the significance of themean angular changes is likely to be strongly influenced bythe variance caused by the algorithm’s initialisation. It is not possible to compute the angular change for K = 1,but this case is not a reasonable grouping anyway under theassumption that there are objects of different types in the givendata sample. Ren´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies As the work of Kelly & McKay (2004, 2005) is very closeto our own work, we want to discuss it in some detailand work out the differences. The authors applied a softclustering analysis to the first data release of SDSS. InKelly & McKay (2004) they decomposed r -band images of3,037 galaxies into shapelets, using the IDL shapelet codeby Massey & R´efr´egier (2005). In Kelly & McKay (2005)they extended this scheme to all five photometric bands u, g, r, i, z of SDSS, thereby also taking into account colourinformation. Afterwards, they used a principal componentanalysis (PCA) to reduce the dimensionality of their pa-rameter space. In Kelly & McKay (2004) the reduction wasfrom 91 to 9 dimensions and in Kelly & McKay (2005)from 455 to 2 dimensions. Then they fitted a mixture-of-Gaussians model (Bilmes 1997) to the compressed data,where each Gaussian component represents a cluster. Theywere able to show that the resulting clusters exhibited areasonable correlation to the traditional Hubble classes.Reducing the parameter space with PCA and also usinga mixture-of-Gaussians model are both problematic fromour point of view. First, PCA relies on the assumptionthat those directions in parameter space that carry thedesired information do also carry a large fraction of thetotal sample variance. This is neither guaranteed nor can itbe tested for in practice. Second, galaxy morphologies arenot expected to be normally distributed. Therefore, usinga mixture-of-Gaussians model is likely to misestimate thedata distribution. Nonetheless, the work by Kelly & McKay(2004, 2005) was a landmark, both concerning their usageof a probabilistic algorithm and conceptually, by applyinga clustering analysis to the first data release of SDSS.In contrast to Kelly & McKay (2004, 2005), we do notreduce the dimensionality of the parameter space and thenapply a clustering algorithm to the reduced data. We alsodo not try to model the data distribution in the parameterspace, which would be virtually impossible due to its highdimensionality ( curse of dimensionality , cf. Bellman 1961).Rather, we use a similarity matrix, which has two majoradvantages: First, we do not rely on any compression tech-nique such as PCA. Second, we cannot make any mistakesby choosing a potentially wrong model for the data distri-bution, since we model the similarity matrix. There are twosources of potential errors in our method:1. Estimation of pairwise similarities (Eq. (6)). This ishampered by our lack of knowledge about the metricin the morphological space and it is in some sense sim-ilar to mismodelling.2. Modelling the similarity matrix by a bipartite-graphmodel. As the only assumption in the derivation of thebipartite-graph model is Eq. (15), this happens if andonly if a significant part of the pairwise similarity is not induced by the clusters, but rather by e.g. observationaleffects. However, any other classification method (auto-mated or not) will have problems in this situation, too.
4. Systematic Tests
In this section we conduct systematic tests using artificialdata samples that are specifically designed to investigatethe impact of certain effects. First, we demonstrate thathard classification schemes cause problems with subsequent parameter estimation. Furthermore, we investigate the im-pact of non-optimal similarity measures, two-cluster sepa-ration, noise and cluster cardinalities on the clustering re-sults.
We start by describing the artificial data sets that we aregoing to use. Furthermore, we describe the diagnostics bywhich we assess the performance of the clustering algo-rithm.The data sets are always composed of two clusters,where the number of example objects drawn from each clus-ter may be different. The clusters are always designed as p -variate Gaussian distributions, i.e.prob( x | µ , Σ) = exp (cid:104) − ( x − µ ) T · Σ − · ( x − µ ) (cid:105)(cid:112) (2 π ) p det Σ , (30)where µ and Σ denote the mean vector and the covariancematrix, respectively.Knowing the true analytic form of the underlying prob-ability distributions, we are able to assess the probabilisticdata-to-cluster assignments proposed by the clustering al-gorithm. For two clusters A and B , the true data-to-clusterassignment of some data point x to cluster k = A, B isgiven by the cluster posteriorprob( k | x ) = prob( x | µ k , Σ k )prob( x | µ A , Σ A ) + prob( x | µ B , Σ B ) . (31)The numerator prob( x | µ k , Σ k ) is the cluster likelihood. Thecluster priors prob( A ) = prob( B ) = are flat and cancelout. For a given data set of N objects, these true clusterposteriors are compared to the clustering results using theexpectation values of the zero-one loss function (cid:104)L (cid:105) = 1 N N (cid:88) n =1 ⇔ probfit( C n | x n ) > probfit( ¬ C n | x n )1 else (32)and of the squared-error loss function (cid:104)L SE (cid:105) = 1 N N (cid:88) n =1 (cid:0) probfit( C n | x n ) − probtrue( C n | x n ) (cid:1) ,(33)where C n denotes the correct cluster label of object x n and ¬ C n the false label. The zero-one loss function is the mis-classification rate, whereas the squared-error loss function issensitive to misestimations of the cluster posteriors that donot lead to misclassifications. As the two clusters are usu-ally well separated in most of the following tests, the truemaximum cluster posteriors are close to 100%. Therefore,misestimation means underestimation of the maximum pos-teriors, which is quantified by (cid:112) (cid:104)L SE (cid:105) . In this first test, we want to demonstrate that hard cutsthat are automatically introduced when using hard classifi-cation or clustering algorithms can lead to systematic mis-estimations of parameters, i.e. biases. This is a general com-ment in order to support our claim that hard data-to-classassignments are generically inappropriate for overlapping en´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies 9
Figure 5.
Break-down of hard classifications in case ofoverlapping clusters.
Deviation ˆ µ A − µ A of estimated and true means vs. two-clusterseparation ∆ x for class A for hard estimator (red line), softestimator (blue line), and predicted bias of hard estimator for∆ x → classes. We are not yet concerned with our soft-clusteringalgorithm. We use two one-dimensional Gaussians withmeans µ A and µ B , variable two-cluster separation ∆ x = µ A − µ B , and constant variance σ = 1. From each Gaussiancluster we then draw N =10,000 objects. From the resultingdata sample we estimate the means ˆ µ k of the two Gaussiansand compare with the true means µ k . The results are aver-aged over 1,000 realisations of data samples.Figure 5 shows the deviations of the estimated from thetrue means when using a hard cut at x = 0 (red line) and aweighted mean (blue line). A hard cut at x = 0 that assignsall data points with x < x > µ hard k = 1 N k N k (cid:88) n =1 x k,n . (34)As Fig. 5 shows, this estimator is strongly biased in caseof overlapping clusters (∆ x → x = 0,we can predict this bias analytically from the expectationvalue (cid:104) x (cid:105) A/B = ∓ (cid:90) ∞ dx x e − x / = ∓ √ π ≈ ∓ . x = 0. This bias is shown as dashedline in Fig. 5, where for ∆ x = 0 also µ A/B = ∓ ∆ x/ µ soft k = (cid:80) Nn =1 prob( k | x n ) x n (cid:80) Nn =1 prob( k | x n ) , (36) Figure 6.
Artificial data sample with six clusters (top) andthe matrix of pairwise Euclidean distances (bottom).
Each cluster has an underlying bivariate Gaussian distributionwith covariance matrix Σ = diag(1 , then we will get an unbiased estimate despite the overlap,as is evident from Fig. 5. This comparison demonstratesthe break-down of hard algorithms in case of overlappingclusters. We now explain how to optimise the similarity measuredefined in Eq. (6) and what ”optimal” means. Given the N × N symmetric matrix of pairwise distances d ( x m , x n ),we can tune the similarity measure by adjusting the twoparameters α and s . Tuning the similarity measure has tobe done with care, since there are two undesired cases: First,for α → ∞ , the resulting similarity matrix approaches aconstant, i.e. W mn = N for all elements, since d ( x m , x n ) ≤ d max. This case prefers K = 1 clusters, independent ofany grouping in the data. Second, for α →
0, the similaritymatrix approaches the step matrix defined by S mn ∝ (cid:26) ⇔ m = n − s ⇔ m (cid:54) = n , (37)which is normalised such that (cid:80) Nm,n =1 S mn = 1. This caseprefers K = N clusters. The optimal similarity measureshould be as different as possible from these two worst cases. Figure 7.
Estimating the optimal similarity measure forthe example data of Fig. 6.
Top panel: Modified Manhattan distance C (Eq. (38)) for s =1 .
01 (cyan line), s = 1 .
03 (blue line) and s = 1 . α → We choose α and s such that the modified Manhattan dis-tance to the constant matrix C = N (cid:88) m =1 m (cid:88) n =1 (cid:12)(cid:12)(cid:12)(cid:12) W mn − N (cid:12)(cid:12)(cid:12)(cid:12) (38)is large. Figure 7 demonstrates how to tune the similaritymeasure using the artificial data set from the toy exam-ple of Fig. 6. The basis is the N × N symmetric distancematrix shown in the bottom panel of Fig. 6. For three dif-ferent values of s , the top panel shows C as functions of α . Obviously, C ( α ) exhibits a maximum and can thus beused to choose α . For s = 1 . s = 1 . s downweights off-diagonal terms in W according to Eq. (37). Thus, we also prefer if s is not tooclose to 1 and s = 1 .
03 (blue curve) is the compromise ofthe three scale parameters shown in Fig. 7. Note that thechoice of s is not an optimisation but a heuristic. Althoughthe artificial data set of Fig. 6 and its distance matrix arevery special, we experienced that C ( α ) as shown in Fig. 7is representative for the general case. The resulting similarity matrix is shown in the rightpanel of Fig. 7 and exhibits a block-like structure, since wehave ordered the data points in the set. This is just for thesake of visualisation and does not affect the clustering re-sults. We clearly recognise six blocks along the diagonal, be-cause the within-cluster similarities are always larger thanthe between-cluster similarities. Furthermore, we recognisea large block of four clusters in the bottom right corner thatare quite similar to each other, whereas the remaining twoclusters are more or less equally dissimilar to all other clus-ters. Consequently, the similarity matrix indeed representsall the features of the data set shown in Fig. 6.We now demonstrate first that the optimal similaritymeasure indeed captures the crucial information on thedata and what happens if we do not use the optimal similar-ity measure. We use an artificial data set composed of twoone-dimensional Gaussian clusters, both with unit varianceand two-cluster separation of ∆ x = 3. We sample 100 ex-ample objects from each cluster and compute the matrix ofpairwise distances using the Euclidean distance measure.This data set and its distance matrix remain unchanged.For a constant parameter s = 2 .
25, we vary the exponent α in the similarity measure defined by Eq. (6). For each valueof α , we fit bipartite-graph models with K = 1, 2 and 3 tothe resulting similarity matrix, averaging the results over15 fits each time.Results of this test are shown in Fig. 8. Panel (a) showsthe modified Manhattan distance C to a constant matrix.This curve is very similar to Fig. 7. There is a prominentpeak at α ≈ .
6, indicating the optimal similarity measure.If the similarity measure is very non-optimal, then the sim-ilarity matrix will be close to a constant or step matrix,i.e. it poorly constrains the bipartite-graph model. In thiscase, we expect to observe overfitting effects, i.e. low resid-uals of the reconstruction and results with high variance.The computation times are longer, too, since the nonlinearmodel parameters can exhibit degeneracies thereby slowingdown the convergence. Counterintuitively, we seek a largevalue of SSR in this test, since a similarity matrix whichcaptures well the information content of the data is harderto fit. Indeed, the SSR values shown in panel (c) of Fig. 8are significantly lower for non-optimal α ’s and peak nearthe optimal α . As expected, the mean computation timesshown in panel (b) are minimal for the optimal similaritymeasure. Panel (d) shows how the evidence for two clus-ters evolves. Near the optimal α , also the evidence for twoclusters shows a local maximum. The misclassification rateshown in panel (e) is insensitive to α over a broad range, butapproaches a rate of 50% rather abruptly for extremely non-optimal similarity measures. The squared-error loss shownin panel (f) is more sensitive to non-optimalities. It exhibitsa minimum for the optimal α and grows monotonically fornon-optimal values.The most important conclusion to draw from this test isthat our method of choosing α and s for the similarity mea-sure defined in Sect. 3.1 is indeed ”optimal”, in the sensethat it minimises both the misclassification rate and thesquared-error loss. Additionally, we see that using the opti-mal similarity measure can also reduce computation timesby orders of magnitude. This is the reason why we need to fit bipartite-graph modelsusing K = 1, 2 and 3, in order to compute the angular changeof SSR( K ) at K = 2.en´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies 11 Figure 8.
Impact of non-optimal similarity measures on clustering results. (a) Modified Manhattan distance C (Eq. (38)). (b) Mean computation time per fit without errorbars. (c) Mean SSR( K ) valuesof resulting fits for K = 2. (d) Mean angular change of SSR( K ) at K = 2. (e) Mean misclassification rate (solid line) and 50%misclassification rate (dashed line). (f) Mean squared-error loss of maximum cluster posteriors. As we have to expect overlapping clusters in the context ofgalaxy morphologies, we now investigate the impact of thetwo-cluster overlap on the clustering results. The data setsused are always composed of 100 example objects drawnfrom two one-dimensional Gaussian clusters, both with unitvariance. The two-cluster separation ∆ x is varied from 1 to1000. For each data set, we compute the matrix of pair-wise Euclidean distances and then automatically computethe optimal similarity matrix by optimising α using a con-stant s = 2 .
25 as described in Sect. 4.3. To each similaritymatrix we fit bipartite-graph models with K = 1, 2 and3 clusters. Furthermore, we fit a K -means algorithm with K = 1, 2 and 3 to each data set in order to compare the re-sults of both clustering algorithms. For each configuration,the results are averaged over 50 fits.Results of this test are summarised in Fig. 9. Panel (a)shows the mean evidence for two clusters, based on the an-gular changes in SSR( K ) for the bipartite-graph model andthe within-cluster scatter for the K -means algorithm. Fordecreasing separation ∆ x , i.e. increasing overlap, the evi-dence for two clusters decreases for both algorithms, as isto be expected. As panel (b) reveals, the misclassificationrates for K -means and the bipartite-graph model are bothin agreement with the theoretically expected misclassifica-tion rate expected in the ideal case (black curve). For two Note that these two curves cannot be compared directly.Their agreement for ∆ x <
20 is coincidence. one-dimensional Gaussians with means ± ∆ x , the theoreti-cal misclassification rate is given by (cid:104)L theo (cid:105) = (cid:90) −∞ dx prob (cid:18) x (cid:12)(cid:12)(cid:12)(cid:12) µ = + ∆ x , σ (cid:19) , (39)which measures the overlap of both Gaussians. In the limit∆ x = 0, this yields (cid:104)L theo (cid:105) = . The explanation for theexcellent performance of both K -means and bipartite-graphmodel is, that in this case the clusters have equal cardi-nalities and are spherical. Nevertheless, the results of the K -means are biased due to the hard data-to-cluster assign-ment. Panel (c) of Fig. 9 shows the mean squared-errorloss of the bipartite-graph models. First, the general trendis that the squared-error loss increases for decreasing two-cluster separation. This is due to the growing amount ofoverlap confusing the bipartite-graph model. Second, for∆ x (cid:46)
4, the squared-error loss decreases significantly. Thiseffect can be explained as follows: For very small separa-tions, the overlap is so strong that even the true clusterposteriors are both close to 50%. Therefore, the fitted clus-ter posteriors scatter around 50%, too, thereby reducing thesquared error. Third, the squared error establishes a con-stant value of (cid:104)L SE (cid:105) ≈ .
045 at large separations. In thiscase, the true maximum cluster posteriors are essentially100%, so this corresponds to a systematic underestimationof the maximum posteriors of (cid:112) (cid:104)L SE (cid:105) ≈ We do not compare with K -means, since K -means is a hardalgorithm and squared-error loss is no reasonable score functionin this case.2 Ren´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies Figure 9.
Impact of two-cluster overlap on clustering re-sults for K -means algorithm and bipartite-graph model. (a) Mean angular change of SSR( K ) (bipartite-graph model)and within-cluster scatter ( K -means) at K = 2. (b) Mean mis-classification rates of K -means and bipartite-graph model (seetext) compared to theoretical prediction. All curves coincide. (c)Mean squared-error loss of bipartite-graph model. large two-cluster separation, this bias does not lead to mis-classifications, as is evident from panel (b) in Fig. 9. Thisbias originates from the fact that any two objects have afinite distance and thus a non-vanishing similarity.This test further demonstrates that the bipartite-graphmodel yields convincing results. This is most evident in themisclassification rate, which is in excellent agreement withthe theoretical prediction of the best possible score. As observational data is subject to noise, we now investi-gate the response of the clustering results to noise on thesimilarity matrix. We simulate the noise by adding a sec-ond dimension y to the data. The two clusters are bivariateGaussian distributions, both with σ x = 1 and two-clusterseparation of ∆ x = 10 and ∆ y = 0. We vary the size ofthe variance in y -direction ranging from σ y = 0 . K -means models using K = 1, 2 and 3. The results areaveraged over 50 fits for each value of σ y . Figure 10.
Impact of noise variance σ y on clustering resultsfor K -means algorithm and bipartite-graph model. (a) Mean angular change of SSR( K ) (bipartite-graph model)and within-cluster scatter ( K -means) at K = 2. (b) Mean mis-classification rate of K -means and bipartite-graph model. (c)Mean squared-error loss of bipartite-graph model. Results of this test are shown in Fig. 10. The evidencefor two clusters (panel (a)) rapidly degrades for increasingvariance for the bipartite-graph model as well as the K -means algorithm, as is to be expected. Inspecting the mis-classification rate (panel (b)) reveals that both algorithmsare insensitive to σ y until a critical variance is reachedwhere both misclassification rates increase abruptly. For the K -means algorithm, this break down happens at σ y ≈ σ y ≈ ∆ xσ y ≈ . σ y , the two clus-ters become more extended in y -direction, until it becomesfavourable to split the data along x = 0 rather than y = 0.This also explains why the misclassification rate is 50% inthis regime. Consequently, the abrupt break-down origi-nates from the setup of this test. Sampling more objectsfrom each cluster might have prevented this effect, butwould have increased the computational effort drastically.Moreover, this also demonstrates that isotropic distancemeasures are problematic. Using e.g. a diffusion distance(e.g. Richards et al. 2009) may solve this problem. Thebreak down is less abrupt in the mean squared-error loss en´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies 13 (panel (c)), since (cid:104)L SE (cid:105) is also sensitive to posterior mises-timation that do not lead to misclassifications.We conclude that the bipartite-graph model is fairlyinsensitive to noise of this kind over a broad range, untilthe setup of this test breaks down. Typically different types of galaxy morphologies have dif-ferent abundancies in a given data sample. For instance,Bamford et al. (2009) observe different type fractions ofearly-type and spiral galaxies in the Galaxy Zoo project.Therefore, we now investigate how many objects of a cer-tain kind are needed in order to detect them as a cluster.The concept of a number of objects being members of a cer-tain cluster is poorly defined in the context of soft cluster-ing. We generalise this concept by defining the cardinality of a cluster c k card( k ) = N (cid:88) n =1 prob( c k | x n ) . (40)This definition satisfies (cid:80) Kk =1 card( k ) = N , since the clus-ter posteriors are normalised. In the case of hard clustering,Eq. (40) is reduced to simple number counts, where the clus-ter posteriors become Kronecker symbols. We use two clus-ters, both one-dimensional Gaussians with unit varianceand a fixed two-cluster separation of ∆ x = 10. We thenvary the number of objects drawn from each cluster suchthat the resulting data set always contains 200 objects. Foreach data set, we compute two different similarity matrices:First, we compute the similarity matrix using the optimal α for a constant s = 2 .
0, according to the recipe given inSect. 4.3. This similarity measure is adapted to every dataset ( adaptive similarity measure ). Second, we compute thesimilarity matrix using α = 0 . s = 2 .
0, which is theoptimal similarity measure for the data set composed toequal parts of objects from both clusters. This similaritymeasure is the same for all data sets ( constant similaritymeasure ). To each of the two similarity matrices we fit abipartite-graph model using K = 2 and average the resultsover 50 fits.The results are summarised in Fig. 11. Panel (a) showsthe dependence of the misclassification rate on the cardi-nality of cluster A. For the adaptive similarity measurethe bipartite-graph model will break down, if one clustercontributes less than 10% to the data set. For the con-stant similarity measure it will break down, if one clustercontributes less than 3%. The same behaviour is evidentfrom the squared-error loss in panel (b). This problem iscaused by the larger group in the data set dominating thestatistics of the modified Manhattan distance C defined byEq. (38). This is a failure of the similarity measure, not ofthe bipartite-graph model. The constant similarity measurestays ”focussed” on the difference between the two clustersand its break-down at 3% signals the limit to which clustersare detectable with the bipartite-graph model.Panel (c) in Fig. 11 shows the correlation of the mea-sured cluster cardinality to the true cluster cardinality. Forthe constant similarity measure, both quantities correlatewell. In contrast to this, for the adaptive similarity measurethe two quantities do not correlate at all. Again, the adap-tive similarity measure is dominated by the larger group, Figure 11.
Impact of cardinalities on clustering results. (a) Mean misclassification rate. (b) Mean squared-error loss. (c)Correlation of estimated and true cluster cardinality. i.e. the similarities between the large and the small groupare systematically too high. This leads to a systematic un-derestimation of the maximum cluster posteriors (cf. panel(b)), since for a two-cluster separation of ∆ x = 10 the trueposteriors are essentially 100% as shown by Fig. 9c. Thisalso affects the cluster cardinalities defined by Eq. (40). Ifthe cluster overlap is stronger, then this bias is likely tolead to misclassifications, too.We conclude that the optimal similarity measure definedin Sect. 4.3 fails to discover groups that contribute 10%or less to the complete data sample. A different similaritymeasure may solve this problem, but the optimal similaritymeasure has the advantage of minimising the misclassifi-cation rate and the squared-error loss for the discoveredgroups.
5. Worked Example with SDSS Galaxies
In this section we present our worked example with SDSSgalaxies. First, we describe the sample of galaxies we anal-yse. Before applying the bipartite-graph model to the wholesample, we apply it to a small subsample of visually clas-sified galaxies to prove that it is working not only for sim-ple simulated data but also for real galaxy morphologies.Again, we emphasise that this is just meant as a demonstra-tion, so parametrisation and sample selection are idealised.
Fukugita et al. (2007) derived a catalogue of 2,253 brightgalaxies with Petrosian magnitude in the r band brighterthan r P = 16 from the Third Data Release of the SDSS(Abazajian et al. 2005). We analyse only g -band imaging ofthis sample, which is sensitive to HII regions and spiral armstructures. We expect that objects that are bright in r arealso bright in the neighbouring g -band. Therefore, all theseobjects have a high signal-to-noise ratio, i.e. the shapeletdecomposition can employ a maximum order sufficientlylarge to reduce possible modelling problems.Apart from the g -band imaging data, we also retrievedfurther morphological information from the SDSS database,namely Petrosian radii r and r containing 50% and 90%of the Petrosian flux, ratios of isophotal semi major andsemi minor axis, and the logarithmic likelihoods of best-fitting de Vaucouleurs and exponential-disk profiles. Giventhe Petrosian radii r and r containing 50% and 90%of the Petrosian flux, we define the concentration index inanalogy to Conselice (2003), C = 5 log (cid:18) r r (cid:19) . (41)For compact objects, such as elliptical galaxies, this con-centration index is large, whereas it is smaller for extendedobjects with slowly decreasing light profiles, such as diskgalaxies.We then reduce the data sample in three steps: First,we sort out peculiar objects, i.e. objects that are definitelynot galaxies, blended objects and objects that were cut inthe mosaic. All these objects have no viable galaxy mor-phologies. This was done by visual inspection of all objects.Second, we decompose all images into shapelets using thesame maximum order N max = 12 (91 expansion coeffi-cients) for all objects. The shapelet code performs severalinternal data processing steps, namely estimating the back-ground noise and subtracting the potentially non-zero noisemean, image segmentation and masking of multiple objects,estimating the object centroid position (cf. Melchior et al.2007). Third, we sort out objects for which the shapeletreconstruction does not provide reasonable models. This isdone by discarding all objects whose best fits have a re-duced χ that is not in the interval [0 . , . χ . We check that the mor-phological information contained in the original data setand the reduced data set does not differ systematically, bycomparing the sample distributions of Petrosian radii, axisratios, concentration indeces, and logarithmic likelihoodsof best-fitting deVaucouleur and exponential-disk profiles.All objects are large compared to the point-spread func-tion (PSF) of SDSS, such that a PSF deconvolution as de-scribed in Melchior et al. (2009a) is not necessary. Thismeans we analyse apparent instead of intrinsic morpholo-gies, but both are approximately the same. In this section we apply the soft clustering algorithm byYu et al. (2005) for the first time to real galaxies. We use
Figure 12.
Mean angular changes (cid:104) ∆( K ) (cid:105) of bipartite-graph model for data set composed of edge-on disks, face-ondisks and ellipticals.a small data set of 84 galaxies, which we visually classi-fied as edge-on disk, face-on disk or ellipticals (28 objectsper type). As these 84 galaxies were very large and verybright, we decomposed them anew using a maximum or-der of N max = 16, resulting in 153 shapelet coefficientsper object. Figure 1 shows one example object and itsshapelet reconstruction for each type. This data set exhibitsa strong grouping and we demonstrate that the bipartite-graph model indeed discovers the edge-on disks, face-ondisks and ellipticals automatically, without any further as-sumptions.The estimation of the number of clusters is shown in Fig.12. The mean angular changes in SSR( K ) averaged over20 fits indeed reveal only one significant kink at K = 3.The lowest value of SSR at K = 3 is SSR ≈
48, whichcorresponds to an RMS residual (cf. Eq. (28)) of (cid:115)
SSR N ( N + 1) ≈ .
6% . (42)The denominator N ( N + 1) is the number of independentelements in the symmetric similarity matrix.We conclude from Fig. 12 that the bipartite-graphmodel indeed favours three clusters. However, we still haveto prove that the similarity matrix contains sufficient infor-mation on the data and that the bipartite-graph model dis-covers the correct classes. For K = 3, the cluster posteriorspopulate a two-dimensional plane because they are subjectto a normalisation constraint. This plane is shown in Fig.13. Indeed, the distribution of cluster posteriors exhibits anexcellent grouping of ellipticals, edge-on disks and face-ondisks. The three clusters are well separated, apart from twoobjects labelled as edge-on disks but assigned to the clusterof ellipticals. A second visual inspection of these two ”out-liers” revealed that we had initially misclassified them asedge-on disk. The excellent results are particularly impres-sive if we remember that we analysed 84 data points dis-tributed in a 153-dimensional parameter space. Moreover,it is very encouraging that the soft clustering analysis didindeed recover the ellipticals, face-on and edge-on disks au-tomatically .In order to get an impression of how good these resultsactually are, we compare the cluster posterior plane to re- en´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies 15 Figure 13.
Cluster posterior space of bipartite-graphmodel for edge-on disks, face-on disks and ellipticals.
The triangle defines the subspace allowed by the normalisationconstraint of the posteriors. The corners of the triangle markthe points of 100% posterior probability. The * indicates thepoint where all three posteriors are equal. Colours encode a-priori classifications unknown to the algorithm.
Figure 14.
Comparing Fig. 13 with results of PCA foredge-on disks, face-on disks and ellipticals.
Parameter space spanned by the first two principal components.The first principal component carries ≈ .
2% and the second ≈ .
4% of the total variance. Colours encode a-priori classifi-cations unknown to the PCA algorithm. sults obtained from PCA. Therefore, we estimate the co-variance matrix Σ of the data sample in shapelet-coefficientspace and diagonalise it. Only the first 83 eigenvalues of Σare non-zero, since the 84 data objects poorly constrain the153 ×
153 covariance matrix. The first two principal compo-nents carry 66 .
6% of the total sample variance and Fig. 14displays the parameter space spanned by them. Obviously,PCA performs well in reducing the parameter space from153 dimensions down to two, since the ellipticals, face-onand edge-on disks exhibit a good grouping. However, thebipartite-graph model provides much more compact andwell-separated groups. This is due to the degeneracies wehave broken when we computed the minimal spherical dis- PCA only reduces the parameter space, but does not assignclasses to objects.
Figure 15.
Estimating the number of clusters in the dataset of Fukugita et al. (2007).
Mean angular changes (cid:104) ∆( K ) (cid:105) are averaged over 15 Fits. tances as described in Sect. 2. In case of PCA, these degen-eracies are unbroken and introduce additional scatter.In both Figs. 13 and 14 we notice that the group of ellip-ticals is significantly more compact than the groups of face-on and edge-on disks. This is caused by three effects: First,as discussed in Sect. 2, our parametrisation of ellipticalgalaxies is problematic, thereby introducing common arte-facts for all objects of this type. These common features arethen picked up by the soft-clustering algorithm. Ironically,the problems of the parametrisation help to discriminatethe types in this case. Second, we described in Sect. 2 how tomake our morphological distance measure invariant againstvarious random quantities, namely image size, image flux,orientation angle and handedness. However, the distancemeasure and thereby the similarity measure are not invari-ant against the inclination angle w.r.t. the line of sight,which introduces additional scatter into the clustering re-sults. We expect that the impact of this random effect issmaller for ellipticals than for disk galaxies. Third, diskgalaxies usually exhibit complex substructures (e.g. spiralarms or star-forming regions), whereas elliptical galaxiesdo not. Consequently, the intrinsic morphological scatterof disk galaxies is larger than for ellipticals. We now present the soft-clustering results from analysingall 1,520 bright galaxies from the reduced data set ofFukugita et al. (2007). We have chosen the similarity mea-sure with s = 1 .
02 and corresponding optimal α ≈ . C ( α ) are of the same generic formas before. Fit results of the similarity matrix for K rang-ing from 1 to 20 are shown in Table 2 and Fig. 15. Thereare significant deviations of the mean angular changes fromzero for K = 3 and K = 8. The signal at K = 2 is ignored,since the SSR value is very high (cf. Table 2).First, we investigate the clustering results for K = 3,where we have SSR ≈ ,
146 (cf. Table 2) corresponding toan RMS residual of ≈ .
7% (cf. Eq. (42)) for the similarity-matrix reconstruction. In Fig. 16 we show the top five exam-ple objects for each of the three clusters together with a his-togram of the distribution of cluster posteriors. Inspecting K minimal SSR mean angular changes (degrees)1 39 ,
220 –2 12 ,
313 14 . ± . ,
146 22 . ± .
144 4 , − . ± .
195 3 ,
868 2 . ± .
276 3 ,
155 2 . ± .
617 2 , − . ± .
178 2 ,
254 5 . ± .
699 2 , − . ± . , − . ± . ,
790 0 . ± . ,
661 1 . ± . ,
593 0 . ± . ,
532 0 . ± . ,
476 0 . ± . ,
430 0 . ± . ,
405 0 . ± . ,
383 0 . ± . ,
365 0 . ± . ,
348 –
Table 2.
Fitting the similarity matrix of 1,520 objects.We present the minimal SSR value out of 15 fits and themean angular change averaged over 15 fits.the example images, we clearly see that the first cluster isobviously composed of face-on disk galaxies, whereas thesecond cluster contains ellipticals. The third cluster is thecluster of edge-on disk galaxies or disks with high inclina-tion angles. However, a blended object has been misclassi-fied into this cluster, too. There are still some blended ob-jects left that we failed to remove, since when sorting outblended objects we visually inspected the images in reducedresolution. The cluster posteriors for K = 3 are very infor-mative: First, we notice that objects from cluster 1 havetypically very low posteriors in cluster 2 and intermediateposteriors in cluster 3, i.e. face-on disks are more similarto edge-on disks than to ellipticals. Second, objects fromcluster 2 have low posteriors in all other clusters. Third,objects from cluster 3 tend to be more similar to objects incluster 2, i.e. edge-on disks are more similar to ellipticals.This is probably due to the higher light concentration andsteep light profiles.These results demonstrate that the clustering analy-sis indeed yields reasonable results for realistic data sets.Furthermore, the results for three clusters are very similarto the clustering scheme of Sect. 5.2. However, three clus-ters are not enough to describe the data faithfully. This isevident from the much larger SSR value for K = 3 com-pared to K = 8 and from Fig. 17, where we show the re-sulting cluster posterior space for K = 3. Large parts of theavailable posterior space remain empty whereas the centralregion is crowded. This behaviour is due to the lack of com-plexity in the bipartite-graph model and strongly suggeststhat more clusters are necessary.For K = 8 we have SSR ≈ ,
254 (cf. Table 2),which corresponds to an RMS residual of ≈ .
2% for thesimilarity-matrix reconstruction. We show ten top exampleobjects for each cluster in Fig. 18. First, we notice that theresulting grouping is excellent. However, it is difficult tounderstand the differences between some clusters. Clusters1 and 5 are obviously objects with high ellipticities, e.g.edge-on disks, but what is their difference? Is it the bulgedominance which is much weaker in cluster 1 than in 5?
Figure 16.
Top example objects for K = 3 clusters. Each row corresponds to a cluster. We also show the distributionof its cluster posteriors beneath each object. Cluster 1 seems tocontain face-on disks, cluster 2 compact objects, and cluster 3edge-on disks.
Figure 17.
Cluster posterior space for K = 3. Projected cluster posteriors are displayed 10% translucent suchthat their density becomes visible. See Fig. 13 for an explanationof the topology of this plot.
Do the clusters differ in their radial light profiles? Whatis the difference between clusters 2 and 7 which are bothface-on disks? Of particular interest are clusters 3 and 8,where both seem to contain roundish and compact objects.However, the posterior histograms reveal a highly asym-metric relation: Objects from cluster 3 also prefer cluster8 above all other clusters. Nevertheless, most of the topexamples of cluster 8 have extremely low posteriors in clus-ter 3, i.e. association with cluster 3 is highly disfavoured.Although we cannot explain this result without further in-vestigation, it is interesting that the algorithm picked upsuch a distinctive signal.As we have access to the isophotal axis ratio and theconcentration index (cf. Eq. (41)) for all objects, we inves- en´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies 17
Figure 18.
Top example objects for K = 8 clusters. Each row corresponds to a cluster. For each object, we also show the histogram of the distribution of its cluster posteriors beneathit. The objects were aligned in shapelet space, not in real space.8 Ren´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies
Figure 19.
Mean axis ratios and concentration indices forthe clusters of Fig. 18.
Weighted means were computed from the top 100 example ob-jects of each cluster. We show the contours of 1 σ and take intoaccount possible correlations. tigate their distributions for the clusters. Figure 19 showsthe mean axis ratios and the mean concentration indices forall eight clusters averaged over the 100 top examples. Thecluster with the highest mean axis ratio is cluster 1, whichis the cluster of edge-on disk galaxies. The cluster with low-est concentration index is cluster 7, which is the cluster offace-on disk galaxies that exhibit extended smooth lightprofiles. Clusters 3, 4, 5 and 8 have the largest concentra-tion indices. As is evident from Fig. 18, these clusters areindeed composed of rather compact objects that seem to beelliptical galaxies. However, there is no decisive distinctionin Fig. 19. This is not necessarily a flaw in the clusteringresults, but rather more likely caused by concentration andaxis ratio being an insufficient parametrisation scheme (cf.Andrae et al. in prep.).It seems like the resulting classification scheme is essen-tially face-on disk, edge-on disk and elliptical. If we increasethe number of clusters, we get further diversification thatmay be caused by bulge dominance or inclination angles.We emphasise again that our primary goal is to demon-strate that our method discovers morphological classes andprovides data-to-class assignments that are reasonable.
6. Conclusions
We briefly summarise our most important arguments andresults: – Galaxy evolution, the process of observation, andthe experience with previous classification attemptsstrongly suggest a probabilistic (”soft”) classification.Hard classifications appear to be generically inappro-priate. – There are two distance-based soft-clustering algorithmsthat have been applied to galaxy morphologies so far:Gaussian mixture models by Kelly & McKay (2004,2005) and the bipartite-graph model by Yu et al. (2005)presented in this work. The weak points of the Gaussianmixture model are the dimensionality reduction andits assumption of Gaussianity. The weakness of thebipartite-graph model is the definition of the similar-ity measure. – The shapelet formalism, our similarity measure, and thebipartite-graph model produce reasonable clusters anddata-to-cluster assignments for real galaxies. The au-tomated discovery of classes corresponding to face-ondisks, edge-on disks and elliptical galaxies without anyprior assumptions is impressive and demonstrates thegreat potential of clustering analysis. Moreover, the au-tomatically discovered classes have a qualitatively dif-ferent meaning compared to pre-defined classes, sincethey represent to grouping that is preferred by the givendata sample itself. – Random effects such as orientation angle and inclinationare a major obstacle, since they introduce additionalscatter into a parametrisation of galaxy morphologies. – For data sets containing N galaxies, the computationtimes scale as O ( N ). Nevertheless, we experienced thata clustering analysis is feasible for data sets containingup to N = 10 ,
000 galaxies without employing super-computers. We conclude that a clustering analysis on adata set of one million galaxies is possible using super-computers. – It is possible to enhance this method by setting upa classifier based on the classes found by the soft-clustering analysis, thereby improving the time com-plexity from O ( N ) to O ( N ). – The method presented in this paper is not limited togalaxy morphologies only. For instances, it could pos-sibly be applied to automated star-galaxy classificationor AGN detection.The bottom line of this paper is that automatic discov-ery of morphological classes and object-to-class assignments(clustering analysis) does work and is less prejudiced andtime-consuming than visual classifications, though the in-terpretation of the results is still an open issue. Especiallywhen analysing new data samples for the first time, clus-tering algorithms are more objective than using pre-definedclasses and visual classifications. The advantages of such asophisticated statistical algorithm justify its considerablecomplexity.
Acknowledgements.
RA thanks Coryn Bailer-Jones for his commentsthat have been particularly useful to improve this work. Furthermore,RA thanks Knud Jahnke and Eric Bell for helpful discussions espe-cially on the interpretation of the results. RA is funded by a Klaus-Tschira scholarship. PM is supported by the DFG Priority Programme1177.
References
Abazajian, K., Adelman-McCarthy, J. K., Ag¨ueros, M. A., et al. 2005,AJ, 129, 1755Andrae, R., Melchior, P., & Jahnke, K. in prep.Baldry, I. K., Glazebrook, K., Brinkmann, J., et al. 2004, ApJ, 600,681Bamford, S. P., Nichol, R. C., Baldry, I. K., et al. 2009, MNRAS, 393,1324Bellman, R. 1961, Adaptive Control Processes: A Guided Tour(Princeton University Press)Bilmes, J. A. 1997, International Computer Science Institute, techni-cal report TR-97-021Conselice, C. J. 2003, The Relationship between Stellar LightDistributions of Galaxies and Their Formation HistoriesCroton, D. J., Springel, V., White, S. D. M., et al. 2006, MNRAS,365, 11Fukugita, M., Nakamura, O., Okamura, S., et al. 2007, AJ, 134, 579Kelly, B. C. & McKay, T. A. 2004, AJ, 127, 625Kelly, B. C. & McKay, T. A. 2005, AJ, 129, 1287Massey, R. & R´efr´egier, A. 2005, MNRAS, 363, 197 en´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies 19en´e Andrae et al. (2010): Soft clustering analysis of galaxy morphologies 19