[PDF] Rapid sorting of radio galaxy morphology using Haralick features

Abstract

We demonstrate the use of Haralick features for the automated classification of radio galaxies. The set of thirteen Haralick features represent an extremely compact non-parametric representation of image texture, and are calculated directly from imagery using the Grey Level Co-occurrence Matrix (GLCM). The GLCM is an encoding of the relationship between the intensity of neighbouring pixels in an image. Using 10,000 sources detected in the first data release of the LOFAR Two-metre Sky Survey (LoTSS), we demonstrate that Haralick features are highly efficient, rotationally invariant descriptors of radio galaxy morphology. After calculating Haralick features for LoTSS sources, we employ the fast density-based hierarchical clustering algorithm HDBSCAN to group radio sources into a sequence of morphological classes, illustrating a simple methodology to classify and label new, unseen galaxies in large samples. By adopting a 'soft' clustering approach, we can assign each galaxy a probability of belonging to a given cluster, allowing for more flexibility in the selection of galaxies according to combinations of morphological characteristics and for easily identifying outliers: those objects with a low probability of belonging to any cluster in the Haralick space. Although our demonstration focuses on radio galaxies, Haralick features can be calculated for any image, making this approach also relevant to large optical imaging galaxy surveys.

Full PDF

MMNRAS , 1–9 (2020) Preprint 2 February 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

Rapid sorting of radio galaxy morphology using Haralick features

Kushatha Ntwaetsile ★ and James E. Geach Centre for Astrophysics Research, Department of Physics, Astronomy & Mathematics, University of Hertfordshire, College Lane, Hatﬁeld, AL10 9AB, UK

ABSTRACT

We demonstrate the use of Haralick features for the automated classiﬁcation of radio galaxies. The set of thirteen Haralickfeatures represent an extremely compact non-parametric representation of image texture, and are calculated directly fromimagery using the Grey Level Co-occurrence Matrix (GLCM). The GLCM is an encoding of the relationship between theintensity of neighbouring pixels in an image. Using 10,000 sources detected in the ﬁrst data release of the LOFAR Two-metreSky Survey (LoTSS), we demonstrate that Haralick features are highly eﬃcient, rotationally invariant descriptors of radio galaxymorphology. After calculating Haralick features for LoTSS sources, we employ the fast density-based hierarchical clusteringalgorithm HDBSCAN to group radio sources into a sequence of morphological classes, illustrating a simple methodology toclassify and label new, unseen galaxies in large samples. By adopting a ‘soft’ clustering approach, we can assign each galaxy aprobability of belonging to a given cluster, allowing for more ﬂexibility in the selection of galaxies according to combinations ofmorphological characteristics and for easily identifying outliers: those objects with a low probability of belonging to any clusterin the Haralick space. Although our demonstration focuses on radio galaxies, Haralick features can be calculated for any image,making this approach also relevant to large optical imaging galaxy surveys.

Key words: methods: data analysis – methods: statistical – galaxies: radio galaxies

We are entering the era of ‘big’ observational astronomy, with sur-veys such as the Dark Energy Survey (DES Abbott et al. 2016),Vera Rubin Observatory Legacy Survey of Space and Time (LSSTIvezić et al. 2008),

Euclid (Laureĳs et al. 2010), Australian SquareKilometer Array Pathﬁnder (ASKAP) (DeBoer et al. 2009). In theselarge surveys, simply curating the data volume becomes a signiﬁcantchallenge and interesting scientiﬁc question in its own right. For ex-ample, in the radio, the Square Kilometer Array (SKA Schilizzi et al.2010) will be the largest big data project in the world, producing1000 petabytes of data each day and about 5 zettabytes of data peryear (Farnes et al. 2018). Like all the large surveys of present andfuture, within the vast datasets lie new parameter spaces and newdiscoveries to be made.Machine learning is becoming an important statistical tool forastronomers seeking to eﬃciently analyse and derive meaning frommassive data sets, with ‘unsupervised’ methods in particular showingpromise in various applications, particularly in automatic classiﬁca-tion, including the separation of starbursts from active galactic nuclei(Geach 2012), optical morphology (Hocking et al. 2018; Martin et al.2020), variable stars (Armstrong et al. 2015; Valenzuela & Pichara2017) and light curves (Aguirre et al. 2018). Other key applicationsare in parameter estimation (e.g. redshifts, (Geach 2012)), anomalyand outlier detection (Baron & Poznanski 2016; Pruzhinskaya et al.2019). Although not without limitations, unsupervised techniquesare attractive for astronomical data exploration because they do not ★ E-mail: [email protected] require labelled training sets, unlike, for example, neural network-based classiﬁcation approaches (e.g. Bailer-Jones et al. 1998; Ballet al. 2004). Instead, unsupervised techniques generally rely on thedata itself to identify patterns (e.g. clusters of self-similar objects)in some pre-deﬁned ‘feature space’ that describes the n -dimensionalvolume where the data are deﬁned. At a simple level, this could justrepresent a vector of observations (e.g. photometry across a set ofﬁlters), but often it is desirable to project the data into another space,and that choice is a human one.In astronomy the two main types of data usually encountered arecatalogues (e.g. Bertin & Arnouts 1996), and imaging. Workingwith catalogues is eﬃcient, since they are a signiﬁcantly reducedrepresentation of the original data; for example in many astronomicalimages the majority of the pixels do not contain useful signal (at leastnot at ﬁrst glance). On the other hand, working with the imaging datadirectly is preferred in some cases, since certain information is noteasily captured, or completely missing from catalogues. An exampleof this is in morphological classiﬁcation of galaxies. Although thereare various simple parametric measures of morphology based on thedistribution of pixel brightness (e.g. Gini, 𝑀 (Lotz et al. 2004) ;CAS (Conselice et al. 2013)), unsupervised machine learning oﬀersopportunities to extract more ﬁne detail – and perform automaticclassiﬁcation – directly from imaging data.For example, Uzeirbegovic et al. (2020) use Principal ComponentAnalysis (PCA) to project Hubble Space Telescope thumbnail imagesof high redshift galaxies into a 12-dimensional space where galaxiesare represented as weighted combinations of ‘eigengalaxies’. Theauthors show how the set of weights for each galaxy can be clusteredto produce meaningful morphological classiﬁcations that ultimately © a r X i v : . [ a s t r o - ph . I M ] F e b Ntwaetsile & Geach required no human intervention to partition. In another example,Hocking et al. (2018) and Martin et al. (2020) present an approachthat uses a hierarchical clustering technique to group together smallimage patches into self-similar clusters, representing each imagepatch as a vector containing the Fourier transform of each patch torepresent the spatial frequency distribution of pixel brightness inseveral optical bands. The clustering resulted in a set of visually puremorphological classes, with properties that can be linked directlywith crowd sourced (i.e. human) classiﬁcations from the Galaxy Zooproject (Fortson et al. 2011), illustrating the power and potential ofunsupervised techniques in producing interpretable morphologicalclassiﬁcations directly from imaging data. This is important becauseeven with mega crowd sourcing, it is unlikely that a large cohortof citizen scientists (Bonney et al. 2009) will be able to classifythe billions of objects to be detected by the likes of LSST,

Euclid and SKA. Carbon footprint aside, machines are inexhaustible andconsistent.Classiﬁcation of galaxy morphology has been performed over theyears using a wide variety of techniques of increasing sophistication.Traditionally visual inspection of galaxies in the optical was carriedout using Hubble’s classiﬁcation scheme (Hubble 1926; de Vau-couleurs 1957). For clusters of galaxies, the Bautz–Morgan (BM)system was developed (McHardy & Riley 1974) to classify the mor-phology of clusters according to the diﬀerence in apparent magnitudebetween the brightest and next brightest members of the cluster. In theradio, Fanaroﬀ & Riley (1974) introduced the famous Fanaroﬀ-Rileyclassiﬁcation scheme that divides radio galaxies into two classesbased on the relative brightness distribution of within the radio lobesand jets. Fundamentallty, these early classiﬁcation techniques acrossthe electromagnetic spectrum were based on the relative distributionof pixel (or photographic plate exposure) brightness, with humanbeings ultimately performing the classiﬁcation.As astronomy became increasingly data driven over the twenti-eth century, automated morphological classiﬁcations using machinelearning were introduced. Artiﬁcial neural networks were trained toclassify galaxies based on their morphology from the early 1990s(Storrie-Lombardi et al. 1992; Naim et al. 1995; Lahav et al. 1996).Since then an increasing number of studies have applied machinelearning for the classiﬁcation of radio galaxy morphologies includingthe use of Self Organised Maps (Galvin et al. 2019; Ralph et al. 2019;Mostert et al. 2020) and Convolutional Neural Networks (Aniyan &Thorat 2017; Wu et al. 2018; Ma et al. 2019). More recently, ‘capsulenetworks’ have also been used for the morphological classiﬁcationof radio galaxies (Lukic et al. 2019; Katebi et al. 2019). In thiswork we contribute to the eﬀort of seeking more eﬃcient means ofautomatically classiﬁying and mining large imaging data sets in theradio using ‘Haralick’ features (Haralick et al. 1973), which provide acompact representation of image texture. Although a well-establishedtool in computer vision, Haralick features have not yet been widelyexploited in radio astronomy. (although see Bazell 2000).The paper is organised as follows: an introduction to Haralickfeatures is presented in Section 2, in Section 3 we describe the dataset we use as a demonstration of the technique from the LOFARTwo-meter Sky Survey, and the Hierarchical Density Based SpatialClustering of Applications with Noise (HDBSCAN) algorithm weadopt to perform classiﬁcation. We present the results in Section 4and make our conclusions in Section 5.

Haralick features were introduced by Haralick et al. (1973) as a sta-tistical method of examining image texture. They consider the rela-tionship between the intensity of neighbouring pixels and are usefulin classifying similar regions in an image given the local spatialdistribution of pixel intensities(Ohmshankar & Paul 2014). Haral-ick features are based on the so-called Grey Level Co-occurrenceMatrix (GLCM). To compute the Haralick features, one must ﬁrstcompute the GLCM, which can then be used to evaluate the Haralickcoeﬃcients (Sieler et al. 2010).The GLCM 𝑝 is a square matrix where each dimension corre-sponds to the number of greyscale intensity values in the image. Foran 8-bit depth, this would result in a 256 matrix, so the bit depth isnormally reduced to 3, such that 𝑝 is an 8 × 𝑝 ( 𝑖, 𝑗 ) is calculated as the frequency that a pixel with intensity 𝑖 hasa pixel with intensity 𝑗 at a given spatial oﬀset. This oﬀset is nor-mally the pixel in the same row but adjacent column, but other oﬀsetscan be used. Therefore 𝑝 encodes the relative spatial distribution ofgreyscale levels in an image. Mathematically, for an image 𝐼 of size 𝑀 × 𝑁 pixels: 𝑝 ( 𝑖, 𝑗 ) = 𝑀 ∑︁ 𝑟 = 𝑁 ∑︁ 𝑐 = (cid:40) 𝐼 ( 𝑟, 𝑐 ) = 𝑖 and 𝐼 ( 𝑟 + Δ 𝑥, 𝑐 + Δ 𝑦 ) = 𝑗 Δ 𝑥 and Δ 𝑦 are the desired oﬀset. Typically we have ( Δ 𝑥 = , Δ 𝑦 = 𝑝 for the diﬀerent directional oﬀsetsand averaging, rotational invariance can be encoded. Finally, 𝑝 isnormalised by the total number of comparisons made such that theGLCM represents a probability distribution.With 𝑝 deﬁned, we can deﬁne 14 Haralick features that describe thetextural properties of the image, or rather, an image section (Haralicket al. 1973; Salhi et al. 2018). The features are deﬁned as follows: Angular Second Moment 𝑓 = ∑︁ 𝑖 ∑︁ 𝑗 𝑝 ( 𝑖, 𝑗 ) (1) Contrast 𝑓 = ∑︁ 𝑖 ∑︁ 𝑗 ( 𝑖 − 𝑗 ) 𝑝 ( 𝑖, 𝑗 ) (2) Entropy 𝑓 = − ∑︁ 𝑖 ∑︁ 𝑗 𝑝 ( 𝑖, 𝑗 ) log ( 𝑝 ( 𝑖, 𝑗 )) (3) Correlation 𝑓 = 𝜎 𝑥 𝜎 𝑦 ∑︁ 𝑖 ∑︁ 𝑗 ( 𝑖 𝑗 ) 𝑝 ( 𝑖, 𝑗 ) − 𝜇 𝑥 𝜇 𝑦 (4)where 𝜇 𝑥,𝑦 and 𝜎 𝑥,𝑦 are the means and standard deviations of thecorresponding row or column in 𝑝 : 𝜇 𝑥 = ∑︁ 𝑖 𝑗 𝑖𝑝 ( 𝑖, 𝑗 ) (5) 𝜇 𝑦 = ∑︁ 𝑖 𝑗 𝑗 𝑝 ( 𝑖, 𝑗 ) (6) MNRAS , 1–9 (2020) aralick features 𝜎 𝑥 = √︄∑︁ 𝑖 𝑗 ( 𝑖 − 𝜇 𝑥 ) 𝑝 ( 𝑖, 𝑗 ) (7) 𝜎 𝑦 = √︄∑︁ 𝑖 𝑗 ( 𝑗 − 𝜇 𝑦 ) 𝑝 ( 𝑖, 𝑗 ) (8) Variance 𝑓 = ∑︁ 𝑖 ∑︁ 𝑗 ( 𝑗 − 𝜇 ) 𝑝 ( 𝑖, 𝑗 ) (9) Homogeneity (a.k.a. Inverse Diﬀerence Moment) 𝑓 = ∑︁ 𝑖 ∑︁ 𝑗 𝑝 ( 𝑖, 𝑗 ) + ( 𝑖 − 𝑗 ) (10) Sum Average 𝑓 = 𝑁 ∑︁ 𝑖 = 𝑖𝑝 𝑥 + 𝑦 ( 𝑖 ) (11)where 𝑁 is the number of grey levels and 𝑝 𝑥 + 𝑦 ( 𝑖 ) corresponds to 𝑝 𝑥 + 𝑦 ( 𝑖 ) = ∑︁ 𝑥 + 𝑦 = 𝑖 𝑝 ( 𝑖, 𝑗 ) (12) Sum Entropy 𝑓 = 𝑁 ∑︁ 𝑖 = 𝑝 𝑥 + 𝑦 ( 𝑖 ) log ( 𝑝 𝑥 + 𝑦 ( 𝑖 )) (13) Sum Variance 𝑓 = 𝑁 ∑︁ 𝑖 = ( 𝑖 − 𝑓 ) 𝑝 𝑥 + 𝑦 ( 𝑖 ) (14) Diﬀerence Entropy 𝑓 = − 𝑁 − ∑︁ 𝑖 = 𝑝 𝑥 − 𝑦 ( 𝑖 ) log ( 𝑝 𝑥 − 𝑦 ( 𝑖 )) (15)where 𝑝 𝑥 − 𝑦 ( 𝑖 ) = ∑︁ 𝑥 − 𝑦 = 𝑖 𝑝 ( 𝑖, 𝑗 ) (16) Diﬀerence Variance 𝑓 = 𝑁 − ∑︁ 𝑖 = 𝑖 𝑝 𝑥 − 𝑦 ( 𝑖 ) (17) Information Measure of Correlation 1 𝑓 = HXY − HXY1max{HX, HY} (18)

Information Measure of Correlation 2 𝑓 = √︁ − exp (− ( HXY2 − HXY ) (19) Figure 1.

Condensed tree visualisation indicating the 10 persistent morpho-logical clusters identiﬁed by HDBSCAN in a sample of the 10,000 brightestLoTSS-DR1 galaxies. Reading from top to bottom, the root contains all of thesample, and this branches oﬀ down the hierarchy; only clusters with a numberof points exceeding a certain critical threshold (the minimum cluster size)persist, which is parameterised by the 𝜆 value, which represents the inverseof the mutual reachability distance between points in the Haralick space (SeeSection 3.2 for further details). where HX and HY are entropies of 𝑝 𝑥 and 𝑝 𝑦 :HXY = − ∑︁ 𝑖 ∑︁ 𝑗 𝑝 ( 𝑖, 𝑗 ) log ( 𝑝 ( 𝑖, 𝑗 )) (20)HXY1 = − ∑︁ 𝑖 ∑︁ 𝑗 𝑝 ( 𝑖, 𝑗 ) log ( 𝑝 𝑥 ( 𝑖 ) 𝑝 𝑦 ( 𝑗 )) (21)HXY2 = − ∑︁ 𝑖 ∑︁ 𝑗 𝑝 𝑥 ( 𝑖 ) 𝑝 𝑦 ( 𝑗 ) log ( 𝑝 𝑥 ( 𝑖 ) 𝑝 𝑦 ( 𝑗 )) (22) Maximal Correlation Eﬃcient

With the deﬁnition 𝑄 ( 𝑖, 𝑗 ) = ∑︁ 𝑘 𝑝 ( 𝑖, 𝑘 ) 𝑝 ( 𝑗, 𝑘 ) 𝑝 𝑥 ( 𝑖 ) 𝑝 𝑦 ( 𝑘 ) (23)The maximal correlation coeﬃcient 𝑓 is deﬁned as the square rootof the second largest eigenvalue of 𝑄 . This is generally consideredto be numerically unstable and we only consider the ﬁrst 13 Haralickfeatures. To demonstrate the eﬃcacy of Haralick features in representing radiogalaxy morphology, we use the ﬁrst data release from the LOw Fre-quency ARray (LOFAR) Two-metre Sky Survey (LoTSS), released

MNRAS000

MNRAS000 , 1–9 (2020)

Ntwaetsile & Geach in 2018 (Shimwell et al. 2019, LoTSS-DR1). LoTSS is a surveyof the northern sky over 120–168 MHz wide-area survey deliveringcontinuum imaging at an angular resolution of 𝜃 = (cid:48)(cid:48) . LoTSS-DR1comprises just 2% of the total survey area, with 424 square degrees ofimaging covering the Hobby-Eberly Telescope Dark Energy Exper-iment (HETDEX) ‘Spring Field’ (10 h m –15 h m and 45 ◦ –57 ◦ ).The median depth is 71 𝜇 Jy beam − at 144 MHz, with a 90% com-pleteness for point sources at 0.45 mJy. A total of 325,694 sourcesare catalogued across 58 mosaics covering the HETDEX ﬁeld. Werefer the reader to Shimwell et al. (2019) for a thorough descriptionof LoTSS and the DR1 data products.Using the LoTSS-DR1 catalogue, we extract thumbnail cutoutsaround the position of the top 10,000 sources ranked by total ﬂux.We adopt a window size of 64 ×

64 pixels (96 ×

96 arcseconds)which we found to be appropriate for comfortably encompassing themajority of sources, with the exception of some of the very brightestand extended emission features. Each image is minmax normalised,put on an 8-bit grey scale and the Haralick features computed (notethat we compute the features for all pixel oﬀsets and average theresult). Finally, each source is represented by a 13-element vectorcorresponding to the 13 Haralick features as described in Section 2. Inthe next section we describe an approach to cluster the feature vectorsto provide groupings of galaxies with similar Haralick features.

HDBSCAN is an unsupervised machine learning algorithm that isbased on DBSCAN (Density-based Spatial Clustering of Applica-tions with Noise), ﬁrst introduced by Ester et al. (1996). It is adensity-based clustering algorithm that assumes clusters are charac-terised by ‘islands’ of high density in the sea of the parameter space,such that clusters are regarded as data partitions that have a higherdensity than their surroundings. HDBSCAN (Campello et al. 2013)takes forward the DBSCAN concept by introducing a hierarchy to theclustering, with ‘persistent’ clusters ﬁnally extracted from the hier-archical tree. Following Malzer & Baum (2019), HDBSCAN worksas follows:(i) The data is a set of vectors. In our case, each vector 𝑥 has13-elements representing the Haralick features of each thumbnailimage.(ii) Consider the ‘core’ distance for a point 𝑥 , core 𝑘 ( 𝑥 ) , whichdeﬁnes the distance to the 𝑘 th nearest neighbour, which is an eﬃ-cient measure of density. Low values of core 𝑘 ( 𝑥 ) correspond to highdensity and vice versa.(iii) The ‘mutual reachability distance’ between two points 𝑎 and 𝑏 is deﬁned as 𝑑 mut ( 𝑎, 𝑏 ) = min { core 𝑘 ( 𝑎 ) , core 𝑘 ( 𝑏 ) , 𝑑 ( 𝑎, 𝑏 )} . Here 𝑑 ( 𝑎, 𝑏 ) is the distance between 𝑎 and 𝑏 according to some metric(e.g. Euclidean). The idea here is that the mutual reachability distanceallows points in dense regions stay close together, and those in lessdense regions to be pushed away. A mutual reachability graph can beused to represent the data set, with vertices as the data objects andweighted edges as connections.(iv) The mutual reachability graph is used to construct the mini-mum spanning tree, and sorting its edges by the mutual reachabilitydistance results in a hierarchical tree structure. The hierarchy ofconnected components is deﬁned by sorting the edges of the treeby distance in reverse order, describing a dendogram. This is thestructure from which clusters will be identiﬁed.(v) We wish to extract ‘ﬂat’ clusters from the hierarchy, and this inprinciple achieved by slicing through the dendogram. Unfortunately,it is not clear at what point to make the cut, and if one does choose a single cut point (i.e. a given distance), that would eﬀectively selectclusters with the same density. HDBSCAN allows the extractionof clusters of variable density, eﬀectively cutting the dendogram atdiﬀerent points.(vi) First the cluster tree is condensed into a simpler structure.Considering the single main trunk which contains all of the datapoints, the tree splits into branches. A condensed cluster hierarchy canbe described by considering the number of points kept in each branchas it splits. Here we introduce the key parameter of minimum clustersize. If a given branch splits into two, with one branch containingfewer points than the minimum cluster size means, the larger branch‘persists’ and the smaller split branch ‘falls out’ of the cluster. If abranch splits into two with both branches exceeding the minimumcluster size, both new branches persist, and so-on.(vii) Clusters are extracted on the notion of persistence in thehierarchy. The parameter 𝜆 = 𝑑 − is deﬁned, and each cluster has a 𝜆 birth (the point at which the cluster split oﬀ) and 𝜆 death (the pointwhen the cluster split into other clusters). In each cluster, we have 𝜆 𝑝 describing when each point fell out of the cluster (or was splitoﬀ into a new cluster), 𝜆 birth ≤ 𝜆 𝑝 ≤ 𝜆 death . Cluster stability 𝑆 isdeﬁned as the sum of 𝜆 𝑝 − 𝜆 birth for all points in the cluster. Toextract clusters the following procedure is followed. First, select allleaves as clusters. Then, working through the hierarchy, consider thestability of a parent cluster 𝑆 𝑝 and its 𝑛 descendants 𝑆 , , ,...,𝑛𝑑 . If 𝑆 𝑝 > (cid:205) 𝑛𝑖 = 𝑆 𝑖𝑑 we unselect all the descendants. If 𝑆 𝑝 < (cid:205) 𝑛𝑖 = 𝑆 𝑖𝑑 thenthe cluster stability is set such that 𝑆 𝑝 = (cid:205) 𝑛𝑖 = 𝑆 𝑖𝑑 . At the root nodewe have our set of selected clusters. Any point in the sample thatdoes not fall into one of the selected clusters is deﬁned as ‘noise’.(viii) The selected clusters can be used to label points. Further-more, the deﬁnition of 𝜆 𝑝 within a cluster, when normalised between0 and 1 provides a means of characterising a probability that a givenpoint belongs to the cluster, or alternatively a measure of the strengthof membership.In this work we use the Python implementation of HDBSCAN .As described above, there are some key parameters to set whenapplying the algorithm, and we experimented with their eﬀect on theclustering results. We found that a simple Euclidean distance metricwas eﬀective and computationally eﬃcient, and that other in-builtdistance metrics did not dramatically alter the results. The parameter‘minimum cluster size’ and ‘minimum number of samples’ are themost critical. The former sets the minimum size of a cluster, asdescribed in (v) above. We adopt a minimum cluster size of 64; inexperimentation, lower sizes result in a larger number of clusters withvery similar visual morphologies split over multiple labels and largervalues result in too few clusters to make useful/meaningful classes.Our experimentation is based on visual inspection of the galaxiesin the clusters returned for diﬀerent minimum cluster sizes. Thebehaviour tends to be, as the minimum cluster size is reduced, someclusters are generated that contain a small (by deﬁnition) numberof objects that are visually similar to some larger ‘main’ cluster. Toillustrate an example, if the minimum cluster size is reduced to 16,then we generate 40 clusters in total. The ﬁnal two clusters – bydeﬁnition morphologically similar – contain 196 and 3612 objectsrespectively, such that there is clearly a dominant cluster present.Since adjacent clusters are similar in the Haralick space, they arelikely small ‘islands’ broken oﬀ in the clustering space that nowsatisfy the more lenient minimum cluster size. As we describe in thefollowing section, an adoption of ‘soft clustering’ mitigates the need https://hdbscan.readthedocs.io MNRAS , 1–9 (2020) aralick features Figure 2.

Thumbnails of LoTSS-DR1 sources sorted by similar Haralick features. Each row represents a cluster estimated by HDBSCAN (Section 3), andfor clarity we show the top 16 examples (ranked by membership strength, with the ﬁrst thumbnail in each row representing the most representative member).Each image subtends 96 ×

96 arcseconds and is linearly scaled between the 0.1-99.9th percentiles. The rows are sequential in terms of their extraction fromthe HDBSCAN hierarchical tree structure, and it is clear that the procedure essentially provides a morphological sorting ranging from radio galaxies withpronounced jets and lobes to point sources. Clustering of 10,000 objects took approximately half a second. to deﬁne had cluster membership boundaries (in eﬀect, every objecthas some ﬁnite probability of belonging to every other cluster).In some sense, the total number of clusters required is a matterof taste or practical need. Indeed, one could argue that the desirednumber of labels –i.e. the level of granularity in morphological dis-tinction –is use-case speciﬁc; for example, a limiting case might bethe simple separation of point and extended sources, in which casejust two clusters would suﬃce.The minimum number of samplescontrols how conservative the clustering is, manifested as the frac-tion of the data that is labelled as noise. This is generally set to thesame value as the minimum cluster size.HDBSCAN is incredibly fast: for the 10,000 sample size consid-ered in this demonstration, the clustering time was just 0.5 seconds.Although performance is obviously hardware dependent, HDBSCANoﬀers high performance for the clustering of large samples, makingthe analysis of the very large radio galaxy imaging catalogues ofthe future (e.g. SKA and pathﬁnders) a tractable problem. In Fig-ure 3 we show the scaling of clustering time with sample size for our13-element features.HDBSCAN has already been applied in astronomy. For transientdiscovery, Webb et al. (2020) present an application in transientdiscovery. The authors evaluated 85,553 min-cadenced light curvesfrom the Deeper, Wider, Faster program (Andreoni & Cooke 2017).They were able to isolate anomalous sources for further analysis

Figure 3.

Clustering time as a function of sample size for the 13-elementHaralick feature vectors. We compare the timing (in milliseconds) for clus-tering that includes the calculation of the minimum spanning tree (MST) andprediction data. The prediction data allow one to use the trained model topredict the labels of new, unseen feature vectors. MNRAS000

Ntwaetsile & Geach

Figure 4.

As Figure 2, but showing a larger sample of objects within cluster 1 (left) and cluster 10 (right). The images are organised row-wise in order ofmembership probability. The nature of the extraction of ﬂat clusters from the condensed tree structure produced by HDBSCAN means that the sequence ofcluster labels (in this case 1–10) forms a natural morphological sort, from highly extended and complex radio morphologies in cluster 1 to compact point sourcesin cluster 10. including the discovery of seven uncatalogued variables and twostellar ﬂare events, including a rarely observed ultrafast ﬂare. Lo-gan & Fotopoulou (2020) used HDBSCAN to identify clusters ofstars, galaxies, and quasars in a multidimensional space deﬁned byphotometry. Using a set of 50,000 spectroscopically labelled objectsfrom the Sloan Digital Sky Survey (Stoughton et al. 2002) the au-thors found that careful attribute selection is a vital part of accurateclassiﬁcation with hdbscan. They optimized the hyperparameters andinput attributes of three separate hdbscan runs, each to select a par-ticular object class and, thus, treat the output of each separate run asa binary classiﬁer. They then consolidated the output to give the ﬁnalclassiﬁcation. They used F1 scores to measure the performance oftheir classiﬁer and they obtained F1 scores of 98.9, 98.9, and 93.13respectively. Finally, Jayasinghe et al. (2019) presented the MilkyWay Project second data release (DR2) and an updated data reduc-tion pipeline. The authors aggregated about 3 million classiﬁcationsfrom volunteers to produce the DR2 catalogue, which contained2600 infrared ‘bubbles’ and nearly 600 candidate bow shock-drivingstars. HDBSCAN was used to create the bubble catalogue, identify-ing clusters within a set of one million classiﬁcations in less than 5minutes on a standard desktop computer, again illustrating the speedperformance of HDBSCAN when dealing with very large datasets.

Since most of the morphological diversity is in the brightest radiosources (e.g. those with extended jets and lobes), in our demonstrationwe take the top 10,000 sources ranked by total ﬂux in the LoTSS-DR1 catalogue. We run HDBSCAN with a minimum cluster size of64 which results in 10 clusters containing 33% of the sample. Figure1 shows the condensed tree visualisation that shows the branchingof the sample from the main root. Examples of the representative morphologies in each cluster are shown in Figure 2. Interestingly,the ordering of the cluster labels 1–10 form a progression: cluster 1contains the most extended sources with prominent jets and lobes.As one examines successive clusters it is clear that the morphologytransitions, encompassing more compact structures including closepairs of similar brightness (e.g. hotspots), pairs of sources dominatedby one brighter component, and ﬁnally to isolated point sources. Toemphasise this, in Figure 4 we show a larger sample of the top 256most likely members of cluster 1 and cluster 10.One drawback of HDBSCAN, at least in this application, is thelarge fraction of sources that are labelled as ‘noise’ – i.e. sources notassigned to any cluster. This is an outcome of the conservative natureof HDBSCAN and the ‘hard edge’ nature of the clusters – pointsclose to the edge but just falling out of the selection are classiﬁed asnoise. We could have achieved a lower noise fraction by reducing theminimum cluster size, such that a larger number of clusters wouldhave persisted, however this has the drawback that the cluster labelsbecome less useful. An alternative approach, which we take here, is touse the concept of ‘soft’ clustering to assign every source to its mostlikely cluster. Soft clustering avoids the hard boundaries imposed bytraditional clustering algorithms, and instead takes a probabalisticview: each point has a probability of belonging to a given cluster.In HDBSCAN, soft clustering is achieved via a modiﬁcation of theoutlier score, which is based on the Global-Local Outlier Score fromHierarchies (GLOSH) algorithm (Campello et al. 2015), and com-bines this with a measure of distance from a given cluster to producean estimate of the probability that any given point belongs to anyof the ﬁxed clusters extracted from the condensed tree. We can thensimply assign cluster labels for every point by taking the most likelycluster it belongs to, and extract samples of galaxies for each labelusing probability thresholds. This fuzzy clustering approach is attrac-tive because it recognises the fact that real galaxies may have sharedcharacteristics (e.g. jet plus point source versus jet without point

MNRAS , 1–9 (2020) aralick features Figure 5.

As Figure 2, but showing examples of galaxies assigned to clusters via soft clustering that HDBSCAN originally failed to classify (i.e. ‘noise’). Incluster 1 we can see that the sources are typically large and diﬀuse systems comparable to the size of the thumbnail. The general appearance of the sources ineach row follows the trend seen in Figure 2, but it is clear that the sample contains ‘sources’ contaminated by imaging artefacts and background features (e.g.ripples) that might help explain the reason they were originally classiﬁed as ‘noise’ in the hard clustering. source). In eﬀect, it is the distribution of membership probabilitiesfor a given source that describes the source morphology. Parameteris-ing morphology this way allows more complex morphological-basedselections to be made that combine the full ‘spectrum’ of morpho-logical types.In Figure 5 we duplicate Figure 2, but this time show examples ofsources originally classiﬁed as noise by HDBSCAN. Soft clusteringallows us to assign the most likely cluster membership, but this vi-sualisation helps us understand the nature of these ‘noisy’ galaxies.In cluster 1 (top row) we can see that the sources are all rather largeand diﬀuse radio galaxies; real sources but rare in the catalogue andtherefore likely fall foul of the minimum cluster size to be designatedan independent cluster. In the sequence of rows corresponding eachcluster label we can see the same trend of ‘extended/lobed’ systemsthrough to ‘compact/point-like’ systems as in Figure 2, indicatingthat the soft clustering assignment still respects the original classi-ﬁcation scheme. However, thumbnails often exhibit contaminationfrom artifacts (e.g. from the interferometric imaging, such as ripples,striations and halos due to nearby bright sources), or the presenceof low signal-to-noise features such as diﬀuse emission. In thesecases we would expect the Haralick features to encode the imagetextures associated with these properties. The soft clustering assign-ment, along with the book-keeping regarding the sources’ originalclassiﬁcation allows us to easily identify such outliers.How do the diﬀerent clusters compare to the basic observational properties of the members? We consider cluster members in each ofthe 10 clusters with high >

90% membership probabilities and com-pare the distributions of total ﬂux, major axis and ratio of major tominor axis as described by Shimwell et al. (2019). Figure 6 com-pares the distributions. The distribution of ﬂux within each cluster isremarkably consistent, although there is some variation in the brighttails of each cluster’s distribution. As expected, there is a strongercorrelation with the shape measurements (although notably broadand overlapping distributions in each case). For example, there is aprogression with the numerical cluster label (which, recall, is de-pendent on the point in the tree where the cluster was extracted, seeFigure 1). For example, the most extended sources are in cluster 1,with the most compact in the ﬁnal cluster 10. It is clear that it wouldbe challenging to reproduce the same morphological groupings sim-ply by using the basic catalogue measurements of ﬂux density, majorand minor axis. Finally, we note that, since clusters are extractedin a sequential fashion from the tree, and that the characteristics ofsources in cluster 10 are ‘far away’ from those in cluster 1 (in Haral-ick space) means that HDBSCAN has produced a meaningful sortingof radio galaxies by morphology.The Haralick features are trivial to compute direct from imagery The fraction of sources with >

90% membership probability across allclusters is 10%. At >

50% membership probability the fraction is 40–50%.MNRAS000

50% membership probability the fraction is 40–50%.MNRAS000 , 1–9 (2020)

Ntwaetsile & Geach

Figure 6.

A comparison of the distribution of (top) total 144 MHz ﬂux,(middle) major axis and (bottom) ratio of major to minor axis for clustermembers with a probability of membership exceeding 90%. The hard cutoﬀs in the ﬂux distributions is a result of the original ﬂux-based selection.Horizontal lines indicate the median values. via the GLCM, and – as demonstrated here – essentially provide acompact description of source morphology. The Haralick feature val-ues could be trivially included in source catalogues at relatively littlememory expense. Care must be taken however: since the features arecalculated from pixel values, they cannot be easily cross-comparedbetween surveys (for example), since they also encode informationabout the noise properties of the image. That eﬀect is not examinedin detail here since we have focused on very high signal-to-noisesources for the present demonstration.

With a focus on radio galaxies, we have shown how Haralick featurescan be used as simple, compact and non-parametric descriptors of source morphology. Haralick features are an established tool in com-puter vision, and simply encapsulate local greyscale image texturewith 13 numbers evaluated from the Grey Level Co-occurance Ma-trix. Since morphological variation is in essence due to diﬀerences inthe spatial distribution of pixel intensities in imaging, our hypothesiswas that Haralick features could be used to represent morphology.Using sources from the LOFAR Two-metre Sky Survey Shimwellet al. (2019), we calculate Haralick features within 96 (cid:48)(cid:48) apertures forthe 10,000 brightest catalogued sources, spanning extended systemswith prominent lobes and jets, to compact point sources. We ﬁnd dis-tinct morphological clusters within the ‘Haralick space’ using the fastHDBSCAN clustering algorithm, which distinguishes ten main mor-phological classes. Rather than ﬁxing cluster assignment, we adoptsoft clustering which, rather than giving points a speciﬁc cluster la-bel, provides a probability of cluster membership. This allows for farmore ﬂexibility in morphological selection, given that often galaxieswill share morphological characteristics across multiple labels. Asclusters are extracted from a hierarchical structure, and the distancebetween clusters in the hierarchy depends on morphology (as en-coded by Haralick features) the sequence of cluster labels providedby HDBSCAN are sorted in a meaningful way. In our example, clus-ter 1 contains the most extended sources with prominent lobes andjets, and the ﬁnal cluster 10 contains isolated and compact sources.The probabalistic nature of the cluster assignments also allow oneto identify morphological outliers, making the detection of unusualgalaxies (or artifacts) trivial. HDBSCAN oﬀers another advantage:it is an incredibly fast clustering algorithm. In our demonstration,the clustering and classiﬁcation of 10,000 sources took just half asecond. Here we focus on radio morphologies, but of course Haralickfeatures could be calculated for any set of imaging data (e.g. optical),and can be extended to more than two dimensions, allowing one toencode spectral/colour information as part of the classiﬁcation.The exploration and curation of the giant imaging datasets of thefuture (e.g. SKA in the radio, LSST in the optical) requires eﬃcientmethodologies, both in terms of computational expense and storageconsiderations. We argue that Haralick features are a valuable ad-dition to non-parametric methods of morphological classiﬁcation ofgalaxies, with hierarchical soft clustering of the Haralick features of agiven sample providing a convenient means of sorting and classifyinggalaxies in the morphological parameter space.

ACKNOWLEDGEMENTS

KN is supported by the Development in Africa with Radio Astron-omy (DARA-BIG DATA) scheme which is a Newton Fund projectdelivered through the UK Science and Technology Facilities Council(STFC) (ST/R001898/1). JEG acknowledges support from the RoyalSociety (URF/R/180014).

DATA AVAILABILITY

The data and code used to produce the results presented inthis paper can be found on Github at https://github.com/KushathaNtwaetsile/HaralickFeatures . REFERENCES

Abbott T., et al., 2016, Monthly Notices of the Royal Astronomical Society,460, 1270MNRAS , 1–9 (2020) aralick features Aguirre C., Pichara K., Becker I., 2018, Monthly Notices of the Royal Astro-nomical Society, 482, 5078Andreoni I., Cooke J., 2017, Proceedings of the International AstronomicalUnion, 14, 135–138Aniyan A. K., Thorat K., 2017, The Astrophysical Journal Supplement Series,230, 20Armstrong D. J., et al., 2015, Monthly Notices of the Royal AstronomicalSociety, 456, 2260Bailer-Jones C. A. L., Irwin M., Von Hippel T., 1998, Monthly Notices of theRoyal Astronomical Society, 298, 361Ball N. M., Loveday J., Fukugita M., Nakamura O., Okamura S., BrinkmannJ., Brunner R. J., 2004, Monthly Notices of the Royal AstronomicalSociety, 348, 1038Baron D., Poznanski D., 2016, Monthly Notices of the Royal AstronomicalSociety, 465, 4530Bazell D., 2000, Monthly Notices of the Royal Astronomical Society, 316,519Bertin E., Arnouts S., 1996, A&AS, 117, 393Bonney R., Cooper C. B., Dickinson J., Kelling S., Phillips T., RosenbergK. V., Shirk J., 2009, BioScience, 59, 977Campello R. J. G. B., Moulavi D., Sander J., 2013, in Pei J., Tseng V. S., CaoL., Motoda H., Xu G., eds, Advances in Knowledge Discovery and DataMining. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 160–172Conselice C. J., Mortlock A., Bluck A. F. L., Grützbauch R., Duncan K.,2013, MNRAS, 430, 1051DeBoer D., et al., 2009, Proceedings of the IEEE, 97, 1507Ester M., Kriegel H.-P., Sander J., Xu X., 1996, KDD-96 ProceedingsFanaroﬀ B. L., Riley J. M., 1974, Monthly Notices of the Royal AstronomicalSociety, 167, 31PFarnes J., Mort B., Dulwich F., Salvin S., Armour W., 2018, Galaxies, 6Fortson L., et al., 2011, Advances in Machine Learning and Data Mining forAstronomyGalvin T. J., et al., 2019, Publications of the Astronomical Society of thePaciﬁc, 131, 108009Geach J. E., 2012, Monthly Notices of the Royal Astronomical Society, 419,2633Haralick R., Shanmugan K., Dinstein I., 1973, IEEE Transactions on Systems,MAN and Cybernetics, 6Hocking A., Geach J. E., Sun Y., Davey N., 2018, Monthly Notices of theRoyal Astronomical Society, 473, 1108Hubble E. P., 1926, ApJ, 64, 321Ivezić v., et al., 2008, LSST: from science drivers to reference design andanticipated data products ( arXiv:0805.2366v4 )Jayasinghe T., et al., 2019, Monthly Notices of the Royal Astronomical Soci-ety, 488, 1141Katebi R., Zhou Y., Chornock R., Bunescu R., 2019, Monthly Notices of theRoyal Astronomical Society, 486, 1539Lahav O., Nairn A., Sodré L. J., Storrie-Lombardi M. C., 1996, MonthlyNotices of the Royal Astronomical Society, 283, 207Laureĳs R. J., Duvet L., Escudero Sanz I., Gondoin P., Lumb D. H., Ooster-broek T., Saavedra Criado G., 2010, in Oschmann Jacobus M. J., ClampinM. C., MacEwen H. A., eds, Society of Photo-Optical InstrumentationEngineers (SPIE) Conference Series Vol. 7731, Space Telescopes and In-strumentation 2010: Optical, Infrared, and Millimeter Wave. p. 77311H,doi:10.1117/12.857123Logan C., Fotopoulou S., 2020, Astronomy and AstrophysicsLotz J. M., Primack J., Madau P., 2004, AJ, 128, 163Lukic V., Brüggen M., Mingo B., Croston J. H., Kasieczka G., Best P. N.,2019, Monthly Notices of the Royal Astronomical Society, 487, 1729Ma Z., et al., 2019, The Astrophysical Journal Supplement Series, 240, 34Malzer C., Baum M., 2019, arXiv e-prints, p. arXiv:1911.02282Martin G., Kaviraj S., Hocking A., Read S. C., Geach J. E., 2020, MonthlyNotices of the Royal Astronomical Society, 491, 1408McHardy I. M., Riley J. M., 1974, Monthly Notices of the Royal AstronomicalSociety, 169, 527Mostert R. I. J., et al., 2020, Unveiling the rarest morphologies of the LOFARTwo-metre Sky Survey radio source population with self-organised maps( arXiv:2011.06001 ) Naim A., Lahav O., Sodré L. J., Storrie-Lombardi M. C., 1995, MonthlyNotices of the Royal Astronomical Society, 275, 567Ohmshankar S., Paul C., 2014, 2nd International Conference on CurrentTrends in Engineering and Technology, ICCTET 2014, pp 409–413Pruzhinskaya M. V., Malanchev K. L., Kornilov M. V., Ishida E. E. O.,Mondon F., Volnova A. A., Korolev V. S., 2019, Monthly Notices of theRoyal Astronomical Society, 289, 3591Ralph N. O., et al., 2019, Publications of the Astronomical Society of thePaciﬁc, 131, 108011Salhi K., Jaara E., Alaoui M., Alaoui Y., 2018, in 2018 International Con-ference on Electronics, Control, Optimization and Computer Science(ICECOCS). pp 1–4Schilizzi R., Dewdney P., Lazio T., 2010. pp 773318–773318,doi:10.1117/12.856344Shimwell T. W., et al., 2019, Astronomy & Astrophysics, 622, A1Sieler L., Tanougast C., Bouridane A., 2010, Microprocessors and Microsys-tems, 34Storrie-Lombardi M. C., Lahav O., Sodré L. J., Storrie-Lombardi L. J., 1992,Monthly Notices of the Royal Astronomical Society, 259, 8PStoughton C., et al., 2002, The Astronomical Journal, v.123, 485-548 (2002)Uzeirbegovic E., Geach J. E., Kaviraj S., 2020, Monthly Notices of the RoyalAstronomical Society, 498, 4021–4032Valenzuela L., Pichara K., 2017, Monthly Notices of the Royal AstronomicalSociety, 474, 3259Webb S., et al., 2020, Monthly Notices of the Royal Astronomical Society,498, 3077Wu C., et al., 2018, Monthly Notices of the Royal Astronomical Society, 482,1211de Vaucouleurs G., 1957, Monthly Notices of the Royal Astronomical Society,117, 225This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000