[PDF] Automatic identification of outliers in Hubble Space Telescope galaxy images

Abstract

Rare extragalactic objects can carry substantial information about the past, present, and future universe. Given the size of astronomical databases in the information era it can be assumed that very many outlier galaxies are included in existing and future astronomical databases. However, manual search for these objects is impractical due to the required labor, and therefore the ability to detect such objects largely depends on computer algorithms. This paper describes an unsupervised machine learning algorithm for automatic detection of outlier galaxy images, and its application to several Hubble Space Telescope fields. The algorithm does not require training, and therefore is not dependent on the preparation of clean training sets. The application of the algorithm to a large collection of galaxies detected a variety of outlier galaxy images. The algorithm is not perfect in the sense that not all objects detected by the algorithm are indeed considered outliers, but it reduces the dataset by two orders of magnitude to allow practical manual identification. The catalogue contains 147 objects that would be very difficult to identify without using automation.

Full PDF

MMNRAS , 1–11 (2020) Preprint 8 January 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

Automatic identiﬁcation of outliers in Hubble Space Telescopegalaxy images

Lior Shamir, ★ Kansas State University, Manhattan, KS 65506, USA

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

Rare extragalactic objects can carry substantial information about the past, present, andfuture universe. Given the size of astronomical databases in the information era it can beassumed that very many outlier galaxies are included in existing and future astronomicaldatabases. However, manual search for these objects is impractical due to the required labor,and therefore the ability to detect such objects largely depends on computer algorithms. Thispaper describes an unsupervised machine learning algorithm for automatic detection of outliergalaxy images, and its application to several Hubble Space Telescope ﬁelds. The algorithmdoes not require training, and therefore is not dependent on the preparation of clean trainingsets. The application of the algorithm to a large collection of galaxies detected a variety ofoutlier galaxy images. The algorithm is not perfect in the sense that not all objects detectedby the algorithm are indeed considered outliers, but it reduces the dataset by two orders ofmagnitude to allow practical manual identiﬁcation. The catalogue contains 147 objects thatwould be very diﬃcult to identify without using automation.

Key words: catalogues – galaxies: peculiar – methods: data analysis

While most galaxies can be classiﬁed into known morphologicaltypes, some galaxies do not ﬁt in any of these common morpholo-gies, and are considered “peculiar”. The “peculiarity” of a galaxyis normally determined by its visual appearance, and the classiﬁca-tion of a galaxy as peculiar is not strictly deﬁned (Nairn & Lahav1997). However, these galaxies can carry important informationabout galaxy evolution (Gillman et al. 2020), and are therefore ofscientiﬁc importance (Bettoni et al. 2001; Casasola et al. 2004;Abraham & van den Bergh 2001).One of the ﬁrst notable attempts to proﬁle peculiar galaxies wasthe Atlas of Peculiar Galaxies (Arp 1966; Arp & Madore 1975), thatwas prepared manually. Other notable eﬀorts to prepare cataloguesof peculiar galaxies include the catalog of collisional ring galax-ies (Madore et al. 2009). Digital sky surveys provided very largedatasets of galaxies, making the identiﬁcation of peculiar galaxiesmore eﬃcient. For instance, Kaviraj (2010) used a set of 70 early-type peculiar systems in Sloan Digital Sky Survey (SDSS) stripe82. Another example is the catalogue of (Nair & Abraham 2010),providing information about the morphology of ∼ . · galaxies.During the preparation of the catalogue, numerous peculiar galaxieswere identiﬁed. Taylor et al. (2005) compiled a collection of 142galaxies that included spiral, irregular, and interacting galaxies byusing the Vatican Advanced Technology Telescope. But because the ★ E-mail: [email protected] analysis was performed manually it was limited by the number ofgalaxies that were analyzed (Nair & Abraham 2010).Because manual analysis is naturally slow, it does not allow tohandle very large databases of galaxies, or requires very substan-tial eﬀorts. For instance, the catalogue of (Arp 1966) took about14 years to complete. In attempt to increase the throughput of thedetection of peculiar galaxies, crowdsourcing was used by allow-ing volunteers to annotate galaxies, leading to the identiﬁcation of“Hanny’s Voorwerp" (Lintott et al. 2009). That approach also ledto the identiﬁcation of a high number of ring galaxies (Finkelmanet al. 2012; Buta 2017).Hubble Space Telescope (HST) was able to provide deeperand more detailed images of galaxies, providing much more de-tailed images of objects that cannot be analyzed morphologicallyby Earth-based sky surveys such as SDSS and the Panoramic Sur-vey Telescope and Rapid Response System (Pan-STARRS). There-fore, HST allows to identify peculiar galaxies in much higher red-shifts compared to Earth-based surveys. Although HST surveys aresmaller than Earth-based surveys such as SDSS, surveys such asthe Cosmic Evolution Survey (COSMOS) still contain more than2 · galaxies (Scoville et al. 2007).While current sky surveys such as SDSS, Pan-STARRS, andthe Dark Energy Survey (DES) are already far too large to allowcomprehensive manual analysis, future digital sky surveys such asthe Vera Rubin Observatory will acquire even more data and a farhigher number of celestial objects. To allow using these data ef-fectively, methods based on computer analysis of galaxy images © a r X i v : . [ a s t r o - ph . GA ] J a n Lior Shamir have been proposed. These include model-driven methods such asGALFIT (Peng et al. 2002), GIM2D (Simard 1999), CAS (Con-selice 2003), Gini (Abraham et al. 2003), Ganalyzer (Shamir 2011),and SpArcFiRe (Davis & Hayes 2014), and methods based on ma-chine learning (Shamir 2009; Huertas-Company et al. 2009; Banerjiet al. 2010; Kuminski et al. 2014; Dieleman et al. 2015; Graham2019; Mittal et al. 2019; Hosny et al. 2020; Cecotti 2020; Chenget al. 2020). The application of these methods led to catalogues(Huertas-Company et al. 2015a,b; Shamir & Wallin 2014; Kumin-ski & Shamir 2016; Goddard & Shamir 2020). Machine learning al-gorithms were also used to identify unusual galaxies, such as galaxymergers (Margalef-Bentabol et al. 2020), showing that galaxy merg-ers can be identiﬁed automatically even when training a machinelearning system with just regular isolated galaxies.Model-driven approaches were used in the past to detect spe-ciﬁc types of galaxies such as ring galaxies (Timmis & Shamir2017; Shamir 2020) or gravitational lenses (Jacobs et al. 2019).Comparing these algorithms to datasets prepared manually showedthat computers were not able to achieve the same level of complete-ness of manual detection, but can compensate for that weakness bytheir ability to scan much larger datasets (Shamir 2020). The mainweakness of model-driven algorithms is that they can be developedonly when the morphology of interest is known, and therefore can-not detect unknown objects of types that have not been observedbefore.Machine learning is often applied by training a system fromthe data rather than tailoring a speciﬁc algorithm. The majority ofmachine learning methods proposed for galaxy image analysis arebased on supervised machine learning, in which a machine learningsystem is trained with annotated “ground truth” to classify newunseen samples. Such supervised machine learning systems mightnot be eﬀective for detecting outlier galaxy images that have not beenseen before, as no samples are available to train such systems. To beable to identify peculiar galaxies automatically, a machine learningsystem needs to be able to identify forms of galaxies that are notpresent in the dataset with which the system was trained. Therefore,for the identiﬁcation of such outlier galaxies, unsupervised machinelearning is required. Additionally, it needs to be able to ﬁlter falsepositives eﬀectively, as due to the large number of objects even asmall false positive rate would lead to a very high number of falsepositives, making such system impractical.

Non-parametric approaches such as deep convolutional neural net-works (DCNNs) have been adjusted to the task of outlier imagedetection. One of the common approaches to outlier image detec-tion using deep neural networks is by using auto-encoders, suchthat outliers can be detected by the reconstruction loss (Amarbayas-galan et al. 2018; Chen et al. 2018), and were also applied to outliergalaxy detection (Margapuri et al. 2020). Deep neural networks haveshown promising performance for the task of identifying mergingsystems in datasets of isolated galaxies (Margalef-Bentabol et al.2020). Since in a universe of isolated galaxies a merging systemwould be considered an outlier, the performance of the algorithm isan indication of the ability to detect outlier galaxies.While deep neural networks provide promising performancein detecting outlier galaxies, they also require large clean trainingsets, and their “black box” nature makes them more diﬃcult toidentify speciﬁc elements that make certain galaxies marked asoutliers. The purpose of this work is to use algorithms that do not require labeling, so that galaxies of types that are not known can alsobe detected. To perform unsupervised machine learning of galaxyimages, each galaxy image is converted into a comprehensive set ofnumerical image content descriptors that reﬂect the visual content ofthe image. That is, each image is represented by a vector of numbersthat correspond to the visual content. The set of numerical imagecontent descriptors (Shamir et al. 2008) has been shown eﬃcacyin analysis of galaxy images (Shamir 2009; Kuminski et al. 2014;Kuminski & Shamir 2016), including certain tasks in unsupervisedanalysis of galaxy images (Shamir 2012; Shamir et al. 2013; Schutter& Shamir 2015).In summary, the set of numerical image content descriptorsinclude edge statistics, Radon transform (Lim 1990), texture de-scriptors such Tamura textures (Tamura et al. 1978), Haralick tex-tures (Haralick et al. 1973), and Gabor textures (Fogel & Sagi1989), distribution of pixel intensities multi-scale histograms (Had-jidemetriou et al. 2001), Zernike polynomials (Teague 1980), theGini coeﬃcient (Abraham et al. 2003), image entropy, Chebyshevstatistics, and box-counting fractals (Wu et al. 1992). These numer-ical image content descriptors are described in detail in (Shamiret al. 2008, 2010, 2013; Schutter & Shamir 2015; Shamir 2016).To obtain more information from each galaxy image, the nu-merical image content descriptors are extracted from the raw pixels,but also from several image transforms. These include the Fouriertransform, Chebyshev transform, Wavelet (symlet 5, level 1) trans-form, and combinations of these transforms (Shamir et al. 2008,2010). The source code of the method is open and publicly avail-able (Shamir 2017).When using a high number of numerical content descriptors, itis expected that some of them would not reﬂect information relevantto the diﬀerence between regular and irregular galaxies. Since thealgorithm aims at identifying also types of galaxies that have notbeen seen before, previously collected data cannot be used for thattask. To rank and weight the content descriptors by the informationthey provide in identifying outlier galaxy images without usingannotated samples, the entropy of each feature f is used as shown inEquation 1. 𝑊 𝑓 = − · Σ 𝑖 𝑃 𝑖 · log 𝑃 𝑖 , (1)where 𝑃 𝑖 is the frequency of the values in the i th bin of a 10-bin histogram of the values of that feature. 𝑊 𝑓 is the entropy of thefeature, which is used as the weight. When the entropy of the featureis low, the feature values are more consistent, and that consistencycan be used as an indication that the numerical content descriptoris informative for reﬂecting the morphology of the galaxies in thedataset.The dissimilarity between each pair of galaxies can be com-puted by using the Earth Mover’s Distance (EMD), which is aneﬀective way of comparing vectors, and commonly using in ma-chine learning tasks (Rubner et al. 2000; Ruzon & Tomasi 2001).EMD can be conceptualized as an optimization problem in whichthe solution is the minimum work required to ﬁll a set of holesin space with the mass of Earth, and the unit of work is the workrequired to move an Earth unit by a distance unit. Equation 2 showsthe EMD optimization problem. 𝑊𝑜𝑟 𝑘 ( 𝑋, 𝑌, 𝐹 ) = Σ 𝑛𝑖 = Σ 𝑛𝑗 = 𝑓 𝑖, 𝑗 𝑑 𝑖, 𝑗 , (2)where X and Y are the weighted feature vectors ( 𝑊𝑥 , 𝑥 ) ..... ( 𝑊𝑥 𝑛 , 𝑥 𝑛 ) of size n, 𝑓 𝑖, 𝑗 is the ﬂow between 𝑋 𝑖 and 𝑌 𝑗 , and 𝑊 is the vector of weights determined for all MNRAS , 1–11 (2020) utlier galaxies in HST features by Equation 1. The ﬂow F is the solution of the followinglinear programming problem: Σ 𝑛𝑖 = Σ 𝑛𝑗 = 𝑓 𝑖, 𝑗 = min ( Σ 𝑛𝑖 = 𝑊𝑥 𝑖 , Σ 𝑛𝑗 = 𝑊 𝑦 𝑗 ) With the following constraints: 𝑊𝑥 𝑖 ≥ Σ 𝑛𝑗 = 𝑓 𝑖, 𝑗 𝑊 𝑦 𝑗 ≥ Σ 𝑛𝑖 = 𝑓 𝑖, 𝑗 The earth mover’s distance between X and Y is then deﬁnedas:

𝐸 𝑀 𝐷 ( 𝑋, 𝑌 ) = 𝑊 𝑜𝑟 𝑘 ( 𝑋,𝑌 ,𝐹 ) Σ 𝑛𝑖 = Σ 𝑛𝑗 = 𝑓 𝑖, 𝑗 More details about the EMD vector comparison can be foundin (Rubner et al. 2000; Ruzon & Tomasi 2001). The EMD is used tomeasure the distance between the histograms of all sets of numericalimage content descriptors described in (Shamir et al. 2008, 2010).The sum of all distances of all histograms determines the distancebetween the two galaxies. The distances measured between diﬀerentpairs of galaxies can be compared to distances between other pairsof galaxies to provide an estimation of the level of similarity ordiﬀerence between each pair of galaxies in a dataset.Once the similarity between each pair of galaxy images can bemeasured, the outlier galaxy images can be detected. A simple wayof identifying outlier galaxy images is by identifying the galaxy xsuch that

𝑀𝑎𝑥 𝑥 ( 𝑀𝑖𝑛 𝑥,𝑦 ( 𝑑 ( 𝑥, 𝑦 ))) . That is, the galaxy image that isthe most likely to be an outlier image is the galaxy that its distance toits most similar galaxy is the highest compared to all other galaxies.However, that criterion might lead to undetected outlier galaxies.When a dataset is large, it is possible that even a rare galaxy typewill appear more than once in that dataset. For instance, the datasetof the Vera Rubin Observatory is expected to image ∼ extragalactic objects, and therefore even a rare one-in-a-million object isexpected to be present in that dataset ∼ times. Therefore, evenrare objects might have one or more objects in the datasets that issimilar to them. That can lead to low maximum distance for theseobjects, and will lead to inability of the algorithm to identify outlierobjects.To avoid a situation in which a small number of outlier objectsthat are similar to each other are not detected, the distances of theobjects from all other objects are sorted, and the R shortest distanceis used as the minimum distance between the object and all otherobjects in the dataset. That means that if R-1 objects that are similarto the target object exist in the dataset, the distances between theseobjects and the target object will not aﬀect the results. By usingthe rank R, a small number of objects that are similar to a certainobject will not lead to inability to detect that object. The value ofR should be determined based on the size of the dataset. The largerthe dataset is, the more likely that a certain rare object will haveother objects in the dataset that are similar to it. Therefore, a largerdataset will require a higher value of R to be able to detect outliergalaxies.The R parameter is used by the algorithm to control the rank ofthe neighbor by which the distance of the sample from the datasetis measured. The value of R allows to reduce the impact of galaxieswith elements that are less common in the dataset. Outlier detectionalgorithms might be dependent on the distribution of the samplesin the dataset. For instance, if most galaxies in the dataset are small,larger galaxies might be identiﬁed as outliers. However, due to the R parameter, the distance between a sample and the rest of the datasetis determined by the distance between the sample and its R th closestneighbor. Therefore, in the case of uneven distribution of the size ofthe objects such that large objects are rare, the presence of more thanR large objects in the dataset should theoretically prevent from largeobjects be identiﬁed as outlier due to their size alone. That it, if morethan R large galaxies are present in the dataset, the R th neighborof a large galaxy is expected to be a large galaxy, and therefore thedistance that determines whether the sample is an outlier shouldnot be large because the R th neighbor is small. Due to noise andthe imperfectness of the distances it is expected that exactly R largegalaxies might not be suﬃcient to avoid large objects identiﬁed asoutliers, but in large datasets the number of large objects is expectedto be much higher than the value of R, and therefore the R th nearestneighbor itself is expected to be a large object. That should ensurethat even if large objects are the minority of the objects in the dataset,that should not lead to large objects being identiﬁed as outliers. The data is taken from several HST ﬁelds that make the CosmicAssembly Near-infrared Deep Extragalactic Legacy Survey (CAN-DELS). CANDELS (Grogin et al. 2011; Koekemoer et al. 2011)covers ﬁve ﬁelds, which are GOODS-N, GOODS-S, EGS, UDS,and COSMOS (Grogin et al. 2011). Sources were detected by ap-plying SExtractor (Bertin & Arnouts 1996) on the F814W bandand selecting sources with 4 𝜎 or higher magnitude compared to thebackground. The sources were then separated by using the Subim-age tool of Montage (Berriman et al. 2004). The images were FITSimages of dimensionality of 122 ×

122 pixels, and these images wereconverted to TIFF format for the image processing. The total numberof objects was 176,808. The redshift and g magnitude distributionof these objects is shown in Figure 1.

The method described in Section 2 was applied to the data describedin Section 3. The method assigns each galaxy that it analyzes witha score of “peculiarity”, and therefore allows to ﬁnd the galaxiesthat are the most likely to be indeed peculiar. The 1,100 galaxyimages with the highest likelihood to be peculiar according to themethod were examined manually, making a selection of ∼ ∼

86% of the galaxies that were detected by thealgorithm. Figure 3 shows examples of objects that were detectedby the algorithm as peculiar, but are not peculiar objects by manualinspection. As the ﬁgure shows, these objects include objects that arenot rare and can be considered false positives, as well as objects thattheir peculiarity is not clear of is not of astronomical origin. Becausethe “peculiarity” of a galaxy is not strictly deﬁned, it is possible thatsome objects of interest were rejected, but the prevalence of suchobjects is expected to be low.

MNRAS000

MNRAS000 , 1–11 (2020)

Lior Shamir F r e qu e n c y Z F r e qu e n c y g magnitude Figure 1.

The redshift and g magnitude distribution of the objects.

Figure 2.

The top 10 outlier galaxy images as ranked by the algorithm.

The algorithm reduces the data by selecting a subset in whichthe frequency of outlier galaxy images is far higher than in theentire dataset, making the manual analysis practical also for largerdatasets. Figure 4 shows the number of objects detected manuallyamong the objects detected automatically, ranked by their distanceas described in Section 2. Naturally, the number of detected objectsincreases when the number of objects being inspected manually

Figure 3.

Objects identiﬁed by the algorithm as outliers that are not indeedoutlier galaxy images. N u m b e r o f p e c u li a r o b j e c t s Number of objects inspected

Figure 4.

The frequency of the number of objects detected manually amongthe objects detected by the algorithm. gets larger. But the graph also shows that the frequency of detectedobjects is higher among the galaxies with lower rank, thereforemaking it practical to perform manual analysis of the results.Tables 1 through 9 show the galaxies identiﬁed by the algorithmafter removing manually ∼

86% of the objects that are not peculiargalaxies. Figures 5 through 14 show the corresponding images of theobjects in the tables. The objects are separated into diﬀerent typessuch as gravitational lenses, ring galaxies, objects with embeddedpoint sources, interacting systems, objects with linear features, one-arm galaxies, galaxies with detached segments, tidally distortedinteracting galaxies, and other galaxies.Figure 8 shows edge-on galaxies with dust lanes. These systemsare not considered necessarily peculiar, but according to the resultsthese forms are relatively rare in the HST sample. Figure 5 showsdetected galaxies with embedded object. Giants clumps of starsembedded in galaxies are not rare in 0 . < 𝑧 < MNRAS , 1–11 (2020) utlier galaxies in HST Figure 5.

Images of the detected objects listed in Table 1. sky surveys, and therefore in many cases galaxies that seem visuallypeculiar in HST do not seem unusual when observed using Earth-based instruments. Figure 15 shows several object in HST, SDSS andPan-STARRS. As the comparison shows, SDSS and Pan-STARRSdo not give suﬃcient details to identify the morphological featuresof these galaxies.Table 2 shows objects suspected as gravitational lenses. Noneof these suspected gravitational lenses are included in the CASTLESsurvey of gravitational lenses (Kochanek et al. 1999), the catalogueof gravitational lens candidates in SDSS (Inada et al. 2012), or asurvey powered by a group ﬁnding algorithm (Wilson et al. 2016).Two of the objects, 23 and 24, are included in the gravitational lenscatalogue of (Faure et al. 2008).

Because the analysis is based on numerical image content descrip-tors, it allows to identify descriptors that can discriminate betweenregular galaxy images and the outlier images. To identify these de-scriptors, the image content descriptors of the 147 outlier imageswere compared to the descriptors of 745 random regular images.The comparison was dine using the Linear Discriminant Analysis(LDA) scores, which can identify the features that can discriminatebetween the two classes. Table 10 shows some of the numerical im-age content descriptors with the highest LDA scores, and the meansand standard deviation of the regular and outlier galaxy images. Thedescription of the speciﬁc descriptors is provided in (Shamir et al.2008, 2010, 2013; Schutter & Shamir 2015; Shamir 2016).As the table shows, descriptors such as edge area and Tamuratexture coarseness exhibit signiﬁcant diﬀerences between regular

Figure 6.

Galaxies with clumps of stars that are not of apparent unusualmorphology. These galaxies are very common in the HST sample, and werenot detected by the algorithm.

Figure 7.

Images of the objects suspected as gravitational lenses listed inTable 2. and outlier galaxy images. An interesting observation is the fractal-ity, computed by using box counting as described in (Lynch et al.1991; Shamir et al. 2009). The fractality is much lower among out-lier galaxy images compared to the regular images. That indicatesthat regular galaxies have higher fractality, which drops in the caseof outlier images.

As discussed in Section 2, the value of R is used to avoid theimpact of rare objects that have similar objects in the dataset. Whenanalyzing large datasets, even a rare object is expected to appearmore than once in the dataset. Therefore, if two rare objects thatare very diﬀerent from all other objects are present in the dataset,it could be that each one of them will be a similar neighbor to theother object. When using the distance from the closest neighbor, thesimilarity between the two objects will assign each of the objectswith a relatively short distance, and therefore these objects mightnot be detected as peculiar.To show the impact of the value of R, a simple experiment wasdone such that the 44 galaxies shown in Figure 11 were combinedwith galaxies 40 through 44 shown in Figure 9. In that dataset, thering galaxies are the regular images. When running the algorithmwhen R is set to 1 and observing the top 10 outliers returned by the

MNRAS000

MNRAS000 , 1–11 (2020)

Lior Shamir

ID RA Dec ID RA Dec ID RA Dec1 189.1622 62.1883 2 215.1411 52.9442 3 189.1906 62.24524 188.9980 62.1668 5 53.15489 -27.857 6 150.3146 1.683957 150.1713 1.62978 8 150.2772 1.91924 9 150.0289 1.8890210 149.9010 1.85073 11 150.0291 2.03546 12 149.9216 2.2060313 150.0966 2.50137 14 150.0506 2.47750 15 150.7432 2.6631716 53.05503 -27.699 17 189.0773 62.2508 18 53.06687 -27.88319 189.1166 62.2854 20 53.07838 -27.878 21 189.1230 62.1130

Table 1.

The coordinates of detected objects with embedded point sources.ID RA Dec ID RA Dec ID RA Dec22 149.8789 2.57436 23 150.1594 2.69273 24 150.0772 2.6458425 53.00104 -27.770 26 34.40478 -5.2248

Table 2.

Right ascension and declination (in degrees) of the galaxies suspected as gravitational lenses detected in the dataset.ID RA Dec ID RA Dec ID RA Dec27 215.3761 53.1241 28 149.8313 1.59189 29 150.0589 1.7469730 150.0610 1.64515 31 150.3063 1.81053 32 149.8813 1.8852133 149.8668 2.05173 34 150.2041 2.80623 35 189.0973 62.292436 53.07250 -27.822

Table 3.

Celestial coordinates of objects that are possible edge-on galaxies with dust lanes.

Figure 8.

Images of the detected edge-on galaxies with dust lanes listed inTable 3. algorithm, only objects 40 and 43 are detected among the top ﬁveoutliers. That does not change when setting the value of R to 2 or3. But when the value of R is set to 4, all objects 40 through 44 aredetected among the top 10 outliers.

The ability of an algorithm to detect outlier galaxy is clearly afunction of the redshift. Closer objects are generally brighter andcan be observed with better details compared to distant object. It isexpected that many objects with rare morphology at high redshiftwould not be identiﬁed as outliers by an algorithm or even by manualobservation due to the small size and faint magnitude. Figure 16shows the number of objects selected by the algorithm in eachredshift range divided by the total number of objects detected by

Figure 9.

Images of the detected objects with linear features listed in Table 4. the algorithm. If also shows the number of objects determined asoutlier candidates after manual inspection in each redshift range,divided by the total number of outlier candidates.As the ﬁgure shows, the fraction of the objects selected aftermanual inspection is higher in the lower redshift ranges comparedto the general population of objects selected by the algorithm, andlower in the higher redshift ranges of 𝑧 >

1. That distribution showsthat a higher number of objects in the higher redshifts are detectedby the algorithm but rejected after manual inspection, which indi-cates that in the higher redshifts the algorithm is less eﬀective inidentifying outlier galaxy candidates compared to the lower red-shifts. That pattern can be expected given that galaxies at higherredshifts tend to be more diﬃcult to inspect visually.

MNRAS , 1–11 (2020) utlier galaxies in HST ID RA Dec ID RA Dec ID RA Dec37 215.2375 53.0477 38 215.2534 53.0987 39 215.1405 53.004140 189.1852 62.1949 41 189.2101 62.3432 42 53.12404 -27.87843 189.2438 62.1366 44 189.2722 62.1792 45 150.6507 1.6453346 189.0420 62.2196 47 188.9525 62.1982

Table 4.

Coordinates of detected objects with linear features.ID RA Dec ID RA Dec ID RA Dec48 214.8828 52.8360 49 189.2587 62.3045 50 150.6613 1.6434251 150.2551 1.88673 52 150.1958 1.88558 53 189.3390 62.192154 189.0522 62.2440 55 189.1127 62.2995 56 34.31388 -5.2024

Table 5.

Coordinates of objects that are possible one-arm galaxies.ID RA Dec ID RA Dec ID RA Dec57 214.9668 52.8542 58 189.1523 62.2768 59 214.6585 52.731160 214.6977 52.6933 61 214.6236 52.7394 62 214.6959 52.727563 215.0946 52.9053 64 214.9997 52.9886 65 215.0564 53.071566 215.3878 53.1364 67 215.1373 53.0894 68 53.12042 -27.75769 189.2676 62.2110 70 189.2785 62.1685 71 150.1785 1.6220672 150.4450 1.72180 73 150.3991 1.62907 74 150.1361 1.6766675 149.7003 1.67250 76 150.1600 1.92169 77 150.1017 2.0532678 149.9335 2.04432 79 149.8189 2.07964 80 150.6226 2.2447581 149.6272 2.19739 82 150.4789 2.40455 83 150.2822 2.4601984 150.2822 2.46019 85 149.8462 2.85215 86 149.7706 2.8044287 189.3322 62.1755 88 53.19657 -27.863 89 189.3906 62.229290 53.22027 -27.854 91 53.05081 -27.679 92 189.1096 62.196393 189.1269 62.2739 94 189.1349 62.1262 95 189.1359 62.122996 53.01292 -27.718 97 34.32599 -5.2154 98 34.39667 -5.266099 34.26670 -5.1327 100 34.32900 -5.1332

Table 6.

Galaxies with ring features.

Figure 10.

Images of the detected objects listed in Table 5.

Sky surveys can acquire substantial amounts of information thatincludes a very large number of galaxies. While it can be assumedthat these databases contain rare objects of scientiﬁc interest, it isdiﬃcult to identify these objects among a large number of objects.Here an automatic method is applied to HST data, and identiﬁedseveral unusual extra-galactic objects. While the last step is manual, the algorithm reduces the data by two orders of magnitude, makingthe manual analysis practical. The objects identiﬁed by the algorithmcan be used as target in future studies.The catalogue is clearly incomplete, as the algorithm is not ableto identify all rare objects of interest. For instance, just two grav-itational lenses detected from the 67 detected gravitational lensesincluded in the catalogue of (Faure et al. 2008). However, since it isbased on automation, it does not require substantial labor, and cantherefore be applied in cases where the databases are far too largeto allow manual analysis.With the increasing importance of large-ﬁeld surveys suchas the ground-based Vera Rubin Observatory and the space-basedEuclid, it is clear that manual analysis will not be suﬃcient to fullyutilize the extreme imaging power of these instruments. While theeﬃcacy of computer analysis cannot yet meet the accuracy levelof manual analysis of an expert, computer analysis is required toapproach these extremely large databases, and the ability to use thedata acquired by current and future sky surveys is largely dependenton the availability and advancement of algorithm that can practicallyanalyze these data.

MNRAS000

MNRAS000 , 1–11 (2020)

Lior Shamir

ID RA Dec ID RA Dec ID RA Dec101 215.2793 53.1822 102 149.9231 1.72376 103 149.7927 1.62935104 149.6892 1.64355 105 150.6343 1.81815 106 150.4635 1.88330107 150.4138 1.84758 108 150.5652 2.16613

Table 7.

Spiral galaxies with detached segments.ID RA Dec ID RA Dec ID RA Dec109 53.10084 -27.831 110 215.2063 53.1576 111 53.02481 -27.751112 189.0039 62.2173 113 189.2672 62.3234 114 150.3090 1.91672115 150.6876 1.97088 116 149.8947 2.20815 117 149.8947 2.20815118 150.6845 2.54897 119 149.9031 2.82170 120 189.0672 62.2663121 189.1247 62.2343 122 34.44746 -5.2467

Table 8.

Tidally distorted interacting pairs.ID RA Dec ID RA Dec ID RA Dec123 215.0012 52.9636 124 189.1776 62.3058 125 215.0720 52.9071126 214.9257 52.9287 127 215.2056 52.9864 128 215.2513 53.1415129 53.11494 -27.767 130 150.7439 1.61616 131 150.1504 1.59564132 149.8757 1.61034 133 150.7142 1.75447 134 149.8485 1.79248135 149.7127 1.77889 136 149.9777 1.83432 137 149.8485 1.79248138 150.4372 1.99945 139 150.5846 2.19190 140 149.7653 2.26561141 150.3849 2.40657 142 150.2759 2.45195 143 189.4095 62.2583144 189.4532 62.2233 145 189.0520 62.1946 146 34.36355 -5.2133147 34.31276 -5.1375

Table 9.

Other galaxies.Descriptor Regular Outliermean meanEdge area 2796 ±

171 14578 ± ± ± ±

98 16.79 ± ±

96 16.13 ± ±

89 13.32 ± ±

94 15.46 ± ±

105 18.57 ± ±

91 14.06 ± Table 10.

The image content descriptors with the highest LDA separationbetween regular and outlier images.

ACKNOWLEDGMENT

I would like to thank the anonymous reviewer for the insightfulcomments that helped to improve the manuscript. The research wasfunded by NSF grant AST-1903823.

DATA AVAILABILITY

The data underlying this article are available in the article. The re-search is based on observations made with the NASA/ESA HubbleSpace Telescope, and obtained from the Hubble Legacy Archive,which is a collaboration between the Space Telescope Science In-stitute (STScI/NASA), the Space Telescope European Coordinating Facility (ST-ECF/ESA) and the Canadian Astronomy Data Centre(CADC/NRC/CSA).

MNRAS , 1–11 (2020) utlier galaxies in HST Figure 11.

Images of detected galaxies with ring features listed in Table 6.

Figure 12.

Images of the detected spiral galaxies with detached segmentslisted in Table 7.

Figure 13.

Images of the tidally distorted object candidates listed in Table 8.MNRAS000

Images of the tidally distorted object candidates listed in Table 8.MNRAS000 , 1–11 (2020) Lior Shamir

Figure 14.

Images of the other detected objects listed in Table 9.

Figure 15.

Comparison of objects 128, 139 and 135 images in HST (left),SDSS (middle), and Pan-STARRS (right). As expected, the comparisonshows that the Earth-based sky surveys do not provide suﬃcient details toidentify the morphology of these galaxies. F r a c t i o n Z Algorithm selection Outlier candidates

Figure 16.

The fraction of objects detected by the algorithm in diﬀerentredshift ranges compared to the total number of objects detected by thealgorithm, and the fraction of outlier candidates in each redshift range com-pared to the total number of outlier candidates determined after manualinspection. In the lower redshifts the fraction of outlier candidates is higher,while it is getting lower in the higher redshifts. MNRAS , 1–11 (2020) utlier galaxies in HST REFERENCES

Abraham R. G., van den Bergh S., 2001, Science, 293, 1273Abraham R. G., Van Den Bergh S., Nair P., 2003, Astrophysical Journal,588, 218Amarbayasgalan T., Jargalsaikhan B., Ryu K. H., 2018, Applied Sciences,8, 1468Arp H., 1966, Astrophysical Journal Supplement Series, 14, 1Arp H. C., Madore B. F., 1975, The Observatory, 95, 212Banerji M., et al., 2010, Monthly Notices of the Royal Astronomical Society,406, 342Berriman G., et al., 2004, in Astronomical Data Analysis Software andSystems. p. 593Bertin E., Arnouts S., 1996, Astronomy and Astrophysics, 117, 393Bettoni D., Galletta G., García-Burillo S., Rodríguez-Franco A., 2001, As-tronomy & Astrophysics, 374, 421Buta R. J., 2017, Monthly Notices of the Royal Astronomical Society, 471,4027Casasola V., Bettoni D., Galletta G., 2004, Astronomy & Astrophysics, 422,941Cecotti H., 2020, International Journal of Machine Learning and Cybernet-ics, pp 1–15Chen Z., Yeo C. K., Lee B. S., Lau C. T., Jin Y., 2018, Neurocomputing,309, 192Cheng T.-Y., et al., 2020, Monthly Notices of the Royal Astronomical Society,493, 4209Conselice C. J., 2003, Astrophysical Journal Supplement Series, 147, 1Davis D. R., Hayes W. B., 2014, Astrophysical Journal, 790, 87Dieleman S., Willett K. W., Dambre J., 2015, Monthly Notices of the RoyalAstronomical Society, 450, 1441Faure C., et al., 2008, Astrophysical Journal Supplement Series, 176, 19Finkelman I., Funes SJ J. G., Brosch N., 2012, Monthly Notices of the RoyalAstronomical Society, 422, 2386Fogel I., Sagi D., 1989, Biological Cybernetics, 61, 103Gillman S., et al., 2020, Monthly Notices of the Royal Astronomical Society,492, 1492Goddard H., Shamir L., 2020, Astrophysical Journal Supplement Series,251, 28Graham A. W., 2019, Monthly Notices of the Royal Astronomical Society,487, 4995Grogin N. A., et al., 2011, Astrophysical Journal Supplement Series, 197,35Guo Y., et al., 2015, Astrophysical Journal, 800, 39Hadjidemetriou E., Grossberg M. D., Nayar S. K., 2001, in Proceedingsof the IEEE Computer Society Conference on Computer Vision andPattern Recognition. pp I–IHaralick R. M., Shanmugam K., Dinstein I. H., 1973, IEEE Transactions onSystems, Man, and Cybernetics, pp 610–621Hosny K., Elaziz M., Selim I., Darwish M., 2020, Astronomy and Comput-ing, p. 100383Huertas-Company M., et al., 2009, Astronomy and Astrophysics, 497, 743Huertas-Company M., et al., 2015a, arXiv preprint arXiv:1509.05429Huertas-Company M., et al., 2015b, arXiv preprint arXiv:1506.03084Inada N., et al., 2012, Astronomical Journal, 143, 119Jacobs C., et al., 2019, Monthly Notices of the Royal Astronomical Society,484, 5330Kaviraj S., 2010, Monthly Notices of the Royal Astronomical Society, 406,382Kochanek C., Falco E., Impey C., Lehár J., McLeod B., Rix H.-W., 1999, inAIP Conference Proceedings. pp 163–175Koekemoer A. M., et al., 2011, Astrophysical Journal Supplement Series,197, 36Kuminski E., Shamir L., 2016, Astrophysical Journal Supplement Series,223, 20Kuminski E., George J., Wallin J., Shamir L., 2014, Publications of theAstronomical Society of the Paciﬁc, 126, 959Lim J. S., 1990, New Haven: Prentice HallLintott C. J., et al., 2009, Monthly Notices of the Royal Astronomical Society, 399, 129Lynch J., Hawkes D., Buckland-Wright J., 1991, Physics in Medicine &Biology, 36, 709Madore B. F., Nelson E., Petrillo K., 2009, Astrophysical Journal Supple-ment Series, 181, 572Margalef-Bentabol B., Huertas-Company M., Charnock T., Margalef-Bentabol C., Bernardi M., Dubois Y., Storey-Fisher K., Zanis L., 2020,arXiv:2003.08263Margapuri V. S. K., Shamir L., Thapa B., 2020, in 29th International Con-ference on Software Engineering and Data Engineering.Mittal A., Soorya A., Nagrath P., Hemanth D. J., 2019, Earth Science Infor-matics, pp 1–17Nair P. B., Abraham R. G., 2010, Astrophysical Journal Supplement Series,186, 427Nairn A., Lahav O., 1997, Monthly Notices of the Royal AstronomicalSociety, 286, 969Peng C. Y., Ho L. C., Impey C. D., Rix H.-W., 2002, Astronomical Journal,124, 266Rubner Y., Tomasi C., Guibas L. J., 2000, International Journal of ComputerVision, 40, 99Ruzon M. A., Tomasi C., 2001, IEEE Transactions on Pattern Analysis andMachine Intelligence, 23, 1281Schutter A., Shamir L., 2015, Astronomy and Computing, 12, 60Scoville N., et al., 2007, Astrophysical Journal Supplement Series, 172, 1Shamir L., 2009, Monthly Notices of the Royal Astronomical Society, 399,1367Shamir L., 2011, Astrophysical Journal, 736, 141Shamir L., 2012, Journal of Computational Science, 3, 181Shamir L., 2016, Publications of the Astronomical Society of the Paciﬁc,129, 024003Shamir L., 2017, Astrophysics Source Code Library, p. ascl:1704.002Shamir L., 2020, Monthly Notices of the Royal Astronomical Society, 491,3767Shamir L., Wallin J., 2014, Monthly Notices of the Royal AstronomicalSociety, 443, 3528Shamir L., Orlov N., Eckley D. M., Macura T., Johnston J., Goldberg I. G.,2008, Source Code for Biology and Medicine, 3, 13Shamir L., Ling S. M., Scott W., Hochberg M., Ferrucci L., Goldberg I. G.,2009, Osteoarthritis and Cartilage, 17, 1307Shamir L., Macura T., Orlov N., Eckley D. M., Goldberg I. G., 2010, ACMTransactions on Applied Perception, 7, 1Shamir L., Holincheck A., Wallin J., 2013, Astronomy and Computing, 2,67Simard L., 1999, in Photometric Redshifts and the Detection of High Red-shift Galaxies. p. 325Tamura H., Mori S., Yamawaki T., 1978, IEEE Transactions on Systems,Man, and Cybernetics, 8, 460Taylor V. A., Jansen R. A., Windhorst R. A., Odewahn S. C., Hibbard J. E.,2005, Astrophysical Journal, 630, 784Teague M. R., 1980, Journal of the Optical Society of America, 70, 920Timmis I., Shamir L., 2017, Astrophysical Journal Supplement Series, 231,2Wilson M. L., Zabludoﬀ A. I., Ammons S. M., Momcheva I. G., WilliamsK. A., Keeton C. R., 2016, Astrophysical Journal, 833, 194Wu C.-M., Chen Y.-C., Hsieh K.-S., 1992, IEEE Transactions on MedicalImaging, 11, 141MNRAS000