In search of the weirdest galaxies in the Universe
UUniversity of GroningenKapteyn Astronomical Institute
Master Thesis Astronomy
In search of the weirdest galaxies inthe Universe
Author:
Job FormsmaS2758717
Supervisors:
Prof. dr. Reynier PeletierTeymoor Saifollahi
July 16, 2020 a r X i v : . [ a s t r o - ph . GA ] J u l bstract Weird galaxies are outliers that have either unknown or very uncommon features making themdifferent from the normal sample. These galaxies are very interesting as they may providenew insights into current theories, or can be used to form new theories about processes in theUniverse. Interesting outliers are often found by accident, but this will become increasingly moredifficult with future big surveys generating an enormous amount of data. This gives the needfor machine learning detection techniques to find the interesting weird objects. In this work, weinspect the galaxy spectra of the third data release of the Galaxy And Mass Assembly survey andlook for the weird outlying galaxies using two different outlier detection techniques. First, weapply distance-based Unsupervised Random Forest on the galaxy spectra using the flux valuesas input features. Spectra with a high outlier score are inspected and divided into differentcategories such as blends, quasi-stellar objects, and BPT outliers. We also experiment witha reconstruction-based outlier detection method using a variational autoencoder and comparethe results of the two different methods. At last, we apply dimensionality reduction techniqueson the output of the methods to inspect the clustering of similar spectra. We find that bothunsupervised methods extract important features from the data and can be used to find manydifferent types of outliers. 2 ontents Introduction
The most interesting discoveries in Astronomy are the accidental and unintended discoveriesof unknown objects never seen before. These unknown unknowns are interesting as they cangive new insights to our understanding of physical processes in the Universe, or introduce newquestions about current theories in Astronomy. Finding these unknown objects is not easyas, by definition, there is no indication of what to look for. Whenever an unknown object isfound, it may easily be disregarded as an error as it might look very different from the expectedobservations. It is up to the mind-set of the observer to either investigate the unknown findingor disregard it as an anomaly and move on with the main objective of their observations. Anexcellent example is the discovery of pulsars (Hewish et al., 1968), where due to good knowledgeof the equipment and a persistent open mindset (Norris, 2017) the new type of astronomicalobject could be discovered. A more recent famous example is
Hanny’s Voorwerp (J´ozsa et al.,2009), which was accidentally found while looking through an enormous amount of data with thecitizen science project Galaxy Zoo . This resulted in an extensive search for more Voorwerpjes ,of which multiple are found nowadays giving insight into the ionization processes of ActiveGalactic Nuclei (Keel et al., 2012). With the increasing size of the already big Astronomicalsurveys and the vast amount of data that comes from them, finding the interesting unknownunknowns by accident will become more difficult. The traditional analysis on the data surveysbecomes too much work to do by hand, resulting in the need to develop new or use existingvisualization techniques.
In the last decade, machine learning has been widely used in Astronomy (Baron, 2019). Ap-plications range from supervised tasks, such as classifying galaxy types, to the unsupervisedtasks of clustering similar objects.
Supervised learning is used to assign a specific class for aset of data. A basic example is the classification of galaxy types based on their morphology.Information about the shape of the galaxy can be learned from pictures, where a (non-)linearrelationship can be found between pixel values and galaxy types. Once the underlying rela-tionship is found, the model can be applied to new data to predict galaxy morphology classes.Learning this relationship requires prior knowledge of the data before training, as the classeshave to be known beforehand. For galaxy types, this is often done by eye by an expert in thefield, or via citizen-science projects like Galaxy Zoo. Supervised machine learning can be a verystrong method to automate the classification of new data as shown in Ksoll et al. (2018), wherethree different supervised methods were successfully applied to identify the Pre-Main-Sequencestars using photometric observations.
Unsupervised learning is used to learn (complex) relation-ships that exist in a data set. In contrast to supervised learning, there are no pre-defined classesresulting in very different applications: clustering of similar objects, dimensionality reduction,and anomaly detection. These applications are arguably more important for scientific researchthan classification via supervised learning, as they can be used to extract new information froma data set. For example, in Turner et al. (2018) they applied a simple clustering method usingnearest neighbors on galaxy feature data to find galaxies in their different evolutionary paths.This robust method of early data inspection can easily be scaled for large data sets. This is anexcellent application of an unsupervised method, as they extracted information from a data setwithout the need for prior knowledge or labeled data. .2 Outlier Detection One of the most interesting applications of unsupervised learning is novelty detection, or outlierdetection . Outliers are objects that have either weird or uncommon features, making them verydifferent from the normal sample. Many types of different algorithms can be applied to big datasets, which learn the underlying rules of common features and find objects that fall outside ofthese rules. Most of the outliers will consist of noisy or erroneous data, as these are simply notcomparable to the real data. Outlier detection is interesting for Astronomy as it can be usedto find uncommon objects, or common events that only happen on small timescales. Targetedobservations of these events can be very difficult, as one does not know where to look at whichtime. Fortunately, the Universe is very big and with the use of big surveys, such as the SloanDigital Sky Survey (SDSS, Blanton et al., 2017) or Galaxy And Mass Assembly (GAMA, Baldryet al., 2017), these rare events can still be observed by accident.Next to the uncommon events and short events, a third interesting outlier category is the ’un-known unknowns’. This category consists of objects that have never been observed before andwhen found not, or only partly, understood. Finding these objects is exciting for Astronomers,as new information can always confirm or shake up some theories about processes and eventsin the Universe. Searching for ’unknown unknowns’ in a data set is not a trivial task as it is avery advanced application of unsupervised learning with no indication of what to look for. Also,different outlier detection techniques may identify different objects as outliers as the methodsinherently differ from each other. While each algorithm finds its own interesting outliers, thereis no single method to identify all the interesting objects at once. A review of novelty detectionmethods is presented in Pimentel et al. (2014), where five different types of outlier detectiontechniques are described and investigated. In this work, we will apply two of these methods,distance based and reconstruction based novelty detection, on galaxy spectra to find outlyingobjects.
Distance based outlier detection methods use an algorithm to determine distances between ob-jects via their features or values and assign a score for their dissimilarities. An outlier score canbe determined for every object using these distances, representing its difference with all otherobjects. The objects with the highest overall outlier scores can be considered the outliers ofthe data set. In Baron & Poznanski (2016), they introduced a general distance based outlierdetection algorithm and applied it on galaxy spectra. This resulted in the findings of manyinteresting outliers and their algorithm is used in our research.
Reconstruction based algorithms learn the underlying rules in the data and reconstruct the datausing these rules. The difference between the original input and the reconstructed data is usedto compute a reconstruction error. As the model is not trained to reconstruct the unknown oruncommon features in outliers, the reconstruction error can be used as an outlier score. Theapplication of reconstruction based methods was tested in Ichinohe & Yamada (2019) and Por-tillo et al. (2020), where variational autoencoders are applied on spectra and used for outlierdetection.Both outlier detection methods result in an outlier score for each object, often normalizedbetween 0 and 1. It is up to the user to determine a cut-off value, for which all objects abovethat value are considered to be outliers. 3 .3 Project Outline
In this project, we will use two outlier detection methods to find weird interesting objects ina big astronomical survey. We try to find weird objects by looking at the spectra in the thirddata release of the Galaxy and Mass Assembly (GAMA) survey (Baldry et al., 2017). Spectracontain a lot of information and physical parameters of the observed objects and are thereforevery nice to use for outlier research. A description of the GAMA data and its content can befound in Section 2. First, we inspect the data for possible instrumental errors, highly uncertainredshift calculations, and inaccurate flux calibrations. We apply the distance-based outlier de-tection algorithm developed in Baron & Poznanski (2016) on our GAMA data. In their work,they applied an algorithm based on Unsupervised Random Forest (URF, Shi & Horvath, 2005)on SDSS galaxy spectra and could find many outlying spectra, such as spectra with extreme lineemission ratios or galaxies with supernovae. Many of their findings were not earlier reported inthe literature. A comparison between this method and a few other conventional outlier detectionalgorithms is provided in Reis et al. (2019b), which states that overall the URF performs thebest. A thorough explanation and our implementation of the URF can be found in Section 3.In this section, we also test the algorithm using known outliers and self-generated weird spectra.The results of the algorithm are inspected and the weirdest spectra are investigated in Section 4.Next to the distance based URF method, we experiment with a reconstruction based outlier de-tection technique utilizing a neural network. We apply an Information Maximizing VariationalAutoencoder (Zhao et al., 2017) on the galaxy spectra, which is a modified version of a varia-tional autoencoder (Kingma & Welling, 2013). This method was demonstrated on SDSS datain Portillo et al. (2020) and can map spectra to a (very) low dimension and fully reconstructdifferent types of galaxies. We use the reconstruction error and low dimensional representationto determine outliers and compare the results of both methods in Section 4.Additional to the search for outliers, we also inspect the high dimensional output of the outlierdetection methods using t-SNE (van der Maaten & Hinton, 2008), a visualization techniqueused to reduce high-dimensional data to a two or three-dimensional map. This was also done inReis et al. (2018), where they applied t-SNE on the output of the URF algorithm. They showedthat the high dimensional output contains a lot of complex information of the data. With themap, the structure of the data can be projected and used to group and find similar galaxies.We show the application of t-SNE and the maps in Section 5.The main objective of this research is to find the outliers and weirdest objects in the GAMAsurvey, but we also want to promote the use of outlier detection methods on spectroscopic data.Spectra are very interesting to look at as they trace many physical properties of the object. Inthe near future, new surveys like WEAVE (Dalton et al., 2014) and 4MOST (de Jong et al.,2019) will generate an enormous amount of spectra. Finding interesting outliers or anomaliesbecomes more difficult as more spectra have to be inspected. To convince people for citizenscience projects might be difficult, as spectra do not look as interesting as the sky pictures usedin the citizen science projects like GalaxyZoo. To cope with the amount of data we need goodmachine learning techniques to automatically find the interesting outliers in future surveys. Weuse and compare two outlier detection algorithms and explore their usability for the search ofoutliers in the Universe. 4
Data
We use data from the third data release (DR3) of the Galaxy and Mass Assembly (GAMA)survey (Baldry et al., 2017). The GAMA survey is a spectroscopic and multiwavelength pho-tometric survey, with the main objective to study cosmic structures on a scale of 1 kpc to 1Mpc (Driver et al., 2009). This includes the study of galaxy clusters, groups, and mergers totest the cold dark matter paradigm of structure formation. The third data release consists ofover 150000 spectra of objects with a reliable heliocentric redshift (Baldry et al., 2014) andother additional photometric parameters (Hill et al., 2011). The objects are located in threeequatorial and two southern observational regions covering a total area of 286 deg . Primaryobservations were done using the AAOmega spectrograph at the Anglo-Australian Telescope(Smith et al., 2004).In total, the GAMA survey consists of more spectra than the observed spectra of the AAOmegaspectrograph. Before the survey started an input catalog was assembled of the objects in theobservation regions to make a highly complete redshift survey. The object locations were drawnfrom the SDSS DR6 photometric observations and UKIDDS (Lawrence et al., 2007), with mag-nitude limits of r < . z < . K AB < . R ≈ . We select the spectraby setting the SURVEY CODE to 5, which represents spectra from the main observations, and
IS SBEST to be true, such that we only have the best spectrum for every single object. Anotherimportant parameter is the nQ value, which represents the normalized quality factor of theredshift value fitted to the spectra. For quality research, we only use spectra with a value ofnQ > = 3, as advised in Hopkins et al. (2013). The full data query can be found in AppendixA, which gives a total of usable spectra to be just over 130000. The wavelength range of theinstrument, and therefore the observed wavelength range of the objects, is between 3750 ˚A and8850 ˚A over 4954 pixels. igure 1: Overview of distributions of GAMA parameters
There is a vast amount of different spectra in the GAMA survey. We quickly inspect someparameters that are already provided for each spectrum: a fitted redshift parameter and asignal to noise ratio SNR. The distributions of these parameters are shown in Figure 1. Lookingat the redshift distribution we can see that the median redshift of GAMA objects is z = 0 . z = 0. Objects with thisassigned redshift value are: foreground stars, point-like quasar potentials which turned out tobe stars, galaxy-star blends, and bad spectra with a wrong fitted redshift value. As a test,we run these objects through our reconstruction based clustering algorithm. This resulted inthe classification of 1567 M-dwarfs, 144 O-stars, and 1118 stars that are either a type A, F orG-star. These spectra will be left out of the final classification of galactic outliers in Section 4. In this project, we use the flux values of the spectra as the input of the outlier detection algo-rithms to compute an outlier score for each spectrum indicating its weirdness. The output ofthe algorithms is thus entirely based on the shape of, and features in, the spectrum. To get thebest unbiased results, all the raw flux values could be used as input for the outlier detectionalgorithm. However, the raw data contains a lot of bad data points, or noisy continuum asshown in Figure 2, which interfere with the results. A few bad pixels, for example due to acosmic ray, can cause a high outlier score for an otherwise normal spectrum. We apply a fewpre-processing steps on the raw data to make it suitable for our project. We try to keep theprocessed data as close to the raw data as possible, as this will give the best relation betweenan outlier score and the observed spectrum.First, we mask three noisy regions in all spectra. The blue end of the spectra up to the observedwavelength of 4050 ˚A is masked due to low flux values as a result of high sky absorption. Wealso mask the region between 5570 ˚A and 5585 ˚A due to high residual sky emission seen in mostof the spectra. At last, the region with wavelengths larger than 8780 ˚A is masked as there arenoisy points at the end of most spectra. Other regions with bad pixels, which are not at theend of the spectra, are linearly interpolated, to simulate as if those regions are normal.6ext to removing the bad pixels, we smooth the data by applying a low pass filter. The rawspectra show a lot of flux fluctuations in the parts where there are no emission lines, especiallyin the blue end of the spectra. Due to the randomness of these fluctuations, spectra can beassigned a high outlier score without the presence of real physical features. We smooth the fluxvalues in the spectra via convolution with the Gaussian kernel using the Astropy package inPython. The kernel can be described as g ( x, σ ) = 1 √ πσ e − x σ , (1)where σ denotes the standard deviation in terms of pixels. This kernel suppresses the low-levelfluctuations, while keeping the shape of the emission lines intact. We test with different kernelsizes, to find an optimal balance between suppressing noise and keeping the data close to theoriginal. This is seen in Figure 2, where the kernels with different pixel widths are applied onthe raw spectra and shown on top of each other. The important tasks are keeping the shapeof the emission lines as authentic as possible, while reducing the fluctuations on the continuumlevel. We see that the Gaussian kernel g ( x,
3) is a good trade-off, but we will also run tests withsmaller and bigger kernel widths.Redshift is a non-linear variable, so we want to avoid working in the observed frame. We redshiftthe spectra to their rest-frame wavelengths and interpolate all spectra to a common grid. Asflux values will be used as feature input for the outlier detection algorithms, it is importantto have all emission lines at the same input point. We interpolate all spectra between therest-frame wavelengths of 3500˚A and 7500˚Ato 8000 data points, giving a slightly over sampledresolution of 0.5˚A per data point. This is to ensure that there are enough data points to probethe width and separation of emission lines for all redshifted spectra. Missing flux values at theends of the spectra are extrapolated as a flat line with the average values of nearby points. Atlast, the full spectrum is normalized by dividing by the median of the flux values. We use threedifferent subsets in this project based on the signal to noise ratio (SNR) of the spectra. Anoverview of the subsets is shown in Table 1.
Figure 2:
Different Gaussian kernel widths tested. The labels of the smoothed spectra referto the Gaussian kernel found in (1). https://docs.astropy.org/en/stable/convolution/ .3 Bad Spectra During early inspection of the data, a few weird spectra were commonly found. These spectraare shown in Figure 3 and trace four of the most recurring types of bad spectra. At first, wehave many fringed spectra shown by the top spectrum of Figure 3. These sine-like oscillationsarise from shortcomings in the AAOmega instrument and are present in observations of a fewspecific fibers. The fringing arises from air gaps in the glue at the connection point of the fibersand the prism causing high-frequency oscillations and poor removal of sky features (Hopkinset al., 2013). For most fringed spectra, the redshift could still be determined as emission linesare still visible. However, as the continuum clearly does not trace the physical properties ofa galaxy, we exclude these fringed galaxies from our outlier detection. As the fringed galaxieshave different shapes of oscillations, there is also no robust way to ”defringe” all galaxies in afully automated manner.The next three classes of bad spectra are solely due to reduction errors. The second spectrumof Figure 3 shows a very big change in continuum at 5700 ˚A. The left and right parts of thespectra are observed by different arms of the instrument and have to be normalized to matcheach other. Several spectra have this so-called bad splice , which is due to poor flat fielding inone of the armsHopkins et al. (2013). Just as with the fringed spectra, a reliable redshift canbe determined as emission and absorption features are visible. The third spectrum of Figure 3shows obvious sky emission lines have not been subtracted correctly.The last group of bad spectra is an unexplained amount of spurious lines in a small observingregion in the sky. Also, a big continuum error between 6000 ˚A and 7000 ˚A can be seen. Thiseffect is not earlier reported and there seems to be no information about the source of the weirdfeatures. This type of bad spectra is found in the specific observed region
G15 Y1 DN1 nnn , with
G15 Y1 the observed field,
DN1 the field identifier and nnn the fiber identifier. As each fiber inthis observed field shows the same errors we call this group the
DN1 field error .Initial random inspection only gave us a few spectra of each group. However, as these spec-tra are very different than normal spectra, the outlier detection algorithms naturally pickedup these spectra as outliers. By iterative application of the algorithm and removal of the badspectra, we found all spectra belonging to the above groups. In total, we found that 4.5% ofthe GAMA spectra are either affected by fringing, bad splicing, or other reduction errors. Inthe GAMA spectroscopic analysis paper Hopkins et al. (2013) the amount of bad spectra wasreported to be around 3%, which shows that we have found a few more than earlier reported.Furthermore, we find a total of 150 bad
DN1 field errors.
Name Subset Objects GG Spectra med { SNR } med { z } SNR2.5 SNR ≥ . ≥ ≥
10 23120 21226 12.98 0.158
Table 1:
Subsets of GAMA AAOmega spectra. The
GG spectra indicate the size of the subsetswith only good galactic spectra, excluding stars and bad spectra.8 igure 3:
Groups of bad spectra in GAMA survey. From top to bottom: fringed spectra, badsplice between instrument arms, sky emission, DN1 field error. Pictures taken from the GAMADR3 Single Object Viewer Outlier Detection Methods
We apply two different algorithms on the GAMA spectra and try to find outliers. The firstalgorithm we use is a distance-based algorithm using Unsupervised Random Forest (URF, Shi& Horvath, 2005), which was developed in Baron & Poznanski (2016) and used to detect outliersin SDSS spectra. The behavior of the algorithm is inspected via different tests to determineits ability to detect outliers and its sensitivity to noisy data. We will show the used data setsand explain the application of the algorithm on the data. We also use a reconstruction basedalgorithm using an Information Maximizing Variational Autoencoder (InfoVAE, Zhao et al.,2017), which was used as an experiment in Portillo et al. (2020) to synthesize SDSS spectra.We apply this algorithm as an experiment on GAMA spectra to find outliers, spectra whichcan not correctly be reconstructed, and to learn about the application of neural networks onastronomical data. In this section, we only look at the application process and outputs of thealgorithms. The outliers of the GAMA survey are discussed in the next section.
The URF is used to generate the pair-wise distances between all objects by looking at thefeatures in the data, constructing a distance matrix. As shown in Baron & Poznanski (2016),this distance matrix traces a lot of information of the objects and can be used to determine anoverall distance (or weirdness ) score for each object. The fundamental part of the algorithm isthe Random Forest (RF, Statistics & Breiman, 2001). An RF is normally used as a supervisedapplication, in which an ensemble of decisions trees learn the rules in the data and determine theclass labels via a majority vote. Our data is not labeled, so we use the RF in an unsupervisedway as described in Shi & Horvath (2005). Instead of assigning different labels to the inputdata, we generate synthetic spectra based on our real data and let the RF learn the differencebetween the real spectra and the synthesized spectra. The RF is trained with a combinationof the real and synthetic data to learn the covariance in the real data and the importance ofdifferent features. The synthetic data is generated by random sampling of each feature over allspectra. This gives spectra-like shapes representing the important features in the data, whilealso including minor details. A comparison between the real spectra and the generated syntheticspectra is shown in Figure 4. Some aligned features as Hydrogen and Oxygen emission linescan be traced in both the real and synthetic data.
Figure 4:
Comparison between real data and synthetic data. 1000 spectra are stacked fromtop to bottom with the input data points from left to right. The bottom picture is syntheticdata generated from the top picture. Normalized flux values are displayed by the color map.10ith the real and synthetic spectra, we prepare multiple chunks of 10000 objects via randomsampling without replacement to train the Random Forest. In each chunk, the objects areunique, while they can still be included in multiple chunks. We train 200 decision trees oneach chunk and combine these to make up the full Random Forest. This approach improvesthe quality and the computational time of the algorithm, as we train multiple times on smallamounts of data. The number of features used in the trees is the square root of the size of theinput data, which was also used in Baron & Poznanski (2016).After training, only the real spectra are applied to the RF to trace their similarities. Thespectra propagate through all the decision trees and eventually end up in a terminal leaf, nor-mally indicating an assigned label in supervised applications. In our application of RF, theterminal nodes indicate if the input was either real or synthetic. This is visualized in Figure5, where a small Random Forest is shown composed of only five decision trees. The terminalleaves have an identifying number in each decision tree, and a pair-wise similarity between twospectra can be determined by looking at how often they both end up in the same terminalleaf. This value ranges from zero, indicating no similarities, to a maximum value of the numberof decision trees, indicating total similarity. This value is normalized by the number of trees,and the distance score between the two objects is defined by subtracting 1 with the similarityscore. This yields a distance score between 0 and 1 for every pair of objects, giving a dis-tance matrix. The distance matrix contains information about the clustering of the spectra andcan be used to trace a single outlier score for each spectrum by averaging over the dissimilarities.A small improvement of the URF proposed and used in Baron & Poznanski (2016) is to onlycount the terminal leaves of trees that label the spectra correctly as real . The decision treesthat label real spectra as synthetic for either of the two spectra in the pair-wise comparisonshould be excluded from the similarity comparison, as they apparently can not be trusted toidentify the real features of the spectra. A very detailed description of the URF algorithm anda nice example in 2D space is shown in Baron & Poznanski (2016), giving an excellent visualinsight of the application of the algorithm. We implement our own code for the URF, basedon the example code of the 2D example. It is fully written in Python using the scikit-learn Random Forest implementation and is run on a 64 core computer. The full algorithm can befound on our Github page . Figure 5:
Very basic Random Forest with 5 decision trees. The applied input propagatesthrough the decision nodes and ends up in a terminal leaf indicating whether it is either real orfake. An identifying number is assigned to each terminal leaf in each decision tree. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier https://github.com/Formsma/GAMA-outliers SNR10 subset with 23120 objects. After the syntheticdata generation the amount of objects is doubled, giving a data input of (46240 x 8000) pixels.As stated in Baron & Poznanski (2016), dimensionality reduction of this input does not increasethe quality of the outlier results. The first runs of the URF algorithm resulted in a lot of badspectra and not very obvious outliers, so we want to explore how the URF works. Unsupervisedmachine learning can often end up in stirring the algorithm parameters until the result is satis-factory, but we prefer to explore how the algorithm works and reacts to different kinds of spectra.To test the response of the algorithm, and therefore the outlier scores of the different spectra,we apply a few different versions of our data set. We first try different levels of convolution andsee if the quality of the results are correlated with the amount of smoothing. Afterwards, wetest different input sizes for the features of the spectra by changing the amount of interpolation.These tests only show how the URF responds to the data structure, but we also want to testthe ability to find outliers. We apply the GAMA data and look in the results for the alreadyknown weird spectra reported in the GAMA database. Also, we generate fake spectra, whichare inserted into our data set acting as outliers. Note that these fake spectra are different fromthe synthetic spectra shown in Figure 4, as the fake spectra are composed of our own inventedshapes and the synthetic spectra are derived from random sampling of our data set.
As discussed in Section 2, the smoothing of the data is important to suppress noise on thecontinuum level, while keeping the emission line shapes intact. We smooth the data via con-volution using a Gaussian kernel and apply the different kernel sizes shown in Figure 2 usingEquation (1). We apply the kernels on the
SNR10 subset and use the data set on the URFalgorithm. The distributions of the outlier scores are shown in Figure 6. In terms of differences,we can see that the higher the kernel, the smoother the score distribution and the lower theaverage score. This is very logical as smoothing removes noisy fluctuations in the data givingmore similar continuum shapes. In every histogram, the main peak with the highest numberof scores traces the most similar galaxies. However, we also observe one or two smaller peaksinside the distribution. After inspection, we can conclude that these smaller peaks trace theclustering of very blue and red galaxies, clustered mostly based on their continuum shape. Forevery distribution, except the kernel g ( x, g ( x,
3) kernel both the blue and red galaxies are found in the singlepeak. This will be more useful for our search for outliers, as these spectra are mixed and orderedon other features instead of only the continuum. With these results and the knowledge thatthe g ( x,
3) kernel preserves the emission and absorption line structure as discussed in 2.2, wedecide to use this kernel in all future runs of the algorithm.12 igure 6:
Distributions of different kernels used for smoothing.The second data structure test is the variation in the input size of data into the RF. This isaltered via the interpolation done on the original spectra. The more data points there are,the better the width of features can be determined. The URF does not know that emissionlines can be fitted with a Gaussian, so having many data points per emission line lets the URFlearn about its width. During the project various amounts of data points are used, where weeventually settled on a total of 8000. This gives around 20 data points for a strong H α line,while more subtle lines have at least 10 data points. By lowering the amount of data points wecan not trace the width of emission and absorption features accurately. Increasing the numberof data points results in increased storage space and computational time, while not increasingthe quality of the results as most points are interpolated from the observations.With the set number of input data points, we also comment on the number of data points usedin the decision trees. Not all 8000 data points are used during learning, but a selection is madeof data points in each decision tree. Only data points with a high gain are used, which describesthe amount of information that can be learned from each data point. The selection is done bytaking the square root of the input as the number of decision points. This is the most commonmethod when using Random Forests, and was also used in Baron & Poznanski (2016). In ourtests with varying the number of features used for decisions, we observe that for a higher thanthe default number of data points results in clustering based on the continuum. Keeping thenumber of features to be the square root of the input size ensures that the algorithm learns thebest covariance of the data while we use a minimum amount of points and avoid overfitting.After training the URF, an interesting question is: which of the data points are used as featuresin the random forest and how important are the flux values at specific wavelengths? As theURF consists of many decision trees, we can not simply display all the splits in the trees ina figure. However, we can display the feature importance of each data point and get a simpleview of the usage of the features. In Figure 7 the normalized feature importance is shown forthe input data points. High values indicate a high entropy split and are used as first splits inthe trees, while low entropy splits are used as final deciders of the class. We can see that on acontinuum level the entropy is highest and for the well-known emission and absorption lines theimportance is lower. This overview shows that the initial decision is made on the continuumof the spectra and the spectral lines decide the final classification. For the lower kernels, theimportance is higher at the red end of the spectra. This is probably due to the differences inthe redshift of the spectra where the ends of the spectra can have sharp cutoffs. This suggestsa correlation between outlier scores and redshift, which could be observed very marginally.13 igure 7: Relative feature importance of input features indicating which data points are usedfirst. Hydrogen emission lines are indicated by red lines, Oxygen emission lines are indicatedby blue lines.The last test we perform is the training with the spectra divided over signal to noise (SNR)bins. As most spectra have a relatively low SNR (see Fig. 1), the URF will mostly train onthose spectra. This gives that the weak emission lines present in higher quality spectra will notbe used as an important feature and important information is lost. Also, a noisy continuum of alow signal to noise ratio spectrum will have a big influence on its outlier score. By dividing thespectra up into bins and train the URF on each individual bin we expect to have better resultsin terms of important features. This method was suggested and used in Baron & Poznanski(2016) to reduce the influence of low SNR spectra.For the
SNR10 subset, no difference was found in dividing into bins as all the spectra werealready of good quality. For the
SNR5 and
SNR2.5 subset, the training was done includingbins as a correlation between the SNR and weirdness score could be observed. However, evenwith the binning of the spectra on their SNR, the URF algorithm is still biased towards thelow SNR spectra. This was also found in Reis et al. (2019b), where different algorithms arecompared. Nonetheless, even with a bias towards low SNR spectra, the algorithm can still findweird spectra among the high SNR spectra as can be seen in Figure 11. The ranked distributionshows that weird galaxies are also found among the high SNR spectra.
The last tests we perform is looking at already known and some self-made outliers. As outlierdetection is searching for the unknown unknowns, we can not directly see if the algorithm isproducing good results using the GAMA data. By inspecting the already known outliers andinserting spectra (or spectra-like) shapes into the data set we can trace the outcome of differentfeatures in the data.In the GAMA database, there is already a
COMMENT flag added to some spectra with a note orremark. These comments were added during the redshift fitting and random inspection of thespectra by the GAMA team. They trace some bad spectra, but also mention some weird spectrashapes and physical events as AGNs. We apply the URF on the
SNR5 subset and overlay theknown fringed spectra, bad splices, and other flagged spectra on top of the results.14e also make six types of fake spectra, divided into two classes. The first class is composed ofreal spectra which are transformed, as seen in the top three spectra of Figure 8. Spectra areflipped from left to right (LR) by rotating around the central feature, resulting in that the mostright feature becomes the most left feature. Also, some spectra are flipped top to bottom (TB)rotating along the horizontal axes, which results in that emission features become absorptionfeatures. At last, we insert Gaussian features in some spectra, representing extra fake emissionlines. These three types can trace the sensitivity of the URF to either the total continuum oremission lines. The second class of fake data has nothing to do with spectra, but merely aresome extreme cases we put in to see how the algorithm responds. The first type is a sine wave,closely resembling the fringed spectra of the data set. The other two spectra are random noiseand a flat line. We generate 50 spectra of each type in both the classes resulting in a total of300 fake spectra.The score of all flagged spectra, from the GAMA team and our self-made ones, are shown onthe score distributions of the
SNR5 subset in Figure 9. Overall we see that the outliers are allassigned a high score, except for the added emission lines. This difference is probably due tothe randomness of the location of these extra emission lines and the fact that the other outliersare very weird based on their continuum. As the URF was trained on all the extreme outliersin the data set, the trained network has not learned all the real important features of the data.A run without these fake spectra would give better results, but this test already showed thatthe weird galaxies can be found.
Figure 8:
Types of self-made fake spectra15 a) Fringed (from GAMA) (b)
Reduction Error (from GAMA) (c)
Left-right flip (d)
Top-bottom flip (e)
Extra emission lines (f )
Sine (g)
Random noise (h)
Single line
Figure 9:
Score distribution of the objects in the
SNR5 subset. The location of the knownoutliers and fake data are indicated by red lines. Known outliers in (a) and (b) are taken fromthe GAMA comments. For the rest of the plots, the types of fake data are shown in Figure 8.16 .1.3 Searching for the Unknown
To ensure real physical outlier results, the important step is to remove the bad spectra from thedata set, as these are natural outliers and clutter the real interesting results. This also gives thatthe algorithm does not have to learn from those outliers, giving that the Random Forest is onlytrained on real features. During the project we flagged the spectra that belong to the groups ofSection 2.3 and also remove spectra with a redshift z < . SNR2.5 subset results inFigure 11. Due to the bias towards lower signal to noise ratio spectra, finding the reason whythose spectra have high outlier score becomes increasingly more difficult as they contain manyfluctuations on a continuum level.In Figure 12, we show a nice visual representation of the spectra sorted on their weirdnessscores after being run through the algorithm. In each color map, the objects with the highestoutlier scores are found at the bottom of the map. This can also be seen by the more chaoticspectra found there. We also observe clustering on the continuum shape of extreme blue andred galaxies, as can be seen in Figure 12. Even with extensive testing and hyperparameteroptimization, this problem persisted and this will be discussed in Section 6. However, we couldstill find interesting outliers which are inspected in-depth in Section 4.
Figure 10:
Score distributions of objects in the different subgroups. The 100 weirdest objectsare found on the right size of the red line. 17 igure 11:
Relation between the distance score from the URF and the signal to noise ratio ofthe objects for all subsets. From top to bottom: SNR10, SNR5 and SNR2.5. Note the increasedbias towards low SNR objects for the SNR2.5 subset.18 igure 12:
Normalized flux of the spectra of subsets (left to right)
SNR10 , SNR5 and
SNR2.5 sorted on their outlier score in each subset. The weirdest spectra can be found at the bottomof each plot. Note the presence of clustering on extreme blue and extreme red spectra in allsubsets. Due to the big amount of spectra in the subsets, our 250 inspected outliers are only asingle line of pixels in these plots. 19 .2 Variational Autoencoder
Additional to the URF algorithm, we also experiment with the application of a reconstruction-based outlier detection technique on the spectra. This technique uses a variational autoencoder (Kingma & Welling, 2013), which can reduce data to a (very) low dimension and fully recon-struct the input. The difference between the reconstructed and input data can be used todetermine an outlier score. The method was demonstrated on X-ray spectra in Ichinohe & Ya-mada (2019) and was reported as a good approach to find outliers. On top of outlier detection,the variational autoencoder can also be used to generate synthetic spectra using the low dimen-sional representation in the trained network. Varying the input on the low dimensional spacetraces the important features in the data and can be used to construct spectra. The variationalautoencoder can thus be used for outlier research, while it is also very useful for dimensionalityreduction and feature importance analysis.The variational autoencoder is a variant of the normal autoencoder (Hinton G.E., 2006). Anautoencoder is an unsupervised method and it utilizes a neural network architecture to learna (non-linear) encoding of data. It consists of two connected parts: an encoder that maps theinput data to a low dimensional representation and a decoder that can fully reconstruct theoriginal data based on this representation. As the encoded dimension is (much) lower than theoriginal dimension, the autoencoder has to learn the important features from the data to beable to fully reconstruct the input. The simplest form of an autoencoder is a neural network ofthree fully connected layers as shown in Figure 13. There is an input layer for the data, a singleintermediate layer with a lower dimension than the input layer, and an output layer with thesame dimension of the input layer. The layers are fully connected, as in that all the neuronsin each layer are connected to all other neurons in the next layer via a (non-)linear function.The network is trained by adjusting the biases and weights of these functions to minimize a loss function, which is often the reconstruction error between the input and the output. Thedimensions of the layers, the specific function that is used between the layers and its biases andweights describe the full autoencoder. Two nice examples of the application of autoencoders inAstronomy are inspecting properties of stellar spectra (Yang & Li, 2015) and classifying lightcurves of variable stars (Tsang & Schultz, 2019). These networks are composed of multiplelayers between the input and the low dimensional layer to learn the more complex structures inthe data, giving a more complex neural network than shown in Figure 13.
Figure 13:
Basic autoencoder. The input layer x with dimension 4 is encoded to the interme-diate layer z of dimension 2 via the connected neurons. The output layer x’ is reconstructedfrom the lower dimension to the dimension of the input.20 igure 14: Variational autoencoder. The intermediate layers are two parallel layers repre-senting the mean µ and log variance σ of a Gaussian generative inference model. The decodersamples the latent vector z to reconstruct the input.Variational autoencoders are a modified version of autoencoders in two technical ways: At first,the encoder and decoder are not fully connected. Instead of encoding the input values to alower dimension, the input gets mapped onto a distribution. The decoder samples from thisdistribution to reconstruct the input. This makes it that the neural network architecture hasa Bayesian modeling approach, where the best model is learned for the best representation ofthe data. The second modification is that loss of the network is not solely based on the recon-struction error, but also includes a loss term that restrains the mapped distribution of the lowdimension latent variables. The addition of the mapped distribution gives that the layout ofthe variational autoencoder differs from the normal autoencoder, as can be seen in Figure 14.The variational autoencoder consists of two coupled independent models: a recognition modelas an encoder and a generative model as a decoder (Kingma & Welling, 2019). The models worktogether to compress the input into a lower latent dimension. The latent dimension consists oflatent variables, which are unobserved variables that represent the observed data. For example,a physical unobserved variable in the data could be the star formation phase of a galaxy, bylearning from the ratios of the different emission lines. To relate the observed data with theunobserved latent variables, the decoder tries to model the underlying processes in the data viathe model p θ ( x ). This is done via the marginal likelihood p θ ( x ) = (cid:90) p θ ( x , z ) d z , (2)where p θ ( x , z ) describes the decoder in the variational autoencoder and z denote the latentvariables. The structure of the decoder is described as p θ ( x , z ) = p θ ( z ) p θ ( x | z ) , (3)with p θ ( z ) the prior distribution of the latent variables. For the prior we use a Gaussian latentspace. This is a simple model, yet works very good in most applications. An overview of deepergenerative models is provided in Kingma & Welling (2019).21 igure 15: Bayesian representation of the variational autoencoderThe encoder is an inference model q φ ( z | x ) that compresses the input data into two variables µ and σ , which represent the mean and log variance for the Gaussian inference model in thedecoder. The variational autoencoder is trained by minimizing the following loss function, the evidence lower bound (ELBO): L vae ( φ, θ, x ) = L ( x , x (cid:48) ) + D KL ( q φ ( z | x ) || p ( z )) . (4)This first part of loss function is the normal reconstruction loss of the autoencoder. The sec-ond part is the Kullback–Leibler (KL, Kullback & Leibler, 1951) divergence of the variationalautoencoder, which traces two distances. By definition it traces the divergence between theapproximate posterior and the true posterior of the data, representing how good the model canapproximate the data. Additionally, it traces the difference between the encoder and decoder bycomparing the encoder q φ ( z | x ) and the expected input for the decoder p θ ( z ). A full Bayesianrepresentation of the variational autoencoder is shown in Figure 15.The decoder is a probabilistic inference model by sampling the latent vector z from the encodedmeans µ and log variance σ . The sampling gives that the reconstructed data can be made froma continuous distribution, resulting in the ability to interpolate between the learned features.For example, if a variational autoencoder is trained on galactic spectra, then the latent dimen-sion can be used to generate synthetic spectra of any shape by choosing specific values as input.In the ideal case, the latent variables would represent real physical parameters of galaxies thatcan be tuned, but that requires a very complex network. Just as with the normal autoencoder,if the variational autoencoder is applied to high dimensional data, then there are multiple layersin the encoder and decoder to learn the complex patterns in the data.We use a modified version of the variational autoencoder called Information Maximizing Varia-tional Autoencoders (InfoVAE, Zhao et al., 2017). This method was used as an initial demon-stration of variational autoencoders on galaxy spectra in Portillo et al. (2020) and showed goodresults in greatly reducing the dimension of the spectra to only 6 variables and finding a fewoutliers. InfoVAE addresses two potential problems with the application of variational autoen-coders on big data problems. At first, the KL divergence is not strong enough to be able to mapthe different kinds of spectra to a representative distribution. This is especially a problem if theinput dimension is much larger than the latent space dimension, as in our case. Also, the KLdivergence term does not take into account that for a complex network, the encoder can alwaysmatch the prior of the decoder. This means that the network did not learn any meaningfullatent variables that describe the features of the data and will result in similar findings as anormal autoencoder. 22he modifications of InfoVAE are applied via adjustments of the ELBO function. An extraterm Maximum Mean Discrepancy (MMD) is added with additional weighting factors. Thisterm is based on the idea that two distributions are identical if their moments are the same.It compares the outcome of the encoder q φ ( z ) with the expected distribution p θ ( z ) and forcesthem to be similar via training of the variational autoencoder. The exact definitions of D KL and D MMD as applied in our algorithm are thoroughly described in Kingma & Welling (2019)and Zhao et al. (2017) and briefly shown in Appendix B. The full ELBO function to minimizeis given by L InfoVAE ( φ, θ, x , α, λ ) = L ( x , x (cid:48) ) + (1 − α ) D KL ( q φ ( z | x ) || p θ ( z ))+ ( α + λ − D MMD ( q φ ( z ) || p θ ( z )) , (5)where the exact value of the weighting values α and λ have to be found via hyperparameteroptimization. As recommended in Zhao et al. (2017) we will use α = 0, which gives that theKL divergence is the same as in (4). For λ we test different values in the range λ = 5 −
15 tosearch for the best latent variable representation.
We train the InfoVAE with the
SNR5 subset that was also used with the URF implementation.This subset contains around 63839 good spectra with high signal to noise ratio, giving a bigdata set with high quality data for this experimental method. We test two different networks,one with the input dimension of 8000 points as used with the URF, and one with the inputdimension of 1000 points as used in Portillo et al. (2020). Lowering the input size reduces theinformation put into the network, but greatly improves the training speed. We compare thereconstruction capability of the networks and look if they can used for outlier detection.The neural network structure in the variational autoencoder is very important for the perfor-mance of the models and has to be carefully tuned to get the best results. We tested differentnetworks with multiple layers of varying sizes. The lowest reconstruction errors were foundwith three fully connected layers with the dimensions 1000-500-100 converting the 8000 inputpoints to 6 latent variables. Even with the low number of 6 latent variables galactic spectracan be fully reconstructed with low errors as shown in Portillo et al. (2020). The networkparameters used in their work are very similar to our optimal network, suggesting that wealso have a good network. For the network with only 1000 input points the network structureis the same, but using the first encoding layer as the input layer omitting the first encoding step.We build the neural network with Python using Keras , a high-level API for the Tensorflow 2platform. An overview of the full network is shown in Figure 41 in Appendix C. The layersare connected with the non-linear ReLu activation function (Nair & Hinton, 2010) that canlearn the non-linear relationships between the input points. The output layer uses a linearactivation as some spectra can have negative values due to normalization. The network istrained by minimizing the loss function shown in (5), using the built-in Adam optimizer withdefault values. The code can be found on the Github page. correct scaling parameters,then the MMD loss is dominating the loss function by a few magnitudes prohibiting the neuralnetwork to learn anything from the data. The best performance was found without using theMMD loss during training. This is different as reported in Portillo et al. (2020), but as theydid not show the loss minimization plots during training, we can not make a direct comparisonbetween the performances. Figure 16:
Loss minimization during training for both the training and validation data
The trained variational autoencoder consists of two models, the recognition model and theinference model. We can split these models and use one at the time to either convert thespectra to their latent space representation using the encoder, or generate synthetic spectrafrom (random) latent variables using the decoder. We first inspect the encoded latent spacerepresentation of the data set to see if the network has learned features in the data. All thespectra are converted to their 6-dimensional latent space vector, where ideally each variable ora combination of variables trace the features in the data. We compute the mean and variance ofeach individual latent variable and inspect the distribution of the values. In Figure 17 we showthe distribution of latent space values for all spectra. If the distribution of a latent variableis similar to the normal distribution N (0 , I ) of the prior, then the latent variable representsimportant features of the data set. If it is not similar, it is either not used for reconstruction,or used as a support variable for a complex relationship in the data. For 4 of the 6 latentvariables, we see a good or similar shape for the distribution of the latent variable indicatinglearned features. 24 igure 17: Distribution of the values of each latent variable. If the distribution is similarto the prior of the Gaussian inference model, then the latent variable traces is used to traceimportant features.We generate synthetic data using the information from the distributions and apply it on thedecoder. If we take the mean of the values for each latent variable and decode it to a spec-trum, the decoded spectrum should represent the most common type of spectrum found in thedata. By tuning a single latent variable at a time using the variance of the distribution, wecan see the features that each individual latent variable traces. In Figure 18 we show generatedspectra from the latent space by changing only a single variable at a time and using the meanof the other latent variable values. The value of the varied latent variable is shown for theindividual plots. This method of displaying the latent variable changes is directly taken fromPortillo et al. (2020), and we can compare the different features found by our network and theirs.For 3 of our latent variables ( z , z and z ), we can easily see what features of the spectradepend on the values of the latent variables as line intensities or continuum shapes vary a lot.The other 3 do not show major differences at first sight, but show more subtle differences inabsolute values on a continuum level. The difference between the synthetic spectra of latentspace variable z is difficult to see in the representation of Figure 18, but the continuum iseither slightly flattened or curved depending on the value. For z and z we see a slight dif-ference in the ratios between different emission lines and different heights of the continuum flux.With both Figure 17 and Figure 18 we can see that the latent variables that have their distri-bution matching the prior show the most obvious features when they are varied. For example,latent variable z matches the prior perfectly and shows significant variation in the emissionline strengths and continuum shape. Unfortunately, we could not train the network to have all6 latent variables matching the prior. As shown in Figure 16 and discussed in 3.2.1, the MMDloss should have helped to force the distributions to match the prior, but this only resulted inworse performance when applied in our algorithm. In Portillo et al. (2020), all latent variablestrace more obvious features such as line broadening or very big changes in line emission ratios.Unfortunately, even with extensive comparison between our method and theirs we could notreproduce this. 25 igure 18: Synthetic spectra generated from the trained variational autoencoder. In eachplot, we only change a single latent variable at a time. This shows the relation between theinference model and the latent space. Note that is a similar representation as in Figure 14 inPortillo et al. (2020) to compare the performances of the networks.26n Figure 19 we cross-correlate two of the latent variables ( z and z ). As shown, with onlythese two latent variables we can already represent a good amount of types of spectra rangingfrom blue galaxies to red galaxies, including and excluding emission lines. All 6 latent variablestrace some or multiple common features of the spectra, and combining them lets us generatemost types of galaxies found in our GAMA subset. As outliers have either uncommon featuresor weird lines, we are confident that the variational autoencoder can not reconstruct thosecorrectly resulting in high outlier scores. Figure 19:
Synthetic spectra generate from the trained variatonal autoencoder changing mul-tiple latent variables. With only two latent variables a huge range of spectra can be generated.Note that no y-axis values are given as the absolute flux values are not real. We only want toshow the different shapes and features of the spectra using these two latent variables.27 .2.3 Reconstruction Capability
We have shown that the variational autoencoder is capable of generating synthetic data givenonly 6 parameters. But it is also important to check if the spectra are reconstructed correctly.This will also be used for outlier detection, as uncommon features are not learned by the neuralnetwork. We will look at the reconstruction of a few selected spectra, including either verycommon, or uncommon shapes. We also compare the reconstruction capability between thenetwork with 8000 and 1000 input points.First, we look at three spectra with clear emission lines visible to see if those are reconstructedcorrectly. In Figure 20 the spectra are shown including the reconstructed interpretation usingthe variational autoencoders with the inputs of 8000 data points or 1000 data points. From topto bottom we show increasingly more complex emission lines. Overall, the continuum is tracedvery well for each spectrum. Spectrum
G12 Y1 DX1 211 has high emission for the most commonstar formation lines and shows good results in the reconstructed version. In
G09 Y4 249 125 theBalmer emission lines are broadened due to activity in the galaxy. The reconstructed versioncan trace some the broadening of the lines, but does not reconstruct it completely correct. For
G09 Y1 DS2 065 , the additional complex structure at the Balmer lines is not reconstructed atall, giving a high reconstruction error. The more the complexity in the emission lines, the morethe reconstruction error tracing a possible outlier.In Figure 21 we show the spectra of a very ’normal’ galaxy and a blended system in which anM-dwarf sits in front of a galaxy. Spectrum
G12 Y1 DS1 268 is reconstructed very accurately,where only very small fluctuations on the continuum level are missed. On the other hand,spectrum
G15 Y2 003 165 shows a big difference between the input and the reconstruction.This is due to the foreground spectrum of the M-dwarf interfering with the flux of the galaxy.For all blends, as the full spectrum is redshifted using the emission lines of the backgroundgalaxy, the features of the M-dwarf will always be in seemingly random locations.
Figure 20:
Reconstruction capability of the trained variational autoencoder for increasinglymore complex Balmer lines. 28 igure 21:
Reconstruction capability of the trained network for a blended object. The topspectrum is a very good reconstruction of a single galaxy, while the bottom spectrum is a galaxywith an M-dwarf star in the foreground. While the x-axes indicated the rest-frame wavelength,the foreground flux of the M-dwarf is redshifted giving the redshift of the background galaxy.These examples of uncommon spectra show that the trained variational autoencoder can re-construct common features in the data very accurate, but fails to reconstruct extra complexstructures or blended objects. The reconstruction errors of these objects are higher than thosefor normal spectra, giving us great confidence in the usage of the variational autoencoder in thesearch for outliers. Looking at the reconstructed spectra from variational autoencoder with 8000input points and 1000 input points, we do not see any major differences in reconstruction er-rors. On closer inspection, we do see an increased performance of the 8000 input network at theheights of the emission lines. We can show many more examples of bad reconstructed spectra,but we will refer the reader towards Section 4, in which the types of outliers are discussed.
The variational autoencoder is used as a reconstruction based outlier detection method. Thisuses the fact that it learned the important features of the majority of the data, while lacking theknowledge of weird shapes or features found in outliers. Besides computing the reconstructionerror, we also look at the latent space representation of all spectra. This low dimensional latentspace contains, just as with the distance matrix of the URF method, important information ofthe spectra. We can use this information to find outliers inside the latent space, by looking atpoints far from the mean of the distribution.We start by looking at the reconstruction error of the spectra defined by a function that com-putes the differences between the input flux values and the reconstructed flux values. Manyexisting metrics can be used, but one could also make up one on their own. We use a comparablemetric as used in Ichinohe & Yamada (2019), where they defined the outlier score via a statisticlogarithmic chi-square metric. 29 a) Outlier scores computed via the reconstruc-tion error of Equation (6) (b)
Outlier scores computed from the latentvariables using LocalOutlierFactor method.
Figure 22:
Outlier score distributions based on the output of the variational autoencoder.The 100 weirdest objects are found on the right side of the red lines.We use a slightly modified version of their metric, omitting the logarithmic scaling, and applymedian (cid:40)(cid:18) ( I − R ) √ I (cid:19) (cid:41) , (6)on the reconstructed spectra of the SNR5 subset as the outlier score, where I is the input and R the reconstructed output. Another suggested metric in Ichinohe & Yamada (2019) is the maxi-mum difference between the reconstructed and input via max( | R − I | ), which can trace a singlehigh emission line that are not correctly reconstructed. Application of this simpler metric onour data showed outliers that consist almost exclusively of spectra with a single extremely highemission line most probable originating from bad reduction or a cosmic ray event. We normalizethe output of (6) such that weird spectra are assigned a score close to 1. The score distributionis shown in Figure 22a, which shows a high peak of spectra with a similar reconstruction errorand a tail of outlying spectra with a high reconstruction error, indicating the outliers. Justas with the URF, we also look at the relationship between the outlier scores and the signal tonoise ratio of the spectra to investigate any bias. In Figure 23 we can see that there is no biastowards lower signal to noise ratio spectra as was found with the URF method. Figure 23:
Relation between the reconstruction outlier score and the signal to noise ratio ofthe spectra. In comparison to the URF, we do not see any bias towards low signal to noise ratiospectra. 30e also look at the latent space variables that trace the encoded features of the input spectra.As the variational autoencoder is not trained to encode uncommon or weird features correctly,spectra containing those can end up with very different latent space vectors in comparison tothe normal spectra. Using the latent space variables to find outliers is computational beneficialas not all the spectra have to be reconstructed. However, the latent space has to be exploredwith a simple unsupervised clustering method instead, adding up to the computational time.A second disadvantage is that we can not see why a spectrum is assigned a high outlier score,so reconstruction is still necessary in the end. We still want to explore this method to examinethe capabilities of outlier detection of spectra in an encoded space.In Portillo et al. (2020), they briefly suggested and used the unsupervised outlier detectionmethod LocalOutlierFactor (LOF, Breunig et al., 2000), an easy to use algorithm found in thescikit-learn package in Python . In short, the method computes the local density of objectsby looking at its k nearest neighbors. For each object, its local density is compared to that ofits neighbors and a score is assigned based on this difference. The Python based method out-puts a negative outlier factor for each object. We define our outlier score with log[ − · NOF],where NOF is the negative outlier factor, and get the score distribution as shown in Figure22b. Normal objects can be found at an outlier score of 0, while outlying objects have highervalues. Similar to the reconstruction score, a high peak with normal galaxies is found with atail of outlying galaxies. The relation between the LOF score and signal to noise ratio of eachspectrum is shown in Figure 24, where again no correlation is found between the outliers andits quality.We show the spectra sorted on their outlier score for both the reconstruction based method andthe latent space method in Figure 25. In the overview, we see no obvious clustering on similartypes of spectra as was found in the overview in Figure 12. The main reason for this differenceis that the URF outlier algorithm method is distance-based, using pair-wise distances of thespectra as the basis for the outlier scores. The variational autoencoder reconstructs all spectraindividually lacking any pair-wise information, resulting in less clustering of similar spectra.
Figure 24:
Relation between the LOF score and the signal to noise ratio of the spectra. Incomparison to the URF, we do not see any bias towards low signal to noise ratio spectra. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor igure 25: Normalized flux of the spectra of subset
SNR5 sorted on their outlier score. Theweirdest spectra with the highest outlier scores are found at the bottom of the maps.
Left: sorted on outlier score determined via reconstruction error.
Right: sorted on outlier scoredetermined via LocalOutlierFactor.At last, we show the correlation between the outlier scores computed from the latent spacevectors and the outlier scores computed from the reconstruction errors in Figure 26. Whileboth method find their own types of outliers, we do see an overlap of outliers in the grey coloredarea. We also see that there is not a very strong correlation between the scores. This suggeststhat, while the latent space vector represents the important features of the spectra, the decoderadds another dependency to the types of outliers we find. Also, while the encoder will mapknown features to a specific latent space vector, uncommon or weird features might be mappedto a random value as it is unknown to the encoder. Another possible reason for the differenceis the specific metric we used to compute the outlier scores. We inspect the types of outliersfound by both methods in Section 4. 32 igure 26:
Correlation between outlier score methods for the variational autoencoder33
Outliers
We inspect the spectra that have the highest outlier scores. The primary search for outliers wasdone with the URF algorithm, so we inspect those first and compare them with the results ofthe variational autoencoder afterward. As we used multiple subsets, we have to define a methodto find the weirdest spectra in the GAMA survey. First, we look at the highest quality subset
SNR10 and inspect the weirdest 100 spectra. Afterward, we inspect the
SNR5 subset for another100 weird spectra we did not find in the
SNR10 subset. At last, we inspect the
SNR2.5 subset,in which we had a lot of difficulty in finding out why some spectra are assigned a high score.Therefore, we only inspect the weirdest 50 spectra from the lowest quality subset. This gives atotal of 250 unique weird spectra when combined. We will discuss these decisions in Section 6.We expect to find many outliers that can be explained, but hope to also find interesting outliersthat we do not understand. There are definitely more interesting outliers in the GAMA surveythat we do not inspect in this Section, but these are up for the reader to find as we provide afull list the outlier scores on the Github page.
Before diving into the outliers, we note that a lot of the spectra are weird due to instrumentalor reduction errors. These spectra naturally came up with the highest weirdness scores andhad to be removed manually to ensure interesting outliers. This iterative procedure resulted ina total of 2926 spectra that showed fringing, not earlier reported by the GAMA team. More-over, 1283 spectra showed a bad splice in the continuum and were unusable, also not earlierreported. Including the ND1 field and (sky) reduction errors, we conclude that at least 4.5%of the AAOmega spectra in the GAMA survey can not be used for big data projects, unless areliable method can be developed to correct all the spectra. This percentage is slightly higheras earlier reported in Hopkins et al. (2013). We attempted to defringe the fringed spectra, butthis introduced additional artifacts in the data due to the differences between the fringes. Thefull list of bad spectra is provided on the Github page to be used in other projects.It can be very difficult to determine why a spectrum was assigned a high outlier score, as itmight not be obvious why the algorithm has set a specific score for a spectrum. We try our bestat investigating what makes the outlying spectra interesting and comment on its presence in theliterature using SIMBAD (Wenger et al., 2000). First, we classify a great amount of spectra viavisual inspection, as there are obvious features like extreme emission lines or very broad features.For the more difficult spectra, we look in-depth at the flux values of the known emission lines,additional emission lines, or other weird features in the continuum. We find many similar typesof spectra as in Baron & Poznanski (2016) and the classification groups will be very similar. Thedifferent types of outliers are extensively described in the next parts and tables of the inspectedspectra are provided in Appendix D. Overall, we divide the spectra into 4 groups that definethe main characteristics of the spectra. The first group is composed of spectra with unusualvelocity structures or broadened emission lines, tracing different active processes in the galaxies.The second group consists of spectra with extremely strong emission lines, uncommon emissionlines, or unusual emission lines ratios. Some spectra show characteristics of two objects types,like a star in front of a galaxy, and are grouped as blends. The remaining spectra are groupedmainly on their unusual continuum shapes, or contain weird non-physical features.34 igure 27:
Three types of spectra with unusual features from kinematic processes inside thegalaxy.
G09 Y4 226 135 is a QSO at relatively high redshift showing broad Ly- α and Carbonemission. The spectra G09 Y4 215 016 and
G09 Y1 DS2 065 show broadening of, or additionalstructure at the Balmer lines tracing active processes in the galaxy.
The first and most obvious outliers are due to different kinematic processes in the galaxiesgiving rise to unusual velocity structures or line broadening. These are the easiest to inspectas there is clearly some additional structure in an otherwise normal spectrum. Among thesespectra, we find many Quasi-Stellar objects (QSOs), different Active Galactic Nuclei (AGNs),and a few spectra that show minor broadening in their emission lines.In total, we have 50 spectra with features that can be traced back to different kinematic pro-cesses going on in the galaxies. We find 14 Quasi-Stellar Objects at relative high redshift withCarbon and Silicon emission, whereof 7 objects are Lyman- α emitters. These high redshiftspectra are actually not found due to the emission lines, but are assigned a high outlier scoreas they are a flat line in our interpolated wavelength range. Of the spectra in our wavelengthrange, we find 27 spectra with additional (complex) structure at either H α or all the Balmerlines. At last, 9 objects have slight broadening at the most common emission lines.The 14 high redshift QSOs are easily found, as they fall outside of our interpolation range. Notethat the term ’high redshift’ is only relative to the other objects in the GAMA survey. Theseobjects could also be found by looking at the redshift values, but the GAMA survey providesno information about the flux of the emission lines in these spectra and classifying still hasto be done via visual inspection. These spectra show broad emission lines for Carbon, whilethe spectra at slightly higher redshift also show broad Silicon, Lyman- α , and even Lyman- γ emission. Out of these 14, only two are earlier reported in the 2dF QSO survey (Croom et al.,2004) and 9 are flagged as QSO in the GAMA database. We show an example of the QSOspectrum G09 Y4 226 135 in Figure 27 and can easily see the broad emission lines.35dditional to high redshift QSOs, we find 27 spectra with unusual velocity structure featuresoriginating from active galaxies. Of these spectra, 19 have extreme broadening of the Balmerlines and 8 show a more complex structure around the Balmer lines. Two examples of broad-ening and an asymmetric additional structure at the Balmer lines are shown in Figure 27. Theextreme broadening of Balmer lines is due to active galactic nuclei, of which most spectra areof Seyfert-1 galaxies. Additional to the broadened Balmer lines, we also observe other velocitystructures in the continuum of
G09 Y4 215 016 . These can not directly be traced to specificemission lines and probably trace more outflows of the active galaxy. The complex structure in
G09 Y1 DS2 065 is found at both the H α and H β lines, indicating a relation between the struc-ture and the Balmer lines. A different spectrum with this structure was also found in Baron& Poznanski (2016) and was modeled to be a combination of both broad Balmer emission seenin Seyfert-1 galaxies, and Balmer absorption in supernova ejecta (Faran et al., 2014). Most ofthe spectra are already reported in the quasar catalogs of Rakshit et al. (2017) and Toba et al.(2014), but 9 of these spectra have either no notable references or are classified as stars.A special mention is the spectrum G09 Y6 090 043 , which showed a broad emission structure atthe MgII emission line not recognized by GAMA. This line is often very variable (Homan et al.,2019), so this spectrum could be an example of an observation of an event that only exists ona short timescale. However, as there are no other spectra of this galaxy in other surveys, thisvariability can not be checked.At last, we find 9 spectra with a slight broadening of emission lines or minor additional structurein the H α and NII structure. The spectrum G12 Y1 DS2 080 showed minor broadening of highemission lines, tracing star formation and some activity. This spectrum is of a rare type called
Green Bean
Galaxies and is one of the 14 spectra studied in Prescott & Sanderson (2019).These Green Beans have a very blue continuum, making them a subtype of Type 2 AGNs.
Figure 28:
Three types of spectra classified based on their emission lines. From top to bottom:Extreme star formation and BPT outlier, Iron emission lines presence, and a post-starburstspectrum. Emission lines: red = Hydrogen, blue = Oxygen, orange = Helium, grey = Iron.36 .1.2 Emission Lines
Most of the outliers show a range of different emission lines with different intensities, but do notshow any obvious unusual velocity structures. This big group is composed of spectra that havevery strong or uncommon emission lines, including spectra with unusual emission line ratiostracing the different star formation phases. We find 114 spectra with very high emission linesof the common elements: Hydrogen, Helium, Oxygen, Nitrogen, and Sulfur. Of these, 29 areoutliers on the BPT diagram (Baldwin et al., 1981) and can be classified via their location onthe diagram. Another interesting type of galaxies in this group is the post-starburst (E+A)galaxy. The spectra of these 8 galaxies show H δ absorption tracing recent star formation, withsometimes still showing on-going star formation via the strong H α emission. At last, we have 4galaxies showing either uncommon or unknown emission lines in their spectra.Many spectra show signs of star formation, as seen by the dominating strong Balmer emissionlines present in the spectra. Additional to the Balmer lines, there are often many other emissionlines. This often leads to looking like the continuum is a flat line, as can be seen in spectrum G02 Y5 114 291 in Figure 28. The heights and ratios of all these emission lines can be usedto study the processes in the galaxies and trace the star formation phase or presence of activegalaxies. We compute the ratios of a few emission lines and plot them on the BPT diagram.The BPT diagram is a popular method to classify emission line spectra as they can classifyspectra very good based on simple observations. We use the provided emission line fluxes fromthe GAMA survey where available and compute the line emission ratios of NII/H α againstOIII/H β . In Figure 29 we show the line emission ratios of all weird objects that belong to ourkinematics and emission line groups. In the background, the distribution of the emission linesratios of all GAMA spectra is shown as a comparison. For some objects, we have no emissionline information as they fall outside of the observed wavelength range due to high redshift. Figure 29:
BPT diagram of the ratio NII[6548]/H α versus OIII/H β . Both the axes are inlog space and the background is composed of the distribution of these ratios for all GAMAspectra. Two theoretical lines (Kewley et al., 2001 and Kauffmann et al., 2003) probe thedivision between the star formation region and active galactic nuclei. The colored points arespectra from the groups described in Section 4.1.1 (red) and Section 4.1.2 (green).37any of the high emission line spectra, shown as green in Figure 29, sit along the star forma-tion track. The star formation track and AGN area in the BPT plot are divided by the purple(Kauffmann et al., 2003) and orange (Kewley et al., 2001) lines originating from earlier studieson galaxies. As the points in the BPT diagram indicate the classes we assigned to them, wecan see that some of them are wrong as a few green points lie within the AGN region. Thisshows that some of the spectra we initially grouped by eye on the strong emission lines alsohave active processes in the galaxy. In total, there are 29 spectra that are on the extreme endson in the BPT diagram and can be considered weird as they have very extreme line ratios. Only10 of the spectra with strong emission lines are classified as either an emission-line Galaxy orHII Galaxy in the literature.A special type of spectra not captured by the BPT diagram, but one with interesting emissionline ratios, originates from port-starburst galaxies. These spectra have a strong H δ absorptionline, which is an indication of an A-star dominated galaxy. These galaxies are often called E+Agalaxies, as they originally thought that this type is only composed of Elliptical galaxies with A-type stars. Models suggest that this absorption is only possible for galaxies with a recent burstof star formation and now goes trough either none, or passive evolution of star formation (Gotoet al., 2003). We can differentiate two sub-types by looking at the presence of H α and Oxygenemission lines tracing ongoing star formation. We find 8 spectra with strong H δ absorptiontracing recent a recent burst of star formation. Of these, 3 spectra show ongoing star forma-tion, as emission lines are present. An example of an E+A galaxy, spectrum G15 Y4 210 008 ,is shown in Figure 28.Additionally to spectra with high emission lines, we also find 17 spectra with either weird lineratios or additional unknown lines. Of these 17 spectra, we have 10 spectra that lack any otheremission line other than the H α -NII structure. We also find 3 spectra that have complex NIIemission structure dominating over H α . Unfortunately, these spectra do not show up on theBPT diagram in Figure 29 as there is no flux information of these spectra due to the noisyadditional structure. The last 4 spectra include uncommon or weird single emission lines andwe look at these individually.Spectrum G09 Y1 CN2 290 has numerous Iron lines along its broad emission lines. The Iron linesare present at 5303˚A, 5722˚A, 6088˚A, and 6376˚A as can be seen in Figure 28. These coronallines originate from gas exposed to X-ray radiation and is investigated in Wang et al. (2012).The Iron lines fade out over a time scale of years suggesting a transient nature. This is again anice example of an observation of an event on a short time scale.In both the spectra
G15 Y1 CS2 180 and
G12 Y1 GND1 125 we observe a single unusual strongemission line which aligns with an Iron line. The spectra are shown in Figure 30, where thesingle emission lines sit at 5304˚A and 5435˚A tracing FeXIV and FeI respectively. In contraryto spectrum
G09 Y1 CN2 290 , there is only a single Iron line suggesting either a different originof the emission or absorption of the other emission lines in the line of sight. Also, the FeIline is very easy to ionize and will only be visible in very cool regions, making it a very weirdobservation. 38 igure 30:
Two spectra with unusual single emission lines.The last spectrum with an uncommon line is
G12 Y1 HND1 169 , which shows strong NI[5200]emission along the usual emission lines. This line is unusual, as atomic Nitrogen is easy to ionizeand therefore not often observed. It can only exist in cool regions inside the galaxy shieldedfrom ionizing sources. None of the spectra with unusual extra lines have been mentioned orinvestigated in the literature. As we primarily search for the outliers, we will not elaborate onthe sources of these unusual lines.
We find 8 spectra that show features of two different objects. Almost all of the blends arecomposed of a foreground star in front of a galaxy. There are 6 spectra with an M-dwarf and 1with an A-type star in front of a galaxy. The spectrum of an M-dwarf can easily be observed viaits characteristic features. In Figure 31 we show the spectrum
G15 Y2 003 165 redshifted basedon the emission lines of the background galaxy. In spectrum
G09 Y5 018 163 (not shown here)an A-type star is found in front of a Galaxy which shows very blue emission, stellar absorptionlines, and galactic emission lines. Visual inspection at the coordinates of the blended spectragive nice pictures of the stars in front of the galaxies. For the spectrum in Figure 31, the targetobject was actually the fore-ground star itself, as can be seen by the central marker in the image.At last, we also find 1 blended object in which a small galaxy is in front of a bigger galaxy.While the foreground galaxy is targeted, it is unknown if the emission lines originate from theforeground or background galaxy as the galaxies are almost at the same distance. None of theblended objects are found in the literature.
Figure 31:
Example of an M-dwarf star in front of a galaxy. Note that we can see thecharacteristic star spectrum including emission lines of a galaxy. Balmer emission is indicatedby red, OIII by blue, NII by green, and SII by yellow lines.39 igure 32:
Two examples of galaxies we classify as extremely blue or diagonal red
The last group of weird spectra is composed of (weird) features on a continuum level. Thisgroup mainly consists of extremely blue and diagonal red galaxies, as well as spectra with weirdcontinuum shapes. In total, we have 33 spectra that probably have a high outliers score due totheir continuum. Of these, 21 spectra have very high flux values in the blue end and 7 spectraare diagonal lines with high flux values at the red end of the spectra, as can be seen in Figure32. These two types have high outlier scores as the URF algorithm showed clustering based ona continuum level. This could also be seen in the visualizations of the sorted spectra on theirweirdness scores in Figure 12. Along with these extreme galaxies, we also have 3 galaxies withhigh continuum flux at both the blue and red end making a sort of valley shape. In contraryto this, we also have a hill shaped spectrum with no emission lines present.Most of these spectra are unwanted as outliers, as they are solely outliers due to their continuumshape without interesting features. However, we did find an object with a very weird structurewhich is shown in Figure 33. Spectrum
G15 Y6 082 045 has a very big unexplained featurearound 7000˚A. This spectrum had the highest outlier score in all of our runs and absorbedmuch of our time thinking about what this unknown unknown could be. Sadly, this weirdfeature does not trace any physical process going on in the galaxy, as we eventually found threemore spectra that show a very similar structure in the same position in the observed frame.Looking at those spectra combined, we also notice that this feature goes along with a bad splice.We classified the other three spectra as bad spectra, but keep
G15 Y6 082 045 as an outlier asit could have been a very interesting find if it was real.
Figure 33:
Spectrum with a seemingly weird velocity structure. We also show an SDSS DR9picture of the galaxy, showing no extreme outflows.40 igure 34:
Overview of the types of outliers.
In our search of the weirdest galaxies in GAMA, we found many different types of galaxies.The overall findings are summarized in Figure 34. Along with the real outliers, we also find 19spectra that contain weird reduction errors or artifacts from reduced skylines causing a highoutlier score. These spectra are different than the systematic bad spectra we flagged earlierand remain in the data set. Overall, the URF algorithm could find a lot of outliers as shownin the previous sections. However, as can be seen in the pie chart and in the visualizations ofthe algorithm runs (see Fig. 12), we also find a few spectra that contain no interesting features.These spectra are clustered on their continuum as many spectra with either high blue flux valuesor high red flux values are found together. As this behavior was not inspected or mentionedin Baron & Poznanski (2016), we do not know if this is a problem or feature of the algorithmitself or due to our implementation.
We also investigate the weirdest spectra found by the variational autoencoder. We will notinvestigate the top outliers as in-depth as done with the URF, but rather look at the performanceof the variational autoencoder and the different types of outliers it can find. Therefor, we willnot provide the tables of the weirdest spectra as done with the URF method, but refer thereader to the outlier scores provided on the Github page. We computed the outlier scores withthe variational autoencoder using two different methods: via the reconstruction error, and byapplying LocalOutlierFactor (LOF) on the latent space variables. Using the reconstructionbased outlier scores we find 14 spectra within the weirdest 100 objects that were already foundwith the URF algorithm and were inspected in Section 4.1. For the LOF based scores, this is 15spectra. Looking at the weirdest spectra from both the scores of the variational autoencoder,and the inspected spectra from the URF, we find 5 spectra that are outliers in all of the methodsused in this research. Of these spectra, 2 have extremely high emission lines, and 3 show activeprocesses. 41 .2.1 Reconstruction Error
The most common way to use the variational autoencoder for outlier detection is by lookingat the reconstruction error of the individual objects. Spectra containing either uncommon orweird features will not be reconstructed properly and its reconstruction error can be used asan outlier score. In Section 3.2.3 we showed the reconstruction capability of the variationalautoencoder on two types of spectra: blended spectra with a foreground star and galaxy, andspectra of different types of active processes in the galaxy. These examples showed that spectrawith complex structure or emission at uncommon wavelengths will have a high reconstructionerror, resulting in a high outlier score. We expect to find many spectra with these types ofbroad features, as those generate high errors. We also expect to find spectra with many strongor uncommon emission lines, as these can give a high outlier score via the squared relationship.We inspect the 100 weirdest spectra with the highest reconstruction errors to look at the globalperformance of this outlier detection method. As we can inspect the reconstructed output of thevariational autoencoder, this method makes it very easy to investigate why the input spectraare considered outliers. Of the outlying spectra, 64 spectra have a high outlier score due toeither a weird continuum shape, or weird features originating from reduction errors. We showa few selected spectra in Figure 35 including their reconstructed version from the variationalautoencoder. A (slightly) weird continuum shape can result in a high outlier score as almost alldata points have a high reconstruction error. Among the reduction errors, we find many singlehigh emission lines originating from inaccurate sky reduction or possible cosmic ray events. Also,the weird spectrum
G15 Y6 082 045 (see Fig. 33) is again found as an outlier. Note that, whilewe also found many outliers based on their continuum with the URF method, the continua wefind with the variational autoencoder always have very unusual shapes. The outlying continuafound by the URF were mostly based on the clustering of very blue or red galaxies.
Figure 35:
Examples of spectra with high reconstruction error primarily based on their con-tinuum shapes. From top to bottom: Spectrum with a higher continuum at high wavelengthsas reconstructed, a very nice edge-on spiral with more complex continuum as reconstructed,and noisy fluctuations at low wavelengths contributing significantly to the reconstruction score.42 igure 36:
Selection of outliers found via the reconstruction error of the variational autoen-coder, not found by the URF method. From top to bottom: blended spectrum containing anM-dwarf, AGN, and strong emission lines. Balmer line locations are indicated with the red line.The other 36 outliers, that contain no major reduction errors or unusual continuum shapes, arevery similar to the types of spectra found by the URF. A selection of these outliers is shownin Figure 36. As expected, we find many spectra with high or broad emission lines, includingspectra with complex structures at the emission lines. We also find two E+A galaxies, of whichonly one was found by the URF. At last, we find 8 blended objects consisting of an M-dwarf infront of a galaxy.Overall, using the reconstruction errors as the outlier scores gives many weird spectra haveunusual continuum shapes. This is expected, as for a weird continuum shape all of the datapoints contribute to the reconstruction error. We also find many interesting outliers, similarto the types of outliers found by the URF method. Using the reconstruction errors of thevariational autoencoder as an outlier score can give good results, given that all the spectra withreduction errors are removed and weird continuum shapes are considered interesting.
The second method of outlier detection using the variational autoencoder is looking at theclustering of the latent space variables. This method completely omits the inference model ofthe variational autoencoder, using only the encoded latent variables to find outliers. Outlyingspectra have either uncommon or weird features that are not recognized by the encoder, andmight be encoded to a weird latent space vector. If we have found the outlying latent spacevectors, then we have to use the inference model to decode the spectrum to try to understandwhy it is an outlier. So, while not using the inference model in the search of outliers, we stillneed it to investigate the outcome. 43 igure 37:
Spectrum showing double peaked emission lines. An extra blue-shifted componentcan be seen at the Oxygen emission lines (indicated by blue).We inspect the 100 weirdest latent space vectors by reconstructing them to spectra using thedecoder. In comparison to the reconstruction errors as outlier scores, we quickly see that thereis a lower amount of outliers based on their continuum or containing reduction errors. A totalof 42 spectra are outliers due to this, while for 58 spectra we see other interesting features. Themajor difference between the reconstruction based and the latent based outlier scores is thatwe find more outliers based on their emission line strength and ratios. For example, there isnot a single spectrum containing a blend of two objects, while we do find many spectra withweird emission line ratios. A few spectra show very strong OII emission relative to the otheremission lines or strong NII emission in comparison to H α . On top of this, we find a typeof spectrum that we have not found before with any other method. As shown in Figure 37,spectrum G15 Y3 007 236 shows double peaked emission lines. The oxygen lines show an extrablue-shifted component, while the H α structure and SII emission lines are more complex thanexpected. We note, again, that the outlier scores of all spectra can be found on the Github page.Comparing the outlier types found via the reconstruction error and the latent space, we noteone fundamental difference. The reconstruction based method finds more outliers with anunusual continuum, as can be seen by the number of blends and weird continuum shapes thatwere found. This follows naturally from how the reconstruction error is defined, as all thedata points contribute to the reconstruction error. Also, it might be that the encoder can notrecognize the unusual continuum and maps it to random latent space variables, resulting infewer spectra with an unusual continuum using the latent space approach. Depending on thetypes of outliers the user wants to find, both methods can be used to find interesting outliers.44 Visualization
Data visualization is very important in either understanding or communicating complex data.Especially in big data projects with many objects, visualizing important features of all the dataat once can be very complicated. The outlier detection algorithms we have used in this researchgenerate high dimensional data that is not easily visualized. For example, the UnsupervisedRandom Forest algorithm produces a complex high dimensional distance matrix with the pair-wise distance between all objects. In Reis et al. (2018), they produced a similar distance matrixof APOGEE spectra and visualized it using t-distributed Stochastic Neighbor Embedding (t-SNE, van der Maaten & Hinton, 2008). t-SNE reduces high dimensional data into a 2D mapwhile preserving the distances and similarities between the objects. With the maps, they showedthat the distance matrix contains information of different physical properties of the objects.We visualize the high dimensional data from the URF and the variational autoencoder on a 2Dmap using t-SNE and inspect the clustering of similar objects. The distance matrix containsa lot of information about the objects and we expect to find many relations in the maps. Forthe variational autoencoder, the spectra are already mapped to the 6-dimensional latent space.These can also be mapped and visualized on a 2D map to inspect the clustering of similar spectra.t-SNE is a non-linear dimensionality reduction algorithm capable of reducing high dimensionaldata into two or three dimensions for visualization. We will reduce our data to two dimensionsand visualize the results on a map. The parameter space of the map itself does not trace anyphysical properties, but similar objects are found in clusters on the map. t-SNE tries to pre-serve the nearest-neighbors of each point, while forcing them to a lower dimension. Object faraway from each other are less similar, but there is no relationship between the absolute distancebetween the points and their dissimilarity. The first step of the t-SNE algorithm is constructinga pair-wise probability distribution indicating the similarity between objects. The second stepis to define a similar distribution of points in a two or three-dimensional map and minimize theKL divergence between the distributions. The KL divergence was mentioned and explained inSection 3.2.We apply t-SNE on our data using the Python implementation in the scikit-learn package.Two important hyperparameters are the perplexity and the learning rate . The perplexity isrelated to the amount of nearest-neighbors used as similar objects. This controls the size ofclusters on the final mapped representation of the input. A higher perplexity results in moreglobal clustering, while low perplexity shows clusters on smaller scales. This parameter is verysensitive and has a big influence on the results. The learning rate is, as in any other machinelearning application, important for the convergence of the algorithm. If the learning rate is setto high, then the final map will look like a ball in which all points have the same distance fromeach other. If the learning rate is set to low most of the points are found in dense clusters with afew outliers. In our application, we tried different values for these parameters and we will showthe best visualization of our data. The best visualization is a t-SNE map that shows clusteringof similar objects on both a global, and a local scale. We try to not have any dense regionsdue to overfitting, or have any significant sparse region without data points. For the distancematrix, we get a map that fits these requirements using a perplexity of 200 and a learning ratevalue of 5000. https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html igure 38: Left: distance matrix from all objects in
SNR5 subset (including bad spectra andstars). The objects are sorted on signal to noise ratio, so we can even see the bias of the URFalgorithm as higher distance scores are shown in the top left corner.
Right: t-SNE representationof the distance matrix including flagged objects. Note the grouping of similar objects.
The URF outlier detection method produces a distance matrix with the pair-wise distance be-tween each object. We used this distance matrix to compute the distance (weirdness) score ofeach individual object and inspect the weirdest objects. We will reduce the distance matrix ofthe
SNR5 subset to a 2D map and inspect the clustering of similar objects. The
SNR5 subset iscomposed of good signal to noise ratio spectra, and its big size makes it a nice data set for thevisualization.The t-SNE map of the distance matrix is shown in Figure 38, which represents the full
SNR5 subset including bad spectra and stars. As stated earlier in Section 3, the URF algorithmassigns higher distance scores to lower signal to noise ratio galaxies. This can be seen in thedistance matrix where higher distance scores are assigned to the lower SNR galaxies in thetop-left corner. In the t-SNE map in Figure 38 three groups of spectra are marked. Theseare the fringed spectra, spectra with reduction errors, and the stars in the GAMA survey. Weincluded these spectra to show that the URF algorithm is capable of grouping similar types ofgalaxies based on their common features.For the best results, we flagged the bad spectra and stars during this research to ensure thatthey do not interfere with learning the features of good spectra. This resulted in the removalof 5667 spectra from the
SNR5 subset, giving a slightly smaller distance matrix after a new runwith the URF algorithm. We use t-SNE on the new distance matrix to compute the map ofonly good spectra. As the dimension of the distance matrix has been lowered, the new t-SNEmap will look different as the map shown in Figure 38. The new map is shown multiple times inFigure 39 with features of the objects to inspect the clustering in the map and the correlationsbetween different parameters and the t-SNE space.46 igure 39:
Mapped distribution of the distance matrix via t-SNE. The objects are mappedonto a two-dimensional plane in which similar objects are clustered locally. Different dataparameters or galaxy feature data is shown as color maps to indicate the amount of informationthe distance matrix contains. 47n Figure 39, we show the t-SNE map of the
SNR5 subset with the following parameters as colormap, from top-left to bottom-right: the 250 outliers, the weirdness score, the signal to noiseratio of spectra, the color of objects, D4000 break index, H α line emission, SII line emission, andOIII line emission. While the signal to noise ratio is not a physical parameter of the objects, itis included to show the correlation between the weirdness score and its quality. The color mapsare chosen as they represent the characteristics and features of the spectra. As the distancematrix is reduced from a high dimension to only two dimensions, much hidden information islost. Still, we find parameters that show structure on the map and we comment on each of themand explain the findings.The weirdness score overview in map (b) shows three regions in which the scores are the high-est. These regions can also be traced back by the highlighted outliers in map (a), in which themarked points are the 250 outliers that were inspected in Section 4. The outliers are found indifferent places in the t-SNE map with big distances in-between them, suggesting that they arecomposed of different types of outliers. This corresponds to the different groups of outliers wefound and similar outliers in these groups are found in small clusters on the t-SNE map. Wealso included a color map of the signal to noise ratio of each object in map (c). The highersignal to noise spectra are found at the outskirts of the map. When looking at map (b) and (c),we can see the correlation between the low signal to noise spectra and high outliers scores. Inmap (d) the color of the objects are shown using the SDSS g and r -band photometry. Galaxieswith similar color are also grouped along the edge of the t-SNE map. There is also a relationbetween the high signal to noise ratio and the color. Red galaxies tend to have an overall highersignal to noise ratio, as the signal is better at higher wavelengths relative to lower wavelengths.At last, we have four maps based on the features in the spectra. These maps trace the physicalproperties of the spectra and can be used to trace the information contained in the distancematrix. The D break index is used to determine the star formation characteristics of galax-ies. It is defined as the flux at the wavelengths 3750˚A-3950˚A divided by the flux at wavelengths4050˚A-4250˚A (Bruzual A., 1983). These regions trace many lines that correspond to eitheryoung bright stars or an older population of metal-rich stars. An interesting observation is thatwhile there is a clear distinction between the two types of galaxies, for both types we have aregion with high weirdness scores and objects marked as outliers as seen in (a). The followingthree maps (f), (g), and (h) show the flux values of the emission lines H α , SII, and OIII respec-tively. These lines are present during star formation and a young star population. Note thatfor the H α line and OIII line, the regions of the highest intensities do not always correspondto each other. They do overlap in the rightmost part of the map indicating the star formationregion.The clustering of the groups of outliers indicates that there are more similar objects nearby.Inspecting the nearby points indeed gives us more similar objects. The region with star-forminggalaxies is pretty global and easily spotted by the flux maps. More star-forming galaxies areeasily found by inspecting points close to the marked outliers. The BPT outliers are clusteredat the tip of the star formation region. The quasars and active galaxies can not be found ona global scale and reside in a small group. More interesting spectra containing very broademission lines or extra structure could be found by inspecting the points close to our outlierswith unusual velocity structure. Overall, the t-SNE map of the distance matrix is very usefulin finding more interesting spectra, while also looking very pretty.48 .2 Latent Variables We also want to test the dimensionality reduction of the 6 latent variables representing theencoded spectra. In Portillo et al. (2020), the latent variables were plotted using a corner plot,showing the pair-wise relationships between them. Instead, we will use the latent space vectorsin a similar way as the distance matrix of the URF and try to make a nice t-SNE map. As wecould find interesting outliers with the latent space vectors, we expect to find a good map inwhich we can find more interesting outliers. We apply t-SNE on the encoded spectra and useddifferent learning rates and perplexity values to find the best map. The best t-SNE map wefound is shown in Figure 40, using a perplexity of 200 and learning rate value of 1000.Overall, we see the same types of relations as shown for the distance matrix in Figure 39. Theoutliers, as shown in (a), seem to be at random places, but are mostly found at the edges of thewhole map or at the edges of dense regions in the map. This can also be seen by the LOF outlierscores of the spectra in (b). The flux values of the emission lines H α , OIII and SII are shown in(f), (h) and (g) respectively. The objects with high flux values for these emission lines are mostlyfound in the bottom-right corner of the map, tracing the star formation region of the t-SNE map.Many similar objects can be found in the star-forming region of the distance map, which was alsopossible with the map in Figure 39. Unfortunately, we could not find any more double peakedemission lines in the neighborhood of the found spectrum. The maps show a very nice globaldistribution and clusters of the different parameters, but the 6 latent parameters contain lessinformation about the spectra than the distance matrix, as the encoder and decoder also containmuch information about relations in the data. This gives that there is almost no clustering ofsimilar objects on a small scale. The best t-SNE map is made with the very detailed distancematrix, but Figure 40 shows that the latent variables also nicely trace features and clustersimilar objects on a larger scale. Figure 40:
Mapped distribution of the latent vectors via t-SNE. Different data parameters orgalaxy feature data is used as color maps. We do not show (c) and (d), as the objects are thesame as already shown in Figure 39. 49
Discussion
In this section, we discuss the decisions made during the project and interpret our results. Wecomment on the acquisition of the GAMA data and the pre-processing steps done on the spec-tra. We discuss our application of the URF outlier detection method and compare the resultswith Baron & Poznanski (2016). Also, the choices made on defining our outlier groups are dis-cussed, along with comparing the different outliers found between the two detection methods.At last, we comment on the t-SNE maps and discuss the application of all our methods forfuture projects.While using a major part of the GAMA survey, we did not use all spectra of the survey. Welimited ourselves to only use the data of the main observations originating from a single instru-ment, making up 78% of the spectra. The decision was made for the following two reasons. Atfirst, all spectra are taken with the same instrument and have the same resolution and overalldata shapes. If data from multiple instruments were used, the outlier detection methods couldbe biased towards the data of a single instrument. Secondly, we only have to work with theinstrumental errors of a single instrument, making masking of bad spectra easier.The GAMA survey is composed of objects at a broad range of redshifts. To learn the importantfeatures, the spectra had to be converted to their rest-frame such that all emission lines arealigned. To ensure the same input space of each spectrum, the spectra were interpolated withthe rest frame wavelengths between 3500˚A and 7500˚A. This ensured that the interesting emis-sion lines are captured for all the objects at different redshifts and excluded regions without fluxvalues. In an ideal world, we would use the raw flux values of the spectra for the outliers search,as they ensure the best relationship between what is observed and an assigned score. However,many spectra showed either bad regions or noisy parts that had to be accounted for. Thereforewe masked bad regions and smoothed the spectra with a Gaussian kernel. The fluctuations inthe continuum were removed successfully and the emission line shapes stayed intact, but theabsolute flux values were slightly lowered. This had no impact on the results, as all fluxes werelowered with the same ratio. We think we made a good trade-off between the pre-processing ofthe data and keeping it as close to the observed values.We implemented the full URF algorithm in Python using multiprocessing to boost its perfor-mance. A basic example of the algorithm was provided in Baron & Poznanski (2016), in whichthey described simple steps to build the algorithm. While the URF outlier detection methodwas already proven to be working in their work, we also wanted to explore how the algorithmworks. Therefore, we applied different tests by changing the input shape of the data, usingdifferent hyperparameters in the Random Forest, and inserting self-made spectra that look veryweird. The tests showed that spectra with broad features are easily found as outliers, whileunusual single lines do not necessarily show up as outliers. Overall, the tests gave good insightinto the processing of the data and the application on the URF.Initial runs on all our GAMA data only showed outliers due to very noisy fluctuations. Weobserved that the URF algorithm is biased towards low signal to noise ratio spectra, which wascoincidentally confirmed during this project by Reis et al. (2019b). To ensure that we find thebest outliers, we used three different subsets based on the signal to noise ratio of the spectra.While the subset with the high signal to noise spectra showed the best results, we also had to50se the lower quality spectra to ensure we find the weirdest spectra in the GAMA survey. Aswe ran the algorithm on different subsets based on a lower limit of their signal to noise ratio,we had to carefully define which spectra are considered weird . For the three subsets
SNR10 , SNR5 and
SNR2.5 , in which the number indicates the lower limit of the signal to noise ratio, wedefined the weirdest spectra as the 100 highest scoring spectra of
SNR10 , the 100 highest scoringspectra of
SNR5 not found earlier, and the 50 highest scores spectra of
SNR2.5 not found earlier.While this seems arbitrary, this allowed us to learn the workings of the algorithm, as we firstapplied good quality data to test the algorithm. Additionally, we only used 50 spectra of the
SNR2.5 subset, as the results were not great due to bias towards lower signal to noise galaxies.We also noticed clustering on the continuum of spectra in the outlier scores as many similarblue or red spectra are grouped together. These also came up as the weirdest spectra, but werein itself not very interesting. This behavior was not reported in Baron & Poznanski (2016),while Reis et al. (2019b) shows similar behavior in the score distribution. Overall, interestingweird spectra still dominated the results, and we could still inspect spectra of many differenttypes of objects.The inspected outliers are grouped and categorized by their most obvious features. Note that,while we inspected the weirdest spectra thoroughly, we can not guarantee that we always no-ticed the exact reason why the URF algorithm assigned a high outlier score. As many featuresare obvious to spot by eye, it is entirely possible that the URF found a weird combination offlux values that determined its high score. We are however very confident that we grouped andclassified the 250 weirdest spectra correctly. We provide a full list of the weirdest spectra onthe Github page for the reader to explore themselves. We found many different types of spectraamong the outliers, such as: additional complex structure at emission lines, line broadeningdue to activity, blends of different objects, and BPT outliers. While the tests showed thatthe algorithm does not assign high outlier scores to galaxies with unusual lines, we still find afew spectra with unusual Iron lines and NI[5200] emission. Many types of outliers we foundare similar to the outliers found in Baron & Poznanski (2016), but unfortunately there was nooverlap of objects in our GAMA data and their SDSS data, so no direct comparisons couldbe made. As most weird spectra showed interesting features, we have great confidence in theapplication of the URF outlier detection method on spectroscopic data. One could argue thatfor all the weird spectra we found, simple models specific for each type of object could be usedinstead. While this is true for known objects, the outlier detection methods can find all types ofobjects, including unknowns. Unfortunately, we did not find any Nobel prize winning unknownunknown in our research.We also applied a more experimental outlier detection method on the galaxy spectra of theGAMA data using a reconstruction-based variational autoencoder to inspect more outliers andlearn about the application of neural networks on this type of data. Using a reconstructionbased method would make it easier for us to investigate why a spectrum is considered an out-lier. We applied our self-made algorithm in a similar method as in Portillo et al. (2020), butwith the primary goal of finding outliers. During training, we noticed that the
MMD loss term,which is supposed to help the network with learning, did not contribute enough. At first, wedid not include a scaling term in the loss function that uses the batch size as a multiplier. How-ever, when we did implement it correctly the results worsened significantly and were unusable.We could not find why this was happening and applied the InfoVAE as a normal variationalautoencoder. We used the trained network to compute outlier scores for the spectra in the
SNR5
Summary and Conclusion
We applied two different outlier detection methods on spectroscopic data from the GAMAsurvey to find weird galaxies in the Universe. The first method was a distance-based outlierdetection technique based on an Unsupervised Random Forest. This method computes a pair-wise distance between all objects resulting in a high-dimensional distance matrix. We used thedistance matrix to determine an outlier score for each individual object and inspected the 250weirdest galaxies. We found many different types of interesting outliers, such as: quasi-stellarobjects, active galaxies, blends of different objects, and BPT outliers. The method also assignedhigh outlier scores to objects clustered on their continuum and we observed a bias towards lowsignal to noise ratio spectra, which was also confirmed by others during the span of this project.Both these issues did not have any significant impact on the quality of the outliers. The secondmethod is a reconstruction-based outlier detection technique using a variational autoencoder.The network was trained using the spectra and learns to reconstruct the important and com-mon features. Spectra with weird or uncommon features will have a high reconstruction errorand are considered outliers. We used the reconstruction errors, and the encoded representationof the spectra, to compute outlier scores and compared the outcome with the UnsupervisedRandom Forest. While the methods are inherently very different, they both found similar typesof outliers and even had some overlap in the weirdest spectra.We visualized the high dimensional output of the algorithms on a two-dimensional map usingthe dimensionality reduction technique t-SNE. This gave a great overview of all the informationcontained in the output and gave a nice insight into the relations found in the data. We couldpoint out multiple correlations in the output using galaxy feature data. The maps could alsobe used to find even more interesting outliers by looking at points clustered on the map.In this project, we successfully showed the application of two outlier detection techniques forspectroscopic data. These will be very useful for future surveys in which a lot of objects will beobserved. Combining these methods with good visualization techniques gives a very robust wayof finding many interesting objects. While we did not find the weirdest galaxy in the Universein this project, we found many different interesting objects via a single method without theneed for prior knowledge of any of them. Outlier detection is a very interesting application ofunsupervised machine learning, and will be essential for finding the unknown unknowns of thefuture. 53 eferences
Baldry I. K., et al., 2010, MNRAS, 404, 86Baldry I. K., et al., 2014, MNRAS, 441, 2440Baldry I. K., et al., 2017, MNRAS, 474, 3875–3888Baldwin J. A., Phillips M. M., Terlevich R., 1981, Publications of the Astronomical Society of the Pacific, 93, 5Baron D., 2019, Machine Learning in Astronomy: a practical overview ( arXiv:1904.07248 )Baron D., Poznanski D., 2016, MNRAS, 465, 4530–4555Blanton M. R., et al., 2017, AJ, 154, 28Breunig M., Kriegel H.-P., Ng R., Sander J., 2000, ACM Sigmod Record, 29, 93Bruzual A. G., 1983, ApJ, 273, 105Colless M., et al., 2003, The 2dF Galaxy Redshift Survey: Final Data Release ( arXiv:astro-ph/0306581 )Croom S. M., Smith R. J., Boyle B. J., Shanks T., Miller L., Outram P. J., Loaring N. S., 2004, MNRAS, 349,1397Dalton G., et al., 2014, Ground-based and Airborne Instrumentation for Astronomy VDriver S. P., Liske J., Graham A. W., 2007, The Millennium Galaxy Catalogue: Science highlights( arXiv:astro-ph/0701468 )Driver S. P., Norberg P., Baldry I. K., Bamford S. P., Hopkins A. M., Liske J., Loveday J., Peacock J. A., 2009,Astronomy & Geophysics, 50, 5.12–5.19Faran T., et al., 2014, MNRAS, 445, 554–569Goto T., et al., 2003, Publications of the Astronomical Society of Japan, 55, 771–787Hewish A., Bell S. J., Pilkington J. D. H., Scott P. F., Collins R. A., 1968, Nature, 217, 709Hill D. T., et al., 2011, MNRAS, 412, 765Hinton G.E. S. R. R., 2006, Science, 313, 504Homan D., Macleod C. L., Lawrence A., Ross N. P., Bruce A., 2019, Behaviour of the MgII 2798AA Line Overthe Full Range of AGN Variability ( arXiv:1910.11364 )Hopkins A. M., et al., 2013, MNRAS, 430, 2047Ichinohe Y., Yamada S., 2019, MNRAS, 487, 2874J´ozsa G. I. G., et al., 2009, A&A, 500, L33–L36Kauffmann G., et al., 2003, MNRAS, 346, 1055–1077Keel W. C., et al., 2012, The Astronomical Journal, 144, 66Kewley L. J., Dopita M. A., Sutherland R. S., Heisler C. A., Trevena J., 2001, The Astrophysical Journal, 556,121–140Kingma D. P., Welling M., 2013, arXiv e-prints, p. arXiv:1312.6114Kingma D. P., Welling M., 2019, Foundations and Trends R (cid:13) in Machine Learning, 12, 307–392Ksoll V. F., et al., 2018, MNRAS ullback S., Leibler R. A., 1951, Ann. Math. Statist., 22, 79Lawrence A., et al., 2007, Monthly Notices of the Royal Astronomical Society, 379, 1599–1617Nair V., Hinton G. E., 2010, in Proceedings of the 27th International Conference on International Conference onMachine Learning. ICML’10. Omnipress, Madison, WI, USA, p. 807–814Norris R. P., 2017, PASA, 34, e007Pimentel M. A., Clifton D. A., Clifton L., Tarassenko L., 2014, Signal Processing, 99, 215Portillo S. K. N., Parejko J. K., Vergara J. R., Connolly A. J., 2020, arXiv e-prints, p. arXiv:2002.10464Prescott M. K. M., Sanderson K. N., 2019, ApJ, 885, 40Rakshit S., Stalin C. S., Chand H., Zhang X.-G., 2017, ApJ, 229, 39Reis I., Poznanski D., Baron D., Zasowski G., Shahaf S., 2018, MNRAS, 476, 2117–2136Reis I., Rotman M., Poznanski D., Prochaska J. X., Wolf L., 2019a, Effectively using unsupervised machinelearning in next generation astronomical surveys ( arXiv:1911.06823 )Reis I., Baron D., Shahaf S., 2019b, The Astronomical Journal, 157, 16Shi T., Horvath S., 2005, Journal of Computational and Graphical Statistics - J COMPUT GRAPH STAT, 15Smith G. A., et al., 2004, AAOmega: a multipurpose fiber-fed spectrograph for the AAT. Moorwood, Alan F. M.and Iye, Masanori, pp 410–420, doi:10.1117/12.551013Statistics L. B., Breiman L., 2001, in Machine Learning. pp 5–32Toba Y., et al., 2014, ApJ, 788, 45Tsang B. T.-H., Schultz W. C., 2019, The Astrophysical Journal, 877, L14Turner S., et al., 2018, MNRAS, 482, 126–150Wang T.-G., Zhou H.-Y., Komossa S., Wang H.-Y., Yuan W., Yang C., 2012, The Astrophysical Journal, 749,115Wenger M., et al., 2000, A&A, 143, 9Yang T., Li X., 2015, MNRAS, 452, 158–168Zhao S., Song J., Ermon S., 2017, arXiv e-prints, p. arXiv:1706.02262de Jong R. S., et al., 2019, 4MOST: Project overview and information for the First Call for Proposals( arXiv:1903.02464 )van der Maaten L., Hinton G., 2008, Visualizing Data using t-SNE cknowledgements I would like to thank my supervisors Teymoor Saifollahi (MSc) and prof. dr. Reynier Peletierfor providing and helping me with this project. This master research project was a great wayof combining the interesting field of Astronomy with the fast developing field of data scienceand machine learning. This project allowed me to expand my knowledge of the applicationof machine learning techniques on complex data and I am grateful that this is possible at theKapteyn Institute. I want to thank my supervisors for their feedback, and especially thankTeymoor for our weekly (online) meetings where we shared and discussed our findings.Also, I would like to thank prof. dr. Michael Biehl and dr. Kerstin Bunte for their knowledgeableinput at the start of the project about data pre-processing and application of the machinelearning techniques. At last, I also want to thank my girlfriend, parents, sister and friends fortheir gezelligheid , discussions and support during the project.56 ppendix A: GAMA Data
SQL query for
Appendix B: Evidence Lower Bound in Practice
We shortly relate the theory of the Evidence Lower Bound (ELBO) function contents to theirapplication in the code of the variational autoencoder. The following functions and theory isshortly summarized out of Zhao et al. (2017), Ichinohe & Yamada (2019) and Kingma & Welling(2019).
Kullback-Leibler Divergence
The
Kullback-Leibler Divergence (KL Divergence) assigns a distance between the approximateposterior and true posterior, and determines the distance between the ELBO and the marginallikelihood. As the prior is set to be the normal distribution p ( z ) = N (0 , I ), the KL Divergenceis: D KL ( q ( z | x i ) || p ( z )) = − J (cid:88) j =1 (cid:16) (1 + log (cid:2) ( σ ij ) (cid:3) − ( µ ij ) − ( σ ij ) (cid:17) , (7)where µ ij and σ ij are the j th element of the mean µ and variance σ calculated in their corre-sponding layer for each spectrum i in the data set. For reference, the variable J denotes theamount of latent space variables and is set to J = 6 in our variational autoencoder. Maximum-Mean Discrepancy
The
Maximum-Mean Discrepancy (MMD) computes the distance between two probabilities bycomparing their momentum. It is efficiently implemented using a kernel trick with k any positivedefinite kernel. The MMD between the output q ( z φ ) of the encoder and the prior p ( z θ ) of thedecoder is D MMD ( q ( z ) || p ( z )) = E p ( z ) ,p ( z (cid:48) ) [ k ( z , z (cid:48) )] + E q ( z ) ,q ( z (cid:48) ) [ k ( z , z (cid:48) )] − E q ( z ) ,p ( z (cid:48) ) [ k ( z , z (cid:48) )] , (8)with E the expectation value of the output of kernel k . We use a Gaussian kernel that computesthe similarity of the two samples as in k ( z , z (cid:48) ) = exp (cid:18) − | z − z (cid:48) | σ (cid:19) (9)The MMD loss is computed in our variational autoencoder by comparing the momentum of thesampled latent variable layer (z sampler) with the normal distribution N (0 , I ) of the prior.57 ppendix C: Variational Autoencoder Figure 41:
The layers and their sizes of our optimal variational autoencoder with input size8000. For the network with input size of 1000, the network looks similar with the first hiddenlayer as the input layer. The question marks in the layer sizes indicate the batch size, whichcan be varied. 58 ppendix D: Weirdest Galaxies in the Universe GAMA
In the following tables we list the weirdest spectra we found with the URF algorithm. Additionalinformation is shown that is relevant for each group and comments are provided that determinedthe placement of the spectra in their particular group.
SPECID z CommentsG02 Y5 083 265 0.185 M dwarf in front of GalaxyG09 Y1 AX2 002 0.006 M dwarf in front of GalaxyG12 Y1 DS2 320 0.179 M dwarf in front of GalaxyG15 Y1 BS2 010 0.357 M dwarf in front of GalaxyG15 Y2 003 165 0.049 M dwarf in front of GalaxyG15 Y2 013 324 0.082 M dwarf in front of GalaxyG15 Y3 021 053 0.235 M dwarf in front of GalaxyG09 Y5 018 163 0.040 Bright star in front of GalaxyG09 Y1 HS1 004 0.018 Tiny galaxy in front of big one, H α source unknown Table 2:
Blended objects
SPECID z CommentsG09 Y1 FS1 085 0.199 Broadening of emission linesG09 Y4 212 273 0.158 Broadening of emission linesG09 Y6 063 013 0.334 Broadening of emission linesG12 Y1 DS2 080 0.306 Broadening of emission linesG12 Y1 IS1 080 0.282 Broadening of emission linesG15 Y2 007 169 0.330 Broadening of emission linesG12 Y1 BX1 127 0.153 Complex and high NIIG02 Y3 015 210 0.043 H α and NII are complex linesG15 Y4 206 189 0.180 H α and NII are complex lines Table 3:
Minor broadening of, or minor additional structure in, emission lines
SPECID z CommentsG02 Y5 080 045 0.075 Additional structure at Balmer linesG12 Y1 BN1 366 0.243 Additional structure at Balmer linesG15 Y1 BN2 004 0.270 Additional structure at Balmer linesG15 Y6 093 351 0.317 Additional structure at Balmer linesG09 Y6 090 043 0.549 Additional structure at H β and broad MgIIG02 Y3 005 005 0.330 Broadened Balmer linesG02 Y3 015 281 0.325 Broadened Balmer linesG02 Y5 094 233 0.157 Broadened Balmer linesG02 Y5 111 257 0.331 Broadened Balmer linesG09 Y1 BS1 152 0.176 Broadened Balmer linesG09 Y4 208 024 0.295 Broadened Balmer linesG12 Y6 040 320 0.106 Broadened Balmer linesG15 Y1 FN2 029 0.331 Broadened Balmer linesG15 Y1 GN1 046 0.144 Broadened Balmer linesG15 Y4 236 287 0.323 Broadened Balmer linesG15 Y5 003 263 0.300 Broadened Balmer linesG15 Y5 029 255 0.279 Broadened Balmer linesG15 Y5 061 182 0.220 Broadened Balmer linesG15 Y6 088 158 0.140 Broadened Balmer linesG15 Y6 100 268 0.235 Broadened H α G09 Y3 015 229 0.762 Broadened H β G09 Y1 DS2 065 0.225 Extra complex structure around Balmer linesG12 Y1 FN1 127 0.270 Extra complex structure around Balmer linesG12 Y1 IS2 153 0.103 Extra structure at Balmer linesG09 Y4 215 016 0.284 Extremely broadened Balmer linesG12 Y1 AT 117 0.135 Extremely broadened Balmer lines and fringingG09 Y4 249 125 0.260 Extremely broadened H α G15 Y6 083 065 1.345 High redshift quasar: CIII and extra structureG09 Y3 015 281 1.128 High redshift quasar: CIII and MgIIG12 Y3 006 046 1.535 High redshift quasar: CIII and MgIIG09 Y2 017 284 0.906 High redshift quasar: CIII and MgIIG15 Y4 212 020 1.885 High redshift quasar: CIV high and sharpG15 Y5 035 226 2.031 High redshift quasar: CIV high and sharpG15 Y6 094 029 1.995 High redshift quasar: CIV high and sharpG09 Y4 226 135 2.563 High redshift quasar: Lyman- α and broad CG12 Y6 073 125 2.232 High redshift quasar: Lyman- α and SiliconG09 Y4 201 018 2.102 High redshift quasar: Lyman- α emitterG09 Y4 212 196 2.238 High redshift quasar: Lyman- α emitterG15 Y6 086 131 3.059 High redshift quasar: Lyman- α emitterG15 Y4 203 235 2.932 High redshift quasar: Lyman- γ emitterG15 Y5 021 245 2.837 High redshift quasar: Lyman- γ emitter Table 4:
Quasi-Stellar Objects. These objects are either at high redshift of show activity viaextremely broadened emission lines. 59
PECID z Comments SPECID z CommentsG02 Y4 021 114 0.006 G15 Y1 GS1 095 0.180G02 Y4 029 209 0.012 G15 Y1 GX2 192 0.027G02 Y4 034 330 0.056 G15 Y1 IN1 030 0.035G02 Y4 040 299 0.026 G15 Y1 IN2 106 0.119G02 Y4 041 266 0.183 G15 Y2 001 254 0.030G02 Y4 043 105 0.078 G15 Y2 013 055 0.041G02 Y4 043 105 0.078 G15 Y2 022 031 0.049G02 Y5 079 023 0.015 G15 Y3 005 049 0.176G02 Y5 082 208 0.073 G15 Y3 005 049 0.176G02 Y5 091 065 0.078 G15 Y3 014 270 0.086G02 Y5 114 257 0.083 G15 Y3 021 322 0.130G02 Y5 114 291 0.166 G15 Y3 037 237 0.156G02 Y6 004 067 0.297 G15 Y3 050 003 0.244G09 Y1 AS2 047 0.077 G15 Y4 203 262 0.087G09 Y1 BS1 271 0.052 G15 Y4 214 347 0.045G09 Y1 CX2 244 0.028 G15 Y4 216 266 0.061G09 Y1 EN2 104 0.013 G15 Y4 224 130 0.063G09 Y1 EN2 256 0.035 G15 Y4 234 096 0.185G09 Y1 ES2 147 0.070 G15 Y5 003 334 0.084G09 Y1 FS1 037 0.123 G15 Y6 073 241 0.058G09 Y1 FS1 089 0.028 G15 Y6 073 391 0.073G09 Y1 FS2 387 0.018 G15 Y6 075 075 0.139G09 Y1 IS2 176 0.076 G15 Y6 088 096 0.048G09 Y2 042 055 0.024 G15 Y6 088 271 0.107G09 Y4 205 318 0.123 G15 Y6 097 155 0.042G09 Y4 207 223 0.332 G15 Y6 099 116 0.369G09 Y4 215 073 0.065 G15 Y6 100 125 0.032G09 Y4 251 217 0.012 G15 Y6 100 221 0.025G12 Y1 AN1 102 0.027 G02 Y3 018 304 0.139 BPT OutlierG12 Y1 AN1 392 0.018 G02 Y4 002 108 0.005 BPT OutlierG12 Y1 AS2 046 0.118 G02 Y4 042 264 0.007 BPT OutlierG12 Y1 CN1 078 0.006 G02 Y5 061 198 0.013 BPT OutlierG12 Y1 CN1 102 0.040 G09 Y1 CN1 168 0.132 BPT OutlierG12 Y1 CND1 285 0.103 G09 Y1 DX1 117 0.008 BPT OutlierG12 Y1 CS2 163 0.013 G09 Y4 207 303 0.246 BPT OutlierG12 Y1 DND1 283 0.011 G09 Y4 207 339 0.157 BPT OutlierG12 Y1 DX1 211 0.140 G09 Y4 207 347 0.190 BPT OutlierG12 Y1 FX1 031 0.080 G09 Y4 207 378 0.273 BPT OutlierG12 Y1 GN1 263 0.044 G09 Y4 231 266 0.012 BPT OutlierG12 Y1 HS2 086 0.190 G12 Y1 AN1 124 0.076 BPT OutlierG12 Y1 IS1 301 0.021 G12 Y1 AN1 254 0.004 BPT OutlierG12 Y1 IS2 193 0.088 G12 Y1 AS2 119 0.006 BPT OutlierG12 Y1 ND1 188 0.027 G12 Y1 BS1 058 0.072 BPT OutlierG12 Y3 014 242 0.179 G12 Y1 CN1 079 0.013 BPT OutlierG12 Y4 210 280 0.249 G12 Y1 IS1 078 0.006 BPT OutlierG12 Y6 056 022 0.312 G12 Y1 IS2 123 0.008 BPT OutlierG12 Y6 060 295 0.049 G12 Y1 ND8 039 0.022 BPT OutlierG15 Y1 BS1 108 0.066 G12 Y6 050 014 0.004 BPT OutlierG15 Y1 CN1 059 0.052 G15 Y1 CX1 053 0.030 BPT OutlierG15 Y1 CN2 291 0.116 G15 Y1 GN1 001 0.032 BPT OutlierG15 Y1 CS1 112 0.133 G15 Y3 009 097 0.026 BPT OutlierG15 Y1 CS1 149 0.051 G15 Y3 016 128 0.033 BPT OutlierG15 Y1 CX1 125 0.091 G15 Y3 047 035 0.037 BPT OutlierG15 Y1 DS2 286 0.029 G15 Y4 205 288 0.007 BPT OutlierG15 Y1 FS1 068 0.103 G15 Y4 234 172 0.035 BPT OutlierG15 Y1 FS1 102 0.037 G15 Y6 088 244 0.038 BPT OutlierG15 Y1 FS2 134 0.082 G15 Y6 089 265 0.099 BPT Outlier
Table 5:
Star-Formation Galaxies. Objects with at least very high and narrow Balmer lines.A few of these objects are also BPT outliers as commented on in Section 4.
SPECID z CommentsG09 Y4 240 259 0.084 Dominating NIIG02 Y3 013 192 0.070 Dominating NII, no H α G09 Y1 CN2 290 0.209 High emission lines and Iron linesG02 Y3 016 160 0.055 Low H α and no OxygenG12 Y1 GND1 125 0.050 Mysterious emission line at 5435˚AG12 Y1 HND1 169 0.199 NII 5200 emission and other emission linesG02 Y3 014 029 0.056 NII much bigger than H α G09 Y2 016 087 0.052 Only H α /NII structureG12 Y1 CX1 394 0.132 Only H α /NII structureG12 Y1 DS2 363 0.132 Only H α /NII structureG12 Y6 052 104 0.038 Only H α /NII structureG15 Y1 FS2 063 0.081 Only H α /NII structureG15 Y4 204 298 0.136 Only H α /NII structureG09 Y4 207 090 0.245 Only H α /NII structureG09 Y6 093 017 0.019 Only H α /NII structureG09 Y4 207 260 0.246 Only H α /NII structureG15 Y1 CS2 180 0.139 Additional Iron line at 5304˚A Table 6:
Galaxies grouped on unusual extra lines or weird line fluxes60
PECID z CommentsG09 Y4 215 048 0.194G09 Y4 215 057 0.238G12 Y1 AN1 085 0.294G12 Y1 CS1 087 0.084 Minor absorption in H δ and H α emissionG12 Y2 041 076 0.133G15 Y4 210 008 0.186 H α emissionG09 Y1 FN2 235 0.293G12 Y1 CN1 175 0.177 H α emission Table 7:
Spectra showing H δ absorption SPECID z CommentsG09 Y1 IS2 022 0.055 Blue galaxyG15 Y1 DS2 125 0.086 Blue galaxyG15 Y4 211 119 0.068 Blue galaxyG12 Y1 BD1 108 0.264 Blue galaxyG15 Y1 CS1 071 0.085 Blue galaxyG15 Y1 DS2 102 0.138 Blue galaxyG02 Y4 043 333 0.206 Diagonal blue end with sky lines in redG09 Y4 211 013 0.186 Diagonal redG09 Y5 015 117 0.035 Diagonal redG12 Y1 DS2 367 0.165 Diagonal redG02 Y3 018 233 0.087 Diagonal red, no emission lines but strong NaIDG09 Y4 207 387 0.405 Diagonal red, no emission lines but strong NaIDG12 Y4 205 395 0.238 Diagonal up to 7000A and then flatG02 Y4 043 265 0.260 High blue endG09 Y4 215 246 0.018 High blue endG15 Y1 DS2 097 0.053 High blue endG15 Y1 DS2 106 0.193 High blue endG15 Y1 DS2 123 0.179 High blue endG15 Y1 DS2 124 0.053 High blue endG15 Y4 202 230 0.204 High blue endG15 Y4 202 266 0.099 High blue endG15 Y4 202 314 0.025 High blue endG15 Y4 204 220 0.085 High blue endG15 Y6 074 314 0.027 High blue endG15 Y6 088 239 0.025 High blue endG15 Y4 237 091 0.224 Mountain shapeG09 Y1 DS2 261 0.108 Noisy blue endG15 Y4 204 386 0.122 Noisy blue endG15 Y4 204 396 0.053 Noisy blue endG02 Y3 019 369 0.323 Valley shapeG09 Y4 212 315 0.079 Valley shapeG09 Y4 212 321 0.040 Valley shapeG15 Y6 082 045 0.025 Very weird feature
Table 8:
Outlying spectra based on their continuum. These are either extremely blue ordiagonal red galaxies due to clustering of the URF algorithm, or weird features on a continuumlevel.
SPECID z CommentsG15 Y2 009 099 0.256 Blue end of spectrum diagonalG12 Y1 HS2 140 0.074 Dominating spurious lineG12 Y1 CND1 092 0.060 Noisy continuumG15 Y2 020 394 0.179 Reduction error with sky line 5578G15 Y4 215 093 0.312 Sky absorption at 5578G15 Y6 081 395 0.052 Sky absorption at 5578G02 Y5 079 376 0.142 Sky lines not reduced correctlyG12 Y1 AS2 129 0.076 Sky lines not reduced correctlyG12 Y1 GS1 117 0.186 Sky lines not reduced correctlyG15 Y2 008 251 0.145 Sky lines not reduced correctlyG15 Y4 207 028 0.249 Spectral arms peak at spliceG09 Y1 AN1 378 0.201 Very dominant sky absorptionG09 Y1 AX2 247 0.190 Very dominant sky absorptionG09 Y1 CN1 233 0.166 Very dominant sky absorptionG15 Y1 BS2 216 0.135 Very dominant sky absorptionG02 Y5 060 294 0.013 Weird absorption feature (all fibre 294?)G15 Y2 009 294 0.166 Weird absorption feature (all fibre 294?)G15 Y2 020 294 0.208 Weird absorption feature (all fibre 294?)G15 Y3 051 294 0.144 Weird absorption feature (all fibre 294?)