Where is Waldo (and his friends)? A comparison of anomaly detection algorithms for time-domain astronomy
Juan Rafael Martínez-Galarza, Federica Bianco, Dennis Crake, Kushal Tirumala, Ashish A. Mahabal, Matthew J. Graham, Daniel Giles
MMNRAS , 1–21 (2020) Preprint 16 September 2020 Compiled using MNRAS L A TEX style file v3.0
Where is Waldo (and his friends)? A comparison ofanomaly detection algorithms for time-domain astronomy
J. Rafael Mart´ınez-Galarza , Federica Bianco , , , Dennis Crake , , ,Kushal Tirumala , Ashish A. Mahabal , Matthew J. Graham , Daniel Giles , Center for Astrophysics | Harvard & Smithsonian Department of Physics and Astronomy, Sharp Laboratory, University of Delaware, Newark, DE 19716 Biden School of Public Policy and Administration, 184 Graham Hall University of Delaware, Newark, DE 19716, USA Data Science Institute, University of Delaware, Tower at STAR, Newark, DE 19713, USA Division of Physics, Mathematics and Astronomy, California Institute of Technology, Pasadena, CA 91125, USA School of Physics and Astronomy, University of Southampton, Hampshire, SO17 1BJ, UK Institute for Astronomy, University of Edinburgh, Royal Observatory, Blackford Hill, Edinburgh, EH9 3HJ, UK Astronomy Department, The Adler Planetarium, Chicago, IL 60605 Physics Department, Illinois Institute of Technology, Chicago, IL, 60616
Accepted XXX. Received YYY; in original form ZZZ
ABSTRACT
Our understanding of the Universe has progressed through deliberate, targeted stud-ies of known phenomena, like the supernova campaigns that enabled the discovery ofthe accelerated expansion of the Universe, as much as through serendipitous, unex-pected discoveries. The discovery of the Jovian moons, and of interstellar objects like1I/’Oumuamua forced us to rethink the framework through which we explain the Uni-verse and develop new theories. Recent surveys, like the Catalina Realtime-TransientSurvey and the Zwicky Transient Facility, and upcoming ones, like the Rubin LegacySurvey of Space and Time, explore the parameter space of astrophysical transients atall time scales, from hours to years, and offer the opportunity to discover new, unex-pected phenomena. In this paper, we investigate strategies to identify novel objectsand to contextualize them within large time-series data sets to facilitate the discoveryof new objects, new classes of objects, and the physical interpretation of their anoma-lous nature. We compare tree-based and manifold-learning algorithms for anomalydetection as they are applied to a data set of light curves from the Kepler observatorythat include the bona fide anomalous Boyajian’s star. We assess the impact of pre-processing and feature engineering schemes and investigate the astrophysical natureof the objects that our models identify as anomalous by augmenting the Kepler datawith
Gaia color and luminosity information. We find that multiple models, used incombination, are a promising strategy to not only identify novel time series but also tofind objects that share phenomenological and astrophysical characteristics with them,facilitating the interpretation of their anomalous characteristics.
Key words: methods: data analysis, methods: statistical, stars: flare, stars: peculiar(except chemically peculiar) “An outlier is an observation that differs so much from otherobservations as to arouse suspicion that it was generated by adifferent mechanism” – Hawkins (1980)
Scientific discovery in astronomy can be thought ofas happening via two complementary approaches. The question-driven approach attempts to provide answers toquestions that have already been conceived based on ourpresent knowledge of existing theories, models, or phenom- ena. Cosmological supernovae (type Ia) are a good exam-ple. They are relatively well-known objects to us in termsof their energetic output, redshift distribution, and spec-tral properties, and we design surveys with observationalparameters fine-tuned to find them, based on those knownproperties. The exploration approach, on the other hand, at-tempts to enable us with the capability to find objects thatare unknown, unexpected, or extremely rare, by either ex-panding the space of observational parameters (for exampleby increasing the spatial resolution available to us with a © a r X i v : . [ a s t r o - ph . S R ] S e p Mart´ınez-Galarza et al. larger telescope) or by employing novel ways to dissect theincreasingly complex data sets that are becoming availableto us. These unexpected discoveries often require additionsor modifications to our theoretical apparatus, and, on occa-sion, force us to formulate new hypotheses, to pose questionswe did not have before.This approach has led to serendipitous discoveries,which are prevalent in astronomy and which have pro-duced significant breakthroughs. Recent examples includethe discovery of peculiar light curves in
Kepler data, suchas KIC 8462852, commonly know as Boyajian’s star (Boy-ajian et al. 2016), the discovery of the interstellar object1I/’Oumuamua (Meech et al. 2017), and the detection ofquasi-periodic oscillations in the X-ray light curve of galaxyG 159 (Miniutti et al. 2019), among many others. The ques-tion then becomes, how do we make serendipity “systematic”(Giles & Walkowicz 2019) in order to increase the chance ofdiscovery in the era of large astronomical data sets? A re-lated question that is at the core of this paper is whether itis possible, once a serendipitous discovery of a remarkableobject has been made, to find all analogs to that object ina given data set.To find analogs of rare objects of interest we want tofind groups of objects in multi-dimensional parameter spacesthat separate out from the rest: “outliers” or “anomalies”.Outliers are traditionally identified as data points that liebeyond some threshold distance from the bulk of the popu-lation of other data points in some representative space. Themost common limit is three or five times the population scat-ter, usually measured by the standard deviation. Assuminga Gaussian distribution, this equates to respectively flaggingone in 370 ( σ ) or one in 1.74 million ( σ ) data points as“outliers”. However, there is no reason to think that an arbi-trary data set in an arbitrary data space should be describedby a Gaussian distribution. In fact, Aggarwal & Yu (2001)have shown that as the number of dimensions increases, theproximity-related measures of similarity between objects be-come less meaningful, since in a high-dimensional space ob-jects are more isolated from each other, and are more likelyto show up as proximity-based outliers.This is our motivation to compare methods that relyon proximity measures with those that take a different ap-proach to describe the nature of anomalies. It also highlightsthe importance of comparing different approaches to featureselection, since the outlying condition of certain objects willdepend on where they live in the specific multi-dimensionalfeature space constructed and explored.By construction, many algorithms designed to detectunknown objects, or anomalies, are unsupervised. Super-vised learning takes place by the association of specific fea-ture values with known labels of the objects in a trainingset. Unsupervised learning, on the other hand, does not relyon labeled objects, but on clustering of objects in the fea-ture space. Several unsupervised algorithms for anomaly de-tection have been used in astronomy, including approachesthat use Euclidean proximity/clustering information to iso-late anomalies (Giles & Walkowicz 2019; Dutta et al. 2007;Henrion et al. 2013), and approaches that use more complexrepresentations of the data that do not involve their pro-jection into an Eculidean spaces, such as neural networksand ensemble methods (Baron & Poznanski 2017; Margalef-Bentabol et al. 2020; Druetto et al. 2019). There are also significant differences in the feature engineering aspects ofthose algorithms. While some of them require a feature ex-traction step that produces a set of synthetic features torepresent the data, others work directly on the data points,either light curves, spectra, or images. While each of thesemethods can perform well for specific data sets, one couldask whether different methods applied to the same data setfind the same anomalies.A comprehensive understanding of the subject ofanomaly detection in astronomy can only be achieved oncewe compare different methods, understand what type ofanomalies these methods find as a function of their archi-tecture and of features engineering choices, and test our re-sults against the selection of specific, bona fide anomalies.In principle, astrophysical anomalies are objects that arerare or previously unobserved and are therefore of utter-most interest, or at least instances of known phenomena inextremely rare stages of evolution. Previous work has madesignificant contributions towards the goal of finding analogsto Boyajian’s star and other anomalies. For example (Giles& Walkowicz 2019) apply a clustering method to a set of syn-thetic features derived for Kepler light curves and demon-strate that their method is capable of identifying anoma-lies such as Boyajian’s star, as well as cataclysmic variables.Schmidt (2019) use a photometric selection method to iso-late analogs of KIC 8462852 by looking for light curve dipsin All Sky Automated Survey for Supernovae (ASAS-SN,Kochanek et al. 2017) data, and they find about 20 similarobjects that deserve follow-up studies.With a systematic comparison of methods, we aim toprovide the most effective recipes to find the most com-pelling objects and to find all its candidate analogs. In thispaper we compare and combine three different anomaly de-tection methods: the Unsupervised Random Forest, and twodimensionality reduction methods combined with proximityclustering, the t -SNE and the UMAP methods. We applythese methods to a test data set comprising light curves fromthe Kepler
Space Observatory, with a general goal that istwo-fold. • First, we want to test whether these methods are suc-cessful at finding a bona fide anomaly, Boyajian’s star, andto determine which specific features make objects like Boy-ajian’s star anomalous. • Second, we attempt to answer the question: given a sci-entifically compelling light curve, such as Boyajian’s star’s,an object with the truly unusual time behavior whose phys-ical properties and origin are yet to be explained, can wefind all of its analogs in a given data set?We evaluate the performance of each method by study-ing the location of the bona fide object in a ranked anomalylist. We also associate the anomaly scores derived usingeach of the three methods with specific features of the lightcurves. Along the way, we propose an empirical definition ofwhat an anomaly is for a Kepler-like data set, and link thefeatures that make a particular object anomalous to its gen-eral astrophysical properties. In order to find the analogs of aparticular object, we explore how the features and anomalyscores from different methods can be combined to identifyobjects that share similar properties with a given object ofinterest, identified as anomalous. The goal is to establish a
MNRAS , 1–21 (2020) nomaly detection mechanism to find analogs not only to Boyajian’s star, butto any light curve of interest.The Kepler telescope generated evenly sampled time se-ries. This is possible with a space mission, but impossiblewith ground-based surveys, due to weather, moon phase,visibility, and, generally, optimization of survey strategies(LSST Science Collaboration et al. 2017). Surveys like theCatalina Realtime-Transient Survey (Drake et al. 2012) andthe Zwicky Transient Facility (Bellm et al. 2019) are an in-valuable reservoir of transients, all measured with unevenlysampled light curves. The Rubin Legacy Survey of Spaceand Time (Rubin LSST, Ivezi´c et al. 2019) will generate anunprecedentedly large data set including tens of billions ofstars measured with 800 unevenly sampled data points. Totest if our results generalize to unevenly sampled data sets,such as the Rubin LSST data, we generate a second dataset from the Kepler data by sub-sampling the original timeseries.Most of the Kepler targets are stars, a significant frac-tion of which have Gaia counterparts that allow us to placethem in a Hertzprung-Russell diagram. In this paper, wealso explore the correlation between being an anomaly andhaving specific stellar properties. This allows us to discussanomalies in terms of their astrophysical properties and toexplore the use of anomaly detection to identify rare objectsof specific types in large data sets.This paper is organized as follows. In section 2 we startby describing the different outlier detection methods to beused in the paper. In section 3 we describe the data set usedto test the performance of those methods, and discuss theimpact of different pre-processing in the process of findingoutliers. In section 4 we show the results of applying the dif-ferent outlier detection algorithms to the Kepler light curves.We focus the discussion on how the important features dis-tribute for “normal” objects and for outliers, and on the po-tential of combining several anomaly detection methods. Fi-nally, in section 5, we discuss our findings in the context ofthe astrophysical properties of the anomalous objects. Codeand data that support this work are available in a GitHubrepository. In this section we briefly introduce the three anomaly detec-tion methods we employed in our analysis.
We describe two tree-based ensemble methods for anomalydetection: Isolation Forest (IF) and Unsupervised RandomForest (URF). The first method is used here to measurethe impact of pre-processing, while the latter leads to thesuccessful identification of our bona fide anomaly and itsresults are further investigated.In order to understand how this similarity is definedin the IF and URF methods, we ought to briefly describethe basics of Random Forest (RF, Ho 1995) classification.A RF is an ensemble of decision trees. Each of these trees https://github.com/fedhere/DtUhackOutliers is a model in which the final prediction is based on a se-ries of comparisons of the values of the predictors (features)against threshold values. Every tree, therefore, correspondsto a partition of the feature space by axis-aligned lines wherethe class of each final partition is given by the class of themajority of objects in that partition. These methods areinherently “greedy” and cannot exhaustively explore the pa-rameter space (consider, for example, that a single continu-ous variable offers an infinite number of binary splits) andare therefore subject to overfitting. As a result, decision treeshave high variance (different trees give a different results).Ensemble approaches reduce the variance of these methods:each tree in a RF uses a random subset of objects from thetraining set and a subset of features. The forest then pre-dicts the class of an object as the majority-vote predictionof all the trees in the ensemble. Isolation Forests (IF) are ensemble methods based on Isola-tion Trees (Liu et al. 2012). In IF a forest of isolation trees,each tree is partitioning a data set based on randomly se-lected features and randomly selected splitting points foreach feature. The anomaly score is proportional to the num-ber of splits required to isolate an object (smaller scoreindicates more anomalous objects, negative scores indicate“outliers”), averaged over all trees in the forest. Anomaliesrequire fewer partitions, i.e. , they are easy to isolate. Theaverage number of partitions over a large number of randomtrees can therefore be considered as a measure of similarityto the bulk of the data. We use the (Pedregosa et al. 2011)implementation (Buitinck et al. 2013) of this method andwe embrace the default parameters: the number of samplesused by each tree is set to max_samples= and the num-ber of trees to 100. The “contamination” parameter sets theexpectation for the fraction of objects in the set that areoutliers, and it is set to contamination=0.1 . These param-eters were demonstrated in Liu et al. 2012 to be effectiveunder a large range of circumstances. The IF anomaly scoreis intuitively interpretable. For this, we choose this methodapplied to features as close as possible to the data (the fluxvalues themselves) to examine the impact of pre-processingchoices on anomaly detection schemes. The unsupervised random forest (URF) method for anomalydetection was first introduced by Shi & Horvath (2006) inthe context of tumor discovery using data sets comprisingtumor marker expressions and has been adapted for the firsttime in astrophysics to look for anomalous objects in a large( ∼ MNRAS000
We describe two tree-based ensemble methods for anomalydetection: Isolation Forest (IF) and Unsupervised RandomForest (URF). The first method is used here to measurethe impact of pre-processing, while the latter leads to thesuccessful identification of our bona fide anomaly and itsresults are further investigated.In order to understand how this similarity is definedin the IF and URF methods, we ought to briefly describethe basics of Random Forest (RF, Ho 1995) classification.A RF is an ensemble of decision trees. Each of these trees https://github.com/fedhere/DtUhackOutliers is a model in which the final prediction is based on a se-ries of comparisons of the values of the predictors (features)against threshold values. Every tree, therefore, correspondsto a partition of the feature space by axis-aligned lines wherethe class of each final partition is given by the class of themajority of objects in that partition. These methods areinherently “greedy” and cannot exhaustively explore the pa-rameter space (consider, for example, that a single continu-ous variable offers an infinite number of binary splits) andare therefore subject to overfitting. As a result, decision treeshave high variance (different trees give a different results).Ensemble approaches reduce the variance of these methods:each tree in a RF uses a random subset of objects from thetraining set and a subset of features. The forest then pre-dicts the class of an object as the majority-vote predictionof all the trees in the ensemble. Isolation Forests (IF) are ensemble methods based on Isola-tion Trees (Liu et al. 2012). In IF a forest of isolation trees,each tree is partitioning a data set based on randomly se-lected features and randomly selected splitting points foreach feature. The anomaly score is proportional to the num-ber of splits required to isolate an object (smaller scoreindicates more anomalous objects, negative scores indicate“outliers”), averaged over all trees in the forest. Anomaliesrequire fewer partitions, i.e. , they are easy to isolate. Theaverage number of partitions over a large number of randomtrees can therefore be considered as a measure of similarityto the bulk of the data. We use the (Pedregosa et al. 2011)implementation (Buitinck et al. 2013) of this method andwe embrace the default parameters: the number of samplesused by each tree is set to max_samples= and the num-ber of trees to 100. The “contamination” parameter sets theexpectation for the fraction of objects in the set that areoutliers, and it is set to contamination=0.1 . These param-eters were demonstrated in Liu et al. 2012 to be effectiveunder a large range of circumstances. The IF anomaly scoreis intuitively interpretable. For this, we choose this methodapplied to features as close as possible to the data (the fluxvalues themselves) to examine the impact of pre-processingchoices on anomaly detection schemes. The unsupervised random forest (URF) method for anomalydetection was first introduced by Shi & Horvath (2006) inthe context of tumor discovery using data sets comprisingtumor marker expressions and has been adapted for the firsttime in astrophysics to look for anomalous objects in a large( ∼ MNRAS000 , 1–21 (2020)
Mart´ınez-Galarza et al. distribution has been washed out by the random sampling.The similarity score is then used in a second stage to findthose objects that are most dissimilar with respect to thebulk of the data. In the context of a multi-dimensional fea-ture space, the weirdness of an object can be thought of asthe average of the pair-wise dissimilarity between that par-ticular object and all the other objects in the data set. Notethat this dissimilarity measure is not necessarily associatedto distance in an Euclidean space, because objects can beisolated in only a few of the many possible dimensions ofthe features space, and also because of the random natureof the forest. Once the RF is trained, the original data setalone is propagated through the forest model, and the simi-larity between two given objects in the data set is measuredas a normalized count of the trees in which two given objectsended up in the same leaf node and with the same class.Formally, the similarity between the i -th and j -th ob-jects in the data set can be calculated as: D ij = − N leaf / N tree (1)Where N leaf is the number of trees for which the i j pairwere both classified as real in the same leaf node, and N tree isthe total number of trees. The weirdness of object i is thenthe average value of D ij for all possible pairings of i with allother j objects in the data set.In this work, we implement the URF using the scikit-learn (Pedregosa et al. 2011) architecture, following the gen-eral recipe described in Baron & Poznanski (2017). The URFhas several important hyper-parameters that require fine-tune in order to avoid over-fitting (and which of course de-pend on the features fed to the model, see subsection 3.2).Among the most important hyperparameters are the num-ber of trees in the forest ( N tree ), and the maximum depth ofthe tree, max_depth , which is the length of the longest pathfrom the root of the tree to a leaf. We performed the tuningof these and other parameters such as the number of pre-dictors to randomly select at each split, and the minimumleaf node size through a validation process that involved agridsearch and a 80-20 training-test split. We corroboratedthat shortly after reaching N tree = and max_depth = ,the accuracy stops improving. We did not go beyond thesevalues in order to avoid significant overfitting. Next, we describe two related manifold learning methods: t -SNE and UMAP. The methods are designed to visualizehigh-dimensional data in a lower-, typically 2-, dimensionalspace where reciprocal distances of in the high dimensionalspace are (optimally) preserved.As we detail in subsection 4.2, an anomaly score canbe defined by looking at the distribution of distances to thenearest neighbor in the converged distribution for both t -SNE and UMAP methods. This is the anomaly score thatwe will use in our subsequent analysis. t -distributed Stochastic Neighbor Embedding The t -distributed Stochastic Neighbor Embedding (SNE)method, or t -SNE, was introduced in Maaten & Hinton (2008), improving upon the well known nonlinear dimension-ality reduction algorithm SNE (Hinton & Roweis 2003). SNEworks by embedding multidimensional Euclidean distanceswith conditional probabilities, which is what represents thesimilarities between datapoints. In other words, suppose wehave a data point x i in the high dimensional space. Thenconsider a normal distribution of distances from x i , whereinpoints near x i have a higher probability density under thedistribution and further points have a lower probability den-sity under the distribution. Then the similarity between x i and another data point x i (cid:48) is the conditional probability P x i (cid:48) | x i that x i will choose x i (cid:48) as a neighbor under the normaldistribution just described. Then we replicate the processfor the lower dimensional space, for which we get anotherset of conditional probabilities. SNE then attempts to mini-mize the Kullback-Leibler (KL) divergence (or relative en-tropy Kullback & Leibler 1951) between the two probabilitydistributions using gradient descent. However, SNE is com-putationally very expensive, largely because of the asymme-try imparted by the use of KL divergence as the distancemetric; t -SNE attempts to resolve this issue by looking ata “symmetric” SNE, specifically a symmetric version of thecost function with similarly simple gradients. t -SNE also re-defines the lower dimensional distribution using a Student t -distribution in place of the Gaussian distribution to solvesthe crowding problem, which stems from the fact that thereis not enough area in a two-dimensional plot to accuratelyembed distances between points that are close, which leadsto loss of information. Uniform Manifold Approximation and Projection (UMAP)is a nonlinear dimensionality technique introduced very re-cently in McInnes et al. (2018). Most dimensionality reduc-tion techniques have a very similar structure: they aim tofind some low dimensional representation of data that mini-mizes information loss between the same representation ap-plied to the high dimensional data set. UMAP works as fol-lows: say we have data points x , . . . , x n . We then create a k -neighbor weighted graph by considering k -neighbors of each x i , and adding an edge in the graph with a defined weight w that depends on the diameter of the k -neighborhood of x i , and the distance between x i and the closest neighbor.Note that the weight, as defined in McInnes et al. (2018), isnot symmetric. We handle this by, given a = w ( x i , x j ) , b = w ( x j , x i ) , defining a new weight w (cid:48) ( x i , x j ) = a + b − ab . Thenthe same process is repeated in the lower dimensional space,resulting in a new weight function for the lower dimensionalspace. Then UMAP minimizes the cross-entropy betweenthe two weight functions as specified by the cost function sothat the lower dimensional weights encapsulate (as closelyas possible) the information from the higher dimensionalweights. Like in t -SNE, this optimization is done via stochas-tic gradient descent. For more specifics, we refer the readerto McInnes et al. (2018). In this section, we describe the data set used for the com-parison of the anomaly detection methods, as well as the
MNRAS , 1–21 (2020) nomaly detection Full dense Detail dense kplr008880472Detail sparse N o r m a li z e d f l u x kplr008108932 Time [BKJD] kplr009051076
Figure 1.
Examples of normalized Kepler light curves used inthis paper. The left panels show the full 85 day long Q16 denselight curves for three different Kepler targets. The middle panelsshow a detail of the same light curves plotted over a period of10 Julian days. The right panels show the same detail but thistime for the corresponding sparse version of the light curves. Thecorresponding Kepler source IDs are shown. Note the data gapearly on in the data for each light curve due to missing data inthe Kepler archive. feature extraction approaches used prior to the applicationof those methods to the resulting features.
The data set considered here is comprised of 2500 lightcurves from the
Kepler telescope archive. These light curveswere obtained from the Mikulski Archive for Space Tele-scopes (MAST) by performing a random search of lightcurves in Quarter 16. They are already detrended fromspacecraft effects and have a uniform cadence, with one pho-tometry point obtained every 30 minutes, and cover a totalperiod of approximately 85 days, starting at Barycentric Ke-pler Julian Date (BKJD) 1472 and ending at BKJD 1558, fora total of 3534 measurements per light curve. We used themost up-to-date pipeline processed data (PDCsap flux) ac-cording to the Kepler Data Release Notes and the KeplerData Processing Handobook (Jenkins 2017), and appliedminimal processing of the light curves ourselves (but see thediscussion on pre-processing in subsection 3.3). Fluxes aregiven in relative flux units as provided in the standard Ke-pler detrended light curves delivered by the MAST archive.We normalized the fluxes to have a mean value of 1, i.e.,we divided them by their unnormalized mean value. Thesenormalized fluxes are not standardized, i.e. , their standarddeviations are those resulting from this normalization pro-cess, without any further adjustments. We do not expectthe latter to have a large impact on our results, as the dif-ferences in standard deviations due to varying distance tothe targets are typically less than any measurable signal, asour results in subsection 5.2 show. In Figure 1 we show asmall sample of the light curves studied here, in order to il-lustrate the wide range of variability properties that Keplertargets have.We have limited this study to only 2500 light curves since this is essentially a proof of concept. Rather than find-ing all anomalies in the entire Kepler data set, or in similardata sets such as TESS (which will be the topic of a sub-sequent paper), here we want to demonstrate the strengthsand weaknesses of each method, and to assess what specificlight curve properties affect the anomaly score in each case,and learn how to identify groups of light curves with similarproperties. Our sample size (about 2% of the entire Keplerdata set) contains enough information about light curve di-versity, while at the same time it provides a manageablenumber of objects, thus allowing us to quickly identify pat-terns and associations.In order to assess the effect of uneven sampling andsparsity of the light curves in our ability to find anoma-lies, we have also produced a separate data set of the2500 degraded light curves by subsampling the full lightcurves at random positions in time, keeping only 10% ofthe data points. The selected time-stamps are the same forall lightcurves. We refer to this second set of light curves asthe sparse set, as opposed to the dense set of the originallight curves Three sparse light curves are shown in Figure 1,right panel.
Extracting features from astronomical light curves (andmore generally from time series and other one-dimensionaldata) is a crucial aspect in setting up a successful classifi-cation or outlier detection algorithm, and, in general, anymachine learning model, either supervised or unsupervised.The decision regarding which features to use should be basedupon the predictive power of these features and on how ef-ficiently they can be used to split the data set into multipleclasses, or in other words, on how well this set of featuresrepresent the different types of variability of the originaldata set. The choice of the best features to use is far fromobvious.In the particular case of supervised tree methods, sinceat each step the model deals with only a single feature andfeatures are never combined in a mathematical sense, thepredictions are generally robust to covariance. Consider thecase when two features are completely collinear: the trees inthe forest will use one or the other based on an initial randomselection. When encountering the second feature the treeswill disregard it since the information carried by the featurehas already been used in splits based on the other, collinearfeature. However, this affects the feature importance evalu-ation. Trees assess the importance of features based on howmuch each feature contributes to splitting. In the covariantexample, the second collinear feature does not ultimatelycontribute and therefore its covariance with other featureshas suppressed its importance. We must therefore be care-ful about selecting the features, not only to maximize theefficiency of our methods, but especially if we intend to usefeature extraction to physically characterize the objects.Feature extraction is particularly challenging when thesampling of a time series is sparse and uneven, specially if In https://github.com/juramaga/kepler_anomalies_urf afile (HR-Observables.npy) is available that lists the Kepler IDsof all light curves used in this study.MNRAS000
Extracting features from astronomical light curves (andmore generally from time series and other one-dimensionaldata) is a crucial aspect in setting up a successful classifi-cation or outlier detection algorithm, and, in general, anymachine learning model, either supervised or unsupervised.The decision regarding which features to use should be basedupon the predictive power of these features and on how ef-ficiently they can be used to split the data set into multipleclasses, or in other words, on how well this set of featuresrepresent the different types of variability of the originaldata set. The choice of the best features to use is far fromobvious.In the particular case of supervised tree methods, sinceat each step the model deals with only a single feature andfeatures are never combined in a mathematical sense, thepredictions are generally robust to covariance. Consider thecase when two features are completely collinear: the trees inthe forest will use one or the other based on an initial randomselection. When encountering the second feature the treeswill disregard it since the information carried by the featurehas already been used in splits based on the other, collinearfeature. However, this affects the feature importance evalu-ation. Trees assess the importance of features based on howmuch each feature contributes to splitting. In the covariantexample, the second collinear feature does not ultimatelycontribute and therefore its covariance with other featureshas suppressed its importance. We must therefore be care-ful about selecting the features, not only to maximize theefficiency of our methods, but especially if we intend to usefeature extraction to physically characterize the objects.Feature extraction is particularly challenging when thesampling of a time series is sparse and uneven, specially if In https://github.com/juramaga/kepler_anomalies_urf afile (HR-Observables.npy) is available that lists the Kepler IDsof all light curves used in this study.MNRAS000 , 1–21 (2020) Mart´ınez-Galarza et al. the sampled epochs are different for different light curves.In that case, any phase information is lost, and the individ-ual photometric measurements are rendered much less usefulfor the analysis. Yet this is the case for many existing as-tronomical time domain surveys (SDSS, (York et al. 2000),ASASSN, (Kochanek et al. 2017), etc.), and will be the caseof upcoming surveys such as the Rubin Legacy Survey ofSpace and Time (Rubin LSST, Ivezi´c et al. 2019). Addition-ally, depending on the number of light curves to analyze andthe specific features to be extracted, the process of featureextraction can be computationally expensive. It is thereforedesirable to use feature extraction methods that do not re-quire heavy processing.A number of feature extraction packages are availablefor time series. One such codes is the Feature Analysis forTime Series (FATS) code (Nun et al. 2015), that is ableto evaluate over 40 features in time series with the epochs,magnitudes, magnitude errors, and filters as an input, andextracts statistical features such as means, standard devia-tions, linear trends, variability indexes, skewness, kurtosis,etc. One issue with this approach to feature extraction isthat not all of the features included are properly defined forall light curves, since there might be missing data, differentnumber of elements, etc. As a result, these features will haveundefined or unreliable for at least a significant fraction ofthe light curves. Therefore, using them as the input for ananomaly detection algorithm will result in a significant biasof the result towards unreliable artifacts.Rather than relying on second-order features, it seemsmore appropriate to remain as close as possible to the datathemselves, therefore avoiding any biases introduced by theextraction process. In this paper, we have chosen to usethe light curve points themselves, the power spectrum ofthe time series (based on the Fourier Transform of the lightcurve), or a combination of both as the input features. Us-ing the power spectrum (periodogram) is desirable especiallyfor light curves that do not share a common phase reference.Periodogram analysis can also be used to find the best pe-riod for periodic variables. However, period-finding itself is asignificant challenge (Graham et al. 2013), and therefore wedo not use periods, but the periodogram values themselves,as the features. The periodogram of a light curve consistsof its Fourier transform and it measures the signal power asa function of angular frequency. Lomb (1976) and Scargle(1982) generalized the concept of periodogram for the caseof unevenly sampled data. Their normalized periodogramcan be written as: P N ( ω ) = σ y (cid:34) [ (cid:205) k ( y k − ˆ y ) cos ω ( t k − τ )] (cid:205) k cos ω ( t k − τ ) + [ (cid:205) k ( y k − ˆ y ) sin ω ( t k − τ )] (cid:205) k sin ω ( t k − τ ) (cid:35) , (2)where σ y is the variance of the photometry y k and τ is a timeoffset that orthogonalizes the model and makes the expres-sion independent on a time translation. As demonstrated inLomb (1976), this expression is fundamentally equivalent toestimating the harmonic content given a least-squares fit toa sinusoidal model consisting of a single component. It cantherefore easily be computed using a fast Fourier transform.Given a light curve, we can evaluate the periodogram in Equation 2 for a discrete number of frequencies, and usethe values of the periodogram evaluated at each of thesefrequencies as the features for our classifier. This is concep-tually similar to using the individual pixels in a spectrumas the features, except that using the same array of fre-quencies for all light curves, we can compare them usingthe same absolute reference. This approach can also be usedfor multi-band, non simultaneous light curves, since a dif-ferent periodogram can be calculated for each filter, and afinal array of features can be obtained by concatenating thesingle-band periodograms.We have extracted periodograms for all light curves tocover a logarithmic range of frequencies corresponding to pe-riods between one hour (twice the Kepler cadence) and 90days (the approximate duration of the observations). Thisrange does not only cover the periods sampled by the ob-servations, but also the typical timescales of different stellarvariability phenomena. For example, an HST survey of thevariability properties of luminous ( M I < − ) stars in M51(Conroy et al. 2018) finds that the variability fraction forthese is ∼ , with many stars showing typical timescalesbetween 1 and 100 days. More in general, the most com-mon pulsating variable stars have characteristic timescalesthat range from a few minutes to a couple of years (Eyer& Mowlavi 2008). In Figure 2 we show examples of the pe-riodograms computed for a sub-sample of relatively normallight curves in the Kepler data set, together with the pe-riodogram of Boyajian’s star, our bona-fide anomaly. Notethat, with respect to the periodogram of Boyajian’s star,“normal” light curves have flatter and featureless power spec-tra. We have constructed a feature vector for each lightcurve composed of the light curve points themselves, aug-mented with these periodograms for a total of 6534 featuresper light curve.One further feature extraction method, the DMDT pro-posed in Mahabal et al. 2017, is adopted for the manifoldlearning-based anomaly detection (see: subsection 2.2) andit is described in subsection 4.2.
Anomaly detection methodologies identify sources whoseproperties stand out with respect to the population of allother objects in a data set. These sources might representexamples of unknown types of objects, in which case they re-quire the formulation of new physical hypotheses to explaintheir properties, or instances of rare evolutionary stages ofknow types. In both cases, they represent an expansion ofour discovery space.But anomaly detection can also reveal instrumental orprocessing artifacts. In the specific case of light curves, suchartifacts can include spurious trends in the light curve base-lines or bad pixels that result in artificial spikes or deeps inthe time series. Therefore, in order to characterize the effectof data artifacts in our analysis, we explore how differentpre-processing approaches affect our ability to find anoma-lies. Consider the results of applying anomaly detection al-gorithms to our set of normalized light curves both beforeand after further pre-processing. In this case, we use theIsolation Forest (IF) method (section 2). By comparing the
MNRAS , 1–21 (2020) nomaly detection Frequency [days ] P S D [ d a y s * f l u x ] Figure 2.
Example periodograms for ten Kepler Q16 light curves, with Boyajian’s star indicated in thick blue. Note the particularstructure of Boyajian’s periodogram, indicating the pronounced deeps in the light curve. The logarithmic frequency range covers periodsfrom 1 hour to about 90 days. All periodograms are obtained over the same array of frequencies. anomaly scores of the pre-processed light curves with thoseof the original light curves, we expect to learn to what ex-tent instrumental artifacts are picked up as anomalies, andwhat is the best way to remove those artifacts in order toidentify astrophysical anomalies.In Figure 3 we show the 25 most anomalous light curves,and the 25 least anomalous light curves, according to asimilarity score derived from the IF method. The fact thatprominent spikes consisting of a single bright datapoint arepresent in all of the 25 least anomalous light curves raisesan alarm. These spikes can be due to artifacts ( e.g. cosmicrays) and thus they would not have an astrophysical cause,or they may be very high energy events, such as star flares( e.g.
Paudel et al. 2019). Yet they dominate the information-content in the standardized light curves, and therefore thesimilarity score of objects, potentially hiding less prominentfeatures that contain real astrophysical information. We notethat we obtain similar qualitative results in terms of theanomaly scores with the IF and URF (subsubsection 2.1.2)when the methods are applied to the standardized lightcurves. Namely, the least anomalous light curves all showprominent spikes.We investigate the sensitivity of the anomaly score topre-processing by removing these spikes using two differentapproaches and then comparing the anomaly scores obtainedfor the original (normalized) and both sets of pre-processedlight curves. The first spike-removal approach involves low-pass filtering: we take the light curve’s rolling average overa window of 10 points (10% of the light curve), henceforthsuppressing any high-frequency variations. We call these thelow-pass filtered light curves set
LPlcv ’s. The second ap-proach involves directly removing the outlying datapoints byreplacing every point outside of a 3 σ deviation range fromthe light curve with the corresponding value of a LPlcv ’sgenerated with a window of size 15 points. We refer to thislast set of pre-processed time series as
Cleanlcv ’s. A sub-set of 30 light curves for each pre-processing choice, originalnormalized light curves,
LPlcv ’s, and
Cleanlcv ’s, is shown inFigure 4.The anomaly scores of the normalized light curves donot strongly correlate with the scores for either of the two pre-processing methods (left and center panels, Figure 5).The Pearson’s r correlation coefficients between each of thetwo sets of pre-processed light curves and the unprocessedversion are r ∼ -0.10 and r ∼ -0.16: a weak inverse propor-tionality, weak but statistically significant for Cleanlcv and
LPlcv respectively, at a 3 σ level ( p -value < . ) the methoddoes measure related anomaly scored for all sets. On theother hand, the two pre-processed light curve sets resultin very similar anomaly scores (right panel Figure 5). Inthis case, the Pearson’s r value is 0.96. Although the IFreported “outlier” threshold, where the anomaly score be-comes negative, is somewhat arbitrary (for example the decision\_function does not take into account the con-tamination parameter ) it is worth noting that all scoresare positives for the unprocessed light curve set, while thepre-processed light curves find anomalies, consistently withthe implicit request set up in the choice of parameters( contamination=0.1 ).Altogether, this demonstrates that the single spikesdominate the model’s decision when assigning an anomalyscore, and may thus hide important anomalous behavior oflesser flux amplitude: truly anomalous light curves (or atleast not those with the most obvious data artifacts) areonly found by these methods when such artifacts have beeneffectively removed and, unless flares and such short-livedhigh significance events are the specific scientific goal, suchfeatures should be removed to reveal more subtle anomalies.We further investigate the scores in order to understandwhy the presence of spikes is associated with a low anomalyscore in the original light curve set. We look at the corre-lation of the anomaly scores for each pre-processing schemewith three simple light curve statistics: standard deviation( σ t ) of the original lightcurve, standard deviation of the lowpass filter version σ tLP , and normalized flux range over thestandard deviation R SN R = max ((cid:174) t )− min ((cid:174) t ) σ t . This last statis-tics is significantly impacted by the presence of spikes. Wefind a strong correlation between the score obtained on the https://github.com/scikit-learn/scikit-learn/issues/8693 MNRAS000
LPlcv respectively, at a 3 σ level ( p -value < . ) the methoddoes measure related anomaly scored for all sets. On theother hand, the two pre-processed light curve sets resultin very similar anomaly scores (right panel Figure 5). Inthis case, the Pearson’s r value is 0.96. Although the IFreported “outlier” threshold, where the anomaly score be-comes negative, is somewhat arbitrary (for example the decision\_function does not take into account the con-tamination parameter ) it is worth noting that all scoresare positives for the unprocessed light curve set, while thepre-processed light curves find anomalies, consistently withthe implicit request set up in the choice of parameters( contamination=0.1 ).Altogether, this demonstrates that the single spikesdominate the model’s decision when assigning an anomalyscore, and may thus hide important anomalous behavior oflesser flux amplitude: truly anomalous light curves (or atleast not those with the most obvious data artifacts) areonly found by these methods when such artifacts have beeneffectively removed and, unless flares and such short-livedhigh significance events are the specific scientific goal, suchfeatures should be removed to reveal more subtle anomalies.We further investigate the scores in order to understandwhy the presence of spikes is associated with a low anomalyscore in the original light curve set. We look at the corre-lation of the anomaly scores for each pre-processing schemewith three simple light curve statistics: standard deviation( σ t ) of the original lightcurve, standard deviation of the lowpass filter version σ tLP , and normalized flux range over thestandard deviation R SN R = max ((cid:174) t )− min ((cid:174) t ) σ t . This last statis-tics is significantly impacted by the presence of spikes. Wefind a strong correlation between the score obtained on the https://github.com/scikit-learn/scikit-learn/issues/8693 MNRAS000 , 1–21 (2020)
Mart´ınez-Galarza et al.
BKJD
BKJD
Figure 3.
Standardized, dense, partial (5-days) Kepler Q16 light curves with corresponding normality scores derived from the IsolationForest method.
Upper panel:
The 25 most normal light curves.
Bottom panel: the 25 least normal light curves. The normality score isreported for each light curve: a larger value indicates a more normal light curve. The highest and lowest datapoint in the light curve aremarked by blue dots. The anomaly score is dominated by single-point events which may be due to artifacts or extremely high energyand short duration events, such as flares, which are rare, and therefore arguable should be contributing to the anomaly estimate.MNRAS , 1–21 (2020) nomaly detection Figure 4.
Example light curves after pre-processing using threedifferent approaches.
Top panel: standardized light curve, as pro-vided by the Kepler pipeline, mean subtracted and divided bythe light curve standard deviation.
Middle panel: LPlcv ’s: low-pass filtered version of the light curves in the top panel, filteredwith a rolling window of five hours.
Bottom panel : Cleanlcv ’s:clean light curves, in which the spikes have been replaced by thelocal mean calculated with a rolling window of five hours. normalized time series and R SN R (Pearsons r ∼ . , p -value < r ∼ − and 0.04respectively). Conversely, we find a strong anti-correlation ofthe score of the pre-processed time series, both LPlcv and
Cleanlcv , with the standard deviation of the original lightcurve, and of the pre-processed versions (all p -values < R SN R metric; but for the lat-ter, the distribution is far more evenly distributed, still witha significant lack of inliers at high R SN R .In the presence of many light curves with significantspikes, therefore, less anomalous objects are largely domi-nated by these artifacts, as they are not easily isolated in themulti-dimensional space of all light curve points. Once re-moved, however, the least anomalous objects are those withsmaller flux variance and a white noise spectrum. More se-vere perturbations of the light curves are then associatedwith anomalies. Pre-processing of light curves to remove ar-tifacts is, therefore, a necessary step in anomaly detectionanalysis of light curves. Alternatively, anomaly detection canbe leveraged as an effective way of identifying data artifactsand high significance spikes.The results in this section are reproducible. In this section we present the results of applying the outlierdetection methods detailed in section 2 to the set of featuresdescribed in the previous section. For each of the methods, The data and code supporting this section can befound on the project GitHub repository at https://github.com/fedhere/DtUhackOutliers/blob/master/scripts/IsolationsForestPreprocessing.ipynb we present results for both dense and sparse version of theKepler Q16 light curves. .
The URF method has been described in subsubsection 2.1.2.We have used the individual light curve points as the in-put features, augmented with the power values of the peri-odogram, as described in subsection 3.2 (Equation 2). Weprocessed the light curves by dividing them by the lightcurve mean (but not modifying the standard deviation) andby removing data artifacts such as bad pixels and sharppeaks, by using the low-pass filtering technique describedin subsection 3.3. The hyper-parameters of the URF weretuned to maximize classification accuracy in training: theaccuracy in discerning real from synthetic light curves (seesubsubsection 2.1.2) is maximized with N tree = . Propagat-ing the original data set through the forest, then, we esti-mate final weirdness scores for each of the 2500 light curves,in both the dense and sparse scenarios. The total numberof features used for the dense ligth curves was 6534, whereas for the sparse light curves it was 637. Here we presentthe resulting weirdness scores, the feature distribution asa function of weirdness score, and an association betweenweirdness scores and properties of the light curves, i.e. , wetry to answer the question: what makes a Kepler light curveanomalous?Having optimized the parameters as described, we runthe lightcurves through our model 20 times to measure itsstability. The weirdness scores are not identical, but verysimilar among runs, confirming that the model is indeedstable. We then construct a single weirdness ranking by av-eraging the weirdness score of each object across runs. Theresulting list gives the final output of our analysis. In Figure 7 we show the histogram of weirdness scores forall 2500 dense light curves, averaged over 20 realizations.A large bump of relatively normal light curves peaks at aweirdness of about 0.62, whereas a smaller, sharper peak ofweird objects is seen at weirdness 0.97 or so. About 15% ofthe objects populate the “weird” peak, with weirdness scoreshigher than 0.95. Boyajian’s star itself has a weirdness scorecoinciding with the 0.97 peak, which puts it in the list of thetop 5% most anomalous objects. Our goal here is to under-stand what drives the weirdness, i.e. , to find what featuresand what feature thresholds can be used to differentiate nor-mal from anomalous objects, and to evaluate if objects of acertain kind, e.g. , Boyajian’s star-like behavior, can be sin-gled out in large data sets based in this type of analysis.More specifically, we want to answer the following question:given a light curve of certain properties ( e.g. , the prominentirregular deeps in Boyajian’s star ligth curve), can we find all light curves in a data set that are like that light curve ofinterests?The total range of derived URF weirdness score valuesfor a given data set (in Figure 7 this range corresponds toabout ( . , . ) ) is related to how clustered together the dataset objects appear in the multi-dimensional space of the cho-sen features. If an object has a weirdness score close to 0, MNRAS000
The URF method has been described in subsubsection 2.1.2.We have used the individual light curve points as the in-put features, augmented with the power values of the peri-odogram, as described in subsection 3.2 (Equation 2). Weprocessed the light curves by dividing them by the lightcurve mean (but not modifying the standard deviation) andby removing data artifacts such as bad pixels and sharppeaks, by using the low-pass filtering technique describedin subsection 3.3. The hyper-parameters of the URF weretuned to maximize classification accuracy in training: theaccuracy in discerning real from synthetic light curves (seesubsubsection 2.1.2) is maximized with N tree = . Propagat-ing the original data set through the forest, then, we esti-mate final weirdness scores for each of the 2500 light curves,in both the dense and sparse scenarios. The total numberof features used for the dense ligth curves was 6534, whereas for the sparse light curves it was 637. Here we presentthe resulting weirdness scores, the feature distribution asa function of weirdness score, and an association betweenweirdness scores and properties of the light curves, i.e. , wetry to answer the question: what makes a Kepler light curveanomalous?Having optimized the parameters as described, we runthe lightcurves through our model 20 times to measure itsstability. The weirdness scores are not identical, but verysimilar among runs, confirming that the model is indeedstable. We then construct a single weirdness ranking by av-eraging the weirdness score of each object across runs. Theresulting list gives the final output of our analysis. In Figure 7 we show the histogram of weirdness scores forall 2500 dense light curves, averaged over 20 realizations.A large bump of relatively normal light curves peaks at aweirdness of about 0.62, whereas a smaller, sharper peak ofweird objects is seen at weirdness 0.97 or so. About 15% ofthe objects populate the “weird” peak, with weirdness scoreshigher than 0.95. Boyajian’s star itself has a weirdness scorecoinciding with the 0.97 peak, which puts it in the list of thetop 5% most anomalous objects. Our goal here is to under-stand what drives the weirdness, i.e. , to find what featuresand what feature thresholds can be used to differentiate nor-mal from anomalous objects, and to evaluate if objects of acertain kind, e.g. , Boyajian’s star-like behavior, can be sin-gled out in large data sets based in this type of analysis.More specifically, we want to answer the following question:given a light curve of certain properties ( e.g. , the prominentirregular deeps in Boyajian’s star ligth curve), can we find all light curves in a data set that are like that light curve ofinterests?The total range of derived URF weirdness score valuesfor a given data set (in Figure 7 this range corresponds toabout ( . , . ) ) is related to how clustered together the dataset objects appear in the multi-dimensional space of the cho-sen features. If an object has a weirdness score close to 0, MNRAS000 , 1–21 (2020) Mart´ınez-Galarza et al.
Figure 5.
Comparison of anomaly scores produced by the Isolation Forest method for light curves pre-processed differently: standardized,low pass filteres, and cleaned by replacing 3 σ outliers with a local mean value. The correlation in the scores generated for standardizedlightcurves and lightcurves where sharp features are removed is low and inverse (Pearson’s r is reported in each panel) regardless howwhether sharp features are low pass filtered or cleaned by replacing the values with a local mean . Conversely, the specific choice ofpre-processing (by low-pass filtering or by replacement) has little influence on the anomaly score. (Regions of high point density arevisualized as contours, instead of scatter points) Figure 6.
The SNR of the light curves, measured as R S N R = max ((cid:174) t )− min ((cid:174) t ) σ t , plotted against the anomaly score measured by the IsolationForest model for standardized light curves, light curves that have been low-pass filtered, and light curves were spikes are replaced bythe value of the local mean. The dependency of the score on the R S N R is weak for the standardized light curve, and significant for thelow-pass filtered and cleaned versions. that means that the majority of objects in the data set havevery similar features. On the other hand, if an object hasa weirdness score close to 1, that means that only a smallfraction of the data set objects have all their features resem-bling those of the object in question. This object is typicallywell isolated in the multi-dimensional space of features, andtherefore it represents an anomaly. In the particular case ofthe Kepler data set presented here, the fact that no lightcurve has weirdness scores lower than ∼ . indicates thatthere is not a single cluster of “normal” objects. Rather, ob-jects are more sparsely distributed in the feature space, withdifferent “types” of normality. However, the clearly bi-modaldistribution of the weirdness scores indicates that there is aminority group of objects that are significantly distinct andisolated with respect to all the other objects. These are theanomalous objects that we are after.An animation showing the evolution of light curve andperiodogram shapes as a function of increasing weirdness score helps us understanding what drives the URF measureof weirdness. In Figure 8 we show four snapshots of thisanimation. Each of the panels shows ten light curves (left)and ten periodograms (right) of objects with similar scorescentered at some representative values: ∼ . , ∼ . , ∼ . (thos corresponds to the location of Boyajian’s star), and ∼ . . Boyajian’s star is shown in thick blue for reference.The population of objects with relatively low weirdnessscores corresponding to the first peak in the distribution(see Figure 7) is dominated by flat light curves with small( < . ) flux fluctuations. They show no frequency struc-ture, and can be thought of as constant fluxes with a whitenoise component. As we move into the plateau seen in thedistribution at weirdness scores of about 0.85 (second rowfrom top in Figure 8), the normalized flux fluctuations are available at http://bit.ly/kepleranomalies_evolution MNRAS , 1–21 (2020) nomaly detection Figure 7.
A histogram of weirdness scores measured by the Unsu-pervised Random Forest for dense Kepler Q16 dense light curves.The vertical line indicates the location of Boyajian’s star. similar in magnitude to the ones observed in the first peak,but periodic modulations start to appear with characteristicperiods of about 1 day, represented by visible harmonics inFourier space. The distribution of power is no longer uniformacross frequencies, with more power now contained at lowfrequencies. As we approach the peak of anomalous objects,deviations from the mean flux become significantly larger ∼ , and harmonic light curves with periods of severaldays to a few weeks start to appear. At the peak of anoma-lous objects, where Boyajian’s star is located (third row inFigure 8), modulating variability timescales are comparablewith the entire duration of the light curves ( ∼ days), andfluctuations in the relative flux reach 5% or more. Oftenthese modulations are accompanied by additional periodicsignals at shorter timescales, represented by several harmon-ics in Fourier space. Low frequencies dominate the powereven more clearly in this regime. Finally, the weirdest lightcurves have fluctuations in flux of 15% or more and severalharmonic frequencies indicating rapid oscillations, with notpresence of long period modulations.From this perspective, Boyajian’s star is remarkable inmore than one way. Besides the clear deeps seen in thelight curve that deviate more than 20% from the mean flux,there are also shallower (0.5%) modulations with charac-teristic timescales comparable to the duration of the entire90 day Kepler coverage. Additionally, there are higher fre-quency oscillations with characteristic periods of about 1day. These modulations result in a complex power spectrumdominated by the large timescales, some power bumps inshort timescales, and a couple of deeps associated with thetypical duration of the main deeps (thick blue line in allpanels of Figure 8). Which features are more informative when selecting anoma-lous objects? In Figure 9 we show the distribution of featureimportances derived during training of the URF. Featureimportance is a measure of how often a particular featureis used when partitioning the feature space during training, and therefore a measure of how important that feature isin isolating anomalous objects. We note that only a smallnumber of features drive the weirdness score, i.e., those withhigher importances. Those features can be associated to in-dividual light curve points that deviate significantly fromthe normalized mean value (e.g. deeps and flares, and highamplitude modulations), but because of the lack of a com-mon phase reference for all light curves, no single light curvepoint is generally more important than others. Rather, ifmany of the light curve points deviate from the mean, thenthe weirdness score increases as a result of the combinedeffect of all points. Something different happens for the fea-tures representing power spectrum values. The characteristictimescales driving the URF weirdness score are associatedto either high frequencies (corresponding to a few hours), orlow frequencies (corresponding to a few weeks to months).These are respectively the typical timescales of instabilitystrip pulsating stars such as RR-Lyrae and δ -Scuti stars,and of long-period variable stars. As pointed out previously,however, stellar pulsations are not the only type of anoma-lous behavior picked up by the URF. The combination oflong (weeks/months) modulations with prominent deeps andflares, as is the case of Boyajian’s star (or certain types ofeclipsing binaries), can also result in anomalous behaviorrecognizable by this particular algorithm.The anomalous nature of certain light curves is moreevident when we consider how their important features dis-tribute with respect to the bulk of the object population.Figure 10 shows scatter plots for several of the most impor-tant features of Figure 9. We note that, when represented bythese features, anomalous light curves appear as drawn froma different distribution than the rest of the objects. Morespecifically, anomalous light curves have a smaller fractionof their spectral power contained in short period oscillationswith respect to normal objects, and slightly more power con-tained in long period observations. They also typically de-viate significantly from the bulk of normalized flux values.Boyajian’s star stands out as an object with extreme devi-ations from the mean flux and, and a moderately complexlow-pass filtered spectrum. A relevant question is: what is the minimum amount of infor-mation needed in order to determine the anomalous natureof a light curve? We investigate this issue here by assess-ing to what extent we are still able to identify anomalouslight curves when only a fraction of the light curve pointsare considered. We randomly sampled the light curves keep-ing only 10% of the measurements and performed the sameURF analysis that was used for the dense light curves. Wethen ranked these sparse light curves according to their URFweirdness score, just as before. In Figure 11 we show thecomparison of URF weirdness scores between the dense andthe sparse light curves. We note that the distribution ofscores is very similar in both cases.Anomalous light curves as determined by their mag-nitudes and periodograms have high scores in both thedense and sparse data sets, and there is a strong correla-tion across the full range of scores. However, in the caseof the sparse light curves, the scores are distributed over aslightly larger range, suggesting that a larger relative frac-
MNRAS000
MNRAS000 , 1–21 (2020) Mart´ınez-Galarza et al.
Figure 8.
Light curve profiles (left panels) and periodograms (right panels) for selected objects as a function of weirdness scores. Theweirdness score increases from the top, with each row showing ten randomly selected objects with URF scores near certain representativevalues, i.e. http://bit.ly/kepleranomalies_evolution . Feature IDLight curve fluxes Spectral power
Short periods (hours)Long periods (weeks) I m p o r t a n c e Figure 9.
Feature importance distribution for the URF. The vertical line marks the boundary between normalized fluxes (left) andfrequencies (right). Each dot represents a feature, and they are color-coded by normalized importance. MNRAS , 1–21 (2020) nomaly detection PSD-1 (short period) P S D - ( s h o r t p e r i o d ) PSD-3 (long period) P S D - ( s h o r t p e r i o d ) Normalized flux 1 N o r m a l i z e d f l u x Normalized flux 1 P S D - ( s h o r t p e r i o d ) Figure 10.
Two-dimensional distributions of several important features within the URF model for normal (orange) and anomalous(black) light curves. Boyajian’s star is indicated by the large red star
Figure 11.
Comparison of the URF weirdness scores between dense and light curves.
Left:
Scatter plot between the weirdness scoresfor dense and sparse light curves, color-coded by the average of the dense and sparse scores. The diagonal line is the identity line.
Right:
Comparison of the score histograms, with dense light curves in blue and sparse ligth curves in red.. tion of light curves are identified as “normal” when sparselysampled light curve are considered. Also, there is more struc-ture in the peak of very anomalous objects in the case ofdense light curves, which implies that some anomalous fea-tures are missed if only sparse light curves are used. Overall,these results indicate that anomalous behavior can be iden- tified even when light curves are sparsely sampled, down to10% of Kepler’s sampling frequency of 30 minutes. The URF results are reproducible and presented at the GitHubrepository at https://github.com/juramaga/kepler_anomalies_urf
MNRAS000
MNRAS000 , 1–21 (2020) Mart´ınez-Galarza et al.
LC: 17 , score: -1.38 LC: 2 , score: -0.17LC: 3 , score: 0.34 LC: 42 , score: 1.06LC: 20 , score: 3.51 LC: 355 , score: 4.11
Figure 12.
A plot of raw
DMDT representations of dense lightcurves, along with their TSNE scores
LC: 24 , score: -1.05 LC: 1 , score: -0.06LC: 6 , score: 0.84 LC: 15 , score: 1.54LC: 5 , score: 2.70 LC: 1980 , score: 6.09
Figure 13.
A plot of raw
DMDT representations of dense lightcurves, along with their UMAP scores
LC: 5 , score: -1.65 LC: 9 , score: -0.49LC: 7 , score: 0.43 LC: 20 , score: 1.70LC: 3 , score: 3.33 LC: 1317 , score: 6.76
Figure 14.
A plot of raw
DMDT representations of sparse lightcurves, along with their TSNE scores
LC: 17 , score: -1.38 LC: 2 , score: -0.17LC: 3 , score: 0.34 LC: 42 , score: 1.06LC: 20 , score: 3.51 LC: 355 , score: 4.11
Figure 15.
A plot of raw
DMDT representations of sparse lightcurves, along with their UMAP scoresMNRAS , 1–21 (2020) nomaly detection - . . . . . . . . . . . . . . . . . . . . -0.05-0.02-0.01-0.01-0.01-0.00-0.00-0.00-0.00-0.000.000.000.000.000.000.000.000.010.010.010.020.05 Tabby's Star (Dense) - . . . . . . . . . . . . . . . . . . . . -0.05-0.02-0.01-0.01-0.01-0.00-0.00-0.00-0.00-0.000.000.000.000.000.000.000.000.010.010.010.020.05 Tabby's Star (Sparse)
Figure 16.
A plot of the raw
DMDT representation of Boy-ajian’s star from the dense light curve data set (left) and thesparse data set (right). The axes show the bins used when cre-ating the
DMDT , and those bins are the same across Figure 12,Figure 13,Figure 14, and Figure 15 t -SNE and UMAP (as described in subsubsection 2.2.1 andsubsubsection 2.2.2 respectively) are separately applied toa representative feature data set, namely the DMDT imagerepresentation of light curves introduced in Mahabal et al.2017. Light curves are sequences of stellar brightness as afunction of time, and the
DMDT is a lightcurve image rep-resentation that simply considers differences in magnitude— DM — against differences in time — DT . The values ofmagnitude differences and time differences are then binned,resulting in a 2D image representation of the light curve(we refer the reader to Mahabal et al. (2017) for more in-formation on the specifics on the DMDT representation oflight curves). The output feature space is a R × matrix,representing the pixel matrix of the DMDT images (the di-mensions comes from the bins we chose for the
DMDT rep-resentation, which are shown explicitly in figure Figure 16).We show examples of
DMDT representations for dense lightcurves in Figure 12 and Figure 13, and sparse light curvesin Figure 14 and Figure 15. For reference, Boyajian’s star
DMDT representation is shown in Figure 16 for dense andsparse light curves.We then flatten the matrices into -long arrays, eacharray representing a light curve. We consider this our start-ing high dimensional space. We then run t -SNE/UMAP(separately) on this representative feature set to reduce thedimensionality of the data to two dimensions.For t -SNE, we use the Python sklearn implementationwith hyperparameters ( n_components =2, perplexity =200, learning_rate =50.0, early_exaggeration =5.0) fordense light curves, and similar hyperparameters forsparse light curves (with the only difference being early_exaggeration =20.0). n_components simply rep-resent the dimension of the space we want to map into. Perplexity is related to the number of neighbors to beconsidered when considering a certain data point (describedin McInnes et al. 2018), defining a notion of similarity. learning_rate affects the gradient descent portion of thetSNE algorithm, with too fast a learning rates resultingin ball like clusters where neighbors are equidistant, andtoo slow a learning rate creating dense cloud clusters. early_exaggeration relates to how tightly points areclumped in the embedding space, so that we can control the visualization of the high dimensional data (this affectsall points similarly, so it mostly impacts visualization,and not the overall similarity results from the algorithm).We ranged the values of the three hyperparameters overdifferent experiments, recording the KL divergence of eachmodel after training on the full data set. We then set thehyperparameters based on the model with the lowest KLdivergence. We show the resulting embeddings for bothdense and sparse light curves in the left panels of Figure 17.For UMAP, we use the implementation pro-vided in McInnes et al. (2018), with hyperparameters( n_neighbors =200, min_dist =0.4, learning_rate =0.25)for dense light curves. We used hyperparameters( n_neighbors =200, min_dist =0.1, learning_rate =0.8) forsparse light curves. n_neighbors is similar to perplexity inthe t -SNE algorithm, controlling the number of neighborsto be considered when defining the weights between datapoints in the weighted graph. min_dist is similar to the early_exaggeration parameter in the t -SNE algorithm,giving control to how tightly points are clumped in the em-bedding space (to create more interpretable visualizations).Similar to the t -SNE method, we vary the hyperparametersfor the model, and pick the model that gives the moststable embeddings. The resulting UMAP embeddings, forboth dense and sparse light curves, are shown in the leftpanels of Figure 18. t -SNE and UMAP We note that in the 2D embeddings plot for t -SNE (topleft panel of Figure 17), there is a cluster with the majorityof the objects arranged along a main curved line, and alsoobjects that are further away from that main cluster. Since t -SNE works by closely simulating the probability distri-bution q i | j of point i picking point j as a neighbor in thehigh dimensional space, we can build a distribution basedon the pairwise Euclidean distances between points in the2-dimensional space of the embeddings. Specifically, for eachpoint we consider its distance to the nearest neighbor. Wethen consider the distribution formed by normalizing thesedistances (right top panel of Figure 17) and take all pointsthat are at least . σ ’s from the mean. These points arethose that are furthest from the cluster and, by the Eu-clidean metric, they are anomalies in the t -SNE method.Objects appear much less clustered together when the samemethod is applied to the sparse light curves, as can be seenin the left bottom panel of Figure 17. In this case, a “main”cluster of objects is not so easily identified. Yet, the sameEuclidean approach can be used to determine which objectsare anomalies. The distribution of distances is shown in theright top panel of Figure 17, bottom. Unlike the URF scoresfor dense and sparse light curves, the distance scores fordense and sparse light curves in t -SNE (and UMAP) areonly weakly correlated.For the UMAP embeddings, we perform the same pro-cedure described above and build a distribution of nearestneighbor distances. Then we build a distribution of distanceswhich is shown in the right panel of Figure 18. We select out-liers by taking all light curves at least . σ from the meanminimum nearest neighbor distance. We show examples ofthese outliers in Figure 19. Similarly to the t -SNE results, MNRAS000
DMDT representation is shown in Figure 16 for dense andsparse light curves.We then flatten the matrices into -long arrays, eacharray representing a light curve. We consider this our start-ing high dimensional space. We then run t -SNE/UMAP(separately) on this representative feature set to reduce thedimensionality of the data to two dimensions.For t -SNE, we use the Python sklearn implementationwith hyperparameters ( n_components =2, perplexity =200, learning_rate =50.0, early_exaggeration =5.0) fordense light curves, and similar hyperparameters forsparse light curves (with the only difference being early_exaggeration =20.0). n_components simply rep-resent the dimension of the space we want to map into. Perplexity is related to the number of neighbors to beconsidered when considering a certain data point (describedin McInnes et al. 2018), defining a notion of similarity. learning_rate affects the gradient descent portion of thetSNE algorithm, with too fast a learning rates resultingin ball like clusters where neighbors are equidistant, andtoo slow a learning rate creating dense cloud clusters. early_exaggeration relates to how tightly points areclumped in the embedding space, so that we can control the visualization of the high dimensional data (this affectsall points similarly, so it mostly impacts visualization,and not the overall similarity results from the algorithm).We ranged the values of the three hyperparameters overdifferent experiments, recording the KL divergence of eachmodel after training on the full data set. We then set thehyperparameters based on the model with the lowest KLdivergence. We show the resulting embeddings for bothdense and sparse light curves in the left panels of Figure 17.For UMAP, we use the implementation pro-vided in McInnes et al. (2018), with hyperparameters( n_neighbors =200, min_dist =0.4, learning_rate =0.25)for dense light curves. We used hyperparameters( n_neighbors =200, min_dist =0.1, learning_rate =0.8) forsparse light curves. n_neighbors is similar to perplexity inthe t -SNE algorithm, controlling the number of neighborsto be considered when defining the weights between datapoints in the weighted graph. min_dist is similar to the early_exaggeration parameter in the t -SNE algorithm,giving control to how tightly points are clumped in the em-bedding space (to create more interpretable visualizations).Similar to the t -SNE method, we vary the hyperparametersfor the model, and pick the model that gives the moststable embeddings. The resulting UMAP embeddings, forboth dense and sparse light curves, are shown in the leftpanels of Figure 18. t -SNE and UMAP We note that in the 2D embeddings plot for t -SNE (topleft panel of Figure 17), there is a cluster with the majorityof the objects arranged along a main curved line, and alsoobjects that are further away from that main cluster. Since t -SNE works by closely simulating the probability distri-bution q i | j of point i picking point j as a neighbor in thehigh dimensional space, we can build a distribution basedon the pairwise Euclidean distances between points in the2-dimensional space of the embeddings. Specifically, for eachpoint we consider its distance to the nearest neighbor. Wethen consider the distribution formed by normalizing thesedistances (right top panel of Figure 17) and take all pointsthat are at least . σ ’s from the mean. These points arethose that are furthest from the cluster and, by the Eu-clidean metric, they are anomalies in the t -SNE method.Objects appear much less clustered together when the samemethod is applied to the sparse light curves, as can be seenin the left bottom panel of Figure 17. In this case, a “main”cluster of objects is not so easily identified. Yet, the sameEuclidean approach can be used to determine which objectsare anomalies. The distribution of distances is shown in theright top panel of Figure 17, bottom. Unlike the URF scoresfor dense and sparse light curves, the distance scores fordense and sparse light curves in t -SNE (and UMAP) areonly weakly correlated.For the UMAP embeddings, we perform the same pro-cedure described above and build a distribution of nearestneighbor distances. Then we build a distribution of distanceswhich is shown in the right panel of Figure 18. We select out-liers by taking all light curves at least . σ from the meanminimum nearest neighbor distance. We show examples ofthese outliers in Figure 19. Similarly to the t -SNE results, MNRAS000 , 1–21 (2020) Mart´ınez-Galarza et al. −15 −10 −5 0 5 10 15 20feature 1−20−1001020 f e a t u r e TSNE Embeddings - Dense F r e q u e n c y Min Distance Distribution for Embedding: TSNE (Dense) −15 −10 −5 0 5 10 15feature 1−20−1001020 f e a t u r e TSNE Embeddings - Sparse −2 0 2 4 6Normalized Nearest Neighbor Distance020406080100120 F r e q u e n c y Min Distance Distribution for Embedding: TSNE (Sparse)
Figure 17.
Left : all dense (top) and sparse (bottom) light curves in the 2D space obtained by running t -SNE. Large green dots indicateoutliers found via the method described in subsubsection 2.2.1. A yellow dot indicates the position of Boyajian’s star. Right : distributionof nearest neighbor distances for t -SNE embedding of dense (top) and sparse (bottom) light curves, for dimension d = . −10 −5 0 5 10feature 1−7.5−5.0−2.50.02.55.07.510.0 f e a t u r e UMAP Embeddings - Dense F r e q u e n c y Min Distance Distribution for Embedding: UMAP (Dense) −8 −6 −4 −2 0 2 4 6feature 1−10123 f e a t u r e UMAP Embeddings - Sparse −1 0 1 2 3 4 5Normalized Nearest Neighbor Dista ce020406080100 F r e q u e c y Mi Dista ce Distributio for Embeddi g: UMAP (Sparse)
Figure 18.
Left:
A plot of all light curves in 2D space obtained by running UMAP: the dense (top) and sparse (bottom) sets. Largegreen dots indicate outliers found via the method described in subsubsection 2.2.2. A yellow dot indicates the position of Boyajian’s star.
Right : distribution of nearest neighbor distances for t -SNE embedding of dense (top) and sparse (bottom) light curves, for dimension d = . the embeddings map for UMAP looks more sparsely dis-tributed over the reduced two-dimensional features space. The results that we have obtained so far indicate that not allanomaly detection methods are equally successful at identi- fying relevant astrophysical anomalies in Kepler light curves.Unlike the URF method, where taking away the majority ofpoints does not affect our capability to find the same anoma-lies, the correlation between anomaly scores between denseand sparse light curves is very week for both t -SNE andUMAP. Also, each of these two methods produce differentanomaly scores than the URF method (as discussed below). MNRAS , 1–21 (2020) nomaly detection Figure 19.
A sample of the dense light curve outliers found with t -SNE (left three columns) and UMAP (right three columns). Figure 20.
The Kepler light-curve set: plotted in orange are thelight curves that appear as an isolated set of objects in Figure 17,an island centered on coordinates (-4,10)): they all show an up-ward trend. No other embedding or method showed this clusteras clearly.
We also note that the anomaly scores are much less stronglycorrelated between the t -SNE and UMAP for either denseor sparse light curves. This implies that each of the threemethods finds different types of anomalies.The t -SNE method is successful at selecting light curveswith rapid ( ∼ day) oscillations modulated by a pronouncedenvelope of lower frequency. UMAP, on the other hand, pref-erentially selects non-periodic, low amplitude variations withcharacteristic time scales of several weeks. However, bothof them seem to perform poorly when confronted with thesparsely sampled light curves, as no clear pattern emergesregarding the anomalies found for the sparse data set. Aconnected island of points is visibly separated from the restof the light curves in the sparse embedding (centered on co-ordinates (4, -10) in bottom left panel of Figure 17). Wheninspected, these light curves showed an upward trend, asshown in Figure 20. This demonstrates again that methodsdesigned or commonly used for anomaly detection and clus-tering are also powerful techniques to reveal pre-processingmistakes or omissions (see subsection 3.3 and also Kim et al.2009) Interestingly, the optimal way to isolate these pointsis the t -SNEsparse embedding, as the island is not obvi-ously visible in any of the other embeddings generated withmanifold-learning methods (Figure 17 and Figure 18)The proximity-based scores for the t -SNE and UMAPmethods do not appear to be optimal in identifying bonafide anomalies such as Boyajian’s star, since the associatedanomaly score is not high for Boyajian’s stars using thesemethods. Boyajian’s star also does not stick out in the t -SNE and UMAP embeddings (Figure 17, Figure 18, Fig- ure 21, and Figure 22). This is likely due to the fact that, aspointed out in the introduction, in a sufficiently large featurespace, any proximity based measure of similarity becomesmeaningless. While dimensionality reduction is expected topartially reduce this problem by ignoring features that areless relevant, this appears to be insufficient to identify theproperties that make of Boyajian’s star an anomalous object,at least if only the distance-based scores are considered.But this is not the end of the story for t -SNE andUMAP. The features resulting from the projection of thelight curves into a 2D feature space using t -SNE and UMAPprovide significant insight regarding the frequency and am-plitude properties of the light curves, as can be demonstratedby confronting them with the URF weirdness scores. In Fig-ure 22 we show again the 2D space of t -SNE and UMAPfeatures for the sparse light curves, but this time color-codedby the independently determined URF weirdness scores. Weobserve a very clear correlation between where a light curvelives in the 2D embeddings and how anomalous it is accord-ing to the URF method.Of relevance is the fact that, although there is a cleargradient in weirdness score as we move between differentsectors of the t -SNE and UMAP embeddings, there are alsoisolated light curves with high weirdness scores (black pointsin Figure 21 and Figure 22) that live in regions of the fea-tures space with a lower density of anomalous objects. Thisis the case for Boyajian’s star. This last point is of particularimportance, because we have established in the analysis ofURF results (subsubsection 4.1.1) that, whereas Boyajian’sstar belongs to the population of anomalies, it has very par-ticular features that are unlike any other light curve in thedata set.The fact that Boyajian’s star’s t -SNE and UMAP fea-ture values do not correlate with the URF weirdness as inthe case of the large majority of the other URF anomaliesindicates that the additional information that we need toidentify it as a unique anomaly comes from combining itshigh URF weirdness score with the features derived fromthe t -SNE and UMAP embeddings. The strongest candi-dates to being Boyajian’s star analogs are therefore thoseobjects that not only have similarly high URF scores, butthat also have similar t -SNE and UMAP features. This is ofcourse not unique for Boyajian’s star. Given any anomalouslight curve of interest, we can find any of its analogs (if theyexist), by performing a similar search. MNRAS000
We also note that the anomaly scores are much less stronglycorrelated between the t -SNE and UMAP for either denseor sparse light curves. This implies that each of the threemethods finds different types of anomalies.The t -SNE method is successful at selecting light curveswith rapid ( ∼ day) oscillations modulated by a pronouncedenvelope of lower frequency. UMAP, on the other hand, pref-erentially selects non-periodic, low amplitude variations withcharacteristic time scales of several weeks. However, bothof them seem to perform poorly when confronted with thesparsely sampled light curves, as no clear pattern emergesregarding the anomalies found for the sparse data set. Aconnected island of points is visibly separated from the restof the light curves in the sparse embedding (centered on co-ordinates (4, -10) in bottom left panel of Figure 17). Wheninspected, these light curves showed an upward trend, asshown in Figure 20. This demonstrates again that methodsdesigned or commonly used for anomaly detection and clus-tering are also powerful techniques to reveal pre-processingmistakes or omissions (see subsection 3.3 and also Kim et al.2009) Interestingly, the optimal way to isolate these pointsis the t -SNEsparse embedding, as the island is not obvi-ously visible in any of the other embeddings generated withmanifold-learning methods (Figure 17 and Figure 18)The proximity-based scores for the t -SNE and UMAPmethods do not appear to be optimal in identifying bonafide anomalies such as Boyajian’s star, since the associatedanomaly score is not high for Boyajian’s stars using thesemethods. Boyajian’s star also does not stick out in the t -SNE and UMAP embeddings (Figure 17, Figure 18, Fig- ure 21, and Figure 22). This is likely due to the fact that, aspointed out in the introduction, in a sufficiently large featurespace, any proximity based measure of similarity becomesmeaningless. While dimensionality reduction is expected topartially reduce this problem by ignoring features that areless relevant, this appears to be insufficient to identify theproperties that make of Boyajian’s star an anomalous object,at least if only the distance-based scores are considered.But this is not the end of the story for t -SNE andUMAP. The features resulting from the projection of thelight curves into a 2D feature space using t -SNE and UMAPprovide significant insight regarding the frequency and am-plitude properties of the light curves, as can be demonstratedby confronting them with the URF weirdness scores. In Fig-ure 22 we show again the 2D space of t -SNE and UMAPfeatures for the sparse light curves, but this time color-codedby the independently determined URF weirdness scores. Weobserve a very clear correlation between where a light curvelives in the 2D embeddings and how anomalous it is accord-ing to the URF method.Of relevance is the fact that, although there is a cleargradient in weirdness score as we move between differentsectors of the t -SNE and UMAP embeddings, there are alsoisolated light curves with high weirdness scores (black pointsin Figure 21 and Figure 22) that live in regions of the fea-tures space with a lower density of anomalous objects. Thisis the case for Boyajian’s star. This last point is of particularimportance, because we have established in the analysis ofURF results (subsubsection 4.1.1) that, whereas Boyajian’sstar belongs to the population of anomalies, it has very par-ticular features that are unlike any other light curve in thedata set.The fact that Boyajian’s star’s t -SNE and UMAP fea-ture values do not correlate with the URF weirdness as inthe case of the large majority of the other URF anomaliesindicates that the additional information that we need toidentify it as a unique anomaly comes from combining itshigh URF weirdness score with the features derived fromthe t -SNE and UMAP embeddings. The strongest candi-dates to being Boyajian’s star analogs are therefore thoseobjects that not only have similarly high URF scores, butthat also have similar t -SNE and UMAP features. This is ofcourse not unique for Boyajian’s star. Given any anomalouslight curve of interest, we can find any of its analogs (if theyexist), by performing a similar search. MNRAS000 , 1–21 (2020) Mart´ınez-Galarza et al.
Figure 21.
The combined features of t -SNE (left) and UMAP (right) methods, color-coded by the URF weirdness score for dense lightcurves. The location of Boyajian’s star in indicated by the cross mark. Figure 22.
The combined features of t -SNE (left) and UMAP (right) methods, color-coded by the URF weirdness score for sparse lightcurves. The location of Boyajian’s star in indicated by the cross mark. In Figure 23 we show the closest analogs to Boyajian’sstar according to this combined approach, when the sparse light curves are used. We observe that light curves in thisgroup have average amplitude ( ∼ BKJD (days)
Figure 23.
The light curves of four anomalous objects (orangelines) that a) have sparse URF weirdness larger than 0.95 and b) are close to Boyajian’s star (black line) in the 2D UMAP embed-ding for sparse light curves. For reference, we also show the lightcurve of a normal, flat, noise dominated light curve (grey line).We adjust the range of relative fluxes to make the variations morevisible. lous dense light curves (URF weirdness score larger than0.93), using two embedded features from t -SNE and one MNRAS , 1–21 (2020) nomaly detection BKJD (days)
BKJD (days)
BKJD (days)
Figure 24.
Left:
A 3-dimensional visualization of the anomalous dense light curves in the feature space of t -SNE and UMAP. The graypoints correspond to the main cluster of outliers, all with weirdness scores higher than 0.93. The colored symbols represent three groupsof outliers from this cluster. Right:
Example light curves from the three colored subsets in the left-hand visualization. The upper panelcorresponds to red squares, the middle panel corresponds to diamonds, and the bottom panel corresponds to triangles. A 3D visualizationis available at http://bit.ly/kepleranomalies_3Dspace . from UMAP, and a viewing angle that shows most of themaligning along a single curved line. We have color-codedthree different groups of anomalies with respect to this maincluster, and plotted example light curves from each group.We note that depending on where these anomalies are lo-cated with respect to the bulk of the objects, they can beeither extremely slow, high amplitude variables (red group),moderate amplitude, semi-periodic oscillators (blue group),or extremely fast and extremely high amplitude periodic os-cillators (green group). Such clustering of the light curvesinto different groups is enabled by adopting our combinedapproach. Anomaly detection in time domain astronomy enables thediscovery of unique light curves that force us to ask newquestions and to formulate new hypotheses. One example ofsuch an anomalous object is that of Boyajian’s star, whichhas become a gold standard in the search for anomalouslight curves. Yet, the very concept of what an anomalouslight curve is lacks a coherent empirical definition, since theanomalous nature of a light curve depends on the featuresused to characterize it and the method employed to isolateit. We propose a definition of anomaly not in terms of theprojected Euclidean distance between data set objects in agiven feature space, but in terms of statistical sampling: ananomaly is an object that is more likely to have been drawnfrom a different distribution with respect to the bulk of theobjects in the multi-dimensional space of the features. Wefurther show that combining methods to detect anomalies
Figure 25.
A Hertzprung-Russell diagram for the Kepler objectsconsidered in this work, constructed from
Gaia photometry andestimated luminosities. The points are color-coded by the URFweirdness score for the dense light curves. may lead to the identification of new classes of objects, inaddition to the identification of true statistical anomalies.We have performed a comparison of several feature en-gineering and anomaly detection approaches in order toprovide insight into what constitutes an anomaly in time-domain surveys, and to assess the effectiveness of differentapproaches in identifying different types of anomalies. Wehave used Boyajian’s star as a bona fide reference for ourstudy, and demonstrated that the success of different meth-ods to identify it as anomalous depends on the particularcombination of features anomaly scores used: we employtree-based methods on a high-dimensional space of flux and
MNRAS000
MNRAS000 , 1–21 (2020) Mart´ınez-Galarza et al. power values (subsection 2.1), and manifold learning meth-ods that decrease the dimensionality of the feature spaces(subsection 2.2). We find that proximity-based scores such asnearest neighbors in an Euclidean space are not well suitedfor finding true anomalies in multi-dimensional time-domainfeatures spaces, but that a combination of low-dimensionalrepresentations of light curves with an anomaly score basedon an ensemble forest method is promising in finding astro-physical rarities and new classes of objects.We have shown the effect of poor pre-processing in ourability to find anomalous light curves. High anomaly scoreswill be dominated by objects with large flux spikes, such asspikes or bad pixels, as we have discussed in subsection 3.3.The physical quantities that enabled the identificationof our bona fide anomaly are the time scale and the rela-tive intensity of light curve variations. Anomalies distributedifferently in the space of spectral power and magnitude dif-ferences (Figure 10). In terms of the frequency of the varia-tions, Kepler anomalies have a larger fractional representa-tion of short (a few hours) or long (months) characteristictimescales (Figure 9).The remarkable correlation between the location of alight curve in the space of t -SNE and UMAP embedded 2-dimensional features, and its URF weirdness score, derivedindependently from features that represent the light curvesin terms of their characteristic frequencies and amplitudes(Figure 22 and Figure 21) indicates that it is possible toidentify meaningful anomalies in a low dimensionality rep-resentation of light curves. The apparent complexity of theKepler light curves, with a broad range of frequencies, am-plitudes, transient behavior, and periodicity, can thereforebe represented by a handful of numbers, with anomaliesoccupying specific regions in these representations. Theseembedded representations correlate well with independentlydetermined anomaly scores derived using the URF method.While the trends in the anomaly scores are broadly consis-tent for the three methods, some individual anomalies asclassified by the URF are instead classified as ”normal” ifa proximity-based score of the embedded features is used.The embedded features alone are a better diagnostic for theanomalous nature of these objects.As Figure 24 illustrates, anomalous light curves thatshare some characteristics cluster together in specific regionsof this space. The URF weirdness score alone (or any otheranomaly score) does not allow for this type of clustering,since light curves with similar behavior do not necessarilyneighbor each other in weirdness score, although they be-long to the same structure in the space of embedded fea-tures. This means that given an anomalous light curve ofinterest (e.g. Boyajian’s star) it is possible to find its closestanalogs in a given data set by combining the URF weirdnessscore with the embedded features. This is of paramount im-portance for the exploration of large time survey data sets,such as the TESS data set, or the upcoming Vera RubinTelescope survey.We also note that it is possible to find a very similar setof anomalous objects even if the light curves are sampled at aconsiderably lower and sparse cadence of 10% with respect tothe original cadence, as demonstrated by Figure 11. This im-plies that despite the sparse sampling, the low-dimensionalrepresentations of the relevant features contain the neces-sary information to identify these anomalies, including rela- tively rapid oscillations. Given that ground-based time do-main surveys cannot perform with a cadence as regular asthat of Kepler (30 minutes between data points), this re-sult is promising for the identification of anomalies at muchsparser cadences. To what extent is the anomalous nature of a light curverelated to the astrophysical properties of the associated ob-ject? Our results show that anomalous light curves can beassociated to known astrophysical processes, such as intrin-sic stellar pulsations (Arras et al. 2006, e.g. ), flaring events(Paudel et al. 2018), or extrinsic binary eclipsing phenomena(Prsa & Eclipsing Binary Working Group 2013; Prˇsa et al.2016), but also to unexplained phenomena, such as the caseof Boyajian’s star. Given an identified anomaly, can we tellto which of these two scenarios it belongs?In order to investigate this, we have obtained
Gaia
DR2 colors and derived luminosities (Gaia Collaborationet al. 2016, 2018) for all the objects in our sample withavailable
Gaia measurements, and produced a Hertzprung-Russell diagram that we have color-coded according to theURF anomaly score. We show this diagram in Figure 25.Our results indicate that anomalies can be either uncommonobjects of known astrophysical nature, such as members ofthe instability strip (RR Lyrae stars, Cepheids, etc.), flaringM-dwarfs, or members of the asymptotic giant branch thatshow periodic oscillations of either short or long periods. Infact, there appears to be a gradient of anomaly scores alongthe giant branch, with the most anomalous objects being onaverage the cooler giants.But anomalies can also be stars that are unremark-able based on their luminosities and temperatures alone,and whose anomalous behavior is more likely due to extrin-sic factors. Such is the case of eclipsing binaries, which arefairly well understood, but also of Boyajian’s star, an objectthat has required us to come up with new models in orderto explain its behavior (Boyajian et al. 2016, see for exam-ple). These objects lie along the Main Sequence and are notexpected to show atypical intrinsic behavior.The correlation between anomaly score, embedded low-dimensionality features, and astrophysical properties indi-cates a possible path toward new discoveries. In particular,the application of a similar analysis to TESS light curveswill allow us to identify scientifically compelling light curvesfor further analysis. We are currently performing this anal-ysis on TESS data (Crake et al. in prep.). Note that thistype of analysis is not limited to light curves with regularcadences. Our experimental setup allows us to adjust for dif-ferences in light curve lengths, cadences, and gaps betweenlight curve points. Both the spectral decomposition and the
DMDT images can be produced for light curves of any lengthand cadence.In summary, we have show that, while different anomalydetection methods do not necessarily identify the sameanomalies, a reasonable empirical and standard approach toanomaly detection in time-domain surveys is possible. Suchapproach combines dimensionality reduction with a randomforest-based anomaly detection algorithm, and greatly sim-plifies the analysis, allowing the identification of the mostremarkable anomalies and the phenomenological reasons for
MNRAS , 1–21 (2020) nomaly detection their anomaly score. It also allows to group light curves ac-cording to their similarity, and paves the way for an analog-finding method that will find siblings to any light curve ofastrophysical interest. Our work is complementary to previ-ous methods that have attempted to identify time-domainanomalies ( e.g. Giles & Walkowicz 2019) and further gener-alizes the issue of anomaly detection in time-domain astron-omy by putting these methods in context.
ACKNOWLEDGEMENTS
We would like to thank the organizers and participants ofthe
Detecting the Unexpected workshop that took place atSTScI in 2017. The ideas for this work came from a hackduring that workshop and have produced also other papers.In particular, we thank Lucianne Walkovicz for a continuousexchange of ideas and for proposing the original hack. Wealso thank Dalya Baron for useful insight about the use ofthe URF method. We thank the original hackers’ team whichincluded Kelle Cruz, and Umaa Rebbapragada. In carryingout this research we have used the scikit-learn Python pack-age. This paper includes data collected by the Kepler mis-sion and obtained from the MAST data archive at the SpaceTelescope Science Institute (STScI). Funding for the Keplermission is provided by the NASA Science Mission Direc-torate. STScI is operated by the Association of Universitiesfor Research in Astronomy, Inc., under NASA contract NAS5-26555.
REFERENCES
Aggarwal C. C., Yu P. S., 2001, in Proceedings of the 2001 ACMSIGMOD international conference on Management of data.pp 37–46Arras P., Townsley D. M., Bildsten L., 2006, ApJ, 643, L119Baron D., Poznanski D., 2017, Monthly Notices of the Royal As-tronomical Society, 465, 4530Bellm E. C., et al., 2019, PASP, 131, 018002Boyajian T. S., et al., 2016, Monthly Notices of the Royal Astro-nomical Society, 457, 3988Buitinck L., et al., 2013, arXiv preprint arXiv:1309.0238Conroy C., et al., 2018, ApJ, 864, 111Drake A. J., et al., 2012, in Griffin E., Hanisch R., Sea-man R., eds, IAU Symposium Vol. 285, New Horizons inTime Domain Astronomy. pp 306–308 ( arXiv:1111.2566 ),doi:10.1017/S1743921312000889Druetto A., Roberti M., Cancelliere R., Cavagnino D., Gai M.,2019, in International Work-Conference on Artificial NeuralNetworks. pp 390–401Dutta H., Giannella C., Borne K., Kargupta H., 2007, in: Pro-ceedings of the 2007 SIAM International Conference on DataMining. SIAMEyer L., Mowlavi N., 2008, in Journal of Physics Confer-ence Series. p. 012010 ( arXiv:0712.3797 ), doi:10.1088/1742-6596/118/1/012010Gaia Collaboration et al., 2016, A&A, 595, A1Gaia Collaboration et al., 2018, A&A, 616, A1Giles D., Walkowicz L., 2019, Monthly Notices of the Royal As-tronomical Society, 484, 834Graham M. J., Drake A. J., Djorgovski S. G., Mahabal A. A.,Donalek C., Duan V., Maker A., 2013, Monthly Notices ofthe Royal Astronomical Society, 434, 3423 Henrion M., Hand D. J., Gandy A., Mortlock D. J., 2013, Sta-tistical Analysis and Data Mining: The ASA Data ScienceJournal, 6, 53Hinton G. E., Roweis S. T., 2003, in Advances in neural informa-tion processing systems. pp 857–864Ho T. K., 1995, in Proceedings of 3rd international conference ondocument analysis and recognition. pp 278–282Ivezi´c ˇZ., et al., 2019, ApJ, 873, 111Jenkins J. M., 2017, Kepler Data Processing Handbook: Philoso-phy and Scope, Kepler Science Document KSCI-19081-002Kim D.-W., Protopapas P., Alcock C., Byun Y.-I., Bianco F. B.,2009, MNRAS, 397, 558Kochanek C. S., et al., 2017, PASP, 129, 104502Kullback S., Leibler R. A., 1951, Ann. Math. Statist., 22, 79LSST Science Collaboration et al., 2017, arXiv e-prints, p.arXiv:1708.04058Liu F. T., Ting K. M., Zhou Z.-H., 2012, ACM Transactions onKnowledge Discovery from Data (TKDD), 6, 1Lomb N. R., 1976, Astrophysics and space science, 39, 447Maaten L. v. d., Hinton G., 2008, Journal of machine learningresearch, 9, 2579Mahabal A., Sheth K., Gieseke F., Pai A., Djorgovski S. G., DrakeA. J., Graham M. J., 2017, in 2017 IEEE Symposium Serieson Computational Intelligence (SSCI). pp 1–8Margalef-Bentabol B., Huertas-Company M., Charnock T.,Margalef-Bentabol C., Bernardi M., Dubois Y., Storey-FisherK., Zanis L., 2020, arXiv preprint arXiv:2003.08263McInnes L., Healy J., Melville J., 2018, arXiv e-prints, p.arXiv:1802.03426Meech K. J., et al., 2017, NatureMiniutti G., et al., 2019, Nature, 573, 381Nun I., Protopapas P., Sim B., Zhu M., Dave R., Castro N.,Pichara K., 2015, arXiv e-prints, p. arXiv:1506.00010Paudel R. R., Gizis J. E., Mullan D. J., Schmidt S. J., BurgasserA. J., Williams P. K. G., Berger E., 2018, ApJ, 861, 76Paudel R. R., Gizis J. E., Mullan D., Schmidt S. J., BurgasserA. J., Williams P. K., Youngblood A., Stassun K., 2019,Monthly Notices of the Royal Astronomical Society, 486, 1438Pedregosa F., et al., 2011, Journal of Machine Learning Research,12, 2825Prsa A., Eclipsing Binary Working Group 2013, in Giants ofEclipse. p. 40102Prˇsa A., et al., 2016, ApJS, 227, 29Scargle J. D., 1982, ApJ, 263, 835Schmidt M., 2019, arXiv e-prints, p. arXiv:1907.02574Shi T., Horvath S., 2006, Journal of Computational and GraphicalStatistics, 15, 118York D. G., et al., 2000, AJ, 120, 1579This paper has been typeset from a TEX/L A TEX file prepared bythe author.MNRAS000