Automated identification of transiting exoplanet candidates in NASA Transiting Exoplanets Survey Satellite (TESS) data with machine learning methods
Leon Ofman, Amir Averbuch, Adi Shliselberg, Idan Benaun, David Segev, Aron Rissman
AAutomated identification of transiting exoplanetcandidates in NASA Transiting Exoplanets SurveySatellite (TESS) data with machine learning methods
Leon Ofman a,b ∗ , Amir Averbuch c,d , Adi Shliselberg d , Idan Benaun d , DavidSegev d , Aron Rissman d a Dept. of Physics, Catholic University of America, Washington, DC 20064 b NASA GSFC, Code 671, Greenbelt, Maryland, 20771, USA. c School of Computer Science, Tel Aviv University, Tel Aviv, Israel d ThetaRay, 8 Hanagar Street, Hod HaSharon, Israel
Abstract
A novel artificial intelligence (AI) technique that uses machine learning (ML)methodologies combines several algorithms, which were developed by
ThetaRay,Inc. , is applied to NASA’s Transiting Exoplanets Survey Satellite (TESS)dataset to identify exoplanetary candidates. The AI/ML
ThetaRay systemis trained initially with Kepler exoplanetary data and validated with con-firmed exoplanets before its application to TESS data. Existing and newfeatures of the data, based on various observational parameters, are con-structed and used in the AI/ML analysis by employing semi-supervised andunsupervised machine learning techniques. By the application of
ThetaRay system to 10,803 light curves of threshold crossing events (TCEs) producedby the TESS mission, obtained from the Mikulski Archive for Space Tele-scopes, we uncover 39 new exoplanetary candidates (EPC) targets. Thisstudy demonstrates for the first time the successful application of combined ∗ Corresponding author
Preprint submitted to New Astronomy February 23, 2021 a r X i v : . [ a s t r o - ph . E P ] F e b ultiple AI/ML-based methodologies to a large astrophysical dataset forrapid automated classification of EPCs. Keywords:
Exoplanet detection methods - Transit photometry - ComputationalMethods - Machine Learning
1. Introduction
The Transiting Exoplanet Survey Satellite (TESS) (Ricker et al., 2014)was launched by NASA on April 18, 2018 with the primary objective ofall-sky surveying more than 200,000 near-Earth stars in search of transit-ing exoplanets using high-precision photometry, producing light curves witha 2-minute cadence. The TESS Objects of Interests (TOI) have been re-leased periodically and archived at the Mikulski Archive for Space Tele-scopes (MAST, https://archive.stsci.edu/ ). The TOI includes plan-etary candidates, as well as potential planetary candidates and other as-trophysical targets, including false positives, comprising the database usedfor searching for confirmed exoplanets. As of March 23, 2020 TESS hasreleased 1766 TOIs with 43 confirmed planets and 412 false positives (see, https://tess.mit.edu/publications/ ).Previously, Kepler Space Telescope launched by NASA in 2009 was de-signed to determine the occurrence frequency of Earth-sized planets. Towardsthis objective, Kepler observed about 200,000 stars with high photometricprecision discovering thousands of transiting exoplanets and exoplanetarycandidates (Borucki et al., 2010; Jenkins et al., 2010a; Koch et al., 2010;Christiansen et al., 2012). During the prime missions (2009 May 2 -2013 May21) Kepler was pointing at a single field of view of about 115 square degreesin the constellations of Cygnus and Lyra. The many periodic signals detectedby Kepler were processed using the Kepler Science Processing Pipeline (Jenk-ins et al., 2010b). They were assembled into a database of threshold crossingevents (TCEs). Direct human input was required to remove false positivesand instrumental effects from this database. However, the resulting TCEsdatabase contains data produced by many possible sources, such as eclipsingbinaries, background eclipsing binaries and many other possible false alarmsources, in addition to small fraction of exoplanetary candidates (EPCs), andstill require considerable analysis for confirmed identification of exoplanets.Recently, Shallue and Vanderburg (2018) identified transiting exoplanetsin Kepler satellite data using Deep Learning (DL) algorithm based on trainingof convolutional neural networks using the Google-Vizier system (Golovinet al., 2017). Shallue and Vanderburg (2018) trained the neural networksto classify whether a given light curve signal is a signature of a transitingexoplanet with low false positive rate. By using their algorithm, they identifymulti-planet resonant chains around Kepler-80 and Kepler-90. Later, theextended Kepler K2 mission, which starting in Nov. 2013, was designedto use the remaining Kepler capabilities after the completion of the primemission including the technical failures of the reaction wheels. During thisobservation phase, the photometric accuracy was reduced, and the pointingvaried in different regions of the sky. Nevertheless, Dattilo et al. (2019) used asimilar automated technique based on Shallue and Vanderburg (2018) studythat is applied to mission data K2 while identifying two previously unknownexoplanets. 3utomated classification methods for transiting exoplanets from TESSdata have been developed using machine learning (ML) techniques in severalstudies (e.g., Ansdell et al., 2018; Zucker and Giryes, 2018; Yu et al., 2019;Osborn et al., 2020) that demonstrate the usefulness and feasibility of thisapproach with various degrees of improved classification performance. In thispaper, we describe an application of novel algorithms, which combine severalML approaches and low rank matrix decomposition, including algorithmsthat identify anomalies in high dimensional big data by using augmentationapproach. This methods, utilized semi-supervised and unsupervised learningwas developed by
ThetaRay, Inc. ( https://thetaray.com/ ) for uncoveringfinancial crimes, cyber and Internet of Things (IoT) security, was applied fortransiting EPCs search, reported in this study. By using Kepler data withconfirmed exoplanets as part of the algorithm training phase and validation,the ThetaRay platform was applied to TESS data yielding 39 new EPCs outof nearly 11000 TCEs, demonstrating the feasibility and utility of this newplatform.The paper is organized as follows: Section 2 discusses the ML methods,Section 3 presents the resulting planetary exoplanet classification in TESSdata. Section 4 is discussion and conclusions. Details of
ThetaRay algorithmsare described in the Appendix.
2. Machine Learning Methods
ThetaRay
Algorithm
In the present study we utilize
ThetaRay
AI-based Fintech algorithms,commercially developed for anomaly detection (financial crimes) in financial4nstitutions, cyber security and IoT for smooth operations of critical infras-tructure installations. Since transiting exoplanets light curves are rare andonly appear in small number of all observed Kepler or TESS stellar lightcurves, they are classified as ‘anomalies’ in our analysis when
ThetaRay sys-tem utilizes the strengths of its algorithms to identify transiting EPCs inthe large number of TCEs. To identify these ‘anomalies’, or exoplanet light-curves,
ThetaRay ’s algorithms generates a data-driven ‘normal’ profile of thedata ingested, and simultaneously identifies anomalies also called abnormalevents, providing forensics that categorizes each event based on its features.This is done autonomously by the algorithm without the need to have rules orsignatures.
ThetaRay ’s algorithmic engine utilizes techniques drawn from awide variety of mathematical disciplines, such as harmonic analysis, diffusiongeometry and stochastic processing, low rank matrix decomposition, ran-domized algorithms in general and randomized linear algebra in particular,geometric measure theory, manifold learning, neural networks/deep learning,and compact representation by dictionaries. One approach models the dataas a diffusion process using Brownian motion of a random walk process togeometrize the data. There is no need for any semantic understanding of theprocessed data, nor are there any predefined rules, heuristics or weights inthe system. The diffused collected dataset is then converted into a Markovmatrix through a normalized graph-Laplacian and modeled as a stochasticprocess that is applied in many dimension (could reach thousands) - see theAppendix for additional details of the algorithms.5 .2. Kepler Satellite data ML training
We have focused on light curves produced by the Kepler space telescope,which collected the light curves of ∼ https://exoplanetarchive.ipac.caltech.edu/ ). We obtained the TCE labels from the catalog’s“av training set” column, which has three possible values: planet candidate(PC), astrophysical false positive (AFP) and non-transiting phenomenon(NTP). We ignored TCEs with the “unknown” label (UNK). These labelswere produced by manual vetting and other diagnostics. We obtained addi-tional data on the TCEs such as planet number, radius of the planet, intervalbetween consecutive planetary transits, etc., from the MAST TESS archive( https://archive.stsci.edu/missions-and-data/transiting-exoplanet-survey-satellite-tess ) for data labeling and use in our analysis.6 .2.1. Features Feature engineering is the process of using data domain knowledge to cre-ate features by manipulating the data through mathematical and statisticalrelations (for examples, see section 2.2.4) of the various components in orderto improve the performance of the AI/ML algorithms. The feature engineer-ing process includes deciding which features to develop, creating the features,checking how the features work with the model, improving the features asneeded, and going back to deciding on or creating additional data featuresuntil the ML/AI algorithm results are optimized. We applied the featureengineering process on our dataset and created new features in addition tothe existing features available in MAST in order to provide more informationwhich will quantify various aspects of the data used by the AI/ML algorithmin the present analysis. We produced a total of 424 features that were usedfor the analysis. We chose the combination of features that provided the bestresults under the capabilities of
ThetaRay ’s system, validated in the train-ing step. In the feature engineering process, we tested the effectiveness ofdifferent combinations of features under the limits of
ThetaRay ’s system.
Additional TCEs Data were downloaded from MAST. We narrowed downthe data only to the required fields for the present task, such as the planetnumber, the radius of the planet, the interval between consecutive planetarytransits, etc., and selected the relevant data from all the fields from “DataColumns in the Kepler TCE Table” ( https://exoplanetarchive.ipac.caltech.edu/docs/API_tce_columns.html ) using the visualization of thevariables (especially KDE plots, see below). Below is the description of the7ariables and labels used in our analysis. • Unique key - concatenation of Kepler ID and Planet Number. KeplerID is a target identification number, as listed in the Kepler Input Cat-alog (KIC). The KIC was derived from a ground-based imaging surveyof the Kepler field conducted prior to launch. The survey’s purposewas to identify stars for the Kepler exoplanet survey by magnitude andcolor. The full catalog of 13 million sources can be searched at theMAST archive. The subset of 4 million targets found upon the KeplerCCDs can be searched via the Kepler Target Search form. – Kepler Input Catalog (KIC) (Brown et al., 2011). – MAST archive - http://archive.stsci.edu/kepler/kic10/search.php . – Kepler Target Search form - http://archive.stsci.edu/kepler/kepler_fov/search.php . • av training set - Autovetter Training Set Label. If the TCE wasincluded in the training set, the training label encodes what is be-lieved to be the “true” classification, and takes a value of either PC,AFP or NTP. The TCEs in the UNKNOWN class sample are markedUNK. Training labels are given a value of NULL for TCEs not in-cluded in the training set. For more detail about how the training setis constructed, see Autovetter Planet Candidate Catalog for Q1-Q17Data Release 24 (KSCI-19091): https://exoplanetarchive.ipac.caltech.edu/docs/KSCI-19091-001.pdf .8 tce prad - Planetary Radius (Earth radii). The radius of the planetobtained from the product of the planet to stellar radius ratio and thestellar radius. • tce max mult ev - Multiple Event Statistic (MES). The maximum cal-culated value of the MES. TCEs that meet the maximum MES thresh-old criterion and other criteria listed in the TCE release notes are deliv-ered to the Data Validation (DV) module of the data analysis pipelinefor transit characterization and the calculation of statistics requiredfor disposition. A TCE exceeding the maximum MES threshold areremoved from the time-series data and the SES and MES statisticsrecalculated. If a second TCE exceeds the maximum MES thresholdthen it is also propagated through the DV module and the cycle isiterated until no more events exceed the criteria. Candidate multi-planet systems are thus found this way. Users of the TCE table canexploit the maximum MES statistic to help filter and sort samples ofTCEs for the purposes of discerning the event quality, determiningthe likelihood of planet candidacy, or assessing the risks of observa-tional follow-up. DV module – http://archive.stsci.edu/kepler/manuals/KSCI-19081-001_Data_Processing_Handbook.pdf • tce period - Orbital Period (days). The interval between consecutiveplanetary transits. • tce time0bk - Transit Epoch (BJD) - 2,454,833.0. The time corre-sponding to the center of the first detected transit in Barycentric Ju-lian Day (BJD) minus a constant offset of 2,454,833.0 days. The offset9orresponds to 12:00 on Jan 1, 2009 UTC. • tce duration - Transit Duration (hrs). The duration of the observedtransits. Duration is measured from first contact between the planetand star until last contact. Contact times are typically computed froma best-fit model produced by a Mandel and Agol (2002) model fit to amulti-quarter Kepler light curve, assuming a linear orbital ephemeris. • tce model snr - Transit Signal-to-Noise (SNR). Transit depth normal-ized by the mean uncertainty in the flux during the transits. • av pred class - Autovetter Predicted Classification. Predicted clas-sifications, which are the ‘optimum MAP classifications.’ Values areeither PC, AFP, or NTP. • tce depth - Transit Depth (ppm). The fraction of stellar flux lost atthe minimum of the planetary transit. Transit depths are typicallycomputed from a best-fit model produced by the Mandel and Agol(2002) model fit to a multi-quarter Kepler light curve, assuming a linearorbital ephemeris. • tce impact - Impact Parameter. The sky-projected distance betweenthe center of the stellar disc and the center of the planet disc at con-junction, normalized by the stellar radius. • local view - vector of length 201: a ‘local view’ of the TCE. It showsthe shape of the transit in detail (close-up of the transit event).10 .2.3. Visualization of Kepler Data We investigated the Kepler data and visualized the variables with Pan-das package in Python. For example, we visualize the distributions of thenumerical variables per class using KDE (Kernel Density Estimation) plots.In Figure 2.2.3 we show several interesting examples with a gap between thecurves labeled ‘Planets’ and ‘Not planets’ as identified by
ThetaRay systemand validated by the Kepler data training set. It can be concluded that thesefeatures are significant in candidate exoplanet identification and therefore wehave included them in the model. If both curves coincide, it can be concludedthat the behavior is the same for label ‘planets’ and ‘not planets’, and so wechose not to include these features in the model.Another example of our analysis is demonstrated in the ‘heat map’, whichis basically a color-coded matrix, where a correlation value between the vari-able of features is used to color each cell of the matrix to represent the relativevalue of that cell. If there is a high correlation between any variables, thedimension of the data can be reduced. The various features are labeled onthe axes. Obviously, the features on the main diagonal that indicate iden-tity correlation are light colored. It is evident from the ‘heat map’ shown inFigure 2.2.3 that most off-diagonal features are weakly correlated. The onlysignificant off-diagonal correlations is between av training set - the train-ing labels, i.e., if the TCE was included in the training set, the training labelencodes what is believed to be the “true” classification, and av pred class - predicted classifications, which are the optimum MAP (maximum a poste-riori) classifications. In fact, this field does not provide analysis informationfor the data but is used as forensic feature. The forensic features are not11 a) (b)(c) (d)
Figure 1: The distributions of the numerical variables using KDE (Kernel Density Esti-mation) plots where the blue curves are labeled ‘Planet’ and the orange curves are labeled‘ Not a planet’ from Kepler data. When there is significant difference between the curves,it can be concluded that these features are more significant for planet identification andtherefore we have included them in the model. If both curves coincide, it can be concludedthat the behavior is not statistically different between the two populations. The plottedvariables are (a) tce period , (b) tce duration , (c) tce time0bk , (d) tce model snr (seetext for their definitions). tce time0bk - transit epoch (BJD),and tce period - Orbital Period (days).
New features were developed based on the original data set from Keplerthat was obtained from MAST to optimize the analysis with
ThetaRay algo-rithm. These features were constructed from the original dataset as describedbelow using the phase-folded “Local View” light curves (see, e.g., Shallue andVanderburg, 2018). • global view - the original vector of length 2001 or a ‘global view’of the TCE that shows the characteristics of the light curve over anentire orbital period. Because of the size limitations of the ThetaRay ’ssystem, we performed dimension reduction. We represented groups of20 columns in the ‘global view’ by computing the average and standarddeviation of those columns. We have a total of 200 new “global view”features. • spline bkspace - the break-point spacing in time units, used for thebest-fit spline. We chose the optimal spacing of spline breakpoints foreach light curve by fitting splines with different breakpoint spacings,calculating the Bayesian Information Criterion (BIC, Schwarz (1978))for each spline, and choosing the breakpoint spacing that minimizedthe BIC. Below, is a brief description of the new features that werecomputed for each TCE “Global View” and “Local View” light curves:13 igure 2: The ‘Heat map’ of some of the features (or parameters) used in the ThetaRay algorithm. The intensity scale indicates the magnitude of the correlation between thefeatures that facilitates determining the dimensionality of the dataset (see text). loc mean – average of the “Local View” light curve. • loc std - standard deviation of the “Local View” light curve. • loc 25% -25% percentile of the “Local View” light curve. • loc 75% - 75% percentile of the “Local View”light curve. • loc max – max value of the “Local View” light curve. • glob mean – average of the original “Global View” light curve. • glob std - standard deviation of the original “Global View” light curve. • glob 25% - lower percentage of the original “Global View” light curve. • glob 75% - upper percentage of the original “Global View” light curve. • glob max – max value of the original “Global View” light curve. • zScore loc min – minimum value of the Z-Score on the “Local View”light curve with window of 10. • zScore loc max – maximum value of the Z-Score on the “Local View”light curve with window of 10. • zScore glob min – minimum value of the Z-Score on the “Global View”light curve with window of 100. • zScore glob max – maximum Z of the-Score on the “Global View”light curve with window of 100.15 .2.5. Working on ThetaRay ’s System
We built in
ThetaRay platform an “analysis chain”, which is a multi-staged flowchart, that is composed of three main stages: Data Source, DataFrame and Analysis. The data is organized into data sources and they areuploaded to
ThetaRay ’s platform. We created data frames in the system withwrangling method (where, data wrangling is a process of cleaning, structur-ing and enriching raw data into a desired format with the intent of makingit more appropriate and valuable for modeling) and split the data randomlyinto train and test in
ThetaRay system such that 80% is allocated for trainingand 20% are allocated for testing. The training procedure generates profileand this was fed into different types of analyses using
ThetaRay
Augmentedand unsupervised algorithms, to find the best parameters that maximize theArea Under ROC Curve (AUC) in each chain, where ROC is Receiver Oper-ating Characteristic (ROC) curve - a standard evaluation metrics for testingclassification model’s performance. After the analysis and review of theseresults were completed, the data was processed again after modification andfine tuning of the internal parameters in the system for results improvement.Then, identification was executed again.
We obtained 10,803 light curves of TCEs produced by the TESS mis-sion from MAST ( http://archive.stsci.edu/ ). We wanted to use thesame model we built based on Kepler’s data, in order to find potential ex-oplanets (anomalies) in the new data from TESS. For using the same mod-els for the two different satellites, we must convert the TESS data to the16ame structure as Kepler data. Therefore, we performed additional stepsto prepare the light curves to be used as inputs to our system. We gener-ated a set of TFRecord files for the TCEs. Each file contains global view , local view and spline bkspace representations like in Kepler. We alsocreated in python the following data files: • global view - Vector of length 2001 that shows the characteristics ofthe light curve over an entire orbital period. • local view - Vector of length 201 that shows the shape of the transitin detail (phase-folded close-up of the transit event). • more features - includes – ticid - TESS ID of the target star. – planetNumber - TCE number within the target star. – planetRadiusEarthRadii - has the same meaning as the field of tce prad in Kepler data. – spline bkspace , mes - same meaning as tce max mult ev in Ke-pler data. – orbitalPeriodDays - same meaning as tce period in Keplerdata. – transitEpochBtjd - same meaning as tce time0bk in Keplerdata. – transitDurationHours - same meaning as tce duration in Ke-pler data. 17 transitDepthPpm - same meaning as tce depth in Kepler Data. – minImpactParameter - same meaning as tce impact in Keplerdata.TESS data is unlabeled, so av training set and av pred class fieldsdo not exist in the TESS data, therefore, we filled these fields with zeros. tce model snr feature exists in Kepler data, but it does not exist inTESS data, so we calculated its value by the ratio of transitDepthPpm and transitDepthPpm err . • Describe files - includes count, mean, std, min, max, 25% percentile,median (50%), 75% percentile. These quantities were computed on eachoriginal data row from the global view and local view files and oneach scaling row of these files.Following the generation of the dataset in the form of Coma SeparatedValues (CSVs), we applied the same manipulation on global view , as inKepler data, in order to reduce the dimensions, and used the analogous424 features produced from TESS data as in Kepler data, for the analysis on
ThetaRay ’s system. Following this step, we applied the
Detection algorithmon TESS data according to the saved model from Kepler and used the resultsfor classification and mapping of TESS light curve TCEs data.
3. Results: Transiting Exoplanet Detection
The first results of the
ThetaRay algorithm produced around 90 prelim-inary identification of EPCs that were further manually vetted, reducing18he number of confirmed EPCs by about a factor of two. Local view light-curves were used together with planetary candidate parameters to vet thealgorithm’s output. In the manual vetting the physical parameters, such asnon-typical ‘local view’ light curves (i.e, v-shapes, and other non-planetaryperiodic features), extremely large planetary radius, and very low signal-to-noise were used. The parameters for the remaining 39 identified EPCs bythe
ThetaRay system form the TESS database of 10,803 TCE’s are given inTable 1. In Figure 3 we show the Local View light curves of eight selectedlight curves for exoplanetary candidates identified using the
ThetaRay al-gorithm. The TESS input catalog ID number (TIC ID), along with severalparameters ( tce prad, tce period, tce depth defined in section 2.2.2) forthe identified EPC are indicated on each panel. Of the 39 validated cases wenote that only two case with planetary radius ( tce prad ) or r p < R Earth (TIC ID 307210830 and 259377017), and a total of eight EPCs identifiedwith r p < R Earth . Another 15 identified EPCs were similar in size or largerthan Jupiter with r p ≥ R Earth . We find the following properties of the 39cases • The orbital periods ( tce period ) of the identified EPCs range from0.38d to just under 23d. • The transit depth ( tce depth ) varied by about an order of magnitudein the range 986 − ∼ −
60 ppm. • The impact parameter was in the range ∼ . − . • The duration of the transits ( tce duration ) was in the range 0 . − . • In four cases the identified EPCs suggest multiple planetary systemswith 2 and 3 planets.
4. Discussion and Conclusions
The TESS satellite provides observations of a large number (200,000) ofstellar light curves with high photometric precision over the whole sky, di-vided in observing sectors, with the aim of detecting transiting Earth-sizedplanets. The stellar object were selected to represent the brightest and clos-est to our solar system. The large dataset of nearly 27 gigabytes per day isthen processed in the science data pipeline providing nearly 11,000 TCE’s asof the time of writing this paper. Further analysis of the TCEs is required tofind confirmed examples of exoplanets, or exoplanetary candidates for morein-depth processing. However, evidently this formidable data analysis task isdifficult, if not impossible to carry out manually. A feasible approach for theTESS data analysis is based on automated identification techniques that weredeveloped recently, customized for transiting exoplanetary candidates iden-tification, utilizing AI/ML methods based on DL neural networks machinelearning methods combined with anomaly identification methods reportedthe present study. This EPCs could be than vetted further with targetedobservations and data analysis.In this study we apply a novel algorithm developed by
ThetaRay, Inc. for cybersecurity and anomaly identification in financial systems. The ad-vantage of this AI/ML system over other machine learning methods is the20 able 1: Some of the parameters (see text) of identified exoplanetary candidates (EPCs)from the TESS mission data archive at http://archive.stsci.edu/ using the
ThetaRay system.
TIC ID TIC_ID: 101948569 -1.2-1-0.8-0.6-0.4-0.200.2 0 50 100 150 200 250
TIC_ID: 422655579 -1.2-1-0.8-0.6-0.4-0.200.20.4 0 50 100 150 200 250
TIC_ID: 178155732 tce_prad=2.372940063tce_period=5.971879959tce_depth=316.0220032 -1.2-1-0.8-0.6-0.4-0.200.20.4 0 50 100 150 200 250
TIC_ID: 219403686 tce_prad=5.821829796tce_period=0.380145997tce_depth=1336.619995tce_prad=15.61709976tce_period=2.903460026tce_depth=4657.879883 -1.2-1-0.8-0.6-0.4-0.200.20.4 0 50 100 150 200 250
TIC_ID: 270677759 tce_prad=9.437470436tce_period=9.129110336tce_depth=8185.049805 -1.2-1-0.8-0.6-0.4-0.200.20.40.6 0 50 100 150 200 250
TIC_ID: 453767182 tce_prad=2.725820065tce_period=10.76249981tce_depth=1361tce_prad=3.049010038tce_period=19.47240067tce_depth=1645.949951 -1.2-1-0.8-0.6-0.4-0.200.20.4 0 50 100 150 200 250
TIC_ID: 308994098 tce_prad=5.175449848tce_period=10.51659966tce_depth=991.1049805 -1.2-1-0.8-0.6-0.4-0.200.2 0 50 100 150 200 250
TIC_ID: 423275733 tce_prad=17.97360039tce_period=2.052979946tce_depth=10176.90039time timetime timetime timetime time
Figure 3: “Local view” normalized phase-folded light-curves of selected exoplanetarycandidates from Table 1 with the parameters tce prad the radius in terms of R Earth , tce period in days, tce depth in ppm, indicated on the corresponding panels. The typ-ical eclipsing exoplanetary light curve temporal shape structure is evident. ThetaRay algorithm to TESS TCE’s we report 39 newplanetary candidates in wide range of sizes from below Earth’s radius tosuper-Jupiter’s radii, and planetary periods ranging from 0.38d to just un-der 23d. We demonstrate that the combination of DL neural networks withanomaly identification mathematical techniques provide an efficient AI/MLalgorithm for the rapid automated search of transiting exoplanet candidateslight curves. Although, we find that we need to apply manual vetting to re-duce the number of false-positives, the total number of EPCs identificationsis manageable for secondary manual vetting of the relatively small number oflight-curves, and this approach provides the desired identification results. Infuture applications, the
ThetaRay ’s algorithm could be further optimized fortransiting exoplanets identification, by including, for example, informed MLsteps, potentially reducing further the false-positive rate in this applicationand providing a new tool for analyzing TESS TCE data.
Acknowledgment
The resources for this research were provided by
ThetaRay, Inc.
LOwould like to acknowledge the hospitality of the Department of Geosciences,Tel Aviv University. 23 ppendix
The classification of light curves as exoplanetary candidates in this paperis achieved by using the analytic platform of
ThetaRay that is described inthis appendix. This platform processes high dimensional big data to iden-tify anomalous behavior in comparison to a normal profile. This anomalydetection tool is used in the present application for classification of EPCsin TESS TCE database. The normal profile is a training data driven andits generation is explained below. In the present study we used Kepler TCEdata as a training dataset as described in section 2.2. This appendix de-scribes some of the algorithms that were utilized in the study of identifyinganomalies in a big data using augmentation, semi-supervised and unsuper-vised type algorithms. The same core algorithms for anomaly identificationare capable of identifying anomalies in cyber (malware), industrial malfunc-tion (IoT) and financial (crimes) data. The algorithms were applied for thefirst time to astrophysical data in this study. These algorithms are partof
ThetaRay ( ) core technology portfolio to fight finan-cial crimes (Shabat et al., 2018a). The algorithms are housed in ThetaRay
Computational Platform that enables efficient data manipulation and pro-cessing. The reported results were obtained by executing these algorithmson
ThetaRay platform.
Appendix A. Semi-supervised processing via augmentation: In-troduction
For background and context, we describe briefly the
ThetaRay systemcurrent commercial applications that now have been expanded and applied to24strophysics dataset. The
ThetaRay is designed to provide a fast and accurateanalytic solutions for identifying emerging risk/crime (classified as anomalies)in financial data, discovering new opportunities, and exposing blind spotswithin these large, complex high dimensional data sets. These AI-basedalgorithms radically reducing false positives, and are uniquely able to uncover“unknown unknowns” (these are threats that one is not aware of, and do noteven know that one is not aware of them).
ThetaRay provides constructivesolutions to anomaly detections challenges via its analytic platform designedfor a big data, uncover previously unknown risks, and do so with industrylow false positive rates and in real time enabling fast forensic.In this project, we assume that some labels of Kepler TCE data, which isa related dataset to TESS TCEs, are given but are not given for the TESSdata. An augmented algorithm, which is considered as a learning method,generates a new data frame based on the provided labels. Then, the newdata frame serves as an input to unsupervised algorithms. In this project,we apply 3 unsupervised algorithms to the augmented data: Geometric-baseddenoted by NY (see section Appendix C.1), algebraic-based denoted by LU(see section Appendix C.2), an hybrid of LU and NY and Neural networkdenoted by AE.The augmentation method is based on Neural Network. By using a Neu-ral Network-based method, the default network (that can be user-adjusted)consists of one input layer (the analysis data frame), three hidden layersand one output layer. All the layers are connected through “weights” thatare automatically tuned during the learning (optimization) process until thenetwork output layer values are close to the values of the provided labels.25fter optimization, the third hidden layer becomes the new data frame aswell as the input to the unsupervised algorithms that are outlined in sectionAppendix B and some of them are described in details in section AppendixC. ThetaRay’s platform covers detection and monitoring of several verticalswith current emphasis on financial crimes by suppling an end-to-end solu-tion. ThetaRay provides an un- and semi-supervised real-time agnostic, AIbased financial crimes detection platform that are based on anomaly detec-tion algorithms of “unknown unknowns”.Rule-based technology, which is very popular among anomaly detectiontools, is intended for what is known and when you know what to look for.
ThetaRay’s detection is achieved by un- and semi-supervised with automaticmethods that are not based on rules, patterns, signatures, heuristics, datasemantics of the features or any prior domain expertise and provide high de-tection rate and very low false positives.
ThetaRay’s methodologies within itsAnalytics Platform are based on unbiased detection through a series of ran-domized advanced AI-based algorithms that can process any number of datafeatures and can be explained, justified and anomalies can be traced back toidentify features that triggered the anomalies therefore it is not classified asa black box. Thus, the platform enables past tracking of events and featuresthat trigger the occurrence of anomalies.
ThetaRay’s system operates underthe assumption that is not know what to look for or what to ask. This al-lows their technology to potentially, detect every type of anomaly before therules are discovered automatically. For efficient processing of the algorithmsthe system uses off-the-shelf hardware components. Inherent parallelism in26he algorithms are implemented with GPU utilization. The platform con-tains advanced and interactive visualization of the input and output phasesof the data analysis. The detection approach is data driven thus, no pre-existing models are assumed to exist. This makes this approach universaland generic and thus opens the way for different applications without theintroduction of bias, limitations, and unfounded preconceptions into the pro-cessing, a property well suited for large astrophysical datasets. Mathematicaland physical justification for most of the available algorithms in the systemare given below.The input training data can be enriched by a given limited set of labels.This increases the detection rate and reduces the false alarm rates. Thisis part of semi-supervised algorithms. Semi- and un-supervised algorithmsare used. Currently, the platform contains eight different unsupervised al-gorithms for the data without labels and three different semi-supervised al-gorithms for the data with partial labels within the detection engine. Theresults are fused to produce one solution.
ThetaRay combines the strengthsof unsupervised and semi-supervised techniques to identify anomalies in thedata. Unsupervised learning assumes that there are no labels to the vari-ous data components. Semi-supervised learning frameworks have made sig-nificant progress in training machine learning with limited labeled data inimage domain. Augmented unsupervised learning can be used side-by-sidewith semi-supervised learning. The augmented algorithms generate a newdata frame based on the analysis data frame and the provided labels. Thenew data frame generated is then the new input for all the unsupervisedalgorithms selected. Labels are categorized as binaries, with the minority of27he labels (known anomalies) marked as “1” and the remainder, which arethe majority of unknown cases, assigned “0”.Augmented process enables covering both the known and the unknownwith a relative balance between them. The
ThetaRay system allows for con-figuration of the underlying input features, algorithms and detection logicat each applications. Technically it is a neural network-based process whichgenerates a new data frame based on the input data frame and binary labelsprovided by the application (in the present case, stellar light-curve data).
Appendix B. Unsupervised algorithms: General descriptionNY:
This algorithm (see, Figure B.4) is based on diffusion maps (DM)methodology (Coifman and Lafon, 2006a) and it is primarily a non-linear dimension reduction process. The anomaly identification proce-dure takes place inside the lower dimensional space (manifold) that isdetermined automatically during the training phase. An out-of-sampleextension procedure (Coifman and Lafon, 2006b) is applied to the iden-tification phase for each multidimensional data point, which did notparticipate in the training phase, to determine whether it belongs tothe manifold (low dimensional space - classified as normal) or deviatesfrom it (classified as anomalous).The NY algorithm, which is based on DM, geometrizes the input train-ing data. DM analyzes the ambient space (training data) and deter-mines automatically where the data actually resides in the embeddedspace. We can visualize the input training data (ambient space) as amatrix of size m × n where m is the number of multidimensional data28 rocessing the input data to obtain a reduced dimension embedding (manifold)Receiving newly arrived data pointDetermine if the newly arrived data point belongs to the embedded manifold than it is classified as normal otherwise it is classified as abnormal (anomalous) Normalization to obtain Markov matrixExtraction of the eigenvalues and eigenvectors of the Markov matrix Input; high dimensional data
Generation of the embedded manifold by the eigenvalues and eigenvectors
TrainingDetection
Figure B.4: NY algorithm: flow chart. points (number of rows in the matrix) and each row is of dimension n - the number of columns in the matrix. The input data is assumed tobe sampled from a low dimensional manifold (embedded space) thatcaptures the dependencies between the observable parameters. DM re-duces in a non-linear way the dimension of the ambient space whichis the training data. The dimensionality reduction by DM is based onlocal affinities between multidimensional data points and on non-linearembedding of the ambient space into a lower dimensional space, de-scribed as a manifold, by using a low rank matrix decomposition. Thenon-parametric nature of this analysis uncovers the important under-lying factors of the input data and reveals the intrinsic geometry of thedata represented by the embedded manifold. This manifold describesgeometrically what we classify as the normal profile of the ambient data.29ewly arrived multidimensional data points, which did not participatein the training procedure, are embedded into the lower dimensionalspace by the application of an out-of-sample extension algorithm. Ifthe embedded multidimensional data point falls into the manifold, it isclassified as normal otherwise it is classified as abnormal (anomalous).See section Appendix C.1 for more details. LU:
Based on a randomized low-rank matrix decomposition (Shabat et al.,2018b). This algorithm builds a dictionary from the training data.Then, each newly arrived multidimensional data point that is not welldescribed (not spanned well) by the dictionary is classified as an anoma-lous data point.The randomized LU (RLU) algorithm is an algebraic approach appliedto input matrix A of size m × n with an intrinsic dimension k smallerthan n . k can be computed automatically or given. RLU is a low rankmatrix decomposition which enables the identification of anomalies us-ing a dictionary constructed from the training data. RLU forms a lowrank matrix approximation of A such that P AQ ≈ LU where P and Q are orthogonal permutation matrices, and L and U are the lowerand upper triangular matrices, respectively. A dictionary is then con-structed according to D = P T L ( T is the transpose of a matrix). Thus, D is a linear combination of the input matrix and a representation ofthe normal data. It is also used in the identification step to classifynewly arrived multidimensional data points that did not participatein the training phase. Thus, a new incoming a multidimensional datapoint x , which satisfies (cid:107) DD † x − x (cid:107) < (cid:15) , is classified as normal; other-30ise, it is classified as anomalous. Here, D † is the pseudo inverse of D and (cid:15) is a quantity defined in the training phase. When applied to amatrix A of size m × n , the RLU decomposition reduces the number m of multidimensional data points, resulting in a reduced-measurementsmatrix of size k × n where k < n < m . Although the algorithm is arandomized, it has been proven in Shabat et al. (2018b) that the prob-ability that the RLU approximation will generate a big error tends tobe very small. See section Appendix C.2 for more details. DK:
The DK Algorithm relies on successive applications of LU and NY.Assume the size of a given training matrix is m data points (rows)by n features (columns). RLU (described in section Appendix C.2)is applied to n . The size of n is reduced substantially through theapplication of random projection (Johnson and Lindenstrauss, 1984).Then, NY (described in section Appendix C.1) is applied to n (dimen-sion) and the matrix is embedded into a lower dimensional space andanomaly identification procedure NY is called in this embedded space. AE:
This is a variational autoencoder (AE) algorithm. AE is machine learn-ing tool designed to generate complex models of data after careful dis-tribution modeling of example data. In neural net language, AE con-sists of an encoder component and a decoder component. We assumethat the input data set is generated from an underlying unobserved(latent) representation. Given an input data set, the encoder part ofan AE approximates the distribution of the latent variables. Finally,the algorithm sets the distribution parameters of the latent layers in a31anner that maximizes the likelihood of generating or reconstructingthe input data in the decoder section. As soon as the distribution ofthe latent variables is approximated, we can sample from this distri-bution to generate an approximate representation of the input data.Since normality consists of and is defined by most of the data points,those will be well-approximated by the AE, while anomalies will bepoorly modeled. Therefore, by comparing the original sample with thereconstructed (generated) data, we can calculate a similarity score thatenables us to detect anomalies. The goal is to use the AE as a denoisingautoencoder. It allows us to encode our sample into the latent spaceand then reconstruct it. By comparing the original sample to the re-construction, we are able to calculate a score that enables us to classifya data point as anomalous data point. Since we plan to use the AE foranomaly detection, we have to calculate the scores for the input andoutput.
Appendix C. Unsupervised algorithms: Mathematical description
Appendix C.1. Diffusion geometry: Background
DMare a kernel-based method for manifold learning that can reveal theintrinsic structures in data and embed them in a low dimensional space. TheDM-based approach computes the diffusion geometry. A spectral embeddingof the data points provides coordinates that are used to interpolate andapproximate the pointwise diffusion map embedding of data.Manifold learning approaches are often used for modeling and uncoveringintrinsic low dimensional structure in high dimensional data. DM is a method32hat captures data manifolds with random walks that propagate throughnon-linear pathways in the data. Transition probabilities of a Markoviandiffusion process (explained later how to compute them) define an intrinsicdiffusion distance metric that is amenable to a low dimensional embedding.By arranging transition probabilities in a row-stochastic diffusion operator,and taking its leading eigenvalues and eigenvectors, one can derive a smallset of coordinates where diffusion distances are approximated as Euclideandistances and intrinsic manifold structures are revealed.In more details, the NY algorithm uncovers the internal geometry ofthe input training data denoted as A . The use of geometric consdierationsspeeds up significantly the anomaly detection computational time. Next isa theory that supports this approach: The goal is to detect anomalies in A and in newly arrived n -dimensional data points that did not participatein the training data A . During the training procedure, size of n , which isalso called the dimension of A , is automatically reduced. The procedureis called dimensionality reduction. Dimensionality reduction as explainedlater, is achieved without damaging the quality and the coherency of thedata in A . More than that, there is no loss of data as explained later.Dimensionality reduction is just a different representation of the training datathat automatically without any human intervention reduced the dimensionaccording to the data and uncovers the real dimension where the trainingdata actually resides.In general, anomaly detection is based on the notion of similarities (oraffinities) between the m high dimensional data points (these are the rows inthe matrix A ). How we detect anomalies in this big data efficiently without33ntroducing bias and without damaging the data? Dimensionality reductionof n is needed. How to achieve this reduction? The following provides the ra-tionale why geometrization of the training data A and tracking the movementof newly arrived data points identify a low dimensional manifold for learning.It is founded mathematically through the preservation of the quality and theintegrity (completeness) of the data in A .The assumption is that the processed data is imbalance: High densitiesof n -dimensional samples (rows in the matrix A ) represent normal data oth-erwise the data is classified as anomalous (abnormal) since the majority ofthe data is normal and thus it is classified as having high density.Theory: How to find the low dimensional space (manifold)? It is provedthat if A is sampled from a low intrinsic dimensional manifold then, as n (di-mension) tends to infinity, the defined random walk, which travels betweenall the data samples, converges to a diffusion process over the manifold. Thisis the key to the processing of A as diffusion process that guarantees efficientscan of the data through randomization without introduction of bias. Itprovides three complementary approaches for dimensionality reduction – dif-fusion distances between n -dimensional samples, randomization and manifoldlearning - emerge from this observation (theorem): 1. A kernel matrix B ofsize m × m (huge) is constructed from distances among all the n -dimensionalsamples (rows). The distances are diffusion distances. 2. Random walk isapplied to the entries in B . This random walk guarantees that there is nobias between the utilization of the distances in B . 3. Diffusion Maps (DM)links between the matrix B and a lower dimensional space (manifold) viadiffusion processing. The dimension of the embedded manifold represents34he reduction of n .Geometrization of the training data - outline description of the approach:The NY algorithm is based on a geometric uncovering of a low dimensionalmanifold in the ambient space (the original space represented by A ) by theapplication of DM to ambient space represented by A . The input data isassumed to be sampled from a low intrinsic dimensional manifold that cap-tures the dependencies between the observable parameters ( n -dimensionalfeatures). DM reduces the dimension n of the training data. It is basedon local affinities between multidimensional data points and on non-linearembedding of the ambient space into a lower dimensional space, described asa manifold, by using a low rank matrix decomposition. The non-parametricnature of this analysis uncovers the important underlying factors of the in-put data and reveals the intrinsic geometry of the data represented by theembedded manifold. This manifold describes geometrically what we classifyas the normal profile in the ambient data. Newly arrived n-dimensional datapoints, which did not participate in the training procedure, are embeddedinto the lower dimensional space by the application of an out-of-sample ex-tension algorithm. If the embedded n-dimensional data point falls into themanifold where most of the normal data reside, it is classified as normal;otherwise it is classified as abnormal (anomalous). The exchange of data be-tween the ambient space and the manifold, where the detection takes place,does not degrade the coherency and the completeness of the data and pre-serves the geometrical relations (affinities) between the two spaces – ambientand embedded (manifold). 35 ppendix C.1.1. Diffusion geometry: outline Let X = { x , . . . , x n } be a dataset and let k : X ×X → R be a symmetricpoint-wise positive kernel that defines a connected, undirected and weightedgraph over X . Then, a random walk over X is defined by the n × n row-stochastic transition probabilities matrix P = D − K , where K is an n × n matrix whose entries are K ( i,j ) := k ( x i , x j ) , i, j = 1 , . . . , n, and D is the n × n diagonal degrees matrix whose i -th element is d ( i ) := (cid:80) nj =1 k ( x i , x j ) , i =1 , . . . , n. The vector d ∈ R n is referred to as the degrees vector of the graphdefined by k .The associated time-homogeneous random walk X ( t ), is defined via theconditional probabilities on its state-space X : assuming that the processstarts at time t = 0, then for any time point t ∈ N P ( X ( t ) = x j | X (0) = x i ) = P t ( i,j ) , where P t ( i,j ) is the ( i, j ) − th entry of the t -th power of thematrix P . As long as the process is aperiodic, it has a unique stationarydistribution ˆd ∈ R n which is the steady state of the process, i.e. ˆd ( j ) =lim t →∞ P t ( i,j ) , regardless the initial state X (0). This steady state is theprobability distribution resulted from (cid:96) normalization of the degrees vector d , i.e., ˆd = d (cid:107) d (cid:107) ∈ R n , (C.1)where (cid:107) d (cid:107) := (cid:80) ni =1 d ( i ) . The diffusion distances at time t are defined bythe metric D ( t ) : X × X → R , D ( t ) ( x i , x j ) := (cid:13)(cid:13) P t ( i, :) − P t ( j, :) (cid:13)(cid:13) (cid:96) ( ˆd − ) = (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) k =1 ( P t ( i,k ) − P t ( j,k ) ) / ˆd ( k ) , i, j = 1 , . . . , n. (C.2)36y definition, P t ( i, :) , the i -th row of P t , is the probability distribution over X after t time steps given that the initial state is X (0) = x i . Therefore, thediffusion distance D ( t ) ( x i , x j ) from Eq. C.2 measures the difference betweentwo propagations along t time steps: the first is originated in x i and thesecond in x j . Weighing the metric by the inverse of the steady state resultsin ascribing high weight for similar probabilities on rare states and vice versa.Thus, a family of diffusion geometries is defined by Eq. C.2, each correspondsto a single time step t .Due to the above interpretation, the diffusion distances are naturally uti-lized for multiscale clustering since they uncover the connectivity propertiesof the graph across time. In B´erard et al. (1994); Coifman and Lafon (2006a)it has been proven that under some conditions, if X is sampled from a low in-trinsic dimensional manifold then, as n tends to infinity, the defined randomwalk converges to a diffusion process over the manifold. Appendix C.2. Randomized LU decomposition: An algorithm for dictionaryconstruction
A dictionary construction algorithm is presented. It is based on a low-rank matrix factorization being achieved by the application of the randomizedLU decomposition (Shabat et al., 2018b) to a training data. This methodis fast, scalable, parallelizable, consumes low memory, outperforms SVD inthese categories and works also extremely well on large sparse matrices. Incontrast to existing methods, the randomized LU decomposition constructsan under-complete dictionary, which simplifies both the construction and theclassification processes of newly arrived multidimensional data points. Thedictionary construction is generic and general that fits different applications.37he randomized LU algorithm, which is applied to a given training datamatrix A ∈ R m × n of m multidimensional data points and n features, de-composes A into two matrices L and U . The size of L is determined by thedecaying spectrum of the singular values of the matrix A , and bounded bymin { n, m } . Both L and U are linearly independent.The randomized LU decomposition algorithm (see, Figure C.5) computesthe rank k LU approximation of a full matrix (Algorithm 1). The mainbuilding blocks of the algorithm are random projections and Rank RevealingLU (RRLU) (Pan, 2000) to obtain a stable low-rank approximation for aninput matrix A that is classified as a training data. In Figure C.5 ‘II’ describesthe generation of a dictionaries by calling item I that describes the flow of therandomized LU decomposition. The end of the execution of ‘I’ means thatthe training is completed. The dictionaries are the input of ‘II’ that performsthe identification. Newly arrived data point that did not participate in thetraining is either span (classified as normal) or not spanned by the dictionary(classified as anomalous).The RRLU algorithm, used in Algorithm 1, reveals the connection be-tween LU decomposition of a matrix and its singular values. Similar algo-rithms exist for rank revealing QR decompositions (see, for example Gu andEisenstat (1996)). Theorem Appendix C.1 (Pan (2000)) . Let A be an m × n matrix ( m (cid:29) n ).Given an integer ≤ k < n , then the following factorization P AQ = L L I n − k U U U , (C.3) holds where L is a lower triangular with ones on the diagonal, U is an onstruct dictionary by applying randomized LU TO A using IReceiving newly arrived data pointDetermine if the newly arrived data point is spanned by D then classified as normal Randomized LUP,Q, L,U Dictionary construction and its pseudo inverse by using the lower matrix and the permutation matrix
Dictionary D
A,k III
Figure C.5: II calls the construction of a dictionary D via randomized LU decompositionas described in I. The LU algorithm is built from the following steps: The inputs to thealgorithm are a matrix A and its rank k (see I). They are submitted to Randomized LUthat generates the following outputs: Permutation matrices P and Q and lower and uppertriangle matrices L and U , respectively. Then, a newly arrived data point, that did notparticipate in the training, is either spanned by D therefore classified as normal otherwiseit is classified as abnormal (anomalous). pper triangular, P and Q are orthogonal permutation matrices. Let σ ≥ σ ≥ ... ≥ σ n ≥ be the singular values of A , then: σ k ≥ σ min ( L U ) ≥ σ k k ( n − k ) + 1 , (C.4) and σ k +1 ≤ (cid:107) U (cid:107) ≤ ( k ( n − k ) + 1) σ k +1 . (C.5)Based on Theorem Appendix C.1, we have the following definition: Definition Appendix C.1 (RRLU Rank k Approximation denoted RRLU k ) . Given a RRLU decomposition (Theorem Appendix C.1) of a matrix A withan integer k (as in Eq. C.3) such that P AQ = LU , then the RRLU rank k approximation is defined by taking k columns from L and k rows from U such that RRLU k ( P AQ ) = L L (cid:16) U U (cid:17) . (C.6) where L , L , U , U , P and Q are defined in Theorem Appendix C.1. Lemma Appendix C.2 ( Shabat et al. (2018b) RRLU ApproximationError) . The error of the RRLU k approximation of A is (cid:107) P AQ − RRLU k ( P AQ ) (cid:107) ≤ ( k ( n − k ) + 1) σ k +1 . (C.7)Algorithm 1 describes the flow of the RLU decomposition algorithm. Appendix C.2.1. Randomized LU Based Classification Algorithm
Based on Section Appendix C.2, we apply the randomized LU decompo-sition (Algorithm 1) to matrix A , yielding P AQ ≈ LU . The outputs P and Q are orthogonal permutation matrices. Theorem Appendix C.3 shows that40 lgorithm 1: Randomized LU Decomposition
Input:
Matrix A of size m × n to decompose; k rank of A ; l numberof columns to use (for example, l = k + 5). Output:
Matrices
P, Q, L, U such that (cid:107)
P AQ − LU (cid:107) ≤ O ( σ k +1 ( A ))where P and Q are orthogonal permutation matrices, L and U arethe lower and upper triangular matrices, respectively, and σ k +1 ( A )is the ( k + 1)th singular value of A . Create a matrix G of size n × l whose entries are i.i.d. Gaussianrandom variables with zero mean and unit standard deviation. Y ← AG . Apply RRLU decomposition (See Pan (2000)) to Y such that P Y Q y = L y U y . Truncate L y and U y by choosing the first k columns and k rows,respectively: L y ← L y (: , k ) and U y ← U y (1 : k, :). B ← L † y P A . ( L † y is the pseudo inverse of L y ). Apply LU decomposition to B with column pivoting BQ = L b U b . L ← L y L b . U ← U b . 41 T L forms (up to a certain accuracy) a basis to A . This is the key propertyof the classification algorithm. Theorem Appendix C.3 ( Shabat et al. (2018b)) . Given a matrix A . Itsrandomized LU decomposition is P AQ ≈ LU . Then, the error of representing A by P T L satisfies: (cid:107) ( P T L )( P T L ) † A − A (cid:107) ≤ (cid:16) (cid:112) nlβ γ + 1 + 2 √ nlβγ ( k ( n − k ) + 1) (cid:17) σ k +1 ( A ) . (C.8)Let x be a multidimensional data point and D = P T L is a dictionary. Thedistance between x and the dictionary D is defined by dist ( x, D ) (cid:44) || DD † x − x || , where D † is the pseudo-inverse of the matrix D . If dist ( x, D ) ≤ (cid:15) then x is normal otherwise it is anomalous. References
Ansdell, M., Ioannou, Y., Osborn, H.P., Sasdelli, M., 2018 NASA FrontierDevelopment Lab Exoplanet Team, Smith, J.C., Caldwell, D., Jenkins,J.M., R¨aissi, C., Angerhausen, D., NASA Frontier Development Lab Exo-planet Mentors, ., 2018. Scientific Domain Knowledge Improves ExoplanetTransit Classification with Deep Learning. Astrophys. J. Lett. 869, L7.doi: , arXiv:1810.13434 .B´erard, P., Besson, G., Gallot, S., 1994. Embedding riemannian manifolds bytheir heat kernel. Geometric and Functional Analysis GAFA 4, 373–398.Borucki, W.J., Koch, D., Basri, G., Batalha, N., Brown, T., Caldwell, D.,Caldwell, J., Christensen-Dalsgaard, J., Cochran, W.D., DeVore, E., Dun-ham, E.W., Dupree, A.K., Gautier, T.N., Geary, J.C., Gilliland, R., Gould,42., Howell, S.B., Jenkins, J.M., Kondo, Y., Latham, D.W., Marcy, G.W.,Meibom, S., Kjeldsen, H., Lissauer, J.J., Monet, D.G., Morrison, D., Sas-selov, D., Tarter, J., Boss, A., Brownlee, D., Owen, T., Buzasi, D., Char-bonneau, D., Doyle, L., Fortney, J., Ford, E.B., Holman, M.J., Seager, S.,Steffen, J.H., Welsh, W.F., Rowe, J., Anderson, H., Buchhave, L., Ciardi,D., Walkowicz, L., Sherry, W., Horch, E., Isaacson, H., Everett, M.E., Fis-cher, D., Torres, G., Johnson, J.A., Endl, M., MacQueen, P., Bryson, S.T.,Dotson, J., Haas, M., Kolodziejczak, J., Van Cleve, J., Chandrasekaran,H., Twicken, J.D., Quintana, E.V., Clarke, B.D., Allen, C., Li, J., Wu,H., Tenenbaum, P., Verner, E., Bruhweiler, F., Barnes, J., Prsa, A., 2010.Kepler Planet-Detection Mission: Introduction and First Results. Science327, 977. doi: .Brown, T.M., Latham, D.W., Everett, M.E., Esquerdo, G.A., 2011. KeplerInput Catalog: Photometric Calibration and Stellar Classification. Astron.J. 142, 112. doi: , arXiv:1102.0342 .Catanzarite, J.H., 2015. Autovetter Planet Candidate Catalog for Q1-Q17Data Release 24. KSCI-19091-001, NASA Ames Research Center, MoffettField, CA.Christiansen, J.L., Jenkins, J.M., Caldwell, D.A., Burke, C.J., Tenenbaum,P., Seader, S., Thompson, S.E., Barclay, T.S., Clarke, B.D., Li, J., Smith,J.C., Stumpe, M.C., Twicken, J.D., Cleve, J.V., 2012. The derivation,properties, and value of kepler’s combined differential photometric preci-sion. Publications of the Astronomical Society of the Pacific 124, 1279–1287. URL: https://doi.org/10.1086%2F668847 , doi: .43oifman, R.R., Lafon, S., 2006a. Diffusion maps. Applied and ComputationalHarmonic Analysis 21, 5 – 30.Coifman, R.R., Lafon, S., 2006b. Geometric harmonics: a novel tool formultiscale out-of-sample extension of empirical functions. Applied andComputational Harmonic Analysis 21, 31–52.Coughlin, J.L., Mullally, F., Thompson, S.E., Rowe, J.F., Burke, C.J.,Latham, D.W., Batalha, N.M., Ofir, A., Quarles, B.L., Henze, C.E., Wolf-gang, A., Caldwell, D.A., Bryson, S.T., Shporer, A., Catanzarite, J., Ake-son, R., Barclay, T., Borucki, W.J., Boyajian, T.S., Campbell, J.R., Chris-tiansen, J.L., Girouard, F.R., Haas, M.R., Howell, S.B., Huber, D., Jenk-ins, J.M., Li, J., Patil-Sabale, A., Quintana, E.V., Ramirez, S., Seader, S.,Smith, J.C., Tenenbaum, P., Twicken, J.D., Zamudio, K.A., 2016. Plane-tary Candidates Observed by Kepler. VII. The First Fully Uniform CatalogBased on the Entire 48-month Data Set (Q1-Q17 DR24). Astrophys. J.Supp. 224, 12. doi: , arXiv:1512.06149 .Dattilo, A., Vanderburg, A., Shallue, C.J., Mayo, A.W., Berlind, P., Bieryla,A., Calkins, M.L., Esquerdo, G.A., Everett, M.E., Howell, S.B., Latham,D.W., Scott, N.J., Yu, L., 2019. Identifying Exoplanets with Deep Learn-ing. II. Two New Super-Earths Uncovered by a Neural Network in K2 Data.Astron. J. 157, 169. doi: , arXiv:1903.10507 .Golovin, D., Solnil, B., Moitra, S., Kochanski, G., Karro, J., D., S., 2017.Google Vizier: A Service for Black-Box Optimization. ACM ISBN 978-1-4503-4887-4/17/08, 1487. doi: .44u, M., Eisenstat, S.C., 1996. Efficient algorithms for computing a strongrank-revealing QR factorization. SIAM Journal on Scientific Computing17, 848–869.Jenkins, J.M., Caldwell, D.A., Chandrasekaran, H., Twicken, J.D.,Bryson, S.T., Quintana, E.V., Clarke, B.D., Li, J., Allen, C., Tenen-baum, P., Wu, H., Klaus, T.C., Cleve, J.V., Dotson, J.A., Haas,M.R., Gilliland, R.L., Koch, D.G., Borucki, W.J., 2010a. INITIALCHARACTERISTICS OF KEPLER LONG CADENCE DATA FORDETECTING TRANSITING PLANETS. Astrophys. J. Lett. 713,L120–L125. URL: https://doi.org/10.1088%2F2041-8205%2F713%2F2%2Fl120 , doi: .Jenkins, J.M., Caldwell, D.A., Chandrasekaran, H., Twicken, J.D., Bryson,S.T., Quintana, E.V., Clarke, B.D., Li, J., Allen, C., Tenenbaum, P., Wu,H., Klaus, T.C., Middour, C.K., Cote, M.T., McCauliff, S., Girouard,F.R., Gunter, J.P., Wohler, B., Sommers, J., Hall, J.R., Uddin, A.K.,Wu, M.S., Bhavsar, P.A., Cleve, J.V., Pletcher, D.L., Dotson, J.A., Haas,M.R., Gilliland, R.L., Koch, D.G., Borucki, W.J., 2010b. OVERVIEW OFTHE KEPLER SCIENCE PROCESSING PIPELINE. Astrophys. J. Lett.713, L87–L91. URL: https://doi.org/10.1088%2F2041-8205%2F713%2F2%2Fl87 , doi: .Johnson, W.B., Lindenstrauss, J., 1984. Extensions of lipschitz mappingsinto a hilbert space. Contemporary mathematics 26, 1.Koch, D.G., Borucki, W.J., Basri, G., Batalha, N.M., Brown, T.M., Cald-well, D., Christensen-Dalsgaard, J., Cochran, W.D., DeVore, E., Dunham,45.W., Gautier, T.N., Geary, J.C., Gilliland, R.L., Gould, A., Jenkins,J., Kondo, Y., Latham, D.W., Lissauer, J.J., Marcy, G., Monet, D., Sas-selov, D., Boss, A., Brownlee, D., Caldwell, J., Dupree, A.K., Howell, S.B.,Kjeldsen, H., Meibom, S., Morrison, D., Owen, T., Reitsema, H., Tarter,J., Bryson, S.T., Dotson, J.L., Gazis, P., Haas, M.R., Kolodziejczak,J., Rowe, J.F., Cleve, J.E.V., Allen, C., Chandrasekaran, H., Clarke,B.D., Li, J., Quintana, E.V., Tenenbaum, P., Twicken, J.D., Wu, H.,2010. KEPLER MISSION DESIGN, REALIZED PHOTOMETRIC PER-FORMANCE, AND EARLY SCIENCE. Astrophys. J. Lett. 713, L79–L86. URL: https://doi.org/10.1088%2F2041-8205%2F713%2F2%2Fl79 ,doi: .Mandel, K., Agol, E., 2002. Analytic Light Curves for Planetary Tran-sit Searches. Astrophys. J. Lett. 580, L171–L175. doi: , arXiv:astro-ph/0210099 .Osborn, H.P., Ansdell, M., Ioannou, Y., Sasdelli, M., Angerhausen, D.,Caldwell, D., Jenkins, J.M., R¨aissi, C., Smith, J.C., 2020. Rapidclassification of TESS planet candidates with convolutional neural net-works. Astron. Astrophys. 633, A53. doi: , arXiv:1902.08544 .Pan, C.T., 2000. On the existence and computation of rank-revealing LUfactorizations. Linear Algebra and its Applications 316, 199–222.Ricker, G.R., Winn, J.N., Vanderspek, R., Latham, D.W., Bakos, G. ´A.,Bean, J.L., Berta-Thompson, Z.K., Brown, T.M., Buchhave, L., But-ler, N.R., Butler, R.P., Chaplin, W.J., Charbonneau, D., Christensen-46alsgaard, J., Clampin, M., Deming, D., Doty, J., De Lee, N., Dressing,C., Dunham, E.W., Endl, M., Fressin, F., Ge, J., Henning, T., Holman,M.J., Howard, A.W., Ida, S., Jenkins, J., Jernigan, G., Johnson, J.A.,Kaltenegger, L., Kawai, N., Kjeldsen, H., Laughlin, G., Levine, A.M., Lin,D., Lissauer, J.J., MacQueen, P., Marcy, G., McCullough, P.R., Morton,T.D., Narita, N., Paegert, M., Palle, E., Pepe, F., Pepper, J., Quirrenbach,A., Rinehart, S.A., Sasselov, D., Sato, B., Seager, S., Sozzetti, A., Stassun,K.G., Sullivan, P., Szentgyorgyi, A., Torres, G., Udry, S., Villasenor, J.,2014. Transiting Exoplanet Survey Satellite (TESS). volume 9143 of Soci-ety of Photo-Optical Instrumentation Engineers (SPIE) Conference Series .p. 914320. doi: .Schwarz, G., 1978. Estimating the dimension of a model. Ann. Statist.6, 461–464. URL: https://doi.org/10.1214/aos/1176344136 , doi: .Shabat, G., Segev, D., Averbuch, A., 2018a. Uncovering unknown unknownsin financial services big data by unsupervised methodologies: Present andfuture trends, in: Proceedings of Machine Learning Research, KDD 2017Workshop on Anomaly Detection in Finance, pp. 8–19.Shabat, G., Shmueli, Y., Aizenbud, Y., Averbuch, A., 2018b. RandomizedLU decomposition. Applied and Computational Harmonic Analysis 44,246–272.Shallue, C.J., Vanderburg, A., 2018. Identifying Exoplanets with Deep Learn-ing: A Five-planet Resonant Chain around Kepler-80 and an Eighth Planet47round Kepler-90. Astron. J. 155, 94. doi: , arXiv:1712.05044 .Yu, L., Vanderburg, A., Huang, C., Shallue, C.J., Crossfield, I.J.M., Gaudi,B.S., Daylan, T., Dattilo, A., Armstrong, D.J., Ricker, G.R., Vanderspek,R.K., Latham, D.W., Seager, S., Dittmann, J., Doty, J.P., Glidden, A.,Quinn, S.N., 2019. Identifying Exoplanets with Deep Learning. III. Au-tomated Triage and Vetting of TESS Candidates. Astron. J. 158, 25.doi: , arXiv:1904.02726 .Zucker, S., Giryes, R., 2018. Shallow Transits—Deep Learning. I. FeasibilityStudy of Deep Learning to Detect Periodic Transits of Exoplanets. Astron.J. 155, 147. doi: , arXiv:1711.03163arXiv:1711.03163