Advanced Astroinformatics for Variable Star Classification
AAdvanced Astroinformatics for Variable Star ClassificationbyKyle Burton JohnstonMaster of Space SciencesDepartment of Physics and Space SciencesCollege of Science, Florida Institute of Technology2006Bachelor of AstrophysicsDepartment of Physics and Space SciencesCollege of Science, Florida Institute of Technology2004A dissertationsubmitted to Florida Institute of Technologyin partial fulfillment of the requirementsfor the degree ofDoctorate of PhilosophyinSpace SciencesMelbourne, FloridaApril, 2019 ⃝ Copyright 2019 Kyle Burton JohnstonAll Rights ReservedThe author grants permission to make single copies.e the undersigned committee hereby approve the attached dissertationTitle: Advanced Astroinformatics for Variable Star ClassificationAuthor: Kyle Burton JohnstonSaida Caballero-Nieves, Ph.D.Assistant ProfessorAerospace, Physics and Space SciencesCommittee ChairAdrian M. Peter, Ph.D.Associate ProfessorComputer Engineering and SciencesOutside Committee MemberV´eronique Petit, Ph.D.Assistant ProfessorPhysics and AstronomyEric Perlman, Ph.D.ProfessorAerospace, Physics and Space SciencesDaniel Batcheldor, Ph.D.Professor and Department HeadAerospace, Physics and Space SciencesBSTRACTTitle: Advanced Astroinformatics for Variable Star ClassificationAuthor: Kyle Burton JohnstonMajor Advisor: Saida Caballero-Nieves, Ph.D.This project outlines the complete development of a variable star classificationalgorithm methodology. With the advent of Big-Data in astronomy, professionalastronomers are left with the problem of how to manage large amounts of data, andhow this deluge of information can be studied in order to improve our understand-ing of the universe. While our focus will be on the development of machine learningmethodologies for the identification of variable star type based on light curve dataand associated information, one of the goals of this work is the acknowledgmentthat the development of a true machine learning methodology must include notonly study of what goes into the service (features, optimization methods) but astudy on how we understand what comes out of the service (performance analysis).The complete development of a beginning-to-end system development strategy ispresented as the following individual developments (simulation, training, featureextraction, detection, classification, and performance analysis). We propose thata complete machine learning strategy for use in the upcoming era of big data fromthe next generation of big telescopes, such as LSST, must consider this type ofdesign integration. iii able of Contents
Abstract iiiList of Figures xiList of Tables xxiAbbreviations xxiiiAcknowledgments xxviDedication xxvii1 Introduction 1 β Persei (EA, Detached Binaries) . . . . . . . . . . . . . . . 282.1.2 β Lyrae (EB, Semi-Detached Binary) . . . . . . . . . . . . . 292.1.3 W Ursae Majoris (EW, Contact Binary) . . . . . . . . . . . 32iv.2 Pulsating Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.2.1 Cepheid (I & II) . . . . . . . . . . . . . . . . . . . . . . . . 382.2.2 RR Lyr (a, b, & c) . . . . . . . . . . . . . . . . . . . . . . . 392.2.3 Delta Scuti SX Phe . . . . . . . . . . . . . . . . . . . . . . . 40
Multi-View Classification of Variable Stars Using Metric Learn-ing 195 ix.1 Variable Star Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 2378.2 Future Development . . . . . . . . . . . . . . . . . . . . . . . . . . 2428.2.1 Standard Data Sets . . . . . . . . . . . . . . . . . . . . . . . 2428.2.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2448.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
A Chapter 4: Broad Class Performance Results 286B Chapter 5: Optimization Analysis Figures 292
B.1 Chapter 5: Performance Analysis Tables . . . . . . . . . . . . . . . 295
C Chapter 7: Additional Performance Comparison 297 x ist of Figures β Persei—EA type—eclipsing binaries (De-tached Binaries). Top figure is a representation of the binary ori-entation with respect to a viewer and the resulting observed lightcurve. Bottom left: Representation of the the positions of the grav-itational potential well for the two components in a binary. Bottomright: depicts the Roche lobe envelope of the system. . . . . . . . . 29xi.4 Example β Persei (EA, Detached Binaries) Phased Time DomainCurve, light curve generated from LINEAR data (LINEAR ID:10013411). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.5 Conceptual overview of β Lyrae—EB type—eclipsing binaries (Semi-Detached Binaries). Top figure is a representation of the binary ori-entation with respect to a viewer and the resulting observed lightcurve. Bottom left: Representation of the the positions of the grav-itational potential well for the two components in a binary. Bottomright: depicts the Roche lobe envelope of the system. . . . . . . . . 312.6 Example β Lyrae (EB, Semi-Detached Binary) Phased Time Do-main Curve, light curve generated from Kepler data (KIC: 02719436). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.7 Overview of EW, Contact Binary. Top figure is a representation ofthe binary orientation with respect to a viewer and the resultingobserved light curve. Bottom left: Representation of the the po-sitions of the gravitational potential well for the two componentsin a binary. Bottom right: depicts the Roche lobe envelope of thesystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.8 Example W Ursae Majoris (EW, Contact Binaries) Phased TimeDomain Curve, light curve generated from Kepler data (KIC: 6024572). 342.9 Conceptual representation of the cepheid instability strip cycle. Topfigure: The cycle of energy absorption, atmospheric heating, at-mospheric expansion, and energy release (cyclic process), Bottomfigure: the resulting light curve caused by the cyclic process. . . . . 38xii.10 Example phased time domain waveform, Left: Example RR Lyr (ab)(LINEAR ID: 10021274), Right: Example RR Lyr (c) (LINEAR ID:10032668), data collected from LINEAR dataset . . . . . . . . . . . 392.11 Example Delta Scuti phased time domain curve, data collected fromLINEAR dataset (LINEAR ID: 1781225) . . . . . . . . . . . . . . . 413.1 Example Raw Time Domain Curves, Left: Kepler Data (KIC: 6024572),Right: LINEAR (LINEAR ID: 10003298) . . . . . . . . . . . . . . . 443.2 Histogram of the number of observations per target star for targetsin the Villanova Kepler Eclipsing Binary Catalogue (x-axis: numberof observations for given star, y-axis: number of stars with thatnumber of observations). Graph shows expectation of data size perlight curve ( 60000 points). . . . . . . . . . . . . . . . . . . . . . . 453.3 Left: Raw Time Domain Data, Right: Corresponding TransformedPhased Time Domain Data (LINEAR ID: 10003298) . . . . . . . . 493.4 Left: Transformed Phased Time Domain Data, Right: Correspond-ing SUPER-SMOOTHER Phased Time Domain Data (Kepler Data,KIC: 5820209) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.5 Example of community standard features used in classification . . . 533.6 High-Level categories and examples of data discovery and machinelearning techniques. (1) Class discovery is used to generate group-ings from previously analyzed data, (2) Correlation discovery isused to construct models of inference based on prior analysis, (3)Anomaly Discovery is used to determine if new observations followhistorical patterns, and (4) Link Discovery is used to make infer-ences based on collections of data . . . . . . . . . . . . . . . . . . . 55xiii.7 The HR Diagram Test Data to be used in the demonstration ofstandard classifiers. Original dataset shown here contains 45 solarneighborhood stars on an HR diagram (Red: Main Sequence Stars,Green: White Dwarfs, Blue: Supergiants and Giants). Data shownhere is for demonstration purposes. . . . . . . . . . . . . . . . . . . 573.8 High-Level categories and examples of correlation discovery tech-niques. (1) Rule based learning uses simple if/else decisions to makeinferences about data, (2) Simple Machine learning uses statisticalmodels to make inferences about data, (3) Artificial Neural Net-works combine multiple simple machine learning models to makeinferences about data, (4) Adaptive learning allows for the itera-tive training/retraining of statistical models at more data becomesavailable or as definitions change. . . . . . . . . . . . . . . . . . . . 583.9 Decision space example, for this example the figures axis are merelyan example dimension (x1, x2) to represent a generic bivariate space. 603.10 An example k-NN decision space, the new data point (in green) iscompared to the training data (in red and blue), distances are de-termined between observation and training data. The circles (solidand dashed) represent the nearest neighbor boundaries, k = 3 and k = 5 respectfully. Notice how the classification of the new obser-vation changes between these two settings (from red to blue) . . . 623.11 k-NN decision space example in our standard generic ( x , x ) coor-dinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63xiv.12 PWC Example showing the transition between points with kernelsmapped to their spatial (x,y) coordinates (Left), to the approxi-mated probability distribution heat map (Right). . . . . . . . . . . 653.13 PWC decision space example in our standard generic ( x , x ) coor-dinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.14 A simple example of an decision tree generated based on expertknowledge. The solid circles represent decision nodes (simple if/elsestatements), and the empty circles represent decisions nodes. Notethat ”yes” always goes to the left child node. . . . . . . . . . . . . 673.15 An example Simple Decision Tree with parent (t), left and rightchild ( t L and t R ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.16 CART decision space example in our standard generic ( x , x ) co-ordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.17 An example diagram for a Single-Layer Perceptron with single outerlayer and an inner layer with three perceptrons . . . . . . . . . . . 723.18 An example diagram for a Multilayer Perceptron, the left figure isthe multilayer structure, the right is the perceptron/activation node(with summation and transformation) . . . . . . . . . . . . . . . . 733.19 MLP decision space in our standard generic ( x , x ) coordinate sys-tem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.20 Correlation of features between two different surveys, note the fea-tures collected from each survey were independent, but on the someof the same targets. Light colors represent no correlation betweenfeatures, dark colors represent high correlation. . . . . . . . . . . . 77xv.21 Example of the change in distance between points as a results ofMetric Learning optimization. Note, left figure is in the Euclideancase, the right figure is with a tailored metric. . . . . . . . . . . . . 813.22 Diagram demonstrating the goal of Large-Margin Nearest NeighborMetric Learning, the objective function is constructed such that ob-servations from the same class are brought closer together, whileobservations from different classes are pushed further apart. [Wein-berger et al., 2009] . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.1 Left: PCA applied to the ASAS+Hipp+OGLE dataset, with thebroad class labels identified and the first two principle componentsplotted. Right: Only the stars classified as pulsating are highlighted 994.2 SPCA applied to the ASAS+Hipp+OGLE dataset . . . . . . . . . 1004.3 ROC vs. PR AUC Plot for the Multi-Layer Perceptron Classifier,generic class types (Eruptive, Giants, Cepheids, RRlyr, Other Puls-ing, Multi-star, and other) are colored. The line y = x is plotted forreference (dashed line) . . . . . . . . . . . . . . . . . . . . . . . . . 1104.4 ROC vs. PR AUC Plot for the Random Forest Classifier, genericclass types (Eruptive, Giants, Cepheids, RRlyr, Other Pulsing, Multi-star, and other) are colored. The line y = x is plotted for reference(dashed line). Left: shows the break down per generic class type,Right: shows the difference between populations with more then 55members in the initial training dataset . . . . . . . . . . . . . . . . 1124.5 Fraction of Anomalous Points Found in the Training Dataset as aFunction of the Gaussian Kernel Spread Used in the Kernel-SVM . 1154.6 Plot of OC-SVM Results Applied to Training Data Only . . . . . . 116xvi.7 OC-SVM testing of the Testing data . . . . . . . . . . . . . . . . . 1175.1 Example State Space Representation . . . . . . . . . . . . . . . . . 1355.2 UCR Phased Light Curves. Classes are given by number only: 1= Blue Line (Eclipsing Binaries), 2 = Green Small Dashed Line(Cepheid), 3 = Red Big Dashed Line (RR Lyr) . . . . . . . . . . . 1445.3 ECVA reduced feature space using the UCR Star Light Curve Data 1455.4 First two Extended Canonical Variates for the Time-Domain Fea-ture Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1505.5 OC-PWC Kernel Width Optimization for LINEAR Data . . . . . . 1536.1 Conceptual overview of a β Lyrae—EB type—eclipsing binary (Semi-Detached Binary) that has the O’Connell Effect, an asymmetry ofthe maxima. The figure is a representation of the binary orientationwith respect to a viewer and the resulting observed light curve. . . 1616.2 An example phased light curve of an eclipsing binary with theO’Connell effect (KIC: 10861842). The light curve has been phasedsuch that the global minimum (cooler in front of hotter) is at lag0 and the secondary minimum (hotter in front of cooler) is at ap-proximately lag 0.5. The side-by-side binary orientations are atapproximately 0.25 and 0.75. Note that the maxima, correspondingto the side-by-side orientations, have different values. . . . . . . . . 1626.3 An example phased light curve (top) and the transformed distribu-tion field (bottom) of an Eclipsing Binary with the O’Connell effect(KIC: 7516345). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173xvii.4 The mean (solid) and a 1 − σ standard deviation (dashed) of thedistribution of O’Connell effect Eclipsing Binary phased light curvesdiscovered via the proposed detector out of the Kepler EclipsingBinary catalog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1816.5 The phased light curves of the discovered OEEB data from Ke-pler, clustered via k-mean applied to the DF feature space. Clusternumber used is based on trial and error, and the unsupervised classi-fication has been implemented here only to highlight morphologicalsimilarities. The left four plots represent clusters 1–4 (top to bot-tom), and the right four plots represent clusters 5–8 (top to bottom). 1836.6 OER versus ∆m for discovered Kepler O’Connell effect EclipsingBinaries. This relationship between OER and ∆m was also demon-strated in [McCartney, 1999]. . . . . . . . . . . . . . . . . . . . . . 1846.7 Distribution of phased–smoothed light curves from the set of dis-covered LINEAR targets that demonstrate the OEEB signature.LINEAR targets were discovered using the Kepler trained detector. 1876.8 OER versus ∆m for the discovered OEEB in the LINEAR data set.This relationship between OER and ∆m was also demonstrated in[McCartney, 1999] and is similar to the distribution found in Figure6.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1887.1 Parameter optimization of the Distribution Field feature space (Left:UCR Data, Right: LINEAR Data). Heat map colors represent mis-classification error, (dark blue—lower, bright yellow—higher) . . . . 226xviii.2 Parameter optimization of the SSMM feature space (Left: UCRData, Right: LINEAR Data). Heat map colors represent misclassi-fication error, (dark blue—lower, bright yellow—higher) . . . . . . 2267.3 DF Feature space after ECVA reduction from LINEAR (ContactBinary/blue circle, Algol/ red +, RRab/green points, RRc in blacksquares, Delta Scu/SX Phe magneta diamonds) off-diagonal plotsrepresent comparison between two different features, on-diagonalplots represent distribution of classes within a feature (one dimen-sional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2287.4 SSMM feature space after ECVA reduction LINEAR (Contact Bi-nary/blue circle, Algol/ red +, RRab/green points, RRc in blacksquares, Delta Scu/SX Phe magneta diamonds) off-diagonal plotsrepresent comparison between two different features, on-diagonalplots represent distribution of classes within a feature (one dimen-sional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2297.5 DF (Left) and SSMM (Right) feature space after ECVA reductionfrom UCR. Class names (1,2, and 3) are based on the classes pro-vided by the originating source and the UCR database . . . . . . . 2308.1 A rough outline of the Variable Star Analysis Library (JVarStar)bundle functional relationships. Notice that the generic math andutility bundles flow down to more specific functional designs suchas the clustering algorithms. . . . . . . . . . . . . . . . . . . . . . 238A.1 Random Forest: mtry = 8, ntree = 100, (Top Left) Pulsating, (TopRight) Erupting, (Bottom Left) Multi-Star, (Bottom Right) Other 287xix.2 SVM: (Top Left) Pulsating, (Top Right) Erupting, (Bottom Left)Multi-Star, (Bottom Right) Other . . . . . . . . . . . . . . . . . . 288A.3 kNN: (Top Left) Pulsating, (Top Right) Erupting, (Bottom Left)Multi-Star, (Bottom Right) Other . . . . . . . . . . . . . . . . . . 289A.4 MLP: (Top Left) Pulsating, (Top Right) Erupting, (Bottom Left)Multi-Star, (Bottom Right) Other . . . . . . . . . . . . . . . . . . 290A.5 MLP: Individual Classification, Performance Analysis . . . . . . . 291A.6 kNN: Individual Classification, Performance Analysis . . . . . . . . 291B.1 Classifier Optimization for UCR Data . . . . . . . . . . . . . . . . . 293B.2 Classifier Optimization for LINEAR Data . . . . . . . . . . . . . . 294xx ist of Tables >
1% of the total return set) are in bold . . . . . . 1204.10 Precision Rate Estimates Per Class Type (in fractions), BoldedClasses are those with Precision <
80% . . . . . . . . . . . . . . . 122xxi.1 Distribution of LINEAR Data across Classes . . . . . . . . . . . . . 1475.2 Misclassification Rates of Feature Spaces from Testing Data . . . . 1516.1 Collection of KIC of Interest (30 Total) . . . . . . . . . . . . . . . . 1686.2 Collection of KIC Not of Interest (121 Total) . . . . . . . . . . . . . 1686.3 Metric Measurements from the Discovered O’Connell Effect Eclips-ing Binaries from the Kepler Data Set . . . . . . . . . . . . . . . . 1856.4 Discovered OEEBs from LINEAR . . . . . . . . . . . . . . . . . . . 1866.5 Comparison of Performance Estimates across the Proposed Classi-fiers (Based on Testing Data) . . . . . . . . . . . . . . . . . . . . . 1907.1 The cross-validation process for LM L tunable values . . . . . . . 2307.2 LINEAR confusion matrix via LM L entries are counts (percent) . 2307.3 UCR confusion matrix via LM L entries are counts (percent) . . . 2317.4 The cross-validation process for LM L − M V tunable values . . . 2317.5 LINEAR confusion matrix via LM L − M V entries are counts (per-cent) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2327.6 UCR confusion matrix via LM L − M V entries are counts (percent) 2327.7 F1-Score metrics for the proposed classifiers with respect to LIN-EAR and UCR datasets . . . . . . . . . . . . . . . . . . . . . . . . 234B.1 Confusion Matrix for Classifiers Based on UCR Starlight Data . . . 295B.2 Confusion Matrix for Classifiers Based on LINEAR Starlight Data 296C.1 LINEAR confusion matrix, LM L-MV - LM L . . . . . . . . . . . . 297C.2 UCR confusion matrix, LM L-MV - LM L . . . . . . . . . . . . . . 298xxii ist of Symbols, Nomenclature orAbbreviations
AbbreviationsCCD Charged Coupled DeviceSDSS Sloan Digital Sky Survey2MASS Two Micron All-Sky SurveyDENIS Deep Near Infrared Survey of the Southern SkyVISTA Visible and Infrared Survey Telescope for AstronomyASAS All Sky Automated SurveyLINEAR Lincoln Near-Earth Asteroid ResearchPan-STARRS Panoramic Survey Telescope and Rapid Response SystemLSST Large Synoptic Survey TelescopeAAVSO American Association of Variable Star ObserversIR InfraredRF Radio FrequencyL-S Periodogram Lomb-ScargleLSC LINEAR Supervised ClassificationFFT Fast Fourier TransformSSMM Slotted Symbolic Markov ModelingOCD O’Connell Effect Detector using Push-Pull LearningDF Distribution FieldsADASS Astronomical Data Analysis Software and SystemsOGLE Optical Gravitational Lensing ExperimentCoRoT Convection, Rotation and planetary TransitsSN Super NovaeDSP Digital Signal ProcessingDFT Discrete Fourier Transformationxxiiibbreviationsk-NN k Nearest NeighborPWC Parzen Window ClassifierCART Classification and Regression TreeRBF-NN Radial Basis Function - Neural NetworkMLP Multi-Layer PerceptronSLP Single-Layer PerceptronRF Random ForestSVD Singular Value DecompositionLMNN Large Margin Nearest NeighborsPCA Principle Component AnalysisMMC Mahalanobis Metric Learning with Application for ClusteringNCA Neighborhood Components AnalysisITML Information Theoretic Machine LearningS&J Schultz & Joachims LearningEWMA Exponentially Weighted Moving AverageCCA Canonical Correlation AnalysisEDA Exploratory Data AnalysisSVM Support Vector MachinesKSVM Kernel Support Vector MachinesROC Receiver Operating Characteristic CurvePR Precision - Recall CurveAUC Area Under the Curvefp False Positivetp True PositiveOC-SVM One Class Support Vector MachinesOC-PWC One Class Parzen Window ClassifierANOVA Analysis of VariancePAA Piecewise Aggregation ApproximationPLS2 Partial Least SquaresUCR University of California RiversideCDF Cumulative Distribution FunctionQDA Quadratic Discriminate AnalysisxxivbbreviationsECVA Extended Canonical Variate AnalysisCVA Canonical Variate AnalysisDWT Discrete Wavelet TransformationKELT Kilodegree Extremely Little TelescopeOEEBs O’Connell Effect Eclipsing BinariesOER O’Connell Effect RatioLCA Light Curve AsymmetrySNR Signal to Noise Ratioslf4j Simple Logging Facade for JavaGCVS general catalog of variable starsxxv cknowledgements
The author is grateful for the valuable machine learning discussion with S. WiecheckiVergara and R. Haber. Initial astroinformatics interest was provided by H. Oluseyi.Editing and review provided by G. Langhenry and H. Monteith. Custom scientificgraphics were provided by S. Herndon. Inspiration provided by C. Role.Research was partially supported by Perspecta, Inc. This material is basedupon work supported by the NASA Florida Space Grant under 2018 Disserta-tion And Thesis Improvement Fellowship (No. 202379). The LINEAR programis sponsored by the National Aeronautics and Space Administration (NRA Nos.NNH09ZDA001N, 09-NEOO09-0010) and the United States Air Force under AirForce Contract FA8721-05-C-0002. This material is based upon work supported bythe National Science Foundation under Grant No. CNS 09-23050. This researchhas made use of NASA’s Astrophysics Data System.xxvi edication
I would like to thank all of those that supported me in my pursuit of this workand this goal.First to my loving wife Caroline and son Edison who have been eternally pa-tient, supportive and loving with me over all these years. Second, my parents andfamily members who have been with me time and time again, helping me alongthis journey; thank you for your love and support.Thanks Dr. Saida Caballero-Nieves, Dr. V´eronique Petit, and Dr. AdrianPeter for the support and encouragement. Thanks also to Dr. Stephen WiecheckiVergara and Dr. Kevin Hutchenson who encouraged me to start this journey tobegin with.Thanks also to Dr. Cori Fletcher and Dr. Trisha Doyle, the best graduateschool friends/support system I could have asked for. To Dr. Nicole Silvestri, thefirst astronomer I ever worked for, I dedicate Appendix D of this work.To those reading, this work was meant to support yours; use it, expand on it,learn from it. xxvii he Cloths of Heaven “Had I the heaven’s embroidered cloths, Enwrought with golden and silver light,The blue and the dim and the dark cloths Of night and light and the half-light; Iwould spread the cloths under your feet: But I, being poor, have only mydreams; I have spread my dreams under your feet; Tread softly because you treadon my dreams.” – W. B. Yeatsxxviii hapter 1Introduction
With the advent of digital astronomy, new benefits and new challenges have beenpresented to the modern-day astronomer. While data are captured in a moreefficient and accurate manner using digital means, the efficiency of data retrievalhas led to an overload of scientific data for processing and storage. That meansthat more stars, in more detail, are captured per night; but increasing data capturebegets exponentially increasing data processing. Database management, digitalsignal processing, automated image reduction, and statistical analysis of data haveall made their way to the forefront of tools for modern astronomy. This cross-disciplinary approach of leveraging statistical analysis and data mining methods toanalyze astronomical data is often referred to as astrostatistics or astroinformatics .The data captured by the modern astronomer can take on many forms but fallinto two basic categories: observed, that is, resulting from the physical detectionof either particles or waves emanating from an astrophysical source, or simulated,that is, resulting from computation or synthetic representation of a hypotheticalastrophysical system. Observed data can be from almost any point on the electro-1agnetic spectrum (photons), can be any emanation of particles (e.g., neutrinos),and, more recently, can even be resultant from more exotic emanations, such asgravitational waves. Within photon detection, we can further break down the cat-egories into specific techniques: imaging, photometry, spectroscopy, polarimetry,and so on. These methods further analyze the character of the emanation be iteither: the rate (flux) at which the photons are emitted, the energy of the individ-ual photons, the relative distribution of photons emitted, or the phase orientationof the photons. Each emanation type–method pairing allows for the inferred mea-surement of certain associated properties of the astrophysical source of interest.This study focuses on optical time-domain analysis—the measurement of how ra-diation flux in the visible ( ∼ variablestar .The requirements for variability are fairly straightforward: the change in fluxover time must exceed in amplitude any other variations from contributions thatare not the source and must change at a rate that is large enough to be statisticallynoticeable. Thus a star might vary, but the amplitude of the variation might benegligible compared to background noise variation observed along with the star,or the resolution of the sensor (sensitivity) might be lower than what is neededto discern a change in amplitude, and therefore no variation would have beenobserved. Alternatively, a star might vary, but the change as a function of timemight be so long as to be unnoticeable in the local time frame or too short comparedto the sample rate of the observations, causing the star’s amplitude to appearconstant. To some degree, all stars have a flux output that is variable; much of2hat we categorize as “variability” is dependent on the equipment we are using tomake observations, the received energy, and the structure of the variation itself. It is precisely the dependence on equipment to detect variability that results inthe pressing need for more advanced methods for time domain analysis. It is nosurprise that one of the first variable stars, Omicron Ceti (Mira), was both brightand varied slowly; this allowed for discovery by eye of the variation [Wilk, 1996].As more advanced methods of detection have become common, stars that varyfaster, are dimmer, or have a smaller amplitude variation are all now discoverable.The improvement to detector efficiency allowed the necessary exposure rate todecrease while leaving the signal-to-noise ratio of the observation of the same starunchanged, thus increasing the sampling rate of the observation. Similarly, theeconomics of astronomical observations have become favorable to increasing thesampling rate, be it either the decreasing cost of detectors (CCDs) or the abilityto create detectors of increasing size. Even telescope automation has had a hand inincreasing the sampling rate, allowing for more sky to be observed, more frequently,without the added expense of having a human in the loop. Between being ableto increase the sampling rate for all observations, the increase in image size, andthe prevalence of larger detector optics, the increase in astronomical wide-fieldsurvey projects has resulted in an exponential increase in the number of stellarobservations in general. 3urveys such as the Two Micron All-Sky Survey [2MASS, Cutri et al., 2003],the Deep Near Infrared Survey of the Southern Sky [DENIS, Epchtein et al., 1997],the Visible and Infrared Survey Telescope for Astronomy [VISTA, Sutherlandet al., 2015], and the Sloan Digital Sky Survey [SDSS, Abazajian et al., 2009]attempt to image (photometry) a wide field in their respective frequency regions.While some surveys are designed for photometric depth, others attempt to observevariability, either in position or in brightness. The All Sky Automated Survey(ASAS) was designed to measure brightness changes specifically, while surveys likethe Lincoln Near-Earth Asteroid Research (LINEAR) survey and the Catalina SkySurvey were originally designed to be used for near-Earth object tracking but havesince been exploited for stellar variability detection.This all-sky, time-domain data collection, in its extreme, can be best demon-strated by two surveys: the Panoramic Survey Telescope and Rapid ResponseSystem (Pan-STARRS) and the Large Synoptic Survey Telescope (LSST): • Pan-STARRS is designed as a near-Earth asteroid detection survey. Eachimage taken requires approximately 2 GB of storage, with exposure timesbetween 30 and 60 s and with an additional minute or so used for computerprocessing. Since images are taken on a continuous basis, the total data col-lection is roughly 10 TB the PS4 telescopes every night. The very large fieldof view of the telescope and the short exposure times enable approximately6000 square degrees of sky to be imaged every night. Roughly the entireobservable sky can be imaged in a period of 40 hours (or approximately 10hours per night over 4 days). Given the need to avoid times when the Moonis bright, this means that an area equivalent to the entire sky is surveyedfour times a month. 4
LSST is designed specifically for astronomical observation and has a 3.2-gigapixel (GP) prime focus digital camera that will take a 15-s exposure every20-s. Repointing such a large telescope (including settling time) within 5 srequires an exceptionally short and stiff structure. This in turn implies a verysmall f-number, which requires very precise focusing of the camera. The focalplane will be 64 cm in diameter and will include 189 CCD detectors, eachof 16 megapixels (MP). Allowing for maintenance, bad weather, and othercontingencies, the camera is expected to take more than 200,000 pictures(1.28 PB uncompressed) per year.It is apparent from the estimates of per year/per night output from two of themost modern all-sky surveys that the rate of data output vastly exceeds the rateat which astronomical analysis can be performed by-hand. Seeing as how thereis no indication that this trend of big data will change anytime soon, automatedprocessing is a necessity if meaningful use is to be made of the data. This includesautomated reduction (magnitudes, images, colors, etc.), association (matching ob-jects across surveys), and interpretation of observed objects.Beyond automated reduction being necessary for processing such a large datainflux, automated reduction also lends itself to being more consistent and determin-istic than having a human in the loop. Likewise, increasing scale requires improvederror reduction for analysis methods to be justifiably efficient.
Our project herefocuses on the automated categorization of variable stars and the abil-ity to determine what kind of variable we are observing based on itstime-domain signal and a prelabeled set of comparison data . This wasonce a manual process; given a light curve, an expert in the field of stellar vari-ability would compare by eye the data to known data sources and determine an5stimated label. This sort of one-by-one analysis is still done as part of the devel-opment of training data for many of the supervised classification efforts currentlyin development and for all of the data used in this discussion.
In general, the combination of signal processing and classification results in thegeneric linear flow shown in Figure 1.1.Figure 1.1: A standard system engineering design (flow chart) of an automatedremote detection algorithmThe individual measurements of flux over time constitute the signal observed.Some form of signal conditioning can be applied (but need not be) to the signal; thisis often an attempt to normalize the waveform or to reduce the noise via filtering.The measurements made can be transformed or mapped to new representations—features—which are optimal or of interest for a given goal. For consistency, wemake the following definitions: • The astrophysical object being measured is the source.6
The tool performing the measuring is the receiver. • A path was taken by the photons to get from the source to the receiver. • The signal is the measurement of the flux, and that signal measured, overtime, results in the time-domain waveform. • This collection of measurements operated on is referred to here as the fea-ture space (e.g. Fourier domain, wavelet domain, time domain statistics,photometry, etc.).Much of the initial effort is the disentanglement of the three functions (source, path,sensor) from one another to understand the source function. These challenges,however, are not unique to astronomy. This effort focuses on addressing fourspecific issues: class space definition, incomplete measurements, continuous signaland secondary information, and performance evaluation.While the AAVSO keeps a catalog of variable star types [Samus’ et al., 2017],this listing is dynamic. Variable star classifications have been added, removed,and updated since its inception [Percy, 2007]. These changes have included thediscovery of new variables stars, new variable types, the determination that twoclasses are the same class, and the determination that one class is two differentclasses. This was the case with Delta Scuti stars [Sterken and Jaschek, 2005]:originally classified as RR Lyrae subtype RRs, they were eventually identified astheir own class. Furthermore, a set of high-frequency Delta Scuti stars that werefound to be metal poor were subdivided into an additional class now called SXPhoenicis stars. 7e present Figure 1.2 for a high-level view of variability types: some variabilityclasses are uniquely different, resulting from different physical processes (Beta Cepvs. Eclipsing Binaries, instability vs. occultation), some variability classes resultfrom the same physical process but different periods. For example, Ceph vs. SXPhoenicis, both result from pulsations causes by He ionization, but different periodsresulting from temperature and metallicity differs. Furthermore some variableclasses result from similar physical processes but different underlying causes. Forexample both Beta Cep vs. Ceph pulsate resulting from instability but one fromFe ionization and the other He ionization (respectfully).8igure 1.2: Variability in the Hertzsprung-Russell (HR) Diagram , A pulsationHR Diagram showing many classes of pulsating stars for which asteroseismologyis possible [Kurtz, 2005].In short, the definitions of variability are based on a mix of physical underlyingparameter boundaries that have been empirically set by experts and on similartime-domain features. This results in ambiguity that can complicate the variablestar classification problem. The expert definitions have an additional effect on theconstruction of the classifier: biases from observations affect the sampling of each9ndividual classes. For example, brighter variables are more likely to be observedand therefore, without additional filtering, will be over-represented compared toother variables. This will result in a class imbalance [Longadge and Dongre, 2013]and will ultimately degrade the performance of the classifier, especially in thecase of multiclass classifiers [see Johnston and Oluseyi, 2017]. How we evaluatethe performance of classifiers and compare the classifiers knowing there is a classimbalance is of major concern.Stellar variable time series data can roughly be described as passively observedtime series snippets, extracted from what is a contiguous signal over multiple nightsor sets of observations. The continuous nature of the time series provides both com-plications and opportunities for time series analysis. The time series signature hasthe potential to change over time, and new observations mean increased opportu-nity for an unstable signature over the long term. If the time signature does notchange, then new observations will result in additive information that will be usedto further define the signature function associated with the class. Implementinga methodology that will address both issues, potential for change and potentialfor additional information, would be beneficial. If the sampling were regular (andcontinuous), short-time Fourier transforms (spectrograms) or periodiograms wouldbe ideal as they would be direct transforms from time to frequency domains. Thedata analyzed cannot necessarily be represented in Fourier space (perfectly), andwhile the wavelet version of the spectrogram or scalogram [Rioul and Vetterli,1991, Szatmary et al., 1994, Bol´os and Ben´ıtez, 2014] could be used, the data areoften irregularly sampled, further complicating the analysis. Methods for obtain-ing regularly spaced samples from irregular samples are known [Broersen, 2009,Rehfeld and Kurths, 2014]; however, these methods have unforeseen effects on10he frequency domain signature being extracted, thereby corrupting the signaturepattern.Astronomical time series data are also frequently irregular; that is, there isno associated fixed △ t over the whole of the data that is consistent with the ob-servation. Even when there is a consistent observation rate, this rate is oftenbroken up because of a given observational plan, daylight interference, or weather-related constraints. Whatever classification method is used must be able to handlevarious irregular sampling rates and observational dropouts, without introducingbiases and artifacts into the derived feature space that will be used for classifica-tion. Most analysis methods require regularized samples. Those that do not eitherrequire some form of transformation from irregular to regular sample rates by adefined methodology or apply some assumption about the time-domain functionthat generated the variation to begin with (such as a Lomb–Scargle periodogram).Irregular sampling solutions [Bos et al., 2002, Broersen, 2009] to address this prob-lem can be defined in one of three ways: • slotting methods, which model points along the timeline using fuzzy or hardmodels [Rehfeld et al., 2011, Rehfeld and Kurths, 2014]; • resampling estimators, which use interpolation to generate the missing pointsand obtain a consistent sample rate; • Lomb–Scargle periodogram–like estimators that apply a model or basis func-tion across the time series and maximize the coefficients of the basis functionto find an accurate representation of the time series.11stronomical surveys produce equally astronomical amounts of data. Process-ing these data is a time-intensive process, and not necessarily a repeatable one;while unsupervised classification will not be discussed as part of this analysis, andthough we assume that the initial construction of training data requires a humanin the loop, the act of using the training data needs to be automated. Not onlydoes the effort need to leverage automated computational processing but said pro-cessing must have an error rate that is meaningful with respect to the scale andcost of the survey itself.For example, if we have a survey of 100 stars, 10% of which are of a specificclass type (e.g., type A), and we have a false alarm rate of 1%, it can be expectedthat from the original survey, ∼
11 stars will be selected to be of interest, and ofthese 11 stars, 1 star is likely to be falsely identified. If a manual identificationprocess takes 30 min to look at each star, then 5.5 hours later, a small team canconfirm that there is one star that was inappropriately labeled, and at $7.25 anhour, the cost of reevaluation becomes ∼ $40. If the scale of the survey is 1 millionobjects, then 110,000 objects will be identified as of interest, and of those, 10,000objects will be false alarms. Using the cost estimates outlined, then, to review allstars identified as of interest given a standard 2080-hour work year would require26.4 years and roughly $400,000 at minimum wage.In this simplified example, only detection is considered; if the problem is totalclassification of all sources, the resources necessary to process the work manuallybecome unmanageable. Thus, to resolve what is primarily a resource problem,supervised classification is necessary. Servers are faster and less expensive thanhumans-in-the-loop, and algorithms are standardizable and are manageable anddefensible in terms of their decision-making processes, thus the particular task of12onsistency and repetition is well suited to automation. Likewise, performance es-timates are quantifiable, and often, characteristics such as false alarms and misseddetection rates are manageable. We present here an answer to the big data vari-able star astrophysics problem: a well-documented high-performance automatedsupervised classification algorithm tuned with cost in mind. This project focuses on the supervised machine learning domain as it applies toastrophysics. This research intends to construct a supervised classification systemfor variable star observations that is tailored to the unique challenges faced bythe astrophysics community. To accomplish this will require a focus on an in-terdisciplinary approach, split between modern astrophysics and machine learningresearch (i.e., astrostatistics and astroinformatics).We will be dealing with cases where the data we are using to train have beenhand-reviewed by an expert. We assume that the additional information providedby the expert is correct; likewise, we make the assumption that the data themselvesare not defective (e.g., photometry from two different stars listed as one); handlingof mislabeled data is possible but is beyond the scope of this specific researcheffort. Given a new input observation, the algorithm shall generate an estimateof label, that is, of the type of variable star the new observation corresponds to,given training data.We focus on everything else, following the trend established by the literature.In areas where novel research occurs, the culmination of the efforts results in anassociated publication on the topic (and therefore the contribution). This work is13rganized as follows:1. We review stellar variability in chapter 2, signal processing, and machinelearning in chapter 3. This establishes a baseline of necessary understandingfor introducing the other key components.2. In chapter 4, we discuss system design and performance of an automatedclassifier.(a)
Description . We test various industry standard methods that have beenpublished within the astroinformatics community, and provide improvedperformance analysis estimates geared toward the challenges faced byastronomers (e.g., class imbalance, large class space, low population rep-resentation, high variance in class pattern). This includes extending thesupervised classification system into an anomaly detection algorithm, tobe used in the discovery and identification of new, previously unobservedvariable star representations.(b)
New developments . Our novel contributions to the field included thetesting of LINEAR data against multiple classifiers, one vs. all/multi-class classification performance comparison, detector performance quan-tification via ROC-AUC and PR-AUC, application of detectors to un-labeled LINEAR data(c)
Article . The work in this chapter was published in Johnston, K. B., &Oluseyi, H. M. (2017). Generation of a supervised classification algo-rithm for time-series variable stars with an application to the LINEARdataset.
New Astronomy, 52 , 35–47.14d)
LSC . The code developed in this chapter was published in Johnston[2018]. LSC (LINEAR Supervised Classification) trains a number ofclassifiers, including random forest and K-nearest neighbor, to classifyvariable stars and compares the results to determine which classifier ismost successful. Written in R, the package includes anomaly detectioncode for testing the application of the selected classifier to new data,thus enabling the creation of highly reliable data sets of classified vari-able stars .3. In chapter 5, we discuss novel feature space implementation.(a) Description . Looking beyond traditional time series feature extractionmethodologies (FFT and light curve folding), we instead focus on time-invariant feature spaces and novel digital signal processing methods tai-lored to variable star observations. Ideally, these are feature spaces thatare easily implemented for various survey conditions, rapid to computegiven various sizes of observations, and easy to optimize in terms of thelinear separability of the identified variable star class space.(b)
New developments . Our novel contributions to the field included theSSMM, as well as the application of SSMM to LINEAR and UCR data.(c)
Article . The work in this chapter was published in Johnston, K. B., &Peter, A. M. (2017). Variable star signature classification using slottedsymbolic Markov modeling.
New Astronomy, 50 , 1–11. In addition, thework here was presented as the poster Johnston, K. B., & Peter, A. https://github.com/kjohnston82/LINEARSupervisedClassification
15. (2016).
Variable star signature classification using slotted symbolicMarkov modeling . Presented at AAS 227, Kissimmee, FL.(d)
SSMM . The code developed in this chapter was published in Johnstonand Peter [2018]. SSMM (Slotted Symbolic Markov Modeling) reducestime-domain stellar variable observations to classify stellar variables.The method can be applied to both folded and unfolded data and doesnot require time warping for waveform alignment. Written in MATLAB,the performance of the supervised classification code is quantifiable andconsistent, and the rate at which new data are processed is dependentonly on the computational processing power available .4. In chapter 6, we discuss a detector for O’Connell-type eclipsing Binaries(using metric learning and DF features).(a) New developments . Our novel contributions to the field included the de-velopment of O’Connell-type Eclipsing Binary detector based on Keplerdata and detection of new O’Connell-type eclipsing Binaries to be usedin defining variable star category (from LINEAR and Kepler datasets)(b)
Article . The work in this chapter is to be published as Johnston, K.B.,et al. (2019). A detection metric designed for O’Connell effect eclipsingbinaries.
Computational Astrophysics and Cosmology . In addition, thework here was presented as the poster Johnston, K. B., et al. (2018).
Learning a novel detection metric for the detection of O’Connell effecteclipsing binaries . Presented at AAS 231, National Harbor, MD. https://github.com/kjohnston82/SSMM OCD . The code developed in this chapter was published in ”O’ConnellEffect Detector using Push-Pull Learning” Johnston, Kyle B.; Haber,Rana. OCD (O’Connell Effect Detector using Push-Pull Learning) de-tects eclipsing binaries that demonstrate the O’Connell Effect. Thistime-domain signature extraction methodology uses a supporting su-pervised pattern detection algorithm. The methodology maps stellarvariable observations (time-domain data) to a new representation knownas distribution fields (DF), the properties of which enable efficient han-dling of issues such as irregular sampling and multiple values per timeinstance. Using this representation, the code applies a metric learningtechnique directly on the DF space capable of specifically identifying thestars of interest; the metric is tuned on a set of labeled eclipsing binarydata from the Kepler survey, targeting particular systems exhibiting theO’Connell effect. This code is useful for large-scale data volumes suchas that expected from next generation telescopes such as LSST .5. In chapter 7, we discuss multi-view classification of variable stars using metriclearning.(a) Description . The processing and analysis of time series features viaadvanced classification means; introducing to the astrostatistics com-munity improved methods beyond the current standard. https://github.com/kjohnston82/OCDetector New developments . Our novel contributions to the field included thedevelopment of a Large margin multi-view metric learning for matrixvariates, Barzilai and Borwein [1988] for matrix data(c)
Article . The work in this chapter is to be published as Johnston, K.B., et al. (2019). Variable star classification using multi-view metriclearning.
Monthly Notices of the Royal Astronomical Society . In ad-dition, the work here was presented as the poster Johnston, K. B., etal. (2018).
Variable star classification using multi-view metric learning .Presented at ADASS Conference XXVIII, College Park, MD.(d)
JVarStar .The code developed in this chapter was published in Johnstonet al. [2019]. Contains Java translations of code designed specifically foranalysis and the supervised classification of variable stars .We should note that much of the inspiration for this research comes from thedissertation of Debosscher [2009], “Automated Classification of Variable Stars: Ap-plication to the OGLE and CoRoT Databases”; this PhD dissertation has providedmany of the current astroinformatics efforts with a baseline methodology and per-formance. Standard methods (classifiers) were implemented and compared basedon standard feature spaces (specifically, frequency domain and “expert”-identifiedfeatures). While we note (specifically in Johnston and Oluseyi [2017]) that thereare some gaps in the implementation of the supervised classification design, over-all, the template Debosscher proposes is sound. We seek here to extend much ofthe research Debosscher outlined as well as the associated papers resulting from https://github.com/kjohnston82/VariableStarAnalysis hapter 2Variable Stars The goal for this effort is the classification of variable stars via analysis of rawphotometric light curve data. For an informative breakdown of different types ofvariable stars, see Eyer and Blake [2005]; they categorize variability into extrinsic(something else is causing the variability) and intrinsic (the object is the source ofthe variability). Extrinsic sources include both asteroids (reflecting light) and stars(eclipsing, rotation/star spots, and the result of microlensing). Intrinsic sourcesinclude both active galactic nuclei (radio quiet and radio loud) and stars (eclipsing,eruptive, cataclysmic, pulsation, and secular).This outline is helpful when considering what types of features are useful fordifferentiation between class types. For example, most of the types listed under“cataclysmic” are going to be impulsive, and stars in the pulsating category willhave repetitive signatures that may or may not have a consistent frequency or eventamplitude modulation. What is required for a comprehensive classification designis the selection of diverse features that can provide utility for a variety of targetsof interest. The “tree of variability” attempts to categorize the stellar variable20igure 2.1: A graphical representation of the ”Tree of Variability”, i.e. the rela-tionship between various astronomical variable types with respect to their physicalcause [Eyer and Blake, 2005]types based on physical cause of variation [Figure 2.1 Eyer and Blake, 2005, GaiaCollaboration et al., 2019]. We can also categorize the stellar variables from asignal point of view as a function of their signal type (see Figure 2.2), includingthe following: • stationary processes: variation that is random but “stable” (the statistics ofthe time-domain signal are consistent at any point in time, e.g., the Sun) • cyclostationary processes: the signal varies cyclically in time (e.g., RR Lyr,Eclipsing Binary/EB) • impulsive process: signal variation is a sudden increase (or decrease) in signal,but the change does not cyclically occur (e.g., supernova)21 nonstationary: change of underlying time-domain statistics that are neitherimpulsive nor cyclic (e.g., Be star)Figure 2.2: Statistical Signal Examples: Nonstationary Signal (blue), StationarySignal (green), Cyclostationary (red), and Impulsive (orange) [Nun et al., 2014]While our interest is in those stars that have cyclostationary signal processes,our analysis is applicable for the other processes as well (Super Novae/SN, forexample). These definitions are local definitions, that is, based on the data ob-served. These are real, physical, evolving sources, and the change from one typeto a different type is always possible given changes in the physical processes of thesource or just given new observations. A valid question, however, with regard tothe identification and specifically the grouping of different types of variable starsis, why do we care? One reason is the “general identification of types”—being ableto say “these two are similar and these other two are dissimilar” has its own utilityin gaining a deeper understanding of the universe. In addition to this, however,the variability of some stars has been found to be directly linked to other physical22arameters (luminosity, stellar mass, rotation rate, etc.). This has been shown forcases such as eclipsing binaries, RR Lyrae, SN, and Beta Cephei. In these cases,the variation of luminosity observed can be tied back to some extrinsic or intrinsicvalue. Many variations result from changes to the star’s atmosphere, what we willcall “source variation.” If we narrow down the tree of variability to just stars andjust the most likely causes, we can highlight the following cases: • eruptive : Eruptive variables are stars varying in brightness because of vio-lent processes and flares occurring in their chromospheres and coronae. Thelight changes are usually accompanied by shell events or mass outflow inthe form of stellar winds of variable intensity and/or by interaction with thesurrounding interstellar medium. • pulsating : Pulsating variables are stars showing periodic expansion and con-traction of their surface layers. The pulsation may be radial or nonradial. Aradially pulsating star remains spherical in shape, while in the case of non-radial pulsations, the star’s shape periodically deviates from a sphere, andevent neighboring zones of its surface may have opposite pulsation phases. • rotation : For variable stars with nonuniform surface brightness and/or ellip-soidal shape whose variability is caused by axial rotation with respect to theobserver, the nonuniformity of surface brightness distribution may be causedby the presence of spots. Nonuniformity could also by caused by some ther-mal or chemical inhomogeneity of the atmosphere caused by a magnetic fieldwhose axis is not coincident with the rotation axis.23 cataclysmic (explosive and novalike) : This includes variable stars showingoutbursts caused by thermonuclear burst processes in their surface layers ordeep in their interiors. • eclipsing : Some variations result from how the light gets from the star tothe observer, that is, dimming resulting from the light being absorbed orocculted by another object, what we will call “path variation.” Eclipsingbinaries and eclipsing planetary systems are two types of variables that fallinto this category and that are addressed in this study.We will not discuss in additional detail all variable star subgroups, but we addresshere those variable types that are common to our studies, specifically eclipsingbinaries and pulsating variables. Much of our decision to select specific variabletypes has been influenced on representation in surveys. The construction of aclassifier when only a few (sometimes one) training data points are available isdifficult at best and ill advised or wrong at worst. The multiplicity of stars is fairly well studied [Mathieu, 1994, Duchˆene and Kraus,2013]; a high number of stars exists in binary or greater systems. A statistical dis-tribution of binary orbital separation is a current topic of survey research [Povedaet al., 2007], but an approximation of distribution for main sequence visual binariesas log of the semi-major axis based on ¨Opik [1924] is often cited. This assumeddistribution is a gross approximation, and variations exist with respect to a starforming region that generated the binary and the stage of life that the binary isin. 24 number of methods can be used to determine if a target star is in a multiplesystem, radial velocity measurements being one of the main ones. Alternatively, acompanion star can be detected if the target star exhibits variability resulting fromoccultation, that is, the cooler of the pair (the companion) eclipsing the hotter ofthe pair (the primary). This will result in a cyclostationary periodic time seriescurve, and at a given primary period, the star will exhibit decreases in flux causedby the stars going in front of one another. These decreases are dependant ona number of factors: the relative size differences of the two stars, their differingeffective temperatures, and the inclination of the binary with respect to the viewer.If we assume no other photometric variability factors (flares, spots, etc.), wecan establish some basic points of interest on the phased light curve of a binarystar and will take Figure 2.4 as an example. For this discussion, we will assumethe binary has a dimmer and brighter star and will refer to the brighter star asthe primary and the dimmer as the secondary. At phase ∼ ∼ A p and A s for the primary and secondary; likewise, we define the brightnesses ofthe stars to be B p and B s . The total observed power radiated is then ∼ ∑ A i B i ,and so if x is the maximum observed flux, then x = A p B p + A s B s . Likewise, if25e call the total received power at first and second minima y and z , then y =( A p − A s k p ) B p + A s B s and z = max(0 , A s − A p k s ) B s + A p B p , where k p and k s represent some fraction that allows for an incomplete occultation as a result ofinclination. We can relate the differences in flux observed to physical relationshipsbetween the two stars in the binary via x − z = A s B s − max(0 , A s − A p k s ) B s (2.1)and x − y = A s k p B p ; (2.2)thus we can define relationships between the relative depths of the eclipse and thephysical properties (surface area and luminosity). This is what makes eclipsingbinaries so interesting: they provide insight into the components of the system.Assuming a k of 1, the transition from maxima to minima is a linear relationship.There are of course a myriad of ways, physically, that these assumptions couldgo wrong: partial eclipses (as discussed), limb darkening (from the stellar atmo-sphere), and other atmospheric effects (flares, reflection, spots, etc.) all can effectthe relationships defined here and must be taken into consideration prior to anal-ysis of the system properties.Using limb darkening as an example: given a star with a meaningful stellaratmosphere, it is known that intensity with respect to an observer is not uniformover the observed surface. This can be shown for the simple case of an semi-infinite, radiation emergent, atmosphere. Following Hubeny and Mihalas [2014],26he intensity for a given frequency can be shown as Equation 2.3 I ν (0 , µ ) = ∫ ∞ S ν ( t ν ) exp ( − t ν /µ ) dt ν /µ, (2.3)where µ is the angle of incidence ( µ = cos ( θ )), ν is the frequency of the light, Sis the source function of the atmosphere, and t ν is the optical depth. Assuming agrey atmosphere (opacity that does not vary with frequency) and one that is also inradiative equilibrium, Hubeny and Mihalas [2014] show that (their equation 17.4); S ( τ ) = J ( τ ), the source function can be shown to be the the mean intensity. UsingHubeny and Mihalas [2014] equation 17.14, i.e. the Eddington approximation: J ( τ ) = 3 H ( τ + 23 ) , (2.4)equation 2.3 becomes: I (0 , µ ) = 3 H ( τ + 23 ); (2.5)and the limb-darkening function (the ratio of intensity of the star with respect tointensity of the center of the star), can be shown to be: I (0 , µ ) /I (0 ,
1) = 35 ( µ + 23 ) . (2.6)Given our prior case of k = 1, i.e. in-plane occultation, we can see that thedecrease in intensity will no longer be a linear function, but instead dependent on µ = cos ( θ ), resulting in curvature in the light curve.The general catalog of variable stars [GCVS, Samus’ et al., 2017] identifies three,traditional, categories of eclipsing binaries: EA, EB, and EW. These represent agradient of how close the binaries are and more importantly how the Roche lobes27ave been filled, thus affecting the light curve shape. Roche lobes are defined as theregion around a star in which the stellar mass is still gravitationally bound. Whenstars expand they can reach sizes that exceed this limit resulting in matter, in thebinary case, transferring from one star to the other. Additionally, RS Canum isalso identified as a subtype but will not be addressed here. β Persei (EA, Detached Binaries)
Like many variable star types, these are named after their prototype, Algol ( β Persei). Algol’s variability has been known since the late 1700s; it represents the setof binaries known as semidetached eclipsing variables. Variability is consistent—there is little mass transfer resulting in very little effect on the orbital parameters.Likewise, both components are usually nearly circular in shape. Figure 2.3 providesan overview of the EA binary example, the top line is a graph of the light curve aswell as the expected binary orientation with respect to the observer (representedas the eye); the bottom left and right figures are diagrams showing the expectedmass transfer relationship, here specifically no-mass transfer is expected. This maynot be the case when the secondary is much dimmer/cooler than the primary andtherefore contributes little to nothing to the overall light curve, in which case thesecondary might be highly distorted.These systems are not limited to being associated with a given evolutionarystage [Sterken and Jaschek, 2005]; various compositions include two main sequencestars (CM Lac), two evolved components with no Roche lobe overflow (AR Lac),one evolved and one overflowing (RZ Cas), and one evolved and one not evolved(V 1379 Aql). An example is shown in Figure 2.4.The variable class is identified by the constant maxima as well as by the sharp28igure 2.3: Conceptual overview of β Persei—EA type—eclipsing binaries (De-tached Binaries). Top figure is a representation of the binary orientation withrespect to a viewer and the resulting observed light curve. Bottom left: Represen-tation of the the positions of the gravitational potential well for the two componentsin a binary. Bottom right: depicts the Roche lobe envelope of the system.peak minima. The minima need not be the same depth, however, it is possible.Periodicity can range from fractions of days to multiple years. β Lyrae (EB, Semi-Detached Binary)
Interestingly, EB binaries are not a consistent population. Identified by the smoothtransition from maximum to minimum, with uneven minima, the causes of thesebinaries vary. Some of these targets have highly eccentric orbits; others have29igure 2.4: Example β Persei (EA, Detached Binaries) Phased Time DomainCurve, light curve generated from LINEAR data (LINEAR ID: 10013411).varying degrees of Roche lobe filling; and others have mismatched evolved pairs.Figure 2.5 provides an overview of the EB binary example, the top line is a graphof the light curve as well as the expected binary orientation with respect to theobserver (represented as the eye); the bottom left and right figures are diagramsshowing the expected mass transfer relationship, here specifically the larger starhas filled its’ Roche lobe, and begun to transfer mass to the companion.In short, the archaic nature of the classification system has lumped togethera number of stars that look similar but on further inspection have a wide variety30igure 2.5: Conceptual overview of β Lyrae—EB type—eclipsing binaries (Semi-Detached Binaries). Top figure is a representation of the binary orientation withrespect to a viewer and the resulting observed light curve. Bottom left: Representa-tion of the the positions of the gravitational potential well for the two componentsin a binary. Bottom right: depicts the Roche lobe envelope of the system.of associated physical parameters. We will not enumerate examples here of thesevarious similar cases; however, we do provide an example of a EB type binary inFigure 2.6. 31igure 2.6: Example β Lyrae (EB, Semi-Detached Binary) Phased Time DomainCurve, light curve generated from Kepler data (KIC: 02719436)
As opposed to the semidetached binaries, contact binaries are those systems wherethe components fill their Roche lobes. Mass transfer does occur, and becauseof their closeness in proximity, their periods are relatively short ( ∼ The term pulsating variable encompasses a large swath of variable subgroups. Forour purposes here, we focus on those stars that lie along the instability strip(Cepheid-like). While stars are normally in hydrostatic equilibrium, there are caseswhere the radiant energy generated by the star’s core is converted into stored en-ergy via ionization in a He shell (usually). The mechanism of energy generation,storage, cooling, and heating to generation results in a cycle of radial expansionand contraction. This instability mechanism in stars is a result of an partial ioniza-34ion zone; this ionization zone allows for the storage of energy, causing the cyclicalprocess.We can describe underlying dynamics of the mass shell as follows [Padmanab-han, 2001]: m ¨ r = 4 πr P − GM mr . (2.7)When in equilibrium, m ¨ r = 0, and r = r eq . We can adiabatically perturb theshell about this equilibrium value to describe small changes: m d δrdt = 4 πr P eq ( δrr + δPP ) + GM mr ( δrr ) ; (2.8)this defines the radial displacement of the shell. Furthermore, we can substitutethe pressure function for density using the adiabatic relationship ( δPP ) = γ ( δρρ ) = − γ ( δrr ) , where ρr = c based on mass conservation. Substituting into equation2.8, we find: d δrdt = GMr (3 γ − δr = ω δr. (2.9)Thus the radius of the shell is oscillatory about the equilibrium radius, withangular frequency ω = GMr (3 γ − r = − Gmr − πr ∂P∂m . (2.10)Similar to how we perturbed the radius, we can perturb the other underlyingphysical parameters ( P and ρ ), that is, P ( m, t ) = P ( m ) + P ( m, t ) = P ( m ) [ p ( m ) e iωt ] , (2.11)35 ( m, t ) = r ( m ) + r ( m, t ) = r ( m ) [ x ( m ) e iωt ] , (2.12) ρ ( m, t ) = ρ ( m ) + ρ ( m, t ) = ρ ( m ) [ d ( m ) e iωt ] (2.13)Linearizing the perturbation and substituting into equation 2.10, we get P ρ ∂p∂r = ω r o x + g ( p + 4 x ) , (2.14)where g = ( Gmr ) ; similarly, via linearization of ( ∂r∂m ) = ( πr ρ ) , we get r ∂x∂r = − x − d. (2.15)Given the adiabatic relationship p = γd , we can write equation 2.15 as p = − γx − γr ∂x∂r . (2.16)Based on these differential equations, we can establish the governing equationas ∂∂r ( γ ∂x∂r ) + 4 r ∂∂r ( γx ) − ρ g P γ ∂x∂r + ρ P [ g r (4 − γ ) + ω ] x = 0 . (2.17)Rearranging equation 2.17 and rewriting it in a second-order differential equa-tion form, we get equation 2.18: x ” + ( r − ρ g P ) x ′ + ρ γP [ ω + (4 − γ ) g r ] x = 0 . (2.18)36e can change the independent variable r to z = Ar , define the polytropicequation of state as P = Kρ ( n +1) /n , and give the gravitational potential as Φ.Therefore g = ∂ Φ ∂r = A Φ dwdz , A = πG [( n +1) K ] n ( − Φ) n − ,ρ = [ − Φ w ( n +1) K ] n , ρ P = K ρ − /n = − n +1Φ w . (2.19)We can substitute the definitions in equation 2.19 into our second-order equa-tions to get d xdz + ( z + n + 1 w dwdz ) dxdz + [ Ω − (4 − γ ) ( n + 1) γ z dwdz ] xw = 0 , (2.20)where Ω = − n +14 πGγρ c ω is a dimensionless frequency for the polytrop and ρ c is thecentral density. The period of oscillation can then be given asΠ √ ¯ ρ = [ ( n + 1) π Ω Gγ ( ¯ ρρ c ) n ] / . (2.21)Thus, for a fixed mode of oscillation, the period is dependent on density. Thisallows the estimation of relationships between period, luminosity, color, and mass[Iben, 1971, Freedman et al., 1994, Alcock et al., 1998] for radially pulsating stars.Beyond the He instability strip members, there are pulsating groups, such as thosealong the iron ionization region [e.g., Beta Cep, SPB; Iglesias et al., 1987], aswell as stars along the white dwarf evolutionary curve that experience pulsations(e.g., DOV, DBV, PNNV). For more information, we recommend looking at Percy[2007]. 37igure 2.9: Conceptual representation of the cepheid instability strip cycle. Topfigure: The cycle of energy absorption, atmospheric heating, atmospheric expan-sion, and energy release (cyclic process), Bottom figure: the resulting light curvecaused by the cyclic process.The cyclic–sawtooth light curve (Figure 2.9) makes these pulsating variablesfairly easy to identify and to discriminate. Additionally, because the variability isrooted in intrinsic physical parameters, that is, the expansion and contraction ofthe star, in some cases, it is possible to link the periodicity of the star to luminosityand other parameters. Cepheids can be used in this manner and therefore haveutility as a standard candle. The Cepheid group of pulsating variables lies along the instability strip (see Figure1.2). They are yellow supergiant pulsating variables with a period between 1 and100 days. The group is further divided into population I and II; pop. I are younger38nd more massive than the Sun, whereas pop. II are older and less massive. Oureffort here has focused on variables with shorter period oscillations, hence Cepheid(I & II) are not included as part of our analysis.
The group of stars referred to as RR Lyr are pulsating variables that lie alongthe instability strip. They have periods between 0.1 and 1.0 days, with spectraltypes between A and F. They are further divided into three subtypes: a, b, and c.RR Lyr (a) have mid-range periods, RR Lyr (b) have long periods, and RR Lyr(c) have shorter periods. The RR Lyr (a & b) type stars have very similar lightcurves; in the phased domain, they are skewed (or asymmetrical) (see Figure 2.10Left), while RR Lyr (c) have a much more symmetrical shape in the phase domain(nearly sinusoidal) (see Figure 2.10 Right).Figure 2.10: Example phased time domain waveform, Left: Example RR Lyr (ab)(LINEAR ID: 10021274), Right: Example RR Lyr (c) (LINEAR ID: 10032668),data collected from LINEAR dataset 39 .2.3 Delta Scuti SX Phe
Further down the HR diagram (decreasing luminosity) and along the instabilitystrip are Delta Scutis; these are pulsating variables of type A to F, with shortperiods (on order of 0.02–0.3 days). Variable stars of this type have smaller am-plitudes compared to RR Lyr and Cepheids and are often more complex in theirlight curves. Specifically, they can express multiple periodicities. The expectationthat there is an underlying relationship between periodicity and luminosity that wewould expect from stars on the Cepheid strip is still true and is based on the fun-damental frequency of the radial pulsation [Petersen and Christensen-Dalsgaard,1999]. An example of a Delta Scuti time curve is provided in Figure 2.11.In a similar spectra class and location on the HR diagram are older and moremetal-poor stars, referred to as SX Phoenicis variables [Cohen and Sarajedini,2012]. These tend to have larger amplitudes and a shorter range of periods (0.03–0.08 days) but still have a very similar phased light curve shape compared to theDelta Scuti. We have grouped these two classes together for the purposes of thisanalysis. Similar to Delta Scutis, the SX Phoenicis also have a period–luminosityrelationship that can be empirically determined and exploited [Santolamazza et al.,2001]. 40igure 2.11: Example Delta Scuti phased time domain curve, data collected fromLINEAR dataset (LINEAR ID: 1781225)41 hapter 3Tools and Methods
We have discussed variable stars targeted by this effort, and while we have men-tioned that they are variable and have even presented their light curves, we havenot made a more formal definition of what exactly we are measuring. Time domaindata are simply a set of measurements sampled at some interval. The measurementcan be any of the standard astronomical observations: magnitude, color, radial ve-locity, spectrum, polarimetry, and so on, can be sampled over time [Kitchin, 2003].Within this work, we focus on single-color photometry. Because of the monotonicrelationship between flux and magnitude (logarithmic transform), differences be-tween training a classifier in the magnitude domain or flux domain are moot. Thuswe will be using both LINEAR magnitudes (in the V-band) and Kepler normalized(and corrected) flux values to train.While time domain data can be derived from any type of measurement, theanalysis performed within the context of this work focuses entirely on photometryin the optical or visual domain. For large, modern surveys, this usually involvesa space-based (Kepler) or ground-based (LINEAR) telescope. Depending on the42urvey, there might be one (LINEAR) or many (SDSS) CCDs, and often the num-ber of detectors will correlate with the number of filters used. Automated slewing,tracking, and targeting all allow for the development of automated surveys. Theresult is effectively “movies” of the stars, a plurality of digital images of swaths ofsky. In addition to the automated mechanical and electric components of modernsurveys, an initial digital automation also is common. Kepler, for example [Jenkinset al., 2010, Christiansen et al., 2013], has an automated processing pipeline thatoperates on the digital images, detecting power distributions of individual sourcesand measuring the apparent magnitude of the source based on known calibrationoperations. Quintana et al. [2010] addresses CCD calibration (i.e., dark, bias,flat, smear, etc.), and Twicken et al. [2010] discusses both the initial photometryanalysis (flux estimation) and the presearch conditioning (artifact removal).The result is a relatively systematic noise-free light curve. While the referencesprovided here are specific to Kepler and the space-based survey mission, the genericprocess of calibration, conditioning, and analysis is common across all surveys.Likewise, there are processes unique to the Kepler mission that do not occur inground-based missions. For example, Kepler performs a smear correction becauseit has no shutter, and does not need to perform an atmospheric correction becauseit is in space. Much of the preprocessing that the surveys perform—the processingpipeline, as it is often referred to—is designed to condition the observed raw datainto a format that is common across all surveys (clean, correct, adjust magnitude,time domain data). 43 .1 DSP with Irregularly Sampled Data
Although they are not an issue with Kepler data, ground-based surveys often sufferfrom irregular sampling rates; that is, the time between individual samples is notconsistent, and downtime occurs because of electronics. A lack of shutter on Keplermeans there is no downtime between images, that is, no loss of photons whilethe CCD processes, and while this then means that part of the CCD calibrationincluded a smear correction operation, it also means that regular sampling ratesare possible within a given continuous observation effort. Breaks in time domaincurves still occur as targets are revisited. Ground-based surveys, however, stillmust contend with a rotating Earth (day–night breaks), and most imaging systemshave a shutter to contend with. The result is an irregular sampled time domaincurve. A comparison of Kepler and LINEAR raw time series data and sample ratesis shown in Figure 3.1.Figure 3.1: Example Raw Time Domain Curves, Left: Kepler Data (KIC:6024572), Right: LINEAR (LINEAR ID: 10003298)It is apparent from the figure that the Kepler data have not only a regularsample rate but also fewer breaks and more samples than the LINEAR data. Even44ithin surveys, the number of samples can differ per target, a histogram of numberof samples for a given target for all of the Kepler Eclipsing binary data pulled fromthe Villanova data base is provided in Figure 3.2.Figure 3.2: Histogram of the number of observations per target star for targets inthe Villanova Kepler Eclipsing Binary Catalogue (x-axis: number of observationsfor given star, y-axis: number of stars with that number of observations). Graphshows expectation of data size per light curve ( 60000 points).These differences in the number of observations within a given survey, irregu-larity of sampling, differing time scales, and likewise differing phases all precludethe application of standard classification methodologies directly to the time do-main data. If one of our goals is a universal classification utility, we are left with45he quandary of how to compare like-to-like both within and across surveys. Timedomain transforms allow us to map our raw time domain observations to a featurespace in which comparisons can be made, specifically measurements of distanceand similarity.The field of digital signal processing (DSP) contains a number of these transfor-mations: Fourier transformations, wavelet transformations, and autocorrelation, toname a few. As Fulcher et al. [2013] discussed, methods for time series analysis cantake on a multitude of forms, with each method designed to quantify some specificvalue or set of values. As Fulcher et al. [2013] show, not all methods are unique,and many correlate with one another; this is especially true within analysis fami-lies (linear correlations, statistical distributions, stationarity evaluations, entropymeasures, etc.). The astronomical community, initially focusing on those variablestars that have a consistent, repeating signature, commonly uses Fourier domainstatistics to evaluate time domain signals. The transformation into the frequencydomain has a twofold effect: (1) mapping to the Fourier (frequency) domain meansthat within a given range of frequencies, variable stars can be compared one to one,and (2) the data can be phased or folded on themselves about a primary period ifone is found, again allowing variable stars to be compared one to one in the phasedomain.This application of Fourier transformation is really an effort to generate aprimary period for the variable star, a process referred to here as period finding.Astronomers have developed many methods for period finding since the late 1970s,when time domain astronomy first became feasible. While the Fourier transfor-mation has a much longer history [Heath, 2018], the algorithms designed, such asdiscrete Fourier transformation (DFT) and fast Fourier transform (FFT), require46egular sample rates to operate. The method of Lomb–Scargle [Lomb, 1976, Scar-gle, 1982] was one of the first methods to address the issues faced by astronomers.Since then, the number of methods has increased, leveraging a variety of differenttechniques [Graham et al., 2013b].Most of the period-finding algorithms are roughly methods of spectral trans-formation with an associated peak/max/min-finding algorithm and include suchmethods as discrete Fourier transform, wavelet decomposition, least squares ap-proximation, string length, autocorrelation, conditional entropy, and autoregres-sive methods. Graham et al. [2013b] review these transformation methods (withrespect to period finding) and find that the optimal period-finding algorithm isdifferent for different stars. With the history and confidence associated with theLomb–Scargle method, it was selected as the main method for generating a primaryperiod within this work.The Lomb–Scargle algorithm computes the Lomb normalized periodogram (spec-tral power as a function of frequency) of a sequence H of N data points with, sam-pled at times T , which are not necessarily evenly spaced [Scargle, 1982]. T and H must be vectors of equal size. The routine calculates the spectral power for an in-creasing sequence of frequencies (in reciprocal units of the time array T ) up to somehigh frequency threshold input constant times the average Nyquist frequency; theuser additionally supplies an an oversampling factor, typically OF AC ≥
4. Thereturned values are arrays of frequencies considered ( f ), the associated spectralpower ( P ) and estimated significance of the power values ( σ ).Although the implementation outlined as part of this research is based on thatdescribed in Teukolsky et al. [1992], section 13.8, rather than using trigonomet-ric recurrences, this implementation takes advantage of Java’s array operators to47alculate the exact spectral power as defined in equation 13.8.4 on page 577 ofTeukolsky et al. [1992]. For more information, our implementation of the Lomb–Scargle algorithm is provided as part of the Variable Star package. One of the key transformations applied to variable star data is a folding of theobservations. The procedure is straightforward: a Fourier transform or similaroperation generates the distribution of frequencies in the observed time domaindata; a maximum power is found [Graham et al., 2013b], and the data are phasedabout this main period. Maximum power, or dominant period, in the Fourierdomain is a proxy for consistency of frequency over the observations. The result isa figure like the one presented in Figure 3.3, a domain bound on [0 ,
1] and a y-axisrange that depends on the original observations or on the transformations appliedby the analyst (such as the min-max transformation applied here). This period found is referred to as the primary period and is the dominant pe-riod expressed by the variable. This method allows for a simplification of the cyclicvariation of stellar brightness, be it from transiting eclipses, radial pulsation, orother cases that produce a cyclostationary signal. Of course, no signal is perfectlyrepeated, and variations exist from cycle to cycle caused by either intrinsic or ex-trinsic factors. The noise of the detector or the atmosphere can cause changes fromcycle to cycle. Stars can and often do vary for multiple different reasons [Percy, fit.astro.vsa.analysis.feature.LombNormalizedPeriodogram fit.astro.vsa.analysis.feature.SignalConditioning dx )in phase, the amplitude ( f ( x )) has a distribution. Ideally, the central tendency ofthe distribution is the expected amplitude if only the main period existed. If weassume that this main period signal is a defining characteristic to be used in theclassification of variable stars, then we require a procedure to isolate or extractthe true underlying period of variation. To this end, we briefly review Friedman’svariable span smoother, that is, the SUPER-SMOOTHER algorithm [Friedman,1984].Let us assume we have bivariate data ( x , y ) , ( x , y ) ... ( x n , y n ); the smoothingfunction can be given as equation 3.1: y i = s ( x i ) + r i , (3.1)where s ( x i ) is the signal and r i is the residual error. Likewise, the optimal smooth-49ng function is the one that minimizes the sum of the residuals ∑ ( r i ) :min s ∑ ( y i − s ( x i )) . (3.2)A simple smoothing function would be the boxcar averaging algorithm [Hol-comb and Norberg, 1955]. A fixed window (the span) is slid across the domain; ateach iteration, an average is found and used at the smoothed estimate. Friedman[1984] proposed an advancement on this idea, by proposing a variable bandwidthsmoother. Given a defined span of J , we can define a local linear estimator, oursmoothing function, as equation 3.3:ˆ y k = ˆ α + ˆ βx k , k = 1 , ..., n. (3.3)This linear model is applied from local fits of points similar to the boxcaraverage, i − J , ..., i + J , with x i ≤ x i +1 for i = J , ..., n − J . The variable span smootherattempts to optimize with respect to equation 3.4:min s,J ∑ ( y i − s ( x i | J ( x i ))) . (3.4)Similar to the constant-span case, the local linear smoother is applied withseveral discrete values of J in the range 0 < J < n ; optimal solutions for s and J are found using a leave-one-out cross-validation procedure [Duda et al., 2012].Friedman [1984] originally recommended three values: J = 0 . n, . n, . n . Theseare intended to reproduce the three main parts of the frequency spectrum of f ( x )(low-, medium-, and high-frequency components). The algorithm selects the bestspan based on the error analysis proposed in equation 3.4 for each input x i ; these“best” estimates are then smoothed using the medium frequency span. The output50s a smooth estimate of the input data, as demonstrated in Figure 3.4.Figure 3.4: Left: Transformed Phased Time Domain Data, Right: CorrespondingSUPER-SMOOTHER Phased Time Domain Data (Kepler Data, KIC: 5820209) The information we observe is from raw signals: amplitude in time (time series),spectral amplitudes (color), spectra, and so on. As we have discussed, the raw timedomain signal can be difficult to work with, so analysis usually involves a transfor-mation to a different representation (transform, mapping, reduction). Traditionalvariable star astronomy focuses on “folding” the time domain data (light curve),that is, transforming the data such that subsequent phases of cyclostationary sig-nals are overlapping. This is done by finding the main (primary) frequency of therepeated signal, and the data are then “phased” together, resulting in a phasedplot, as shown in Figure 3.3.We are still left with the challenge of what to do with this information wehave generated so far : raw time domain data, a frequency representation of thetime domain data (i.e., the periodogram), and a phased data plot, none of which51an be used for one-to-one comparison yet. As mentioned, the time domain datacan be of unequal sizes (and phases); the frequency domain representation willnot be consistent within class (eclipsing binaries, for example, can have a widerange of periods); and the phased data representation, while on a constraineddomain space, can have unequal samples, thus precluding one-to-one comparisons.Additional transformations are necessary, then, for a similarity analysis.Community efforts in the reduction of the time domain data have mostly fo-cused on the development of independent metrics derived from one of the threeoutputs discussed so far (raw data, frequency data, or phased data). Expert selec-tion of measurable values that are consistent within a given class space (variablestar type) have resulted in a multitude of features. These have included the fol-lowing: • time domain statistics: This includes the quantification of the statistics as-sociated with raw time domain data. The statistics could be as simple asmeans, standard deviations, kurtosis, skew, max, min, and so on. They couldalso be local statistics, either on the phased or unphased data. • frequency domain statistics: Most common in literature is statistical reduc-tion of the transformation representations, specifically ratios ( f /f ), differ-ences, levels, and so on.Most, if not all, standard transformations of this nature stem from an originalreference, Debosscher [2009], and have since been added to over time to produce aset of upward of 60 features. A graphical representation of these expertly selected52eatures is provided in Figure 3.5. Figure 3.5: Example of community standard features used in classificationThese features, however, to the best of our knowledge, are statistical measuresthat have been implemented to distill the complex time domain data into sin-gular measures and are not necessarily optimized or selected for the purpose ofmaximizing the ability of a classifier to differentiate between variable stars. Thework presented here will propose additional transformations that might be usefulin discriminating variable stars. Mahabal, A. (2016).
Complete classification conundrum . Presented at Statistical Challengesin Modern Astronomy VI, June 6–10, Carnegie Mellon University. .4 Machine Learning and Astroinformatics One of the key tools of this research effort is the leveraging of the past 30+ years ofmachine learning advancements into the construction of an algorithm for the au-tomated classification or detection of variable stars of interest. The tools reviewedhere represent a small fraction of the total possible methods; we focus here onthose that have found favor in the astroinformatics community, or those that arepredecessors to those methods that we have proposed and implemented. “Astroin-formatics: A 21st Century Approach to Astronomy” [Borne et al., 2009] is one ofthe first community calls for a codification of this field of study. Submitted to theAstronomy and Astrophysics Decadal Survey, it advocates for the formal creation,recognition, and support of a major new discipline called “astroinformatics,” de-fined here as “the formalization of data-intensive astronomy and astrophysics forresearch and education” [Borne, 2010]. More specifically, this is the field of studythat includes • large database management for astrophysical data • high-performance computing for astrophysics simulations • data mining, visualization, or exploratory data analysis specific for astro-physical data • machine learning for astrophysical problems and observationsWe focus here on the fourth bullet point, machine learning. Machine learning, inthe context of astroinformatics and specifically our goal (data discovery), can befurther broken down into a couple methods of data discovery [see Figure 3.6, BoozAllen Hamilton, 2013]. 54igure 3.6: High-Level categories and examples of data discovery and machinelearning techniques. (1) Class discovery is used to generate groupings from pre-viously analyzed data, (2) Correlation discovery is used to construct models ofinference based on prior analysis, (3) Anomaly Discovery is used to determine ifnew observations follow historical patterns, and (4) Link Discovery is used to makeinferences based on collections of dataFirst, with regard to class discovery, how can we find categories of objects55ased on observed traits and known prior information? Efforts in this directionfocus on unsupervised classification (clustering) methods. Second, with regard tocorrelation discovery, based on what we know and trends we are able to map andunderstand, how can we understand new observations? Work here usually fallsunder the category of supervised classification, which implies training algorithmsbased on expertly labeled data to infer labels on newly observed data. Third isanomaly discovery, the “unknown unknowns”: how do we construct an algorithmto determine when new data fall out of range of our prior observed trends? Lastis association or link discovery, or training an algorithm to make connections thatare interesting or helpful based on a predesigned set of rules. For demonstration purposes, we have constructed a test set with which to demon-strate some standard classifiers. Figure 3.7 is a plot of 45 solar neighborhood starson an HR diagram, which we have expanded by adding 200 artificially generateddata points. This expansion was performed via data augmentation, specifically abivariate jittering of the original data with Gaussian noise; thus the population ofeach upsampled artificially generated data point has some basis in the original 45“templates.” These data are intended for demonstration purposes only.56igure 3.7: The HR Diagram Test Data to be used in the demonstration of standardclassifiers. Original dataset shown here contains 45 solar neighborhood stars on anHR diagram (Red: Main Sequence Stars, Green: White Dwarfs, Blue: Supergiantsand Giants). Data shown here is for demonstration purposes.This joint data set of artificial and real observations will be used going forwardto demonstrate the concepts presented.
This work will address three of the four machine learning categories addressedearlier, class discovery in the form of clustering and correlation discovery (bothdiscussed here), as well as some initial anomaly detection proposed in section 4.3.3.The correlation discovery can be further broken down into a few basic categorieswith respect to complexity of implementation; these are presented in Figure 3.8.57igure 3.8: High-Level categories and examples of correlation discovery techniques.(1) Rule based learning uses simple if/else decisions to make inferences about data,(2) Simple Machine learning uses statistical models to make inferences about data,(3) Artificial Neural Networks combine multiple simple machine learning modelsto make inferences about data, (4) Adaptive learning allows for the iterative train-ing/retraining of statistical models at more data becomes available or as definitionschange.We focus on the construction of simple machine learning functions (see Figure58.8), as they tend to lend themselves better to transparency, which is a necessityfor scientific implementation. Simple artificial neural networks are implemented aspart of this work.As a baseline, we adopt the notation and nomenclature of Hastie et al. [2009]when it comes to our overview of the subject of machine learning. As we focus ourdiscussion to metric learning specifically, we will borrow notation from Bellet et al.[2015]. Let us define a simple generic model: ˆ Y = ˆ β + ∑ pj =1 X j ˆ β j . Here we havemade some observations as part of an experiment of both an input ( X j also canbe referred to as a feature or attribute) and an output or response ˆ Y . Our modelis defined both by the functionality and the parameters ˆ β j . The parameters arelearned based on our training data, and the training data are the set of pairs ( X j , y j ) observed as part of the experiment. The feature measured could be a value, aset of values, a continuous series, a matrix, a string, and so on. Note that whenthe response is qualitative, you might see ( X j , g i ), denoting that instead of theresponse being a continuous variable, g i is from the set G labels or classes, andthese are discrete options, such as (1,2,3, . . . ), (a,b,c, . . . ), (red, green, blue, . . . ),and are representative descriptions of what was being measured. Figure 3.9 givesan example of the boundaries resulting from the development of the classifier andthe associated regions of class space. 59igure 3.9: Decision space example, for this example the figures axis are merelyan example dimension (x1, x2) to represent a generic bivariate space. The k-Nearest Neighbor algorithm estimates a classification label based on the“closest” samples provided in the training data [Altman, 1992]. If { x n } is a setof training data n big, then we find the distance between a new pattern and eachpattern in the training set based on a given measure of distance. Often, the distancebetween points is estimated via L p -norm, given here as equation 3.5: ∥ x − y ∥ p = ( | x − y | p + | x − y | p + ... + | x n − y n | p ) /p . (3.5)60he value of p can be adjusted but is frequently given as 2 (Euclidean distance).This is not the only distance measure—a much more complete survey is given inCha [2007], covering not just the L p -norm family but also lesser known comparisonfamilies (intersection, inner product, squared-chord, χ , and Shannon’s entropy).Based on the distance measurement, the algorithm will classify the new patterndepending on the majority of class labels in the k closest training data sets. Thepattern can be rejected (“unknown”) when there is not majority rule or if thedistance exceeds some threshold value defined by the user.For example, Figure 3.10 shows the training set (in red and blue) and the testcase (in green). If k = 1, the test case would be classified as “red,” as the nearesttraining sample is red. If k = 5, the test case would be classified as “blue,” as theclosest five measurements (dashed circle) contain three blue and two red (majorityvote wins). https://en.wikipedia.org/wiki/K-nearest neighbors algorithm k = 3 and k = 5 respectfully. Notice how the classificationof the new observation changes between these two settings (from red to blue)While k-NN is nonparametric, posterior probabilities are estimable when k islarge and the density of points is large. If we call v the volume of space contained bythe k nearest points (the area of the circles in Figure 3.10), P ( x | ω i ) ∼ k i /Nv , where k i is the number of matching patterns and N is the number of training samples.A variation of the k-Nearest Neighbor algorithm is to weight the contribution ofthe class votes ( ρ k-NN). Weighting is often a function of distance; points that arefarther away have less effect on the class decision [Hechenbichler and Schliep, 2004].Using the training data, we have applied the k-NN algorithm and determined that k = 1 is the optimal solution. Figure 3.11 is a plot of a decision space and the testdata, data which were not used in training. fit.astro.vsa.utilities.ml.knn x , x ) coordi-nate system We are interested in f ( x | c = m ), that is, the distribution of the observable for agiven class type. Often we either do not know the distribution or the distribu-tion is known but is not easily expressible (e.g., not Gaussian). We can identifya good approximation of the distribution using kernel density estimation (KDE;aka Parzen window density estimation). Similar to the k-NN classifier, the Parzenwindow classifier (PWC) is based on defining similarity. The classifier leveragesKDE to approximate the probability density function for each class distribution[Rosenblatt, 1956]. The one-dimensional KDE can easily be extended to the mul- fit.astro.vsa.utilities.ml.pwc f ( x | x n , S ) = 1 N N ∑ n =1 S D · K ( x − x n S ) , (3.6)where D is the dimensionality of the x vector and K ( u ) is some function we candefine as a window or “kernel.” For each new point x , an estimate of ˆ f ( x | x n , S )can be generated per class. Using the prior approximations, we can estimate theposterior probability of the pattern observed belonging to class i asˆ p ( ω i | x ) = ˆ f ( x | ω i ) ˆ p ( ω i ) ∑ Jk =1 ˆ f ( x | ω k ) ˆ p ( ω k ) . (3.7)The decision space is then based on a comparison of the probability densityestimate per class and the prior probabilities [Parzen, 1962, Duda et al., 2012],given here as Equation 3.8, ˆ f ( x | ω ) > P ( ω ) P ( ω ) ˆ f ( x | ω ) . (3.8)see Figure 3.12 for an example . commons.wikimedia.org x , x ) coordi-nate system 65 .4.5 Decision Tree Algorithms Let us discuss the notion of a classification decision tree algorithm, a set of im-plemented rules that make discrimination decisions. To set up the idea of a clas-sification decision tree, we provide the following example: I have a basket of fruit(apple, orange, grape), and I wish to describe the process by which I separate thefruit into piles of similar fruits. How do I make my decisions as to which fruit goesinto which pile? I could weigh each piece of fruit—grapes are the lightest, butapples and oranges might weigh the same (within the variance of their weights). Icould measure the size of the pieces of fruit, but I would run into similar concerns.The most distinguishing feature is the color of the fruit, and the algorithm wouldgo something like “look at the fruit: if red, put into basket A; if orange, put intobasket B; if purple or green, put into basket C.”These logical decisions are binary (if/else), and we can graph this algorithm asa series of binary decisions, see Figure 3.14. Each decision node is a binary deci-sion maker that divides the original population (parent) into two new populations(children): true to the left group, false to the right group. The final baskets (e.g.,blue circles in Figure 3.14) or populations are referred to as terminal nodes.66igure 3.14: A simple example of an decision tree generated based on expertknowledge. The solid circles represent decision nodes (simple if/else statements),and the empty circles represent decisions nodes. Note that ”yes” always goes tothe left child node.
The classification and regression tree (CART) is a method for generating (opti-mizing) the binary decision tree for the purposes of classification using measuresof impurity [Breiman et al., 1984]. When the algorithm makes a decision, thesplit ( S ) is defined as a function of both the dimensions of the vector ( d ) and thethreshold of the decision (some value). If we look at the simplest tree, then theparent node is given as t , and the two child nodes are t L and t R .67igure 3.15: An example Simple Decision Tree with parent (t), left and right child( t L and t R )The impurity i of any given node is estimated using an impurity metric. En-tropy and Gini diversity are two popular methods for determining an impuritymetric. Change in impurity is given as equation 3.9:∆ i ( S, t ) = i ( t ) − [ P L · i ( t L ) + P R · i ( t R )] , (3.9)where P L = N L N t and p ( j | t ) = N j N t . Entropy-based impurity is given as equation 3.10: i ( t ) = − ∑ j p ( j | t ) · log ( p ( j | t )) . (3.10)The CART algorithm attempts to maximize the change in impurity of theparent or maximize the purity of the children. These iterations continue until eachterminal node contains a cluster that is pure. The cross-validation process is thenimplemented to prune the tree (remove nodes from the tree and collapse upward) tominimize the cross-validation error. The resulting classification function is givenas T ( x ; θ b ), the tree of binary (if/else) decisions ( θ b ) that have been optimizedbased on the input data set. The output of any given tree is the set of posteriorprobabilities, that is, likelihoods that the input vector x is of a given class type68 C ( x ), which we show as equation 3.11:ˆ C ( x ) = max ( T ( x ; θ b )) . (3.11)Figure 3.16 is a plot of a decision space and the test data (which were not usedin training) based on the CART decisions; note that the if/else decisions result ina purely linear decision boundary (cuts along a given dimension).Figure 3.16: CART decision space example in our standard generic ( x , x ) coor-dinate system Bootstrap aggregation can be leveraged to reduce the variance normally associ-ated with the application of the CART algorithm to a given data set. A random fit.astro.vsa.utilities.ml.cart N points is drawn to generate a given set of input training data,;CART is applied to this subset of data to generate the tree T b ( x ). This operationis performed B number of times to generate a set of noisy but unbiased modelsof the class decision space. This set of trees is the random forest classifier. Thealgorithm classifies by simply taking a consensus decision across all trees (majorityrule). Hastie et al. [2009] state that if the variance of the output of an individualtree is σ , then the average output of the random forest is equation 3.12: ρσ + 1 − ρB σ , (3.12)where ρ is the positive pairwise correlation. As the number of trees increases, thesecond term goes to zero, and because ρ ∈ [ − , ρσ ≤ σ ; the random forestclassifier will have a reduced or equal variance compared to the individual CARTalgorithm. This means that increasing the value of B will have diminishing returns,as − ρB →
0. Optimal size of the value of B can be found via cross-validation. We consider the binary classification case where the labels of the observations,whatever they might be, have been mapped to a 0 or 1. Logistic regression [Hastieet al., 2009] models the posterior probability of the classes, given the input obser-vations, via a linear function in x . By specifying the model as a logit transform,the continuous variable input can be mapped to the domain [0 , P r ( G = 1 | X = x ) = exp ( β + β T x ) . (3.13)70he parameter set θ = { β , β T } can be solved via optimization methods;specifically, a loss function can be established as the sum of the log posteriorprobabilities (log-likelihood): l ( θ ) = N ∑ i =1 log p g i ( x i ; θ ) . (3.14)The function can be minimized via Newton–Raphson algorithm [Heath, 2018],where the Hessian matrix and derivatives are derived; the optimization process isthen β new = β old − ( ∂ l ( β ) ∂β∂β T ) − ∂l ( β ) ∂β . (3.15)The result of the optimization and the decision generation process is a lineardecision line (or set of linear decision lines in the case of the multiclass problem).Based on the model in equation 3.13, a simple threshold value can be set thatdetermines the mapping to our binary decision (0 or 1). This simple transformmodel is the single-layer perceptron (SLP) model given graphically in Figure 3.17,also known as a neuron. fit.astro.vsa.utilities.ml.lrc Multilayer perceptron (MLP) classification is the extension of the single-layer casein terms of both numbers of perceptrons and layers of perceptrons [Rumelhartet al., 1985]. Specifically, MLP establishes one or more “hidden layers,” each onecontaining a finite set of transformations where each individual transformation isa single-layer perceptron. As demonstrated in Figure 3.18, the input layer (theset of observations) feeds the first hidden layer (the set of SLPs). In most cases,all nodes of one layer feed all nodes of the next. While not shown, the output ofthe first hidden layer could feed a second hidden layer, the second a third, andso on. The last layer is the output of the algorithm, either an estimated label(classification) or a point estimate (regression). x , x ) coordinate system As a means of standard clustering, based on vector-variate input (e.g., R m × ), k-means is both a straightforward operation and an industry standard [Duda et al.,2012]. The algorithm falls into the category of descent optimization. The algo-rithm iterates in an attempt to define clusters, that is, groupings of similar inputswhere similarity is based on a set of defined rules associated with distance. Al-though we can define distance however we see fit, the standard k-means clusteringalgorithm uses the Euclidean distance; this need not be an absolute requirement,however—any estimation of central tendency could be used here [Modak et al.,2018]. Clusters, for the purpose of the algorithm, are defined by a single vector,that is, the centroid of the cluster. Membership C ( i ) of any given input point toa cluster is based on closest cluster centroid ( m k ). The algorithm optimizes the74ithin-point scatter given the cluster definitions. This optimization problem isdefined as min C, { m k } K K ∑ k =1 N k ∑ C ( i )= k ∥ x i − m k ∥ . (3.16)The algorithm operates as follows:1. Each data point is randomly assigned a cluster membership.2. For a given cluster assignment (membership of each point to a cluster basedon closest centroid), the center of the cluster is estimated (mean of members).3. Based on the means estimated in the prior step, the points are reassignedmembership based on closest cluster centroid: C ( i ) = min ≤ k ≤ K ∥ x i − m k ∥ .4. Steps 2 and 3 are iterated until membership among the input data set isunchanged.The resulting output is a list of clusters, their center vectors, and the associatedmembers. Because of the random nature of the initial assignment, the final assign-ment of cluster number can vary from application to application of the algorithmto the same data set. Likewise, results can vary depending on the number of clus-ters used in the operation. As part of our analysis, we propose a feature spacethat is matrix-variate (see section 7.3.4), and likewise our analysis of the resultingobservations needs to be able to accept inputs that are on the R m × n space. Tothat end, we propose a k-means algorithm that uses the Forbinus norm instead ofthe Euclidean. The algorithm can then be directly applied to the observed feature75ata. The key to useful feature space reduction is the transformation of an observed rawdata set into a new domain with a smaller dimensionality but with roughly thesame amount of useful information [Cassisi et al., 2012]. What quantifies “useful”is a point of order that can be disputed, but PCA defines this as a function of cor-relation [Einasto et al., 2011]. As an example, a feature space in the domain R n × where all n dimensions are perfectly correlated contains redundant information.So we ask, how much of a “difference” is each feature making? How much am Igetting for my effort? How many features do I need? What are the diminishingreturns? As an example, we present the Palomar Transient Factory (PTF) andCatalina Real-Time Transient Survey (CRTS) features, two different surveys usingtwo different feature spaces. A correlation matrix of features is presented in Figure3.20, and it is apparent that there are features that do not correlate (light colors)and features that almost certainly do (dark colors). fit.astro.vsa.common.utilities.clustering.KMeansClustering Mahabal, A. (2016).
Complete classification conundrum . Presented at Statistical Challengesin Modern Astronomy VI, June 6–10, Carnegie Mellon University. X is decomposed via X = U Σ W T ;a number of computationally efficient means of solving for these components exists[e.g., Bosner, 2006]. The set of principal components is then T = XW , where eachcolumn of T is a principal component, with the first column being the componentthat has maximum variation. A transformation of X using PCA for the purposesof dimensionality reduction can be performed via T L ← XW L , where T L has been77runcated to the first L number of dimensions. Much of machine learning, in particular, the preparation of input data prior totraining, involves the transformation or mapping of the original observed data andfeature space into a domain that is favorable to achieving the machine learninggoals. This might mean a dimensionality increase, as would be the case with aradial basis function neural network [Park and Sandberg, 1991, Faloutsos et al.,1994]; the input data set of feature dimension R m × is compared to a set of neuronsthat numbers larger than the dimensionality of the original feature space ( n > m ).The components of the neuron can vary depending on implementation, but gen-erally the radial basis function is commonly taken to mean the Gaussian kernelgiven here as equation 3.17: ρ ( x i ; c ) = exp ( − β ∥ x i − c ∥ ) , (3.17)where β is the width of the kernel and c is the centroid. The value and numberof centroids are provided by the user, though they are often a set or subset of theinput training data. The width of the kernel can be optimized based on trainingprocedures. Likewise, it can be naively fixed, as can the weight of each neuronoutput similar to the multilayer perceptron classifier discussed in section 3.4.6.1.By applying a classifier that might generate linear decision criteria (e.g., logisticregression) to this expanded R n × feature space, the linear decision mapped back fit.astro.vsa.analysis.feature.PCA
78o the original R m × can be nonlinear, resulting in a more generic or adaptableclassifier.Conversely, we can decrease the dimensionality of the observational featurespace by selecting a set of neurons that numbers smaller than the dimensionalityof the original feature space ( n < m ). Let us suppose that the data are alreadyclustered or grouped together prior to implementation of a dimensionality reduc-tion methodology [Park et al., 2003]. This would be the case in the supervisedclassification problem where the data have an associated label (class); this wouldalso be the case if k-means or some other unsupervised clustering methodologyhad been applied to unlabeled data (cluster number). Assuming a vector-variateinput (e.g., R m × ), the centroid of any given grouping can be given as the averageof the members: c = 1 N N ∑ i =1 a i . (3.18)We could just as easily use the median or other measurements of the clus-ter centroid. The result is the set of centroids that represent our neurons C =[ c , c , ..., c k ] ∈ R m × k and can be used to transform our original observations intoa new k -dimensional equation Y = [ ∥ x i − c ∥ , ∥ x i − c ∥ , ..., ∥ x i − c k ∥ ] . (3.19)This is a simplified version of the algorithm presented by Park et al. [2003],who include additional weighting of the individual contributions of each neurondetermined by an optimization process. Likewise, we have included here only theEuclidean distance ( ∥ x i − c ∥ ), but other distance measurements could easily be79ncluded or used in its stead [Cha, 2007]. Metric learning has its roots in the understanding of how and why observationsare considered similar. The very idea of similarity is based around the numeri-cal measurement of distance, and the computation of a distance is generated viaapplication of a distance function. Bellet et al. [2015] define a distance functionas follows: over a set X , a distance is a pairwise function d : X × X → R thatsatisfies the following conditions for all x, x ′ , x ” ∈ X :1. d ( x, x ′ ) = d ( x ′ , x ) [symmetry]2. d ( x, x ′ ) ≥ d ( x, x ′ ) = 0 if and only if x = x ′ [identity of indiscernibles]4. d ( x, x ′ ) ≤ d ( x, x ′ ) + d ( x ′ , x ”) [triangle inequality]These are standard requirements for a distance and collectively are referred to asthe distance axioms. Bellet et al. [2015] further define the specific metric distanceas equation 3.20: d ( x, x ′ ) = √ ( x − x ′ ) T M ( x − x ′ ) , (3.20)where X ⊆ R n and the metric is required to be M ∈ S n + , where S n + is the cone ofsymmetric positive semi-definite n × n real-valued matrices. Metric learning seeksto optimize this distance via manipulation of the metric M , based on available sidedata (see Figure 3.21). 80igure 3.21: Example of the change in distance between points as a results ofMetric Learning optimization. Note, left figure is in the Euclidean case, the rightfigure is with a tailored metric.How the optimization occurs and what is considered important, that is, theconstruction of the objective function, together compose the underlying differencebetween the various metric learning algorithms. The side information leveraged isdefined as the set of labeled data { x i , y i } ni =1 ; furthermore, the triplet is defined as( x i , x j , x k ), where x i and x j have the same label but x i and x k do not. It is expected,then, based on the definition of similarity and distance, that d ( x i , x j ) < d ( x i , x k ),that is, that the distances between similar labels are smaller than the distancesbetween dissimilar labels. Methods like LMNN [Weinberger et al., 2009] leveragethis comparison to bring similar things closer together, while pushing dissimilarthings further apart.Given the metric learning optimization process, the result is a tailored distance81etric and associated distance function (equation 3.20). This distance function isthen leveraged in a standard k-NN classification algorithm. The k-NN algorithmestimates a classification label based on the closest samples provided in training[Altman, 1992]. If x n is a set of training data n big, then we find the distancebetween a new pattern x i and each pattern in the training set. The new patternis classified depending on the majority of class labels in the closest k . Bellet et al. [2015] review a number of metric learning algorithms, in additionto a suite of algorithms for augmenting and improving the outlined classificationtechniques. Methods include the fundamental MMC method [Xing et al., 2003]and NCA method [Goldberger et al., 2005], both of which follow the procedure ofestablishing an objective function and then attempting to optimize the metric withrespect to the side information available (input data). That being said, the Large-Margin Nearest Neighbor Metric learning [Weinberger et al., 2009] algorithm is ametric learning design that generalizes well to most situations.It is also easily extendable, making it a key component in the development ofmany other, more complex designs. Many of the more advanced methods discussedby Bellet et al. [2015] extend the original LMNN design in some fashion. Simplyput, the goal of LMNN is to bring things that are similar closer together, whilepushing things that are different further away from one another; for an example,see Figure 3.22. This is done via construction of an objective function with respectto the metric distance. fit.astro.vsa.utilities.ml.metriclearning M ∈ S d + ,ξ ≥ (1 − µ ) ∑ ( x i ,x j ) ∈ S lmnn d M ( x i , x j ) + µ ∑ i,j,k ξ ijk s.t. d M ( x i , x j ) − d M ( x i , x k ) ≥ − ξ i,j,k ∀ ( x i, x j , x k ) , (3.21)where d M ( x i , x j ) = ∥ x i − x j ∥ M and µ ∈ [0 , M = L T L , and a replacement of the inequalityconstraint (i.e., slack function) with the hinge loss function ([ z ] + = max ( z, M = I ).This limitation to those points within some limited neighborhood distance al-lows for a more rapid computation of optimal metric, with little to no loss ofperformance (the points that are important are the points that are used). Theresulting transformations are given as the following objective function: ε ( L ) = (1 − µ ) ∑ ij η ij ∥ L ( x i − x j ) ∥ + µ ∑ ijk η ij (1 − y ik ) [ ∥ L ( x i − x j ) ∥ − ∥ L ( x i − x k ) ∥ ] + , (3.22)where η ij = 1 when x i and x j are in the same neighborhood and y ik = 0 when x i and x k are of the same class-type. Optimization occurs via gradient descent(equation 3.23), where the derivative of the objective function is found and usedto iterate the solution: L ( t +1) = L ( t ) − β ∂ε ( L ) ∂ L . (3.23)A full derivation and algorithm design can be found in Weinberger et al.[2009]. The LMNN algorithm is the basis on which a large number of proposed metriclearning algorithms have been constructed. These additional algorithms introduceadditions to the objective function or changes to the training and design of theclassifier to provides additional flexibility. For example, the S&J metric learningalgorithm [Schultz and Joachims, 2004] introduces a regularization term as part fit.astro.vsa.utilities.ml.metriclearning.lmnn
84f the loss function for the purpose of favoring a lower-complexity metric. Kernel-LMNN [Chatpatanasiri et al., 2008] and M -LMNN [Weinberger and Saul, 2008]learn a kernel transformation and multiple metrics (via tree partitioning) to gen-erate a much more complex decision space. Bottou and Bousquet [2008] developa distributed/stochastic version by estimating an overall metric M by trainingon smaller segments of a larger training set via averaging or EWMA, dependingon whether all the data are available or whether the training process is contin-uous. Kedem et al. [2012] develop gradient-boosted LMNN, a generalization ofLMNN to learn a distance in a nonlinear projection space defined by some trans-formation based on gradient-boosted regression trees. Likewise, Parameswaranand Weinberger [2010] reformulate LMNN to learn the shared metric based on agiven number of T tasks (different data sets). These formulations and extensionsallow the underpinning design of LMNN to be extensible to many applications, andwhile they are mentioned here as examples, the designs we present can of coursebe modified with these or many of the other designs [Bellet et al., 2015], based onthe needs of the task. We address the following classification problem: given a set of expertly labeledside data containing C different classes, where measurements can be made on theclasses in question to extract a set of features for each observation, how do we de-fine a distance metric that optimizes the misclassification rate? As discussed, wehave identified a number of features and feature spaces that may provide utility indiscriminating between various types of stellar variables. How do we combine thisinformation and generate a decision space, or rather, how do we define the distance85 ij = ( x i − x j ) ′ M ( x i − x j ) when x i contains two matrices (SSMM or DF in ourcase)? Specifically, we attempt to construct a distance metric based on multipleattributes of different dimensions (e.g., R m × n and R m × ). To respond to this chal-lenge, we investigate the utility of multi-view learning. For our purposes here, wespecify each individual measurement as the feature and the individual extractionsor representations of the underlying measurement as the view. As an example, ifprovided the color of a variable star in ugriz , the individual measurements of u − g or r − i shall be referred to here as the features, but the collective set of colors isthe view.Xu et al. [2013] review multiview learning and outline some basic definitions.Multiview learning treats the individual views separately but also provides somefunctionality for joint learning, where the importance of each view is dependent onthe others. As an alternative to multi-view learning, the multiple views could betransformed into a single view, usually via concatenation. The costs and benefitsof single-view versus multi-view learning are discussed by Xu et al. [2013] and arebeyond the scope of this discussion. Likewise, classifier fusion [Tax and Duin, 2001,Tax and Muller, 2003, Kittler et al., 1998] could be viewed as an alternative tomulti-view learning. Here each view would be independently learned and result inan independent classification algorithm. The results of the set of these classifiersare combined (mixing of posterior probability) to result in a singular estimate ofclassification/label. This is similar to the operation of a random forest classifier;that is, results from multiple individual trees combine to form a joint estimate.The single-view learning with concatenation, multi-view learning, and classifierfusion designs can be differentiated by when the joining of the views is considered:before, during, or after . 86ulti-view learning can be roughly divided into three topic areas: (1) co-training, (2) multiple-kernel learning, and (3) subspace learning. Each methodattempts to consider all views during the training process. Multiple-kernel learningalgorithms attempt to exploit kernels that naturally correspond to different viewsand combine kernels either linearly or nonlinearly to improve learning performance[G¨onen and Alpaydın, 2011]. Subspace learning uses canonical correlation analysis(CCA) or a similar method to generate an optimal latent representation of twoviews, which can be trained on directly. This CCA method can be performedmultiple times for the case in which many views exist; it also frequently results ina dimensionality that is lower than the original space [Hotelling, 1936, Akaho, 2006,Zhu et al., 2012]. This work will focus on the method of co-training, specifically, metric co-training . Large-margin multi-view metric learning [Hu et al., 2014, 2018]is an example of metric co-training; the designed objective function minimizes theoptimization of the individual view as well as the difference between view distances,simultaneously. LM L is reviewed, allowing us to establish some basic definitionsregarding multi-view learning for the general astronomical community.
The multi-view metric distance is defined as: d M ( x i , x j ) = K ∑ k =1 w k ( x ki − x kj ) T M k ( x ki − x kj ) , (3.24)where K is the number of views and x ki is the i th observation of the k th viewfor a given input. A weight of importance w k is estimated for each view via ourco-training, as is the metric for the view M k . Note that each view can have a87ifferent dimensionality, as each M k is uniquely defined for the view for whichit was trained. It is apparent that the distance is some weighted average of theindividual view distances. The objective function for LM L [Hu et al., 2014, 2018]is defined as:min M ,..., M K J = ∑ Kk =1 w pk I k + λ ∑ Kk,l =1 ,k
0, which can be slow, dependingon the methodology used. Hu et al. [2014] transform the metric M k , followingWeinberger et al. [2009], via the decomposition M = L T L . This allows for un-constrained optimization of the objective function with respect to the decomposedmatrix L k ; the matrix L k can then be used to generate an appropriate metric M k .88he gradient of the LM L optimization function can be shown as: ∂J∂ L k = 2 L k [ w pk ∑ i,j y ij h ′ [ z ] C kij + λ K ∑ l =1 ,l ̸ = k ∑ i,j ( − d M k ( x ki , x kj ) d M l ( x li , x lj ) ) C kij ] , (3.27)where C kij = ( x ki − x kj ) ( x ki − x kj ) T is the outer product of the differences and h ′ [ z ]defines the derivative of the hinge loss function. The algorithm operates as atwo-step process (alternating optimization) between the optimization of the de-composed metrics L k and the weighting between the views w k . First, the weightsare fixed , w = [ w , w , ...w k ], and the metrics M k are updated. The iterative updateto the L k estimate is generated via gradient for each view: L ( t +1) k = L ( t ) k − β ∂J∂ L k . (3.28)Second, the metrics M k are fixed with the updated values and the individualweights , w = [ w , w , ...w k ], are estimated. To estimate the update for the weights,a Lagrange function is constructed in equation 3.29: La ( w, η ) = ∑ Kk =1 w pk I k + λ ∑ Kk,l =1 ,k New Astronomy, 52 , 35–47 The procedure outlined in this paper will follow the standard philosophy for thegeneration of a supervised pattern classification algorithm as professed in Dudaet al. [2012] and Hastie et al. [2004], i.e. exploratory data analysis, training andtesting of supervised classifier, comparison of classifiers in terms of performance,application of classifier. Our training data is derived from a set of three well knownvariable star surveys: the ASAS survey [Pojmanski et al., 2005], the Hipparcossurvey [Perryman et al., 1997], and the OGLE dataset [Udalski et al., 2002]. Dataused for this study must meet a number of criteria:1. Each star shall have differential photometric data in the ugriz system2. Each star shall have variability in the optical channel (band) that exceedssome fixed threshold with respect to the error in amplitude measurement93. Each star shall have a consistent class label, should multiple surveys addressthe same star These requirements reduce the total training set down to 2,054 datasets with 32unique class labels. The features extracted are based on Fourier frequency domaincoefficients [Deb and Singh, 2009], statistics associated with the time domain space,and differential photometric metrics; for more information see Richards et al. [2012]for a table of all 68 features with descriptions. The 32 unique class labels can befurther generalized into four main groups: eruptive, multi-star, pulsating, and“other” [Debosscher, 2009], the breakdown of characterizations for the star classesfollows the following classifications: • Pulsating – Giants : Mira, Semireg RV, Pop. II Cepheid, Multi. Mode Cepheid – RR Lyrae : FO, FM, and DM – “Others” : Delta Scuti, Lambda Bootis, Beta Cephei, Slowly PulsatingB, Gamma Doradus, SX Phe, Pulsating Be • Erupting : Wolf-Rayet, Chemically Peculiar, Per. Var. SG, Herbig AE/BE,S Doradus, RCB and Classical T-Tauri • Multi-Star : Ellipsoidal, Beta Persei, Beta Lyrae, W Ursae Maj. • Other : Weak-Line T-Tauri, SARG B, SARG A, LSP, RS Cvn94he a priori distribution of stellar classes is given in Table 4.1 for the broad classesand in Table 4.2 for the unique classes.Table 4.1: Broad Classification of Variable Types in the Training and TestingDataset Type Count % DistMulti-Star 514 0.25Other 135 0.07Pulsating 1179 0.57Erupting 226 0.1195able 4.2: Unique Classification of Variable Types in the Training and TestingDataset Class Type % Dist Class Type % Dista. Mira 8.0% m. Slowly Puls. B 1.5%b1. Semireg PV 4.9% n. Gamma Doradus 1.4%b2. SARG A 0.7% o. Pulsating Be 2.4%b3. SARG B 1.4% p. Per. Var. SG 2.7%b4. LSP 2.6% q. Chem. Peculiar 3.7%c. RV Tauri 1.2% r. Wolf-Rayet 2.0%d. Classical Cepheid 9.9% r1. RCB 0.6%e. Pop. II Cepheid 1.3% s1. Class. T Tauri 0.6%f. Multi. Mode Cepheid 4.8% s2. Weak-line T Tauri 1.0%g. RR Lyrae FM 7.2% s3. RS CVn 0.8%h. RR Lyrae FO 1.9% t. Herbig AE/BE 1.1%i. RR Lyrae DM 2.9% u. S Doradus 0.3%j. Delta Scuti 6.5% v. Ellipsoidal 0.6%j1. SX Phe 0.3% w. Beta Persei 8.7%k. Lambda Bootis 0.6% x. Beta Lyrae 9.8%l. Beta Cephei 2.7% y. W Ursae Maj. 5.9%It has been shown [Rifkin and Klautau, 2004] that how the classification of amulti-class problem is handled can affect the performance of the classifier; i.e. ifthe classifier is constructed to process all 32 unique classes as the same time, or if 32different classifiers (detectors) are trained individually and the results are combinedafter application, or if a staged approach is best where a classifier is trained on96he four broad classes first, then a secondary classifier is trained on the uniqueclass labels in each broad class [Debosscher, 2009]. The a priori distribution ofclasses, the number of features to use, and the number of samples in the trainingset are key factors in determining which classification procedure to use. Thisdependence is often best generalized as the “curse of dimensionality” [Bellman,1961], a set of problems that arise in machine learning that are tied to attemptingto quantify a signature pattern for a given class, when the combination of a lownumber of training samples and high feature dimensionality results in a sparsityof data. Increasing sparsity results in a number of performance problems with theclassifier, most of which amount to decrease generality (over-trained classifier) anddecreased performance (low precision or high false alarm rate). Various procedureshave been developed to address the curse of dimensionality, most often some formof dimensionality reduction technique is implemented or a general reframing of theclassification problem is performed. For this effort, a reframing of the classificationproblem will be performed to address these issues. Prior to the generation of the supervised classification algorithm, an analysis of thetraining dataset is performed. This exploratory data analysis [EDA, Tukey, 1977]is used here to understand the class separability prior to training, and to help thedeveloper gain some insight into what should be expected in terms of performanceof the final product. For example if during the course of the EDA it is foundthat the classes are linearly separable in the given dimensions using the trainingdata, then we would expect a high performing classifier to be possible. Likewise,initial EDA can be useful in understanding the distribution of the classes in the97iven feature space answering questions like: are the class distributions multi-dimensional Gaussian? Do the class distributions have erratic shapes? Are theymulti-modal? Not all classifiers are good for all situations, and often an initialqualitative EDA can help narrow down the window of which classifiers should beinvestigated and provide additional intuition to the analyst. Principle Component Analysis (PCA) is one of many methods [Duda et al., 2012],and often the one most cited, when EDA of multi-dimensional data is being per-formed. Via orthogonal transformation, PCA rotates the feature space into a newrepresentation where the feature dimensions are organized such that the first di-mension (the principle component) has the largest possible variance, given thefeature space. This version of PCA is the most simple and straight-forward; thereare numerous variants, all of which attempt a similar maximization process (e.g.,of variance, of correlation, of between group variance) but may also employ anadditional transformation (e.g., manifold mapping, using the “kernel trick” , etc.).Using the broad categories defined for the variable star populations, PCA is per-formed in R using the FactoMineR package [Lˆe et al., 2008], and the first twocomponents are plotted (see Figure 4.1).The PCA transformation is not enough to separate out the classes, however thegraphical representation of the data does provide some additional insight about thefeature space and the distribution of classes. The eruptive and multi-star popu-lations appear to have a single mode in the dimensions presented in Figure 4.1,while the pulsating and the “other” categories appear to be much more irregularin shape. Further analysis addressing just the pulsating class shows that the dis-98igure 4.1: Left: PCA applied to the ASAS+Hipp+OGLE dataset, with the broadclass labels identified and the first two principle components plotted. Right: Onlythe stars classified as pulsating are highlightedtribution of stars with this label is spread across the whole of the feature space(Figure 4.1). In this representation of the feature space there is a significant overlap across allclasses. Even if other methods of dimensionality reduction were implemented, forexample Supervised-PCA [Bair et al., 2006], linear separation of classes withoutdimensional transformation is not possible. Application of SPCA results in theFigure 4.2, and is also provided in movie form as digitally accessible media . https://github.com/kjohnston82/LINEARSupervisedClassification/AohSPCA.mp4 .60.40.20 SPCA 1 -0.2-0.4-0.6-0.8-0.200.20.4 SPCA 2 SP C A PulsatingMulti-StarUnknownEruptive Figure 4.2: SPCA applied to the ASAS+Hipp+OGLE datasetThis non-Gaussian, non-linear separable class space requires further transfor-mation to improve separation of classes or a classifier which performs said mappinginto a space where the classes have improved separability. Four classifiers are brieflydiscussed which address these needs. All algorithms are implemented in the R language, version 3.1.2 (2014-10-31)“Pumpkin–Helmet”, and operations are run on x86 64-w64-mingw32/x64 (64-bit)platform. Four classifiers are initially investigated: k-Nearest Neighbor (k-NN),100upport vector machine (SVM), random forest (RF) and multi-layer perceptron(MLP).The k-Nearest Neighbor algorithm implemented is based on the k-NN algorithmoutlined by Duda et al. [2012] and Altman [1992], with allowance for distancemeasurements using both L (taxi cab) and L (Euclidean distance) (see equation4.1). ∥ x ∥ p = ( | x | p + | x | p + ... + | x n | n ) /p , p = 1 , , ..., ∞ . (4.1)The testing set is implemented to determine both optimal distance method tobe used and k value, i.e. number of nearest neighbors to count.A number of SVM packages exist [Karatzoglou et al., 2005], the e1071 package[Dimitriadou et al., 2008] is used in this study and was first implementation ofSVM in R. It has been shown to perform well and contains a number of additionalSVM utilities beyond the algorithm trainer that make it an ideal baseline SVMalgorithm for performance testing. SVM decisions lines are hyperplanes, linear cutsthat split the feature space into two sections; the optimal hyperplane is the onethat has the larger distance to the nearest training data (i.e., maximum margin).Various implementations of the original algorithm exist, including the Kernel SVM[Boser et al., 1992] used here for this study with the Gaussian Kernel. KSVM usesthe so called “Kernel Trick” to project the original feature space into a higherdimension, resulting in hyperplane decision lines that are non-linear, a beneficialfunctionality should one find that the classes of interest are not linearly separable.The multilayer perceptron supervised learning algorithm (MLP) falls into thefamily of neural network classifiers. The classifier can be simply described aslayers (stages) of perceptron (nodes), where each perceptron performs a different101ransformation on the same dataset. These perceptrons often employ simple trans-formation (i.e., logit, sigmod, etc.), to go from the original input feature space, intoa set of posterior probabilities. The construction of these layers and the transfor-mations is beyond the scope of this article, and for more information on neuralnetworks, backpropagation, error minimization, and design of the classifier thereader is invited to review such texts as Rhumelhart et al. [1986]. This studymakes use of the R library RSNNS , for the construction and analysis of the MLPclassifier used; see Bergmeir and Ben´ıtez [2012].Lastly, random forests are the conglomeration of sets of classification and re-gression trees (CARTs). The CART algorithm, made popular by Breiman et al.[1984], generates decisions spaces by segmenting the feature space dimension by di-mension. Given an initial training set, the original CART is trained such that eachdecision made maximally increases the purity of resulting two new populations.Each subsequent node following either similarly divides the received populationinto two new populations with improved class purity or is a terminal node, whereno further splits are made and instead a class estimate is provided.A detailed discussion of how the CART algorithm is trained, the various va-rieties of impurity that can be used in the decision making process, addressed in[Breiman et al., 1984] as well as other standard pattern classification text (Hastieet al., 2004, Duda et al., 2012). Random Forests are the conglomeration of theseCART classification algorithms, trained on variation of the same training set[Breiman, 2001]. This ensemble classifier constructs a set of CART algorithms,each one trained on a reduction of the original training set, this variation resultsin each CART algorithm in the set being slightly different. Given a new observedpattern applied to the set of CART classifiers, a set of decisions is generated. The102andom Forest classifier combines these estimated class labels to generate a uni-fied class estimate. This study makes use of the randomForest package in R, seeLiaw and Wiener [2002]. The training of all four classifier types proceeds with roughly the same procedure;following the one-vs.-all methodology for multi-class classification, a class type ofinterest is identified as either broad or unique, the original training set is splitequally into a training set and a testing set with the a priori population distribu-tions approximately equal to the population distribution of the combined trainingset. Adjustable parameters for each classifier are identified: the RF number oftrees, the Kernel-SVM kernel spread, the k-NN k value and p value, the MLPnumber of units in the hidden layers, and then the classifier is initially trained andtested against the testing population. Parameters are then adjusted and subse-quent classifiers are trained, and misclassification error is found as a function ofthe parameter adjustments. Those parameters resulting in a trained classifier witha minimal amount of error are implemented.For each classifier, two quantifications of performance are generated: a receiveroperating characteristic (ROC) curve and a precision recall (PR) curve. Fawcett[2006] outlines both, and discusses the common uses of each. Both concepts plottwo performance statistics for a given classification algorithm, given some changingthreshold value, which will for this study be a critical probability that the posteriorprobability of the class of interest (the target stellar variable) is compared against.These curves can be generated when the classifier is cast as a “two-class” problem,where one of the classes is the target (class of interest) while the other is not.103or any two-class classifier the metrics highlighted here can be generated andare a function of the decision space selected by the analyst. Frequently the ac-ceptance threshold, i.e. the hypothesized class must have a posterior probabilitygreater than some λ , is selected based on the errors of the classifier. Many genericclassification algorithms are designed such that the false positive (fp) rate and1-true positive (tp) rate are both minimized. Often this practice is ideal; how-ever the problem faced in the instances addressed in this article require additionalconsiderations. We note two points:1. When addressing the unique class types, there are a number of stellar variablepopulations which are relatively much smaller than others. This so-calledclass imbalance has been shown [Fawcett, 2006] to cause problems with per-formance analysis if not handled correctly. Some classification algorithmsadjust for this imbalance, but often additional considerations must be made,specifically when reporting performance metrics.2. Minimization of both errors, or minimum-error-rate classification, is oftenbased on the zero-one loss function. In this case, it is assumed that the costof a false positive (said it was, when it really was not) is the same as a falsenegative (said it was not, when it really was). If the goal of this study is toproduce a classifier that is able to classify new stars from very large surveys,some of which are millions of stars big, the cost of returning a large numberof false alarms is much higher than the cost of missing some stars in someclasses. Especially when class separation is small, if the application of theclassifier results in significant false alarms the inundation of an analyst withbad decisions will likely result in a general distrust of the classifier algorithm.104he ROC curve expresses the adjustment of the errors as a function of the decisioncriterion. Likewise, the PR curve expresses the adjustment of precision (the per-centage of true positives out of all decisions made) and recall (true positive rate)as a function of the decision criterion. By sliding along the ROC or PR curve, wecan change the performance of the classifier. Note that increasing the true positiverate causes an increase in the false positive rate as well (and vice-versa). Often acommon practice is to fix [Scharf, 1991] one of the metrics, false positive rate, ofall classifiers used.Similar to the ROC curve, the PR curve demonstrates how performance variesbetween precision and recall for a given value of the threshold. It is apparentthat the PR and ROC curves are related [Davis and Goadrich, 2006], both havea true-positive rate as an axis (TP Rate and Recall are equivalent), both arefunctions of the threshold used in the determination of estimated class for a newpatter (discrimination), both are based on the confusion matrix and the associatedperformance metrics. Thus fixing the false alarm rate, not only fixes the truepositive rate, but also the precision of the classifier.If the interest was a general comparison of classifiers, instead of selecting aspecific performance level,Fawcett [2006] suggests that the computation of Area-Under-the-Curve quantifies either the PR and ROC curve into a single “perfor-mance” estimate that represents the classifier as a whole. The ROC-AUC of aclassifier is equivalent to the probability that the classifier will rank a randomlychosen positive instance higher than a randomly chosen negative instance. The PR-AUC of a classifier is roughly the mean precision of the classifier. Both ROC andPR curves should be considered when evaluating a classifier [Davis and Goadrich,2006], especially when class imbalances exist. For this study, the best performing105lassifier will be the one that maximizes both the ROC-AUC and the PR-AUC.Likewise, when the final performance of the classifier is proposed, false positive rateand precision will be reported and used to make assumptions about the decisionsmade by the classification algorithm. Based on the foundation of performance analysis methods, ROC and PR curves,and AUC discussed, the study analyzes classification algorithms applied to boththe broad and unique (individual) class labels. Initially an attempt was made to adjust both the number of trees ( ntree ) and thenumber of variables randomly sampled as candidates at each split ( mtry ). Basedon Breiman et al. [1984] recommendation, mtry was set to √ M , where M is thenumber of features. The parameters ntree was set to 100, based on the workperformed by Debosscher [2009]. Classifiers were then generated based on thetraining sample, and the testing set was used to generate the ROC and PR AUCfor each one-vs-all classifier. The associated curves are given in Appendix A, theresulting AUC estimates are in Table 4.3:Table 4.3: ROC/PR AUC Estimates based on training and testing for the RandomForest Classifier.AUC Pulsating Eruptive Multi-Star OtherROC 0.971 0.959 0.992 0.961PR 0.979 0.788 0.986 0.800106 .3.2.2 Broad Classes — Kernel SVM Instead of using the “hard” class estimates, common with SVM usage, the “soft”estimates, i.e. posterior probabilities, are used. This allows for the thresholdingnecessary to construct the PR and ROC curves. Kernel spreads of 0.001, 0.01,and 0.1 were tested (set as the variable gamma in R), the associated PR and ROCcurve are given in Appendix A. It was found that 0.1 was optimal for the featurespace (using Gaussian Kernels). The associated curves are given in Appendix A,the resulting AUC estimates are in Table 4.4:Table 4.4: ROC/PR AUC Estimates based on training and testing for the KernelSVM Classifier. AUC Pulsating Eruptive Multi-Star OtherROC 0.938 0.903 0.979 0.954PR 0.952 0.617 0.968 0.694 It was found that for the training set used, that increasing performance was gainedwith increasing values of k-nearest neighbors. Gains in performance were limitingafter k = 4, and a value of k = 10 was selected to train with. The value of thepolynomial defined in the generation of distance (via L p -norm) was varied between1 and 3, with decreasing performance found for p > 3. The associated PR andROC curves were generated for values of p < 4. The associated curves are givenin Appendix A, the resulting AUC estimates for p < There are two variables associated with MLP algorithm training: the number ofunits in the hidden layers (size) and the number of parameters for the learningfunction to use ( learnParam ). It was found that for the dataset: The learnParam value had little effect on the performance of the classifier, and it was taken to be0.1 for implementation here. The variable size did have an effect, an initial study ofvalues between 4 and 18 demonstrated that the best performance occurred between4 and 8. PR and ROC curves for these values were generated and are in AppendixA, the resulting AUC estimates for values 4, 6 and 8 are in Table 4.6:108able 4.6: ROC/PR AUC Estimates based on training and testing for the MLPClassifier. learnParam AUC Pulsating Eruptive Multi-Star Other4 ROC 0.928 0.694 0.914 0.585PR 0.916 0.120 0.869 0.1836 ROC 0.933 0.751 0.888 0.473PR 0.909 0.139 0.797 0.1238 ROC 0.920 0.706 0.914 0.529PR 0.903 0.175 0.854 0.159 Analysis of the broad classes provided insight into the potential of a staged classi-fier. The performance of the broad classification algorithms does not suggest thatthe supervised variable star classification problem would be benefited by a stageddesign. The RF classifier performed best across all broad classes and against allother classifiers, but still had significant error; had the broad classes perfectly sepa-rated, further analysis into the staged design would have been warranted. Instead,two-class classifier designed based on the unique classes are explored.Similar to the broad classification methodology, the training sample is separatedinto a training set and a testing data set for each unique class type for training in atwo-class classifier. Again, the testing data is used to minimize the misclassificationerror and find optimal parameters for each of the classifiers. Each classifier is thenoptimal for the particular class of interest. With the change of design, a changeof performance analysis is also necessary. With nearly ten times the number ofclasses, a comparison of ROC and PR curves per classifier type and per class type109equires a methodology that allows the information plotted on a single plot (directcomparison). Keeping with the discussion outlined by Davis and Goadrich [2006],we plot ROC vs. PR for each classifier (Figure 4.3 as an example).Figure 4.3: ROC vs. PR AUC Plot for the Multi-Layer Perceptron Classifier,generic class types (Eruptive, Giants, Cepheids, RRlyr, Other Pulsing, Multi-star,and other) are colored. The line y = x is plotted for reference (dashed line)The set of these performance analysis graphs are in Appendix A. A comparisonof the general performance of each classifier can be derived from the generationof AUC for each of the performance curves. Here, the quantification of generalperformance for a classifier is given as either mean precision across all class types110r via non-parametric analysis of the AUC. The non-parametric analysis usedis compiled as follows: for each class of stars, the average performance acrossclassifiers is found, if for a classifier the performance is greater than the mean, anassignment of +1 is given, else -1. Over all classes, the summation of assignmentsis taken and given in Table 4.7.Table 4.7: Performance Analysis of Individual ClassifiersROC-AUC PR-AUCMean Non-Para. Mean Non-Para.KNN-Poly-1 0.884 2 0.530 8SVM 0.905 -4 0.407 -26MLP 0.894 2 0.470 -8RF 0.948 22 0.595 14It is apparent that the RF classifier out-performs the other three classificationalgorithms, using both the mean of precision as well as a non-parametric compar-ison of the AUC statistics. The plot comparing ROC-AUC and PR-AUC for theRandom Forest classifier is presented in Figure 4.4.Based on Figure 4.4, it is observed that star populations of similar class typesdo not necessarily cluster together. Additionally it is apparent that the originalsize of the population in the training set, while having some effect on the ROC-AUC, has a major effect on the resulting PR-AUC. Figure 4.4 b demonstrates thatfor those classes with an initial population of 55 (empirically guessed value), theprecision is expected to be greater than 70%. Surprisingly though for classes withan initial population of 55 or less, the limits of precision are less predictable andin fact appear to be random with respect to class of interest training size. Thus,111igure 4.4: ROC vs. PR AUC Plot for the Random Forest Classifier, generic classtypes (Eruptive, Giants, Cepheids, RRlyr, Other Pulsing, Multi-star, and other)are colored. The line y = x is plotted for reference (dashed line). Left: shows thebreak down per generic class type, Right: shows the difference between populationswith more then 55 members in the initial training datasetwithout further training data or feature space improvements, the performancestatistics graphed in Figure 4.4 are the statistics that will be used as part of theapplication of the classifier to the LINEAR dataset, performance statistics for theother classifiers are given in Appendix A. In addition to the pattern classification algorithm outlined, the procedure outlinedhere includes the construction of a One-Class Support Vector Machine (OC-SVN)for use as an anomaly detector. The pattern classification algorithms presentedand compared as part of this analysis partition the entire decision space. Forthe random forest, kNN, MLP, and SVM two-class classifier algorithms, thereis no consideration for deviations of patterns beyond the training set observed,112.e. absolute distance from population centers. All of the algorithms investigatedconsider relative distances, i.e. is the new pattern P closer to the class center of Bor A? Thus, despite that an anomalous pattern is observed by a new survey, theclassifier will attempt to estimate a label for the observed star based on the labelsit knows.In addition to the pattern classification algorithm outlined, the procedure out-lined here includes the construction of a One-Class Support Vector Machine (OC-SVN) for use as an anomaly detector. The pattern classification algorithms pre-sented and compared as part of this analysis partition the entire decision space.For the random forest, kNN, MLP, and SVM two-class classifier algorithms, thereis no consideration for deviations of patterns beyond the training set observed,i.e. absolute distance from population centers. All of the algorithms investigatedconsider relative distances, i.e. is the new pattern P closer to the class center of Bor A?Thus, despite that an anomalous pattern is observed by a new survey, theclassifier will attempt to estimate a label for the observed star based on the labels itknows. To address this concern, a one-class support vector machine is implementedas an anomaly detection algorithm. Lee and Scott [2007] describe the design andconstruction of such an algorithm. Similar to the Kernel- SVM discussed prior,the original dimensionality is expanded using the Kernel trick (Gaussian Kernels)allowing complex regions to be more accurately modeled. For the OC-SVM, thetraining data labels are adjusted such that all entered data is of class type one(+1). A single input pattern at the origin point is artificially set as class type two( 1). The result is the lassoing or dynamic encompassing of known data patterns.The lasso boundary represents the division between known (previously ob-113erved) regions of feature space and unknown (not-previously observed) regions.New patterns observed with feature vectors occurring in this unknown region areconsidered anomalies or patterns without support, and the estimated labels re-turned from the supervised classification algorithms should be questioned, despitethe associated posterior probability of the label estimate [Sch¨olkopf et al., 2001].The construction of the OC-SVM to be applied as part of this analysis starts withthe generation of two datasets (training and testing) from the ASAS + Hipp +OGLE training data. The initial training set is provided to the OC-SVM [Lee andScott, 2007] algorithm which generates the decision space (lasso). This decisionspace is tested against the training data set; and the fraction of points declaredto be anomalous is plotted against the spread of the Kernel used in the OC-SVM(Figure 4.5). 114igure 4.5: Fraction of Anomalous Points Found in the Training Dataset as aFunction of the Gaussian Kernel Spread Used in the Kernel-SVMBecause of the hyper-dimensionality, the OC-SVM algorithm is unable to per-fectly encapsulate the training data; however a minimization can be found andestimated. The first two principle components of the training data feature spaceare plotted for visual inspection (Figure 4.6), highlighting those points that were115alled “anomalous” based on a nu value (kernel spread) of 0.001. Less than 5% ofthe points are referred to as anomalies (˜falsely).Figure 4.6: Plot of OC-SVM Results Applied to Training Data OnlyFurther testing is performed on the anomaly space, using the second datasetgenerated. As both datasets originate from the same parent population, the OC-SVM algorithm parameter ( nu ) is tuned to a value that maximally accepts the116esting points (Figure 4.7).Figure 4.7: OC-SVM testing of the Testing dataThe minimum fraction was found at a nu of 0.03. The OC-SVM was appliedto the LINEAR dataset with the optimal kernel spread. All 192,744 datasets wereprocessed, with 58,312 (False, or “anomalous” ) and 134,432 (True, or “expected”) decisions made, i.e. 30% of the LINEAR dataset is considered anomalous basedon the joint ASAS+HIPP+OGLE training dataset feature space.117 .4 Application of Supervised Classifier to LIN-EAR Dataset For application to the LINEAR dataset, a RF classifier is constructed based onthe training set discussed prior. The classifiers are designed using the one-vs.-allmethodology, i.e. each stellar class has its own detector (i.e. overlap in estimatedclass labels is possible), therefore 32 individual two-class classifiers (detectors)are generated. The individual classification method (one-vs.-all) allows for eachgiven star to have multiple estimated labels (e.g. multiple detectors returning apositive result for the same observation). The one-vs.-all methodology also allowsthe training step of the classification to be more sensitive to stars who might havebeen under-represented in the training sample, improving the performance of thedetector overall.Based on the testing performance results (ROC and PR curves) presented forthe individual classifiers, the critical statistic used for the RF decision process wastuned such that a 0.5% false positive rate is expected when applied to the LIN-EAR dataset. In addition to the RF classifier, an OC-SVM anomaly detectionalgorithm was trained and used to determine if samples from the LINEAR datasetare anomalous with respect to the joint ASAS+OGLE+HIPP dataset. Applyingthe RF classifier(s) and the OC-SVM algorithm to the LINEAR dataset the follow-ing was found using a threshold setting corresponding to a false alarm rate of 0.5%(see ROC curve analysis). Given an initial set of LINEAR data (192,744 samples),the following table was constructed based on the results of the application of the118solated one-vs.-all RF classifiers only, see Table 4.8:Table 4.8: Initial results from the application of the RF classifier(s)Class Type Est. Pop Class Type Est. Popa. Mira 3256 m. Slowly Puls. B 2b1. Semireg PV 7 n. Gamma Doradus 2268b2. SARG A 4291 o. Pulsating Be 14746b3. SARG B 30 p. Per. Var. SG 284b4. LSP 10 q. Chem. Peculiar 10c. RV Tauri 5642 r. Wolf-Rayet 3970d. Classical Cepheid 31 r1. RCB 1253e. Pop. II Cepheid 326 s1. Class. T Tauri 17505f. Multi. Mode Cepheid 556 s2. Weak-line T Tauri 4945g. RR Lyrae FM 13470 s3. RS CVn 40512h. RR Lyrae FO 1276 t. Herbig AE/BE 1358i. RR Lyrae DM 9800 u. S Doradus 2185j. Delta Scuti 493 v. Ellipsoidal 132j1. SX Phe 9118 w. Beta Persei 481k. Lambda Bootis 69 x. Beta Lyrae 2l. Beta Cephei 2378 y. W Ursae Maj. 1365103,628 stars were not classified (˜54%) and of those 11,619 were considered“Anomalous” . 57,848 stars were classified only once (˜30%) and of those 23,397were considered “Anomalous” . 31,268 stars were classified with multiple labels(˜16%) and of those 23296 were considered “Anomalous” . The set of stars thatwere both classified once and did not have anomalous patterns (34,451) are broken119own by class type in Table 4.9.Table 4.9: Initial results from the application of the RF classifier(s) and the OC-SVM anomaly detection algorithm, classes that are major returned classes ( > Class Type Est. Pop % Total Class Type Est. Pop % Totala. Mira 15 0.04% m. Slowly Puls. B 2 0.002%b1. Semireg PV 1 0.002% n. Gamma Doradus 2268 3.8%b2. SARG A 1362 4.0% o. Pulsating Be 14746 0.61%b3. SARG B 0 0% p. Per. Var. SG 284 0.26%b4. LSP 1 0.002% q. Chem. Peculiar 10 0% c. RV Tauri 538 1.6% r. Wolf-Rayet 3970 6.2% d. Classical Cepheid 2 0.006% r1. RCB 1253 0.01%e. Pop. II Cepheid 50 0.15% s1. Class. T Tauri 17505 5.4% f. Multi. Mode Cepheid 286 0.83% s2. Weak-line T Tauri 4945 3.3%g. RR Lyrae FM 2794 8.1% s3. RS CVn 40512 46.6%h. RR Lyrae FO 710 2.1% t. Herbig AE/BE 1358 0.33% i. RR Lyrae DM 2350 6.8% u. S Doradus 2185 1.7% j. Delta Scuti 8 0.02% v. Ellipsoidal 132 0.08% j1. SX Phe 1624 4.7% w. Beta Persei 481 0.42%k. Lambda Bootis 1 0.002% x. Beta Lyrae 2 0.006%l. Beta Cephei 25 0.07% y. W Ursae Maj. 1365 3.1% The listing of individual discovered populations are provided digitally . Twoclasses were not detected confidently out of the LINEAR dataset: SARG B andChemically Peculiar. This does not mean that these stars are not contained in theLINEAR dataset. Similarly, those stars that were not classified are not necessarilyin a “new” class of stars. There are a number of possibilities why these stars werenot found in the survey including: https://github.com/kjohnston82/LINEARSupervisedClassification < Class Type Precision Est. Pop Class Type Precision Est. Popa. Mira 0.94 14 m. Slowly Puls. B 0.91 0b1. Semireg PV 0.97 0 n. Gamma Doradus 0.88 1159 b2. SARG A 0.76 r1. RCB 0.73 s1. Class. T Tauri 0.75 s2. Weak-line T Tauri 0.75 s3. RS CVn 0.74 i. RR Lyrae DM 0.73 u. S Doradus 0.67 δ Scuti 0.95 7 v. Ellipsoidal 0.78 j1. SX Phe 0.73 This paper has demonstrated the construction and application of a supervised clas-sification algorithm on variable star data. Such an algorithm will process observedstellar features and produce quantitative estimates of stellar class label. Usinga hand-process (verified) dataset derived from the ASAS, OGLE, and Hipparcossurvey, an initial training and testing set was derived. The trained one-vs.-allalgorithms were optimized using the testing data via minimization of the mis-classification rate. From application of the trained algorithm to the testing data,performance estimates can be quantified for each one-vs.-all algorithm. The Ran-dom Forest supervised classification algorithm was found to be superior for the122eature space and class space operated in. Similarly, a one-class support vectormachine was trained in a similar manner, and designed as an anomaly detector.With the classifier and anomaly detection algorithm constructed, both wereapplied to a set of 192,744 LINEAR data points. Of the original samples, settingthe threshold of the RF classifier using a false alarm rate of 0.5%, 34,451 uniquestars were classified only once in the one-vs.-all scheme and were not identifiedby the anomaly detection algorithm. The total population is partitioned into theindividual stellar variable classes; each subset of LINEAR ID corresponding to thematched patterns is stored in a separate file and accessible to the reader. Whileless than 18% of the LINEAR data was classified, the class labels estimated have ahigh level of probability of being the true class based on the performance statisticsgenerated for the classifier and the threshold applied to the classification process. Further improvement in both the initial training dataset is necessary, if the re-quirements of the supervised classification algorithm are to be met (100% classifi-cation of new data). Larger training data, with more representation (support) isneeded to improve the class space representation used by the classifier and reducethe size of the “anomalous” decision region. Specifically, additional example ofthe under-sampled variable stars, enough to perform k-fold cross-validation wouldyield improved performance and increased generality of the classifier. An improvedfeature space could also benefit the process, if new features were found to provideadditional linear separation for certain classes, such as those presented in [Johnstonand Peter, 2017]. 123owever, additional dimensionality without reduction of superfluous featuresis warned against as it may only worsen the performance issues of the classifier.Instead, investigation into the points found to be anomalous in under-sampledclasses, and determination if they are indeed of the class reported by the classi-fier designed here would be of benefit, as these points would serve to not onlybolster the number of training points used in the algorithm, but they would alsoincrease the size (and support) of the individual class spaces. Implementation ofthese concepts, with a mindfulness of the changing performance of the supervisedclassification algorithm, could result in performance improvements across the classspace. 124 hapter 5Novel Feature SpaceImplementation A methodology for the reduction of stellar variable observations (time-domaindata) into a novel feature space representation is introduced. The proposed method-ology, referred to as Slotted Symbolic Markov Modeling (SSMM), has a number ofadvantages over other classification approaches for stellar variables. SSMM can beapplied to both folded and unfolded data. Also, it does not need time-warping foralignment of the waveforms. Given the reduction of a survey of stars into this newfeature space, the problem of using prior patterns to identify new observed pat-terns can be addressed via classification algorithms. These methods have two largeadvantages over manual-classification procedures: the rate at which new data isprocessed is dependent only on the computational processing power available andthe performance of a supervised classification algorithm is quantifiable and consis-tent[Johnston and Peter, 2018].The remainder of this paper is structured as follows. First, the data, prior125fforts, and challenges uniquely associated to classification of stars via stellar vari-ability is reviewed. Second, the novel methodology, SSMM, is outlined includingthe feature space and signal conditioning methods used to extract the unique time-domain signatures. Third, a set of classifiers (random forest/bagged decisions tree,k-nearest neighbor, and Parzen window classifier) is trained and tested on the ex-tracted feature space using both a standardized stellar variability dataset and theLINEAR dataset. Fourth, performance statistics are generated for each classifierand a comparing and contrasting of the methods is discussed. Lastly, an anomalydetection algorithm is generated using the so called one-class Parzen Window Clas-sifier and the LINEAR dataset. The result will be the demonstration of the SSMMmethodology as being a competitive feature space reduction technique, for usagein supervised classification algorithms . Many prior studies on time-domain variable star classification [Debosscher, 2009,Barclay et al., 2011, Blomme et al., 2011, Dubath et al., 2011, Pichara et al., 2012,Pichara and Protopapas, 2013, Graham et al., 2013a, Angeloni et al., 2014, Masciet al., 2014]. rely on periodicity domain feature space reductions. Debosscher[2009] and Templeton [2004] review a number of feature spaces and a number ofefforts to reduce the time domain data, most of which implement Fourier tech-niques, primarily the Lomb–Scargle (L-S) Method [Lomb, 1976, Scargle, 1982],to estimate the primary periodicity [Eyer and Blake, 2005, Deb and Singh, 2009, Lightly Edited from original paper: Johnston, K. B., & Peter, A. M. (2017). Variable starsignature classification using slotted symbolic Markov modeling. New Astronomy, 50 , 1–11. The discussion of the Slotted Symbolic Markov Modeling (SSMM) algorithm en-compasses the analysis, reduction and classification of data. The a priori distri-bution of class labels are roughly evenly distributed for both studies, therefore theapproach uses a multi-class classifier. Should the class labels with additional databecome unbalanced, other approaches are possible [Rifkin and Klautau, 2004].Data specific challenges, associated with astronomical time series observations,have been identified as needing to be addressed as part of the algorithm design. Stellar variable time series data can roughly be described as passively observedtime series snippets, extracted from what is a contiguous signal (star shine) over128ultiple nights or sets of observations. The time series signatures have the poten-tial to change over time, and new observations allow for the increased opportunityfor an unstable signature over the long term. Astronomical time series data is alsofrequently irregular, i.e., there is often no associated fixed ∆ t over the whole of thedata that is consistent with the observation. Even when there is a consistent obser-vation rate, this rate is often broken up because of observational constraints. Thestellar variable moniker covers a wide variety of variable types: stationary (con-sistently repeating identical patterns), non-stationary (patterns that increase/de-crease in frequency over time), non-regular variances (variances that change overthe course of time, shape changes), as well as both Fourier and non-Fourier se-quences/patterns. Pure time-domain signals do not lend themselves to signatureidentification and pattern matching, as their domain is infinite in terms of poten-tial discrete data (dimensionality). Not only must a feature space representationbe found, but the dimensionality should not increase with increasing data.Based on these outlined data/domain specific challenges (continuous time se-ries, irregular sampling, and varied signature representations) this paper will at-tempt to develop a feature space extraction methodology that will construct ananalysis of stellar variables and characterize the shape of the periodic stellar vari-able signature. A number of methods have been demonstrated that fit this pro-file [Grabocka et al., 2012, Fu, 2011, Fulcher et al., 2013], however many of thesemethods focus on identifying a specific time series shape sequence in a long(er) con-tinuous time series, and not necessarily on the differentiation between time seriessequences. To address these domain specific challenges, the following methodologyoutline is implemented:1. To address the irregular sampling rate, a slotting methodology is used [Re-129feld et al., 2011]: Gaussian kernel window slotting with overlap. The slot-ting methodology is used to generate estimates of amplitudes at regularizedpoints, with the result being a up-sampled conditioned waveform. This hasbeen shown to be useful in the modeling and reconstruction of variability dy-namics[Rehfeld and Kurths, 2014, Kovaˇcevi´c et al., 2015, Huijse et al., 2011],and is similar to the methodologies used to perform Piecewise AggregateApproximation [Keogh et al., 2001].2. To reduce the conditioned time series into a usable feature space, the ampli-tudes of the conditioned time series will be mapped to a discrete state spacebased on a standardized alphabet. The result is the state space representa-tion of the time domain signal, and is similar to the methodologies used toperform Symbolic Aggregate Approximation [Lin et al., 2007].3. The state space transitions are then modeled as a first order Markov ChainGe and Smyth [2000], and the state transition probability matrix (MarkovMatrix) is generated, a procedure unique to this study. It will be shown thata mapping of the transitions from observation to observation will provide anaccurate and flexible characterization of the stellar variability signature.The Markov Matrix is vectorized into a vector, and is the signature pattern (fea-ture vector) used in the classification of time-domain signals for this study. Itshould be noted, if the underlying signature was too sparsely sampled the featurespace transform would not capture the shapes or features of interest, as it wouldbe true with any time-domain transform (i.e., Fourier methods will not capturefrequency content over the Nyquist). Many of the transforms that are commonlyused combat the problems of irregular samples or low sample density, make addi-130ional assumptions such as the underlying waveform shape (e.g. Box–fitting LeastSquares Siverd et al., 2012) or the frequency of occurrence (based on physicalparameters). Many of these assumptions are oriented towards specific target de-tection; as such, the false alarm and missed detection rates are tied specificallyto those assumptions. A collection and comparison of these frequency samplingmethods and the associated assumptions can be found in Graham et al. [2013b]. Each waveform is conditioned using the slotting resampling methodology for irreg-ularly sampled waveforms outlined in Rehfeld et al. [2011]. The slotting procedureacts as follows:1. A set of evenly spaced (in time) windows, with size w are generated, thewindows can be overlapping or adjacent2. Within each window (slot) are observed samples, the difference in time be-tween the observation point and the center of the window is computed3. a Gaussian kernel with width s weights the contribution of the observation,a estimate of the amplitude for the new point is generated from the weightedamplitudes of the contributing observations.For this implementation, an overlapping slot (75% overlap) was used, meaning thatthe window width is larger then the distance between slot centers. This method-ology is effectively Kernel Smoothing with Slotting [Li and Racine, 2007], with aslot width of w . This w is optimized, via cross-validation of the data; anecdotallyhowever, median sample rate of the waveforms is often a best estimate; as such131he rate would capture at least one point in each slot when applied to continuousobservation data. Initial testing (ANOVA) demonstrated that the misclassificationrate varied little with changes in w about the median.Let the set of { y ( t n ) } Nn =1 samples, where t < t < t < ... < t N and thereare N samples, be the initial time series dataset. The observed time series data isstandardized (subtract the mean, divide by the standard deviation), and then theslotting procedure is applied. If x [ i ] ← y ( t i ) Ni =1 , then the algorithm to generate theslotted time domain data is given in Algorithm 1.132 lgorithm 1 Gaussian Kernel Slotting procedure GaussianKernelSlotting ( x [ i ] , t [ i ] , w, λ )2:3: x prime [ i ] ← ( x [ i ] − mean ( x [ i ])) /std ( x [ i ]) ▷ Standardize Amplitudes4: t [ i ] ← t [ i ] − min ( t [ i ]) ▷ Start at Time Origin5: slotCenters ← w : max ( t [ i ]) + w ▷ Make Slot Locations6: timeSeriesSets = [] ▷ Initialize Time Series Sets7: slotSet = [] ▷ Make an Empty Slot Set8:9: while i < length ( slotCenters ) do ▷ Compute Slots10: idx ← all t in interval [ slotCenters − w, slotCenters + w ]11: inSlotX ← x [ idx ]12: inSlotT ← t [ idx ]13:14: if inSlot is empty then ▷ There is a Gap15: if slotSet is empty then ▷ Move to Where Data is16: currentP t ← find next t > slotCenters + w i ← find last slotCenters < t [ currentP t ]18: else ▷ Store the Slotted Estimates19: add slotSet to structure timeSeriesSets slotSet ← []21: end if else weights ← exp ( − (( inSlot − slotCenters ) ∗ λ ))24: meanAmp ← sum ( weights ∗ inSlotX ) /sum ( weights )25: add meanAmp to the current slotSet26: end if i + +28: end while end procedure If it is assumed that the conditioned standardized waveform segments have an am-plitude distribution that approximates a Gaussian distribution (which they won’t,but that is irrelevant to the effort), then using a methodology similar to SymbolicAggregate Approximation [Lin et al., 2007, 2012] methodologies, an alphabet (statespace) is defined based on our assumptions as an alphabet extending between ± σ and will encompass 95% of the amplitudes observed. This need not always bethe case, but the advantage of the standardization of the waveform is that, withsome degree of confidence the information from the waveform is contained roughlybetween ± σ . Figure 5.1 demonstrates a eight state translation; the alphabet willbe significantly more resolved then this for astronomical waveforms.134igure 5.1: Example State Space RepresentationThe resolution of the alphabet granularity is to be determined via cross-validationto determine an optimal resolution for a given survey. The set of state transitions,the transformation of the conditioned signal, is used to populate a transition prob-ability matrix or first order Markov Matrix. The transition state frequencies are estimated for signal measured between emptyslots, transitions are not evaluated between day-night periods, or between slews(changes in observation directions during a night) and only evaluated for contin-uous observations. Each continuous set of conditioned waveforms (with Slotting135nd State Approximation applied) is used to populate the empty matrix P , withdimensions equal to r × r , where r is the number of states, is built. The matrix ispopulated using the following rules: • N ij is the number of observation pairs x [ t ] and x [ t + 1] with x [ t ] is state s i and x [ t + 1] in state r j • N i is the number of observation pairs x [ t ] and x [ t + 1] with x [ t ] in state s i and x [ t + 1] in any one of the states j = 1 , ..., r The now populated matrix P is a transition frequency matrix, with each row i representing a frequency distribution (histogram) of transitions out of the state s i .The transition probability matrix is approximated by converting the elements ofP by approximating the transition probabilities using P ij = N ij / N i . The resultingmatrix is often described as a first order Markov Matrix [Ross, 2013]. State changesare based on only the observation-to-observation amplitude changes; the matrixis a representation of the linearly interpolated sequence [Ge and Smyth, 2000].Furthermore, the matrix is vectorized similar to image analysis methods into afeature space vector, with dimensions depend on the resolution and bounds of thestates. The algorithm to process the time-domain conditioned data is given inAlgorithm 2. 136 lgorithm 2 Markov Matrix Generation procedure MarkovMatrixGeneration ( timeSeriesSets, s ) markovM atrix = [] for i := 1 to length of timeSeriesSets do markovM atrixP rime ← [] currentSlotSet ← markovM atrixP rime [ j ] for k := 1 to length of currentSlotSet do idxIn ← find state containing currentSlotSet [ k − idxOut ← find state containing currentSlotSet [ k ] markovM atrixP rime [ idxIn, idxOut ] + + end for markovM atrix ← markovM atrix + markovM atrixP rime end for N i = sum along row of markovM atrix for j := 1 to length of s doif N i ̸ = 0 then markovM atrix [: , j ] ← N ij N i ▷ Estimate Markov Matrix end if end forend procedure The resulting Markov Matrix is vectorized into a feature vector given by: P i = ⎡⎢⎢⎢⎢⎢⎢⎢⎣ p p ... p r p p · · · · · · ... ... . . . ... p r p ... p rr ⎤⎥⎥⎥⎥⎥⎥⎥⎦ ⇒ vec ( P i ) = [ p p ... p ... p rr ] , (5.1)137here P i is the Markov Chain of the i th input training set, and x i is the i th inputvectorized training pattern. When using the Markov matrix representation, theresolution of the state set needs to be small to avoid loss of information resultingfrom over generalization. However, if the state resolution is too small the spar-sity of the transition matrix will result in a shape signature that is too dependenton noise and the “ individualness” of specific waveform to be of any use. Thusadditional processing is necessary for further analysis; even a small set of states(12 x 12) will result in a feature vector with high dimensionality (144 dimensions).While a window and overlap size is assumed for the slotting to address the irreg-ular sampling of the time series data, there are two adjustable features associatedwith this analysis: the kernel width associated with the slotting and the statespace (alphabet) resolution. It is apparent that a range of resolutions and kernelwidth need to be tested to determine best performance given a generic supervisedclassifier. For these purposes a rapid initial classification algorithm, General Quadratic Dis-criminate Analysis [Duda et al., 2012], was implemented to estimate the misclassi-fication rate (wrong decisions/total decisions). To reduce the large, sparse, featurevector resulting from the unpacking of the Markov Matrix we applied a super-vised dimensionality reduction technique commonly referred to as canonical variateanalysis (ECVA) [Nørgaard et al., 2006]. The methodology for ECVA has rootsin principle component analysis (PCA). PCA is a procedure performed on largemultidimensional datasets with the intent of rotating what is a set of possiblycorrelated dimensions into a set of linearly uncorrelated variables [Scholz, 2006].138he transformation results in a dataset, where the first principle component (di-mension) has the largest possible variance. PCA is an unsupervised methodology, a priori labels for the data being processed are not taken into consideration, andwhile a reduction in feature dimensionality is obtained and a maximization in thethe variance will occur, the operation may not maximize the linear separability ofthe class space.In contrast to PCA, Canonical Variate Analysis does take class labels intoconsiderations. The variation between groups is maximized resulting in a trans-formation that benefits the goal of separating classes. Given a set of data x with: g different classes, n i observations of each class, and r × r dimensions in eachobservation; following Johnson et al. [1992], the within-group and between-groupcovariance matrix is defined as: S within = 1 n − g g ∑ i =1 n i ∑ j =1 ( x ij − ¯ x ij )( x ij − ¯ x i ) ′ (5.2) S between = 1 g − g ∑ i =1 n i ( x i − ¯ x )( x i − ¯ x ) ′ ; (5.3)where n = ∑ gi =1 n i , ¯ x i = n i ∑ n i j =1 x ij , and ¯ x = n ∑ n i j =1 n i x i . CVA attempts tomaximize the function: J ( w ) = w ′ S between ww ′ S within w . (5.4)Which is solvable so long as S within is non-singular, which need not be thecase, especially when analyzing multicollinear data. When the case arises that thedimensions of the observed patterns are multicollinear additional considerationsneed to be made. Nørgaard et al. [2006] outlines a methodology for handling these139ases in CVA. Partial least squares analysis, PLS2 [Wold, 1939], is used to solve theabove linear equation, resulting in an estimate of w , and given that, an estimateof the canonical variates (the reduced dimension set). The application of ECVAto our vectorized Markov Matrices results in a reduced feature space of dimension g − Two datasets are addressed here, the first is the STARLIGHT dataset from theUCR time series database, the second is published data from the LINEAR survey.The UCR time series dataset is used to base line the time-domain dataset featureextraction methodology proposed, it is compared to the results published on theUCR website. The UCR time series data contains only time domain data that hasalready been folded and put into magnitude phase space, no photometric data fromeither SDSS or 2MASS, nor star identifications for these data, could be recovered,and only three class types are provided which are not defined besides by number.The second dataset, the LINEAR survey, provides an example of a modern largescale astronomical survey, contains time-domain data that has not been folded orotherwise manipulated, is already associated with SDSS and 2MASS photometricvalues, and has five identified stellar variable types. For each dataset, the statespace resolution and the kernel widths for the slotting methods will be optimizedusing 5-fold cross-validation. The performances of three classifiers on only thetime-domain dataset for the UCR data, and on the mixture of time-domain dataand color data for the LINEAR survey, are estimated using 5-fold cross-validation140nd testing. The performances of the classifiers will be compared. Finally ananomaly detection algorithm will be trained and tested, for the LINEAR dataset. The training set is used for 5-Fold cross-validation, and a set of three classificationalgorithms are tested [Hastie et al., 2009, Duda et al., 2012]: k-Nearest Neighbor (k-NN), Parzen Window Classifier (PWC) and Random Forest (RF). Cross-validationis used to determine optimal classification parameters (e.g., kernel width) for eachof the classification algorithms. The first three algorithms implemented were de-signed by the authors in MATLAB, based on Duda et al. [2012] and Hastie et al.[2009] (k-NN and PWC) algorithm outlines. Code is accessible via github . The k-nearest neighbor algorithm is a non-parametric classification method; ituses a voting scheme based on an initial training set to determine the estimatedlabel[Altman, 1992]. For a given new observation, the L Euclidean distance isfound between the new observation and all points in the training set. The distancesare sorted, and the k closest training sample labels are used to determine the newobserved sample estimated label (majority rule). Cross-validation is used to findan optimal k value, where k is any integer greater than zero. https://github.com/kjohnston82/SSMM .3.2.2 PWC Parzen windows classification is a technique for non-parametric density estima-tion, which is also used for classification [Parzen, 1962, Duda et al., 2012]. Using agiven kernel function, the technique approximates a given training set distributionvia a linear combination of kernels centered on the observed points. As the PWCalgorithm (much like a k-NN) does not require a training phase, as the data pointsare used explicitly to infer a decision space. Rather than choosing the k nearestneighbors of a test point and labeling the test point with the weighted majorityof its neighbor’s votes, one can consider all points in the voting scheme and as-sign their weight by means of the kernel function. With Gaussian kernels, theweight decreases exponentially with the square of the distance, so far away pointsare practically irrelevant. Cross-validation is necessary however, to determine anoptimal value of h , the “width” of the radial basis function (or whatever kernel isbeing used). To generate the random forest classifier, the TreeBagger algorithm in MATLABis implemented. The algorithm generates n decision trees on the provided train-ing sample. The n decision trees operate on any new observed pattern, and thedecision made by each tree are conglomerated together (majority rule) to gener-ate a combined estimated label. To generate Breiman’s random forest algorithm[Breiman et al., 1984], the value NVarToSample is provided a value (other thanall) and a random set of variables is used to generate the decision trees; see theMATLAB TreeBagger documentation for more information.142 .3.3 Comparison to Standard Set (UCR) The UCR time domain datasets are used to basis classification methodologies[Keogh et al., 2011]. The UCR time domain datasets [Protopapas et al., 2006], arederived from a set of Cepheid, RR Lyrae, and Eclipsing Binary Stars. The time-domain datasets have been phased (folded) via the primary period and smoothedusing the SUPER-SMOOTHER algorithm [Reimann, 1994] by the Protopapasstudy prior to being provided to the UCR database. The waveforms received fromUCR are amplitude as a function of phase; the SUPER-SMOOTHER algorithmwas also used [Protopapas et al., 2006] to produce regular samples (in the amplitudevs. phase space). The sub-groups of each of the three classes are combined togetherin the UCR data (i.e., RR (ab) + RR (c) = RR), similarly the data is taken fromtwo different studies (OGLE and MACHO). A plot of the phased light curves isgiven in Figure 5.2. 143igure 5.2: UCR Phased Light Curves. Classes are given by number only: 1 =Blue Line (Eclipsing Binaries), 2 = Green Small Dashed Line (Cepheid), 3 = RedBig Dashed Line (RR Lyr)Class analysis is a secondary effort when applying the methodology outlined tothe UCR dataset, the primary concern is a demonstration of performance of thesupervised classification methodology with respect to the baseline performancereported by UCR implementing a simple waveform nearest neighbor algorithm. The folded waveforms are treated identical to the unfolded waveforms in termsof the processing presented. Values of phase were generated to accommodate the144lotting technique, thereby allowing the functionally developed to be used for bothamplitude vs. time (LINEAR) as well as amplitude vs. phase (UCR). The slotting,State Space Representation, Markov Matrix and ECVA flow is implemented exactlythe same. As there are only three classes in the dataset, the ECVA algorithmresults in a dimensionality of only two ( g − ECV -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 E C V -0.04-0.0200.020.040.060.080.1 Extended Canonical Variates Figure 5.3: ECVA reduced feature space using the UCR Star Light Curve DataEach classifier is then trained only on the ECVA reduced time-domain featurespace. The resulting optimization analysis, based on the 5-fold cross-validation is145resented in Figures B.1a, B.1b and B.1c. Depending on the methodology used,cross-validation estimates a misclassification error of < . . . The SSMM methodology presented does no worse than the 1-NN presented byKeogh et al. [2011] and appears to provide some increase in performance. Theprocedure described operates on folded data as well as unfolded data and does notneed time-warping for alignment of the waveform, demonstrating the flexibilityof the method. The procedure not only separated out the classes outlined, butin addition found additional clusters of similarity in the dataset. Whether theseclusters correspond to the sub-groupings reported by the original generating source(RR (ab) and RR (c), etc.) is not known, as object identification is not providedby the UCR dataset. For the analysis of the proposed algorithm design, the LINEAR dataset is parsedinto training, cross-validation and test sets on time series data from the LINEARsurvey that has been verified, and for which accurate photometric values are avail-able [Sesar et al., 2011, Palaversa et al., 2013]. From the starting sample of 7,194146INEAR variables, a clean sample of 6,146 time series datasets and their associ-ated photometric values were used for classification. Stellar class type is limitedfurther to the top five most populous classes: RR Lyr (ab), RR Lyr (c), DeltaScuti / SX Phe, Contact Binaries and Algol-Like Stars with 2 Minima; resultingin a set of 6,086 observations. The distribution of stellar classes is presented inTable 5.1. Table 5.1: Distribution of LINEAR Data across ClassesType Count PercentageAlgol 287 5.6Contact Binary 1805 35.6Delta Scuti 68 1.3RRab 2189 43.0RRc 737 14.5 In support of the supervised classification algorithm, artificial datasets have beengenerated and introduced into the training/testing set. These artificial datasetsare a representation of stars with-out variability. This introduction of artificialdata is done for the same reasons the training of the anomaly detection algorithmis performed: • The LINEAR dataset implemented only represents five of the most populousvariable star types [Richards et al., 2012], thus the class space defined by theclasses is incomplete. 147 Even if the class space was complete, studies such as Debosscher [2009],Dubath et al. [2011] have all shown that many stellar variable populationsare under-sampled. • Similarly, many of the studies focus on stellar variables only, and do notinclude non-variable stars. While filters are often applied to separate variableand non-variable stars—Chi-Squared specifically, Sesar et al. 2013)— theseare not necessarily perfect methods for removing non-variable populations,and could result in an increase in false alarms.This artificial time series is generated with a Gaussian Random amplitude distribu-tion. In addition to the time-domain information randomly generated, photomet-ric information is also generated. The photometric measurements used to classifythe stars are used to generate empirical distributions (histograms) of each of thefeature vectors. These histograms are turned into cumulative distribution func-tions (CDFs). The artificially generated photometric patterns are generated viasampling from these generated empirical distribution functions. Sampling is per-formed via the Inverse Transform method [Law and Kelton, 1991] . These artificialdatasets are treated identical in processing to the other observed waveforms. In addition to the time domain data, color data is obtainable for the LINEARdataset, resulting from the efforts of large photometric surveys such as SDSS and2MASS. These additional features are merged with the reduced time domain fea-ture space, resulting in an overall feature space. For this study, the optical SDSSfilters ( ugriz ) and the IR filters ( J K ) are used to generate the color features:148 − g , g − i , i − K , and J − K . The color magnitudes are corrected for the ISMextinction using E ( B − V ) from the SFD maps and the extinction curve shapefrom Berry et al. [2012]. In addition to these color features, bulk time domainstatistics are also generated: logP is the log of the primary period derived fromthe Fourier domain space, magM ed is the median LINEAR magnitude, ampl , skew , and kurt are the amplitude, skewness, and kurtosis for the observed lightcurve distribution. These additional features will be included for the analysis ofthe LINEAR dataset. See electronic supplement—Combined LINEAR Features,Extra-Figure-CombinedLINEARFeatures.fig—for a plot matrix of the combinedfeature space. It is assumed that the state space resolution that minimizes the misclassificationrate using QDA, will likewise minimize the misclassification rate using any of theother classification algorithms. The slot width was taken to be 0.015 and a kernelspread of 0.01 was used. Using the optimal amplitude state space resolution (0.03),a three dimensional plot (the first three ECVA parameters) is constructed; see theelectronic supplement for the associated movie (ECVA Feature LINEAR Movie,ExtendedCanonicalVariates.mp4). Figure 5.4 is a plot of the first two extendedcanonical variates: 149 CV -0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 E C V -0.1-0.0500.050.10.15 Extended Canonical Variates AlgolContact BinaryDelta Scu / SX PheNo VariablityRRabRRc Figure 5.4: First two Extended Canonical Variates for the Time-Domain FeatureSpaceBased on the merged feature space, the optimal parameters for the k-NN, PWCand Random Forest Classifier are generated. The cross-validation optimizationfigures for each are presented in Figures B.2a, B.2b and B.2c respectively. Testingwas performed on a pre-partitioned set, separate from the training and cross-validation populations.The transformation applied to the training and cross-validation data were alsoapplied to the testing data. After optimal parameters have been found for both theresolution of the Markov Model and the classification algorithms, the testing set150s used to estimate the confusion matrix. Confusion matrices are generated—andgiven in in Appendix B— True Labels are shown on the left column and EstimatedLabel are shown on the top row (Tables B.2a, B.2b and B.2c). Further analysiswas performed comparing the classification capability of a supervised classifier withonly the SSMM features (post-ECVA analysis), with only the traditional featurespaces and a classifier with all three feature spaces (photometric data, frequencyand time statistics, and SSMM ) to show the SSMM relative performance (Table: 5.2).Table 5.2: Misclassification Rates of Feature Spaces from Testing Data1-NN PWC RFAll Features 0.01 0.01 0.01Color and Frequency 0.01 0.02 0.01SSMM Only 0.03 0.04 0.03Comparable performance is obtained just using the SSMM feature space com-pared to the color and frequency space, and for PWC, a small increase is obtainedwhen the features are combined. In addition to the pattern classification algorithm outlined, the procedure outlinedincludes the construction of an anomaly detector. The pattern classification algo-rithm presented as part of this analysis, partition the entire decision space basedon the known class type provided in the LINEAR dataset. For many supervisedclassifier algorithms, and indeed those presented here, there is no consideration fordeviations of patterns beyond the training set observed, i.e. absolute distance from151opulation centers. All of the algorithms investigated consider relative distances,i.e. is the new pattern P closer to the class center of B or A? Thus, despite thatan anomalous pattern is observed by a new survey, the classifier will attempt toestimate a label for the observed star based on the labels it knows. To address thisconcern, a one-class anomaly detection algorithm is implemented.Anomaly Detection and Novelty Detection methods are descriptions of similarprocesses with the same intent, i.e., the detection of new observations outside ofthe class space established by training. These methods have been proposed forstellar variable implementations prior to this analysis [Protopapas et al., 2006].Tax [2001] and Tax and Muller [2003] outline the implementation of a numberof classifiers for One-Class (OC) classification, i.e., novel or anomaly detection.Here, the PWC algorithm (described earlier) is transformed into an OC anomalydetection algorithm. The result is the “lassoing” or dynamic encompassing ofknown data patterns.The lasso boundary represents the division between known (previously ob-served) regions of feature space and unknown (not-previously observed) regions.New patterns observed with feature vectors occurring in this unknown region areconsidered anomalies or patterns without support, and the estimated labels re-turned from the supervised classification algorithms should be questioned, despitethe associated posterior probability of the label estimate. This paper implementsthe DD Toolbox designed by Tax and implements the PR toolbox [Duin et al.,2007]. The resulting error curve generated from the cross-validation of the PWC-OC algorithm resembles a threshold model (probit), the point which minimizes theerror and minimizes the kernel width is found (Figure 5.5).152 C-PWC Kernel Width -1 M i sc l a ss i f i c a t i on R a t e - F o l d C r o ss - v a li da t i on Figure 5.5: OC-PWC Kernel Width Optimization for LINEAR DataThis point (minimization of error and kernel width) is the optimal kernel width(2.5). Estimated misclassification rate of the detector is determined via evaluationof the testing set and found to be 0 . Given only time series data (no photometric data, frequency data, or time domainstatistics), for the classes and the LINEAR observations made (resolution of am-plitude and frequency rate of observations) a ∼ 2% misclassification rate with thevarious, more general, classifiers. Kernel width of the slots used to account forirregular sampling and state space resolution are major factors in performance,153s mentioned we have assumed a slotting with that is a function of the survey(median of the continuous sample rate). This has been found to work optimallyfor the LINEAR dataset, however cross-validation could be performed with otherdataset to determine if addition optimization is possible.With the addition of photometric data, the misclassification rate is reducedby another ∼ ∼ . 07% is found. The Slotted Symbolic Markov Modeling (SSMM) methodology developed has beenable to generate a feature space which separates variable stars by class (supervisedclassification). This methodology has the benefit of being able to accommodateirregular sampling rates, dropouts and some degree of time-domain variance. Italso provides a fairly simple methodology for feature space generation, necessaryfor classification. One of the major advantages of the methodology used is thata signature pattern (the transition state model) is generated and updated withnew observations. The transition frequency matrix for each star is accumulated,given new observations, and the probability transition matrix is re-estimated. Themethodology’s ability to perform is based on the input data sampling rate, pho-tometric error and most importantly the uniqueness of the time-domain patternsexpressed by variable stars of interest. 154he analysis presented has demonstrated the SSMM methodology performanceis comparable to the UCR baseline performance analysis, if not slightly better. Inaddition, the translation of the feature space has demonstrated that the originalsuggestion of three classes might not be correct; a number of additional clustersare revealed as are some potential misclassifications in the training set. The per-formance of four separate classifiers trained on the UCR dataset is examined. Ithas been shown that the methodology presented is comparable to direct distancemethods (UCR base line). It is also shown that the methodology presented ismore flexible. The LINEAR dataset provides more opportunity to demonstratethe proposed methodology. The larger class space, unevenly sampled data withdropouts and color data all provide additional challenges to be addressed. Afteroptimization, the misclassification rate is roughly ∼ ∼ . Further research is outlined in three main focus topics: dataset improvement,methodology improvement, simulation/performance analysis. The limited datasetand class space used for this study is known. Future efforts will include a morecomplete class space, as well as more data to support under-represented class types.Specifically datasets such as the Catalina Real Time Transient Survey [Drake et al.,2009], will provide greater depth and completeness as a prelude to the data setsthat will be available from the Panoramic Survey Telescope & Rapid Response155ystem and the Large Synoptic Survey Telescope (LSST).In addition to improving the underlying training data used, the methodologyoutline will also be researched to determine if more optimal methods are available.Exploring the effects of variable size state space for the translation could potentiallyyield performance improvements, as could a comparison of slotting methods (e.g.box slots vs. Gaussian slots vs. other kernels or weighting schemes). Likewise,implementations beyond supervised classification (e.g., unsupervised classification)were not explored as part of this analysis. How the feature space outlined in thisanalysis would lend itself to clustering or expectation-maximization algorithms isyet to be determined.In future work, how sampling rates and photometric errors affect the ability torepresent the underlying time-domain functionality using synthetic time-domainsignals will be explored. Simulation of the expected time domain signals willallow for an estimation of performance of other spectral methods (DWT/DFT forirregular sampling), which will intern allow for and understanding of the benefitsand drawbacks of each methodology, relative to both class type and observationalconditions. This type of analysis would require the modeling and development ofsynthetic stellar variable functions to produce reasonable (and varied) time domainsignature. 156 hapter 6Detector for O’Connell Type EBs With the rise of large-scale surveys, such as Kepler, the Transiting ExoplanetSurvey Satellite (TESS), the Kilodegree Extremely Little Telescope (KELT), theSquare Kilometre Array, the Large Synoptic Survey Telescope (LSST), and Pan-STARRS, a fundamental working knowledge of statistical data analysis and datamanagement to reasonably process astronomical data is necessary. The ability tomine these data sets for new and interesting astronomical information opens anumber of scientific windows that were once closed by poor sampling, in terms ofboth number of stars (targets) and depth of observations (number of samples).This section focuses on the development of a novel, modular time-domain sig-nature extraction methodology and its supporting supervised pattern detectionalgorithm for variable star detection. The design could apply to any number ofvariable star types that exhibit consistent periodicity (cyclostationary) in theirflux; examples include most Cepheid-type stars (RR Lyr, SX Phe, Gamma Dor,etc...) as well as other eclipsing binary types. Nonperiodic variables would requirea different feature space [Johnston and Peter, 2017], but the underlying detection157cheme could still be relevant. Herein we present the design’s utility, by its tar-geting of eclipsing binaries that demonstrate a feature known as the O’Connelleffect.We have selected O’Connell effect eclipsing binaries (OEEBs) to demonstrateinitially our detector design. We highlight OEEBs here because they compose asubclass of a specific type of variable star (eclipsing binaries). Subclass detec-tion provides an extra layer of complexity for our detector to try to handle. Wedemonstrate our detector design on Kepler eclipsing binary data from the Vil-lanova catalog, allowing us to train and test against different subclasses in thesame parent variable class type. We train our detector design on Kepler eclipsingbinary data and apply the detector to a different survey—the Lincoln Near-EarthAsteroid Research asteroid survey [LINEAR, Stokes et al., 2000]—to demonstratethe algorithm’s ability to discriminate and detect our targeted subclass given notjust the parent class but other classes as well.Classifying variable stars relies on proper selection of feature spaces of inter-est and a classification framework that can support the linear separation of thosefeatures. Selected features should quantify the telltale signature of the variability—the structure and information content. Prior studies to develop both features andclassifiers include expert selected feature efforts [Debosscher, 2009, Sesar et al.,2011, Richards et al., 2012, Graham et al., 2013a, Armstrong et al., 2016, Mahabalet al., 2017, Hinners et al., 2018], automated feature selection efforts [McWhirteret al., 2017, Naul et al., 2018], and unsupervised methods for feature extrac-tion [Valenzuela and Pichara, 2018, Modak et al., 2018]. The astroinformaticscommunity-standard features include quantification of statistics associated withthe time-domain photometric data, Fourier decomposition of the data, and color158nformation in both the optical and IR domains [Nun et al., 2015, Miller et al.,2015]. The number of individual features commonly used is upward of 60 andgrowing [Richards et al., 2011] as the number of variable star types increases, andas a result of further refinement of classification definitions [Samus’ et al., 2017].We seek here to develop a novel feature space that captures the signature of interestfor the targeted variable star type.The detection framework here maps time-domain stellar variable observationsto an alternate distribution field (DF) representation [Sevilla-Lara and Learned-Miller, 2012] and then develops a metric learning approach to identify OEEBs.Based on the matrix-valued DF feature, we adopt a metric learning frameworkto directly learn a distance metric [Bellet et al., 2015] on the space of DFs.We can then utilize the learned metric as a measure of similarity to detect newOEEBs based on their closeness to other OEEBs. We present our metric learningapproach as a competitive push–pull optimization, where DFs corresponding toknown OEEBs influence the learned metric to measure them as being nearer inthe DF space. Simultaneously, DFs corresponding to non-OEEBs are pushed awayand result in large measured distances under the learned metric.This section is structured as follows. First, we review the targeted stellar vari-able type, discussing the type signatures expected. Second, we review the dataused in our training, testing, and discovery process as part of our demonstrationof design. Next, we outline the novel proposed pipeline for OEEB detection; thisreview includes the feature space used, the designed detector/classifier, and the as-sociated implementation of an anomaly detection algorithm [Chandola et al., 2009].Then, we apply the algorithm, trained on the expertly selected/labeled VillanovaEclipsing Binary catalog OEEB targets, to the rest of the catalog with the purpose159f identifying new OEEB stars. We present the results of the discovery processusing a mix of clustering and derived statistics. We apply the Villanova EclipsingBinary catalog trained classifier, without additional training, to the LINEAR dataset. We provide results of this cross-application, i.e., the set of discovered OEEBs.For comparison, we detail two competing approaches. We develop training andtesting strategies for our metric learning framework, and finally, we conclude witha summary of our findings and directions for future research. The O’Connell effect [O’Connell, 1951] is defined for eclipsing binaries as an asym-metry in the maxima of the phased light curve (see Figure 6.2). This maximaasymmetry is unexpected, as it suggests an orientation dependency in the bright-ness of the system.Similarly, the consistency of the asymmetric over many orbits is also surprising,as it suggests that the maxima asymmetry has a dependence on the rotation ofthe binary system. While the cause of the O’Connell effect is not fully understood,researchers have offered a number of explanations. Additional data and modelingare necessary for further investigation [McCartney, 1999]. Several theories propose to explain the effect, including starspots, gas streamimpact, and circumstellar matter [McCartney, 1999]. The work by [Wilsey andBeaky, 2009] outlines each of these theories and demonstrates how the observedeffects are generated by the underlying physics.160igure 6.1: Conceptual overview of a β Lyrae—EB type—eclipsing binary (Semi-Detached Binary) that has the O’Connell Effect, an asymmetry of the maxima.The figure is a representation of the binary orientation with respect to a viewerand the resulting observed light curve. • Starspots result from chromospheric activity, causing a consistent decreasein brightness of the star when viewed as a point source. While magneticsurface activity will cause both flares (brightening) and spots (darkening),flares tend to be transient, whereas spots tend to have longer-term effects onthe observed binary flux. Thus, between the two, starspots are the favoredhypothesis for causing long-term consistent asymmetry; often binary simula-tions (such as the Wilson–Devinney code) can be used to model O’Connelleffect binaries via including an often large starspot [Zboril and Djurasevic,2006]. • Gas stream impact results from matter transferring between stars (smallerto larger) through the L1 point and onto a specific position on the largerstar, resulting in a consistent brightening on the leading/trailing side of the161igure 6.2: An example phased light curve of an eclipsing binary with theO’Connell effect (KIC: 10861842). The light curve has been phased such thatthe global minimum (cooler in front of hotter) is at lag 0 and the secondary min-imum (hotter in front of cooler) is at approximately lag 0.5. The side-by-sidebinary orientations are at approximately 0.25 and 0.75. Note that the maxima,corresponding to the side-by-side orientations, have different values.162econdary/primary. • The circumstellar matter theory proposes to describe the increase in bright-ness via free-falling matter being swept up, resulting in energy loss and heat-ing, again causing an increase in amplitude. Alternatively, circumstellarmatter in orbit could result in attenuation, i.e., the difference in maximummagnitude of the phased light curve results from dimming and not brighten-ing.In the study McCartney [1999], the authors limited the sample to only sixstar systems: GSC 03751-00178, V573 Lyrae, V1038 Herculis, ZZ Pegasus, V1901Cygni, and UV Monocerotis. Researchers have used standard eclipsing binarysimulations [Wilson and Devinney, 1971] to demonstrate the proposed explanationsfor each light curve instance and estimate the parameters associated with thephysics of the system. [Wilsey and Beaky, 2009] noted other cases of the O’Connelleffect in binaries, which have since been described physically; in some cases, theeffect varied over time, whereas in other cases, the effect was consistent over yearsof observation and over many orbits. The effect has been found in both overcontact,semidetached, and near-contact systems.While one of the key visual differentiators of the O’Connell effect is ∆ m max ,this alone could not be used as a means for detection, as the targets trained on orapplied to are not guaranteed to be (a) eclipsing binaries and (b) periodic. One ofthe goals we are attempting to highlight is the transformation of expert qualitativetarget selection into quantitative machine learning methods.163 .1.2 Characterization of OEEB We develop a detection methodology for a specific target of interest—OEEB—defined as an eclipsing binary where the light curve (LC) maxima are consistentlyat different amplitudes over the span of observation. Beyond differences in max-ima, and a number of published examples, little is defined as a requirement foridentifying the O’Connell effect [Wilsey and Beaky, 2009, Knote, 2015].McCartney [1999] provide some basic indicators/measurements of interest inrelation to OEEB binaries: the O’Connell effect ratio (OER), the difference inmaximum amplitudes (∆ m ), the difference in minimum amplitudes, and the lightcurve asymmetry (LCA). The metrics are based on the smoothed phased lightcurves. The OER is calculated as Equation 6.1:OER = ∑ n / i =1 ( I i − I ) ∑ ni = n / +1 ( I i − I ) , (6.1)where the min-max amplitude measurements for each star are grouped into phasebins ( n = 500), where the mean amplitude in each bin is I i . An OER > I = 0. The difference in max amplitude iscalculated as Equation 6.2:∆ m = max t< . ( f ( t ) N ) − max t ≥ . ( f ( t ) N ) , (6.2)164here we have estimated the maximum in each half of the phased light curve. TheLCA is calculated as Equation 6.3:LCA = √ n / ∑ i =1 ( I i − I ( n +1 − i ) ) I i . (6.3)As opposed to the measurement of OER, LCA measures the deviance from sym-metry of the two peaks. Defining descriptive metrics or functional relationships(i.e., bounds of distribution) requires a larger sample than is presently available.An increased number of identified targets of interest is required to provide the sam-ple size needed for a complete statistical description of the O’Connell effect. Thequantification of these functional statistics allows for the improved understandingof not just the standard definition of the targeted variable star but also the popula-tion distribution as a whole. These estimates allow for empirical statements to bemade regarding the differences in light curve shapes among the variable star typesinvestigated. The determination of an empirically observed distribution, however,requires a significant sample to generate meaningful descriptive statistics for thevarious metrics.In this effort, we highlight the functional shape of the phased light curve asour defining feature of OEEB stars. The prior metrics identified are selected orreduced measures of this functional shape. We propose here that, as opposed totraining a detector on the preceding indicators, we use the functional shape ofthe phased light curve by way of the distribution field to construct our automatedsystem. 165 .2 Variable Star Data As a demonstration of design, we apply the proposed algorithm to a set of prede-fined, expertly labeled eclipsing binary light curves. We focus on two surveys ofinterest: first, the Kepler Villanova Eclipsing Binary catalog, from which we de-rive our initial training data as well as our initial discovery (unlabeled) data, andsecond, the Lincoln Near-Earth Asteroid Research, which we treat as unlabeleddata. Leveraging the Kepler pipeline already in place, and using the data from the Vil-lanova Eclipsing Binary catalog [Kirk et al., 2016], this study focuses on a set ofpredetermined eclipsing binaries identified from the Kepler catalog. From this cat-alog, we developed an initial, expertly derived, labeled data set of proposed targetsof interest identified as OEEB. Likewise, we generated a set of targets identifiedas “not interesting” based on our expert definitions, i.e., intuitive inference.Using the Eclipsing Binary catalog [Kirk et al., 2016], we identified a set of30 targets of interest and 121 targets of noninterest via expert analysis—by-eyeselection based on researchers’ interests. Specific target identification is listed in asupplementary digital file at the project repository. We use this set of 151 lightcurves for training and testing. ./supplement/KeplerTraining.xlsx .2.1.1 Light Curve/Feature Space Prior to feature space processing, the raw observed photometric time domain dataare conditioned and processed. Operations include long-term trend removal, arti-fact removal, initial light curve phasing, and initial eclipsing binary identification;we performed these actions prior to the effort demonstrated here, by the EclipsingBinary catalog (our work uses all 2875 long-cadence light curves available as ofthe date of publication). The functional shape of the phased light curve is se-lected as the feature to be used in the machine learning process, i.e., detection oftargets of interest. While the data have been conditioned already by the Keplerpipeline, added steps are taken to allow for similarity estimation between phasedcurves. Friedman’s SUPERSMOOTHER algorithm [Friedman, 1984, VanderPlasand Ivezi´c, 2015] is used to generate a smooth 1-D functional curve from thephased light curve data. The smoothed curves are normalized via the min-maxscaling equation 6.4: f ( ϕ ) N = f ( ϕ ) − min( f ( ϕ ))max ( f ( ϕ )) − min( f ( ϕ )) , (6.4)where f ( ϕ ) is the smoothed phased light curve, f is the amplitude from thedatabase source, ϕ is the phase where ϕ ∈ [0 , f ( ϕ ) N is the min-max scaledamplitude. (Note that we will use the terms f ( ϕ ) N and min-max amplitude inter-changeably throughout this article.) We use the minimum of the smoothed phasedlight curve as a registration marker, and both the smoothed and unsmoothed lightcurves are aligned such that lag/phase zero corresponds to minimum amplitude(eclipse minima; see [McCartney, 1999]).167 .2.1.2 Training/Testing Data The labeled training data are provided as part of the supplementary digital projectrepository. We include the SOI and NON-SOI Kepler identifiers here (KIC).Table 6.1: Collection of KIC of Interest (30 Total) Table 6.2: Collection of KIC Not of Interest (121 Total) The 2,000+ eclipsing binaries left in the Kepler Eclipsing Binary catalog are left asunlabeled targets. We use our described detector to “discover” targets of interest,i.e., OEEB. The full set of Kepler data is accessible via the Villanova EclipsingBinary website (http://keplerebs.villanova.edu/).168or analyzing the proposed algorithm design, the LINEAR data set is alsoleveraged as an unknown “unlabeled” data set ripe for OEEB discovery [Sesaret al., 2011, Palaversa et al., 2013]. From the starting sample of 7,194 LINEARvariables, we used a clean sample of 6,146 time series data sets for detection. Stellarclass type is limited further to the top five most populous classes—RR Lyr (ab),RR Lyr (c), Delta Scuti / SX Phe, Contact Binaries, and Algol-Like Stars withtwo minima—resulting in a set of 5,086 observations.Unlike the Kepler Eclipsing Binary catalog, the LINEAR data set containstargets other than (but does include) eclipsing binaries; the data set we used[Johnston and Peter, 2017] includes Algols (287), Contact Binaries (1805), DeltaScuti (68), and RR Lyr (ab-2189, c-737). The light curves are much more poorlysampled; this uncertainty in the functional shape results from lower SNR (groundsurvey) and poor sampling. The distribution of stellar classes is presented in Table5.1. The full data sets used at the time of this publication from the Kepler andLINEAR surveys are available from the associated public repository. Relying on previous designs in astroinformatics to develop a supervised detectionalgorithm [Johnston and Oluseyi, 2017], we propose a design that tailors the re-quirements specifically toward detecting OEEB-type variable stars. github.com/kjohnston82/OCDetector .3.1 Prior Research Many prior studies on time-domain variable star classification [Debosscher, 2009,Barclay et al., 2011, Blomme et al., 2011, Dubath et al., 2011, Pichara et al., 2012,Pichara and Protopapas, 2013, Graham et al., 2013a, Angeloni et al., 2014, Masciet al., 2014] rely on periodicity domain feature space reductions. Debosscher [2009]and Templeton [2004] review a number of feature spaces and a number of effortsto reduce the time-domain data, most of which implement Fourier techniques,primarily the Lomb–Scargle (L-S) method [Lomb, 1976, Scargle, 1982], to estimatethe primary periodicity [Eyer and Blake, 2005, Deb and Singh, 2009, Richards et al.,2012, Park and Cho, 2013, Ngeow et al., 2013].The studies on classification of time-domain variable stars often further reducethe folded time-domain data into features that provide maximal-linear separabilityof classes. These efforts include expert selected feature efforts [Debosscher, 2009,Sesar et al., 2011, Richards et al., 2012, Graham et al., 2013a, Armstrong et al.,2016, Mahabal et al., 2017, Hinners et al., 2018], automated feature selection efforts[McWhirter et al., 2017, Naul et al., 2018], and unsupervised methods for featureextraction [Valenzuela and Pichara, 2018, Modak et al., 2018]. The astroinfor-matics community-standard features include quantification of statistics associatedwith the time-domain photometric data, Fourier decomposition of the data, andcolor information in both the optical and IR domains [Nun et al., 2015, Milleret al., 2015]. The number of individual features commonly used is upward of 60and growing [Richards et al., 2011] as the number of variable star types increasesand as a result of further refinement of classification definitions [Samus’ et al.,2017]. Curiously, aside from efforts to construct a classification algorithm from thetime-domain data directly [McWhirter et al., 2017], few efforts in astroinformat-170cs have looked at features beyond those described here—mostly Fourier domaintransformations or time domain statistics. Considering the depth of possibility fortime-domain transformations [Fu, 2011, Grabocka et al., 2012, Cassisi et al., 2012,Fulcher et al., 2013], it is surprising that the community has focused on just afew transforms. Similarly, the astroinformatics community has focused on only afew classifiers, limited mostly to standard classifiers and, specifically, decision treealgorithms, such as random forest–type classifiers.Here we propose an implementation that simplifies the traditional design: lim-iting ourselves to a one versus all approach [Johnston and Oluseyi, 2017] targetinga variable type of interest; limiting ourselves to a singular feature space—the dis-tribution field of the phased light curve—based on Helfer et al. [2015] as a represen-tation of the functional shape; and introducing a classification/detection schemethat is based on similarity with transparent results [Bellet et al., 2015] that can befurther extended, allowing for the inclusion of an anomaly detection algorithm. As stated, this analysis focuses on detecting OEEB systems based on their lightcurve shape. The OEEB signature has a cyclostationary signal, a functional shapethat repeats with a consistent frequency. The signature can be isolated using aprocess of period finding, folding, and phasing [Graham et al., 2013b]; the Vil-lanova catalog provides the estimated “best period.” The proposed feature spacetransformation will focus on the quantification or representation of this phasedfunctional shape. This particular implementation design makes the most intuitivesense, as visual inspection of the phased light curve is the way experts identifythese unique sources. 171s discussed, prior research on time-domain data identification has varied be-tween generating machine-learned features [Gagniuc, 2017], implementing genericfeatures [Masci et al., 2014, Palaversa et al., 2013, Richards et al., 2012, Deboss-cher, 2009], and looking at shape- or functional-based features [Haber et al., 2015,Johnston and Peter, 2017, Park and Cho, 2013]. This analysis will leverage thedistribution field transform to generate a feature space that can be operated on; adistribution field (DF) is defined as [Helfer et al., 2015, Sevilla-Lara and Learned-Miller, 2012] Equation 6.5:DF ij = ∑ [ y j < f ( x i ≤ ϕ ≤ x i +1 ) N < y j − ] ∑ [ y j < f ( ϕ ) N < y j − ] , (6.5)where [ ] is the Iverson bracket [Iverson, 1962], given as[ P ] = ⎧⎪⎨⎪⎩ P = true0 otherwise , (6.6)and y j and x i are the corresponding normalized amplitude and phase bins, respec-tively, where x i = 0 , /n x , /n x , . . . , y i = 0 , /n y , /n y , . . . , n x is the number oftime bins, and n y is the number of amplitude bins. The result is a right stochasticmatrix, i.e., the rows sum to 1. Bin number, n x and n y , is optimized by cross-validation as part of the classification training process. Smoothed phased data—generated from SUPERSMOOTHER—are provided to the DF algorithm.We found this implementation to produce a more consistent classification pro-cess. We found that the min-max scaling normalization, when outliers are present,can produce final patterns that focus more on the outlier than the general func-tionality of the light curve. Likewise, we found that using the unsmoothed data in172igure 6.3: An example phased light curve (top) and the transformed distributionfield (bottom) of an Eclipsing Binary with the O’Connell effect (KIC: 7516345).the DF algorithm resulted in a classification that was too dependent on the scatterof the phased light curve. Although at first glance, that would not appear to be anissue, this implementation resulted in light curve resolution having a large impacton the classification performance—in fact, a higher impact than the shape itself.An example of this transformation is given in Figure 6.3.Though the DF exhibits properties that a detection algorithm can use to iden-tify specific variable stars of interest, it alone is not sufficient for our ultimate goalof automated detection. Rather than vectorizing the DF matrix and treating it173s a feature vector for standard classification techniques, we treat the DF as thematrix-valued feature that it is [Helfer et al., 2015]. This allows for the retentionof row and column dependence information that would normally be lost in thevectorization process [Ding and Dennis Cook, 2018]. At its core, the proposed detector is based on the definition of similarity and, moreformally, a definition of distance. Consider the example triplet “ x is more similarto y than to z ,” i.e., the distance between x and y in the feature space of interestis smaller than the distance between x and z . The field of metric learning focuseson defining this distance in a given feature space to optimize a given goal, mostcommonly the reduction of error rate associated with the classification process.Given the selected feature space of DF matrices, the distance between two matrices X and Y [Bellet et al., 2015, Helfer et al., 2015] is defined as Equation 6.7: d ( X, Y ) = ∥ X − Y ∥ M = tr { ( X − Y ) T M ( X − Y ) } . (6.7) M is the metric that we will be attempting to optimize, where M ⪰ E = 1 − λN c − ∑ i,j DF ic − DF jc M − λN − N c ∑ i,k DF ic − DF kc M + γ ∥ M ∥ F , (6.8)where N c is the number of training data in class c ; λ and γ are variables tocontrol the importance of push versus pull and regularization, respectively; andthe triplet { DF ic , DF jc , DF kc } defines the relationship between similar and dissim-ilar observations, i.e., DF ic is similar to DF jc and dissimilar to DF kc , as per thedefinitions outlined in Bellet et al. [2015]. Clearly there are three basic compo-nents: a pull term, which is small when the distance between similar observationsis small; a push term, which is small when the distance between dissimilar obser-vations is larger; and a regularization term, which is small when the Frobeniusnorm ( ∥ M ∥ F = √ T r ( M M H )) of M is small. Thus the algorithm attempts tobring similar distribution fields closer together, while pushing dissimilar ones far-ther apart, while attempting to minimize the complexity of the metric M . Theregularizer on the metric M guards against overfitting and consequently enhancesthe algorithm’s ability to generalize, i.e., allow for operations across data sets.Thisregularization strategy is similar to popular regression techniques like lasso andridge [Hastie et al., 2009].Additional parameters λ and γ weight the importance of the push–pull termsand metric regularizer, respectively. These free parameters are typically tuned viastandard cross-validation techniques on the training data. The objective functionrepresented by Equation 6.8 is quadratic in the unknown metric M ; hence it ispossible to obtain the following closed-form solution to the minimization of the175quation 6.8 objective function as: M = λγ ( N − N c ) ∑ i,k ( DF ic − DF kc ) ( DF ic − DF kc ) T − − λγ ( N c − ∑ i,k ( DF ic − DF jc ) ( DF ic − DF jc ) T . (6.9)Equation 6.9 does not guarantee that M is positive semi-definite (PSD). Toensure this property, we can apply the following straightforward projection stepafter calculating M to ensure the requirement of M ⪰ M = U T Λ U ;2. generate Λ + = max (0 , Λ), i.e., select positive eigenvalues;3. reconstruct the metric M : M = U T Λ + U .This projected metric is used in the classification algorithm. The metric learnedfrom this push–pull methodology is used in conjunction with a standard k-nearestneighbor (k-NN) classifier. The traditional k-NN algorithm is a nonparametric classification method; it usesa voting scheme based on an initial training set to determine the estimated label[Altman, 1992]. For a given new observation, the L Euclidean distance is foundbetween the new observation and all points in the training set. The distances aresorted, and the k closest training sample labels are used to determine the newobserved sample estimated label (majority rule). Cross-validation is used to findan optimal k value, where k is any integer greater than zero.176he k-NN algorithm estimates a classification label based on the closest samplesprovided in training. For our implementation, the distance between a new patternDF i and each pattern in the training set is found using the optimized metric asopposed to the identity metric that would been used in the L Euclidean distancecase. The new pattern is classified depending on the majority of the closest k classlabels. The distance between patterns is in Equation 6.7, using the learned metric M . The new OEEB systems discovered by the method of automated detection pro-posed here can be used to further investigate their frequency of occurrence, provideconstraints on existing light curve models, and provide parameters to look for thesesystems in future large-scale variability surveys like LSST. The algorithm implements fivefold cross-validation [Duda et al., 2012]; the algo-rithmic details associated with the cross-validation process are beyond the scopeof this article, but in short, (1) the algorithm splits each class in the labeled datain half, with one half used in training and the other in testing; (2) the trainingdata are further subdivided into five partitions; and (3) as the algorithm is trained,these partitions are used to generate a training set using four of the five partitionsand a cross-validation set using the fifth.177he minimization of misclassification rate is used to optimize floating parame-ters in the design, such as the number of x -bins, the number of y -bins, and k-values.Some parameters are more sensitive than others; often this insensitivity is relatedto the loss function or the feature space, or the data themselves. For example, the γ and λ values weakly affected the optimization, while the bin sizes and k-valueshad a stronger effect.The cross-validation process was then reduced to optimizing the n x , n y , and k values; n x and n y values were tested over values 20–40 in steps of 5. The setof optimized parameters is given as γ =1.0, λ = 0 . n x = 25, n y = 35, and k = 3. Given the optimization of these floating variables in all three algorithms,the testing data are then applied to the optimal designs. The algorithm is applied to the Villanova Eclipsing Binary catalog entries thatwere not identified as either “Of Interest” or “Not of Interest,” i.e., unlabeled forthe purposes of our targeted goal. The trained and tested data sets are combinedinto a single training set for application; the primary method (push–pull metricclassification) is used to optimize a metric based on the optimal parameters foundduring cross-validation and to apply the system to the entire Villanova EclipsingBinary data (2875 curves). On the basis of the results demonstrated in [Johnston and Oluseyi, 2017], the al-gorithm additionally conditions the detection process based on a maximal distanceallowed between a new unlabeled point and the training data set in the feature178pace of interest.This anomaly detection algorithm is based on the optimized metric; a maximumdistance between data points is based on the training data set, and we use afraction (0.75) of that maximum distance as a limit to determine “known” versus“unknown.” The value of the fraction was initially determined via trial and error,based on our experiences with the data set and the goal of minimizing false alarms(which were visually apparent). This further restricts the algorithm to classifyingthose targets that exist only in “known space.” The k-NN algorithm generates adistance dependent on the optimized metric; by restricting the distances allowed,we can leverage the algorithm to generate the equivalent of an anomaly detectionalgorithm.The resulting paired algorithm (detector + distance limit) will produce esti-mates of “interesting” versus “not interesting,” given new—unlabeled—data. Ouralgorithm currently will not produce confidence estimates associated with the la-bel. Confidence associated with detection can be a touchy subject, both for thescientists developing the tools and for the scientists using them. Here we havefocused on implementing a k-NN algorithm with optimized metric (i.e., metriclearning); posterior probabilities of classification can be estimated based on k-NNoutput [Duda et al., 2012] and can be found as ( k c / ( n ∗ volume)); linking theseposterior probability estimates to “how confident am I that this is what I thinkthis is” is not often the best choice of description.Confidence in our detections will be a function of the original PPML classi-fication algorithm performance, the training set used and the confidence in thelabeling process, and the anomaly detection algorithm we implemented. Even( k c / ( n ∗ volume)) would not be a completely accurate description in our scenario.179ome researchers [Dalitz, 2009] have worked on linking “confidence” in k-NN clas-sifiers with distance between the points. Our introduction of an anomaly detectionalgorithm into the design thus allows a developer/user the ability to limit the falsealarm rate by introducing a maximum acceptable distance thus allowing for somecontrol of confidence in our classification results; see [Johnston and Oluseyi, 2017]for more information. Once we remove the discovered targets that were also in the initial training data,the result is a conservative selection of 124 potential targets of interest listed ina supplementary digital file at the project repository. We here present an initialexploratory data analysis performed on the phased light curve data. At a highlevel, the mean and standard deviation of the discovered curves are presented inFigure 6.4.A more in-depth analysis as to the meaning of the distribution functional shapesis left for future study. Such an effort would include additional observations (spec-troscopic and photometric additions would be helpful) as well as analysis usingbinary simulator code such as Wilson–Devinney [Prˇsa and Zwitter, 2005]. It isnoted that in general, there are some morphological consistencies across the dis-covered targets:1. In the majority of the discovered OEEB systems, the first maximum follow-ing the primary eclipse is greater than the second maximum following thesecondary eclipse. ./supplement/AnalysisOfClusters.xlsx − σ standard deviation (dashed) of thedistribution of O’Connell effect Eclipsing Binary phased light curves discoveredvia the proposed detector out of the Kepler Eclipsing Binary catalog.2. The light curve relative functional shape from the primary eclipse (minima)to primary maxima is fairly consistent across all discovered systems.3. The difference in relative amplitude between the two maxima does not ap-pear to be consistent, nor is the difference in relative amplitude between theminima.We perform additional exploratory data analysis on the discovered group viasubgrouping partitioning with unsupervised clustering. The k-means clustering al-gorithm with matrix-variate distances presented as part of the comparative method-ologies is applied to the discovered data set (their DF feature space). This clus-tering is presented to provide more detail on the discovered group morphological181hapes. The associated 1-D curve generated by the SUPERSMOOTHER algo-rithm is presented with respect to their respective clusters (clusters 1–8) in Figure6.5.The clusters generated were initialized with random starts, thus additionaliterations can potentially result in different groupings. The calculated metric valuesand the clusters numbers for each star are presented in the supplementary digitalfile. A plot of the measured metrics as well as estimated values of period andtemperature (as reported by the Villanova Kepler Eclipsing Binary database), aregiven with respect to the cluster assigned by k-means. Following figure 4.6 in[McCartney, 1999], plot of OER versus ∆m is isolated and presented in Figure 6.6. The linear relationship between OER and ∆ m reported in [McCartney, 1999] isapparent in the discovered Kepler data as well. The data set here extends fromOER ∼ (0 . , . 8) and ∆ m ∼ ( − . , . ∼ (0 . , . 2) and ∆ m ∼ ( − . , . m domain, likely resulting from our additional appli-cation of min-max amplitude scaling. The gap in ∆ m between − . m . ./Figures/ReducedFeaturesKeplerAll Temp.png m σ ∆ m / ∆ m OER σ OER / OER LCA σ LCA / LCA − − Note. Metrics are based on [McCartney, 1999] proposed values of interest.The clusters identified by the k-mean algorithm applied to the DF featurespace roughly correspond to groupings in the OER / ∆m feature space (clusteringalong the diagonal). The individual cluster statistics (mean and relative error) withrespect to the metrics measured here are given in Table 6.3. All of the clusters havea positive mean ∆ m , save for cluster 6. The morphological consistency within acluster is visually apparent in Figure 6.5 but also in the relative error of LCA, withclusters 5 and 7 being the least consistent. The next step will include applicationsto other surveys. We further demonstrate the algorithm with an application to a separate indepen-dent survey. Machine learning methods have been applied to classifying variablestars observed by the LINEAR survey [Sesar et al., 2011], and while these methodshave focused on leveraging Fourier domain coefficients and photometric measure-ments { u, g, r, i, z } from SDSS, the data also include best estimates of period, as allof the variable stars trained on had cyclostationary signatures. It is then trivial to185xtract the phased light curve for each star and apply our Kepler trained detectorto the data to generate “discovered” targets of interest.Table 6.4: Discovered OEEBs from LINEAR The discovered targets are aligned, and the smoothed light curves are presentedin Figure 6.7. Note that the LINEAR IDs are presented in Table 6.4 and as asupplementary digital file at the project repository. Application of our Kepler trained detector to LINEAR data results in 24 “dis-covered” OEEBs. These include four targets with a negative O’Connell effect.Similar to the Kepler discovered data set, we plot OER / ∆ m features using lower-resolution phased binnings ( n = 20) and see that the distribution and relationshipfrom [McCartney, 1999] hold here as well (see Figure 6.8). The pairing of DF feature space and push–pull matrix metric learning representsa novel design; thus it is difficult to draw conclusions about performance of thedesign, as there are no similar studies that have trained on this particular data set,targeted this particular variable type, used this feature space, or used this clas-sifier. As we discussed earlier, classifiers that implement matrix-variate features ./supplement/LINEARDiscovered.xlsx L distance applied tothe phased light curves (Method A) and k-means representation with quadraticdiscriminant analysis (QDA) (Method B). Method A is similar to the UCR [Chenet al., 2015b] time series data baseline algorithm, reported as part of the database.Provided here is a direct k-NN classification algorithm applied directly to thesmoothed, aligned, regularly sampled phased light curve. This regular sampling isgenerated via interpolation of the smoothed data set and is required because of thenature of the nearest neighbor algorithm requiring one-to-one distance. Standardprocedures can then be followed [Hastie et al., 2009]. Method B borrows fromPark et al. [2003], transforming the matrix-variate data into vector-variate datavia estimation of distances between our training set and a smaller set of exemplarmeans DFs that were generated via unsupervised learning. Distances were foundusing the Frobenius norm of the difference between the two matrices.Whereas Method A uses neither the DF feature representation nor the metriclearning methodology, Method B uses DF feature space but not the metric learningmethodology. This presents a problem, however, as most standard out-of-the-boxclassification methods require a vector input. Indeed, many methodologies, evenwhen faced with a matrix input, choose to vectorize the matrix. An alternative tothis implementation is a secondary transformation into a lower-dimensional featurespace. Following the work of [Park et al., 2003], we implement a matrix distancek-means algorithm (e.g., k-means with a Frobenius norm) to generate estimates of189able 6.5: Comparison of Performance Estimates across the Proposed Classifiers(Based on Testing Data)PPML Method A Method BError rate 12.5% 15.6% 12.7%clusters in the DF space. Observations are transformed by finding the Euclideandistance between each training point and each of the k-mean matrices discovered.The resulting set of k-distances is treated as the input pattern, allowing the use ofthe standard QDA algorithm [Duda et al., 2012]. The performances of both theproposed methodology and the two comparative methodologies are presented inTable 6.5. The algorithms are available as open source code, along with our novelimplementation, at the project repository.We present the performance of the main novel feature space/classification pair-ing as well as the two additional implementations that rely on more standardmethods. Here we have evaluated performance based on misclassification rate,i.e., 1-accuracy given by Fawcett [2006] as 1 − correct / total. The method we pro-pose has a marginally better misclassification rate (Table 6.5) and has the addedbenefit of (1) not requiring unsupervised clustering, which can be inconsistent,and (2) providing nearest neighbor estimates allowing for demonstration of di-rect comparison. These performance estimate values are dependent on the initialselected training and testing data. They have been averaged and optimized viacross-validation; however, with so little initial training data and with the selectionprocess for which training and testing data are randomized, performance estimatesmay vary. Of course, increases in training data will result in increased confidencein performance results.We have not included computational times as part of this analysis, as they190end to be dependent on the system operated on. We can anecdotally discuss that,on the system implemented as part of this research (MacBook Pro, 2.5 GHz Inteli7, 8 GB RAM), the training optimization of our proposed feature extraction andPPML classification total took less than 5–10 min to run—variation depending onwhatever else was running in the background. Use of the classifiers on unlabeleddata resulted in a classification in fractions of seconds per star. However, we shouldnote that this algorithm will speed up if it is implemented on a parallel processingsystem, as much of the time taken in the training process resulted from linearalgebra operations that can be parallelized. The DF representation maps deterministic, functional stellar variable observationsto a stochastic matrix, with the rows summing to unity. The inherently prob-abilistic nature of DFs provides a robust way to mitigate interclass variabilityand handle irregular sampling rates associated with stellar observations. Becausethe DF feature is indifferent to sampling density so long as all points along thefunctional shape are represented, the trained detection algorithm we generate anddemonstrate in this article can be trained on Kepler data but directly applied tothe LINEAR data, as shown in section 6.4.3.The algorithm, including comparison methodologies, designed feature spacetransformations, classifiers, utilities, and so on, is publicly available at the projectrepository; all code was developed in MATLAB and was run on MATLAB 9.3.0(R2017b). The operations included here can be executed either via calling indi- https://GitHub.com/kjohnston82/OCDetector This design is modular enough to be applied as is to other types of stars and starsystems that are cyclostationary in nature. With a change in feature space, specif-ically one that is tailored to the target signatures of interest and based on priorexperience, this design can be replicated for other targets that do not demonstratea cyclostationary signal (i.e., impulsive, nonstationary, etc.) and even to targetsof interest that are not time variable in nature but have a consistent observablesignature (e.g., spectrum, photometry, image point-spread function, etc.). Oneof the advantages of attempting to identify the O’Connell effect Eclipsing Binaryis that one only needs the phased light curve—and thus the dominant period al-lowing a phasing of the light curve—to perform the feature extraction and thusthe classification. The DF process here allows for a direct transformation into asingular feature space that focuses on functional shape.For other variable stars, a multiview approach might be necessary; either de-scriptions of the light curve signal across multiple transformations (e.g., Waveletand DF), or across representations (e.g. polarimetry and photometry) or acrossfrequency regimes (e.g. optical and radio) would be required in the process ofproperly defining the variable star type. The solution to this multiview problem isneither straightforward nor well understood [Akaho, 2006]. Multiple options have https://GitHub.com/kjohnston82/VariableStarAnalysis k − d trees as well as otherfeature space partitioning methods have been shown to reduce the computationalrequirements. The method we have outlined here has demonstrated the ability to detect targetsof interest given a training set consisting of expertly labeled light curve trainingdata. The procedure presents two new functionalities: the distribution field, ashape-based feature space, and the push–pull matrix metric learning algorithm,a metric learning algorithm derived from LMNN that allows for matrix-variatesimilarity comparisons. A comparison to less novel, more standard methods wasdemonstrated on a Kepler eclipsing binary sub-dataset that was labelled by anexpert in the field of O’Connell effect binary star systems. The performance of thethree methods is presented, the methodology proposed (DF + Push-Pull MetricLearning) is comparable to or outperforms the other methods. As a demonstra-tion, the design is applied to Kepler eclipsing binary data and LINEAR data.Furthermore, the increase in the number of systems, and the presentation of thedata, allows us to make additional observations about the distribution of curves193nd trends within the population. Future work will involve the analysis of thesestatistical distributions, as well as an inference as to their physical meaning.The new OEEB systems we discovered by the method of automated detectionproposed here can be used to further investigate their frequency of occurrence, pro-vide constraints on existing light curve models, and provide parameters to look forthese systems in future large-scale variability surveys like LSST. Although the efforthere targets OEEB as a demonstration, it need not be limited to those particulartargets. We could use the DF feature space along with the push–pull metric learn-ing classifier to construct a detector for any variable stars with periodic variability.Furthermore, any variable star (e.g., supernova, RR Lyr, Cepheids, eclipsing bina-ries) can be targeted using this classification scheme, given the appropriate featurespace transformation allowing for quantitative evaluation of similarity. This designcould be directly applicable to exo-planet discovery; either via light curve detec-tion (e.g., to detect eclipsing exo-planets) or via machine learning applied to othermeans (e.g., spectral analysis). 194 hapter 7Multi-View Classification ofVariable Stars Using MetricLearning The classification of variable stars relies on a proper selection of feature spaces ofinterest and a classification framework that can support the linear separation ofthose features. Features should be selected that quantify the tell-tale signature ofthe variability — the structure and information content. Prior studies generatedfeature spaces such as: Slotted Symbolic Markov Model [SSMM, Johnston andPeter, 2017], Fourier transform [Deb and Singh, 2009], wavelet transformation,Distribution Field [DF, Johnston et al., forthcoming], and so on; they attempt tocompletely differentiate or linearly-separate various type of variable stars classes.These efforts include: expert selected feature efforts [Debosscher, 2009, Sesar et al.,2011, Richards et al., 2012, Graham et al., 2013a, Armstrong et al., 2016, Mahabalet al., 2017, Hinners et al., 2018], automated feature selection efforts [McWhirter195t al., 2017, Naul et al., 2018], and unsupervised methods for feature extraction[Valenzuela and Pichara, 2018, Modak et al., 2018].The astroinformatics-community standard features include quantification ofstatistics associated with the time domain photometric data, Fourier decompo-sition of the data, and color information in both the optical and infrared domain[Nun et al., 2015, Miller et al., 2015]. The number of individual features com-monly used is upwards of 60+ and growing [Richards et al., 2011] as the number ofvariable star types increases, and as a result of further refinement of classificationdefinitions [Samus’ et al., 2017].Similarly, these efforts are limited in either size or scope based on: the sur-vey goals from which the data being trained on was originally derived [Angeloniet al., 2014], the developer/scientists research interests [P´erez-Ortiz et al., 2017,McCauliff et al., 2015], or a subset of the top five to ten most frequent class-typesKim and Bailer-Jones [2016], Pashchenko et al. [2018], Naul et al. [2018]. In ourresearch, no efforts were found in the literature that address all variables identifiedby Samus’ et al. [2017]—most address some subset. For a informative breakdownof different types of variable stars, see Eyer and Blake [2005]. As surveys becomemore complete and more dense in observations, the complexity of the problem islikely to grow [Bass and Borne, 2016].Here in lies the complication of expertly selected feature sets; their originalfunction is keyed to the original selection of variable stars of interest. Additionallythe features selected are often co-linear, resulting in little to no new informationor separability despite the increase in dimensionality and additional increase incomputational power needed to manage the data [D’Isanto et al., 2016]. Grow-ing the feature dimensionality, via either additional feature space transformations196r addition of information resulting from multi-messenger astronomy, results inincreasing the sparsity of the training data representation of class feature distri-bution. This requires increasingly more complex classifier designs to both supportthe dimensionality as well as the potential non-linear class-space separation.Curiously, aside from efforts to construct a classification algorithm from thetime domain data directly [McWhirter et al., 2017], few efforts in astroinformat-ics have looked at other features beyond those described above—mostly Fourierdomain transformations or time domain statistics. Considering the depth of pos-sibility for time domain transformations [Fu, 2011, Grabocka et al., 2012, Cassisiet al., 2012, Fulcher et al., 2013], it is surprising that the community has focused onjust a few transforms. Similarly, the astroinformatics-community has focused onjust a few classifiers as well, limited to mostly standard classifiers, and specificallydecision tree algorithms such as random forest type classifiers.Prior studies have initially addressed the potential of using metric learning asa means for classification of variable stars [Johnston et al., forthcoming]. Metriclearning has a number of benefits that are advantageous to the astronomer: • Metric learning uses nearest neighbors (k-NN) classification to generate thedecision space [Hastie et al., 2009, Duda et al., 2012], k-NN provides instantclarity into the reasoning behind the classifiers decision (based on similarity,“ x i is closer to x j than x k ” ). • Metric learning leverages side information (the supervised labels of the train-ing data) to improve the metric, i.e. a transformation of the distance betweenpoints that favors a specific goal: similar closer together, different furtherapart, simplicity of the metric, feature dimensionality reduction, etc.. This197ide data is based on observed prior analyzed data, thus decisions have agrounding in expert identification as opposed to black-box machine learning[Bellet et al., 2015]. Dimensionality reduction in particular can be helpfulfor handling feature spaces that are naturally sparse. • k-NN can be supported by other algorithm structures such as data partition-ing methods to allow for a rapid response time in assigning labels to newobservations, despite relying upon a high number of training data [Faloutsoset al., 1994]. • The development of an anomaly detection functionality Chandola et al.[2009], which has been shown to be necessary to generate meaningful clas-sifications [see: Johnston and Peter, 2017, Johnston and Oluseyi, 2017], iseasily constructed from the k-NN metric learning framework. The following procedure presented requires only processed (artifact removed) timedomain data, i.e. light curves. The features extracted and used for classificationare based on only the time domain data. This paper outlines a number of noveldevelopments in the area of time-domain variable star classification that are ofmajor benefit to the developer/researcher. First, we demonstrate both the Slot-ted Symbolic Markov Model Johnston and Peter [2017] and the Distribution FieldJohnston et al. [forthcoming] transforms as viable feature spaces to use for clas-sification of variable stars on their own; SSMM requires no phasing of the timedomain data but still provides a feature that is shape based, DF allows for the con-sideration of the whole phased waveform without additional picking and choosing198f metrics from the waveform [i.e., see Richards et al., 2012].Second, we demonstrate leveraging metric learning as a viable means of clas-sification of variable stars that has dramatic benefits to the user. Metric learningdecisions have an implicit traceability: the ability to follow from the classifier’sdecision, to the weights associated with each individual feature used as part ofthe classification, to the nearest-neighbors used in making the decision provide aclear idea of why the classifier made the decision. This direct comparison of newlyobserved with prior observations, and the justification via historical comparison,make this method ideal for astronomical—and indeed scientific—applications.Lastly, this paper will introduce Multi-view learning as a methodology that canprovide a major benefit to the astronomical community. Astronomy often dealswith multiple transformations (e.g., Fourier Domain, Wavelet Domain, statisti-cal...etc) and multiple domains of data types (visual, radio frequency, high energy,particle, etc.). The ability to handle, and just as importantly co-train an opti-mization algorithm on, multiple domain data will be necessary as the multitudeof data grows. The project software is provided publicly at the associated GitHubrepository .This design will be generic enough that it can be transferred from project toproject, survey to survey, and class space to class space, with a minimal changein features while still being able to maximize performance of the classifier withrespect top targeted project goals. In this paper will be organized as follows: (1)summarize current stellar variable classification efforts, features currently in use,and machine learning methodologies exercised (2) review the features used (statis- https://github.com/kjohnston82/VariableStarAnalysis L, and LM L-MV) (4) demonstrate our optimization of featureextraction algorithm for our datasets, leveraging “simple” classification methods(k-NN) and cross-validation processes (5) demonstrate our optimization of classi-fication parameters for LM L and LM L-MV via cross-validation and (6) reporton the performance of the feature/classifier pairing. Our proposal is an implemen-tation of both the feature extraction and classifier for the purposes of multi-classidentification, that can handle raw observed data. We present an initial set of time domain feature extraction methods; the designdemonstrated is modular in nature, allowing for a user to append or substitute fea-ture spaces that an expert has found to be of utility in the identification of variablestars. Although our initial goal is variable star identification, given a separate setof features this method could be applied to other astroinformatics problems (i.e.,image classification for galaxies, spectral identification for stars or comets, etc.).While we demonstrate the classifier has a multi-class classification design, whichis common in the astroinformatics references we have provided, the design herecan easily be transformed into a one-vs-all design [Johnston and Oluseyi, 2017] forthe purposes of generating a detector or classifier designed specifically to a user’sneeds [Johnston et al., forthcoming]. 200 .2.1 Signal Conditioning Required are features that can respond to the various signal structures that areunique to the classes of interest, i.e. phased light shape, frequency distribution,phase distribution, etc.). Our implementation starts with raw data (such as astro-nomical light curves) as primary input, which are then mapped into a specific fea-ture space. To support these transformations, a set of signal conditioning methodsare implemented for the two new feature space presented below. These techniquesare based on the methods presented in Johnston and Peter [2017] and are fairlycommon in the industry. The data that is leveraged— with respect to classificationof the waveform—is on the order of hundreds of observations over multiple cycles.While the data is not cleaned as part of the upfront process, the features that areimplemented are robust enough to not be affected by intermittent noise. The rawwaveform is left relatively unaffected, however smoothing does occur on the phasedwaveform to generate a new feature vector, i.e. a phased smoothed waveform.The phased waveform is generated via folding the raw waveform about a pe-riod found to best represent the cyclical process [Graham et al., 2013b]. TheSUPER-SMOOTHER algorithm [Friedman, 1984] is used to smooth the phaseddata into a functional representation. Additionally in some cases, the originatingsurvey/mission will perform some of these signal conditioning processes as part oftheir analysis pipeline (e.g., Kepler). This includes outlier removal, period finding,and long term trend removal. Most major surveys include a processing pipeline,our modular analysis methods provide a degree of flexibility that allow the imple-menter to take advantage of these pre-applied processes. Specifically of use, whileour feature extraction SSMM does not require a phased waveform, the DF featuredoes, thus period finding methods are of importance.201ost of the period finding algorithms are methods of spectral transformationwith an associated peak/max/min finding algorithm and include such methods as:discrete Fourier transform, wavelets decomposition, least squares approximations,string length, auto-correlation, conditional entropy and auto-regressive methods.Graham et al. [2013b] review these transformation methods (with respect to periodfinding), and find that the optimal period finding algorithm is different for differenttypes of variable stars. The Lomb–Scargle method was selected as the main methodfor generating a primary period for this implementation. For more information, ourimplementation of the Lomb–Scargle algorithm is provided as part of the VariableStar package . For our investigation we have selected feature spaces that quantify the functionalshape of repeated signal—cyclostationary signal—but are dynamic enough to han-dle impulsive type signals (e.g., supernova) as well. This particular implementationdesign makes the most intuitive sense, visual inspection of the light curve is howexperts identify these sources. Prior research on time domain data identificationhas varied between generating machine learned features [Bos et al., 2002, Broersen,2009, Blomme et al., 2011, Bol´os and Ben´ıtez, 2014, Gagniuc, 2017], implementinggeneric features [e.g. Fourier domain features; Debosscher, 2009, Richards et al.,2012, Palaversa et al., 2013, Masci et al., 2014], and looking at shape or functionalbased features [e.g. DF, SSMM; Park and Cho, 2013, Haber et al., 2015]. fit.astro.vsa.analysis.feature.LombNormalizedPeriodogram Slotted Symbolic Markov Models (SSMM) is useful in the differentiation betweenvariable star types [Johnston and Peter, 2017]. The time domain slotting describedin Rehfeld et al. [2011] is used to regularize the sampling of the photometric obser-vations. The resulting regularized sampled waveform is transformed into a statespace [Lin et al., 2007, Bass and Borne, 2016]; thus the result of the conditioningis the stochastic process { y n , n = 1 , , ... } . The stochastic process is then usedto populate the empty matrix P [Ge and Smyth, 2000], the elements of P arepopulated as the transition state probabilities (equation 7.1). P { y n +1 = j | y n = i, y n − = i n − , ..., y = i , y = i } = P ij (7.1)The populated matrix P is the SSMM feature; and is often described as a firstorder Markov Matrix. 203 .2.2.2 Distribution Field (DF) A distribution field (DF) is an array of probability distributions, where probabilityat each element is defined as [Helfer et al., 2015] equation 7.2.DF ij = ∑ [ y j < x ′ ( x i ≤ p ≤ x i +1 ) < y j − ] ∑ [ y j < x ′ ( p ) < y j − ] (7.2)Note, [ ] is the Iverson Bracket [Iverson, 1962], y j and x i are the correspondingnormalized amplitude and phased time bins, respectively. The result is a 2-Dhistogram that is a right stochastic matrix, i.e. the rows sum to one. Bin numbers,are optimized by cross-validation as part of the classification training process.Separately, SSMM itself is an effective feature for discriminating variable star typesas shown by Johnston and Peter [2017]. Similarly, DF has been shown to be avaluable feature for discriminating time domain signatures, see Helfer et al. [2015]and Johnston et al. [forthcoming]. The classification methodology known as metric learning has its roots in the un-derstanding of how and why observations are considered similar. The very idea ofsimilarity is based around the numerical measurement of distance, and the com-putation of a distance is generated via application of a distance function. Belletet al. [2015] define the metric distance as equation 7.3 d ( x, x ′ ) = √ ( x − x ′ ) T M ( x − x ′ ); (7.3)204here X ⊆ R d and the metric is required to be M ∈ S d + . S d + is the cone of symmet-ric positive semi-definite (PSD) d × d real valued matrices. Metric learning seeksto optimize this distance, via manipulation of the metric M , based on availableside data. How the optimization occurs, what is focused on and what is consid-ered important, i.e. the construction of the objective function, is the underlyingdifference between the various metric learning algorithms.The side information is defined as the set of labeled data { x i , y i } ni =1 . Further-more the triplet is defined as ( x i , x j , x k ) where x i and x j have the same label but x i and x k do not. It is expected then, based on the definition of similarity anddistance, that d ( x i , x j ) < d ( x i , x k ), i.e., that the distances between similar labelsis smaller than the distances between dissimilar ones. Methods such as LMNN[Weinberger et al., 2009] use this inequality to defined an objective function thatoptimizes the metric to bring similar things closer together, while pushing dissim-ilar things further apart.Given the metric learning optimization process, the result is a tailored distancemetric and associated distance function (equation 6.7). This distance functionis then used in a standard k-NN classification algorithm. The k-NN algorithmestimates a classification label based on the closest samples provided in training[Altman, 1992]. If x n is a set of training data n big, then we find the distancebetween a new pattern x i and each pattern in the training set. The new patternis classified depending on the majority of class labels in the closest k data points.205 .3 Challenges Addressed In the application of the LM L algorithm to our data we found a number of chal-lenges not specified by the original paper that required attention. Some of thesechallenges were a direct result of our views (vectorization of matrix-variate data)and some of these challenges were resulting from practical application (hinge lossfunctionality and step-size optimization). While the original LM L paper does not specify details with respect to the im-plementation of the hinge loss functionality used, the numerical implementationof both the maxima and the derivative of the maxima are of critical importance.For the implementation here, the hinge-loss functionality is approximated usingGeneralized Logistic Regression [Zhang and Oles, 2001, Rennie and Srebro, 2005].Should a different approximation of hinge loss be requested, care should be given tothe implementation, as definitions from various public sources are not consistent.For purposes here, the Generalized Logistic Regression is used to approximate thehinge loss ( h [ x ] ≈ g + ( z, ϕ )) and is defined as equation 7.4: g + ( z, ϕ ) = ⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩ . z ≤ − z z ≥ ϕ log (1 + exp ( zϕ )) − < z < 10 (7.4)206he derivative of the Generalized Logistic Regression is then given as Equation 7.5: ∂g + ( z, ϕ ) ∂z = ⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩ . z ≤ − z ≥ exp( ϕz )1+exp( ϕz ) − < z < 10 (7.5)For practical reasons (underflow/overflow) the algorithm is presented as a piece-wise function, in particular this is necessary because of the exponential in thefunctionality. In addition, the public literature is not consistent on the definitionof the hinge-loss functionality approximation, specifically the relationship betweenthe notations: [ z ] + , h [ z ], max( z, g + ( z, ϕ ); usually the inconsistency iswith respect to the input i.e. z , − z , or 1 − z . We have explicitly stated ourimplementation here to eliminate any confusion. While LM L provides an approximate ”good” step size to use, in practice it wasfound that a singular number was not necessarily useful. While the exact reasonsof why a constant step size was not beneficial were not investigated; the followingchallenges were identified:1. The possibility of convergence was very sensitive to the step size.2. Small step sizes that did result in a consistent optimization, resulted in avery slow convergence.3. While an attempt could be made to find an optimal step size with respectto all views, it seems unlikely this would occur given the disparate nature of207he views we have selected (distribution field, photometric color, time domainstatistics, etc.).4. For the metric learning methods used here (in both the standard and the pro-posed algorithms) the objective function magnitude scales with the numberof training data sets, view dimensions and the number of views, as is apparentfrom the individual component of LM L: ∑ i,j, h [ τ k − y ij ( µ k − d M k ( x ki , x kj ) )] .With increasing number of training data, the objective function will increaseand the gradient component ( w pk ∑ i,j y ij h ′ [ z ] C kij ) will similarly be effected.This means that computational overflows could occur just by increasing thenumber of training data used.In lieu of a singular estimate, we propose a dynamic estimate of the step-size periteration per view. A review of step-size and gradient descent optimization methods[Ruder, 2016] suggest a number of out-of-the-box solutions to the question of speed(specifically methods such as Mini-Batch gradient descent).The question of dynamic step size requires more development, in particularwhile methods exists, these are almost entirely focused on vector variate optimiza-tion. Barzilai and Borwein [1988] outline a method for dynamic step size estimationthat has its’ basis in secant root finding, the method described is extended here toallow for matrix variate cases. The gradient descent update for our metric learningalgorithm is given as Equation 7.6. L ( t +1) = L ( t ) − β ∂J∂ L (7.6)In the spirit of Barziliai and Borwein, here in known as the BB-step method,208he descent algorithm is reformulated as Equation 7.7: λ k = arg min λ ∥ ∆ L − λ ∆ g ( L ) ∥ F (7.7)where λ k is a dynamic step size to be estimated per iteration and per view, △ g ( L ) = ∇ f ( L ( t ) ) − ∇ f ( L ( t − ) and ∆ L = L ( t ) − L ( t − . The Forbinus norm can be definedas ∥ A ∥ F = T r ( A · A H ), the BB-step method can be found as Equation 7.8: ∂∂λ T r [ (∆ L − λ ∆ g ( L )) (∆ L − λ ∆ g ( L )) H ] = 0 (7.8)Based on the Matrix Cookbook [Petersen et al., 2008], Equation 7.8 can betransformed into Equation 7.9. T r [ − ∆ g ( L ) [∆ L − λ ∆ g ( L )] H − [∆ L − λ ∆ g ( L )] ∆ g ( L ) H ] = 0 (7.9)With some algebra, Equation 7.9 can be turned into a solution for our approx-imation of optimal step size, given here as Equation 7.10.ˆ λ = 12 · T r [ ∆ g ( L ) · ∆ L H + ∆ L · ∆ g ( L ) H ] T r [∆ g ( L ) · ∆ g ( L ) H ] (7.10)It is elementary to show that our methodology can be extended for △ g ( L k ) = ∇ f ( L ( t ) k ) − ∇ f ( L ( t − k ) and ∆ L k = L ( t ) k − L ( t − k ; likewise we can estimate ˆ λ k per view, so long as the estimates of both gradient and objective function aremonitored at each iteration. While this addresses our observations, it should benoted that the fourth challenged outlined (scaling with increasing features andtraining data) was only partially addressed. Specifically, the above methodologydoes not address an initial guess of λ k ; in multiple cases it was found that this209nitial value was set to high, causing our optimization to diverge. Providing aninitial metric in the form of σ I where 0 < σ < σ was used to offset a J value (from the objective function)that was too high (overflow problems). Care should be taken to set both the initial λ k and σ to avoid problems. The features focused on as part of our implementation include both vector vari-ate and matrix variate views. The matrix variate views requires transformationfrom their matrix domain to a vectorized domain for implementation in the LM Lframework. The matrix-variate to vector-variate transformation implemented hereis outlined in Johnston and Peter [2017]. The matrix is transformed vec( X ki ) = x ki to a vector domain. A dimensionality reduction process is implemented as someof the matrices are large enough to result in large sparse vectors (i.e., 20 × 20 DFmatrix = 400 element vector). To reduce the large sparse feature vector resultingfrom the unpacking of matrix, we applied a supervised dimensionality reductiontechnique commonly referred to as extended canonical variate analysis (ECVA)[Nørgaard et al., 2006].The methodology for ECVA has roots in principle component analysis (PCA).PCA is a procedure performed on large multidimensional datasets with the intentof rotating what is a set of possibly correlated dimensions into a set of linearly un-correlated variables [Scholz, 2006]. The transformation results in a dataset, wherethe first principle component (dimension) has the largest possible variance. PCAis an unsupervised methodology, i.e. known labels for the data being processed isnot taken into consideration, thus a reduction in feature dimensionality will occur,210nd while it maximizes the variance, it might not maximize the linear separabilityof the class space. In contrast to PCA, Canonical Variate Analysis does take classlabels into considerations. The variation between groups is maximized resultingin a transformation that benefits the goal of separating classes. Given a set ofdata x with: g different classes, n i observations of each class; following Johnsonet al. [1992], the within-group and between-group covariance matrix is defined asEquations 7.11 and 7.12 respectfully. S within = 1 n − g g ∑ i =1 n i ∑ j =1 ( x ij − ¯ x ij )( x ij − ¯ x i ) ′ (7.11) S between = 1 g − g ∑ i =1 n i ( x i − ¯ x )( x i − ¯ x ) ′ (7.12)Where n = ∑ gi =1 n i , ¯ x i = n i ∑ n i j =1 x ij , and ¯ x = n ∑ n i j =1 n i x i . CVA attempts tomaximize the Equation 7.13. J ( w ) = w ′ S between ww ′ S within w (7.13)The equation is solvable so long as S within is non-singular, which need not be thecase, especially when analyzing multicollinear data. When the case arises that thedimensions of the observed patterns are multicollinear, additional considerationsneed to be made. Nørgaard et al. [2006] outlines a methodology (ECVA) forhandling these cases in CVA. Partial least squares analysis, PLS2 [Wold, 1939],is used to solve the above linear equation, resulting in an estimate of w , andgiven that, an estimate of the canonical variates (the reduced dimension set). Theapplication of ECVA to our vectorized matrices results in a reduced feature spaceof dimension g − 1, this reduced dimensional feature space, per view, is then used211n the LM L classifier. We address the following classification problem: given a set of expertly labeledside data containing C different classes (e.g., variable star types), where measure-ments can be made on the classes in question to extract a set of feature spaces foreach observation, how do we define a distance metric that optimizes the misclas-sification rate? Specifically, how can this be done within the context of variablestar classification based on the observation of photometric time-domain data? Wehave identified a number of features that may provide utility in discriminatingbetween various types of stellar variables. We review how to combine this infor-mation together and generate a decision space; or rather, how to define the distance d ij = ( x i − x j ) ′ M ( x i − x j ), when x i contains two matrices (SSMM or DF in ourcase). Specifically we attempt to construct a distance metric based on multipleattributes of different dimensions (e.g. R m × n and R m × ).To respond to this challenge we investigate the utility of multi-view learning.For our purposes here we specify each individual measurement as the feature, andthe individual transformations or representations of the underlying measurementas the feature space. Views, are the generic independent collections of these fea-tures or feature space. Thus, if provided the color of a variable star in ugriz , theindividual measurements of u − g or r − i shall be referred to here as the featuresbut the collective set of colors is the feature space. Our methodology here allows usto defined sets of collections of these feature and/or feature spaces as independentviews, for example: all of ugriz measurements, the vectorized DF measurement,the concatenation of time-domain statistics and colors, the reduced (selected) sam-212ling of Fourier spectra, could all be individual views. The expert defined theseviews a priori .Xu et al. [2013], Kan et al. [2016] review multi-view learning and outline somebasic definitions. Multi-view learning treats the individual views separately, butalso provides some functionality for joint learning where the importance of eachview is dependent on the others. As an alternative to multi-view learning, themultiple views could be transformed into a single view, usually via concatenation.The cost–benefit analysis of concatenated single-view vs. multi-view learning arediscussed in Xu et al. [2013] and are beyond the scope of this paper.Classifier fusion [Kittler et al., 1998, Tax, 2001, Tax and Muller, 2003] could beconsidered as an alternative to multi-view learning, with each view independentlylearned, and resulting in an independent classification algorithm. The result of theset of these classifiers are combined together (mixing of posterior probability) toresult in a singular estimate of classification/label; this is similar to the operationof a Random Forest classifier, i.e. results from multiple individual trees combinedtogether to form a joint estimate. We differentiate between the single-view learningwith concatenation, multi-view learning, and classifier fusion designs based onwhen the join of the views is considered in the optimization process: before, during,or after.Multi-view learning can be divided into three topics: 1) co-training, 2) multiple-kernel learning, and 3) subspace learning. Each method attempts to consider allviews during the training process. Multiple-kernel learning algorithms attemptto exploit kernels that naturally correspond to different views and combine ker-nels either linearly or non-linearly to improve learning performance [G¨onen andAlpaydın, 2011, Kan et al., 2016]. 213ub-space learning uses canonical correlation analysis (CCA), or a similarmethod, to generate an optimal latent representation of two views which can betrained on directly. The CCA method can be iterated multiple times based onthe number of views, this process will frequently result in a dimensionality that islower then the original space [Hotelling, 1936, Akaho, 2006, Zhu et al., 2012, Kanet al., 2016].This work will focus on a method of co-training, specifically metric co-training.Large Margin Multi-Metric Learning [Hu et al., 2014, 2018] is an example of met-ric co-training; the designed objective function minimizes the optimization of theindividual view, as well as the difference between view distances, simultaneously.The full derivation of this algorithm is outlined in Hu et al. [2014], and the algo-rithm for optimization for LM L is given as their Algorithm 1. This algorithm isimplemented Java and is available as part of the software distribution.Our implementation also includes additional considerations not discussed inthe original reference. These considerations were found to be necessary basedon challenges discovered when we applied the LM L algorithm to our data. Thechallenges and our responses are discussed in Appendix A.In addition to the implementation of LM L, we have developed a matrix variateversion as well (section 7.3.4). This matrix variate classifier is novel with respectto multi-view learning methods and is one of two metric learning methods that weknow of, the other being Push-Pull Metric Learning [Helfer et al., 2015].214 .3.4 Large Margin Multi-Metric Learning with MatrixVariates The literature on metric learning methods is fairly extensive (see Bellet et al.[2015] for a review), however all of the methods presented so far focus on the orig-inal definition that is based in X ⊆ R d × , i.e. vector-variate learning. Whilethe handling of matrix-variate data has been addressed here, the method requirea transformation—vec( x ) and then ECVA—which ignores the problem of directlyoperating on matrix-variate data. The literature on matrix-variate classificationand operations is fairly sparse. The idea of a metric learning supervised classifica-tion methodology based on matrix-variate data is novel.Most of the matrix-variate research has some roots in the work by Hotelling[1936] and Dawid [1981]. There are some key modern references to be noted aswell: Ding and Cook [2014] and Ding and Dennis Cook [2018] address matrix-variate PCA and matrix variate regression (matrix predictor and response), Du-tilleul [1999] and Zhou et al. [2014] address the mathematics of the matrix normaldistribution, and Safayani and Shalmani [2011] address matrix-variate CCA.Developing a matrix-variate metric learning algorithm requires a formal defini-tion of distance for matrix-variate observations, i.e. where X ⊆ R p × q . Glanz andCarvalho [2013] define the matrix normal distribution as X i ∼ M N ( µ, Σ s , Σ c ),where X i and µ are p × q matrices, Σ s is a p × p matrix defining the row co-variance, and Σ c is a q × q matrix defining the column covariance. Equivalentlythe relationship between the matrix normal distribution and the vector normal215istribution is given as equation 7.14,vec ( X i ) ∼ N (vec ( µ ) , Σ c ⊗ Σ s ) . (7.14)The matrix-variate normal distribution is defined as equation 7.15 [Gupta andNagar, 2000] P ( X i ; µ, Σ s , Σ c ) = (2 π ) − pq ⏐⏐ ( Σ c ⊗ Σ s ) − ⏐⏐ exp { − 12 vec ( X i − µ ) ( Σ c ⊗ Σ s ) − vec ( X i − µ ) } . (7.15)This distribution holds for the features that we are using as part of this study,at least within the individual classes. The Mahalanobis distance between ourobservations is then defined for the Matrix-Variate case as equations 7.16 to 7.18: d Σ ( X, X ′ ) . = vec ( X − X ′ ) ( Σ c ⊗ Σ s ) − vec ( X − X ′ ) , (7.16)= vec ( X − X ′ ) T vec ( Σ − s ( X − X ′ ) Σ − c ) , (7.17)= tr [ Σ − c ( X − X ′ ) T Σ − s ( X − X ′ ) ] . (7.18)This last iteration of the distance between matrices is used in our developmentof a metric learning methodology. Similar to the development of LM L and theoutline of Torresani and Lee [2007], we develop a metric learning algorithm formatrix-variate data. First the Mahalanobis distance for the matrix-variate multi-216iew case is recast as equation 7.19 d U k , V k ( X ki , X kj ) = tr [ U k ( X ki − X kj ) T V k ( X ki − X kj )] ; (7.19)where U k and V k represent the inverse covariance of the column and rowrespectively. The individual view objective function is constructed similar to theLMNN [Weinberger et al., 2009] methodology; we define a push (equation 7.20)and pull (equation 7.21) as: push k = γ ∑ j ⇝ i,l η kij (1 − y il ) · h [ d U k , V k ( X ki , X kj ) − d U k , V k ( X ki , X kl ) + 1 ] , (7.20) pull k = ∑ i,j η kij · d U k , V k ( X ki , X kj ); (7.21)where y il = 1 if and only if y i = y l and y il = 0 otherwise; and η kij = 1 if and onlyif x i and x j are targeted neighbors of similar label y i = y j . For a more in-depthdiscussion of target neighbor, see Torresani and Lee [2007].Furthermore, we include regularization terms [Schultz and Joachims, 2004] withrespect to U k and V k as part of the objective function design; these are defined as λ ∥ U k ∥ F and λ ∥ V k ∥ F , respectively. The inclusion of regularization terms in ourobjective function help promote sparsity in the learned metrics. Favoring sparsitycan be beneficial when the dimensionality of the feature spaces is high, and canhelp lead to a more generic and stable solution.217he sub-view objective function is then equation 7.22:min U k , V k I k = ∑ i,j η kij · d U k , V k ( X ki , X kj )+ γ ∑ j ⇝ i,l η kij (1 − y il ) · h [ d U k , V k ( X ki , X kj ) − d U k , V k ( X ki , X kl ) + 1 ] + λ ∥ U k ∥ F + λ ∥ V k ∥ F ; (7.22)where λ > L theobjective function is equation 7.23:min U k , V k J k = w k I k + µ K ∑ q =1 ,q ̸ = k ∑ i,j ( d U k , V k ( X ki , X kj ) − d U l , V l ( X qi , X qj ) ) ; (7.23)where ∑ Kk =1 w k = 1 and the first term is the contribution of the individual k th view, while the second term is designed to minimize the distance difference betweenattributes.This objective design is solved using a gradient descent solver operation. Toenforce the requirements of U k ≻ V k ≻ 0, the metrics are decomposed— U k = Γ Tk Γ k and V k = N Tk N k . The gradient of the objective function with respectto the decomposed matrices Γ k and N k is estimated. The unconstrained optimumis found using the gradient of the decomposed matrices; the U k and V k matricesare then reconstituted at the end of the optimization process. We reformulate thematrix variate distance as equation 7.24: d Γ k , N k (∆ kij ) = tr [ Γ Tk Γ k ( ∆ kij ) T N Tk N k ( ∆ kij )] ; (7.24)218or ease we make the following additional definitions: d U k , V k ( X ki , X kj ) = d kij , X ki − X kj = ∆ kij , A kij = ( ∆ kij ) T N Tk N k ( ∆ kij ) , and B kij = ∆ kij Γ Tk Γ k ( ∆ kij ) T . Notethat A kij = ( A kij ) T and B kij = ( B kij ) T . Additionally we identify the gradients asequations 7.25 and 7.26: 2 Γ k A kij = ∂d kij ∂ Γ k (7.25)2 N k B kij = ∂d kij ∂ N k , (7.26)as being pertinent for derivation. We give the gradient of the individual viewobjective I k as equations 7.27 and 7.28: ∂I k ∂ Γ k = 2 Γ k ( (1 − γ ) ∑ i,j η kij · A kij + γ ∑ j ⇝ i,l η kij (1 − y il ) · h ′ [ z ] · [ A kij − A kil ] + λ I ) (7.27) ∂I k ∂ N k = 2 N k ( (1 − γ ) ∑ i,j η kij · B kij + γ ∑ j ⇝ i,l η kij (1 − y il ) · h ′ [ z ] · [ B kij − B kil ] + λ I ) , (7.28)and the gradient of the joint objective as equations 7.29 and 7.30: ∂J k ∂ Γ k = w pk ∂I k ∂ Γ k + 4 µ Γ k K ∑ q =1 ,q ̸ = k ∑ i.j ( d kij − d qij ) A kij (7.29) ∂J k ∂ N k = w pk ∂I k ∂ N k + 4 µ N k K ∑ q =1 ,q ̸ = k ∑ i.j ( d kij − d qij ) B kij . (7.30)219o estimate the update for the weights, we solve for the Lagrange functiongiven equation 7.31: La ( w, η ) = K ∑ k =1 w pk I k + λ K ∑ k,l =1 ,k Require: ρ ≥ Ensure: X k while ⏐⏐ J ( t ) − J ( t − ⏐⏐ < ϵ do for k = 1 , ..., K do Solve ∇ J k ( X ki ; U k , V k ) = [ ∂J k ∂ Γ k , ∂J k ∂ N k ] ˆ β tk = · T r [ ∆ g ( Γ k ) · ∆ Γ Hk +∆ Γ k · ∆ g ( Γ k ) H ] T r [∆ g ( Γ k ) · ∆ g ( Γ k ) H ] ˆ κ tk = · T r [ ∆ g ( N k ) · ∆ N Hk +∆ N k · ∆ g ( N k ) H ] T r [∆ g ( N k ) · ∆ g ( N k ) H ] Γ ( t +1) k = Γ ( t ) k − β ( t ) k ∂J k ∂ Γ k N ( t +1) = N ( t ) − κ ( t ) k ∂J k ∂ N k end for for k = 1 , ..., K do w k = (1 /I k ) / ( p − ∑ Kk =1 (1 /I k ) / ( p − J k = w k I k += µ ∑ Kq =1 ,q ̸ = k ∑ i.j ( d U k , V k ( X ki , X kj ) − d U l , V l ( X qi , X qj ) ) J ( t ) = J ( t ) + J k end for end whilereturn U k = Γ Tk Γ k and V k = N Tk N k .4 Implementation We develop a supporting functional library in Java (java-jdk/11.0.1), and relyon a number of additional publicly available scientific and mathematical opensource packages including the Apache foundation commons packages (e.g. MathCommons Foundation, 2018b and Commons Lang Foundation, 2018c) and theJSOFA package to support our designs. The overall functionality is supported ata high level by the following open source packages: • Maven is used to manage dependencies, and produce executable functionalityfrom the project Foundation [2018a] • JUnit is used to support library unit test management [Team, 2018a] • slf4j is used as a logging frame work [Team, 2017] • MatFileRW is used for I/O handling [Team, 2018b]We recommend reviewing the vsa-parent .pom file included as part of the softwarepackage for a more comprehensive review of the functional dependency. Versionsare subject to upgrades as development proceeds beyond this publication. Exe-cution of the code was performed on a number of platforms including a personallaptop (MacBook Pro, 2.5GHz Intel Core i7, macOS Mojave) and an institutionhigh performance computer (Florida Institute of Technology, BlueShark HPC) .The development of the library and functionality in Java allow for the functionalitypresented here to be applied regardless of platform. We are not reporting process-ing times as part of this analysis as the computational times varied depending on https://it.fit.edu/information-and-policies/computing/blueshark-supercomputer-hpc/ Similar to Johnston and Peter [2017], we use the University of California RiversideTime Series Classification Archive (UCR) and the Lincoln Near-Earth AsteroidResearch (LINEAR) dataset to demonstrate the performance of our feature spaceclassifier. The individual datasets are described as follows: • UCR : We baseline the investigated classification methodologies [Keogh et al.,2011] using the UCR time domain datasets. The UCR time domain datasetSTARLIGHT [Protopapas et al., 2006] is derived from a set of Cepheid,RR Lyra, and Eclipsing Binary Stars. This time-domain dataset is phased(folded) via the primary period and smoothed using the SUPER–SMOOTHERalgorithm [Reimann, 1994] by the Protopapas study prior to being providedto the UCR database. Note that the sub-groups of each of the three classesare combined together in the UCR data (i.e., RR Lyr (ab) + RR Lyr (c) =RR). • LINEAR : The original database LINEAR is subsampled; we select timeseries data that has been verified and for which accurate photometric valuesare available [Sesar et al., 2011, Palaversa et al., 2013]. This subsampledset is parsed into separate training and test sets. From the starting sampleof 7,194 LINEAR variables, a clean sample of 6,146 time series datasets223nd their associated photometric values is used. Stellar class-type is limitedfurther to the top five most populous classes: RR Lyr (ab), RR Lyr (c), δ Scuti / SX Phe, Contact Binaries and Algol-Like Stars with 2 Minima,resulting in a set of 6,086 observations.Training data subsets are generated as follows: UCR already defines a trainingand test set, the LINEAR data is split into a training and test set using a predefinedalgorithm (random assignment, of nearly equal representation of classes in trainingand test). We used a method of 5-fold cross-validation both datasets; the partitionsin 5-fold algorithm are populated via random assignment. For more details on thedatasets themselves, a baseline for performance, and additional references, seeJohnston and Peter [2017]. The time domain transformation we selected requires parameter optimization (res-olution, kernel size, etc.); each survey can potentially have a slightly differentoptimal set of transformation parameters with respect to the underlying surveyparameters (e.g. sample rate, survey frequency, number of breaks over all obser-vations, etc.). While we could include the parameters optimization in the cross-validation process for the classifier, this will be highly computationally challeng-ing, specifically for classifier that require iterations, as we would be handling anincreasing number of permutations with each iteration, over an unknown numberof iteration. To address this problem, the feature space is cross-validated on thetraining dataset, and k-NN classification is used (assuming a fixed temporary k value allows little to no tuning) to estimate the misclassification error with theproposed feature space parameters. The optimized features are used as givens for224he cross-validation process in optimizing the intended classifier. Likely some lossof performance will occur, but considering how the final classifier design is basedon k-NN as well, it is expected to be minor.Because of the multi-dimensional nature of our feature space, we propose thefollowing method for feature optimization—per class we generate a mean repre-sentation of the feature (given the fraction of data being trained on), all data arethen transformed (training and cross-validation data) via Park et al. [2003] intoa distance representation, i.e. the difference of the observed feature and each ofthe means is generated. Note that for the matrix feature spaces, the FrobeniusNorm is used. Alternatively we could have generated means based on unsupervisedclustering (k-Means); while not used in this study, this functionality is providedas part of the code. We found that the performance using the unsupervised casewas very sensitive to the initial number of k used. For the LINEAR and UCRdatasets, the results were found with respect to optimization of feature (DF andSSMM) parameters to be roughly the same. A k-NN algorithm is applied to thereduced feature space, 5-fold classification is then used to generate the estimate oferror, and the misclassification results are presented a response map given featureparameters (Figures 7.1 and Figure 7.2):We select the optimum values for each feature space, based on a minimizationof both the LINEAR and UCR data. These values are estimated to be: DFOptimized (x, y) – 30 × 25, SSMM Optimized (res x scale) – 0 . × . The implementation of LM L is applied to the UCR and LINEAR datasets. Basedon the number of views associated with each feature set, the underlying classifier225igure 7.1: Parameter optimization of the Distribution Field feature space (Left:UCR Data, Right: LINEAR Data). Heat map colors represent misclassificationerror, (dark blue—lower, bright yellow—higher)Figure 7.2: Parameter optimization of the SSMM feature space (Left: UCR Data,Right: LINEAR Data). Heat map colors represent misclassification error, (darkblue—lower, bright yellow—higher)will be different (e.g. UCR does not contain color information and it also has onlythree classes compared to LINEAR’s five). The features SSMM, DF, and StatisticalRepresentations (mean, standard deviation, kurtosis, etc.) are computed for bothdatasets. Color and the time domain representations provided with the LINEARdata are also included as additional views.226o allow for the implementation of the vector-variate classifier, the dimension-ality of the SSMM and DF features are reduced via vectorization of the matrixand then processing by the ECVA algorithm, resulting in a dimensionality that is k − 1, where k is the number of classes. We note that without this processing viaECVA, the times for the optimization became prohibitively long, this is similar tothe implementation of IPMML given in Zhou et al. [2016]. SSMM and DF featuresare generated with respect to the LINEAR dataset—Park’s transformation is notapplied here—the feature space reduced via ECVA, the results are and given inFigure 7.3 (DF) and Figure 7.4 (SSMM).Similarly, the SSMM and DF features are generated for the UCR dataset—Park’s transformation is not applied here–and the feature space reduced via ECVA.The results are plotted and given in Figure 7.5 (DF–Left) and (SSMM–Right).The dimensions given in the figures are reduced dimensions resulting from theECVA transform and therefore they do not necessarily have meaningful descrip-tions (besides x , x , x , ..., x n ). These reduced feature spaces are used as input tothe LM L algorithm.The individual views are standardized (subtract by mean and divide by stan-dard deviation). Cross-validation of LM L is used to optimize the three tunableparameters and the one parameter associated with the k-NN. The LM L authorsrecommend some basic parameter starting points; our analysis includes investigat-ing the tunable values as well as an upper (+1) and lower (-1) level about eachparameter, over a set of odd k-NN values [1,19]; the optimization only needs tooccur for each set of tunable values, the misclassification given a k-Value can beevaluated separately, this experiment is outlined in Table 7.1.Cross-validation is performed to both optimize for our application and investi-227igure 7.3: DF Feature space after ECVA reduction from LINEAR (Contact Bi-nary/blue circle, Algol/ red +, RRab/green points, RRc in black squares, DeltaScu/SX Phe magneta diamonds) off-diagonal plots represent comparison betweentwo different features, on-diagonal plots represent distribution of classes within afeature (one dimensional)gate the sensitivity of the classifier to adjustment of these parameters. For a breakdown of the cross validation results, see the associated datasets and spreadsheetprovided as part of the digital supplement. Based on the cross-validation process, the following optimal parameters are found: • LINEAR : k-NN(11), τ (1.0), µ (5.0), λ (0.1)228igure 7.4: SSMM feature space after ECVA reduction LINEAR (Contact Bi-nary/blue circle, Algol/ red +, RRab/green points, RRc in black squares, DeltaScu/SX Phe magneta diamonds) off-diagonal plots represent comparison betweentwo different features, on-diagonal plots represent distribution of classes within afeature (one dimensional) • UCR : k-NN(9), τ (1.0), µ (5.0), λ (0.1)The classifier is then trained using the total set of training data along withthe optimal parameters selected. As a reminder, the λ parameter controls theimportance of regularization, the µ parameter controls the importance of pairwisedistance in the optimization process, and γ controls the balance between push andpull. 229igure 7.5: DF (Left) and SSMM (Right) feature space after ECVA reduction fromUCR. Class names (1,2, and 3) are based on the classes provided by the originatingsource and the UCR databaseTable 7.1: The cross-validation process for LM L tunable valuesVariable 1 2 3 4 5 6 7 τ µ λ LM L entries are counts (percent) Misclassification Rate RR Lyr (ab) δ Scu / SX Phe Algol RR Lyr (c) Contact Binary MissedRR Lyr (ab) 1081 (0.992) 0 (0.000) 0 (0.000) 6 (0.006) 1 (0.001) 2 (0.002)Delta Scu / SX Phe 0 (0.000) 23 (0.852) 0 (0.000) 2 (0.074) 2 (0.074) 0 (0.000)Algol 1 (0.007) 0 (0.000) 108 (0.788) 0 (0.000) 28 (0.204) 0 (0.000)RR Lyr (c) 23 (0.062) 0 (0.000) 1 (0.003) 343 (0.925) 4 (0.011) 0 (0.000)Contact Binary 3 (0.003) 0 (0.000) 29 (0.033) 9 (0.010) 832 (0.952) 1 (0.001) The implementation of LM L-MV is applied to the UCR and LINEAR datasets.The features SSMM, DF, and Statistical Representations (mean, standard devia-tion, kurtosis, etc.) are computed for both datasets. Similar to the LM L proce-230able 7.3: UCR confusion matrix via LM L entries are counts (percent)Misclassification Rate 2 3 1 Missed2 2296 (0.996) 9 (0.004) 0 (0.000) 0 (0.000)3 17 (0.004) 4621 (0.972) 116 (0.023) 0 (0.000)1 8 (0.007) 375 (0.319) 794 (0.675) 0 (0.000)dure, color and the time domain representations provided with the LINEAR dataare also included as additional views. The implementation of the matrix-variateclassifier, allows us to avoid the vectorization and feature reduction (ECVA) step.The individual views are standardized prior to optimization. Also similar to theLM L procedure, cross-validation of LM L-MV is used to optimize the three tun-able parameters and the one parameter associated with the k-NN. The table ofexplored tunable parameters is given in Table 7.4:Table 7.4: The cross-validation process for LM L − M V tunable valuesVariable 1 2 3 4 5 6 7 λ µ γ Based on the cross-validation process, the following optimal parameters are found(and their cross-validation error estimates): • LINEAR : k-NN(15), λ (0.5), µ (0.5), γ (0.5) • UCR : k-NN(19), λ (0.5), µ (1.0), γ (0.5)231he classifier is then trained using the total set of training data along with theoptimal parameters selected. The trained classifier is applied to the test data, theconfusion matrices resulting from the application are presented in Table 7.5 andTable 7.6:Table 7.5: LINEAR confusion matrix via LM L − M V entries are counts (percent) Misclassification Rate RR Lyr (ab) Delta Scu / SX Phe Algol RR Lyr (c) Contact Binary MissedRR Lyr (ab) 1074 (0.985) 0 (0.000) 1 (0.001) 15 (0.014) 0 (0.000) 0 (0.000)Delta Scu / SX Phe 1 (0.037) 24 (0.889) 0 (0.000) 2 (0.074) 0 (0.000) 0 (0.000)Algol 3 (0.022) 0 (0.000) 104 (0.759) 1 (0.007) 29 (0.212) 0 (0.000)RR Lyr (c) 23 (0.059) 0 (0.000) 1 (0.003) 343 (0.930) 4 (0.008) 0 (0.000)Contact Binary 3 (0.003) 0 (0.000) 29 (0.035) 9 (0.001) 832 (0.958) 1 (0.002) Table 7.6: UCR confusion matrix via LM L − M V entries are counts (percent)Misclassification Rate 2 3 1 Missed2 2298 (0.997) 6 (0.003) 0 (0.000) 1 (˜0.000)3 4 (0.001) 4450 (0.936) 300 (0.063) 0 (0.000)1 3 (0.003) 467 (0.397) 707 (0.601) 0 (0.000) The matrix-variate and the vector-variate versions do not perform much differentunder the conditions provided given the data observed. However, as a reminder, theLM L implementation includes a feature reduction methodology (ECVA) that ourLM L-MV does not. The ECVA front end was necessary as the dimensionality ofthe unreduced input vectors results in features and metrics which are prohibitivelylarge (computationally). It is not entirely surprising that our two competitivemethodologies perform similarly, with the LM L algorithm of having the benefit ofbeing able to process the matrix-variate spaces ahead of time via ECVA and thusbeing able to process the SSMM and DF features spaces in a lower dimension ( c − L andLM L-MV algorithm designs. Detailed computations associated with all analysesare included as part of the digital supplement.It should be noted, that ECVA has its limitations; anecdotally on more thenone occasion during the initial analysis, when the full dataset was provided to the233able 7.7: F1-Score metrics for the proposed classifiers with respect to LINEARand UCR datasets F1-Score UCR LINEAR LM L LM L − M V IP M M L k − N N Multi-view MV 0.725 0.574 k − N N Multi-View 0.691 0.506 k − N N Concatenated 0.650 0.427 RF DF RF SSM M RF T imeStatistics L implementation iterated at a much faster rate withthe same amount of data, compared to the LM L-MV algorithm. The matrixmultiplication operations associated with the matrix distance computation aremore computationally expensive compared to the simpler vector metric distancecomputation, however many computational languages have been optimized formatrix multiplication (e.g., MATLAB, Mathmatica, CUDA, etc.). Again, thetime the ECVA algorithm takes to operate upfront saves the LM L iterationstime. In general, both algorithms perform well with respect to misclassificationrate, but both also require concessions to handle the scale and scope of the featurespaces used. The cost of most of these concession can be mitigated with additionalmachine learning strategies, some of which we have begun to implement here—parallel computation for example. 234 .5 Conclusions The classification of variable stars relies on a proper selection of features of inter-est and a classification framework that can support the linear separation of thosefeatures. Features should be selected that quantify the signature of the variability,i.e. its structure and information content. Here, two features which have utility inproviding discriminatory capabilities, the SSMM and DF feature spaces are stud-ied. The feature extraction methodologies are applied to the LINEAR and UCRdataset, as well as a standard breakdown of time domain descriptive statistics, andin the case of the LINEAR dataset, a combination of ugriz colors. To support theset of high-dimensionality features, or views, multi-view metric learning is inves-tigated as a viable design. Multi-view learning provides an avenue for integratingmultiple transforms to generate a superior classifier. The structure of multi-viewmetric learning allows for a number of modern computational designs to be usedto support increasing scale and scope (e.g., parallel computation); these consider-ations can be leveraged given the parameters of the experiment designed or theproject in question.Presented, in addition to an implementation of a standard multi-view metriclearning algorithm (LM L) that works with a feature space that has been vector-ized and reduced in dimension, is a multi-view metric learning algorithm designedto work with matrix-variate views. This new classifier design does not requiretransformation of the matrix-variate views ahead of time, and instead operatesdirectly on the matrix data. The development of both algorithm designs (matrix-variate and vector-variate) with respect to the targeted experiment of interest(discrimination of time-domain variable stars) highlighted a number of challengesto be addressed prior to practical application. In overcoming these challenges, it235as found that the novel classifier design (LM L-MV) performed on order of thestaged (Vectorization + ECVA + LM L) classifier. Future research will includeinvestigating overcoming high dimensionality matrix data (e.g. SSMM), improvingthe parallelization of the design presented, and implementing community standardworkarounds for large dataset data (i.e., on-line learning, stochastic/batch gradientdescent methods, k-d tree... etc.). 236 hapter 8Conclusions We focused our efforts on the field of astroinformatics, specifically machine learn-ing relating to time-domain features and variable star identification. We outlined avertically integrated analysis: the collection of training data and the developmentof a time-domain feature space, a classification/detection optimized algorithm, anda performance analysis procedure that can properly represent the classifier perfor-mance. We assume that the survey will handle much of the signal conditioningand detection (front end logic). Most of what we are designing is the “back end”and includes both development and testing (verification/validation). The development of the Variable Star Analysis (VarStar) library is a capstonefor this research. The VarStar library is a Java library and contains not only https://github.com/kjohnston82/VariableStarAnalysis L-MV) but also the setof fundamental mathematics and supporting functions necessary for computation.The code can be split into three basic categories: bindings (data objects that act ascontainers for similar information), utilities (mathematics, machine learning, andother generic tools), and analysis (features and transforms used in our research aswell as executable functions that were developed for the research). The design ofthe library functions flows from the math bundle, downward, with reliance on amultitude of third-party open source packages. The flow design is given in Figure8.1.Figure 8.1: A rough outline of the Variable Star Analysis Library (JVarStar) bundlefunctional relationships. Notice that the generic math and utility bundles flowdown to more specific functional designs such as the clustering algorithms.We briefly outline the contents of the library functionality; for more detail,please see the code itself. The underlying library is actively developed in Java238java-jdk/11.0.1) and relies on a number of additional publicly available scientificand mathematic open source packages, including the Apache foundation commonspackages (e.g., Math Commons [Foundation, 2018b] and Commons Lang [Founda-tion, 2018c]) and the JSOFA package [Harrison, 2016]. The Apache Math Com-mons package specifically provides the Java objects necessary for the handling ofvector and matrix mathematics; linear algebra within the VSA library is dependenton this package and the VectorOperations/MatrixOperations classes that ex-tend the Math Common’s functionality for use.The VSA math bundle contains low-level functionality, such as numeric tests(e.g., is even? is odd?) and numerical constants (e.g., 4 π ) that are commonin scientific applications. The math bundle also contains geometric functional-ity and algorithms useful in scientific applications, such as the oriented Grahamscan [Graham, 1972] for generating 2-D convex hulls given a set of distributedpoints, algorithms for the random distribution generation on surfaces, and animplementation of the QuickHull3D algorithm [Barber et al., 1996] for gener-ating convex hulls and intersection of convex hulls given a set of distributedpoints in ND space. Fundamental object and geometry mathematics is also in-cluded, and the bundle has classes for cone and plane shapes. Similarly, themath bundle contains linear and nonlinear solver methods, including a polynomialsolver, a weighted multiple linear regression solver, and functionality that usesor ingests functionality inherent to the Apache Math Common’s analysis bundle(see http://commons.apache.org/proper/commons-math/userguide/analysis.htmlfor more information).The I/O utilities bundle provides support for accessing and writing data intoand out of multiple formats, including .mat file formats (MATLAB). The VSA239esign relies on the storage of data in a .mat format, which allows for easy analysisof results in a scripting language (MATLAB); the storage of data files in a .matformat is also efficient, as the MatFileRW functionality used for I/O handling[Team, 2018b] provides the ability to store structures and other fundamental dataobjects. More complex data I/O is also handled; this requires, however, that dataobjects be serializable. While there is a singular machine learning bundle, the clustering bundle, datahandling bundle, and swarm optimization bundle also contain machine learningfunctionality for the user, but have been split apart for the sake of develop-ment. The machine learning bundle is a Java implementation of the MATLABproject MatLearn and contains several standard machine learning algorithms,such as the classification and regression tree (CART), the logistic regression al-gorithm (LRC), the k-NN algorithm, linear and quadratic discriminate analy-sis (LDA/QDA), canonical variate analysis (CVA), Parzen window classsification(PWC), and a set of metric learning algorithms (LMNN, NCA, S&J, MMC, ITML,etc.). As discussed, this is also where we have developed the LM L-MV algorithm.The machine learning also contains fundamental functionalities, such as classes fordistance measurement (e.g., MetricDistance), performance utilities for evaluationof classifiers such as confusion matrix functionality, and classes designed to supportsupervised classification (training, cross-validation, and testing).The data handling bundle contains functionality for managing data relation-ships between pattern, label, and view. Classes here support the sorting and https://docs.oracle.com/javase/tutorial/jndi/objects/serial.html https://github.com/kjohnston82/MatLearn L-MValgorithm, some research was performed on the topic of particle swarm optimiza-tion as a means of improving the iterative optimization design to move away fromthe standard gradient descent algorithm. At the end of the day, this research wasnot pursued, but the functionality developed for the optimization remains in theSwarm Optimization bundle.Beyond the fundamental building blocks that the utilities package representsis the analysis package, which contains both the unique feature spaces we havedeveloped and more standard feature spaces, such as the Lomb–Scargle transform.These feature transformations are then tied together in the package Variable StarAnalysis, and this bundle handles reading in of data, processing of the individualraw waveforms, application of the features to the data, data handling and workflowmanagement of the patterns/label pairing, and training of the targeted supervisedclassification algorithm.The Maven functionality that stitches all of the packages together handles alsothe dependency management and provides the ability to compile and generateexecutables. This allows for development and distribution of executable train-ing algorithms and .jar functionality that can be transferred and batch run fortraining purposes. Furthermore, the Java library structure, and Maven projectmanagement, will allow others to interface with all or part of the VSA designbased on user needs. 241he project software is provided publicly at the associated GitHub repository. The overall functionality is supported at a high level by the following open sourcepackages: Maven is used to manage dependencies and produce executable func-tionality from the project Foundation [2018a], JUnit is used to support library unittest management [Team, 2018a], and slf4j is used as a logging framework [Team,2017]. A more complete view of the dependencies and versions can be found aspart of the VSA-parent .pom file included as part of the software package. As demonstrated, our research has produced new features of use, a new classifier,a review of design for supervised classification systems (including methods of per-formance analysis), and the application of these methods in the construction of adetector. Additionally, this has produced a body of code that has been made opento the public for further development and use. Avenues of future developmentinclude research of additional features, classifiers, and detector designs, expandingupon what we have produced so far. Additionally, we identify two specific effortsthat are necessary to improving our designs: increasing the number of standardvariable star data and developing a synthetic stellar variable waveform generator. With the development of new feature extraction methodologies and classificationtechniques, estimations of performance are necessary. However, we have found https://github.com/kjohnston82/VariableStarAnalysis In lieu of real data, often it is beneficial to develop an empirical simulator togenerate synthetic signatures that can be used to test the designed system. Thedevelopment of a synthetic simulator has a number of benefits, training data arealways available, and the development of a simulator often requires understandingof the defining qualities of a classification type (here variable type). We answerthe question, what makes this particular variable type unique in observation? So,how do we test the performance of either the feature extraction methodologies244r the supervised classification methodologies we are generating? Johnston andOluseyi [2017] highlight standards in performance analysis methods for supervisedclassification (performance metrics). These performance estimates are dependenton the initial labels given for the training data. Training data labels are completelydependent on hand-labeling of survey data (supervised). We ask the question, howcan we provide data to the supervised classification algorithm, where we know forcertain the label? To address this challenge, we propose developing a syntheticstellar variable generation.Developing such an algorithm would require understanding the basic definitionof the stellar variable (what makes type A unique from other types). Proper syn-thesis requires understanding the distribution of features and codified descriptionsof the variable star types. These require a methodology to formulate a distributionof the features. This would include both the random generation of scalar featuresand the development of a random generator to handle time-domain functions. Thegenerated features would need to be correlated with one another as well (we aretrying to interpolate between variable star examples). In addition to the generator,we also construct an algorithm for the removal of synthetic signals that approxi-mate various survey conditions: time spent on target, day/night breaks, error inmagnitude (noise model), and so on. We define two representations: • the archetype: the fundamental representation of what makes type A of thattype might be the “first” observed variable star of that type, a pinnacleexample • the generator: the set of feature distributions that describes the range of thevariable star class type 245e propose to construct such a setup to experiment with or test our supervisedclassification system. The injection of synthetic data would allow us to determinethe extrema that will still produce a result in classification (minimum frequency,minimum amplitude changes, inconsistent cyclic pattern, etc.). Synthetic modelscan provide indications as to what features work when and what survey conditionsare necessary for the classification algorithm to operate. To develop prior proba-bility estimates and feature space likelihood distributions, we use labeled surveydata from the identified (and available) surveys. We can also use stellar models tohelp guide our distribution estimates, especially when we are attempting to modelthe extrema of any given class.We have attempted an initial effort; however, based on our initial research, wehave identified a set of challenges. As discussed by Sterken and Jaschek [2005],there is no standard compendium of variable stars. The American Variable StarAssociation is in charge of monitoring, coordinating, and defining variable startypes, but it does not manage a single “encyclopedia” of archetypes. Without asingular standard collection (and, in some cases, not even a standard definition) ofsome variable star types, the generation of archetypes for all variable star types isimpractical (completeness issue). Similarly, interpolation across functional shapesis a leading-edge technology. We outline the following developments of this research:1. System Design and Performance of an Automated Classifier246a) Publication . Johnston, K. B., & Oluseyi, H. M. (2017). Generation ofa supervised classification algorithm for time-series variable stars withan application to the LINEAR dataset. New Astronomy, 52 , 35–47.(b) LSC [ascl:1807.033] . Supervised classification of time-series variablestars Johnston, Kyle B.i. LSC (LINEAR Supervised Classification) trains a number of clas-sifiers, including random forest and K-nearest neighbor, to classifyvariable stars and compares the results to determine which classi-fier is most successful. Written in R, the package includes anomalydetection code for testing the application of the selected classifierto new data, thus enabling the creation of highly reliable data setsof classified variable stars .(c) Resultsi. We have demonstrated the construction and application of a su-pervised classification algorithm on variable star data. Such analgorithm will process observed stellar features and produce quan-titative estimates of stellar class labels. Using a hand-processed(verified) data set derived from the ASAS, OGLE, and Hipparcossurveys, an initial training and testing set was derived.ii. The trained one-vs.-all algorithms were optimized using the testingdata via minimization of the misclassification rate. From appli-cation of the trained algorithm to the testing data, performance https://github.com/kjohnston82/LINEARSupervisedClassification Publication . Johnston, K. B., & Peter, A. M. (2017). Variable starsignature classification using slotted symbolic Markov modeling. NewAstronomy, 50 , 1–11.(b) Poster . Johnston, K. B., & Peter, A. M. (2016). Variable star signatureclassification using slotted symbolic Markov modeling . Presented at AAS227, Kissimmee, FL.(c) SSMM[ascl:1807.032] . Slotted symbolic Markov modeling for classify-ing variable star signatures Johnston, Kyle B.; Peter, Adrian, M.248. SSMM (slotted symbolic Markov modeling) reduces time-domainstellar variable observations to classify stellar variables. The methodcan be applied to both folded and unfolded data and does not re-quire time warping for waveform alignment. Written in MATLAB,the performance of the supervised classification code is quantifiableand consistent, and the rate at which new data are processed isdependent only on the computational processing power available .(d) Resultsi. The SSMM methodology developed has been able to generate a fea-ture space that separates variable stars by class (supervised clas-sification). This methodology has the benefit of being able to ac-commodate irregular sampling rates, dropouts, and some degree oftime-domain variance. It also provides a fairly simple methodologyfor feature space generation, necessary for classification.ii. One of the major advantages of the methodology used is that asignature pattern (the transition state model) is generated and up-dated with new observations.iii. The performance of four separate classifiers trained on the UCRdata set is examined. It has been shown that the methodologypresented is comparable to direct distance methods (UCR baseline).It is also shown that the methodology presented is more flexible. https://github.com/kjohnston82/SSMM ∼ ∼ Publication . Johnston, K. B., et al. (2019). A detection metric designedfor O’Connell effect eclipsing binaries. Computational Astrophysics andCosmology . Manuscript in review.(b) Poster. Johnston, K. B., et al. (2018). Learning a novel detectionmetric for the detection of O’Connell effect eclipsing binaries . Presentedat AAS 231, National Harbor, MD.(c) OCD . O’Connell Effect Detector Using Push-Pull Learning Johnston,Kyle B.; Haber, Ranai. OCD (O’Connell effect detector using push-pull learning) detectseclipsing binaries that demonstrate the O’Connell effect. This time-domain signature extraction methodology uses a supporting super-vised pattern detection algorithm. The methodology maps stellarvariable observations (time-domain data) to a new representationknown as distribution fields (DF), the properties of which enableefficient handling of issues such as irregular sampling and multiple250alues per time instance. Using this representation, the code ap-plies a metric learning technique directly on the DF space capableof specifically identifying the stars of interest; the metric is tunedon a set of labeled eclipsing binary data from the Kepler survey,targeting particular systems exhibiting the O’Connell effect. Thiscode is useful for large-scale data volumes such as that expectedfrom next-generation telescopes like LSST .(d) Resultsi. A modular design is developed that can be used to detect typesof stars and star systems that are cyclostationary in nature. Witha change in feature space, specifically one that is tailored to thetarget signatures of interest and based on prior experience, thisdesign can be replicated for other targets that do not demonstratea cyclostationary signal (e.g., impulsive, nonstationary) and evento targets of interest that are not time-variable in nature but have aconsistent observable signature (e.g., spectrum, photometry, imagepoint-spread function).ii. The method outlined here has demonstrated the ability to detecttargets of interest given a training set consisting of expertly labeledlight curve training data.iii. The procedure presents two new functionalities: the DF, a shape-based feature space, and the Push-Pull Matrix Metric Learning https://github.com/kjohnston82/OCDetector Paper . Johnston, K. B., et al. (2019). Variable star classification usingmulti-view metric learning. Computational Astrophysics and Cosmol-ogy . Manuscript in review.(b) Poster . Johnston, K. B., et al. (2018). Variable star classification usingmulti-view metric learning . Presented at ADASS Conference XXVIII,College Park, MD.(c) Java Project VariableStarAnalysis . Contains Java translations of codedesigned specifically for analysis and the supervised classification of vari-able stars :(d) Results https://github.com/kjohnston82/VariableStarAnalysis ugriz colors.ii. To support the set of high-dimensionality features, or views, multi-view metric learning is investigated as a viable design. Multi-viewlearning provides an avenue for integrating multiple transforms togenerate a superior classifier.iii. The structure of multi-view metric learning allows for a numberof modern computational designs to be used to support increasingscale and scope (e.g., parallel computation); these considerationscan be leveraged given the parameters of the experiment designedor the project in question.iv. A new classifier design that does not require transformation of thematrix-variate views ahead of time is presented. This classifier op-erates directly on the matrix data.v. The development of both algorithm designs (matrix-variate andvector-variate) with respect to the targeted experiment of interest(discrimination of time-domain variable stars) highlighted a num-ber of challenges to be addressed prior to practical application. Inovercoming these challenges, it was found that the novel classifierdesign (LM L-MV) performed on order of the staged (vectorization+ ECVA + LM L) classifier.253ur efforts in astroinformatics have yielded code, identified previously unla-beled stellar variables, proved out a new feature space and classifier, and estab-lished methodologies to be used in future variable star identification efforts.254 ibliography K. N. Abazajian, J. K. Adelman-McCarthy, M. A. Ag¨ueros, S. S. Allam, C. AllendePrieto, D. An, K. S. J. Anderson, S. F. Anderson, J. Annis, N. A. Bahcall, andet al. The Seventh Data Release of the Sloan Digital Sky Survey. ApJS, 182:543–558, June 2009. doi: 10.1088/0067-0049/182/2/543.S. Akaho. A kernel method for canonical correlation analysis. arXiv preprintcs/0609071 , 2006.C. Alcock, R. A. Allsman, D. R. Alves, et al. The MACHO Project LMC VariableStar Inventory. VII. The Discovery of RV Tauri Stars and New Type II Cepheidsin the Large Magellanic Cloud. AJ, 115:1921–1933, May 1998. doi: 10.1086/300317.N. S. Altman. An introduction to kernel and nearest-neighbor nonparametricregression. The American Statistician , 46(3):175–185, 1992.R. Angeloni, R. Contreras Ramos, M. Catelan, et al. The VVV Templates ProjectTowards an automated classification of VVV light-curves. I. Building a databaseof stellar variability in the near-infrared. A&A, 567:A100, July 2014. doi: 10.1051/0004-6361/201423904. 255. J. Armstrong, J. Kirk, K. W. F. Lam, et al. K2 variable catalogue - II. Machinelearning classification of variable stars and eclipsing binaries in K2 fields 0-4.MNRAS, 456:2260–2272, February 2016. doi: 10.1093/mnras/stv2836.NRC Astronomy. New worlds, new horizons in astronomy and astrophysics, 2010.E. Bair, T. Hastie, D. Paul, and R. Tibshirani. Prediction by supervised principalcomponents. Journal of the American Statistical Association , 101(473), 2006.N. M. Ball and R. J. Brunner. Data mining and machine learning in astronomy. International Journal of Modern Physics D , 19(07):1049–1106, 2010.C. B. Barber, D. P. Dobkin, D. P. Dobkin, and H. Huhdanpaa. The quickhull algo-rithm for convex hulls. ACM Transactions on Mathematical Software (TOMS) ,22(4):469–483, 1996.T. Barclay, G. Ramsay, P. Hakala, et al. Stellar variability on time-scales ofminutes: results from the first 5 yr of the Rapid Temporal Survey. MNRAS,413:2696–2708, June 2011. doi: 10.1111/j.1365-2966.2011.18345.x.J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA journalof numerical analysis , 8(1):141–148, 1988.G. Bass and K. Borne. Supervised ensemble classification of Kepler variable stars.MNRAS, 459:3721–3737, July 2016. doi: 10.1093/mnras/stw810.A. Bellet, A. Habrard, and M. Sebban. Metric learning. Synthesis Lectures onArtificial Intelligence and Machine Learning , 9(1):1–151, 2015.R. Bellman. Adaptive control processes: a guided tour , volume 4. Princeton uni-versity press Princeton, 1961. 256. Bergmeir and J. M. Ben´ıtez. Neural networks in R using the Stuttgart neuralnetwork simulator: RSNNS. Journal of Statistical Software , 46(7):1–26, 2012.M. Berry, ˇZ. Ivezi´c, B. Sesar, et al. The Milky Way Tomography with SloanDigital Sky Survey. IV. Dissecting Dust. ApJ, 757:166, October 2012. doi:10.1088/0004-637X/757/2/166.J. Blomme, L. M. Sarro, F. T. O’Donovan, et al. Improved methodology forthe automated classification of periodic variable stars. MNRAS, 418:96–106,November 2011. doi: 10.1111/j.1365-2966.2011.19466.x.V. J. Bol´os and R. Ben´ıtez. The Wavelet Scalogram in the Study of Time Series. In Advances in Differential Equations and Applications , volume 4, pages 147–154.SEMA SIMAI Springer Series, 10 2014. doi: 10.1007/978-3-319-06953-1 15.Booz Allen Hamilton. The field guide to data science. Booze Allen Hamilton Weblinks , 2013.S. Boriah, V. Chandola, and V. Kumar. Similarity measures for categorical data: Acomparative evaluation. In ”Society for Industrial and Applied Mathematics - 8thSIAM International Conference on Data Mining 2008, Proceedings in AppliedMathematics 130” , volume 1, pages 243–254. SIAM, October 2008.K. Borne, A. Accomazzi, J. Bloom, et al. Astroinformatics: A 21st Century Ap-proach to Astronomy. In astro2010: The Astronomy and Astrophysics DecadalSurvey , volume 2010 of Astronomy , 2009.K. D. Borne. Astroinformatics: data-oriented astronomy research and education. Earth Science Informatics , 3(1-2):5–17, 2010.257. Bos, S. de Waele, and P. M. T. Broersen. Autoregressive spectral estimation byapplication of the Burg algorithm to irregularly sampled data. Instrumentationand Measurement, IEEE Transactions on , 51(6):1289–1294, Dec 2002.B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimalmargin classifiers. In Proceedings of the fifth annual workshop on Computationallearning theory , pages 144–152. ACM, 1992.N. Bosner. Fast methods for large scale singular value decomposition . PhD thesis,PhD thesis, Department of Mathematics, University of Zagreb, 2006.L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances inneural information processing systems , pages 161–168, 2008.L. Breiman. Random forests. Machine learning , 45(1):5–32, 2001.L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and regres-sion trees . CRC press, 1984.P. M. Broersen. Practical aspects of the spectral analysis of irregularly sampleddata with time-series models. IEEE Transactions on Instrumentation and Mea-surement , 58(5):1380–1388, 2009.S. Carliles, T. Budav´ari, S. Heinis, et al. Random Forests for Photometric Red-shifts. ApJ, 712:511–515, March 2010. doi: 10.1088/0004-637X/712/1/511.C. Cassisi, P. Montalto, M. Aliotta, et al. Similarity measures and dimensionalityreduction techniques for time series data mining. In Advances in data miningknowledge discovery and applications . InTech, 2012.258. Cha. Comprehensive survey on distance/similarity measures between probabilitydensity functions. City , 1(2):1, 2007.V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACMComputing Surveys (CSUR) , 41(3):15, 2009.R. Chatpatanasiri, T. Korsrilabutr, P. Tangchanachaianan, and B. Kijsirikul.On Kernelizing Mahalanobis Distance Learning Algorithms. Arxiv preprint.http://arxiv. org/abs , 804, 2008.Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista. Theucr time series classification archive, July 2015a. .Y. Chen, E. Keogh, B. Hu, et al. The ucr time series classification archive, July2015b.J. L. Christiansen, B. D. Clarke, C. J. Burke, et al. Measuring Transit SignalRecovery in the Kepler Pipeline. I. Individual Events. ApJS, 207:35, August2013. doi: 10.1088/0067-0049/207/2/35.R. E. Cohen and A. Sarajedini. SX Phoenicis period-luminosity relations and theblue straggler connection. MNRAS, 419:342–357, January 2012. doi: 10.1111/j.1365-2966.2011.19697.x.M. S. Connelley, B. Reipurth, and A. T. Tokunaga. The Evolution of the Mul-tiplicity of Embedded Protostars. I. Sample Properties and Binary Detections.AJ, 135:2496–2525, June 2008. doi: 10.1088/0004-6256/135/6/2496.259. M. Cutri, M. F. Skrutskie, S. van Dyk, C. A. Beichman, J. M. Carpenter,T. Chester, L. Cambresy, T. Evans, J. Fowler, J. Gizis, E. Howard, J. Huchra,T. Jarrett, E. L. Kopan, J. D. Kirkpatrick, R. M. Light, K. A. Marsh, H. Mc-Callon, S. Schneider, R. Stiening, M. Sykes, M. Weinberg, W. A. Wheaton,S. Wheelock, and N. Zacarias. VizieR Online Data Catalog: 2MASS All-SkyCatalog of Point Sources (Cutri+ 2003). VizieR Online Data Catalog , 2246,June 2003.Christoph Dalitz. Reject options and confidence measures for knn classi-fiers. Schriftenreihe des Fachbereichs Elektrotechnik und Informatik HochschuleNiederrhein , 8:16–38, 2009.J. Davis and M. Goadrich. The relationship between Precision-Recall and ROCcurves. In Proceedings of the 23rd international conference on Machine learning ,pages 233–240. ACM, 2006.A. P. Dawid. Some matrix-variate distribution theory: notational considerationsand a Bayesian application. Biometrika , 68(1):265–274, 1981.S. Deb and H. P. Singh. Light curve analysis of variable stars using Fourier de-composition and principal component analysis. A&A, 507:1729–1737, December2009. doi: 10.1051/0004-6361/200912851.J. Debosscher. Automated Classification of variable stars: Application to the OGLEand CoRoT databases . PhD thesis, Institute of Astronomy, Katholieke Univer-siteit Leuven, Belgium, 2009.E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel. Misc functionsof the Department of Statistics (e1071), TU Wien. R package , pages 1–5, 2008.260. Ding and R. D. Cook. Dimension folding PCA and PFC for matrix-valuedpredictors. Statistica Sinica , 24(1):463–492, 2014.S. Ding and R. Dennis Cook. Matrix variate regressions and envelope models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 80(2):387–408, 2018.A. D’Isanto, S. Cavuoti, M. Brescia, et al. An analysis of feature relevance in theclassification of astronomical transients with machine learning methods. MN-RAS, 457:3119–3132, April 2016. doi: 10.1093/mnras/stw157.S. G. Djorgovski, A. A. Mahabal, C. Donalek, M. J. Graham, A. J. Drake,B. Moghaddam, and M. Turmon. Flashes in a Star Stream: Automated Classi-fication of Astronomical Transient Events. In E-Science (e-Science), 2012 IEEE8th International Conference on , September 2012.A. J. Drake, S. G. Djorgovski, A. Mahabal, et al. First Results from theCatalina Real-Time Transient Survey. ApJ, 696:870–884, May 2009. doi:10.1088/0004-637X/696/1/870.P. Dubath, L. Rimoldini, M. S¨uveges, et al. Random forest automated supervisedclassification of Hipparcos periodic variable stars. MNRAS, 414:2602–2617, July2011. doi: 10.1111/j.1365-2966.2011.18575.x.G. Duchˆene and A. Kraus. Stellar Multiplicity. ARA&A, 51:269–310, August2013. doi: 10.1146/annurev-astro-081710-102602.R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification . John Wiley &Sons, 2012. 261.P.W. Duin, P. Juszczak, P. Paclik, et al. Prtools4.1, a matlab toolbox for patternrecognition,. Delft University of Technology, 2007.P. Dutilleul. The MLE algorithm for the matrix normal distribution. Journal ofstatistical computation and simulation , 64(2):105–123, 1999.M. Einasto, L. J. Liivam¨agi, E. Saar, et al. SDSS DR7 superclusters. Principalcomponent analysis. A&A, 535:A36, November 2011. doi: 10.1051/0004-6361/201117529.N. Epchtein, B. de Batz, L. Capoani, L. Chevallier, E. Copet, P. Fouqu´e, P. La-combe, T. Le Bertre, S. Pau, D. Rouan, S. Ruphy, G. Simon, D. Tiph`ene,W. B. Burton, E. Bertin, E. Deul, H. Habing, J. Borsenberger, M. Dennefeld,F. Guglielmo, C. Loup, G. Mamon, Y. Ng, A. Omont, L. Provost, J.-C. Renault,F. Tanguy, S. Kimeswenger, C. Kienel, F. Garzon, P. Persi, M. Ferrari-Toniolo,A. Robin, G. Paturel, I. Vauglin, T. Forveille, X. Delfosse, J. Hron, M. Schultheis,I. Appenzeller, S. Wagner, L. Balazs, A. Holl, J. L´epine, P. Boscolo, E. Picazzio,P.-A. Duc, and M.-O. Mennessier. The deep near-infrared southern sky survey(DENIS). The Messenger , 87:27–34, March 1997.L. Eyer and C. Blake. Automated classification of variable stars for All-Sky Au-tomated Survey 1-2 data. MNRAS, 358:30–38, March 2005. doi: 10.1111/j.1365-2966.2005.08651.x.C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matchingin time-series databases , volume 23. ACM, 1994.T. Fawcett. An introduction to ROC analysis. Pattern recognition letters , 27(8):861–874, June 2006. 262om Fawcett. An introduction to roc analysis. Pattern recognition letters , 27(8):861–874, 2006.Apache Software Foundation. Apache maven. Apache Software Foundation, 2018a.URL https://maven.apache.org/index.html .Apache Software Foundation. Apache math commons. Apache Software Founda-tion, 2018b. URL http://commons.apache.org/proper/commons-math/ .Apache Software Foundation. Apache commons lang. Apache Software Foundation,2018c. URL https://commons.apache.org/proper/commons-lang/ .W. L. Freedman, B. F. Madore, J. R. Mould, et al. Distance to the Virgo clustergalaxy M100 from Hubble Space Telescope observations of Cepheids. Nature,371:757–762, October 1994. doi: 10.1038/371757a0.J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning ,volume 1. Springer series in statistics Springer, Berlin, 2001.J.H. Friedman. A variable span smoother. Technical report, Stanford Univ CAlab for computational statistics, 1984.T. Fu. A review on time series data mining. Engineering Applications of ArtificialIntelligence , 24(1):164–181, 2011.B. D. Fulcher, M. A. Little, and N. S. Jones. Highly comparative time-seriesanalysis: the empirical structure of time series and their methods. Journal ofthe Royal Society Interface , 10(83):20130048, 2013.P. A. Gagniuc. Markov Chains: From Theory to Implementation and Experimen-tation . John Wiley & Sons, 2017. 263aia Collaboration, L. Eyer, L. Rimoldini, M. Audard, R. I. Anderson, K. Nien-artowicz, F. Glass, O. Marchal, M. Grenon, N. Mowlavi, and et al. Gaia DataRelease 2. Variable stars in the colour-absolute magnitude diagram. A&A, 623:A110, March 2019. doi: 10.1051/0004-6361/201833304.X. Ge and P. Smyth. Deformable Markov model templates for time-series patternmatching. In Proceedings of the sixth ACM SIGKDD international conferenceon Knowledge discovery and data mining , pages 81–90. ACM, 2000.H. Glanz and L. Carvalho. An expectation-maximization algorithm for the matrixnormal distribution. arXiv preprint arXiv:1309.6609 , 2013.J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov. Neigh-bourhood components analysis. In Advances in neural information processingsystems , pages 513–520, 2005.M. G¨onen and E. Alpaydın. Multiple kernel learning algorithms. Journal of ma-chine learning research , 12(Jul):2211–2268, 2011.J. Grabocka, A. Nanopoulos, and L. Schmidt-Thieme. Invariant time-series clas-sification. In Machine Learning and Knowledge Discovery in Databases , pages725–740. Springer, 2012.M. J. Graham, S. G. Djorgovski, A. A. Mahabal, C. Donalek, and A. J. Drake.Machine-assisted discovery of relationships in astronomy. MNRAS, 431:2371–2384, May 2013a. doi: 10.1093/mnras/stt329.M. J. Graham, A. J. Drake, S. G. Djorgovski, et al. A comparison of periodfinding algorithms. MNRAS, 434:3423–3444, October 2013b. doi: 10.1093/mnras/stt1264. 264. L. Graham. An efficient algorithm for determining the convex hull of a finiteplanar set. Info. Pro. Lett. , 1:132–133, 1972.A. Gupta and D. Nagar. Matrix variate distributions. monographs and surveys inpure and applied mathematics, 2000.R. Haber, A. Rangarajan, and A. M. Peter. Discriminative Interpolation for Classi-fication of Functional Data. In Joint European Conference on Machine Learningand Knowledge Discovery in Databases , pages 20–36. Springer, 2015.P. Harrison. Jsofa, a pure java translation of the sofa library, 2016.T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization pathfor the support vector machine. The Journal of Machine Learning Research , 5:1391–1415, 2004.T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning ,volume 2. Springer, 2009.M. T. Heath. Scientific computing: an introductory survey , volume 80. SIAM,2018.K. Hechenbichler and K. Schliep. Weighted k-nearest-neighbor techniques andordinal classification. Sonderforschungsbereich , 386, 2004.E. Helfer, B. Smith, R. Haber, and A. Peter. Statistical Analysis of FunctionalData. Technical report, Florida Institute of Technology, 2015.T. A. Hinners, K. Tat, and R. Thorp. Machine Learning Techniques for StellarLight Curve Classification. AJ, 156:7, July 2018. doi: 10.3847/1538-3881/aac16d. 265. F. Holcomb and R. E. Norberg. Nuclear spin relaxation in alkali metals. PhysicalReview , 98(4):1074, 1955.H. Hotelling. Relations between two sets of variates. Biometrika , 28(3/4):321–377,1936.J. Hu, J. Lu, J. Yuan, and Y. Tan. Large margin multi-metric learning for faceand kinship verification in the wild. In Asian Conference on Computer Vision ,pages 252–267. Springer, 2014.J. Hu, J. Lu, Y. Tan, J. Yuan, and J. Zhou. Local large-margin multi-metriclearning for face and kinship verification. IEEE Transactions on Circuits andSystems for Video Technology , 28(8):1875–1891, 2018.I. Hubeny and D. Mihalas. Theory of Stellar Atmospheres . 2014.P. Huijse, P. A. Estevez, P. Zegers, J. C. Principe, and P. Protopapas. PeriodEstimation in Astronomical Time Series Using Slotted Correntropy. IEEE SignalProcessing Letters , 18:371–374, June 2011. doi: 10.1109/LSP.2011.2141987.P. Huijse, P. A. Estevez, P. Protopapas, P. Zegers, and J. C. Principe. An Infor-mation Theoretic Algorithm for Finding Periodicities in Stellar Light Curves. IEEE Transactions on Signal Processing , 60:5135–5145, October 2012. doi:10.1109/TSP.2012.2204260.I. Iben, Jr. Globular Cluster Stars: Results of Theoretical Evolution and PulsationStudies Compared with the Observations. PASP, 83:697, December 1971. doi:10.1086/129210. 266. A. Iglesias, F. J. Rogers, and B. G. Wilson. Reexamination of the metal con-tribution to astrophysical opacity. ApJ, 322:L45–L48, November 1987. doi:10.1086/185034.K. E. Iverson. A programming language. In Proceedings of the May 1-3, 1962,spring joint computer conference , pages 345–351. ACM, 1962.J. M. Jenkins, D. A. Caldwell, H. Chandrasekaran, et al. Overview of theKepler Science Processing Pipeline. ApJ, 713:L87–L91, April 2010. doi:10.1088/2041-8205/713/2/L87.R. A. Johnson, D. W. Wichern, et al. Applied multivariate statistical analysis ,volume 4. Prentice hall Englewood Cliffs, NJ, 1992.K. B. Johnston. LSC: Supervised classification of time-series variable stars. As-trophysics Source Code Library, July 2018.K. B. Johnston and H. M. Oluseyi. Generation of a supervised classification algo-rithm for time-series variable stars with an application to the LINEAR dataset.New A, 52:35–47, April 2017. doi: 10.1016/j.newast.2016.10.004.K. B. Johnston and A. Peter, M. SSMM: Slotted Symbolic Markov Modelingfor classifying variable star signatures. Astrophysics Source Code Library, July2018.K. B. Johnston and A. M. Peter. Variable Star Signature Classification usingSlotted Symbolic Markov Modeling. New A, 50:1–11, January 2017. doi: 10.1016/j.newast.2016.06.001. 267. B. Johnston, S. M. Caballero-Nieves, A. M. Peter, V. Petit, and R. Haber.JVarStar: Variable Star Analysis Library. Astrophysics Source Code Library,April 2019.K. B. Johnston, R. Haber, S. M. Caballero-Nieves, A. M. Peter, V. Petit, andM. Knote. A novel detection metric designed for identification of o’connell effecteclipsing binaries. forthcoming, forthcoming.M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen. Multi-view discriminant analysis. IEEE transactions on pattern analysis and machine intelligence , 38(1):188–194,2016.A. Karatzoglou, D. Meyer, and K. Hornik. Support vector machines in R. Journalof Statistical Software , 2005.D. Kedem, S. Tyree, F. Sha, et al. Non-linear metric learning. In Advances inNeural Information Processing Systems , pages 2573–2581, 2012.E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimensionality reduc-tion for fast similarity search in large time series databases. Knowledge andinformation Systems , 3(3):263–286, 2001.E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana. The ucr time series classi-fication/clustering homepage, 2011.D.-W. Kim and C. A. L. Bailer-Jones. A package for the automated classificationof periodic variable stars. A&A, 587:A18, March 2016. doi: 10.1051/0004-6361/201527188. 268. Kim and D. Kang. Fire detection system using random forest classificationfor image sequences of complex background. Optical Engineering , 52(6):067202–067202, 2013.B. Kirk, K. Conroy, A. Prˇsa, et al. Kepler Eclipsing Binary Stars. VII. The Catalogof Eclipsing Binaries Found in the Entire Kepler Data Set. AJ, 151:68, March2016. doi: 10.3847/0004-6256/151/3/68.C. R. Kitchin. Astrophysical techniques . hiladelphia: Institute of Physics Publish-ing, 2003.J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas. On combining classifiers. IEEEtransactions on pattern analysis and machine intelligence , 20(3):226–239, 1998.M. Knote. Study of Eclipsing Binary Systems NSVS 732240 and NSVS 5726288(Abstract). Journal of the American Association of Variable Star Observers(JAAVSO) , 43:258, December 2015.A. Kovaˇcevi´c, L. ˇC. Popovi´c, A. I. Shapovalova, et al. Time Delay Evolutionof Five Active Galactic Nuclei. Journal of Astrophysics and Astronomy , 36:475–493, December 2015. doi: 10.1007/s12036-015-9366-5.W. Krzanowski. Principles of multivariate analysis . Oxford University Press, 2000.D. W. Kurtz. Asteroseismology: Past, Present and Future. Journal of Astrophysicsand Astronomy , 26:123, June 2005. doi: 10.1007/BF02702322.L. Lan, S. Haidong, Z. Wang, and S. Vucetic. An Active Leanring AlgorithmBased on Parzen Window Classification. In JMLR: Workshop and ConferenceProceedings , pages 99–112, 2011. 269. M Law and W. D. Kelton. Simulation modeling and analysis , volume 2.McGraw-Hill New York, 1991.S. Lˆe, J. Josse, F. Husson, et al. FactoMineR: an R package for multivariateanalysis. Journal of statistical software , 25(1):1–18, 2008.G. Lee and C. Scott. Nested support vector machines. Signal Processing, IEEETransactions on , 58(3):1648–1660, 2010.G. Lee and C. D. Scott. The one class support vector machine solution path. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE Interna-tional Conference on , volume 2, pages II–521. IEEE, 2007.Q. Li and J. S. Racine. Nonparametric econometrics: theory and practice . Prince-ton University Press, 2007.Andy Liaw and Matthew Wiener. Classification and regression by randomForest. R news , 2(3):18–22, 2002.J. Lin, E. Keogh, S. Lonardi, and B. Chiu. A symbolic representation of timeseries, with implications for streaming algorithms. In Proceedings of the 8th ACMSIGMOD workshop on Research issues in data mining and knowledge discovery ,pages 2–11. ACM, 2003.J. Lin, E. Keogh, L. Wei, and S. Lonardi. Experiencing SAX: a novel symbolicrepresentation of time series. Data Min. Knowl. Discov. , 15(2):107–144, October2007. ISSN 1384-5810. doi: 10.1007/s10618-007-0064-z. URL http://dx.doi.org/10.1007/s10618-007-0064-z . 270. Lin, S. Williamson, K. Borne, and D. DeBarr. Pattern recognition in timeseries. Advances in Machine Learning and Data Mining for Astronomy , 1:617–645, 2012.N. R. Lomb. Least-squares frequency analysis of unequally spaced data. Ap&SS,39:447–462, February 1976. doi: 10.1007/BF00648343.J. P. Long, E. C. Chi, and R. G. Baraniuk. Estimating a Common Period fora Set of Irregularly Sampled Functions with Applications to Periodic VariableStar Data. arXiv e-prints , December 2014.R. Longadge and S. Dongre. Class imbalance problem in data mining review. arXivpreprint arXiv:1305.1707 , 2013.Y. Luo, T. Liu, D. Tao, and C. Xu. Decomposition-based transfer distance metriclearning for image classification. IEEE Transactions on Image Processing , 23(9):3789–3801, 2014.A. Mahabal, K. Sheth, F. Gieseke, et al. Deep-learnt classification of light curves.In Computational Intelligence (SSCI), 2017 IEEE Symposium Series on , pages1–8. IEEE, 2017.C. D. Manning, P. Raghavan, H. Sch¨utze, et al. Introduction to information re-trieval , volume 1. Cambridge university press Cambridge, 2008.F. J. Masci, D. I. Hoffman, C. J. Grillmair, and R. M. Cutri. Automated Clas-sification of Periodic Variable Stars Detected by the Wide-field Infrared SurveyExplorer. AJ, 148:21, July 2014. doi: 10.1088/0004-6256/148/1/21.271. D. Mathieu. Pre-Main-Sequence Binary Stars. ARA&A, 32:465–530, 1994. doi:10.1146/annurev.aa.32.090194.002341.S. A. McCartney. A 2-D model for the O’Connell effect in W Ursae Majorissystems . PhD thesis, THE UNIVERSITY OF OKLAHOMA, 1999.S. D. McCauliff, J. M. Jenkins, J. Catanzarite, C. J. Burke, J. L. Coughlin, J. D.Twicken, P. Tenenbaum, S. Seader, J. Li, and M. Cote. Automatic Classificationof Kepler Planetary Transit Candidates. ApJ, 806:6, June 2015. doi: 10.1088/0004-637X/806/1/6.D. McNamara. Luminosities of SX Phoenicis, Large-Amplitude Delta Scuti, andRR Lyrae Stars. PASP, 109:1221–1232, November 1997. doi: 10.1086/133999.P. R. McWhirter, I. A. Steele, D. Al-Jumeily, A. Hussain, and M. M. B. R. Vel-lasco. The classification of periodic light curves from non-survey optimized ob-servational data through automated extraction of phase-based visual features.In Neural Networks (IJCNN), 2017 International Joint Conference on , pages3058–3065. IEEE, 2017.A. A. Miller, J. S. Bloom, J. W. Richards, et al. A Machine-learning Methodto Infer Fundamental Stellar Parameters from Photometric Light Curves. ApJ,798:122, January 2015. doi: 10.1088/0004-637X/798/2/122.A. A. Miranda, Y. Le Borgne, and G. Bontempi. New routes from minimal ap-proximation error to principal components. Neural Processing Letters , 27(3):197–207, 2008. 272. Modak, T. Chattopadhyay, and A. K. Chattopadhyay. Unsupervised classifica-tion of eclipsing binary light curves through k-medoids clustering. arXiv e-prints ,January 2018.B. Naul, J. S. Bloom, F. P´erez, and S. van der Walt. A recurrent neural network forclassification of unevenly sampled variable stars. Nature Astronomy , 2:151–155,November 2018. doi: 10.1038/s41550-017-0321-z.C. Ngeow, S. Lucchini, S. Kanbur, et al. Preliminary analysis of ULPC light curvesusing fourier decomposition technique. In Space Science and Communication(IconSpace), 2013 IEEE International Conference on , pages 7–12. IEEE, 2013.L. Nørgaard, R. Bro, F. Westad, and S. B. Engelsen. A modification of canon-ical variates analysis to handle highly collinear multivariate data. Journal ofChemometrics , 20(8-10):425–435, 2006.I. Nun, K. Pichara, P. Protopapas, and D.-W. Kim. Supervised Detection ofAnomalous Light Curves in Massive Astronomical Catalogs. ApJ, 793:23,September 2014. doi: 10.1088/0004-637X/793/1/23.I. Nun, P. Protopapas, B. Sim, M. Zhu, et al. FATS: Feature Analysis for TimeSeries. arXiv e-prints , May 2015.D. J. K. O’Connell. The so-called periastron effect in eclipsing binaries (summary).MNRAS, 111:642, 1951. doi: 10.1093/mnras/111.6.642.E. ¨Opik. Statistical Studies of Double Stars: On the Distribution of RelativeLuminosities and Distances of Double Stars in the Harvard Revised PhotometryNorth of Declination -31deg. Publications of the Tartu Astrofizica Observatory ,25, 1924. 273. Padmanabhan. Theoretical Astrophysics - Volume 2, Stars and Stellar Systems .Cambridge University Press, July 2001. doi: 10.2277/0521562414.L. Palaversa, ˇZ. Ivezi´c, L. Eyer, et al. Exploring the Variable Sky with LINEAR.III. Classification of Periodic Light Curves. AJ, 146:101, October 2013. doi:10.1088/0004-6256/146/4/101.S. Parameswaran and K. Q. Weinberger. Large margin multi-task metric learning.In Proceedings of the 23rd International Conference on Neural Information Pro-cessing Systems - Volume 2 , NIPS’10, pages 1867–1875, USA, 2010. Curran As-sociates Inc. URL http://dl.acm.org/citation.cfm?id=2997046.2997104 .H. Park, M. Jeon, and J. B. Rosen. Lower dimensional representation of textdata based on centroids and least squares. BIT Numerical mathematics , 43(2):427–448, 2003.J. Park and I. W. Sandberg. Universal approximation using radial-basis-functionnetworks. Neural computation , 3(2):246–257, 1991.M. J. Park and S. S. Cho. Functional Data Classification of Variable Stars. CSAM (Communications for Statistical Applications and Methods) , 20(4):271–281, 2013.E. Parzen. On estimation of a probability density function and mode. The annalsof mathematical statistics , pages 1065–1076, 1962.I. N. Pashchenko, K. V. Sokolovsky, and P. Gavras. Machine learning searchfor variable stars. MNRAS, 475:2326–2343, April 2018. doi: 10.1093/mnras/stx3222. 274. Patterson. Measures of dependence for matrices, 2015.J. R. Percy. Understanding Variable Stars . Cambridge University Presss, May2007.M. F. P´erez-Ortiz, A. Garc´ıa-Varela, A. J. Quiroz, et al. Machine learning tech-niques to select Be star candidates. An application in the OGLE-IV Gaia southecliptic pole field. A&A, 605:A123, September 2017. doi: 10.1051/0004-6361/201628937.M. A. C. Perryman, L. Lindegren, J. Kovalevsky, et al. The HIPPARCOS Cata-logue. A&A, 323:L49–L52, July 1997.J. O. Petersen and J. Christensen-Dalsgaard. Pulsation models of delta Scutivariables. II. delta Scuti stars as precise distance indicators. aap , 352:547–554,December 1999.K. B. Petersen, M. S. Pedersen, et al. The matrix cookbook. Technical Universityof Denmark , 7(15):510, 2008.K. Pichara and P. Protopapas. Automatic Classification of Variable Stars in Cata-logs with Missing Data. ApJ, 777:83, November 2013. doi: 10.1088/0004-637X/777/2/83.K. Pichara, P. Protopapas, D.-W. Kim, et al. An improved quasar detectionmethod in EROS-2 and MACHO LMC data sets. MNRAS, 427:1284–1297,December 2012. doi: 10.1111/j.1365-2966.2012.22061.x.G. Pojmanski. The All Sky Automated Survey. Catalog of about 3800 VariableStars. Acta Astron., 50:177–190, June 2000.275. Pojmanski. The All Sky Automated Survey. Catalog of Variable Stars. I. 0 h- 6 hQuarter of the Southern Hemisphere. Acta Astron., 52:397–427, December2002.G. Pojmanski, B. Pilecki, and D. Szczygiel. The All Sky Automated Survey.Catalog of Variable Stars. V. Declinations 0 arcd - +28 arcd of the NorthernHemisphere. Acta Astron., 55:275–301, September 2005.A. Poveda, C. Allen, and A. Hern´andez-Alc´antara. The Frequency Distributionof Semimajor Axes of Wide Binaries: Cosmogony and Dynamical Evolution. InW. I. Hartkopf, P. Harmanec, and E. F. Guinan, editors, Binary Stars as CriticalTools and Tests in Contemporary Astrophysics , volume 240 of IAU Symposium ,pages 417–425, August 2007. doi: 10.1017/S1743921307004383.P. Protopapas, J. M. Giammarco, L. Faccioli, M. F. Struble, R. Dave, and C. Al-cock. Finding outlier light curves in catalogues of periodic variable stars. MN-RAS, 369:677–696, June 2006. doi: 10.1111/j.1365-2966.2006.10327.x.A. Prˇsa and T. Zwitter. A Computational Guide to Physics of Eclipsing Binaries.I. Demonstrations and Perspectives. The Astrophysical Journal , 628:426–438,July 2005. doi: 10.1086/430591.E. V. Quintana, J. M. Jenkins, B. D. Clarke, et al. Pixel-level calibration in theKepler Science Operations Center pipeline. In Software and Cyberinfrastructurefor Astronomy , volume 7740 of Proceedings of the SPIE , page 77401X, July 2010.doi: 10.1117/12.857678. 276. Rebbapragada, K. Lo, K. L. Wagstaff, et al. Classification of ASKAP VASTRadio Light Curves. In E. Griffin, R. Hanisch, and R. Seaman, editors, NewHorizons in Time Domain Astronomy , volume 285 of IAU Symposium , pages397–399, April 2012. doi: 10.1017/S1743921312001196.K. Rehfeld and J. Kurths. Similarity estimators for irregular and age-uncertaintime series. Climate of the Past , 10(1):107–122, 2014.K. Rehfeld, N. Marwan, J. Heitzig, and J. Kurths. Comparison of correlationanalysis techniques for irregularly sampled time series. Nonlinear Processes inGeophysics , 18(3):389–404, 2011.J. D. Reimann. Frequency Estimation Using Unequally-Spaced Astronomical Data. PhD thesis, UNIVERSITY OF CALIFORNIA, BERKELEY., January 1994.J. D. M. Rennie and N. Srebro. Loss functions for preference levels: Regression withdiscrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshopon advances in preference handling , pages 180–186. Kluwer Norwell, MA, 2005.L. W. Renninger and J. Malik. When is scene identification just texture recogni-tion? Vision research , 44(19):2301–2311, 2004.D. E. Rhumelhart, J. E. McClelland, PDP Research Group, et al. Parallel dis-tributed processing: Exploration in the microstructure of cognition, 1986.J. W. Richards, D. L. Starr, N. R. Butler, et al. On Machine-learned Classificationof Variable Stars with Sparse and Noisy Time-series Data. ApJ, 733:10, May2011. doi: 10.1088/0004-637X/733/1/10.277. W. Richards, D. L. Starr, A. A. Miller, et al. Construction of a Calibrated Prob-abilistic Classification Catalog: Application to 50k Variable Sources in the All-Sky Automated Survey. ApJS, 203:32, December 2012. doi: 10.1088/0067-0049/203/2/32.R. Rifkin and A. Klautau. In defense of one-vs-all classification. The Journal ofMachine Learning Research , 5:101–141, 2004.O. Rioul and M. Vetterli. Wavelets and signal processing. IEEE SP MAGAZINE ,8(4):14–38, 1991. doi: 10.1109/79.91217.M. Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics , pages 832–837, 1956.S. M. Ross. Applied probability models with optimization applications . CourierCorporation, 2013.S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprintarXiv:1609.04747 , 2016.D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal represen-tations by error propagation. Technical report, California Univ San Diego LaJolla Inst for Cognitive Science, 1985.M. Safayani and M. T. M. Shalmani. Matrix-variate probabilistic model for canon-ical correlation analysis. EURASIP Journal on Advances in Signal Processing ,2011(1):748430, 2011. 278. N. Samus’, E. V. Kazarovets, O. V. Durlevich, et al. General catalogue ofvariable stars: Version GCVS 5.1. Astronomy Reports , 61:80–88, January 2017.doi: 10.1134/S1063772917010085.H. Sana, S. E. de Mink, A. de Koter, et al. Binary Interaction Dominates theEvolution of Massive Stars. Science , 337:444, July 2012. doi: 10.1126/science.1223344.P. Santolamazza, M. Marconi, G. Bono, F. Caputo, S. Cassisi, and R. L. Gilliland.Linear Nonadiabatic Properties of SX Phoenicis Variables. ApJ, 554:1124–1140,June 2001. doi: 10.1086/321408.J. D. Scargle. Studies in astronomical time series analysis. II - Statistical aspectsof spectral analysis of unevenly spaced data. ApJ, 263:835–853, December 1982.doi: 10.1086/160554.L. L. Scharf. Statistical signal processing , volume 98. Addison-Wesley Reading,MA, 1991.B. Sch¨olkopf and Alexander J Smola. Learning with kernels: Support vector ma-chines, regularization, optimization, and beyond . MIT press, 2002.B. Sch¨olkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson.Estimating the support of a high-dimensional distribution. Neural computation ,13(7):1443–1471, 2001.M. Scholz. Approaches to analyse and interpret biological profile data . PhD thesis,University of Potsdam, Germany, 2006.279. Schultz and T. Joachims. Learning a distance metric from relative comparisons.In Advances in neural information processing systems , pages 41–48, 2004.B. Sesar, J. S. Stuart, ˇZ. Ivezi´c, et al. Exploring the Variable Sky with LINEAR.I. Photometric Recalibration with the Sloan Digital Sky Survey. AJ, 142:190,December 2011. doi: 10.1088/0004-6256/142/6/190.B. Sesar, ˇZ. Ivezi´c, J. S. Stuart, et al. Exploring the Variable Sky with LINEAR.II. Halo Structure and Substructure Traced by RR Lyrae Stars to 30 kpc. AJ,146:21, August 2013. doi: 10.1088/0004-6256/146/2/21.L. Sevilla-Lara and E. Learned-Miller. Distribution fields for tracking. In Com-puter Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on , pages1910–1917. IEEE, 2012.H. Shi, Y. Luo, C. Xu, and Y. Wen. Manifold Regularized Transfer Distance MetricLearning. In Proceedings of the British Machine Vision Conference (BMVC) ,pages 158.1–158.11. BMVA Press, September 2015. ISBN 1-901725-53-7. doi:10.5244/C.29.158. URL https://dx.doi.org/10.5244/C.29.158 .B. W. Silverman. Density estimation for statistics and data analysis . Routledge,2018.R. J. Siverd, T. G. Beatty, J. Pepper, et al. KELT-1b: A Strongly Irradiated,Highly Inflated, Short Period, 27 Jupiter-mass Companion Transiting a Mid-FStar. ApJ, 761:123, December 2012. doi: 10.1088/0004-637X/761/2/123.M. Sokolova and G. Lapalme. A systematic analysis of performance measuresfor classification tasks. Information Processing & Management , 45(4):427–437,2009. 280. R. Stanway, J. J. Eldridge, and G. D. Becker. Stellar population effects on theinferred photon density at reionization. MNRAS, 456:485–499, February 2016.doi: 10.1093/mnras/stv2661.C. Sterken and C. Jaschek. Light curves of variable stars: a pictorial atlas . Cam-bridge University Press, 2005.G. H. Stokes, J. B. Evans, H. E. M. Viggh, F. C. Shelly, and E. C. Pearce. LincolnNear-Earth Asteroid Program (LINEAR). Icarus, 148:21–28, November 2000.doi: 10.1006/icar.2000.6493.W. Sutherland, J. Emerson, G. Dalton, E. Atad-Ettedgui, S. Beard, R. Bennett,N. Bezawada, A. Born, M. Caldwell, P. Clark, S. Craig, D. Henry, P. Jeffers,B. Little, A. McPherson, J. Murray, M. Stewart, B. Stobie, D. Terrett, K. Ward,M. Whalley, and G. Woodhouse. The Visible and Infrared Survey Telescope forAstronomy (VISTA): Design, technical overview, and performance. A&A, 575:A25, March 2015. doi: 10.1051/0004-6361/201424973.K. Szatmary, J. Vinko, and J. Gal. Application of wavelet analysis in variable starresearch. I. Properties of the wavelet map of simulated variable star light curves.A&AS, 108:377–394, December 1994.R. Tagliaferri, G. Longo, S. Andreon, et al. Neural Networks for PhotometricRedshifts Evaluation. Lecture Notes in Computer Science , 2859:226–234, 2003.doi: 10.1007/978-3-540-45216-4 26.D. M. J. Tax. One-class Classification . PhD thesis, Delft University of Technology,2001. 281. M. J. Tax. Ddtools, the data description toolbox for matlab, July 2014. version2.1.1.D. M. J. Tax and R. P. W. Duin. Combining one-class classifiers. In InternationalWorkshop on Multiple Classifier Systems , pages 299–308. Springer, 2001.D. M. J. Tax and K. R. Muller. Feature extraction for one-class classification. In Proceedings of the ICANN/ICONIP , pages 342–349, 2003.JUnit Team. Junit, 2018a. URL https://junit.org .MatFileRW Team. Matfilerw. github, 2018b. URL https://github.com/diffplug/matfilerw .Quality Open Software Team. Simple logging facade for java (slf4j), 2017. URL .Skymind Team. Nd4j, 2018c. URL https://deeplearning4j.org/docs/latest/ .M. Templeton. Time-Series Analysis of Variable Star Data. Journal of the Amer-ican Association of Variable Star Observers (JAAVSO) , 32:41–54, June 2004.D. Terrell, J. Gross, and W. R. Cooney. A BVR C I C Survey of W Ursae MajorisBinaries. AJ, 143:99, April 2012. doi: 10.1088/0004-6256/143/4/99.Saul A Teukolsky, Brian P Flannery, WH Press, and WT Vetterling. Numericalrecipes in c. SMR , 693(1):59–70, 1992.C. Torrence and G. P. Compo. A practical guide to wavelet analysis. Bulletin ofthe American Meteorological society , 79(1), 1998.282. Torresani and K. Lee. Large margin component analysis. In Advances in neuralinformation processing systems , pages 1385–1392, 2007.J. W. Tukey. Exploratory data analysis . Reading, Mass., 1977.J. D. Twicken, H. Chandrasekaran, J. M. Jenkins, et al. Presearch data con-ditioning in the Kepler Science Operations Center pipeline. In Software andCyberinfrastructure for Astronomy , volume 7740 of Proc. SPIE, page 77401U,July 2010. doi: 10.1117/12.856798.A. Udalski, I. Soszynski, M. Szymanski, et al. The Optical Gravitational LensingExperiment. Cepheids in the Magellanic Clouds. IV. Catalog of Cepheids fromthe Large Magellanic Cloud. Acta Astron., 49:223–317, September 1999.A. Udalski, B. Paczynski, K. Zebrun, et al. The Optical Gravitational LensingExperiment. Search for Planetary and Low-Luminosity Object Transits in theGalactic Disk. Results of 2001 Campaign. Acta Astron., 52:1–37, March 2002.L. Valenzuela and K. Pichara. Unsupervised classification of variable stars. MN-RAS, 474:3259–3272, March 2018. doi: 10.1093/mnras/stx2913.Jacob T. VanderPlas and ˇZeljko Ivezi´c. Periodograms for Multiband AstronomicalTime Series. The Astrophysical Journal , 812(1):18, Oct 2015. doi: 10.1088/0004-637X/812/1/18.C. Wang, C. Chi, W. Zhou, and R. K. Wong. Coupled Interdependent AttributeAnalysis on Mixed Data. In AAAI , pages 1861–1867, 2015.283. Q. Weinberger and L. K. Saul. Fast solvers and efficient implementations fordistance metric learning. In Proceedings of the 25th international conference onMachine learning , pages 1160–1167. ACM, 2008.K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for largemargin nearest neighbor classification. J. Mach. Learn. Res. , 10:207–244, June2009. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1577069.1577078 .S. R. Wilk. Mythological Evidence for Ancient Observations of Variable Stars. Journal of the American Association of Variable Star Observers (JAAVSO) , 24:129–133, 1996.N. J. Wilsey and M. M. Beaky. Revisiting the O’Connell Effect in EclipsingBinary Systems. Society for Astronomical Sciences Annual Symposium , 28:107,May 2009.D. R. Wilson and T. R. Martinez. Improved heterogeneous distance functions. J.Artif. Int. Res. , 6(1):1–34, January 1997. ISSN 1076-9757. URL http://dl.acm.org/citation.cfm?id=1622767.1622768 .R. E. Wilson and E. J. Devinney. Realization of Accurate Close-Binary LightCurves: Application to MR Cygni. ApJ, 166:605, June 1971. doi: 10.1086/150986.H. Wold. A Study in Analysis of Stationary Time Series. Journal of the RoyalStatistical Society , 102(2):295–298, 1939.284.P. Xing, M.I. Jordan, S.J. Russell, and A.Y. Ng. Distance metric learning withapplication to clustering with side-information. In Advances in neural informa-tion processing systems , pages 521–528, 2003.C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. arXiv preprintarXiv:1304.5634 , 2013.N. Ye. A markov chain model of temporal behavior for anomaly detection. In , volume 166, page 169, West Point, NY., 2000.M. Zboril and G. Djurasevic. Progress Report on the Monitoring Active Late-Type Stars in 2005/2006 and the Analysis of V523 Cas. Serbian AstronomicalJournal , 173:89, Dec 2006. doi: 10.2298/SAJ0673089Z.T. Zhang and F. J. Oles. Text categorization based on regularized linear classifi-cation methods. Information retrieval , 4(1):5–31, 2001.H. Zhou and L. Li. Regularized matrix regression. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 76(2):463–483, 2014.L. Zhou, H. Wang, Z. M. Lu, T. Nie, and K. Zhao. Face Recognition Based on LDAand Improved Pairwise-Constrained Multiple Metric Learning Method. Journalof Information Hiding and Multimedia Signal Processing , 7(5):1092, 2016.S. Zhou et al. Gemini: Graph estimation with matrix variate normal instances. The Annals of Statistics , 42(2):532–562, 2014.X. Zhu, Z. Huang, H. T. Shen, et al. Dimensionality reduction by mixed kernelcanonical correlation analysis. Pattern Recognition , 45(8):3003–3016, 2012.28586 ppendix AChapter 4: Broad ClassPerformance Results Figure A.1: Random Forest: mtry = 8, ntree = 100, (Top Left) Pulsating, (TopRight) Erupting, (Bottom Left) Multi-Star, (Bottom Right) Other287igure A.2: SVM: (Top Left) Pulsating, (Top Right) Erupting, (Bottom Left)Multi-Star, (Bottom Right) Other 288igure A.3: kNN: (Top Left) Pulsating, (Top Right) Erupting, (Bottom Left)Multi-Star, (Bottom Right) Other 289igure A.4: MLP: (Top Left) Pulsating, (Top Right) Erupting, (Bottom Left)Multi-Star, (Bottom Right) Other 290igure A.5: MLP: Individual Classification, Performance AnalysisFigure A.6: kNN: Individual Classification, Performance Analysis291 ppendix BChapter 5: Optimization AnalysisFigures NN Number of k M i sc l a ss i f i c a t i on R a t e - F o l d C r o ss - v a li da t i on (a) Nearest Neighbor Classifiers PWC Gaussian Kernel Spread -2 -1 M i sc l a ss i f i c a t i on R a t e - F o l d C r o ss - v a li da t i on (b) Parzen Window Classifier Number of Trees Generated M i sc l a ss i f i c a t i on R a t e - F o l d C r o ss - v a li da t i on (c) Random Forest Figure B.1: Classifier Optimization for UCR Data293 NN Number of k M i sc l a ss i f i c a t i on R a t e - F o l d C r o ss - v a li da t i on (a) Nearest Neighbor classifiers PWC Gaussian Kernel Spread -2 -1 M i sc l a ss i f i c a t i on R a t e - F o l d C r o ss - v a li da t i on (b) Parzen window classifier Number of Trees Generated M i sc l a ss i f i c a t i on R a t e - F o l d C r o ss - v a li da t i on (c) Random Forest Figure B.2: Classifier Optimization for LINEAR Data294 .1 Chapter 5: Performance Analysis Tables Table B.1: Confusion Matrix for Classifiers Based on UCR Starlight Data (a) 1-NN True \ Est 1 2 31 0.86 0.003 0.132 0.0 0.99 0.0083 0.031 0.002 0.97 (b) PWC True \ Est 1 2 31 0.82 0.003 0.182 0.00 0.97 0.0353 0.16 0.004 0.84 (c) RF True \ Est 1 2 31 0.91 0.003 0.0822 0.0 0.99 0.0053 0.004 0.0007 0.99295able B.2: Confusion Matrix for Classifiers Based on LINEAR Starlight Data (a) 1-NNTrue \ Est Algol Contact Binary Delta Scuti No Variation RRab RRcAlgol 0.76 0.20 0.0 0.0 0.0 0.04Contact Binary 0.03 0.95 0.005 0.005 0.01 0.0Delta Scuti 0.0 0.0 0.88 0.12 0.0 0.0No Variation 0.0 0.0 0.01 0.99 0.0 0.0RRab 0.0 0.005 0.0 0.0 0.95 0.045RRc 0.0 0.03 0.0 0.0 0.14 0.83(b) PWCTrue \ Est Algol Contact Binary Delta Scuti No Variation RRab RRcAlgol 0.97 0.01 0.0 0.0 0.02 0.0Contact Binary 0.0 0.99 0.0 0.0 0.0 0.01Delta Scuti 0.0 0.0 0.94 0.06 0.0 0.0No Variation 0.0 0.0 0.0 1.0 0.0 0.0RRab 0.0 0.01 0.0 0.0 0.99 0.0RRc 0.0 0.01 0.0 0.0 0.0 0.99(c) RFTrue \ Est Algol Contact Binary Delta Scuti No Variation RRab RRcAlgol 0.93 0.07 0.0 0.0 0.0 0.04Contact Binary 0.0 0.99 0.0 0.0 0.0 0.0Delta Scuti 0.0 0.0 0.94 0.0 0.0 0.06No Variation 0.0 0.02 0.0 0.98 0.0 0.0RRab 0.0 0.0 0.0 0.0 1.0 0.05RRc 0.0 0.0 0.0 0.0 0.0 1.0 ppendix CChapter 7: AdditionalPerformance Comparison Table C.1: LINEAR confusion matrix, LM L-MV - LM L Misclassification Rate RR Lyr (ab) Delta Scu / SX Phe Algol RR Lyr (c) Contact Binary MissedRR Lyr (ab) -7 0 1 9 -1 -2Delta Scu / SX Phe 1 1 0 0 -2 0Algol 2 0 -4 1 1 0RR Lyr (c) -1 0 0 2 -1 0Contact Binary 0 0 2 -8 5 1 L-MV - LM3