Pavlos Protopapas
Harvard University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pavlos Protopapas.
very large data bases | 2009
Eamonn J. Keogh; Li Wei; Xiaopeng Xi; Michail Vlachos; Sang-Hee Lee; Pavlos Protopapas
Shape matching and indexing is important topic in its own right, and is a fundamental subroutine in most shape data mining algorithms. Given the ubiquity of shape, shape matching is an important problem with applications in domains as diverse as biometrics, industry, medicine, zoology and anthropology. The distance/similarity measure for used for shape matching must be invariant to many distortions, including scale, offset, noise, articulation, partial occlusion, etc. Most of these distortions are relatively easy to handle, either in the representation of the data or in the similarity measure used. However, rotation invariance is noted in the literature as being an especially difficult challenge. Current approaches typically try to achieve rotation invariance in the representation of the data, at the expense of discrimination ability, or in the distance measure, at the expense of efficiency. In this work, we show that we can take the slow but accurate approaches and dramatically speed them up. On real world problems our technique can take current approaches and make them four orders of magnitude faster without false dismissals. Moreover, our technique can be used with any of the dozens of existing shape representations and with all the most popular distance measures including Euclidean distance, dynamic time warping and Longest Common Subsequence. We further show that our indexing technique can be used to index star light curves, an important type of astronomical data, without modification.
Machine Learning | 2009
Umaa Rebbapragada; Pavlos Protopapas; Carla E. Brodley; Charles Alcock
Catalogs of periodic variable stars contain large numbers of periodic light-curves (photometric time series data from the astrophysics domain). Separating anomalous objects from well-known classes is an important step towards the discovery of new classes of astronomical objects. Most anomaly detection methods for time series data assume either a single continuous time series or a set of time series whose periods are aligned. Light-curve data precludes the use of these methods as the periods of any given pair of light-curves may be out of sync. One may use an existing anomaly detection method if, prior to similarity calculation, one performs the costly act of aligning two light-curves, an operation that scales poorly to massive data sets. This paper presents PCAD, an unsupervised anomaly detection method for large sets of unsynchronized periodic time-series data, that outputs a ranked list of both global and local anomalies. It calculates its anomaly score for each light-curve in relation to a set of centroids produced by a modified k-means clustering algorithm. Our method is able to scale to large data sets through the use of sampling. We validate our method on both light-curve data and other time series data sets. We demonstrate its effectiveness at finding known anomalies, and discuss the effect of sample size and number of centroids on our results. We compare our method to naive solutions and existing time series anomaly detection methods for unphased data, and show that PCAD’s reported anomalies are comparable to or better than all other methods. Finally, astrophysicists on our team have verified that PCAD finds true anomalies that might be indicative of novel astrophysical phenomena.
Monthly Notices of the Royal Astronomical Society | 2006
Pavlos Protopapas; J. M. Giammarco; L. Faccioli; Mitchell F. Struble; Rahul Surendra Dave; Charles Alcock
We present a methodology to discover outliers in catalogues of periodic light curves. We use a cross-correlation as the measure of ‘similarity’ between two individual light curves, and then classify light curves with lowest average ‘similarity’ as outliers. We performed the analysis on catalogues of periodic variable stars of known type from the MACHO and OGLE projects. This analysis was carried out in Fourier space and we established that our method correctly identifies light curves that do not belong to those catalogues as outliers. We show how an approximation to this method, carried out in real space, can scale to large data sets that will be available in the near future such as those anticipated from the Panoramic Survey Telescope & Rapid Response System (Pan-STARRS) and Large Synoptic Survey Telescope (LSST).
Monthly Notices of the Royal Astronomical Society | 2007
J. M. Diego; Max Tegmark; Pavlos Protopapas; H. B. Sandvik
We describe a method to estimate the mass distribution of a gravitational lens and the position of the sources from combined strong and weak lensing data. The algorithm combines weak and strong lensing data in a unified way producing a solution which is valid in both the weak and the strong lensing regimes. The method is non-parametric, allowing the mass to be located anywhere in the field of view. We study how the solution depends on the choice of basis used to represent the mass distribution. We find that combining weak and strong lensing information has two major advantages: it alleviates the need for priors and/or regularization schemes for the intrinsic size of the background galaxies (this assumption was needed in previous strong lensing algorithms) and it reduces (although does not remove) biases in the recovered mass in the outer regions where the strong lensing data are less sensitive. The code is implemented into a software package called Weak & Strong Lensing Analysis Package (WSLAP) which is publicly available at http://darwin.cfa.harvard.edu/SLAP/.
The Astrophysical Journal | 2011
Dae-Won Kim; Pavlos Protopapas; Yong Ik Byun; Charles Alcock; Roni Khardon; M. Trichas
We present a new QSO selection algorithm using a Support Vector Machine (SVM), a supervised classification method, on a set of extracted time series features including period, amplitude, color, and autocorrelation value. We train a model that separates QSOs from variable stars, non-variable stars and microlensing events using 58 known QSOs, 1,629 variable stars and 4,288 non-variables using the MAssive Compact Halo Object (MACHO) database as a training set. To estimate the efficiency and the accuracy of the model, we perform a cross-validation test using the training set. The test shows that the model correctly identifies ∼80% of known QSOs with a 25% false positive rate. The majority of the false positives are Be stars. We applied the trained model to the MACHO Large Magellanic Cloud (LMC) dataset, which consists of 40 million lightcurves, and found 1,620 QSO candidates. During the selection none of the 33,242 known MACHO variables were misclassified as QSO candidates. In order to estimate the true false positive rate, we crossmatched the candidates with astronomical catalogs including the Spitzer Surveying the Agents of a Galaxy’s Evolution (SAGE) LMC catalog and a few X-ray catalogs. The results further suggest that the majority of the candidates, more than 70%, are QSOs. Subject headings: Magellanic Clouds methods: data analysis quasars: generalWe present a new quasi-stellar object (QSO) selection algorithm using a Support Vector Machine, a supervised classification method, on a set of extracted time series features including period, amplitude, color, and autocorrelation value. We train a model that separates QSOs from variable stars, non-variable stars, and microlensing events using 58 known QSOs, 1629 variable stars, and 4288 non-variables in the MAssive Compact Halo Object (MACHO) database as a training set. To estimate the efficiency and the accuracy of the model, we perform a cross-validation test using the training set. The test shows that the model correctly identifies ~80% of known QSOs with a 25% false-positive rate. The majority of the false positives are Be stars. We applied the trained model to the MACHO Large Magellanic Cloud (LMC) data set, which consists of 40 million light curves, and found 1620 QSO candidates. During the selection none of the 33,242 known MACHO variables were misclassified as QSO candidates. In order to estimate the true false-positive rate, we crossmatched the candidates with astronomical catalogs including the Spitzer Surveying the Agents of a Galaxys Evolution LMC catalog and a few X-ray catalogs. The results further suggest that the majority of the candidates, more than 70%, are QSOs.
The Astronomical Journal | 2010
Federica B. Bianco; Z.-W. Zhang; M. J. Lehner; S. Mondal; S.-K. King; J. Giammarco; M. Holman; N. K. Coehlo; Jen-Hung Wang; Charles Alcock; Tim Axelrod; Yong-Ik Byun; W. P. Chen; K. H. Cook; R. Dave; I. de Pater; Dong-Woo Kim; Typhoon Lee; H. C. Lin; Jack J. Lissauer; S. L. Marshall; Pavlos Protopapas; John A. Rice; Megan E. Schwamb; Shiang-Yu Wang; Chih Yi Wen
We have analyzed the first 3.75 years of data from the Taiwanese American Occultation Survey (TAOS). TAOS monitors bright stars to search for occultations by Kuiper Belt objects (KBOs). This data set comprises 5 × 10^5 star hours of multi-telescope photometric data taken at 4 or 5 Hz. No events consistent with KBO occultations were found in this data set. We compute the number of events expected for the Kuiper Belt formation and evolution models of Pan & Sari, Kenyon & Bromley, Benavidez & Campo Bagatin, and Fraser. A comparison with the upper limits we derive from our data constrains the parameter space of these models. This is the first detailed comparison of models of the KBO size distribution with data from an occultation survey. Our results suggest that the KBO population is composed of objects with low internal strength and that planetary migration played a role in the shaping of the size distribution.
The Astronomical Journal | 2007
Lorenzo Faccioli; Charles Alcock; Kem Holland Cook; Gabriel E. Prochter; Pavlos Protopapas; David Syphers
We present a new sample of 4634 eclipsing binary stars in the Large Magellanic Cloud (LMC), expanding on a previous sample of 611 objects and a new sample of 1509 eclipsing binary stars in the Small Magellanic Cloud (SMC), that were identified in the light curve database of the MACHO project. We perform a cross correlation with the OGLE-II LMC sample, finding 1236 matches. A cross correlation with the OGLE-II SMC sample finds 698 matches. We then compare the LMC subsamples corresponding to center and the periphery of the LMC and find only minor differences between the two populations. These samples are sufficiently large and complete that statistical studies of the binary star populations are possible.
Monthly Notices of the Royal Astronomical Society | 2005
Pavlos Protopapas; Raul Jimenez; Charles Alcock
We present an algorithm that allows fast and efficient detection of transits, including planetary transits, from light-curves. The method is based on building an ensemble of fiducial models and compressing the data using the MOPED compression algorithm. We describe the method and demonstrate its efficiency by finding planet-like transits in simulated Panoramic Survey Telescope & Rapid Response System (Pan-STARRS) light-curves. We show that our method is independent of the size of the search space of transit parameters. In large sets of light-curves, we achieve speed-up factors of the order of 10 3 times over an optimized adaptive search in the /2 space. We discuss how the algorithm can be used in forthcoming large surveys like Pan-STARRS and the Large Synoptic Survey Telescope (LSST), and how it may be optimized for future space missions like Kepler and COROT where most of the processing must be done on board.
IEEE Computational Intelligence Magazine | 2014
Pablo Huijse; Pablo A. Estévez; Pavlos Protopapas; Jose C. Principe; Pablo Zegers
Time-domain astronomy (TDA) is facing a paradigm shift caused by the exponential growth of the sample size, data complexity and data generation rates of new astronomical sky surveys. For example, the Large Synoptic Survey Telescope (LSST), which will begin operations in northern Chile in 2022, will generate a nearly 150 Petabyte imaging dataset of the southern hemisphere sky. The LSST will stream data at rates of 2 Terabytes per hour, effectively capturing an unprecedented movie of the sky. The LSST is expected not only to improve our understanding of time-varying astrophysical objects, but also to reveal a plethora of yet unknown faint and fast-varying phenomena. To cope with a change of paradigm to data-driven astronomy, the fields of astroinformatics and astrostatistics have been created recently. The new data-oriented paradigms for astronomy combine statistics, data mining, knowledge discovery, machine learning and computational intelligence, in order to provide the automated and robust methods needed for the rapid detection and classification of known astrophysical objects as well as the unsupervised characterization of novel phenomena. In this article we present an overview of machine learning and computational intelligence applications to TDA. Future big data challenges and new lines of research in TDA, focusing on the LSST, are identified and discussed from the viewpoint of computational intelligence/machine learning. Interdisciplinary collaboration will be required to cope with the challenges posed by the deluge of astronomical data coming from the LSST.
Astronomy and Astrophysics | 2014
Dae-Won Kim; Pavlos Protopapas; Coryn A. L. Bailer-Jones; Yong-Ik Byun; Seo-Won Chang; J.-B. Marquette; Min-Su Shin
The EPOCH (EROS-2 periodic variable star classification using machine learning) project aims to detect periodic variable stars in the EROS-2 light curve database. In this paper, we present the first result of the classification of periodic variable stars in the EROS-2 LMC database. To classify these variables, we first built a training set by compiling known variables in the Large Magellanic Cloud area from the OGLE and MACHO surveys. We crossmatched these variables with the EROS-2 sources and extracted 22 variability features from 28 392 light curves of the corresponding EROS-2 sources. We then used the random forest method to classify the EROS-2 sources in the training set. We designed the model to separate not only