Everything we'd like to do with LSST data, but we don't know (yet) how
AAstroinformaticsProceedings IAU Symposium No. 325, 2017M. Brescia, eds. c (cid:13) Everything we’d like to do with LSST data,but we don’t know (yet) how
Željko Ivezić , Andrew J. Connolly & Mario Jurić Department of Astronomy, University of Washington,Box 351580, Seattle, WA 98195-1580, USAemail: [email protected]
Abstract.
The Large Synoptic Survey Telescope (LSST), the next-generation optical imagingsurvey sited at Cerro Pachon in Chile, will provide an unprecedented database of astronomicalmeasurements. The LSST design, with an 8.4m (6.7m effective) primary mirror, a 9.6 sq. deg.field of view, and a 3.2 Gigapixel camera, will allow about 10,000 sq. deg. of sky to be coveredtwice per night, every three to four nights on average, with typical 5-sigma depth for point sourcesof r =24.5 (AB). With over 800 observations in ugrizy bands over a 10-year period, these datawill enable a deep stack reaching r =27.5 (about 5 magnitudes deeper than SDSS) and fainttime-domain astronomy. The measured properties of newly discovered and known astrometricand photometric transients will be publicly reported within 60 sec after observation. The vastdatabase of about 30 trillion observations of 40 billion objects will be mined for the unexpectedand used for precision experiments in astrophysics. In addition to a brief introduction to LSST,we discuss a number of astro-statistical challenges that need to be overcome to extract maximuminformation and science results from LSST dataset. Keywords. surveys, galaxies, stars: statistics
1. Introduction
The last decade has seen fascinating observational progress in optical imaging surveys.The SDSS dataset is currently being greatly extended by the ongoing surveys such asPan-STARRS (Kaiser et al. 2010) and the Dark Energy Survey (Flaugher 2008). TheLarge Synoptic Survey Telescope (LSST) is the most ambitious survey currently plannedin the visible band (for a brief overview, see Ivezić et al. 2008a). The unparalleled LSSTsurvey power is due to its large étendue (see Figure 1).The goals of the LSST are driven by four key science themes: probing dark energyand dark matter, taking an inventory of the Solar System, exploring the transient opticalsky, and mapping the Milky Way. The LSST will be a large, wide-field ground-basedsystem designed to obtain multiple images covering the sky visible from Cerro Pachónin Northern Chile. The system, with an 8.4m (6.7m effective) primary mirror, a 9.6 deg field of view, and a 3.2 Gigapixel camera, will allow, on average, about 10,000 deg of skyto be covered using pairs of 15-second exposures in two photometric bands every threenights, with a typical 5 σ depth for point sources of r ∼ . . The system is designedto yield high image quality as well as superb astrometric and photometric accuracy † .The survey area will cover 30,000 deg with δ < +34 . ◦ , and will be imaged multipletimes in six bands, ugrizy , covering the wavelength range 320–1050 nm. About 90% ofthe observing time will be devoted to a deep-wide-fast survey mode which will observean 18,000 deg region over 800 times (summed over all six bands) during the anticipated † For detailed specifications, please see the LSST Overview Paper, Ivezić et al. (2008a), andthe LSST Science Requirements Document (LSST Science Collaboration 2011) a r X i v : . [ a s t r o - ph . I M ] D ec Ivezić, Connolly & Jurić
Figure 1.
A comparison of the primary mirror size and the field-of-view size for LSST andGemini South telescopes. The product of the primary mirror size and the field-of-view size, theso-called étendue (or grasp), a characteristic that determines the speed at which a system cansurvey a given sky area to a given flux limit, is much larger for LSST. Figure courtesy of ChuckClaver.
10 years of operations, and yield a coadded map to r ∼ .
2. LSST Data Analysis Challenges
The LSST project will deliver data products that will enable a large number of cutting-edge science programs (Jurić et al. 2016). Nevertheless, depending on the topic, the pathfrom LSST data products to science results and journal papers may sometimes requireadditional challenging analysis work. These challenges, representative of the era of BigData, stem from: • Large data volumes (petabytes) • Large numbers of objects (billions) verything we’d like to do with LSST data, but we don’t know (yet) how Figure 2. • Highly multi-dimensional spaces (thousands) • Unknown statistical distributions • Time-series data (irregular sampling) • Heteroscedastic errors, truncated, censored and missing data • Unreliable quantities (e.g. unknown systematics and random errors)
Everything we’d like to do with LSST data, but we don’t know (yet) how is a catchytitle but somewhat inaccurate. First, we most certainly do not include here “everything”,and second, we and our LSST colleagues already have at least some ideas for how toapproach most of problems discussed below. We hope that this contribution will helpmotivate others to join us in this thinking, and to engage in the work needed to maxi-mize LSST’s scientific yields.To begin and stimulate this conversation, we have selected a few topics where substan-tial preparatory work is needed to optimally analyze datasets at the LSST scale. Theseare:( a ) Interpreting spectral energy distributions (SEDs)( b ) Identifying moving objects( c ) Characterizing and classifying variable stars( d ) Understanding systematic measurement uncertainties Ivezić, Connolly & Jurić( e ) Characterizing astrophysical simulations and astrophysical systematics( f ) Devising new or enhanced algorithms to process LSST data.We emphasize that many other members of LSST Science Collaborations contributedto the formulation of this, by all means, incomplete list. In the remainder of this section,we discuss these topics in a bit more detail.2.1. Interpretation of spectral energy distributions (SEDs)
Efficient and robust interpretation of time-resolved multi-band photometry for “billionsand billions” of objects is bound to yield unprecedented science results. A combinationof required measurement precision and relatively wide bandpasses will require carefulinterpretation of LSST data.A broad-band photometric system, such as LSST, aims to deliver calibrated in-bandflux F b = (cid:90) F ν ( λ ) φ b ( λ ) dλ, (2.1)where F ν ( λ ) is specific flux of an object at the top of the atmosphere and φ b ( λ ) is thenormalized system response for the given band, φ b ( λ ) = λ − S b ( λ ) (cid:82) λ − S b ( λ ) dλ (2.2)(the λ − term reflects the fact that CCDs are photon-counting devices). Here, S b ( λ ) isthe overall atmosphere + system throughput S b ( λ ) = S sysb ( λ ) × S atmb ( λ ) . (2.3)Numerous science programs can be cast as constraining the possible forms of the trueSED F ν ( λ ) given the measured broad-band fluxes, F b , and the normalized system re-sponse, φ b ( λ ) , with b = ( u, g, r, i, z, y ) . Because of the integration over broad bandpasses,forward modeling using a trial SED (either empirical or model based) is typically superiorto “correcting data” (fluxes, positions, sizes). Examples of such programs, where SEDspresumably depend on relevant astrophysical parameters, include( a ) photo-z algorithms: the observed galaxy and quasar SEDs depend on the redshiftof an intrinsic SED (due to expansion of the universe, source evolution, and intergalacticextinction; see e.g. Bolzonella et al. 2000);( b ) photometric parallax for stars, where measured colors can be used to constrain theeffective temperature and luminosity (e.g. Jurić et al. 2008);( c ) photometric metallicity for stars (trained using spectroscopic metallicities, seeIvezić et al. 2008b); and( d ) interstellar extinction along the line of sight for stars in the Milky Way disk (see,e.g., Berry et al. 2012).There are a number of open issues that are being worked on by the community: • What are the relative advantages and disadvantages of machine learning methodscompared to methods based on fitting SED templates (both empirical and simulated).How can we incorporate ancillary data (and priors) within the photo-z methods, forexample utilizing angular cross-correlation of photometric and spectroscopic samples ofgalaxies (Newman 2008)? • What are the impacts of heteroscedastic noise, priors, and truncated and censoreddata? verything we’d like to do with LSST data, but we don’t know (yet) how • How much will “per-visit processing” of LSST data help (due to varying bandpasses φ b ( λ ) because of the unavoidable variations in S atmb ( λ ) )? • What is the best way to handle posterior probability density functions (pdfs), howmuch is gained compared to simple (e.g. maximum likelihood) point estimates, what arethe optimal compression algorithms for pdfs, etc.? • How should the parameter covariances be handled (the same question is also validpretty much everywhere else below)?2.2.
Moving objects
The catalogs generated by LSST will increase the known number of small bodies inthe Solar System by a factor of 10-100, among all populations (Jones et al. 2016). Themedian number of observations for Main Belt asteroids will be on the order of 200-300,allowing sparse lightcurve inversion to determine rotation periods, spin axes, and shapeinformation. The current strawman for the LSST survey strategy is to obtain two visitsof the same field per night (each “visit” being a pair of back-to-back 15s exposures),separated by about 30 minutes, and covering the entire observable sky every 3-4 daysthroughout the observing season.The main reason for two observations per night is to help association of observationsof the same moving object from different nights, as follows. The typical distance be-tween two nearby asteroids on the Ecliptic, at the faint fluxes probed by LSST, is afew arcminutes (counts are dominated by Main Belt asteroids). Typical asteroid motionduring several days is much larger (of the order a degree or more) and thus, withoutadditional information, detections of individual objects are ”scrambled”. However, withtwo detections per night, the motion vector can be estimated. The motion vector makesthe linking problem much easier because positions from one night can be approximatelyextrapolated to future (or past) nights.There are several interesting open questions regarding moving objects: • Cadence optimization: are two visits per night really needed? Would perhaps asubstantial increase in the computing power solve the association problem with just asingle detection per night? • How robust and efficient would be a full Bayesian approach for characterizing theorbits of asteroids (see, e.g., Virtanen et al. 2001)? • How computationally hard is it to deploy shift-and-coadd method for KBOs andmore distant objects on LSST-scale dataset? • What are the most robust and efficient methods for sparse lightcurve inversion ofseveral million asteroids? 2.3.
Variable stars
Early in the survey, LSST will be discovering about 100,000 variable stars per night athigh Galactic latitudes (Ridgway et al. 2014), and probably many more at low latitudes(but the forecast is less certain). The total number of variable stars to be discovered byLSST is of the order several hundred million (the total number of detected and measuredstars will be about 20 billion). In addition, about 1000 new supernovae are expected to bediscovered every observing night. A number of statistical questions need to be answeredfor the full exploitation of this dataset: • How to best distinguish regular (periodic) from irregular variability when the dataare sampled irregularly and when the variability may be wavelength dependent? • How to distinguish short from long variability timescales? • What are the best methods for the robust detection of variability, and for anomaly Ivezić, Connolly & Jurićdetection † ? Recent developments in compressed sensing and deep learning have the po-tential to revolutionize the analysis of variability and transient detection. By exploitingthe sparseness of the data and careful choice of the models that might be fit to thesedata, it maybe possible to characterize and classify sources in a way that is both flexibleand robust to noise. • What are the best methods for characterization and classification of a broad rangeof variability (especially in case of sparse data early in the survey)? How do machinelearning methods compare to light curve template-based methods? Are there metricsthat will enable a general classification scheme for identifying sources that might needfollow up observations? • Is it possible to further optimize the cadence to enhance discoveries and characteri-zation of variable stars? • What is the impact of heteroscedastic noise, astrophysical priors, and truncated andcensored data? • Can light curve and objects characterization and classification be done directly indatabase? 2.4.
Systematic measurement uncertainties
Due to the large number of objects in LSST samples, many science programs, includingcosmology, will be sensitive to systematic errors. In many cases the volume of the availabledata will mean that systematics are the dominant source of uncertainty (that is, givenbillions of objects measured a thousand times, how do we know that sqrt(N) will stillwork in this regime?). The primary goals include:( a ) ensuring that the astrometry can be measured with statistical and systematicerrors at the miliarcsec level,( b ) ensuring that the photometry can be measured with statistical and systematicerrors at the milimag level,( c ) measuring galaxy shapes (e.g., for use in cosmic shear analysis) with the PSFknown across the focal plane to a level where the autocorrelation of PSF residuals issmaller than − .These effects will need to be quantified as functions of position on the sky, positionon the focal plane, observing conditions (e.g., atmospheric seeing, sky brightness), andobject properties (e.g., brightness, colors, size). Some of the open questions include: • How can the impact of unknown SEDs be quantified? • What is the impact of the atmosphere (due to variable seeing and transmissivity,differential chromatic refraction and intrinsic stochasticity)? • How can we robustly quantify both multiplicative and additive errors in galaxy shearmeasurements? • How can we control systematic errors in photometric redshifts?2.5.
Astrophysical simulations and astrophysical systematics
The expected precision of the LSST measurements and their resulting constraints on cos-mological and astrophysical models requires the development of simulation and modellingtools of equal or better precision. These tools will need to provide predictions for what theLSST will observe (in order to define effective survey strategies for the LSST), interpretthe observations in the context of physically motivated models, and generate multiple † Extensive tools for doing both template-matched and “model-independent” detection ofvariability have been recently developed in the context of LIGO. verything we’d like to do with LSST data, but we don’t know (yet) how • How do we support the generation of large scale simulations? The computationalresources required to generate cosmological simulations, and in particular series of sim-ulations for characterizing the covariance of cosmological models, are large and couldexceed the resources available to individual investigators. • How do we share simulations in a manner similar to the availability of observationdata? Often the sizes of useful simulated data sets must exceed those of the observationaldatasets and are already approaching the PB scale. Transferring the generated simula-tions or mock catalogs from supercomputing centers to where they might be analyzedwill stress academic network capacities. • What is the impact of baryonic effects on dark matter halo profiles? The currentgeneration of hydrodynamical simulations do not simulate large cosmological volumes.Approximations, where lower resolution or dark matter only models are used to identifyregions of interest in the simulation that are then re-simulated at higher resolution, canlead to biases in any derived correlations as they are not representative volumes of theuniverse. • What are the main feedback mechanisms in galaxy formation and what is the bestway to handle nonlinear galaxy bias? • How can we best address intrinsic alignments of galaxy shapes with the density field? • How can we best extract the information about the evolution of the Milky Waygalaxy using LSST measurements of 20 billion stars? • How can we best extract the information about the evolution of the Solar Systemusing LSST measurements of a few million asteroids?2.6.
LSST System Enhancements and New Algorithms
LSST is an automated facility that will deliver not only raw images, but also fully reduceddata products (calibrated single-epoch images, multiple flavors of co-adds, and a varietyof catalogs). Its cadence will be optimized to enable a balanced science return acrossthe four key science themes (Section 1). Its data products have been designed to enablethe derivation of a large fraction of those results without the need for end users to fullyunderstand the details of the LSST instrument and data reduction, algorithms, or tobegin from raw pixel data.To make this possible, the LSST project is making a major investment in computinginfrastructure, software, and algorithm development. Yet it is quite clear that more andbetter are always possible; even marginal improvement in performance (of both hardwareand software) could yield significant additional science returns. While some of the open Ivezić, Connolly & Jurićissues listed below are already being addressed by groups both within and outside theLSST construction project, substantial further research could be done.Again, these are simply the most obvious examples; the list is by no means complete. • Observing strategy (cadence) optimization can yield improvements in total open-shutter time for the survey, but also can improve the utility of angular and temporalsampling functions and dithering patterns (Delgado et al. 2014). It is, therefore, impor-tant to develop a scheduling algorithm that can efficiently address potential evolution ofthe LSST observing system and evolution of its science drivers. The LSST Project is de-veloping a scheduling algorithm that meets the survey requirements, but the complexityof the problem and the potential return on investment † argues for further research. • LSST does not plan to deliver specialized crowded field reductions or catalogs; imagesof crowded regions of the Milky Way will be processed with the same code utilizedelsewhere, though perhaps with different priors used in object detection and deblendingstages (i.e., to a very good approximation, every object observed towards the Galacticcenter is a star). A purpose-built (multi-epoch capable) crowded field code capable ofdealing with LSST source densities and data volumes would tremendously enhance thescientific return of LSST’s Galactic dataset. • No LSST data products have been explicitly designed to enable the detection andcharacterization of diffuse (e.g., ISM) or extremely low-surface brightness structures (e.g.,LSB galaxies recently discovered by projects such as Dragonfly). Developing specializedcodes to enable such processing may add significant value to the LSST dataset. • Complex galaxy models (e.g., tidal tails of merged galaxies) will not be fit by thestandard LSST pipelines. Such a tool would greatly help in understanding gravitationalpotential around judiciously selected galaxies. • Forward modeling of images on per visit basis (termed Multifit in LSST Data Man-agement context) is superior to analysis of co-added images (because of varying observingconditions) and will be done by LSST. A particularly interesting problem is one of simul-taneous forward modelling of data from different datasets (e.g., LSST and WFIRST).While there is substantial ongoing development (e.g. the Tractor code, see Lang et al.2016), including within LSST Project, many statistical and other issues ‡ remain openand will require substantial further research to find the optimal approach. • At the required precision level, the LSST point spread function (PSF) will dependon time, instrument state, source position, and source color (more precisely, on in-bandSED shape); see Meyers & Burchat (2015). Robust and precise determination of the PSFwill therefore be a rather non-trivial undertaking. The LSST project is required to char-acterize the PSF to the degree described in the LSST Science Requirements Document,but further improvements may be possible. • A shift-and-stack algorithm (for co-adding images along arbitrary space-time tra-jectories), that could be efficiently deployed for large datasets would likely have a majorimpact on outer Solar System science. LSST Data Management (DM) system will notdeliver shift-and-stack pipelines or data products, but these could be easily built on topof the open source code LSST DM will deliver. • Image differencing will be used to detect transient sources in the LSST data stream.In order to control the false positive rate, new sophisticated algorithms will have to bedeveloped to account for varying observing conditions (e.g., the treatment of differentialchromatic refraction effects due to varying airmass, as well as color-dependent PSFs). † For example, just 1% effective improvement in LSST scheduling is roughly equivalent to ∼ $4M in operational cost. ‡ For example, blended objects present major algorithmic challenges and a discussion of theirtreatment, which is currently an open research area, would warrant a paper on its own. verything we’d like to do with LSST data, but we don’t know (yet) how • The SEDs and other properties of newly discovered transients will be poorly knowninitially. It is not clear yet what characterization and classification algorithms would bethe best for separating the most interesting transients that require prompt followup fromthe background of much more numerous transients which can be analyzed on much longertimescales without significant loss of science outcome. • While there are well developed methods for the classification of light curves of vari-able stars (e.g. Richards et al. 2011), transient classification with sparse data is a muchharder problem. • Jointly processing data from LSST and other surveys (e.g., Euclid or WFIRST)would certainly result in a superior dataset than the one produced individually by eitherof these projects (for details see, e.g., Jain et al. 2015). It is not clear, however, howexactly to implement these ideas in practice, especially given that the survey overlapwill be significantly truncated (either by position on the sky, e.g., for WFIRST, or bybrightness, e.g. for Gaia). • Finally, LSST data processing will be performed in the context of a relatively tradi-tional HPC-like computing facility utilizing proven, low-risk, technologies (e.g., HT Con-dor, Pegasus). Similarly, LSST catalog data will be served to the users by way of relationaldatabases (albeit of advanced, distributed, kind). Research into alternative models of pro-cessing (e.g., making use of the public cloud or workflow systems like Apache Spark) ordata storage and serving (e.g., no-SQL databases, or next-generation experiments such asSciSQL) would be of great interest. If successful, these efforts could significantly enhancethe ability of the community to perform affordable large-scale catalog computations oreven image reprocessings.A number of use cases above may be possible for the users to run at the LSST DataAccess Centers. LSST has reserved approximately 10% of its total capacity to enableend-user analyses and generation of added value data products.Furthermore, many of the use cases would be best tackled by enhancing existing LSSTpipelines or building completely new functionality on top of the one already provided byLSST. All of the source code for the LSST pipelines will be publically available, enablingthese kinds of endeavors.Finally, LSST Operations have been built with the assumption that, in addition to thework within the facility, the community will make new discoveries and breakthroughs inareas of algorithms and data products. Such enhancements, developed by the community,can be incorporated into standard LSST processing, thus becoming a part of the officialLSST alert streams and/or data releases.
3. Discussion
Due to the size and complexity of the LSST dataset, and the susceptibility of many ofits major science programs to systematics in both measured quantities and astrophysicalpredictions, substantial preparatory work is required to enable the full exploitation of theLSST dataset. The bottleneck for science will not be the size of the dataset but insteadour ability to extract useful and reliable information from the data.Here we have summarized some of the most obvious research directions required toenhance the LSST science outcome. The main anticipated work areas include: • advanced astronomical digital image processing,0 Ivezić, Connolly & Jurić • statistical modeling and analysis, • data mining and machine learning, • high performance computing, • astrophysical simulations, and • multi-dimensional and temporal data visualization.LSST data analysis and the development of the fields of astro-informatics & astro-statistics will be closely intertwined. This synergy will open numerous opportunitiesfor people with “Big Data” skills. Prospective LSST science users, across all disciplines,should collaborate and coordinate. By working jointly we can make the LSST great, andmaximize the tremendous potential and science return of its dataset. Acknowledgements
This material is based upon work supported in part by the National Science Foundationthrough Cooperative Agreement 1258333 managed by the Association of Universitiesfor Research in Astronomy (AURA), and the Department of Energy under ContractNo. DE-AC02-76SF00515 with the SLAC National Accelerator Laboratory. AdditionalLSST funding comes from private donations, grants to universities, and in-kind supportfrom LSSTC Institutional Members. We thank Gregory Dubois-Felsmann for his carefulreading and excellent comments.
References
Berry, M., Ivezić, Ž., Sesar, B., et al. 2012, Astrophysical Journal, 757, 166Bolzonella, M., Miralles, J. M. & Pelló, R. 2000, Astronomy & Astrophysics, 363, 476Connolly, A.J., Angeli, G.Z., Chandrasekharan, S., et al. 2014, Proceedings of the SPIE, Volume9150, id. 915014Delgado, F., Saha, A., Chandrasekharan, S., et al. 2014, Proceedings of the SPIE, Volume 9150,id. 915015DESC; Dark Energy Science Collaboration (DESC) Science Roadmap, 2015, http://lsst-desc.org/sites/default/files/DESC_SRM_V1.pdfEyer, L., Evans, D.W., Mowlavi, N., et al. 2015, ArXiv:
Flaugher, B. 2008, In
A Decade of Dark Energy: Spring Symposium, Proceedings of the confer-ences held May 5-8, 2008 in Baltimore, Maryland. (USA). Ed. by N. Pirzkal & H. Ferguson.
Ivezić, Ž., Tyson, J.A., Acosta, E., et al. 2008a, ArXiv:
Ivezić, Ž., Sesar, B., Jurić, M., et al. 2008b, Astrophysical Journal, 684, 287Jain, B., Spergel, D., Bean, R., et al. 2015, ArXiv:
Jones, R.L., Jurić, M. & Ivezić, Ž. 2016, Proceedings of the IAU, 318, 282, ArXiv:
Jurić, M., Ivezić, Ž., Brooks, A., et al. 2008, Astrophysical Journal, 673, 864Jurić, M., Kantor, J., Lim, K-T., et al. 2016, ASP Conf Ser. in press, ArXiv:
Kaiser, N., Burgett, W., Chambers, K., et al. 2010,