A scalable transient detection pipeline for the Australian SKA Pathfinder VAST survey
Sergio Pintaldi, Adam Stewart, Andrew O'Brien, David Kaplan, Tara Murphy
aa r X i v : . [ a s t r o - ph . I M ] J a n A scalable transient detection pipeline for the Australian SKAPathfinder VAST survey
Sergio Pintaldi , Adam Stewart , Andrew O’Brien , David Kaplan and TaraMurphy , Sydney Informatics Hub, The University of Sydney, NSW 2006, Australia; [email protected] Sydney Institute for Astronomy, School of Physics, The University of Sydney,NSW 2006, Australia University of Wisconsin-Milwaukee, Department of Physics, Milwaukee, WI,USA ARC Centre of Excellence for Gravitational Wave Discovery (OzGrav),Hawthorn, Victoria, Australia
Abstract.
The Australian Square Kilometre Array Pathfinder (ASKAP) collects im-ages of the sky at radio wavelengths with an unprecedented field of view, combined witha high angular resolution and sub-millijansky sensitivities. The large quantity of dataproduced is used by the ASKAP Variables and Slow Transients (VAST) survey scienceproject to study the dynamic radio sky. E ffi cient pipelines are vital in such research,where searches often form a ‘needle in a haystack’ type of problem to solve. However,the existing pipelines developed among the radio-transient community are not suitablefor the scale of ASKAP datasets.In this paper we provide a technical overview of the new “VAST Pipeline” : amodern and scalable Python-based data pipeline for transient searches, using up-to-datedependencies and methods. The pipeline allows source association to be performed atscale using the Pandas DataFrame interface and the well-known
Astropy crossmatchfunctions. The
Dask
Python framework is used to parallelise operations as well as scalethem both vertically and horizontally, by means of a cluster of workers. A modern webinterface for data exploration and querying has also been developed using the latest
Django web framework combined with
Bootstrap .
1. Introduction
The ASKAP Survey for Variables and Slow Transients (VAST; Murphy et al. 2013)is the study of astrophysical transient and variable phenomena at radio wavelengths,such as flare stars and supernovae, using the new Australian Square Kilometre Ar-ray Pathfinder (ASKAP; Hotan et al. submitted ) telescope. The large field of viewof ASKAP means that large areas of the radio sky can be surveyed regularly at sub-millijansky sensitivities. This has not been possible with previous radio telescopes,and means that ASKAP is now providing an unprecedented view of the dynamic ra-dio sky. For example, the first shallow all-sky survey completed by ASKAP — theRapid ASKAP Continuum Survey (RACS; McConnell et al. 2020) — provided 2.81 Pintaldi and othersmillion source measurements, compared to 0.2 million sources detected in the previ-ous best survey of the southern sky (SUMSS; Mauch et al. 2003). This creates a datachallenge for VAST, where regular epochs of the sky will result in the need to con-struct lightcurves for millions of astrophysical sources, while also being able to swiftlyidentify the small percentage of sources that exhibit transient behaviour, in addition toproviding a visualisation solution for such a large and rich dataset.Searching for transient and variable sources in image-plane radio data has uniquechallenges compared to other wavelengths, such as optical astronomy, because of in-consistencies that can occur between images. This means that methods such as ‘imagedi ff erencing’ can be inaccurate or di ffi cult to produce, especially for short integrationobservations. Hence, a common technique is to perform source extraction on the imagesand to then associate the extractions into unique source time series data which can thenbe analysed for variability and transient behaviour. Currently, the main open-sourcesoftware for image-based transient detection well-known to the radio astronomy com-munity is the LOFAR Transients Pipeline ( “TraP” ; Swinbank et al. 2015). The TraPis a robust and powerful software package that can be used generally with radio imagedata from any radio facility, as demonstrated by previous successful transient and vari-able searches with data from a range of telescopes (Stewart et al. 2016; Driessen et al.2020; Sarbadhicary et al. 2020). However, when using the TraP with ASKAP surveydata such as the VAST pilot survey, it became apparent that due to the scale of the data,the TraP performance did not meet our requirements. For example, a five image, fullsensitivity, ASKAP dataset, i.e. five images of the same 30 deg region of sky eachwith approximately 30 000 sources, has a processing time of 95 min. The main issueswere: • TraP uses the PySE (Carbone et al. 2018) source extractor to measure the proper-ties of the sources found in the images. However, source catalogues are a standarddata product of the ASKAP processing pipeline, hence this step was no longerrequired and could potentially save a significant amount of processing time, al-though, the ability to read in catalogues is not a feature in TraP (v4.0). • TraP also has a ‘forced extractions’ feature, this is when PySE is used to force-fully extract a Gaussian measurement of a source that was not detected in a cer-tain image in order to ‘fill in’ the gap in the light curve. The high source countwith ASKAP meant that this step could also take considerable processing time. • Most of source association operations are performed in the database, using the
SQL language, which can become slow when there are high number of sources.In addition the logic written in
SQL makes debugging di ffi cult. • The software is written in
Python 2.7 , which reached end of life in January2020, meaning that improvements gained by switching to Python 3 were noteasily reachable without a considerable migration e ff ort.To solve these issues, we have developed a modern and scalable Python code basethat incorporates and builds upon TraP features to e ffi ciently process ASKAP data.The modern technology stack allows us to perform fast mass source association using Pandas DataFrame and well-known
Astropy crossmatch functions. We have alsodeveloped a web interface under the same code code base, from which a user can runthe pipeline and importantly, visualise and query the results.scalabletransientdetectionpipelinefortheAustralianSKAPathfinderVASTsurvey 3The main features of the developed pipeline are discussed in Section 2, while anoverview of its architecture is illustrated in Section 3.
2. Pipeline Processing and Features
The main objectives of the pipeline is to take the ASKAP image and associated dataproducts (refer to Section 2.1) and perform the following the steps: • load and ingest the ASKAP images and related files; • associate all the source measurements into single astronomical objects; • perform force extractions to fill in non-detected sources; • calculate aggregate metrics for the objects; • output the results to a database and parquet files.The main VAST Pipeline technology stack is as follows:
Python 3.6+ , Postgres12+ , Astropy 4+ , Django 3+ , Dask 2+ and
Bootstrap 4 . The following sectionsdescribe each step of the pipeline in more in detail along with other pipeline compo-nents.
Raw data from the telescope is processed by the ASKAP observatory and we requirefour of the standard data products: (i) the image file (FITS), (ii) the source cataloguewith the extracted source measurements from the image (text), (iii) the noise file (FITS)and (iv) the background map file (FITS). Data products ii, iii and iv and produced by“Selavy” (Whiting & Humphreys 2012), which is the default source-finder softwarefor the ASKAP images. In the ingestion phase, the pipeline reads the image file FITSheader for meta data information such as the date of the observation and sky loca-tion, performs some data harmonisation operations on the source catalogue (e.g. errorcorrection), and uploads the measurements to the database and also writes the measure-ments to parquet files, and finally uses the noise image to gain background root-mean-squared statistics of the radio image. Once an image has been ingested by the pipelinethe information is always retained and the image is available for any other pipeline runs,i.e. measurements are never duplicated in the database.
The source association operations make use of
Astropy (Robitaille et al. 2013) cross-match and sky search functions, while all the data manipulations are performed in
Pandas DataFrame (McKinney & et al. 2010). In general, the association process fol-lows the same logic as that used by the TraP. This includes using the weighted averageright ascension and declination values when associating and the handling of ‘one-to-many’, ‘many-to-one’ and ‘many-to-many’ association types (a standard association isdefined as ‘one-to-one’). We refer the reader to Section 4.4 of (Swinbank et al. 2015)for a detailed explanation of the association steps and types. We have implemented aminor improvement in that any relationships formed between sources because of theassociation types are recorded as part of the pipeline output. There are currently threeassociation methods available: Pintaldi and others1. Basic: this mode uses the
Astropy crossmatch functions( match_to_catalog_sky ) for a given fixed radius and provides only the clos-est match to each object. ‘One-to-many’ association types are possible and arerecorded.2. Advanced: it uses
Astropy search around the sky function ( search_around_sky )in which all matching sources are found within the user specified search radius,i.e. not only the closest as in the basic method. ‘One-to-many’, ‘many-to-one’and ‘many-to-many’ association types are possible and are recorded. ‘Many-to-many‘ associations are reduced to either ‘one-to-one’ or ‘one-to-many’ associa-tion types.3. de Ruiter: in this mode the association is performed similarly to the advancedmode, but the de Ruiter radius (Scheers 2011) is calculated and used in the sourceassociation process using a user specified de Ruiter radius limit. The same asso-ciation types as found in advanced are also possible in this mode.By default the association is performed on an image by image basis in observationdate order, however two features exist to help decrease the processing time in somecases. First, it is possible to run association in parallel if the data set has images fromdistinct separate regions of the sky to analyse. For example, if a survey consisted ofa sequence of images over time centred at RA = = = = –60, the pipeline can associate these two regions separately(in parallel) and collect the results at the end.The second mode available is the ability to group images together to form a singleepoch. Surveys such as the VAST Pilot Survey commonly mosaic a large region ofthe sky in one observing session, and then will observe the same area of sky againat a later time in another session. Depending on the science case, it can beneficial toconsider each observing session as one complete epoch to associate to the followingepoch obtained from the next session - this is instead of considering each individualimage that makes up the full survey area, i.e. the default method. Users are able todefine which images should be considered as a single epoch for which the pipeline willthen merge the source catalogues and remove duplicate measurements for each epochand perform association on the combined catalogues. For example, the VAST Pilotsurvey phase 1 consists of 12 epochs, split over 6 distinct sky regions, and contains8.3 million individual source measurements. If parallel association is used with thedefault image-by-image association then the total association step processing time is150 min. If the epoch mode is activated grouping together the images from each of the12 epochs, the association step time is reduced to 11 min. Following association the pipeline calculates the ideal coverage each source shouldhave, which allows the pipeline to determine: (i) which sources have a non-detectionin an image where the location of the source is covered, and (ii) which sources areflagged as ‘new’ sources. A ‘new’ source is a source that was not present in the firstobservation of the data set and has a signal-to-noise ratio such that it should have beendetected. These are important to flag as they are likely transient sources.Sources in group (i) that have non-detections are passed onto the forced extractionstep (see Section 2.3). For new sources — group (ii) — a few more steps are takenscalabletransientdetectionpipelinefortheAustralianSKAPathfinderVASTsurvey 5to provide useful metrics for post-analysis. In the initial determination step above, thesource is flagged as new if the signal-to-noise ratio (SNR) of the first measurement’speak flux in time is above a user defined threshold (default of 5) compared to the lowestRMS value recorded in the image that provided the non-detection. However, the RMSvalues of di ff erent regions in the image can vary drastically, hence, to help the userdecide if the ‘new’ source is significant, for each new source the pipeline measures theRMS value of the non-detection image at the precise location of the source in question.The SNR is then recalculated using the precise RMS value and is attached to the sourcein the pipeline outputs. This helps the end-user filter out weak ‘new’ sources that arelikely to be erroneous. The sources that have a non-detection in an image where they should have been de-tected, group (i) sources in Section 2.2.1, are submitted to have measurements extractedfrom the image in question. The pipeline has an option to place a user-defined mini-mum threshold (default of 3) of the SNR of the source for the forced extraction to takeplace. The SNR determined using the peak flux of the first measurement in time againstthe minimum RMS value of the non-detection image. This provides some control overdisregarding forced extractions for sources that are not expected to have been detectedeven in the best region of the image. This helps lower the computation time, especiallyfor data sets with images with vastly di ff erent median RMS values.The forced extractions are performed using a custom written function that uses aGaussian the same size as the point-spread-function of the telescope in the observation.Extractions from images are also performed in parallel using Dask . The pipeline calculates various aggregate statistics for each unique source, such as,for example, maximum and minimum flux values, total number of true detection andforced measurements, the average compactness (flux int / flux peak and the number of re-lations. Two variability metrics are also included in the calculations - the V (coe ffi cientof variation) and η (weighted reduced χ significance of the variation) metrics, as de-scribed in (Swinbank et al. 2015). These are calculated for both the integrated and peakflux values.In addition to the η and V metrics the pipeline also calculates the metrics com-monly used to perform a ‘two-epoch’ variability analysis (Mooley et al. 2016), whichhas the advantage of being able to define variability on more specific timescales. Foreach source, the V s and m metrics are calculated for every unique pair of measurementsat di ff erent timestamps. The metrics for each unique pair in the data set are uploadedthe database and written to parquet. The pipeline also uses a user-defined threshold V s value (default of 4.3) to assign the most significant metric pair to the source itself toallow for swift filtering on sources themselves. For example if a source has pairs of V s = . m = .
8, and V s = . m = . V s = . m = .
5, the pipeline assignsthe V s = . m = . The
Dask framework is used to parallelise operations on
DataFrames such as the com-putation of metrics, merging
DataFrames , and groupby-apply operations. It is also Pintaldi and othersused by the parallel source association mode, and in the forced source extractions,among other minor pipeline features. The use of
Dask enables out-of-RAM operationsas
Dask is able to manage the RAM limits, and more importantly it provides horizontalscalability with the use of a cluster of workers.
The results of a pipeline run can be explored in a website developed using the
PythonDjango framework, in which a
Bootstrap4 template was used. A
Postgres12+ database is used to serve the data to the web app, and to store the data model as well asrelationships between the data entities (e.g. sources and images). The
Q3C Postgres plugin (Koposov & Bartunov 2006) is used to add cone search functionality in the dataquery. The interface enables: • Displaying general statistics about all the pipeline runs (e.g. total amount ofimages processed); • Inspecting and configuring single pipeline runs: e.g. change the configurationfile, initialise a pipeline run with chosen parameters and list of images to process; • Start or re-run a pipeline run with the computation in the background. This func-tionality is enabled by
Django extension app
Django-Q ; • Displaying tables related to images, measurements, sources, pipeline runs; • Displaying detailed image information: name, positions, axis and RMS; sky re-gion preview with
Aladin Lite (Boch & Fernique 2014), user comments; ex-tracted measurements table; • Displaying detailed measurement information: position data, axis and other data;sky region preview with
Aladin Lite ; postage stamp cutout using
JS9 (Mandel& Vikhlinin 2020); user comments; sources and siblings tables; • Displaying detailed source information: source metrics and data such as posi-tion, flux, η and V metrics, light curve using Bokeh (Bokeh Development Team2020), external catalogue crossmatches and references using
Astroquery pack-age (Ginsburg et al. 2019), postage stamp cutouts of the associated measure-ments (using the
JS9 library), related sources, measurements (the single sourcesdata points over time associated into the same astronomical object), sky regionpreview (using
Aladin Lite service); • Exporting tables to excel or CSV data format; • Commenting and tagging: for example a source can be tagged as a particularastronomical object; • “Starring” a source: a user can add a source to their own favourite sources list,with a comment;scalabletransientdetectionpipelinefortheAustralianSKAPathfinderVASTsurvey 7 All outputs of a pipeline run are also saved in parquet files for raw data access andcomputation in the
Dask cluster. Saving in parquet format also makes the results beeasily portable and accessible, in that they can easily be transported to local users ownmachines and, for example, be loaded into a Jupyter Notebook environment using pack-ages such as
Pandas or Vaex (Breddels & Veljanoski 2018) for o ffl ine post-processing. Figure 1. VAST Pipeline architecture and stack.
3. Pipeline Architecture
A general schematic of the pipeline architecture is shown in figure 1. The back-endas well as the front-end stacks are shown, on top of the storage and the computingplatforms. The pipeline is a web application developed in Python using the Django webframework. Although
Apache Hadoop and its ecosystem of tools (Landset et al. 2015),such as
Apache Spark , are the most predominant tools in “Big Data” applications,we chose to adopt Python and its eco-system of tools (i.e.
Pandas DataFrame and
Dask ) as the main pipeline technology stack due to their simplicity to develop andmaintain, and popularity among the scientific community, in addition to astronomyspecific libraries such as
Astropy . Dask provides similar features of Spark, both froma data processing as well as a scalability point of view. The basic data relationship between images and sources is represented in figure 2: Dask vs Spark: https://docs.dask.org/en/latest/spark.html
Pintaldi and others • Image: the ASKAP sky image taken at a specific time, with the main image fileand related noise, background map and extracted measurements files. • Measurement: a single source measurement at a specific time. Multiple measure-ments belong to a single image. • Source: collection of associated measurements over time and images, that belongto the very same unique astronomical object. (cid:36)(cid:54)(cid:46)(cid:36)(cid:51)(cid:3)(cid:44)(cid:48)(cid:36)(cid:42)(cid:40) (cid:36)(cid:54)(cid:46)(cid:36)(cid:51)(cid:3)(cid:44)(cid:48)(cid:36)(cid:42)(cid:40)(cid:48)(cid:40)(cid:36)(cid:54)(cid:56)(cid:53)(cid:40)(cid:48)(cid:40)(cid:49)(cid:55) (cid:48)(cid:40)(cid:36)(cid:54)(cid:56)(cid:53)(cid:40)(cid:48)(cid:40)(cid:49)(cid:55) (cid:48)(cid:40)(cid:36)(cid:54)(cid:56)(cid:53)(cid:40)(cid:48)(cid:40)(cid:49)(cid:55) (cid:48)(cid:40)(cid:36)(cid:54)(cid:56)(cid:53)(cid:40)(cid:48)(cid:40)(cid:49)(cid:55) (cid:48)(cid:40)(cid:36)(cid:54)(cid:56)(cid:53)(cid:40)(cid:48)(cid:40)(cid:49)(cid:55) (cid:48)(cid:40)(cid:36)(cid:54)(cid:56)(cid:53)(cid:40)(cid:48)(cid:40)(cid:49)(cid:55)(cid:48)(cid:40)(cid:36)(cid:54)(cid:56)(cid:53)(cid:40)(cid:48)(cid:40)(cid:49)(cid:55) (cid:48)(cid:40)(cid:36)(cid:54)(cid:56)(cid:53)(cid:40)(cid:48)(cid:40)(cid:49)(cid:55)(cid:54)(cid:50)(cid:56)(cid:53)(cid:38)(cid:40) (cid:54)(cid:50)(cid:56)(cid:53)(cid:38)(cid:40) (cid:54)(cid:50)(cid:56)(cid:53)(cid:38)(cid:40)
Figure 2. VAST Pipeline image data relationship.
This data schema avoids duplication of data when processing the images with di ff erencesettings. For example, when using two di ff erent choices for the association radius, thepipeline uses the same measurements objects, but will create di ff erent source objectsdue to the di ff erent association settings. This design choice was possible as the sourcefinder is external to the pipeline. On the other hand, such design still enables processingan image with two di ff erent source-finder settings. In fact they will treated as twodi ff erent image entities in the pipeline, though they are the same physical image.
4. Summary
The VAST Pipeline is a modern and scalable detection software for the discovery ofradio transient and variable sources in data from the ASKAP telescope. It is built uponlegacy community developed software, but with a modern and simpler architecture.The results have been validated against the ones produced by “Trap”, and by individualcustom-developed code bases, developed by researchers and students at the Univer-sity of Sydney. It is able to process the VAST Pilot Survey phase 1 data (924 images,8.3 million measurements, and ∼
5. Acknowledgements
This tool was developed in collaboration with the Sydney Informatics Hub (SIH), acore research facility at the University of Sydney. DK and AO are supported by NSFgrant AST-1816492. TM acknowledges the support of the Australian Research Councilthrough grants FT150100099 and DP190100561. Parts of this research were conductedby the Australian Research Council Centre of Excellence for Gravitational Wave Dis-covery (OzGrav), project number CE170100004. The authors thanks the creators of
SBAdmin 2 bootstrap dashboard to make the template freely available.0 Pintaldi and others
References
Boch, T., & Fernique, P. 2014, in Astronomical Data Analysis Software and Systems XXIII,edited by N. Manset, & P. Forshay, vol. 485 of Astronomical Society of the PacificConference Series, 277Bokeh Development Team 2020, Bokeh: Python library for interactive visualization. URL https://bokeh.org/
Breddels, M. A., & Veljanoski, J. 2018, A&A, 618, A13.
Carbone, D., Garsden, H., Spreeuw, H., Swinbank, J. D., van der Horst, A. J., Rowlinson, A.,Broderick, J. W., Rol, E., Law, C., Molenaar, G., & Wijers, R. A. M. J. 2018, Astronomyand Computing, 23, 92.
Driessen, L. N., McDonald, I., Buckley, D. A. H., Caleb, M., Kotze, E. J., Potter, S. B., Ra-jwade, K. M., Rowlinson, A., Stappers, B. W., Tremou, E., & et al. 2020, MNRAS, 491,560.
Ginsburg, A., Sip˝ocz, B. M., Brasseur, C., Cowperthwaite, P. S., Craig, M. W., Deil, C.,Groener, A. M., Guillochon, J., Guzman, G., Liedtke, S., , & et al. 2019, The Astro-nomical Journal, 157, 98Koposov, S., & Bartunov, O. 2006, in Astronomical Data Analysis Software and Systems XV,edited by C. Gabriel, C. Arviset, D. Ponz, & S. Enrique, vol. 351 of AstronomicalSociety of the Pacific Conference Series, 735Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. 2015, Journal of Big Data, 2,24Mandel, E., & Vikhlinin, A. 2020, ericmandel / js9 v3.0 upgrade to ES6, add rot90, flip, andmore. URL https://doi.org/10.5281/zenodo.3598590 Mauch, T., Murphy, T., Buttery, H. J., Curran, J., Hunstead, R. W., Piestrzynski, B., Robertson,J. G., & Sadler, E. M. 2003, Monthly Notices of the Royal Astronomical Society, 342,1117McConnell, D., Hale, C. L., Lenc, E., Banfield, J. K., Heald, G., Hotan, A. W., Leung, J. K.,Moss, V. A., Murphy, T., O’Brien, A., & et al. 2020, Publications of the AstronomicalSociety of Australia, 37.
McKinney, W., & et al. 2010, in Proceedings of the 9th Python in Science Conference (Austin,TX), vol. 445, 51Mooley, K., Hallinan, G., Bourke, S., Horesh, A., Myers, S., Frail, D., Kulkarni, S., Levitan,D., Kasliwal, M., Cenko, S., & et al. 2016, The Astrophysical Journal, 818, 105Murphy, T., Chatterjee, S., Kaplan, D. L., Banyer, J., Bell, M. E., Bignall, H. E., Bower, G. C.,Cameron, R. A., Coward, D. M., Cordes, J. M., & et al. 2013, Publications of the Astro-nomical Society of Australia, 30Robitaille, T. P., Tollerud, E. J., Greenfield, P., Droettboom, M., Bray, E., Aldcroft, T., Davis,M., Ginsburg, A., Price-Whelan, A. M., Kerzendorf, W. E., & et al. 2013, Astronomy &Astrophysics, 558, A33Sarbadhicary, S. K., Tremou, E., Stewart, A. J., Chomiuk, L., Peters, C., Hales, C., Strader, J.,Momjian, E., Fender, R., & Wilcots, E. M. 2020, arXiv e-prints.
Scheers, L. 2011, Transient and variable radio sources in the lofar sky: an architecture for adetection frameworkStewart, A. J., Fender, R. P., Broderick, J. W., Hassall, T. E., Muñoz-Darias, T., Rowlinson, A.,Swinbank, J. D., Staley, T. D., Molenaar, G. J., Scheers, B., & et al. 2016, MNRAS,456, 2321.