Transient-optimised real-bogus classification with Bayesian Convolutional Neural Networks -- sifting the GOTO candidate stream
T. L. Killestein, J. Lyman, D. Steeghs, K. Ackley, M. J. Dyer, K. Ulaczyk, R. Cutter, Y.-L. Mong, D. K. Galloway, V. Dhillon, P. O'Brien, G. Ramsay, S. Poshyachinda, R. Kotak, R. P. Breton, L. K. Nuttall, E. Pallé, D. Pollacco, E. Thrane, S. Aukkaravittayapun, S. Awiphan, U. Burhanudin, P. Chote, A. Chrimes, E. Daw, C. Duffy, R. Eyles-Ferris, B. Gompertz, T. Heikkilä, P. Irawati, M. R. Kennedy, A. Levan, S. Littlefair, L. Makrygianni, D. Mata Sánchez, S. Mattila, J. Maund, J. McCormac, D. Mkrtichian, J. Mullaney, E. Rol, U. Sawangwit, E. Stanway, R. Starling, P. A. Strøm, S. Tooke, K. Wiersema, S. C. Williams
MMNRAS , 1–17 (0000) Preprint 22 February 2021 Compiled using MNRAS L A TEX style file v3.0
Transient-optimised real-bogus classification with BayesianConvolutional Neural Networks — sifting the GOTO candidatestream
T. L. Killestein, ★ J. Lyman, D. Steeghs, K. Ackley, M. J. Dyer, K. Ulaczyk, R. Cutter, Y.-L. Mong, D. K. Galloway, V. Dhillon, , P. O’Brien, G. Ramsay, S. Poshyachinda, R. Kotak, R. P. Breton, L. K. Nuttall, E. Pallé, D. Pollacco, E. Thrane, S. Aukkaravittayapun, S. Awiphan, U. Burhanudin, P. Chote, A. Chrimes, E. Daw, C. Duffy, R. Eyles-Ferris, B. Gompertz, T. Heikkilä, P. Irawati, M. R. Kennedy, A. Levan, , S. Littlefair, L. Makrygianni, D. Mata Sánchez, S. Mattila, J. Maund, J. McCormac, D. Mkrtichian, J. Mullaney, E. Rol, U. Sawangwit, E. Stanway, R. Starling, P. A. Strøm, S. Tooke, K. Wiersema, S. C. Williams , Department of Physics, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, UK School of Physics & Astronomy, Monash University, Clayton VIC 3800, Australia Department of Physics and Astronomy, University of Sheffield, Sheffield S3 7RH, UK School of Physics & Astronomy, University of Leicester, University Road, Leicester LE1 7RH, UK Armagh Observatory & Planetarium, College Hill, Armagh, BT61 9DG National Astronomical Research Institute of Thailand, 260 Moo 4, T. Donkaew, A. Maerim, Chiangmai, 50180 Thailand Department of Physics & Astronomy, University of Turku, Vesilinnantie 5, Turku, FI-20014, Finland Jodrell Bank Centre for Astrophysics, Department of Physics and Astronomy, The University of Manchester, Manchester M13 9PL, UK University of Portsmouth, Portsmouth, PO1 3FX, UK Instituto de Astrof’isica de Canarias, E-38205 La Laguna, Tenerife, Spain Department of Astrophysics/IMAPP, Radboud University, Nijmegen, The Netherlands Finnish Centre for Astronomy with ESO (FINCA), Quantum, Vesilinnantie 5, University of Turku, FI-20014 Turku, Finland
Accepted XXX. Received YYY; in original form ZZZ
ABSTRACT
Large-scale sky surveys have played a transformative role in our understanding of astrophysicaltransients, only made possible by increasingly powerful machine learning-based filtering toaccurately sift through the vast quantities of incoming data generated. In this paper, we presenta new real-bogus classifier based on a Bayesian convolutional neural network that providesnuanced, uncertainty-aware classification of transient candidates in difference imaging, anddemonstrate its application to the datastream from the GOTO wide-field optical survey. Notonly are candidates assigned a well-calibrated probability of being real, but also an associ-ated confidence that can be used to prioritise human vetting efforts and inform future modeloptimisation via active learning. To fully realise the potential of this architecture, we presenta fully-automated training set generation method which requires no human labelling, incor-porating a novel data-driven augmentation method to significantly improve the recovery offaint and nuclear transient sources. We achieve competitive classification accuracy (FPR andFNR both below 1%) compared against classifiers trained with fully human-labelled datasets,whilst being significantly quicker and less labour-intensive to build. This data-driven approachis uniquely scalable to the upcoming challenges and data needs of next-generation transientsurveys. We make our data generation and model training codes available to the community.
Key words: methods: data analysis – surveys – techniques: photometric ★ [email protected] © a r X i v : . [ a s t r o - ph . I M ] F e b Killestein et al.,
Transient astronomy seeks to identify new or variable objects in thenight sky, and characterise them to learn about the underlying mech-anisms that power them and govern their evolution. This variabilitycan occur on timescales of milliseconds to years, and at luminosi-ties ranging from stellar flares to luminous supernovae that outshinetheir host galaxy (Kulkarni 2012; Villar et al. 2017). Through ob-servations of optical transient sources we have obtained evidence ofthe explosive origins of heavy elements (e.g. Abbott et al. 2017b,Pian et al. 2017), traced the accelerating expansion of our Universeacross cosmic time (e.g. Perlmutter et al. 1999), and located thefaint counterparts of some of the most distant and energetic astro-physical events known: gamma-ray bursts (e.g. Tanvir et al. 2009).Requiring multiple observations of the same sky area to detect vari-ability, transient surveys naturally generate vast quantities of datathat require processing, filtering, and classification – this has driventhe development of increasingly powerful techniques bolstered bymachine learning to meet the demands of these projects.Many of the earliest prototypical transient surveys began asgalaxy-targeted searches, performed with small field-of-view instru-ments. In the early stages of these surveys candidate identificationwas performed manually, with humans ‘blinking’ images to look forvarying sources. This process is time-consuming and error-prone,and represented a bottleneck in the survey dataflow which heavilylimited the sky coverage of these surveys. The first ‘modern’ tran-sient surveys (e.g. LOSS; Filippenko et al. 2001) used early formsof difference imaging to detect candidates in the survey data, au-tomating the candidate detection process and enabling both fasterresponse times and greater sky coverage. LOSS proved extremelysuccessful, discovering over 700 supernovae in the first decade ofoperation, providing a homogeneous sample that has proven usefulin constraining supernova rates for the local Universe (Leaman et al.2011; Li et al. 2011).Difference imaging has since emerged as the dominant methodfor the identification of new sources in optical survey data. With thismethod, an input image has a historic reference image subtractedto remove static, unvarying sources. Transient sources in this dif-ference image appear as residual flux, which can be detected andmeasured photometrically using standard techniques. Various al-gorithms have been proposed for optical image subtraction, eitherattempting to match the point spread function (PSF) and spatially-varying background between an input and reference image (Alard &Lupton 1998; Becker 2015), or accounting for the mismatch statis-tically (Zackay et al. 2016) to enable clean subtraction. Differenceimaging also provides an effective way to robustly discover andmeasure variable sources in crowded fields (Wozniak 2000).Driven by both improvements in technology (large-formatCCDs, wide-field telescopes) and difference imaging algorithms,large-scale synoptic sky surveys came to the fore. In this mode, sig-nificant areas of sky can be covered each night to a useful depth andcandidate transient sources automatically flagged. This has drivenan exponential growth in discoveries of transients, with over 18,000discovered in 2019 alone . Wide-field surveys such as the ZwickyTransient Facility (ZTF; Bellm et al. 2019), PanSTARRS1 (PS1;Chambers et al. 2016), the Asteroid Terrestrial-impact Last AlertSystem (ATLAS; Tonry et al. 2018), and the All Sky Automated Sur-vey for SuperNovae (ASAS-SN; Shappee et al. 2014) have provento be transformative, collectively discovering hundreds of new tran-sients per night. https://wis-tns.weizmann.ac.il/ With the ability to repeatedly and rapidly tile large areas ofsky in order to search for new and varying sources, the follow-upof optical counterparts to poorly localised external triggers becamepossible, in the process ushering in the age of multi-messengerastronomy. An early example was detection of optical counterpartsto
Fermi gamma-ray bursts by the Palomar Transient Factory (PTF;Law et al. 2009). Typical localisation regions from the
Fermi
GBMinstrument (Meegan et al. 2009) were of order 100 square degrees atthis time, representing a significant challenge to successfully locatecomparatively faint ( 𝑟 ∼ −
19) GRB afterglows. Of the 35 high-energy triggers responded to, 8 were located in the optical (Singeret al. 2015), demonstrating the emerging effectiveness of synopticsky surveys for this work.Another recent highlight has been the detection of an opti-cal counterpart to a TeV-scale astrophysical neutrino detected bythe IceCUBE facility (Aartsen et al. 2017). Recent and historicalwide-field optical observations of the localisation area combinedwith high-energy constraints from
Fermi enabled the identificationof a flaring blazar, believed to be responsible for the alert (IceCube-170922A; IceCube Collaboration et al. 2018) . This rapidly increas-ing survey capability has culminated recently in the landmark dis-covery of a multi-messenger counterpart to the gravitational wave(GW) event GW170817 (Abbott et al. 2017a,b).
For many years, the rate of difference image detections generatedper night by sky surveys has significantly exceeded the capacityof teams of humans to manually vet and investigate each one.This has motivated the development of algorithmic filtering onnew sources, to reject the most obvious false positives and reducethe incoming datastream to something tractable by human vetting.With the growing scale and depth of modern sky surveys, simplestatic cuts on source parameters cannot keep pace with the rateof candidates, with high false positive rates leading to substantialcontamination by artifacts. This situation has motivated the develop-ment of machine learning (ML) and deep learning (DL) classifiers,which can extract subtle relationships/connections between the in-put data/features and perform more effective filtering of candidates.The dominant paradigm for this task has so far been the real-bogusformalism (e.g. Bloom et al. 2012), which formulates this filteringas a binary classification problem. Genuine astrophysical transientsare designated ‘real’ (score 1), whereas detector artefacts, subtrac-tion residuals and other distractors are labelled as ‘bogus’ (score0). A machine learning classifier can then be trained using theselabels with an appropriate set of inputs to make predictions aboutthe nature of a previously-unseen (by the classifier) source withinan image.This real-bogus classification is only one step in a transientdetection pipeline. Having established the candidates appearing asastrophysically real sources, further filtering is required to determineif they are scientifically interesting, or distractors – the definitionof “interesting” is naturally governed by the science goals of thesurvey. This process draws in contextual information from existingcatalogues, historical evolution, and more fine-grained classifica-tion routines. The last step before triggering follow-up and furtherstudy (at least currently) is human inspection of the remaining can-didates. No single filtering step is 100% efficient in removing falsepositives/low significance detections, thus human vetting is requiredto identify promising candidates and screen out any bogus detec-tions that have made it this far. Real-bogus classification is the mostcrucial step, reducing the volume of candidates that later steps must
MNRAS , 1–17 (0000) ayesian real-bogus classification for GOTO process and the amount of bogus candidates that humans must even-tually sift through to find interesting objects – a balance betweensensitivity (to avoid missing detections irretrievably) and specificity(avoiding floods of low-quality candidates) must be reached.Real-bogus classification is a well-studied problem, beginningwith early transient surveys (Romano et al. 2006; Bailey et al. 2007),and evolving both in complexity and performance with the increas-ing demands placed on it by larger and deeper sky surveys such asPTF (Brink et al. 2013), PanSTARRS1 (Chambers et al. 2016), andthe Dark Energy Survey (Goldstein et al. 2015). Early classifierswere generally built on decision tree-based predictors such as ran-dom forests (Breiman 2001), using a feature vector as input. Featurevectors comprise extracted information about a given candidate, andoften include broad image-level statistics/descriptions designed tomaximally separate real and bogus detections in the feature space.Examples include the source full-width half maximum computedfrom the 2D profile, noise levels, and negative pixel counts. Moreelaborate features can be composed via linear combinations of thesequantities, which may exploit correlations and symmetries. Anothermethod of deriving features is to compute compressed numericalrepresentations of the source via Zernicke/shapelet decomposition(Ackley et al. 2019).However, feature selection can represent a bottleneck to in-creasing performance. Features are typically selected by humans toencode the salient details of a given detection, attempting to finda compromise between classification accuracy and speed of eval-uation. This introduces the possibility of missing salient featuresentirely, or choosing a sub-optimal combination of them.Directly using pixel intensities as a feature representationavoids choosing features entirely, instead training on flattened andnormalised input images (Wright et al. 2015; Mong et al. 2020),these have demonstrated improved accuracy over fixed-feature clas-sifiers. However, this approach quickly (quadratically) becomes in-efficient for large inputs. Using a smaller input size means infor-mation on the surrounding area of each detection is unavailable,limiting the visible context and affecting classification accuracy asa result.Recently, convolutional neural networks (CNNs, LeCun et al.1995) have led to a paradigm shift in the field of computer visionand machine learning, which has been transformative in the waywe process, analyse, and classify image data across all disciplines.CNNs use learnable convolutional filters known as kernels to re-place feature selection. These filters are cross-correlated with theinput images to generate ‘feature maps’, effectively compact featurerepresentations. Through the training process, the filter parametersare optimised to extract the most salient details of the inputs, whichcan then be fed into fully-connected layers to perform classificationor regression. In this way, the model can select its own feature rep-resentations, avoiding the bottleneck of human selection. Multiplelayers can be combined to achieve greater representational power,known as deep learning (LeCun et al. 2015). Recent work usingCNNs has demonstrated state-of-the-art performance at real-bogusclassification (Gieseke et al. 2017; Cabrera-Vives et al. 2017; Duevet al. 2019; Turpin et al. 2020). CNNs are also efficiently parallelis-able making them suitable for high-volume data processing tasks.Whilst providing substantial accuracy improvements over previoustechniques, deep learning is particularly reliant upon large and highquality training sets to minimise overfitting, arising from the highnumber of model parameters. Although augmentation and regular-isation techniques can minimise this risk, they are no substitute fora larger dataset. The performance of any classifier is ultimately lim-ited by the error rate on the training labels, so it is important to also ensure the dataset is accurately labelled. Making a large, pure,and diverse training set can be among the most challenging parts ofdeveloping a machine learning algorithm, and significant effort hasbeen focused on this area in recent years.Traditionally the ‘gold-standard’ for machine learning datasetsacross computer science and astronomy has been human-labelleddata, as this represents the ground truth for any supervised learningtask. Use of citizen science has proven to be particularly effective,leveraging large numbers of participants and ensembling their in-dividual classifications to provide higher accuracy training sets formachine learning through collaborative schemes such as Zooniverse(Lintott et al. 2008; Mahabal et al. 2019). However, even in largeteams, human labelling of large-scale datasets is time-consumingand inefficient requiring hundreds–thousands of hours spent collec-tively to build a dataset of a suitable size and purity. Specificallyfor real-bogus classification, there are also issues with completenessand accuracy for human labelling of very faint transients close to thedetection limit. These faint transients are where a classifier has po-tential to be the most helpful, so if the training set is fundamentallybiased in this regime, any classifier predictions will be similarlylimited. To go beyond human-level performance, we cannot solelyrely on human labelling, additional information is required. Onespecific aspect of astronomical datasets that can be leveraged toaddress both issues discussed above is the availability of a diverserange of contextual data about a given source. Sizeable cataloguesof known variable stars, galaxies, high energy sources, asteroids,and many other astronomical objects are freely available and can bequeried directly to identify and provide a more complete picture ofthe nature of a given source.Significant effort is being invested in data processing tech-niques for transient astronomy in anticipation of the Vera C. RubinObservatory (Ivezić et al. 2019), due to begin survey operations in2022. Via the Legacy Survey of Space and Time (LSST), the entiresouthern sky will be surveyed down to a depth of 𝑟 (cid:48) ∼ . MNRAS000
MNRAS , 1–17 (0000) ayesian real-bogus classification for GOTO process and the amount of bogus candidates that humans must even-tually sift through to find interesting objects – a balance betweensensitivity (to avoid missing detections irretrievably) and specificity(avoiding floods of low-quality candidates) must be reached.Real-bogus classification is a well-studied problem, beginningwith early transient surveys (Romano et al. 2006; Bailey et al. 2007),and evolving both in complexity and performance with the increas-ing demands placed on it by larger and deeper sky surveys such asPTF (Brink et al. 2013), PanSTARRS1 (Chambers et al. 2016), andthe Dark Energy Survey (Goldstein et al. 2015). Early classifierswere generally built on decision tree-based predictors such as ran-dom forests (Breiman 2001), using a feature vector as input. Featurevectors comprise extracted information about a given candidate, andoften include broad image-level statistics/descriptions designed tomaximally separate real and bogus detections in the feature space.Examples include the source full-width half maximum computedfrom the 2D profile, noise levels, and negative pixel counts. Moreelaborate features can be composed via linear combinations of thesequantities, which may exploit correlations and symmetries. Anothermethod of deriving features is to compute compressed numericalrepresentations of the source via Zernicke/shapelet decomposition(Ackley et al. 2019).However, feature selection can represent a bottleneck to in-creasing performance. Features are typically selected by humans toencode the salient details of a given detection, attempting to finda compromise between classification accuracy and speed of eval-uation. This introduces the possibility of missing salient featuresentirely, or choosing a sub-optimal combination of them.Directly using pixel intensities as a feature representationavoids choosing features entirely, instead training on flattened andnormalised input images (Wright et al. 2015; Mong et al. 2020),these have demonstrated improved accuracy over fixed-feature clas-sifiers. However, this approach quickly (quadratically) becomes in-efficient for large inputs. Using a smaller input size means infor-mation on the surrounding area of each detection is unavailable,limiting the visible context and affecting classification accuracy asa result.Recently, convolutional neural networks (CNNs, LeCun et al.1995) have led to a paradigm shift in the field of computer visionand machine learning, which has been transformative in the waywe process, analyse, and classify image data across all disciplines.CNNs use learnable convolutional filters known as kernels to re-place feature selection. These filters are cross-correlated with theinput images to generate ‘feature maps’, effectively compact featurerepresentations. Through the training process, the filter parametersare optimised to extract the most salient details of the inputs, whichcan then be fed into fully-connected layers to perform classificationor regression. In this way, the model can select its own feature rep-resentations, avoiding the bottleneck of human selection. Multiplelayers can be combined to achieve greater representational power,known as deep learning (LeCun et al. 2015). Recent work usingCNNs has demonstrated state-of-the-art performance at real-bogusclassification (Gieseke et al. 2017; Cabrera-Vives et al. 2017; Duevet al. 2019; Turpin et al. 2020). CNNs are also efficiently parallelis-able making them suitable for high-volume data processing tasks.Whilst providing substantial accuracy improvements over previoustechniques, deep learning is particularly reliant upon large and highquality training sets to minimise overfitting, arising from the highnumber of model parameters. Although augmentation and regular-isation techniques can minimise this risk, they are no substitute fora larger dataset. The performance of any classifier is ultimately lim-ited by the error rate on the training labels, so it is important to also ensure the dataset is accurately labelled. Making a large, pure,and diverse training set can be among the most challenging parts ofdeveloping a machine learning algorithm, and significant effort hasbeen focused on this area in recent years.Traditionally the ‘gold-standard’ for machine learning datasetsacross computer science and astronomy has been human-labelleddata, as this represents the ground truth for any supervised learningtask. Use of citizen science has proven to be particularly effective,leveraging large numbers of participants and ensembling their in-dividual classifications to provide higher accuracy training sets formachine learning through collaborative schemes such as Zooniverse(Lintott et al. 2008; Mahabal et al. 2019). However, even in largeteams, human labelling of large-scale datasets is time-consumingand inefficient requiring hundreds–thousands of hours spent collec-tively to build a dataset of a suitable size and purity. Specificallyfor real-bogus classification, there are also issues with completenessand accuracy for human labelling of very faint transients close to thedetection limit. These faint transients are where a classifier has po-tential to be the most helpful, so if the training set is fundamentallybiased in this regime, any classifier predictions will be similarlylimited. To go beyond human-level performance, we cannot solelyrely on human labelling, additional information is required. Onespecific aspect of astronomical datasets that can be leveraged toaddress both issues discussed above is the availability of a diverserange of contextual data about a given source. Sizeable cataloguesof known variable stars, galaxies, high energy sources, asteroids,and many other astronomical objects are freely available and can bequeried directly to identify and provide a more complete picture ofthe nature of a given source.Significant effort is being invested in data processing tech-niques for transient astronomy in anticipation of the Vera C. RubinObservatory (Ivezić et al. 2019), due to begin survey operations in2022. Via the Legacy Survey of Space and Time (LSST), the entiresouthern sky will be surveyed down to a depth of 𝑟 (cid:48) ∼ . MNRAS000 , 1–17 (0000)
Killestein et al.,
The Gravitational-Wave Optical Transient Observer (Steeghs et al.2021) is a wide-field optical array, designed specifically to rapidlysurvey large areas of sky in search of the weak kilonovae and af-terglows associated with gravitational wave counterparts. The workwe present in this paper was conducted during the GOTO prototypestage, using data taken with a single ‘node’ of telescopes situatedat the Roque de los Muchachos observatory on La Palma. Eachnode comprises 8 co-mounted fast astrograph OTAs (optical tubeassemblies) combining to give a ∼
40 square degree field of viewin a single pointing. GOTO performs surveys using a custom wide 𝐿 band filter (approximately equivalent to 𝑔 (cid:48) + 𝑟 (cid:48) ) down to 𝐿 ≈ ∼ code. Image subtractionis performed on the aligned science and reference images with thehotpants algorithm (Becker 2015) to generate a difference image.To locate residual sources in the difference image, source extractionis performed using SExtractor (Bertin & Arnouts 1996). Detec-tions in the difference image are referred to as ‘candidates’ throughthe remainder of this paper. For each candidate, a set of small stampsare cut out from the main science, template and difference imagesand this forms the input to the GOTO real-bogus classifier. Thisprocess and proposed improvements are discussed in more detailin Section 2.1. From here, candidates that pass a cut on real-bogusscore (using a preliminary classifier) are ingested into the GOTOMarshall – a central website for GOTO collaborators to vet, searchand follow-up candidates (Lyman et al., in prep.).In line with the principal science goals of the GOTO project,the real-bogus classifier discussed in this work is constructed specif-ically to maximise the recovery rate of extragalactic transients andother explosive events such as cataclysmic variable outbursts. Small-scale stellar variability can be easily detected via difference imag-ing, but is better studied through the aggregated source light curves. https://github.com/Lyalpha/spalipy An operational requirement for the current version of this classi-fier is the ability to perform consistently across multiple differenthardware configurations. During classifier development, the GOTOprototype used two different types of optical tube design, each withvarying optical characteristics that led to different point spread func-tions, distortion patterns, and background levels/patterns. Due tolimited data availability, training a classifier for each individualOTA (or group of OTAs of the same type) was not viable. Thisrequirement adds an additional operational challenge over surveyprograms such as the Zwicky Transient Facility (ZTF, Bellm et al.2019) and PanSTARRS1 (PS1, Chambers et al. 2016), which use astatic, single-telescope design. If acceptable results can be achievedwith this heterogeneous hardware configuration, then further per-formance gains can be expected when the design GOTO hardwareconfiguration is deployed. This will use telescopes of consistentdesign and improved optical quality meaning less model capacityneeds to be directed towards making the classification performancestable and across a diverse ensemble of optical distortions.In this paper, we propose an automated training set generationprocedure that enables large, minimally contaminated, and diversedatasets to be produced in less time than human labelling and atlarger scales. This procedure also introduces a data-driven augmen-tation scheme to generate synthetic training data that can be usedto significantly improve the performance of any classifier on ex-tragalactic transients of all types, but with particular effectivenessfor nuclear transients. Using this improved training data, we applyBayesian convolutional neural networks (BCNNs) to astronomicalreal-bogus classification for the first time, providing uncertainty-aware predictions that measure classifier confidence, in addition tothe typical real-bogus score. This opens up promising future di-rections for more complex classification tasks, as well as optimallyutilising the predictions of human labellers. We emphasise that al-though this classifier is discussed in the context of GOTO and ourassociated science needs, the techniques discussed are fully generaland could be applied to general real-bogus classification at otherprojects easily. Our code, gotorb, is made freely available online with this in mind. The ‘real’ content of our training set is composed of minor planets,similar to Smith et al. (2020). Assuming the sky motion is large(but not so large that the source is trailed) these objects are typ-ically detected in the science image but not the template image,which provides a clean subtraction residual resembling an explo-sive transient. Due to the large pixels of the GOTO detectors andshort exposure times of each sub-image, very few asteroids movesufficiently quickly to trail. We estimate that sky motions of 1 arcsecper minute or greater will lead to trailing.There are significant numbers of asteroids detectable down to 𝐿 ∼ . ∼ astorb database (Moskovitz et al. 2019), based onobservations reported to the Minor Planet Center , difference im-age detections can be robustly cross-matched to minor planets in the https://github.com/GOTO-OBS/gotorb MNRAS , 1–17 (0000) ayesian real-bogus classification for GOTO
10 12 14 16 18 20GOTO L -band magnitude10 N u m b e r minor planetsynthetic transient Figure 1.
Magnitude distribution of the minor planets (MP) used to buildour training set. Bright-end number densities are dominated by the truemagnitude distribution of the minor planets, where the faint-end density isconstrained by the GOTO limiting magnitude. The magnitude distributionof synthetic transients (SYN) is a sub-sample of the minor planet magnitudedistribution, except with a cut at 𝐿 ∼
16, to avoid unrealistically brightobjects. field. This provides a significant pool of high-confidence, unique,and diverse difference image detections from which to build a cleantraining set.We use the online SkyBoT cone search (Berthier et al. 2006,2016) to retrieve the positions and magnitudes of all minor planetswithin the field of view of each GOTO image, then cross-matchthis table with all valid difference image detections using a 1 arc-sec threshold value to identify the asteroids present in the image.The ephemerides provided are of sufficient quality that this is ad-equate to match even faint ( 𝐿 ∼
20) asteroids. To avoid spuriouscross-matches, only asteroids brighter than the 5-sigma limitingmagnitude of the image are considered. An alternative offline conesearch is made accessible via the pympc package Python package,which the code can fall back on if SkyBoT is unavailable. Usingminor planets, the training set can reliably be extended to faintermagnitudes, where the performance of human vetters begins to sig-nificantly decrease. Figure 1 illustrates the magnitude distributionof minor planets used to construct the training set.To create the bogus content of our training set, we randomlysample detections in the difference image following Brink et al.(2013). Bogus detections overwhelmingly ( (cid:38) https://pypi.org/project/pympc/ stars will be missed with this procedure, and we develop tools toidentify them retrospectively after model training in Section 3.3.To improve the classifier’s resistance to specific challengingsubtypes of data poorly represented in our algorithmically generatedtraining set, we inject human-labelled detections into the dataset.More specifically, candidates from the GOTO Marshall (discussedin full in Lyman et al., in prep.) are included, which were misiden-tified by the classifier in the pipeline at the time as real and laterlabelled as bogus by human vetters. The previous classifier was arapidly-deployed prototype CNN similar in design to that presentedhere, trained on a smaller dataset of minor planets and random bo-gus detections. These detections are included to allow the classifierto screen out artifacts missed by the prototype image processingpipeline, including satellite trails and highly wind-shaken PSFs.This artifically increases the diversity of the bogus component of thetraining set, as these edge-case detections would rarely be selectedby naive random sampling and so be poorly represented within themodel. Although these detections represent a small fraction of theoverall training set ( ∼ For each detection identified for inclusion in our train-ing/validation/test sets, a series of stamps are cut out from the largerGOTO image centred on the difference image residual. In commonwith previous CNN-based classifiers, we use small cutouts of themedian-stacked science and template images, as well as the resultantdifference image after image subtraction. The size of these stampsis an important model hyperparameter, which we explore in moredetail in Section 3.1. A example of the model inputs for a syntheticsource are illustrated in Figure 2.An important addition to our network’s inputs compared toprevious work is a peak-to-peak ( p2p ) layer. This is included tocharacterise variability across the individual images that make upa median stacked science image, and is calculated as the peak-to-peak (maximum value - minimum value) variation of each pixelcomputed across all individual images that composed the medianstack. To ensure consistent alignment across all individual stampsand remove any jitter, we cut out the region based on the RA/Deccoordinates of the source detection in the median stack. This ad-ditional provides an effective discriminator for spurious transientevents such as cosmic ray hits and satellite trails. If sufficientlybright, these are not removed by the simple median stacking in thecurrent pipeline due to the small number of sub-frames used. Thisis particularly problematic for cosmic ray hits which are convolvedwith a Gaussian kernel for image subtraction, and appear PSF-likein the difference image. This can create convincing artifacts whichare difficult to identify without access to the individual image levelinformation. In testing, this reduced the false positive rate on thetest set by ∼ p2p layer), there is a2–3% decrease in false positive rate.For all of the above steps, stamps extending beyond the edgeof the detector have missing areas filled in with a constant intensitylevel of 10 − , to distinguish them quantitatively from masked (i.e.saturated) pixels which are assigned a value of zero in the differenceimage by the pipeline. The specific intensity level chosen for this MNRAS000
20) asteroids. To avoid spuriouscross-matches, only asteroids brighter than the 5-sigma limitingmagnitude of the image are considered. An alternative offline conesearch is made accessible via the pympc package Python package,which the code can fall back on if SkyBoT is unavailable. Usingminor planets, the training set can reliably be extended to faintermagnitudes, where the performance of human vetters begins to sig-nificantly decrease. Figure 1 illustrates the magnitude distributionof minor planets used to construct the training set.To create the bogus content of our training set, we randomlysample detections in the difference image following Brink et al.(2013). Bogus detections overwhelmingly ( (cid:38) https://pypi.org/project/pympc/ stars will be missed with this procedure, and we develop tools toidentify them retrospectively after model training in Section 3.3.To improve the classifier’s resistance to specific challengingsubtypes of data poorly represented in our algorithmically generatedtraining set, we inject human-labelled detections into the dataset.More specifically, candidates from the GOTO Marshall (discussedin full in Lyman et al., in prep.) are included, which were misiden-tified by the classifier in the pipeline at the time as real and laterlabelled as bogus by human vetters. The previous classifier was arapidly-deployed prototype CNN similar in design to that presentedhere, trained on a smaller dataset of minor planets and random bo-gus detections. These detections are included to allow the classifierto screen out artifacts missed by the prototype image processingpipeline, including satellite trails and highly wind-shaken PSFs.This artifically increases the diversity of the bogus component of thetraining set, as these edge-case detections would rarely be selectedby naive random sampling and so be poorly represented within themodel. Although these detections represent a small fraction of theoverall training set ( ∼ For each detection identified for inclusion in our train-ing/validation/test sets, a series of stamps are cut out from the largerGOTO image centred on the difference image residual. In commonwith previous CNN-based classifiers, we use small cutouts of themedian-stacked science and template images, as well as the resultantdifference image after image subtraction. The size of these stampsis an important model hyperparameter, which we explore in moredetail in Section 3.1. A example of the model inputs for a syntheticsource are illustrated in Figure 2.An important addition to our network’s inputs compared toprevious work is a peak-to-peak ( p2p ) layer. This is included tocharacterise variability across the individual images that make upa median stacked science image, and is calculated as the peak-to-peak (maximum value - minimum value) variation of each pixelcomputed across all individual images that composed the medianstack. To ensure consistent alignment across all individual stampsand remove any jitter, we cut out the region based on the RA/Deccoordinates of the source detection in the median stack. This ad-ditional provides an effective discriminator for spurious transientevents such as cosmic ray hits and satellite trails. If sufficientlybright, these are not removed by the simple median stacking in thecurrent pipeline due to the small number of sub-frames used. Thisis particularly problematic for cosmic ray hits which are convolvedwith a Gaussian kernel for image subtraction, and appear PSF-likein the difference image. This can create convincing artifacts whichare difficult to identify without access to the individual image levelinformation. In testing, this reduced the false positive rate on thetest set by ∼ p2p layer), there is a2–3% decrease in false positive rate.For all of the above steps, stamps extending beyond the edgeof the detector have missing areas filled in with a constant intensitylevel of 10 − , to distinguish them quantitatively from masked (i.e.saturated) pixels which are assigned a value of zero in the differenceimage by the pipeline. The specific intensity level chosen for this MNRAS000 , 1–17 (0000)
Killestein et al.,
SCIENCE TEMPLATE DIFFERENCE P2P
Figure 2.
Example data format for a set of idealised synthetic images ofa single Gaussian source newly appearing in the science image. We applya naive convolution of science image with template PSF and vice versa inproducing the difference image for visualisation purposes. From left to right:science median, template median, difference image, pixel-wise peak-to-peakvariation across contributing images to science median. Cutouts are 55x55pixels square, corresponding to a side length of 1.1 arcminutes. offsetting is not important, and we choose our value to be well abovemachine precision (significant enough to influence the gradients)but well below the typical background level. To ensure that theclassifier remains numerically stable in later training steps, eachstack of stamps undergoes layer-wise L2 normalisation to reducethe input’s magnitude. Each stamp has the mean subtracted and isthen divided through by the L2 ( √(cid:174) 𝑥 · (cid:174) 𝑥 ) norm. Although asteroids provide a convenient source of PSF-like residu-als to train on, it is important to note that they cannot fully replicategenuine transients. Asteroids are markedly simpler to learn and dis-criminate for a classifier since they lack the complex background ofa host galaxy. The main goal of this classifier is to detect extragalac-tic transients, so adapting the training set to maximise performanceon these objects is important. An ideal approach would be to add alarge number of genuine transients into the training set. However,GOTO has not been on-sky long enough to collect a suitably largeset of these detections, and we only build the training set from theprevious year of data. Even assuming every supernova over the pastyear is robustly detected in our data this will still yield a number oftransients that is significantly less than the target size of our trainingset. This would create a severely imbalanced dataset, which couldin principle be used but with reduced classification performance.Using spectroscopically confirmed transients may also inject an el-ement of observational bias into our training set, as events that havefavourable properties for spectroscopy (in nearby galaxies, offsetfrom their host, bright) are preferentially selected (Bloom et al.2012) to be followed up. Instead we reserve a set of real, spectro-scopically confirmed transients GOTO has detected ( ∼
900 as ofAugust 2020) for benchmarking purposes, as they represent a valu-able insight into real-world performance and can be used to directlyevaluate the effectiveness of any transient augmentation scheme weemploy, as in Section 4.2.PSF injection has been used heavily in prior work to generatesynthetic detections for testing recovery rates and simulating thefeasibility of observations. This process can be computationallyintensive, involving construction of an effective PSF (ePSF) fromcombining multiple isolated sources or fitting an approximatingfunction (e.g. a Gaussian) to sources in the image. The ePSF modelcan then be scaled and injected into to the image to simulate anew source. By injecting sources in close proximity to galaxies inindividual images then propagating this through the data reductionpipeline, synthetic transients could be generated in a realistic way.However, the fast optical design of GOTO makes this a complex task, as the PSF varies as a function of source position on thedetector. Sources in the corners of an image display mild coma,which, combined with wind-shake and other optical distortion, canlead to unusual PSFs that are not accurately reproduced by themean PSF. In principle this could be accounted for by computingPSFs for sub-regions of a given image or assuming some spatially-varying kernel to fit for, but this would add sizeable overheads tothe injection process and will always be an approximation.Recent new techniques such as generative adversarial net-works (GANs, Goodfellow et al. 2014) have shown promise ingenerating novel training examples that can be used to addressclass imbalances/scarcity in training sets (Mariani et al. 2018), andhave recently started to be applied to astrophysical problems (Yipet al. 2019). However these networks are computationally expen-sive, complex to train and understand the outputs of, and don’t fullyremove the need for large datasets. A robust human-interpretablemethod for generating synthetic examples is a better approach forthe noisy, diverse datasets used in real-bogus classification.We propose a novel technique for synthesising realistic tran-sients that can be used to significantly improve transient-specificperformance when compared to a pure minor planet training set,without requiring PSF injection or other CPU-intensive approaches.For each minor planet detected in an image, the GLADE galaxy cat-alogue (Dálya et al. 2018) is queried for nearby galaxies within aset angular distance of 10 arcminutes, chosen such that the PSFof sources within this region are consistent. Pre-built indices areused via catsHTM (Soumagnac & Ofek 2018) to accelerate query-ing GLADE. The algorithm chooses the galaxy with the brightestgalaxy (minimum 𝐵 band magnitude) within range, then generatesa cutout stamp with with a randomly chosen 𝑥, 𝑦 offset relative tothe galaxy centre. For the implementation within this work, the 𝑥, 𝑦 pixel offsets are drawn from a uniform distribution 𝑈 (− . ) cho-sen to fully cover the range of offsets for nearby galaxies. Sourcesthat are completely detached from any host galaxy are better rep-resented by the minor planet component of the training set. Thisensures that a diverse range of transient configurations (nuclear,offset, orphaned) are sampled. The minor planet and galaxy stampare then directly summed to produce the synthetic transient. Forthe purposes of real-bogus classification, accurately matching themeasured transient host-offset distribution is not crucial. The hostoffset distribution contains implicit and difficult to quantify biasesresulting from the specific selection functions of the transient sur-veys that populate it – it does not reflect accurately the underlyingdistribution of astrophysical transients. By choosing from a uniformdistribution, we instead aim to attain consistent performance acrossa wide range of host offsets that overlap with the range inferred fromthe transient host offset distribution.The original individual images for each component are re-trieved to correctly compute the peak-to-peak variation of the com-bined stamp. Model inputs are pre-processed and undergo L2 nor-malisation (as discussed in Section 2.1) prior to training and infer-ence, so additional background flux introduced by this method doesnot affect the model inputs. The noise characteristic of this combinedstamp is not straightforward to compute due to the highly correlatednoise present in the difference image and varying intensity levels,and could be higher or lower depending on the specific stamps –with the straightforward Gaussian case providing a √ ∼ constant, as the stamp scale is far smaller than theoverall frame scale – naturally this breaks down in the presence of MNRAS , 1–17 (0000) ayesian real-bogus classification for GOTO SCI TEMPL DIFF P2P
Figure 3.
Randomly selected sample of synthetic transients generated withour algorithm, displayed in the same format as in Figure 2. Significantvariations in the PSF are visible due to sampling directly from the image,improving classifier resilience. nebulosity/galaxy light but this represents a overwhelmingly smallfraction of the sky. We also reject all minor planets with
𝐿 <
16, asthese are significantly brighter than the selected host galaxy so arebetter represented by the pure minor planet candidates. This alsocuts down significantly on saturated detections of dubious quality.This choice has no detrimental effect on bright-end performance,as discussed in Section 4. A random sample of synthetic transientsgenerated with this approach is shown in Figure 3. Our methodbears some similarity in retrospect to the approach of (Cabrera-Vives et al. 2017), who added stamps from the science image intodifference images to simulate detections in ‘random’ locations. Ourapproach uses confirmed difference image detections of MPs andputs them in more purposeful locations, whilst preserving the noisecharacteristics of the difference stamp.This approach has strong advantages over simply injectingtransients into galaxies. By selecting only galaxies close to eachminor planet, the PSF is preserved and is consistent, regardless ofhow distorted it may be. Injection-based methods require estima-tion/assumption of the image PSF, which is typically a parame-terised function determined by fitting isolated sources. Given thevariation in PSF across images and across individual unit telescopes,this would be a computationally intensive task, and would likely leadto poorer results compared to using minor planets. However, usingonly these synthetic transients introduces unintended behaviour inthe trained model that significantly degrades classification perfor-mance if not remedied. Since every synthetic transient in the training
Metalabel Train TestMinor planet 72992 8133Synthetic transient 40192 4521Random bogus 177556 19645Galaxy residual 28040 3190Marshall bogus 24577 2662Total 343357 38151
Breakdown of the composition of our dataset, partitioned accordingto training and test sets. The validation dataset is not shown, but is composedof 10% of the training dataset, chosen randomly at training time. set is associated with a host galaxy by design, the model will overtime learn to associate all detections with galaxies as being real asthere is no loss penalty for doing so. To resolve this, we also injectgalaxy residuals as bogus detections, randomly sampling from theremaining GLADE catalog matches at a 1:1 transient:galaxy resid-ual ratio. This way, the model learns that the salient features of thesedetections are not the galaxy, but the PSF-like detection embeddedin them.
Using the techniques developed in the sections above, we buildour training set with GOTO prototype data from 01-01-2019 to01-01-2020. This ensures that our performance generalises wellacross a range of possible conditions – with PSF shape and limitingmagnitude being the most important properties that benefit fromthis randomisation. A breakdown of training set proportions andproperties is given in Table 1.Our code is fully parallelised at image level, meaning that afull training set of ∼ As a starting point, we follow the braai classifier of Duev et al.(2019) in using a downsized version of the VGG16 CNN architec-ture of Simonyan & Zisserman (2014). This network architecturehas proven to be very capable across a variety of machine learn-ing tasks, and is a relatively simple architecture to implement andtweak. This architecture uses conv-conv-pool blocks as the primarycomponent – two convolutions are applied in sequence to extractboth simple and compound features, then the resultant feature mapis reduced in size by a factor 2 by ‘pooling’, taking the maximumvalue of each 2x2 group of pixels. This architecture also uses smallkernels (3x3) for performance. These structures are illustrated inFigure 4. We use the configuration as presented in Duev et al.(2019) for development, but later conduct a large-scale hyperpa-rameter search to fine-tune the performance to our specific dataset(Section 3.1). The primary inputs to the classifier are small cut-outsof the science, template, difference, and p2p images as discussed inSection 2.1 which we refer to as stamps.The sample weights for real and bogus examples are adjustedto account for the class imbalance in our dataset, set to the recip-rocal of the number of examples with each label. Class weights
MNRAS000
MNRAS000 , 1–17 (0000)
Killestein et al., conv conv pool (2x2) conv conv pool (4x4) flatten dense(55,55,4) (53,53,24) (51,51,24) (25,25,56) (23,23,56) (21,21,56) (1400,) (208,) (1,)
Figure 4.
Block schematic of the optimal neural network architecture found by hyperparameter optimisation in Section 3.1. Each block here represents a 3Dimage tensor, either as input to the network, or the product of a convolution operation generating an ‘activation map’. Classification is performed using thescalar output of the neural network. Directly above each 3D tensor block the dimensions in pixels are shown, along with the operation that generates the nextblock below it represented by the coloured arrow. Not illustrated for clarity here are the dropout masks applied between each layer and the activation layers.Base figure produced with nnsvg (LeNail 2019). are not adjusted on a per-batch basis, as our training set is onlymildly imbalanced. For regularisation, we apply a penalty to theloss based on the L2 norm of each weight matrix. This penalisesexploding gradients and promotes stability in the training phase. L1regularisation was trialled but did not produce significantly betterresults. We also use spatial dropout (Tompson et al. 2015) betweenall convolutions which provides some regularisation, but primarilyis used for the purposes of uncertainty estimation (see Section 3.3)– a small dropout probability of ∼ ∼
170 epochs takes around 10 hours. In-ferencing is significantly quicker, with an average throughput of7,500 candidates per second with no model ensembling performed.Our model training code is freely available via the gotorb
Python package , which includes the full range of tunable parameters andmodel optimisations we implement. To achieve the maximum performance possible with a given neuralnetwork, we conduct a search over the model hyperparameters toassess which combinations lead to the best classification accuracyand model throughput. Initially the ROC-AUC score (Fawcett 2006)was used as the metric to optimise as in many cases this is a moreindicative performance metric than others, however this did nottranslate directly to improvements in classification performance.We conjecture this may be due to the score-invariant nature ofthe ROC-AUC statistic – it only captures the probability that arandomly selected real example will rank higher than a randomlyselected bogus example, which is independent of the specific real-bogus threshold chosen. We instead opt to use the accuracy score,as this directly maps to the quantity we want to maximise in ourmodel.Data-based hyperparameters (training set composition, stampsize, data augmentation) are optimised iteratively by hand due tocomputational constraints. An approximate real-bogus ratio be-tween 1:2 to 1:3 was found to be optimal, with greater values givingbetter bogus performance at the cost of recovery of real detections –we opt for 1:2 in the final dataset. The overall dataset size was foundto be the biggest determinant of classification accuracy, with largerdatasets showing improved performance – although this increasewas subject to diminishing returns with larger and larger datasets.We chose a training set of O(4 × ) examples, as this was roughlythe largest dataset we could fit into RAM on training nodes – nat-urally this could be increased further by reading data from disk ondemand, but given CPUs were used for training there was a need tominimise input pipeline latencies as much as possible to compen-sate. Model performance was found to be relatively insensitive tothe ratio of synthetic transients to minor planets, as long as therewere at least 10,000 of both in the training set. Using a dataset where https://github.com/GOTO-OBS/gotorb MNRAS , 1–17 (0000) ayesian real-bogus classification for GOTO
25 30 35 40 45 50 55 60Stamp size (pixels)0 . . . . . . . . R O C - AU C s c o r e Mean1 σ CI Figure 5.
Classifier performance on the test set of a 330,000 example train-ing set as a function of input stamp size. Each point is the average of 3independent training runs on the same input training set, with the shadedregion representing the 1 𝜎 confidence interval. ∼ Hyperband algorithm (Li et al. 2017) as implemented in the Keras-Tunerpackage (O’Malley et al. 2019). This algorithm implements a ran-dom search, with intelligent allocation of computational resourcesby partially training brackets of candidate models and only selectingthe best fraction of each bracket to continue training. In testing, thisconsistently outperformed both naive random search and Bayesianoptimisation in terms of final performance. Table 2 illustrates theregion of (hyper)parameter space we choose to conduct our searchover. The upper limits for the neuron/filter parameters are set bypurely computational constraints – networks above this thresholdtake too long to evaluate and train, and so are excluded. We also setan upper limit of 500,000 on the number of model parameters to continuous
Hyperparameter Min Max Prior SelectedBlock 1 filters ( 𝑁 ) 8 32 linear 24Block 2 filters ( 𝑁 ) 𝑁
64 linear 56 𝑁 fc
64 512 linear 208Dropout rate 10 − . × − Learning rate 10 − − log 6 × − Regulariser penalty 10 − − log 2 . × − discrete Hyperparameter Choice SelectedKernel initialiser He, Glorot GlorotKernel regulariser L1, L2 L2Activation function ReLU, LeakyReLU, ELU LeakyReLU
Table 2.
Hyperparameter space over which the optimisation search wasconducted, split by numerical and categorical variables. The final adoptedvalues are given in the rightmost column. avoid overly complex models and promote small but efficient archi-tectures. Based on initial experimentation, we require the numberof convolutional filters in the second block must be greater thanor equal to the number in the first block. This ensures that thelargest (and most computationally expensive) convolution opera-tions are performed on tensors that have been max-pooled and thusare smaller, reducing execution time. To maximise performanceacross all possible deployment architectures, the number of con-volutional filters and fully-connected layer neurons are constrainedto be a multiple of 8. This is one of the requirements for fullyleveraging optimised GPU libraries (such as cuDNN, Chetlur et al.2014), and also enables use of specialised hardware acceleratorssuch as tensor cores in the future. Conveniently, this discretisationalso makes the hyperparameter space more tractable to explore.This search took around 1 month to complete on a single32-core compute node, and sampled 828 unique parameter config-urations. The three top-scoring models were then retrained fromrandom initialisation through to early stopping to validate their per-formance, and confirm that the hyperparameter combination led tostable and consistent results. The top three scoring models achievedaccuracies on the hyperparameter validation set of 98.88, 98.64and 98.54% respectively. Some of the candidate models had to bepruned from the list due to excessive overfitting. The best model wasthen selected based on the minimum test set error. Our final modelachieved a test set class-balanced accuracy of 98 . ± .
02% (F1score 0 . ± . Uncertainty estimation in neural networks is an open problem, butis of critical importance for a range of applications. Traditional de-terministic neural networks output a single score per class between0 and 1. This single value would be sufficient to provide a measureof confidence, if properly calibrated. However, neural networks areoften regarded as providing over-confident predictions in general,
MNRAS000
MNRAS000 , 1–17 (0000) Killestein et al., and, worse, providing misidentifications at high confidence. Givingneural networks the ability to make nuanced predictions and ac-count for their own uncertainty in decision making is a potentiallypowerful improvement, that we discuss in more detail over the nextsections.It is important to be specific and distinguish between epistemic(systematic) and aleatoric (random) uncertainty for the purposes ofour classification problem (Kendall & Gal 2017). Aleatoric uncer-tainty is captured by the classifier’s score value, and originates fromnoise in the input data. More crucial for our application is quan-tifying the epistemic uncertainty – that is the uncertainty in ourchoice of neural network’s model weights. This epistemic sourceof error is directly quantifiable through Bayesian neural networks,and in later sections this is the error, confidence, or uncertainty werefer to and attempt to quantify. In the Bayesian framework, this canbe achieved by casting model parameters as probability distribu-tions, and using the mechanics of Bayesian statistics to marginalisethe neural network output over these distributions, in the processfinding the score posterior. In this way, the uncertainty inherent inmodel selection can be quantified. There are various approximateand exact approaches to achieve this which we outline below.Dropout (Srivastava et al. 2014) provides a useful form of reg-ularisation in neural networks. At each training step, a fraction 𝑝 (a tunable hyperparameter) of neuron weights are randomly set tozero, decreasing the effective number of parameters of the model.In this way, overfitting can be prevented and generalisation accu-racy can be increased. In traditional neural networks, dropout isnot active at inference time so that all neurons are used for makingpredictions. However, Gal & Ghahramani (2015a) demonstrate theprofound result that training and evaluating neural networks withdropout is equivalent to performing the approximate Bayesian in-ference discussed above, with multiple evaluations being equivalentto Monte Carlo integration of the posterior distribution. This is di-rectly applicable to convolutional neural networks, via the MonteCarlo dropout technique (Gal & Ghahramani 2015b; referred to asMCDropout for brevity from now on).Alternative approaches to uncertainty estimation exist (Bayesby Backprop, Blundell et al. 2015), which instead directly per-forms the approximate Bayesian inference by instead casting neu-ron weights as distributions with associated hyperparameters, thenupdating these according to the backpropagated gradients (like de-terministic NNs). In this work, we opt to use MCDropout for com-putational efficiency and for maximal compatibility with existingnetwork architectures and software. No changes to the training loopare required, and only a simple wrapper is required at inference toperform multiple predictions with dropout enabled. The only signif-icant additional computational cost for a Bayesian neural networkusing the MCDropout technique over a deterministic CNN is at infer-ence time, as multiple samples need to be drawn to approximate theposterior. This performance overhead can be mitigated with suitablebatching of the dataset. The ability of neural networks to learn com-plex, non-linear representations in high-dimensional vector spacesis well-known and utilised throughout machine learning. However,estimation of the uncertainty of products of neural networks is oftena barrier to their implementation in scientific applications, wherewell-grounded determination of errors is important. MCDropoutprovides a principled way to introduce this.Although a comparatively new technique, Bayesian neural net-works show emerging promise across a variety of astronomical clas-sification and regression tasks – including supernova light curveclassification (Möller & de Boissière 2020), efficient learning of galaxy morphology (Walmsley et al. 2020), and age estimation ofstars for galactic archaeology (Ciucă et al. 2020).There is disagreement in the literature on the precise natureof a Bayesian neural network and how to implement it ‘properly’,from approximate variational inference as used here, to applyingsome variant of the Markov Chain Monte Carlo sampler over theweight and bias parameters of the neural network. However, whatis relevant for the implementation in this work is that examplesthe classifier is unconfident about are assigned lower confidencescores than obviously real/bogus detections. More complex tests,such as confirming that the classifier’s confidence matches the actualconfidence of the dataset/some human-derived uncertainty score arebeyond the scope of the introductory work presented here.Whilst these posterior predictions are informative to humanvetters, converting them to a single informative summary parame-ter that captures the overall uncertainty is more useful for integrationinto pipelines and enabling coarse filtering of candidates. To con-vert the posterior distributions to meaningful information about theconfidence of a given prediction, we utilise the information entropy H . For a binary classification problem, the generic entropy formulacan be reduced to: H ( 𝑝 ) = − 𝑝 log 𝑝 − ( − 𝑝 ) log − 𝑝 where 𝑝 is the probability of a given detection being real (thereal-bogus score). The entropy is maximised for 𝑝 = .
5, wherethe probability of being real vs. bogus is equal, or the classifierprediction carries no useful information. We define the classifierconfidence C in terms of the average entropy of the posterior distri-bution samples, scaling to confidences in the range [ , ] with therelation C = − 𝑁 𝑁 ∑︁ 𝑖 = H 𝑖 where 𝑁 is the number of posterior samples and H 𝑖 is the binaryentropy of the 𝑖 th posterior sample. This metric is equivalent the sec-ond term of the BALD acquisition function of Houlsby et al. (2011),and is chosen as it is pre-normalised to [ , ] unlike standard devi-ation or similar metrics. Naturally the uncertainties we derive hereare correlated with the actual output score, but the multiple sam-ples provide sufficient dispersion that this metric is useful to assessmodel confidence. In future implementations, these raw posteriorsamples (or some approximating distribution parameters to reducedata needs) could be fed directly into downstream, more specialisedclassification tools to enable them to make use of the real-bogusclassifier’s probabilistic predictions in their own score/posterior. One immediate advantage of Bayesian neural networks over de-terministic neural networks is the ability to improve classificationperformance through model ensembling. Figure 6 illustrates the gainin accuracy observed by averaging the predictions of our BNN, asa function of the number of posterior samples. Although small, thisis a definite improvement over single-evaluation predictions, and islikely constrained by our dataset. For the majority of positive andnegative examples the model is highly confident about the assignedRB score, so averaging over the posteriors does not improve themsignificantly. This increase in performance is likely to be greater onmore complex (multi-class) classification problems, or scenarioswhere significantly less training data is available.Posteriors and/or associated confidence scores can be added to
MNRAS , 1–17 (0000) ayesian real-bogus classification for GOTO . . . . . . C l a ss - b a l a n ce d acc u r ac y BCNN mean1 σ CI Figure 6.
Classification accuracy on the test set from Section 2.3 as a func-tion of the number of posterior samples averaged. Each point is the averageof 10 model runs, with the shaded area corresponding to the 2 𝜎 confidenceinterval. The BCNN quickly recovers the performance of a deterministicCNN within statistical uncertainty (99 . ± .
03% accuracy, F1: 0.9877)and provides additional information in the form of confidence. No signifi-cant improvement in classification accuracy is obtained beyond 10 samples,remaining consistent out to 50 samples. any downstream candidate evaluation tools, providing an additionalmetric to inform decisions. Objects with both high score and highconfidence are highly likely to be genuine, so can be prioritised inhuman vetting of candidates. This means more time can be spentlooking at more marginal candidates, and obvious detections canquickly be identified. Confidence provides a complementary metricto the pure real-bogus score that can help alleviate some of the is-sues with the poor dynamic range observed in the classifier outputsat low/high scores. Classification is still performed on the consen-sus real-bogus score derived from the posterior, with the confidencescore intended to aid human decision making. In Figure 7, we illus-trate some example candidates, their associated real-bogus score,and the score posterior.Classifier confidence is also a useful tool for the training anddevelopment process, providing deeper insight into the functioningof the classifier and the associated training set. Predictive uncer-tainty provides a useful heuristic to clean datasets of mislabelleddata. Misclassified detections that the classifier returns a high confi-dence for are very likely to be mislabelled, as the confidence score ispartially based on seeing large numbers of similar detections in thetraining set. These frames can be actively prioritised in any humanrelabelling efforts, or fixed cuts on the confidence can be utilisedto perform this in a semi-automated way. This ‘optimal relabelling’scheme provides a method for human vetters and machine learn-ing models to collaboratively and iteratively refine noisy labels.Our label noise is introduced as humans are imperfect judges ofreal/bogus, and interpret the vetting rubric in different ways leadingto inconsistencies which can harm model performance.We demonstrate the effectiveness of this procedure on thetraining set built in this work by training the model first on theuncleaned dataset, then attempt to relabel the misclassified detec-tions in the training and test set ordered by decreasing confidence.This amounts to a substantial task of 3580 stamps, which wouldtake a prohibitively long time to relabel by hand, notwithstandingthe possibility of human bias in the relabelling. We instead herepropose a heuristic re-labelling scheme based on the BALD score
SCIENCE TEMPLATE DIFFERENCE C : 0.020SCIENCE TEMPLATE DIFFERENCE C : 0.096SCIENCE TEMPLATE DIFFERENCE C : 0.406SCIENCE TEMPLATE DIFFERENCE C : 0.593SCIENCE TEMPLATE DIFFERENCE C : 0.859 Figure 7.
A selection of example posteriors, taken from real GOTO data.The majority of predictions are highly confident, so we select examplesof increasing confidence score ( C ) to display here. Plotted here is a Gaus-sian kernel-density estimate constructed from 500 posterior samples. Thegreen line indicates the correct label for each candidate, with the black lineindicating the mean of the distribution. The dashed line indicates 𝑃 real = . of Houlsby et al. (2011) that leverages the simplistic nature of binaryclassification.The model is first trained on the ‘unclean’ dataset generatedwith the approaches in Section 2.3, then the BALD score is evalu-ated over the misidentifications in the test and training sets. Fromhere, a new set of labels is derived by flipping the labels of thoseexamples that have a BALD score less than (thus confidence higherthan) the median – effectively accepting the prediction of the clas-sifier over the human vetter. This approach is naturally capable offlipping the labels of accurately labelled stamps incorrectly, but byimposing this cut in classifier confidence it ensures that the ma-jority of relabelled stamps each round correspond to regions ofclassifier parameter space that are well-covered by the training setand so are classified at high confidence. This method effectivelytrades active human labelling time for passive background compu-tational time, and can be applied iteratively as suggested above toprogressively improve the quality of the dataset labelling. We man-ually checked a subset of the sources selected to be re-labelled toverify these were sensible and indeed found they were mislabelled MNRAS000
A selection of example posteriors, taken from real GOTO data.The majority of predictions are highly confident, so we select examplesof increasing confidence score ( C ) to display here. Plotted here is a Gaus-sian kernel-density estimate constructed from 500 posterior samples. Thegreen line indicates the correct label for each candidate, with the black lineindicating the mean of the distribution. The dashed line indicates 𝑃 real = . of Houlsby et al. (2011) that leverages the simplistic nature of binaryclassification.The model is first trained on the ‘unclean’ dataset generatedwith the approaches in Section 2.3, then the BALD score is evalu-ated over the misidentifications in the test and training sets. Fromhere, a new set of labels is derived by flipping the labels of thoseexamples that have a BALD score less than (thus confidence higherthan) the median – effectively accepting the prediction of the clas-sifier over the human vetter. This approach is naturally capable offlipping the labels of accurately labelled stamps incorrectly, but byimposing this cut in classifier confidence it ensures that the ma-jority of relabelled stamps each round correspond to regions ofclassifier parameter space that are well-covered by the training setand so are classified at high confidence. This method effectivelytrades active human labelling time for passive background compu-tational time, and can be applied iteratively as suggested above toprogressively improve the quality of the dataset labelling. We man-ually checked a subset of the sources selected to be re-labelled toverify these were sensible and indeed found they were mislabelled MNRAS000 , 1–17 (0000) Killestein et al., detections that had leaked through the quality cuts we applied.After 1 round of the heuristic relabelling routine outlined above,the class-balanced accuracy achieved on the classifier test set im-proved markedly from 98 . ± .
02 to 99 . ± .
01% (F1 score:0 . ± . . ± . Machine learning algorithms acquire inherent and often subtle bi-ases based on the training set used in their construction. Giventhe automated nature of our data set generation, it is particularlyimportant to verify that performance is consistent across a rangeof parameters of interest, such as transient magnitude. Some careis required in choosing the test set for evaluating classifier perfor-mance in a real-world setting, as the training set has been augmentedwith both human-labelled data and fully synthetic data. Althougha low FPR/FNR on the validation and test data is encouraging asit is artificially made more difficult for the classifier to learn, itis not directly representative of the performance we should expectin deployment as a non-negligible component of it is synthetic.Performance characterisation should be reinforced with extensivetesting on representative samples of GOTO data. A particular focusis to confirm that the synthetic augmentation scheme we implementleads to genuine improvements in the classifier’s recovery rate ofreal transient detections. We also emphasise that in following sec-tions, we effectively test the performance of the real-bogus classifierin isolation – the ‘real-world’ detection efficiency is the product ofthe efficiency of multiple pipeline stages, most crucially image sub-traction and source extraction. Exploring the impact of these stepsis beyond the scope of this paper, and thus are left to future work.In the following sections, we use ‘accuracy’ to refer to theclass-balanced accuracy, as it is more appropriate for our mildlyimbalanced classification task. We also quote results based on themean scores of 10 posterior samples (motivated by the saturationobserved in Figure 6) since individual evaluations of a Bayesianneural network using MCDropout are based on weaker classifiersdue to the presence of dropout. Typical uncertainties (estimated asthe standard deviation) on the metrics below are < . To provide a more granular view of the classifier performance, wefurther split the test set into two groups for the purposes of eval-uation. The first comprises of only the minor planet and randombogus detections. We also test a synthetic transient/galaxy resid-ual test set, to verify that the classifier can genuinely discriminatebetween galaxies and galaxies with transients. This also revealsany strong performance differences between the two main positiveclasses, which could skew metrics evaluated on the whole dataset.For both test sets, the human-inspected Marshall data is deliberatelyexcluded, since it is significantly more challenging for the classifierthan normal detections and does not accurately reflect the true datadistribution.The best-scoring classifier after hyperparameter optimisationshows excellent performance, attaining balanced accuracies of99 .
49% (F1: 0.9935) and 99 .
19% (F1: 0.9925) on the minor planetand synthetic transient test datasets respectively. Figure 9 illustratesthe false positive and negative rates for the classifier on both theminor planet and transient datasets, as a function of the real-bogusthreshold chosen. There is a clear difference in false negative ratebetween the minor planet and transient datasets, reflecting the in-creased difficulty associated with the complex host morphologyassociated with the transient examples. The classifier displays anotable skew in the FPR/FNR equality point towards lower val-ues. This is a result of the Marshall injections in the training set,which are made more difficult to learn than the random bogus detec-
MNRAS , 1–17 (0000) ayesian real-bogus classification for GOTO −
50 0 50Latent vector 1 − − − L a t e n t v ec t o r glxresid marshall mp randjunk syntransient −
50 0 50Latent vector 1 0 . . . . . C on fi d e n ce Figure 8.
Example class-clustering (left) and confidence (right) maps generated from the classifier’s test set. Each colour in the left panel represents a specificsub-class of detections, where colour on the right represents classifier confidence. The top legend gives the classes corresponding to each colour in the leftpanel. Regions of low confidence in the right panel tend to correspond to cluster boundaries in the left, where there is more uncertainty about which class eachexample belongs to. . . . . . . . . . . . C u m u l a ti v e p e r ce n t a g e MP only FNRAll-data FPRMMCETransient only FNR b rPredictedbr T r u e Figure 9.
False positive/negative rate evaluated on the test set, excludingMarshall examples. Performance metrics are split based on minor planetand synthetic transients. The grey dashed line (MMCE) represents the full-dataset mean misclassification error, which is below 1% between real-bogusscores of 0.1 – 0.6. Inset: confusion matrix, evaluated on the full test set.There is a slight difference in the false negative rates achieved between theminor planets and synthetic transients, reflecting the increased difficultyposed by complex host morphology and subtraction residuals. tions due to being misclassified by the previous classifier. This doesnot affect classification accuracy, and could be fixed by applying apower transform to the classifier output if required, conditioned onthe validation set.Given the spatially-variable optical characteristics present inthe GOTO prototype, it is important to confirm that our classifier provides good performance across the full detector – and not simplyin the centre where distortion is minimal. In Figure 10 we plot theclass-balanced accuracy score as a function of radial position onthe detector, using a series of radial bins chosen to equalise sourcedensity. These radial bins are scaled through by the maximum value(corresponding to the image corner) to provide a scale-free mea-surement of detector position. Class-balanced accuracy is used hereas the real-bogus fraction varies as a function of detector position,and care must be taken to account for this. We find a consistentperformance of ∼
99% out to a fractional radial distance of 0.7, witha slight drop of 1% out at the far edge of the image. This is pri-marily due to the severe distortion found in the image corners ofthe GOTO prototype optical tubes, which produces very challeng-ing detections (abnormal PSFs, strong vignetting) both for sourceextraction and real-bogus classification. Some contribution to thisperformance decrease is likely from good quality sources close tothe edge of the image or close to the edge of the science-templateoverlap. Estimating reliably these sources and their contribution tothe numbers in each bin is a complex task. Suffering only a 1%decrease in performance in these extremely challenging conditionsdemonstrates the overall robustness of the classifier. With the signif-icantly improved optical quality of the GOTO design specificationOTAs, we anticipate that future versions of our classifier trainedon data from the upgraded system will display a constant (withinstatistical error) classification accuracy as a function of detectorposition.
To provide the most accurate assessment of transient-specific clas-sifier performance and further confirm that our algorithmically-
MNRAS000
MNRAS000 , 1–17 (0000) Killestein et al., . . . . . . . . . . . . . . T e s t s e t acc u r ac y Unbinned accuracyMean + 1 σ CI (3441/bin)
Figure 10.
Class-balanced accuracy evaluated on the test set as a functionof detector position. We use a series of concentric radial bins, chosen tocontain equal numbers of sources for uniform statistics. We scale the radiusthrough by the detector size to give a relative picture of performance. Thedrop in performance at large radial distances is primarily caused by theextreme optical distortion present in the early GOTO prototype, and only aminor drop of 1% in accuracy in these challenging conditions demonstratesthe very robust performance of our classifier. With the design-specificationGOTO optics, we anticipate this curve will be level within error. generated training set generalises well, we assemble a test set of gen-uine astrophysical transients. This set was found by cross-matchinga list of all spectroscopically confirmed supernovae reported to theTransient Name Server (TNS) since January 2019 with the GOTOmaster candidate table. Those with an associated GOTO candidatewithin 3 arcsec, with TNS discovery magnitude greater than theGOTO source magnitude, and only found in GOTO data after theformal TNS discovery date are accepted. With these cuts, purityis favoured over completeness, a deliberate choice to ensure thatthe test set is as clean of false positives as possible. This yields877 known transients recovered in the GOTO prototype data. Thewhole-sample recovery rate is 97 . ± . 𝐿 band magnitude. We find that the classifiermaintains excellent performance across the full magnitude range ofdetections accessible to GOTO, even towards fainter magnitudes.Our galaxy augmentation scheme provides up to a 30% improve-ment in recovery rate at magnitudes fainter than 𝐿 ∼ . . . . . . . . . . L -band discovery magnitude0 . . . . . . . T P R @ .
39 89 176 291 436 430 206 39 6 . . . . . . . . . . . . . . . . . . T P R @ .
15 42 38 24 35 38 32 22
Figure 11.
Top panel:
Recovery rate (TPR) as a function of GOTO discoverymagnitude, at a fixed real-bogus threshold of 0.5. The dashed line indicatesthe performance of a classifier with a similarly sized training set, but withonly minor planet detections. Error bars are derived directly from the clas-sifier score posteriors. The number of detections per bin is written beloweach bar. The sharp drop-off in the number of detections beyond 𝐿 ∼ . Bottom panel:
Recovery rate of transients that canbe reliably associated with a host galaxy (as cross-matched with WISExSu-perCosmos, Bilicki et al. 2016) as a function of host offset. As above, errorbars are derived from the classifier score posteriors, and a similarly-sizedminor planet-based classifier is plotted for comparison. There is a markedimprovement in the recovery rate for very small host offsets, particularly fornuclear transients. ment for sensitivity to nuclear transients, considered to be a moredifficult transient morphology to detect. Motivated by the typicalRMS astrometric noise level of GOTO images, we adopt a fixedthreshold of 0.5 arcsec to distinguish between nuclear and offsettransients. We find a 13 ± MNRAS , 1–17 (0000) ayesian real-bogus classification for GOTO Although the main transient sources of interest for GOTO willoverwhelmingly be fainter than the saturation level ( 𝐿 ∼ 𝐿 ∼ 𝐿 (cid:46)
10, 100% are recovered, although small-number statistics limits the usefulness of this metric. This bright-endtesting demonstrates the excellent dynamic range of the classifier,showing high (>90%) recovery rates from 10 th – 20 th magnitude.Through the host offset distribution choice we make, we ex-pect to generate a reasonable number of transients at zero offset, sothis region of parameter space should not be empty in the trainingset. To test the performance in this regime we repeated the proce-dure outlined in Section 2.2, except with the host offsetting routinedisabled to generate synthetic detections overlapping the galaxy nu-cleus only. This generated 5,100 synthetic nuclear transients, witha magnitude distribution consistent with that in Figure 1. Testingour model against this dataset (with the negative examples beinggalaxy residuals as in Section 2.3, we obtain a 97.5% accuracy,with a recovery rate (TPR) of ∼ 𝑅𝐵 score as a proxy for 𝑃 𝑟𝑒𝑎𝑙 (the probability a given source is real) in such implementations.One significant benefit of using a Bayesian neural network isa built-in indicator of out-of-distribution data – that is data poorlyrepresented by or unseen in the training set. For input data that iscompletely different to the training set, the classifier will return alow confidence score which can then be used to remove/deprioritisethe candidate in downstream applications. This confidence can alsobe used to optimise candidate vetting efforts, with the highest- . . . . . . F r ac ti ono f po s iti v e s ( P r e a l ) σ CIMean of 20 samples0 . . . . . . N u m b e r BogusReal
Figure 12.
Top panel: classifier calibration curve, illustrating how well theclassifier’s output score corresponds to probability. The mean of 20 samplesand the 1 𝜎 confidence interval are plotted to show that individual drawsfrom the posterior remain well-calibrated. Bottom panel: Score distributionfor both real and bogus examples – with the relative scarcity of exampleswith 0 . < 𝑅𝐵 < . confidence candidates being a natural choice to prioritise over lower-confidence, lower quality detections.In principle, the task-specific knowledge encoded in our trainednetwork weights can be used to accelerate the training of similarreal-bogus classifiers through transfer learning, and in principleincrease generalisation (Yosinski et al. 2014). This requires that thesame data input structure is used and there are no changes to modelhyperparameters. However, we caution that training in this way issusceptible to local minima and does not offer the opportunity tochange the model hyperparameters that training from scratch does– in Section 3.1 we have demonstrated the sizeable performanceimprovements doing a full hyperparameter search can yield, and soencourage this.The techniques and framework we implement in this paper arenaturally extensible to more challenging astronomical classificationtasks such as those outlined at the end of Section 1.1. A key focus ismore fine-grained classification – being able to distinguish variablestars, supernovae, nuclear transients and other astrophysical objectsof interest in an automated (and crucially, accurate) way. Figure 8already hints at this being a fruitful approach, as we see evidenceof morphological differentiation in both the positive and negativeclasses through the emergence of smaller sub-clusters. Similarly,leveraging the wealth of contextual information available from as-trophysical surveys in a principled, informative, and efficient waywithin the framework of deep learning poses an open challenge,with potentially significant gains possible. We aim to address thesechallenges, among others, with development of future generationsof the classifier we implement here. We demonstrate a data-driven approach to generating large, low-contamination training sets, which along with our novel augmen-tation scheme can be used to train high-performance, transient-optimised real-bogus classifiers. By combining real PSFs fromminor planets with galaxies, we generate realistic synthetic tran-sients that provide a measurable improvement in the recovery ofgenuine astrophysical transients. This technique is computationally
MNRAS000
MNRAS000 , 1–17 (0000) Killestein et al., lightweight, easily implemented, and directly applicable to a varietyof both current and future transient survey streams/datasets.We also demonstrate the efficacy of Bayesian neural networksfor the first time in real-bogus classification, and demonstrate theunique insights that confidence estimation can bring to the real-bogus problem. Being able to assign epistemic confidences to clas-sifier predictions in addition to the more typical real-bogus scoreprovides another parameter for human vetters further downstreamto use in identifying promising candidate detections – this can po-tentially be used in future to further automate decision making inthe context of follow-up and reporting. Techniques such as this thatminimise human involvement in data-gathering and labelling willbecome increasingly important in the new ‘big-data’ era of astron-omy that large-scale projects such as the Rubin Observatory andSKA will bring about.Our classifier demonstrates excellent performance across awide magnitude range, with a missed detection rate of 0.5% ata fixed 1% false positive rate, and up to 30% improvement in re-covery rate of astrophysical transients in the challenging faint end.This has the potential to markedly increase the number of faint tran-sients GOTO can discover, and significantly improves the prospectsfor detecting the kilonova afterglows of gravitational-wave drivenmergers GOTO was designed to find. We anticipate that improve-ments to the quality and stability of GOTO’s hardware and dataflowwill bring significant performance gains for the real-bogus classifierpresented here.GOTO is due to undergo significant expansion over the comingyears, with a final configuration of 4 installations spread across anorthern (La Palma) and southern (Siding Spring) site providing ahigh-cadence datastream covering almost the whole sky down to20th magnitude every 2-3 days. The tools developed in this workhave generated a classifier that is capable of handling and sifting theaccompanying volume of candidate transient detections with robustaccuracy and high sensitivity.
ACKNOWLEDGEMENTS
We thank the anonymous referee for their insightful com-ments which helped improve the quality of this manuscript. TheGravitational-wave Optical Transient Observer (GOTO) project ac-knowledges the support of the Monash-Warwick Alliance; WarwickUniversity; Monash University; Sheffield University; the Univer-sity of Leicester; Armagh Observatory & Planetarium; the Na-tional Astronomical Research Institute of Thailand (NARIT); theUniversity of Turku; the University of Manchester; the Univer-sity of Portsmouth; the Instituto de Astrofísica de Canarias (IAC)and the Science and Technology Facilities Council (STFC). DS,KU, BG and JDL acknowledge support from the STFC via grantsST/T007184/1, ST/T003103/1 and ST/P000495/1. JDL acknowl-edges support from a UK Research and Innovation Fellowship(MR/T020784/1). RPB, MRK and DMS acknowledge support fromthe ERC under the European Union’s Horizon 2020 research and in-novation programme (grant agreement No. 715051; Spiders). POBand RS acknowledge support from the STFC.This research made use of Astropy, a community-developedcore Python package for Astronomy (Astropy Collaboration et al.2013; Price-Whelan et al. 2018), and scikit-learn (Pedregosa et al. astorb.dat were originally pro-vided by NASA grant NAG5-4741 (PI E. Bowell) and the LowellObservatory endowment, and more recently by NASA PDART grantNNX16AG52G (PI N. Moskovitz). This research has made use ofIMCCE’s SkyBoT VO tool. This research has made use of dataand/or services provided by the International Astronomical Union’sMinor Planet Center. DATA AVAILABILITY
The gotorb code is made freely available at https://github.com/GOTO-OBS/gotorb , along with validation examples for test-ing. Accompanying observational data used in this work will bemade available via upcoming GOTO public data releases.
REFERENCES
Aartsen M. G., et al., 2017, Journal of Instrumentation, 12, P03012Abadi M., et al., 2015, TensorFlow: Large-Scale Machine Learning on Het-erogeneous Systems,
Abbott B. P., et al., 2017a, Phys. Rev. Lett., 119, 161101Abbott B. P., et al., 2017b, ApJ, 848, L12Ackley K., Eikenberry S. S., Yildirim C., Klimenko S., Garner A., 2019,AJ, 158, 172Alard C., Lupton R. H., 1998, ApJ, 503, 325Astropy Collaboration et al., 2013, A&A, 558, A33Bailey S., Aragon C., Romano R., Thomas R. C., Weaver B. A., Wong D.,2007, ApJ, 665, 1246Becker A., 2015, HOTPANTS: High Order Transform of PSF ANd TemplateSubtraction (ascl:1504.004)Bellm E. C., et al., 2019, PASP, 131, 018002Berthier J., Vachier F., Thuillot W., Fernique P., Ochsenbein F., Genova F.,Lainey V., Arlot J. E., 2006, SkyBoT, a new VO service to identify SolarSystem objects. , p. 367Berthier J., Carry B., Vachier F., Eggl S., Santerne A., 2016, MNRAS, 458,3394Bertin E., Arnouts S., 1996, A&AS, 117, 393Bilicki M., et al., 2016, ApJS, 225, 5Bloom J. S., et al., 2012, PASP, 124, 1175Blundell C., Cornebise J., Kavukcuoglu K., Wierstra D., 2015, arXiv e-prints, p. arXiv:1505.05424Breiman L., 2001, Machine learning, 45, 5Brink H., Richards J. W., Poznanski D., Bloom J. S., Rice J., Negahban S.,Wainwright M., 2013, MNRAS, 435, 1047Cabrera-Vives G., Reyes I., Förster F., Estévez P. A., Maureira J.-C., 2017,ApJ, 836, 97Carrasco-Davis R., et al., 2020, arXiv e-prints, p. arXiv:2008.03309Chambers K. C., et al., 2016, arXiv e-prints, p. arXiv:1612.05560Chetlur S., Woolley C., Vandermersch P., Cohen J., Tran J., Catanzaro B.,Shelhamer E., 2014, arXiv e-prints, p. arXiv:1410.0759Chollet F., et al., 2015, Keras, https://keras.io
Ciucă I., Kawata D., Miglio A., Davies G. R., Grand R. J. J., 2020, arXive-prints, p. arXiv:2003.03316Dálya G., et al., 2018, MNRAS, 479, 2374Dieleman S., Willett K. W., Dambre J., 2015, MNRAS, 450, 1441Duev D. A., et al., 2019, MNRAS, 489, 3582Fawcett T., 2006, Pattern Recognition Letters, 27, 861Filippenko A. V., Li W. D., Treffers R. R., Modjaz M., 2001, in Paczynski B.,Chen W.-P., Lemme C., eds, Astronomical Society of the Pacific Con-ference Series Vol. 246, IAU Colloq. 183: Small Telescope Astronomyon Global Scales. p. 121Fossey S. J., Cooke B., Pollack G., Wilde M., Wright T., 2014, CentralBureau Electronic Telegrams, 3792, 1Gal Y., Ghahramani Z., 2015a, arXiv e-prints, p. arXiv:1506.02142Gal Y., Ghahramani Z., 2015b, arXiv e-prints, p. arXiv:1506.02158MNRAS , 1–17 (0000) ayesian real-bogus classification for GOTO Gal Y., Islam R., Ghahramani Z., 2017, arXiv e-prints, p. arXiv:1703.02910Gieseke F., et al., 2017, MNRAS, 472, 3101Goldstein D. A., et al., 2015, AJ, 150, 82Gompertz B. P., et al., 2020, MNRAS, 497, 726Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., OzairS., Courville A., Bengio Y., 2014, arXiv e-prints, p. arXiv:1406.2661Heinze A. N., et al., 2018, AJ, 156, 241Houlsby N., Huszár F., Ghahramani Z., Lengyel M., 2011, arXiv e-prints,p. arXiv:1112.5745IceCube Collaboration et al., 2018, Science, 361, eaat1378Ivezić Ž., et al., 2019, ApJ, 873, 111Kendall A., Gal Y., 2017, arXiv e-prints, p. arXiv:1703.04977Kingma D. P., Ba J., 2014, arXiv e-prints, p. arXiv:1412.6980Kulkarni S. R., 2012, arXiv e-prints, p. arXiv:1202.2381Law N. M., et al., 2009, PASP, 121, 1395LeCun Y., Bengio Y., et al., 1995, The handbook of brain theory and neuralnetworks, 3361, 1995LeCun Y., Bengio Y., Hinton G., 2015, Nature, 521, 436LeNail A., 2019, The Journal of Open Source Software, 4, 747Leaman J., Li W., Chornock R., Filippenko A. V., 2011, MNRAS, 412, 1419Li W., et al., 2011, MNRAS, 412, 1441Li L., Jamieson K., DeSalvo G., Rostamizadeh A., Talwalkar A., 2017, TheJournal of Machine Learning Research, 18, 6765Lintott C. J., et al., 2008, MNRAS, 389, 1179Maaten L. v. d., Hinton G., 2008, Journal of machine learning research, 9,2579Mahabal A., et al., 2019, PASP, 131, 038002Mariani G., Scheidegger F., Istrate R., Bekas C., Malossi C., 2018, arXive-prints, p. arXiv:1803.09655Meegan C., et al., 2009, ApJ, 702, 791Möller A., de Boissière T., 2020, MNRAS, 491, 4277Mong Y.-L., et al., 2020, arXiv e-prints, p. arXiv:2008.10178Moskovitz N., Schottland R., Burt B., Wasserman L., Mommert M., BailenM., Grimm S., 2019, in EPSC-DPS Joint Meeting 2019. pp EPSC–DPS2019–644Niculescu-Mizil A., Caruana R., 2005, in Proceedings of the 22nd interna-tional conference on Machine learning. pp 625–632O’Malley T., Bursztein E., Long J., Chollet F., Jin H., InvernizziL., et al., 2019, Keras Tuner, https://github.com/keras-team/keras-tuner
Pedregosa F., et al., 2011, Journal of Machine Learning Research, 12, 2825Perlmutter S., et al., 1999, ApJ, 517, 565Pian E., et al., 2017, Nature, 551, 67Price-Whelan A. M., et al., 2018, AJ, 156, 123Reyes E., Estévez P. A., Reyes I., Cabrera-Vives G., Huijse P., CarrascoR., Forster F., 2018, in 2018 International Joint Conference on NeuralNetworks (IJCNN). pp 1–8Rhodes B., 2019, Skyfield: High precision research-grade positions for plan-ets and Earth satellites generator (ascl:1907.024)Romano R. A., Aragon C. R., Ding C., 2006, in 2006 5th InternationalConference on Machine Learning and Applications (ICMLA’06). pp77–82Shappee B. J., et al., 2014, ApJ, 788, 48Simonyan K., Zisserman A., 2014, arXiv e-prints, p. arXiv:1409.1556Singer L. P., et al., 2015, ApJ, 806, 52Smith K. W., et al., 2020, PASP, 132, 085002Soumagnac M. T., Ofek E. O., 2018, PASP, 130, 075002Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R.,2014, The journal of machine learning research, 15, 1929Steeghs D., et al., 2021, The Gravitational-wave Optical Transient Observer(GOTO): prototype performance and prospects for transient science, inprep.Tanvir N. R., et al., 2009, Nature, 461, 1254Tompson J., Goroshin R., Jain A., LeCun Y., Bregler C., 2015, in Proceed-ings of the IEEE conference on computer vision and pattern recognition.pp 648–656Tonry J. L., et al., 2018, PASP, 130, 064505Turpin D., et al., 2020, MNRAS Villar V. A., Berger E., Metzger B. D., Guillochon J., 2017, ApJ, 849, 70Walmsley M., et al., 2020, MNRAS, 491, 1554Wozniak P. R., 2000, Acta Astron., 50, 421Wright D. E., et al., 2015, MNRAS, 449, 451Yip K. H., et al., 2019, in AAS/Division for Extreme Solar Systems Ab-stracts. p. 305.04Yosinski J., Clune J., Bengio Y., Lipson H., 2014, arXiv e-prints, p.arXiv:1411.1792Zackay B., Ofek E. O., Gal-Yam A., 2016, ApJ, 830, 27This paper has been typeset from a TEX/L A TEX file prepared by the author.MNRAS000