[PDF] Transient-optimised real-bogus classification with Bayesian Convolutional Neural Networks -- sifting the GOTO candidate stream

Abstract

Large-scale sky surveys have played a transformative role in our understanding of astrophysical transients, only made possible by increasingly powerful machine learning-based filtering to accurately sift through the vast quantities of incoming data generated. In this paper, we present a new real-bogus classifier based on a Bayesian convolutional neural network that provides nuanced, uncertainty-aware classification of transient candidates in difference imaging, and demonstrate its application to the datastream from the GOTO wide-field optical survey. Not only are candidates assigned a well-calibrated probability of being real, but also an associated confidence that can be used to prioritise human vetting efforts and inform future model optimisation via active learning. To fully realise the potential of this architecture, we present a fully-automated training set generation method which requires no human labelling, incorporating a novel data-driven augmentation method to significantly improve the recovery of faint and nuclear transient sources. We achieve competitive classification accuracy (FPR and FNR both below 1%) compared against classifiers trained with fully human-labelled datasets, whilst being significantly quicker and less labour-intensive to build. This data-driven approach is uniquely scalable to the upcoming challenges and data needs of next-generation transient surveys. We make our data generation and model training codes available to the community.

Full PDF

MMNRAS , 1–17 (0000) Preprint 22 February 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

Transient-optimised real-bogus classiﬁcation with BayesianConvolutional Neural Networks — sifting the GOTO candidatestream

T. L. Killestein, ★ J. Lyman, D. Steeghs, K. Ackley, M. J. Dyer, K. Ulaczyk, R. Cutter, Y.-L. Mong, D. K. Galloway, V. Dhillon, , P. O’Brien, G. Ramsay, S. Poshyachinda, R. Kotak, R. P. Breton, L. K. Nuttall, E. Pallé, D. Pollacco, E. Thrane, S. Aukkaravittayapun, S. Awiphan, U. Burhanudin, P. Chote, A. Chrimes, E. Daw, C. Duﬀy, R. Eyles-Ferris, B. Gompertz, T. Heikkilä, P. Irawati, M. R. Kennedy, A. Levan, , S. Littlefair, L. Makrygianni, D. Mata Sánchez, S. Mattila, J. Maund, J. McCormac, D. Mkrtichian, J. Mullaney, E. Rol, U. Sawangwit, E. Stanway, R. Starling, P. A. Strøm, S. Tooke, K. Wiersema, S. C. Williams , Department of Physics, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, UK School of Physics & Astronomy, Monash University, Clayton VIC 3800, Australia Department of Physics and Astronomy, University of Sheﬃeld, Sheﬃeld S3 7RH, UK School of Physics & Astronomy, University of Leicester, University Road, Leicester LE1 7RH, UK Armagh Observatory & Planetarium, College Hill, Armagh, BT61 9DG National Astronomical Research Institute of Thailand, 260 Moo 4, T. Donkaew, A. Maerim, Chiangmai, 50180 Thailand Department of Physics & Astronomy, University of Turku, Vesilinnantie 5, Turku, FI-20014, Finland Jodrell Bank Centre for Astrophysics, Department of Physics and Astronomy, The University of Manchester, Manchester M13 9PL, UK University of Portsmouth, Portsmouth, PO1 3FX, UK Instituto de Astrof’isica de Canarias, E-38205 La Laguna, Tenerife, Spain Department of Astrophysics/IMAPP, Radboud University, Nĳmegen, The Netherlands Finnish Centre for Astronomy with ESO (FINCA), Quantum, Vesilinnantie 5, University of Turku, FI-20014 Turku, Finland

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

Large-scale sky surveys have played a transformative role in our understanding of astrophysicaltransients, only made possible by increasingly powerful machine learning-based ﬁltering toaccurately sift through the vast quantities of incoming data generated. In this paper, we presenta new real-bogus classiﬁer based on a Bayesian convolutional neural network that providesnuanced, uncertainty-aware classiﬁcation of transient candidates in diﬀerence imaging, anddemonstrate its application to the datastream from the GOTO wide-ﬁeld optical survey. Notonly are candidates assigned a well-calibrated probability of being real, but also an associ-ated conﬁdence that can be used to prioritise human vetting eﬀorts and inform future modeloptimisation via active learning. To fully realise the potential of this architecture, we presenta fully-automated training set generation method which requires no human labelling, incor-porating a novel data-driven augmentation method to signiﬁcantly improve the recovery offaint and nuclear transient sources. We achieve competitive classiﬁcation accuracy (FPR andFNR both below 1%) compared against classiﬁers trained with fully human-labelled datasets,whilst being signiﬁcantly quicker and less labour-intensive to build. This data-driven approachis uniquely scalable to the upcoming challenges and data needs of next-generation transientsurveys. We make our data generation and model training codes available to the community.

Key words: methods: data analysis – surveys – techniques: photometric ★ [email protected] © a r X i v : . [ a s t r o - ph . I M ] F e b Killestein et al.,

Transient astronomy seeks to identify new or variable objects in thenight sky, and characterise them to learn about the underlying mech-anisms that power them and govern their evolution. This variabilitycan occur on timescales of milliseconds to years, and at luminosi-ties ranging from stellar ﬂares to luminous supernovae that outshinetheir host galaxy (Kulkarni 2012; Villar et al. 2017). Through ob-servations of optical transient sources we have obtained evidence ofthe explosive origins of heavy elements (e.g. Abbott et al. 2017b,Pian et al. 2017), traced the accelerating expansion of our Universeacross cosmic time (e.g. Perlmutter et al. 1999), and located thefaint counterparts of some of the most distant and energetic astro-physical events known: gamma-ray bursts (e.g. Tanvir et al. 2009).Requiring multiple observations of the same sky area to detect vari-ability, transient surveys naturally generate vast quantities of datathat require processing, ﬁltering, and classiﬁcation – this has driventhe development of increasingly powerful techniques bolstered bymachine learning to meet the demands of these projects.Many of the earliest prototypical transient surveys began asgalaxy-targeted searches, performed with small ﬁeld-of-view instru-ments. In the early stages of these surveys candidate identiﬁcationwas performed manually, with humans ‘blinking’ images to look forvarying sources. This process is time-consuming and error-prone,and represented a bottleneck in the survey dataﬂow which heavilylimited the sky coverage of these surveys. The ﬁrst ‘modern’ tran-sient surveys (e.g. LOSS; Filippenko et al. 2001) used early formsof diﬀerence imaging to detect candidates in the survey data, au-tomating the candidate detection process and enabling both fasterresponse times and greater sky coverage. LOSS proved extremelysuccessful, discovering over 700 supernovae in the ﬁrst decade ofoperation, providing a homogeneous sample that has proven usefulin constraining supernova rates for the local Universe (Leaman et al.2011; Li et al. 2011).Diﬀerence imaging has since emerged as the dominant methodfor the identiﬁcation of new sources in optical survey data. With thismethod, an input image has a historic reference image subtractedto remove static, unvarying sources. Transient sources in this dif-ference image appear as residual ﬂux, which can be detected andmeasured photometrically using standard techniques. Various al-gorithms have been proposed for optical image subtraction, eitherattempting to match the point spread function (PSF) and spatially-varying background between an input and reference image (Alard &Lupton 1998; Becker 2015), or accounting for the mismatch statis-tically (Zackay et al. 2016) to enable clean subtraction. Diﬀerenceimaging also provides an eﬀective way to robustly discover andmeasure variable sources in crowded ﬁelds (Wozniak 2000).Driven by both improvements in technology (large-formatCCDs, wide-ﬁeld telescopes) and diﬀerence imaging algorithms,large-scale synoptic sky surveys came to the fore. In this mode, sig-niﬁcant areas of sky can be covered each night to a useful depth andcandidate transient sources automatically ﬂagged. This has drivenan exponential growth in discoveries of transients, with over 18,000discovered in 2019 alone . Wide-ﬁeld surveys such as the ZwickyTransient Facility (ZTF; Bellm et al. 2019), PanSTARRS1 (PS1;Chambers et al. 2016), the Asteroid Terrestrial-impact Last AlertSystem (ATLAS; Tonry et al. 2018), and the All Sky Automated Sur-vey for SuperNovae (ASAS-SN; Shappee et al. 2014) have provento be transformative, collectively discovering hundreds of new tran-sients per night. https://wis-tns.weizmann.ac.il/ With the ability to repeatedly and rapidly tile large areas ofsky in order to search for new and varying sources, the follow-upof optical counterparts to poorly localised external triggers becamepossible, in the process ushering in the age of multi-messengerastronomy. An early example was detection of optical counterpartsto

Fermi gamma-ray bursts by the Palomar Transient Factory (PTF;Law et al. 2009). Typical localisation regions from the

Fermi

GBMinstrument (Meegan et al. 2009) were of order 100 square degrees atthis time, representing a signiﬁcant challenge to successfully locatecomparatively faint ( 𝑟 ∼ −

19) GRB afterglows. Of the 35 high-energy triggers responded to, 8 were located in the optical (Singeret al. 2015), demonstrating the emerging eﬀectiveness of synopticsky surveys for this work.Another recent highlight has been the detection of an opti-cal counterpart to a TeV-scale astrophysical neutrino detected bythe IceCUBE facility (Aartsen et al. 2017). Recent and historicalwide-ﬁeld optical observations of the localisation area combinedwith high-energy constraints from

Fermi enabled the identiﬁcationof a ﬂaring blazar, believed to be responsible for the alert (IceCube-170922A; IceCube Collaboration et al. 2018) . This rapidly increas-ing survey capability has culminated recently in the landmark dis-covery of a multi-messenger counterpart to the gravitational wave(GW) event GW170817 (Abbott et al. 2017a,b).

For many years, the rate of diﬀerence image detections generatedper night by sky surveys has signiﬁcantly exceeded the capacityof teams of humans to manually vet and investigate each one.This has motivated the development of algorithmic ﬁltering onnew sources, to reject the most obvious false positives and reducethe incoming datastream to something tractable by human vetting.With the growing scale and depth of modern sky surveys, simplestatic cuts on source parameters cannot keep pace with the rateof candidates, with high false positive rates leading to substantialcontamination by artifacts. This situation has motivated the develop-ment of machine learning (ML) and deep learning (DL) classiﬁers,which can extract subtle relationships/connections between the in-put data/features and perform more eﬀective ﬁltering of candidates.The dominant paradigm for this task has so far been the real-bogusformalism (e.g. Bloom et al. 2012), which formulates this ﬁlteringas a binary classiﬁcation problem. Genuine astrophysical transientsare designated ‘real’ (score 1), whereas detector artefacts, subtrac-tion residuals and other distractors are labelled as ‘bogus’ (score0). A machine learning classiﬁer can then be trained using theselabels with an appropriate set of inputs to make predictions aboutthe nature of a previously-unseen (by the classiﬁer) source withinan image.This real-bogus classiﬁcation is only one step in a transientdetection pipeline. Having established the candidates appearing asastrophysically real sources, further ﬁltering is required to determineif they are scientiﬁcally interesting, or distractors – the deﬁnitionof “interesting” is naturally governed by the science goals of thesurvey. This process draws in contextual information from existingcatalogues, historical evolution, and more ﬁne-grained classiﬁca-tion routines. The last step before triggering follow-up and furtherstudy (at least currently) is human inspection of the remaining can-didates. No single ﬁltering step is 100% eﬃcient in removing falsepositives/low signiﬁcance detections, thus human vetting is requiredto identify promising candidates and screen out any bogus detec-tions that have made it this far. Real-bogus classiﬁcation is the mostcrucial step, reducing the volume of candidates that later steps must

MNRAS , 1–17 (0000) ayesian real-bogus classiﬁcation for GOTO process and the amount of bogus candidates that humans must even-tually sift through to ﬁnd interesting objects – a balance betweensensitivity (to avoid missing detections irretrievably) and speciﬁcity(avoiding ﬂoods of low-quality candidates) must be reached.Real-bogus classiﬁcation is a well-studied problem, beginningwith early transient surveys (Romano et al. 2006; Bailey et al. 2007),and evolving both in complexity and performance with the increas-ing demands placed on it by larger and deeper sky surveys such asPTF (Brink et al. 2013), PanSTARRS1 (Chambers et al. 2016), andthe Dark Energy Survey (Goldstein et al. 2015). Early classiﬁerswere generally built on decision tree-based predictors such as ran-dom forests (Breiman 2001), using a feature vector as input. Featurevectors comprise extracted information about a given candidate, andoften include broad image-level statistics/descriptions designed tomaximally separate real and bogus detections in the feature space.Examples include the source full-width half maximum computedfrom the 2D proﬁle, noise levels, and negative pixel counts. Moreelaborate features can be composed via linear combinations of thesequantities, which may exploit correlations and symmetries. Anothermethod of deriving features is to compute compressed numericalrepresentations of the source via Zernicke/shapelet decomposition(Ackley et al. 2019).However, feature selection can represent a bottleneck to in-creasing performance. Features are typically selected by humans toencode the salient details of a given detection, attempting to ﬁnda compromise between classiﬁcation accuracy and speed of eval-uation. This introduces the possibility of missing salient featuresentirely, or choosing a sub-optimal combination of them.Directly using pixel intensities as a feature representationavoids choosing features entirely, instead training on ﬂattened andnormalised input images (Wright et al. 2015; Mong et al. 2020),these have demonstrated improved accuracy over ﬁxed-feature clas-siﬁers. However, this approach quickly (quadratically) becomes in-eﬃcient for large inputs. Using a smaller input size means infor-mation on the surrounding area of each detection is unavailable,limiting the visible context and aﬀecting classiﬁcation accuracy asa result.Recently, convolutional neural networks (CNNs, LeCun et al.1995) have led to a paradigm shift in the ﬁeld of computer visionand machine learning, which has been transformative in the waywe process, analyse, and classify image data across all disciplines.CNNs use learnable convolutional ﬁlters known as kernels to re-place feature selection. These ﬁlters are cross-correlated with theinput images to generate ‘feature maps’, eﬀectively compact featurerepresentations. Through the training process, the ﬁlter parametersare optimised to extract the most salient details of the inputs, whichcan then be fed into fully-connected layers to perform classiﬁcationor regression. In this way, the model can select its own feature rep-resentations, avoiding the bottleneck of human selection. Multiplelayers can be combined to achieve greater representational power,known as deep learning (LeCun et al. 2015). Recent work usingCNNs has demonstrated state-of-the-art performance at real-bogusclassiﬁcation (Gieseke et al. 2017; Cabrera-Vives et al. 2017; Duevet al. 2019; Turpin et al. 2020). CNNs are also eﬃciently parallelis-able making them suitable for high-volume data processing tasks.Whilst providing substantial accuracy improvements over previoustechniques, deep learning is particularly reliant upon large and highquality training sets to minimise overﬁtting, arising from the highnumber of model parameters. Although augmentation and regular-isation techniques can minimise this risk, they are no substitute fora larger dataset. The performance of any classiﬁer is ultimately lim-ited by the error rate on the training labels, so it is important to also ensure the dataset is accurately labelled. Making a large, pure,and diverse training set can be among the most challenging parts ofdeveloping a machine learning algorithm, and signiﬁcant eﬀort hasbeen focused on this area in recent years.Traditionally the ‘gold-standard’ for machine learning datasetsacross computer science and astronomy has been human-labelleddata, as this represents the ground truth for any supervised learningtask. Use of citizen science has proven to be particularly eﬀective,leveraging large numbers of participants and ensembling their in-dividual classiﬁcations to provide higher accuracy training sets formachine learning through collaborative schemes such as Zooniverse(Lintott et al. 2008; Mahabal et al. 2019). However, even in largeteams, human labelling of large-scale datasets is time-consumingand ineﬃcient requiring hundreds–thousands of hours spent collec-tively to build a dataset of a suitable size and purity. Speciﬁcallyfor real-bogus classiﬁcation, there are also issues with completenessand accuracy for human labelling of very faint transients close to thedetection limit. These faint transients are where a classiﬁer has po-tential to be the most helpful, so if the training set is fundamentallybiased in this regime, any classiﬁer predictions will be similarlylimited. To go beyond human-level performance, we cannot solelyrely on human labelling, additional information is required. Onespeciﬁc aspect of astronomical datasets that can be leveraged toaddress both issues discussed above is the availability of a diverserange of contextual data about a given source. Sizeable cataloguesof known variable stars, galaxies, high energy sources, asteroids,and many other astronomical objects are freely available and can bequeried directly to identify and provide a more complete picture ofthe nature of a given source.Signiﬁcant eﬀort is being invested in data processing tech-niques for transient astronomy in anticipation of the Vera C. RubinObservatory (Ivezić et al. 2019), due to begin survey operations in2022. Via the Legacy Survey of Space and Time (LSST), the entiresouthern sky will be surveyed down to a depth of 𝑟 (cid:48) ∼ . MNRAS000

Killestein et al.,

The Gravitational-Wave Optical Transient Observer (Steeghs et al.2021) is a wide-ﬁeld optical array, designed speciﬁcally to rapidlysurvey large areas of sky in search of the weak kilonovae and af-terglows associated with gravitational wave counterparts. The workwe present in this paper was conducted during the GOTO prototypestage, using data taken with a single ‘node’ of telescopes situatedat the Roque de los Muchachos observatory on La Palma. Eachnode comprises 8 co-mounted fast astrograph OTAs (optical tubeassemblies) combining to give a ∼

40 square degree ﬁeld of viewin a single pointing. GOTO performs surveys using a custom wide 𝐿 band ﬁlter (approximately equivalent to 𝑔 (cid:48) + 𝑟 (cid:48) ) down to 𝐿 ≈ ∼ code. Image subtractionis performed on the aligned science and reference images with thehotpants algorithm (Becker 2015) to generate a diﬀerence image.To locate residual sources in the diﬀerence image, source extractionis performed using SExtractor (Bertin & Arnouts 1996). Detec-tions in the diﬀerence image are referred to as ‘candidates’ throughthe remainder of this paper. For each candidate, a set of small stampsare cut out from the main science, template and diﬀerence imagesand this forms the input to the GOTO real-bogus classiﬁer. Thisprocess and proposed improvements are discussed in more detailin Section 2.1. From here, candidates that pass a cut on real-bogusscore (using a preliminary classiﬁer) are ingested into the GOTOMarshall – a central website for GOTO collaborators to vet, searchand follow-up candidates (Lyman et al., in prep.).In line with the principal science goals of the GOTO project,the real-bogus classiﬁer discussed in this work is constructed specif-ically to maximise the recovery rate of extragalactic transients andother explosive events such as cataclysmic variable outbursts. Small-scale stellar variability can be easily detected via diﬀerence imag-ing, but is better studied through the aggregated source light curves. https://github.com/Lyalpha/spalipy An operational requirement for the current version of this classi-ﬁer is the ability to perform consistently across multiple diﬀerenthardware conﬁgurations. During classiﬁer development, the GOTOprototype used two diﬀerent types of optical tube design, each withvarying optical characteristics that led to diﬀerent point spread func-tions, distortion patterns, and background levels/patterns. Due tolimited data availability, training a classiﬁer for each individualOTA (or group of OTAs of the same type) was not viable. Thisrequirement adds an additional operational challenge over surveyprograms such as the Zwicky Transient Facility (ZTF, Bellm et al.2019) and PanSTARRS1 (PS1, Chambers et al. 2016), which use astatic, single-telescope design. If acceptable results can be achievedwith this heterogeneous hardware conﬁguration, then further per-formance gains can be expected when the design GOTO hardwareconﬁguration is deployed. This will use telescopes of consistentdesign and improved optical quality meaning less model capacityneeds to be directed towards making the classiﬁcation performancestable and across a diverse ensemble of optical distortions.In this paper, we propose an automated training set generationprocedure that enables large, minimally contaminated, and diversedatasets to be produced in less time than human labelling and atlarger scales. This procedure also introduces a data-driven augmen-tation scheme to generate synthetic training data that can be usedto signiﬁcantly improve the performance of any classiﬁer on ex-tragalactic transients of all types, but with particular eﬀectivenessfor nuclear transients. Using this improved training data, we applyBayesian convolutional neural networks (BCNNs) to astronomicalreal-bogus classiﬁcation for the ﬁrst time, providing uncertainty-aware predictions that measure classiﬁer conﬁdence, in addition tothe typical real-bogus score. This opens up promising future di-rections for more complex classiﬁcation tasks, as well as optimallyutilising the predictions of human labellers. We emphasise that al-though this classiﬁer is discussed in the context of GOTO and ourassociated science needs, the techniques discussed are fully generaland could be applied to general real-bogus classiﬁcation at otherprojects easily. Our code, gotorb, is made freely available online with this in mind. The ‘real’ content of our training set is composed of minor planets,similar to Smith et al. (2020). Assuming the sky motion is large(but not so large that the source is trailed) these objects are typ-ically detected in the science image but not the template image,which provides a clean subtraction residual resembling an explo-sive transient. Due to the large pixels of the GOTO detectors andshort exposure times of each sub-image, very few asteroids movesuﬃciently quickly to trail. We estimate that sky motions of 1 arcsecper minute or greater will lead to trailing.There are signiﬁcant numbers of asteroids detectable down to 𝐿 ∼ . ∼ astorb database (Moskovitz et al. 2019), based onobservations reported to the Minor Planet Center , diﬀerence im-age detections can be robustly cross-matched to minor planets in the https://github.com/GOTO-OBS/gotorb MNRAS , 1–17 (0000) ayesian real-bogus classiﬁcation for GOTO

10 12 14 16 18 20GOTO L -band magnitude10 N u m b e r minor planetsynthetic transient Figure 1.

Magnitude distribution of the minor planets (MP) used to buildour training set. Bright-end number densities are dominated by the truemagnitude distribution of the minor planets, where the faint-end density isconstrained by the GOTO limiting magnitude. The magnitude distributionof synthetic transients (SYN) is a sub-sample of the minor planet magnitudedistribution, except with a cut at 𝐿 ∼

16, to avoid unrealistically brightobjects. ﬁeld. This provides a signiﬁcant pool of high-conﬁdence, unique,and diverse diﬀerence image detections from which to build a cleantraining set.We use the online SkyBoT cone search (Berthier et al. 2006,2016) to retrieve the positions and magnitudes of all minor planetswithin the ﬁeld of view of each GOTO image, then cross-matchthis table with all valid diﬀerence image detections using a 1 arc-sec threshold value to identify the asteroids present in the image.The ephemerides provided are of suﬃcient quality that this is ad-equate to match even faint ( 𝐿 ∼

20) asteroids. To avoid spuriouscross-matches, only asteroids brighter than the 5-sigma limitingmagnitude of the image are considered. An alternative oﬄine conesearch is made accessible via the pympc package Python package,which the code can fall back on if SkyBoT is unavailable. Usingminor planets, the training set can reliably be extended to faintermagnitudes, where the performance of human vetters begins to sig-niﬁcantly decrease. Figure 1 illustrates the magnitude distributionof minor planets used to construct the training set.To create the bogus content of our training set, we randomlysample detections in the diﬀerence image following Brink et al.(2013). Bogus detections overwhelmingly ( (cid:38) https://pypi.org/project/pympc/ stars will be missed with this procedure, and we develop tools toidentify them retrospectively after model training in Section 3.3.To improve the classiﬁer’s resistance to speciﬁc challengingsubtypes of data poorly represented in our algorithmically generatedtraining set, we inject human-labelled detections into the dataset.More speciﬁcally, candidates from the GOTO Marshall (discussedin full in Lyman et al., in prep.) are included, which were misiden-tiﬁed by the classiﬁer in the pipeline at the time as real and laterlabelled as bogus by human vetters. The previous classiﬁer was arapidly-deployed prototype CNN similar in design to that presentedhere, trained on a smaller dataset of minor planets and random bo-gus detections. These detections are included to allow the classiﬁerto screen out artifacts missed by the prototype image processingpipeline, including satellite trails and highly wind-shaken PSFs.This artiﬁcally increases the diversity of the bogus component of thetraining set, as these edge-case detections would rarely be selectedby naive random sampling and so be poorly represented within themodel. Although these detections represent a small fraction of theoverall training set ( ∼ For each detection identiﬁed for inclusion in our train-ing/validation/test sets, a series of stamps are cut out from the largerGOTO image centred on the diﬀerence image residual. In commonwith previous CNN-based classiﬁers, we use small cutouts of themedian-stacked science and template images, as well as the resultantdiﬀerence image after image subtraction. The size of these stampsis an important model hyperparameter, which we explore in moredetail in Section 3.1. A example of the model inputs for a syntheticsource are illustrated in Figure 2.An important addition to our network’s inputs compared toprevious work is a peak-to-peak ( p2p ) layer. This is included tocharacterise variability across the individual images that make upa median stacked science image, and is calculated as the peak-to-peak (maximum value - minimum value) variation of each pixelcomputed across all individual images that composed the medianstack. To ensure consistent alignment across all individual stampsand remove any jitter, we cut out the region based on the RA/Deccoordinates of the source detection in the median stack. This ad-ditional provides an eﬀective discriminator for spurious transientevents such as cosmic ray hits and satellite trails. If suﬃcientlybright, these are not removed by the simple median stacking in thecurrent pipeline due to the small number of sub-frames used. Thisis particularly problematic for cosmic ray hits which are convolvedwith a Gaussian kernel for image subtraction, and appear PSF-likein the diﬀerence image. This can create convincing artifacts whichare diﬃcult to identify without access to the individual image levelinformation. In testing, this reduced the false positive rate on thetest set by ∼ p2p layer), there is a2–3% decrease in false positive rate.For all of the above steps, stamps extending beyond the edgeof the detector have missing areas ﬁlled in with a constant intensitylevel of 10 − , to distinguish them quantitatively from masked (i.e.saturated) pixels which are assigned a value of zero in the diﬀerenceimage by the pipeline. The speciﬁc intensity level chosen for this MNRAS000

Killestein et al.,

SCIENCE TEMPLATE DIFFERENCE P2P

Figure 2.

Example data format for a set of idealised synthetic images ofa single Gaussian source newly appearing in the science image. We applya naive convolution of science image with template PSF and vice versa inproducing the diﬀerence image for visualisation purposes. From left to right:science median, template median, diﬀerence image, pixel-wise peak-to-peakvariation across contributing images to science median. Cutouts are 55x55pixels square, corresponding to a side length of 1.1 arcminutes. oﬀsetting is not important, and we choose our value to be well abovemachine precision (signiﬁcant enough to inﬂuence the gradients)but well below the typical background level. To ensure that theclassiﬁer remains numerically stable in later training steps, eachstack of stamps undergoes layer-wise L2 normalisation to reducethe input’s magnitude. Each stamp has the mean subtracted and isthen divided through by the L2 ( √(cid:174) 𝑥 · (cid:174) 𝑥 ) norm. Although asteroids provide a convenient source of PSF-like residu-als to train on, it is important to note that they cannot fully replicategenuine transients. Asteroids are markedly simpler to learn and dis-criminate for a classiﬁer since they lack the complex background ofa host galaxy. The main goal of this classiﬁer is to detect extragalac-tic transients, so adapting the training set to maximise performanceon these objects is important. An ideal approach would be to add alarge number of genuine transients into the training set. However,GOTO has not been on-sky long enough to collect a suitably largeset of these detections, and we only build the training set from theprevious year of data. Even assuming every supernova over the pastyear is robustly detected in our data this will still yield a number oftransients that is signiﬁcantly less than the target size of our trainingset. This would create a severely imbalanced dataset, which couldin principle be used but with reduced classiﬁcation performance.Using spectroscopically conﬁrmed transients may also inject an el-ement of observational bias into our training set, as events that havefavourable properties for spectroscopy (in nearby galaxies, oﬀsetfrom their host, bright) are preferentially selected (Bloom et al.2012) to be followed up. Instead we reserve a set of real, spectro-scopically conﬁrmed transients GOTO has detected ( ∼

900 as ofAugust 2020) for benchmarking purposes, as they represent a valu-able insight into real-world performance and can be used to directlyevaluate the eﬀectiveness of any transient augmentation scheme weemploy, as in Section 4.2.PSF injection has been used heavily in prior work to generatesynthetic detections for testing recovery rates and simulating thefeasibility of observations. This process can be computationallyintensive, involving construction of an eﬀective PSF (ePSF) fromcombining multiple isolated sources or ﬁtting an approximatingfunction (e.g. a Gaussian) to sources in the image. The ePSF modelcan then be scaled and injected into to the image to simulate anew source. By injecting sources in close proximity to galaxies inindividual images then propagating this through the data reductionpipeline, synthetic transients could be generated in a realistic way.However, the fast optical design of GOTO makes this a complex task, as the PSF varies as a function of source position on thedetector. Sources in the corners of an image display mild coma,which, combined with wind-shake and other optical distortion, canlead to unusual PSFs that are not accurately reproduced by themean PSF. In principle this could be accounted for by computingPSFs for sub-regions of a given image or assuming some spatially-varying kernel to ﬁt for, but this would add sizeable overheads tothe injection process and will always be an approximation.Recent new techniques such as generative adversarial net-works (GANs, Goodfellow et al. 2014) have shown promise ingenerating novel training examples that can be used to addressclass imbalances/scarcity in training sets (Mariani et al. 2018), andhave recently started to be applied to astrophysical problems (Yipet al. 2019). However these networks are computationally expen-sive, complex to train and understand the outputs of, and don’t fullyremove the need for large datasets. A robust human-interpretablemethod for generating synthetic examples is a better approach forthe noisy, diverse datasets used in real-bogus classiﬁcation.We propose a novel technique for synthesising realistic tran-sients that can be used to signiﬁcantly improve transient-speciﬁcperformance when compared to a pure minor planet training set,without requiring PSF injection or other CPU-intensive approaches.For each minor planet detected in an image, the GLADE galaxy cat-alogue (Dálya et al. 2018) is queried for nearby galaxies within aset angular distance of 10 arcminutes, chosen such that the PSFof sources within this region are consistent. Pre-built indices areused via catsHTM (Soumagnac & Ofek 2018) to accelerate query-ing GLADE. The algorithm chooses the galaxy with the brightestgalaxy (minimum 𝐵 band magnitude) within range, then generatesa cutout stamp with with a randomly chosen 𝑥, 𝑦 oﬀset relative tothe galaxy centre. For the implementation within this work, the 𝑥, 𝑦 pixel oﬀsets are drawn from a uniform distribution 𝑈 (− . ) cho-sen to fully cover the range of oﬀsets for nearby galaxies. Sourcesthat are completely detached from any host galaxy are better rep-resented by the minor planet component of the training set. Thisensures that a diverse range of transient conﬁgurations (nuclear,oﬀset, orphaned) are sampled. The minor planet and galaxy stampare then directly summed to produce the synthetic transient. Forthe purposes of real-bogus classiﬁcation, accurately matching themeasured transient host-oﬀset distribution is not crucial. The hostoﬀset distribution contains implicit and diﬃcult to quantify biasesresulting from the speciﬁc selection functions of the transient sur-veys that populate it – it does not reﬂect accurately the underlyingdistribution of astrophysical transients. By choosing from a uniformdistribution, we instead aim to attain consistent performance acrossa wide range of host oﬀsets that overlap with the range inferred fromthe transient host oﬀset distribution.The original individual images for each component are re-trieved to correctly compute the peak-to-peak variation of the com-bined stamp. Model inputs are pre-processed and undergo L2 nor-malisation (as discussed in Section 2.1) prior to training and infer-ence, so additional background ﬂux introduced by this method doesnot aﬀect the model inputs. The noise characteristic of this combinedstamp is not straightforward to compute due to the highly correlatednoise present in the diﬀerence image and varying intensity levels,and could be higher or lower depending on the speciﬁc stamps –with the straightforward Gaussian case providing a √ ∼ constant, as the stamp scale is far smaller than theoverall frame scale – naturally this breaks down in the presence of MNRAS , 1–17 (0000) ayesian real-bogus classiﬁcation for GOTO SCI TEMPL DIFF P2P

Figure 3.

Randomly selected sample of synthetic transients generated withour algorithm, displayed in the same format as in Figure 2. Signiﬁcantvariations in the PSF are visible due to sampling directly from the image,improving classiﬁer resilience. nebulosity/galaxy light but this represents a overwhelmingly smallfraction of the sky. We also reject all minor planets with

𝐿 <

16, asthese are signiﬁcantly brighter than the selected host galaxy so arebetter represented by the pure minor planet candidates. This alsocuts down signiﬁcantly on saturated detections of dubious quality.This choice has no detrimental eﬀect on bright-end performance,as discussed in Section 4. A random sample of synthetic transientsgenerated with this approach is shown in Figure 3. Our methodbears some similarity in retrospect to the approach of (Cabrera-Vives et al. 2017), who added stamps from the science image intodiﬀerence images to simulate detections in ‘random’ locations. Ourapproach uses conﬁrmed diﬀerence image detections of MPs andputs them in more purposeful locations, whilst preserving the noisecharacteristics of the diﬀerence stamp.This approach has strong advantages over simply injectingtransients into galaxies. By selecting only galaxies close to eachminor planet, the PSF is preserved and is consistent, regardless ofhow distorted it may be. Injection-based methods require estima-tion/assumption of the image PSF, which is typically a parame-terised function determined by ﬁtting isolated sources. Given thevariation in PSF across images and across individual unit telescopes,this would be a computationally intensive task, and would likely leadto poorer results compared to using minor planets. However, usingonly these synthetic transients introduces unintended behaviour inthe trained model that signiﬁcantly degrades classiﬁcation perfor-mance if not remedied. Since every synthetic transient in the training

Metalabel Train TestMinor planet 72992 8133Synthetic transient 40192 4521Random bogus 177556 19645Galaxy residual 28040 3190Marshall bogus 24577 2662Total 343357 38151

Breakdown of the composition of our dataset, partitioned accordingto training and test sets. The validation dataset is not shown, but is composedof 10% of the training dataset, chosen randomly at training time. set is associated with a host galaxy by design, the model will overtime learn to associate all detections with galaxies as being real asthere is no loss penalty for doing so. To resolve this, we also injectgalaxy residuals as bogus detections, randomly sampling from theremaining GLADE catalog matches at a 1:1 transient:galaxy resid-ual ratio. This way, the model learns that the salient features of thesedetections are not the galaxy, but the PSF-like detection embeddedin them.

Using the techniques developed in the sections above, we buildour training set with GOTO prototype data from 01-01-2019 to01-01-2020. This ensures that our performance generalises wellacross a range of possible conditions – with PSF shape and limitingmagnitude being the most important properties that beneﬁt fromthis randomisation. A breakdown of training set proportions andproperties is given in Table 1.Our code is fully parallelised at image level, meaning that afull training set of ∼ As a starting point, we follow the braai classiﬁer of Duev et al.(2019) in using a downsized version of the VGG16 CNN architec-ture of Simonyan & Zisserman (2014). This network architecturehas proven to be very capable across a variety of machine learn-ing tasks, and is a relatively simple architecture to implement andtweak. This architecture uses conv-conv-pool blocks as the primarycomponent – two convolutions are applied in sequence to extractboth simple and compound features, then the resultant feature mapis reduced in size by a factor 2 by ‘pooling’, taking the maximumvalue of each 2x2 group of pixels. This architecture also uses smallkernels (3x3) for performance. These structures are illustrated inFigure 4. We use the conﬁguration as presented in Duev et al.(2019) for development, but later conduct a large-scale hyperpa-rameter search to ﬁne-tune the performance to our speciﬁc dataset(Section 3.1). The primary inputs to the classiﬁer are small cut-outsof the science, template, diﬀerence, and p2p images as discussed inSection 2.1 which we refer to as stamps.The sample weights for real and bogus examples are adjustedto account for the class imbalance in our dataset, set to the recip-rocal of the number of examples with each label. Class weights

MNRAS000

MNRAS000 , 1–17 (0000)

Killestein et al., conv conv pool (2x2) conv conv pool (4x4) ﬂatten dense(55,55,4) (53,53,24) (51,51,24) (25,25,56) (23,23,56) (21,21,56) (1400,) (208,) (1,)

Figure 4.

Block schematic of the optimal neural network architecture found by hyperparameter optimisation in Section 3.1. Each block here represents a 3Dimage tensor, either as input to the network, or the product of a convolution operation generating an ‘activation map’. Classiﬁcation is performed using thescalar output of the neural network. Directly above each 3D tensor block the dimensions in pixels are shown, along with the operation that generates the nextblock below it represented by the coloured arrow. Not illustrated for clarity here are the dropout masks applied between each layer and the activation layers.Base ﬁgure produced with nnsvg (LeNail 2019). are not adjusted on a per-batch basis, as our training set is onlymildly imbalanced. For regularisation, we apply a penalty to theloss based on the L2 norm of each weight matrix. This penalisesexploding gradients and promotes stability in the training phase. L1regularisation was trialled but did not produce signiﬁcantly betterresults. We also use spatial dropout (Tompson et al. 2015) betweenall convolutions which provides some regularisation, but primarilyis used for the purposes of uncertainty estimation (see Section 3.3)– a small dropout probability of ∼ ∼

170 epochs takes around 10 hours. In-ferencing is signiﬁcantly quicker, with an average throughput of7,500 candidates per second with no model ensembling performed.Our model training code is freely available via the gotorb

Python package , which includes the full range of tunable parameters andmodel optimisations we implement. To achieve the maximum performance possible with a given neuralnetwork, we conduct a search over the model hyperparameters toassess which combinations lead to the best classiﬁcation accuracyand model throughput. Initially the ROC-AUC score (Fawcett 2006)was used as the metric to optimise as in many cases this is a moreindicative performance metric than others, however this did nottranslate directly to improvements in classiﬁcation performance.We conjecture this may be due to the score-invariant nature ofthe ROC-AUC statistic – it only captures the probability that arandomly selected real example will rank higher than a randomlyselected bogus example, which is independent of the speciﬁc real-bogus threshold chosen. We instead opt to use the accuracy score,as this directly maps to the quantity we want to maximise in ourmodel.Data-based hyperparameters (training set composition, stampsize, data augmentation) are optimised iteratively by hand due tocomputational constraints. An approximate real-bogus ratio be-tween 1:2 to 1:3 was found to be optimal, with greater values givingbetter bogus performance at the cost of recovery of real detections –we opt for 1:2 in the ﬁnal dataset. The overall dataset size was foundto be the biggest determinant of classiﬁcation accuracy, with largerdatasets showing improved performance – although this increasewas subject to diminishing returns with larger and larger datasets.We chose a training set of O(4 × ) examples, as this was roughlythe largest dataset we could ﬁt into RAM on training nodes – nat-urally this could be increased further by reading data from disk ondemand, but given CPUs were used for training there was a need tominimise input pipeline latencies as much as possible to compen-sate. Model performance was found to be relatively insensitive tothe ratio of synthetic transients to minor planets, as long as therewere at least 10,000 of both in the training set. Using a dataset where https://github.com/GOTO-OBS/gotorb MNRAS , 1–17 (0000) ayesian real-bogus classiﬁcation for GOTO

25 30 35 40 45 50 55 60Stamp size (pixels)0 . . . . . . . . R O C - AU C s c o r e Mean1 σ CI Figure 5.

Classiﬁer performance on the test set of a 330,000 example train-ing set as a function of input stamp size. Each point is the average of 3independent training runs on the same input training set, with the shadedregion representing the 1 𝜎 conﬁdence interval. ∼ Hyperband algorithm (Li et al. 2017) as implemented in the Keras-Tunerpackage (O’Malley et al. 2019). This algorithm implements a ran-dom search, with intelligent allocation of computational resourcesby partially training brackets of candidate models and only selectingthe best fraction of each bracket to continue training. In testing, thisconsistently outperformed both naive random search and Bayesianoptimisation in terms of ﬁnal performance. Table 2 illustrates theregion of (hyper)parameter space we choose to conduct our searchover. The upper limits for the neuron/ﬁlter parameters are set bypurely computational constraints – networks above this thresholdtake too long to evaluate and train, and so are excluded. We also setan upper limit of 500,000 on the number of model parameters to continuous

Hyperparameter Min Max Prior SelectedBlock 1 ﬁlters ( 𝑁 ) 8 32 linear 24Block 2 ﬁlters ( 𝑁 ) 𝑁

64 linear 56 𝑁 fc

64 512 linear 208Dropout rate 10 − . × − Learning rate 10 − − log 6 × − Regulariser penalty 10 − − log 2 . × − discrete Hyperparameter Choice SelectedKernel initialiser He, Glorot GlorotKernel regulariser L1, L2 L2Activation function ReLU, LeakyReLU, ELU LeakyReLU

Table 2.

Hyperparameter space over which the optimisation search wasconducted, split by numerical and categorical variables. The ﬁnal adoptedvalues are given in the rightmost column. avoid overly complex models and promote small but eﬃcient archi-tectures. Based on initial experimentation, we require the numberof convolutional ﬁlters in the second block must be greater thanor equal to the number in the ﬁrst block. This ensures that thelargest (and most computationally expensive) convolution opera-tions are performed on tensors that have been max-pooled and thusare smaller, reducing execution time. To maximise performanceacross all possible deployment architectures, the number of con-volutional ﬁlters and fully-connected layer neurons are constrainedto be a multiple of 8. This is one of the requirements for fullyleveraging optimised GPU libraries (such as cuDNN, Chetlur et al.2014), and also enables use of specialised hardware acceleratorssuch as tensor cores in the future. Conveniently, this discretisationalso makes the hyperparameter space more tractable to explore.This search took around 1 month to complete on a single32-core compute node, and sampled 828 unique parameter conﬁg-urations. The three top-scoring models were then retrained fromrandom initialisation through to early stopping to validate their per-formance, and conﬁrm that the hyperparameter combination led tostable and consistent results. The top three scoring models achievedaccuracies on the hyperparameter validation set of 98.88, 98.64and 98.54% respectively. Some of the candidate models had to bepruned from the list due to excessive overﬁtting. The best model wasthen selected based on the minimum test set error. Our ﬁnal modelachieved a test set class-balanced accuracy of 98 . ± .

02% (F1score 0 . ± . Uncertainty estimation in neural networks is an open problem, butis of critical importance for a range of applications. Traditional de-terministic neural networks output a single score per class between0 and 1. This single value would be suﬃcient to provide a measureof conﬁdence, if properly calibrated. However, neural networks areoften regarded as providing over-conﬁdent predictions in general,

MNRAS000

MNRAS000 , 1–17 (0000) Killestein et al., and, worse, providing misidentiﬁcations at high conﬁdence. Givingneural networks the ability to make nuanced predictions and ac-count for their own uncertainty in decision making is a potentiallypowerful improvement, that we discuss in more detail over the nextsections.It is important to be speciﬁc and distinguish between epistemic(systematic) and aleatoric (random) uncertainty for the purposes ofour classiﬁcation problem (Kendall & Gal 2017). Aleatoric uncer-tainty is captured by the classiﬁer’s score value, and originates fromnoise in the input data. More crucial for our application is quan-tifying the epistemic uncertainty – that is the uncertainty in ourchoice of neural network’s model weights. This epistemic sourceof error is directly quantiﬁable through Bayesian neural networks,and in later sections this is the error, conﬁdence, or uncertainty werefer to and attempt to quantify. In the Bayesian framework, this canbe achieved by casting model parameters as probability distribu-tions, and using the mechanics of Bayesian statistics to marginalisethe neural network output over these distributions, in the processﬁnding the score posterior. In this way, the uncertainty inherent inmodel selection can be quantiﬁed. There are various approximateand exact approaches to achieve this which we outline below.Dropout (Srivastava et al. 2014) provides a useful form of reg-ularisation in neural networks. At each training step, a fraction 𝑝 (a tunable hyperparameter) of neuron weights are randomly set tozero, decreasing the eﬀective number of parameters of the model.In this way, overﬁtting can be prevented and generalisation accu-racy can be increased. In traditional neural networks, dropout isnot active at inference time so that all neurons are used for makingpredictions. However, Gal & Ghahramani (2015a) demonstrate theprofound result that training and evaluating neural networks withdropout is equivalent to performing the approximate Bayesian in-ference discussed above, with multiple evaluations being equivalentto Monte Carlo integration of the posterior distribution. This is di-rectly applicable to convolutional neural networks, via the MonteCarlo dropout technique (Gal & Ghahramani 2015b; referred to asMCDropout for brevity from now on).Alternative approaches to uncertainty estimation exist (Bayesby Backprop, Blundell et al. 2015), which instead directly per-forms the approximate Bayesian inference by instead casting neu-ron weights as distributions with associated hyperparameters, thenupdating these according to the backpropagated gradients (like de-terministic NNs). In this work, we opt to use MCDropout for com-putational eﬃciency and for maximal compatibility with existingnetwork architectures and software. No changes to the training loopare required, and only a simple wrapper is required at inference toperform multiple predictions with dropout enabled. The only signif-icant additional computational cost for a Bayesian neural networkusing the MCDropout technique over a deterministic CNN is at infer-ence time, as multiple samples need to be drawn to approximate theposterior. This performance overhead can be mitigated with suitablebatching of the dataset. The ability of neural networks to learn com-plex, non-linear representations in high-dimensional vector spacesis well-known and utilised throughout machine learning. However,estimation of the uncertainty of products of neural networks is oftena barrier to their implementation in scientiﬁc applications, wherewell-grounded determination of errors is important. MCDropoutprovides a principled way to introduce this.Although a comparatively new technique, Bayesian neural net-works show emerging promise across a variety of astronomical clas-siﬁcation and regression tasks – including supernova light curveclassiﬁcation (Möller & de Boissière 2020), eﬃcient learning of galaxy morphology (Walmsley et al. 2020), and age estimation ofstars for galactic archaeology (Ciucă et al. 2020).There is disagreement in the literature on the precise natureof a Bayesian neural network and how to implement it ‘properly’,from approximate variational inference as used here, to applyingsome variant of the Markov Chain Monte Carlo sampler over theweight and bias parameters of the neural network. However, whatis relevant for the implementation in this work is that examplesthe classiﬁer is unconﬁdent about are assigned lower conﬁdencescores than obviously real/bogus detections. More complex tests,such as conﬁrming that the classiﬁer’s conﬁdence matches the actualconﬁdence of the dataset/some human-derived uncertainty score arebeyond the scope of the introductory work presented here.Whilst these posterior predictions are informative to humanvetters, converting them to a single informative summary parame-ter that captures the overall uncertainty is more useful for integrationinto pipelines and enabling coarse ﬁltering of candidates. To con-vert the posterior distributions to meaningful information about theconﬁdence of a given prediction, we utilise the information entropy H . For a binary classiﬁcation problem, the generic entropy formulacan be reduced to: H ( 𝑝 ) = − 𝑝 log 𝑝 − ( − 𝑝 ) log − 𝑝 where 𝑝 is the probability of a given detection being real (thereal-bogus score). The entropy is maximised for 𝑝 = .

5, wherethe probability of being real vs. bogus is equal, or the classiﬁerprediction carries no useful information. We deﬁne the classiﬁerconﬁdence C in terms of the average entropy of the posterior distri-bution samples, scaling to conﬁdences in the range [ , ] with therelation C = − 𝑁 𝑁 ∑︁ 𝑖 = H 𝑖 where 𝑁 is the number of posterior samples and H 𝑖 is the binaryentropy of the 𝑖 th posterior sample. This metric is equivalent the sec-ond term of the BALD acquisition function of Houlsby et al. (2011),and is chosen as it is pre-normalised to [ , ] unlike standard devi-ation or similar metrics. Naturally the uncertainties we derive hereare correlated with the actual output score, but the multiple sam-ples provide suﬃcient dispersion that this metric is useful to assessmodel conﬁdence. In future implementations, these raw posteriorsamples (or some approximating distribution parameters to reducedata needs) could be fed directly into downstream, more specialisedclassiﬁcation tools to enable them to make use of the real-bogusclassiﬁer’s probabilistic predictions in their own score/posterior. One immediate advantage of Bayesian neural networks over de-terministic neural networks is the ability to improve classiﬁcationperformance through model ensembling. Figure 6 illustrates the gainin accuracy observed by averaging the predictions of our BNN, asa function of the number of posterior samples. Although small, thisis a deﬁnite improvement over single-evaluation predictions, and islikely constrained by our dataset. For the majority of positive andnegative examples the model is highly conﬁdent about the assignedRB score, so averaging over the posteriors does not improve themsigniﬁcantly. This increase in performance is likely to be greater onmore complex (multi-class) classiﬁcation problems, or scenarioswhere signiﬁcantly less training data is available.Posteriors and/or associated conﬁdence scores can be added to

MNRAS , 1–17 (0000) ayesian real-bogus classiﬁcation for GOTO . . . . . . C l a ss - b a l a n ce d acc u r ac y BCNN mean1 σ CI Figure 6.

Classiﬁcation accuracy on the test set from Section 2.3 as a func-tion of the number of posterior samples averaged. Each point is the averageof 10 model runs, with the shaded area corresponding to the 2 𝜎 conﬁdenceinterval. The BCNN quickly recovers the performance of a deterministicCNN within statistical uncertainty (99 . ± .

03% accuracy, F1: 0.9877)and provides additional information in the form of conﬁdence. No signiﬁ-cant improvement in classiﬁcation accuracy is obtained beyond 10 samples,remaining consistent out to 50 samples. any downstream candidate evaluation tools, providing an additionalmetric to inform decisions. Objects with both high score and highconﬁdence are highly likely to be genuine, so can be prioritised inhuman vetting of candidates. This means more time can be spentlooking at more marginal candidates, and obvious detections canquickly be identiﬁed. Conﬁdence provides a complementary metricto the pure real-bogus score that can help alleviate some of the is-sues with the poor dynamic range observed in the classiﬁer outputsat low/high scores. Classiﬁcation is still performed on the consen-sus real-bogus score derived from the posterior, with the conﬁdencescore intended to aid human decision making. In Figure 7, we illus-trate some example candidates, their associated real-bogus score,and the score posterior.Classiﬁer conﬁdence is also a useful tool for the training anddevelopment process, providing deeper insight into the functioningof the classiﬁer and the associated training set. Predictive uncer-tainty provides a useful heuristic to clean datasets of mislabelleddata. Misclassiﬁed detections that the classiﬁer returns a high conﬁ-dence for are very likely to be mislabelled, as the conﬁdence score ispartially based on seeing large numbers of similar detections in thetraining set. These frames can be actively prioritised in any humanrelabelling eﬀorts, or ﬁxed cuts on the conﬁdence can be utilisedto perform this in a semi-automated way. This ‘optimal relabelling’scheme provides a method for human vetters and machine learn-ing models to collaboratively and iteratively reﬁne noisy labels.Our label noise is introduced as humans are imperfect judges ofreal/bogus, and interpret the vetting rubric in diﬀerent ways leadingto inconsistencies which can harm model performance.We demonstrate the eﬀectiveness of this procedure on thetraining set built in this work by training the model ﬁrst on theuncleaned dataset, then attempt to relabel the misclassiﬁed detec-tions in the training and test set ordered by decreasing conﬁdence.This amounts to a substantial task of 3580 stamps, which wouldtake a prohibitively long time to relabel by hand, notwithstandingthe possibility of human bias in the relabelling. We instead herepropose a heuristic re-labelling scheme based on the BALD score

SCIENCE TEMPLATE DIFFERENCE C : 0.020SCIENCE TEMPLATE DIFFERENCE C : 0.096SCIENCE TEMPLATE DIFFERENCE C : 0.406SCIENCE TEMPLATE DIFFERENCE C : 0.593SCIENCE TEMPLATE DIFFERENCE C : 0.859 Figure 7.

A selection of example posteriors, taken from real GOTO data.The majority of predictions are highly conﬁdent, so we select examplesof increasing conﬁdence score ( C ) to display here. Plotted here is a Gaus-sian kernel-density estimate constructed from 500 posterior samples. Thegreen line indicates the correct label for each candidate, with the black lineindicating the mean of the distribution. The dashed line indicates 𝑃 real = . of Houlsby et al. (2011) that leverages the simplistic nature of binaryclassiﬁcation.The model is ﬁrst trained on the ‘unclean’ dataset generatedwith the approaches in Section 2.3, then the BALD score is evalu-ated over the misidentiﬁcations in the test and training sets. Fromhere, a new set of labels is derived by ﬂipping the labels of thoseexamples that have a BALD score less than (thus conﬁdence higherthan) the median – eﬀectively accepting the prediction of the clas-siﬁer over the human vetter. This approach is naturally capable ofﬂipping the labels of accurately labelled stamps incorrectly, but byimposing this cut in classiﬁer conﬁdence it ensures that the ma-jority of relabelled stamps each round correspond to regions ofclassiﬁer parameter space that are well-covered by the training setand so are classiﬁed at high conﬁdence. This method eﬀectivelytrades active human labelling time for passive background compu-tational time, and can be applied iteratively as suggested above toprogressively improve the quality of the dataset labelling. We man-ually checked a subset of the sources selected to be re-labelled toverify these were sensible and indeed found they were mislabelled MNRAS000

02 to 99 . ± .

01% (F1 score:0 . ± . . ± . Machine learning algorithms acquire inherent and often subtle bi-ases based on the training set used in their construction. Giventhe automated nature of our data set generation, it is particularlyimportant to verify that performance is consistent across a rangeof parameters of interest, such as transient magnitude. Some careis required in choosing the test set for evaluating classiﬁer perfor-mance in a real-world setting, as the training set has been augmentedwith both human-labelled data and fully synthetic data. Althougha low FPR/FNR on the validation and test data is encouraging asit is artiﬁcially made more diﬃcult for the classiﬁer to learn, itis not directly representative of the performance we should expectin deployment as a non-negligible component of it is synthetic.Performance characterisation should be reinforced with extensivetesting on representative samples of GOTO data. A particular focusis to conﬁrm that the synthetic augmentation scheme we implementleads to genuine improvements in the classiﬁer’s recovery rate ofreal transient detections. We also emphasise that in following sec-tions, we eﬀectively test the performance of the real-bogus classiﬁerin isolation – the ‘real-world’ detection eﬃciency is the product ofthe eﬃciency of multiple pipeline stages, most crucially image sub-traction and source extraction. Exploring the impact of these stepsis beyond the scope of this paper, and thus are left to future work.In the following sections, we use ‘accuracy’ to refer to theclass-balanced accuracy, as it is more appropriate for our mildlyimbalanced classiﬁcation task. We also quote results based on themean scores of 10 posterior samples (motivated by the saturationobserved in Figure 6) since individual evaluations of a Bayesianneural network using MCDropout are based on weaker classiﬁersdue to the presence of dropout. Typical uncertainties (estimated asthe standard deviation) on the metrics below are < . To provide a more granular view of the classiﬁer performance, wefurther split the test set into two groups for the purposes of eval-uation. The ﬁrst comprises of only the minor planet and randombogus detections. We also test a synthetic transient/galaxy resid-ual test set, to verify that the classiﬁer can genuinely discriminatebetween galaxies and galaxies with transients. This also revealsany strong performance diﬀerences between the two main positiveclasses, which could skew metrics evaluated on the whole dataset.For both test sets, the human-inspected Marshall data is deliberatelyexcluded, since it is signiﬁcantly more challenging for the classiﬁerthan normal detections and does not accurately reﬂect the true datadistribution.The best-scoring classiﬁer after hyperparameter optimisationshows excellent performance, attaining balanced accuracies of99 .

49% (F1: 0.9935) and 99 .

19% (F1: 0.9925) on the minor planetand synthetic transient test datasets respectively. Figure 9 illustratesthe false positive and negative rates for the classiﬁer on both theminor planet and transient datasets, as a function of the real-bogusthreshold chosen. There is a clear diﬀerence in false negative ratebetween the minor planet and transient datasets, reﬂecting the in-creased diﬃculty associated with the complex host morphologyassociated with the transient examples. The classiﬁer displays anotable skew in the FPR/FNR equality point towards lower val-ues. This is a result of the Marshall injections in the training set,which are made more diﬃcult to learn than the random bogus detec-

MNRAS , 1–17 (0000) ayesian real-bogus classiﬁcation for GOTO −

50 0 50Latent vector 1 − − − L a t e n t v ec t o r glxresid marshall mp randjunk syntransient −

50 0 50Latent vector 1 0 . . . . . C on ﬁ d e n ce Figure 8.

Example class-clustering (left) and conﬁdence (right) maps generated from the classiﬁer’s test set. Each colour in the left panel represents a speciﬁcsub-class of detections, where colour on the right represents classiﬁer conﬁdence. The top legend gives the classes corresponding to each colour in the leftpanel. Regions of low conﬁdence in the right panel tend to correspond to cluster boundaries in the left, where there is more uncertainty about which class eachexample belongs to. . . . . . . . . . . . C u m u l a ti v e p e r ce n t a g e MP only FNRAll-data FPRMMCETransient only FNR b rPredictedbr T r u e Figure 9.

False positive/negative rate evaluated on the test set, excludingMarshall examples. Performance metrics are split based on minor planetand synthetic transients. The grey dashed line (MMCE) represents the full-dataset mean misclassiﬁcation error, which is below 1% between real-bogusscores of 0.1 – 0.6. Inset: confusion matrix, evaluated on the full test set.There is a slight diﬀerence in the false negative rates achieved between theminor planets and synthetic transients, reﬂecting the increased diﬃcultyposed by complex host morphology and subtraction residuals. tions due to being misclassiﬁed by the previous classiﬁer. This doesnot aﬀect classiﬁcation accuracy, and could be ﬁxed by applying apower transform to the classiﬁer output if required, conditioned onthe validation set.Given the spatially-variable optical characteristics present inthe GOTO prototype, it is important to conﬁrm that our classiﬁer provides good performance across the full detector – and not simplyin the centre where distortion is minimal. In Figure 10 we plot theclass-balanced accuracy score as a function of radial position onthe detector, using a series of radial bins chosen to equalise sourcedensity. These radial bins are scaled through by the maximum value(corresponding to the image corner) to provide a scale-free mea-surement of detector position. Class-balanced accuracy is used hereas the real-bogus fraction varies as a function of detector position,and care must be taken to account for this. We ﬁnd a consistentperformance of ∼

99% out to a fractional radial distance of 0.7, witha slight drop of 1% out at the far edge of the image. This is pri-marily due to the severe distortion found in the image corners ofthe GOTO prototype optical tubes, which produces very challeng-ing detections (abnormal PSFs, strong vignetting) both for sourceextraction and real-bogus classiﬁcation. Some contribution to thisperformance decrease is likely from good quality sources close tothe edge of the image or close to the edge of the science-templateoverlap. Estimating reliably these sources and their contribution tothe numbers in each bin is a complex task. Suﬀering only a 1%decrease in performance in these extremely challenging conditionsdemonstrates the overall robustness of the classiﬁer. With the signif-icantly improved optical quality of the GOTO design speciﬁcationOTAs, we anticipate that future versions of our classiﬁer trainedon data from the upgraded system will display a constant (withinstatistical error) classiﬁcation accuracy as a function of detectorposition.

To provide the most accurate assessment of transient-speciﬁc clas-siﬁer performance and further conﬁrm that our algorithmically-

MNRAS000

MNRAS000 , 1–17 (0000) Killestein et al., . . . . . . . . . . . . . . T e s t s e t acc u r ac y Unbinned accuracyMean + 1 σ CI (3441/bin)

Figure 10.

Class-balanced accuracy evaluated on the test set as a functionof detector position. We use a series of concentric radial bins, chosen tocontain equal numbers of sources for uniform statistics. We scale the radiusthrough by the detector size to give a relative picture of performance. Thedrop in performance at large radial distances is primarily caused by theextreme optical distortion present in the early GOTO prototype, and only aminor drop of 1% in accuracy in these challenging conditions demonstratesthe very robust performance of our classiﬁer. With the design-speciﬁcationGOTO optics, we anticipate this curve will be level within error. generated training set generalises well, we assemble a test set of gen-uine astrophysical transients. This set was found by cross-matchinga list of all spectroscopically conﬁrmed supernovae reported to theTransient Name Server (TNS) since January 2019 with the GOTOmaster candidate table. Those with an associated GOTO candidatewithin 3 arcsec, with TNS discovery magnitude greater than theGOTO source magnitude, and only found in GOTO data after theformal TNS discovery date are accepted. With these cuts, purityis favoured over completeness, a deliberate choice to ensure thatthe test set is as clean of false positives as possible. This yields877 known transients recovered in the GOTO prototype data. Thewhole-sample recovery rate is 97 . ± . 𝐿 band magnitude. We ﬁnd that the classiﬁermaintains excellent performance across the full magnitude range ofdetections accessible to GOTO, even towards fainter magnitudes.Our galaxy augmentation scheme provides up to a 30% improve-ment in recovery rate at magnitudes fainter than 𝐿 ∼ . . . . . . . . . . L -band discovery magnitude0 . . . . . . . T P R @ .

39 89 176 291 436 430 206 39 6 . . . . . . . . . . . . . . . . . . T P R @ .

15 42 38 24 35 38 32 22

Figure 11.

Top panel:

Recovery rate (TPR) as a function of GOTO discoverymagnitude, at a ﬁxed real-bogus threshold of 0.5. The dashed line indicatesthe performance of a classiﬁer with a similarly sized training set, but withonly minor planet detections. Error bars are derived directly from the clas-siﬁer score posteriors. The number of detections per bin is written beloweach bar. The sharp drop-oﬀ in the number of detections beyond 𝐿 ∼ . Bottom panel:

Recovery rate of transients that canbe reliably associated with a host galaxy (as cross-matched with WISExSu-perCosmos, Bilicki et al. 2016) as a function of host oﬀset. As above, errorbars are derived from the classiﬁer score posteriors, and a similarly-sizedminor planet-based classiﬁer is plotted for comparison. There is a markedimprovement in the recovery rate for very small host oﬀsets, particularly fornuclear transients. ment for sensitivity to nuclear transients, considered to be a morediﬃcult transient morphology to detect. Motivated by the typicalRMS astrometric noise level of GOTO images, we adopt a ﬁxedthreshold of 0.5 arcsec to distinguish between nuclear and oﬀsettransients. We ﬁnd a 13 ± MNRAS , 1–17 (0000) ayesian real-bogus classiﬁcation for GOTO Although the main transient sources of interest for GOTO willoverwhelmingly be fainter than the saturation level ( 𝐿 ∼ 𝐿 ∼ 𝐿 (cid:46)

10, 100% are recovered, although small-number statistics limits the usefulness of this metric. This bright-endtesting demonstrates the excellent dynamic range of the classiﬁer,showing high (>90%) recovery rates from 10 th – 20 th magnitude.Through the host oﬀset distribution choice we make, we ex-pect to generate a reasonable number of transients at zero oﬀset, sothis region of parameter space should not be empty in the trainingset. To test the performance in this regime we repeated the proce-dure outlined in Section 2.2, except with the host oﬀsetting routinedisabled to generate synthetic detections overlapping the galaxy nu-cleus only. This generated 5,100 synthetic nuclear transients, witha magnitude distribution consistent with that in Figure 1. Testingour model against this dataset (with the negative examples beinggalaxy residuals as in Section 2.3, we obtain a 97.5% accuracy,with a recovery rate (TPR) of ∼ 𝑅𝐵 score as a proxy for 𝑃 𝑟𝑒𝑎𝑙 (the probability a given source is real) in such implementations.One signiﬁcant beneﬁt of using a Bayesian neural network isa built-in indicator of out-of-distribution data – that is data poorlyrepresented by or unseen in the training set. For input data that iscompletely diﬀerent to the training set, the classiﬁer will return alow conﬁdence score which can then be used to remove/deprioritisethe candidate in downstream applications. This conﬁdence can alsobe used to optimise candidate vetting eﬀorts, with the highest- . . . . . . F r ac ti ono f po s iti v e s ( P r e a l ) σ CIMean of 20 samples0 . . . . . . N u m b e r BogusReal

Figure 12.

Top panel: classiﬁer calibration curve, illustrating how well theclassiﬁer’s output score corresponds to probability. The mean of 20 samplesand the 1 𝜎 conﬁdence interval are plotted to show that individual drawsfrom the posterior remain well-calibrated. Bottom panel: Score distributionfor both real and bogus examples – with the relative scarcity of exampleswith 0 . < 𝑅𝐵 < . conﬁdence candidates being a natural choice to prioritise over lower-conﬁdence, lower quality detections.In principle, the task-speciﬁc knowledge encoded in our trainednetwork weights can be used to accelerate the training of similarreal-bogus classiﬁers through transfer learning, and in principleincrease generalisation (Yosinski et al. 2014). This requires that thesame data input structure is used and there are no changes to modelhyperparameters. However, we caution that training in this way issusceptible to local minima and does not oﬀer the opportunity tochange the model hyperparameters that training from scratch does– in Section 3.1 we have demonstrated the sizeable performanceimprovements doing a full hyperparameter search can yield, and soencourage this.The techniques and framework we implement in this paper arenaturally extensible to more challenging astronomical classiﬁcationtasks such as those outlined at the end of Section 1.1. A key focus ismore ﬁne-grained classiﬁcation – being able to distinguish variablestars, supernovae, nuclear transients and other astrophysical objectsof interest in an automated (and crucially, accurate) way. Figure 8already hints at this being a fruitful approach, as we see evidenceof morphological diﬀerentiation in both the positive and negativeclasses through the emergence of smaller sub-clusters. Similarly,leveraging the wealth of contextual information available from as-trophysical surveys in a principled, informative, and eﬃcient waywithin the framework of deep learning poses an open challenge,with potentially signiﬁcant gains possible. We aim to address thesechallenges, among others, with development of future generationsof the classiﬁer we implement here. We demonstrate a data-driven approach to generating large, low-contamination training sets, which along with our novel augmen-tation scheme can be used to train high-performance, transient-optimised real-bogus classiﬁers. By combining real PSFs fromminor planets with galaxies, we generate realistic synthetic tran-sients that provide a measurable improvement in the recovery ofgenuine astrophysical transients. This technique is computationally

MNRAS000

MNRAS000 , 1–17 (0000) Killestein et al., lightweight, easily implemented, and directly applicable to a varietyof both current and future transient survey streams/datasets.We also demonstrate the eﬃcacy of Bayesian neural networksfor the ﬁrst time in real-bogus classiﬁcation, and demonstrate theunique insights that conﬁdence estimation can bring to the real-bogus problem. Being able to assign epistemic conﬁdences to clas-siﬁer predictions in addition to the more typical real-bogus scoreprovides another parameter for human vetters further downstreamto use in identifying promising candidate detections – this can po-tentially be used in future to further automate decision making inthe context of follow-up and reporting. Techniques such as this thatminimise human involvement in data-gathering and labelling willbecome increasingly important in the new ‘big-data’ era of astron-omy that large-scale projects such as the Rubin Observatory andSKA will bring about.Our classiﬁer demonstrates excellent performance across awide magnitude range, with a missed detection rate of 0.5% ata ﬁxed 1% false positive rate, and up to 30% improvement in re-covery rate of astrophysical transients in the challenging faint end.This has the potential to markedly increase the number of faint tran-sients GOTO can discover, and signiﬁcantly improves the prospectsfor detecting the kilonova afterglows of gravitational-wave drivenmergers GOTO was designed to ﬁnd. We anticipate that improve-ments to the quality and stability of GOTO’s hardware and dataﬂowwill bring signiﬁcant performance gains for the real-bogus classiﬁerpresented here.GOTO is due to undergo signiﬁcant expansion over the comingyears, with a ﬁnal conﬁguration of 4 installations spread across anorthern (La Palma) and southern (Siding Spring) site providing ahigh-cadence datastream covering almost the whole sky down to20th magnitude every 2-3 days. The tools developed in this workhave generated a classiﬁer that is capable of handling and sifting theaccompanying volume of candidate transient detections with robustaccuracy and high sensitivity.

ACKNOWLEDGEMENTS

We thank the anonymous referee for their insightful com-ments which helped improve the quality of this manuscript. TheGravitational-wave Optical Transient Observer (GOTO) project ac-knowledges the support of the Monash-Warwick Alliance; WarwickUniversity; Monash University; Sheﬃeld University; the Univer-sity of Leicester; Armagh Observatory & Planetarium; the Na-tional Astronomical Research Institute of Thailand (NARIT); theUniversity of Turku; the University of Manchester; the Univer-sity of Portsmouth; the Instituto de Astrofísica de Canarias (IAC)and the Science and Technology Facilities Council (STFC). DS,KU, BG and JDL acknowledge support from the STFC via grantsST/T007184/1, ST/T003103/1 and ST/P000495/1. JDL acknowl-edges support from a UK Research and Innovation Fellowship(MR/T020784/1). RPB, MRK and DMS acknowledge support fromthe ERC under the European Union’s Horizon 2020 research and in-novation programme (grant agreement No. 715051; Spiders). POBand RS acknowledge support from the STFC.This research made use of Astropy, a community-developedcore Python package for Astronomy (Astropy Collaboration et al.2013; Price-Whelan et al. 2018), and scikit-learn (Pedregosa et al. astorb.dat were originally pro-vided by NASA grant NAG5-4741 (PI E. Bowell) and the LowellObservatory endowment, and more recently by NASA PDART grantNNX16AG52G (PI N. Moskovitz). This research has made use ofIMCCE’s SkyBoT VO tool. This research has made use of dataand/or services provided by the International Astronomical Union’sMinor Planet Center. DATA AVAILABILITY

The gotorb code is made freely available at https://github.com/GOTO-OBS/gotorb , along with validation examples for test-ing. Accompanying observational data used in this work will bemade available via upcoming GOTO public data releases.

REFERENCES

Aartsen M. G., et al., 2017, Journal of Instrumentation, 12, P03012Abadi M., et al., 2015, TensorFlow: Large-Scale Machine Learning on Het-erogeneous Systems,

Abbott B. P., et al., 2017a, Phys. Rev. Lett., 119, 161101Abbott B. P., et al., 2017b, ApJ, 848, L12Ackley K., Eikenberry S. S., Yildirim C., Klimenko S., Garner A., 2019,AJ, 158, 172Alard C., Lupton R. H., 1998, ApJ, 503, 325Astropy Collaboration et al., 2013, A&A, 558, A33Bailey S., Aragon C., Romano R., Thomas R. C., Weaver B. A., Wong D.,2007, ApJ, 665, 1246Becker A., 2015, HOTPANTS: High Order Transform of PSF ANd TemplateSubtraction (ascl:1504.004)Bellm E. C., et al., 2019, PASP, 131, 018002Berthier J., Vachier F., Thuillot W., Fernique P., Ochsenbein F., Genova F.,Lainey V., Arlot J. E., 2006, SkyBoT, a new VO service to identify SolarSystem objects. , p. 367Berthier J., Carry B., Vachier F., Eggl S., Santerne A., 2016, MNRAS, 458,3394Bertin E., Arnouts S., 1996, A&AS, 117, 393Bilicki M., et al., 2016, ApJS, 225, 5Bloom J. S., et al., 2012, PASP, 124, 1175Blundell C., Cornebise J., Kavukcuoglu K., Wierstra D., 2015, arXiv e-prints, p. arXiv:1505.05424Breiman L., 2001, Machine learning, 45, 5Brink H., Richards J. W., Poznanski D., Bloom J. S., Rice J., Negahban S.,Wainwright M., 2013, MNRAS, 435, 1047Cabrera-Vives G., Reyes I., Förster F., Estévez P. A., Maureira J.-C., 2017,ApJ, 836, 97Carrasco-Davis R., et al., 2020, arXiv e-prints, p. arXiv:2008.03309Chambers K. C., et al., 2016, arXiv e-prints, p. arXiv:1612.05560Chetlur S., Woolley C., Vandermersch P., Cohen J., Tran J., Catanzaro B.,Shelhamer E., 2014, arXiv e-prints, p. arXiv:1410.0759Chollet F., et al., 2015, Keras, https://keras.io

Ciucă I., Kawata D., Miglio A., Davies G. R., Grand R. J. J., 2020, arXive-prints, p. arXiv:2003.03316Dálya G., et al., 2018, MNRAS, 479, 2374Dieleman S., Willett K. W., Dambre J., 2015, MNRAS, 450, 1441Duev D. A., et al., 2019, MNRAS, 489, 3582Fawcett T., 2006, Pattern Recognition Letters, 27, 861Filippenko A. V., Li W. D., Treﬀers R. R., Modjaz M., 2001, in Paczynski B.,Chen W.-P., Lemme C., eds, Astronomical Society of the Paciﬁc Con-ference Series Vol. 246, IAU Colloq. 183: Small Telescope Astronomyon Global Scales. p. 121Fossey S. J., Cooke B., Pollack G., Wilde M., Wright T., 2014, CentralBureau Electronic Telegrams, 3792, 1Gal Y., Ghahramani Z., 2015a, arXiv e-prints, p. arXiv:1506.02142Gal Y., Ghahramani Z., 2015b, arXiv e-prints, p. arXiv:1506.02158MNRAS , 1–17 (0000) ayesian real-bogus classiﬁcation for GOTO Gal Y., Islam R., Ghahramani Z., 2017, arXiv e-prints, p. arXiv:1703.02910Gieseke F., et al., 2017, MNRAS, 472, 3101Goldstein D. A., et al., 2015, AJ, 150, 82Gompertz B. P., et al., 2020, MNRAS, 497, 726Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., OzairS., Courville A., Bengio Y., 2014, arXiv e-prints, p. arXiv:1406.2661Heinze A. N., et al., 2018, AJ, 156, 241Houlsby N., Huszár F., Ghahramani Z., Lengyel M., 2011, arXiv e-prints,p. arXiv:1112.5745IceCube Collaboration et al., 2018, Science, 361, eaat1378Ivezić Ž., et al., 2019, ApJ, 873, 111Kendall A., Gal Y., 2017, arXiv e-prints, p. arXiv:1703.04977Kingma D. P., Ba J., 2014, arXiv e-prints, p. arXiv:1412.6980Kulkarni S. R., 2012, arXiv e-prints, p. arXiv:1202.2381Law N. M., et al., 2009, PASP, 121, 1395LeCun Y., Bengio Y., et al., 1995, The handbook of brain theory and neuralnetworks, 3361, 1995LeCun Y., Bengio Y., Hinton G., 2015, Nature, 521, 436LeNail A., 2019, The Journal of Open Source Software, 4, 747Leaman J., Li W., Chornock R., Filippenko A. V., 2011, MNRAS, 412, 1419Li W., et al., 2011, MNRAS, 412, 1441Li L., Jamieson K., DeSalvo G., Rostamizadeh A., Talwalkar A., 2017, TheJournal of Machine Learning Research, 18, 6765Lintott C. J., et al., 2008, MNRAS, 389, 1179Maaten L. v. d., Hinton G., 2008, Journal of machine learning research, 9,2579Mahabal A., et al., 2019, PASP, 131, 038002Mariani G., Scheidegger F., Istrate R., Bekas C., Malossi C., 2018, arXive-prints, p. arXiv:1803.09655Meegan C., et al., 2009, ApJ, 702, 791Möller A., de Boissière T., 2020, MNRAS, 491, 4277Mong Y.-L., et al., 2020, arXiv e-prints, p. arXiv:2008.10178Moskovitz N., Schottland R., Burt B., Wasserman L., Mommert M., BailenM., Grimm S., 2019, in EPSC-DPS Joint Meeting 2019. pp EPSC–DPS2019–644Niculescu-Mizil A., Caruana R., 2005, in Proceedings of the 22nd interna-tional conference on Machine learning. pp 625–632O’Malley T., Bursztein E., Long J., Chollet F., Jin H., InvernizziL., et al., 2019, Keras Tuner, https://github.com/keras-team/keras-tuner

Pedregosa F., et al., 2011, Journal of Machine Learning Research, 12, 2825Perlmutter S., et al., 1999, ApJ, 517, 565Pian E., et al., 2017, Nature, 551, 67Price-Whelan A. M., et al., 2018, AJ, 156, 123Reyes E., Estévez P. A., Reyes I., Cabrera-Vives G., Huĳse P., CarrascoR., Forster F., 2018, in 2018 International Joint Conference on NeuralNetworks (ĲCNN). pp 1–8Rhodes B., 2019, Skyﬁeld: High precision research-grade positions for plan-ets and Earth satellites generator (ascl:1907.024)Romano R. A., Aragon C. R., Ding C., 2006, in 2006 5th InternationalConference on Machine Learning and Applications (ICMLA’06). pp77–82Shappee B. J., et al., 2014, ApJ, 788, 48Simonyan K., Zisserman A., 2014, arXiv e-prints, p. arXiv:1409.1556Singer L. P., et al., 2015, ApJ, 806, 52Smith K. W., et al., 2020, PASP, 132, 085002Soumagnac M. T., Ofek E. O., 2018, PASP, 130, 075002Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R.,2014, The journal of machine learning research, 15, 1929Steeghs D., et al., 2021, The Gravitational-wave Optical Transient Observer(GOTO): prototype performance and prospects for transient science, inprep.Tanvir N. R., et al., 2009, Nature, 461, 1254Tompson J., Goroshin R., Jain A., LeCun Y., Bregler C., 2015, in Proceed-ings of the IEEE conference on computer vision and pattern recognition.pp 648–656Tonry J. L., et al., 2018, PASP, 130, 064505Turpin D., et al., 2020, MNRAS Villar V. A., Berger E., Metzger B. D., Guillochon J., 2017, ApJ, 849, 70Walmsley M., et al., 2020, MNRAS, 491, 1554Wozniak P. R., 2000, Acta Astron., 50, 421Wright D. E., et al., 2015, MNRAS, 449, 451Yip K. H., et al., 2019, in AAS/Division for Extreme Solar Systems Ab-stracts. p. 305.04Yosinski J., Clune J., Bengio Y., Lipson H., 2014, arXiv e-prints, p.arXiv:1411.1792Zackay B., Ofek E. O., Gal-Yam A., 2016, ApJ, 830, 27This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000