Galaxy Zoo DECaLS: Detailed Visual Morphology Measurements from Volunteers and Deep Learning for 314,000 Galaxies
Mike Walmsley, Chris Lintott, Tobias Geron, Sandor Kruk, Coleman Krawczyk, Kyle W. Willett, Steven Bamford, William Keel, Lee S. Kelvin, Lucy Fortson, Karen L. Masters, Vihang Mehta, Brooke D. Simmons, Rebecca Smethurst, Elisabeth M. Baeten, Christine Macmillan
MMNRAS , 1–18 (2021) Preprint 18 February 2021 Compiled using MNRAS L A TEX style file v3.0
Galaxy Zoo DECaLS: Detailed Visual Morphology Measurementsfrom Volunteers and Deep Learning for 314,000 Galaxies
Mike Walmsley ★ , Chris Lintott , Tobias Géron , Sandor Kruk , Coleman Krawczyk ,Kyle W. Willett , Steven Bamford , William Keel , Lee S. Kelvin ,Lucy Fortson Karen L. Masters , Vihang Mehta , Brooke D. Simmons , Rebecca Smethurst ,Elisabeth M. Baeten , Christine Macmillan , Oxford Astrophysics, Department of Physics, University of Oxford, Denys Wilkinson Building, Keble Road, Oxford, OX1 3RH, UK European Space Agency, ESTEC, Keplerlaan 1, NL-2201 AZ, Noordwijk, The Netherlands Institute of Cosmology and Gravitation, University of Portsmouth Dennis Sciama Building, Burnaby Road, Portsmouth, PO1 3FX, UK School of Physics and Astronomy, University of Minnesota, 116 Church St SE, Minneapolis, MN 55455, USA Centre for Astronomy & Particle Theory, School of Physics and Astronomy, University of Nottingham, University Park, Nottingham, NG7 2RD, UK Dept. of Physics and Astronomy, University of Alabama, Tuscaloosa, AL 35487, USA Department of Astrophysical Sciences, Princeton University, 4 Ivy Lane, Princeton, NJ 08544, USA Minnesota Institute for Astrophysics, University of Minnesota, 116 Church St SE, Minneapolis, MN 55455, USA Department of Physics and Astronomy, Haverford College, 370 Lancaster Avenue, Haverford, PA 19041, USA Department of Physics, Lancaster University, Bailrigg, Lancaster, LA1 4YB, UK Citizen Scientist, Zooniverse c/o University of Oxford, Keble Road, Oxford OX1 3RH, UK
Last updated XXX; in original form XXX
ABSTRACT
We present Galaxy Zoo DECaLS: detailed visual morphological classifications for Dark En-ergy Camera Legacy Survey images of galaxies within the SDSS DR8 footprint. DeeperDECaLS images ( 𝑟 = . 𝑟 = . Key words: methods: data analysis, galaxies: bar, galaxies: bulges, galaxies: disc, galaxies:interaction, galaxies: general
Morphology is a key driver and tracer of galaxy evolution. For ex-ample, bars are thought to move gas inwards (Sakamoto et al. 1999)driving and/or shutting down star formation (Sheth et al. 2004; Jogeeet al. 2005), and bulges are linked to global quenching (Masters et al.2011; Fang et al. 2013; Bluck et al. 2014) and inside-out quenching(Spindler et al. 2017; Lin et al. 2019). Morphology also traces otherkey drivers, such as the merger history of a galaxy. Mergers supportgalaxy assembly (Wang et al. 2011; Martin et al. 2018), though their ★ Contact e-mail: [email protected] relative contribution is an open question (Casteels et al. 2014), andmay create tidal features, bulges, and disks, allowing past mergersto be identified (Hopkins et al. 2010; Fontanot et al. 2011; Kaviraj2014; Brooks & Christensen 2015).Unpicking the complex interplay between morphology andgalaxy evolution requires measurements of detailed morphology inlarge samples. While modern surveys reveal exquisite morphologi-cal detail, they image far more galaxies than scientists can visuallyclassify. Galaxy Zoo solves this problem by asking members of thepublic to volunteer as ‘citizen scientists’ and provide classificationsthrough a web interface. Galaxy Zoo has provided morphology mea-surements for surveys including SDSS (Lintott et al. 2008; Willett © a r X i v : . [ a s t r o - ph . GA ] F e b M. Walmsley et al et al. 2013) and large HST programs (Simmons et al. 2017b; Willettet al. 2017).Knowing the morphology of homogeneous samples of hun-dreds of thousands of galaxies supports science only possible atscale. The catalogues produced by the collective effort of GalaxyZoo volunteers have been used as the foundation of a large numberof studies of galaxy morphology (see Masters 2019 for a review),with the method’s ability to provide estimates of confidence along-side classification especially valuable. Galaxy Zoo measures subtleeffects in large populations (Masters et al. 2010; Willett et al. 2015;Hart et al. 2017); identifies unusual populations that challenge stan-dard astrophysics (Simmons et al. 2013; Tojeiro et al. 2013; Kruket al. 2017); and finds unexpected and interesting objects that pro-vide unique data on broader galaxy evolution questions (Lintottet al. 2009; Cardamone et al. 2009; Keel et al. 2015).Here, we present the first volunteer classifications of galaxyimages collected by the Dark Energy Camera Legacy Survey (DE-CaLS, Dey et al. 2018). This work represents the first systematicengagement of volunteers with low-redshift images as deep as thoseprovided by DECaLS, and thus represents a more reliable catalogueof detailed morphology than has hitherto been available. These de-tailed classifications include the presence and strength of bars andbulges, the count and winding of spiral arms, and the indicationsof recent or ongoing mergers. Our volunteer classifications weresourced over three separate Galaxy Zoo DECaLS (GZD) classifi-cation campaigns, GZD-1, GZD-2, and GZD-5, which classifiedgalaxies first released in DECaLS Data Releases 1, 2, and 5 re-spectively. The key practical differences are that GZD-5 uses animproved decision tree aimed at better identification of mergersand weak bars, and includes galaxies with just 5 total votes aswell as galaxies with 40 or more. Across all campaigns, we collect7,496,325 responses from Galaxy Zoo volunteers, recording 30 ormore classifications in at least one campaign for 139,919 galax-ies and fewer (approximately 5 classifications) for an additional173,870 galaxies, totalling 313,789 classified galaxies.For the first time in a Galaxy Zoo data release, we also pro-vide automated classifications made using Bayesian deep learning(Walmsley et al. 2020). By using our volunteer classifications totrain a deep learning algorithm, we can make detailed classifica-tions for all 313,789 galaxies in our target sample, providing mor-phology measurements faster than would be possible than relyingon volunteers alone. Bayesian deep learning allows us to learn fromuncertain volunteer responses and to estimate the uncertainty of ourpredictions. Our classifier predicts posteriors for how volunteerswould have answered all decision tree questions , with an accuracycomparable to asking 5 to 15 volunteers, depending on the question,and, for some questions, exceeding 99% accuracy on galaxies wherethe volunteers are confident (volunteer vote fractions below 0.2 orabove 0.8).In Section 2, we describe the observations used and the creationof RGB images suitable for classification. In Section 3, we givean overview of the volunteer classification process and detail thenew decision trees used. In Section 4, we investigate the effects ofimproved imaging and improved decision trees, and we compare ourresults to other morphological measurements. Then, in Section 5,we describe the design and performance of our automated classifier- an ensemble of Bayesian convolutional neural networks. Finally, Excluding the final ‘Is there anything odd?’ question as it is multiple-choice in Section 6, we provide guidance (and example code) for effectiveuse of the classifications.
Our galaxy images are created from data collected by the DECaLSsurvey (Dey et al. 2018). DECaLS uses the Dark Energy Camera(DECam, Flaugher et al. 2015) at the 4m Blanco telescope at CerroTololo Inter-American Observatory, near La Serena, Chile. DECamhas a roughly hexagonal 3.2 square degree field of view with a pixelscale of 0.262 arcsec per pixel. The median point spread functionFWHM is 1 . (cid:48)(cid:48)
29, 1 . (cid:48)(cid:48)
18 and 1 . (cid:48)(cid:48)
11 for 𝑔 , 𝑟 , and 𝑧 , respectively.The DECaLS survey contributes targeting images for the up-coming Dark Energy Spectroscopic Instrument (DESI). DECaLSis responsible for the DESI footprint in the Southern Galactic Cap(SGC) and the 𝛿 (cid:54)
34 region of the Northern Galactic Cap (NGC),totalling 10,480 square degrees . 1130 square degrees of the SGCDESI footprint are already being imaged by DECam through theDark Energy Survey (DES, The Dark Energy Survey Collaboration2005) so DECaLS does not repeat this part of the DESI footprint.DECaLS implements a 3-pass strategy to tile the sky. Each pass isslightly offset (approx 0.1-0.6 deg). The choice of pass and exposuretime for each observation is optimised in real time based on the ob-serving conditions recorded for the previous targets, as well as theinterstellar dust reddening, sky position, and estimated observingconditions of possible next targets. This allows a near-uniform depthacross the survey. In DECaLS DR1, DR2, and DR5, from whichour images are drawn, the median 5 𝜎 point source depths for areaswith 3 observations was approximately (AB) 𝑔 =24.65, 𝑟 =23.61, and 𝑧 =22.84 . The DECaLS survey completed observations in March2019. We identify galaxies in the DECaLS imaging using the NASA-Sloan Atlas v1.0.0 (NSA). As the NSA was derived from SDSSDR8 imaging (Aihara et al. 2011), this data release only includesgalaxies which are within both the DECaLS and SDSS DR8 foot-print. In effect, we are using deeper DECaLS imaging of the galaxiespreviously imaged in SDSS DR8. This ensures our morphologicalmeasurements have a wealth of ancillary information derived fromSDSS and related surveys, and allows us to measure any shift inclassifications vs. Galaxy Zoo 2 using the subset of SDSS DR8galaxies classified both in this work and in Galaxy Zoo 2 (Sec.4). Figure 1 shows the resulting GZ DECaLS sky coverage. NSAv1.0.0 was not published but the values of the columns used here areidentical to those in NSA v1.0.1, released in SDSS DR13 (Albaretiet al. 2017); only the column naming conventions are different.Selecting galaxies with the NSA introduces two implicit cuts.First, the NSA primarily includes galaxies brighter than 𝑚 𝑟 = . 𝑚 𝑟 = .
77 are included only if they are in deeper survey areas (e.g.Stripe82) or were measured using ‘spare’ fibres after all brightergalaxies in a given field were covered; we suggest researchersenforce their own magnitude cut according to their science case. The remaining DESI footprint is being imaged by DECaLS’ companionsurveys, MzLS and BASS (Dey et al. 2018)000
77 are included only if they are in deeper survey areas (e.g.Stripe82) or were measured using ‘spare’ fibres after all brightergalaxies in a given field were covered; we suggest researchersenforce their own magnitude cut according to their science case. The remaining DESI footprint is being imaged by DECaLS’ companionsurveys, MzLS and BASS (Dey et al. 2018)000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Figure 1.
Sky coverage of GZ DECaLS (equatorial coordinates), resultingfrom the imaging overlap of DECaLS DR5 and SDSS DR8, shown in red.Darker areas indicate more galaxies. Sky coverage of Galaxy Zoo 2, whichused images sourced from SDSS DR7, shown in light blue. The NSA in-cludes galaxies imaged by SDSS DR8, including galaxies newly imaged atthe Southern Galactic Cap (approx. 2500 deg ) Second, the NSA only covers redshifts of 𝑧 = .
15 or below. Tothese implicit cuts, we add an explicit cut requiring Petrosian ra-dius (
PETROTHETA ) of at least 3 arcseconds, to ensure the galaxy issufficiently extended for meaningful classification.For each galaxy, if the coordinates had been imaged in the 𝑔 , 𝑟 and 𝑧 bands, and the galaxy passed the selection cuts above,we acquired a combined FITS cutout of the 𝑔𝑟𝑧 ×
424 pixel squaregalaxy images. GZD-1 and GZD-2 acquired 424 ×
424 pixel squareFITS cutouts directly from the cutout service. To ensure that galaxiestypically fit well within a 424 pixel image, cutouts were downloadedwith an interpolated pixel scale 𝑠 of 𝑠 = max ( min ( 𝑝 ∗ . , 𝑝 ∗ . ) , . ) (1)where 𝑝 is the Petrosian 50%-light radius and 𝑝 is the Petrosian90%-light radius.For GZD-5, to avoid banding artifacts caused by the inter-polation method of the DECaLS cutout service, each FITS imagewas downloaded at the fixed native telescope resolution of 0.262arcsec per pixel , with enough pixels to cover the same area as424 pixels at the interpolated pixel scale 𝑠 . These individually-sizedFITS were then resized locally up to the interpolated pixel scale 𝑠 by Lanczos interpolation (Lanczos 1938). Image processing isotherwise identical between campaigns. Galaxies with incompleteimaging, defined as more than 20% missing pixels in any band, werediscarded. For GZD-1/2, 92,960 of 101,252 galaxies had completeimaging (91.8%). For GZD-5, 216,106 of 247,746 galaxies not inDECaLS DR1/2 had complete imaging (87.2%) . We convert the measured grz fluxes into RGB images. To use the grz bands as RGB colors, we multiply the flux values in each bandby 125.0, 71.43, and 52.63, respectively. These numbers are chosen Up to a maximum of 512 pixels per side. Highly extended galaxies weredownloaded at reduced resolution such that the FITS had exactly 512 pixelsper side. Note that these numbers do not sum to the total number of galaxiesclassified across both campaigns because some galaxies are shared betweencampaigns. by eye such that typical subjects show an appropriate range of coloronce mapped to RGB channels.For background pixels with very low flux, and therefore highvariance in the proportion of flux per band, naively colouring bythe measured flux creates a speckled effect (Willett et al. 2017).As an extreme example, a pixel with 1 photon in the 𝑔 band andno photons in 𝑟 or 𝑧 would be rendered entirely red. To removethese colourful speckles, we desaturate pixels with very low flux.We first estimate the total per-pixel photon count 𝑁 assuming anexposure time of 90 seconds per band and a mean photon frequencyof 600nm. Poisson statistics imply the standard deviation on thetotal mean flux in that pixel is proportional to √ 𝑁 . For pixels with astandard deviation below 100, we scale the per-band deviation fromthe mean per-pixel flux by a factor of 1% of the standard deviation.The effect is to reduce the saturation of low-flux pixels in proportionto the standard deviation of the total flux. Mathematically, we set 𝑋 (cid:48) 𝑖 𝑗𝑐 = 𝑋 𝑖 𝑗 + 𝛼𝑋 𝑖 𝑗𝑐 where 𝛼 = min ( . √︃ 𝑋 𝑖 𝑗 𝑇 / 𝜆, ) (2)where 𝑋 𝑖 𝑗𝑐 and 𝑋 (cid:48) 𝑖 𝑗𝑐 are the flux at pixel 𝑖 𝑗 in channel 𝑐 beforeand after desaturation, 𝑋 𝑖 𝑗 is the mean flux across bands at pixel 𝑖 𝑗 , 𝑇 is the mean exposure time (here, 90 seconds) and 𝜆 is the meanphoton wavelength (here, 600 nm).Pixel values were scaled by sinh − ( 𝑥 ) to compensate for thehigh dynamic range typically found in galaxy flux, creating imageswhich can show both bright cores and faint outer features. To removethe very brightest and darkest pixels, we linearly rescale the pixelvalues to lie on the (− . , ) interval and then clip the pixelvalues to 0 and 255 respectively. We use these final values to createan RGB image using pillow (Kemenade et al. 2020).The images will available on Zenodo athttps://dx.doi.org/10.5281/zenodo.4196267. As of this arxivpreprint, the images have not yet been uploaded. Volunteer classifications for GZ DECaLS were collected duringthree campaigns. GZD-1 and GZD-2 classified all 99,109 galax-ies passing the criteria above from DECALS DR1 and DR2, re-spectively. GZD-1 ran from September 2015 to February 2016,and GZD-2 from April 2016 to February 2017. GZD-5 classified262,000 DECALS DR5-only galaxies passing the criteria above.GZD-5 ran from March 2017 to October 2020. GZD-5 used morecomplex retirement criteria aimed at improving our automated clas-sification (3.1) and an improved decision tree aimed at better iden-tification of weak bars and minor mergers (4.2).This iteration of the Galaxy Zoo project used the infrastructuremade available by the Zooniverse platform; in particular, the opensource Panoptes platform (The Zooniverse Team 2020). The plat-form allows for the rapid creation of citizen science projects, andpresents participating volunteers with one of a subject set of imageschosen either randomly, or through criteria described in section 3.1.
How many volunteer classifications should each galaxy receive?Ideally, all galaxies would receive enough classifications to be con-fident in the average response (i.e. the vote fraction) while stillclassifying all the target galaxies within a reasonable timeframe.However, the size of modern surveys make this increasingly im-practical. Collecting 40 volunteer classifications for all 314,000
MNRAS , 1–18 (2021)
M. Walmsley et al
0 5,000 G a l a x i e s GZD-1020000.0 G a l a x i e s GZD-20 20 40 60 80 100
Classifications per galaxy G a l a x i e s GZD-5
Figure 2.
GZD-1, GZD-2 and GZD-5 classification counts, excluding im-plausible classifications (Sec. 4.3.1). GZD-1 has approximately 40-60 classi-fications, GZD-2 has approximately 40, and GZD-5 has either approximately5 or approximately 30-40. 5.9% of GZD-5 galaxies received more than 40classifications due to mistaken duplicate uploads. galaxies in this data release would have taken around eight yearswithout further promotion efforts. The larger data sets of futuresurveys will only be more challenging. In anticipation of futureclassification demands, we have therefore implemented a variableretirement rate here (motivated and described further in Walmsleyet al. 2020). Unlike previous data releases, GZ DECaLS galaxieseach received different numbers of classifications (Figure 2). Begin-ning part-way through GZD-5, we prioritise classifications for thegalaxies expected to most improve our machine learning models,and rely more heavily on those models for classifying the remainder.For GZD-1 and GZD-2, all galaxies received at least 40 clas-sifications (as with previous data releases). GZD-1 galaxies havebetween 40 and 60 classifications, selected at random, while GZD-2 galaxies all have approximately 40. For GZD-5, galaxies classifieduntil June 2019 also received approximately 40 classifications. FromJune 2019, we introduced an active learning system. Using activelearning, galaxies expected to be the most informative for trainingour deep learning model received 40 classifications, and all remain-ing galaxies received at least 5 classifications. Galaxies receiving 5classifications of ‘artifact’ were retired at that point.By ‘most informative’, we mean the galaxies which, if classi-fied, would most improve the performance of our model. We de-scribe our method for estimating which galaxies would be most in-formative in Walmsley et al. (2020). Briefly, we use a convolutionalneural network to make repeated predictions for the probability that 𝑘 of 𝑁 total volunteers select a given answer. For each prediction, werandomly permute the network with MC Dropout (Gal 2016), ap-proximating (roughly) training many networks to make predictionson the same dataset. It can be shown that, under some assumptions,the most informative galaxies will be the galaxies with confidentlydifferent predictions under each MC Dropout permutation; that is,where the permuted networks confidently disagree (Houlsby 2014).We emphasise that the number of classifications each galaxyreceived under active learning is not random . For details on handlingthis and other selection effects, see Sec. 6. The questions and answers we ask our volunteers define the mea-surements we can publish. It is therefore critical that the Galaxy Zoodecision tree matches the science goals of the research community.The questions in a given Galaxy Zoo workflow are designedto be answerable even by a classifier with little or no astrophysicalbackground. This motivates a focus primarily on the appearance ofthe galaxy, rather than incorporating physical interpretations whichwould require prior knowledge of galaxies. As an example, theinitial question in all decision trees from Galaxy Zoo 2 onwardshas asked the viewer to distinguish primarily between “smooth”and “featured” galaxies, rather than “elliptical” and “disk” galaxies.This distinction between descriptive and interpretive classificationis not always perfectly enforced. For example, the “features” re-sponse to the initial question is worded as “features or disk”, anda later question asks whether the galaxy is “merging or disturbed”,which requires some interpretation. To aid classifiers, all iterationsof Galaxy Zoo have therefore included illustrative icons in the clas-sification interface. Additional help is also available; in the currentproject, the interface includes a brief tutorial, a detailed field guidewith multiple examples of each type of galaxy, and specific helptext available for each individual classification task.The largest workflow change between Galaxy Zoo versions wasbetween the original Galaxy Zoo (GZ1) and Galaxy Zoo 2 (GZ2).GZ1 presented classifiers with a single task per galaxy, a choicebetween smooth/elliptical, multiple versions of featured/disk (in-cluding edge-on, face-on, and directionality of spiral structure), andmerger. GZ2 re-classified the brightest quarter of the GZ1 samplein much greater detail, including a branched, multi-task decisiontree. Subsequent changes to the decision tree for different versionsof Galaxy Zoo have been mostly iterative in nature, driven in partby the data itself and in part by experience-based reflection whichrevealed minor adjustments that could help classifiers provide moreaccurate information. As an example of the former, a new branch wasadded for GZ-Hubble and GZ-CANDELS to capture informationon star-forming clumps in classifications of higher-redshift galax-ies. As an example of the latter, the final 2 tasks of GZ2 have beenadjusted over multiple versions to facilitate reliable identification ofrare features. Such adjustments have generally been minimized toavoid complicating comparisons with previous campaigns.The decision tree used for GZD-1 and GZD-2 has three mod-ifications vs. the Galaxy Zoo 2 decision tree (Willett et al. 2013).The ‘Can’t Tell’ answer to ‘How many spiral arms are there?’ wasremoved, the number of answers to ‘How prominent is the centralbulge?’ was reduced from four to three, and ‘Is the galaxy cur-rently merging, or is there any sign of tidal debris?’ was added as astandalone question.For GZD-5, we made three further changes. Several GalaxyZoo studies (e.g. Skibba et al. 2012; Masters et al. 2012; Willettet al. 2013; Kruk et al. 2018) found that galaxies selected with0.2< 𝑝 bar <0.5 in GZ2 correspond to ‘weak bars’ when comparedwith expert classification such as those in Nair & Abraham (2010).Therefore, to increase the detection of bars, we changed the possibleanswers to the ”Does this galaxy have a bar?” question from ‘Yes’or ‘No’ to ‘Strong’, ‘Weak’ or ‘No’. We define a strong bar as onethat is clearly visible and extending across a large fraction of thesize of the galaxy. A weak bar is smaller and fainter relative to thegalaxy, and can appear more oval than the strong bar, while stillbeing longer in one direction than the other. Our definition of strongvs. weak bar is similar that of Nair & Abraham (2010), with theexception that they also have an ‘intermediate’ classification. We MNRAS000
GZD-1, GZD-2 and GZD-5 classification counts, excluding im-plausible classifications (Sec. 4.3.1). GZD-1 has approximately 40-60 classi-fications, GZD-2 has approximately 40, and GZD-5 has either approximately5 or approximately 30-40. 5.9% of GZD-5 galaxies received more than 40classifications due to mistaken duplicate uploads. galaxies in this data release would have taken around eight yearswithout further promotion efforts. The larger data sets of futuresurveys will only be more challenging. In anticipation of futureclassification demands, we have therefore implemented a variableretirement rate here (motivated and described further in Walmsleyet al. 2020). Unlike previous data releases, GZ DECaLS galaxieseach received different numbers of classifications (Figure 2). Begin-ning part-way through GZD-5, we prioritise classifications for thegalaxies expected to most improve our machine learning models,and rely more heavily on those models for classifying the remainder.For GZD-1 and GZD-2, all galaxies received at least 40 clas-sifications (as with previous data releases). GZD-1 galaxies havebetween 40 and 60 classifications, selected at random, while GZD-2 galaxies all have approximately 40. For GZD-5, galaxies classifieduntil June 2019 also received approximately 40 classifications. FromJune 2019, we introduced an active learning system. Using activelearning, galaxies expected to be the most informative for trainingour deep learning model received 40 classifications, and all remain-ing galaxies received at least 5 classifications. Galaxies receiving 5classifications of ‘artifact’ were retired at that point.By ‘most informative’, we mean the galaxies which, if classi-fied, would most improve the performance of our model. We de-scribe our method for estimating which galaxies would be most in-formative in Walmsley et al. (2020). Briefly, we use a convolutionalneural network to make repeated predictions for the probability that 𝑘 of 𝑁 total volunteers select a given answer. For each prediction, werandomly permute the network with MC Dropout (Gal 2016), ap-proximating (roughly) training many networks to make predictionson the same dataset. It can be shown that, under some assumptions,the most informative galaxies will be the galaxies with confidentlydifferent predictions under each MC Dropout permutation; that is,where the permuted networks confidently disagree (Houlsby 2014).We emphasise that the number of classifications each galaxyreceived under active learning is not random . For details on handlingthis and other selection effects, see Sec. 6. The questions and answers we ask our volunteers define the mea-surements we can publish. It is therefore critical that the Galaxy Zoodecision tree matches the science goals of the research community.The questions in a given Galaxy Zoo workflow are designedto be answerable even by a classifier with little or no astrophysicalbackground. This motivates a focus primarily on the appearance ofthe galaxy, rather than incorporating physical interpretations whichwould require prior knowledge of galaxies. As an example, theinitial question in all decision trees from Galaxy Zoo 2 onwardshas asked the viewer to distinguish primarily between “smooth”and “featured” galaxies, rather than “elliptical” and “disk” galaxies.This distinction between descriptive and interpretive classificationis not always perfectly enforced. For example, the “features” re-sponse to the initial question is worded as “features or disk”, anda later question asks whether the galaxy is “merging or disturbed”,which requires some interpretation. To aid classifiers, all iterationsof Galaxy Zoo have therefore included illustrative icons in the clas-sification interface. Additional help is also available; in the currentproject, the interface includes a brief tutorial, a detailed field guidewith multiple examples of each type of galaxy, and specific helptext available for each individual classification task.The largest workflow change between Galaxy Zoo versions wasbetween the original Galaxy Zoo (GZ1) and Galaxy Zoo 2 (GZ2).GZ1 presented classifiers with a single task per galaxy, a choicebetween smooth/elliptical, multiple versions of featured/disk (in-cluding edge-on, face-on, and directionality of spiral structure), andmerger. GZ2 re-classified the brightest quarter of the GZ1 samplein much greater detail, including a branched, multi-task decisiontree. Subsequent changes to the decision tree for different versionsof Galaxy Zoo have been mostly iterative in nature, driven in partby the data itself and in part by experience-based reflection whichrevealed minor adjustments that could help classifiers provide moreaccurate information. As an example of the former, a new branch wasadded for GZ-Hubble and GZ-CANDELS to capture informationon star-forming clumps in classifications of higher-redshift galax-ies. As an example of the latter, the final 2 tasks of GZ2 have beenadjusted over multiple versions to facilitate reliable identification ofrare features. Such adjustments have generally been minimized toavoid complicating comparisons with previous campaigns.The decision tree used for GZD-1 and GZD-2 has three mod-ifications vs. the Galaxy Zoo 2 decision tree (Willett et al. 2013).The ‘Can’t Tell’ answer to ‘How many spiral arms are there?’ wasremoved, the number of answers to ‘How prominent is the centralbulge?’ was reduced from four to three, and ‘Is the galaxy cur-rently merging, or is there any sign of tidal debris?’ was added as astandalone question.For GZD-5, we made three further changes. Several GalaxyZoo studies (e.g. Skibba et al. 2012; Masters et al. 2012; Willettet al. 2013; Kruk et al. 2018) found that galaxies selected with0.2< 𝑝 bar <0.5 in GZ2 correspond to ‘weak bars’ when comparedwith expert classification such as those in Nair & Abraham (2010).Therefore, to increase the detection of bars, we changed the possibleanswers to the ”Does this galaxy have a bar?” question from ‘Yes’or ‘No’ to ‘Strong’, ‘Weak’ or ‘No’. We define a strong bar as onethat is clearly visible and extending across a large fraction of thesize of the galaxy. A weak bar is smaller and fainter relative to thegalaxy, and can appear more oval than the strong bar, while stillbeing longer in one direction than the other. Our definition of strongvs. weak bar is similar that of Nair & Abraham (2010), with theexception that they also have an ‘intermediate’ classification. We MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release added examples of galaxies with ‘weak bars’ to the Field Guideand provided a new icon for this classification option, as shown inFigure 3.Second, to allow for more fine-grained measurements of bulgesize, we increased the number of "How prominent is the centralbulge?" answers from three (‘No’, ‘Obvious’, ‘Dominant’) to five(‘No Bulge’, ‘Small’, ‘Moderate’, ‘Large’, ‘Dominant’). We alsore-included the ‘Can’t Tell’ answer.Third, we modified the ‘Merging’ question from ‘Merging’,‘Tidal’, ‘Both’, or ‘None’, to the more phenomenological ‘Merg-ing’, ‘Major Disturbance’, ‘Minor Disturbance’, or ‘No’. Our goalwas to present more direct answers to our volunteers and to betterdistinguish major and minor mergers, to support recent scientificinterest in the role of major and minor mergers on mass assem-bly (López-Sanjuan et al. 2010; Kaviraj 2013), black hole accretion(Alexander & Hickox 2012; Simmons et al. 2017a), and morphology(Hopkins et al. 2009; Lotz et al. 2011; Lofthouse et al. 2017). Wemade this final "merger" change two months after launching GZD-5;6722 GZD-5 galaxies (2.7%) were fully classified before that dateand so do not have responses from volunteers to this question.We also make several improvements to the illustrative iconsshown for each answer. These icons are the most visible guidefor volunteers as to what each answer means (complementing thetutorial, help text, field guide, and ‘Talk’ forum). Figure 3 shows theGZD-5 decision tree with new icons as shown to volunteers. Thedecision tree used in GZD-1 and GZD-2 is shown in Figure B1.For the ‘Smooth or Featured?’ question, we changedthe‘Smooth’ icon to include three example galaxies at various el-lipticities, and the ‘Featured’ icon to include an edge-on disk ratherthan a ring galaxy. For ‘Edge On?’, we replaced the previous tickicon with a new descriptive icon, and the previous cross icon with the‘Smooth’ icon above. We also modified the text to no longer specify‘exactly’ edge on, and renamed the answers from ‘Yes’ and ‘No’ to‘Yes - Edge On Disk’ and ’No - Something Else’. For ‘Bulge?’, wecreated new icons to match the change from four to five answers.For ‘Bar’, we replaced the previous tick and cross icons with newdescriptive icons for ‘Strong Bar’, ‘Weak Bar’ and ‘No Bar’. For‘Merger?’, we added new descriptive icons to match the updatedanswers.Changes to the decision tree complicate comparisons otherGalaxy Zoo projects. As we show in the following sections, theavailable answers will affect the sensitivity of volunteers to certainmorphological features, and so morphology measurements madewith different decision trees may not be directly comparable. Thisdifficulty in comparison has historically required us to be conser-vative in our changes to the decision tree. However, the advent ofeffective automated classifications allows us to retrospectively makeclassifications using any preferred decision tree. Specifically, in thiswork, we train our automated classifier to predict what volunteerswould have said using the GZD-5 decision tree, for galaxies whichwere originally classified by volunteers using the GZD-1/2 decisiontree (Section 5.1). The images used in GZ DECaLS are deeper and higher resolutionthan were available for GZ2. The GZ2 primary sample (Willett et al.2013) uses images from SDSS DR7 (Abazajian et al. 2009), whichare 95% complete to 𝑟 = . . (cid:48)(cid:48) Figure 3.
Classification decision tree for GZD-5, with new icons as shownto volunteers. Questions shaded with the same colours are at the same levelof branching in the tree; grey have zero dependent questions, green one, bluetwo, and purple three. plate scale of 0 . (cid:48)(cid:48)
396 per pixel (York et al. 2000). In contrast, GZDECaLS uses images from DECaLS DR2 to DR5, which have amedian 5 𝜎 point source depth of 𝑟 = .
6, a seeing better than 1 . (cid:48)(cid:48) . (cid:48)(cid:48)
262 per pixel(Dey et al. 2018) .We expect the improved imaging to reveal morphology notpreviously visible, particularly for features which are faint (e.g.tidal features, low surface brightness spiral arms) or intricate (e.g.weak bars, flocculent spiral arms). Our changes to the decision tree(Sec. 3.2) were partly made to better exploit this improved imaging.To investigate the consequences of improved imaging, we com-pare galaxies classified in both GZ2 and GZ DECalS. Galaxies willtypically be classified by both projects if they are inside both theSDSS DR7 Legacy catalogue (i.e. the source GZ2 catalogue) andDECaLS DR5 footprints (broadly, North Galactic Cap galaxies with − < 𝛿 <
0) and match the selection criteria of each project (seeWillett et al. 2013 and Sec. 2.2). GZ2’s 𝑟 < . , 1–18 (2021) M. Walmsley et al
GZ2 "Featured" fraction G Z D E C a L S " F e a t u r e d " f r a c t i o n Figure 4.
Comparison of ‘Featured’ fraction for galaxies classified in bothGZ2 and GZ DECaLS. Ambiguous galaxies are consistently reported asmore featured in GZ DECaLS, which we attribute to the significantly im-proved imaging depth of DECaLS.
We find that volunteers successfully recognise newly-visiblemorphology features. Figure 4 compares the distribution of votefractions to "Is this galaxy smooth or featured?" for GZ2 and GZDECaLS. Ambiguous galaxies, with ‘featured’ fractions betweenapprox. 0.25 and 0.75, are consistently reported as more featured(median absolute increase of 0.13, median percentage increase of22%) with the deeper GZ DECaLS images.The shift towards featured galaxies is an accurate responseto the new images, rather than systematics from (for example) achanging population of volunteers. Figure 5 compares the GZ2and GZ DECaLS images of a random sample of galaxies drawnfrom the 1000 cross-classified galaxies with the largest increase in‘featured’ fraction. In all of these galaxies (and for a clear majorityof galaxies in similar samples), volunteers are correctly recognisingnewly visible detailed features.We observe a similar pattern in the vote fractions of spiral armsand bars for featured galaxies. For galaxies consistently consideredfeatured (i.e. where both projects reported a ‘featured’ vote fractionof at least 0.5), the median vote fraction for spiral arms increasedfrom 0.84 to 0.9, and for bars from 0.21 to 0.24. This suggests thateven for galaxies where some details were already visible (and hencewere considered featured), improved imaging makes our volunteersmore likely to identify specific features.We argue the improved depth of DECaLS ( 𝑟 = . 𝑟 = . 𝑔𝑟𝑖 bands (SDSS) to 𝑔𝑟𝑧 bands (DECaLS), which might makeolder stars more prominent. We expect the difference in seeing tobe negligible here.Comparing classifications made using the same possible an-swers on the same galaxies shows how improved DECaLS imagingleads to ambiguous galaxies being correctly reported as more fea-tured, and to spiral arms and bars being reported with more confi- Figure 5.
GZ2 and GZ DECaLS images for 6 galaxies drawn randomlyfrom the 1000 galaxies classified in both projects with the largest increasein ‘featured’ vote fraction (reported fractions shown in red). The increasedfraction accurately reflects the increased visibility of detailed morphologyfrom improved imaging. MNRAS000
GZ2 and GZ DECaLS images for 6 galaxies drawn randomlyfrom the 1000 galaxies classified in both projects with the largest increasein ‘featured’ vote fraction (reported fractions shown in red). The increasedfraction accurately reflects the increased visibility of detailed morphologyfrom improved imaging. MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Bar "Yes" fraction N o r m a li s e d C o un t GZD-1/2 (Previous Tree)
Nair: BarNair: No Bar 0.0 0.2 0.4 0.6 0.8 1.0
Bar "Strong" + "Weak" fractionsGZD-5 (New Tree)
Nair: BarNair: No Bar
Figure 6.
Left: Distribution of fraction of GZD-1/2 volunteers answering‘Yes’ (not ‘No’ to ‘Does this galaxy have a bar?’, split by expert classificationfrom NA10 of barred (blue) or unbarred (orange). Right: as left, but forGZD-5 volunteers answering ‘Strong’ or ‘Weak’ (not ‘No’). Volunteers aresubstantially better at identifying barred galaxies using the GZD-5 three-answer question. dence. However, volunteers are also sensitive to which questions areasked and how those questions are asked. We measure the impactof our changes to the decision tree ‘Bar’ question for GZD-5 in thenext section.
To measure the effect of the new decision tree on bar sensitivity,we compare the classifications made using each tree against expertclassifications. Nair & Abraham 2010 (hereafter NA10) classifiedall 14,034 SDSS DR4 galaxies at 0 . < 𝑧 < .
05 with 𝑔 < 𝑓 featured > .
25 (as measured by GZD-5), selecting a featuredsample of 807 galaxies classified by NA10, GZD-1/2, and GZD-5.Figure 6 compares volunteer classifications for expert-labelledcalibration galaxies made using each tree. We find that barredand unbarred galaxies are significantly better separated with theStrong/Weak/None answers than with Yes/No answers. Of 220 Nair-identified bars (of any type), 184 (84%) receive a majority vote forbeing barred by volunteers using the new tree, up from 120 (55%)with the previous tree.NA10 classified barred galaxies into five subtypes: Strong,Intermediate, Weak, Nuclear, Ansae, and Peanut (plus None, im-plicitly). We can use the first three subtypes as a measurement ofexpert-classified bar strength, and therefore evaluate how our volun-teers respond to bars of different strengths. Following the approachto defining summary metrics of Masters et al. (2019), we summarisethe bar vote fractions into a single volunteer estimate of bar strength, 𝐵 vol = 𝑓 strong + . 𝑓 weak , and compare the distribution of 𝐵 for eachexpert-classified bar strength (Figure 7). We find that the volunteerbar strength estimates increase smoothly with expert-classified barstrength, though individual galaxies vary substantially. This sug-gests that typical bar strength in galaxy samples can be successfullyinferred from volunteer votes. N o r m . C o un t Nair: Strong0246 N o r m . C o un t Nair: Interm.0246 N o r m . C o un t Nair: Weak0.0 0.2 0.4 0.6 0.8 1.0
Bar "Strong" fraction + 0.5 "Weak" fraction N o r m . C o un t Nair: None
Figure 7.
Distributions of volunteer bar strength estimates, 𝐵 vol = 𝑓 strong + . 𝑓 weak , split by expert-classified (NA10) bar strength. Individual galaxiesare shown with rug plots (15 Strong, 110 Intermediate, 87 Weak, and 377None). Volunteer bar strength estimates increase smoothly with expert-classified bar strength, though individual galaxies vary substantially. The addition of the ‘weak bar’ answer in GZD-5 significantlyimproves sensitivity to bars compared with previous versions of thedecision tree. Additionally, volunteer votes across the three answersmay be used to infer bar strength. We hope that the detailed bar clas-sifications in our catalogue will help researchers better understandthe properties of strong and weak bars and their influence on hostgalaxies.
Galaxy Zoo data releases have previously included two post-hocmodifications to the volunteer classifications; volunteer weighting,to reduce the influence of strongly atypical volunteers, and redshiftdebiasing, to estimate the vote fractions a galaxy might have re-ceived had it been observed at a specific redshift. We describe eachmodification below.
Volunteer weighting, as introduced in Galaxy Zoo 2 (Willett et al.2013), assigns each volunteer an aggregation weight of (initially)one, and iteratively reduces that weight for volunteers who typicallydisagree with the consensus. This method affects relatively fewvolunteers and therefore causes only a small shift in vote fractions- in Galaxy Zoo 2, for example, approximately 95% of volunteershad a weighting of one (i.e. unaffected), 94.8% of galaxies had achange in vote fraction of no more than 0.1 for any question, andthe mean change in vote fraction across all questions and galaxieswas 0.0032.The most significant change in final vote fractions is causedby down-weighting rare (approx. 1%) volunteers who repeatedlydisagree with consensus by answering ‘artifact’ at implausibly highrates (including 100%) for many galaxies. Answering artifact ends
MNRAS , 1–18 (2021)
M. Walmsley et al
GZD-1/2
Reported "Artifact" Rate by Volunteer
GZD-5 V o l un t ee r s ( N > o n l y ) Figure 8.
Distribution of reported ‘artifact’ rates by volunteer (i.e. how ofteneach volunteer answered ‘artifact’ over all the galaxies they classified). Thevast majority report artifact rates consistent with those of the authors (below0.1), but a very small subset report implausibly high artifact rates ( > . ) and consequently have their classifications discarded. Only volunteers withat least 150 classifications are shown; the distribution for volunteers withfewer classifications is not bimodal. the classification and shows the next galaxy, and so we hypothesisethat these rare volunteers are primarily interested in seeing manygalaxies rather than contributing meaningful classifications. Thereare very few such volunteers, but because answering artifact allowsclassifications to be submitted very quickly, they have an outsizeeffect on the aggregate vote fractions.Figure 8 shows the distribution of reported artifact rates forvolunteers with at least 150 total classifications. We expect the truefraction of artifacts to be less than 0.1, and the vast majority ofvolunteers report artifact rates consistent with this. However, thedistribution is bimodal, with a small second peak around 1.0 (i.e.volunteers reporting every galaxy as an artifact). To remove the im-plausible mode, we discard the classifications of volunteers with atleast 150 total classifications and reported artifact rates greater than0.5. In GZD-1/2, 1.1% (643) of volunteers are excluded, discarding11% (483,081) of classifications. In GZD-5, 0.03% (543) volunteersare excluded, discarding 5.3% (249,592) of classifications.We investigated the possibility of other groups of atypical vol-unteers giving similar answers across across questions by analysingthe per-user vote fractions with either a two-dimensional visualisa-tion using UMAP (McInnes et al. 2020) or with clustering usingHDBSCAN (McInnes et al. 2017). We find no strong evidence thatsuch clusters exist. Galaxies at higher redshifts appear fainter and smaller on the sky,making it harder to detect detailed morphological features than ifthe galaxy were closer. This creates a bias in visual classifications(whether human or automated) where galaxies of the same intrinsicmorphology are less likely to be classified as having detailed featuresas redshift increases (Bamford et al. 2009). Redshift debiasing is smoothfeaturedstar/artefact cleaneddebiased edge-onface-on/inclined unbarredweak barstrong bar no spiral armsspiral arms no bulgesmall bulgemoderate bulgelarge bulgedominant bulge tight spiralmoderate spiralloose spiral rounded bulgeboxy bulgeno bulge circularin-betweencigar shaped no mergerminor disturbancemajor disturbancemerger redshift f r a c t i o n ( f > . ) Figure 9.
Number of GZD-5 galaxies with 𝑓 > . 𝑓 > .
5, 58,916 galaxies. For most questions and answers,debiasing successfully flattens the redshift trends. For ‘Smooth or Featured’and ‘Bulge Prominence’, redshift debiasing overcorrects. an attempt to mitigate this bias by estimating how a galaxy wouldappear if it were at a fixed low redshift (here, 𝑧 = . . < 𝑧 < .
15, approx-imately 1.5 Gyr) do not evolve significantly for galaxies of similarintrinsic brightness and physical size, and so, for a luminosity-limited sample, any change we observe to the vote fraction distri-bution as a function of redshift is purely a consequence of imaging.If so, we can estimate the vote fractions which would be observedif each galaxy were at low redshift by modifying the vote fractionsof higher-redshift galaxies such that they have the same overall dis-tribution as their low-redshift counterparts in brightness and size.We base the debiasing on a luminosity-limited sample , selectedbetween 0 . < 𝑧 < .
15 and − . > 𝑀 𝑟 > −
23. We considerthe galaxies with at least 30 votes for the first question (‘Smooth orFeatured’) after volunteer weighting (above), for a total of 87,617galaxies in GZD-1/2 and 58,916 galaxies in GZD-5. For each ques-tion, separately, we define a subset of galaxies to which we applythe debiasing procedure.Each subset is defined using a cut of 𝑓 > . 𝑓 feat × 𝑓 not edge-on > . 𝑁 >
MNRAS000
MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release galaxy has been classified by a significant number of people. Webin the subset of galaxies by 𝑀 𝑟 , log ( 𝑅 ) and 𝑧 for each answer inturn. We use the voronoi_2d_binning package from Cappellari& Copin (2003) to ensure the bins will have an approximately equalnumber of galaxies (with a minimum of 50). We then match votefraction distributions on a bin-by-bin basis, such that the cumulativedistribution of vote fractions at each redshift is shifted to be similarto that of the lowest redshift sample (0 . < 𝑧 < . 𝑓 feat > . comparing differentmorphological types, some of the systematic errors in the debiasingmay cancel out. Uncertainties in the debiasing will also decrease asthe sample size increases.For these reasons, we strongly suggest that users of the debiasedclassifications only use them to consider populations of galaxiesrather than individual or small samples, and to consider that theremay still be some residual trends and uncertainties that are hard tomodel with current methods. Combining citizen science with automated classification allows usto do better science than with either alone. The clearest benefit isthat automated classification scales well with sample size. For GZ DECaLS, classifying all 311,488 suitable galaxies using volunteersalone is infeasible; collecting 40 classifications per galaxy, the stan-dard from previous Galaxy Zoo projects, would take around eightyears without further promotion efforts - by which time we expectnew surveys to start. Automated classification also evolves - as thequality of our models improves, so too will the quality of our clas-sifications. And automated classification is replicable from scratchwithout requiring a crowd - other researchers may run our open-source code and recover our classifications (within stochasticity),or create equivalent classifications for newly-imaged galaxies.Finally, and of particular relevance to researchers using thisdata release, automated classification allows us to retroactively up-date the decision tree. Because our classifier learns to make pre-dictions from GZD-5 classifications, using the improved tree withbetter detection of mergers and weak bars, we can then predict whatour volunteers would have said for the GZD-1 and GZD-2 galaxies had we been using the improved tree at that time .Our specific automated classification approach offers severalqualitative benefits over previous work. First, through careful con-sideration of uncertainty, we can both learn from uncertain volun-teer responses and predict posteriors (rather than point estimates)for new galaxies. Second, by predicting the answers to every ques-tion with a single model (similarly to Dieleman et al. 2015, andunlike more recent work e.g. Domínguez Sánchez et al. 2018; Khanet al. 2019; Walmsley et al. 2020), we improve performance bysharing representations between tasks (Caruana 1997) - intuitively,knowing how to recognise spiral arms can also help you count them.Learning from every galaxy to predict every answer uses our valu-able volunteer effort as efficiently as possible. This is particularlyeffective because we aim to predict detailed morphology, and hencelearn to create a detailed representation of each galaxy.
We require a model which can:(i) Learn efficiently from volunteer responses of varying (i.e.heteroskedastic) uncertainty.(ii) Predict posteriors for those responses on new galaxies, forevery question.In previous work (Walmsley et al. 2020) we modelled volunteerresponses as being binomially distributed and trained our model tomake maximum likelihood estimates using the loss function L = 𝑘 log 𝑓 𝑤 ( 𝑥 ) + ( 𝑁 − 𝑘 ) log ( − 𝑓 𝑤 ( 𝑥 )) (3)where, for some target question, 𝑘 is the number of responses (suc-cesses) of some target answer, 𝑁 is the total number of responses(trials) to all answers, and 𝑓 𝑤 ( 𝑥 ) = ˆ 𝜌 is the predicted probabilityof a volunteer giving that answer.This Binomial assumption, while broadly successful, brokedown for galaxies with vote fractions 𝑘𝑁 close to 0 or 1, wherethe Binomial likelihood is extremely sensitive to 𝑓 𝑤 ( 𝑥 ) , and forgalaxies where the question asked was not appropriate (e.g. predictif a featureless galaxy has a bar). Instead, in this work, the modelpredicts a distribution 𝑝 ( 𝜌 | 𝑓 𝑤 ( 𝑥 )) and 𝜌 is then drawn from thatdistribution.For binary questions, one could parametrise 𝑝 ( 𝜌 | 𝑓 𝑤 ( 𝑥 )) withthe Beta distribution (being flexible and defined on the unit interval),and predict the Beta distribution parameters 𝑓 𝑤 ( 𝑥 ) = ( ˆ 𝛼, ˆ 𝛽 ) byminimising MNRAS , 1–18 (2021) M. Walmsley et al L = ∫ Bin ( 𝑘 | 𝜌, 𝑁 ) Beta ( 𝜌 | 𝛼, 𝛽 ) 𝑑𝛼𝑑𝛽 (4)where the Binomial and Beta distributions are conjugate and hencethis integral can be evaluated analytically.In practice, we would like to predict the responses to questionswith more than two answers, and hence we replace each distributionwith its multivariate counterpart; Beta( 𝜌 | 𝛼, 𝛽 ) with Dirichlet( (cid:174) 𝜌 | (cid:174) 𝛼 ) ,and Binomial( 𝑘 | 𝜌, 𝑁 ) with Multinomial( (cid:174) 𝑘 | (cid:174) 𝜌, 𝑁 ). L 𝑞 = ∫ Multi ( (cid:174) 𝑘 | (cid:174) 𝜌, 𝑁 ) Dirichlet ( (cid:174) 𝜌 | (cid:174) 𝛼 ) 𝑑 (cid:174) 𝛼 (5)where (cid:174) 𝑘, (cid:174) 𝜌 and (cid:174) 𝛼 are now all vectors with one element per answer.The Dirichlet-Multinomial distribution is much more flexi-ble than the Binomial, allowing our model to express uncertaintythrough wider posteriors and confidence through narrower posteri-ors. We believe this is a novel approach.For the base architecture, we use the EfficientNet B0 model(Tan & Le 2019). The EfficientNet family of models includes sev-eral architectural advances over the standard convolutional neu-ral network architectures commonly used within astrophysics (e.g.Huertas-Company et al. 2015; Dieleman et al. 2015; Khan et al.2019; Cheng et al. 2020; Ferreira et al. 2020), including auto-ML-derived structure (Tan et al. 2019; He et al. 2019), depthwise con-volutions (Howard et al. 2017), bottleneck layers (Iandola et al.2017), and squeeze-and-excitation optimisation (Hu et al. 2018).The EfficientNet B0 model was identified using multi-objectiveneural architecture search (Tan et al. 2019), optimising for bothaccuracy and FLOPS (i.e. computational cost of prediction). Thisbalancing of accuracy and FLOPS is particularly useful for astro-physics researchers with limited access to GPU resources, leadingto a model capable of making reliable predictions on hundreds ofmillions of galaxies. In short, the architecture is similar to traditionalconvolutional neural networks, being composed of a series of con-volutional blocks of decreasing resolution and increasing channels.Each convolutional block uses mobile inverted bottleneck convolu-tions following MobileNetV2 (Sandler et al. 2018), which combinecomputationally efficient depthwise convolutions with residual con-nections between bottlenecks (as opposed to residual connectionsbetween blocks with many channels, as in e.g. ResNet (He et al.2016)). EfficientNet B0 has 5.3 million parameters.We modify the final EfficientNet B0 layer output units to givepredictions smoothly between 1 and 100 (using softmax activation),which is appropriate for Dirichlet parameters (cid:174) 𝛼 . (cid:174) 𝛼 elements below1 can lead to bimodal ‘horseshoe’ posteriors, and (cid:174) 𝛼 elements aboveapproximately 100 can lead to extremely confident predictions inextreme 𝜌 , both of which are implausible for galaxy morphologyposteriors. These constraints may cause the most extreme galaxiesto have predicted vote fractions which are slightly less extreme thanvolunteers would record, but we do not anticipate this to affectpractical use; whether a galaxy is extremely likely to have a bar ormerely highly likely is rarely of scientific consequence.We would like our single model to predict the answer to everyquestion in the Galaxy Zoo tree. To do this, our architecture usesone output unit per answer (i.e. for 13 questions with a total of 20answers, we use 20 output units). We calculate the (negative log)likelihood per question (Eqn. 5), and then, treating the errors in themodel’s answers to each question as independent events, calculatethe total loss as log L = ∑︁ 𝑞 L 𝑞 ( (cid:174) 𝑘 𝑞 , 𝑁 𝑞 , (cid:174) 𝑓 𝑤𝑞 ) (6)where, for question 𝑞 , 𝑁 𝑞 is the total answers, (cid:174) 𝑘 𝑞 is the observedvotes for each answer, and (cid:174) 𝑓 𝑤𝑞 is the values of the output unitscorresponding to those answers (which we interpret as the Dirichlet (cid:174) 𝛼 parameters in Eqn. 5).We train our model using the GZD-5 volunteer classifications.Because the training set includes both active-learning-selectedgalaxies receiving at least 40 classifications and the remaining GZD-5 galaxies with around 5 classifications, it is crucial that the modelis able to learn efficiently from labels of varying uncertainty. Un-like Walmsley et al. (2020), which trained one model per questionand needed to filter galaxies where that question asked may not beappropriate, we can predict answers to all questions and learn fromall labelled galaxies.We train or evaluate our models using the 249,581 (98.5%)GZD-5 galaxies with at least three volunteer classifications. Learn-ing from galaxies with even fewer (one or two) classifications shouldbe possible in principle, but we do not attempt it here as we do notexpect galaxies with so few classifications to be significantly infor-mative. The Dirichlet concentrations (distribution parameters) usedto calculate our metrics are predicted by three identically-trainedmodels, each making 5 forward passes with random dropout con-figurations and augmentations. We ensemble all 15 forward passesby simply taking the mean posterior given the total votes recorded,which may be interpreted as the posterior of an equally-weightedmixture of Dirichlet-Multinomial distributions. This mean poste-rior can then be used to calculate credible intervals (error bars)and in standard statistical analyses. We develop our approach usinga conventional 80/20 train-test split, and make a new split beforecalculating the final metrics reported here.For the published automated classifications, where we aimsimply to make the best predictions possible rather than to testperformance, we train on all 249,581 galaxies with at least 3 votes(98.5%). We also train five rather then three models to maximiseperformance. Training each model on an NVIDIA V100 GPU takesaround 24 hours. We then make predictions (using the updatedGZD-5 schema) on all 313,789 galaxies in all campaigns. Eachprediction (forward pass) takes approx. 6ms, equating to approx.160ms for each published posterior.Starting from the galaxy images shown to volunteers (Section2.3), we take an average over channels to remove color informationand avoid biasing our morphology predictions (Walmsley et al.2020), then resize and save the images as 300x300x1 matrices. Wethen apply random augmentations when loading each image intomemory, creating a unique randomly-modified image to be used asinput to the network. We first apply random horizontal and verticalflips, followed by an aliased rotation by a random angle in therange (0, 𝜋 ), with missing pixels being filled by reflection on theboundaries. Finally, we crop the image about a random centroid to224x224 pixels, effectively zooming in slightly towards a randomoff-center point. We also apply these augmentations at test time tomarginalise our posteriors over any unlearned variance. We trainusing the Adam (Kingma & Ba 2015) optimizer and a batch size of128. We end training once the model loss fails to improve for 10consecutive epochs.Code for our deep learning classifier is available to the re-viewer(s) and will be made open-source on publication. MNRAS000
We require a model which can:(i) Learn efficiently from volunteer responses of varying (i.e.heteroskedastic) uncertainty.(ii) Predict posteriors for those responses on new galaxies, forevery question.In previous work (Walmsley et al. 2020) we modelled volunteerresponses as being binomially distributed and trained our model tomake maximum likelihood estimates using the loss function L = 𝑘 log 𝑓 𝑤 ( 𝑥 ) + ( 𝑁 − 𝑘 ) log ( − 𝑓 𝑤 ( 𝑥 )) (3)where, for some target question, 𝑘 is the number of responses (suc-cesses) of some target answer, 𝑁 is the total number of responses(trials) to all answers, and 𝑓 𝑤 ( 𝑥 ) = ˆ 𝜌 is the predicted probabilityof a volunteer giving that answer.This Binomial assumption, while broadly successful, brokedown for galaxies with vote fractions 𝑘𝑁 close to 0 or 1, wherethe Binomial likelihood is extremely sensitive to 𝑓 𝑤 ( 𝑥 ) , and forgalaxies where the question asked was not appropriate (e.g. predictif a featureless galaxy has a bar). Instead, in this work, the modelpredicts a distribution 𝑝 ( 𝜌 | 𝑓 𝑤 ( 𝑥 )) and 𝜌 is then drawn from thatdistribution.For binary questions, one could parametrise 𝑝 ( 𝜌 | 𝑓 𝑤 ( 𝑥 )) withthe Beta distribution (being flexible and defined on the unit interval),and predict the Beta distribution parameters 𝑓 𝑤 ( 𝑥 ) = ( ˆ 𝛼, ˆ 𝛽 ) byminimising MNRAS , 1–18 (2021) M. Walmsley et al L = ∫ Bin ( 𝑘 | 𝜌, 𝑁 ) Beta ( 𝜌 | 𝛼, 𝛽 ) 𝑑𝛼𝑑𝛽 (4)where the Binomial and Beta distributions are conjugate and hencethis integral can be evaluated analytically.In practice, we would like to predict the responses to questionswith more than two answers, and hence we replace each distributionwith its multivariate counterpart; Beta( 𝜌 | 𝛼, 𝛽 ) with Dirichlet( (cid:174) 𝜌 | (cid:174) 𝛼 ) ,and Binomial( 𝑘 | 𝜌, 𝑁 ) with Multinomial( (cid:174) 𝑘 | (cid:174) 𝜌, 𝑁 ). L 𝑞 = ∫ Multi ( (cid:174) 𝑘 | (cid:174) 𝜌, 𝑁 ) Dirichlet ( (cid:174) 𝜌 | (cid:174) 𝛼 ) 𝑑 (cid:174) 𝛼 (5)where (cid:174) 𝑘, (cid:174) 𝜌 and (cid:174) 𝛼 are now all vectors with one element per answer.The Dirichlet-Multinomial distribution is much more flexi-ble than the Binomial, allowing our model to express uncertaintythrough wider posteriors and confidence through narrower posteri-ors. We believe this is a novel approach.For the base architecture, we use the EfficientNet B0 model(Tan & Le 2019). The EfficientNet family of models includes sev-eral architectural advances over the standard convolutional neu-ral network architectures commonly used within astrophysics (e.g.Huertas-Company et al. 2015; Dieleman et al. 2015; Khan et al.2019; Cheng et al. 2020; Ferreira et al. 2020), including auto-ML-derived structure (Tan et al. 2019; He et al. 2019), depthwise con-volutions (Howard et al. 2017), bottleneck layers (Iandola et al.2017), and squeeze-and-excitation optimisation (Hu et al. 2018).The EfficientNet B0 model was identified using multi-objectiveneural architecture search (Tan et al. 2019), optimising for bothaccuracy and FLOPS (i.e. computational cost of prediction). Thisbalancing of accuracy and FLOPS is particularly useful for astro-physics researchers with limited access to GPU resources, leadingto a model capable of making reliable predictions on hundreds ofmillions of galaxies. In short, the architecture is similar to traditionalconvolutional neural networks, being composed of a series of con-volutional blocks of decreasing resolution and increasing channels.Each convolutional block uses mobile inverted bottleneck convolu-tions following MobileNetV2 (Sandler et al. 2018), which combinecomputationally efficient depthwise convolutions with residual con-nections between bottlenecks (as opposed to residual connectionsbetween blocks with many channels, as in e.g. ResNet (He et al.2016)). EfficientNet B0 has 5.3 million parameters.We modify the final EfficientNet B0 layer output units to givepredictions smoothly between 1 and 100 (using softmax activation),which is appropriate for Dirichlet parameters (cid:174) 𝛼 . (cid:174) 𝛼 elements below1 can lead to bimodal ‘horseshoe’ posteriors, and (cid:174) 𝛼 elements aboveapproximately 100 can lead to extremely confident predictions inextreme 𝜌 , both of which are implausible for galaxy morphologyposteriors. These constraints may cause the most extreme galaxiesto have predicted vote fractions which are slightly less extreme thanvolunteers would record, but we do not anticipate this to affectpractical use; whether a galaxy is extremely likely to have a bar ormerely highly likely is rarely of scientific consequence.We would like our single model to predict the answer to everyquestion in the Galaxy Zoo tree. To do this, our architecture usesone output unit per answer (i.e. for 13 questions with a total of 20answers, we use 20 output units). We calculate the (negative log)likelihood per question (Eqn. 5), and then, treating the errors in themodel’s answers to each question as independent events, calculatethe total loss as log L = ∑︁ 𝑞 L 𝑞 ( (cid:174) 𝑘 𝑞 , 𝑁 𝑞 , (cid:174) 𝑓 𝑤𝑞 ) (6)where, for question 𝑞 , 𝑁 𝑞 is the total answers, (cid:174) 𝑘 𝑞 is the observedvotes for each answer, and (cid:174) 𝑓 𝑤𝑞 is the values of the output unitscorresponding to those answers (which we interpret as the Dirichlet (cid:174) 𝛼 parameters in Eqn. 5).We train our model using the GZD-5 volunteer classifications.Because the training set includes both active-learning-selectedgalaxies receiving at least 40 classifications and the remaining GZD-5 galaxies with around 5 classifications, it is crucial that the modelis able to learn efficiently from labels of varying uncertainty. Un-like Walmsley et al. (2020), which trained one model per questionand needed to filter galaxies where that question asked may not beappropriate, we can predict answers to all questions and learn fromall labelled galaxies.We train or evaluate our models using the 249,581 (98.5%)GZD-5 galaxies with at least three volunteer classifications. Learn-ing from galaxies with even fewer (one or two) classifications shouldbe possible in principle, but we do not attempt it here as we do notexpect galaxies with so few classifications to be significantly infor-mative. The Dirichlet concentrations (distribution parameters) usedto calculate our metrics are predicted by three identically-trainedmodels, each making 5 forward passes with random dropout con-figurations and augmentations. We ensemble all 15 forward passesby simply taking the mean posterior given the total votes recorded,which may be interpreted as the posterior of an equally-weightedmixture of Dirichlet-Multinomial distributions. This mean poste-rior can then be used to calculate credible intervals (error bars)and in standard statistical analyses. We develop our approach usinga conventional 80/20 train-test split, and make a new split beforecalculating the final metrics reported here.For the published automated classifications, where we aimsimply to make the best predictions possible rather than to testperformance, we train on all 249,581 galaxies with at least 3 votes(98.5%). We also train five rather then three models to maximiseperformance. Training each model on an NVIDIA V100 GPU takesaround 24 hours. We then make predictions (using the updatedGZD-5 schema) on all 313,789 galaxies in all campaigns. Eachprediction (forward pass) takes approx. 6ms, equating to approx.160ms for each published posterior.Starting from the galaxy images shown to volunteers (Section2.3), we take an average over channels to remove color informationand avoid biasing our morphology predictions (Walmsley et al.2020), then resize and save the images as 300x300x1 matrices. Wethen apply random augmentations when loading each image intomemory, creating a unique randomly-modified image to be used asinput to the network. We first apply random horizontal and verticalflips, followed by an aliased rotation by a random angle in therange (0, 𝜋 ), with missing pixels being filled by reflection on theboundaries. Finally, we crop the image about a random centroid to224x224 pixels, effectively zooming in slightly towards a randomoff-center point. We also apply these augmentations at test time tomarginalise our posteriors over any unlearned variance. We trainusing the Adam (Kingma & Ba 2015) optimizer and a batch size of128. We end training once the model loss fails to improve for 10consecutive epochs.Code for our deep learning classifier is available to the re-viewer(s) and will be made open-source on publication. MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Our model successfully predicts posteriors for volunteer votes toeach question. We show example posteriors for a question withtwo answers, ‘Does this galaxy have spiral arms’ (Yes/No), in Fig.10, and a question with three answers, ‘Does this galaxy have abar’ (Strong/Weak/None), in Fig. 11. In Appendix A, we provide agallery of the galaxies with the highest expected vote fractions for aselection of answers, to visually demonstrate the quality of the mostconfident machine classifications.To aid intuition for the typical performance, we reduce boththe vote fraction labels and the posteriors down to discrete classi-fications by rounding the vote fractions and mean posteriors to 0or 1, and calculate classification metrics (Table 1) and confusionmatrices (Figure 13). Here and throughout this section, to removegalaxies for which the question is not relevant, we only count galax-ies where at least half the volunteers were asked that question. Wereport two sets of classification metrics; metrics for all (relevant)galaxies, and only for galaxies where the volunteers are confident(defined as having a vote fraction for one answer above 0.8, follow-ing Domínguez Sánchez et al. 2019).The performance on confident galaxies is useful to measurebecause such galaxies have a clear correct label. For such galaxies,performance is near-perfect; we achieve better than 99% accuracyfor most questions, with the lowest accuracy (for spiral arm count)being 98.6%. The confusion matrices reflect this, showing littlenotable confusion for any question.Reported performance on all galaxies will be lower than onconfident galaxies as the correct labels are uncertain. Our measuredvote fractions are approximations of the theoretical ‘true’ vote frac-tions (as we cannot ask infinitely many volunteers), and many galax-ies are genuinely ambiguous and do not have a meaningful ‘correct’answer. No classifier should achieve perfect accuracy on galaxieswhere the volunteers themselves are not confident. Nonetheless, per-formance is more than sufficient for scientific use; accuracy rangesfrom 77.4% (spiral arm count) to 98.7% (disk edge on). We observesome moderate confusion between similar answers, particularly be-tween No or Weak bar, Moderate or Large bulges, and Two or Threespiral arms, which matches our intuition for the answers that volun-teers might confuse and so likely reflects ambiguity in the trainingdata. More surprisingly, there is also confusion between Two spiralarms and Can’t Tell. Figure 18 shows random examples of spiralswhere the most common volunteer answer was Two, but the classi-fier predicted Can’t Tell, and vica versa. In both cases, the galaxiesgenerally have diffuse or otherwise subtle spiral arms embedded ina bright disk, confusing both human and machine. This highlightsthe difficulty in using classification metrics to assess performanceon ambiguous galaxies.We can mitigate the ambiguity in classifications of galaxies bymeasuring regression metrics on the vote fractions, without round-ing to discrete classifications. Figure 16 shows the mean deviationsbetween the model predictions (mean posteriors) and the observedvote fractions, by question, for test set galaxies with approximately40 volunteer responses. Performance is again excellent, with thepredictions typically well within 10% of the observed vote frac-tions. Predicting spiral arm count is relatively challenging, as notedabove. Predicting answers to the ‘Merger’ question of ‘None’ (i.e.not a merger) is also challenging, perhaps because of the rarity ofcounter-examples.The volunteer vote fractions against which we compare ourpredictions are themselves uncertain for most galaxies. We aim topredict the true vote fraction, i.e. the vote fraction from lim 𝑁 →∞ p ( v o t e s ) p ( v o t e s ) p ( v o t e s ) p ( v o t e s ) "Has Spiral Arms" votes p ( v o t e s ) Figure 10.
Posteriors for ‘Does this galaxy have spiral arms?’, split byensemble model (bold colors) and, within each model, dropout forwardpasses (faded colors). The number of volunteers answering ‘Yes’ (not knownto classifier) is shown with a black dashed line. Galaxies are selected atrandom from the test set, provided the spiral question is relevant (definedas a vote fraction of 0.5 or more to the preceding answer, ‘Featured’). Theimage presented to volunteers is shown to the right. The model input is acropped, downsized, greyscale version (Sec 5.1). The Dirichlet-Multinomialposteriors are strictly only defined at integer votes; for visualisation only, weshow the Γ -generalised posteriors between integer votes.MNRAS , 1–18 (2021) M. Walmsley et al
Question Count Accuracy Precision Recall F1Smooth Or Featured 11346 0.9352 0.9363 0.9352 0.9356Disk Edge On 3803 0.9871 0.9871 0.9871 0.9871Has Spiral Arms 2859 0.9349 0.9364 0.9349 0.9356Bar 2859 0.8185 0.8095 0.8185 0.8110Bulge Size 2859 0.8419 0.8405 0.8419 0.8409How Rounded 6805 0.9314 0.9313 0.9314 0.9313Edge On Bulge 506 0.9111 0.9134 0.9111 0.8996Spiral Winding 1997 0.7832 0.8041 0.7832 0.7874Spiral Arm Count 1997 0.7742 0.7555 0.7742 0.7560Merging 11346 0.8798 0.8672 0.8798 0.8511(a) Classification metrics for all galaxiesQuestion Count Accuracy Precision Recall F1Smooth Or Featured 3495 0.9997 0.9997 0.9997 0.9997Disk Edge On 3480 0.9980 0.9980 0.9980 0.9980Has Spiral Arms 2024 0.9921 0.9933 0.9921 0.9924Bar 543 0.9945 0.9964 0.9945 0.9951Bulge Size 237 1.0000 1.0000 1.0000 1.0000How Rounded 3774 0.9968 0.9968 0.9968 0.9968Edge On Bulge 258 0.9961 0.9961 0.9961 0.9961Spiral Winding 213 0.9906 1.0000 0.9906 0.9953Spiral Arm Count 659 0.9863 0.9891 0.9863 0.9871Merging 3108 0.9987 0.9987 0.9987 0.9987(b) Classification metrics for galaxies where volunteers are confident
Table 1.
Classification metrics on all galaxies (above) or on galaxies wherevolunteers are confident for that question (i.e. where one answer has a votefraction above 0.8). Multi-class precision, recall and F1 scores are weightedby the number of true galaxies for each answer. Classifications on confidentgalaxies are near-perfect.. volunteers, but we only know the vote fraction from 𝑁 volunteers.However, 387 pre-active-learning galaxies were erroneously up-loaded twice or more, and so received more than 75 classificationseach. We can compare our predictions against these confidently-known galaxies. We can also calculate the deviations from askingfewer ( 𝑁 <<
75) volunteers by artificially truncating the numberof votes collected, and ask - how many volunteer responses to thatquestion would we need to have errors similar to that of our model?Figure 17 shows the model and volunteer deviations for a represen-tative selection of questions; the model predictions are as accurateas asking that question to around 10 volunteers. The actual numberof volunteers needed to be shown that galaxy to achieve equivalentaccuracy will be higher for questions only asked given certain pre-vious answers (i.e. all but ‘Smooth or Featured?’ and ‘Merger?’),as some will give different answers to preceding questions and sonot be asked that question.We can also measure if our posteriors correctly estimate thisuncertainty. As a qualitative test, Figure 19 shows a random se-lection of galaxies binned by ‘Smooth or Featured’ vote fractionprediction entropy, measuring the model’s uncertainty. Predictionentropy is calculated as the (discrete) Shannon entropy over allpossible combinations of votes, assuming 10 total votes for thisquestion (our results are robust to other choices of total votes). Un-usual, inclined or poorly-scaled galaxies have highly uncertain (highentropy) votes, while smooth and especially clearly featured galax-ies have confident (low entropy) votes. The most uncertain galaxies(not shown) are so poorly scaled (due to incorrect estimation of The model is, in this strict sense, slightly superhuman. the Petrosian radius in the NASA-Sloan Atlas) that they are barelyvisible. These results match our intuition and demonstrate that ourposteriors provide meaningful uncertainties.More quantitatively, Figure 20 shows the calibration of ourposteriors for the two binary questions in GZD-5 - ‘Edge-on Disk’and ‘Has Spiral Arms’. A well-calibrated posterior dominated bydata (i.e. where the prior has minimal effect) will include the mea-sured value within any bounds as often as the total probabilitywithin those bounds. We calculate calibration by, for each galaxy,iterating through each symmetric highest posterior density credi-ble interval (i.e. starting from the posterior peak and moving thebounds outwards) and recording both the total probability insidethe bounds and whether the recorded volunteer vote is inside thebounds. We then group (bin) by total probability and record theempirical frequency with which the votes lie within bounds of thattotal probability. We find that calibration is excellent. Our classifieris correctly uncertain.The ultimate measure of success is whether our predictionsare useful for science. Masters et al. (2019) (hereafter M19), whichused GZ2 classifications to investigate the relationship betweenbulge size and winding angle and found - contrary to a conventionalview of the Hubble sequence - no strong correlation. We repeat thisanalysis using our (deeper) DECaLS data, using either volunteer orautomated classification, to check if the automated classificationslead to the same science results as the volunteers.Specifically, we select a clean sample of face-on spiral galaxiesusing M19’s vote fraction cuts of 𝑓 feat > . 𝑓 not-edge-on > . 𝑓 spiral-yes > . 𝑓 merging=none > . 𝑓 odd cut, to remove galaxies with ongoingmergers or with otherwise disturbed features. For the volunteer votefractions, we can only use either GZD-1/2 or GZD-5 classifications,since the former decision tree had three bulge size answers and thelatter had five; we choose GZD-5 to benefit from the added precisionof additional answers. To avoid selection effects (Sec. 6.2) we onlyuse galaxies classified prior to active learning being activated. Forthe automated classifications, we use a model trained on GZD-5 topredict GZD-5 decision tree vote fractions (including the five bulgeanswers) for every GZ DECaLS galaxy (313,798). This allows us toexpand our sample size from 5,378 galaxies using GZD-5 volunteersonly to 43,672 galaxies using our automated classifier.We calculate bulge size and spiral winding following Eqn. 1and 3 in M19, trivially generalising the bulge size calculation toallow for five bulge size answers: 𝑊 avg = . 𝑓 medium + . 𝑓 tight (7) 𝐵 avg = . 𝑓 small + . 𝑓 moderate + . 𝑓 large + . 𝑓 dominant (8)Both classification methods find no correlation between bulgesize and spiral winding, consistent with M19. Figure 21 showsthe distribution of bulge size against spiral winding using eithervolunteer predictions (fractions) or the deep learning predictions(expected fractions) for the sample of featured face-on galaxiesselected above. The distributions are indistinguishable, with theautomated method offering a substantially larger (approx 8x) samplesize. We hope this demonstrates the accuracy and scientific valueof our automated classifier. MNRAS000
75) volunteers by artificially truncating the numberof votes collected, and ask - how many volunteer responses to thatquestion would we need to have errors similar to that of our model?Figure 17 shows the model and volunteer deviations for a represen-tative selection of questions; the model predictions are as accurateas asking that question to around 10 volunteers. The actual numberof volunteers needed to be shown that galaxy to achieve equivalentaccuracy will be higher for questions only asked given certain pre-vious answers (i.e. all but ‘Smooth or Featured?’ and ‘Merger?’),as some will give different answers to preceding questions and sonot be asked that question.We can also measure if our posteriors correctly estimate thisuncertainty. As a qualitative test, Figure 19 shows a random se-lection of galaxies binned by ‘Smooth or Featured’ vote fractionprediction entropy, measuring the model’s uncertainty. Predictionentropy is calculated as the (discrete) Shannon entropy over allpossible combinations of votes, assuming 10 total votes for thisquestion (our results are robust to other choices of total votes). Un-usual, inclined or poorly-scaled galaxies have highly uncertain (highentropy) votes, while smooth and especially clearly featured galax-ies have confident (low entropy) votes. The most uncertain galaxies(not shown) are so poorly scaled (due to incorrect estimation of The model is, in this strict sense, slightly superhuman. the Petrosian radius in the NASA-Sloan Atlas) that they are barelyvisible. These results match our intuition and demonstrate that ourposteriors provide meaningful uncertainties.More quantitatively, Figure 20 shows the calibration of ourposteriors for the two binary questions in GZD-5 - ‘Edge-on Disk’and ‘Has Spiral Arms’. A well-calibrated posterior dominated bydata (i.e. where the prior has minimal effect) will include the mea-sured value within any bounds as often as the total probabilitywithin those bounds. We calculate calibration by, for each galaxy,iterating through each symmetric highest posterior density credi-ble interval (i.e. starting from the posterior peak and moving thebounds outwards) and recording both the total probability insidethe bounds and whether the recorded volunteer vote is inside thebounds. We then group (bin) by total probability and record theempirical frequency with which the votes lie within bounds of thattotal probability. We find that calibration is excellent. Our classifieris correctly uncertain.The ultimate measure of success is whether our predictionsare useful for science. Masters et al. (2019) (hereafter M19), whichused GZ2 classifications to investigate the relationship betweenbulge size and winding angle and found - contrary to a conventionalview of the Hubble sequence - no strong correlation. We repeat thisanalysis using our (deeper) DECaLS data, using either volunteer orautomated classification, to check if the automated classificationslead to the same science results as the volunteers.Specifically, we select a clean sample of face-on spiral galaxiesusing M19’s vote fraction cuts of 𝑓 feat > . 𝑓 not-edge-on > . 𝑓 spiral-yes > . 𝑓 merging=none > . 𝑓 odd cut, to remove galaxies with ongoingmergers or with otherwise disturbed features. For the volunteer votefractions, we can only use either GZD-1/2 or GZD-5 classifications,since the former decision tree had three bulge size answers and thelatter had five; we choose GZD-5 to benefit from the added precisionof additional answers. To avoid selection effects (Sec. 6.2) we onlyuse galaxies classified prior to active learning being activated. Forthe automated classifications, we use a model trained on GZD-5 topredict GZD-5 decision tree vote fractions (including the five bulgeanswers) for every GZ DECaLS galaxy (313,798). This allows us toexpand our sample size from 5,378 galaxies using GZD-5 volunteersonly to 43,672 galaxies using our automated classifier.We calculate bulge size and spiral winding following Eqn. 1and 3 in M19, trivially generalising the bulge size calculation toallow for five bulge size answers: 𝑊 avg = . 𝑓 medium + . 𝑓 tight (7) 𝐵 avg = . 𝑓 small + . 𝑓 moderate + . 𝑓 large + . 𝑓 dominant (8)Both classification methods find no correlation between bulgesize and spiral winding, consistent with M19. Figure 21 showsthe distribution of bulge size against spiral winding using eithervolunteer predictions (fractions) or the deep learning predictions(expected fractions) for the sample of featured face-on galaxiesselected above. The distributions are indistinguishable, with theautomated method offering a substantially larger (approx 8x) samplesize. We hope this demonstrates the accuracy and scientific valueof our automated classifier. MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release We release two volunteer catalogues and two automated catalogues,available at https://dx.doi.org/10.5281/zenodo.4196267. gz_decals_volunteers_ab includes the volunteer classifi-cations for 92,960 galaxies from GZD-1 and GZD-2. Classificationsare made using the GZD-1/2 decision tree (Fig. B1). All galaxiesreceived at least 40 classifications, and consequently have approxi-mately 30-40 after volunteer weighting (Sec. 4.3.1). This catalogueis ideal for researchers needing standard morphology measurementson a reasonably large sample, with minimal complexity. 33,124galaxies in this catalogue were also previously classified in GZ2;the GZD-1/2 classifications are better able to detect faint featuresdue to deeper DECaLS imaging, and so should be preferred. gz_decals_volunteers_c includes the volunteer classifica-tions from GZD-5. Classifications are made using the improvedGZD-5 decision tree which adds more detail for bars and merg-ers (Sec. 4.2). This catalogue includes 253,286 galaxies, but eachgalaxy does not have the same number of classifications. 59,337galaxies have at least 30 classifications (after denoising), and the re-mainder have far fewer (approximately 5). The selection effects forhow many classifications each galaxy receives are detailed belowin Sec. 6.2. This catalogue may be useful to researchers who pre-fer a larger sample than gz_decals_volunteers_ab at the costof more uncertainty and the introduction of selection effects, orwho need detailed bar or merger measurements for a small numberof galaxies. We use gz_decals_volunteers_c to train our deeplearning classifier.The automated classifications are made using our Bayesiandeep learning classifier, trained on gz_decals_volunteers_c to predict the answers to the GZD-5 decision tree for allGZ DECaLS galaxies (including those in GZD-1 and GZD-2). gz_decals_auto_posteriors contains the predicted posteriorsfor each answer - specifically, the Dirichlet concentration param-eters that encode the posteriors. We hope this catalogue will behelpful to researchers analysing galaxies in Bayesian frameworks. gz_decals_auto_fractions reduces those posteriors to theautomated equivalent of previous Galaxy Zoo data releases, con-taining the expected vote fractions (mean posteriors). Note that notall vote fractions are relevant for every galaxy; we suggest assessingrelevance using the estimated fraction of volunteers that would havebeen asked each question, which we also include. We hope this cat-alogue will be useful to researchers seeking detailed morphologyclassifications on the largest possible sample, who might benefitfrom error bars but do not need full posteriors.We also release Jupyter notebooks showing how to use eachcatalogue on (full link on publication). Thesedemonstrate how to load and query each catalogue with pandas (McKinney 2010), and how to create callable posteriors from theDirichlet concentration parameters.The automated catalogues may be interactivelyexplored at https://share.streamlit.io/mwalmsley/galaxy-poster/gz_decals_mike_walmsley.py.
The GZD-1/2 catalogue reports at least 40 classifications for allgalaxies imaged by DECaLS DR1/2 and passing the appropriateselection cuts (Section 2.2). Additional classifications above 40are assigned independently of the galaxy properties. The selection function for total classifications in the GZD-5 catalogue is morecomplex. In practice, if you require a strictly random sample ofGZD-5 galaxies with more than five volunteer classifications, youshould exclude galaxies where ‘random_selection’ is False. You mayalso consider using the posteriors from our deep learning classifier,which are comparable across all GZ DECaLS galaxies (Section 5).Below, we describe the GZD-5 total classification selection effects.Early galaxies were initially uploaded row-by-row from theNASA-Sloan Atlas, each (eventually) receiving 40 classifications.We also uploaded two additional subsets. For the first, 1355 galax-ies were targeted for classification to support an external researchproject. Of these, 1145 would have otherwise received five clas-sifications. These 1145 galaxies with additional classifications areidentified with the ‘targeted’ group and should be excluded. For thesecond, we reclassified the 1497 galaxies classified in both GZD-1/2 and the Nair & Abraham (2010) expert visual morphologyclassification catalogue to measure the effect of our new decisiontree (results are shown in Sec. 4.2). Both the GZD-1/2 and GZD-5classifications are reported in the respective catalogues (Section 6.Similarly to the targeted galaxies, 651 of these calibration galaxieswould have otherwise received five classifications, are identifiedwith the ‘calibration’ group, and should be excluded.We then implemented active learning (Sec 3.1), prioritising6,939 galaxies from the remaining pool of 199,496 galaxies notyet uploaded. The galaxies are identified with the groups ‘ac-tive_priority’ (selected for 40 classifications) and ‘active_baseline’(the remainder). For a strictly random selection, both groups shouldbe excluded, leaving the galaxies classified prior to the introductionof active learning.Finally, we note that 14,960 (5.9%) of GZD-5 galaxies receivedmore than 40 classifications due to being erroneously uploaded morethan once. The images are identical and so we report the aggregateclassifications across all uploads of the same galaxy.
The most appropriate usage of the Galaxy Zoo DECaLS vote frac-tions depends on the specific science case. Many galaxies have am-biguous vote fractions (e.g. roughly similar vote fractions for bothdisk and elliptical morphologies) because of observational limita-tions like image resolution, or because the galaxy morphology istruly in-between the available answers (perhaps because the galaxyhas an unusual feature such as polar rings, Moiseev et al. 2011, orbecause the galaxy is undergoing a morphological transition). Tomake best use of such galaxies, we recommend that, where possible,readers use the vote fractions as statistical weights in their analysis.For example, when investigating the differences in the stellar massdistributions of elliptical and disk galaxies, the disk (elliptical) votefractions can be used as weights when plotting the distributions,resulting in the galaxies with the highest vote fraction for disk(elliptical) morphology dominating the resulting distribution. Thisensures that each galaxy contributes to the analysis, without exclud-ing galaxies with ambiguous vote fractions. For examples of usingvote fractions as weights, see Smethurst et al. (2015) and Masterset al. (2019).Using the vote fractions as weights is not appropriate for allscience cases. For example, if galaxies of a particular morphologyneed to be isolated to form a sample for observational follow-up (e.g.overlapping pairs, see Keel et al. 2013, and ‘bulgeless’ galaxies, seeSimmons et al. 2017a; Smethurst et al. 2019), or if the fraction ofa certain morphological type of galaxy is to be calculated (e.g. barfraction, see Simmons et al. 2014). These science cases require a
MNRAS , 1–18 (2021) M. Walmsley et al cut on the appropriate vote fraction to be chosen. However, readersshould be aware that making cuts on the vote fractions is a crudemethod to identify galaxies of certain morphologies and will resultin an incomplete sample.Table 2 shows our suggested cuts for populations of commoninterest, based on visual inspection by the authors and chosen forhigh specificity (low contamination) at the cost of low sensitivity(completeness). We urge the reader to adjust these cuts to suit thesensitivity and specificity of their science case, to add additionalcuts to better select their desired population, and to make their ownvisual inspection to verify the selected population is as intended.For a full analysis, we once again suggest the reader avoid cuts byappropriately weighting ambiguous galaxies, or take advantage ofthe posteriors provided by our automated classifier.
What does a classification mean? The comparison of GZ2 and GZDECaLS images (Fig. 5) highlights that our classifications aim tocharacterise the clear features of an image, and not what an expertmight infer from that image. For example, volunteers might see animage of a galaxy which is broadly smooth, and so answer smooth,even though our astronomical understanding might suggest that thefaint features around the galaxy core are likely indicative of spiralarms that would be revealed given deeper images. This situationoccurs in several galaxies in Fig. 5. These ‘raw’ classifications willbe most appropriate for researchers working on computer vision oron particularly low-redshift, well-resolved galaxies. The redshift-debiased classifications, which are effectively an estimate of galaxyfeatures not clearly seen in the image, will be most appropriatefor researchers especially interested in fainter features or studyinglinks between our estimated intrinsic visual morphologies and othergalaxy properties.We showed in Sec. 4.2 that changing the answers availableto volunteers significantly improves our ability to identify weakbars. This highlights that our classifications are only defined inthe context of the answers presented. One cannot straightforwardlycompare classifications made using different decision trees. Ourscientific interests and our understanding of volunteers both evolve,and so our decision trees must also evolve to match them. However,only the last few years of volunteer classifications will use the latestdecision tree (based on previous data releases), placing an upperlimit on the number of galaxies with compatible classifications atany one time. Our automated classifier resolves this here by allowingus to retrospectively apply the GZD-5 decision tree (with betterweak bar detection, among other changes) to galaxies only classifiedby volunteers in GZD-1 and GZD-2. This flexibility ensures thatGalaxy Zoo will remain able to answer the most pertinent researchquestions at scale.We have shown (5.2) that our automated classifier is generallyhighly accurate, well-calibrated, and leads to at least one equiv-alent science result. However, we cannot exclude the possibilityof unexpected systematic biases or of adversarial behaviour fromparticular images. Avoiding subtle biases and detecting overconfi-dence on out-of-distribution data remain open computer science re-search questions, often driven by important terrestrial applications(Szegedy et al. 2014; Hendrycks & Gimpel 2017; Eykholt et al.2018; Smith & Gal 2018; Geirhos et al. 2019; Ren et al. 2019; Yanget al. 2020; Margalef-Bentabol et al. 2020). Volunteers also havebiases (e.g a slight preference for recognising left-handed spirals,Land et al. 2008) and struggle with images of an adversarial nature (e.g. confusing edge-on disks with cigar-shaped ellipticals), thoughthese can often be discovered and resolved through discussion withthe community and by adapting the website.We believe the future of morphology classification is in thethoughtful combination of volunteers and machine learning. Suchcombinations will be more than just faster; they will be replicable,uniform, error-bounded, and quick to adapt to new tasks. They willlet us ask new questions - draw the spiral arms, select the barlength, separate the merging galaxies pixelwise - which would beinfeasible with volunteers alone for all but the smallest samples (e.g.Lingard et al. 2020). And they will find the interesting, unusual andunexpected galaxies which challenge our understanding and inspirenew research directions.The best combination of volunteer and machine is unclear. Ourexperiment with active learning is one possible approach, but (whencompared to random selection) suffers from complexity to imple-ment, an unknown selection function, and no guarantee - or evenclear final measurement of - an improvement in model performance.Many other approaches are suggested in astrophysics (Wright et al.2017; Beck et al. 2018; Wright et al. 2019; Dickinson et al. 2019;Martin et al. 2020; Lochner & Bassett 2020) and in citizen scienceand human-computer interaction more broadly (Chang et al. 2017;Wilder et al. 2020; Liu et al. 2020; Bansal et al. 2019). We willcontinue to search for and experiment with strategies to create themost effective contribution to research by volunteers.
We have presented Galaxy Zoo DECaLS; detailed galaxy mor-phology classifications for 311,488 galaxies imaged by DECaLSDR5 and within the SDSS DR11 footprint. Classifications werecollected from volunteers on the Zooniverse citizen science plat-form over three campaigns, GZD-1, GZD-2, and GZD-5, whereGZD-5 used an improved decision tree leading to better identi-fication of weak bars and mergers. All galaxies receive at leastfive volunteer classifications; galaxies in GZD-1 and GZD-2 re-ceive at least 40, while in GZD-5 only a prioritised subset receive40. Volunteer classifications are then used to train a deep learn-ing classifier to classify all galaxies. This classifier is able to bothlearn from uncertain volunteer responses and predict full posteri-ors, rather than point estimates, for what volunteers would havesaid. We show that the deep learning classifier is accurate and well-calibrated. We release both volunteer and automated classificationsat at https://dx.doi.org/10.5281/zenodo.4196267.
APPENDIX A: GALAXIES WITH CONFIDENTAUTOMATED CLASSIFICATIONS
To intuitively demonstrate the performance of our automated clas-sifier, we show, for a selection of detailed morphology questions,the galaxies with the most confident automated classifications forthat question. We show the galaxies with the highest mean posteriorfor being strongly barred (Fig. A1), edge-on and bulgeless (Fig A2),one-armed spirals (Fig. A3), loosely wound spirals (Fig. A4) andmergers (Fig. A5). We present the galaxies here as shown to GalaxyZoo volunteers (in color and at 424x424 pixel resolution), but themodel makes predictions on more challenging greyscale 224x224pixel images (see Sec. 5.1).
MNRAS000
MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Population Approx. Cut Q. Votes NotesFeatured Disk featured > . featured > . smooth > . strong bar > . weak bar > . strong bar + weak bar > . merger > . merger > . major disturb. > . minor disturb. > . Table 2.
Suggested cuts for rough identification of galaxy populations, based on visual inspection by the authors. Q. votes is the minimum number of totalvotes for that question; for example, to identify strong bars, require at least 20 total votes to the question ‘Does this galaxy have a bar?’. This ensures enoughvotes to calculate reliable vote fractions.
APPENDIX B: GZD-1/2 DECISION TREE
Figure B1 shows the Galaxy Zoo decision tree used for the earlierGZD-1 and GZD-2 DECaLS campaigns. This tree is based on thetree used for Galaxy Zoo 2 (Willett et al. 2013) with three mod-ifications; the ‘Can’t Tell’ answer to ‘How many spiral arms arethere?’ was removed, the number of answers to ‘How prominent isthe central bulge?’ was reduced from four to three, and ‘Is the galaxycurrently merging, or is there any sign of tidal debris?’ was addedas a standalone question. Please see Sec. 3.2 for a full discussion.
APPENDIX C: CATALOGUE SAMPLE ROWS
Tables C1 and C2 present sample rows from the volunteer andautomated morphology catalogues respectively. The volunteer datashown is from GZD-5; the GZD-1/2 catalogue follows an equivalentschema. For brevity, we show only columns for a single question(‘Bar’) and a single answer (‘Weak’); other questions and answersfollow an identical pattern. A full description of all columns isavailable on data.galaxyzoo.org.
ACKNOWLEDGEMENTS
The data in this paper are the result of the efforts of theGalaxy Zoo volunteers, without whom none of this work wouldbe possible. Their efforts are individually acknowledged athttp://authors.galaxyzoo.org. We would also like to thank our vol-unteer translators; Mei-Yin Chou, Antonia Fernández Figueroa, Ro-drigo Freitas, Na’ama Hallakoun, Lauren Huang, Alvaro Menduina,Beatriz Mingo, Verónica Motta, João Retrê, and Erik Rosenberg.We would like to thank Dustin Lang for creating the lega-cysurvey.org cutout service and for contributing image processingcode. We also thank Sugata Kaviraj and Matthew Hopkins for help-ful discussions.MW acknowledges funding from the Science and TechnologyFunding Council (STFC) Grant Code ST/R505006/1. We also ac-knowledge support from STFC under grant ST/N003179/1.RJS acknowledges funding from Christ Church, University ofOxford.LF acknowledges partial support from US National Science Foundation award OAC 1835530; VM and LF acknowledge partialsupport from NSF AST 1716602.This research made use of the open-source Python scientificcomputing ecosystem, including SciPy (Jones et al. 2001), Mat-plotlib (Hunter 2007), scikit-learn (Pedregosa et al. 2011), scikit-image (van der Walt et al. 2014) and Pandas (McKinney 2010).This research made use of Astropy, a community-developedcore Python package for Astronomy (The Astropy Collaborationet al. 2018).This research made use of TensorFlow (Abadi & et. al 2015).The Legacy Surveys consist of three individual and complementaryprojects: the Dark Energy Camera Legacy Survey (DECaLS; NSF’sOIR Lab Proposal ID
MNRAS , 1–18 (2021) M. Walmsley et al iauname ra dec bar_total-votes bar_weak bar_weak_fraction bar_weak_debiasedJ112953.88-000427.4 172.47 -0.07 16 1 0.06 0.15J104325.29+190335.0 160.86 19.06 2 0 0.00 0.00J104629.54+115415.1 161.62 11.90 4 2 0.50 -J082950.68+125621.8 127.46 12.94 0 0 - -J122056.00-015022.0 185.23 -1.84 3 0 0.00 -
Table C1.
Sample of GZD-5 volunteer classifications, with illustrative subset of columns. Columns: ‘iauname’ galaxy identifier from NASA-Sloan Atlas; RAand Dec, similarly; ‘Bar’ question total votes for all answers; ‘Bar’ question votes for ‘Weak’ answer; fraction of ‘Bar’ question votes for ‘Weak’ answer;estimated fraction after applying redshift debiasing (Sec. 4.3.2). Other questions and answers follow the same pattern (not shown for brevity).iauname RA Dec bar_preceding-fraction bar_weak_concentrations bar_weak_fraction_mlJ112953.88-000427.4 172.47 -0.07 0.14 [6.158, 5.0723, 5.4842, ... 0.09J104325.29+190335.0 160.86 19.06 0.13 [4.3723, 4.5933, 4.8582... 0.07J100927.56+071112.4 152.36 7.19 0.58 [9.3129, 10.3911, 8.4791... 0.40J143254.45+034938.1 218.23 3.83 0.55 [13.2981, 12.2639, 8.8957... 0.26J135942.73+010637.3 209.93 1.11 0.77 [15.6247, 15.6893, 14.72.... 0.28
Table C2.
Sample of automated classifications (GZD-5 schema), with illustrative subset of columns. Columns: ‘iauname’ galaxy identifier from NASA-SloanAtlas; RA and Dec, similarly; predicted vote fraction to the answer preceding‘Bar’ (‘Disk Edge On = No’), for estimating relevance; Dirichlet concentrationsdefining the predicted posterior for the ‘Bar’ question and ‘Weak’ answer (see Sec. 5); predicted fraction of ‘Bar’ question votes for the ‘Weak’ answer derivedfrom those concentrations. Other questions and answers follow the same pattern (not shown for brevity).
Carlos Chagas Filho de Amparo a Pesquisa do Estado do Rio deJaneiro, Conselho Nacional de Desenvolvimento Cientifico e Tec-nologico and the Ministerio da Ciencia, Tecnologia e Inovacao, theDeutsche Forschungsgemeinschaft and the Collaborating Institu-tions in the Dark Energy Survey. The Collaborating Institutions areArgonne National Laboratory, the University of California at SantaCruz, the University of Cambridge, Centro de Investigaciones Ener-geticas, Medioambientales y Tecnologicas-Madrid, the Universityof Chicago, University College London, the DES-Brazil Consor-tium, the University of Edinburgh, the Eidgenossische TechnischeHochschule (ETH) Zurich, Fermi National Accelerator Laboratory,the University of Illinois at Urbana-Champaign, the Institut de Cien-cies de l’Espai (IEEC/CSIC), the Institut de Fisica d’Altes Energies,Lawrence Berkeley National Laboratory, the Ludwig-MaximiliansUniversitat Munchen and the associated Excellence Cluster Uni-verse, the University of Michigan, the National Optical AstronomyObservatory, the University of Nottingham, the Ohio State Univer-sity, the University of Pennsylvania, the University of Portsmouth,SLAC National Accelerator Laboratory, Stanford University, theUniversity of Sussex, and Texas A&M University.BASS is a key project of the Telescope Access Program (TAP),which has been funded by the National Astronomical Observatoriesof China, the Chinese Academy of Sciences (the Strategic PriorityResearch Program "The Emergence of Cosmological Structures"Grant
DATA AVAILABILITY
The data underlying this article are available via Zenodo athttps://dx.doi.org/10.5281/zenodo.4196267. Any future data up-dates will be released using DOI versioning. The code underlyingthis article will be available on Github on publication.
REFERENCES
Abadi M., et. al ., 2015, TensorFlow: Large-Scale Machine Learning on Het-erogeneous Distributed Systems,
Abazajian K. N., et al., 2009, The Astrophysical Journal Supplement Series,182, 543Aihara H., et al., 2011, The Astrophysical Journal Supplement Series, 193,29Albareti F. D., et al., 2017, The Astrophysical Journal Supplement Series,233, 25Alexander D. M., Hickox R. C., 2012, New Astronomy Reviews, 56, 93Bamford S. P., et al., 2009, Monthly Notices of the Royal AstronomicalSociety, 393, 1324Bansal G., Nushi B., Kamar E., Lasecki W. S., Weld D. S., Horvitz E.,2019, Proceedings of the AAAI Conference on Human Computationand Crowdsourcing, 7, 19Beck M. R., et al., 2018, Monthly Notices of the Royal Astronomical Society,476, 5516Bluck A. F., Trevor Mendel J., Ellison S. L., Moreno J., Simard L., PattonD. R., Starkenburg E., 2014, Monthly Notices of the Royal AstronomicalSociety, 441, 599Brooks A., Christensen C., 2015, in , Vol. 418, Galactic Bulges. pp 317–353,doi:10.1007/978-3-319-19378-6_12Cappellari M., Copin Y., 2003, Monthly Notices of the Royal AstronomicalSociety, 342, 345Cardamone C., et al., 2009, Monthly Notices of the Royal AstronomicalSociety, 399, 1191Caruana R., 1997, Machine Learning, 28, 41 MNRAS000
Abazajian K. N., et al., 2009, The Astrophysical Journal Supplement Series,182, 543Aihara H., et al., 2011, The Astrophysical Journal Supplement Series, 193,29Albareti F. D., et al., 2017, The Astrophysical Journal Supplement Series,233, 25Alexander D. M., Hickox R. C., 2012, New Astronomy Reviews, 56, 93Bamford S. P., et al., 2009, Monthly Notices of the Royal AstronomicalSociety, 393, 1324Bansal G., Nushi B., Kamar E., Lasecki W. S., Weld D. S., Horvitz E.,2019, Proceedings of the AAAI Conference on Human Computationand Crowdsourcing, 7, 19Beck M. R., et al., 2018, Monthly Notices of the Royal Astronomical Society,476, 5516Bluck A. F., Trevor Mendel J., Ellison S. L., Moreno J., Simard L., PattonD. R., Starkenburg E., 2014, Monthly Notices of the Royal AstronomicalSociety, 441, 599Brooks A., Christensen C., 2015, in , Vol. 418, Galactic Bulges. pp 317–353,doi:10.1007/978-3-319-19378-6_12Cappellari M., Copin Y., 2003, Monthly Notices of the Royal AstronomicalSociety, 342, 345Cardamone C., et al., 2009, Monthly Notices of the Royal AstronomicalSociety, 399, 1191Caruana R., 1997, Machine Learning, 28, 41 MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Casteels K. R. V., et al., 2014, Monthly Notices of the Royal AstronomicalSociety, 445, 1157Chang J. C. J. C., Amershi S., Kamar E., 2017, Conference on HumanFactors in Computing Systems - Proceedings, 2017-May, 2334Cheng T. Y., et al., 2020, Monthly Notices of the Royal Astronomical Society,493, 4209Dey A., et al., 2018, eprint arXiv:1804.08657Dickinson H., Fortson L., Scarlata C., Beck M., Walmsley M., 2019, Pro-ceedings of the International Astronomical UnionDieleman S., Willett K. W., Dambre J., 2015, Monthly Notices of the RoyalAstronomical Society, 450, 1441Domínguez Sánchez H., et al., 2018, Monthly Notices of the Royal Astro-nomical Society, 476, 3661Domínguez Sánchez H., et al., 2019, Monthly Notices of the Royal Astro-nomical Society, 484, 93Eykholt K., et al., 2018, in Conference on Computer Vision and PatternRecognition. http://arxiv.org/abs/1707.08945
Fang J. J., Faber S. M., Koo D. C., Dekel A., 2013, Astrophysical Journal,776, 63Ferreira L., Conselice C. J., Duncan K., Cheng T.-Y., Griffiths A., WhitneyA., 2020, The Astrophysical Journal, 895, 115Flaugher B., et al., 2015, Astronomical Journal, 150, 150Fontanot F., de Lucia G., Wilman D., Monaco P., 2011, Monthly Notices ofthe Royal Astronomical Society, 416, 409Gal Y., 2016, PhD thesis, University of CambridgeGeirhos R., Michaelis C., Wichmann F. A., Rubisch P., Bethge M., BrendelW., 2019, in 7th International Conference on Learning Representa-tions, ICLR 2019. International Conference on Learning Representa-tions, ICLR, http://arxiv.org/abs/1811.12231
Hart R. E., et al., 2016, Monthly Notices of the Royal Astronomical Society,461, 3663Hart R. E., Bamford S. P., Casteels K. R. V., Kruk S. J., Lintott C. J., MastersK. L., 2017, Monthly Notices of the Royal Astronomical Society, 468,1850He K., Zhang X., Ren S., Sun J., 2016, in Proceedings of theIEEE Computer Society Conference on Computer Vision andPattern Recognition. No. 3. IEEE Computer Society, pp 770–778, doi:10.1109/CVPR.2016.90, http://arxiv.org/pdf/1512.03385v1.pdfhttp://arxiv.org/abs/1512.03385
He X., Zhao K., Chu X., 2019, Arxiv preprintHendrycks D., Gimpel K., 2017, in International Conference on LearningRepresentations. http://arxiv.org/abs/1610.02136
Hopkins P. F., Cox T. J., Younger J. D., Hernquist L., 2009, AstrophysicalJournal, 691, 1168Hopkins P. F., et al., 2010, Astrophysical Journal, 715, 202Houlsby N., 2014, PhD thesis, doi:10.1007/BF03167379, http://ezproxy.nottingham.ac.uk/login?url=http://search.proquest.com/docview/1779546086?accountid=8018%5Cnhttp://sfx.nottingham.ac.uk/sfx_local/?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&genre=dissertations+&+theses&sid=ProQ:ProQue
Howard A. G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T.,Andreetto M., Adam H., 2017, Arxiv preprintHu J., Shen L., Sun G., 2018, in Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition. IEEE Com-puter Society, pp 7132–7141, doi:10.1109/CVPR.2018.00745, http://arxiv.org/abs/1709.01507
Huertas-Company M., et al., 2015, Astrophysical Journal, Supplement Se-ries, 221, 8Hunter J. D., 2007, Computing in Science and Engineering, 9, 99Iandola F. N., Han S., Moskewicz M. W., Ashraf K., Dally W. J., KeutzerK., 2017, International Conference on Learning RepresentationsJogee S., Scoville N. Z., Kenney J. D. P., 2005, The Astrophysical Journal,Volume 630, Issue 2, pp. 837-863., 630, 837Jones E., Oliphant T., Pearu P., Others 2001, SciPy: Open Source ScientificTools for Python,
Kaviraj S., 2013, Monthly Notices of the Royal Astronomical Society: Let- ters, 437Kaviraj S., 2014, Monthly Notices of the Royal Astronomical Society, 440,2944Keel W. C., Manning A. M., Holwerda B. W., Mezzoprete M., LintottC. J., Schawinski K., Gay P., Masters K. L., 2013, Publications of theAstronomical Society of the Pacific, 125, 2Keel W. C., et al., 2015, Astronomical Journal, 149, 155Kemenade H. v., et al., 2020, python-pillow/Pillow 7.1.2,doi:10.5281/ZENODO.3766443Khan A., Huerta E. A., Wang S., Gruendl R., Jennings E., Zheng H., 2019,Physics Letters, Section B: Nuclear, Elementary Particle and High-Energy Physics, 795, 248Kingma D. P., Ba J. L., 2015, in 3rd International Conference on LearningRepresentations, ICLR 2015 - Conference Track Proceedings. Interna-tional Conference on Learning Representations, ICLRKruk S. J., et al., 2017, Monthly Notices of the Royal Astronomical Society,469, 3363Kruk S. J., et al., 2018, Monthly Notices of the Royal Astronomical Society,473, 4731Lanczos C., 1938, Journal of Mathematics and Physics, 17, 123Land K., et al., 2008, Monthly Notices of the Royal Astronomical Society,388, 1686Lin L., et al., 2019, The Astrophysical Journal, 872, 50Lingard T. K., et al., 2020, arXivLintott C. J., et al., 2008, Monthly Notices of the Royal Astronomical Society,389, 1179Lintott C. J., et al., 2009, Monthly Notices of the Royal Astronomical Society,399, 129Liu A., Guerra S., Fung I., Matute G., Kamar E., Lasecki W., 2020, inThe Web Conference 2020 - Proceedings of the World Wide WebConference, WWW 2020. Association for Computing Machinery, Inc,New York, NY, USA, pp 2432–2442, doi:10.1145/3366423.3380306, https://dl.acm.org/doi/10.1145/3366423.3380306
Lochner M., Bassett B. A., 2020, arXivLofthouse E. K., Kaviraj S., Conselice C. J., Mortlock A., Hartley W., 2017,Monthly Notices of the Royal Astronomical Society, 465, 2895López-Sanjuan C., Balcells M., Pérez-González P. G., Barro G., Gallego J.,Zamorano J., 2010, Astronomy and Astrophysics, 518, A20Lotz J. M., Jonsson P., Cox T. J., Croton D., Primack J. R., Somerville R. S.,Stewart K., 2011, Astrophysical Journal, 742, 103Margalef-Bentabol B., Huertas-Company M., Charnock T., Margalef-Bentabol C., Bernardi M., Dubois Y., Storey-Fisher K., Zanisi L.,2020, Detecting outliers in astronomical images with deep generativenetworks., doi:10.1093/mnras/staa1647, https://arxiv.org/abs/2003.08263
Martin G., Kaviraj S., Devriendt J. E., Dubois Y., Pichon C., 2018, MonthlyNotices of the Royal Astronomical Society, 480, 2266Martin G., Kaviraj S., Hocking A., Read S. C., Geach J. E., 2020, MonthlyNotices of the Royal Astronomical Society, 491, 1408Masters K. L., 2019, Proceedings of the International Astronomical Union,14, 205Masters K. L., et al., 2010, Monthly Notices of the Royal AstronomicalSociety, 404, 792Masters K. L., et al., 2011, Monthly Notices of the Royal AstronomicalSociety, 411, 2026Masters K. L., et al., 2012, Monthly Notices of the Royal AstronomicalSociety, 424, 2180Masters K. L., et al., 2019, Monthly Notices of the Royal AstronomicalSociety, 487, 1808McInnes L., Healy J., Astels S., 2017, The Journal of Open Source Software,2, 205McInnes L., Healy J., Melville J., 2020, Arxiv preprintMcKinney W., 2010, Data Structures for Statistical Comput-ing in Python, http://conference.scipy.org/proceedings/scipy2010/mckinney.html
Moiseev A. V., Smirnova K. I., Smirnova A. A., Reshetnikov V. P., 2011,Monthly Notices of the Royal Astronomical Society, 418, 244Nair P. B., Abraham R. G., 2010, The Astrophysical Journal SupplementMNRAS , 1–18 (2021) M. Walmsley et al
Series, 186, 427Pedregosa F., et al., 2011, Journal of Machine Learning Research, 12, 2825Ren J., et al., 2019, Technical report, Likelihood Ratios for Out-of-Distribution DetectionSakamoto K., Okumura S. K., Ishizuki S., Scoville2 N. Z., 1999,Technical Report 2, Bar-driven Transport of Molecular Gas toGalactic Centers and its Consequences, http://stacks.iop.org/0004-637X/525/i=2/a=691http://iopscience.iop.org/article/10.1086/307910/pdf , doi:10.1086/307910. , http://stacks.iop.org/0004-637X/525/i=2/a=691http://iopscience.iop.org/article/10.1086/307910/pdf
Sandler M., Howard A., Zhu M., Zhmoginov A., Chen L. C., 2018,in Proceedings of the IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition. IEEE Computer Society,pp 4510–4520, doi:10.1109/CVPR.2018.00474, http://arxiv.org/abs/1801.04381
Sheth K., Blain A. W., Kneib J.-P., Frayer D. T., van der Werf P. P., KnudsenK. K., 2004, The Astrophysical Journal, 614, L5Simmons B., et al., 2013, Monthly Notices of the Royal Astronomical Soci-ety, 429, 2199Simmons B. D., et al., 2014, Monthly Notices of the Royal AstronomicalSociety, 445, 3466Simmons B. D., Smethurst R. J., Lintott C., 2017a, Monthly Notices of theRoyal Astronomical Society, 12, 1Simmons B. D., et al., 2017b, Monthly Notices of the Royal AstronomicalSociety, 464, 4420Skibba R. A., et al., 2012, Monthly Notices of the Royal AstronomicalSociety, 423, 1485Smethurst R. J., et al., 2015, Monthly Notices of the Royal AstronomicalSociety, 450, 435Smethurst R. J., Simmons B. D., Lintott C. J., Shanahan J., 2019, MonthlyNotices of the Royal Astronomical Society, 489, 4016Smith L., Gal Y., 2018, Arxiv preprintSpindler A., et al., 2017, Monthly Notices of the Royal Astronomical Society,23, 1Szegedy C., Zaremba W., Sutskever I., Bruna J., Erhan D., Goodfellow I.,Fergus R., 2014, in 2nd International Conference on Learning Repre-sentations, ICLR 2014 - Conference Track Proceedings. InternationalConference on Learning Representations, ICLRTan M., Le Q. V., 2019, in 36th International Conference on Machine Learn-ing, ICML 2019. pp 10691–10700, http://arxiv.org/abs/1905.11946
Tan M., Chen B., Pang R., Vasudevan V., Sandler M., Howard A., Le Q. V.,2019, Proceedings of the IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, 2019-June, 2815The Astropy Collaboration et al., 2018, The Astronomical Journal, 156, 123The Dark Energy Survey Collaboration 2005, The Dark Energy Survey, http://arxiv.org/abs/astro-ph/0510346
The Zooniverse Team 2020, zooniverse/panoptes: Zooniverse API to sup-port user defined volunteer research projects, https://github.com/zooniverse/panoptes
Tojeiro R., et al., 2013, Monthly Notices of the Royal Astronomical Society,432, 359Walmsley M., et al., 2020, Monthly Notices of the Royal AstronomicalSociety, 491, 1554Wang J., et al., 2011, Monthly Notices of the Royal Astronomical Society,413, 1373Wilder B., Horvitz E., Kamar E., 2020, in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. Inter-national Joint Conferences on Artificial Intelligence, pp 1526–1533,doi:10.24963/ijcai.2020/212Willett K. W., et al., 2013, Monthly Notices of the Royal AstronomicalSociety, 435, 2835Willett K. W., et al., 2015, Monthly Notices of the Royal AstronomicalSociety, 449, 820Willett K. W., et al., 2017, Monthly Notices of the Royal AstronomicalSociety, 464, 4176Wright D. E., et al., 2017, Monthly Notices of the Royal Astronomical Society, 472, 1315Wright D. E., Fortson L., Lintott C., Laraia M., Walmsley M., 2019, ACMTransactions on Social Computing, 2, 1Yang K., Qinami K., Fei-Fei L., Deng J., Russakovsky O., 2020, in FAT*2020 - Proceedings of the 2020 Conference on Fairness, Accountability,and Transparency. Association for Computing Machinery, Inc, pp 547–558, doi:10.1145/3351095.3375709, http://arxiv.org/abs/1912.07726http://dx.doi.org/10.1145/3351095.3375709
York D. G., et al., 2000, The Astronomical Journal, 120, 1579van der Walt S., Schönberger J. L., Nunez-Iglesias J., Boulogne F., WarnerJ. D., Yager N., Gouillart E., Yu T., 2014, PeerJ, 2, e453This paper has been typeset from a TEX/L A TEX file prepared by the author.MNRAS000
York D. G., et al., 2000, The Astronomical Journal, 120, 1579van der Walt S., Schönberger J. L., Nunez-Iglesias J., Boulogne F., WarnerJ. D., Yager N., Gouillart E., Yu T., 2014, PeerJ, 2, e453This paper has been typeset from a TEX/L A TEX file prepared by the author.MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Strong Bar W e a k B a r Strong Bar W e a k B a r Strong Bar W e a k B a r Strong Bar W e a k B a r Strong Bar W e a k B a r Figure 11.
Posteriors for ‘Does this galaxy have a bar?’, for the same randomgalaxies selected in Fig. 10. Each point is colored by the predicted probabilityof volunteers giving that many ‘Strong’, ‘Weak’, and (implicitly, as the totalanswers is fixed) ‘None’ votes. The volunteer answer (not known to classifier)is circled. For clarity, only the mean posterior across all models and dropoutforward passes is shown.MNRAS , 1–18 (2021) M. Walmsley et al
Figure 12.
All Galaxies High Volunteer Confidence(a) ‘Smooth or Featured’(b) Edge On Disk(c) ‘Has Spiral Arms’(d) ‘Spiral Arm Count’(e) ‘Spiral Arm Winding’
Figure 13.
Confusion matrices for each question, made on the test set of 49,700 randomly-selected galaxies with at least three volunteer votes. Discreteclassifications are made by rounding the vote fraction (label) and mean posterior (prediction) to the nearest integer. The matrices then show the counts ofrounded predictions (x axis) against rounded labels (y axis). We report confusion matrices for all 49,700 galaxies (left) or only for galaxies where the volunteersare confident in that question, defined as having the vote fraction for one answer above 0.8 (right). Such confident galaxies are expected to have a clearly correctlabel, making correct and incorrect predictions straightforward to measure but also making the classification task easier. Continued below.MNRAS000
Confusion matrices for each question, made on the test set of 49,700 randomly-selected galaxies with at least three volunteer votes. Discreteclassifications are made by rounding the vote fraction (label) and mean posterior (prediction) to the nearest integer. The matrices then show the counts ofrounded predictions (x axis) against rounded labels (y axis). We report confusion matrices for all 49,700 galaxies (left) or only for galaxies where the volunteersare confident in that question, defined as having the vote fraction for one answer above 0.8 (right). Such confident galaxies are expected to have a clearly correctlabel, making correct and incorrect predictions straightforward to measure but also making the classification task easier. Continued below.MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Figure 14.
All Galaxies High Volunteer Confidence(f) ‘Bar’(g) ‘Bulge Size’(h) ‘Merging’(i) ‘Edge On Bulge Shape’(j) Edge On Bulge Ellipticity
Figure 15.
Confusion matrices, continued from above. To avoid the loss of information from rounding, we encourage researchers not to treat Galaxy Zooclassifications as discrete, and instead to use the full vote fractions or posteriors where possible.MNRAS , 1–18 (2021) M. Walmsley et al
Vote Fraction Mean Deviation smooth-or-featured_smoothsmooth-or-featured_featured-or-disksmooth-or-featured_artifactdisk-edge-on_yesdisk-edge-on_nohas-spiral-arms_yeshas-spiral-arms_nobar_strongbar_weakbar_nobulge-size_dominantbulge-size_largebulge-size_moderatebulge-size_smallbulge-size_nonehow-rounded_roundhow-rounded_in-betweenhow-rounded_cigar-shapededge-on-bulge_boxyedge-on-bulge_noneedge-on-bulge_roundedspiral-winding_tightspiral-winding_mediumspiral-winding_loosespiral-arm-count_1spiral-arm-count_2spiral-arm-count_3spiral-arm-count_4spiral-arm-count_more-than-4spiral-arm-count_cant-tellmerging_nonemerging_minor-disturbancemerging_major-disturbancemerging_merger
Figure 16.
Mean absolute deviations between the model predictions and theobserved vote fractions, by question, for the test set galaxies with approxi-mately 40 volunteer responses. The model is typically well within 10% ofthe observed vote fractions.
Truncated number of votes M e a n e rr o r v s . a ll v o t e s Smooth Or Featured
SmoothFeatured-or-diskArtifact0 5 10 15 20
Truncated number of votes M e a n e rr o r v s . a ll v o t e s Bar
StrongWeakNo0 5 10 15 20
Truncated number of votes M e a n e rr o r v s . a ll v o t e s Has Spiral Arms
Yes0 5 10 15 20
Truncated number of votes M e a n e rr o r v s . a ll v o t e s Bulge Size
LargeModerateSmallNone
Figure 17.
Mean errors vs. the true (
𝑁 >
75) vote fractions for either atruncated ( 𝑁 = 𝑁 =
20) number of volunteers (solid) or the automatedclassifier (dashed). Asking only a few volunteers gives a noisy estimate ofthe true vote fraction. Asking more volunteers reduces this noise. For somenumber of volunteers, the noise in the vote fraction is similar to the errorof the automated classifier, meaning they make classifications of similaraccuracy; this number is where the solid and dashed lines intersect. We findthe automated classifier has a similar accuracy to approx. 5 to 15 volunteers,depending on the question. MNRAS000
20) number of volunteers (solid) or the automatedclassifier (dashed). Asking only a few volunteers gives a noisy estimate ofthe true vote fraction. Asking more volunteers reduces this noise. For somenumber of volunteers, the noise in the vote fraction is similar to the errorof the automated classifier, meaning they make classifications of similaraccuracy; this number is where the solid and dashed lines intersect. We findthe automated classifier has a similar accuracy to approx. 5 to 15 volunteers,depending on the question. MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Figure 18.
Random spiral galaxies where the classifier confuses the mostlikely volunteer vote for spiral arm count between ‘2’ and ‘Can’t Tell’.Above: galaxies where the classifier predicted ‘2’ but more volunteers an-swered ‘Can’t Tell’. Below: vica versa, galaxies where the classifier predicted‘Can’t Tell’ but more volunteers answered ‘2’. Red text shows the volunteer(vol.) and machine-learning-predicted (ML) vote fractions for each answer.Counting the spiral arms is challenging, even for the authors. This high-lights the difficulty in assessing performance by reducing the posteriors toclassifications and then comparing against uncertain true labels.
Figure 19.
Galaxies binned by ‘Smooth or Featured’ vote prediction entropy,measuring the model’s uncertainty in the votes. Bins (columns) are equallyspaced (boundaries noted above). Five random galaxies are shown per bin.Unusual, inclined or poorly-scaled galaxies have highly uncertain (highentropy) votes, while smooth and especially clearly featured galaxies haveconfident (low entropy) votes, matching our intuition and demonstrating thatour posteriors provide meaningful uncertainties.MNRAS , 1–18 (2021) M. Walmsley et al
0% 20% 40% 60% 80% 100%
Credible interval width R a t i o i n i n t e r v a l Disk Edge On N N Credible interval width R a t i o i n i n t e r v a l Has Spiral Arms N N Figure 20.
Calibration curves for the two binary GZ DECaLS questions.The 𝑥 -axis shows the credible interval width - for data-dominated posteriors,roughly (e.g.) 30% of galaxies should have vote fractions within their 30%credible interval. The 𝑦 -axis shows what percentage actually do fall withineach interval width. We split calibration by galaxies with few votes (andhence typically wider posteriors) and more votes (narrower posteriors). Onlycredible intervals with at least 100 measurements are shown. Calibration forboth questions is excellent. B avg W a v g Volunteers (N=5378) B avg W a v g Automated (N=43672)
Figure 21.
Distribution of bulge size vs. spiral winding, using responsesfrom volunteers (left) or our automated predictions (right). We observeno clear correlation between bulge size and spiral winding, consistent withM19. The distributions are consistent between volunteers and our automatedmethod. We hope this demonstrates the accuracy and scientific value of ourautomated classifier. MNRAS000
Distribution of bulge size vs. spiral winding, using responsesfrom volunteers (left) or our automated predictions (right). We observeno clear correlation between bulge size and spiral winding, consistent withM19. The distributions are consistent between volunteers and our automatedmethod. We hope this demonstrates the accuracy and scientific value of ourautomated classifier. MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Figure A1.
Galaxies automatically classified as most likely (highest mean posterior) to be strongly barred.MNRAS , 1–18 (2021) M. Walmsley et al
Figure A2.
Galaxies automatically classified as most likely (highest mean posterior) to be edge-on with no bulge.MNRAS000
Galaxies automatically classified as most likely (highest mean posterior) to be edge-on with no bulge.MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Figure A3.
Galaxies automatically classified as most likely (highest mean posterior) to have exactly one spiral arm.MNRAS , 1–18 (2021) M. Walmsley et al
Figure A4.
Galaxies automatically classified as most likely (highest mean posterior) to have loosely wound spiral arms.MNRAS000
Galaxies automatically classified as most likely (highest mean posterior) to have loosely wound spiral arms.MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Figure A5.
Galaxies automatically classified as most likely (highest mean posterior) to be mergers.MNRAS , 1–18 (2021) M. Walmsley et al
A0: Smooth A1: Featuresor disk A2: Star orartifactA0: Yes A1: NoA0: Bar A1: No barA0: Spiral A1: No spiralA0: Nobulge A2: Obvious A3:DominantA1: Ring A2: Lens orarc A6:OverlappingA4: Irregular A5: OtherA3: DustlaneA0:Completelyround A1: Inbetween A2: Cigarshaped A0:Rounded A1: Boxy A2: Nobulge A0: Tight A1: Medium A2: LooseA0: 1 A1: 2 A2: 3 A3: 4 A4: Morethan 4A0: NoneA0: Merging A1: Tidaldebris A2: Both A3: Neither
T00: Is the galaxy simply smooth and rounded, with no sign of a disk?T01: Could this be a disk viewed edge-on?T02: Is there a sign of a bar feature through thecentre of the galaxy?T03: Is there any sign of a spiral arm pattern?T04: How prominent is the central bulge, compared with the rest of thegalaxy?T05: Is the galaxy currently merging or is there any sign of tidal debris?T06: Do you see any of these odd features in the image?T07: How rounded is it? T08: Does the galaxy have a bulgeat its centre? If so, what shape? T09: How tightly wound do thespiral arms appear?T10: How many spiral arms are there?End1st Tier Question2nd Tier Question3rd Tier Question4th Tier Question
Figure B1.
Decision tree used for GZD-1 and GZD-2, based on the Galaxy Zoo 2 decision tree. The GZD-5 decision tree is shown in Figure 3.MNRAS000