[PDF] Galaxy Zoo DECaLS: Detailed Visual Morphology Measurements from Volunteers and Deep Learning for 314,000 Galaxies

Abstract

We present Galaxy Zoo DECaLS: detailed visual morphological classifications for Dark Energy Camera Legacy Survey images of galaxies within the SDSS DR8 footprint. Deeper DECaLS images (r=23.6 vs. r=22.2 from SDSS) reveal spiral arms, weak bars, and tidal features not previously visible in SDSS imaging. To best exploit the greater depth of DECaLS images, volunteers select from a new set of answers designed to improve our sensitivity to mergers and bars. Galaxy Zoo volunteers provide 7.5 million individual classifications over 314,000 galaxies. 140,000 galaxies receive at least 30 classifications, sufficient to accurately measure detailed morphology like bars, and the remainder receive approximately 5. All classifications are used to train an ensemble of Bayesian convolutional neural networks (a state-of-the-art deep learning method) to predict posteriors for the detailed morphology of all 314,000 galaxies. When measured against confident volunteer classifications, the networks are approximately 99% accurate on every question. Morphology is a fundamental feature of every galaxy; our human and machine classifications are an accurate and detailed resource for understanding how galaxies evolve.

Full PDF

MMNRAS , 1–18 (2021) Preprint 18 February 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

Galaxy Zoo DECaLS: Detailed Visual Morphology Measurementsfrom Volunteers and Deep Learning for 314,000 Galaxies

Mike Walmsley ★ , Chris Lintott , Tobias Géron , Sandor Kruk , Coleman Krawczyk ,Kyle W. Willett , Steven Bamford , William Keel , Lee S. Kelvin ,Lucy Fortson Karen L. Masters , Vihang Mehta , Brooke D. Simmons , Rebecca Smethurst ,Elisabeth M. Baeten , Christine Macmillan , Oxford Astrophysics, Department of Physics, University of Oxford, Denys Wilkinson Building, Keble Road, Oxford, OX1 3RH, UK European Space Agency, ESTEC, Keplerlaan 1, NL-2201 AZ, Noordwĳk, The Netherlands Institute of Cosmology and Gravitation, University of Portsmouth Dennis Sciama Building, Burnaby Road, Portsmouth, PO1 3FX, UK School of Physics and Astronomy, University of Minnesota, 116 Church St SE, Minneapolis, MN 55455, USA Centre for Astronomy & Particle Theory, School of Physics and Astronomy, University of Nottingham, University Park, Nottingham, NG7 2RD, UK Dept. of Physics and Astronomy, University of Alabama, Tuscaloosa, AL 35487, USA Department of Astrophysical Sciences, Princeton University, 4 Ivy Lane, Princeton, NJ 08544, USA Minnesota Institute for Astrophysics, University of Minnesota, 116 Church St SE, Minneapolis, MN 55455, USA Department of Physics and Astronomy, Haverford College, 370 Lancaster Avenue, Haverford, PA 19041, USA Department of Physics, Lancaster University, Bailrigg, Lancaster, LA1 4YB, UK Citizen Scientist, Zooniverse c/o University of Oxford, Keble Road, Oxford OX1 3RH, UK

Last updated XXX; in original form XXX

ABSTRACT

We present Galaxy Zoo DECaLS: detailed visual morphological classiﬁcations for Dark En-ergy Camera Legacy Survey images of galaxies within the SDSS DR8 footprint. DeeperDECaLS images ( 𝑟 = . 𝑟 = . Key words: methods: data analysis, galaxies: bar, galaxies: bulges, galaxies: disc, galaxies:interaction, galaxies: general

Morphology is a key driver and tracer of galaxy evolution. For ex-ample, bars are thought to move gas inwards (Sakamoto et al. 1999)driving and/or shutting down star formation (Sheth et al. 2004; Jogeeet al. 2005), and bulges are linked to global quenching (Masters et al.2011; Fang et al. 2013; Bluck et al. 2014) and inside-out quenching(Spindler et al. 2017; Lin et al. 2019). Morphology also traces otherkey drivers, such as the merger history of a galaxy. Mergers supportgalaxy assembly (Wang et al. 2011; Martin et al. 2018), though their ★ Contact e-mail: [email protected] relative contribution is an open question (Casteels et al. 2014), andmay create tidal features, bulges, and disks, allowing past mergersto be identiﬁed (Hopkins et al. 2010; Fontanot et al. 2011; Kaviraj2014; Brooks & Christensen 2015).Unpicking the complex interplay between morphology andgalaxy evolution requires measurements of detailed morphology inlarge samples. While modern surveys reveal exquisite morphologi-cal detail, they image far more galaxies than scientists can visuallyclassify. Galaxy Zoo solves this problem by asking members of thepublic to volunteer as ‘citizen scientists’ and provide classiﬁcationsthrough a web interface. Galaxy Zoo has provided morphology mea-surements for surveys including SDSS (Lintott et al. 2008; Willett © a r X i v : . [ a s t r o - ph . GA ] F e b M. Walmsley et al et al. 2013) and large HST programs (Simmons et al. 2017b; Willettet al. 2017).Knowing the morphology of homogeneous samples of hun-dreds of thousands of galaxies supports science only possible atscale. The catalogues produced by the collective eﬀort of GalaxyZoo volunteers have been used as the foundation of a large numberof studies of galaxy morphology (see Masters 2019 for a review),with the method’s ability to provide estimates of conﬁdence along-side classiﬁcation especially valuable. Galaxy Zoo measures subtleeﬀects in large populations (Masters et al. 2010; Willett et al. 2015;Hart et al. 2017); identiﬁes unusual populations that challenge stan-dard astrophysics (Simmons et al. 2013; Tojeiro et al. 2013; Kruket al. 2017); and ﬁnds unexpected and interesting objects that pro-vide unique data on broader galaxy evolution questions (Lintottet al. 2009; Cardamone et al. 2009; Keel et al. 2015).Here, we present the ﬁrst volunteer classiﬁcations of galaxyimages collected by the Dark Energy Camera Legacy Survey (DE-CaLS, Dey et al. 2018). This work represents the ﬁrst systematicengagement of volunteers with low-redshift images as deep as thoseprovided by DECaLS, and thus represents a more reliable catalogueof detailed morphology than has hitherto been available. These de-tailed classiﬁcations include the presence and strength of bars andbulges, the count and winding of spiral arms, and the indicationsof recent or ongoing mergers. Our volunteer classiﬁcations weresourced over three separate Galaxy Zoo DECaLS (GZD) classiﬁ-cation campaigns, GZD-1, GZD-2, and GZD-5, which classiﬁedgalaxies ﬁrst released in DECaLS Data Releases 1, 2, and 5 re-spectively. The key practical diﬀerences are that GZD-5 uses animproved decision tree aimed at better identiﬁcation of mergersand weak bars, and includes galaxies with just 5 total votes aswell as galaxies with 40 or more. Across all campaigns, we collect7,496,325 responses from Galaxy Zoo volunteers, recording 30 ormore classiﬁcations in at least one campaign for 139,919 galax-ies and fewer (approximately 5 classiﬁcations) for an additional173,870 galaxies, totalling 313,789 classiﬁed galaxies.For the ﬁrst time in a Galaxy Zoo data release, we also pro-vide automated classiﬁcations made using Bayesian deep learning(Walmsley et al. 2020). By using our volunteer classiﬁcations totrain a deep learning algorithm, we can make detailed classiﬁca-tions for all 313,789 galaxies in our target sample, providing mor-phology measurements faster than would be possible than relyingon volunteers alone. Bayesian deep learning allows us to learn fromuncertain volunteer responses and to estimate the uncertainty of ourpredictions. Our classiﬁer predicts posteriors for how volunteerswould have answered all decision tree questions , with an accuracycomparable to asking 5 to 15 volunteers, depending on the question,and, for some questions, exceeding 99% accuracy on galaxies wherethe volunteers are conﬁdent (volunteer vote fractions below 0.2 orabove 0.8).In Section 2, we describe the observations used and the creationof RGB images suitable for classiﬁcation. In Section 3, we givean overview of the volunteer classiﬁcation process and detail thenew decision trees used. In Section 4, we investigate the eﬀects ofimproved imaging and improved decision trees, and we compare ourresults to other morphological measurements. Then, in Section 5,we describe the design and performance of our automated classiﬁer- an ensemble of Bayesian convolutional neural networks. Finally, Excluding the ﬁnal ‘Is there anything odd?’ question as it is multiple-choice in Section 6, we provide guidance (and example code) for eﬀectiveuse of the classiﬁcations.

Our galaxy images are created from data collected by the DECaLSsurvey (Dey et al. 2018). DECaLS uses the Dark Energy Camera(DECam, Flaugher et al. 2015) at the 4m Blanco telescope at CerroTololo Inter-American Observatory, near La Serena, Chile. DECamhas a roughly hexagonal 3.2 square degree ﬁeld of view with a pixelscale of 0.262 arcsec per pixel. The median point spread functionFWHM is 1 . (cid:48)(cid:48)

29, 1 . (cid:48)(cid:48)

18 and 1 . (cid:48)(cid:48)

11 for 𝑔 , 𝑟 , and 𝑧 , respectively.The DECaLS survey contributes targeting images for the up-coming Dark Energy Spectroscopic Instrument (DESI). DECaLSis responsible for the DESI footprint in the Southern Galactic Cap(SGC) and the 𝛿 (cid:54)

34 region of the Northern Galactic Cap (NGC),totalling 10,480 square degrees . 1130 square degrees of the SGCDESI footprint are already being imaged by DECam through theDark Energy Survey (DES, The Dark Energy Survey Collaboration2005) so DECaLS does not repeat this part of the DESI footprint.DECaLS implements a 3-pass strategy to tile the sky. Each pass isslightly oﬀset (approx 0.1-0.6 deg). The choice of pass and exposuretime for each observation is optimised in real time based on the ob-serving conditions recorded for the previous targets, as well as theinterstellar dust reddening, sky position, and estimated observingconditions of possible next targets. This allows a near-uniform depthacross the survey. In DECaLS DR1, DR2, and DR5, from whichour images are drawn, the median 5 𝜎 point source depths for areaswith 3 observations was approximately (AB) 𝑔 =24.65, 𝑟 =23.61, and 𝑧 =22.84 . The DECaLS survey completed observations in March2019. We identify galaxies in the DECaLS imaging using the NASA-Sloan Atlas v1.0.0 (NSA). As the NSA was derived from SDSSDR8 imaging (Aihara et al. 2011), this data release only includesgalaxies which are within both the DECaLS and SDSS DR8 foot-print. In eﬀect, we are using deeper DECaLS imaging of the galaxiespreviously imaged in SDSS DR8. This ensures our morphologicalmeasurements have a wealth of ancillary information derived fromSDSS and related surveys, and allows us to measure any shift inclassiﬁcations vs. Galaxy Zoo 2 using the subset of SDSS DR8galaxies classiﬁed both in this work and in Galaxy Zoo 2 (Sec.4). Figure 1 shows the resulting GZ DECaLS sky coverage. NSAv1.0.0 was not published but the values of the columns used here areidentical to those in NSA v1.0.1, released in SDSS DR13 (Albaretiet al. 2017); only the column naming conventions are diﬀerent.Selecting galaxies with the NSA introduces two implicit cuts.First, the NSA primarily includes galaxies brighter than 𝑚 𝑟 = . 𝑚 𝑟 = .

77 are included only if they are in deeper survey areas (e.g.Stripe82) or were measured using ‘spare’ ﬁbres after all brightergalaxies in a given ﬁeld were covered; we suggest researchersenforce their own magnitude cut according to their science case. The remaining DESI footprint is being imaged by DECaLS’ companionsurveys, MzLS and BASS (Dey et al. 2018)000

Sky coverage of GZ DECaLS (equatorial coordinates), resultingfrom the imaging overlap of DECaLS DR5 and SDSS DR8, shown in red.Darker areas indicate more galaxies. Sky coverage of Galaxy Zoo 2, whichused images sourced from SDSS DR7, shown in light blue. The NSA in-cludes galaxies imaged by SDSS DR8, including galaxies newly imaged atthe Southern Galactic Cap (approx. 2500 deg ) Second, the NSA only covers redshifts of 𝑧 = .

15 or below. Tothese implicit cuts, we add an explicit cut requiring Petrosian ra-dius (

PETROTHETA ) of at least 3 arcseconds, to ensure the galaxy issuﬃciently extended for meaningful classiﬁcation.For each galaxy, if the coordinates had been imaged in the 𝑔 , 𝑟 and 𝑧 bands, and the galaxy passed the selection cuts above,we acquired a combined FITS cutout of the 𝑔𝑟𝑧 ×

424 pixel squaregalaxy images. GZD-1 and GZD-2 acquired 424 ×

424 pixel squareFITS cutouts directly from the cutout service. To ensure that galaxiestypically ﬁt well within a 424 pixel image, cutouts were downloadedwith an interpolated pixel scale 𝑠 of 𝑠 = max ( min ( 𝑝 ∗ . , 𝑝 ∗ . ) , . ) (1)where 𝑝 is the Petrosian 50%-light radius and 𝑝 is the Petrosian90%-light radius.For GZD-5, to avoid banding artifacts caused by the inter-polation method of the DECaLS cutout service, each FITS imagewas downloaded at the ﬁxed native telescope resolution of 0.262arcsec per pixel , with enough pixels to cover the same area as424 pixels at the interpolated pixel scale 𝑠 . These individually-sizedFITS were then resized locally up to the interpolated pixel scale 𝑠 by Lanczos interpolation (Lanczos 1938). Image processing isotherwise identical between campaigns. Galaxies with incompleteimaging, deﬁned as more than 20% missing pixels in any band, werediscarded. For GZD-1/2, 92,960 of 101,252 galaxies had completeimaging (91.8%). For GZD-5, 216,106 of 247,746 galaxies not inDECaLS DR1/2 had complete imaging (87.2%) . We convert the measured grz ﬂuxes into RGB images. To use the grz bands as RGB colors, we multiply the ﬂux values in each bandby 125.0, 71.43, and 52.63, respectively. These numbers are chosen Up to a maximum of 512 pixels per side. Highly extended galaxies weredownloaded at reduced resolution such that the FITS had exactly 512 pixelsper side. Note that these numbers do not sum to the total number of galaxiesclassiﬁed across both campaigns because some galaxies are shared betweencampaigns. by eye such that typical subjects show an appropriate range of coloronce mapped to RGB channels.For background pixels with very low ﬂux, and therefore highvariance in the proportion of ﬂux per band, naively colouring bythe measured ﬂux creates a speckled eﬀect (Willett et al. 2017).As an extreme example, a pixel with 1 photon in the 𝑔 band andno photons in 𝑟 or 𝑧 would be rendered entirely red. To removethese colourful speckles, we desaturate pixels with very low ﬂux.We ﬁrst estimate the total per-pixel photon count 𝑁 assuming anexposure time of 90 seconds per band and a mean photon frequencyof 600nm. Poisson statistics imply the standard deviation on thetotal mean ﬂux in that pixel is proportional to √ 𝑁 . For pixels with astandard deviation below 100, we scale the per-band deviation fromthe mean per-pixel ﬂux by a factor of 1% of the standard deviation.The eﬀect is to reduce the saturation of low-ﬂux pixels in proportionto the standard deviation of the total ﬂux. Mathematically, we set 𝑋 (cid:48) 𝑖 𝑗𝑐 = 𝑋 𝑖 𝑗 + 𝛼𝑋 𝑖 𝑗𝑐 where 𝛼 = min ( . √︃ 𝑋 𝑖 𝑗 𝑇 / 𝜆, ) (2)where 𝑋 𝑖 𝑗𝑐 and 𝑋 (cid:48) 𝑖 𝑗𝑐 are the ﬂux at pixel 𝑖 𝑗 in channel 𝑐 beforeand after desaturation, 𝑋 𝑖 𝑗 is the mean ﬂux across bands at pixel 𝑖 𝑗 , 𝑇 is the mean exposure time (here, 90 seconds) and 𝜆 is the meanphoton wavelength (here, 600 nm).Pixel values were scaled by sinh − ( 𝑥 ) to compensate for thehigh dynamic range typically found in galaxy ﬂux, creating imageswhich can show both bright cores and faint outer features. To removethe very brightest and darkest pixels, we linearly rescale the pixelvalues to lie on the (− . , ) interval and then clip the pixelvalues to 0 and 255 respectively. We use these ﬁnal values to createan RGB image using pillow (Kemenade et al. 2020).The images will available on Zenodo athttps://dx.doi.org/10.5281/zenodo.4196267. As of this arxivpreprint, the images have not yet been uploaded. Volunteer classiﬁcations for GZ DECaLS were collected duringthree campaigns. GZD-1 and GZD-2 classiﬁed all 99,109 galax-ies passing the criteria above from DECALS DR1 and DR2, re-spectively. GZD-1 ran from September 2015 to February 2016,and GZD-2 from April 2016 to February 2017. GZD-5 classiﬁed262,000 DECALS DR5-only galaxies passing the criteria above.GZD-5 ran from March 2017 to October 2020. GZD-5 used morecomplex retirement criteria aimed at improving our automated clas-siﬁcation (3.1) and an improved decision tree aimed at better iden-tiﬁcation of weak bars and minor mergers (4.2).This iteration of the Galaxy Zoo project used the infrastructuremade available by the Zooniverse platform; in particular, the opensource Panoptes platform (The Zooniverse Team 2020). The plat-form allows for the rapid creation of citizen science projects, andpresents participating volunteers with one of a subject set of imageschosen either randomly, or through criteria described in section 3.1.

How many volunteer classiﬁcations should each galaxy receive?Ideally, all galaxies would receive enough classiﬁcations to be con-ﬁdent in the average response (i.e. the vote fraction) while stillclassifying all the target galaxies within a reasonable timeframe.However, the size of modern surveys make this increasingly im-practical. Collecting 40 volunteer classiﬁcations for all 314,000

MNRAS , 1–18 (2021)

M. Walmsley et al

0 5,000 G a l a x i e s GZD-1020000.0 G a l a x i e s GZD-20 20 40 60 80 100

Classifications per galaxy G a l a x i e s GZD-5

Figure 2.

GZD-1, GZD-2 and GZD-5 classiﬁcation counts, excluding im-plausible classiﬁcations (Sec. 4.3.1). GZD-1 has approximately 40-60 classi-ﬁcations, GZD-2 has approximately 40, and GZD-5 has either approximately5 or approximately 30-40. 5.9% of GZD-5 galaxies received more than 40classiﬁcations due to mistaken duplicate uploads. galaxies in this data release would have taken around eight yearswithout further promotion eﬀorts. The larger data sets of futuresurveys will only be more challenging. In anticipation of futureclassiﬁcation demands, we have therefore implemented a variableretirement rate here (motivated and described further in Walmsleyet al. 2020). Unlike previous data releases, GZ DECaLS galaxieseach received diﬀerent numbers of classiﬁcations (Figure 2). Begin-ning part-way through GZD-5, we prioritise classiﬁcations for thegalaxies expected to most improve our machine learning models,and rely more heavily on those models for classifying the remainder.For GZD-1 and GZD-2, all galaxies received at least 40 clas-siﬁcations (as with previous data releases). GZD-1 galaxies havebetween 40 and 60 classiﬁcations, selected at random, while GZD-2 galaxies all have approximately 40. For GZD-5, galaxies classiﬁeduntil June 2019 also received approximately 40 classiﬁcations. FromJune 2019, we introduced an active learning system. Using activelearning, galaxies expected to be the most informative for trainingour deep learning model received 40 classiﬁcations, and all remain-ing galaxies received at least 5 classiﬁcations. Galaxies receiving 5classiﬁcations of ‘artifact’ were retired at that point.By ‘most informative’, we mean the galaxies which, if classi-ﬁed, would most improve the performance of our model. We de-scribe our method for estimating which galaxies would be most in-formative in Walmsley et al. (2020). Brieﬂy, we use a convolutionalneural network to make repeated predictions for the probability that 𝑘 of 𝑁 total volunteers select a given answer. For each prediction, werandomly permute the network with MC Dropout (Gal 2016), ap-proximating (roughly) training many networks to make predictionson the same dataset. It can be shown that, under some assumptions,the most informative galaxies will be the galaxies with conﬁdentlydiﬀerent predictions under each MC Dropout permutation; that is,where the permuted networks conﬁdently disagree (Houlsby 2014).We emphasise that the number of classiﬁcations each galaxyreceived under active learning is not random . For details on handlingthis and other selection eﬀects, see Sec. 6. The questions and answers we ask our volunteers deﬁne the mea-surements we can publish. It is therefore critical that the Galaxy Zoodecision tree matches the science goals of the research community.The questions in a given Galaxy Zoo workﬂow are designedto be answerable even by a classiﬁer with little or no astrophysicalbackground. This motivates a focus primarily on the appearance ofthe galaxy, rather than incorporating physical interpretations whichwould require prior knowledge of galaxies. As an example, theinitial question in all decision trees from Galaxy Zoo 2 onwardshas asked the viewer to distinguish primarily between “smooth”and “featured” galaxies, rather than “elliptical” and “disk” galaxies.This distinction between descriptive and interpretive classiﬁcationis not always perfectly enforced. For example, the “features” re-sponse to the initial question is worded as “features or disk”, anda later question asks whether the galaxy is “merging or disturbed”,which requires some interpretation. To aid classiﬁers, all iterationsof Galaxy Zoo have therefore included illustrative icons in the clas-siﬁcation interface. Additional help is also available; in the currentproject, the interface includes a brief tutorial, a detailed ﬁeld guidewith multiple examples of each type of galaxy, and speciﬁc helptext available for each individual classiﬁcation task.The largest workﬂow change between Galaxy Zoo versions wasbetween the original Galaxy Zoo (GZ1) and Galaxy Zoo 2 (GZ2).GZ1 presented classiﬁers with a single task per galaxy, a choicebetween smooth/elliptical, multiple versions of featured/disk (in-cluding edge-on, face-on, and directionality of spiral structure), andmerger. GZ2 re-classiﬁed the brightest quarter of the GZ1 samplein much greater detail, including a branched, multi-task decisiontree. Subsequent changes to the decision tree for diﬀerent versionsof Galaxy Zoo have been mostly iterative in nature, driven in partby the data itself and in part by experience-based reﬂection whichrevealed minor adjustments that could help classiﬁers provide moreaccurate information. As an example of the former, a new branch wasadded for GZ-Hubble and GZ-CANDELS to capture informationon star-forming clumps in classiﬁcations of higher-redshift galax-ies. As an example of the latter, the ﬁnal 2 tasks of GZ2 have beenadjusted over multiple versions to facilitate reliable identiﬁcation ofrare features. Such adjustments have generally been minimized toavoid complicating comparisons with previous campaigns.The decision tree used for GZD-1 and GZD-2 has three mod-iﬁcations vs. the Galaxy Zoo 2 decision tree (Willett et al. 2013).The ‘Can’t Tell’ answer to ‘How many spiral arms are there?’ wasremoved, the number of answers to ‘How prominent is the centralbulge?’ was reduced from four to three, and ‘Is the galaxy cur-rently merging, or is there any sign of tidal debris?’ was added as astandalone question.For GZD-5, we made three further changes. Several GalaxyZoo studies (e.g. Skibba et al. 2012; Masters et al. 2012; Willettet al. 2013; Kruk et al. 2018) found that galaxies selected with0.2< 𝑝 bar <0.5 in GZ2 correspond to ‘weak bars’ when comparedwith expert classiﬁcation such as those in Nair & Abraham (2010).Therefore, to increase the detection of bars, we changed the possibleanswers to the ”Does this galaxy have a bar?” question from ‘Yes’or ‘No’ to ‘Strong’, ‘Weak’ or ‘No’. We deﬁne a strong bar as onethat is clearly visible and extending across a large fraction of thesize of the galaxy. A weak bar is smaller and fainter relative to thegalaxy, and can appear more oval than the strong bar, while stillbeing longer in one direction than the other. Our deﬁnition of strongvs. weak bar is similar that of Nair & Abraham (2010), with theexception that they also have an ‘intermediate’ classiﬁcation. We MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release added examples of galaxies with ‘weak bars’ to the Field Guideand provided a new icon for this classiﬁcation option, as shown inFigure 3.Second, to allow for more ﬁne-grained measurements of bulgesize, we increased the number of "How prominent is the centralbulge?" answers from three (‘No’, ‘Obvious’, ‘Dominant’) to ﬁve(‘No Bulge’, ‘Small’, ‘Moderate’, ‘Large’, ‘Dominant’). We alsore-included the ‘Can’t Tell’ answer.Third, we modiﬁed the ‘Merging’ question from ‘Merging’,‘Tidal’, ‘Both’, or ‘None’, to the more phenomenological ‘Merg-ing’, ‘Major Disturbance’, ‘Minor Disturbance’, or ‘No’. Our goalwas to present more direct answers to our volunteers and to betterdistinguish major and minor mergers, to support recent scientiﬁcinterest in the role of major and minor mergers on mass assem-bly (López-Sanjuan et al. 2010; Kaviraj 2013), black hole accretion(Alexander & Hickox 2012; Simmons et al. 2017a), and morphology(Hopkins et al. 2009; Lotz et al. 2011; Lofthouse et al. 2017). Wemade this ﬁnal "merger" change two months after launching GZD-5;6722 GZD-5 galaxies (2.7%) were fully classiﬁed before that dateand so do not have responses from volunteers to this question.We also make several improvements to the illustrative iconsshown for each answer. These icons are the most visible guidefor volunteers as to what each answer means (complementing thetutorial, help text, ﬁeld guide, and ‘Talk’ forum). Figure 3 shows theGZD-5 decision tree with new icons as shown to volunteers. Thedecision tree used in GZD-1 and GZD-2 is shown in Figure B1.For the ‘Smooth or Featured?’ question, we changedthe‘Smooth’ icon to include three example galaxies at various el-lipticities, and the ‘Featured’ icon to include an edge-on disk ratherthan a ring galaxy. For ‘Edge On?’, we replaced the previous tickicon with a new descriptive icon, and the previous cross icon with the‘Smooth’ icon above. We also modiﬁed the text to no longer specify‘exactly’ edge on, and renamed the answers from ‘Yes’ and ‘No’ to‘Yes - Edge On Disk’ and ’No - Something Else’. For ‘Bulge?’, wecreated new icons to match the change from four to ﬁve answers.For ‘Bar’, we replaced the previous tick and cross icons with newdescriptive icons for ‘Strong Bar’, ‘Weak Bar’ and ‘No Bar’. For‘Merger?’, we added new descriptive icons to match the updatedanswers.Changes to the decision tree complicate comparisons otherGalaxy Zoo projects. As we show in the following sections, theavailable answers will aﬀect the sensitivity of volunteers to certainmorphological features, and so morphology measurements madewith diﬀerent decision trees may not be directly comparable. Thisdiﬃculty in comparison has historically required us to be conser-vative in our changes to the decision tree. However, the advent ofeﬀective automated classiﬁcations allows us to retrospectively makeclassiﬁcations using any preferred decision tree. Speciﬁcally, in thiswork, we train our automated classiﬁer to predict what volunteerswould have said using the GZD-5 decision tree, for galaxies whichwere originally classiﬁed by volunteers using the GZD-1/2 decisiontree (Section 5.1). The images used in GZ DECaLS are deeper and higher resolutionthan were available for GZ2. The GZ2 primary sample (Willett et al.2013) uses images from SDSS DR7 (Abazajian et al. 2009), whichare 95% complete to 𝑟 = . . (cid:48)(cid:48) Figure 3.

Classiﬁcation decision tree for GZD-5, with new icons as shownto volunteers. Questions shaded with the same colours are at the same levelof branching in the tree; grey have zero dependent questions, green one, bluetwo, and purple three. plate scale of 0 . (cid:48)(cid:48)

396 per pixel (York et al. 2000). In contrast, GZDECaLS uses images from DECaLS DR2 to DR5, which have amedian 5 𝜎 point source depth of 𝑟 = .

6, a seeing better than 1 . (cid:48)(cid:48) . (cid:48)(cid:48)

262 per pixel(Dey et al. 2018) .We expect the improved imaging to reveal morphology notpreviously visible, particularly for features which are faint (e.g.tidal features, low surface brightness spiral arms) or intricate (e.g.weak bars, ﬂocculent spiral arms). Our changes to the decision tree(Sec. 3.2) were partly made to better exploit this improved imaging.To investigate the consequences of improved imaging, we com-pare galaxies classiﬁed in both GZ2 and GZ DECalS. Galaxies willtypically be classiﬁed by both projects if they are inside both theSDSS DR7 Legacy catalogue (i.e. the source GZ2 catalogue) andDECaLS DR5 footprints (broadly, North Galactic Cap galaxies with − < 𝛿 <

0) and match the selection criteria of each project (seeWillett et al. 2013 and Sec. 2.2). GZ2’s 𝑟 < . , 1–18 (2021) M. Walmsley et al

GZ2 "Featured" fraction G Z D E C a L S " F e a t u r e d " f r a c t i o n Figure 4.

Comparison of ‘Featured’ fraction for galaxies classiﬁed in bothGZ2 and GZ DECaLS. Ambiguous galaxies are consistently reported asmore featured in GZ DECaLS, which we attribute to the signiﬁcantly im-proved imaging depth of DECaLS.

We ﬁnd that volunteers successfully recognise newly-visiblemorphology features. Figure 4 compares the distribution of votefractions to "Is this galaxy smooth or featured?" for GZ2 and GZDECaLS. Ambiguous galaxies, with ‘featured’ fractions betweenapprox. 0.25 and 0.75, are consistently reported as more featured(median absolute increase of 0.13, median percentage increase of22%) with the deeper GZ DECaLS images.The shift towards featured galaxies is an accurate responseto the new images, rather than systematics from (for example) achanging population of volunteers. Figure 5 compares the GZ2and GZ DECaLS images of a random sample of galaxies drawnfrom the 1000 cross-classiﬁed galaxies with the largest increase in‘featured’ fraction. In all of these galaxies (and for a clear majorityof galaxies in similar samples), volunteers are correctly recognisingnewly visible detailed features.We observe a similar pattern in the vote fractions of spiral armsand bars for featured galaxies. For galaxies consistently consideredfeatured (i.e. where both projects reported a ‘featured’ vote fractionof at least 0.5), the median vote fraction for spiral arms increasedfrom 0.84 to 0.9, and for bars from 0.21 to 0.24. This suggests thateven for galaxies where some details were already visible (and hencewere considered featured), improved imaging makes our volunteersmore likely to identify speciﬁc features.We argue the improved depth of DECaLS ( 𝑟 = . 𝑟 = . 𝑔𝑟𝑖 bands (SDSS) to 𝑔𝑟𝑧 bands (DECaLS), which might makeolder stars more prominent. We expect the diﬀerence in seeing tobe negligible here.Comparing classiﬁcations made using the same possible an-swers on the same galaxies shows how improved DECaLS imagingleads to ambiguous galaxies being correctly reported as more fea-tured, and to spiral arms and bars being reported with more conﬁ- Figure 5.

GZ2 and GZ DECaLS images for 6 galaxies drawn randomlyfrom the 1000 galaxies classiﬁed in both projects with the largest increasein ‘featured’ vote fraction (reported fractions shown in red). The increasedfraction accurately reﬂects the increased visibility of detailed morphologyfrom improved imaging. MNRAS000

Nair: BarNair: No Bar 0.0 0.2 0.4 0.6 0.8 1.0

Bar "Strong" + "Weak" fractionsGZD-5 (New Tree)

Nair: BarNair: No Bar

Figure 6.

Left: Distribution of fraction of GZD-1/2 volunteers answering‘Yes’ (not ‘No’ to ‘Does this galaxy have a bar?’, split by expert classiﬁcationfrom NA10 of barred (blue) or unbarred (orange). Right: as left, but forGZD-5 volunteers answering ‘Strong’ or ‘Weak’ (not ‘No’). Volunteers aresubstantially better at identifying barred galaxies using the GZD-5 three-answer question. dence. However, volunteers are also sensitive to which questions areasked and how those questions are asked. We measure the impactof our changes to the decision tree ‘Bar’ question for GZD-5 in thenext section.

To measure the eﬀect of the new decision tree on bar sensitivity,we compare the classiﬁcations made using each tree against expertclassiﬁcations. Nair & Abraham 2010 (hereafter NA10) classiﬁedall 14,034 SDSS DR4 galaxies at 0 . < 𝑧 < .

05 with 𝑔 < 𝑓 featured > .

25 (as measured by GZD-5), selecting a featuredsample of 807 galaxies classiﬁed by NA10, GZD-1/2, and GZD-5.Figure 6 compares volunteer classiﬁcations for expert-labelledcalibration galaxies made using each tree. We ﬁnd that barredand unbarred galaxies are signiﬁcantly better separated with theStrong/Weak/None answers than with Yes/No answers. Of 220 Nair-identiﬁed bars (of any type), 184 (84%) receive a majority vote forbeing barred by volunteers using the new tree, up from 120 (55%)with the previous tree.NA10 classiﬁed barred galaxies into ﬁve subtypes: Strong,Intermediate, Weak, Nuclear, Ansae, and Peanut (plus None, im-plicitly). We can use the ﬁrst three subtypes as a measurement ofexpert-classiﬁed bar strength, and therefore evaluate how our volun-teers respond to bars of diﬀerent strengths. Following the approachto deﬁning summary metrics of Masters et al. (2019), we summarisethe bar vote fractions into a single volunteer estimate of bar strength, 𝐵 vol = 𝑓 strong + . 𝑓 weak , and compare the distribution of 𝐵 for eachexpert-classiﬁed bar strength (Figure 7). We ﬁnd that the volunteerbar strength estimates increase smoothly with expert-classiﬁed barstrength, though individual galaxies vary substantially. This sug-gests that typical bar strength in galaxy samples can be successfullyinferred from volunteer votes. N o r m . C o un t Nair: Strong0246 N o r m . C o un t Nair: Interm.0246 N o r m . C o un t Nair: Weak0.0 0.2 0.4 0.6 0.8 1.0

Bar "Strong" fraction + 0.5 "Weak" fraction N o r m . C o un t Nair: None

Figure 7.

Distributions of volunteer bar strength estimates, 𝐵 vol = 𝑓 strong + . 𝑓 weak , split by expert-classiﬁed (NA10) bar strength. Individual galaxiesare shown with rug plots (15 Strong, 110 Intermediate, 87 Weak, and 377None). Volunteer bar strength estimates increase smoothly with expert-classiﬁed bar strength, though individual galaxies vary substantially. The addition of the ‘weak bar’ answer in GZD-5 signiﬁcantlyimproves sensitivity to bars compared with previous versions of thedecision tree. Additionally, volunteer votes across the three answersmay be used to infer bar strength. We hope that the detailed bar clas-siﬁcations in our catalogue will help researchers better understandthe properties of strong and weak bars and their inﬂuence on hostgalaxies.

Galaxy Zoo data releases have previously included two post-hocmodiﬁcations to the volunteer classiﬁcations; volunteer weighting,to reduce the inﬂuence of strongly atypical volunteers, and redshiftdebiasing, to estimate the vote fractions a galaxy might have re-ceived had it been observed at a speciﬁc redshift. We describe eachmodiﬁcation below.

Volunteer weighting, as introduced in Galaxy Zoo 2 (Willett et al.2013), assigns each volunteer an aggregation weight of (initially)one, and iteratively reduces that weight for volunteers who typicallydisagree with the consensus. This method aﬀects relatively fewvolunteers and therefore causes only a small shift in vote fractions- in Galaxy Zoo 2, for example, approximately 95% of volunteershad a weighting of one (i.e. unaﬀected), 94.8% of galaxies had achange in vote fraction of no more than 0.1 for any question, andthe mean change in vote fraction across all questions and galaxieswas 0.0032.The most signiﬁcant change in ﬁnal vote fractions is causedby down-weighting rare (approx. 1%) volunteers who repeatedlydisagree with consensus by answering ‘artifact’ at implausibly highrates (including 100%) for many galaxies. Answering artifact ends

MNRAS , 1–18 (2021)

M. Walmsley et al

GZD-1/2

Reported "Artifact" Rate by Volunteer

GZD-5 V o l un t ee r s ( N > o n l y ) Figure 8.

Distribution of reported ‘artifact’ rates by volunteer (i.e. how ofteneach volunteer answered ‘artifact’ over all the galaxies they classiﬁed). Thevast majority report artifact rates consistent with those of the authors (below0.1), but a very small subset report implausibly high artifact rates ( > . ) and consequently have their classiﬁcations discarded. Only volunteers withat least 150 classiﬁcations are shown; the distribution for volunteers withfewer classiﬁcations is not bimodal. the classiﬁcation and shows the next galaxy, and so we hypothesisethat these rare volunteers are primarily interested in seeing manygalaxies rather than contributing meaningful classiﬁcations. Thereare very few such volunteers, but because answering artifact allowsclassiﬁcations to be submitted very quickly, they have an outsizeeﬀect on the aggregate vote fractions.Figure 8 shows the distribution of reported artifact rates forvolunteers with at least 150 total classiﬁcations. We expect the truefraction of artifacts to be less than 0.1, and the vast majority ofvolunteers report artifact rates consistent with this. However, thedistribution is bimodal, with a small second peak around 1.0 (i.e.volunteers reporting every galaxy as an artifact). To remove the im-plausible mode, we discard the classiﬁcations of volunteers with atleast 150 total classiﬁcations and reported artifact rates greater than0.5. In GZD-1/2, 1.1% (643) of volunteers are excluded, discarding11% (483,081) of classiﬁcations. In GZD-5, 0.03% (543) volunteersare excluded, discarding 5.3% (249,592) of classiﬁcations.We investigated the possibility of other groups of atypical vol-unteers giving similar answers across across questions by analysingthe per-user vote fractions with either a two-dimensional visualisa-tion using UMAP (McInnes et al. 2020) or with clustering usingHDBSCAN (McInnes et al. 2017). We ﬁnd no strong evidence thatsuch clusters exist. Galaxies at higher redshifts appear fainter and smaller on the sky,making it harder to detect detailed morphological features than ifthe galaxy were closer. This creates a bias in visual classiﬁcations(whether human or automated) where galaxies of the same intrinsicmorphology are less likely to be classiﬁed as having detailed featuresas redshift increases (Bamford et al. 2009). Redshift debiasing is smoothfeaturedstar/artefact cleaneddebiased edge-onface-on/inclined unbarredweak barstrong bar no spiral armsspiral arms no bulgesmall bulgemoderate bulgelarge bulgedominant bulge tight spiralmoderate spiralloose spiral rounded bulgeboxy bulgeno bulge circularin-betweencigar shaped no mergerminor disturbancemajor disturbancemerger redshift f r a c t i o n ( f > . ) Figure 9.

Number of GZD-5 galaxies with 𝑓 > . 𝑓 > .

5, 58,916 galaxies. For most questions and answers,debiasing successfully ﬂattens the redshift trends. For ‘Smooth or Featured’and ‘Bulge Prominence’, redshift debiasing overcorrects. an attempt to mitigate this bias by estimating how a galaxy wouldappear if it were at a ﬁxed low redshift (here, 𝑧 = . . < 𝑧 < .

15, approx-imately 1.5 Gyr) do not evolve signiﬁcantly for galaxies of similarintrinsic brightness and physical size, and so, for a luminosity-limited sample, any change we observe to the vote fraction distri-bution as a function of redshift is purely a consequence of imaging.If so, we can estimate the vote fractions which would be observedif each galaxy were at low redshift by modifying the vote fractionsof higher-redshift galaxies such that they have the same overall dis-tribution as their low-redshift counterparts in brightness and size.We base the debiasing on a luminosity-limited sample , selectedbetween 0 . < 𝑧 < .

15 and − . > 𝑀 𝑟 > −

23. We considerthe galaxies with at least 30 votes for the ﬁrst question (‘Smooth orFeatured’) after volunteer weighting (above), for a total of 87,617galaxies in GZD-1/2 and 58,916 galaxies in GZD-5. For each ques-tion, separately, we deﬁne a subset of galaxies to which we applythe debiasing procedure.Each subset is deﬁned using a cut of 𝑓 > . 𝑓 feat × 𝑓 not edge-on > . 𝑁 >

MNRAS000

MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release galaxy has been classiﬁed by a signiﬁcant number of people. Webin the subset of galaxies by 𝑀 𝑟 , log ( 𝑅 ) and 𝑧 for each answer inturn. We use the voronoi_2d_binning package from Cappellari& Copin (2003) to ensure the bins will have an approximately equalnumber of galaxies (with a minimum of 50). We then match votefraction distributions on a bin-by-bin basis, such that the cumulativedistribution of vote fractions at each redshift is shifted to be similarto that of the lowest redshift sample (0 . < 𝑧 < . 𝑓 feat > . comparing diﬀerentmorphological types, some of the systematic errors in the debiasingmay cancel out. Uncertainties in the debiasing will also decrease asthe sample size increases.For these reasons, we strongly suggest that users of the debiasedclassiﬁcations only use them to consider populations of galaxiesrather than individual or small samples, and to consider that theremay still be some residual trends and uncertainties that are hard tomodel with current methods. Combining citizen science with automated classiﬁcation allows usto do better science than with either alone. The clearest beneﬁt isthat automated classiﬁcation scales well with sample size. For GZ DECaLS, classifying all 311,488 suitable galaxies using volunteersalone is infeasible; collecting 40 classiﬁcations per galaxy, the stan-dard from previous Galaxy Zoo projects, would take around eightyears without further promotion eﬀorts - by which time we expectnew surveys to start. Automated classiﬁcation also evolves - as thequality of our models improves, so too will the quality of our clas-siﬁcations. And automated classiﬁcation is replicable from scratchwithout requiring a crowd - other researchers may run our open-source code and recover our classiﬁcations (within stochasticity),or create equivalent classiﬁcations for newly-imaged galaxies.Finally, and of particular relevance to researchers using thisdata release, automated classiﬁcation allows us to retroactively up-date the decision tree. Because our classiﬁer learns to make pre-dictions from GZD-5 classiﬁcations, using the improved tree withbetter detection of mergers and weak bars, we can then predict whatour volunteers would have said for the GZD-1 and GZD-2 galaxies had we been using the improved tree at that time .Our speciﬁc automated classiﬁcation approach oﬀers severalqualitative beneﬁts over previous work. First, through careful con-sideration of uncertainty, we can both learn from uncertain volun-teer responses and predict posteriors (rather than point estimates)for new galaxies. Second, by predicting the answers to every ques-tion with a single model (similarly to Dieleman et al. 2015, andunlike more recent work e.g. Domínguez Sánchez et al. 2018; Khanet al. 2019; Walmsley et al. 2020), we improve performance bysharing representations between tasks (Caruana 1997) - intuitively,knowing how to recognise spiral arms can also help you count them.Learning from every galaxy to predict every answer uses our valu-able volunteer eﬀort as eﬃciently as possible. This is particularlyeﬀective because we aim to predict detailed morphology, and hencelearn to create a detailed representation of each galaxy.

We require a model which can:(i) Learn eﬃciently from volunteer responses of varying (i.e.heteroskedastic) uncertainty.(ii) Predict posteriors for those responses on new galaxies, forevery question.In previous work (Walmsley et al. 2020) we modelled volunteerresponses as being binomially distributed and trained our model tomake maximum likelihood estimates using the loss function L = 𝑘 log 𝑓 𝑤 ( 𝑥 ) + ( 𝑁 − 𝑘 ) log ( − 𝑓 𝑤 ( 𝑥 )) (3)where, for some target question, 𝑘 is the number of responses (suc-cesses) of some target answer, 𝑁 is the total number of responses(trials) to all answers, and 𝑓 𝑤 ( 𝑥 ) = ˆ 𝜌 is the predicted probabilityof a volunteer giving that answer.This Binomial assumption, while broadly successful, brokedown for galaxies with vote fractions 𝑘𝑁 close to 0 or 1, wherethe Binomial likelihood is extremely sensitive to 𝑓 𝑤 ( 𝑥 ) , and forgalaxies where the question asked was not appropriate (e.g. predictif a featureless galaxy has a bar). Instead, in this work, the modelpredicts a distribution 𝑝 ( 𝜌 | 𝑓 𝑤 ( 𝑥 )) and 𝜌 is then drawn from thatdistribution.For binary questions, one could parametrise 𝑝 ( 𝜌 | 𝑓 𝑤 ( 𝑥 )) withthe Beta distribution (being ﬂexible and deﬁned on the unit interval),and predict the Beta distribution parameters 𝑓 𝑤 ( 𝑥 ) = ( ˆ 𝛼, ˆ 𝛽 ) byminimising MNRAS , 1–18 (2021) M. Walmsley et al L = ∫ Bin ( 𝑘 | 𝜌, 𝑁 ) Beta ( 𝜌 | 𝛼, 𝛽 ) 𝑑𝛼𝑑𝛽 (4)where the Binomial and Beta distributions are conjugate and hencethis integral can be evaluated analytically.In practice, we would like to predict the responses to questionswith more than two answers, and hence we replace each distributionwith its multivariate counterpart; Beta( 𝜌 | 𝛼, 𝛽 ) with Dirichlet( (cid:174) 𝜌 | (cid:174) 𝛼 ) ,and Binomial( 𝑘 | 𝜌, 𝑁 ) with Multinomial( (cid:174) 𝑘 | (cid:174) 𝜌, 𝑁 ). L 𝑞 = ∫ Multi ( (cid:174) 𝑘 | (cid:174) 𝜌, 𝑁 ) Dirichlet ( (cid:174) 𝜌 | (cid:174) 𝛼 ) 𝑑 (cid:174) 𝛼 (5)where (cid:174) 𝑘, (cid:174) 𝜌 and (cid:174) 𝛼 are now all vectors with one element per answer.The Dirichlet-Multinomial distribution is much more ﬂexi-ble than the Binomial, allowing our model to express uncertaintythrough wider posteriors and conﬁdence through narrower posteri-ors. We believe this is a novel approach.For the base architecture, we use the EﬃcientNet B0 model(Tan & Le 2019). The EﬃcientNet family of models includes sev-eral architectural advances over the standard convolutional neu-ral network architectures commonly used within astrophysics (e.g.Huertas-Company et al. 2015; Dieleman et al. 2015; Khan et al.2019; Cheng et al. 2020; Ferreira et al. 2020), including auto-ML-derived structure (Tan et al. 2019; He et al. 2019), depthwise con-volutions (Howard et al. 2017), bottleneck layers (Iandola et al.2017), and squeeze-and-excitation optimisation (Hu et al. 2018).The EﬃcientNet B0 model was identiﬁed using multi-objectiveneural architecture search (Tan et al. 2019), optimising for bothaccuracy and FLOPS (i.e. computational cost of prediction). Thisbalancing of accuracy and FLOPS is particularly useful for astro-physics researchers with limited access to GPU resources, leadingto a model capable of making reliable predictions on hundreds ofmillions of galaxies. In short, the architecture is similar to traditionalconvolutional neural networks, being composed of a series of con-volutional blocks of decreasing resolution and increasing channels.Each convolutional block uses mobile inverted bottleneck convolu-tions following MobileNetV2 (Sandler et al. 2018), which combinecomputationally eﬃcient depthwise convolutions with residual con-nections between bottlenecks (as opposed to residual connectionsbetween blocks with many channels, as in e.g. ResNet (He et al.2016)). EﬃcientNet B0 has 5.3 million parameters.We modify the ﬁnal EﬃcientNet B0 layer output units to givepredictions smoothly between 1 and 100 (using softmax activation),which is appropriate for Dirichlet parameters (cid:174) 𝛼 . (cid:174) 𝛼 elements below1 can lead to bimodal ‘horseshoe’ posteriors, and (cid:174) 𝛼 elements aboveapproximately 100 can lead to extremely conﬁdent predictions inextreme 𝜌 , both of which are implausible for galaxy morphologyposteriors. These constraints may cause the most extreme galaxiesto have predicted vote fractions which are slightly less extreme thanvolunteers would record, but we do not anticipate this to aﬀectpractical use; whether a galaxy is extremely likely to have a bar ormerely highly likely is rarely of scientiﬁc consequence.We would like our single model to predict the answer to everyquestion in the Galaxy Zoo tree. To do this, our architecture usesone output unit per answer (i.e. for 13 questions with a total of 20answers, we use 20 output units). We calculate the (negative log)likelihood per question (Eqn. 5), and then, treating the errors in themodel’s answers to each question as independent events, calculatethe total loss as log L = ∑︁ 𝑞 L 𝑞 ( (cid:174) 𝑘 𝑞 , 𝑁 𝑞 , (cid:174) 𝑓 𝑤𝑞 ) (6)where, for question 𝑞 , 𝑁 𝑞 is the total answers, (cid:174) 𝑘 𝑞 is the observedvotes for each answer, and (cid:174) 𝑓 𝑤𝑞 is the values of the output unitscorresponding to those answers (which we interpret as the Dirichlet (cid:174) 𝛼 parameters in Eqn. 5).We train our model using the GZD-5 volunteer classiﬁcations.Because the training set includes both active-learning-selectedgalaxies receiving at least 40 classiﬁcations and the remaining GZD-5 galaxies with around 5 classiﬁcations, it is crucial that the modelis able to learn eﬃciently from labels of varying uncertainty. Un-like Walmsley et al. (2020), which trained one model per questionand needed to ﬁlter galaxies where that question asked may not beappropriate, we can predict answers to all questions and learn fromall labelled galaxies.We train or evaluate our models using the 249,581 (98.5%)GZD-5 galaxies with at least three volunteer classiﬁcations. Learn-ing from galaxies with even fewer (one or two) classiﬁcations shouldbe possible in principle, but we do not attempt it here as we do notexpect galaxies with so few classiﬁcations to be signiﬁcantly infor-mative. The Dirichlet concentrations (distribution parameters) usedto calculate our metrics are predicted by three identically-trainedmodels, each making 5 forward passes with random dropout con-ﬁgurations and augmentations. We ensemble all 15 forward passesby simply taking the mean posterior given the total votes recorded,which may be interpreted as the posterior of an equally-weightedmixture of Dirichlet-Multinomial distributions. This mean poste-rior can then be used to calculate credible intervals (error bars)and in standard statistical analyses. We develop our approach usinga conventional 80/20 train-test split, and make a new split beforecalculating the ﬁnal metrics reported here.For the published automated classiﬁcations, where we aimsimply to make the best predictions possible rather than to testperformance, we train on all 249,581 galaxies with at least 3 votes(98.5%). We also train ﬁve rather then three models to maximiseperformance. Training each model on an NVIDIA V100 GPU takesaround 24 hours. We then make predictions (using the updatedGZD-5 schema) on all 313,789 galaxies in all campaigns. Eachprediction (forward pass) takes approx. 6ms, equating to approx.160ms for each published posterior.Starting from the galaxy images shown to volunteers (Section2.3), we take an average over channels to remove color informationand avoid biasing our morphology predictions (Walmsley et al.2020), then resize and save the images as 300x300x1 matrices. Wethen apply random augmentations when loading each image intomemory, creating a unique randomly-modiﬁed image to be used asinput to the network. We ﬁrst apply random horizontal and verticalﬂips, followed by an aliased rotation by a random angle in therange (0, 𝜋 ), with missing pixels being ﬁlled by reﬂection on theboundaries. Finally, we crop the image about a random centroid to224x224 pixels, eﬀectively zooming in slightly towards a randomoﬀ-center point. We also apply these augmentations at test time tomarginalise our posteriors over any unlearned variance. We trainusing the Adam (Kingma & Ba 2015) optimizer and a batch size of128. We end training once the model loss fails to improve for 10consecutive epochs.Code for our deep learning classiﬁer is available to the re-viewer(s) and will be made open-source on publication. MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Our model successfully predicts posteriors for volunteer votes toeach question. We show example posteriors for a question withtwo answers, ‘Does this galaxy have spiral arms’ (Yes/No), in Fig.10, and a question with three answers, ‘Does this galaxy have abar’ (Strong/Weak/None), in Fig. 11. In Appendix A, we provide agallery of the galaxies with the highest expected vote fractions for aselection of answers, to visually demonstrate the quality of the mostconﬁdent machine classiﬁcations.To aid intuition for the typical performance, we reduce boththe vote fraction labels and the posteriors down to discrete classi-ﬁcations by rounding the vote fractions and mean posteriors to 0or 1, and calculate classiﬁcation metrics (Table 1) and confusionmatrices (Figure 13). Here and throughout this section, to removegalaxies for which the question is not relevant, we only count galax-ies where at least half the volunteers were asked that question. Wereport two sets of classiﬁcation metrics; metrics for all (relevant)galaxies, and only for galaxies where the volunteers are conﬁdent(deﬁned as having a vote fraction for one answer above 0.8, follow-ing Domínguez Sánchez et al. 2019).The performance on conﬁdent galaxies is useful to measurebecause such galaxies have a clear correct label. For such galaxies,performance is near-perfect; we achieve better than 99% accuracyfor most questions, with the lowest accuracy (for spiral arm count)being 98.6%. The confusion matrices reﬂect this, showing littlenotable confusion for any question.Reported performance on all galaxies will be lower than onconﬁdent galaxies as the correct labels are uncertain. Our measuredvote fractions are approximations of the theoretical ‘true’ vote frac-tions (as we cannot ask inﬁnitely many volunteers), and many galax-ies are genuinely ambiguous and do not have a meaningful ‘correct’answer. No classiﬁer should achieve perfect accuracy on galaxieswhere the volunteers themselves are not conﬁdent. Nonetheless, per-formance is more than suﬃcient for scientiﬁc use; accuracy rangesfrom 77.4% (spiral arm count) to 98.7% (disk edge on). We observesome moderate confusion between similar answers, particularly be-tween No or Weak bar, Moderate or Large bulges, and Two or Threespiral arms, which matches our intuition for the answers that volun-teers might confuse and so likely reﬂects ambiguity in the trainingdata. More surprisingly, there is also confusion between Two spiralarms and Can’t Tell. Figure 18 shows random examples of spiralswhere the most common volunteer answer was Two, but the classi-ﬁer predicted Can’t Tell, and vica versa. In both cases, the galaxiesgenerally have diﬀuse or otherwise subtle spiral arms embedded ina bright disk, confusing both human and machine. This highlightsthe diﬃculty in using classiﬁcation metrics to assess performanceon ambiguous galaxies.We can mitigate the ambiguity in classiﬁcations of galaxies bymeasuring regression metrics on the vote fractions, without round-ing to discrete classiﬁcations. Figure 16 shows the mean deviationsbetween the model predictions (mean posteriors) and the observedvote fractions, by question, for test set galaxies with approximately40 volunteer responses. Performance is again excellent, with thepredictions typically well within 10% of the observed vote frac-tions. Predicting spiral arm count is relatively challenging, as notedabove. Predicting answers to the ‘Merger’ question of ‘None’ (i.e.not a merger) is also challenging, perhaps because of the rarity ofcounter-examples.The volunteer vote fractions against which we compare ourpredictions are themselves uncertain for most galaxies. We aim topredict the true vote fraction, i.e. the vote fraction from lim 𝑁 →∞ p ( v o t e s ) p ( v o t e s ) p ( v o t e s ) p ( v o t e s ) "Has Spiral Arms" votes p ( v o t e s ) Figure 10.

Posteriors for ‘Does this galaxy have spiral arms?’, split byensemble model (bold colors) and, within each model, dropout forwardpasses (faded colors). The number of volunteers answering ‘Yes’ (not knownto classiﬁer) is shown with a black dashed line. Galaxies are selected atrandom from the test set, provided the spiral question is relevant (deﬁnedas a vote fraction of 0.5 or more to the preceding answer, ‘Featured’). Theimage presented to volunteers is shown to the right. The model input is acropped, downsized, greyscale version (Sec 5.1). The Dirichlet-Multinomialposteriors are strictly only deﬁned at integer votes; for visualisation only, weshow the Γ -generalised posteriors between integer votes.MNRAS , 1–18 (2021) M. Walmsley et al

Question Count Accuracy Precision Recall F1Smooth Or Featured 11346 0.9352 0.9363 0.9352 0.9356Disk Edge On 3803 0.9871 0.9871 0.9871 0.9871Has Spiral Arms 2859 0.9349 0.9364 0.9349 0.9356Bar 2859 0.8185 0.8095 0.8185 0.8110Bulge Size 2859 0.8419 0.8405 0.8419 0.8409How Rounded 6805 0.9314 0.9313 0.9314 0.9313Edge On Bulge 506 0.9111 0.9134 0.9111 0.8996Spiral Winding 1997 0.7832 0.8041 0.7832 0.7874Spiral Arm Count 1997 0.7742 0.7555 0.7742 0.7560Merging 11346 0.8798 0.8672 0.8798 0.8511(a) Classiﬁcation metrics for all galaxiesQuestion Count Accuracy Precision Recall F1Smooth Or Featured 3495 0.9997 0.9997 0.9997 0.9997Disk Edge On 3480 0.9980 0.9980 0.9980 0.9980Has Spiral Arms 2024 0.9921 0.9933 0.9921 0.9924Bar 543 0.9945 0.9964 0.9945 0.9951Bulge Size 237 1.0000 1.0000 1.0000 1.0000How Rounded 3774 0.9968 0.9968 0.9968 0.9968Edge On Bulge 258 0.9961 0.9961 0.9961 0.9961Spiral Winding 213 0.9906 1.0000 0.9906 0.9953Spiral Arm Count 659 0.9863 0.9891 0.9863 0.9871Merging 3108 0.9987 0.9987 0.9987 0.9987(b) Classiﬁcation metrics for galaxies where volunteers are conﬁdent

Table 1.

Classiﬁcation metrics on all galaxies (above) or on galaxies wherevolunteers are conﬁdent for that question (i.e. where one answer has a votefraction above 0.8). Multi-class precision, recall and F1 scores are weightedby the number of true galaxies for each answer. Classiﬁcations on conﬁdentgalaxies are near-perfect.. volunteers, but we only know the vote fraction from 𝑁 volunteers.However, 387 pre-active-learning galaxies were erroneously up-loaded twice or more, and so received more than 75 classiﬁcationseach. We can compare our predictions against these conﬁdently-known galaxies. We can also calculate the deviations from askingfewer ( 𝑁 <<

75) volunteers by artiﬁcially truncating the numberof votes collected, and ask - how many volunteer responses to thatquestion would we need to have errors similar to that of our model?Figure 17 shows the model and volunteer deviations for a represen-tative selection of questions; the model predictions are as accurateas asking that question to around 10 volunteers. The actual numberof volunteers needed to be shown that galaxy to achieve equivalentaccuracy will be higher for questions only asked given certain pre-vious answers (i.e. all but ‘Smooth or Featured?’ and ‘Merger?’),as some will give diﬀerent answers to preceding questions and sonot be asked that question.We can also measure if our posteriors correctly estimate thisuncertainty. As a qualitative test, Figure 19 shows a random se-lection of galaxies binned by ‘Smooth or Featured’ vote fractionprediction entropy, measuring the model’s uncertainty. Predictionentropy is calculated as the (discrete) Shannon entropy over allpossible combinations of votes, assuming 10 total votes for thisquestion (our results are robust to other choices of total votes). Un-usual, inclined or poorly-scaled galaxies have highly uncertain (highentropy) votes, while smooth and especially clearly featured galax-ies have conﬁdent (low entropy) votes. The most uncertain galaxies(not shown) are so poorly scaled (due to incorrect estimation of The model is, in this strict sense, slightly superhuman. the Petrosian radius in the NASA-Sloan Atlas) that they are barelyvisible. These results match our intuition and demonstrate that ourposteriors provide meaningful uncertainties.More quantitatively, Figure 20 shows the calibration of ourposteriors for the two binary questions in GZD-5 - ‘Edge-on Disk’and ‘Has Spiral Arms’. A well-calibrated posterior dominated bydata (i.e. where the prior has minimal eﬀect) will include the mea-sured value within any bounds as often as the total probabilitywithin those bounds. We calculate calibration by, for each galaxy,iterating through each symmetric highest posterior density credi-ble interval (i.e. starting from the posterior peak and moving thebounds outwards) and recording both the total probability insidethe bounds and whether the recorded volunteer vote is inside thebounds. We then group (bin) by total probability and record theempirical frequency with which the votes lie within bounds of thattotal probability. We ﬁnd that calibration is excellent. Our classiﬁeris correctly uncertain.The ultimate measure of success is whether our predictionsare useful for science. Masters et al. (2019) (hereafter M19), whichused GZ2 classiﬁcations to investigate the relationship betweenbulge size and winding angle and found - contrary to a conventionalview of the Hubble sequence - no strong correlation. We repeat thisanalysis using our (deeper) DECaLS data, using either volunteer orautomated classiﬁcation, to check if the automated classiﬁcationslead to the same science results as the volunteers.Speciﬁcally, we select a clean sample of face-on spiral galaxiesusing M19’s vote fraction cuts of 𝑓 feat > . 𝑓 not-edge-on > . 𝑓 spiral-yes > . 𝑓 merging=none > . 𝑓 odd cut, to remove galaxies with ongoingmergers or with otherwise disturbed features. For the volunteer votefractions, we can only use either GZD-1/2 or GZD-5 classiﬁcations,since the former decision tree had three bulge size answers and thelatter had ﬁve; we choose GZD-5 to beneﬁt from the added precisionof additional answers. To avoid selection eﬀects (Sec. 6.2) we onlyuse galaxies classiﬁed prior to active learning being activated. Forthe automated classiﬁcations, we use a model trained on GZD-5 topredict GZD-5 decision tree vote fractions (including the ﬁve bulgeanswers) for every GZ DECaLS galaxy (313,798). This allows us toexpand our sample size from 5,378 galaxies using GZD-5 volunteersonly to 43,672 galaxies using our automated classiﬁer.We calculate bulge size and spiral winding following Eqn. 1and 3 in M19, trivially generalising the bulge size calculation toallow for ﬁve bulge size answers: 𝑊 avg = . 𝑓 medium + . 𝑓 tight (7) 𝐵 avg = . 𝑓 small + . 𝑓 moderate + . 𝑓 large + . 𝑓 dominant (8)Both classiﬁcation methods ﬁnd no correlation between bulgesize and spiral winding, consistent with M19. Figure 21 showsthe distribution of bulge size against spiral winding using eithervolunteer predictions (fractions) or the deep learning predictions(expected fractions) for the sample of featured face-on galaxiesselected above. The distributions are indistinguishable, with theautomated method oﬀering a substantially larger (approx 8x) samplesize. We hope this demonstrates the accuracy and scientiﬁc valueof our automated classiﬁer. MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release We release two volunteer catalogues and two automated catalogues,available at https://dx.doi.org/10.5281/zenodo.4196267. gz_decals_volunteers_ab includes the volunteer classiﬁ-cations for 92,960 galaxies from GZD-1 and GZD-2. Classiﬁcationsare made using the GZD-1/2 decision tree (Fig. B1). All galaxiesreceived at least 40 classiﬁcations, and consequently have approxi-mately 30-40 after volunteer weighting (Sec. 4.3.1). This catalogueis ideal for researchers needing standard morphology measurementson a reasonably large sample, with minimal complexity. 33,124galaxies in this catalogue were also previously classiﬁed in GZ2;the GZD-1/2 classiﬁcations are better able to detect faint featuresdue to deeper DECaLS imaging, and so should be preferred. gz_decals_volunteers_c includes the volunteer classiﬁca-tions from GZD-5. Classiﬁcations are made using the improvedGZD-5 decision tree which adds more detail for bars and merg-ers (Sec. 4.2). This catalogue includes 253,286 galaxies, but eachgalaxy does not have the same number of classiﬁcations. 59,337galaxies have at least 30 classiﬁcations (after denoising), and the re-mainder have far fewer (approximately 5). The selection eﬀects forhow many classiﬁcations each galaxy receives are detailed belowin Sec. 6.2. This catalogue may be useful to researchers who pre-fer a larger sample than gz_decals_volunteers_ab at the costof more uncertainty and the introduction of selection eﬀects, orwho need detailed bar or merger measurements for a small numberof galaxies. We use gz_decals_volunteers_c to train our deeplearning classiﬁer.The automated classiﬁcations are made using our Bayesiandeep learning classiﬁer, trained on gz_decals_volunteers_c to predict the answers to the GZD-5 decision tree for allGZ DECaLS galaxies (including those in GZD-1 and GZD-2). gz_decals_auto_posteriors contains the predicted posteriorsfor each answer - speciﬁcally, the Dirichlet concentration param-eters that encode the posteriors. We hope this catalogue will behelpful to researchers analysing galaxies in Bayesian frameworks. gz_decals_auto_fractions reduces those posteriors to theautomated equivalent of previous Galaxy Zoo data releases, con-taining the expected vote fractions (mean posteriors). Note that notall vote fractions are relevant for every galaxy; we suggest assessingrelevance using the estimated fraction of volunteers that would havebeen asked each question, which we also include. We hope this cat-alogue will be useful to researchers seeking detailed morphologyclassiﬁcations on the largest possible sample, who might beneﬁtfrom error bars but do not need full posteriors.We also release Jupyter notebooks showing how to use eachcatalogue on (full link on publication). Thesedemonstrate how to load and query each catalogue with pandas (McKinney 2010), and how to create callable posteriors from theDirichlet concentration parameters.The automated catalogues may be interactivelyexplored at https://share.streamlit.io/mwalmsley/galaxy-poster/gz_decals_mike_walmsley.py.

The GZD-1/2 catalogue reports at least 40 classiﬁcations for allgalaxies imaged by DECaLS DR1/2 and passing the appropriateselection cuts (Section 2.2). Additional classiﬁcations above 40are assigned independently of the galaxy properties. The selection function for total classiﬁcations in the GZD-5 catalogue is morecomplex. In practice, if you require a strictly random sample ofGZD-5 galaxies with more than ﬁve volunteer classiﬁcations, youshould exclude galaxies where ‘random_selection’ is False. You mayalso consider using the posteriors from our deep learning classiﬁer,which are comparable across all GZ DECaLS galaxies (Section 5).Below, we describe the GZD-5 total classiﬁcation selection eﬀects.Early galaxies were initially uploaded row-by-row from theNASA-Sloan Atlas, each (eventually) receiving 40 classiﬁcations.We also uploaded two additional subsets. For the ﬁrst, 1355 galax-ies were targeted for classiﬁcation to support an external researchproject. Of these, 1145 would have otherwise received ﬁve clas-siﬁcations. These 1145 galaxies with additional classiﬁcations areidentiﬁed with the ‘targeted’ group and should be excluded. For thesecond, we reclassiﬁed the 1497 galaxies classiﬁed in both GZD-1/2 and the Nair & Abraham (2010) expert visual morphologyclassiﬁcation catalogue to measure the eﬀect of our new decisiontree (results are shown in Sec. 4.2). Both the GZD-1/2 and GZD-5classiﬁcations are reported in the respective catalogues (Section 6.Similarly to the targeted galaxies, 651 of these calibration galaxieswould have otherwise received ﬁve classiﬁcations, are identiﬁedwith the ‘calibration’ group, and should be excluded.We then implemented active learning (Sec 3.1), prioritising6,939 galaxies from the remaining pool of 199,496 galaxies notyet uploaded. The galaxies are identiﬁed with the groups ‘ac-tive_priority’ (selected for 40 classiﬁcations) and ‘active_baseline’(the remainder). For a strictly random selection, both groups shouldbe excluded, leaving the galaxies classiﬁed prior to the introductionof active learning.Finally, we note that 14,960 (5.9%) of GZD-5 galaxies receivedmore than 40 classiﬁcations due to being erroneously uploaded morethan once. The images are identical and so we report the aggregateclassiﬁcations across all uploads of the same galaxy.

The most appropriate usage of the Galaxy Zoo DECaLS vote frac-tions depends on the speciﬁc science case. Many galaxies have am-biguous vote fractions (e.g. roughly similar vote fractions for bothdisk and elliptical morphologies) because of observational limita-tions like image resolution, or because the galaxy morphology istruly in-between the available answers (perhaps because the galaxyhas an unusual feature such as polar rings, Moiseev et al. 2011, orbecause the galaxy is undergoing a morphological transition). Tomake best use of such galaxies, we recommend that, where possible,readers use the vote fractions as statistical weights in their analysis.For example, when investigating the diﬀerences in the stellar massdistributions of elliptical and disk galaxies, the disk (elliptical) votefractions can be used as weights when plotting the distributions,resulting in the galaxies with the highest vote fraction for disk(elliptical) morphology dominating the resulting distribution. Thisensures that each galaxy contributes to the analysis, without exclud-ing galaxies with ambiguous vote fractions. For examples of usingvote fractions as weights, see Smethurst et al. (2015) and Masterset al. (2019).Using the vote fractions as weights is not appropriate for allscience cases. For example, if galaxies of a particular morphologyneed to be isolated to form a sample for observational follow-up (e.g.overlapping pairs, see Keel et al. 2013, and ‘bulgeless’ galaxies, seeSimmons et al. 2017a; Smethurst et al. 2019), or if the fraction ofa certain morphological type of galaxy is to be calculated (e.g. barfraction, see Simmons et al. 2014). These science cases require a

MNRAS , 1–18 (2021) M. Walmsley et al cut on the appropriate vote fraction to be chosen. However, readersshould be aware that making cuts on the vote fractions is a crudemethod to identify galaxies of certain morphologies and will resultin an incomplete sample.Table 2 shows our suggested cuts for populations of commoninterest, based on visual inspection by the authors and chosen forhigh speciﬁcity (low contamination) at the cost of low sensitivity(completeness). We urge the reader to adjust these cuts to suit thesensitivity and speciﬁcity of their science case, to add additionalcuts to better select their desired population, and to make their ownvisual inspection to verify the selected population is as intended.For a full analysis, we once again suggest the reader avoid cuts byappropriately weighting ambiguous galaxies, or take advantage ofthe posteriors provided by our automated classiﬁer.

What does a classiﬁcation mean? The comparison of GZ2 and GZDECaLS images (Fig. 5) highlights that our classiﬁcations aim tocharacterise the clear features of an image, and not what an expertmight infer from that image. For example, volunteers might see animage of a galaxy which is broadly smooth, and so answer smooth,even though our astronomical understanding might suggest that thefaint features around the galaxy core are likely indicative of spiralarms that would be revealed given deeper images. This situationoccurs in several galaxies in Fig. 5. These ‘raw’ classiﬁcations willbe most appropriate for researchers working on computer vision oron particularly low-redshift, well-resolved galaxies. The redshift-debiased classiﬁcations, which are eﬀectively an estimate of galaxyfeatures not clearly seen in the image, will be most appropriatefor researchers especially interested in fainter features or studyinglinks between our estimated intrinsic visual morphologies and othergalaxy properties.We showed in Sec. 4.2 that changing the answers availableto volunteers signiﬁcantly improves our ability to identify weakbars. This highlights that our classiﬁcations are only deﬁned inthe context of the answers presented. One cannot straightforwardlycompare classiﬁcations made using diﬀerent decision trees. Ourscientiﬁc interests and our understanding of volunteers both evolve,and so our decision trees must also evolve to match them. However,only the last few years of volunteer classiﬁcations will use the latestdecision tree (based on previous data releases), placing an upperlimit on the number of galaxies with compatible classiﬁcations atany one time. Our automated classiﬁer resolves this here by allowingus to retrospectively apply the GZD-5 decision tree (with betterweak bar detection, among other changes) to galaxies only classiﬁedby volunteers in GZD-1 and GZD-2. This ﬂexibility ensures thatGalaxy Zoo will remain able to answer the most pertinent researchquestions at scale.We have shown (5.2) that our automated classiﬁer is generallyhighly accurate, well-calibrated, and leads to at least one equiv-alent science result. However, we cannot exclude the possibilityof unexpected systematic biases or of adversarial behaviour fromparticular images. Avoiding subtle biases and detecting overconﬁ-dence on out-of-distribution data remain open computer science re-search questions, often driven by important terrestrial applications(Szegedy et al. 2014; Hendrycks & Gimpel 2017; Eykholt et al.2018; Smith & Gal 2018; Geirhos et al. 2019; Ren et al. 2019; Yanget al. 2020; Margalef-Bentabol et al. 2020). Volunteers also havebiases (e.g a slight preference for recognising left-handed spirals,Land et al. 2008) and struggle with images of an adversarial nature (e.g. confusing edge-on disks with cigar-shaped ellipticals), thoughthese can often be discovered and resolved through discussion withthe community and by adapting the website.We believe the future of morphology classiﬁcation is in thethoughtful combination of volunteers and machine learning. Suchcombinations will be more than just faster; they will be replicable,uniform, error-bounded, and quick to adapt to new tasks. They willlet us ask new questions - draw the spiral arms, select the barlength, separate the merging galaxies pixelwise - which would beinfeasible with volunteers alone for all but the smallest samples (e.g.Lingard et al. 2020). And they will ﬁnd the interesting, unusual andunexpected galaxies which challenge our understanding and inspirenew research directions.The best combination of volunteer and machine is unclear. Ourexperiment with active learning is one possible approach, but (whencompared to random selection) suﬀers from complexity to imple-ment, an unknown selection function, and no guarantee - or evenclear ﬁnal measurement of - an improvement in model performance.Many other approaches are suggested in astrophysics (Wright et al.2017; Beck et al. 2018; Wright et al. 2019; Dickinson et al. 2019;Martin et al. 2020; Lochner & Bassett 2020) and in citizen scienceand human-computer interaction more broadly (Chang et al. 2017;Wilder et al. 2020; Liu et al. 2020; Bansal et al. 2019). We willcontinue to search for and experiment with strategies to create themost eﬀective contribution to research by volunteers.

We have presented Galaxy Zoo DECaLS; detailed galaxy mor-phology classiﬁcations for 311,488 galaxies imaged by DECaLSDR5 and within the SDSS DR11 footprint. Classiﬁcations werecollected from volunteers on the Zooniverse citizen science plat-form over three campaigns, GZD-1, GZD-2, and GZD-5, whereGZD-5 used an improved decision tree leading to better identi-ﬁcation of weak bars and mergers. All galaxies receive at leastﬁve volunteer classiﬁcations; galaxies in GZD-1 and GZD-2 re-ceive at least 40, while in GZD-5 only a prioritised subset receive40. Volunteer classiﬁcations are then used to train a deep learn-ing classiﬁer to classify all galaxies. This classiﬁer is able to bothlearn from uncertain volunteer responses and predict full posteri-ors, rather than point estimates, for what volunteers would havesaid. We show that the deep learning classiﬁer is accurate and well-calibrated. We release both volunteer and automated classiﬁcationsat at https://dx.doi.org/10.5281/zenodo.4196267.

APPENDIX A: GALAXIES WITH CONFIDENTAUTOMATED CLASSIFICATIONS

To intuitively demonstrate the performance of our automated clas-siﬁer, we show, for a selection of detailed morphology questions,the galaxies with the most conﬁdent automated classiﬁcations forthat question. We show the galaxies with the highest mean posteriorfor being strongly barred (Fig. A1), edge-on and bulgeless (Fig A2),one-armed spirals (Fig. A3), loosely wound spirals (Fig. A4) andmergers (Fig. A5). We present the galaxies here as shown to GalaxyZoo volunteers (in color and at 424x424 pixel resolution), but themodel makes predictions on more challenging greyscale 224x224pixel images (see Sec. 5.1).

MNRAS000

MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Population Approx. Cut Q. Votes NotesFeatured Disk featured > . featured > . smooth > . strong bar > . weak bar > . strong bar + weak bar > . merger > . merger > . major disturb. > . minor disturb. > . Table 2.

Suggested cuts for rough identiﬁcation of galaxy populations, based on visual inspection by the authors. Q. votes is the minimum number of totalvotes for that question; for example, to identify strong bars, require at least 20 total votes to the question ‘Does this galaxy have a bar?’. This ensures enoughvotes to calculate reliable vote fractions.

APPENDIX B: GZD-1/2 DECISION TREE

Figure B1 shows the Galaxy Zoo decision tree used for the earlierGZD-1 and GZD-2 DECaLS campaigns. This tree is based on thetree used for Galaxy Zoo 2 (Willett et al. 2013) with three mod-iﬁcations; the ‘Can’t Tell’ answer to ‘How many spiral arms arethere?’ was removed, the number of answers to ‘How prominent isthe central bulge?’ was reduced from four to three, and ‘Is the galaxycurrently merging, or is there any sign of tidal debris?’ was addedas a standalone question. Please see Sec. 3.2 for a full discussion.

APPENDIX C: CATALOGUE SAMPLE ROWS

Tables C1 and C2 present sample rows from the volunteer andautomated morphology catalogues respectively. The volunteer datashown is from GZD-5; the GZD-1/2 catalogue follows an equivalentschema. For brevity, we show only columns for a single question(‘Bar’) and a single answer (‘Weak’); other questions and answersfollow an identical pattern. A full description of all columns isavailable on data.galaxyzoo.org.

ACKNOWLEDGEMENTS

The data in this paper are the result of the eﬀorts of theGalaxy Zoo volunteers, without whom none of this work wouldbe possible. Their eﬀorts are individually acknowledged athttp://authors.galaxyzoo.org. We would also like to thank our vol-unteer translators; Mei-Yin Chou, Antonia Fernández Figueroa, Ro-drigo Freitas, Na’ama Hallakoun, Lauren Huang, Alvaro Menduina,Beatriz Mingo, Verónica Motta, João Retrê, and Erik Rosenberg.We would like to thank Dustin Lang for creating the lega-cysurvey.org cutout service and for contributing image processingcode. We also thank Sugata Kaviraj and Matthew Hopkins for help-ful discussions.MW acknowledges funding from the Science and TechnologyFunding Council (STFC) Grant Code ST/R505006/1. We also ac-knowledge support from STFC under grant ST/N003179/1.RJS acknowledges funding from Christ Church, University ofOxford.LF acknowledges partial support from US National Science Foundation award OAC 1835530; VM and LF acknowledge partialsupport from NSF AST 1716602.This research made use of the open-source Python scientiﬁccomputing ecosystem, including SciPy (Jones et al. 2001), Mat-plotlib (Hunter 2007), scikit-learn (Pedregosa et al. 2011), scikit-image (van der Walt et al. 2014) and Pandas (McKinney 2010).This research made use of Astropy, a community-developedcore Python package for Astronomy (The Astropy Collaborationet al. 2018).This research made use of TensorFlow (Abadi & et. al 2015).The Legacy Surveys consist of three individual and complementaryprojects: the Dark Energy Camera Legacy Survey (DECaLS; NSF’sOIR Lab Proposal ID

MNRAS , 1–18 (2021) M. Walmsley et al iauname ra dec bar_total-votes bar_weak bar_weak_fraction bar_weak_debiasedJ112953.88-000427.4 172.47 -0.07 16 1 0.06 0.15J104325.29+190335.0 160.86 19.06 2 0 0.00 0.00J104629.54+115415.1 161.62 11.90 4 2 0.50 -J082950.68+125621.8 127.46 12.94 0 0 - -J122056.00-015022.0 185.23 -1.84 3 0 0.00 -

Table C1.

Sample of GZD-5 volunteer classiﬁcations, with illustrative subset of columns. Columns: ‘iauname’ galaxy identiﬁer from NASA-Sloan Atlas; RAand Dec, similarly; ‘Bar’ question total votes for all answers; ‘Bar’ question votes for ‘Weak’ answer; fraction of ‘Bar’ question votes for ‘Weak’ answer;estimated fraction after applying redshift debiasing (Sec. 4.3.2). Other questions and answers follow the same pattern (not shown for brevity).iauname RA Dec bar_preceding-fraction bar_weak_concentrations bar_weak_fraction_mlJ112953.88-000427.4 172.47 -0.07 0.14 [6.158, 5.0723, 5.4842, ... 0.09J104325.29+190335.0 160.86 19.06 0.13 [4.3723, 4.5933, 4.8582... 0.07J100927.56+071112.4 152.36 7.19 0.58 [9.3129, 10.3911, 8.4791... 0.40J143254.45+034938.1 218.23 3.83 0.55 [13.2981, 12.2639, 8.8957... 0.26J135942.73+010637.3 209.93 1.11 0.77 [15.6247, 15.6893, 14.72.... 0.28

Table C2.

Sample of automated classiﬁcations (GZD-5 schema), with illustrative subset of columns. Columns: ‘iauname’ galaxy identiﬁer from NASA-SloanAtlas; RA and Dec, similarly; predicted vote fraction to the answer preceding‘Bar’ (‘Disk Edge On = No’), for estimating relevance; Dirichlet concentrationsdeﬁning the predicted posterior for the ‘Bar’ question and ‘Weak’ answer (see Sec. 5); predicted fraction of ‘Bar’ question votes for the ‘Weak’ answer derivedfrom those concentrations. Other questions and answers follow the same pattern (not shown for brevity).

Carlos Chagas Filho de Amparo a Pesquisa do Estado do Rio deJaneiro, Conselho Nacional de Desenvolvimento Cientiﬁco e Tec-nologico and the Ministerio da Ciencia, Tecnologia e Inovacao, theDeutsche Forschungsgemeinschaft and the Collaborating Institu-tions in the Dark Energy Survey. The Collaborating Institutions areArgonne National Laboratory, the University of California at SantaCruz, the University of Cambridge, Centro de Investigaciones Ener-geticas, Medioambientales y Tecnologicas-Madrid, the Universityof Chicago, University College London, the DES-Brazil Consor-tium, the University of Edinburgh, the Eidgenossische TechnischeHochschule (ETH) Zurich, Fermi National Accelerator Laboratory,the University of Illinois at Urbana-Champaign, the Institut de Cien-cies de l’Espai (IEEC/CSIC), the Institut de Fisica d’Altes Energies,Lawrence Berkeley National Laboratory, the Ludwig-MaximiliansUniversitat Munchen and the associated Excellence Cluster Uni-verse, the University of Michigan, the National Optical AstronomyObservatory, the University of Nottingham, the Ohio State Univer-sity, the University of Pennsylvania, the University of Portsmouth,SLAC National Accelerator Laboratory, Stanford University, theUniversity of Sussex, and Texas A&M University.BASS is a key project of the Telescope Access Program (TAP),which has been funded by the National Astronomical Observatoriesof China, the Chinese Academy of Sciences (the Strategic PriorityResearch Program "The Emergence of Cosmological Structures"Grant

DATA AVAILABILITY

The data underlying this article are available via Zenodo athttps://dx.doi.org/10.5281/zenodo.4196267. Any future data up-dates will be released using DOI versioning. The code underlyingthis article will be available on Github on publication.

REFERENCES

Abadi M., et. al ., 2015, TensorFlow: Large-Scale Machine Learning on Het-erogeneous Distributed Systems,

Abazajian K. N., et al., 2009, The Astrophysical Journal Supplement Series,182, 543Aihara H., et al., 2011, The Astrophysical Journal Supplement Series, 193,29Albareti F. D., et al., 2017, The Astrophysical Journal Supplement Series,233, 25Alexander D. M., Hickox R. C., 2012, New Astronomy Reviews, 56, 93Bamford S. P., et al., 2009, Monthly Notices of the Royal AstronomicalSociety, 393, 1324Bansal G., Nushi B., Kamar E., Lasecki W. S., Weld D. S., Horvitz E.,2019, Proceedings of the AAAI Conference on Human Computationand Crowdsourcing, 7, 19Beck M. R., et al., 2018, Monthly Notices of the Royal Astronomical Society,476, 5516Bluck A. F., Trevor Mendel J., Ellison S. L., Moreno J., Simard L., PattonD. R., Starkenburg E., 2014, Monthly Notices of the Royal AstronomicalSociety, 441, 599Brooks A., Christensen C., 2015, in , Vol. 418, Galactic Bulges. pp 317–353,doi:10.1007/978-3-319-19378-6_12Cappellari M., Copin Y., 2003, Monthly Notices of the Royal AstronomicalSociety, 342, 345Cardamone C., et al., 2009, Monthly Notices of the Royal AstronomicalSociety, 399, 1191Caruana R., 1997, Machine Learning, 28, 41 MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Casteels K. R. V., et al., 2014, Monthly Notices of the Royal AstronomicalSociety, 445, 1157Chang J. C. J. C., Amershi S., Kamar E., 2017, Conference on HumanFactors in Computing Systems - Proceedings, 2017-May, 2334Cheng T. Y., et al., 2020, Monthly Notices of the Royal Astronomical Society,493, 4209Dey A., et al., 2018, eprint arXiv:1804.08657Dickinson H., Fortson L., Scarlata C., Beck M., Walmsley M., 2019, Pro-ceedings of the International Astronomical UnionDieleman S., Willett K. W., Dambre J., 2015, Monthly Notices of the RoyalAstronomical Society, 450, 1441Domínguez Sánchez H., et al., 2018, Monthly Notices of the Royal Astro-nomical Society, 476, 3661Domínguez Sánchez H., et al., 2019, Monthly Notices of the Royal Astro-nomical Society, 484, 93Eykholt K., et al., 2018, in Conference on Computer Vision and PatternRecognition. http://arxiv.org/abs/1707.08945

Fang J. J., Faber S. M., Koo D. C., Dekel A., 2013, Astrophysical Journal,776, 63Ferreira L., Conselice C. J., Duncan K., Cheng T.-Y., Griﬃths A., WhitneyA., 2020, The Astrophysical Journal, 895, 115Flaugher B., et al., 2015, Astronomical Journal, 150, 150Fontanot F., de Lucia G., Wilman D., Monaco P., 2011, Monthly Notices ofthe Royal Astronomical Society, 416, 409Gal Y., 2016, PhD thesis, University of CambridgeGeirhos R., Michaelis C., Wichmann F. A., Rubisch P., Bethge M., BrendelW., 2019, in 7th International Conference on Learning Representa-tions, ICLR 2019. International Conference on Learning Representa-tions, ICLR, http://arxiv.org/abs/1811.12231

Hart R. E., et al., 2016, Monthly Notices of the Royal Astronomical Society,461, 3663Hart R. E., Bamford S. P., Casteels K. R. V., Kruk S. J., Lintott C. J., MastersK. L., 2017, Monthly Notices of the Royal Astronomical Society, 468,1850He K., Zhang X., Ren S., Sun J., 2016, in Proceedings of theIEEE Computer Society Conference on Computer Vision andPattern Recognition. No. 3. IEEE Computer Society, pp 770–778, doi:10.1109/CVPR.2016.90, http://arxiv.org/pdf/1512.03385v1.pdfhttp://arxiv.org/abs/1512.03385

He X., Zhao K., Chu X., 2019, Arxiv preprintHendrycks D., Gimpel K., 2017, in International Conference on LearningRepresentations. http://arxiv.org/abs/1610.02136

Hopkins P. F., Cox T. J., Younger J. D., Hernquist L., 2009, AstrophysicalJournal, 691, 1168Hopkins P. F., et al., 2010, Astrophysical Journal, 715, 202Houlsby N., 2014, PhD thesis, doi:10.1007/BF03167379, http://ezproxy.nottingham.ac.uk/login?url=http://search.proquest.com/docview/1779546086?accountid=8018%5Cnhttp://sfx.nottingham.ac.uk/sfx_local/?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&genre=dissertations+&+theses&sid=ProQ:ProQue

Howard A. G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T.,Andreetto M., Adam H., 2017, Arxiv preprintHu J., Shen L., Sun G., 2018, in Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition. IEEE Com-puter Society, pp 7132–7141, doi:10.1109/CVPR.2018.00745, http://arxiv.org/abs/1709.01507

Huertas-Company M., et al., 2015, Astrophysical Journal, Supplement Se-ries, 221, 8Hunter J. D., 2007, Computing in Science and Engineering, 9, 99Iandola F. N., Han S., Moskewicz M. W., Ashraf K., Dally W. J., KeutzerK., 2017, International Conference on Learning RepresentationsJogee S., Scoville N. Z., Kenney J. D. P., 2005, The Astrophysical Journal,Volume 630, Issue 2, pp. 837-863., 630, 837Jones E., Oliphant T., Pearu P., Others 2001, SciPy: Open Source ScientiﬁcTools for Python,

Kaviraj S., 2013, Monthly Notices of the Royal Astronomical Society: Let- ters, 437Kaviraj S., 2014, Monthly Notices of the Royal Astronomical Society, 440,2944Keel W. C., Manning A. M., Holwerda B. W., Mezzoprete M., LintottC. J., Schawinski K., Gay P., Masters K. L., 2013, Publications of theAstronomical Society of the Paciﬁc, 125, 2Keel W. C., et al., 2015, Astronomical Journal, 149, 155Kemenade H. v., et al., 2020, python-pillow/Pillow 7.1.2,doi:10.5281/ZENODO.3766443Khan A., Huerta E. A., Wang S., Gruendl R., Jennings E., Zheng H., 2019,Physics Letters, Section B: Nuclear, Elementary Particle and High-Energy Physics, 795, 248Kingma D. P., Ba J. L., 2015, in 3rd International Conference on LearningRepresentations, ICLR 2015 - Conference Track Proceedings. Interna-tional Conference on Learning Representations, ICLRKruk S. J., et al., 2017, Monthly Notices of the Royal Astronomical Society,469, 3363Kruk S. J., et al., 2018, Monthly Notices of the Royal Astronomical Society,473, 4731Lanczos C., 1938, Journal of Mathematics and Physics, 17, 123Land K., et al., 2008, Monthly Notices of the Royal Astronomical Society,388, 1686Lin L., et al., 2019, The Astrophysical Journal, 872, 50Lingard T. K., et al., 2020, arXivLintott C. J., et al., 2008, Monthly Notices of the Royal Astronomical Society,389, 1179Lintott C. J., et al., 2009, Monthly Notices of the Royal Astronomical Society,399, 129Liu A., Guerra S., Fung I., Matute G., Kamar E., Lasecki W., 2020, inThe Web Conference 2020 - Proceedings of the World Wide WebConference, WWW 2020. Association for Computing Machinery, Inc,New York, NY, USA, pp 2432–2442, doi:10.1145/3366423.3380306, https://dl.acm.org/doi/10.1145/3366423.3380306

Lochner M., Bassett B. A., 2020, arXivLofthouse E. K., Kaviraj S., Conselice C. J., Mortlock A., Hartley W., 2017,Monthly Notices of the Royal Astronomical Society, 465, 2895López-Sanjuan C., Balcells M., Pérez-González P. G., Barro G., Gallego J.,Zamorano J., 2010, Astronomy and Astrophysics, 518, A20Lotz J. M., Jonsson P., Cox T. J., Croton D., Primack J. R., Somerville R. S.,Stewart K., 2011, Astrophysical Journal, 742, 103Margalef-Bentabol B., Huertas-Company M., Charnock T., Margalef-Bentabol C., Bernardi M., Dubois Y., Storey-Fisher K., Zanisi L.,2020, Detecting outliers in astronomical images with deep generativenetworks., doi:10.1093/mnras/staa1647, https://arxiv.org/abs/2003.08263

Martin G., Kaviraj S., Devriendt J. E., Dubois Y., Pichon C., 2018, MonthlyNotices of the Royal Astronomical Society, 480, 2266Martin G., Kaviraj S., Hocking A., Read S. C., Geach J. E., 2020, MonthlyNotices of the Royal Astronomical Society, 491, 1408Masters K. L., 2019, Proceedings of the International Astronomical Union,14, 205Masters K. L., et al., 2010, Monthly Notices of the Royal AstronomicalSociety, 404, 792Masters K. L., et al., 2011, Monthly Notices of the Royal AstronomicalSociety, 411, 2026Masters K. L., et al., 2012, Monthly Notices of the Royal AstronomicalSociety, 424, 2180Masters K. L., et al., 2019, Monthly Notices of the Royal AstronomicalSociety, 487, 1808McInnes L., Healy J., Astels S., 2017, The Journal of Open Source Software,2, 205McInnes L., Healy J., Melville J., 2020, Arxiv preprintMcKinney W., 2010, Data Structures for Statistical Comput-ing in Python, http://conference.scipy.org/proceedings/scipy2010/mckinney.html

Moiseev A. V., Smirnova K. I., Smirnova A. A., Reshetnikov V. P., 2011,Monthly Notices of the Royal Astronomical Society, 418, 244Nair P. B., Abraham R. G., 2010, The Astrophysical Journal SupplementMNRAS , 1–18 (2021) M. Walmsley et al

Series, 186, 427Pedregosa F., et al., 2011, Journal of Machine Learning Research, 12, 2825Ren J., et al., 2019, Technical report, Likelihood Ratios for Out-of-Distribution DetectionSakamoto K., Okumura S. K., Ishizuki S., Scoville2 N. Z., 1999,Technical Report 2, Bar-driven Transport of Molecular Gas toGalactic Centers and its Consequences, http://stacks.iop.org/0004-637X/525/i=2/a=691http://iopscience.iop.org/article/10.1086/307910/pdf , doi:10.1086/307910. , http://stacks.iop.org/0004-637X/525/i=2/a=691http://iopscience.iop.org/article/10.1086/307910/pdf

Sandler M., Howard A., Zhu M., Zhmoginov A., Chen L. C., 2018,in Proceedings of the IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition. IEEE Computer Society,pp 4510–4520, doi:10.1109/CVPR.2018.00474, http://arxiv.org/abs/1801.04381

Sheth K., Blain A. W., Kneib J.-P., Frayer D. T., van der Werf P. P., KnudsenK. K., 2004, The Astrophysical Journal, 614, L5Simmons B., et al., 2013, Monthly Notices of the Royal Astronomical Soci-ety, 429, 2199Simmons B. D., et al., 2014, Monthly Notices of the Royal AstronomicalSociety, 445, 3466Simmons B. D., Smethurst R. J., Lintott C., 2017a, Monthly Notices of theRoyal Astronomical Society, 12, 1Simmons B. D., et al., 2017b, Monthly Notices of the Royal AstronomicalSociety, 464, 4420Skibba R. A., et al., 2012, Monthly Notices of the Royal AstronomicalSociety, 423, 1485Smethurst R. J., et al., 2015, Monthly Notices of the Royal AstronomicalSociety, 450, 435Smethurst R. J., Simmons B. D., Lintott C. J., Shanahan J., 2019, MonthlyNotices of the Royal Astronomical Society, 489, 4016Smith L., Gal Y., 2018, Arxiv preprintSpindler A., et al., 2017, Monthly Notices of the Royal Astronomical Society,23, 1Szegedy C., Zaremba W., Sutskever I., Bruna J., Erhan D., Goodfellow I.,Fergus R., 2014, in 2nd International Conference on Learning Repre-sentations, ICLR 2014 - Conference Track Proceedings. InternationalConference on Learning Representations, ICLRTan M., Le Q. V., 2019, in 36th International Conference on Machine Learn-ing, ICML 2019. pp 10691–10700, http://arxiv.org/abs/1905.11946

Tan M., Chen B., Pang R., Vasudevan V., Sandler M., Howard A., Le Q. V.,2019, Proceedings of the IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, 2019-June, 2815The Astropy Collaboration et al., 2018, The Astronomical Journal, 156, 123The Dark Energy Survey Collaboration 2005, The Dark Energy Survey, http://arxiv.org/abs/astro-ph/0510346

The Zooniverse Team 2020, zooniverse/panoptes: Zooniverse API to sup-port user deﬁned volunteer research projects, https://github.com/zooniverse/panoptes

Tojeiro R., et al., 2013, Monthly Notices of the Royal Astronomical Society,432, 359Walmsley M., et al., 2020, Monthly Notices of the Royal AstronomicalSociety, 491, 1554Wang J., et al., 2011, Monthly Notices of the Royal Astronomical Society,413, 1373Wilder B., Horvitz E., Kamar E., 2020, in Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence. Inter-national Joint Conferences on Artiﬁcial Intelligence, pp 1526–1533,doi:10.24963/ĳcai.2020/212Willett K. W., et al., 2013, Monthly Notices of the Royal AstronomicalSociety, 435, 2835Willett K. W., et al., 2015, Monthly Notices of the Royal AstronomicalSociety, 449, 820Willett K. W., et al., 2017, Monthly Notices of the Royal AstronomicalSociety, 464, 4176Wright D. E., et al., 2017, Monthly Notices of the Royal Astronomical Society, 472, 1315Wright D. E., Fortson L., Lintott C., Laraia M., Walmsley M., 2019, ACMTransactions on Social Computing, 2, 1Yang K., Qinami K., Fei-Fei L., Deng J., Russakovsky O., 2020, in FAT*2020 - Proceedings of the 2020 Conference on Fairness, Accountability,and Transparency. Association for Computing Machinery, Inc, pp 547–558, doi:10.1145/3351095.3375709, http://arxiv.org/abs/1912.07726http://dx.doi.org/10.1145/3351095.3375709

York D. G., et al., 2000, The Astronomical Journal, 120, 1579van der Walt S., Schönberger J. L., Nunez-Iglesias J., Boulogne F., WarnerJ. D., Yager N., Gouillart E., Yu T., 2014, PeerJ, 2, e453This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Strong Bar W e a k B a r Strong Bar W e a k B a r Strong Bar W e a k B a r Strong Bar W e a k B a r Strong Bar W e a k B a r Figure 11.

Posteriors for ‘Does this galaxy have a bar?’, for the same randomgalaxies selected in Fig. 10. Each point is colored by the predicted probabilityof volunteers giving that many ‘Strong’, ‘Weak’, and (implicitly, as the totalanswers is ﬁxed) ‘None’ votes. The volunteer answer (not known to classiﬁer)is circled. For clarity, only the mean posterior across all models and dropoutforward passes is shown.MNRAS , 1–18 (2021) M. Walmsley et al

Figure 12.

All Galaxies High Volunteer Conﬁdence(a) ‘Smooth or Featured’(b) Edge On Disk(c) ‘Has Spiral Arms’(d) ‘Spiral Arm Count’(e) ‘Spiral Arm Winding’

Figure 13.

Confusion matrices for each question, made on the test set of 49,700 randomly-selected galaxies with at least three volunteer votes. Discreteclassiﬁcations are made by rounding the vote fraction (label) and mean posterior (prediction) to the nearest integer. The matrices then show the counts ofrounded predictions (x axis) against rounded labels (y axis). We report confusion matrices for all 49,700 galaxies (left) or only for galaxies where the volunteersare conﬁdent in that question, deﬁned as having the vote fraction for one answer above 0.8 (right). Such conﬁdent galaxies are expected to have a clearly correctlabel, making correct and incorrect predictions straightforward to measure but also making the classiﬁcation task easier. Continued below.MNRAS000

All Galaxies High Volunteer Conﬁdence(f) ‘Bar’(g) ‘Bulge Size’(h) ‘Merging’(i) ‘Edge On Bulge Shape’(j) Edge On Bulge Ellipticity

Figure 15.

Confusion matrices, continued from above. To avoid the loss of information from rounding, we encourage researchers not to treat Galaxy Zooclassiﬁcations as discrete, and instead to use the full vote fractions or posteriors where possible.MNRAS , 1–18 (2021) M. Walmsley et al

Vote Fraction Mean Deviation smooth-or-featured_smoothsmooth-or-featured_featured-or-disksmooth-or-featured_artifactdisk-edge-on_yesdisk-edge-on_nohas-spiral-arms_yeshas-spiral-arms_nobar_strongbar_weakbar_nobulge-size_dominantbulge-size_largebulge-size_moderatebulge-size_smallbulge-size_nonehow-rounded_roundhow-rounded_in-betweenhow-rounded_cigar-shapededge-on-bulge_boxyedge-on-bulge_noneedge-on-bulge_roundedspiral-winding_tightspiral-winding_mediumspiral-winding_loosespiral-arm-count_1spiral-arm-count_2spiral-arm-count_3spiral-arm-count_4spiral-arm-count_more-than-4spiral-arm-count_cant-tellmerging_nonemerging_minor-disturbancemerging_major-disturbancemerging_merger

Figure 16.

Mean absolute deviations between the model predictions and theobserved vote fractions, by question, for the test set galaxies with approxi-mately 40 volunteer responses. The model is typically well within 10% ofthe observed vote fractions.

Truncated number of votes M e a n e rr o r v s . a ll v o t e s Smooth Or Featured

SmoothFeatured-or-diskArtifact0 5 10 15 20

Truncated number of votes M e a n e rr o r v s . a ll v o t e s Bar

StrongWeakNo0 5 10 15 20

Truncated number of votes M e a n e rr o r v s . a ll v o t e s Has Spiral Arms

Yes0 5 10 15 20

Truncated number of votes M e a n e rr o r v s . a ll v o t e s Bulge Size

LargeModerateSmallNone

Figure 17.

Mean errors vs. the true (

𝑁 >

75) vote fractions for either atruncated ( 𝑁 = 𝑁 =

20) number of volunteers (solid) or the automatedclassiﬁer (dashed). Asking only a few volunteers gives a noisy estimate ofthe true vote fraction. Asking more volunteers reduces this noise. For somenumber of volunteers, the noise in the vote fraction is similar to the errorof the automated classiﬁer, meaning they make classiﬁcations of similaraccuracy; this number is where the solid and dashed lines intersect. We ﬁndthe automated classiﬁer has a similar accuracy to approx. 5 to 15 volunteers,depending on the question. MNRAS000

Random spiral galaxies where the classiﬁer confuses the mostlikely volunteer vote for spiral arm count between ‘2’ and ‘Can’t Tell’.Above: galaxies where the classiﬁer predicted ‘2’ but more volunteers an-swered ‘Can’t Tell’. Below: vica versa, galaxies where the classiﬁer predicted‘Can’t Tell’ but more volunteers answered ‘2’. Red text shows the volunteer(vol.) and machine-learning-predicted (ML) vote fractions for each answer.Counting the spiral arms is challenging, even for the authors. This high-lights the diﬃculty in assessing performance by reducing the posteriors toclassiﬁcations and then comparing against uncertain true labels.

Figure 19.

Galaxies binned by ‘Smooth or Featured’ vote prediction entropy,measuring the model’s uncertainty in the votes. Bins (columns) are equallyspaced (boundaries noted above). Five random galaxies are shown per bin.Unusual, inclined or poorly-scaled galaxies have highly uncertain (highentropy) votes, while smooth and especially clearly featured galaxies haveconﬁdent (low entropy) votes, matching our intuition and demonstrating thatour posteriors provide meaningful uncertainties.MNRAS , 1–18 (2021) M. Walmsley et al

0% 20% 40% 60% 80% 100%

Credible interval width R a t i o i n i n t e r v a l Disk Edge On N N Credible interval width R a t i o i n i n t e r v a l Has Spiral Arms N N Figure 20.

Calibration curves for the two binary GZ DECaLS questions.The 𝑥 -axis shows the credible interval width - for data-dominated posteriors,roughly (e.g.) 30% of galaxies should have vote fractions within their 30%credible interval. The 𝑦 -axis shows what percentage actually do fall withineach interval width. We split calibration by galaxies with few votes (andhence typically wider posteriors) and more votes (narrower posteriors). Onlycredible intervals with at least 100 measurements are shown. Calibration forboth questions is excellent. B avg W a v g Volunteers (N=5378) B avg W a v g Automated (N=43672)

Figure 21.

Distribution of bulge size vs. spiral winding, using responsesfrom volunteers (left) or our automated predictions (right). We observeno clear correlation between bulge size and spiral winding, consistent withM19. The distributions are consistent between volunteers and our automatedmethod. We hope this demonstrates the accuracy and scientiﬁc value of ourautomated classiﬁer. MNRAS000

Galaxies automatically classiﬁed as most likely (highest mean posterior) to be strongly barred.MNRAS , 1–18 (2021) M. Walmsley et al

Figure A2.

Galaxies automatically classiﬁed as most likely (highest mean posterior) to be edge-on with no bulge.MNRAS000

Galaxies automatically classiﬁed as most likely (highest mean posterior) to be edge-on with no bulge.MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Figure A3.

Galaxies automatically classiﬁed as most likely (highest mean posterior) to have exactly one spiral arm.MNRAS , 1–18 (2021) M. Walmsley et al

Figure A4.

Galaxies automatically classiﬁed as most likely (highest mean posterior) to have loosely wound spiral arms.MNRAS000

Galaxies automatically classiﬁed as most likely (highest mean posterior) to have loosely wound spiral arms.MNRAS000 , 1–18 (2021) alaxy Zoo DECaLS Data Release Figure A5.

Galaxies automatically classiﬁed as most likely (highest mean posterior) to be mergers.MNRAS , 1–18 (2021) M. Walmsley et al

A0: Smooth A1: Featuresor disk A2: Star orartifactA0: Yes A1: NoA0: Bar A1: No barA0: Spiral A1: No spiralA0: Nobulge A2: Obvious A3:DominantA1: Ring A2: Lens orarc A6:OverlappingA4: Irregular A5: OtherA3: DustlaneA0:Completelyround A1: Inbetween A2: Cigarshaped A0:Rounded A1: Boxy A2: Nobulge A0: Tight A1: Medium A2: LooseA0: 1 A1: 2 A2: 3 A3: 4 A4: Morethan 4A0: NoneA0: Merging A1: Tidaldebris A2: Both A3: Neither

T00: Is the galaxy simply smooth and rounded, with no sign of a disk?T01: Could this be a disk viewed edge-on?T02: Is there a sign of a bar feature through thecentre of the galaxy?T03: Is there any sign of a spiral arm pattern?T04: How prominent is the central bulge, compared with the rest of thegalaxy?T05: Is the galaxy currently merging or is there any sign of tidal debris?T06: Do you see any of these odd features in the image?T07: How rounded is it? T08: Does the galaxy have a bulgeat its centre? If so, what shape? T09: How tightly wound do thespiral arms appear?T10: How many spiral arms are there?End1st Tier Question2nd Tier Question3rd Tier Question4th Tier Question

Figure B1.

Decision tree used for GZD-1 and GZD-2, based on the Galaxy Zoo 2 decision tree. The GZD-5 decision tree is shown in Figure 3.MNRAS000