[PDF] DeepSZ: Identification of Sunyaev-Zel'dovich Galaxy Clusters using Deep Learning

Abstract

Galaxy clusters identified from the Sunyaev Zel'dovich (SZ) effect are a key ingredient in multi-wavelength cluster-based cosmology. We present a comparison between two methods of cluster identification: the standard Matched Filter (MF) method in SZ cluster finding and a method using Convolutional Neural Networks (CNN). We further implement and show results for a `combined' identifier. We apply the methods to simulated millimeter maps for several observing frequencies for an SPT-3G-like survey. There are some key differences between the methods. The MF method requires image pre-processing to remove point sources and a model for the noise, while the CNN method requires very little pre-processing of images. Additionally, the CNN requires tuning of hyperparameters in the model and takes as input, cutout images of the sky. Specifically, we use the CNN to classify whether or not an 8 arcmin \times 8 arcmin cutout of the sky contains a cluster. We compare differences in purity and completeness. The MF signal-to-noise ratio depends on both mass and redshift. Our CNN, trained for a given mass threshold, captures a different set of clusters than the MF, some of which have SNR below the MF detection threshold. However, the CNN tends to mis-classify cutouts whose clusters are located near the edge of the cutout, which can be mitigated with staggered cutouts. We leverage the complementarity of the two methods, combining the scores from each method for identification. The purity and completeness of the MF alone are both 0.61, assuming a standard detection threshold. The purity and completeness of the CNN alone are 0.59 and 0.61. The combined classification method yields 0.60 and 0.77, a significant increase for completeness with a modest decrease in purity. We advocate for combined methods that increase the confidence of many lower signal-to-noise clusters.

Full PDF

MMNRAS , 000–000 (2021) Preprint 1 March 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

DeepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clustersusing Deep Learning

Z. Lin (cid:63) , N. Huang , C. Avestruz , , W. L. K. Wu , , S. Trivedi , J. Caldeira , B. Nord , , Department of Computer Science, University of Illinois at Urbana–Champaign, Urbana, IL 61801, USA Department of Physics, University of California, Berkeley, CA 94720, USA Department of Physics, University of Michigan, Ann Arbor, MI 48109, USA Leinweber Center for Theoretical Physics, University of Michigan, Ann Arbor, MI 48109, USA Kavli Institute for Cosmological Physics, University of Chicago, Chicago, IL 60637, USA SLAC National Accelerator Laboratory & KIPAC, 2575 Sand Hill Road, Menlo Park, CA 94025 CSAIL, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA Fermi National Accelerator Laboratory, P.O. Box 500, Batavia, IL 60510, USA

ABSTRACT

Galaxy clusters identiﬁed from the Sunyaev Zel’dovich (SZ) eﬀect are a key ingredientin multi-wavelength cluster-based cosmology. We present a comparison between twomethods of cluster identiﬁcation: the standard Matched Filter (MF) method in SZcluster ﬁnding and a method using Convolutional Neural Networks (CNN). We furtherimplement and show results for a ‘combined’ identiﬁer. We apply the methods tosimulated millimeter maps for several observing frequencies for an SPT-3G-like survey.There are some key diﬀerences between the methods. The MF method requires imagepre-processing to remove point sources and a model for the noise, while the CNNmethod requires very little pre-processing of images. Additionally, the CNN requirestuning of hyperparameters in the model and takes as input, cutout images of thesky. Speciﬁcally, we use the CNN to classify whether or not an 8 arcmin × Key words: cosmic microwave background, galaxy clusters, cosmology, deep learning

The abundance of galaxy clusters is sensitive to cos-mological parameters (Allen et al. 2011). Galaxy clustershave provided cosmological constraints with data from mul-tiple wavelengths including X-ray (Vikhlinin et al. 2009),microwave (Bocquet et al. 2019; Planck Collaboration et al.2016b), and optical (Costanzi et al. 2019). One potential (cid:63)

Contact e-mail: [email protected] systematic uncertainty in cluster-based cosmology is the se-lection function of observed clusters. Clusters observed inmillimeter maps have one of the better understood selectionfunctions, providing a sample selected based on SZ signalsigniﬁcance, which is highly correlated with mass.Galaxy clusters are collections of galaxies ensconced ina halo of dark matter, which provides most of the gravi-tational potential. Amidst the galaxies in a cluster, there © a r X i v : . [ a s t r o - ph . C O ] F e b Lin et al. exists a hot intracluster medium that emits in the X-rays (viaBremsstrahlung), and which makes them observable in themillimeter through the Sunyaev-Zel’dovich (SZ) eﬀect. TheSZ eﬀect is an upscattering of cosmic microwave background(CMB) photons that shifts the CMB black-body spectrumalong the line of sight of a galaxy cluster (Carlstrom et al.2002). The SZ eﬀect is independent of redshift and dependentonly on the temperature of the intracluster medium, a quan-tity strongly correlated with cluster mass. An SZ-selectedgalaxy cluster sample therefore provides what is close to amass-limited selection function, which is straightforward toincorporate in cosmological analyses. SZ-selected clustershave resulted in a number of astrophysical studies as well.Follow-up observations probe the physics of the intraclustermedium and cluster galaxies from data in other wavelengths.SZ cluster follow-up include

Chandra (McDonald et al. 2017)or XMM-Newton (Bulbul et al. 2019) in the X-ray and the

Hubble Space Telescope (Strazzullo et al. 2019) in the opticaland

Spitzer (Strazzullo et al. 2019) or

Herschel (Zohren et al.2019) in the infrared.The traditional method of identifying SZ clusters is todeploy a matched ﬁlter based method (MF) on the maps(Melin et al. 2006, 2012), which identiﬁes regions in themaps that maximize the signal-to-noise ratio correspondingto the ﬁlter shape. The method has successfully identiﬁedcosmological samples in maps constructed with survey datafrom the South Pole Telescope (Bleem et al. 2015; Huanget al. 2019), Planck (Planck Collaboration et al. 2016a), andthe Atacama Cosmology Telescope (Hilton et al. 2018). TheSPT-3G camera, deployed in 2017, dramatically increasesmapping speed over previous cameras. It is expected thatthere will be 5000 cluster detections at 97% purity in the 1500sq. deg. survey area (Benson et al. 2014). Next-generationexperiments, like CMB-S4, will have lower noise and willlikely be able to see even more objects (Abazajian et al.2016).Convolutional Neural Networks (CNNs) are quickly be-coming an essential tool for cosmology and astrophysics(Ntampaka et al. 2019a). CNNs have already been used forboth CMB analyses, e.g. by Caldeira et al. (2019); Krachmal-nicoﬀ and Tomasi (2019); Hortua et al. (2019), and analysesrelated to galaxy clusters. Recent applications of CNNs togalaxy clusters include mass estimations from mock X-rayimages (Ntampaka et al. 2019b) and velocity dispersion distri-butions (Ho et al. 2019). Complementary to mass estimationanalyses, Green et al. (2019) used machine learning methodsto identify physically relevant features in X-ray images thatcorrespond to a galaxy cluster dynamical state. However, ap-plications of machine learning to galaxy cluster observablesin the CMB are still emerging.Examples of machine learning to galaxy clusters in theCMB include Hurier et al. (2017), where neural networksproduced ﬁltered and cleaned maps to improve resultingcluster catalogs with lower mass thresholds and thereforehigher redshifts, a proof-of-concept deep learning applicationto identify SZ clusters in Planck survey data (Bonjean 2020),and to cluster mass estimates from CMB lensing (Gupta andReichardt 2020). We emphasize that our work is the ﬁrstto explicitly use deep neural networks for the identiﬁcationof SZ clusters in their image space from millimeter maps inthe absence of additional map ﬁltering or cleaning steps. Inparticular, we use convolutional neural networks (CNNs) to identify cluster-containing cutouts of the simulated CMB sky.We compare our results with the standard cluster-ﬁndingmethod in the CMB, which has complementary performance,and explore the beneﬁts of combining cluster-ﬁnding methodswith an example implementation of such combinations andthe resulting comparisons.In this work, the CNN classiﬁcation is binary, with theCNN output corresponding to a rank-ordered likelihood for acutout of the sky to contain a cluster above a mass thresholdversus not. However, the standard MF cluster-ﬁnding methodassigns detection signiﬁcance as a function of SNR, whichincreases with cluster mass. As such, an apple-to-applescomparison between the two methods is admittedly artiﬁcial.We do devise a consistent way of comparing the two methodsthat highlights their respective strengths. But, we note that,for a future work, a CNN regression method that predictsthe mass of a cluster would more naturally lend itself to anapples-to-apples comparison with the standard MF output.We highlight a few innovative areas of this work. We havedevised a new and eﬀective training approach for extremelyunbalanced samples. We introduce a metric, the F1 score, thatassesses the combined completeness and purity of the clustersample. The F1 score eﬃciently summarizes the eﬀectivenessof the cluster ﬁnder, enabling a comparison between the CNNand MF methods. We also use the F1 score to evaluate acombined method that incorporate both outputs from theCNN and the MF. One can apply this approach to combineother machine-learning methods with a standard method ofcluster-ﬁnding.The paper is organized as follows. § § § §

4. The codes weused are published on https://github.com/deepskies/deepsz.

In this section, we discuss the origin and preparation of thedata.

We take simulations of the microwave sky from Sehgal et al.(2010), which are built on top of an N-body simulation.Brieﬂy, the N-body simulation used for the sky simula-tion has box size L = 1000 h − Mpc, with 1024 particleswith particle mass 6 . × h − M (cid:12) and softening length (cid:15) = 16 . h − kpc. These simulations provide a full-sky real-ization of the lensed CMB, galactic dust, point sources, andthe SZ eﬀect (both kinematic and thermal) for observingfrequencies 27, 30, 39, 44, 70, 93, 100, 143, 145, 219, 225,280, 353 GHz.The sky simulations include the SZ signal from galaxyclusters, with halo virial masses and locations identiﬁed in theN-body simulation using a friends-of-friends (FoF) haloﬁnder,with a linking length 0.2 of the mean interparticle spacing.The data products we use include separate all-sky maps for MNRAS , 000–000 (2021) eepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clusters using Deep Learning each component, as well as catalogs for the locations of N-body halos, and point sources. The halos and point sourcesare only unique on one octant of the sky (the other octantsuse various reﬂections of the catalogs). We therefore restrictour search to only one octant of the sky. We further restrictour search to the 90, 148, and 219 GHz channels, motivatedby typical ground-based CMB telescopes. In addition tothe simulated sky signals, we create white noise realizationsto imitate the eﬀect of instrumental noise. The instrumentnoise levels are 2.8, 2.6, and 6.6 µ K-arcmin for 90 GHz, 148GHz, and 219 GHz maps, respectively — consistent withprojected performance for the SPT-3G camera (Benson et al.2014). All data used in our analysis can be downloaded from https://lambda.gsfc.nasa.gov/toolbox/tb_sim_ov.cfm

For the purposes of our search, everything except the SZsignal from high-mass halos is a noise term. On large angularscales ( (cid:38)

10 arcmin), the maps are dominated by the CMB(because our maps are in units relative to the CMB tempera-ture, the unlensed CMB map for each band is identical). Onarcminute scales, the maps are dominated by point sources.Thermal sources (dusty star-forming galaxies and galacticdust) are brighter at higher frequencies, while radio sources(radio-loud galaxies, typically AGN) are brighter at lowerfrequencies. The SZ signal from galaxy clusters occupies thespace between point sources and the CMB, with angularscales typically between one and ten arcminutes. In the threebands used in this work, the SZ signal is negative, and mostsigniﬁcant at 90 GHz. The 219 GHz channel is aligned withthe null of the SZ spectrum, and has very little thermal SZsignal. These simulations contain thermal SZ contributionsfrom a large number of low-mass clusters, which are well be-low the detection threshold. We must take these into accountas an additional noise term. The kinematic SZ signal is muchsmaller than any of the other noise terms.To make these simulations more realistic, we include anoise term based on the predicted instrumental noise fromthe full SPT-3G survey (as noted above). However, thereare additional instrumental eﬀects that we have ignored,which must be accounted for in real data. First, we have usedthe simulations at their native resolution (approximately0 . (cid:48) ∼ . (cid:48)

0. Second, the map-makingprocess used by the SPT collaboration includes time-domainﬁltering to remove low frequency noise due to both theinstrument and the atmosphere. In the map domain, theﬁltering is approximately represented by an anistropic ﬁlterthat preferentially removes long wavelength modes in theR.A. direction. Finally, the remaining noise is not white, butincreases at larger scales. None of these eﬀects are includedin our simulations.

Data preparation for the CNN model is simpler than forthe Matched Filter. Model training for the network doesnot require any special pre-processing of the maps such asnormalization, or point source removal . For example, insteadof aﬀecting manual normalization, the usual steps in trainingCNNs such as convolution, batch normalization etc., implic-itly process the images in a manner suitable for the prediction.On the other hand, the Matched Filter method does in fact require data preprocessing, such as point source removal, asdescribed in Section 2.1.To construct maps for the CNN, we take componentsand simply add them together. First, we start with mapswith just the CMB and tSZ. We then add instrument noiseto the maps. We then progressively add infrared galaxies,radio sources, and galactic dust emissions to the cutoutscumulatively to facilitate investigation into which of thesecomponents have the largest eﬀect on cluster identiﬁcation.The maps used for the MF are described in § × z (cid:62) .

25 and with virial mass aboveM vir (cid:62) × M (cid:12) . Furthermore, we condition a cutout as“positive” only if the cluster position is located within the6 × unbalanced dataset since the classof “Negative” labels has many more objects than the classof “Positive” labels. Unbalanced datasets pose an additionalchallenge when training neural networks. We discuss thischallenge and our approach in a later section ( § training set is comprised ofcutouts with center right ascension (RA) coordinate above0 . ×

90 deg, the test set comprised of cutouts with center RA

MNRAS000

MNRAS000 , 000–000 (2021)

Lin et al. GH z CMB + tSZ +kSZ +Infrared Galaxies +Radio Galaxies +Galactic Dust GH z GH z K (a) GH z CMB + tSZ +kSZ +Infrared Galaxies +Radio Galaxies +Galactic Dust GH z GH z K (b) Figure 1.

A positive (top) and negative (bottom) sample input to the neural network with three channels at 90GHz (1st row in top/bottom),148 GHz (2nd row) and 219 GHz maps (3rd row). We show the cutout with diﬀerent components (CMB, tSZ, kSZ, IR galaxies, radiogalaxis, galactic dust) gradually added here. All cutouts contain instrument noise, which is 2.8 µ K-arcmin 2.6 µ K-arcmin and 6.6 µ K-arcmin for 90 GHz, 148 GHz and 219 GHz maps respectively, consistent with projected performance for the upcoming CMB experimentSPT-3G. Each pixel is 0.25 x 0.25 arcmin. Each cutout has 32x32 pixels. X-axis is ra, and Y-axis is dec. The position of the cluster ismarket with a red ”X”. MNRAS , 000–000 (2021) eepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clusters using Deep Learning coordinate below 0 . ×

90 deg, and the remaining cutoutsgrouped into the validation set. To avoid training a networkon the periphery of a cluster, in the training set we removethe negative cutouts that are adjacent to a positive one. Wedo not do this on the validation or test set. The ﬁnal numbersof cutouts in the training, validation and test sets are 601,400,105,183, and 56,637 respectively.Fig. 1 shows two sets of example maps at 90 GHz, 148GHz and 219 GHz. The top half of this ﬁgure shows cutoutmaps with a galaxy cluster, where the true position of thecluster is marked with a red “X”. The lower panel of thisﬁgure shows example cutout maps that do not contain agalaxy cluster. From left to right, we add additional sourcesof noise. The addition of IR and radio galaxies increases thevisual inhomogeneity of the image.We next discuss the impact of high ﬂux sources. Figure 2shows an example cluster-containing cutout from our samplethat happens to also host a high ﬂux radio galaxy. Thecolumn on the left is the image with only the CMB+tSZcomponents simulated, similar to the left-most column ofFigure 1. The column on the right has all other componentsadded, similar to the right-most column of Figure 1. In theabsence of the high ﬂux radio galaxy (left column), the tSZsignal from the galaxy cluster is visually identiﬁable in the90 GHz and 148 GHz frequencies and the colormap for theseimages spans a few hundred µ K. In contrast, the images thatinclude the high ﬂux radio galaxy (right column) span thethousands of µ K, and the tSZ signal from the galaxy clusteris barely identiﬁable in the only the 90 GHz frequency withthe same linear scaling. The high ﬂux galaxy is the mainvisible feature in these images, seen as two very bright pixelsnear the center of the cluster.The MF method requires a preprocessing step of point-source removal in order to identify the high signal-to-noiseco-located pixels of the tSZ signal from galaxy clusters. Manyclassical machine learning methods also require some sortof preprocessing, such as rescaling and normalization, tooptimally perform. We emphasize that the CNN methodwe use does not require any such manual preprocessing andtakes as input, pixel data from images like the high ﬂuxgalaxy containing cluster. This cutout is an example of a“True Positive” identiﬁed cluster by our method.

We next describe distinct methods of SZ galaxy cluster detec-tion — ﬁrst the canonical method, Matched Filter (MF), andthen an alternative method via Deep Convolutional NeuralNetworks (CNNs). We are developing the methods in simula-tions, and therefore know the ground truth of all the clusters,including, namely, the properties of their underlying darkmatter haloes. To assess cluster-ﬁnding eﬃcacy, we mustmatch clusters to haloes, which we describe after the detec-tion methods. We then note the diﬀerences and similaritiesof the methods in principle, in practical implementation, andcompare their eﬃcacies.

We applied a matched-ﬁlter-based method to the simulatedmicrowave maps. The matched-ﬁlter is a standard method GH z CMB + tSZ K GH z GH z All Components K Figure 2.

We provide an example cutout that contains a cluster,and also happens to host a high-ﬂux infrared galaxy. The columnon the left is the image with only the CMB+tSZ components, andthe column on the right is the image with all components addingon the kSZ, infrared and radio galaxies, and galactic dust. Similarto Fig. 1, top to bottom rows correspond to the 90, 148, and 219GHz bands. The addition of a high-ﬂux infrared galaxy increasesthe colorbar range by an order of magnitude, weakening the tSZsignal from the galaxy cluster relative to the rest of the cutout.The CNN takes input images, such as those in the right-handcolumn, with no pre-processing. to identify galaxy clusters from their SZ signature, providinga baseline to compare our deep learning model. It uses thespectral and spatial characteristics of the SZ signal fromgalaxy clusters to diﬀerentiate them from noise. This methodwas developed in Melin et al. (2006). We apply the matchedﬁlter method as applied to data collected with the SouthPole Telescope (e.g. Vanderlinde et al. 2010 (hereafter V10),Bleem et al. 2015, and Huang et al. 2019 (hereafter H19)). Inthis section we provide a descriptive overview of the methodand leave more mathematical and detailed treatments to theabove references.A matched ﬁlter is a Fourier-domain ﬁlter which opti-mally ﬁlters for a given input proﬁle, and a given set of noisepower spectra. The input proﬁle is weighted by the inverseof the noise power spectra. This weighting scheme providesan optimal ﬁlter under two conditions: the noise terms areGaussian and stationary (that is, the noise power spectrado not vary over the map). In order to maximize signal-to-noise on SZ clusters, our matched ﬁlter also combines themaps across frequency bands. For each band, we model thefollowing noise terms:

CMB:

The CMB contributes noise primarily at large angu-lar scales. We calculate the CMB power spectrum using theCode for Anisotropies in the Microwave Background (CAMB;Lewis et al. 2000) with the best ﬁt ΛCDM parameters fromWMAP7+SPT (Komatsu et al. 2011; Keisler et al. 2011). We

MNRAS000

MNRAS000 , 000–000 (2021)

Lin et al. also include a term for the kinematic SZ, based on Shirokoﬀet al. (2011). These terms have the same amplitude in eachband. The simulations used in this work generate the CMBrealization based on WMAP5 ΛCDM parameters.

Point Sources:

The brightest point sources contribute noiseon the smallest scales, but the background of unresolvedsources contributes to a much broader range of scales. Wemodel the combination of two source populations: one fromdusty galaxies, which are brighter at higher observing fre-quencies; and one from radio-loud sources, which are dimmerat higher observing frequencies. We assume that the spa-tial power spectrum of the combined source population isﬂat in C (cid:96) -space. Each frequency is normalized such that D (cid:96) = (cid:96) ( (cid:96) + 1) / (2 π ) C (cid:96) = (2.7, 8.8, 71.) µ K CMB2 at (90, 148,219) GHz and (cid:96) = 3000. These values were chosen to matchthe amplitudes of the point source power spectra from thesimulated maps used in this work.

SZ Background:

There is a contribution to the noise fromdim, undetected SZ sources. We model this as ﬂat in D (cid:96) -space,with D = 3 . µ K CMB2 at 90 GHz, and the remainingbands scaled from this value using the non-relativistic formof the SZ frequency spectrum.

Instrumental Noise:

Instrumental noise is also ﬂat in C (cid:96) -space, with amplitudes given in § β -model (Cavaliere and Fusco-Femiano 1976) with β = 1.V10 explored more complex models, but found no increase inthe eﬃcacy of the matched ﬁlter. The β -model has one freeparameter, which sets the angular scale of the proﬁle. Wecreate 12 diﬀerent ﬁlters to account for clusters of diﬀerentangular sizes.Our ﬁltering procedure produces a set of 12 maps (onefor each proﬁle) in signal to noise units. We run a peak-ﬁnding algorithm which groups connected pixels above agiven SN threshold. Each group’s position is calculated bytaking the SN-weighted mean of the pixel positions. Finally,for groups with detections in multiple output maps, we takethe group with the highest SN. These detections make up theﬁnal cluster candidate list. Each candidate has a location,SN (which is used as a proxy for mass; see e.g. H19), andthe proﬁle which maximized its SN. The machine learning model we have chosen for this workis a deep convolutional neural network, which is known toperform well in image classiﬁcation tasks. Compared withthe matched ﬁlter method, neural-network-based methodsrequire less image pre-processing, and are very ﬂexible in thatthe same architecture can be used in diﬀerent tasks with littlemodiﬁcation. For example, model structures developed in classiﬁcation problems are often used for regression problems.In fact, even the weights trained on one task can often helpin the training of a diﬀerent task. With this in mind, wewill use a well-known architecture as a starting point for ournetwork structure.

Our network design is based on ResNet50 (He et al. 2016),which is a powerful and popular deep convolutional neuralnetwork architecture (LeCun et al. 1998).A typical neural network takes in a multi-dimensionalarray (e.g. a RGB image as a 3D array), and outputs ascalar or array, depending on the use case. In our binaryclassiﬁcation case, the output is a real number denoting the“score”, which should be a rank-preserving function of theprobability that a cutout contains a target. A deep neuralnetwork usually consists of several layers, and in the simplestcase, each layer has several neurons , each of which computesa linear combination of its inputs, and then passes the valuethrough a non-linear function. This output then forms theinput of the next layer.A deep convolutional neural network (DCNN or CNN)is characterized by the presence of convolutional layers, pos-sibly among others. A convolutional layer takes in an imageor a batch of images as the input. A layer contains small(usually several pixels by several pixels) learnable ﬁlters, andconvolves each ﬁlter with the input using a sliding window,usually with a stride, to get a feature map. This feature mapoutput is then passed through an element-wise non-linearfunction. Beyond the ﬁrst layer, the input are feature mapsgenerated by previous layers instead of images. A convolu-tional layer has a few notable distinctions compared with afully connected layer: It reduces the number of parameters(parameter sharing), and, partly because of this, its outputis less sensitive to translation in the input ( translation in-variant ), which proves to be a very desirable property inapplications like image classiﬁcation. After a convolutionallayer, a typical CNN will have a pooling layer that takes themaximum activation in a small region (usually 2x2 pixels).Strides in the convolutional layer and pooling layer are twoways to down-sample the feature maps, and by repeatingthese steps, the neurons in the deeper layers can “see” a largerand larger region of the original image.If layers are just stacked to build a very deep neuralnetwork with no other modiﬁcations, we often see an degra-dation in accuracy, since the parameters become very diﬃcultto optimize as the network becomes deeper. Note that whilethe number of parameters grows with the number of layers,the phenomenon referred to here is manifested in a higher training error, so this is not related to overﬁtting. A

ResidualNetwork (ResNet) is a DCNN that aims to mitigate thisissue. A Residual Network has several residual blocks, eachconsisting of a few convolutional layers and a shortcut con-nection from the ﬁrst layer to the last within the block. Theidea is that the shortcut connection behaves like an identityfunction, and the layers inside a residual block will only needto add what has not been learned by layers prior to the block:the “residual” of this identity map. Shortcut connections thenallow us to make models deeper without degrading perfor-mance. This is because simply adding identity layers to a

MNRAS , 000–000 (2021) eepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clusters using Deep Learning model (that is, if the additional layers that are skipped addnothing new) will not increase the training error.ResNet50, proposed in the original ResNet paper, is apopular ResNet model used in many diﬀerent image classiﬁ-cation tasks. It has 50 convolutional or fully connected layersin total, grouped into several residual blocks. The depth ofResNet50 was optimized to train on relatively big cutouts.However, to cater to the small image sizes in our dataset, weeﬀect the following changes:1. We remove the ﬁrst convolution layer (kernel size 7,stride 2, padding 3), and replace it with 2 smaller convolutionlayers with kernel size 3 and padding 1. The ﬁrst convolutionlayer has a stride of 1 whereas the second has a stride of 2.Like the original ResNet50, we have batch normalization andReLU after each convolution layer, and we also have a maxpooling layer of stride 2 after the second convolution layer.2. Instead of having a 5th stage, we put 2 fully connectedlayers of size 256 on top of stage 4’s output, and put aprediction and softmax for 2 classes afterwards.Fig. 3 shows the structure of the network visually. As mentioned earlier, our data is highly unbalanced. Theratio of negative cutouts to positive cutouts is over a factorof 50. To address this challenge, we experiment with sev-eral training strategies. We ﬁrst manually assign diﬀerentweights to positive and negative samples to give comparableimportance to the positive and negative subsamples. Sim-ilarly, we oversample the positive samples to establish aneﬀective negative-to-positive ratio closer to 1. However, thesestrategies result in a trade-oﬀ. If we set the eﬀective ratio low(closer to a balanced sample), then the network carries a biasunrepresentative of the real sample. On the other hand, if theeﬀective ratio is too high, then the network is very diﬃcult totrain. This diﬃculty comes from the fact that a blind guessof all negatives can already produce a good accuracy andloss. Moreover, iterating the process over diﬀerent ratios isvery time-consuming. To solve this problem, we devise thefollowing strategy:(i) Let us denote r as the eﬀective ratio of the number ofnegative cutouts to the number of positive cutouts. S r is asample where the positive cutouts are over-sampled such thatthe negative-to-positive eﬀective ratio is r . One can thinkof this as a data-loader of the same training set that givespositive cutouts a higher probability of being sampled.(ii) We start by training the model on S , a sample wherethe there is an equal probability of drawing a negative ora positive cutout. We train the model with a batch size of128, using cross-entropy loss. We evaluate the loss on theover-sampled validation set every 100 batches and continuetraining until the loss on the validation set converges (i.e.does not decrease for 1000 batches). We consider this as anepoch. Note that each epoch can have diﬀerent sizes.

As anexample, the ﬁrst epoch took 2400 batches to complete.(iii) In each epoch, we use stochastic gradient descent fortraining with an L2 weight decay rate set to 0.003. We set thelearning rate to 5 × − , with a linear warm-up schedule inthe ﬁrst 500 batches in each epoch (linearly increasing from 5 / × − to 5 × − ). After 1000 batches, we decrease thelearning rate to 5 × − .(iv) At the end of each epoch, we save the best modelweight so far, feed the original validation set (no over-sampling involved) to this model and save the results. Wethen increase r by one and continue training starting withthis model weight.(v) After training on S , we go back and pick the modelweights that best perform on the entire validation set. Inthis experiment, the best performing model that we select isfrom S .We can consider this to be a dynamic stratiﬁed sampling.The entire training process, including evaluation of the testand validation sets at the end of each ratio, took approxi-mately 8 hours on a GTX 1060 GPU. We ﬁrst show someof the training and validation cross-entropy loss statisticsduring the training process in the top panel of Fig. 4. We alsooverlay the ratio at each iteration in the plot. In the bottompanel of Fig. 4 we show the training and validation accuracyadjusted for the ratio, along with the ratio at each iteration.Unless speciﬁed, all numbers reported are for cutouts with allnoise components. We perform a breakdown of these resultsin Section §

3. Finally, we perform the classical Platt Scalingcalibration method for binary classiﬁcation, (Platt (1999)),on the ﬁnal output, so that we can interpret the output asprobabilities. This is equivalent to running a logistic regres-sion using output on the validation set. All CNN outputs inthis paper are the calibrated numbers.

The MF assigns detections to speciﬁc coordinates, while theCNN detects whether or not each cutout contains a cluster.To compare two methods, we need to match MF detectionsto cutouts.MF-identiﬁed clusters whose centers lie near the edgeof a cutout may have an identiﬁed center just inside thecutout, but a true center outside the cutout region. In thesecases, the cutout containing the MF identiﬁed cluster positionwould not be a “cluster-containing” cutout, despite the actualcorrespondence. These cases would lead to a comparison thatunderstates the MF performance. To ensure correspondenceand to make a fair comparison between methods, we performthe following:(i) We match MF detections to the largest cluster withina 1-arcmin radius of the detection. The radius was chosen tomaximize MF’s performance.(ii) If the detection corresponds to a real cluster, we matchthe detection to the cutout corresponding to the true clustercenter. If the detection does not correspond to a real cluster,we match the detection to the cutout containing the MFidentiﬁed center.(iii) We assign the corresponding signal-to-noise-ratio(SNR) of the MF detection to the cutout to which the de-tection is assigned in the step above. If multiple detectionsmap to the same cutout, the largest SNR is assigned to thatcutout.

MNRAS000

MNRAS000 , 000–000 (2021)

Lin et al. data3x3 Conv-BN-ReLUchannels: 643x3 Conv(s=2)-BN-ReLUchannels: 64Max Pool(s=2)Stage 1 (3 Blocks)channels:(in,bottleneck,out) = (64,64,256)1st block stride=1Stage 2 (4 Blocks)channels:(in,bottleneck,out) = (256,128,512)1st block stride=2Stage 3 (6 Blocks)channels:(in,bottleneck,out) = (512,256,1024)1st block stride=2FC-ReLUchannels: 256FC-ReLUchannels: 256FC-ReLU-Softmaxchannels: 2prediction:( P (positive), P (negative)) Stage 1 input3x3 Conv(s=1)-BN-ReLUchannels: 64 (bottleneck)3x3 Conv-BN-ReLUchannels: 64 (bottleneck)3x3 Conv-BNchannels: 2563x3 Conv(s=1)-BNchannels: 256 elementwise add - ReLU Exactly the same the 2ndConv: Convolution. Stride is 1 (s=1) unless speciﬁedBN: Batch NormalizationReLU: Rectiﬁed Linear UnitFC: Fully Connected

Figure 3.

Visualization of the modiﬁed ResNet. There are three stages with residual blocks, mostly like the original design, and we onlyshow the details of the ﬁrst one for simplicity. In the ﬁrst block within each stage, we have two Conv-BN-ReLU layers with a smallnumber of channels (bottleneck) for better learning, and another Conv-BN layer with more channels, whose output is taken a sum withthe skip connection. In the following blocks the skip connection does not perform any operation.

Below, we present results of both methods applied to themock data set with all noise components. Table 1 summa-rizes all the confusion matrices after appropriate thresholdselection. These results use thresholds that maximize theF1 score of the methods applied to the validation set (see3.2 for the detailed procedures). Solely focusing on purityor completeness as a metric can be misleading; the F1 scoreoﬀers a fairer platform for comparison. We see that the F1score performance of MF and CNN are comparable. Addi-tionally, the Ensemble methods can signiﬁcantly improve theperformance measured by any of the metrics. The Ensemblemethods either simultaneously improve both purity and com-pleteness (AND ensemble), or greatly improve one withoutsacriﬁcing the other (PROD ensemble).

In this subsection, we describe one possible systematic in ourcomparison of the two methods. To assign a performancemetric to a cluster detection method, we need a mass thresh-old for a “true cluster”. There are otherwise multiple halosin any given patch of the simulated sky, some of which sitabove that threshold and some of which sit below. Due to hi-erarchical structure formation, there are more objects belowa mass cut than above. An artifact of a cutoﬀ is that therewill likely be some confusion near the selected threshold.In particular, the matched ﬁlter does not assume a cleancutoﬀ in mass when identifying a cluster, but rather identiﬁesa cluster according to the SNR. The SNR produced by thematched ﬁlter scales with cluster mass (and weakly withredshift, see Bleem et al. (2015), and H19) with a ∼

20% log-

MNRAS , 000–000 (2021) eepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clusters using Deep Learning L o ss N e g a ti v e - t o - P o s iti v e R a ti o Train LossValid LossNegative-to-Positive Ratio (a) A cc u r ac y N e g a ti v e - t o - P o s iti v e R a ti o Train AccuracyValid AccuracyBlind GuessNegative-to-Positive Ratio (b)

Figure 4.

In this ﬁgure, (a) shows the loss history, and (b) shows the accuracy history as a function of the number of batches (iterations).

Method Prediction Has Cluster (Truth) No Cluster (Truth) Precision (Purity) Recall (Completeness) F1MF Has Cluster 1133 729 0.61 0.61 0.61No Cluster 712 102609CNN Has Cluster 1118 780 0.59 0.61 0.60No Cluster 727 102558MF+CNN Has Cluster 1283 625 0.67 0.70 0.68Ensemble AND No Cluster 561 102713MF+CNN Has Cluster 1429 946 0.60 0.77 0.68Ensemble rankproduct No Cluster 416 102392

Table 1.

A summary of all the confusion matrices. All confusion matrices are given by selecting the threshold for each method on thevalidation set to maximize F1 score, and then apply the thresholds on the test set. normal scatter in the scaling relation (see e.g. Bocquet et al.2019). To brieﬂy describe the relationship, at ﬁxed mass, thetemperature of the cluster gas increases with redshift there-fore increasing the tSZ as well. And, at larger angular sizes,the cluster size becomes comparable to the smallest CMBﬂuctuations. The MF downweights such modes, ultimatelycontributing to a redshift dependence in the MF performance.We included a redshift threshold of 0.25 (see Section 1.2) tosomewhat address such issues.We can conjecture that the CNN will also get confusedby an object whose mass is near the threshold. We assume amass threshold when labeling a cutout as “cluster-containing”.But some halos below the mass threshold may have a higherMF SNR or CNN score than halos above the detection thresh-old. This ambiguity aﬀects the precision of both methods. Nonetheless, we ﬁx the threshold of each method in ourcomparison in an eﬀort to fairly assess each.

In Section 3.1, we discussed how the ambiguity in deﬁningtrue positives in the sample might aﬀect the metrics formodel performance. Another aspect of performance metriccomparisons is assessing the trade-oﬀ that occurs when wevary the classiﬁcation threshold; we can increase the numberof true positives, improving the completeness of a sample, atthe cost of increasing the number of false positives, worseningthe purity of a sample. The cost-beneﬁt comparison becomesmore complex in an imbalanced dataset, where there is alarge diﬀerence between the number of true positives and thenumber of true negatives. This is the case for the number of

MNRAS000

MNRAS000 , 000–000 (2021) Lin et al. F S c o r e C oun t s MFCNNMF Score Dist.CNN Score Dist.

Figure 5.

F1 score as a function of the threshold for S/N ratiofrom MF and probability from CNN (on the validation set). Wepick the thresholds for each method here and later evaluate themon the test sets. The distribution of the scores (S/N ratio for MF,and probability for CNN) are shown in the background. We alsodid not include any MF SNRs below 3.0 because there are toomany of them. cluster containing cutouts; galaxy clusters are relatively rareobjects in the overall footprint of the sky.The F1 score is a common choice of performance metricin problems with imbalanced samples. If a classiﬁer blindlypredicted everything to be positive (or negative) and therewere many more true positives (negatives) in a sample, wewould see a misleadingly high accuracy. The F1 score accountsfor the imbalance, deﬁned as the harmonic mean of precisionand recall:F1 = 2precision − + recall − = 2TP2TP + FN + FP , (1)In our particular use case, a classiﬁer could identify all cutoutsas non-cluster-containing and achieve a high accuracy. Butthe F1 score would equal zero, indicating the lack of informa-tion provided by this classiﬁcation scheme. A high F1 scoreindicates a classiﬁer with high precision while detecting asmany clusters as possible despite the imbalance between thetwo classes in our dataset. We therefore emphasize the F1score in subsequent discussions of model performance.Note we advocate for F1 as a useful metric, regardless ofthe method classiﬁer. We can use the F1 score to determine anoptimal threshold for positive predictions for a given classiﬁer.We do that by looking at how F1 varies as a function of thethreshold for positive identiﬁcation of that classiﬁer. Themaximum F1 score corresponds to a threshold that maximizespositively identiﬁed clusters while minimizing false positives,useful for any cluster analysis that requires further follow-upobservations. The peak F1 score also provides a single metricto compare any classiﬁer.Fig. 5 shows the F1 score on the validation set of MF andCNN as a function of the threshold for positive identiﬁcation.The MF has a threshold at some SNR, and the correspond-ing F1 score is shown with the purple line. The CNN hasthresholds of probability output with the corresponding F1score shown in the black line. The shaded purple and blackhistograms respectively show the number of objects within agiven SNR bin and probability bin.From this ﬁgure, we see that the peak F1 score of MF applied to the validation set is 0.60, which corresponds to aSNR threshold of 13.4, the 98.2% rank percentile of the MFSNR of all cutouts. That same SNR threshold results in anF1 score of 0.62 when applied to the MF SNR of the test set.We can also determine that the peak F1 score of CNN appliedto the validation set is 0.60, corresponding to a probabilitythreshold of 0.367, the 98.0% rank percentile of the CNNscores of all cutouts. That same probability threshold resultsin an F1 score of 0.60 on the test set. Using the F1 score tocompare the two methods, we see that they do comparablywell. We summarize the metrics in Table 1.Note, if we consider true-positives with lower clustermass thresholds, the F1 score of the MF as a function ofSNR would shift. At ﬁxed SNR, as we decrease the massthreshold, purity (precision) will increase, but completeness(recall) will decrease. As a result, the direction of the shiftof F1 at each SNR value will be determined jointly by thesetwo eﬀects. But, the peak of the F1 score curve, i.e. theoptimal SNR threshold, will systematically shift to lowerSNR values since true clusters include lower mass objects.However, our analysis does not include these results, sincewe want to compare the MF results with a given CNN modelthat was trained at a given mass and redshift threshold.It would be computationally expensive to iteratively trainmore CNN models to compare with the MF performance atdiﬀerent mass thresholds. Furthermore, we will have multipleclusters per cutout, complicating our comparison metrics(see discussion in Sec 1.2.)The F1 score also provides a means to combining the twomethods for cluster identiﬁcation. We evaluated two ways tocombine the predictions, an EnsemblePROD method and an

EnsembleAND method described below. • EnsemblePROD method: Here, we form one single scoreby multiplying the rank-percentile of CNN and MF scores,similar to an interaction term in regression tasks. Supposewe have N cutouts, and one CNN score p CNNi and one MFSNR s MFi for each cutout i ∈ [ N ]. The CNN rank percentileis deﬁned as q CNNi := N (cid:80) j ∈ [ N ] [ p CNNj (cid:54) p CNNi ], and theMF rank percentile is deﬁned similarly - just the percentile itsMF SNR falls in the sample. The combined rank percentileis simply their product: q Ensemblei := q CNNi q MFi . • EnsembleAND method: Here, we classify a cutout basingon a logical AND condition basing on CNN and MF scores.Unlike previous methods, this method requires 2 thresholds(one for CNN and one for MF), and classify a cutout aspositive if both thresholds are met.We evaluate the metrics for the combined classiﬁersusing the same metrics and procedures as we did for theindividual methods. We choose the threshold on validationset and evaluate the performance on the test set.Fig. 6a shows the results of the F1 score curve for

Ensem-blePROD as a function of the rank product, deﬁned above.The peak of the curve corresponds to the rank product thatresults in the maximum F1 score. We summarize the metricsin Table 1. Recall that the thresholds picked on the validationset for the individual methods are 98.2% for MF and 98.0%for CNN. The new rank-product threshold, 93.8%, is at the97.7%-percentile of all rank-products, which is very closeto the two standalone threshold ranks. In other words,

En-semblePROD gives a similar number of positive predictions(compared with MF and CNN), but the overall F1 score, as

MNRAS , 000–000 (2021) eepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clusters using Deep Learning shown in Table 1, is greatly improved. This suggests that EnsemblePROD successfully incorporates information fromboth methods.Fig. 6b shows the results of the F1 score for

Ensemble-AND , color-coding the F1 score in the space of the MF andCNN thresholds for positive identiﬁcation. We summarize themetrics in Table 1. As expected, the two thresholds are signif-icantly lower than the counterparts in the standalone MF orCNN methods, which is a direct result of the AND logic weapply here. This however suggests that CNN predicts certainlow-MF-SNR cutouts with higher probabilities and the MFmeasured a higher SNR for certain low CNN probabilitycutouts; the two methods complement one another.We now compare the performance of the ensemble meth-ods with the the standalone on the test set. The F1 scoreallows for an overall comparison along a single dimension.Both ensemble methods, with F1 scores of 0.69 and 0.68respectively, lead to signiﬁcant improvement over standaloneMF or CNN method, with F1 scores of 0.61 and 0.60 respec-tively. This illustrates the strength in combining methods,particularly with the use of F1 scores. In Section 3.3, weinvestigate how the two methods complement one another.

We compare the CNN and MF performance using thresholdsthat maximize the respective F1 score of each method. Withthis, we arrive at the precision and recall values provided inTable 1. The ﬁrst two rows of the table show the standaloneperformance of each the MF and CNN. The precision of theCNN is slightly worse than the MF, and the recall slightlyimproved. We want to emphasize again that only looking ateither precision or recall here would be misleading, as ourevaluation metric is mainly F1 score. For example, it’s truethat the

EnsemblePROD method does not improve precision,but that’s because it trades (on the validation set) precisionfor a much higher recall to improve the F1 score.Given the diﬀerences inherent to each method, we furtherexplore if the two methods are complementary to one another,preferentially selecting (or missing) clusters with particularattributes. To further investigate, we looked at two sets ofclusters in the extremes of the CNN and MF performance:(i)

Clusters with high MF SNRs but low CNN scores:

Theleft panel of Fig. 9a shows R vir as a function of clusterredshift, color coded by t SZ . We show the clusters whoseCNN score is below 0 . .

4. The rightpanel shows the position of each cluster within the cutouts.These clusters are very close to the edge of the center of thecutout, which is a 6 × Clusters with high CNN scores but low MF SNRs:

Theleft panel of Fig. 9b shows R vir as a function of clusterredshift, color coded by t SZ . We show the 73 clusters whoseCNN score is above 0 . < . Given the slightly worse overall performance of the CNN, weexamine noise components as a potential cause to the lowerperformance. Fig. 8 shows the performance of the standaloneCNN with diﬀerent levels of noise (diﬀerent components),quantiﬁed with a ROC (receiver operating characteristic)curve. With each added noise component, the area under thecorresponding ROC curve (AUC) decreases, as one mightexpect. However, the diﬀerence is not drastic varying onlyfrom AUC=0.99 to 0.98. The resulting test

F1 we measureby using validation thresholds with each added level of noiseis 0.64 for cmb+tsz, 0.63 for +ksz, 0.61 for +IR galaxies,0.6 for +radio galaxies, and 0.61 for +galactic dust. In bothcases, infrared galaxies and radio galaxies seem to aﬀect theperformance of CNN the most. We conclude that the CNN isrelatively robust to the noise components in the microwavesky.

As described in Section 3.3, we found that many of thelow-CNN-probability and high-MF-SNR cutouts have theircluster very close to the edge. We re-generated cutouts withthe same cluster at varying distance from the center, andapply the same trained network on these shifted cutouts.Fig. 9c shows the positions of these clusters in the newcutouts. Interestingly, we did ﬁnd a very obvious negativecorrelation between the prediction score and the clusters’distance from the center of the cutouts, as shown in Fig. 9c.This suggests potential room of improvement with somediﬀerent ways to apply the network. A potential algorithmto maximize the performance of the CNN could be to simplyoverlap cutouts by 1/4 the width of a cutout. This way,each cluster is within the middle two quarters of at leastfour cutouts. While several cluster containing cutouts wouldsuﬀer from the edge eﬀect, each cluster would be positivelyidentiﬁed in some subset of the cutouts.

We now compare the cluster identiﬁcation methods showingthe completeness of each method, i.e. their performance as afunction of cluster properties.Figure 10 shows the completeness, also known as re-call, of each method as a function of cluster mass. This isthe fraction of true positives identiﬁed by a given methodout of the total number of cluster images evaluated. Eachline color corresponds to a diﬀerent identiﬁcation method:black, purple, and blue respectively correspond to CNN,MF, and EnsemblePROD. The solid lines show the com-pleteness for the entire sample, and the dashed lines forclusters above z > .

25. Recall, we impose both a mass cutof M halo (cid:62) × M (cid:12) and a redshift cut of z > .

25 forcutouts labeled as “cluster-containing” in our training setfor the CNN classiﬁer. We mark this threshold with a redvertical line. For comparison, the blue histogram illustratesthe underlying halo mass function of the simulated galaxycluster sample that we used to produce the cutouts, plottedas “Counts of Objects” (right y-axis labels) as a function of

MNRAS000

MNRAS000 , 000–000 (2021) Lin et al. F S c o r e (a) F1 score on the validation set if we form a single score bymultiplying the rank percentile of CNN score and MF S/Nratio. The validation set selects the rank product threshold tobe 93.8%. M F T h r e s ho l d F S c o r e (b) F1 score on the validation set if we require both MF andCNN’s score to pass a certain threshold. The optimal point iswhen CNN probability is greater than 0.26 and MF S/N ratiogreater than 7.9 (marked with a black cross). Figure 6.

Here we show the performance of 2 Ensemble methods: EnsemblePROD (6a) and EnsembleAND (6b). Their performancesare very similar: Both achieve an F1 score of 0.69 on the validation set, and on the test set. Both methods, however, noticeablyoutperform CNN or MF standalone.

S/N=97.34 CNN Prob=0.86 Mvir=7.24e+14 C NN T P MF TP

S/N=10.09 CNN Prob=0.73 Mvir=2.23e+14

MF FN

S/N=32.60 CNN Prob=0.06 Mvir=2.10e+14 C NN F N S/N=9.42 CNN Prob=0.26 Mvir=2.16e+14

Figure 7.

Examples of cluster-containing cutouts correctly clas-siﬁed by the combined method EnsemblePROD. The picturesare generated by treating the three channels (90/148/219 GHz)of input as RGB channels. For each channel, the pixel value isre-scaled to [0, 255], with the same channel (e.g. red channel) ofall cutouts sharing the same color scale. The top (bottom) row hashigh (low) probabilities assigned by the CNN classiﬁer, which weidentify as a CNN ”true positive”/TP (CNN ”false negative”/FN).The left (right) column has high (low) signal to noise as measuredby the MF algorithm, which we identify as a MF ”true positive”,MF TP (MF ”false negative”, MF FN). While the bottom rightpanel might be identiﬁed as both a CNN FN and a MF FN, theEnsemblePROD correctly identiﬁes this cutout because of its rela-tively high ranking in the CNN and MF scores. Note that althoughthe score given by CNN in the bottom left is only 0.06, we cansee that it is a high percentile already because CNN gives mostcutouts a score of essentially 0 (see the histogram in Fig. 5) the virial mass bin. To guide the eye for comparison, thebottom panel shows the ratio of the recall of CNN and MFwith the recall of EnsemblePROD with the same redshiftselection.In the range of 1 . (cid:46) M halo M (cid:12) (cid:46) . z > .

25, meaningthat many of the high mass objects were also not labeled as“cluster-containing”.Figure 11 shows the corresponding selection as a functionof redshift. Here, we only show the completeness curves ofeach method for clusters that are above our mass cut for“true”, M halo (cid:54) × M (cid:12) . The red vertical line marks theredshift threshold for cutouts labeled “cluster-containing”.For clusters below z (cid:46) .

9, the EnsemblePROD out-performs both the CNN and the MF in completeness. Athigher redshifts, the completeness of the sample selectedby the MF is comparable to that of the sample selected byEnsemblePROD, both exceeding the completeness of theCNN, which remains relatively constant until z ∼ .

5. TheEnsemblePROD has a larger F1 value with similar com-pleteness at high redshifts as the MF, thereby illustratingthat information from the CNN helps to maintain purity incluster identiﬁcation. In fact, for both lower mass and lowerredshift galaxy clusters, the completeness of the sample im-proved when we combine the two methods. This improvementsuggests that the MF and the CNN are picking up comple-mentary features in our galaxy cluster sample.To better assess the complementarity, we visualize theperformance of each method in Figure 12. Here, we show thedistribution of the virial mass as a function of three clusterparameters in our sample. From top to bottom, we show thevirial mass as a function of redshift, virial radius, and angular

MNRAS , 000–000 (2021) eepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clusters using Deep Learning F cmb+tsz peak=0.67+ksz peak=0.66+ir_pts peak=0.64+rad_pts peak=0.63+dust peak=0.62 (a) T r u e P o s iti v e R a t e ROC cmb+tsz AUC=0.99+ksz AUC=0.99+ir_pts AUC=0.99+rad_pts AUC=0.98+dust AUC=0.98 (b)

Figure 8.

Metrics: 8a shows the curves of F1 score as a function ofthe threshold on

CNN probability for diﬀerent levels of noises. 8bshows the ROCs of CNN’s predictions on diﬀerent levels of noises. size of the cluster. From left to right, we show the percent oftrue positives identiﬁed by the CNN, EnsemblePROD, andMF, where we color code the parameter space by the truepositive percentage. Darker shades of green correspond tohigher true positive rates in that region of parameter space.Each pixel in the color coded parameter space correspondsto at least 3 clusters from our sample. For reference, the hor-izontal dashed line indicates the mass threshold for clustersin cutouts that we labeled as “cluster-containing”.If we look at the performance of the methods in theVirial Mass vs. Redshift space, we can see that the CNNpositively identiﬁes more of the clusters at low redshifts whosemasses lie just above the mass cut than the MF identiﬁes(i.e. 2 × M (cid:12) (cid:46) M vir (cid:46) × M (cid:12) and z (cid:46) . × M (cid:12) (cid:46) M vir (cid:46) × M (cid:12) and z (cid:38) MNRAS000 , 000–000 (2021) Lin et al. V i r i a l R a d i u s ( M p c ) tSZ (arcmin^2) R e l a ti v e D ec (a) (Left) The virial radius as a function of redshift colorcoded by their tSZ of 88 clusters with a high MF score, but lowranked score from the CNN. (Right) Position of the clusters in the cutout shown on the right. These clusters span the massand redshift range, but tend to sit close to the edges of the cutouts that the CNN evaluated. V i r i a l R a d i u s ( M p c ) tSZ (arcmin^2) R e l a ti v e D ec (b) (Left) The virial radius as a function of redshift color-coded by their tSZ of 73 clusters with a high CNN score, but lowMF S/N ratio. (Right) Position of the clusters in the cutout shown on the right. V i r i a l R a d i u s ( M p c ) tSZ (arcmin^2) R e l a ti v e D ec A v e r a g e C NN S c o r e MeanQuantile 0.10Quantile 0.90 (c) To investigate whether an edge eﬀect is indeed the main driver of the low prediction probability on these clusters, we randomlyregenerated cutouts with these clusters located at varying distances from the center of the cutouts. On average, the prediction value givenby CNN is a decreasing function of the distance between the center of the cluster and the center of the cutout. We ﬁrst divide the 88clusters in 9a into 11 groups. For each group, we randomly regenerate cutouts for these clusters such that the cluster has diﬀerent distancefrom and angle relative to the cutout center (Middle), and then compute the curve of prediction score as a function of distance from centerfor these randomly generated cutouts. Finally, we plot the mean, 10th percentile and 90th percentile (over the 11 groups) of these curves .

Figure 9.

MNRAS , 000–000 (2021) eepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clusters using Deep Learning Mass0.00.20.40.60.81.0 C o m p l e t e n e ss ( R eca ll ) C oun t s o f O b j ec t s CNNCNN (redshift>0.25)MFMF (redshift>0.25)EnsemblePRODEnsemblePROD (redshift>0.25)10 Mass ( )0.60.81.0

CNN/EnsemblePRODCNN/EnsemblePROD (redshift>0.25)MF/EnsemblePRODMF/EnsemblePROD (redshift>0.25)

Figure 10.

Top panel: Blue histogram shows underlying mass function of the simulated galaxy cluster sample in the cutouts (Countsof Objects vs. Mass). Lines show the completeness curves for each method, with the dashed lines corresponding to the completenesscurve for objects that satisfy our redshift threshold of z > .

25 used for training. The purple dashed lines correspond to the MF, blackdashed the CNN, and blue dashed the EnsemblePROD. We can see that the EnsemblePROD completeness curve sits above both methodsuntil ∼ × M (cid:12) , where there are few objects. The decrease of the CNN curve at the highest masses could be due to the decreasedsample size for training in the corresponding mass range. Bottom panel: Ratio of CNN (or MF) completeness curve to the correspondingEnsemblePROD completeness curve with the same line style and color as the top panel. We also include solid lines for the full samplewithout the redshift threshold, but since the impact of the redshift threshold is small, they are very similar. C o m p l e t e n e ss ( R eca ll ) C oun t s o f O b j ec t s CNN (mass>2e14)MF (mass>2e14)EnsemblePROD (mass>2e14)

Figure 11.

Blue histogram shows the underlying number of galaxy clusters in each redshift bin that satisfy our mass threshold of M halo > × M (cid:12) used for training (Counts of Objects vs. Redshift). Lines show the completeness curves for each method on thesubset of objects over the same mass threshold. The EnsemblePROD completeness curve sits above the other methods until z ≈ .

9, atwhich point it is comparable to MF.MNRAS000

9, atwhich point it is comparable to MF.MNRAS000 , 000–000 (2021) Lin et al.

In this paper, we compare the performance of a matched ﬁlter(MF) method and a convolutional neural network (CNN) inthe task of identifying galaxy clusters in mock millimetermaps of the cosmic microwave background. For the neuralnetwork, we use a modiﬁed version of ResNet, a architectureused in popular image classiﬁcation. We use simulated mi-crowave maps at 90, 148, and 219 GHz channels with addedobservational components to train and test our CNN. Wealso use the F1 score (see section 3.2), a quantity that ac-counts for both precision (purity) and recall (completeness),to compare method performance and to deﬁne an identiﬁca- tion procedure that combines both the MF and CNN. Weﬁnd the following: • At the selected redshift and mass thresholds, the CNNdoes comparably to the MF (see Table 1). The precision(purity) of the CNN is slightly lower, but the the recall(completeness) slightly higher. • The CNN achieved comparable performance in the ab-sence of standard image pre-processing, e.g. normalization,point source subtraction, etc. • A cluster identiﬁcation procedure that combines boththe MF and CNN scores signiﬁcantly improves performance(e.g. see Figures 10, 11 and 12), indicating complementaritybetween the methods. • We note that the cutout nature of thetrain/test/validation dataset for the CNN impactsthe model performance; clusters further from the cutoutcenter are more diﬃcult for the CNN to identify (seesection 3.4.2). An algorithm that minimizes the distancebetween the cluster center and the cutout center wouldfurther improve the CNN performance.We note that some cutouts contributing to false positiveidentiﬁcation are cutouts that contain clusters just below themass threshold of M vir = 2 × that we label as “cluster-containing”. The cluster-ﬁnding task diﬀers from most otherstandard classiﬁcation applications in that galaxy clustersmay be deﬁned on a continuum; the separation betweengalaxy clusters and galaxy groups is largely deﬁnition-based.Admittedly, galaxy clusters would be ideal objects to applyregression methods that predict continuous values of clusterparameters. We leave this to a follow-up paper.We also emphasize the use of the F1 score as a mecha-nism for apples-to-apples cluster-ﬁnding method comparisonsand method combinations. The F1 score plays a distinct roleof enabling a performance comparison between the two meth-ods considered. The purity and completeness of a methoddepends on a threshold for positive identiﬁcation. Shifts inthat threshold for a given method will change the quotedpurity and completeness. We therefore choose a threshold foreach method that maximizes the F1 score in that method.In other words, our cluster-ﬁnding comparison compares thebest version of the CNN and MF to one another, using the F1score as a metric for “best”. We additionally use the F1 scoreas a metric that allows us to combine the two cluster-ﬁndingmethods to further improve performance. We present a usecase of the F1 score as a comparison metric and combina-tion mechanism that can be used as a template for othercluster-ﬁnding comparisons and combinations.Finally, we comment on the complementarity of MFand CNN. MF inherently relies on an understanding of theexpected signal and noise in the ﬁlter deﬁnition. The CNNrelies on an understanding of the data when generating thetraining data set, but does not rely on any assumptions inthe network architecture nor did it require that the dataundergo preprocessing steps such as point source removal.Another distinguishing aspect between the two is that the MFhas an explicit fully analytic formulation as a discriminativemodel, while the CNN has a quasi-analytic (semi-parametric)formulation via the simulation data used to train the model. Acommon criticism of machine learning methods, particularlyneural networks, is in the lack of interpretability. While theMF method has physically motivated, or a priori physically MNRAS , 000–000 (2021) eepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clusters using Deep Learning denoted, features the method is designed to detect, the CNNpicks up on a non-linear combination of features that do notnecessarily have obvious physical motivation. Last, we notethat the MF method had taken human time to calibrate.While use of the CNN did not require signiﬁcant calibration,the human time went into developing the CNN architectureand training the model. MNRAS000

We plot the distribution of cluster properties and color code by true positive identiﬁcation of each method. The left columncorresponds to true positive identiﬁcation by our CNN, center by the rank product, and right by MF. The black dashed line illustratesour chosen mass cut for what we labeled as true positives for the training set in our CNN. The color coding of % Positive colors thepercent of clusters (cutout containing clusters) in that cluster property bin that was positively identiﬁed by each method. The propertybins have at least 3 clusters from which we calculate a percentage. One key feature to note is that the CNN method has a relativelystronger mass-limited selection function that is driven by our mass cut in labeling true positives. This selection does not strongly dependon redshift, virial radius, or angular size. On the other hand, the matched ﬁlter preferentially picks out lower mass clusters that are morecompact, which are those at higher redshifts. The combined method, using the rank product, also has a selection function more alignedwith our chosen mass cut.

Author Contributions • Lin : led the design and experiments of the standaloneNN and ensemble methods, prepared the cutouts from rawdata after appropriate checks, performed the evaluation anddiagnostic analysis for MF, NN and ensemble models, con-tributed to manuscript writing, and kept the work goingthrough challenging phases. • Huang : optimized and ran the matched ﬁlter, producinga catalog for comparison, and contributed to manuscriptwriting. • Avestruz : contributed to the initial project conception,project management, manuscript writing, and student men-torship, including guidance on analysis direction. • Wu : generated simulations for testing and validation;did initial classiﬁcation; contributed to manuscript writing. • Trivedi : prototyped initial architecture and experiments, provided guidance on subsequent deep learning componentsof the project. • Caldeira : contributed to comparisons between results ofthe two methods and manuscript writing. • Nord : developed the initial project conception, helpedassemble the research team, contributed to technical develop-ment, project management, manuscript writing, and studentmentorship, including guidance on analysis direction.We would like to thank Arya Farahi for discussionsthat helped improve this manuscript. We also thank PhilMansﬁeld for help on manuscript presentation at the endof the project. CA would like to thank support from theLSA Collegiate Fellows program and the Leinweber Centerfor Theoretical Physics at the University of Michigan. Thisproject was supported in part by NSF-AAG awards AST-200994 and AST-2009121. During part of the project, STwas supported by by the National Science Foundation under

MNRAS , 000–000 (2021) eepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clusters using Deep Learning Grant No. DMS-1439786 while he was in residence at theInstitute for Computational and Experimental Research inMathematics in Providence, RI, during the non-linear algebraprogram.We acknowledge the https://deepskieslab.com as a com-munity of multi-domain experts and collaborators whohave facilitated an environment of open discussion, idea-generation, and collaboration. This community was impor-tant for the development of this project. We would also liketo extend our thanks to the SPT collaboration for providingthe matched ﬁlter implementation.This manuscript has been authored by Fermi ResearchAlliance, LLC under Contract No. DE-AC02-07CH11359with the U.S. Department of Energy, Oﬃce of Science, Oﬃceof High Energy Physics. This material is based upon worksupported by the National Science Foundation GraduateResearch Fellowship under Grant No. DGE 1752814.

MNRAS000

MNRAS000 , 000–000 (2021) Lin et al.

REFERENCES

Abazajian, K.N., Adshead, P., Ahmed, Z., et al., 2016.CMB-S4 Science Book, First Edition. arXiv e-prints ,arXiv:1610.02743 arXiv:1610.02743 .Allen, S.W., Evrard, A.E., Mantz, A.B., 2011. CosmologicalParameters from Observations of Galaxy Clusters. ARA&A49, 409–470. doi:doi:10.1146/annurev-astro-081710-102514, arXiv:1103.4829 .Benson, B.A., Ade, P.A.R., Ahmed, Z., et al., 2014. SPT-3G: anext-generation cosmic microwave background polarizationexperiment on the South Pole telescope, in: Millimeter, Sub-millimeter, and Far-Infrared Detectors and Instrumentationfor Astronomy VII, p. 91531P. doi:doi:10.1117/12.2057305, arXiv:1407.2973 .Bleem, L.E., Stalder, B., de Haan, T., et al., 2015. GalaxyClusters Discovered via the Sunyaev-Zel’dovich Eﬀect inthe 2500-Square-Degree SPT-SZ Survey. ApJS 216, 27.doi:doi:10.1088/0067-0049/216/2/27, arXiv:1409.0850 .Bocquet, S., Dietrich, J.P., Schrabback, T., et al., 2019. ClusterCosmology Constraints from the 2500 deg SPT-SZ Survey:Inclusion of Weak Gravitational Lensing Data from Magellanand the Hubble Space Telescope. Astrophysical Journal 878,55. doi:doi:10.3847/1538-4357/ab1f10, arXiv:1812.01679 .Bonjean, V., 2020. Deep learning for Sunyaev-Zel’dovich de-tection in Planck. A&A 634, A81. doi:doi:10.1051/0004-6361/201936919, arXiv:1911.10778 .Bulbul, E., Chiu, I.N., Mohr, J.J., et al., 2019. X-Ray Proper-ties of SPT-selected Galaxy Clusters at 0.2 < z < 1.5Observed with XMM-Newton. Astrophysical Journal 871, 50.doi:doi:10.3847/1538-4357/aaf230, arXiv:1807.02556 .Caldeira, J., Wu, W.L.K., Nord, B., et al., 2019. DeepCMB: Lens-ing reconstruction of the cosmic microwave background withdeep neural networks. Astronomy and Computing 28, 100307.doi:doi:10.1016/j.ascom.2019.100307, arXiv:1810.01483 .Carlstrom, J.E., Holder, G.P., Reese, E.D., 2002. Cosmologywith the Sunyaev-Zel’dovich Eﬀect. ARA&A 40, 643–680.doi:doi:10.1146/annurev.astro.40.060401.093803, arXiv:astro-ph/0208192 .Cavaliere, A., Fusco-Femiano, R., 1976. X-rays from hot plasmain clusters of galaxies. A&A 49, 137.Costanzi, M., Rozo, E., Simet, M., et al., 2019. Methods forcluster cosmology and application to the SDSS in prepa-ration for DES Year 1 release. MNRAS 488, 4779–4800.doi:doi:10.1093/mnras/stz1949.Green, S.B., Ntampaka, M., Nagai, D., et al., 2019. UsingX-ray morphological parameters to strengthen galaxy clus-ter mass estimates via machine learning. arXiv e-prints ,arXiv:1908.02765 arXiv:1908.02765 .Gupta, N., Reichardt, C.L., 2020. Mass Estimation of GalaxyClusters with Deep Learning II: CMB Cluster Lensing. arXive-prints , arXiv:2005.13985 arXiv:2005.13985 .He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learningfor image recognition, in: 2016 IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2016, Las Vegas, NV,USA, June 27-30, 2016, pp. 770–778. URL: https://doi.org/10.1109/CVPR.2016.90 , doi:doi:10.1109/CVPR.2016.90.Hilton, M., Hasselﬁeld, M., Sif´on, C., et al., 2018. The Ata-cama Cosmology Telescope: The Two-season ACTPol Sunyaev-Zel’dovich Eﬀect Selected Cluster Catalog. ApJS 235, 20.doi:doi:10.3847/1538-4365/aaa6cb, arXiv:1709.05600 .Ho, M., Rau, M.M., Ntampaka, M., et al., 2019. A Ro-bust and Eﬃcient Deep Learning Method for DynamicalMass Measurements of Galaxy Clusters. arXiv e-prints ,arXiv:1902.05950 arXiv:1902.05950 .Hortua, H.J., Volpi, R., Marinelli, D., Malag`o, L., 2019.Parameters Estimation for the Cosmic Microwave Back-ground with Bayesian Neural Networks. arXiv e-prints , arXiv:1911.08508 arXiv:1911.08508 .Huang, N., Bleem, L.E., Stalder, B., et al., 2019. GalaxyClusters Selected via the Sunyaev-Zel’dovich Eﬀect inthe SPTpol 100-Square-Degree Survey. arXiv e-prints ,arXiv:1907.09621 arXiv:1907.09621 .Hurier, G., Aghanim, N., Douspis, M., 2017. MILCANN : A neuralnetwork assessed tSZ map for galaxy cluster detection. arXive-prints , arXiv:1702.00075 arXiv:1702.00075 .Keisler, R., Reichardt, C.L., Aird, K.A., et al., 2011. A Mea-surement of the Damping Tail of the Cosmic Microwave Back-ground Power Spectrum with t he South Pole Telescope. Astro-physical Journal 743, 28. doi:doi:10.1088/0004-637X/743/1/28, arXiv:1105.3182 .Komatsu, E., Smith, K.M., Dunkley, J., et al., 2011. Seven-year Wilkinson Microwave Anisotropy Probe (WMAP) Ob-servations: Cosmological Interpre tation. ApJS 192, 18–+.doi:doi:10.1088/0067-0049/192/2/18, arXiv:1001.4538 .Krachmalnicoﬀ, N., Tomasi, M., 2019. Convolutional neuralnetworks on the HEALPix sphere: a pixel-based algorithmand its application to CMB data analysis. A&A 628, A129.doi:doi:10.1051/0004-6361/201935211, arXiv:1902.04083 .LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P., 1998.Grandient-based learning applied to document recogni-tion, in: Proceedings of IEEE 86 (11) (1998), pp. 2278–2324. URL: http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf .Lewis, A., Challinor, A., Lasenby, A., 2000. Eﬃcient Computa-tion of Cosmic Microwave Background Anisotropies in ClosedFriedmann-Robe rtson-Walker Models. Astrophysical Journal538, 473–476. doi:doi:10.1086/309179.McDonald, M., Allen, S.W., Bayliss, M., et al., 2017. The Re-markable Similarity of Massive Galaxy Clusters from z 0 to1.9. Astrophysical Journal 843, 28. doi:doi:10.3847/1538-4357/aa7740, arXiv:1702.05094 .Melin, J.B., Aghanim, N., Bartelmann, M., et al., 2012. A com-parison of algorithms for the construction of SZ cluster cata-logues. A&A 548, A51. doi:doi:10.1051/0004-6361/201015689, arXiv:1210.1416 .Melin, J.B., Bartlett, J.G., Delabrouille, J., 2006. Catalog ex-traction in SZ cluster surveys: a matched ﬁlter approach.A&A 459, 341–352. doi:doi:10.1051/0004-6361:20065034, arXiv:arXiv:astro-ph/0602424 .Ntampaka, M., Avestruz, C., Boada, S., et al., 2019a. The Role ofMachine Learning in the Next Decade of Cosmology. BAAS51, 14. arXiv:1902.10159 .Ntampaka, M., ZuHone, J., Eisenstein, D., et al., 2019b. A DeepLearning Approach to Galaxy Cluster X-Ray Masses. Astro-physical Journal 876, 82. doi:doi:10.3847/1538-4357/ab14eb, arXiv:1810.07703 .Planck Collaboration, Ade, P.A.R., Aghanim, N., et al., 2016a.Planck 2015 results. XXVII. The second Planck cata-logue of Sunyaev-Zeldovich sources. A&A 594, A27.doi:doi:10.1051/0004-6361/201525823, arXiv:1502.01598 .Planck Collaboration, Ade, P.A.R., Aghanim, N., et al., 2016b.Planck 2015 results. XXIV. Cosmology from Sunyaev-Zeldovich cluster counts. A&A 594, A24. doi:doi:10.1051/0004-6361/201525833, arXiv:1502.01597 .Platt, J.C., 1999. Probabilistic outputs for support vector ma-chines and comparisons to regularized likelihood methods, in:ADVANCES IN LARGE MARGIN CLASSIFIERS, MIT Press.pp. 61–74.Sehgal, N., Bode, P., Das, S., et al., 2010. Simulations ofthe Microwave Sky. Astrophysical Journal 709, 920–936.doi:doi:10.1088/0004-637X/709/2/920, arXiv:0908.0540 .Shirokoﬀ, E., Reichardt, C.L., Shaw, L., et al., 2011. ImprovedConstraints on Cosmic Microwave Background SecondaryAnisotropies from the Comple te 2008 South Pole TelescopeData. Astrophysical Journal 736, 61–+. doi:doi:10.1088/0004-MNRAS , 000–000 (2021) eepSZ: Identiﬁcation of Sunyaev-Zel’dovich Galaxy Clusters using Deep Learning arXiv:1012.4788 .Strazzullo, V., Pannella, M., Mohr, J.J., et al., 2019. Galaxy popu-lations in the most distant SPT-SZ clusters. I. Environmentalquenching in massive clusters at 1.4 (cid:46) z (cid:46) arXiv:1807.09768 .Vanderlinde, K., Crawford, T.M., de Haan, T., et al., 2010. GalaxyClusters Selected with the Sunyaev-Zel’dovich Eﬀect from2008 South Pole Telescope Observations. Astrophysical Jour-nal 722, 1180–1196. doi:doi:10.1088/0004-637X/722/2/1180, arXiv:1003.0003 .Vikhlinin, A., Kravtsov, A.V., Burenin, R.A., et al., 2009. Chan-dra Cluster Cosmology Project III: Cosmological Param-eter Constraints. Astrophysical Journal 692, 1060–1074.doi:doi:10.1088/0004-637X/692/2/1060, arXiv:0812.2720 .Zohren, H., Schrabback, T., van der Burg, R.F.J., et al., 2019.Optical follow-up study of 32 high-redshift galaxy cluster can-didates from Planck with the William Herschel Telescope.MNRAS 488, 2523–2542. doi:doi:10.1093/mnras/stz1838, arXiv:1906.08174 .MNRAS000