[PDF] Catalog-free modeling of galaxy types in deep images: Massive dimensional reduction with neural networks

Abstract

Current models of galaxy evolution are constrained by the analysis of catalogs containing the flux and size of galaxies extracted from multiband deep fields carrying inevitable observational and extraction-related biases which can be highly correlated. In practice, taking all of these effects simultaneously into account is difficult, and derived models are inevitably biased. To address this issue, we use robust likelihood-free methods for the inference of luminosity function parameters, made possible via massive compression of multiband images using artificial neural networks. This technique makes the use of catalogs unnecessary when comparing observed and simulated multiband deep fields and constraining model parameters. A forward modeling approach generates galaxies of multiple types depending on luminosity function parameters and paints them on photometric multiband deep fields including both the instrumental and observational characteristics. The simulated and the observed images present the same selection effects and can therefore be properly compared. We train a fully-convolutional neural network to extract the most model-parameter-sensitive summary statistics out of these realistic simulations, shrinking down the dimensionality of the summary space. Finally, using the trained network to compress both observed and simulated deep fields, the model parameter values are constrained through Population Monte Carlo likelihood-free inference. Using synthetic photometric multiband deep fields similar to the CFHTLS and D1/D2 deep fields and massively compressing them through the convolutional neural network, we demonstrate the robustness, accuracy and consistency of this new catalog-free inference method. We are able to constrain the parameters of luminosity functions of different types of galaxies and our results are fully compatible with the classic catalog extraction approaches.

Full PDF

AAstronomy & Astrophysics manuscript no. main ©ESO 2021February 3, 2021

Catalog-free modeling of galaxy types in deep images

Massive dimensional reduction with neural networks

F. Livet ⋆ , T. Charnock ⋆⋆ , D. Le Borgne ⋆⋆⋆ , V. de Lapparent ⋆⋆⋆⋆ Institut d’Astrophysique de Paris, Sorbonne Université, CNRS, UMR 7095, 98 bis boulevard Arago, 75014 Paris, France

ABSTRACT

Context.

Current models of galaxy evolution are constrained by the analysis of catalogs containing the ﬂux and size of galaxiesextracted from multiband deep ﬁelds. However, these catalogs contain inevitable observational and extraction-related biases whichcan be highly correlated. In practice, taking all of these eﬀects simultaneously into account is diﬃcult, and therefore derived modelsare inevitably biased too.

Aims.

To address this issue, we use robust likelihood-free methods for the inference of luminosity function parameters, made possiblevia massive compression of multiband images using artiﬁcial neural networks. This technique makes the use of catalogs unnecessarywhen comparing observed and simulated multiband deep ﬁelds and constraining model parameters. Because of the eﬃcient datacompression, the method does not suﬀer from the required binning of the observables inherent to the use of catalogs.

Methods.

A forward modeling approach generates galaxies of multiple types depending on luminosity function parameters and paintsthem on photometric multiband deep ﬁelds including both the instrumental and observational characteristics. The simulated and theobserved images present the same selection eﬀects and can therefore be properly compared. We train a fully-convolutional neural net-work to extract the most model-parameter-sensitive summary statistics out of these realistic simulations, shrinking the dimensionalityof the summary space down to the number of parameters in the model. Finally, using the trained network to compress both observedand simulated deep ﬁelds, the model parameter values are constrained through Population Monte Carlo likelihood-free inference.

Results.

Using synthetic photometric multiband deep ﬁelds similar to the CFHTLS and WIRDS D1/D2 deep ﬁelds and massivelycompressing them through the convolutional neural network, we demonstrate the robustness, accuracy and consistency of this newcatalog-free inference method. We are able to constrain the parameters of luminosity functions of diﬀerent types of galaxies and ourresults are fully compatible with the classic catalog extraction approaches.

Key words.

Methods: numerical / Methods: statistical / Galaxies: elliptical and lenticular, cD / Galaxies: spiral / Galaxies: evolution/ Galaxies: luminosity function, mass function

1. Introduction

Traditionally, the study of galaxy evolution is based on theanalysis of large sets of photometric surveys with long exposuretimes and a wide range of bands. This data is used with thegoal to answer speciﬁc questions about galaxy evolution suchas what are the star-formation histories, the accretion histories,the morphological transformations, and the merging rates forgalaxies of various types and masses, and what is the role ofenvironment in these various characteristics.In a classic approach, one extracts photometric catalogs ofgalaxies from the images and uses the catalogs to model howtheir luminosities, their colors and their sizes jointly evolve withredshift. However, catalogs suﬀer from a variety of incomplete-ness eﬀects. Rigorous studies based on ﬂux-limited samplescan be performed, but they are not optimal: by construction, inﬂux-limited catalogs the very faint galaxies near the backgroundnoise are ignored, which induces a Malmquist bias, if this eﬀectis not dealt with properly (Malmquist 1922, 1925). Moreover,some galaxies can overlap and confusion is often a limiting eﬀectat low-luminosities (Condon 1974), mainly at large wavelengths.Poor segmentation, morphological biases, or even the lack of ⋆ email: [email protected] ⋆⋆ email: [email protected] ⋆⋆⋆ email: [email protected] ⋆⋆⋆⋆ email: [email protected] detection of galaxies can be a consequence of cosmologicaldimming which decreases the surface brightness of an object bya factor (1 + 𝑧 ) −4 , where 𝑧 is the redshift (Tolman 1934; Calviet al. 2014). Another statistical selection eﬀect is the Eddingtonbias (Eddington 1913): at a given apparent ﬂux, the numberof fainter galaxies with a ﬂux overestimation is larger than thenumber of brighter galaxies with a ﬂux underestimation. Thestellar contamination of deep surveys (Pearson et al. 2014) isanother limitation of faint galaxy catalogs because some starscan be misidentiﬁed as galaxies and contaminate the results.Finally, redshift aﬀects the luminosities and colors of objects,and K-correction must be applied (Hogg et al. 2002, 2003) tocorrect for such eﬀects.Moreover, these various selection eﬀects are correlated insubtle ways, which are very diﬃcult to express analytically, andsome may be spatially variable within a survey (Bernardi et al.2017, Appendix B). As a result, the classic approach of extract-ing catalogs is a complex inverse problem of trying to correctthe eﬀects of correlated biases of these catalogs (Marzke 1998;Taghizadeh-Popp et al. 2015). If not successful, the modelsderived from these catalogs tend to be biased in a non-trivial way.In order to avoid solving the hard inverse problem of modelinference from catalogs, one may use forward modeling coupledwith Approximate Bayesian Computation (ABC). This tech- Article number, page 1 of 23 a r X i v : . [ a s t r o - ph . GA ] F e b &A proofs: manuscript no. main nique has been used in many ﬁelds and for various applications,for example in cosmology (Akeret et al. 2015; Kacprzak et al.2018), in Initial Mass Function modeling (Cisewski-Kehe et al.2019), to study type Ia supernovae (Weyant et al. 2013), andin galaxy evolution modeling (Carassou et al. 2017; Tortorelliet al. 2020). In the speciﬁc case of galaxy evolution, the forwardmodeling approach simulates realistic deep multi-bands imagesand compares them to observed data in order to constrain theparameters of the model.Placing constraints on the parameters of galaxy populationmodels is diﬃcult when extracting catalogs from the deepsurveys which contains a huge number of galaxies with variousﬂuxes, shapes and sizes. There are a number of source extractionpackages in the literature, for example Source Extractor (Bertin & Arnouts 1996, 2010). In a recent paper, Carassouet al. (2017) have developed a method for binning extractedcatalogs in ﬂuxes and sizes in order to infer the parametersof the luminosity functions used in their model. There, theysuccessfully measured the evolution parameters of 1 and 2galaxy-type luminosity functions. However, the binning of theapparent magnitude distributions had to be limited to 10 intervalsper band (among 8 bands), which nevertheless meant that thestudy took place in a very sparsely populated -dimensionspace. Moreover, catalogs are aﬀected by a number of extractionbiases: the model PSF, the type of magnitude used to perform theextraction (isophotal, aperture, or model), the determination ofthe sky background, the segmentation process and the confusionof sources. Consequently, the analytical form for the likelihoodis unknown for the parameters of the luminosity functions ofthe various populations of galaxies and the information fromsuch deep surveys could be lost by compressing down to catalogs.To circumvent all of these problems simultaneously, we describehere a novel method for massive data compression using neuralnetworks in order to enable the observation/model comparisondirectly on the multi-wavelength images, therefore completelybypassing the biases induced by source extraction. We use theInformation Maximising Neural Network (IMNN) algorithm(Charnock et al. 2018) which ﬁts a neural network sensitive onlyto the eﬀects of model parameters on the simulations obtainedfrom our forward model. We implement this methodology forthe ﬁrst time on deep and large multi-wavelength images ofgalaxies: the 1 deg Canada France Hawaii Telescope LegacySurvey (CFHTLS) D1 deep ﬁeld observed both in the optical,using the MegaPrime instrument in the 𝑢 ′ , 𝑔 ′ , 𝑟 ′ , 𝑖 ′ , 𝑧 ′ ﬁlters, andin the near infrared using the WIRCam instrument in the 𝐽 , 𝐻 , 𝐾𝑠 ﬁlters. We used the ﬁnal releases of the TERAPIX processeddata for each survey, T0007 for CFHTLS (Hudelot et al. 2012)and T0002 for WIRDS (Bielby et al. 2012); the WIRDS imagesare re-binned to match the 0.186 arcsec/pixel of the opticalimages. An RGB image of part of the D1 deep ﬁeld in the opticalis shown in Fig. 1 which highlights the complexity of the dataavailable for studying the various populations of galaxies. Thereis a huge wealth of information that is distributed such that it isimpossible to construct a likelihood describing the probabilityof obtaining such a ﬁeld.The paper is divided as follows: Sect. 2 introduces the for-ward model used to create simulated multi-band deep ﬁeldsand discusses its features: luminosity functions parameteri-zation, Spectral Energy Distributions (SED) of a bulge+diskdecomposition, internal dust extinction, stars, reddening, etc. InSect. 3 and Sect. 4 we present the compression of deep ﬁelds using IMNN and discuss its characteristics: Fisher information,summary statistics, Gaussian and non-Gaussian likelihoods,quasi-maximum likelihood estimators. This includes the de-scription of the neural network architecture and the loss functionwe used to ﬁt the network. In Sect. 5, we describe ApproximateBayesian Computation (ABC) and the Population Monte Carlo(PMC) algorithm in the context of compression via neuralnetworks. Sect. 6 presents a toy application with only two pa-rameters ( 𝜙 ∗ and 𝑀 ∗ ) of the luminosity functions of one spiralpopulation. Sect. 7 demonstrates a more realistic application toconstrain the density parameter 𝜙 ∗ of both elliptical and spiralgalaxies on parts of the observed CFHTLS and WIRDS D1 deepﬁeld in 8 photometric bands, as well as validating the methodson simulated data. In this paper, a Λ CDM cosmology is adoptedwith Ω Λ = 0 . , Ω 𝑀 = 0 . and 𝐻 = 70 km s −1 Mpc −1 .

2. The forward model

In this section, we describe, in detail, the forward model thatwe use to produce mock images. These images are then used(Sect. 4.6) to train the IMNN and also to perform the approxi-mate Bayesian inference (Sect. 5).

We create mock photometric catalogs of galaxies with a codebased on the

Stuff program (Bertin 2010) rewritten from C toPython by F. Livet and now accounting for: (i) distinct extinctionin the bulge and disk components; (ii) suppression of sample vari-ance allowing one to generate images with variations in the lumi-nosity function parameters by only adding or suppressing galax-ies depending on the sign of the change in the luminosity functionat each absolute magnitude. This therefore results in commongalaxies at the same coordinates and with identical bulge anddisk parameters (see Sect. 6.2). The aim of this code is to producerealistic photometric multi-wavelength catalogs of properties ofgalaxies from various populations. Each galaxy of these mockcatalogs is decomposed as a sum of a bulge and a disk with thefollowing properties: its apparent magnitudes in each observa-tional band (the SDSS 𝑢 ′ , 𝑔 ′ , 𝑟 ′ , 𝑖 ′ , 𝑧 ′ , and the near infrared 𝐽 , 𝐻 , 𝐾𝑠 used for the CFHTLS and WIRDS surveys respectively), itsapparent bulge-to-total light ratio in the 𝑔 ′ band used as reference(deﬁning its Hubble type), the characteristic sizes of its bulge anddisk, its disk inclination which is used to determine its internalextinction, its redshift. The code uses a set of galaxy types deﬁnedby a speciﬁc luminosity function, an intrinsic bulge-to-total lumi-nosity ratio 𝐵 ∕ 𝑇 , the spectral energy distributions (SED) of thebulge/disk, the bulge and the disk scale lengths satisfying (withsome dispersion) the observed scaling laws, and extinction laws.The galaxies of each type are then generated with Poisson drawsin small redshift bins of ℎ −1 Mpc (where ℎ = 𝐻 ∕100 is thereduced Hubble constant) from 𝑧 = 0 . to 𝑧 = 20 . Each of our modelled galaxy populations is described by its lumi-nosity function, that is the number of galaxies per unit comovingvolume of the Universe and per magnitude interval, as a functionof absolute magnitude, 𝑀 . This probability distribution functionis commonly ﬁtted by a Schechter proﬁle (Schechter 1976): Φ( 𝑀 ) = 0 . 𝜙 ∗ [ . 𝑀 ∗ − 𝑀 ) ] 𝛼 exp [ −10 . 𝑀 ∗ − 𝑀 ) ] (1) Article number, page 2 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images

Fig. 1. arcmin of the 1 deg CFHTLS D1 deep ﬁeld in RGB (using the 𝑖 ′ , 𝑟 ′ , 𝑔 ′ ﬁlters). This shows the diversity of sizes/shapes/colors whichwe can use to perform a deep statistical study of the various populations of galaxies. This image reveals the complexity of the distribution of objectsand the huge wealth of information that is available in such a ﬁeld. Note that this complexity presents a barrier in the methods which we use toconstrain the parameters of the luminosity functions for each population of galaxies. with three parameters: 𝜙 ∗ the normalization density (inMpc −3 mag −1 ), 𝛼 the faint end slope (e.g. for 𝛼 < −1 , the densityof faint galaxies increases with magnitude) and 𝑀 ∗ a charac-teristic absolute magnitude. The luminosity function is redshift-dependent (Lilly et al. 1995), thus two other parameters are in-troduced, 𝜙 ∗ evol and 𝑀 ∗ evol , deﬁned as 𝑀 ∗ = 𝑀 ∗0 + 𝑀 ∗ evol log(1 + 𝑧 ) (2) 𝜙 ∗ = 𝜙 ∗0 log(1 + 𝑧 ) 𝜙 ∗ evol (3)where 𝑧 is the redshift at which the luminosity function is mea-sured.It has been shown that the luminosity function may evolvediﬀerently for diﬀerent populations of galaxies (Zucca et al.2006), so these ﬁve parameters may be diﬀerent for each galaxypopulation. In Sects. 6 and 7 we will focus on constraining 𝑀 ∗ and 𝜙 ∗ for spirals and 𝜙 ∗0 for spirals and ellipticals. The SEDs used to model the diversity of galaxies are sampledfrom the template library (Coleman et al. 1980): "E" and "Irr".These two SEDs are assigned to the bulge and the disk of a galaxyrespectively for a given morphological type in variable propor-tions (using 𝐵 ∕ 𝑇 ), which leads to a large range of colours for theindividual galaxies: an example is shown in Fig. 2. In the cur-rent implementation we keep the SEDs and the 𝐵 ∕ 𝑇 ratios ﬁxedwith redshift for all galaxy types. However, the evolution of theSEDs of individual galaxies is allowed through the evolution ofthe type-dependent luminosity functions and the diﬀerent extinc-tion eﬀects (see Sect. 2.6). The de Vaucouleurs proﬁle (de Vaucouleurs 1953) describes howthe surface brightness 𝐼 𝐵 (in cd.m −2 ) of a bulge/elliptical variesas a function of the apparent distance 𝑅 (in kpc) from the center 𝐼 𝐵 ( 𝑅 ) = 𝐼 𝑒 exp [ −7 . (( 𝑅𝑅 𝑒 ) − 1 )] (4)where 𝑅 𝑒 (in kpc) is the half-light radius or eﬀective radius (i.e.the isophote containing half of the total luminosity of the bulge),and 𝐼 𝑒 (in cd.m −2 ) is the surface brightness at 𝑅 𝑒 .The same proﬁle can be written in terms of magnitude as 𝜇 𝐵 ( 𝑅 ) = 𝑀 𝐵 + 8 . ( 𝑅𝑅 𝑒 ) + 5 log( 𝑅 𝑒 ) + 16 . (5)where 𝜇 𝐵 ( 𝑅 ) is the bulge surface brightness (in mag.kpc −2 ) atradius 𝑅 and 𝑀 𝐵 is the bulge absolute magnitude.It has been shown that the average eﬀective radius followsan empirical relation with the absolute magnitude of the bulge(Binggeli et al. 1984): ⟨ 𝑅 𝑒 ⟩ = 𝑅 − 𝑝 ( 𝑀 𝐵 − 𝑀 ) (6)where 𝑅 = 1 . ℎ −1 kpc, 𝑀 = −20 . and 𝑝 = { . , if 𝑀 𝐵 < 𝑀 . , otherwise (7) Article number, page 3 of 23 &A proofs: manuscript no. main

We allow the eﬀective bulge radius to evolve with redshift bya (1 + 𝑧 ) 𝛾 𝐵 factor where 𝛾 𝐵 is a constant (Trujillo et al. 2006;Williams et al. 2010).Furthermore, it has been shown that the intrinsic ﬂatten-ing, 𝑞 , of the bulge follows a normal distribution with mean 𝜇 = 0 . and standard deviation 𝜎 = 0 . (Sandage et al. 1970).Finally, the absolute magnitude of the bulge 𝑀 𝐵 is relatedto the total absolute magnitude of the galaxy 𝑀 through therelation: 𝑀 𝐵 = 𝑀 − 2 . 𝐵 ∕ 𝑇 ) (8)where 𝐵 ∕ 𝑇 is the bulge-to-total light ratio which is assumed notto evolve for galaxies of a given type. The exponential proﬁle describes how the surface brightness 𝐼 𝐷 (in cd.m −2 ) of a disk varies as a function of the apparent distance 𝑅 (in kpc) from the center 𝐼 𝐷 ( 𝑅 ) = 𝐼 exp ( − 𝑅ℎ 𝐷 ) (9)where ℎ 𝐷 (in kpc) is the disk scale-length and 𝐼 (in cd.m −2 ) isthe surface brightness in the center.As for the bulge, the same proﬁle can be written in terms ofmagnitude as 𝜇 𝐷 ( 𝑅 ) = 𝑀 𝐷 + 1 . 𝑅ℎ 𝐷 + 5 log( ℎ 𝐷 ) + 0 . (10)where 𝜇 𝐷 ( 𝑅 ) is the disk surface brightness (in mag.kpc −2 ) at ra-dius 𝑅 and 𝑀 𝐷 = 𝑀 − 2 . 𝐵 ∕ 𝑇 ) is the disk absolutemagnitude. A log-normal relation linking the disk scale length toits absolute magnitude can be ﬁtted (de Jong & Lacey 2000): ℎ 𝐷 = ℎ exp [ . 𝛽𝑀 𝐷 +  (0 , 𝜎 ) ] (11)where ℎ = 3 . ℎ −1 kpc, 𝛽 = −0 . and  (0 , 𝜎 ) is a randomnumber following a normal distribution with zero-centred meanand standard deviation 𝜎 = 0 . .We allow the disk scale-length to evolve with redshift by a (1 + 𝑧 ) 𝛾 𝐷 factor where 𝛾 𝐷 is a constant (Trujillo et al. 2006;Williams et al. 2010). The internal extinction by dust is applied separately for the bulgeand the disk using the interstellar extinction curve (Fitzpatrick &Massa 2007) of the Milky Way in the wavelength range from theUV to the near infrared:

SED( 𝜆 ) = SED ( 𝜆 ) exp [− 𝜅 ( 𝑖, 𝜔, 𝜆 ) 𝜏 ( 𝜆 )] (12)where SED( 𝜆 ) is the extincted SED of the bulge/disk, SED ( 𝜆 ) is the face-on non-extincted SED of the bulge/disk, 𝜏 ( 𝜆 ) is theextinction curve and 𝜅 ( 𝑖, 𝜔, 𝜆 ) is a coeﬃcient dependent on thewavelength 𝜆 , the inclination 𝑖 and the total central opacity 𝜔 ofthe disk.In order to determine the value of 𝜅 ( 𝑖, 𝜔, 𝜆 ) , we use theattenuation-inclination relations of Popescu et al. (2011) to com-pute the diﬀerences of extinct and non-extinct magnitudes for Wavelength [ Å ]10 N o r m a li z e d f l u x u g r i z J H Ks bulge (E): z = 0, no extinctionbulge (E): z = 0.5, extinctiondisk (Irr): z = 0, no extinctiondisk (Irr): z = 0.65, extinction Fig. 2.

Example of the evolution of the bulge and disk SEDs from far-UV to near-IR. The red plain curve shows the generic E template SED ofan elliptical galaxy of Coleman et al. (1980) at redshift 𝑧 = 0 and withno extinction. The red dashed-curve shows the same SED at redshift 𝑧 = 0 . and with extinction (where 𝜔 = 1 . and 𝑖 = 45 °, see Sect. 2.6).The blue plain curve shows the generic Irr template SED of Colemanet al. (1980) used here for the disks modeling of spirals at redshift 𝑧 = 0 and with no extinction. The blue dashed-curve shows the same SED atredshift 𝑧 = 0 . and with extinction (where 𝜔 = 1 . and 𝑖 = 45 °).This shows that even if these SEDs do not evolve explicitly, the evolutionof the type-dependent luminosity functions and the diﬀerent extinctioneﬀects implicitly allow for an SED evolution of individual galaxies. Theplain curves are normalized so that the integral of the template SEDmultiplied by the SED of the reference SDSS 𝑔 ′ band equals 1. The 8ﬁlters used in this paper ( 𝑢 ′ , 𝑔 ′ , 𝑟 ′ , 𝑖 ′ , 𝑧 ′ , 𝐽 , 𝐻 , 𝐾𝑠 ) are shown at thebottom and their responses are multiplied by a factor . . both the bulge and the disk. Then, we interpolate these relationsto obtain the values of 𝜅 ( 𝑖, 𝜔, 𝜆 ) for a random inclination 𝑖 be-tween and 𝜋 ∕2 rad. To determine the total central opacity 𝜔 of the galaxy, we use the values obtained from Markov ChainMonte Carlo (MCMC) analysis (Chevallard et al. 2013) on theSDSS Data Release 7 and of Abazajian et al. (2009): 𝜔 = ⎧⎪⎨⎪⎩ . +0 . . for bulgeless galaxies . +0 . . for galaxies with bulges (13)In our applications of Sects. 6 and 7, we only use 𝜔 = 1 . +0 . . because we consider elliptical and/or spiral populations of galax-ies and we model both populations with a bulge ( 𝐵 ∕ 𝑇 = 1 forthe elliptical and 𝐵 ∕ 𝑇 = 0 . for the spiral galaxies).Finally, in each redshift bin, we extinct the magnitude of thebulge and of the disk for the eﬀect of the intergalactic mediumon a source (Madau et al. 1996). Article number, page 4 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images

In order to obtain realistic deep ﬁelds, we added stars using theBesançon model (Robin et al. 2003) web service which includestheir recent improvements (Robin et al. 2012, 2014; Bienayméet al. 2015; Amôres et al. 2017). This model was run at the coordi-nates of the CFHTLS D1 and D2 deep ﬁelds in the 8 photometricbands. When using a smaller image than the 1 deg deep ﬁeld,a Poisson random draw is performed to select only the relativenumber of stars in the sub-area. An example of the star countsis shown in the 𝑖 -band in Fig. 12 in Sect. 7. The stars representonly a few percent of the objects below the completeness limit of 𝑔 ≃ 26 . mag. In order to correct the apparent magnitudes of the galaxies fromthe eﬀect of dust in the Milky Way at the speciﬁc location ofthe CFHTLS D1 deep ﬁeld, we correct each observed magni-tude with the parameterization of Fitzpatrick & Massa (1990) inthe UV and spline ﬁtting in the optical and IR with the 𝑅 𝑉 fac-tor of 3.1 and an average 𝐸 ( 𝐵 − 𝑉 ) of 0.02 using the python extinction package . These correction values are given in Ta-ble 1. The catalogs of galaxy magnitudes and sizes generated followingthe above procedure are converted into images using

SkyMaker (Bertin 2009).

SkyMaker renders the modeled galaxies on ahigh resolution pixel grid and convolves them with an auto-computed realistic Point Spread Function (Bertin 2011). Themodel is then sub-sampled at the ﬁnal image resolution and cen-tered at some speciﬁed coordinates.

SkyMaker ﬁnally adds thesky background, saturation, photon (Poisson) noise and read out(Gaussian) noise. The various parameters used for each passbandare shown in Table 1 (note that the long exposure times of theCFHTLS Deep ﬁelds are reﬂected in the high eﬀective gains witha 1 second exposure time).

3. Compression of deep ﬁelds

In the last two decades, there has been tremendous progressin the ﬁeld of pattern recognition/classiﬁcation on images. In2004, it was shown that standard neural networks can be greatlyaccelerated by using GPUs with their implementation 20 timesfaster than the same implementation on CPUs (Oh & Jung 2004).Convolutional neural networks were shown to outperform moretraditional machine learning methods for images (Cireşan et al.2010): a deep convolutional neural network won the ImageNetLarge Scale Visual Recognition Challenge in 2012 (Krizhevskyet al. 2012).To compress the images, we use a neural network whoseparameters are ﬁt via the maximization of the logarithm ofthe determinant of the Fisher information matrix (Charnocket al. 2018). Because of the multi-scale properties and thevarious shapes of the galaxies, we have decided to use a fullyconvolutional inception architecture, see Sect. 4 for a fulldescription. https://github.com/kbarbary/extinction In the following, 𝑋 is a set of observations, Θ , 𝑑 and 𝜇 arevectors and 𝐶 and 𝐹 are matrices. The Fisher information is theamount of information that a set of observations, 𝑋 , containabout a set of model parameters, Θ , in a model describing thedata. The probability of the data in the model is described bya likelihood function,  , whose value,  ( 𝑋 | Θ) , describes howlikely are some observations, 𝑋 , when given a particular valuefor the model parameters, Θ .We brieﬂy deﬁne the score of a set of observations, 𝑋 ,with likelihood function  depending on a set of parameters Θ : 𝑆 ( 𝑋, Θ) = ∇ Θ log  ( 𝑋 | Θ) (14)The score indicates the sensitivity of the likelihood function tosmall changes in the parameters of the model. The variance ofthe score is the Fisher information matrix at some value of theparameters, Θ : 𝐹 ( 𝑋, Θ) = 𝔼 [ 𝑆 ( 𝑋, Θ ′ ) 𝑆 ( 𝑋, Θ ′ ) 𝑇 |||| Θ ′ = Θ ] (15)i.e. the expectation of the observed information at a given set ofparameter values Θ ′ = Θ . As shown in Alsing & Wandelt (2018), and following theMOPED algorithm (Heavens et al. 2000), a compression schemecan be designed which is optimal in terms of lossless data com-pression for data where the likelihood function is known. To doso, we note that the information inequality states that the varianceof some observable statistic, 𝑡 (Θ) , of parameters Θ , is boundedby 𝔼 [ 𝑡 (Θ ′ ) 𝑡 (Θ ′ ) 𝑇 | Θ ′ = Θ ] ≥ ∇ Θ 𝔼 [ 𝑡 (Θ ′ ) 𝑇 | Θ ′ = Θ ] 𝐹 −1 ∇ Θ 𝔼 [ 𝑡 (Θ ′ ) | Θ ′ = Θ ] (16)By equating the summary statistic with the score, 𝑡 (Θ) = 𝑆 ( 𝑋, Θ) we notice that the left hand side of the equation is equiv-alent to the Fisher information deﬁned in Eq. (15). Since the gra-dient with respect to the parameters commutes with the expecta-tion value then ∇ Θ 𝔼 [ 𝑡 (Θ ′ ) | Θ ′ = Θ ] = 𝔼 [ ∇ ′ log  ( 𝑋 | Θ ′ ) | Θ ′ = Θ ] = − 𝐹 ( 𝑋, Θ) . (17)Substituting this back into Eq. (16) shows that using the scorefunction as a summary statistic saturates the information inequal-ity (Lehmann & Casella 1998) and therefore constitutes an opti-mal, lossless compression. By maximising the likelihood func-tion with respect to the parameters, i.e. ﬁnding the parameters Θ opt at which the score is zero, we have a quasi-maximum likeli-hood estimator, which if using the expectation value of the scorefor the compression is equivalent to Θ opt = Θ + 𝐹 −1 ( 𝑋, Θ) 𝑆 ( 𝑋, Θ) . (18) Article number, page 5 of 23 &A proofs: manuscript no. main

Table 1.

Parameters for the

SkyMaker program.

Passband 𝑢 𝑔 𝑟 𝑖 𝑧 𝐽 𝐻 𝐾 sEﬀective wavelength ( 𝜇 m) 0.355 0.475 0.640 0.776 0.925 1.253 1.631 2.146Eﬀective gain (e − /ADU) 1026 7095 6919 6807 2148 1116 2379 2134Eﬀective exposure time (sec) 1 1 1 1 1 1 1 1Saturation level (ADU) 6466 2710 3088 4231 12331 68068 114324 110884Read-out noise (e − ) 5 5 5 5 5 30 30 30Seeing FWHM (arcsec) 0.833 0.817 0.771 0.744 0.718 0.65 0.65 0.65Sky level (AB mag/arcsec ) 22.2 21.7 20.8 20.0 19.1 16.7 15.4 15.4Zero-point (AB mag) 30 30 30 30 30 30 30 30Milky Way reddening (AB mag) 0.097 0.075 0.049 0.036 0.027 0.016 0.010 0.007 First, let us suppose we observe 𝑛 independent and identicallydistributed events, 𝑑 , … , 𝑑 𝑛 , with density depending on a set of 𝑝 parameters, Θ = { 𝜃 , … , 𝜃 𝑝 } . Let us suppose that the data isdescribed by a Gaussian likelihood function, for which we knowthe parametric equation, ∀ 𝑖 ∈ {1 , … , 𝑛 } : −2 log  ( 𝑑 𝑖 | Θ) = ( 𝑑 𝑖 − 𝜇 ) 𝑇 𝐶 −1 ( 𝑑 𝑖 − 𝜇 ) + log(2 𝜋 det 𝐶 ) (19)where –  ( 𝑑 𝑖 | Θ) is the likelihood function describing the probabilityof a single data observation, 𝑑 𝑖 , given parameters Θ . – 𝜇 = ( 𝜇 , … , 𝜇 𝑞 ) 𝑇 is the mean vector over all the data de-pending on Θ with ∀ 𝛼 ∈ {1 , … , 𝑞 } : 𝜇 𝛼 = 1 𝑛 𝑛 ∑ 𝑘 =1 𝑑 𝑘,𝛼 – 𝐶 is the covariance matrix of the data and ∀ 𝛼, 𝛽 ∈{1 , … , 𝑞 } : 𝐶 𝛼𝛽 = 1 𝑛 − 1 𝑛 ∑ 𝑘 =1 ( 𝑑 𝑘,𝛼 − 𝜇 𝛼 ) ( 𝑑 𝑘,𝛽 − 𝜇 𝛽 ) Assuming the covariance to be independent of the parameters Θ (see Sect. 4.5 for a full explanation) and replacing the Gaussianlikelihood of Eq. (19) in Eq. (15) yields the Fisher informationmatrix : 𝐹 = ∇ Θ 𝜇 𝑇 𝐶 −1 ∇ Θ 𝜇 (20)Now using score compression to massively reduce the dimen-sionality of our data we can see that we get quasi-maximum like-lihood estimates, ̃ Θ , using ̃ Θ ≈ Θ ′ + 𝐹 −1 ( 𝑑, Θ ′ ) ( ∇ Θ 𝜇 (Θ ′ ) ) 𝑇 𝐶 −1 ( 𝑑 − 𝜇 (Θ ′ ) ) (21)which have a variance that saturates the Cramer-Rao bound giventhat Θ ′ is close to the peak of the likelihood. In practise, thismeans that the data has been compressed down to the dimen-sionality of the number of parameters in the model without los-ing any information (in terms of the Fisher information matrix)as long as the data is Gaussianly distributed, as with the MOPEDalgorithm (Heavens et al. 2000). For a problem such as the inference of model parameters fromdeep galactic ﬁelds, the likelihood is not Gaussian and, worse, isnot known. Developing parallelly the idea of score compression(Alsing & Wandelt 2018) and the MOPED algorithm (Heavenset al. 2000), the proposed method (Charnock et al. 2018) usedhere considers a neural network, 𝑓 ∶ 𝑑 → 𝑥 , which compressesthe data set, 𝑑 = ( 𝑑 , … , 𝑑 𝑛 ) , to a set of summary statistics, 𝑥 = ( 𝑥 , … , 𝑥 𝑛 ) , under the constraint of maximising the Fisherinformation. Each piece of data, 𝑑 𝑖 , can be very large (for exam-ple in our application each simulation is a large multi-band deepﬁeld image with pixels) but can be compressed,in practice losslessly, down to the number of model parameters, 𝑝 .Since each piece of data, 𝑑 𝑖 , is described in our model bythe parameters Θ , then the corresponding summary statistic, 𝑥 𝑖 , is also dependent on Θ . The neural network 𝑓 can beseen as a function which transforms the original unknownlikelihood function  ( 𝑑 𝑖 | Θ) of the data into an asymptoticallyGaussian likelihood function  ( 𝑥 𝑖 | Θ) of the summary statistics: ∀ 𝑖 ∈ {1 , … , 𝑛 } : −2 log  ( 𝑥 𝑖 | Θ) = ( 𝑥 𝑖 − 𝜇 𝑓 ) 𝑇 𝐶 −1 𝑓 ( 𝑥 𝑖 − 𝜇 𝑓 ) + log(2 𝜋 det 𝐶 𝑓 ) (22)where, with the same formalism as the Sect. 3.4: – 𝑓 ( 𝑑 𝑖 ) = 𝑥 𝑖 . –  ( 𝑥 𝑖 | Θ) is the likelihood function for a single summarystatistic 𝑥 𝑖 obtained from the data 𝑑 𝑖 via the transformation 𝑓 given parameters Θ . – 𝜇 𝑓 is the mean vector of the summary statistics. – 𝐶 𝑓 is the covariance matrix of the summary statistics.Consequently, the Fisher information matrix, ignoring parame-ter dependence on the covariance, is: 𝐹 = ∇ Θ 𝜇 𝑇𝑓 𝐶 −1 𝑓 ∇ Θ 𝜇 𝑓 (23)Such a transformation yields the same quasi-maximum likeli-hood estimator ̃ Θ as Eq. (21) but for the summary statistics. Inthis case, the original data, 𝑑 , is compressed to obtain its sum-mary statistics, 𝑓 ( 𝑑 ) = 𝑥 . Therefore, by ﬁtting the parameters ofa neural network to maximise the logarithm of the determinantof Eq. (23) we can approximate score compression for general,non-linear likelihoods and thereby compress our large multi-bandimages to approximately suﬃcient statistic vectors. Article number, page 6 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images

Fig. 3.

Example of a convolutional layer. The input image of size with depth on which is applied a kernel of size with depth .Giving an output of size

28 × 28 with depth . With such convolutionallayers, fewer parameters are needed and translationally invariant featurescan be ﬁtted, i.e. the location of relevant features is not important.

4. Neural network description and training

To compress the images we need to choose a neural network ar-chitecture and ﬁt its parameters (i.e. weights and biases). The fol-lowing sections describe this process.

Traditionally, the building block of neural networks is the per-ceptron (Rosenblatt 1962). A perceptron takes several inputs, 𝑥 , 𝑥 , … , 𝑥 𝑛 , and produces a single output. Rosenblatt intro-duced weights, 𝑤 , 𝑤 , … , 𝑤 𝑛 , that are real numbers expressingthe importance of the respective inputs to determine the output.A bias 𝑏 is added and a non-linear activation function 𝜓 gives theoutput 𝑦 : 𝑦 = 𝜓 ( 𝑛 ∑ 𝑖 =1 𝑤 𝑖 𝑥 𝑖 + 𝑏 ) Perceptrons are generally assembled together to form a fully-connected layer where each neuron of the input is connected toseveral neurons. These neurons can then be stacked such that theoutput of each layer are used as the input of subsequent layers.This conﬁguration has a limit in terms of computational time be-cause the number of weights adds up quickly, particularly whenthe network is composed of many fully connected layers.

The parameters of a convolutional layer consist of a set of tun-able discreet convolutional kernels often known as ﬁlters. Nor-mally, each ﬁlter is small spatially (along width and height), butextends through the full depth of the input volume, i.e. along thediﬀerent ﬁlters of deep ﬁeld images. Intuitively, the network ﬁtsﬁlters that activate under the input of some types of visual feature,such as a pattern with some orientation or a patch with some par-ticular color. When dealing with high-dimensional inputs suchas multi-band images it is impractical to connect all inputs to allneurons in any particular layer due to both the high dimensional-ity of the input, but also the distribution of informative featuresnot being location dependent. The convolutional layers allows usto connect each neuron to only a local region of the input volume(see Fig. 3).

It is common to periodically insert a pooling layer in-betweensuccessive convolutional layers in a network. The function of apooling layer is to progressively reduce the spatial size in orderto work at diﬀerent scales, the ﬁrst layer works on the small-est scales then after pooling, the convolution kernel works ona larger scale, etc. The pooling layer operates independently onevery slice in depth of the input and re-sizes it spatially by tak-ing the maximum or the average of a running local patch of theoutput of the convolutional layer.

Galaxies can be of diﬀerent sizes and shapes, hence salientparts in the image can have extremely large variation in size.Because of this huge variation in the distribution in scale of theinformation, choosing the correct kernel size for the convolutionoperation is delicate but critical. Moreover, very deep networks(several layers with many neurons) are prone to over-ﬁtting andnaively stacking large convolution operations is computationallyexpensive. A way to deal with these considerations have beendeveloped (Szegedy et al. 2015) using a so-called inceptionblock : a parallel mixture of convolutions and pooling layers (seean example on Fig. 4 with a series of 5 consecutive inceptionblocks). Inception blocks allow one to study the input data atdiﬀerent scales in parallel and to extract features of diﬀerentsizes on the same input. A succession of inception blocks canbe used to create a full network architecture called an inceptionnetwork. The network architecture we use in the applications isshown in Fig. 4.Because distant galaxies are generally spread over only afew pixels in size, we use kernels with pixel size , (equivalent to) and (equivalent to) in each of the layers .As part of the inception architecture, convolutions areperformed before each or kernel sized convolutionsto reduce the depth, hence the number of kernels, of each block.Each convolution has an output depth of 8 ﬁlters which keepsthe total number of parameters relatively low (with 12 800parameters) compared to traditional inception networks (withmillions of parameters). At the end of each inception block, weconcatenate each output of the convolutions and then calculatean average pooling using the mean over patches of theoutput to decrease the resolution of the image by . This processis applied several times until the output is of expected dimension,i.e. the number of parameters of the model. Speciﬁcally, the full input data used in the applications described inSects. 6 and 7 are massively compressed by the network down tothe number of parameters in the corresponding luminosity func-tion model. Our inception network is fully convolutional, we donot use any fully-connected layer so that complete translationalinvariance is incorporated. We also use a pre-processing layerto reduce the dynamical range of the ﬂux of the pixels using aninverse hyperbolic sin function. Normally the loss function is a measure of the ﬁt of the neuralnetwork to the data. However, with the IMNN, the loss function Note that the indicated kernels in Fig. 4 actually correspond to and kernels performed in parallel and then concatenated (thisreduces the number weights but is strictly equivalent, see Szegedy et al.2015), and similarly for the kernels. Article number, page 7 of 23 &A proofs: manuscript no. main is a measure of the amount of information that is extracted fromthe data.We use IMNN to maximise the Fisher information of theoutput summaries obtained from the network. As a consequence,during the training, we measure the determinant of the Fishermatrix, which we want to maximise. However, since the Fisherinformation is invariant under linear scaling of the summaries,the magnitude of the summaries needs to be controlled, whichcan be done by constraining the covariance matrix (see below).We therefore deﬁne two loss functions: Λ 𝐹 = − log [det 𝐹 ] (24) Λ 𝐶 = 0 . (|| 𝐶 𝑓 − 𝐼 || F + || 𝐶 −1 𝑓 − 𝐼 || F ) (25)where || . || F is the Frobenius norm.The ﬁrst loss function Λ 𝐹 measures the Fisher informationand the second loss function Λ 𝐶 measures the diﬀerence of thecovariance from the identity matrix. We are trying to ﬁnd a net-work parameter conﬁguration such that both loss functions areminimal. Minimising Λ 𝐹 maximises the information, whilst min-imising Λ 𝐶 provides a covariance of the summaries which isclose to the identity matrix. Keeping the covariance relativelyclose to identity ﬁxes the scale of the summaries statistics whilstproviding a covariance which is mostly independent of the pa-rameters, which justiﬁes dropping the covariance term in Eqs.(20) and (23). There exists several techniques (Hwang et al. 1979)to achieve the minimization process of such a multi-objectiveproblem (for example linear scalarization, 𝜖 -constraint, etc...).Here, we use a continuously penalized optimization in whichwe combine the loss functions Λ 𝐹 and Λ 𝐶 as follows: Λ = Λ 𝐹 + 𝑟 Λ 𝐶 Λ 𝐶 (26)where 𝑟 Λ 𝐶 = 𝜆 Λ 𝐶 Λ 𝐶 + exp(− 𝛼 Λ 𝐶 ) (27)is a Gaussian function with user-deﬁned parameters 𝜆 and 𝛼 . Asa result, when the covariance is far from identity, the 𝑟 Λ 𝐶 func-tion is large and the optimization is concentrated on bringing thecovariance and its inverse back to identity. Following the description of the IMNN code (Charnock et al.2018) to train the network when exact gradients of the simula-tor are not available we need to choose some ﬁducial parametervalues, Θ ﬁd , and some deviation about the ﬁducial, ΔΘ ﬁd . Fromthese values, we generate a training set of images consisting of: – A ﬁrst collection of images generated with the ﬁducial pa-rameters Θ ﬁd . – A second collection of images generated with Θ ﬁd + ΔΘ ﬁd . – A third collection of images generated with Θ ﬁd − ΔΘ ﬁd .(using the method explained in Sect. 3). As illustrated in Fig.5, we pass the training set of images through the network toobtain the summary statistics and we compute the quantities ofinterest: the covariance matrix from the ﬁducial set of images, https://bitbucket.org/tomcharnock/imnn Fig. 4.

Fully convolutional inception network with 5 inception blocksused to perform the compression. An inception block is composed ofparallelized convolutions (with , and kernel convolu-tions) which simultaneously processes the same input at diﬀerent scalesto extract diﬀerent features, then concatenates the full output. After eachinception block, the input is compressed with a pooling layer todecrease the resolution by a factor 4. The last convolution allows for acompression down to the number of parameters of the model. The cur-rent size is given after each inception block pooling.Article number, page 8 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images Fig. 5.

Schematic description of the training of the network to obtain the covariance matrix and the derivative of the mean vector which are used inEq. (23) to obtain the Fisher information matrix. The top/middle/bottom lists of images form what we call the training set and are generated withﬁducial parameters Θ ﬁd (respectively with Θ ﬁd + ΔΘ ﬁd , Θ ﬁd − ΔΘ ﬁd ). From the ﬁrst collection of images, we compute the covariance matrix andfrom the second and third collections of images, we compute the derivative of the mean vector. the numerical derivative of the mean of the summary statisticsvector from the perturbed parameter sets and, with these, thedeterminant of the Fisher information matrix.At each iteration of the training phase, the training set ofimages is passed through the network before running the back-propagation, i.e. calculating the gradient of the loss functionand using the chain rule to calculate the eﬀect of each networkparameter on the value of the loss function such that they canbe updated, thereby minimizing the loss. Consequently, as thetraining advances the network summarizes the data while re-taining more and more information about the model parameters.Furthermore, the network becomes increasingly insensitiveto parameter independent noise providing a function whichcalculates extremely robust summary statistics. We use anotherindependent set of diﬀerent images which the network did nottrain on, in order to validate the performance: the validation set .To update the network parameters at each iteration, we usethe Adam optimizer (Kingma & Ba 2014) and/or the stochasticgradient descent method . In this training process, we donot know the expected Fisher information of the images, so arecommendation is to continue the training until the loss reachesa plateau (or starts to decrease) when evaluated on the validationset, indicating that the network has compressed the data with themaximum information available. At this point the summariesare Gaussianly distributed in regions close to the ﬁducial choiceof parameter values, even if the likelihood of the actual datais non-Gaussian (Charnock et al. 2018). If the values of theinferred parameters are far from the ﬁducial choice of trainingparameters, it can be wise to choose another set of ﬁducial valuesto train the network closer to the previously derived values.

5. Approximate Bayesian computation (ABC) andpopulation Monte Carlo (PMC)

Since the likelihood function of our model is unknown, we canuse approximate Bayesian computation (ABC) to bypass its eval-uation by considering the density of forward simulations aroundsome observed data. The ﬁrst ABC-related ideas date back to the1980s (Rubin 1984) with a sampling method that yields, asymp-totically, the posterior distribution. The ABC-rejection samplingin its most basic form, generates 𝑛 simulations, { ̂𝐷 𝑖 (Θ 𝑖 ) | 𝑖 ∈[1 , 𝑛 ]} , from model parameters Θ following their prior distribu-tion 𝑝 (Θ) and compares them to the observed data, 𝐷 . Simula-tions that are distinctly diﬀerent from the observed data 𝐷 areconsidered unlikely to have been generated from the same pa-rameter values describing 𝐷 and the associated simulation pa-rameter values are rejected. In more precise terms, the sample ̂𝐷 𝑖 is accepted with tolerance 𝜖 ≥ if: 𝜌 ( ̂𝐷 𝑖 , 𝐷 ) ≤ 𝜖 (28)where 𝜌 is a distance which measures the discrepancy between ̂𝐷 𝑖 and 𝐷 . Note that, in general, the form of the distance measure, 𝜌 ,can be diﬃcult to choose for ABC (amounting to a similar inter-pretation as a choice of likelihood function), but is well motivatedwhen using the IMNN Charnock et al. (2018). The probability ofgenerating a data set ̂𝐷 𝑖 with a small distance to 𝐷 typically de-creases as the dimensionality of the data increases. A commonapproach to somewhat alleviate this problem is to replace 𝐷 witha set of lower-dimensional summary statistics 𝑆 ( 𝐷 ) , which areselected to capture the relevant information in 𝐷 . The full rele-vance of Sects. 3 and 4 on obtaining the summary statistics thusbecomes clear in this context and 𝑆 is the score compression pro-vided by the IMNN in our application. The acceptance criterion Article number, page 9 of 23 &A proofs: manuscript no. main in the ABC rejection algorithm then becomes: 𝜌 ( 𝑆 ( ̂𝐷 𝑖 ) , 𝑆 ( 𝐷 )) ≤ 𝜖 (29)where the summary statistics are derived using the compressiondescribed in Sect. 3. In our application, the distance used betweena summary statistic of a simulation ̂𝑥 𝑖 and that of the observeddata, 𝑥 = 𝑆 ( 𝐷 ) , is deﬁned by the Fisher information of the pre-viously trained network (Charnock et al. 2018) by: 𝜌 ( ̂𝑥 𝑖 , 𝑥 ) = ( ̃ Θ( ̂𝑥 𝑖 ) − ̃ Θ( 𝑥 ) ) 𝑇 𝐹 ( ̃ Θ( ̂𝑥 𝑖 ) − ̃ Θ( 𝑥 ) ) (30)This distance measure is justiﬁed with the quasi-maximum like-lihood estimates described in Eq. (21) because that space shouldbe asymptotically Euclidean (with the Fisher information as themetric) once the network has converged and the summary statis-tics become Gaussianly distributed (Charnock et al. 2018).Even when using summary statistics, the acceptance rate ofthe ABC is low for a small 𝜖 because we sample from the wholemulti-dimensional prior distribution, so ﬁnding a new set of pa-rameters whose simulations are more similar to the data is un-likely, which makes the sampling from the prior distribution in-eﬃcient and hence slow. We therefore use the Population MonteCarlo (PMC) procedure described in the next section. The goal of the population Monte Carlo (PMC) procedure is tohave a higher density of draws in the more probable regions ofthe posterior distribution. The diﬀerent steps of the PMC usedhere are: – Step 1: from the prior distribution, we draw 𝑁 (user-deﬁned)sets of parameters. For each set of parameters in this sample: ⋆ Step 1a: we simulate the multi-band images. ⋆ Step 1b: we compress the simulated images with IMNNto reduce the dimensional complexity. ⋆ Step 1c: we compute the distance between the sum-marised simulation and the summary of the observed datausing Eq. (30). ⋆ Step 1d: we deﬁne a weighting which corresponds to theprobability of this set of parameters under the prior dis-tribution. – Step 2: we compute the mean vector and the weighted-covariance matrix of the 𝑁 sets of parameters. We set the 𝑅 counter to . – Step 3: from the 𝑁 sets of parameters, we identify the 𝑄 % (user-deﬁned) of the compressed simulations that have thelargest distance from the summary of the observed data. Foreach set of parameters in this sub-sample: ⋆ Step 3a: we resample it from a proposal distributionwhich is a Gaussian using the current parameter values asthe mean and the weighted-covariance matrix from

Step2 . Each time we go through this step, we increment the 𝑅 counter. ⋆ Step 3b: we simulate the multi-band images at this newproposed set of parameters. ⋆ Step 3c: we compress the simulated image with IMNNto reduce the dimensional complexity. ⋆ Step 3d: we compute the distance between the sum-marised simulation and the summary of the observed datausing Eq. (30). ⋆ Step 3e: if this distance is smaller than the previous dis-tance we keep this set of parameters. Otherwise, we goback to

Step 3a . – Step 4: each weight is updated using the probability of thecorresponding resampled set of parameters calculated and theweighted-covariance matrix of Step 2 : each initial weight(under the prior distribution) is divided by the normalizedGaussian using the diﬀerence between the new and initial pa-rameter set as the mean and the weighted-covariance matrixfrom Step 2. – Step 5:

As long as the counter 𝑅 is smaller than the user-deﬁned threshold 𝑀 , we go back to Step 2 and re-identify the 𝑄 % simulations with the largest distance from the observeddata. Otherwise, the PMC ﬁnishes.The 𝑄 % in Step 3 is a user choice and we chose values al-lowing us to somewhat parallelise the procedure. If the numberof draws 𝑀 is large enough, when the procedure stops, the dis-tribution of the 𝑁 sets of parameters has become approximatelystationary, and it can then be considered as a good approximationof the posterior (Del Moral et al. 2012).In the application of Sect. 6, we use 𝑄 = 75% , 𝑁 = 1000 andwe adopt 𝑀 = 50000 , therefore, at each iteration the proceduretries to re-draw samples and ﬁnishes when the number ofattempts in the same iteration is higher than .In the application of Sect. 7, we use 𝑄 = 25% , 𝑁 = 500 andwe adopt 𝑀 = 5000 , therefore, at each iteration the proceduretries to re-draw samples and ﬁnishes when the number ofattempts in the same iteration is higher than .

6. Application to 𝑴 ∗ and 𝝓 ∗ for spirals only In order to prove that the methodology works, we use simu-lated deep ﬁelds with only 1 spiral population of galaxies with2 free parameters of the luminosity function, 𝑀 ∗ and 𝜙 ∗ , whilethe slope is ﬁxed to 𝛼 = −1 . , see Eq. (1). The parameters donot evolve with redshift, 𝑀 ∗ evol = 0 and 𝜙 ∗ evol = 0 . We use a 𝐵 ∕ 𝑇 = 0 . , with the “E” SED for the bulge and the “Irr” SEDfor the disk (see Fig. 2). As explained in Sect. 4.6, we choosethe following ﬁducial parameter values and their oﬀsets for thenumerical derivatives to ﬁt the network: Θ ﬁd , = 𝑀 ∗ = −20ΔΘ ﬁd , = Δ 𝑀 ∗ = 0 . ﬁd , = log ( 𝜙 ∗ ) = −2 . ﬁd , = Δ log ( 𝜙 ∗ ) = 0 . (31)where 𝜙 ∗ is given for 𝐻 = 100 kms −1 Mpc −1 .Fig. 6 shows the 5 theoretical luminosity functions, usedwhen making the images to train the network. Fig. 7 shows theimpact on the RGB image (using the 𝑖 ′ , 𝑟 ′ , 𝑔 ′ ﬁlters) of a sim-ulation when changing only one of the two parameters 𝑀 ∗ or 𝜙 ∗ . We see that the density of spirals is boosted for both the Δ log ( 𝜙 ∗ ) increase in log ( 𝜙 ∗ ) and the Δ 𝑀 ∗ decrease in 𝑀 ∗ .Consequently, we expect a strong correlation between these twoparameters. To ﬁt the network parameters (i.e. weights and biases), we fol-low the prescription described in Sect. 4.6 and generate 200 sim-ulations (in the 8 photometric bands: 𝑢 ′ , 𝑔 ′ , 𝑟 ′ , 𝑖 ′ , 𝑧 ′ , 𝐽 , 𝐻 , 𝐾𝑠 )with parameters at ﬁducial values. For the images used for the nu-merical derivatives, we generate 50 simulations (in each of the 8bands) for each parameter at Θ ﬁd + ΔΘ ﬁd and another 50 simula-tions per parameter at Θ ﬁd −ΔΘ ﬁd (see Eq. (31)). Note that for the Article number, page 10 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images M B [ M P C m a g ] ( M * , log ( * ) )( M * + M * , log ( * ) )( M * M * , log ( * ) )( M * , log ( * ) + log ( * ) )( M * , log ( * ) log ( * ) ) Fig. 6.

Theoretical luminosity functions of the spiral population with theperturbed values of the parameters used to train the network. The param-eters used for each curve are listed in Eq. (31). The explicit features inthe data, generated from these functions, provide the diﬀerences that theIMNN will become sensitive to when training, therefore mapping the ef-fect of log ( 𝜙 ∗ ) and 𝑀 ∗ to informative summaries. Fig. 7.

Color images ( 𝑔 ′ , 𝑟 ′ , 𝑖 ′ ﬁlters) showing the eﬀect of the perturbedvalues from ﬁducial for each parameter, as listed in Eq. (31). For eachsubplot we only add/remove the oﬀset for one parameter listed in Eq.(31) and keep the other at its ﬁducial value. We see that decreasing thevalue of 𝑀 ∗ (top right) increases the number of galaxies. We also notethat decreasing the value of log ( 𝜙 ∗ ) decreases the number of galaxies,so these two parameters are highly correlated. Table 2.

Optimizer and learning rate used during the training.epoch optimizer learning rate0 Adam 0.02100 SGD −6

200 SGD −6

600 SGD −6 . −6 −6 . −7 . −7 Notes:

SGD stands for stochastic gradient descent.simulations for the numerical derivatives, each parameter is var-ied independently and in turn, keeping the other variable param-eters ﬁxed at their ﬁducial values. Furthermore, the initial seedsare ﬁxed to be the same for each individual pair of Θ ﬁd + ΔΘ ﬁd and Θ ﬁd − ΔΘ ﬁd and varied for every pair in the set, such that thegalaxies are generated at the same location in the image and withidentical ﬂux and shape parameter, i.e. there are 50 individualseeds for the 200 simulations needed for the 50 simulations gen-erated with ( 𝑀 ∗ +Δ 𝑀 ∗ , 𝜙 ∗ ) , ( 𝑀 ∗ −Δ 𝑀 ∗ , 𝜙 ∗ ) , ( 𝑀 ∗ , 𝜙 ∗ +Δ 𝜙 ∗ ) and ( 𝑀 ∗ , 𝜙 ∗ −Δ 𝜙 ∗ ) . This ensures that the only changes from oneimage of the pair to the other are that galaxies appear or disap-pear according to the luminosity function. The suppression of thisvariance allows us to make fewer simulations for the derivativesthan for the ﬁducial simulations used to calculate the covariancematrix. This yields a total of

200 + 2 × 2 × 50 = 400 images(in the 8 bands) for the training set and we use the same num-ber for the validation set. Each simulation is a pixeldeep ﬁeld image as in Fig. 7. The optimizer techniques and thecorresponding learning rates used during the training are givenin Table 2. We use this training scheme to quickly ﬁnd a trans-formation that forces the Covariance matrix of the summaries tobe approximately the identity matrix, and then explore the morecomplex landscape of the parameter dependent information. Thearchitecture of the network is that of Fig. 4, which yields a totalof 12,170 network parameters to be ﬁtted. Because of the num-ber and size of the images, with this conﬁguration, one epochof training (i.e. when the whole training and validation sets arepassed through the network and the network parameters are up-dated) takes about 40 seconds to run (on an NVIDIA QUADRORTX 8000 45GB GPU).We train the network for almost 14000 epochs ( ≈ Article number, page 11 of 23 &A proofs: manuscript no. main | F | trainingvalidation0 2500 5000 7500 10000 12500epoch10 | C | training: C C | C | validation: C C Fig. 8. Top: evolution of the determinant of the Fisher information ma-trix during almost 14000 epochs. The blue curve represents the infor-mation extracted from the training set and the orange curve from thevalidation set. The determinant of the Fisher information is maximizedon the training set and is comparable to the determinant of the Fisher in-formation calculated with the validation set. This conﬁrms that the twosets are representative samples. The training is stopped when the Fisherinformation of the validation set becomes relatively constant.

Bottom: evolution of the determinants of the covariance matrix (solidline) and of the inverse of the covariance matrix (dashed line) for thetraining set (blue) and for the validation set (orange). The values be-come constant quickly which shows that the network is suppressing theparameter dependence of the covariance. Note that, due to the strong cor-relation between the parameters, the covariance matrix cannot exactlydiagonalize to the identity matrix, but the fact that it is constant showsthat the parameter dependence is small, which is the only requirementfor the IMNN. cient to stop the training and accept this as the amount of infor-mation we are willing to extract from the data given the trainingtime.

Once the network is trained, we run both an ABC and a PMCprocedure to test its reliability (see Sect. 5). To demonstrate thatour method works, we replace what is referred to as "observeddata" in Sect. 5 with a simulation called "fake data", generatedwith parameters set to the ﬁducial parameter values in Eq. (31): Θ fake , = 𝑀 ∗ = −20Θ fake , = log ( 𝜙 ∗ ) = −2 . (32)where 𝜙 ∗ is given for 𝐻 = 100 kms −1 Mpc −1 .We use the formalism of Sect. 5 and choose a uniform priorfor the two parameters: 𝑀 ∗ ∈ [−24 , −16]log ( 𝜙 ∗ ) ∈ [−4 , −0 . (33)From these prior intervals, we uniformly draw an initial sam-ple of 5000 pairs of parameters and generate a simulated deepﬁeld (see Sect. 2) for each pair. We then pass both the fake dataand the 5000 simulated ﬁelds through the network in order tocompress them. The bottom left plot of Fig. 9 shows the values Fig. 9. Results of the ABC procedure.Bottom left:

Values of the parameters for the 5000 prior simulations(orange) and for the simulations with minimal distance (blue) to thefake data. The red dotted lines are the parameter values at which the fakedata was generated. We retrieve the strong correlation between the twoparameters. Top left and bottom right:

1D marginal distributions of the parametervalues. of these parameters for the entire 5000 prior simulations in or-ange and we decide to only keep the simulations in blue forwhich the distance to the fake data is minimal. The red dotted-lines are the parameter values used to generate the fake data. Onecan notice that we retrieve the strong correlation between the twoparameters. Because we do not know what value of 𝜖 is neededto properly approximate the posterior here so we use the PMCprocedure to do it automatically.For the PMC, we use the same prior for the two parametersas in Eq. (33) and we apply the PMC procedure of Sect. 5.2 with 𝑁 = 1000 and 𝑀 = 50000 and 𝑄 = 75% for re-sampling the 𝑁 sets of parameters. This yields in total around 232577 drawsand generated simulations. The results are shown in Fig. 10, the2D distribution (bottom left) and the 1D projected distributionsconcentrate around the values of the parameters used to generatethe fake data (i.e. 𝑀 ∗ = −20 and log ( 𝜙 ∗ ) = −2 . ). The pro-cedure clearly identiﬁes the strong correlation between the twoparameters, both aﬀecting the density of objects, and is able toinfer that the parameter values used to generate the simulation liein the bulk of the probability density. This application illustratesthat the network is able to summarise the eﬀects in the images ofthe two correlated parameters and that it is able to robustly con-strain the possible values of the parameters used to generate thefake data.

7. Application to 𝝓 ∗𝟎 for ellipticals and spirals Here we consider a more realistic model with two populations ofgalaxies, ellipticals and spirals, which can then be compared tothe observed data for the CFHTLS and WIRDS D1 deep ﬁeld.We are only inferring the density parameter 𝜙 ∗0 of the luminosity Article number, page 12 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images ( M * | x ) KDEInitial parameter

22 20 18 M * l o g ( * ) ( log ( * )| x ) Fig. 10. Results of the PMC procedure.Bottom left:

Distribution of the parameters values for ﬁnal 1000points obtained by the PMC. The black/dark grey/light grey show the contours and the red dotted-line shows the values of theparameters that were used to generate the fake data.

Top left and bottom right:

1D marginal distributions of the parametersand their Kernel Density Estimate in black.The procedure converged around the parameter values used to generatethe fake data. There is evidence here, in 2D, that there is a large un-certainty due to the correlation between the parameters whose eﬀect isevident in the data, see Fig. Θ ﬁd , = log ( 𝜙 ∗0 ,𝐸 ) = −2 . ﬁd , = Δ log ( 𝜙 ∗0 ,𝐸 ) = 0 . ﬁd , = log ( 𝜙 ∗0 ,𝑆𝑝 ) = −2 . ﬁd , = Δ log ( 𝜙 ∗0 ,𝑆𝑝 ) = 0 . (34)where 𝜙 ∗ is given for 𝐻 = 100 kms −1 Mpc −1 . The other lumi-nosity parameters are set to be constant with a redshift evolutionof the 𝜙 ∗ and 𝑀 ∗ parameters for both populations: we use thevalues from López-Sanjuan et al. (2017, see Table 3) and listthem in Table 3. Note that we consider a factor of 10 smalleroﬀset value of 𝜙 ∗0 , Sp for the spiral galaxies than 𝜙 ∗0 , E for the ellip-ticals when generating the simulations for the numerical deriva-tives (see Eq. (34)), in order to account for the higher density ofspirals than ellipticals (see Fig. 14). In this paper, we work in thereference SDSS 𝑔 ′ -band for the absolute magnitude, so we applya correction of 𝐵 − 𝑔 ′ = 0 . for the elliptical population and 𝐵 − 𝑔 ′ = 0 . for the spiral population as suggested by Table 7of Fukugita et al. (1995).As described in Sect. 2.7, in order to be able to compare thesimulations to real observations, we include stars in our simula-tions using the Besançon model of Robin et al. (2003). Fig. 11shows the total diﬀerential counts for the D1+D2 deep ﬁelds andour forward model including stars in all bands, and Fig. 12 the Fig. 11.

Total diﬀerential counts in all 8 bands ( 𝑢 ′ , 𝑔 ′ , 𝑟 ′ , 𝑖 ′ , 𝑧 ′ , 𝐽 , 𝐻 , 𝐾𝑠 from top to bottom) for both CFHTLS observations (histograms) fromMegaPrime D1+D2 (Hudelot et al. 2012) and from WIRCam (4 deepﬁelds) (Bielby et al. 2012), compared with our ﬁducial model (emptycircles). For clarity, the counts in each band are regularly oﬀset verticallydownwards by 1 dex from 𝑢 ′ to 𝐾𝑠 . Stars are accounted for in 𝑢 ′ , 𝑔 ′ , 𝑟 ′ , 𝑖 ′ , 𝑧 ′ but omitted in this plot for 𝐽 , 𝐻 , 𝐾𝑠 (both for data and model). Thisgraph shows that the magnitudes of the galaxies in the forward modelare in good agreement with the observations down to their completenesslimits in all 8 photometric bands. decomposition into stars and both types of galaxies in the 𝑖 ′ bandonly. One can visually inspect the similarity between the modeland the observations in the total density of objects.For computational eﬃciency, we use only a 3.17 arcmin sub-image (corresponding to pixels of 0.186 arcsec) ofthe full 1 deg D1 deep ﬁeld in the 8 photometric bands 𝑢 ′ , 𝑔 ′ , 𝑟 ′ , 𝑖 ′ , 𝑧 ′ , 𝐽 , 𝐻 , 𝐾𝑠 to infer the values of 𝜙 ∗0 for both the elliptical andspiral populations. Fig. 13 shows the eﬀects on a color image ofthe 𝑔 ′ , 𝑟 ′ , 𝑖 ′ simulations for the elliptical population (left column)and the spiral population (right column) of changing the ﬁducialparameters by their respective positive (top images) and negative(middle images) oﬀset from the ﬁducial that are used to calculatethe numerical derivatives. The bottom image of each column isthe diﬀerence of the above images and shows the galaxies whichappear in the images used to calculate the numerical derivativesof the summaries using the IMMN. One can see in the bottom leftimage that only elliptical galaxies remain when only 𝜙 ∗0 for theelliptical is changed (these galaxies appear as yellow/red becauseof the SED used to model them), as opposed to the blue galaxiesof the spiral population remaining in the bottom right.Fig. 14 shows the 6 theoretical luminosity functions, at 𝑧 = 0 and 𝑧 = 2 , used to train the network. One can see that the densityof bright red ellipticals is lower at 𝑧 = 2 compared to 𝑧 = 0 :indeed, there were fewer galaxies with early-type SEDs in thepast than today because they were still forming their stars andhad therefore bluer SEDs. We also see that we expect a relativelylarger density of faint spirals compared to ellipticals both at 𝑧 = 0 and 𝑧 = 2 . The curves for the perturbed luminosity functions ofspirals are closer together because the oﬀset ΔΘ is a factor 10less than for the ellipticals. As in the previous application, we generate simulations similarly,with 200 ﬁducial images of 3.17 arcmin and pix- Article number, page 13 of 23 &A proofs: manuscript no. main

10 12 14 16 18 20 22 24 26 28 30i mag (AB)10 N [ / d e g / . m a g ] fiducial model: Sc galaxiesfiducial model: E galaxiesfiducial model: starsfiducial model: total countsCFHT-LS D1+D2 Fig. 12.

Diﬀerential counts (stars and galaxies) in the 𝑖 ′ band for theCFHTLS D1+D2 ﬁelds matching our ﬁducial model decomposed intostars (Besançon model), elliptical galaxies and spiral galaxies down tothe completeness limit. This graph shows the dominance of stars andspiral galaxies at the bright-end and faint-end, respectively, and the dif-ﬁculty to constrain the modeling of elliptical galaxies because of theirsmaller number. At the faint end, the completeness of the CFHTLSsource extractions is limited to 𝑖 ≃ 26 whereas the ﬁducial model showsthe fainter input source distribution. Table 3.

Fixed parameters of the luminosity functions for the ellipticaland spirals

B/T 𝑀 ∗ 𝛼 𝜙 ∗ evol 𝑀 ∗ evol E .

68 −0 .

53 −1 .

37 −1 . Sp . .

71 −1 .

29 −0 .

03 −1 . Notes: - parameter values are given for 𝐻 = 100 kms −1 Mpc −1 andfor 𝑧 = 0 .- see Eqs. (2) and (3) for the deﬁnition of the last 2 columns.els, and seed matched images for the numeri-cal derivatives. We use the same inception architecture with theAdam optimizer (learning rate of 0.04) for the ﬁrst 4000 epochsof the training, then the stochastic gradient descent optimizer(learning rate of −6 ) for the remaining epochs. We trainthe network on the GPU NVIDIA TITAN X 12Go for almost 10000 epochs ( ≈ Our ultimate goal is to infer the parameters on the 1 deg D1deep ﬁeld. However, we limit here the analysis to deep ﬁelds ofsize 3.17 arcmin (i.e. pixels) for computationaleﬃciency, and randomly choose 5 such regions of the D1, seeFig. 17. In order to conﬁrm that the network is trained properlyand is performing well, we also generate 5 simulations with theﬁducial parameters of Table 3 and the same angular size as theCFHTLS data, which we consider as fake data (see Fig. 16). Thegoal is to obtain the individual posterior distributions of the 5 ob-served data images, then to compute a joint posterior to constrainthe values of the parameters and extract the conﬁdence intervals.The same procedure is applied to the 5 fake data, the diﬀerenceis that we know the values of the parameters that were used togenerate the data. We choose 5 images in both cases as a goodcompromise between the computational time and the accuracy ofthe joint posterior. We use a uniform prior distribution with the same lower boundfor the amplitudes of the luminosity functions of the elliptical andspirals galaxies but diﬀerent upper bounds : log ( 𝜙 ∗0 , E ) ∈ [−4 , −0 . ( 𝜙 ∗0 , Sp ) ∈ [−4 , −1] (35)The choice of a smaller upper bound of the prior for the spiralsis made because the luminosity function for this population hasa steep faint end slope (see Table 3 and Fig. 14), and it can leadto very long generation times of the simulations with high val-ues of the amplitude 𝜙 ∗0 , Sp . Also, previous studies such as López-Sanjuan et al. (2017) have shown that log ( 𝜙 ∗0 , Sp ) > −1 is veryunlikely, and indeed, this is what we also ﬁnd from Fig. 20. Wewish to note that cutting possible regions of parameter space dueto computational resources is not a statistically rigorous process,but the information from (López-Sanjuan et al. 2017) gives usreason to believe that it is acceptable to truncate here.Fig. 18 shows the results of the ABC procedure with5000 draws of the parameters following the uniform priorof Eq. (35). One can choose a distance-threshold from thedata to accept/reject some of the parameters as shown by thegreen/blue/red points in the bottom left plot. The number of ac-cepted points decreases when this threshold is low but their 2Dregion also becomes more concentrated. As a consequence, thecorresponding 1D marginalized distribution of log ( 𝜙 ∗0 ,𝐸 ) (topleft) varies signiﬁcantly depending of this choice for a distancethreshold. This is due to not densely sampling from a stationarydistribution, a problem which can occur when sparsely drawingfrom the prior and not choosing a small enough 𝜖 − ball in whichto accept simulations. In order to properly approximate the pos-terior, the distance threshold must approach 𝜖 = 0 and more sim-ulations need to be done for the ABC until a steady distributionis reached. This highlights the need for a PMC that automaticallyapproaches the posterior, which we describe in the following. Here, we apply a parallelized version of the PMC procedure tothe 5 sub-images of both the fake and observed data. We use thesame priors as in Eq. (35) and we apply the PMC procedure ofSect. 5.2 with 𝑁 = 500 and 𝑀 = 5000 and a threshold for Article number, page 14 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images

Fig. 13. Left: the bottom image is the diﬀerence of the two above images that are simulated by only changing the ﬁducial parameter of the ellipticalpopulation to the perturbed parameter values by +ΔΘ ﬁd , (top) and −ΔΘ ﬁd , (middle). Only yellow/red elliptical galaxies remain, which conﬁrmsthat 𝜙 ∗0 ,𝐸 aﬀects the density of ellipticals in the simulations. Right: the bottom image is the diﬀerence of the two above images that are simulated by only changing the ﬁducial parameter of the spiral populationto the perturbed parameter values by +ΔΘ ﬁd , (top) and −ΔΘ ﬁd , (middle). Only blue spiral galaxies remain, which conﬁrms that 𝜙 ∗0 ,𝑆𝑝 aﬀects thedensity of spirals in the simulations. These RGB color images are obtained from the 𝑖 ′ , 𝑟 ′ , 𝑔 ′ ﬁlters. re-sampling the 𝑁 sets of parameters. This yields in total around60 000 draws and generated simulations per sub-image.The results are shown in Fig. 19 for the fake data and Fig.20 for the observed data. As shown by the 2D distribution ofpoints (bottom left) and the 1D projected Gaussian kernel den-sity estimates, both for the observed and fake data, the posteri- ors concentrate around similar regions of parameter values forthe 5 sub-images of each sample since the data comes from thesame generative process, whilst the diﬀerences in the posteriorsare due to the independent environment (or realisation) contain-ing more or less constraining power. We provide the , and conﬁdence intervals for the 1D marginal distributions. Article number, page 15 of 23 &A proofs: manuscript no. main [ M P C m a g ] z = 0 2422201816 M B [ M P C m a g ] z = 2log ( *0, E )log ( *0, E ) + log ( *0, E )log ( *0, E ) log ( *0, E )log ( *0, Sp )log ( *0, Sp ) + log ( *0, Sp )log ( *0, Sp ) log ( *0, Sp ) Fig. 14.

Theoretical luminosity functions of the elliptical (blue, orangeand green) and spiral (red, purple and brown) populations at redshift 𝑧 = 0 (top) and 𝑧 = 2 (bottom) ; note that the legend applies for bothpanels. The parameters used for each luminosity function are the ﬁdu-cial parameters of Eq. (34) and those of Table 3. These curves show thehigher integrated density of spiral galaxies than of elliptical galaxies.The steep faint-end slope of the spiral population implies that there area lot of faint spiral galaxies in the images, which is not the case for theelliptical population. There is also a redshift eﬀect which decreases theproportion of ellipticals/spirals when looking at distant objects. Note that since the distribution is not Gaussian in 2D, its covari-ance (nor the 1D marginal standard deviations) are suﬃcient towholly encapsulate the information in this posterior distribution.This further gives us good reason to consider such likelihood-free inference methods. The parameter values used to generatethe fake data images are shown with red dotted-lines in Fig. 19,and can be compared to the posterior distributions. All results,namely the parameter values used to generate the data (for thefake data only) and the , and

1D conﬁdence in- | F | trainingvalidation0 2000 4000 6000 8000epoch10 | C | training: C C | C | validation: C C Fig. 15. Top: evolution of the determinant of the Fisher informationmatrix during almost 10 000 epochs. The blue curve represents the in-formation extracted from the training set and the orange curve from thevalidation set. The curves increase which means that the network learnsmore about the two parameters as the training continues. The trainingwas stopped when the validation curve ﬂattened, suggesting that the net-work has converged.

Bottom: evolution of the determinant of the covariance matrix (solidline) and the inverse of the covariance matrix (dashed line). The bluecurves are for the training set and the orange curves for the validationset. The training curves reach 1 very fast which shows that the loss is sta-bilized and that the magnitude of the summaries is under control. Thevalidation curve oscillates while being still very close to identity, whichis a sign that there is some weak parameter dependence in the covari-ance. tervals for Φ ∗0 , E and Φ ∗0 , Sp are listed in Tables 4 and 5 for the 5individual sub-images.We emphasize that for both the observed and fake data, Figs.19 and 20 show that the results are narrower 1D maginalized dis-tributions for the spiral population than for the elliptical popula-tion. This comes from the steep faint-end slope of the luminosityfunction for the spirals: a small change in log ( 𝜙 ∗0 , Sp ) has a bigimpact on the number of faint spirals which can be more easily bedistinguished. It is not exactly the case for the other parameter:a large change in log ( 𝜙 ∗0 , E ) is needed to considerably alter thenumber of elliptical galaxies which equates to greater uncertaintyon this parameter.Comparison of Figs. 19 and 20 show that if the posterior dis-tributions of the spirals have a similar dispersion for the fake andobserved data, the posterior distributions of the elliptical popula-tion is narrower for the observed data than for the fake data. Thisis an eﬀect of the small number of ellipticals on an image: witha small value for log ( 𝜙 ∗0 ,𝐸 ) , there are very few elliptical galax-ies on the image and this lack of a statistically signiﬁcant sampleincreases the width of the posterior due to the lack of availableinformation. In comparison, the posterior is narrower for the ob-served data because more elliptical galaxies appear to be present: the peak of the joint-posterior is at log ( 𝜙 ∗0 ,𝐸 ) = −1 . , com-pared to a lower value of log ( 𝜙 ∗0 ,𝐸 ) = −2 . for the fake data (seeTables 4 and 5). The summaries provided by the network encodeinformation about the relation of the number of ellipticals and Article number, page 16 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images

Fig. 16.

RGB images using the 𝑔 ′ , 𝑟 ′ , 𝑖 ′ ﬁlters for 5 random simulations of 3.17 arcmin at the ﬁducial luminosity function parameters of Table 3,used to validate the method. Table 4.

1D conﬁdence intervals for the 5 sub-images of the fake data and the initial values at which the fake data was generated.

Initial 68% 95% 99%fake 1 log ( 𝜙 ∗0 , E ) -2.09 [-3.20 , -1.63] [-3.86 , -1.23] [-3.99 , -0.93] log ( 𝜙 ∗0 , Sp ) -2.04 [-2.11 , -1.93] [-2.23 , -1.86] [-2.45 , -1.86]fake 2 log ( 𝜙 ∗0 , E ) -2.09 [-2.86 , -1.94] [-3.67 , -1.76] [-4.00 , -1.59] log ( 𝜙 ∗0 , Sp ) -2.04 [-2.09 , -1.98] [-2.16 , -1.93] [-2.31 , -1.86]fake 3 log ( 𝜙 ∗0 , E ) -2.09 [-2.78 , -1.97] [-3.56 , -1.76] [-3.93 , -1.75] log ( 𝜙 ∗0 , Sp ) -2.04 [-2.10 , -2.00] [-2.15 , -1.96] [-2.21 , -1.91]fake 4 log ( 𝜙 ∗0 , E ) -2.09 [-3.15 , -1.60] [-3.94 , -1.27] [-3.99 , -0.85] log ( 𝜙 ∗0 , Sp ) -2.04 [-2.11 , -1.92] [-2.26 , -1.85] [-2.92 , -1.85]fake 5 log ( 𝜙 ∗0 , E ) -2.09 [-2.60 , -1.76] [-3.50 , -1.58] [-3.95 , -1.58] log ( 𝜙 ∗0 , Sp ) -2.04 [-2.09 , -1.99] [-2.17 , -1.95] [-2.23 , -1.88]chain-rule joint log ( 𝜙 ∗0 , E ) -2.09 [-2.55 , -1.76] [-3.40 , -1.54] [-3.90 , -1.51] log ( 𝜙 ∗0 , Sp ) -2.04 [-2.08 , -1.99] [-2.14 , -1.94] [-2.20 , -1.91] Note: parameter values are given for 𝐻 = 100 kms −1 Mpc −1 and for 𝑧 = 0 ( 𝜙 ∗ is in units of Mpc −3 mag −1 ). Article number, page 17 of 23 &A proofs: manuscript no. main

Fig. 17.

RGB images using the 𝑔 ′ , 𝑟 ′ , 𝑖 ′ ﬁlters for 5 random regions of 3.17 arcmin within the 1 deg CFHTLS D1 deep ﬁeld that are used to inferthe luminosity function parameters of the elliptical and spiral galaxies, namely the logarithm of their amplitudes log ( 𝜙 ∗0 , E ) and log ( 𝜙 ∗0 , Sp ) . Table 5.

1D conﬁdence intervals for the 5 sub-images of the D1 observed data and the joint data.

68% 95% 99%real 1 log ( 𝜙 ∗0 , E ) [-1.75 , -1.47] [-1.94 , -1.35] [-2.12 , -1.29] log ( 𝜙 ∗0 , Sp ) [-2.08 , -1.95] [-2.17 , -1.89] [-2.31 , -1.83]real 2 log ( 𝜙 ∗0 , E ) [-2.01 , -1.60] [-2.31 , -1.43] [-2.89 , -1.26] log ( 𝜙 ∗0 , Sp ) [-2.08 , -1.93] [-2.17 , -1.85] [-2.27 , -1.53]real 3 log ( 𝜙 ∗0 , E ) [-1.92 , -1.58] [-2.16 , -1.47] [-2.51 , -1.33] log ( 𝜙 ∗0 , Sp ) [-2.17 , -2.02] [-2.24 , -1.96] [-2.46 , -1.96]real 4 log ( 𝜙 ∗0 , E ) [-2.03 , -1.63] [-2.30 , -1.47] [-2.76 , -1.31] log ( 𝜙 ∗0 , Sp ) [-2.06 , -1.94] [-2.15 , -1.89] [-2.22 , -1.84]real 5 log ( 𝜙 ∗0 , E ) [-1.93 , -1.54] [-2.28 , -1.34] [-2.90 , -1.24] log ( 𝜙 ∗0 , Sp ) [-2.13 , -1.96] [-2.23 , -1.88] [-2.44 , -1.73]chain-rule joint log ( 𝜙 ∗0 , E ) [-1.84 , -1.58] [-2.05 , -1.48] [-2.39 , -1.38] log ( 𝜙 ∗0 , Sp ) [-2.10 , -2.00] [-2.15 , -1.96] [-2.23 , -1.92] Note: parameter values are given for 𝐻 = 100 kms −1 Mpc −1 and for 𝑧 = 0 ( 𝜙 ∗ is in units of Mpc −3 mag −1 ). Article number, page 18 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images

Fig. 18. Results of the ABC procedure.Bottom left: the parameter values corresponding to 5000 simulations(dots) are drawn from our random uniform prior (orange) of Eq. (35).The colored points are those with a small distance 𝜌 (Eq. (30)) to theobserved data (second image of Fig. 17 from a CFHTLS Deep ﬁeld) :the closest points in blue, the closest points in green and the closest points in red. Top left and bottom right : marginalized distributions of the distanceselections in the bottom-left panel.The points with the smallest distances appear to be around ( log ( 𝜙 ∗0 , E ) , log ( 𝜙 ∗0 , Sp ) ) = (−2 , −2) . One can see that the num-ber of accepted points decreases and is more concentrated when theacceptance threshold for the distance is low, which reﬂects into thecorresponding 1D posteriors: with this ABC procedure (contrarily tothe PMC) the shape of the posterior is very sensitive to the acceptance.In eﬀect, we should increase the number of simulations whilst decreas-ing the size of the 𝜖 -ball of the points that are kept until we reach astationary distribution which is approximately the posterior. However,this is very expensive due to sampling from the entire prior distribution,where more of the samples are not representative of the observed data. spirals in the images to the model parameters, and therefore canbe used for the inference of this property using the PMC.One can note that for the observed data sub-image labeledas "real 3" in Fig. 20, the 1D conﬁdence interval of the pa-rameter log ( 𝜙 ∗0 ,𝑆𝑝 ) is slightly displaced compared to the otherobserved sub-images, and moreover towards lower densities ofspirals. This can be explained by the presence of a large spi-ral galaxy in this observed image in Fig. 17. This galaxy cov-ers ∼ 100 × 100 pixels, an area of the image (that is ∼ 1% ofthe full sub-image) which reduces the amount ofinformative data with which to correlate the number of detectedspirals, therefore seeming fewer because of the smaller area inwhich they are visible. This illustrates the fact that the inferenceis correct even if the PMC procedure happens to be biased be-cause of statistically anomalous data. This is the reason why weuse 5 sub-images of the D1 deep ﬁeld, with the goal to improvethe robustness of the results. Fig. 19.

Posterior distributions of the two parameters log ( 𝜙 ∗0 , E ) and log ( 𝜙 ∗0 , Sp ) for the 5 images of fake data (with diﬀerent colors from blueto purple) and for the joint-PMC (black) described in Sect. 7.6. The , and contours of the joint-PMC are plotted in black, grey andlight-grey respectively in the bottom left panel. The parameters used togenerate the fake data is indicated with red dashed lines. The 5 individ-ual 1D posteriors for each image peak in the same region of parame-ter values, which indicates that the most likely parameters are consis-tent among the 5 images. The deviation between the diﬀerent posteriorsarises from the fact that these ﬁelds are stochastically sampled from arandom process and so statistical diﬀerences exist in the data. The joint-posterior is tighter and shows how likely the parameters would be if weconsidered the 5 images simultaneously. Because we only have individual posteriors for each sub-image,we need to combine them together to derive a unique posterior.Unfortunately, one cannot combine posterior chains in a simpleway. A rigorous way to achieve such a posterior is to use theBayesian chain rule: 𝑝 ( 𝜃 | 𝑋 , … , 𝑋 𝑛 ) ∝ 𝑝 ( 𝜃 ) 𝑝 ( 𝑋 | 𝜃 ) 𝑝 ( 𝑋 | 𝑋 , 𝜃 ) ⋯ 𝑝 ( 𝑋 𝑛 | 𝑋 , … , 𝑋 𝑛 −1 , 𝜃 ) (36)where 𝜃 is a set of parameters and 𝑋 , … , 𝑋 𝑛 are the 𝑛 individualpieces of data. The chain rule allows one to obtain the posteriorssequentially: ∀ 𝑖 ∈ {1 , … , 𝑛 − 1} let us suppose we have obtainedthe posterior distribution (via ABC or PMC) of 𝑝 ( 𝜃 | 𝑋 , … , 𝑋 𝑖 ) ,then1. Consider 𝑝 ( 𝜃 | 𝑋 , … , 𝑋 𝑖 ) , derived from the pieces of data 𝑋 , … , 𝑋 𝑖 , as an approximate proposal distribution for 𝜃 .2. Use that proposal distribution of 𝜃 for the inference, usingABC or PMC, given the new piece of data, 𝑋 𝑖 +1 .3. Derive a new posterior distribution from 𝑝 ( 𝜃 | 𝑋 , … , 𝑋 𝑖 +1 ) which can be used in turn as the proposal distribution for thenext piece of data.We run a parallelized version of such a sequential-PMC forthe 5 sub-images, both for the fake and observed data. The result-ing joint-posteriors are shown as black lines (called ’KDE joint’)in Figs. 19 and 20 and can be compared to the posterior distribu-tions obtained on the individual sub-images. In these ﬁgures, the Article number, page 19 of 23 &A proofs: manuscript no. main

Fig. 20.

Posterior distributions of the two parameters log ( 𝜙 ∗0 , E ) and log ( 𝜙 ∗0 , Sp ) for the 5 images of observed data (blue to purple) and forthe joint-PMC (black) described in Sect. 7.6. The , and contours of the joint-PMC are plotted in black, grey and light-grey re-spectively in the bottom left panel. The 5 individual 1D posteriors foreach sub-image peak in the same region of parameter values, which in-dicates that the most likely parameters are consistent among the 5 im-ages. The diﬀerences in the posteriors obtained from the diﬀerent im-ages come from the fact that the observed data comes from diﬀerentpatches of the sky with statistically diﬀerent amounts of information inthe patches due to their independent environments. The joint-posterior istighter and shows how likely the parameters would be if we consideredthe 5 images simultaneously. , and contours are drawn using the joint-posteriorin black, grey and light-grey respectively. The 1D , and conﬁdence intervals for the joint-posteriors are listed at theend of Table 4 for the fake data and 5 for the observed data.One can note that the conﬁdence intervals are narrower for thejoint-posteriors because more information is available, thereforeallowing us to draw tighter constraints on the parameter values. Despite a diﬀerent sample and no extraction, when compar-ing our results to López-Sanjuan et al. (2017), we obtainfor the dominant spiral population a 1- 𝜎 conﬁdence intervalcomparable to theirs. Here we compare our results with other measurements ofthe luminosity functions for the elliptical (or red, or quies-cent) galaxy population and for the spiral (or blue, or star-forming) population obtained from deep multi-band galaxy sur-veys: López-Sanjuan et al. (2017), Brown et al. (2007), Beareet al. (2015), Salimbeni et al. (2008), Drory et al. (2009) andFaber et al. (2007). In these studies, the two populations are iden-tiﬁed either: – by comparing the observed magnitudes in diﬀerent bands, – by ﬁnding the best ﬁt SEDs to each galaxy.These methods are aﬀected by several biases because of the cat-alog extraction process (see Sect. 1) and because of the choice ofthreshold used to split galaxies into red or blue. Fig. 21.

The inferred luminosity functions derived for the elliptical (top)and the spiral (bottom) populations using the conﬁdence intervalsis drawn in blue. Our results are compared with the results of (López-Sanjuan et al. 2017) in green, (Brown et al. 2007) in pink, (Beare et al.2015) in red, (Salimbeni et al. 2008) in brown, (Drory et al. 2009) inpurple and (Faber et al. 2007) in orange. Our 𝑔 ′ magnitudes are con-verted into the Johnson B-band absolute magnitude, at redshift 𝑧 = 0 . and for 𝐻 = 100 kms −1 Mpc −1 using 𝐵 − 𝑔 ′ = 0 . for ellipticals and 𝐵 − 𝑔 ′ = 0 . for spirals, listed in Table 7 of Fukugita et al. (1995). Thesegraphs show that our derived luminosity function for the spiral popula-tion agree well with the other existing surveys, whereas there is sometension among the elliptical luminosity functions, due to small numberstatistics and a 1-component luminosity - versus 2-components for thosewith an up-rising faint-end. Fig. 21 shows the luminosity function using the conﬁ-dence intervals of the joint-posterior over the 5 sub-images ofthe CFHTLS+WIRDS D1 ﬁeld in blue for both the elliptical(top) and spiral population (bottom) compared to the results fromthe other articles. Fig. 21 shows a good agreement between ourresults and the other surveys for the spiral population (bottompanel) for which the network was able to precisely constrain theluminosity function (narrow conﬁdence interval). Top panel

Article number, page 20 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images of Fig. 21 shows that for the elliptical population, there are largerdiﬀerences between all analyses including ours. This is likelypartly due to Poisson noise: the number density of galaxies inthe elliptical population is typically a factor of 10 to 100 smallerthan for the spiral population in a given magnitude bin, introduc-ing more dispersion. Moreover, as pointed out in several of thequoted analyses, the chosen samples of red/blue galaxies containmisidentiﬁed galaxies, which aﬀect more the smaller ellipticalpopulation.The top panel of Fig. 21 also indicates evidence for an excessof red faint galaxies: López-Sanjuan et al. (2017), Drory et al.(2009) and Salimbeni et al. (2008) actually use the sum of twoSchechter functions to better model the luminosity function ofthe elliptical population (and its up-rising faint-end). We suspectthat this seemingly observed excess of faint elliptical galaxiesto which our method is in principle quite sensitive to tends to“pull” upwards our single-Schechter luminosity function for theelliptical population and therefore is systematically higher thanthe luminosity functions derived by Brown et al. (2007), Beareet al. (2015), and Faber et al. (2007). In that case, our limitedmodel of two galaxy populations with single evolving Schechterluminosity functions for each population is too coarse. Neverthe-less, the up-rising of the elliptical luminosity function at the faintend could be caused by some of the faint spiral galaxies - whichare numerous - being categorized as red galaxies by some sourceextraction method.Conversely, some of the elliptical galaxies in the other anal-yses could have been misclassiﬁed by some source extractionmethod as blue galaxies, and lead to a systematically too lowamplitude for the elliptical luminosity function. Also, dusty star-forming galaxies (which have red colors) may be modeled as el-lipticals in our analysis (due to the resemblance of the strong star-burst to a bulge-dominated object in the image), whereas theymay be classiﬁed as spirals in the other analyses. Both eﬀectswould aﬀect only the luminosity function of elliptical galaxies,as spiral types vastly dominate in number over the elliptical types.

8. Conclusions and perspectives

In this paper, we introduce a novel method to infer robust con-straints on the luminosity functions of elliptical/spiral galaxiesusing a massive compression of panchromatic images through aneural network and likelihood-free (simulation-based) inference.This approach analyzes directly multi-band deep ﬁeld pixel mapsusing neural networks and performs a likelihood-free inferencewithout the need for any source catalog. The use of simulated im-ages in similar “observing” conditions as the data - taking into ac-count the complex selection biases that aﬀect the survey images(see Sect. 1) - as a central part of the inference process and thedirect comparison of the simulations to the observed CFHTLSdeep ﬁeld is the key contribution of our approach.In this article, synthetic populations of elliptical/spiral galax-ies are simulated using a forward model of galaxy evolution,sampled from luminosity functions for each population anddecomposed into their bulge and disk via a set of SEDs. Theforward model is made even more realistic by adding the internalextinction by dust of each component of the galaxies, the redden-ing caused by the Milky Way and the stars from the Besançonstellar model. The

SkyMaker software is then used to paint thesesimulated galaxies and stars on realistic panchromatic imagesin the optical 𝑢 ′ , 𝑔 ′ , 𝑟 ′ , 𝑖 ′ , 𝑧 ′ MegaPrime ﬁlters, and the nearinfrared 𝐽 , 𝐻 , 𝐾𝑠 WIRCam ﬁlters, making sure to reproduce theinstrumental selection eﬀects of CFHTLS and WIRCAM images(the parameters of the forward model are summarized in Table 6). The images simulated through this process are then usedto train a fully-convolutional Inception network to extractinformation about the parameters of each luminosity function.The network is trained to maximize the Fisher information of theimages and enables the drastic reduction in the dimensionalityof the images down to the number of parameters of the model.Contrary to deep learning approaches, training the neuralnetwork with only 400 images (200 for the ﬁducial parameters,and 200 for the derivatives) is suﬃcient to obtain very goodresults. After the network is trained on simulated deep ﬁelds,ABC/PMC procedures are run, starting from a uniform prior,to infer the parameters of the luminosity functions of dataimages. The approach is applied to both fake simulated imagesand sub-images of the observed D1 deep ﬁeld. We have shownthat, on the fake data we can robustly infer the parametersused to generate the simulation: log ( 𝜙 ∗ ) and 𝑀 ∗ for a spiralpopulation in Sect. 6 and log ( 𝜙 ∗ ) for spiral and ellipticalgalaxies in Sect. 7. We also developed a joint-PMC procedure inorder to infer the parameters using multiple pixelsub-images at once.This method proves its eﬃciency in constraining the non-trivial and correlated parameters of two luminosity functions ofelliptical and spiral populations of galaxies: the amplitude andthe characteristic magnitude. Using the likelihood-free inferenceand the compressed network summaries, we were able to inferpossible input values of the parameters for simulated data, aswell as the values of the model parameters describing the gener-ative process for sub-images of the observed CFHTLS D1 data.The derived luminosity functions are in good agreement withthose for the elliptical (or red, or quescient) galaxy populationand for the spiral (or blue, or star-forming) population obtainedfrom deep multi-band galaxy surveys by López-Sanjuan et al.(2017), Brown et al. (2007), Beare et al. (2015), Salimbeni et al.(2008), Drory et al. (2009) and Faber et al. (2007), especially forthe spiral population. For the elliptical population, informationabout the excess of faint galaxies that some authors try to modelas the sum of two Schechter functions was encoded via theIMNN and yields a higher amplitude for this galaxy type.Here we have illustrated the method using only two parame-ters of the luminosity functions of two galaxy types but in prin-ciple, the usual 5 parameters of evolving luminosity functions(amplitude, characteristic magnitude, faint-end slope, amplitudeevolution and characteristic magnitude evolution) could be si-multaneously inferred from observed data. Further realism in thesimulations is still required, including all galaxy types that sig-niﬁcantly contribute to the observed deep ﬁelds, like lenticulargalaxies ; it is not clear whether a population of irregular galax-ies should be considered ; but dividing the spirals galaxies intoearly and late type spirals (with a large bulge-to-total ratio for Sa-Sb types, and a small ratio for Sc-Sd types) may also bring usefulinformation if these types evolve diﬀerently from redshift 𝑧 = 1 .In the model used here, a galaxy is decomposed as a bulge anda disk, which is too simplistic for dusty galaxies. F. Livet is cur-rently reﬁning the model by decomposing the disk as a thin anda thick disk to better model the color gradients of spiral galaxies.Our current forward model uses 2 non-evolving SEDs of (Cole-man et al. 1980) to model the bulges and disks of spiral galaxiesbut it can be improved with evolving scenarios of galaxies - andtheir bulge and disk - such as those of the Pegase model of Fioc& Rocca-Volmerange (2019).In any case, the IMNN compresses the input to the same di-mension as the parameter space and thanks to this extreme com- Article number, page 21 of 23 &A proofs: manuscript no. main pression, one can potentially investigate a large number of phys-ical parameters. The limiting step would be that the ABC/PMCprocedure might lead to a very long exploration of the param-eters space due to its high dimensionality, but we could usethe pydelfi approach of Alsing et al. (2019) to explore otherlikelihood-free inference techniques which are less expensive interms of the number of simulations necessary. Obviously, thecomputing time will be the limiting factor for a realistic multi-population analysis of one full CFHTLS Deep ﬁeld.

Contributions

Florian Livet wrote the bulk of the paper, upgraded and opti-mized the code for the generation of the forward simulations toa more realistic version (dust internal extinction, galaxies at thesame locations), contributed to the development of the IMNN(penalized loss), wrote and optimized the inception architecture(with Tensorﬂow), developed a parallelized version of the PMC(with Docker environment and Tensorﬂow-Serving) and adaptedthe PMC technique to obtain the joint-posterior (Bayesian chainrule). He also ran the analysis of the parameters of the forwardmodel and their optimization to train the inception networks inboth applications, and he compared the results to other measure-ments from deep surveys. Tom Charnock led the statistical anal-ysis and the code development for the IMNN and contributed toall the statistical analyses and methods involved in the IMNN andPMC, and to the writing of the article. He was integral in substan-tiating technical and methodological concepts in the article andprovided both practical and theoretical expertise on the IMNNand PMC. Valérie de Lapparent and Damien Le Borgne are bothsupervisors of the PhD of Florian Livet. They provided practicaland theoretical expertise on the astrophysical and statistical as-pects of the analysis, and contributed to the writing of the article.Damien Le Borgne provided several Python codes and his codingexpertise to Florian Livet. He also generated the ﬁgures with theobserved and predicted galaxy and stellar number counts.

Acknowledgments

The authors thank Emmanuel Bertin for managing the

Morpho cluster used to generate all the simulations and perform allthe network trainings, for his available programs (

Stuff and

SkyMaker ) and for his help to create a

Docker environment. Thiswork uses the available ﬁnal releases of the TERAPIX processeddata, T0007 for CFHTLS (Hudelot et al. 2012) and T0002 forWIRDS (Bielby et al. 2012). Our forward model was extendedusing the stellar distributions of the Besançon model (Robin et al.2003) web service. TC acknowledges ﬁnancial support from theSorbonne Univ. Emergence fund, 2019-2020.

References

Abazajian, K. N., Adelman-McCarthy, J. K., Agüeros, M. A., et al. 2009, ApJS,182, 543Akeret, J., Refregier, A., Amara, A., Seehars, S., & Hasner, C. 2015, J. Cosmol-ogy Astropart. Phys., 2015, 043Alsing, J., Charnock, T., Feeney, S., & Wandelt, B. 2019, MNRAS, 488, 4440Alsing, J. & Wandelt, B. 2018, MNRAS, 476, L60Amôres, E. B., Robin, A. C., & Reylé, C. 2017, A&A, 602, A67Beare, R., Brown, M. J. I., Pimbblet, K., Bian, F., & Lin, Y.-T. 2015, ApJ, 815,94Bernardi, M., Fischer, J. L., Sheth, R. K., et al. 2017, MNRAS, 468, 2569Bertin, E. 2009, Mem. Soc. Astron. Italiana, 80, 422Bertin, E. 2010, Stuﬀ: Simulating “Perfect” Astronomical CataloguesBertin, E. 2011, Astronomical Society of the Paciﬁc Conference Series, Vol. 442,Automated Morphometry with SExtractor and PSFEx, ed. I. N. Evans, A. Ac-comazzi, D. J. Mink, & A. H. Rots, 435 Bertin, E. & Arnouts, S. 1996, A&AS, 117, 393Bertin, E. & Arnouts, S. 2010, SExtractor: Source ExtractorBielby, R., Hudelot, P., McCracken, H. J., et al. 2012, A&A, 545, A23Bienaymé, O., Robin, A. C., & Famaey, B. 2015, A&A, 581, A123Binggeli, B., Sandage, A., & Tarenghi, M. 1984, AJ, 89, 64Brown, M. J. I., Dey, A., Jannuzi, B. T., et al. 2007, ApJ, 654, 858Calvi, V., Stiavelli, M., Bradley, L., Pizzella, A., & Kim, S. 2014, ApJ, 796, 102Carassou, S., de Lapparent, V., Bertin, E., & Le Borgne, D. 2017, A&A, 605, A9Charnock, T., Lavaux, G., & Wandelt, B. D. 2018, Phys. Rev. D, 97, 083004Chevallard, J., Charlot, S., Wandelt, B., & Wild, V. 2013, MNRAS, 432, 2061Cireşan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J. 2010, NeuralComputation, 22, 3207, pMID: 20858131Cisewski-Kehe, J., Weller, G., & Schafer, C. 2019, arXiv e-prints,arXiv:1904.11306Coleman, G. D., Wu, C. C., & Weedman, D. W. 1980, ApJS, 43, 393Condon, J. J. 1974, ApJ, 188, 279de Jong, R. S. & Lacey, C. 2000, ApJ, 545, 781de Vaucouleurs, G. 1953, MNRAS, 113, 134Del Moral, P., Doucet, A., & Jasra, A. 2012, Statistics and Computing, 22, 1009Drory, N., Bundy, K., Leauthaud, A., et al. 2009, ApJ, 707, 1595Eddington, A. S. 1913, MNRAS, 73, 359Faber, S. M., Willmer, C. N. A., Wolf, C., et al. 2007, ApJ, 665, 265Fioc, M. & Rocca-Volmerange, B. 2019, A&A, 623, A143Fitzpatrick, E. L. & Massa, D. 1990, ApJS, 72, 163Fitzpatrick, E. L. & Massa, D. 2007, ApJ, 663, 320Fukugita, M., Shimasaku, K., & Ichikawa, T. 1995, PASP, 107, 945Heavens, A. F., Jimenez, R., & Lahav, O. 2000, MNRAS, 317, 965Hogg, D. W., Baldry, I. K., Blanton, M. R., & Eisenstein, D. J. 2002, arXiv e-prints, astroHogg, D. W., Blanton, M. R., Eisenstein, D. J., et al. 2003, ApJ, 585, L5Hudelot, P., Cuillandre, J. C., Withington, K., et al. 2012, VizieR Online DataCatalog, II/317Hwang, C.-L., Masud, A. S. M. M., & Paidy, S. R. 1979, Multiple objective de-cision making : methods and applications : a state-of-the-art survey (Berlin :Springer)Kacprzak, T., Herbel, J., Amara, A., & Réfrégier, A. 2018, J. Cosmology As-tropart. Phys., 2018, 042Kingma, D. P. & Ba, J. 2014, arXiv e-prints, arXiv:1412.6980Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012, in Advances in Neural In-formation Processing Systems 25, ed. F. Pereira, C. J. C. Burges, L. Bottou,& K. Q. Weinberger (Curran Associates, Inc.), 1097–1105Lehmann, E. L. & Casella, G. 1998, Theory of Point Estimation, 2nd edn. (NewYork, NY, USA: Springer-Verlag)Lilly, S. J., Tresse, L., Hammer, F., Crampton, D., & Le Fevre, O. 1995, ApJ,455, 108López-Sanjuan, C., Tempel, E., Benítez, N., et al. 2017, A&A, 599, A62Madau, P., Ferguson, H. C., Dickinson, M. E., et al. 1996, MNRAS, 283, 1388Malmquist, K. G. 1922, Meddelanden fran Lunds Astronomiska ObservatoriumSerie I, 100, 1Malmquist, K. G. 1925, Meddelanden fran Lunds Astronomiska ObservatoriumSerie I, 106, 1Marzke, R. O. 1998, Astrophysics and Space Science Library, Vol. 231, TheGalaxy Luminosity Function at Zero Redshift: Constraints on Galaxy For-mation, ed. D. Hamilton, 23Oh, K.-S. & Jung, K. 2004, Pattern Recognition, 37, 1311Pearson, C. P., Serjeant, S., Oyabu, S., et al. 2014, MNRAS, 444, 846Popescu, C. C., Tuﬀs, R. J., Dopita, M. A., et al. 2011, A&A, 527, A109Robin, A. C., Marshall, D. J., Schultheis, M., & Reylé, C. 2012, A&A, 538, A106Robin, A. C., Reylé, C., Derrière, S., & Picaud, S. 2003, A&A, 409, 523Robin, A. C., Reylé, C., Fliri, J., et al. 2014, A&A, 569, A13Rosenblatt, F. 1962, Principles of neurodynamics. Perceptrons and the theory ofbrain mechanisms (Washington DC, USA: Spartan Books)Rubin, D. B. 1984, The Annals of Statistics, 12, 1151Salimbeni, S., Giallongo, E., Menci, N., et al. 2008, A&A, 477, 763Sandage, A., Freeman, K. C., & Stokes, N. R. 1970, ApJ, 160, 831Schechter, P. 1976, ApJ, 203, 297Szegedy, C., Liu, W., Jia, Y., et al. 2015, in The IEEE Conference on ComputerVision and Pattern Recognition (CVPR)Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., & Wojna, Z. 2015, arXiv e-prints, arXiv:1512.00567Taghizadeh-Popp, M., Fall, S. M., White, R. L., & Szalay, A. S. 2015, ApJ, 801,14Tolman, R. C. 1934, Relativity, Thermodynamics, and CosmologyTortorelli, L., Fagioli, M., Herbel, J., et al. 2020, arXiv e-prints,arXiv:2001.07727Trujillo, I., Förster Schreiber, N. M., Rudnick, G., et al. 2006, ApJ, 650, 18Weyant, A., Schafer, C., & Wood-Vasey, W. M. 2013, ApJ, 764, 116Williams, R. J., Quadri, R. F., Franx, M., et al. 2010, ApJ, 713, 738Zucca, E., Ilbert, O., Bardelli, S., et al. 2006, A&A, 455, 879

Article number, page 22 of 23ivet, Charnock, Le Borgne, de Lapparent: Catalog-free modeling of galaxy types in deep images

Table 6.