[PDF] A Composite Likelihood Approach for Inference under Photometric Redshift Uncertainty

Abstract

Obtaining accurately calibrated redshift distributions of photometric samples is one of the great challenges in photometric surveys like LSST, Euclid, HSC, KiDS, and DES. We combine the redshift information from the galaxy photometry with constraints from two-point functions, utilizing cross-correlations with spatially overlapping spectroscopic samples. Our likelihood framework is designed to integrate directly into a typical large-scale structure and weak lensing analysis based on two-point functions. We discuss efficient and accurate inference techniques that allow us to scale the method to the large samples of galaxies to be expected in LSST. We consider statistical challenges like the parametrization of redshift systematics, discuss and evaluate techniques to regularize the sample redshift distributions, and investigate techniques that can help to detect and calibrate sources of systematic error using posterior predictive checks. We evaluate and forecast photometric redshift performance using data from the CosmoDC2 simulations, within which we mimic a DESI-like spectroscopic calibration sample for cross-correlations. Using a combination of spatial cross-correlations and photometry, we show that we can provide calibration of the mean of the sample redshift distribution to an accuracy of at least 0.002(1+z), consistent with the LSST-Y1 science requirements for weak lensing and large-scale structure probes.

Full PDF

MMNRAS , 1–22 (2015) Preprint 6 January 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

A Composite Likelihood Approach for Inference under PhotometricRedshift Uncertainty

M. M. Rau ★ , C. B. Morrison , S. J. Schmidt , S. Wilson , R. Mandelbaum ,Y.-Y. Mao for the LSST Dark Energy Science Collaboration McWilliams Center for Cosmology, Department of Physics, Carnegie Mellon University, Pittsburgh, PA 15213 Department of Astronomy, University of Washington, Box 351580, Seattle, WA 98195, USA School of Computer Science and Statistics, Lloyd Institute, Trinity College, Dublin, Ireland Department of Physics, University of California, Davis, CA 95616, USA Department of Physics and Astronomy, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

Obtaining accurately calibrated redshift distributions of photometric samples is one of the greatchallenges in photometric surveys like LSST, Euclid, HSC, KiDS, and DES. We combine theredshift information from the galaxy photometry with constraints from two-point functions,utilizing cross-correlations with spatially overlapping spectroscopic samples. Our likelihoodframework is designed to integrate directly into a typical large-scale structure and weak lensinganalysis based on two-point functions. We discuss eﬃcient and accurate inference techniquesthat allow us to scale the method to the large samples of galaxies to be expected in LSST.We consider statistical challenges like the parametrization of redshift systematics, discuss andevaluate techniques to regularize the sample redshift distributions, and investigate techniquesthat can help to detect and calibrate sources of systematic error using posterior predictivechecks. We evaluate and forecast photometric redshift performance using data from the Cos-moDC2 simulations, within which we mimic a DESI-like spectroscopic calibration sample forcross-correlations. Using a combination of spatial cross-correlations and photometry, we showthat we can provide calibration of the mean of the sample redshift distribution to an accuracyof at least 0 . ( + 𝑧 ) , consistent with the LSST-Y1 science requirements for weak lensingand large-scale structure probes. Key words: keyword1 – keyword2 – keyword3

With ongoing and future large area photometric surveys like theDark Energy Survey (DES; e.g., Abbott et al. 2018b), the Kilo-Degree Survey (KiDS; e.g., Hildebrandt et al. 2017), the HyperSuprime-Cam (HSC; e.g., Aihara et al. 2018), the Rubin Observa-tory Legacy Survey of Space and Time (LSST; e.g., Ivezić et al.2019), the Roman Space Telescope (e.g. Spergel et al. 2015) andEuclid (e.g. Laureĳs et al. 2011) modern cosmology has enteredthe era of precision cosmology, where it becomes increasingly im-portant to accurately account for sources of systematic bias anduncertainty (e.g. Mandelbaum 2018). Large area photometric sur-veys constrain cosmological parameters and the growth of structureusing two-point statistics of galaxy and shear ﬁelds (see e.g. Hilde-brandt et al. 2017; Uitert et al. 2017; Abbott et al. 2018a; Joudakiet al. 2018; Hikage et al. 2019; Heymans et al. 2020). Using only ★ E-mail: [email protected] the broadband photometry of galaxies allows for a limited accuracyin the estimated redshifts. In photometric surveys, we therefore typ-ically consider two-point statistics of density ﬁelds that have beenprojected along the line-of-sight, i.e., in the redshift direction. Theyare then subsequently compared with the corresponding weak lens-ing (WL) and large scale structure (LSS) theory predictions in alikelihood framework. These theory predictions have to account forthe line-of-sight projection, and therefore depend on the redshiftdistribution of the galaxies in the sample that have to be accuratelymodelled and calibrated (see e.g. Huterer et al. 2006; Hoyle et al.2018; Tanaka et al. 2018; Hildebrandt et al. 2020; Joudaki et al.2020).A primary goal of large area photometric survey programs is tomap the growth of structure and expansion history of the Universe,and thereby constrain the dark energy equation of state via thedistance-redshift and growth-redshift relations (see e.g., Albrechtet al. 2006, p. 31) which both enter the WL and LSS modelling.Note that these fundamental relationships within our cosmologicalmodel are redshift dependent, as are some key sources of theoret- © a r X i v : . [ a s t r o - ph . C O ] J a n ical uncertainty, such as the galaxy-dark matter bias model (seee.g. Matarrese et al. 1997; Clerkin et al. 2015; Chang et al. 2016;Simon & Hilbert 2018; Prat et al. 2018). The inferred ensembleredshift distributions for samples of galaxies can therefore exhibit adegeneracy with cosmological or astrophysical parameters. Inaccu-rate distance, or redshift, measurements based on the photometry ofthe galaxies are therefore important modelling systematics in thesesurveys (e.g. Ma et al. 2006; Bernstein & Huterer 2010). We there-fore must exploit all data sources that have the potential to breakthese degeneracies to perform eﬃcient and accurate inference.The two methods available to constrain the redshift of galaxiesin the absence of accurate spectroscopic measurements are ‘tem-plate ﬁtting’ methods and empirical methods that ‘learn’ the map-ping between photometry and redshift (for a recent review, seeSalvato et al. 2019). SED ﬁtting methods ﬁt the galaxy photome-try with models of the galaxy spectral energy distribution (SED;e.g., Arnouts et al. 1999; Benítez 2000; Ilbert et al. 2006; Feld-mann et al. 2006; Greisel et al. 2015; Leistedt et al. 2016; Malz &Hogg 2020). Machine Learning-based methods infer photometricredshifts by constructing a density estimate for the conditional dis-tribution of the galaxy redshifts given their photometry (Tagliaferriet al. 2003; Collister & Lahav 2004; Gerdes et al. 2010; CarrascoKind & Brunner 2013; Bonnett 2015; Rau et al. 2015; Hoyle 2016).Combinations of both these techniques have also been investigated(Speagle & Eisenstein 2015; Hoyle et al. 2015). Unfortunately theaccuracy of these techniques is limited since they suﬀer from diﬀer-ent sources of systematic error. Template ﬁtting approaches can besystematically biased, if ﬁts are constructed using sets of spectralenergy distributions that are not representative of all galaxies in thesample. In contrast, photometric redshift techniques that require atraining set can produce systematically biased results due to incom-plete spectroscopic training samples. It is particularly diﬃcult toobtain representative spectroscopic data due to the long exposuretimes that are necessary to obtain accurate spectroscopic redshiftsfor faint sources (see e.g. Huterer et al. 2014; Newman et al. 2015).Instead of inferring photometric redshifts by ﬁtting modelsfor the spectral energy distribution, we can also infer photometricredshift infromation using spatial cross-correlations between pho-tometric samples and spectroscopic samples (e.g. Newman 2008;Ménard et al. 2013; McQuinn & White 2013; Scottez et al. 2016;Raccanelli et al. 2017; Morrison et al. 2017; Davis et al. 2017; Gattiet al. 2018). Cross-correlation methods measure the spatial crosscorrelation between a reference sample with accurate redshift infor-mation, typically spectroscopic galaxy catalogs, and photometricsamples that do not have accurate redshift information. Ignoringcosmic magniﬁcation eﬀects (see e.g. Scranton et al. 2005) theexpected spatial cross correlation is only nonzero for samples atthe same redshift. By cross correlating subsamples of spectroscopicsamples that are selected in thin redshift slices with these photomet-ric catalogs and comparing the resulting signals, we can reconstructthe redshift distribution of the unknown photometric sample.It is important to highlight the diﬀerent sources of systematicuncertainty in these two approaches: the measurement of spatialcross correlations requires that the sample with unknown redshiftinformation and the reference sample overlap spatially and cover thesame redshift range. However, the spectroscopic calibration sampledoes not have to cover the same color/magnitude space as the un-known photometric sample. It is, however, important to accuratelymodel the redshift-dependent galaxy-dark matter bias of the photo-metric sample and the spectroscopic calibration sample, since theredshift-dependent ratio between these two functions is completelydegenerate with the photometric redshift distribution to be inferred. In contrast, template-based redshift inference requires a complete setof templates but no calibration sample. Checking a ﬁtted model canalso, in principle, use the color space alone, by comparing the pho-tometry generated by the ﬁtted templates with the measurements. Inpractice this approach has limitations. The generation of SED modeltemplates is challenging and often requires spectroscopic referencedata for some galaxies. Furthermore, degeneracies between galaxytype and galaxy redshift can make the aforementioned color-basedapproach ill-deﬁned. Thus, while template ﬁtting does not requirespectroscopic data to infer redshifts of galaxies, in practise it is of-ten necessary for building and evaluating models. Finally, empiricaltechniques that construct photometric redshift estimates by ‘learn-ing’ from a spectroscopic calibration dataset require reference datathat does not have to spatially overlap, but needs to be representativein color-redshift space.Besides spatial correlations of galaxy clustering, we can alsouse other two-point statistics from e.g., weak gravitational lensing(e.g. Benjamin et al. 2013; Stölzner et al. 2020). There also existsa considerable literature in how photometric redshift uncertaintycan be treated in the individual cosmological probes (McLeod et al.2017; Hoyle & Rau 2019) or how one can combine template ﬁttingand cross correlation measurements (Alarcon et al. 2020b; Sánchez& Bernstein 2019; Jones & Heavens 2019; Rau et al. 2020). Shortlybefore this paper was submitted for publication Myles et al. (2020);Gatti et al. (2020); Cawthon et al. (2020) presented the redshiftinference scheme for the DES Y3 analyses, that combines a cross-correlation and shear ratio data vector with redshift informationderived using an empirical mapping of broad band ‘Wide ﬁeld’photometry to spatially smaller calibration ﬁelds with narrow-bandphotometric and spectroscopic redshift information.This paper presents a composite likelihood approach to jointlyconstrain photometric redshift distributions using information fromboth the available photometry and the clustering of galaxies. Wefocus on statistical challenges in this inference. In particular, theparts of the model that utilize the photometry of galaxies can posecomputational challenges, since the likelihood depends on mea-surements of all galaxies in the sample. We therefore derive an eﬃ-cient methodology that facilitates inference of redshift distributionswithin this computationally expensive part of the model. Redshiftinference based on noisy photometry is an inverse problem and theinference scheme requires careful regularization to achieve goodprobability coverage. We therefore describe several regularizationtechniques and evaluate their respective merits in numerical exper-iments. Information from the spatial distribution of galaxies canthen be incorporated within the composite likelihood frameworkby eﬃcient MCMC sampling. We test our methodology using datafrom the CosmoDC2 (Korytov et al. 2019) simulated extragalac-tic catalog. While some of the inference techniques developed inthis paper can also be used in the context of an empirical mappingto a small-area calibration ﬁeld, our primary goal is to facilitateinference using physical SED modelling that utilizes a likelihoodthat jointly describes photometry and spatial information for all ob-served galaxies. Inference under spatial variations in photometry orredshift information will be addressed in the course of the paperand in § 10.The paper is structured as follows: § 2 describes the simu-lated galaxy samples used in this work, while § 3 gives a briefintroduction into inverse problems and deconvolution by discussinga simple toy model for photometric redshift inference. The fol-lowing sections describe our inference methodology in detail: § 4starts with a description of the photometric likelihood, where wealso discuss several regularization schemes, and § 5 formulates the MNRAS , 1–22 (2015), 1–22 (2015)

225 Gpc , the ‘Outer Rim’ run (Heitmann et al. 2019). Thesimulated catalog covers 440 deg of sky area and spans a redshiftrange 0 < 𝑧 ≤

3. Galaxies are assigned to the halo catalog andsupplemented with additional galaxies based on the assumption ofa power law extrapolation of a power law sub-halo mass function atlower masses. The resulting catalog exhibits a number count slopeconsistent with that of the Hyper SuprimeCam Deep survey (Aiharaet al. 2018) down to an r-band magnitude of r ∼

28, well beyondthe apparent magnitudes that will be utilized in this paper. Thegalaxy catalog uses a combination of empirical and semi-analyticmodelling, utilizing the Galacticus (Benson 2012) and GalSamplercodes (Hearin et al. 2020). For more details on the catalog genera-tion and properties we refer the reader to Korytov et al. (2019).In § 2.1 we will describe the particular selection of photometricdata and the photometric redshift catalog used in this work. § 2.2describes the generation of the reference spectroscopic sample.

The photometric sample consists of mock galaxies from the LSST-DESC “CosmoDC2" synthetic sky catalog (Korytov et al. 2019).The catalogs do not contain stars or AGN, so star-galaxy separationand non-thermal contamination are not an issue in this data set.Observations consist of magnitudes in the six 𝑢𝑔𝑟𝑖𝑧𝑦

Rubin Ob-servatory ﬁlters. Simulated photometric errors were added to thesix bands using a simple model designed to match the expectedphotometric S/N due to depth, seeing, airmass, and sky brightnessat the completion of the full 10-year Wide Fast Deep survey (Ivezićet al. 2019). All galaxies are assumed to be isolated, i.e. blendingeﬀects are not modeled. We restrict the sample to galaxies withan 𝑖 LSST -band magnitude of 𝑖 LSST < . 𝑖 LSST -band signal-to-noise (S/N) of ∼

20. We makethis cut because redshift estimates for lower S/N objects degraderapidly below this S/N level. We reserve a small set of ∼ 𝑖 LSST < . Template Fitting Redshifts

We use the publicly available Bayesianphotometric redshift code BPZ (Benítez 2000) to compute redshiftestimates for our simulated galaxies. BPZ is a template-based red-shift estimation code that estimates redshift by computing modelﬂuxes from a set of template SEDs and evaluating the resulting 𝜒 when compared to observed ﬂuxes. BPZ includes the optional ap-plication of a bivariate Bayesian prior over the joint distribution of available at: type/SED and apparent magnitude in the redshift estimation, thoughwe do not employ the prior in this investigation.To construct a template set we begin with the empirical SEDcatalog of Brown et al. (2014). We then use the ESP software pack-age (Kalmbach & Connolly 2017), which constructs a principalcomponent basis set from the empirical SEDs and uses photomet-ric training data to construct the ﬁnal SED template via GaussianProcesses. The ﬁnal training set used by BPZ consists of the 129empirical templates and 100 additional templates output from ESP.These templates roughly, but not perfectly, span the observed rangeof colors for the sample. We compute the likelihoods for all SEDs bycomparing the observed ﬂuxes to model ﬂuxes evaluated on a grid ofredshift spanning 0 < 𝑧 <

3. The 1-dimensional marginalized (overtemplate type) posterior distributions for each galaxy comprise ourﬁnal template ﬁtting redshift estimate.

Machine Learning-based Redshifts

We use the python version ofthe publicly available FlexCode (Izbicki & Lee 2017) combinedwith the XGBoost algorithm (Chen & Guestrin 2016) to computephotometric redshifts which we will refer to by the name FlexZ-Boost. FlexZBoost estimates the conditional density in redshiftfor each galaxy by ﬁtting to an orthonormal set of basis functions(in this case cosines) via regression with XGBoost. To further re-ﬁne the estimates, 25 per cent of the training data is reserved asa validation set to determine optimal values for trimming extrane-ous low-level peaks in the likelihood, and a “sharpening" parameterof the form 𝑝 ( 𝑧 ) ∝ 𝑝 ( 𝑧 ) 𝛼 that adjusts the overall width of thedensity estimates to best match the data. For this analysis we use35 cosine basis functions, and a sharpening parameter, chosen viacross-validation, of 1 .

4. Given the representative training data usedin this experiment, we expect very accurate redshift estimates fromthe FlexZBoost algorithm.

The simulated reference spectroscopic sample is selected to mimic,in broad strokes, the sample selections of the Dark Energy Spec-trosopic Instrument (DESI, DESI Collaboration et al. 2016, Zhouet al. 2020a, Zhou et al. 2020b). This consists of a set of four sam-ples with increasing mean redshift: a magnitude-limited sample to 𝑟 LSST < .

5; a Luminous Red Galaxy (LRG) sample; an Emis-sion Line Galaxy (ELG) sample; and ﬁnally a high-redshift Quasar(QSO) sample. We show the redshift and 𝑖 LSST -band magnitude dis-tributions of these subsamples in Fig. 1. The LRG, ELG, and QSOsamples are selected such that their density per redshift matchesthat of the DESI samples (priv. comm. Rongpu Zhou and JeﬀreyNewman). This sample is distinct from the redshift calibration datamentioned in the previous section.We construct a magnitude-limited sample, by imposing a mag-nitude cut of 𝑟 LSST < .

5. To approximate the LRG, ELG and QSOgalaxy samples, we use the values of the stellar mass, star forma-tion rate, and black hole mass times Eddington ratio as proxies forobjects that are LRG, ELG, and QSO-like respectively. Our goalwith these samples is to select galaxies that will have diﬀering biasproperties and mimic the complexities of the DESI sample in thisregard, while matching the density and signal-to-noise we wouldexpect with a DESI-like sample. We thus use these simple truthquantities from the simulation rather than recreate the full colorselection of a true, simulated DESI sample. The QSO, ELG, and available at https://github.com/tpospisi/flexcode MNRAS , 1–22 (2015)

Figure 1.

Left:

Redshift probability density functions of the galaxy populations that constitute the DESI-like spectroscopic reference sample.

Right:

Corresponding i-band magnitude distributions.

LRG samples are all selected with 𝑟 LSST > .

5, to be independentof the magnitude-limited sample. Additionally, QSOs and ELGs areselected to have 𝑟 LSST < . 𝑧 LSST < . ∼

50 deg test area in theCosmoDC2 simulations and apply them to the full 300 deg area. As we will see in detail in the following sections, the photometricredshift problem is a deconvolution problem, where the redshiftdistribution of a sample of photometrically observed galaxies isinferred from their noisy photometric measurements. To give thereader an intuitive understanding of deconvolution problems, wepresent a short introduction into the classical deconvolution prob-lem. A similar description in the context of photometric redshiftestimation can be found in Padmanabhan et al. (2005). We close thissection by discussing the limitations of the toy model consideredhere and motivate the likelihood inference framework presented inthe following.

Consider three vectors of random variables Z , Z 𝑝 and 𝝐 with di-mension 𝑁 gal , which denotes the photometric sample size. Z and Z 𝑝 denote the true and photometric redshifts of the galaxies in thesample and 𝝐 the residual error between both quantities. The addi-tive noise model that connects these random variables is given as: Z 𝑝 = Z + 𝝐 . (1)The probability densities associated with these random variablesare: 𝑍 𝑗 ∼ 𝑝 𝑧 (2) 𝑍 𝑝𝑗 ∼ 𝑝 𝑝𝑧 (3) 𝜖 𝑗 ∼ 𝑝 𝜖 , (4)where 𝑗 ∈ { , . . . , 𝑁 gal } and ‘ ∼ ’ connects the realization of a ran-dom variable on the left hand side with the probability densityfunction (PDF) on the right hand side from which this realization isdrawn.The random variable 𝜖 𝑗 is assumed to be identically and in-dependently distributed, as well as independent of 𝑍 𝑗 . These as-sumptions do not hold in the photometric redshift scenario, as thenoise very clearly depends on the color, and therefore redshift, ofthe galaxy. However, in the following toy model, we adopt these as-sumptions for simplicity. The theory can be easily extended towardsinput-dependent noise (see e.g. Meister 2009) without changing theintuition presented in this section.In order to derive an estimator for 𝑝 𝑧 , we use the convolutiontheorem that connects the PDF of the sum of independent randomvariables with the convolution of their densities. We can thereforewrite: 𝑝 𝑝𝑧 = 𝑝 𝑧 ∗ 𝑝 𝜖 = ∫ 𝑝 𝑧 ( 𝑧 𝑝 − 𝑧 ) 𝑝 𝜖 ( 𝑧 ) d 𝑧 = ∫ 𝑝 𝜖 ( 𝑧 𝑝 − 𝑧 ) 𝑝 𝑧 ( 𝑧 ) d 𝑧 , (5) Note that the probability densities 𝑝 𝑧 and 𝑝 𝑝𝑧 are both redshift distribu-tions of samples of galaxies. They diﬀer since ( 𝑝 𝑝𝑧 , 𝑝 𝑧 ) denotes the sampledistribution of (photometric, true or spectroscopic) galaxy redshifts. Thus 𝑝 𝑝𝑧 would be broader, since the error in the redshift, drawn from 𝑝 𝜖 , isconvolved with the true redshift. A Fourier-based approach is not necessary. Concretely, the likelihoodframework presented in the following section works in real space. A Fourierdescription for the classical deconvolution problem is, however, analyticallytractable and provides a clear picture of the nature of the problem and theimportance of regularization. MNRAS , 1–22 (2015), 1–22 (2015) ikelihood Inference under Photo-z error The Fourier transform of a probability distribution is the char-acteristic function. We will denote the characteristic functions of( 𝑝 𝑧 , 𝑝 𝑝𝑧 , 𝑝 𝜖 ) as ( 𝑝 ft 𝑧 , 𝑝 p , ft 𝑧 , 𝑝 ft 𝜖 ). Given a sample drawn from a PDF,e.g., the sample of photometric redshifts of 𝑁 gal galaxies, we canestimate 𝑝 p , ft 𝑧 as ˆ 𝑝 p , ft 𝑧 ( 𝑡 ) = 𝑁 gal 𝑁 gal ∑︁ 𝑗 = exp (cid:16) 𝑖𝑡𝑍 𝑝𝑗 (cid:17) . (6)The argument 𝑡 of the characteristic function could be interpretedas a kind of redshift-frequency if we treat the redshift of a galaxy asa ‘time parameter’. Under the assumption of independence between Z and 𝜖 we can write: 𝑝 p , ft 𝑧 ( 𝑡 ) = ˆ 𝑝 ft 𝑧 ( 𝑡 ) ˆ 𝑝 ft 𝜖 ( 𝑡 ) = 𝑝 ft 𝑧 ( 𝑡 ) 𝑝 ft 𝜖 ( 𝑡 ) . (7)Therefore an estimator for 𝑝 ft 𝑧 ( 𝑡 ) is given as:ˆ 𝑝 𝑓 𝑡𝑧 ( 𝑡 ) = 𝑝 ft 𝜖 ( 𝑡 ) 𝑁 gal 𝑁 gal ∑︁ 𝑗 = exp (cid:16) 𝑖𝑡𝑍 𝑝𝑗 (cid:17) , (8)where we assume 𝑝 ft 𝜖 ( 𝑡 ) is known and nonzero everywhere. We notethat this estimator is consistent and unbiased (Meister 2009). Theerror term 𝑝 ft 𝜖 ( 𝑡 ) here acts as a ‘ﬁlter’ to weight down small scalemodes in the distribution. However, we note that this term 1 / 𝑝 ft 𝜖 canbecome large when 𝑝 ft 𝜖 is small.As a consequence the inverse Fourier transform 𝑝 𝑧 ( 𝑧 ) = 𝜋 ∫ exp (− 𝑖𝑡𝑧 ) ˆ 𝑝 𝑓 𝑡𝑧 ( 𝑡 ) d 𝑡 , (9)is neither integrable nor square integrable. Loosely speaking thisimplies that the parameter space that describes the shape of 𝑝 𝑧 ( 𝑧 ) does not have to be bounded. We will see this eﬀect also for the morecomplex model considered in the later sections of this work. Wereiterate that while the estimator of ˆ 𝑝 ft 𝑧 has very desirable properties,the inverse transformation is not well deﬁned, hence deconvolutionproblems are part of a larger class of ‘inverse problems’.In order to obtain well-deﬁned results, we therefore have to per-form regularization either by regulating the shape/parametrizationof 𝑝 𝑝𝑧 (e.g. using Kernel methods), projecting 𝑝 𝑧 onto a suitablebasis like wavelet functions or by directly restricting the 1 / 𝑝 ft 𝜖 term,as implemented in a Ridge method (e.g. Meister 2009, § 2.2.3).We will not discuss the details of these methods and refer to theliterature for a more detailed explanation (e.g. Meister 2009). It is,however, instructive to study the functional form of one of these reg-ularized estimators. Making the ansatz of a kernel density estimatefor the photometric redshift PDF, one can show that the deconvolveddensity ˆ 𝑝 𝑧 can be estimated as (e.g. Meister 2009):ˆ 𝑝 𝑧 ( 𝑧 ) = 𝜋 ∫ exp (− 𝑖𝑡𝑧 ) (cid:32) 𝐾 ft ( 𝑡𝑏 ) 𝑝 ft 𝜖 ( 𝑡 ) (cid:33) 𝑁 gal 𝑁 gal ∑︁ 𝑗 = exp (cid:16) 𝑖𝑡𝑍 𝑝𝑗 (cid:17) d 𝑡 , (10)where 𝑏 denotes the bandwidth and 𝐾 ft the fourier transform ofthe kernel function that enters the kernel density estimation ansatzfor 𝑝 𝑝𝑧 . We see that by restricting the shape of the density 𝑝 𝑝𝑧 to a kernel density estimate whose smoothness is governed by theparameter 𝑏 , we regularize the 1 / 𝑝 ft 𝜖 ( 𝑡 ) term by a multiplicative fac-tor, that renders the inverse Fourier transformation both integrableand square integrable assuming bounded, compactly supported and Here,ˆdenotes an estimator for the respective function. non-vanishing 𝐾 ft . The bandwidth parameter governs the tradeoﬀbetween the bias, or ‘smoothness’, of the density estimate, and itsvariance. Choosing a larger bandwidth washes out small scale noisein the reconstructed density. In the limit of vanishing bandwidth,we would again obtain an ill-posed inverse Fourier transformation.In the following sections, we will apply regularization tech-niques that restrict the functional form of the redshift distribution,following a similar idea as presented in closed form in Eq. (10) forthe classical deconvolution problem. We note that inverse problems like the classical deconvolution prob-lem also appear in several other scenarios like the measurementsof shapes, where the point-spread function (PSF) of galaxies con-volves the galaxies’ light proﬁles and leads to a loss of information.While the considered toy model of the photometric redshift problemis analytically tractable, it does not describe the realistic situation.Besides the relatively simple extension towards a galaxy-dependentphotometric noise, the noise distribution 𝑝 𝜖 is, in photometric red-shift estimation, given as a joint likelihood between the photometryof all galaxies in the sample, that depends on the additional pa-rameters that enter the SED modelling. The redshift of each galaxyis a parameter that enters its likelihood and the sample redshiftdistribution is its prior. Furthermore, the model does not properlyaccount for the spatial distribution of galaxies. The clustering ofgalaxies does not only constrain their redshift distribution, but con-nects to SED modelling, with nuisance parameters that describe,e.g., galaxy-dark matter bias.We structure the discussion of the likelihoods used in this workin practice based on the following roadmap: we ﬁrst describe ourlikelihood framework for the photometry of galaxies given a set oftemplates in § 4. We reiterate that in contrast to the classical de-convolution problem, the ‘error distribution’ in the full photometricredshift problem is based on the joint photometric likelihood. Wewill therefore base our estimator and inference on this likelihoodinstead of the characteristic function. Despite these methodologicaldiﬀerences, we note that the necessity for regularization is the sameas in the analytically tractable classical deconvolution problem. Theconsidered regularization schemes are described in § 4.2. A partic-ular challenge in the context of large area photometric surveys isthe necessity to scale the inference to a large number of galaxies. InAppendices § A and § B we derive an eﬃcient inference frameworkbased on the Laplace approximation that facilitates fast probabilisticdeconvolution. We will use this deconvolution methodology in thefollowing sections § 4 to § 8. The spectral energy distributions of distant galaxies are a complexsuperposition of spectral components from their stellar populations.The SED of the galaxy can be uniquely mapped to a givenredshift 𝑧 , which allows us to predict the galaxy ﬂux as a functionof redshift in a given optical ﬁlter band F ( 𝜆 ) by 𝑓 𝑖 ( 𝑧, 𝛼 ) = ∫ F ( 𝜆 ) SED 𝜆 ( 𝜆, 𝑧, 𝛼 ) d 𝜆 . (11)where SED 𝜆 ( 𝜆, 𝑧, 𝛼 ) is the Spectral Energy Distribution templatein units of erg (cid:14) cm (cid:14) s (cid:14) Å. The parameter 𝛼 denotes additional freeparameters in the SED template models, such as galaxy age, type, orred continuum slope. For a given set of photometric ﬁlters F ( 𝜆 ) we MNRAS , 1–22 (2015) obtain a mapping between the redshift 𝑧 of the galaxy and a vectorof ﬂuxes f . We will denote this mapping as T ( 𝑧, 𝛼 ) .Assuming that the measurements of photometry for diﬀerentgalaxies are independent we can make the ansatz for the jointlikelihood of ﬂuxes of a galaxy sample ˆF 𝑝 ( ˆF | z , 𝜶 ) = 𝑁 gal (cid:214) 𝑖 = N ( ˆf i |T ( 𝑧 𝑖 , 𝛼 𝑖 ) , 𝚺 𝑖 ) . (12)Here, 𝚺 𝒊 denotes the measurement covariance matrix of the ﬂuxmeasurements ˆf i , and ˆF denotes the set of all ﬂux measurementsof the galaxies. We assume Gaussian uncertainties here, where N ( 𝑥, 𝜇, Σ ) denotes the Normal distribution. The parameter 𝛼 caneither be a galaxy-speciﬁc index that selects a certain template froma pre-speciﬁed number of models, or a physical property of galaxies.The prior on the parameters 𝑧 and 𝛼 must account for theircorrelation. An example for a possible parametrization in the caseof a galaxy-speciﬁc template index would be a two-dimensionalhistogram. However, other parametrizations are possible, especiallyif additional parameters that change the shape of the base templatesare included in the template set. In this work, we will consider thesimplest case, where we use a multidimensional histogram priorwhere each histogram cell denotes a combination of redshift binand discretized 𝛼 parameter value, that for example could indicatea template selection. The histogram index 𝑖 runs over all histogrambins { 𝑖 : 0 < 𝑖 ≤ 𝑁 tot } , where 𝑁 tot = 𝑁 bins × 𝑁 parameters . Theprior on the corresponding histogram heights, denoted as 𝑛 𝐵𝑖 cor-responding to the interval 𝐼 𝑖 in the 𝑧 − 𝛼 parameter space, reads: 𝑝 ( 𝑧, 𝛼 ) = 𝑁 tot ∑︁ 𝑖 = 𝑛 𝐵𝑖 [( 𝑧, 𝛼 ) ∈ 𝐼 𝑖 ] . (13)Here [ 𝐾 ] denotes the Iverson bracket, that is (0, 1) if the proposition 𝐾 is (false, true). We note that n B parametrizes the joint distributionof redshift histograms and 𝜶 parameter. For simplicity we will inthe following omit the marginalization over 𝜶 and refer to n B as theparameters of the sample redshift distribution. The reason is that inthis paper we do not add additional parameters to parametrize theSEDs over which we need to marginalize. In applications like weakgravitational lensing and galaxy clustering we are mainly interestedin estimating the redshift distribution of a sample of galaxies, herereferred to as the base sample and parametrized by the vector n B .It is therefore useful to marginalize over the redshifts of individualgalaxies. We note that if the posterior of individual galaxy redshiftsis important, we can always post-sample using the ﬁnal posterioron n B , based on Eq. (34), that then also includes information fromgalaxy clustering. The posterior distribution of the sample redshiftdistribution given ˆF is then: 𝑝 ( n B | ˆF ) ∝ 𝑝 ( n B ) 𝑁 gal (cid:214) 𝑖 = ∫ ∞ d 𝑧 𝑖 𝑝 ( 𝑧 𝑖 | n B ) N ( ˆf i |T ( 𝑧 𝑖 ) , Σ 𝑖 ) . (14)Discretizing the integral and using Eq. (13) we obtain 𝑝 ( n B | ˆF ) ∝ 𝑝 ( n B ) 𝑁 gal (cid:214) 𝑖 = 𝑁 tot ∑︁ 𝑗 = 𝑛 B 𝑗 ∫ 𝑧 𝑗𝑅 𝑧 𝑗𝐿 d 𝑧 𝑖 𝑝 ( ˆf i |T ( 𝑧 𝑖 ) , Σ 𝑖 ) . (15)The histogram heights 𝑛 𝐵𝑖 = 𝜋 𝐵𝑖 (cid:14) Δ 𝑧 can be expressed as the ra-tio between π and the histogram width Δ 𝑧 , assuming equal-sized This assumption can be violated due to eﬀects such as blending of nearbygalaxy light proﬁles on ﬂux calibration errors.. redshift bins. The vector π 𝑩 has the properties (cid:205) 𝑁 tot 𝑖 = 𝜋 𝐵𝑖 = ≤ 𝜋 𝐵𝑖 ≤

1, and therefore lies on the simplex. Our ﬁrst choicefor a distribution on the simplex for 𝑝 ( 𝝅 𝑩 ) (and therefore 𝑝 ( n 𝐵 ) )was the Dirichlet distribution . During the course of this project wehave applied a mean ﬁeld variational inference scheme that uses theDirichlet as the variational distribution as well as a Gibbs samplingscheme based on the Dirichlet-Multinomial cojugacy for posteriorinference. We found that the variational inference scheme yieldedunderestimated error bars, likely due to the restricted covariancestructure of the dirichlet. Moreover, the sampling approach did notscale well to the large galaxy samples expected for the ﬁrst-yearLSST observations. Speciﬁcally, the computational workload to up-date redshift variables for 10 − galaxies seems very large, andwhile subsampling techniques provide a possible mitigation, theycan lead to biased inferences (Quiroz et al. 2018). Furthermore theapplication of sampling techniques requires a suﬃciently large traceto ensure convergence. This can be diﬃcult to ensure in this case.In order to provide a more ﬂexible distributional ansatz than thedirichlet, while still maintaining the computational advantages ofa mean ﬁeld variational inference scheme, we decided to developa scheme that is based on the logit-normal distribution (Atchison& Shen 1980), as explained in the following section. While theseconsiderations motivate our choice of method, we note that thisshould not discredit alternative approaches based on sampling orvariational inference in general. We will perform a more detailedanalysis of convergence and probability coverage of multiple infer-ence techniques in future work. The problem speciﬁed by Eq. (14) is a deconvolution problem thatextends the simple toy model considered in § 3. The ‘noise’ PDF isnow given by a joint likelihood that can depend on a complex setof parameters. Furthermore, while the discussion in § 3 focused onderiving an estimator for the deconvolved density, the focus here isto infer posteriors using eﬃcient inference techniques. We presentthe detailed description and derivation of the inference pipeline inAppendices A and B. The ﬁnal form of Eq. (14) is then given in theform of a logit-normal posterior: 𝑝 ( n B | ˆF ) ≈ √︁ | 𝜋 𝚺 y | Δ 𝑧 𝑁 bins (cid:206) 𝑁 bins 𝑖 = 𝑛 𝐵𝑖 exp (cid:32) − (cid:32) log (cid:32) 𝒏 𝑩 − N bins 𝑛 𝐵𝑁 bins (cid:33) − 𝝁 y , ML (cid:33) 𝚺 − y (cid:32) log (cid:32) 𝒏 𝑩 − N bins 𝑛 𝐵𝑁 bins (cid:33) − 𝝁 y , ML (cid:33)(cid:33) , (16)where Δ 𝑧 denotes the histogram bin width. The estimation of thecovariance 𝚺 y and mean vector 𝝁 y , ML are detailed in AppendicesA and B. However, we note that this formalism derives the hessian H = − 𝚺 𝒚 − and obtaining the covariance matrix 𝚺 𝒚 requires matrixinversion. The subscript ‘y’ here refers to the variable transforma-tion: y ( 𝝅 ) = (cid:2) log ( 𝜋 / 𝜋 𝑁 bins ) , . . . , log ( 𝜋 𝑁 bins − / 𝜋 𝑁 bins ) (cid:3) , (17) The Dirichlet is the conjugate prior to the multinominal distribution, whichcan make sampling and inference easier. Concretely, if a Dirichlet prior is seton the probabilities of the multinomial likelihood (which are its parameters),the posterior over these probabilities is again a Dirichlet. However conjugacydoes not imply that the prior is ideal in all circumstances.MNRAS , 1–22 (2015), 1–22 (2015) ikelihood Inference under Photo-z error that we discuss in more detail in Appendix B. The vector n B − 𝑁 bins denotes here the vector of n B excluding the last entry 𝑛 𝐵 N bins , wherewe assume equal sized redshift histogram bin width.Like the classical deconvolution problem, the inference ofEq. (14) is an inverse problem. We can therefore expect that thereexists a parameter vector 𝝅 ∈ Δ , where Δ denotes the simplexspace, that has a high likelihood (relative to the maximum likeli-hood value), but a large distance from the true 𝝅 opt .Furthermore, as we have seen in § 3, the solutions of inverseproblems do not have to be bounded (or even well deﬁned), whichimplies that uncertainties can be arbitrarily large (see, e.g., Kuusela2016). Regularization, detailed in the following section, is thereforea key aspect in our inference pipeline. In this section we describe techniques that we employ to regularize the deconvolution problem. As instabilities arise when the histogramwidth is of the same order as the uncertainty in the redshift of theindividual galaxies, picking broader bins reduces these artifacts(see e.g. Kuusela 2016). Considering our toy model in Eq. (10), wesee that if 𝐾 ft ( 𝑡𝑏 ) is narrower than 𝑝 ft 𝜖 ( 𝑡 ) , there can be values of 𝑡 where their ratio, and therefore the integrand in Eq. (10), can becomelarge or even unbounded. Subject to the aforementioned limitations,this behaviour generalizes to the deconvolution problem consideredhere. In the following, we will denote this scheme as the ‘WideBin’ method. As will be seen in the following section, this simplescheme can lead to posteriors that can be biased and too narrow. Wetherefore consider alternative approaches. The ‘Merging Bin Regularization’ scheme (Kuusela 2016) uses avery thin initial histogram binning. This will likely result in theaforementioned typical instabilities of inverse problems, but canavoid biases of the Wide-Bin regularization (or other smoothing)schemes. However, we must ensure that the optimization of themaximum-likelihood solution converges to a global maximum. Wetherefore run multiple optimizations with diﬀerent initial conditionsand pick the best solution. Furthermore, the hessian H can have avery high condition number. Even though it is possible to sam-ple from the resulting posteriors without the matrix inverse usingMCMC sampling (only the inverse covariance enters the 𝜒 ), sam-pling is more eﬃcient using the standard Box-Mueller method (Box& Muller 1958), which requires an inverse. Tikhonov Regularization

We perform the matrix inversion of thehessian H using Tikhonov regularization (see e.g. Kress 1998, pp.86-90). Here, we treat the matrix inversion as a system of linearequations constructed from the hessian and the column-wise inversehessian/unit matrix respectively. The instability of the problem canlead to very small entries in the hessian that imply large entries (in In our case we note that all parameter values 𝝅 ∈ Δ are bounded, because Δ (with a chosen metric) is a bounded metric space. However these solutionsin logit space (see § B) do not have to be bounded. We will use the term ‘regularization’ not only in the context of Bayesianstatistics, where it’s often implemented in the form of a prior, but in generalto describe methods that restrict the complexity of parameters or functions.

Redshift z S a m p l e P h o t o m e t r i c R e d s h i f t D i s t r i b u t i o n n B True DistributionFiducialMerging Bin

Figure 2.

Illustration of the impact of the ‘Merging Bin’ regularizationon the posterior of the sample photometric redshift distribution. We show1 𝜎 intervals. The x-axis shows the redshift value 𝑧 , the y-axis the value ofthe n B parameters. The errorbars are the [ , ] percentiles, which wouldcorrespond to 1 𝜎 intervals for a normal distribution. The black dashedcurve shows the spectroscopic redshift distribution. The red contour showsthe result of the ‘Merging Bin’ regularization with 30 bins applied to the‘Fiducial’ contours that are binned using 50 bins. We refer for a detailedexplanation to § 8 and Fig. 7, which shows and discusses the ‘Fiducial’ caseas ‘Small Sample (50k)’. absolute values) in its inverse. As a regularization, one can add apenalty term and reformulate the problem as a minimizationmin (cid:110) || Ha 𝑖 − 𝑖 || + || 𝚼 a 𝑖 || (cid:111) , (18)where 𝑖 denotes column 𝑖 of the inverse hessian and the unit matrixrespectively. The matrix 𝚼 = 𝛼 is the Tikhonov matrix, 𝛼 theregularization parameter and the unit matrix. The regularizationterm penalizes large values for a i , which regularizes the inversionand reduces the condition number. The analytic solution to thisminimization problem is given as: c 𝑖 = (cid:16) H T H + 𝚼 T 𝚼 (cid:17) − H T i . (19)We note that we introduce the Tikhonov regularization herepredominantly as a way to regularize the matrix inversion of thehessian. We recommend selecting the parameter 𝛼 to be just as largeas necessary to perform this inversion accurately. Tikhonov regular-ization can be used as the main regularization in inverse problems;however, we ﬁnd that the merging bin regularization scheme per-forms much better in terms of producing well-calibrated probability n B posteriors. We reiterate that the idea of the merging bin reg-ularization scheme proposed by Kuusela (2016) is to deliberatelystart with histogram bins that are too small and lead to a noisy de-convolved density. We then exploit the characteristic noise patternin the deconvolved distribution, where bins that overshoot, i.e. arelarger than the true value, are immediately followed by those thatundershoot. This results in an alternating or ‘zig-zag’ pattern of thedeconvolved density. Merging these neighboring bins then helps to‘stabilize’ the deconvolved distribution. We therefore sample froma posterior obtained assuming a ﬁnely binned histogram and mergeneighboring bins, which compensates the noise eﬀect. We can thenin principle directly use these samples in the cross-correlation like-lihood. Nonetheless, it is computationally more eﬃcient to remapthese samples to a regular grid with the same or very similar resolu-tion than the binning used for the cross-correlations, since treatingthe ﬁnely binned histogram heights as free parameters would not MNRAS , 1–22 (2015) add more information due to the resolution loss. We found that thetemplate ﬁtting posterior after merging bin regularization can againbe well-described by a logit normal distribution that we ﬁt usingAssumed Density ﬁltering.

Assumed Density Filtering

To ﬁt the logit normal distribution tothe sampled and merged posterior samples, we work in logit spaceand make a gaussian ansatz 𝑞 ( y ) = N ( y | 𝝁 , 𝚺 ) . (20)We can directly generate samples from the true distribution bysampling from the original, ﬁnely binned, logit-normal distributionwith subsequent merging, i.e. averaging, of neighboring bins, andthen transforming back to logit space. We will denote this truedistribution as 𝑝 true ( y ) . Assumed density ﬁltering then commencesby minimizing the Kullback-Leibler divergence between our ansatz 𝑞 ( y ) and 𝑝 ( y ) ,KL ( 𝑝 || 𝑞 ) = ∫ d y ( 𝑝 ( y ) log 𝑝 ( y ) − 𝑝 ( y ) log 𝑞 ( y )) . (21)After optimizing 𝐾 𝐿 ( 𝑝 || 𝑞 ) for 𝝁 and 𝚺 , we can show that theoptimium is reached if ∫ d y 𝑝 ( y ) y = 𝝁 , (22)and 𝚺 = ∫ d y 𝑝 ( y ) ( y − 𝝁 )( y − 𝝁 ) T (23)We see that assumed density ﬁltering reduces to moment matchingin logit space, when we apply the sample mean and sample covari-ance estimators to samples from 𝑝 true ( y ) . We note that this is ingeneral true for distributions of the exponential family (see, e.g.,Ranganathan 2004).To summarize, we perform the inference scheme described inthe previous section using a ﬁne histogram binning. Subsequentlywe sample from the posterior after regularizing the inverse hessianusing Tikhonov regularization. We merge neighboring bins fromeach posterior draw until the noise is reduced and we obtain a smoothprobability distribution. We illustrate this process in Fig. 2, whichillustrates the impact of the ‘Merging Bin’ regularization schemeon the posterior of the sample photometric redshift distribution.Comparing the grey contours (‘Fiducial’) that uses 50 bins withthe red contours (‘Merging Bin’) that merges neighboring bins toa binning of 30 bins, we see the much smoother shape and theelimination of the ‘zig-zag’ pattern present in the grey contours.We defer a more thorough explanation of the methodology andsample to § 8 and Fig. 7, which discusses the result shown in thegrey contours under the abbreviation ‘Small Sample (50k)’.Based on our experience we propose to initially merge neigh-boring bins until we obtain a bin size of the order of the average ± 𝜎 range of the individual galaxy redshift distributions. Subse-quently we merge fewer bins until the aforementioned character-istic ‘zig-zag’ noise pattern appears. This can be identiﬁed as thelimiting resolution we can obtain. We note that it is important todistinguish patterns due to ‘real’ line of sight structure and due tothe aforementioned noise in the deconvolution. If the pattern ap-pears gradually with increasing resolution (merging fewer bins),it is indicative of statistically signiﬁcant line-of-sight structure. Ifthe deconvolved density suddenly becomes unstable in a ‘zig-zag’pattern when fewer bins are merged, we have reached a resolutionlimit. Using assumed density ﬁltering under the ansatz of a logit normal distribution, we ﬁnally reparametrize our model on the ﬁnalredshift grid.The merging process described above is largely based on in-specting when the instabilities vanish. There are certainly moreprincipled alternatives. In a classical deconvolution problem, likethe one presented in § 3, one could use a bootstrap estimate of thebias and variance of the reconstruction with respect to the over-smoothed photometric redshift distribution. This is consistent withthe approach taken by Padmanabhan et al. (2005) based on the rec-ommendation in Craig & Brown (1986). We use a joint likelihoodbetween the photometry and spatial information to produce posteri-ors for the photometric sample redshift distribution and not a ‘pointprediction’. Furthermore our ‘measured data’ is the photometry andspatial information of galaxies. Accordingly our model selection, ofwhich regularization is a part, must reproduce the measured pho-tometry and spatial distribution, e.g., measured by the correlationfunctions of galaxies. In the Bayesian context, this would trans-late into the usage of posterior predictive checks (PPC) discussedin § 7.1. While the path to development of a more principled se-lection of the hyperparameters of regularization is known, it willrequire a thorough investigation. We defer this to future work, us-ing the aforementioned ‘rule-of-thumb’ methodology as an interimsolution. In order to include information about the spatial distribution ofgalaxies into the likelihood, we consider spatial cross-correlationsbetween photometric and spectroscopic samples. Spatial correla-tions measure the excess probability over random to ﬁnd two galax-ies separated by a certain distance. This can be exploited to extractredshift information for galaxy samples (see e.g. Newman 2008;Ménard et al. 2013; McQuinn & White 2013; Scottez et al. 2016;Raccanelli et al. 2017; Morrison et al. 2017; Davis et al. 2017; Gattiet al. 2018) for which we do not have accurate redshift informa-tion, i.e. photometric galaxy samples, using spatially overlappingspectroscopic catalogs.The idea is to select the spectroscopic samples in thin redshiftslices and estimate the cross-correlation between these redshift-selected samples and the full photometric galaxy sample. As dis-cussed in the following, the resulting signal will then be proportionalto the photometric redshift distribution at that redshift.Fig. 3 illustrates the basic idea of cross-correlation redshiftinference. We consider two galaxy samples: a reference sample‘R’ and a base galaxy sample ‘B’. The reference sample containsgalaxies with accurate, often spectroscopic, redshift measurements;the base galaxy sample consists of galaxies observed in broad bandphotometric ﬁlters. As the base/reference samples are typically pho-tometric/spectroscopic samples, we use these terms interchangablyin text . The redshift distribution of the base sample is illustratedby the red distribution, while the binned reference sample redshiftdistributions for simplicity are shown as tophat functions (unlikethe simulated samples we use to test our methodology). A sin-gle cross-correlation is then obtained by cross-correlating a singletophat selection with the full base sample. Multiple measurements We note, however, that the reference sample does not have to be a spectro-scopic dataset, as multi-band, narrow ﬁlter photometric observations (Alar-con et al. 2020a), or photometric redshifts of redMaGiC samples (see e.g.Gatti et al. 2018), also allow for reasonable redshift accuracy.MNRAS , 1–22 (2015), 1–22 (2015)

Illustration of the construction of the cross-correlation data vector.A ‘Reference Sample’ that can be precisely selected in redshift is moved overa ‘Base Sample’, which can either be a sample without redshift information(photometric sample) or the ‘Reference Sample’ itself. In this illustration10 cross-correlations would be estimated, constituting the cross-correlationdata vector. therefore ‘slice’ through the redshift distribution of the base sample,illustrated here by the arrows and the grey hatched tophat slices.As described in detail in Schmidt et al. (2013), Morrison et al.(2017) and Ménard et al. (2013), we measure the over-density, com-pared with a spatially random distribution of points, of photometricgalaxies around each galaxy in the spectroscopic sample, withinan annulus of physical scale Δ 𝜒 = [ 𝜒 min , 𝜒 max ] . The theoreticalmodel for a cross-correlation function between the spectroscopicreference sample in tophat bin 𝑖 and the photometric base sample isgiven as: 𝑤 RB 𝑖 ∝ 𝑏 𝑅𝑖 𝑏 𝐵𝑖 (cid:32) 𝜋 B 𝑖 𝑧 𝑖𝐻 − 𝑧 𝑖𝐿 (cid:33) 𝑤 DM , i . (24)Here, 𝑏 𝑅𝑖 and 𝑏 𝐵𝑖 denote the value of the redshift-dependent galaxy-dark matter bias of the reference (R) and base (B) samples. Thenormed histogram bin heights of the base, or photometric, sampleredshift distribution are denoted as 𝜋 𝐵𝑖 , where (cid:205) 𝑖 𝜋 𝐵𝑖 =

1, and thesize of a redshift bin is given as 𝑧 𝑖𝐻 − 𝑧 𝑖𝐿 . The term 𝑤 DM , i denotesthe contribution of dark-matter clustering to the cross-correlationsignal, which depends on the cosmological model.We see that the modelling of the cross-correlation signal de-pends on the product of two redshift-dependent galaxy-dark matterbias functions that are completely degenerate with the set of pa-rameters 𝜋 B that parametrizes the redshift distribution of the basesample. Furthermore, since 𝑤 DM , i depends on the cosmologicalmodel, it will be computationally expensive to sample over theseparameters.To reduce the impact of the cosmological model, we want tocombine the cross-correlations w RB with the correlations w RR ofthe spectroscopic sample. We therefore correlate the spectroscopicsample with itself in a manner analogous to what was just described,i.e., by correlating tophat selected spectroscopic samples with thefull spectroscopic sample. The corresponding theory prediction thenreads 𝑤 RR 𝑖 ∝ ( 𝑏 𝑅𝑖 ) (cid:32) 𝜋 𝑅𝑖 𝑧 𝑖𝐻 − 𝑧 𝑖𝐿 (cid:33) 𝑤 DM , i (25)where 𝜋 𝑅𝑖 is the normalized histogram height of the spectroscopic(full) sample redshift distribution. Both correlation function mea- surements ( ˆ w RB and ˆ w RR ) just described are assumed to individu-ally follow a Gaussian likelihood .Based on these deﬁnitions and approximations and the consid-erations in the previous section, we construct a likelihood based onthe ratio between ˆw RB and ˆw RR . Under the assumption of a diag-onal covariance matrix for ˆw RB and ˆw RR , we can construct therandom variable 𝚪 for bin 𝑖 with components Γ meas 𝑖 = (cid:32) ˆ 𝑤 𝑅𝐵𝑖 ˆ 𝑤 𝑅𝑅𝑖 (cid:33) . (26)We reiterate that both ˆ 𝑤 𝑅𝐵𝑖 and ˆ 𝑤 𝑅𝑅𝑖 are described by a Gaus-sian Likelihood. Their respective means and standard deviationsare given as 𝜇 RB , i , 𝜇 RR , i , 𝜎 RB , i , 𝜎 RR , i respectively. The theoreticalprediction for the transformed random variable Γ meas 𝑖 is then Γ theo 𝑖 ( 𝑏 𝐵𝑖 , 𝑏 𝑅𝑖 , 𝜋 𝐵𝑖 , 𝜋 𝑅𝑖 ) = 𝑏 𝐵𝑖 𝑏 𝑅𝑖 𝜋 𝐵𝑖 𝜋 𝑅𝑖 , (27)and its likelihood: 𝑝 ( Γ meas 𝑖 | Γ theo 𝑖 ) = 𝑏 ( Γ theo 𝑖 ) 𝑑 ( Γ theo 𝑖 ) 𝑎 ( Γ theo 𝑖 ) √ 𝜋𝜎 RB , i 𝜎 RR , i × (cid:32) Φ (cid:32) 𝑏 ( Γ theo 𝑖 ) 𝑎 ( Γ theo 𝑖 ) (cid:33) − Φ (cid:32) − 𝑏 ( Γ theo 𝑖 ) 𝑎 ( Γ theo 𝑖 ) (cid:33)(cid:33) + 𝑎 ( Γ theo 𝑖 ) 𝜋𝜎 RB , i 𝜎 RR , i exp (cid:32) − 𝑐 Γ theo 𝑖 (cid:33) . (28)Here Φ ( 𝑧 ) denotes the cumulative distribution function of the zeromean unit variance normal distribution and 𝑎 ( Γ theo 𝑖 ) = √︄ 𝜎 , i ( Γ theo 𝑖 ) + 𝜎 , i (29) 𝑏 ( Γ theo 𝑖 ) = 𝜇 RB , i 𝜎 , i Γ 𝑖 + 𝜇 RR , i 𝜎 , i (30) 𝑑 ( Γ theo 𝑖 ) = exp (cid:32) 𝑏 ( Γ theo 𝑖 ) − 𝑐𝑎 ( Γ theo 𝑖 ) 𝑎 ( Γ theo 𝑖 ) (cid:33) (31) 𝑐 = 𝜇 , i 𝜎 , i + 𝜇 , i 𝜎 , i . (32)For the following discussion we deﬁne n = 𝝅 𝑧 𝐻 − 𝑧 𝐿 , i.e. the variables 𝝅 , n refer to histogram heights normalized to sum to unity and tounit area respectively. The variable n B and n R refer to the histogramheights of the base and reference samples. This likelihood assumesindependence between neighboring redshift bins. We can, however,expect a degree of correlation especially for lower redshift bins dueto magniﬁcation eﬀects.Eq. (28) is approximately independent of the modelling of the 𝑤 DM term, assuming that we pick suﬃciently thin redshift bins to‘divide-out’ the redshift-dependence of the dark matter clusteringterm 𝑤 DM . Based on the aforementioned independence assumption In reality, we expect that the likelihood will deviate from the Gaussianassumption (see e.g. Hahn et al. 2019). While the covariance matrix will be dominated by the diagonal, wecan expect, that the cross-correlation measurements in diﬀerent bins willbe correlated. Thus the assumption of a diagonal covariance matrix is anapproximation.MNRAS , 1–22 (2015) between redshift bins, the joint likelihood for all bins 𝑖 now reads: 𝑝 ( 𝚪 | b b , b R , n R , n B ) = 𝑁 bins (cid:214) 𝑖 = 𝑝 ( Γ 𝑖 | 𝑏 𝐵𝑖 , 𝑏 𝑅𝑖 , 𝑛 𝑅𝑖 , 𝑛 𝐵𝑖 ) (33)The function that describes the set of ratios 𝑏 𝐵𝑖 / 𝑏 𝑅𝑖 will be denotedas 𝐶 ( 𝑧, Δ 𝜒 ⊥ ) and depends both on redshift and the size of theannulus Δ 𝜒 . For a selected annulus size we will use the abbreviation 𝐶 ( 𝑧 ) . To formulate a joint likelihood for the data vector of both galaxy po-sitions and photometry, we use the composite likelihood ansatz (e.g.Varin et al. 2011) that uses the product of marginal likelihoods forboth the photometry ˆF and the vector of cross-correlation functions 𝚪 : 𝑝 ( ˆF , 𝚪 | n B , sys , n B , n R , C ( z )) = 𝑝 ( ˆF | n 𝐵 ) 𝜐 𝑝 ( 𝚪 | n B , sys , C ( z )) 𝜐 , (34)where 𝝊 are weights that can be selected to improve the eﬃciency ofthe estimation (see e.g. Varin et al. 2011) by increasing the inﬂuenceof one part of the composite likelihood over the other. Furthermorethe composite likelihood can be conditioned on auxiliary parameterssuch as the ﬁeld. For simplicity we consider here only the simplecase of 𝝊 = ˆF and 𝚪 with the joint data vectors of theselected galaxy samples, which would include covariances betweenthe 𝚪 measurements for diﬀerent tomographic bins. Furthermorethe quality and number of available photometric bands can changefor diﬀerent spatial areas. Similarly we need to construct a joint datavector of ˆF and 𝚪 that incorporates these covariances. This can bemodelled either analytically (e.g. Stoyan & Stoyan 1994; Sánchezet al. 2020) or by using spatial resampling techniques. In this workwe will concentrate on the composite likelihood as given in Eq. (34)and refer extensions of the method to future work. Parameter inference is only a single step in a full statistical analysisand needs to be combined with additional analysis steps. We needto ensure that parameters can be uniquely inferred and the posteriordoes not exhibit ﬂat regions or strong degeneracies, which can makethe application of MCMC techniques diﬃcult (see e.g. Rothenberg1971; Raue et al. 2013). Furthermore, one needs to investigate thesensitivity of the results against changes in the prior and likelihood.Finally, one has to judge if the inferred posteriors are sensible inthe context of the cosmological/astrophysical model and evaluate ifthe ﬁtted model is a good representation of the observed data. Thelast step will be the topic of this section. § 7.1 describes posteriorpredictive checks as a means to evaluate the goodness of ﬁt of themodel and in § 7.2 we propose a method to parametrize systematicsdue to biased photometric likelihoods.

The idea of posterior predictive checks (PPC) is to simulate syn-thetic data from a ﬁtted model, that is then compared with theoriginal measurements to serve as an internal consistency check.For example there exist several approaches that allow us to estimatethe quality of probability calibration based on the distribution ofposterior predictive 𝑝 -values (e.g. Gelman et al. 1996). This paperwill only provide a short discussion of posterior predictive testing,which is still an area of active research. The basic idea of modelchecking is to investigate if data predicted by the ﬁtted model isrepresentative of the observed data. The classical approach, for ex-ample developed for linear regression, uses an analytical probabilitydistribution for the test statistic (for example the 𝜒 distribution) andevaluates the tail probability to test how much of an ‘outlier’ theobserved data is, given the ﬁtted model. This approach can be prob-lematic if the model is complex , which makes it diﬃcult to derivean analytic sampling distribution for a suitable statistic given a ﬁttedmodel. Further problems arise due to signiﬁcant inﬂuence of out-liers, or if parameters are subject to boundary conditions. Posteriorpredictive checks are extensions to this classical approach in theBayesian framework.Starting from the composite likelihood deﬁned in Eq. (34), theposterior predictive distribution reads 𝑝 (D rep |D) = ∬ d n B d C ( z ) 𝑝 (D rep | n B , C ( z )) 𝑝 ( n B , C ( z )|D) , (35)where D = { Γ ( 𝑤 BR , 𝑤 RR ) , ˆ F } and D rep denote the replicated, orpredicted, measurements sampled from the ﬁtted model. We notethat our model speciﬁes only the ratio between 𝑤 RB and 𝑤 RR . How-ever, we can always sample replications for one of these quantitiesusing the measurement of the other via the data transformationspeciﬁed in Eq. (26).Posterior predictive checking is particularly useful as it allowsus to access model calibration quality and predictive accuracy with-out spectroscopic (or accurate multiband photometric) validationdata. This is a decisive advantage of specifying the photometriclikelihood over empirical approaches based on machine learningor, more speciﬁcally, conditional density estimation. By modellingthe data generating process of SED evolution and redshifting, wecan generate new photometry using a ﬁtted physical model, givingus the opportunity to develop statistical tests based on the predic-tive distribution to probe the quality of the photometric likelihoodcalibration.To avoid confusion, it is important to clarify that posteriorpredictive checking and model comparison, while often method-ologically similar, have diﬀerent goals. Posterior predictive checksaim to provide internal consistency tests for a given model and in-ference framework. Model comparison/combination has the goal tocompare/combine multiple models based on some measure of ﬁttingaccuracy. However, model comparison and combination can also bebased on posterior predictive accuracy (see e.g. Gelman et al. 2013).In the development of new models and the evaluation of directionsfor improvement, both of these concepts work together.An important prerequisite for the ﬁtting of complex statis-tical models is the evaluation of model parameter degeneracies.We have discussed the fact that the parameters that describe the An example are models that do not have the form of a generalized linearmodel (see, e.g., Gelman et al. 1996). MNRAS , 1–22 (2015), 1–22 (2015) ikelihood Inference under Photo-z error redshift-dependent galaxy-dark matter bias and the sample redshiftdistribution enter the clustering likelihood in a completely degener-ate way. Accordingly, completely diﬀerent n B - C ( z ) combinationswill produce the same data distribution. This therefore limits theinformation that can be obtained on n B depending on the prior in-formation imposed on C ( z ) . A combination with the photometriclikelihood can help to break these degeneracies, as long as the SEDmodelling itself does not exhibit strong degeneracies, e.g. betweenthe SED type and the redshift of galaxies. A dedicated study ofmodel checking in color space is left for future work. In §8.3, we will discuss how miscalibrated likelihoods can lead tosystematic biases and uncertainties in the deconvolution operation.Our goal in this section is to include a simple transformation into themodel that parametrizes these systematics. A simple choice is theGaussian convolution kernel, which modiﬁes the sample redshiftdistribution as 𝑝 ( 𝑧 | n B , sys ) = ∫ d 𝝉 𝑝 ( 𝝉 ) ∫ 𝑝 ( 𝑧 | 𝑧, 𝝉 ) 𝑝 ( 𝑧 | n B ) d 𝑧 . (36)Here n B , sys denotes the parameters that describe the sample red-shift distribution after the convolution is applied to the originalsample redshift distribution 𝑝 ( 𝑧 | n B ) , where the convolution ker-nel is a gaussian with standard deviation and shift in the mean 𝝉 = ( Δ 𝜇, Δ 𝜎 ) : 𝑝 ( 𝑧 | 𝑧, 𝝉 ) = √︃ ( 𝜋 Δ 𝜎 ) exp (cid:34) − (cid:18) 𝑧 − 𝑧 + Δ 𝜇 Δ 𝜎 (cid:19) (cid:35) (37)Assuming the same histogram parametrization for 𝑝 ( 𝑧 | n B ) as for 𝑝 ( 𝑧 | n B , sys ) , we see that this implies an aﬃne transformation n B −→ A ( 𝜏 ) · n B where the matrix A is given as 𝐴 𝛽 𝑗 = Φ (cid:32) 𝑧 𝑗𝐻 (cid:12)(cid:12)(cid:12)(cid:12) 𝑧 𝛽𝐻 − 𝑧 𝛽𝐿 + Δ 𝜇, Δ 𝜎 (cid:33) − Φ (cid:32) 𝑧 𝑗𝐿 (cid:12)(cid:12)(cid:12)(cid:12) 𝑧 𝛽𝐻 − 𝑧 𝛽𝐿 + Δ 𝜇, Δ 𝜎 (cid:33) = (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) erf (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) 𝑧 𝑗𝐻 − (cid:18) 𝑧 𝛽𝐻 − 𝑧 𝛽𝐿 (cid:19) − Δ 𝜇 Δ 𝜎 √ (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) − erf (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) 𝑧 𝑗𝐿 − (cid:18) 𝑧 𝛽𝐻 − 𝑧 𝛽𝐿 (cid:19) − Δ 𝜇 Δ 𝜎 √ (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172)(cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) , (38)where Φ denotes the cumulative distribution function of a nor-mal distribution and erf the error function. We choose a Gaussiansmoothing kernel since it has been shown that unbiased cosmologi-cal inference from measurements of weak lensing and LSS criticallydepends on accurate recovery on the mean and standard deviation ofthe photometric sample redshift distribution. Biases in both of thesestatistics can be parametrized using this kernel. In contrast to thenormal distribution, which exhibits a closed form solution under anaﬃne transformation, the logit-normal does not have this property.However we empirically ﬁnd that we can approximate the shapeof the distribution after an aﬃne transformation as a logit-normaldistribution to good accuracy. For a given set of 𝝉 , we can ﬁnd theupdated parameter values by assumed density ﬁltering.While marginalization using sampling techniques is possible,we choose a computationally eﬃcient approximation and marginal-ize over a discrete model set that consists of diﬀerent smoothing N Filter N Galaxy N z − hist N tomo ˆF τ z n B , sys n R n B ΓC ( z ) Figure 4.

Directed graphical representation of the statistical model de-scribed in this paper. Empty/ﬁlled circles denote random variables thatare latent/observed. 𝑁 (∗) denotes the dimensionality of the random vari-able. Boxes encapsulate random variables with the same dimensionality.Solid/dotted lines indicate random/deterministic relationships between ran-dom variables. sizes Δ 𝜎 𝑖 and Δ 𝜇 = 𝑝 ( nz 𝐵 , C ( z )| 𝚪 , ˆF ) = ∑︁ Δ 𝜎 𝑖 𝑝 ( Δ 𝜎 𝑖 | 𝚪 , ˆF ) 𝑝 ( nz B , C ( z )| 𝚪 , ˆF , Δ 𝜎 𝑖 ) . (39)This assumes that there is no systematic shift in the sample redshiftdistribution, and deviations from the true underlying distribution aredue to miscalibrated but on average unbiased individual galaxy like-lihoods. Furthermore, we will assume that 𝑝 ( Δ 𝜎 𝑖 | 𝚪 , ˆF ) = 𝑝 ( Δ 𝜎 𝑖 ) ,i.e., the smoothing size is a prior choice independent of the data thatcan be calibrated on simulations. For simplicity we will use a ﬂatprior 𝑝 ( Δ 𝜎 ) here. Here we review and summarize our complete model. We reviewthe structure of all components in § 7.3.1 and review the inferencestrategy in § 7.3.2.

Fig. 4 summarizes the joint inference strategy presented in the previ-ous sections in a directed graphical model. Each random variable isdenoted as a circle, probabilistic/deterministic relationships are de-noted as solid/dotted lines. Boxes around random variables denotethe dimensionality of the random variable. For example the colorvector f 𝑖 is an 𝑁 ﬁlter dimensional random variable for 𝑁 galaxies in 𝑁 tomo tomographic bins. Filled circles denote observed randomvariables, in our case the photometry ˆF and the cross correlationratios 𝚪 .The graphical model is structured into three parts: the leftpart represents the photometric likelihood, the middle bullets de-scribe our treatment of systematics, and the right part describes theclustering redshift likelihood, which depends on the spectroscopicredshift distribution and the redshift-dependent galaxy-dark matterbias ratio.The structure of the graph illustrates the construction of themodel via the composite likelihood ansatz discussed in § 6. It sep-arates the two data sources ˆF and 𝚪 in the left and right part ofthe graph. As we are mainly interested in performing inferences on MNRAS , 1–22 (2015) the n B variables, we marginalize over the 𝑧 variables, which pro-vides signiﬁcant computational advantages. The mapping between n B and n B , sys takes the form of a deterministic transformation andis therefore indicated by a dotted line.While n 𝑅 is here treated as a random variable, its ‘shot noise’uncertainties are very small for the considered sample sizes of thespectroscopic sample. We therefore decided to ﬁx its value to themaximum likelihood value, i.e., the histogram height. The presented model consists of two likelihood terms and a de-terministic transformation n B → n B , sys . Starting with the pho-tometric likelihood, we employ the inference scheme detailed inAppendices A and B that results in a posterior 𝑝 ( n B | ˆF ) de-ﬁned in Eq. (16). We then employ the transformation detailed in§ 7.2 to parametrize systematics in the inferred posterior frombiased photometric likelihoods. This yields a systematics cor-rected posterior 𝑝 ( n B , sys | ˆF ) using the methodology described in§ 7.2. The ﬁnal combination with the clustering likelihood term 𝑝 ( n B , sys , C ( z )| ˆF , 𝚪 ) ∝ 𝑝 ( 𝚪 | n B , sys , C ( z )) 𝑝 ( n B , sys | ˆF ) is then per-formed using a Monte Carlo Markov chain (MCMC) sampling ap-proach.We update the parameters ( n B , sys , C ( z )) in two samplingblocks: the set of parameters that describe the redshift distribu-tion of the base sample n B , sys and the parameters c that describe theevolution of the redshift-dependent galaxy-dark matter bias ratio C ( z ) .Concretely, we iteratively sample from the conditional distribu-tions 𝑝 ( n B , sys | ˆF , 𝚪 , c ) and then from 𝑝 ( c | ˆF , ˆ 𝚪 , n B , sys ) . This meansthat we iteratively sample each parameter block in turn, while hold-ing the other parameter block ﬁxed. The sampling method that canbe employed to update each parameter blocks is ﬂexible . We use aMetropolis-Hastings sampling scheme to sample the c parameters.To sample from the conditional 𝑝 ( n B , sys | ˆF , 𝚪 , c ) we also employ aMetropolis scheme, however we perform the sampling not in termsof the n B , sys parameters, but in logit space, i.e. in terms of the y parameters that are connected with n B , sys , or their normalizedanalog 𝜋 sys , via Eq. (B1). In this way we can utilize proposal dis-tributions that are deﬁned in real space to sample a distributiondeﬁned on the simplex. We reiterate that the posterior 𝑝 ( n B | ˆF ) has,in our framework, an analytical form and sampling is therefore veryeﬃcient. However if we include a treatment of systematics or aclustering redshift likelihood into the inference, we need to employsampling approaches because the posterior has no longer a closedform solution. To demonstrate the eﬀectiveness of our inference methodology, weconsider an idealized setup which allows us to forecast the con-straints on the sample redshift distribution that we can expect froma DESI-like spectroscopic survey overlapping with the LSST Y10footprint. We assume in this section that the composite likelihoodis well calibrated, both in the clustering and in the template ﬁt-ting part. This assumption will likely not hold in practice and we We tried several approaches like Hamiltonian MCMC or Elliptical Slicesampling. All approaches work well; we discuss here the structurally simplestscheme.

Figure 5.

The top three panels show cross-correlation measurements be-tween the photometric base sample and the spectroscopic reference sample 𝑤 RB and correlation function measurements of the spectroscopic referencesample 𝑤 RR for 3 diﬀerent annuli (see § 5) as a function of redshift. Theerrorbars correspond to the ± 𝜎 measurement errors of the correlationfunction measurements. The lowest panel plots the true sample redshiftprobability density function of the spectroscopic reference sample (‘Spec.’)and the photometric base sample (‘Phot.’). therefore study the impact of likelihood mis-speciﬁcation on theinference in § 8.3. In particular we will evaluate the performanceof our methodology by comparing with science requirements of theﬁrst year (‘LSST Y1’) and the 10th year (‘LSST Y10’) of the LSSTdata release deﬁned in The LSST Dark Energy Science Collabora-tion et al. (2018). We note that we will utilize a mock simulation ofgalaxy photometry likelihoods in § 8.2 instead of SED likelihoodsconstructed on the simulated photometry, because the simulatedphotometry of the DC2 simulations showed a discontinuous color-redshift mapping, which induced an unrealistically large error inour template ﬁtting results. In § 8.3, which will discuss aspectsof model checking and will not interpret results in the context ofLSST science requirements, we will use both Machine Learningand Template Fitting methods described in § 2.1. We use the software package ‘the-wizz’ (Morrison et al. 2017) tomeasure cross-correlations between the reference (spectroscopic)and base (photometric) samples 𝑤 RB and correlations of the refer-ence sample 𝑤 RR in 20 equally spaced redshift bins from 𝑧 ∈ ( , ) ,corresponding to 3042 Mpc comoving distance at the mean redshiftof (cid:104) 𝑧 (cid:105) = .

88. Fig. 5 shows these measurements in the three top pan-els for three annuli (see § 5) of 0 . − . . − . . −

10 Mpc. In the lowest panel we show the sample redshift proba-bility density functions for the reference and base samples. The cor-relations w RR are larger than the cross-correlations w RB in all threepanels, implying on average a lower than unity ratio 𝑏 B ( 𝑧 )/ 𝑏 R ( 𝑧 ) .The errorbars increase with redshift due to the decreasing numberof galaxies, leading to a larger shot noise error. Consider the shapeof w RR and w RB for the largest annuli [ . − . ] Mpc, in the lowredshift range of 𝑧 < .

0. We see that the measurements of w RB https://github.com/morriscb/the-wizz MNRAS , 1–22 (2015), 1–22 (2015)

0. We see that the measurements of w RB https://github.com/morriscb/the-wizz MNRAS , 1–22 (2015), 1–22 (2015) ikelihood Inference under Photo-z error and w RR roughly resemble the shape of the reference and base sam-ple redshift distributions (lowest panel), implying a roughly linearratio 𝑏 𝐵 ( 𝑧 )/ 𝑏 𝑅 ( 𝑧 ) in this redshift range (compare with Fig. 6).For smaller annuli, the change in slope around 𝑧 = . 𝑧 = . w RB and w RR areapproximately equal. Here, the sample redshift distribution of thebase sample is larger than the one of the reference sample. Sincethe QSO sample will have a larger galaxy-dark matter bias than thebase sample, we can expect 𝑏 𝐵 ( 𝑧 )/ 𝑏 𝑅 ( 𝑧 ) < . overlap between DESI and LSSTY10, using measurements obtained on the the 300 deg CosmoDC2simulations, we scale the error on these measurements by a factorof 4 in the following analyses.We would like to generate a cross-correlation likelihood thatallows us full control over the imposed redshift-dependent galaxy-dark matter bias ratio model and that has roughly the correctwidth of the full DESI area. Furthermore, since the mean of theratio distribution depends on both the mean and the variance of w RB and w RR , scaling the measurement error will induce biasesin the mean of this ratio and therefore in the reconstructed samplephotometric redshift distribution.To correct for possible biases that would occur when the mea-surements errors are naively scaled and allow for better control overthe redshift dependent galaxy-dark matter bias ratio, we ﬁrst ﬁt thegalaxy-dark matter bias ratio 𝐶 ( 𝑧 ) to the original data within these20 bins. We show the results of this ﬁt in Fig. 6 for diﬀerent rangesin physical distance and indicate the redshift range of the diﬀer-ent spectroscopic samples by vertical lines, where these limits aremeant to guide the eye and do not constitute sharp breaks (com-pare with Fig. 1). We see that within these redshift ranges, 𝐶 ( 𝑧 ) is asmooth function and can be ﬁtted by a 3rd degree Chebychev polyno-mial 𝐶 ( 𝑧 ) = (cid:205) 𝑖 = 𝑐 𝑖 𝑇 𝑖 ( 𝑧 ) . Here, 𝑇 𝑖 ( 𝑧 ) denote Chebychev functionsand 𝑐 𝑖 denotes the expansion coeﬃcients. Within the three redshiftranges {[ . , . ] , [ . , . ] , [ . , . ]} , we perform a regressionﬁt to the median of the 𝐶 ( 𝑧 ) posterior due its heavy tails. For thefollowing analysis we select the median annuli of 0 . − . − . 𝑤 𝑅𝑅 and 𝑤 𝑅𝐵 deﬁned inEq. (26) and forecast a new data vector for 𝑤 𝑅𝐵 , while holding themeasurement of 𝑤 𝑅𝑅 ﬁxed. This amounts to multiplying the ratio w RB / w RR by a constant for each redshift bin that compensates forthe diﬀerence in the mean of the reconstructed sample photometricredshift distribution before and after we impose the ﬁtted redshiftdependent galaxy-dark matter bias model and perform the scaling.In this way we ensure that our ratio distribution is self-consistentwith the photometric sample redshift distribution. We then usethese adjusted measurements in the composite likelihood (Eq. 34).This correction is necessary because we would otherwise merelyuse noisy measurements with wrongly decreased errorbars, whichwill lead to biases in the probability calibration of any inference. The uncertainty in the correlation function measurements will likelydiﬀer from this factor of four scaling in the real data. Our treatment of thecross-correlation data vector here is approximate and will be complementedin future work.

Redshift z B i a s R a t i o b B ( z ) / b R ( z ) Mag-Lim LRG ELG QSO

Galaxy-DM Bias Ratio Scale/z-Dependence (±1 )

Figure 6.

Galaxy-dark matter bias ratio as a function of redshift for diﬀerentscales. We show here the [ , ] percentiles, that correspond to ± 𝜎 fora Normal distribution. The redshift ranges of the diﬀerent spectroscopicsubsamples are plotted by vertical lines. Within these ranges, the bias ratiois a relatively smooth function of redshift, indicating a smooth redshiftdependence of the galaxy-dark matter bias of the photometric sample. Atthe borders of these ranges, the bias ratio curves are discontinuous. Furthermore we want to have control over the underlying redshiftdependent galaxy-dark matter bias model to eliminate any additionalspeciﬁcation error. It should therefore merely be seen as an approx-imate forecast of the constraining power that cross-correlations willadd to the composite likelihood and a demonstration of the inferencemethodology. It is not an accurate treatment of galaxy-dark matterbias or the correlation function measurements expected in the ﬁnalLSST measurement. For this we would require a more realistic sim-ulation of the DESI-like spectroscopic sample, the ﬁnal area and amuch better understanding of the galaxy-dark matter bias of eachgalaxy population, all of which are subjects of active investigationin the ﬁeld.

For the photometric part of the composite likelihood we assume aredshift scaling of 𝜎 ( 𝑧 ) = . ( + 𝑧 ) , where 𝑧 denotes the true, orspectroscopic, redshift. This scaling is a photometric redshift perfor-mance benchmark for LSST frequently adopted in the literature (e.g.Graham et al. 2020) and deﬁned in the LSST science requirementsdocument . We then generate a mock catalog by sampling valuesfrom the true sample redshift distribution and generate a catalog ofmock likelihoods by scattering these values within this redshift errormodel. We reiterate that we assume here that the redshift likelihoodsconstructed from the galaxies’ photometry, mimicked here by theaforementioned redshift error model, is perfectly known. We notethat this is an idealized assumption that we impose to demonstratethe methodology described in the previous sections.Tab. 1 summarizes the diﬀerent conﬁgurations we use in thiswork. In particular we investigate posteriors obtained using severaldiﬀerent sample sizes and regularization techniques. In particular,the ﬁrst and second columns show the abbreviation used in thetext and the corresponding ﬁgure. The generated sample size of themock catalog is shown in the third column. The columns ‘Tikhonov https://docushare.lsst.org/docushare/dsweb/Get/LPM-17 (page 4)MNRAS , 1–22 (2015) Figure 7.

Left panel:

Posteriors of the sample redshift probability density function 𝑝 ( 𝑧 ) of the photometric sample (short: photometric redshift distribution)parametrized by the parameters n B for diﬀerent setups listed in Tab. 1. The x-axis shows the redshift value 𝑧 , the y-axis the value of the n B parameters. Theerrorbars are the [ , ] percentiles, which would correspond to 1 𝜎 intervals for a normal distribution. The black dashed curve shows the spectroscopicredshift distribution in the binning used by the cross-correlation measurements. We consider four cases, and refer to Tab. 1 and § 8 for details on the experimentalsetup. We highlight a variance-dominated posterior ‘Small Sample (50k)’, which shows a characteristic alternating, or ‘zig-zag’ pattern, as well as a comparisonbetween the cyan ‘Medium Sample (500k)’ and ‘Medium Sample (500k), Fid. Setup + WX’ posteriors. Here, the latter includes a cross-correlation ‘WX’ datavector in its likelihood. This reduces the error especially in the high-redshift tail of the distributions. Right panel:

The y-axis shows the relative diﬀerencebetween the posterior of the photometric redshift distribution parametrized by the n B post parameters and the spectroscopic redshift distribution n B true (blackdashed curve in the left panel).Abbreviation Sample Size Figure Tikhonov Reg. 𝛼 Initial Binning Eﬀective Binning WXSmall Sample (50k) 50k Fig. 7 0.1 50 50 -Medium Sample (500k) 500k Fig. 7 0.08 50 31 -Large Sample (5000k) Oversmoothing 5000k Fig. 7 0.0001 25 25 -Medium Sample (500k), Fid. Setup & WX 500k Fig. 7 0.08 50 31 (cid:88)

Tik. Regul. Low 5000k Fig. 8 0.0001 50 25 -Tik. Regul. Medium 5000k Fig. 8 1 50 25 -Tik. Regul. High 5000k Fig. 8 10 50 25 -Oversmoothing 5000k Fig. 8 0.0001 25 25 -Tik. Regul. Low + WX 5000k Fig. 8 0.0001 50 25 (cid:88)

Table 1.

Summary of the diﬀerent conﬁgurations that we test in this work. The ﬁrst column lists the abbreviations, the second refers to the Figure where thesetup is analysed. The next columns list the value of the Tikhonov regularization parameter 𝛼 (see § 4.2.1), the number of initial bins, the eﬀective bin numberafter (potentially) applying merging bin regularization (see § 4.2.1) and an indicator if the composite likelihood includes the cross-correlation data (see § 5). Reg. 𝛼 ’, ‘Initial Binning’ and ‘Eﬀective Binning’ list the value ofthe Tikhonov regularization parameter 𝛼 (see § 4.2.1), as well asthe used initial and eﬀective bin number , i.e., the histogram binnumber after the merging bin regularization scheme. The ﬁnal col-umn indicates whether the cross-correlation data vector is includedin the composite likelihood. Fig. 7 shows a selection of posteriorsusing setups from Tab. 1. The left hand panel shows the obtained As an approximate rule, one can expect a noisy deconvolution if noprior is applied, if the size of the bins is smaller than ± 𝜎 range of theindividual galaxy redshift likelihoods for moderate sample sizes of the order10 galaxies. For our redshift range and photometric redshift scatter thiswould imply an eﬀective number of bins of 30 − probability density function, the right panel the relative diﬀerencebetween these posteriors and the spectroscopic redshift distributionthat is shown as the black dashed line in both panels. The errorbars are the [ , ] percentiles, corresponding to a Gaussian ± 𝜎 interval.The ‘Small Sample (50k)’ setup highlights the noisy, i.e.,variance-dominated deconvolution, where we clearly see the com-paratively large and ﬂuctuating errorbars in Fig. 7. We note thatimposing a smoothing method will reduce these features. How-ever this can come at the expense of additional biases as discussedlater, and the characteristic covariance structure in the posterior isnot a priori problematic, as long as draws from the posteriors arebounded and well-deﬁned. Merging bin regularization exploits this MNRAS , 1–22 (2015), 1–22 (2015) ikelihood Inference under Photo-z error anti-correlation structure to provide an ‘objective’ regularizationwithout the need to carefully motivate an external prior or smooth-ing model choice.An alternative that provides additional physical motivation isthe inclusion of clustering redshift measurements into the compos-ite likelihood. This can be seen by comparing the posteriors fromthe ‘Medium Sample (500k)’ and the ‘Medium Sample (500k), Fid.Setup & WX’ cases. The corresponding results in Fig. 7 show thatfor the same regularization, the inclusion of the cross-correlationdata into the likelihood decreases the uncertainties, which is espe-cially visible in the high-redshift tail. We note that these results aredependent on the chosen galaxy-dark matter bias model. As dis-cussed in the beginning of this section, the parametrization usedhere is very ﬂexible and the eﬀective number of parameters canlikely be reduced, if a more physical model is chosen. In this regard,we can view the presented reduction in the statistical error due toclustering redshifts as conservative. As mentioned previously, bi-ases due to ill-motivated regularization choices play an importantrole, especially for large sample sizes, where the statistical error issmall. We illustrate this here in the ‘Large Sample (5000k) Over-smoothing’ case, by deliberately choosing a coarser binning withoutmerging bin regularization. We clearly see that the statistical erroris quite small with the bias dominating.In order to investigate the quality of probability calibration,we consider the posterior distribution over the mean values of pho-tometric sample redshift distributions drawn from the posterior of n B . This is a reasonable choice, as it has been shown that accuratemodelling of weak lensing and LSS critically depends on accuraterecovery of the posterior mean.Fig. 8 shows ﬁve boxplots that each visualize the distributionof the posterior mean that corresponds to a diﬀerent setup underconsideration. The box edges denote the [ , ] percentiles, andthe deﬁnition of the whiskers, i.e., the thin vertical lines with shorthorizontal edges represent the [ . , . ] percentiles. The horizon-tal line within the box represents the median . The x-axis showsseveral diﬀerent scenarios, as listed in Tab. 1; the y-axis shows thevalue of the posterior mean. The middle solid black line correspondsto the mean of the true redshift distribution, shown as the dashedblack line in the left panel of Fig. 7. We reiterate that all resultshave been obtained using a mock catalog containing 5000k galax-ies. The (dashed/dotted), (grey/magenta) horizontal lines representthe requirement values for (Y1/Y10), (LSS/WL) measurements asgiven in the LSST DESC Science Requirements Document (DESCSRD The LSST Dark Energy Science Collaboration et al. 2018).We note that the LSST DESC Science Requirements Documentsconsiders a tomographic analysis and not a single bin, as we dohere. We therefore restrict ourselves to a qualitative comparison.Furthermore it should be noted that higher order moments of thephotometric sample redshift distribution will also correlate withcosmological parameters, especially for a clustering likelihood (seee.g. Nicola et al. 2020; Hadzhiyska et al. 2020), and our metricis therefore bound to be incomplete. Redeﬁning these metrics andrequirements is the subject of ongoing work.We consider three scenarios: ‘Tik Regul. Low’, ‘Tik. Regul.Medium’ and Tik. Regul. High’. As can be seen from Tab. 1 these Concretely, we draw a number of 𝑛 𝐵 realizations that each parametrizea photometric sample redshift distribution and evaluate the mean on each ofthese distributions. We note that the original deﬁnition of the boxplot uses a diﬀerent deﬁ-nition of the box size and the whiskers. We refer to Wickham & Stryjewski(2012) for more details. scenarios diﬀer by their value of the Tikhonov regularization pa-rameter 𝛼 . With increasing 𝛼 , the error bars decrease and the bias inthe results increases. When comparing this with the ‘oversmooth-ing’ results, we see the same pattern. This similarity in behaviorsarises because both a large 𝛼 and choosing large bins reduces thevariance of each bin.Finally we show the impact of including the cross-correlationmeasurements into the data vector in the ‘Tik. Regul. Low + WX’scenario, which adds clustering information to the ‘Tik. Regul.Low’ scenario. When comparing these two cases, we see that thedistribution of the posterior mean is now symmetric and reasonablycentered within the science requirements. In particular we notethat the uncertainties are still much larger when compared withthe previously considered, strongly regularized cases. This showsthat while the eﬀect of reducing the variance of the posteriors issimilar when using regularization or including cross-correlationdata into the composite likelihood, the posteriors can be much bettercalibrated in the latter case. Using a smoothing, or regularization,method essentially makes assumptions about the true shape of thedistribution without strict data evidence. In contrast, adding cross-correlations to the composite likelihood adds this information in aphysical, data-driven way.Another eﬀect that needs consideration is the increase in theintrinsic estimator bias due to the ‘downsampling’ of the probabil-ity density function to a lower resolution, e.g., by picking largerbin width or by imposing a diﬀerent regularization or smoothingscheme. This loss in resolution implies that we inadvertently limitthe accuracy with which small scale structure can be reconstructedin the density ﬁeld along the line-of-sight. As demonstrated andstudied in detail in Rau et al. (2017), this eﬀect can lead to biasesin the cosmological parameter inference that are often small, butthat would need scrutiny for upcoming data analyses. Since we gavea detailed description of this eﬀect in Rau et al. (2017) includingschemes to detect and mitigate these eﬀects, we do not focus on itin detail here. However, this eﬀect can be illustrated for the currentsetup, since the merging bin regularization downsamples the reso-lution to a relatively coarse grid of 31 bins. Furthermore considerthe redshift distribution of true redshifts discretized using the 20 bingrid used to obtain the cross-correlation measurements. Since weuse this distribution, i.e. the black curve in Fig. 7, as a reference, wealso have to consider its intrinsic discretization error. Concretely,when comparing the mean estimated from this curve with the sam-ple mean, we obtain a diﬀerence in these values of 0.0079. Whilethis is of the same order as the Y1 science requirements in Fig. 8,Y10 requirements will necessitate an increase in sample size orthe inclusion of cross-correlation constraints that will allow us toperform inference at a higher resolution. Due to the slow expectedconvergence of deconvolution estimators with sample size (see e.g.Carroll & Hall 1988), it is likely that several orders of magnitudeincrease in sample size will be necessary. This is attainable for thelarge numbers of observed galaxies in LSST Y10, and our method-ology can scale to large sample sizes. However, in order to reachthe sample sizes that are expected for LSST observations, we needto develop an implementation that optimizes storage space and usesan eﬃcient parallelization strategy, which is beyond the scope ofthis work.Alternatively, we could use a diﬀerent scheme that employs acontinuous model like logistic Gaussian processes (Rau et al. 2020)or Dirichlet processes. The convergence of these density estimatorswill likely be better, however they will also require additional com-putational overhead in the inference. A detailed study of estimatorconvergence is needed to settle on a recommendation and prove sig- MNRAS , 1–22 (2015) Tik. Regul.Low Tik. Regul.Medium Tik. Regul.High Oversmoothing Tik. Regul.Low + WX0.870.880.890.900.910.920.930.94 P o s t e r i o r M e a n z DESC SRD Y10 WLDESC SRD Y1 WLDESC SRD Y10 LSSDESC SRD Y1 LSS

Figure 8.

Boxplot illustration of the mean of the posterior sample redshiftprobability density function of the photometric sample (short: posteriormean) for diﬀerent experimental setups listed in Tab. 1 and detailed in § 8.The x-axis lists the diﬀerent scenarios, the y-axis the value of the posteriormean. The box shows the [ , ] percentiles, the vertical lines with hori-zontal edges (whiskers) show the [ . , . ] percentiles, corresponding tothe 1 𝜎 and 2 𝜎 intervals for the normal distribution. The horizontal line inthe box is the median. The (dashed/dotted), (grey/magenta) lines correspondto the requirement on the uncertainty of the posterior mean as quoted in theLSST DESC Science Requirements Document (DESC SRD, The LSSTDark Energy Science Collaboration et al. 2018) for (Y1/Y10) (LSS/WL)measurements. The central, solid grey line is the mean of the true redshiftdistribution, shown as the dashed black line in the left panel of Fig. 7. Allresults have been obtained using a mock catalog of 5000k galaxies withphotometric redshift scatter that is perfectly calibrated. We highlight thedecrease in the statistical error and potential increase in systematic bias forlarger regularization, going from the leftmost to the fourth case. The right-most boxplot shows the impact of including clustering redshift informationinto the likelihood. niﬁcant improvement over the simple histogram scheme employedhere; we will leave this for future work.Most importantly, however, it is likely that systematic errorsdue to the miscalibration and mis-speciﬁcation of the compositelikelihood, either by a suboptimal galaxy-dark matter bias modelor due to miscalibrated SED likelihoods, will lead to an errorbudget that will dominate the aforementioned errors. If the mis-speciﬁcation can be parametrized and marginalized over, the vari-ance of the parameter posteriors will be increased, otherwise theywill lead to biases in the resulting parameter posteriors. In thefollowing section we will discuss these sources of error. We willshowcase the usage of posterior predictive checks as a way to detectmiscalibration and suggest procedures for consistent model check-ing and reﬁnement. We discussed and showcased our inference methodology in the pre-vious section using idealized data. To complement the discussion,this section highlights how miscalibrated likelihoods can lead tobiases in the inferred sample redshift distribution and how posteriorpredictive checks can be used to detect these issues.To mimic well-calibrated photometric likelihoods we use con-ditional density estimates from the FlexZboost package (Dalmassoet al. 2020). We note that these conditional densities are not photo-metric likelihoods in the sense of Eq. (34). The free parameters inthe photometric likelihood are the redshifts of the galaxies and pa-

Redshift w P o s t P r e d R B w M e a s R B UnbiasedBiased = 0.8 Sys. Marg.Biased Fixed Bias = 0.8Biased Marg. Bias = 0.8

Figure 9.

We plot the residual between replicated and true scale-averagedcross-correlation measurement w RB between a DESI-like spectroscopic ref-erence (‘R’) and photometric base (‘B’) sample as a function of redshift.The horizontal line with errorbars shows the uncertainties in the originalmeasurement. The green/magenta contours show the scenario where wemarginalize over all parameters c that parametrize the redshift dependentgalaxy-dark matter bias ratio function 𝐶 ( 𝑧 ) (see § 8.1) without the sys-tematics kernel for the unbiased/biased ( 𝑓 = .

8) cases. The yellow/bluecontours consider the biased scenario ( 𝑓 = . 𝐶 ( 𝑧 ) and do/donot marginalize over the systematics kernel. The error bars and contoursshow the [5, 95] percentiles. We see that the replicated measurements donot show signiﬁcant tension with the original measurements, if we eithermarginalize over the systematic (‘Biased f=0.8 Sys. Marg.’) or if we use aﬂexible redshift-dependent galaxy-dark matter bias model (‘Biased Marg.Bias f=0.8’). Only if the form of the galaxy-dark matter bias is known togood precision – in this case we hold its values ﬁxed – are PPCs usingcross-correlations sensitive in detecting tensions. rameters that describe properties of the Spectral Energy Distribution(SED). In conditional density estimation, non-physical parametersdescribe a ﬂexible model that provides a mapping between photom-etry and redshift. This ﬂexible model is then ﬁtted to known cali-bration data. Thus while in SED ﬁtting the distribution of redshiftconstitutes a posterior distribution, conditional density estimationtreats it as a predictive distribution (often without marginalizingover the modelling uncertainty). However since the goal of thissubsection is to demonstrate potential systematic biases and uncer-tainties in the deconvolution operation, this diﬀerence is not of greatimportance here.In order to simulate the impact that a population of galaxieswith inaccurately calibrated photometric redshift likelihoods has onthe deconvolved redshift distribution, we consider a redshift range of0 . − . 𝑓 = .

8) of galaxies have alikelihood from the BPZ code, and only 20% retain their conditionaldensity predictions from the FlexZboost code. We picked this setupbecause the BPZ predictions within this redshift range, while be-ing inferior to the FlexZboost predictions, still have an acceptablequality. We perform this experiment by selecting a range in redshiftbecause we will perform posterior predictive checks (PPC) usingcross-correlations and need to control where we would expect sys-tematics. This will allow us to disentangle model misspeciﬁcation

MNRAS , 1–22 (2015), 1–22 (2015)

MNRAS , 1–22 (2015), 1–22 (2015) ikelihood Inference under Photo-z error issues from the systematics, e.g. from the ‘noisy’ deconvolution,described previously. We note that the quality of photometric red-shift likelihoods does not sharply change with redshift in this way,in real photometric samples. Instead photometric redshift qualityis a complex function in color space that strongly depends e.g. onthe quality of the photometry, the number of available bands, theamount of calibration data and the template set. Modelling this ac-curately is beyond the scope of this work and would require themeasurement of cross-correlations in color cells and the extensionof PPC to the full composite likelihood, that includes sampling thephotometry of galaxies in these color cells. Using the mean of theFlexZboost conditional densities for the selection instead of the trueredshifts would ‘smooth-out’ the quality of the likelihoods as a func-tion of redshift at the boundaries of the 0 . − . n B and the parameters that govern the galaxy-darkmatter bias ratio evolution c , following the Chebychev basis ex-pansion described in § 8.1. For simplicity we will deconvolve theredshift distribution on the same 20 bin redshift grid used in thecross-correlation data vector. For 500k galaxies, this leads to verysmall statistical errorbars in the deconvolution. As a simpliﬁcationwe can then ﬁx the n B posterior to its maximum likelihood value.As mentioned in the previous section, this oversmoothing will leadto biases in the n B posteriors. However, since we will only performa posterior predictive analysis with respect to the clustering likeli-hood, that is less constraining than the photometric likelihood, thesystematics incurred by these simpliﬁcations and the underestima-tion of statistical error, are sub-dominant compared with the overallstatistical error budget from the correlation function measurements.Fig. 9 shows the sampled cross correlation measurements fromthe ﬁtted joint model, in residual to the original measurements. Weshowcase four scenarios. In the unbiased case we use FlexZBoostPDFs and marginalize over all c parameters. Due to the good cal-ibration of these Machine Learning-produced conditional distribu-tions, we obtain very similar results compared with the previoussection. The reason for this success is, of course, the representativetraining set that would not be available in a practical application.Furthermore we consider three scenarios with 𝑓 = .

8. The sce-nario shown in yellow ﬁxes the galaxy-dark matter bias parameters( c ), but marginalizes over the parameters of the systematics kernel.The blue/magenta lines ﬁx/marginalize over the c parameters, butdo not include the systematics kernel correction.As can be seen in Fig. 9, the treatment of galaxy-dark matterbias has a profound impact on the consistency between the replicatedand original cross-correlation measurements. Within the errors, theresults are consistent for all scenarios except the one without sys-tematics kernel correction that uses a ﬁxed 𝐶 ( 𝑧 ) model. As shownin the yellow line, these biases can be corrected by the systematicskernel marginalization. This illustrates that if suﬃcient informationabout 𝐶 ( 𝑧 ) is available, the clustering likelihood alone can allow forpowerful posterior predictive checks. If this is not the case, consis-tency tests of redshift distributions with respect to clustering redshiftmeasurements can be misleading. Provided suﬃcient informationon the galaxy-dark matter bias, we can parametrize the biases in the deconvolved density estimate using, e.g., a convolution with akernel function as described in § 7.2. We show these results as theyellow lines ‘Biased 𝑓 = . n B vector with a Gaussian kernel function of width Δ 𝜎 ∈ [ . , . ] in 40 steps. We see that this correction can compensate for themisspeciﬁed likelihoods even in the case of a ﬁxed 𝐶 ( 𝑧 ) model.The degeneracy between the redshift-dependent galaxy-dark matterbias ratio model and the n B parameters highlights the importanceof performing posterior predictive checks in color space to provideadditional information on the redshift distribution. However, thisrequires careful modelling of SEDs and the development of a trans-parent, reproducible analysis framework that additionally includestests for parameter degeneracies and a model comparison frame-work. This is beyond the scope of this work, but will be addressedin a future paper. Accurate photometric redshift inference is one of the most impor-tant challenges in large area photometric surveys like LSST, DES,HSC, or KiDS. As discussed in detail in § 3, photometric redshiftinference is, from a statistical point of view, a deconvolution prob-lem, where an underlying true redshift distribution is convolvedwith an SED model-dependent error distribution given by the pho-tometric likelihood. The deconvolution inference of sample redshiftdistribution is not new (e.g. Padmanabhan et al. 2005; Leistedt et al.2016; Malz & Hogg 2020), and spatial information has also beenincorporated into the inference (e.g. Alarcon et al. 2020b; Sánchez& Bernstein 2019; Jones & Heavens 2019; Rau et al. 2020). We ex-tended these prior works by developing a fast approximate inferencescheme for deconvolution, that combines redshift information fromboth the photometry and the spatial distribution of galaxies in termsof a composite likelihood ansatz. We particularly provided a discus-sion on regularization techniques and the tradeoﬀ between bias andvariance in the Bayesian context for medium to large sample sizes.In particular, our goal is to include the treatment of photomet-ric redshift via the likelihood of the galaxies’ photometry into thecurrent cosmological inference framework, which is based on cor-relation functions. The main reason for our likelihood choice is toallow the easy integration into the likelihood inference frameworkbased on two-point statistics of galaxy density and shear ﬁelds. Thisis more diﬃcult for other approaches presented in Alarcon et al.(2020b) and Sánchez & Bernstein (2019) since, in the currentlydemonstrated form, the redshift information from cross-correlatingthe overlapping spectroscopic sample is included via an estimatorand not using a likelihood (that would depend on cosmologicalparameters). While the works of Padmanabhan et al. (2005); Leist-edt et al. (2016); Malz & Hogg (2020) are structurally similar interms of the treatment of the photometric likelihood, they do notdiscuss the eﬀect of including clustering information. It is notewor-thy that the early work by Padmanabhan et al. (2005) provides anexcellent, explicit discussion of regularization, which is the maindiﬃculty in the photometric redshift problem, although not in thecontext of probability calibration and the aforementioned joint in-ference framework. We summarized our approach to the challengesthat arise in the estimation of redshift distributions for samples ofgalaxies in the context of photometric surveys. Concretely, we con-sidered the combination of photometric information with two-pointstatistics, the scalability and regularization of the deconvolution in-ference in the large sample scenario, and investigate the impact of

MNRAS , 1–22 (2015) systematics from misspeciﬁed individual galaxy photometric like-lihoods, proposing parametrizations for these systematics. Theseachievements lay the foundations for future extensions that we willdiscuss in the next section.In § 4.1 we described our inference methodology that is de-signed to facilitate inference on large galaxy catalogs to be expectedin LSST. The scheme uses a Laplace Approximation in logit spaceand facilitates inference using an iterative scheme of expectationmaximization update equations. This provides computational ad-vantages over sampling approaches. Additionally this methodologyfacilitates fast joint inference with a cross-correlation data vector(see § 5) that we included in a composite likelihood ansatz. As high-lighted in § 6, this provides the possibility of additional extensionsthat include two-point statistics from cosmological weak lensingand galaxy-galaxy lensing measurements. As we discussed in § 3,ensemble redshift distribution inference based on a photometriclikelihood is a deconvolution problem, which requires regulariza-tion to yield bounded and well-deﬁned results. In this context, wediscussed a regularization scheme that consists of a combinationof Tikhonov regularization with (more importantly) a scheme thatmerges neighboring bins to exploit the characteristic covariancestructure in the deconvolved densities. In agreement with the ﬁnd-ings of the original paper by Kuusela (2016) that proposed andapplied this scheme to the Poisson inverse problem, we ﬁnd that the‘Merging Bin’ scheme leads to better calibrated results as comparedwith Tikhonov regularization and with an oversmoothing schemethat selects a coarser redshift binning for the sample redshift his-tograms.In order to test and discuss the quality of our posterior infer-ence, we used data from the CosmoDC2 simulations to generate aspectroscopic DESI-like sample and a photometric mock catalog,that uses an LSST-like photometric error model. This allowed us totest the impact of a spectroscopic calibration sample with an inho-mogeneous galaxy population as a function of redshift. We foundthat the ratio between the redshift-dependent galaxy-dark matterbias of the photometric and the spectroscopic sample is a smoothfunction of redshift, if the spectroscopic calibration sample consistsof a single galaxy population, and is discontinuous if the galaxypopulation strongly changes. We therefore employed a step-wisesmooth function based on a Chebychev polynomial expansion toparametrize this ratio.In § 8 we performed a forecast of redshift inference perfor-mance on ideal data, assuming perfectly calibrated individual galaxyredshift likelihoods. We found that using the aforementioned merg-ing bin regularization, we were able to produce accurate posteriorsof ensemble redshift distributions. We reiterate that using other reg-ularization schemes, like an overly large Tikhonov regularizationparameter, or an oversmoothing approach that picks overly widehistogram bins, can lead to signiﬁcant biases in the recovered pos-terior mean.When compared with the DESC science requirements for WLand large scale structure measurements in terms of the mean ofthe photometric sample redshift distribution, we found that we canmeet the DESC SRD Y1 goals and remain consistent with the DESCSRD Y10 goals with 5000k galaxies, if cross-correlations are in-cluded in the joint composite likelihood. In practical applications,however, Spectral Energy Distribution (SED) templates for galaxieswill be subject to modelling biases that cannot be well calibratedusing spectroscopic data (see e.g. Hartley et al. 2020). We thereforeproposed to use posterior predictive checks (PPC) as a means toevaluate the quality of our inference. Here, we compared replica-tions of the data sampled from the ﬁtted model with the original measurement to evaluate model goodness-of-ﬁt. Speciﬁcally cross-correlation redshift inference is often used to calibrate photometricredshifts obtained using photometry (Newman 2008; Johnson et al.2016; Davis et al. 2017). In § 8.3 we demonstrated that PPC ofcross-correlation measurements can detect systematic biases in therecovered sample redshift distribution if the galaxy-dark matter biasof the photometric and spectroscopic samples is known to suﬃcientaccuracy.In order to parametrize potential biases in the sample redshiftdistribution posteriors caused by misspeciﬁed photometric likeli-hoods, particularly over-deconvolution eﬀects that lead to overlynarrow redshift distributions, we proposed a simple Gaussian ﬁlterthat, as demonstrated in § 8.3, was able to correct these biases.

10 FUTURE WORK

In future work, it will be important to extend the inference schemedeveloped in this paper. We plan to consider a range of extensions,e.g., iterated nested Laplace approximations (Bornkamp 2011) inlogit space, the usage of more ﬂexible distributions that can be ﬁttedusing variational inference schemes, as well as the development ofspecialized subsampling MCMC schemes. The diﬀerent techniqueswill be evaluated in combination with regularization approachesbased on the quality of their probability coverage. Another exten-sion, particularly to reduce the bias in the density estimation, is toconsider other parametrizations for the deconvolved density eitherby employing density estimators with better mean squared errorscaling like Kernel Density estimators, basis function expansions orusing methods such as logistic Gaussian Processes (e.g. Rau et al.2020). The combination of photometric and clustering informationcan be extended by connecting the modelling of SEDs and redshift-dependent galaxy-dark matter bias modelling via the luminosityfunction as shown in van Daalen & White (2018). This also hasthe potential to reduce the degeneracy between SED and redshift-dependent galaxy-dark matter bias systematics. Finally, we note thatthe data quality of the photometry will not be the same in all areason the sky. In order to include these ﬁeld-to-ﬁeld variations into thecomposite likelihood framework, we can, for example, condition thelikelihood on the ﬁeld and include a corresponding data covarianceinto the likelihood by employing either resampling techniques (seee.g. Davison & Hinkley 2013) or using theoretical modelling (e.g.Stoyan & Stoyan 1994; Sánchez et al. 2020).To conclude, we have presented an eﬃcient photometric red-shift inference framework that combines information from both thephotometry and the spatial distribution of galaxies. The methodol-ogy is designed to scale well to large samples. We complement thisframework with methods for regularization, model checking andredshift systematics parametrization. The forecasts we performedusing CosmoDC2 data give us conﬁdence that, with the additionalimprovements described here, the methodology presented will en-able accurate and well-calibrated redshift inference for LSST andother ongoing and future large area photometric surveys.

SOFTWARE

Besides software referenced directly in the text, we performed theanalyses in this work using the following software packages: thepython language (van Rossum 1995), scipy (Virtanen et al. 2020),numpy (Harris et al. 2020), jupyter notebook (Kluyver et al. 2016),

MNRAS , 1–22 (2015), 1–22 (2015)

MNRAS , 1–22 (2015), 1–22 (2015) ikelihood Inference under Photo-z error ipython (Perez & Granger 2007), matplotlib (Hunter 2007) andpandas (Wes McKinney 2010). DATA AVAILABILITY STATEMENT

The cosmoDC2 extragalactic catalog is publicly available at https://portal.nersc.gov/project/lsst/cosmoDC2/_README.html . The ancillary catalogs (photo-z and DESI-likeselection for cosmoDC2) and other derived data underlying thisarticle will be shared on reasonable request to the correspondingauthor. The source code that implements the algorithms presentedin this article will be made available via Zenodo.

ACKNOWLEDGEMENTS

This paper has undergone internal review in the LSST Dark EnergyScience Collaboration. We would like to thank the internal reviewersDavid Alonso, Will Hartley and David Kirkby for their insightfulcomments.MMR led and planned the project from the initial idea andmotivation to the experimental design. He performed the analysesin interaction with the coauthors, and prepared the paper. CBMcontributed by: development of clustering redshifts software, cre-ation of clustering redshifts photometric and spectroscopic sam-ples, creation of clustering redshift data products. SJS constructedthe photometric redshift catalogs and contributed to writing thecorresponding portions of the paper. SW provided feedback on themethod development and manuscript. RM advised on motivation,scope, experimental design, and analysis, and contributed to theediting of the paper draft. YYM developed the access tools forcosmoDC2 and photo-z data products.MMR and RM are supported by DOE grant DE-SC0010118and NSF grant IIS-1563887. SJS acknowledges support from DOEgrant DESC0009999 and NSF/AURA grant N56981C. MMR wassupported in part by the Simons Foundation (Simons InvestigatorAward, PI: Mandelbaum). This material is based upon work sup-ported in part by the National Science Foundation through Cooper-ative Agreement 1258333 managed by the Association of Univer-sities for Research in Astronomy (AURA), and the Department ofEnergy under Contract No. DE-AC02-76SF00515 with the SLACNational Accelerator Laboratory. Additional LSST funding comesfrom private donations, grants to universities, and in-kind supportfrom LSSTC Institutional Members. SJS acknowledges supportfrom DOE grant DE-SC0009999 and NSF/AURA grant N56981C.Support for YYM was provided by NASA through the NASA Hub-ble Fellowship grant no. HST-HF2-51441.001 awarded by the SpaceTelescope Science Institute, which is operated by the Association ofUniversities for Research in Astronomy, Incorporated, under NASAcontract NAS5-26555.The DESC acknowledges ongoing support from the Insti-tut National de Physique Nucléaire et de Physique des Particulesin France; the Science & Technology Facilities Council in theUnited Kingdom; and the Department of Energy, the National Sci-ence Foundation, and the LSST Corporation in the United States.DESC uses resources of the IN2P3 Computing Center (CC-IN2P3–Lyon/Villeurbanne - France) funded by the Centre National de laRecherche Scientiﬁque; the National Energy Research ScientiﬁcComputing Center, a DOE Oﬃce of Science User Facility sup-ported by the Oﬃce of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231; STFC DiRAC HPC Fa-cilities, funded by UK BIS National E-infrastructure capital grants;and the UK particle physics grid, supported by the GridPP Col-laboration. This work was performed in part under DOE ContractDE-AC02-76SF00515.

REFERENCES

Abbott T. M. C., et al., 2018a, Phys. Rev. D, 98, 043526Abbott T. M. C., et al., 2018b, ApJS, 239, 18Aihara H., et al., 2018, PASJ, 70, S4Alarcon A., et al., 2020a, arXiv e-prints, p. arXiv:2007.11132Alarcon A., Sánchez C., Bernstein G. M., Gaztañaga E., 2020b, MNRAS,498, 2614Albrecht A., et al., 2006, arXiv e-prints, pp astro–ph/0609591Arnouts S., Cristiani S., Moscardini L., Matarrese S., Lucchin F., FontanaA., Giallongo E., 1999, MNRAS, 310, 540Atchison J., Shen S., 1980, Biometrika, 67, 261Benítez N., 2000, ApJ, 536, 571Benjamin J., et al., 2013, MNRAS, 431, 1547Benson A. J., 2012, New Astron., 17, 175Bernstein G., Huterer D., 2010, MNRAS, 401, 1399Bishop C. M., 2006, Pattern Recognition and Machine Learning (Informa-tion Science and Statistics). Springer-Verlag, Berlin, HeidelbergBonnett C., 2015, MNRAS, 449, 1043Bornkamp B., 2011, Journal of Computational and Graphical Statistics, 20,656Box G. E. P., Muller M. E., 1958, Ann. Math. Statist., 29, 610Brown M. J. I., et al., 2014, ApJS, 212, 18Carrasco Kind M., Brunner R. J., 2013, MNRAS, 432, 1483Carroll R. J., Hall P., 1988, Journal of the American Statistical Association,83, 1184Cawthon R., et al., 2020, arXiv e-prints, p. arXiv:2012.12826Chang C., et al., 2016, MNRAS, 459, 3203Chen T., Guestrin C., 2016, in Proceedings of the 22Nd ACMSIGKDD International Conference on Knowled ge Discoveryand Data Mining. KDD ’16. ACM, New York, NY, USA, pp785–794, doi:10.1145/2939672.2939785, http://doi.acm.org/10.1145/2939672.2939785

Clerkin L., Kirk D., Lahav O., Abdalla F. B., Gaztañaga E., 2015, MNRAS,448, 1389Collister A. A., Lahav O., 2004, Publications of the Astronomical Societyof the Paciﬁc, 116, 345Craig I. J. D., Brown J. C., 1986, Inverse problems in astronomy. A guide toinversion strategies for remotely sensed dataDESI Collaboration et al., 2016, arXiv e-prints, p. arXiv:1611.00036Dalmasso N., Pospisil T., Lee A. B., Izbicki R., Freeman P. E., Malz A. I.,2020, Astronomy and Computing, 30, 100362Davis C., et al., 2017, arXiv e-prints, p. arXiv:1710.02517Davison A. C., Hinkley D. V., 2013, Bootstrap Methods and Their Applica-tion. Cambridge University Press, USAFeldmann R., et al., 2006, MNRAS, 372, 565Gatti M., et al., 2018, MNRAS, 477, 1664Gatti M., et al., 2020, arXiv e-prints, p. arXiv:2012.08569Gelman A., li Meng X., Stern H., 1996, Statistica Sinica, pp 733–807Gelman A., Hwang J., Vehtari A., 2013, arXiv e-prints, p. arXiv:1307.5928Gerdes D. W., Sypniewski A. J., McKay T. A., Hao J., Weis M. R., WechslerR. H., Busha M. T., 2010, ApJ, 715, 823Graham M. L., et al., 2020, AJ, 159, 258Greisel N., Seitz S., Drory N., Bender R., Saglia R. P., Snigula J., 2015,MNRAS, 451, 1848Hadzhiyska B., Alonso D., Nicola A., Slosar A., 2020, arXiv e-prints, p.arXiv:2007.14989Hahn C., Beutler F., Sinha M., Berlind A., Ho S., Hogg D. W., 2019,MNRAS, 485, 2956Harris C. R., et al., 2020, Nature, 585, 357–362MNRAS , 1–22 (2015) Hartley W. G., et al., 2020, MNRAS, 496, 4769Hearin A., Korytov D., Kovacs E., Benson A., Aung H., Bradshaw C.,Campbell D., LSST Dark Energy Science Collaboration 2020, MNRAS,495, 5040Heitmann K., et al., 2019, ApJS, 245, 16Heymans C., et al., 2020, arXiv e-prints, p. arXiv:2007.15632Hikage C., et al., 2019, PASJ, 71, 43Hildebrandt H., et al., 2017, MNRAS, 465, 1454Hildebrandt H., et al., 2020, arXiv e-prints, p. arXiv:2007.15635Hoyle B., 2016, Astronomy and Computing, 16, 34Hoyle B., Rau M. M., 2019, MNRAS, 485, 3642Hoyle B., Rau M. M., Bonnett C., Seitz S., Weller J., 2015, MNRAS, 450,305Hoyle B., et al., 2018, MNRAS, 478, 592–610Hunter J. D., 2007, Computing in Science Engineering, 9, 90Huterer D., Takada M., Bernstein G., Jain B., 2006, MNRAS, 366, 101Huterer D., Lin H., Busha M. T., Wechsler R. H., Cunha C. E., 2014,MNRAS, 444, 129Ilbert O., et al., 2006, A&A, 457, 841Ivezić Ž., et al., 2019, ApJ, 873, 111Izbicki R., Lee A. B., 2017, Electron. J. Statist., 11, 2800Johnson A., et al., 2016, MNRAS, 465, 4118Jones D. M., Heavens A. F., 2019, MNRAS, 483, 2487Joudaki S., et al., 2018, MNRAS, 474, 4894Joudaki S., et al., 2020, A&A, 638, L1Kalmbach J. B., Connolly A. J., 2017, AJ, 154, 277Kluyver T., et al., 2016, in Loizides F., Schmidt B., eds, Positioning andPower in Academic Publishing: Players, Agents and Agendas. pp 87 –90Korytov D., et al., 2019, ApJS, 245, 26Kress R., 1998, Numerical Analysis. Graduate Texts in Mathematics,Springer New York, https://books.google.com.na/books?id=R6182rh0tKEC

Kuusela M. J., 2016, PhD thesis, Lausanne, EPFL, doi:10.5075/epﬂ-thesis-7118, http://infoscience.epfl.ch/record/220015

Laureĳs R., et al., 2011, arXiv e-prints, p. arXiv:1110.3193Leistedt B., Mortlock D. J., Peiris H. V., 2016, MNRAS, 460, 4258Ma Z., Hu W., Huterer D., 2006, ApJ, 636, 21Malz A. I., Hogg D. W., 2020, arXiv e-prints, p. arXiv:2007.12178Mandelbaum R., 2018, ARA&A, 56, 393Matarrese S., Coles P., Lucchin F., Moscardini L., 1997, MNRAS, 286, 115McLeod M., Balan S. T., Abdalla F. B., 2017, MNRAS, 466, 3558McQuinn M., White M., 2013, MNRAS, 433, 2857Meister A., 2009, Deconvolution Problems in Nonparametric Statistics. Lec-ture Notes in Statistics, Springer Berlin Heidelberg, https://books.google.de/books?id=ItGkJPZQj-MC

Ménard B., Scranton R., Schmidt S., Morrison C., Jeong D., Budavari T.,Rahman M., 2013, arXiv e-prints, p. arXiv:1303.4722Morrison C. B., Hildebrandt H., Schmidt S. J., Baldry I. K., Bilicki M., ChoiA., Erben T., Schneider P., 2017, MNRAS, 467, 3576Myles J., et al., 2020, arXiv e-prints, p. arXiv:2012.08566Neal R. M., Hinton G. E., 1993, in Learning in Graphical Models. KluwerAcademic Publishers, pp 355–368Newman J. A., 2008, ApJ, 684, 88Newman J. A., et al., 2015, Astroparticle Physics, 63, 81Nicola A., et al., 2020, J. Cosmology Astropart. Phys., 2020, 044Padmanabhan N., et al., 2005, MNRAS, 359, 237Pawitan Y., 2001, In All Likelihood: Statistical Modelling and InferenceUsing Likelihood. Oxford science publications, OUP Oxford, https://books.google.com/books?id=M-3pSCVxV5oC

Perez F., Granger B. E., 2007, Computing in Science Engineering, 9, 21Prat J., et al., 2018, MNRAS, 473, 1667Quiroz M., Villani M., Kohn R., Tran M.-N., Dang K.-D., 2018, arXive-prints, p. arXiv:1807.08409Raccanelli A., Rahman M., Kovetz E. D., 2017, MNRAS, 468, 3650Ranganathan A., 2004, Assumed Density Filtering,

Rau M. M., Seitz S., Brimioulle F., Frank E., Friedrich O., Gruen D., HoyleB., 2015, MNRAS, 452, 3710Rau M. M., Hoyle B., Paech K., Seitz S., 2017, MNRAS, 466, 2927Rau M. M., Wilson S., Mandelbaum R., 2020, MNRAS, 491, 4768Raue A., Kreutz C., Theis F., Timmer J., 2013, Philosophical transac-tions. Series A, Mathematical, physical, and engineering sciences, 371,20110544Rothenberg T. J., 1971, Econometrica, 39, 577Salvato M., Ilbert O., Hoyle B., 2019, Nature Astronomy, 3, 212Sánchez C., Bernstein G. M., 2019, MNRAS, 483, 2801Sánchez C., Raveri M., Alarcon A., Bernstein G. M., 2020, MNRAS,Schmidt S. J., Ménard B., Scranton R., Morrison C., McBride C. K., 2013,MNRAS, 431, 3307Scottez V., et al., 2016, MNRAS, 462, 1683Scranton R., et al., 2005, ApJ, 633, 589Simon P., Hilbert S., 2018, A&A, 613, A15Speagle J. S., Eisenstein D. J., 2015, arXiv e-prints, p. arXiv:1510.08073Spergel D., et al., 2015, arXiv e-prints, p. arXiv:1503.03757Stölzner B., Joachimi B., Korn A., Hildebrandt H., Wright A. H., 2020,arXiv e-prints, p. arXiv:2012.07707Stoyan D., Stoyan H., 1994, Fractals, Random Shapes and PointFields: Methods of Geometrical Statistics. Wiley Series in Probabil-ity and Statistics, Wiley, https://books.google.com/books?id=Dw3vAAAAMAAJ

Tagliaferri R., Longo G., Andreon S., Capozziello S., Donalek C., GiordanoG., 2003, Neural Networks for Photometric Redshifts Evaluation. pp226–234, doi:10.1007/978-3-540-45216-4_26Tanaka M., et al., 2018, PASJ, 70, S9The LSST Dark Energy Science Collaboration et al., 2018, arXiv e-prints,p. arXiv:1809.01669Uitert E., et al., 2017, MNRAS, 476Varin C., Reid N., Firth D., 2011, Statist. Sinica, pp 5–42Virtanen P., et al., 2020, Nature Methods, 17, 261Wes McKinney 2010, in Stéfan van der Walt Jarrod Millman eds, Pro-ceedings of the 9th Python in Science Conference. pp 56 – 61,doi:10.25080/Majora-92bf1922-00aWickham H., Stryjewski L., 2012, Technical report, 40 years of boxplots.had.co.nzZhou R., et al., 2020a, arXiv e-prints, p. arXiv:2001.06018Zhou R., et al., 2020b, Research Notes of the American Astronomical Soci-ety, 4, 181van Daalen M. P., White M., 2018, MNRAS, 476, 4649van Rossum G., 1995, Technical Report CS-R9526, Python tutorial. Centrumvoor Wiskunde en Informatica (CWI), Amsterdam

APPENDIX A: DERIVING THE E-M UPDATEEQUATIONS

We start the discussion with an intuitive motivation for the theo-retical foundation of the E-M algorithm. Assume a ‘system’ thatconsists of hidden variables 𝑌 and observed variables 𝑍 . We wishto ﬁnd a set of parameters 𝜃 that maximize the joint distribution ofboth variables given 𝜃 . We know from statistical physics that the freeenergy of this system 𝐹 ( 𝑝, 𝜃 ) should be minimized and depends onthe distribution of hidden variables, or states, 𝑝 ( 𝑦 ) and the parame-ters of the conditional 𝑝 ( 𝑦 | 𝑧, 𝜃 ) . The E-M algorithm performs thisminimization iteratively, where we assume an initial choice for 𝜃 .In the 𝐸 -step, we choose a distribution 𝑝 ( 𝑦 ) , while holding 𝜃 ﬁxed,so that 𝐹 ( 𝑝, 𝜃 old ) is minimized. In the subsequent 𝑀 -step we hold 𝑝 ﬁxed, but choose 𝜃 in a way that 𝐹 ( 𝑝 old , 𝜃 ) is minimized. Thisprocedure is iterated until the free energy does not change muchwith additional iterations, i.e., the scheme converges. In practicalcalculations, the connection with the variational free energy is notoften used, but it is a useful concept to build up an intuitive under-standing of the method. We refer the interested reader to Neal & MNRAS , 1–22 (2015), 1–22 (2015) ikelihood Inference under Photo-z error Hinton (1993) for a more detailed explanation. In the following wewill describe the derivation of the E-M algorithm in the concretecontext of ﬁnding the maximum likelihood solution of our photo-metric likelihood. For that we will use a diﬀerent notation, howeverthe intuition remains unchanged, if we associate the free energy (upto a sign) with the term L( 𝑞, 𝝅 ) in Eq. (A3).To derive the Expectation-Maximization algorithm , we ﬁrstintroduce the parameter vector ζ 𝑖 for each galaxy 𝑖 that is a 𝑁 tot dimensional vector to indicate bin membership in the ‘1-hot en-coding’ scheme. This means that for each galaxy, we have a 𝑁 tot dimensional binary vector, where (‘0’/‘1’) indicates that the galaxy(resides/does not reside) in the respective redshift bin. For the re-mainder of this section we will work with the normed histogram binheights π = n 𝐵 Δ 𝑧 that parametrize the prior distribution over 𝑧 − 𝛼 as deﬁned in Eq. 13. The prior distribution over 𝜻 is then givenas 𝑝 ( 𝜻 | π ) ∝ 𝑁 tot (cid:214) 𝑘 = 𝑁 gal (cid:214) 𝑛 = 𝜋 𝜁 𝑛,𝑘 𝑘 , (A1)where (cid:205) 𝑁 tot 𝑘 = 𝜋 𝑘 =

1. Using 𝜻 we can write the conditional distribu-tion of the measured photometry given 𝜻 as 𝑝 ( ˆ F | 𝜻 , π ) ∝ 𝑁 gal (cid:214) 𝛽 = 𝑁 tot (cid:214) 𝑘 = (cid:32)∫ 𝑧 𝑘𝑅 𝑧 𝑘𝐿 d 𝑧 𝛽 ∫ 𝛼 𝑘𝑅 𝛼 𝑘𝐿 d 𝛼 𝛽 𝑝 ( f 𝛽 |T ( 𝑧 𝛽 , 𝛼 𝛽 ) , Σ 𝛽 ) (cid:33) 𝜁 𝛽, k . (A2)It is important at this point to note that a marginalization over theparameter vectors 𝜁 𝑖 for all galaxies will yield the second, i.e. thelikelihood, term in Eq. (15).To derive the iterative optimization scheme we ﬁrst considerthe decomposition of the posterior aslog 𝑝 ( 𝝅 | ˆF ) = L( 𝑞, 𝝅 ) + 𝐾 𝐿 ( 𝑞 || 𝑝 ) + log 𝑝 ( 𝝅 ) − log 𝑝 ( ˆF ) , (A3)where L( 𝑞, n B ) = ∑︁ 𝜻 𝑞 ( 𝜻 ) log (cid:32) 𝑝 ( ˆF , 𝜻 ) 𝑞 ( 𝜻 ) (cid:33) (A4) 𝐾 𝐿 ( 𝑞 || 𝑝 ) = − ∑︁ 𝜻 𝑞 ( 𝜻 ) log (cid:32) 𝑝 ( 𝜻 | ˆF , 𝝅 ) 𝑞 ( ζ ) (cid:33) (A5)This decomposition implies an iterative scheme to maximizelog 𝑝 ( 𝝅 | ˆF ) . Given an initial parameter vector 𝝅 old , we ﬁrst minimize 𝐾 𝐿 ( 𝑞 || 𝑝 ) in the ‘E-step’ which directly implies 𝑞 ( ζ ) = 𝑝 ( 𝜻 | ˆF , 𝝅 ) .In the ‘M’-step we ﬁx the distribution 𝑞 ( ζ ) and maximize L( 𝑞, 𝝅 ) .This maximization directive is then given as L( 𝑞, 𝝅 ) = 𝑆 (cid:16) 𝝅 , 𝝅 old (cid:17) = ∑︁ 𝜻 𝑝 ( 𝜻 | ˆF , 𝝅 old ) log 𝑝 ( ˆF , 𝜻 | 𝝅 ) + const . , (A6)which is the expectation of the data log-likelihood with respect to 𝜻 . After a new parameter vector 𝝅 new is obtained, we continue withthe ‘E’-step holding 𝝅 new ﬁxed. This process is continued untilconvergence. In the following we will derive the correspondingupdate equations. The interested reader will ﬁnd the following derivation in analogy to thederivation to the E-M update equations for the Gaussian Mixture model (seeBishop 2006). We are referring to bins as deﬁned in Eq. (13). Here, 𝜻 denotes the collection of ζ vectors of all galaxies. E-step:

Given an old parameter vector 𝜋 old we evaluate 𝑝 ( 𝜻 | ˆF , π old ) ∝ (A7) 𝑁 gal (cid:214) 𝛽 = 𝑁 tot (cid:214) 𝑗 = (cid:32) 𝜋 j , old ∫ 𝑧 𝑘𝑅 𝑧 𝑘𝐿 d 𝑧 𝛽 ∫ 𝛼 𝑘𝑅 𝛼 𝑘𝐿 d 𝛼 𝛽 𝑝 ( f 𝛽 |T ( 𝑧 𝛽 , 𝛼 𝛽 ) , Σ 𝛽 ) (cid:33) 𝜁 𝛽 j . (A8) M-step:

In the maximization step of the algorithm we want tomaximize the expected data log-likelihood with respect to the pa-rameter 𝜻 . Given the updated posterior 𝑝 ( 𝜻 | ˆF , π old ) this expectationis given as: 𝑆 ( π new , π old ) = 𝑁 gal ∑︁ 𝛽 = 𝑁 tot ∑︁ 𝑗 = 𝐸 (cid:2) 𝜁 𝛽 j (cid:3) × (cid:32) log (cid:16) 𝜋 𝛽 j (cid:17) + log (cid:32)∫ 𝑧 𝑗𝑅 𝑧 𝑗𝐿 d 𝑧 𝛽 ∫ 𝛼 𝑗𝑅 𝛼 𝑗𝐿 d 𝛼 𝛽 𝑝 ( f 𝛽 |T ( 𝑧 𝛽 , 𝛼 𝛽 ) , Σ 𝛽 ) (cid:33)(cid:33) , (A9)where 𝐸 (cid:2) 𝜁 𝛽 j (cid:3) = (cid:205) 𝜁 𝑛,𝑘 𝜁 𝑛𝑘 𝑝 ( 𝜁 𝑛𝑘 | ˆF , π old ) (cid:205) 𝜁 𝑛𝑘 𝑝 ( 𝜁 𝑛,𝑘 | ˆF , π old ) = 𝜋 i , old ∫ 𝑧 𝑖𝑅 𝑧 𝑖𝐿 d 𝑧 𝛽 ∫ 𝛼 𝑖𝑅 𝛼 𝑖𝐿 d 𝛼 𝛽 𝑝 ( f 𝛽 |T ( 𝑧 𝛽 , 𝛼 𝛽 ) , Σ 𝛽 ) (cid:205) 𝑗 𝜋 j , old ∫ 𝑧 𝑗𝑅 𝑧 𝑗𝐿 d 𝑧 𝛽 ∫ 𝛼 𝑗𝑅 𝛼 𝑗𝐿 d 𝛼 𝛽 𝑝 ( f 𝛽 |T ( 𝑧 𝛽 , 𝛼 𝛽 ) , Σ 𝛽 ) (A10)We optimize 𝑆 (cid:0) nz z , t , new , nz z , t , old (cid:1) under the constraint (cid:205) 𝑘 𝜋 𝑘 = 𝑆 ( π new , π old ) = 𝑆 ( π new , π old ) + 𝜆 (cid:32)∑︁ 𝑘 𝜋 𝑘 − (cid:33) (A11)Equating ∇ π 𝑆 ( π new , π old ) == , performing a summation over 𝑘 ,and using the summation constraint of π , we obtain − 𝜆 = 𝑁 gal . (A12)This leads to the update equations for the E-M scheme that areiterated until we reach convergence in π : 𝑁 𝑡𝑘 = 𝜋 𝑡 − 𝑘 𝑁 gal ∑︁ 𝑖 = (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) ∫ 𝑧 𝑘𝑅 𝑧 𝑘𝐿 d 𝑧 𝑖 ∫ 𝛼 𝑘𝑅 𝛼 𝑘𝐿 d 𝛼 𝑖 𝑝 ( ˆf i |T ( 𝑧 𝑖 , 𝛼 𝑖 ) , Σ 𝑖 ) (cid:205) 𝑁 tot 𝑗 = 𝜋 𝑡 − 𝑗 ∫ 𝑧 𝑗𝑅 𝑧 𝑗𝐿 d 𝑧 𝑖 ∫ 𝛼 𝑗𝑅 𝛼 𝑗𝐿 d 𝛼 𝑖 𝑝 ( ˆf i |T ( 𝑧 𝑖 , 𝛼 𝑖 ) , Σ 𝑖 ) (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) (A13) 𝜋 𝑡𝑘 = 𝑁 𝑡 − 𝑘 (cid:205) 𝑘 𝑁 𝑡 − 𝑘 . (A14)In appendix B we will derive a Laplace approximation to theposterior based on this optimization scheme. We apply and discussthis scheme in § 4 and § 8. In practice we would iterate until the log-likelihood changes only by anextremely small amount.MNRAS , 1–22 (2015) APPENDIX B: DERIVING THE LAPLACEAPPROXIMATION

In the previous appendix we derived an iterative scheme to obtainmaximum likelihood estimates of the vector of normed histogramheights 𝝅 ML based on the E-M algorithm. We note that the E-Malgorithm is guaranteed to produce a maximum likelihood estimate 𝝅 ML that lies on the simplex. The direct application of the Laplaceapproximation will eﬀectively estimate Gaussian errors on the val-ues. Applying this approximation around 𝜋 ML will lead to posteriorsthat reach to negative values, i.e. the posterior draws are not guar-anteed to lie on the simplex. To extend the Laplace approximationto random variables that lie on the simplex, we ﬁrst consider amapping from simplex space to R 𝑁 bins − . This mapping is realizedby the additive logistic transformation. Assume y ∈ R 𝑁 bins − , wedeﬁne the function 𝝅 ( y ) = (cid:32) 𝑒 𝑦 + (cid:205) 𝑁 bins − 𝑖 = 𝑒 𝑦 𝑖 , . . . , 𝑒 𝑦 Nbins − + (cid:205) 𝑁 bins − 𝑖 = 𝑒 𝑦 𝑖 , + (cid:205) 𝑁 bins − 𝑖 = 𝑒 𝑦 𝑖 (cid:33) 𝑇 , (B1)with its inverse y ( 𝝅 ) = (cid:2) log ( 𝜋 / 𝜋 𝑁 bins ) , . . . , log ( 𝜋 𝑁 bins − / 𝜋 𝑁 bins ) (cid:3) . (B2)We see that the transformed variables y are now deﬁned in realspace and we perform the Laplace approximation as usual. Assum-ing a ﬂat prior in logistic space, we can directly utilize the invarianceof the Maximum Likelihood estimate under variable transforma-tions (see e.g. Pawitan 2001) and approximate the posterior 𝑝 ( y | ˆF ) = N ( y | 𝝁 ML , 𝚺 ) , (B3)where 𝝁 y ML = y ( 𝝅 ML ) , (B4)and 𝚺 𝒚 = − H − (cid:12)(cid:12)(cid:12) y = y ML . (B5)Here H is the hessian of the log-likelihood (the second term inEq. (15)) as a function of y evaluated at y ( 𝝅 ML ) .The components of the hessian are given as 𝐻 𝑎𝑧 = − 𝑁 gal ∑︁ 𝑖 = (cid:16)(cid:205) 𝑁 tot 𝑗 (cid:16) 𝜕𝜋 𝑗 𝜕𝑦 𝑧 (cid:17) 𝐼 𝑖 𝑗 (cid:17) (cid:16)(cid:205) 𝑁 tot 𝑗 = (cid:16) 𝜕𝜋 𝑗 𝜕𝑦 𝑎 (cid:17) 𝐼 𝑖 𝑗 (cid:17)(cid:16)(cid:205) 𝑁 tot 𝑗 = 𝜋 𝑗 𝐼 𝑖 𝑗 (cid:17) (B6) + 𝑁 gal ∑︁ 𝑖 = (cid:16)(cid:205) 𝑁 tot 𝑗 = 𝜋 𝑗 𝐼 𝑖 𝑗 (cid:17) 𝑁 tot ∑︁ 𝑗 = (cid:32) 𝜕 𝜋 𝑗 𝜕𝑦 𝑎 𝜕𝑦 𝑧 (cid:33) 𝐼 𝑖 𝑗 , (B7)where 𝐼 𝑖 𝑗 = ∫ 𝑧 𝑗𝑅 𝑧 𝑗𝐿 d 𝑧 𝑖 ∫ 𝛼 𝑗𝑅 𝛼 𝑗𝐿 d 𝛼 𝑖 𝑝 ( ˆf i |T ( 𝑧 𝑖 , 𝛼 𝑖 ) , Σ 𝑖 ) . (B8)The ﬁrst and second order derivatives are then evaluated to 𝜕𝜋 𝑖 𝜕𝑦 𝑗 =  𝜋 𝑖 ( − 𝜋 𝑖 ) , 𝑖 = 𝑗 ∧ 𝑖 < 𝑁 𝐷 − 𝜋 𝑖 𝜋 𝑗 , 𝑖 ≠ 𝑗 ∧ 𝑖 < 𝑁 𝐷 − 𝜋 𝑗 + (cid:205) 𝐷 − 𝑧 = exp 𝑦 𝑧 , 𝑖 = 𝑁 𝐷 (B9)and 𝜕 𝜋 𝑖 𝜕𝑦 𝛼 𝜕𝑦 𝑗 =  𝜕𝜋 𝑖 𝜕𝑦 𝛼 − 𝜋 𝑖 𝜕𝜋 𝑖 𝜕𝑦 𝛼 , 𝑖 = 𝑗 ∧ 𝑖 < 𝑁 𝐷 − 𝜕𝜋 𝑖 𝜕𝑦 𝛼 𝜋 𝑗 − 𝜋 𝑖 𝜕𝜋 𝑗 𝜕𝑦 𝛼 , 𝑖 ≠ 𝑗 ∧ 𝑖 < 𝑁 𝐷𝜋 𝑗 𝜋 𝛼 − 𝜕𝜋𝑗𝜕𝑦𝛼 + (cid:205) 𝐷 − 𝑧 = exp 𝑦 𝑧 , 𝑖 = 𝑁 𝐷 (B10) Transformed into probability, or simplex, space, this posterior isthen identiﬁed as a logit-normal distribution 𝑝 ( 𝝅 | ˆF ) ≈ √︁ | 𝜋 𝚺 y | (cid:206) 𝑁 bins 𝑖 = 𝜋 𝑖 (B11)exp (cid:18) − . (cid:18) log (cid:18) 𝝅 − N bins 𝜋 𝑁 bins (cid:19) − 𝝁 y , ML (cid:19) 𝚺 − y (cid:18) log (cid:18) 𝝅 − N bins 𝜋 𝑁 bins (cid:19) − 𝝁 y , ML (cid:19)(cid:19) . (B12)We note that the logit-normal is a probability distribution on thesimplex, just as the Dirichlet. In fact, the Dirichlet can be approxi-mated well by a logit-normal (Atchison & Shen 1980). However thelogit-normal allows for a more complex covariance structure. Thescheme developed in this appendix is applied and analysed in § 4and § 8. This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS , 1–22 (2015), 1–22 (2015)