[PDF] PHAT: PHoto-z Accuracy Testing

Abstract

Here we introduce PHAT, the PHoto-z Accuracy Testing programme, an international initiative to test and compare different methods of photo-z estimation. Two different test environments are set up, one (PHAT0) based on simulations to test the basic functionality of the different photo-z codes, and another one (PHAT1) based on data from the GOODS survey. The accuracy of the different methods is expressed and ranked by the global photo-z bias, scatter, and outlier rates. Most methods agree well on PHAT0 but produce photo-z scatters that can differ by up to a factor of two even in this idealised case. A larger spread in accuracy is found for PHAT1. Few methods benefit from the addition of mid-IR photometry. Remaining biases and systematic effects can be explained by shortcomings in the different template sets and the use of priors on the one hand and an insufficient training set on the other hand. Scatters of 4-8% in Delta_z/(1+z) were obtained, consistent with other studies. However, somewhat larger outlier rates (>7.5% with Delta_z/(1+z)>0.15; >4.5% after cleaning) are found for all codes. There is a general trend that empirical codes produce smaller biases than template-based codes. The systematic, quantitative comparison of different photo-z codes presented here is a snapshot of the current state-of-the-art of photo-z estimation and sets a standard for the assessment of photo-z accuracy in the future. The rather large outlier rates reported here for PHAT1 on real data should be investigated further since they are most probably also present (and possibly hidden) in many other studies. The test data sets are publicly available and can be used to compare new methods to established ones and help in guiding future photo-z method development. (abridged)

Full PDF

11 Astronomy & Astrophysics manuscript no. 14885 c (cid:13)

ESO 2018June 4, 2018

PHAT: PHoto- z Accuracy Testing (cid:63)

H. Hildebrandt , S. Arnouts , P. Capak , L. A. Moustakas , C. Wolf , F. B. Abdalla , R. J. Assef , M. Banerji ,N. Ben´ıtez , G. B. Brammer , T. Budav´ari , S. Carliles , D. Coe , T. Dahlen , R. Feldmann , D. Gerdes ,B. Gillis , O. Ilbert , R. Kotulla , , O. Lahav , I. H. Li , J.-M. Miralles , N. Purger , S. Schmidt , and J. Singal Leiden Observatory, Leiden University, Niels Bohrweg 2, 2333CA Leiden, The Netherlandse-mail: [email protected] Canada-France-Hawaii Telescope Corporation, Kamuela, HI 96743, USA Spitzer Science Center, 314-6, California Institute of Technology, 1201 E. California Blvd, Pasadena, CA, 91125, USA Jet Propulsion Laboratory, California Institute of Technology, MS 169-327, Pasadena, CA 91109, USA Department of Physics, University of Oxford, DWB, Keble Road, Oxford, OX1 3RH, UK Department of Physics and Astronomy, University College London, Gower Street, London WC1E 6BT, UK Department of Astronomy, The Ohio State University, 4055 McPherson Lab, 140 W. 18th Avenue, Columbus, OH 43210, USA Institute of Astronomy, University of Cambridge, Madingley Road, Cambridge, CB3 0HA, UK Instituto de Astrof´ısica de Andaluc´ıa (CSIC), Apdo. 3044, 18008 Granada, Spain Department of Astronomy, Yale University, New Haven, CT 06520-8101, USA Department of Physics and Astronomy, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA Department of Computer Science, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA Space Telescope Science Institute, 3700 San Martin Drive, Baltimore, MD 21218, USA Department of Physics, Institute of Astronomy, ETH Z¨urich, Wolfgang-Pauli-Strasse 16, 8093 Z¨urich, Switzerland Department of Physics, University of Michigan, Ann Arbor, Michigan 48109, USA Department of Physics and Astronomy, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, N2L 3G1, Canada Laboratoire d’Astrophysique de Marseille, CNRS-Universit´e d’Aix-Marseille, 38 rue Fr´ed´eric Joliot-Curie, 13388 Marseille Cedex13, France Centre for Astrophysics Research, University of Hertfordshire, College Lane, Hatﬁeld AL10 9AB, UK Department of Astronomy, University of Wisconsin-Madison, 475 N Charter St, Madison, WI 53706, USA Centre for Astrophysics & Supercomputing, Swinburne University of Technology, PO Box 218, Hawthorn, VIC 3122, Australia Institut d’Estudis Andorrans, Avda Rocafort 21-23, AD600 Sant Juli`a de L`oria, Andorra Department of Physics of Complex Systems, E¨otv¨os Lor´and University, Pf. 32, H-1518 Budapest, Hungary Physics Department, University of California, 1 Shields Avenue, Davis, CA 95616, USA Kavli Institute for Particle Astrophysics and Cosmology, SLAC National Accelerator Laboratory, Menlo Park, CA 94025, USAReceived ; accepted

ABSTRACT

Context.

Photometric redshifts (photo- z ’s) have become an essential tool in extragalactic astronomy. Many current and upcomingobserving programmes require great accuracy of photo- z ’s to reach their scientiﬁc goals. Aims.

Here we introduce PHAT, the PHoto- z Accuracy Testing programme, an international initiative to test and compare di ﬀ erentmethods of photo- z estimation. Methods.

Two di ﬀ erent test environments are set up, one (PHAT0) based on simulations to test the basic functionality of the di ﬀ erentphoto- z codes, and another one (PHAT1) based on data from the GOODS survey including 18-band photometry and ∼ Results.

The accuracy of the di ﬀ erent methods is expressed and ranked by the global photo- z bias, scatter, and outlier rates. Whilemost methods agree very well on PHAT0 there are di ﬀ erences in the handling of the Lyman- α forest for higher redshifts. Furthermore,di ﬀ erent methods produce photo- z scatters that can di ﬀ er by up to a factor of two even in this idealised case. A larger spread in ac-curacy is found for PHAT1. Few methods beneﬁt from the addition of mid-IR photometry. The accuracy of the other methods isuna ﬀ ected or su ﬀ ers when IRAC data are included. Remaining biases and systematic e ﬀ ects can be explained by shortcomings in thedi ﬀ erent template sets (especially in the mid-IR) and the use of priors on the one hand and an insu ﬃ cient training set on the otherhand. Some strategies to overcome these problems are identiﬁed by comparing the methods in detail. Scatters of 4-8% in ∆ z / (1 + z )were obtained, consistent with other studies. However, somewhat larger outlier rates ( > .

5% with ∆ z / (1 + z ) > . > .

5% aftercleaning) are found for all codes that can only partly be explained by AGN or issues in the photometry or the spec- z catalogue. Someoutliers were probably missed in comparisons of photo- z ’s to other, less complete spectroscopic surveys in the past. There is a generaltrend that empirical codes produce smaller biases than template-based codes. Conclusions.

The systematic, quantitative comparison of di ﬀ erent photo- z codes presented here is a snapshot of the current state-of-the-art of photo- z estimation and sets a standard for the assessment of photo- z accuracy in the future. The rather large outlier ratesreported here for PHAT1 on real data should be investigated further since they are most probably also present (and possibly hidden) inmany other studies. The test data sets are publicly available and can be used to compare new, upcoming methods to established onesand help in guiding future photo- z method development. a r X i v : . [ a s t r o - ph . C O ] A ug . Introduction The estimation of redshifts from photometry alone is an old idea(Baum 1962; Puschell et al. 1982; Koo 1985; Loh & Spillar1986; Connolly et al. 1995). It has come a long way from being ararely used technique for special kinds of objects to a major toolnow widely used for a multitude of observational programmes.Not only can this photometric redshift (photo- z ) approachyield redshifts of fainter objects than accessible by spectroscopy,but also the e ﬃ ciency in terms of the number of objects with red-shift estimates per unit telescope time is largely increased. Thesetwo properties make photo- z ’s extremely attractive for observingprogrammes depending on redshifts for a large number of faintgalaxies if these redshifts do not have to be as precise as spec-troscopic redshifts (spec- z ’s).Still the requirements on the accuracy of photo- z ’s for up-coming surveys are formidable. Photo- z ’s are essential in con-straining dark energy (DE) by weak gravitational lensing andcan be used for other DE probes such as galaxy clustering, super-novae of type Ia, and the mass function of galaxy clusters as well(Albrecht et al. 2006; Peacock et al. 2006). Surveys of galaxyformation and evolution also depend on photo- z ’s to study theseprocesses as a function of environment and to probe to fainterlevels than with spectroscopy alone. To fully exploit the powerof these huge, future data sets, photo- z ’s with a very low level ofresidual systematics are needed (e.g. Huterer et al. 2006).There are many aspects which inﬂuence the performance ofphoto- z ’s. The choice of an observing strategy sets the theoret-ical limit for the accuracy. Choosing the ﬁlters and distributingthe available observing time over the di ﬀ erent ﬁlters to reachcertain depths can have a great impact on photo- z ’s. Accuratephotometric calibration is of great importance as well as the re-moval of e ﬀ ects of the di ﬀ erent point-spread-function (PSF) inthe di ﬀ erent bands. Varying column densities of galactic dustover the survey area have to be accounted for before a photo- z code can be expected to perform at its best.Here we would like to ignore all these e ﬀ ects as much aspossible and concentrate on the last link in the chain, the photo- z methods themselves. It is clear that the two regimes – data andmethod – cannot be separated cleanly because there are connec-tions between the two. For example, it is highly likely that onemethod of photo- z estimation will perform better than a secondmethod on one particular data set while the situation may well bereversed on a di ﬀ erent data set. Whenever such a situation arisesin the following we will try to alert the reader to that.The methodology behind photo- z ’s is developing fast withever more complex methods yielding results of increasing accu-racy. In this context it is important to set a standard to comparethe di ﬀ erent methods to each other in order to make quantitativestatements about their di ﬀ erences and to take a snapshot of to-day’s state-of-the-art. Such comparisons and rankings can thenbe used to identify the most promising approaches and to con-centrate on their further improvement.In this paper we present an international initiative namedPHAT (PHoto- z Accuracy Testing) which was initiated to carryout such a quantitative comparison. A very similar initiative hasbeen carried out for shape measurement algorithms in the Shear Send o ﬀ print requests to : H. Hildebrandt (cid:63) Based on observations obtained with the Hubble Space Telescope,the Spitzer Space Telescope, the Keck Observatory, the Kitt PeakNational Observatory, the Subaru Telescope, the Palomar Observatory,and the University of Hawaii 2.2-meter telescope. TEsting Program (STEP; Heymans et al. 2006; Massey et al.2007) and led to important improvements in the methodologyof measuring galaxy shapes for weak gravitational lensing ap-plications. Similar but much more limited blind tests of photo- z ’s have been performed by Hogg et al. (1998) on spectroscopicdata from the Keck telescope on the Hubble Deep Field (HDF),by Hildebrandt et al. (2008) on spectroscopic data from theVIMOS VLT Deep Survey (VVDS; Le F`evre et al. 2004) andthe FORS Deep Field (FDF; Noll et al. 2004), and by Abdallaet al. (2008b) on the sample of Luminous Red Galaxies from theSDSS-DR6.In the framework of PHAT we provide standardised test en-vironments to the photo- z community which consist of simu-lated or observed photometric catalogues alongside with addi-tional material like ﬁlter curves, SED templates, and trainingsets. These data sets can be used in a blind (or semi-blind, i.e.with support of a training set) test by the participants to esti-mate redshifts with their favourite codes. Two such test stepshave been carried out so far. The ﬁrst one called PHAT0 is basedon a highly idealised simulation representing an easy case to testthe most basic elements of photo- z estimation and to identifypossible low-level discrepancies between the methods. The sec-ond test called PHAT1 is based on real data originating from theGreat Observatories Origins Deep Survey (GOODS, Giavaliscoet al. 2004) representing a much more complex environmentpushing photo- z codes to their limits and revealing more sys-tematic di ﬃ culties.PHAT was conceived as an open competition. The test datasets are publicly available over the PHAT website and all majorphoto- z groups in the astronomical community were informedof the initiative via email. Furthermore, PHAT was advertisedon several meetings and workshops to increase its visibility. Thephoto- z codes presented here were not selected by the PHATcoordinators but reﬂect the interest of the community in such acompetition. This strategy led to an impressive feedback of 21participants submitting results obtained with 17 di ﬀ erent photo- z codes. After a large number of results was collected for eachtest data set, the results of all codes were published on the PHATwebsite. But the test data sets are still kept blind (i.e. the individ-ual redshifts are retained) to allow further participants to meetthe same conditions.First we shortly summarise every photo- z method that wasused within PHAT (Sect. 2). Then in Sect. 3 & 4 the motivationbehind the tests, the data sets, and the results are described in de-tail for PHAT0 and PHAT1, respectively. In Sect. 5 we concludeand give an outlook to future activities within PHAT. We use ABmagnitudes throughout.

2. Methods

In the following we describe the di ﬀ erent methods that wereused to estimate photo- z ’s from the catalogues presented inSect. 3 & 4. A summary of the methods can also be found inTable 1 together with the three-letter acronyms that are used inthe remainder of the paper to identify the codes. The third smallletter indicates whether the code belongs to the empirical codes(-e), which are trained on the colours of a sub-sample of ob-jects with accurate redshift estimates (e.g. spec- z ’s), or to thecodes ﬁtting SED templates to the observed photometry (-t). It Most empirical codes o ﬀ er the ﬂexibility of using also any otherphotometric observable like e.g. size, concentration, or surface bright-ness. Since we only use magnitudes in PHAT we skip this detail in theremainder of Sect. 2.. Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing 3 should be noted that this distinction is somewhat fuzzy. A num-ber of codes include ingredients from both regimes. We just keepthis terminology because it has been widely used in the litera-ture. For a more rigorous description of the underlying conceptsin photo- z methods and their common properties see Budav´ari(2009).Note that the descriptions of the di ﬀ erent template sets ofthe template SED ﬁtting codes in the following subsections onlyapply to PHAT1. For PHAT0 the template set was provided andit was used by every participant with a template-based code. BPZ (BP-t)

BPZ (Bayesian Photo- z ’s; Ben´ıtez 2000) introduced the use ofBayesian inference and priors to photometric redshift estima-tion. The code uses a prior P ( z , T | m ) which gives the like-lihood that given an apparent magnitude m , a galaxy wouldhave redshift z and SED type T . As an example of how the priorworks, bright objects and ellipticals are assumed unlikely to be athigh redshift. For each galaxy, this information is combined (ina Bayesian manner) with the likelihood P ( C | z , T ) of observingthe galaxy colours C for each redshift and SED pair, yielding theﬁnal P ( z , T | C , m ). By marginalising over T , P ( z ) is obtainedalong with the most likely redshift z b and its uncertainties. Forthe PHAT tests, BPZ version 1.99.3 is used, a slightly updatedversion of that used in the Coe et al. (2006) UDF analysis. – Templates:

The Coe et al. (2006) SED templates are usedwith

BPZ , which include a CWW + SB SED template set(similar to that used in PHAT0 with Kinney et al.’s SB1replaced by SB3) as introduced in Ben´ıtez (2000) and re-calibrated by Ben´ıtez et al. (2004) plus two younger starbursttemplates from Bruzual & Charlot (2003) added in Coe et al.(2006). Note that the empirical CWW + SB templates as wellas the synthetic BC03 templates include emission lines. Nodust extinction was added to the BC03 templates. Betweeneach of the eight adjacent templates two interpolated tem-plates are added, for a total of 22 templates. Beyond 25600Å,the majority of the templates are undeﬁned and must be ex-trapolated. Thus it cannot be expected that these templatesprovide good ﬁts to IRAC photometry of low redshift ob-jects. – Prior:

For PHAT0, a ﬂat prior is used. The prior was calcu-lated by Ben´ıtez (2000) based on objects with spec- z in theCFRS (Lilly et al. 1995) and HDF-N (Williams et al. 1996).It was shown to yield results superior to the “ﬂat” prior im-plicitly assumed by maximum likelihood (or “frequentist”)methods. – Training:

No training with the model- z ’s / spec- z ’s was per-formed. BPZ (BP2-t)

BPZ is run on PHAT1 a second time with a di ﬀ erent template setand additional training. – Templates:

The second library (Ben´ıtez 2010, in preparation)uses as starting point a set of 6 templates from PEGASE(Fioc & Rocca-Volmerange 1997) selected to be similar tothe Coe et al. (2006) templates. This library is further cal-ibrated using the FIREWORKS photometry and spectro-scopic redshifts from Wuyts et al. (2008). Note that thesetemplates include emission lines and dust extinction. – Prior:

Same as BP-t. – Training:

The templates are compared to the photometry ofthe spec- z training set and new zero points are estimated, asin Coe et al. (2006). We also measure the amount of excessscatter in the predicted vs measured colours compared withthat expected from the catalogue photometric errors and typi-cal template uncertainties (Brammer et al. 2008). This excessscatter is included in the photo- z estimation as a zero pointuncertainty. EAZY (EA-t)

EAZY (Brammer et al. 2008) is a template-ﬁtting code designedto produce un-biased photometric redshift estimates for deepmulti-wavelength surveys that lack representative calibrationsamples with spectroscopic redshifts. – Templates:

EAZY uses a unique template set derived usingthe non-negative matrix factorisation algorithm (Sha et al.2007; Blanton & Roweis 2007) trained on synthetic photom-etry from the semi-analytic light-cone produced by De Lucia& Blaizot (2007). These templates can be considered theprincipal component spectra of all galaxies at 0 < z < ﬀ erences between localand high-redshift galaxy samples. EAZY is able to reproducecomplex star-formation histories by ﬁtting non-negative lin-ear combinations of the templates. The templates includeemission lines following the prescription of Ilbert et al.(2009). – Template error function:

Template mismatch is addressedwith a “template error function”, which assigns lowerweights at rest-frame wavelengths where the template cali-bration is uncertain or where the templates are not expectedto fully reproduce observed galaxy colours. This feature isparticularly important when using mid-IR (IRAC) photom-etry, which samples wavelengths where the observed emis-sion can be dominated by non-stellar (i.e. dust) sources notincluded in the templates. – Prior:

EAZY adopts a prior equal to the normalised redshiftdistribution of galaxies in the De Lucia & Blaizot (2007)semi-analytic light-cone at a given apparent R or K magni-tude. This is akin to a luminosity prior under the assumptionthat the light-cone reasonably reproduces the galaxy lumi-nosity function. – Training:

No training with the model- z ’s / spec- z ’s was per-formed. GALEV and

GAZELLE (GA-t)

GAZELLE (Kotulla & Fritze 2009, Kotulla, in preparation) isbased on a χ minimisation algorithm to compare the observedSEDs to a large library of GALEV evolutionary synthesis models(Kotulla et al. 2009).

GAZELLE also accounts for inherent uncer-tainties in the model grid, e.g. due to uncertainties in the stellarevolution data and stellar spectral libraries, by assuming a 0 . – Templates:

GALEV includes a full suite of emission lines(Anders & Fritze 2003), a detailed treatment of the attenu-ation due to intergalactic HI (Madau 1995) and optionally achemical evolution model. This combination allows to notonly estimate photometric redshifts, but at the same timephysical parameters (stellar masses, star formation rates,etc.) for each galaxy in a consistent manner. Masses andmass-dependent parameters are computed by scaling model

H. Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing values with the scaling factor derived from matching theoverall normalisation of the template ﬂuxes relative to theobserved ﬂuxes. For the PHAT1 run the model grid included5 undisturbed models for E and Sa-Sd type galaxies supple-mented with a set of 21 models encountering a strong star-burst at galaxy ages of 0 . =

8; for the undisturbed models a chemi-cally consistent evolution (see Kotulla & Fritze 2009 for de-tails) is chosen, for the burst models a metallicity ﬁxed tohalf the solar value is used. All templates include the fullevolution from the onset of star formation until the presentday and the Calzetti et al. (2000) dust extinction descriptionis chosen. Emission lines are included as well. – Filter weighting:

To avoid complications at wavelengths be-yond the rest-frame K-band where dust emission becomesincreasingly important, only ﬁlters that cover the rest-frameK-band or shorter wavelengths are included, e ﬀ ectively ig-noring some of the Spitzer ﬁlters at low-redshift. – Prior:

No prior is included that might a ﬀ ect the resultingredshift distribution. – Training:

No training with the model- z ’s / spec- z ’s was per-formed. GOODZ (GO-t)

The

GOODZ code (Dahlen et al. 2010, in preparation) is a devel-oped version of the code used by Dahlen et al. (2005, 2007) tocalculate photometric redshifts in the GOODS-S. The code isbased on the template ﬁtting method and allows the inclusion ofBayesian priors based on the expected shape of the galaxy lu-minosity function. Similar to this investigation,

GOODZ uses thefour empirical templates from Coleman et al. (1980) and twotemplates from (Kinney et al. 1996, their templates SB2 andSB3). The code also uses available spectroscopic redshifts tocorrect for o ﬀ sets between ﬂuxes extracted in di ﬀ erent ﬁlters orinstruments. Such o ﬀ sets may be signiﬁcant when combiningdata from di ﬀ erent instruments with varying PSF or pixel-scalesand may uncorrected lead to increased scatter or biases in thephotometric redshifts. The spectroscopic redshifts are also usedto adjust the input set of template SEDs using a method similarto Ilbert et al. (2006). – Templates:

GOODZ is only run on PHAT0 so that no individualtemplate set is associated with this code. – Prior:

No prior was used. – Training:

No training with the model- z ’s was performed. Hyperz (HY-t)

Hyperz is a publicly available code based on SED templates ﬁt-ting using a standard χ minimisation method. The codes usesthe observed ﬂuxes of an object in a set of given ﬁlters and com-pares them with the theoretical ﬂuxes of galaxies in the sameﬁlters obtained from template spectra, either synthetic or empir-ical, taking into account the observational uncertainties but alsothe possible observational hidden e ﬀ ects such as reddening orIGM opacity. It computes not only a best-ﬁt solution which min-imises the di ﬀ erences, therefore a most probable photometricredshift, but also a full probability function as a function of red-shift. The code and the method have been tested and describedextensively in Bolzonella et al. (2000) and further practical de-scription can be found in its users manual. Hyperz comes with agiven set of templates, ﬁlters, reddening laws and Lyman forest modelling but can be easily adapted to use any kind of parame-ters that would ﬁt the needs of the user. Its simplicity has brought

Hyperz to be extensively used and tested since its launch, andeven to be used beyond the pure computation of photometricredshifts. – Templates:

Hyperz comes with two standard template sets,one based on the synthetic stellar population library ofBruzual & Charlot (1993) and the other one consisting ofthe four empirical templates from Coleman et al. (1980). Forthe PHAT1 test, the latter empirical library was chosen and itwas supplemented with two starburst templates from Kinneyet al. (1996) (templates from both libraries include emissionlines). This set of six basic template was further enlarged byapplying di ﬀ erent amounts of extinction to the templates ac-cording to the Calzetti et al. (2000) dust extinction law. – Prior:

No prior was used. – Training:

No training with the model- z ’s / spec- z ’s was per-formed. Kernelz (KR-t)

This method is a hybrid incorporating aspects of both template-based and empirical codes, though it is most similar in design to

BPZ and other Bayesian methods. As in standard template-basedcodes model colours are computed for a set of galaxy SEDs ata set of ﬁxed redshifts. However, then this grid of colours istreated as if they were individual galaxies. For each test galaxythe points are weighted by a factor that is akin to a Bayesianprior, accounting for the expected probability of seeing such agalaxy given the apparent magnitude and type of the test point.Redshifts are then estimated using kernel regression, construct-ing a weighted average redshift, with weights proportional totheir proximity to the template points in colour space. The ker-nel bandwidth is chosen by cross-validation using the trainingset of galaxies with known redshifts. Results presented here rep-resent code that is still in development. Details of the kernel re-gression method for both empirical and hybrid techniques willbe described in Schmidt & Brewer (in prep). A promising ex-tension that improves the method by allowing for data adaptivekernels will be described in Udaltsova & Schmidt (in prep). Apublic release of the code is also in the works. – Templates:

Because

Kernelz was still in development whenthe results were submitted, simple templates from Colemanet al. (1980) and Kinney et al. (1996) (both of which includeemission lines) with some extrapolation to IRAC wave-lengths were used. – Prior:

An empirical prior trained on data from VVDS wasused. In practice, this is very similar to the prior described inIlbert et al. (2006). – Training:

The spectroscopic data was used to choose the ker-nel bandwidth alone, no tweaking of templates or zero pointswas performed.

Le Phare (LP-t)

The public code

Le Phare (Arnouts et al. 2002; Ilbert et al.2006) is primarily dedicated to estimate photo- z ’s, but it can alsobe used to estimate physical parameters like stellar masses andinfrared luminosities. Le Phare is based on a standard templateﬁtting procedure. The templates are redshifted and integratedthrough the instrumental transmission curves. The opacity of theIGM is taken into account and internal extinction could be added . Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing 5 as a free parameter to each galaxy. The photo- z ’s are obtained bycomparing the modelled ﬂuxes and the observed ﬂuxes with a χ merit function. A probability distribution function is associatedto each photo- z .For the PHAT1 sample, we adopted a conﬁguration similarto the one used in the COSMOS ﬁeld (Ilbert et al. 2009): – Templates:

The set of templates was generated by Pollettaet al. (2007) with the code GRASIL (Silva et al. 1998).The 9 galaxy templates of Polletta et al. (2007) include 3SEDs of elliptical galaxies and 6 templates of spiral galax-ies (S0, Sa, Sb, Sc, Sd, Sdm). Those were complementedwith 12 additional blue templates generated with Bruzual &Charlot (2003). Four di ﬀ erent dust extinction laws were ap-plied (Prevot et al. 1984; Calzetti et al. 2000, and an addi-tional bump at 2175Å), depending on the considered tem-plate. Emission lines were added to the templates using rela-tions between the UV continuum, the star formation rate andthe emission line ﬂuxes (Kennicutt 1998). – Prior:

No prior on the redshift distribution was applied.However, no redshift solution which would produce a galaxybrighter than M ( B ) = −

24 was allowed. Such a prior wouldcreate catastrophic failure for some QSOs, but it was not ex-plicitly intended to estimate photo- z ’s for QSOs (no AGNtemplates were included in this run), although the PHAT1catalogue contains some (see below). – Training:

An automatic calibration of the zero-points wasperformed using the spec-z sample. The calibration is ob-tained by comparing the observed and modelled ﬂuxes(Ilbert et al. 2006). The calibration is done iteratively untilconvergence in the zero-points values is reached. This stephelps in removing bias.

LRT (LR-t)

LRT (Low-Resolution Spectral Templates Assef et al. 2008,2010) is a set of subroutines intended for estimating K-corrections and photometric redshifts using a basis of empiri-cal low resolution SED templates (hence

LRT ) for galaxies andAGNs. In this basis, every galaxy is represented by a non-negative linear combination of three empirically determinedSED templates that resemble an elliptical, an Sbc spiral and anIm irregular galaxy. Given the nature of the tests in the PHATinitiative, the AGN SED template was not used. For the PHAT0testing phase, the

LRT subroutines were modiﬁed to do a simple χ minimisation to ﬁt each template to the data separately ratherthan ﬁtting a non-negative combination of them. – Templates:

The templates were derived from the extensivebroad-band and spectroscopic observations of the NOAODeep Wide-Field Survey (Jannuzi & Dey 1999) Bo¨otes ﬁeldand range in wavelength between 0.03 and 30 µ m. In thePHAT1 testing phase, the LRT subroutines were used withthe SED templates derived in Assef et al. (2008) which havea shorter wavelength range (0.1–10 µ m) than the newer ver-sions presented in Assef et al. (2010). These newer SEDtemplates also integrate an AGN component with variableextinction. – Prior:

For estimating photometric redshifts, the

LRT subrou-tines also use a simple luminosity function prior, which is bydefault based on the R -band luminosity function of Lin et al.(1996). – Training:

No training with the model- z ’s / spec- z ’s was per-formed. Originated from the template-based method described in Csabaiet al. (2003), this method uses synthetic colours calculated fromthe given spectral energy distribution templates. A common ap-proach for template ﬁtting is to take a small number of spectraltemplates T and choose the best ﬁt by optimising the likelihoodof the ﬁt as a function of redshift, type, and luminosity, p(z, T, L).Here a variant of this method is used that incorporates a continu-ous distribution of spectral templates, enabling the error functionin redshift and type to be well deﬁned. – Templates:

This code is only run on PHAT0 so that no indi-vidual template set is associated with this code. – Prior:

No prior was used. – Training:

No training with the model- z ’s was performed. ZEBRA (ZE-t & ZE2-t)

ZEBRA (Zurich Extragalactic Bayesian Redshift AnalyzerFeldmann et al. 2006) is a freely available, open source photo-metric redshift code based on a SED template-ﬁtting approach.Built on top of a traditional Maximum Likelihood ansatz it in-troduces and combines several novel methods that help to im-prove the accuracy of photometric redshift estimates for galax-ies and AGNs (see e.g. Oesch et al. 2010; Luo et al. 2010, forsome recent applications). First,

ZEBRA is able to detect and cor-rect photometric o ﬀ sets in the input catalogue. Second, ZEBRA can use spectroscopic redshifts on a small fraction of the pho-tometric sample to iteratively correct the original set of inputtemplates. This template correction step has been shown to bea crucial ingredient in decreasing the bias, the scatter, and thenumber of outliers in the redshift estimation (e.g. Feldmann et al.2006; Mobasher et al. 2007). Third, when run in Bayesian mode

ZEBRA computes the prior in redshift-template space in a self-consistent manner from the input catalogues and the redshift-template likelihood functions. This prior is consequently used toderive the posterior probability distribution of each input object.Here, since

ZEBRA participates only in PHAT0, it is run in its ba-sic Maximum Likelihood mode and with the provided templates.The following set of parameters are used. The redshifts are al-lowed to vary in steps of 0.002 from 0 to 4. The ﬁlter bandsare mildly smoothed using a top-hat ﬁlter with FWHM of 20Å.Finally, the spectral ﬂux densities weighted with photon energy,not photon counts, are computed using the –ﬂux-type = – Templates:

ZEBRA is only run on PHAT0 so that no individualtemplate set is associated with this code. – Prior:

No prior was used. – Training:

No training with the model- z ’s was performed. ANN z (AN-e) ANN z (Collister & Lahav 2004) is an empirical photo- z codebased on artiﬁcial neural networks. Such a network is made upof several layers, each consisting of a number of nodes. The ﬁrstlayer receives the galaxy magnitudes as inputs, while the lastlayer outputs the estimated photometric redshift. The layers inbetween could consist of any number of nodes each. The nodesare inter-connected, and every connection carries a ’weight’,which is a free parameter in the parametrisation. When a net-work is trained the weights of all node connections are deter-mined by minimising a cost function E . To avoid an over-ﬁtting, H. Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing every network is tested on a validation set of galaxies, whosespectroscopic redshifts are also known. The network with low-est value of E as calculated on the validation set is selected andthe photometric sample is run through it for redshift estimation.An error bar is assigned to each photo- z via a chain rule (seeCollister & Lahav 2004, for details). Neural networks have beenused e.g. for estimation of photo- z ’s for the SDSS (Collister et al.2007; Oyaizu et al. 2008; Abdalla et al. 2008b), as well as fore-casts of photometric redshifts for future surveys like the DarkEnergy Survey (Banerji et al. 2008) and Euclid (Abdalla et al.2008a).A neural network architecture of N:2N:2N:1 was used for thePHAT tests where N is the number of ﬁlters for which there areinput magnitudes. Di ﬀ erent architectures were tested, but thisdid not lead to any substantial improvement in the results. Thechoice of architecture is fully justiﬁed by tests done in Firth et al.(2003) and Collister & Lahav (2004). BDT (DT-e)

The Boosted Decision Tree (

BDT ) algorithm (Gerdes et al. 2010)is a training-set-based method that combines an ensemble ofweak classiﬁers into a single, powerful classiﬁer. The spectro-scopic training set is ﬁrst divided into redshift bins whose widthis approximately half the expected photo- z resolution of the al-gorithm for the given sample. We have found that a ﬁner binningchoice does not improve the resolution. For each bin, a set oftrees is trained intended to recognise as “signal” those galaxieswhose redshift falls within the bin in question, and “background”those that fall more than 2 σ away from the signal bin, where σ is the iteratively-determined photo- z resolution. As training vari-ables we use the observed magnitudes in each band. The pro-cess of constructing an individual tree begins with a root nodecontaining all the training galaxies. The root node is then splitinto two subsamples by placing a cut on the one variable thatbest separates the sample into signal and background. Each newnode is subsequently split in this way until the nodes reach acertain minimum size. The result is a tree containing nodes withpredominantly signal and predominantly background galaxies.The process of “boosting” iteratively repeats this process, givinghigher weight to galaxies that were initially misclassiﬁed. Theoverall signal probability of a galaxy is then obtained by com-bining the classiﬁcation output from approximately 50 trees ineach photo- z bin, where higher weight is given trees with lowermisclassiﬁcation rates in the training set.The method produces a photo- z probability for each galaxyas a function of redshift. This method therefore yields not onlyan estimate of the best photo- z and error, but a reconstruction ofthe full redshift PDF, P ( z ). In Gerdes et al. (2010) it was shownthat the BDT algorithm improves upon the default photo- z ’s inthe SDSS spectroscopic sample, and that the PDFs yield a moreaccurate reconstruction of the redshift distribution N ( z ). χ ) (EC-e) The method of Wolf (2009) derives PDFs from empirical mod-els and is a subclass of kernel regression methods. It mimicsa template-based χ -technique with the main di ﬀ erence that anempirical dataset is used in place of the template grid. Each ob-ject in the empirical set contributes to the observed object with aquantiﬁed probability. The PDF of redshifts thus obtained can beused in its entirety or investigated for ambiguities. Here, it is justreduced to an expectation value and RMS in redshift. Any kernel approach requires to choose a kernel function which also acts asa smoothing scale to the discrete empirical model grid. Here, weused a Gaussian kernel function with σ m = . m

1. However, a χ -method is correctly implemented if the kernel function appliedto the model makes its density distribution match that of the ob-served sample (see the matched error scale in Sect. 6 of Wolf2009, for details). As a consequence, redshift distributions ofobject samples can be reconstructed potentially accurate withinPoisson noise of the sample sizes, which would also imply nobias exceeding random noise. This empirical method compares the observed colours to the ref-erence set. The estimation method ﬁrst searches the colour spacefor the k nearest neighbours of every object in the estimation set(i.e. the galaxies for which we want to estimate redshift) and thenestimates the redshift by ﬁtting a local low order polynomial tothese points. An improved version of this code is using a k-dtree index for fast nearest neighbour search (Csabai et al. 2007).It was used to calculate photometric redshifts for the SDSS DataRelease 7 (Abazajian et al. 2009). The advantage of this methodversus a template-based method might be the better estimationaccuracy, but it cannot extrapolate, so the completeness of thereference set is crucial. For this reason, we have used the largetraining set available for the PHAT0 test.The estimation was done using the large, simulated data setusing 150 nearest neighbours. A small number of outliers wasautomatically excluded from the regression on the neighboursets. This empirical photo- z method is based on Li & Yee (2008),which uses a polynomial ﬁt so that the galaxy redshift is ex-pressed as the sum of its magnitudes and colours. Di ﬀ erent fromLi & Yee (2008) where the training set galaxies are divided intoseveral ﬁxed colour-magnitude cells, here the coe ﬃ cients of thephoto- z polynomial are derived individually for each galaxy bychoosing a subset of training set galaxies whose magnitudes andcolours are closest to the input galaxy. They are chosen based onquadratically summed ranks of colour and magnitude di ﬀ erencesbetween the training set galaxies and the input galaxy. All mag-nitudes and independent colours are used. Note that each train-ing set galaxy has an equal weight in the ﬁt. This may introducea redshift bias to input galaxies near the edges of the colour-magnitude distributions. Therefore, a better approach would beto assign weights to the chosen training-set galaxies based onthe inverse value of their ﬁnal rank, but this has not been imple-mented for PHAT. The RT-e method by Carliles et al. (2010) is based on RandomForests which are an empirical, non-parametric regression tech-nique. A Random Forest builds an ensemble average of ran-domised regression tree redshift estimates. Bootstrap samplesare created by sampling from the training set with replacement,and each regression tree is trained on its own bootstrap sample.Given a new test object, each regression tree produces its ownredshift estimate, and these estimates are averaged to yield the ﬁ-nal Random Forest redshift estimate. This technique also resultsin Gaussian errors, and this behaviour has a strong theoretical . Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing 7 statistical explanation. Intuitively speaking, a given new galaxycan be considered to be drawn from the space of inputs (colours,magnitudes, etc.) by redshifts. This space is the event space, andfor that new galaxy one can hypothesise the existence of a distri-bution over the event space, unique to that galaxy, which reﬂectsthe similarity of the new galaxy (minus the unknown redshift)to any given point in the event space. The Random Forest ap-proximates this distribution per object, and the process results ineasily computable per-object error parameter estimates.For the PHAT tests a leaf size of 5 was chosen and 50 treeswere used.

The primary motivation for the development of this code was totreat additional available galaxy information beyond photomet-ric data, for example shape parameters, on an equal footing withthe photometric data (as it was done in e.g. Collister & Lahav2004; Ball et al. 2004). The package, although still undergoingmodiﬁcation, is a multi-layer perceptron neural network for theIDL environment. The IDL code can be relatively easily modi-ﬁed, and could in principle be optimised for a variety of inputdata situations. As training convergence is relatively slow in thisnetwork, it is most useful in situations where a robust trainingset is available from the outset.As implemented here, the network has an input layer of neu-rons which accepts the magnitudes in each band. The input layertreats all input information on an equal footing, normalisingacross all objects in the training set so that the inputs for eachneuron on the input layer are distributed between 0 and 1. Thereare two hidden layers of 30 neurons each, and an output layerwith a single neuron obtaining a value between 0 and 1 which isa proxy for the estimated redshift, with the linear conversion de-ﬁned during the training when the known redshifts of the trainingset are supplied subject to the conversion.

3. PHAT0 - a highly idealised simulation

The lowest algorithmic level of the codes can be tested if thephotometry is bias-free and everything except for the redshiftsis provided. In this way the choice of template sets, the use ofpriors, etc. do not play a role and code-speciﬁc problems canbe disentangled from other e ﬀ ects. To this end, simulations withsynthetic photometry are set up with the LP-t photo- z code (seeSect. 2.8). In order to keep things simple PHAT0 is based on a very lim-ited template set and a long wavelength baseline. A noise-freecatalogue with accurate synthetic colours is provided as well asa catalogue with a low level of additional noise. Furthermore,we added a very large training set to ensure that also empiricalphoto- z algorithms ﬁnd an ideal environment. The ingredientsare detailed in the following.Everything but the redshifts for the test data set was revealedto the participants. In particular, the template set (Sect. 3.2.1) andthe ﬁlter curves (Sect. 3.2.2) were provided, and details aboutthe construction of the catalogues (Sect. 3.2.3 & 3.2.4; e.g. theused IGM recipe) were revealed. The participants were explicitlyasked to use those ingredients if applicable to make their setupas comparable to the simulation setup as possible. Fig. 1.

Template set used for the PHAT0 test (arbitrary ﬂux nor-malisation).

The empirical template set by Coleman et al. (1980) has beenused extensively in di ﬀ erent photo- z studies. As in the case ofLP-t (Ilbert et al. 2006) and BP-t (Ben´ıtez 2000) we decidedto supplement this template set by two templates for starburstgalaxies from Kinney et al. (1996). The template SEDs are dis-played in Fig. 1.It should be noted that the choice of the template set isnot critical in this test because the template set is provided tothe participants using template-based codes and the very largetraining set (see below) covers densely the whole SED-redshiftspace. This particular set is chosen here because it is one of themost widely used sets for photo- z ’s in its original, extended, andmodiﬁed (re-calibrated) form. Participants using template-basedcodes were explicitly asked to use this particular template set forthe PHAT0 test and switch o ﬀ any priors within their codes. For the PHAT0 test we want to avoid systematic e ﬀ ects thatcan arise in photo- z ’s because of an insu ﬃ cient coverage inwavelength. For example, colour-redshift degeneracies (see e.g.Ben´ıtez 2000) can occur between high- and low-redshift if in-frared (IR) and / or ultraviolet (UV) bands are not available.Thus, the ﬁlter set used here spans the whole range fromnear-UV to mid-IR (see Fig. 2). We choose the ugriz -bands fromMEGACAM mounted at the CFHT (Boulade et al. 2003), the Y JHK -bands of UKIDSS (Lawrence et al. 2007), and the twobluer bands of the IRAC camera mounted on the Spitzer SpaceTelescope (Fazio et al. 2004). Again this choice is not too criticalsince the ﬁlter curves are provided and one of the tests does notinclude any noise at all and the other one includes just a low levelof noise in the photometry.

H. Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing

Table 1.

Methods used for photo- z estimation within PHAT Acronym Participant Code Reference PublicBP-t Coe, D.

BPZ , Bayesian Photometric Redshifts Ben´ıtez (2000); Coe et al. (2006) √ a BP2-t Benitez, N.

BPZ , Bayesian Photometric Redshifts Ben´ıtez (2000); Ben´ıtez 2010 in prep. √ a EA-t Brammer, G.

EAZY , Easy and Accurate Redshifts from Yale Brammer et al. (2008) √ b GA-t Kotulla, R.

GALEV , GALaxy EVolution Kotulla et al. (2009) √ c GO-t Dahlen, T.

GOODZ

Dahlen et al. (2005, 2007)HY-t Miralles, J.-M.

Hyperz

Bolzonella et al. (2000) √ d KR-t Schmidt, S.

Kernelz , Kernel Regression Schmidt & Brewer (in prep)LP-t Arnouts, S.

Le Phare

Ilbert et al. (2006) √ e Ilbert, O.LR-t Assef, R.

LRT , Low-Resolution Spectral Templates Assef et al. (2008, 2010) √ f PT-t Purger, N. Template Repair Adelman-McCarthy et al. (2007) √ g ZE-t Feldmann, R.

ZEBRA , Zurich Extragalactic Bayesian Redshift Analyzer Feldmann et al. (2006) √ h ZE2-t Gillis, B.

ZEBRA , Zurich Extragalactic Bayesian Redshift Analyzer Feldmann et al. (2006) √ h AN-e Abdalla, F.

ANN z , Artiﬁcial Neural Network Collister & Lahav (2004) √ i Banerji, M.DT-e Gerdes, D.

BDT , Boosted Decision Trees Gerdes et al. (2010)EC-e Wolf, C. Empirical χ Wolf (2009)PN-e Purger, N. Nearest-Neighbour Fit Abazajian et al. (2009) √ g PO-e Li, I. H. Polynomial Fit Li & Yee (2008)RT-e Carliles, S. Regression Trees Carliles et al. (2010) √ j SN-e Singal, J. Neural Network - √ ka http://acs.pha.jhu.edu/˜txitxo/ ; version 1.99.3 used for PHAT: b c d http://webast.ast.obs-mip.fr/hyperz/ e f g http://skyserver.elte.hu/PhotoZ/ h i j k One of the most simple tests one can think of is to compare theredshift estimates of di ﬀ erent codes for data with inﬁnite signal-to-noise (S / N) and thus perfect colours. In this way the agree-ment of the basic interpolation- and convolution-algorithms intemplate-based codes can be tested. Any di ﬀ erences found insuch a basic test will probably propagate to more realistic se-tups.We use the LP-t code as a reference to create such a cat-alogue evenly distributed over the six templates and over theredshift range 0 < z < ﬀ ect of absorption bythe intergalactic medium (IGM) following the recipe by Madau(1995). The model redshifts were revealed to the participants forthis test.It should be noted that inaccurate redshift estimates from oneof the codes only mean that this particular code does not agreeperfectly with LP-t. Which of the two codes is inaccurate (orwhether even both are inaccurate) cannot be decided with such atest. To study the inﬂuence of noise on the results, a more realis-tic catalogue is set up as well. We adopt a parametric formfor the signal-to-noise as a function of magnitude which be-haves as a power-law at bright magnitudes and an exponen-tial at faint magnitudes. The transition regime is deﬁned by the parameters ( m (cid:63) , err (cid:63) ). At magnitude m ≤ m (cid:63) , we adopt err ( m ) = . α bright + m − m (cid:63) ) , and at magnitude m ≥ m (cid:63) , we use err ( m ) = err (cid:63) . . exp (10 α faint ( m − m (cid:63) ) ), where α bright and α f aint are theslopes at bright and faint magnitudes respectively. The adoptedvalues for each ﬁlter are reported in Table 2, while the behaviourof the Signal-to-Noise (S / N = . / err ) for the di ﬀ erent pass-bands is shown in Fig 3 (colour coded from u band, in cyan to4.5 µ m , in red). The noisy magnitudes are randomly drawn as-suming a Gaussian distribution in ﬂux with mean and standarddeviation ( f lux , err ( f lux )).To generate the simulated catalogue, the galaxies are dis-tributed according to r -band luminosity functions for the di ﬀ er-ent spectral types. However, for simplicity in the comparison ofthe di ﬀ erent codes, we do not apply any dust attenuation for thestar-forming galaxies and we do not let the luminosity functionsevolve with redshift. Thus, this simulated catalogue is not ex-pected to provide a realistic distribution of low and high redshiftgalaxies. Note, that we do include the averaged Lyman absorp-tion by the intergalactic medium as a function of redshift, follow-ing Madau (1995) which will a ﬀ ect the blue bands at high red-shift. The catalogue has been cut to objects brighter than r = / N sources are included. The red-shift distribution attains a smooth shape with a peak at interme-diate redshifts and few objects beyond z = . ∼

11 000 objects for whichthe redshifts are not revealed to the participants. Furthermore, a . Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing 9

Fig. 2.

Transmission curves of the ﬁlter set used for the PHAT0test

Fig. 3.

Signal-to-noise model used for the PHAT0 testmuch larger training set of ∼

170 000 objects with exactly thesame properties as the original catalogue is provided.

In the following we will present the results of three di ﬀ erenttemplate-based codes on the noise-free catalogue that were sub-mitted after the release. The training of empirical codes on noise-free data often does not make sense. That is probably the reason Table 2.

Filters used for the PHAT0 test

Filter Instrument m (cid:63) err (cid:63) α bright α faint u MEGACAM@CFHT 27.0 0.2 − .

25 0.22 g MEGACAM@CFHT 26.0 0.2 − .

25 0.22 r MEGACAM@CFHT 26.0 0.2 − .

25 0.22 i MEGACAM@CFHT 26.0 0.2 − .

25 0.22 z MEGACAM@CFHT 26.0 0.2 − .

25 0.22 Y WFCAM@UKIRT 26.0 0.2 − .

25 0.22 J WFCAM@UKIRT 26.0 0.2 − .

25 0.22 H WFCAM@UKIRT 26.0 0.2 − .

25 0.22 K WFCAM@UKIRT 26.0 0.2 − .

25 0.223 . µ m IRAC@Spitzer 25.0 0.2 − .

25 0.224 . µ m IRAC@Spitzer 25.0 0.2 − .

25 0.22

Fig. 5.

Opacity curves used by LP-t (solid) and HY-t and BP-t(dashed) for a redshift of z = . z model against the redshift estimate z phot and the redshift dif-ference ∆ z = z model − z phot .The ZE2-t code shows nearly perfect agreement with LP-tin this test in terms of redshift estimates. This suggests stronglythat the basic interpolation of the ﬁlter- and template-curves andtheir subsequent convolution by the two codes leads to colourestimates that agree very well. Also the modelled attenuation ofthe IGM seems to be identical in both codes.Up to a redshift of z ∼ . / BP-t is close to perfect as well. For higher redshifts thereare considerable discrepancies between LP-t on the one hand andHY-t and BP-t on the other hand.A further analysis shows that especially the blue templateswith considerable UV ﬂux get assigned grossly wrong redshiftestimates. At a redshift of z ∼ . α line enters our ﬁl-ter set. These two facts suggest that the handling of the IGM, i.e.the opacity of the Lyman- α forest, is implemented di ﬀ erently inthe codes. Although all codes refer to the paper of Madau (1995),it turns out that HY-t and BP-t use an analytic approximation ofthe opacity curve. As described in that paper the opacity curvecan be approximated by a step-function with depression factors D A and D B shortward of Lyman- α and Lyman- β , respectively,and a complete absorption shortward of the Lyman-limit. LP-tuses the full opacity curve instead (binned for redshift intervalsof ∆ z = . z = . z Accuracy Testing

Fig. 4.

Results of the PHAT0 test for the noise-free catalogueThe scatter around the mean opacity curve for a given red-shift is rather large (see Fig. 3 of Madau 1995) due to clusteringof the IGM. Thus, for practical applications we do not expecteither method to perform superior over the other one as longas a direct relation between opacity and redshift is assumed. Toaccount for the greatly varying optical depth of the IGM for dif-ferent lines-of-sight at a ﬁxed redshift in a realistic application,one certainly would have to vary opacity as another free param-eter. The discrepancies reported here just appear in this artiﬁcialtest without noise and a ﬁxed opacity-redshift relation. However,di ﬀ erent residuals between model and observation might wellbe present in applications of photo- z codes with a ﬁxed opacity-redshift relation to real data. We select the best ﬁt or most likely photo- z estimate from eachmethod. Some methods provide estimates of conﬁdence in theirphoto- z ’s in the form of redshift uncertainties or probability dis-tributions P ( z ) and / or template quality of ﬁt measurements like χ . These can help identify and prune those photo- z estimatesmost likely to be outliers. However these conﬁdence measuresare not performed consistently or universally among the variousmethods, so we do not consider them here.The error distribution of photo- z ’s is usually non-Gaussianwith extended tails and some catastrophic outliers with grosslywrong redshift estimates. To summarise this distribution by afew numbers is not always possible. Here we express the photo- z accuracy in terms of the mean and the RMS scatter of the quan- Table 3.

Results for the PHAT0 catalogue with noise

Acronym bias scatter outlier rate a LP-t 0 .

000 0 .

010 0 . − .

005 0 .

011 0 . − .

001 0 .

012 0 . .

000 0 .

014 0 . .

000 0 .

012 0 . − .

002 0 .

013 0 . .

000 0 .

011 0 . − .

005 0 .

011 0 . .

000 0 .

011 0 . − .

005 0 .

011 0 . .

000 0 .

011 0 . − .

004 0 .

019 0 . .

000 0 .

017 0 . .

001 0 .

019 1 . .

000 0 .

013 0 . − .

005 0 .

049 18 . a Outliers are deﬁned as objects with | ∆ z | = | z model − z phot | > . tity ∆ z = z model − z phot (after rejection of outliers), and an outlierrate, as it was done in many former studies. These statistics forthe di ﬀ erent codes can be found in Table 3. Figure 8 shows thescatter and outlier values in comparison. We deﬁne all objectswith a redshift estimate that di ﬀ ers by more than 0.1 from themodel redshift, i.e. | ∆ z | = | z model − z phot | > .

1, as outliers. Werefer the reader to the diagrams in Figs. 6 & 7 showing the com-plete error distribution. . Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing 11

Fig. 6.

Results of the PHAT0 test for the catalogue with noise, z phot vs. z model . Note that LP-t (top-left panel) was used to create thesimulations and should be regarded as a reference. In order to set a standard to which the performance of all othercodes can be compared to, we run LP-t on the catalogue withnoise that was created by the code itself. It is reasonable to regardthe accuracy reached by LP-t on this catalogue as a theoreticallimit set by the amount of noise put in (see Sect. 3.2.4). Theresults are displayed in the ﬁrst panels of Figs. 6 & 7 alongsidethe results from the other codes.

The numbers in Table 3 and the observed error distributions dis-played in Figs. 6 & 7 suggest that most codes tested here per-form similarly to LP-t. Note that there is some degeneracy be-tween the scatter values and the outlier rates. No signiﬁcant bias is produced by any of the codes. All bias values are smaller than0.5%. Looking at the scatter values and outlier rates four di ﬀ er-ent groups can be identiﬁed:1. A large number of codes (AN-e, BP-t, GO-t, EA-t, LR-t, RT-e, PT-t, ZE-t, ZE2-t) performs very similarly to LP-t withscatter values only slightly larger and outlier rates that arevery similar or even smaller. This can be regarded as essen-tially identical performance because the low numbers of out-liers are strongly a ﬀ ected by shot-noise. Note that the outlierrates of these codes correspond to 0 − ∼

11 000 ob-jects!2. Some other codes (GA-t, HY-t, PN-e) show larger values inboth statistics than LP-t, but the di ﬀ erences are still minorand not very signiﬁcant. z Accuracy Testing

Fig. 7.

Results of the PHAT0 test for the catalogue with noise, ∆ z = z model − z phot vs. z model . Note that LP-t (top-left panel) was usedto create the simulations and should be regarded as a reference.3. The codes DT-e, PO-e yield scatter values that are larger bya factor of two and outlier rates that are much larger than theLP-t statistics, with DT-e yielding a smaller outlier rate thanPO-e.4. SN-e performs worse but is still in the development phase.In the following we discuss the problems occurring in the lasttwo groups. The panels for DT-e of Figs. 6 & 7 clearly show that the codeperforms very similar to the codes from groups 1. & 2. for red-shifts z model < ∼ .

1. For larger redshifts the training set becomesmore and more sparse. The division into branches of the decisiontree hence becomes less precise. For the highest redshift inter- val only one branch is established so that objects from a ratherlarge range in z model are all assigned the same z phot . This particu-lar feature of the DT-e code leads to the slightly worse statisticsreported in Table 3.The empirical code PO-e (see Sect. 2.16) is based on asecond-order polynomial ﬁt of the colour-redshift relation. Thisleads to a very limited number of degrees of freedom (66 in thePHAT0 case with 11 bands) compared to the number of objectsin the training set. Not all the information included in the train-ing set can be reﬂected by the 66 coe ﬃ cients so that this empir-ical code performs worse in this test than other empirical codes(e.g. AN-e) that feature many more degrees of freedom. Note that PO-e was trained on a much smaller training set with ∼ z Accuracy Testing 13

Fig. 8.

Scatter and outlier values for the catalogue with noise ofPHAT0. The inlet shows the region in the lower left as a blow-up, but due to shot noise the performance of most the codes inthe inlet should be regarded as identical.The SN-e code was developed for a low redshift ( z < . / or noisy data thatis photometric only, as was the case with the PHAT datasets.However, it was useful to examine its unoptimised performancewith the PHAT data, as an indication of the extent to which op-timisation of the network characteristics to a given input datascheme matters.

4. PHAT1 - a test on GOODS data

The estimation of photo- z ’s is special in the sense that the de-sired answer can in principle be obtained through spectroscopicobservations. Thus, we have an accurate benchmark which wecan compare photo- z ’s to and we do not have to rely fully on sim-ulations. This is a very di ﬀ erent situation from other estimationproblems in astronomy, e.g. the estimation of shapes of galaxiesfor weak gravitational lensing, where accurate knowledge of theintrinsic shape is inaccessible for comparison.Given the high complexity of the photo- z approach and themultiple factors that inﬂuence the results it is reasonable to testthe photo- z codes on real photometric data of objects that havealso been observed spectroscopically for precise redshift mea-surements. In this way the tendency of simulations to idealisecertain aspects of real data can be avoided.As a note of caution it should, however, be mentionedthat comparisons of photo- z ’s to spec- z ’s might well draw asomewhat idealised picture of photo- z performance. The cur-rently available spectroscopic catalogues are only highly com-plete at bright magnitudes. For fainter magnitudes the fractionof high-quality spectroscopic redshift measurements decreases.As Hildebrandt et al. (2008) showed, the objects missing in the spec- z catalogues are likely the ones for which also photo- z es-timation is harder and photo- z accuracy is worse. We chose theGOODS-N ﬁeld also for the reason that it is one of the regionsof the sky with the most complete spectroscopy down to faintlimits. The imaging data for this test are part of the Great ObservatoriesOrigins Deep Survey northern ﬁeld (GOODS-N, Giavaliscoet al. 2004). The original four-band, optical ACS data are com-plemented with images at other wavelengths from a variety ofinstruments. See Table 4 for a summary. In total, there are datain 18 bands covering the near-UV to the mid-IR.The photometry used in the PHAT1 test is drawn from Capaket al. (2004) which includes U , B J , V J , R C , I C , z (cid:48) and HK (cid:48) pho-tometry. Deep J , and H band photometry taken with ULBCAMon the UH2.2m (Wang et al. 2006) and K s band photometrytaken with WIRC on Palomar (Bundy et al. 2005) were addedby ﬁrst PSF matching then measuring photometry in 3 (cid:48)(cid:48) di-ameter apertures using the method described in Capak et al.(2004). The GOODS-ACS photometry in F435W (B), F606W(V + R), F775W ( i (cid:48) ), and F850LP ( z (cid:48) ) along with the IRAC data(Moustakas et al. private Communication) were added by po-sitionally matching the catalogues provided by the GOODSteam with the Capak et al. (2004) catalogues using a 1 (cid:48)(cid:48) match-ing radius. Following recommended practice, the SExtractorMAG AUTO magnitudes were used for the ACS data, while theaperture corrected 3 . (cid:48)(cid:48) diameter aperture magnitudes were usedfor IRAC.For this stage of testing we wanted to use publicly avail-able data that could be obtained with minimal e ﬀ ort by an av-erage researcher. The results of this test illustrate the critical rolethat photometric methods play in obtaining good photo- z ’s. Westrongly recommend care in obtaining photometry across im-ages with variable and very di ﬀ erent PSFs. Images should bealigned, the PSFs matched, and ﬂuxes measured in consistentapertures and care should be taken to ensure noise estimates arecorrect (Capak et al. 2004, 2007; Wolf et al. 2004; Fern´andez-Soto et al. 2001). As illustrated by our test on one of the beststudied ﬁelds in the sky, correctly measured pan-chromatic pho-tometry is not generally available. Users will likely have to, andprobably should, measure their own photometry to ensure thebest results. This is made simpler by automated tasks such asColorPro (Coe et al. 2006) which measure PSF matched aperturephotometry for a combination of space and ground based data,while more complicated routines such as TFIT (Laidler et al.2007) ﬁt high resolution galaxy images using the local PSF foreach image.Bulk photometric o ﬀ sets were removed by minimising theo ﬀ set between the predicted and measured photometric points asa function of rest frame magnitude as described in Capak et al.(2007). The resulting photometry has mean systematic o ﬀ setsbetween photometric bands smaller than 0.01 mag. However,close inspection of the photometric catalogue shows that thereis a fraction of objects which show a rather large discrepancy Note that this procedure is only mildly dependent on the Capaket al. (2007) template set used for the re-calibration because the redshiftrange of the training sample is broad. For a given template SED thesame rest frame wavelength corresponds to many di ﬀ erent observer’sframe wavelengths so that systematic features in a template get dis-tributed evenly over many ﬁlters. Only BP-t, HY-t, and KR-t use tem-plate sets that are somewhat similar to the Capak et al. (2007) templateset.4 H. Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing

Fig. 9. Di ﬀ erence between the average ACS (mean of F606W,F775W, and F850LP) and average SUPRIMECAM (mean of RIz ) magnitudes as a function of redshift in the PHAT1 cata-logue.between the ACS- and the SUPRIMECAM-photometry in theoptical. Those objects are essentially evenly distributed in red-shift. A fraction of 15% (10%) of the objects shows a di ﬀ er-ence of > . > . RIz ), as displayedin Fig. 9. Some of these objects might be variable, while othersmight be a ﬀ ected by di ﬀ erent blending in the space- and ground-based bands. We do not ﬁlter these objects because they are alsoincluded in photometric catalogues that are routinely used formany science projects. We want to provide estimates of photo- z accuracy that are as close to reality as possible and such mis-matches of photometry from di ﬀ erent instruments (or also di ﬀ er-ent bands of the same instrument) are not exceptions but ratherthe norm. Such issues reﬂect the complex problem of obtain-ing a good photometric catalogue from multi-band imaging datataken with di ﬀ erent cameras and / or taken under di ﬀ erent observ-ing conditions. But we will comment upon the impact of theseobjects on global photo- z performance in the following sectionsand mention some strategies to prune them.The photometric catalogue is matched to di ﬀ erent spectro-scopic catalogues from Cowie et al. (2004) , Wirth et al. (2004),Treu et al. (2005), and Reddy et al. (2006) . This yields a totalof 1984 objects with 18-band photometry and spectroscopic red-shifts. We randomly select a quarter of those objects as a train-ing set, i.e. for the release of the catalogue the spectroscopicredshifts of one quarter of the objects are revealed. The magni- which includes spec- z ’s from Cohen et al. (1996, 2000); Cohen(2001); Phillips et al. (1997); Lowenthal et al. (1997, 1998); Dickinson(1998); Liu et al. (1999); Barger et al. (2000, 2001, 2003); Steidel et al.(1996, 2003) which includes spec- z ’s from Blain et al. (2004) It should be noted that this is a fairly small training set for such alarge redshift range. It cannot be expected that empirical codes perform

Table 4.

Filters used for the PHAT1 test.

Filter Instrument m lim . ;AB U MOSAIC@KPNO-4m 27.1 a B SUPRIMECAM@Subaru 26.9 a V SUPRIMECAM@Subaru 26.8 a R SUPRIMECAM@Subaru 26.6 a I SUPRIMECAM@Subaru 25.6 a Z SUPRIMECAM@Subaru 25.4 a F435W ACS@HST 27.8 b F606W ACS@HST 27.8 b F775W ACS@HST 27.1 b F850LP ACS@HST 26.6 b J [email protected] 24.1 c H [email protected] 23.1 c HK [email protected] 22.1 a K WIRC@Hale-5m 22.5 d . µ m IRAC@Spitzer 25.8 e . µ m IRAC@Spitzer 25.8 e . µ m IRAC@Spitzer 23.0 e . µ m IRAC@Spitzer 23.0 ea σ in a circular aperture with a diameter of 3 (cid:48)(cid:48) b σ in a circular aperture with a diameter of 0 . (cid:48)(cid:48) c σ for a point-source d σ for a Gaussian proﬁle with FWHM = . (cid:48)(cid:48) e σ for a point-source tude and redshift distributions are shown in Fig. 10. Note that thecatalogue is highly complete down to R ∼

24. The PHAT1 cat-alogue does not only contain normal galaxies. There is a smallnumber of AGN in the sample which we explicitly decided toinclude.The participants are asked to run their codes twice on theprovided catalogue, once including the IRAC bands and oncewithout the IRAC bands. This is done because many templatesets are inaccurate in the mid-IR and we do not want this e ﬀ ectto dominate the comparisons. Unlike in PHAT0 the participantsusing template-based codes were asked to choose the best pos-sible template set for their code in PHAT1. Thus, template setsdi ﬀ er between the di ﬀ erent “-t” methods here. We use a similar set of statistics as for the PHAT0 test to charac-terise the performance of the photo- z ’s on the PHAT1 data withtwo di ﬀ erences: – We report the bias and scatter of ∆ z (cid:48) = z spec − z phot + z spec . – Outliers are deﬁned as objects with | ∆ z (cid:48) | > . and thescatter and outlier values are plotted in Fig. 11 for the full sam-ple and for an R <

24 magnitude-limited sample. The full er-ror distributions are displayed in Figs. 12 & 13 for the 14-bandcase (i.e. without the IRAC bands). The results for the empiricalcodes only include the non-training objects whereas the resultsfor the template-based codes include all objects. We checkedthe performance of the template-based codes on the training andnon-training sample and found no signiﬁcant di ﬀ erences. as well on such a data set as template-based codes. This should not beregarded as a deﬁciency in the codes but rather a deﬁciency in the data. In Table 6 results are presented for a relaxed deﬁnition of outliersbeing objects with | ∆ z (cid:48) | > . z Accuracy Testing 15

Fig. 10. R -band magnitude- ( left ) and redshift-distributions ( right ) of the PHAT1 catalogue ( solid ) and the training sub-sample( dotted ). Table 5.

Results for the PHAT1 catalogue with and without theIRAC bands, and for all objects and a magnitude-limited samplewith R < R <

24 14-band R < a bias scatter outl. a bias scatter outl. a bias scatter outl. a BP-t -0.046 0.060 30.9 (27.7) 0.011 0.048 11.4 (7.1) -0.053 0.055 31.3 0.012 0.044 6.7BP2-t 0.003 0.041 10.4 (7.5) 0.004 0.041 10.2 (7.8) 0.003 0.035 6.4 0.005 0.035 5.9EA-t 0.020 0.042 11.6 (5.9) 0.022 0.042 13.5 (7.1) 0.021 0.037 7.0 0.023 0.037 8.8GA-t -0.009 0.061 23.1 (18.1) 0.016 0.059 19.3 (15.5) -0.012 0.059 18.3 0.018 0.057 14.6HY-t -0.001 0.058 18.5 (15.2) 0.018 0.055 14.7 (10.1) -0.002 0.055 15.7 0.019 0.054 10.9KR-t -0.008 0.053 19.7 (13.3) -0.006 0.053 16.7 (9.8) -0.010 0.049 15.4 -0.008 0.050 9.2LP-t 0.004 0.040 7.7 (4.9) 0.009 0.038 9.2 (4.7) 0.005 0.036 3.9 0.009 0.034 4.5LR-t 0.024 0.061 14.8 (12.9) 0.038 0.055 18.8 (15.9) 0.021 0.058 9.2 0.039 0.051 14.4AN-e -0.010 0.074 31.0 (29.0) -0.006 0.078 38.5 (36.5) -0.013 0.071 24.4 -0.007 0.076 32.8EC-e -0.001 0.067 18.4 (15.3) 0.002 0.066 16.7 (13.3) -0.006 0.064 14.5 -0.003 0.064 13.5PO-e -0.009 0.052 18.0 (14.5) -0.007 0.051 13.7 (9.4) -0.009 0.047 10.7 -0.008 0.046 7.1RT-e -0.009 0.066 21.4 (19.0) -0.008 0.067 24.2 (21.6) -0.012 0.063 16.4 -0.012 0.064 18.4 a Percentage of objects with | ∆ z (cid:48) | = | z spec − z phot + z spec | > .

15. The num-bers for the cleaned sample excluding objects with discrepantACS / SUPRIMECAM photometry are given in brackets.

The most striking feature in Fig. 12 and Table 5 is the largefraction of outliers ( >

9% of the total sample) with catastroph-ically wrong photo- z ’s. This fraction is higher than typical lit-erature estimates. It should be emphasised that some of the ob-jects included here are unusual in the sense that they have SEDsdi ﬀ erent from normal galaxies (e.g. AGNs). A small fraction isalso inﬂuenced by blending e ﬀ ects in the ground-based bands orvariability, so that there is a mismatch between the ACS and theSUPRIMECAM optical photometry. There may also be a verysmall number of objects with wrong spec- z ’s. But the bulk ofthe outliers are real. If we reject objects which have discrepantphotometry between ACS and SUPRIMECAM (see Sect. 4.2)the outlier rates decrease considerably as indicated by the val-ues in brackets in Table 5. The bias is largely una ﬀ ected by thisﬁltering and the scatter values do not decrease by more than10% (both not given in Table 5). We also test the most accuratecode in PHAT1 (LP-t) without ACS photometry. The statistics of the problematic objects do not improve signiﬁcantly althoughexcluding ACS removes the discrepancy between overlappingoptical ﬁlters. This suggests that most of the outliers amongstthese objects are not just outliers because their photometry iscorrupted, but rather because it is intrinsically harder to esti-mate photo- z ’s for them. We leave the detailed characterisationof these peculiar objects (their morphology, SEDs, remainingphotometric issues, etc.) to a future study.A lot of codes seem to have problems with identifying cor-rectly the redshifts of objects from the Reddy et al. (2006) sam-ple with 1 . < ∼ z < ∼

3. We explicitly decided to include thoseobjects in the test in order not to artiﬁcially idealise the situ-ation. PHAT was conceived to give a realistic picture of whatcan be achieved with today’s techniques. Those outliers reportedhere are present in deep photometric catalogues and it is a deli-cate task for every scientist to remove those or account for theire ﬀ ect. The fact that literature values of outlier rates are usually z Accuracy Testing

Fig. 11.

Scatter and outlier values for the 14- (crosses) and 18-band (squares) PHAT1 case. The arrows indicate the e ﬀ ect of addingthe IRAC bands on photo- z accuracy. The left panel shows the statistics for all objects and the right panel the ones for all objectswith an I -band magnitude R < Fig. 12.

Results of the PHAT1 test with 14 bands (i.e. excluding IRAC bands), z phot vs. z spec . . Objects with R ≥

24 are labelled inred. . Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing 17

Table 6.

Same as Table 5 but with a relaxed criterion for outliers. R <

24 14-band R < a bias scatter outl. a bias scatter outl. a bias scatter outl. a BP-t -0.084 0.122 5.9 (5.0) 0.016 0.085 4.8 (5.0) -0.098 0.112 5.8 -0.098 0.112 5.8BP2-t 0.009 0.084 3.8 (2.4) 0.011 0.081 3.6 (2.4) 0.008 0.072 1.5 0.008 0.072 1.5EA-t 0.023 0.088 4.2 (2.0) 0.026 0.092 5.5 (2.0) 0.024 0.074 1.9 0.024 0.074 1.9GA-t -0.014 0.125 8.7 (5.9) 0.030 0.106 7.7 (5.9) -0.026 0.115 5.4 -0.026 0.115 5.4HY-t -0.011 0.116 4.9 (4.2) 0.027 0.098 4.8 (4.2) -0.016 0.109 3.5 -0.016 0.109 3.5KR-t -0.015 0.114 8.6 (5.9) -0.003 0.105 6.9 (5.9) -0.024 0.101 6.6 -0.024 0.101 6.6LP-t 0.003 0.079 2.3 (1.4) 0.011 0.079 3.7 (1.4) 0.005 0.060 1.0 0.005 0.060 1.0LR-t 0.028 0.104 4.5 (4.0) 0.054 0.098 7.6 (4.0) 0.023 0.087 2.5 0.023 0.087 2.5AN-e -0.036 0.151 3.1 (2.4) -0.035 0.173 4.2 (2.4) -0.047 0.130 1.4 -0.047 0.130 1.4EC-e -0.007 0.120 3.6 (3.1) -0.003 0.114 3.6 (3.1) -0.015 0.106 1.9 -0.015 0.106 1.9PO-e -0.013 0.124 3.1 (2.3) 0.001 0.107 2.3 (2.3) -0.020 0.098 1.2 -0.020 0.098 1.2RT-e -0.031 0.126 3.2 (2.8) -0.028 0.137 3.6 (2.8) -0.034 0.111 1.4 -0.034 0.111 1.4 a Percentage of objects with | ∆ z (cid:48) | = | z spec − z phot + z spec | > .

5. The num-bers for the cleaned sample excluding objects with discrepantACS / SUPRIMECAM photometry are given in brackets.

Table 7.

Same as Table 5 but in two di ﬀ erent redshift bins. z spec ≤ . z spec ≤ . z spec > . z spec > . a bias scatter outl. a bias scatter outl. a bias scatter outl. a BP-t -0.050 0.055 31.4 (27.5) 0.013 0.044 7.2 (4.1) -0.019 0.074 28.0 (28.9) -0.001 0.075 35.3 (27.5)BP2-t 0.003 0.035 6.8 (4.9) 0.005 0.035 6.5 (4.5) 0.001 0.071 30.7 (25.1) 0.001 0.075 31.0 (31.3)EA-t 0.021 0.037 9.9 (3.9) 0.022 0.038 11.9 (4.9) 0.014 0.065 21.3 (19.9) 0.024 0.062 22.7 (22.3)GA-t -0.010 0.060 19.7 (14.6) 0.018 0.057 16.4 (12.9) 0.003 0.071 42.7 (42.7) 0.008 0.073 35.0 (34.1)HY-t -0.003 0.055 16.5 (12.9) 0.018 0.054 12.3 (8.9) 0.014 0.072 29.7 (30.8) 0.021 0.062 28.0 (18.5)KR-t -0.012 0.047 16.8 (11.8) -0.011 0.050 10.5 (6.1) 0.026 0.072 35.7 (24.2) 0.042 0.062 51.3 (36.0)LP-t 0.005 0.037 6.2 (3.2) 0.008 0.034 6.8 (2.8) 0.002 0.059 15.7 (16.6) 0.014 0.057 23.0 (18.0)LR-t 0.023 0.059 10.1 (8.3) 0.039 0.053 15.1 (12.0) 0.028 0.079 41.3 (45.0) 0.037 0.070 39.7 (43.1)AN-e -0.017 0.070 27.6 (25.5) -0.010 0.076 33.6 (31.6) 0.051 0.078 50.7 (53.2) 0.045 0.077 66.4 (70.3)EC-e -0.003 0.065 16.1 (12.9) -0.000 0.064 14.5 (11.4) 0.015 0.077 32.3 (32.3) 0.015 0.077 29.5 (26.6)PO-e -0.012 0.049 12.6 (9.6) -0.011 0.047 9.4 (6.0) 0.019 0.075 48.3 (48.3) 0.026 0.074 37.7 (32.7)RT-e -0.016 0.062 19.6 (17.0) -0.014 0.064 21.1 (18.6) 0.040 0.072 31.8 (32.9) 0.039 0.071 41.9 (42.4) a Percentage of objects with | ∆ z (cid:48) | = | z spec − z phot + z spec | > .

15. The num-bers for the cleaned sample excluding objects with discrepantACS / SUPRIMECAM photometry are given in brackets. smaller reﬂects the di ﬃ culty of a blind test, but it most proba-bly also reﬂects that our combined spec- z catalogue, explicitlyincluding objects from the so-called “redshift-desert”, is morecomplete and representative than some other commonly usedcatalogues. Especially at R <

24 our spec- z catalogue is highlycomplete, and also for this bright cut the outlier rates are ratherlarge for most codes (see Table 5 and the right panel of Fig. 11).There are means of identifying outliers (poor ﬁts, broadredshift-probability functions, etc.) and photometric cataloguescan often be cleaned (e.g. by extraction ﬂags) to yield muchlower outlier rates. Depending on the science application such aﬁltering can be more or less applicable. For example, we showedthat rejecting objects with problematic photometry can improvethe situation considerably. However, photo- z ’s are often used ina rather blind fashion without extensive checking (often due to alack of spec- z comparisons) and ﬁltering. Some science applica-tions also rely on redshifts for all objects not allowing for ﬁlter-ing. For those kind of applications the raw numbers reported byPHAT1 in Table 5 are more informative than the cleaned onesgiven in brackets.The best performance on this data set is achieved by the LP-t, BP2-t, EA-t, and BP-t codes, with LP-t showing the small-est scatter and outlier rates. The empirical PO-e code follows closely. While EA-t and BP-t also performed nicely on thePHAT0 test with noise (LP-t was used for the creation of thePHAT0 simulations), the good results for PO-e came as a sur-prise because this code ranked next to last in the PHAT0 testwith noise. The sparse training set of PHAT1 ( ∼

500 objects) isapparently large enough to fully exploit the capabilities of PO-e because there are not too many degrees of freedom involvedhere. In contrast, the empirical AN-e code that was in the topgroup for PHAT0 fails basically on PHAT1. The training set ofPHAT1 is too sparse to train the neural network over this largeredshift range. Neural networks are generally very good at inter-polating smooth functions. However, the colour-redshift map-ping of galaxies is highly complex in many places. Furthermore,there are ambiguities (also called colour-redshift degeneraciesBen´ıtez 2000) in a catalogue spanning a large redshift range, i.e.objects with very di ﬀ erent redshifts and very similar colours. Ingeneral, neural networks, as the one used in AN-e, are not pre-pared to deal with such ambiguities since they only assign oneoutput redshift to a particular point in colour space.The top group of ﬁve (LP-t, BP2-t, EA-t, BP-t, and PO-e) isfollowed by HY-t, KR-t, LR-t, GA-t, EC-e, and RT-e in approx-imately this order. HY-t, KR-t and LR-t show some more or lesspronounced, peculiar features with a number of objects being z Accuracy Testing

Fig. 13.

Similar to Fig. 12 but showing ∆ z = z spec . − z phot vs. z spec . .assigned very similar photo- z ’s (horizontal features in Fig. 12).These features certainly have a large inﬂuence on the statisticsand prevent those codes from performing as well as the top groupalthough their error distribution in the core looks very similar.GA-t and EC-e show clearly a larger scatter in the core of theerror distribution. The distribution for EC-e is smoother but witha larger width resulting in the largest scatter (excluding AN-e).It is obvious that the empirical codes produce biases that aresmaller by typically a factor of two compared to the template-based codes. The data-model match is by construction betterin the empirical case. A mismatch in the template-based casecan be due to both, slightly inaccurate templates and slightlyinaccurate photometry. It should be noted that it is very hardto achieve a photometric cross-calibration accuracy spanningthe whole wavelength range from the UV to the mid-IR. EC-e,which was designed with the goal of being as bias-free as possi-ble, shows by far the smallest bias indeed. The combination of amachine-learning algorithm and the proper use of PDFs pays o ﬀ here. In Figs. 14 & 15 the results for the 18-band case (i.e. with IRACbands included) are presented. The statistics are also listed inTable 5 and the scatter and outlier values for the di ﬀ erent codesare plotted in Fig. 11 in comparison to the ones of the 14-bandcase. It is immediately obvious, especially from Fig. 11, thatnot all codes beneﬁt from adding the IRAC photometry. OnlyLP-t, EA-t, LR-t, RT-e, and AN-e show some improvementwhen adding those information about the observed-frame mid-IR SEDs of the objects. The outlier rates of LP-t and EA-t de-crease by ∼

15% compared to the 14-band case making them byfar the best codes in this test, together with BP2-t, which basi-cally shows the same performance as with 14 bands. Also RT-eimproves slightly in scatter and outlier rate with 18 bands com-pared to 14 bands. The bias and outlier rate of LR-t are decreasedsomewhat but with the trade-o ﬀ of a slightly larger scatter. AN-edoes not perform as poorly with 18 bands as with 14 bands butis still the least accurate code in this test.PO-e, KR-t, HY-t, GA-t, and EC-e show slightly worse per-formance than in the 14-band case with approximately con-served order. BP-t, however, shows a huge increase in the num-ber of outliers by ∼ z perfor-mance. Most of the Coe et al. (2006) templates are undeﬁnedand must be extrapolated for λ > z bias.Re-calibration of the IRAC zeropoints with this template set im-proves the situation somewhat, but is not done here for simplic-ity. The good performance of BP2-t shows that it is not the codebut the template set that makes the di ﬀ erence here. . Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing 19

Fig. 14.

Similar to Fig. 12 but for 18 bands (i.e. including IRAC bands).

The performance shown by the best codes in the semi-blindPHAT1 test with low bias and scatter values in the 4 − > . z catalogue besides the presence of objects with unusual SEDsand some problems with the combination of space-based andground-based photometry. It should be noted that the PHAT1spectroscopic catalogue represents a very deep sample and is notpurely magnitude-limited. However, such depths are commonlyused in photometric studies in extragalactic astronomy. We can-not fully quantify the fraction of outliers that are due to photom-etry problems on the one hand or due to intrinsically problematicobjects with strange SEDs on the other hand. But the test of LP-twithout ACS data described in Sect. 4.3 suggests that most ofthe problem seen here is connected to the latter.Di ﬀ erences in the accuracy of the codes for the 14-band casecan mostly be attributed to di ﬀ erences in the template sets andpriors for the SED-ﬁtting codes on the one hand and di ﬀ erencesin the training schemes for the empirical codes on the other hand.It is not the aim of this study to explain all the features seen inthis comparison. Rather we want to provide a snapshot of whatcurrent codes are capable to do in a semi-blind application. It is striking that half of the codes perform worse with theIRAC photometry included. Especially, the low- z performancesu ﬀ ers in this case. For the template-based codes this can be ex-plained by insu ﬃ cient knowledge of the template SEDs in themid-IR . If the templates do not represent the reality it cannot beexpected that additional data lead to an improvement. EA-t, theonly template-based code that really beneﬁts from the informa-tion in the IRAC bands, di ﬀ ers from the other template-basedcodes in the sense that it uses a template error function (seeBrammer et al. 2008, for a detailed description). This featureweighs the measurements in the di ﬀ erent bands according to theestimated accuracy of the template at the rest-frame wavelengththat corresponds to the e ﬀ ective wavelength of a given ﬁlter ata particular redshift step before computing the χ . This hard-coded template error function assigns a low accuracy to the mid-IR spectral region of the templates so that the IRAC bands donot inﬂuence the χ at low- z . At higher redshifts, however, whenIRAC probes the rest-frame near-IR or optical where templatesare more accurate, the information is used and can improve thephoto- z ’s. That is reﬂected in the lower bias and outlier frac-tion for EA-t in the 18-band case when compared to the 14-bandcase. BP2-t employs a ﬁlter error based on the scatter betweenthe photometry of best-ﬁt models and observed photometry in a This is mostly due to insu ﬃ cient modelling of dust emission fea-tures from PAHs.0 H. Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing

Fig. 15.

Similar to Fig. 13 but for 18 bands (i.e. including IRAC bands).particular ﬁlter on the spectroscopic training set. This essentiallydown-weights the IRAC bands. In general the mid-IR behaviourof the advanced template sets used by LP-t, BP2-t, and EA-tseems to be more realistic than the extrapolations employed forsome other sets leading to better performance with 18 bands.The lower bias values produced by the empirical codes sug-gest that there are still systematic inaccuracies in most templatesets. With a su ﬃ cient training set such inaccuracies can be re-paired by re-calibrating the templates, e.g. with the approach de-scribed in Budav´ari et al. (2000). Such a better data-model matchis demonstrated by BP2-t showing consistently the lowest biasof all template-based methods which is however still somewhatlarger than the values for EC-e.

5. Conclusions

With PHAT we provide a snapshot of the photo- z accuracyachievable with today’s methods in semi-blind tests. Most majorphoto- z codes used in the current literature are included in thischallenge presented here.A ﬁrst test, PHAT0, on highly idealised simulations yieldsgood agreement between the di ﬀ erent codes (16 participants intotal) and especially in comparison to the LP-t code that wasused to create the simulations. Di ﬀ erences are found in the han-dling of the opacity of the IGM, which are most likely unimpor-tant for practical applications (as long as only broad photometricbands are used). The PHAT1 test based on real photometric and spectroscopicdata from the GOODS survey represents a much more di ﬃ culttest environment including many of the challenges encounteredin practical applications. As expected the results from twelveparticipants show a larger ﬂuctuation in accuracy, but a generalconvergence is seen for most codes, i.e. scatter values and outlierrates are within a factor of two of the best code in the test. Whilethe best codes perform to expectations in terms of bias and scat-ter, some other codes show remaining biases due to a templateset that does not perfectly ﬁt the data or due to an insu ﬃ cienttraining set. Half of the codes do not beneﬁt from adding mid-IRphotometry from the Spitzer Space Telescope. This ﬁnding sug-gest strongly that there is considerable inaccuracy in some of thetemplate sets in the rest-frame mid-IR region of the SEDs. Therather large outlier rates reported in this test should be taken se-riously since most of these problematic objects are also presentin purely magnitude-limited photometric samples, but not nec-essarily in commonly used spec- z catalogues, which are incom-plete at fainter magnitudes. Cleaning of the catalogues is stillnecessary for PHAT1 to reach an outlier rate below ∼

5% forthe best code in the test. More detailed future studies (possiblyin the framework of PHAT) are needed to identify the nature ofthis problem and quantify the contributions from multi-colourphotometry issues on the one hand and objects with intrinsicallyunusual SEDs on the other hand. We believe that solving theproblem of these outliers lies at the core of future photo- z im-provements. It is clear that improved spec- z catalogues which are . Hildebrandt et al.: PHAT: PHoto- z Accuracy Testing 21 as complete as possible will be indispensable for such an e ﬀ ort.Some science applications that do not rely on complete samplesof galaxies (like e.g. dark energy studies with weak gravitationalshear) can greatly beneﬁt from e ﬃ cient cleaning of galaxy cat-alogues. There are ways of considerably improving photo- z ac-curacy by rejecting objects with unreliable estimates. It is, how-ever, beyond the scope of this study to present strategies on howto optimise catalogues for di ﬀ erent science applications and howto quantify those improvements.Photo- z accuracy is of paramount importance for a largenumber of future science projects, ranging from galaxy evolu-tion to cosmology. The di ﬀ erences in the performance of thedi ﬀ erent photo- z codes presented here will have a direct impacton the power of photometric surveys to answer those scientiﬁcquestions. We did not quantify the impact of photo- z accuracyhere, but it should be noted that there is still some way to go be-fore photo- z ’s reach the accuracy required for e.g. future full-skydark energy surveys.The test environments used in this study are pub-licly available at and can be used to assess theperformance of future methods in comparison to the results pre-sented here in a quantitative and unbiased way. Acknowledgements.

We would like to thank JPL / Caltech for hospitality and sup-port during the 2008 PHAT workshop. We are grateful to the large number of col-leagues who made PHAT a success through discussions, criticism, and encour-agement. A special thanks goes to Mike Hudson who came up with the acronym“PHAT”. HH would like to thank in particular Catherine Heymans, KonradKuijken, Ludovic van Waerbeke, and Peter Schneider for supporting the PHATe ﬀ ort. HH was supported by the European DUEL RTN, project MRTN-CT-2006-036133. The work of LAM and DC was carried out at the Jet PropulsionLaboratory, California Institute of Technology, under a contract with NASA.LAM acknowledges support by the NASA ATFP program. CW was supported byan STFC Advanced Fellowship. NP acknowledges support from NKTH:Polanyiand KCKHA005 grants. References