[PDF] TPZ : Photometric redshift PDFs and ancillary information by using prediction trees and random forests

Abstract

With the growth of large photometric surveys, accurately estimating photometric redshifts, preferably as a probability density function (PDF), and fully understanding the implicit systematic uncertainties in this process has become increasingly important. In this paper, we present a new, publicly available, parallel, machine learning algorithm that generates photometric redshift PDFs by using prediction trees and random forest techniques, which we have named TPZ. This new algorithm incorporates measurement errors into the calculation while also dealing efficiently with missing values in the data. In addition, our implementation of this algorithm provides supplementary information regarding the data being analyzed, including unbiased estimates of the accuracy of the technique without resorting to a validation data set, identification of poor photometric redshift areas within the parameter space occupied by the spectroscopic training data, a quantification of the relative importance of the variables used to construct the PDF, and a robust identification of outliers. This extra information can be used to optimally target new spectroscopic observations and to improve the overall efficacy of the redshift estimation. We have tested TPZ on galaxy samples drawn from the SDSS main galaxy sample and from the DEEP2 survey, obtaining excellent results in each case. We also have tested our implementation by participating in the PHAT1 project, which is a blind photometric redshift contest, finding that TPZ performs comparable to if not better than other empirical photometric redshift algorithms. Finally, we discuss the various parameters that control the operation of TPZ, the specific limitations of this approach and an application of photometric redshift PDFs.

Full PDF

MMon. Not. R. Astron. Soc. , 1–21 (0000) Printed 1 April 2013 (MN L A TEX style ﬁle v2.2)

TPZ : Photometric redshift PDFs and ancillaryinformation by using prediction trees and random forests

Matias Carrasco Kind (cid:63) and Robert J. Brunner

Department of Astronomy, University of Illinois, Urbana, IL 61820 USA

ABSTRACT

With the growth of large photometric surveys, accurately estimating photometric red-shifts, preferably as a probability density function (PDF), and fully understanding theimplicit systematic uncertainties in this process has become increasingly important. Inthis paper, we present a new, publicly available, parallel, machine learning algorithmthat generates photometric redshift PDFs by using prediction trees and random foresttechniques, which we have named TPZ. This new algorithm incorporates measurementerrors into the calculation while also dealing eﬃciently with missing values in the data.In addition, our implementation of this algorithm provides supplementary informationregarding the data being analyzed, including unbiased estimates of the accuracy of thetechnique without resorting to a validation data set, identiﬁcation of poor photomet-ric redshift areas within the parameter space occupied by the spectroscopic trainingdata, a quantiﬁcation of the relative importance of the variables used to construct thePDF, and a robust identiﬁcation of outliers. This extra information can be used tooptimally target new spectroscopic observations and to improve the overall eﬃcacy ofthe redshift estimation. We have tested TPZ on galaxy samples drawn from the SDSSmain galaxy sample and from the DEEP2 survey, obtaining excellent results in eachcase. We also have tested our implementation by participating in the PHAT1 project,which is a blind photometric redshift contest, ﬁnding that TPZ performs comparableto if not better than other empirical photometric redshift algorithms. Finally, we dis-cuss the various parameters that control the operation of TPZ, the speciﬁc limitationsof this approach and an application of photometric redshift PDFs.

Key words: galaxies: distance and redshift statistics – surveys – statistics – methods:data analysis – statistical

Late time cosmological measurements are often made bycarefully measuring the three-dimensional distribution ofgalaxies. For these measurements, the distance between thegalaxy and the observer is most accurately made by usinga spectroscopic redshift. However, spectroscopic measure-ments are considerably more diﬃcult to obtain, and are,therefore, more expensive than photometric measurements,as they require long exposures in order to achieve suﬃcientsignal-to-noise over a wide wavelength range. As an exam-ple, while the Sloan Digital Sky Survey (SDSS; York et al.2000) has taken millions of spectroscopic redshifts of galax-ies to high precision (Aihara et al. 2011), the same surveyhas obtained detailed photometric measurements for a muchlarger sample of galaxies in considerably less time. This di- (cid:63)

E-mail: [email protected] chotomy will only grow with ongoing and planned surveysthat are dominated by photometric-only observations.As a result, considerable attention has been focusedon the estimation of redshifts by applying statistical tech-niques to the photometric observations of sources throughdiﬀerent ﬁlters. These photometric redshift (hereafter photo- z ) estimation techniques have become crucial for modern,multi-band digital surveys; and this need for fast and accu-rate photo- z estimation is becoming even more importantfor large photometric surveys like the Dark Energy Survey(DES ) and the Large Synoptic Survey Telescope (LSST ),which are probing galaxies that are often too faint to bespectroscopically observed. Adopting a photo- z approachallows cosmological measurements on galaxy samples thatare currently at least a hundred times larger than compa- (cid:13) a r X i v : . [ a s t r o - ph . C O ] M a r M. Carrasco Kind and R. J. Brunner rable spectroscopic samples, that have relatively simple anduniform selection functions, and that extend to fainter ﬂuxlimits and larger angular scales and thus probe much largercosmic volumes. In summary, photo- z techniques provide amuch higher number of galaxies with redshift estimates perunit telescope time than spectroscopic surveys (Hildebrandtet al. 2010).The estimation of galaxy redshifts using multi bandphotometry was ﬁrst performed by Baum (1962), while Koo(1985) and Loh & Spillar (1986) were the ﬁrst to computegalaxy redshifts by using digital photometric observationsfrom charge coupled devices. In the last ﬁfteen years, how-ever, the estimation of redshifts from broadband photome-try has grown signiﬁcantly. Presently, there are many dif-ferent methods for computing photometric redshifts (see,e.g., Wang et al. 2008; Hildebrandt et al. 2010; Abdallaet al. 2011, for an updated comparison of current photomet-ric redshift methods and public codes). These techniques canbe broadly categorized as either template ﬁtting algorithmsor empirical training algorithms. The template ﬁtting algo-rithms (e.g., Ben´ıtez 2000; Bolzonella et al. 2000; Csabaiet al. 2003; Ilbert et al. 2006; Feldmann et al. 2006; Assefet al. 2010) can either use empirical (e.g., Coleman et al.1980; Assef et al. 2010) or synthetic spectral templates (e.g.,Bruzual & Charlot 2003). These techniques estimate a pho-tometric redshift by ﬁnding the best match between the ob-served magnitudes or colors and the synthetic magnitude orcolors from the suite of templates that are sampled acrossthe expected redshift range of the photometric observations.Empirical training methods use a spectroscopic train-ing data set to calibrate an algorithm that can be quicklyapplied to new photometric observations. Initially the train-ing set was used to map a polynomial function between thecolors and the redshift (e.g., Connolly et al. 1995; Brun-ner et al. 1997). More recently, this process has been ex-tended to machine learning algorithms, including artiﬁcialneural networks (e.g., Collister & Lahav 2004; Oyaizu et al.2008b), boosted decision trees (e.g., Gerdes et al. 2010),random forest (e.g., Carliles et al. 2010), nearest neigh-bors (e.g., Ball et al. 2007, 2008; Lima et al. 2008), self-organized maps (e.g., Geach 2012; Way & Klose 2012), spec-tral connectivity analysis (e.g., Freeman et al. 2009), Gaus-sian process (e.g., Way et al. 2009; Bonﬁeld et al. 2010),support vector machines (e.g., Wadadekar 2005) or QuasiNewton Algorithm (e.g., Cavuoti et al. 2012). While only afew of these photo- z methods are publicly available, they allperform to a similar accuracy and provide only a single red-shift estimate rather than a full redshift probability densityfunction for each galaxy.The template ﬁtting methods, which leverage modelgalaxy spectral energy distributions (SED), have been usedextensively and are often preferred since once implementedthey can be readily applied to new data by simply adoptingthe appropriate photometric ﬁlter transmission functions.Given a representative sample of template galaxy spectra,most of these techniques can reliably predict a photo- z , al-though the use of training data that includes known redshiftscan improve these predictions (e.g., Ben´ıtez 2000; Ilbertet al. 2006). These techniques, however, are not exempt fromuncertainties due to measurement errors on the survey ﬁltertransmission curves, mismatches when ﬁtting the observedmagnitudes or colors to template SEDs, and color-redshift degeneracies. Furthermore, template techniques generallybecome less reliable at high redshift where the uncertain-ties in galaxy SEDs increases, since the templates are oftencalibrated using low redshift galaxies.On the other hand, when provided with a high qual-ity spectroscopic training sample, empirical training tech-niques have been shown to have similar or even better per-formance (Collister & Lahav 2004). In addition, empiricaltechniques are generally simpler to apply to diﬀerent datasets and frequently provide an improved quantiﬁcation ofany uncertainties, which can be encoded in a photo- z prob-ability density function (PDF). They also have the addi-tional advantage that is easier to include extra information,such as galaxy proﬁles, concentration, angular sizes, or en-vironmental properties, in addition to magnitudes or colors.These methods, however, are only reliable within the limitsof the training data, and suﬃcient caution must be exercisedwhen extrapolating these algorithms beyond the limits of thetraining data.As the demand for more accurate photo- z methods hasgrown, techniques have branched out into new areas in or-der to improve the accuracy of photo- z estimation. Whilea complete understanding of the systematic uncertainties isneeded for a reliable and accurate machine learning photo- z algorithm (see, e.g., Oyaizu et al. 2008a, for a discussionon photometric redshift errors), other issues have recentlybeen recognized in the eﬀort to generate the most accuratephotometric redshifts. For example, Cunha et al. (2012a,b)analyzed the eﬀect of systematics within the spectroscopictraining data set that is used to estimate a galaxy photo- z .Likewise, other functionality that a modern photo- z algo-rithm should provide include an identiﬁcation of outliers onthe training set that lead to an incorrect estimation of aphoto- z , an identiﬁcation of the features within the train-ing data that most strongly aﬀect a photo- z estimate, andan identiﬁcation of areas of parameter space (e.g., magni-tudes, colors, and redshift ranges) that are under sampledby the training data. The last two features are importantto the design of photometric surveys, as they provide use-ful information to optimally and eﬃciently guide follow-upspectroscopy to generate the scientiﬁcally most useful train-ing data set for these algorithms.Of course, we estimate a galaxy’s redshift so that it canbe used in a subsequent analysis. A number of cosmologi-cal measurements such as galaxy clustering, weak lensing,baryon acoustic oscillations and the mass function of galaxyclusters (see, e.g., Ho et al. 2012; Reid et al. 2010; Jee et al.2013), among others, depend strongly on both the num-ber of targeted galaxies in the sample and the accuracy ofthe measured distances to the galaxies. Given the growth ofphotometric-only surveys, these cosmological measurementswill require the use of reliable photometric redshifts and acomplete understanding of their uncertainties. As a result,photo- z methods will be most eﬀective going forward if theycan not only robustly provide a reliable redshift estimationbut also a redshift probability density function.In addition, the extra information in a redshift PDFcan be used to improve or enhance a particular cosmologi-cal measurement analysis. For example, Myers et al. (2009)have shown that by using the full redshift PDF within atwo-point angular quasar correlation function, as opposed tosimply using a single redshift estimate, their measurement c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests has been improved by a factor of nearly four, which is equiv-alent to increasing the survey volume by a similar factor.Likewise, Mandelbaum et al. (2008) discuss how the accu-racy of photo- z and the inclusion of the photo- z PDF aﬀectthe calibration for weak lensing studies. Other recent stud-ies (see, e.g., Sheth 2007; van Breukelen & Clewley 2009)have also demonstrated how a cosmological measurementcan be improved by using a photo- z PDF. However, giventhe lack of reliable photo- z PDF estimation techniques, thisremains an underutilized tool.In this work, we address these issues by introducingTPZ (Trees for Photo-Z), a new, Python-based, machinelearning, parallel code for estimating photometric redshiftPDFs by using prediction trees and random forest tech-niques (Breiman et al. 1984; Breiman 2001). Our approachis an ensemble learning method that generates several clas-siﬁers and combines their results into a ﬁnal output. Predic-tion trees partition the multi-dimensional space recursivelyinto smaller regions, which is terminated when a leaf onlycontains a few elements. Within these ﬁnal leaves, our algo-rithm can leverage a simple model for the actual prediction,by using, for example, the mean value for a regression or themode in a voting process as used in a classiﬁcation scheme.Likewise, the basic idea of a random forest method is touse bootstrap samples from the training data to build a setof prediction trees. These trees are constructed by selectingthe best split point from a random subsample of the dimen-sions (e.g., magnitudes or colors) along which the data aresubdivided. By aggregating the predictions from this for-est of trees, we produce a more accurate estimate. In ourimplementation, we incorporate the errors on the measuredattributes by perturbing the galaxy parameters by their un-certainties. We repeat this process, generating multiple in-dividual new observations of each galaxy that are subse-quently combined into a ﬁnal PDF, which can be used asdesired to estimate a single redshift and its associated error.In addition, our implementation of this technique naturallyincorporates data with missing values and also provides ex-tra meta information, such as an unbiased estimate of theprediction error, a measure of the relative importance of theparameters used in the photo- z estimation as a function ofredshift, an identiﬁcation of regions where the training dataprovide poor predictions, and an identiﬁcation of galaxiesthat are likely outliers.This paper is organized as follows. In § z methodpresented herein. § § z implementation by using thesedata, present an analysis of the results and discuss the ca-pabilities of our approach. Finally in §

5, we conclude witha summary of our main results and a discussion of the TPZalgorithm.

Among the diﬀerent non-linear methods that are used tocompute photometric redshifts, prediction trees are one ofthe simplest yet most accurate techniques. Supervised learn-ing methods using prediction trees, either classiﬁcation or re- gression, have been shown to be one of the most accurate al-gorithms for low as well high multi-dimensional data (Caru-ana et al. 2008). They also are fast, can easily deal with miss-ing data, and have similarities with other non-parametrictechnique. For example, prediction trees are similar to k-nearest-neighbor (kNN) algorithms in that they both groupdata points with similar characteristics.However, kNN use test data to identify similar pointswithin the training set while keeping the parameter k ﬁxed,even though some points might have a very diﬀerent num-ber of similar neighbors. On the other hand, prediction treeshave terminal leaves that bound regions of the parameterspace where the predictions (i.e., redshifts) and their prop-erties (e.g., magnitudes) are similar. As both the quantityand identify of test data can vary between leaf (or termi-nal) nodes, prediction trees are known as adaptive nearest-neighbor methods (Breiman et al. 1984). Prediction trees are built by asking a sequence of questionsthat recursively split the data, frequently into two branches,until a terminal leaf is created that meets a stopping crite-rion (e.g., a minimum leaf size). The small region boundingthe data in the terminal leaf node represents a speciﬁc sub-sample of the entire data with similar properties. Within thisleaf, a model is applied that provides a fairly comprehensi-ble prediction, especially in situations where many variablesmay exist that interact in a nonlinear manner as is often thecase with photo- z estimation. A visualization of an exampletree generated by our technique is shown in Figure 1.There are two classes of prediction trees (Breiman et al.1984): classiﬁcation and regression, both of which are imple-mented in TPZ.(i) Classiﬁcation Trees (also called Decision Trees) : As thename suggests, this type of prediction tree is designed toclassify or predict a discrete category from the data. Eachterminal leaf contains data that belongs to one or moreclasses. The prediction can be either a point prediction basedon the mode of the classes inside that leaf or distributionalby assigning probabilities for each category based on theirempirically estimated relative frequencies. For example, inour photo- z technique we use the magnitudes or colors ofgalaxies to determine the probability that a galaxy lies ei-ther inside or outside a speciﬁc redshift bin (a detailed ex-planation of the algorithm is presented in § Information Gain ( I G ), which is deﬁned interms of the impurity degree index I d : I G ( T, M ) = I d ( T ) − (cid:88) m (cid:15) values ( M ) | T m || T | I d ( T m ) (1)where T is the training data in a given node, M is one ofthe possible dimensions (e.g., magnitudes) along which thenode may be split, m are the possible values of a speciﬁcdimension M (in the case of magnitudes m might represent2 or more magnitude bins), | T | and | T m | are respectively c (cid:13)000

Figure 1.

A simpliﬁed example of a binary prediction tree plottedradially. The initial node is close to the center of the ﬁgure. Thesplitting process terminates when a stopping criterion is reached.Individual colors represent the unique variable (e.g., ﬁxed aper-ture g or r or magnitude colors) used for the splitting at eachnode. Each leaf provides a speciﬁc prediction based on the infor-mation contained within that terminal node (gray triangles in theﬁgure). The subpanel corresponds to zoomed in region from thetree. the size of the total training data and the number of objectsfor a given subset m within the current node, and I d isthe function that represents the degree of impurity of theinformation.There are three standard methods to compute the impurityindex ( I d ). The ﬁrst method is by using the informationentropy , which is deﬁned in the expected manner (similar toThermodynamics): I d ( T ) ≡ H ( T ) = − n (cid:88) i =1 f i log f i (2)where i is the class to be predicted (e.g., inside or outside aredshift bin) and the sum is over all n possible classes (twoin our example), and f i is the fraction of the training databelonging to class i . The same deﬁnition applies for a subsetof the data T m .The second option, is to measure the Gini impurity ( G ). Inthis case, a leaf is considered pure if all the data containedwithin it have the same class. The Gini impurity can becomputed inside each node: I d ( T ) ≡ G ( T ) = n (cid:88) i =1 (cid:88) j (cid:54) = i f i f j (3)where f i and f j are the fractions of the training data of class i or j . The same equation applies for a subset of T along oneparticular dimension M . Since f i are the fractions for allpossible classes, we have that the (cid:80) i f i = 1, and, therefore, (cid:80) j (cid:54) = i f j = 1 − f i . As a result, the expression for Equation 3can be simpliﬁed to I d ( T ) ≡ G ( T ) = 1 − n (cid:88) i =1 f i (4) Fraction of Class i =1 ( f ) I m p u r i t y I n d e x ( I d ) EntropyGini indexClass Error

Figure 2.

Impurity index I d for a two-class example as a functionof the probability of one of the classes f using the informationentropy(blue), Gini impurity (green) and classiﬁcation error (red).In all cases, the impurity is at its maximum when the fraction ofdata within a node with class 1 is 0.5, and zero when all data arein the same category. The third method is to simply measure the impurity degreeby using the classiﬁcation error ( C E ): I d ( T ) ≡ C E ( T ) = 1 − max { f i } (5)where the maximum values are taken among the fractions f i within the data T that have class i . During the tree con-struction, the data are scanned over each dimension to de-termine the split point that maximizes the information gainas deﬁned by Equation 1 and the attribute that maximizesthis impurity index overall is selected. For example, Fig-ure 2 shows these three impurity indices, for a node withdata that are only categorized into two classes, as a func-tion of the fraction of the data having a speciﬁc class. If allof the data belong to a speciﬁc class, the impurity is zero.On the other hand, if half of the data have one class and theremaining data all belong to the other class, the impurity isat its maximum. Our implementation can calculate any ofthese three diﬀerent impurity indices, and any one of themcan be selected for the construction of the prediction trees.Alternatively, the index providing the highest informationgain at a given node can be selected.(ii) Regression Trees

A second type of prediction tree is usedwhen the data to be predicted is continuous; since it doesnot use discrete classes, we instead ﬁt a regression model tothe data inside a leaf. The construction of a regression treefollows the same structure as the classiﬁcation tree, and onceagain a node is generally divided into two branches (i.e., abinary tree). There are two primary diﬀerences, however,between regression and decision trees. First, each leaf hastraining data with diﬀerent redshift values; the predictionvalue is based on a regression model covering these points.Usually, the mean of the training redshifts is returned, soeach prediction is no longer a discrete classiﬁcation, but isinstead an estimation of a continuous variable. Second, theprocedure used to select the best dimension to split for aregression tree is based on the minimization of the sum of c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests the squared errors, which for a node T is given by S ( T ) = (cid:88) m (cid:15) values ( M ) (cid:88) i (cid:15) m ( z i − ˆ z m ) (6)where m are the possible values (bins) of the dimension M , z i are the values of the target variable on each branch/bin m , and ˆ z m is the speciﬁc prediction model used. In the caseof the arithmetic mean , we have that ˆ z m = n m (cid:80) i (cid:15) m z i ,where n m are the members on branch m . This allows us torewrite Equation 6 as S ( T ) = (cid:88) m (cid:15) values ( M ) n m V m (7)where V m is the variance of the estimator ˆ z m .At each node in our tree, we scan all dimensions to identifythe split point that minimizes S ( T ). The splitting dimensionthat has the lowest value of S is selected as the splittingdirection, and this procedure is repeated until either somethreshold in S is reached or any new nodes would containless than the predeﬁned minimum leaf size. Random forest is an ensemble learning algorithm that ﬁrstgenerates many prediction trees and subsequently combinestheir predictions together. It is one of the most accurate em-pirically trained learning techniques for both low and highdimensional data (Caruana et al. 2008). The idea is simple,given a training sample T containing N objects that have M attributes (e.g., survey magnitudes), create N T bootstrapsamples of size N (i.e., N randomly selected objects with re-placement). From these samples, we create the correspond-ing N T prediction trees without pruning them back.If all the variables are examined when deciding the bestpoint to split, the method is called bagging (Breiman 1996).An additional layer of randomness can be added to the bag-ging process by choosing the best split point from among arandom subsample of m ∗ < M variables at each node, where m ∗ is kept ﬁxed during the process. The value of m ∗ is anadjustable parameter that is directly related to the strength of a tree (a strong tree has a low error rate) and the corre-lation between any two trees (the more correlated the trees,the higher the forest error rate). Increasing or reducing m ∗ has the same eﬀect on both features. Of course we want toselect the optimal value of m ∗ . A good starting point is toset m ∗ (cid:39) √ M , although the accuracy of the algorithm is,in the end, not very sensitive to this parameter for a largenumber of trees and relatively small number of dimensions.After constructing all of the prediction trees, a ﬁnal and ro-bust prediction is calculated by combining all N T estimatestogether.Breiman (2001) ﬁrst introduced this algorithm andshowed that this technique performs very well when com-pared to many other learning techniques. This technique isrobust against overﬁtting (i.e., there is no limit on the num-ber of trees, N T , in the forest), it runs eﬃciently on largedata sets, it can generate an internal unbiased estimate ofthe error, and it can provide extra information about therelative importance of the input variables and the internalstructure of the training data. Given a training set T , this extra, ancillary information canbe calculated prior to the computation of the photo- z PDFs.As a result, we can use this a priori information to explorethe eﬃcacy of diﬀerent parameter combinations while alsoobtaining an estimate of the bias and variance of the photo- z prediction. This is done by using out-of-bag (OOB) samples,which consist of a random sample of data that are left outof each tree. In the process of growing a forest, N T treesare created using bootstrap samples of size N . In each ofthese samples, about one-third of the data are not used whenconstructing a tree, and are instead used as a test sample forthe recently built tree. The test results created by using thisOOB data are combined together to obtain estimators of theerror, which, when built using a suﬃciently large number oftrees in the forest, has been shown to be unbiased and asaccurate as using a validation set of the same size as thetraining set (Breiman 1996). This removes, therefore, theneed for a separate validation sample that can introduce abias into the ﬁnal result. This method also has the advantageof using the full spectroscopic data to compute PDFs.The OOB data can also be used to estimate the rela-tive importance of each attribute or dimension to the photo- z calculation. This provides an elegant method to identifyand remove attributes that do not contribute signiﬁcantly,thereby reducing the noise and dimensions of the problem.This also has the beneﬁts of increasing the performance ofthe implementation, improving our understanding of thecomplexity in the interaction between diﬀerent attributes,and improving the identiﬁcation of new training data from,for example, follow-up observations. This relative impor-tance is estimated for each attribute by ﬁrst quantifyingany variations in the prediction error when the OOB dataare permuted only along the speciﬁc attribute, leaving theothers unchanged. This process is repeated across all trees,and the end result is the average in the error increment whencompared to the unperturbed variables for all the tress overthe entire forest.Another item we can construct is the proximity matrix, P rox ( i, j ), which is a symmetric, positive deﬁnite matrixthat gives the fraction of trees in the forest in which element i and j fall in the same terminal leaf. This matrix is con-structed tree-by-tree by running all the data, both the OOBand the data used for growing, down each tree. When galaxy i and j are in the same leaf, their proximity is increased byone. At the end, all the proximities are normalized by thetotal number of trees; therefore, similar galaxies will tendto have higher proximities than dissimilar ones. This matrixcan be computed for the training set, the test set, or bothtogether. Since this matrix quantiﬁes the relative similaritybetween galaxies, it can be used to identify outliers within adata set. For example, by computing the squared sum of allproximities for each galaxy, we can algorithmically identifygalaxies with few neighbors by selecting sources with thelowest value, which can be ﬂagged for further inspection.To build or apply prediction trees, the data cannot havemissing values for any of the attributes used to construct thetrees (e.g., the most important survey magnitudes). To in-clude more data into the classiﬁcation process, we can usethe proximity matrix to estimate any missing values or toreplace highly uncertain values. We do this in an iterative c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner process, by performing the forest growing step of the al-gorithm and replacing the missing attribute at each pass.We select the replacement value by computing the averageparameter value from the k nearest galaxies; we can alsoinversely weight these galaxies by their respective distance.This process continues until we have obtained convergenceor until a ﬁxed number of iterations have been performed.By using the proximity matrix, OOB error estimates,and the relevant importance of diﬀerent attributes, we canalso identify zones where the photo- z prediction is eitherpoor or is loosely constrained by the training data. In eithercase, this knowledge is of vital importance when decidingwhat galaxies to target spectroscopically in order to opti-mally improve a training sample. One way this feature isimplemented is by using the two most important attributesto map the areas of parameter space by their predictionerror. This map can guide the identiﬁcation of new datathat increases the eﬃcacy of the training sample by target-ing those galaxies that minimize the prediction error in un-der sampled areas, thereby more eﬀectively utilizing limitedspectroscopic follow-up observations. Two previous works have utilized prediction trees for photo- z calculations. Carliles et al. (2008, 2010) predicted photo-metric redshifts and their errors by using a random forestbuilt with the regression tree package for R by using themean value as their leaf model. They used a subset of themain galaxy sample of the SDSS Data Release 6 (Adelman-McCarthy et al. 2008) catalog with colors as their attributes.They demonstrated that random forest methods are wellsuited to the photo- z estimation problem as they obtainedcomparable results to other machine learning methods, andthey publicly released their R scripts. They did not, how-ever, take full advantage of the ancillary information pro-vided by the random forest technique, nor did they produceprobability density functions.Gerdes et al. (2010) have developed a new technique,called ArborZ, to compute photometric redshifts usingboosted decision trees (BDT). These classiﬁcation trees areconstructed in a similar manner to our classiﬁcation trees,as discussed in § z estimatesprovided by Oyaizu et al. (2008b) in the case of the SDSSdata, and by ANNz (Collister & Lahav 2004) for the DES.Our approach, detailed below, extends these previousresults to create a new publicly available method that usesrandom forests to compute PDFs by using classiﬁcationand/or regression trees. Our approach also uses extra in-formation encoded within the measurement errors, gener- ates extra, ancillary information describing the spectro-scopic training sample, and provides a better control of theuncertainties. We also, therefore, are able to examine theimportance of the attributes used to grow the trees, andidentify areas in the attribute space where the training dataare dominated by shot noise statistics. Our implementation of prediction trees with random forestfor photometric redshift PDF prediction, TPZ, is written inthe Python programming language and uses MPI for paral-lel communication to run eﬃciently on distributed memorysystems. As shown in Figure 3, our implementation is di-vided into three steps: Data Pre-processing

The ﬁrst step prepares the data forthe construction of the prediction trees. First, we optionallyperform a principal component analysis (PCA) of the data inorder to reduce strong correlations between attributes. ThisPCA transformation can reduce the dimensionality of theinput data prior to the training, which can be important forlarge data sets with many attributes. This step also includesthe replacement of missing values, which we do iteratively,ﬁnding that between 5–10 iterations leads to a convergenceon the missing values. We next generate N R training samplesby perturbing the measured values according to the error oneach variable, which we assume to be normally distributed.In this manner, we can incorporate the measurement errorin the prediction tree construction, we reduce the bias onproximity matrices, and we introduce randomness into theconstruction of the trees in a controlled manner. Random Forest Construction

The second step is theactual construction of the random forest, where we gener-ate fully grown prediction trees. We construct N T trees byusing bootstrapping for each perturbed sample in the set of N R training samples we created in the ﬁrst step. This stepcan be done several times with a smaller number of treesto both explore the parameter space and gain insight intothe internal structure of the data prior to building the ﬁnalprediction trees. Finally, this step can also produce the an-cillary information that can characterize the performance ofour code prior to estimating the ﬁnal photo- z values. Photo- z PDF Construction

The ﬁnal step uses thenewly generated prediction trees to create individual photo- z PDFs for each source in the application data set. Thisprocess involves running each source down each tree, test-ing the source at each node until we arrive at a terminal leafwhere we make a prediction. At the end, we combine all ofthe forest predictions into a probability density function.

TPZ can use either type of prediction tree that uses randomforests: classiﬁcation or regression; the actual implementa-tion details only diﬀer after the ﬁrst step.

Classiﬁcation Mode:

In this mode, the spectroscopicsample is divided into several redshift bins that either havea ﬁxed width (or, alternatively, resolution), which allows a (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests Spectroscopicsample Pre processdata R eg r e ss i on C l a ss i f i c a t i on Photo-z PDF

Replace missing values (iterate)Perform PCA transformationPerturb magnitudes by their measured errors

Additional Information

OOB error, Variable importanceProximity, poor area identification,OutliersGalaxies withKnown redshifts and several magnitudes or other variables min z max z min z max zbin 4bin 1 bin 2 bin 3 in outin in inz z z z L =(z +z +z )3 TPZ

Trees for Photo-Z

Mode optional prediction prediction C o m b i ne a ll t r ee s C o m b i ne t r ee s i n b i n G e t p r obab ili t i e s G ene r a t e P D F Figure 3.

A simpliﬁed representation of TPZ, the details for each subprocess are described more fully within the text. Note that eachtree drawn in the ﬁgure represents a full random forest with N T bootstrap samples for every one of the N R random perturbed samples.The big circles containing galaxies represent a terminal leaf, which are directly used to make a prediction for each new galaxy. variable number of galaxies within each redshift bin, or havea ﬁxed number of galaxies per redshift bin, which meansour redshift bins are of variable width. Within each bin, wecreate a forest of classiﬁcation trees, as described above, us-ing the perturbed samples as well as the bootstrap samples.These trees classify an object as either lying inside or out-side a bin. By using all of the training data within each bin,we both decrease the overall performance of our implemen-tation due to the larger data volume and also increase thechance of catastrophic errors since most data will lie outsidethe bin of interest.We address these issues by following a similar approach tothat used by Gerdes et al. (2010). For each bin, we identifyall sources that lie inside the bin. This number of galaxieswith class inside is n in . We next select a factor fn out ofthe n in galaxies that have spectroscopic redshifts that lieoutside the bin by a factor of z out times the width δz ofthe bin. This means that galaxies with class outside fall z out × δz from the boundaries of the bin. This allows a betterdistinction between the class inside and the class outside asit would have if we include objects located very near to theseboundaries. In the end, each bin will have (1 + fn out ) n in galaxies available for training the forest.If the training set is limited, wider bins can be used in or-der to have a suﬃcient number of training galaxies per bin.Furthermore, these bins can even be allowed to overlap bysome value; this overlap can be taken into account whenbuilding the photo- z PDFs by normalizing by the fractionof wider bins that overlap with each other. After all of theforests are created for all of the bins, the test data are rundown each tree in each forest, which assigns either the class inside or outside to the test source. After combining all ofthe assigned classes from the forest, we assign a probabilityfor the source to belong to that redshift bin, which is simplythe number of times the source was assigned the inside classdivided by the total number of trees. By repeating this pro-cess for each bin and renormalizing the subsequent result,we generate a photo- z PDF for the source.

Regression Mode:

In this mode, we use all availabletraining data to fully grow each tree. For each perturbedsample, N T trees are created using the methodology ex-plained in § z , the testdata are run down each tree in the forest. Each tree returnsthe set of spectroscopic redshift measurements that, afterconversion to a given resolution, are converted into a PDFby normalizing to the total number of objects returned. Alltrees have the same weight when constructing the PDF, aswell as the values of the terminal leaves identiﬁed in eachtree. If a single value is desired, a mean value and its errorcan be returned via the standard methods by aggregatingall of the relevant values as returned by the diﬀerent trees.The choice of either of these modes will depend on thecharacteristics of the data being analyzed. On average, theregression mode runs faster than the classiﬁcation mode for c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner a speciﬁc accuracy, and is also better suited for data that arenot uniformly distributed. The classiﬁcation mode, on theother hand, provides a better characterization of the dataas a function of redshift, since it creates its own randomforest on each bin unlike the regression mode where a forestis created using the full range in redshift. The classiﬁcationmode is also better suited for uniformly distributed dataand can provide a reliable and robust prior probabilities ina Bayesian framework when using wider redshift bins. Whenfaced with a high quality and rich training set, both modeswill provide similar accuracies and error rates, but the re-gression mode, being faster, would generally be preferred.Figure 3 shows a simpliﬁed workﬂow of our TPZ im-plementation. Each tree in this ﬁgure represents an entireforest, where the single tree results are averaged to get aﬁnal prediction. The classiﬁcation mode predicts a prob-ability that a source lies within each bin, thereby build-ing up a photo- z PDF, while the regression mode keeps allsources found on a terminal leaf and combines their valuesto construct a photo- z PDF at the desired resolution. Forboth modes, ancillary information can be provided, and bothmodes share the same data pre-processing steps.

To demonstrate the capabilities and the eﬃcacy of the dif-ferent parameter conﬁgurations of the TPZ code, we haveused several diﬀerent photometric and spectroscopic datasets that vary both in quantity and quality. In this sectionwe brieﬂy discuss these data sets and the speciﬁc data sam-ples from each that we used in the testing process, which isfurther described in § The Sloan Digital Sky Survey (SDSS; York et al. 2000) phaseI and phase II conducted a photometric survey in the op-tical bands u , g , r , i , z that covered almost 10,000 squaredegrees, or approximately one-fourth of the entire sky. Theresultant photometric catalog contains photometry of over10 galaxies, making the SDSS one of the largest surveysever performed. The SDSS also conducted a spectroscopicsurvey of targets selected from the SDSS photometric cata-log, obtaining spectra of about 10 low redshift galaxies.In this paper, we use a subset of the Main Galaxy Sam-ple (MGS; Strauss et al. 2002) from the Data Release 7 cat-alog (Abazajian et al. 2009). Speciﬁcally, we selected 55,000galaxies by using the online CasJobs website . This spec-troscopic data ranges from z ≈ .

02 up to z ≈ . u − g , g − r , r − i , and i − z . We use the SDSS http://casjobs.sdss.org/CasJobs/ colors as opposed to the more commonly used magnitudesfor this particular test to both demonstrate the ﬂexibility ofTPZ and to generate scientiﬁcally more interesting ancillaryinformation. The PHoto-z Accuracy Testing (PHAT; Hildebrandt et al.2010) project ﬁrst compared the performance and system-atics of diﬀerent photo- z codes on synthetic data (PHAT0)that was speciﬁcally created for a contest, and also more re-cently used real data (PHAT1) in a similar manner; therebyproviding a more realistic comparison by using real mea-surements. The PHAT project provides ﬁlter responses forphoto- z estimation by SED-ﬁtting methods and a trainingdata set for photo- z estimation by empirical methods. Thetrue redshifts of the test data are not public, which pro-vides a more reliable, blind comparison between diﬀerentapproaches (see Hildebrandt et al. 2010, for more detailsabout the contest). In this paper, we use the PHAT1 data,which consists of real observations selected from the GreatObservatories Origins Deep Survey Northern ﬁeld (GOODS-N; Giavalisco et al. 2004).These data include photometry from the original ACSfour-band data: F435W(B), F606W(V+R), F775W(i’) andF850LP(z’) that have been cross-matched with photometryfrom Capak et al. (2004), including U (from KPNO), B J , V J , R C , I C , z (cid:48) (from SUBARU), and HK (cid:48) (from QUIRC).In addition, the photometry of PHAT1 also includes Deep J and H bands (from ULBCAM; Wang et al. 2006), K S (fromWIRC; Bundy et al. 2005), and four Spitzer IRAC bands:3 .

6, 4 .

5, 5 .

8, and 8 . µ m. This photometric catalog wascross-matched with all available spectroscopic GOODS-Ndata (Cowie et al. 2004; Wirth et al. 2004; Treu et al. 2005;Reddy et al. 2006), producing a ﬁnal data set of eighteenband photometry and spectroscopy for 1,984 galaxies.For the contest, only 515 galaxy redshifts were pub-lished for use as training data; the remaining redshifts wereunpublished and used internally by the PHAT project toconduct a blind comparison test. Despite the limited train-ing data, multiple authors submitted the photo- z predictionsand the results were published in Hildebrandt et al. (2010).As the contest had already been completed when we startedthis work, we were unable to participate. However, as dis-cussed in § The Deep Extragalactic Evolutionary Probe (DEEP) is amulti-phase, deep spectroscopic survey performed with theKeck telescope. Phase I used the LIRS instrument (Okeet al. 1995), while phase II used the DEIMOS spectro-graph (Faber et al. 2003). The DEEP2 Galaxy Redshift Sur-vey is a magnitude limited spectroscopic survey of objectswith R AB < . B , R , and I and it has been recently extended (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests z spec z ph o t R e g z spec z ph o t C l a ss < ∆ z > RegClass z phot σ ∆ z RegClass

Figure 4.

Photometric vs. spectroscopic redshift for all test SDSS MGS test galaxies using regression mode (

Left ) and classiﬁcationmode (

Center ). (

Right ) A comparison of the bias (upper panel) and the scatter (lower panel) as a function of redshift for the SDSS MGSdata by using the regression mode (blue dots) and the classiﬁcation mode (green squares). by cross-matching the data to other photometry databases.In this work, we use the Data Release 4 (Matthews et al.2013), the latest DEEP2 release that includes secure andaccurate spectroscopy for over 38,000 sources. The photom-etry for the sources in this catalog was expanded by us-ing two u , g , r , i , and z surveys: the Canada-France-HawaiiLegacy Survey (CFHTLS Gwyn 2012), and the SDSS. Foradditional details about the photometric extension of theDEEP2 catalog see Matthews et al. (2013).To use the DEEP2 data with TPZ, we selected sourceswith secure redshifts (ZQUALITY (cid:62)

3) that were securelyclassiﬁed as galaxies, have no bad ﬂags, and have full pho-tometry. Even though the ﬁlter responses are similar, the u , g , r , i , and z photometry come from two diﬀerent sur-veys and are thus not identical. We therefore treat thosegalaxies with SDSS photometry for ﬁelds 2,3, and 4 of theDEEP2 target areas independently from those for ﬁeld 1with CFHTLS photometry. In the end, this leaves us witha total of 19,699 galaxies with eight band photometry andredshifts, of which we use 10,000 for training and hold therest for testing. In this section, we apply the photo- z estimation techniquepresented in § §

3. Since the point of this paperis to introduce the TPZ algorithm and our associated imple-mentation, we use these three diﬀerent data sets to highlightdiﬀerent features of the code. Thus we do not apply TPZuniformly to each data set, and the three subsections hereinare necessarily diﬀerent.

We ﬁrst apply TPZ to the SDSS main galaxy sample, us-ing both the regression and the classiﬁcation methods asexplained in § Table 1.

A comparison between the Regression Mode and theClassiﬁcation mode for the SDSS MGS galaxies with diﬀerentconﬁdence level restrictions.Implementation < ∆ z > [10 − ] σ ∆ z [10 − ] Fraction a Reg All − .

08 2 .

25 100%Class All 2 .

18 2 .

46 100%Reg zConf > . − .

20 2 .

18 98.2%Class zConf > . − .

15 2 .

34 94.2%Reg zConf > . − .

33 1 .

97 91.0%Class zConf > . − .

80 2 .

20 73.5%Reg zConf > . − .

23 1 .

76 67.3%Class zConf > . − .

92 1 .

82 34.7% a Fraction of galaxies remaining after a cut on zConf . galaxies held out for testing from the SDSS MGS, for re-gression and classiﬁcation modes, respectively. Both imple-mentations show similar performance in the central part ofthe redshift distribution; however, there are diﬀerences atboth the low and high redshift regions of this sample. Theright panel shows both the mean of the bias, deﬁned as∆ z = z spec − z zphot , and its scatter for eight redshift bins.The regression mode performs slightly better at all redshiftbins, but especially on the ﬁrst and last bin, where the clas-siﬁcation mode shows systematic errors in classiﬁcation.This error arises due to the lack of training data at thoseredshifts for the classiﬁcation mode, where, though we allowsome overlap between bins, we keep the bin size constant,which can result in large diﬀerences in the number of train-ing objects per bin. This reduction is most pronounced inthe lowest and highest redshift bins, which results in a loweraccuracy and a higher scatter. We also are aﬀected at thelow redshift regime by the fact that a predicted redshift cannot be negative, those introducing a positive skew to thepredicted redshift values for very low redshifts.Since both implementation modes produce photo- z PDFs, we can compute conﬁdence levels, zConf , aroundthe mean (or mode) for each individual PDF. To simplify c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner N o r m a li z e d P D F zConf = 0.17 zConf = 0.54 N o r m a li z e d P D F zConf = 0.87 zConf = 0.98 Figure 5.

Four example PDFs produced by TPZ for the SDSSMGS selected with diﬀerent values of zConf . The higher the valueof zConf , the more narrowly concentrated the PDF is about themean. The vertical dashed line corresponds to the spectroscopicvalue for the test galaxy and the gray area encloses the conﬁdencelevel. comparisons with past results, we deﬁne zConf as the inte-grated probability between z phot ± σ TPZ (1+ z phot ). We select σ TPZ = 0 .

03 as an approximation to the intrinsic scatterof the algorithm when applied to the data, which can becomputed by using the OOB data. Of course we could de-ﬁne zConf in some other manner, but the results would berelatively unaﬀected. Figure 5 presents four diﬀerent PDFstaken from the SDSS MGS, each with diﬀerent conﬁdencelevels that are shown as a bounded gray area under eachPDF curve.In this example, we measured zConf around the meanof each PDF and the actual spectroscopic redshifts areshown as vertical dashed lines for reference. From this ﬁg-ure, we see that zConf provides a reasonable summary ofthe concentration of the PDF, and can, therefore, be usedto further restrict a photo- z sample by selecting only thosePDFs with a zConf value above some threshold. In gen-eral, as shown in this Figure, we see that lower conﬁdencevalues are strongly correlated with less accurate predictions.Nevertheless, it is still possible to have a small fraction ofgalaxies with high zConf PDFs that are estimated at thewrong redshift. We discuss the zConf parameter and its usein identifying a clean galaxy sample in further detail in § zConf . As before, we see that, on average, the regressionmode outperforms the classiﬁcation mode on this data set,although the diﬀerence is reduced when we apply a cut onthe conﬁdence level. Interestingly, at more restrictive zConf cuts, the performance of both modes is similar; however, thenumber of galaxies remaining in the regression mode sampleis higher. Note that since these are averaged values over the sample, any minor change implies a signiﬁcant change onindividual calculations.As a result, we believe that making a cut on zConf results in a cleaner sample, as shown by the improved per-formance metrics for either implementation mode. The dif-ference in the fraction of galaxies that remain in each sampleindicates that, on average, PDFs generated by the classiﬁ-cation implementation are broader than PDFs generated bythe regression implementation. This result is reasonable, asthe classiﬁcation mode bins the redshift space and providesprobabilities for all bins which can produce a more sparsedistribution. In the classiﬁcation mode the probabilities arecomputed individually for each redshift bin, which could beimportant and easily extended to build a prior distributionthat can be used in a Bayesian method. Since the regressionmode was shown to be more accurate for the SDSS (see,e.g., Figure 4 and Table 1), we use the mean of the PDF ascalculated by the regression mode on the SDSS MGS datain the rest of this section, unless otherwise indicated.We can broadly compare our use of zConf to deﬁneclean galaxy samples to other published results; we notethat a direct, one-to-one, comparison is problematic due tothe diﬀerent training sets and attributes used in computingphotometric redshifts for the SDSS main galaxy sample. Ifwe take a zConf > = 0 .

75, we keep 91% of the data andcompute the fraction of galaxies with | ∆ z | < z i , where z i =0.001, 0.002 and 0.003 as 45.2%, 73.0% and 89.8%, respec-tively. These valued compare favorably to those from Lau-rino et al. (2011) who, even though they used an extendedcatalog, compute these same values to be 43.4 %, 72.4%and 86.9%, with a mean bias of < ∆ z > = 15 × − and σ ∆ z = 1 . × − (these latter values can be comparedwith our results shown in Table 1). Finally, we note thatmaking a strict cut of ∆ z > .

006 identiﬁes an outlier frac-tion of 1.54%, while other groups, using extended catalogsas well, have reported values of 1. 9% (Gerdes et al. 2010)and 2.6% (Oyaizu et al. 2008b).

As detailed in § z ) and its standard deviation (i.e., σ ∆ z ) for each tree. Fi-nally, we average these metrics over all trees. In Figure 6,we present in the top panel the mean bias as a function ofredshift taken both from the test data (blue line) and fromthe OOB data used during the training process (green line).The bottom panel in this ﬁgure presents the standard devia-tion for each redshift bin. The RMS of these values providesan approximation to the intrinsic error and scatter of theTPZ code, which can be used to compute conﬁdence levels.From the OOB data, we compute the RMS of the bias tobe 0 . . . . c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests < ∆ z > Photo- z estimateOOB estimate z phot σ ∆ z Photo- z estimateOOB estimate Figure 6. ( Top ) The averaged ∆ z as a function of redshift forall test galaxies from the SDSS MGS (blue circles) and from theOOB (Out-Of-Bag) data computed individually for each tree andsubsequently averaged over the forest (green squares). ( Bottom )The standard deviation of ∆ z as a function of redshift for the testset (blue circles) and the OOB data (green squares). In this casethe OOB data provide a unbiased, upper-limit for these metrics. OOB data were not used to train a particular tree, yet thefull data are used when building the forest by using the boot-strap samples. If we would have run all of the training data after the forest was constructed without using the OOB ap-proach, we would have obtained biased (although lower val-ues) for these metrics. This approach would thus not pro-vide a prior estimation of the accuracy of TPZ. With theOOB data, we compute a priori these unbiased estimatesexclusively from the training set, without the need for a val-idation set, allowing us to take full advantage of all availablespectroscopic data.The OOB data can also be used to compute the relativeimportance of each attribute, which can be done by permut-ing each of the attributes in the non OOB data when trainingthe tree. The result of this process can be directly comparedwith the unperturbed case using the OOB data, as shownin Figure 7. In this ﬁgure, the left panel shows the relativeimportance factor, which is computed by using the absolutevalue of the OOB bias as a comparison metric, of the fourcolors used to build the regression trees for the MGS sam-ple. In this plot, a factor of one implies that the attributeacts as a random variable, since a perturbation along thatdirection produces no changes. Any value greater than oneproduces a change in the bias, making it larger and thereforeless accurate.From this ﬁgure, we see that the g − r color showsthe largest relative importance factor, being close to four,meaning that the absolute bias, on average, changes by thissame factor when this color is randomly perturbed. On theother hand, the i − z color is the least, on average, rele-vant attribute in this context, with a relative importancefactor less than 1.5. Due to the limited number of attributesin this test, however, removing this last color actually pro-duces slightly worse results. In the general case when more attributes are present, removing less important variables willimprove the results. While this result might seem counter-intuitive, it results naturally from the random nature of thetree construction. Since only m attributes (e.g., three) arerandomly selected to decide the split dimension, an attributewith overall low importance can be occasionally selected tosplit a node. By omitting attributes with lower importance,we force the trees to be built from attributes with greaterinformation content, thereby improving the accuracy of theprediction.Another interesting point is that the relative impor-tance for both of the mentioned colors remain consistent,independent of redshift, while the other two colors showvariation (i.e., u − g and r − i exchange importance rat-ings more than once), although they are overall consistentwith each other. This behavior is mainly due to importantspectral features, such as the 4000 ˚A break, passing betweendiﬀerent ﬁlters, which TPZ identiﬁes algorithmically, as im-portant indicators of a galaxy’s redshift. We see this resultfrom another perspective in the central panel of Figure 7,which presents the RMS of the relative importance, sortedby their rank, for the four colors computed by using the ab-solute bias (blue line) and the variance (red line). Both met-rics rank the attributes in the same order and either can beused to compute their importance to the data set. Perturb-ing the attributes produces a stronger eﬀect on the absolutebias than on the scatter, mainly because when perturbingone dimension, we lose information and thereby increase thelikelihood that a galaxy will end up in a random branch ofthe tree, especially for an important attribute. This wouldlikely lead to a misclassiﬁcation, which directly aﬀects themean absolute bias. Relative Importance

The importance rank can also be used to better understandthe training data, to check whether it is possible to reducethe dimensionality of the problem, and to identify areas ofthe mapped parameter space where new training data canbe most eﬀectively incorporated. This latter point can beaccomplished by identifying the leaf nodes, and the galaxiescontained therein, for each tree and computing their accu-racy on predicting for the OOB data along with their prox-imity matrices. By averaging over these results for all trees,we obtain the desired result.For example, by using the two most important at-tributes previously identiﬁed for the SDSS MGS ( g − r , and u − g ), we present a heat map in the right panel of Figure 7that encodes the binned performance of these two attributes,where higher values indicate lower predictive success in thatbin. In this plot, we see there are a few bins where perfor-mance is markedly lower (blue and light blue squares), andseveral areas that are lower than average (the yellow bins).On the other hand, there are two areas where the predic-tive power is quite high (deep orange-red), which are likelythe result of the known color bi-modality of SDSS galax-ies (Strateva et al. 2001) where early-type galaxies lie in theupper right part of this plot and late-type galaxies lie in thebottom left part of this plot. The areas in this heat mapwhere the predictive performance is low can be caused byeither a lack of training data, by galaxies with color degen-eracies, or by galaxies with higher than normal magnitude c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner A tt r i b u t e i m p o r t a n c e I A g − ru − gr − ii − z g − r u − g r − i i − z Attribute1.01.52.02.53.03.54.0 A tt r i b u t e i m p o r t a n c e I A | ∆ z | σ ∆ z g − r u − g Figure 7. ( Left ): The attribute importance factor I A as a function of redshift for the four attributes (SDSS colors) used in this analysisfor the bias only. This factor quantiﬁes how much the metrics decrease as we permute the attributes one at a time. ( Central ): RMS ofthe relative importance factor as a function of the attributes computed by using the bias (blue) and the scatter( red). (

Right ): A heatmap constructed by using the two most important attributes, which indicates areas of parameter space where the photo- z prediction ispoor. The higher the value (i.e., bluer) in a region, the more training data are needed to increase the accuracy of photo- z estimationwithin that region. These zones might also contain outliers or galaxies with bad photometry. errors. As a result, these areas can be prioritized for follow-up observations to improve the performance of the photo- z estimation. Identifying new training data

Previously, we had stated that the relative importance ofthe diﬀerent attributes, graphically shown in the heat mapin the right panel of Figure 7, could be used to optimallyidentify new training data. We test this assumption by ﬁrstrandomly selecting 1,000 galaxies as our training set, in or-der to simulate a poor training set, so that we can quantifythe eﬀects of both randomly adding new data and selectivelyadding new data by using the relative importance. We per-form this test by ﬁrst adding 1,000 new galaxies and secondby adding 2,000 galaxies and computing the mean normal-ized bias, deﬁned as ∆ z (cid:48) = ( z spec − z phot ) / (1 + z spec ), andits standard deviation as we change the size of the trainingset by using the four color attributes from the SDSS MGSand and a forest with 100 prediction trees.We summarize these test results in Table 2. As shownin the table, selecting galaxies from those zones with loweraccuracy as indicated by the heat map produces more ac-curate predictions than adding galaxies randomly. In fact,even adding 1,000 galaxies by using the heat map produces aslightly better performance than adding 2,000 galaxies ran-domly. These results indicate that it is more important toselectively add galaxies to areas where the prediction is poorthan to simply increase the size of the training set.We continue this process, by continually adding either1,000 or 2,000 new galaxies to the training set. As the bot-tom panel of Figure 9 for the SDSS MGS demonstrates, afterabout 5,000 galaxies (or at half the size of our full trainingset), the performance metric shows little variation, which isalso reﬂected in the last row of Table 2 where the metrics forthe 15,000 galaxy training set are presented for comparison.This test demonstrates how current and future photomet-ric surveys can optimally construct training sets by eitherselectively using existing observations, or by obtaining new Number of training galaxies < ∆ z (cid:48) > σ ∆ z (cid:48) Table 2.

A comparison of the performance of TPZ for the SDSSMGS when extra data are added to the training set either ran-domly or by selectively using ancillary information. spectroscopic observations to improve the photo- z estima-tion. Error distribution

After applying TPZ to the SDSS MGS, we can estimatephoto- z errors directly from the estimated PDF by comput-ing either the mean, the mode, or some other statistic fromthis distribution. As a demonstration, we calculate the er-ror σ as the region of the photo- z PDF centered on themean that contains 68% of the cumulative probability. Wenext calculate the distribution of these standard errors bycomputing ( z phot − z spec ) /σ for each PDF, which is shownas the black points in Figure 8. For unbiased standard errorestimates, this distribution should be normally distributedwith zero mean and unit variance. When we ﬁt our measuredpoints, we obtain a Gaussian with mean equal to 0.112 anda width of 0.949, which is shown by the solid green curve.This simple error estimate is quite close to the unbiasedexpectation, which is as we would expect for any reliabletechnique. The ﬁt is not a perfect Gaussian due to a slightlyextended tail on the left hand side of the distribution. Weinterpret this as a manifestation of the very narrow PDFswe have obtained and that the SDSS MGS is concentratedat lower redshifts where most photo- z techniques suﬀer froma small tendency to over-predict the photometric redshifts,as shown in the left panel of Figure 4. c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests ∆ z/σ N u m b e r d e n s i t y std error µ = σ = Figure 8.

The photometric standardized error, ( z phot − z spec ) /σ , for the MGS galaxies (black dots) using the meanof each individual PDF and the best ﬁt Gaussian with µ = 0 . σ = 0 .

949 (solid green curve).

Size of forest

When we construct a forest for prediction, one parameterthat must be speciﬁed is the number of trees that should beconstructed. This is important as the more trees in the for-est, the higher the computational demands, which slows thetraining process and construction of photo- z PDFs. Thus, wetest the performance of TPZ for the SDSS MGS by varyingthe number of trees built for our forest for a ﬁxed-size train-ing sample. As before, we compute the mean of the absolutebias and its standard deviation, and present how these quan-tities vary as the number of trees in our forest changes for aﬁxed training size of 10,000 galaxies.These results are presented in the top panel of Figure 9,which shows that our algorithm does become more accurateas the number of trees increases. However, after around 100trees, the predictive power of the forest shows little varia-tion, indicating that this is a reasonable number of trees forthis prediction process. Breiman (2001) demonstrated that,as the number of trees in a random forest increases, anymargin function will converge to a limit value. Thus, as ex-pected, we see our generalized error value converging. As aresult, this implies that our technique does not over-ﬁt thedata as more trees are added in comparison to other machinelearning methods.

Training size

Once we know the optimal number of trees that must bebuilt for our forest, we next need to know the optimal sizeof our training set. By using 100 trees (as determined inthe previous section), we vary the size of our training setand present the results in the bottom panel of Figure 9.As shown in this ﬁgure, the accuracy of TPZ for predict-ing photo- z does not change signiﬁcantly after using around70% of the galaxies. This is an interesting result, that ourapproach quantiﬁes in an elegant manner, but which willobviously vary between diﬀerent data sets. Fundamentally,as the training set increases in size, the prediction accuracy Number of trees in forest m e a n | ∆ z | a n d s c a tt e r mean | ∆ z | σ | ∆ z | Size of training set [%] m e a n | ∆ z | a n d s c a tt e r mean | ∆ z | σ | ∆ z | Figure 9.

The absolute mean normalized bias deﬁned as | ∆ z (cid:48) | = | ( z spec − z phot ) | / (1 + z spec ), and its scatter as a function of thenumber of trees in the forest, keeping the training set ﬁxed (top).The same two values as a function of the size of the training setkeeping the number of trees ﬁxed at 100 for galaxies in the SDSSMGS (bottom). also increases until most of the multi-dimensional parame-ter space has been sampled and little extra information isadded by new training galaxies.Of course in this test we have not used the relative im-portance of our parameter attributes, as shown, for exam-ple, in the central panel of Figure 7. By manually selectingadditional data, we should be able to reduce the values ofthese metrics signiﬁcantly, which is discussed in the nextsection. But even in our current approach, we expect thatsome of our test data are not well represented in our train-ing set, which will limit the accuracy of this approach. Wesee this as an opportunity, however, as we can compute across-data proximity matrix by using the trained forest toidentify galaxies within the test data that are isolated withfew neighbors in the parameter space. Once identiﬁed, thesegalaxies could be treated individually by using, for exam-ple, other photo- z estimation techniques (see, e.g., CarrascoKind & Brunner 2013, in preparation). We also tested TPZ on the PHAT1 dataset, described in § z techniques. In this contest, only a limitedquantity of training data are provided; we have approxi-mately 500 galaxies to train our algorithm for the approxi-mately 1,500 galaxies that form the validation sample. Thesedata also have a sparse redshift distribution, extending from z ≈ z ∼

6. Despite these limitations, we applied TPZto this training data, submitted our results to the contest,and obtained the resulting performance metrics from thePHAT leader (H. Hildebrandt, private communication). Wepresent our speciﬁc results in Table 3, which can be com-pared directly with the results shown in Table 5 of the PHATpaper (Hildebrandt et al. 2010). c (cid:13)000

Table 3.

The TPZ results for the PHAT1 catalogue both withand without the IRAC bands, and for all galaxies and for amagnitude-limited sample with R <

24. Note that these are thesame statistics presented in Table 5 of Hildebrandt et al. (2010)for other photo- z estimation techniques.Run bias a scatter b outlier rate c − .

002 0 .

055 14 . − .

007 0 .

055 12 . < − .

004 0 .

055 11 . < − .

009 0 .

054 9 . a bias is deﬁned as: ∆ z (cid:48) = z spec − z phot z spec b RMS of the bias ∆ z (cid:48) c Outliers are deﬁned as objects with | ∆ z (cid:48) | > . We computed validation results for four diﬀerent pho-tometric samples: by using all eighteen photometric bands,by omitting the Spitzer photometry and using only four-teen photometric bands, and by creating magnitude limited(

R <

24) for each of these two galaxy samples. For thesevalidation runs, we use the regression mode to create a for-est of 150 trees with m ∗ = 4 (as described in § zConf parameter so that wecould more directly compare our results to the other com-petitors. In the end, the TPZ results are among the mostaccurate photo- z predictions, especially when compared toother empirical training codes. Interestingly enough, TPZeven outperforms some template photo- z techniques, whichare supposedly better suited for this particular challenge dueto the dearth of training data and large redshift range cov-ered by the validation sample. These results show that evenin less than ideal conditions, TPZ provides a robust estima-tion of photometric redshifts. Note that due to the lack oftraining data and the extended redshift distribution of thevalidation sample, we did not generate ancillary informationfor the data by using the OOB approach. DEEP2

We have also tested TPZ by using the DEEP2 redshift sur-vey data, which extends to much higher redshifts than theSDSS MGS. As described in § z results.We follow a similar analysis to what we used with the SDSSMGS, and after we compute the photo- z PDFs, we selectonly those galaxies with zConf > .

7, which includes about81% of the galaxies. We have that the average bias, using∆ z (cid:48) = ( z spec − z phot ) / (1 + z spec ), is -0.007 with σ ∆ z (cid:48) = 0.059and a outlier rate, deﬁned as | ∆ z (cid:48) | > .

15 = 2.9%. We knowof no previous photo- z analyses of these data (describedin § z com-puted by using the mean of each individual PDF with thespectroscopic redshift for the 7,856 galaxies. In this ﬁgure,we also compute the median, shown by the black dots, andthe tenth and ninetieth percentiles, shown by the black errorbars, within spectroscopic bins of width ∆ z = 0 . z spec z ph o t Figure 10.

The TPZ photo- z with zConf > z phot and the errors bars correspond to the tenth and ninetieth per-centiles within a given spectroscopic bin of width ∆ z = 0 . across all redshifts, and both the isodensity contours andthe errors bars indicate that there are few outliers or catas-trophic photo- z . However, at both ends of the distribution,we see several bins that show that the photo- z results areless accurate and are systematically higher for the ﬁrst twobins and systematically lower for the last two bins. Thiseﬀect is often seen with empirical techniques, as the spec-troscopic training samples are often less complete at theseredshifts, see, e.g., the redshift distribution in Figure 15.Another eﬀect causing this skewness is that estimated pho-tometric redshifts can not be negative, thus our probabilitydistribution can not be symmetrical at the low redshift end.Another possible explanation for the low redshift systematicis the eﬀect of galaxy inclination and the induced extinctionon photo- z prediction as shown recently by Yip et al. (2011).Likewise, the systematic underestimation at higher red-shifts is likely aﬀected by the fact that many of these galaxiesare near the limit of the photometry and thus have higherthan average magnitude errors. In combination with thelower density of training data, this will reduce the eﬃcacyof our photo- z technique. To understand this eﬀect, recallthat our trees are built from objects whose photometry issampled by assuming a normal distribution deﬁned by themagnitude and magnitude error from the bootstrap sam-ples. As the magnitude error increases, the range of possiblevalues to sample increases, thereby producing a sparser sam-pling for this galaxy within our forest. Since there are fewgalaxies with redshifts above 1.3 in the training data, thebranches on the forest for high- z galaxies are mainly domi-nated by training galaxies with redshifts closer to 1. As webuild the PDF for the high- z galaxies, the PDFs will be pos-itive skewed, and thus the mean value of each PDF will tendto be at lower redshift values.We demonstrate this skewness in Figure 11, whichshows the average skewness of the photo- z PDFs and theone-sigma error as a function of the spectroscopic redshift.These two quantities are computed as the third standardized c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests z spec S k e w n e ss Figure 11.

The skewness of the photo- z PDF as a function ofspectroscopic redshift. The solid black line is the mean of theskewness and the pink shaded region corresponds to the one- σ interval. Positive skewness indicate a PDF skewed to lower red-shifts. moment: S k = (cid:90) (cid:18) z − ¯ zσ z (cid:19) p ( z ) dz (8)with,¯ z = (cid:90) zp ( z ) dz and , σ z = (cid:90) (z − ¯z) p(z)dz (9)where the integrals are computed over the redshift domain,and p ( z ) is the photo- z PDF. We can see that for redshiftsup to 1.1 the average skewness is very close to zero, showinga small trend to negative values, which will, on average, pro-duce lower values for the mean photo- z . At higher redshifts,however, there is a clear increase in the average skewness,which will tend to produce lower values for the mean of thePDF. It is important to note that even though these PDFsmay be (slightly) skewed, they still predict suﬃcient proba-bility near the true redshift, information that is overlookedby other methods that use one point predictions. On theother hand, a catastrophic photo- z would have a symmetricPDF centered near the wrong redshift, which is not what weobserve here. Relative Importance

By using OOB data, we have computed the metrics from thetraining data that we compare in Table 4 to the metrics weobtained from the test data after the photo- z distributionswere computed. The ﬁrst two rows of this table show thecomplete results for all attributes for the DEEP2 galaxies.From this we see that there is strong agreement betweenthe OOB and test data results for both the bias and thevariance. We also computed the relative importance for theeight photometric bands and the RG attribute, which is theestimated R-band radius of an object in 0.207” pixels (i.e.,the sigma of the Gaussian ﬁt to the light distribution).In the left panel of Figure 12, we present the attributeimportance factor as a function of redshift for the three mostand the two least important attributes. From this ﬁgure, we Table 4.

A comparison in the accuracy of photo- z predicationby using diﬀerent attribute combinations from the DEEP2 datafor all test galaxies. The ﬁrst row are the metrics for TPZ usingonly OOB data, which are comparable to the values obtainedfrom the full test data, shown in the second row. The remainingrows provide these metrics for training data that have had theindicated number of attributes removed from the calculation.Attribute Selection < | ∆ z (cid:48) | > σ ∆ z All attributes (OOB metrics) 0.052 0.053All attributes 0.047 0.049Remove 2 least important 0.044 0.046Remove 2 most important 0.061 0.068Remove 4 least important 0.044 0.048Remove 4 most important 0.070 0.084 see that the R band and the r band are the most importantattributes for making a photo- z prediction, similar to the g − r color for the SDSS MGS. This demonstrates that by apure statistical analysis, in the optical regime, the R band isthe most eﬀective attribute. These attributes show a peak intheir importance between redshifts of 0.3 to 0.5. We interpretthis increase to the presence of the 4000˚Abreak being locatedat these redshifts within these ﬁlters. Likewise, the next twomost important attributes are the i band and the z band,which are likely important for the same reason, albeit overa slightly higher redshift range.On the other hand, the least two important attributesare the B band and the RG attribute. As shown in this ﬁg-ure, the RG attribute does not contribute to the photo- z prediction, instead acting like a random variable and thus islikely introducing extra noise into the calculation. We alsosee no clear evidence that the eﬀect of this attribute changeswith redshift. At low redshifts, this attribute could be af-fected by inclination angle or spectral type, while at higherredshifts, galaxies tend to be fainter and thus have smallerangular sizes. Presumably, these cumulative eﬀects combineto erase any important information this attribute might pro-vide to the photo- z calculation.In the right panel of Figure 12, we present the meanrelative importance for each attribute computed from thechanges in the mean (blue circles) and the scatter (redsquares) when this attributed is permuted, similar to thecentral panel of Figure 7. Once again, both metrics agreewith the importance ranking. In order to characterize the at-tributes and their computed ranking of importance, we havemade the following tests. First, we removed the two least im-portant attributes, after which we remove the two most im-portant attributes, reincorporating the previously removedattributes. We repeat this process, but now we remove al-ternately the four least and most important attributes. Ineach case, we test whether TPZ is able to correctly recognizethe attribute importances. These results are summarized inTable 4, where we use the absolute mean value of ∆ z (cid:48) andits dispersion.As is not surprising, we see that removing the two leastimportant attributes does, in eﬀect, improve the precisionof TPZ while also making the code run faster since we havefewer dimensions to check when splitting nodes within thetree, less data to keep in memory when building the treethus improving cache access, and random realizations from c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner redshift A tt r i b u t e i m p o r t a n c e I A rRiBRG r R i z g I u B RG Attribute A tt r i b u t e i m p o r t a n c e I A | ∆ z | σ ∆ z Figure 12. ( Left ): The variable importance factor, I A , as a function of redshift for the three most and the two least important attributes(i.e., DEEP2 magnitudes) using the bias to quantify this importance. This index speciﬁes how important an attribute is to the calculationof a metric when we permute the attributes one at a time. ( Right ): The RMS of the attribute importance factor as a function of theattributes computed by using the bias (blue) and its scatter (red). Both of these metrics capture the same relative attribute importance. the input parameters will be faster since there are fewer di-mensions to sample. Yet, removing four attributes shows aslight decrease in the overall performance, in this case wehave removed too much information. While this decreasemight seem rather small, since we are randomly selectingattributes when splitting nodes within the trees, by remov-ing four we have increased the scatter since we are losinginformation. On the other hand, removing the most impor-tant attributes signiﬁcantly aﬀects the results, regardless ofhow many attributes we remove. As we would expect, thereason is clear. Since these attributes have the most infor-mation needed to subdivide the multidimensional parameterspace in order to produce accurate photo- z , removing themnegatively impacts the performance of TPZ.In a further control test we added two extra artiﬁcialvariables to the data set, one of which is strongly dependenton the source redshift, i.e. a function of redshift, while thesecond one is a uniformly distributed random variable. Aftercomputing their importance rankings, we can see from Fig-ure 13 that TPZ recognized these two extra attributes andput them on the extreme limits of the importance ranking.The most important value ranks at about eight, the randomvariable ranks at one as expected, while the r magnitude isranked with a value close to two. We notice that the variable RG is very close to one, and therefore to a random variable.As discussed above, we can safely remove this variable fromour calculation as it does not provide any useful informa-tion. The legend on the plot indicates also the descendingorder in importance, in concordance with Figure 12. Missing data

One interesting capability of TPZ is that it can be used toreplace attributes in data that are either missing attributesor have attributes with large uncertainties. As discussed in § < z s for the clean test sample byusing all three training samples: the original, clean sam-ple (i.e., the control), the replaced attribute sample, andthe cut sample. Likewise, we use the clean training sam-ple to replace missing attributes and estimate photo- z s forthe bad test sample. We present the results of these testsin Table 5, where we compare the photo- z estimation forthe clean sample with the replaced and cut samples. Forthis comparison, we use ∆ z pp = z phot , clean − z phot , other and∆ mag = mag clean − mag other , along with their variances,where other can either be the replaced or cut samples. Asshown in this Table, the replaced value sample produces, onaverage, superior photo- z s than the cut sample. Likewise,we have estimated robust photo- z s for the bad test sam-ple, which signiﬁcantly increases the size of our resultingtest data. Dealing with missing attributes is important , es- c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests Table 5.

Photo- z estimation metrics to demonstrate the robust-ness of our missing attribute technique. The ﬁrst two rows showthe average bias and its variance between the estimated photo- z ,and replaced magnitude when either removing or recovering baddata in comparison with photo- z s predicted using the original,clean sample. The last row shows the the same metrics calculatedby using the clean test sample, but for data missing in the testsample as compared to the clean original sample.Recovered train < ∆ z pp >a σ z pp b ∆ mag σ mag with removed data -1.27 3.5 – –with recovered data 0.40 1.6 0.021 0.094Recovered test < ∆ z pp > σ z pp ∆ mag σ mag with recovered data 0.72 4.5 0.033 0.12 a in units of 10 − b in units of 10 − redshift A tt r i b u t e i m p o r t a n c e I A f(z)rRGrandom Figure 13.

The variable importance factor, I A , as a functionof redshift for the most and least important attribute using thebias to quantify the importance ranking. As a control test, weadded two artiﬁcial variables: an attribute that is a function ofthe spectroscopic redshift, and a uniformly distributed randomattribute. TPZ is able to recognize these two extra attributes andrank them accordingly, as shown by the ﬁgure legend. pecially when a spectroscopic training sample is limited orwhen cross-matching between incomplete catalogs is carriedout in order to develop a more complete catalog for photo- z estimation. Photo- z PDFs and zConf

As discussed in § zConf parameter can be used toidentify galaxies with narrow, concentrated photo- z PDFs,which ideally will result in galaxy samples that have themost accurate photo- z estimates. The zConf parameter isdemonstrated for DEEP2 galaxies in the left panel of Fig-ure 14, which shows four representative photo- z PDFs se-lected with diﬀerent values of zConf as measured aboutthe mean of each PDF. Both this ﬁgure and Figure 5, which presents four photo- z PDFs by using SDSS data, highlightthe fact that wide and sparse distributions have low zConf values while narrower PDFs have higher zConf values.The goal of a parameter like zConf is to algorithmi-cally identify galaxies that have, on average, the most accu-rate photo- z estimates. To test this hypothesis, we used allavailable DEEP2 training data to bud our prediction treesand estimate photo- z s for the DEEP2 test sample. From thissample, we applied three zConf cuts: 0.5, 0.7, and 0.9, andcalculated the bias and scatter as a function of redshift forthe three resulting galaxy samples. We compare these re-sults to the bias and scatter when no zConf cut is appliedin the right hand panel of Figure 14. As shown in this ﬁgure,both the mean absolute bias and the scatter are reduced as zConf is increased, independent of redshift.A simple, intuitive approach to select galaxies by their zConf would be 0.5 as this selects galaxies that have a 50%probability that their photo- z redshift estimate lies withinthe limits imposed by ± σ TPZ (1+ z phot ). Furthermore, highervalues would provide more accurate results at the expenseof reduced statistical power (i.e., a smaller, ﬁnal catalog).In Figure 14, for example, cuts on zConf at 0.5, 0.7 and0.9 keep 90%, 76% and 38% of the galaxies from the orig-inal catalog. Alternatively, given the OOB data predictiveresults, a required accuracy or number density can be usedto identify a suitable value of zConf . N(z): An application of photo- z PDFs

Most of the results we have presented within this paper havebeen based on the estimation of a single metric computedfrom the photo- z PDF, for example the mean or mode. Obvi-ously, using a single value to represent the PDF wastes sig-niﬁcant information, but since many photo- z applicationsmimic spectroscopic redshift applications, new approachesmust be developed to capitalize on the full information con-tent of a photo- z PDF. As a result, we present a simple,yet very important application that uses the full photo- z PDFs—estimating the galaxy redshift distribution, N(z).This function is a fundamental measurement and is veryimportant to a number of cosmological applications includ-ing weak lensing tomography (e.g., Mandelbaum et al. 2008;Jee et al. 2012) and projecting three-dimensional theoreticalpower spectra to angular clustering measurements (Hayeset al. 2012).We compute the normalized galaxy redshift distribu-tion, N ( z ), for all the galaxies in DEEP2 sample (i.e., noDEEP2 redshift conﬁdence selection cut was applied), shownin Figure 15 as the shaded gray area. As demonstrated bythis ﬁgure, in this spectroscopic survey, most galaxies wereselected to have redshifts between 0.6 and 1.2. Next, we com-pute the binned photometric redshift distribution by usingthe mean value from each photo- z PDF, shown by the redcurve. While this curve does trace the gross features of theunderlying spectroscopic redshift distribution, it fails to cap-ture the full detail and can be signiﬁcantly diﬀerent at cer-tain redshifts, including at the mode. For comparison, weshow in black the photo- z PDF redshift distribution thatwe obtain by simply stacking the individual PDFs together.With this simple approach, we obtain a more accurate rep-resentation of the true sample redshift distribution. Here wehave used all the galaxies, without selecting galaxies by their c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner N o r m a li z e d P D F zConf = 0.16 zConf = 0.47 N o r m a li z e d P D F zConf = 0.80 zConf = 0.98 < | ∆ z | > zConf > 0.00 zConf > 0.50 zConf > 0.70 zConf > 0.90 z spec σ ∆ z Figure 14. ( Left ): Same as Figure 5 but for four example galaxies taken from DEEP2. The vertical dashed line indicates the spectroscopicredshift and the gray area the zConf value. (

Right ): The absolute normalized bias and the scatter for galaxy samples deﬁned by diﬀerent zConf cuts by using the mean of the photo- z PDF as our estimate. conﬁdence level. This demonstrates that all individual PDFscomputed with TPZ carry important information about theunderlying distribution.These diﬀerences are more clearly exposed in the bot-tom panel of Figure 15, where we show the absolute frac-tional error, ( N phot − N spec ) /N spec , as a function of redshift,using the same color scheme as before. From this ﬁgure, wesee that the stacked PDF has a smaller error for almost allredshifts. In addition, the photo- z PDF redshift distribu-tion is considerably smoother and looks more like a ﬁt tothe spectroscopic sample, which is another beneﬁt of usingthe full photo- z PDF. For this particular demonstration, thephoto- z PDF presented used a bin size of 0.002, while thespectroscopic and photometric redshift distributions used abin size of 0.03. Of course, we can generate smoother distri-butions for either the spectroscopic or photo- z mean valueredshift distributions by reducing the bin size, however, thetrade oﬀ is that we run the risk of increasing the shot noisein the resulting distribution. In this work we have presented and publicly release TPZ,a new, parallel, machine learning photo- z Python code thatcomputes photo- z PDFs and also provides ancillary infor-mation about the photometric data. TPZ is based on theconstruction of prediction trees and consequently a randomforest. Overall, TPZ is a three step algorithm that ﬁrst pre-processes the data, completes galaxies with missing photo-metric values in an eﬃcient manner, and also incorporatesmeasurements errors. A photo- z PDF can be generated fromthe prediction trees in one of two modes: classiﬁcation or re-gression. Both modes produces similar accuracies, but theregression mode is preferred when either the training dataare either poorly sampled or not uniformly distributed. Onthe other hand, the classiﬁcation mode provides a detailedsynopsis of the redshift distribution that can be used to con-struct priors for use with other photo- z techniques. http://lcdm.astro.illinois.edu/research/TPZ.html We demonstrated the eﬃcacy of the TPZ algorithm andits implementation by applying this new code to three dif-ferent data sets: the SDSS main galaxy sample, the PHAT1blind challenge, and the DEEP2 survey. With the SDSSMGS, we demonstrated that using conﬁdence levels is im-portant as they improve the overall accuracy of our photo- z sample by selecting those galaxies with narrow PDFs. Thistechnique is unique in the sense that it does not need aseparate validation test, yet provides ancillary informationby using OOB data. We have shown that with these data,we obtain unbiased estimates of both the bias and the dis-persion, which are very similar to the same values obtainedfrom the test data for both the SDSS MGS and DEEP2.Obviously, this result is extremely important when workingwith data that have unknown redshifts.TPZ not only provides these prior metrics, but it alsoprovides a ranking of the relative importance of the dif-ferent photometric attributes that are used by the code.This completely statistical process recovers what is natu-rally expected from physical consideration of these diﬀerentattributes. With this importance ranking, we can constructa heat map of the diﬀerent locations in parameter space thatproduce poor photo- z estimations. Furthermore, we demon-strated that by adding new, manually selected data we canproduce more accurate photo- z predication than by simplyadding new galaxies randomly. This implies that we can op-timally identify new training data for current and futurephotometric surveys, such as DES or LSST, in order to im-prove their photo- z predictions.The attribute importance can also be used to removethose attributes that are least important, thereby improvingthe computational speed. In addition, we demonstrated thatthe performance metrics converge as the number of trees in-creases in the forest, providing a further method to reducethe computational time since we have a direct measure ofthe minimum forest size. Likewise, we also demonstrated byusing the SDSS MGS that these same metrics also convergewith the number of training galaxies for a ﬁxed forest size.Thus, except for adding in manually selected training datato improve areas with poor photo- z prediction, we have anexplicit limit for the number of training galaxies needed. c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests N o r m a li z e d N ( z ) spec zmean of PDFstacked PDF redshift N ( z ) | e rr o r | Figure 15. ( Top ): The redshift distribution for the all DEEP2spectroscopic sample of galaxies (shaded gray histogram), com-puted from the mean value of individual photo- z PDFs (redcurve), and computed by stacking individual photo- z PDFs (blackcurve). (

Bottom ): The residual absolute error between the spec-troscopic redshift distribution and the two photo- z redshift dis-tributions shown using the same color scheme. Finally, with this technique we found that the error distri-bution was characterized by a Gaussian distribution witha mean very close to zero and variance very close to one,indicating that the source of errors is relatively unbiased.We ran our code on the PHAT1 blind challenge withexcellent results; even with limited training data we wereable to compute accurate photo- z ’s that were compara-ble if not better to other empirical techniques as well as tosome SED ﬁtting techniques. By using the DEEP2 redshiftdata, we tested TPZ over a large redshift range, obtainingvery accurate results. In particular, we were able to iden-tify the important attributes, which in this case was theR band magnitude followed by the I band magnitude, andthe least important attributes, which in this case was the RG attribute and the B band magnitude. Despite these im-pressive results, we still have a slight systematically biasedphoto- z at very low and very high redshifts, which we pri-marily believe is caused by the low number of training dataat these redshifts and also the fact that photo- z estimatescan not be negative. We also see a positive skewness in thephoto- z PDFs at high redshifts. We believe this result is dueto the fact that these galaxies tend to be fainter and havelarger magnitude errors. These larger magnitude errors pro-duce a sparser forest at higher redshifts, which is manifestedby having a lower photo- z PDF mean value at these sameredshifts.We have also demonstrated how the zConf parametercan be used to select galaxy samples that have improvedphoto- z estimates with minimal outliers. A target value forthis useful parameter can be set to a desired photo- z preci-sion either by calculating the value expected by using OOBdata or as required by a speciﬁc cosmological requirement.Likewise, we have demonstrated how TPZ can eﬃcientlyhandle missing data within a catalog. By artiﬁcially gen-erating bad or missing parameter values within both thetraining and the testing data sets, we were not only able to robustly recover the missing parameters but more impor-tantly new photo- z estimates that are consistent with thephoto- z estimates from the original, full data set. Therefore,this technique increases the power of photo- z estimation byrecovering missing data from the training catalog as well asthe power of our resulting sample statistics by recoveringmissing data from the application data set.Finally, by calculating the normalized distribution ofgalaxies as a function of redshift, we were able to demon-strate the advantages of using a full photo- z PDF as opposedto using one single estimator of the PDF or any other pointmetric. Speciﬁcally, by simply stacking each individual PDF,we recover the underlying galaxy redshift distribution to amuch higher precision than by simplifying using the meanof each individual photo- z PDF.In conclusion, we note that since TPZ is an empiri-cal algorithm, it is inherent dependent on the quality of itstraining data. Thus, as is the case with all empirical algo-rithms, TPZ is limited by the available spectroscopic train-ing data. Furthermore, the application of TPZ to regions ofparameter space beyond the limits of the training data (i.e.,extrapolation) will be less reliable. We do note, however, theTPZ does provide ancillary information that can be investi-gated to better understand the limitations imposed by thetraining set, to identify the optimal locations within the ap-plication data space where new training data will be mostuseful, and to quantify the possible errors associated withthe extrapolation of this technique. Another approach toimprove the eﬃcacy of photo- z estimation beyond the lim-its of a spectroscopic training sample would be to includethe predictions from diﬀerent, non-empirical techniques intoa meta-classiﬁer. We will explore this approach in a futurework on this topic. ACKNOWLEDGEMENTS

The authors thank the referee for a careful reading of themanuscript. The authors also gratefully acknowledge the useof the parallel computing resource provided by the Compu-tational Science and Engineering Program at the Universityof Illinois. The CSE computing resource, provided as part ofthe Taub cluster, is devoted to high performance computingin engineering and science. This work also used resourcesfrom the Extreme Science and Engineering Discovery Envi-ronment (XSEDE), which is supported by National ScienceFoundation grant number OCI-1053575.MCK has been supported by the Computational Sci-ence and Engineering (CSE) fellowship at the University ofIllinois at Urbana-Champaign. RJB has been supported inpart by the Institute for Advanced Computing Applicationsand Technologies faculty fellowship at the University of Illi-nois.Funding for the DEEP2 Galaxy Redshift Survey hasbeen provided by NSF grants AST-95-09298, AST-0071048,AST-0507428, and AST-0507483 as well as NASA LTSAgrant NNG04GC89G.The SDSS is managed by the Astrophysical ResearchConsortium for the Participating Institutions. The Partic-ipating Institutions are the American Museum of Natu-ral History, Astrophysical Institute Potsdam, University ofBasel, University of Cambridge, Case Western Reserve Uni- c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner versity, University of Chicago, Drexel University, Fermilab,the Institute for Advanced Study, the Japan ParticipationGroup, Johns Hopkins University, the Joint Institute forNuclear Astrophysics, the Kavli Institute for Particle As-trophysics and Cosmology, the Korean Scientist Group, theChinese Academy of Sciences (LAMOST), Los Alamos Na-tional Laboratory, the Max-Planck-Institute for Astronomy(MPIA), the Max-Planck-Institute for Astrophysics (MPA),New Mexico State University, Ohio State University, Uni-versity of Pittsburgh, University of Portsmouth, PrincetonUniversity, the United States Naval Observatory, and theUniversity of Washington

REFERENCES

Abazajian K. N. et al., 2009, ApJS, 182, 543Abdalla F. B., Banerji M., Lahav O., Rashkov V., 2011,MNRAS, 417, 1891Adelman-McCarthy J. K. et al., 2008, ApJS, 175, 297Aihara H. et al., 2011, ApJS, 193, 29Assef R. J. et al., 2010, ApJ, 713, 970Ball N. M., Brunner R. J., Myers A. D., Strand N. E.,Alberts S. L., Tcheng D., 2008, ApJ, 683, 12Ball N. M., Brunner R. J., Myers A. D., Strand N. E.,Alberts S. L., Tcheng D., Llor`a X., 2007, ApJ, 663, 774Baum W. A., 1962, in IAU Symposium, Vol. 15, Problemsof Extra-Galactic Research, McVittie G. C., ed., p. 390Ben´ıtez N., 2000, ApJ, 536, 571Bolzonella M., Miralles J.-M., Pell´o R., 2000, A&A, 363,476Bonﬁeld D. G., Sun Y., Davey N., Jarvis M. J., AbdallaF. B., Banerji M., Adams R. G., 2010, MNRAS, 405, 987Breiman L., 1996, Machine Learning, 24, 123,10.1007/BF00058655Breiman L., 2001, Machine Learning, 45, 5Breiman L., Friedman J. H., Olshen R. A., StoneC. J., 1984, Classiﬁcation and Regression Trees, Statis-tics/Probability Series. Wadsworth Publishing Company,Belmont, California, U.S.A.Brunner R. J., Connolly A. J., Szalay A. S., BershadyM. A., 1997, ApJL, 482, L21Bruzual G., Charlot S., 2003, MNRAS, 344, 1000Bundy K., Ellis R. S., Conselice C. J., 2005, ApJ, 625, 621Capak P. et al., 2004, AJ, 127, 180Carliles S., Budav´ari T., Heinis S., Priebe C., Szalay A.,2008, in Astronomical Society of the Paciﬁc ConferenceSeries, Vol. 394, Astronomical Data Analysis Software andSystems XVII, Argyle R. W., Bunclark P. S., Lewis J. R.,eds., p. 521Carliles S., Budav´ari T., Heinis S., Priebe C., Szalay A. S.,2010, ApJ, 712, 511Caruana R., Karampatziakis N., Yessenalina A., 2008, inProceedings of the 25th international conference on Ma-chine learning, ICML ’08, ACM, New York, NY, USA, pp.96–103Cavuoti S., Brescia M., Longo G., Mercurio A., 2012, ArXive-printsColeman G. D., Wu C.-C., Weedman D. W., 1980, ApJS,43, 393Collister A. A., Lahav O., 2004, PASP, 116, 345 Connolly A. J., Csabai I., Szalay A. S., Koo D. C., KronR. G., Munn J. A., 1995, AJ, 110, 2655Cowie L. L., Barger A. J., Hu E. M., Capak P., SongailaA., 2004, AJ, 127, 3137Csabai I. et al., 2003, AJ, 125, 580Cunha C. E., Huterer D., Busha M. T., Wechsler R. H.,2012a, MNRAS, 423, 909Cunha C. E., Huterer D., Lin H., Busha M. T., WechslerR. H., 2012b, ArXiv e-printsDavis M. et al., 2003, in Society of Photo-Optical In-strumentation Engineers (SPIE) Conference Series, Vol.4834, Society of Photo-Optical Instrumentation Engineers(SPIE) Conference Series, Guhathakurta P., ed., pp. 161–172Faber S. M. et al., 2003, in Society of Photo-Optical In-strumentation Engineers (SPIE) Conference Series, Vol.4841, Society of Photo-Optical Instrumentation Engineers(SPIE) Conference Series, Iye M., Moorwood A. F. M.,eds., pp. 1657–1669Feldmann R. et al., 2006, MNRAS, 372, 565Freeman P. E., Newman J. A., Lee A. B., Richards J. W.,Schafer C. M., 2009, MNRAS, 398, 2012Geach J. E., 2012, MNRAS, 419, 2633Gerdes D. W., Sypniewski A. J., McKay T. A., Hao J.,Weis M. R., Wechsler R. H., Busha M. T., 2010, ApJ,715, 823Giavalisco M. et al., 2004, ApJL, 600, L93Gwyn S. D. J., 2012, AJ, 143, 38Hayes B., Brunner R., Ross A., 2012, MNRAS, 421, 2043Hildebrandt H. et al., 2010, A&A, 523, A31Ho S. et al., 2012, ApJ, 761, 14Ilbert O. et al., 2006, A&A, 457, 841Jee M. J., Tyson J. A., Schneider M. D., Wittman D.,Schmidt S., Hilbert S., 2012, ArXiv e-printsJee M. J., Tyson J. A., Schneider M. D., Wittman D.,Schmidt S., Hilbert S., 2013, ApJ, 765, 74Koo D. C., 1985, AJ, 90, 418Laurino O., D’Abrusco R., Longo G., Riccio G., 2011, MN-RAS, 418, 2165Lima M., Cunha C. E., Oyaizu H., Frieman J., Lin H.,Sheldon E. S., 2008, MNRAS, 390, 118Loh E. D., Spillar E. J., 1986, ApJ, 303, 154Mandelbaum R. et al., 2008, MNRAS, 386, 781Matthews D. J., Newman J. A., Coil A. L., Cooper M. C.,Gwyn S. D. J., 2013, ApJS, 204, 21Myers A. D., White M., Ball N. M., 2009, MNRAS, 399,2279Newman J. A. et al., 2012, ArXiv e-printsOke J. B. et al., 1995, PASP, 107, 375Oyaizu H., Lima M., Cunha C. E., Lin H., Frieman J.,2008a, ApJ, 689, 709Oyaizu H., Lima M., Cunha C. E., Lin H., Frieman J.,Sheldon E. S., 2008b, ApJ, 674, 768Reddy N. A., Steidel C. C., Erb D. K., Shapley A. E.,Pettini M., 2006, ApJ, 653, 1004Reid B. A. et al., 2010, MNRAS, 404, 60Schapire R. E., Freund Y., Bartlett P., Lee W. S., 1998,The Annals of Statistics, 26, 1651Sheth R. K., 2007, MNRAS, 378, 709Strateva I. et al., 2001, AJ, 122, 1861Strauss M. A. et al., 2002, AJ, 124, 1810Treu T., Ellis R. S., Liao T. X., van Dokkum P. G., 2005, c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests ApJL, 622, L5van Breukelen C., Clewley L., 2009, MNRAS, 395, 1845Wadadekar Y., 2005, PASP, 117, 79Wang D., Zhang Y.-X., Liu C., Zhao Y.-H., 2008, ChJAA,8, 119Wang W.-H., Cowie L. L., Barger A. J., 2006, ApJ, 647, 74Way M. J., Foster L. V., Gazis P. R., Srivastava A. N.,2009, ApJ, 706, 623Way M. J., Klose C. D., 2012, PASP, 124, 274Wirth G. D. et al., 2004, AJ, 127, 3121Yip C.-W., Szalay A. S., Carliles S., Budav´ari T., 2011,ApJ, 730, 54York D. G. et al., 2000, AJ, 120, 1579This paper has been typeset from a TEX/ L A TEX ﬁle preparedby the author. c (cid:13)000