TPZ : Photometric redshift PDFs and ancillary information by using prediction trees and random forests
MMon. Not. R. Astron. Soc. , 1–21 (0000) Printed 1 April 2013 (MN L A TEX style file v2.2)
TPZ : Photometric redshift PDFs and ancillaryinformation by using prediction trees and random forests
Matias Carrasco Kind (cid:63) and Robert J. Brunner
Department of Astronomy, University of Illinois, Urbana, IL 61820 USA
ABSTRACT
With the growth of large photometric surveys, accurately estimating photometric red-shifts, preferably as a probability density function (PDF), and fully understanding theimplicit systematic uncertainties in this process has become increasingly important. Inthis paper, we present a new, publicly available, parallel, machine learning algorithmthat generates photometric redshift PDFs by using prediction trees and random foresttechniques, which we have named TPZ. This new algorithm incorporates measurementerrors into the calculation while also dealing efficiently with missing values in the data.In addition, our implementation of this algorithm provides supplementary informationregarding the data being analyzed, including unbiased estimates of the accuracy of thetechnique without resorting to a validation data set, identification of poor photomet-ric redshift areas within the parameter space occupied by the spectroscopic trainingdata, a quantification of the relative importance of the variables used to construct thePDF, and a robust identification of outliers. This extra information can be used tooptimally target new spectroscopic observations and to improve the overall efficacy ofthe redshift estimation. We have tested TPZ on galaxy samples drawn from the SDSSmain galaxy sample and from the DEEP2 survey, obtaining excellent results in eachcase. We also have tested our implementation by participating in the PHAT1 project,which is a blind photometric redshift contest, finding that TPZ performs comparableto if not better than other empirical photometric redshift algorithms. Finally, we dis-cuss the various parameters that control the operation of TPZ, the specific limitationsof this approach and an application of photometric redshift PDFs.
Key words: galaxies: distance and redshift statistics – surveys – statistics – methods:data analysis – statistical
Late time cosmological measurements are often made bycarefully measuring the three-dimensional distribution ofgalaxies. For these measurements, the distance between thegalaxy and the observer is most accurately made by usinga spectroscopic redshift. However, spectroscopic measure-ments are considerably more difficult to obtain, and are,therefore, more expensive than photometric measurements,as they require long exposures in order to achieve sufficientsignal-to-noise over a wide wavelength range. As an exam-ple, while the Sloan Digital Sky Survey (SDSS; York et al.2000) has taken millions of spectroscopic redshifts of galax-ies to high precision (Aihara et al. 2011), the same surveyhas obtained detailed photometric measurements for a muchlarger sample of galaxies in considerably less time. This di- (cid:63)
E-mail: [email protected] chotomy will only grow with ongoing and planned surveysthat are dominated by photometric-only observations.As a result, considerable attention has been focusedon the estimation of redshifts by applying statistical tech-niques to the photometric observations of sources throughdifferent filters. These photometric redshift (hereafter photo- z ) estimation techniques have become crucial for modern,multi-band digital surveys; and this need for fast and accu-rate photo- z estimation is becoming even more importantfor large photometric surveys like the Dark Energy Survey(DES ) and the Large Synoptic Survey Telescope (LSST ),which are probing galaxies that are often too faint to bespectroscopically observed. Adopting a photo- z approachallows cosmological measurements on galaxy samples thatare currently at least a hundred times larger than compa- (cid:13) a r X i v : . [ a s t r o - ph . C O ] M a r M. Carrasco Kind and R. J. Brunner rable spectroscopic samples, that have relatively simple anduniform selection functions, and that extend to fainter fluxlimits and larger angular scales and thus probe much largercosmic volumes. In summary, photo- z techniques provide amuch higher number of galaxies with redshift estimates perunit telescope time than spectroscopic surveys (Hildebrandtet al. 2010).The estimation of galaxy redshifts using multi bandphotometry was first performed by Baum (1962), while Koo(1985) and Loh & Spillar (1986) were the first to computegalaxy redshifts by using digital photometric observationsfrom charge coupled devices. In the last fifteen years, how-ever, the estimation of redshifts from broadband photome-try has grown significantly. Presently, there are many dif-ferent methods for computing photometric redshifts (see,e.g., Wang et al. 2008; Hildebrandt et al. 2010; Abdallaet al. 2011, for an updated comparison of current photomet-ric redshift methods and public codes). These techniques canbe broadly categorized as either template fitting algorithmsor empirical training algorithms. The template fitting algo-rithms (e.g., Ben´ıtez 2000; Bolzonella et al. 2000; Csabaiet al. 2003; Ilbert et al. 2006; Feldmann et al. 2006; Assefet al. 2010) can either use empirical (e.g., Coleman et al.1980; Assef et al. 2010) or synthetic spectral templates (e.g.,Bruzual & Charlot 2003). These techniques estimate a pho-tometric redshift by finding the best match between the ob-served magnitudes or colors and the synthetic magnitude orcolors from the suite of templates that are sampled acrossthe expected redshift range of the photometric observations.Empirical training methods use a spectroscopic train-ing data set to calibrate an algorithm that can be quicklyapplied to new photometric observations. Initially the train-ing set was used to map a polynomial function between thecolors and the redshift (e.g., Connolly et al. 1995; Brun-ner et al. 1997). More recently, this process has been ex-tended to machine learning algorithms, including artificialneural networks (e.g., Collister & Lahav 2004; Oyaizu et al.2008b), boosted decision trees (e.g., Gerdes et al. 2010),random forest (e.g., Carliles et al. 2010), nearest neigh-bors (e.g., Ball et al. 2007, 2008; Lima et al. 2008), self-organized maps (e.g., Geach 2012; Way & Klose 2012), spec-tral connectivity analysis (e.g., Freeman et al. 2009), Gaus-sian process (e.g., Way et al. 2009; Bonfield et al. 2010),support vector machines (e.g., Wadadekar 2005) or QuasiNewton Algorithm (e.g., Cavuoti et al. 2012). While only afew of these photo- z methods are publicly available, they allperform to a similar accuracy and provide only a single red-shift estimate rather than a full redshift probability densityfunction for each galaxy.The template fitting methods, which leverage modelgalaxy spectral energy distributions (SED), have been usedextensively and are often preferred since once implementedthey can be readily applied to new data by simply adoptingthe appropriate photometric filter transmission functions.Given a representative sample of template galaxy spectra,most of these techniques can reliably predict a photo- z , al-though the use of training data that includes known redshiftscan improve these predictions (e.g., Ben´ıtez 2000; Ilbertet al. 2006). These techniques, however, are not exempt fromuncertainties due to measurement errors on the survey filtertransmission curves, mismatches when fitting the observedmagnitudes or colors to template SEDs, and color-redshift degeneracies. Furthermore, template techniques generallybecome less reliable at high redshift where the uncertain-ties in galaxy SEDs increases, since the templates are oftencalibrated using low redshift galaxies.On the other hand, when provided with a high qual-ity spectroscopic training sample, empirical training tech-niques have been shown to have similar or even better per-formance (Collister & Lahav 2004). In addition, empiricaltechniques are generally simpler to apply to different datasets and frequently provide an improved quantification ofany uncertainties, which can be encoded in a photo- z prob-ability density function (PDF). They also have the addi-tional advantage that is easier to include extra information,such as galaxy profiles, concentration, angular sizes, or en-vironmental properties, in addition to magnitudes or colors.These methods, however, are only reliable within the limitsof the training data, and sufficient caution must be exercisedwhen extrapolating these algorithms beyond the limits of thetraining data.As the demand for more accurate photo- z methods hasgrown, techniques have branched out into new areas in or-der to improve the accuracy of photo- z estimation. Whilea complete understanding of the systematic uncertainties isneeded for a reliable and accurate machine learning photo- z algorithm (see, e.g., Oyaizu et al. 2008a, for a discussionon photometric redshift errors), other issues have recentlybeen recognized in the effort to generate the most accuratephotometric redshifts. For example, Cunha et al. (2012a,b)analyzed the effect of systematics within the spectroscopictraining data set that is used to estimate a galaxy photo- z .Likewise, other functionality that a modern photo- z algo-rithm should provide include an identification of outliers onthe training set that lead to an incorrect estimation of aphoto- z , an identification of the features within the train-ing data that most strongly affect a photo- z estimate, andan identification of areas of parameter space (e.g., magni-tudes, colors, and redshift ranges) that are under sampledby the training data. The last two features are importantto the design of photometric surveys, as they provide use-ful information to optimally and efficiently guide follow-upspectroscopy to generate the scientifically most useful train-ing data set for these algorithms.Of course, we estimate a galaxy’s redshift so that it canbe used in a subsequent analysis. A number of cosmologi-cal measurements such as galaxy clustering, weak lensing,baryon acoustic oscillations and the mass function of galaxyclusters (see, e.g., Ho et al. 2012; Reid et al. 2010; Jee et al.2013), among others, depend strongly on both the num-ber of targeted galaxies in the sample and the accuracy ofthe measured distances to the galaxies. Given the growth ofphotometric-only surveys, these cosmological measurementswill require the use of reliable photometric redshifts and acomplete understanding of their uncertainties. As a result,photo- z methods will be most effective going forward if theycan not only robustly provide a reliable redshift estimationbut also a redshift probability density function.In addition, the extra information in a redshift PDFcan be used to improve or enhance a particular cosmologi-cal measurement analysis. For example, Myers et al. (2009)have shown that by using the full redshift PDF within atwo-point angular quasar correlation function, as opposed tosimply using a single redshift estimate, their measurement c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests has been improved by a factor of nearly four, which is equiv-alent to increasing the survey volume by a similar factor.Likewise, Mandelbaum et al. (2008) discuss how the accu-racy of photo- z and the inclusion of the photo- z PDF affectthe calibration for weak lensing studies. Other recent stud-ies (see, e.g., Sheth 2007; van Breukelen & Clewley 2009)have also demonstrated how a cosmological measurementcan be improved by using a photo- z PDF. However, giventhe lack of reliable photo- z PDF estimation techniques, thisremains an underutilized tool.In this work, we address these issues by introducingTPZ (Trees for Photo-Z), a new, Python-based, machinelearning, parallel code for estimating photometric redshiftPDFs by using prediction trees and random forest tech-niques (Breiman et al. 1984; Breiman 2001). Our approachis an ensemble learning method that generates several clas-sifiers and combines their results into a final output. Predic-tion trees partition the multi-dimensional space recursivelyinto smaller regions, which is terminated when a leaf onlycontains a few elements. Within these final leaves, our algo-rithm can leverage a simple model for the actual prediction,by using, for example, the mean value for a regression or themode in a voting process as used in a classification scheme.Likewise, the basic idea of a random forest method is touse bootstrap samples from the training data to build a setof prediction trees. These trees are constructed by selectingthe best split point from a random subsample of the dimen-sions (e.g., magnitudes or colors) along which the data aresubdivided. By aggregating the predictions from this for-est of trees, we produce a more accurate estimate. In ourimplementation, we incorporate the errors on the measuredattributes by perturbing the galaxy parameters by their un-certainties. We repeat this process, generating multiple in-dividual new observations of each galaxy that are subse-quently combined into a final PDF, which can be used asdesired to estimate a single redshift and its associated error.In addition, our implementation of this technique naturallyincorporates data with missing values and also provides ex-tra meta information, such as an unbiased estimate of theprediction error, a measure of the relative importance of theparameters used in the photo- z estimation as a function ofredshift, an identification of regions where the training dataprovide poor predictions, and an identification of galaxiesthat are likely outliers.This paper is organized as follows. In § z methodpresented herein. § § z implementation by using thesedata, present an analysis of the results and discuss the ca-pabilities of our approach. Finally in §
5, we conclude witha summary of our main results and a discussion of the TPZalgorithm.
Among the different non-linear methods that are used tocompute photometric redshifts, prediction trees are one ofthe simplest yet most accurate techniques. Supervised learn-ing methods using prediction trees, either classification or re- gression, have been shown to be one of the most accurate al-gorithms for low as well high multi-dimensional data (Caru-ana et al. 2008). They also are fast, can easily deal with miss-ing data, and have similarities with other non-parametrictechnique. For example, prediction trees are similar to k-nearest-neighbor (kNN) algorithms in that they both groupdata points with similar characteristics.However, kNN use test data to identify similar pointswithin the training set while keeping the parameter k fixed,even though some points might have a very different num-ber of similar neighbors. On the other hand, prediction treeshave terminal leaves that bound regions of the parameterspace where the predictions (i.e., redshifts) and their prop-erties (e.g., magnitudes) are similar. As both the quantityand identify of test data can vary between leaf (or termi-nal) nodes, prediction trees are known as adaptive nearest-neighbor methods (Breiman et al. 1984). Prediction trees are built by asking a sequence of questionsthat recursively split the data, frequently into two branches,until a terminal leaf is created that meets a stopping crite-rion (e.g., a minimum leaf size). The small region boundingthe data in the terminal leaf node represents a specific sub-sample of the entire data with similar properties. Within thisleaf, a model is applied that provides a fairly comprehensi-ble prediction, especially in situations where many variablesmay exist that interact in a nonlinear manner as is often thecase with photo- z estimation. A visualization of an exampletree generated by our technique is shown in Figure 1.There are two classes of prediction trees (Breiman et al.1984): classification and regression, both of which are imple-mented in TPZ.(i) Classification Trees (also called Decision Trees) : As thename suggests, this type of prediction tree is designed toclassify or predict a discrete category from the data. Eachterminal leaf contains data that belongs to one or moreclasses. The prediction can be either a point prediction basedon the mode of the classes inside that leaf or distributionalby assigning probabilities for each category based on theirempirically estimated relative frequencies. For example, inour photo- z technique we use the magnitudes or colors ofgalaxies to determine the probability that a galaxy lies ei-ther inside or outside a specific redshift bin (a detailed ex-planation of the algorithm is presented in § Information Gain ( I G ), which is defined interms of the impurity degree index I d : I G ( T, M ) = I d ( T ) − (cid:88) m (cid:15) values ( M ) | T m || T | I d ( T m ) (1)where T is the training data in a given node, M is one ofthe possible dimensions (e.g., magnitudes) along which thenode may be split, m are the possible values of a specificdimension M (in the case of magnitudes m might represent2 or more magnitude bins), | T | and | T m | are respectively c (cid:13)000
Among the different non-linear methods that are used tocompute photometric redshifts, prediction trees are one ofthe simplest yet most accurate techniques. Supervised learn-ing methods using prediction trees, either classification or re- gression, have been shown to be one of the most accurate al-gorithms for low as well high multi-dimensional data (Caru-ana et al. 2008). They also are fast, can easily deal with miss-ing data, and have similarities with other non-parametrictechnique. For example, prediction trees are similar to k-nearest-neighbor (kNN) algorithms in that they both groupdata points with similar characteristics.However, kNN use test data to identify similar pointswithin the training set while keeping the parameter k fixed,even though some points might have a very different num-ber of similar neighbors. On the other hand, prediction treeshave terminal leaves that bound regions of the parameterspace where the predictions (i.e., redshifts) and their prop-erties (e.g., magnitudes) are similar. As both the quantityand identify of test data can vary between leaf (or termi-nal) nodes, prediction trees are known as adaptive nearest-neighbor methods (Breiman et al. 1984). Prediction trees are built by asking a sequence of questionsthat recursively split the data, frequently into two branches,until a terminal leaf is created that meets a stopping crite-rion (e.g., a minimum leaf size). The small region boundingthe data in the terminal leaf node represents a specific sub-sample of the entire data with similar properties. Within thisleaf, a model is applied that provides a fairly comprehensi-ble prediction, especially in situations where many variablesmay exist that interact in a nonlinear manner as is often thecase with photo- z estimation. A visualization of an exampletree generated by our technique is shown in Figure 1.There are two classes of prediction trees (Breiman et al.1984): classification and regression, both of which are imple-mented in TPZ.(i) Classification Trees (also called Decision Trees) : As thename suggests, this type of prediction tree is designed toclassify or predict a discrete category from the data. Eachterminal leaf contains data that belongs to one or moreclasses. The prediction can be either a point prediction basedon the mode of the classes inside that leaf or distributionalby assigning probabilities for each category based on theirempirically estimated relative frequencies. For example, inour photo- z technique we use the magnitudes or colors ofgalaxies to determine the probability that a galaxy lies ei-ther inside or outside a specific redshift bin (a detailed ex-planation of the algorithm is presented in § Information Gain ( I G ), which is defined interms of the impurity degree index I d : I G ( T, M ) = I d ( T ) − (cid:88) m (cid:15) values ( M ) | T m || T | I d ( T m ) (1)where T is the training data in a given node, M is one ofthe possible dimensions (e.g., magnitudes) along which thenode may be split, m are the possible values of a specificdimension M (in the case of magnitudes m might represent2 or more magnitude bins), | T | and | T m | are respectively c (cid:13)000 , 1–21 M. Carrasco Kind and R. J. Brunner
Figure 1.
A simplified example of a binary prediction tree plottedradially. The initial node is close to the center of the figure. Thesplitting process terminates when a stopping criterion is reached.Individual colors represent the unique variable (e.g., fixed aper-ture g or r or magnitude colors) used for the splitting at eachnode. Each leaf provides a specific prediction based on the infor-mation contained within that terminal node (gray triangles in thefigure). The subpanel corresponds to zoomed in region from thetree. the size of the total training data and the number of objectsfor a given subset m within the current node, and I d isthe function that represents the degree of impurity of theinformation.There are three standard methods to compute the impurityindex ( I d ). The first method is by using the informationentropy , which is defined in the expected manner (similar toThermodynamics): I d ( T ) ≡ H ( T ) = − n (cid:88) i =1 f i log f i (2)where i is the class to be predicted (e.g., inside or outside aredshift bin) and the sum is over all n possible classes (twoin our example), and f i is the fraction of the training databelonging to class i . The same definition applies for a subsetof the data T m .The second option, is to measure the Gini impurity ( G ). Inthis case, a leaf is considered pure if all the data containedwithin it have the same class. The Gini impurity can becomputed inside each node: I d ( T ) ≡ G ( T ) = n (cid:88) i =1 (cid:88) j (cid:54) = i f i f j (3)where f i and f j are the fractions of the training data of class i or j . The same equation applies for a subset of T along oneparticular dimension M . Since f i are the fractions for allpossible classes, we have that the (cid:80) i f i = 1, and, therefore, (cid:80) j (cid:54) = i f j = 1 − f i . As a result, the expression for Equation 3can be simplified to I d ( T ) ≡ G ( T ) = 1 − n (cid:88) i =1 f i (4) Fraction of Class i =1 ( f ) I m p u r i t y I n d e x ( I d ) EntropyGini indexClass Error
Figure 2.
Impurity index I d for a two-class example as a functionof the probability of one of the classes f using the informationentropy(blue), Gini impurity (green) and classification error (red).In all cases, the impurity is at its maximum when the fraction ofdata within a node with class 1 is 0.5, and zero when all data arein the same category. The third method is to simply measure the impurity degreeby using the classification error ( C E ): I d ( T ) ≡ C E ( T ) = 1 − max { f i } (5)where the maximum values are taken among the fractions f i within the data T that have class i . During the tree con-struction, the data are scanned over each dimension to de-termine the split point that maximizes the information gainas defined by Equation 1 and the attribute that maximizesthis impurity index overall is selected. For example, Fig-ure 2 shows these three impurity indices, for a node withdata that are only categorized into two classes, as a func-tion of the fraction of the data having a specific class. If allof the data belong to a specific class, the impurity is zero.On the other hand, if half of the data have one class and theremaining data all belong to the other class, the impurity isat its maximum. Our implementation can calculate any ofthese three different impurity indices, and any one of themcan be selected for the construction of the prediction trees.Alternatively, the index providing the highest informationgain at a given node can be selected.(ii) Regression Trees
A second type of prediction tree is usedwhen the data to be predicted is continuous; since it doesnot use discrete classes, we instead fit a regression model tothe data inside a leaf. The construction of a regression treefollows the same structure as the classification tree, and onceagain a node is generally divided into two branches (i.e., abinary tree). There are two primary differences, however,between regression and decision trees. First, each leaf hastraining data with different redshift values; the predictionvalue is based on a regression model covering these points.Usually, the mean of the training redshifts is returned, soeach prediction is no longer a discrete classification, but isinstead an estimation of a continuous variable. Second, theprocedure used to select the best dimension to split for aregression tree is based on the minimization of the sum of c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests the squared errors, which for a node T is given by S ( T ) = (cid:88) m (cid:15) values ( M ) (cid:88) i (cid:15) m ( z i − ˆ z m ) (6)where m are the possible values (bins) of the dimension M , z i are the values of the target variable on each branch/bin m , and ˆ z m is the specific prediction model used. In the caseof the arithmetic mean , we have that ˆ z m = n m (cid:80) i (cid:15) m z i ,where n m are the members on branch m . This allows us torewrite Equation 6 as S ( T ) = (cid:88) m (cid:15) values ( M ) n m V m (7)where V m is the variance of the estimator ˆ z m .At each node in our tree, we scan all dimensions to identifythe split point that minimizes S ( T ). The splitting dimensionthat has the lowest value of S is selected as the splittingdirection, and this procedure is repeated until either somethreshold in S is reached or any new nodes would containless than the predefined minimum leaf size. Random forest is an ensemble learning algorithm that firstgenerates many prediction trees and subsequently combinestheir predictions together. It is one of the most accurate em-pirically trained learning techniques for both low and highdimensional data (Caruana et al. 2008). The idea is simple,given a training sample T containing N objects that have M attributes (e.g., survey magnitudes), create N T bootstrapsamples of size N (i.e., N randomly selected objects with re-placement). From these samples, we create the correspond-ing N T prediction trees without pruning them back.If all the variables are examined when deciding the bestpoint to split, the method is called bagging (Breiman 1996).An additional layer of randomness can be added to the bag-ging process by choosing the best split point from among arandom subsample of m ∗ < M variables at each node, where m ∗ is kept fixed during the process. The value of m ∗ is anadjustable parameter that is directly related to the strength of a tree (a strong tree has a low error rate) and the corre-lation between any two trees (the more correlated the trees,the higher the forest error rate). Increasing or reducing m ∗ has the same effect on both features. Of course we want toselect the optimal value of m ∗ . A good starting point is toset m ∗ (cid:39) √ M , although the accuracy of the algorithm is,in the end, not very sensitive to this parameter for a largenumber of trees and relatively small number of dimensions.After constructing all of the prediction trees, a final and ro-bust prediction is calculated by combining all N T estimatestogether.Breiman (2001) first introduced this algorithm andshowed that this technique performs very well when com-pared to many other learning techniques. This technique isrobust against overfitting (i.e., there is no limit on the num-ber of trees, N T , in the forest), it runs efficiently on largedata sets, it can generate an internal unbiased estimate ofthe error, and it can provide extra information about therelative importance of the input variables and the internalstructure of the training data. Given a training set T , this extra, ancillary information canbe calculated prior to the computation of the photo- z PDFs.As a result, we can use this a priori information to explorethe efficacy of different parameter combinations while alsoobtaining an estimate of the bias and variance of the photo- z prediction. This is done by using out-of-bag (OOB) samples,which consist of a random sample of data that are left outof each tree. In the process of growing a forest, N T treesare created using bootstrap samples of size N . In each ofthese samples, about one-third of the data are not used whenconstructing a tree, and are instead used as a test sample forthe recently built tree. The test results created by using thisOOB data are combined together to obtain estimators of theerror, which, when built using a sufficiently large number oftrees in the forest, has been shown to be unbiased and asaccurate as using a validation set of the same size as thetraining set (Breiman 1996). This removes, therefore, theneed for a separate validation sample that can introduce abias into the final result. This method also has the advantageof using the full spectroscopic data to compute PDFs.The OOB data can also be used to estimate the rela-tive importance of each attribute or dimension to the photo- z calculation. This provides an elegant method to identifyand remove attributes that do not contribute significantly,thereby reducing the noise and dimensions of the problem.This also has the benefits of increasing the performance ofthe implementation, improving our understanding of thecomplexity in the interaction between different attributes,and improving the identification of new training data from,for example, follow-up observations. This relative impor-tance is estimated for each attribute by first quantifyingany variations in the prediction error when the OOB dataare permuted only along the specific attribute, leaving theothers unchanged. This process is repeated across all trees,and the end result is the average in the error increment whencompared to the unperturbed variables for all the tress overthe entire forest.Another item we can construct is the proximity matrix, P rox ( i, j ), which is a symmetric, positive definite matrixthat gives the fraction of trees in the forest in which element i and j fall in the same terminal leaf. This matrix is con-structed tree-by-tree by running all the data, both the OOBand the data used for growing, down each tree. When galaxy i and j are in the same leaf, their proximity is increased byone. At the end, all the proximities are normalized by thetotal number of trees; therefore, similar galaxies will tendto have higher proximities than dissimilar ones. This matrixcan be computed for the training set, the test set, or bothtogether. Since this matrix quantifies the relative similaritybetween galaxies, it can be used to identify outliers within adata set. For example, by computing the squared sum of allproximities for each galaxy, we can algorithmically identifygalaxies with few neighbors by selecting sources with thelowest value, which can be flagged for further inspection.To build or apply prediction trees, the data cannot havemissing values for any of the attributes used to construct thetrees (e.g., the most important survey magnitudes). To in-clude more data into the classification process, we can usethe proximity matrix to estimate any missing values or toreplace highly uncertain values. We do this in an iterative c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner process, by performing the forest growing step of the al-gorithm and replacing the missing attribute at each pass.We select the replacement value by computing the averageparameter value from the k nearest galaxies; we can alsoinversely weight these galaxies by their respective distance.This process continues until we have obtained convergenceor until a fixed number of iterations have been performed.By using the proximity matrix, OOB error estimates,and the relevant importance of different attributes, we canalso identify zones where the photo- z prediction is eitherpoor or is loosely constrained by the training data. In eithercase, this knowledge is of vital importance when decidingwhat galaxies to target spectroscopically in order to opti-mally improve a training sample. One way this feature isimplemented is by using the two most important attributesto map the areas of parameter space by their predictionerror. This map can guide the identification of new datathat increases the efficacy of the training sample by target-ing those galaxies that minimize the prediction error in un-der sampled areas, thereby more effectively utilizing limitedspectroscopic follow-up observations. Two previous works have utilized prediction trees for photo- z calculations. Carliles et al. (2008, 2010) predicted photo-metric redshifts and their errors by using a random forestbuilt with the regression tree package for R by using themean value as their leaf model. They used a subset of themain galaxy sample of the SDSS Data Release 6 (Adelman-McCarthy et al. 2008) catalog with colors as their attributes.They demonstrated that random forest methods are wellsuited to the photo- z estimation problem as they obtainedcomparable results to other machine learning methods, andthey publicly released their R scripts. They did not, how-ever, take full advantage of the ancillary information pro-vided by the random forest technique, nor did they produceprobability density functions.Gerdes et al. (2010) have developed a new technique,called ArborZ, to compute photometric redshifts usingboosted decision trees (BDT). These classification trees areconstructed in a similar manner to our classification trees,as discussed in § z estimatesprovided by Oyaizu et al. (2008b) in the case of the SDSSdata, and by ANNz (Collister & Lahav 2004) for the DES.Our approach, detailed below, extends these previousresults to create a new publicly available method that usesrandom forests to compute PDFs by using classificationand/or regression trees. Our approach also uses extra in-formation encoded within the measurement errors, gener- ates extra, ancillary information describing the spectro-scopic training sample, and provides a better control of theuncertainties. We also, therefore, are able to examine theimportance of the attributes used to grow the trees, andidentify areas in the attribute space where the training dataare dominated by shot noise statistics. Our implementation of prediction trees with random forestfor photometric redshift PDF prediction, TPZ, is written inthe Python programming language and uses MPI for paral-lel communication to run efficiently on distributed memorysystems. As shown in Figure 3, our implementation is di-vided into three steps: Data Pre-processing
The first step prepares the data forthe construction of the prediction trees. First, we optionallyperform a principal component analysis (PCA) of the data inorder to reduce strong correlations between attributes. ThisPCA transformation can reduce the dimensionality of theinput data prior to the training, which can be important forlarge data sets with many attributes. This step also includesthe replacement of missing values, which we do iteratively,finding that between 5–10 iterations leads to a convergenceon the missing values. We next generate N R training samplesby perturbing the measured values according to the error oneach variable, which we assume to be normally distributed.In this manner, we can incorporate the measurement errorin the prediction tree construction, we reduce the bias onproximity matrices, and we introduce randomness into theconstruction of the trees in a controlled manner. Random Forest Construction
The second step is theactual construction of the random forest, where we gener-ate fully grown prediction trees. We construct N T trees byusing bootstrapping for each perturbed sample in the set of N R training samples we created in the first step. This stepcan be done several times with a smaller number of treesto both explore the parameter space and gain insight intothe internal structure of the data prior to building the finalprediction trees. Finally, this step can also produce the an-cillary information that can characterize the performance ofour code prior to estimating the final photo- z values. Photo- z PDF Construction
The final step uses thenewly generated prediction trees to create individual photo- z PDFs for each source in the application data set. Thisprocess involves running each source down each tree, test-ing the source at each node until we arrive at a terminal leafwhere we make a prediction. At the end, we combine all ofthe forest predictions into a probability density function.
TPZ can use either type of prediction tree that uses randomforests: classification or regression; the actual implementa-tion details only differ after the first step.
Classification Mode:
In this mode, the spectroscopicsample is divided into several redshift bins that either havea fixed width (or, alternatively, resolution), which allows a (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests Spectroscopicsample Pre processdata R eg r e ss i on C l a ss i f i c a t i on Photo-z PDF
Replace missing values (iterate)Perform PCA transformationPerturb magnitudes by their measured errors
Additional Information
OOB error, Variable importanceProximity, poor area identification,OutliersGalaxies withKnown redshifts and several magnitudes or other variables min z max z min z max zbin 4bin 1 bin 2 bin 3 in outin in inz z z z L =(z +z +z )3 TPZ
Trees for Photo-Z
Mode optional prediction prediction C o m b i ne a ll t r ee s C o m b i ne t r ee s i n b i n G e t p r obab ili t i e s G ene r a t e P D F Figure 3.
A simplified representation of TPZ, the details for each subprocess are described more fully within the text. Note that eachtree drawn in the figure represents a full random forest with N T bootstrap samples for every one of the N R random perturbed samples.The big circles containing galaxies represent a terminal leaf, which are directly used to make a prediction for each new galaxy. variable number of galaxies within each redshift bin, or havea fixed number of galaxies per redshift bin, which meansour redshift bins are of variable width. Within each bin, wecreate a forest of classification trees, as described above, us-ing the perturbed samples as well as the bootstrap samples.These trees classify an object as either lying inside or out-side a bin. By using all of the training data within each bin,we both decrease the overall performance of our implemen-tation due to the larger data volume and also increase thechance of catastrophic errors since most data will lie outsidethe bin of interest.We address these issues by following a similar approach tothat used by Gerdes et al. (2010). For each bin, we identifyall sources that lie inside the bin. This number of galaxieswith class inside is n in . We next select a factor fn out ofthe n in galaxies that have spectroscopic redshifts that lieoutside the bin by a factor of z out times the width δz ofthe bin. This means that galaxies with class outside fall z out × δz from the boundaries of the bin. This allows a betterdistinction between the class inside and the class outside asit would have if we include objects located very near to theseboundaries. In the end, each bin will have (1 + fn out ) n in galaxies available for training the forest.If the training set is limited, wider bins can be used in or-der to have a sufficient number of training galaxies per bin.Furthermore, these bins can even be allowed to overlap bysome value; this overlap can be taken into account whenbuilding the photo- z PDFs by normalizing by the fractionof wider bins that overlap with each other. After all of theforests are created for all of the bins, the test data are rundown each tree in each forest, which assigns either the class inside or outside to the test source. After combining all ofthe assigned classes from the forest, we assign a probabilityfor the source to belong to that redshift bin, which is simplythe number of times the source was assigned the inside classdivided by the total number of trees. By repeating this pro-cess for each bin and renormalizing the subsequent result,we generate a photo- z PDF for the source.
Regression Mode:
In this mode, we use all availabletraining data to fully grow each tree. For each perturbedsample, N T trees are created using the methodology ex-plained in § z , the testdata are run down each tree in the forest. Each tree returnsthe set of spectroscopic redshift measurements that, afterconversion to a given resolution, are converted into a PDFby normalizing to the total number of objects returned. Alltrees have the same weight when constructing the PDF, aswell as the values of the terminal leaves identified in eachtree. If a single value is desired, a mean value and its errorcan be returned via the standard methods by aggregatingall of the relevant values as returned by the different trees.The choice of either of these modes will depend on thecharacteristics of the data being analyzed. On average, theregression mode runs faster than the classification mode for c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner a specific accuracy, and is also better suited for data that arenot uniformly distributed. The classification mode, on theother hand, provides a better characterization of the dataas a function of redshift, since it creates its own randomforest on each bin unlike the regression mode where a forestis created using the full range in redshift. The classificationmode is also better suited for uniformly distributed dataand can provide a reliable and robust prior probabilities ina Bayesian framework when using wider redshift bins. Whenfaced with a high quality and rich training set, both modeswill provide similar accuracies and error rates, but the re-gression mode, being faster, would generally be preferred.Figure 3 shows a simplified workflow of our TPZ im-plementation. Each tree in this figure represents an entireforest, where the single tree results are averaged to get afinal prediction. The classification mode predicts a prob-ability that a source lies within each bin, thereby build-ing up a photo- z PDF, while the regression mode keeps allsources found on a terminal leaf and combines their valuesto construct a photo- z PDF at the desired resolution. Forboth modes, ancillary information can be provided, and bothmodes share the same data pre-processing steps.
To demonstrate the capabilities and the efficacy of the dif-ferent parameter configurations of the TPZ code, we haveused several different photometric and spectroscopic datasets that vary both in quantity and quality. In this sectionwe briefly discuss these data sets and the specific data sam-ples from each that we used in the testing process, which isfurther described in § The Sloan Digital Sky Survey (SDSS; York et al. 2000) phaseI and phase II conducted a photometric survey in the op-tical bands u , g , r , i , z that covered almost 10,000 squaredegrees, or approximately one-fourth of the entire sky. Theresultant photometric catalog contains photometry of over10 galaxies, making the SDSS one of the largest surveysever performed. The SDSS also conducted a spectroscopicsurvey of targets selected from the SDSS photometric cata-log, obtaining spectra of about 10 low redshift galaxies.In this paper, we use a subset of the Main Galaxy Sam-ple (MGS; Strauss et al. 2002) from the Data Release 7 cat-alog (Abazajian et al. 2009). Specifically, we selected 55,000galaxies by using the online CasJobs website . This spec-troscopic data ranges from z ≈ .
02 up to z ≈ . u − g , g − r , r − i , and i − z . We use the SDSS http://casjobs.sdss.org/CasJobs/ colors as opposed to the more commonly used magnitudesfor this particular test to both demonstrate the flexibility ofTPZ and to generate scientifically more interesting ancillaryinformation. The PHoto-z Accuracy Testing (PHAT; Hildebrandt et al.2010) project first compared the performance and system-atics of different photo- z codes on synthetic data (PHAT0)that was specifically created for a contest, and also more re-cently used real data (PHAT1) in a similar manner; therebyproviding a more realistic comparison by using real mea-surements. The PHAT project provides filter responses forphoto- z estimation by SED-fitting methods and a trainingdata set for photo- z estimation by empirical methods. Thetrue redshifts of the test data are not public, which pro-vides a more reliable, blind comparison between differentapproaches (see Hildebrandt et al. 2010, for more detailsabout the contest). In this paper, we use the PHAT1 data,which consists of real observations selected from the GreatObservatories Origins Deep Survey Northern field (GOODS-N; Giavalisco et al. 2004).These data include photometry from the original ACSfour-band data: F435W(B), F606W(V+R), F775W(i’) andF850LP(z’) that have been cross-matched with photometryfrom Capak et al. (2004), including U (from KPNO), B J , V J , R C , I C , z (cid:48) (from SUBARU), and HK (cid:48) (from QUIRC).In addition, the photometry of PHAT1 also includes Deep J and H bands (from ULBCAM; Wang et al. 2006), K S (fromWIRC; Bundy et al. 2005), and four Spitzer IRAC bands:3 .
6, 4 .
5, 5 .
8, and 8 . µ m. This photometric catalog wascross-matched with all available spectroscopic GOODS-Ndata (Cowie et al. 2004; Wirth et al. 2004; Treu et al. 2005;Reddy et al. 2006), producing a final data set of eighteenband photometry and spectroscopy for 1,984 galaxies.For the contest, only 515 galaxy redshifts were pub-lished for use as training data; the remaining redshifts wereunpublished and used internally by the PHAT project toconduct a blind comparison test. Despite the limited train-ing data, multiple authors submitted the photo- z predictionsand the results were published in Hildebrandt et al. (2010).As the contest had already been completed when we startedthis work, we were unable to participate. However, as dis-cussed in § The Deep Extragalactic Evolutionary Probe (DEEP) is amulti-phase, deep spectroscopic survey performed with theKeck telescope. Phase I used the LIRS instrument (Okeet al. 1995), while phase II used the DEIMOS spectro-graph (Faber et al. 2003). The DEEP2 Galaxy Redshift Sur-vey is a magnitude limited spectroscopic survey of objectswith R AB < . B , R , and I and it has been recently extended (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests z spec z ph o t R e g z spec z ph o t C l a ss < ∆ z > RegClass z phot σ ∆ z RegClass
Figure 4.
Photometric vs. spectroscopic redshift for all test SDSS MGS test galaxies using regression mode (
Left ) and classificationmode (
Center ). (
Right ) A comparison of the bias (upper panel) and the scatter (lower panel) as a function of redshift for the SDSS MGSdata by using the regression mode (blue dots) and the classification mode (green squares). by cross-matching the data to other photometry databases.In this work, we use the Data Release 4 (Matthews et al.2013), the latest DEEP2 release that includes secure andaccurate spectroscopy for over 38,000 sources. The photom-etry for the sources in this catalog was expanded by us-ing two u , g , r , i , and z surveys: the Canada-France-HawaiiLegacy Survey (CFHTLS Gwyn 2012), and the SDSS. Foradditional details about the photometric extension of theDEEP2 catalog see Matthews et al. (2013).To use the DEEP2 data with TPZ, we selected sourceswith secure redshifts (ZQUALITY (cid:62)
3) that were securelyclassified as galaxies, have no bad flags, and have full pho-tometry. Even though the filter responses are similar, the u , g , r , i , and z photometry come from two different sur-veys and are thus not identical. We therefore treat thosegalaxies with SDSS photometry for fields 2,3, and 4 of theDEEP2 target areas independently from those for field 1with CFHTLS photometry. In the end, this leaves us witha total of 19,699 galaxies with eight band photometry andredshifts, of which we use 10,000 for training and hold therest for testing. In this section, we apply the photo- z estimation techniquepresented in § §
3. Since the point of this paperis to introduce the TPZ algorithm and our associated imple-mentation, we use these three different data sets to highlightdifferent features of the code. Thus we do not apply TPZuniformly to each data set, and the three subsections hereinare necessarily different.
We first apply TPZ to the SDSS main galaxy sample, us-ing both the regression and the classification methods asexplained in § Table 1.
A comparison between the Regression Mode and theClassification mode for the SDSS MGS galaxies with differentconfidence level restrictions.Implementation < ∆ z > [10 − ] σ ∆ z [10 − ] Fraction a Reg All − .
08 2 .
25 100%Class All 2 .
18 2 .
46 100%Reg zConf > . − .
20 2 .
18 98.2%Class zConf > . − .
15 2 .
34 94.2%Reg zConf > . − .
33 1 .
97 91.0%Class zConf > . − .
80 2 .
20 73.5%Reg zConf > . − .
23 1 .
76 67.3%Class zConf > . − .
92 1 .
82 34.7% a Fraction of galaxies remaining after a cut on zConf . galaxies held out for testing from the SDSS MGS, for re-gression and classification modes, respectively. Both imple-mentations show similar performance in the central part ofthe redshift distribution; however, there are differences atboth the low and high redshift regions of this sample. Theright panel shows both the mean of the bias, defined as∆ z = z spec − z zphot , and its scatter for eight redshift bins.The regression mode performs slightly better at all redshiftbins, but especially on the first and last bin, where the clas-sification mode shows systematic errors in classification.This error arises due to the lack of training data at thoseredshifts for the classification mode, where, though we allowsome overlap between bins, we keep the bin size constant,which can result in large differences in the number of train-ing objects per bin. This reduction is most pronounced inthe lowest and highest redshift bins, which results in a loweraccuracy and a higher scatter. We also are affected at thelow redshift regime by the fact that a predicted redshift cannot be negative, those introducing a positive skew to thepredicted redshift values for very low redshifts.Since both implementation modes produce photo- z PDFs, we can compute confidence levels, zConf , aroundthe mean (or mode) for each individual PDF. To simplify c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner N o r m a li z e d P D F zConf = 0.17 zConf = 0.54 N o r m a li z e d P D F zConf = 0.87 zConf = 0.98 Figure 5.
Four example PDFs produced by TPZ for the SDSSMGS selected with different values of zConf . The higher the valueof zConf , the more narrowly concentrated the PDF is about themean. The vertical dashed line corresponds to the spectroscopicvalue for the test galaxy and the gray area encloses the confidencelevel. comparisons with past results, we define zConf as the inte-grated probability between z phot ± σ TPZ (1+ z phot ). We select σ TPZ = 0 .
03 as an approximation to the intrinsic scatterof the algorithm when applied to the data, which can becomputed by using the OOB data. Of course we could de-fine zConf in some other manner, but the results would berelatively unaffected. Figure 5 presents four different PDFstaken from the SDSS MGS, each with different confidencelevels that are shown as a bounded gray area under eachPDF curve.In this example, we measured zConf around the meanof each PDF and the actual spectroscopic redshifts areshown as vertical dashed lines for reference. From this fig-ure, we see that zConf provides a reasonable summary ofthe concentration of the PDF, and can, therefore, be usedto further restrict a photo- z sample by selecting only thosePDFs with a zConf value above some threshold. In gen-eral, as shown in this Figure, we see that lower confidencevalues are strongly correlated with less accurate predictions.Nevertheless, it is still possible to have a small fraction ofgalaxies with high zConf PDFs that are estimated at thewrong redshift. We discuss the zConf parameter and its usein identifying a clean galaxy sample in further detail in § zConf . As before, we see that, on average, the regressionmode outperforms the classification mode on this data set,although the difference is reduced when we apply a cut onthe confidence level. Interestingly, at more restrictive zConf cuts, the performance of both modes is similar; however, thenumber of galaxies remaining in the regression mode sampleis higher. Note that since these are averaged values over the sample, any minor change implies a significant change onindividual calculations.As a result, we believe that making a cut on zConf results in a cleaner sample, as shown by the improved per-formance metrics for either implementation mode. The dif-ference in the fraction of galaxies that remain in each sampleindicates that, on average, PDFs generated by the classifi-cation implementation are broader than PDFs generated bythe regression implementation. This result is reasonable, asthe classification mode bins the redshift space and providesprobabilities for all bins which can produce a more sparsedistribution. In the classification mode the probabilities arecomputed individually for each redshift bin, which could beimportant and easily extended to build a prior distributionthat can be used in a Bayesian method. Since the regressionmode was shown to be more accurate for the SDSS (see,e.g., Figure 4 and Table 1), we use the mean of the PDF ascalculated by the regression mode on the SDSS MGS datain the rest of this section, unless otherwise indicated.We can broadly compare our use of zConf to defineclean galaxy samples to other published results; we notethat a direct, one-to-one, comparison is problematic due tothe different training sets and attributes used in computingphotometric redshifts for the SDSS main galaxy sample. Ifwe take a zConf > = 0 .
75, we keep 91% of the data andcompute the fraction of galaxies with | ∆ z | < z i , where z i =0.001, 0.002 and 0.003 as 45.2%, 73.0% and 89.8%, respec-tively. These valued compare favorably to those from Lau-rino et al. (2011) who, even though they used an extendedcatalog, compute these same values to be 43.4 %, 72.4%and 86.9%, with a mean bias of < ∆ z > = 15 × − and σ ∆ z = 1 . × − (these latter values can be comparedwith our results shown in Table 1). Finally, we note thatmaking a strict cut of ∆ z > .
006 identifies an outlier frac-tion of 1.54%, while other groups, using extended catalogsas well, have reported values of 1. 9% (Gerdes et al. 2010)and 2.6% (Oyaizu et al. 2008b).
As detailed in § z ) and its standard deviation (i.e., σ ∆ z ) for each tree. Fi-nally, we average these metrics over all trees. In Figure 6,we present in the top panel the mean bias as a function ofredshift taken both from the test data (blue line) and fromthe OOB data used during the training process (green line).The bottom panel in this figure presents the standard devia-tion for each redshift bin. The RMS of these values providesan approximation to the intrinsic error and scatter of theTPZ code, which can be used to compute confidence levels.From the OOB data, we compute the RMS of the bias tobe 0 . . . . c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests < ∆ z > Photo- z estimateOOB estimate z phot σ ∆ z Photo- z estimateOOB estimate Figure 6. ( Top ) The averaged ∆ z as a function of redshift forall test galaxies from the SDSS MGS (blue circles) and from theOOB (Out-Of-Bag) data computed individually for each tree andsubsequently averaged over the forest (green squares). ( Bottom )The standard deviation of ∆ z as a function of redshift for the testset (blue circles) and the OOB data (green squares). In this casethe OOB data provide a unbiased, upper-limit for these metrics. OOB data were not used to train a particular tree, yet thefull data are used when building the forest by using the boot-strap samples. If we would have run all of the training data after the forest was constructed without using the OOB ap-proach, we would have obtained biased (although lower val-ues) for these metrics. This approach would thus not pro-vide a prior estimation of the accuracy of TPZ. With theOOB data, we compute a priori these unbiased estimatesexclusively from the training set, without the need for a val-idation set, allowing us to take full advantage of all availablespectroscopic data.The OOB data can also be used to compute the relativeimportance of each attribute, which can be done by permut-ing each of the attributes in the non OOB data when trainingthe tree. The result of this process can be directly comparedwith the unperturbed case using the OOB data, as shownin Figure 7. In this figure, the left panel shows the relativeimportance factor, which is computed by using the absolutevalue of the OOB bias as a comparison metric, of the fourcolors used to build the regression trees for the MGS sam-ple. In this plot, a factor of one implies that the attributeacts as a random variable, since a perturbation along thatdirection produces no changes. Any value greater than oneproduces a change in the bias, making it larger and thereforeless accurate.From this figure, we see that the g − r color showsthe largest relative importance factor, being close to four,meaning that the absolute bias, on average, changes by thissame factor when this color is randomly perturbed. On theother hand, the i − z color is the least, on average, rele-vant attribute in this context, with a relative importancefactor less than 1.5. Due to the limited number of attributesin this test, however, removing this last color actually pro-duces slightly worse results. In the general case when more attributes are present, removing less important variables willimprove the results. While this result might seem counter-intuitive, it results naturally from the random nature of thetree construction. Since only m attributes (e.g., three) arerandomly selected to decide the split dimension, an attributewith overall low importance can be occasionally selected tosplit a node. By omitting attributes with lower importance,we force the trees to be built from attributes with greaterinformation content, thereby improving the accuracy of theprediction.Another interesting point is that the relative impor-tance for both of the mentioned colors remain consistent,independent of redshift, while the other two colors showvariation (i.e., u − g and r − i exchange importance rat-ings more than once), although they are overall consistentwith each other. This behavior is mainly due to importantspectral features, such as the 4000 ˚A break, passing betweendifferent filters, which TPZ identifies algorithmically, as im-portant indicators of a galaxy’s redshift. We see this resultfrom another perspective in the central panel of Figure 7,which presents the RMS of the relative importance, sortedby their rank, for the four colors computed by using the ab-solute bias (blue line) and the variance (red line). Both met-rics rank the attributes in the same order and either can beused to compute their importance to the data set. Perturb-ing the attributes produces a stronger effect on the absolutebias than on the scatter, mainly because when perturbingone dimension, we lose information and thereby increase thelikelihood that a galaxy will end up in a random branch ofthe tree, especially for an important attribute. This wouldlikely lead to a misclassification, which directly affects themean absolute bias. Relative Importance
The importance rank can also be used to better understandthe training data, to check whether it is possible to reducethe dimensionality of the problem, and to identify areas ofthe mapped parameter space where new training data canbe most effectively incorporated. This latter point can beaccomplished by identifying the leaf nodes, and the galaxiescontained therein, for each tree and computing their accu-racy on predicting for the OOB data along with their prox-imity matrices. By averaging over these results for all trees,we obtain the desired result.For example, by using the two most important at-tributes previously identified for the SDSS MGS ( g − r , and u − g ), we present a heat map in the right panel of Figure 7that encodes the binned performance of these two attributes,where higher values indicate lower predictive success in thatbin. In this plot, we see there are a few bins where perfor-mance is markedly lower (blue and light blue squares), andseveral areas that are lower than average (the yellow bins).On the other hand, there are two areas where the predic-tive power is quite high (deep orange-red), which are likelythe result of the known color bi-modality of SDSS galax-ies (Strateva et al. 2001) where early-type galaxies lie in theupper right part of this plot and late-type galaxies lie in thebottom left part of this plot. The areas in this heat mapwhere the predictive performance is low can be caused byeither a lack of training data, by galaxies with color degen-eracies, or by galaxies with higher than normal magnitude c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner A tt r i b u t e i m p o r t a n c e I A g − ru − gr − ii − z g − r u − g r − i i − z Attribute1.01.52.02.53.03.54.0 A tt r i b u t e i m p o r t a n c e I A | ∆ z | σ ∆ z g − r u − g Figure 7. ( Left ): The attribute importance factor I A as a function of redshift for the four attributes (SDSS colors) used in this analysisfor the bias only. This factor quantifies how much the metrics decrease as we permute the attributes one at a time. ( Central ): RMS ofthe relative importance factor as a function of the attributes computed by using the bias (blue) and the scatter( red). (
Right ): A heatmap constructed by using the two most important attributes, which indicates areas of parameter space where the photo- z prediction ispoor. The higher the value (i.e., bluer) in a region, the more training data are needed to increase the accuracy of photo- z estimationwithin that region. These zones might also contain outliers or galaxies with bad photometry. errors. As a result, these areas can be prioritized for follow-up observations to improve the performance of the photo- z estimation. Identifying new training data
Previously, we had stated that the relative importance ofthe different attributes, graphically shown in the heat mapin the right panel of Figure 7, could be used to optimallyidentify new training data. We test this assumption by firstrandomly selecting 1,000 galaxies as our training set, in or-der to simulate a poor training set, so that we can quantifythe effects of both randomly adding new data and selectivelyadding new data by using the relative importance. We per-form this test by first adding 1,000 new galaxies and secondby adding 2,000 galaxies and computing the mean normal-ized bias, defined as ∆ z (cid:48) = ( z spec − z phot ) / (1 + z spec ), andits standard deviation as we change the size of the trainingset by using the four color attributes from the SDSS MGSand and a forest with 100 prediction trees.We summarize these test results in Table 2. As shownin the table, selecting galaxies from those zones with loweraccuracy as indicated by the heat map produces more ac-curate predictions than adding galaxies randomly. In fact,even adding 1,000 galaxies by using the heat map produces aslightly better performance than adding 2,000 galaxies ran-domly. These results indicate that it is more important toselectively add galaxies to areas where the prediction is poorthan to simply increase the size of the training set.We continue this process, by continually adding either1,000 or 2,000 new galaxies to the training set. As the bot-tom panel of Figure 9 for the SDSS MGS demonstrates, afterabout 5,000 galaxies (or at half the size of our full trainingset), the performance metric shows little variation, which isalso reflected in the last row of Table 2 where the metrics forthe 15,000 galaxy training set are presented for comparison.This test demonstrates how current and future photomet-ric surveys can optimally construct training sets by eitherselectively using existing observations, or by obtaining new Number of training galaxies < ∆ z (cid:48) > σ ∆ z (cid:48) Table 2.
A comparison of the performance of TPZ for the SDSSMGS when extra data are added to the training set either ran-domly or by selectively using ancillary information. spectroscopic observations to improve the photo- z estima-tion. Error distribution
After applying TPZ to the SDSS MGS, we can estimatephoto- z errors directly from the estimated PDF by comput-ing either the mean, the mode, or some other statistic fromthis distribution. As a demonstration, we calculate the er-ror σ as the region of the photo- z PDF centered on themean that contains 68% of the cumulative probability. Wenext calculate the distribution of these standard errors bycomputing ( z phot − z spec ) /σ for each PDF, which is shownas the black points in Figure 8. For unbiased standard errorestimates, this distribution should be normally distributedwith zero mean and unit variance. When we fit our measuredpoints, we obtain a Gaussian with mean equal to 0.112 anda width of 0.949, which is shown by the solid green curve.This simple error estimate is quite close to the unbiasedexpectation, which is as we would expect for any reliabletechnique. The fit is not a perfect Gaussian due to a slightlyextended tail on the left hand side of the distribution. Weinterpret this as a manifestation of the very narrow PDFswe have obtained and that the SDSS MGS is concentratedat lower redshifts where most photo- z techniques suffer froma small tendency to over-predict the photometric redshifts,as shown in the left panel of Figure 4. c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests ∆ z/σ N u m b e r d e n s i t y std error µ = σ = Figure 8.
The photometric standardized error, ( z phot − z spec ) /σ , for the MGS galaxies (black dots) using the meanof each individual PDF and the best fit Gaussian with µ = 0 . σ = 0 .
949 (solid green curve).
Size of forest
When we construct a forest for prediction, one parameterthat must be specified is the number of trees that should beconstructed. This is important as the more trees in the for-est, the higher the computational demands, which slows thetraining process and construction of photo- z PDFs. Thus, wetest the performance of TPZ for the SDSS MGS by varyingthe number of trees built for our forest for a fixed-size train-ing sample. As before, we compute the mean of the absolutebias and its standard deviation, and present how these quan-tities vary as the number of trees in our forest changes for afixed training size of 10,000 galaxies.These results are presented in the top panel of Figure 9,which shows that our algorithm does become more accurateas the number of trees increases. However, after around 100trees, the predictive power of the forest shows little varia-tion, indicating that this is a reasonable number of trees forthis prediction process. Breiman (2001) demonstrated that,as the number of trees in a random forest increases, anymargin function will converge to a limit value. Thus, as ex-pected, we see our generalized error value converging. As aresult, this implies that our technique does not over-fit thedata as more trees are added in comparison to other machinelearning methods.
Training size
Once we know the optimal number of trees that must bebuilt for our forest, we next need to know the optimal sizeof our training set. By using 100 trees (as determined inthe previous section), we vary the size of our training setand present the results in the bottom panel of Figure 9.As shown in this figure, the accuracy of TPZ for predict-ing photo- z does not change significantly after using around70% of the galaxies. This is an interesting result, that ourapproach quantifies in an elegant manner, but which willobviously vary between different data sets. Fundamentally,as the training set increases in size, the prediction accuracy Number of trees in forest m e a n | ∆ z | a n d s c a tt e r mean | ∆ z | σ | ∆ z | Size of training set [%] m e a n | ∆ z | a n d s c a tt e r mean | ∆ z | σ | ∆ z | Figure 9.
The absolute mean normalized bias defined as | ∆ z (cid:48) | = | ( z spec − z phot ) | / (1 + z spec ), and its scatter as a function of thenumber of trees in the forest, keeping the training set fixed (top).The same two values as a function of the size of the training setkeeping the number of trees fixed at 100 for galaxies in the SDSSMGS (bottom). also increases until most of the multi-dimensional parame-ter space has been sampled and little extra information isadded by new training galaxies.Of course in this test we have not used the relative im-portance of our parameter attributes, as shown, for exam-ple, in the central panel of Figure 7. By manually selectingadditional data, we should be able to reduce the values ofthese metrics significantly, which is discussed in the nextsection. But even in our current approach, we expect thatsome of our test data are not well represented in our train-ing set, which will limit the accuracy of this approach. Wesee this as an opportunity, however, as we can compute across-data proximity matrix by using the trained forest toidentify galaxies within the test data that are isolated withfew neighbors in the parameter space. Once identified, thesegalaxies could be treated individually by using, for exam-ple, other photo- z estimation techniques (see, e.g., CarrascoKind & Brunner 2013, in preparation). We also tested TPZ on the PHAT1 dataset, described in § z techniques. In this contest, only a limitedquantity of training data are provided; we have approxi-mately 500 galaxies to train our algorithm for the approxi-mately 1,500 galaxies that form the validation sample. Thesedata also have a sparse redshift distribution, extending from z ≈ z ∼
6. Despite these limitations, we applied TPZto this training data, submitted our results to the contest,and obtained the resulting performance metrics from thePHAT leader (H. Hildebrandt, private communication). Wepresent our specific results in Table 3, which can be com-pared directly with the results shown in Table 5 of the PHATpaper (Hildebrandt et al. 2010). c (cid:13)000
6. Despite these limitations, we applied TPZto this training data, submitted our results to the contest,and obtained the resulting performance metrics from thePHAT leader (H. Hildebrandt, private communication). Wepresent our specific results in Table 3, which can be com-pared directly with the results shown in Table 5 of the PHATpaper (Hildebrandt et al. 2010). c (cid:13)000 , 1–21 M. Carrasco Kind and R. J. Brunner
Table 3.
The TPZ results for the PHAT1 catalogue both withand without the IRAC bands, and for all galaxies and for amagnitude-limited sample with R <
24. Note that these are thesame statistics presented in Table 5 of Hildebrandt et al. (2010)for other photo- z estimation techniques.Run bias a scatter b outlier rate c − .
002 0 .
055 14 . − .
007 0 .
055 12 . < − .
004 0 .
055 11 . < − .
009 0 .
054 9 . a bias is defined as: ∆ z (cid:48) = z spec − z phot z spec b RMS of the bias ∆ z (cid:48) c Outliers are defined as objects with | ∆ z (cid:48) | > . We computed validation results for four different pho-tometric samples: by using all eighteen photometric bands,by omitting the Spitzer photometry and using only four-teen photometric bands, and by creating magnitude limited(
R <
24) for each of these two galaxy samples. For thesevalidation runs, we use the regression mode to create a for-est of 150 trees with m ∗ = 4 (as described in § zConf parameter so that wecould more directly compare our results to the other com-petitors. In the end, the TPZ results are among the mostaccurate photo- z predictions, especially when compared toother empirical training codes. Interestingly enough, TPZeven outperforms some template photo- z techniques, whichare supposedly better suited for this particular challenge dueto the dearth of training data and large redshift range cov-ered by the validation sample. These results show that evenin less than ideal conditions, TPZ provides a robust estima-tion of photometric redshifts. Note that due to the lack oftraining data and the extended redshift distribution of thevalidation sample, we did not generate ancillary informationfor the data by using the OOB approach. DEEP2
We have also tested TPZ by using the DEEP2 redshift sur-vey data, which extends to much higher redshifts than theSDSS MGS. As described in § z results.We follow a similar analysis to what we used with the SDSSMGS, and after we compute the photo- z PDFs, we selectonly those galaxies with zConf > .
7, which includes about81% of the galaxies. We have that the average bias, using∆ z (cid:48) = ( z spec − z phot ) / (1 + z spec ), is -0.007 with σ ∆ z (cid:48) = 0.059and a outlier rate, defined as | ∆ z (cid:48) | > .
15 = 2.9%. We knowof no previous photo- z analyses of these data (describedin § z com-puted by using the mean of each individual PDF with thespectroscopic redshift for the 7,856 galaxies. In this figure,we also compute the median, shown by the black dots, andthe tenth and ninetieth percentiles, shown by the black errorbars, within spectroscopic bins of width ∆ z = 0 . z spec z ph o t Figure 10.
The TPZ photo- z with zConf > z phot and the errors bars correspond to the tenth and ninetieth per-centiles within a given spectroscopic bin of width ∆ z = 0 . across all redshifts, and both the isodensity contours andthe errors bars indicate that there are few outliers or catas-trophic photo- z . However, at both ends of the distribution,we see several bins that show that the photo- z results areless accurate and are systematically higher for the first twobins and systematically lower for the last two bins. Thiseffect is often seen with empirical techniques, as the spec-troscopic training samples are often less complete at theseredshifts, see, e.g., the redshift distribution in Figure 15.Another effect causing this skewness is that estimated pho-tometric redshifts can not be negative, thus our probabilitydistribution can not be symmetrical at the low redshift end.Another possible explanation for the low redshift systematicis the effect of galaxy inclination and the induced extinctionon photo- z prediction as shown recently by Yip et al. (2011).Likewise, the systematic underestimation at higher red-shifts is likely affected by the fact that many of these galaxiesare near the limit of the photometry and thus have higherthan average magnitude errors. In combination with thelower density of training data, this will reduce the efficacyof our photo- z technique. To understand this effect, recallthat our trees are built from objects whose photometry issampled by assuming a normal distribution defined by themagnitude and magnitude error from the bootstrap sam-ples. As the magnitude error increases, the range of possiblevalues to sample increases, thereby producing a sparser sam-pling for this galaxy within our forest. Since there are fewgalaxies with redshifts above 1.3 in the training data, thebranches on the forest for high- z galaxies are mainly domi-nated by training galaxies with redshifts closer to 1. As webuild the PDF for the high- z galaxies, the PDFs will be pos-itive skewed, and thus the mean value of each PDF will tendto be at lower redshift values.We demonstrate this skewness in Figure 11, whichshows the average skewness of the photo- z PDFs and theone-sigma error as a function of the spectroscopic redshift.These two quantities are computed as the third standardized c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests z spec S k e w n e ss Figure 11.
The skewness of the photo- z PDF as a function ofspectroscopic redshift. The solid black line is the mean of theskewness and the pink shaded region corresponds to the one- σ interval. Positive skewness indicate a PDF skewed to lower red-shifts. moment: S k = (cid:90) (cid:18) z − ¯ zσ z (cid:19) p ( z ) dz (8)with,¯ z = (cid:90) zp ( z ) dz and , σ z = (cid:90) (z − ¯z) p(z)dz (9)where the integrals are computed over the redshift domain,and p ( z ) is the photo- z PDF. We can see that for redshiftsup to 1.1 the average skewness is very close to zero, showinga small trend to negative values, which will, on average, pro-duce lower values for the mean photo- z . At higher redshifts,however, there is a clear increase in the average skewness,which will tend to produce lower values for the mean of thePDF. It is important to note that even though these PDFsmay be (slightly) skewed, they still predict sufficient proba-bility near the true redshift, information that is overlookedby other methods that use one point predictions. On theother hand, a catastrophic photo- z would have a symmetricPDF centered near the wrong redshift, which is not what weobserve here. Relative Importance
By using OOB data, we have computed the metrics from thetraining data that we compare in Table 4 to the metrics weobtained from the test data after the photo- z distributionswere computed. The first two rows of this table show thecomplete results for all attributes for the DEEP2 galaxies.From this we see that there is strong agreement betweenthe OOB and test data results for both the bias and thevariance. We also computed the relative importance for theeight photometric bands and the RG attribute, which is theestimated R-band radius of an object in 0.207” pixels (i.e.,the sigma of the Gaussian fit to the light distribution).In the left panel of Figure 12, we present the attributeimportance factor as a function of redshift for the three mostand the two least important attributes. From this figure, we Table 4.
A comparison in the accuracy of photo- z predicationby using different attribute combinations from the DEEP2 datafor all test galaxies. The first row are the metrics for TPZ usingonly OOB data, which are comparable to the values obtainedfrom the full test data, shown in the second row. The remainingrows provide these metrics for training data that have had theindicated number of attributes removed from the calculation.Attribute Selection < | ∆ z (cid:48) | > σ ∆ z All attributes (OOB metrics) 0.052 0.053All attributes 0.047 0.049Remove 2 least important 0.044 0.046Remove 2 most important 0.061 0.068Remove 4 least important 0.044 0.048Remove 4 most important 0.070 0.084 see that the R band and the r band are the most importantattributes for making a photo- z prediction, similar to the g − r color for the SDSS MGS. This demonstrates that by apure statistical analysis, in the optical regime, the R band isthe most effective attribute. These attributes show a peak intheir importance between redshifts of 0.3 to 0.5. We interpretthis increase to the presence of the 4000˚Abreak being locatedat these redshifts within these filters. Likewise, the next twomost important attributes are the i band and the z band,which are likely important for the same reason, albeit overa slightly higher redshift range.On the other hand, the least two important attributesare the B band and the RG attribute. As shown in this fig-ure, the RG attribute does not contribute to the photo- z prediction, instead acting like a random variable and thus islikely introducing extra noise into the calculation. We alsosee no clear evidence that the effect of this attribute changeswith redshift. At low redshifts, this attribute could be af-fected by inclination angle or spectral type, while at higherredshifts, galaxies tend to be fainter and thus have smallerangular sizes. Presumably, these cumulative effects combineto erase any important information this attribute might pro-vide to the photo- z calculation.In the right panel of Figure 12, we present the meanrelative importance for each attribute computed from thechanges in the mean (blue circles) and the scatter (redsquares) when this attributed is permuted, similar to thecentral panel of Figure 7. Once again, both metrics agreewith the importance ranking. In order to characterize the at-tributes and their computed ranking of importance, we havemade the following tests. First, we removed the two least im-portant attributes, after which we remove the two most im-portant attributes, reincorporating the previously removedattributes. We repeat this process, but now we remove al-ternately the four least and most important attributes. Ineach case, we test whether TPZ is able to correctly recognizethe attribute importances. These results are summarized inTable 4, where we use the absolute mean value of ∆ z (cid:48) andits dispersion.As is not surprising, we see that removing the two leastimportant attributes does, in effect, improve the precisionof TPZ while also making the code run faster since we havefewer dimensions to check when splitting nodes within thetree, less data to keep in memory when building the treethus improving cache access, and random realizations from c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner redshift A tt r i b u t e i m p o r t a n c e I A rRiBRG r R i z g I u B RG Attribute A tt r i b u t e i m p o r t a n c e I A | ∆ z | σ ∆ z Figure 12. ( Left ): The variable importance factor, I A , as a function of redshift for the three most and the two least important attributes(i.e., DEEP2 magnitudes) using the bias to quantify this importance. This index specifies how important an attribute is to the calculationof a metric when we permute the attributes one at a time. ( Right ): The RMS of the attribute importance factor as a function of theattributes computed by using the bias (blue) and its scatter (red). Both of these metrics capture the same relative attribute importance. the input parameters will be faster since there are fewer di-mensions to sample. Yet, removing four attributes shows aslight decrease in the overall performance, in this case wehave removed too much information. While this decreasemight seem rather small, since we are randomly selectingattributes when splitting nodes within the trees, by remov-ing four we have increased the scatter since we are losinginformation. On the other hand, removing the most impor-tant attributes significantly affects the results, regardless ofhow many attributes we remove. As we would expect, thereason is clear. Since these attributes have the most infor-mation needed to subdivide the multidimensional parameterspace in order to produce accurate photo- z , removing themnegatively impacts the performance of TPZ.In a further control test we added two extra artificialvariables to the data set, one of which is strongly dependenton the source redshift, i.e. a function of redshift, while thesecond one is a uniformly distributed random variable. Aftercomputing their importance rankings, we can see from Fig-ure 13 that TPZ recognized these two extra attributes andput them on the extreme limits of the importance ranking.The most important value ranks at about eight, the randomvariable ranks at one as expected, while the r magnitude isranked with a value close to two. We notice that the variable RG is very close to one, and therefore to a random variable.As discussed above, we can safely remove this variable fromour calculation as it does not provide any useful informa-tion. The legend on the plot indicates also the descendingorder in importance, in concordance with Figure 12. Missing data
One interesting capability of TPZ is that it can be used toreplace attributes in data that are either missing attributesor have attributes with large uncertainties. As discussed in § < z s for the clean test sample byusing all three training samples: the original, clean sam-ple (i.e., the control), the replaced attribute sample, andthe cut sample. Likewise, we use the clean training sam-ple to replace missing attributes and estimate photo- z s forthe bad test sample. We present the results of these testsin Table 5, where we compare the photo- z estimation forthe clean sample with the replaced and cut samples. Forthis comparison, we use ∆ z pp = z phot , clean − z phot , other and∆ mag = mag clean − mag other , along with their variances,where other can either be the replaced or cut samples. Asshown in this Table, the replaced value sample produces, onaverage, superior photo- z s than the cut sample. Likewise,we have estimated robust photo- z s for the bad test sam-ple, which significantly increases the size of our resultingtest data. Dealing with missing attributes is important , es- c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests Table 5.
Photo- z estimation metrics to demonstrate the robust-ness of our missing attribute technique. The first two rows showthe average bias and its variance between the estimated photo- z ,and replaced magnitude when either removing or recovering baddata in comparison with photo- z s predicted using the original,clean sample. The last row shows the the same metrics calculatedby using the clean test sample, but for data missing in the testsample as compared to the clean original sample.Recovered train < ∆ z pp >a σ z pp b ∆ mag σ mag with removed data -1.27 3.5 – –with recovered data 0.40 1.6 0.021 0.094Recovered test < ∆ z pp > σ z pp ∆ mag σ mag with recovered data 0.72 4.5 0.033 0.12 a in units of 10 − b in units of 10 − redshift A tt r i b u t e i m p o r t a n c e I A f(z)rRGrandom Figure 13.
The variable importance factor, I A , as a functionof redshift for the most and least important attribute using thebias to quantify the importance ranking. As a control test, weadded two artificial variables: an attribute that is a function ofthe spectroscopic redshift, and a uniformly distributed randomattribute. TPZ is able to recognize these two extra attributes andrank them accordingly, as shown by the figure legend. pecially when a spectroscopic training sample is limited orwhen cross-matching between incomplete catalogs is carriedout in order to develop a more complete catalog for photo- z estimation. Photo- z PDFs and zConf
As discussed in § zConf parameter can be used toidentify galaxies with narrow, concentrated photo- z PDFs,which ideally will result in galaxy samples that have themost accurate photo- z estimates. The zConf parameter isdemonstrated for DEEP2 galaxies in the left panel of Fig-ure 14, which shows four representative photo- z PDFs se-lected with different values of zConf as measured aboutthe mean of each PDF. Both this figure and Figure 5, which presents four photo- z PDFs by using SDSS data, highlightthe fact that wide and sparse distributions have low zConf values while narrower PDFs have higher zConf values.The goal of a parameter like zConf is to algorithmi-cally identify galaxies that have, on average, the most accu-rate photo- z estimates. To test this hypothesis, we used allavailable DEEP2 training data to bud our prediction treesand estimate photo- z s for the DEEP2 test sample. From thissample, we applied three zConf cuts: 0.5, 0.7, and 0.9, andcalculated the bias and scatter as a function of redshift forthe three resulting galaxy samples. We compare these re-sults to the bias and scatter when no zConf cut is appliedin the right hand panel of Figure 14. As shown in this figure,both the mean absolute bias and the scatter are reduced as zConf is increased, independent of redshift.A simple, intuitive approach to select galaxies by their zConf would be 0.5 as this selects galaxies that have a 50%probability that their photo- z redshift estimate lies withinthe limits imposed by ± σ TPZ (1+ z phot ). Furthermore, highervalues would provide more accurate results at the expenseof reduced statistical power (i.e., a smaller, final catalog).In Figure 14, for example, cuts on zConf at 0.5, 0.7 and0.9 keep 90%, 76% and 38% of the galaxies from the orig-inal catalog. Alternatively, given the OOB data predictiveresults, a required accuracy or number density can be usedto identify a suitable value of zConf . N(z): An application of photo- z PDFs
Most of the results we have presented within this paper havebeen based on the estimation of a single metric computedfrom the photo- z PDF, for example the mean or mode. Obvi-ously, using a single value to represent the PDF wastes sig-nificant information, but since many photo- z applicationsmimic spectroscopic redshift applications, new approachesmust be developed to capitalize on the full information con-tent of a photo- z PDF. As a result, we present a simple,yet very important application that uses the full photo- z PDFs—estimating the galaxy redshift distribution, N(z).This function is a fundamental measurement and is veryimportant to a number of cosmological applications includ-ing weak lensing tomography (e.g., Mandelbaum et al. 2008;Jee et al. 2012) and projecting three-dimensional theoreticalpower spectra to angular clustering measurements (Hayeset al. 2012).We compute the normalized galaxy redshift distribu-tion, N ( z ), for all the galaxies in DEEP2 sample (i.e., noDEEP2 redshift confidence selection cut was applied), shownin Figure 15 as the shaded gray area. As demonstrated bythis figure, in this spectroscopic survey, most galaxies wereselected to have redshifts between 0.6 and 1.2. Next, we com-pute the binned photometric redshift distribution by usingthe mean value from each photo- z PDF, shown by the redcurve. While this curve does trace the gross features of theunderlying spectroscopic redshift distribution, it fails to cap-ture the full detail and can be significantly different at cer-tain redshifts, including at the mode. For comparison, weshow in black the photo- z PDF redshift distribution thatwe obtain by simply stacking the individual PDFs together.With this simple approach, we obtain a more accurate rep-resentation of the true sample redshift distribution. Here wehave used all the galaxies, without selecting galaxies by their c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner N o r m a li z e d P D F zConf = 0.16 zConf = 0.47 N o r m a li z e d P D F zConf = 0.80 zConf = 0.98 < | ∆ z | > zConf > 0.00 zConf > 0.50 zConf > 0.70 zConf > 0.90 z spec σ ∆ z Figure 14. ( Left ): Same as Figure 5 but for four example galaxies taken from DEEP2. The vertical dashed line indicates the spectroscopicredshift and the gray area the zConf value. (
Right ): The absolute normalized bias and the scatter for galaxy samples defined by different zConf cuts by using the mean of the photo- z PDF as our estimate. confidence level. This demonstrates that all individual PDFscomputed with TPZ carry important information about theunderlying distribution.These differences are more clearly exposed in the bot-tom panel of Figure 15, where we show the absolute frac-tional error, ( N phot − N spec ) /N spec , as a function of redshift,using the same color scheme as before. From this figure, wesee that the stacked PDF has a smaller error for almost allredshifts. In addition, the photo- z PDF redshift distribu-tion is considerably smoother and looks more like a fit tothe spectroscopic sample, which is another benefit of usingthe full photo- z PDF. For this particular demonstration, thephoto- z PDF presented used a bin size of 0.002, while thespectroscopic and photometric redshift distributions used abin size of 0.03. Of course, we can generate smoother distri-butions for either the spectroscopic or photo- z mean valueredshift distributions by reducing the bin size, however, thetrade off is that we run the risk of increasing the shot noisein the resulting distribution. In this work we have presented and publicly release TPZ,a new, parallel, machine learning photo- z Python code thatcomputes photo- z PDFs and also provides ancillary infor-mation about the photometric data. TPZ is based on theconstruction of prediction trees and consequently a randomforest. Overall, TPZ is a three step algorithm that first pre-processes the data, completes galaxies with missing photo-metric values in an efficient manner, and also incorporatesmeasurements errors. A photo- z PDF can be generated fromthe prediction trees in one of two modes: classification or re-gression. Both modes produces similar accuracies, but theregression mode is preferred when either the training dataare either poorly sampled or not uniformly distributed. Onthe other hand, the classification mode provides a detailedsynopsis of the redshift distribution that can be used to con-struct priors for use with other photo- z techniques. http://lcdm.astro.illinois.edu/research/TPZ.html We demonstrated the efficacy of the TPZ algorithm andits implementation by applying this new code to three dif-ferent data sets: the SDSS main galaxy sample, the PHAT1blind challenge, and the DEEP2 survey. With the SDSSMGS, we demonstrated that using confidence levels is im-portant as they improve the overall accuracy of our photo- z sample by selecting those galaxies with narrow PDFs. Thistechnique is unique in the sense that it does not need aseparate validation test, yet provides ancillary informationby using OOB data. We have shown that with these data,we obtain unbiased estimates of both the bias and the dis-persion, which are very similar to the same values obtainedfrom the test data for both the SDSS MGS and DEEP2.Obviously, this result is extremely important when workingwith data that have unknown redshifts.TPZ not only provides these prior metrics, but it alsoprovides a ranking of the relative importance of the dif-ferent photometric attributes that are used by the code.This completely statistical process recovers what is natu-rally expected from physical consideration of these differentattributes. With this importance ranking, we can constructa heat map of the different locations in parameter space thatproduce poor photo- z estimations. Furthermore, we demon-strated that by adding new, manually selected data we canproduce more accurate photo- z predication than by simplyadding new galaxies randomly. This implies that we can op-timally identify new training data for current and futurephotometric surveys, such as DES or LSST, in order to im-prove their photo- z predictions.The attribute importance can also be used to removethose attributes that are least important, thereby improvingthe computational speed. In addition, we demonstrated thatthe performance metrics converge as the number of trees in-creases in the forest, providing a further method to reducethe computational time since we have a direct measure ofthe minimum forest size. Likewise, we also demonstrated byusing the SDSS MGS that these same metrics also convergewith the number of training galaxies for a fixed forest size.Thus, except for adding in manually selected training datato improve areas with poor photo- z prediction, we have anexplicit limit for the number of training galaxies needed. c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests N o r m a li z e d N ( z ) spec zmean of PDFstacked PDF redshift N ( z ) | e rr o r | Figure 15. ( Top ): The redshift distribution for the all DEEP2spectroscopic sample of galaxies (shaded gray histogram), com-puted from the mean value of individual photo- z PDFs (redcurve), and computed by stacking individual photo- z PDFs (blackcurve). (
Bottom ): The residual absolute error between the spec-troscopic redshift distribution and the two photo- z redshift dis-tributions shown using the same color scheme. Finally, with this technique we found that the error distri-bution was characterized by a Gaussian distribution witha mean very close to zero and variance very close to one,indicating that the source of errors is relatively unbiased.We ran our code on the PHAT1 blind challenge withexcellent results; even with limited training data we wereable to compute accurate photo- z ’s that were compara-ble if not better to other empirical techniques as well as tosome SED fitting techniques. By using the DEEP2 redshiftdata, we tested TPZ over a large redshift range, obtainingvery accurate results. In particular, we were able to iden-tify the important attributes, which in this case was theR band magnitude followed by the I band magnitude, andthe least important attributes, which in this case was the RG attribute and the B band magnitude. Despite these im-pressive results, we still have a slight systematically biasedphoto- z at very low and very high redshifts, which we pri-marily believe is caused by the low number of training dataat these redshifts and also the fact that photo- z estimatescan not be negative. We also see a positive skewness in thephoto- z PDFs at high redshifts. We believe this result is dueto the fact that these galaxies tend to be fainter and havelarger magnitude errors. These larger magnitude errors pro-duce a sparser forest at higher redshifts, which is manifestedby having a lower photo- z PDF mean value at these sameredshifts.We have also demonstrated how the zConf parametercan be used to select galaxy samples that have improvedphoto- z estimates with minimal outliers. A target value forthis useful parameter can be set to a desired photo- z preci-sion either by calculating the value expected by using OOBdata or as required by a specific cosmological requirement.Likewise, we have demonstrated how TPZ can efficientlyhandle missing data within a catalog. By artificially gen-erating bad or missing parameter values within both thetraining and the testing data sets, we were not only able to robustly recover the missing parameters but more impor-tantly new photo- z estimates that are consistent with thephoto- z estimates from the original, full data set. Therefore,this technique increases the power of photo- z estimation byrecovering missing data from the training catalog as well asthe power of our resulting sample statistics by recoveringmissing data from the application data set.Finally, by calculating the normalized distribution ofgalaxies as a function of redshift, we were able to demon-strate the advantages of using a full photo- z PDF as opposedto using one single estimator of the PDF or any other pointmetric. Specifically, by simply stacking each individual PDF,we recover the underlying galaxy redshift distribution to amuch higher precision than by simplifying using the meanof each individual photo- z PDF.In conclusion, we note that since TPZ is an empiri-cal algorithm, it is inherent dependent on the quality of itstraining data. Thus, as is the case with all empirical algo-rithms, TPZ is limited by the available spectroscopic train-ing data. Furthermore, the application of TPZ to regions ofparameter space beyond the limits of the training data (i.e.,extrapolation) will be less reliable. We do note, however, theTPZ does provide ancillary information that can be investi-gated to better understand the limitations imposed by thetraining set, to identify the optimal locations within the ap-plication data space where new training data will be mostuseful, and to quantify the possible errors associated withthe extrapolation of this technique. Another approach toimprove the efficacy of photo- z estimation beyond the lim-its of a spectroscopic training sample would be to includethe predictions from different, non-empirical techniques intoa meta-classifier. We will explore this approach in a futurework on this topic. ACKNOWLEDGEMENTS
The authors thank the referee for a careful reading of themanuscript. The authors also gratefully acknowledge the useof the parallel computing resource provided by the Compu-tational Science and Engineering Program at the Universityof Illinois. The CSE computing resource, provided as part ofthe Taub cluster, is devoted to high performance computingin engineering and science. This work also used resourcesfrom the Extreme Science and Engineering Discovery Envi-ronment (XSEDE), which is supported by National ScienceFoundation grant number OCI-1053575.MCK has been supported by the Computational Sci-ence and Engineering (CSE) fellowship at the University ofIllinois at Urbana-Champaign. RJB has been supported inpart by the Institute for Advanced Computing Applicationsand Technologies faculty fellowship at the University of Illi-nois.Funding for the DEEP2 Galaxy Redshift Survey hasbeen provided by NSF grants AST-95-09298, AST-0071048,AST-0507428, and AST-0507483 as well as NASA LTSAgrant NNG04GC89G.The SDSS is managed by the Astrophysical ResearchConsortium for the Participating Institutions. The Partic-ipating Institutions are the American Museum of Natu-ral History, Astrophysical Institute Potsdam, University ofBasel, University of Cambridge, Case Western Reserve Uni- c (cid:13) , 1–21 M. Carrasco Kind and R. J. Brunner versity, University of Chicago, Drexel University, Fermilab,the Institute for Advanced Study, the Japan ParticipationGroup, Johns Hopkins University, the Joint Institute forNuclear Astrophysics, the Kavli Institute for Particle As-trophysics and Cosmology, the Korean Scientist Group, theChinese Academy of Sciences (LAMOST), Los Alamos Na-tional Laboratory, the Max-Planck-Institute for Astronomy(MPIA), the Max-Planck-Institute for Astrophysics (MPA),New Mexico State University, Ohio State University, Uni-versity of Pittsburgh, University of Portsmouth, PrincetonUniversity, the United States Naval Observatory, and theUniversity of Washington
REFERENCES
Abazajian K. N. et al., 2009, ApJS, 182, 543Abdalla F. B., Banerji M., Lahav O., Rashkov V., 2011,MNRAS, 417, 1891Adelman-McCarthy J. K. et al., 2008, ApJS, 175, 297Aihara H. et al., 2011, ApJS, 193, 29Assef R. J. et al., 2010, ApJ, 713, 970Ball N. M., Brunner R. J., Myers A. D., Strand N. E.,Alberts S. L., Tcheng D., 2008, ApJ, 683, 12Ball N. M., Brunner R. J., Myers A. D., Strand N. E.,Alberts S. L., Tcheng D., Llor`a X., 2007, ApJ, 663, 774Baum W. A., 1962, in IAU Symposium, Vol. 15, Problemsof Extra-Galactic Research, McVittie G. C., ed., p. 390Ben´ıtez N., 2000, ApJ, 536, 571Bolzonella M., Miralles J.-M., Pell´o R., 2000, A&A, 363,476Bonfield D. G., Sun Y., Davey N., Jarvis M. J., AbdallaF. B., Banerji M., Adams R. G., 2010, MNRAS, 405, 987Breiman L., 1996, Machine Learning, 24, 123,10.1007/BF00058655Breiman L., 2001, Machine Learning, 45, 5Breiman L., Friedman J. H., Olshen R. A., StoneC. J., 1984, Classification and Regression Trees, Statis-tics/Probability Series. Wadsworth Publishing Company,Belmont, California, U.S.A.Brunner R. J., Connolly A. J., Szalay A. S., BershadyM. A., 1997, ApJL, 482, L21Bruzual G., Charlot S., 2003, MNRAS, 344, 1000Bundy K., Ellis R. S., Conselice C. J., 2005, ApJ, 625, 621Capak P. et al., 2004, AJ, 127, 180Carliles S., Budav´ari T., Heinis S., Priebe C., Szalay A.,2008, in Astronomical Society of the Pacific ConferenceSeries, Vol. 394, Astronomical Data Analysis Software andSystems XVII, Argyle R. W., Bunclark P. S., Lewis J. R.,eds., p. 521Carliles S., Budav´ari T., Heinis S., Priebe C., Szalay A. S.,2010, ApJ, 712, 511Caruana R., Karampatziakis N., Yessenalina A., 2008, inProceedings of the 25th international conference on Ma-chine learning, ICML ’08, ACM, New York, NY, USA, pp.96–103Cavuoti S., Brescia M., Longo G., Mercurio A., 2012, ArXive-printsColeman G. D., Wu C.-C., Weedman D. W., 1980, ApJS,43, 393Collister A. A., Lahav O., 2004, PASP, 116, 345 Connolly A. J., Csabai I., Szalay A. S., Koo D. C., KronR. G., Munn J. A., 1995, AJ, 110, 2655Cowie L. L., Barger A. J., Hu E. M., Capak P., SongailaA., 2004, AJ, 127, 3137Csabai I. et al., 2003, AJ, 125, 580Cunha C. E., Huterer D., Busha M. T., Wechsler R. H.,2012a, MNRAS, 423, 909Cunha C. E., Huterer D., Lin H., Busha M. T., WechslerR. H., 2012b, ArXiv e-printsDavis M. et al., 2003, in Society of Photo-Optical In-strumentation Engineers (SPIE) Conference Series, Vol.4834, Society of Photo-Optical Instrumentation Engineers(SPIE) Conference Series, Guhathakurta P., ed., pp. 161–172Faber S. M. et al., 2003, in Society of Photo-Optical In-strumentation Engineers (SPIE) Conference Series, Vol.4841, Society of Photo-Optical Instrumentation Engineers(SPIE) Conference Series, Iye M., Moorwood A. F. M.,eds., pp. 1657–1669Feldmann R. et al., 2006, MNRAS, 372, 565Freeman P. E., Newman J. A., Lee A. B., Richards J. W.,Schafer C. M., 2009, MNRAS, 398, 2012Geach J. E., 2012, MNRAS, 419, 2633Gerdes D. W., Sypniewski A. J., McKay T. A., Hao J.,Weis M. R., Wechsler R. H., Busha M. T., 2010, ApJ,715, 823Giavalisco M. et al., 2004, ApJL, 600, L93Gwyn S. D. J., 2012, AJ, 143, 38Hayes B., Brunner R., Ross A., 2012, MNRAS, 421, 2043Hildebrandt H. et al., 2010, A&A, 523, A31Ho S. et al., 2012, ApJ, 761, 14Ilbert O. et al., 2006, A&A, 457, 841Jee M. J., Tyson J. A., Schneider M. D., Wittman D.,Schmidt S., Hilbert S., 2012, ArXiv e-printsJee M. J., Tyson J. A., Schneider M. D., Wittman D.,Schmidt S., Hilbert S., 2013, ApJ, 765, 74Koo D. C., 1985, AJ, 90, 418Laurino O., D’Abrusco R., Longo G., Riccio G., 2011, MN-RAS, 418, 2165Lima M., Cunha C. E., Oyaizu H., Frieman J., Lin H.,Sheldon E. S., 2008, MNRAS, 390, 118Loh E. D., Spillar E. J., 1986, ApJ, 303, 154Mandelbaum R. et al., 2008, MNRAS, 386, 781Matthews D. J., Newman J. A., Coil A. L., Cooper M. C.,Gwyn S. D. J., 2013, ApJS, 204, 21Myers A. D., White M., Ball N. M., 2009, MNRAS, 399,2279Newman J. A. et al., 2012, ArXiv e-printsOke J. B. et al., 1995, PASP, 107, 375Oyaizu H., Lima M., Cunha C. E., Lin H., Frieman J.,2008a, ApJ, 689, 709Oyaizu H., Lima M., Cunha C. E., Lin H., Frieman J.,Sheldon E. S., 2008b, ApJ, 674, 768Reddy N. A., Steidel C. C., Erb D. K., Shapley A. E.,Pettini M., 2006, ApJ, 653, 1004Reid B. A. et al., 2010, MNRAS, 404, 60Schapire R. E., Freund Y., Bartlett P., Lee W. S., 1998,The Annals of Statistics, 26, 1651Sheth R. K., 2007, MNRAS, 378, 709Strateva I. et al., 2001, AJ, 122, 1861Strauss M. A. et al., 2002, AJ, 124, 1810Treu T., Ellis R. S., Liao T. X., van Dokkum P. G., 2005, c (cid:13) , 1–21 PZ : Photometric redshift PDFs by using prediction trees and random forests ApJL, 622, L5van Breukelen C., Clewley L., 2009, MNRAS, 395, 1845Wadadekar Y., 2005, PASP, 117, 79Wang D., Zhang Y.-X., Liu C., Zhao Y.-H., 2008, ChJAA,8, 119Wang W.-H., Cowie L. L., Barger A. J., 2006, ApJ, 647, 74Way M. J., Foster L. V., Gazis P. R., Srivastava A. N.,2009, ApJ, 706, 623Way M. J., Klose C. D., 2012, PASP, 124, 274Wirth G. D. et al., 2004, AJ, 127, 3121Yip C.-W., Szalay A. S., Carliles S., Budav´ari T., 2011,ApJ, 730, 54York D. G. et al., 2000, AJ, 120, 1579This paper has been typeset from a TEX/ L A TEX file preparedby the author. c (cid:13)000