Improving Correlation Function Fitting with Ridge Regression: Application to Cross-Correlation Reconstruction
aa r X i v : . [ a s t r o - ph . I M ] J a n A CCEPTED TO T HE A STROPHYSICAL J OURNAL
Preprint typeset using L A TEX style emulateapj v. 5/2/11
IMPROVING CORRELATION FUNCTION FITTING WITH RIDGE REGRESSION: APPLICATION TOCROSS-CORRELATION RECONSTRUCTION D ANIEL
J. M
ATTHEWS AND J EFFREY
A. N
EWMAN
Department of Physics and Astronomy, University of Pittsburgh, 3941 O’Hara Street, Pittsburgh, PA 15260
Accepted to The Astrophysical Journal
ABSTRACTCross-correlation techniques provide a promising avenue for calibrating photometric redshifts and deter-mining redshift distributions using spectroscopy which is systematically incomplete (e.g., current deep spec-troscopic surveys fail to obtain secure redshifts for 30-50% or more of the galaxies targeted). In this paperwe improve on the redshift distribution reconstruction methods from our previous work by incorporating fullcovariance information into our correlation function fits. Correlation function measurements are strongly co-variant between angular or spatial bins, and accounting for this in fitting can yield substantial reduction inerrors. However, frequently the covariance matrices used in these calculations are determined from a relativelysmall set (dozens rather than hundreds) of subsamples or mock catalogs, resulting in noisy covariance matriceswhose inversion is ill-conditioned and numerically unstable. We present here a method of conditioning thecovariance matrix known as ridge regression which results in a more well behaved inversion than other tech-niques common in large-scale structure studies. We demonstrate that ridge regression significantly improvesthe determination of correlation function parameters. We then apply these improved techniques to the problemof reconstructing redshift distributions. By incorporating full covariance information, applying ridge regres-sion, and changing the weighting of fields in obtaining average correlation functions, we obtain reductions inthe mean redshift distribution reconstruction error of as much as ∼
40% compared to previous methods. In anappendix, we provide a description of POWERFIT, an IDL code for performing power-law fits to correlationfunctions with ridge regression conditioning that we are making publicly available.
Subject headings: galaxies: distances and redshifts — large-scale structure of the universe — surveys — cos-mology: observations INTRODUCTION
Many of the cosmological measurements to be performedwith future photometric surveys will require extremelywell-characterized redshift distributions (Albrecht et al. 2006;Huterer et al. 2006; Ma et al. 2006). A key challenge will beto obtain redshift information for galaxies from large multi-wavelength samples, which include objects that are too faintfor spectroscopy or fail to yield secure redshifts. A conven-tional approach is to use large sets of spectroscopic redshiftsto establish a relation between redshift and color informa-tion to calibrate photometric redshifts (Connolly et al. 1995;Lima et al. 2008; Budavári 2009; Freeman et al. 2009). How-ever, at the depths probed by these surveys, existing large red-shift samples have all been highly incomplete, with a strongdependence of success rate on both redshift and galaxy prop-erties (Cooper et al. 2006).Newman (2008) described a new technique for calibrat-ing photometric redshifts (commonly referred to as photo- z ’s)using cross-correlations which exploits the fact that galax-ies at similar redshifts tend to cluster with each other, andin Matthews & Newman (2010) (from here on referred to asMN10) we tested this technique using realistic mock catalogswhich include the impact of bias evolution and cosmic vari-ance. We showed that for objects in a photometric redshiftbin (e.g., selected using some photo- z -based algorithm), wecan recover its true redshift distribution, φ p ( z ), by measur-ing the two-point angular cross-correlation between objects inthat bin with a bright spectroscopic sample in the same regionof the sky, as a function of spectroscopic z . The cross-corre-lation signal at a given redshift will depend on both the intrin- [email protected], [email protected] sic clustering of the samples with each other and the fractionof the photometric sample that lies at that redshift (i.e., φ p ).Autocorrelation measurements give information about the in-trinsic clustering of each sample and can be used to break thisdegeneracy.In MN10, we assumed for convenience that correlationfunction measurements in different angular/radial bins werecompletely independent. However, analytical models as wellas simulations have shown that the covariance between bins issignificant (Bernstein 1994; Zehavi et al. 2005; Crocce et al.2011). Incorporating all available information about this co-variance should provide better constraints on the correlationfunction parameters used in reconstructing φ p ( z ). In this pa-per we improve on the methods of MN10 by accounting forthis covariance.However, the inversion of covariance matrices calculatedfrom relatively small sample sizes (e.g. a modest number ofmock catalogs or jackknife regions) is not well behaved: mod-est noise in a covariance matrix can yield large variations inits inverse. We therefore also incorporate ridge regression,a method of conditioning covariance matrices (i.e., stabiliz-ing the calculation of their inverse) which is common in thestatistics literature but novel to correlation function analyses,into our methods. We will then optimize the reconstruction of φ p ( z ) by varying the level of this conditioning. Throughoutthis paper we assume a flat Λ CDM cosmology with Ω m =0.3, Ω Λ =0.7, and Hubble parameter H = 100 h km s - Mpc - ; wehave assumed h =0.72, matching the Millennium simulationsused for this work, where it is not explicitly included in for-mulae.The structure of this paper is as follows. In §2 we describethe catalogs and data sets used. In §3 we provide a summaryof the reconstruction technique used in Matthews & Newman(2010), as well as a description of the changes implementedfor this paper. In §4 we detail the optimization of the recon-struction achieved using a risk analysis. We summarize theresults and discuss the gains from ridge regression in §5. Fi-nally, we include an appendix describing the public release ofPOWERFIT, an IDL program which performs fits for power-law correlation function parameters incorporating covarianceinformation and conditioning via either ridge regression orsingular value trimming. DATA SETS
To test this method, it is necessary to construct two sam-ples of galaxies, one with known redshift (“spectroscopic”)and the other unknown (“photometric”). We have done thisusing mock DEEP2 Redshift Survey light cones produced byDarren Croton. A total of 24 light cones were constructed bytaking lines-of-sight through the Millennium Simulation halocatalog (Lemson & Virgo Consortium 2006) with the redshiftof the simulation cube used increasing with distance from theobserver (Kitzbichler & White 2007). The light cones werethen populated with galaxies using a semi-analytic modelwhose parameters were chosen to reproduce local galaxyproperties (Croton et al. 2006). Each light cone covers therange 0 . < z < . . × . R -band magnitude R < .
1, resembling the selection probability and depth of theDEEP2 Galaxy Redshift survey. The mean number of spec-troscopic objects in a single light cone is 35 , h z i = 0 .
75 and σ z = 0 .
20. This emu-lates choosing a set of objects which have been placed in a sin-gle photometric redshift bin by some algorithm with Gaussianerrors. Although the selection function used is a Gaussian, thenet z distribution of the photometric sample will not be a pureGaussian since the redshift distribution of the mock catalogwe select from is not uniform. After applying this Gaussianselection to the mock catalog, we randomly reject half of theselected objects in order to reduce calculation time. The meannumber of objects in the final photometric sample over the 24light cones is 44 , s ’, while those related to the photometric sam-ple will be labelled with subscript ’ p ’. METHODS
Correlation Functions
The basic quantities we will use to determine redshift distri-butions are the real space two-point correlation function and the angular two-point correlation function. The real spacetwo-point correlation function ξ ( r ) is a measure of the excessprobability dP (above that for a random distribution) of find-ing a galaxy in a volume dV , at a separation r from anothergalaxy, dP = n [1 + ξ ( r )] dV (Peebles 1980), where n is themean number density of the sample. The angular two-pointcorrelation function w ( θ ) is a measure of the excess probabil-ity dP of finding a galaxy in a solid angle d Ω , at a separation θ on the sky from another galaxy, dP = Σ [1 + w ( θ )] d Ω (Peebles1980), where Σ is the mean number of galaxies per steradian.To calculate correlation functions we bin the distance be-tween object pairs in a given catalog to get the pair countsas a function of separation (either real-space separation for ξ ( r ), or θ separation on the sky for w ( θ )). We similarly countthe number of pairs in each bin between objects in a partic-ular catalog and those in a randomly-distributed catalog, aswell as the number of pairs between two random catalogs; wethen calculate correlation functions by applying the Landy &Szalay estimator (Landy & Szalay 1993). The random cata-logs are constructed to have the same shape on the sky (andin the case of ξ ( r ), the same redshift distribution) as the cor-responding data catalog, but contain objects distributed witha uniform distribution on the sky (constant number of objectsper solid angle, taking spherical geometry into account). Inall cases we use a random catalog with ∼
10 times the numberof objects in the corresponding data catalog. More details ofour correlation function measurement methods are providedin MN10.Throughout this paper we will model ξ ( r ) as a power law, ξ ( r ) = ( r / r ) - γ , which is an adequate assumption (withinthe errors for the samples considered here) from ∼ . ∼ h - comoving Mpc for both observed samples to z ∼ . ξ ( r ) = ( r / r ) - γ , then w ( θ ) = A θ - γ , where A ∝ r γ (Peebles 1980).However, since the observed mean galaxy density in a fieldis not necessarily representative of the universal mean density,measurements of w ( θ ) from a particular field will in generaldeviate from the global mean by an additive offset known asthe integral constraint. To remove this effect, we fit w ( θ ) usinga power law minus a constant, e.g. we consider models of theform w ( θ ) = A θ - γ - C , where C is the integral constraint. Thespecific correlation measurements that we use in the calcula-tion of φ p ( z ) are the autocorrelation of the spectroscopic sam-ple as a function of separation and redshift, ξ ss ( r , z ); the angu-lar autocorrelation of the photometric sample, w pp ( θ ); and theangular cross-correlation of the two samples with each otheras a function of spectroscopic redshift, w sp ( θ, z ). As we willdescribe below, the redshift distribution of the photometricsample can be determined using the fit parameters (in partic-ular r , γ , and A ) of each of these three correlation functions. Reconstructing φ p ( z )Modeling ξ ( r ) as a power law, we can determine therelationship between the angular cross-correlation function w sp ( θ, z ) and the redshift distribution. Following the deriva-tion in Newman (2008) (cf. eq. 4), the cross-correlation be-tween the photometric sample and spectroscopic objects atredshift z can be written as w sp ( θ, z ) = φ p ( z ) H ( γ sp ) r γ sp , sp θ - γ sp D ( z ) - γ sp dl / dz , (1)where H ( γ ) = Γ (1 / Γ (( γ - / / Γ ( γ/
2) (here Γ ( x ) is thestandard Gamma function) and φ p ( z ) is the probability distri-bution function of the redshift of an object in the photometricsample. The angular size distance, D ( z ), and the comovingdistance to redshift z , l ( z ), are determined from the basic cos-mology. We can see that since w sp ( θ, z ) ∼ φ p ( z ) r γ sp , sp , there isa degeneracy in the cross-correlation signal between the red-shift distribution and the intrinsic clustering of the two sam-ples with each other (characterized by r , sp and γ sp ). We canestimate these cross-correlation parameters from the autocor-relation measurements of each sample using the assumptionof linear biasing, for which ξ sp = ( ξ ss ξ pp ) / .We now outline the details of the particular procedureswhich gave the best reconstruction of φ p ( z ) in MN10. First,we need to calculate the autocorrelation parameters of eachsample. To determine how ξ ss evolves with redshift we binthe spectroscopic objects in redshift and measure the two-point correlation function in each z -bin. We calculate ξ ss us-ing the as-observed redshifts from the simulation, which areaffected by redshift space distortions in the line-of-sight direc-tion (Hamilton 1998). To minimize this effect it is common tocalculate ξ as a function of the projected separation, r p , andthe line-of-sight separation, π , and then integrate along theline-of-sight to obtain the projected correlation function, w p ( r p ) = 2 Z ∞ ξ [( r p + π ) / ] d π (2)= r p (cid:18) r r p (cid:19) γ H ( γ ) , (3)where H ( γ ) is defined following equation 1. By measuring w p ( r p ) in multiple z -bins and fitting with equation 3, we candetermine r , ss ( z ) and γ ss ( z ). We found that employing a lin-ear fit of r , ss and γ ss as a function of z resulted in a betterrecovery of φ p ( z ) than using each bin’s value directly. Wethen calculate the angular autocorrelation of the photometricsample, w pp ( θ ), and fit to obtain the parameters A pp and γ pp .We use the autocorrelation parameters along with an initialguess of r , pp to calculate r γ sp , sp = ( r γ ss , ss r γ pp , pp ) / . We found thatthe best results were obtained by assuming the redshift de-pendence of the scale length is similar for each sample, i.e. r , pp ( z ) ∝ r , ss ( z ), with an initial guess of r , pp ( z ) = r , ss ( z ).For the angular cross-correlation between the two samples, w sp ( θ, z ), we bin the spectroscopic sample into small bins inredshift and, in each bin, measure the cross-correlation be-tween objects in that z -bin with all objects in the photometricsample. We can then fit to obtain the parameters A sp , γ sp , and C sp . However, we found a significant degeneracy betweenthese parameters when fitting. To remove this degeneracy, wefix γ sp = ( γ ss + γ pp ) / z -bin, and only fit for the am-plitude and integral constraint. We choose this estimate for γ sp because the clustering of the samples with each other isexpected to be intermediate to the intrinsic clustering of eachsample. Using the resulting values of A sp and γ sp , as wellas the initial guess for r γ sp , sp , we obtain an initial guess of theredshift distribution φ p ( z ) using equation 1. Using this φ p ( z ),along with A pp and γ pp , we can redetermine r , pp using Lim-ber’s equation (Peebles 1980), which we use to redetermine r γ sp , sp and thus φ p ( z ). This process is repeated until conver-gence is reached. A more detailed description of this tech-nique, including an error analysis for the resulting reconstruc-tions, can be found in MN10.We have implemented an additional step in the reconstruc-tion of φ p ( z ) for this paper that was not employed by MN10.For each measurement, after fixing γ sp and fitting for A sp and C sp in each z -bin, we performed a smooth fit to the measuredvalues of C sp ( z ) as a function of redshift. Using the same γ sp but fixing C sp at the predicted values for each bin, we thenfit for A sp . We obtained the best results from a Gaussian fitto C sp , although simply smoothing the measured C sp ( z ) val-ues with a boxcar average also resulted in significant gains inreconstruction accuracy. We initially tested these techniquesfor MN10, but they did not improve the reconstruction, andin some z -bins made the reconstruction worse. However, afterincorporating covariance information into our analyses, thisadditional step significantly reduced errors in the reconstruc-tion of φ p ( z ), likely because the determination of C sp for eachredshift bin is now more accurate.We have also made a change in the methods used to cal-culate average correlation measurements from multiple lightcones. In MN10 this was done by summing the pair countsover all of the fields and using the total pair counts in theLandy & Szalay estimator. However, in the course of thispaper we found that this method overestimates the mean cor-relation by more heavily weighting those light cones whichare overdense at a particular redshift: they will both containmore pairs and, generally, exhibit stronger clustering than arandomly-selected region of the universe. For this paper, weinstead determine the average correlation by calculating thecorrelation function in each field individually and then per-forming an unweighted average of those measurements. Thischange had little effect on the autocorrelation function of thephotometric sample, w pp ( θ ), mainly because the larger vol-ume sampled meant that the density varies less from field tofield. The projected autocorrelation of the spectroscopic sam-ple, w p ( r p ), and the cross-correlation measurements, w sp ( θ, z ),were significantly affected by this change, however, with av-erage decreases in the correlation strength of ∼ - Fitting Parameters Using Full Covariance Information
In MN10 we fit for the various correlation function param-eters ( r , ss , γ ss , etc.) assuming that there is no covariancebetween measurements in different angular/ r p bins. We de-termined best-fit parameters by performing a χ minimiza-tion where the errors used were given by the standard devia-tion of the correlation function measurements in each of the24 mock light-cones; i.e. the fitting assumed that the rele-vant covariance matrices were all diagonal. However, ana-lytical models as well as simulations have shown that the off-diagonal elements of the covariance matrix are non-negligible(Bernstein 1994; Zehavi et al. 2005; Crocce et al. 2011). Wehave confirmed this to be the case by calculating the full co-variance matrices of correlation function measurements in the24 fields. Therefore, in MN10 we were not exploiting the fullcovariance information when fitting for the correlation func-tion parameters. By incorporating this information into ourfitting process, we should expect to obtain more accurate re-sults.In order to calculate the parameters using the full covari-ance matrix we used χ minimization as in MN10, but in thiscase we calculate χ values taking into account the covari- q (degrees)0.00010.00100.01000.10001.0000 w pp ( q ) F IG . 1.— An example of fitting a power law-integral constraint modelto a measurement of the angular autocorrelation of the photometric sam-ple, w pp ( θ ), from Millennium catalog mock light cones. The solid line isa fit assuming no covariance between angular bins, while the dashed lineis a fit using the full covariance matrix, where both are fit over the range0 . ◦ < θ < . ◦ . ance: χ = ( y - ˜y ) T C - ( y - ˜y ) (4)where C is the covariance matrix, y is the observed correla-tion function data in each bin, and ˜y is the expected valueaccording to a given model. As an example, for w ( θ ) equation4 becomes: χ = h w ( θ ) - ( A θ - γ - C ) i T C - h w ( θ ) - ( A θ - γ - C ) i . (5)We start by minimizing equation 5 for the case of fixed γ . Inthat case, this minimization is simply linear regression where θ - γ is the independent variable, and A and - C are the stan-dard ’slope’ and ’intercept’. Minimizing χ analytically toobtain the parameters for a linear fit is straightforward; thusfor fixed γ we can readily determine the best-fit A and C viastandard formulae. Alternatively, to fit for all three param-eters simultaneously we can repeat the linear fit process fordifferent values of γ , and then determine the value of γ whichminimizes χ . We use this fitting method to determine theparameters of the angular autocorrelation of the photometricsample, w pp ( θ ), and of each z -bin of the angular cross-corre-lation, w sp ( θ, z ). For the projected real-space autocorrelationfunction, we see from equation 3 that w p ( r p ) ∼ r - γ p (i.e. thesame as the relation between w ( θ ) and θ ), so the fitting methodis the same except that we force the intercept to be equal tozero and only fit for γ and A . We then find r using the con-version A = r γ H ( γ ) from equation 3. Figure 1 compares the fitassuming no covariance for one measurement of w pp ( θ ) fromthe simulation (averaging w pp from 4 of the 24 mock fields)to a fit using the full covariance matrix.The covariance matrices we use for fitting are calculated us-ing correlation measurements from the 24 mock light-cones,and is therefore a sample covariance matrix and not the ’true’,underlying C . It can be shown that while the sample covari- ance matrix is an unbiased estimator of C , the inverse of thesample covariance matrix is in fact a biased estimator forthe inverse of the true covariance matrix (Hartlap et al. 2007).The amount of bias depends on the size of the sample used tocalculate the covariance matrix; in our case, this is the num-ber of mock catalogs (24). However, this bias can be correctedfor (assuming Gaussian statistics and statistically independentmeasurements) simply by rescaling the inverse sample covari-ance matrix by a constant factor; this will not, therefore, affectthe location of any χ minimum. We apply a bias correctionwhere relevant in our analysis below. Conditioning the Covariance Matrix
Since we are using a covariance matrix calculated from amodest number of light cones–in effect a ’measured’ covari-ance matrix with only a limited number of samples–noiseand numerical instabilities cause difficulties when calculat-ing C - . We found the inversion of C to be much morewell behaved when using coarser bins in θ and r p than em-ployed in MN10. For both w p ( r p ) and w ( θ ) we doubled thebin size in log space, i.e. we use bins with ∆ log( r p ) = 0 . ∆ log( θ ) = 0 .
2. Increasing the bin size further did not yieldsignificant improvements.To reduce the impact of noise in our measured covari-ance matrix further, we investigated several methods of con-ditioning the matrix (i.e., modifying the covariance matrixto improve the robustness of its inversion), and looked athow varying the conditioning improved the reconstruction.One commonly-applied method involves performing a sin-gular value decomposition (SVD) of the covariance matrixand setting the singular values below some threshold (andtheir inverse) equal to zero (Jackson 1972; Wiggins 1972).This is equivalent to performing an eigenmode analysis andtrimming any unresolved modes, as is done, for instance, inMcBride et al. (2011).We also tried conditioning the covariance matrix us-ing a technique commonly known as ridge regression(Hoerl & Kennard 1970). This involves adding a small valueto all of the diagonal elements of the covariance matrix be-fore inverting, which reduces the impact of noise in the off-diagonal elements and makes the inversion more stable. Weparameterized this conditioning by calculating the median ofthe diagonal elements of the covariance matrix and adding afraction f of that median value to the diagonal. We obtainedbetter results from ridge regression than from zeroing out sin-gular values (see §4.1 below), and it is therefore the primarymethod used throughout the rest of this paper.At first glance it may seem that applying ridge regressionto the covariance matrix should be detrimental to determin-ing the actual values of correlation function parameters: weare effectively assuming by fiat that the effective covariancematrix to be used in calculating χ differs from what was mea-sured. Since ridge regression yields larger values for the di-agonal elements of the covariance matrix than the data them-selves would suggest, the results are equivalent to a situationwith larger nominal measurement uncertainties (and hencebroader χ minima) than implied by the original covariancematrix.However, when C is determined from a limited set of mea-surements, C - tends to differ significantly from the true in-verse. Hence, using the standard covariance matrix in fit-ting should lead to measurements with nominally tighter er-rors than ridge regression techniques, but those measurements f −7 −6 −5 −4 Singular value threshold ˜ R ( A ) / F IG . 2.— A test of the impact of the conditioning of the covariance matrix on the results from fitting the amplitude of the correlation function, A . We plot thesquare root of the fractional median risk (solid line) and of the maximum risk (dashed line) on A as a function of the degree of conditioning. We define the riskas the total mean squared error; i.e., the variance plus bias squared. (Left panel) We condition using ridge regression; we add a fraction f of the median of thediagonal covariance matrix elements to all diagonal elements in order to stabilize the inversion of the covariance matrix. (Right panel) We condition by invertingusing singular value decomposition (SVD), setting all singular values below some threshold to zero. The median values are from a single set of 10 runs, but themaximum risk line is the mean of the results from 10 sets of 10 runs, as the maximum risk varied significantly from run to run. Errors on the median are plotted,but are very small and not visible. The conditioning has a much larger effect on the maximum risk, and we therefore use a minimax optimization: i.e., choose theparameter values which make the maximum risk as small as possible. Using ridge regression, both the median and maximum optimized risk are smaller than forthe SVD method. We therefore use ridge regression as our primary conditioning technique in this paper; the optimum results in fitting w pp ( θ ) are achieved for f ∼ may in fact be significantly offset from the true value of theparameter we are attempting to determine. This can cause theparameter results to have larger spread about the true valuethan optimal. When we add some degree of ridge regression,the inverse of the covariance matrix is better behaved, andhence is less likely to yield a discrepant result. By varying thestrength of the ridge regression conditioning, we can choosedifferent tradeoffs between the bias and variance of parame-ter estimates. In general, we want both of these contributionsto be small; in the next section we investigate what degree ofconditioning minimizes their sum. RISK OPTIMIZATION
In this section we will evaluate how the conditioning ofthe covariance matrix affects the determination of correla-tion function parameters and ultimately the reconstruction of φ p ( z ). By doing so, we will be able to optimize the reconstruc-tion of the true redshift distribution of the photometric sample.We assess this by measuring the integrated mean squared er-ror, i.e. the variance plus the bias squared. This is commonlyreferred to in statistics literature as the ’risk’. By focusingon the risk in some quantity we are optimizing for the mini-mum combined effect of variance and bias: either large ran-dom errors or large bias would lead to a large risk. We hencedefine the risk to be R (X) = (cid:10) (X - X true ) (cid:11) , where X - X true isthe difference between the measured parameter value and itstrue value . At times we will also refer to the fractional risk of a parameter, which we define as ˜ R (X) = h (X - X true ) i / X .Since we utilize three different types of correlation measure-ments in the reconstruction of φ p ( z ), we look at how changingthe level of conditioning of the covariance matrix affects eachone individually. Optimizing Fits To w pp ( θ )We optimized the conditioning of the covariance matrix forthe autocorrelation of the photometric sample using a MonteCarlo simulation where we use the covariance matrix of w pp calculated from the 24 fields (i.e., the 24 different light cones)as our “true” covariance matrix, and then use it to generaterealizations of correlated noise about a selected model. Todo this we first find the eigenvalues and eigenvectors of thecovariance matrix. We create uncorrelated Gaussian noisewith variances equal to the eigenvalues, and then apply thetransformation matrix constructed from the eigenvectors tothis noise. This technique yields mock data with correlatednoise corresponding exactly to the “true” covariance matrix(here, the covariance matrix of the 24 mock fields). Forthe true model we use A true = 4 . × - , γ true = 1 .
58, and C true = 6 . × - , which are approximately the mean param-eters measured from the simulation.In MN10 we used the 24 mock light-cones to generate 10 “measurements” by randomly selecting four fields at a timeand finding the average w ( θ ) for those fields. In order tosimulate this we used the method for generating correlatednoise described above to create 24 realizations of single-field w ( θ ) measurements, and then generated 10 randomly se-lected ’pick-4 measurements’ from those 24 realizations; wewill refer to each set of 24 new realizations (and its derivedproducts) as a ’run’ below. For each run we use the set of 24realizations to calculate a measured covariance matrix, whichwill differ from the true covariance matrix used to generate thenoise. The uncertainty in an estimate of the covariance ma-trix from the 24 realizations should be worse than the errorsin realistic applications, making this treatment conservative.This is because the area covered by photometric surveys willin general be much larger than for the spectroscopic sample,which will result in a better constrained covariance matrix forthe autocorrelation of the photometric sample; however, forthe mock catalogs used here the spectroscopic and photomet-ric areas are identical. The resulting ’measured’ covariancematrix for a given run is then used to fit for the parametersof a power-law fit in each of that run’s pick-4 measurementsby minimizing χ (cf. equation 4). For this and all other cor-relation function fits described herein we used the IDL codePOWERFIT, which is being publicly released as an accompa-niment to this paper.We begin by evaluating how the reconstruction of the am-plitude, A , changes as we vary the conditioning. The integralconstraint exhibits similar behavior to the amplitude since itis proportional to the correlation strength; we are in any eventnot as concerned with the behavior of C since it is essentiallya nuisance parameter. For simplicity, we fix γ at the true valuefor each run and only fit for A and C . We calculate the risk on A by performing 10 runs, where for each run we:1. Created 24 realizations of w ( θ ) as described above2. Generated 10 pick-4 measurements, randomly select-ing four realizations at a time from the 24 and calculat-ing their mean w ( θ )3. Fit each pick-4 measurement for A and C using the co-variance matrix calculated from the 24 realizations cre-ated in step 14. Calculated the mean fractional risk on A over the 10 pick-4 measurements, ˜ R ( A ) = h ( A - A true ) i / A true .We can perform the fits and calculate the fractional risk on A while applying varying levels of conditioning on the co-variance matrix. We parameterize the ridge regression condi-tioning using a variable f , which we define as the fraction ofthe median value amongst diagonal elements of the covari-ance matrix which is added to the diagonal elements; i.e.,we replace the i , i element of the covariance matrix, C ii , byC ii + f × median( C ii ). For comparison, we also calculate thefractional risk on A while varying the singular value thresholdfor the SVD conditioning described in §3.3.1, where all sin-gular values below the threshold and their inverses are set tozero.Figure 2 shows the square root of the median and maximumfractional risk amongst the 10 runs as a function of both f and the singular value threshold. In both cases we see thatthe conditioning has a much stronger effect on the maximumrisk than it does on the median. We therefore perform a mini-max optimization; i.e., we choose the conditioning that mini-mizes the maximum risk. Looking at the level of conditioningcorresponding to this minimax optimization for each method,we see that the median and maximum risk are both smaller −3 • −5 −2 • −5 −1 • −5 • −5 • −5 • −5 A−A true −0.003−0.002−0.0010.0000.0010.0020.003 C − C t r u e F IG . 3.— Contour plot showing the distribution of the median values of A - A true and C - C true from each of 10 runs as described in §4.1, where A and C are the fit parameters for w ( θ ) = A θ - γ - C . For our model we used A true = 4 . × - and C true = 6 . × - . For each distribution we show the1 σ and 2 σ contours. The solid lines are the fit parameters when using thefull covariance matrix with the optimized conditioning ( f = 3%). The dashedlines show the distribution resulting from fits with the same techniques asMN10, where we assume no covariance and fit over a smaller θ range. We aremost concerned with errors in the amplitude; it is clear there is a significantimprovement in the recovery of the actual value of A when the full covarianceinformation is exploited. for the ridge regression conditioning. In addition, with theSVD method the maximum risk is much more sensitive tochanges in the threshold around its optimized value. Smallchanges from the optimized threshold value in either direc-tion can have a significant effect on the maximum risk, whilethe maximum risk curve for the ridge regression method isrelatively flat in the vicinity of the optimized value. We there-fore use ridge regression conditioning for the remainder ofthe calculations. By adding a few percent conditioning to ourcovariance matrix with the ridge regression method, we cansignificantly decrease the maximum risk without significantlyworsening the median risk. The optimized value for f strikesa balance between the need for conditioning to stabilize in-version and the desire not to distort the relative impact of di-agonal and off-diagonal covariance matrix elements, whichwould lead to inappropriate weighting of different data pointsin calculating χ .Figure 3 shows a contour plot of the median values of A - A true vs. C - C true amongst all pick-4 measurements for eachof the 10 runs using the optimized conditioning ( f = 3%). InMN10, although we had measured the correlation function outto a separation θ ∼ . ◦ , we only fit over the range 0 . ◦ <θ < . ◦ . In that case, fitting over this smaller range reducedthe error in A , and thus improved the reconstruction. Whenusing the full covariance matrix for the fit we found that fittingover the full range of θ yielded even smaller parameter errors,as seen in Figure 3. By utilizing covariance information in ourfitting, we can robustly incorporate correlation measurementsfrom larger scales which were useless (or even detrimental)when ignoring the covariance. Optimizing Fits To w p ( r p )We used a different method to optimize the conditioning forthe projected correlation function of the spectroscopic sample.As described in §2, this sample was constructed by selecting60% of the objects with R < .
1. We calculated the risk forthe autocorrelation parameters by creating multiple sampleswhere a different 60% of the objects are chosen each time,and comparing these to the results for a sample containing100% of the objects. This differs from the method used in§4.1 in that we are actually performing the correlation mea-surements using the simulations rather than generating modelnoise based on a covariance matrix calculated from the simu-lation. In the case of w pp ( θ ) it was more difficult to determinethe true values of w ( θ ) (required for calculating the risk) tosignificantly greater accuracy than individual measurements,and therefore we relied on synthetic techniques for that anal-ysis. Here, we have a ’truth’ measurement which is muchbetter than the fits resulting from any set of 60% of the brightobjects in only four fields, so we can measure the risk robustlywithout relying on simulated measurements. When calculat-ing the reconstruction of φ p ( z ) we measure the parameters ofa fit to w p ( r p ) in multiple redshift bins. For simplicity, in thissection we focus on a single z -bin in the middle of the redshiftrange, 0 . < z < . pick-4 measurements of w p ( r p )from the full sample and fit each measurement to the func-tional form given in equation 3, employing the full covariancematrix calculated from the 24 fields to determine r and γ . Asin MN10, we fit over the range 0 . < r p < h - Mpc. Sincethe covariance matrix calculated from the full sample shouldbe more stable than for the 60% subsets due to its smallernoise, we initially performed the fits with zero conditioningand used that as our ’truth’. The median values of the parame-ter measurements for the full sample amongst the 24 differentfields were used as estimates of the true parameter values. Wethen calculate the risk on r and γ by performing 100 runs,where for each run we:1. Constructed samples from each of the 24 mock fields byrandomly selecting 60% of the objects with R < .
12. Generated 10 pick-4 measurements, randomly select-ing four fields at a time from the 24 and calculating theirmean w p ( r p )3. Fit each pick-4 measurement for r and γ using the co-variance matrix calculated from the w p ( r p ) values mea-sured using the 24 samples constructed in step 1
4. Calculated the mean fractional risk on both param-eters, i.e. ˜ R ( r ) = h ( r - r , true ) i / r , true and ˜ R ( γ ) = h ( γ - γ true ) i /γ true , over the 10 pick-4 measurements.In step 3 we calculate the covariance matrix from 24 fields,which is more fields than we would actually have if we wereto do cross-correlation reconstruction with current datasets at z ∼
1. However, it is likely comparable to the level to whichwe should be able to determine the covariance matrix using In MN10, we corrected w p ( r p ) for the fact that ξ ss ( r p , π ) is not in ac-tuality measured to infinite line-of-sight separation. This was not done forthis test, as the correction will affect the parameters of the full sample and itssubsets in a similar way, so any trends in the risk should not be affected. Thissaved significant calculation time. f ˜ R / F IG . 4.— The square root of the fractional median risk (error bars) andmaximum risk (dashed line) on r , ss (upper curves) and γ ss (lower curves)as a function of the degree of conditioning used for 100 runs, where 60% ofobjects with R < . f =3.5%. current-generation deep mock catalogs, particularly since fitresults will be sensitive to the relative values of covariancematrix elements, but not their absolute normalization. Foreach run we calculate the fractional risk on both parametersfor varying levels of conditioning.Figure 4 shows the square root of the median and maximumfractional risk on r and γ amongst the 100 runs as a functionof the conditioning. For both parameters we see a slight dipin the median risk over the 100 runs at f ∼ . f ∼ . Optimizing φ p ( z ) Reconstruction
After optimizing the fits for the autocorrelation measure-ments, we then looked at how conditioning the cross-correla-tion covariance matrices affects the overall reconstruction of φ p ( z ). Since the uncertainty in φ p ( z ) is dominated by the un-certainty in w sp ( θ, z ), this conditioning should have the great-est impact on the reconstruction. We generate 10 pick-4 mea-surements by averaging the correlation measurements fromfour randomly selected fields out of the 24, which we thenuse to calculate φ p ( z ). For calculating the risk, we know thetrue redshift distribution in each field perfectly from the sim-ulation, so we do not need to rely on synthetic techniques asin §4.1. Since the fits for both w pp ( θ ) and w p ( r p ) were bestwith a few percent ridge regression conditioning (§4.1, §4.2),for simplicity we adopt f =3.5% as the optimal conditioningin both cases.For each pick-4 measurement, we determine the autocor-relation parameters of the photometric sample by fitting the w pp ( θ ) from the selected 4 fields using the optimally condi-tioned covariance matrix calculated from the 24 fields. Allthree parameters ( A pp , γ pp , and C pp ) are left free and fit si-multaneously. To measure the evolution of the correlationfunction parameters of the spectroscopic sample, we calcu-lated w p ( r p ) in 10 z -bins covering the range 0 . < z < . z -bin we calculate the covariance matrixfrom the 24 fields and fit each pick-4 measurement using theoptimal conditioning to determine r , ss ( z ) and γ ss ( z ).In one redshift bin (0 . < z < . r , ss and γ ss obtained with these methods were significantly dif-ferent from the values determined when assuming no covari-ance. We investigated the likelihood contours in detail andfound they were not well behaved; not only were the me-dian parameter values different from the result with no co-variance, the standard deviation of the 10 pick-4 measure-ments proved to be an underestimate of the uncertainty in thatbin, which had significant effects when performing an error-weighted linear fit to r , ss ( z ) and γ ss ( z ). We attempted a varietyof methods for estimating the errors in that bin with poor re-sults. However, we found that fitting over the shorter range0 . < r p < h - Mpc, rather than 0 . < r p < h - Mpc,gave more well behaved values (more consistent with the val-ues in other redshift bins or those obtained when ignoring co-variance) and improved the reconstruction. For consistencywe fit over this range for all bins where z < .
8. As in MN10we continue to fit over the range 1 . < r p < h - Mpc for z > .
8, as in the Millennium simulations (though less so inreal datasets) w p ( r p ) diverges significantly from a power lawat 0 . < r p < h - Mpc.While the conditioning of the fits for the autocorrelationparameters was kept the same for each measurement, we var-ied the conditioning of the cross-correlation fits to see how itaffects the reconstruction. We bin the spectroscopic sampleover the range 0 . < z < .
39 with a bin size of ∆ z = 0 . w sp ( θ ) in each bin. At each level of conditioningwe:1. Calculated the covariance matrix of w sp ( θ ) in each red-shift bin from the 24 fields and apply the ridge regres-sion conditioning to each matrix2. Generated 10 pick-4 measurements, randomly select-ing four fields at a time from the 24 and calculating theirmean w sp ( θ, z )3. In each z -bin, fit the pick-4 measurements for A sp and C sp , fixing γ sp as described in §3.2, using the covariancematrices calculated in step 14. Combined A sp ( z ) and the optimized autocorrelation pa-rameters for each pick-4 measurement to calculate theprobability distribution function, φ p ( z ), applying equa-tion 15. For each pick-4 measurement, we calculated the meanrisk on φ p ( z ), R ( φ p ( z )) = h ( φ p ( z ) - φ p , true ( z )) i , over therange 0 . < z < .
2. This was done in two ways:(a) Using the overall mean φ p ( z ) of the 24 fields as φ p , true ( z )(b) Using the mean φ p ( z ) from the particular 4 fieldsused in a given measurement as φ p , true ( z )6. Calculated the mean R ( φ p ( z )) over the 10 pick-4 mea-surements for both types of risk – 25 – f R ( φ p ( z )) / Fig. 5.— The square root of the mean risk over the range 0 < z < sp ) in eachredshift bin. The solid line is the risk compared to the overall mean of the 24 fields, andthe star symbol is the corresponding risk using the methods of MN10. The dashed red(grey)line is the risk defined from comparing each measurement to the mean redshift distributionof the particular 4 fields used, and the red(grey) diamond symbol is the corresponding riskusing the previous method. Both are at or near their minimum value at a conditioning of afew percent. The decrease in the risk when comparing to the overall mean is much greater,though improvements are significant regardless of the measure used. F IG . 5.— The square root of the mean risk over the range 0 . < z < . w sp ( θ ) in each redshift bin. The solid line is therisk compared to the overall mean of the 24 fields, and the star symbol is thecorresponding risk using the methods of MN10. The dashed red(gray) lineis the risk defined from comparing each measurement to the mean redshiftdistribution of the particular 4 fields used, and the red(gray) diamond symbolis the corresponding risk using the previous method. Both are at or near theirminimum value at a conditioning of a few percent. The decrease in the riskwhen comparing to the overall mean is much greater, though improvementsare significant regardless of the measure used. In step 5, we calculate the risk over a slightly limited redshiftrange to eliminate bins where noise dominates the measure-ments, which diluted our ability to assess the impact of ridgeregression.Figure 5 shows both mean risks as a function of the con-ditioning, compared to the risk using methods identical toMN10. We optimized for the mean risk over the redshift rangerather than the maximum risk as the latter was dominated byrandom outliers (due to the smaller number of objects in theredshift bins used, errors in w sp ( θ, z ) are much larger, andhence random excursions extend further, than for the auto-correlations). Both techniques indicate that the minimum riskis obtained at around a few percent conditioning. There is asubstantial improvement in both measures, but particularly inthe risk comparing the redshift distribution for the four cho-sen fields to the overall (e.g. universal) mean. Figure 6 showsthe reconstruction for 3.5% conditioning (i.e. the same for allthree fits) as well as the variance and bias, and compares tothe reconstruction using methods identical to MN10. The de-crease in the variance is significant in each redshift bin whilethe bias is relatively unchanged in all but a few z -bins. Byincorporating full covariance information and ridge regres-sion methods, the square root of the fractional risk is < SUMMARY AND CONCLUSION
In this paper we have improved on the cross-correlationtechniques presented in Matthews & Newman (2010) by in-corporating full covariance information. In addition, we havedemonstrated the improvements that result from incorporat-ing ridge regression in fitting for correlation function parame-ters. Conditioning using ridge regression allowed us to obtain f p ( z ) s [ f p ( z )] f r e c ( z ) − f t r u e ( z ) F IG . 6.— The reconstruction of φ p ( z ) using 3.5% conditioning for fits to all three correlation measurements, (i.e. w pp ( θ ), w p ( r p ), w sp ( θ , z )). In the top panel,the solid line is the mean true distribution of the 24 fields, the star symbols are the median values of the 10 pick-4 measurements obtained using the methods ofMN10, and the diamonds are the median values for the optimized reconstruction using the full covariance matrix for the fits (with error bars). The middle panelcompares the standard deviation of the 10 pick-4 measurements in each bin using the methods from MN10 (solid line) to the improved reconstruction (dashedline), while the bottom panel compares the bias. The errors are significantly smaller in each bin, while the bias is comparable when full covariance informationis used. These results are not significantly changed for moderate changes in f . a more stable inversion of the covariance matrix by reducingthe impact of noise in the off diagonal elements, resulting inbetter estimates of the correlation function parameter values;results were significantly better than with other commonly-used methods such as zeroing out small singular values in asingular value decomposition of the covariance matrix. Weanalyzed how this conditioning affected the integrated meansquared error, i.e. the risk, for these parameter measurements,and in doing so optimized the cross-correlation technique forrecovering the redshift distribution of a photometric samplewith unknown redshifts. We also found that we gain signif-icant improvement in the reconstruction by adding a step tothe recipe described in MN10: we now perform a smooth fitfor the amplitude of the integral constraint of the cross-corre-lation measurements as a function of redshift, C sp ( z ). We thenrefit for the amplitude of the cross-correlation, A sp , with C sp fixed at the smooth fit value in each z-bin. We tested the effect of the ridge regression technique on thecalculation of parameter values for both w ( θ ) and w p ( r p ) andfound that it had a much more significant impact on the max-imum risk found over multiple runs than on the median risk.In other words, it yields a great improvement in the worst-case errors, but smaller improvements in more typical cases.For w ( θ ) the square root of the maximum fractional risk inthe amplitude, A , for fixed γ decreased by ∼
35% on averageat a few percent conditioning. For w p ( r p ) we found a simi-lar decrease for r , ss ( ∼ γ ss wassomewhat smaller ( ∼ ∼
42% de-0crease in the mean of the square root of the risk on the recov-ered φ p ( z ) compared to the overall (i.e. universal) mean φ p ( z ),and ∼
16% decrease when comparing the recovered φ p ( z ) tothe mean of the actual φ p ( z ) for the particular four fields usedin the measurement.In this paper, as well as in MN10, we utilized an artificialselection for objects in a photometric redshift bin that con-sisted simply of selecting objects with a Gaussian probabilitycentered at a mean redshift. However, in real samples, pho-tometric redshift errors should depend on galaxy type (andhence biasing) and will not necessarily lead to Gaussian selec-tion probabilities or uniform evolution of bias with true red-shift within the selected sample. Therefore in a future paper,we will test this improved cross-correlation technique using these same mock catalogs but instead of assuming a redshiftdistribution, we will use simulated photometry to measurephotometric redshifts and try to recover the redshift distribu-tion of various photo-z selected bins. The following paper inthis series will apply these techniques to test photometric red-shifts for galaxies in the CANDELS multiwavelength survey(Grogin et al. 2011; Koekemoer et al. 2011).This paper has benefited from helpful discussions withLarry Wasserman, Chris Genovese, Peter Freeman, ChadSchafer, Ann Lee, Nikhil Padmanabhan and Michael Wood-Vasey. It was supported by the United States Department ofEnergy Early Career program via grant de-sc0003960. APPENDIXPOWERFIT CODE
In the course of this analysis we developed a short IDL function designed to fit for the parameters of a power-law plus constantmodel using full covariance information, with or without conditioning of the covariance matrix. Given arrays containing theindependent variable values x , the dependent variable values y , and the covariance matrix of the y values, C , it determines thebest-fit parameters for a function of the form y = ax b + c via χ minimization (cf. Equation 4). It outputs the best-fit parametervalues in the form of a three-element array, i.e. [ a , b , c ]. POWERFIT calculates the fit parameters as described in §3.3. If theexponent, b , is fixed, the best-fit values of a and c are calculated analytically using standard linear regression formulae. To fitfor all three parameters simultaneously, POWERFIT instead uses the AMOEBA function (distributed with IDL, and based on theroutine amoeba described in Numerical Recipes in C (Press et al. 1992)) to search for the exponent value that minimizes the χ of the fit.POWERFIT optionally allows the user to fix either the exponent value, b , the constant, c , or both, at specified values whencalculating the fit. It is also possible to condition the covariance matrix using either of the methods described in §3.3.1. For ridgeregression conditioning, the user must provide a value for f , the fraction of the median of the diagonal elements of the covariancematrix to add to the diagonal elements before inverting. For SVD conditioning, the required input is the singular value threshold;any singular values below that threshold, as well as their inverses, are set equal to zero before calculating the inversion. The codeis suitable for any application where a power law or power law plus constant model is fit to data with a known covariance matrix;it can be downloaded at ..