Estimating sparse precision matrices
Nikhil Padmanabhan, Martin White, Harrison H. Zhou, Ross O'Connell
MMon. Not. R. Astron. Soc. , 000–000 (0000) Printed 27 February 2018 (MN L A TEX style file v2.2)
Estimating sparse precision matrices
Nikhil Padmanabhan , Martin White , Harrison H. Zhou , Ross O’Connell Dept. of Physics, Yale University, New Haven, CT 06511 Dept. of Physics and Astronomy, U.C. Berkeley, Berkeley, CA, 94720 Dept. of Statistics, Yale University, New Haven, CT 06511 McWilliams Center for Cosmology, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA
27 February 2018
ABSTRACT
We apply a method recently introduced to the statistical literature to directly estimatethe precision matrix from an ensemble of samples drawn from a corresponding Gaus-sian distribution. Motivated by the observation that cosmological precision matrices areoften approximately sparse, the method allows one to exploit this sparsity of the preci-sion matrix to more quickly converge to an asymptotic / √ N sim rate while simultane-ously providing an error model for all of the terms. Such an estimate can be used as thestarting point for further regularization efforts which can improve upon the / √ N sim limit above, and incorporating such additional steps is straightforward within this frame-work. We demonstrate the technique with toy models and with an example motivated bylarge-scale structure two-point analysis, showing significant improvements in the rateof convergence. For the large-scale structure example we find errors on the precisionmatrix which are factors of 5 smaller than for the sample precision matrix for thousandsof simulations or, alternatively, convergence to the same error level with more than anorder of magnitude fewer simulations. Key words: cosmological parameters — large-scale structure of Universe
Frequently in astrophysics and cosmology the final step in anyanalysis is to compare some summary of the data to the predic-tions of a theoretical model (or models). In order to make thiscomparison a model for the statistical errors in the compressedform of the data is required. The most common assumption isthat the errors are Gaussian distributed, and the covariance ma-trix, C ij , is supplied along with the data. In order for preciseand meaningful comparison of theory and observation, the the-oretical model, summary statistics derived from the data and C ij must all be accurately determined. A great deal of work hasgone into all three areas.There are three broad methods for determining such a co-variance matrix. First, there may be an analytical or theoreticaldescription of C ij that can be used (see Niklas Grieb et al. 2015;O’Connell et al. 2015; Pearson & Samushia 2015, for some re-cent examples and further references). Secondly, we may at-tempt to infer C ij from the data itself. This is often referred toas an ‘internal error estimate’ and techniques such as jackknife(Tukey 1958) or bootstrap (Efron 1979) are commonly used.Thirdly, we may attempt to infer C ij by Monte-Carlo simu-lation. This is often referred to as an external error estimate.Combinations of these methods are also used.Our focus will be on the third case, where the covariancematrix is determined via Monte-Carlo simulation. The major limitation of such methods is that the accuracy of the covariancematrix is limited by the number of Monte-Carlo samples that areused with the error typically scaling at N − / (see, e.g. Dodel-son & Schneider 2013; Taylor et al. 2013; Taylor & Joachimi2014, for recent discussions in the cosmology context). Theseerrors manifest themselves both as ‘noise’ and ‘bias’ in the sam-ple covariance matrix and the inverse covariance matrix (whichwe shall refer to as the ‘precision matrix’ from now on and de-note by Ψ ).We consider the specific case where the precision matrix issparse, either exactly or approximately. This may happen evenwhen the covariance matrix is dense, and occurs genericallywhen the correlations in the covariance matrix decay as a powerlaw (or faster). It is worth emphasizing that any likelihood anal-ysis requires the precision matrix, and not the covariance matrix.We present an algorithm that can exploit this sparsity structureof the precision matrix with relatively small numbers of simula-tions.The outline of the paper is as follows. In Section 2 we in-troduce our notation and set up the problem we wish to address.Section 3 introduces the statistical method for entry-wise esti-mation of the precision matrix, plus a series of refinements tothe method which improve the convergence. Section 4 presentsseveral numerical experiments which emphasize the issues andbehavior of the algorithm, both for a toy model which illus- c (cid:13) a r X i v : . [ a s t r o - ph . I M ] D ec Padmanabhan et al
Notation Description N ( µ, C ) Normal distribution with mean µ , covariance Cx ∼ D x distributed as D(cid:107) · (cid:107) F Frobenius matrix norm (see Eq. 1) (cid:107) · (cid:107) Spectral matrix norm (see after Eq. 1) A t Transpose of A Table 1.
Summary of notation used in this paper trates the principles behind the algorithm and for a model basedon galaxy large-scale structure analyses. We finish in Section 5with a discussion of our results, the implications for analysis,and directions for future investigation.
We will consider p -dimensional observations, x , with a covari-ance matrix C and its inverse, the precision matrix Ψ ; we dif-fer from the usual statistical literature where the covariance andprecision matrices are represented by Σ and Ω in order to matchthe usage in cosmology more closely. Since we will need to con-sider both estimates of these matrices as well as their (unknown)true value, we denote estimates with hats. Individual elementsof the matrix are represented by the corresponding lowercaseGreek letters, e.g. ψ ij . We will also want to consider normal-ized elements of the precision matrix - we will abuse notationhere and define these by r ij ≡ ψ ij / (cid:112) ψ ii ψ jj .We denote the normal distribution with mean µ and co-variance C by N ( µ, C ) . The notation x ∼ D denotes a ran-dom variable x with a probability distribution D . The Frobeniusnorm of a matrix is defined by (cid:107) A (cid:107) F ≡ (cid:32) (cid:88) i (cid:88) j a ij (cid:33) / = (cid:32) Tr AA t (cid:33) / (1)while the spectral norm, denoted by (cid:107) · (cid:107) , is the largest singularvalue of the matrix. Table 1 summarizes our notation.The problem we consider is estimating the p × p precisionmatrix, Ψ , from d independent samples , x i where (cid:54) i (cid:54) d and where x i is a p dimensional vector assumed to be drawnfrom a Gaussian distribution. The usual approach to this prob-lem has been to compute the sample covariance matrix S = 1 d − d (cid:88) i =1 (∆ x i )(∆ x i ) t (2)where the superscript t is the transpose, and ∆ x i ≡ x i − (1 /d ) (cid:80) di =1 x i is the difference vector. An unbiased estimateof the precision matrix is then ˆ Ψ = d − p − d − S − (3)where the prefactor accounts for the mean of the inverse-Wishart distribution (for a first application in cosmology, seeHartlap et al. 2007). We use d instead of N sim in what follows for brevity. Our approach in this paper uses the observation that precisionmatrices in cosmology are often very structured and sparse. Un-fortunately, this structure is hard to exploit if computing theprecision matrix involves the intermediate step of computingthe covariance matrix. Our approach uses a technique, pointedout in Ren et al. (2015), to directly compute the precision ma-trix from an ensemble of simulations. Unlike that work, whichwas interested in an arbitrary sparsity pattern, the structure ofcosmological precision matrices is expected to be more regular,significantly simplifying the problem.The steps in our algorithm are :(i) Estimate the elements of the precision matrix entrywise.(ii) Smooth the precision matrix.(iii) Ensure positive-definiteness of the resulting precisionmatrix.Each of these are discussed in detail below. It is worth empha-sizing that the key insight here is the first step, and it is easy toimagine variants of the remaining steps that build off an entry-wise estimate of the precision matrix.
Consider a random vector x i = ( Z , Z , ..., Z p ) drawn froma multivariate normal distribution with mean 0 (this is triviallygeneralized) and covariance matrix C . Imagine partitioning thecomponents into two sets A and A c ≡ { , . . . , p } \ A with Z A denoting the subset of components in set A . Consider the prob-ability of Z A conditioned on Z A c (see Appendix A for someuseful identities) P ( Z A | Z A c ) = N ( − Ψ − A,A Ψ A,A c Z A c , Ψ − A,A ) , (4)where Ψ A,B represents the submatrix indexed by the sets ofindices A and B . This equation can be interpreted as linearlyregressing Z A on Z A c : Z A = βZ A c + e A (5)where β = − Ψ − A,A Ψ A,A c and (cid:104) e A e tA (cid:105) = Ψ − A,A . This inter-pretation is key to the algorithm presented here : the inverse ofcovariance of the residuals of the above linear regression is anestimate of a subset of the full precision matrix.The above equation also demonstrates how to make useof the sparsity of the precision matrix. Note that the subma-trix Ψ A,A c (that appears in β ) inherits the sparsity of the pre-cision matrix and therefore, one only need regress on a smallnumber of elements of Z A c . To illustrate this with a concreteexample, consider a tridiagonal Ψ and A = { } . Eqs. 4 and5 imply that Z = βZ + e where, in this case, β and e are simply scalars, and (cid:104) e (cid:105) = ψ − , . Given measurements ( Z (1)1 , Z (1)2 , · · · ) , ( Z (2)1 , Z (2)2 , · · · ) , . . . , ( Z ( d )1 , Z ( d )2 , · · · ) , wedo an ordinary least-squares fit for β estimating the error e and use ψ , = e − . The linear regression in this case requires d (cid:29) observations to robustly determine β and e , comparedwith d (cid:29) p observations where p is the rank of the precisionmatrix; this can translate into a significant reduction in the re-quired number of simulations.We can now write down the first step of our algorithm. Forevery pair (cid:54) i < j (cid:54) p , linearly regress Z { ij } on Z { k,l,... } where the k, l, . . . are determined by the sparsity pattern of thematrix. For the cases considered here, we perform an ordinary c (cid:13) , 000–000 stimating sparse precision matrices least squares regression, which guarantees that ˆ β and ˆ e are in-dependent (Greene 2003). From the residuals, we form the × covariance matrix, which can be inverted to get an estimate ofthe precision matrix elements ψ ii , ψ ij and ψ jj : (cid:18) ˆ ψ ii ˆ ψ ij ˆ ψ ji ˆ ψ jj (cid:19) − = 1 d − K (cid:88) (cid:18) ˆ e i ˆ e i ˆ e j ˆ e i ˆ e j ˆ e j (cid:19) (6)where the sum is over the d observations (the index is sup-pressed to avoid confusion) and K is the number of variablesbeing regressed on. Note that this gives us estimates of ψ ii , ψ ij and ψ jj . While it is possible to directly estimate ψ ij , we havefound it more robust to estimate r ij = ψ ij / (cid:112) ψ ii ψ jj . Thiscombination reduces the finite-sample corrections and as wedemonstrate in the next section, achieves its asymptotic distri-bution ( r − ˆ r ) ∼ N (0 , √ − r / √ d ) for relatively small val-ues of d . We therefore use these pairwise regressions to onlycompute r ij .In order to compute the ψ ii , we repeat the above regressionanalysis for A = i . Note that we could use the values of ψ ii cal-culated in the previous step, but working in the single variablecase simplifies the analysis of the properties of the estimator.Defining s = 1 d − K (cid:88) ˆ e i (7)one can show (Greene 2003) that ( d − K ) s ψ ii ∼ χ d − K (8)and therefore the estimator ˆ ψ ii = d − K − (cid:80) ˆ e i (9)is distributed as a scaled inverse χ distribution with d − K degrees of freedom.At the end of this first step, our estimate of the precisionmatrix can be written as ˆ Ψ = DR D (10)where D is a diagonal matrix with D ii = (cid:113) ˆ ψ ii while R has r ij on the off-diagonals and on the diagonal. We expect the covariance and precision matrices we encounterin cosmology to be relatively “smooth”, which we can thereforehope to use to reduce the noise in the entrywise estimates. Sucha smoothing operation can take many forms. If we could con-struct a model for the covariance/precision matrix, one coulddirectly fit the entrywise estimates to the model. For example,on large scales, one might use a Gaussian model with free shotnoise parameters (Xu et al. 2012; O’Connell et al. 2015). In theabsence of a model, one might use a non-parametric algorithmto smooth the precision matrix, respecting the structure of thematrix. For the examples we consider in this paper, we use a cu-bic smoothing spline and smooth along the off-diagonals of thematrix, with the degree of smoothing automatically determinedby the data using a cross-validation technique. Since we believethat such a smoothing procedure may be generically useful, Ap-pendix C contains an explicit description of the algorithm. For instance, the d − K factor cancels in its computation The estimate of the precision matrix is “close” to the true preci-sion matrix in a Frobenius sense. However, since the matrix wasestimated entrywise, there is no guarantee that R (and there-fore ˆ Ψ ) is positive definite . Our goal in this section is to find arefined estimate R that is both positive definite and close to R .A natural approach is to choose R to maximize its poste-rior P ( R| S ) given an observed sample covariance matrix andthe prior on R from R . Again assuming Gaussianity, the like-lihood of S (ignoring irrelevant constants) is P ( S | Ψ ) = d (cid:0) log det Ψ − tr SΨ (cid:1) (11)while we take the prior on R to be P ( ˆ Ψ ) = − d (cid:107)R − R (cid:107) F (12)where we assume that the error on r ij is d − / . We ignore the r dependence on the error to avoid potentially biasing these re-sults with noise from the estimate; note that this error estimateis a conservative estimate. For similar reasons, we hold the di-agonal matrix D fixed. Putting this together, our maximizationproblem can be written as R = argmax r ij , ( ij ) ∈JR(cid:31) (cid:34) log det R − tr (cid:0) D S DR (cid:1) − (cid:107)R − R (cid:107) F (cid:35) (13)where J is the set of indices of the nonzero elements of R (asdetermined by the sparsity of the matrix) and R (cid:31) repre-sents the space of positive definite matrices. We also observethat while we have fixed the relative weights of the likelihoodand prior terms here based on the expected error on r , it is pos-sible in principle to weight these contributions differently. Weleave this possibility open for future applications. We also notethat this refinement step is reminiscent of the Scout estimator ofWitten & Tibshirani (2009).We perform this optimization using standard techniques;details of the algorithm are in Appendix B.
We consider two examples of covariance (and precision) ma-trices in this paper. First, in order to build intuition, we con-sider a tridiagonal precision matrix Ψ = ( ψ ij ) where ψ ii = 2 , ψ i,i +1 = ψ i − ,i = − and zero otherwise. The structure ofthe resulting covariance matrix is shown in Fig. 1; unlike theprecision matrix, the covariance matrix has long range support.The structure of this precision matrix, and the correspondingcovariance matrix, is qualitatively similar to the cosmologicalexamples we will consider later. We fix p = 100 ; this is similarto the number of parameters in covariance matrices for currentsurveys.Second, we use an example more closely related to the co-variance matrices one might expect in galaxy redshift surveys.Galaxy redshift surveys typically involve negligible measure-ment uncertainties, but suffer instead from both sampling vari-ance and shot noise. These can be estimated within the frame-work of a theory, but a theory for the formation of non-linear The diagonal matrices D are positive definite by construction.c (cid:13)000
We consider two examples of covariance (and precision) ma-trices in this paper. First, in order to build intuition, we con-sider a tridiagonal precision matrix Ψ = ( ψ ij ) where ψ ii = 2 , ψ i,i +1 = ψ i − ,i = − and zero otherwise. The structure ofthe resulting covariance matrix is shown in Fig. 1; unlike theprecision matrix, the covariance matrix has long range support.The structure of this precision matrix, and the correspondingcovariance matrix, is qualitatively similar to the cosmologicalexamples we will consider later. We fix p = 100 ; this is similarto the number of parameters in covariance matrices for currentsurveys.Second, we use an example more closely related to the co-variance matrices one might expect in galaxy redshift surveys.Galaxy redshift surveys typically involve negligible measure-ment uncertainties, but suffer instead from both sampling vari-ance and shot noise. These can be estimated within the frame-work of a theory, but a theory for the formation of non-linear The diagonal matrices D are positive definite by construction.c (cid:13)000 , 000–000 Padmanabhan et al
20 40 60 80 10020406080100 20 40 60 80 10020406080100
Figure 1.
A density plot of the covariance matrix for our toy tridiagonalprecision matrix with along the diagonals and − on the first off-diagonal. - Ψ ij ( no r m a li z ed ) Figure 2.
A representation of the precision matrix for the cosmologi-cally motivated example. We plot the 25th, 50th, and 75th rows of thematrix, demonstrating that the matrix is clearly diagonally dominant,and very sparse. The inset shows that the full matrix shows the samestructure of the individual rows plotted here. objects is not currently understood. Instead we use a lineartheory approximation. Specifically, we compute the covariancematrix of the multipoles of the correlation function, ξ (cid:96) ( s ) , as-suming Gaussian flucutations evolved according to linear the-ory. The assumed cosmology and power spectrum are of the Λ CDM family, with Ω m = 0 . , h = 0 . , n s = 0 . and σ = 0 . . The objects are assumed to be linearly biased with b = 2 and shot-noise is added appropriate for a number den-sity of ¯ n = 4 × − h Mpc − . We evaluate C ij in 100 bins,equally spaced in s , for both the monopole and quadrupole mo-ment of the correlation function, interleaved to form a × matrix. Fig. 2 plots the corresponding precision matrix. This isclearly dominated by a narrow banded structure, and is similarin character to our first case. We start by characterizing the accuracy with which we can esti-mate individual entries of the precision matrix. As we discussedpreviously, we separate out the measurements of the diagonalelements of Ψ from the off-diagonal elements; for the latter, � = ��� ( � ) ��������������� � = ��� ( � ) � = ��� ( � ) ����� ��� ��� ��� ��� ��� � = ���� ( � ) ��� ��� ��� ��� ��� ω ����� � � � Figure 3.
Histograms of the recovered values of ψ , for differentvalues of d , correcting for the finite sample bias. We assume that thematrix is banded to k = 25 in all cases. The solid [red] line is theexpected distribution of these values. � = ��� ( � ) ���� � = ��� ( � ) � = ��� ( � ) ������������������������� - ��� - ��� - ��� - ��� - ��� - ��� - ��� � = ���� ( � ) - ��� - ��� - ��� - ��� - ��� - ��� - ��� � ����� � � � Figure 4.
Same as Fig. 3, except for r , . we compute r ij ≡ ψ ij / (cid:112) ψ ii ψ jj . Figs. 3 and 4 show the dis-tributions of the recovered values for two representative entriesof the precision matrix of our toy model. The different panelscorrespond to different numbers of simulations d , while each ofthe histograms is constructed from an ensemble of 1000 suchrealizations. We find good agreement with the theoretically ex-pected distributions of both ψ ii and r ij , with the distributionfor r ij close to the asymptotically expected Gaussian even forrelatively small numbers of simulations. All of these results as-sumed a k = 25 banded structure (i.e. 24 non-zero upper/loweroff-diagonals); the results for different choices of this bandingparameter are qualitatively similar. Holding the number of sim-ulations d fixed, the scatter in the estimates decreases with de-creasing k , since one is regressing on a smaller number of vari-ables in this case.Given these entrywise estimates (each of which will be in-dividually noisy), we turn to the problem of “smoothing” awaythis noise to improve our covariance estimates. If one had an apriori model for the structure of the precision matrix, one coulddirectly fit to it. For instance, in the case of our toy model, onemight use the fact that the matrix has constant off-diagonals.However, in general, one might only expect the matrix to besmooth (nearby entries with the same value); in such a case, us-ing eg. smoothing splines could provide a reasonably genericsolution. c (cid:13) , 000–000 stimating sparse precision matrices - - - - - r Figure 5.
A demonstration of the impact of smoothing on the elementsof the precision matrix for the cosmological model. We plot the odd ele-ments of the first off-diagonal [top, circles] and the even elements of thesecond off-diagonal for our cosmological model [bottom, squares] (seetext for details for how the covariance matrix is packed). The precisionmatrix was estimated by assuming k = 20 and d = 2000 simulations.The points are the raw entry-wise estimates, the dashed line is the truevalue, while the solid line is the smoothed version. Note that smoothingcan significantly reduce the variance of the estimates. More complicatedsmoothing schemes could reduce the bias at the very edges. As a worked non-trivial example, we consider smoothingan estimate of our cosmological model. Given that the underly-ing galaxy correlation function is a smooth function, we wouldexpect the off-diagonals of the precision matrix to be smooth.However, given that our matrix was constructed by interleavingthe monopole and quadrupole correlation functions, we’d ex-pect this interleaving to persist in the off-diagonals. We there-fore smooth the odd and even elements of each off-diagonalseparately. Fig. 5 shows the results for two example minor di-agonals, comparing it to the raw entrywise estimates as well asthe true value. For the majority of the points, smoothing signif-icantly reduces the noise in the entrywise estimates. It can alsomiss features in the model, if those variations are smaller thanthe noise. We observe this effect at the edges of the top curves,where the smoothed curves miss the drop-off for the initial andfinal points. This is clearly a function of the relative sizes ofthese effects, as is evident from the lower sets of curves wherethe smoothed estimates track the variations at the edges better.Our final step is to ensure a positive definite precision ma-trix using the algorithm presented in the previous section. Fig. 6shows the final results, plotting an example row from the cos-mological case. Our estimator tracks the oscillating structure inthe precision matrix, and is clearly less noisy than the directsample precision matrix. This improvement directly translatesinto improved performance of the precision matrix.
We now turn to the problem of determining the banding param-eter k . Our goal here isn’t to be prescriptive, but to develop asimple guide. We anticipate that the choosing an appropriatebanding will depend on the particular problem as well as somenumerical experimentation.Since we do not expect the off-diagonals of the precisionmatrix to be exactly zero, choosing a banding reflects a trade-off between bias and variance. By setting small off-diagonals tobe zero, our estimated precision matrix will, by construction, be ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■■■■■
50 60 70 80 90 - - - Ψ ij ( no r m a li z ed ) Figure 6.
A part of the 70th row of the cosmological precision matrix,normalized by (cid:112) ψ ii ψ jj . The filled circles connected by solid lines isthe true precision matrix, the filled squares show the sample precisionmatrix, while the open circles are the estimate developed in this paper.The precision matrix was estimated from d = 1000 simulations, and weassume a banding of k = 20 to estimate the matrix (short dashed lines).The fiducial k = 15 banding that we use for this matrix is marked bythe long dashed lines. I nde x I nde x Figure 7.
Selecting the banding for the toy model, with each row repre-senting whether a particular off-diagonal was consistent with zero ornot. Each row represents an off-diagonal, with the index of the off-diagonal on the y -axis, while each column represents a set of d = 500 simulations used to estimate the precision matrix. In order to estimatethe precision matrix, we assume a banding of k = 25 . Unfilled boxesshow diagonals consistent with zero, gray boxes show diagonals thatare inconsistent with zero at the 95% level but not at 99%, while blackrepresents diagonals inconsistent with zero at the 99% level. An indexof one represents the first off-diagonal. The top panel considers the un-smoothed matrix, while the lower panel considers the smoothed case.The tridiagonal nature of the matrix is clearly apparent here, with thefirst off-diagonal clearly nonzero, and no structure evident for the re-maining cases. We choose k = 3 as our fiducial choice for both cases. m Unsmoothed Smoothed70 2.47 0.7580 2.52 0.7190 2.56 0.67100 2.59 0.63
Table 2.
An example of the thresholds at the 95% level for the un-smoothed and smoothed off-diagonals (see text for details on how theseare computed). While the thresholds increase for the unsmoothed case,they decrease for the smoothed case since the increase in the number ofpoints better constrain the spline.c (cid:13)000
An example of the thresholds at the 95% level for the un-smoothed and smoothed off-diagonals (see text for details on how theseare computed). While the thresholds increase for the unsmoothed case,they decrease for the smoothed case since the increase in the number ofpoints better constrain the spline.c (cid:13)000 , 000–000
Padmanabhan et al biased. However, any estimate of these off-diagonal elementsfrom a finite number of simulations will be dominated by thenoise in the measurements. This also implies that the bandingwill, in general depend on the number of simulations (except inthe case that the matrix is truly sparse); we will see this explic-itly in the next section.We propose a simple thresholding scheme to determinewhether an entire off-diagonal is consistent with zero or not -we set an off-diagonal to zero if the maximum absolute value ofall its elements is less than a pre-determined threshold. For theunsmoothed estimate of the precision matrix, we determine thisthreshold as follows. Assume an off-diagonal has m elements x i , all drawn from a Gaussians with mean zero and a knownvariance (but with arbitrary correlations). Then, the probabilityof exceeding a threshold value X can be bounded by the fol-lowing P (max x i > X ) (cid:54) (cid:88) i P ( x i > X ) = mP ( x > X ) (14)where the last P ( x > X ) is the complementary cumulativeGaussian probability distribution. Choosing an appropriate fail-ure rate mP ( x > X ) determines the appropriate threshold touse. We follow a similar procedure for the smoothed estimateof the precision matrix, although the choice of the threshold iscomplicated by the smoothing procedure and the correlationsbetween different elements on the same off-diagonal. We there-fore estimate the threshold by Monte Carlo, simulating m ran-dom Gaussian variables, fitting them with a cubic spline andthen tabulating the distribution of maximum values. This pro-cess ignores the correlations between the points and so, we ex-pect our estimates to only be approximate. Table 2 shows anexample of these thresholds as a function of m .Figs. 7 and 8 plot, for a set of simulations, which off-diagonals are determined to be non-zero, using the above cri-teria. Each row of these figures corresponds to an off-diagonal,whereas each column represents an independent set of simu-lations from which a precision matrix can be estimated. Sinceour procedure assumes a banding of the precision matrix, westart by selecting a conservative choice of the band. Filled boxesshow off-diagonals that are inconsistent with zero, with theshading representing the confidence level for this. A banded ma-trix in this figure would appear as a filled set of rows.We start with the unsmoothed cases first (upper panels).The tridiagonal nature of the toy precision matrix is clear, withonly a single row clearly visible. The cosmological example isless sharp, but a central band is still clear. The smoothed cases(lower panels) are noisier, because our estimates of the thresh-olds ignored correlations. Even with this increased noise, wediscern no trend suggesting an increased banding for the toyexample. For the cosmological example however, the non-zeroband is clearly increased, implying that one must work out toa larger band. This is to be expected, since the elements of theprecision matrix is this case are not exactly zero. Reducing thenoise (by smoothing) reduces the threshold below which onemay estimate an element of the precision matrix to be zero. Wefind a similar trend with increasing numbers of simulations (dis-cussed further below). mis-classifying a zero off-diagonal as non-zero I nde x I nde x Figure 8.
Same as Fig. 7 except for the cosmological model. The num-ber of realizations per column is d = 1000 , and the banding assumedfor in the estimate is k = 30 . The difference between the smoothedand unsmoothed cases is more apparent here. Smoothing requires usto estimate more off-diagonals, due to the reduction in the noise. Wechoose k = 10 and k = 15 for the unsmoothed and smoothed casesrespectively. A shortcoming of the above is the lack of an objective, au-tomatic determination of the banding. We do not have such aprescription at this time (and it isn’t clear that such a prescrip-tion is possible in general). We therefore advocate some levelof numerical experimentation when determining the appropri-ate band.
We now turn to the performance of our precision matrix esti-mates as a function of the number of simulations input. Thereare a number of metrics possible to quantify “performance”.The most pragmatic of these would be to propagate the esti-mates of the precision matrix through parameter fitting, and seehow the errors in the precision matrix affect the errors in theparameters of interest. The disadvantage of this approach is thatit is problem-dependent and therefore hard to generalize. Wedefer such studies to future work that is focused on particularapplications.A more general approach would be to quantify the “close-ness” of our precision matrix estimates to truth, using a “loss”function. There are a variety of ways to do this, each of whichtest different aspects of the matrix. We consider five such lossfunctions here :(i)
Frobenius Norm : || ∆ Ψ || F ≡ || Ψ − ˆ Ψ || F (15)This is an entrywise test of the elements of the precision matrix,and is our default loss function. Clearly, reducing the Frobeniusnorm will ultimately improve any estimates derived from theprecision matrix, but it is likely not the most efficient way to doso. In the basis where ∆ Ψ is diagonal, the Frobenius norm isjust the RMS of the eigenvalues, and can be used to set a (weak)bound on the error on χ .(ii) Spectral Norm : || ∆ Ψ || ≡ || Ψ − ˆ Ψ || (16)This measures the largest singular value (maximum absoluteeigenvalue) of ∆ Ψ . This yields a very simple bound on the er-ror in χ — || ∆ Ψ || | x | where x is the difference between themodel and the observations. c (cid:13) , 000–000 stimating sparse precision matrices (iii) Inverse test : A different test would be to see how well ˆ Ψ approximates the true inverse of C . A simple measure of thiswould be to compute || C ˆ Ψ − I || F = || C ˆ Ψ − CΨ || = || C ∆ Ψ || .However, this measure is poorly behaved. In particular it isn’tinvariant under transposes, although one would expect ˆ Ψ to bean equally good left and right inverse. To solve this, we use thesymmetrized version || C / ˆ ΨC / − I || F , although for brevity,we continue to denote it by || C ∆ Ψ || F .(iv) χ variance : Given that parameter fits are often doneby minimizing a χ function, we can aim to minimize the er-ror in this function due to an error in the precision matrix. Ifwe define ∆ χ = x t (cid:16) ∆ Ψ (cid:17) x , where as before, x is the dif-ference between the data and the model, we define the χ lossas the RMS of ∆ χ . In order to compute this, we need to spec-ify how x is distributed. There are two options here. The firstcomes from varying the input parameters to the model, whilethe second comes from the the noise in the data. The former isapplication-dependent and we defer specific applications to fu-ture work. The second suggests x ∼ N (0 , C ) , in which casewe find σ (cid:16) ∆ χ (cid:17) = (cid:34) (cid:0) ∆ ΨC ∆ ΨC (cid:1) + (cid:0) Tr ΨC (cid:1) (cid:35) / (17)(v) Kullback-Leibler (KL) divergence : Our final loss func-tion is the Kullback-Leibler divergence between C and Ψ , de-fined by KL ≡ (cid:34) Tr( CΨ ) − dim( C ) − log det( C Ψ ) (cid:35) . (18)The Kullback-Leibler divergence can be interpreted as the ex-pectation of the Gaussian likelihood ratio of x computed withthe estimated and true precision matrices. As in the inverse vari-ance case, we assume x ∼ N (0 , C ) , which captures the varia-tion in the data.Figures 9 and 10 show these different losses for our toyand cosmological models; we plot the ratio of the loss obtainedusing the sample precision matrix to the loss obtained usingthe techniques presented in this paper. The figures show theimprovement averaged over fifty realizations, although the im-provement for individual realizations is similar. For all choicesof a loss function, we find that the techniques presented hereyield a factor of a ∼ few improvement over simply inverting thesample precision matrix. The largest gains come from exploit-ing the sparsity of the precision matrix and directly estimatingit from the simulations. The secondary smoothing step yields asmaller relative (although clear) improvement. Given the similarbehaviour of all the loss functions, we focus on the Frobeniusnorm below for brevity.We now turn to how the loss depends on the number ofsimulations d ; Figs. 11 and 12 summarize these results forboth models considered here. The improvements seen aboveare immediately clear here. At a fixed number of simulations,one obtains a more precise estimate of the precision matrix;conversely, a significantly smaller number of simulations is re-quired to reach a target precision. We also note that we can A divergence is a generalization of a metric which need not be sym-metric or satisfy the triangle inequality. Formally a divergence on aspace X is a non-negative function on the Cartesian product space, X × X , which is zero only on the diagonal. KL σ ( Δχ ) C ΔΨ F ΔΨ ΔΨ F I m p r o v e m en t Figure 9.
The improvement over the sample precision matrix
Loss(sample) / Loss for different norms for the toy tridiagonal preci-sion matrix. From left to right, each group plots the improvement forthe sample precision matrix (1 by construction), the unsmoothed andsmoothed precision matrices. We assume a k = 3 banding in both theunsmoothed and smoothed cases, and all cases assume d = 500 . Weaverage over 50 such realizations in this figure, although the improve-ments for individual realizations are similar. KL σ ( Δχ ) C ΔΨ F ΔΨ ΔΨ F I m p r o v e m en t Figure 10.
The same as Fig. 9 but for the cosmological model. Theunsmoothed covariance matrix assumes k = 10 , while the smoothedcovariance matrix uses k = 15 . All three cases use d = 1000 . ● ● ● ●■ ■ ■ ■ ■◆ ◆ ◆ ◆ ◆
100 200 500 10001251020 Sample size ( d ) Δ Ψ F Figure 11.
The Frobenius loss || ∆ Ψ || F for our toy model as a functionof sample size d for the sample precision matrix (blue circles), our esti-mate of the precision matrix with and without the intermediate smooth-ing step (green diamonds and orange squares respectively). The lattertwo cases assume a banding of k = 3 . The dashed lines show a d − / trend which is more quickly attained by the estimates presented in thiswork than by the sample precision matrix. Recall that this is a × matrix - one cannot estimate the sample precision matrix with d < .This restriction isn’t present for the estimator presented here.c (cid:13)000
The Frobenius loss || ∆ Ψ || F for our toy model as a functionof sample size d for the sample precision matrix (blue circles), our esti-mate of the precision matrix with and without the intermediate smooth-ing step (green diamonds and orange squares respectively). The lattertwo cases assume a banding of k = 3 . The dashed lines show a d − / trend which is more quickly attained by the estimates presented in thiswork than by the sample precision matrix. Recall that this is a × matrix - one cannot estimate the sample precision matrix with d < .This restriction isn’t present for the estimator presented here.c (cid:13)000 , 000–000 Padmanabhan et al ● ● ● ●■ ■ ■ ■ ■◆ ◆ ◆ ◆ ◆▲ ▲ ▲ ▲ ▲
200 500 1000 20002050100200500 Sample size ( d ) Δ Ψ F Figure 12.
Analogous to Fig. 11 except now for the cosmologicalmodel. As in the previous case, the circles [blue] show the sample preci-sion matrix, squares [orange] - our unsmoothed estimate with k = 10 ,diamonds [green] - our smoothed estimate with k = 15 , and and trian-gles [red] - our smoothed estimate with k = 30 . Unlike the toy model,we see our estimators flatten out with increasing d at fixed values of k .This reflects the fact that the precision matrix here is not strictly sparse,and that with increasing values of d , one can reliably estimate moreelements of the matrix. obtain estimates of the precision matrix with d < p simula-tions. Given that we assume the precision matrix is sparse, thisisn’t surprising, although we re-emphasize that the usual proce-dure of first computing a covariance matrix and then inverting itdoesn’t easily allow exploiting this property.Analogous to the fact that the sample precision matrix re-quires d > p to obtain an estimate, our approach requires d > k to perform the linear regressions. The gains of the method comefrom the fact that k can be significantly smaller than d .We can also see how the accuracy in the precision matrixscales with the number of simulations. For d (cid:29) p , the errorscales as d − / as one would expect from simple averaging.However, as d gets to within a factor of a few of p , the errordeviates significantly from this scaling; at d < p the error is for-mally infinite, since the sample covariance matrix is no longerinvertible.The d dependence for our estimator is more involved. Forthe case of the toy model, we find that the error lies on the d − / scaling for the range of d we consider here, with the smoothedestimator performing ∼
10% better. Note that we do not con-verge faster than d − / - that scaling is set by the fact that weare still averaging over simulations - but the prefactor is signifi-cantly smaller. At fixed banding, the cosmological model startsclose to a d − / scaling, but then flattens off as d increases.This follows from the fact that the true precision matrix is notstrictly zero for the far off-diagonals, and therefore, the error inthe estimator has a minimum bound. However, the appropriatebanding will be a function of d , since as the number of simula-tions increases, one will be able to estimate more off-diagonalterms. We see this explicitly in the figure where the k = 30 curve starts to outperform the k = 15 curve at large d .Motivated by this, we consider how the various losses varyas a function of the assumed banding at fixed number of sim-ulations. For the toy model, we find a well defined minimum k value. Intriguingly, except for the Frobenius norm, it is onelarger that the true value. For the cosmological case, the answeris less clear. All the norms suggest k > , but have very weak k dependence after that, with possible multiple maxima. Thissuggests that the particular choice of k will tend to be applica- ● ● ● ● ● ● ● ●■ ■ ■ ■ ■ ■ ■ ■ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ Lo ss / M i n ( Lo ss ) Figure 13.
Matrix losses as a function of the choice of the banding pa-rameter k for the toy model. The dashed [red], dotted [blue] and solid[green] lines correspond to the KL, || C ∆Ω || and || ∆Ω || F losses re-spectively. The losses are all normalized to the minimum value of theloss. Recall that k is defined including the main diagonal (eg. k = 2 corresponds to one-off diagonal element). All of these are computed for d = 500 simulations. We find a well-defined minimum here, althoughfor two of the three norms, it is at k = 3 and not k = 2 . ● ● ● ● ● ● ● ● ●■ ■ ■ ■ ■ ■ ■ ■ ■ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ Lo ss / M i n ( Lo ss ) Figure 14.
Same as Fig. 13 but for the cosmological model. The numberof simulations used here is d = 1000 . Unlike the toy model, there isn’ta clear optimal banding here, with small changes in the loss for banding k > . tion specific (and require some numerical experimentation). Wedefer a more careful study of this to later work. This paper describes a method recently introduced in the statis-tics literature to directly estimate the precision matrix from anensemble of samples for the case where we have some infor-mation about the sparsity structure of this matrix. This allowsfor getting higher fidelity estimates of the precision matrix withrelatively small numbers of samples.The key result in this paper is the description of an algo-rithm to directly estimate the precision matrix without goingthrough the intermediate step of computing a sample covariancematrix and then inverting it. It is worth emphasizing that this es-timator is completely general and does not rely on the sparsity ofthe precision matrix. The estimator does allow us to exploit thestructure of the precision matrix directly; we then use this prop-erty for the specific case where the precision matrix is sparse. c (cid:13) , 000–000 stimating sparse precision matrices However, we anticipate that this algorithm may be useful evenbeyond the specific cases we consider here.We also demonstrate the value of regularizing elementwiseestimates of the precision matrix. Although this is not the firstapplication of such techniques to precision (and covariance) ma-trices, we present a concrete implementation using smoothingsplines, including how regularizing parameters may be automat-ically determined from the data.We demonstrate our algorithm with a series of numericalexperiments. The first of these, with a explicitly constructedsparse precision matrix, allows us to both demonstrate and cal-ibrate every aspect of our algorithm. Our second set of ex-periments uses the covariance/precision matrix for the galaxytwo-point correlation function and highlights some of the realworld issues that one might encounter, including the fact thatthe precision matrices may not be exactly sparse. In all casesour method improves over the naive estimation of the sampleprecision matrix by a significant factor (see e.g. Figs. 9 and 10).For almost any measure comparing the estimate to truth we findfactors of several improvement, with estimates based on 100 re-alizations with our method outperforming the sample precisionmatrix from 2000 realizations (see Figs 11 and 12). The errorsin our method still scales as N − / , just as for an estimatorbased on the sample covariance matrix. However, our approachachieves this rate for a smaller number of simulations, and witha significantly smaller prefactor.A key assumption in our results is that the precision ma-trix may be well approximated by a banded, sparse matrix. Thisapproximation expresses a trade-off between bias and noise.Banding yields a biased estimate of the precision matrix, buteliminates the noise in the estimate. We present a thresholdingprocedure to estimate this banding, and find that the width ofband increases with increasing sample size, as one would ex-pect. Our banded approximation is similar in spirit to the taper-ing of the precision matrix proposed by Paz & S´anchez (2015).A hybrid approach may be possible; we defer this to future in-vestigations.Realistic precision matrices may combine a dominant, ap-proximately sparse component with a subdominant, non-sparsecomponent. For instance, in the case of the correlation func-tion, the non-sparse component can arise even in Gaussian mod-els from fluctuations near the scale of the survey volume, fromnonlinear effects, as well from the effect of modes outside thesurvey volume (Harnois-D´eraps & Pen 2012; de Putter et al.2012; Takada & Hu 2013; Kayo et al. 2013; Mohammed &Seljak 2014). In these cases, we imagine combining our esti-mate of the dominant term with an estimate of the non-sparsecomponent (perhaps taken directly from the sample covariancematrix). The key insight here is that our method may be usedto estimate the dominant term with higher fidelity. We defer adetailed study of methods for estimating the non-sparse compo-nent and combining the estimates to later work.The computational requirements for the next generation ofsurveys is in large part driven by simulations for estimating co-variance and precision matrices. We present an approach thatmay significantly reduce the number of simulations required forclasses of precision matrices. The ultimate solution to this prob-lem will likely involve combining model-agnostic approacheslike the one presented in this work, with improved models forcovariance matrices. ACKNOWLEDGMENTS
NP thanks Daniel Eisenstein, Andrew Hearin and Uroˇs Seljakfor discussions on covariance matrices. NP and MW thank theAspen Center for Physics, supported by National Science Foun-dation grant PHY-1066293, where part of this work was com-pleted. NP is supported in part by DOE de-sc0008080. HHZ issupported in part by NSF DMS-1507511 and DMS-1209191.
APPENDIX A: USEFUL LINEAR ALGEBRA RESULTS
For completeness, we include a few key linear algebra resultsused in this paper. We refer the reader to Petersen & Pedersen(2012) for a convenient reference.
A1 The Inverse of a Partitioned Matrix
Suppose A = (cid:20) A A A A (cid:21) (A1)then A − = (cid:20) B − − A − A B − − B − A A − B − (cid:21) (A2)where B ≡ A − A A − A (A3) B ≡ A − A A − A (A4) A2 Conditional Distributions of a Multivariate Gaussian If x ∼ N ( µ, C ) with x = (cid:20) x a x b (cid:21) (A5) µ = (cid:20) µ a µ b (cid:21) (A6) C = (cid:20) C a C c C Tc C b (cid:21) (A7)then p ( x a | x b ) = N ( µ (cid:48) a , C (cid:48) a ) with µ (cid:48) a = µ a + C c C − ( x b − µ b ) (A8) C (cid:48) a = C a − C c C − b C Tc (A9)Note that the covariance matrices are the Schur complement ofthe block matrices. APPENDIX B: AN ALGORITHM FOR MAXIMUMLIKELIHOOD REFINEMENT
The maximum likelihood problem is equivalent to solving (cid:34) R − − D S D − R − R ) (cid:35) J = 0 . (B1)A variant of this problem was considered in Bakonyi & Woerde-man (1995), where they consider the solution of above prob-lem without the prior constraint. They suggest using a Newton-Raphson algorithm to solve this; we reproduce this below, in-cluding the changes needed for our modified problem. c (cid:13)000
The maximum likelihood problem is equivalent to solving (cid:34) R − − D S D − R − R ) (cid:35) J = 0 . (B1)A variant of this problem was considered in Bakonyi & Woerde-man (1995), where they consider the solution of above prob-lem without the prior constraint. They suggest using a Newton-Raphson algorithm to solve this; we reproduce this below, in-cluding the changes needed for our modified problem. c (cid:13)000 , 000–000 Padmanabhan et al If J = ( i , j ) , . . . , ( i s , j s ) are the nonzero indices in the R , we define the vectors x = (cid:16) R i j , . . . , R i s j s (cid:17) (B2) y = (cid:16) y , . . . , y s (cid:17) (B3) y p = (cid:34) R − − D S D − R − R ) (cid:35) i p j p (B4)where x is just the list of elements of R we are varying, and y are the residuals from our desired solution. The Hessian H is an s -by- s matrix with elements H pq = ( R − ) i p j q ( R − ) i q j p + ( R − ) i q i p ( R − ) j q j p + 2 . (B5)The minimization proceeds by iterating the following steps untilthe residual || y || ∞ has reached a pre-determined tolerance (weuse − ) :(i) Compute the minimization direction v by solving y = Hv .(ii) Compute δ = (cid:112) v T y , and set the step size α = 1 if δ < / ; otherwise α = 1 / (1 + δ ) .(iii) Update x → x + αv , and use this to update R and y . Ifthe update yields a R that is not positive definite, we backtrackalong this direction, reducing α by a factor of 2 each time, untilpositive definiteness is restored. In practice, this happens rarely,and early in the iteration, and a single backtrack step restorespositive definiteness.The above algorithm is very efficient, converging in ∼ iterations or fewer for the cases we consider. However, a majorcomputational cost is in inverting the Hessian . For a k -bandedmatrix, s = n ( k − − k ( k − / ; for some of the cases we con-sider, this is easily between and elements. Inverting theHessian every iteration is computationally too expensive, andwe transition over to the BFGS algorithm (Nocedal & Wright2000) for problem sizes s > . Conceptually, the BFGS al-gorithm (and others of its class) use first derivative informationto approximate the Hessian and its inverse. At each step of theiteration, the inverse Hessian is updated using a rank-2 update(Eq. 6.17 in Nocedal & Wright 2000), based on the change inthe parameters and the gradient. Since the algorithm works bydirectly updating the inverse Hessian, we avoid the computa-tional cost when computing the minimization direction.We implement Algorithm 6.1 of Nocedal & Wright (2000)and refer the reader there for a complete description. We limitourselves here to highlighting the particular choices we make.The first is the starting approximation to the Hessian; we usethe identity matrix for simplicity. While we may get better per-formance with an improved guess, this choice does not appearto adversely affect convergence and we adopt it for simplicity.The second choice is how to choose the step size in the one-dimensional line minimization. Motivated by our success in theNewton-Raphson case, we use the same step-size choice, withthe additional condition that we also backtrack if the error onthe previous iteration is < . × the current error. This last con-dition prevents overshooting the minimum. Note that the Wolfe conditions (Nocedal & Wright 2000) are automatically satisfiedwith this algorithm due to the convexity of our function. Even though we don’t explicitly compute the inverse, there is a sub-stantial cost to solving the linear system
Finally, we verify that both algorithms converge to thesame answer for the same problems.
APPENDIX C: CUBIC SMOOTHING SPLINES ANDCROSS VALIDATION
We summarize the construction of cubic smoothing splines asused here; our treatment here follows Craven & Wahba (1979)(see also Reinsch 1967; de Boor 2001).Consider a function y i ≡ y ( i ) evaluated at n evenlyspaced points i = 1 , . . . , n . We aim to find a function f ( x ) that minimizes p n (cid:88) i =1 ( f ( i ) − y i ) + (cid:90) n dx ( f (cid:48)(cid:48) ( x )) (C1)where p balances fidelity to the observed points and the smooth-ness of the function, and is an input parameter. For appropriateconditions , the solution to this is a cubic spline, with points at i = 1 , . . . , n . Our goal therefore reduces to determining p and f i ≡ f ( i ) .Craven & Wahba (1979) suggest a variation on cross-validation to determine the value of p . In ordinary cross-validation, one removes a point at a time from the dataand minimizes the squared deviation (over all points) of thespline prediction for the dropped point from its actual value.While conceptually straightforward, actually performing thiscross-validation is operationally cumbersome. Craven & Wahba(1979) suggest a weighted variant of ordinary cross-validation(generalized cross-validation), that both theoretically and exper-imentally, has very similar behaviour, and is straightforward tocompute. We outline this method below.The f i are determined from the y i by f ..f n = A ( p ) y ..y n (C2)where A ( p ) is an n × n matrix that depends on p . Follow-ing Reinsch (1967) and Craven & Wahba (1979), we construct A ( p ) as follows :(i) Construct the n × n − tridiagonal matrix Q with thefollowing non-zero elements q i,i +1 = q i +1 , = 1 and q ii = − , where i = 1 , . . . , n .(ii) Construct the ( n − × ( n − tridiagonal matrix T with non-zero elements t n − ,n − = t ii = 4 / and t i,i +1 = t i +1 ,i = 2 / , with i = 1 , . . . , n − .(iii) Compute F = QT − / . Construct its singular valuedecomposition F = UDV t with D a diagonal matrix with n − singular values d i , and U and V being n × ( n − and ( n − × ( n − orthogonal matrices respectively.(iv) Define the n − values z j by U t y , where y are the y i arranged in a column vector as above. Define ˜ d i ≡ d i / ( d i + p ) .(v) The generalized cross-validation function V ( p ) is given This condition is not necessary, but is what is relevant for our appli-cation Square integrability, and continuous second derivativesc (cid:13) , 000–000 stimating sparse precision matrices by V ( p ) = 1 n n − (cid:88) i =1 ˜ d i z i (cid:44)(cid:32) n − (cid:88) i =1 ˜ d i (cid:33) . (C3)We minimize this function using a simple linear search in log p ;the minimum determines the value of p .(vi) The cubic spline matrix is then given by I − A = U ˜DU t , where ˜D is an ( n − × ( n − diagonal matrixwith ˜ d i on the diagonal. REFERENCES
Bakonyi M., Woerdeman H. J., 1995, SIAM Journal on MatrixAnalysis and Applications, 16, 369Craven P., Wahba G., 1979, Numerische Mathematik, 31, 377de Boor C., 2001, A Practical Guide to Splines - Revised Edi-tion. Springerde Putter R., Wagner C., Mena O., Verde L., Percival W. J.,2012, JCAP, 4, 19Dodelson S., Schneider M. D., 2013, Phys. Rev. D, 88, 063537Efron B., 1979, Ann. Statist., 7, 1Greene W. H., 2003, Econometric analysis. Vol. 5, PrenticehallHarnois-D´eraps J., Pen U.-L., 2012, MNRAS, 423, 2288Hartlap J., Simon P., Schneider P., 2007, A&A, 464, 399Kayo I., Takada M., Jain B., 2013, MNRAS, 429, 344Mohammed I., Seljak U., 2014, MNRAS, 445, 3382Niklas Grieb J., S´anchez A., Salazar-Albornoz S., dalla Vec-chia C., 2015, ArXiv e-printsNocedal J., Wright S. J., 2000, Numerical Optimization.SpringerO’Connell R., Eisenstein D., Vargas M., Ho S., PadmanabhanN., 2015, ArXiv e-printsPaz D. J., S´anchez A. G., 2015, MNRAS, 454, 4326Pearson D. W., Samushia L., 2015, ArXiv e-printsPetersen K. B., Pedersen M. S., , 2012, The Matrix CookbookReinsch C. H., 1967, Numerische Mathematik, 10, 177Ren Z., Sun T., Zhang C.-H., Zhou H. H., 2015, The Annals ofStatistics, 43, 991Takada M., Hu W., 2013, Phys. Rev. D, 87, 123504Taylor A., Joachimi B., 2014, MNRAS, 442, 2728Taylor A., Joachimi B., Kitching T., 2013, MNRAS, 432, 1928Tukey J. W., 1958, 29, 614Witten D. M., Tibshirani R., 2009, Journal of the Royal Statis-tical Society. Series B, Statistical methodology, 71, 615Xu X., Padmanabhan N., Eisenstein D. J., Mehta K. T., CuestaA. J., 2012, Monthly Notices of the Royal Astronomical So-ciety, 427, 2146 c (cid:13)000