How do Markov approximations compare with other methods for large spatial data sets?
aa r X i v : . [ s t a t . C O ] N ov How do Markov approximations compare with othermethods for large spatial data sets?
David Bolin a, ∗ Finn Lindgren b a Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden b Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway
Abstract
The Matérn covariance function is a popular choice for modeling dependence in spatial environmental data.Standard Matérn covariance models are, however, often computationally infeasible for large data sets. In thiswork, recent results for Markov approximations of Gaussian Matérn fields based on Hilbert space approxima-tions are extended using wavelet basis functions. These Markov approximations are compared with two of themost popular methods for efficient covariance approximations; covariance tapering and the process convolutionmethod. The results show that, for a given computational cost, the Markov methods have a substantial gain inaccuracy compared with the other methods.
Key words:
Matérn covariances, Kriging, Wavelets, Markov random fields, Covariance tapering, process convo-lutions, Computational efficiency
The traditional methods in spatial statistics were typically developed without any considerations of computa-tional efficiency. In many of the classical applications of spatial statistics in environmental sciences, the cost forobtaining measurements limited the size of the data sets to ranges where computational cost was not an issue.Today, however, with the increasing use of remote sensing satellites, producing many large climate data sets,computational efficiency is often a crucial property.In recent decades, several techniques for building computationally efficient models have been suggested.In many of these techniques, the main assumption is that a latent, zero mean Gaussian process X ( s ) can beexpressed, or at least approximated, through some finite basis expansion X ( s ) = n X j =1 w j ξ j ( s ) , (1)where w j are Gaussian random variables, and { ξ j } nj =1 are pre-defined basis functions. The justification for usingthese basis expansions is usually that they converge to the true spatial model as n tends to infinity. However,for a finite n , the choice of the weights and basis functions will greatly affect the approximation error and thecomputational efficiency of the model. Hence, if one wants an accurate model for a given computational cost,asymptotic arguments are insufficient.If the process X ( s ) has a discrete spectral density, one can obtain an approximation on the form (1) bytruncating the spectral expansion of the process. Another way to obtain an, in some sense optimal, expansion ∗ Corresponding author. Tel.: +46 46 2227974; fax: +46 46 2224623;
Email address: [email protected] (David Bolin)
1n the form (1) is to use the the eigenfunctions of the covariance function for the latent field X ( s ) as a basis,which is usually called the Karhunen-Loève (KL) transform. The problem with the KL transform is that analyticexpressions for the eigenfunctions are only known in a few simple cases, which are often insufficient to representthe covariance structure in real data sets. Numerical approximations of the eigenfunctions can be obtained for agiven covariance function; however, the covariance function is in most cases not known, but has to be estimatedfrom data. In these cases, it is infeasible to use the KL expansion in the parameter estimation, which is often themost computationally demanding part of the analysis. The spectral representation has a similar problem sincethe computationally efficient methods are usually restricted to stationary models with gridded data, and are notapplicable in more general situations. Thus, to be useful for a broad range of practical applications, the methodsshould be applicable to a wide family of stationary covariance functions, and be extendable to nonstationarycovariance structures.One method that fulfills these requirements is the process convolution approach (Barry and Ver Hoef, 1996,Higdon, 2001, Cressie and Ravlicová, 2002, Rodrigues and Diggle, 2010). In this method, the stochastic field, X ( s ) , is defined as the convolution of a Gaussian white noise process with some convolution kernel k ( s ) . Thisconvolution is then approximated with a sum on the form (1) to get a discrete model representation. Processconvolution approximations are computationally efficient if a small number of basis functions can be used, butin practice, this will often give a poor approximation of the continuous convolution model.A popular method for creating computationally efficient approximations is covariance tapering (Furrer et al.,2006). This method can not be written as an approximation on the form (1), but the idea is instead to taperthe true covariance to zero beyond a certain range by multiplying the covariance function with some compactlysupported taper function (Gneiting, 2002). This facilitates the use of sparse matrix techniques that increases thecomputational efficiency, at the cost of replacing the original model with a different model, which can lead toproblems depending on the spatial structure of the data locations. However, the method is applicable to bothstationary and nonstationary covariance models, and instead of choosing the set of basis functions in (1), thetaper range and the taper function has to be chosen.Nychka et al. (2002) used a wavelet basis in the expansion (1), and showed that by allowing for some cor-relation among the random variables w j , one gets a flexible model that can be used for estimating nonstationarycovariance structures. As a motivating example, they showed that using a wavelet basis, computationally efficientapproximations to the popular Matérn covariance functions can be obtained using only a few nonzero correlationsfor the weights w j . The approximations were, however, obtained numerically, and no explicit representationswere derived.Rue and Tjelmeland (2002) showed that general stationary covariance models can be closely approximatedby Markov random fields, by numerically minimizing the error in the resulting covariances. Song et al. (2008)extended the method by applying different loss criteria, such as minimizing the spectral error or the Kullback-Leibler divergence. A drawback of the methods is that, just as for the KL and wavelet approaches, the numericaloptimisation must in general be performed for each distinct parameter configuration.Recently, Lindgren and Rue (2007) derived an explicit method for producing computationally efficient ap-proximations to the Matérn covariance family. The method uses the fact that a random process on R d with aMatérn covariance function is a solution to a certain stochastic partial differential equation (SPDE). By consider-ing weak solutions to this SPDE with respect to some set of local basis functions { ξ j } nj =1 , an approximation onthe form (1) is obtained, where the stochastic weights have a sparse precision matrix (inverse covariance matrix),that can be written directly as a function of the parameters, without any need for costly numerical calculations.The method is also extendable to more general stationary and nonstationary models by extending the generatingSPDE (Lindgren et al., 2011, Bolin and Lindgren, 2011).In this paper, we use methods from Lindgren and Rue (2007) and Lindgren et al. (2011) to algebraicallycompute the weights w j for wavelet based approximations to Gaussian Matérn fields (Section 3). For certainwavelet bases, the weights form a Gaussian Markov Random Field (GMRF), which greatly increases the compu-2ational efficiency of the approximation. For other wavelet bases, such as the one used in Nychka et al. (2002),the weights can be well approximated with a GMRF.In order to evaluate the practical usefulness of the different approaches, a detailed analysis of the computa-tional aspects of the spatial prediction problem is performed (Section 2 and Section 4). The results show that theGMRF methods are more efficient and accurate than both the process convolution approach and the covariancetapering method. As a motivating example for why computational efficiency is important, consider spatial prediction. The mostwidely used method for spatial prediction is commonly known as linear kriging in geostatistics. Let Y ( s ) be anobservation of a latent Gaussian field, X ( s ) , under mean zero Gaussian measurement noise, E ( s ) , uncorrelatedwith X and with some covariance function r E ( s , t ) , Y ( s ) = X ( s ) + E ( s ) , (2)and let µ ( s ) and r ( s , t ) be the mean value function and covariance function for X ( s ) respectively. Depending onthe assumptions on µ ( s ) , linear kriging is usually divided into simple kriging (if µ is known), ordinary kriging(if µ is unknown but independent of s ), and universal kriging (if µ is unknown and can be expressed as a linearcombination of some deterministic basis functions). To limit the scope of this article, parameter estimation willnot be considered, and to simplify the notations, we let µ ( s ) ≡ . It should, however, be noted that all results inlater sections regarding computational efficiency also hold in the cases of ordinary kriging and universal kriging.For more details on kriging, see e.g.Stein (1999) or Schabenberger and Gotway (2005).Let r ( s , t ) have some parametric structure, and let the vector γ contain all covariance parameters. Let Y bea vector containing the observations, X be a vector containing X ( s ) evaluated at the measurement locations, s , . . . , s m , and let X be a vector containing X ( s ) at the locations, ˆ s , . . . , ˆ s ˆ m , for which the kriging predictorshould be calculated. With X = ( X ⊤ , X ⊤ ) ⊤ , one has X = A X , and X = A X for two diagonal matrices A and A , and the model can now be written as X | γ ∼ N ( , Σ X ) , Y | X ∼ N ( A X , Σ E ) , where Σ X is the covariance matrix for X and Σ E contains the covariances r E ( s i , s j ) It is straightforward to showthat X | Y , γ ∼ N ( ˆ ΣA Σ − E Y , ˆ Σ ) , where ˆ Σ = ( Σ − X + A ⊤ Σ − E A ) − , and the well known expression for thekriging predictor is now given by the conditional mean E ( X | Y , γ ) = A ˆ ΣA Σ − E Y = A Σ X A ⊤ ( A Σ X A ⊤ + Σ E ) − Y = Σ X X ( Σ X + Σ E ) − Y = Σ X X Σ − Y Y , (3)where the elements on row i and column j in Σ X X and Σ Y are given by the covariances r (ˆ s i , s j ) and r ( s i , s j )+ r E ( s i , s j ) respectively. To get the standard expression for the variance of the kriging predictor, the Woodburyidentity is used on ˆ Σ : V ( X | Y , γ ) = A ( Σ − X + A ⊤ Σ − E A ) − A ⊤ = A Σ X A − A Σ X A ⊤ ( A Σ X A ⊤ + Σ E ) A Σ X A ⊤ = Σ X − Σ X X Σ − Y Σ ⊤ X X .
3f there are no simplifying assumptions on Σ X , the computational cost for calculating the kriging predictor is O ( ˆ mm + m ) , and the cost for calculating the variance is even higher. This means that with measurements,the number of operations needed for the kriging prediction for a single location is on the order of . Thesecomputations are thus not feasible for a large data set where one might have more than measurements.The methods described in Section 1 all make different approximations in order to reduce the computationalcost for calculating the kriging predictor and its variance. These different approximations, and their impact onthe computational cost, are described in more detail in Section 4; however, to get a general idea of how thecomputational efficiency can be increased, consider the kriging predictor for a model on the form (1). The field X can then be written as X = Bw ∼ N ( , BΣ w B ⊤ ) , where column i in the matrix B contains the basis function ξ i ( s ) evaluated at all measurement locations and all locations where the kriging prediction is to be calculated.Let B = A B and B = A B be the matrices containing the basis functions evaluated at the measurementlocations and the kriging locations respectively. The kriging predictor is then E ( X | Y , γ ) = B ( Σ − w + B ⊤ Σ − E B ) − B Σ − E Y . (4)If the measurement noise is Gaussian white noise, Σ E is diagonal and easy to invert. If Σ − w is either known, oreasy to calculate, the most expensive calculation in (4) is to solve u = ( Σ − w + B ⊤ Σ − E B ) − B Σ − E Y . Thisis a linear system of n equations, where n is the number of basis functions used in the approximation. Thus, theeasiest way of reducing the computational cost is to choose n ≪ m , which is what is done in the convolutionapproach. Another approach is to ensure that ( Σ − w + B ⊤ Σ − E B ) is a sparse matrix. Sparse matrix techniquescan then be used to calculate the kriging predictor, and the computational cost can be reduced without reducingthe number of basis functions in the approximation. If a wavelet basis is used, B ⊤ Σ − E B will be sparse, and inSection 3, it is shown that the precision matrix Q w = Σ − w can also be chosen as a sparse matrix by using theHilbert space approximation technique by Lindgren et al. (2011). In the remainder of this paper, the focus is on the family of Matérn covariance functions (Matérn, 1960) and thecomputational efficiency of some different techniques for approximating Gaussian Matérn fields. This sectionshows how wavelet bases can be used in the Hilbert space approximation technique by Lindgren et al. (2011) toobtain computationally efficient Matérn approximations.
Because of its versatility, the Matérn covariance family is the most popular choice for modeling spatial data (Stein,1999). There are a few different parameterizations of the Matérn covariance function in the literature, and theone most suitable in our context is r ( h ) = 2 − ν φ (4 π ) d Γ( ν + d ) κ ν ( κ k h k ) ν K ν ( κ k h k ) , (5)where ν is a shape parameter, κ a scale parameter, φ a variance parameter, and K ν is a modified Besselfunction of the second kind of order ν > . With this parametrization, the variance of a field with this covarianceis r ( ) = φ Γ( ν )(4 π ) − d Γ( ν + d ) − κ − ν , and the associated spectral density is S ( ω ) = φ (2 π ) d κ + k ω k ) ν + d . (6)4or the special case ν = 0 . , the Matérn covariance function is the exponential covariance function. The smooth-ness of the field increases with ν , and in the limit as ν → ∞ , the covariance function is a Gaussian covariancefunction if κ is also scaled accordingly, which gives an infinitely differentiable field. As noted by Whittle (1963), a random process with the covariance (5) is a solution to the SPDE ( κ − ∆) α X ( s ) = φ W ( s ) , (7)where W ( s ) is Gaussian white noise, ∆ is the Laplacian, and α = ν + d/ . The key idea in Lindgren et al.(2011) is to approximate the solution to the SPDE using a basis expansion on the form (1). The starting point ofthe approximation is to consider the stochastic weak formulation of the SPDE nD b i , ( κ − ∆) α X E , i = 1 , . . . , n b o d = {h b i , φ Wi , i = 1 , . . . , n b } . (8)Here d = denotes equality in distribution, h f, g i = R f ( s ) g ( s ) d s , and equality should hold for every finite set oftest functions { b i , i = 1 , . . . , n b } from some appropriate space. A finite element approximation of the solution X is then obtained by representing it as a finite basis expansion on the form (1), where the stochastic weights arecalculated by requiring (8) to hold for only a specific set of test functions { b i , i = 1 , . . . , n } and { ξ i } is a set ofpredetermined basis functions. We illustrate the more general results from Lindgren et al. (2011) with the specialcase α = 2 , where one uses b i = ξ i and one then has (cid:10) ξ i , ( κ − ∆) X (cid:11) = n X j =1 w j (cid:10) ξ i , ( κ − ∆) ξ j (cid:11) . (9)By introducing the matrix K with elements K i,j = (cid:10) ξ i , ( κ − ∆) ξ j (cid:11) and the vector w = ( w , . . . , w n ) ⊤ , theleft hand side of (8) can be written as Kw . Since, by Lemma 1 in Lindgren et al. (2011) (cid:10) ξ i , ( κ − ∆) ξ j (cid:11) = κ h ξ i , ξ j i − h ξ i , ∆ ξ j i = κ h ξ i , ξ j i + h∇ ξ i , ∇ ξ j i , the matrix K can be written as the sum K = κ C + G where C i,j = h ξ i , ξ j i and G i,j = h∇ ξ i , ∇ ξ j i . Theright hand side of (8) can be shown to be Gaussian with mean zero and covariance φ C and thus get that w ∼ N (0 , φ K − CK − ) .For the second fundamental case, α = 1 , Lindgren et al. (2011) show that w ∼ N ( , φ K − ) and for higherorder α ∈ N , the weak solution is obtained recursively using these two fundamental cases. For example, if α = 4 the solution to ( κ − ∆) X ( s ) = φ W ( s ) is obtained by solving ( κ − ∆) X ( s ) = ˜ X ( s ) , where ˜ X is thesolution for the case α = 2 . This results in a precision matrix for the weights Q α defined recursively as Q α = KC − Q α − C − K , α = 3 , , . . . (10)where Q = φ − K and Q = φ − K ⊤ C − K . Thus, all Matérn fields with ν + d/ ∈ N can be approximatedthrough this procedure. For more details, see Lindgren and Rue (2007) and Lindgren et al. (2011). The resultsfrom Rue and Tjelmeland (2002) show that accurate Markov approximations exist also for other ν -values, andone approximate approach to finding explicit expressions for such models was given in the authors’ response inLindgren et al. (2011). However, in many practical applications ν cannot be estimated reliably (Zhang, 2004),and using only a discrete set of ν -values is not necessarily a significant restriction.5 .3 Wavelet basis functions In the previous section, nothing was said about how the the basis functions { ξ i } should be chosen. The followingsections, however, shows that wavelet bases have many desirable properties which makes them suitable to usein the Hilbert space approximations on R d . In this section, a brief introduction to multiresolution analysis andwavelets is given.A multiresolution analysis on R is a sequence of closed approximation subspaces { V j } j ∈ Z of functions in L ( R ) such that V j ⊂ V j +1 , cl S j ∈ Z V j = L ( R ) , and T j ∈ Z V j = { } , where cl is the closure, and f ( s ) ∈ V j ifand only if f (2 − j s ) ∈ V . This last requirement is the multiresolution requirement because this implies that allthe approximation spaces V j are scaled versions of the space V . A multiresolution analysis is generated startingwith a function usually called a father function or a scaling function. The function ϕ ∈ L ( R ) is called a scalingfunction for { V j } j ∈ Z if it satisfies the two-scale relation ϕ ( s ) = X k ∈ Z p k ϕ (2 s − k ) , (11)for some square-summable sequence { p k } k ∈ Z and the translates { ϕ ( s − k ) } k ∈ Z form an orthonormal basis for V . Given the multiresolution analysis { V j } j ∈ Z , the wavelet spaces { W j } j ∈ Z are then defined as the orthogonalcomplements of V j in V j +1 for each j , and one can show that W j is the span of { ψ (2 j s − k ) } k ∈ Z , where thewavelet ψ is defined as ψ ( s ) = P k ∈ Z ( − k p − k ϕ (2 s − k ) .Given the spaces W j , V j can be decomposed as the direct sum V j = V ⊕ W ⊕ W ⊕ . . . ⊕ W j − . (12)Several choices of scaling functions have been presented in the literature. Among the most widely used con-structions are the B-spline wavelets (Chui and Wang, 1992) and the Daubechies wavelets (Daubechies, 1992)that both have several desirable properties for our purposes.The scaling function of B-spline wavelets are m :th order B-splines with knots at the integers. Because ofthis, there exists closed form expressions for the corresponding wavelets, and the wavelets have compact supportsince the m :th order scaling function has support on (0 , m + 1) . The wavelets are orthogonal at different scales,but translates at the same scale are not orthogonal. This property is usually referred to as semi-orthogonality.The Daubechies wavelets form a hierarchy of compactly supported orthogonal wavelets that are constructedto have the highest number of vanishing moments for a given support width. This generates a family of waveletswith an increasing degree of smoothness. Except for the first Daubechies wavelet, there are no closed formexpressions for these wavelets; however, for practical purposes, this is not a problem because the exact values forthe wavelets at dyadic points can be obtained very fast using the Cascade algorithm (Burrus et al., 1988). In thiswork, the DB3 wavelet is used because it is the first wavelet in the family that has one continuous derivative. TheDB3 wavelet and its scaling function are shown in Figure 1. To use the Hilbert space approximation for a given basis, the precision matrix for the weights Q α has to becalculated. By (10), we only have to be able to calculate the matrices C and G to built the precision matrix forany α ∈ N . The elements in these matrices are inner products between the basis functions: C i,j = Z ξ i ( s ) ξ j ( s ) d s , G i,j = Z ( ∇ ξ i ( s )) ⊤ ∇ ξ j ( s ) d s . (13)This section shows how these elements can be calculated for the DB3 wavelets and the B-spline wavelets. Whenusing a wavelet basis in practice, one always have to choose a finest scale, J , to work with. Given that the6 Sfrag replacements Scaling function0 1 2 3 4 5-0.400.40.81.2 PSfrag replacements Wavelet-2 -1 0 1 2 3-1.2-0.600.61.21.8
Figure 1: The DB3 scaling function and wavelet.subspace V J is used as an approximation of L ( R ) , one can use two different bases. Either one works with thedirect basis for V J , that consists of scaled and translated versions of the father function ϕ ( s ) , or one can use themultiresolution decomposition (12). In what follows, both these cases are considered. R For the Daubechies wavelets, the matrix C is the identity matrix since these wavelets form an orthonormal basisfor L ( R ) . Thus, only the matrix G has to be calculated. If the direct basis for V J is used, G contains innerproducts on the form (cid:10) ∇ ϕ (2 J s − k ) , ∇ ϕ (2 J s − l ) (cid:11) = 2 J h∇ ϕ ( s ) , ∇ ϕ ( s − l + k ) i ≡ J Λ( k − l ) . (14)Because the scaling function has compact support on [0 , N − , these inner products are only non-zero if k − l ∈ [ − (2 N − , N − . Thus, the matrix G is sparse, which implies that the weights w in (1) forma GMRF. Since there are no closed form expressions for the Daubechies wavelets, there is no hope in findinga closed form expression for the non-zero inner products (14). Furthermore, standard numerical quadrature forcalculating the inner products is too inaccurate due to the highly oscillating nature of the gradients. However,utilizing properties of the wavelets, one can calculate an approximation of the inner product of arbitrary precisionby solving a system of linear equations. It is outside the scope of this paper to present the full method, but thebasic principle is to construct a system of linear equations by using the scaling- and moment equations for thewavelets. This system is then solved using, for example, LU factorization. For details, see Latto et al. (1991).Using this technique for the DB3 wavelets, the following nonzero values for Λ( η ) are obtained Λ(0) = 5 . , Λ( ±
1) = − . , Λ( ±
2) = 0 . , Λ( ±
3) = − . , Λ( ±
4) = − . . These values are calculated once and tabulated for constructing the G matrix, which is a band matrix with J Λ(0) on the main diagonal, J Λ(1) on the first off diagonals, et cetera.If the multiresolution decomposition (12) is used as a basis for V J , one also needs the inner products (cid:10) ∇ ψ (2 j s − k ) , ∇ ψ (2 i s − l ) (cid:11) , i, j ∈ Z . Because of the two-scale relation (11), every wavelet ψ (2 j s − k ) can be written as a finite sum of the scalingfunction at scale J . Using this property, the G matrix can be constructed efficiently using only the already7 Sfrag replacements nz = 55798Multiresolution basis0 200 400 600 8000200400600800 PSfrag replacements nz = 8368Direct basis0 200 400 600 8000200400600800
Figure 2: The non-zero elements in the G matrices for a multiresolution DB3 basis with five layer of waveletsand the corresponding direct basis. . of the elements are non-zero for the multiresolution basis whereas only . of the elements are non-zero for the direct basis.computed Λ values. Figure 2 shows the structure of the G matrices for a multiresolution DB3 basis with fivelayers of wavelets and the corresponding direct basis. Note that there are fewer non-zero elements in the precisionmatrix for the direct basis. Hence, it is more computationally efficient to use the direct basis instead of themultiresolution basis. R For the B-spline wavelets, the matrices C and G can be calculated directly from the closed form expressions forthe basis functions and their derivatives. When a direct basis is used on R , C is a band matrix with bandwidth m + 1 , if the m :th order spline wavelet is used. For example, for m = 1 , calculating (13) gives C i,j = 2 − J · / , i = j, / , | i − j | = 1 , otherwise, G i,j = 2 J · , i = j, − , | i − j | = 1 , otherwise.Since the expression for the precision matrix for the weights w contains the inverse of C , it is a dense matrix.Hence, C − has to be approximated with a sparse matrix if Q should be sparse. This issue is addressed inLindgren et al. (2011) by lowering the integration order of h ξ i , ξ j i , which results in an approximate, diagonal C matrix, ˜ C , with diagonal elements ˜ C ii = P nk =1 C ik . In Section 4, the effect of this approximation on thecovariance approximation for the basis expansion is studied in some detail. For the multiresolution basis, thematrices are block diagonal, and this approximation is not applicable. R d The easiest way of constructing a wavelet basis for L ( R d ) is to use the tensor product functions generated by d one-dimensional wavelet bases. Let ϕ be the scaling function for a multiresolution on R , the father functioncan be written as ¯ ϕ ( x , . . . , x d ) = Q di =1 ϕ ( x i ) . The scalar product h∇ ¯ ϕ ( x ) , ∇ ¯ ϕ ( x + η ) i , where η now is a8ulti-integer shift in d dimensions, can then be written as h∇ ¯ ϕ ( x ) , ∇ ¯ ϕ ( x + η ) i = * ∇ d Y i =1 ϕ ( x ) , ∇ d Y i =1 ϕ ( x + η i ) + = d X i =1 Z R d ∂ ϕ ( x i ) ∂ x i ∂ ϕ ( x i + η i ) ∂ x i Y j = i ϕ ( x j ) ϕ ( x j + η j ) d x = d X i =1 Λ( η i ) Y j = i Z R ϕ ( x j ) ϕ ( x j + η j ) d x j . This expression looks rather complicated, but it implies a very simple Kronecker structure for G d , the G matrixin R d . For example, in R and R , G = G ⊗ C + C ⊗ G G = G ⊗ C ⊗ C + C ⊗ G ⊗ C + C ⊗ C ⊗ G , where G and C are the G and C matrices for the corresponding one-dimensional basis and ⊗ denotes theKronecker product. Similarly, C = C ⊗ C , and C = C ⊗ C ⊗ C . These expressions hold both if thedirect basis for V J if used or if the multiresolution construction (12) is used for the one-dimensional spaces. ForDaubechies wavelets, the C matrix is the identity matrix for all d ≥ . This also holds for the direct B-splinebasis if the diagonal approximation is used for C . As discussed in Section 2 is computational efficiency often an important aspect in practical applications. How-ever, the computation time for obtaining for example an approximate kriging prediction is in itself not that inter-esting unless one also knows how accurate it is. We will therefore in this section compare the wavelet Markovapproximations with two other popular methods, covariance tapering and process convolutions, with respect totheir accuracy and computationally efficiency when used for kriging.Before the comparison, we give a brief introduction to the process convolution method and the covariancetapering method and discuss the methods’ computational properties. As mentioned in Section 2, the computa-tional cost for the kriging prediction for a single location based on m observations is O ( m ) . In what follows, thecorresponding computational costs for the three different approximation methods are presented. We start with thewavelet Markov approximations and then look at the process convolutions and the covariance tapering method.After this, an initial comparison of the different wavelet approximations is performed in Section 4.4 and then thefull kriging comparison is presented in Section 4.5-4.6. When using a wavelet basis, one can either work with the direct basis for the approximation space V J or dothe wavelet decomposition into the direct sum of J − wavelet spaces and V . If one only is interested in theapproximation error, the decomposition into wavelet spaces is not necessary and it is more efficient to work inthe direct basis for V J since this will result in a precision matrix with fewer nonzero elements. Therefore we onlyuse the direct bases for V J in the comparisons in this section.The wavelet approximations are on the form (1), so Equation (4) is used to calculate the kriging predictor.However, since an explicit expression for the precision matrix for the weights w exists for this method, we rewrite9he equation as E ( X | Y , γ ) = B ( Q w + B ⊤ Q E B ) − B Q E Y , where Q E = Σ − E is diagonal if E is Gaussian white noise. If the number of kriging locations is small, thecomputationally demanding step is again to solve a system on the form u = ( Q w + B ⊤ Q E B ) − v . Now, if the Daubechies wavelets or the Markov approximated spline wavelets are used, both Q w and B ⊤ Q E B are sparse and positive definite matrices. The system is therefore most efficiently solved using Cholesky factoriz-ation, forward substitution, and back substitution. The forward substitution and back substitution are much fasterthan calculating the Cholesky triangle L , so the computational complexity for the kriging predictor is determinedby the calculation of L . Because of the sparsity structure, this computational cost is in general O ( n ) , O ( n / ) ,and O ( n ) for problems in one, two, and three dimensions respectively (see Rue and Held, 2005). If the splinebases are used without the markov approximation, the computational cost instead is O ( n ) since Q w then isdense. It should be noted here that any basis could be used in the SPDE approximation, but in order to get goodcomputational properties we need both Q w and B ⊤ Q E B to be sparse. This is the reason for why for exampleFourier bases are not appropriate to use in the SPDE formulation since B in this case always is a dense matrix. In the process convolution method, the Gaussian random field X ( s ) on R d is specified as a process convolution X ( s ) = Z k ( s , u ) B ( d u ) , (15)where k is some deterministic kernel function and B is a Brownian sheet. One of the advantages with thisconstruction is that nonstationary fields easily are constructed by allowing the convolution kernel to be dependenton location. If, however, the process is stationary we have k ( s , u ) = k ( s − u ) and the covariance function for X is r ( τ ) = R k ( u − τ ) k ( u ) d u . Thus, the covariance function and the kernel k are related through k = F − π ) d p F ( r ) ! = F − π ) d √ S ! , where S is the spectral density for X ( s ) and F denotes the Fourier transform (Higdon, 2001). Since the spectraldensity for a Matérn covariance function in dimension d with parameters ν , φ , and κ is given by (6), one findsthat the corresponding kernel is a Matérn covariance function with parameters ν k = ν − d , φ k = φ , and κ k = κ .An approximation of (15) which is commonly used in convolution based modeling is X ( s ) ≈ n X j =1 k ( s − u j ) w j , where u , . . . , u n are some fixed locations in the domain, and w j are independent zero mean Gaussian variableswith variances equal to the area associated with each u j . Thus, this approximation is on the form (1), with basisfunctions ξ j ( s ) = k ( s − u j ) . When this approximation is used, Equation (4) is used to calculate the krigingpredictor. Because the basis functions in the expansion are Matérn covariance functions, the matrices B and B Σ E and Σ − w are diagonal matrices, one still has to solve a system on the form u = ( Σ − w + B ⊤ Σ − E B ) − v where ( Σ − w + B ⊤ Σ − E B ) is a dense n by n matrix. The computational cost for both constructing and invertingthe matrix is O ( mn + n ) , where n is the number of basis functions used in the basis expansion. For krigingprediction of ˆ m locations, the total computational complexity is O ( ˆ mn + mn + n ) . Covariance tapering is not a method for constructing covariance models, but a method for approximating a givencovariance model to increase the computational efficiency. The idea is simply to to taper the true covariance, r ( τ ) , to zero beyond a certain range, θ , by multiplying the covariance function with some compactly supportedpositive definite taper function r θ ( τ ) . Using the tapered covariance, r tap ( τ ) = r θ ( τ ) r ( τ ) , the matrix Σ Y in the expression for the kriging predictor (3) is sparse, which facilitates the use of sparse matrixtechniques that increases the computational efficiency. The taper function should, of course, also be chosensuch that the basic shape of the true covariance function is preserved, and of especial importance for asymptoticconsiderations is that the smoothness at the origin is preserved.Furrer et al. (2006) studied the accuracy and numerical efficiency of tapered Matérn covariance functions,and to be able to compare their results to Matérn approximations obtained by the wavelet Hilbert space approx-imations and the process convolution method, we use their choice of taper functions:Wendland : r θ ( τ ) = (cid:18) max (cid:20) − k τ k θ , (cid:21)(cid:19) (cid:18) k τ k θ (cid:19) , Wendland : r θ ( τ ) = (cid:18) max (cid:20) − k τ k θ , (cid:21)(cid:19) (cid:18) k τ k θ + 35 k τ k θ (cid:19) . These taper functions were first introduced by Wendland (1995). For dimension d ≤ , the Wendland functionis a valid taper function for the Matérn covariance function if ν ≤ . , and the Wendland functions is a validtaper function if ν ≤ . . Furrer et al. (2006) found that Wendland was slightly better than Wendland for agiven ν , so we use Wendland for all cases when ν ≤ . and Wendland if . < ν ≤ . .If a tapered Matérn covariance is used, the kriging predictor can be written as E ( X | Y , γ ) = Σ tapX X ( Σ tapX + Σ E ) − Y where the element on row i and column j in Σ tapX X and Σ tapX are given by r tap (ˆ s i , s j ) and r tap ( s i , s j ) respect-ively. Since the tapered covariance is zero for lags larger than the taper range, θ , many of the elements in Σ tapX will be zero. Thus, the three step approach used for the wavelet Markov approximations can be used to solve thesystem u = ( Σ tapX + Σ E ) − Y efficiently. Since the number of non-zero elements for row i in Σ tapX is determinedby the number of measurement locations at a distance smaller than θ from location s i , the computational cost isdetermined both by the taper range and the spacing of the observations. Thus, if the measurements are irregularlyspaced, it is hard to get a precise estimate of the computational cost. However, for given measurement locations,the taper range can be chosen such that the average number of neighbors to the measurement locations is somefixed number k θ . The cost for the Cholesky factorization is then similar to the cost for a GMRF with m nodesand a neighborhood size k θ . 11 .4 Covariance approximation For practical applications of any of the approximation methods discussed here, one is often mostly interested inproducing kriging predictions which are close to the optimal predictions. The error one makes in the krigingprediction is closely related to the methods ability to reproduce the true Matérn covariance function. There aremany different wavelet bases one could consider using in the Markov approximation method, and before weconsider the kriging problem we will in this section compare some of these bases with respect to their ability toreproduce the Matérn covariance function so that we can choose only a few of the best methods to compare inthe next section. As a reference, we also include the process convolution approximation in this comparison.A natural measure of the error in the covariance approximation is a standardized L norm of the differencebetween the true-, and approximate covariance functions, ǫ r ( s ) = R ( r ( s , u ) − ˆ r ( s , u )) d u R r ( s , u ) d u . (16)Note here that the true covariance function r ( s , u ) is stationary and isotropic, while the approximate covariancefunction ˆ r ( s , u ) , for the basis expansion (1), generally is not. For the wavelet approximations and the processconvolutions, ǫ r is periodic in s since the approximation error in general is smaller where the basis functions arecentered, and we therefore use the mean value of ǫ ( s ) over the studied region as a measure of the covarianceerror.We use the different methods to approximate the covariance function for a Matérn field on the square [0 , × [0 , in R . The computational complexity for the kriging predictions depend on the number of basis functions, n , used in the approximations. For the Markov approximated spline bases and the Daubechies 3 basis, thiscomplexity is O ( n / ) whereas it is O ( n ) for the spline bases if the Markov approximation is not used andfor the process convolution method. We therefore use basis functions for the O ( n / ) methods and basis functions for the other methods to get the covariance error for the methods when the computational cost isapproximately equal.Figure 3 shows the covariance error for the different methods as functions of the approximate range, κ − √ ν ,of the true covariance function for three different values of ν . There are several things to note in this figure:1. The covariance error decreases for all methods as the range of the true covariance function increases. Thisis not surprising since the error will be small if the distance between the basis functions (which is keptfixed) is small compared to the true range.2. The solid lines correspond to Markov approximations, which have computational complexity O ( n / ) for calculating the kriging predictor, and the approximations with computational complexity O ( n ) havedashed lines in the figure.3. There is no convolution kernel estimate for ν = 1 since the convolution kernel has a singularity in zero inthis case. For the other cases, the locations { u j } for the kernel basis functions were placed on a regular × lattice in the region.4. The error one makes by the Markov approximation of the spline bases becomes larger for increasing orderof the splines. Note that the third order spline basis is best without the approximation whereas the firstorder spline basis is best if the Markov approximation is used.It is clear from the figure that the Markov approximations have a much lower covariance error for the samecomputational complexity. Among these, the Daubechies 3 basis is best for large ranges whereas the Markovapproximated first order spline basis is best for short ranges. The higher order spline bases have larger covarianceerrors so we from now on focus on the first order spline basis and the Daubechies 3 basis.12 Sfrag replacements ν = 1 Correlation range0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 − − − − − − PSfrag replacements ν = 2 Correlation range0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 − − − − − − − PSfrag replacements ν = 3 Correlation range0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 − − − − − − − PSfrag replacementsDaubechies 3 1st order spline2nd order spline3rd order splineProcess convolutionApprox 1st order splineApprox 2nd order splineApprox 3rd order splineDaubechies 3-1-0.500.51-1-0.500.51
Figure 3: Numeric approximations of the L -norm (16) shown as a function of approximate range for differentvalues of ν and different bases in R . In this figure, basis functions are used for the bases with Markov struc-ture (solid lines), and basis functions are used for the other bases (dashed lines). This gives approximatelythe same computational complexity for kriging prediction. In the previous section, several bases were compared with respect to their ability to approximate the true cov-ariance function when used in an approximation on the form (1) of a Gaussian Matérn field. The comparisonshowed that the Daubechies 3 (DB3) basis and the Markov approximated linear spline (S1) basis are most ac-curate for a given computational complexity. In this section, the spatial prediction errors for these two waveletMarkov approximations are compared with the process convolution method and the covariance tapering method.In the comparisons, note that the S1 basis is essentially of the same type of piecewise linear basis as used inLindgren et al. (2011), so that the results here also apply to that paper.
Simulation setup
Let X ( s ) be a Matérn field with shape parameter ν (chosen later as , , or ) and approximate correlation range r (later varied between . and ). The range r determines κ through the relation κ = √ νr − and the varianceparameter φ = 4 π Γ( ν + 1) κ ν Γ( ν ) − is chosen such that the variance of X ( s ) is . We measure X at measurement locations chosen at random from a uniform distribution on the square [0 , × [0 , in R usingthe measurement equation (2), where E ( s ) is Gaussian white noise uncorrelated with X with standard deviation13 = 0 . .Given these measurements, spatial prediction of X to all locations on a × lattice of equally spacedpoints in the square is performed using the optimal kriging predictor, the wavelet Markov approximations, theprocess convolution method, and the covariance tapering method. For each approximate method, the sum ofsquared differences between the optimal kriging prediction and the approximate method’s kriging prediction isused as a measure of kriging error.We compare the methods for ν = 1 , , , and for each ν we test different ranges varied between . and in steps of . . For a given ν and a given range, data sets are simulated and the average kriging error iscalculated for each method based on these data sets. Choosing the number of basis functions
To obtain a fair comparison between the different methods, the number of basis functions for each method shouldbe chosen such that the computation time for the kriging prediction is equal for the different methods. Thecomputations needed for calculating the prediction can be divided into three main steps as follows
Step 1.
Build all matrices except M in step 3 necessary to calculate the kriging predictor. Step 2.
Solve the matrix inverse problem for the given method:S1, DB3 and Process convolution: u = ( Σ − w + B ⊤ Σ − E B ) − B Σ − E Y , Tapering: u = (cid:16) Σ tapX + Σ E (cid:17) − Y , Optimal kriging: u = ( Σ X + Σ E ) − Y . Step 3.
Depending on which method that is used, build M = B , M = Σ tapX X , or M = Σ X X and calculatethe kriging predictor ˆ X = Mu .For the optimal kriging predictor, and in some cases for the Tapering method, the matrix M cannot be calculatedand stored at once due to memory constraints if the number of measurements is large. Each element in ˆ X is thenconstructed separately as ˆ X i = M i u , where M i is a row in M . It is then natural to include the time it takes tobuild the rows in M in the time it takes to calculate ˆ X , which is the reason for including the time it takes to build M in step 3 instead of step 1.The computation time for the first step will be very dependent on the actual implementation, and we willtherefore focus on the computation time for the last two steps when choosing the number of basis functions. Ifone only does kriging prediction to a few locations, the second step will dominate the computation time whereasthe third step can dominate if kriging is done to many locations. To get results that are easier to interpret, wechoose the number of basis functions such that the time for the matrix inverse problem in step 2 is similar for thedifferent methods.Now since the computational complexity for step 2 is O ( n ) for the convolution method and O ( n / ) forthe Markov methods, one would think that if n basis functions are used in the convolution method and n basisfunctions are used for the Markov methods, the computation time would be equal. Unfortunately it is not thatsimple. If two different methods have computational complexity O ( n ) , this means that the computation timescales as n when n is increased for both methods; however, the actual computation time for a fixed n can be quitedifferent for the two methods. For example, DB3 is approximately 6 times more computationally demanding thanS1 for the same number of basis functions. The reason being that the DB3 basis functions have larger support thanthe S1 basis functions and this cases the matrices B and Σ − w for DB3 to contain approximately times as manynonzero elements compared to S1 for the same number of basis functions. However, the relative computationtime will scale as n / if n is increased for both methods.14 Sfrag replacements ν = 1 Correlation range0 0.5 1 1.5 2 2.5 3 3.5 4 − − PSfrag replacements ν = 2 Correlation range0 0.5 1 1.5 2 2.5 3 3.5 4 − − PSfrag replacements ν = 3 Correlation range0 0.5 1 1.5 2 2.5 3 3.5 4 − − PSfrag replacementsTapering S1DB3Process convolutionTapering-1-0.500.51-1-0.500.51
Figure 4: Kriging errors for the different methods as a functions of the true covariance function’s range. For eachrange, the values are calculated as the mean of simulations. The lower limit of the bands around the curvesare the estimate minus the standard deviation of the samples, and the upper limit is the estimate plus the standarddeviation.To get approximately the same computation time for step 2 for the different approximation methods, thenumber of basis functions for S1 is fixed to . Since DB3 is approximately six times more computationallydemanding, the number of basis functions for this method is set to . As mentioned in Lindgren et al. (2011),one should extend the area somewhat to avoid boundary effects from the SPDE formulation used in the Markovmethods. We therefore expand the area with two times the range in each direction which results in a slightlyhigher number of basis functions used in the computations.The computation time for S1 and DB3 increases if ν is increased since the precision matrix for the weightscontain more nonzero elements for larger values of ν . Therefore we use basis functions placed on a regular × lattice in the kriging area for the convolution method when ν = 2 and use basis functions placedon a regular × lattice when ν = 3 . For the tapering method we chose the tapering range θ such that theexpected number of measurements within a circle with radius θ to each kriging location is similar to the numberof neighbors to the weights in the S1 method. For ν = 1 , ν = 2 , and ν = 3 this gives a tapering ranges of . , . , and . respectively and results in approximately the same number of nonzero elements in the taperedcovariance matrix as in the precision matrix Q for the S1 basis.15ptimal prediction S1 basis Convolution basis Tapered covarianceFigure 5: An example of an optimal kriging prediction and predictions using the S1 basis, the convolutionbasis, and a tapered covariance when ν = 2 and the covariance range is . The predictions are based on observations and are calculated for a × grid in the square [0 , × [0 , . The number of basis functionsand the tapering range are chosen such that the total time for Step 2 and Step 3 is approximately equal for thedifferent methods. Results
In Figure 4 can the average kriging errors for the different methods be seen as functions of the true covariancefunction’s approximate range r . The values for a given ν and r is an average of simulations. The convolutionkernels are singular if ν = 1 , so there is no convolution estimate for this case. The tapering estimate is best forshort ranges, which is not surprising since the covariance matrix for the measurements not is changed much bythe tapering if the true range then is shorter than the tapering range. For larger ranges, however, the taperingmethod has a larger error than the other methods. One reason for this is that the tapered covariance function isvery different from the true covariance function if the true range is much larger than the tapering range. Anotherreason is that the prediction for all locations that do not have any measurements closer than the tapering rangeis zero in the tapering method. The convolution method has a similar problem if the effective range of the basisfunctions is smaller than the distance between the basis functions. In this case, the estimates for all locations thatare not close to the center of some basis function have a large bias towards zero. These problems can clearly beseen in Figure 5, where the optimal kriging prediction, and the predictions for S1, the tapering method, and theconvolution method, are shown for an example where ν = 2 and the range is .The computation times for the different methods are shown in Table 1. These computation times are obtainedusing an implementation in Matlab on a computer with a 3.33GHz Intel Xeon X5680 processor. As intended, thetime for step 2 is similar for the different methods whereas there is a larger difference between the computationtime for step 3 because the computation time for the kriging prediction scales differently with the number ofkriging locations for the different methods. Note that the wavelet methods are less computationally demandingthan the tapering method and the convolution method when doing kriging to many locations. The reason beingthat the matrix M in step 3 can be constructed without having to do costly covariance function evaluations.As mentioned previously is the computation time for step 1 very dependent on the actual implementation.However, as for step 3 can the Markov method’s matrices be constructed without doing any covariance functionevaluations which is the reason for the faster computation time. One thing to note here is that if the parametersare changed (for example when doing parameter estimation), one does not have to construct all matrices again inthe Markov methods as one have to do for the other two methods.In conclusion we see that S1 is both faster and has a smaller kriging error for all ranges when compared toDB3 and the convolution method and compared to the tapering method it has a smaller kriging error for all butvery short ranges. Since the tapering methods computational cost varies with the tapering range, we conclude this implementation available at = 1 Optimal DB3 S1 Conv. TaperStep 1 .
68 (6 . .
490 (0 . .
423 (0 . − − .
771 (0 . Step 2 .
074 (0 . .
113 (0 . .
088 (0 . − − .
117 (0 . Step 3 .
48 (6 . .
293 (0 . .
248 (0 . − − .
051 (0 . Total .
23 (8 . .
896 (0 . .
759 (0 . − − .
939 (0 . ν = 2 Step 1 .
19 (6 . .
600 (0 . .
489 (0 . .
961 (0 . .
184 (1 . Step 2 .
327 (0 . .
228 (0 . .
203 (0 . .
217 (0 . .
247 (0 . Step 3 .
94 (6 . .
310 (0 . .
260 (0 . .
942 (0 . .
319 (0 . Total .
45 (9 . .
138 (0 . .
951 (0 . .
120 (0 . .
750 (1 . ν = 3 Step 1 .
75 (6 . .
759 (0 . .
569 (0 . .
656 (1 . .
413 (1 . Step 2 .
468 (0 . .
394 (0 . .
377 (0 . .
390 (0 . .
421 (0 . Step 3 .
36 (6 . .
315 (0 . .
266 (0 . .
522 (1 . .
460 (0 . Total .
58 (9 . .
468 (0 . .
213 (0 . .
57 (1 . .
30 (1 . Table 1: Average computation times for the results in Figure 4. The values are based on the simulations foreach value of ν . The standard deviations are shown in the parentheses.section with a study of how changing the tapering range changes the results in order to get a better understandingof which method is to prefer when comparing S1 and the tapering method. As shown above is the S1 method to prefer over the DB3 method and the convolution method in all our test caseswhereas the tapering method had a smaller kriging error for very short ranges. Since this was done using a fixedtapering range chosen such that the computation time for step 2 would be similar to the other methods we nowlook at what happens if the tapering range is varied when keeping the true range fixed.The setup is the same as in the previous comparison, a Matérn field with ν = 2 , variance and an approximaterange r is measured at randomly chosen locations in a square in R . The difference is that we now keepthese parameters fixed but instead vary the tapering range from . to in steps of . . We generate datasets and calculate the kriging predictions for the S1 method and the tapering method for all values of the taperingrange. Based on these estimates, the average kriging error is calculated for S1 and for each tapering estimate.The results can be seen in Figure 6. The kriging errors are shown in the left panels and the computation timesare shown in the right panels. The blue lines represent the S1 method, which obviously does not depend on thetapering range, and the yellow lines represent the tapering method. In the left panels, the solid lines show thetime for step 2 in the computations and the dashed lines show the total time for step 2 and step 3. In the uppertwo panels, the true range r is , and S1 basis functions are used. In this case, S1 is more accurate thanthe tapering method for all tapering ranges tested, which is not surprising considering the previous results. Inthe bottom panels of the figure, the true range r is . and S1 basis functions are used. This is a casewhere the tapering method was more accurate than S1 in the previous study and we see here that the taperingmethod is more accurate for tapering ranges larger than . and that the time for step 2 is smaller for all taperingranges smaller than . . Thus, by choosing the tapering range between . and . , the tapering method ismore accurate and has a smaller computation time for step 2.The accuracy of the tapering method increases if the ratio between the tapering range and the true range isincreased, and the computation time depends on what the distance between the measurements is compared to17 Sfrag replacements Kriging errorTapering range0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 − PSfrag replacements TimeTapering range s ec ond s − − − PSfrag replacements Kriging errorTapering range0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 PSfrag replacements TimeTapering range s ec ond s − − − Figure 6: The computation time for step 2 (right) and the kriging errors (left) for the covariance tapering method(yellow lines) as a function of taper range. The values for the S1 basis (blue lines) is shown for comparison.In the upper panels, the range of the true covariance function is and in the lower panels the range is . .The colored lines are averages of simulations, and the grey bands indicates the standard deviation of thesesamples. The solid lines in the right panels show the computation time for step 2 whereas the dashed lines showthe total computation time for step 2 and step 3 when calculating the kriging predictions using the two methods.the tapering range. If the distance between the measurements is large, the tapering method is fast, whereas it isslower if the distance is small. Thus, the situation where the tapering method performs the best is when the truecovariance range is short compared to the distance between the measurements. However, also for the case whenthe true range is small, the total time it takes to calculate the tapering prediction is larger than the time it takes tocalculate the S1 prediction unless the number of kriging locations is small.In this work, the taper functions that Furrer et al. (2006) found to be best for each value of ν are used, but theresults may be improved by using other taper functions. Changing the taper function will, however, not changethe fact that the prediction for all locations that do not have any measurements closer than the tapering range iszero in the tapering method and that the tapered covariance function is very different from the true covariancefunction if the tapering range is short compared to the true range. Finally, the results for all methods couldbe improved by finding optimal parameters for the approximate models instead of using the parameters for thetrue Matérn covariance. For the tapering method, however, Furrer et al. (2006) found that this only changed therelative accuracy by one or two percent. 18 Conclusions
Because of the increasing number of large environmental data sets, there is a need for computationally efficientstatistical models. To be useful for a broad range of practical applications, the models should contain a widefamily of stationary covariance functions, and be extendable to nonstationary covariance structures, while stillallowing efficient calculations for large problems.The SPDE formulation of the Matérn family of covariance functions has these properties, as it can be extendedto more general nonstationary spatial models (see Bolin and Lindgren, 2011, Lindgren et al., 2011, for details onhow this can be done), and allows for efficient and accurate Markov model representations. In addition, as shownby the simulation comparisons, these Markov methods are more efficient and accurate than both the processconvolution approach and the covariance tapering method for approximating Matérn fields.Depending on the context in which a model is used, different aspects are important to make it computationallyefficient. If, for example, the model is used in MCMC simulations, one should be able to generate samples fromthe model given the parameters efficiently, or if the parameters are estimated in a numerical maximum likelihoodprocedure, one must be able to evaluate the likelihood efficiently. To limit the scope of this article, only thecomputational aspects of kriging was considered. However, for practical applications, parameter estimation islikely the most computationally demanding part of the analysis. If maximum likelihood estimation is performedusing numerical optimization of the likelihood, matrix inverses similar to the one in Step 2 in Table 1 have to beperformed in each iteration of the optimization, and it is therefore important that these inverses can be calculatedefficiently. We have not discussed estimation here, but the Markov methods are likely most efficient in thissituation as well because these do not require costly Bessel function evaluations when calculating the likelihood.However, this is left for future research to investigate in more detail. An introduction to maximum likelihoodestimation using the SPDE formulation can be found in Bolin and Lindgren (2011) and Lindgren et al. (2011).Finally, some relevant methods, such as Cressie and Johannesson (2008) and Banerjee et al. (2008), was notincluded in the comparison in order to keep it relatively short and also because they are difficult to compare withthe methods discussed here. It would be interesting to include more methods in the comparison, but we leave thisfor future work.
References
Banerjee, S., Gelfand, A.E., Finley, A.O., Sang, H., 2008. Gaussian predictive process models for large spatialdata sets. J. Roy. Statist. Soc. Ser. B 70, 825–848.Barry, R.P., Ver Hoef, J.M., 1996. Blackbox kriging: Spatial prediction without specifying variogram models. J.Agr. Biol. Environ. Statist. 1, 297–322.Bolin, D., Lindgren, F., 2011. Spatial models generated by nested stochastic partial differential equations, withan application to global ozone mapping. Ann. Appl. statis. 5, 523–550.Burrus, C., Gopinath, R., Guo, H., 1988. Introduction to Wavelets and Wavelet Transforms: A Primer. Prentice-Hall, New York.Chui, C.K., Wang, J.Z., 1992. On compactly supported spline wavelets and a duality principle. Transactions ofthe American Mathematical Society 330, 903–915.Cressie, N., Johannesson, G., 2008. Fixed rank kriging for very large spatial data sets. J. Roy. Statist. Soc. Ser.B 70, 209–226.Cressie, N., Ravlicová, M., 2002. Calibrated spatial moving average simulations. Statist. Model. 2, 267–279.19aubechies, I., 1992. Ten Lectures on Wavelets (CBMS-NSF Regional Conference Series in Applied Mathem-atics). Soc for Industrial & Applied Math.Furrer, R., Genton, M.G., Nychka, D., 2006. Covariance tapering for interpolation of large spatial datasets. J.Comput. Graph. Statist. 15, 502–523.Gneiting, T., 2002. Compactly supported correlation functions. J. Multivariate Anal. 83, 493–508.Higdon, D., 2001. Space and Space-time modeling using process convolutions. Technical Report 01-03. DukeUniversity, Durham, NC.Latto, A., Resnikoff, H.L., Tenenbaum, E., 1991. The evaluation of connection coefficients of compactly sup-ported wavelets, in: Proceedings of the French-USA Workshop on Wavelets and Turbulence, Springer-Verlag.Lindgren, F., Rue, H., 2007. Explicit construction of GMRF approximations to generalised Matérn Fields onirregular grids. Technical Report 12. Centre for Mathematical Sciences, Lund University. Lund, Sweden.Lindgren, F., Rue, H., Lindström, J., 2011. An explicit link between Gaussian fields and Gaussian Markovrandom fields: the stochastic partial differential equation approach (with discussion). J. Roy. Statist. Soc. Ser.B 73, 423–498.Matérn, B., 1960. Spatial variation. Meddelanden från statens skogsforskningsinstitut 49.Nychka, D., Wikle, C., Royle, J.A., 2002. Multiresolution models for nonstationary spatial covariance functions.Statist. Model. 2, 315–331.Rodrigues, A., Diggle, P.J., 2010. A class of convolution-based models for spatio-temporal processes with non-separable covariance structure. Scand. J. Statist. 37, 553–567.Rue, H., Held, L., 2005. Gaussian Markov Random Fields; Theory and Applications. volume 104 of