SSparse Cholesky matrices in spatial statistics
Abhirup Datta ∗ Department of Biostatistics, Johns Hopkins University
Abstract
Gaussian Processes (GP) is a staple in the toolkit of a spatial statistician.Well-documented computing roadblocks in the analysis of large geospatial datasetsusing Gaussian Processes have now largely been mitigated via several recent sta-tistical innovations. Nearest Neighbor Gaussian Processes (NNGP) has emergedas one of the leading candidate for such massive-scale geospatial analysis ow-ing to their empirical success. This articles reviews the connection of NNGPto sparse Cholesky factors of the spatial precision (inverse-covariance) matrix.Focus of the review is on these sparse Cholesky matrices which are versatileand have recently found many diverse applications beyond the primary usageof NNGP for fast parameter estimation and prediction in the spatial (gener-alized) linear models. In particular, we discuss applications of sparse NNGPCholesky matrices to address multifaceted computational issues in spatial boot-strapping, simulation of large-scale realizations of Gaussian random fields, andextensions to non-parametric mean function estimation of a Gaussian Processusing Random Forests. We also review a sparse-Cholesky-based model for areal(geographically-aggregated) data that addresses long-established interpretabilityissues of existing areal models. Finally, we highlight some yet-to-be-addressedissues of such sparse Cholesky approximations that warrants further research.
Keywords:
Cholesky matrix, large geospatial data, Nearest Neighbor Gaussian Pro-cess, sparse methods, spatial statistics.
Spatially-indexed data are commonly encountered in many diverse fields of researchincluding forestry, climatology, environmental health, ecology, infectious disease epi-demiology, neuro-imaging, etc. The objective of spatial statistics has primarily beento develop models and algorithms that can utilize the location information in the datato improve statistical inference. In particular, spatial (generalized) linear models usingGaussian Processes (GP) have become a staple tool for such geospatial analysis. Let ∗ AD was supported by NSF award DMS-1915803. Email: [email protected] a r X i v : . [ s t a t . M E ] F e b , s , . . . , s n denote n locations and y ( s i ) and x ( s i ) respectively denote the univariateresponse and a p × i th location. A linear mean GP modelfor such geo-spatial data specifies y ( · ) ∼ GP ( x ( · ) β , Σ( · , · )) where x ( · ) β is the linearmean function and Σ( · , · ) is the covariance function (Banerjee et al., 2014; Cressie andWikle, 2015). This endows the responses with a linear mean E ( y ( s )) = x β and thecovariance Cov ( y ( s ) , y ( s )) = Σ( s , s ). If y = ( y ( s ) , . . . , y ( s n )) denotes the vector ofobserved responses and X denotes the corresponding design matrix created by stackingup the x ( s i )’s, then the GP model implies y ∼ N ( X β , Σ ) where Σ = (Σ( s i , s j )) . (1)The resulting multivariate Gaussian likelihood p ( y ) ∝ q det ( Σ ) exp (cid:18) −
12 ( y − X β ) Σ − ( y − X β ) (cid:19) , (2)is used to infer about β and the parameters in the spatial covariance function Σ. Com-puting this likelihood involves storing the n × n covariance matrix Σ , requiring O ( n )storage. It also involves computing the inverse and the determinant of Σ . These op-erations are typically performed via the Cholesky decomposition and requires O ( n )computation time or floating point operations (FLOPs). Neither the storage or com-puting demands of a GP likelihood can be afforded by typical personal computers evenfor moderately large n ( ∼ or larger).Over the last few decades the spatial statistics community has attacked this big GPproblem from many fronts and offered many different and efficient solutions to ease thecomputational burden. Methods include sparse nearest neighbor approximations (Vec-chia, 1988; Stein et al., 2004; Datta et al., 2016a), low-rank approximations (Banerjeeet al., 2008; Cressie and Johannesson, 2008), sparse-plus-low-rank method (Ma et al.,2019), multi-resolutional approach (Katzfuss, 2017), data partitioning (Barbian andAssun¸c˜ao, 2017; Guhaniyogi and Banerjee, 2018), covariance tapering (Furrer et al.,2006; Kaufman et al., 2008), stochastic partial differential equations (Lindgren et al.,2011), composite likelihoods (Bevilacqua and Gaetan, 2015; Eidsvik et al., 2014), grid-based methods (Nychka et al., 2015; Guinness and Fuentes, 2017; Stroud et al., 2017),among others. A comprehensive review of all these methods is beyond the scope of thispaper but we refer the readers to the articles Sun et al. (2012); Bradley et al. (2016);Banerjee (2017); Heaton et al. (2019); Banerjee (2020) for reviews and comparisons ofthe methods.In this manuscript, we focus on reviewing Nearest Neighbor Gaussian Processes(NNGP) (Datta et al., 2016a) based on Vecchia’s approximation (Vecchia, 1988). Thisstrand of literature, reviewed in Section 2, itself has become sufficiently large ow-ing to the empirical success of this method. The traditional use of this method hasprimarily been fast parameter estimation and prediction for geospatial data in a GP-based linear model. As discussed above there are many alternative methods to achieve2hese tasks and the recent data analysis comparisons of Heaton et al. (2019) demon-strated that many of the methods achieve highly competitive performance. However,the NNGP method naturally yields sparse Cholesky factors of the spatial precision(inverse-covariance) matrices which have a wider range of applications beyond para-metric estimation and prediction in the spatial linear model. This review focusesprimarily on some such novel applications of these sparse Cholesky factor matrices inspatial statistics.In Section 3, we first expand on 3 recent applications of sparse Cholesky matricesin geospatial data — resampling or bootstrap for spatial data (Section 3.1), large-scalesimulation of Gaussian random fields (Section 3.2), and non-parametric mean functionestimation for Gaussian Processes using random forests (Section 3.3). We demonstratehow sparse Cholesky matrices are used to resolve multi-faceted computational issues ineach of these applications. Our fourth and final example (Section 3.4) digresses fromthe setting of geospatial (point-referenced) data, and considers areal (geographically-aggregated) data. We review a new class of models using sparse Cholesky factors forsuch data that improves over existing approaches in terms of parameter interpretabilityand model performance while remaining computationally scalable. We conclude inSection 4 with a discussion of few yet-to-be-addressed aspects of the NNGP methodthat open up future avenues of research. In a seminal paper, Vecchia (1988) proposed leveraging the spatial structure encodedin GP covariance functions to obtain a computationally scalable approximation of theGP likelihood (2). The approach proceeds by rewriting (2) as p ( y ) = p ( y ( s )) p ( y ( s ) | y ( s )) . . . p ( y ( s n ) | y ( s ) , . . . , y ( s n − )= p ( y ( s )) n Y i =2 p ( y ( s i ) | y ( H ( s i ))) , (3)where H ( s i ) = ( s , . . . , s i − ) and y ( H ( s i )) is the vector formed by stacking y ( s ) for s ∈ H ( s i ). Most popular choices of the covariance function Σ conforms to the ‘firstlaw of geography’, i.e., ensures that proximal things are more similar than distantones. Hence, the magnitude of the spatial covariance decays with distance and theconditioning sets H ( s i ) will contain many members far away from s i and thus offeringminimal information about it. Hence, Vecchia (1988) proposed approximating (3) with p ( y ( s )) n Y i =2 p ( y ( s i ) | y ( N ( s i ))) (4)3here N ( s i ) is a set of up to m nearest neighbors of s i chosen from H ( s i ), and y ( N ( s i ))is defined similar to y ( H ( s i )). For two sets A, B ⊆ { , . . . , n } , let Σ ( A, B ) denote thesub-matrix of Σ with rows and columns respectively indexed by A and B . Then underthe GP model (1), the nearest neighbor likelihood (4) reduces to N ( y ( s ) | x ( s ) β , Σ ( s , s )) n Y i =2 N ( y ( s i ) | x ( s i ) β + b i ( y ( N ( s i )) − X ( N ( s i )) β ) , f i ) , where b i = Σ ( N ( s i ) , N ( s i )) − Σ ( N ( s i ) , s i ), f i = Σ ( s i , s i )) − Σ ( s i , N ( s i )) b i , and X ( N ( s i )) is the design matrix corresponding to y ( N ( s i )). Parameter estimation pro-ceeds by maximizing the pseudo-likelihood (4). The restriction of the neighbor sets N ( s i ) to be comprised of at most m locations ensures we only need to store and invert m × m matrices Σ ( N ( s i ) , N ( s i )), thereby reducing the storage and computing require-ments to evaluate the likelihood (4) respectively to O ( nm ) and O ( nm ) respectively.Stein et al. (2004) demonstrated that as (4) is the product of correctly specified con-ditional densities p ( y ( s i ) | y ( N ( s i )), the score function from (4) yields a set of unbiasedestimating equations for the parameters. They also generalized this approximationin several ways including using block conditional densities instead of the univariateones, exploring choices of neighbor sets beyond nearest neighbors, and formulating aRestricted Maximum Likelihood (REML) approach for parameter estimation. The GP model (1) is conceived from the additive spatial regression model E ( y ( s i )) = x ( s i ) β + w ( s i ) where x ( s i ) β is the linear covariate effect and w ( s i ) is a smooth spatialeffect explaining residual structured variation in y ( s i ). Modeling w ( · ) as a Gaussianprocess with zero-mean and covariance function C and assuming additive Gaussian iiderrors (cid:15) i leads to (1) with the covariance function Σ( · , · ) = C ( · , · ) + τ δ ( · , · ) where τ isthe error variance and δ is the white noise covariance function, i.e., δ ( s , s ) = I ( s = s ).Thus the GP regression (1) is the marginal form of the hierarchical mixed effect model y ( s i ) = x ( s i ) β + w ( s i ) + (cid:15) ( s i ) , w = ( w ( s ) , . . . , w ( s n )) ∼ N (0 , C ) where C = ( C ( s i , s j )) ,(cid:15) ( s i ) iid ∼ N (0 , τ ) . (5)The original nearest neighbor approximation of Vecchia was directly applied to thedata likelihood (2) for y . Often, inference on the latent spatial surface w ( s ) is ofinterest for scientists to understand structured variation in the response beyond whatis explained by the covariates. Datta et al. (2016a) generalized the idea of Vecchia(1988) from a data likelihood approximation to ‘Nearest Neighbor Gaussian Processes’ (NNGP) — a new valid class of multivariate Gaussian distributions and Gaussianrandom fields that can be used to conduct fast spatial inference on observed or latentprocesses. The key to this extension is the following connection of this nearest neighborapproximation to sparse Cholesky matrices.4or any Gaussian Process w ( · ) with zero mean and covariance function C , akin to(4), the nearest neighbor approximation of the likelihood of realizations of the process w ( · ) at s , . . . , s n is given by p ( w ( s )) n Y i =2 p ( w ( s i ) | w ( N ( s i ))) . (6)The term p ( w ( s i ) | w ( N ( s i ))) is the conditional density N ( w ( s i ) | b i w ( N ( s i )) , f i ) where b i = C ( N ( s i ) , N ( s i )) − C ( N ( s i ) , s i ) f i = C ( s i , s i )) − C ( s i , N ( s i )) b i (7)and can be considered as the likelihood from the generative model w ( s i ) = b i w ( N ( s i ))+ N (0 , f i ). Thus the expression in (6) can be equivalently written as the likelihood fromthe model: w ( s ) = η w ( s ) = b w ( s ) + η w ( s ) = b w ( s ) + b w ( s ) + η . . . . . .w ( s n ) = b n w ( s ) + b n w ( s ) + . . . + b n,n − w ( s n − ) + η n , (8)where η i ind ∼ N (0 , f i ) with f = C ( s , s ), and b ij = 0 if s j is not a neighbor of s i , andis the l th element of b i if s j is the l th neighbor of s i . We can stack the equations in (8)to have the matrix equation: w = Bw + η (9)where w = ( w ( s ) , . . . , w ( s n )) , η = ( η , . . . , η n ) ∼ N ( , F ) with F = diag( f , . . . , f n ),and B = ( b ij ) is a strictly lower triangular matrix. From (9), we have( I − B ) w = η ⇐⇒ w = ( I − B ) − η ∼ N (0 , ( I − B ) − F ( I − B ) − T ) . (10)Thus Datta et al. (2016a) noted that the nearest neighbor approximation of Vecchia(1988) corresponds to a generative multivariate Gaussian model w ∼ N (0 , e C ) for theprocess realizations, where e C = ( I − B ) − F ( I − B ) − T . This new model essentiallyreplaces the model w ∼ N ( , C ) where C = C ( s i , s j | θ ), which corresponds to the fullGP likelihood. The precision matrix e C − = ( I − B ) F − ( I − B ) (11)admits a Cholesky decomposition L L with the lower-triangular Cholesky factor L = F − / ( I − B ) . (12)As B has at most m -non-zero elements per row and F − / is a diagonal matrix, theCholesky factor L = ( l ij ) is also equally row-sparse. Computing L involves only com-puting the b i ’s and f i ’s from (7) thus requiring O ( nm ) FLOPs and O ( nm ) storage as5pposed to O ( n ) FLOPs and O ( n ) storage for computing the Cholesky factor of thefull GP matrix C . Subsequent to computing L , the likelihood for w involves comput-ing quadratic forms and determinant of e C − . This is straightforward as any quadraticform u e C − v = ( Lu ) ( Lv ) and due to row-sparsity of L , the multiplication Lx onlyuses O ( nm ) additional FLOPs. Similarly, det (cid:16) e C (cid:17) = Q i l ii . Therefore, computing thewhole likelihood only requires linear storage and time.The generative approach of Datta et al. (2016a) using sparse Cholesky factor hasseveral benefits beyond the data likelihood approximation of Vecchia (1988). One canconsider any hierarchical spatial model but simply replace the Gaussian Process priorfor spatial random effects w with an NNGP prior. For example, in (5), as y ( s i )’s areindependent conditional of w and β , the joint likelihood N ( y | X β + w , τ I ) × N ( w | , e C ) (13)can be evaluated efficiently even for large spatial data. Augmenting this with priorsfor the other parameters ( β , τ , and parameters of the covariance function C ), facili-tates standard Bayesian inference on the latent effects w . One can also proceed withfrequentist estimation using the EM algorithm by treating w as the missing data.Another benefit of NNGP is prediction of the latent process at new locations.Datta et al. (2016a) specified the conditional distribution of w ( s ) | w at new locations s / ∈ S = { s , . . . , s n } is given by w ( s ) | w ind ∼ N ( C ( s , N ( s )) C ( N ( s ) , N ( s )) − w ,C ( s , s ) − C ( s , N ( s )) C ( N ( s ) , N ( s )) − C ( N ( s ) , s )) (14)The prediction distribution is basically equivalent to kriging independently (condi-tional on w ) at each new location s / ∈ S using m -nearest neighbors N ( s ) of s in S ,instead of all of S . Vecchia (1992) considered similar nearest neighbor-based krigingbut only for point predictions as opposed to entire prediction distributions. Katzfusset al. (2020a) extended the independent nearest-neighbor kriging to joint kriging whichimproved prediction quality.Datta et al. (2016a) demonstrated that Equations (10) and (14) complete the spec-ification of a valid Gaussian Process over the entire domain which was referred to asthe Nearest Neighbor Gaussian Process (NNGP). In any hierarchical model, NNGPcan be used to replace GP prior to ensure fast computation by leveraging the sparseCholesky factor L , and proceed with fast and full Bayesian inference on all parametersand the latent process w ( · ), and Bayesian predictive inference on the outcome process y ( · ). More detailed reviews of the method are available in Datta et al. (2016b) andBanerjee (2017). There has been considerable recent work related to Vecchia’s approximation and NNGP.Gramacy and Apley (2015) developed a ‘
Local approximation GP ’ which extends nearest-neighbor based kriging equations to non-stationary covariance functions. More classes6f non-stationary covariance models with nearest neighbor approximations have beenimplemented recently in Risser and Turek (2019). Stroud et al. (2017) used the NNGPprecision matrix e C − as a pre-conditioner for the full GP matrix C to solve for un-knowns u in linear equations of the form Cu = z arising in conditional GP simulations.Sch¨afer et al. (2020) demonstrated that, given the ordering and the neighbor sets, theCholesky factor L is the optimal one in terms of Kullback-Leibler distance between thefull GP distribution and a sparse Cholesky based distribution.The quality of the nearest neighbor approximation and the resulting sparse Choleskyfactor depends on the choice of data ordering. As spatial data does not have any nat-ural ordering, simple orderings like sorting along some co-ordinate is often adopted.Guinness (2018) studied alternate choices of ordering and observed that certain well-principled orderings or sometimes even random orderings can lead to more efficientparameter estimation. Katzfuss and Guinness (2021) considered joint orderings of theresponses y and the latent random effects w in a large class of models which theyreferred to as the ‘ Sparse Generalized Vecchia (SGV) ’. They demonstrated that theoriginal Vecchia approximation for the response process and NNGP for the latent pro-cess (and many other popular spatial models) can be unified under the umbrella ofSGV as they arise from imposing different relative orderings of y and w .There has been substantial investigation on relative merits of using this approxi-mation on the response process (Vecchia, 1988) and the latent process (Datta et al.,2016a). Endowing the latent process w with an NNGP prior and marginalizing w out,yields: y ∼ N ( X β , e C + τ I ) . (15)While the hierarchical model (13) is equivalent to the marginal model (15), unfortu-nately, the latter is not directly amenable to scalable computing. This is because evenif e C − inverse has a sparse Cholesky factor, the same cannot be said of ( e C + τ I ) − .The Bayesian implementation of Datta et al. (2016a) avoids this issue by sampling thelatent w sequentially from (13) in a Gibbs sampler. However, for large data, sequen-tial sampling of n latent random effects substantially increases the MCMC dimensionand can lead to convergence issues. Motivated by these sampling issues, Finley et al.(2019) used NNGP prior to directly model the response process to facilitate a Bayesianmodel-based analogue of Vecchia’s original likelihood approximation. However, Katz-fuss and Guinness (2021) has also showed that using NNGP on the latent process leadsto a better approximation of the full GP than when using Vecchia’s approximation onthe response process. Zhang et al. (2019) has proposed a solution that retains thisadvantage of using the latent NNGP while circumventing the sequential sampling of w . They resourcefully use the sparse Cholesky factor of e C − in a Bayesian conjugatedescent approach that enjoys the aforementioned benefits of the latent NNGP model((13) or (15)) while enabling fast block sampling of w .Multiple software implementing Vecchia’s approximation and Nearest NeighborGaussian Processes are now publicly available including CRAN R-packages spNNGP (Finley et al., 2020a), BRISC (Saha and Datta, 2018b),
GPvecchia (Katzfuss et al.,7020b), and
BayesNSGP (Turek and Risser, 2019). Many extensions of the methodhave also been developed including generalizations to spatio-temporal settings (Jonesand Zhang, 1997; Datta et al., 2016c), multivariate settings (Taylor-Rodriguez et al.,2019), non-Gaussian outcomes (Finley et al., 2020b; Zilber and Katzfuss, 2019), spatio-temporal filtering (Jurek and Katzfuss, 2020), and multivariate cumulative distributionfunctions of Gaussian Processes (Nascimento and Shaby, 2020). Most of these havefocused on parametric modeling, estimation and prediction using the spatial GP linearmodel. In the next Section, we discuss some novel applications of the NNGP sparseCholesky factors.
This Section reviews a fast parametric bootstrap developed in Saha and Datta (2018a)for inference (interval estimates) on parameters in the spatial regression model (1) or(5). The parameters consist of the regression coefficient β , the noise variance τ andthe parameters θ specifying the covariance function C . To obtain interval estimatesof the parameters, one can opt for a Bayesian implementation, sampling from thejoint-likelihood (13) augmented by priors for the parameters using MCMC methods(Banerjee et al., 2014). These provide full posterior distributions of all the parametersfrom which one can obtain point and interval estimates of any function of the parame-ters. This strategy was adopted in the latent- (Datta et al., 2016a; Zhang et al., 2019)and response- (Finley et al., 2009) NNGP models. However, the sequential nature ofMCMC typically require many thousand iterations leading to prolonged analysis timesdespite the speedup per likelihood evaluation afforded by use of NNGP.One can also leverage asymptotic distribution of the parameter MLEs (maximumlikelihood estimates) in a frequentist setup. For spatial data, there are two paradigmsof asymptotics. For ‘ infill asymptotics ’ where data are observed with increased densityin a fixed spatial domain, individual parameters may not be consistently estimable andonly certain functions of the spatial parameters are identifiable (Chen, 2002; Zhang,2004; Tang et al., 2019). The results for the ‘ increasing domain ’ setting where the spa-tial domain expands along with increased sample sizes, are more in line with traditionalasymptotics. Mardia and Marshall (1984) established asymptotic normality of param-eters for a wide choice of covariance functions. However, asymptotic interval estimatesrely on the asymptotic covariance of the parameters and these involve the Hessian ofthe full GP likelihood which is computationally infeasible for large n . The issue persistseven when using the nearest-neighbor approximation (4) to the likelihood as one needsto use a computationally onerous sandwich-variance estimator (Stein et al., 2004).A general alternative method for finite sample inference on parameters is bootstrap-ping. Bootstrapping can be done in an embarrassingly parallel fashion thereby easingsome of the computational burden of the aforementioned approaches. However, spatial8ata are correlated violating the fundamental principle of resampling iid units used inbootstrap. Even if one wishes to proceed with bootstrap, ignoring this correlation, itis unclear how to define the GP covariance matrix for a bootstrapped dataset as thecorrelation between two resamples of the same data unit ( y, x , s ) will be one as theywill have the same location.Olea and Pardo-Ig´uzquiza (2011) proposed a parametric bootstrap for the spa-tial regression model (1). As Cov( y ) = Σ , we have Cov( Σ − / y ) = I where Σ − / denotes the Cholesky factor of the precision matrix Σ − . Thus, the residual vector r = Σ − / ( y − X β ) are iid N (0 ,
1) distributed and one can resample from them tocreate bootstrapped residual vectors r (1) , . . . , r ( B ) where B is the bootstrap samplesize. Subsequently, generating y ( b ) = X β + Σ / r ( b ) for b = 1 , . . . , B , creates the boot-strapped datasets having the same distribution as the original data. If ψ denotes theentire set of parameters, one can run MLE for each dataset y ( b ) in parallel to obtainthe sample of MLEs ψ (1) , . . . , ψ ( B ) from which one can obtain interval estimates. As β and ( θ , τ ) (which parametrizes Σ ) are unknown, in practice, they are replaced bytheir MLE for this algorithm.The parametric bootstrap of Olea and Pardo-Ig´uzquiza (2011) requires the Choleskyfactors of both Σ and Σ − which use O ( n ) operations. Hence, despite being amenableto embarrassingly parallel computing, the algorithm cannot be used for large spatialdatasets. Saha and Datta (2018a) noted that the replacing full GP with NNGP achievesboth in O ( n ) time. To see this, similar to (10), let e Σ denote the NNGP covariancematrix approximating Σ , i.e., e Σ = ( I − B y ) − F y ( I − B y ) − T where B y and F y aredefined respectively similar to B and F , but for the covariance function Σ of theresponse process y ( · ) instead of the covariance function C of the latent process w ( · ).Hence akin to (12), the Cholesky factor is obtained as e Σ − / = L y = F − / y ( I − B y ) (16)using linear storage and time. As e Σ − / ≈ Σ − / , we have Cov( e Σ − / y ) ≈ I . So Sahaand Datta (2018a) proposed using the latter to decorrelate y .We provide a small illustration below to assess the quality of this approximatedecorrelation of spatial data generated from a full GP. We generate 1000 locationsrandomly on a unit square and simulated 10000 datasets each one being realization ofa GP on those 1000 locations. We used two choices of the covariance function: theexponential covariance function and the smoother Mat´ern covariance with smoothness3 / e Σ − / from (16) andplotted the sample correlation matrix for the decorrelated vectors on the right column.We observe from Figure 1 (right) that for both choices of the covariance function, thesample covariance of the NNGP decorrelated datasets are close to the identity matrix.To further confirm this, in the bottom row we plot the densities of the diagonal and off-diagonal elements of the decorrelated covariance matrix. Once again for both choices9f covariance functions, the density of the off-diagonal elements concentrates around 0while those for the diagonal elements are around 1. Thus swapping the dense inverseCholesky factor Σ − / with the NNGP analog e Σ − / achieves fast decorrelation withoutsignificant loss of information.The second part of the bootstrap algorithm, requires correlating back the resamplederrors r ( b ) to get the bootstrapped datasets y ( b ) = X β + Σ / r ( b ) . While it is naturalto consider replacing Σ / with the NNGP approximation e Σ / = L − y = ( I − B y ) − F / y , (17)unlike L y , its inverse L − y is not directly available. Instead, Saha and Datta (2018a)propose an O ( n )-time algorithm to compute products of the form L − y v required in thecorrelate-back step. Note that u = L − y v can be obtained by solving the triangularsystem L y u = v for u . Now L y = e Σ − / has already been computed. From (8), weknow that L y = ( l ( y ) ij ) is lower-triangular with at most m non-zero elements per row.Hence, one can back-solve for u = ( u , , . . . , u n ) as follows: u = v /l ( y )11 u = ( v − l ( y )21 u ) /l ( y )22 . . . . . .u n = ( v n − X i BRISC bootstrap function of the BRISC CRAN R-package (Saha and Datta, 2018b). Simulating large datasets from a full GP is also computationally prohibitive. Thisis because generating realizations from N ( , Σ ) also involves either the Cholesky orthe eigen-value decomposition of Σ . Both require O ( n ) storage and O ( n ) time, andcannot be accomplished using personal computing resources even for moderately largedatasets. To illustrate, for a set of experiments detailed below, a personal computerwith 16 GB of memory could not generate random draws from a full GP for n ≥ / covariance function. (bottom) Density of diagonaland off-diagonal entries of the sample decorrelated matrix.11ame as the correlate-back step of the bootstrap described in Section 3.1. As the NNGPcovariance matrix e Σ well approximates the full GP covariance Σ . If z ∼ N (0 , I ), wehave y ( sim ) = e Σ / z ∼ N (0 , e Σ ) ≈ N (0 , Σ ). Thus, a random draw y ( sim ) of size n for an NNGP, only involves generating n iid N (0 , 1) draws z , . . . , z n and computingthe product e Σ / z . As outlined in Section 3.1, this product can be obtained by firstcomputing the reverse Cholesky factor L y = e Σ − / and then solving for y ( sim ) in thetriangular system L y y ( sim ) = z using the sparse back-solve of (18). The entire processonce again remains linear in storage and time.Figure 2 assesses the quality of simulation of random fields using the full GP andNNGP for n = 1000 randomly chosen locations and two choices of covariance functions(exponential and Mat´ern / ). For each setting, 10000 random draws are generated andthe sample covariance matrix is plotted both for full GP and NNGP. We see that foreach choice of the covariance function, the heat-maps of the sample covariance matriceslook similar for the full GP and NNGP draws. Both methods display sharper decayfor the exponential covariance function, and slower decay for the smoother Mat´ern / function. For a more quantitative assessment of the approximation, we took the dif-ference between the full GP and NNGP sample covariance matrices and plotted thedensity of these differences in bottom row of Figure 2. We see that for both choices ofcovariance function, these densities peak around 0 demonstrating the closeness of theapproximation.We also compare the run times for generating draws using full GP and NNGPin Table 1. We use the same two choices of covariance functions, sample sizes of n = 1000 , , , , n increases.This is because, while the computation and storage times for NNGP are linear in n , itrequires the neighbor sets as inputs. There is a one-time cost of finding these sets of m nearest neighbors for each location. This step can be time-consuming for large n , asdetailed in Finley et al. (2020b). The times for NNGP presented in Table 1 includesthe timing for this neighbor-search step which explains this discrepancy. However, evenwith this expensive one-time operation, the simulation times for NNGP are impressivebeing able to generate 10000 draws of 100000 realizations of the random field in lessthan 15 minutes on a personal computer.Simulations from NNGP can now be performed using the BRISC simulation func-tion of the BRISC CRAN R-package (Saha and Datta, 2018b).12igure 2: Simulation of Gaussian random fields using (left column) full GP and (rightcolumn) NNGP for the exponential covariance function (top row) and the Mat´ern / covariance function (middle row). The bottom row plots the density of the entries ofthe difference matrix between the sample covariance matrices generated from full GPand NNGP. 13ovariance function Sample size NNGP full GPexponential 1000 3 122500 6 835000 11 46410000 31 NA100000 831 NAMat´ern / The hierarchical model (5) is a linear mixed-effects model with a linear covariate effect x ( s ) β and a spatial random effect w ( s ) modeled as a GP. In recent years, owing tothe progress Geographical Information Systems (GIS), we have witnessed a deluge oflarge geospatial datasets in many research fields. These large datasets have offered theopportunity to generalize the strong linearity assumption about the covariate effectto more general non-linear classes of functions h ( x ( s )) and use data-intensive machinelearning methods for their estimation. A spatial non-linear mixed effects model will begiven by y ( s i ) = h ( x ( s i )) + w ( s i ) + (cid:15) ( s i ) , w ∼ N (0 , C ) , (cid:15) ( s i ) iid ∼ N (0 , τ ) . (19)Akin to (1), integrating out the spatial random effects leads to the marginal model y ∼ N ( h , Σ ) where h = ( h ( x ( s )) , . . . , h ( x ( s n ))) . (20)Basis functions offer a natural avenue for such non-linear spatial mixed models. Wecan simply replace the linear covariate effect x β with h ( x ) = B ( x ) γ where B ( x ) is thebasis expansion at x for a choice of basis B and γ are the unknown basis coefficients. Asthe problem remains linear in the parameters γ , this generalization does not necessitateany change in estimation strategies. However, basis functions are generally not ideal tomodel certain classes of functions (e.g., functions with discontinuity like step-functionsor piecewise continuous functions). More importantly, they suffer from the curse ofdimensionality and are often inadequate if the number of covariates are greater than 3or 4 (Taylor and Einbeck, 2013).Random forests (RF) (Breiman, 2001) has become one of the mainstays on non-parametric function estimation. RF has been shown to be a asymptotically consistentfunction estimation method (Scornet et al., 2015) but the guarantees are established14nly under iid errors. Despite this, RF has been used extensively in spatial or temporalsettings where the data are correlated. Such applications have often ignored the datadependence, not using the spatial locations or the spatial covariance Σ , simply fittingthe classic RF algorithm using only y and X . Saha et al. (2020) demonstrated thatignoring this spatial information degrades the performance of RF. Hengl et al. (2018)developed a ‘ spatial random forests (spRF) ’ that uses the paired distances between alocation and all other locations as additional covariates. For n locations, this adds n − h in the spatial non-linear mixed model(19) using a novel Random Forest algorithm that accounts for the spatial dependence.RF estimate is the average of many regression trees (Breiman et al., 1984). The keyobservation to the extension of random forests to spatial data is that creation of aregression tree is equivalently characterized as a greedy OLS optimization algorithm.To elaborate, a tree estimate (with K leaf nodes) of h can be represented using a n × K binary membership matrix Z whose ( i, j ) th element is 1 if the i th observationbelongs to the j th leaf-node, and a vector of values β , . . . , β K where β k is the value ofthe estimate representing the k th node. Thus at the r th iteration of thee algorithm, h is modeled as Z ( r ) β ( r ) for this membership matrix Z ( r ) and the representative values β ( r ) . To find the best choices of Z ( r ) and β ( r ) , one optimizes the ordinary least squares(OLS) loss: ( d Z ( r ) , d β ( r ) ) = arg min Z ( r ) ∈C ( r ) , β ( r ) k Y − Z ( r ) β ( r ) k . (21)Here C ( r ) is the class of eligible design matrices given the previous set of nodes in thetree and the covariate values. The optimal d Z ( r ) gives the next set of nodes createdin the tree, while the corresponding d β ( r ) gives the updated set of node representativevalues. This process of tree creation is continued till a termination criterion is met(maximum number of nodes, minimum number of members per node, etc.).Saha et al. (2020) noted that the OLS optimization in (21) is equivalent to obtainingMLE for ( Z ( r ) , β ( r ) ) from the model y ∼ N ( Z ( r ) β ( r ) , I ). Under spatial dependence, wehave Cov( y ) = Σ . To incorporate this spatial covariance, one should consider themodel (20) with h = Z ( r ) β ( r ) , i.e., y ∼ N ( Z ( r ) β ( r ) , Σ ). Thus, for a given Σ , theyproposed optimizing a generalized least square (GLS) loss( d Z ( r ) , d β ( r ) ) = arg min Z ( r ) ∈C ( r ) , β ( r ) ( Y − Z ( r ) β ( r ) ) Σ − ( Y − Z ( r ) β ( r ) ) . (22)The new GLS loss accounts for the spatial dependence in the data. For the optimalchoice of d Z ( r ) , the node representatives will be given by the GLS estimate d β ( r ) = (cid:18) d Z ( r ) Σ − d Z ( r ) (cid:19) − d Z ( r ) Σ − y . (23)The optimization criterion (22) and the node representatives (23) complete thealgorithm for one regression tree. To create a forest from trees, in iid settings, data15re resampled and a regression tree is estimated for each resampled dataset, whichare then averaged to create the forest estimate. However, as discussed in Section 3.1,direct resampling of spatially correlated data violates assumptions of bootstrap, andit is unclear what Σ would be for a resampled dataset as the correlation between tworesamples of the same data unit ( y i , x ( s i ) , s i ) would be 1.The GLS approach to random forests offers a synergistic solution to the resamplingproblem. A GLS regression between y and some design matrix Z , with a covariancematrix Σ , can be thought of as an OLS regression between the decorrelated vector y ∗ = Σ − / y and Σ − / Z . As in Section 3.1, for spatial data, it makes sense toresample the decorrelated vector y ∗ . Thus, for a resampling matrix P t used to createthe t th regression tree, we can simply modify (22) for the resample and use( d Z ( r ) , d β ( r ) ) = arg min Z ( r ) ∈C ( r ) , β ( r ) ( Y − Z ( r ) β ( r ) ) Σ − T/ P t P t Σ − / ( Y − Z ( r ) β ( r ) ) . (24)for partitioning of the nodes. The corresponding node representatives will now be givenby d β ( r ) = (cid:18) d Z ( r ) Σ − T/ P t P t Σ − / d Z ( r ) (cid:19) − d Z ( r ) Σ − T/ P t P t Σ − / y . (25)Note that both the partitioning criterion (24) and the node values (25) requirescomputing the Cholesky factor Σ − / . For large n , this would not be feasible. Hence,for practical implementation once again the NNGP Cholesky factor e Σ − / is used inall steps thereby ensuring linear time and storage for the algorithm.Saha et al. (2020) referred to this new random forest algorithm as ‘ RF-GLS ’ – owingto its strong connection to generalized least squares. Using numerical studies across awide span of settings, they showed that RF-GLS substantially improves over naive RFfor estimating the mean function h , and over both naive RF and spRF of Hengl et al.(2018) for spatial prediction. Asymptotic consistency of RF-GLS was also establishedfor a wide-range of dependent processes including data generated from a full Mat´ernGP and fitted using the sparse NNGP Cholesky factor. The method also works forfunctional estimation under other types of dependence, e.g., autoregressive time-serieswhich also yields sparse Cholesky factors, and has been implemented in the R-package RandomForestsGLS (Saha et al., 2021). Most applications of sparse Cholesky matrices in spatial statistics, as discussed inthis review, pertain to the paradigm of point-referenced or geospatial data. However,the idea has also been exploited to recently propose a new class of generative andinterpretable models for areal (geographically-aggregated) data. Each unit in suchdataset represents a geographical region (e.g., counties or states). The data can beenvisioned as consisting of triplets ( y i , X i , R i ) where R i is the i th region, and y i and16 i are the corresponding response and covariates. A generalized mixed-linear modelcommonly used to analyze such data is specified as: g ( E ( y i )) = x i β + w i , i = 1 , . . . , n (26)where g is a suitable link function for the outcome type, x i β is the fixed covariate effectand w i is the area-specific random effect. The random effect vector w = ( w , . . . , w n ) istypically endowed with multivariate Gaussian prior with the covariance (or precision)matrix capturing the geographical information.Unlike point-referenced data where location of each unit is a 2- or 3-dimensional co-ordinate vector, it is challenging to encapsulate information of an entire geographicalarea R i by some representative co-ordinates. Hence, a typical approach to capturingspatial structure in areal data is to represent the geography in terms of a graph G withvertices as the set of regions, and the adjacency matrix A = ( A ij ) specified to capturethe relative spatial information in the regions. One common way of doing this is setting A ij = 1 if regions i and j share a border. Subsequently, to impose smoothness acrossneighboring regions (which share an edge in the graph), a popular prior for w is the‘ Conditional Autoregressive (CAR) ’ model w ∼ N (0 , σ ( D − ρ A ) − ) (27)where D =diag( m , . . . , m n ) is a diagonal matrix with m i being the number of neigh-bors of R i , σ is the marginal variance parameter, and ρ controls the degree of spatialsmoothness.The CAR model was originally proposed by Besag (1974) via the conditional dis-tributions w i | w − i ∼ N ( 1 m i X j ∼ i w j , σ m i ) (28)where w − i = ( w , . . . , w i − , w i +1 , . . . , w n ) and j ∼ i ⇐⇒ A ij = 0. The conditionalmodel can be shown to be equivalent to (27) for ρ = 1 and offers a much clear insightinto why this model effectuates spatial smoothing. The conditional mean of w i is themean of its neighbors, and the conditional precision to be proportional to the numberof neighbors. However, for ρ = 1, the precision matrix D − A is singular for all graphs.This is an improper prior and is referred to as the ‘ Improper or Intrinsic CAR (ICAR) ’.The ICAR model was later extended to include the parameter ρ < ρ and its complicated association with the spatial correlation induced (Wall, 2004;Assuncao and Krainski, 2009; Banerjee et al., 2014). In addition, we also note that theresulting covariance matrix from the CAR model is heteroskedastic. To illustrate theseissues, we study the CAR covariance matrix ( D − ρ A ) − for a simple 3 × × ρ , (right) CAR-inducedneighbor-pair correlations for each edge group as a function of ρ .each other). Figure 3 (middle) then plots the CAR variance for each vertex group asa function of ρ . We see that the variances are not same across groups, i.e, CAR modelis not homoskedastic. Regions with higher number of neighbors are endowed withlower marginal variance. While it is reasonable to model the conditional variance of aunit (given its neighbors) to be less if there are more neighbors, the same logic shouldnot be true of the marginal variance. Hence, this heteroskedasticity is problematic.Figure 3 (right) plots the correlation between neighboring units as a function of ρ .We see that the correlation is again not same for each neighbor-pair and that therelationship between this correlation and ρ is highly non-linear. This prohibits anydirect interpretation of ρ as has been highlighted in the aforementioned studies.Datta et al. (2019) proposed a new class of models based on sparse Cholesky factorsthat circumvents these issues of CAR models. The construction is identical to (8)and (10) which can be viewed as a general method to create multivariate Gaussiandistributions via Cholesky factors. The specific choices of B and F leads to differentmodels. In the context of geo-spatial modeling discussed in the previous sections, thecentral step was to select the neighbor sets. Given the neighbor sets, the matrices B and F are automatically determined from the covariance matrix C of the full GP jointdistribution which the NNGP is approximating.For areal data, the neighbor sets are naturally defined by the graph G as regionssharing an edge in the adjacency matrix. Hence, Datta et al. (2019) proposed setting b ij = 0 in (8) unless i ∼ j and j < i . However, the central choice here involvesspecifying the non-zero b ij ’s and f i ’s. Unlike NNGP which used the covariance matrix C from the full GP covariance, for areal data there is no suitable joint distributionfrom which to derive B and F . Using the CAR covariance matrix would lead to anapproximation of it that inherits its aforementioned drawbacks. If the graph was a tree(acyclic graph), then a valid candidate to derive B and F is autoregressive covarianceson trees, i.e., C = ( ρ d ij ), where d ij denotes the shortest path between vertices i and j on the graph (Basseville et al., 2006). However, graphs generated by specifying edgesbetween geographical regions sharing a border will typically be not acyclic and such a18 will not be positive definite. One option would be to approximate the whole graphwith a minimal spanning tree (MST) but such an approximation leads to large errorsby leaving out many edges (Sudderth, 2002).Datta et al. (2019) noted that specifying the i th row of B and f i only requires aworking assumption on the joint distribution of of the sub-graph G i corresponding tothe vertices i ∪ { j | j ∼ i, j < i } . They then used an autoregressive covariance matrixon the MST T i of this sub-graph which has only edges between i and all j ∼ i, j < i .This leads to thee following choices: b ij = I ( j ∼ i, j < i ) ρ m
1, the i th conditional distribution resulting from (29) becomes w i | w , . . . , w i − ∼ N ( 1 m
Directed acyclic graph autoregres-sion (DAGAR) ’ model because of its reliance on autoregressive covariances and asthe undirected graph G combined with the ordering imposed to construct the sparseCholesky factor essentially yields a directed acyclic graph. Several nice properties ofDAGAR model has been established with regards to parameter interpretability andcomputation. For any tree or grid graph, DAGAR covariance matrices was shown tobe homoskedastic, and ρ could be directly interpreted as the exact correlation betweeneach pair of neighboring regions. If the graph G has e edges, the storage and compu-tational cost of evaluating the DAGAR likelihood is O ( n + e ), same as the sparsityof the Cholesky factor F − / ( I − B ) or equivalently of B . Simulation studies coveringa wide range of scenarios showed that DAGAR models tend to perform better thanCAR models for low or moderate spatial correlation in the data, while the performanceof the two classes of models are similar in presence of strong spatial correlation. Themodel has subsequently been generalized to include multivariate outcomes (Gao et al.,2020). 19 Discussion Vecchia’s approximation, NNGP and the resulting sparse Cholesky matrices have be-come increasingly popular in recent years owing to their spectacular empirical successin mimicking inference from full Gaussian Processes while being extremely scalable. Inthis manuscript we reviewed these methods and discussed in greater details some recentand diverse applications of these sparse Cholesky based beyond their tradition usagein scalable estimation and prediction for geo-spatial data. We conclude the manuscriptwith discussion of some yet-to-be-address aspects of this approach which opens uppotential avenues of future research. As discussed in Section 2.3, the sparse Cholesky factors generated from NNGP relyon a predetermined ordering of the locations. The order dependence is evident fromthe construction in (10) and can affect the performance of the method in terms ofparameter efficiency (Guinness, 2018). For a stationary covariance function C , the fullGP covariance matrix C is homoskedastic, i.e., C ii = σ , the spatial variance. Theimpact of NNGP approximation on homoskedasticity has not been studied. We hereshow empirically that the NNGP precision matrices are not homoskedastic in generaland that the degree of departure from homoskedasticity can be order-dependent.We generated 200 locations randomly on an unit square, and use an exponentialcovariance function with σ = 1 and φ = 1, and m = 5 nearest neighbors. Figure 4plots the variance of w i ’s against the ordering i where w = ( w , . . . , w n ) ∼ N (0 , e C ).The left figure is for co-ordinate based ordering and we see that with increase in thelocation order i there is a sharp decline in the NNGP variances from the true varianceof 1. For units ordered towards the very end, these variances are as low as 0 . 75. Theright figure is for random ordering. The lack of homoskedasticity is less prominent herewith worst case variance of around 0 . 9. However, both the magnitude and frequencyof decline in the NNGP variances still generally increase with the order of the unit i (i.e., towards the right-side of the x -axis).These results show that NNGP variances often underestimate the full GP vari-ance, and the extent of this is heavily dependent on the choice of ordering. Thisunder-estimation of the variance can potentially manifest itself in aggressive predic-tion intervals with poor coverage for units ordered towards the end. If the orderingis geographical (like co-ordinate based) this may imply that the quality of predictioncan degrade in regions corresponding to higher orderings. For random ordering, asthe units ordered towards the end will be spread out randomly throughout the wholeregion, quality of prediction intervals for a specific region is less likely to be impactedalthough overall prediction quality may still suffer. Finally, as the number of neighbors m is increased, the quality of NNGP approximation improves and for all choices of or-dering, the NNGP variances are much closer to 1. However, it is unknown if there existsan ordering and neighbor-selection strategy that can exactly preserve homoskedasticity.20igure 4: Spatial variances in NNGP plotted against order-index of the location forco-ordinate based ordering (left) and random ordering (right). Gaussian Processes are widely used as stochastic emulators of computer outputs (Cur-rin et al., 1991) where the ‘space’ is not physical space but some abstract multi-dimensional parameter domain. In such cases the locations s are high-dimensionaland each co-ordinate of the location (corresponding to a different parameter) has a dif-ference scale of variation. The choice of Euclidean distance-based neighbor sets used inVecchia’s approximation and NNGP is rooted in the assumption of isotropic covariancefunctions. These covariance functions ignore thee relative scales of each co-ordinate ofthe location vector and are not be suitable for GP-based emulators.Separable covariance functions that endow each co-ordinate with its own scale pa-rameter, are widely popular for such emulation tasks using high-dimensional inputs(locations). If s = ( s (1) , . . . , s ( K ) ) denote a K -dimensional location, then a separablecovariance function between two such locations s i and s j is given by: C ( s i , s j | θ ) = K Y k =1 C k ( s ( k ) i , s ( k ) j | θ k ) , (30)where C k ’s are covariance functions for one-dimensional locations. For example, if each C k is chosen from the popular Gaussian family of covariances with decay parameter φ k , the resulting covariance function is given by: C ( s i , s j | θ ) = σ exp − K X k =1 φ k ( s ( k ) i − s ( k ) j ) ! . (31)It is unclear how to guide neighbor set selection of NNGP for such separable co-variance functions on high-dimensional input spaces. Euclidean-distance based nearestneighbors of a given location no longer have the highest correlation with that location.21he correlation contour is now highly dependent on the values of the decay parameters φ , . . . , φ K . For example, if φ (cid:29) φ k for all k = 1, selecting neighbors based on the firstco-ordinate than the total Euclidean distance will lead to a much better approximation.As these parameters φ k ’s are unknown, it is challenging to set such a strategy apriori.One can pursue a dynamic neighbor-selection strategy, searching for locations thathave the highest correlation with a given location for each new value of the φ k ’s. How-ever, such brute-force neighbor searches at each iteration of an MCMC or optimizationalgorithm will drastically slow down the method. Datta et al. (2016c) explored thisproblem in the context of spatio-temporal models, which also uses anisotropic co-variance functions as variation along space and time also has different scales. Theyproposed a more restricted search perimeter that is guaranteed to contain the m highest-correlated neighbors. Whether such a strategy computationally tenable forhigh-dimensional input spaces needs to be researched. As input points are furtherspread apart in higher dimensions, research also needs to be conducted to investigatehow the established trade-offs between the quality of nearest neighbor (or highest-correlation) approximations and the size of the neighbor set hold up for such high-dimensional locations. References Assuncao, R. and Krainski, E. (2009). Neighborhood dependence in Bayesian spatialmodels. Biometrical Journal , 51:851–869.Banerjee, S. (2017). High-dimensional bayesian geostatistics. Bayesian analysis ,12(2):583.Banerjee, S. (2020). Modeling massive spatial datasets using a conjugate bayesianlinear modeling framework. Spatial Statistics , 37:100417. Frontiers in Spatial andSpatio-temporal Research.Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014). Hierarchical modeling andanalysis for spatial data . CRC press.Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictiveprocess models for large spatial data sets. Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 70(4):825–848.Barbian, M. H. and Assun¸c˜ao, R. M. (2017). Spatial subsemble estimator for largegeostatistical data. Spatial Statistics , 22:68–88.Basseville, M., Benveniste, A., Chou, K. C., Golden, S. A., Nikoukhah, R., and Willsky,A. S. (2006). Modeling and estimation of multiresolution stochastic processes. IEEETrans. Inf. Theor. , 38(2):766–784. 22esag, J. (1974). Spatial interaction and statistical analysis of lattice systems. Journalof the Royal Statistical Society, Series B , 36:192–225.Bevilacqua, M. and Gaetan, C. (2015). Comparing composite likelihood methods basedon pairs for spatial gaussian random fields. Statistics and Computing , 25(5):877–892.Bradley, J. R., Cressie, N., Shi, T., et al. (2016). A comparison of spatial predictorswhen datasets could be very large. Statistics Surveys , 10:100–131.Breiman, L. (2001). Random forests. Machine Learning , 45(1):5–32.Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984). Classification andregression trees . CRC press.Chen, H. F. (2002). Stochastic Approximation and Its Applications . Dordrecht:KluwerAcademic Publishers.Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial datasets. Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,70(1):209–226.Cressie, N. and Wikle, C. K. (2015). Statistics for spatio-temporal data . John Wiley &Sons.Currin, C., Mitchell, T., Morris, M., and Ylvisaker, D. (1991). Bayesian predictionof deterministic functions, with applications to the design and analysis of computerexperiments. Journal of the American Statistical Association , 86(416):953–963.Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016a). Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets. Journal of theAmerican Statistical Association , 111(514):800–812.Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016b). On nearest-neighborgaussian process models for massive spatial data. Wiley Interdisciplinary Reviews:Computational Statistics , 8(5):162–171.Datta, A., Banerjee, S., Finley, A. O., Hamm, N. A., and Schaap, M. (2016c). Nonsep-arable dynamic nearest neighbor gaussian process models for large spatio-temporaldata with an application to particulate matter analysis. The annals of applied statis-tics , 10(3):1286.Datta, A., Banerjee, S., and Hodges, J. S. (2019). Spatial disease mapping using di-rected acyclic graph auto-regressive (dagar) models. Bayesian Analysis , 14(4):1221–1244.Eidsvik, J., Shaby, B. A., Reich, B. J., Wheeler, M., and Niemi, J. (2014). Estima-tion and prediction in spatial models with block composite likelihoods. Journal ofComputational and Graphical Statistics , 23(2):295–315.23inley, A., Datta, A., and Banerjee, S. (2020a). spNNGP: Spatial Regression Modelsfor Large Datasets using Nearest Neighbor Gaussian Processes . R package version0.1.4.Finley, A. O., Datta, A., and Banerjee, S. (2020b). R package for nearest neighborgaussian process models. arXiv preprint arXiv:2001.09111 .Finley, A. O., Datta, A., Cook, B. D., Morton, D. C., Andersen, H. E., and Banerjee,S. (2019). Efficient algorithms for bayesian nearest neighbor gaussian processes. Journal of Computational and Graphical Statistics , 28:401–414.Finley, A. O., Sang, H., Banerjee, S., and Gelfand, A. E. (2009). Improving the per-formance of predictive process modeling for large datasets. Computational statistics& data analysis , 53(8):2873–2884.Furrer, R., Genton, M. G., and Nychka, D. (2006). Covariance tapering for interpo-lation of large spatial datasets. Journal of Computational and Graphical Statistics ,15(3):502–523.Gao, L., Banerjee, S., and Datta, A. (2020). Spatial modeling for correlated cancersusing bivariate directed graphs. Annals of Cancer Epidemiology .Gramacy, R. B. and Apley, D. W. (2015). Local gaussian process approximation forlarge computer experiments. Journal of Computational and Graphical Statistics ,24(2):561–578.Guhaniyogi, R. and Banerjee, S. (2018). Meta-kriging: Scalable bayesian modeling andinference for massive spatial datasets. Technometrics , 60(4):430–444.Guinness, J. (2018). Permutation and grouping methods for sharpening gaussian pro-cess approximations. Technometrics , 60(4):415–429.Guinness, J. and Fuentes, M. (2017). Circulant embedding of approximate covariancesfor inference from gaussian data on large lattices. Journal of computational andGraphical Statistics , 26(1):88–97.Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Ger-ber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., et al. (2019). A case studycompetition among methods for analyzing large spatial data. Journal of Agricultural,Biological and Environmental Statistics , 24(3):398–425.Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B., and Gr¨aler, B. (2018).Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ , 6:e5518. 24ones, R. H. and Zhang, Y. (1997). Models for continuous stationary space-timeprocesses. In Modelling longitudinal and spatially correlated data , pages 289–298.Springer.Jurek, M. and Katzfuss, M. (2020). Hierarchical sparse cholesky decompositionwith applications to high-dimensional spatio-temporal filtering. arXiv preprintarXiv:2006.16901 .Katzfuss, M. (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association , 112(517):201–214.Katzfuss, M. and Guinness, J. (2021). A general framework for vecchia approximationsof gaussian processes. Statist. Sci. , 36(1):124–141.Katzfuss, M., Guinness, J., Gong, W., and Zilber, D. (2020a). Vecchia approximationsof gaussian-process predictions. Journal of Agricultural, Biological and Environmen-tal Statistics , 25(3):383–414.Katzfuss, M., Jurek, M., Zilber, D., and Gong, W. (2020b). GPvecchia: ScalableGaussian-Process Approximations . R package version 0.1.3.Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008). Covariance taperingfor likelihood-based estimation in large spatial data sets. Journal of the AmericanStatistical Association , 103(484):1545–1555.Lindgren, F., Rue, H., and Lindstr¨om, J. (2011). An explicit link between gaussianfields and gaussian markov random fields: the stochastic partial differential equationapproach. Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,73(4):423–498.Ma, P., Konomi, B. A., and Kang, E. L. (2019). An additive approximate gaussianprocess model for large spatio-temporal data. Environmetrics , 30(8):e2569.Mardia, K. V. and Marshall, R. J. (1984). Maximum likelihood estimation of modelsfor residual covariance in spatial regression. Biometrika , 71(1):135–146.Nascimento, M. and Shaby, B. A. (2020). A vecchia approximation for high-dimensionalgaussian cumulative distribution functions arising from spatial data. arXiv preprintarXiv:2007.15195 .Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S. (2015).A multiresolution gaussian process model for the analysis of large spatial datasets. Journal of Computational and Graphical Statistics , 24(2):579–599.Olea, R. A. and Pardo-Ig´uzquiza, E. (2011). Generalized bootstrap method forassessment of uncertainty in semivariogram inference. Mathematical geosciences ,43(2):203–228. 25isser, M. D. and Turek, D. (2019). Bayesian nonstationary gaussian process modeling:The bayesnsgp package for . arXiv preprint arXiv:1910.14101 .Saha, A., Basu, S., and Datta, A. (2020). Random forests for dependent data. arXivpreprint arXiv:2007.15421 .Saha, A., Basu, S., and Datta, A. (2021). RandomForestsGLS: Random Forests forDependent Data . R package version 0.1.1.Saha, A. and Datta, A. (2018a). Brisc: Bootstrap for rapid inference on spatial covari-ances. Stat , page e184.Saha, A. and Datta, A. (2018b). BRISC: Fast Inference for Large Spatial DatasetsUsing BRISC . R package version 0.1.0.Sch¨afer, F., Katzfuss, M., and Owhadi, H. (2020). Sparse cholesky factorization bykullback-leibler minimization. arXiv preprint arXiv:2004.14455 .Scornet, E., Biau, G., Vert, J.-P., et al. (2015). Consistency of random forests. TheAnnals of Statistics , 43(4):1716–1741.Stein, M. L. (2012). Interpolation of Spatial Data: Some Theory for Kriging . SpringerScience & Business Media.Stein, M. L., Chi, Z., and Welty, L. J. (2004). Approximating likelihoods for largespatial data sets. Journal of the Royal Statistical Society Series B , 66(2):275–296.Stroud, J. R., Stein, M. L., and Lysen, S. (2017). Bayesian and maximum likelihoodestimation for gaussian processes on an incomplete lattice. Journal of computationaland Graphical Statistics , 26(1):108–120.Sudderth, E. B. (2002). Embedded trees: Estimation of Gaussian processes on graphswith cycles. http://cs.brown.edu/ sudderth/papers/sudderthMasters.pdf .Sun, Y., Li, B., and Genton, M. G. (2012). Geostatistics for large datasets. In Advancesand challenges in space-time modelling of natural events , pages 55–77. Springer.Tang, W., Zhang, L., and Banerjee, S. (2019). On identifiability and consistency of thenugget in gaussian spatial process models. arXiv preprint arXiv:1908.05726 .Taylor, J. and Einbeck, J. (2013). Challenging the curse of dimensionality in multi-variate local linear regression. Computational Statistics , 28(3):955–976.Taylor-Rodriguez, D., Finley, A. O., Datta, A., Babcock, C., Andersen, H.-E., Cook,B. D., Morton, D. C., and Banerjee, S. (2019). Spatial factor models for high-dimensional and large spatial data: an application in forest variable mapping. Sta-tistica Sinica , 29:1155. 26urek, D. and Risser, M. (2019). BayesNSGP: Bayesian Analysis of Non-StationaryGaussian Process Models . R package version 0.1.1.Vecchia, A. (1992). A new method of prediction for spatial regression models withcorrelated errors. Journal of the Royal Statistical Society: Series B (Methodological) ,54(3):813–830.Vecchia, A. V. (1988). Estimation and model identification for continuous spatialprocesses. Journal of the Royal Statistical Society Series B , 50:297–312.Wall, M. (2004). A close look at the spatial structure implied by the CAR and SARmodels. Journal of Statistical Planning and Inference , 121:311–324.Zhang, H. (2004). Inconsistent estimation and asymptotically equal interpolationsin model-based geostatistics. Journal of the American Statistical Association ,99(465):250–261.Zhang, L., Datta, A., and Banerjee, S. (2019). Practical bayesian modeling and in-ference for massive spatial data sets on modest computing environments. StatisticalAnalysis and Data Mining: The ASA Data Science Journal , 12(3):197–209.Zilber, D. and Katzfuss, M. (2019). Vecchia-laplace approximations of generalized gaus-sian processes for big non-gaussian spatial data. arXiv preprint arXiv:1906.07828arXiv preprint arXiv:1906.07828