[PDF] Grid-Parametrize-Split (GriPS) for Improved Scalable Inference in Spatial Big Data Analysis

Abstract

Rapid advancements in spatial technologies including Geographic Information Systems (GIS) and remote sensing have generated massive amounts of spatially referenced data in a variety of scientific and data-driven industrial applications. These advancements have led to a substantial, and still expanding, literature on the modeling and analysis of spatially oriented big data. In particular, Bayesian inferences for high-dimensional spatial processes are being sought in a variety of remote-sensing applications including, but not limited to, modeling next generation Light Detection and Ranging (LiDAR) systems and other remotely sensed data. Massively scalable spatial processes, in particular Gaussian processes (GPs), are being explored extensively for the increasingly encountered big data settings. Recent developments include GPs constructed from sparse Directed Acyclic Graphs (DAGs) with a limited number of neighbors (parents) to characterize dependence across the spatial domain. The DAG can be used to devise fast algorithms for posterior sampling of the latent process, but these may exhibit pathological behavior in estimating covariance parameters. While these issues are mitigated by considering marginalized samplers that exploit the underlying sparse precision matrix, these algorithms are slower, less flexible, and oblivious of structure in the data. The current article introduces the Grid-Parametrize-Split (GriPS) approach for conducting Bayesian inference in spatially oriented big data settings by a combination of careful model construction and algorithm design to effectuate substantial improvements in MCMC convergence. We demonstrate the effectiveness of our proposed methods through simulation experiments and subsequently undertake the modeling of LiDAR outcomes and production of their predictive maps using G-LiHT and other remotely sensed variables.

Full PDF

GGrid-Parametrize-Split (GriPS) for Improved

Scalable Inference in Spatial Big Data Analysis

Michele Peruzzi ∗ , Sudipto Banerjee † , David B. Dunson ∗ , Andrew O. Finley ‡ Abstract

Rapid advancements in spatial technologies including Geographic Information Sys-tems (GIS) and remote sensing have generated massive amounts of spatially referenceddata in a variety of scientiﬁc and data-driven industrial applications. These advance-ments have led to a substantial, and still expanding, literature on the modeling andanalysis of spatially oriented big data. In particular, Bayesian inferences for high-dimensional spatial processes are being sought in a variety of remote-sensing appli-cations including, but not limited to, modeling next generation Light Detection andRanging (LiDAR) systems and other remotely sensed data. Massively scalable spa-tial processes, in particular Gaussian processes (GPs), are being explored extensivelyfor the increasingly encountered big data settings. Recent developments include GPsconstructed from sparse Directed Acyclic Graphs (DAGs) with a limited number ofneighbors (parents) to characterize dependence across the spatial domain. The DAGcan be used to devise fast algorithms for posterior sampling of the latent process, butthese may exhibit pathological behavior in estimating covariance parameters. Whilethese issues are mitigated by considering marginalized samplers that exploit the under-lying sparse precision matrix, these algorithms are slower, less ﬂexible, and obliviousof structure in the data. The current article introduces the Grid-Parametrize-Split(GriPS) approach for conducting Bayesian inference in spatially oriented big data set-tings by a combination of careful model construction and algorithm design to eﬀectuatesubstantial improvements in MCMC convergence. We demonstrate the eﬀectiveness ofour proposed methods through simulation experiments and subsequently undertake themodeling of LiDAR outcomes and production of their predictive maps using G-LiHTand other remotely sensed variables.

Keywords:

Big data; Bayesian modeling; directed acyclic graphs; Gaussian processes; sparsemodeling. ∗ Department of Statistical Science, Duke University † Department of Biostatistics, UCLA Fielding School of Public Health ‡ Department of Forestry, Michigan State University a r X i v : . [ s t a t . M E ] J a n Introduction

Geographic Information Systems (GIS) and related technologies such as remote sensors,satellite imaging and portable devices that are capable of collecting precise positioning in-formation have spawned much interest in the production of digitized geographic information,which, in turn, has led to the amassing of massive amounts of spatial data. Rapidly increas-ing availability and capabilities of remote sensing technologies and GIS have brought aboutsubstantial developments in modeling and analyzing spatial data in diverse scientiﬁc anddata-driven industrial applications including, but not limited to, natural and environmentalsciences; economics; climate science; ecology; forestry; and public health. With the abun-dance of spatial big data problems in the sciences and engineering, GIS and spatial datascience will likely occupy a central place in the data revolution engulﬁng us.With the scientiﬁc and technological communities embracing an era where open-accessdata-rich environments provide extraordinary opportunities to understand spatial and tem-poral complexity of processes, statisticians are increasingly encountering inferential questionsthat require complex and computationally challenging models. Analysts working with spatialtechnologies customarily seek to (i) estimate the variations and associations among severalspatially referenced outputs; and (ii) predict outputs at new spatial locations. A genericparadigm for spatial models posits a joint probability law [ data | process ] × [ process | parameters ] × [ parameters ] , (1)where each [ · ] represents a legitimate probability law. The above posits an underlying “pro-cess” that generates the data. Modeling this process in a rigorous and computationallyeﬃcient manner without sacriﬁcing richness in inference presents several challenges. Howexactly the underlying process is speciﬁed depends upon the data attributes relevant to infer-ence. We focus on point-referenced data, where spatial locations are indexed by coordinateson a map. The “process” is modeled as a spatial random ﬁeld over the domain of interestand the observations are treated as a ﬁnite realization of this random ﬁeld. Endowing theﬁnite-dimensional realizations of a random ﬁeld with a probability law that is multivariatenormal leads to the pervasive Gaussian process (GP).The GP is, perhaps, the most conspicuous of process speciﬁcations and oﬀers ﬂexibilityand richness in modeling. However, GPs incur onerous computational costs that make them2mpracticable for spatial big data analysis. The primary burden stems from storage andcomputations involving the massive spatial covariance matrices representing the dependen-cies inside the data. The elements of these matrices are covariance functions, or kernels,depending on parameters that describe the process. These covariance functions are nonlin-ear and, depending upon the scientiﬁc hypotheses the model represents (e.g., nonstationarydependence), can be computationally burdensome. For irregularly situated spatial locations,as is common in geostatistics, these covariance matrices are typically dense and oﬀer no ex-ploitable structure to facilitate computations. Even for a modestly large number of pointsby today’s norms ( ≈ , +), the computational demands are prohibitive. Yet, spatialanalysts encounter data sets with locations numbering in the millions. The size of such datasets easily exceed common CPU and memory conﬁgurations, which means we need to relyupon statistical models to enable such analyses within available computing resources.There exists a very large literature on scalable GP approximations for extending spatialand spatiotemporal inference to big data including, but not limited to, low-rank methods(Quiñonero-Candela and Rasmussen, 2005; Snelson and Ghahramani, 2007; Banerjee et al.,2008; Cressie and Johannesson, 2008), covariance tapering (Furrer et al., 2006; Kaufmanet al., 2008), domain partitioning (Sang and Huang, 2012; Stein, 2014), divide-and-conquermethods based on meta-posteriors (Guhaniyogi and Banerjee, 2018), recursive partitioningand hierarchical matrices (Ambikasaran et al., 2016; Katzfuss, 2017; Huang and Sun, 2018;Geoga et al., 2020; Jurek and Katzfuss, 2020), Gaussian random Markov random ﬁelds(GMRF; Rue and Held, 2005), composite likelihood methods (Eidsvik et al., 2014), andlocal approximations (Gramacy and Apley, 2015). We refer the reader to Sun et al. (2011),Banerjee (2017), and Heaton et al. (2019) for in-depth reviews and comparisons of scalablegeostatistical methods.One particular approach that has been garnering much attention in spatial and spa-tiotemporal big data analysis stems from representing the joint probability law of a ﬁniterealization of the spatial process using a directed acyclic graph (DAG) constructed from thespatial locations as the nodes of the graph with parents of each node as spatial neighbors.Vecchia (1988) used this idea as a method to approximate the multivariate Gaussian like-lihood function corresponding to a realized GP over a ﬁnite but large number of locations.Datta et al. (2016a) extends such ﬁnite-dimensional approximations to a well-deﬁned spatial3rocess, the Nearest-Neighbor Gaussian Process (NNGP), over continuous spatial domainsby deﬁning a set of reference locations over which the DAG is constructed. The conditionalindependence of the outputs given neighboring outputs that is imposed by the DAG yields asparse precision (inverse of covariance) matrix resulting in a sparse likelihood approximationto the full GP likelihood (often called Vecchia’s approximation in the spatial literature).Alternative methods to build the DAG have been considered (Quiroz et al., 2019; Katzfussand Guinness, 2019). Recently, Peruzzi et al. (2020) introduced a class of meshed Gaussianprocesses (MGPs) for massive spatial data sets. MGPs are constructed by mapping the nodesof a pre-determined directed acyclic graph (DAG) to groups of spatial locations. Scalabilityto big data follows from special choices of DAGs; treed DAGs for multivariate outcomes canbe associated to recursive domain partitioning (Peruzzi and Dunson, 2020), whereas DAGswith repeated patterns arise from domain tiling or axis-parallel tessellations. In the lattercase, Peruzzi et al. (2020) show that additional computational speedups correspond to dataon partly observed lattices when sampling the latent process a posteriori.However, ineﬃciencies associated with excessive data augmentation may arise when thedata are not regularly spaced. Bayesian inference about the latent process and the parame-ters governing it can be carried out by collecting dependent samples from their joint posteriordistribution using Markov chain Monte Carlo (MCMC) methods. Features of the posteriordistribution for the parameters of interest (e.g. their mean) are estimated using such depen-dent samples. Eﬃciency of MCMC depends on (i) the computational complexity (in termsof ﬂops and time) of the algorithm used to generate a sample of size T , and (ii) the eﬀectivesample size (ESS) of such a sample. The ESS (which is typically T ∗ < T ) is the size of ani.i.d. sample from the posterior distribution with the same variance as the sample from theMarkov chain. Let { θ , . . . , θ T } be the posterior samples of a parameter θ obtained uponconvergence of an MCMC algorithm. Then, ESS is found as T ∗ = T /κ ( θ ) , where κ ( θ ) is theautocorrelation time associated to the sequence θ t , t = 1 , . . . , T . Therefore, ESS measuresthe loss in eﬃciency due to the use of a Markov chain (Robert and Casella, 2005). A slowlymixing Markov chain, which exhibits high autocorrelations between successive samples, islinked to a lower ESS. Markov chains that mix poorly require substantially longer time toproduce Monte Carlo samples that accurately characterize the posterior distribution. Unfor-tunately, mixing time increases with the dimension of the state space (see e.g. Dunson and4ohndrow, 2020), which is particularly troubling in big data settings.Poor mixing of MCMC in spatial contexts negatively impacts the quality of estimationsand uncertainty quantiﬁcation of the unknown covariance parameters when sampling thehigh-dimensional latent process. Marginalized or collapsed samplers exhibit considerableimprovements in mixing because they obviate sampling the high-dimensional spatial eﬀects,which are subsequently sampled after the posterior samples of the collapsed model havebeen obtained. Collapsed samplers are, however, eﬀective only for Gaussian likelihoods andtend to be much slower as they entail more expensive matrix computations (Finley et al.,2019). Furthermore, the practical performance of these algorithms in quantifying posterioruncertainty about the covariance parameters may still be unsatisfactory. Improvementscompared to (faster) non-marginalized samplers may be small. Coupled with the observationthat ESS is frequently much higher for the latent process than it is for covariance parameterswhen considering unmarginalized samplers, one may attribute poor mixing to the absenceof consistent estimators for the parameters in the spatial covariance function. This hasbeen studied for the Matérn covariance kernel for in-ﬁll or ﬁxed-domain spatial asymptotics(Stein, 1990; Ying, 1991; Zhang, 2004), including in models with the nugget (Tang et al.,2020). In the Bayesian context, inconsistent estimation implies that the eﬀect of the priordoes not diminish even as we increase our sample size by populating the ﬁxed spatial domainwith locations. It is also well known that model parametrizations can impact performanceof MCMC (Gelfand et al., 1995; Liu and Wu, 1999; van Dyk and Meng, 2001; Yu andMeng, 2011; Kastner, 2014). The connections between parametrizations and eﬃciency inspatial big data contexts have not been analyzed, but we expect them to play an importantrole. For example, a linear model of coregionalization (LMC; Matheron, 1982; Wackernagel,2003) for modeling multivariate spatial dependence in a Bayesian regression model (seee.g. Zhang and Banerjee, 2020) is typically associated with a “noncentered” parametrization(Papaspiliopoulos et al., 2007) that inhibits posterior sampling eﬃciency in scenarios thatare common in geostatistical settings, e.g. when the measurement error variance is smallrelative to the process variance, but is otherwise computationally convenient.In this article, we combine three separate ideas to resolve the aforementioned issues.First, we enforce a gridded ( Gri ) reference set on an MGP model to ensure sizable scalabilityimprovements for non-gridded multivariate outcomes that may be misaligned. Second, we5arametrize ( P ) the LMC via Matérn covariances targeting explicit posterior sampling of themicroergodic parameter (Stein, 1999), which is well identiﬁed. Third, we overparametrizethe process covariance via a parameter splitting ( S ) strategy related to half-t priors (Gelman,2006). The latter two ideas are based on known results but, to the best of our knowledge,their impact on multivariate spatial regression models has not been considered. In theMatérn model, P and S imply that the sill σ , range /φ , and smoothness ν all have mixedcentering. We then demonstrate that the combination of Grid, Parametrize, and Split,leading to a GriPS regression, results in substantial improvements in eﬃciency even whensampling very high-dimensional latent eﬀects.Our proposed methodology and related software advancements are aimed for broad ap-plication in global-scale public and private land surface monitoring eﬀorts that rely onnext generation Light Detection and Ranging (LiDAR) systems and other remotely senseddata. LiDAR provides high-dimensional characterization of land surface structures at point-referenced locations, e.g., topography, vegetation, ice/snow. Repeat measurement from thesesystems, and coupling with other remotely sensed data, facilitate change detection andchange attribution, e.g., natural and human induced disturbances. Next generation LiDARsystems include the National Science Foundations’s National Ecological Observatory Net-work (NEON) Airborne Observation Platform (Kampe et al., 2010), National Aeronauticsand Space Administration (NASA) ICESAT-2 (ICESat-2, 2020), Global Ecosystem Dynam-ics Investigation (GEDI; GEDI 2020), and Goddard LiDAR, Hyper-Spectral, and Thermalimager (G-LiHT, Cook et al. 2013). The GEDI mission sensor, mounted on the InternationalSpace Station (ISS), began collecting data in 2019 and, like ICESAT-2 and G-LiHT, is key toglobal-scale mapping of forest change and quantifying forest loss contribution to atmosphericCO2 concentrations. Unlike traditional LiDAR data collection campaigns that exhaustivelysample small regions, next generation LiDAR systems such as ICESAT-2, G-LiHT, andGEDI cover large spatial extents by sparsely sampling locations across the domain. Whilegenerating complete-coverage high resolution maps of forest outcomes is still of interest,there is also interest in creating predictive maps of the high-dimensional LiDAR signals overnon-sampled locations.The remainder of the article proceeds as follows. Section 2 provides an outline of aBayesian multivariate spatial regression model and discusses the major bottlenecks that de-6ract from eﬃciency in MCMC samplers for current state-of-the-art models in big data set-tings. We then introduce GriPS in Section 3 to circumvent such bottlenecks. Section 4.1 out-lines the posterior sampling algorithm whose eﬃciency is demonstrated in Sections 4.2 and 5through simulation experiments and subsequent analysis of LiDAR outcomes and productionof their predictive maps using G-LiHT and other remotely sensed variables. Let

D ⊂ (cid:60) d denote our spatial domain and (cid:96) ∈ D be a generic spatial location. A multivariatespatial regression model describes the relationship between a vector of spatial outputs andpredictors while incorporating a multivariate spatial process and white noise. Thus, y ( (cid:96) ) = X ( (cid:96) ) (cid:62) β + w ( (cid:96) ) + ε ( (cid:96) ) , (2)where y ( (cid:96) ) = ( y ( (cid:96) ) , y ( (cid:96) ) , . . . , y q ( (cid:96) )) (cid:62) is the q × vector of outcomes at location (cid:96) , X ( (cid:96) ) (cid:62) = blockdiag { x i ( (cid:96) ) (cid:62) } qi =1 is a q × p matrix of spatially referenced predictors with each x i ( (cid:96) ) being p i × vector of predictors corresponding to y i ( (cid:96) ) at location (cid:96) and p = (cid:80) qi =1 p i , β is the p × vector of corresponding regression coeﬃcients, w ( (cid:96) ) and ε ( (cid:96) ) are q × vectors ofspatial random eﬀects and random noise with elements w i ( (cid:96) ) and (cid:15) i ( (cid:96) ) , respectively, for i = 1 , , . . . , q such that ε ( (cid:96) ) iid ∼ N (0 , D ) with a diagonal D = diag ( τ , . . . , τ q ) .We model the uncountable set { w ( (cid:96) ) : (cid:96) ∈ D} as a q -variate multivariate Gaussianprocess, denoted as w ( (cid:96) ) ∼ GP ( , C θ ( · , · )) , where C θ ( · , · ) is a q × q matrix-valued cross-covariance function indexed by parameters θ such that C θ ( (cid:96) , (cid:96) (cid:48) ) = [ cov { w i ( (cid:96) ) , w j ( (cid:96) (cid:48) ) } ] is the q × q matrix with ( i, j ) th element given by the covariance between w i ( (cid:96) ) and w j ( (cid:96) (cid:48) ) . Hence-forth we suppress θ in the cross-covariance to simplify notation. A valid cross-covariancesatisﬁes: (i) C ( (cid:96) , (cid:96) (cid:48) ) = C ( (cid:96) (cid:48) , (cid:96) ) (cid:62) ; and (ii) (cid:80) ni =1 (cid:80) nj =1 z (cid:62) i C ( (cid:96) i , (cid:96) j ) z j > for any integer n andany ﬁnite collection of points { (cid:96) , (cid:96) , . . . , (cid:96) n } and for all z i ∈ (cid:60) q \ { } (see, e.g., Genton andKleiber, 2015). Over any ﬁnite set L ⊂ D the process has a multivariate normal distribution w L ∼ N (0 , C L ) , where w L is the qn L × column vector and C L is the qn L × qn L blockmatrix with the q × q matrix C ( (cid:96) i , (cid:96) j ) as its ( i, j ) block for i, j = 1 , . . . , n L .7 wθ β , τ (a) Centered parametrization y Λ ωϕ β , τ (b) Noncentered parametrization Figure 1: Common parametrizations of Bayesian spatial regression models.Let T = { (cid:96) , . . . , (cid:96) n } ⊂ D be the set of locations where at least one of our q spatialoutcomes has been measured. We construct y T = [ y ( (cid:96) ) (cid:62) , . . . , y ( (cid:96) n ) (cid:62) ] (cid:62) as the nq × vectorof y ( (cid:96) i ) ’s over the n locations, analogously deﬁne w T and ε T , and let X T = [ X ( (cid:96) ) : . . . : X ( (cid:96) n )] (cid:62) . A Bayesian hierarchical model is constructed from the above speciﬁcations as p ( β , w T , τ , θ | y T ) ∝ p ( τ , θ ) × N ( β ; m β , V β ) × N ( w T ; , C T ) × N ( y T ; X T β + w , D n ) , (3)where τ = { τ , . . . , τ q } and D n = blockdiag ( { D } ni =1 ) . The likelihood in (3) corresponds to y = Xβ + w + ε , where the speciﬁcation over T is implicit.Bayesian inference proceeds by sampling from the posterior distribution in (3) usingMCMC algorithms. Eﬃciency of MCMC algorithms can be measured by eﬀective samplesize (ESS) per unit time. ESS is related to the autocorrelation in the Markov chains andmeasures the size of an equivalent i.i.d. sample, i.e., producing the same Monte Carlo errorin ergodic averages, to the dependent MCMC samples (Liu and Chen, 1995; Robert andCasella, 2005). Scalable models eﬀectively reduce the time for each MCMC iteration, butissues adversely aﬀecting eﬃcient sampling persist. We consider, in particular, the eﬃcientinference for the cross-covariance parameters θ and the measurement error variances τ . Modeling the latent q -variate process w ( · ) in (2) requires specifying a valid cross-covariancefunction. We consider a linear model of coregionalization (LMC), w ( (cid:96) ) = k (cid:88) j =1 λ j ω j ( (cid:96) ) = Λ ω ( (cid:96) ) (4)8here Λ = [ λ , . . . , λ k ] is a q × k full (column) rank matrix with ( i, j ) th entry λ ij , and each ω j ( (cid:96) ) is a univariate spatial process with correlation function ρ j ( (cid:96) , (cid:96) (cid:48) ) = ρ ( (cid:96) , (cid:96) (cid:48) ; φ j ) . The k ≤ q components of ω ( (cid:96) ) are independent of each other (hence cov { ω j ( (cid:96) ) , ω h ( (cid:96) (cid:48) ) } = 0 whenever h (cid:54) = j ). This implies that ω ( (cid:96) ) is a multivariate spatial process with cross-correlation ρ ( (cid:96) , (cid:96) (cid:48) ; ϕ ) indexed by ϕ = { φ , . . . , φ k } and C ( (cid:96) , (cid:96) (cid:48) ; θ ) = Λ ρ ( (cid:96) , (cid:96) (cid:48) ) Λ (cid:62) . Under stationarity, if (cid:107) (cid:96) − (cid:96) (cid:48) (cid:107) = 0 , then C ( ; θ ) = Λ ρ ( ; ϕ ) Λ (cid:62) = ΛΛ (cid:62) since ρ (0; ϕ ) = I k . Therefore, Λ identiﬁeswith the Cholesky factorization of C ( ) and is lower-triangular with positive diagonal entries(see. e.g., Finley et al., 2008; Zhang and Banerjee, 2020, and references therein for BayesianLMC models). Model (2) becomes y = Xβ + ( I n ⊗ Λ ) ω + ε . Integrating out w = ( I n ⊗ Λ ) ω from the Gaussian likelihood fetches the density p ( y | X , β , τ , θ ) = N ( y ; Xβ , C θ + D n ) ,where C θ = ( I n ⊗ Λ ) ρ ϕ ( I n ⊗ Λ (cid:62) ) and θ = { Λ , ϕ } , irrespective of whether we take w or ω as the latent process. Otherwise, diﬀerent hierarchical model parametrizations can beconsidered (Papaspiliopoulos et al., 2007).A centered parametrization (Figure 1a) implies that θ and y are conditionally indepen-dent given w . A noncentered parametrization (Figure 1b) considers Λ and ω independenta priori. A fully-noncentered parametrization can be built by writing w = ( I n ⊗ Λ ) L ϕ u where L ϕ L (cid:62) ϕ = ρ ϕ and u is white noise—the graphical model reads u → y and θ → y .The diﬀerence between centered and noncentered parametrizations is best appreciated viaa simple example. If q = 1 , in the centered model we may let Λ = σ > and assignprior σ ∼ InvG ( σ ; a, b ) . Then, the full conditional posterior is p ( σ | w , y ) = p ( σ | w ) = InvG ( σ ; a ∗ , b ∗ ) where a ∗ = a + n/ and b ∗ = b + w (cid:62) C − θ w . In the noncentered models, Λ = λ > can be assigned e.g. a truncated Gaussian prior N ( λ ; m, v ) ( λ > , resulting ina truncated Gaussian full conditional posterior, with variance v ∗ = ( v − + ω (cid:62) ω /τ ) − , mean m ∗ = v ∗ ω (cid:62) ( y − Xβ ) /τ , and truncation below zero. Updating ϕ in all models typicallyproceeds via Metropolis steps, hence the fully noncentered model provides no additionaladvantage and we will not consider it further.Speciﬁcally, we expect the relative eﬃciency of the noncentered models to be poor whenthe ratio between the noise variance and process variance is small (Papaspiliopoulos et al.,2007). As this commonly occurs in geospatial settings, the centered parametrization mightlead to better MCMC performance. However, noncentered models are advantageous as theyinvolve k a-priori independent processes rather than q dependent ones, leading to more9onvenient updates for Λ , especially when setting k < q as in a spatial factor model (Renand Banerjee, 2013; Taylor-Rodriguez et al., 2019; Zhang and Banerjee, 2020). The LMC of Section 2.2 requires the speciﬁcation of the k correlation functions φ j ( · , · ) . TheMatérn model is a parsimonious but ﬂexible choice in this context. For a univariate spatialprocess ω ( · ) ∼ GP (0 , C θ ) , one assumes that cov { ω ( (cid:96) ) , ω ( (cid:96) (cid:48) ) } = σ ρ ( (cid:96) , (cid:96) (cid:48) ; { φ, ν } ) , where ρ ( (cid:96) , (cid:96) (cid:48) ; { φ, ν } ) = 2 − ν Γ( ν ) φ ν (cid:107) (cid:96) − (cid:96) (cid:48) (cid:107) ν K ν ( φ (cid:107) (cid:96) − (cid:96) (cid:48) (cid:107) ) , (5)and K ν is the modiﬁed Bessel function of the second kind of order ν . We know that θ = { σ , φ, ν } are weakly identiﬁed (Banerjee et al., 2014). In ﬁxed-domain asymptotics,only the microergodic parameter { σ φ ν } can be estimated consistently (Stein, 1990; Ying,1991; Zhang, 2004; Tang et al., 2020). In practice, predictions at unknown locations are lesssensitive to the individual parameters’ value, since σ φ ν matters more for spatial interpola-tion. Nevertheless, weak identiﬁability is expected to adversely impact MCMC mixing andconvergence for { σ , φ, ν } . If one seeks to estimate the individual parameters and quantifyuncertainty, informative priors become necessary. It is not uncommon in these contexts toﬁx some parameters to reasonable values, constrain the prior on a grid, or perform greedysearches for optimal values through cross-validation (Shirota et al., 2019; Banerjee, 2020). Spatial dependence of the outcomes in (2) is modeled using the latent process. Integratingout the latent eﬀects from (3) yields the marginal or collapsed Bayesian model p ( β , τ , θ | y ) ∝ p ( τ , θ ) × N ( β ; µ β , V β ) × N ( y ; Xβ , C θ + D n ) . (6)Now we can sample the unknown parameters { β , τ , θ } from (6). This is appealing becausewe avoid sampling the high dimensional w , a step that is known to signiﬁcantly encumberMCMC mixing and convergence for parameters that are highly correlated to w a posteriori.Even without w in (6), we can sample from p ( w | y ) by drawing one w from p ( w | β , τ , θ , y ) ,which is Gaussian, for each sampled value of { β , τ , θ } from (6).10ampling from (6), however, entails ( C θ + D n ) − and its determinant, which becomesprohibitive in big data settings. To further elucidate, the chief computational burden arisesfrom the Cholesky decomposition of K θ = C θ + D n or its inverse. Once the choleskydecomposition has been computed, say chol ( K θ ) = LF L (cid:62) , where L = [ l ii ] is unit lowertriangular (with l ii = 1 ) and F = diag [ f ii ] is diagonal with f ii > , then log N ( y ; Xβ , C θ + D n ) = constant − n (cid:88) i =1 log ( f ii ) − n (cid:88) i =1 u i f ii , (7)where u i is the i -th element of the vector u that solves the lower-triangular system Lu = y − Xβ . Given the Cholesky decomposition, computing (7) is cheap (of O ( nq ) ). However, theCholesky decomposition itself involves O ( n q ) ﬂoating point operations (ﬂops) and O ( n q ) in storage complexity and, hence, is impracticable for very large values of nq .One option to improve computational speed is to replace w ( · ) with a low-rank process (cid:101) w ( · ) with covariance function (cid:101) C θ ( (cid:96) , (cid:96) (cid:48) ) = C θ ( (cid:96) , S ) C − θ ( S ) C θ ( S , (cid:96) (cid:48) ) , where S is a set of n k knots (Banerjee et al., 2008). In this case, the Sherman-Woodbury-Morrison (SWM)identity and the matrix determinant lemma on the integrated covariance (cid:101) C θ + D n = C θ ( (cid:96) , S ) C − θ ( S ) C θ ( S , (cid:96) (cid:48) ) + D n imply that one only needs to compute C − θ ( S ) which isof size n k × n k rather than n × n . Directly computing the SWM identity is not optimal andcan be avoided using chol ( C θ ( S )) (see Banerjee, 2017, for details).Greater scalability is achieved when w ( · ) is replaced by any DAG-based process such asan NNGP or an MGP. Here, C θ is replaced by (cid:101) C θ such that (cid:101) C − θ is sparse. Computationswill ultimately depend on ( C − θ + D − n ) − and its determinant. This relies on sparse libraries(e.g. sparse Cholesky) which, unfortunately, can impede iterative computations in big dataanalysis and may be circumvented in some special circumstances (Peruzzi and Dunson, 2020).Alternatively, one can directly model the responses as a sparse GP, in which case the originalcross-covariance function C θ ( (cid:96) , (cid:96) (cid:48) ) is ﬁrst replaced by (cid:101) C θ ( (cid:96) , (cid:96) (cid:48) ) = C ( (cid:96) , (cid:96) (cid:48) ) θ + δ (cid:96) = (cid:96) (cid:48) { D } , thenthe resulting full model’s precision (cid:101) C − θ is directly replaced by a sparse counterpart (cid:101)(cid:101) C − θ .Now one does not need sparse Cholesky libraries as (cid:101)(cid:101) C − θ and its determinant are computeddirectly from the DAG Finley et al. (2019). This model does not include latent eﬀects, butVecchia approximations on the latent eﬀects tend to produce inference closer to the full GPmodel than the response model (Katzfuss and Guinness, 2019). While marginalization avoidssome issues related to model parametrization, it does not address the weak identiﬁability of11he Matérn parameters in (5). Finally, the marginal and response models are only feasiblewhen all outcomes are Gaussian. Sampling w a posteriori is, therefore, more generallyapplicable. Let us return to the Bayesian hierarchical model in (3) that explicitly includes w . While β and w can be directly updated using Gibbs steps from their (Gaussian) full conditional distri-butions and τ can also be updated directly if we assign independent conjugate distributions(Inverse-Gamma) for each τ i , the cross-covariance parameters θ do not render closed-formfull conditional distributions. Therefore, we update θ using some stationarity-preservingMCMC method such as Metropolis random walk of Hamiltonian Monte Carlo. Theseentail computing, up to a proportionality constant, a density proportional to p ( w | θ ) = N ( w ; , C θ ) × p ( θ ) , where p ( θ ) is any proper prior distribution for θ . Unfortunately,once again, we are faced with computing the inverse and determinant of a high-dimensionalcovariance matrix.Scalable methods accrue signiﬁcant beneﬁts in computing p ( w | θ ) . In particular, a “cu-bic” meshed GP (QMGP; Peruzzi et al., 2020) produces a reversal in the balance of computa-tional costs: evaluations of p ( w | θ ) become faster than sampling w —usually they are about10 times slower. In order to achieve such speedups, one requires that the so-called reference set S be deﬁned on a regular grid. S is akin to the set of “knots” or “sensors” in low rankmethods, whose choice is not a trivial matter, see e.g. Krause et al. (2008) or Guhaniyogiet al. (2011). Standard QMGPs are associated with MCMC eﬃciency losses when a griddedreference set does not overlap with the measurement locations. In this case the size of thelatent eﬀects might greatly exceed the sample size, with negative impacts on MCMC mixingand convergence of the centered parameters. We clarify this point after building the QMGPthat we subsequently use in the remainder of the article.For any given ﬁnite set L of spatial locations and a ﬁxed reference set S , let U = L \ S denote the non-reference locations. We partition the domain using an axis-paralleltessellation T = { D j } Mj =1 such that D = ∪ Mj =1 D j , where M = (cid:81) di =1 M i . We then partitionthe reference and non-reference sets into S = ∪ Mi =1 S i and U = ∪ Mi =1 U i , where S i = S ∩ D i and U i = U ∩ D i . Let the DAG G = { V , E } have nodes V = A ∪ B where A and12 are the M reference and non-reference nodes, respectively. T and G are linked via η : D → V which maps any reference location to a node in A and any non-referencelocation to a node in B . The parent set of nodes for v ∈ V is denoted as Pa [ v ] , and E = { Pa [ v ] } v ∈ V . By construction, we let Pa [ v ] ⊂ A if v ∈ B , i.e. non-reference nodes haveno children. Furthermore, in the cubic MGP’s speciﬁcation of G we have | Pa [ v ] | ≤ d and | Ch [ v ] ∩ A | ≤ d whenever v ∈ A , where d is the dimension of D . For non-reference nodes,let η − ( U i ) = b i and η − ( S i ) = a i . Then, we specify Pa [ b i ] = { a i } —non-reference locations’parent set comprises reference locations in the same domain partition. These speciﬁcationsfor G imposes a coloring on G enabling parallel sampling of the latent random eﬀects (see,e.g., Gonzalez et al., 2011), independently of observed locations T and d . At L , the newprocess has density p ∗ , p ∗ ( w L ) = M (cid:89) i =1 p ( w i | w [ i ] ) (cid:89) (cid:96) ∈ U i p ( w ( (cid:96) ) | w i ) , (8)where w i = w ( η − ( v i )) = w ( S i ) = ( w ( (cid:96) ) (cid:62) , . . . , w ( (cid:96) n Si ) (cid:62) ) (cid:62) , w [ i ] = w ( η − ( Pa [ v i ])) , using η − to ﬁnd all reference locations mapped to the same node. We assume conditional inde-pendence at non-reference locations in this article. A Gaussian base process leads to (8) as aproduct of multivariate Gaussian densities; p ∗ ( w L ) is then Gaussian with a sparse precisionmatrix – refer to Peruzzi et al. (2020) for more details.For an observed location (cid:96) ∈ T ∩ D i , the standard MGP regression model is y ( (cid:96) ) = X ( (cid:96) ) (cid:62) β + w ( (cid:96) ) + ε ( (cid:96) ) ; ε ( (cid:96) ) ∼ N ( , D ) ; w ( (cid:96) ) = H (cid:96) w [ (cid:96) ] + ξ ( (cid:96) ) ; ξ ( (cid:96) ) ∼ N ( , R (cid:96) ) , (9)where, denoting C (cid:96) = C ( (cid:96) , (cid:96) ) , the set of parent locations as B = η − ( Pa [ η ( (cid:96) )]) and C [ (cid:96) ] = C ( B, B ) , we have H (cid:96) = C (cid:96) , [ (cid:96) ] C − (cid:96) ] and R (cid:96) = C (cid:96) − C (cid:96) , [ (cid:96) ] C − (cid:96) ] C [ (cid:96) ] , (cid:96) . Since evaluating p ( w | θ ) is required in posterior sampling of θ , the primary computing cost depends on C − (cid:96) ] . Thenumber of such inverses and their sizes depend on T , S , and U , for a given G and T . If T aregridded, we let S ⊃ T be the observed grid plus the missing grid locations, and U = ∅ . Only O ( d ) larger matrix inverses are needed, rather than M , and w is sampled at |S| locations.In fact, with stationary cross-covariances, for a given node v i one can ﬁnd “prototypes” v i and v i ∗ such that C i = C i and C [ i ] = C [ i ∗ ] , and the number of prototype nodesis O ( d ) . One computes the inverses for the prototypes and copies the result across the13hole domain. If T are not gridded, then we still let S be a grid, again leading to O ( d ) large matrix solvers. However, now T = U and we sample w at |S| + |U | locations. Since n S = |S| ≈ n to avoid over-smoothing the spatial surface, we would be using about twiceas many latent variables as we have observations. Updates of the centered parametersproceed by evaluating p ( w S | θ ) × p ( w U | w S , θ ) , which implies that the move from θ to θ (cid:48) must agree with the current high-dimensional sample of the latent process at both referenceand non-reference locations. Consequently, MCMC convergence will be slow. Clearly, thecomputational speedup of gridded reference sets may not compensate for such ineﬃciencies. Grid , Parametrize , Split

We now generalize the computational advantages of QMGPs to non-gridded data, and rebuildthe linear model of coregionalization to improve the overall sampling eﬃciency. We assume,without loss of generality, that the observed locations T are irregularly spaced. We targetraw speed in the ﬁrst subsection, eﬃciency in the next. Grid

We build the reference set S as a grid, regardless of T . If D = [0 , d , we may take S = (cid:110)(cid:16) j N , . . . , j d N d (cid:17) : ( j , . . . , j d ) ∈ { , . . . , N d } d (cid:111) , so the number of reference locations is n S = (cid:81) dj =1 N j . From (9) we can partly marginalize the latent process at the observed locations.For a location (cid:96) ∈ D i and conditioning on realizations of the latent process at S , we have y ( (cid:96) ) = X ( (cid:96) ) (cid:62) β + H (cid:96) w [ (cid:96) ] + ε (cid:48) ( (cid:96) ) ; ε (cid:48) ( (cid:96) ) ∼ N ( , D + R (cid:96) ) , (10)where w [ (cid:96) ] are process realizations at locations η − ( Pa [ b i ]) since (cid:96) ∈ D i and, hence, η ( (cid:96) ) = b i , H (cid:96) and R (cid:96) are q × n S i and q × q matrices, respectively. For a given w S , if (cid:96) ∈ S then R (cid:96) = and H (cid:96) w [ (cid:96) ] = w (cid:96) . In other words, (10) generalizes QMGPs to nongridded datawithout sampling at too many locations. Furthermore, while no issue arise if T and S partlyoverlap, in general they are exclusive.Partial marginalization of the latent process removes the need to sample at non-referencelocations. However, unlike full marginalization, the advantages induced by having S on agrid are maintained. Moreover, choosing a grid S such that n S (cid:28) n leads to interpreting1410) as a “meshed” extension of the bias-adjusted predictive process (MPP; Finley et al.,2009). While computational considerations make it advantageous to choose S on a grid,(10) does not preclude alternatives. For example, typical T = S implies prediction locationsdo are not in the reference set, but here—assuming X ( (cid:96) ) are available—we can choose S as the prediction set. Then, sampling w S yields the desired predictions. This feature canbe used to produce model-based heat maps (i.e. raster images) of the latent process at anyresolution. Finally, while partial marginalization is available for Gaussian ﬁrst stage models,we expect analogous strategies to be similarly advantageous otherwise, taking advantage ofconditional independence of all outcomes on the latent process. Parametrize a Matérn LMC,

Split its variance

This subsection does not assume w is a QMGP. After letting (cid:101) ω ( (cid:96) ) = ( (cid:101) ω ( (cid:96) ) , . . . , (cid:101) ω k ( (cid:96) )) (cid:62) and (cid:101) ω j ( (cid:96) ) ∼ GP (0 , (cid:101) C j ) , where (cid:101) C j ( (cid:96) , (cid:96) (cid:48) ) = ρ ( (cid:96) , (cid:96) (cid:48) ; { φ j , ν j } ) φ ν j j , (11)we write the LMC as w ( (cid:96) ) = Λ ω ( (cid:96) ) = Λ QQ − ω ( (cid:96) ) = (cid:101) Λ (cid:101) ω ( (cid:96) ) , where Q = diag { φ ν , . . . , φ ν k k } . (cid:101) Λ = Λ Q is thus a lower-triangular rescaling matrix whose j th diagonal element is λ jj φ ν j j ,which is the square root of the microergodic parameter (Stein, 1999)—its better identiﬁabil-ity impacts MCMC eﬃciency positively. The immediate consequence of this change is thatposterior samples of Λ will only be obtained via post-processing of MCMC output—we willinstead directly sample (cid:101) Λ and (cid:101) ω .The LMC is typically estimated as a noncentered parametrization, which as we men-tioned is associated with poor mixing when the nuggets are small. We address this issue byparameter expansion (or splitting) of λ jj : denoting Ψ = { λ , . . . , λ kk } , we ﬁnd L = ΛΨ − as a lower triangular matrix with units along the diagonal. Letting J = diag { σ − , . . . , σ − k } ,we write the LMC as w ( (cid:96) ) = Λ ω ( (cid:96) ) = L Ψ ω ( (cid:96) ) . The split follows by writing w ( (cid:96) ) = Λ J J − ω ( (cid:96) ) = Λ ω ( (cid:96) ) , where Λ = L Ψ J , ω j ( (cid:96) ) ∼ GP (0 , σ j ρ j ) and ρ j is as above. In prac-tice, we sample ω alongside Λ , whose j th diagonal element is λ jj = λ jj /σ j . The split entailsan overparametrization of the variances as we assign a prior on Λ which is independent ofthe prior on J . If q = 1 , we may assign a truncated Gaussian prior on λ centered at zero15 A rϕ β , τ A = Λ QJ = Λ (cid:34) φ ν ... φ νkk (cid:35) (cid:34) σ − ... σ − k (cid:35) ϕ = { σ j , φ j , ν j } kj =1 Figure 2: DAG of the Bayesian spatial regression model using a PS parametrization. Therelationship with a standard parametrization is summarised on the right.with truncation below zero, and an inverse Gamma on σ , to get a half-t for λ = λσ (Gel-man, 2006). In the context of Bayesian factor models, a similar parameter expansion wasintroduced in Ghosh and Dunson (2009). We build the model for w by combining parametrization and splitting strategies: w ( (cid:96) ) = Λ ω ( (cid:96) ) = Λ QJ J − Q − ω ( (cid:96) ) = A r ( (cid:96) ) , (12)where A = Λ QJ is a q × k matrix with zeros in the upper triangular part, the k -variateprocess is r ( · ) ∼ QM GP ( , K ( · )) such that r ( (cid:96) ) = ( r ( (cid:96) ) , . . . , r k ( (cid:96) )) (cid:62) and, for each j =1 , . . . , k , r j ( (cid:96) ) ∼ QM GP (0 , K j ( · )) where K j ( · ) = σ j ρ j ( · ) φ νj with ρ j ( · ) as the Matérn correlationin (5). All independent QMGPs are built on the same domain partition T and the samegraph G . The GriPS spatial regression model then becomes y ( (cid:96) ) = X ( (cid:96) ) (cid:62) β + Z (cid:96) r [ (cid:96) ] + ε (cid:48) ( (cid:96) ) ; ε (cid:48) ( (cid:96) ) ∼ N ( , D + Σ (cid:96) ) , (13)where Z (cid:96) = A H (cid:96) , Σ (cid:96) = A R (cid:96) A (cid:62) . Writing (13) for locations T , we get y = Xβ + Z r + ε (cid:48) ; ε (cid:48) ∼ N ( , D n + Σ ) , (14)where ⊗ denotes the Kronecker product, Z = ( I n ⊗ A ) H , and Σ = ( I n ⊗ A ) R ( I n ⊗ A (cid:62) ) .In addition to the terms already appearing in (2), we now explicitly have r representingthe nk × vector of latent eﬀects at the gridded reference set S . Furthermore, H is asparse nk × n S k matrix whose i th row block is ﬁlled with H (cid:96) i , with zeroes at columnscorresponding to locations in S not in the parent set of (cid:96) i . Since the k margins of the r H (cid:96) i that arranges it accordingto the process margins is a block-diagonal matrix whose j th (of the total k ) row block is K j ( (cid:96) , [ (cid:96) ]) K − j ([ (cid:96) ]) . Similarly, R is a diagonal nk × nk matrix whose i th block is R (cid:96) i , itselfdiagonal with j th diagonal element K j ( (cid:96) , (cid:96) ) − K j ( (cid:96) , [ (cid:96) ]) K − j ([ (cid:96) ]) K j ([ (cid:96) ] , (cid:96) ) . We ﬁnd Σ as ablock-diagonal matrix of dimension nq × nq with q × q blocks—the i th block is A R (cid:96) i A (cid:62) .We complete the Bayesian model by assigning multivariate Gaussian prior distributionsto β and the lower triangular elements of A , and independent zero-mean Gaussian distri-butions truncated below zero for the diagonal elements of A . We also assign independentInverse Gamma priors for each σ j , j = 1 , . . . , k and each τ i , i = 1 , . . . , q . Figure 2 sum-marises the PS parametrization—GriPS regressions are noncentered on A and centered on ϕ = { σ j , φ j , ν j } kj =1 . Ultimately, we are interested in Λ = A J − Q − , meaning that a PS parametrization is simultaneously centered and noncentered on Λ since J and Q include thecentered parameters. Next, we outline the sampling algorithms for estimation and prediction. Updating r proceeds following G . Taking reference node v i ∈ A , the set T i are locationshaving S i as parent set, hence we can write y T i = X T i β + Z T i r i + ε T i and ε T i ∼ N ( , D T i + Σ T i ) . Denote (cid:101) y T i = I T i ( y T i − X T i β ) and (cid:101) D T i = I T i ( D T i + Σ T i ) I T i , where I T i is a diagonalmatrix with zeros at elements corresponding to missing outcomes in T i , ones elsewhere.Calling Ch [ v i ] = { v j : v i ∈ Pa [ v j ] } , we partition H j = [ H i : H [ j ] \ i ] for each v j ∈ Ch [ v i ] ,and similarly r [ j ] = ( r (cid:62) i , r (cid:62) [ j ] \ i ) (cid:62) , where the [ j ] \ i notation refers the set Pa [ v j ] \ { v i } . Then,the full conditional distribution of r i is Gaussian, with mean g i and covariance G i such that: G − i = Z (cid:62) T i (cid:101) D − T i Z T i + R − i + (cid:88) j ∈ Ch [ v i ] H (cid:62) [ j ] \ i R − j H [ j ] \ i G i g i = R − i H i r [ i ] + Z (cid:62) T i (cid:101) D − T i (cid:101) y T i + (cid:88) j ∈ Ch [ v i ] H (cid:62) [ j ] \ i R − j r [ j ] \ i . Updating β proceeds via its full conditional distribution, which is N ( β ; m ( n ) β , V ( n ) β ) , where V ( n ) − β = V − β + X (cid:62) ( D n + Σ ) − X and V ( n ) β m ( n ) β = V − β m β + X (cid:62) ( D n + Σ ) − ( y − Z r ) .Updates for other unknowns may proceed via Metropolis steps: for A and τ , the centeredparametrization results in being target density being the prior π ( A ) π ( τ ) , multiplied by the17ollowing product of multivariate Gaussians (cid:89) i ∈T N ( y (cid:96) i ; X (cid:96) i β + Z (cid:96) i r (cid:96) i , D + Σ (cid:96) i ) (cid:12)(cid:12) obs , (15)where we restrict each term to the observed outcomes, i.e. each term is an l -dimensionalGaussian corresponding to the l ≤ q observed outcomes at spatial location (cid:96) i .For ϕ = { σ j , φ j , ν j } kj =1 , the Metropolis update following the centered parametrizationtargets the density obtained by multiplying the prior π ( ϕ ) to (cid:81) Mi N ( r i ; H i r [ i ] , R i ) , whichtranslates (8) for the GriPS coregionalization model. In our implementation, we use therobust adaptive Metropolis algorithm of Vihola (2012) to target an acceptance rate of about0.23. We note that it is computationally easier to update A , τ and σ than φ and ν as the latter two require recomputing the correlations. Finally, if T is an incomplete gridand we choose S ⊃ T as the grid which “ﬁlls the gaps” in T , then the regression error is ε ∼ N ( , D n ) . In that case, with the prior distributions we have mentioned earlier, A and τ can be updated via their full conditionals—refer to the supplementary material for details.If S (cid:54) = T , then those full conditionals are the correct updates for posterior sampling of model y = Xβ + Z r + (cid:101) ε where (cid:101) ε ∼ N (0 , D n ) , which is not (14) as Σ is omitted and is insteadakin to a meshed version of the “unmodiﬁed” predictive process of Banerjee et al. (2008); weexpect the performance of the two models to be very similar when S is large and S ⊃ (cid:101)

S ≈ T .All applications in this article use Metropolis updates for (14) on nongridded data.

We illustrate the performance of GriPS in estimation and prediction in univariate spatialregressions using simulated data. Each generated data set includes a realization of a Gaussianprocess with Matérn covariance (5), ﬁxing the smoothness ν to its known value (either ν = 0 . or ν = 1 . ). The outcomes are observed at irregularly-spaced locations in [0 , . Thetotal number of observed locations is n = 5000 . Our main goal is the eﬃcient estimationof the posterior distribution of σ , φ, τ . Predictive performance is used to validate ourmethodology. We use a grid of n out = 10000 spatial locations as our test set. We set S inGriPS models to overlap with the test set—we are thus sampling the spatial process at twiceas many locations as there are observations. For ν = 0 . , we generate 50 datasets for each18 = 0 . ν = 1 . Eﬃciency

ESS ESS/s Relative ESS ESS/s RelativeQMGPGriPS φ σ τ φ σ τ NNGPresponse φ σ τ φ σ τ Table 1: MCMC eﬃciency of GriPS in the estimation of Matérn parameters in univariatesettings. For posterior samples of size 2500, we report eﬀective sample size—absolute, perunit time, and relative to the NNGP response model.combination of σ ∈ { , } and φ ∈ { . , } , and similarly for ν = 1 . but with φ ∈ { , } ,for a total of 400 datasets each of size n all = 15000 .QMGPs based on GriPS are compared with unmodiﬁed “latent” QMGPs and NNGPs,both of which also sample the latent process, but without reparametrization or splitting.We also consider a “response” NNGP model which avoids sampling the latent process, and astochastic partial diﬀerential equations approach (Lindgren et al., 2011) via integrated nestedLaplace approximations (INLA; Rue et al., 2009) that avoids MCMC altogether. The SPDE-INLA approach scales to big data via a sparse GMRF approximation of the Gaussian ﬁeldwhich requires the user to deﬁne a mesh of points. We ﬁx the mesh at the test set—also thereference set in GriPS. Carefully assessing MCMC convergence for each of the 400 datasetsis impractical. Hence, we ﬁx the number of iterations to 5000 for all MCMC-based methods.In these settings, GriPS is almost always faster in computing 5000 MCMC iterations thanthe non sampling-based INLA SPDE method on the same task. We provide more details onthe implemented methods in the supplementary material, along with additional ﬁgures and19 = 0 . ν = 1 . Estimation

RMSE M%E UQ: Covg RMSE M%E UQ: CovgQMGPGriPS φ σ τ φ σ τ φ σ τ NNGPlatent φ σ τ φ σ R-INLA τ Table 2: Accuracy in estimating the Matérn parameters φ, σ , τ in univariate settings: rootmean square error ( RMSE ), mean absolute percentage error (

M%E ), and uncertainty quan-tiﬁcation in terms of coverage (

UQ: Covg , with target 95%), averaged across 400 datasets. ν = 0 . ν = 1 . Prediction

RMSPE CRPS UQ: Covg RMSPE CRPS UQ: CovgQMGP-GriPS 0.3489 0.1960 0.9445

QMGP latent 0.3512 0.1967

R-INLA

Table 3: Accuracy in predictions at 10000 unobserved locations, averaged across 400 datasets,measured as root mean square prediction error (

RMSPE ), continuous ranked probabilityscore (

CRPS ), and uncertainty quantiﬁcation measured as coverage of 95% prediction inter-vals. 20 s t n = . n = . QMGPGriPS QMGPlatent NNGPresponse NNGPlatent QMGPGriPS QMGPlatent NNGPresponse NNGPlatent QMGPGriPS QMGPlatent NNGPresponse NNGPlatent1101001101001000 E ff e c t i v e s a m p l e s i z e Figure 3: Eﬀective sample size (log-scaled) in estimating σ , φ , τ at diﬀerent Matérnsmoothness ( ν = 0 . or ν = 1 . ), from an MCMC sample of size 2500.summaries of computing time for each approach.Sampling eﬃciency in terms of eﬀective sample size per unit time is shown in Table 1. Asexpected, the “latent” NNGP method is less eﬃcient than its “response” counterpart. Thelatter displays performance similar to the unmodiﬁed QMGP model. On the other hand, theGriPS method is massively more eﬃcient than all other methods when estimating σ and φ —the relative eﬃciency improvements are between one and two orders of magnitude.These eﬃciency improvements are reﬂected in the accuracy of estimations, as shown inTable 2. In particular, GriPS provide more reliable uncertainty quantiﬁcation as evidencedby the better coverage of the posterior 95% credible intervals (uncertainty intervals for all 400data sets are reported in the supplementary material). Furthermore, GriPS-based methodsmost accurately estimate the exponential covariance parameters, with the smoother ν = 1 . slightly favoring an NNGP response model. However, in the latter scenario GriPS stillachieves the best predictive performance, as shown in Table 3.We have, thus far, considered unmodiﬁed QMGPs in our comparison, which use S = T with irregularly spaced locations. As a consequence, the net eﬀect of the PS parametrizationalone is unclear. We address this matter in the supplementary material by consideringgridded data, in which case all QMGP models use the same reference set and thus all21igure 4: At the top: tree cover in 2000 (left) and forest loss year (right) over the entirestudy area. At the bottom: the two outcomes ( p.90 and f.cvr ) shown at a smaller subregion.The strips along which the outcomes are recorded appear as thin lines in the top maps.performance diﬀerences will be attributable to the PS parametrization. The results of ouranalysis show that the eﬃciency improvements due to PS are even larger than shown in thissection, and QMGP-GriPS is shown to be more than two orders of magnitude more eﬃcientthan an NNGP model of the response in estimating σ and φ . We consider a large set of remotely sensed light detection and ranging (LiDAR) data, col-lected using the National Aeronautics and Space Administration (NASA) Goddard’s LiDAR,Hyperspectral, and Thermal (G-LiHT) Airborne Imager (Cook et al., 2013) sensor in Sum-mer 2014 over a large region in Alaska. The data set includes observations of forest canopyheight (labeled p.90 and measured in meters) and fractional forest cover ( f.cvr ) at 12,270,334spatial locations, which we seek to jointly characterize in a multivariate regression model.Owing to the low ﬂying altitude (355m) and narrow ﬁeld of view (30°) of the sensor, theobserved data locations are not uniformly spread out in the study area. Instead, they are22ighly concentrated along swaths less than 200m wide and separated by 9km of unobservedareas. This data set has been considered in the context of univariate spatial regressions inFinley et al. (2019), to which we refer for additional details.We select a sub-region of the data of size km × km covering about 31.2% of theobserved data, and sample 2.5 million spatial locations uniformly at random in this areato build the data set we consider. The resulting study area includes 17 strips of observeddata. The eﬀective dimension of our data is thus nq = 5 × accounting for q = 2 . Weseek to make predictions of both outcomes on the full sub-region (including the wide areasseparating the observed strips) based on the spatial regression model (14) and using twocovariates that are available across the entire study area as shown in Figure 4. The twocovariates are: (i) percent forest cover measured in the year 2000 (labeled tc ); and (ii) anindicator variable of forest loss ( loss ) that gives the number of years since 2000 that a locationhad a forest-destroying disturbance such as a ﬁre. Outcomes and predictors are shown inFigure 4.We build the reference set S for GriPS by taking a grid (colored in orange in Figure5) of regularly-spaced locations. Its size is n orange = 1 . × which we use to coverthe selected study area. However, this grid is not dense enough to accurately characterizespatial dependence since only a small portion of the reference locations are near the observedoutcomes. For this reason, we augment S with an additional patterned grid (colored in bluein Figure 5)—of size n blue = 5 . × —that is built to cover the 17 strips, but not theunobserved areas. Rather than aiming for perfectly overlapping this grid with the data,we create a regular pattern of blue strips by equally spacing their centers. The slight oﬀ-centering of some data swaths along this pattern motivates us to expand each grid stripsymmetrically in both directions.We implement the QMGP model on S by partitioning the coordinates into and intervals, respectively. This patterned setup ensures that only a few matrix inverses haveto be computed at each iteration. Given that p.90 > and < f.cvr < , we ﬁt (14) on (cid:103) p.90 = log { p.90 } and analogously take (cid:103) f.cvr = log { f.cvr / (1 − f.cvr ) } ) . We ran MCMC fora total of 20,000 iterations, of which we discarded the ﬁrst 5,000 as burn-in. Predictions ona test set made of 10,000 left-out observations are based on a thinned chain consisting of atotal of 500 samples—memory constraints inhibit storage of a signiﬁcantly larger number of23igure 5: Construction of S by overlaying a grid of regularly-spaced locations (orange) to apatterned grid (blue); alignment with the data swaths is achieved by rotating S .samples—and shown in Table 5. Full prediction maps over the whole study areas along withuncertainty bands are shown in Figure 6. All predictions and their summaries have beenperformed after inverse-transforming all posterior samples from MCMC.We implement GriPS on two alternative partitions of the spatial domain. The ﬁrstpartitioning scheme tessellates the domain into 33 intervals along the horizontal and 500along the vertical axis, resulting in a coarse partitioning of S . The second uses the samepartitioning for the horizontal axis, but intervals for the vertical axis. Prediction resultsfrom the two implementations are reported in Table 5. All other ﬁgures and tables refer tothe coarse partition ( × ) model. We observe that the coarser partitioning strategy isconservative as it results in relatively large regions including up to 140 spatial locations. Finerpartitioning leads to faster computations at a minor cost in terms of predictive accuracy.Finally, Table 4 provides marginal summaries of the posterior distribution of the unknowncovariance and regression parameters. Marginal posterior densities and the full Markovchains are available in the Appendix. We ﬁnd that loss has a negative impact on bothoutcomes even after accounting for spatial variability via the bivariate latent process. Thetwo margins are positively associated as evidenced by the positive value of the oﬀ-diagonalterm of Λ (their correlation at a distance h = 0 is estimated to be about ∼ . ). We have introduced a Grid-Parametrize-Split (GriPS) strategy for resolving ineﬃciencies inMCMC sampling of spatial regression models and demonstrated its validity as an alternative24igure 6: Predictions of p.90 and f.cvr and uncertainty quantiﬁcation in terms of the widthof 95% credible intervals. 25 λ p.90 φ p.90 λ f.cvr φ f.cvr λ p.90 , f.cvr β and τ β tc β tc β loss β loss τ p.90 τ f.cvr (on p.90 ) (on f.cvr ) (on p.90 ) (on f.cvr ) Table 4: Estimated posterior summaries for unknown covariance parameters θ , regressioncoeﬃcients β , and measurement errors τ for the GriPS regression on the Alaskan LiDARdata set. × partition × partition p.90 f.cvr p.90 f.cvr RMSPE 1.7045 0.1181 1.7320 0.1197Covg (95%) 0.9396 0.9384 0.9400 0.9401Time/iteration 8.18s 3.98sNum. iterations 20,000 30,000Total time 45.46h 33.17hTable 5: Prediction results over the test set, and timing, of the two implemented models.26o methods not based on sampling the latent process a posteriori. Our development of themethodology and the subsequent numerical experiments cogently illustrate the eﬀectivenessof GriPS in conducting spatial big data analysis. Our focus on MGPs, Matérn covariancesand LMCs, and Gaussian outcomes suggests several possible directions for future research.On one hand, we expect our insight to be applicable more generally to Bayesian estimationof spatial regressions with latent processes (not necessarily MGPs). Moreover, research mayfocus on eﬃcient Bayesian parametrizations of covariance models alternative to the Matérnand/or the LMC (see, e.g., Gneiting et al., 2010; Genton and Kleiber, 2015; Zhang andBanerjee, 2020), and on applying similar ideas to non Gaussian outcomes.Other natural extensions include adapting DAG-based processes to spatial-temporal dataanalysis. Challenges speciﬁc to spatial-temporal domains include the identifying “neighbors”over space and time as the scale of associations along space signiﬁcantly diﬀer from that alongtime. Dynamic NNGPs (Datta et al., 2016b) resolve this problem by estimating neighborsinstead of ﬁxing them. QMGPs have been implemented on spatiotemporal domains (Peruzziet al., 2020), but eﬃcient estimation of the covariance parameters via MCMC remains chal-lenging. Therefore, extensions of GriPS to spatial-temporal domains could circumvent someof these problems and can, possibly, provide easier and more eﬀective analysis of spatial-temporal big data.

Funding

M.P. and D.B.D. have received funding from the European Research Council (ERC) underthe European Union’s Horizon 2020 research and innovation programme (grant agreement No856506), and grant R01ES028804 of the United States National Institutes of Health (NIH).A.O.F. received funding from the U.S. National Science Foundation (NSF) grants DMS-1916395 and EF-1253225, and NASA Carbon Monitoring System program. S.B. receivedfunding from the National Science Foundation (NSF) through grants NSF/DMS 1916349and NSF/IIS 1562303, and from the National Institute of Environmental Health Sciences(NIEHS) through grants R01ES030210 and R01ES027027.27

Full conditional updates of A and τ If the data are gridded or we sample r at non-reference locations, full conditionals for A and τ are available. Let Y be the n × q matrix whose j th column is Y j = y j − X j β j ,i.e. the j th outcome at locations T , minus the static linear predictor for the same outcome.Similarly denote as F the n × k matrix whose j th column is the j th margin of the r process. We can then write Y = F A (cid:62) + E where E is a n × q matrix such that vec { E } ∼ N ( , I n ⊗ D ) , with vec {·} the vectorization operator. For outcome j , let F ( j ) subset F tocolumns corresponding to the nonzero elements in the j th row of A , denoted as A j \ . Then,with D = diag { τ , . . . , τ q } the full conditional distribution of τ j is InvGamma ( a ∗ j , b ∗ j ) whoseparameters are a ∗ j = a + n j / and b ∗ j = b + ˆ E (cid:62) j ˆ E j , where ˆ E j = Y j, obs − F ( j ) , obs A (cid:62) j \ and Y j, obs is the n j × vector collecting the observed elements of Y j . Finally, A j \ has Gaussianfull conditional distribution with precision V ( n ) − A j = V − A j + F (cid:62) ( j ) , obs ( I n j ⊗ D − ) F ( j ) , obs andmean m ( n ) A j = V ( n ) A j F (cid:62) ( j ) , obs ( I n j ⊗ D − ) Y j, obs . B Comparisons on synthetic gridded data

We compare QMGP-GriPS to unmodiﬁed QMGPs and NNGP models of the response ongridded data. Since the data grid can be directly used as reference set, we can write y ( (cid:96) ) = X (cid:62) β + w ( (cid:96) ) + ε ( (cid:96) ) where ε ( (cid:96) ) ∼ N (0 , D ) and we will therefore not need to modify QMGPs.We thus relabel GriPS as QMGP-PS in this section.While the overall simulation design is the same as in the main article, here we samplethe univariate Gaussian process with Matérn covariance at n = 15625 spatial locations on a × grid using D = [0 , as our spatial domain. Then, all process realizations insidethe [0 . , . region and at about of the remaining locations (chosen uniformly atrandom) are removed from the data and considered as test set for predictions. All otherdesign choices remain the same, implying that once again our comparisons are based on 400datasets.We report MCMC eﬃciency summaries in Table 6 and estimation and prediction resultsin Table 7. Compared to the synthetic data analysis in the main article and relative to un-modiﬁed QMGPs, there is only a slight improvement in QMGP-PS relative to QMGP-GriPS.28 s t n = . n = . QMGPPS QMGPlatent NNGPresponse QMGPPS QMGPlatent NNGPresponse QMGPPS QMGPlatent NNGPresponse1101001101001000 E ff e c t i v e s a m p l e s i z e Figure 7: Eﬀective sample size in the gridded synthetic data application.This is a positive result for GriPS since it implies the loss in performance when choosing

S (cid:54) = T is only minor, and attributes most of the improvements to the PS parametrization.Furthermore, the simpliﬁcations arising in QMGPs from gridded data result in much fastercomputations. Relative to NNGP models of the response, we show here an additional C Details on implemented models

In the synthetic data applications, the prior distributions on τ was chosen as an inverseGamma with parameters . and for QMGP-GriPS latent QMGPs, and both NNGPmodels. π ( σ ) = Inv.Gamma (2 . , for latent QMGPs and NNGPs, whereas in GriPSthis parameter is expanded into σ with prior π ( σ ) = Inv.Gamma ( (cid:15), (cid:15) ) with (cid:15) small (weset it to − ), and λ ∼ π ( λ ) = N λ> (0 , . The SPDE-INLA approach was implementedfor ν = 0 . following Krainski et al. (2019), Chapter 2, setting α = ν + 1 = 1 . . Setting ν = 1 . was unsupported by inla.spde2.pcmatern in R-INLA . The priors on σ , φ weresuch that P r ( σ > .

5) = 0 . and P r ( φ < .

1) = 0 . . The mesh for INLA was ﬁxed29 = 0 . ν = 1 . Eﬃciency

ESS ESS/s Relative ESS ESS/s RelativeQMGP -PS φ σ τ QMGP latent φ σ τ φ σ τ Table 6: MCMC eﬃciency in estimating the Matérn parameters in univariate gridded datasettings. For posterior samples of size 2500, we report eﬀective sample size—absolute, perunit time, and relative to the NNGP response model.at a -sized grid which overlapped with the spatial locations of the test set. In otherwords we used for INLA the same mesh as the reference set in the QMGP-GriPS method.Smaller meshes are associated with faster computations in INLAs but negatively impactpredictions and estimation accuracy (especially about the nugget τ ). All models were runin a Ubuntu 20.04 workstation based on a 12-core, 24-thread AMD Ryzen 9 3900 CPUand 128GB of memory; R version 4.0.3 was linked to Intel MKL 2019.5-075 which includeseﬃcient LAPACK (Anderson et al., 1999) and BLAS (Blackford et al., 2002) libraries forlinear algebra. GriPS-based QMGPs were run on 12 threads using OpenMP (Dagum andMenon, 1998), all other models on 22 threads. Figure 8 summarises the compute times foreach model, whereas Figure 9 reports all 95% posterior intervals for the estimation of theunknown covariance parameters. Eﬀective sample size reported in the tables throughout thearticle was calculated using the effectiveSize function of R package coda (Plummer et al.,2006). 30 = 0 . ν = 1 . Estimation

RMSE M%E UQ: Covg RMSE M%E UQ: CovgQMGP -PS φ σ τ φ σ τ φ σ τ ν = 0 . ν = 1 . Prediction

RMSPE CRPS UQ: Covg RMSPE CRPS UQ: CovgQMGP-PS

QMGP latent 0.3374 0.1887 0.9506 0.4816 0.2607 0.9554NNGP response 0.3432 0.1914 0.9778 0.2789 0.1512 0.9782

Table 7: Top: accuracy in estimating the Matérn parameters φ, σ plus τ , in univariatesettings—root mean square error ( RMSE ), mean absolute percentage error (

M%E ), and un-certainty quantiﬁcation in terms of coverage (

UQ: Covg , with target 95%), averaged across400 datasets. Bottom: accuracy in predictions—root mean square prediction error (

RM-SPE ), continuous ranked probability score (

CRPS ), and 95% coverage. n = n =

QMGPGriPS QMGPlatent NNGPresponse NNGPlatent SPDEINLA QMGPGriPS QMGPlatent NNGPresponse NNGPlatent3050100 T i m e i n s e c ond s n = n = QMGPGriPS QMGPlatent NNGPresponse QMGPGriPS QMGPlatent NNGPresponse1030100 T i m e i n s e c ond s Figure 8: Compute time for each model. The top ﬁgure refers to the analysis and modelsimplemented in the main article, the bottom to those in this supplement.31

MGP−GriPS QMGP−latent NNGP−resp NNGP−latent SPDE−INLA f =

QMGP−GriPS QMGP−latent NNGP−resp NNGP−latent SPDE−INLA f = QMGP−GriPS QMGP−latent NNGP−resp NNGP−latent SPDE−INLA s = QMGP−GriPS QMGP−latent NNGP−resp NNGP−latent SPDE−INLA s = Coverage of 95% intervals, v = 0.5

QMGP−GriPS QMGP−latent NNGP−resp NNGP−latent

20 30 20 30 20 30 20 3001020304050 f = QMGP−GriPS QMGP−latent NNGP−resp NNGP−latent f = QMGP−GriPS QMGP−latent NNGP−resp NNGP−latent s = QMGP−GriPS QMGP−latent NNGP−resp NNGP−latent s = Coverage of 95% intervals, v = 1.5

Figure 9: Coverage of posterior 95% intervals in all 400 data sets in the synthetic dataapplication appearing in the main article, and for all models. Orange lines represent thetrue values; blue lines are 95% (marginal) intervals that include the true value, grey lines areintervals that do not include the true value. Intervals are calculated using MCMC samplesof size 2500. SPDE-INLA intervals are extracted from the approximated marginal posteriordistributions. 32

Additional details on the LiDAR data application

Figures 10 and 11 add to the posterior summaries reported in the main article.Figure 10: Markov chains for the unknown parameters in the regression model for the LiDARdata. l f. cvr l p. 90, f. cvr l p. 90 t t f f. cvr f p. 90 b tc on f . cvr b tc on p. 90 b loss on f . cvr b loss on p. 90 Figure 11: Estimated marginal posterior densities of the covariance parameters θ = { Λ , φ } and regression parameters β and τ . 33 eferences Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D. W., and O’Neil, M. (2016),“Fast Direct Methods for Gaussian Processes,”

IEEE Transactions on Pattern Analysisand Machine Intelligence , 38, 252–265, doi:10.1109/TPAMI.2015.2448083 . 3Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Green-baum, A., Hammarling, S., McKenney, A., and Sorensen, D. (1999),

LAPACK Users’ Guide ,Philadelphia, PA: Society for Industrial and Applied Mathematics, 3rd ed. 30Banerjee, S. (2017), “High-Dimensional Bayesian Geostatistics,”

Bayesian Analysis , 12, 583–614, doi:10.1214/17-BA1056R . 3, 11— (2020), “Modeling Massive Spatial Datasets Using a Conjugate Bayesian Linear Modeling Frame-work,”

Spatial Statistics , in press, doi:10.1016/j.spasta.2020.100417 . 10Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014),

Hierarchical Modeling and Analysis for SpatialData, 2nd edition , Chapman and Hall/CRC, doi:10.1201/b17115 . 10Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008), “Gaussian predictive processmodels for large spatial data sets,”

Journal of the Royal Statistical Society, Series B , 70, 825–848, doi:10.1111/j.1467-9868.2008.00663.x . 3, 11, 18Blackford, L. S., Petitet, A., Pozo, R., Remington, K., Whaley, R. C., Demmel, J., Dongarra,J., Duﬀ, I., Hammarling, S., Henry, G., et al. (2002), “An updated set of basic linear algebrasubprograms (BLAS),”

ACM Transactions on Mathematical Software , 28, 135–151. 30Cook, B. D., Corp, L. A., Nelson, R. F., Middleton, E. M., Morton, D. C., McCorkel, J. T.,Masek, J. G., Ranson, K. J., Ly, V., and Montesano, P. M. (2013), “NASA Goddard’s Li-DAR, Hyperspectral and Thermal (G-LiHT) Airborne Imager,”

Remote Sensing , 5, 4045–4066, doi:10.3390/rs5084045 . 6, 22Cressie, N. and Johannesson, G. (2008), “Fixed Rank Kriging for Very Large Spa-tial Data Sets,”

Journal of the Royal Statistical Society, Series B , 70, 209–226, doi:10.1111/j.1467-9868.2007.00633.x . 3Dagum, L. and Menon, R. (1998), “OpenMP: an industry standard API for shared-memory pro-gramming,”

Computational Science & Engineering, IEEE , 5, 46–55. 30 atta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016a), “Hierarchical Nearest-NeighborGaussian Process Models for Large Geostatistical Datasets,” Journal of the American StatisticalAssociation , 111, 800–812, doi:10.1080/01621459.2015.1044091 . 3Datta, A., Banerjee, S., Finley, A. O., Hamm, N. A. S., and Schaap, M. (2016b), “Nonsepara-ble dynamic nearest neighbor Gaussian process models for large spatio-temporal data with anapplication to particulate matter analysis,”

The Annals of Applied Statistics , 10, 1286–1316, doi:10.1214/16-AOAS931 . 27Dunson, D. and Johndrow, J. E. (2020), “The Hastings algorithm at ﬁfty,”

Biometrika , 107, 1–23, doi:10.1093/biomet/asz066 . 4Eidsvik, J., Shaby, B. A., Reich, B. J., Wheeler, M., and Niemi, J. (2014), “Estimation and Pre-diction in Spatial Models With Block Composite Likelihoods,”

Journal of Computational andGraphical Statistics , 23, 295–315, doi:10.1080/10618600.2012.760460 . 3Finley, A. O., Banerjee, S., Ek, A. R., and McRoberts, R. E. (2008), “Bayesian multivariate processmodeling for prediction of forest attributes,”

Journal of Agricultural, Biological, and Environ-mental Statistics , 13, 60, doi:10.1198/108571108X273160 . 9Finley, A. O., Datta, A., Cook, B. D., Morton, D. C., Andersen, H. E., and Banerjee, S. (2019), “Ef-ﬁcient Algorithms for Bayesian Nearest Neighbor Gaussian Processes,”

Journal of Computationaland Graphical Statistics , 28, 401–414, doi:10.1080/10618600.2018.1537924 . 5, 11, 23Finley, A. O., Sang, H., Banerjee, S., and Gelfand, A. E. (2009), “Improving the performance ofpredictive process modeling for large datasets,”

Computational Statistics and Data Analysis , 53,2873–2884, doi:10.1016/j.csda.2008.09.008 . 15Furrer, R., Genton, M. G., and Nychka, D. (2006), “Covariance Tapering for Interpolationof Large Spatial Datasets,”

Journal of Computational and Graphical Statistics , 15, 502–523, doi:10.1198/106186006X132178 . 3GEDI (2020), “Global Ecosystem Dynamics Investigation LiDAR,” https://gedi.umd.edu . 6Gelfand, A. E., Sahu, S. K., and Carlin, B. P. (1995), “Eﬃcient Parametrisations for Normal LinearMixed Models,”

Biometrika , 82, 479–488, doi:10.2307/2337527 . 5 elman, A. (2006), “Prior distributions for variance parameters in hierarchical models (commenton article by Browne and Draper),” Bayesian Analysis , 1, 513–533, doi:10.1214/06-BA117A . 6,16Genton, M. G. and Kleiber, W. (2015), “Cross-Covariance Functions for Multivariate Geostatistics,”

Statistical Science , 30, 147–163, doi:10.1214/14-STS487 . 7, 27Geoga, C. J., Anitescu, M., and Stein, M. L. (2020), “Scalable Gaussian Process ComputationsUsing Hierarchical Matrices,”

Journal of Computational and Graphical Statistics , 29, 227–237, doi:10.1080/10618600.2019.1652616 . 3Ghosh, J. and Dunson, D. B. (2009), “Default Prior Distributions and Eﬃcient Posterior Com-putation in Bayesian Factor Analysis,”

Journal of Computational and Graphical Statistics , 18,306–320, doi:10.1198/jcgs.2009.07145 . 16Gneiting, T., Kleiber, W., and Schlather, M. (2010), “Matérn Cross-Covariance Functions forMultivariate Random Fields,”

Journal of the American Statistical Association , 105, 1167–1177, doi:10.1198/jasa.2010.tm09420 . 27Gonzalez, J., Low, Y., Gretton, A., and Guestrin, C. (2011), “Parallel Gibbs Sampling: From Col-ored Fields to Thin Junction Trees,” in

Proceedings of the Fourteenth International Conferenceon Artiﬁcial Intelligence and Statistics , eds. Gordon, G., Dunson, D., and Dudík, M., Fort Laud-erdale, FL, USA: PMLR, vol. 15 of

Proceedings of Machine Learning Research , pp. 324–332.13Gramacy, R. B. and Apley, D. W. (2015), “Local Gaussian Process Approximation for LargeComputer Experiments,”

Journal of Computational and Graphical Statistics , 24, 561–578, doi:10.1080/10618600.2014.914442 . 3Guhaniyogi, R. and Banerjee, S. (2018), “Meta-Kriging: Scalable Bayesian Mod-eling and Inference for Massive Spatial Datasets,”

Technometrics , 60, 430–444, doi:10.1080/00401706.2018.1437474 . 3Guhaniyogi, R., Finley, A. O., Banerjee, S., and Gelfand, A. E. (2011), “Adaptive Gaus-sian predictive process models for large spatial datasets,”

Environmetrics , 22, 997–1007, doi:10.1002/env.1131 . 12 eaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F.,Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., andZammit-Mangion, A. (2019), “A Case Study Competition Among Methods for Analyzing LargeSpatial Data,” Journal of Agricultural, Biological and Environmental Statistics , 24, 398–425, doi:10.1007/s13253-018-00348-w . 3Huang, H. and Sun, Y. (2018), “Hierarchical Low Rank Approximation of Likelihoods forLarge Spatial Datasets,”

Journal of Computational and Graphical Statistics , 27, 110–118, doi:10.1080/10618600.2017.1356324 . 3ICESat-2 (2020), “Ice, Cloud, and Land Elevation Satellite-2,” https://icesat-2.gsfc.nasa.gov/ . 6Jurek, M. and Katzfuss, M. (2020), “Hierarchical sparse Cholesky decomposition with applicationsto high-dimensional spatio-temporal ﬁltering,” arXiv:2006.16901 . 3Kampe, T. U., Johnson, B. R., Kuester, M. A., and Keller, M. (2010), “NEON: the ﬁrst continental-scale ecological observatory with airborne remote sensing of vegetation canopy biochemistry andstructure,”

Journal of Applied Remote Sensing , 4, 1–24, doi:10.1117/1.3361375 . 6Kastner, Gregor Frühwirth-Schnatter, S. (2014), “Ancillarity-suﬃciency interweaving strategy(ASIS) for boosting MCMC estimation of stochastic volatility models,”

Computational Statis-tics and Data Analysis , 76, 408–423, doi:10.1016/j.csda.2013.01.002 . 5Katzfuss, M. (2017), “A Multi-Resolution Approximation for Massive Spatial Datasets,”

Journal ofthe American Statistical Association , 112, 201–214, doi:10.1080/01621459.2015.1123632 . 3Katzfuss, M. and Guinness, J. (2019), “A general framework for Vecchia approximations of Gaussianprocesses,” arXiv:1708.06302 . 4, 11Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008), “Covariance Tapering for Likelihood-Based Estimation in Large Spatial Data Sets,”

Journal of the American Statistical Association ,103, 1545–1555, doi:10.1198/016214508000000959 . 3Krainski, E. T., Gómez-Rubio, V., Bakka, H., Lenzi, A., Castro-Camilo, D., Simpson, D., Lindgren,F., and Rue, H. (2019),

Advanced Spatial Modeling with Stochastic Partial Diﬀerential EquationsUsing R and INLA , CRC Press/Taylor and Francis Group. 29 rause, A., Singh, A., and Guestrin, C. (2008), “Near-Optimal Sensor Placements in GaussianProcesses: Theory, Eﬃcient Algorithms and Empirical Studies,” Journal of Machine LearningResearch , 8, 235–284, . 12Lindgren, F., Rue, H., and Lindström, J. (2011), “An explicit link between Gaussian ﬁelds andGaussian Markov random ﬁelds: the stochastic partial diﬀerential equation approach,”

Journal ofthe Royal Statistical Society: Series B , 73, 423–498, doi:10.1111/j.1467-9868.2011.00777.x .19Liu, J. S. and Chen, R. (1995), “Blind Deconvolution via Sequential Imputations,”

Journal of theAmerican Statistical Association , 90, 567–576, doi:10.1080/01621459.1995.10476549 . 8Liu, J. S. and Wu, Y. N. (1999), “Parameter Expansion for Data Augmentation,”

Journal of theAmerican Statistical Association , 94, 1264–1274, doi:10.1080/01621459.1999.10473879 . 5Matheron, G. (1982), “Pour une analyse krigeante des données régionalisées,”

Technical report N.732,Centre de Géostatistique . 5Papaspiliopoulos, O., O., R. G., and Sköld, M. (2007), “A General Framework for the Parametriza-tion of Hierarchical Models,”

Statistical Science , 22, 59–73, doi:10.1214/088342307000000014 .5, 9Peruzzi, M., Banerjee, S., and Finley, A. O. (2020), “Highly Scalable Bayesian Geostatistical Model-ing via Meshed Gaussian Processes on Partitioned Domains,”

Journal of the American StatisticalAssociation , in press. doi:10.1080/01621459.2020.1833889 . 4, 12, 13, 27Peruzzi, M. and Dunson, D. B. (2020), “Spatial multivariate trees for big data Bayesian regression,” arXiv:2012.00943 . 4, 11Plummer, M., Best, N., Cowles, K., and Vines, K. (2006), “CODA: Convergence Diagnosis andOutput Analysis for MCMC,”

R News , 6, 7–11. 30Quiroz, Z. C., Prates, M. O., and Dey, D. K. (2019), “Block Nearest Neighboor Gaussian processesfor large datasets,” arXiv:1604.08403 . 4Quiñonero-Candela, J. and Rasmussen, C. E. (2005), “A Unifying View of Sparse ApproximateGaussian process Regression,”

Journal of Machine Learning Research , 6, 1939–1959, . 3 en, Q. and Banerjee, S. (2013), “Hierarchical Factor Models for Large Spatially Mis-aligned Data: A Low-Rank Predictive Process Approach,” Biometrics , 69, 19–30, doi:10.1111/j.1541-0420.2012.01832.x . 10Robert, C. P. and Casella, G. (2005),

Monte Carlo Statistical Methods (Springer Texts in Statistics) ,Berlin, Heidelberg: Springer-Verlag. 4, 8Rue, H. and Held, L. (2005),

Gaussian Markov Random Fields: Theory and Applications , Chapman& Hall/CRC, doi:10.1007/978-3-642-20192-9 . 3Rue, H., Martino, S., and Chopin, N. (2009), “Approximate Bayesian inference for latent Gaus-sian models by using integrated nested Laplace approximations,”

Journal of the Royal StatisticalSociety: Series B , 71, 319–392, doi:10.1111/j.1467-9868.2008.00700.x . 19Sang, H. and Huang, J. Z. (2012), “A full scale approximation of covariance functions forlarge spatial data sets,”

Journal of the Royal Statistical Society, Series B , 74, 111–132, doi:10.1111/j.1467-9868.2011.01007.x . 3Shirota, S., Finley, A. O., Cook, B. D., and Banerjee, S. (2019), “Conjugate Nearest Neigh-bor Gaussian Process Models for Eﬃcient Statistical Interpolation of Large Spatial Data,” arXiv:1907.10109 . 10Snelson, E. and Ghahramani, Z. (2007), “Local and global sparse Gaussian process approximations,”in

Proceedings of the Eleventh International Conference on Artiﬁcial Intelligence and Statistics ,vol. 2 of

Proceedings of Machine Learning Research , pp. 524–531, http://proceedings.mlr.press/v2/snelson07a.html . 3Stein, M. (1990), “Uniform Asymptotic Optimality of Linear Predictions of a RandomField Using an Incorrect Second-Order Structure,”

The Annals of Statistics , 18, 850–872, doi:10.1214/aos/1176347629 . 5, 10Stein, M. L. (1999),

Interpolation of spatial data: Some theory for Kriging , Springer Series inStatistics, New York: Springer-Verlag, doi:10.1007/978-1-4612-1494-6 . 6, 15— (2014), “Limitations on low rank approximations for covariance matrices of spatial data,”

SpatialStatistics , 8, 1–19, doi:doi:10.1016/j.spasta.2013.06.003 . 3 un, Y., Li, B., and Genton, M. (2011), “Geostatistics for large datasets,” in Advances and Chal-lenges in Space-time Modelling of Natural Events , eds. Montero, J., Porcu, E., and Schlather, M.,Berlin Heidelberg: Springer-Verlag, pp. 55–77, doi:10.1007/978-3-642-17086-7 . 3Tang, W., Zhang, L., and Banerjee, S. (2020), “On identiﬁability and consistency of the nugget inGaussian spatial process models,” arXiv:1908.05726 . 5, 10Taylor-Rodriguez, D., Finley, A. O., Datta, A., Babcock, C., Andersen, H. E., Cook, B. D., Mor-ton, D. C., and Banerjee, S. (2019), “Spatial factor models for high-dimensional and largespatial data: An application in forest variable mapping,”

Statistica Sinica , 29, 1155–1180, doi:10.5705/ss.202018.0005 . 10van Dyk, D. A. and Meng, X.-L. (2001), “The Art of Data Augmentation,”

Journal of Computationaland Graphical Statistics , 10, 1–50, doi:10.1198/10618600152418584 . 5Vecchia, A. V. (1988), “Estimation and Model Identiﬁcation for Continuous Spa-tial Processes,”

Journal of the Royal Statistical Society, Series B , 50, 297–312, doi:10.1111/j.2517-6161.1988.tb01729.x . 3Vihola, M. (2012), “Robust adaptive Metropolis algorithm with coerced acceptance rate,”

Statisticsand Computing , 22, 997–1008, doi:10.1007/s11222-011-9269-5 . 18Wackernagel, H. (2003),

Multivariate Geostatistics: An Introduction with Applications , Springer,Berlin, doi:10.1007/978-3-662-05294-5 . 5Ying, Z. (1991), “Asymptotic properties of a maximum likelihood estimator withdata from a Gaussian process,”

Journal of Multivariate Analysis , 36, 280–296, doi:10.1016/0047-259X(91)90062-7 . 5, 10Yu, Y. and Meng, X.-L. (2011), “To Center or Not to Center: That Is Not the Question—AnAncillarity–Suﬃciency Interweaving Strategy (ASIS) for Boosting MCMC Eﬃciency,”

Journal ofComputational and Graphical Statistics , 20, 531–570, doi:10.1198/jcgs.2011.203main . 5Zhang, H. (2004), “Inconsistent Estimation and Asymptotically Equal Interpolations inModel-Based Geostatistics,”

Journal of the American Statistical Association , 99, 250–261, doi:10.1198/016214504000000241 . 5, 10Zhang, L. and Banerjee, S. (2020), “Spatial Factor Modeling: A Bayesian Matrix-Normal Approachfor Misaligned Data,” arXiv:2006.00595 . 5, 9, 10, 27. 5, 9, 10, 27