A Scalable Partitioned Approach to Model Massive Nonstationary Non-Gaussian Spatial Datasets
AA Scalable Partitioned Approach to ModelMassive Nonstationary Non-Gaussian Spatial
Datasets
Benjamin Seiyon Lee and Jaewoo Park ∗ Department of Statistics, George Mason UniversityDepartment of Statistics and Data Science, Yonsei UniversityDepartment of Applied Statistics, Yonsei UniversityNovember 30, 2020
Abstract
Nonstationary non-Gaussian spatial data are common in many disciplines, in-cluding climate science, ecology, epidemiology, and social sciences. Examples includecount data on disease incidence and binary satellite data on cloud mask (cloud/no-cloud). Modeling such datasets as stationary spatial processes can be unrealisticsince they are collected over large heterogeneous domains (i.e., spatial behavior dif-fers across subregions). Although several approaches have been developed for non-stationary spatial models, these have focused primarily on Gaussian responses. Inaddition, fitting nonstationary models for large non-Gaussian datasets is computa-tionally prohibitive. To address these challenges, we propose a scalable algorithm formodeling such data by leveraging parallel computing in modern high-performancecomputing systems. We partition the spatial domain into disjoint subregions andfit locally nonstationary models using a carefully curated set of spatial basis func-tions. Then, we combine the local processes using a novel neighbor-based weightingscheme. Our approach scales well to massive datasets (e.g., 1 million samples) andcan be implemented in nimble , a popular software environment for Bayesian hierar-chical modeling. We demonstrate our method to simulated examples and two largereal-world datasets pertaining to infectious diseases and remote sensing.
Keywords:
Markov chain Monte Carlo, spatial partition, basis representation, parallel com-puting, nonstationary process, non-Gaussian spatial data ∗ Corresponding Author: Department of Statistics and Data Science, Yonsei University; Department ofApplied Statistics, Yonsei University, Seoul, 03722, Republic of Korea (e-mail: [email protected]) a r X i v : . [ s t a t . C O ] N ov Introduction
Nonstationary spatial models have been used in a wide range of scientific applications,including disease modeling (Ejigu et al., 2020), remote sensing (Heaton et al., 2017), pre-cision agriculture (Katzfuss, 2013), precipitation modeling (Risser and Calder, 2015), andair pollutant monitoring (Fuentes, 2002). Simple models assume second-order stationarityof the spatial process; however, this can be unrealistic since data are collected over het-erogeneous domains. Here, the spatial processes can exhibit localized spatial behaviors.Although several methods (cf. Fuentes, 2002; Risser and Calder, 2015; Heaton et al., 2017)have been developed for modeling nonstationary spatial data, these have focused primarilyon Gaussian spatial data. Moreover, fitting these models poses both computational and in-ferential challenges, especially for large datasets. In this manuscript, we propose a scalablealgorithm for fitting nonstationary non-Gaussian datasets. Our approach captures non-stationarity by partitioning the spatial domain and modeling local spatial processes usingbasis expansions. This new algorithm is computationally efficient in that: (1) partition-ing the spatial domain permits parallelized computation on high-performance computation(HPC) systems; and (2) basis approximation of spatial processes dramatically reduces thecomputational overhead.There is a growing literature on modeling nonstationary spatial datasets. Weightedaverage methods (Fuentes, 2001; Kim et al., 2005; Risser and Calder, 2015) combine local-ized spatial models to reduce computational costs. Basis function approximations (Nychkaet al., 2002; Katzfuss, 2013, 2017; Hefley et al., 2017) represent complex spatial processesusing linear combinations of spatial basis functions. Higdon (1998) and Paciorek and2chervish (2006) represent a nonstationary process using convolutions of spatially varyingkernel functions. Based on the spatial partitioning strategies, some of these approachesare amenable to massive spatial datasets. For example, Heaton et al. (2017) develops acomputationally efficient approach for large nonstationary spatial data by partitioning anentire domain into disjoint sets using a hierarchical clustering algorithm. Katzfuss (2017)constructs basis functions at multiple levels of resolution based on recursive partitioningof the spatial region. Guhaniyogi and Banerjee (2018) proposes a divide-and-conquer ap-proach to generate a global posterior distribution by combining local posterior distributionsfrom each subsample. Though these approaches scale well, they are limited to Gaussianresponses.Spatial generalized linear mixed models (SGLMMs) (Diggle et al., 1998) are popularclass of models designed for non-Gaussian spatial datasets. SGLMMs are widely used forboth areal and point-referenced data, where latent Gaussian random fields can account forthe spatial correlations. However, fitting SGLMMs for massive spatial datasets is com-putationally demanding since the dimension of correlated spatial processes grows with anincreasing number of observations. Although several computational methods (Banerjeeet al., 2008; Hughes and Haran, 2013; Guan and Haran, 2018; Lee and Haran, 2019; Zil-ber and Katzfuss, 2020) have been proposed for large non-Gaussian spatial datasets, thesemethods assume second-order stationarity of the latent spatial processes.In this manuscript, we propose a scalable approach for modeling massive nonstationarynon-Gaussian spatial datasets. Our smooth mosaic basis approximation for nonstationarySGLMMs (SMB-SGLMMs) combines key ideas from weighted average approaches and basis3pproximations. SMB-SGLMM consists of four steps: (1) partition the spatial regionusing a spatial clustering algorithm (Heaton et al., 2017); (2) generate localized spatialbasis functions; (3) fit a nonstationary basis-representation model to each partition; and(4) smooth the local processes using distance-based weighting scheme (smooth mosaic).Due to the partitioning and localized model fitting, we can leverage parallel computing,which greatly increases the scalability of the SMB-SGLMM method. To our knowledge, thisstudy is the first attempt to develop a scalable algorithm for fitting large nonstationary non-Gaussian spatial datasets. Furthermore, our method provides an automated mechanismfor selecting appropriate spatial basis functions. We also provide ready-to-use code writtenin nimble (de Valpine et al., 2017), a software environment for Bayesian inference.The outline for the remainder of this paper is as follows. In Section 2, we introduce sev-eral nonstationary modeling approaches. We discuss the potential extension of stationarySGLMMs to nonstationary SGLMMs and discuss their challenges. In Section 3, we proposeSMB-SGLMMs for massive spatial data and provide implementation details. Furthermore,we investigate the computational complexity of our method in detail. In Section 4, westudy the performance of SMB-SGLMMs through simulated data examples. In Section5, we apply SMB-SGLMMs to malaria incidence data and binary cloud mask data fromsatellite imagery. We conclude with a discussion and summary in Section 6.4
Nonstationary Modeling for Non-Gaussian SpatialData
Let Z = { Z ( s i ) } Ni =1 be the observed data and X ⊂ R N × p be the matrix of covariates at thespatial locations s = { s i } Ni =1 in a spatial domain S ⊆ R . W = { W ( s i ) } Ni =1 is a mean-zeroGaussian process with covariance matrix Σ ⊂ R N × N . Then SGLMMs can be defined as g { E [ Z | β , W ] } := η = X β + WW ∼ N (0 , Σ) (1)with link function g ( · ) and linear predictor η . Standard SGLMMs (Diggle et al., 1998)consider a second-order stationary Gaussian process for W for their convenient mathemat-ical framework. However, this assumption can be unrealistic for spatial processes existingin large heterogeneous domains (see Bradley et al. (2016), for a discussion). A naturalextension to (1) is to model W as a nonstationary spatial process. There is an extensiveliterature on modeling nonstationary spatial data (Sampson, 2010) such as: (1) weighted-average methods, (2) basis function methods, and (3) process convolutions. Our methodis motivated by these nonstationary modeling approaches.Weighted average methods (Fuentes, 2001) divide the spatial region S into disjointpartitions and fit locally stationary models to each partition. For example, Kim et al.(2005); Heaton et al. (2017) partition the spatial domain through Voronoi tessellation.Then, the global process is constructed by combining the locally stationary processes via aweighted average. The weights are computed using the distances between the observation5ocations and ‘center’ of the localized processes. These approaches scale well by takingadvantage of parallel computation (cf. Risser and Calder, 2015; Heaton et al., 2017).Basis functions approaches represent the nonstationary covariance structure as an ex-pansion of spatial basis functions { Φ j ( s ) } mj =1 . Let Φ be an N by m matrix with columnsindicate the basis functions and rows indicate locations Φ i,j = Φ j ( s i ). Then we can con-struct a nonstationary spatial process as W ≈ Φ δ , δ ∼ N (0 , Σ Φ ) , where δ is the coefficients of basis functions. We approximate the covariance structure as ΦΣ Φ Φ (cid:62) , which is not dependent solely on the lag between locations; hence this is non-stationary. Different types of basis functions have been used, for instance eigenfunctionsobtained from the empirical covariance (Holland et al., 1999), multiresolution basis func-tions (Nychka et al., 2002, 2015; Katzfuss, 2017), and computationally efficient low-rankrepresentation of nonstationary covariance (Katzfuss, 2013).Process convolutions represent the nonstationary spatial processes through convolutionsof spatially varying kernel function and Brownian motion. For an arbitrary s ∈ S , W ( s ) = (cid:90) S K s ( u ) d W ( u )where K s ( · ) is a kernel function centered at location s and W ( · ) is a bivariate Brown-ian motion. Higdon (1998) use bivariate Gaussian kernels under this framework. Severalextensions have also been proposed including creating closed-form nonstationary Mat´ern6ovariance functions (Paciorek and Schervish, 2006), extension to multivariate spatial pro-cess (Kleiber and Nychka, 2012), and computationally efficient local likelihood approaches(Risser and Calder, 2015).We note that these nonstationary models have focused on Gaussian responses. Directapplication of these methods to (1) is challenging because we cannot obtain closed-formmaximum likelihood estimates by marginalizing out W . Within the Bayesian framework,updating conditional posterior distributions requires a computational complexity of O ( N ),which becomes infeasible even for moderately large size datasets (e.g., binary satellite datawith 100,000 observations). Although several computationally efficient approaches (cf. Rueet al., 2009; Hughes and Haran, 2013; Guan and Haran, 2018; Lee and Haran, 2019; Zilberand Katzfuss, 2020) have been developed for non-Gaussian hierarchical spatial models,they are assuming stationarity of W . In what follows, we develop partitioned nonstationarymodels for non-Gaussian spatial data. Our method is computationally efficient and providesaccurate predictions over large heterogeneous spatial domains. We propose a smooth mosaic basis approximation for nonstationary SGLMMs (SMB-SGLMMs) designed for massive spatial datasets. We begin with an outline of our method:
Step 1.
Partition the spatial domain into disjoint subregions.7 tep 2.
Construct data-driven basis functions for each subregion.
Step 3.
Fit a locally nonstationary basis function model to each subregion in parallel.
Step 4.
Construct the global nonstationary process as a weighted average of local processes.SMB-SGLMMs are described in Figure 1. We provide the details in the following subsec-tions.
Step 1. Partition the spatial domain into disjoint subregions
We use an agglomerative clustering approach (Heaton et al., 2017) to partition the spatialdomain S into K subregions {S k } Kk =1 , which satisfy ∪ Kk =1 S k = S . We fit a nonspatialgeneralized linear model ( glm function in R ) using responses Z and covariates X . Thenwe obtain the spatially correlated residuals { (cid:15) ( s i ) } Ni =1 . For all i (cid:54) = j , we calculate thedissimilarity between (cid:15) ( s i ) and (cid:15) ( s i ) as d ij = | (cid:15) ( s i ) − (cid:15) ( s j ) | / (cid:107) s i − s j (cid:107) from spatial finitedifferences (Banerjee and Gelfand, 2006). Heaton et al. (2017) assigns locations with lowdissimilarity values ( d ij ) into the same partitions. The main idea is to separate locationswith large pairwise dissimilarities (i.e. rapidly changing residual surfaces (cid:15) ( s )). We initialize K = N where each observation belongs to its own cluster. Then we combine two clustersif they are Voronoi neighbors and have minimum pairwise dissimilarity. We repeat thisprocedure until we arrive at the desired K partitions (Figure 1 (a)). We provide detailsabout the clustering algorithm in the supplementary material.8igure 1: Illustration for the partitioned nonstationary approach for simulated W . (a)Nonstationary W is partitioned through 16 subregions; different colors indicate disjointpartitions. (b) For each partition, thin plate splines basis functions are constructed atknots; basis functions represent distinct spatial patterns. (c) The Local nonstationarymodel is fit to each partition using a linear combination of basis functions. (d) The globalnonstationary process is obtained via a weighted average of the local processes.9 tep 2. Construct data-driven basis functions for each subregion For each partition, we generate a collection of spatial basis functions. We have Z k = { Z ( s ) : s ∈ S k } ∈ R N k , the observations belong to S k , where N = (cid:80) Kk =1 N k . X k is an N k × p matrix of covariates. Consider the knots (grid points) { u kj } m k j =1 over S k ( m k (cid:28) N k ).These knots can define a wide array of spatial basis functions such as radial basis functions(Nychka et al., 2015; Katzfuss, 2017) and eigenbasis functions (Banerjee et al., 2008). Inthis study, we consider thin plate splines defined as Φ kj ( s ) = (cid:107) s − u kj (cid:107) log( (cid:107) s − u kj (cid:107) ).Here Φ k is an N k × m k matrix by evaluating the basis function at N k locations in S k (Figure 1 (b)). Although we focus on thin plate splines, different types of basis functionscan be considered. Examples include eigenfunctions (Holland et al., 1999; Banerjee et al.,2013; Guan and Haran, 2018), radial basis (Nychka et al., 2015; Katzfuss, 2017), principalcomponents (Higdon et al., 2008; Cressie, 2015), and Moran’s basis (Hughes and Haran,2013; Lee and Haran, 2019). Step 3. Fit a locally nonstationary basis function model to each subregion inparallel.
For each partition, we can represent the spatial random effects as W k ≈ Φ k δ k and modelthe conditional mean E [ Z k | β k , Φ k , δ k ] as g { E [ Z k | β k , Φ k , δ k ] } := η k = X k β k + Φ k δ k δ k ∼ N (0 , Σ Φ k ) , (2)10here Σ Φ k is a covariance of basis coefficients δ k . Here we set Σ Φ k = σ k I k , as in adiscrete approximation of a nontationary Gaussian process (Higdon, 1998). This basisrepresentation approximates the covariance through σ k Φ k Φ (cid:62) k . Such approximation cancapture the nonstationary behavior of the spatial process through a linear combination ofbasis functions (Figure 1 (c)). Since we typically choose m k (cid:28) N k , basis representationscan drastically reduce computational costs by avoiding large matrix operations. For oursimulated example (Section 4.1), we use m k = 81 for a partition of size N k = 13 , Φ k can alsoreduce correlations in δ k , resulting in fast mixing MCMC algorithms (Haran et al., 2003;Christensen et al., 2006). For the exponential family distribution f ( · ), the partition-specifichierarchical spatial model is as follows: Data Model: Z k | η k ∼ f ( η k ) g { E [ Z k | β k , Φ k , δ k )] } := η k = X k β k + Φ k δ k Process Model: δ k | σ k ∼ N (0 , σ k I k ) Parameter Model: β k ∼ p ( β k ) , σ k ∼ p ( σ k ) (3)We complete the hierarchical model by assigning prior distributions for the model param-eters β k and σ k . 11 tep 4. Construct the global nonstationary process as a weighted average ofthe local processes. To construct the global process, we use a weighted average of the fitted local processes.Note that Φ k ∈ R N k × m k is the basis functions matrix consisting of thin plate splinesΦ kj ( s ) = (cid:107) s − u kj (cid:107) log( (cid:107) s − u kj (cid:107) ) for s ∈ S k , where { u kj } m k j =1 are the knots over S k .Here, we introduce another notation. We define (cid:101) Φ k ∈ R N × m k by evaluating Φ kj ( s ) for all s ∈ S . Let (cid:101) Φ k ( s ) ∈ R m k be the row of (cid:101) Φ k corresponding to spatial location s ∈ S . Since W k ( s ) ≈ (cid:101) Φ (cid:62) k ( s ) δ k , we have: W ( s ) = K (cid:88) k =1 c k ( s ) W k ( s ) ≈ K (cid:88) k =1 c k ( s ) (cid:101) Φ (cid:62) k ( s ) δ k ,c k ( s ) ∝ exp (cid:16) − (cid:107) s − (cid:101) s k (cid:107) (cid:17) if (cid:107) s − (cid:101) s k (cid:107) ≤ γ otherwise 0 , (4)where (cid:101) s k ∈ S k is the closest point to s . The weight c k ( s ) is proportional to the inversedistance between s and (cid:101) s k ; hence, shorter distances result in higher weights. We assign a 0weight if the distance exceeds a threshold, or weighting radius, γ . We present details aboutchoice of γ in Section 3.2. The weighted average of the local processes approximates thenonstationary global process (Figure 1 (d)). Similarly, a global linear predictor η can bewritten as η (s) = K (cid:88) k =1 [ X ( s ) (cid:62) β k { s ∈S k } + c k ( s ) (cid:101) Φ (cid:62) k ( s ) δ k ] , (5)where { s ∈S k } is an indicator function. Here X ( s ) ∈ R p is a vector of the covariate matrix X for location s , β k ∈ R p is corresponding regression coefficients. Our method providesa partition varying estimate of β k . This is because the fixed effects may have spatially12arying (nonstationary) behavior over large heterogeneous spatial domains. Therefore, asin Heaton et al. (2017) we provide a partition varying β k in our applications. If estimatinga global β is of interest, one may consider divide and conquer algorithms such as consensusMonte Carlo (Scott et al., 2016) or geometric median of the subset posteriors (Minskeret al., 2017). Such methods provide the global posterior distribution of fixed effects bycombining subset posteriors. Spatial prediction
Spatial prediction at unobserved locations is of great interest in many scientific applications.Let s ∗ ∈ S be an arbitrary unobserved location. From thin plate splines basis functions, wecan construct a local basis as Φ kj ( s ∗ ) = (cid:107) s ∗ − u kj (cid:107) log( (cid:107) s ∗ − u kj (cid:107) ), where we have { u kj } m k j =1 knots in partition k . As in (5) we can also provide a global prediction: η (s ∗ ) = K (cid:88) k =1 [ X ( s ∗ ) (cid:62) β k { s ∗ ∈S k } + c k ( s ∗ ) (cid:101) Φ (cid:62) k ( s ∗ ) δ k ] . (6)For given posterior samples { β k , δ k } Kk =1 , we can obtain a posterior predictive distributionof η (s ∗ ) . In this section, we provide automated heuristics for the tuning parameters. To implementSMB-SGLMMs, we need to specify the following components: (1) K number of partitions,(2) location of knots in each partition, and (3) a weighting radius γ for smoothing thelocal processes. In practice, we can set K ≤ C (number of available cores) for parallel13omputation. Our method is heavily parallelizable, so computational walltimes tend todecrease with larger K . However, selecting a very large K may result in unreliable localestimates due to a small number of observations N k within each partition. In our simulationstudy, we compare the performance of our approach with varying K . Then, we select the K that minimizes the out-of-sample root cross validated mean squared prediction error(rCVMSPE). Based on simulation results, the SMB-SGLMM is robust to the choice of K .To avoid overfitting, we use lasso (Tibshirani, 1996) to select the appropriate numberand location of the knots. Initially, we set m candidate knots { u kj } mj =1 uniformly over eachpartition S k (e.g. m ≈ Z k and covariates [ X k , Φ k ], where Φ k is an N k by m matrix. We impose an l penalty to onlythe basis coefficients δ k , not the fixed effects β k . We use the glmnet package (Friedmanet al., 2010a) in R for lasso regression. For basis selection, we choose the basis functionscorresponding to the nonzero basis coefficients. Since we run lasso regression independentlyfor each partition, this step is embarrassingly parallel.From a pre-specified set of values (e.g., γ = 0 . , . , . , . γ that yields the lowest rCVMSPE. Note that we choose γ upon completion of Steps 1-3,the computationally demanding parts of SMB-SGLMM. Since the calculations in (6) areinexpensive, there are very little additional costs associated with Step 4. We examine the computational complexity of SMB-SGLMM and illustrate how our ap-proach scales with an increasing number of observations N . The three computationally14emanding components are (1) basis selection (lasso), (2) MCMC for fitting the local pro-cesses, and (3) obtaining the global process. Here, parallelized computing is integral to thescalability of SMB-SGLMM. We provide the following discussion on computational costsand parallelization for each step:1. Basis Selection:
In each partition, our methods select the m k knots from m can-didates using a regularization method (lasso). Based on results in Friedman et al.(2010a) the cost of the coordinate descent-based lasso is O ( N k m ), where N k is thenumber of observations in a partition S k . We can select the basis functions for eachpartition in parallel across K processors.2. MCMC for local processes:
The computational cost is dominated by matrix-vector multiplications Φ k δ k , where Φ k is the N k by m k basis function matrix fromthe previous lasso step. The costs for this step is O ( N k m k ). We can fit the localprocesses in parallel across K processors.3. Global Process:
We obtain the global process using weighted averages in (4). Thisstep requires O ( N ) complexity to calculate a distance matrix because the weights c k ( s ) in (4) are based on the distances between observations. Computing c k ( s ) requiresa one-time computation of the distance matrix for all N locations, which can bereadily parallelized across C available processors. We propose a novel way to “stream”the distances (Supplement) so that we can compute the weights c k ( s ) without actuallystoring the final distance matrix (e.g. 8TB for N = 1 million).Table 1 summarizes complexity of SMB-SGLMM. Considering that the complexity of15perations ComplexityBasis selection O ( (cid:80) Kk =1 N k m/K )MCMC O ( (cid:80) Kk =1 N k m k /K )Weighted average O ( N /C )Table 1: Computational complexity of SMB-SGLMMs. K is the number of partitions and C is the total available cores. N k is the number of observations, and m k is the number ofknots from each partition. Knots are selected from m candidate knots using a lasso.the stationary SGLMM is O ( N ), SMB-SGLMM is fast and provides accurate predictionsfor nonstationary processes (details in Sections 4,5). We implement SMB-SGLMMs in two simulated examples of massive ( N = 100 , nimble (de Valpineet al., 2017), a programming language for constructing and fitting Bayesian hierarchicalmodels. Parallel computation is implemented through the parallel package in R . Thecomputation times are based on a single 2.2 GHz Intel Xeon E5-2650v4 processor. All thecode was run on the Pennsylvania State University Institute for Cyber Science-AdvancedCyber Infrastructure (ICS-ACI) high-performance computing infrastructure. Source codeis provided in the Supplement.Data is generated on 125 ,
000 locations on the spatial domain
S ∈ R . We fit thespatial models using N = 100 ,
000 observations and reserve the remaining N cv = 25 , Z = { Z ( s i ) : s i ∈ s } where s = { s , ..., s N } . Observations are generated using the SGLMM frameworkdescribed in (1) with β = (1 , W = { W ( s i ) :16 i ∈ s } are generated through convolving spatially varying kernel functions (Higdon, 1998;Paciorek and Schervish, 2006; Risser and Calder, 2015). For some s ∈ s and referencelocations u j ∈ D , we have W ( s ) = (cid:80) Jj =1 K s ( u j ) V ( u j ), where K s ( u j ) is a spatially varyingGaussian kernel function centered at reference location u j and V ( u j ) is a realization ofGaussian white noise. Additional details are provided in the Supplement. The binarydataset uses a Bernoulli data model and a logit link function logit( p ) = p − p , and the countdataset is similarly generated using a Poisson data model and a log link function.We model the localized processes using the hierarchical framework in (3). To completethe hierarchical model, we set priors following Hughes and Haran (2013): β ∼ N ( , I )and σ ∼ IG (0 . , K (thenumber of partitions) and γ (weighting radius). We examine five partition groups K = { , , , , } and four weighting radii γ = { . , . , . , } . In total, we study a totalof 5 × glmnet R package (Friedmanet al., 2010b). We generate 100 ,
000 samples from the posterior distribution π ( β , δ , σ )using a block random-walk Metropolis-Hastings algorithm using the adaptation routinefrom Shaby and Wells (2010). We examine predictive ability and computational cost.These include rCVMSPE = (cid:113) N cv (cid:80) N cv i =1 ( Z i − ˆ Z i ) and the walltime required to run 100 , .1 Count Data Table 2 presents results for the out-of-sample prediction errors for the SMB-SGLMM ap-proach. Results indicate that the performance of our approach is robust across differentcombinations of K and γ . For this example, predictive accuracy improves as we increase thenumber of partitions ( K ) and decrease the width of the weight radius ( γ = 0 . K = 36 and γ = 0 . O ( N k m k ) where N k and m k are the number of observations and thenumber of selected basis functions (thin-plate splines) within partition k , respectively. Forthe case where K = 36, the median number of basis functions per partition is 14 with arange of 4 to 169.The localized parameter estimates of β are centered around the true parameter values β = (1 ,
1) (Supplement). For the case where the number of partitions K = 36, we alsoprovide a map of the localized estimates of β = ( β , β ) in the Supplement.18igure 2: True (top left) and predicted intensity surfaces (top right) for the simulatedcount data example. True (bottom left) and predicted probability surfaces (top right) forthe simulated binary data example. We set K = 36 and γ = 0 . K = 25 and γ = 0 . γ ) WalltimePartitions 0.1 0.25 0.5 1 (minutes)4 1.079 1.109 1.147 1.204 113.139 1.060 1.084 1.137 1.228 67.9216 1.059 1.073 1.104 1.196 65.1925 1.057 1.073 1.127 1.576 65.3436 γ ). We report the combined wall-time for lasso, MCMC, and weighting. In Table 3, we present prediction results for the binary simulated dataset. For this example,we observe that increasing the number of partitions K and reducing the neighbor radius γ results in more accurate predictions and lower computational costs. Figure 2 includesthe posterior predictive probability surface for the implementation yielding the lowest rootCVMSPE ( K = 25 and r = 0 . K = 25, the mean number of basisfunctions per partition is 7 . β are centered around the true parametervalues β = (1 ,
1) (Supplement). For the case where the number of partitions K = 25, weprovide a map of the localized estimates of β = ( β , β ). In this section, we apply our method to two real-world datasets pertaining to malariaincidence in the African Great Lakes region and cloud cover from satellite imagery. For both20eighting Radius ( γ ) WalltimePartitions 0.1 0.25 0.5 1 (minutes)4 0.360 0.362 0.371 0.380 70.159 γ ). We report the combined walltimefor lasso, MCMC, and weighting.large non-Gaussian nonstationary datasets, SMB-SGLMM provides accurate predictionswithin a reasonable timeframe. Malaria is a parasitic disease which can lead to severe illnesses and even death. Predictingoccurrences at unknown locations can be of significant interest for effective control interven-tions. We compiled malaria incidence data from the Demographic and Health Surveys of2015 (ICF, 2020). The dataset contains malaria incidence (counts) from 4 ,
741 GPS clustersin nine contiguous countries in the African Great Lakes region: Burundi, the DemocraticRepublic of Congo, Malawi, Mozambique, Rwanda, Tanzania, Uganda, Zambia, and Zim-babwe. We use the population size, average annual rainfall, vegetation index of the region,and the proximity to water as spatial covariates. Under a spatial regression framework,Gopal et al. (2019) analyzes malaria incidence in Kenya using these environmental vari-ables. In this study, we extend this approach to multiple countries in the African GreatLakes region.We use N = 3 ,
973 observations to fit the model and save N cv = 948 observations for21eighting Radius ( γ ) WalltimePartitions 0.035 0.075 0.1 0.2 0.3 (minutes)2 55.1 62.7 69.4 78.6 88.2 33.23 58.6 69.7 76.8 95.2 132.1 19.94 56.9 70.7 82.1 82.2 94.2 16.9Table 4: Root CVMSPE and total walltime (mins) for the malaria incidence example.Rows denote the four partition classes and columns correspond to the chosen weightingradius. We report the combined walltime for lasso, MCMC, and weighting.Figure 3: Illustration of the malaria occurrence dataset for K = 2 and γ = 0 . K ∈ { , , } and γ ∈ { . , . , . , . , . } . For each partition, we set the numberof candidate knots to be approximately m = 500 and perform basis selection using lasso(Tibshirani, 1996). On average 49 . ,
000 iterations.Table 4 compares the rCVMSPE for each case. We observe that rCVMSPE increaseswith larger weighting radii γ , possibly due to over smoothing in the partition boundaries.For smaller γ (0 .
035 and 0 . K . For this example, setting K = 2 , γ = 0 .
035 yields the most accurate predictions. InFigure 3, the predicted intensities of the validation locations exhibit similar spatial pat-22igure 4: Posterior mean estimates of partition-varying ( K = 2) fixed effects ˆ β . Estimatedcoefficients ( { ˆ β , ˆ β , ˆ β , ˆ β } ) for covariates population (top left), vegetation (top right),water (bottom left), and rain (bottom right).23erns as the true count observations. We provide maps for the partition-varying coefficients( K = 2) in Figure 4. The smaller partition includes parts of northern Malawi, southernTanzania, and northeastern Zambia. Here, the values of (cid:98) β > (cid:98) β , we observe that while rainfall may increase malaria incidence, these effects are morepronounced in the smaller partition. The National Aeronautics and Space Administration (NASA) launched the Terra Satellitein December 1999 as part of the Earth Observing System. As in past studies (Senguptaand Cressie, 2013; Bradley et al., 2019), we model the cloud mask data captured by theModerate Resolution Imaging Spectroradiometer (MODIS) instrument onboard the Terrasatellite. The response is a binary incidence of cloud mask at a 1km × N = 1 , ,
000 observations to fit our model andreserved N cv = 111 ,
000 for validation. We model the binary observations as a nonsta-tionary SGLMM via the SMB-SGLMM method. Similar to Sengupta and Cressie (2013);Bradley et al. (2019), we include the vector and a vector of latitudes as the covariatesand use a logit link function.For the SMB-SGLMM approach, we vary the number of partitions K ∈ { , , , } and weighting radius γ ∈ { . , . , . , . } for a total of 16 cases. For each partition,we begin with m = 1 ,
000 knots (candidates) and perform basis selection using lasso regres-24eighting Radius ( γ ) WalltimePartitions 0.01 0.025 0.05 0.1 (hours)16 0.192 0.215 0.229 0.266 5.425 . ,
000 iterations.Figure 5 indicates that there are similar spatial patterns between binary observationsand predicted probability surface. We also provide the misclassification rate for each casein Table 5. The performance of SMB-SGLMMs is robust across different choices of K and γ . Results suggest that a moderate number of partitions ( K = 25) and a smaller weightingradius ( γ = 0 .
01) yields the most accurate predictions. The combined walltimes decreasewhen using more partitions; however, these are on the order of hours in all cases.
In this manuscript, we propose a scalable algorithm for modeling massive nonstationarynon-Gaussian datasets. Existing approaches are limited to either stationary non-Gaussianor nonstationary Gaussian spatial data, but not both. Our method divides the spatialdomain into disjoint partitions using a spatial clustering algorithm (Heaton et al., 2017).For each partition, we fit a localized model using a collection of thin plate spline basis25igure 5: Illustration for the MODIS cloud mask dataset. (left) True observations of cloudmask. (right) Posterior predictive probability surface ( K = 25 and γ = 0 . N = 1 mil-lion binary observations within 4 hours. To our knowledge, this is the first method gearedtowards modeling nonstationary non-Gaussian spatial data at this scale.The proposed framework can be extended to a wider range of spatial basis functions. Inthe literature, there exists a wide array of spatial basis functions such as bi-square (radial)basis functions using varying resolutions (Cressie and Johannesson, 2008; Nychka et al.,2015; Katzfuss, 2017), empirical orthogonal functions (Cressie, 2015), predictive processes26Banerjee et al., 2008), Moran’s basis functions (Griffith, 2003; Hughes and Haran, 2013),wavelets (Nychka et al., 2002; Shi and Cressie, 2007), Fourier basis functions (Royle andWikle, 2005) and Gaussian kernels (Higdon, 1998). A closer examination of adoptingBayesian regularization methods (see O’Hara et al. (2009) for a detailed review) for selectingbasis functions is also an interesting future research avenue.Developing scalable methods for modeling nonstationary non-Gaussian spatio-temporaldata is challenging. The partition-based basis function representation can be integratedinto existing hierarchical spatio-temporal models. For example, we can approximate thenonstationary processes using a tensor product of spatial and temporal basis functions orby constructing data-driven space-time basis functions. Acknowledgements
Jaewoo Park was supported by the Yonsei University Research Fund 2020-22-0501 and theNational Research Foundation of Korea (NRF-2020R1C1C1A0100386811). The authorsare grateful to Matthew Heaton, Murali Haran, John Hughes, and Whitney Huang forproviding useful sample code and advice. The authors are thank the anonymous reviewersfor their careful review and valuable comments.27 upplementary Material for A Scalable Partitioned Approach to Model Mas-sive Nonstationary Non-Gaussian Spatial Datasets
A Spatial Clustering Algorithm
Here, we provide the clustering algorithm (Heaton et al., 2017) in detail. We obtain resid-uals (cid:15) from a GLM fit with a response vector Z ∈ R N and a covariate matrix X ∈ R N × p .Let (cid:15) k ∈ R N k be the residuals belongs to the cluster (partition) S k . Then we can definethe dissimilarity between two clusters as d ( S k , S k ) = (cid:104) N k N k N k + N k (¯ (cid:15) k − ¯ (cid:15) k ) (cid:105) E , where ¯ (cid:15) k is the average of (cid:15) k and ¯ E is the average Euclidean distance between points in S k , S k . Then the spatial clustering algorithm can be summarized as follows. Algorithm 1
Spatial clustering algorithm (Heaton et al., 2017)Initialize each location s k = S k for k = 1 , · · · , N ; we have N number of clusters.1. Find clusters S k , S k having the minimum d ( S k , S k ) where s i ∼ s j (Voronorineighbors) for s i ∈ S k and s j ∈ S k
2. Combine two clusters S min { k ,k } = S k ∪ S k and set S max { k ,k } = ∅ Repeat 1-2 until we have
K < N number of clusters.We note that Algorithm 1 becomes computationally expensive with increasing numberof observations. Following suggestions in Heaton et al. (2017), we perform clustering aftercombining observations to a lattice { s ∗ l } Ll =1 ( L << N ). Here, N l = { s i : (cid:107) s i − s ∗ l (cid:107) < s i − s ∗ m (cid:107)} is the collection of observations whose closest lattice point is s ∗ l , and ¯ (cid:15) ( s ∗ l ) = |N l | − (cid:80) s i ∈N l (cid:15) ( s i ). Then we apply Algoritm 1 to { ¯ (cid:15) ( s ∗ l ) } Ll =1 rather than to { (cid:15) ( s i ) } Ni =1 . Sincethe number of lattice points L is much smaller than the number of observations N , spatialclustering algorithm becomes computationally feasible. For instance, in our simulationstudies we chose L = 900 for N = 100 , B Simulation of Nonstationary Spatial Random Ef-fects
We describe how to generate the nonstationary spatial random effects from Section 4, W = { W ( s ) , ..., W ( s n )) } . The nonstationary spatial random effects W are generated byconvolving a collection of spatially varying kernel functions (Higdon, 1998; Paciorek andSchervish, 2006; Risser and Calder, 2015). The construction procedure is broken downinto four steps: (1) select locations for the “basis” and “reference” kernel functions; (2)construct “basis” kernels; (3) use “basis” kernels to construct “reference” kernels on a finitegrid; and (4) generate non-stationary spatial random effects using the “reference kernels”.In Step 1, we select M “basis” locations b = { b , ..., b M } on a coarse grid of evenly-spaced locations over the spatial domain D ∈ R . Similarly, we select J “reference” lo-cations u = { u , ..., u J } on a finer grid of evenly-space locations in D . As in past studies(Higdon, 1998; Paciorek and Schervish, 2006; Risser and Calder, 2015), we typically select M < J . Figure 6 illustrates the placement of the ‘basis” and “reference” locations.In Step 2, we construct Gaussian “basis” kernels centered at each basis location b for29igure 6: Locations (knots) for “basis” (red) and “reference” (blue) locations.30 = 1 , ..., M . The “basis” kernels are defined as K m ( x i ) = (2 π ) − | Σ m | − / exp (cid:8) −
12 ( x i − b m ) (cid:48) Σ m ( x i − b m ) (cid:9) , where b m is a “basis” location, x i is the location of interest, and Σ m is a 2 × m -th Gaussian “basis” kernel.In Step 3, we construct the Gaussian kernels for the reference locations K s ( u j ) for j = 1 , ..., J as a weighted average of the “basis” kernels K m ( · ). The “reference” kernels aredefined as: K s ( u j ) = M (cid:88) m =1 w m ( s ) K m ( u j ) , where w m ( s ) are the distance-based weights, u j are the “reference” locations, and s isthe spatial location of interest. Here, the weights w m ( s ) ∝ exp (cid:8) − || s − b m || (cid:9) and (cid:80) Mm =1 w m ( s ) = 1.Finally, in Step 4, we generate the nonstationary spatial random effects as W ( s ) = J (cid:88) j =1 K s ( u j ) V ( u j ) , where K s ( u j ) is a spatially varying Gaussian kernel function centered at “reference” loca-tions u j and V ( u j ) is a realization of Gaussian white noise. Note that V ( u j ) ∼ N (0 , σ u ).In our implementation, we chose M = 9 “basis” locations b = { b , ..., b M } and J = 100“reference” locations u = { u , ..., u J } on a grid of evenly-space locations in D . The “basis”kernel functions K m ( · ) for m = 1 , ..., m as31igure 7: “Basis” Kernel functions K m ( · ) for locations b m , m = 1 , ..., = .
50 0 . .
30 0 . Σ = . − . − .
12 0 . Σ = .
50 0 . .
18 0 . Σ = .
50 0 . .
54 0 . Σ = .
50 0 . .
06 0 . Σ = . − . − .
48 0 . Σ = .
50 0 . .
42 0 . Σ = . − . − .
36 0 . Σ = . − . − .
24 0 . C Computing weights via parallelization
As mentioned in Section 3.1 (step 4), we propose a parallelized method to compute weights c k ( s ) for k = 1 , ..., K and s ∈ S . For a given location s , the weights c k ( s ) ∝ exp (cid:16) −(cid:107) s − (cid:101) s k (cid:107) (cid:17) where (cid:101) s k ∈ S k is the point in partition k with the shortest distance to point s . The challengelies in computing and storing the distances (cid:107) s − (cid:101) s k (cid:107) . Naive implementations may simplycompute the distance matrix between all N locations, which requires O ( n ) operationsas well as O ( n ) in storage. In the MODIS example (Section 5.2), the distance matrix( n = 1million observations) demands 8 T B of storage.We propose “streaming” the weight calculations ( c k ( s )) without storing the distancematrix. First, we parallelize over C available cores where each core is assigned a location s . Then, we compute distances between location s and the other locations s − . This33igure 9: Map of the spatially-varying β parameter (left) and β parameter (right) for thesimulated Poisson Dataset)n-dimensional vector of distances (e.g. 8 M B for the MODIS case) can be stored in therandom access memory (RAM). Finally, we compute the weights c k ( s ) for each partition k .The task concludes once c k ( s ) are computed for all s in our dataset and for all partitions k . D Partition Varying Estimates β parameter (left) and β parameter (right) forthe simulated Binary Dataset)Figure 11: Map of the spatially-varying β parameter (left) and β parameter (right) forthe MODIS cloud mask dataset.) 35ount Data Binary DataPartition β β β β K = 4 partitions, we report parameter estimates for β and β for thesimulated count and binary datasets. This includes the posterior mean and 95% credibleintervals for each partition.Count Data Binary DataPartition β β β β K = 9 partitions, we report parameter estimates for β and β for thesimulated count and binary datasets. This includes the posterior mean and 95% credibleintervals for each partition. 36ount Data Binary DataPartition β β β β K = 16 partitions, we report parameter estimates for β and β for thesimulated count and binary datasets. This includes the posterior mean and 95% credibleintervals for each partition. 37ount Data Binary DataPartition β β β β K = 25 partitions, we report parameter estimates for β and β for thesimulated count and binary datasets. This includes the posterior mean and 95% credibleintervals for each partition. 38ount Data Binary DataPartition β β β β K = 36 partitions, we report parameter estimates for β and β for thesimulated count and binary datasets. This includes the posterior mean and 95% credibleintervals for each partition. 39 eferences Banerjee, A., Dunson, D. B., and Tokdar, S. T. (2013). Efficient Gaussian process regressionfor large datasets.
Biometrika , 100(1):75–89.Banerjee, S. and Gelfand, A. E. (2006). Bayesian wombling: Curvilinear gradient assess-ment under spatial process models.
Journal of the American Statistical Association ,101(476):1487–1501.Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictiveprocess models for large spatial data sets.
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , 70(4):825–848.Bradley, J. R., Cressie, N., Shi, T., et al. (2016). A comparison of spatial predictors whendatasets could be very large.
Statistics Surveys , 10:100–131.Bradley, J. R., Holan, S. H., and Wikle, C. K. (2019). Bayesian hierarchical models withconjugate full-conditional distributions for dependent data from the natural exponentialfamily.
Journal of the American Statistical Association , 0(ja):1–29.Christensen, O. F., Roberts, G. O., and Sk¨old, M. (2006). Robust Markov chain MonteCarlo methods for spatial generalized linear mixed models.
Journal of Computationaland Graphical Statistics , 15(1):1–17.Cressie, N. (2015).
Statistics for spatial data . John Wiley & Sons.Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets.40 ournal of the Royal Statistical Society: Series B (Statistical Methodology) , 70(1):209–226.de Valpine, P., Turek, D., Paciorek, C., Anderson-Bergman, C., Temple Lang, D., andBodik, R. (2017). Programming with models: writing statistical algorithms for generalmodel structures with NIMBLE.
Journal of Computational and Graphical Statistics ,26:403–413.Diggle, P. J., Tawn, J., and Moyeed, R. (1998). Model-based geostatistics.
Journal of theRoyal Statistical Society: Series C (Applied Statistics) , 47(3):299–350.Ejigu, B. A., Wencheko, E., Moraga, P., and Giorgi, E. (2020). Geostatistical methods formodelling non-stationary patterns in disease risk.
Spatial Statistics , 35:100397.Friedman, J., Hastie, T., and Tibshirani, R. (2010a). Regularization paths for generalizedlinear models via coordinate descent.
Journal of statistical software , 33(1):1.Friedman, J., Hastie, T., and Tibshirani, R. (2010b). Regularization paths for generalizedlinear models via coordinate descent.
Journal of Statistical Software , 33(1):1–22.Fuentes, M. (2001). A high frequency kriging approach for non-stationary environmen-tal processes.
Environmetrics: The official journal of the International EnvironmetricsSociety , 12(5):469–483.Fuentes, M. (2002). Interpolation of nonstationary air pollution processes: a spatial spectralapproach.
Statistical Modelling , 2(4):281–298.41opal, S., Ma, Y., Xin, C., Pitts, J., and Were, L. (2019). Characterizing the spatialdeterminants and prevention of malaria in Kenya.
International journal of environmentalresearch and public health , 16(24):5078.Griffith, D. A. (2003). Spatial filtering. In
Spatial Autocorrelation and Spatial Filtering ,pages 91–130. Springer.Guan, Y. and Haran, M. (2018). A computationally efficient projection-based approachfor spatial generalized linear mixed models.
Journal of Computational and GraphicalStatistics , 27(4):701–714.Guhaniyogi, R. and Banerjee, S. (2018). Meta-kriging: Scalable Bayesian modeling andinference for massive spatial datasets.
Technometrics , 60(4):430–444.Haran, M., Hodges, J. S., and Carlin, B. P. (2003). Accelerating computation in markovrandom field models for spatial data via structured mcmc.
Journal of Computationaland Graphical Statistics , 12(2):249–264.Heaton, M. J., Christensen, W. F., and Terres, M. A. (2017). Nonstationary gaussianprocess models using spatial hierarchical clustering from finite differences.
Technometrics ,59(1):93–101.Hefley, T. J., Broms, K. M., Brost, B. M., Buderman, F. E., Kay, S. L., Scharf, H. R.,Tipton, J. R., Williams, P. J., and Hooten, M. B. (2017). The basis function approachfor modeling autocorrelation in ecological data.
Ecology , 98(3):632–646.42igdon, D. (1998). A process-convolution approach to modelling temperatures in the NorthAtlantic Ocean.
Environmental and Ecological Statistics , 5(2):173–190.Higdon, D., Gattiker, J., Williams, B., and Rightley, M. (2008). Computer model cali-bration using high-dimensional output.
Journal of the American Statistical Association ,103(482):570–583.Holland, D. M., Saltzman, N., Cox, L. H., and Nychka, D. (1999). Spatial prediction of sul-fur dioxide in the eastern United States. In geoENV II—Geostatistics for environmentalapplications , pages 65–76. Springer.Hughes, J. and Haran, M. (2013). Dimension reduction and alleviation of confounding forspatial generalized linear mixed models.
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , 75(1):139–159.ICF (2004-2017 (Accessed July, 1, 2020)). Demographic and health surveys (various)[datasets]. Funded by USAID. Data retrieved from , http://dhsprogram.com/data/available-datasets.cfm .Katzfuss, M. (2013). Bayesian nonstationary spatial modeling for very large datasets.
Environmetrics , 24(3):189–200.Katzfuss, M. (2017). A multi-resolution approximation for massive spatial datasets.
Journalof the American Statistical Association , 112(517):201–214.Kim, H.-M., Mallick, B. K., and Holmes, C. (2005). Analyzing nonstationary spatial data43sing piecewise gaussian processes.
Journal of the American Statistical Association ,100(470):653–668.Kleiber, W. and Nychka, D. (2012). Nonstationary modeling for multivariate spatial pro-cesses.
Journal of Multivariate Analysis , 112:76–91.Lee, B. S. and Haran, M. (2019). Picar: An efficient extendable approach for fittinghierarchical spatial models. arXiv preprint arXiv:1912.02382 .Minsker, S., Srivastava, S., Lin, L., and Dunson, D. B. (2017). Robust and scalable Bayesvia a median of subset posterior measures.
The Journal of Machine Learning Research ,18(1):4488–4527.Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S. (2015). Amultiresolution Gaussian process model for the analysis of large spatial datasets.
Journalof Computational and Graphical Statistics , 24(2):579–599.Nychka, D., Wikle, C., and Royle, J. A. (2002). Multiresolution models for nonstationaryspatial covariance functions.
Statistical Modelling , 2(4):315–331.O’Hara, R. B., Sillanp¨a¨a, M. J., et al. (2009). A review of Bayesian variable selectionmethods: what, how and which.
Bayesian analysis , 4(1):85–117.Paciorek, C. J. and Schervish, M. J. (2006). Spatial modelling using a new class of nonsta-tionary covariance functions.
Environmetrics: The official journal of the InternationalEnvironmetrics Society , 17(5):483–506. 44isser, M. D. and Calder, C. A. (2015). Local likelihood estimation for covariance func-tions with spatially-varying parameters: the convospat package for r. arXiv preprintarXiv:1507.08613 .Royle, J. A. and Wikle, C. K. (2005). Efficient statistical mapping of avian count data.
Environmental and Ecological Statistics , 12(2):225–243.Rue, H., Martino, S., and Chopin, N. (2009). Approximate bayesian inference for latentgaussian models by using integrated nested laplace approximations.
Journal of the royalstatistical society: Series b (statistical methodology) , 71(2):319–392.Sampson, P. (2010). Constructions for nonstationary spatial processes in gelfand ae, digglepj, fuentes m, and guttorp p (ed.), the handbook of spatial statistics, chapter 9.Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch,R. E. (2016). Bayes and big data: The consensus Monte Carlo algorithm.
InternationalJournal of Management Science and Engineering Management , 11(2):78–88.Sengupta, A. and Cressie, N. (2013). Hierarchical statistical modeling of big spatial datasetsusing the exponential family of distributions.
Spatial Statistics , 4:14–44.Shaby, B. and Wells, M. T. (2010). Exploring an adaptive metropolis algorithm.
Currentlyunder review , 1(1):17.Shi, T. and Cressie, N. (2007). Global statistical analysis of misr aerosol data: a massivedata product from nasa’s terra satellite.
Environmetrics: The official journal of theInternational Environmetrics Society , 18(7):665–680.45ibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of theRoyal Statistical Society: Series B (Methodological) , 58(1):267–288.Zilber, D. and Katzfuss, M. (2020). Vecchia-Laplace approximations of generalized Gaus-sian processes for big non-Gaussian spatial data.