[PDF] Multi-Scale Process Modelling and Distributed Computation for Spatial Data

Abstract

Recent years have seen a huge development in spatial modelling and prediction methodology, driven by the increased availability of remote-sensing data and the reduced cost of distributed-processing technology. It is well known that modelling and prediction using infinite-dimensional process models is not possible with large data sets, and that both approximate models and, often, approximate-inference methods, are needed. The problem of fitting simple global spatial models to large data sets has been solved through the likes of multi-resolution approximations and nearest-neighbour techniques. Here we tackle the next challenge, that of fitting complex, nonstationary, multi-scale models to large data sets. We propose doing this through the use of superpositions of spatial processes with increasing spatial scale and increasing degrees of nonstationarity. Computation is facilitated through the use of Gaussian Markov random fields and parallel Markov chain Monte Carlo based on graph colouring. The resulting model allows for both distributed computing and distributed data. Importantly, it provides opportunities for genuine model and data scaleability and yet is still able to borrow strength across large spatial scales. We illustrate a two-scale version on a data set of sea-surface temperature containing on the order of one million observations, and compare our approach to state-of-the-art spatial modelling and prediction methods.

Full PDF

MMulti-Scale Process Modelling and DistributedComputation for Spatial Data

Andrew Zammit-Mangion ∗ School of Mathematics and Applied Statistics, University of Wollongong,Wollongong, AustraliaORCID: orcid.org/0000-0002-4164-6866Jonathan RougierSchool of Mathematics, University of Bristol, Bristol, UKORCID: orcid.org/0000-0003-3072-7043

Abstract

Recent years have seen a huge development in spatial modelling and predictionmethodology, driven by the increased availability of remote-sensing data and thereduced cost of distributed-processing technology. It is well known that modellingand prediction using inﬁnite-dimensional process models is not possible with largedata sets, and that both approximate models and, often, approximate-inferencemethods, are needed. The problem of ﬁtting simple global spatial models to largedata sets has been solved through the likes of multi-resolution approximations andnearest-neighbour techniques. Here we tackle the next challenge, that of ﬁttingcomplex, nonstationary, multi-scale models to large data sets. We propose doingthis through the use of superpositions of spatial processes with increasing spatialscale and increasing degrees of nonstationarity. Computation is facilitated throughthe use of Gaussian Markov random ﬁelds and parallel Markov chain Monte Carlobased on graph colouring. The resulting model allows for both distributed comput-ing and distributed data. Importantly, it provides opportunities for genuine modeland data scaleability and yet is still able to borrow strength across large spatialscales. We illustrate a two-scale version on a data set of sea-surface temperaturecontaining on the order of one million observations, and compare our approach tostate-of-the-art spatial modelling and prediction methods.

Keywords:

Graph Colouring, Markov Chain Monte Carlo, Parallel Sampling,Spatial Statistics

Large spatial/spatio-temporal data sets are now centre-stage in several of the environ-mental sciences such as meteorology and glaciology. Two popular tools available to thespatial statistician to deal with such data are the hierarchical model and the closely-related notion of conditional independence (Cressie and Wikle, 2011, Section 2.1.5). In atwo layer, linear, Gaussian, data-process model, the widely adopted assumption that data ∗ Corresponding author: Tel.: +61-2-4221-5112; E-mail: [email protected] a r X i v : . [ s t a t . C O ] F e b re conditionally independent, given a low-dimensional underlying process, is suﬃcientfor developing inferential algorithms that scale linearly with the dimension of the data(and quadratically with the dimension of the process). Several methods capitalise on thisapproach for the spatial or spatio-temporal analysis of big data; these include ﬁxed rankkriging (Cressie and Johannesson, 2008), predictive processes (Banerjee et al., 2008), anda suite of approaches based on Gaussian Markov random ﬁeld (GMRF) approximationsto geostatistical models (e.g., Rue and Tjelmeland, 2002; Lindgren et al., 2011; Nychkaet al., 2015). For spatio-temporal variants see, for example, Sahu and Mardia (2005);Dewar et al. (2009); Cressie and Wikle (2011) and Wikle et al. (2019).In a two-layer data-process model, dimensionality reduction of the process needs tobe done with care, as the space of the latent functions that can be reproduced is no largerthan the span of the set of basis functions used for dimensionality reduction. If this spacedoes not contain the true signal, then any residual observed signal variability will be, atbest, attributed to other components in the hierarchical model, typically measurementerror or ﬁne-scale process variability. At worst, there is no such component and inferencesare over-conﬁdent (e.g., Zammit-Mangion and Cressie, 2020). As such, low-dimensionalmodels should only be used with big data if the analyst is conﬁdent that the true (hidden)process can be adequately reconstructed using the chosen basis-function representation.As a result, model approximations have been developed that ensure that several basisfunctions can be used. Two popular ones include one based on multi-scale GMRFs(Nychka et al., 2015), and one based on a multi-resolution approximation to the processcovariance function (Katzfuss, 2017). Unfortunately, models based on multi-scale GMRFsare currently hindered by the computational bottleneck of the required sparse Choleskyfactorisations, while neither approach is well-suited to approximate highly nonstationarymulti-scale processes. Rather, their motivation, and the associated inference methodsthat have been designed for them, are built on the premise that the underlying covariancefunction of the process is relatively simple; the approximations are made to be able toﬁt the model and predict with it when using large data sets. These methods work verywell when this is indeed the case. The next challenge is therefore to use models andappropriate inference methods for when we have big data and for when the underlyingprocess is multi-scale and highly nonstationary, as is the case in many applications.The model we propose in Section 2 models the process Y ( · ) as a sum of severalprocesses at various scales, where the degree of nonstationarity increases with the scaleof the process. We approximate these processes using GMRFs so that the conditionaldependence structure of the latent variables can be exploited for local processing and theuse of a parallel Gibbs sampler, as discussed in Section 3. In Section 4 we give somepractical guidelines on how to construct the model, and in Section 5 we demonstratethe two-scale variant on a data set containing on the order of one million observationsusing around one million basis functions, with the covariances constructed though severalhundreds of parameters to allow for local nonstationarity. Section 6 concludes. In this section we detail the model through its hierarchical structure, adopting the termi-nology of Berliner (1996). The top layer in the hierarchy is the data model (Section 2.1),the middle layer is the process model (Section 2.2), and the bottom layer is the param-eter model (Section 2.3). We often reference subsets of vectors and matrices through2uperscripts. Speciﬁcally, for a vector X , X I denotes the sub-vector of X constructedfrom the elements with indices in the set I . Sometimes we use subscripts to referencea speciﬁc sub-vector wherever using superscript notation would be tedious; for example X k is the k th component, or k th sub-vector, of X , where the component’s deﬁnition iscontext-speciﬁc. These ways of referencing can be used together; for example X { i } k refersto the i th element of the k th component of X . When conditioning we sometimes use thesymbol “rest” to denote the remainder of a sub-vector, and its precise deﬁnition is to betaken from the context. Consider, for example, the n -dimensional vector X . If we areconsidering the distribution of X S | X rest , where S ⊂ { , . . . , n } , then X rest ≡ X { ,...,n }\S . Assume one has measurements Z ≡ ( Z , . . . , Z m ) (cid:62) of some underlying zero-mean spatialprocess Y ( · ) on D ⊂ R or D ⊆ S , where S is the surface of the sphere. Our datamodel is given by Z l = Y ( s l ) + ε l , for l = 1 , . . . , m , where ε l is Gaussian measurement-error that is independent of Y ( · ) and that has variance σ ε ( s l ), and { s , . . . , s m } ⊂ D isa set of point-referenced measurement locations. In this article we assume that { ε l } areuncorrelated conditional on σ ε ( · ), and that m is large; in our case study in Section 5, m is on the order of one million. We deﬁne Y ( · ) to be the sum of ( K + 1) independent Gaussian processes (GPs), Y ( · ) = Y ( · ) + (cid:80) Kk =1 Y k ( · ) , where Y ( · ) is stationary with large spatial scale (i.e., a large range),and Y k ( · ) , k ≥ , is nonstationary with spatial scale decreasing with increasing k . In ourapproach K ≥

1, and in the case study of Section 5, K = 1.Without loss of generality we assume that Y k ( · ) , k = 0 , . . . , K, have zero expectation.We model the GPs through the stochastic partial diﬀerential equations( κ − ∆) α / ( τ Y ( s )) = W ( s ) , s ∈ D, (1)and, for k = 1 , . . . , K ,( κ k ( s ) − ∆) α k / ( τ k ( s ) Y k ( s )) = W k ( s ) , s ∈ D, (2)where W k ( · ) , k = 0 , . . . , K, are Gaussian spatial white noise processes, κ k are spatialscale parameters, τ k control the process variance, and ∆ is the Laplacian. When D ⊂ R ,the solution to (1) is a stationary GP with Mat´ern covariance function with smooth-ness parameter ν = α −

1, while that to (2) is a nonstationary process which, undersome regularity assumptions on κ k ( · ) and τ k ( · ), exhibits local Mat´ern behaviour withsmoothness parameter ν k = α k − ν k , k = 0 , . . . , K, are known.As in Lindgren et al. (2011) and Lindgren and Rue (2015), we project Y k ( · ) , k =0 , , . . . , K, onto a ﬁnite-dimensional basis through Y k ( · ) = a k ( · ) (cid:62) η k , where a k ( · ) is avector of r η k tent basis functions on a triangulation of D , T η k say (Scherer, 2017, Section12.6.2), and η k ∼ Gau( , Q − k ), where Q k is a sparse precision matrix. (Throughout,we use T β to denote the triangulation associated with the random quantity β ). Thetriangulations T η k , k = 0 , . . . , K, become ﬁner with increasing k since Y k ( · ) has spatialscale that decreases with k . Since each basis function corresponds to a vertex in the3riangulation, the coeﬃcients η k , k = 0 , . . . , K, which are independent across k , can beassociated with the vertices of the basis functions they weight; this property will facilitategraph partitioning and colouring in Section 3.After projecting onto our ﬁnite-dimensional subspace, we can write the measurementsas Z = K (cid:88) k =0 A k η k + ε , where A k ≡ ( a k ( s l ) : l = 1 , . . . , m ) (cid:62) , ε ∼ Gau( , Q − ε ) , and Q ε is diagonal withmeasurement-error precisions as non-zero elements. Collect all the unknown model pa-rameters used to construct Q ε and Q ≡ bdiag( Q , . . . , Q K ) into the vector θ . Let A ≡ ( A , . . . , A K ) and η ≡ ( η (cid:62) , . . . , η (cid:62) K ) (cid:62) . For this model, which is structurally iden-tical to the LatticeKrig model (Nychka et al., 2015), η | Z , θ also has a sparse precisionmatrix since Z is point referenced. However, both Z and η can contain millions ofelements so that storing, let alone computing with, A and this precision matrix, canbecome impractical. On the other hand GMRFs have the desirable property that theyadmit parallel-data and parallel-model conditional structures that can be used to facili-tate computation. More details on how we exploit this property are given in Section 3. As in Lindgren and Rue (2015) we do not consider the natural parameters τ , κ , and τ k ( · ) , κ k ( · ) , k = 1 , . . . , K, directly. Rather, we construct a parameter model for the processstandard deviations σ and { σ k ( · ) } , and the nominal ranges ρ and { ρ k ( · ) } . In our setting,for k = 1 , . . . , K (and similarly for k = 0), these are related to the natural parametersthrough log τ k ( · ) = 12 log (cid:18) Γ( ν k )4 π Γ( α k ) (cid:19) − log σ k ( · ) − ν k log κ k ( · ) , log κ k ( · ) = 12 log (8 ν k ) − log ρ k ( · ) , where Γ( · ) is the gamma function.We decompose the spatially-varying parameter log σ ε ( · ) as a weighted sum of r ε tentbasis functions b ε ( · ) constructed on a triangulation T ε on D . We also decompose log σ k ( · )and log ρ k ( · ) , k = 1 , . . . , K, as weighted sums of r θ k ≡ r σ k = r ρ k tent basis functions, b k ( · ),constructed on a triangulation T θ k ≡ T σ k = T ρ k on D . Note that here, and below, we usethe subscript θ k to identify quantities such as basis functions and triangulations that arecommon to both σ k ( · ) and ρ k ( · ). Since the parameter ﬁelds at scale k need to vary slowlyrelative to the process at that scale (Lindgren et al., 2011), the triangulation T θ k can bemuch coarser than T η k . These decompositions yield the following parameter models,log σ ε ( · ) = b ε ( · ) (cid:62) θ ε , log σ k ( · ) = b k ( · ) (cid:62) θ σ k , log ρ k ( · ) = b k ( · ) (cid:62) θ ρ k , for k = 1 , . . . , K , remembering that, for k = 0, σ and ρ are spatially-invariant.We assume that the elements of the parameter vectors θ X , X ∈ { ε, σ , ρ , . . . , σ K , ρ K } ,are independent and distributed according to Gaussian distributions. Speciﬁcally, we let4 { i } X | ω X , λ X IID ∼ Gau( ω X , λ X ) , i = 1 , . . . , r X , (3)where ω X = E( θ { i } X ), λ X = var( θ { i } X ), and r X is the dimension of θ X . Although we assumeindependence, a GMRF model for the parameters might also be used (see Monterrubio-G´omez et al., 2019), although this would introduce additional complexity to our updatingalgorithm of Section 3. The hyperparameters ω ρ k and λ ρ k need to reﬂect a prior judge-ment that the process range decreases with increasing k ; we provide some guidelines inSection 4 and an example of how this can be done in practice with a two-scale process inSection 5. Appropriate hyperparameters for constructing the prior distributions over θ ε and θ σ k can usually be deduced from domain knowledge and exploratory data analysis(see Section 5.2).We complete our parameter model by equipping the parameters appearing in thecoarse component, ρ and σ , with the prior lognormal distributions,log σ ∼ Gau( ω σ , λ σ ) , (4)log ρ ∼ Gau( ω ρ , λ ρ ) , (5)where reasonable values of ω σ and λ σ can generally be deduced from exploratory analysisof the data, while ω ρ and λ ρ can generally be ﬁxed using prior application knowledge(see Sections 4 and 5). As with k >

0, we assume that σ and ρ are independent.To summarise, the full set of parameters and the full conditional dependence structureof our model is given by the following directed acyclic graph (DAG): θ ε (cid:47) (cid:47) Z η (cid:56) (cid:56) η (cid:79) (cid:79) · · · η K (cid:106) (cid:106) ( σ , ρ ) (cid:79) (cid:79) ( θ σ , θ ρ ) (cid:79) (cid:79) · · · ( θ σ K , θ ρ K ) (cid:79) (cid:79) where the components of θ σ k and θ ρ k are mutually independent for k = 1 , . . . , K . The co-eﬃcients η construct a stationary process with spatially-invariant standard deviation andrange, σ and ρ , respectively. On the other hand, η , . . . , η K , construct nonstationaryprocesses with spatially-varying standard deviations and ranges. For these nonstationaryprocesses, the standard deviations and ranges are modelled through the coeﬃcients θ σ k and θ ρ k , k = 1 , . . . , K , where the individual elements are independent and their marginaldistributions are pre-speciﬁed. The measurement errors ε are assumed to be spatiallyuncorrelated and are thus modelled using just the basis-function coeﬃcients, θ ε . Our aim is to sample from the posterior distribution pr( η , . . . , η K , θ ε , θ , . . . , θ K | Z ),where θ ≡ ( σ , ρ ) (cid:62) , and θ k ≡ ( θ (cid:62) σ k , θ (cid:62) ρ k ) (cid:62) , k = 1 , . . . , K . The value of the model ispartly that it provides a ﬂexible representation of spatial nonstationarity, and partly thatits posterior distribution can be sampled from in a serial fashion using many Gibbs steps.Additionally, the Gibbs steps for η k and θ k , k = 1 , . . . , K, can be split and parallelised ifthey are too diﬃcult to sample directly. We hence have a parallel MCMC scheme thatcan scale well with both data size and model complexity.5n Section 3.1 we discuss Gibbs sampler updates for the process and parameter coef-ﬁcients in the ﬁrst scale; in Section 3.2 the updates for the process and parameter coeﬃ-cients in the other scales; in Section 3.3 a re-updating strategy for the process coeﬃcientsfor k >

0; in Section 3.4 the updates for the measurement-error variance parameters; andin Section 3.5 we summarise the sampler. k = 0 The process Y ( · ) captures large-scale eﬀects, and hence its triangulation T η can be coarserelative to T η k for k = 1 , . . . , K . Since θ ≡ ( σ , ρ ) (cid:62) and η can be expected to be highlycorrelated a posteriori (Knorr-Held and Rue, 2002) we sample them jointly in the Gibbssampler, which substantially improves the mixing of the Markov chain. Speciﬁcally, wesample from the full conditional distributionpr( η , θ | eelse) = pr( η | θ , eelse) × pr( θ | eelse excl . η ) , (6)by ﬁrst sampling from pr( θ | eelse excl . η ) and then from pr( η | θ , eelse), wherewe use ‘eelse’ as shorthand for ‘everything else.’ Here, and in the rest of the article, wealways sample using the most recently sampled values of the other variables.The conditional distribution of θ in (6) is a partially collapsed distribution, where η is integrated out:pr( θ | eelse excl . η ) = (cid:90) pr( η , θ | eelse) d η ∝ (cid:90) pr( Z | η , eelse) × pr( η | θ ) d η pr( θ ) . (7)The integrand in (7) is Gaussian, and therefore the integral can be evaluated analyticallyto give log pr( θ | eelse excl . η ) = 12 log | Q ε | + 12 log | Q |−

12 log | (cid:101) Q | − (cid:101) Z (cid:62) Q ε (cid:101) Z + 12 (cid:101) Z (cid:62) Q ε A (cid:101) µ + log pr( θ ) + const. , (8)where (cid:101) Q = A (cid:62) Q ε A + Q , (cid:101) µ = (cid:101) Q − A (cid:62) Q ε (cid:101) Z , and (cid:101) Z = Z − (cid:80) k> A k η k . The trian-gulation T η must be made suﬃciently coarse that factorising Q , as well as algebraicoperations of the form (cid:101) Q − X = Y , can be done on a single computing node. We providemore detailed guidelines on this in Section 4. All other operations can be computedin a distributed fashion from sums and products of smaller vectors and matrices corre-sponding to chunks of Z , { A k } , and { η k } . There is therefore no theoretical limit on thedata size and number of scales K that will preclude sampling from this distribution ina reasonable time-frame given suﬃcient parallel computing resources. This is also truefor all the sampling operations we outline below. The conditional distribution (8) doesnot have a recognisable distribution in θ , and so this update uses a Metropolis–Hastings6igure 1: Illustration of regions delineating the data footprint F (bold black line) andthe eﬀective process footprint T (bold red line) of θ I k (blue dot) and the support of itsassociated basis function (bold blue line). The black dots depict the data points, the redmesh depicts T η k , and the blue mesh depicts T θ k .step. The computational complexity of this step is dominated by the sparse Choleskyfactorisation of (cid:101) Q , which is approximately O ( r / η ) ﬂops.After updating θ we update η . The full conditional distribution of η is given by η | eelse ∼ Gau( (cid:101) µ , (cid:101) Q − ) , (9)where (cid:101) µ and (cid:101) Q have already been computed for (8). A sample of (9) is thereforeobtained ‘for free’ after the update of θ . k = 1 , . . . , K The update of θ k ≡ ( θ (cid:62) σ k , θ (cid:62) ρ k ) (cid:62) is more tricky than that of θ : The full conditionaldistribution of θ k has the same structure as that of θ given in (7), but since for k > T η k is ﬁne and r η k is large, the elements of the integral cannot beevaluated in memory, and so a sequential strategy is necessary.Let I index one element of θ σ k (equivalently θ ρ k ) and, with a slight abuse of notation,let θ I k denote the pair of elements ( θ I σ k , θ I ρ k ) (cid:62) . Let the set T be the eﬀective processfootprint of θ I k , deﬁned as the set of indices of η k for which the (prior) conditionaldistribution of η rest k excluding η T k , pr( η rest k | θ I k , θ rest k ), is approximately independent of θ I k (see (11) below). It might be helpful to picture η T k as a ‘halo’ around supp( b I k ), wheresupp( · ) denotes a function’s support; see Fig. 1. Further, let F be the data footprint of η T k , that is, the set of indices of the data points that lie in the subset of the domain given7y ∪ i ∈T supp( a { i } k ). Under the partitioning of the coeﬃcients { θ I k , θ rest k } , { η T k , η rest k } , andthe data { Z F , Z rest } , the relevant part of the full DAG is Z F Z rest η T k (cid:79) (cid:79) η rest k (cid:98) (cid:98) (cid:111) (cid:111) (cid:79) (cid:79) θ I k (cid:79) (cid:79) (cid:61) (cid:61) θ rest k (cid:97) (cid:97) (cid:79) (cid:79) (10)where we have used a dashed line to denote a conditional dependence relationship thatis weak, and one that can potentially be ignored for practical purposes.We jointly update { η T k , θ I k } via the full conditional distributionpr( η T k , θ I k | eelse) = pr( η T k | θ I k , eelse) × pr( θ I k | eelse excl. η T k ) , by ﬁrst sampling from pr( θ I k | eelse excl. η T k ) and subsequently from pr( η T k | θ I k , eelse).As with θ we have that pr( θ I k | eelse excl. η T k ) is a partially collapsed distribution givenby pr( θ I k | eelse excl. η T k ) = (cid:90) pr( θ I k , η T k | eelse) d η T k ∝ (cid:90) pr( Z F | η T k , eelse) pr( η T k | η rest k , θ I k , θ rest k ) × pr( η rest k | θ I k , θ rest k ) d η T k pr( θ I k ) . Since η k is a Gaussian Markov random ﬁeld it is computationally cheap to evaluate theconditional distribution pr( η T k | η rest k , θ I k , θ rest k ). However, it is computationally expensiveto evaluate the ‘marginal’ distribution pr( η rest k | θ I k , θ rest k ), as is required for this update,since it has a (large) precision matrix that is not sparse in general. We therefore resortto the approximation implied by the dashed line of the DAG in (10),pr( η rest k | θ I k , θ rest k ) ≈ pr( η rest k | (cid:101) θ I k , θ rest k ) , (11)where we have replaced θ I k by some value (cid:101) θ I k , which may depend on the value of θ rest k .This approximation is motivated by the fact that, notionally, θ I k controls the varianceparameters σ k ( · ) and ρ k ( · ) locally , and can therefore be expected to have diminishing eﬀecton the probability of η rest k by the marginalisation property of the Gaussian distribution.An attractive feature of this approximation is that it can be improved as much as needed,at the expense of increased computational cost, by increasing the size of the eﬀectiveprocess footprint T . The chosen eﬀective process footprint should be one for whichthe approximation (11) is acceptable in practice. In our application, setting T to bethe indices of the basis functions a k ( · ) that have their and their neighbours’ maximainside supp( b I k ) suﬃced, but more conservative choices could be considered if needed; seeSect. 4.3. 8ur approximation yieldspr( θ I k | eelse excl. η T k ) ∝∼ (cid:90) pr( Z F | η T k , eelse) pr( η T k | η rest k , θ I k , θ rest k ) d η T k × pr( θ I k ) , (12)the log of which is identical in structure to (8) with an added term due to the presenceof η rest k in the conditional distribution of η T k (Rue and Held, 2005, Sect. 2.2.5). Thisextra term does not alter the computational or memory complexity of the operations.In particular, the computational complexity is approximately O ( |T / | ) ﬂops, and thussampling will be feasible as long as |T | has the same order of magnitude as r η , which wehave set to be suﬃciently small to make it possible for these computations to be done ona single computing node.As with θ , this distribution has no recognisable form and requires a Metropolis–Hastings step. Further, the computations for this conditional distribution can once againbeneﬁt from a distributed data framework. However, since there are r θ k of them (onefor each parameter basis function), for the inference framework to be scaleable, the totalcomputational time required by these updates must not depend on r θ k . This is indeedthe case under our approximation as we now show.Take two sets of parameters θ I k and θ I k with associated eﬀective process footprints T and T for which η T k ⊥⊥ η T k | η rest k , θ k , and for which the associated data footprints F ∩ F = ∅ (this condition is trivially satisﬁed in our context where Z are point refer-enced and the { a k ( · ) } are tent basis functions). Partition θ k as { θ I k , θ I k , θ rest k } , η k as { η T k , η T k , η rest k } , and Z as { Z F , Z F , Z rest } . Then, the relevant part of the DAG wherewe now omit the weak conditional dependencies for clarity, is Z F Z rest Z F η T k (cid:79) (cid:79) η rest k (cid:98) (cid:98) (cid:111) (cid:111) (cid:79) (cid:79) (cid:60) (cid:60) (cid:47) (cid:47) η T k (cid:79) (cid:79) θ I k (cid:79) (cid:79) θ rest k (cid:61) (cid:61) (cid:97) (cid:97) (cid:79) (cid:79) θ I k (cid:79) (cid:79) (13)From the moralised version (Lauritzen, 1996, Chapter 2) of (13), we immediately see thatthe coeﬃcients { η rest k , θ rest k } separate the two groups { η T k , θ I k } and { η T k , θ I k } so thatpr( { η T k , θ I k } , { η T k , θ I k } | eelse) ∝∼ pr( η T k , θ I k | eelse) pr( η T k , θ I k | eelse)= pr( η T k | eelse) pr( θ I k | eelse excl. η T k ) × pr( η T k | eelse) pr( θ I k | eelse excl. η T k ) . Therefore, the update operations for θ I k and θ I k can be dispatched to diﬀerent coresin a multicore computing architecture, and the updates can be done in parallel. Wecan identify the sets of parameters that can be updated in parallel by colouring the fullconditional dependency graph of θ k under our chosen set of eﬀective footprints, such thatno conditionally dependent elements of θ k have the same colour. In a minimal setting,9here we choose the eﬀective process footprints as the subset of η k for which the maximaof the associated basis functions and those of its Markov blanket lie in the support of b I k ,the conditional dependency graph is the edge-preserving bijection of the spatial graph T θ k . With this choice of footprints we can then make use of the four-colour theorem(Gonthier, 2008) so that a sample of θ k can be done in exactly four steps, irrespective ofthe scale or data size. More conservative (i.e., larger) footprints will result in the need formore colours, and hence a lower degree of parallelisation, but still the number of colourswill be independent of scale and data size, rendering this model and inferential techniquescaleable. In our application we made use of the backtracking algorithm for colouring thegraph (Bender and Wilf, 1985).Consider now the joint update of the quantities { η T k , θ I k } . As in the joint updateof { η , θ } , a sample from the full conditional distribution of η T k is obtained ‘for free’since it is Gaussian with a mean vector and a precision matrix that are also used whensampling θ I k . One might be tempted to just skip this step, obtain a full sample of θ k ,and only then obtain a full sample of η k using, for example, the re-updating strategyin Sect. 3.3. However, doing so will result in a collapsed Gibbs sampler that does nottarget the correct stationary distribution (Van Dyk and Park, 2008); see Appendix Afor more details on why this is the case. It is therefore important that η T k is sampledconcurrently with θ I k although, as we discuss next, a re-updating of η k is necessary inpractice to improve convergence of the Markov chain. k = 1 , . . . , K When a block of η k gets updated conditional on everything else it is pinned at its bound-ary, because η k is locally smooth a priori and, in regions where data density is low, aposteriori as well. This results in slow mixing of the Markov chain, which was problematicin earlier versions of our implementation (see Appendix B for an illustration on a simplemodel). We now address this issue by re-sampling each η k in a separate Gibbs step.For each η k we tile the domain D into tiles so that each element of η k is then associatedwith exactly one tile, speciﬁcally the tile in which the basis function it weights is amaximum. Each tile is no larger than what can be processed on a single node, and so willbe associated with about r η basis functions, as per the update of η . The tiles must alsobe large enough such that the elements of η k associated with any two non-contiguous tiles(i.e., tiles that do not share a common boundary) are conditionally independent given theelements of η k associated with the other tiles. These tiles and their neighbours can beused to establish a supergraph made up of blocks of η k , where each block (correspondingto one tile) is conditionally independent of the rest given its neighbours. This supergraphis then coloured such that no two neighbouring blocks have the same colour; for point-referenced data this will require at most four colours. The colouring also corresponds toone on the tiles where no two tiles which share a common border have the same colour.As with the joint update of { η k , θ k } , this colouring allows us to develop a parallelsampler for η k . Speciﬁcally, consider two blocks of η k that are associated with two tiles,with indices given by T and T , respectively. When these two blocks are of the samecolour, η T k ⊥⊥ η T k | η rest k according to the structural zeros of Q k . Let F be the datafootprint of η T k . Similarly, let F be the data footprint of η T k . For point-referencedmeasurements it is straightforward to see that F ∩ F = ∅ so that we can partition Z as { Z F , Z F , Z rest } . The relevant part of the DAG for η k and Z is therefore simply the10igure 2: Example of three tilings on D = S that ensure that basis functions (not shown)are not in the vicinity of a boundary in at least one of the tiles.ﬁrst two lines of (13), from which we obtain η T k ⊥⊥ η T k | Z F , Z F , η rest k , eelse , (14)and therefore η T k and η T k can be updated in parallel (Wilkinson, 2006). The log ofthe full conditional distributions are identical in form to (8) with the same extra termmentioned in Sect. 3.2, which is simple to compute. Therefore, each of these updatesagain requires only a small chunk of data, matrices, and subsets of samples of the otherprocess coeﬃcients, so that each of these updates can also be done via a distributed-dataarchitecture. The computational complexity of the operations required to sample from η T k and η T k are approximately O ( |T / | ) and O ( |T / | ), respectively.Even though this sampling strategy re-updates η k , this coloured-tile scheme has gotits own boundaries close to which the process coeﬃcients will experience poor mixing. Weaddress this by using three tilings for each k , one after another in the Gibbs sampler. Eachtiling is shifted by about a third of a tile, relative to the one before. This ensures thatno element of η k is always near a boundary when Gibbs sampling. We show three suchtilings on S in Fig. 2. The resulting Gibbs sampler is an instance of a blocked samplerwhere the blocks are not disjoint; see, for example Jensen et al. (1995). Appendix Billustrates the beneﬁt of an alternating tiling scheme on a simple model. Consider now the parameters θ ε . The submodel for the Gibbs update of these parametersis made up of Z | η , θ ε ∼ Gau (cid:0) A η , Q − ε (cid:1) , (15)11nd the prior distribution (3) with X = ε , where Q − ε is diagonal with non-zero elements[exp( b ε ( s l ) (cid:62) θ ε )] , l = 1 , . . . , m . As with η k and θ k , elements of θ ε are spatially-indexedby the maxima of the basis functions they weight. Let I contain the index of one elementof θ ε . The data footprint F of θ I ε contains the indices of the data points that lie insupp( b I ε ). For two elements of θ ε with indices I and I , respectively, and for which F ∩ F = ∅ , the relevant part of the conditional independence graph is Z F Z rest Z F θ I ε (cid:79) (cid:79) θ rest ε (cid:79) (cid:79) (cid:98) (cid:98) (cid:60) (cid:60) θ I ε (cid:79) (cid:79) because the components of Z are independent given θ ε and everything else, and thecomponents of θ ε are IID in the prior.As with η , the condition F ∩ F = ∅ is essential, but this is easily satisﬁed inour context since b ε ( · ) are tent basis functions and the data are point-referenced. Inparticular, this condition is satisﬁed whenever θ I ε and θ I ε are not neighbours in thegraph implied by T ε . In this case, the full conditional dependence graph of θ ε is just anedge-preserving bijection of the spatial graph T ε , with θ ε as the vertices. We thereforealso colour it using four colours, and we update parameters of the same colour in parallel.The full conditional distribution,pr( θ I ε | eelse) ∝ pr( Z F | θ I ε , θ rest ε , η ) pr( θ I ε ) , (16)does not have a recognisable distribution, and so this update uses a Metropolis–Hastingsstep. Since the measurement errors are independent, computing the accept-reject ratiois computationally inexpensive. Algorithm 1 gives a summary of all the stages in one complete pass through the Gibbssampler. Each sample is generated conditional on Z and the most recent samples of theother random variables, again denoted here as ‘eelse’ for ‘everything else.’ When theconditional distribution to sample from is not available in closed form, the update isdone via an accept-reject step; in our application in Sect. 5 we use adaptive random walkMetropolis proposals. In Algorithm 1 we use the superscript ∗ to denote intermediatequantities: these quantities are discarded in the ﬁnal output of the Gibbs sampler.Steps 1–3 in the Gibbs sampler use distributed data, while steps 2 and 3 also useparallel computation to obtain a complete sample in a ﬁxed number of sweeps (usually4). It is worth re-iterating that, despite the re-updating of η k in step 2(b), updating η k for each colour in step 2(a) is still required in order to ensure the correct stationarydistribution is targeted when going through the colours in sequence (Van Dyk and Park,2008); see Appendix A. The favourable computational properties of the updates in Sect. 3 can only be takenadvantage of if sensible triangulations are used for the process and parameter decom-positions. Further, the separation of Y ( · ) into separate scales introduces identiﬁability12 lgorithm 1: Parallel Gibbs Sampler for the Multi-Scale Process Model

1. Sample { η , θ } | eelse.2. For k = 1 , . . . , K (a) Sample { η ∗ k , θ k } | eelse in parallel ( ≥ η k | eelse in parallel (4 colours).3. Sample θ ε | eelse in parallel (4 colours).4. Shift the tiling for each η k , k ≥ , to the next position and go to step 1.issues, akin to those seen in problems of spatial source separation (Nordhausen et al.,2015). Therefore, judicious construction of the triangulations and prior distributions onthe nominal ranges (that help to separate the scales in spectral space) are required. Ourinformal guidelines in this section are based on point-referenced measurements that haverelatively high signal-to-noise ratio; these guidelines may need to be slightly adjustedwhen this is not the case. Process:

Each process Y k ( · ) is associated with a triangulation T η k . Recall that, as k increases, the process scale decreases, and therefore we seek triangulations that becomeﬁner with increasing k . Constructing appropriate triangulations on which to establishthe multi-scale process is important. In Appendix C we empirically show the possibleconsequences of an inappropriate discretisation scheme for a two-scale process in one-dimensional space.A key design criterion when constructing these triangulations is the size of largestmatrix that can be eﬃciently factorised on a single computing node. Using hardware andlinear algebra libraries current as of the year 2020, sparse precision matrices of dimension50000 constructed from a second-order GMRF on a two-dimensional manifold (followingappropriate ﬁll-in reducing permutations) can be factorised extremely quickly (in aboutone second) on a single computing node. Since in our model η is updated as a whole(see (6)), a regular triangulation T η on D should therefore be constructed such that r η is between 10 and 10 .At the other end of the spectrum, the ﬁnest process triangulation, T η K , should be ﬁneenough that the size of the data footprint associated with each element of η K is small, andpreferably in the single digits. This guideline ensures that our multi-scale process modelis able to resolve the highest frequency components that may be extracted from the data.If we do not do this, then our predictions at the ﬁnest scale are possibly over-smoothed(i.e., the model is underﬁtting).The number of scales K to use is a modelling choice which can be guided by spectralconsiderations. A good rule of thumb is to ensure that each scale is associated withsignals with nominal ranges that approximately span an order of magnitude, and that thetriangulations can adequately reproduce signals with these ranges. On a two-dimensional13anifold, this rule of thumb implies that | η k | ≈ | η k − | , for k = 1 , . . . , K . In Sect. 5,we use two scales with r η on the order of 10 and r η on the order of 10 . Process parameters:

The update of θ I k in (12) requires factorising a square matrixwith number of columns equal to the size of its eﬀective process footprint, |T | , whichthus needs to be on the order of r η . Therefore, since T η k becomes ﬁner with increasing k ,this necessarily means that T θ k also needs to become ﬁner with increasing k . Speciﬁcally,the number of basis functions in a k ( · ) that have support intersecting b I k ( · ), for each I ,needs to be on the order of r η . This successive reﬁning of T θ k with k yields an attractivemodel where the degree of nonstationarity of the process increases with k . Measurement-error standard deviation:

The coarseness of the triangulation T ε reﬂects a prior belief on how quickly (spatially) the measurement-error variance changes;this is application-speciﬁc. Nominal ranges:

In Sect. 4.1 we associated T η k with nominal ranges spanning approx-imately an order of magnitude, and this choice can be reﬂected in our prior distributionfor the nominal ranges. We suggest that ω ρ and λ ρ are set such that a low percentileof pr( ρ ), the 2.5 percentile say, is no smaller than one to two times the average edgelength in T η . Similarly, we suggest that ω ρ k and λ ρ k are set such that a low percentile ofpr(exp( θ I ρ k )), for each I , is no smaller than one to two times the average edge length in T η k . This choice is motivated by the Nyquist–Shannon criterion, which is also used for ba-sis function placement by Zammit-Mangion et al. (2012). From a practical point of view,it encapsulates the fact that signal components with ﬁner scales cannot be captured bythis decomposition of Y k ( · ) on T η k . The upper 2.5 percentile of pr( ρ ), corresponding tothe highest plausible nominal range of the multi-scale process, can usually be set based onapplication considerations. The upper 2.5 percentile of pr(exp( θ I ρ k )) , k = 1 , , . . . , K, forall I , can be set to be close to, but higher than, the lower 2.5 percentie of pr(exp( θ I ρ k − )).This ensures some spectral overlap across the scales. Process standard deviations:

The prior distribution pr( σ ) can be uninformative orbased on results from an exploratory data analysis. Since in this model source separationis done via the prior distributions of ρ k , the hyperparameters ω σ k and λ σ k can be set suchthat pr(exp( θ I σ k )), for each I , is relatively uninformative or identical to pr( σ ). Measurement-error standard deviation:

The hyperparameters ω ε and λ ε encodethe plausible range of values of measurement-error standard deviations, and their choiceis therefore application-speciﬁc. The eﬀective process footprint of each θ I k is a design choice. In principle, the footprintneeds to be one such that the approximation (11) is acceptable. For example, one mightuse a spatial buﬀer around supp( b I k ) of width equal to the upper 97.5 percentile ofpr(exp( θ I ρ k )) to deﬁne the eﬀective process footprint of θ I k . If this footprint is so largethat sampling of θ I k becomes prohibitive, then tighter prior distributions (and hence14ore scales in order to ensure spectral coverage everywhere) would need to be used ifthis approximation is of concern. Alternatively, one might simulate a number of Markovchains with diﬀerent computationally inexpensive eﬀective process footprints in parallel,and verify that the target distribution is indeed (approximately) independent of thechosen footprint size. As discussed in Sect. 3.2, in our application we have taken aminimal approach and used relatively small eﬀective process footprints. This choice doesnot seem to have had a detrimental eﬀect on the quality of the probabilistic predictionswhich, as we show next in Sect. 5, are considerably better than those obtained fromcompeting models. A sophisticated distributed implementation on a multi-node cluster is required for thesampler to be used with data-set sizes on the order of tens or hundreds of millions andwith models with many (

K >

1) scales. However, a two-scale model with approximatelyone million basis functions and one million data points can be implemented on a standardmulti-core computing node using a straightforward implementation in R (R Core Team,2019), whereby the “chunks” required for each parallel sample (implemented via thefunction mclapply ) are loaded from disk when needed. This data size is still consideredfairly large in spatial analyses, and we therefore restrict ourselves to this case in thisarticle. An implementation that allows for orders of magnitude more data and basisfunctions is being developed and will be discussed elsewhere.We compare inferences from the two-scale variant of the multi-scale spatial processmodel with those from several other spatial models and approximation methods, specif-ically a coarse-scale SPDE model; a suite of independent ﬁne-scale SPDE models overspatially contiguous blocks of data; a single-scale model approximated using a full-scaleapproximation (FSA); a stationary multi-scale process model constructed using a rela-tively small number of basis functions, and a single-scale process model using the ap-proximations via a nearest-neighbour Gaussian process approximation. These modelsand approximations were chosen to show that several aspects of the proposed multi-scalemodel, namely the use of multiple scales, the use of a large number of basis functions,and the ability to model nonstationarity, are crucial to get good predictions in a typicalbig-data environmental spatial data analysis.Reproducible code and additional material, such as MCMC trace plots, are availablefrom https://github.com/andrewzm/SpatialMultiScale . As data we used global sea-surface temperature (SST) obtained from the Visible InfraredImaging Radiometer Suite (VIIRS) on board the Suomi National Polar-orbiting Partner-ship (Suomi NPP) weather satellite on October 14 2014 (Cao et al., 2013). We sampledone million data points at random from the complete data set of 238 million data points,intentionally leaving data out from an 8 ◦ × ◦ box centred at (155 ◦ W, 0 ◦ N) in the PaciﬁcOcean, and used these as observation/training data. We then sampled another one mil-lion data points at random from the remainder and used these for assessing the qualityof our predictions. After discarding data with invalid coordinate entries, and data points15igure 3: SST residuals used for training the model. Left panel: Global view of residuals,with the 8 ◦ × ◦ box in the Paciﬁc Ocean marked in black and the box delineating theregion depicted in the right panel marked in red. Right panel: Residuals in an area ofthe Western Paciﬁc around Papua New Guinea (note the diﬀerent colour scale).that fell outside our constructed meshes (see Sect. 5.2) we were left with about 900000data points in each of the training and validation data sets.We then detrended the training and validation data sets using a linear model with anintercept, the latitude coordinate, and the square of the latitude coordinate, as covariates.The residuals from the training data are what we call Z in Sect. 2, while we denote theresiduals from the validation data set as Z v . In Fig. 3 we show the residuals Z on theglobe together with the 8 ◦ × ◦ box (left panel) and a zoomed-in view of these residualsaround Papua New Guinea (right panel). We can see from the ﬁgure that the spatialdata are very irregularly-sampled in space, and that the distance from anywhere in theocean to the nearest data point can range from a few kms to thousands of kms. The dataalso reveal both large-scale and small-scale features that need to be modelled. We constructed the Delaunay triangulations on S with a coarse boundary around landmasses (used in a study by Simpson et al., 2016) using the inla.mesh.create functionin INLA , and ﬁxed ν = ν = 1. We determined the approximate number of vertices inthe triangulations as follows, using the guidelines in Sect. 4. First, since it is requiredto factorise Q we made sure that T η contains only a few tens of thousands of vertices,in our case it contains r η = 38274 vertices. Second, since Y ( · ) models the small-scalevariability and we are only considering two scales, we require T η to be spatially densewith respect to the observed data. In our case we constructed a triangulation T η with942349 vertices. Using this triangulation, the proportion of elements in η which havea data footprint size in the single digits is 92%. Finally, we want the nonstationarityin the ﬁne-scale process to vary smoothly in space, and we therefore let T ε = T θ haveonly a few hundred vertices, in our case 205 vertices. Using this triangulation the largesteﬀective process footprint is of size 27198, which is on the same order as r η as desired.The triangulations on a small region of the Paciﬁc Ocean are shown in Fig. 4.In order to re-sample η as discussed in Sect. 3.3, we constructed three partitioningsof η from three spatial tilings of D . The ﬁrst tiling was done using the third resolutionof an Icosahedral Snyder Equal Area Aperture 3 Hexagonal (ISEA3H) grid on the surfaceof the sphere (Sahr, 2008). The second and third tilings were then done by shifting the16igure 4: The triangulations T η (red), T η (blue), and T ε (= T θ ) (black) shown here fora small region in the Paciﬁc Ocean as marked by the red rectangle in the inset.ISEA3H grid north and east, respectively; see Fig. 2. To improve mixing of the Markovchain, spatial tiles that contained less than 100 data points or less than 200 basis functionsat scale k = 1, were merged with neighbouring tiles. The three tilings and colourings areshown in Fig. 5.We constructed the prior distributions following the guidelines of Sect. 4. As prior dis-tribution for ρ we used a lognormal distribution such that ( ρ , . , ρ , . ) = (300 , ρ ,q denotes the q th quantile of the distribution. Our choice of ρ , . stemsfrom us having a maximum edge length of 150 km when constructing T η , while our choiceof ρ , . presents a soft maximum on what we believe the spatial correlations in SSTs are(10000 km equates to approximately half the greatest east-west span of the Paciﬁc Ocean).On the other hand, as prior distribution for θ { i } ρ , i = 1 , . . . , r ρ , we used a Gaussian dis-tribution such that (exp( θ { i } ρ , . ) , exp( θ { i } ρ , . )) = (30 , T η , while the upper quantile was selected to ensure that there is suﬃcient spectral overlapbetween Y ( · ) and Y ( · ). As for the marginal process standard deviation, we used prior dis-tributions such that ( σ , . , σ , . ) = (exp( θ { i } σ k , . ) , exp( θ { i } σ k , . )) = (0 . , ◦ C, whereour upper quantile was selected from the fact that the empirical standard deviation of theSST residuals is approximately equal to 2 ◦ C. Finally, as prior for θ ε we used a Gaussiandistribution such that (exp( θ { i } ε, . ) , exp( θ { i } ε, . )) = (0 . , ◦ C, which reﬂects our priorbelief that it is unlikely that the measurement error standard deviations are less than0.5 ◦ C or larger than 5 ◦ C.As initial values for our parameters in the MCMC algorithm we used samples of θ , θ , and θ ε drawn from their respective prior distributions. As initial values for η weused least-squares estimates from the model Z = A η + ε , perturbed using independent17igure 5: Colourings of the three tilings of η .samples from a Gaussian distribution with mean zero and variance 0.04. As initial valuesfor η we used independent samples from a Gaussian distribution with mean zero andvariance 0.01. Our adaptive random-walk Metropolis proposals were initialised to havevariance 0.001; adaptation was done every 30 iterations for the ﬁrst 2000 iterations.The Gibbs sampler algorithm in Sect. 3.5 was run for 10000 iterations. The ﬁrst 5000samples were discarded as burn-in and the remaining 5000 were thinned by a factor of 50,to yield 100 samples from the posterior distribution over all the states and parameters.It took approximately 5 days of computing time to obtain 10000 samples using ourimplementation.The long run-time of our algorithm made a detailed test of MCMC convergence im-practical. Instead, we made an informal test of convergence by analysing samples of theprocess Y ( · ) evaluated at approximately 350000 points regularly interspersed throughoutthe ocean. Speciﬁcally, we computed the eﬀective sample sizes (see Eq. 11.8 of Gelmanet al., 2013) of the thinned chain at each of these locations, and compared the empiricaldensity of these sizes to that constructed from independent samples from a Gaussiandistribution, and to that constructed from traces that are ﬁrst-order autoregressive pro-cesses with auto-correlation coeﬃcient 0.1. In Fig. 6 we see that the empirical density isvery sensitive to auto-correlation, and that the eﬀective sample sizes from our samplerare virtually indistinguishable from those computed from independent samples. Thisindicates that most likely we do indeed have independent samples from the posteriordistribution. Visual inspection of a random selection of trace plots also suggested theabsence of auto-correlation.Traces of other model components were not all as uncorrelated: While, for example,traces for η , η , and θ ε , had median eﬀective sample sizes of 87, 90, and 90, respectively,those for θ σ and θ ρ had median eﬀective sample sizes of 52 and 46, respectively. Whilethe latter low eﬀective sample sizes imply that we should be cautious when making infer-ential statements on the nominal ranges and variances, they do not necessarily suggest18igure 6: Empirical density of the estimated eﬀective sample size ˆ n eﬀ from a chain withindependent samples from a normal distribution (black, solid line), a chain where eachtrace is a ﬁrst-order autoregressive process with auto-correlation parameter 0.1 (blue,dotted line), and the chain’s traces for Y ( · ) evaluated at approximately 350000 locationsusing our sampler (red, dashed line).that our process predictions could beneﬁt from a longer run. This is a consequence ofweak identiﬁability of the process model we are using; see Besag et al. (1995); Eberlyand Carlin (2000) and Gelfand et al. (2001) for more discussion. Moreover, the purposeof our approach is to predict, and for prediction the ultimate goal is to maximise out-of-sample performance. In this respect the key ﬁnding is that our approach is well-calibratedout-of-sample, and that it considerably outperforms other approaches, as we describe inSect. 5.4. We compared the predictions of the proposed model (SPDE2) to several other modelsand approximations commonly used in applications containing large spatial data sets.We describe these in more detail below.

Global coarse-scale model (SPDE0):

We used

INLA to ﬁt the SPDE (Lindgrenet al., 2011) in (1) to the entire data set via a GMRF on the coeﬃcients of the tentbasis functions on T η . We let ν = 1, and used the same prior distributions for ρ and σ as we did for our two-scale spatial model in Sect. 5.2. We compare to this model todemonstrate the beneﬁt of using a second, small-scale process when modelling such largespatial data sets. It took approximately 30 minutes of computing time to ﬁt the modeland obtain predictions with SPDE0. The nominal range for this model was estimated to19e approximately 800 km. Multiple spatially-independent small-scale processes (SPDE1-indep):

Weused

INLA to independently ﬁt SPDEs with spatially-invariant parameters to data that liein each of the tiles shown in Fig. 5, top panel. For each set of tile indices T in our modelSPDE2, we generated a ﬁne-resolution mesh that has a similar number of basis functionsas |T | , and a few more around the boundary of the partition to reduce boundary eﬀects.We let ν = 1, and used the same prior distributions for (the now spatially-invariant) σ and ρ in each tile as we did for our two-scale process model in Sect. 5.2. These indepen-dent models were used to predict the noisy process at validation data locations that liewithin their associated tile. Results from this model are used to demonstrate the beneﬁtof having a large-scale process that can borrow strength across large spatial scales whenmodelling large spatial data. It took approximately 6 hours to sequentially ﬁt the modelsand obtain predictions with SPDE1. Multi-scale stationary process with a relatively small number of basis func-tions via LatticeKrig (LTK):

We used

LatticeKrig (Nychka et al., 2015) to ﬁt a3-resolution LatticeKrig model on the cylinder comprising a total of 137577 basis func-tions. Note that many of these basis functions (approximately one third) are over land.We ﬁtted three LTK models with the parameter a.wght , which dictates the spatial range,ﬁxed to 4.01, 5.01, and 6.01, respectively. In Sect. 5.4 we only show results for the case a.wght = 6.01, which provided slightly better predictions than the other two cases (allcases gave very similar predictions). LTK models are generally limited to r = 100000 to200000 basis functions since they require factorisation of precision matrices of dimension r × r when being ﬁt. Results from this model are used to demonstrate that hundreds ofthousands of basis functions are still likely to be insuﬃcient when modelling large spa-tial data, despite the use of multiple scales. Fitting and predicting (via 30 conditionalsimulations) with LTK required approximately one day of computing time. Single-scale process approximated using a full-scale approximation (FSA):

The full-scale approximation (Sang and Huang, 2012) to a Mat´ern covariance functionwith smoothness ν = 1 was implemented on R using 100 knots randomly placed on thesurface of the sphere before projection onto the plane. The number of blocks on which toapproximate the residual ﬁeld was set to 9000 and data were attributed to each of theseblocks using a k-means clustering algorithm on the lon–lat coordinates of the data. Thischoice ensured that the number of data points in each block was computationally feasibleat about 100 data points per block. It took approximately two days to ﬁt and predictusing the FSA. The nominal range for this model was estimated to be approximately 850km. Single-scale process approximated using nearest neighbours (NNGP):

Weused the R package spNNGP to implement a conjugate version of the nearest-neighbourGaussian process (NNGP, Finley et al., 2019), where NNGPs are ﬁtted for several plausi-ble values of the latent-process range and variance, and cross-validation is used to selectthe best of these models for prediction. In each of these models an inverse-gamma priordistribution is used for the observation measurement-error variance, which yields closed-form predictive distributions that are quick to evaluate. In our implementation we set the20umber of neighbours to 15, the covariance function to a Mat´ern covariance function with ν = 1, and used a ﬁne grid for the latent-process range and variance, ensuring that aninterior point of this grid was selected as optimal. It took approximately three hours toﬁnd the optimal NNGP model through cross-validation and to predict using the optimalmodel. The nominal range for this model was estimated to be approximately 700 km.Results from the FSA and NNGP are used to show that single-scale process models,which use local methods when predicting, are not competitive against models like SPDE2which are able to borrow strength across large spatial scales when predicting over largegaps. In Fig. 7 we show predictions (posterior means) and prediction standard errors (posteriorstandard errors) for Y ( · ) using the two-scale spatial-process model SPDE2. The detail inthese maps is very high – the one million basis functions eﬀectively model the SST at a30 ×

30 km resolution globally. One should be cautious when interpreting the spatially-varying parameters (see Sect. 5.2), however it is evident that the posterior expectationsof θ ε , θ σ , and θ ρ vary considerably. In particular, the elements in E(exp( θ σ ) | Z ) hadan interquartile range of (0 . , . ◦ C while those in E(exp( θ ρ ) | Z ) had an interquartilerange of (40 , k = 1 (ranging from a few dozen to a few hundreds of kms) aresmaller than E( ρ | Z ) ≈ σ | Z ) = 1 . ◦ C, and therefore there is more power beingallocated to the signal at the coarse scale than at the ﬁne scale. Again, this was expectedsince power spectral densities tend toward zero at large frequencies. More intuition intothe behaviour of the process at the separate scales can be obtained by visualising predic-tions of Y ( · ) and Y ( · ) separately; we show one such plot centred on the Brazil–Malvinasconﬂuence in our online code repository.We compare the predictions from all the models in terms of the root-mean-squaredprediction error (RMSPE), continuous-ranked probability score (CRPS), 90% intervalscore (IS90), and the 90% coverage (Cov90) at the validation-data locations (Gneiting andRaftery, 2007). To assess the models’ ability to predict and quantify uncertainty correctlyat validation-data locations that are in regions of both dense and sparse training data,we split the validation data set into three: Z (1) v contains all the validation data outsidethe 8 ◦ × ◦ box in the Paciﬁc Ocean (see Sect. 5.1) that are in the vicinity of trainingdata (speciﬁcally, a 1 . ◦ × . ◦ box centred on each datum in Z (1) v contains at least onetraining datum within it); Z (2) v , which contains the validation data not in the vicinityof training data and outside the 8 ◦ × ◦ box; and Z (3) v that contains the validation datawithin the 8 ◦ × ◦ box. The dimensions of Z (1) v , Z (2) v , and Z (3) v were 897282 , Y ( · ) from the two-scale spatial-process model (SPDE2) when ﬁtted to the data shown in Fig. 3, left panel. Predictionsand prediction standard errors corresponding to the region depicted in Fig. 3, right panel,are shown in the bottom-left and bottom-right panels, respectively.and the single-scale models that use local methods when predicting (FSA and NNGP)give better predictions. All methods give the correct coverage. What we deduce fromthese results is well-known: When doing spatial prediction on large data sets there islittle beneﬁt in using a global model when predicting at locations close to observed data(unless the signal-to-noise-ratio is very low; see, e.g., Zammit-Mangion et al., 2018) ifthe target of the spatial analysis is pointwise process prediction. There is also likely to belittle beneﬁt to be gained from using nonstationary process models since the inferencesare predominantly data-driven and not model-driven. Rather, it is crucial to use a latentprocess model that is able to reproduce the ﬁne-scale variation in the underlying process.In the second six rows of Table 1 we show diagnostics for when prediction locationsare not in the vicinity (through the 1 . ◦ × . ◦ boxes described earlier) of observations.The models LTK and SPDE0 are now not compromised as much for their use of relativelyfew basis functions. On the other hand, since FSA and NNGP utilise only a few datapoints when predicting, they begin to give poorer predictions. SPDE1-indep performssurprisingly well in this scenario, but slightly worse than SPDE0. All methods performedworse than SPDE2. Finally, in the last six rows of Table 1 we show diagnostics forvalidation data inside the 8 ◦ × ◦ lon-lat box, while in Fig. 8 we show the predictionsat the data locations. We see that predicting using ‘nearest data’ generates unwantedartefacts in the case of the FSA and the NNGP. We expected LTK to perform well in thiscase; however, it appears to have over-ﬁt at the region’s boundary, resulting in a largenegative prediction in the region’s interior. The two-scale model outperforms the othersby quite some margin in terms of RMSPE and CRPS since it is able to correctly predict22able 1: Diagnostic results on the validation data Z (1) v (validation data in the vicinityof training data), Z (2) v (validation data not in the vicinity of training data), and Z (3) v (validation data in the 8 ◦ × ◦ box). Data Model RMSPE CRPS IS90 Cov90FSA 0.353 0.185 1.659 0.915LTK 0.378 0.200 1.752 0.913 Z (1) v NNGP

FSA 1.001 0.540 4.489 0.928LTK 0.763 0.430 4.065 0.959 Z (2) v NNGP 0.799 0.464 4.420 0.990SPDE0 0.699 0.377 3.170 0.928SPDE1-indep 0.728 0.382 3.278 0.935SPDE2

FSA 0.460 0.325 3.570 0.993LTK 0.463 0.339 3.748 0.986 Z (3) v NNGP 0.515 0.422 5.000 0.996SPDE0 0.425 0.282 2.926 0.995SPDE1-indep 0.461 0.252 1.830

SPDE2 the region of low SST anomaly traversing this box from west to east, while resolving ﬁnescales on the region’s boundary. Interestingly all models except SPDE1-indep, and to alesser extent SPDE2, are largely underconﬁdent in this unobserved area.

The majority of models and inference techniques developed for analysing massive spatialdata sets are not designed to exploit the multi-scale, nonstationary nature of the under-lying process. The result is the widespread use of techniques that are able to providegood predictions at the ﬁne scale (e.g., NNGP) or at the coarse scale when the number ofbasis functions for decomposition is capped for computational reasons, but not both. Inthis article we have proposed a multi-scale model where the degree of nonstationarity in-creases with the scale of the component processes, together with an approximate inferencealgorithm that is scaleable with both data and model size. We have shown that it outper-forms other state-of-the-art approaches that are amenable to big data scenarios, but thatare unable to capture the complexities of the underlying process due to either the use of afew basis functions, or the use of local methods that only take into account a small subsetof the data when predicting. We conclude that using information at multiple scales is im-portant for accurate prediction in environmental applications such as the analysis of SST,particularly in regions where remote sensing data is sparse. We used SPDEs to modelthe processes at each scale, but other spatial GMRFs (such as conditional autoregressivemodels) can be used instead. However, we anticipate that constructing appropriate grids,23igure 8: Process predictions at the validation locations Z (3) v inside the 8 ◦ × ◦ region(black box) in which data was left out. The left-out data in this region are shown in thebottom-left panel.modelling the nonstationary structure, and eliciting interpretable prior distributions forthe parameters would become more challenging.There are several avenues for future work. First, while the approximation (11) is asensible one under the guidelines of Sect. 4, we do not provide any theoretical bound onthe eﬀect of the approximation on the target distribution, which is not straightforwardto derive. Second, while our implementation in Sect. 5 used chunked-up data and modelswhen doing inference, these were saved and loaded to disk when sampling via graphcolouring. Sophisticated implementations include one where static objects (e.g., data andbasis-function matrices) reside permanently in memory on dedicated nodes (Katzfuss andHammerling, 2017), while communication of selected samples between the nodes is doneeﬃciently via message passing (e.g., using MPI). Such an implementation would speed upthe Gibbs sampler, and is necessary for the consideration of higher scales and data sets24hat are one to two orders of magnitude larger than that considered in Sect. 5. Third,there is considerable interest in making inference with large non-Gaussian spatial data.Parallelisation is still possible for this case, but our computational framework would needto be modiﬁed slightly since the process coeﬃcients cannot be sampled directly whenthe data is non-Gaussian. Speciﬁcally, when carrying out an update corresponding to acolour, the process coeﬃcients would have to be proposed and then accepted or rejectedjointly with the parameter coeﬃcients; see, for example, Knorr-Held and Rue (2002). Acknowledgements

We thank Yuliya Marchetti for providing the sea-surface temperature data set, BohaiZhang for providing the MATLAB code for the implementation of the FSA, Matt Mooresfor general discussions on improving MCMC mixing, and Quan Vu for providing com-ments on an early version of this manuscript. AZ–M was supported by the AustralianResearch Council (ARC) Discovery Early Career Research Award, DE180100203.

References

Banerjee S, Gelfand AE, Finley AO, Sang H (2008) Gaussian predictive process modelsfor large spatial data sets. Journal of the Royal Statistical Society: Series B 70:825–848Bender EA, Wilf HS (1985) A theoretical analysis of backtracking in the graph coloringproblem. Journal of Algorithms 6:275–282Berliner LM (1996) Hierarchical Bayesian time series models. In: Hanson KM, Silver RN(eds) Maximum Entropy and Bayesian Methods, Springer, New York, NY, pp 15–22Besag J, Green P, Higdon D, Mengersen K (1995) Bayesian computation and stochasticsystems. Statistical Science 10:3–41Cao C, Xiong J, Blonski S, Liu Q, Uprety S, Shao X, Bai Y, Weng F (2013) SuomiNPP VIIRS sensor data record veriﬁcation, validation, and long-term performancemonitoring. Journal of Geophysical Research: Atmospheres 118:11664–11678Cressie N, Johannesson G (2008) Fixed rank kriging for very large spatial data sets.Journal of the Royal Statistical Society: Series B 70:209–226Cressie N, Wikle CK (2011) Statistics for Spatio-Temporal Data. John Wiley and Sons,Hoboken, NJDewar M, Scerri K, Kadirkamanathan V (2009) Data-driven spatio-temporal modelingusing the integro-diﬀerence equation. IEEE Transactions on Signal Processing 57:83–91Eberly LE, Carlin BP (2000) Identiﬁability and convergence issues for Markov chainMonte Carlo ﬁtting of spatial models. Statistics in Medicine 19:2279–2294Finley AO, Datta A, Cook BC, Morton DC, Andersen HE, Banerjee S (2019) Eﬃcient al-gorithms for Bayesian nearest-neighbor Gaussian processes. Journal of Computationaland Graphical Statistics 28:401–414 25elfand AE, Carlin BP, Trevisani M (2001) On computation using Gibbs sampling formultilevel models. Statistica Sinica 11:981–1003Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian DataAnalysis, 3rd edn. Chapman & Hall/CRC Press, Boca Raton, FLGneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association 102:359–378Gonthier G (2008) Formal proof – the four-color theorem. Notices of the AMS 55:1382–1393Jensen CS, Kjærulﬀ U, Kong A (1995) Blocking Gibbs sampling in very large probabilisticexpert systems. International Journal of Human-Computer Studies 42:647–666Katzfuss M (2017) A multi-resolution approximation for massive spatial datasets. Journalof the American Statistical Association 112:201–214Katzfuss M, Hammerling D (2017) Parallel inference for massive distributed spatial datausing low-rank models. Statistics and Computing 27:363–375Knorr-Held L, Rue H (2002) On block updating in Markov random ﬁeld models for diseasemapping. Scandinavian Journal of Statistics 29:597–614Lauritzen SL (1996) Graphical Models. Clarendon Press, Oxford, UKLindgren F, Rue H (2015) Bayesian spatial modelling with R-INLA. Journal of StatisticalSoftware 63(19)Lindgren F, Rue H, Lindstr¨om J (2011) An explicit link between Gaussian ﬁelds andGaussian Markov random ﬁelds: the stochastic partial diﬀerential equation approach.Journal of the Royal Statistical Society: Series B 73:423–498Monterrubio-G´omez K, Roininen L, Wade S, Damoulas T, Girolami M (2019) Poste-rior inference for sparse hierarchical non-stationary models. https://arxiv.org/abs/1804.01431

Nordhausen K, Oja H, Filzmoser P, Reimann C (2015) Blind source separation for spatialcompositional data. Mathematical Geosciences 47:753–770Nychka D, Bandyopadhyay S, Hammerling D, Lindgren F, Sain S (2015) A multires-olution Gaussian process model for the analysis of large spatial datasets. Journal ofComputational and Graphical Statistics 24:579–599R Core Team (2019) R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, AustriaRue H, Held L (2005) Gaussian Markov Random Fields: Theory and Applications. Chap-man and Hall/CRC Press, Boca Raton, FLRue H, Tjelmeland H (2002) Fitting Gaussian Markov random ﬁelds to Gaussian ﬁelds.Scandinavian Journal of Statistics 29:31–4926ahr K (2008) Location coding on icosahedral aperture 3 hexagon discrete global grids.Computers, Environment and Urban Systems 32:174–187Sahu SK, Mardia KV (2005) A Bayesian kriged Kalman model for short-term forecastingof air pollution levels. Journal of the Royal Statistical Society: Series C 54:223–244Sang H, Huang JZ (2012) A full scale approximation of covariance functions for largespatial data sets. Journal of the Royal Statistical Society: Series B 74:111–132Scherer POJ (2017) Computational Physics: Simulation of Classical and Quantum Sys-tems, 3rd edn. Springer, Cham, SwitzerlandSimpson D, Illian JB, Lindgren F, Sørbye SH, Rue H (2016) Going oﬀ grid: computa-tionally eﬃcient inference for log-Gaussian Cox processes. Biometrika 103:49–70Van Dyk DA, Park T (2008) Partially collapsed Gibbs samplers: Theory and methods.Journal of the American Statistical Association 103:790–796Wikle CK, Zammit-Mangion A, Cressie N (2019) Spatio-Temporal Statistics with R.Chapman & Hall/CRC, Boca Raton, FLWilkinson DJ (2006) Parallel Bayesian computation. In: Kontoghiorghes EJ (ed) Hand-book of Parallel Computation and Statistics, CRC Press: Boca Raton, FL, pp 477–508Zammit-Mangion A, Cressie N (2020) FRK: An R package for spatial and spatio-temporalprediction with large datasets. Journal of Statistical Software, in pressZammit-Mangion A, Sanguinetti G, Kadirkamanathan V (2012) Variational estimation inspatiotemporal systems from continuous and point-process observations. IEEE Trans-actions on Signal Processing 60:3449–3459Zammit-Mangion A, Cressie N, Shumack C (2018) On statistical approaches to generateLevel 3 products from satellite remote sensing retrievals. Remote Sensing 10:155

A Targeted distribution of the Markov chain

In Sect. 3.2 we asserted that it is important to update η k concurrently with θ k , eventhough η k is later re-updated (Sect. 3.3). If one does not do this, η rest k in (14) wouldbe ‘out of sync’ with the updated parameters θ k ; as a consequence, when η T k or η T k isupdated in (14), an incorrect distribution would be targeted. This phenomenon occurswhen marginalising (termed ‘trimming’ by Van Dyk and Park, 2008) in Gibbs samplers.We show the importance of resampling on a very simple spatial model, where wehave two sets of parameters, θ and θ , and two sets of basis-function coeﬃcients, η and η . In what follows we omit conditioning on the data Z , since this is implicit inall the distributions. We denote the target (posterior) distribution as pr ( Θ ) where Θ ≡ { θ , θ , η , η } . In MCMC we seek a transition kernel q ( Θ (cid:48) | Θ ) such thatpr ( Θ (cid:48) ) ≡ (cid:90) pr ( Θ ) q ( Θ (cid:48) | Θ ) d Θ = pr ( Θ (cid:48) ) . (A.17)If (A.17) holds, then we say that the Markov chain preserves the target distribution, pr .27n a vanilla Gibbs sampler, one constructs the transition kernel from full conditionaldistributions of the target distribution: q ( Θ (cid:48) | Θ ) = pr ( θ (cid:48) | θ , η , η ) pr ( θ (cid:48) | θ (cid:48) , η , η ) × pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) . (A.18)Successive updating of the parameters in this fashion preserves the target distribution.To see this, substitute (A.18) in (A.17) to obtainpr ( Θ (cid:48) ) = (cid:90) (cid:90) (cid:90) (cid:90) pr ( θ , θ , η , η ) pr ( θ (cid:48) | θ , η , η ) × pr ( θ (cid:48) | θ (cid:48) , η , η ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η ) × pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) d θ d θ d η d η = (cid:90) (cid:90) (cid:90) pr ( θ , η , η ) pr ( θ (cid:48) | θ , η , η ) × pr ( θ (cid:48) | θ (cid:48) , η , η ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η ) × pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) d θ d η d η = (cid:90) (cid:90) pr ( θ (cid:48) , η , η ) pr ( θ (cid:48) | θ (cid:48) , η , η ) × pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) d η d η = (cid:90) pr ( θ (cid:48) , θ (cid:48) , η ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η ) × pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) d η = pr ( η (cid:48) , θ (cid:48) , θ (cid:48) ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) , = pr ( θ (cid:48) , θ (cid:48) , η (cid:48) , η (cid:48) ) , as required. Now, this vanilla sampler does not mix well due to the correlation a posterioribetween η i and θ i , i = 1 , q ( Θ (cid:48) | Θ ) = pr ( θ (cid:48) | θ ) pr ( θ (cid:48) | θ (cid:48) ) × pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) . (A.19)A similar treatment to the vanilla Gibbs case reveals that this only targets the correctdistribution if pr ( θ , θ , η , η ) = pr ( θ , θ ) pr ( η , η ), which is almost certainly notthe case in our spatial models. Therefore, (A.19) is not a target-preserving kernel. In ourMCMC scheme this is important: updating θ k in blocks and subsequently updating η k in blocks will not yield samples from the posterior distribution.A kernel which preserves the target can be constructed by updating η and η twice,with the intermediate quantities then discarded. As in Algorithm 1, denote these inter-mediate quantities as η ∗ and η ∗ , respectively, and consider the transition kernel q ( η ∗ , η ∗ , Θ (cid:48) | Θ ) = pr ( η ∗ , θ (cid:48) | θ , η ) pr ( η ∗ , θ (cid:48) | θ (cid:48) , η ∗ ) × pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η ∗ ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) . η { } from Sampler 1 (left) and Sampler 2 (right), respectively.This kernel preserves the target (posterior) distribution sincepr ( Θ (cid:48) ) = (cid:90) · · · (cid:90) pr ( θ , θ , η , η ) pr ( η ∗ , θ (cid:48) | θ , η ) × pr ( η ∗ , θ (cid:48) | θ (cid:48) , η ∗ ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η ∗ ) × pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) d θ d θ d η d η d η ∗ d η ∗ = (cid:90) (cid:90) (cid:90) (cid:90) pr ( η ∗ , θ (cid:48) , θ , η ) × pr ( η ∗ , θ (cid:48) | θ (cid:48) , η ∗ ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η ∗ ) × pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) d θ d η d η ∗ d η ∗ = (cid:90) (cid:90) pr ( η ∗ , θ (cid:48) , θ (cid:48) , η ∗ ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η ∗ ) × pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) d η ∗ d η ∗ = (cid:90) pr ( η (cid:48) , θ (cid:48) , θ (cid:48) , η ∗ ) pr ( η (cid:48) | θ (cid:48) , θ (cid:48) , η (cid:48) ) d η ∗ = pr ( θ (cid:48) , θ (cid:48) , η (cid:48) , η (cid:48) ) , as required. 29 lgorithm 2: Sampler 1 (Fixed tiling) For i = 1 , . . . , N

1. Sample η T | η T

2. Sample η T | η T Algorithm 3: Sampler 2 (Alternating tilings)

For i = 1 , . . . , N

1. If i is odd:(a) Sample η T | η T (b) Sample η T | η T

2. If i is even:(a) Sample η T | η T (b) Sample η T | η T , η T (c) Sample η T | η T B Simulation experiment illustrating the beneﬁt ofalternating tilings

Consider a GMRF η ∼ Gau( , Q − ), where Q is the sparse tridiagonal matrix Q = 1 σ v  − φ − φ φ − φ − φ φ − φ . . . . . . . . . − φ φ − φ − φ  , (B.20)where omitted entries are zero, φ is a length-scale parameter, and σ v a variance parameter.We consider the case where η ∈ R n with n = 99, and consider the following tilings, T ≡ { , . . . , } , T ≡ { , . . . , } , T ≡ { , . . . , } , T ≡ { , . . . , } , T ≡ { , . . . , } . Here, we compare Markov chain behaviour when sampling N times from this distributionusing the two samplers, Sampler 1 and Sampler 2, detailed in Algorithms 2 and 3, re-30igure 10: RMSPE (top panels) and CRPS (bottom panels) corresponding to Z (1) v (leftpanels) and Z (2) v (right panels) for varying basis-function widths ( δ and δ ) in the 1Dexperiment.spectively. Sampler 1 is a blocked Gibbs sampler which samples η using only T and T ,while Sampler 2 changes the tiling used, {T , T } or {T , T , T } , at each iteration.In our simulation experiment we simulated η using φ = 0 . σ v = 0 .

2, and let N = 10000. We then generated two Markov chains, one using Sampler 1 and anotherusing Sampler 2, and for each chain took the last 5000 samples and applied a thinningfactor of 2. In Fig. 9 we show the empirical auto-correlation functions computed from thetrace plots of η { } from both samplers. We see that samples of η { } from Sampler 1 arehighly correlated due to the proximity of this variable to the tiling boundary. Samplesof η { } from Sampler 2 are virtually uncorrelated. Hence, a system of shifting tiles in aGibbs sampler for spatial GMRFs (as done in Algorithm 1) can virtually eliminate anyauto-correlation that may appear due to tile boundaries. Note that a thinning factorgreater than the number of tilings needs to be used to eﬀectively remove any auto-correlation. C Simulation experiment illustrating the sensitivityof the predictions to the chosen basis functions

In this section we conduct a simulation experiment that demonstrates the eﬀect of a coarsediscretisation, and the corresponding basis-function representation, on the predictionperformance of the multi-scale model. The experiment is done in a one-dimensional,31wo-scale-process setting.Consider a 1D Gaussian process on D = [0 , C ( · ) and C ( · ). The exponential covariancefunctions have as parameters the variances σ k and ranges τ k , k = 0 ,

1, and are given by C k ( h ) = σ k exp( −(cid:107) h (cid:107) /τ k ) , k = 0 , . We model the process of interest Y ( · ) as a sum of the two processes Y k ( · ) = a k ( · ) (cid:62) η k , k = 0 ,

1, where a k ( · ) are basis functions and η k are basis-function coeﬃcients.Now, let the basis functions a k ( · ) , k = 0 , , be piecewise constants on a regular partition-ing of D . The basis functions have width δ and δ , respectively, where δ < δ . We areinterested in what the eﬀect of a poor choice for δ and δ is when predicting Y ( · ) fromnoisy observations Z .If δ k →

0, then the k th scale of the original process is reconstructed exactly if we let η k ∼ Gau( , Q − k ) , k = 0 , , where Q k is the (sparse) tridiagonal matrix given by (B.20)with σ v replaced with σ v,k and φ replaced with φ k . Here, φ k = exp( − δ k /τ k ) and σ v,k = σ k (1 − φ k ). In a simulation environment where we have access to σ k , τ k , k = 0 ,

1, we cantherefore obtain accurate GMRF representations for our processes at the individual scalesfor when δ k is small. We can then also see what happens as δ k grows (this correspondsto coarsening a triangulation in 2D).In our simulation environment we ﬁxed τ = 0 . , τ = 0 . , σ = 1 and σ = 0 .

05 andconducted 100 Monte Carlo simulations, where in each simulation we did the following:1. Randomly established an ‘unobserved’ region in D , D gap say, where | D gap | = 0 . D , with 1000 in D \ D gap and 100 in D gap withmeasurement-error variance σ ε = 0 . D \ D gap wereused as training data Z , the remaining 500 in D \ D gap as validation data Z (1) v , andthose 100 in D gap as validation data Z (2) v .3. For various values of δ and δ , we constructed Q and Q according to (B.20)and used these, the true measurement-error variance, and Z , to predict Y ( · ) at allvalidation data locations.4. Computed the RMSPE and CRPS at the validation locations.Each Monte Carlo simulation provided us with an RMSPE and a CRPS correspond-ing to a combination of { δ , δ } . We then averaged over the 100 simulations to provideaveraged RMSPEs and CRPSs corresponding to each combination of { δ , δ } . This ex-periment allows us to analyse the detrimental eﬀect of a large δ or δ on our predictions.The results from this experiment are summarised in Fig. 10. The ﬁgure clearly showsthat the RMSPE considerably increases in regions where data is dense ( Z (1) v , left panels)and δ is large; on the other hand δ does not have much of an eﬀect in these regions.The situation is reversed in regions where data is missing in large contiguous blocks ( Z (2) v ,right panels). Here δ does not play a big role while δ does. When doing simple kriging(with the exact model) the mean RMSPEs were 0.07 and 0.41, respectively, while themean CRPSs were 0.033 and 0.24, respectively, which are relatively close to what wasobtained with the smallest values we chose for δ and δ . Therefore, the way in whichwe discretise both scales are important, and both δ and δ should be made as small asneeded for their respective scale; in this experiment δ = 0 .

001 and δ = 0 ..