Graphical Gaussian Process Models for Highly Multivariate Spatial Data
GGraphical Gaussian Process Models for HighlyMultivariate Spatial Data
Debangan Dey , Abhirup Datta ∗ , and Sudipto Banerjee Johns Hopkins Bloomberg School of Public Health,, 615 N Wolfe St,Baltimore, MD 21205, USA Department of Biostatistics, University of California Los Angeles
Abstract
For multivariate spatial (Gaussian) process models, common cross-covariance func-tions do not exploit graphical models to ensure process-level conditional independenceamong the variables. This is undesirable, especially for highly multivariate settings, wherepopular cross-covariance functions such as the multivariate Mat´ern suffer from a “curseof dimensionality” as the number of parameters and floating point operations scale up inquadratic and cubic order, respectively, in the number of variables. We propose a classof multivariate “graphical Gaussian Processes” using a general construction called “stitch-ing” that crafts cross-covariance functions from graphs and ensure process-level condi-tional independence among variables. For the Mat´ern family of functions, stitching yieldsa multivariate GP whose univariate components are exactly Mat´ern GPs, and conforms toprocess-level conditional independence as specified by the graphical model. For highlymultivariate settings and decomposable graphical models, stitching offers massive com-putational gains and parameter dimension reduction. We demonstrate the utility of thegraphical Mat´ern GP to jointly model highly multivariate spatial data using simulationexamples and an application to air-pollution modelling.
Keywords:
Mat´ern Gaussian processes; Graphical model; covariance selection; conditionalindependence. ∗ AD was supported by NSF award DMS-1915803. Email: [email protected] a r X i v : . [ s t a t . M E ] S e p Introduction
Multivariate spatial data abound in the natural and environmental sciences for studying featuresof the joint distribution of multiple spatially dependent variables (see, for example, Wacker-nagel, 2013; Cressie and Wikle, 2011; Banerjee et al., 2014). The objectives are to estimatespatial associations for each variable and associations among the variables. Let y ( s ) be a q × vector of spatially-indexed dependent outcomes within any location. A multivariate spatialregression model specifies the marginal distribution for each outcome as y j ( s ) = x j ( s ) T β j + w j ( s ) + (cid:15) j ( s ) , j = 1 , , . . . , q (1)where y j ( s ) is the j -th element in y ( s ) , x j ( s ) and β j are the predictors and slopes, w j ( s ) isa zero-centred spatial process and (cid:15) j ( s ) ind ∼ N (0 , τ j ) is random noise corresponding to out-come j . We focus upon the zero-centred multivariate Gaussian process (random field) w ( s ) =( w ( s ) , w ( s ) , . . . , w q ( s )) T with special attention to large q . The cross-covariance function of w ( s ) which is a matrix-valued function on D × D that maps any pair of locations ( s, s (cid:48) ) tothe q × q matrix C ( s, s (cid:48) ) = ( C ij ( s, s (cid:48) )) with ( i, j ) -th element C ij ( s, s (cid:48) ) = Cov ( w i ( s ) , w j ( s (cid:48) )) .Cross-covariance functions must ensure that for any finite set of locations S = { s , . . . , s n } ,the nq × nq matrix C ( S , S ) = ( C ( s i , s j )) is positive definite (p.d.). Valid classes of cross-covariance functions have been comprehensively reviewed in Genton and Kleiber (2015).We build upon Gneiting et al. (2010) and Apanasovich et al. (2012) who introduced mul-tivariate Mat´ern cross-covariance functions, where the marginal covariance functions for eachvariable and the cross-covariance functions between each pair of variables are members of theMat´ern family. In its most general form, the multivariate Mat´ern is appealing as it ensures thateach univariate process is a Mat´ern GP with its own range, smoothness and spatial variance.Constraints on the parameters are needed to ensure positive-definiteness. The parsimoniousMat´ern (Gneiting et al., 2010) imposes equality of the spatial range for all variables. Apanaso-vich et al. (2012) laid out more general sufficient conditions for the parameters, yielding a verybroad class of multivariate Mat´ern covariances. However, all current multivariate Mat´ern mod-els require estimating O ( q ) cross-covariance parameters and the multivariate Gaussian likeli-1ood involves the inverse and determinant of the dense nq × nq covariance matrix, which isprohibitive if n or q is large. Indeed, illustrations of multivariate Mat´ern have mostly restrictedto applications with q ≤ . Our work is also related to a conditional approach developed inCressie and Zammit-Mangion (2016), where the univariate GPs are specified sequentially, con-ditional on the previous GPs assuming some ordering of the q variables. However, this approachrequires an ordering of the q variables, and does not attempt to retain conditional dependencerelations from an inter-variable graph over a continuous domain, which we seek here.We address the highly-multivariate setting with tens to hundreds of variables measured ateach spatial location, which is becoming increasingly commonplace in the environmental andphysical sciences. We specifically address some key properties of multivariate GPs that aredeemed critical for handling highly multivariate data. First, we retain the flexibility to modeland interpret spatial properties of each surface separately. Except for the multivariate Mat´ern,most other multivariate covariance functions fail to retain this property. Next, we constructthe highly multivariate model by adapting graphical models to process-based settings. Whilegraphical models are extensively used to model multivariate dependencies in non-spatial set-tings, their use in spatial process modeling has primarily been to achieving scalability withrespect to the number of locations ( n ). Multivariate covariance functions do not, in general,exploit an inter-variable graph to build a multivariate GP that conforms to process-level con-ditional independence among the variables. We introduce a method of stitching GPs togethersuch that: (i) the marginal GP for each outcome agrees with the original process; and (ii) weretain the conditional independence of the processes implied by the inter-variable graph. Third,we focus on computational scalability with respect to the number of variables q . While substan-tial attention has been accorded to settings with massive number of locations ( n ) (see Heatonet al., 2019, for a very recent review), the highly multivariate setting with large q fosters ratherdifferent computational issues. Likelihoods for popular covariance functions like the multivari-ate Mat´ern involve an O ( q ) parameter set, and O ( q ) computations. Hence, even for a smallnumber of locations n , most methods suffer from the curse of dimensionality stemming fromoptimizing in or sampling from such a high-dimensional space.2 Method
Let G V = ( V , E V ) be a graph, where V is the set of indices for the variables and E V is the setof edges. We will first define and derive a Graphical Gaussian Process (GGP) from a givencross-covariance function such that the same marginal spatial covariance functions for eachvariable and the same cross-covariance functions for pairs of variables ( i, j ) ∈ E V are retained.Following the development in Dahlhaus (2000) for multivariate discrete time-series, we defineprocess-level conditional independence of multivariate continuous-space GPs. Let B ⊂ V and w B ( D ) = { w k ( s ) : k ∈ B, s ∈ D} , where each w k ( s ) is a spatial process over domain D .We define two processes w i ( · ) and w j ( · ) to be conditionally independent given the processes { w k ( · ) | k ∈ V \ { i, j }} if Cov ( z iB ( s ) , z jB ( s (cid:48) )) = 0 for all s, s (cid:48) ∈ D and B = V \ { i, j } , where z kB ( s ) = w k ( s ) − E [ w k ( s ) | σ ( { w j ( s (cid:48) ) : j ∈ B, s (cid:48) ∈ D} )] denotes the “residual” process.Conditional independence boils down to zeros in the spectral density matrix for stationaryprocesses (see, e.g., Theorem 2.4 in Dahlhaus (2000)). If F ( ω ) = ( f ij ( ω )) is the q × q spec-tral density matrix corresponding to the cross-covariance functions C ij of a q × stationaryGaussian process on some domain D , then two processes w i ( · ) and w j ( · ) are conditionally in-dependent if and only if f ij ( ω ) = 0 for almost all ω , where F ( ω ) − = ( f ij ( ω )) . The result isanalogous to graphical Gaussian models, where the absence of an edge between two variablesimplies a zero in the corresponding entry in the precision matrix. We now define a GraphicalGaussian Process (GGP) as follows.
Definition 2.1. [Graphical Gaussian Process] A q × GP w ( s ) is a Graphical Gaussian Process(GGP) with respect to a graph G V = ( V , E V ) when the univariate GPs w i ( · ) and w j ( · ) areconditionally independent for every ( i, j ) / ∈ E V .Clearly any collection of q independent GPs will trivially constitute a GGP with respect tothe graph G V . Our first key result (Theorem 2.2) shows that there exists a unique GGP corre-sponding to G V from any given cross-covariance function. To prove this, we use the followingLemma. 3 emma 2.1 (Covariance selection (Dempster, 1972)) . Given G V = ( V , E V ) and any positivedefinite matrix F = ( F ij ) indexed by V × V , there exists a unique positive definite matrix (cid:101) F = ( (cid:101) F ij ) such that (cid:101) F ij = F ij for i = j or for ( i, j ) ∈ E V , and ( (cid:101) F − ) ij = 0 for ( i, j ) / ∈ E V .Proof. This is the main result developed in Dempster (1972).The above is a seminal result on covariance selection that we use recurrently in this paper.
Theorem 2.2.
Let C ( h ) = ( C ij ( h )) be a q × q stationary cross-covariance function and let F ij ( ω ) = ( f ij ( ω )) be the spectral density matrix corresponding to C ( h ) . If f ii ( ω ) are square-integrable for all i , then given any graph G V = ( V , E V ) , there exists a unique q × GGP (seeDefinition 2.1) with respect to G V with cross-covariance function M ( h ) = ( M ij ( h )) such that M ij ( h ) = C ij ( h ) for i = j and for all edges ( i, j ) ∈ E V .Proof. See Supplementary Materials.Applying Theorem 2.2 to a multivariate Mat´ern GP with isotropic cross-covariance C ( h ) =( C ij ( h )) , we obtain a GGP w ( s ) such that each univariate process w i ( s ) is a GP with its Mat´erncovariance function C ii ( h ) . Furthermore, for each edge ( i, j ) ∈ E V , the cross-covariancefunction between w i ( s ) and w j ( s + h ) also remains C ij ( h ) , i.e., also Mat´ern, and for edges ( i, j ) / ∈ E V the processes w i ( s ) and w j ( s (cid:48) ) are conditionally independent for all s, s (cid:48) ∈ D .Theorem 2.2, while of theoretical interest, is of limited practical value. For multivariatespatial data observed over a (possibly irregular) set of locations, the likelihood is specified bythe cross-covariance function and Theorem 2.2 does not offer a convenient way to generatecross-covariances in closed form. One could apply the iterative proportional scaling (IPS)algorithm (Speed et al., 1986) for covariance selection and obtain (cid:101) F ( ω ) for a large set of ω ’s bynumerical integration. The cross-covariance function is then obtained by inverting the cross-spectral densities (cid:101) f ij ( ω ) . Since the spatial covariance parameters are unknown, this processhas to be repeated in every iteration of optimizing or sampling from the likelihood, which isimpractical. Hence, in the next Section we pursue a more practicable approach, we refer to as stitching . 4 .2 Stitching of Gaussian Processes We will construct GGPs by stitching together univariate GPs. Stitching will work for anymultivariate GP, but we motivate it with the multivariate Mat´ern model (Gneiting et al., 2010;Apanasovich et al., 2012). We first argue why no simple parametrization of the existing multi-variate Mat´ern GPs yields a Mat´ern GGP with respect to G V . The isotropic multivariate Mat´erncross-covariance function on a d -dimensional domain is C ij ( s, s (cid:48) ) = σ ij H ij ( (cid:107) s − s (cid:48) (cid:107) ) , where H ij ( · ) = H ( · | ν ij , φ ij ) , H being the Mat´ern correlation function (Apanasovich et al., 2012). If θ ij = { σ ij , ν ij , φ ij } , then for a multivariate Mat´ern GP the i th individual variable is a univariateMat´ern with parameters θ ii . This is attractive because it endows each univariate process with itsown variance σ ii , smoothness ν ii , and spatial decay φ ii . Another nice property is that under thismodel, Σ = ( σ ij ) = Cov ( w ( s )) is the covariance matrix for w ( s ) within each location s . Theremaining parameters i.e., the cross-correlation parameters ν ij and φ ij for i (cid:54) = j , are generallyhard to interpret, especially since ν ij does not correspond to the smoothness of any surface.Recent work by Kleiber (2017) on the concept of coherence defined by the scaled spectraldensity matrix of a stationary multivariate process has facilitated the interpretation of the cross-covariance parameters. The parsimonious multivariate Mat´ern model of Gneiting et al. (2010)is subsumed in this general specification as a sub-case with ν ij = ( ν ii + ν jj ) / and φ ij = φ .To ensure a valid multivariate Mat´ern cross-covariance function, it is sufficient to constrainthe intra-site covariance matrix Σ = ( σ ij ) to be of the form (Theorem 1, Apanasovich et al.,2012) σ ij = b ij Γ( ( ν ii + ν jj + d ))Γ( ν ij ) φ A + νii + νjjij Γ( ν ij + d ) where ∆ A ≥ , and B = ( b ij ) > , i.e., is p.d. (2)This is equivalent to Σ being constrained as Σ = ( B (cid:12) ( γ ij )) where γ ij are constants collectingthe terms in (2) involving only ν ij ’s and φ ij ’s, and (cid:12) denotes the Hadamard (element-wise)product. Similarly, the spectral density matrix takes the form F ( ω ) = ( B (cid:12) ( g ij ( ω ))) , where g ij ( ω ) involves the parameters φ ij and ν ij .The b ij ’s are the q parameters (free of φ ij ’s or ν ij ’s) that are constrained to ensure B is positive-definite. For independent and identically distributed multivariate data, when q is5arge, a sparse graphical model among the variables is typically enforced by setting elementsof the q × q inverse-covariance (precision) matrix to be zero. However, it is clear that zerosin B − or Σ − do not generally equate to zeros in F − ( ω ) for the multivariate Mat´ern. Anexception occurs when each component is posited to have the same smoothness ν and the samespatial decay parameter φ , whence both Σ and F ( ω ) become proportional to B . Hence, zerosin B − (specified according to G V ) will correspond to zeros in Σ − and F − ( ω ) yielding aGGP with respect to G V . However, assuming ν ij = ν and φ ij = φ for all ( i, j ) implies thatthe univariate GPs have the same smoothness and rate of spatial decay, which is restrictive.Beyond this “separable” case, there is, to the best of our knowledge, no known parameterchoice for the multivariate Mat´ern GPs that will allow it to be a GGP with respect to a givengraph G V . This issue persists even when the GPs are restricted to a finite set of locations L .The covariance C ( L , L ) = Cov ( w ( L )) is defined through ν ij ’s, φ ij ’s and b ij ’s. Even whenall parameters except the b ij ’s are known, C ( L , L ) is a complicated function depending upon O ( q ) parameters in B , and zeros in B − do not correspond to zeros in C ( L , L ) − . There areno obvious constraints on the parameters to enforce process level conditional independence.Instead of constraining parameters, we will “stitch” univariate GPs to build a multivariateGGP, given a valid q × q cross-covariance C ( s, s (cid:48) ) and a graph G V . We begin our constructionon L , a finite, but otherwise arbitrary, set of locations in D . Our aim is to construct a q -variateGP w ( · ) with the following properties: (i) each w i ( L ) is a realization from a univariate GP withcovariance function C ii ( s, s (cid:48) ) , i.e., Cov ( w i ( L )) = C ii ( L , L ) ; (ii) for any edge ( i, j ) ∈ E V , thecross-covariances are preserved, i.e., Cov ( w i ( L ) , w j ( L )) = C ij ( L , L ) ; and (iii) w ( · ) retainsthe conditional independence relations specified by G V . We model w ( L ) ∼ N (0 , M ( L , L )) seeking a positive definite matrix M ( L , L ) such that (a) M ii ( L , L ) = C ii ( L , L ) for all i =1 , . . . , q , to satisfy (i), (b) M ij ( L , L ) = C ij ( L , L ) for all ( i, j ) ∈ E V , to satisfy (ii), and (c) ( M ( L , L ) − ) ij = 0 for all ( i, j ) / ∈ E V to build towards (iii). We identify existence of such a matrix M ( L , L ) as a covariance selection problem. Let G L = ( L , E L ) be the complete graph on the set of locations L . To ensure that the cross-covariances are preserved over L for ( i, j ) ∈ E V and the conditional independence among Condition (c) only ensures conditional independence over L . Process-level conditional independence overthe entire domain D follows from the subsequent extension in (3) as proved in Theorem 2.3. w ( L ) are inherited from G V , we have to use the strong product G V (cid:2) G L to build M ( L , L ) . Here, G V (cid:2) G L = ( V × L , E
V×L ) with V × L = { ( v, l ) : v ∈ V , l ∈ L} and E V×L comprises edges between vertex-pairs ( v, l ) and ( v (cid:48) , l (cid:48) ) based upon the following strong-productadjacency rules: (i) v = v (cid:48) and ( l, l (cid:48) ) ∈ E L ; or (ii) l = l (cid:48) and ( v, v (cid:48) ) ∈ E V ; or (iii) ( v, v (cid:48) ) ∈ E V and ( l, l (cid:48) ) ∈ E L . Lemma 2.1 with the positive definite matrix F = C ( L , L ) and the graph G V (cid:2) G L , ensures the existence and uniqueness of a positive definite matrix (cid:101) F = M ( L , L ) satisfying conditions (a), (b) and (c) above. In practice, M ( L , L ) can be obtained using aniterative proportional scaling algorithm (Speed et al., 1986; Xu et al., 2011).Having built the finite-dimensional distribution of w ( L ) from G V (cid:2) G L , we now suitablyextend it to a well-defined multivariate GP w ( · ) over the domain D , which conforms to theconditional dependencies implied by G V : w i ( s ) = C ii ( s, L ) C ii ( L , L ) − w i ( L ) + z i ( s ) for all s ∈ D \ L , (3)where each z i ( s ) is a zero-centred Gaussian Process, independent across i and independent of w ( L ) , with covariance function C ii | L ( s, s (cid:48) ) = C ii ( s, s (cid:48) ) − C ii ( s, L ) C − ii ( L , L ) C ii ( L , s (cid:48) ) . Thedistribution of w i ( L ) ∼ N (0 , C ii ( L )) and w i ( s ) defined in (3) for s ∈ D \ L specifies a well-defined q -variate GP w ( · ) = ( w ( · ) , . . . , w q ( · )) T over D . The above construction, which werefer to as “stitching”, ensures the following. Theorem 2.3.
Given a cross-covariance matrix C ( s, s (cid:48) ) = ( C ij ( s, s (cid:48) )) and an inter-variablegraph G V , stitching creates a valid multivariate GGP w ( · ) with positive-definite cross-covariance function M ( s, s (cid:48) ) = ( M ij ( s, s (cid:48) )) such that:(a) M ii ( s, s (cid:48) ) = C ii ( s, s (cid:48) ) for all s, s (cid:48) ∈ D and for each i = 1 , . . . , q .(b) If ( i, j ) ∈ E V , then M ij ( s, s (cid:48) ) = C ij ( s, s (cid:48) ) for all s, s (cid:48) ∈ L .(c) If variables ( i, j ) / ∈ E V , then the processes w i ( · ) and w j ( · ) will be conditionally indepen-dent on D given w B ( · ) , where w B ( · ) = { w k ( · ); k ∈ V \ { i, j }} are all the other processes.Proof. See Supplementary Materials. 7
41 2 −s_x − s _ y −2−1012 w (a) (b) Figure 1:
Stitching Gaussian Processes. Left: Realizations of 4 univariate GPs. Right: Realization ofa multivariate (4-dimensional) GGP created by stitching together the 4 univariate GPs from the leftfigure using the strong product graph over the 4 variables and 3 locations.
Stitching, therefore, ensures that the marginal covariance functions for each w i ( s ) over D is exactly C ii ( , s, s (cid:48) ) . For variable pairs ( i, j ) included in the graphical model G V , M ij ( s, s (cid:48) ) = C ij ( s, s (cid:48) ) for locations in L , which can be made arbitrarily dense in the domain. Finallyprocess-level conditional independence relations are established using G V (cid:2) G L on L and thenusing independent processes z i ( · ) for each variable i for the extension to D without violat-ing the specified conditional independence. In particular, if C ( h ) is the multivariate Mat´erncross-covariance (Gneiting et al., 2010; Apanasovich et al., 2012), then stitching with G V willproduce a GGP w ( s ) on D such that each w i ( s ) is a GP with univariate Mat´ern covariancefunctions. Furthermore, these Mat´ern processes satisfy the conditional independence specifiedby G V .The term “stitching” is motivated from the example in Figure 1. Figure 1(a) shows real-izations of 4 univariate Mat´ern GPs w i ( · ) , i = 1 , . . . , over 3 locations, each with a differentsmoothness and spatial range. Figure 1(b) shows a multivariate GGP constructed by stitchingtogether the 4 processes at the 3 locations in L using a path-graph as G V . Geometrically, thislooks like stitching the four surfaces together at the locations L , while exactly preserving eachunivariate surface. The graph edges serve as the threads holding the surfaces together.We point out differences between the GGP ensured by Theorem 2.2 and the one producedby stitching. For pairs of variables ( i, j ) ∈ E V , the cross-covariance for the former is the8ame as the Mat´ern cross-covariance C ij on the entire domain D , whereas for the latter thisagreement is only on L . In fact, for a pair s, s (cid:48) / ∈ L and i (cid:54) = j it is straightforward to see that, M ij ( s, s (cid:48) ) = C ii ( s, L ) C ii ( L , L ) − M ( L , L ) ij C jj ( L , L ) − C jj ( L , s (cid:48) ) . (4)Stitching exploits this fixed rank cross-covariance to construct a practically implementableGGP with full-rank marginal covariance and process-level conditional independence inheritedfrom G V . On the other hand, unlike Theorem 2.2, stitching does not need the given cross-covariance C to be stationary and can be used with asymmetric or even non-stationary C .The reference set L is arbitrary and can, but need not, overlap with the set of data locations.If D i , i = 1 , , . . . , q denotes the set of n i data locations for the i -th variable, then the jointprobability density of w i ( D i ) and w ( L ) is specified by w ( L ) ∼ N (0 , M ( L , L )) and w i ( D i ) | w ( L ) ind ∼ N ( C ii ( D i , L ) C ii ( L , L ) − w i ( L ) , C ii | L ( D i , D i )) for i = 1 , . . . , q . (5)The distribution of { w i ( D i ) | w ( L ) , i = 1 , . . . , q } involves a block-diagonal covariance matrixwith variable-specific blocks. Therefore, it can be evaluated easily if all n i ’s are small. If some n i ’s are large, one can use one of the many well-established variants for scaling up GPs to verylarge number of locations (Heaton et al., 2019). For example, a nearest neighbour GP (NNGP,Datta et al., 2016) yields a sparse approximation of C ii | L ( D i , D i ) with linear complexity, butthe joint distribution still preserves the conditional independence implied by G V .Turning to the highly multivariate case, where q is large, note that { w i ( D i ) | w ( L ) , i =1 , . . . , q } in (5) has q conditionally independent factors and is easy to compute. Hence, it is w ( L ) ∼ N (0 , M ( L , L )) that presents the bottleneck. In particular, there are two challengesfor large q . As discussed earlier, the multivariate Mat´ern C ( h ) required for stitching needs toconstrain B = ( b ij ) to be p.d. matrix on an O ( q ) -dimensional parameter space. Searchingin such a high-dimensional space is difficult for large q and verifying positive definitenessof B incurs an additional cost of O ( q ) floating point operations. Second, evaluating w ( L ) ∼ N (0 , M ( L , L )) involves matrix operations for the nq × nq matrix M ( L , L ) . While the precisionmatrix, M ( L , L ) − , is sparse because of G V , its determinant is usually not available in closed9orm and the calculation can become prohibitive even for small n . We now consider decomposable inter-variable graphs G V in highly multivariate settings. For G V = ( V , E ) , and a triplet ( A, B, O ) of disjoint subsets of the vertex set V , O is said to separate A from B if every path from a vertex in A to a vertex in B passes through a vertex in O . If V = A ∪ B ∪ O , and O is a complete subset of V , then ( A, B, O ) is said to decompose G V .The graph is said to be decomposable if it is complete or if there exists a proper decomposition ( A, B, O ) into decomposable subgraphs G A ∪ O and G B ∪ O . Decomposability is conspicuous ingraphical models (see, e.g., Dobra et al., 2003; Wang and West, 2009) and fitting Bayesiangraphical models is cumbersome for non-decomposable graphs (Roverato, 2002; Atay-Kayisand Massam, 2005).For decomposable graphs we can significantly reduce the dimension of the parameter space,storage, and computational burden of stitching using Mat´ern GPs. An important property ofa decomposable graph is that the cliques of the graph can be ordered into a perfect sequence(Lauritzen, 1996). Let K , · · · , K p be a sequence of subsets of the vertex set V for an undirectedgraph G V . Let, F m = K ∪ · · · ∪ K m and S m = F m − ∩ K m . The sequence { K m } is said to beperfect if it satisfies the following: (i) for every l > , there is an m < l such that S l ⊂ K m ;and (ii) the sets S m are complete for all m . If G V is decomposable, then it has a perfect cliquesequence and the joint density of w ( L ) can be factorized as follows. Corollary 3.0.1. If G V has a perfect clique sequence, { K , K , · · · , K p } , then the GGP likeli-hood can be decomposed as f M ( w ( L )) = Π km =1 f C ( w K m ( L ))Π km =2 f C ( w S m ( L )) , (6) where S m = F m − ∩ K m and F m − = K ∪ · · · ∪ K m − , and f M and f C are the densities ofGaussian Processes with covariance functions M ( s, s (cid:48) ) and C ( s, s (cid:48) ) , respectively.Proof. See Supplementary Materials. 10orollary 3.0.1 helps us manage the dimension and constraints of the parameter space andthe computational complexity. For an arbitrary graph G V , the parameter space for the stitchingcovariance function M is the same as the parameter space { θ ij | < i, j ≤ q } for the originalcovariance function C . For a decomposable graph G V , the likelihood (6) and, in turn, thestitched GGP is only specified by the parameters { θ ij | ( i = j ) or ( i, j ) ∈ E V } . Therefore,the dimension of the parameter space reduces from O ( q ) to | E V | , the number of edges on G V , which is small for sparse graphs. When using a multivariate Mat´ern cross-covariance C for stitching, the parameter space for B in the stitched graphical Mat´ern is the intersectionof the parameter spaces of the low-dimensional clique-specific multivariate Mat´ern covariancefunctions C K , . . . , C K p . Hence, the parameter space becomes { b ij | ( i = j ) or ( i, j ) ∈ E V } and needs to satisfy the constraint that B K l = ( b ij ) i,j ∈ K l is positive definite for all l = 1 , . . . , p .This reduces the complexity of the parameter space constraints from O ( q ) to at most O ( p ∗ q ∗ ) ,where q ∗ is the largest clique size and p ∗ is the maximum number of cliques sharing a commonvertex.The precision matrix of w ( L ) satisfies (see, e.g., Lemma 5.5 in Lauritzen, 1996) M ( L , L ) − = p (cid:88) m =1 [ C − K m (cid:2) G L ] ] V×L − k (cid:88) m =2 [ C − S m (cid:2) G L ] ] V×L , (7)where, for any symmetric matrix A = ( a ij ) with rows and columns indexed by a subset U of V × L , A V×L denotes a |V × L| × |V × L| matrix such that ( A V×L ) ij = a ij if ( i, j ) ∈⊂ U ,and ( A V×L ) ij = 0 elsewhere. From (6) and (7) we see that we can avoid the large matrix M ( L , L ) and all matrix operations are limited to the sub-matrices of M ( L , L ) correspondingto the cliques K m (cid:2) G L and separators S m (cid:2) G L . The entire process requires at most O ( pn q ∗ ) flops and O ( pn q ∗ ) storage, where p is the length of the perfect ordering. Table 1 summarizesthese gains from stitching over decomposable graphs.In addition to the computational benefits described above, stitched GGP models are natu-rally amenable to parallel computing. In a Bayesian implementation of a stitched GGP model(described in Section S2.1) of the Supplement, we can exploit the underlying graphical model G V and deploy a chromatic Gibbs sampler (Gonzalez et al., 2011) to simultaneously update11 able 1: Properties of any q -dimensional multivariate Mat´ern GP of Gneiting et al. (2010) orApanasovich et al. (2012) and a multivariate graphical Mat´ern GP stitched using a decomposable graph G V with largest clique size q ∗ , length of perfect ordering p , and maximal number of cliques p ∗ sharing acommon vertex. Model attributes Multivariate Mat´ern Graphical model Mat´ernNumber of parameters O ( q ) O ( | E V | + q ) Parameter constraints O ( q ) O ( p ∗ ( q ∗ )) (worst case)Storage O ( n q ) O ( pn q ∗ ) Time complexity O ( n q ) pn q ∗ Conditionally independent processes No YesUnivariate components are Matern GPs Yes Yes batches of random variables in parallel. If we group all variable-specific parameters (regres-sion coefficients, spatial parameters, noise variance and latent spatial random effects) to be theparameter vector η i , under a graph colouring of G V , η i and η (cid:48) i can be updated simultaneouslyif i and i (cid:48) share the same colour, as illustrated in Figure 2(a). This brings down the number ofsequential steps in sampling of the η i ’s from q to the chromatic number χ ( G V ) . The chromaticsampling scheme will be different for the b ij ’s. Let G E ( G V ) = ( E V , E ∗ ) denote a graph withvertices as the set of edges E V . An edge (( i, j ) , ( i (cid:48) , j (cid:48) )) in this new graph G E ( G V ) is in E ∗ if { i, i (cid:48) , j, j (cid:48) } are in some clique K of G V . Then we can batch the updates of b ij ’s based oncolouring of the graph G E ( G V ) (Figure 2(b)) and the number of such sequential batch updateswill be the chromatic number χ ( G E ( G V )) , a reduction from | E V | sequential updates for b ij . (a) Colouring of a gem graph G V . (b) Colouring of the correspond-ing edge graph G E ( G V ) Figure 2:
Chromatic sampling for GGP: Left: Colouring of a gem graph G V between variableprocesses w , w , w , w , w , used for chromatic sampling of the marginal (variable-specific)parameters. Right: Colouring of the corresponding edge graph G E ( G V ) used for chromatic sampling ofthe cross-covariance parameters b ij ’s. Extensions
Graphical Gaussian Processes constructed from stitching can also be natural candidates fornon-separable (in space-time), non-stationary (in time) modelling of univariate or multivariatespatial time-series data. First, consider a univariate spatial time-series modelled as a GaussianProcess w ( s, t ) for s ∈ D evolving over a discrete set of time points t ∈ T = { , , . . . , T } .We envision this as a T × GP w ( s ) = ( w ( s ) , . . . , w T ( s )) T , where w t ( s ) = w ( s, t ) .Temporal evolution of processes is encapsulated using a directed acyclic graph (DAG),which, when moralized, produces an undirected graph G T over T . We can then recast thespatial time-series model as a T × GGP conforming to the conditional independence impliedby G T . A multivariate Mat´ern used for stitching will produce a GGP with each w t ( · ) beinga Mat´ern GP with parameters θ tt . Time-specific spatial parameters enrich the model withoutimposing stationarity of the spatial process over time and space-time separability (Gneiting,2002).Any Autoregressive (AR) structure over time corresponds to a decomposable moralizedgraph G T . For example, the AR (1) model corresponds to a path graph with edges { ( t, t +1) | t =1 , . . . , T − } , q ∗ = 2 and p ∗ = 2 . An AR (2) is specified by the DAG t − → t and t − → t for all t ∈ , . . . , T (Figure S1(a) in the Supplement), which, when moralized, yields the sparsedecomposable graph G T (with q ∗ = 3 ) in Figure S1(b) of the Supplement. Hence, Corol-lary 3.0.1 ensures accrual of computational gains for GGP models for autoregressive spatialtime-series. An added benefit of using the GGP is that the auto-regression parameters need notbe universal, but can be time-specific, thus relaxing another restrictive stationarity condition.Multivariate spatial time-series can also be modelled using GGPs. We envision q variablesrecorded at T time-points resulting in qT variables. We now specify G V×T on the variable-timeset. Common specifications for multivariate time-series like graphical vector autoregressive(VAR) structures (Dahlhaus and Eichler, 2003) will yield decomposable G V×T . For example,consider the non-separable graphical-VAR of order 1 with q = 2 and specified by the DAG (1 , t − → (1 , t ) , (1 , t − → (2 , t ) , and (2 , t − → (2 , t ) (Figure S1(c) of the Supplement).13his yields the decomposable G V×T in Figure S1(d) of the Supplement, also with q ∗ = 3 . So far all our examples have involved isotropic (symmetric) multivariate Mat´ern cross-covariances. Symmetry implies C ij ( s, s (cid:48) ) = C ij ( s (cid:48) , s ) for all i, j, s, s (cid:48) and is not a necessarycondition for validity of a cross-covariance function. Asymmetric cross-covariance functionshave been discussed in Apanasovich and Genton (2010) and Li and Zhang (2011). An asym-metric cross-covariance C a can be specified in-terms of a symmetric cross-covariance C as C aij ( s, s (cid:48) ) = C aij ( s − s (cid:48) ) = C ij ( s − s (cid:48) + ( a i − a j )) where a i , i = 1 , . . . , q are distinct variablespecific parameters. As mentioned earlier, stitching will work with any valid cross-covariancefunction, and if C a is used for the stitching, then the resulting graphical cross-covariance M a will also be asymmetric, satisfying M aij ( s, s (cid:48) ) = C aij ( s, s (cid:48) ) for all ( i, j ) ∈ E V , and s, s (cid:48) ∈ L . We outline a Gibbs sampler in Section S2.1 of the Supplement for the multivariate spatial linearmodel in (1), where the latent q × process w ( s ) is modelled as a GGP. If |L| = n , then thealgorithm needs to sample all the O ( nq ) latent spatial random effects w i ( L ) at each iteration.A common strategy for estimating process parameters in spatial linear models is to integrateout the spatial random effects w and directly use the marginalized (or collapsed) likelihood forthe response process y ( · ) = ( y ( · ) , . . . , y q ( · )) T , which is also a multivariate GP. However, w ( · ) modelled as a GGP does not ensure the marginalized y ( · ) will be a GGP. We demonstrate thisin Figure 3(a) with a path graph G V between latent processes w ( · ) , w ( · ) and w ( · ) . Theresponse processes y i ( · ) = w i ( · ) + (cid:15) i ( · ) have complete graphs. This is because Cov ( y ) = Cov ( w ) + Cov ( (cid:15) ) , and the zeros in Cov ( w ) − do not correspond to zeros in Cov ( y ) − . Hence,modelling the latent spatial process as a GGP and subsequent marginalization is inconvenientbecause the marginalized likelihood for y will not factorize like (6).Instead, we can directly create a GGP for the response process by stitching the marginalcross-covariance function Cov ( y ( s ) , y ( s + h )) = C ( h ) + D ( h ) using G V , where D ( h ) = diag ( τ , . . . , τ q ) I ( h = 0) is the diagonal white-noise covariance function. If using a Mat´ern14 a) GGP on the latent process. (b) GGP on the response process. Figure 3:
Comparison of induced graphs for processes (obeying a path graph) from marginalizedmodel and latent model. Blue edges indicate the dependencies modelled and red edges denote themarginal dependencies induced from the model construction. cross-covariance C , the resulting GGP model for y ( · ) has, for each y i ( · ) , univariate processeswith mean x i ( · ) T β i and covariance function C ii ( h )+ τ i I ( h = 0) . The cross-covariance between y i ( · ) and y j ( · ) is also Mat´ern for ( i, j ) ∈ E V and locations in L . For ( i, j ) / ∈ G V , the responseprocesses y i ( · ) and y j ( · ) will be conditionally independent. We denote the covariance functionof this GGP by M ∗ and outline the Gibbs sampler in Section S2.2 of the Supplement.The response model drastically reduces the dimensionality of the sampler from O ( nq + | E V | ) for the latent model to O ( q + | E V | ) . However, what we gain in terms of convergence of thechain gets compromised in interpretation of the latent process. As we see in Figure 3(b), usinga graphical model on the response process leads to a complete graph among the latent process.If, however, conditional independence on the latent processes is not absolutely necessary, thenthe marginalized GGP model is a viable alternative for modelling highly multivariate spatialdata. We conducted multiple simulation experiments to compare five models: (a) PM: ParsimoniousMultivariate Matern of Gneiting et al. (2010); (b) MM: Multivariate Mat´ern of Apanasovichet al. (2012) with ν ij = ν ii = ν jj = , and ∆ A = 0 and φ ij = ( φ ii + φ jj ) / ; (c) BGML:Bayesian Graphical Matern on the latent process; (d) BGMR: Bayesian Graphical Matern on15he response process (Sections 4.3 and S2.2); and (e) GM: Graphical Matern using classicalmaximum likelihood estimation for parameter estimation (using the co-ordinate descent algo-rithm outlined in Section S2.3 of the Supplement). All GGPs were stitched using the MMmodel of (a). Table 2:
Different simulation scenarios considered for the comparison between methods.
Set q Graph G V B Nugget Locations Data model Fitted models1 5 Gem (Figure 2(a)) Random No Same location for all variables GM GM, MM, PM, BGMR2 15 Path b i − ,i = ρ i Yes Partial overlap in locations for variables, GM PM, BGML, BGMR3A 100 Path b i − ,i = ρ i Yes Partial overlap in locations for variables GM BGML, BGMR3B 100 Path b i − ,i = ρ i Yes Partial overlap in locations for variables MM BGML, BGMR
We consider the 4 settings in Table 2. In Set 1, we have q = 5 and generate data froma graphical Mat´ern using a gem graph (Figure 2 (a)) for stitching. We assumed no spatialmisalignment, i.e, all variables observed at each location in L . Hence, we used GM for es-timation. As q is small, we could implement both PM and MM. Set 1 did not use a nugget,hence BGML and BGMR are the same for this set. For all scenarios, the number of locationswas n = 250 and locations were generated uniformly from a grid within the unit square. ForSet 2, we considered q = 15 spatially misaligned outcomes, a path graph G V , and includedthe nugget processes to generate the responses. The models fitted are PM, BGML and BGMR.Sets 3A and 3B consider the highly multivariate case with q = 100 outcomes and a path graphamong the variables. Neither PM nor MM can be implemented for such settings because B involves parameters and likelihood evaluation requires inverting a × matrixin each iteration. The data is generated using GM in 3A and MM in 3B, the latter serving as amisspecified example.We simulated 1 covariate x j ( s i ) for each variable j , generated independently from a N (0 , distribution and the true regression coefficients β j from Unif(-2,2) for j = 1 , , . . . , q . The φ ii and σ ii were equispaced numbers in (1 , , while the b ij ’s where chosen as in Table 2.For all of the candidate models, each component of the q -variate process is a Mat´ern GP.Hence, following the recommendation outlined in Apanasovich et al. (2012), the marginal pa-rameters θ ii for the univariate Mat´ern processes were estimated apriori using only the data forcorresponding i -th variable. The BRISC package (Saha and Datta, 2018) was used for estima-tion. 16 a) Set 1 (b) Set 2(c) Set 3A (d) Set 3B Figure 4:
Estimates of the cross-covariance parameters σ ij φ ij = Γ(1 / b ij , ( i, j ) ∈ E V , for the 4simulation sets. The horizontal pink lines in Figures (a) and (b) indicate true parameter values. We primarily focus on estimating the cross-covariance parameters b ij , ( i, j ) ∈ E V , as theyspecify the cross-covariances in stitching. In Figure 4 we present the estimates of σ ij φ ij =Γ(1 / b ij which are the b ij ’s rescaled to be at the same scale as the marginal microergodicparameters σ ii φ ii . The variable-specific parameters are of lesser importance because stitchingensures that each univariate process is Mat´ern GP, similar to the multivariate Mat´ern models.The estimates of the marginal spatial parameters are plotted in Figure S2 of the Supple-ment. The estimates of the regression coefficients β j were accurate for all models, and are notpresented.Figure 2(a) reveals that for Set 1, the MM, GM and BGMR all produce reasonable estimates17f the true cross-covariance parameters, whereas the parsimonious Mat´ern estimates are biasedand more variable. For Set 2, the estimates of PM are once again biased, whereas both BGMLand BGMR produce estimates much closer to the truth. For the highly multivariate settingof Set 3A, the GGP is still competitive with both BGML and BGMR once again accuratelyestimating all the b ij ’s for ( i, j ) ∈ E V . This is true even for the misspecified case of Set 3B,where both BGML and BGMR accurately estimated all the cross covariances for variable pairsbelonging to the graphical model. . We demonstrate an application of GGP for non-stationary (in time) and non-separable (inspace-time) modelling of spatial time-series, as discussed in Section 4.1. We model daily lev-els of PM . measured at different monitoring stations across 11 states of the north-eastern USand Washington DC for a three month period from February, 01, 2020, until April, 30th, 2020.The data is publicly available from the website of the United States Environmental ProtectionAgency.We selected n = 106 stations that had at least two months of measured data. Meteorologicalvariables like temperature, barometric pressure, wind-speed and relative humidity are knownto affect PM . levels. Since all of the pollutant monitoring stations do not have the meteo-rological covariates, we collected the weather information from NOAA Local ClimatologicalDatabase. Finally, we merged the weather information from EPA and NOAA and imputed thedaily weather information at pollutant monitoring locations using multilevel B-spline smooth-ing. Following Section 4.1, we view the spatial time-series at n = 106 locations and T = 90 days as a highly multivariate ( -dimensional) spatial dataset. Neither the parsimonious Mat´ernnor the multivariate Mat´ern were implementable as they involve around cross-covarianceparameters and × matrix computations at each iteration.We used a Mat´ern GGP to model the data, with an AR (1) graphical model for smoothingacross time, as exploratory analysis revealed strong autocorrelation between pollutant processeson consecutive days even after adjusting for all the covariates. The marginal parameters for18 a) (b) ●● ●●● ●● ●● ●●●●●●● ●● ●● ●●● ●● ●● ●● ● ● ● ● ●●●●● ● ● ●●●●●● ● ●●●● ● ● ●●● ● ●● ●● ● ●● ● ●● ●●● ● ●● ● ●● ●●● ● ●● ● ●● ●● ●●●●●● ● ● ● ● ●●● ●● ●●● ● ●● ●●● ●● ●● ●●●●●●● ●● ●● ●●● ●● ●● ●● ● ● ● ● ●●●●● ● ● ●●●●●● ● ●●●● ● ● ●●● ● ●● ●● ● ●● ● ●● ●●● ● ●● ● ●● ●●● ● ●● ● ●● ●● ●●●●●● ● ● ● ● ●●● ●● ●●● ● First two weeks of February Last two weeks of April −78 −75 −72 −69 −78 −75 −72 −69404244
Longitude La t i t ude −1012(in ug/m ) (c) Figure 5: PM . analysis: (a) Daily RMSPE for the 6 fortnightly analyses, (b) Estimates of theauto-regression covariance from graphical Mat´ern for the full analysis, (c) Estimates of the residualspatial processes from BGMR for PM . pollutant levels (after adjusting for covariates) in North-eastUS between first two weeks of February and last two weeks of April day t were σ tt , φ tt and τ t . The autoregressive coefficient (cross-covariance parameter) betweendays t − and t was ρ t . Thus GGP offered the flexibility to model non-separability acrossspace and time, time-varying marginal spatial parameters and autoregresive coefficients. Weimplemented both the latent (BGML) and the response (BGMR) models.We first present a sub-group analysis breaking the 90 days worth of data into fortnights.Data for each fortnight is only dimensional and, hence, we are able to analyse each chunkseparately using the parsimonious Mat´ern. We compare the daily RMSPE based on hold-outdata among the three methods. Figure 5(a) reveals that the three method produce very similarpredictive performance when analysing each fortnight of data separately. We analyse the fulldataset using the GGP model as other multivariate Mat´ern GPs are precluded by the highlymultivariate setting. The GGP model involves only cross-covariance parameters. Since the19argest clique size in an AR(1) graph is , the largest matrix we deal with is only × .We present the estimates of the auto-covariance parameters in Figure 5(b). We see that there islarge variation in the estimates of this parameter across time with many spikes indicating highpositive autocorrelation. The estimates provide strong evidence in favour of non-stationarityacross time.Figure 5(c) presents the estimated average residual spatial surface y P ( s ) =(1 / |P| ) (cid:80) t ∈P ( y t ( s ) − x t ( s ) T ˆ β t ) over two choices of the time-period P – the first two weeks ofFebruary, 2020 (left) and the last two weeks of April, 2020 (right). These two periods representthe beginning and end of the time period for our study and also correspond to the before andduring lock-downs imposed in the north-eastern US due to COVID-19. We observe a decreasein intensity of the residual process from February to April suggesting a decrease in the PM . levels during this period even after accounting for the meteorological covariates. We have addressed modelling highly-multivariate spatial data using GGPs with cross-covariance functions that exploit graphical models to ensure process-level conditional indepen-dence among the variables, while preserving attractive interpretation of the marginal covariancefunctions (e.g., the Mat´ern family) for each univariate process. The existence of such processesis formally established and a pragmatic variant using “stitching” is developed. The highly mul-tivariate setting is scaled by stitching with decomposable graphical models and Mat´ern GPs.This development has focused upon the highly multivariate setting that require joint mod-elling of a very large number of spatially dependent variables. This “high-dimensional” prob-lem is distinctly different from the burgeoning literature on high-dimensional problems refer-ring to the massive number of spatial locations. Nevertheless, there are some obvious connec-tions between these two problems in lieu of the recent interest in DAG-based GPs for modellingthe latter situation Datta et al. (2016, 2018). A future direction will be to simultaneously ad-dress the problem of big “ n ” and big “ q ” by merging the ideas of nearest neighbor locationgraphs with sparse variable graphs in a modified stitching procedure.20 eferences Apanasovich, T. V. and Genton, M. G. (2010). Cross-covariance functions for multivariaterandom fields based on latent dimensions.
Biometrika , 97(1):15–30.Apanasovich, T. V., Genton, M. G., and Sun, Y. (2012). A valid mat´ern class of cross-covariance functions for multivariate random fields with any number of components.
Journalof the American Statistical Association , 107(497):180–193.Atay-Kayis, A. and Massam, H. (2005). A monte carlo method for computing the marginallikelihood in nondecomposable gaussian graphical models.
Biometrika , 92(2):317–335.Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014).
Hierarchical Modeling and Analysis forSpatial Data . Chapman & Hall/CRC, Boca Raton, FL, second edition.Cram´er, H. (1940). On the theory of stationary random processes.
Annals of Mathematics ,pages 215–230.Cressie, N. and Zammit-Mangion, A. (2016). Multivariate spatial covariance models: a condi-tional approach.
Biometrika , 103(4):915–935.Cressie, N. A. C. and Wikle, C. K. (2011).
Statistics for spatio-temporal data . Wiley Series inProbability and Statistics. Wiley, Hoboken, NJ.Dahlhaus, R. (2000). Graphical interaction models for multivariate time series.
Metrika ,51(2):157–172.Dahlhaus, R. and Eichler, M. (2003). Causality and graphical models in time series analysis.
Oxford Statistical Science Series , pages 115–137.Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchical nearest-neighborGaussian process models for large geostatistical datasets.
Journal of the American StatisticalAssociation , 111(514):800–812.Datta, A., Banerjee, S., Hodges, J. S., Gao, L., et al. (2018). Spatial disease mapping usingdirected acyclic graph auto-regressive (dagar) models.
Bayesian Analysis .21empster, A. P. (1972). Covariance selection.
Biometrics , pages 157–175.Dobra, A. et al. (2003). Markov bases for decomposable graphical models.
Bernoulli ,9(6):1093–1108.Genton, M. G. and Kleiber, W. (2015). Cross-covariance functions for multivariate geostatis-tics.
Statistical Science , pages 147–163.Gneiting, T. (2002). Nonseparable, stationary covariance functions for space–time data.
Jour-nal of the American Statistical Association , 97(458):590–600.Gneiting, T., Kleiber, W., and Schlather, M. (2010). Mat´ern cross-covariance functions formultivariate random fields.
Journal of the American Statistical Association , 105(491):1167–1177.Gonzalez, J., Low, Y., Gretton, A., and Guestrin, C. (2011). Parallel gibbs sampling: From col-ored fields to thin junction trees. In
Proceedings of the Fourteenth International Conferenceon Artificial Intelligence and Statistics , pages 324–332.Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F.,Gramacy, R. B., Hammerling, D., Katzfuss, M., et al. (2019). A case study competitionamong methods for analyzing large spatial data.
Journal of Agricultural, Biological andEnvironmental Statistics , 24(3):398–425.Kleiber, W. (2017). Coherence for multivariate random fields.
Statistica Sinica , pages 1675–1697.Lauritzen, S. L. (1996).
Graphical models , volume 17. Clarendon Press.Li, B. and Zhang, H. (2011). An approach to modeling asymmetric multivariate spatial covari-ance structures.
Journal of Multivariate Analysis , 102(10):1445–1453.Parra, G. and Tobar, F. (2017). Spectral mixture kernels for multi-output gaussian processes.In
Advances in Neural Information Processing Systems , pages 6681–6690.22overato, A. (2002). Hyper inverse wishart distribution for non-decomposable graphs and itsapplication to bayesian inference for gaussian graphical models.
Scandinavian Journal ofStatistics , 29(3):391–411.Saha, A. and Datta, A. (2018). Brisc: bootstrap for rapid inference on spatial covariances.
Stat ,7(1):e184.Speed, T. P., Kiiveri, H. T., et al. (1986). Gaussian markov distributions over finite graphs.
TheAnnals of Statistics , 14(1):138–150.Wackernagel, H. (2013).
Multivariate geostatistics: an introduction with applications . SpringerScience & Business Media.Wang, H. and West, M. (2009). Bayesian analysis of matrix normal graphical models.
Biometrika , 96(4):821–834.Xu, P.-F., Guo, J., and He, X. (2011). An improved iterative proportional scaling procedure forgaussian graphical models.
Journal of Computational and Graphical Statistics , 20(2):417–431. 23 upplementary Materials for ”Graphical Gaussian ProcessModels for Highly Multivariate Spatial Data”
Debangan Dey, Abhirup Datta, Sudipto Banerjee
S1 Proofs of Theorem 2.2.
For the original GP, for each ω , F ( ω ) = { f ij ( ω ) } is a valid spectral densitymatrix. Therefore, following Cramer’s Theorem (Cram´er, 1940; Parra and Tobar, 2017), F ( ω ) is positive definite for (almost) every ω . Using Lemma 2.1 we derive a unique (cid:101) F ( ω ) = ( (cid:101) f ij ( ω )) ,which is also positive definite and satisfies (cid:101) F ( ω ) ij = F ( ω ) ij = f ij ( ω ) for i = j or ( i, j ) ∈ E V ,and (cid:101) F ( ω ) − ij = 0 for ( i, j ) / ∈ E V . The square-integrability assumption of f ii ( ω ) is sufficient toensure that (cid:82) | (cid:101) F ij ( ω ) | dω < ∞ using the Cauchy-Schwarz inequality. Thus, we have a spectraldensity matrix (cid:101) F ( ω ) , which is positive definite for (almost) all ω , (cid:101) f ii ( ω ) = f ii ( ω ) > forall i , ω , and (cid:82) | (cid:101) f ij ( ω ) | dω < ∞ for all i, j . By Cramer’s theorem, there exists a GP w ( · ) with spectral density matrix (cid:101) F ( ω ) and some cross-covariance function M . As by construction (cid:101) f ij ( ω ) = f ij ( ω ) for i = j or ( i, j ) ∈ E V , we have M ij = C ij for i = j or ( i, j ) ∈ E V . Since (cid:101) F − ( ω ) ij = 0 for ( i, j ) / ∈ E V and almost all ω , using the result of Dahlhaus (2000), w ( s ) hasprocess-level conditional independence as specified by G V , completing the proof. of Theorem 2.3. For two arbitrary locations s , s ∈ D , we can calculate the covariance func-24ion from our construction as follows: M ij ( s , s ) = Cov ( C ii ( s , L ) C ii ( L , L ) − w i ( L ) + z i ( s ) ,C jj ( s , L ) C jj ( L , L ) − w j ( L ) + z j ( s ))= C ii ( s , L ) C ii ( L , L ) − Cov ( w i ( L ) , w j ( L )) C jj ( L , L ) − C jj ( L , s ))+ I ( i = j ) C ii | L ( s , s )= I ( i = j )[ C ii ( s , L ) C ii ( L , L ) − C ii ( L , s )) + C ii | L ( s , s )]+ I ( i (cid:54) = j ) C ii ( s , L ) C ii ( L , L ) − M ij ( L , L ) C jj ( L , L ) − C jj ( L , s )= I ( i = j ) C ii ( s , s )+ I ( i (cid:54) = j ) C ii ( s , L ) C ii ( L , L ) − M ij ( L , L ) C jj ( L , L ) − C jj ( L , s )) (S1)The second equality follows from the independence of z i and z j for i (cid:54) = j , the third equalityuses M ii ( LL ) = C ii ( L , L ) and the fourth uses the form of the conditional covariance function C ii | L from (3). It is now immediate, that w i has the covariance function C ii on the entiredomain D , proving Part (a).If ( i, j ) ∈ E V , and ( s , s ) ∈ L , in (S1), we will have M ij ( s , s ) = C ij ( s , s ) directlyfrom the construction of M ( L , L ) . This proves part (b).To prove part (c), without loss of generality we only consider q = 3 processes w ( s ) , w ( s ) , w ( s ) which is constructed via stitching, with the assumption that (1 , / ∈ E V .First, we will show that, for any two locations s , s ∈ D , w ( s ) is conditionally independentof w ( s ) given w ( L ) , which we denote as w ( s ) ⊥⊥ w ( s ) | w ( L ) .As (1 , / ∈ E V , the sets { × L} = { (1 , s ) | s ∈ L} and { × L} are separated by { × L} in the graph G V (cid:2) G L . Hence, using the global Markov property of Gaussian graphical models,we have w ( L ) ⊥⊥ w ( L ) | w ( L ) .For any s , s ∈ D we have, similar to (S1), Cov ( w ( s ) w ( s ) | w ( L ))= C ( s , L ) C ( L , L ) − Cov ( w ( L ) , w ( L ) | w ( L )) C ( L , L ) − C ( L , s ) = 0 . w ( s ) ⊥⊥ w ( s ) | w ( L ) for any s , s ∈ D . Now Cov ( w ( s ) , w ( s ) | σ ( { w ( s ) | s ∈ D} ))= Cov ( w ( s ) , w ( s ) | σ ( w ( L ) , { z ( s ) | s ∈ D} ))= Cov ( w ( s ) , w ( s ) | σ ( w ( L ))) = 0 . (S2)The third inequality follows from the fact that for any three random variables X, Y and Z suchthat X and Y are independent of Z , E ( X | Y, Z ) = E ( X | Y ) . Equation (S2) establishes processlevel conditional independence for w ( · ) and w ( · ) given w ( · ) , thereby proving part (c). of Corollary 3.0.1. Recall from the construction of M ( L , L ) that the Gaussian random vector w ( L ) satisfies the graphical model G = G V (cid:2) G L , where G L is the complete graph between n locations. The strong product graph G is decomposable and K m (cid:2) G L ; m = 1 , · · · , p form aperfect sequence for G with S m (cid:2) G L ; m = 2 , · · · , p being the separators.Thus, using results (3.17) and (5.44) from Lauritzen (1996), we are able to factorize w ( L ) as (6). S2 Implementation
S2.1 Gibbs sampler for GGP model for the latent processes
Let y i = ( y i ( s i ) , y i ( s i ) , . . . , y i ( s in i )) T be the n i × vector of measurements for the i -thresponse or outcome over the set of n i locations in D . Let X i = ( x i ( s i ) , x i ( s i ) , · · · , x i ( s in i )) T be the known n i × p i matrix of predictors on the set S i = { s i , · · · , s in i } . We specify the spatiallinear model as y i = X i β i + w i + (cid:15) i , where β i is the p i × vector of regression coefficients, (cid:15) i is the n i × vector of normally distributed random independent errors with marginal commonvariance τ i , and w i is defined analogously to y i for the latent spatial process correspondingto the i -th outcome. The distribution of each w i is derived from the specification of w ( s ) asthe q × multivariate Mat´ern GGP with respect to a decomposable G V . Let { φ ii , σ ii , τ i | i =1 . . . . , q } denote the marginal parameters for each component Mat´ern process w i ( · ) .We elucidate the sampler using a GGP constructed by stitching the simple multivariate26at´ern (Apanasovich et al., 2012), where ν ij = ( ν ii + ν jj ) / , ∆ A = 0 in (2) and φ ij =( φ ii + φ jj ) / . Hence, the only additional cross-correlation parameters are { b ij | ( i, j ) ∈ E V } .Any of the other multivariate Mat´ern specifications in Apanasovich et al. (2012) that involvemore parameters to specify ν ij ’s and φ ij ’s can be implemented in a similar manner. We considerpartial overlap between the variable-specific location sets and take L = ∪ i S i as the referenceset for stitching. If there is total lack of overlap between the data locations for each variable,we can simply take L to be a set of locations sufficiently well distributed in the domain and theGibbs sampler can be designed analogously.Conjugate priors are available for β i ind ∼ N ( µ i , V i ) and τ i ind ∼ IG ( a i , b i ) , where IG is theInverse-Gamma distribution. There are no conjugate priors for the process parameters. Forease of notation, the collection M a,b the submatrix of M indexed by sets a and b , M a = M a,a ,and M a | b = M a − M a,b M − b M b,a . Similarly, we denote w ( a ) to be the vector stacking w i ( s ) forall ( i, s ) ∈ a . We denote cliques by K and separators by S in the perfect ordering of the graph G V .The full-conditional distributions for the Gibbs updates of the parameters are as follows. p ( β i | · ) ∼ N (( X T i X i + V − i ) − ( µ i + X T i ( y i − w i )) , τ i ( X T i X i + V − i ) − ) ; p ( τ i | · ) ∼ IG ( a + n i , b + ( y i − X T i β i − w i ) T ( y i − X T i β i − w i )2 ) ; p ( σ ii , φ ii , ν ii | · ) ∝ (cid:81) K (cid:51) i | M K ×L | exp (cid:0) − w ( K × L ) T M − K ×L w ( K × L ) (cid:1)(cid:81) S (cid:51) i | M S ×L | exp (cid:0) − w ( S × L ) T M − S ×L w ( S × L ) (cid:1) ∗ p ( σ ii ) p ( φ ii ) p ( ν ii ) ; p ( b ij | · ) ∝ (cid:81) K (cid:51) ( i,j ) I ( B K > | M K ×L | exp (cid:0) − w ( K × L ) T M − K ×L w ( K × L ) (cid:1)(cid:81) S (cid:51) ( i,j ) 1 | M S ×L | exp (cid:0) − w ( S × L ) T M − S ×L w ( S × L ) (cid:1) × p ( b ij ) , for ( i.j ) ∈ E V . To update the latent random effects w , let L = { s , . . . , s n } and o i = diag ( I ( s ∈S ) , . . . , I ( s n ∈ S n )) denote the vector of missing observations for the i -th outcome. With27 i ( L ) = ( x i ( s ) , . . . , x i ( s n )) T , y i ( L ) and w i ( L ) defined similarly, we have, p ( w i ( L ) | · ) ∼ N (cid:0) M − i µ i , M − i (cid:1) , where M i = 1 τ i diag ( o i ) + (cid:88) K (cid:51) i M − { i }×L| ( K \{ i } ) ×L − (cid:88) S (cid:51) i M − { i }×L| ( S \{ i } ) ×L ,µ i = ( y i ( L ) − x i ( L ) T β i ) (cid:12) o i τ i + (cid:88) K (cid:51) i T i ( K ) w (( K \ { i } ) × L ) − (cid:88) S (cid:51) i T i ( S ) w (( S \ { i } ) × L ) ,T i ( A ) = M − { i }×L| ( A \{ i } ) ×L M { i }×L , ( A \{ i } ) ×L M − A \{ i } ) ×L , for A ∈ { K, S } . The Gibbs sampler evinces the multifaceted computational gains. The constraints on theparameters B no longer require checking the positive-definiteness of B , which would require O ( q ) flops for each check. Instead, due to decomposability, it is enough to check for pos-itive definiteness of the (atmost q ∗ dimensional) sub-matrices B K of B corresponding to thecliques of G V . The largest matrix inversion across all these updates is of the order nq ∗ × nq ∗ ,corresponding to the largest clique. The largest matrix that needs storing is also of dimension nq ∗ × nq ∗ . These result in appreciable reduction of computations from any multivariate Mat´ernmodel that involves nq × nq matrices and positive-definiteness checks for q × q matrices atevery iteration.Finally, for generating predictive distributions, note that, as a part of the Gibbs sampler, weare simultaneously imputing w i at the locations L \ S i . Subsequently, we only need to sample y i ( L \ S i ) | · ∼ N ( X i ( L \ S i ) (cid:48) β i + w i ( L \ S i ) , τ i I ) . S2.2 Gibbs sampler for GGP model for the response processes
Let y ( L ) = ( y ( L ) , · · · , y q ( L )) T , X ( L ) = bdiag ( X ( L ) , . . . , X q ( L )) , and β = ( β T , · · · , β T q ) T .We will consider the joint likelihood y ( L ) | X ( L ) , β, { φ ii , σ ii , τ i } { i =1 ,...,q } , { b ij } { ( i,j ) ∈ E V } ∼ N ( X ( L ) β, M ∗ V ×L ) (S3)and impute the missing data y i ( L \ S i ) in the sampler. Let T i = { i } × ( L \ S i ) , U i ( A ) =( A × L ) \ T i for A ∈ { K, S } and β ( A ) be the vector stacking up β j for j ∈ A . Also, for any28 ⊆ V × L , let ˜ X ( U ) = bdiag ( { X j ( U ∩ ( { j } × L )) | j (cid:51) U ∩ ( { j } × L ) (cid:54) = {}} . We have thefollowing updates: y i ( T i ) | · ∼ N ( X i ( T i ) β i + H − i (cid:32)(cid:88) K (cid:51) i M ∗− T i | U i ( K ) M ∗− T i , U i ( K ) M ∗− U i ( K ) ( y ( U i ( K )) − ˜ X ( U i ( K )) β ( K )) − (cid:88) S (cid:51) i M ∗− T i | U i ( S ) M ∗− T i , U i ( S ) M ∗− U i ( S ) ( y ( U i ( S )) − ˜ X ( U i ( S )) β ( S )) (cid:33) , H − i ) where H i = (cid:88) K (cid:51) i M ∗− T i | U i ( K ) − (cid:88) S (cid:51) i M ∗− T i | U i ( S ) Once again the updates require inversion or storage of matrices of size atmost nq ∗ × nq ∗ .The updates for the other parameters are similar to that in the sampler of Section S2.1 of theSupplement with the cross-covariance M ∗ replacing M . The only exception is τ i , which nolonger has conjugate full conditionals and are also now updated using Metropolis random walksteps within the Gibbs sampler akin to the other spatial parameters. S2.3 Co-ordinate descent
To conduct estimation and prediction using GGP in a frequentist setting, we outline a co-ordinate descent algorithm for maximum likelihood estimation. We illustrate the implemen-tation for the case where each of the q variables are measured at L . The case of spatialmisalignment can be handled by an EM algorithm to impute the missing responses for eachvariable. For the frequentist setup, we use the GGP model for the response. From Corol-lary 3.0.1, the joint likelihood can be factored into sub-likelihoods corresponding to specificcliques and separators. Let θ ( t ) denote the values of the spatial parameters θ at the t -th itera-tion, and M ∗L = M ∗L ( θ ) denote the GGP covariance matrix of y ( V × L ) from stitching. Let θ ii = { σ ii , φ ii , ν ii } , θ − i = θ \ θ ii , θ − ij = θ \ { b ij } . ˜ X ( L ) := ˜ X ( V × L ) we immediately have29he following updates of the parameters: β ( t +1) = (cid:16) ˜ X ( L ) T M ∗− L ( θ ( t ) ) ˜ X ( L ) (cid:17) ˜ X ( L ) T M ∗− L ( θ ( t ) ) y ( L ) ,θ ( t +1) ii = arg min θ ii (cid:104) (cid:88) K (cid:51) i l K ( θ ii ) − (cid:88) S (cid:51) i l S ( θ ii ) (cid:105) , where for any A ⊂ V ,l A ( θ ii ) = log( | M ∗ A ×L ( θ ii , θ ( t ) − i ) | )+( y ( A × L ) − ˜ X ( A × L ) β ( A )) T M −∗ A ×L ( θ ii , θ ( t ) − i )( y ( A × L ) − ˜ X ( A × L ) β ( A )) ,b ( t +1) ij = arg min b ij (cid:104) (cid:88) K (cid:51) ( i,j ) (cid:16) ˜ (cid:96) K ( b ij ) − log( I ( B K > (cid:17) − (cid:88) S (cid:51) ( i,j ) ˜ (cid:96) S ( b ij ) (cid:105) , for ( i, j ) ∈ E V , where ˜ (cid:96) A ( b ij ) = log( | M ∗ A ×L ( b ij , θ ( t ) − ij ) | )+( y ( A × L ) − ˜ X ( A × L ) β ( A )) T M −∗ A ×L ( b ij , θ ( t ) − ij )( y ( A × L ) − ˜ X ( A × L ) β ( A )) . The update of β involves the large nq × nq matrix M ∗− L . However, from (7), M ∗− L can beexpressed as sum of sparse matrices, each requiring at-most O ( n q ∗ ) storage and computationarising from inverting matrices of the form C K (cid:2) L + D K (cid:2) L . For updates of the spatial parameters θ ii and b ij , coordinate descent moves along the respective parameter and optimizing the nega-tive log-likelihood which is expresses in terms of the corresponding negative log-likelihoods ofthe cliques and separators containing that parameter. This process is iterated until convergence.Each iteration of the co-ordinate descent has the same complexity of parameter dimension,same computation and storage costs and parameter constraint check as each iteration of theGibbs sampler, and hence is comparably scalable. S3 Additional figures a) DAG for a univariateAR(2) model (b) Moralized G T for a uni-variate AR(2) model(c) DAG for the graphical VAR model example of Section 4.1(d) Moralized G V×T for the graphical VAR model of Figure (c)
Figure S1:
Graphical models for autoregressive spatial time-series. a) (b)(c) (d) Figure S2:
Estimates of the marginal parameters σ ii φ ii , i ∈ V , for the 4 simulation settings. Thehorizontal pink lines in Figures (a) and (b) indicate the true parameter values., for the 4 simulation settings. Thehorizontal pink lines in Figures (a) and (b) indicate the true parameter values.