Modeling massive highly-multivariate nonstationary spatial data with the basis graphical lasso
Mitchell Krock, William Kleiber, Dorit Hammerling, Stephen Becker
MModeling massive multivariate spatial data with thebasis graphical lasso
Mitchell Krock , William Kleiber , Dorit Hammerling , and Stephen Becker January 8, 2021
Abstract
We propose a new modeling framework for highly multivariate spatial processesthat synthesizes ideas from recent multiscale and spectral approaches with graphicalmodels. The basis graphical lasso writes a univariate Gaussian process as a linear com-bination of basis functions weighted with entries of a Gaussian graphical vector whosegraph is estimated from optimizing an (cid:96) penalized likelihood. This paper extends thesetting to a multivariate Gaussian process where the basis functions are weighted withGaussian graphical vectors. We motivate a model where the basis functions representdifferent levels of resolution and the graphical vectors for each level are assumed to beindependent. Using an orthogonal basis grants linear complexity and memory usagein the number of spatial locations, the number of basis functions, and the number ofrealizations. An additional fusion penalty encourages a parsimonious conditional inde-pendence structure in the multilevel graphical model. We illustrate our method on alarge climate ensemble from the National Center for Atmospheric Research’s Commu-nity Atmosphere Model that involves 40 spatial processes. Keywords: Multivariate Gaussian process, nonstationary, graphical lasso,spatial basis function, empirical orthogonal function, large ensemble,climate model
The past twenty years have seen a surge of interest in developing models for multivariatespatial processes. The major obstacle lies in defining valid cross-covariance functions thatcan characterize complex interactions between multiple processes; the primary difficulty is Department of Statistics, Rutgers University. Corresponding author e-mail: [email protected] Department of Applied Mathematics, University of Colorado Boulder Department of Applied Mathematics and Statistics, Colorado School of Mines a r X i v : . [ s t a t . M E ] J a n hat the cross-covariance and marginal covariance functions must work together to providea nonnegative definite matrix function. Most research has focused on exploring new modelsor new approaches for defining cross-covariances that are valid for a handful of processes.Many applied problems, such as those in statistical climatology, involve datasets with dozensto hundreds of variables, and most existing approaches fail as strategies to model and un-derstand relevant dependencies between variables.The main extant approaches to generating valid multivariate spatial models are reviewedby Genton and Kleiber (2015) and include covariance convolution and kernel convolution(Gaspari and Cohn, 1999; Majumdar and Gelfand, 2007; Majumdar et al., 2010; Ver Hoef andBarry, 1998), the linear model of coregionalization (LMC) (Goulard and Voltz, 1992; Schmidtand Gelfand, 2003; Wackernagel, 2003; Gelfand et al., 2004), latent dimensions (Apanasovichand Genton, 2010), dynamical models (Shaddick and Wakefield, 2002; Calder, 2007; Ippolitiet al., 2012) conditional Bayesian hierarchical structures (Le and Zidek, 2006; Pollice andJona Lasinio, 2010), and direct specification as in the multivariate Mat´ern (Gneiting et al.,2010; Apanasovich et al., 2012) and nonstationary extensions (Kleiber and Nychka, 2012;Kleiber and Porcu, 2014). All of these models are designed to handle a few variables, typicallyfewer than five.Some groups have recently turned attention to the highly-multivariate problem, whichoften also involves massive numbers of spatial locations. Even with valid covariance andcross-covariance functions specified, likelihood computations for p -variate Gaussian processesat n spatial locations scale as O ( p n ), which is unacceptable for applications involving high-dimensional processes. One of the first works to tackle highly multivariate processes, Furrerand Genton (2011), introduces “aggregation cokriging,” but this work only addresses pre-diction and not the modeling nor understanding of highly multivariate spatial data. Severaladaptations of the LMC (Guinness, 2018; Taylor-Rodriguez et al., 2019; Bruinsma et al.,2020) are capable of handling large multivariate data, but each still requires constructinga nonstationary covariance model for latent processes. Kleiber et al. (2019) introduce anapproach that generalizes LatticeKrig (Nychka et al., 2015) to the multivariate case andcan handle massive spatial data, relying on compactly supported basis functions and spatialautoregressive Gaussian Markov random field models for stochastic coefficients. Dey et al.22020) recently introduced an idea utilizing Gaussian graphical models that can approximatethe multivariate Mat´ern, but, like the LMC, it does not directly address the issue of non-stationarity in the covariance and cross-covariance functions. Our approach also relies onGaussian graphical models but avoids directly modeling the covariance and cross-covariancefunctions by using a basis decomposition. In addition to nonstationarity granted by ju-dicious choice of basis functions and stochastic coefficient structure, employing orthogonal basis functions allows for rapid computation with non-gridded high-dimensional multivariatespatial data.To introduce the basic ideas, let us fix some notation. The p -variate observational Gaus-sian process model under consideration is Y ( s )... Y p ( s ) = µ ( s )... µ p ( s ) + Z ( s )... Z p ( s ) + ε ( s )... ε p ( s ) (1)or in vector form, Y ( s ) = µ ( s )+ Z ( s )+ ε ( s ). Here, Y ( s ) = ( Y ( s ) , . . . , Y p ( s )) T is the observedprocess at location s ∈ R d with mean µ ( s ) = ( µ ( s ) , . . . , µ p ( s )) T and spatially correlatedstochastic variation Z ( s ) = ( Z ( s ) , . . . , Z p ( s )) T , which we assume to be a multivariate Gaus-sian process. The observations are subject to noise ε ( s ) = ( ε ( s ) , . . . , ε p ( s )) T , a mean zeromultivariate white noise process with covariance matrix Cov( ε ( s ) , ε ( s )) = diag( τ , . . . , τ p ).A key difficulty in working with multivariate processes is in specifying the matrix-valuedcovariance, C ( s , s ) = ( C ij ( s , s )) pi,j =1 , where C ij ( s , s ) = Cov( Z i ( s ) , Z j ( s )) are the di-rect and cross-covariance functions. This matrix function must be carefully constrainedin order to be a nonnegative definite matrix function. In particular, for arbitrary loca-tions { s i } ni =1 , the block matrix Σ with ( i, j )th block C ( s i , s j ) must be nonnegative definite.Genton and Kleiber (2015) give an overview of cross-covariance functions for multivariategeostatistics which is still relatively up-to-date.We introduce a model for highly multivariate and nonstationary spatial data that also ac-commodates estimation and simulation strategies for large networks of observation locations.The essential ideas rely on representing the vector-valued process in a basis expansion withsparsity-inducing Gaussian graphical modeling of the stochastic coefficients. We propose apenalized likelihood framework for estimation and associated optimization algorithms. The3ethod is illustrated on a challenging climate data science problem involving p = 40 spatialprocesses from an atmospheric model at thousands of locations that exhibit strong nonsta-tionarities and cross-process dependencies. Our model provides straightforward interpreta-tions of cross-process dependences, which for the climate data example identify scientificallymeaningful and justifiable relationships. Our approach relies on a basis expansion of the spatially-correlated components of Z ( s ): Z i ( s ) = L (cid:88) (cid:96) =1 W i(cid:96) φ (cid:96) ( s ) ( i = 1 , . . . , p ) (2)for some classes of basis functions { φ (cid:96) } and stochastic coefficients { W i(cid:96) } . For fixed i , this rep-resentation subsumes many popular approaches in the spatial statistical literature that havebeen primarily explored in the univariate setting, including discretized spectral methods,low rank approaches, and empirical orthogonal functions among others, despite potentiallimitations depending upon the choice of basis function (Stein, 2014). Few extensions to themultivariate case (2) have been made, with Kleiber et al. (2019) being a notable case. InSection 2.1, we show how the univariate basis graphical lasso directly extends to the multi-variate setting (2), and in Section 2.2, we discuss an alternative which is more appropriatefor massive multivariate spatial data. Our modeling and optimization strategy follows from extensions to our prior work, which wediscuss and connect to the multivariate problem in this section. In Krock et al. (2020), weintroduced the basis graphical lasso (BGL) to model (2) for the univariate case p = 1. Thegoal of the BGL is to obtain a sparse nonparametric estimate for the precision matrix Q of themean zero Gaussian graphical vector W = ( W , . . . , W L ) T when Z ( s ) = (cid:80) L(cid:96) =1 W (cid:96) φ (cid:96) ( s ). Inother words, we build a Gaussian process by fitting a Gaussian graphical model to the randomcoefficients of fixed basis functions. Recall that the sparsity pattern of the precision matrixencodes conditional independencies between random variables, with Q ij = 0 if and only if4 i and W j are conditionally independent given all other entries of W . This information iscommonly visualized with an undirected graph known as a Gaussian graphical model, wherevertices symbolize variables and the lack of an edge between two vertices indicates such aconditional independence (Rue and Held, 2005). Inspired by the graphical lasso (Friedmanet al., 2008), we use an (cid:96) penalized likelihood to estimate the graphical model, but inthe standard graphical lasso setting W is observed directly, whereas our model includesbasis functions and noise. Our BGL method is viable for a very large sample size ( n ),and multiple realizations ( m ) are preferred but not required. Note that this formulation isopposite to the LMC, which is not typically used in a univariate setting but estimates thedeterministic weights of individual Gaussian processes. We claim that the BGL generalizes tothe multivariate setting and also that the optimization routine enjoys a similar computationalframework.With W (cid:96) = ( W (cid:96) , . . . , W p(cid:96) ) T ∼ N ( , Q − (cid:96) ), the multivariate basis model (2) alternativelycan be written Z ( s ) = L (cid:88) (cid:96) =1 φ (cid:96) ( s ) W (cid:96) (3)so that each basis function is weighted with a p -variate random vector. Assuming the weightvectors are independent means that this model amounts to characterizing the inverse covari-ance matrices { Q , . . . , Q L } . In contrast, the LMC considers (3) where { φ ( s ) , . . . , φ L ( s ) } areindependent Gaussian processes with deterministic weights. We note that recent variants ofthe LMC (Taylor-Rodriguez et al., 2019; Bruinsma et al., 2020) can handle a large number ofvariables, and in fact orthogonal basis functions are also exploited for computational gains inthe latter work, but such techniques ultimately amount to efficiently modeling spatially-denseunivariate Gaussian processes (e.g., Datta et al. (2016); Titsias (2009)). The semiparametriclatent factor model (Teh et al., 2005) shares a setup similar to the LMC where a multi-outputGaussian process is represented as a linear mixture of independent univariate Gaussian pro-cesses. With these models, weight vectors of the independent univariate Gaussian processesand (stationary) covariance parameters must be estimated. On the other hand, our ap-proach associates each basis function with a p -variate Gaussian graphical model, which ismuch faster from a computational point of view and automatically produces nonstationarycovariance and cross-covariance functions with straightforward interpretations.5et us assume µ ≡ s , . . . , s n , form the observation vector Y = ( Y ( s ) T , . . . , Y ( s n ) T ) T . Suppose we have multiple independent realizations Y , . . . , Y m of Y . Up to multiplicative and additive constants not depending on the np × np variance-covariance matrix Σ = Var( Y ), the negative log-likelihood islog det Σ + 1 m m (cid:88) i =1 Y T i Σ − Y i . (4)Define the sample covariance S = m (cid:80) mi =1 Y i Y T i . Using the cyclic property of trace, werewrite the negative log-likelihood aslog det Σ + tr( SΣ − ) (5)to align with more prevalent notation in graphical lasso literature. Note that Y is simply alinear combination of the random coefficient vector W = ( W T1 , . . . , W T L ) T and basis func-tions, plus noise. That is, Y = ΦW + ε for a basis matrix Φ , so Σ = ΦQ − Φ T + D . Thesematrices are defined explicitly in Section 3, but it is important to realize here that the matrixalgebra produces the same optimization problem as in the univariate case.The original BGL solves the (cid:96) -penalized maximum likelihood equationˆ Q ∈ arg min Q (cid:23) log det( ΦQ − Φ T + D ) + tr( S ( ΦQ − Φ T + D ) − ) + (cid:107) Λ ◦ Q (cid:107) . (6)We use the notation Q (cid:23) Q is positive semidefinite. Here, (cid:107) Λ ◦ Q (cid:107) = (cid:80) i,j Λ ij | Q ij | where Λ ij are nonnegative penalty parameters that encourage sparsity in theestimate. Evaluating (6) requires an expensive O ( p n ) Cholesky decomposition. Afterapplying the matrix determinant lemma, the Sherman-Morrison-Woodbury formula, andthe cyclic property of trace, we can equivalently minimizelog det (cid:0) Q + Φ T D − Φ (cid:1) − log det Q − tr (cid:0) Φ T D − SD − Φ ( Q + Φ T D − Φ ) − (cid:1) + (cid:107) Λ ◦ Q (cid:107) . (7)Once the matrices Φ T D − Φ and Φ T D − SD − Φ are computed, evaluating (7) only requiresCholesky decompositions in the dimension pL , so we can reduce likelihood evaluations to6 ( p L ). However, (7) is nonsmooth and nonconvex with respect to Q , so the minimizationproblem is nontrivial.Studying the convexity/concavity structure of (7) suggests a difference-of-convex (DC)algorithm where the next guess for Q is obtained by solving a convex optimization problemwith the concave part linearized at the previous guess. Such an algorithm reads Q ( j +1) = arg min Q (cid:23) (cid:16) − log det Q + tr (cid:16) Ψ ( j ) Q (cid:17) + (cid:107) Λ ◦ Q (cid:107) (cid:17) (8)where the linearization matrix Ψ ( j ) = ( Q ( j ) + Φ T D − Φ ) − + ( Q ( j ) + Φ T D − Φ ) − Φ T D − SD − Φ ( Q ( j ) + Φ T D − Φ ) − (9)is a function of the previous guess Q ( j ) and the aforementioned precomputed matrices. Sincea DC algorithm such as (8) is a majorization-minimization algorithm, we are guaranteed thatthe guesses for Q create a nonincreasing sequence in the objective function (7). Moreover,(8) is a well-studied problem known as the graphical lasso. Typically, the graphical lassouses the sample covariance matrix of directly-observed, nonnoisy variables to produce asparse inverse covariance matrix and accordingly a graphical model for the variables. Herewe iteratively call the graphical lasso algorithm to estimate a graph for latent basis weights(with additive noise in the observational model), and the linearization matrix (9) acts asthe sample covariance in the algorithm. We solve the graphical lasso with the second ordermethod QUIC (Hsieh et al., 2014b).
Although the previous section shows that the BGL can be readily extended to the multivari-ate setting, the generalization is not well-motivated by a connection to standard multivariatespatial models, and moreover it will require burdensome matrix calculations in the dimension pL . In particular, the BGL must compute the linearization matrix (9) and substitute it intothe graphical lasso at each step of the DC algorithm. With our climate data example weuse L = 2000 basis functions and p = 40 variables; an 80,000 dimensional precision matrixis too large for this procedure. Advances in graphical modeling (Fattahi et al., 2019) mayallow for estimation of graphs of this magnitude, but storing the dense linearization matrix7oses an issue to further scalability. We conclude the paper with more commentary aboutthis direct generalization (see Section 5), but here we propose a similar DC algorithm whichis more feasible in a highly multivariate setting.Our basic model still follows a penalized likelihood-based framework, minimizingˆ Q ∈ arg min Q (cid:23) log det( ΦQ − Φ T + D ) + tr( S ( ΦQ − Φ T + D ) − ) + P ( Q ) (10)for some penalty P . However, an (cid:96) graphical lasso-type penalty by itself does not imposeany regularity on the structure of coefficient graphs. To motivate our proposal, we recallsome recent insights into multivariate modeling that will suggest an appropriate form for P .The multivariate spectral representation theorem states Z ( s ) = (cid:90) exp( i ω T s ) W (d ω ) (11)for a mean zero stationary process Z ( s ), where W ( · ) is a complex-valued mean zero randommeasure vector (Stein, 1999). Taking a discretization of the integral we approximate (cid:90) exp( i ω T s ) W (d ω ) ≈ (cid:88) (cid:96) cos( ω T (cid:96) s + θ (cid:96) ) W (cid:96) (12)to motivate writing (3) as a linear combination of independent random vectors W (cid:96) . Therepresentation theorem is intimately linked to the spectral density matrix of Z ( s ) wherewe identify Var( W (cid:96) ) with f ( ω (cid:96) ). One connection to our model is that Q − (cid:96) can be viewedas the spectral density matrix at frequency ω (cid:96) , but our approach estimates the inverse spectral density matrix in a Gaussian graphical framework, and moreover we will relaxthe tie to harmonic basis functions. Indeed, Kleiber (2017) provides interpretation andexploration of spectral coherence that will additionally motivate our penalized likelihoodimplementation. Before moving on, it is important to note that although we use the spectralrepresentation theorem (11) to motivate the ensuing approach, our method is general andextends beyond harmonic basis expansions but with similar coherence-like interpretationsof coefficient dependence. Guinness (2018) proposes a multivariate space-time model withflexible coherence structures which uses LMC in the spectral domain, but the covariancestructure is stationary and the method is only viable for gridded data.8he multivariate basis graphical lasso model can be motivated with the same penalizedlikelihood context as in Section 2.1. Recall the model setup: we work under the additivemodel (1) with Z ( s ) specified as in the basis representation (3), and W i and W j are indepen-dent Gaussian graphical vectors for i (cid:54) = j . In particular, we assume that Q (cid:96) = Var( W (cid:96) ) − isa sparse matrix defining a graphical structure at level (cid:96) . If we consider each Q (cid:96) to correspondto an arbitrary sparse graphical model, then we propose P as a graphical lasso regularizationfor each level: P ( Q , . . . , Q L ) = λ L (cid:88) (cid:96) =1 (cid:88) i (cid:54) = j | ( Q (cid:96) ) ij | . (13)This penalty enforces sparsity for each precision matrix but not necessarily any similarity be-tween levels of resolution. Recent development in spectral coherence (Kleiber, 2017) suggeststhat we should expect the coherence of processes arising in practice to vary smoothly acrosslevels. In addition to the (cid:96) sparsity penalty, we include an (cid:96) sequentially-fused penaltyto encourage similarity of the conditional independence structure across adjacent levels ofresolution: P ( Q , . . . , Q L ) = λ L (cid:88) (cid:96) =1 (cid:88) i (cid:54) = j | ( Q (cid:96) ) ij | + ρ L − (cid:88) (cid:96) =1 (cid:88) i (cid:54) = j | ( Q (cid:96) ) ij − ( Q (cid:96) +1 ) ij | . (14)The new fusion penalty with tuning parameter ρ penalizes precision matrices at adjacentlevels if their off-diagonals do not have the same value. This formulation suggests a smoothly-varying graph structure and produces a parsimonious conditional independence structure ofthe random weights over all levels of resolution.Assuming that Q = diag( Q , . . . , Q L ) does not change any reasoning leading to the BGLformulation from Section 2.1 but allows us to reduce computations on matrices of size pL × pL to computations on L p × p matrices. At each step of the DC algorithm (8), we simplify theminimization problem toarg min Q (cid:96) (cid:23) , (cid:96) =1 ,...,L (cid:32) L (cid:88) (cid:96) =1 − log det Q (cid:96) + tr ( Ψ (cid:96) Q (cid:96) ) (cid:33) + P ( Q , . . . , Q L ) (15)where Ψ (cid:96) is the (cid:96) th block diagonal of the linearization matrix Ψ , which depends upon theprevious graph guesses. An efficient way to calculate Ψ , . . . , Ψ L for orthogonal bases ispresented in Section 3. 9sing only the sparsity penalty (13) means (15) separates into L independent graphicallasso problems with “sample covariance” matrices Ψ , . . . , Ψ L . Using the fusion penalty(14) gives no such separation of (15) into L independent optimization problems at eachDC iteration. Instead, we treat Ψ , . . . , Ψ L as an array of sample covariance matrices andsubstitute them into the fused multiple graphical lasso (FMGL) (Yang et al., 2015) formodeling multiple similar graphical models across multiple datasets. A related idea is thejoint graphical lasso (Danaher et al., 2014), but their similarity-inducing regularization termpenalizes all pairs of graphs rather than just adjacent graphs as in (14), suggesting behaviorsimilar to white noise processes—an unreasonable assumption for most real spatial dataapplications. The associated algorithm uses a slower optimization approach which calculateseigendecompositions in R . The FMGL algorithm, written in Matlab, uses the same secondorder approximation as QUIC (Hsieh et al., 2014b) and also exhibits quadratic convergence.A general framework for second order optimization in high dimensional statistical modelingwith regularization is developed in Hsieh et al. (2014a).
We highlight some important aspects of implementing the multivariate BGL model. Inpractice, we independently estimate an error variance for each variable and define the di-agonal matrix D accordingly. Matrices Φ T D − Φ and Φ T D − SD − Φ must be computedeffectively—ignoring D − , the naive matrix multiplications cost O ( p nL ) and O ( p n L ), re-spectively. Moreover, the block diagonals of the linearization (9) must be retrieved withoutexpending O ( p L ) flops for the matrix inverse ( Q + Φ T D − Φ ) − . In Krock et al. (2020) with p = 1 this linearization step was not an issue—the computational bottleneck was iterativelysolving the graphical lasso in the dimension of the basis functions L . Here, we are modeling L graphs of dimension p , so the graphical modeling step may no longer be the bottleneck butrather the linearization (9). However, assuming that Q is block diagonal and basis functionsare orthogonal makes this linearization step trivial. See Section 3.0.1. See Section 3.2. n × L basis matrix with ( i, j ) entry φ j ( s i ). Observe that Z ( s i ) = L (cid:88) (cid:96) =1 φ (cid:96) ( s i ) W (cid:96) = M Φ T i ( i = 1 , . . . , n ) (16)where M has columns W , . . . , W L and Φ T i is the i th column of Φ T . Introducing the vec( · )operator, which stacks the columns of a matrix one-by-one into a vector, we write the processobservation vector as Z = Z ( s )... Z ( s n ) = M Φ T1 ... M Φ T n = vec( M Φ T ) (17)where the last equation follows from (16.2.7) in Harville (1997). The vec operator cooperateswith the Kronecker product ⊗ in the following way: vec( ABC ) = ( C T ⊗ A )vec( B ) whenever ABC is well-defined. For us, this implies Z = (Φ ⊗ I p ) W (18)since W = vec( M ) by construction. Thus we identify Φ = Φ ⊗ I p . Also note that D − = I n ⊗ diag( τ − , . . . , τ − p ) is a Kronecker product. Since ( A ⊗ C )( B ⊗ D ) = AB ⊗ CD whenever AB and CD are well-defined, we have the Kronecker product representation Φ T D − Φ =Φ T Φ ⊗ diag( τ − , . . . , τ − p ), which will be sparse for any choice of basis functions. In general,solving systems with Q + Φ T D − Φ does not have an exploitable structure since Q is blockdiagonal yet Φ T D − Φ is a Kronecker product. However, using orthogonal basis functions(i.e., Φ T Φ = I L ) means that Φ T D − Φ is diagonal, so the first term in (9) is block diagonaland can be easily inverted in O ( Lp ).The linearization also involves Φ T D − SD − Φ = ( Φ T D − Y data )( Y Tdata D − Φ ) /m , where Y data is the data matrix with columns of realizations Y , . . . , Y m . Using vec( ABC ) =( C T ⊗ A )vec( B ) again, Φ T D − Y i = vec(diag( τ − , . . . , τ − p )mat( Y i )Φ) (19)where mat( Y i ) is the p × n matrix with vec(mat( Y i )) = Y i . So Φ T D − Y data can be cal-culated in O ( mnpL ) and has low storage cost O ( mpL ). Once Φ T D − Y data and the block11iagonals of ( Q + Φ T D − Φ ) − are computed, the block diagonals of the second term in (9)can be computed in O ( Lp m ).To summarize, for an orthogonal basis, only { Q , . . . , Q L } , { τ − , . . . , τ − p } , and Φ T D − Y data must be stored in memory, and we can compute the block diagonals of (9) in O ( Lp + Lp m ).These L p × p block matrices are then sent into the FMGL as an array of L “sample covari-ance” matrices, and the solution to the FMGL problem is the next guess for { Q , . . . , Q L } ,and the entire procedure is repeated until (cid:107) Q ( j +1) − Q ( j ) (cid:107) F / (cid:107) Q ( j ) (cid:107) F < (cid:15) , which we set as (cid:15) = 0 . When considering the fusion penalty, a natural initial guess is the corresponding unfusedestimate. With ρ = 0, the main optimization (15) amounts to solving the graphical lassoindependently by level:arg min Q (cid:96) (cid:23) , (cid:96) =1 ,...,L (cid:32) L (cid:88) (cid:96) =1 − log det Q (cid:96) + tr ( Ψ (cid:96) Q (cid:96) ) + λ (cid:88) i (cid:54) = j | ( Q (cid:96) ) ij | (cid:33) . (20)The entire algorithm has linear complexity and storage in L in this case. It is also easy to obtain the unpenalized maximum likelihood estimates for Q , . . . , Q L .Recall that S − = arg min Q (cid:23) − log det Q + tr( SQ ), assuming S is nonsingular. This meansthat our DC algorithm would simply invert the linearization matrix (or each block diagonal ofthe linearization matrix in the multivariate case) rather than substitute it into the graphicallasso. Again, Q , . . . , Q L are independent, which implies the same storage and complexityas the unfused estimate in Section 3.0.1. This algorithm takes many more DC iterations toconverge than either of the regularized estimates.12 .2 Estimating the Error Variances Given a single variable with sample covariance S , we minimize the following function jointlyover τ and a few parameters describing a diagonal matrix Q :log det (cid:0) Q + τ − Φ T Φ (cid:1) − log det Q − tr (cid:16) τ − Φ T S Φ (cid:0) Q + τ − Φ T Φ (cid:1) − (cid:17) + n log τ + τ − tr( S ) . Note that this expression can be simplified with orthogonal bases, and the trace terms canbe quickly computed as squared Frobenius norms. The diagonal parameterization of Q andparameter estimation is further discussed in Section 4.1. We record values for τ , . . . , τ p inTable 2 in the Appendix which are used throughout the rest of the paper. A similar approachused in Krock et al. (2020) was found to be successful in recovering error variances even witha misspecified spatial covariance structure. First we describe the cross-validation procedure for a pair of penalty parameters ( λ, ρ ).Suppose we use k folds and consider t arbitrary pairs of penalties represented by { Λ , . . . , Λ t } .Let ˆ Q Λ j ( S ) be the estimate we get from applying our algorithm with empirical covariance S = m (cid:80) mi =1 Y i Y T i and penalty pair Λ j . For A ⊆ { , . . . , m } , let S A = | A | − (cid:80) i ∈ A Y i Y T i .We seek Λ so that α (Λ) = (cid:96) ( ˆ Q Λ ( S ) , S ) is small, where (cid:96) ( Q , S ) = log det (cid:0) Q + Φ T D − Φ (cid:1) − log det Q − tr (cid:0) Φ T D − SD − Φ ( Q + Φ T D − Φ ) − (cid:1) (21)is the unpenalized likelihood function in (7). The cross-validation approach is to partition { , . . . , m } into disjoint sets { A , . . . , A k } and select ˆΛ = arg min Λ ∈{ Λ ,..., Λ t } ˆ α (Λ) whereˆ α (Λ) = k − (cid:80) ki =1 (cid:96) ( ˆ Q Λ ( S A ci ) , S A i ).Jointly searching over λ and ρ can quickly become unwieldy even when considering asmall combination of sparsity and fusion penalties, as noted by Danaher et al. (2014), whoinstead suggest a dense search for λ with ρ = 0 fixed and then a search for ρ with thatsparsity value fixed. The individual cross-validation for either λ or ρ follows the same idea:whichever penalty parameter has the lowest negative log-likelihood average across folds isselected. 13 Data Analysis
This section is broken into two main, but related, application and validation efforts. Bothare done with the lens of the climate data problem: we begin with some exploratory analysesof the climate dataset, deriving reasonable basis functions and providing discussion to guideintuition for the ensuing model application. The second section is a detailed simulation studythat tests our ability, under a similar setup as the climate example, to recover meaningful andrelevant graphs for coefficients at different levels of basis functions under realistic assumptionson possible graph structures. The final section provides the full analysis of our model on theCommunity Atmospere Model (CAM) data along with scientific interpretations of recoveredgraphical structures and some implied covariance and cross-covariance patterns.
We apply our method to a large climatological dataset from an ensemble study conductedat the National Center for Atmospheric Research (NCAR). Climate variability is typicallyassessed by examining a collection of numerical climate model simulations, which are com-putationally and economically expensive to produce. Relationships between variables atdifferent spatial scales are crucial for scientific investigations; hence a scalable statisticalmodel which can simulate multivariate processes could be a powerful tool for climatologists.Our method allows for efficient emulation and straightforward interpretation of complexgeophysical model variable relationships potentially filling this niche.A climate model ensemble is typically a collection of climate simulations from the samenumerical model using various initial conditions; ours is an extended version of the ensembledescribed in Baker et al. (2015) with m = 343 members. Data are recorded at n = 48,602gridded spatial locations over the globe. There are a total of 164 variables available, themajority of which are three-dimensional, meaning they have a third dimension correspond-ing to 30 vertical atmospheric levels. For our study we only consider the two-dimensionalsurface variables, and a subset thereof. First, the variables are on different scales, so they arestandardized with a pixelwise empirical mean and pixelwise empirical standard deviation.Histograms and Q-Q plots were consulted to remove strongly non-normal variables. Note14hat our data are yearly-averaged quantities, so a Gaussian assumption is generally reason-able. Potential variables were also removed if they were very strongly correlated, suggestingredundant information (e.g., when vectorized across space and realizations, the absolute cor-relation between two processes was above 0 . p = 40 variables listedin Table 1, which are grouped into five categories: aerosol variables, cloud variables, fluxvariables, precipitation variables, and transport/state variables. Throughout the rest of thedocument, aerosol variables are colored red, cloud variables are colored blue, flux variablesare colored green, precipitation variables are colored purple, and transport/state variablesare colored grey.With the p = 40 variables in hand, our approach relies on first specifying a set of spa-tial basis functions. We construct such functions as empirical orthogonal functions (EOFs)(Wikle, 2010), which are widely used in the atmospheric and climate sciences. Typically,EOFs are used in a temporal context with a single variable. Let’s consider a single spa-tiotemporal variable and suppose we have a matrix B of data with rows indexing n spatiallocations and columns indexing t time points. If B = U DV T is the (economy) SVD of thedata matrix, the columns of the orthogonal matrix U , referred to as EOFs, represent thenormalized eigenvectors of the process empirical covariance matrix BB T .We compute the (economy) SVD of the n × pm = 48 , × ,
720 data matrix wherea row corresponds to a spatial location and contains the p standardized variables orderedsequentially by realization. Such an approach can be thought of as generating pooled EOFsthat describe common structure seen amongst all variables. Using the same notation
U DV T for the SVD, we follow common practice and take the first L columns of U to form our n × L basis matrix Φ. Exploratory analysis suggests truncating after L = 2000 EOFs is a reasonabletradeoff between using a relatively small number of basis functions and explaining sufficientvariance (97.2%). The first two pooled EOFs are displayed in Figure 1. We emphasize thatpooling variables together to create EOFs is nontraditional and explains why the first twoEOFs account for such little variability.With this formulation it is important to clarify the role of the additive error term. Herean interpretation as a traditional spatial statistical nugget effect is not well-motivated andinstead we think of ε as a fine-scale process which is at smaller scale than the EOFs and15 ariable Description Units CategoryAODVIS Aerosol optical depth 550 nm AerosolBURDEN1 Aerosol burden mode 1 kg/m AerosolBURDEN2 Aerosol burden mode 2 kg/m AerosolBURDEN3 Aerosol burden mode 3 kg/m AerosolBURDENBC Black carbon aerosol burden kg/m AerosolBURDENPOM POM aerosol burden kg/m AerosolBURDENSEASALT Seasalt aerosol burden kg/m AerosolBURDENSO4 Sulfate aerosol burden kg/m AerosolBURDENSOA SOA aerosol burden kg/m AerosolCDNUMC Vertically-integrated droplet concentration 1/m CloudCLDHGH Vertically-integrated high cloud fraction CloudCLDMED Vertically-integrated mid-level cloud fraction CloudCLDTOT Vertically-integrated total cloud fraction CloudFLDS Downwelling longwave flux at surface W/m FluxFLNS Net longwave flux at surface W/m FluxFLNSC Clearsky net longwave flux at surface W/m FluxFLNT Net longwave flux at top of model W/m FluxFLNTC Clearsky net longwave flux at top of model W/m FluxFSDS Downwelling solar flux at surface W/m FluxFSDSC Clearsky downwelling solar flux at surface W/m FluxFSNS Net solar flux at surface W/m FluxFSNSC Clearsky net solar flux at surface W/m FluxFSNTC Clearsky net solar flux at top of model W/m FluxFSNTOA Net solar flux at top of atmosphere W/m FluxLHFLX Surface latent heat flux W/m FluxLWCF Longwave cloud forcing W/m CloudPBLH PBL height W/m Transport/StatePS Surface pressure Pa Transport/StateQREFHT Reference height humidity kg/kg PrecipitationSHFLX Surface sensible heat flux W/m FluxSWCF Shortwave cloud forcing W/m CloudTAUX Zonal surface stress N/m Transport/StateTAUY Meridional surface stress N/m Transport/StateTGCLDCWP Total grid-box cloud water path (liquid and ice) kg/m CloudTGCLDIWP Total grid-box cloud ice water path kg/m CloudTGCLDLWP Total grid-box cloud liquid water path kg/m CloudTMQ Total vertically integrated precipitable water kg/m PrecipitationTREFHT Surface air temperature at reference height K Transport/StateU10 10m wind speed m/s Transport/StatePRECT PRECL Large-scale (stable) precipitation rate (liq + ice) m/s Precipitationplus PRECC Convective precipitation rate (liq + ice)
Table 1: Variable descriptions. Note that analysis happens on standardized, unitless data.absorbs the remaining variability unexplained by the pooling of variables. As noted in Wikle(2010), if enough eigenvectors are used to explain sufficient variation, then it is reasonableto assume that the EOF residuals are uncorrelated in space. This motivates the white16
Variance Explained
EOF020406080100
Figure 1: First two pooled EOFs of the standardized CAM data. The two EOFs accountfor 1.83% and 1.59% of the total variability of all 40 variables, respectively. Rightmost plotshows the cumulative percentage of variability explained by the EOFs, with L = 2000 EOFscapturing 97.2% of the total variance.noise assumption on ε which in turn suggests independently estimating τ , . . . , τ p using theprocedure described in Section 3.2. Namely, for each variable, we optimize the likelihoodin Section 3.2 over two parameters ( ψ , τ ) under the assumption that the precision matrixhas constant diagonal ψ . We also tried a similar diagonal assumption on the precisionmatrix but where the diagonal grows exponentially with (cid:96) th element ψ e α(cid:96) . An exponentialparameterization reflects the fact that we expect smaller marginal precision values (i.e.,higher variance) at lower levels of resolution and larger marginal precision values (i.e., lowervariances) at higher frequencies. These low dimensional optimization problems are solvedwith a standard L-BFGS algorithm. Values are shown in Table 2 in the Appendix along witha signal-to-noise ratio. The sum of the diagonal of Q − gives a quantification of the totalsignal variance, and we divide this by τ to get the signal-to-noise ratio η = ψ − (cid:80) L(cid:96) =1 ( e − α ) (cid:96) τ .In the case where Q has a constant diagonal ψ , the ratio is η = Lψ − τ . Figure 2(a) compares the estimated signal-to-noise ratio for the two types of diagonal Q .Results suggest that the constant diagonal explains more of the signal, which is somewhatcounterintuitive. This result makes sense with respect to the formulas for η shown above,assuming ψ − and τ are equal, since (cid:80) L(cid:96) =1 ( e − α ) (cid:96) < L when α >
0. Estimates for τ are indeed the same out to several decimal places, but estimates for ψ − are about fourtimes smaller in the constant diagonal case (see Table 2). Our explanation is that theconstant diagonal gives a smaller precision (higher variance) at higher frequencies than theexponential diagonal, hence the larger signal-to-noise ratio. Interesting patterns among the17ariables emerge in both plots in Figure 2. Many of the aerosol or cloud variables appear tobe clustered together, and the pressure variable seems to be noticeably smoother than therest, which is physically expected and visually the case when examining model output. . + . + . + (a) Smoothly Varying Diagonal C on s t an t D i agona l AODVIS BURDEN1BURDEN2 BURDEN3BURDENBCBURDENPOMBURDENSEASALTBURDENSO4BURDENSOACDNUMCCLDHGHCLDMEDCLDTOT FLDSFLNSFLNSC FLNTFLNTC FSDSFSDSCFSNSFSNSCFSNTC FSNTOALHFLX LWCFPBLH PSQREFHTSHFLX SWCFTAUXTAUYTGCLDCWPTGCLDIWPTGCLDLWP TMQTREFHTU10PRECT + + + (b) Smoothness S i gna l − t o − N o i s e R a t i o AODVIS BURDEN1BURDEN2BURDEN3BURDENBCBURDENPOMBURDENSEASALTBURDENSO4BURDENSOACDNUMC CLDHGHCLDMEDCLDTOT FLDS FLNSFLNSC FLNTFLNTCFSDS FSDSCFSNSFSNSCFSNTC FSNTOALHFLX LWCFPBLH PSQREFHTSHFLX SWCFTAUXTAUY TGCLDCWPTGCLDIWPTGCLDLWP TMQTREFHTU10PRECT
Figure 2: (a) compares signal-to-noise ratio under the two diagonal Q parameterizations.Line in (a) has slope one and passes through the origin. (b) compares smoothness α andsignal-to-noise ratio η for the variables. See values in Table 2 in the Appendix.18 .2 Simulation Study This section is dedicated to a simulation study that aims to approximate the behavior weexpect to see in the actual climate data analysis. Our goal is to determine the ability ofour estimation approach to recover meaningful and correct graphical structures from datasimulated under the same setup as the available CAM data, as well as assess the cross-validation scheme. In particular, we simulate from the model using the basis functions φ , . . . , φ L and error variances τ , . . . , τ p described in the previous section with the samenumber of variables, ensemble members, and spatial locations as in CAM.A main question is what graphs Q , . . . , Q L to simulate from. The study is broken intotwo main pieces: the first is a case where all coefficient vectors have the same precisionmatrix; such an approach can be interpreted as approximating a multivariate white noiseprocess. The second approach is more realistic, in which processes are constructed usingvalues in Table 2 to mimic smoothnesses that are exhibited in the actual dataset. Bothoptions, described in more detail below, are governed by an initial precision matrix Q .(a) White noise: each level of resolution has the same graph; Q = Q = Q = · · · = Q L .(b) Spatially correlated: at each level (cid:96) the coefficients of the basis functions are gener-ated according to the Hadamard product covariance model ( s (cid:96) s T (cid:96) ) ◦ Q − where s (cid:96) =( ψ − e − α (cid:96)/ , . . . , ψ − p e − α p (cid:96)/ ) T and the smoothness parameters ( α , . . . , α p ) and stan-dard deviations ( ψ , . . . , ψ p ) are from Table 2. This scales the processes to behavewith smoothnesses and variances that are exhibited by the standardized CAM vari-ables and also preserves the sparsity pattern of the original graph Q across all levels.To specify Q , we first vectorize each variable over both space and realization and thencalculate the corresponding correlation matrix. This correlation matrix, shown Figure 3(a),is substituted into the standard graphical lasso to obtain Q . Note that the graphical lassopenalty here equals ξ (cid:80) L(cid:96) =1 (cid:80) i (cid:54) = j | ( Q (cid:96) ) ij | and does not include the diagonal to be consistentwith (14). To be clear, the notation ξ is used here because this is a fundamentally differentproblem without basis functions, but ξ plays the same role as the previous sparsity penaltyparameter λ . After some exploratory analysis, we consider a relatively dense initial graph19orresponding to ξ = 0 .
2; see an illustration in Figure 3. In each case, the setup of thesimulated ensemble is the same size as the true Community Atmosphere Model: 343 ensemblemembers with 40 variables at 48,602 locations. To choose cross-validation parameters, wefollow steps from Section 3.3 and select a sparsity penalty with ρ = 0 fixed followed bya fusion penalty with the optimal sparsity penalty fixed. With the same graphical modelrepeated across levels, we may expect a large ρ value to be optimal. Note, however, that thediagonals in (14) are unpenalized, so ρ → ∞ does not necessarily produce identical estimatesat each level. (a) −1.0−0.50.00.51.0 PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T (b) −10123 PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T (c) −1.0−0.50.00.51.0 PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T Figure 3: (a) shows the true pairwise correlations among the standardized CAM variables,taken across space and realization. (b) shows the precision matrix from substituting cor-relation matrix (a) into the graphical lasso problem with penalty ξ = 0 .
2. The resultingprecision matrix (b) is 81% sparse, and its correlation matrix is shown in (c).
We show cross-validation results for the two cases and several other summarizing plots thatdescribe the behavior of the estimated graphical models across level. Each case considereda 5-fold CV where 4/5 of the realizations were used for training and the remaining 1/5 wereused for testing. Figure 4 shows cross-validation results for the sparsity parameter λ with ρ = 0 fixed. The negative log-likelihood has a minimizer which is destroyed with the additionof the sparsity penalty (13), which is counterintuitive since we would expect the penalizedlikelihood to select a sparser solution. In general it is difficult to say anything about (13)as a function of λ since (cid:107) Q λ (cid:107) decreases as λ increases. Observe that in the white noise20ase (a), λ = 1 is large enough to produce a diagonal precision matrix across all levels, asthe penalized and unpenalized curves coincide. This is not the case in (b) where precisionmatrices grew exponentially with level instead of remaining constant; a much larger sparsitypenalty is required to produce diagonal precision matrices. Similar curves to those in Figure4 but for the fusion penalty (with optimal sparsity value λ fixed) were jagged and gave littleinsight into which ρ values from { , , . . . , } were preferred. − − − (a) Sparsity Penalty−log likehoodwith penalty 0.0 0.2 0.4 0.6 0.8 1.0 − − − (b) Sparsity Penalty−log likehoodwith penalty
Figure 4: CV results for the sparsity penalty while ρ = 0 is fixed. Red line penalizes forsparsity as in equation (13). Minimizing value is λ = 0 .
05 in the white noise case (a) and λ = 0 .
08 in the correlated case (b).Figure 5 shows a count of the number of zeros in the precision matrix over level. With ρ set to zero, the correct underlying graph structure is apparent in both cases. For (a),adding a fusion penalty of ρ = 11 produces the same graphical structure across all levels,while in (b), an even stronger fusion penalty ρ = 43 is unable to pin down the graphicalstructure. This is likely due to the low sparsity penalty λ = 0 .
08 which combines with theexponentially growing values in the precision matrix and large fusion penalty to incorrectlyfuse graph edges across adjacent levels of resolution.Finally, in Figure 15 in the Appendix, we display the estimated marginal precision foreach variable over level. In general, we found that the marginal precisions are largely dictatedby the sparsity penalty and adding a fusion penalty has little effect. For the white noisecase (a), the curves appear constant across level, as expected. For case (b) where the data-generating precision matrices increase exponentially with level, the curves appear linear on a21 hite noise, Fusion = 0
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T Correlated, Fusion = 0
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T White noise, Fusion = 11
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T Correlated, Fusion = 43
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T Figure 5: Counting the nonzeros of { ˆ Q , . . . , ˆ Q } for various penalties. Sparsity penaltyis λ = 0 .
05 in the first column and λ = 0 .
08 in the second column. True sparsity patterncorresponds to (b) in Figure 3.log plot, so the exponential growth has been correctly modeled. The prominent grey line in(b) is the pressure variable, which yet again exhibits its smoothness with noticeably largerprecision values at high frequency. In this simulation study, we were able to accurately22stimate marginal precision values that were constant across level and also exponentiallygrowing across level without making any assumptions about the behavior of these marginalprecisions.
We proceed to the data analysis using the basis setup described in Section 4.1 and τ , . . . , τ p from Table 2. The remaining question is what penalty parameters to use. We performed a5-fold CV search over λ ∈ { , , , , , , . . . , , } . For each fold, the testing datawas all realizations assigned to that fold (i.e., 1/5 of all realizations), and the training datawas the rest (i.e., 4/5 of all realizations). We recorded the negative log-likelihood (i.e., (7)without the penalty) and its penalized version, but unfortunately there was little to inferfrom the results – both favored the lowest sparsity penalty λ = 1. Considering smallersparsity values in the interval (0 ,
1) is an option, but this increases the computation timedramatically, unlike in the simulation study where such solutions were still computed fairlyquickly.
With this cross-validation attempt and similar difficulties with penalties encountered inDanaher et al. (2014) in mind, we proceed by fixing several penalty pairs and examiningthe resulting modeling implications. Ideally, we would select a model with a sensible, inter-pretable graphical neighbor structure over levels. Differences between models with differentgraph structures may be minor as different graphs can give approximately the same correla-tion structure.In the supplemental material and remainder of this document, we display several resultsfor λ = 20 since this looked like an inflection point in a plot of λ versus the total graphsparsity percentage (see Figure 6(a)). The other plot in Figure 6 shows how the sparsity ofthe graphs corresponding to λ = 20 changes over levels. We also will occasionally compare23 = 20 results with those from λ = 1 to give an idea of how the implied graph structurechanges with different penalties. (a) Sparsity Penalty S pa r s i t y P e r c en t age (b) Level S pa r s i t y P e r c en t age Figure 6: (a): total sparsity percentage of the graphs { Q , . . . , Q L } as a function of λ . (b):sparsity percentage by level for λ = 20.One notion of dependence we examine comes from averaging the estimated variable cor-relation matrix over space. To be precise, from (3) we see the process covariance matrixat location s is (cid:80) L(cid:96) =1 ( φ (cid:96) ( s )) ( Q (cid:96) ) − . This covariance matrix is converted into a correlationmatrix, and the correlation matrices for each location are averaged. Figure 16 in the Ap-pendix shows the result for the unpenalized model (i.e., the case with λ = 0 and ρ = 0described in Section 3.1) as well as the penalized models. Different penalties produce minordifferences in the implied correlation structure, but the overall similarity with Figure 3(a),which calculated variable correlations for the standardized CAM data by ignoring space andrealization, suggests that our model correctly captures correlations between variables.For the next set of model diagnostics, we turn away from correlations to look at precisionmatrices. Familiar plots of estimated marginal precisions by level are shown in Figure 7.Clearly, the exponential parameterization considered in the simulation study is not well-suited to our data. Just as in the simulation study, adding a sparsity penalty causes thepressure variable (colored grey) to have the highest marginal precisions at high levels of24esolution, as we would expect from the smoothest variable. Overall, adding a sparsitypenalty brings all marginal precisions down by an order of magnitude, with particularlystrong shrinkage at lower levels of resolution. Given the focus on regularization, these resultsmay not seem surprising, but the diagonals of the precision matrix are not penalized in anyformulation we have considered. This shrinkage of marginal precisions can be attributed tothe larger number of neighbors in the low-penalized graph structures, which means that the(conditional) precision will be higher than in estimates from higher penalties. − − + + Marginal Precision, No Penalty M a r g i na l P r e c i s i on − − + + Marginal Precision, Sparsity = 1 M a r g i na l P r e c i s i on − − + + Marginal Precision, Sparsity = 20 M a r g i na l P r e c i s i on − − + + Level M a r g i na l P r e c i s i on BURDENSEASALTCDNUMCFLNTCPSTREFHTPRECT 0 500 1000 1500 2000 − − + + Level M a r g i na l P r e c i s i on BURDENSEASALTCDNUMCFLNTCPSTREFHTPRECT 0 500 1000 1500 2000 − − + + Level M a r g i na l P r e c i s i on BURDENSEASALTCDNUMCFLNTCPSTREFHTPRECT
Figure 7: Marginal precision estimates (i.e., diagonals of ˆ Q , . . . , ˆ Q ) shown by level forvarious penalty choices. Log scale on y -axis. Bottom row shows a subset of six variablesfrom the top row. The grey pressure variable is noticeably smoother than the rest afterregularization is added.Our penalized maximum likelihood procedure produces interpretable graphical modelswhich imply conditional independencies among the variables at varying levels of resolution.Here we study some properties of the estimated graphical models. Figure 8 gives an idea ofhow the conditional independence structure changes with respect to level for different penal-ties. The left column counts the nonzeros of the estimated precision matrices ˆ Q , . . . , ˆ Q λ = 20 we see several conditional independencies between variables thatpersist over all levels. The middle and right columns show how the conditional dependenceneighborhood structure of two variables (BURDENSEASALT and PS) changes over level,and the impact of the fusion penalty is most apparent in these columns. Broadly speaking,the fusion penalty smooths out the neighbor pattern across levels. Note that the fusionpenalty can fuse adjacent nonzeros rather than adjacent zeros and can cause a neighbor tofuse across all levels if ρ is large enough. Figure 9 displays estimated graphical models fordifferent levels of resolution. At early levels corresponding to large scale spatial patterns, thegraphs are quite dense. Higher frequency EOFs display interesting and reasonable patterns.For example, near the top of the (cid:96) = 150 graph we see groupings among many cloud andprecipitation variables, and these connections are more evident at higher levels and even withdifferent penalty parameters (not shown). For higher level EOFs, past around (cid:96) = 500 with λ = 20, the graphs suggest variable independence. This idea of complete independence is ex-plored in Figure 10 where we display the first level at which each variable is independent andremains independent of all other variables. Note again a natural grouping of variable types,with transport and pressure variables achieving independence much earlier than finer-scaleprecipitation and cloud variables. Further, as could also be observed in the visualization ofthe conditional dependence structure for the two selected variables in Figure 8, the level atwhich independence occurs varies quite dramatically with the sparsity penalty, leading toroughly a four-fold increase in the number of connected levels going from λ = 20 to λ = 1.Now we examine some spatial properties of our estimates. Figure 11 shows the estimatedlocal standard deviations for a subset of six variables. Recall that the variables were stan-dardized to have an empirical unit standard deviation, so these plots should be interpretedas potential bias corrections where the standardization fails to accurately describe the vari-ability. Most striking is the El Ni˜no effect apparent in the plot for the pressure variable PS.The pattern’s presence is unsurprising since El Ni˜no/La Ni˜na are strongly tied to changes inpressure over the Pacific Ocean and their relative infrequency likely requires more modelingcare than just an empirical standardization. Finer-scale variables CDNUMC and PRECTare able to capture distinct behavior in mountainous regions (e.g., Rocky Mountains andHimalayas). 26n our last collection of figures, we return to correlation, but this time examine correlationover space as well as across variables. Figure 12 shows variable correlations as a function ofspace. To be precise, each image shows the correlation between the location marked in greenand all other locations. Nonstationarity is evident from the difference in behaviors betweenland and ocean, and long range negative correlation is seen to be a possible byproduct of thismodeling scheme. In Figure 13, we display estimated local cross-correlations between a few ofour selected variables. Expected negative and positive correlations between pairs of variablesare correctly captured (e.g., between pressure and precipitation and between cloud dropletconcentration and precipitation). Finally, in Figure 14, we show variable cross-correlationsas a function of space. To be precise, each image shows the cross-correlation between thefirst variable at the location marked in green and the second variable at all other locations.The flexibility of the model is again apparent in its nonstationary behavior and variouspositive and negative cross-correlations. Such behavior is difficult to accommodate usingextant models but readily available in our approach without any additional effort (Kleiberand Genton, 2013). Both the correlation and cross-correlation functions centered where ElNi˜no/La Ni˜na occur (see right column in Figures 12 and 14) exhibit long-range dependencethrough the equator across the Pacific Ocean.We conclude the data analysis with a brief commentary about the timing results, assumingthe orthogonal basis has been constructed and τ , . . . , τ p are estimated. Note that the latterstep is fairly quick using the technique in Section 3.2 with an orthogonal basis. The choiceof penalty dictates the runtime of the DC algorithm. For the maximum likelihood estimatewith no penalty, the tolerance (cid:15) = 0 .
05 is reached in fifteen DC iterations in five seconds.With sparsity penalty λ = 20, the algorithm converges in two DC iterations in forty-fiveseconds. With sparsity penalty λ = 1, the algorithm again requires two DC iterations buttakes thirteen minutes. With fusion penalty ρ = 10, both estimates converged in one DCiteration using the ρ = 0 solution as the initial guess, but ( λ, ρ ) = (20 ,
10) took sevenminutes while ( λ, ρ ) = (1 ,
10) took one minute. All experiments were performed in Matlabon a MacBook Pro with a 2.6 GHz Intel Core i7 processor.278 (Sparsity,Fusion)=(1,0)
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T BURDENSEASALT Neighbors, Sparsity = 1
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS 0 500 1000 1500 2000
Level
PS Neighbors, Sparsity = 1
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS 0 500 1000 1500 2000
Level (Sparsity,Fusion)=(1,10)
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T BURDENSEASALT Neighbors, Sparsity = 1, Fusion = 10
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS 0 500 1000 1500 2000
Level
PS Neighbors, Sparsity = 1, Fusion = 10
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS 0 500 1000 1500 2000
Level (Sparsity,Fusion)=(20,0)
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T BURDENSEASALT Neighbors, Sparsity = 20
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS 0 500 1000 1500 2000
Level
PS Neighbors, Sparsity = 20
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS 0 500 1000 1500 2000
Level (Sparsity,Fusion)=(20,10)
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T BURDENSEASALT Neighbors, Sparsity = 20, Fusion = 10
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS 0 500 1000 1500 2000
Level
PS Neighbors, Sparsity = 20, Fusion = 10
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS 0 500 1000 1500 2000
Level
Figure 8: Illustrating how graphical model neighborhoods behave for various penalty choices.Left column counts the nonzeros of ˆ Q , . . . , ˆ Q by level. Center and right columns showhow the neighbors of BURDENSEASALT and PS change over level.9 AODVIS BURDEN1 BURDEN2BURDEN3
BURDENBC
BURDENPOM
BURDENSEASALT
BURDENSO4BURDENSOA
CDNUMC CLDHGHCLDMEDCLDTOT
FLDS FLNS FLNSC FLNTFLNTCFSDSFSDSCFSNSFSNSC FSNTC
FSNTOA
LHFLXLWCFPBLH PS QREFHT
SHFLXSWCF TAUX TAUY
TGCLDCWP TGCLDIWPTGCLDLWP TMQ
TREFHT
U10
PRECT
Level 1, Sparsity = 20 (38.2%)
AODVIS BURDEN1 BURDEN2BURDEN3
BURDENBC
BURDENPOM
BURDENSEASALT
BURDENSO4BURDENSOA
CDNUMC CLDHGHCLDMEDCLDTOT
FLDS FLNS FLNSC FLNTFLNTCFSDSFSDSCFSNSFSNSC FSNTC
FSNTOA
LHFLXLWCFPBLH PS QREFHT
SHFLXSWCF TAUX TAUY
TGCLDCWP TGCLDIWPTGCLDLWP TMQ
TREFHT
U10
PRECT
Level 150, Sparsity = 20 (16.9%)
AODVIS BURDEN1 BURDEN2BURDEN3
BURDENBC
BURDENPOM
BURDENSEASALT
BURDENSO4BURDENSOA
CDNUMC CLDHGHCLDMEDCLDTOT
FLDS FLNS FLNSC FLNTFLNTCFSDSFSDSCFSNSFSNSC FSNTC
FSNTOA
LHFLXLWCFPBLH PS QREFHT
SHFLXSWCF TAUX TAUY
TGCLDCWP TGCLDIWPTGCLDLWP TMQ
TREFHT
U10
PRECT
Level 350, Sparsity = 20 (3.2%)
Figure 9: Example estimated graphical models for λ = 20. Low level graphs are noisy, withthe Level 1 graph containing 38.2% of all possible connections. Higher level graphs showreasonable variable clusters until eventually no graph edges exist.0 PS T A U Y T A U X F L N T C T M Q T R E F H T B URD E N B URD E N S O B URD E N S O A B
URD E N SEASA L T F S D S C B URD E N H F L X U A O D V I S B
URD E N F L D S F L N S C L D M E D Q R E F H T F L N S C B URD E N B C B URD E N P O M PB L H S H F L X T G C L D I W P F L N T L W C F P R E C T C L DH G H C L D T O T F S D S S W C F F S N S F S N T O A T G C L DC W P CDNU M C F S N S C F S N T C T G C L D L W P Levels when independence occurs
Figure 10: The first level at which a variable becomes (and remains) independent. Variablesare ordered in increasing fashion according to the implied independence level for λ = 1.Results for λ = 20 accompany the taller bars and show a similar story for variable groupsbut with earlier levels of independence.Figure 11: Estimated local standard deviations for six variables. Standardization of thevariables is reflected in the color scale, where unit values are colored white. Values aboveone suggest the empirical standardization does not explain enough variability.1Figure 12: Estimated spatial correlation functions for BURDENSEASALT (top row), PS(middle row), and PRECT (bottom row). Correlation function is centered over Tajikstan inthe left column, U.S. in the middle column, and the Pacific Ocean in the right column.Figure 13: Estimated local cross-correlations for several pairs of variables. Color scalechanges between rows to permit easier comparisons between positively correlated variables(top row) and between negatively correlated variables (bottom row).2Figure 14: Estimated spatial cross-correlation functions, again centered over Tajikstan inthe left column, U.S. in the middle column, and the Pacific Ocean in the right column. Toprow shows PRECT and PS which exhibit negative local cross-correlations, while bottom rowshows BURDENBC and BURDENPOM which exhibit positive local cross-correlations. Conclusion
We have presented a multivariate Gaussian process model which will be applicable in avariety of future endeavors. There are many benefits under this framework: covarianceand cross-covariance functions are nonstationary, all likelihood calculations are exact, andcomputations and storage are cheap when W , . . . , W L are independent and basis functionsare orthogonal.Future experiments could relax the assumption of independent weights and estimate theentire sparse pL × pL precision matrix Q as in Section 2.1 with the DC algorithm (8). Evenif the basis functions are not orthogonal, Φ T D − Φ = Φ T Φ ⊗ diag( τ − , . . . , τ − p ) is sparsedue to the Kronecker product with a diagonal matrix. In this formulation, it is crucial torewrite (9) and calculate this linearization term by solving linear systems with the sparsematrix Q + Φ T D − Φ or its sparse Cholesky decomposition. Note that the dense pL × pL linearization matrix must then be stored in memory, and advanced graphical lasso algorithms(Fattahi et al., 2019) must be explored for the subsequent graph estimation. Modelingdependence across basis functions would allow for more flexibility in the cross-covariances,since, with the independence assumption, Cov( Z i ( s ) , Z j ( s (cid:48) )) = (cid:80) L(cid:96) =1 φ (cid:96) ( s )(( Q (cid:96) ) − ) ij φ (cid:96) ( s (cid:48) ) =Cov( Z i ( s (cid:48) ) , Z j ( s )) is symmetric.We demonstrated that our model can easily fit a large climate ensemble and producereasonable and interpretable results. Our method also easily scales to computer experimentswith high-dimensional inputs (e.g., s ∈ R d with d (cid:29) Acknowledgements
This research was supported by NSF. Code is available upon request.33 eferences
Apanasovich, T. and Genton, M. G. (2010), “Cross-covariance functions for multivariaterandom fields based on latent dimensions,”
Biometrika , 97, 15–30, ISSN 0006-3444, URL https://doi.org/10.1093/biomet/asp078 .Apanasovich, T. V., Genton, M. G., and Sun, Y. (2012), “A valid Mat´ern class of cross-covariance functions for multivariate random fields with any number of components,”
J.Amer. Statist. Assoc. , 107, 180–193, ISSN 0162-1459, URL https://doi.org/10.1080/01621459.2011.643197 .Baker, A. H., Hammerling, D. M., Levy, M. N., Xu, H., Dennis, J. M., Eaton, B. E., Edwards,J., Hannay, C., Mickelson, S. A., Neale, R. B., Nychka, D., Shollenberger, J., Tribbia, J.,Vertenstein, M., and Williamson, D. (2015), “A new ensemble-based consistency test forthe community earth system model (pycect v1.0),”
Geoscientific Model Development , 8,2829–2840, URL https://gmd.copernicus.org/articles/8/2829/2015/ .Bruinsma, W., Perim, E., Tebbutt, W., S., H. J., Solin, A., and Turner, R. (2020), “Scalableexact inference in multi-output gaussian processes,” arXiv: Machine Learning .Calder, C. A. (2007), “Dynamic factor process convolution models for multivariate space-time data with application to air quality assessment,”
Environ. Ecol. Stat. , 14, 229–247,ISSN 1352-8505, URL https://doi.org/10.1007/s10651-007-0019-y .Danaher, P., Wang, P., and Witten, D. M. (2014), “The joint graphical lasso for inversecovariance estimation across multiple classes,”
J. R. Stat. Soc. Ser. B. Stat. Methodol. ,76, 373–397, ISSN 1369-7412, URL https://doi.org/10.1111/rssb.12033 .Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016), “Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets,”
Journal of the Amer-ican Statistical Association , 111, 800–812, URL https://doi.org/10.1080/01621459.2015.1044091 .Dey, D., Datta, A., and Banerjee, S. (2020), “Graphical gaussian process models for highlymultivariate spatial data,” arXiv: Statistics .34attahi, S., Zhang, R. Y., and Sojoudi, S. (2019), “Linear-time algorithm for learning large-scale sparse graphical models,”
IEEE Access , pp. 1–1, ISSN 2169-3536.Friedman, J., Hastie, T., and Tibshirani, R. (2008), “Sparse inverse covariance estimationwith the graphical lasso,”
Biostatistics , 9, 432–441, ISSN 1465-4644, URL .Furrer, R. and Genton, M. G. (2011), “Aggregation-cokriging for highly multivariate spa-tial data,”
Biometrika , 98, 615–631, ISSN 0006-3444, URL https://doi.org/10.1093/biomet/asr029 .Gaspari, G. and Cohn, S. E. (1999), “Construction of correlation functions in two and threedimensions,”
Quarterly Journal of the Royal Meteorological Society , 125, 723–757, URL https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/qj.49712555417 .Gelfand, A. E., Schmidt, A. M., Banerjee, S., and Sirmans, C. F. (2004), “Nonstationarymultivariate process modeling through spatially varying coregionalization,”
Test , 13, 263–312, ISSN 1133-0686, URL https://doi.org/10.1007/BF02595775 , with discussion byMontserrat Fuentes, Dave Higdon and Bruno Sans´o and a rejoinder by the authors.Genton, M. G. and Kleiber, W. (2015), “Cross-covariance functions for multivariate geo-statistics,”
Statist. Sci. , 30, 147–163, ISSN 0883-4237, URL https://doi.org/10.1214/14-STS487 .Gneiting, T., Kleiber, W., and Schlather, M. (2010), “Mat´ern cross-covariance functions formultivariate random fields,”
J. Amer. Statist. Assoc. , 105, 1167–1177, ISSN 0162-1459,URL https://doi.org/10.1198/jasa.2010.tm09420 .Goulard, M. and Voltz, M. (1992), “Linear coregionalization model: Tools for estimationand choice of cross-variogram matrix,”
Mathematical Geology , 24, 269–286.Guinness, J. (2018), “Nonparametric spectral methods for multivariate spatial and spatial-temporal data,” arXiv: Statistics .Harville, D. A. (1997),
Matrix algebra from a statistician’s perspective , Springer-Verlag, NewYork, ISBN 0-387-94978-X, URL https://doi.org/10.1007/b98818 .35sieh, C.-J., Dhillon, I. S., Ravikumar, P. K., Becker, S., and Olsen, P. A.(2014a), “Quic & dirty: A quadratic approximation approach for dirty statis-tical models,” in
Advances in Neural Information Processing Systems 27 , eds.Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,Curran Associates, Inc., pp. 2006–2014, URL http://papers.nips.cc/paper/5578-quic-dirty-a-quadratic-approximation-approach-for-dirty-statistical-models.pdf .Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. (2014b), “QUIC: quadraticapproximation for sparse inverse covariance estimation,”
J. Mach. Learn. Res. , 15, 2911–2947, ISSN 1532-4435.Ippoliti, L., Valentini, P., and Gamerman, D. (2012), “Space-time modelling of coupledspatiotemporal environmental variables,”
J. R. Stat. Soc. Ser. C. Appl. Stat. , 61, 175–200, ISSN 0035-9254, URL https://doi.org/10.1111/j.1467-9876.2011.01011.x .Kleiber, W. (2017), “Coherence for multivariate random fields,”
Statist. Sinica , 27, 1675–1697, ISSN 1017-0405.Kleiber, W. and Genton, M. G. (2013), “Spatially varying cross-correlation coefficients inthe presence of nugget effects,”
Biometrika , 100, 213–220.Kleiber, W. and Nychka, D. (2012), “Nonstationary modeling for multivariate spatial pro-cesses,”
J. Multivariate Anal. , 112, 76–91, ISSN 0047-259X, URL https://doi.org/10.1016/j.jmva.2012.05.011 .Kleiber, W., Nychka, D., and Bandyopadhyay, S. (2019), “A model for large multivariatespatial data sets,”
Statist. Sinica , 29, 1085–1104, ISSN 1017-0405.Kleiber, W. and Porcu, E. (2014), “Nonstationary matrix covariances: compact support, longrange dependence and quasi-arithmetic constructions,”
Stochastic Environmental Researchand Risk Assessment , 29, 193–204.Krock, M., Kleiber, W., and Becker, S. (2020), “Nonstationary modeling with sparsity for36patial data via the basis graphical lasso,”
Journal of Computational and Graphical Statis-tics , 0, 1–36, URL https://doi.org/10.1080/10618600.2020.1811103 .Le, N. D. and Zidek, J. V. (2006),
Statistical analysis of environmental space-time processes ,Springer Series in Statistics, Springer, New York, ISBN 0-387-26209-1; 978-0387-26209-3.Majumdar, A. and Gelfand, A. E. (2007), “Multivariate spatial modeling for geostatisticaldata using convolved covariance functions,”
Math. Geol. , 39, 225–245, ISSN 0882-8121,URL https://doi.org/10.1007/s11004-006-9072-6 .Majumdar, A., Paul, D., and Bautista, D. (2010), “A generalized convolution model formultivariate nonstationary spatial processes,”
Statist. Sinica , 20, 675–695, ISSN 1017-0405.Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S. (2015), “Amultiresolution Gaussian process model for the analysis of large spatial datasets,”
J.Comput. Graph. Statist. , 24, 579–599, ISSN 1061-8600, URL https://doi.org/10.1080/10618600.2014.914946 .Pollice, A. and Jona Lasinio, G. (2010), “A multivariate approach to the analysis of airquality in a high environmental risk area,”
Environmetrics , 21, 741–754, ISSN 1180-4009,URL https://doi.org/10.1002/env.1059 .Rue, H. and Held, L. (2005),
Gaussian Markov Random Fields: Theory And Applica-tions (Monographs on Statistics and Applied Probability) , Chapman & Hall/CRC, ISBN1584884320.Schmidt, A. M. and Gelfand, A. E. (2003), “A bayesian coregionalization approach formultivariate pollutant data,”
Journal of Geophysical Research: Atmospheres , 108, URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2002JD002905 .Shaddick, G. and Wakefield, J. (2002), “Modelling daily multivariate pollutant data at mul-tiple sites,”
J. Roy. Statist. Soc. Ser. C , 51, 351–372, ISSN 0035-9254, URL https://doi.org/10.1111/1467-9876.00273 . 37tein, M. L. (1999),
Interpolation of Spatial Data: Some Theory for Kriging , New York:Springer-Verlag.Stein, M. L. (2014), “Limitations on low rank approximations for covariance matrices ofspatial data,”
Spat. Stat. , 8, 1–19, URL https://doi.org/10.1016/j.spasta.2013.06.003 .Taylor-Rodriguez, D., Finley, A., Datta, A., Babcock, C., Andersen, H., Cook, B., Morton,D., and Banerjee, S. (2019), “Spatial factor models for high-dimensional and large spatialdata: An application in forest variable mapping,”
Statistica Sinica , 29, 1155–1180, ISSN1017-0405.Teh, Y., Seeger, M. W., and Jordan, M. I. (2005), “Semiparametric latent factor models,”in
AISTATS .Titsias, M. (2009), “Variational learning of inducing variables in sparse gaussian processes,”in
Proceedings of the Twelth International Conference on Artificial Intelligence and Statis-tics , eds. D. van Dyk and M. Welling, Hilton Clearwater Beach Resort, Clearwater Beach,Florida USA: PMLR, volume 5 of
Proceedings of Machine Learning Research , pp. 567–574,URL http://proceedings.mlr.press/v5/titsias09a.html .Ver Hoef, J. M. and Barry, R. P. (1998), “Constructing and fitting models for cokriging andmultivariable spatial prediction,”
J. Statist. Plann. Inference , 69, 275–294, ISSN 0378-3758, URL https://doi.org/10.1016/S0378-3758(97)00162-6 .Wackernagel, H. (2003),
Multivariate Geostatistics: An Introduction with Applications ,Springer Berlin Heidelberg, ISBN 9783540441427, URL https://books.google.com/books?id=Rhr7bgLWxx4C .Wikle, C. K. (2010), “Low rank representations for spatial processes,” in
Handbook of Spa-tial Statistics , eds. A. E. Gelfand, P. Diggle, P. Guttorp, and M. Fuentes, Boca Raton:Chapman & Hall / CRC, pp. 107–118.Yang, S., Lu, Z., Shen, X., Wonka, P., and Ye, J. (2015), “Fused multiple graphical38asso,”
SIAM J. Optim. , 25, 916–943, ISSN 1052-6234, URL https://doi.org/10.1137/130936397 . (a) Level M a r g i na l P r e c i s i on (b) Level M a r g i na l P r e c i s i on Figure 15: (a) Marginal precision estimates for the simulated white noise data. In this case,values should be constant across level. (b) Marginal precision estimates for the simulatedcorrelated data. Corresponds to the parameterization with exponentially growing precisionmatrices. Note the logarithmic scale on the y -axis, so the exponential growth is capturedwell. The noticeable grey line in (b) is the pressure variable.390 True variable correlations −1.0−0.50.00.51.0
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T (Sparsity,Fusion)=(1,0) −1.0−0.50.00.51.0 PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T (Sparsity,Fusion)=(20,0) −1.0−0.50.00.51.0 PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T No Penalty −1.0−0.50.00.51.0
PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T (Sparsity,Fusion)=(1,10) −1.0−0.50.00.51.0 PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T (Sparsity,Fusion)=(20,10) −1.0−0.50.00.51.0 PRECTU10TREFHTTMQTGCLDLWPTGCLDIWPTGCLDCWPTAUYTAUXSWCFSHFLXQREFHTPSPBLHLWCFLHFLXFSNTOAFSNTCFSNSCFSNSFSDSCFSDSFLNTCFLNTFLNSCFLNSFLDSCLDTOTCLDMEDCLDHGHCDNUMCBURDENSOABURDENSO4BURDENSEASALTBURDENPOMBURDENBCBURDEN3BURDEN2BURDEN1AODVIS A O D V I SB URD E N B URD E N B URD E N B URD E N B C B URD E N P O M B URD E N SEASA L T B URD E N S O B URD E N S O A CDNU M CC L DH G HC L D M E DC L D T O TF L D S F L N S F L N S C F L N TF L N T C F S D S F S D S C F S N S F S N S C F S N T C F S N T O A L H F L X L W C F PB L H PS Q R E F H T S H F L XS W C FT A U X T A U Y T G C L DC W P T G C L D I W P T G C L D L W P T M Q T R E F H T U P R E C T Figure 16: Showing the estimated process correlation matrix averaged over all observationlocations for various penalty choices. Top left is identical to Figure 3(a), which computedvariable correlations for the standardized CAM data over all realizations and locations.1 exponential constant τ α ψ η ψ η
AODVIS 0.1285 0.0029 0.1046 1.91 0.2049 2.89BURDEN1 0.1059 0.0030 0.1047 2.66 0.2043 4.27BURDEN2 0.1490 0.0027 0.1094 1.36 0.2055 2.13BURDEN3 0.1318 0.0029 0.1065 1.77 0.2049 2.74BURDENBC 0.1217 0.0029 0.1067 2.05 0.2047 3.22BURDENPOM 0.1283 0.0028 0.1070 1.86 0.2049 2.89BURDENSEASALT 0.1404 0.0028 0.1076 1.55 0.2052 2.41BURDENSO4 0.1327 0.0028 0.1063 1.76 0.2050 2.70BURDENSOA 0.1265 0.0029 0.1072 1.86 0.2048 2.98CDNUMC 0.2301 0.0020 0.1116 0.75 0.2088 0.87CLDHGH 0.1947 0.0024 0.1072 0.94 0.2071 1.23CLDMED 0.2238 0.0023 0.1109 0.70 0.2085 0.92CLDTOT 0.2122 0.0024 0.1096 0.76 0.2079 1.03FLDS 0.1558 0.0028 0.1065 1.28 0.2057 1.95FLNS 0.1524 0.0029 0.1042 1.34 0.2056 2.04FLNSC 0.1435 0.0028 0.1059 1.55 0.2053 2.31FLNT 0.1257 0.0029 0.1037 2.04 0.2048 3.02FLNTC 0.1293 0.0030 0.1073 1.75 0.2049 2.85FSDS 0.1315 0.0028 0.1014 2.03 0.2049 2.75FSDSC 0.1327 0.0029 0.1059 1.74 0.2050 2.70FSNS 0.1363 0.0026 0.1030 1.94 0.2051 2.56FSNSC 0.1761 0.0024 0.1087 1.12 0.2064 1.51FSNTC 0.1947 0.0023 0.1114 0.91 0.2071 1.23FSNTOA 0.1520 0.0025 0.1040 1.58 0.2055 2.05LHFLX 0.2390 0.0021 0.1143 0.63 0.2092 0.80LWCF 0.1563 0.0026 0.1033 1.46 0.2057 1.94PBLH 0.2223 0.0022 0.1101 0.74 0.2084 0.93PS 0.0625 0.0038 0.1138 5.22 0.2035 12.36QREFHT 0.1387 0.0028 0.1066 1.60 0.2051 2.47SHFLX 0.2286 0.0021 0.1126 0.70 0.2087 0.88SWCF 0.1505 0.0026 0.1026 1.58 0.2055 2.09TAUX 0.1835 0.0025 0.1092 0.99 0.2067 1.39TAUY 0.1972 0.0023 0.1087 0.92 0.2072 1.20TGCLDCWP 0.1730 0.0026 0.1045 1.17 0.2063 1.57TGCLDIWP 0.2050 0.0023 0.1083 0.88 0.2076 1.11TGCLDLWP 0.2014 0.0022 0.1088 0.93 0.2074 1.15TMQ 0.1067 0.0031 0.1064 2.47 0.2043 4.21TREFHT 0.1701 0.0026 0.1097 1.09 0.2062 1.63U10 0.2124 0.0022 0.1106 0.80 0.2079 1.03PRECT 0.2491 0.0020 0.1150 0.61 0.2098 0.73
Table 2: Estimates of standard deviation τ with L = 2000 EOFS, obtained independentlyfor each variable. For the exponential case, Q is diagonal with (cid:96) th element ψ e α(cid:96) . For theconstant case, Q has constant diagonal ψ . Also, η is the signal-to-noise ratio on 106