A Vecchia Approximation for High-Dimensional Gaussian Cumulative Distribution Functions Arising from Spatial Data
AA Vecchia Approximation for High-Dimensional GaussianCumulative Distribution Functions Arising from Spatial Data
Mauricio Nascimento a and Benjamin A. Shaby b a Department of Statistics, The Pennsylvania State University, University Park, PA 16802,USA; b Department of Statistics, Colorado State University, Fort Collins, CO 80523-1801,USA
ARTICLE HISTORY
Compiled July 31, 2020
Abstract
We introduce an approach to quickly and accurately approximate the cumulative dis-tribution function of multivariate Gaussian distributions arising from spatial Gaus-sian processes. This approximation is trivially parallelizeable and simple to imple-ment using standard software. We demonstrate its accuracy and computational ef-ficiency in a series of simulation experiments, and apply it to analyzing the jointtail of a large precipitation dataset using a recently-proposed scale mixture modelfor spatial extremes. This dataset is many times larger than what was previouslyconsidered possible to fit using preferred inferential techniques.
KEYWORDS
Gaussian process ; Scale mixture ; Spatial extremes
1. Introduction
We introduce a trivially parallelizable approach to quickly and accurately approximatethe cumulative distribution function ( cdf ) of multivariate Gaussian distributions withhighly structured covariance matrices, such as those arising from spatial Gaussianprocesses. The multivariate Gaussian distribution is by far the most widely used formodeling multivariate and spatial data. To a large degree, its near universal adoption isthe result of its simplicity; it is concisely and intuitively parametrized by a mean vectorand pairwise dependence in the form of a covariance matrix. Prominent examples ofits use include time series models like autoregressive and moving average models,which consider the joint distribution of the observations observed at discrete timepoints to be multivariate Gaussian, as well as geostatistics models, which considerspatially-indexed observations to be realizations of a Gaussian process, usually with aparsimoniously parametrized covariance structure. Even multivariate models that donot assume Gaussian responses often represent dependence using some kind of latentmultivariate Gaussian distribution.In most situations, likelihood-based inference on popular models just requires cal-culation of the joint density of all observations. The probability density function ( pdf ) CONTACT Mauricio Nascimento. Email: [email protected] a r X i v : . [ s t a t . C O ] J u l or a multivariate Gaussian random variable is f ( x ; µ , Σ ) ≡ φ µ , Σ ( x ) = (2 π ) − k/ det( Σ ) − / exp (cid:104) −
12 ( x − µ ) (cid:48) Σ − ( x − µ ) (cid:105) , (1)where µ is the mean vector of length D and Σ is the D × D covariance matrix.In principle, there is nothing difficult about calculating this density; it simply re-quires commonplace operations like calculating an exponent, matrix determinant, ma-trix multiplication, and matrix inversion. However this is not an easy task in practicewhen the dimension D is large. The complexity of calculating the determinant andinverse of a D × D matrix is typically O ( D ) for algorithms in common use. Thismeans that for large values of D , the calculation of the pdf becomes prohibitive.Computing the Gaussian cdf , which is a much more difficult problem, has receivedmuch less attention. The problem has increased in prominence recently with advancesin spatial modeling of extreme events. State-of-the-art approaches for spatial extremeslike Wadsworth and Tawn [21], Thibaud et al. [19], de Fondeville and Davison [3], andHuser and Wadsworth [12] all require high-dimensional Gaussian cdf s for inference.This turns out to be the dominant computational bottleneck, and all but de Fondev-ille and Davison [3] restricted their analyses to fewer than 20 spatial locations becauselarger datasets are computationally intractable using widely-used techniques for com-puting the Gaussian cdf . In real-world spatial applications, one should expect to seemany more spatial locations, and existing approaches are not equipped to handledatasets of even moderate size.Multivariate Gaussian cdf s appear in other contexts as well; for example the densityof multivariate skewed Gaussian and t random variables are functions of the multi-variate Gaussian cdf [1]. Here, we will focus on the case of spatial extremes. To makethings concrete, we will use the example of the Gaussian scale mixture model from[14], although our computational strategy would work equally well in any context withhighly-structured covariance matrices.Most approaches to calculating multivariate Gaussian probabilities are intended forproblems of small or moderate dimension. Genz [7] proposed a transformation from theoriginal integral over R D to an integral over a unit hypercube. Transforming to a finiteregion then allows the use any standard numerical integration method. Genz [8] derivedformulas to calculate bivariate and trivariate Gaussian cdf s with high precision usingGauss-Legendre numerical integration. The calculations are fast and precise but do notapply in higher dimensions. Miwa et al. [15] proposed a two-stage recursive approach toestimate the Gaussian cdf . Their approach does not scale to high dimensions becauseit requires a sum over a combinatorially exploding (in D ) number of terms.The most popular approach for approximating Gaussian cdf s in moderate dimen-sions was proposed by Genz and Bretz [9]. They describe the use of Monte Carlo (MC)and quasi-Monte Carlo (QM) methods to estimate the joint cdf . Their QM methodshave smaller asymptotic errors than the MC versions, and hence are the more widelyused.More recently, Genton et al. [6] sped up the Genz and Bretz [9] QM algorithmby performing matrix computations with fast hierarchical matrix libraries [11]. As afollow-up, Cao et al. [2] combined hierarchical matrix computations with a blockingtechnique to further speed up computations. These approaches are much faster thantheir predecessors and work for Gaussian random variables with arbitrary covariancestructures. They lean heavily on linking to specialized libraries for matrix operations.Our approach achieves speedups using a fundamentally different strategy, by specifi-2ally leveraging the properties of highly-structured covariance forms arising from, forexample, time series or spatial data. It requires no exotic software, and is triviallyparallelizable using simple tools in R .
2. A Vecchia Approximation for the Multivariate Gaussian DistributionFunction
The multivariate Gaussian cdf that we wish to calculate is simply the integral of the pdf (1), P ( X < x ; µ , Σ ) ≡ Φ µ , Σ ( x ) = (cid:90) x −∞ · · · (cid:90) x k −∞ φ µ , Σ ( y ) dy . . . dy D . (2)To calculate the integral (2), one must resort to numerical techniques, as it is well-known that no closed form exists, even in a single dimension. In high dimensions,numerical integration is very difficult simply due to geometry and the curse of dimen-sionality. The difficulty is compounded in the case of the Gaussian cdf because whilethe curse of dimensionality requires an exponentially (in D ) increasing number of eval-uations of the integrand, the cost of each evaluation of the integration itself grows as D . We seek a technique that simultaneously 1) reduces the effective dimension of theintegral and 2) reduces the dimension of the pdf in the integrand. pdf Vecchia [20] introduced a way to approximate high-dimensional Gaussian pdf s arisingfrom spatial data, which is particularly amenable to modification for our purposes.The starting point of the Vecchia [20] approximation is to write the joint density as aproduct of cascading conditional densities, f ( x ) = f ( x ) D (cid:89) i =2 f ( x i | x i − ) . (3)Here, f ( x ) is the univariate Gaussian density with mean µ and variance Σ ,and, for i = 2 , . . . , k , the conditional density f ( x i | x i − ) is the univariate Gaus-sian density with mean µ i + Σ [ i, i − Σ − i − , i − ( x i − − µ i − ) and variance Σ i,i − Σ [ i, i − Σ − i − , i − Σ [1: i − ,i ] . The leading terms in this product are fast to calculate,but for terms corresponding to large i , the computations are nearly as burdensome asthose of the original representation (1).To help solve this problem, Vecchia [20] proposed an approximation to the full jointdistribution, in the setting where the random vector X is observed from a spatial Gaus-sian process. He modified the cascading conditional representation (3) by replacing theconditioning on high-dimensional vectors x i − with conditioning on well-chosen vec-tors that have much smaller dimension. By limiting the conditioning sets to vectors oflength m << D , this strategy replaces expensive O ( D ) matrix operations with much3aster O ( m ) matrix operations. The approximation of the joint density is then f ( x ) ≈ f ( x ) D (cid:89) i =2 f ( x i | x N i ) , (4)where N i is the conditioning set of size m (more precisely, min( m, i − x i , for i = 2 , . . . , D .A good choice for a conditioning set to approximate the complete conditional densityof each x i might be the m components that are most correlated with x i . In the contextwhere the random vector X = ( X ( s ) , . . . , X ( s k )) T arises from a stationary spatialGaussian process observed at locations s , . . . , s k , the components most correlatedwith X i ≡ X ( s i ) will be those observed at locations that are the m nearest neighborsto s i (under covariance models in common use). Other strategies for constructingconditioning sets have also been explored [10,18].Vecchia’s approximation has been found to be quite accurate under many covariancemodels and sampling scenarios relevant to analysis of spatial Gaussian processes [10].Moreover, it is very fast to compute, even using the most naive implementation. How-ever, its power is fully realized when the D components of the product are computedin parallel, which is trivially easy to implement using standard tools in R . cdf Our approach to approximating the high-dimensional Gaussian cdf is to re-write thejoint cdf as a telescoping product of conditional cdf s, analogously to (3), and thento approximate each complete conditional cdf with cdf that conditions on a smallercollection of components, analogously to (4). In the case of the pdf , this strategy ofchoosing smaller conditioning sets eliminates the need to compute high-dimensionalmatrix computations required by (1), whereas in the case of the cdf , this strategyeliminates the need to compute the high-dimensional integral required by (2).Specifically, we can re-write any joint cdf as F ( x ) = P ( X < x ) = P ( X < x ) D (cid:89) i =2 P ( X i < x i | X < x , . . . , X i − < x i − )= P ( X < x ) D (cid:89) i =2 P ( X i < x i | X i − < x i − ) (5)Then, just as in the approximation to the pdf (4), in the cdf (5) each conditionalprobability in the product can be approximated by reducing the size of the conditioningset to at most m components. Thus, our Vecchia approximation for the Gaussian cdf F ( x ) ≈ P ( X < x ) D (cid:89) i =2 P ( X i < x i | X N i < x N i )= P ( X < x ) D (cid:89) i =2 P ( X i < x i , X N i < x N i ) P ( X N i < x N i )= Φ( x ) D (cid:89) i =2 Φ( x { i, N i } )Φ( x N i ) , (6)where again N i is the conditioning set of size min( m, i −
1) chosen for the component x i , for i = 2 , . . . , D .The approximation given by (6) reduces computational costs by replacing the D -dimensional integral in (2) with a series of much simpler integrals of dimension m + 1and m , for m << D . Furthermore, all of the elements in the product can be computedin parallel.The multivariate cdf s in (6) still have to be evaluated numerically. For all but thesmallest possible choices of m , best practices suggest using a QM method like that ofGenz and Bretz [9] to approximate the numerator and denominator.Similarly to the original Vecchia approximation to the Gaussian pdf , choosing theconditioning sets involves a trade-off; choose m too small and the accuracy of theapproximation will suffer, but choose m too large and the computational benefits willdiminish.
3. Simulation Study
To assess the accuracy and speed of this approximation, and to explore the trade-offinherent in the choice of m , we conduct a simulation study. Since the true value ofthe cdf is not available, the best we can do to check for accuracy is to see whetherit is consistent with results obtained from direct use of the Genz and Bretz [9] QMapproach. We simulate a Gaussian process observed on equally spaced grids of fivedifferent sizes, 15 ×
15, 30 ×
30, 50 ×
50, 75 ×
75 and 100 × cdf estimation: an exponential model with range parameter 1 and and exponen-tial model with range parameter 5, each with unit variance. This makes a total of10 different scenarios. For each scenario, we used four different sizes of conditioningsets, choosing m = 5, 10, 30 and 50 closest neighbors. For comparison, we computedthe Genz and Bretz [9] QM method using 499 and 3,607 sample points. We use theimplementation of the Genz and Bretz [9] algorithm in the mvPot package [4] for R .In principle, the accuracy and computational requirements of the QM grows with thenumber of sample points (which, here, must be a prime number). Since the algorithmsare stochastic, we repeated each calculation five times and plotted each replication asa dot in Figures 1, 2, 3, and 4.Figure 1 shows the value of the estimated log cdf for all grid sizes and all esti-mation methods for the simulated Gaussian process with range parameter 1. The log cdf estimated with the Vecchia approximation increases with the number of neighborsuntil it stabilizes for 30 neighbors, after which it is consistent with the two QM ap-proximations. This suggests that, under this scenario, it is advisable to use at least 305eighbors in order to estimate the log cdf . For the two smaller grids, it appears thatthe Vecchia approximation has a similar variance to the QM approximation using 499sample points, but a higher variance that the QM approximation using 3,607 samplepoints. For the larger grids, the Vecchia approximation appears to have a lower vari-ance than both QM approximations. Figure 2 shows the same as Figure 1, but forexponential Gaussian processes with range parameter 5. The story is similar to thecase with the shorter range process, except it appears that 50 neighbors may be neces-sary in order to stabilize the estimated log cdf . It may be the case that the number ofneighbors necessary to accurately approximate the log cdf increases with length of thedependence of the Gaussian process. Intuitively, this may occur because for processeswith longer-range dependence, a smaller proportion of the information in data may becaptured by local approximations.Figures 3 and 4 show the time required to approximate the log cdf , on a single core,for Gaussian processes with range parameters of 1 and 5, respectively. The computa-tion time is influenced by both the number of observations and number of neighborsused in the Vecchia approximation. Computational costs increase with the number ofobservations, for both the Vecchia and QM approximation methods, and also increasewith the number of neighbors in the conditioning set. Oddly, the empirical computa-tion time did not increase for the QM approximation with the larger set of samplepoints. For smaller grid sizes, the QM methods are faster than the Vecchia approx-imations, except when the size of the conditioning set very small. For grids of size50 ×
50 and larger, computation time of the approximation using 30 neighbors was asfast as or faster than the QM method. When the number of observations is extremelylarge, in the case of the 100 ×
100 grid, the computation time was much lower forthe Vecchia approximation compared to the QM approximation. This suggests thatfor high-dimensional datasets the use of the Vecchia approximation is preferable QMmethod, even if computations are done sequentially.
Since each term of the Vecchia cdf approximation (6) is independent of every term, it istrivial to parallelize the computations. In practice, we compute all of the required low-dimensional Gaussian cdf s on the log scale, and then sum them at the end. In principal,the speedup should be linear in the number of cores used for the calculation. To explorethis relationship, we compute the cdf approximation based on a Gaussian processobserved at 10,000 locations, varying the number of compute cores used between 5and 40. For each setup, we repeat the computation 15 times. Figure 5 shows timerequired to compute the log cdf approximation. The computing time decreases withthe number of cores. We observe roughly the expected linear relationship up to 20 cores,when a jump occurs before again decreasing. We suspect that this is behavior a resultof the particular hardware configuration we used, which consists of networked 20-coreprocessors. That is, we guess that once an additional physical processor is engaged,which occurs beyond 20 cores, overhead costs increase and attenuate the expectedcomputational gains. When 40 cores were used, it took less than 1 minute to computethe log cdf approximation for 10,000 observations. There are clearly some diminishingreturns due to communication overhead, but in principle, this approximation could bemade arbitrarily fast with a big enough computing system.6 l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l l l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M −3.9−3.8−3.7−16.0−15.5−15.0−14.5−44−43−42−41−40−100−96−92−175−170−165−160 Method
Log CD F Figure 1.
Estimated log cdf for exponential Gaussian processes with range parameter ρ = 1. The x -axisrepresents the different methods used for the cdf computation and the y -axis is the log cdf . Each point is anindependent estimate of the log cdf , and each black point is the average over the replications. The Vecchiaapproximation seems to stabilize when at least 30 neighbors are used, and results in values that are consistentwith the QM approximations. l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l l l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M −1.10−1.05−1.00−0.95−0.90−3.8−3.6−3.4−3.2−3.0−14−12−10−8−30−25−20−15−60−50−40−30 Method
Log CD F Figure 2.
Estimated log cdf for exponential Gaussian processes with range parameter ρ = 5. The x -axisrepresents the different methods used for the cdf computation and the y -axis is the log cdf . Each point is anindependent estimate of the log cdf , and each black point is the average over the replications. For this processwith longer-range dependence, the Vecchia approximation may not stabilize until at least 50 neighbors areused, when results become consistent with the QM approximations. l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M Method T i m e i n S e c ond s Figure 3.
Time to estimate the cdf approximation for an exponential Gaussian process with range parameter ρ = 1. The x -axis represents the different approximation methods, and the y -axis is the computation time. Eachpoint is an independent replication of the procedure, and the black point is the average over the replications. l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M N e i g h b o r s10 N e i g h b o r s30 N e i g h b o r s50 N e i g h b o r s Q M Q M Method T i m e i n S e c ond s Figure 4.
Time to estimate the cdf approximation for an exponential Gaussian process with range parameter ρ = 5. The x -axis represents the different approximation methods, and the y -axis is the computation time. Eachpoint is an independent replication of the procedure, and the black point is the average over the replications. llllllllllllll lllllllllllllll lllllllllllllll lllllllllllllll lllllllllllllll lllllllllllllll lllllllllllllll lllllllllllllll Number of Cores Used T i m e i n S e c ond s Time to Estimate the Log CDF of 10000 locations for different number of cores
Figure 5.
Time to compute log cdf approximation parallelized across different numbers of computing cores.
The representation defined by equation (5) and its approximation (6) calculates thejoint probability as the product of univariate conditional distributions. However it isalso possible to write the full joint cdf as a cascading product of multivariate, ratherthan univariate, conditional cdf s. Under equation 6, it is necessary to calculate the n univariate conditional probabilities, each of which requires a m + 1-dimensional cdf calculation. If instead we divide the components into q groups of p joint observations,such that q × p = n , we would only need to calculate the product of q conditionalprobabilities. However, doing so would make the dimensionality of each individualGaussian cdf calculation in (6) between m + p and pm + p . So it would trade thecost of computing higher-dimensional cdf terms for the benefit of computing fewerterms. Such a trade-off could affect both the accuracy and computational efficiency ofthe approximation. Guinness [10] explored this possibility in the context of pdf s andfound that it can be advantageous to consider multivariate conditional densities in theVecchia density approximation. To explore the effect of calculating higher dimensionalconditional probabilities, we calculate the log cdf approximation based on groupingsof observations of different sizes.An additional consideration that could effect the accuracy and speed of the approx-imation is the construction of the conditioning sets. Using the nearest neighbors, as wehave done above, requires the additional step of ordering the components by distance,which could be slow. Choosing randomly-selected conditioning sets could potentiallyspeed up the computation by avoiding this sorting step.Figures 6 and 7 show the estimated log cdf and time (in seconds) to compute theapproximated log cdf , using the 100 ×
100 grid. We used approximations based onjoint conditional cdf s of dimension 2, 5, 10, 20, 30, and 50. For each grouping size, weconstructed conditioning sets using 3 different methods. The first method conditions onthe m most correlated observations (in this case simply the nearest neighbors) for eachobservation in the joint grouping, resulting in a conditioning set of size pm . The secondmethod simply conditions on m random observations. The third method conditions on11 l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l l l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l l l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l Correlation Random Random Each2 5 10 20 30 50 2 5 10 20 30 50 2 5 10 20 30 50−1500−1000−5000
Number of Neighbors
Log CD F Joint Estimation llllll
Estimated Log CDF using limited number of neighbors
Figure 6.
Estimated log cdf based on observations from an exponential Gaussian process with range pa-rameter ρ = 1, using 3 different methods to select conditioning sets, and different dimensionalities of jointconditional observations. m random observations per element of the multivariate conditional calculation, againresulting of a conditioning set of size pm .From Figure 6 it is clear that simply conditioning on m random observations failsto yield an acceptable approximation. Performance can be improved by conditioningon more random observations, which is what the third method does. Method 3 showssomewhat improved behavior, however it was only able to perform acceptably whenboth the dimensionality p of the joint conditional probability and the size of the condi-tioning set pm were both large. It is clear from Figure 6 that conditioning on randomneighbors is much less accurate than conditioning on the most highly correlated neigh-bors. For conditioning sets consisting of small numbers m of neighbors per element inthe joint conditional probability, the use of a large group p of joint observations hada better result, probably simply due to fact that the total number pm of neighbors inthe conditioning set was larger. However, when the number m of correlated neighborsgets large enough the number p of joint observations does not seem to affect result ofthe approximation.Figure 7 shows the computation time required for all of the approximation schemesdepicted in Figure 6. The clear trend is that choosing a small conditioning set ofrandom observations is very fast (middle panel), using higher-dimensional joint condi-tional cdf s is slower than using lower-dimensional joint conditional cdf s (all panels),and for the same size conditioning set, the time required to find the nearest neighborsis not a major bottleneck (right and left panels). This conclusion is different from ex-ploration of the same issues, in the context of the pdf , found in Guinness [10]. There,using higher-dimensional joint conditional calculations was found to be beneficial, andthe time required to find nearest neighbors was substantial enough to warrant the useof a fast approximate ordering algorithm. In the case of the cdf approximation, codeprofiling confirmed that the time required to order the observations was insignificant,with the overwhelming majority of the computation time being used in calculating thelower-dimensional joint cdf s using the QM technique.12 l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l l l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l l l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l Correlation Random Random Each2 5 10 20 30 50 2 5 10 20 30 50 2 5 10 20 30 500100200300400500
Number of Neighbors T i m e i n S e c ond s Joint Estimation llllll
Time taken to estimate 10,000 dimensions using limited number of neighbors
Figure 7.
Time to estimate the log CDF with dependence parameter ρ = 1 using 3 different methods toselect neighbors, multiple number of neighbors and multiple number of joint observations.
4. Example: A Gaussian Scale Mixture for Spatial Extremes
Recent advances in the statistics of extremal spatial phenomena have produced modelsthat are flexible enough to accommodate both strong and weak spatial dependence inthe far joint tails. One prominent strategy for achieving this is to construct scale mix-tures of Gaussian processes, where the mixing distribution is chosen carefully so as toproduce the desired tail dependence characteristics [12,14,16,17]. The preferred flavorof maximum likelihood inference for these models requires computing a Gaussian cdf whose dimension is roughly equal to the number of spatial locations in the dataset.Other state-of-the-art models for spatial extremes also rely on high-dimensional Gaus-sian cdf s [3,19,21]. To show the usefulness of our cdf approximation, we analyze datafrom precipitation gauges in Europe using the Gaussian scale mixture model fromHuser et al. [14], which we describe below.The class of scale mixtures of Gaussian processes is defined generically by X ( s ) = R × W ( s ) R ∼ F R ⊥⊥ W ( s ) . (7)Here, W ( s ) is a standard Gaussian process (i.e. with unit variance) on some domain D indexed by s ∈ D . For a collection of k observations, the finite dimensional distributionof the Gaussian component is W ∼ N k (0 , Σ ( θ )), where Σ ( θ ) is a D × D covariancematrix constructed using a chosen covariance model that is indexed by parameter θ .The random scaling R comes from distribution F R . The choice of F R is criticaland determines the strength of the tail dependence in the resulting model [5]. A keyquantity for summarizing the strength of tail dependence is the conditional prob-ability χ u ( s i , s j ) = P { X ( s i ) > u | X ( s j > u } , for spatial locations s i and s j . Iflim u →∞ χ u ( s i , s j ) = 0 for all s i , s j ∈ D , we say that X ( s ) is asymptotically independent ,while if lim u →∞ χ u ( s i , s j ) > s i , s j ∈ D , we say that X ( s ) is asymptoticallydependent . 13hile many choices are available for the mixing distribution F R , Huser et al. [14]suggest the parametric model defined by equation (8). When β >
0, the mixtureprocess X ( s ) is asymptotically independent, and when β = 0, X ( s ) is asymptoti-cally dependent. Therefore, this class of scale mixtures is rich enough to include bothasymptotic independence and asymptotic dependence as nontrivial sub-models. F R ( r ) = (cid:40) − exp {− γ ( r β − /β } , for β > − r γ , for β = 0 . (8)To construct the likelihood for maximum likelihood estimation, we must integrateout R from the model (7). Equations (9) and (10) show the marginal multivariate cdf and pdf , respectively, for a finite collection of observations X from X ( s ) defined in (7).Here Φ D represents the D -dimensional multivariate cdf from a Gaussian distributionwith mean vector 0 and covariance matrix Σ ( θ ), and φ D represents the D -dimensionalmultivariate pdf from a Gaussian distribution with mean 0 and covariance matrix Σ ( θ ).There are no closed forms for these expressions, so it is necessary to use numericalmethods to evaluate the (one-dimensional) integrals. G ( x ) = (cid:90) ∞ Φ D ( x /r ; Σ ) f R ( r ) dr (9) g ( x ) = (cid:90) ∞ φ D ( x /r ; Σ ) r − D f R ( r ) dr. (10)The preferred strategy for maximum likelihood estimation of extremal dependencemodels is to treat all observations falling below a high threshold as left censored [13].This leads to a favorable balance between using the data as efficiently as possible,while not allowing data in the bulk of the distribution to have a large effect on depen-dence estimation. The censored likelihood for each temporal replicate is obtained bytaking one partial derivative of (9) for every observation that falls above the threshold.Thus, (10) is the relevant likelihood when all observations, at one particular temporalreplicate, are above the threshold, so nothing is censored. However, since the thresholdis chosen to be a high quantile to prioritize inference on the tail, most observationsare usually censored for any temporal replicate. When all observations fall below thethreshold, the relevant likelihood is (9).Most often, in any temporal replicate, there will be a mixture of observations aboveand below the threshold. In this case, the relevant joint likelihood of x is defined byequation (11), which results from taking partial derivatives of (9) with respect to onlythe un-censored observations. If we let I be the set of points above the threshold and I c be the points below, then G I ( x ) := ∂ | I | ∂ x I G ( x ) = (cid:90) ∞ ∂ | I | ∂ x I Φ k ( x /r ; Σ ) f R ( r ) dr = (cid:90) ∞ Φ | I c | { ( x I c − Σ I c ; I Σ − I ; I x I ) /r ; Σ I c | I } φ | I | ( x I /r ; Σ I ; I ) r −| I | f R ( r ) dr, (11)where dependence of the covariance matrices on θ is suppressed for brevity, and thenotation Σ A ; A refers to rows and columns of Σ pertinent to the points in A . The14atrix Σ I c | I = Σ I c ; I c − Σ I c ; I Σ − I ; I Σ I ; I c is the covariance matrix of the conditionalnormal distribution of the censored observations given the un-censored observations.The computational issue arises because the integrand (11) contains a Gaussian cdf ofdimension | I c | , the number of censored observations in a temporal replicate. Again, formost replicates, this number | I c | is close to the total number of observation locations D because the censoring threshold is chosen to be high, such that most observationsfall below the threshold and are therefore censored. Our dataset consists of weekly maximum precipitation observations between January,2000 and April, 2019, in the western and central region of continental Europe, northof the mountain ranges the Pyrenees, Alps, and Carpathians. The 6 countries weconsider are Germany, Poland, Netherlands, Belgium, Czech Republic, and France.Figure 8 shows the locations of the observation stations distributed over Europe. Thisdataset consists of 1,006 weekly maxima from D = 528 weather stations. For context,the computational bottleneck from the Gaussian cdf limited the analysis in Huseret al. [14] to a dataset of D = 12 locations, even though analysis was performedon a large high-performance computing cluster. We use the weekly maximum dailyaccumulations at each location to break temporal dependence that might arise fromstorms that persist for more than one day. Out of the 531,168 total observations, 32.6%were missing values. For each weekly maximum, only the available data was used forestimation, and all missing observations were disregarded.15 lll lllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll level Figure 8. D = 528 weather stations located over 6 European countries The covariance model we use for the underlying Gaussian processes is an anisotropicexponential, Σ ij ( θ ) = exp {− h ij /ρ } , where ρ is the range parameter and h ij isthe Mahalanobis distance between locations s i and s j . The Mahalanois distance isparametrized as h ij = Ω T Ω , where Ω = ( s i − s j ) T (cid:18) cos( φ ) − sin( φ )sin( φ ) cos( φ ) (cid:19) − (cid:18) A (cid:19) , for rotation angle φ ∈ [0 , π ) and and aspect ratio A >
1. Thus, after fixing the mixingparameter γ at 1, as it plays a much less significant role than the parameter β indetermining tail dependence characteristics, we arrive at a total of 4 parameters toestimate, ψ = ( β, ρ, φ, A ) T .The first step in estimating the dependence is to transform the observations to beon the same marginal scale. To do this, we start by applying a rank transformation tostandard uniform, independently for each station. That is, for each station k = 1 , . . . , D and each time point t = 1 , . . . T , the observation X kt on the uniform scale is U kt = rank( X kt ) T + 1 . We next choose a high threshold to be the 0.95 marginal empirical quantile at eachlocation. Then, denoting the marginal cdf and pdf of each X kt , respectively, as16 M ( x ) = (cid:82) ∞ Φ( x/r ) f ( r ) dr and g M ( x ) = (cid:82) ∞ φ ( x/r ) r − f ( r ) dr (we assume station-arity, so the marginal distribution is assumed to be the same at each location), andletting the vector v t = (max { u t , . } , . . . , max { u Dt , . } ) T , the copula censoredlikelihood for each time replicate k is L ( ψ ; v t ) = G { G − M ( v t ) , . . . , G − M ( v Dt ) } if all obs. are below the threshold g { G − M ( v t ) ,...,G − M ( v Dt ) } (cid:81) Dk =1 g M { G − M ( v kt ) } if all obs. are above the threshold G It { G − M ( v t ) ,...,G − M ( v Dt ) } (cid:81) Dk ∈ Ii g m { G − M ( v kt ) } if some obs. are above and some below the thresholdFinally, the log likelihood across all time points t for the parameter vector ψ is l ( ψ ; v ) = T (cid:88) t =1 log( L ( ψ ; v t )) . We found the maximum likelihood estimator (MLE) by applying the Nelder-Meadnumerical optimizer in the R function optim . MLEs are shown in Table 1. The MLEfor the mixing parameter β is 0.82, which in this context is fairly far away from zero—far enough to strongly suggest that the process is asymptotically independent. TheMLEs for the anisotropy parameters suggest pronounced eccentricity. To interpret andvisualize the estimated dependence model implied by the MLEs shown in Table 1, weplot level curves in the resulting χ u function for u = 0 .
95 on the quantile scale, shownin Figure 8. Each ellipse represents a constant value of χ u =0 . ( s ) = P { F M [ X ( s )] > . | F M [ X ( s )] > . } , for an arbitrarily-chosen reference point s near the centerof the map. The level curves are ellipses due to the anisotropic construction, withthe major axis roughly along a northeast-southwest orientation, and joint exceedancesmore likely with decreasing distance from s .Parameter MLE ρ β φ A Table 1.
Maximum likelihood estimates of dependence parameters
5. Discussion
The main objective of this paper was to propose fast approximation to high-dimensional Gaussian cdf s that arise from spatial Gaussian processes. We modifiedVecchia’s approximation for Gaussian pdf s to the context of Gaussian cdf s. Simu-lations showed that for large numbers of locations and relatively small conditioningsets, this approximation gives results consistent with state-of-the-art QM methods,and reduces computational time considerably, even when computations are performedsequentially. Furthermore, the approximation is trivially easy to code in parallel usingstandard R packages, and requiring no linking to specialized software libraries.We demonstrated the utility of our fast cdf approximation by using it to find maxi-mum censored likelihood estimates for the scale mixture model of Huser et al. [14]. Thismodel is attractive because of its flexible tail dependence characteristics, but is ham-17ered by computational difficulties arising from the need to compute high-dimensionalGaussian cdf s during inference. We fit this model to a precipitation dataset consistingover 500 spatial locations, whereas previous efforts using conventional QM techniqueswere limited to just 12 locations.One drawback that we noticed during the data analysis is that conventional opti-mization routines had trouble converging, due to the stochastic nature of the likeli-hood objective function. For future studies, one possible approach to circumventingthis problem is to use stochastic optimization algorithms, which may be better suitedto optimizing random objective functions. Acknowledgements
This research was supported in part by NSF grant DMS-1752280. Computations forthis research were performed on the Institute for Computational and Data SciencesAdvanced CyberInfrastructure (ICDS-ACI).
References [1] Reinaldo B. Arellano-Valle and Adelchi Azzalini. On the unification of families of skew-normal distributions.
Scand. J. Statist. , 33(3):561–574, 2006. ISSN 0303-6898. . URL https://doi.org/10.1111/j.1467-9469.2006.00503.x .[2] Jian Cao, Marc G. Genton, David E. Keyes, and George M. Turkiyyah. Hierarchical-blockconditioning approximations for high-dimensional multivariate normal probabilities.
Stat.Comput. , 29(3):585–598, 2019. ISSN 0960-3174. . URL https://doi.org/10.1007/s11222-018-9825-3 .[3] R. de Fondeville and A. C. Davison. High-dimensional peaks-over-threshold inference.
Biometrika , 105(3):575–592, 2018. ISSN 0006-3444. . URL https://doi.org/10.1093/biomet/asy026 .[4] Raphael de Fondeville and Leo Belzile. mvPot: Multivariate Peaks-over-Threshold Mod-elling for Spatial Extreme Events , 2018. URL https://CRAN.R-project.org/package=mvPot . R package version 0.1.4.[5] Sebastian Engelke, Thomas Opitz, and Jennifer Wadsworth. Extremal dependence ofrandom scale constructions.
Extremes , 22(4):623–666, 2019. ISSN 1386-1999. . URL https://doi.org/10.1007/s10687-019-00353-3 .[6] Marc G. Genton, David E. Keyes, and George Turkiyyah. Hierarchical decompositionsfor the computation of high-dimensional multivariate normal probabilities.
J. Comput.Graph. Statist. , 27(2):268–277, 2018. ISSN 1061-8600. . URL https://doi.org/10.1080/10618600.2017.1375936 .[7] Alan Genz. Numerical computation of multivariate normal probabilities.
Journal ofComputational and Graphical Statistics , 1(2):141–149, 1992.[8] Alan Genz. Numerical computation of rectangular bivariate and trivariate normal and t probabilities. Stat. Comput. , 14(3):251–260, 2004. ISSN 0960-3174. . URL https://doi.org/10.1023/B:STCO.0000035304.20635.31 .[9] Alan Genz and Frank Bretz.
Computation of multivariate normal and t probabilities ,volume 195 of Lecture Notes in Statistics . Springer, Dordrecht, 2009. ISBN 978-3-642-01688-2. . URL https://doi.org/10.1007/978-3-642-01689-9 .[10] Joseph Guinness. Permutation and grouping methods for sharpening Gaussian processapproximations.
Technometrics , 60(4):415–429, 2018. ISSN 0040-1706. . URL https://doi.org/10.1080/00401706.2018.1437476 .[11] Wolfgang Hackbusch.
Hierarchical matrices: algorithms and analysis , volume 49of
Springer Series in Computational Mathematics . Springer, Heidelberg, 2015. SBN 978-3-662-47323-8; 978-3-662-47324-5. . URL https://doi.org/10.1007/978-3-662-47324-5 .[12] Rapha¨el Huser and Jennifer L. Wadsworth. Modeling spatial processes with unknownextremal dependence class.
J. Amer. Statist. Assoc. , 114(525):434–444, 2019. ISSN 0162-1459. . URL https://doi.org/10.1080/01621459.2017.1411813 .[13] Rapha¨el Huser, Anthony C. Davison, and Marc G. Genton. Likelihood estimators formultivariate extremes.
Extremes , 19(1):79–103, 2016. ISSN 1386-1999. . URL https://doi.org/10.1007/s10687-015-0230-4 .[14] Rapha¨el Huser, Thomas Opitz, and Emeric Thibaud. Bridging asymptotic independenceand dependence in spatial extremes using Gaussian scale mixtures.
Spat. Stat. , 21(partA):166–186, 2017. . URL https://doi.org/10.1016/j.spasta.2017.06.004 .[15] Tetsuhisa Miwa, A. J. Hayter, and Satoshi Kuriki. The evaluation of general non-centredorthant probabilities.
J. R. Stat. Soc. Ser. B Stat. Methodol. , 65(1):223–234, 2003. ISSN1369-7412. . URL https://doi.org/10.1111/1467-9868.00382 .[16] Samuel A. Morris, Brian J. Riech, Emeric Thibaud, and Daniel Cooley. A space-timeskew- t model for threshold exceedances. Biometrics , 73(3):749–758, 2017. ISSN 0006-341X. . URL https://doi.org/10.1111/biom.12644 .[17] Thomas Opitz. Modeling asymptotically independent spatial extremes based on Laplacerandom fields.
Spat. Stat. , 16:1–18, 2016. . URL https://doi.org/10.1016/j.spasta.2016.01.001 .[18] Michael L. Stein, Zhiyi Chi, and Leah J. Welty. Approximating likelihoods for largespatial data sets.
J. R. Stat. Soc. Ser. B Stat. Methodol. , 66(2):275–296, 2004. ISSN1369-7412.[19] Emeric Thibaud, Juha Aalto, Daniel S. Cooley, Anthony C. Davison, and Juha Heikkinen.Bayesian inference for the Brown-Resnick process, with an application to extreme lowtemperatures.
Ann. Appl. Stat. , 10(4):2303–2324, 2016. ISSN 1932-6157. . URL https://doi.org/10.1214/16-AOAS980 .[20] A. V. Vecchia. Estimation and model identification for continuous spatial processes.
Journal of the Royal Statistical Society. Series B (Methodological) , 50(2):297–312, 1988.[21] Jennifer L. Wadsworth and Jonathan A. Tawn. Efficient inference for spatial extremevalue processes associated to log-Gaussian random functions.
Biometrika , 101(1):1–15,2014. ISSN 0006-3444. . URL https://doi.org/10.1093/biomet/ast042 ..