Gaussian Process Learning via Fisher Scoring of Vecchia's Approximation
GGaussian Process Learning via Fisher Scoringof Vecchia’s Approximation
Joseph Guinness
Cornell University, Department of Statistics and Data Science
Abstract
We derive a single pass algorithm for computing the gradient and Fisher information of Vecchia’sGaussian process loglikelihood approximation, which provides a computationally efficient meansfor applying the Fisher scoring algorithm for maximizing the loglikelihood. The advantages ofthe optimization techniques are demonstrated in numerical examples and in an application toArgo ocean temperature data. The new methods are more accurate and much faster than anoptimization method that uses only function evaluations, especially when the covariance functionhas many parameters. This allows practitioners to fit nonstationary models to large spatial andspatial-temporal datasets.
The Gaussian process model is an indispensible tool for the analysis of spatial and spatial-temporaldatasets and has become increasingly popular as a general-purpose model for functions. Becauseof its high computational burden, researchers have devoted substantial effort to developing nu-merical approximations for Gaussian process computations. Much of the work focuses on efficientapproximation of the likelihood function. Fast likelihood evaluations are crucial for optimizationprocedures that require many evaluations of the likelihood, such as the default Nelder-Mead al-gorithm (Nelder and Mead, 1965) in the R optim function. The likelihood must be repeatedlyevaluated in MCMC algorithms as well.Compared to the amount of literature on efficient likelihood approximations, there has beenconsiderably less development of techniques for numerically maximizing the likelihood (see Geogaet al. (2018) for one recent example). This article aims to address the disparity by providing:1. Formulas for evaluating the gradient and Fisher information for Vecchia’s likelihood approxi-mation in a single pass through the data, so that the Fisher scoring algorithm can be applied.Fisher scoring is a modification of the Newton-Raphson optimization method, replacing theHessian matrix with the Fisher information matrix.2. Numerical examples with simulated and real data demonstrating the practical advantagesthat the new techniques provide over an optimizer that uses function evaluations alone.Among the sea of Gaussian process approximations proposed over the past several decades,Vecchia’s approximation (Vecchia, 1988) has emerged as a leader. It can be computed in lineartime and with linear memory burden, and it can be parallelized. Maximizing the approximationcorresponds to solving a set of unbiased estimating equations, leading to desirable statistical prop-erties (Stein et al., 2004). It is general in that it does not require gridded data nor a stationarymodel assumption. The approximation forms a valid multivariate normal model, and so it canbe used for simulation and conditional simulation. As an approximation to the target model, itis highly accurate relative to competitors (Guinness, 2018). Vecchia’s approximation also forms a1 a r X i v : . [ s t a t . C O ] M a y onceptual hub in the space of Gaussian process approximations, since a generalization includesmany well-known approximations as special cases (Katzfuss and Guinness, 2017). Lastly, there arepublicly available R packages implementing it (Finley et al., 2017; Guinness and Katzfuss, 2018).The numerical examples in this paper show that, in realistic data and model scenarios, thenew techniques offer significant computational advantages over default optimization techniques.Although it is more expensive to evaluate the gradient and Fisher information in addition to thelikelihood, the Fisher scoring algorithm converges in a small number of iterations, leading to a largeadvantage in total computing time over an optimization method that uses only the likelihood. Forisotropic Mat´ern models, the speedup is roughly 2 to 4 times, and on more complicated modelswith more parameters, the new techniques can be more than 40 times faster. This is a significantpractical improvement that will be attractive to practitioners choosing among various methods. Let s , . . . , s n be locations in a domain D . At each s i , we observe a scalar response y i , collected intocolumn vector y = ( y , . . . , y n ) T . Along with the response, we observe covariates x i = ( x i , . . . , x ip )collected into an n × p design matrix X . In the Gaussian process, we model y as a multivariatenormal vector Y with expected value E ( Y ) = Xβ ( β ∈ R p ), and covariance matrix E (( Y − Xβ )( Y − Xβ ) T ) = Σ θ , where the ( i, j ) entry of Σ θ is K θ ( s i , s j ). The function K θ is positive definite on D × D and depends on covariance parameters θ . The loglikelihood for β and θ islog f β,θ ( y ) = − n π ) −
12 log det Σ θ −
12 ( y − Xβ ) T Σ − θ ( y − Xβ ) . (1)Unless Σ θ has some exploitable structure, evaluation of the loglikelihood involves storing the n entries of Σ θ and performing O ( n ) floating point operations to obtain the Cholesky factor of Σ θ ,both of which are computationally prohibitive when n is large.Vecchia’s loglikelihood approximation is a modification of the conditional representation of ajoint density function. Let g (1) = ∅ , g ( i ) ⊂ (1 , . . . , i −
1) and y g ( i ) be the corresponding subvectorof y . Vecchia’s loglikelihood approximation is (cid:96) ( β, θ ) = n (cid:88) i =1 log f β,θ ( y i | y g ( i ) ) , (2)leading to computational savings when | g ( i ) | is small. As mentioned in the introduction, Vecchia’slikelihood approximation corresponds to a valid multivariate normal distribution with mean Xβ and a covariance matrix (cid:101) Σ θ . To motivate why obtaining the gradient and Fisher information posesan analytical challenge, consider the partial derivative of Vecchia’s loglikelihood with respect to θ j : ∂(cid:96) ( β, θ ) ∂θ j = 12 ( y − Xβ ) T (cid:101) Σ − θ ∂ (cid:101) Σ θ ∂θ j (cid:101) Σ − θ ( y − Xβ ) −
12 Tr (cid:32)(cid:101) Σ − θ ∂ (cid:101) Σ θ ∂θ j (cid:33) , (3)where ( ∂ (cid:101) Σ θ /∂θ j ) is an n × n matrix of partial derivatives of (cid:101) Σ θ with respect to θ j . Not only is ∂ (cid:101) Σ θ /∂θ j too large to store in memory, the covariances (cid:101) Σ θ are not easily computable, nor are theirpartial derivatives. In the next section, we outline a simple reframing of Vecchia’s likelihood thatleads to a computationally tractable method of evaluating the gradient and Fisher information.2 Derivations for Single Pass Algorithm
To derive formulas for the gradient and Fisher information, it is helpful to rewrite the conditionallikelihoods in terms of marginals. To this end, define u i = y g ( i ) and v i = ( y g ( i ) , y i ). Define thedesign matrices for u i and v i , respectively, as Q i and R i , and define the covariance matrices for u i and v i , respectively as A i and B i (suppressing dependence on θ ). The notation is chosen to followthe mnemonic device that the first of the two letters alphabetically is a subvector or submatrix ofthe second letter. Vecchia’s loglikelihood can then be rewritten as (cid:96) ( β, θ ) = m (cid:88) i =1 log f β,θ ( v i ) − log f β,θ ( u i ) (4)= − n (cid:88) i =1 [log det B i − log det A i ] (5) − n (cid:88) i =1 (cid:2) ( v i − R i β ) T B − i ( v i − R i β ) − ( u i − Q i β ) T A − i ( u i − Q i β ) (cid:3) − n π ) . (6)Our proposed algorithm for obtaining the likelihood, gradient, and Fisher information involvescomputing the following quantities in a single pass through the data.( logdet ) = n (cid:88) i =1 (log det B i − log det A i ) (7)( dlogdetj ) = n (cid:88) i =1 (cid:0) Tr( B − i B i,j ) − Tr( A − i A i,j ) (cid:1) (8)( ySy ) = n (cid:88) i =1 (cid:0) v Ti B − i v i − u Ti A − i u i (cid:1) (9)( XSy ) = n (cid:88) i =1 (cid:0) R Ti B − i v i − Q Ti A − i u i (cid:1) (10)( XSX ) = n (cid:88) i =1 (cid:0) R Ti B − i R i − Q Ti A − i Q i (cid:1) (11)( dySyj ) = − n (cid:88) i =1 (cid:0) v Ti B − i B i,j B − i v i − u Ti A − i A i,j A − i u i (cid:1) (12)( dXSyj ) = − n (cid:88) i =1 (cid:0) R Ti B − i B i,j B − i v i − Q Ti A − i A i,j A − i u i (cid:1) (13)( dXSXj ) = − n (cid:88) i =1 (cid:0) R Ti B − i B i,j B − i R i − Q Ti A − i A i,j A − i Q i (cid:1) (14)( Trjk ) = n (cid:88) i =1 (cid:2) Tr( B − i B i,j B − i B i,k ) − Tr( A − i A i,j A − i A i,k ) (cid:3) , (15)where A i,j and B i,j are the matrices of partial derivatives of A i and B i , respectively, with respectto θ j . The quantities having the form ( d ∗ j ) are simply the partial derivatives of the correspondingquantity ( ∗ ) with respect to θ j . Each of these quantities can be updated at each i = 1 , . . . , n , andso all can be evaluated in a single pass through the data. We refer to them collectively as oursingle-pass quantities. 3 .1 Profile Likelihood, Gradient, and Fisher Information Given covariance parameter θ , denote the maximum Vecchia likelihood estimate of β as (cid:98) β ( θ ). Since (cid:98) β ( θ ) has a closed form expression (Section 3.2), we can maximize the profile likelihood (cid:96) ( (cid:98) β ( θ ) , θ )over θ alone. The profile likelihood can be written in terms of our single-pass quantities as (cid:96) ( (cid:98) β ( θ ) , θ ) = − n π ) −
12 ( logdet ) − (cid:104) ( ySy ) − XSy ) (cid:98) β ( θ ) + (cid:98) β ( θ ) T ( XSX ) (cid:98) β ( θ ) (cid:105) . (16)Therefore the partial derivatives can also be written in terms of the single pass quantities as ∂(cid:96) ( (cid:98) β ( θ ) , θ ) ∂θ j = −
12 ( dlogdetj ) − (cid:104) ( dySyj ) − dXSyj ) (cid:98) β ( θ ) + (cid:98) β ( θ ) T ( dXSXj ) (cid:98) β ( θ ) (cid:105) − (cid:34) − XSy ) ∂ (cid:98) β ( θ ) ∂θ j + 2 (cid:98) β ( θ ) T ( XSX ) ∂ (cid:98) β ( θ ) ∂θ j (cid:35) , (17)where ( ∂ (cid:98) β ( θ ) /∂θ j ) is the column vector of partial derivatives of the p entries of (cid:98) β ( θ ) with respectto covariance parameter θ j . The Fisher information is I ( θ ) jk = 12 n (cid:88) i =1 (cid:2) Tr( B − i B i,j B − i B i,k ) − Tr( A − i A i,j A − i A i,k ) (cid:3) = 12 ( Trjk ) . It remains to be shown that (cid:98) β ( θ ) and ∂ (cid:98) β ( θ ) /∂θ j can be computed using our single-pass quantities. The profile likelihood estimate (cid:98) β ( θ ) satisfies ∂(cid:96) ( β, θ ) /∂β j = 0 for every j = 1 , . . . , p . These partialderivatives are ∂(cid:96) ( β, θ ) /∂β ... ∂(cid:96) ( β, θ ) /∂β p = n (cid:88) i =1 R Ti B − i ( v i − R i β ) − Q Ti A − i ( u i − Q i β ) , (18)giving the equation (cid:34) n (cid:88) i =1 (cid:16) R Ti B − i R i − Q Ti A − i Q i (cid:17)(cid:35) (cid:98) β ( θ ) = (cid:34) n (cid:88) i =1 (cid:16) R Ti B − i v i − Q Ti A − i u i (cid:17)(cid:35) . (19)Therefore, the profile likelihood estimate of β is (cid:98) β ( θ ) = ( XSX ) − ( XSy ) , (20)a function of our single pass quantities. Taking partial derivatives with respect to θ j yields ∂ (cid:98) β ( θ ) ∂θ j = ( XSX ) − ( dXSyj ) − ( XSX ) − ( dXSXj )( XSX ) − ( XSy ) , (21)also a function of our single pass quantities. 4 Numerical Studies
This section contains timing results, comparing the R optim implementation of the Nelder-Meadalgorithm to the Fisher scoring algorithm. In Fisher scoring, we reject steps that do not increasethe loglikelihood, dividing the step size by 2 iteratively. If we cannot increase the loglikelihoodalong the Fisher step direction, we attempt to step along the gradient. For both Nelder-Meadand Fisher scoring, we first generate estimates using Vecchia’s approximation with | g ( i ) | = 10nearest neighbors, then refine the estimates using | g ( i ) | = 30. In Nelder-Mead, we evaluate onlythe likelihood, not the gradient and Fisher information. The Fisher scoring algorithm stops whenthe dot product between the step and the gradient is less than 1e-4. Default stopping criteria wereused for Nelder-Mead algorithm. We simulate all datasets from the same model: Y ( s ) = µ + Z ( s ) + ε ( s ) , (22)where µ = 0, Z is a Gaussian process with exponential covariance function K ( s , s ) = σ exp( −(cid:107) s − s (cid:107) /α ), and ε ( s ) are i.i.d. N (0 , τ ) with τ = 0 .
2. We take ( σ , α ) = (2 , . , . In addition to the exponential covariance withunknown variance and range, we estimate parameters in three covariance models that generalizethe exponential: K ( s , s ) = σ Γ( ν )2 ν − (cid:18) (cid:107) s − s (cid:107) α (cid:19) ν K ν (cid:18) (cid:107) s − s (cid:107) α (cid:19) (23) K ( s , s ) = σ Γ( ν )2 ν − ( (cid:107) Ls − Ls (cid:107) ) ν K ν ( (cid:107) Ls − Ls (cid:107) ) (24) K ( s , s ) = exp J (cid:88) j =1 b j ( φ j ( s ) + φ j ( s )) σ Γ( ν )2 ν (cid:18) (cid:107) s − s ) (cid:107) α (cid:19) ν K ν (cid:18) (cid:107) s − s (cid:107) α (cid:19) . (25)The first is an isotropic Mat´ern covariance function. The second is a geometrically anisotropicMat´ern covariance, with anisotropy parameterized by the 2 × L . Thethird is a Mat´ern covariance with a nonstationary variance function. The nonstationary variancesare defined in terms of pre-specified known basis functions φ j and unknown parameters b j . Foridentifiability purposes, the J = 8 basis functions are an orthogonal basis that is also orthogonal toa constant function. The orthogonal basis is formed by applying Gram-Schmidt orthogonalizationto a set of Gaussian basis functions.Excluding µ , which is estimated by profile maximum likelihood, but including the nugget vari-ance τ , the four models have 3, 4, 6, and 12 unknown parameters. Each model has a multiplicativevariance parameter σ . In the Nelder-Mead algorithm, we profile out σ , whereas In Fisher scoring,we do not. We found that profiling σ does not substantially influence convergence speed in Fisherscoring. All positive parameters are mapped to the real line by a log transform. Each model wasfit to each dataset on an independent R process running on a single thread of an 8-core (16 thread)Intel Xeon W-2145 (3.7GHz, 4.5GHz Turbo) processor with 16GB RAM. Fifteen datasets were sentto each of 14 threads, yielding 210 simulated datasets.Hisotograms of the timing results are given in Figure 1. Considering the median times, Fisherscoring is 2-3 times faster for the isotropic models (3 and 4 parameters), more than 10 timesfaster for the stationary Mat´ern model (6 parameters), and 47 times faster for the nonstationarymodel (12 parameters). There is also no noticeable loss in accuracy, which we can evaluate bycomparing the maximum loglikelihoods returned by Fisher scoring to the loglikelihoods returnedby Nelder-Mead. For the isotropic models, the two loglikelihoods never differed by more than 0.0015 Isotropic Exponentialtime (seconds)
Isotropic Materntime (seconds)
Stationary Materntime (seconds)
Nonstationary Materntime (seconds)
Figure 1: Results of optimization timing study. Each plot shows histograms of time (in seconds)until convergence for one of the four covariance functions over 210 replicates. The plotted numbersindicate the median time until convergence.units. In the stationary model, the Fisher scoring loglikelihood was more than 0.01 larger than theNelder-Mead loglikelihood in 11 of the 210 datasets, whereas the Nelder-Mead loglikelihood wasnever more than 0.01 larger than the Fisher scoring loglikelihood. For the nonstationary model,the Fisher scoring loglikelihoods were more than 0.01 larger than the Nelder-Mead loglikelihoods in199 of the 210 datasets, whereas the Nelder-Mead loglikelihoods were more than 0.01 larger in just3 of the 210 datasets. Not only does Nelder-Mead take nearly 50 times longer than Fisher scoringin the nonstationary model, it usually does not find the actual maximum likelihood estimates.
Surprisingly, the maximum likelihood estimate of the nugget variance can be a negative number.This is not an error of the optimization algorithm (a triumph rather); in these cases, the algorithmfinds a negative nugget with a higher likelihood than any nonnegative nugget can produce. Negativenugget estimates occur most frequently when the data are evenly spaced and when the maximumlikelihood estimate of the smoothness parameter is small ( ν < . τ ) = − .
01 log(1 + 0 . /τ ) , pen( ν ) = − .
01 log(1 + 0 . /ν ) . The likelihood function also has difficulty jointly identifying variance and range parameters whenthe range parameter estimate is much larger than the maximum distance between points in thedataset. This is a theoretically well-studied problem (Zhang, 2004) that no optimization routinecan overcome. We have found that penalizing large variance parameters helps improve convergenceof Fisher scoring without sacrificing accuracy. We used the penaltypen( σ ) = log(1 + e σ / (cid:101) σ − ) , where (cid:101) σ is the estimate of the residual variance parameter in a least squares fit of the responseto the constant covariate. This imposes essentially no penalty on the parameter unless it is severaltimes larger than the least squares estimate, after which the penalty increases roughly linearly in σ . These two identifiability problems can be handled more elegantly in a Bayesian framework, butwe do not pursue that here because identifiability is not the focus of this paper.6 Data l a t g r i d l a t g r i d Figure 2: Plot of Argo data and variograms evaluated at 10km for the spatial-only model, andthe spatial-temporal model.
Argo is a global program that deploys floating ocean temperature sensors (International ArgoProgram, 2019). Each Argo float operates on a 10 day cycle, during which it descends to a 2000mdepth and returns to the surface, collecting temperature and salinity measurements along the depthprofile. The floats drift freely in the horizontal direction with ocean currents. As of May 2019,3,799 floats covered the globe. We analyze a subset of the observations collected at 100 dbar(approximately 100m depth) between January 1 and March 31, 2016. Preprocessed data wereprovided by Mikael Kuusela and are described in more detail in Kuusela and Stein (2018). In total,there are 32,492 measurements over the three month period. The data are plotted in Figure 2.We model the data from day t and location s on the sphere S ⊂ R as Y ( s, t ) = β + β L ( s ) + β L ( s ) + Z ( s + Φ( s ) , t ) + ε ( s, t ) , where L ( s ) is the latitude of location s , Z ( s, t ) is a Gaussian process with covariance function K θ ,Φ : R → R is a spatial warping function, and ε ( s, t ) are i.i.d. mean zero normals with variance τ . We consider both spatial and spatial-temporal models for K θ : K θ (( s , t ) , ( s , t )) = σ ν − Γ( ν ) ( d α ( s , s )) ν K ν ( d α ( s , s )) ,K θ (( s , t ) , ( s , t )) = σ ν − Γ( ν ) ( d α (( s , t ) , ( s , t ))) ν K ν ( d α (( s , t ) , ( s , t ))) . The function d α is Euclidean distance scaled by either a spatial range parameter or spatial andtemporal range parameters d α ( s , s ) = (cid:107) s − s (cid:107) α , d α (( s , t ) , ( s , t )) = (cid:18) (cid:107) s − s (cid:107) α + | t − t | α (cid:19) / . The warping function Φ is assumed to be a linear combination of the gradients of the five sphericalharmonic functions of degree 2, where the gradient is with respect to the three Euclidean coor-dinates. We use degree 2 because the degree 0 function is constant, and the degree 1 sphericalharmonics have constant partial derivatives (as a function of s ), and so degree 1 warpings simplytranslate all points by the same vector and do not affect the covariances. We also consider thespecial case of Φ( s ) = 0 for all s , which corresponds to isotropic models in space and time. Thespatial warping model has 9 parameters, while the space-time warping model has 10. The isotropicmodels have 4 and 5 parameters. 7oglikelihood time (minutes)Model Fisher Scoring Nelder-Mead Fisher Scoring Nelder-MeadIsotropic Spatial − . − .
676 2.76 7.06Warping Spatial − . − .
912 6.56 79.96Isotropic Space-Time − . − .
039 3.62 10.61Warping Space-Time 0 . − .
038 12.13 190.31Table 1: Optimization results for four models fit to Argo float data. Reported loglikelihoods aredifferences from largest loglikelihoodWe fit each model using both Fisher scoring and Nelder-Mead, with the results given in Table 1.Fisher scoring is able to fit the space-time warping model in 12.13 minutes, whereas Nelder-Meadran for 190 minutes and returned a loglikelihood value 2.038 units lower. In the spatial-only warpingmodel, Fisher Scoring finished in 6.56 minutes, whereas Nelder-Mead returned a loglikelihood value0.01 lower after 80 minutes. The two methods produced nearly the same loglikelihoods on theisotropic models, with Fisher scoring running more than twice as fast. The results closely mirrorthe numerical study, where Fisher scoring had its largest improvements in both speed and accuracywhen fitting models with the many parameters. Finally, in Figure 2, we plot Var( Y ( s, t ) − Y ( s + h, t ))as a function of s , with (cid:107) h (cid:107) = 10km. The images show that the warping model produces ananisotropic variogram, with larger increment variances near the equator. We believe that practitioners will benefit from the availability of high quality algorithms for fittingnonstationary Gaussian process models to large spatial and spatial-temporal datasets. The methodsare applicable to any covariance function that is differentiable with respect to its parameters. This isimportant because it separates the tasks of constructing models and developing methods for fittingthe models, freeing us to select the most appropriate covariance function for the data rather thanthe most appropriate model for which a specialized method exists. The Fisher scoring algorithm, aswell as anisotropic, nonstationary variance, and warping covariance functions, will be implementedin version 0.2.0 of the GpGp R package (Guinness and Katzfuss, 2018).
Acknowledgements
This work was supported by the National Science Foundation under grant No. 1613219 and theNational Institutes of Health under grant No. R01ES027892.
References
Finley, A., Datta, A., Banerjee, S., and Mckim, A. (2017). spNNGP: Spatial Regression Models forLarge Datasets using Nearest Neighbor Gaussian Processes . R package version 0.1.1.Geoga, C. J., Anitescu, M., and Stein, M. L. (2018). Scalable Gaussian process computations usinghierarchical matrices. arXiv preprint arXiv:1808.03215 .Guinness, J. (2018). Permutation and grouping methods for sharpening Gaussian process approx-imations.
Technometrics , 60(4):415–429. 8uinness, J. and Katzfuss, M. (2018).
GpGp: Fast Gaussian Process Computation Using Vecchia’sApproximation . R package version 0.1.0.International Argo Program (2019). . Online: accessed 2019-05-19.Katzfuss, M. and Guinness, J. (2017). A general framework for Vecchia approximations of Gaussianprocesses. arXiv preprint arXiv:1708.06302 .Kuusela, M. and Stein, M. L. (2018). Locally stationary spatio-temporal interpolation of Argoprofiling float data.
Proceedings of the Royal Society A , 474(2220):20180400.Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization.
The computerjournal , 7(4):308–313.Stein, M. L., Chi, Z., and Welty, L. J. (2004). Approximating likelihoods for large spatial data sets.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 66(2):275–296.Vecchia, A. V. (1988). Estimation and model identification for continuous spatial processes.
Journalof the Royal Statistical Society: Series B (Methodological) , 50(2):297–312.Zhang, H. (2004). Inconsistent estimation and asymptotically equal interpolations in model-basedgeostatistics.