Fixed-Domain Asymptotics Under Vecchia's Approximation of Spatial Process Likelihoods
aa r X i v : . [ m a t h . S T ] F e b F IXED -D OMAIN A SYMPTOTICS U NDER V EC C HIA ’ S A PPROXIMATION OF S PATIAL P ROC ESS L IKELIHOODS
A P
REPRINT
Lu Zhang ∗ Department of StatisticsColumbia UniversityNew York, NY, 10027 [email protected]
Wenpin Tang †‡ Department of Industrial Engineering and Operations ResearchColumbia UniversityNew York, NY, 10027 [email protected]
Sudipto Banerjee § UCLA Department of BiostatisticsUniversity of California, Los AngelesLos Angeles, CA 90095-1772 [email protected]
February 16, 2021 A BSTRACT
Statistical modeling for massive spatial data sets has generated a substantial literature on scalablespatial processes based upon a likelihood approximation due to Vecchia [1988]. Vecchia’s approx-imation for Gaussian process models enables fast evaluation of the likelihood by restricting depen-dencies at a location to its neighbors. We establish inferential properties of microergodic spatialcovariance parameters within the paradigm of fixed-domain asymptotics when they are estimatedusing Vecchia’s approximation. The conditions required to formally establish these properties areexplored, theoretically and empirically, and the effectiveness of Vecchia’s approximation is furthercorroborated from the standpoint of fixed-domain asymptotics. These explorations suggest somepractical diagnostics for evaluating the quality of the approximation. K eywords Fixed-domain asymptotics · Gaussian processes · Likelihood approximations; Matérn covariancefunction · Microergodic parameter estimation · Spatial statistics. ∗ Lu Zhang is a Postdoctoral Researcher in Department of Statistics at Columbia University, New York, U.S.A. URL: luzhangstat.github.io † Wenpin Tang is an Assistance Professor in Department of Industrial Engineering and Operations Research at Columbia Uni-versity, New York, U.S.A. URL: ‡ The first and second authors have equal contributions to this paper. § Sudipto Banerjee is Professor and Chair of the Department of Biostatistics in the University of California, Los An-geles, USA. Mailing address: 650 Charles E. Young Drive South, CHS 51-254, Los Angeles, CA 90095-1772. URL:
PREPRINT - F
EBRUARY
16, 2021
Geostatististical data are often modeled by treating observations as partial realizations of a spatial random field. Wecustomarily model the random field { Y ( s ) : s ∈ D} over a bounded region D ∈ R d as a Gaussian process (GP),denoted Y ( s ) ∼ GP ( µ β ( s ) , K θ ( · , · )) , with mean µ β ( s ) and covariance function K θ ( s, s ′ ) = cov ( y ( s i ) , y ( s j )) . Theprobability law for a finite set χ = { s , s , . . . , s n } is given by y ∼ N ( µ β , K θ ) , where y = ( y ( s i )) and µ β = ( µ β ( s i )) are n × vectors with elements y ( s i ) and µ β ( s i ) , respectively, and K θ = ( K θ ( s i , s j )) is the n × n spatial covariancematrix whose ( i, j ) th element is the value of the covariance function K θ ( s i , s j ) . We consider the widely employedstationary Matérn covariance function [Matérn, 1986, Stein, 1999b] given by K θ ( s, s ′ ) := σ ( φ k h k ) ν Γ( ν )2 ν − K ν ( φ k h k ) , k h k ≥ , (1.1)where h = s − s ′ , σ > is called the partial sill or spatial variance, φ > is the scale or decay parameter, ν > is a smoothness parameter, Γ( · ) is the Gamma function, K ν ( · ) is the modified Bessel function of order ν [Abramowitz and Stegun, 1965, Section 10] and θ = { σ , φ, ν } . The spectral density corresponding to (1.1), whichwe will need later, is f ( u ) = C σ φ ν ( φ + u ) ν + d for some C > . (1.2)Likelihood-based inference for θ will require matrix computations in the order of ∼ n floating point operations (flops)and can become impracticable when the number of spatial locations, n , is very large. Writing Y n = ( y , y , . . . , y n ) ⊤ ,where y i := y ( s i ) for i = 1 , , . . . , n are the n sampled measurements, we write the joint density p ( Y n | θ ) := N ( Y n ; µ β , K θ ) as p ( Y n | θ ) = p ( y ; θ ) n Y i =2 p ( y i | y ( i − ; θ ) , (1.3)where y ( i ) = ( y , . . . y i ) . Vecchia [1988] suggested a simple approximation to (1.3) based upon the notion that itmay not be critical to use all components of y ( i − in p ( y i | y ( i − ; θ ) . Instead, the joint density p ( Y n | θ ) in (1.3) isapproximated by ˜ p ( Y n | θ ) = p ( y | θ ) n Y i =2 p ( y i | S ( i − ; θ ) , (1.4)where S ( i ) is a subvector of y ( i ) for i = 1 , . . . , n . The density ˜ p ( Y n | θ ) in (1.4) is called Vecchia’s approximationand can be regarded as a quasi- or composite likelihood [Zhang, 2012, Eidsvik et al., 2014, Bachoc and Lagnoux,2020]. Vecchia’s approximation has attracted a remarkable amount of attention in recent times, already too vast tobe comprehensively reviewed here [see, e.g., Stein et al., 2004, Datta et al., 2016a,b, Guinness, 2018, Katzfuss et al.,2020, Katzfuss and Guinness, 2021]. Algorithmic developments in Bayesian and frequentist settings [Finley et al.,2019, Katzfuss et al., 2020] have enabled scalability to massive data sets (with n ∼ locations) and (1.4) lies at thecore of several methods that tackle “big data” problems in geospatial analysis.Following the fixed-domain (infill) asymptotic paradigm for spatial inference [Stein, 1999a, Zhang and Zimmerman,2005] we discuss inferential properties for the parameters in (1.1). In this setting, Zhang [2004] showed that notall parameters in θ admit consistent maximum likelihood estimators from the full Gaussian likelihood in (1.3) con-structed with a stationary Matérn covariance function, but certain microergodic parameters are consistently esti-mated. Du et al. [2009, Theorem 5] formally established the asymptotic distributions of these microergodic parameters.Kaufman and Shaby [2013] addressed jointly estimating the spatial range and the variance parameters in the Matérnfamily and the effect of a prefixed range on inference when having relatively small sample size. All of the aforemen-tioned work has been undertaken using (1.3). Here, we formally establish the inferential properties for the estimatesof microergodic parameters obtained from Vecchia’s approximate likelihood in (1.4). We build our work on a brief butinsightful discussion in Section 10.5.3 of Zhang [2012], regarding the inferential behaviour arising from (1.4). To thebest of our knowledge such explorations have not hitherto been formally undertaken. Following the aforementionedworks in spatial asymptotics, we will restrict attention to the infill or fixed domain setting and focus on the inferentialproperties of the microergodic parameters for any given value of the smoothness parameter. More specifically, weexplore the criteria for asymptotic normality for the maximum likelihood estimates of the microergodic parametersobtained from Vecchia’s approximation. In this regard, our work follows the paradigm laid out in Zhang [2012] in thatwe can no longer assume that the conditioning set is bounded. This distinguishes our intended contribution from thatin Bachoc and Lagnoux [2020], where bounded conditional sets are exploited to establish consistency results for someselected values of the smoothness paramater. On the other hand, we show that a different set of conditions can yield a2 PREPRINT - F
EBRUARY
16, 2021closed form asymptotic distribution for any given value of the smoothness parameter. For the subsequent development,it suffices to assume that µ β ( s ) = 0 , i.e., the data has been detrended. Hence, we work with a zero-centred stationaryGaussian process with the Matérn covariance function in (1.1), a fixed smoothness parameter ν and with the samplinglocations χ n restricted to a bounded region.The balance of this article is arranged as follows. One of our key results, Theorem 1, is presented in Section 2, provid-ing general criteria for asymptotic normality of maximum likelihood estimates of microergodic parameters obtainedfrom Vecchia’s approximation. In Section 3, we demonstrate that these general criteria are implied by a conditionon the conditioning size which grows much slower than the sample size. We numerically check the conclusions forone-dimensional cases and extend the discussion for two-dimensional cases in Section 4.1. A graphical method for di-agnosing the model fitting of using Vecchia’s approximation for finite sample is proposed and illustrated by simulationstudies in Section 4.2. Identifiability and consistent estimation of θ in (1.1) relies upon the equivalence and orthogonality of Gaussian mea-sures. Two probability measures P and P on a measurable space (Ω , F ) are said to be equivalent , denoted P ≡ P ,if they are absolutely continuous with respect to each other. Thus, P ≡ P implies that for all A ∈ F , P ( A ) = 0 if and only if P ( A ) = 0 . On the other hand, P and P are orthogonal, denoted P ⊥ P , if there exists A ∈ F forwhich P ( A ) = 1 and P ( A ) = 0 . While measures may be neither equivalent nor orthogonal, Gaussian measures arein general one or the other. For a Gaussian probability measure P θ indexed by a set of parameters θ , we say that θ is microergodic if P θ ≡ P θ if and only if θ = θ [see, e.g., Stein, 1999b, Zhang, 2012]. Two Gaussian probabilitymeasures defined by Matérn covariance functions K θ ( h ) and K θ ( h ) , where θ = { σ , φ , ν } and θ = { σ , φ , ν } are equivalent if and only if σ φ ν = σ φ ν [Theorem 2 in Zhang, 2004]. Consequently, one cannot consistently es-timate σ or φ [Corollary 1 in Zhang, 2004] from full Gaussian process likelihood functions, σ φ ν is a microergodicparameter that can be consistently estimated.If the oracle (data generating) values of φ and σ are φ and σ , respectively, then for any fixed value of the decay φ = φ , we know from Du et al. [2009, Theorem 5] that √ n (ˆ σ n φ ν − σ φ ν ) L −→ N (0 , σ φ ν ) ) , (2.1)where ˆ σ n is the maximum likelihood estimator from the full likelihood (1.3). Let ˆ σ n,vecch be the maximum likelihood estimate of the variance σ , ˆ σ n,vecch = argmax σ { ˜ p ( Y n | φ , σ ) , σ ∈ R } , (2.2)where ˜ p ( · ) is the density (1.4). We develop the asymptotic equivalence of ˆ σ n,vecch with ˆ σ n . To proceed further, weintroduce some notations. Assume that the target process y ( s ) ∼ GP (0 , K θ ( · )) , where K θ ( h ) is defined in (1.1) witha fixed ν . Let P i , i = 0 , denote probability measures for y ( s ) ∼ GP (0 , K θ i ) with θ i = { σ i , φ i , ν } . Assume that σ = σ φ ν /φ ν and let E j ( · ) denote the expectation with respect to probability measure P j , j = 0 , . We define e ,j := y , µ i,j := E j ( y i | y ( i − ) , e i,j := y i − µ i,j , i = 2 , . . . , n ˜ e ,j := y , ˜ µ i,j := E j ( y i | S ( i − ) , ˜ e i,j := y i − ˜ µ i,j , i = 2 , . . . , n . (2.3)In Lemma 1 we derive a useful expression for ˆ σ n,vecch using the quantities in (2.3). Lemma 1.
The estimate of σ from Vecchia’s likelihood approximation with fixed ν , and φ = φ can be expressed as ˆ σ n,vecch = σ n n X i =1 ˜ e i, E ˜ e i, . (2.4) Proof.
In Vecchia’s approximation (1.4) with fixed ν , φ = φ and unknown σ in K θ ( · ) , p ( y i | S ( i − ) is Gaussianwith mean e µ i, = e Σ i, ( e Σ i, ) − S ( i − and variance e Σ i, := e Σ i, − e Σ i, ( e Σ i, ) − e Σ i, , where e Σ i, e Σ i, e Σ i, e Σ i, ! is the co-3 PREPRINT - F
EBRUARY
16, 2021variance matrix of (cid:18) y i S ( i − (cid:19) under ˜ p ( · ; φ , σ ) . Since e µ i, does not depend on σ , and e Σ i, can be expressed as σ e Σ † i, ,where e Σ † i, does not depend upon σ , the conditional distributions p ( y i | S ( i − ) under Vecchia’s approximation is p ( y i | S ( i − ) = 1 q πσ e Σ † i, exp − ˜ e i, σ e Σ † i, ! . A direct computation of (2.2) with any fixed φ yields (2.4), where we have used the fact E ˜ e i, = σ e Σ † i, on the righthand side of (2.4).Our main result builds on the discussion in Section 10.5.3 of Zhang [2012] to establish the following theorem thatexplores the asymptotic distribution of ˆ σ n,vecch . Theorem 1.
Assume that either of the following conditions holds: n X i =1 E (˜ e i, − e i, ) E ˜ e i, = O (1) and n X i =1 E e i, E ˜ e i, − ! = O (1) . (2.5) or n X i =1 E (˜ e i, − e i, ) E e i, = O (1) and n X i =1 E ˜ e i, E e i, − ! = O (1) . (2.6) Then √ n (ˆ σ n,vecch φ ν − σ φ ν ) L −→ N (0 , σ φ ν ) ) . (2.7)Before presenting the proof of Theorem 1, we state and prove the following lemma. Lemma 2.
The assumptions in (2.5) imply that E " n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, = o ( √ n ) , (2.8) Proof.
We first prove that (2.5) implies (2.8). Note that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E " n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 E (˜ e i, − e i, + e i, ) E ˜ e i, − ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 E (˜ e i, − e i, ) E ˜ e i, + n X i =1 E e i, E ˜ e i, − ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n X i =1 E (˜ e i, − e i, ) E ˜ e i, + n X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E e i, E ˜ e i, − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where the second equality follows from the fact that ˜ e i, − e i, and e i, are independent under P . By the first conditionin (2.5), we get P ni =1 E (˜ e i, − e i, ) /E ˜ e i, = O (1) = o ( √ n ) . Fix ε > . By the second condition in (2.5), thereis M > such that P i>M (cid:0) E e i, /E ˜ e i, − (cid:1) < ε . So for n > M , we can use the Cauchy–Schwarz inequality toobtain n X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E e i, E ˜ e i, − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = M X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E e i, E ˜ e i, − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + n X i = M +1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E e i, E ˜ e i, − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E e i, E ˜ e i, − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + p ( n − M ) ε . Dividing both sides by √ n reveals that lim sup n →∞ √ n P ni =1 (cid:12)(cid:12) E e i, /E ˜ e i, − (cid:12)(cid:12) ≤ √ ε . Since ε > is arbitrary, itfollows that P ni =1 (cid:12)(cid:12) E e i, /E ˜ e i, − (cid:12)(cid:12) = o ( √ n ) and we obtain (2.8).4 PREPRINT - F
EBRUARY
16, 2021We now present a proof of Theorem 1.
Proof of Theorem 1.
Recall that P ni =1 e i, / ( E e i, ) = Y ⊤ n V − n, Y n , where V n, is the covariance matrix of Y n under P . Using (2.4) that was derived in Lemma 1, we obtain √ n (ˆ σ n,vecch /σ −
1) = 1 √ n " n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, + √ n (cid:18) n Y ⊤ n V − n, Y n − (cid:19) . (2.9)By the central limit theorem, we have √ n (cid:0) n Y ⊤ n V − n, Y n − (cid:1) L −→ N (0 , .We next show that the condition (2.5) implies that n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, = o ( √ n ) . (2.10)To prove this, it will be sufficient to show that n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, − E " n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, = O (1) . (2.11)The result in (2.10) will then follow from Lemma 2. We turn to proving (2.11). Our argument relies on theequivalence of Gaussian sequences. Let e P be the probability distribution corresponding to ˜ p ( · ; φ , σ ) , and let ρ n := ˜ p ( Y n ; φ , σ ) /p ( Y n ; φ , σ ) be the Radon-Nikodym derivative of e P with respect to P the realization Y n for a given n . If e P is equivalent to P , then lim n →∞ ρ n = ρ ∞ =: d e P /dP with P -probability [see, e.g.,Ibragimov and Rozanov, 1978, Section III.2.1]. Also, P (0 < ρ ∞ < ∞ ) = 1 and −∞ < E (log ρ ∞ ) < ∞ . As aconsequence, log ρ n = −
12 log det e V n, det V n, − n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, ! = O (1) ,E (log ρ n ) = −
12 log det e V n, det V n, − E " n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, = O (1) , where e V n, (respectively, V n, ) is the covariance matrix of Y n under e P (respectively, P ). By taking the differenceof the above two equations, we get (2.11) under the condition that e P is equivalent to P . Using Theorem 5, SectionVII.6 of Shiryaev [1996], we can conclude that e P is equivalent to P ⇐⇒ ∞ X i =1 E (˜ e i, − e i, ) E ˜ e i, + E e i, E ˜ e i, − ! < ∞⇐⇒ ∞ X i =1 E (˜ e i, − e i, ) E e i, + E ˜ e i, E e i, − ! < ∞ . (2.12)Since first equivalence in (2.12) is simply a reformulation of (2.5), we have established that the condition (2.5) implies(2.11) and, hence, the result in (2.10).The proof from condition (2.6) to (2.10) is established following the proof from condition (2.5) to (2.10). We nowbreak the quantity P ni =1 ˜ e i, /E ˜ e i, − P ni =1 e i, /E e i, in (2.10) into n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, − E " n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, and E " n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, , and replace the left hand-side of (2.11) and (2.8) by the two quantities respectively. Then the equivalence of P and P along with lemma 2 shows that the replaced (2.8) holds. The proof that (2.6) implies the replaced (2.11) remainsthe same except that we now use the second equivalence in (2.12). Thus we complete the proof of Theorem 1.5 PREPRINT - F
EBRUARY
16, 2021Turning to the connection between Theorem 1 and predictive consistency of Vecchia’s approximation in the senseof Kaufman and Shaby [2013, p.478], note that e i, and ˜ e i, are the predictive errors for y i under the full model withcorrect parameters and under (1.4) with possibly incorrectly specified parameters ( φ, σ ) = ( φ , σ ) . A consequence of(2.5) or (2.6) is that E ˜ e i, /E e i, → as i (and hence n ) is large. Hence, (2.5) or (2.6) implies asymptotic normalityof estimates as well as predictive consistency. R It is possible to obtain further insights into Theorem 1 when considering the asymptotic normality of ˆ σ n,vecch forMatérn models with observations on the real line. Whilst the conditions (2.5) and (2.6) are, in general, analyticallyintractable due to the presence of E ˜ e i, , we will show that (2.6) holds for Matérn models on R .To simplify the presentation, we consider the fixed domain D = [0 , , and the sampled locations χ = { i/n : 0 ≤ i ≤ n } . Denote δ = 1 /n for the spacing of χ , and y i = y ( iδ ) , ≤ i ≤ n for the observations. We define S ( i ) = S ( i ) [ k ] := ( y i , y i − , . . . , y i − k +1 ) for a positive integer k , where S ( i ) [ k ] is the vector of k consecutive observationsbackward from y i . Assumption 1.
Let D = [0 , , and χ = { iδ : 0 ≤ i ≤ n } with δ = 1 /n . Then n X i =1 E ( e i, − e i, ) E e i, = O (1) . (3.1)Before stating the main result on R , we demonstrate why this assumption is reasonable. We empirically investigate P ni =1 E ( e i, − e i, ) /E e i, for increasing values of n . Figure 1 plots the values of P ni =1 E ( e i, − e i, ) /E e i, with χ = { iδ : 0 ≤ i ≤ n } for ν = 0 . , . , . , . , . , and n ranging from to . As n increases, the plottends to flatten as is suggested by the assumption. n n = n = n = n = n = Figure 1: Trend of P ni =1 E ( e i, − e i, ) /E e i, for Matérn model when χ is a regular grid χ = { iδ : 0 ≤ i ≤ n } .Parameter σ in Matérn covariogram equals . and decay φ for different ν are set to make the correlation of twopoints equals 0.05 when their distance reaches 0.2.Some additional explanation is also possible from a theoretical viewpoint. Since P and P are equivalent, Corollary3.1 of Stein [1990a] implies that E ( e i, − e i, ) /E e i, → as n, i → ∞ , which is consistent with (3.1). Bystationarity and symmetry of the Matérn model, e i,j is distributed as the error of the least square estimate of y := y (0) given observations y ( i ) = ( y , . . . , y i ) ⊤ . Hence, e i,j can be realized as y − E j ( y | y ( i ) ) . Similarly, ˜ e i,j can be realizedas y − E j ( y | S ( i − ) . Now consider the infinitely sampled locations { iδ : i ≥ } , and extend the finite sample Y n := ( y , . . . , y n ) ⊤ to Y := ( y , y , . . . ) ⊤ with y i := y ( iδ ) . Let e ∞ ,j be the error y − E j ( y | y , . . . ) for the infinitesequence Y . For f (resp. f ) the spectral density under P (resp. P ), it is easily seen that f , f ∼ Cσ φ ν u − ν − as u → ∞ , and ( f − f ) /f ≍ u − . Therefore, by Theorem 2 of Stein [1999b], E ( e ∞ , − e ∞ , ) E e ∞ , = O ( δ min(2 ν +1 , log( δ − ) ν =3 / )) . (3.2)Intuitively, e i,j ≈ e ∞ ,j for large i . It will not be unreasonable to speculate a stronger result where (3.2) still holds byreplacing e ∞ ,j with e i,j for large i , which would imply (3.1). As indicated on p.138 of Stein [1999a], obtaining the rate6 PREPRINT - F
EBRUARY
16, 2021of E ( e i, − e i, ) /E e i, for any bounded domain D is a highly non-trivial task. The only known results are obtainedin Stein [1990b, 1999b] for ν = , , . . . , which also imply (3.1). In fact, we need that E e i, = E e ∞ , (1 + O ( i − κ )) for any κ > . Hence, the only missing piece in the above heuristics is an estimate of E e i, /E e ∞ , , which we donot explore further here. Our result is stated as follows. Theorem 2.
Let D = [0 , , χ = { iδ : 0 ≤ i ≤ n } and S ( i ) = S ( i ) [min( i, n ǫ )] for ǫ ∈ (0 , . If (3.1) holds, then (2.6) also holds. Consequently, (2.7) holds for the Matérn model ( ν > ). We make a few remarks before presenting a proof. Theorem 2 states that the asymptotic normality of the microergodicparameter σ φ ν still holds under Vecchia’s approximation in a neighborhood of size k = n ǫ ≪ n (sample size),where the computation of ˆ σ n,vecch is much cheaper. This justifies the validity of Vecchia’s approximation for Matérnmodels from a fixed-domain perspective. The range n ǫ may not be optimal, and it might be possible to improve to k = O (log n ) . However, we do not pursue this direction here from a theoretical standpoint. A simulation study isprovided for the case of k = O (log n ) in Section 4.1.An interesting situation arises with ν = 1 / , in which the process reduces to the Ornstein-Uhlenbeck process, and p ( y i | y ( i − ) = p ( y i | y i − ) . Therefore, (2.7) trivially holds for Vechhia’s approximation with a neighbor of size k = 1 = O (1) . It is, therefore, natural to enquire whether the asymptotic normality of (2.7) holds under Vecchia’sapproximation within a range k = O (1) . If this is true, computational efforts can be reduced further. Unfortunately,this need not be the case. For ν < / and k = O (1) , n ν (ˆ σ n,vecch φ ν − σ φ ν ) converges to a non-Gaussiandistribution [Bachoc and Lagnoux, 2020]. The cases for ν ≥ / remain unresolved.Now we turn to the proof of Theorem 2. The key to this analysis is the following proposition, which relies on aresult on the bound of e ∞ ,j − e i,j , i.e. the difference between the errors of the finite and the infinite least squareestimates [Baxter, 1962]. The study dates back to the work of Kolmogorov [1941], see also Grenander and Szegö[1958], Ibragimov [1964], Dym and McKean [1970, 1976], Ginovian [1999] for related discussions. Proposition 1.
Let κ > . There exist C , C κ > such that for δ < and j = 0 , , E j e ∞ ,j ∼ C δ ν , (3.3) E j ( e ∞ ,j − e i,j ) ≤ C κ δ ν i − κ . (3.4) Proof.
From the discussion below (3.1) that e i,j and e ∞ ,j can be realized as e i,j = y − E j ( y | y , . . . , y i ) and e ∞ ,j = y − E j ( y | y , y , . . . ) , where y i := y ( iδ ) is indexed by nonnegative integers, we know from Stein [1999a,p.77] that the spectral density of Y under P j is f δj ( u ) = 1 δ ∞ X ℓ = −∞ f j (cid:18) u + 2 πℓδ (cid:19) for u ∈ ( − π, π ] , j = 0 , , where f j is the spectral density defined by (1.2) corresponding to P j . For j = 0 , , f j ( u ) ∼ Cσ φ ν u − ν − as u → ∞ . From Stein [1999a, p.80, (17)], we obtain E j e ∞ ,j ∼ πCσ φ ν δ ν exp (cid:18) π Z π − π log (cid:0) Σ ∞ ℓ = −∞ | u + 2 πℓ | − ν − (cid:1) du (cid:19) , which implies (3.3) with C = 2 πCσ φ ν exp (cid:16) π R π − π log (cid:0) Σ ∞ ℓ = −∞ | u + 2 πℓ | − ν − (cid:1) du (cid:17) .Turning to (3.4), we know from Baxter [1962, p.142, (15)] that E j ( e i,j − e ∞ ,j ) = E j e ∞ ,j E j e i,j ∞ X m = i | φ m,j (0) | , (3.5)where φ m,j ( · ) are the Szegö polynomials associated with the spectral f δj [see Section 2.1 of Grenander and Szegö,1958, for background]. Note that E j e ∞ ,j ∼ C δ ν and E j e i,j ≤ E j e ,j → as δ → . It will be sufficient toestablish ∞ X m =0 m κ | φ m,j (0) | ≤ D κ for some D κ > , (3.6)7 PREPRINT - F
EBRUARY
16, 2021in which case the identity (3.5) will imply (3.4). The key observation of Baxter [1962] (Theorem 2.3) is that (3.6)holds if the κ th moment of the Fourier coefficients associated with f δj is bounded from above by D ′ κ for some D ′ κ > ,i.e. ∞ X m =0 m κ | c m,j | < D ′ κ , where c m,j := 12 π Z π − π f δj ( u ) e − inu du. A sufficient condition for the latter to hold is that the κ th derivative of f δj is integrable, and R π − π (cid:12)(cid:12)(cid:12) d κ du κ f δj ( u ) (cid:12)(cid:12)(cid:12) du ≤ D ′′ κ ,for some D ′′ κ > which does not depend on δ < . Breaking the sum of f δj according to ℓ = 0 and ℓ = 0 produces Z π − π (cid:12)(cid:12)(cid:12)(cid:12) d κ du κ f δj ( u ) (cid:12)(cid:12)(cid:12)(cid:12) du ≤ A Z ∞−∞ (1 + u ν +1+ κ ) − du + A ′ δ ν X ℓ =0 l − ν − − κ , (3.7)where A, A ′ > are numerical constants. Hence, the right hand side of (3.7) is bounded by D ′′ κ = A R ∞−∞ (1 + u ν +1+ κ ) − du + A ′ P ℓ =0 l − ν − − κ , which depends only on κ . Proof of Theorem 2.
We fix κ = 2 /ǫ , and use C (1) κ , C (2) κ , . . . to denote constants depending only on κ . Note that E e i, = E e ∞ , + E ( e ∞ , − e i, ) , since e ∞ , and e ∞ , − e i, are independent under P . By Proposition 1, we get E e i, ∼ C δ ν (1 + C (1) κ i − κ ) . Similarly, E ˜ e i, = E e ∞ , + E (˜ e i, − e ∞ , ) ∼ C δ ν (1 + C (2) κ min( i, n ǫ ) − κ ) ,because ˜ e i, is realized as y − E (cid:0) y | S ( i − [ k ] (cid:1) with k = min( i, n ǫ ) . Therefore, n X i =1 E ˜ e i, E e i, − ! ≤ X i ≥ n ǫ C (3) κ n − ǫκ C (1) κ i − κ ! + n X i =1 C (4) κ i − κ C (1) κ i − κ ! ≤ C (5) κ n + C (6) κ n X i =1 i − κ ≤ C (7) κ . Moreover, we have E (˜ e i, − e i, ) E e i, ≤ E (˜ e i, − e ∞ , ) E e i, + 3 E ( e i, − e ∞ , ) E e i, + 3 E ( e i, − e i, ) E e i, E e i, E e i, . By the same argument as above, for the first two terms: n X i =1 E (˜ e i, − e ∞ , ) E e i, < C (8) κ and n X i =1 E ( e i, − e ∞ , ) E e i, < C (9) κ . For the last term, n X i =1 E ( e i, − e i, ) E e i, E e i, E e i, ≤ C (10) κ n X i =1 E ( e i, − e i, ) E e i, , which converges because of (3.1). This establishes (2.6) and, hence, (2.7) follows. Based on (2.8) and (2.11) provided in Theorem 1, Theorem 2 has proved that c n ( φ , φ , k ) = 1 √ n " n X i =1 ˜ e i, E ˜ e i, − n X i =1 e i, E e i, = o (1) (4.1)when k = n ǫ for ǫ ∈ (0 , . The equation (4.1) induces the critical condition (2.10), resulting in the convergencein law in (2.7). Looking into the more challenging case k = O (log( n )) , we extend the discussion in Theorem 2via investigating the behaviour of c n ( φ , φ , k ) in (4.1) for a sequence of datasets with increasing sample sizes. Ourexperimental setup consider the study domains D = [0 , with n observations on the grid χ = { i/ ( n −
1) : 0 ≤ PREPRINT - F
EBRUARY
16, 2021 i ≤ n } and D = [0 , with n = n s observations on the grid χ = { ( i/ ( n s − , j/ ( n s − ≤ i, ≤ n s , ≤ j ≤ n s } . We order locations on χ first based on the second coordinate and then break ties based on the associatedfirst coordinate. We take S ( i ) [ k ] as the k nearest neighbors of y i +1 . In both studies on D and D , we fix σ = 1 . and consider different smoothness values ν ∈ { . , . , . , . , } . We choose different decay parameters φ fordifferent ν so that K θ ( h ) = 0 . when h = 0 . and . for the study on D and D , respectively.For each fixed value of θ = { σ , φ, ν } , we generate datasets with Y n being the realization from y ( s ) ∼ GP (0 , K θ ( · )) and calculate c n ( φ , φ , k ) with k being the closest integer to . n ) and φ = 1 . φ . Then,we record the mean and standard deviation of the values of c n ( φ , φ , k ) . We repeat this process for differentvalues of n ranging from = 64 to = 2048 in the study on D . The study on D follows the study on D with n s ranging from to . The code for this simulation study along with other simulation studies of this paper areavailable on https://github.com/LuZhangstat/vecchia_consistency . Figure 2 presents 5 different graphsfor the study on D , one for each value of ν , showing the mean and standard deviation of c n ( φ , φ , k ) for differentvalues of n . The value of c n ( φ , φ , k ) , as seen in Figure 2, decreases rapidly as the sample size increases, supporting
012 100 300 1000 n n = n = n = n = n = Figure 2: The mean of c n ( φ , φ , k ) of simulations on D = [0 , . The error bars represent one standard deviation.The sample size n take on values in , , , , and .the main conclusion in Theorem 2. The corresponding graphs for the study on D are presented in Figure 3. Thesegraphs also reveal decreasing trends, but with more gentle slopes as compared to Figure 2. n n = n = n = n = n = Figure 3: The mean of c n ( φ , φ , k ) of simulations on D = [0 , . The error bars represent one standarddeviation. The sample size n take on values in , , , , and .We have also seen, from the proof in Theorem 1, that c n ( k, φ , φ ) = √ n (ˆ σ n,vecch /σ − ˆ σ ,n /σ ) where ˆ σ ,n = argmax σ { p ( y ; φ , σ ) , σ ∈ R + } is the maximum likelihood estimator from (1.3) when fixing φ = φ .Hence, c n ( k, φ , φ ) also measures the discrepancy between ˆ σ n,vecch /σ and ˆ σ ,n /σ , and the decreasing trend of c n ( φ , φ , k ) indicates that the inference based on Vecchia’s approximation approaches the inference based on the fulllikelihood as sample size increases. This phenomenon revals that Vecchia’s approximation is still efficient when theneighbor size k is substantially smaller than the sample size.9 PREPRINT - F
EBRUARY
16, 2021
We now turn to some practical data analyses considerations in finite sample settings. In practice estimation using (1.4)is complicated by the fixed ordering of locations and the choices of { S ( i ) } ni =1 . Stein et al. [2004] and Guinness [2018]have provided excellent practical insights into these issues that can considerably improve finite sample behaviour incertain settings. Datta et al. [2016b] consider dynamic Bayesian modeling with model-based selection of { S ( i ) } ni =1 based upon the covariance function and implemented within MCMC algorithms. Here, we conduct some simplersimulation examples to evaluate Vecchia’s approximation for fixed sample size, but with varying number of neighbors,based upon the properties of the normalized conditional residuals, ˜ e i, / ( E ˜ e i, ) / for i = 1 , , . . . , n , which wereemployed in the proofs in Section 2.We first investigate the behaviour of { e i, / ( E e i, ) / } ni =1 . From (1.3), each p ( y i | y ( i − ; θ ) is a Gaussian density.Therefore, the standardized e i, , i.e., { e i, / ( E e i, ) / } ni =1 , can be viewed as independent samples from a standardnormal distribution. When the approximation in (1.4) is close to (1.3), the estimate ˆ θ of θ based on (1.4) shouldbe close to θ and the pairs of the outcome, given a fixed set of neighbors, should also exhibit smaller correlations.This, in turn, would yield smaller correlations among the standardized ˜ e i, , i.e., { ˜ e i, / ( E ˜ e i, ) / } ni =1 when φ = ˆ φ .Therefore, higher correlations among the fitted ˜ e i,j will be indicative of a poorer approximation of (1.3) by (1.4).We present two examples. In the first example, we specify our observed locations to be equispaced over D = [0 , with n = 300 observations. We generate data from a zero-centered Gaussian process with the Matérn covariancefunction specified by ν = 2 . , φ = 10 . and σ = 1 . to generate the data. In the second example, we specifyobserved locations over a × grid on D = [0 , and generate data from a zero-centered Gaussian process with ν = 1 . , φ = 9 . and σ = 1 . as the values for the Matérn covariance function that generate the data.For each data set, we estimate φ by maximizing the profile likelihood derived from (1.4) log {PL ( φ ) } ∝ − n { ρ vecch ( φ ) } ] − n − n (cid:20) n Y ⊤ { ρ vecch ( φ ) } − Y (cid:21) , (4.2)where log {PL ( φ ) } = log[sup σ {L ( σ , φ ) } ] and ρ vecch ( φ ) is the correlation matrix of Y under the probability distribu-tion corresponding to (1.4). We take S ( i ) as the k nearest neighbors of y i +1 . Subsequent to fitting the model, we usethe resulting estimates of { ˜ e i, / ( E ˜ e i, ) / } ni =1 to conduct visual checks on the quality of the approximation. (c) k = 5 (d) k = 8(a) k = 2 (b) k = 40 10 20 30 40 0 10 20 30 400.000.250.500.751.000.000.250.500.751.00 lag Figure 4: Autocorrelation plots of { ˜ e i, / ( E ˜ e i, ) / } ni =1 with (a) k = 2, (b) k = 4, (c) k = 5 and (d) k = 8 for the 1-Dexample in Section 4.2, where φ is estimated by maximizing the profile likelihood (4.2). The dash lines indicate -0.05and 0.05Figures 4 and 5 provide autocorrelation plots of the fitted { ˜ e i, / ( E ˜ e i, ) / } ni =1 with varying choices of k for the firstand the second example, respectively. In Figure 4, the autocorrelation for all lags drop fairly sharply as soon as the10 PREPRINT - F
EBRUARY
16, 2021number of neighbors k exceeds . Some periodical patterns are observed in the autocorrelation plot in the secondexample when k = 6 , but the periodicity is much less pronounced as k approaches . (c) k = 8 (d) k = 10(a) k = 4 (b) k = 60 20 40 60 0 20 40 600.000.250.500.751.000.000.250.500.751.00 lag Figure 5: Autocorrelation plots of { ˜ e i, / ( E ˜ e i, ) / } ni =1 with (a) k = 4, (b) k = 6, (c) k = 8 and (d) k = 10 for the2-D example in Section 4.2, where φ is estimated by maximizing the profile likelihood (4.2). The dash lines indicate-0.05 and 0.05We also fit Bayesian models to the above two data sets to compare the impact of k on the inference of the unknownparameters based on the same data set. Specifically, we first consider the full likelihood defined by (1.3) to obtainposterior estimates of { φ, σ } . Next, we obtain the corresponding estimates using Vecchia’s approximation withvarying k and compare the posterior estimates of { φ, σ } with those from the full likelihood. For all models, weassign a half-Normal prior with mean . and standard deviation √ for σ , and a Gamma prior with shape . andrate . for φ . Both priors are weakly informative. All computations were performed using Hamiltonian Monte Carloexecuted in the Rstan [Stan Development Team, 2020] computing environment, where we ran 3 parallel chains for iterations each. Adequate convergence was diagnosed (based upon reported ˆ R < . and visual diagnosticsshowing good mixing of multiple chains) after , iterations and posterior inference was based upon the remaining × , posterior samples. The plots of the posterior densities for { φ, σ } for the first and the secondexamples are provided in Figures 6 and 7, respectively.We observe that the plots of the posterior densities for { φ, σ } based on Vecchia’s approximations with k ≥ arepractically indistinguishable from the posterior density based on the full likelihood in the first example. The densityplots in Figure 7 reveal that increasing values of k have lesser of an effect on the shape of the posterior distribution for { φ, σ } for k ≥ in the second example. These results are very consistent with those from the auto-correlation plots.The above two examples show that visual checking of { ˜ e i, / ( E ˜ e i, ) / } ni =1 is a pragmatic tool for detecting inade-quacies in Vecchia’s approximation. Whilst it works well in our examples, we caution that adequacy of model fit andrelated model selection issues are much broader issues and are not comprehensively undertaken here. They have beenextensively investigated elsewhere [see, e.g., Stein et al., 2004, Datta et al., 2016a, Guinness, 2018]. We have developed insights into inference based on GP likelihood approximations by Vecchia [1988] under fixeddomain asymptotics for geostatistical data analysis. We have formally established the sufficient conditions for suchapproximations to have the same asymptotic efficiency as a full GP likelihood in estimating parameters in Matérncovariance function. The insights obtained here will enhance our understanding of identifiability of process parametersand can also be useful for developing priors for the microergodic parameters in Bayesian modeling. The resultsderived here will also offer insights into formally establishing posterior consistency of process parameters for a numberof Bayesian models that have emerged from (1.4) [Datta et al., 2016a,b, Katzfuss and Guinness, 2021, Peruzzi et al.,in press]. 11
PREPRINT - F
EBRUARY
16, 2021 (c) k = 5 (d) k = 8(a) full likelihood (b) k = 27 9 11 7 9 11012012 f l og ( s ) level Figure 6: Posterior density for { φ, log( σ ) } fitted with (a) full likelihood and Vecchia’s approximation with (b) k = 4 ,(c) k = 8 and (d) k = 16 for the 1-D example in Section 4.2. We present the joint distribution of { φ, log( σ ) } insteadof { φ, σ } for a better illustration. All plots share the same color scale and ranges of axes for ease of comparison. (c) k = 8 (d) k = 16(a) full likelihood (b) k = 47 8 9 10 11 12 7 8 9 10 11 12−0.50.00.5−0.50.00.5 f l og ( s ) level Figure 7: Posterior density for { φ, log( σ ) } fitted with (a) full likelihood and Vecchia’s approximation with (b) k = 4 ,(c) k = 8 and (d) k = 16 for the 2-D example in Section 4.2. We present the joint distribution of { φ, log( σ ) } insteadof { φ, σ } for a better illustration. All plots share the same colour scale and ranges of axes for ease of comparison.We anticipate the current manuscript to generate further research in variants of geostatistical models. For example, it isconceivable that these results will lead to asymptotic investigations of covariance-tapered models [see, e.g., Wang et al.,2011] and in adapting some results, such as Theorems 2 and 3 in Kaufman and Shaby [2013] where φ is estimated, tothe approximate likelihoods presented here. Finally, there is scope to specifically investigate fixed domain inference forother likelihood approximations that extend or generalize (1.4) [see, e.g., Katzfuss and Guinness, 2021, Peruzzi et al.,in press]. 12 PREPRINT - F
EBRUARY
16, 2021
Acknowledgements
The work of the first and third author was supported, in part, by funding from the National Science Foundation grantsNSF/DMS 1916349 and NSF/IIS 1562303, and by the National Institute of Environmental Health Sciences (NIEHS)under grants R01ES030210 and 5R01ES027027.
References
M. Abramowitz and A. Stegun.
Handbook of Mathematical Functions: with Formulas, Graphs, and MathematicalTables . Dover, 1965.F. Bachoc and A. Lagnoux. Fixed-domain asymptotic properties of maximum composite likelihood estimators forGaussian processes.
J. Statist. Plann. Inference , 209:62–75, 2020.Glen Baxter. An asymptotic result for the finite predictor.
Math. Scand. , 10:137–144, 1962.A. Datta, S. Banerjee, A. O. Finley, and A. E. Gelfand. Hierarchical Nearest-Neighbor Gaussian Process Modelsfor Large Geostatistical Datasets.
Journal of the American Statistical Association , 111:800–812, 2016a. URL http://dx.doi.org/10.1080/01621459.2015.1044091 .A. Datta, S. Banerjee, A. O. Finley, N. A. S. Hamm, and M. Schaap. Non-separable dynamic nearest-neighbor gaussianprocess models for large spatio-temporal data with an application to particulate matter analysis.
Annals of AppliedStatistics , 10:1286–1316, 2016b. URL http://dx.doi.org/10.1214/16-AOAS931 .J. Du, H. Zhang, and V. S. Mandrekar. Fixed-domain asymptotic properties of tapered maximum likelihood estimators.
The Annals of Statistics , 37(6A):3330–3361, 2009.H. Dym and H. P. McKean. Extrapolation and interpolation of stationary Gaussian processes.
Ann. Math. Statist. , 41:1817–1844, 1970.H. Dym and H. P. McKean.
Gaussian processes, function theory, and the inverse spectral problem . Academic Press[Harcourt Brace Jovanovich, Publishers], New York-London, 1976. Probability and Mathematical Statistics, Vol.31.Jo Eidsvik, Benjamin A. Shaby, Brian J. Reich, Matthew Wheeler, and Jarad Niemi. Estimation and prediction inspatial models with block composite likelihoods.
Journal of Computational and Graphical Statistics , 23(2):295–315, 2014. doi: 10.1080/10618600.2012.760460. URL https://doi.org/10.1080/10618600.2012.760460 .Andrew O Finley, Abhirup Datta, Bruce C Cook, Douglas C Morton, Hans E Andersen, and Sudipto Banerjee. Efficientalgorithms for Bayesian Nearest Neighbor Gaussian Processes.
Journal of Computational and Graphical Statistics ,28(2):401–414, 2019.M. S. Ginovian. Asymptotic behavior of the prediction error for stationary random sequences.
Izv. Nats. Akad. NaukArmenii Mat. , 34(1):18–36 (2000), 1999.Ulf Grenander and Gabor Szegö.
Toeplitz forms and their applications . California Monographs in MathematicalSciences. University of California Press, Berkeley-Los Angeles, 1958.Joseph Guinness. Permutation and grouping methods for sharpening Gaussian process approximations.
Technometrics ,60(4):415–429, 2018.I. A. Ibragimov. On the asymptotic behavior of the prediction error.
Theory of Probability & Its Applications , 9(4):627–634, 1964.Il ′ dar Abdullovich Ibragimov and Y. A. Rozanov. Gaussian random processes , volume 9 of
Applications of Mathe-matics . Springer-Verlag, New York-Berlin, 1978. Translated from the Russian by A. B. Aries.Matthias Katzfuss and Joseph Guinness. A general framework for vecchia approximations of gaussian processes.
Statist. Sci. , 36(1):124–141, 02 2021. doi: 10.1214/19-STS755. URL https://doi.org/10.1214/19-STS755 .Matthias Katzfuss, Joseph Guinness, Wenlong Gong, and Daniel Zilber. Vecchia approximations of Gaussian-processpredictions.
Journal of Agricultural, Biological and Environmental Statistics , 25(3):383–414, 2020.C. G. Kaufman and B. A. Shaby. The role of the range parameter for estimation and prediction in geostatistics.
Biometrika , 100(2):473–484, 2013.A. Kolmogorov. Interpolation und Extrapolation von stationären zufälligen Folgen.
Bull. Acad. Sci. URSS Sér. Math.[Izvestia Akad. Nauk. SSSR] , 5:3–14, 1941.B. Matérn.
Spatial Variation . Springer-Verlag, 1986. 13
PREPRINT - F
EBRUARY
16, 2021Michele Peruzzi, Sudipto Banerjee, and Andrew O. Finley. Highly scalable bayesian geostatistical modeling viameshed gaussian processes on partitioned domains.
Journal of the American Statistical Association , in press. doi:10.1080/01621459.2020.1833889. URL https://doi.org/10.1080/01621459.2020.1833889 .A. N. Shiryaev.
Probability , volume 95 of
Graduate Texts in Mathematics . Springer-Verlag, New York, second edition,1996.Stan Development Team. RStan: the R interface to Stan, 2020. URL http://mc-stan.org/ . R package version2.21.2.M. L. Stein.
Interpolation of Spatial Data: Some Theory for Kriging . Springer-Verlag, 1999a.Michael L. Stein. Uniform asymptotic optimality of linear predictions of a random field using an incorrect second-order structure.
Ann. Statist. , 18(2):850–872, 1990a.Michael L. Stein. Bounds on the efficiency of linear predictions using an incorrect covariance function.
Ann. Statist. ,18(3):1116–1138, 1990b.Michael L. Stein. Predicting random fields with increasing dense observations.
Ann. Appl. Probab. , 9(1):242–273,1999b.Michael L Stein, Zhiyi Chi, and Leah J Welty. Approximating likelihoods for large spatial data sets.
Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 66(2):275–296, 2004.A. V. Vecchia. Estimation and model identification for continuous spatial processes.
Journal of the Royal Statisticalsociety, Series B , 50:297–312, 1988.Daqing Wang, Wei-Liem Loh, et al. On fixed-domain asymptotics and covariance tapering in gaussian random fieldmodels.
Electronic Journal of Statistics , 5:238–269, 2011.Hao Zhang. Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics.
Journal ofthe American Statistical Association , 99(465):250–261, 2004.Hao Zhang. Asymptotics and computation for spatial statistics. In
Advances and Challenges in Space-time Modellingof Natural Events , pages 239–252. Springer, 2012.Hao Zhang and Dale L Zimmerman. Towards reconciling two asymptotic frameworks in spatial statistics.