One-dimensional Nonstationary Process Variance Function Estimation
OOne-dimensional Nonstationary ProcessVariance Function Estimation
Eunice J. KimDepartment of Mathematics and Statistics, Amherst CollegeandZhengyuan ZhuDepartment of Statistics, Iowa State UniversityMay 24, 2016
Abstract
Many spatial processes exhibit nonstationary features. We estimate a variancefunction from a single process observation where the errors are nonstationary andcorrelated. We propose a difference-based approach for a one-dimensional nonsta-tionary process and develop a bandwidth selection method for smoothing, taking intoaccount the correlation in the errors. The estimation results are compared to thatof a local-likelihood approach proposed by Anderes and Stein(2011). A simulationstudy shows that our method has a smaller integrated MSE, easily fixes the boundarybias problem, and requires far less computing time than the likelihood-based method.
Keywords: difference-based; correlated errors1 a r X i v : . [ s t a t . M E ] M a y Introduction
The prevalence of mobile devices and increase in storage capacity have brought about highdemand for spatial data analysis. Many spatial processes exhibit nonstationary features,such as non-constant mean and variance and changing structure of autocorrelation. Weencounter these features in data from ecology, geology, meteorology, astronomy, and eco-nomics to name a few. Specific examples include processes describing natural phenomena,such as species and mineral dispersal, wind fields, crop yields, and the cosmic microwavebackground, as well as human activities in aggregate, such as geolocated Internet searchqueries, real estate prices, and air pollution. When analyzing these data, it is not only im-portant to estimate the overall trend in the process but also useful to construct reasonableinterval estimates of the mean process and provide spatial prediction intervals.In this paper, we are interested in estimating the variance function of a one-dimensionalspatial process where the mean and the variance functions are smooth and have additivecorrelated errors. We assume a fixed equidistant grid design for a one-dimensional processand consider a mixed domain asymptotic framework to develop the theory for a variancefunction estimator. Brown and Levine (2007) have discussed asymptotic properties ofnonparametric variance estimators formed by differencing. This article extends the scenarioto nonstationary correlated error processes and discusses cross-validation for bandwidthselection. Our estimator requires estimating the correlation structure embedded in the dataand adjusting the scale of the difference-based estimator using the estimated correlation.We describe the asymptotic properties of our estimator and evaluate its performance usinga simulation study.Section 2 discusses prior work on variance estimation for nonstationary processes. Sec-tion 3 defines a data model and local variogram as a product of a variance function and astandard variogram function. Section 4 looks into the estimator of local variogram functionand its theoretical properties, and Section 5 presents the algorithm for variance functionestimation. Section 6 evaluates the method through a simulation study, and Section 7discusses the advantages of the difference-based variance function estimator comparing itto a likelihood-based estimator and closes with future work.2
Related Work
Neumann et al. (1941) proposed using differences of successive observations to estimatethe variance of independent and identically distributed ( i.i.d. ) errors. Seifert et al. (1993)and Wang et al. (2008) explored that the bias in the variance estimation was reduced viadifferencing as it cancels out a mean or a gradually changing mean function. Gasser et al.(1986) proposed second-order differencing to estimate a variance function incorporatingirregularly-spaced observations, and Hall et al. (1990, 1991) used a differencing approachto estimate the variance of two-dimensional processes with i.i.d. errors in image processing.All these work assume independence among the errors.We propose a difference-based variance function estimator for one-dimensional pro-cesses where the errors are nonstationary and correlated. Under the same assumption ofdata model, Anderes and Stein (2011) proposed a likelihood-based method to estimate asmooth variance function and a correlation function. Their method can handle irregularlyspaced data and provides statistical efficiency when the Gaussianity assumption is metfor the observed process. Still, the computational burden is heavy especially when select-ing bandwidth as covariance matrices must be inverted for every iteration of simulation.Gaussian distributional assumption could also be stringent for the observed nonstationaryprocesses.Hall and Carroll (1989) discussed the asymptotic risk of the difference-based variancefunction estimator in nonparametric regression depending on the smoothness of a variancefunction relative to the smoothness of a mean function. Brown and Levine (2007) re-examined the asymptotic properties of difference-based variance estimators of non-constantand independent errors, and Wang et al. (2008) derived the asymptotic minimax risk ratein terms of the the degree differentiability of the mean and variance function, α and β respectively. When α is greater than or equal to 1 /
4, the convergence rate of risk is O ( n − β/ (2 β +1) ), the same as in nonparametric regression with i.i.d. errors; and when α isless than 1 /
4, then the risk is O ( n − α ), still slightly larger than O ( n − ). In Section 4we describe the asymptotic results of the correlated error scenario and compare it to theindependent error scenario.In signal processing, a band-pass filter provides local variance estimation assuming that3he marginal mean function is changing slowly. The method works well under second-order stationarity of the error process but not under nonstationarity. Since the shape ofa filter and the passband interact with the underlying data process, the estimation resultgets distorted systematically. Also, converting the output from frequency domain mayintroduce bias. Therefore, working in the time-domain for the output in the same domainshould result in more accurate estimation. We define a continuously differentiable Lipschitz function, a nonstationary data model,and local variogram. Estimating a variance function using a nonparametric approach, theLipschitz condition on the mean and variance functions of the data provides the basis ofminimum smoothness.
Definition 1
Let c , c > . Denote q (cid:48) . = q − (cid:98) q (cid:99) where (cid:98) q (cid:99) is the largest integer lessthan q . We say that the function f ( x ) is in class of Λ q ( c f ) if for all x, y ∈ (0 , , (cid:12)(cid:12) f ( (cid:98) q (cid:99) ) ( x ) − f ( (cid:98) q (cid:99) ) ( y ) (cid:12)(cid:12) ≤ c | x − y | q (cid:48) , (cid:12)(cid:12) f ( k ) ( x ) (cid:12)(cid:12) ≤ c for k = 0 , . . . , (cid:98) q (cid:99) , and c f = max( c , c ) . Definition 2
If a function f ( x ) is in class Λ q ( c f ) and there exists δ > such that f ( x ) > δ for all x ∈ [0 , , we say the function is in Λ + q ( c f ) . Consider a nonstationary continuous process model Z ( s ) = µ ( s ) + σ ( s ) X ( s ) (1)on 0 ≤ s ≤
1, without loss of generality, with a smooth mean function µ ( s ) and anadditive, correlated noise as a product of a smooth standard deviation function σ ( s )and a second-order stationary process { X ( s ) } where E ( X ( s )) = 0 , var ( X ( s )) = 1, and cov ( X ( s ) , X ( s (cid:48) )) = ρ ( | s − s (cid:48) | ; θ ) for all pairs of s and s (cid:48) in the unit interval. We consider µ ( s ) ∈ Λ q ( c f ), q ≥
0, and σ ( s ) ∈ Λ + β ( c f ) , β ≥
2. We assume the following general formof a correlation function: ρ ( | s − s (cid:48) | ; θ ) = s = s (cid:48) − | s − s (cid:48) | α θ + O ( | s − s (cid:48) | α +2 ) s (cid:54) = s (cid:48) (2)4here θ > < α < s i = (2 i − / (2 n ) where the location is indexed by i = 1 , . . . , n . As a shorthand, we write Z i = Z ( s i ) , µ i = µ ( s i ) , σ i = σ ( s i ) , ρ h = ρ ( h/n ),and to specify a parametric correlation function, the shorthand is ρ s ; θ = ρ ( s ; θ ). We usethe following notation to denote the j th -order derivative of a function σ ( s ): σ j ) ( s ) = d j σ ( x ) /dx j | x = s .We expand the definition of variogram , introduced by Matheron (1962), since differ-encing nonstationary process data requires local treatment. Using a 0-mean nonstationaryprocess as a data model from (1), the variance at s of a lag- h first-order differenced processis var (cid:18) Z (cid:18) s − h n (cid:19) − Z (cid:18) s + h n (cid:19)(cid:19) =2 σ ( s ) (1 − ρ h ) + 2 (cid:0) σ (1) ( s ) (cid:1) (1 + ρ h ) (cid:18) h n (cid:19) + o (cid:0) n − (cid:1) . (3)The first term resembles the definition of a variogram but with local variance, and thederivatives of the smoothly changing local variance and other higher order terms follow. Definition 3
The local variogram γ L ( s, h ; θ ) is defined as the leading term of (3), i.e. γ L ( s, h ; θ ) = σ ( s ) (cid:18) − ρ (cid:18) hn ; θ (cid:19)(cid:19) . (4)Local variogram (4) is a product of a variance function and the variogram of a standardizedprocess. Variogram represents spatial dispersion by taking lagged differences of a stationaryprocess, and local variogram describes spatial dispersion of a nonstationary process. Whenthe lag size h is small in comparison to the number of observed points n in a process, thatis in mixed-domain asymptotic, the higher order terms vanish. We first define an estimator for local variogram in Section 4.1 as a preliminary steps toestimating a variance function. The bias and the variance of the estimator are derived inSection 4.2 and 4.3 respectively. In Section 4.4 the asymptotic convergence rate of the5oint-wise mean square error of the local variogram estimator is compared to that of astandard nonparametric estimator with i.i.d. errors. If { Z i } is an independent and identically distributed error process with mean 0 and variance1, this implies that taking a simple differencing by lag- h of an equally-spaced processrenders a correlated sequence { D i,h } n − hi =1 = { ( Z i − Z i + h ) / √ } n − hi =1 where E ( D i,h ) = 0 and var ( D i,h ) = 1 for i = 1 , . . . , n − h and any positive integer h < n . Brown and Levine(2007) refer to the sequence { D i,h } n − hi =1 as pseudo-residuals. The squared pseudo-residualsat different lags resemble a variogram since var ( D i,h ) = E ( D i,h ) by the 0-mean property.We derive an estimator of local variogram from the differenced process since the { Z i } weconsider is nonstationary as in (1).Let K λ represent a Gasser-M¨uller kernel with bandwidth λ and restrict its order tobe greater than β , the degree differentiability of variance function in { Z i } . Gasser et al.(1985) have developed the kernel so that the moment conditions simplify the calculation ofhigh-order terms in nonparametric estimators and that the edge effect be easily removed byadjusting the kernels at the boundaries of the domain. Without loss of generality, we assumethat the process is observed on a unit interval, [0 , / √ var ( D i,h ) is in the same order as var ( Z i ), and wematch the i th pseudo-residual at the center of observed pair location, that is ( s i + s i + h ) / i + h/ /n .We define Gasser-M¨uller kernel estimator of local variogram at location s and lag h asˆ γ L λ ( s, h ) = n − h (cid:88) i =1 K λ,i + h/ ( s ) D i,h . (5)Note that in the local variogram estimator (5) the i th squared difference D i,h is associatedwith the kernel weight indexed by i + h/
2. The shift in the index by h/ D i,h . When h = 1,for example, the limits of the kernel are at s i and s i +1 corresponding to the locations ofobservations Z ( s i ) and Z ( s i +1 ), which form the i th pseudo-residual. If the kernel weightswere indexed by i instead of i + h/
2, then the lower and upper integration limits will be6 s i − + s i ) / s i + s i +1 ) / h/ Let D i,h = ( Z i − Z i + h ) / √ δ i,h = µ i − µ i + h , and g i,h = σ i + σ i + h − σ i σ i + h ρ h for i =1 , . . . , n − h . The expected value of the local variogram estimator is E (ˆ γ L λ ( s, h )) = n − h (cid:88) i =1 K λ,i + h ( s ) E (cid:0) D i,h (cid:1) = 12 n − h (cid:88) i =1 K λ,i + h ( s ) (cid:8) ( µ i − µ i + h ) + σ i + σ i + h − σ i σ i + h ρ h (cid:9) . The bias of the local variogram estimator is bias (ˆ γ λ ( s, h )) = E (ˆ γ λ ( s, h )) − (1 − ρ h ) σ ( s )= n − h (cid:88) i =1 K λ,i + h ( s ) (cid:26)
12 ( δ i,h + g i,h ) − (1 − ρ h ) σ ( s ) (cid:27) . (6)Note that (1 − ρ h ) = O ( n − α ) and 0 < α < Theorem 4.1
Assume a nonstationary data model (1) and the correlation function (2).The mean and the variance functions µ ( s ) and σ ( s ) are continuously differentiable Lips-chitz functions (see Definitions 1 and 2) where µ ( s ) ∈ Λ q ( c f ) , q ≥ and σ ( s ) ∈ Λ + β ( c f ) , β ≥ . The difference-based local variogram m -order Gasser-M¨uller kernel estimator (5) at lo-cation s and lag h has an asymptotic bias of order bias (ˆ γ λ ( s, h )) = O ( n − + n − q + n − α − ) where q, β < mO ( n − + n − q + n − α − ) + O ( n − α λ m ) where q < m ≤ βO ( n − + n − q + n − α − ) + O ( λ m ) where m ≤ q. (7)7 roof 1 To calculate an asymptotic bias we split (6) into two parts. The first term is δ i,h whose expansion is in (19) for q ≥ and in (20) for ≤ q < . Convolved with a Gasser-M¨uller kernel of order m (Gasser et al. (1985)), the higher order terms in δ i,h cancel whenthe number of derivatives of the mean function q ≤ m : n − h (cid:88) i =1 K λ,i + h ( s ) δ i,h = O ( n − ) + O ( n − q ) where q < mO ( n − ) + O ( n − q ) + O ( λ m ) where q ≥ m. (8) The second part of the bias is g i,h − σ ( s )(1 − ρ h ) . In equation (21), the leading term inthe expansion of g i,h about s is the local variogram σ ( s )(1 − ρ h ) . Applying a Gasser-M¨ullerkernel to the remaining high order terms in (21), we get the following: n − h (cid:88) i =1 K λ,i + h ( s ) (cid:26) g i,h − σ ( s )(1 − ρ h ) (cid:27) = n − h (cid:88) i =1 K λ,i + h ( s ) (cid:40) (1 − ρ h ) (cid:32) σ s σ (1) s hn + σ s σ (2) s h n (cid:33) + 12 (cid:18) σ (1) s hn (cid:19) (cid:41) + n − h (cid:88) i =1 K λ,i + h ( s )(1 − ρ h ) (cid:98) β (cid:99) (cid:88) j =1 (cid:40) ( σ s ) ( j ) j ! + ( σ s ) ( j +1) j + 1)! (cid:18) hn (cid:19) hn (cid:41) ( s i − s ) j + n − h (cid:88) i =1 K λ,i + h ( s ) h n (cid:98) β (cid:99) (cid:88) k =1 k +1 (cid:88) j =1 c k σ ( j ) s σ ( k − j +2) s ( s i − s ) k + n − h (cid:88) i =1 K λ,i + h ( s ) O ( | s i − s | β )= O ( n − α − ) + O ( n − ) where β < mO ( n − α − ) + O ( n − ) + O ( n − α λ m ) where β ≥ m. (9) The bias in the local variogram is summarized in (7) combining the results in (8) and (9).
As we see from the asymptotic results summarized in (7), the bias has an order dependenton the differentiability of the mean and the variance functions q and β respectively, theorder m of the kernel, and the smoothness of the process defined by α . As α approaches0, the data process is rough and work almost as an independent process. As α approaches2, the smoothness increases and follows the form of a process with a Gaussian correlationfunction. Also note that when α = 1, the smoothness is equivalent to a process generatedwith an exponential correlation function, which we use in the simulation study. Remark 1
We detail Theorem 4.1 in the order we listed the results in in (7). . Assume that m > q and m > β , in other words the order of kernel is greater than thedegree differentiability of both the mean and variance functions. Then, (A.i) when α < and α +12 < q ≤ , the bias is O ( n − α − ) ; (A.ii) when α < and q ≤ α + 1 / ,the bias is O ( n − q ) ; and (A.iii) when α ≥ and q ≥ , the bias is O ( n − ) .B. Assume that q < m ≤ β and that λ = O ( n − x ) where < x < . Then O ( n − α λ m ) isthe order of bias in the following three settings: (B.i) q ≥ , α ≤ , and x < /m ;(B.ii) q ≥ , α ≥ , and x < (2 − α ) /m ; and (B.iii) α < , q < α + 1 , and x < (2 q − α ) /m . The remaining scenarios should should refer to Case A.C. Assume that m ≤ q irrespective of β and that λ = O ( n − x ) where < x < . Thenthe bias is O ( λ m ) in the following three settings: (C.i) q ≥ , α ≥ , and /m > x ;(C.ii) q < min(1 , α +12 ) , and x < q/m ; (C.iii) α < , α + 1 < q and x < ( α + 1) /m .The remaining scenarios should refer to Case A. When m is greater than q and β , which is the case of A, the asymptotic bias is thesmallest. Therefore, we recommend choosing a high order kernel function even though wedo not know q and β in practice. Following the recommendation, it will be very unlikelyto encounter the bias in O ( λ m ) of case C. The variance of the local variogram estimator at location s and lag h is var (ˆ γ λ ( s, h )) = n − h (cid:88) i =1 n − h (cid:88) j =1 K λ,i + h ( s ) K λ,j + h ( s ) cov ( D i,h , D j,h ) . (10)Recall D i,h = ( δ i + σ i X i − σ i + h X i + h ) / √ X i is a stationary process with mean 0,variance 1, and a correlation function cov ( X i , X i + h ) = ρ h . Let { X i } ni =1 be a Gaussianprocess. Then ( σ i X i − σ i + h X i + h ) is distributed N ormal (0 , g i,h ) and its fourth moment is E ( σ i X i − σ i + h X i + h ) = 3 g i,h . The variance of the squared pseudo-residual is var ( D i,h ) = E ( D i,h ) − E ( D i,h )= 14 (cid:110) δ i,h + 6 δ i,h g i,h + 3 g i,h − (cid:0) δ i,h + g i,h (cid:1) (cid:111) = δ i,h g i,h + 12 g i,h . i th and the j th squared differences is cov ( D i,h , D j,h )= 14 (cid:8) E (( Z i − Z i + h ) ( Z j − Z j + h ) ) − ( δ i,h + g i,h )( δ j,h + g j,h ) (cid:9) = δ i,h δ j,h { ρ | i − j | ( σ i σ j + σ i + h σ j + h ) − ρ | i − j − h | σ i σ j + h − ρ | i − j + h | σ i + h σ j } + 12 { ( ρ | i − j | σ i σ j − ρ | i − j − h | σ i σ j + h ) + (cid:0) ρ | i − j + h | σ i + h σ j − ρ | i − j | σ i + h σ j + h (cid:1) } + ( ρ | i − j | + ρ | i − j − h | ρ | i − j + h | ) σ i σ i + h σ j σ j + h − ρ | i − j | σ i σ i + h ( ρ | i − j + h | σ j + ρ | i − j − h | σ j + h )= δ i,h δ j,h P ij + 12 P ij . where P ij = ρ | i − j | ( σ i σ j + σ i + h σ j + h ) − ρ | i − j − h | σ i σ j + h − ρ | i − j + h | σ i + h σ j for i (cid:54) = j . Note thatwhen i = j , P ii = g i,h . The Taylor expansion of P i,j about s i for any i (cid:54) = j is P ij = h n (cid:16) σ (1) i (cid:17) − h ( nθ ) σ i + o (cid:0) n − (cid:1) . (11)In Theorem 4.2 we derive the asymptotic rate of convergence of the variance of Gasser-M¨uller estimator of local variogram expressed in (10). Theorem 4.2
Assume the same conditions as in Theorem 4.1. The variance of the localvariogram estimator ˆ γ L,λ ( s, h ) in (5) is asymptotically in the order as follows: var (ˆ γ λ ( s, h )) = O (cid:18) nλ (cid:19) O (cid:0) n − q − α + n − α (cid:1) . (12) Proof 2
Use Taylor expansions about s of δ i,h , g i,h , and P ij (details are in equations (19)and (21) in the Appendix and in (11) respectively), and obtain a Taylor expansion of thevariance in (10) about s at fixed lag h . var (ˆ γ λ ( s, h )) = n − h (cid:88) i =1 K λ,i + h ( s ) (cid:18) δ i g i + g i (cid:19) + 2 n − h − (cid:88) i>j =1 K λ,i + h ( s ) K λ,j + h ( s ) (cid:18) δ i δ j P ij + P ij (cid:19) =2 n − h (cid:88) i =1 K λ,i + h ( s ) (cid:8) δ i (1 − ρ h ) O (1) + (1 − ρ h ) O (1) (cid:9) + 2 n − h − (cid:88) i>j =1 K λ,i + h ( s ) K λ,j + h ( s ) (cid:8) δ i δ j O ( n − ) + O ( n − ) (cid:9) =2(1 − ρ h ) n − h (cid:88) i =1 K λ,i + h ( s ) (cid:8) O ( n − + n − q ) + (1 − ρ h ) O (1) (cid:9)
10 2 n − h − (cid:88) i>j =1 K λ,i + h ( s ) K λ,j + h ( s ) O ( n − ) (13) Use the fact that K λ,i + h = O (cid:0) nλ (cid:1) and (cid:80) K λ,i + h = O (cid:0) nλ (cid:1) and reduce the last line (13)to (12). The correlation between D i,h and D j,h is cor ( D i,h , D j,h )= cov ( D i,h , D j,h ) (cid:113) var ( D i,h ) var ( D j,h )= δ i,h δ j,h P ij + P ij (cid:113) ( δ i,h g i,h + g i,h )( δ j,h g j,h + g j,h )= h n (cid:20) σ i θ (cid:26) σ i θ − (cid:16) σ (1) i (cid:17) − δ i,h δ j,h n h (cid:27) + (cid:26) δ i,h δ j,h n h + 12 (cid:16) σ (1) i (cid:17) (cid:27) (cid:16) σ (1) i (cid:17) + o ( n − ) (cid:21)(cid:113)(cid:0) δ i,h g i,h + g i,h (cid:1) (cid:0) δ j,h g j,h + g j,h (cid:1) = O ( n − ) O ( n − α ) = O ( n − − α ) ) . Note that the correlation between the squared pseudo-residuals D i,h and D j,h for i (cid:54) = j converges to 0 with the infill asymptotic. Simple differencing not only removes the featureof a mean function but also drastically reduces correlation in the data the closer α is to 2.When δ i,h = o ( n − ) is negligible for all i = 1 , . . . , n − h , in other words µ ( s ) ∈ Λ q where q <
1, the third line of equality for cor ( D i,h , D j,h ) is reduced to cor ( D i,h , D j,h ) = h n (cid:26) σ i θ − (cid:16) σ (1) i (cid:17) (cid:27) + o ( n − ) g i,h g j,h . Since g i,h = o ( n − α ), the rate of convergence for the correlation is again O (cid:0) n − − α ) (cid:1) where0 < α < The point-wise risk of the local variogram estimator is the sum of the squared bias in (6)and the variance in (13). The asymptotic point-wise risk can be derived from combiningthe results of Theorems 4.1 and 4.2. 11 heorem 4.3
Consider estimating the variance function of a one-dimensional nonstation-ary process with n equally-spaced observations, whose data model follows (1) and (2). Weassume that µ ( s ) ∈ Λ q , q ≥ , σ ( s ) ∈ Λ β , β ≥ and that the bandwidth λ = O ( n − x ) where < x < . When the order of Gasser-M¨uller kernel m is greater than both q and β , thepoint-wise risk of the estimator of local variogram in (5) and the asymptotic convergencerate of bandwidth are Risk (ˆ γ λ ( s, h )) = O ( n − q ) where λ (cid:16) n − − α +4 q O ( n − ) where λ (cid:16) n − α (14) given α < q < min( α + 12 , for the top case and q ≥ and α > for the bottom case.When the order of Gasser-M¨uller kernel m is greater than only one of q > or β , thepoint-wise risk is Risk (ˆ γ λ ( s, h )) = O (cid:0) n − m (1+2 α ) / (1+2 m ) (cid:1) where λ (cid:16) n − (1+2 α ) / (1+2 m ) O (cid:0) n − α − m/ (1+2 m ) (cid:1) where λ (cid:16) n − / (1+2 m ) (15) given α < min (cid:18) q, (cid:19) for the top and α < q for the bottom. Proof 3
The asymptotic bias and variance are derived in (7) and (12) respectively.
Risk (ˆ γ λ ( s, h ) , γ ( s, h )) = bias (ˆ γ λ ( s, h )) + var (ˆ γ λ ( s, h ))= O ( n − + n − q ) + O (cid:18) nλ (cid:19) O ( n − α ) where q, β < mO ( n − + n − q + n − α λ m ) + O (cid:18) nλ (cid:19) O ( n − α + n − q − α ) where q < m ≤ β,O ( n − + n − q + λ m ) + O (cid:18) nλ (cid:19) O ( n − α + n − q − α ) where m ≤ q. (16) A. Assume that m > q and m > β . i) When q ≥ , α < q holds true because < α < , and (12) reduces to O ( n − α − λ − ) . When the asymptotic bias is O ( n − ) , the bandwidth condition ismet and λ (cid:16) n − α , which suggests < α .(ii) When < q < , the asymptotic order of bias is O ( n − q ) . With α < q , theasymptotic variance of O ( n − α − λ − ) . Then, λ (cid:16) n − − α +4 q .B. Assume q < m ≤ β . • When α < q , the asymptotic variance is O ( n − α − λ − ) .(i) The bias is O ( n − α λ m ) when α < q , and it gives λ (cid:16) n − / (1+2 m ) .(ii) The bias of O ( n − ) has similar conditions as A(i) where q > , α > − m m ,and λ (cid:16) n − α .(iii) The bias of O ( n − q ) has similar conditions as A(ii) where q ≤ , α > q − m m ,and λ (cid:16) n − − α +4 q . • When α ≥ q , the asymptotic variance is O ( n − q − α − λ − ) .(iv) The bias is O ( n − α λ m ) when α < q , which is a contradiction to α ≥ q .(v) The bias is O ( n − ) when q > , which is a contradiction to < α < with α ≥ q .(vi) The bias of O ( n − q ) suggests λ (cid:16) n − (1+ α − q ) = o ( n − ) , which is a contradictionto λ = O ( n − ) .C. Assume m ≤ q .(i) Assuming that the bias is O ( λ m ) , we have λ (cid:16) n − (1+2 α ) / (1+2 m ) when α < q ; and λ (cid:16) n − (1+ α +2 q ) / (1+2 m ) when α ≥ q . Checking the assumptions requires:- O ( λ m ) > O ( n − ) ⇐⇒ m (1 + 2 α )1 + 2 m < . This implies m < α − when α > .- O ( λ m ) > O ( n − q ) ⇐⇒ m (1 + 2 α )1 + 2 m < q . This holds true for the givencondition m ≤ q since α m < q m < qm where the second inequalityholds when m < q . ii) The bias of O ( n − ) has the same case as A(i) and B(ii), and λ (cid:16) n − α .(iii) By the second check in C(i), the bias cannot be O ( n − q ) .(v) The bias of O ( n − ) does not hold as in B(v). Remark 2
In (14) of Theorem 4.3, the order of risks and bandwidth are continuous at q = 1 . In (15) of Theorem 4.3, the risks in O (cid:0) n − m (1+2 α ) / (1+2 m ) (cid:1) and O (cid:0) n − α − m/ (1+2 m ) (cid:1) are quite similar. In fact O (cid:0) n − m (1+2 α ) / (1+2 m ) (cid:1) O ( n − α − m/ (1+2 m ) ) = O (cid:0) n α/ (1+2 m ) (cid:1) = o ( n ) . In both cases,the implied range of the smoothness parameter is α < / . Remark 3
There is divergence of risk when (i) q ≥ β and/ or (ii) the process is verysmooth with α (cid:38) / . In both cases a variance function is masked by a mean process. Given m = β , as α → α = 0 suggests an independent process), the risk convergesto O (cid:0) n − β/ (1+2 β ) (cid:1) in all three cases. This result is consistent with the nonparametricestimation of β differentiable functions. We are interested in estimating the variance function embedded in a nonstationary spatialprocess where β > q , i.e. the variance function has a greater differentiability than themean function, and the standardized spatial process is isotropic. The estimation of avariogram at a fixed lag- h is given as the average of squared differences at that lag, andthe estimation of a local variogram at a fixed location and distance is a local average of thesquared differences for the given location. The concept of “local” or nearby locations canbe defined by a bandwidth of a smoothing kernel, and this section discusses the bandwidthselection procedure.It is well known in nonparametric statistics literature that when the underlying datacontain correlation in additive errors, cross-validation requires an adjustment in data or inthe penalty term of objective functions. Altman (1990) proposed to adjust the weights ofthe correlated residuals. Han and Gu (2008) added a penalty term to the likelihood func-tion to adjust for the correlation and then simultaneously estimated the bandwidth andthe correlation parameters. Opsomer et al. (2001) compiled several proposals of bandwidth14election in nonparametric regression with correlated errors and addressed recent develop-ments on the theoretical front. We choose leave-one-out cross-validation to minimize themean square prediction errors of local variogram.Recall that D i.h denotes the i th squared difference of lag- h process. Let d i,h represent arealization of D i,h , and define a raw deviance of local variogram estimation at s i + h/ (cid:15) i = d i,h − ˆ γ L (cid:18) s i + h , h (cid:19) . (17)Let the covariance matrix of the deviances be C (cid:15) whose ( i, j ) element is cov ( (cid:15) i , (cid:15) j ).We use it to de-correlate the raw deviances, resid (cid:15) = ( (cid:15) , · · · , (cid:15) n ), of the local variogramestimation and denote the de-correlated residuals as ξ = C − / (cid:15) (cid:91) resid (cid:15) . The choice of a covariance model and the parameter values are not sensitive to bandwidthestimation because the correlation in resid (cid:15) is weak. We assume an exponential covariancemodel with a range parameter φ = 0 .
01 as follows. cov ( (cid:15) i , (cid:15) j ) = exp (cid:16) − φ | i − j | n (cid:17) . Thealgorithm for bandwidth selection is below:1. Use a differencing lag h = 1. Create a set of bandwidths { λ j } whose value wouldnot exceed 1/2 the range of the sample domain. Estimate the local variogram over agiven domain using equation (5) for each λ j .
2. Calculate (cid:91) resid (cid:15) in (17) for each λ j .3. Select a bandwidth using leave-one-out cross-validation, whose minimization of theoverall mean-squared-error is approximated as follows.ˆ λ ← arg λ min n − h (cid:88) i =1 (cid:18) ξ i − M ( i,i ) (cid:19) Note M is an ( n − h ) × ( n − h ) smoothing matrix of D i,h and and the ( i, i ) elementof M is M ( i,i ) = K (0). An important consideration in both local variogram or variance function estimation is to guaranteethat they are non-negative or positive everywhere. When a bandwidth is small, the smoothing may resultin negative values most often near the boundaries. In order to address this problem, one may fix thebandwidth size for locations near the boundary.
15. Use the bandwidth from 3 to estimate the local variogram and scale the de-trendedoriginal data by the square root of the estimation. Here, we assume (cid:8) Z i (cid:9) Ni =1 is theunaccounted, correlated variation in the process. (cid:110) Z ∗ i (cid:111) Ni =1 ← (cid:8) Z i (cid:9) Ni =1 (cid:110)(cid:113)(cid:98) γ L, ˆ λ ( s i , h ) (cid:111) Ni =1
5. Select any correlation model for (cid:8) Z ∗ i (cid:9) and estimate the correlation parameters θ .Resize the local variogram by ˆ σ ∗ estimated from (cid:8) Z ∗ i (cid:9) .6. Plug-in the correlation model and the parameters to estimate variance at any s :ˆ σ ( s ) ← (cid:98) γ L, ˆ λ ( s ; h ) ˆ σ ∗ − ˆ ρ ( h ; ˆ θ )A simulation study result is in the next section, and it includes the result of bandwidthselection. Since bandwidth selection is necessary in local estimation of functions, it isimportant to use computationally inexpensive approaches. The generalized cross-validationwe use is an approximation of cross-validation score that is not very heavy on computing incomparison to a leave-one-out cross-validation. In Anderes and Stein (2011), the bandwidthselection adds greater computing complexity in addition to the functional estimation sincethe proposal is simulation-based and requires either inverting a large matrix or takingderivatives of an estimated function for every iteration of simulation. We also need a matrixinversion for de-correlation, but it is set up to perform once for the functional estimation.Also, there is no need to iterate from 3-6, which could be a needed optimization process,because the estimation of correlation parameter is stable across a wide range choices inconstructing C (cid:15) .. Here we compare the difference-based method and the likelihood-based method in termsof statistical and computing efficiencies. We also examine the effect of dependence incorrelated errors on functional estimations. We define oracle bandwidth as the bandwidth16hat yields the minimum discretely integrated mean square error (
DM SE ) of the estimatedfunction. To provide equal footing on the difference- and likelihood-based estimations, weassume that the correlation functions and the parameter values are known. We label theoracle bandwidths under such set-up as ‘Diff- λ O ’ and ‘Like- λ O ’. In applying the proposedmethod as described in Section 5, we select a bandwidth ‘Diff- λ ∗ ’ and estimate correlationparameters. Levine (2006) also proposes a bandwidth selection method for heteroskedasticbut independent errors, and we refer to them as ‘Levine- λ ’. Assume a data model Z ( s ) = µ ( s ) + σ ( s ) X ( s ) as in (1) and set µ ( s ) = 0 to test the methoddirectly on the correlated error processes or under a constant mean assumption. We fixthe stationary error process { X ( s ) } as a Gaussian process for analytical tractability. Theyare easy to simulate, and the likelihood-based approach should prefer tractable likelihoodfunctions and provide no disadvantage when compared to the difference-based estimation.The dependent structure is generated using an exponential correlation function with arange parameter set at three levels θ =0.1, 0.01 and 0. The latter, in fact, refers to anindependent error setting. The processes are generated on an equally spaced grid over aunit interval, 0 ≤ s ≤
1. Four sample sizes n = 100, 200, 500, and 1000 are used. Thestandard deviation functions are chosen to examine the effect of differentiability of themean versus the variance functions especially on the bandwidth selections. Here is thesummary of experimental details. Note that Anderes and Stein (2011) have used σ ( s ) in3(a), and we added 3(b).1. n = 100, 200, 500 and 10002. σ ( s ) : [0 , → R + and set s ∈ (cid:110) , n − , . . . , n − n − , (cid:111) . (a) an infinitely-differentiable function: σ ( s ) = 2 sin( s/ .
15) + 2 . , (b) a step function: σ ( s ) = 1 + { /
3. For a stationary error processl { X s } , let cor (cid:0) X ( s ) , X ( s + h/n ) (cid:1) = exp (cid:0) − θ hn (cid:1) and set θ = 0 .
1, 0.01 where h is small and 0 ≤ s ≤ − hn . Also, set cor (cid:0) X ( s ) , X ( s (cid:48) ) (cid:1) = 0 forany 0 ≤ s, s (cid:48) ≤
1. 17igure 1: The true standard deviation function is in thick red line. The estimation resultsare in thin gray lines for three estimation methods crossed with three levels of sample size.4. Draw 100 random samples of each random procress.We define
DM SE (ˆ σ λ ) = (cid:80) ni =1 { ˆ σ i, ˆ λ − σ i } /n and L ∞ (ˆ σ λ ) = max i (cid:110) | ˆ σ i, ˆ λ − σ i | (cid:111) . Weestimate the variance functions at 100 equally spaced locations on [0,1]. Then, we evaluatethe estimated functions using discretely integrated mean square error ( DM SE ) as an over-all measure of functional estimation and the maximum absolute deviation, i.e. supremumnorm L ∞ , to represent the worst estimation. Figure 1 shows a few variance function estimation results where the true σ ( s ) is sinusoidal.The thick red line represents the true function and the thin gray lines are estimation results18f several simulations. The first row implemented our proposed method; the second rowapplied the same idea with oracle bandwidths and known covariance parameters; and thethird used Anderes and Stein (2011)’s likelihood-based method with oracle bandwidthsand covariance parameters. As expected in an infill design, the estimation becomes moreprecise as the number of points n increases. The likelihood-based functional estimationshows more undulation than the difference-based estimation results. In other words, theoracle bandwidths for the likelihood-based method are underestimated. They were selectedto minimize the discretized mean square error ( DM SE ), and therefore smoothing out theundulation should result in larger discretely integrated mean square errors than from thecurrent functional estimations.Figures 2 and 3 summarize the simulation results of the sinusoidal σ ( · ) with discretelyintegrated mean square error ( DM SE ) and the maximum absolute deviation (
M AX ) re-spectively. The color of the boxplots represents the estimation method where the proposedmethod is in white, the proposed with λ O in red, and the likelihood-based method with λ O in blue. There are two sets of triad-colored boxplots for each sample size where thefirst set indicates weakly correlated processes (with θ = 0 .
01) and the second set stronglycorrelated processes (with θ = 0 . Diff and
Likelihood -based methods when the ora-cle bandwidths are plugged in and where n is less than 200. As n grows, Diff has smaller
DM SE and
M AX than
Likelihood . The differences are attributable to the under-smoothedfunctional estimations and to the boundary effects as the likelihood-based functional esti-mations are flat near the boundaries (as shown in the third row of Figure 1). Note thatthe summary measures show a reasonable range of values, considering that the functionalvalue of σ ( · ) ranges from 0.8 - 4.8; the DM SE s are mostly less than 0.5 and the
M AX sare generally less than 1.5. The comparison of two methods via Figures 2 and 3 shows that
Diff is a simpler method with lower risk in estimation than
Likelihood .As noted in (14) of Theorem 4.3, the greater differentiability a variance function has,the quicker the risk converges. To confirm the theoretical results of asymptotic risk, thesimulation study involved four variance functions changing the differentiability, but we omit19 . . . . MSE n Figure 2: Comparing the proposed method (white boxplots), difference-based methodwith oracle bandwidths (red), and the likelihood-based method with oracle bandwidths(blue) for the estimation of sinusoidal σ ( · ) using DM SE .
100 100 MAX n Figure 3: Comparing the results of functional estimation using
M AX , the L ∞ norm.20able 1: Bandwidth selection summary for (a) sine and (b) step σ ( · ) function estimation.Oracle bandwidths, λ O , achieve the minimum DMSE for both difference- and likelihood-based methods, our proposed bandwidth selections are represented as λ ∗ , and there isLevine’s method.Bandwidth (a) Sine (b) Step n Methods θ = 0 . θ = 0 .
01 indep. θ = 0 . θ = 0 .
01 indep.100 Diff- λ O (.054) (.059) (.052) (.071) (.084) (.076) Diff– λ ∗ (.074) (.079) (.069) (.126) (.087) (.074) Levine 0.356 0.455 0.420 0.360 0.467 0.418 (.297) (.274) (.281) (.304) (.267) (.289)
Like– λ O (.054) (.055) (.033) (.032) (.030) (.030)
200 Diff- λ O (.034) (.037) (.046) (.050) (.060) (.066) Diff– λ ∗ (.090) (.108) (.119) (.126) (.143) (.163) Levine 0.234 0.380 0.347 0.248 0.369 0.334 (.248) (.224) (.229) (.249) (.230) (.217)
Like– λ O (.034) (.028) (.021) (.025) (.024) (.023)
500 Diff- λ O (.027) (.031) (.037) (.042) (.042) (.047) Diff– λ ∗ h 0.217 0.205 0.180 0.357 0.329 0.260 (.107) (.117) (.111) (.143) (.147) (.159) Levine 0.186 0.256 0.232 0.192 0.264 0.240 (.186) (.164) (.165) (.193) (.152) (.166)
Like– λ O (.016) (.016) (.016) (.019) (.016) (.017) λ O (.026) (.026) (.023) (.033) (.033) (.038) Diff– λ ∗ (.121) (.117) (.109) (.159) (.157) (.165) Levine 0.180 0.289 0.174 0.199 0.288 0.191 (.155) (.118) (.094) (.157) (.123) (.092)
Like– λ O (.013) (.011) (.013) (.015) (.013) (.014) θ = 0 .
01 and0.1) of the range parameter were plugged in for the
Diff (the red) and
Likelihood (the blueboxplots) showed very similar estimation results in both
DM SE and
M AX . In practice,the covariance parameter estimation brings uncertainty to the estimation as we see a widerrange of
DM SE and
M AX in the white boxplots.Table 1 contains the summary of selected bandwidth of sinusoidal and step σ ( · ) function.We used a degree 6 Gasser-M¨uller kernel for the differenced-based method and a Gaussian-based higher order kernel for the likelihood-based method. We see that the size of the“oracle bandwidth” is slightly smaller for a dependent process than an independent processwhen n = 500 and 1000. Our bandwidth selection gives the opposite result in that theindependent error process gets the smallest average. Since the range of selected bandwidthsis wide, there are under-smoothed and over-smoothed functions in the top row of Figure 1especially where n is large.We have shown through a simulation study that difference-based estimation has asmaller DM SE than a likelihood-based approach. In nonparametric regression, boundarybias can be easily fixed by adjusting the objective function near the boundary, whereas forthe likelihood-based method generalized estimating equations are suggested. Another con-trast between the two approaches is in computing time. A difference-based method needsno matrix inversion and reduces the computing time by O ( n − ) to that of a likelihood-based method, where n is the length of the data process. The bandwidth selection idea byAnderes and Stein (2011) also requires a global covariance matrix inversion and increasesthe computing time by O ( mn ) where m is the number of simulations for generating aglobally stationary process to test against the observed nonstationary process. While theirbandwidth selection ideas are insightful and useful when there is a specific data model thatcan be simulated, it is much more costly to perform likelihood-based estimation in termsof computing time and power. 22 Summary and Future Work
We have developed a nonparametric variance function estimator for a one-dimensionalnonstationary process whose stationary correlation structure is isotropic. Under certainregularity conditions we can directly estimate a variance function, applying a differencefilter to the data. We assume that the error processes are additive. The mean can beestimated and then removed from the data to employ our method and estimate the vari-ance function. A direct application of the method to compute pseudo-residuals is possibleassuming a smoothly varying mean function, and this should reduce the bias caused fromestimating the mean function. We have investigated infill asymptotic properties of the localvariogram estimator and have shown that the asymptotic rate of convergence is dependenton the relative smoothness of mean function to the smoothness of variance function andthe mean square differentiability of the data process.We would like to extend the difference-based method to a two-dimensional randomfield nonstationary variance function estimation. In such setting the number of differencefilter choices increases in shape and size, and the dependence would be stronger among thefiltered process leading to new challenges for estimation.
SUPPLEMENTARY MATERIAL
Here is the detailed expansion of the variance of h -lagged nonstationary process withsmooth mean and variance function. This details (3) in deriving the local variogram (4) asthe main term of the expansion. var (cid:18) Z (cid:18) s − h n (cid:19) − Z (cid:18) s + h n (cid:19)(cid:19) =2 (cid:32) σ ( s ) + σ ( s )2! (cid:18) h n (cid:19) + σ ( s )4! (cid:18) h n (cid:19) + o (cid:18)(cid:18) h n (cid:19) (cid:19)(cid:33) − ρ h p (cid:88) k =0 (cid:40)(cid:18) σ ( k ) ( s ) k ! (cid:19) (cid:18) h n (cid:19) k ( − k + 2 (cid:88) i + j =2 k, i (cid:54) = j σ ( i ) ( s ) i ! σ ( j ) ( s ) j ! (cid:18) h n (cid:19) k (cid:41) =2(1 − ρ h ) (cid:40) σ ( s ) + σ ( s )2! (cid:18) h n (cid:19) + σ ( s )4! (cid:18) h n (cid:19) + o (cid:18)(cid:18) h n (cid:19) (cid:19)(cid:41) + ρ h (cid:34) (cid:0) σ ( s ) (cid:1) σ ( s ) (cid:18) h n (cid:19) (cid:40) (cid:0) σ ( s ) (cid:1)
32 ( σ ( s )) + (cid:0) σ ( s ) (cid:1) σ ( s )) − (cid:0) σ ( s ) (cid:1) σ ( s ) + σ ( s ) σ ( s )6 σ ( s ) (cid:41) (cid:18) h n (cid:19) (cid:35) =2 σ ( s )(1 − ρ h ) + (cid:40) σ ( s )(1 − ρ h ) + (cid:0) σ ( s ) (cid:1) σ ( s ) ρ h (cid:41) (cid:18) h n (cid:19) + o (cid:32)(cid:18) h n (cid:19) (cid:33) . (18) δ i,h ≤ c µ ( h/n ) q . Under the condition that µ ( · ) ∈ Λ q ( c f ) and q ≥
0, the Taylor expansionof δ i,h about location s when q ≥ δ i,h = (cid:98) q (cid:99) (cid:88) j =1 µ ( j ) s j ! (cid:8) ( s i − s ) j − ( s i + h − s ) j (cid:9) + O ( | s i − s | q + | s i + h − s | q )= − hn (cid:98) q (cid:99) (cid:88) j =1 µ ( j ) s j ! j − (cid:88) a =0 ( s i − s ) a ( s i + h − s ) j − − a + O ( | s i − s | q + | s i + h − s | q ); (19)and when 0 ≤ q <
1, it is: δ i,h = c (cid:18) in (cid:19) q − c (cid:18) i + hn (cid:19) q = O (cid:0) n − q (cid:1) . (20)A Taylor expansion of g i,h about location s is:12 g i,h =(1 − ρ h ) σ s + σ s (cid:98) β (cid:99) (cid:88) j =1 σ ( j ) s j ! (cid:110) ( s i − s ) j + ( s i + h − s ) j (cid:111) + O ( | s i − s | β )+ (cid:98) β/ (cid:99) (cid:88) l =1 (cid:32) σ ( l ) s l ! (cid:33) (cid:110) ( s i − s ) l + ( s i + h − s ) l − ρ h ( s i − s ) l ( s i + h − s ) l (cid:111) + (cid:98) β (cid:99) (cid:88) m =3 m − (cid:88) j =1 (cid:34) c m σ ( j ) s σ ( m − k ) s m ! { ( s i − s ) m + ( s i + h − s ) m } − ρ h σ ( j ) s σ ( m − j ) s j !( m − j )! ( s i − s ) j ( s i + h − s ) m − j (cid:35) (21)under the condition that σ ( · ) ∈ Λ + β and β ≥
2. Note that n − h (cid:88) i =1 K λ,i ( s ) = n − h (cid:88) i =1 (cid:90) ( s i + s i +1 ) / s i + s i − ) / λ K ( B ) (cid:18) s − uλ (cid:19) du (22)= O ( nλ ) O (cid:18) nλ (cid:19) = O (1) . (23)Then n − h (cid:88) i =1 K λ,i + h ( s ) = O (cid:18) nλ (cid:19) . (24)24 eferences Altman, N. S. (1990). Kernel smoothing of data with correlated errors.
Journal of AmericanStatistical Association , 85(411):749 – 759.Anderes, E. B. and Stein, M. L. (2011). Local likelihood estimation for nonstationaryrandom fields.
Journal of Multivariate Analysis , 102:506–520.Brown, L. D. and Levine, M. (2007). Variance estimation in nonparametric regression viathe difference sequence method.
The Annals of Statistics , 35(5):2219 – 2232.Gasser, T., M¨uller, H.-G., and Mammitzsch, V. (1985). Kernels for nonparametric curveestimation.
Journal of Royal Statistical Society. Series B , 47(2):238 – 252.Gasser, T., Sroka, L., and Jennen-Steinmetz, C. (1986). Residual variance and residualpattern in nonlinear regression.
Biometrika , 73(3):pp. 625–633.Hall, P. and Carroll, R. J. (1989). Variance function estimation in regression: the effect ofestimating the mean.
Journal of Royal Statistical Society. Series B , 51(1):3–14.Hall, P., Kay, J. W., and Titterington, D. M. (1990). Asymptotically optimal difference-based estimation of variance in nonparametric regression.
Biometrika , 77:521 – 528.Hall, P., Kay, J. W., and Titterington, D. M. (1991). On estimation of noise variance intwo-dimensional signal processing.
Advanced Applied Probability , 23:476 – 495.Han, C. and Gu, C. (2008). Optimal smoothing with correlated data.
Sankhya: The IndianJournal of Statistics , 70-A(1):38 – 72.Levine, M. (2006). Bandwidth selection for a class of difference-based variance estimators inthe nonparametric regression: A possible approach.
Computational Statistics and DataAnalysis , 50(12):3405 – 3431.Matheron, G. (1962).
Trait´e de G´eostatistique Appliqu´ee, Tome I , volume 14. M´emoiresdu Bureau de Recherches G´eologiques et Mini`eres, Paris.25eumann, J. V., Kent, R. H., Bellinson, H. R., and Hart, B. I. (1941). The mean squaresuccessive difference.
The Annals of Mathematical Statistics , 12(2):pp. 153–162.Opsomer, J., Wang, Y., and Yang, Y. (2001). Nonparametric regression with correlatederrors.
Statistical Science , 16(2):134 – 153.Seifert, B., Gasser, T., and Wolf, A. (1993). Nonparametric estimation of residual variancerevisited.
Biometrika , 80(2):373 – 383.Wang, L., Brown, L. D., Cai, T. T., and Levine, M. (2008). Effect of mean on variancefunction estimation in nonparametric regression.