A Low Rank Gaussian Process Prediction Model for Very Large Datasets
aa r X i v : . [ s t a t . C O ] J un A Low Rank Gaussian Process Prediction Model for Very Large Datasets
Roberto Rivera
Abstract —Spatial prediction requires expensive computationto invert the spatial covariance matrix it depends on andalso has considerable storage needs. This work concentrates oncomputationally efficient algorithms for prediction using verylarge datasets. A recent prediction model for spatial data knownas Fixed Rank Kriging is much faster than the kriging andcan be easily implemented with less assumptions about theprocess. However, Fixed Rank Kriging requires the estimationof a matrix which must be positive definite and the originalestimation procedure cannot guarantee this property.We present a result that shows when a matrix subtraction of agiven form will give a positive definite matrix. Motivated by thisresult, we present an iterative Fixed Rank Kriging algorithm thatensures positive definiteness of the matrix required for predictionand show that under mild conditions the algorithm numericallyconverges. The modified Fixed Rank Kriging procedure is im-plemented to predict missing chlorophyll observations for verylarge regions of ocean color. Predictions are compared to thosemade by other well known methods of spatial prediction.
I. I
NTRODUCTION
Gaussian process prediction (GPP) , require computations of O ( n ) , which represents a formidable amount of computationwhen n is very large. One way to overcome the computationalburden has been using regression splines or penalized regres-sion splines when the spatial dependence is modelled deter-ministically [1, 2]. In the case of the input dependence beingmodeled stochastically, several models have been suggested toreduce computational burden. Among them we have, taperingthe covariance matrix [3], approximating the field using aGaussian Markov random field [4], and doing computations inthe spectral domain [5]. Often these stochastic models assumestationarity, an assumption that many times is justified forsmall sample sizes but likely to be too restrictive for largedatasets. Moreover, published application of these methodsprimarily consider scenarios with a few thousand monitoringlocations. When tens of thousands of observations or more areavailable, many of the techniques become unfeasible. In thisarticle, we aim to modify a recent approach called Fixed RankKriging [6]. Particularly, we give a necessary and sufficientcondition that ensures F = C − bD is positive definite.Motivated by this result, we propose a new algorithm appliedto Fixed Rank Kriging to estimate the necessary parameters σ and matrix K while ensuring positive definiteness of ˆ K ,the estimate of K . In this article we denote that a matrix F ispositive definite by F ≻ . In sections II to II-B Fixed RankKriging is presented. This method makes predictions feasibleeven in the case of very large datasets, without assumingstationarity. In section III we introduce the new algorithm thatensures ˆ K ≻ and we show that this algorithm convergesnumerically under general conditions. This is followed by anapplication using ocean color data. II. F IXED R ANK K RIGING
Cressie and Johannesson [6] introduced a new way ofobtaining spatial predictions. This method is called FixedRank Kriging (FRK), and mainly addresses two drawbacksof classical GPP. One is that GPP cannot be implementedin the case of very large datasets since, in general, thecomputations involving the inverse of the covariancematrix, Σ − K , are of the order O ( n ) . Even a few thousandobservations will make the GPP computations unfeasible.With FRK, the computations per prediction location aresignificantly faster while no longer requiring a stationaryassumption. This section briefly reviews the inner workingsof FRK as presented in [6], which from now on is referencedas CJ08. The model expressed in terms of a hidden process is, Y ( s ) = x ( s ) ′ β + W ( s ) + ǫ ( s )= H ( s ) + ǫ ( s ) (1)where H ( s ) = x ( s ) ′ β + W ( s ) is the hidden process ofinterest. x ( s ) ′ β represents a random field mean and/or effectsof p covariates. For observations at n locations s ) , ...., s n ,let Y = ( Y ( s ) , ...., Y ( s n )) ′ , and X be a n × p matrix with k th column ( X k ( s ) , ...., X k ( s n )) ′ for k = 1 , .., p . Assume rank ( X ) = p with p < n . Furthermore, assume W ( s ) and ǫ ( s ∗ ) are uncorrelated for locations s , s ∗ ∈ D ⊂ ℜ d . Alsoassume V ar ( ǫ ( s i )) = σ v ( s i ) for i = 1 , ..., n and that W ( s ) is a Gaussian process with V ar ( W ( s )) < ∞ ∀ s ∈ D . Ifa fixed number r of basis functions are chosen such that Z ( s ) ≡ ( Z ( s ) , ..., Z r ( s )) ′ are the basis functions evaluatedat location s , CJ08 represent the spatial covariance of W ( s ) between locations s i , s j by, C ( W ( s i ) , W ( s j )) = Z ( s i ) ′ K Z ( s j ) (2)using a symmetric positive definite r × r matrix K , amatrix that will be estimated later on. Generally, (2) is nota function of distance between locations and is therefore anonstationary covariance. CJ08 use the eigen-decompositionof K to show that (2) can be interpreted as being similar to atruncated Karhunen-Loeve expansion but with non-orthogonalfunctions, Z ( · ) .Define Z as the n × r matrix with i th row given by Z ( s i ) ′ =( Z ( s i ) , ..., Z r ( s i )) , and let V be a n × n diagonal matrix with v ( s i ) as the i th diagonal entry ( v ( s i ) is assumed known).Combining (1) and (2) gives the covariance matrix of Y interms of K as, V ar K ( Y ) ≡ Σ K = ZKZ ′ + σ V. (3)y representing the n × n covariance matrix Σ K as in (3), theSherman-Morrison-Woodbury equation may be used [7, page50], Σ − K = ( σ V ) − − ( σ V ) − Z (cid:0) K − + Z ′ ( σ V ) − Z (cid:1) − Z ′ ( σ V ) − (4)Now calculation of Σ − K requires only the inversion of a r × r matrix K and a n × n diagonal matrix V . Predictionsat locations s o are now possible by using GPP with Σ − K specified as in (4). Thus by substitution into the GPP equationaccounting for measurement error [8] we get, ˆ H ( s o ) = x ( s o ) ′ ˆ β + Z ( s o ) ′ KZ ′ Σ − K ( Y − X ˆ β ) (5)where ˆ β = ( X ′ Σ − K X ) − X ′ Σ − K Y , while the mean squaredprediction error of ˆ H ( s o ) is, σ K ( s o ) = Z ( s o ) ′ K Z ( s o ) − Z ( s o ) ′ KZ ′ Σ − K ZK Z ( s o ) +( x ( s o ) − X ′ Σ − K ZK Z ( s o )) ′ ( X ′ Σ − K X ) − ( x ( s o ) − X ′ Σ − K ZK Z ( s o )) (6)Both the covariance parameter matrix K and the scalar σ need to be estimated to proceed with FRK. A. Basis functions
Smoothing spline bases, regression spline bases, radial basisfunctions, eigenvectors or wavelets are among some basisfunctions that could be used with FRK. In this work, weuse the same bisquare basis functions as [6]. These revolvearound the concept of gridding the locations at differentresolutions and obtaining centroid locations. Basis functionsat a coarse resolution/scale, capture general global attributesof the process. The finer the scale becomes, the more localare the attributes captured by each basis function at thatresolution. Another benefit of the bisquare basis function isthat is has an intuitive interpretation, the closer a locationis to a centroid, then the closer to 1 is the basis function,while the further they are, the closer to zero is the basisfunction. Furthermore each basis function has local support.For prediction, computations of Z ′ A − Z and Z ′ a for any A − and a are O ( nr ) for any basis function where r isthe total number of basis functions [6]. But when matrix Z is sparse, as is the Z matrix resulting from using bisquarebasis functions, then the operations required for prediction are O ( kr ) for k < r << n .Suppose a FRK model is designed for a square region of , regularly spaced locations such that the covariance (2)is composed of two scales of variation, namely a coarse scalewith 4 functions, and a finer scale with 25 functions. Figure1 displays the basis functions at the coarse scale. Similarly,Figure 2 displays the basis functions at the fine scale. −34 −32 −30303132333435 00.20.40.60.81 −34 −32 −30303132333435 00.20.40.60.81 −34 −32 −30303132333435 00.20.40.60.81 −34 −32 −30303132333435 00.20.40.60.81 Fig. 1: This plot reveals the coarse scale basis functions for amodel with 2 scales of variation in a square region of , locations. The data is binned and the centroid locations areobtained according to the binning. −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 −35 −303035 00.51 Fig. 2: This plot reveals the fine scale basis functions for amodel with 2 scales of variation in a square region of , locations. The data is binned and the centroid locations areobtained according to the binning. B. Parameter estimation
Positive definiteness of the covariance matrix is requiredto ensure invertibility of Σ − K and positive mean squaredprediction errors [8]. CJ08 bin the data for the purposeof estimation of the covariance. This way, the estimationand fitting of the covariance will depend on the numberof bins M , and not on the number of locations n , where M << n . Assigning bin centers { u m : m = 1 , ..., M } ,a neighborhood N ( u m ) is defined. The neighborhood2 ( u m ) could be set up by some Euclidean distance asa threshold (independent of direction, a sort of circularneighborhood, or by distances in the horizontal and verticaldirection). Then define the neighborhood indicator variable as, w mi = s i ∈ N ( u m ) , and Y ( s i ) is not missing , Note, that a missing value at location s i would imply a weightof zero. Furthermore, designate w m = ( w m , ..., w mn ) ′ , andlet n be the n dimensional column vector of ones. Withoutany knowledge of the process we may use the ordinary leastsquares estimator ˆ β of β , to detrend the data such that, D ( s i ) = Y ( s i ) − x ( s i ) ′ ˆ β . Let D = ( D ( s ) , ..., D ( s n )) ′ .Then the average of the residuals within the m th bin will be, ¯ D ( u m ) = w ′ m Dw ′ m n for m = 1 , ..., M Let ¯ D = ( ¯ D ( u ) , ..., ¯ D ( u M )) ′ . An estimate of the truecovariance matrix of the binned residuals, Σ M = V ar ( ¯ D ) based on the empirical method of moments estimator haselement ( m, k ) given by, b Σ M ( u m , u k ) = V D ( u k ) if m = k,C D ( u m , u k ) if m = k (7)for m, k ∈ { , ..., M } where V D ( u m ) = P ni =1 w mi D ( s i ) / w ′ m n and C D ( u m , u k ) = ¯ D ( u m ) ¯ D ( u k ) .Matrix b Σ M has the ( m, k ) elements seen in equation (7).The intention is to approximate ˆΣ M by a function ofspatial dependence parameters K and σ . That is, we want amatrix ¯Σ( K, σ ) with the following ¯Σ( K, σ ) mk entries, Cov ( ¯ D ( u m ) , ¯ D ( u k )) = Cov w ′ m Dw ′ m n , w ′ k Dw ′ k n ! ≈ w ′ m ( ZKZ ′ + σ V ) w k ( w ′ m n )( w ′ k n ) (8) , ¯ Z ′ m K ¯ Z k + σ ¯ V ∗ mk such that if ¯ Z = ( ¯ Z ′ , ...., ¯ Z ′ M ) ′ is the M × r matrix ofbinned basis functions, and ¯ V = diag ( ¯ V ∗ , ..., ¯ V ∗ MM ) , then ¯Σ( K, σ ) = ¯ ZK ¯ Z ′ + σ ¯ V . In order to approximate ˆΣ M by ¯Σ( K, σ ) , CJ08 quantify the ’best’ approximation in termsof the Frobenius norm. Specifically for FRK, k b Σ M − ¯Σ M ( K, σ ) k F = k b Σ M − ¯ ZK ¯ Z ′ − σ ¯ V k F (9) Instead of deriving ¯ V ∗ mk strictly from (8), we use ¯ V mk = w ′ m V w k / ( w ′ m n ) to get a more appropriate estimator of σ . Althoughthe σ estimator has been rescaled, the resulting ˆ K remains invariant to therescaling. As a consequence, FRK predictions and their standard errors areunaffected. is minimized to obtain an estimate of K and σ . Toease matrix computations, the QR-decomposition of ¯ Z isperformed. That is, ¯ Z = QR where Q is a column-wiseorthonormal matrix of dimension M × r , and R is anonsingular r × r upper triangular matrix. Then the K thatminimizes (9) is given by, ˆ K ( σ ) = R − Q ′ (cid:16)b Σ M − σ ¯ V (cid:17) Q ( R − ) ′ (10)Therefore ˆ K ( σ ) depends on σ . Substituting (10) into ¯Σ M ( K, σ ) alters (9) such that ˆ σ , is the estimator of ˆ σ thatminimizes, X m,k (cid:16) ( b Σ M − QQ ′ b Σ M QQ ′ ) mk − σ ( ¯ V − QQ ′ ¯ V QQ ′ ) mk (cid:17) (11)Note that remarkably, the minimization problem for σ isclearly of the form of a simple least squares problem withslope σ and no intercept and is therefore easy to obtain (ofcourse ˆ σ must be constrained to be positive since σ is avariance parameter) When σ = 0 , ˆ K is positive definite while if σ > the matrix is positive definite conditional on the value of σ .There is no guarantee that the estimate ˆ σ will result in apositive definite matrix. In fact, Shi and Cressie (2007, page671) state that sometimes ˆ σ needs to be adjusted so that ˆ K ispositive definite but do not give further details. In this work,an algorithm is proposed to use the positive definiteness of ˆ K as a constraint.We simulate a Gaussian mean stationary Gaussian processwith an exponential covariance function, a partial sill of 5.5,range of 1. The nugget was chosen to be 1.375, so that thesignal to noise ration was 4. A sequence of 1,500 potentialvalues of σ were used to then calculate ˆ K ( σ ) , its smallesteigenvalue λ min , and the sum of squares produced in (11)by the σ value. σ = 2 . is the result of minimizing(11) when only constrained so that the estimator is positive;whereas σ = 1 . gives the smallest sum of squares suchthat λ min > . Figure 3 shows how λ min and the sum ofsquares (11) related to σ . The issue at hand is clearly seen,the minimization of the sum of squares to find ˆ σ requires theconstraint that ˆ K is positive definite. Hence the solution is asubspace of the unconstrained sum of squares problem for ˆ σ .Yet, positive definiteness is a quadratic condition and appearstricky to implement as a constraint. However, it turns out thatthe positive definiteness of ˆ K can be applied as a upper boundconstraint on ˆ σ . Although [6, page 216] state that r ≤ M < n , for the estimation of theparameters we must have r = M . If the number of bins M were equal to thenumber of basis functions r this would imply that also QQ ′ = I which wouldmake the minimization problem (9) impossible to solve. Therefore for FRK,it is required that r < M < n ( r < M to ensure column-wise orthogonalityof Q ). σ λ m i n σ S u m o f S qua r e s Fig. 3: Left panel displays the minimum eigenvalue λ min of ˆ K (ˆ σ ) versus different values of ˆ σ plugged into (10). Theright panel shows the sum of squares (11) for each value of ˆ σ . Straight lines indicate σ = 1 . .III. A NEW ALGORITHM TO ESTIMATE THE
FRK
PARAMETERS
Clearly from (10), if ˆ σ is too large, then ˆ K will havenegative diagonal entries. A symmetric matrix with negativeentries in the diagonal cannot be positive definite [7, page141]. We wish to impose positive definiteness as a constraintwhen minimizing (11). The following lemma leads us in theright direction, Lemma 1.
Define F = C − bD where all matrices F, C, D are r × r real matrices, C ≻ , and D ≻ , and C and D aresymmetric. Furthermore, assume F has distinct eigenvaluesand that b is any constant such that b > . Then, F ≻ ⇔ b < e ′ C e e ′ D e where e is the eigenvector associated with the minimumeigenvalue of F , λ . The interpretation of the lemma is that, a subtraction oftwo positive definite symmetric matrices of a certain formgives us a matrix F that is positive definite under certainconditions, among them that the scalar b is not too big. Thefollowing Corollary shows how ˆ K is a special case of Lemma1. Corollary 1.
Assume ˆ K in (10) has distinct eigenvalues, λ < .... < λ r . Then ˆ K is positive definite if and only if, σ < e ′ R − Q ′ b Σ M Q ( R − ) ′ e e ′ R − Q ′ ¯ V Q ( R − ) ′ e (12) where e is the r × normalized eigenvector correspondingto the smallest eigenvalue λ of ˆ K . Result (12) motivates the use of the positive definitenessrequirement of ˆ K as a linear constraint on ˆ σ . The followingalgorithm iteratively estimates σ and K . FRK parameter estimation algorithm
1) Calculate Q , R , ¯ V , and ˆΣ M as described in section II-B.2) Estimate σ by minimizing equation (11) only subjectto a constraint that ˆ σ > . Start at zero an index ofthe iteration, g = 0 , , .. . Set ˆ σ g as the result from (11).3) Calculate ˆ K g ≡ ˆ K (ˆ σ g ) using (10).4) Check if ˆ K g ≻ . This is so if λ min,g > . If it is,we stop here. If it is not, calculate an upper boundaccording to (12) for ˆ σ g . Let the upper bound be ˆ σ u,g .5) Minimize (11) over ˆ σ g but now subject to both, thegreater than zero constraint and to the upper bound ˆ σ u,g constraint.6) Repeat steps 3-5 above until ˆ K g ≻ .This algorithm is a special case of the cutting plane algorithmdeveloped by Tuy [9, 10]. What remains is to show that as g th increases, λ min,g of ˆ K g increases while the upper boundprovided by (12) decreases.Now we are ready to show the numerical convergence ofthe proposed FRK algorithm by stating the following theorem, Theorem 1. If λ min,g is the minimum eigenvalue of ˆ K g at iter-ation g , ˆ K g has distinct eigenvalues λ min,g , ..., λ max,g , ∀ g and σ u,g is the upper bound found in Step 4 of the FRK parameterestimation algorithm at iteration g , then λ min,g > λ min,g − if and only if ˆ σ g < ˆ σ g − . Returning to the mean stationary Gaussian process with anexponential covariance function adopted earlier, implementingour FRK parameter estimation algorithm using M = 100 results in ˆ σ = 1 . after 8 iterations. Figure 4 demonstrateshow λ min,g and the sum of squares from (11) increase periteration g , while the number of negative eigenvalues and ˆ σ g decrease with every iteration.4 σ λ m i n N u m be r o f nega t i v e λ m i n SSE
Fig. 4: Upper left panel displays ˆ σ g per iteration g for simu-ation process H ( s ) . Upper right panel shows λ min,g . Thelower panel plots present the number of negative eigenvaluesof ˆ K g and the sum of squares (11) per iteration g respectively.Kang and Cressie [11] propose an empirical way to ensurethat ˆ K is positive definite by updating the eigenvalues of b Σ M .To maintain the variability features of the original estimate b Σ M they must choose specific values of a parameter in theiralgorithm. In our algorithm we do not change the estimateof spatial covariance and instead iterate among the estimatesof ˆ σ and ˆ K . Now that we can ensure that ˆ K g ≻ thefollowing steps are taken to attain the FRK prediction of arandom field of interest:1) Calculate the ordinary least square estimate ˆ β =( X ′ X ) − X ′ Y of β .2) Obtain residuals D = Y − X ˆ β according to theordinary least squares estimate of β .3) Set up the matrix of basis functions Z .4) Next the stage is set to estimate the covariance matrix Σ K as in (3).5) Calculate the method of moments estimate b Σ M of Σ M using ¯ D ( u m )
6) Obtain ¯ Z and ¯ V . Then perform the QR-decompositionof ¯ Z .7) Estimate σ and ˆ K according to the algorithm proposedin this section and shown to numerically converge inTheorem 1.8) Predict the random field H ( s o ) using (5) and its standarderror with the square root of (6).Several things are worth mentioning. First, it is implicitlyassumed that the binning has been performed in such a waythat missing values are not present in the binned version of thedata. Otherwise weighting of the covariance estimates wouldnot be possible (an empty bin results in V D ( u m ) = ∞ forthat bin as seen from (7)). More importantly in the presence of missing values the positive definiteness of the matrix couldno longer be guaranteed. A remedy could be to remove emptybins from (7) or impute them by some rule. x, y coordinatebased averages or local averaging are just some possibilities ofimputing missing values. Cressie [8] discusses median polish,another tool that could be used for imputation of missing bins.Little and Rubin [12] provide a much broader emphasis onways to deal with missing values.Another noteworthy fact is that storage of the n × n Σ − K isoften required in classical geostatistics, a task of the order of O ( n ) . In addition to savings on computational time, FRKprovides substantial reduction in storage.Another point worth mentioning is that sometimes somebins may be considerably more variable than others. Toaccount for differences in variability among the bins, theinfluence of potential outliers, and for bins that have moredata, the parameter estimation can be weighted. This isanalogous to the use of weighted least squares instead ofordinary least squares in variogram fitting for the samepurpose. CJ08 also took this into consideration and based on[see 8, page 95] and references therein suggest, ˆ a m ≈ − / ( w ′ m n ) / /V D ( u m ) as a weight. This translates into a rescaling of the minimizationproblem (9). If ¯ A = diag (ˆ a , ...., ˆ a M ) , then the norm isequivalent to k ¯ A / ˆΣ M ¯ A / − ¯ A / ¯Σ M ( K, σ ) ¯ A / k F and, ˆ K a = R − a Q ′ a (cid:16)b Σ M − σ ¯ V (cid:17) Q a ( R − a ) ′ (13)where we used ¯ A / ¯ Z = Q a R a . Therefore virtually no com-putational cost is added by using the weights. Furthermore,the previous results in this article still hold. Corollary 2.
Assume ˆ K a in (13) has distinct eigenvalues, λ < .... < λ r . Then ˆ K a is positive definite if and only if, σ a < e ′ R − a Q ′ a ¯ A / b Σ M ¯ A / Q a ( R − a ) ′ e e ′ R − a Q ′ a ¯ A / ¯ V ¯ A / Q a ( R − a ) ′ e where e is the r × normalized eigenvector correspondingto the smallest eigenvalue λ of ˆ K a . The proof is almost identical to Lemma 1 and Corollary 1.Theorem 1 also still holds.Fixed Rank Kriging is a method intended for very largedatasets. Comparisons with GPP are made difficult for tworeasons. First, if the dataset is too large, classical GPP is notfeasible. On the other hand if a dataset is small enough forGPP, it might be too small to estimate the spatial covarianceproperly using (7). For example if n = 3 , one shouldprobably have at least 30 pixels per bin, which doesn’t leavemuch room for a multiresolution basis.Finally since the inverse computation is not an O ( n ) anymore when FRK is used, a Likelihood approach thatwould more fully take into account the probability distribution5f the data in comparison to the method of moments approachis plausible. Yet keep in mind that in this case one of theparameters of interest, K , is a matrix. Stein [13] fits acovariance of the form (3) by parameterizing K and thenmaximizing the likelihood.IV. C HLOROPHYLL DATA
Ocean color observations enable scientists to study severalbiological and biogeochemical properties of the oceans. Inpart, ocean color can measure surface phytoplankton (micro-scopic ocean plants) since color in most world’s oceans in thevisible light region (wavelength of 400-700nm) varies withthe concentration of chlorophyll and other components, i.e themore phytoplankton present, the greater the concentration ofplant pigments, ergo the greener the water [14]. Ocean coloris crucial for: the study of organic matter produced by algaeand bacteria, the study of the biochemistry of the ocean, theassessment of the role of the ocean in the carbon cycle, andthe potential global warming trend.Based on prior and ongoing ocean color satellite missions,scientists can now study the spatial and temporal variability ofthe biological, chemical and physical processes that regulateocean color around the globe. For example, [15] and [16]present studies of the spatial correlation of chlorophyll atthe mesoscale (about 10-200km). Substantial progress hasbeen made in analyzing these ocean color satellite data, See[14, 17, 18]. But many computationally intensive statisticalchallenges remain. The ocean color satellite datasets aremassive, with hundreds of thousands of observations or morearound the globe. Ocean processes are generally non-stationaryin both space and time. Furthermore, due to many factors,satellite data comes with large amounts of missing data.In this section our goal is to compare the prediction capabilityof of the modified Fixed Rank Kriging (FRK) algorithm , toother well known prediction methods, namely: ordinary leastsquares, regression thin plate splines, and universal kriging,whenever computations required for the latter are feasible.
A. Satellite data L wN ( λ ) , asubsurface radiance reflected out of the ocean through anair-sea interface. Both SeaWiFS and MODIS-Aqua measurethe water leaving radiance L wN ( λ ) at several wavelengths λ .These two instruments offer different space-time resolutions, different sampling patterns, and different measurementuncertainties. Table I summarizes some traits that characterizeeach of the two sensors.Sensor SeaWiFS MODISSatellite Orbview-2 AquaData available since 09/04/1997 07/04/2002Time equator is crossed 12:20pm 1:30pmSpectral bands 8 36Swath width 2806 km 2330 kmSpectral coverage (nm) 402-885 405-14,385Resolution 1100 m 1000 mTilt − o , o , o km × km spatialresolution at the equator for Level 3 products while MODIShas . km × . km spatial resolution at the equator for Level3 products. See [20].The L wN ( λ ) data used for our analysis is processed bya semi-analytical model developed by Garver, Siegel andMaritorena, GSM01 [18]. The algorithm expresses the waterleaving radiance in terms of CHL and the Inherent OpticalProperties (IOP): CDM and BBP. In short, the GSM01 modelinverts observations of the normalized water leaving radiancespectrum, L wN ( λ ) , to estimate, Chl, CDM and BBP [21].Data from two NASA satellites are available. Specificallyfor SeaWiFS, the time period of September 04, 1997 throughJuly 04, 2007 is available. Aqua results from July 04, 2002through July 04, 2007 are also available. Missing observa-tions result from orbital sampling, sun glint and cloud cover.Regions of the globe often have observations for only about30% of the days (sometimes less) across all years. Figure 5shows the satellite measurement pattern for the Aqua sensorfor one day (December 31, 2002), and highlights the space-time missing data problem. The orbital track of the satellite isclearly visible (especially in the Southern Hemisphere), yeton occasion, regions that fall along the sampling track donot have observations. In fact, there are very few observationsnorth of o N, while more data is present south of o S. Thispattern in the missing values is due to a seasonal effect on themechanism of missing values. We focus on open ocean waterswhere GSM01 output is considered to be more reliable thanthat coming from coastal water reflectance [21].Maritorena and Siegel [22] combine the L wN ( λ ) values from6eaWiFS and Aqua combined to produce CHL and IOP data.This results in better spatial/temporal coverage and sometimeslower uncertainty surrounding the variables obtained (pixelswith multiple observations are imputed a weighted average ofsatellite data according to their respective uncertainties). Morerecently [23] included MERIS data as well in the mergingprocedure. We begin the analysis on the field of chlorophyllvalues and the chlorophyll dependence on spatial location.Fig. 5: Global chlorophyll values for December 31st 2002using the Aqua Sensor.V. C HLOROPHYLL FOR A VERY LARGE REGION IN THE N ORTH AND S OUTH P ACIFIC
In this section, we examine the predictions obtainedfrom four prediction methods: ordinary least squares (OLS),an additive model (AM), and fixed rank kriging (FRK).For this analysis, the observations consist of 8-day Sea-viewing Wide Field-of-view Sensor (SeaWiFS) and ModerateResolution Imaging Spectroradiometer (Modis) Aqua mergeddata resulting from the GSM01 algorithm. Campbell[24] discusses how chlorophyll follows approximately alognormal distribution. We follow this result and modelthe chlorophyll on the natural logarithm scale and denote Y ( s i ) = log ( CHL i ) for location (pixel) i . Hypothesis tests on β in a preliminary OLS model with x ( s i ) = (1 , s i , s i ) ′ as the i th row of the n × matrix X where s i = ( s i , s i ) for i = 1 , . . . , n (where n = 3600 ), and with β = ( β , β , β ) ′ ,implied a significant effect in both North-South and East-Westdirections.Due to the size of the data, the AM is implementedusing thin plate regression splines. More precisely, the basisfunctions for the thin plate regression splines model are chosenby eigen-truncating the matrix with entries coming from thebasis functions of thin plate splines [2]. The FRK procedureis implemented in Matlab using a 32-bit Linux machine. A. Measure of predictor performance n t observations are put aside for testing the performance ofa predictor. The remaining observations are used to fit OLS,AM, and FRK models. Then the estimated mean squaredprediction error can be estimated by cross validation, \ M SP E (Υ m ) = 1 n t n t X i ′ =1 ( Y ( s ∗ i ′ ) − ˆ H m ( s ∗ i ′ )) (14)calculated at test locations s ∗ i ′ for n t prediction locations, and Υ m is the m th modeling procedure. We also note that (14)must be used with caution. Care is needed, since this methoddoes not prove that a spatial model is correct, only that it is’not grossly incorrect’ [8]. B. Satellite data for regions of interest
The regions in the analysis are 130-155W by 5-30N in theNorth Pacific and 125-150W by 5-30S in the South Pacific,both which include 90,000 locations. Kriging would requirethe storage of a × covariance matrix for eachregion and the inversion of this matrix is O (90000 ) . In theNorth Pacific region the 8-day data has 1947 missing valuesfor the time period starting at Julian day 73 of year 2003, andin the South Pacific 2,403 missing values for the time periodstarting at Julian day 177 of year 2007. These time periodsare conveniently chosen so that we can separate different sizesof test data for the comparison of the prediction models likewe did in the previous section. Figure 6 displays the SouthPacific region of interest. Specks of pixels with apparenthigh values relative to neighbors can be seen around o S, o W. These are due to the coasts of the French Polynesiaislands in that region. Clearly most of the variability occursin a North-South direction, with an increasing trend in thenorthward direction. The North Pacific region also showsmost of the variability in the North-South direction (but withan increasing trend in the southward direction). Based onthese preliminary images, we assume a linear trend with thecoordinates in each direction.7 −150 −145 −140 −135 −130 −12551015202530 −3.5−3−2.5−2−1.5−1
Fig. 6: This plot presents the log of CHL for the regionbetween o and o W and between o and o S.The AM is implemented using thin plate regression splines.Exploratory analysis lead us to choose 100 basis functions tofit the thin plate regression spline model.Fixed Rank Kriging is implemented using bisquare basisfunction with three scales of variation 16, 64, and 225 func-tions respectively. Parameters K and σ were estimated using M = 900 , V = I and following the new procedure detailedin section II.Missing values are omitted before the comparison of thepredictors. Then 15% of the data are randomly removed to testeach model. The remaining observations are used to fit OLS,AM and FRK models and then each model is used to predictthe test data. The procedure is repeated 50 times. Then the \ M SP E is obtained using (14). The study was conducted alsousing 25% and 50% of the observations as test data. Table IIpresents the results of the analysis for the North Pacific. WhileTable III does so for the South Pacific. OLS gives the highestestimates of
M SP E . For both the North and South Pacificwe find that FRK outperforms not only OLS but AM as well.The FRK method takes about 167 seconds to calculate ˆ K and ˆ σ using the procedure introduced in section III and estimatethe process at all 44,026 locations in the North Pacific (seeTable II) when 50% of the original data is used to test theprediction method. A regression thin plate spline implementsa basis function smoothed equally in all directions. Anisotropyin the large region could hinder any method based on such anassumption. Yet a tensor product of regression splines [2], inboth spatial directions did not improve on the results of thethin plate regression splines shown here. M odel \ M SP E (15%) \ M SP E (25%) \ M SP E (50%) CPU timeOLS 0.0515 (6.50e-4)0.0517 (5.75e-4)0.0516 (3.63e-4) 0.27AM 0.0169 (3.54e-4)0.0169 (2.71e-4)0.0169 (1.93e-4) 140.84FRK 0.0099 (2.73e-4)0.0100 (1.88e-4)0.0100 (8.73e-5) 167.68TABLE II: Comparison of the mean squared prediction errorobtained from OLS, AM and FRK models for the region in theNorth Pacific when 15, 25 and 50% of the data are removedand used as test data. Each cell contains the mean \ M SP E for all 50 simulated test data and the corresponding standarddeviation in parenthesis.
M odel \ M SP E (15%) \ M SP E (25%) \ M SP E (50%)OLS 0.0598 (8.00e-4) 0.0597 (5.45e-4) 0.0597 (4.21e-4)AM 0.0188 (7.33e-4) 0.0188 (5.16e-4) 0.0188 (3.04e-4)FRK 0.0159 (6.83e-4) 0.0159 (4.73e-4) 0.0159 (2.63e-4)TABLE III: Comparison of the mean squared prediction errorobtained from OLS, AM and FRK models for the region in theSouth Pacific when 15, 25 and 50% of the data are removedand used as test data. Each cell contains the mean \ M SP E for all 50 simulated test data and the corresponding standarddeviation in parenthesis.Moreover, the mean \ M SP E of the prediction methodsalmost do not change as more missing values are present in thedata. In the case of the OLS and AM this may be so because, assuspected from Figure 6, most of the variability occurs in theNorth-South direction and, if thought of deterministically, theassociation between chlorophyll and location is almost linear.Even when half of the data is removed for cross-validation,the amount of remaining observations may be enough that theFRK predictor performs well. Further study would be neededto confirm this.FRK allows quick prediction of missing values in a massive re-gion without assuming that the spatial association is stationary.Figure 7 displays how the ocean color image would look likeif missing values where predicted using FRK and imputed intothe original image (right panel). The prediction method doesnot induce any unwanted discontinuities or extreme outliersin the image. Unfortunately, with the number of observationsused in this section GPP cannot be used. We suspect that itwill underperform FRK because in this scenario the spatialassociation may vary considerably across the region.8 og(chl) available −150 −140 −13051015202530 −3−2.5−2−1.5−1 log(chl) with NAs replaced −150 −140 −13051015202530 −3−2.5−2−1.5−1
Fig. 7: Natural log of CHL for the region between o and o W and o and o N for 8-day data starting on Julian day169 of year 2007 (left panel). Over 50% of the observationsare missing. The observations available are used to fit a FRKmodel with M = 900 bins and r = 305 basis functions. Theright panel shows the result of filling in the missing valuesusing FRK predictions.VI. F INAL R EMARKS
FRK can be viewed as a Gaussian process random effectsmodel that allows prediction in the case of very large datasetswithout assuming stationarity and without storing the n × n covariance matrix. We give a result showing when the differ-ence of two matrices (of a certain form) is guaranteed to be apositive definite matrix. Based on that result, we have proposedan iterative algorithm that finds ˆ σ such that ˆ K is ensured to bepositive definite. We prove that the algorithm always convergesnumerically. The results hold when parameter estimation isweighted according to bin variability.In this work we implement the modified FRK algorithm toocean color data. We find that the prediction method givessmooth predictions without assuming stationary spatial de-pendence and can make these predictions quickly, even whenthe data are very large. In comparison, thin plate regressionsplines require more computational time while, depending onthe pattern and amount of missing data, not performing aswell as FRK. Moreover, the FRK predictions appear to holdthe association between CHL and the IOPs (results not shown).Many directions are possible for future work. It would be use-ful to see the performance of FRK when different basis func-tions and different values of r are used. Shi and Cressie [25]use wavelets as basis functions. Yet efficient implementation ofwavelets requires the absence of missing values. The authorsimpute the missing values to obtain the basis function matrix.How this step affects their final results is unclear and furtherstudy is warranted. We conjecture that a flexible basis functioncan be derived with the use of truncated tensor product basisfunctions. Spatio-temporal and multivariate extensions would also be useful. Other explanatory variables known to affectocean color could be included in the prediction model: seasurface temperature, wind and in situ measurements of oceancolor could be used. The effect of some of the properties( M , r , basis function class, etc.) of the FRK model on oceancolor predictions needs further study. The FRK predictions ofmissing values will provide more coverage of the ocean. Thebiological, chemical and physical forcings could be studiedat different spatial scales, taking into account the predictedobservations and the uncertainty surrounding their estimation.A PPENDIX
A. Proof of Lemma 1Proof.
Since F is symmetric (it is the result of subtractingtwo symmetric matrices), by the spectral theorem, F = E Λ E ′ E ′ F E = Λ (15)where
Λ = diag ( λ r , ..., λ ) is the matrix with eigenvaluesof F and E is the associated eigenvector matrix. Equality in(15) holds since the eigenvalues are all distinct, implying theeigenvector matrix is orthogonal, E ′ E = I ⇒ E ′ = E − .Note that (15) implies that: e ′ F e = λ . (16)Assume F ≻ to prove the upper bound on b and thencomplete the proof by assuming the upper bound in b andshowing that then F ≻ . Now, given that F is positivedefinite, then all its eigenvalues are positive. Therefore by (16), e ′ F e > e ′ ( C − bD ) e > b < e ′ C e e ′ D e . To complete the proof, assume b < e ′ C e e ′ D e . Then ∃ a ∈ ℜ : a > such that, b = va ( e ′ D e ) (17)where v = e ′ C e . Note that v > since C ≻ . Then from(16), λ = e ′ ( C − bD ) e Substituting (17) for b gives, λ = e ′ (cid:18) C − va ( e ′ D e ) D (cid:19) e = v − v ( e ′ D e ) a ( e ′ D e )= ( v − va )= v (cid:18) ( a − a (cid:19) > The last inequality is due to v > and a > .9 . Proof of Corollary 1 Since, as stated in section II, ˆΣ M and ¯ V are positivedefinite, R − Q ′ b Σ M Q ( R − ) ′ and R − Q ′ ¯ V Q ( R − ) ′ are alsopositive definite [7, p. 141]. The proof of the corollary isexactly the same as for Lemma 1 with F = ˆ K, C = R − Q ′ b Σ M Q ( R − ) ′ , D = R − Q ′ ¯ V Q ( R − ) ′ , and b = ˆ σ . C. Proof of Theorem 1Proof.
Assume λ min,g > λ min,g − and denote the minimumnormalized eigenvector as e ∗ = e min , then the eigenvectorassociated with λ min,g at iteration g will be e ∗ g . Furthermore,let ˆ K g = C − ˆ σ g D where we define C = R − Q ′ b Σ M Q ( R − ) ′ ,and D = R − Q ′ ˆ V Q ( R − ) ′ . Showing ˆ σ g < ˆ σ g − isequivalent to demonstrating that if, ˆ σ g − = ˆ σ g + δ g (18)then ∃ δ g ∈ ℜ : δ g > for all g . Using (18) one can write, ˆ K g = C − (ˆ σ g − − δ g ) D = ˆ K g − + δ g D (19)Now one has everything required for the first part of the proof.Assuming that all eigenvalues of ˆ K g are distinct, then, e ∗ ′ g − ˆ K g − e ∗ g − = λ min,g − < λ min,g = e ∗ ′ g ˆ K g e ∗ g ≤ e ∗ ′ g − ˆ K g e ∗ g − (20) = e ∗ ′ g − ( ˆ K g − + δ g D ) e ∗ g − (21) = e ∗ ′ g − ˆ K g − e ∗ g − + δ g e ∗ ′ g − D e ∗ g − The first equality comes from (16) with F = ˆ K . The inequalityafterwards is by assumption in this part of the proof. Theinequality in (20) is due to the Rayleigh-Ritz theorem result[26]. (21) is obtained by substituting ˆ K g with (19). From thelast equation we see that, given that D is known to be positivedefinite, δ g > for all g therefore showing that ˆ σ g < ˆ σ g − by (18).To complete the proof, now assume ˆ σ g < ˆ σ g − and write ˆ σ g = ˆ σ g − − δ g . From (10) we may write, ˆ K g = C − ˆ σ g D Then, ˆ K g = C − (ˆ σ g − − δ g ) D = C − ˆ σ u,g − D + δ g D = ˆ K g − + δ g D from this result we deduce that λ min ( ˆ K g ) = λ min ( ˆ K g − + δ g D ) . As a final step, λ min ( ˆ K g ) = λ min ( ˆ K g − + δ g D ) > λ min ( ˆ K g − ) the last inequality is due to theorem 8.1.5 in [7, p. 396]. D. Proof of Corollary ?? and Corollary ?? Proofs are almost identical to Lemma 1 and Theorem 1respectively. R
EFERENCES [1] C. Paciorek, “Computational techniques for spatial lo-gistic regression with large data sets,”
ComputationalStatistics and Data Analysis , vol. 51, pp. 3631–3653,2007.[2] S. Wood,
Generalized additive models - An introductionwith R . Chapman and Hall, 2006.[3] R. Furrer, M. G. Genton, and D. Nychka, “Covariancetapering for interpolation of large spatial datasets,”
Jour-nal of Computational and Graphical Statistics , vol. 15,pp. 502–523, 2004.[4] H. Rue and L. Held,
Gaussian Markov Random Fields,Theory and applications . Chapman and Hall, 2005.[5] M. Fuentes, “Approximate likelihood for large irregularlyspaced spatial data,”
Journal of the American StatisticalAssociation , vol. 102, pp. 321–331, 2007.[6] N. Cressie and G. Johannesson, “Fixed rank krigingfor very large spatial data sets,”
Journal of the RoyalStatistical Society , vol. 70, pp. 209–226, 2008.[7] G. H. Golub and C. F. Van Loan,
Matrix Computations. ,3rd ed. John Hopkins, 1996.[8] N. Cressie,
Statistics for spatial data . Wiley, 1993.[9] H. Tuy, “Concave programming under linear constraints,”
Soviet Mathematics , vol. 5, pp. 1437–1440, 1964.[10] R. Horst and H. Tuy,
Global Optimization: Deterministicapproaches.
Springer-Verlag, 1990.[11] E. L. Kang, N. Cressie, and T. Shi, “Using temporalvariability to improve spatial mapping with application tospatial data,”
The Canadian Journal of Statistics , vol. 38,pp. 271–289, 2010.[12] R. A. Little and D. B. Rubin,
Statistical analysis withmissing data. , 2nd ed. Wiley, 2002.[13] M. L. Stein, “A modeling approach for large spatialdatasets,”
Journal of the Korean Statistical Society ,vol. 37, pp. 3–10, 2008.[14] D. A. Siegel, S. Maritorena, and N. B. Nelson, “In-dependence and interdependencies among global oceancolor properties: Reassessing the bio-optical assump-tion.”
Journal of Geophysical Research , vol. 110, p.C07011, 2005.[15] S. C. Doney, D. M. Glover, S. J. McCue, and M. Fuentes,“Mesoscale variability of sea-viewing wide field of viewsensor (seawifs) satellite ocean color: Global patterns andspatial scales,”
Journal of Geophysical Research , vol.108, p. 3024, 2003.[16] M. Fuentes, S. C. Doney, D. M. Glover, and S. J. McCue,
Spatial structure of the SeaWiFS ocean color data forthe North Atlantic Ocean. In: Studies in the AtmosphericSciences, Lecture Notes in Statistics . Springer Verlag,2000.1017] D. A. Siegel, S. C. Doney, and J. A. Yoder, “Thenorth atlantic spring phytoplankton bloom and sverdrup’scritical depth hypothesis,”
Science , vol. 296, pp. 730–3,2002.[18] S. Maritorena, D. A. Siegel, and A. R. Peterson, “Op-timization of a semianalytical ocean color model forglobal-scale applications.”
Applied Optics , vol. 41, pp.2705–2714, 2002.[19] C. Pottier, G. V., G. Larnicol, P. Schaeffer, and P. Y.Le Traon, “Merging seawifs and modis ocean colordata in north and equatorial at- lantic using weigthedaveraging and objective analysis.”
IEEE on Geoscienceand Remote Sensing , vol. 44, pp. 3436–3451, 2006.[20] W. D. Robinson, G. M. Schmidt, C. R. McClain, andP. J. Werdell, “Changes made in the operational seawifsprocessing,” in
SeaWiFS Postlaunch Calibration and Val-idation Analyses, Part 2, SeaWiFS Postlaunch TechnicalReport Series , C. M. et al., Ed., 2000, p. 1228.[21] D. A. Siegel, S. Maritorena, N. B. Nelson, M. J. Behren-feld, and C. R. McClain, “Colored dissolved organic mat-ter and its inuence on the satellite-based characterizationof the ocean biosphere.”
Geophysical Research Letters ,vol. 32, p. L20605, 2005b.[22] S. Maritorena and D. A. Siegel, “Consistent merging ofsatellite ocean color data sets using a bio-optical model.”
Remote Sensing of Environment , vol. 94, pp. 429–440,2005.[23] S. Maritorena, O. Hembise-Fanton-dAndon, A. Mangin,and D. A. Siegel, “Merged satellite ocean color dataproducts using a bio-optical model: Characteristics, ben-efits and issues.”
Remote Sensing of Environment , vol.114, pp. 1791–1804, 2010.[24] J. W. Campbell, “The lognormal distribution as a modelfor bio-optical variability in the sea.”
Journal of Geo-physical Research , vol. 100, pp. 13,237–13,254, 1995.[25] T. Shi and N. Cressie, “Global statistical analysis of misraerosol data: a massive data product from nasa’s terrasatellite,”
Environmetrics , vol. 18, pp. 665–680, 2007.[26] H. L¨utkepohl,