Fast increased fidelity approximate Gibbs samplers for Bayesian Gaussian process regression
FFast increased fidelity approximate Gibbs samplers for BayesianGaussian process regression
Kelly R. Moran and Matthew W. Wheeler
Abstract
Gaussian processes (GPs) are common components in Bayesian non-parametric models. Their useis supported by efficient sampling algorithms, a rich methodological literature, and strong theoreticalgrounding. However, due to their prohibitive computation and storage demands, the use of exact GPs inBayesian models is limited to problems containing at most several thousand observations. Computationaland storage bottlenecks arise when sampling the GP. Sampling requires a matrix inversion and theCholesky factorization of the conditional covariance matrix; these operations scale at O ( n ) , where n is the number of unique inputs. Storage of individual matrices scales at O ( n ) , and can quicklyoverwhelm the resources of most modern computers for larger problems. To overcome these bottlenecks,we develop a sampling algorithm using H matrix approximation of the matrices comprising the GPposterior covariance. These matrices can approximate the true conditional covariance matrix withinmachine precision and allow for sampling algorithms that scale at O ( n log n ) time and storage demandsscaling at O ( n log n ) . We also describe how these algorithms can be used as building blocks to modelhigher dimensional surfaces at O ( dn log n ), where d is the dimension of the surface under consideration,using tensor products of one-dimensional GPs. Though various scalable processes have been proposedfor approximating Bayesian GP inference when n is large, to our knowledge, none of these methods showthat the approximation’s Kullback-Leibler divergence to the true posterior can be made arbitrarily smalland may be no worse than the approximation provided by finite computer arithmetic. In what follows,we describe H− matrices, give an efficient Gibbs sampler using these matrices for one-dimensional GPs,offer a proposed extension to higher dimensional surfaces, and investigate the performance of this fastincreased fidelity approximate GP, FIFA-GP, using both simulated and real data sets. Gaussian processes (GPs) provide flexible priors over smooth function spaces and are widely used in Bayesiannon-parametric modeling. Though such models have strong theoretical underpinnings (e.g., see [1] andreferences therein), they are limited in use due to the computational complexity required to sample fromthe posterior distribution, which grows at cubic time complexity and quadratic storage requirements inrelation to the number of unique observations. For n observations, estimation requires computing the inverse,determinant, and square-root decomposition of the n × n covariance matrix. These operations scale at O ( n )with O ( n ) memory requirements. When using Gibbs sampling, computational issues are compounded asreliable inference depends on multiple samples being drawn from the posterior distribution. In practice, theuse of GPs is limited to data sets containing at most several thousand observations.A large literature exists on scalable GPs. Common methods for approximating the likelihood from the fullGP include (a) using a subset of the training data [2, 3]; (b) introducing a grid of inducing points [4, 5, 6, 7] toapproximate the covariance matrix using a low-rank matrix that is scaled further using structure exploitingalgebra [8, 9]; (c) relying on local data to make predictions in a given region [10, 11, 12]. Methods based on(a) or (b) struggle to capture local structure in the data unless the subset size or number of inducing pointsapproach the number of training points. Methods based upon (c) tend to capture local structure well butmay struggle with global patterns; additionally, they do not offer an approximation to the global covariance.[13, 14] suggest methods for approximating posterior realizations of a function conditional on observed data,but do not address hyperparameter uncertainty or quantify the distributional similarity between the true and1 a r X i v : . [ m a t h . S T ] J un pproximating posterior. The bulk of existing approximation approaches do not address the issue of unifiedparameter estimation and function interpolation. Further, the majority of the extant literature focuses onfast approximate methods for computing the inverse of a matrix, but does not address estimating the square-root of this inverse, which is required to sample from the process. For a thorough review of these approaches,see [15].Though the majority of the above methodologies are used in a frequentist context for point estimation, ex-plicitly Bayesian models have been proposed to efficiently approximate samples from a GP while accountingfor hyperparameter uncertainty for large n . [16] uses a combination of inducing point and sparse kernelmethods to capture both local and global structure. Global structure is captured via a reduced rank approx-imation to the GP covariance matrix based on truncating the Karhunen–Lo`eve expansion of the process andsolving the resulting integral equation using the Nystr¨om method, and residual local structure is capturedvia a tapered kernel. Approximation fidelity is a function of taper length and number of inducing points.Computation using this method scales at O ( nm + nk ), where m is the number of inducing points and k is the average number of non-zero entries per row in the approximated covariance matrix. For high fidelityapproximations, or in higher dimensions, the size of m and k may need to approach n to reach the desiredaccuracy. This leads to a sampling algorithm with quadratic complexity.As an alternative to this approach, [17] allows for increased computational efficiency and a probabilistic boundon the Frobenius norm between the approximated and true covariance matrix. This approach constructs alinear projection of the data onto a lower dimensional subspace using a stochastic basis construction [18,19]. The m − dimensional subspace in this random projection method and that of the inducing point methodin [16] are equivalent when the projection matrix is chosen to be the first m eivenvectors of the SVD ofthe GP covariance matrix. The construction of these approximate matrices may scale poorly as n increasesand increases the overall computation time even though the sampling algorithm is efficient. Finally, thealgorithm still scales at O ( m ) , where m is the dimension of the subspace approximation, and in higherdimensions m tends to approach n, resulting in no computational savings.A Bayesian approach that does offer significant computational savings is the hierarchical nearest-neighborapproach [20] which has O ( n log n ) complexity. This method serves as a sparsity-inducing prior allowing forscalability to previously unimplementable data sizes; however, it sacrifices the ability to retain fidelity to atrue (non-sparse) covariance matrix by forcing correlations between the majority of observations to be exactly0 , guaranteeing the approximating covariance matrix will diverge from the true covariance matrix.As an alternative to the above mentioned matrix approximation methods, H -matrices [21, 22] provide effi-cient matrix compression techniques, that allow for quasi-linear approximate solutions to linear systems ofequations, determinant computation, matrix multiplication, and Cholesky-like square root matrix decompo-sition. The computational complexity of a given operation can vary depending on the type of H -matrix andthe particular algorithm, but those considered here, and the majority of the H− matrix approaches definedin the literature, have cost at most O ( n log n ). H -matrix techniques combine methods similar to [20] and[17] to accurately approximate a covariance matrix to near machine precision by decomposing the matrix asa tree of partial and full rank matrices. H -matrices have previously been used for likelihood approximation and kriging in large scale GPs [23, 24,25], Gaussian probability computations in high dimensions [26, 27], and as a step in approximate matrixsquare root computation using Chebyshev matrix polynomials for conditional realizations [13]. Despitethese advances, from a Bayesian perspective, the benefits of the H− matrix formulation do not directlytranslate into efficient Gibbs sampling algorithms. This is because the posterior conditional distributionhas a covariance matrix that is comprised of multiplicative component pieces and not directly amenable to H− matrix approximation (i.e., we can’t approximate the posterior covariance using one single H− matrix;see equation 3). To overcome this difficulty, we propose a sampling algorithm that does not require directcomputation (approximate or otherwise) of the posterior covariance matrix, but instead uses the propertiesof H− matrices to sample from the posterior conditional distribution. This algorithm is efficient, providingscalable near exact GP approximations for d = 1 . Construction of these matrices bottlenecks when the surface under study has dimension d >
1. In caseswhere d >
1, we approximate the surface using a tensor product of d d H -matrices, requiring d O ( n log n ) matrixfactorizations for the tensor product formulation. We develop a novel sampling step for the bases of thefunction posterior with computational cost O ( d n log n ).When d = 1 , we show the proposed method has a bound on the Kullback–Leibler divergence betweenthe approximated and true conditional posterior (competing methods only provide such a bound on the prior ) that, in many cases, can be increased to machine precision. Further when d >
1, we show howthe method defines a stochastic process with an infinite functional basis based upon the individual GPcovariance kernels. The methodology has the ability to run problems with n on the order of 10 on a localmachine. Additionally, R code that can be used to reproduce all examples in this paper is included in theSupplementary Materials.The manuscript is structured as follows: Section 2 gives an overview of Gaussian process modeling anddescribes existing methods for approximate Bayesian computation. Section 3 describes the matrix decompo-sition method used for the proposed GP approximation. Section 4 offers proofs for the approximation fidelityof the posterior and details the sampling algorithm. We illustrate the relative performance of all methodsalongside a true GP in Section 5 with a simulation study having small n and then compare performance ofjust approximate methods using a large n simulation. Section 6 provides timing and performance resultsusing real data. Finally, section 7 discusses possible applications and extensions of this approximation toother high-dimensional Bayesian settings. We provide an overview of Gaussian process models and detail the computational bottlenecks associatedwith their use in Bayesian samplers. We also review existing approaches for approximating the GP posterior,pointing out drawbacks with each to motivate our approach.
Suppose one observes n noisy realizations of a function f : R d → R at a set of inputs { x i ∈ R d } ni =1 . Denotethese observations { y i ( x i ) } ni =1 , and for simplicity of exposition, assume the inputs are unique. In the classicregression setting, error is assumed independent and normally distributed: y i = f ( x i ) + e i , e i ∼ N(0 , τ − ) , i = 1 , . . . , n. (1)When f is an unknown function it is common to assume f ( · ) ∼ GP ( m ( · ) , k ( · , · )), a Gaussian process withmean function m ( · ) and covariance function k ( · , · ) . This process is completely specified by the mean andthe covariance function.
A priori the mean function is frequently taken to be zero, with behavior of theprocess defined by the covariance function. This function, k ( · , · ) , along with its hyper-parameters, Θ, specifyproperties of f (e.g., for the exponential covariance kernel realizations of f are smooth and infinitely differ-entiable); the hyper-parameters Θ further specifies these properties (e.g. how quickly covariance betweenpoints decays or how far the function tends to stray from its mean).Letting f be a zero centered GP with k ( · , · ) being any symmetric covariance kernel (e.g., squared exponen-tial or Mat´ern) having hyper-parameters Θ, one specifies a nonparametric prior over f for the regressionproblem given in (1). As an example, one may define k ( · , · ) using the squared exponential covariance ker-nel, parameterized as k ( x i , x j ) = σ f exp (cid:2) − ( x i − x j ) (cid:48) Ω( x i − x j ) (cid:3) with Ω = diag(1 /ρ , . . . , /ρ d ), whereΘ = ( σ f , ρ , . . . , ρ d ).In general for n observations, define K nn (Θ) to be the covariance matrix for the n observed inputs using k ( · , · ) . That is, K nn (Θ) is such that [ K nn (Θ)] i,j = k ( x i , x j ) for i, j ∈ { , . . . , n } . Let Y = [ y , . . . , y n ] (cid:48) be the3ector of observed noisy realizations of f. The log-likelihood function is then (cid:96) n (Θ , τ ) = − n π ) − log (cid:2) det( K nn (Θ) + τ − I n ) (cid:3) − Y T ( K nn (Θ) + τ − I n ) − Y. (2)Computing this log-likelihood requires calculating the determinant and inverse of the n × n matrix K + τ − I n , which are both O ( n ) operations. For Bayesian computation, the computational bottlenecks increase.Defining f = { f ( x i ) } ni =1 and K = K nn (Θ), the posterior of f conditional on the hyper-parameters is known,i.e., f | y , X, Θ , τ ∼ N( µ f , Σ f ) , Σ f = K − K ( K + τ − I n ) − K = K ( τ K + I n ) − ,µ f = K ( K + τ − I n ) − y = τ Σ f y . (3)Calculating the covariance in (3) requires the inversion of ( τ K + I n ) followed by the matrix multiplication of K and ( τ K + I n ) − , which are both O ( n ) operations. Calculating just the mean in (3) requires either theinversion of ( K + τ − I n ) followed by two matrix-vector products (an O ( n ) followed by two O ( n ) operations)or first calculating the variance in (3) followed by a matrix-vector product (two O ( n ) operations followedby one O ( n ) operation).For Markov Chain Monte Carlo algorithms, assuming prior distributions are assigned to Θ and τ, inferenceproceeds by sampling from p (Θ , τ, f | y , X ) , which requires evaluating (2) and (3) a large number of times.In addition to these evaluation, the Cholesky decomposition of the matrix product is required for samplingfrom the posterior of f at each iteration. This is also an O ( n ) operation. The computational cost of eachsample limits the use of Bayesian estimation of the full GP to problems having at most 5000 observationson most single processor computers, and 5000 observations is generous. It is the authors’ experience that1500 is a more reasonable upper bound for most problems. Given the difficulties of evaluating (2) and (3) for large n, a number of approximation methods have beendeveloped. The approach of [16] combines the reduced rank process of [31] with the idea of covariance taperingin order to capture both global and local structure. Specifically, f ( x ) is decomposed as f ( x ) = f global ( x ) + f local ( x ) , where f global ( x ) is the reduced rank approximation from [31] and f local ( x ) = f ( x ) − f global ( x ) isthe residual of the process after this global structure has been accounted for. The covariance function of theresidual f local ( x ) is approximated using a tapered covariance kernel, i.e. a kernel in which the covariance isexactly 0 for data at any two locations whose distance is larger than the specified taper range. The full-scalemethod has improved performance relative to the reduced rank process or covariance tapering alone and hasthe desired quality of capturing global and local structure. However, the quality of this approximation to theoriginal function is highly dependent on the choice of taper length and number of inducing points and thereis no way to constructively bound the error between realizations of the approximate and true covariancematrices.The compression approach of [17] approximates the covariance matrix K by K lp = (Φ K ) T (Φ K Φ T ) − Φ K ,where Φ is a projection matrix. The “best” rank- m projection matrix in terms of || · || F and || · || is thematrix of the first m eigenvectors of K. Because finding the spectral decomposition of K is itself an O ( n )operation, the two algorithms in [17] focus on finding near-optimal projection approximations. The secondof the proposed algorithms in the paper has the advantage of a probabilistic bound on the Frobenius normbetween the projection approximation and true covariance matrix. That is, one can choose algorithm settingss.t. Pr( || K − Φ T Φ K || F < ε ) = p for some desired probability p. However, it is iterative; its use requiresexpensive pre-computation steps that scale poorly as n increases with no defined order of computationalcomplexity in this pre-computation phase.The hierarchical nearest-neighbor GP [20] is introduced as a sparsity-inducing prior that allows for fullyBayesian sampling with scalability to previously unimplementable data sizes. This prior introduces a finite set4f neighborhood sets having block sparsity structure, and defines a relation between these neighborhood setsvia a directed acyclic graph. The distribution at any new points is then expressed via nearest neighborhoodsets, yielding a valid spatial process over uncountable sets. While this prior is shown to yield similar inferenceto the full GP, the choice of a sparse prior means the ability to retain fidelity to a true (non-sparse) covariancematrix is sacrificed. Thus, although inference in simulation using the true and approximated posterior appearsimilar in numerical simulations, there is no bound on their divergence.The H -matrix approximations used in FIFA-GP for approximating K generalize block-diagonal, sparse,and low-rank approximations to the underlying covariance matrix [24]. The first layer of the FIFA-GPapproximation to K is analogous to the NNGP with neighbor sets defined to be those points located in thesame dense diagonal block. These off diagonal blocks can be estimated using parallel random projectionapproaches to enable fast decomposition into low-rank approximations, which are similar to the projectionalgorithms proposed for the full matrix in the compressed GP model. Additionally, the unique structure ofthe H matrix composition allows for fast computation of the determinant, the Cholesky, and the inverse,which are required for an MCMC algorithm, but cannot be used directly to sample from (3). Here, one ofthe key contributions of this manuscript is showing how to combine standard H -matrix operations to drawa random vector with distribution defined in (3).Using FIFA-GP, local structure is captured due to the dense block diagonal elements. At the same time,global structure is preserved via high-fidelity off-diagonal compression. Furthermore, the approximationerror between the true and approximate covariance matrices is bounded when constructing the H− matrix.This allows for a bound on the KL-divergence of the posterior for the GP, and this bound has not beenshown in previous methods. H -matrices In this section we describe how low rank approximations and tree-based hierarchical matrix decompositionsenable fast and accurate computations for Gaussian process.
The rank of a matrix A ∈ R m × n is the dimension of the vector space spanned by its columns. Intuitively,the rank can be thought of as the amount of information contained in a matrix. The matrix A is full rank ifrank( A ) = min( m, n ); such a matrix contains maximal information for its dimension. For any rank p matrix A it is possible to write A = U V T where U ∈ R m × p and V T ∈ R p × n , then it is only necessary to store O (max( m, n ) p ) values rather than O ( mn ) values. For GP covariance matrices having n observations thesavings gained by relaxing the O ( n ) memory requirements can be quite significant (see SI Section 10.1.1for a concrete example).It is also possible to approximate full-rank A with some rank- p matrix A p so that A ≈ A p = U V T . If p (cid:28) rank( A ) significant computational gain can be achieved, but approximation fidelity depends on how muchinformation is lost in the representation of A as some lower-rank matrix. With most low-rank factorizationmethods the approximation error decreases as p approaches rank( A ). A hierarchical matrix ( H -matrix) is a data-sparse approximation of a non-sparse matrix relying on recursivetree-based sub-division. The data-sparsity of this approximation depends on how well sub-blocks of theoriginal matrix can be represented by low-rank matrices, and whether the assumed tree structure aligns withthese potentially low-rank blocks. For dense matrices having suitably data-sparse H -matrix approximations,many dense linear algebra operations (e.g. matrix-vector products, matrix factorizations, solving linearequations) can be performed stably and with near linear complexity since they take place on low-rank or5mall full-rank matrices. Storage gains are made when the decomposed low-rank blocks can be stored ratherthan the original full dense matrix blocks. The construction of and algorithms for H− matrices are vasttopics and outside of the scope of this manuscript; for further information on H -matrices we refer the readerto [21, 32, 22].A key advantage to using H -matrices for matrix approximation is that the error between the true matrixand the approximate matrix can be bounded by construction . Specifically, the max-norm (i.e. the maximumelementwise absolute difference) can be made arbitrarily small for the type of H− matrix used in FIFA-GP.The ability for a given matrix to be well approximated by an H -matrix (and thus the potential for com-putational and storage gains) depends on the structure of that matrix. With a GP covariance matrix, therelationship between rows and columns of the matrix depends on the ordering of the individual data points.In one dimension, a simple direct sorting of the points will lead to the points closest in space being closest inthe matrix. In multiple dimensions, clustering algorithms can be used to group similar data points. Alterna-tively, fast O ( n log n ) sorting via a kd -tree can be used (e.g. [33]), in which the data are sorted recursivelyone dimension at a time.There are many types of H matrices[21, 34, 35, 32, 22]. Each can facilitate posterior computation using themethods outlined below. For our exposition, we focus on Hierarchical Off-Diagonal Low Rank (HODLR)matrices, a type of H -matrix, due their fast construction and ease of use. HODLR matrices are definedrecursively via 2 × H -matrices, HODLR matrices may be faster in practice, but have the sametime complexity for the same algorithm. As such, it can have faster decomposition and solve algorithms forour purposes; however, they sacrifice the flexibility of allowing high-rank off-diagonal blocks and adaptivematrix partitionings. HODLR matrices for Gaussian processes defined over dimensions greater than onemay be difficult to construct in practice.Figure 1: Example partition tree and associated partitioning of a 2-level HODLR matrix. The diagonalblocks are full rank, while the off-diagonal blocks can be represented via a reduced rank factorization.HODLR matrices, their factorization, and their use in GP likelihood estimation and kriging have beenthoroughly discussed in [35, 36]. Open source code for constructing, factorizing, and performing select linearalgebra operations using HODLR matrices is available in the HODLRlib library [37]. Of relevance to our workare fast symmetric factorization of symmetric positive-definite HODLR matrices, determinant calculation,matrix-vector multiplication, and solver. Specifically, symmetric positive-definite HODLR matrix A can besymmetrically factored into A = W W T in O ( n log n ) time [36]. An example is included in SI Section 10.1.Further details on these algorithms are available in [35, 36]..Assuming points proximal in space are near each other in the covariance matrix, a major benefit of the6 = x x Full rank; Low rank; Zero matrix; Identity matrix;
Figure 2: Factorization of a 2-level HODLR matrix.HODLR decomposition is that local structure is captured with high fidelity due to the dense diagonalblocks in the approximate covariance matrix. By construction, these diagonal blocks are perfectly preserved.Furthermore, global structure is not ignored, as in covariance tapering methods, but rather approximatedvia low-rank representations of the off-diagonal blocks. The neighborhood sets of the hierarchical nearest-neighbor GP [20] can be thought of similarly as the block diagonal entries (although these neighborhoodsneed not be all of the same size), but with relationships between neighborhood sets providing an update tothis block-sparse structure rather than the subsequent matrix factors in Figure 2.
Consider the case where d = 1 . Bayesian GP regression requires estimating the unknown hyperparametersof the covariance kernel. As these parameters vary based upon the kernel chosen, we only consider thesquared exponential kernel, i.e., k ( x i , x j ) = σ f exp( − ρ || x i − x j || ), with extensions to other covariancekernels being straightforward. Under this assumption, let K σ f ,ρ denote the n × n matrix computed from thesquared-exponential kernel having hyperparameters σ f and ρ , and evaluated at { x i ∈ R } ni =1 .Let y ∼ N ( f, τ − ) , with conditionally conjugate priors specified for the parameters, i.e., take τ ∼ Ga( a / , b / , /σ f ∼ Ga( a / , b / , and ρ ∼ (cid:80) r(cid:96) =1 r − δ s (cid:96) , a discrete uniform distribution over possible values for ρ in { s , . . . , s r } . Such priors are standard in the literature [31, 16, 20, 28]. Then exact GP regression proceedsby sampling from the following conditional distributions: f | − ∼ N( K σ f ,ρ ( K σ f ,ρ + τ − I ) − y , K σ f ,ρ ( τ K σ f ,ρ + I ) − ) , (4) τ | − ∼ Ga (cid:18) a + n , b + ( y − f ) T ( y − f )2 (cid:19) , (5) σ − f | − ∼ Ga (cid:18) a + n , b + σ f f T K − σ f ,ρ f (cid:19) , (6)Pr( ρ = s h | − ) = c { det( K σ f ,s h ) } − / exp (cid:110) − f T K − σ f ,s h f (cid:111) , (7)where c − = (cid:80) r(cid:96) =1 { det( K σ f ,s (cid:96) ) } − / exp (cid:110) − f T K − σ f ,s (cid:96) f (cid:111) .This sampler requires the computation of the inverse, determinant, and Cholesky decomposition, whichare O ( n ) operations. For what follows, we use the properties of H− matrices to replace these operationswith O ( n log n ) counterparts to develop a fast Gibbs sampler. In the case where a given H− matrix hasalready been factorized, e.g. in the case when the covariance kernels can be reused by fixing the length-scalecomponents on a discrete grid and pre-computing the factorization, inverse and determinant computationscan be done in O ( n log n ) time using the HODLRlib library [37]. The discrete prior in step (7) allows for7he relevant matrix inverse and determinant to be computed at every grid point before running the Gibbssampler so as not to have to recompute these values in each step. In practice, this step could be replacedwith a Metropolis step for even faster computation when the number of possible values for ρ is large. TheGaussian error sampler can be extended to non-Gaussian errors by defining an appropriate relation betweena latent GP and the observed data, e.g., via a probit link for binary data [38] or a rounding operator forcount data [39]. f Consider step (4), which requires computation of the posterior mean K σ f ,ρ ( K σ f ,ρ + τ − I ) − y and posterior covariance, K σ f ,ρ ( τ K σ f ,ρ + I ) − . (8)Direct computation of these quantities has cubic time complexity. It is possible to use H− matrices to quicklycompute the mean [23], but it is difficult to construct (8) in the H− matrix format, which makes samplingfrom the standard algorithm difficult.We propose a sampling algorithm for the GP function posterior that leverages the near linear HODLRoperations (specifically matrix-vector products, solutions to linear systems, and applications of symmetricfactor to a vector) rather than direct matrix multiplication or inversion. Let K = K σ f ,ρ for notationalconvenience. We approximate the matrices K and M = τ K + I in (4) by H -matrices ˜ K and ˜ M respectively.In what follows, we use the relation K ( τ K + I ) − = ( τ K + I ) − ( τ K + K )( τ K + I ) − to develop our samplingalgorithm. Algorithm 1:
GP sampler using H -matrix Result:
Produce an approximate draw from p ( f | y , X, Θ) using factorizations of cost O ( n log n ), andmatrix-vector product and solve operations of cost O ( n log n ). Input: y : Noisy observation of f . (cid:15) : specified tolerance. B : Maximum block size. { Θ , τ } : Hyper-parameters. if τ < then (cid:15) ∗ = τ (cid:15) ; else (cid:15) ∗ = (cid:15) ; end Construct: ˜ K ≈ K with tolerance (cid:15) ∗ /τ using { B, Θ , τ } ; // H -matrix construction. Construct: ˜ M ≈ τ K + I with tolerance (cid:15) ∗ using { B, Θ , τ } ; // H -matrix construction. Construct: W , such that ˜ K = W W T ; // H -matrix symmetric factorization. Factorize: ˜ M for later H -matrix operations ; // H -matrix factorization. Sample: a , b ∼ N(0 , I ) ;Let: Z = √ τ ˜ K a + ˜ W b ; // Z ∼ N (0 , τ ˜ K + ˜ K )Solve: ˜ M w = Z ; // w ∼ N (0 , ˜ K [ τ ˜ K + I ] − ])Solve: ˜ M r = τ y ; // r = ( τ ˜ K + I ) − τ y return Z ∗∗ = w + ˜ Kr ; Lemma 4.1.
From Algorithm 1, Z ∗∗ ∼ N(˜ µ f , ˜Σ f ) , where ˜Σ f = ˜ K ˜ M − and ˜ µ f = τ ˜Σ f y are defined to bethe approximations of the posterior function variance Σ f and mean µ f , respectively.8 roof. See Appendix.The most expensive steps in Algorithm 1 are the H -matrix factorizations of the matrices ˜ K and ˜ M . Whenusing a fixed grid of length-scale parameters, the factorization of ˜ K can be pre-computed. If the precision τ is also fixed (i.e., if the factorization of ˜ M can also be pre-computed), the cost of the entire algorithm willbe O ( n log n ) rather than O ( n log n ) . Remark : There are a number of H− matrix constructions that can be used in this algorithm, and they maybe preferable in certain situations. For example, the H -matrix [34] provides a faster construction for 2 and3-dimensional inputs, and may be preferable in these cases.When discussing computation times above we have assumed the H -matrix decomposition method used is the HODLRlib implementation of the HODLR decomposition. However, any hierarchical decomposition allowingfor symmetric matrix factorization and having an (cid:15) bound by construction on the max-norm between theapproximated and true matrices could be used instead with the associated computation cost being that ofthe method used.
Sampling (6) and (7) involves solving the linear system K b = f for b (i.e., finding K − f ) and findingthe determinant of K . If matrix K is approximated by a H -matrix ˜ K, significant computational savingsresult. The HODLR H -matrix decomposition performs these operations at a cost of O ( n log n ) . Here ˜ K isa factorized HODLR matrix, which has been computed prior to this operation (the factorization itself is an O ( n log n ) operation). The following theorem concerns the approximation fidelity of f | y when the component pieces of the GPposterior covariance Σ f = K ( τ K + I ) − , K and M = τ K + I , are calculated using approximations ˜ K and ˜ M , respectively, with maximum absolute difference between the true and approximated matrices beingbounded by ε . The proof is general and gives an upper bound on convergence for large n. The implicationof this result is that there’s an upper limit to the approximation fidelity when n grows past a certain point,even if the true matrices are used, because of the limits of finite computer arithmetic. Theorem 4.2.
Let p ∼ N ( µ f , Σ f ) where µ f = τ Σ f y and Σ f is an n × n positive definite matrix, withΣ f = K ( τ K + I ) − for K the n × n realization of some symmetric covariance kernel, y is a length- n vector,and τ is a constant. Define M = ( τ K + I ) such that Σ f = KM − . Then there exists matrices ˜ K, ˜ M ∈ H with || K − ˜ K || max ≤ ε and || M − ˜ M || max ≤ ε such that for ˜Σ f = ˜ K ˜ M − , ˜ µ f = τ ˜Σ f y , and q ∼ N (˜ µ f , ˜Σ f ) D KL ( P||Q ) = E P (cid:34) log (cid:32) PQ (cid:33)(cid:35) ≤ c n ε + c n / ε + c n ε , (9)with lim ε → D KL ( P||Q ) = 0 , where the density functions of p and q are denoted by P and Q , respectively.The constants c , c , and c are dependent on the conditioning of K and M . Note that using the HODLRlib library, ˜ K and ˜ M can be created and factorized in O ( n log ( n )) time. Proof.
See Appendix.The proof relies on the assumption that || ˜ K − K || F and || ˜ M − M || F can be bounded to be arbitrarily small.The advantage to using a hierarchical matrix decomposition to approximate K and M is that the norm isoften bounded by construction . Specifically, for the HODLR decomposition used in this paper the max-normis bounded on construction [37]. Therefore, the resulting F -norm of the difference between the HODLRapproximation and the true matrix is bounded by n times the max-norm for each matrix (i.e., for n × n matrix A, || ˜ A − A || F ≤ n || ˜ A − A || m ax ), satisfying the assumption of the proof.9 .1.4 Additional considerations Smooth GPs measured at a dense set of locations tend to have severely ill-conditioned covariance matrices(Section 3.2 of [17] provides an excellent discussion of this issue). An H -matrix approximation does notnecessarily improve the conditioning relative to the original dense matrix. Typically, practitioners mitigatethis ill-conditioning by adding a small nugget term to the diagonal of the covariance matrix. We include thisnugget term prior to H -matrix construction.Algorithm 1 does not require a linear solve involving the H -matrix approximation of K , but it does requireone involving that of τ K + I . A practical tweak that improves conditioning and doesn’t alter the fundamentalalgorithm is to scale y so as to make τ smaller, and remove this scaling factor in post processing. A smaller τ makes inverting τ K + I (and its approximation) more stable with little to no impact on the posterior surfaceestimates. It is also possible to combine the ideas in [17] with a given H -matrix algorithm so that the denseblock diagonals in the factorization of both ˜ K and ˜ M could be compressed to the desired level of fidelity inorder to improve conditioning within each block.Details on how the function posterior at new inputs can be sampled, an efficient way to handle non-uniqueinput points, and adjustments for heteroskedastic noise are provided in Section 9.1 of the Appendix. Though Lemma 4.1 and Theorem 4.2 are valid for any input dimension, the HODLR factorization doesn’tgenerally scale well with the number of input dimensions. For an example of this phenomenon, see SI Section10.1. To take advantage of the fast factorization and linear algebra operations afforded by HODLR when d = 1, we propose an extension of these algorithms that scales to higher dimensions at O ( dn log n ) . Wemodel a d -dimensional surface as a scaling factor times a tensor product of 1-dimensional GPs having unitvariance: y i = β f ( x ,i ) ⊗ f ( x ,i ) ⊗ . . . ⊗ f d ( x d,i ) + e i ,e i ∼ N(0 , τ − ) , f h ( · ) ∼ GP ( m h ( · ) , k h ( · , · )) , (10) i = 1 , . . . , n, h = 1 , . . . , d, where β is the scaling factor and k h ( · , · ) is some covariance kernel with unit function variance.Approximating surfaces using a tensor product of bases is common [29, 40, 41, 42]. Often, the tensor producttakes the form of a finite basis expansion using splines. The flexibility of more traditional tensor productspline approaches carries over to this approach, but one does not have to be concerned with the choice of theknot set to achieve this flexibility (e.g. see de Jong and van Zanten [43] for an example of this using tensorproduct B-splines and knot set selection). By utilizing lemma 2.1 of de Jong and van Zanten, it is trivialto show that there exists GP tensor product specifications in the sample path of (10) that are arbitrarilyclose to any d − dimensional H¨older space of functions having up to r continuous partial derivatives D α ,α ≤ r . Lemma 4.3.
Let C r be the H¨older space of functions having r continuous partial derivatives and (cid:15) > , then for any h ∈ C r there exists a tensor product GP f = f ⊗ f ⊗ . . . ⊗ f d with sample paths such thatsuch that P r ( || h − f || ∞ ≤ (cid:15) ) > . (cid:15) ball of any continuous surface. The result then followsbecause there there are a large number of covariance kernels k ( · , · ) that are dense in the space of continuousfunctions (see for example [44]). This lemma shows that (10) provides a flexible solution to modeling higherdimensional surfaces. 10n practice, the actual covariance kernel used may not be adequate to model a given data-set (e.g. theremay be a high prior-probability placed on sample paths that are too smooth and/or the prior over thehyperparameters too restrictive). In these cases, one tensor product may over smooth the data and misslocal features of the surface; here, multiple additive tensor products may be considered to alleviate thisproblem. This was the strategy of [28] who used an additive sum of GP tensor products like those definedin (10). It is also the strategy in most standard spline based approaches, and it is often used in hierarchicalgrid refinements for tensor product B-splines (e.g. see [42]); however, the choice of an appropriate knot setfor each additive component is not necessary when using FIFA GPs. H − matrices can be used to approximate the GP in equation (10) and provide the previously discussedbenefits of speed and near-machine-precision fidelity. Sampling from the function posterior for each f h ( · )involves heteroskedastic noise; and one must modify the KL divergence bound as well as the samplingalgorithm. To see why, note that f h ( x h,i ) = y i − e i (cid:81) k (cid:54) = h f k ( x k,i ) . Then y i (cid:81) k (cid:54) = h f k ( x k,i ) is a noisy observation of f h ( x h,i ) with variance τ − / ( (cid:81) k (cid:54) = h f k ( x k,i )) .This approach also provides additional computational advantages if the number of unique input values issmaller than n in a given dimension. For example, consider observations made on a 500 ×
500 grid ofinputs. Then computation proceeds rapidly even though the total number of observations is 250,000. Suchcomputation gains are not unique to the tensor product approach, but are also available using additivekernels [45, 46] or Kronecker based inference [47]; however, these approaches do not have the computationaladvantages of H − matrix arithmetic and thus still scale with cubic complexity.The tensor product of GP realizations is different than a GP with a separable covariance kernel comprisedof the tensor product of univariate kernels, as in [48]. The latter also offers computational advantages, buthas the same issues for large covariance matrices. Finally, the tensor product approach in (10) is one way inwhich univariate GPs can be used to model higher dimensional surfaces. In some cases, an additive modelmay suffice [45, 46]. In this section, we report performance and timing of each method for a small- n simulation using datagenerated from a Gaussian process and a large- n simulation in which the true generating function is knownbut not a GP. All calculations in this and the subsequent section are performed on a 2016 MacBook Prowith a 2.9 GHz Intel Core i7 processor. n simulation We compare the performance of the approximate methods to that of an exact GP sampler. Of interest isboth similarity of point estimates to the exact GP and fidelity of uncertainty about those estimates. Theexperiment with synthetic data proceeds as follows for n ∈ { , , } and n ∗ = 50:1. Points x , . . . , x n are simulated from a N [ − , (0 ,
1) distribution, so observed data are more concentratedabout the origin.2. A “true” function f true is simulated from a Gaussian process having an exponential covariance function k ( x i , x j ) = σ f exp( − ρ || x i − x j || ) + τ − δ ij . Note that the true values of σ f , ρ, and τ are varied in eachconfiguration.3. A sampler based on the exact GP covariance matrix is run to get exact Bayesian estimates of theposterior for the hyperparameters σ f , ρ, τ, the function f at training points { x i } ni =1 , and the function f ∗ at test points { x i ∗ } n ∗ i ∗ =1 . These estimates are defined as the “exact posterior.”4. Each of the approximate methods are used to obtain samples from the posterior for the hyperparameters σ f , ρ, τ, f , and f ∗ , referred to as “approximate posterior” samples.11. For each sampler, the first 2,000 samples are discarded as burn-in and every tenth draw of the next25,000 samples is retained.Table 1 shows simulation results for n = 1000. In the table, MSPE f ∗ = n ∗ (cid:80) n ∗ i =1 ( f ∗ i − ˆ f ∗ i ) summarizes pre-dictive performance for determining the true function mean at test points and 95% CI ( f ∗ ) Area summarizesthe geometric area covered by the 95% credible interval about f ∗ . Hyperparameter summary ˆ θ = T (cid:80) Tt =1 θ ( t ) provides the mean of the T samples, where θ ( t ) is the t -th draw of θ, and [ θ ll , θ ul ] gives the associated 95%pointwise posterior credible interval. The key take away from Table 1 is that inference made using FIFA-GP is extremely similar to that made using the exact GP for GP hyperparameters, function posterior, andthe noise precision. The compressed GP has similar predictive performance and noise precision estimates,but inference on the GP hyperparameters and area of uncertainty about the function posterior differs fromthose quantities as measured by the exact GP. Analogous tables for n = 100 and n = 500 , and replicatesamplers run with the same data for n = 100 , shown in SI Section 10.2, provide further empirical supportthat the exact GP and FIFA-GP behave almost identically in terms of both predictive performance andinference. n simulation We compare the predictive performance and the computing time of each approximation method using a largersimulated data set, which was not based on a true underlying Gaussian process model. In this simulation,the true function is f ( x ) = sin(2 x )+ e x and input data are sampled from a N [ − , (0 ,
1) distribution. Valuesof y are sampled having mean f and precision τ = 1 . Parameter estimates and performance results analogous to those provided in the small- n simulation areprovided in SI Sections 10.2.3 and 10.2.4. As when data were simulated using a true GP for the mean,parameter estimates for FIFA-GP are similar to that of the exact GP sampler. As n increases, both theprecision and accuracy of FIFA-GP increase. Figure 3 shows the time taken (in minutes) to iterate through100 steps of the sampler for the exact GP and that of the approximated methods with a grid of 100 possiblelength scale values comprising the options for the discrete sampling step. The FIFA-GP method remainscomputationally tractable even when n = 200,000. The cost of the pre-computation steps (i.e., projectionconstruction) scales poorly for the compressed GP method. l l l l l l l l l l ll l l l l ll l l l l ll l l l l l l l l l ll l l l l l l ll l l l l l
100 300 500 1 ,
000 2 ,
000 5 ,
000 10 ,
000 25 ,
000 50 ,
000 100 ,
000 200 , n T i m e ( m i nu t e s ) Method lll
ExactCompressedFIFA−GP
Type
TotalSampling
Figure 3: Time taken for 100 samples from full Bayesian posterior. Type “Total” includes the time takento pre-create the matrices for the discrete uniform grid of length-scale values; “Sampling” only includes thetime taken during the sampling phase after these setup computations have occurred.12able 1: Performance results, parameter estimates, and computing time for Bayesian GP regression with n = 1000 data points simulated using asquared exponential covariance kernel. Time shown is total time for setup and 27,000 iterations through Gibbs sampler, with 2,500 samples retained.Truth Exact GP FIFA-GP Compressed GP (cid:15) max = 10 − (cid:15) max = 10 − (cid:15) fro = 0 . (cid:15) fro = 0 . f ∗ - 1e-04 1e-04 1e-04 2e-04 2e-04low noise 95% CI ( f ∗ ) Area - 0.27 0.27 0.27 0.29 0.30ˆ τ [ τ ll , τ ul ] τ = 30 28.3 [25.9, 30.9] 28.3 [25.9, 30.8] 28.3 [25.9, 30.8] 28.4 [26.0, 31] 28.4 [25.9, 30.9]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 1.04 [0.52, 2.27] 0.92 [0.51, 1.85] 0.98 [0.52, 2.06] 0.72 [0.44, 1.22] 0.70 [0.43, 1.17]ˆ ρ [ ρ ll , ρ ul ] ρ = 0 .
25 0.33 [0.15, 0.69] 0.35 [0.17, 0.70] 0.33 [0.16, 0.65] 0.68 [0.56, 0.77] 0.76 [0.70, 0.80]Time (min) - 96.7 8.7 8.0 6.2 6.0Smooth and MSPE f ∗ - 0.008 0.008 0.008 0.007 0.007high noise 95% CI ( f ∗ ) Area - 0.83 0.85 0.85 0.98 0.95ˆ τ [ τ ll , τ ul ] τ = 2 2.25 [2.06, 2.45] 2.26 [2.07, 2.46] 2.26 [2.06, 2.46] 2.26 [2.07, 2.47] 2.26 [2.06, 2.46]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 1.01 [0.53, 2.01] 0.98 [0.52, 1.88] 0.97 [0.50, 1.98] 0.89 [0.51, 1.73] 0.87 [0.50, 1.49]ˆ ρ [ ρ ll , ρ ul ] ρ = 0 .
25 0.23 [0.13, 0.53] 0.27 [0.13, 0.85] 0.26 [0.13, 0.68] 0.68 [0.55, 0.82] 0.60 [0.55, 0.71]Time (min) - 94.5 8.7 8.0 6.1 6.2Wiggly and MSPE f ∗ - 0.001 0.001 0.001 0.001 0.001low noise 95% CI ( f ∗ ) Area - 0.37 0.37 0.37 0.40 0.39ˆ τ [ τ ll , τ ul ] τ = 30 30.6 [28.0, 33.4] 30.6 [28.0, 33.4] 30.6 [28.0, 33.3] 30.7 [28.0, 33.5] 30.7 [28.1, 33.4]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 1.45 [0.90, 2.50] 1.41 [0.87, 2.39] 1.50 [0.90, 2.91] 1.14 [0.80, 1.69] 1.17 [0.79, 1.74]ˆ ρ [ ρ ll , ρ ul ] ρ = 2 1.70 [1.20, 2.50] 1.74 [1.22, 2.60] 1.69 [1.14, 2.55] 2.54 [2.54, 2.54] 2.44 [2.35, 2.53]Time (min) - 96.0 9.6 8.6 7.9 7.3Wiggly and MSPE f ∗ - 0.013 0.013 0.013 0.013 0.013high noise 95% CI ( f ∗ ) Area - 1.32 1.33 1.33 1.30 1.30ˆ τ [ τ ll , τ ul ] τ = 2 2.09 [1.91, 2.28] 2.09 [1.91, 2.28] 2.09 [1.91, 2.29] 2.09 [1.91, 2.28] 2.09 [1.91, 2.28]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 0.89 [0.57, 1.48] 0.93 [0.56, 1.64] 0.91 [0.58, 1.59] 0.92 [0.58, 1.51] 0.92 [0.60, 1.52]ˆ ρ [ ρ ll , ρ ul ] ρ = 2 2.61 [1.67, 3.73] 2.48 [1.56, 3.78] 2.56 [1.62, 3.79] 2.43 [2.41, 2.50] 2.42 [2.36, 2.49]Time (min) - 95.3 10.8 9.7 7.6 8.0 .3 Tensor product simulation In order to illustrate the capability of the tensor product approach defined in equation (10), we simulatefrom two functions having d = 2 . The first function, g ( x , x ) = sin( x ) sin( x ) √ x x , is separable into afunction of x multiplied by a function of x and thus should be easily approximated by the tensor productformulation. The second function, g ( x , x ) = x − xy + 3 y + 2 , is not separable and thus may not bewell approximated by a single tensor product term. For each function, draws of x and x are sampled from U (0 ,
4) and noisy observations of the functions are made with precision τ = 0 .
5. The number of observationsis varied, with n ∈ { , , , , , , } . Figure 4 shows the noisy observations of g and model estimated surface for varying n. Figure 5 shows theMSPE for surface estimates at test points. The model is able to learn the surface reasonably well, evenwith small n . Coverage of the 95% credible interval is near nominal (0.962, 0.949, and 0.946 for n = 500, n = 2000, and n = 25000, respectively). Furthermore, the model is salable to large n because it relies onone-dimensional HODLR-approximated GPs.Figure 4: Top: Noisy observations. Bottom: Estimated surface from tensor product model. Left to right: n = 100, n = 1000, and n = 10000.Figure 6 shows the noisy observations of g and model estimated surface for varying n, with either 1 or2 additive basis components used. Figure 7 shows the MSPE for surface estimates at test points. Inthis example, g is poorly approximated by a single tensor product component, i.e. by the model y i = f ( x ,i ) ⊗ f ( x ,i ) + e i , which may be due to choice of co-variance kernel or prior. When a second tensorproduct basis as added, i.e. when we model the data as y i = f ( x ,i ) ⊗ f ( x ,i ) + h ( x ,i ) ⊗ h ( x ,i ) + e i , themodel is flexible enough to capture the true surface. Coverage of the 95% credible interval is near nominal(0.956, 0.949, 0.947 for n = 500, n = 2000, and n = 25000, respectively).14 l l l l l l
100 500 1 ,
000 2 ,
000 5 ,
000 10 ,
000 25 , n M SPE
Figure 5: MSPE of the tensor product surface for predicting g at new inputs.Figure 6: Top: Noisy observations. Middle (bottom): Estimated surface from tensor product model withone (two) additive basis term(s). Left to right: n = 100, n = 1000, and n = 10000. We consider two data examples to show how multidimensional surfaces can be modeled using FIFA-GP. Aswe only include a spacial process and do not include relevant covariates, we do not claim the estimates to be15 l ll ll ll ll ll ll
100 500 1 ,
000 2 ,
000 5 ,
000 10 ,
000 25 , n M SPE
Number of bases ll Figure 7: MSPE of the tensor product surface for predicting g at new inputs with 1 or 2 additive bases.optimal; rather they are illustrative of how the methodology can be used to model flexible functional formsfor extremely large data sets on relatively modest computer hardware. NOAA started recording carbon dioxide (CO2) measurements at Mauna Loa Observatory, Hawaii in May of1974. The CO2 data are measured as the mole fraction in dry air in units of parts per million (ppm). Hourlydata are available for download from 1974 to the present at . To illustratethe utility of the proposed method, we use the full data for which the quality control flag indicated no obviousproblems during collection or analysis (358,253 observations). We model the year-season surface as the sumof a GP for the annual effect, a GP for the seasonal effect, and a tensor product of GPs for year and season.For the tensor product, the one dimensional GP’s had just under 9 ,
000 observations in each dimension. Forthis example, the full MCMC algorithm took under thirty minutes.Figure 8 shows the observed and model-predicted year-season CO2 surfaces. There is an evident increase inCO2 across years along with a seasonal pattern, with a peak in early summer and a trough in the fall. Thetensor product spline basis allows for variation in the seasonal shape that varies smoothly with year.Figure 9 shows slices of the model-predicted year-season CO2 surfaces along each dimension. The tensorproduct of Gaussian processes fits the model well, with residuals having no apparent seasonal or annualpatterns. Earlier years seem to exhibit some heteroskedastic and positive-skewed noise, which could beexplicitly accounted for using an alternative noise model. PM . describes fine particulate matter, inhalable and with diameters that are generally under 2.5 microme-ters. The EPA monitors and reviews national PM . concentrations, and sets national air quality standardsunder the Clean Air Act. Fine particles are the particulate matter having greatest threat to human health.These particles are also the main cause of reduced visibility in many parts of the United States.We use the sample-level PM . data for the contiguous states since 1999 (4,977,391 observations). We modelthe latitude-longitude-time surface as the sum of a GP for the geographic effect and a tensor product ofGPs for latitude, longitude, and time. This three dimensional surface fit again was fast taking about anhour.Figure 10 shows the model predicted mean PM . value for the contiguous United States at three time points.There is an evident decrease in PM . across years along with regional hot spots, with the southeast and16igure 8: Left: Observed surface. Right: Model-predicted surface. Year C O ( pp m ) Slice of annual dimension
Day/Time C O ( pp m ) Slice of seasonal dimension
Figure 9: Top: Slice of CO2 surface viewed along the year dimension (hour 14 of day 185 shown). Evidentis the annual increase in CO2. Bottom: Slice of CO2 surface viewed along the season dimension (year 1975shown). Evident is the seasonal pattern of CO2 levels, with a peak in early summer and a trough in the fall.parts of California being particularly problematic. The total mean PM . decreased over time, which reflectsthe increased air quality standards over this time period.17 .83.03.2 PM 2.5
PM 2.5
PM 2.5
Figure 10: Model predicted mean PM . value for the contiguous United States. From top to bottom:February 19, 1999; July 16, 2009; and December 11, 2019.18 Discussion
We have described an algorithm that allows near-exact Gaussian process sampling to be performed in acomputationally tractable way up to n on the order of 10 for univariate inputs using H matrices. Asobserved by [49, 23], the HODLR matrix format can quickly grow in computation cost in higher dimensionalinput spaces. We have addressed this issue by proposing a tensor product approach that leverages the speedof HODLR operations in univariate input domains and scales to higher dimensions at O ( dn log n ) . It wouldalso be possible to use the univariate GPs as building blocks in other ways to construct multidimensionalsurfaces. The downside to this approach is that it is unsuitable for approximating an elaborate GP with acarefully chosen covariance for d > H -matrices as well. Lemma 4.1 and Theorem 4.2 are valid for anymatrix approximation having an error bound. Other H -matrix approximations could ameliorate some of thecomputational costs associated with HODLR matrices in higher-dimensional input settings, without requiringthe tensor product GP approach, and we have begun to implement these ideas using H − compressionmatrices using the H2LIB library [50]. This library efficiently compresses matrices up to three dimensions,and we have had success modeling 3-dimensional GP problems up to n = 20 , . For the H2LIB library,the inversion uses a single processor, but one sample still takes less than a second for many problemswhere n = 20 , . For higher dimensional inputs, the tensor product approach still is applicable, and weimagine one could approximate a surface using different H / H ∈ matrix compression algorithms with the sametheoretical guarantees on approximation as well as computational efficiency.Although we considered the intensive Bayesian regression problem of the case of large n when f () is modeledas a Gaussian process, the framework introduced here is general and can be applied to any problem inwhich some large structured matrix inversion is required within a Bayesian sampler. Still within the GPrealm, there are many cases when univariate GPs are used as model components rather than for standaloneregression problems. Having the ability to plug-in a fast near-exact sampler could open the use of thesecomponent GPs up to larger problems.Outside of the GP regression, similar set of computational bottlenecks arise in the case of large p when f ()is given a parametric sparsity inducing prior (e.g., the horseshoe prior). Specifically, the matrix inversionrequired in [51] is also of the form (Σ + τ I ) − , where Σ is some p × p positive definite matrix, τ is a constant,and I is a p × p identity matrix. Exploring the use of H -matrix compression and arithmetic in the sparseBayesian regression setting provides a promising avenue of future research. This work was partially supported by the Department of Energy Computational Science Graduate Fellowship(grant DE-FG02-97ER25308). This research was supported [in part] by the Intramural Research Programof the NIH, National Institute of Environmental Health Sciences. The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript. The authors would liketo thank David Dunson and Amy Herring for helpful comments.
References [1] Carl Edward Rasmussen and Christopher KI Williams.
Gaussian Processes for Machine Learning . MITPress, Cambridge MA, 2006.[2] Ralf Herbrich, Neil D Lawrence, and Matthias Seeger. “Fast sparse Gaussian process methods: Theinformative vector machine”. In:
Advances in Neural Information Processing Systems . 2003, pp. 625–632. 193] Krzysztof Chalupka, Christopher KI Williams, and Iain Murray. “A framework for evaluating approx-imation methods for Gaussian process regression”. In:
Journal of Machine Learning Research
Advances in Neural Information Processing Systems . 2001, pp. 682–688.[5] Joaquin Qui˜nonero-Candela and Carl Edward Rasmussen. “A unifying view of sparse approximateGaussian process regression”. In:
Journal of Machine Learning Research
ArtificialIntelligence and Statistics . 2009, pp. 567–574.[7] James Hensman, Nicolo Fusi, and Neil D Lawrence. “Gaussian processes for big data”. In: arXivpreprint arXiv:1309.6835 (2013).[8] Andrew Wilson and Hannes Nickisch. “Kernel interpolation for scalable structured Gaussian processes(KISS-GP)”. In:
International Conference on Machine Learning . 2015, pp. 1775–1784.[9] Andrew Gordon Wilson, Christoph Dann, and Hannes Nickisch. “Thoughts on massively scalableGaussian processes”. In: arXiv preprint arXiv:1511.01870 (2015).[10] Robert B Gramacy and Herbert K H Lee. “Bayesian treed Gaussian process models with an applicationto computer modeling”. In:
Journal of the American Statistical Association
Advances in Neural Information Processing Systems . 2009, pp. 1193–1200.[12] Abhirup Datta et al. “On nearest-neighbor Gaussian process models for massive spatial data”. In:
Wiley Interdisciplinary Reviews: Computational Statistics
Water Resources Research
Advances in Water Resources
82 (2015),pp. 124–138.[15] Haitao Liu et al. “When Gaussian process meets big data: A review of scalable GPs”. In: arXiv preprintarXiv:1807.01065 (2018).[16] Huiyan Sang and Jianhua Z Huang. “A full scale approximation of covariance functions for large spatialdata sets”. In:
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
Biometrika . IEEE. 2006,pp. 143–152.[19] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. “Finding structure with randomness:Stochastic algorithms for constructing approximate matrix decompositions”. In: (2009).[20] Abhirup Datta et al. “Hierarchical nearest-neighbor Gaussian process models for large geostatisticaldatasets”. In:
Journal of the American Statistical Association H -matrices. part i: Introduction to H -matrices”. In: Computing H -matrices”. In: Comput-ing
IEEE Transactions onPattern Analysis and Machine Intelligence
Computational Statistics & Data Analysis
137 (2019), pp. 115–132.[25] Christopher J Geoga, Mihai Anitescu, and Michael L Stein. “Scalable Gaussian process computationsusing hierarchical matrices”. In:
Journal of Computational and Graphical Statistics (2019), pp. 1–11.[26] Jian Cao et al. “Hierarchical-block conditioning approximations for high-dimensional multivariate nor-mal probabilities”. In:
Statistics and Computing
Journal of Computational andGraphical Statistics
Biometrics
A practical guide to splines . Springer-Verlag, 1978.[30] Paul HC Eilers and Brian D Marx. “Splines, knots, and penalties”. In:
Wiley Interdisciplinary Reviews:Computational Statistics
Journalof the Royal Statistical Society: Series B (Statistical Methodology)
Hierarchical matrices: Algorithms and analysis . Vol. 49. Springer, 2015.[33] Ingo Wald and Vlastimil Havran. “On building fast kd-trees for ray tracing, and on doing that in O(N log N)”. In: . IEEE. 2006, pp. 61–69.[34] Steffen B¨orm and Wolfgang Hackbusch. “Data-sparse approximation by adaptive H -matrices”. In: Computing O ( N log N ) fast direct solver for partial hierarchicallysemi-separable matrices”. In: Journal of Scientific Computing arXiv preprint arXiv:1405.0223 (2014).[37] Sivaram Ambikasaran, Karan Singh, and Shyam Sankaran. “HODLRlib: A Library for HierarchicalMatrices”. In:
The Journal of Open Source Software
Statistical Methodology
Biometrika
Com-puting
Proceedings of Chamonix . Vol. 1. 1996.[42] Bert J¨uttler. “Surface fitting using convex tensor-product splines”. In:
Journal of Computational andApplied Mathematics
Electronic Journal of Statistics
Journal of Statistical Planning and Inference
Advances in Neural Information Processing Systems . 2011, pp. 226–234.[46] Nicolas Durrande, David Ginsbourger, and Olivier Roustant. “Additive covariance kernels for high-dimensional Gaussian process modeling”. In:
Annales de la Facult´e des sciences de Toulouse: Math´ematiques .Vol. 21. 3. 2012, pp. 481–499.[47] Seth Flaxman et al. “Fast Kronecker inference in Gaussian processes with non-Gaussian likelihoods”.In:
International Conference on Machine Learning . 2015, pp. 607–616.[48] Andrei I Karol’ and Alexander I Nazarov. “Small ball probabilities for smooth Gaussian fields andtensor products of compact operators”. In:
Mathematische Nachrichten H -matrices”. In: Computing
IEEE SignalProcessing Letters
Matrix analysis . Springer Verlag, 1997.[53] Roger A. Horn and Charles R. Johnson. “5.8 Condition numbers: inverses and linear systems”. In:
Matrix Analysis . Cambridge University Press, 2012, p. 381.21
Appendix
Here we provide details on how the function posterior at new inputs can be sampled, an efficient way tohandle non-unique input points, and handling heteroskedastic noise.
In the Gibbs sampling framework, i.e. when we can condition on f , the predictive distribution of therealization of the function f ∗ at a new point x ∗ is given by f ∗ | x ∗ , f , X, Θ , τ ∼ N( µ ∗ , σ ∗ ) ,µ ∗ = k ∗ n K − f ,σ ∗ = k ∗∗ − k ∗ n K − k n ∗ , (11)where k ∗∗ = k ( x ∗ , x ∗ ), and k ∗ n = k ( x ∗ , X ) and k n ∗ = k ( X, x ∗ ) = ( k ∗ n ) T are realizations of the covariancekernel between new input x ∗ and each of the observed inputs.Posterior realizations of f ∗ can be sampled in O ( n log n ) per point assuming the H -matrix approximationfor K is pre-assembled and pre-factorized; in this case the most costly calculation of σ ∗ from equation (11)is a linear solve. Such pre-computation is possible when the length-scale is sampled on a fixed grid. In thiscase we can also pre-solve K − k n ∗ for each possible length-scale value.Unfortunately, we cannot pre-solve K − f because f is sampled and changes at each iteration of the Gibbssampler. Therefore, for N such new points the cost is O ( N n log n ). Thus, this method is not preferredfor applications in which function realizations are desired for a very large number of new points (i.e., when N ≈ n ). However, it is quite powerful in applications for which n (cid:29) N. Denote the set of all n (possibly non-unique) inputs as X = { x h ∈ R d } nh =1 . Let X u = { x ui ∈ R d } Ui =1 denotethe set of U unique input values. Then U ≤ n and for any i (cid:54) = j x ui (cid:54) = x uj . Computational savings canbe achieved when
U < n.
Let Q i denote the set of original indices for which the input values are equalto x ui , i.e. Q i = { h s.t. x h = x ui } . Then define y u = { y ui = | Q i | (cid:80) h ∈ Q i y h ∈ R } Ui =1 to be the set of U average observations at each unique input value. Sampling and inference can proceed using a GP prior onthe U × U covariance matrix associated with observations y u . The precision term τ added to the diagonalof the original covariance matrix is then replaced with | Q i | · τ for entry ( i, i ) of the new covariance matrixfor all i ∈ , . . . , U. It was assumed that e i ∼ N(0 , τ − ) . However, heteroskedastic noise can easily be accomodated in the abovesamplers. Suppose that e i ∼ N(0 , τ − i ) . Define D − = diag( τ − , . . . , τ − n ) to be the diagonal matrix ofvariance parameters so that D = diag( τ , . . . , τ n ) is then the diagonal matrix of precision parameters. Thenthe posterior variance from equation (3) becomesCov( f | y , X, Θ) = K − K ( K + D − ) − K = ( D + K − ) − (Sherman-Morrison-Woodbury formula)= K ( DK + I ) − (( D + K − ) − = KK − ( D + K − ) − )= K ( K + D − ) − D − (Right-multiply by DD − )= D − ( K + D − ) − K. (Transpose symmetric matrix)22he posterior mean from equation (3) becomes E ( f | y , X, Θ) = K ( K + D − ) − y = Cov( f | y , X, Θ) D y . The format of the above posterior covariance is chosen so as to enable easier HODLR operations. SeeAppendix 9.4 for further algebraic exposition and a discussion of how the KL divergence bound and Algorithm1 are adjusted for heteroskedastic variance.
The HODLR approximation relies on random compression of the off-diagonal block matrices. This processdoesn’t guarantee that the result is always a positive semidefinite matrix, although in practice it most oftenis (and when it is not, simply decreasing ε will often lead to the conditions being met). Similarly, the product˜ K ˜ M − in Theorem 4.2 is not guaranteed to be positive semidefinite for every perturbation from K and M. In fact, this issue is practically present for large n even when using the exact GP due to finite precisionarithmetic. Proof of Lemma 4.1.
After
Step
1, Var( Z ) = τ ˜ K ˜ K + ˜ K = ˜ K ( τ ˜ K + I ) . Note that τ ˜ K + I is equivalentto ˜ M because of the tolerances specified in Step
1. To see why, note that || K − ˜ K || max < (cid:15) ∗ /τ = ⇒|| τ K − τ ˜ K || max < (cid:15) ∗ , so the HODLR approximation τ ˜ K satisfies the (cid:15) ∗ bound for approximating τ K . Sincethe off-diagonal block matrices are factorized via partial-pivoted LU decomposition, multiplication by aconstant does not impact the solution other than by rescaling. That is, finding an approximation for K withtolerance (cid:15) ∗ /τ is equivalent to finding an approximation for τ K with tolerance (cid:15) ∗ and dividing the result by τ. Next, recall the HODLR decomposition preserves diagonal elements of the original matrix, so a decompositionof matrix A plus the identity matrix is equivalently expressed as (cid:94) A + I and ˜ A + I. Thus τ ˜ K + I = (cid:94) τ K + I and Z ∼ N(0 , ˜ K ˜ M ) at the start of Step
1. Now at the end of
Step Z ∗ ) = ˜ M − ˜ K ˜ M ˜ M − = ˜ M − ˜ K. Because˜ M − , ˜ K, and their product are symmetric, the terms commute and Var( Z ∗ ) = ˜ K ˜ M − . Thus Z ∗ ∼ N(0 , ˜Σ f ).The mean ˜ µ f is created in Step ?? , leading to the sum in the final step to have the desired distribution. Proof of Theorem 4.2.
From equation (3), let M = τ K + I. Define ˜ K and ˜ M to be the H -matrix approx-imations of K and M, respectively. Denote p ∼ N ( τ K y , KM − ) and q ∼ N ( τ ˜ K y , ˜ K ˜ M − ) , with densityfunctions P and Q , respectively. In other words, P is the true GP posterior and Q is the H− approximatedGP posterior. E P (cid:34) log (cid:32) PQ (cid:33)(cid:35) = 12 (cid:18) log (cid:20) det( ˜ K ˜ M − )det( KM − ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) (i.) +tr (cid:2) ( ˜ K ˜ M − ) − ( KM − ) (cid:3) − n (cid:124) (cid:123)(cid:122) (cid:125) (ii.) + (cid:2) τ ˜ K ˜ M − y − τ KM − y (cid:3) (cid:48) (cid:0) ˜ K ˜ M − (cid:1) − (cid:2) τ ˜ K ˜ M − y − τ KM − y (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) (iii.) (cid:19) (12)Let Ω K = ˜ K − K, and Ω M = ˜ M − M be the matrices of differences between the approximated and truematrices. Assume by construction max ≤ i,j ≤ n | Ω Kij | ≤ (cid:15) and max ≤ i,j ≤ n | Ω Mij | ≤ (cid:15).
That is, assume the absoluteentry-wise difference between each approximated and true matrix is at most some (cid:15). M using ˜ K rather than having to create itseparately, if desired. For τ > , simply using ˜ M = τ ˜ K + I satisfies the (cid:15) -bound on both matrices. For τ < , finding ˜ K satisfying the bound for (cid:15)/τ and then ˜ M = τ ˜ K + I satisfies the (cid:15) -bound on both matrices.In the latter case ˜ K is calculated to a higher level of fidelity in order to guarantee the fidelity of ˜ M meetsthe requisite bound. Part 1.
For the part (i.) bound we will rely on the Hoffman-Wielandt (HW) inequality [52]. This inequalitystates that for n × n Σ , Σ symmetric with λ i , λ i the eigenvalues of Σ , Σ respectively, there exists apermutation π such that n (cid:88) i =1 ( λ π ( i ) − λ i ) ≤ || Σ − Σ || F . Suppose || Σ − Σ || F ≤ c , then each individual ( λ π ( i ) − λ i ) must be ≤ c because the sum for all n termsis ≤ c . In other words, | λ π ( i ) − λ i | ≤ | c | for each i ∈ , . . . , n. Suppose c ≥ , then λ π ( i ) λ i ∈ [1 − c, c ] . In the case that the max norm of (Σ − Σ ) is bounded by (cid:15) > , i.e. max ≤ i,j ≤ n | Σ ij − Σ ij | < (cid:15), we have n (cid:88) i =1 ( λ π ( i ) − λ i ) ≤ || Σ − Σ || F ≤ n (cid:15) = ⇒ λ π ( i ) λ i ∈ [1 − n(cid:15), n(cid:15) ] . (13)Now consider the log of the product of the above terms. Supposing all λ i , λ i > λ π ( i ) λ i ∈ [1 − n(cid:15), n(cid:15) ] = ⇒ n (cid:89) i =1 λ π ( i ) λ i ∈ [(1 − n(cid:15) ) n , (1 + n(cid:15) ) n ]= ⇒ log (cid:20) n (cid:89) i =1 λ π ( i ) λ i (cid:21) ∈ [ n log(1 − n(cid:15) ) , n log(1 + n(cid:15) )] . (14)Note for the lower bound in equation (14) to be sensible we need 1 > n(cid:15) = ⇒ (cid:15) < n . See Sections 9.3.1 and9.3.2 for additional algebraic exposition.Additionally, we will rely on the following determinant and log properties for any matrices A and B ofcompatible dimensions and λ Ai the eigenvalues of A for n × n matrix A :det( AB ) = det( A ) · det( B ) , (15)det( A − ) = det( A ) − , (16)det( A ) = n (cid:89) i =1 λ Ai , (17)log( AB ) = log( A ) + log( B ) , (18)Now armed with the HW inequality and the determinant and log properties, we return to the issue ofbounding part (i.): log (cid:20) det( ˜ K ˜ M − )det( KM − ) (cid:21) = log (cid:20) det( ˜ K ) · det( ˜ M − )det( K ) · det( M − ) (cid:21) (Equation (15))= log (cid:20) det( ˜ K ) · det( M )det( K ) · det( ˜ M ) (cid:21) (Equation (16))= log (cid:20) det( ˜ K )det( K ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) (d1.) + log (cid:20) det( M )det( ˜ M ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) (d2.) (Equation (18))24onsider part (d1.). Let λ Ki and λ ˜ Ki be the eigenvalues of K and ˜ K, respectively, and π be some permutationof the indices i = 1 , . . . , n, then we havelog (cid:20) det( ˜ K )det( K ) (cid:21) = log (cid:20) (cid:81) ni =1 λ ˜ Ki (cid:81) ni =1 λ Ki (cid:21) (Equation (17))= log (cid:20) (cid:81) ni =1 λ ˜ Kπ ( i ) (cid:81) ni =1 λ Ki (cid:21) = log (cid:20) n (cid:89) i =1 λ ˜ Kπ ( i ) λ Ki (cid:21) ≤ n log(1 + n(cid:15) ) (Equation (14)) ≤ n (cid:15). (For x > − , log(1 + x ) ≤ x )The bound on part (d2.) follows analogously, so the overall bound on part (i.) islog (cid:20) det( ˜ K ˜ M − )det( KM − ) (cid:21) ≤ n (cid:15). (19) Part 2.
For the part (ii.) bound we havetr (cid:2) ( ˜ K ˜ M − ) − ( KM − ) (cid:3) − n = tr (cid:2) ˜ M ˜ K − KM − (cid:3) − n = tr (cid:2) ˜ K − KM − ˜ M (cid:3) − n (Trace circular)= tr (cid:2) ˜ K − ( ˜ K − Ω K ) M − ˜ M (cid:3) − n ( K = ˜ K − Ω K by definition)= tr (cid:2) ˜ K − ( ˜ K − Ω K ) M − ( M + Ω M ) (cid:3) − n ( ˜ M = M + Ω M by definition)= tr (cid:2) ˜ K − ( ˜ K − Ω K ) M − ( M + Ω M ) (cid:3) − n ( ˜ M = M + Ω M by definition)= tr (cid:2) ( ˜ K − ˜ K − ˜ K − Ω K )( M − M + M − Ω M ) (cid:3) − n = tr (cid:2) ( I n − ˜ K − Ω K )( I n + M − Ω M ) (cid:3) − n ( I n is n × n identity matrix)= tr (cid:2) I n + M − Ω M − ˜ K − Ω K − ˜ K − Ω K M − Ω M (cid:3) − n = tr (cid:2) I n (cid:3) + tr (cid:2) M − Ω M (cid:3) − tr (cid:2) ˜ K − Ω K (cid:3) − tr (cid:2) ˜ K − Ω K M − Ω M (cid:3) − n (Trace additive)= tr (cid:2) M − Ω M (cid:3) − tr (cid:2) ˜ K − Ω K (cid:3) − tr (cid:2) ˜ K − Ω K M − Ω M (cid:3) (tr (cid:2) I n (cid:3) = n )Consider the bound on the first term in the above expression:tr (cid:2) M − Ω M (cid:3) = n (cid:88) i =1 n (cid:88) j =1 M − ij Ω Mji (Definition of trace) ≤ || M − || max n (cid:88) i =1 n (cid:88) j =1 Ω Mji (cid:0) || M − || max = max ≤ i,j ≤ n | M − ij | (cid:1) ≤ || M − || max n (cid:88) i =1 n (cid:88) j =1 (cid:15) (cid:0) max ≤ i,j ≤ n | Ω Mij | ≤ (cid:15) (cid:1) ≤ || M − || max n (cid:15). Because the (cid:15) bound on Ω K applies to − Ω K as well, the upper bound on the second term may be found25nalogously as that on the first: − tr (cid:2) ˜ K − Ω K (cid:3) = tr (cid:2) ˜ K − ( − Ω K ) (cid:3) (Trace linear)= n (cid:88) i =1 n (cid:88) j =1 ˜ K − ij ( − Ω Kji ) (Definition of trace) ≤ || ˜ K − || max n (cid:88) i =1 n (cid:88) j =1 ( − Ω Kji ) (cid:0) || ˜ K − || max = max ≤ i,j ≤ n | ˜ K − ij | (cid:1) ≤ || ˜ K − || max n (cid:88) i =1 n (cid:88) j =1 (cid:15) (cid:0) max ≤ i,j ≤ n | − Ω Kij | = max ≤ i,j ≤ n | Ω Kij | ≤ (cid:15) (cid:1) ≤ || ˜ K − || max n (cid:15). It would be preferable to have the bound only depend on K and the specific approximation ˜ K . Weyl’sinequality states that for a matrix A perturbed by some matrix Ω , the difference between the eigenvalues ofthe original matrix A and its perturbation ˜ A = A + Ω is bounded by the spectral norm of Ω . Specifically, | λ Ai − λ ˜ Ai | ≤ || Ω || ∀ i = 1 , . . . n. (20)Let σ min ( · ) and σ max ( · ) define the minimum and maximum eigenvalues, respectively, of a matrix. Then byequation (20) we can say | σ min ( K ) − σ min ( ˜ K ) | ≤ || K − ˜ K || ≤ n || K − ˜ K || max = n (cid:15). (21)We consider the bound on || ˜ K − || max , namely || ˜ K − || max ≤ √ n || ˜ K − || = (cid:113) nσ max ( ˜ K − ) (Definition of spectral norm)= (cid:113) n/σ min ( ˜ K ) ( σ max ( A − ) = 1 /σ min ( A ))= (cid:113) n/σ min ( ˜ K )= (cid:112) n/ [ σ min ( K ) + δ K ] . (Letting δ K = σ min ( K ) − σ min ( ˜ K ))The smaller σ min ( K ) + δ K is, the looser the bound on || ˜ K − || max becomes. From equation (21) we know that | δ K | ≤ n (cid:15) . In the worst case, δ K will be as negative as possible so that the term in the denominator becomescloser to 0; this “worst” value is − n (cid:15) . This worst case relies on the assumption that n (cid:15) < σ min ( K ) , so that σ min ( ˜ K ) remains nonzero. Any (cid:15) < σ min ( K ) n satisfies this condition. Then the bound on the second term is − tr (cid:2) ˜ K − Ω K (cid:3) ≤ || ˜ K − || max n (cid:15) ≤ n / [ σ min ( K ) − n (cid:15) ] − / (cid:15). (22)Before we move on to the third term, we state two trace inequalities. First, let A, B be two matrices ofcompatible dimensions with A (cid:48) B being a square matrix, then by the Cauchy-Schwarz inequality:tr (cid:2) A (cid:48) B (cid:3) ≤ (cid:113) tr (cid:2) A (cid:48) A (cid:3) tr (cid:2) B (cid:48) B (cid:3) . (23)Next, let A be a positive semi-definite matrix with eigenvalues λ i , then:tr (cid:2) A (cid:3) = (cid:88) i λ i ≤ (cid:0) (cid:88) i λ i (cid:1) = tr (cid:2) A (cid:3) . (24)26ow consider the third and final term. To bound this term, we will rely on two trace inequalities, shown inequations (23) and (24). Treat − ˜ K − Ω K as A (cid:48) and M − Ω M as B in equation (23), then − tr (cid:2) ˜ K − Ω K M − Ω M (cid:3) = tr (cid:2) − ˜ K − Ω K M − Ω M (cid:3) (Trace linear) ≤ (cid:113) tr (cid:2) ( − ˜ K − Ω K )( − ˜ K − Ω K ) (cid:48) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) (t1.) tr (cid:2) ( M − Ω M ) (cid:48) ( M − Ω M ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) (t2.) . (Equation (23))Consider bounding only part (t1.) above:tr (cid:2) ( − ˜ K − Ω K )( − ˜ K − Ω K ) (cid:48) (cid:3) = tr (cid:2) ˜ K − Ω K Ω K ˜ K − (cid:3) ( ˜ K − , Ω K symmetric)= tr (cid:2) ˜ K − ˜ K − Ω K Ω K (cid:3) (Trace circular) ≤ (cid:113) tr (cid:2) ( ˜ K − ) (cid:3) tr (cid:2) (Ω K ) (cid:3) (Equation (23)) ≤ (cid:113) tr (cid:2) ˜ K − (cid:3) tr (cid:2) (Ω K ) (cid:3) (Equation (24) and ˜ K − is PSD) ≤ (cid:113) tr (cid:2) ˜ K − (cid:3) ( n(cid:15) ) (tr (cid:2) (Ω K ) r (cid:3) ≤ ( n(cid:15) ) r , see Section 9.3.3) ≤ tr (cid:2) ˜ K − (cid:3) ( n(cid:15) ) . An analogous sequence of arguments bounds part (t2.):tr (cid:2) ( M − Ω M ) (cid:48) ( M − Ω M ) (cid:3) ≤ tr (cid:2) M − (cid:3) ( n(cid:15) ) . Thus, the bounds on (t1.) and (t2.) can be used to bound the whole expression and we have: − tr (cid:2) ˜ K − Ω K M − Ω M (cid:3) ≤ (cid:113) tr (cid:2) ˜ K − (cid:3) ( n(cid:15) ) tr (cid:2) M − (cid:3) ( n(cid:15) ) = | tr( ˜ K − ) | | tr( M − ) | n (cid:15) . As with the max-norm of ˜ K, we want the term | tr( ˜ K − ) | to only involve K, n, and (cid:15).
Assuming as beforethat (cid:15) < σ min ( K ) n , and using the fact that the trace of a matrix is the sum of its eigenvalues, we have | tr( ˜ K − ) | ≤ n || ˜ K − || = n (cid:113) σ max ( ˜ K − )= n/σ min ( ˜ K ) ≤ n/ ( σ min ( K ) − n (cid:15) ) . Therefore, the bound on all of part (ii.) istr (cid:2) ( ˜ K ˜ M − ) − ( KM − ) (cid:3) − n ≤ || M − || max n (cid:15) + n / (cid:112) σ min ( K ) − n (cid:15) (cid:15) + n | tr( M − ) | σ min ( K ) − n (cid:15) (cid:15) . (25) Part 3.
To bound the part (iii.) term, we will rely on the following norm inequality. Let A be a symmetric n × n matrix and || A || denote the spectral norm, then it follows by Rayleigh’s inequality and the definitionsof the spectral and Frobenius norms that x (cid:48) A x ≤ max ≤ j ≤ n λ j || x || = || A || || x || ≤ || A || F || x || (26)where λ j are eigenvalues of matrix A and x is a length- n vector.27e will also rely on an inequality for the difference of matrix inverses. Let A, B be invertible n × n matricesand let || · || denote some submultiplicative matrix norm. Suppose || A − ( B − A ) || < || A − B || < c forsome c > . Then || A − − B − || ≤ || A − || || A − B || − || A − || || A − B ||≤ c || A − || − c || A − || . (27)See Section 9.3.4 for details of this inequality, which is based on [53]. (cid:2) τ ˜ K ˜ M − y − τ KM − y (cid:3) (cid:48) (cid:0) ˜ K ˜ M − (cid:1) − (cid:2) τ ˜ K ˜ M − y − τ KM − y (cid:3) = τ (cid:2) ( ˜ K ˜ M − − KM − ) y (cid:3) (cid:48) (cid:0) ˜ K ˜ M − (cid:1) − (cid:2) ( ˜ K ˜ M − − KM − ) y (cid:3) = τ y (cid:48) ( ˜ K ˜ M − − KM − ) (cid:0) ˜ K ˜ M − (cid:1) − ( ˜ K ˜ M − − KM − ) y ( ˜ K ˜ M − and KM − symmetric)= τ || ( ˜ K ˜ M − − KM − ) (cid:0) ˜ K ˜ M − (cid:1) − ( ˜ K ˜ M − − KM − ) || || y || (Equation (26) ≤ τ || (cid:0) ˜ K ˜ M − (cid:1) − || || y || (cid:124) (cid:123)(cid:122) (cid:125) (n1.) || ˜ K ˜ M − − KM − || (cid:124) (cid:123)(cid:122) (cid:125) (n2.) (Submultiplicativity of 2-norm)We want to bound the term in (n1.) involving ˜ K and ˜ M by some function of K, M , n , and (cid:15) . We have,assuming as before that (cid:15) < σ min ( K ) n , || (cid:0) ˜ K ˜ M − (cid:1) − || = || ˜ M ˜ K − || ≤ || ˜ M || || ˜ K − || = (cid:113) σ max ( ˜ M ) σ max ( ˜ K − )= (cid:115) σ max ( ˜ M ) σ min ( ˜ K ) ≤ (cid:115) σ max ( M ) + n (cid:15)σ min ( K ) − n (cid:15) . The term (n1.) acts as a constant, and (n2.) will go toward 0 with (cid:15) . Here, consider the square root of (n2.).Assume || M − ( ˜ M − M ) || <
1, a required condition for the inequality shown in equation (27). (Note thatany (cid:15) < n || M − || will satisfy this condition – see Section 9.3.4 for details.)28 | ˜ K ˜ M − − KM − || = || ( K + Ω K ) ˜ M − − KM − || ( ˜ K = K + Ω K by definition)= || K ˜ M − + Ω K ˜ M − − KM − || = || K ( ˜ M − − M − ) + Ω K ˜ M − || ≤ || K ( ˜ M − − M − ) || + || Ω K ˜ M − || (Triangle inequality) ≤ || K || || ˜ M − − M − || + || Ω K || || ˜ M − || (Submultiplicativity of 2 − norm) ≤ || K || || ˜ M − − M − || + n(cid:15) || ˜ M − || ( || Ω K || ≤ (cid:114)(cid:80) i (cid:80) j | max Ω Kij | ) ≤ || K || || M − − ˜ M − || + n(cid:15) || ˜ M − || ( || A − B || = || B − A || ) ≤ || K || || M − || || M − ˜ M || − || M − || || M − ˜ M || + n(cid:15) || ˜ M − || (Equation (27))= || K || || M − || || − Ω M || − || M − || || − Ω M || + n(cid:15) || ˜ M − || ( − Ω M = M − ˜ M )= || K || || M − || || Ω M || − || M − || || Ω M || + n(cid:15) || ˜ M − || ( || A || = || − A || ) ≤ n(cid:15) || K || || M − || − n(cid:15) || M − || + n(cid:15) || ˜ M − || . ( || Ω K || ≤ (cid:114)(cid:80) i (cid:80) j | max Ω Kij | )As with || ˜ K − || , we can bound || ˜ M − || in terms of M , n , and (cid:15) . Assuming that (cid:15) < σ min ( M ) n (which, since M is almost always better conditioned than K , will likely be true already because of the earlier statementthat we chose (cid:15) < σ min ( K ) n ), || ˜ M − || = (cid:113) σ max ( ˜ M − )= 1 / (cid:113) σ min ( ˜ M ) ≤ / (cid:112) σ min ( M ) − n (cid:15). Note that since n, || K || , and || M − || are fixed (although is is possible they are each quite large), we havelim (cid:15) → n(cid:15) || K || || M − || − n(cid:15) || M − || = 0 . Since we chose (cid:15) < σ min ( M ) n , we have (cid:15) < nσ min ( M ) < n (cid:112) σ min ( M ) (assuming σ min ( M ) <
1, if notwe could simply choose (cid:15) satisfying (cid:15) < n (cid:112) σ min ( M )). Then 1 − n(cid:15) || M − || > , so n(cid:15) || K || || M − || − n(cid:15) || M − ||
A, B be invertible n × n matrices and let || · || denote some submultiplicative matrix norm. Define∆ = B − A . Assume || A − ∆ || < B is nonsingular. For the bound on || A − − B − || , we will use the following result from[53]: || B − || ≤ || A − || − || A − ∆ || . (29)This bound comes from the following: || B − || = || A − − A − BB − + A − AB − || (Add 0 matrix and multiply by identity)= || A − − A − ∆ B − || (Factor out terms) ≤ || A − || + || A − ∆ B − || (Triangle inequality) ≤ || A − || + || A − ∆ || || B − || (Submultiplicativity of norm)Dividing both sides by || B − || in the inequality || B − || ≤ || A − || + || A − ∆ || || B − || gives1 ≤ || A − |||| B − || + || A − ∆ || , giving the result in equation (29).We will also use the result: (1 − || AB || ) − ≤ (1 − || A || || B || ) − , (30)derived as follows: || AB || ≤ || A || || B || (Submultiplicativity of norm)= ⇒ − || AB || ≥ −|| A || || B || (Multiply by -1)= ⇒ − || AB || ≥ − || A || || B || (Add 1 to both sides)= ⇒ (1 − || AB || ) − ≥ (1 − || A || || B || ) − . ( a ≥ b = ⇒ a ≤ b )Moving to the overall bound on || A − − B − || , we have || A − − B − || = || A − BB − − A − AB − || = || A − ∆ B − || (Factor out terms) ≤ || A − ∆ || || B − || (Submultiplicativity of norm) ≤ || A − ∆ || || A − || − || A − ∆ || (Equation (29)) ≤ || A − || || ∆ || − || A − ∆ || (Submultiplicativity of norm) ≤ || A − || || ∆ || − || A − || || ∆ || (Equation (30))Then if || A − B || is bounded, i.e. || A − B || ≤ c for some c > , and || A − ∆ || < || A − − B − || can be expressed as || A − − B − || ≤ c || A − || − c || A − || . c , it can be ensured that || A − ∆ || <
1, because || A − ∆ || = || A − ( B − A ) ||≤ || A − || || B − A || (Submultiplicativity of norm) ≤ c || A − || ( || B − A || = || A − B || ≤ c )so if c < || A − || , then || A − ∆ || ≤ c || A − || < . We can find the GP posterior covariance and expectation under heteroskedastic variance. Recall Cov( f | y , X, Ω) = K − K ( K + D − ) − K. By the Sherman-Morrison-Woodbury formula and algebraic manipulation, we have( K + D − ) − = K − − K − ( D + K − ) − K − = ⇒ K ( K + D − ) − K = KK − K − KK − ( D + K − ) − K − K = ⇒ K ( K + D − ) − K = K − ( D + K − ) − = ⇒ K − K ( K + D − ) − K = K − K + ( K + D − ) − = ⇒ K − K ( K + D − ) − K = ( D + K − ) − = ⇒ K − K ( K + D − ) − K = KK − ( D + K − ) − = ⇒ K − K ( K + D − ) − K = K ([ D + K − ] K ) − = ⇒ K − K ( K + D − ) − K = K ( DK + I ) − = ⇒ K − K ( K + D − ) − K = K ( DK + I ) − DD − = ⇒ K − K ( K + D − ) − K = K ( K + D − ) − D − = ⇒ K − K ( K + D − ) − K = D − ( K + D − ) − K. Thus, Cov( f | y , X, Ω) = K − K ( K + D − ) − K = K ( DK + I ) − = D − ( K + D − ) − K. We can also find the GP posterior mean under heteroskedastic variance. Recall E ( f | y , X, Ω) = K ( K + D − ) − y . Then, via simple algebraic manipulation, we have K ( K + D − ) − y = K ( K + D − ) − D − D y = K ( D [ K + D − ]) − D y = K ( DK + I ) − D y = Cov( f | y , X, Ω) D y . The bound on the KL divergence, which can be found in equation (12), then uses M = DK + I for the GPcovariance rather than M = τ K + I. The bound on part (iii.) of the KL divergence, which can be foundin equation (28), then only requires the slight modification of replacing τ || y || with || D y || due to thisadjustment to the GP mean under heteroskedasticity.The sampling steps in Algorithm 1 also need adjusting to account for heteroskedastic noise. The modificationis described below. We avoid working with the non-symmetric matrix DK + I by using the form of theposterior covariance D − ( K + D − ) − K ; while it may look computationally more intensive, it actuallyrequires fewer HODLR operations than the alternative forms and leads to more stable sampling because allHODLR matrices are symmetric. Algorithm 1 (heteroskedastic) : Given noisy observations y of an underlying function f at inputs X, diagonal precision matrix D, GP hyperparameters Θ , and HODLR specifications tolerance (cid:15) and maximumblock size B, approximate a sample from p ( f | y , X, Θ) using a factorization of cost O ( n log n ) and operationsof cost at most O ( n log n ). Sampling proceeds as follows:32 tep
1: Create H matrix decomposition of K and K + D − with tolerance (cid:15) , call these ˜ K and ˜ P , respectively(assembly, cost O ( n log n )). Step
2: Get W, the symmetric factor of ˜ K s.t. ˜ K = W W T (factorization, cost O ( n log n )). Step
3: Sample a ∼ N(0 , D ) and b ∼ N(0 , I ) , both independent n − dimensional random vectors. Step
4: Let Z = ˜ Ka + W b (matrix vector products, cost O ( n log n )). Step
5: Let Z ∗ = ˜ P − Z (solve linear system, cost O ( n log n )). Step
6: Let Z ∗∗ = D − Z ∗ (multiply vector by diagonal matrix, cost O ( n )). Step
7: Let R = ˜ K ˜ P − y (solve linear system then matrix vector product, cost O ( n log n )). Step
8: Finally, let Z ∗∗∗ = R + Z ∗∗ , which is the approximated sample from p ( f | y , X, Θ) (vector addition,cost O ( n )). Lemma 9.1.
From Algorithm 1.h, Z ∗∗∗ ∼ N(˜ µ f , ˜Σ f ) , where ˜Σ f = ˜ K ˜ P − D − and ˜ µ f = ˜ K ˜ P − y are definedto be the approximations of the posterior function variance Σ f and mean µ f , respectively. Proof.
Analogous to the proof of Algorithm 1 for homoskedastic variance. Note that step 5 relies on theproperty ( A − ) T = ( A T ) − . Specifically, Cov( Z ∗ ) = Cov( ˜ P − Z ) = ˜ P − Cov( Z )( ˜ P − ) T , implying Cov( Z ∗ ) =˜ P − ( ˜ KD ˜ K + ˜ K ) ˜ P − = ˜ P − ˜ K ( D ˜ K + I ) ˜ P − = ˜ P − ˜ KD ( ˜ K + D − ) ˜ P − = ˜ P − ˜ KD. The symmetric factorization of HODLR matrices was described thoroughly in Ambikasaran (2014). Ratherthan restating the algorithm in full here, we include an example to provide intuition into the decompositionand symmetric factorization.For the example matrix, n = 200 uniform data points were sampled in the range (0 ,
1) and then sorted, andthe exponential covariance kernel k ( x, x (cid:48) ) = exp( − x − x (cid:48) ) ) was used. Figure 11 shows the matrix and theblock structure of the matrix as a 2-level HODLR matrix.Figure 11: Left: realizations of covariance kernel for d = 1 with sorted inputs. Right: block partition of a2-level HODLR tree.After the matrix has been divided into block diagonal HODLR structure, the off-diagonal blocks are ap-proximated to the desired tolerance (i.e., the desired maximum absolute difference between the true andapproximated matrix entries). In the example, we set the tolerance to 10 − . Figure 12 shows the low-rankdescription of the off-diagonal blocks alongside the elementwise difference between the true and approximatedmatrices when the tolerance is set to 10 − . The rank of the large off-diagonal blocks with this tolerance is7, and that of the smaller off-diagonal blocks is 5. More explicitly, U (1)1 and V (1)1 are n × U (2)1 , V (2)1 , U (2)2 and V (2)2 are n × A as A = W W T , where W = A (cid:96) A (cid:96) − . . . A A is the product of a block-diagonally densematrix and (cid:96) matrices that are low-rank updates to the identity (where (cid:96) is the level of the HODLR matrix).To quickly summarize, the first step of the algorithm is to symmetrically factor each of the diagonal blocks ofthe matrix. Then, the inverse of the relevant block diagonal matrix/matrices is applied to each U ji , V ji , leadingto the factorization A = A (cid:96) ˜ A (cid:96) A T(cid:96) where A (cid:96) is a block diagonal matrix with entries being the symmetricfactorization of the diagonal blocks of A and diagonal blocks of ˜ A (cid:96) are low-rank updates to the identitymatrix. These diagonal blocks can be quickly factorized using the Sherman-Morrison-Woodbury formulaand Theorem 3.1 in Ambikasaran (2014). The algorithm proceeds iteratively until reaching the 0th level.Full details are provided in Ambikasaran (2014). 34igure 12: Left: HODLR matrix with low-rank description overlaid on off-diagonal blocks. Right: Differencebetween the true and approximated matrices when the tolerance for each off-diagonal block is set to 10 − . Figure 13 shows realizations of covariance kernels for randomly generated x having d = 1 and d = 2 withthe large off-diagonal block emphasized. These n × n off-diagonal blocks can be approximated by rank r matrices. If one requires the largest absolute difference to be less than 10 − , the d = 1 example shownrequires r = 8 while the d = 2 example requires r = 56 (here we use the SVD because this is a small n = 200example). The higher the dimension of the input space, the more complex the structure in the covariancematrix even after sorting and the more costly storage and computation become. In the example, the storagecost of the dense 100 ×
100 off-diagonal block is 80,000 bytes and that of the SVD approximation with r = 8is 12,800 bytes, with an equal number of flops required for matrix-vector multiplication. The real savingscome as n increases. With the same hyperparameters and range of data, setting n = 2000 leads to a storagecost for the dense off-diagonal block matrix of 8,000,000 bytes but only requires an r = 9 matrix for thesame fidelity of approximation (i.e., 144,000 bytes of storage).Figure 13: Left: realizations of covariance kernel for d = 1 with sorted inputs. Right: realizations ofcovariance kernel for d = 2 with inputs sorted via a kd -tree.35 As an example of how the HODLR factorization enables fast algebraic operations, consider the calculationof det( A ), the determinant of A . We will rely on Sylvester’s Determinant Theorem and two facts regardingthe determinant: first, for square matrices the determinant of the product of the matrices is equal to theproduct of the determinants, and second, the determinant of a block diagonal matrix is the product of thedeterminants of the individual blocks. Suppose A is factorized as in Figure 2 into A A A . The only full-rankmatrices are those on the diagonal of A , and the number of levels in the factorization can be chosen suchthat these are of sufficiently low dimension that computing the determinant of any individual block is oflow computational expense. The blocks of A and A are low rank updates to the identity and thus canbe computed using Sylvester’s Determinant Theorem, which says that det( I m + T S ) = det( I n + ST ) for S ∈ R n × p and T ∈ R p × n . l l l lll l ll llllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll −101 −2 −1 0 1 x y Setting s f = r = t = l ll ll llllllllllllll lllllllllllllll llllllllllllllllllllllllllllllllllllllllllllll lllllllll llll l ll llll −2−101 −2 −1 0 1 2 x y Setting s f = r = t = l ll lllllll llllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllll llll llll l ll −1.0−0.50.00.51.0 −2 −1 0 1 2 x y Setting s f = r = t = lll lllll lll l lll ll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll −2−10123 −2 −1 0 1 x y Setting s f = r = t = Figure 14: Data and true parameter values from each simulation setting.36 ower t Mean t Upper t Lower s f Mean s f Upper s f Lower r Mean r Upper r E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . Model E s t i m a t e Figure 15: Parameter estimates from each sampler run 100 times under Setting 1.37 ower t Mean t Upper t Lower s f Mean s f Upper s f Lower r Mean r Upper r E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . Model E s t i m a t e Figure 16: Parameter estimates from each sampler run 100 times under Setting 2.38 ower t Mean t Upper t Lower s f Mean s f Upper s f Lower r Mean r Upper r E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . Model E s t i m a t e Figure 17: Parameter estimates from each sampler run 100 times under Setting 3.39 ower t Mean t Upper t Lower s f Mean s f Upper s f Lower r Mean r Upper r E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . Model E s t i m a t e Figure 18: Parameter estimates from each sampler run 100 times under Setting 4.40
SPE f* Area f* MSE f Area f E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . Model E s t i m a t e Figure 19: Performance results from each sampler run 100 times under Setting 1.41
SPE f* Area f* MSE f Area f E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . Model E s t i m a t e Figure 20: Performance results from each sampler run 100 times under Setting 2.42
SPE f* Area f* MSE f Area f E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . Model E s t i m a t e Figure 21: Performance results from each sampler run 100 times under Setting 3.43
SPE f* Area f* MSE f Area f E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . E x a c t F I F A − G P − F I F A − G P − C o m p r e ss ed . C o m p r e ss ed . Model E s t i m a t e Figure 22: Performance results from each sampler run 100 times under Setting 4.44 n = 100 and n = 500Table 2: Performance results, parameter estimates, and computing time for Bayesian GP regression with n = 500 data points simulated using asquared exponential covariance kernel. Time shown is total time for setup and 27,000 iterations through Gibbs sampler, with 2,500 samples retained.Truth Exact GP FIFA-GP Compressed GP (cid:15) max = 10 − (cid:15) max = 10 − (cid:15) fro = 0 . (cid:15) fro = 0 . f ∗ - 0.001 0.001 0.001 0.001 0.001low noise 95% CI ( f ∗ ) Area - 0.35 0.35 0.35 0.38 0.39ˆ τ [ τ ll , τ ul ] τ = 30 30.4 [26.8, 34.4] 30.4 [26.6, 34.3] 30.4 [26.7, 34.3] 30.4 [26.7, 34.2] 30.4 [26.7, 34.4]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 1.11 [0.56, 2.39] 1.25 [0.64, 2.52] 1.26 [0.61, 2.69] 0.85 [0.52, 1.49] 0.79 [0.48, 1.29]ˆ ρ [ ρ ll , ρ ul ] ρ = 0 .
25 0.36 [0.17, 0.85] 0.31 [0.16, 0.58] 0.32 [0.16, 0.61] 0.55 [0.51, 0.69] 0.70 [0.59, 0.81]Time (min) - 15.3 4.2 3.9 1.2 1.2Smooth and MSPE f ∗ - 0.003 0.004 0.003 0.002 0.003high noise 95% CI ( f ∗ ) Area - 1.15 1.12 1.16 1.25 1.19ˆ τ [ τ ll , τ ul ] τ = 2 1.82 [1.60, 2.06] 1.83 [1.60, 2.06] 1.82 [1.60, 2.05] 1.82 [1.60, 2.05] 1.82 [1.60, 2.04]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 0.57 [0.31, 1.12] 0.54 [0.31, 1.01] 0.56 [0.31, 1.11] 0.48 [0.29, 0.84] 0.52 [0.30, 0.94]ˆ ρ [ ρ ll , ρ ul ] ρ = 0 .
25 0.39 [0.13, 1.10] 0.33 [0.13, 0.89] 0.41 [0.14, 1.10] 0.68 [0.58, 0.86] 0.52 [0.43, 0.65]Time (min) - 20.0 4.2 3.9 1.2 1.1Wiggly and MSPE f ∗ - 0.001 0.001 0.001 0.001 0.001low noise 95% CI ( f ∗ ) Area - 0.53 0.53 0.53 0.53 0.54ˆ τ [ τ ll , τ ul ] τ = 30 28.7 [25.2, 32.3] 28.7 [25.2, 32.5] 28.7 [25.2, 32.2] 28.6 [25.2, 32.3] 28.7 [25.3, 32.3]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 1.09 [0.68, 1.82] 1.07 [0.65, 1.82] 1.03 [0.65, 1.66] 0.96 [0.63, 1.48] 0.94 [0.64, 1.43]ˆ ρ [ ρ ll , ρ ul ] ρ = 2 2.05 [1.41, 3.13] 2.10 [1.43, 3.06] 2.14 [1.50, 3.27] 2.34 [2.09, 2.65] 2.39 [2.17, 2.51]Time (min) - 20.1 4.7 4.3 1.4 1.3Wiggly and MSPE f ∗ - 0.015 0.015 0.015 0.015 0.015high noise 95% CI ( f ∗ ) Area - 1.83 1.81 1.82 1.76 1.75ˆ τ [ τ ll , τ ul ] τ = 2 2.05 [1.80, 2.31] 2.05 [1.80, 2.32] 2.05 [1.80, 2.32] 2.05 [1.81, 2.32] 2.05 [1.80, 2.31]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 1.03 [0.64, 1.68] 1.07 [0.66, 1.85] 1.07 [0.63, 1.90] 1.10 [0.69, 1.81] 1.13 [0.70, 1.85]ˆ ρ [ ρ ll , ρ ul ] ρ = 2 2.93 [1.53, 3.92] 2.69 [1.44, 3.89] 2.80 [1.47, 3.93] 2.51 [2.23, 2.74] 2.31 [2.14, 2.56]Time (min) - 20.1 5.0 4.4 1.4 1.4 able 3: Performance results, parameter estimates, and computing time for Bayesian GP regression with n = 100 data points simulated using asquared exponential covariance kernel. Time shown is total time for setup and 27,000 iterations through Gibbs sampler, with 2,500 samples retained.Truth Exact GP FIFA-GP Compressed GP (cid:15) max = 10 − (cid:15) max = 10 − (cid:15) fro = 0 . (cid:15) fro = 0 . f ∗ - 0.005 0.005 0.005 0.006 0.006low noise 95% CI ( f ∗ ) Area - 0.94 0.95 0.94 1.01 1.01ˆ τ [ τ ll , τ ul ] τ = 30 22.5 [16.5, 29.6] 22.6 [16.7, 29.5] 22.5 [16.7, 29.3] 22.4 [16.6, 29.2] 22.4 [16.4, 29.3]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 1.38 [0.71, 2.88] 1.38 [0.73, 2.85] 1.42 [0.73, 2.95] 1.11 [0.66, 1.91] 1.11 [0.66, 1.92]ˆ ρ [ ρ ll , ρ ul ] ρ = 0 .
25 0.37 [0.17, 0.74] 0.37 [0.17, 0.79] 0.35 [0.17, 0.76] 0.60 [0.41, 0.81] 0.59 [0.41, 0.91]Time (min) - 0.5 0.9 0.8 0.2 0.2Smooth and MSPE f ∗ - 0.035 0.035 0.034 0.037 0.037high noise 95% CI ( f ∗ ) Area - 2.19 2.23 2.20 2.36 2.28ˆ τ [ τ ll , τ ul ] τ = 2 2.07 [1.54, 2.72] 2.08 [1.56, 2.71] 2.08 [1.54, 2.69] 2.06 [1.51, 2.70] 2.07 [1.53, 2.69]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 0.64 [0.32, 1.33] 0.63 [0.32, 1.28] 0.65 [0.33, 1.43] 0.57 [0.31, 1.14] 0.60 [0.31, 1.20]ˆ ρ [ ρ ll , ρ ul ] ρ = 0 .
25 0.49 [0.15, 1.07] 0.52 [0.15, 1.13] 0.47 [0.15, 1.14] 0.81 [0.59, 1.13] 0.62 [0.43, 0.92]Time (min) - 0.6 0.9 0.8 0.2 0.3Wiggly and MSPE f ∗ - 0.005 0.005 0.005 0.004 0.004low noise 95% CI ( f ∗ ) Area - 1.14 1.12 1.13 1.17 1.15ˆ τ [ τ ll , τ ul ] τ = 30 23.0 [16.8, 30.4] 22.9 [16.9, 29.8] 23.1 [16.8, 30.2] 23.2 [16.9, 30.4] 23.1 [16.8, 30.4]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 0.84 [0.51, 1.45] 0.87 [0.52, 1.52] 0.85 [0.52, 1.44] 0.78 [0.49, 1.29] 0.79 [0.51, 1.26]ˆ ρ [ ρ ll , ρ ul ] ρ = 2 1.89 [1.02, 3.74] 1.73 [1.02, 3.47] 1.83 [1.03, 3.39] 2.30 [1.95, 2.68] 2.20 [1.53, 2.71]Time (min) - 0.6 1.0 0.9 0.3 0.3Wiggly and MSPE f ∗ - 0.054 0.052 0.053 0.054 0.053high noise 95% CI ( f ∗ ) Area - 3.10 3.12 3.12 3.09 3.05ˆ τ [ τ ll , τ ul ] τ = 2 2.32 [1.68, 3.04] 2.31 [1.68, 3.02] 2.31 [1.69, 3.05] 2.32 [1.70, 3.05] 2.31 [1.67, 3.04]ˆ σ f [ σ f, ll , σ f, ul ] σ f = 1 0.94 [0.55, 1.69] 0.95 [0.55, 1.71] 0.94 [0.54, 1.7] 0.94 [0.54, 1.65] 0.97 [0.56, 1.73]ˆ ρ [ ρ ll , ρ ul ] ρ = 2 2.96 [1.45, 3.96] 2.81 [1.29, 3.92] 2.86 [1.38, 3.93] 2.81 [2.35, 3.33] 2.64 [2.08, 3.16]Time (min) - 0.6 1.0 1.0 0.3 0.3 Table 4: Performance results, parameter estimates, and computing time for Bayesian GP regression with n ∈ { , , , } data points withmean f ( x ) = sin 2 x + e x and precision τ = 2. Time shown is total time for setup and 7,000 iterations through Gibbs sampler, with the first 2,000samples discarded and every tenth after retained.Exact GP FIFA-GP Compressed GP (cid:15) max = 10 − (cid:15) max = 10 − (cid:15) fro = 0 . (cid:15) fro = 0 . n = 100 MSPE f ∗ f ∗ ) Area 2.75 2.72 2.69 2.67 2.68ˆ τ [ τ ll , τ ul ] 1.85 [1.32, 2.38] 1.87 [1.37, 2.37] 1.86 [1.38, 2.40] 1.87 [1.31, 2.49] 1.85 [1.37, 2.47]ˆ σ f [ σ f, ll , σ f, ul ] 0.93 [0.47, 2.16] 1.10 [0.49, 3.57] 0.82 [0.47, 1.41] 0.85 [0.48, 1.53] 0.82 [0.48, 1.37]ˆ ρ [ ρ ll , ρ ul ] 1.11 [0.28, 1.98] 1.04 [0.21, 1.97] 1.13 [0.47, 1.94] 1.21 [1.06, 1.36] 1.18 [0.89, 1.29]Time (min) 0.1 0.1 0.1 0.0 0.0 n = 500 MSPE f ∗ f ∗ ) Area 1.49 1.56 1.54 1.48 1.57ˆ τ [ τ ll , τ ul ] 1.94 [1.73, 2.18] 1.94 [1.70, 2.20] 1.93 [1.71, 2.18] 1.93 [1.70, 2.16] 1.93 [1.70, 2.19]ˆ σ f [ σ f, ll , σ f, ul ] 0.97 [0.53, 2.02] 1.02 [0.51, 2.48] 0.93 [0.52, 1.88] 0.82 [0.51, 1.32] 0.86 [0.48, 1.85]ˆ ρ [ ρ ll , ρ ul ] 1.02 [0.43, 1.81] 1.05 [0.39, 1.78] 1.01 [0.39, 1.92] 1.05 [1.05, 1.05] 1.28 [1.28, 1.28]Time (min) 2.0 0.6 0.5 0.2 0.2 n = 1000 MSPE f ∗ f ∗ ) Area 1.12 1.11 1.11 1.08 1.06ˆ τ [ τ ll , τ ul ] 2.02 [1.84, 2.21] 2.02 [1.86, 2.20] 2.02 [1.84, 2.20] 2.01 [1.84, 2.19] 2.02 [1.86, 2.21]ˆ σ f [ σ f, ll , σ f, ul ] 0.99 [0.48, 2.16] 0.92 [0.54, 1.68] 0.80 [0.47, 1.59] 0.82 [0.57, 1.16] 0.87 [0.56, 1.29]ˆ ρ [ ρ ll , ρ ul ] 1.02 [0.35, 1.86] 0.98 [0.41, 1.86] 1.19 [0.61, 1.91] 1.05 [1.05, 1.05] 1.05 [1.05, 1.05]Time (min) 12.7 1.2 1.1 0.8 0.8 n = 2000 MSPE f ∗ f ∗ ) Area 0.74 0.77 0.74 0.79 0.80ˆ τ [ τ ll , τ ul ] 2.00 [1.87, 2.13] 2.00 [1.88, 2.13] 2.00 [1.88, 2.11] 2.00 [1.88, 2.14] 2.00 [1.87, 2.11]ˆ σ f [ σ f, ll , σ f, ul ] 1.37 [0.65, 2.19] 0.92 [0.58, 1.66] 1.27 [0.61, 2.28] 0.81 [0.51, 1.40] 0.80 [0.50, 1.52]ˆ ρ [ ρ ll , ρ ul ] 0.50 [0.24, 0.95] 0.75 [0.33, 1.56] 0.54 [0.26, 1.16] 1.09 [1.09, 1.09] 1.05 [1.05, 1.05]Time (min) 62.9 2.6 2.2 4.1 4.1 n simulation results Figures 23 through 26 show performance results, τ estimate and 95% CI limits, and computing time forFIFA-GP regression with n up to 100,000. Observations have mean f ( x ) = sin 2 x + e x and precision τ = 2. Time shown includes setup and 7,000 iterations through Gibbs sampler, with the first 2,000 samplesdiscarded and every tenth after retained. For all n, the tolerance is set to (cid:15) max = 10 − . We see that as n increases, both the accuracy and the precision of the function estimates increase (i.e., theMSPE and the area of the function 95% credible interval approach 0). The precision estimate ˆ τ approachesthe truth, and the 95% credible interval width narrows. Finally, even with n = 100,000 the sampler onlytakes 200 minutes to run (on a several-years-old MacBook laptop). l l l l l l l l l l
100 300 500 1 ,
000 2 ,
000 5 ,
000 10 ,
000 25 ,
000 50 ,
000 100 , n M SPE
Figure 23: MSPE of hold-out data for large n simulation. l l l l l l l l l l
100 300 500 1 ,
000 2 ,
000 5 ,
000 10 ,
000 25 ,
000 50 ,
000 100 , n f * % c r ed i b l e i n t e r v a l a r ea Figure 24: Area of 95% credible interval around f ∗ for large n simulation.48 l l l l l l l l ll l l l l l l l l ll l l l l l l l l l
100 300 500 1 ,
000 2 ,
000 5 ,
000 10 ,
000 25 ,
000 50 ,
000 100 , n t e s t i m a t e and C I Figure 25: Estimate and credible interval of τ for large n simulation. l l l l l l l l l l
100 300 500 1 ,
000 2 ,
000 5 ,
000 10 ,
000 25 ,
000 50 ,
000 100 , n T i m e ( m i n ) Figure 26: Timing for large nn