[PDF] Slice sampling covariance hyperparameters of latent Gaussian models

Abstract

The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes.

Full PDF

SSlice sampling covariance hyperparametersof latent Gaussian models

Iain Murray

School of InformaticsUniversity of Edinburgh

Ryan Prescott Adams

Dept. Computer ScienceUniversity of Toronto

Abstract

The Gaussian process (GP) is a popular way to specify dependencies be-tween random variables in a probabilistic model. In the Bayesian frameworkthe covariance structure can be speciﬁed using unknown hyperparameters.Integrating over these hyperparameters considers diﬀerent possible expla-nations for the data when making predictions. This integration is often per-formed using Markov chain Monte Carlo (MCMC) sampling. However, withnon-Gaussian observations standard hyperparameter sampling approachesrequire careful tuning and may converge slowly. In this paper we presenta slice sampling approach that requires little tuning while mixing well inboth strong- and weak-data regimes.

Many probabilistic models incorporate multivariate Gaussian distributions to explain de-pendencies between variables. Gaussian process (GP) models and generalized linear mixedmodels are common examples. For non-Gaussian observation models, inferring the parame-ters that specify the covariance structure can be diﬃcult. Existing computational methodscan be split into two complementary classes: deterministic approximations and Monte Carlosimulation. This work presents a method to make the sampling approach easier to apply.In recent work Murray et al. [1] developed a slice sampling [2] variant, elliptical slice sam-pling , for updating strongly coupled a-priori Gaussian variates given non-Gaussian obser-vations. Previously, Agarwal and Gelfand [3] demonstrated the utility of slice sampling forupdating covariance parameters, conventionally called hyperparameters , with a Gaussianobservation model, and questioned the possibility of slice sampling in more general settings.In this work we develop a new slice sampler for updating covariance hyperparameters. Ourmethod uses a robust representation that should work well on a wide variety of problems,has very few technical requirements, little need for tuning and so should be easy to apply.

We consider generative models of data that depend on a vector of latent variables f that areGaussian distributed with covariance Σ θ set by unknown hyperparameters θ . These modelsare common in the machine learning Gaussian process literature [e.g. 4] and throughout thestatistical sciences. We use standard notation for a Gaussian distribution with mean m andcovariance Σ, N ( f ; m , Σ) ≡ | π Σ | − / exp (cid:0) − ( f − m ) (cid:62) Σ − ( f − m ) (cid:1) , (1)and use f ∼ N ( m , Σ) to indicate that f is drawn from a distribution with the density in (1).1 a r X i v : . [ s t a t . C O ] O c t La t en t v a l ue s , f l = 0.1l = 0.5l = 2 (a) Prior draws −2 −1 p ( l og l | f ) l = 0.1l = 0.5l = 2 (b) Lengthscale given f Figure 1: (a)

Shows draws from the prior over f using three diﬀerent lengthscales in the squaredexponential covariance (2). (b) Shows the posteriors over log-lengthscale for these three draws.

The generic form of the generative models we consider is summarized bycovariance hyperparameters θ ∼ p h , latent variables f ∼ N (0 , Σ θ ) , and a conditional likelihood P (data | f ) = L ( f ) . The methods discussed in this paper apply to covariances Σ θ that are arbitrary positivedeﬁnite functions parameterized by θ . However, our experiments focus on the popular casewhere the covariance is associated with N input vectors { x n } Nn =1 through the squared-exponential kernel, (Σ θ ) ij = k ( x i , x j ) = σ f exp (cid:16) − (cid:80) Dd =1( x d,i − x d,j ) (cid:96) d (cid:17) , (2)with hyperparameters θ = { σ f , { (cid:96) d }} . Here σ f is the ‘signal variance’ controlling the overallscale of the latent variables f . The (cid:96) d give characteristic lengthscales for converting thedistances between inputs into covariances between the corresponding latent values f .For non-Gaussian likelihoods we wish to sample from the joint posterior over unknowns, P ( f , θ | data) = Z L ( f ) N ( f ; 0 , Σ θ ) p h ( θ ) . (3)We would like to avoid implementing new code or tuning algorithms for diﬀerent covariancesΣ θ and conditional likelihood functions L ( f ). A Markov chain transition operator T ( z (cid:48) ← z ) deﬁnes a conditional distribution on a newposition z (cid:48) given an initial position z . The operator is said to leave a target distribution π invariant if π ( z (cid:48) ) = (cid:82) T ( z (cid:48) ← z ) π ( z ) d z . A standard way to sample from the joint poste-rior (3) is to alternately simulate transition operators that leave its conditionals, P ( f | data , θ )and P ( θ | f ), invariant. Under fairly mild conditions the Markov chain will equilibrate to-wards the target distribution [e.g. 5].Recent work has focused on transition operators for updating the latent variables f givendata and a ﬁxed covariance Σ θ [6, 1]. Updates to the hyperparameters for ﬁxed latentvariables f need to leave the conditional posterior, P ( θ | f ) ∝ N ( f ; 0 , Σ θ ) p h ( θ ) , (4)invariant. The simplest algorithm for this is the Metropolis–Hastings operator, see Algo-rithm 1. Other possibilities include slice sampling [2] and Hamiltonian Monte Carlo [7, 8].Alternately ﬁxing the unknowns f and θ is appealing from an implementation standpoint.However, the resulting Markov chain can be very slow in exploring the joint posterior distri-bution. Figure 1a shows latent vector samples using squared-exponential covariances withdiﬀerent lengthscales. These samples are highly informative about the lengthscale hyperpa-rameter that was used, especially for short lengthscales. The sharpness of P ( θ | f ), Figure 1b,dramatically limits the amount that any Markov chain can update the hyperparameters θ for ﬁxed latent values f . 2 lgorithm 1 M–H transition for ﬁxed fInput:

Current f and hyperparameters θ ;proposal dist. q ; covariance function Σ () . Output:

Next hyperparameters Propose: θ (cid:48) ∼ q ( θ (cid:48) ; θ ) Draw u ∼ Uniform(0 , if u < N ( f ;0 , Σ θ (cid:48) ) p h ( θ (cid:48) ) q ( θ ; θ (cid:48) ) N ( f ;0 , Σ θ ) p h ( θ ) q ( θ (cid:48) ; θ ) return θ (cid:48) (cid:46) Accept new state else return θ (cid:46) Keep current state

Algorithm 2

M–H transition for ﬁxed ν Input:

Current state θ , f ; proposal dist. q ;covariance function Σ () ; likelihood L (). Output:

Next θ , f Solve for N (0 , I ) variate: ν = L − θ f Propose θ (cid:48) ∼ q ( θ (cid:48) ; θ ) Compute implied values: f (cid:48) = L Σ θ (cid:48) ν Draw u ∼ Uniform(0 , if u < L ( f (cid:48) ) p h ( θ (cid:48) ) q ( θ ; θ (cid:48) ) L ( f ) p h ( θ ) q ( θ (cid:48) ; θ ) return θ (cid:48) , f (cid:48) (cid:46) Accept new state else return θ , f (cid:46) Keep current state

Often the conditional likelihood is quite weak; this is why strong prior smoothing as-sumptions are often introduced in latent Gaussian models. In the extreme limit inwhich there is no data, i.e. L is constant, the target distribution is the prior model, P ( f , θ ) = N ( f ; 0 , Σ θ ) p h ( θ ). Sampling from the prior should be easy, but alternately ﬁx-ing f and θ does not work well because they are strongly coupled. One strategy is toreparameterize the model so that the unknown variables are independent under the prior.Independent random variables can be identiﬁed from a commonly-used generative procedurefor the multivariate Gaussian distribution. A vector of independent normals, ν , is drawnindependently of the hyperparameters and then deterministically transformed: ν ∼ N (0 , I ) , f = L Σ θ ν , where L Σ θ L (cid:62) Σ θ = Σ θ . (5) Notation:

Throughout this paper L C will be any user-chosen square root of covariancematrix C . While any matrix square root can be used, the lower-diagonal Cholesky decom-position is often the most convenient. We would reserve C / for the principal square root,because other square roots do not behave like powers: for example, chol( C ) − (cid:54) = chol( C − ).We can choose to update the hyperparameters θ for ﬁxed ν instead of ﬁxed f . As theoriginal latent variables f are deterministically linked to the hyperparameters θ in (5), theseupdates will actually change both θ and f . The samples in Figure 1a resulted from usingthe same whitened variable ν with diﬀerent hyperparameters. They follow the same generaltrend, but vary over the lengthscales used to construct them.The posterior over hyperparameters for ﬁxed ν is apparent by applying Bayes rule to thegenerative procedure in (5), or one can laboriously obtain it by changing variables in (3): P ( θ | ν , data) ∝ P ( θ, ν , data) = P ( θ, f = L Σ θ ν , data) | L Σ θ | ∝ · · · ∝ L ( f ( θ, ν )) p h ( θ ) . (6)Algorithm 2 is the Metropolis–Hastings operator for this distribution. The acceptance rulenow depends on the latent variables through the conditional likelihood L ( f ) instead of theprior N ( f ; 0 , Σ θ ) and these variables are automatically updated to respect the prior. In theno-data limit, new hyperparameters proposed from the prior are always accepted. Neither of the previous two algorithms are ideal for statistical applications, which is illus-trated in Figure 2. Algorithm 2 is ideal in the “weak data” limit where the latent variables f are distributed according to the prior. In the example, the likelihoods are too restrictive forAlgorithm 2’s proposal to be acceptable. In the “strong data” limit, where the latent vari-ables f are ﬁxed by the likelihood L , Algorithm 1 would be ideal. However, the likelihoodterms in the example are not so strong that the prior can be ignored.For regression problems with Gaussian noise the latent variables can be marginalised out an-alytically, allowing hyperparameters to be accepted or rejected according to their marginalposterior P ( θ | data). If latent variables are required they can be sampled directly fromthe conditional posterior P ( f | θ, data). To build a method that applies to non-Gaussianlikelihoods, we create an auxiliary variable model that introduces surrogate Gaussian ob-servations that will guide joint proposals of the hyperparameters and latent variables.3 O b s e r v a t i on s , y current state fwhitened prior proposalsurrogate data proposal Figure 2:

A regression problem with Gaussian observations illustrated by 2 σ gray bars. Thecurrent state of the sampler has a short lengthscale hyperparameter ( (cid:96) = 0 . (cid:96) = 1 .

5) is being proposed. The current latent variables do not lie on a straight enough line for thelong lengthscale to be plausible. Whitening the prior (Section 2.1) updates the latent variables toa straighter line, but ignores the observations. A proposal using surrogate data (Section 3, with S θ set to the observation noise) sets the latent variables to a draw that is plausible for the proposedlengthscale while being close to the current state. We augment the latent Gaussian model with auxiliary variables, g , a noisy version of thetrue latent variables: P ( g | f , θ ) = N ( g ; f , S θ ) . (7)For now S θ is an arbitrary free parameter that could be set by hand to either a ﬁxedvalue or a value that depends on the current hyperparameters θ . We will discuss how toautomatically set the auxiliary noise covariance S θ in Section 3.2.The original model, f ∼ N (0 , Σ θ ) and (7) deﬁne a joint auxiliary distribution P ( f , g | θ )given the hyperparameters. It is possible to sample from this distribution in the oppositeorder, by ﬁrst drawing the auxiliary values from their marginal distribution P ( g | θ ) = N ( g ; 0 , Σ θ + S θ ) , (8)and then sampling the model’s latent values conditioned on the auxiliary values from P ( f | g , θ ) = N ( f ; m θ, g , R θ ) , where some standard manipulations give: R θ = (Σ − θ + S − θ ) − = Σ θ − Σ θ (Σ θ + S θ ) − Σ θ = S θ − S θ ( S θ +Σ θ ) − S θ , m θ, g = Σ θ (Σ θ + S θ ) − g = R θ S − θ g . (9)That is, under the auxiliary model the latent variables of interest are drawn from theirposterior given the surrogate data g . Again we can describe the sampling process via adraw from a spherical Gaussian: η ∼ N (0 , I ) , f = L R θ η + m θ, g , where L R θ L (cid:62) R θ = R θ . (10)We then condition on the “whitened” variables η and the surrogate data g while updatingthe hyperparameters θ . The implied latent variables f ( θ, η , g ) will remain a plausible drawfrom the surrogate posterior for the current hyperparameters. This is illustrated in Figure 2.We can leave the joint distribution (3) invariant by updating the following conditionaldistribution derived from the above generative model: P ( θ | η , g , data) ∝ P ( θ, η , g , data) ∝ L (cid:0) f ( θ, η , g ) (cid:1) N ( g ; 0 , Σ θ + S θ ) p h ( θ ) . (11)The Metropolis–Hastings Algorithm 3 contains a ratio of these terms in the acceptance rule. The Metropolis–Hastings algorithms discussed so far have a proposal distribution q ( θ (cid:48) ; θ )that must be set and tuned. The eﬃciency of the algorithms depend crucially on carefulchoice of the scale σ of the proposal distribution. Slice sampling [2] is a family of adaptivesearch procedures that are much more robust to the choice of scale parameter.4 lgorithm 3 Surrogate data M–H

Input: θ , f ; prop. dist. q ; model of Sec. 3. Output:

Next θ , f Draw surrogate data: g ∼ N ( f , S θ ) Compute implied latent variates: η = L − R θ ( f − m θ, g ) Propose θ (cid:48) ∼ q ( θ (cid:48) ; θ ) Compute function f (cid:48) = L R θ (cid:48) η + m θ (cid:48) , g Draw u ∼ Uniform(0 , if u < L ( f (cid:48) ) N ( g ;0 , Σ θ (cid:48) + S θ (cid:48) ) p h ( θ (cid:48) ) q ( θ ; θ (cid:48) ) L ( f ) N ( g ;0 , Σ θ + S θ ) p h ( θ ) q ( θ (cid:48) ; θ ) return θ (cid:48) , f (cid:48) (cid:46) Accept new state else return θ , f (cid:46) Keep current state

Algorithm 4

Surrogate data slice sampling

Input: θ , f ; scale σ ; model of Sec. 3. Output:

Next f , θ Draw surrogate data: g ∼ N ( f , S θ ) Compute implied latent variates: η = L − R θ ( f − m θ, g ) Randomly center a bracket: v ∼ Uniform(0 , σ ) , θ min = θ − v, θ max = θ min + σ Draw u ∼ Uniform(0 , Determine threshold: y = u L ( f ) N ( g ; 0 , Σ θ + S θ ) p h ( θ ) Draw proposal: θ (cid:48) ∼ Uniform( θ min , θ max ) Compute function f (cid:48) = L R θ (cid:48) η + m θ (cid:48) , g if L ( f (cid:48) ) N ( g ; 0 , Σ θ (cid:48) + S θ (cid:48) ) p h ( θ (cid:48) ) > y return f (cid:48) , θ (cid:48) else if θ (cid:48) < θ Shrink bracket minimum: θ min = θ (cid:48) else Shrink bracket maximum: θ max = θ (cid:48) goto Algorithm 4 applies one possible slice sampling algorithm to a scalar hyperparameter θ inthe surrogate data model of this section. It has a free parameter σ , the scale of the initialproposal distribution. However, careful tuning of this parameter is not required. If the initialscale is set to a large value, such as the width of the prior, then the width of the proposals willshrink to an acceptable range exponentially quickly. Stepping-out procedures [2] could beused to adapt initial scales that are too small. We assume that axis-aligned hyperparametermoves will be eﬀective, although reparameterizations could improve performance [e.g. 9]. S θ The surrogate data g and noise covariance S θ deﬁne a pseudo-posterior distribution thatsoftly speciﬁes a plausible region within which the latent variables f are updated. The noisecovariance determines the size of this region. The ﬁrst two baseline algorithms of Section 2result from limiting cases of S θ = α I : 1) if α = 0 the surrogate data and the current latentvariables are equal and the acceptance ratio reduces to that of Algorithm 1. 2) as α → ∞ the observations are uninformative about the current state and the pseudo-posterior tendsto the prior. In the limit, the acceptance ratio reduces to that of Algorithm 2. One couldchoose α based on preliminary runs, but such tuning would be burdensome.For likelihood terms that factorize, L ( f ) = (cid:81) i L i ( f i ), we can measure how much the likeli-hood restricts each variable individually: P ( f i |L i , θ ) ∝ L i ( f i ) N ( f i ; 0 , (Σ θ ) ii ) . (12)A Gaussian can be ﬁtted by moment matching or a Laplace approximation (matching sec-ond derivatives at the mode). Such ﬁts, or close approximations, are often possible analyti-cally and can always be performed numerically as the distribution is only one-dimensional.Given a Gaussian ﬁt to the site-posterior (12) with variance v i , we can set the auxil-iary noise to a level that would result in the same posterior variance at that site alone:( S θ ) ii = ( v − i − (Σ θ ) ii − ) − . (Any negative ( S θ ) ii must be thresholded.) The moment match-ing procedure is a grossly simpliﬁed ﬁrst step of “assumed density ﬁltering” or “expectationpropagation” [10], which are too expensive for our use in the inner-loop of a Markov chain. We have discussed samplers that jointly update strongly-coupled latent variables and hy-perparameters. The hyperparameters can move further in joint moves than their narrowconditional posteriors (e.g., Figure 1b) would allow. A generic way of jointly sampling real-valued variables is Hamiltonian/Hybrid Monte Carlo (HMC) [7, 8]. However, this methodis cumbersome to implement and tune, and using HMC to jointly update latent variablesand hyperparameters in hierarchical models does not itself seem to improve sampling [11].Christensen et al. [9] have also proposed a robust representation for sampling in latentGaussian models. They use an approximation to the target posterior distribution to con-5truct a reparameterization where the unknown variables are close to independent. Theapproximation replaces the likelihood with a Gaussian form proportional to N ( f ; ˆ f , Λ(ˆ f )):ˆ f = argmax f L ( f ) , Λ ij (ˆ f ) = ∂ log L ( f ) ∂f i ∂f j (cid:12)(cid:12)(cid:12) ˆ f , (13)where Λ is often diagonal, or it was suggested one would only take the diagonal part.This Taylor approximation looks like a Laplace approximation, except that the likelihoodfunction is not a probability density in f . This likelihood ﬁt results in an approximateGaussian posterior N ( f ; m θ, g =ˆ f , R θ ) as found in (9), with noise S θ = Λ(ˆ f ) − and data g =ˆ f .Thinking of the current latent variables as a draw from this approximate posterior, ω ∼ N (0 , I ) , f = L R θ ω + m θ, ˆ f , suggests using the reparameterization ω = L − R θ ( f − m θ, ˆ f ).We can then ﬁx the new variables and update the hyperparameters under P ( θ | ω , data) ∝ L ( f ( ω , θ )) N ( f ( ω , θ ); 0 , Σ θ ) p h ( θ ) | L R θ | . (14)When the likelihood is Gaussian, the reparameterized variables ω are independent of eachother and the hyperparameters. The hope is that approximating non-Gaussian likelihoodswill result in nearly-independent parameterizations on which Markov chains will mix rapidly.Taylor expanding some common log-likelihoods around the maximum is not well deﬁned,for example approximating probit or logistic likelihoods for binary classiﬁcation, or Pois-son observations with zero counts. These Taylor expansions could be seen as giving ﬂat orundeﬁned Gaussian approximations that do not reweight the prior. When all of the like-lihood terms are ﬂat the reparameterization approach reduces to that of Section 2.1. Thealternative S θ auxiliary covariances that we have proposed could be used instead.The surrogate data samplers of Section 3 can also be viewed as using reparameterizations,by treating η = L − R θ ( f − m θ, g ) as an arbitrary random reparameterization for making pro-posals. A proposal density q ( η (cid:48) , θ (cid:48) ; η , θ ) in the reparameterized space must be multiplied bythe Jacobian | L − R θ (cid:48) | to give a proposal density in the original parameterization. The proba-bility of proposing the reparameterization must also be included in the Metropolis–Hastingsacceptance probability: min (cid:18) , P ( θ (cid:48) , f (cid:48) | data) · P ( g | f (cid:48) ,S θ (cid:48) ) · q ( θ ; θ (cid:48) ) | L − Rθ | P ( θ, f | data) · P ( g | f ,S θ ) · q ( θ (cid:48) ; θ ) | L − Rθ (cid:48) | (cid:19) . (15)A few lines of linear algebra conﬁrms that, as it must do, the same acceptance ratio resultsas before. Alternatively, substituting (3) into (15) shows that the acceptance probabilityis very similar to that obtained by applying Metropolis–Hastings to (14) as proposed byChristensen et al. [9]. The diﬀerences are that the new latent variables f (cid:48) are computedusing diﬀerent pseudo-posterior means and the surrogate data method has an extra termfor the random, rather than ﬁxed, choice of reparameterization.The surrogate data sampler is easier to implement than the previous reparameterizationwork because the surrogate posterior is centred around the current latent variables. Thismeans that 1) no point estimate, such as the maximum likelihood ˆ f , is required. 2) pickingthe noise covariance S θ poorly may still produce a workable method, whereas a ﬁxed repa-rameterized can work badly if the true posterior distribution is in the tails of the Gaussianapproximation. Christensen et al. [9] pointed out that centering the approximate Gaus-sian likelihood in their reparameterization around the current state is tempting, but thatcomputing the Jacobian of the transformation is then intractable. By construction, thesurrogate data model centers the reparameterization near to the current state. We empirically compare the performance of the various approaches to GP hyperparametersampling on four data sets: one regression, one classiﬁcation, and two Cox process inferenceproblems. Further details are in the rest of this section, with full code as supplementarymaterial. The results are summarized in Figure 3 followed by a discussion section.6n each of the experimental conﬁgurations, we ran ten independent chains with diﬀerentrandom seeds, burning in for 1000 iterations and sampling for 5000 iterations. We quantifythe mixing of the chain by estimating the eﬀective number of samples of the completedata likelihood trace using R-CODA [12], and compare that with three cost metrics: thenumber of hyperparameter settings considered (each requiring a small number of covariancedecompositions with O ( n ) time complexity), the number of likelihood evaluations, and thetotal elapsed time on a single core of an Intel Xeon 3GHz CPU.The experiments are designed to test the mixing of hyperparameters θ while sampling fromthe joint posterior (3). All of the discussed approaches except Algorithm 1 update the latentvariables f as a side-eﬀect. However, further transition operators for the latent variables forﬁxed hyperparameters are required. In Algorithm 2 the “whitened” variables ν remain ﬁxed;the latent variables and hyperparameters are constrained to satisfy f = L Σ θ ν . The surrogatedata samplers are ergodic: the full joint posterior distribution will eventually be explored.However, each update changes the hyperparameters and requires expensive computationsinvolving covariances. After computing the covariances for one set of hyperparameters, itmakes sense to apply several cheap updates to the latent variables. For every method weapplied ten updates of elliptical slice sampling [1] to the latent variables f between eachhyperparameter update. One could also consider applying elliptical slice sampling to areparameterized representation, for simplicity of comparison we do not. Independently ofour work Titsias [13] has used surrogate data like reparameterizations to update latentvariables for ﬁxed hyperparameters. Methods

We implemented six methods for updating Gaussian covariance hyperparame-ters. Each method used the same slice sampler, as in Algorithm 4, applied to the followingmodel representations. ﬁxed: ﬁxing the latent function f [14]. prior-white: whiteningwith the prior. surr-site: using surrogate data with the noise level set to match the siteposterior (12). We used Laplace approximations for the Poisson likelihood. For classiﬁ-cation problems we used moment matching, because Laplace approximations do not workwell [15]. surr-taylor: using surrogate data with noise variance set via Taylor expansion ofthe log-likelihood (13). Inﬁnite variances were truncated to a large value. post-taylor and post-site: as for the surr- methods but a ﬁxed reparameterization based on a posteriorapproximation (14). Binary Classiﬁcation (Ionosphere)

We evaluated four diﬀerent methods for perform-ing binary GP classiﬁcation: fixed , prior-white , surr-site and post-site . We appliedthese methods to the Ionosphere dataset [16], using 200 training data and 34 dimensions.We used a logistic likelihood with zero-mean prior, inferring lengthscales as well as sig-nal variance. The -taylor methods reduce to other methods or don’t apply because themaximum of the log-likelihood is at plus or minus inﬁnity. Gaussian Regression (Synthetic)

When the observations have Gaussian noise the post-taylor reparameterization of Christensen et al. [9] makes the hyperparameters andlatent variables exactly independent. The random centering of the surrogate data model willbe less eﬀective. We used a Gaussian regression problem to assess how much worse the sur-rogate data method is compared to an ideal reparameterization. The synthetic data set had200 input points in 10-D drawn uniformly within a unit hypercube. The GP had zero mean,unit signal variance and its ten lengthscales in (2) drawn from Uniform(0 , √ fixed , prior-white , surr-site / surr-taylor ,and post-site / post-taylor methods. For Gaussian likelihoods the -site and -taylor methods coincide: the auxiliary noise matches the observation noise ( S θ = 0 . I ). Cox process inference

We tested all six methods on an inhomogeneous Poisson processwith a Gaussian process prior for the log-rate. We sampled the hyperparameters in (2) anda mean oﬀset to the log-rate. The model was applied to two point process datasets: 1) arecord of mining disasters [17] with 191 events in 112 bins of 365 days. 2) 195 redwood treelocations in a region scaled to the unit square [18] split into 25 ×

25 = 625 bins. The resultsfor the mining problem were initially highly variable. As the mining experiments were alsothe quickest we re-ran each chain for 20,000 iterations.7 onosphere synthetic mining redwoods01234 Effective samples per likelihood evaluation ionosphere synthetic mining redwoods01234 Effective samples per covariance construction fixed prior−white surr−site post−site surr−taylor post−taylorionosphere synthetic mining redwoods01234 Effective samples per secondx1.6e−04 x3.3e−04 x4.3e−05 x4.8e−04 x2.9e−04 x1.1e−03 x7.4e−04 x3.7e−03 x7.7e−03 x5.4e−02 x1.2e−01 x1.5e−02

Figure 3:

The results of experimental comparisons of six MCMC methods for GP hyperparameterinference on four data sets. Each ﬁgure shows four groups of bars (one for each experiment) and thevertical axis shows the eﬀective number of samples of the complete data likelihood per unit cost.The costs are per likelihood evaluation (left), per covariance construction (center), and per second (right). Means and standard errors for 10 runs are shown. Each group of bars has been rescaled forreadability: the number beneath each group gives the eﬀective samples for the surr-site method,which always has bars of height 1. Bars are missing where methods are inapplicable (see text).

On the Ionosphere classiﬁcation problem both of the -site methods worked much betterthan the two baselines. We slightly prefer surr-site as it involves less problem-speciﬁcderivations than post-site .On the synthetic test the post- and surr- methods perform very similarly. We had expectedthe existing post- method to have an advantage of perhaps up to 2–3 × , but that was notrealized on this particular dataset. The post- methods had a slight time advantage, butthis is down to implementation details and is not notable.On the mining problem the Poisson likelihoods are often close to Gaussian, so the exist-ing post-taylor approximation works well, as do all of our new proposed methods. TheGaussian approximations to the Poisson likelihood ﬁt most poorly to sites with zero counts.The redwood dataset discretizes two-dimensional space, leading to a large number of bins.The majority of these bins have zero counts, many more than the mining dataset. Taylorexpanding the likelihood gives no likelihood contribution for bins with zero counts, so itis unsurprising that post-taylor performs similarly to prior-white . While surr-taylor works better, the best results here come from using approximations to the site-posterior (12).For unreasonably ﬁne discretizations the results can be diﬀerent again: the site- reparam-eterizations do not always work well.Our empirical investigation used slice sampling because it is easy to implement and use.However, all of the representations we discuss could be combined with any other MCMCmethod, such as [19] recently used for Cox processes. The new surrogate data and post-site representations oﬀer state-of-the-art performance and are the ﬁrst such advanced methodsto be applicable to Gaussian process classiﬁcation.An important message from our results is that ﬁxing the latent variables and updatinghyperparameters according to the conditional posterior — as commonly used by GP practi-tioners — can work exceedingly poorly. Even the simple reparameterization of “whiteningthe prior” discussed in Section 2.1 works much better on problems where smoothness isimportant in the posterior. Even if site approximations are diﬃcult and the more ad-vanced methods presented are inapplicable, the simple whitening reparameterization shouldbe given serious consideration when performing MCMC inference of hyperparameters. Acknowledgements

We thank an anonymous reviewer for useful comments. This work was supported in partby the IST Programme of the European Community, under the PASCAL2 Network ofExcellence, IST-2007-216886. This publication only reﬂects the authors’ views. RPA is ajunior fellow of the Canadian Institute for Advanced Research.8 eferences [1] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical slice sampling.

Journal of Machine Learning Research: W&CP , 9:541–548, 2010. Proceedings of the13th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS).[2] Radford M. Neal. Slice sampling.

Annals of Statistics , 31(3):705–767, 2003.[3] Deepak K. Agarwal and Alan E. Gelfand. Slice sampling for simulation based ﬁttingof spatial data models.

Statistics and Computing , 15(1):61–69, 2005.[4] Carl Edward Rasmussen and Christopher K. I. Williams.

Gaussian Processes for ma-chine learning . MIT Press, 2006.[5] Luke Tierney. Markov chains for exploring posterior distributions.

The Annals ofStatistics , 22(4):1701–1728, 1994.[6] Michalis Titsias, Neil D Lawrence, and Magnus Rattray. Eﬃcient sampling for Gaussianprocess inference using control variables. In

Advances in Neural Information ProcessingSystems 21 , pages 1681–1688. MIT Press, 2009.[7] Simon Duane, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid MonteCarlo.

Physics Letters B , 195(2):216–222, September 1987.[8] Radford M. Neal. MCMC using Hamiltonian dynamics. To appear in the Handbookof Markov Chain Monte Carlo, Chapman & Hall / CRC Press, 2011. .[9] Ole F. Christensen, Gareth O. Roberts, and Martin Sk˜ald. Robust Markov chain MonteCarlo methods for spatial generalized linear mixed models.

Journal of Computationaland Graphical Statistics , 15(1):1–17, 2006.[10] Thomas Minka. Expectation propagation for approximate Bayesian inference. In

Pro-ceedings of the 17th Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI) ,pages 362–369, 2001. Corrected version available from http://research.microsoft.com/~minka/papers/ep/ .[11] Kiam Choo. Learning hyperparameters for neural network models using Hamiltoniandynamics. Master’s thesis, Department of Computer Science, University of Toronto,2000. Available from .[12] Mary Kathryn Cowles, Nicky Best, Karen Vines, and Martyn Plummer. R-CODA0.10-5, 2006. .[13] Michalis Titsias. Auxiliary sampling using imaginary data, 2010. Unpublished.[14] Radford M. Neal. Regression and classiﬁcation using Gaussian process priors. In J. M.Bernardo et al., editors,

Bayesian Statistics 6 , pages 475–501. OU Press, 1999.[15] Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binaryGaussian process classiﬁcation.

Journal of Machine Learning Research , 6:1679–1704,2005.[16] V. G. Sigillito, S. P. Wing, L. V. Hutton, and K. B. Baker. Classiﬁcation of radarreturns from the ionosphere using neural networks.

Johns Hopkins APL TechnicalDigest , 10:262–266, 1989.[17] R. G. Jarrett. A note on the intervals between coal-mining disasters.

Biometrika , 66(1):191–193, 1979.[18] Brian D. Ripley. Modelling spatial patterns.

Journal of the Royal Statistical Society,Series B , 39:172–212, 1977.[19] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and HamiltonianMonte Carlo methods.