[PDF] Pseudo-Marginal Bayesian Inference for Gaussian Processes

Abstract

The main challenges that arise when adopting Gaussian Process priors in probabilistic modeling are how to carry out exact Bayesian inference and how to account for uncertainty on model parameters when making model-based predictions on out-of-sample data. Using probit regression as an illustrative working example, this paper presents a general and effective methodology based on the pseudo-marginal approach to Markov chain Monte Carlo that efficiently addresses both of these issues. The results presented in this paper show improvements over existing sampling methods to simulate from the posterior distribution over the parameters defining the covariance function of the Gaussian Process prior. This is particularly important as it offers a powerful tool to carry out full Bayesian inference of Gaussian Process based hierarchic statistical models in general. The results also demonstrate that Monte Carlo based integration of all model parameters is actually feasible in this class of models providing a superior quantification of uncertainty in predictions. Extensive comparisons with respect to state-of-the-art probabilistic classifiers confirm this assertion.

Full PDF

aa r X i v : . [ s t a t . M L ] A p r Pseudo-Marginal Bayesian Inference forGaussian Processes

Maurizio Filippone and Mark Girolami

Abstract —The main challenges that arise when adopting Gaussian Process priors in probabilistic modeling are how to carry out exactBayesian inference and how to account for uncertainty on model parameters when making model-based predictions on out-of-sampledata. Using probit regression as an illustrative working example, this paper presents a general and effective methodology based on thepseudo-marginal approach to Markov chain Monte Carlo that efﬁciently addresses both of these issues. The results presented in thispaper show improvements over existing sampling methods to simulate from the posterior distribution over the parameters deﬁning thecovariance function of the Gaussian Process prior. This is particularly important as it offers a powerful tool to carry out full Bayesianinference of Gaussian Process based hierarchic statistical models in general. The results also demonstrate that Monte Carlo basedintegration of all model parameters is actually feasible in this class of models providing a superior quantiﬁcation of uncertainty inpredictions. Extensive comparisons with respect to state-of-the-art probabilistic classiﬁers conﬁrm this assertion.

Index Terms —Hierarchic Bayesian Models, Gaussian Processes, Markov chain Monte Carlo, Pseudo-Marginal Monte Carlo, KernelMethods, Approximate Bayesian Inference. ✦ NTRODUCTION

Non-parametric or kernel based models represent a suc-cessful class of statistical modelling and prediction meth-ods. To focus ideas throughout the paper we employ theworking example of predictive classiﬁcation problems;the methodology presented, however, is applicable toall hierarchic Bayesian models in general and thoseemploying Gaussian Process (GP) priors in particular.Important examples of kernel-based classiﬁers are theSupport Vector Machine (SVM) [1], [2], the RelevanceVector Machine (RVM) [3], [4], and the Gaussian Processclassiﬁer [5]. Although these classiﬁers are based on dif-ferent modeling assumptions and paradigms of statisti-cal inference, they are characterized by a kernel functionor covariance operator that allows one to build nonlinearclassiﬁers able to tackle challenging problems [6], [7], [8],[9], [10], [11].In order to allow these classiﬁers to be ﬂexible, itis necessary to parameterize the kernel (or covariance)function by a set of so called hyper-parameters . Afterobserving a set of training data, the aim is to estimate orinfer such hyper-parameters. In the case of SVMs pointestimates of hyper-parameters are obtained by optimiz-ing a cross-validation error. This makes optimizationviable only in the case of very few hyper-parameters, asgrid search is usually employed, and is limited by theavailable amount of data. In GP classiﬁcation, instead,the probabilistic nature of the model provides a means • M. Filippone is with the School of Computing Science, University ofGlasgow, UK. E-mail: maurizio.ﬁ[email protected] • M. Girolami is with the Department of Statistics, University of Warwick,UK. E-mail: [email protected] • First version dated 24 July 2012 was submitted to arXiv on 2 Oct 2013.Revised on 10 Apr 2013, 5 Mar 2014, and 4 Apr 2014. (usually after approximately integrating out latent vari-ables) to obtain an approximate marginal likelihood thatoffers the possibility to optimize the hyper-parameters;this is known as type II Maximum Likelihood (ML) [12],[5]. Deterministic approximations for integrating outthe latent variables include the Laplace Approxima-tion (LA) [13], Expectation Propagation (EP) [14], Varia-tional Bayes [15], Integrated Nested Laplace Approxima-tions [16], and mean ﬁeld approximations [17]; see [18],[19] for extensive assessments of the relative merits ofdifferent approximation schemes for GP classiﬁcation.From a fully Bayesian perspective, in a GP classiﬁerone would like to be able to (i) infer all model parametersand latent variables from data and (ii) integrate outlatent variables and hyper-parameters with respect totheir posterior distribution when making predictionsaccounting for their uncertainty; this in particular, wouldeffectively make the classiﬁer parameter free . To date, theliterature lacks a systematic way to efﬁciently tackle bothof these questions. The main limitations are due the factthat it is not possible to obtain any of the quantitiesneeded in the inference in closed form because of analyt-ical intractability. This requires the use of some form ofapproximation, and deterministic approximations havebeen proposed to integrate out latent variables only; inorder to integrate out the hyper-parameters most of theapproaches propose quadrature methods [20], [16], thatcan only be employed in the case of a small number ofhyper-parameters.Recently, there have been a few attempts to carryout inference using stochastic approximations based onMarkov chain Monte Carlo (MCMC) methods [21], [22],[23], the idea being to leverage asymptotic guarantees ofconvergence of Monte Carlo estimates to the true values.Unfortunately, employing MCMC methods for inferring latent variables and hyper-parameters is extremely chal-lenging, and state-of-the-art methods for doing so arestill inefﬁcient and difﬁcult to use in practice.This paper aims at providing a straightforward toimplement methodology that is effective in the directionof bridging this gap, by proposing an MCMC methodthat addresses most of the difﬁculties that one is facedwith when applying stochastic based inference in GPmodeling, such as discrete label classiﬁcation. The mainissue in applying MCMC methods to carry out inferencein GP classiﬁcation is in sampling the hyper-parametersform the full posterior. This is due to the structure of themodel that makes latent variables and hyper-parametersstrongly coupled a posteriori; as a result, chains arecharacterized by low efﬁciency and poor mixing. The keyidea of the proposed methodology is to break the correla-tion in the sampling between latent variables and hyper-parameters by approximately integrating out the latentvariables while retaining a correct MCMC procedure;namely, maintaining the exact posterior distribution overhyper-parameters as the invariant distribution of thechains and ergodicity properties. This can be achievedby means of the so called

Pseudo Marginal (PM) approachto Monte Carlo sampling [24], [25]. This work showsthat the use of the PM approach leads to remarkableefﬁciency in the sampling of hyper-parameters, thusmaking the fully Bayesian treatment viable and simpleto implement and employ.The importance of integrating out the hyper-parameters to achieve a sound quantiﬁcation of uncer-tainty in predictions is well known and has been high-lighted, for example, in [12], [26], [16], [27]; employingthis in practice, however, is notoriously challenging. Themain motivation for this work, is to demonstrate that thismarginalization can be done exactly, in a Monte Carlosense, by building upon deterministic approximationsalready proposed in the GP and Machine Learning liter-ature. This work reports a thorough empirical compar-ison in this direction, showing the ability of the fullyBayesian treatment to achieve a better quantiﬁcationof uncertainty compared to the standard practice ofoptimization of the hyper-parameters in GP classiﬁca-tion. Furthermore, the results report a comparison witha probabilistic version of the SVM classiﬁer [28]. Theresults on GP classiﬁcation support the argument thathyper-parameters should be integrated out to achieve areliable quantiﬁcation of uncertainty in applications andthis paper provides a practical means to achieve this .The paper is organized as follows: section 2 reviewsthe Gaussian Process approach to classiﬁcation, and sec-tion 3 presents the proposed MCMC approach to obtainsamples from the posterior distribution over both latentvariables and hyper-parameters. Section 4 reports an as-sessment of the sampling efﬁciency achieved by the PMapproach compared to other MCMC approaches, and ∼ maurizio/pages/code pm/ section 5 reports a study on the performance of the fullyBayesian GP classiﬁer compared to other probabilisticclassiﬁers. Finally, section 6 concludes the paper. AUSSIAN P ROCESS C LASSIFICATION

In this section, we brieﬂy review GP classiﬁcation basedon a probit likelihood (see [5] for an extensive presen-tation of GPs). Let X = { x , . . . , x n } be a set of n inputvectors described by d covariates and associated withobserved univariate responses y = { y , . . . , y n } with y i ∈ {− , +1 } . Let f = { f , . . . , f n } be a set of latentvariables. From a generative perspective, GP classiﬁersassume that the class labels have a Bernoulli distributionwith success probability given by a transformation of thelatent variables: p ( y i | f i ) = Φ( y i f i ) . (1)Here Φ denotes the cumulative function of the Gaussiandensity; based on this modeling assumption, the likeli-hood function is: p ( y | f ) = n Y i =1 p ( y i | f i ) . (2)The latent variables f are given a zero mean GP priorwith covariance K : f ∼ N ( f | , K ) . (3)Let k ( x i , x j | θ ) be the function modeling the covariancebetween latent variables evaluated at the input vectors,parameterized by a vector of hyper-parameters θ . In thispaper we will adopt a covariance function deﬁned asfollows: k ( x i , x j | θ ) = σ exp (cid:20) −

12 ( x i − x j ) T A ( x i − x j ) (cid:21) . (4)The parameter σ is the variance of the marginal priordistribution for each of the latent variables f i . The matrix A , instead, deﬁnes the type of covariance between thevalues of the function at different input vectors. Bydeﬁning a matrix A with a global parameter as follows, A − = τ I, (5)an isotropic covariance function is obtained. Alterna-tively, A can be deﬁned to assign a different parameterto each covariate A − = diag (cid:0) τ , . . . , τ d (cid:1) . (6)The latter choice yields the so called Automatic RelevanceDetermination (ARD) prior [29]. In this formulation, thehyper-parameters τ i can be interpreted as length-scaleparameters. Let θ be a vector comprising σ and all thelength-scale parameters, and K be the matrix whoseentries are k ij = k ( x i , x j | θ ) .The GP classiﬁcation model is hierarchical, as y isconditioned on f , and f is conditioned on θ and theinputs X . In order to keep the notation uncluttered, inthe remainder of this paper we will not report explicitly the conditioning on the inputs in any of the equations.We now brieﬂy review the types of approximations thathave been proposed in the literature to employ GPclassiﬁers. One of the difﬁculties encountered in GP classiﬁcation isthat, unlike GP regression, the prior on the latent vari-ables and the likelihood do not form a conjugate pair;therefore, it is not possible to analytically integrate outthe latent variables. As a consequence, it is not possibleto directly sample from or optimize the distribution ofhyper-parameters given the labels, nor directly evaluatepredictive probabilities. This has motivated a large bodyof research that attempts to approximate the posteriordistribution over the latent variables p ( f | y , θ ) with aGaussian q ( f | y , θ ) = N ( f | µ q , Σ q ) in order to exploitconjugacy. By doing so, it is possible to analyticallyintegrate out latent variables to obtain an approximatemarginal likelihood, and compute the predictive distri-bution for new data, as discussed in the following. TheGaussian approximation yields an approximate marginallikelihood ˆ p ( y | θ ) that can then be optimized with respectto the hyper-parameters, or used to obtain samples fromthe approximate posterior distribution over the hyper-parameters, say ˆ p ( θ | y ) , using MCMC techniques. Wenow brieﬂy discuss how this can be achieved.To obtain an approximate predictive distribution, con-ditioned on a value of the hyper-parameters θ , we cancompute: p ( y ∗ | y , θ ) = Z p ( y ∗ | f ∗ ) p ( f ∗ | f , θ ) q ( f | y , θ ) df ∗ d f . (7)Here θ can be a ML estimate that maximizes the approx-imate likelihood or one sample from the approximateposterior ˆ p ( θ | y ) . For simplicity of notation, let K bethe covariance matrix evaluated at θ , k ∗ the vectorwhose i th element is k ( x i , x ∗ | θ ) and k ∗∗ = k ( x ∗ , x ∗ | θ ) .Given the properties of multivariate normal variables, f ∗ is distributed as N ( f ∗ | µ ∗ , β ∗ ) with µ ∗ = k T ∗ K − f and β ∗ = k ∗∗ − k T ∗ K − k ∗ . Approximating p ( f | y , θ ) witha Gaussian q ( f | y , θ ) = N ( f | µ q , Σ q ) makes it possibleto analytically perform integration with respect to f inequation 7. In particular, the integration with respect to f yields N ( f ∗ | m ∗ , s ∗ ) with m ∗ = k T ∗ K − µ q , and s ∗ = k ∗∗ − k T ∗ K − k ∗ + k T ∗ K − Σ q K − k ∗ . The univariate integration with respect to f ∗ followsexactly in the case of a probit likelihood, as it is aconvolution of a Gaussian and a cumulative Gaussian Z p ( y ∗ | f ∗ ) N ( f ∗ | m ∗ , s ∗ ) df ∗ = Φ m ∗ p s ∗ ! . (8) We now brieﬂy review two popular approximationmethods for integrating out latent variables, namely theLaplace Approximation and Expectation Propagation. The Laplace Approximation (LA) is based on the as-sumption that the distribution of interest can be approx-imated by a Gaussian centered at its mode and withthe same curvature. By analyzing the Taylor expansionof the logarithm of target and approximating densities,the latter requirement is satisﬁed by imposing an in-verse covariance for the approximating Gaussian equalto the negative Hessian of the logarithm of the targetdensity [30]. For a given value of the hyper-parameters θ , deﬁne Ψ( f ) = log[ p ( y | f )] + log[ p ( f | θ )] + const . (9)as the logarithm of the target density up to terms in-dependent of f . Performing a Laplace approximationamounts in deﬁning a Gaussian q ( f | y , θ ) = N ( f | ˆ f , ˆΣ) ,such that ˆ f = arg max f Ψ( f ) and ˆΣ − = −∇ f ∇ f Ψ(ˆ f ) . (10)As it is not possible to directly solve the maximizationproblem in equation 10, an iterative procedure basedon the following Newton-Raphson formula is usuallyemployed: f new = f − ( ∇ f ∇ f Ψ( f )) − ∇ f Ψ( f ) (11)starting from f = until convergence. The gradient andthe Hessian of the log of the target density are: ∇ f Ψ( f ) = ∇ f log[ p ( y | f )] − K − f , (12) ∇ f ∇ f Ψ( f ) = ∇ f ∇ f log[ p ( y | f )] − K − . (13)Note that if log[ p ( y | f )] is concave, such as in probit clas-siﬁcation, Ψ( f ) has a unique maximum. Practically, theNewton-Raphson update in equation 11 is implementedby employing Woodbury identities to avoid inverting K directly (see section 3.4 of [5] for full details). In such animplementation, one n × n matrix factorization is neededat each iteration and no other O ( n ) operations. The Expectation Propagation (EP) algorithm is basedon the assumption that each individual term of thelikelihood can be approximated by an unnormalizedGaussian p ( y i | f i ) ≃ ˜ Z i N ( f i | ˜ µ i , ˜ σ i ) . (14)Approximating each term in the likelihood by a Gaus-sian implies that the approximate likelihood, as a func-tion of f , is multivariate Gaussian N ( f | ˜ µ , ˜Σ) n Y i =1 ˜ Z i (15)with ˜ µ i = ˜ µ i and ˜Σ ii = ˜ σ i . Under this approximation, the posterior p ( f | y , θ ) isapproximated by a Gaussian q ( f | y , θ ) = N ( f | ˆ f , ˆΣ) with: ˆΣ − = K − + ˜Σ − and ˆ f = ˆΣ ˜Σ ˜ µ (16)The EP algorithm is characterized by the way the pa-rameters ˜ Z i , ˜ µ i , and ˜ σ i are optimized. The EP algorithmloops through the n factors approximating the likelihoodupdating those three parameters for each factor in turn.First, the so called cavity distribution is computed q \ i ( f i | θ ) ∝ Z p ( f | θ ) Y j = i ˜ Z j N ( f j | ˜ µ j , ˜ σ j ) , (17)which is obtained by leaving out the i th factor from q ( f | y , θ ) . Second, a revised Gaussian q ′ ( f i | θ ) , whichclosely approximates the product of the cavity distri-bution and the exact i th likelihood term, is sought. Inparticular, this is performed by minimizing the followingKullback-Leibler divergence: KL (cid:16) q \ i ( f i | θ ) p ( y i | f i ) (cid:13)(cid:13)(cid:13) q ′ ( f i | θ ) (cid:17) , (18)which in practice boils down to matching the momentsof the two distributions. Third, once the mean andvariance of q ′ ( f i | θ ) are computed, it is possible to derivethe updated parameters ˜ Z i , ˜ µ i , and ˜ σ i for the i th factor.The derivation of those equations is rather involved, andthe reader is referred to [5] for full details; EP requiresﬁve operations in O ( n ) at each iteration. Note thatconvergence of the EP algorithm is not guaranteed ingeneral; however, for GP classiﬁcation, no convergenceissues have been reported in the literature. Furthermore,EP for GP classiﬁcation has been reported to offer su-perior accuracy in approximations compared to othermethods [18], [19]. In a fully Bayesian treatment, the aim is to integrate outlatent variables as well as hyper-parameters: p ( y ∗ | y ) = Z p ( y ∗ | f ∗ ) p ( f ∗ | f , θ ) p ( f , θ | y ) df ∗ d f d θ . (19)Again, the integration with respect to f ∗ can be done an-alytically, whereas the integration with respect to latentvariables and hyper-parameters requires the posteriordistribution p ( f , θ | y ) . One way to tackle the intractabil-ity in characterizing p ( f , θ | y ) is to draw samples from p ( f , θ | y ) using MCMC methods, so that a Monte Carloestimate of the predictive distribution can be used p ( y ∗ | y ) ≃ N N X i =1 Z p ( y ∗ | f ∗ ) p ( f ∗ | f ( i ) , θ ( i ) ) df ∗ , (20)where f ( i ) , θ ( i ) denotes the i th sample from p ( f , θ | y ) .This estimate will asymptotically converge to the exactexpectation p ( y ∗ | y ) . AMPLING F ROM p ( f , θ | y ) Sampling from the posterior over f and θ by jointproposals is not feasible; it is extremely unlikely topropose a set of latent variables and hyper-parametersthat are compatible with each other and observed data.In order to draw samples from p ( f , θ | y ) , it is thereforenecessary to resort to a Gibbs sampler, whereby f and θ are updated in turn. We now brieﬂy review the stateof the art in Gibbs sampling techniques for GP models,and propose a new Gibbs sampler based on the PMapproach. p ( f | y , θ ) Efﬁcient sampling of the latent variables can be achievedby means of Elliptical Slice Sampling (ELL-SS) [31]. ELL-SS is based on an adaptation of the Slice Sampling algo-rithm [32] to propose new values of the latent variables.ELL-SS has the very appealing property of requiring notuning, so that minimum user intervention is needed,and by the fact that once K is factorized the complexityof iterating ELL-SS is in O ( n ) . Recently, the efﬁciencyof ELL-SS has been extensively demonstrated on severalmodels involving GP priors [21].Another way to efﬁciently sample latent variables inGP models is by means of a variant of Hybrid MonteCarlo (HMC) [33], [34] where the inverse mass matrixis set to the GP covariance K , as described in detailin [21]. This variant of HMC can be interpreted as asimpliﬁed version of Riemann manifold HamiltonianMonte Carlo (RMHMC) [35] which makes it possible toobtain samples from the posterior distribution over f in O ( n ) once K is factorized. Owing to its simplicity, in theremainder of this paper we will use ELL-SS to samplefrom the posterior over latent variables p ( f | y , θ ) . θ em-ploying reparameterization techniques In GP classiﬁcation, efﬁciently sampling from the pos-terior distribution over the latent variables and hyper-parameters is complex because of their strong cou-pling [21], [22], [26]. The result of this strong couplingis that ﬁxing f induces a sharply peaked posterior over θ that makes the chain converge slowly and mix verypoorly. This effect is illustrated in Figure 1. In particular,conditioning the sampling of θ on f corresponds to con-sidering the standard parameterization of GP models y | f and f | θ , which is also known as Sufﬁcient Augmentation(SA) [36].A better parameterization can be given by introducinga set of transformed (whitened) latent variables ν [37].The way ν is deﬁned is by f = L ν , L being the Choleskyfactor of K . In this parameterization, that is also knownas Ancillary Augmentation (AA) [36], ν are constructedto be a priori independent from the hyper-parameters(using L is convenient as it is needed also to evaluate the GP prior density). In the AA parameterization θ is sampled from p ( θ | y , ν ) . The effect of conditioningon ν makes the conditional posterior over θ larger, asillustrated in Figure 1. In the Surrogate (SURR) data model proposed in [22],a set of auxiliary latent variables g is introduced as anoisy version of f ; in particular, p ( g | f , θ ) = N ( g | f , S θ ) .This construction yields a conditional for f of the form p ( f | g , θ ) = N ( f | m , R ) , with R = S θ − S θ ( S θ + K ) − S θ and m = RS − θ g . After decomposing R = DD T , thesampling of θ is then conditioned on the “whitened”variables η , deﬁned as f = D η + m . The covariance S θ is constructed by matching the posterior distributionover each of the latent variables individually (see [22]for further details). Figure 1 shows that the SURR pa-rameterization is characterized by a conditional posteriorover θ larger than SA and slightly larger than the AAparameterization. p ( θ | y ) : the PseudoMarginal approach The use of reparameterization techniques mitigates theproblems due to the coupling of latent variables andhyper-parameters, but sampling efﬁciency for GP mod-els is still an issue (for example, [7] reports simulationsof ten parallel chains comprising ﬁve millions sampleseach). Intuitively, the best strategy to break the correla-tion between latent variables and hyper-parameters insampling from the posterior over the hyper-parameterswould be to integrate out the latent variables alto-gether. As we discussed, this is not possible, but herewe present a strategy that uses an unbiased estimateof the marginal likelihood p ( y | θ ) to devise an MCMCstrategy that produces samples from the correct posteriordistribution p ( θ | y ) . For the sake of clarity, in this workwe will focus on the Metropolis-Hastings algorithm withproposal π ( θ ′ | θ ) . We are interested in sampling from theposterior distribution p ( θ | y ) ∝ p ( y | θ ) p ( θ ) . (21)In order to do that, we would need to integrate out thelatent variables: p ( y | θ ) = Z p ( y | f ) p ( f | θ ) d f (22)and use this along with the prior p ( θ ) in the Hastingsratio: z = p ( y | θ ′ ) p ( θ ′ ) p ( y | θ ) p ( θ ) π ( θ | θ ′ ) π ( θ ′ | θ ) (23)As already discussed, analytically integrating out f is notpossible.The results in [25], [24] show that we can plug into theHastings ratio an estimate ˜ p ( y | θ ) of the marginal p ( y | θ ) , and as long as this is unbiased, then the sampler willdraw samples from the correct posterior p ( θ | y ) . ˜ z = ˜ p ( y | θ ′ ) p ( θ ′ )˜ p ( y | θ ) p ( θ ) π ( θ | θ ′ ) π ( θ ′ | θ ) (24)This result is remarkable as it gives a simple recipe tobe used in hierarchical models to tackle the problem ofstrong coupling between groups of variables when usingMCMC algorithms. N = 200 length−scale po s t e r i o r den s i t y SAAASURRPM

Fig. 1. Comparison of the posterior distribution p ( θ | y ) with the posterior p ( θ | f ) in the SA parameterization, theposterior p ( θ | y , ν ) in the AA parameterization, and theparameterization used in the SURR method. Figure 1 shows the effect of conditioning the samplingof θ on different transformations of the latent variablesgiven by SA (blue line), AA (red line), and SURR (greenline). The conditional variance for the three approachesis still way lower than the variance of the marginal pos-terior p ( θ | y ) that can be obtained by the PM approach.This motivates the use of the PM approach to effectivelybreak the correlation between latent variables and hyper-parameters in an MCMC scheme.Note that if the goal is quantifying uncertainty in theparameters only, and no predictions are needed, onecould just iterate the sampling of θ | y , as this is doneregardless of f . For predictions, instead, samples fromthe joint posterior p ( f , θ | y ) are needed in the MonteCarlo integral in equation 20, so both steps are necessary.We consider this as a Gibbs sampler despite the factthat in principle interleaving of the two steps is notneeded; one could obtain samples from the posteriordistribution over f in a second stage, once samples from p ( θ | y ) are available. This would come at an extra costgiven that sampling f | y , θ requires the factorization of K for each MCMC sample θ . Therefore, when predictionsare needed, we prefer to interleave the two steps, andstill interpret the proposed sampling strategy as a Gibbssampler. p ( y | θ ) using importancesampling In order to obtain an unbiased estimator ˜ p ( y | θ ) for themarginal p ( y | θ ) , we propose to employ importance sam-pling. We draw N imp samples f i from the approximating distribution q ( f | y , θ ) , so that we can approximate themarginal p ( y | θ ) = R p ( y | f ) p ( f | θ ) d f by: ˜ p ( y | θ ) ≃ N imp N imp X i =1 p ( y | f i ) p ( f i | θ ) q ( f i | y , θ ) (25)It is easy to verify that equation 25 yields an unbi-ased estimate of p ( y | θ ) , as its expectation is the exactmarginal p ( y | θ ) . Therefore, this estimate can be used inthe Hastings ratio to construct an MCMC approach thatsamples from the correct invariant distribution p ( θ | y ) .Algorithm 1 sketches the MH algorithm that we proposeto sample the hyper-parameters. Algorithm 1

Pseudo-marginal MH transition operator tosample θ . Input : The current pair ( θ, ˜ p ( y | θ )) , a routine to approx-imate p ( f | y , θ ) by q ( f | y , θ ) , and number of importancesamples N imp Output : A new pair ( θ, ˜ p ( y | θ )) Draw θ ′ from the proposal distribution π ( θ ′ | θ ) Approximate p ( f | y , θ ′ ) by q ( f | y , θ ′ ) Draw N imp samples from q ( f | y , θ ′ ) Compute ˜ p ( y | θ ′ ) using eq. 25 Compute A = min (cid:26) , ˜ p ( y | θ ′ ) p ( θ ′ )˜ p ( y | θ ) p ( θ ) π ( θ | θ ′ ) π ( θ ′ | θ ) (cid:27) Draw u from U [0 , if A > u then return ( θ ′ , ˜ p ( y | θ ′ )) else return ( θ , ˜ p ( y | θ )) From the theory of importance sampling [38], thevariance of the estimator is zero when q ( f | y , θ ) is pro-portional to p ( y | f ) p ( f | θ ) , which is proportional to theposterior distribution over f that we do not know how tosample from in the ﬁrst place. In our case, the more accu-rate the Gaussian approximation the smaller the varianceof the estimator. Given that EP has been reported to bemore accurate in approximating p ( f | y , θ ) , it is reasonableto expect that EP will lead to a smaller estimator variancecompared to LA. This will be assessed in the next section.The motivation for using an importance sampling es-timator rather than other simulation based methods forestimating marginal likelihoods, is the following. Eventhough it is possible to sample f relatively efﬁciently,the estimation of marginal likelihoods from MCMCsimulations is generally challenging [39], [40] and onlyguarantees of estimator consistency are available. Ob-taining estimates based on samples from p ( f | y , θ ) wouldrequire some form of user intervention (assessment ofconvergence and estimation of efﬁciency) every time anew value of θ is proposed; this is clearly not practical oruseful for the PM scheme. This work reports an extensiveassessment of LA and EP to obtain Gaussian approx-imations to p ( f | y , θ ) within the importance samplingestimate of p ( y | θ ) . We show here why the proposed method yields anMCMC approach that produces samples from the correctinvariant distribution p ( θ | y ) . The easiest way to seethis is by considering N imp = 1 ; showing correctnessfor larger numbers of importance samples is similarbut notationally heavier (see [25] for further details).By substituting the importance sampling estimate ˜ p ( y | θ ) with N imp = 1 into ˜ z and rearranging the terms, weobtain ˜ z = p ( y | f ′ ) p ( f ′ | θ ′ ) p ( θ ′ ) p ( y | f ) p ( f | θ ) p ( θ ) × (cid:20) q ( f | y , θ ) q ( f ′ | y , θ ′ ) π ( θ | θ ′ ) π ( θ ′ | θ ) (cid:21) (26)Isolating the terms in the squared bracket allows us tointerpret ˜ z as a Hastings ratio with a joint proposal for θ and for the importance sample f given by π ( f ′ , θ ′ | f , θ ) = q ( f ′ | y , θ ′ ) π ( θ ′ | θ ) . (27)The remaining term in ˜ z indicates that the target distri-bution this approach is sampling from is p ( y | f ) p ( f | θ ) p ( θ ) = p ( y , f , θ ) . (28)If we concentrate on θ , regardless of f , the targetdistribution is exactly what we are aiming to samplefrom, as it is proportional to the posterior p ( θ | y ) . Theextension to more than one importance sample followsfrom a similar argument, except that the approximat-ing density q ( f | y , θ ) appears in the expression of thetarget distribution; however, this does not cause anyproblems as marginalizing latent variables leads to thesame conclusion as in the case N imp = 1 . The analysis for N imp = 1 also reveals an interesting similarity with theapproach proposed in [23], where a joint update of θ and f was performed as follows: proposing θ ′ | θ , proposing f ′ | y , θ ′ , and accepting/rejecting the joint proposal θ ′ , f ′ .However in this case the PM transition kernel will stilltarget the desired marginal posterior irrespective of thevalue of importance samples N imp . SSESSING IMPORTANCE DISTRIBUTIONS

In this section, we present simulations to assess theability of the PM approach to characterize the marginallikelihood p ( y | θ ) in GP classiﬁcation. First, we aim toassess the quality of the estimate given by the im-portance sampler based on LA and EP on simulateddata with respect to the number of importance samples.Second, we will evaluate the efﬁciency of the sampler onsimulated data with respect to the approximation usedto draw importance samples and with respect to theirnumber. Third, we will compare the PM approach withthe AA and SURR parameterizations that are the mostefﬁcient sampling schemes proposed in the literaturefor sampling hyper-parameters in models involving GPpriors [21]. In all the experiments, in both LA and EPwe imposed a convergence criterion on the change insquared norm of f being less than n/ . . . . . . n = 50, LA length−scale p s eudo m a r g i na l Nimp = 1Nimp = 64 0.0 0.5 1.0 1.5 2.0 2.5 . . . . . n = 50, EP length−scale p s eudo m a r g i na l Nimp = 1Nimp = 640.0 0.5 1.0 1.5 2.0 2.5 . . . . . n = 200, LA length−scale p s eudo m a r g i na l Nimp = 1Nimp = 64 0.0 0.5 1.0 1.5 2.0 2.5 . . . . . n = 200, EP length−scale p s eudo m a r g i na l Nimp = 1Nimp = 64

Fig. 2. Plot of the PM as a function of the length-scale τ ; black solid lines represent the average over repetitions and dashed lines represent 2.5th and 97.5thquantiles for N imp = 1 and N imp = 64 . The solid red lineis the prior density. In this section, we present an assessment of the varianceof the estimator ˜ p ( θ | y ) with respect to the global length-scale parameter τ in equation 5. In particular, we areinterested in comparing the quality of the approximationgiven by the importance sampling approach based onthe two different approximations employed in this work.Given that the dimensionality of the space wherethe approximation is performed grows linearly with thenumber of input vectors n , we are also interested instudying the effect of n on the quality of the approxima-tion. Based on these considerations, we simulated datafrom the GP classiﬁcation model with d = 2 and n = 50 and n = 200 , with an isotropic covariance function with τ = 0 . and σ = 2 . . We ﬁxed the value of σ tothe value used to generate the data, and we imposed aGamma prior on the length-scale p ( τ ) = G ( τ | a τ , b τ ) withshape a τ = 1 and rate b τ = 1 / √ d . We then computedthe posterior over θ based on ˜ p ( y | θ ) for different valuesof τ and over repetitions, with different number ofimportance samples N imp . The results are reported inﬁgure 2.As expected, for larger sets of importance samples, theestimates are more accurate. We can also see that EPleads to a smaller variance compared to LA, which is notsurprising given that the approximation achieved by EPis more accurate [18], [19]. The experiment suggests thatthere is little increase in the variance of the estimator forthe larger data set. In this section we report an analysis on simulated datashowing how the choice of the approximation and thenumber of importance samples affect the efﬁciency insampling from p ( θ | y ) . We generated data sets from theGP classiﬁcation model with different combinations ofnumber of input vectors and number of covariates. Thecovariates were generated in the unit hypercube anddata were selected to have an equal number of inputvectors in each class. We chose Gamma priors for thehyper-parameters as follows: p ( τ i ) = G ( τ i | a τ , b τ ) withshape a τ = 1 and rate b τ = 1 / √ d , and p ( σ ) = G ( σ | a σ , b σ ) with shape a σ = 1 . and rate b σ = 0 . . In the formulationof the GP classiﬁcation model, all hyper-parameters haveto be positive; for the sake of convenience, we reparame-terized them introducing the variables ψ τ i = log( τ i ) and ψ σ = log( σ ) .For each method, we ran parallel chains for burn-in iterations followed by iterations; conver-gence speed of the samplers was monitored using thePotential Scale Reduction Factor (PSRF) ( ˆ R statistics) asdescribed in [41]. The chains were initialized from theprior, rather than using the procedure suggested in [41]to make the convergence test more challenging. Also,correctness of the code was checked by using the ideapresented in [42], that indirectly shows that the Markovchains have indeed p ( θ | y ) as their invariant distribution.The proposal mechanism π ( θ ′ | θ ) was the same for allthe PM approaches for a given combination of n and d ,so that it is meaningful to analyze the effect of N imp onsampling efﬁciency, convergence speed, and acceptancerate. In particular, a large variance for the estimatorof the marginal likelihood can eventually lead to theacceptance of θ because p ( y | θ ) is largely overestimatedleading to a difﬁculty for the chain to move away fromthere. In this case, the chain can get stuck and takeseveral iterations before moving again; this effect hasbeen reported in [24], [25]. To isolate the effect of N imp and the type of approximation on sampling efﬁciencyand acceptance rate, we tuned the chains using prelim-inary runs for EP and N imp = 64 to achieve about acceptance rate and used the same proposal for LA andother values of N imp .The results are reported in table 1 for isotropic andARD covariances. As a measure of efﬁciency, we usedthe minimum Effective Sample Size (ESS) [43] across thehyper-parameters. The tables also report the median of ˆ R achieved by the chains at different iterations, namely , , , and . This gives an idea of theconvergence as the iterations progress. Finally, in table1 we report the acceptance rate; a low acceptance ratecompared to the one obtained by PM EP (64) indicatesthat the chains are more likely to get stuck due to a largevariance of the estimator of the marginal likelihood.The results indicate that sampling efﬁciency whenemploying EP to approximate the posterior distribution TABLE 1Analysis of convergence and efﬁciency of a MH algorithm sampling the hyper-parameters using the PM approach.The results show the dependency of the effective sample size (ESS) and speed of convergence (measured throughthe ˆ R statistics after e , e , e , and e iterations) with respect to the type of approximation (LA or EP) and thenumber of importance samples used to compute an unbiased estimate of the marginal likelihood p ( y | θ ) . Isotropic ARD n d

Scheme

ESS ˆ R ˆ R ˆ R ˆ R Acc

ESS ˆ R ˆ R ˆ R ˆ R Acc ( N imp ) 1 e e e e rate e e e e rate50 2 PM LA (1) 749 (73) 1 .

00 1 .

00 23 . .

4) 131 (41) 1 .

04 1 .

05 1 .

04 1 .

02 13 . . PM LA (16) 778 (61) 1 .

00 1 .

00 25 . .

4) 206 (21) 1 .

05 1 .

04 1 .

03 1 .

01 16 . . PM LA (64) 752 (76) 1 .

00 1 .

00 24 . .

5) 212 (38) 1 .

03 1 .

02 1 .

01 16 . . PM EP (1) 747 (100) 1 .

00 1 .

00 24 . .

5) 208 (22) 1 .

01 1 .

00 1 .

01 1 .

00 16 . . PM EP (16) 736 (163) 1 .

00 1 .

00 25 . .

6) 246 (23) 1 .

00 1 .

00 19 . . PM EP (64) 793 (54) 1 .

00 1 .

00 24 . .

5) 252 (24) 1 .

00 1 .

00 19 . . AA

287 (58) 1 .

00 1 .

00 22 . .

9) 20 (9) 1 .

12 1 .

05 1 .

03 20 . . SURR

154 (11) 1 .

01 1 .

00 1 .

00 22 . .

0) 21 (6) 1 .

11 1 .

08 1 .

07 1 .

05 21 . .

50 10 PM LA (1) 237 (147) 1 .

25 1 .

26 1 .

19 1 .

08 18 . .

6) 19 (12) 1 .

08 1 .

09 1 .

13 1 .

17 7 . . PM LA (16) 238 (194) 1 .

01 1 .

02 1 .

03 1 .

02 16 . .

0) 62 (30) 1 .

04 1 .

06 1 .

11 17 . . PM LA (64) 348 (126) 1 .

01 1 .

01 21 . .

3) 78 (23) 1 .

03 1 .

01 1 .

01 20 . . PM EP (1) 282 (47) 1 .

02 1 .

01 1 .

00 19 . .

4) 63 (18) 1 .

03 1 .

02 1 .

01 15 . . PM EP (16) 507 (36) 1 .

00 1 .

00 24 . .

0) 107 (12) 1 .

01 1 .

01 24 . . PM EP (64) 583 (51) 1 .

00 1 .

00 26 . .

6) 108 (28) 1 .

02 1 .

01 1 .

01 26 . . AA

71 (23) 1 .

02 1 .

01 1 .

02 1 .

01 22 . .

5) 74 (6) 1 .

03 1 .

02 1 .

02 22 . . SURR

52 (15) 1 .

04 1 .

02 1 .

02 21 . .

9) 45 (5) 1 .

03 1 .

02 1 .

01 21 . .

200 2 PM LA (1) 717 (31) 1 .

00 1 .

00 31 . .

5) 318 (29) 1 .

00 1 .

00 38 . . PM LA (16) 739 (46) 1 .

00 1 .

00 32 . .

5) 364 (20) 1 .

00 1 .

00 43 . . PM LA (64) 730 (26) 1 .

00 1 .

00 32 . .

6) 355 (42) 1 .

00 1 .

00 43 . . PM EP (1) 736 (47) 1 .

00 1 .

00 32 . .

5) 349 (26) 1 .

00 1 .

00 42 . . PM EP (16) 736 (48) 1 .

00 1 .

00 32 . .

4) 365 (21) 1 .

00 1 .

00 43 . . PM EP (64) 721 (43) 1 .

00 1 .

00 32 . .

7) 354 (37) 1 .

00 1 .

00 43 . . AA

112 (49) 1 .

01 1 .

01 21 . .

0) 41 (8) 1 .

07 1 .

06 1 .

04 23 . . SURR

54 (8) 1 .

01 1 .

02 1 .

01 1 .

01 21 . .

5) 61 (9) 1 .

02 1 .

01 1 .

01 22 . .

200 10 PM LA (1) 115 (53) 1 .

03 1 .

02 1 .

04 27 . .

3) 27 (10) 1 .

39 1 .

33 1 .

19 1 .

11 12 . . PM LA (16) 117 (42) 1 .

05 1 .

04 1 .

03 1 .

03 26 . .

0) 53 (18) 1 .

08 1 .

06 1 .

03 1 .

02 18 . . PM LA (64) 145 (62) 1 .

01 1 .

02 28 . .

0) 37 (31) 1 .

14 1 .

15 1 .

17 1 .

17 13 . . PM EP (1) 75 (35) 1 .

03 1 .

02 18 . .

7) 26 (14) 1 .

12 1 .

09 1 .

13 1 .

08 10 . . PM EP (16) 137 (38) 1 .

01 1 .

01 27 . .

2) 52 (13) 1 .

20 1 .

15 1 .

08 1 .

04 17 . . PM EP (64) 130 (64) 1 .

09 1 .

12 25 . .

0) 45 (21) 1 .

03 1 .

09 17 . . AA

26 (6) 1 .

05 1 .

04 1 .

03 1 .

03 23 . .

9) 22 (7) 1 .

10 1 .

08 1 .

05 1 .

03 21 . . SURR

18 (7) 1 .

27 1 .

28 1 .

24 1 .

16 21 . .

6) 10 (3) 1 .

07 1 .

08 1 .

06 1 .

05 20 . . over f is higher than when employing the LA algorithm.It is striking to see that evaluating the PM with as littleas one importance sample seems already able to offeracceptable performance in terms of ESS compared tolarger values of N imp . However, a low acceptance ratewhen N imp is small suggests that the correspondingchains can take several iterations before accepting anyproposals. Table 1 also reports a comparison of the PM method withthe AA and SURR sampling schemes with a Metropolis-Hastings transition operator so that results are meaning-fully comparable. The proposal mechanism was tunedduring the burn-in phase to achieve about accep-tance rate. Table 1 shows that the PM approach achievesfaster convergence and higher sampling efﬁciency com-pared to the AA scheme. The SURR method has com-parable convergence behavior and efﬁciency comparedto the AA scheme. This is somehow different fromwhat presented in [22], where a component-wise slicesampling transition operator was employed. In [22], the SURR method achieved higher efﬁciency per covarianceconstruction, likelihood evaluation and running timecompared to the AA method. In the experiment reportedhere ESS is comparable.Table 2 reports the average number of operations in O ( n ) needed for one iteration of the PM approachwith LA and EP approximations and the AA and SURRparameterizations. The table shows that EP is moreexpensive than the LA algorithm, and that the PMapproaches require more O ( n ) operations compared toAA and SURR. Normalization of the ESS by the numberof operations suggests that the cost of obtaining indepen-dent samples using the PM approach is generally betterthan AA and SURR. In the PM approach we also triedstopping the approximation algorithms using a looserconvergence criterion (results not reported); especiallyfor N imp = 64 this yielded similar efﬁciency with muchlower computational cost.We conducted a further test of convergence with theaim of mitigating the effect of the random walk ex-ploration and highlighting the advantages offered bydifferent parameterizations. We ran the AA, SURR andPM approaches by repeating each step of the Gibbs TABLE 3Analysis of convergence and efﬁciency of the AA and SURR parameterizations compared to the PM approach. Eachof the Gibbs sampling updates was repeated times in order to highlight the effect of the parameterization alone.The PM approach is based on the EP approximation and N imp = 64 . The column next to the one reporting the ESS shows the

ESS normalized by the number of operations in O ( n ) . Isotropic ARD n d

Scheme

ESS ESS ˆ R ˆ R ˆ R ˆ R ESS ESS ˆ R ˆ R ˆ R ˆ R/O ( n ) 1 e e e e /O ( n ) 1 e e e e

50 2 PM .

00 1 .

00 4301 (169) 243 (10) 1 .

00 1 .

00 1 . AA .

00 1 .

00 57 (11) 57 (11) 1 .

01 1 .

01 1 . SURR

702 (98) 234 (33) 1 .

00 1 .

00 64 (9) 21 (3) 1 .

02 1 .

50 10 PM .

00 1 .

00 2207 (141) 161 (10) 1 .

00 1 .

00 1 . AA

211 (22) 211 (22) 1 .

01 1 .

00 1 .

00 464 (40) 464 (40) 1 .

00 1 .

00 1 . SURR

133 (23) 44 (8) 1 .

01 1 .

00 207 (18) 69 (6) 1 .

01 1 .

200 2 PM .

00 1 .

00 5844 (166) 338 (10) 1 .

00 1 .

00 1 . AA

327 (142) 327 (142) 1 .

01 1 .

00 1 .

01 1 .

00 143 (22) 143 (22) 1 .

01 1 .

00 1 . SURR

234 (24) 78 (8) 1 .

00 1 .

00 339 (47) 113 (16) 1 .

00 1 .

200 10 PM

737 (533) 55(40) 1 .

01 1 .

01 363 (250) 28 (19) 1 .

03 1 .

01 1 . AA

37 (7) 37 (7) 1 .

05 1 .

03 42 (8) 42 (8) 1 .

03 1 .

02 1 . SURR

25 (6) 8 (2) 1 .

05 1 .

01 1 .

00 22 (5) 7 (2) 1 .

09 1 .

07 1 .

05 1 . TABLE 2Average number of operations in O ( n ) required for eachiteration of the PM approaches with the LA and EPapproximations and for each iteration in the AA andSURR parameterizations. n d Scheme Isotropic ARD50 2 PM LA . .

2) 8 . . PM EP . .

0) 17 . . AA . .

0) 1 . . SURR . .

0) 3 . .

50 10 PM LA . .

3) 8 . . PM EP . .

1) 13 . . AA . .

0) 1 . . SURR . .

0) 3 . .

200 2 PM LA . .

0) 9 . . PM EP . .

0) 17 . . AA . .

0) 1 . . SURR . .

0) 3 . .

200 10 PM LA . .

2) 7 . . PM EP . .

1) 13 . . AA . .

0) 1 . . SURR . .

0) 3 . . sampler times, so that we can roughly consider thenew sample drawn in a Gibbs sampling updates inde-pendent with respect to the previous. The behavior ofthe ˆ R statistics with respect to the number of iterations,reported in table 3, reveals some interesting features. Allmethods are characterized by fast convergence. In thecase of the SURR and AA methods, however, efﬁciencyis much lower than what can be achieved by the PMmethod when repeating the Gibbs sampling update θ | y times. This is an indication that the parameterizationsof SURR and AA methods are not fully capable ofbreaking the correlation between hyper-parameters andlatent variables. Finally, note that in the case of ARDcovariances, the low efﬁciency in all methods is due tothe random walk type of exploration; in those cases iterations of the Gibbs sampling steps were not enough to ensure independence between consecutive samples. This section reports a comparison of classiﬁcation per-formance on three UCI data sets [44], so as to verifythe capability of the proposed approach to effectivelycarry out inference for parameters in general GP clas-siﬁcation problems. The Breast and Pima data sets aretwo-class classiﬁcation problems, described by d = 9 and d = 8 covariates, comprising n = 682 and n = 768 input vectors respectively. The Abalone data set hasthree classes; in this paper, we considered the task ofinferring parameters of a GP probit classiﬁer for thetwo classes “M” and “F”, resulting in a data set of n = 2835 input vectors and d = 8 covariates. Allcovariates were transformed to have zero mean and unitstandard deviation. We selected the isotropic covariancefunction in equation 5 and chose the following priors fortheir parameters: p ( τ ) = G ( τ | a τ , b τ ) with shape a τ = 1 and rate b τ = 1 / √ d , and p ( σ ) = G ( σ | a σ , b σ ) with shape a σ = 1 . and rate b σ = 0 . .We compared the proposed sampling PM approachwith the AA and SURR parameterizations. In the lattertwo, we alternated the sampling of θ using the MHalgorithm and the sampling of f iterating ELL-SS tentimes. In the PM approach we selected an approximationbased on the LA algorithm and chose the number ofimportance samples to be N imp = 1 .We ran chains for iterations, where the ﬁrst were used to tune the proposal mechanisms toachieve an acceptance rate between and . Inthe case of the PM approach, in the adaptive phase weused the approximate marginal likelihood obtained bythe LA algorithm. This was to overcome the problemsthat may arise when chains get trapped due to a largelyoverestimated value of the marginal likelihood. After iterations, we then switched to the estimate of the marginal likelihood obtained by importance sampling.For each sampling approach, we ran parallel chainsinitialized from the prior, so that we could compareconvergence speed.The results for the three UCI data sets are reportedin ﬁgures 3 and 4. The plots show efﬁciency and con-vergence speed in the sampling of the logarithm of thelength-scale parameter τ . The left and middle panels ofthese ﬁgures show the trace and the auto-correlationplots of one chain after the burn-in period of iterations. The auto-correlation plot gives an indicationof efﬁciency; the faster the auto-correlation of a chainreaches values close to zero, the higher the efﬁciency ofthe corresponding MCMC approach. In order to facilitatevisualization, the traces were thinned by a factor of and the auto-correlation plots were computed on thethinned chains. The right panel shows the evolution ofthe PSRF ( ˆ R ) after the burn-in period without thinningthe chains.The results indicate that the SURR parameterizationyields better performance compared to the AA param-eterization, where convergence can be extremely slow.The PM approach achieves impressive efﬁciency andconvergence speed compared to the AA and SURR pa-rameterizations. Remarkably, in these experiments, thisis achieved using only one importance sample and arelatively cheap approximation based on the LA algo-rithm. We note here that this might not be the case ingeneral. The key to the success of the PM approach isthe possibility to obtain a low-variance estimator for themarginal likelihood p ( y | θ ) . Some approximations mightnot be good enough to ensure that this is the case;in such situations, more importance samples or betterapproximations should be employed. This has recentlybeen investigated in more detail in [45], where unbiasedestimates of the marginal likelihood based on AnnealedImportance Sampling [46] have been proposed. OMPARISON WITH OTHER PROBABILISTICCLASSIFIERS

This section reports a comparison of classiﬁcation per-formance on ﬁve UCI data sets. In the Glass data set weconsidered the two classes “window glass” and “non-window glass”. In the USPS data set we considered thetask of classiﬁcation of “3” vs “5” as in [18].We constructed increasingly larger training sets com-prising an equal number of input vectors per class. Foreach value of number of input vectors n , we constructed training sets across which we evaluated the perfor-mance of the proposed PM approach. We compared the PM approach (MCMC PM EP) with(i) a probabilistic version of an SVM classiﬁer [28] (ii)the GP classiﬁer using the EP approximation [18], [5]

Breast n = 682 AA − Trace plot

SURR − Trace plot

PM − Trace plot

AA − Autocorr.

SURR − Autocorr.

PM − Autocorr.

AA − PSRF

SURR − PSRF

PM − PSRF

Pima n = 768 AA − Trace plot

SURR − Trace plot

PM − Trace plot

AA − Autocorr.

SURR − Autocorr.

PM − Autocorr.

AA − PSRF

SURR − PSRF

PM − PSRF

Fig. 3. Summary of efﬁciency and convergence speedon Breast and Pima data sets for the AA and SURR pa-rameterization and the proposed PM approach. All plotsshow the sampling of the logarithm of the length-scaleparameter τ . The ﬁgure shows trace plots (left panels)and corresponding auto-correlation plots (middle panels)for one chain thinned by a factor of after burn-in. Theright panel reports the evolution of the PSRF after burn-in;in this plot the solid line and the red dashed line representthe median and the . percentile respectively. optimizing θ using type II Maximum Likelihood [5],[12] (EP ML) and (iii) with the classiﬁer obtained bysampling θ based on the marginal likelihood computedby EP (MCMC EP). Predictions in EP ML and MCMC EPwere carried out by approximately integrating out latentvariables according to the Gaussian approximation given Abalone n = 2835 AA − Trace plot

SURR − Trace plot

PM − Trace plot

AA − Autocorr.

SURR − Autocorr.

PM − Autocorr.

AA − PSRF

SURR − PSRF

PM − PSRF

Fig. 4. Summary of efﬁciency and convergence speed onthe Abalone data set. by the EP algorithm.We used the SVM code in the e1071

R package,which provides a front-end to the

LIBSVM library wherenon-linear SVMs employ a squared exponential isotropickernel. In order to meaningfully compare the four clas-siﬁers, we used the isotropic kernel/covariance for all ofthem. In the MCMC PM EP method we set N imp = 64 . We are interested in comparing the ability of the classi-ﬁers to reliably quantify uncertainty in predicting classlabels. Predictive probabilities give a means to do so; themore conﬁdent the classiﬁer is about a correct class label,the better. Also, predictive probabilities make it possibleto avoid making decisions when predictive probabilitiesare below a given threshold.Following [47], we propose to summarize the ability ofa classiﬁer to reliably quantify uncertainty in predictionsby analyzing classiﬁcation accuracy and AUC (whichdenotes the area under the Receiver Operating Charac-teristic (ROC) curve for the classiﬁer) versus the degreeof abstention. In particular, we propose to measure thearea under the two curves obtained by plotting accuracyversus degree of abstention and AUC versus degreeof abstention. We will denote such scores by “capacityaccuracy” and “capacity AUC”, respectively. A value ofcapacity close to one suggests that a classiﬁer is capableof correctly classifying test data with a high degree ofconﬁdence.For a probabilistic classiﬁer, we compute accuracy andAUC versus degree of abstention as follows. Denote by ρ a threshold for the predictive probabilities. Accord-ing to a threshold value ρ , we compute the degree ofabstention as the proportion of test data for which the predictive probability p satisﬁes the following condition . − ρ < p < . ρ . For the rest of the test data,namely data for which the classiﬁer is most conﬁdent, wecompute accuracy and AUC. We repeat this procedurefor different values of ρ , starting from . going upto . at increments of . , so that we obtain theplots of accuracy and AUC with respect to degree ofabstention. We ﬁnally compute the area of the two curvesto obtain “capacity accuracy” and “capacity AUC” forthe classiﬁer. Given that the degree of abstention mightnot reach the value of one, we propose to divide thearea of the curves by the largest degree of abstention,so that the two capacity scores are normalized to one.Figure 5 shows two exemplar curves of accuracy andAUC with respect to the degree of abstention that areused to compute the capacity scores. Glass

Abstention A cc u r a cy SVMEP MLMCMC EPMCMC PM EP 0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0

Pima

Abstention A UC SVMEP MLMCMC EPMCMC PM EP

Fig. 5. Left panel: Plot of accuracy vs degree of absten-tion for one of the folds in the Glass data set for n = 50 .Right panel: Plot of AUC vs degree of abstention for oneof the folds in the Pima data set for n = 20 . The results are reported in ﬁgures 6 and 7 for theﬁve data sets considered in this work. In general, theprobabilistic version of the SVM classiﬁer leads to aworse quantiﬁcation of uncertainty compared to the GPclassiﬁers. Figure 5 shows how SVMs tend to assignhigh conﬁdence to wrong decisions more often thanGP classiﬁers. Also, the GP classiﬁer optimizing thehyper-parameters (EP ML) yields worse performancecompared to the GP classiﬁers where hyper-parametersare integrated out. Finally, the general trend is that thePM approach is the one achieving the best quantiﬁcationof uncertainty compared to all the classiﬁers consideredin this work.A closer look at the posterior distribution obtainedby MCMC EP and MCMC PM EP reveals the followinginsights. In the case of classiﬁcation, the quality of theapproximation of the marginal likelihood given by EPis generally accurate enough that the posterior distribu-tion over the hyper-parameters for the MCMC PM EPapproach is similar to the one obtained by employingMCMC EP. Differences in predictions obtained by thetwo methods are mostly due to the different way latent

10 20 50 1000.50.60.70.80.9

Pima n C apa c i t y a cc u r a cy nSVMEP MLMCMC EPMCMC PM EP 10 20 50 1000.30.40.50.60.70.80.9 Pima n C apa c i t y A UC nSVMEP MLMCMC EPMCMC PM EP10 20 50 1000.50.60.70.80.91.0 Thyroid n C apa c i t y a cc u r a cy n10 20 50 100 0.960.970.980.991.00 10 20 50 1000.00.20.40.60.81.0 Thyroid n C apa c i t y A UC n10 20 50 100 0.950.960.970.980.991.00 Fig. 6. Plots of performance scores with respect to sizeof training set for the Pima (ﬁrst row) and the Thyroid(second row) data sets. The legend is reported in the ﬁrstrow only and it applies to all the plots. In the remainingplots, a closeup is reported to make it easier to comparethe results. variables are integrated out when making predictions.The situation can be different for other GP models where,for example, EP cannot be derived or EP exhibits conver-gence issues, so those considerations are peculiar to GPclassiﬁcation. The aim of this study is to demonstratethat it is important to account for the uncertainty inthe hyper-parameters when predicting labels on unseendata in order to achieve a sound quantiﬁcation of uncer-tainty. Although this has been pointed out in previouswork [12], [26], [16], [27], the proposed approach makesexact Bayesian computations, in a Monte Carlo sense,actually feasible by building upon deterministic approx-imations.In terms of complexity of the three GP classiﬁers, thefollowing considerations can be made. All the methodsemploy EP that requires three O ( n ) operations at eachiteration. In the case of ML, the approximation of themarginal likelihood p ( y | θ ) given by EP is optimized withrespect to θ . When the number of hyper-parameters issmall, as in the cases of the isotropic RBF covariancefunction considered here, the optimization can be per-formed by grid search. In the case of the ARD covariance,optimization can be performed, for instance, by em-ploying the conjugate gradient algorithm, that requiresthe derivatives of the approximate marginal likelihoodwith respect to the hyper-parameters. Computing suchderivatives involves an extra O ( n ) operation (see sec-tion 5.5.2 of [5]). In MCMC EP the approximation ofthe marginal likelihood p ( y | θ ) given by EP is used

10 20 50 1000.30.40.50.60.70.80.91.0

Ionosphere n C apa c i t y a cc u r a cy nSVMEP MLMCMC EPMCMC PM EP 10 20 50 1000.30.40.50.60.70.80.91.0 Ionosphere n C apa c i t y A UC nSVMEP MLMCMC EPMCMC PM EP10 20 50 1000.50.60.70.80.91.0 Glass n C apa c i t y a cc u r a cy n10 20 50 100 0.920.940.960.981.00 10 20 50 1000.20.40.60.81.0 Glass n C apa c i t y A UC n10 20 50 100 0.950.960.970.980.991.0010 20 50 1000.60.70.80.91.0 USPS n C apa c i t y a cc u r a cy n10 20 50 100 0.930.940.950.960.970.980.991.00 10 20 50 1000.40.60.81.0 USPS n C apa c i t y A UC n10 20 50 100 0.9700.9750.9800.9850.9900.9951.000 Fig. 7. Plots of performance scores with respect to sizeof training set for the Ionosphere (ﬁrst row), the Glass(second row) and the USPS (third row) data sets. Thelegend is reported in the ﬁrst row only and it applies to allthe plots. In the remaining plots, a closeup is reported tomake it easier to compare the results. directly to obtain samples form p ( θ | y ) . In this case, eachiteration requires running EP to obtain an approximationto p ( y | θ ) , so the overall complexity is still in O ( n ) ,but the number of operations depends on the numberof the iterations the MCMC approach is run for. Run-ning MCMC PM EP requires exactly the same numberof operations as MCMC EP, except that the Choleskydecomposition of the covariance of the approximatingGaussian is needed to draw the importance samples,adding an extra O ( n ) operation.To compute the mean of the predictive distribution forone test sample, EP ML requires only operations in O ( n ) and none in O ( n ) , as the expensive elements neededto compute it are already available from running the EPapproximation. In the case of MCMC EP, the mean of thepredictive distribution is an average of means computedfor several draws from the posterior over θ , so the com-plexity scales linearly with the number of MCMC sam-ples and quadratically with n . MCMC PM EP, instead, requires samples from the posterior distribution over la-tent variables in addition to hyper-parameters. Drawingsamples from p ( f | y , θ ) requires computations in O ( n ) aspreviously discussed. For each sample from the posteriordistribution over latent variables and hyper-parameters,all the other computations are again in O ( n ) followingsimilar arguments as in the case of EP ML; therefore,similarly to MCMC EP, the complexity scales linearlywith the number of MCMC samples and quadraticallywith n . ONCLUSIONS

This paper presented a methodology that enables thefully Bayesian treatment of GP models, using probitregression as a working example, and builds upon ex-isting approximate methods for integrating out latentvariables. The key element in this paper is the adoptionof the pseudo marginal approach to devise a correctMCMC sampling scheme for the hyper-parameters of thecovariance of the Gaussian Process prior from an approx-imation of the marginal density of the observation giventhe hyper-parameters. The resulting sampling scheme issimple, and it is based on approximate methods that arecurrently very popular.The results indicate that the proposed methodologyleads to an MCMC approach where chains are character-ized by high convergence speed and high efﬁciency. Thisis an important feature that yields a step forward towardmaking fully Bayesian inference based on Gaussian Pro-cesses a concrete possibility for many applications withlittle user intervention. The overall efﬁciency is drivenby the MH proposal for the hyper-parameters that canbe inefﬁcient for models with several hyper-parameters;a matter of current investigation is to study alternativeproposal mechanisms that avoid the erratic behavior ofrandom walk exploration.In support vector based classiﬁers hyper-parametersare optimized by minimizing the cross-validation erroracross a set of candidate values. It is clear that forsmall data sets or for covariance functions with a largenumber of hyper-parameters, this procedure becomesunfeasible. The proposed approach, instead, yields anatural way to integrate the uncertainty in the hyper-parameters when making predictions and infer themfrom data. The comparison with other state-of-the-artprobabilistic classiﬁers that are commonly employed inthe Machine Learning community shows that accountingfor the posterior over the hyper-parameters is extremelybeneﬁcial, especially for small data sets.In terms of scalability, the main computational bot-tleneck is in the computation of the GP prior densitythat requires the factorization of the covariance matrix,which is in O ( n ) . The same considerations apply to allGP classiﬁers that use approximate methods to integrateout latent variables, so we argue that by running anefﬁcient sampling procedure for the hyper-parametersrather than an optimization strategy, the computational overhead will not be dramatically higher, but the classi-ﬁcation performance will be much more reliable.We believe that the results presented here can beextended to other latent Gaussian models, such as Log-Gaussian Cox process models [48] and Ordinal Regres-sion with GP priors [49]. Finally, it is possible to extendthe proposed PM MCMC approach to deal with GPmodels characterized by a sparse inverse covariance,which are popular when analyzing spatio-temporal data.In this case, it is possible to exploit sparsity in the inversecovariance of the GP, yielding a fast mixing and efﬁcientMCMC approach capable of processing large amountsof data. A CKNOWLEDGMENTS

The authors would like to thank the anonymous re-viewers for their critical and constructive comments andsuggestions. Mark Girolami is supported by an EPSRCEstablished Career Research Fellowship EP/J016934/2,a Royal Society Wolfson Research Merit Award, andEPSRC Project grants ENGAGE EP/K015664/2 andEQUIP EP/K034154/1. This work is dedicated toStefano Filippone. R EFERENCES [1] C. Cortes and V. Vapnik, “Support Vector Networks,”

MachineLearning , vol. 20, pp. 273–297, 1995.[2] V. N. Vapnik,

The nature of statistical learning theory . New York,NY, USA: Springer-Verlag New York, Inc., 1995.[3] C. E. Rasmussen and J. Q. Candela, “Healing the relevancevector machine through augmentation,” in

Proceedings of the 22ndinternational conference on Machine learning , ser. ICML ’05. NewYork, NY, USA: ACM, 2005, pp. 689–696.[4] M. E. Tipping, “Sparse bayesian learning and the relevance vectormachine,”

Journal of Machine Learning Research , vol. 1, pp. 211–244,2001.[5] C. E. Rasmussen and C. Williams,

Gaussian Processes for MachineLearning . MIT Press, 2006.[6] A. Bosch, A. Zisserman, and X. Muoz, “Scene Classiﬁcation Usinga Hybrid Generative/Discriminative Approach,”

IEEE Transac-tions on Pattern Analysis and Machine Intelligence , vol. 30, no. 4,pp. 712–727, 2008.[7] M. Filippone, A. F. Marquand, C. R. V. Blain, S. C. R. Williams,J. Mour˜ao-Miranda, and M. Girolami, “Probabilistic Predictionof Neurological Disorders with a Statistical Assessment of Neu-roimaging Data Modalities,”

Annals of Applied Statistics , vol. 6,no. 4, pp. 1883–1905, 2012.[8] T. Jaakkola, M. Diekhans, and D. Haussler, “A DiscriminativeFramework for Detecting Remote Protein Homologies,”

Journalof Computational Biology , vol. 7, no. 1-2, pp. 95–114, 2000.[9] T. Joachims, “Text Categorization with Suport Vector Machines:Learning with Many Relevant Features,” in

ECML , ser. LectureNotes in Computer Science, C. Nedellec and C. Rouveirol, Eds.,vol. 1398. Springer, 1998, pp. 137–142.[10] G. R¨atsch, S. Sonnenburg, and C. Sch¨afer, “Learning InterpretableSVMs for Biological Sequence Classiﬁcation,”

BMC Bioinformatics ,vol. 7, no. S-1, 2006.[11] O. Williams, A. Blake, and R. Cipolla, “Sparse Bayesian learningfor efﬁcient visual tracking,”

IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. 27, no. 8, pp. 1292–1304, Aug. 2005.[12] C. M. Bishop,

Pattern Recognition and Machine Learning (InformationScience and Statistics) , 1st ed. Springer, Aug. 2007.[13] C. K. I. Williams and D. Barber, “Bayesian classiﬁcation withGaussian processes,”

IEEE Transactions on Pattern Analysis andMachine Intelligence , vol. 20, pp. 1342–1351, 1998. [14] T. P. Minka, “Expectation Propagation for approximate Bayesianinference,” in Proceedings of the 17th Conference in Uncertainty inArtiﬁcial Intelligence , ser. UAI ’01. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 2001, pp. 362–369.[15] M. N. Gibbs and D. J. C. MacKay, “Variational Gaussian processclassiﬁers,”

IEEE Transactions on Neural Networks and LearningSystems , vol. 11, no. 6, pp. 1458–1464, 2000.[16] H. Rue, S. Martino, and N. Chopin, “Approximate Bayesianinference for latent Gaussian models by using integrated nestedLaplace approximations,”

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , vol. 71, no. 2, pp. 319–392, 2009.[17] M. Opper and O. Winther, “Gaussian processes for classiﬁcation:Mean-ﬁeld algorithms,”

Neural Computation , vol. 12, no. 11, pp.2655–2684, 2000.[18] M. Kuss and C. E. Rasmussen, “Assessing Approximate Inferencefor Binary Gaussian Process Classiﬁcation,”

Journal of MachineLearning Research , vol. 6, pp. 1679–1704, 2005.[19] H. Nickisch and C. E. Rasmussen, “Approximations for BinaryGaussian Process Classiﬁcation,”

Journal of Machine Learning Re-search , vol. 9, pp. 2035–2078, Oct. 2008.[20] B. Cseke and T. Heskes, “Approximate Marginals in Latent Gaus-sian Models,”

Journal of Machine Learning Research , vol. 12, pp.417–454, 2011.[21] M. Filippone, M. Zhong, and M. Girolami, “A comparative evalu-ation of stochastic-based inference methods for Gaussian processmodels,”

Machine Learning , vol. 93, no. 1, pp. 93–114, 2013.[22] I. Murray and R. P. Adams, “Slice sampling covariance hyper-parameters of latent Gaussian models,” in

NIPS , J. D. Lafferty,C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta,Eds. Curran Associates, 2010, pp. 1732–1740.[23] L. Knorr-Held and H. Rue, “On Block Updating in MarkovRandom Field Models for Disease Mapping,”

Scandinavian Journalof Statistics , vol. 29, no. 4, pp. 597–614, Dec. 2002.[24] C. Andrieu and G. O. Roberts, “The pseudo-marginal approachfor efﬁcient Monte Carlo computations,”

The Annals of Statistics ,vol. 37, no. 2, pp. 697–725, Apr. 2009.[25] M. A. Beaumont, “Estimation of Population Growth or Decline inGenetically Monitored Populations,”

Genetics , vol. 164, no. 3, pp.1139–1160, Jul. 2003.[26] R. M. Neal, “Regression and classiﬁcation using Gaussian processpriors (with discussion),”

Bayesian Statistics , vol. 6, pp. 475–501,1999.[27] M. B. Taylor and J. P. Diggle, “INLA or MCMC? A Tutorial andComparative Evaluation for Spatial Prediction in log-GaussianCox Processes,” 2012, arXiv:1202.1738.[28] C. C. Chang and C. J. Lin, “LIBSVM: A library for support vectormachines,”

ACM Transactions on Intelligent Systems and Technology ,vol. 2, no. 3, 2011.[29] D. J. C. Mackay, “Bayesian methods for backpropagation net-works,” in

Models of Neural Networks III , E. Domany, J. L. vanHemmen, and K. Schulten, Eds. Springer, 1994, ch. 6, pp. 211–254.[30] L. Tierney and J. B. Kadane, “Accurate Approximations for Pos-terior Moments and Marginal Densities,”

Journal of the AmericanStatistical Association , vol. 81, no. 393, pp. 82–86, 1986.[31] I. Murray, R. P. Adams, and D. J. C. MacKay, “Elliptical slicesampling,”

Journal of Machine Learning Research - Proceedings Track ,vol. 9, pp. 541–548, 2010.[32] R. M. Neal, “Slice Sampling,”

Annals of Statistics , vol. 31, pp. 705–767, 2003.[33] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “HybridMonte Carlo,”

Physics Letters B , vol. 195, no. 2, pp. 216–222, 1987.[34] R. M. Neal, “Probabilistic inference using Markov chain MonteCarlo methods,” Dept. of Computer Science, University ofToronto, Tech. Rep. CRG-TR-93-1, Sep. 1993.[35] M. Girolami and B. Calderhead, “Riemann manifold Langevinand Hamiltonian Monte Carlo methods,”

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , vol. 73, no. 2,pp. 123–214, Mar. 2011.[36] Y. Yu and X.-L. Meng, “To Center or Not to Center: That Is Notthe Question–An Ancillarity-Sufﬁciency Interweaving Strategy(ASIS) for Boosting MCMC Efﬁciency,”

Journal of Computationaland Graphical Statistics , vol. 20, no. 3, pp. 531–570, 2011.[37] O. Papaspiliopoulos, G. O. Roberts, and M. Sk¨old, “A generalframework for the parametrization of hierarchical models,”

Sta-tistical Science , vol. 22, no. 1, pp. 59–73, 2007. [38] C. P. Robert and G. Casella,

Monte Carlo Statistical Methods(Springer Texts in Statistics) . Secaucus, NJ, USA: Springer-VerlagNew York, Inc., 2005.[39] N. Friel and A. N. Pettitt, “Marginal likelihood estimation viapower posteriors,”

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , vol. 70, no. 3, pp. 589–607, 2008.[40] J. Skilling, “Nested sampling for general Bayesian computation,”

Bayesian Analysis , vol. 1, no. 4, pp. 833–860, 2006.[41] A. Gelman and D. B. Rubin, “Inference from iterative simulationusing multiple sequences,”

Statistical Science , vol. 7, no. 4, pp.457–472, 1992.[42] J. Geweke, “Getting it right: joint distribution tests of posteriorsimulators,”

Journal of the American Statistical Association , vol. 99,no. 467, pp. 799–804, 2004.[43] W. R. Gilks and D. J. Spiegelhalter,

Markov chain Monte Carlo inpractice . Chapman & Hall/CRC, 1996.[44] A. Asuncion and D. J. Newman, “UCI machine learning reposi-tory,” 2007.[45] M. Filippone, “Bayesian inference for Gaussian process clas-siﬁers with annealing and exact-approximate MCMC,” 2013,arXiv:1311.7320.[46] R. M. Neal, “Annealed importance sampling,”

Statistics and Com-puting , vol. 11, no. 2, pp. 125–139, Apr. 2001.[47] C. Ferri and J. Hern´andez-Orallo, “Cautious Classiﬁers,” in

RO-CAI , J. Hern´andez-Orallo, C. Ferri, N. Lachiche, and P. A. Flach,Eds., 2004, pp. 27–36.[48] J. Møller, A. R. Syversveen, and R. P. Waagepetersen, “LogGaussian Cox Processes,”

Scandinavian Journal of Statistics , vol. 25,no. 3, pp. 451–482, 1998.[49] W. Chu and Z. Ghahramani, “Gaussian Processes for OrdinalRegression,”

Journal of Machine Learning Research , vol. 6, pp. 1019–1041, Dec. 2005.

Maurizio Filippone

Maurizio Filippone receiveda Master’s degree in Physics and a Ph.D. inComputer Science from the University of Gen-ova, Italy, in 2004 and 2008, respectively.In 2007, he was a Research Scholar withGeorge Mason University, Fairfax, VA. From2008 to 2011, he was a Research Associate withthe University of Shefﬁeld, U.K. (2008-2009),with the University of Glasgow, U.K. (2010), andwith University College London, U.K (2011). Heis currently a Lecturer with the University ofGlasgow, U.K. His current research interests include statistical methodsfor pattern recognition.Dr Filippone serves as an Associate Editor for Pattern Recognitionand the IEEE Transactions on Neural Networks and Learning Systems.