A New Monte Carlo Based Algorithm for the Gaussian Process Classification Problem
aa r X i v : . [ s t a t . M L ] O c t A New Monte Carlo Based Algorithm for theGaussian Process Classification Problem
Amir F. AtiyaDepartment of Computer EngineeringCairo UniversityCairo, [email protected] A. FayedDepartment of Engineering Mathematics and PhysicsCairo UniversityCairo, Egypth [email protected] H. Abdel-GawadDepartment of Electrical EngineeringPurdue [email protected] 21, 2018
Abstract
Gaussian process is a very promising novel technology that has been applied to boththe regression problem and the classification problem. While for the regression problemit yields simple exact solutions, this is not the case for the classification problem,because we encounter intractable integrals. In this paper we develop a new derivationthat transforms the problem into that of evaluating the ratio of multivariate Gaussianorthant integrals. Moreover, we develop a new Monte Carlo procedure that evaluatesthese integrals. It is based on some aspects of bootstrap sampling and acceptance-rejection. The proposed approach has beneficial properties compared to the existingMarkov Chain Monte Carlo approach, such as simplicity, reliability, and speed. Introduction
Gaussian process classification (GPC) is a promising Bayesian approach to the classificationproblem. It is based on defining a so-called latent variable for every pattern, and settingthe prior based on proximity relations among the patterns. Based on the prior and througha series of integrals, the posterior of a test pattern will determine its classification (see thereviews of Rasmussen and Williams [39], Nickisch and Rasmussen [36], and Seeger [43]).Unfortunately, this resulting multi-integral formula is intractable, and can only be solvedthrough some approximations or using lengthy algorithms. In spite of its superior perfor-mance (as has been pointed out in several application studies, such as by Altun, Hofmann,and Smola [2], Jenssen et al [24], and Bazi and Melgani [4]), this computational issue ham-pers its wide applicability to large data sets. In this paper, we propose a new simplified,but exact, formula for the binary GPC problem. The proposed method is based on applyingsome substitutions and transformations that convert the problem into that of evaluating aratio of two orthant integrals of a multivariate Gaussian density. Moreover, we develop a newMonte-Carlo-type algorithm to evaluate these orthant integrals. The proposed algorithm isbased on some aspects of bootstrap sampling and acceptance-rejection.The approaches in the literature for the GPC problem can be categorized into solutionsbased on analytical approximation of the integrals and solutions based on Monte Carlo sam-pling (see the extensive review of Nickisch and Rasmussen [36]). Among the well-known pro-posed methods from the first category are the Laplace’s approximation (Williams and Barber[56]) and the expectation propagation (Minka [31]). Also, some other efficient approximation-based methods include the work of Csat´o et al [6], Opper and Winther [37]), Gibbs andMacKay [15], Rifkin and Klautau [42], Jaakkola and Haussler [23], and Kim, and Ghahra-mani [27]. Work on the second category has been more scarce. Almost all of the approachesare based on the Markov Chain Monte Carlo (MCMC) concept. Neal’s so-called
AnnealedImportance Sampling (AIS) [34] uses an approximate posterior, rather than the prior, as astarting point for evaluating the marginal likelihood (to be described shortly). The
HybridMonte Carlo (HMC) (Neal [35]) is based on the concept of “importance density”. Murray,Adams, and Mackay [33] proposed the Elliptical Slice Sampler (ESS). It is based on samplingover an elliptical slice to efficiently obtain the step size. Titsias, Lawrence and Rattrpay [48]introduced a novel MCMC approach by making use of a low dimensional vector of controlvariables (see also Titsias and Lawrence [49]). Vehtari, S¨arkk¨a, and Lampinen [54] proposeda novel way to choose effective starting values for the MCMC based on early stopping. Bar-ber and Williams [3] proposed a hybrid method, that is uses partly an approximation andpartly the MCMC procedure.Nickisch and Rasmussen [36]) provided a comprehensive review and comparison betweenthe different GPC approximations and the MCMC approaches. They came to the conclusionthat the MCMC-type approaches are superior in performance, but of course computationallymore extensive. This is because they can obtain the exact solution of the integral formula,provided the size of the run is large enough.Kuss and Rasmussen [29] compared between a number of GPC approximations. Ratherthan improving the approximation ability or the computation speed, some researchers con-sidered short-cuts to the the GPC model itself to achieve better efficiency or performance,such as sparse GPC models (see Csat´o and Opper [7] Vanhatalo and Vehtari [53], and Tit-sias and Lawrence [47]), and low-dimensional manifold embedding (see Ursatun and Darrell[51]). 2 majority of the work in the literature and the above reviewed methods consider thebinary classification case. Nevertheless, some researchers extended The GPC problem tothe multi-class case (for example Girolami and Rogers [16], Hern´andez-Lobato et al [20] andSeeger and Jordan [44]).The method proposed in this paper fits more into the second category described above(i.e. exact Monte Carlo-based), but it is not based on the MCMC concept. It is guaranteedto converge as close as possible to the exact solution of the multi-integral formula, providedwe use a large enough sample of generated points. The advantages of the proposed algorithmis that it does not require any parameter-tuning (other than specifying the number of MonteCarlo generated points), is consistent, and reliable. (In short it works all the time, wetested hundreds of problems, some as high as 2000 dimensional problems.) It also comparesfavorably in terms of speed and accuracy to the other MCMC approaches, especially forthe evaluation of the marginal likelihood. The marginal likelihood is an expression forthe likelihood of the data given the parameters. The hyperparameters are typically tunedby optimizing the marginal likelihood function. Because of repeated evaluations of thelikelihood function, this step is the most time-consuming part, and the speed-up providedby the proposed algorithm will lead to a significant computational benefit. A beneficialaspect of the proposed integral formulation is that it gives many insights into the differentinfluencing factors. For example, one can obtain the limiting behavior of the covariancematrix parameters, and therefore understand the classification behavior when moving theirvalues in certain directions. Also, the other advantage of the integral formulation is thatit is given in terms of multivariate Gaussian orthant integrals. These are well-researchedintegrals, and several approximations exist in the literature. So, this could possibly open theway for new competitive approximations to the GPC problem.The paper is organized as follows. Next section we present an overview and definitionof the GPC problem. In Section 3 we propose the new formulation that is obtained bysimplifying the multi-integral formula. Section 4 presents an overview of the multivariateGaussian integral that has to be evaluated. In Section 5 we propose the new Monte Carlomodel for evaluating the integral. Section 6 provides the experimental study to assess thenew model, and Section 7 is the conclusion.
Gaussian process classifiers (GPC) are based on defining a “latent state” f i for every trainingpattern. It is a central variable in the formulation which measures some sort of degree ofmembership to one of the classes. Let y i denote the class membership of training pattern i ,where y i = 1 denotes Class 1 and y i = − f i , whoserange is from −∞ to ∞ , is mapped into class posterior probability through a monotonesquashing function σ that has a range of (0 , J = P ( y i = 1 | f i ) = σ ( f i ) (1)There are two typical forms for σ in the GPC literature: the logit (or logistic function) andthe probit (or cumulative Gaussian integral). As argued by Nickisch and Rasmussen [36],both choices are effectively quite similar. In this work we consider only the probit function.In what is next we will follow closely the terminology of Rasmussen and Williams [39].Let us arrange the latent variables and the class memberships in one vector each: f =3 f , . . . , f N ) T , and y = ( y , . . . , y N ) T . Note that each index of the afforementioned vectorspertains to a specific training pattern, and N is the size of the training set. Let x i be thefeature vector of training pattern i . Moreover, let us arrange all training vectors x i as rowsin a matrix X . Let x ∗ denote the feature vector of the test pattern, whose class needs to beevaluated. Let its latent state be f ∗ .The latent state vector f obeys an a priori density that is assumed to be a multivariateGaussian (therefore the name Gaussian processes). From an a priori point of view, patternsthat are close (in the features space) are more likely to belong to the same class. So thisprior density is selected to reflect that property. Patterns with nearby feature vectors havehighly correlated latent variables f i , and as the patterns become more distant the correlationdecays. The a priori density can be written as p ( f | X ) = N ( f ; 0 , Σ) (2)where N ( f ; µ, Σ) denotes a Gaussian density of variable f having mean vector µ and co-variance matrix Σ. The covariance matrix has elements that are a function of the distancebetween two feature vectors k x i − x j k and is so designed to achieve this aforementionedcorrelation behavior (see Rasmussen and Williams [39] for a detailed discussion and exam-ples of covariance functions). A particularly prevalent choice of the covariance matrix is theso called “RBF” covariance matrix, given by:Σ ij = βe −k x i − x j k α (3)The α and β parameters (called respectively the length scale and the latent function scale)are very influential in the performance of the classifier, and tuning them has to be done withcare (see Sundarajan and Keerthi [46]). More will be said later on them.A test pattern’s latent variable f ∗ will have similar correlation structure as the trainingpatterns. Consider the augmented training latent state vector and test point latent state. Itis given as Gaussian, as follows: " f f ∗ ∼ N " f f ∗ ; 0 , " Σ Σ X x ∗ Σ TX x ∗ Σ x ∗ x ∗ (4)where Σ X x ∗ is the covariance between the training latent variables and the test latent variable(it is a vector), and Σ x ∗ x ∗ is the variance of f ∗ . A key to estimating the class membershipof the test point is to evaluate the probability density of its latent state f ∗ , conditional onall the information that is available from the training set: p ( f ∗ | X, y , x ∗ ) = Z p ( f ∗ | X, x ∗ , f ) p ( f | X, y ) d f (5)where p ( f | X, y ) = p ( y | f ) p ( f | X ) p ( y | X ) is the posterior of the latent variables.Then, we compute the probability of Class 1 averaged over the conditional density of f ∗ : J ∗ ≡ p ( y ∗ = +1 | X, y , x ∗ ) = Z σ ( f ∗ ) p ( f ∗ | X, y , x ∗ ) df ∗ (6)where we used the fact that σ ( f ∗ ) signifies the conditional given in Eq. (1). We get p ( f ∗ | X, y , x ∗ ) = R p ( f ∗ | X, x ∗ , f ) p ( y | f ) p ( f | X ) d f p ( y | X ) (7)4here p ( y | f ) = N Y i =1 p ( y i | f i ) = N Y i =1 σ ( y i f i ) = N Y i =1 y i f i Z −∞ e − x √ π dx, (8)where we used the fact that σ is the probit function (integral of the Gaussian function).Note that P ( y i = − | f i ) = 1 − P ( y i = 1 | f i ) = 1 − σ ( f i ) = σ ( − f i ) = σ ( y i f i ) because of thepoint symmetry of σ . Also, p ( f ∗ | X, x ∗ , f ) = N ( f ∗ ; a T f , σ ∗ ) (9)where a = Σ − Σ X x ∗ (10) σ ∗ = Σ x ∗ x ∗ − Σ TX x ∗ Σ − Σ X x ∗ (11)We utilized formulas expressing the conditional of a multidimensional Gaussian distribution,applied to Eq. (4).All past formulas follow from straightforward probability manipulations, and they aredescribed clearly in Rasmussen and Williams [39], p. 16 Eq. 2.19. Equation (6), representingthe posterior probability corresponding to the test pattern x ∗ , is the main quantity neededto classify the pattern. Thus, J ∗ ≥ . L , definedas L = p ( y | X ) (12)It is an important quantity for the purpose of tuning the two hyperparameters ( α and β ). Bymaximizing the marginal likelihood, we arrive at hyperparameters that are most consistentwith the observed data. As such, any method should also be able to efficiently evaluatethe marginal likelihood. See Rasmussen and Williams [39] for more information about themarginal likelihood. Evaluating Equations (6), (7) analytically is very hard to accomplish. The difficulty arisesalso when attempting to evaluate them numerically because of the high dimensionality ofthe integrals (for example for a problem with a training set of size 1000 we are dealing withmore than a thousand-fold integral). Also, attempting a standard Monte Carlo evaluationleads to some practical problems, among them is the fact that Q Ni =1 σ ( y i f i ) turns out to beusually a very small number (with a negative exponent with a very large magnitude). Soto summarize, we are dealing with a very hard problem if an exact solution is to be sought.Here we develop a procedure that transforms the problem into the more approachable form ofintegrals of multivariate Gaussian functions. Specifically, we perform some substitutions andtransformations of variables that will convert the problem into evaluating orthant integrals of5ome multivariate Gaussian density. By orthant integral we mean an integral of a zero-meanmultivariate Gaussian function over some quadrant, e.g. over x ≥
0. The detailed steps aregiven below.Substituting (8), (2), (9) and (7) into (6), we obtain: J ∗ = 1 p ( y | X ) ∞ Z f ∗ = −∞ " f ∗ Z u = −∞ e − u √ π du f " N Y i =1 y i f i Z z i = −∞ e − z i √ π dz i N ( f ; 0 , Σ) N ( f ∗ ; a T f , σ ∗ ) df df ..df N df ∗ (13)Rearranging, we get J ∗ = ∞ R f ∗ = −∞ f ∗ R u = −∞ ∞ R f = −∞ ... ∞ R f N = −∞ y f R z = −∞ ... y N f N R z N = −∞ e − W dz ...dz N df ...df N dudf ∗ (2 π ) N +1 σ ∗ | Σ | p ( y | X ) (14)where W = u + N X i =1 z i + f T Σ − f + ( f ∗ − a T f ) σ ∗ (15)Rewriting W in matrix form: W = h u z · · · z N i uz ... z N + f T Σ − f + h f ∗ f · · · f N i B f ∗ f ... f N where B = σ ∗ " − a − a T i . The problem with this integral ( J ∗ ) is that some of itsvariables ( f ∗ , f , ..., f N ) occur in the limits of the integrals. A transformation can fix thisproblem by using the substitution (see the preliminary work of [1]): v ≡ v v ... v N +2 = − − y − y N u z f ∗ f (16)6e get J ∗ = ∞ R v =0 ∞ R v =0 ... ∞ R v N +1 =0 ∞ R v N +2 = −∞ ... ∞ R v N +2 = −∞ e − v T D v dv · · · dv N +2 (2 π ) N +1 σ ∗ | Σ | p ( y | X ) (17)where D = − I − C ′ − σ ∗ − a T σ ∗ − C ′ − a σ ∗ I + Σ − + aa T σ ∗ (18)where C ′ = y y . . . y N and I is the identity matrix (in both cases in the formulait is N × N ). The integration can then be put in the form: J ∗ = 1 | D | | Σ | p ( y | X ) σ ∗ Z N (cid:16) v ; 0 , D − (cid:17) d v (19)This integral above is called orthant normal integral where the orthant is defined over v ... v N +1 ≥ −∞ < v N +2 ... v N +2 < ∞ . Let us denote this orthant by orth . Con-sider now the term p ( y | X ). Integrating (7) w.r.t. f ∗ from −∞ to ∞ , and using the fact that R p ( f ∗ | X, y , x ∗ ) df ∗ = 1 we get p ( y | X ) = ∞ R −∞ R f p ( f ∗ | X, x ∗ , f ) p ( y | f ) p ( f | X ) d f df ∗ = ∞ R −∞ ∞ R −∞ e − x √ π dx R f p ( f ∗ | X, x ∗ , f ) p ( y | f ) p ( f | X ) d f df ∗ (20)The integral ∞ R −∞ e − x √ π dx in the previous equation is inserted on purpose. It equals 1 so itwill not alter the formula. The above integral will be the same as (19) except that the limitswill be over the orthant defined by v ... v N +1 ≥ −∞ < v v N +2 ... v N +2 < ∞ . The reasonis that the above analysis that led to (19) will apply here except that one of the integral’slimits is different (the Gaussian integral with variable x that is inserted in (20), here theupper limit is ∞ instead of f ∗ ). Let us denote this orthant by orth+ . The expression for theposterior will then be given as. J ∗ = p ( y ∗ = 1 | X, y , x ∗ ) = R orth N ( v ; 0 , D − ) d v R orth + N ( v ; 0 , D − ) d v = I I (21)7 .2 Further Reduction: The limits for the portion ( v N +2 , . . . , v N +2 ) T in Expression (21) are from −∞ to ∞ , so thesevariables can be integrated out. This means that the expression can further be reduced froma 2 N + 2 dimensional integral to an N + 1 dimensional integral. The detailed steps of thisreduction are given below.Let v = v ... v N +1 and v ′ = v N +2 ... v N +2 . We can write: I = k Z v ≥ Z −∞≤ v ′ ≤∞ e − ( v T A v − v T A v ′ + v ′ T A v ′ )(2 π ) N +12 | A | − d v ′ d v (22)where the D matrix defined in (18) is written as D = " A − A − A A , A = I , A = " C ′ , A = σ ∗ − a T σ ∗ − a σ ∗ I + Σ − + aa T σ ∗ , and k = (2 π ) − N +12 | A | − | D | . Some ma-nipulations lead to: I = k Z v ≥ Z −∞≤ v ′ ≤∞ e − ( v ′ − A − A v ) T A ( v ′ − A − A v )(2 π ) N +12 | A | − d v ′ . e − ( v T A v − v T A A − A v ) d v Now the inside integral w.r.t. v ′ equals 1. We get I = k (2 π ) N +12 | A | − Z v ≥ N (cid:16) v ; 0 , A − (cid:17) d v , (23)where A = I − A A − A (24)A similar formula applies for I but with integration limits given by −∞ < v < ∞ and v ... v N +1 ≥ A − ≡ R = (cid:18) x ∗ x ∗ Σ TX x ∗ C ′ C ′ Σ X x ∗ C ′ ( I + Σ) C ′ (cid:19) (25)= I + A Σ ′ A (26)where Σ ′ is the composite covariance function (of training patterns plus testing pattern, seeEq. (4)): Σ ′ = (cid:18) Σ x ∗ x ∗ Σ TX x ∗ Σ X x ∗ Σ (cid:19) (27)The final classification posterior probability is thus given by J ∗ = p ( y ∗ = 1 | X, y , x ∗ ) = R orth N ( v ; 0 , I + A Σ ′ A ) d v R orth + N ( v ; 0 , I + A Σ ′ A ) d v = I I (28)8here orth means the integration over v ≥
0, and orth + is the integration over the regiongiven by −∞ < v < ∞ and v ... v N +1 ≥ p ( y | X ), which represents the marginal likelihood as described in Eq. (12) and beyond, isbasically I , or the denominator of the expression for J ∗ . Thus, L = p ( y | X ) = Z orth + N ( v ; 0 , I + A Σ ′ A ) d v (29)= Z orth N ( v ′ ; 0 , C ′ ( I + Σ ′ ) C ′ ) d v ′ (30)The last identity is obtained by noting that the limit for v in Eq. (29) stretches from −∞ to ∞ and therefore v can be integrated out. The vector v ′ ≡ ( v , . . . , v N +1 ) T correspondsto the variables of the training set, while v corresponds to the test pattern. The denominator, as mentioned, represents the marginal likelihood p ( y | X ). Essentially,the marginal likelihood, as also observed from the integral, measures how well the classmemberships y i fit with the covariance structure. If the patterns of 1’s and -1’s multipliedwith the components of Σ ′ (through A ) emphasize the large covariance elements (throughagreeing class memberships, i.e. y i y j = 1) and deemphasizing the small covariance elements(through disagreeing class memberships, i.e. the multiplied by the factor y i y j = − J ∗ is the ratio of the integral over an orthant,and the integral is over orth + which basically extends over two orthants. This makes theexpression appropriately smaller than 1. Observing the numerator, we find that the predom-inant signs (in y i and therefore also A ) that multiply large covariance elements (in Σ X x ∗ )will determine if the numerator has a large value (therefore the pattern should be classifiedas Class 1), or a small value (therefore the pattern should be classified as Class 2). Notethat the identity matrix component of the covariance matrix in Eq. (26) is some kind ofregularizing factor that acts to “fatten” density.Let us now consider the effects of the hyperparameters. Consider the RBF kernel (Eq.(3)). The analysis here complements and in many ways confirms the insightful analysis ofNickisch and Rasmussen [36]. When either the latent function scale β −→ α −→ − d where d is the dimension.So, we end up with L = 2 − N and J ∗ = . If β −→ ∞ then the identity matrix part ofthe covariance matrix becomes negligible compared to the other part, see Eq. (26), and weget a formula similar to Eq. (28), but without the identity matrix added in the covarianceexpression. If α −→ ∞ , then we get the following formula. To avoid distraction to side9ssues, the proof is not given here. J ∗ = R ∞−∞ e − u σ ( √ βu ) N +1 h − σ ( √ βu ) i N du R ∞−∞ e − u σ ( √ βu ) N h − σ ( √ βu ) i N du (31)where σ is the cumulative Gaussian integral (i.e. the integral of the one-dimensional Gaussiandensity), and N ( N ) is the number of Class 1 (Class 2) training patterns. Essentially, thisgives some kind of a “soft” counting procedure, without regards to the distances involved.One can see that all test patterns will be classified as only one specific class (the one thatwins the counting game). As can be seen the final equations (28), (29), and (30) for the posterior probability andthe marginal likelihood are given in terms of two multivariate Gaussian integrals. Thedifficulty is that these are very high dimensional integrals (the dimension equals the size ofthe training set), making that a formidable problem. Consider that we would like to applythe standard approach of generating many points according to the multivariate density, andthen computing the fraction of points that fall in the considered orthant (area of integration).Such high order integrals are typically a very small number, for example 10 − . So even ifwe generate trillions of points, essentially no point will happen fall in the area of integration.There is a large literature on the multivariate Gaussian integral. Interest in the problemstarted around the forties of last century, and research is continuing since then, see thereviews of Gupta [17] and [18], Johnson [25], and Genz [14]. Essentially the state of the artis that a closed form formula exists only for a dimension up to three for the centered case(i.e. the integrals for the zero mean case where the limits are from 0, as is our case, seeEriksson [11]), and no closed form formulas exist for the non-centered case. There are somespecial constructions of the covariance matrices for which simplified formulas exist (for anyarbitrary dimension). There has also been some series expansions for the centered case interms of the elements of the covariance matrix (Kendall [26], and Moran [32]), or in termsof the elements of the inverse of the covariance matrix (Ribando [41]). These formulas,while very elegant and insightful, are intractable for dimensions larger than ten, because ofthe huge number of combinations of powers of the N variables of the covariance matrix.A parallel track in the attempt to tackle the multivariate Gaussian integral is by applyingsome efficient numerical integration techniques (see for example Schervish [45]). Due to theexponential nature of these methods, they are applicable for a dimension up to around 20.Another track considers improvised Monte Carlo methods (De´ak [9], Genz [13], Breslaw [5]and Hajivassiliou et al [19]). Again, these methods have not been demonstrated on highdimensional problems. The proposed new Monte Carlo method tackles the high-dimensional multivariate Gaussianintegral, and thereby simultaneously evaluates the posterior probability and the marginal10ikelihood. It combines aspects of rejection sampling and bootstrap sampling. The generalidea is to first generate samples for the first variable v . Subsequently, we reject the pointsthat fall outside the integral limits (for v ). Then we replenish in place of the discardedpoints by sampling with replacement from the existing points (i.e. the points that havebeen accepted). Next, we move on to the second variable, v , and generate points usingthe conditional distribution p ( v | v ). Again, we reject the points of v that fall outside theintegration limit, and replenish by sampling with replacement. We continue in this manneruntil we reach the final variable v N . The integral value is then estimated as the product ofthe acceptance ratios of the N variables. Unlike MCMC-type methods, we do not need toperform additional cycles. We cycle only once through the N variables, each time generatinga number M of points.Here are the detailed steps of the algorithm:1. For i = 1 to N perform the following:2. If i = 1 then generate M points v ( m ) from p ( v ). Otherwise, generate M points v i ( m )according to the conditional density function: p ≡ p ( v i | v i − ( m )) (32)where v i − ( m ) is the m th string of points (of variables v to v i − ) already generatedin the previous steps.3. Reject the points v i ( m ) that are outside the area of integration, i.e. reject the points v i ( m ) ≤
0. Assume that there are M ( i ) accepted points and M ( i ) ≡ M − M ( i )rejected point.4. Replenish in place of the rejected points, by sampling a number M ( i ) points by re-placement from among the accepted points.5. Once reaching the last dimension i = N , we stop, computing the multivariate integralas I = N Y i =1 M ( i ) M ! (33) The proof that the proposed algorithm leads to an estimate for the multivariate orthantprobability is essentially by construction, and this is described here. The multivariate integralcan be written as: I = p ( v ≥ , v ≥ , . . . , v N ≥
0) (34)= p ( v N ≥ | v ≥ , . . . , v N − ≥ . . . p ( v ≥ | v ≥ p ( v ≥
0) (35)Since we generate points according to the distribution in (32), which conditions only on thesurviving points that were not being rejected in previous rounds, the generated points willobey the distribution p ( v i | v ≥ , . . . , v i − ≥ M ( i ) /M is an estimateof the probability p ( v i ≥ | v ≥ , . . . , v i − ≥ .3 Example: Consider for example that we would like to compute I = Z ∞ . Z ∞ . Z ∞ . N ( µ, Σ) dv dv dv (36)1. We generate M points from p ( v ) (let M = 10). Let these points be 2.1, 0.2, 0.6, 1.8,2.2, 0.8, 1.3, -0.3, 1.4, 1.6. We reject the points 0.2, 0.6, 0.8, -0.3, as they are less than1.2 and hence are outside the area of integration. We keep the six accepted points2.1, 1.8, 2.2, 1.3, 1.4, 1.6, and remove the rejected points. In place of the rejectedpoints, we sample a similar number (i.e. four) from among the accepted points, withreplacement. Assume we have done that and we have obtained: 2.1, 1.3, 1.8, 2.1. So,overall we have the following points ( v ( m )’s): 2.1, 1.8, 2.2, 1.3, 1.4, 1.6, 2.1, 1.3, 1.8,2.1.2. Generate M ≡
10 points from p ( v | v ( m )). Specifically, we generate one point v (1)using the density p ( v | v = 2 . v (2) using the density p ( v | v = 1 . v ( m ) = 2.5, 1.5, 2.0, 1.7, 1.1, 1.5, 1.7, 1.3,1.9, 2.0. We reject the points 1.1 and 1.3 as they are below the integration limit of 1.4.Then we sample two more points by replacement in place of these rejected points. Wecontinue in this manner for the remaining dimension. Assume that we rejected threepoints in this case.3. The integral estimate is the product of the acceptance ratios, i.e. it equals (cid:16) (cid:17)(cid:16) (cid:17)(cid:16) (cid:17) . Because the integral is typically a very small number, we will consider here the logarithm ofthe integral as the target value we would like to estimate, i.e., from Eq.(35):log( I ) = log h p ( v N ≥ | v ≥ , . . . , v N − ≥ i + . . . + log h p ( v ≥ | v ≥ i + log h p ( v ≥ i (37)Assume for the time being that the values generated are independent. In every step ofthe algorithm we generate M Bernoulli trials, where each has a probability of P i ≡ p ( v i ≥ | v ≥ , . . . , v i − ≥
0) in landing in the integral’s sought interval and being accepted for thesubsequent steps. Let us analyze the bias and the variance. E " log M i M ! ≈ log( P i ) + 1 P i E " M i M − P i − E " M i M − P i P i (38)where the expression in the RHS originates from a Taylor series expansion of log( M i /M )around the value log( P i ), and keeping up to quadratic terms. The expectation in the secondterm in the RHS equals zero (because it is a binomial process, so the expectation of M i equals M P i ). The last term in the RHS can also be evaluated, and we obtain the bias asBias ≡ E " log M i M ! − log( P i ) = − (1 − P i )2 M P i + O ( M − ) (39)= O ( M − ) (40)12hich means that we can have the bias as close as possible to zero, as the number of generatedpoints becomes very large.Concerning the variance, we getVar " log M i M ! = 1 − P i M P i + O ( M − ) = O ( M − ) (41)For the overall integral, we get log( I ) = N X i =1 log( P i ) (42)with the bias and variance becomingBias( ˆ I ) = − M " N X i =1 (1 − P i )2 P i + O ( M − ) (43)Var( ˆ I ) = 1 M " N X i =1 (1 − P i ) P i + O ( M − ) (44)Thus, the mean square error (MSE) goes to zero as M −→ ∞ . Note that P i is the outcomeof a one-dimensional integral, so it is expected to be in the middle range of (0 , − I ) / ( M I ) + O ( M − ), where I is the value of the multivariate integral.One can see that the MSE is very large because typically I is infinitessimally small.When we derived the above formula, we assumed that the samples are independent, as anapproximation. Strictly speaking they are not, because of the following reason. Consider twopoints v j (1) and v j (2) generated according to p ( v j | v ≥ , . . . , v j − ≥ j and going upstream through the conditioned variables, wecould find one variable, say v j − k , that is a common conditioned variable to the two generatedpoints v j (1) and v j (2) (i.e. a common ancestor). This is because of the bootstrap samplingprocedure. However, we argue that the dependence will be fairly small. This is becauseequally-valued samples will get completely dispersed when we generate samples for the nextvariable, and so the dependence will decay fast. So, the net effect of this dependence is tohave a somewhat higher MSE, but it would still be the same order, i.e. O ( M − ), and withhigher coefficient. (It is akin to estimating the mean of a variable using generated pointshaving a banded covariance matrix, the MSE will still be O ( M − ).) p ( v i | v i − ( m )) In Step 2 in the algorithm described in Subsection 5.1, we need to generate from the con-ditional Gaussian distribution. This can be accomplished using the well known identity(assume mean( v )=0): p ( v i | v i − ) = N (cid:16) v i , b Ti v i − , σ i (cid:17) (45)where b i = R − i − , i − R i − ,i (46)13 i = R i,i − R i, i − R − i − , i − R i − ,i (47)and where R ≡ I + A Σ ′ A is the covariance matrix pertaining to the multivariate Gaussian(see Eqs. (25) and (28)), and the notation A i : j,k : l means the submatrix constructed from A by taking rows i to j and columns j to k .We have to compute these variables in Eqs. (46) and (47), including inverting a matrixevery step, i.e. N times. We present here a computationally more efficient algorithm basedon a recursive computation of the quantities in Eqs. (46) and (47). Assume that we haveperformed the computations at Step i , i.e. that b i , σ i and Q i ≡ R − i − , i − are available.Proceeding to the next step i + 1, we first tackle Q i +1 . Using the partitioned matrix inversion(Horn and Johnson [21]), we get Q i +1 = (cid:18) R i − , i − R i − ,i R T i − ,i R ii (cid:19) − (48)= R − i − , i − + k R − i − , i − R i − ,i R T i − ,i R − i − , i − − k R − i − , i − R i − ,i − k R T i − ,i R − i − , i − k ! where k = R ii − R T i − ,i R − i − , i − R i − ,i . Notice that k equals σ i , which is available from theprevious step. This way of updating the inverse of the covariance matrix has been commonlyused in the signal processing community, and it was even introduced in Gaussian processregression by Csat´o and Opper [7], Van Vaerenbergh et al [52], and P´erez-Cruz et al [38].Substituting from Eq. (46), we get Q i +1 = Q i + b i b Ti σ i − b i σ i − b Ti σ i σ i (49)We also get b i +1 = Q i +1 R i,i +1 (50) σ i +1 = R i +1 ,i +1 − R T i,i +1 b i +1 (51)In summary, using these recursive formulas we can compute the moments for the conditionaldistribution using O ( N ) instead of O ( N ) operations, thus providing some computationalsavings. The algorithm turns out to be very simple, and can be coded easily. It is important tostart with the training set, then proceed with the test set. So, basically we will rename thevariables, such that v N represents the training set, and v N +1: N + NT EST represents the testset. Also, for convenience, the covariance matrix of (25) will be rearranged and will be madeto include all test patterns, to become R = (cid:18) C ′ ( I + Σ) C ′ Σ TXX ∗ C ′ C ′ Σ XX ∗ I + Σ X ∗ X ∗ (cid:19) (52)14here X ∗ is the matrix of test patterns. As can be seen in the algorithm, the training setcomputations have to be done once, and need not be repeated for every test pattern, makingthe algorithm of incremental nature. The algorithm is described follows: Algorithm GPC-MC i = 1: Set Q = R , , σ = R , . Generate M points v ( m ) according to N ( v , , σ ).Compute ˆ P = (cid:16) v ( m ) s . t . v ( m ) ≥ (cid:17) M (53)where the latter expression means the fraction of points that are ≥
0. Remove thepoints v ( m ) <
0. Sample by replacement from among the remaining points to keepthe total number of points equal M . Rename the variables, so that v ( m ) are the newkept points.2. For i = 2 to N do the following:(a) Compute the matrices: b i = Q i R i − ,i (54) σ i = R i,i − R T i − ,i b i (55) Q i +1 = Q i + b i b Ti σ i − b i σ i − b Ti σ i σ i (56)(b) Generate M points v i ( m ) according to N ( v i , b Ti v i − ( m ) , σ i ), m = 1 , . . . , M .Compute ˆ P i = (cid:16) v i ( m ) s . t . v i ( m ) ≥ (cid:17) M (57)(c) Remove the points v i ( m ) <
0. In their place, sample by replacement from amongthe remaining points to keep the total number of points equal M . Rename thevariables, so that v i ( m ) are the new kept points.3. The log marginal likelihood function is given by the following sum over the trainingset probabilities: LogL = N X i =1 log( ˆ P i ) (58)4. For i = N + 1 to N + N T EST (the test patterns) do the following:(a) Compute the matrices: b i = Q N +1 R N,i (59)where Q N +1 represents the covariance matrix inverse, obtained at the last trainingpattern. It will not be be updated further during the test. σ i = R i,i − R T N,i b i (60)15b) Generate M points v i ( m ) according to N ( v i , b Ti v N ( m ) , σ i ), m = 1 , . . . , M .Compute ˆ P i = (cid:16) v i ( m ) s . t . v i ( m ) ≥ (cid:17) M (61)Note that ˆ P i is the sought test pattern posterior probability . Note also that we donot need to perform the bootstrap sampling step here for the test.Note that the proposed algorithm, after it is applied to the training set, has its samples obeythe posterior distribution. So these samples can be saved for any future evaluation of a testpattern. A possibly more efficient modification is to have some kind of soft count, instead of the hardcount used in Eq.(57) (or Eq. (33) ). We know that each point, while moving from Step i − i , is generated from a Gaussian density. The probability of landing in the positiveside can be computed by simply applying the cumulative Gaussian integral for each point(i. e. σ ( u ) ≡ R u −∞ e − x √ π dx ). In that case, instead of Eq. (57) we apply the following:ˆ P i = P Mm =1 σ b Ti v i − ( m ) σ i ! M (62)This applies similarly to Eq. (61). Of course, this does not relieve us from having to generatethe points. We have to do that while moving forward till reaching the last dimension. Inessence, all other steps are similar to the original version of the algorithm. Only the countis different. In this experiment we test the convergence properties of the new Monte Carlo multivariateGaussian integration approach. The goal here is to test the efficacy of this new methodirrespective of its use in Gaussian process classifiers. The application of this method to theGaussian process classifier hinges mainly on its success as a stand-alone integration approach.Once we establish this fact, we will have assurances that it would work well in the Gaussianprocess classifier setting.To be able to judge the new method’s approximation error, we have to use exampleswhere the “ground truth”, i.e. the real integral value, is known. We identified a specialform where this can be obtained. The experiments will be performed on this special form,described below.The covariance matrix equals 1 on the diagonal and equals d i d j off the diagonal (at the( i, j ) th position), where d = ( d , . . . , d N ) T is some vector with | d i | <
1. In such a situation,16he orthant probability can be reduced to a simple one-dimensional integration, as follows: Z ∞ N ( v , µ, Σ) d v = 1 √ π Z ∞−∞ e − u N Y i =1 σ d i u q − d i ! du (63)where σ is the cumulative Gaussian function (i.e. the one-dimensional integration of theGaussian density function). This formula was proposed by Das [8], Dunnet and Sobel [10],and Ihm [22]. It has been also generalized to different forms by Marsaglia [30] and Webster[55], and also used in combination of Monte Carlo sampling by Breslaw [5]. Some specialcases of this formula even yield some closed-form solutions.In this experiment we considered different dimensions for our space. Specifically, weconsidered the dimensions N = 50, N = 200, and N = 500. In addition, for each dimensionwe considered 50 different problems, where each problem has a different d vector (whosecomponents are generated from a uniform distribution in [ − , M , we ran each ofthese problems for various values of M . Because the value of the integral is usually aninfinitessimal value, a sensible approach is to consider the logarithm of the estimated integral,and compare it to the logarithm of the true integral. For example a typical integral valuefor a 500-dimensional problem could be 10 − . The logarithm becomes about -461. As anerror measure, we used the following mean absolute percentage error , defined as: M AP E = 100
N R NR X i =1 | log( I ) − log( ˆ I ) || log( I ) | (64)where I is the true integral value, ˆ I is the integral value estimated by the algorithm, and N R represents the number of runs (i.e. the number of different d vectors tested, in our case N R = 50).Note that we have to be careful when evaluating the true integral numerically using Eq.(63). If we multiply the terms first and then integrate numerically, we end up with very smallnumbers, leading to a large error. We overcame this difficulty, by successive normalizationby the maximum value after each multiplication. Then we evaluate the integral and multiplyback the normalization terms that we divided by.Table 1 shows the MAPE error measure (average over each of the 50 tested problems)for each of the tested values of N (dimension) and M (number of Monte Carlo runs). Notethat these are percent errors, so they are multiplied by 100. Displayed in the table is also thestandard error (over the 50 tested problems). One can observe that the developed algorithmevaluates the orthant probabilities with good accuracy. As expected the accuracy tends toimprove for larger M . However, the relation between the accuracy and the dimension is lessstraightforward to describe. Even though by Eqs. (43) and (44) one might expect that forlarge N there will be more terms and hence a higher error, in practice the P i ’s are moreimportant influencing factors. One can also see that the algorithm succeeded for even thecase of 500 dimensional problems, even though such high dimensions are quite formidableproblems. Most of the algorithms for orthant probability estimation test on problems withonly tens of dimension.Note that because of memory limitations, in case of a large number M of Monte Carlosamples it may not be practical to propagate all samples together. A more practical approachis to rerun the problem several times, each with a smaller M . For example assume that wewould like to use 2,000,000 samples. In that case we apply ten runs, each with M = 200 , o. MC samples Problem 1 ( N = 50) Problem 2 ( N = 200) Problem 3 ( N = 500)3,000,000 0.012 ( 0.0011 ) 0.015 ( 0.0065 ) 0.043 ( 0.0299 )1,000,000 0.016 ( 0.0020 ) 0.019 ( 0.0068 ) 0.045 ( 0.0298 )300,000 0.041 ( 0.0050 ) 0.032 ( 0.0072 ) 0.050 ( 0.0297 )100,000 0.072 ( 0.0094 ) 0.039 ( 0.0070 ) 0.059 ( 0.0303 )30,000 0.141 ( 0.0160 ) 0.078 ( 0.0065 ) 0.080 ( 0.0289 )10,000 0.245 ( 0.0268 ) 0.101 ( 0.0093 ) 0.107 ( 0.0322 ) Table 1: The Mean Absolute Percentage Error (MAPE, in %) of the Log MultivariateGaussian Integral Estimate (and its Standard Error in Brackets) against the Dimension ofthe Problem and the Number of Monte Carlo Samples (the Numbers are in
Percent so theyare Multiplied by 100)
The next group of experiments aims to verify that the proposed Monte Carlo method, in aGaussian process classification setting, does converge to the true solution. The problem weface is that in general there is no way to know the true solution, and so it could be hardto verify this claim. However, we identified a special group of problems where the “groundtruth” could be obtained. This is if we take the distance kernel function to be of the dotproduct form. This means that the covariance matrix equalsΣ = XX T (65)This is the so-called linear kernel. It is a legitimate kernel function as it represents a similaritybetween the patterns, and is positive semidefinite. For single-feature classification problemsit can be shown after a few of lines of derivation that the class 1 posterior probability of apattern, as given by Eq. (28), becomes J ∗ = R ∞−∞ e − u σ ( xu ) Q Ni =1 σ ( y i x i u ) du R ∞−∞ e − u Q Ni =1 σ ( y i x i u ) du (66)where x i and x are the feature values of respectively the i th training pattern and the testpattern, and y i denotes the class membership for pattern i . The marginal likelihood is simplythe denominator in Eq. (66). In this group of problems the covariance function becomesof the form discussed in the Experiment 1, where we can make use of the one-dimensionalintegration method of Eq, (63) to evaluate the integrals. The fact that we are dealing witha one-dimensional feature space does not necessarily make the problem any easier. We arestill dealing with the same formulas and with the same very high-dimensional integrals. Thedimension of the feature vector impacts only the covariance matrix, it will just have differententries.We generated a number of training and testing patterns from a one-dimensional (i.e.single-feature) two-class Gaussian problem. To ensure that the proposed model can handledifferent types of problems, we considered a variety of training/testing set sizes, a varietyof means/variances for the class-conditional densities (in order to account for a variety ofdifferent class overlaps). Specifically, we considered the four problems shown below. Let N N T EST be the sizes of respectively the training set and test set, and let µ i and σ i berespectively the mean and standard deviation of the class conditional density for class i . • Problem 1: N = 100 , N T EST = 50 , µ = 0 , µ = 1 , σ = 0 . , σ = 0 . • Problem 2: N = 200 , N T EST = 100 , µ = 0 , µ = 1 , σ = 2 , σ = 1. • Problem 3: N = 400 , N T EST = 200 , µ = 0 , µ = 1 . , σ = 0 . , σ = 0 . • Problem 4: N = 800 , N T EST = 400 , µ = 0 , µ = 1 , σ = 1 , σ = 0 . important for achieving better accuracies/speeds. The reason will bementioned at the end of this subsection. To obtain statistically more reliable numbers, fromeach of the above problems we applied the proposed algorithm a number N R ≡
20 differenttimes. Since the estimated class 1 posterior probabilities of the different test patterns are ina well-known range from 0 to 1, it is sufficient to use an absolute error metric, so we usedthe mean absolute error (MAE), defined as follows:
M AE = 1
N T EST
NT EST X j =1 | J ∗ j − ˆ J ∗ j | (67)where J ∗ j is an evaluation of the class 1 posterior probability for test pattern j using an exactnumerical integration procedure (the true value), obtained by the formula of Eq. (66), andˆ J ∗ j is the estimate using the proposed Monte Carlo procedure. Table 2 shows the obtainedMAE values, averaged over the 20 runs, for a variety of numbers of Monte Carlo samples.Concerning the log marginal likelihood, we evaluated it using the proposed Monte Carloalgorithm, and compared it with the true value, obtained numerically by evaluating thedenominator of Eq. (66). Since the log marginal likelihood can take any level, a normalizederror measure is more appropriate. So we used the mean absolute percent error (MAPE)measure. The formula is similar to Eq. (64), but with the appropriate comparison variablesreplaced. Table 3 shows the obtained MAPE (%) values for a variety of numbers of MonteCarlo samples for the log marginal likelihood estimation problem.As seen from both tables, the algorithm is able to achieve a low error for both, the prob-ability evaluation and the marginal likelihood. One can also see that increasing the numberof Monte Carlo samples M leads to better accuracy. As mentioned in the last experiment,the relation between accuracy and dimension is less straightforward to describe. It is in-fluenced more by the specific covariance matrix and the resulting conditional probabilities P i . By observing Eqs. (43) and (44), one finds that a small P i can lead to large error. Itis therefore advantageous to have the P i ’s closer to the middle (in most cases it is around0.5). To achieve that, it is important to shuffle the data of both classes, rather than list firstthe data for Class 1, followed by the data of Class 2. The latter will cause more extreme P i ’s and therefore lead to less accuracy. Other than random shuffle, one could interleave thedata of class 1 and class 2 in a repetitive way (e.g. class 1 pattern, then class 2, then class1, then class 2, etc). In this and the next experiment we present a comparison of the proposed Monte Carloalgorithm with the Markov Chain Monte Carlo (MCMC) approach, its only peer. The19 o MC Prob 1 Prob 2 Prob 3 Prob 4Samples ( N = 100) ( N = 200) ( N = 400) ( N = 800)3,000,000 0.00016 ( 0.00004 ) 0.00024 ( 0.00005 ) 0.00022 ( 0.00005 ) 0.00022 ( 0.00006 )1,000,000 0.00031 ( 0.00007 ) 0.00043 ( 0.00010 ) 0.00038 ( 0.00009 ) 0.00045 ( 0.00012 )300,000 0.00056 ( 0.00013 ) 0.00081 ( 0.00018 ) 0.00062 ( 0.00014 ) 0.00075 ( 0.00019 )100,000 0.00095 ( 0.00021 ) 0.00139 ( 0.00031 ) 0.00112 ( 0.00025 ) 0.00128 ( 0.00033 )30,000 0.00160 ( 0.00036 ) 0.00297 ( 0.00066 ) 0.00212 ( 0.00047 ) 0.00235 ( 0.00061 )10,000 0.00308 ( 0.00069 ) 0.00463 ( 0.00103 ) 0.00391 ( 0.00088 ) 0.00443 ( 0.00114 ) Table 2: The Mean Absolute Error (MAE) (averaged over the 20 runs) of the GaussianProcess Classification of Experiment 2 for the Four Different Problems against the Numberof Monte Carlo Samples (the Standard Error is in Brackets).No MC Prob 1 Prob 2 Prob 3 Prob 4Samples ( N = 100) ( N = 200) ( N = 400) ( N = 800)3,000,000 0.0081 ( 0.0018 ) 0.0088 ( 0.0020 ) 0.0063 ( 0.0014 ) 0.0033 ( 0.0009 )1,000,000 0.0187 ( 0.0042 ) 0.0097 ( 0.0022 ) 0.0095 ( 0.0021 ) 0.0059 ( 0.0015 )300,000 0.0364 ( 0.0081 ) 0.0255 ( 0.0057 ) 0.0130 ( 0.0029 ) 0.0077 ( 0.0020 )100,000 0.0671 ( 0.0150 ) 0.0340 ( 0.0076 ) 0.0238 ( 0.0053 ) 0.0130 ( 0.0034 )30,000 0.1170 ( 0.0262 ) 0.0629 ( 0.0141 ) 0.0450 ( 0.0101 ) 0.0249 ( 0.0064 )10,000 0.1522 ( 0.0340 ) 0.1334 ( 0.0298 ) 0.0900 ( 0.0201 ) 0.0622 ( 0.0161 )Table 3: The Mean Absolute Percent Error MAPE (in %, i.e. the Numbers are Multipliedby 100) of the Log Marginal Likelihood of Experiment 2 for the Four Different Problemsagainst the Number of Monte Carlo Samples. All are Averages over the 20 Runs, and theStandard Error is in Brackets. 20CMC is the only available method that can accurately compute the exact classificationprobabilities. All other methods give only approximations. There are several MCMC basedmodels. In this experiment we compare between the proposed algorithm and the HybridMonte Carlo (HMC) [35], and the Elliptical Slice Sampler (ESS) by Murray, Adams, andMackay [33]. For both methods, we use the implementation written by Rasmussen andNickisch [40], which includes several enhancements of these two methods. For the proposedalgorithm we used the variant with the soft count, described in Subsection 5.7.We considered Problem 3 ( N = 400) of Experiment 2 (with a linear kernel). As men-tioned, these are the only type of problems where the ground truth is known. For thepurpose of comparison the two main aspects of speed and accuracy are important. Theyare contradictory metrics, for example improving the accuracy (by having a larger MonteCarlo sample) will lead to more lengthy runs, and vice versa too. To be able to visualizesimultaneously both of these aspects of the performance we have plotted both the CPU timeagainst the logarithm of the MAE in Figure 1 (for the case of probability estimation) andagainst the logarithm of the MAPE in Figure 2 (for the case of marginal likelihood estima-tion). In each of the two figures every point corresponds to the average CPU time/averageMAE (or MAPE) over ten runs for a particular Monte Carlo parameter. The Monte Carloparameter for the proposed algorithm, and for the HMC and ESS algorithms is the numberof Monte Carlo samples. For the HMC and ESS algorithms we kept the other parameters attheir recommended values, as given in the implementation by Rasmussen and Nickisch [40](they are any way much less influential than the number of samples). The parameters arefixed as follows: the number of of skipped samples is 40, the number of burn-in samples is10, and the number of runs to remove finite temperature bias is 3.To be able to judge the advantage of one algorithm versus another, one should examinethe difference in accuracy for the same run time, or similarly the difference in run time forthe same accuracy. One can see from the graphs that the proposed algorithm generallybeats the HMC algorithm. The margin of outperformance is considerable, especially for themarginal likelihood case. The ESS ties with the the proposed algorithm for pattern probabil-ity estimation for low to moderate accuracy targets, but ESS outperforms for high accuracycomputationally expensive runs. On the other hand, the proposed algorithm outperformsESS considerably for the marginal likelihood case. As pointed out before, the marginal like-lihood is by far the most important of the two aspects. The reason is that it is evaluatednumerous times in the process of tuning the hyperparameters of the kernel function, whilethe pattern probability estimation is performed only once. For example, from the figure theproposed algorithm produces a log(MAPE) of the marginal likelihood of about -4.52 witha CPU time of 525 sec. The ESS algorithm produces about the same log(MAPE) (or justa little better at -4.70) with CPU time of 5350 sec. With about a hundred application ofan optimization algorithm (such as Rasmussen and Nickisch’s GPML toolbox’s minimize function [40]), the new algorithm takes about 15 hours, while ESS takes about 148 hours. The previous experiment, while providing an accurate benchmark comparison, considersonly the linear kernel, which may not be very prevalent in real world applications. In thisexperiment we consider the more common RBF kernel, and also some synthetic and real worldproblems, in order to have a test as close as possible to realistic situations. We also add to the21
Log ( M AE ) Log(MAE) NewLog(MAE) HMCLog(MAE) ESS
Figure 1: Log of the Mean Absolute Error (MAE) of the Probability Estimates of the NewAlgorithm, the HMC Algorithm, and the ESS Algorithm against the CPU Time (Seconds)of the Runs
Log ( M APE ) Log(MAPE−MGN) NewLog(MAPE−MGN) HMCLog(MAPE−MGN) ESS
Figure 2: Log of the Mean Absolute Percentage Error (MAPE) of the Log Marginal Likeli-hood Estimates of the New Algorithm, the HMC Algorithm, and the ESS Algorithm againstthe CPU Time (Seconds) of the Runs 22omparison the model developed by Titsias, Lawrence and Rattray [48], in addition to theESS and HMC algorithms considered last experiment. Titsias et al’s algorithm is anotherMCMC-based algorithm for the GPC problem. It relies on using dynamically optimizedcontrol variables that provide a low dimensional representation of the function. The code ispublicly available at [50]. It applies only to RBF and ARD kernels [50], and that is why wedid not include it in the comparison of the last experiment.For the purpose of this comparison, there is however the problem of lack of ground truth,because for RBF kernels we do not know the true value of the integrals that would yieldthe Gaussian process class probabilities. Nevertheless, we run the competing models on anumber of problems, and check their convergence properties. If they converge to the sameclass probabilities, then this is a strong indication that these algorithms do indeed converge.Also, note that the considered RBF kernel is a more efficient and more widely used kernelthan the linear kernel. So the experiments presented here are more relevant, as they fit moreclosely to the real experimental situations.We considered artificially generated data using Gaussian class-conditional densities, andreal data sets. For the artificial problems, using the Bayes classifier’s formula, one can com-pute the true posterior probability P ( y i = 1 | x ) ≡ P truei which the Gaussian process classifierattempts to model. However, we must emphasize that these true posterior probabilities neednot be the same as those obtained by GPC, as GPC is based on a different formulation. Ifeither proposed or competing algorithms do a good job converging to the true value of thesought integral, but it turns out to be far from the true posterior, then it is not the faultof the algorithm. It should be attributed to the degree of validity of the Gaussian processformulation or to the finite sampled-ness of the training data. Nevertheless, a comparisonwith the true posterior provides a useful sanity check. We computed the mean absolute errorbetween each competing method’s estimated probabilities and P truei (let us denote them byMAE-POST(NEW), MAE-POST(TITSIAS), MAE-POST(ESS), and MAE-POST(HMC)).This measure applies only for synthetic problems, as for real problems we do not know thetrue posteriors. We also computed the following measure. For each pattern, we obtain themedian of all four algorithms’ estimated probability P medi . This so-called “consensus” valueis compared against each algorithm’s estimated probabilities. We get Mean i (cid:16) | P i − P medi | (cid:17) foreach method (denote these by MAE-MED(NEW), MAE-MED(TITSIAS), MAE-MED(ESS),and MAE-MED(HMC)). This measure will expose the aberrant algorithm that fails to con-verge, and is therefore a useful sanity check. We have computed a similar measure for thelog marginal likelihood.The problems considered are described as follows. Let d be the dimension of the featurevector, and let e d denote the d -dimensional vector of all ones. Also, let:Σ = (cid:18) . .
25 1 (cid:19) (68)and Σ is a 10 ×
10 one-banded matrix with 0.5 on the diagonal and 0.2 on the upper andthe lower bands. We considered the following synthetic problems, that provide a varietyof different levels of training set sizes, class overlaps, and space dimensions, and also thefollowing real world problems. • Problem 1:
N T RAIN = 50 , N T EST = 50 , d = 2, p ( x | C ) = N ( x, , I ), p ( x | C ) = N ( x, e , Σ ). 23 Problem 2:
N T RAIN = 200 , N T EST = 200 , d = 2, p ( x | C ) = N ( x, , I ), p ( x | C ) = N ( x, e , Σ ). • Problem 3:
N T RAIN = 50 , N T EST = 50 , d = 2, p ( x | C ) = N ( x, , I ), p ( x | C ) = N ( x, . e , Σ ). • Problem 4:
N T RAIN = 50 , N T EST = 50 , d = 2, p ( x | C ) = N ( x, , I ), p ( x | C ) = N ( x, e , Σ ). • Problem 5:
N T RAIN = 50 , N T EST = 50 , d = 10, p ( x | C ) = N ( x, , I ), p ( x | C ) = N ( x, e , Σ ). • Problem 6:
N T RAIN = 1000 , N T EST = 1000 , d = 2, p ( x | C ) = N ( x, , I ), p ( x | C ) = N ( x, e , Σ ). • Problem 7: Crabs data,
N T RAIN = 100 , N T EST = 100 , d • Problem 8: Breast Cancer data,
N T RAIN = 200 , N T EST = 249 , d = 9, availableat http://mlearn.ics.uci.edu/databases/breast-cancer-wisconsin/. • Problem 9: USPS 3 vs 5 data,
N T RAIN = 750 , N T EST = 790 , d α = 5 , β = 12. α = 5 , β = 53. α = 3 , β = 24. α = 0 . , β = 0 . α = 3 , β = 16. α = 0 . , β = 3For TITSIAS we considered 50,000 iterations, where we considered 5000 iterations for theburn-in, and a starting number of three control variables. For each problem we first ranthe TITSIAS on the first α and β parameter set. We selected the number of Monte Carlosamples of the proposed method, the ESS method, and the HMC method so that it runs inabout the same time as the TITSIAS (measured by CPU time). Then, we fixed this numberfor the other α and β parameter combination runs. Table 4 shows the average MAE-POSTand MAE-MED for all nine problem (averaged over the six hyperparameter combintations).From the runs we note the following observations: • The runs of all methods lead to close probability estimates for all the test patternsand all the runs (typically an absolute error between the probability estimates and themedian of probability estimates is of the order 10 − ). The exception is with HMC,which is a further from the other algorithms’ estimates.24 roblem MAE-POST MAE-MEDNEW TITSIAS ESS HMC NEW TITSIAS ESS HMCProblem 1 0.1177 0.1176 0.1177 0.1153 1 . × e − . × e − . × e − . × e − Problem 2 0.0668 0.0676 0.06711 0.0673 3 . × e − . × e − . × e − . × e − Problem 3 0.0822 0.0830 0.0824 0.0837 2 . × e − . × e − . × e − . × e − Problem 4 0.1637 0.1641 0.1632 0.1685 1 . × e − . × e − . × e − . × e − Problem 5 0.2884 0.2883 0.2890 0.2872 1 . × e − . × e − . × e − . × e − Problem 6 0.0382 0.0384 0.0359 0.0367 7 . × e − . × e − . × e − . × e − Problem 7 – – – – 1 . × e − . × e − . × e − . × e − Problem 8 – – – – 1 . × e − . × e − . × e − . × e − Problem 9 – – – – 2 . × e − – 4 . × e − . × e − Table 4: The Mean Absolute Error between the Algorithms’ Probability Estimates and theBayesian Posterior Probability (MAE-POST) and the Median of the Algorithms’ Probabili-ties (MAE-MED) over All Hyperparameter Sets. Note that for the Real-World Problems 7,8, and 9 MAE-POST is not Available • All methods lead to very similar mean absolute error with respect to the Bayesianposterior error. This error is also fairly low, especially if the size of the training setis large. This is an indication that the source of the descrepancy is probably thefinite-sampled-ness of the training set, rather than an inadequacy of the GPC model. • The TITSIAS method was considerably slower for hyperparameter sets 4) and 6), anda little slower for hyperparameter sets 3) and 5). Even though we had fixed the numberof iterations for all hyperparameter sets (at 50,000), it took about ten times as muchto complete the run for sets 4) and 6) (compared to parameter sets 1) and 2) andcompared to the other methods). It seems that lower values of α lead to much slowerruns for the TITSIAS method. In contrast, the proposed Monte Carlo method yieldssimilar speeds for all parameter sets. Also, the TITSIAS method did not converge forProblem 9 (the USPS problem). Even when reducing the number of samples to 10,000,it did not converge in more than 24 hours of a run. • For most methods the runs take about a few minutes for a small problem (like
N T RAIN = N T EST = 50), and about ten minutes for a medium problem (like
N T RAIN = N T EST = 200). Of course, this is with the exception of the hyperparameter com-binations that lead to slow runs for the TITSIAS method. For example for Problem3 (
N T RAIN = N T EST = 200) the proposed method with M = 1 , ,
000 took 12minutes using Matlab on a computer featuring an Intel duo core I3 processor. Again,it is hard to perform an accurate speed comparison between the algorithms, becausewe do not know the ground truth. • The marginal likelihood of the proposed Monte Carlo algorithm and the ESS algorithmare very close. However, the HMC algorithm produces a marginal likelihood thatis somewhat different (often about 5 % different, and for Problem 9 about 100 %different from that of the other two algorithms). This indicates that it fell short offully converging. 25 .5 Comments on the Results
Working with general Bayesian methods often leads to high dimensional integrals. Evaluatingsuch integrals can sometimes be frustrating because of the high dimensionality, and manyMonte Carlo approaches fail. The advantage of the proposed appraoch is that it is veryreliable. It basically works all the time, as we have not encountered a failing run. It is afairly short algorithm, and is simple to code, so this will cut down on development time.The fact that it has no tuning parameter (other than the number of Monte Carlo samples)also facilitates applying the method and cuts down on time consuming tuning runs. Theproposed algorithm has a main advantage compared to the other MCMC-based algorithms.It is considerably faster for the problem of evaluating the marginal likelihood. This isimportant because of its repeated evaluation during the hyperparameter tuning step, andthis makes it a very time-consuming process. For this step it is sensible to use a smallernumber M of Monte Carlo samples (for example 20,000 to 50,000). Exact evaluation ofthe marginal likelihood will not much impact the optimization outcome. It is evidenced byTable 2 and Table 3 that small size Monte Carlo samples achieve very low error for the logmarginal likelihood (significantly lower than the error in the class posterior probabilities).Also, running small samples could possibly yield three or four best hyperparameter sets,for which on a closer look we rerun the method using a larger M to differentiate betweenthem (a classic exploration versus exploitation problem). Once the optimal parameters areobtained, in the classification step we can then use a larger M (for example 200,000 or500,000), because then, accuracy is important.An interesting observation is that the impact of the dimension of the integral N on theaccuracy of the proposed method is not that large. It seems that other factors weigh in more,such as the structure of the covariance matrix, and the resulting conditional probabilities.It is imperative to shuffle the class 1 and class 2 patterns, or interleave them in a regularway, before constructing the covariance matrix. This will lead to well-behaved conditionalprobabilities, and therefore better accuracy. In this paper we derived a new formulation that simplifies the multi-integral formula for theGaussian process classification (GPC) problem. The formulation, given in terms of the ratioof two multivariate Gaussian integrals, gives new insights, and potentially opens the doorfor better approximations.We also developed a Monte Carlo method for the evaluation of multivariate Gaussianintegrals. This allows us to obtain very close to exact evaluation of the GPC probabilitiesand the marginal likelihood function. The proposed method is simple, reliable, and fast.As such, it should be considered as a promising candidate for researchers to test, whenattempting to obtain exact GPC probabilities.
Appendix
The matrix A can be written as A = B + bb T σ ∗ (69)26here B = (cid:18) I + Σ − (cid:19) (70) b = (cid:18) − a (cid:19) (71)Using the small rank adjustment matrix inversion lemma [21], we get A − = B − − B − bb T B − q (72)where q = σ ∗ + b T B − b (73)Substituting (72) into Eq. (24), we get A = (cid:18) I − C ′ ( I + Σ − ) − C ′ (cid:19) + b ′ b ′ T q (74)where b ′ = A B − b = (cid:18) − C ′ ( I + Σ − ) − a (cid:19) (75)= (cid:18) − C ′ ( I + Σ) − Σ X x ∗ (cid:19) (76)(The last equality follows from substituting for the variable a from Eq. (10) (i.e. a =Σ − Σ X x ∗ ), and teleporting the resulting Σ − into the bracketed expression ( I + Σ − ) − .)The variable q can be simplified as follows: q = σ ∗ + b T (cid:18) I + Σ − ) − (cid:19) b (77)= σ ∗ + 1 + a T ( I + Σ − ) − a (78)= 1 + Σ x ∗ x ∗ − Σ TX x ∗ ( I + Σ) − Σ X x ∗ (79)The last equation follows from the definition of a (Eq. 10), and the definition of σ ∗ Eq. (11),and several steps of simplification.The first matrix in the RHS of Eq. (74) can be simplified further, by noting that I − C ′ ( I + Σ − ) − C ′ = C ′ h I − ( I + Σ − ) − i C ′ (80)= C ′ ( I + Σ − ) − h ( I + Σ − ) − I i C ′ (81)= C ′ ( I + Σ) − C ′ (82)where we used the fact that C ′ = I because it is a diagonal matrix of 1’s and -1’s. We getthe final formula for A , as follows: A = q − Σ TX x ∗ ( I +Σ) − C ′ q − C ′ ( I +Σ) − Σ X x ∗ q C ′ ( I + Σ) − C ′ + C ′ ( I +Σ) − Σ X x ∗ Σ TX x ∗ ( I +Σ) − C ′ q (83)Let us construct the following matrix, which we will subsequently try to invert.27 = (cid:18) x ∗ x ∗ Σ TX x ∗ C ′ C ′ Σ X x ∗ C ′ ( I + Σ) C ′ (cid:19) (84)Using the partitioned matrix inverse theory [21], we get R − = q − Σ TX x ∗ ( I +Σ) − C ′ q − C ′ ( I +Σ) − Σ X x ∗ q C ′ ( I + Σ) − C ′ + C ′ ( I +Σ) − Σ X x ∗ Σ TX x ∗ ( I +Σ) − C ′ q (85)which is the same as A in Eq. (83), and that completes the proof. References [1] A. H. Abdel-Gawad and A. F. Atiya. “A new accurate approximation for the Gaussianprocess classification problem”,
Proceedings of the International Joint Conference onNeural Networks (IJCNN’08) , Hong Kong, China, 2008.[2] Y. Altun, T. Hofmann, and A. Smola, “Gaussian process classification for segmenting andannotating sequences”,
Proceedings of the International Conference on Machine Learning(ICML’2004) , 2004.[3] D. Barber and C. K. I. Williams. Gaussian processes for Bayesian classification via hybridMonte Carlo. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in NeuralInformation Processing Systems 9, Cambridge, MA, 1997. The MIT Press.[4] Y. Bazi and F. Melgani, “Gaussian Process approach to remote sensing image classi-fication ”,
IEEE Trans Geoscience and Remote Sensing , Vol. 48, No. 1, pp. 186-197,2010.[5] J. Breslaw, “Evaluation of multivariate normal probability integrals using a low variancesimulator”,
The Review of Economics and Statistics , Vol. 76, No. 4 (Nov., 1994), pp.673-682, 1994.[6] L. Cs´ato, E. Fokou, M. Opper, and B. Schottky, “Efficient approaches to Gaussian processclassification,” in
Neural Information Processing Systems 12 , pp. 251257, MIT Press,2000.[7] L. Cs´ato and M. Opper, “Sparse online Gaussian processes”,
Neural Computation , Vol.14, No. 2, pp. 641-669, 2002.[8] S. C. Das, “The numerical evaluation of a class of integrals, II,”
Proceedings of theCambridge Philosophical Society , Vol. 52, 442-448, 1956.[9] I. De´ak, “Computing probabilities of rectangles in case of multivariate normal distribu-tion”,
Journal of Statistics, Comput. Simul. , Vol. 26, pp. 101-114., 1986.[10] C. W. Dunnet and M. Sobel, “Approximations of the probability integral and certainpercentage points of a multivariate analogue of Student’s t-distribution”,
Biometrika ,Vol. 42, pp. 258-260. 2811] F. Eriksson, “On the measure of solid angles”,
Mathematics Magazine , Vol. 63, No. 3,pp. 184-187, 1990.[12] M. Evans and T. Swartz,
Approximating Integrals via Monte Carlo and DeterministicMethods , Oxford University Press, New York, USA, 2000.[13] A. Genz, “Comparison of methods for the computation of multivariate normal proba-bilities”,
Computing Science and Statistics , Vol. 25 pp. 400-405, 1993.[14] A. Genz and F. Bretz,
Computation of multivariate normal and t probabilities , Springer-Verlag, Berlin, Germany, 2009.[15] M. Gibbs, and D. Mackay, “Variational Gaussian process classifiers,”
IEEE Transactionson Neural Networks , Vol. 11, pp. 1458-1464, 2000.[16] M. Girolami and S. Rogers, “Variational Bayesian multinomial probit regression withGaussian process priors”,
Neural Computation , Vol. 18, No. 8, pp. 1790-1817, 2006.[17] S. Gupta, “Probability integrals of multivariate normal and multivariate t”,
The Annalsof Mathematical Statistics , Vol. 34, No. 3, pp. 792-828, 1963.[18] S. Gupta, “Bibliography of the multivariate normal integrals and related topics”,
TheAnnals of Mathematical Statistics , Vol. 34, No. 3, pp. 829-838, 1963.[19] V. A. Hajivassiliou, D. L. McFadden, and P. Ruud, ”Simulation of multivariate normalorthant probabilities: methods and programs”, Working Paper, Yale University, 1991.[20] D. Hern´andez-Lobato, J. M. Hern´andez-Lobato, and P. Dupont, “Robust multi-classGaussian process classification”, In Proc Neural Information Processing Systems, NIPS2011, Granada, Spain, December 12-15, 2011.[21] R. A. Horn and C. R. Johnson,
Matrix Analysis , Cambridge University Press, Cam-bridge, UK, 1985.[22] P. Ihm, “Numerical evaluation of certain multivariate integrals,”
Sankhya
Vol. 21, pp.363-366, 1959.[23] T. Jaakkola and D. Haussler, “Probabilistic kernel regression models,” in D. Heckermanand J. Whittaker, editors,
Workshop on Artificial Intelligence and Statistics 7 , MorganKaufmann, 1999.[24] R. Jenssen, D. Erdogmus, J. C. Principe, and T. Eltoft, “The Laplacian classifier,
IEEETransactions on Signal Processing , Vol. 55, no. 7, pp. 3262-3271, 2007.[25] N. L. Johnson, S. Kotz, and N. Balakrishnan,
Continuous Univariate Distributions, vol.1 , New York: Wiley, 1994.[26] M. G. Kendall, “Proof of relations connected with the tetrachoric series and its gener-alization”,
Biometrika , Vol. 32, pp. 196-198, 1941.[27] H. C. Kim, and Z. Ghahramani, “Bayesian Gaussian process classification with theEM-EP algorithm”,
IEEE Trans Pattern Anal Mach Intelligence , Vol. 28, No. 12, pp.1948-59, 2006. 2928] S. Kotz, N. Balakrishnan, and N. H. Johnson ,
Continuous Multivariate Distributions,Volume 1, Models and Applications , John Wiley & Sons, New York, 2000.[29] M. Kuss and C. E. Rasmussen, “Assessing approximate inference for binary Gaussianprocess classification”,
Journal of Machine Learning Research , Vol. 6, pp. 1679-1704,2005.[30] G. Marsaglia, “Expressing the normal distribution with covariance matrix A + B interms of one with covariance matrix A”,
Biometrika , Vol. 50, pp. 535-538, 1963.[31] T. Minka,
A Family of Algorithms for Approximate Bayesian Inference , Ph.D. thesis,MIT, 2001.[32] P. A. P. Moran, “Rank correlation and product-moment correlation”,
Biometrika , Vol.35, pp. 203-206, 1948.[33] I. Murray, R. P. Adams and D. J.C. MacKay, “Elliptical slice sampling”,
Journal ofMachine Learning Research W&CP , Vol. 9, pp. 541-548, 2010.[34] R. Neal, “Annealed importance sampling”, Technical Report No. 9805, Department ofStatistics, University of Toronto, 1998.[35] R. Neal, “Regression and classification using Gaussian process priors,” in J. M.Bernardo, J. O. Berger, A. P. David, and A. F. M. Smith, Eds.,
Bayesian Statistics6 , pp. 475-501, Oxford University Press, 1999.[36] H. Nickisch and C. E. Rasmussen, “Approximations for binary Gaussian process classi-fication”,
Journal of Machine Learning Research , Vol. 9, pp. 2035-2078, 2008.[37] M. Opper and O. Winther, “Gaussian processes for classification: mean field algo-rithms,”
Neural Computation , Vol. 12, pp. 2655-2684, 2000.[38] F. P´erez-Cruz, S. Vaerenbergh, J. J. Murillo-Fuentes, M. L´azaro-Gredilla, and I. SantaMaria, “Gaussian processes for nonlinear signal processing”,
IEEE Signal ProcessingMagazine , Vol. 30, pp. 40-50, 2013.[39] C. Rasmussen and C. Williams,
Gaussian Processes for Machine Learning,
MIT Press,2005.[40] C. E. Rasmussen and H. Nickisch, “Gaussian Processes for Machine Learning (GPML)Toolbox”,
Journal of Machine Learning Research , Vol. 11, pp. 3011-3015, 2010.[41] J. Ribando, “Measuring solid angles beyond dimension three”,
Discrete & Computa-tional Geometry , Vol. 36, pp. 479-487, 2006.[42] , R. Rifkin and A. Klautau, “In defense of one-vs-all classification”,
Journal of MachineLearning Research , Vol. 5, pp. 101-141, 2004.[43] M. Seeger, “Gaussian processes for machine learning”,
International Journal of NeuralSystems , Vol. 14, No. 2, pp. 69-106, 2004.3044] M. Seeger and M. Jordan, “Sparse Gaussian process classification with multiple classes”,Technical Report, Department of Statistics TR 661, University of California, Berkeley,CA, 2004.[45] M. Schervish, “Multivariate normal probabilities with error bound”,
Applied Statistics ,Vol. 33, pp. 81-87, 1984.[46] S. Sundararajan and S. S. Keerthi, “Predictive approaches for choosing hyperparametersin Gaussian processes”,
Neural Computation , Vol. 13, pp. 11031118, 2001.[47] M. K. Titsias, “Variational learning of inducing variables in sparse Gaussian processes”,
Journal of Machine Learning Research - Proceedings Track , Vol. 5, pp. 567-574, 2009.[48] M. K. Titsias, N. D. Lawrence and M. Rattray, “Efficient sampling for Gaussian processinference using control variables”,
Advances Advances in Neural Processing Systems 12 ∼ mtitsias/software.html.[51] R. Urtasun and T. Darrell, “Discriminative Gaussian process latent variable model forclassification”, Proceedings of the 24th International Conference on Machine Learning(ICML’2007) , 2007.[52] S. Van Vaerenbergh, M. L´azaro-Gredilla, and I. Santa Maria, “Kernel recursive leastsquares tracker for time varying regression”,
IEEE Transactions Neural Networks andLearning Systems , Vol. 23, pp. 1313-1326, 2012.[53] J. Vanhatalo, and A. Vehtari, “Speeding up the binary Gaussian process classifica-tion”,
Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI’2010) ,Catalina Island, CA, 2010.[54] A. Vehtari, S. S¨arkk¨a and J. Lampinen, “On MCMC sampling in Bayesian MLP neuralnetworks”,
Proceedings of the 2000 International Joint Conference on Neural Networks ,Shun-Ichi Amari, C. Lee Giles, Marco Gori and Vincenzo Piuri, Eds, Vol. I, pp. 317-322,2000.[55] J. T. Webster, “On the application of the method of Das in evaluating a multivariatenormal integral”,
Biometrika , Vol. 57, pp. 657-659, 1970.[56] C. Williams and D. Barber “Bayesian classification with Gaussian processes,”