[PDF] Dirichlet-based Gaussian Processes for Large-scale Calibrated Classification

Abstract

In this paper, we study the problem of deriving fast and accurate classification algorithms with uncertainty quantification. Gaussian process classification provides a principled approach, but the corresponding computational burden is hardly sustainable in large-scale problems and devising efficient alternatives is a challenge. In this work, we investigate if and how Gaussian process regression directly applied to the classification labels can be used to tackle this question. While in this case training time is remarkably faster, predictions need be calibrated for classification and uncertainty estimation. To this aim, we propose a novel approach based on interpreting the labels as the output of a Dirichlet distribution. Extensive experimental results show that the proposed approach provides essentially the same accuracy and uncertainty quantification of Gaussian process classification while requiring only a fraction of computational resources.

Full PDF

DDirichlet-based Gaussian Processesfor Large-scale Calibrated Classiﬁcation

Dimitrios Milios

EURECOMSophia Antipolis, France [email protected]

Raffaello Camoriano

University of GenoaGenoa, Italy [email protected]

Pietro Michiardi

EURECOMSophia Antipolis, France

[email protected]

Lorenzo Rosasco

University of GenoaLCSL, IIT & MIT [email protected]

Maurizio Filippone

EURECOMSophia Antipolis, France [email protected]

Abstract

In this paper, we study the problem of deriving fast and accurate classiﬁcationalgorithms with uncertainty quantiﬁcation. Gaussian process classiﬁcation pro-vides a principled approach, but the corresponding computational burden is hardlysustainable in large-scale problems and devising efﬁcient alternatives is a challenge.In this work, we investigate if and how Gaussian process regression directly appliedto the classiﬁcation labels can be used to tackle this question. While in this casetraining time is remarkably faster, predictions need be calibrated for classiﬁcationand uncertainty estimation. To this aim, we propose a novel approach based oninterpreting the labels as the output of a Dirichlet distribution. Extensive exper-imental results show that the proposed approach provides essentially the sameaccuracy and uncertainty quantiﬁcation of Gaussian process classiﬁcation whilerequiring only a fraction of computational resources.

Classiﬁcation is a classic machine learning task. While the most basic performance measure isclassiﬁcation accuracy, in practice assigning a calibrated conﬁdence to the predictions is often crucial[5]. For example, in image classiﬁcation providing classiﬁcation predictions with a calibrated score isimportant to avoid making over-conﬁdent decisions [6, 11, 14]. Several classiﬁcation algorithms thatoutput a continuous score are not necessarily calibrated (e.g., support vector machines [22]). Popularways to calibrate classiﬁers use a validation set to learn a transformation of their output score thatrecovers calibration; these include Platt scaling [22] and isotonic regression [36]. Calibration canalso be achieved if a sensible loss function is employed [12], for example the logistic/cross-entropyloss, and it is known to be positively impacted if the classiﬁer is well regularized [6].Bayesian approaches provide a natural framework to tackle these kinds of questions, since quantiﬁ-cation of uncertainty is of primary interest. In particular, Gaussian Processes Classiﬁcation (

GPC )[23, 34, 8] combines the ﬂexibility of Gaussian Processes ( GP s) [23] and the regularization stem-ming from their probabilistic nature, with the use of the correct likelihood for classiﬁcation, that is a r X i v : . [ c s . L G ] M a y ernoulli or multinomial for binary or multi-class classication, respectively. While we are not awareof empirical studies on the calibration properties of GPC , our results conﬁrm the intuition that

GPC is actually calibrated. The most severe drawback of

GPC , however, is that it requires carrying outseveral matrix factorizations, making it unattractive for large-scale problems.In this paper, we study the question of whether Gaussian process approaches can be made efﬁcientto ﬁnd accurate and well-calibrated classiﬁcation rules. A simple idea is to use Gaussian processregression directly on classiﬁcation labels. This idea is quite common in non-probabilistic approaches[32, 25] and can be grounded from a decision theoretic point of view. Indeed, the Bayes’ rule mini-mizing the expected least squares is the the expected conditional probability, which in classiﬁcationis directly related to the conditional probabilities of each class, see e.g. [29, 3]. Directly regressingthe labels leads to fast training and excellent classiﬁcation accuracies [27, 16, 10]. However, thecorresponding predictions are not calibrated for uncertainty quantiﬁcation. The question is then ifcalibration can be achieved while retaining speed.The main contribution of our work is the proposal of a transformation of the classiﬁcation labels,which turns the original problem into a regression problem without compromising calibration. For GP s, this has the enormous advantage of bypassing the need for expensive posterior approximations,leading to a method that is as fast as a simple regression of the original labels. The proposed methodis based on the interpretation of the labels as the output of a Dirichlet distribution, so we name itDirichlet-based GP classiﬁcation ( GPD ). Through an extensive experimental validation, includinglarge-scale classiﬁcation tasks, we demonstrate that

GPD is calibrated and competitive in performancewith state-of-the-art

GPC . Calibration of classiﬁers:

Platt scaling [22] is a popular method to calibrate the output score ofclassiﬁers, as well as isotonic regression [36]. More recently, Beta calibration [12] and temperaturescaling [6] have been proposed to extend the class of possible transformations and reduce theparameterization of the transformation, respectively. It is established that binary classiﬁers arecalibrated when they employ the logistic loss; this is a direct consequence of the fact that theappropriate model for Bernoulli distributed variables is the one associated with this loss [12]. Theextension to multi-class problems yields the so-called cross-entropy loss, which corresponds to themultinomial likelihood. Not necessarily, however, the right loss makes classiﬁers well calibrated;recent works on calibration of convolutional neural networks for image classiﬁcation show that depthnegatively impacts calibration due to the introduction of a large number of parameters to optimize,and that regularization is important to recover calibration [6].

Kernel-based classiﬁcation:

Performing regression on classiﬁcation labels is also known as least-squares classiﬁcation [32, 25]. We are not aware of works that study GP -based least-squares classiﬁca-tion in depth; we could only ﬁnd a few comments on it in [23] (Sec. 6.5). GPC is usually approachedassuming a latent process, which is given a GP prior, that is transformed into a probability of classlabels through a suitable squashing function [23]. Due to the non-conjugacy between the GP priorand the non-Gaussian likelihood, applying standard Bayesian inference techniques in GPC leads toanalytical intractabilities, and it is necessary to resort to approximations. Standard ways to approxi-mate computations include the Laplace Approximation [34] and Expectation Propagation [18]; see,e.g., [15, 20] for a detailed review of these methods. More recently, there have been advancements inworks that extend “sparse” GP approximations [33] to classiﬁcation [9] in order to deal with the issuesof scalability with the number of observations. All these approaches require iterative reﬁnements ofthe posterior distribution over the latent GP , and this implies expensive algebraic operations at eachiteration. Consider a multi-class classiﬁcation problem. Given a set of N training inputs X = { x , . . . , x N } and their corresponding labels Y = { y , . . . , y N } , with one-hot encoded classes denoted by thevectors y i , a classiﬁer produces a predicted label f ( x ∗ ) as function of any new input x ∗ .2n the literature, calibration is assessed through the Expected Calibration Error ( ECE ) [6], which isthe average of the absolute difference between accuracy and conﬁdence:

ECE = M (cid:88) m =1 | X m || X ∗ | | acc ( f ( X m ) , Y m ) − conf ( f , X m ) | , (1)where the test set X ∗ is divided into disjoint subsets { X , . . . , X M } , each corresponding to agiven level of conﬁdence conf ( f , X m ) predicted by the classiﬁer f , while acc ( f ( X m ) , Y m ) is theclassiﬁcation accuracy of f measured on the m -th subset. Other metrics used in this work tocharacterize the quality of a classiﬁer are the mean negative log-likelihood ( MNLL ) of the test setunder the model, and the error rate on the test set. All metrics are deﬁned so that lower values arebetter. GP classiﬁcation ( GPC ) GP -based classiﬁcation is deﬁned by the following abstract steps:1. A Gaussian prior is placed over a latent function f ( x ) . The GP prior is characterized by themean function µ ( x ) and the covariance function k ( x , x (cid:48) ) . The observable (non-Gaussian)prior is obtained by transforming through a sigmoid function so that the sampled functionsproduce proper probability values. In the multi-class case, we consider C independent priorsover the vector of functions f = [ f , . . . , f C ] (cid:62) ; transformation to proper probabilities isachieved by applying the softmax function σ ( f ) .2. The observed labels y are associated with a categorical likelihood with probability compo-nents p ( y c | f ) = σ ( f ( x )) c , for any c ∈ { , . . . , C } .3. The latent posterior is obtained by means of Bayes’ theorem.4. The latent posterior is transformed via σ ( f ) , to obtain a distribution over class probabilities.Throughout this work, we consider µ ( x ) = 0 and k ( x , x (cid:48) ) = a exp (cid:16) − ( x − x (cid:48) ) l (cid:17) . This choiceof covariance function is also known as the RBF kernel; it is characterized by the a and l hyper-parameters, interpreted as the GP marginal variance and length-scale, respectively. The hyper-parameters are commonly selected my maximizing the marginal likelihood of the model.The major computational challenge of GPC can be identiﬁed in Step 3 described above. The categoricallikelihood implies that the posterior stochastic process is not Gaussian and it cannot be calculatedanalytically. Therefore, different approaches resort to Gaussian approximations of the likelihood sothat the resulting approximate Gaussian posterior has the following form: p ( f | X, Y ) ≈ q ( f | X, Y ) ∼ p ( f ) N ( ˜ µ , ˜Σ) . (2)For example, in Expectation Propagation (EP, [18]) ˜ µ and ˜Σ are determined by the site parameters ,which are learned via an iterative process. In the variational approach of [9], ˜ µ and ˜Σ are thevariational parameters of the approximate Gaussian likelihoods. Despite being successful, approacheslike these contribute signiﬁcantly to the computational cost of classiﬁcation, as they introduce a largenumber of parameters that need to be optimized. In this work, we explore a more straightforwardGaussian approximation to the likelihood that requires no signiﬁcant computational overhead. GP regression ( GPR ) on classiﬁcation labels

A simple way to bypass the problem induced bycategorical likelihoods is to perform least-squares regression on the labels by ignoring their discretenature. This implies considering a Gaussian likelihood p ( y | f ) = N ( f , σ n ) , where σ n is theobservation noise variance. It is well-known that if the observed labels are and , then the function f that minimizes the mean squared error converges to the true class probabilities in the limit ofinﬁnite data [24]. Nevertheless, by not squashing f through a softmax function, we can no longerguarantee that the resulting distribution of functions will lie within 0 and 1. For this reason, additionalcalibration steps are required (i.e. Platt scaling). Softmax function σ ( f ) s.t. σ ( f ) j = exp( f j ) / (cid:80) Cc =1 exp( f c ) for j = 1 , ...C M S E ML + CE lossBayes + MSE lossBayes + CE loss n = 20

True function f p ML + CE (std~0.341)Bayes + MSE (std~0.235)Bayes + CE (std~0.173) n = 40

True function f p ML + CE (std~0.271)Bayes + MSE (std~0.149)Bayes + CE (std~0.130) n = 80

True function f p ML + CE (std~0.285)Bayes + MSE (std~0.133)Bayes + CE (std~0.129)

Figure 1: Convergence of classiﬁers with different loss functions and regularization properties. Left:summary of the mean squared error (MSE) from the true function f p for 1000 randomly sampledtraining sets of different size; the Bayesian CE-based classiﬁer is characterized by smaller varianceeven when the number of training inputs is small. Right: demonstration of how the averaged classiﬁersapproximate the true function for different training sizes. Kernel Ridge Regression (

KRR ) for classiﬁcation

The idea of directly regressing labels is quitecommon when GP estimators are applied within a frequentist context [25]. Here they are typicallyderived from a non-probabilistic perspective based on empirical risk minimization, and the corre-sponding approach is dubbed Kernel Ridge Regression [7]. Taking this perspective, two commentscan be made. The ﬁrst is that the noise and covariance parameters are viewed as regularizationparameters that need to be tuned, typically by cross-validation. In our experiments, we comparethis method with a canonical GPR approach. The second comment is that regressing labels withleast squares can be justiﬁed from a decision theoretic point of view. The Bayes’ rule minimizingthe expected least squares is the regression function (the expected conditional probability), whichin binary classiﬁcation is proportional to the conditional probability of one of the two classes [3](similar reasoning applies to multi-class classiﬁcation [19, 2]). From this perspective, one couldexpect a least squares estimator to be self-calibrated, however this is typically not the case in practice,a feature imputed to the limited number of points and the choice of function models. In the followingwe breiﬂy present Platt scaling, a simple and effective post-hoc calibration method which can beseamlessly applied to both

GPR - and

KRR -based learning pipelines to obtain a calibrated model.

Platt scaling

Platt scaling [22] is an effective approach to perform post-hoc calibration for differenttypes of classiﬁers, such as

SVM s [21] and neural networks [6]. Given a decision function f , which isthe result of a trained binary classiﬁer, the class probabilities are given by the sigmoid transformation π ( x ) = σ ( af ( x ) + b ) , where a and b are optimised over a separate validation set, so that the resultingmodel best explains the data. Although this parametric form may seem restrictive, Platt scaling hasbeen shown to be effective for a wide range of classiﬁers [21]. We advocate that two components are critical for well-calibrated classiﬁers: regularization and the cross-entropy loss . Previous work indicates that regularization has a positive effect on calibration [6].Also, classiﬁers that rely on the cross-entropy loss are reported to be well-calibrated [21]. This formof loss function is equivalent to the negative Bernoulli log-likelihood (or categorical in the multi-classcase), which is the proper interpretation of classiﬁcation outcomes.In Figure 1, we demonstrate the effects of regularization and cross-entropy empirically: we summarizeclassiﬁcation results on four synthetic datasets of increasing size. We assume that each classlabel is sampled from a Bernoulli distribution with probability given by the unknown function f p : R d → [0 , with input space of dimensionality d . For a classiﬁer to be well-calibrated, it shouldaccurately approximate f p . We ﬁt three kinds of classiﬁers: a maximum likelihood (ML) classiﬁerthat relies on cross entropy loss (CE), a Bayesian classiﬁer with MSE loss (i.e. GPR classiﬁcation),and ﬁnally a Bayesian classiﬁer that relies on CE (i.e.

GPC ). We report the averages over 1000iterations and the average standard deviation. The Bayesian classiﬁers that rely on the cross entropyloss converge to the true solution at a faster rate, and they are characterized by smaller variance.Although performing

GPR on the labels induces regularization through the prior, the likelihood modelis not appropriate. One possible solution is to employ meticulous likelihood approximations such asEP or variational GP classiﬁcation [9], alas at an often prohibitive computational cost, especially forconsiderably large datasets. In the section that follows, we introduce a methodology that combines4he best of both worlds. We propose to perform GP regression on labels transformed in such a waythat a less crude approximation of the categorical likelihood is achieved. GP regression on transformed Dirichlet variables There is an obvious defect in GP -based least-squares classiﬁcation: each point is associated with aGaussian likelihood, which is not the appropriate noise model for Bernoulli-distributed variables.Instead of approximating the true non-Gaussian likelihood, we propose to transform the labels in alatent space where a Gaussian approximation to the likelihood is more sensible.For a given input, the goal of a Bayesian classiﬁer is to estimate the distribution over its classprobability vector; such a distribution is naturally represented by a Dirichlet-distributed randomvariable. More formally, in a C -class classiﬁcation problem each observation y is a sample from acategorical distribution Cat( π ) . The objective is to infer the class probabilities π = [ π , . . . , π C ] (cid:62) ,for which we use a Dirichlet model: π ∼ Dir( α ) . In order to fully describe the distribution ofclass probabilities, we have to estimate the concentration parameters α = [ α , . . . , α C ] (cid:62) . Givenan observation y such that y k = 1 , our best guess for the values of α will be: α k = 1 + α (cid:15) and α i = α (cid:15) , ∀ i (cid:54) = k . Note that it is necessary to add a small quantity < α (cid:15) (cid:28) , so as to have validparameters for the Dirichlet distribution. Intuitively, we implicitly induce a Dirichlet prior so thatbefore observing a data point we have the probability mass shared equally across C classes; we knowthat we should observe exactly one count for a particular class, but we do not know which one. Mostof the mass is concentrated on the corresponding class when y is observed. This practice can bethought of as the categorical/Bernoulli analogue of the noisy observations in GP regression. Thelikelihood model is: p ( y | α ) = Cat( π ) , where π ∼ Dir( α ) . (3)It is well-known that a Dirichlet sample can be generated by sampling from C independent Gamma-distributed random variables with shape parameters α i and rate λ = 1 ; realizations of the classprobabilities can be generated as follows: π i = x i (cid:80) Cc =1 x c , where x i ∼ Gamma( α i , (4)Therefore, the noisy Dirichlet likelihood assumed for each observation translates to C independentGamma likelihoods with shape parameters either α i = 1 + α (cid:15) , if y i = 1 , or α i = α (cid:15) otherwise.In order to construct a Gaussian likelihood in the log-space, we approximate each Gamma-distributed x i with ˜ x i ∼ Lognormal(˜ y i , ˜ σ i ) , which has the same mean and variance (i.e. moment matching): E[ x i ] = E[˜ x i ] ⇔ α i = exp(˜ y i + ˜ σ i / x i ] = Var[˜ x i ] ⇔ α i = (cid:0) exp(˜ σ i ) − (cid:1) exp(2˜ y i + ˜ σ i ) Thus, for the parameters of the normally distributed logarithm we have: ˜ y i = log α i − ˜ σ i / , ˜ σ i = log(1 /α i + 1) (5)Note that this is the ﬁrst approximation to the likelihood that we have employed so far. One couldargue that a log-normal approximation to a Gamma-distributed variable is reasonable, although itis not perfect for small values of the shape parameter α i . However, the most important implicationis that we can now consider a Gaussian likelihood in the log-space. Assuming a vector of latentprocesses f = [ f , . . . , f C ] (cid:62) , we have: p (˜ y i | f ) = N ( f i , ˜ σ i ) , (6)where the class label is now denoted by ˜ y i in the transformed logarithmic space. It is importantto note that the noise parameter ˜ σ i is different for each observation; we have a heteroskedastic regression model. In fact, the ˜ σ i values (as well as ˜ y i ) solely depend on the Dirichlet pseudo-countassumed in the prior, which only has two possible values. Given this likelihood approximation, it isstraightforward to place a GP prior over f and evaluate the posterior over C latent processes exactly. Remark:

In the binary classiﬁcation case, we still have to perform regression on two latent processes.The use of heteroskedastic noise model implies that one latent process is not a mirrored version ofthe other (see Figure 2), contrary to

GPC . 5 .02.50.0

Class 0

PosteriorData 0.00.51.0

Class 0

PredictionData5.02.50.0

Class 1

PosteriorData 0.00.51.0

Class 1

PredictionData

Figure 2: Example of Dirichlet regression for a one-dimensional binary classiﬁcation problem. Left:the latent GP posterior for class “0” (top) and class “1” (bottom). Right: the transformed posteriorthrough softmax for class “0” (top) and class “1” (bottom). GP posterior to Dirichlet variables The obtained GP posterior emulates the logarithm of a stochastic process with Gamma marginals thatgives rise to the Dirichlet posterior distributions. It is straightforward to sample from the posteriorlog-normal marginals, which should behave as Gamma-distributed samples to generate posteriorDirichlet samples as in Equation (4). It is easy to see that this corresponds to a simple application ofthe softmax function on the posterior GP samples. The expectation of class probabilities will be: E[ π i, ∗ | X ] = (cid:90) exp( f i, ∗ ) (cid:80) j exp( f j, ∗ ) p ( f i, ∗ | X ) d f ∗ (7)which can be approximated by sampling from the Gaussian posterior p ( f i, ∗ | X ) .Figure 2 is an example of Dirichlet regression for a one-dimensional binary classiﬁcation problem.The left-side panels demonstrate how the GP posterior approximates the transformed data; the errorbars represent the standard deviation for each data-point. Notice, that the posterior for class “0” (top)is not a mirror image of class “1” (bottom), because of the different noise terms assumed for eachlatent process. The right-side panels show results in the original output space, after applying softmaxtransformation; as expected in the binary case, one posterior process is a mirror image of the other. α (cid:15) The performance of Dirichlet-based classiﬁcation is affected by the choice of α (cid:15) , in addition to theusual GP hyperparameters. As α (cid:15) approaches zero, α i converges to either or . It is easy to see thatfor the transformed “1” labels we have ˜ σ i = log 2 and ˜ y i = log(1 / √ in the limit. The transformed“0” labels, however, converge to inﬁnity, and so do their variances. The role of α (cid:15) is to make thetransformed labels ﬁnite, so that it is possible to perform regression. The smaller the α (cid:15) is, the furtherthe transformed labels will be apart, but at the same time, the variance for the “0” label will be larger.By increasing α (cid:15) , the transformed labels of different classes tend to be closer. The marginal log-likelihood tends to be larger, as it is easier for a zero-mean GP prior to ﬁt the data. However,this behavior is not desirable for classiﬁcation purposes. For this reason, the Gaussian marginallog-likelihood in the transformed space is not appropriate to determine the optimal value for α (cid:15) .Figure 3 demonstrates the effect of α (cid:15) on classiﬁcation accuracy, as reﬂected by the MNLL metric.Each subﬁgure corresponds to a different dataset;

MNLL is reported for different choices of α (cid:15) between . and . . As a general remark, it appears that there is no globally optimal α (cid:15) parameteracross datasets. However, the reported training and testing MNLL curves appear to be in agreementregarding the optimal choice for α (cid:15) . We therefore propose to select the α (cid:15) value that minimizes the MNLL on the training data.

We experimentally evaluate the methodologies discussed on the datasets outlined in Table 1. Forthe implementation of GP -based models, we use and extend the algorithms available in the GPFlowlibrary [17]. More speciﬁcally, for GPC we make use of variational sparse GP [8], while for GPR we6 .1 0.01 0.001 0.0001Dirichlet prior 0.0750.1000.1250.150 M N LL Training setTest set (a)

HTRU M N LL Training setTest set (b)

MAGIC M N LL Training setTest set (c)

LETTER M N LL Training setTest set (d)

DRIVE

Figure 3: Exploration of α (cid:15) for 4 different datasets with respect to the MNLL metric.Table 1: Datasets used for evaluation, available from the UCI repository [1].Dataset Classes Training instances Test instances Dimensionality Inducing points

EEG

HTRU

MAGIC

MINIBOO

COVERBIN

SUSY

LETTER

26 15000 5000 16 200

DRIVE

11 48509 10000 48 500

MOCAP GP regression [33]. The latter is also the basis for our GPD implementation:we apply adjustments so that heteroskedastic noise is admitted, as dictated by the Dirichlet mapping.Concerning

KRR , in order to scale it up to large-scale problems we use a subsampling-based variantnamed Nyström

KRR ( NKRR ) [31, 35]. Nyström-based approaches have been shown to achievestate-of-the-art accuracy on large-scale learning problems [13, 30, 26, 4, 28]. The number of inducing(subsampled) points used for each dataset is reported in Table 1.The experiments have been repeated for 10 random training/test splits. For each iteration, inducingpoints are chosen by applying k-means clustering on the training inputs. Exceptions are

COVERBIN and

SUSY , for which we used 5 splits and inducing points chosen uniformly at random. For

GPR wefurther split each training dataset: 80% of which is used to train the model and the remaining 20% isused for calibration with Platt scaling.

NKRR uses an 80-20% split for k -fold cross-validation andPlatt scaling calibration, respectively. For each of the datasets, the α (cid:15) parameter of GPD was selectedaccording to the training

MNLL : we have . for COVERBIN , . for LETTER , DRIVE and

MOCAP ,and . for the remaining datasets.In all experiments, we consider an isotropic RBF kernel; the kernel hyperparameters are selected bymaximizing the marginal likelihood for the GP -based approaches, and by k -fold cross validation for NKRR (with k = 10 for all datasets except from SUSY , for which k = 5 ). In the case of GPR , we alsooptimize the noise variance jointly with all kernel parameters.The performance of

GPD , GPC , GPR and

NKRR is compared in terms of various error metrics, includingerror rate,

MNLL and

ECE for a collection of datasets. The error rate,

MNLL and

ECE values obtainedare summarised in Figure 4. The

GPC method tends to outperform

GPR in most cases. Regarding the

GPD approach, its performance tends to lie between

GPC and

GPR ; in some instances classiﬁcationperformance is better than

GPC and

NKRR . Most importantly, this performance is obtained at afraction of the computational time required by the

GPC method. Figure 5 summarizes the speed-upachieved by the used of

GPD as opposed to the variational GP classiﬁcation approach.This dramatic difference in computational efﬁciency has some interesting implications regardingthe applicability of GP -based classiﬁcation methods on large datasets. GP -based machine learningapproaches are known to be computationally expensive; their practical application on large datasetsdemands the use of scalable methods to perform approximate inference. The approximation qualityof sparse approaches depends on the number (and the selection) of inducing points. In the case ofclassiﬁcation, the speed-up obtained by GPD implies that the computational budget saved can be spenton a more ﬁne-grained sparse GP approximation. In Figure 5, we explore the effect of increasing the7 EG HTRU2 MAGIC MiniBoo Cover SUSY letter Drive MoCap0.00.10.2 E rr o r r a t e GPC GPD GPR (Platt) NKRREEG HTRU2 MAGIC MiniBoo Cover SUSY letter Drive MoCap0.00.20.4 M N LL GPC GPD GPR (Platt) NKRR

EEG HTRU2 MAGIC MiniBoo Cover SUSY letter Drive MoCap0.000.020.040.060.08 E C E GPC GPD GPR (Platt) NKRR

Figure 4: Error rate,

MNLL and

ECE for the datasets considered in this work.

E E G H T R U M A G I C M i n i B o o C o v e r l e t t e r D r i v e M o C a p Sp ee d - u p

201 462 708 1550Training time (sec)0.050.060.07 E rr o r r a t e GPC (~14K sec)GPD (a)

LETTER

847 1789 2417 4244Training time (sec)0.0800.085 E rr o r r a t e GPC (~10K sec)GPD (b)

MINIBOO

256 728 1022 2175Training time (sec)0.020.030.04 E rr o r r a t e GPC (~16K sec)GPD (c)

MOCAP

Figure 5: Left: Speed-up obtained by using

GPD as opposed to

GPC . Right: Error vs training time for

GPD as the number of inducing points is increased for three datasets. The dashed line represents theerror obtained by

GPC using the same number of inducing points as the fastest

GPD listed.number of inducing points for three datasets:

LETTER , MINIBOO and

MOCAP ; we see that the errorrate drops below the target

GPC with a ﬁxed number of inducing points, and still at a fraction of thecomputational effort.

Most GP -based approaches to classiﬁcation in the literature are characterized by a meticulousapproximation of the likelihood. In this work, we experimentally show that such GP classiﬁerstend to be well-calibrated, meaning that they estimate correctly classiﬁcation uncertainty, as this isexpressed through class probabilities. Despite this desirable property, their applicability is limited tosmall/moderate size of datasets, due to the high computational complexity of approximating the trueposterior distribution.Least-squares classiﬁcation, which may be implemented either as GPR or KRR , is an establishedpractice for more scalable classiﬁcation. However, the crude approximation of a non-Gaussianlikelihood with a Gaussian one has a negative impact on classiﬁcation quality, especially as this isreﬂected by the calibration properties of the classiﬁer.Considering the strengths and practical limitations of GP s, we proposed a classiﬁcation approachthat is essentially an heteroskedastic GP regression on a latent space induced by a transformation ofthe labels, which are viewed as Dirichlet-distributed random variables. This allowed us to convert C -class classiﬁcation to a problem of regression for C latent processes with Gamma likelihoods. We8hen proposed to approximate the Gamma-distributed variables with log-normal ones, and thus weachieved a sensible Gaussian approximation in the logarithmic space. Crucially, this can be seen as apre-processing step, that does not have to be learned, unlike GPC , where an accurate transformation issought iteratively. Our experimental analysis shows that Dirichlet-based GP classiﬁcation produceswell-calibrated classiﬁers without the need for post-hoc calibration steps. The performance of ourapproach in terms of classiﬁcation accuracy tends to lie between properly-approximated GPC andleast-squares classiﬁcation, but most importantly it is orders of magnitude faster than

GPC .As a ﬁnal remark, we note that the predictive distribution of the

GPD approach is different from thatobtained by

GPC , as can be seen in the extended results in the appendix. An extended characterizationof the predictive distribution for

GPD is subject of future work.

Acknowledgments

DM and PM are partially supported by KPMG. RC would like to thank Luigi Carratino for the usefulexchanges concerning large-scale experiments. LR is funded by the Air Force project FA9550-17-1-0390 (European Ofﬁce of Aerospace Research and Development) and the RISE project NoMADS -DLV-777826. MF gratefully acknowledges support from the AXA Research Fund.

References [1] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007.[2] L. Baldassarre, L. Rosasco, A. Barla, and A. Verri. Multi-output learning via spectral ﬁltering.

Machinelearning , 87(3):259–301, 2012.[3] P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classiﬁcation, and risk bounds.

Journal of the AmericanStatistical Association , 2006.[4] R. Camoriano, T. Angles, A. Rudi, and L. Rosasco. NYTRO: When Subsampling Meets Early Stopping. In

Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics , pages 1403–1411,2016.[5] P. A. Flach.

Classiﬁer Calibration , pages 1–8. Springer US, Boston, MA, 2016.[6] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. In D. Precupand Y. W. Teh, editors,

Proceedings of the 34th International Conference on Machine Learning , volume 70of

Proceedings of Machine Learning Research , pages 1321–1330, International Convention Centre, Sydney,Australia, Aug. 2017. PMLR.[7] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining,inference and prediction.

The Mathematical Intelligencer , 27(2):83–85, 2001.[8] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable Variational Gaussian Process Classiﬁcation. InG. Lebanon and S. V. N. Vishwanathan, editors,

Proceedings of the Eighteenth International Conferenceon Artiﬁcial Intelligence and Statistics , volume 38 of

Proceedings of Machine Learning Research , pages351–360, San Diego, California, USA, May 2015. PMLR.[9] J. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani. MCMC for Variationally Sparse GaussianProcesses. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,

Advances inNeural Information Processing Systems 28 , pages 1648–1656. Curran Associates, Inc., 2015.[10] P. Huang, H. Avron, T. N. Sainath, V. Sindhwani, and B. Ramabhadran. Kernel methods match deep neuralnetworks on TIMIT. In

IEEE International Conference on Acoustics, Speech and Signal Processing , 2014.[11] A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?,Mar. 2017.[12] M. Kull, T. S. Filho, and P. Flach. Beta calibration: a well-founded and easily implemented improvementon logistic calibration for binary classiﬁers. In A. Singh and J. Zhu, editors,

Proceedings of the 20thInternational Conference on Artiﬁcial Intelligence and Statistics , volume 54 of

Proceedings of MachineLearning Research , pages 623–631, Fort Lauderdale, FL, USA, Apr. 2017. PMLR.[13] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystrom Method. In

NIPS , pages 1060–1068. CurranAssociates, Inc., 2009.

14] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world, Feb. 2017.arXiv:1607.02533.[15] M. Kuss and C. E. Rasmussen. Assessing Approximate Inference for Binary Gaussian Process Classiﬁca-tion.

Journal of Machine Learning Research , 6:1679–1704, 2005.[16] Z. Lu, A. May, K. Liu, A. B. Garakani, D. Guo, A. Bellet, L. Fan, M. Collins, B. Kingsbury, M. Picheny,and F. Sha. How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets.

CoRR , abs/1411.4000,2014.[17] A. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani,and J. Hensman. GPﬂow: A Gaussian process library using TensorFlow.

Journal of Machine LearningResearch , 18(40):1–6, Apr. 2017.[18] T. P. Minka. Expectation Propagation for approximate Bayesian inference. In

Proceedings of the 17thConference in Uncertainty in Artiﬁcial Intelligence , UAI ’01, pages 362–369, San Francisco, CA, USA,2001. Morgan Kaufmann Publishers Inc.[19] Y. Mroueh, T. Poggio, L. Rosasco, and J.-J. Slotine. Multiclass learning with simplex coding. In

Advancesin Neural Information Processing Systems , pages 2789–2797, 2012.[20] H. Nickisch and C. E. Rasmussen. Approximations for Binary Gaussian Process Classiﬁcation.

Journal ofMachine Learning Research , 9:2035–2078, Oct. 2008.[21] A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In

Proceedingsof the 22Nd International Conference on Machine Learning , ICML ’05, pages 625–632. ACM, 2005.[22] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihoodmethods.

Advances in Large Margin Classiﬁers , 10(3), 1999.[23] C. E. Rasmussen and C. Williams.

Gaussian Processes for Machine Learning . MIT Press, 2006.[24] C. E. Rasmussen and C. K. I. Williams.

Gaussian Processes for Machine Learning . MIT Press, 2006.[25] R. Rifkin, G. Yeo, T. Poggio, and Others. Regularized least-squares classiﬁcation.

Nato Science Series SubSeries III Computer and Systems Sciences , 190:131–154, 2003.[26] A. Rudi, R. Camoriano, and L. Rosasco. Less is More: Nyström Computational Regularization. InC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,

Advances in NeuralInformation Processing Systems 28 , pages 1657–1665. Curran Associates, Inc., 2015.[27] A. Rudi, L. Carratino, and L. Rosasco. FALKON: An Optimal Large Scale Kernel Method. In I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,

Advances inNeural Information Processing Systems 30 , pages 3888–3898. Curran Associates, Inc., 2017.[28] A. Rudi, L. Carratino, and L. Rosasco. Falkon: An optimal large scale kernel method. In

Advances inNeural Information Processing Systems , pages 3891–3901, 2017.[29] J. Shawe-Taylor and N. Cristianini.

Kernel Methods for Pattern Analysis . Cambridge University Press,New York, NY, USA, 2004.[30] S. Si, C.-J. Hsieh, and I. S. Dhillon. Memory Efﬁcient Kernel Approximation. In

ICML , volume 32 of

JMLR Proceedings , pages 701–709. JMLR.org, 2014.[31] A. J. Smola and B. Schölkopf. Sparse Greedy Matrix Approximation for Machine Learning. In

ICML ,pages 911–918. Morgan Kaufmann, 2000.[32] J. A. K. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classiﬁers.

Neural Process.Lett. , 9(3):293–300, June 1999.[33] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In D. A. Dykand M. Welling, editors,

Proceedings of the Twelfth International Conference on Artiﬁcial Intelligenceand Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009 , volume 5 of

JMLRProceedings , pages 567–574. JMLR.org, 2009.[34] C. K. I. Williams and D. Barber. Bayesian classiﬁcation with Gaussian processes.

IEEE Transactions onPattern Analysis and Machine Intelligence , 20:1342–1351, 1998.

35] C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In T. K. Leen,T. G. Dietterich, V. Tresp, T. K. Leen, T. G. Dietterich, and V. Tresp, editors,

NIPS , pages 682–688. MITPress, 2000.[36] B. Zadrozny and C. Elkan. Transforming Classiﬁer Scores into Accurate Multiclass Probability Estimates.In

Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and DataMining , KDD ’02, pages 694–699, New York, NY, USA, 2002. ACM.

A Extended calibration results

Reliability diagrams offer a visual representation of calibration properties, where accuracy is plotted as a functionof conﬁdence for the subsets { X , . . . , X M } . For a perfectly calibrated classiﬁer, the accuracy function shouldbe equal to the identity line, implying that accu ( X m ) = conf ( X m ) . Large deviations from the identity linemean that the class probabilities are either underestimated or overestimated.In Figure 6 we summarize the reliability diagrams for a number of binary classiﬁcation datasets. Each row ofdiagrams corresponds to a particular dataset; and each column to one of the GP -based classiﬁcation approachesthat we have discussed in this work. In particular we consider the variational GP classiﬁcation algorithm of[9] ( GPC ), our Dirichlet-based classiﬁcation scheme (

GPD ), and GP regression on the labels without and with aPlatt-scaling post-hot calibration step ( GPR and

GPR ( PLATT )). Note that each one of these approaches producesa distribution of classiﬁers. Thus, in the diagrams of Figure 6 we show the reliability curve of the mean classiﬁer(depicted as solid lines-points), along with the classiﬁers described by the upper and lower 95% quantiles of thepredictive distribution (grey area). If a classiﬁer is well-calibrated, then its reliability curve should be close tothe identity curve (dashed line); the latter should also lie within the limits of the grey area.For the results of Figure 6, we have considered M = 10 subsets for different levels of conﬁdence. We note thatdeviations from the identity curve should not deemed important, if these are not backed by a sufﬁcient numberof samples. For some datasets there are certain levels of conﬁdence (middle section of HTRU

GPC and

GPD produce well-calibrated models. The

GPR approach on the other hand tends to produce a sigmoid-shaped reliability curve, which suggests that thereis underestimation of the class probabilities. This behavior is cured however by performing calibration viaPlatt-scaling, as we see for the

GPR ( PLATT ) method.These conclusions are further supported by Figure 7, which summarizes the reliability plots for a number ofmulti-class datasets. In the multi-class case, it is not obvious how to concisely summarise the effect of thepredictive distribution, so we resort to simple reliability plots of the average classiﬁer for each method.As a ﬁnal remark, we note that the predictive distribution of the

GPD approach is different from that obtained by

GPC . In fact, judging form the upper and lower quantile classiﬁers as presented in Figure 6, it appears that

GPD results in a narrower predictive distribution, which is nevertheless well-calibrated. An extended characterizationof the predictive distribution for

GPD is subject of future work. .0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y EEG - GPC (ECE=0.0420) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y EEG - GPD (ECE=0.0225) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y EEG - GPR (ECE=0.0928) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y EEG - GPR (Platt) (ECE=0.0244)0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y HTRU2 - GPC (ECE=0.0418) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y HTRU2 - GPD (ECE=0.0398) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y HTRU2 - GPR (ECE=0.0461) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y HTRU2 - GPR (Platt) (ECE=0.0436)0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y MAGIC - GPC (ECE=0.0187) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y MAGIC - GPD (ECE=0.0301) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y MAGIC - GPR (ECE=0.0506) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y MAGIC - GPR (Platt) (ECE=0.0225)0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y CovertypeBinary - GPC (ECE=0.0150) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y CovertypeBinary - GPD (ECE=0.0223) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y CovertypeBinary - GPR (ECE=0.0395) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y CovertypeBinary - GPR (Platt) (ECE=0.0117)

Figure 6: Calibration results for four different GP -based classiﬁcation approaches on four binaryclassiﬁcation datasets. The bounds correspond to the classiﬁers given by the 95% conﬁdence intervalof the posterior GP . A cc u r a c y letter GPC (ECE=0.0374)GPD (ECE=0.0540)GPR (ECE=0.3511)GPR (Platt) (ECE=0.0731) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y Drive

GPC (ECE=0.0447)GPD (ECE=0.0511)GPR (ECE=0.1895)GPR (Platt) (ECE=0.0469) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y MoCap