Dirichlet-based Gaussian Processes for Large-scale Calibrated Classification
Dimitrios Milios, Raffaello Camoriano, Pietro Michiardi, Lorenzo Rosasco, Maurizio Filippone
DDirichlet-based Gaussian Processesfor Large-scale Calibrated Classification
Dimitrios Milios
EURECOMSophia Antipolis, France [email protected]
Raffaello Camoriano
University of GenoaGenoa, Italy [email protected]
Pietro Michiardi
EURECOMSophia Antipolis, France
Lorenzo Rosasco
University of GenoaLCSL, IIT & MIT [email protected]
Maurizio Filippone
EURECOMSophia Antipolis, France [email protected]
Abstract
In this paper, we study the problem of deriving fast and accurate classificationalgorithms with uncertainty quantification. Gaussian process classification pro-vides a principled approach, but the corresponding computational burden is hardlysustainable in large-scale problems and devising efficient alternatives is a challenge.In this work, we investigate if and how Gaussian process regression directly appliedto the classification labels can be used to tackle this question. While in this casetraining time is remarkably faster, predictions need be calibrated for classificationand uncertainty estimation. To this aim, we propose a novel approach based oninterpreting the labels as the output of a Dirichlet distribution. Extensive exper-imental results show that the proposed approach provides essentially the sameaccuracy and uncertainty quantification of Gaussian process classification whilerequiring only a fraction of computational resources.
Classification is a classic machine learning task. While the most basic performance measure isclassification accuracy, in practice assigning a calibrated confidence to the predictions is often crucial[5]. For example, in image classification providing classification predictions with a calibrated score isimportant to avoid making over-confident decisions [6, 11, 14]. Several classification algorithms thatoutput a continuous score are not necessarily calibrated (e.g., support vector machines [22]). Popularways to calibrate classifiers use a validation set to learn a transformation of their output score thatrecovers calibration; these include Platt scaling [22] and isotonic regression [36]. Calibration canalso be achieved if a sensible loss function is employed [12], for example the logistic/cross-entropyloss, and it is known to be positively impacted if the classifier is well regularized [6].Bayesian approaches provide a natural framework to tackle these kinds of questions, since quantifi-cation of uncertainty is of primary interest. In particular, Gaussian Processes Classification (
GPC )[23, 34, 8] combines the flexibility of Gaussian Processes ( GP s) [23] and the regularization stem-ming from their probabilistic nature, with the use of the correct likelihood for classification, that is a r X i v : . [ c s . L G ] M a y ernoulli or multinomial for binary or multi-class classication, respectively. While we are not awareof empirical studies on the calibration properties of GPC , our results confirm the intuition that
GPC is actually calibrated. The most severe drawback of
GPC , however, is that it requires carrying outseveral matrix factorizations, making it unattractive for large-scale problems.In this paper, we study the question of whether Gaussian process approaches can be made efficientto find accurate and well-calibrated classification rules. A simple idea is to use Gaussian processregression directly on classification labels. This idea is quite common in non-probabilistic approaches[32, 25] and can be grounded from a decision theoretic point of view. Indeed, the Bayes’ rule mini-mizing the expected least squares is the the expected conditional probability, which in classificationis directly related to the conditional probabilities of each class, see e.g. [29, 3]. Directly regressingthe labels leads to fast training and excellent classification accuracies [27, 16, 10]. However, thecorresponding predictions are not calibrated for uncertainty quantification. The question is then ifcalibration can be achieved while retaining speed.The main contribution of our work is the proposal of a transformation of the classification labels,which turns the original problem into a regression problem without compromising calibration. For GP s, this has the enormous advantage of bypassing the need for expensive posterior approximations,leading to a method that is as fast as a simple regression of the original labels. The proposed methodis based on the interpretation of the labels as the output of a Dirichlet distribution, so we name itDirichlet-based GP classification ( GPD ). Through an extensive experimental validation, includinglarge-scale classification tasks, we demonstrate that
GPD is calibrated and competitive in performancewith state-of-the-art
GPC . Calibration of classifiers:
Platt scaling [22] is a popular method to calibrate the output score ofclassifiers, as well as isotonic regression [36]. More recently, Beta calibration [12] and temperaturescaling [6] have been proposed to extend the class of possible transformations and reduce theparameterization of the transformation, respectively. It is established that binary classifiers arecalibrated when they employ the logistic loss; this is a direct consequence of the fact that theappropriate model for Bernoulli distributed variables is the one associated with this loss [12]. Theextension to multi-class problems yields the so-called cross-entropy loss, which corresponds to themultinomial likelihood. Not necessarily, however, the right loss makes classifiers well calibrated;recent works on calibration of convolutional neural networks for image classification show that depthnegatively impacts calibration due to the introduction of a large number of parameters to optimize,and that regularization is important to recover calibration [6].
Kernel-based classification:
Performing regression on classification labels is also known as least-squares classification [32, 25]. We are not aware of works that study GP -based least-squares classifica-tion in depth; we could only find a few comments on it in [23] (Sec. 6.5). GPC is usually approachedassuming a latent process, which is given a GP prior, that is transformed into a probability of classlabels through a suitable squashing function [23]. Due to the non-conjugacy between the GP priorand the non-Gaussian likelihood, applying standard Bayesian inference techniques in GPC leads toanalytical intractabilities, and it is necessary to resort to approximations. Standard ways to approxi-mate computations include the Laplace Approximation [34] and Expectation Propagation [18]; see,e.g., [15, 20] for a detailed review of these methods. More recently, there have been advancements inworks that extend “sparse” GP approximations [33] to classification [9] in order to deal with the issuesof scalability with the number of observations. All these approaches require iterative refinements ofthe posterior distribution over the latent GP , and this implies expensive algebraic operations at eachiteration. Consider a multi-class classification problem. Given a set of N training inputs X = { x , . . . , x N } and their corresponding labels Y = { y , . . . , y N } , with one-hot encoded classes denoted by thevectors y i , a classifier produces a predicted label f ( x ∗ ) as function of any new input x ∗ .2n the literature, calibration is assessed through the Expected Calibration Error ( ECE ) [6], which isthe average of the absolute difference between accuracy and confidence:
ECE = M (cid:88) m =1 | X m || X ∗ | | acc ( f ( X m ) , Y m ) − conf ( f , X m ) | , (1)where the test set X ∗ is divided into disjoint subsets { X , . . . , X M } , each corresponding to agiven level of confidence conf ( f , X m ) predicted by the classifier f , while acc ( f ( X m ) , Y m ) is theclassification accuracy of f measured on the m -th subset. Other metrics used in this work tocharacterize the quality of a classifier are the mean negative log-likelihood ( MNLL ) of the test setunder the model, and the error rate on the test set. All metrics are defined so that lower values arebetter. GP classification ( GPC ) GP -based classification is defined by the following abstract steps:1. A Gaussian prior is placed over a latent function f ( x ) . The GP prior is characterized by themean function µ ( x ) and the covariance function k ( x , x (cid:48) ) . The observable (non-Gaussian)prior is obtained by transforming through a sigmoid function so that the sampled functionsproduce proper probability values. In the multi-class case, we consider C independent priorsover the vector of functions f = [ f , . . . , f C ] (cid:62) ; transformation to proper probabilities isachieved by applying the softmax function σ ( f ) .2. The observed labels y are associated with a categorical likelihood with probability compo-nents p ( y c | f ) = σ ( f ( x )) c , for any c ∈ { , . . . , C } .3. The latent posterior is obtained by means of Bayes’ theorem.4. The latent posterior is transformed via σ ( f ) , to obtain a distribution over class probabilities.Throughout this work, we consider µ ( x ) = 0 and k ( x , x (cid:48) ) = a exp (cid:16) − ( x − x (cid:48) ) l (cid:17) . This choiceof covariance function is also known as the RBF kernel; it is characterized by the a and l hyper-parameters, interpreted as the GP marginal variance and length-scale, respectively. The hyper-parameters are commonly selected my maximizing the marginal likelihood of the model.The major computational challenge of GPC can be identified in Step 3 described above. The categoricallikelihood implies that the posterior stochastic process is not Gaussian and it cannot be calculatedanalytically. Therefore, different approaches resort to Gaussian approximations of the likelihood sothat the resulting approximate Gaussian posterior has the following form: p ( f | X, Y ) ≈ q ( f | X, Y ) ∼ p ( f ) N ( ˜ µ , ˜Σ) . (2)For example, in Expectation Propagation (EP, [18]) ˜ µ and ˜Σ are determined by the site parameters ,which are learned via an iterative process. In the variational approach of [9], ˜ µ and ˜Σ are thevariational parameters of the approximate Gaussian likelihoods. Despite being successful, approacheslike these contribute significantly to the computational cost of classification, as they introduce a largenumber of parameters that need to be optimized. In this work, we explore a more straightforwardGaussian approximation to the likelihood that requires no significant computational overhead. GP regression ( GPR ) on classification labels
A simple way to bypass the problem induced bycategorical likelihoods is to perform least-squares regression on the labels by ignoring their discretenature. This implies considering a Gaussian likelihood p ( y | f ) = N ( f , σ n ) , where σ n is theobservation noise variance. It is well-known that if the observed labels are and , then the function f that minimizes the mean squared error converges to the true class probabilities in the limit ofinfinite data [24]. Nevertheless, by not squashing f through a softmax function, we can no longerguarantee that the resulting distribution of functions will lie within 0 and 1. For this reason, additionalcalibration steps are required (i.e. Platt scaling). Softmax function σ ( f ) s.t. σ ( f ) j = exp( f j ) / (cid:80) Cc =1 exp( f c ) for j = 1 , ...C M S E ML + CE lossBayes + MSE lossBayes + CE loss n = 20
True function f p ML + CE (std~0.341)Bayes + MSE (std~0.235)Bayes + CE (std~0.173) n = 40
True function f p ML + CE (std~0.271)Bayes + MSE (std~0.149)Bayes + CE (std~0.130) n = 80
True function f p ML + CE (std~0.285)Bayes + MSE (std~0.133)Bayes + CE (std~0.129)
Figure 1: Convergence of classifiers with different loss functions and regularization properties. Left:summary of the mean squared error (MSE) from the true function f p for 1000 randomly sampledtraining sets of different size; the Bayesian CE-based classifier is characterized by smaller varianceeven when the number of training inputs is small. Right: demonstration of how the averaged classifiersapproximate the true function for different training sizes. Kernel Ridge Regression (
KRR ) for classification
The idea of directly regressing labels is quitecommon when GP estimators are applied within a frequentist context [25]. Here they are typicallyderived from a non-probabilistic perspective based on empirical risk minimization, and the corre-sponding approach is dubbed Kernel Ridge Regression [7]. Taking this perspective, two commentscan be made. The first is that the noise and covariance parameters are viewed as regularizationparameters that need to be tuned, typically by cross-validation. In our experiments, we comparethis method with a canonical GPR approach. The second comment is that regressing labels withleast squares can be justified from a decision theoretic point of view. The Bayes’ rule minimizingthe expected least squares is the regression function (the expected conditional probability), whichin binary classification is proportional to the conditional probability of one of the two classes [3](similar reasoning applies to multi-class classification [19, 2]). From this perspective, one couldexpect a least squares estimator to be self-calibrated, however this is typically not the case in practice,a feature imputed to the limited number of points and the choice of function models. In the followingwe breifly present Platt scaling, a simple and effective post-hoc calibration method which can beseamlessly applied to both
GPR - and
KRR -based learning pipelines to obtain a calibrated model.
Platt scaling
Platt scaling [22] is an effective approach to perform post-hoc calibration for differenttypes of classifiers, such as
SVM s [21] and neural networks [6]. Given a decision function f , which isthe result of a trained binary classifier, the class probabilities are given by the sigmoid transformation π ( x ) = σ ( af ( x ) + b ) , where a and b are optimised over a separate validation set, so that the resultingmodel best explains the data. Although this parametric form may seem restrictive, Platt scaling hasbeen shown to be effective for a wide range of classifiers [21]. We advocate that two components are critical for well-calibrated classifiers: regularization and the cross-entropy loss . Previous work indicates that regularization has a positive effect on calibration [6].Also, classifiers that rely on the cross-entropy loss are reported to be well-calibrated [21]. This formof loss function is equivalent to the negative Bernoulli log-likelihood (or categorical in the multi-classcase), which is the proper interpretation of classification outcomes.In Figure 1, we demonstrate the effects of regularization and cross-entropy empirically: we summarizeclassification results on four synthetic datasets of increasing size. We assume that each classlabel is sampled from a Bernoulli distribution with probability given by the unknown function f p : R d → [0 , with input space of dimensionality d . For a classifier to be well-calibrated, it shouldaccurately approximate f p . We fit three kinds of classifiers: a maximum likelihood (ML) classifierthat relies on cross entropy loss (CE), a Bayesian classifier with MSE loss (i.e. GPR classification),and finally a Bayesian classifier that relies on CE (i.e.
GPC ). We report the averages over 1000iterations and the average standard deviation. The Bayesian classifiers that rely on the cross entropyloss converge to the true solution at a faster rate, and they are characterized by smaller variance.Although performing
GPR on the labels induces regularization through the prior, the likelihood modelis not appropriate. One possible solution is to employ meticulous likelihood approximations such asEP or variational GP classification [9], alas at an often prohibitive computational cost, especially forconsiderably large datasets. In the section that follows, we introduce a methodology that combines4he best of both worlds. We propose to perform GP regression on labels transformed in such a waythat a less crude approximation of the categorical likelihood is achieved. GP regression on transformed Dirichlet variables There is an obvious defect in GP -based least-squares classification: each point is associated with aGaussian likelihood, which is not the appropriate noise model for Bernoulli-distributed variables.Instead of approximating the true non-Gaussian likelihood, we propose to transform the labels in alatent space where a Gaussian approximation to the likelihood is more sensible.For a given input, the goal of a Bayesian classifier is to estimate the distribution over its classprobability vector; such a distribution is naturally represented by a Dirichlet-distributed randomvariable. More formally, in a C -class classification problem each observation y is a sample from acategorical distribution Cat( π ) . The objective is to infer the class probabilities π = [ π , . . . , π C ] (cid:62) ,for which we use a Dirichlet model: π ∼ Dir( α ) . In order to fully describe the distribution ofclass probabilities, we have to estimate the concentration parameters α = [ α , . . . , α C ] (cid:62) . Givenan observation y such that y k = 1 , our best guess for the values of α will be: α k = 1 + α (cid:15) and α i = α (cid:15) , ∀ i (cid:54) = k . Note that it is necessary to add a small quantity < α (cid:15) (cid:28) , so as to have validparameters for the Dirichlet distribution. Intuitively, we implicitly induce a Dirichlet prior so thatbefore observing a data point we have the probability mass shared equally across C classes; we knowthat we should observe exactly one count for a particular class, but we do not know which one. Mostof the mass is concentrated on the corresponding class when y is observed. This practice can bethought of as the categorical/Bernoulli analogue of the noisy observations in GP regression. Thelikelihood model is: p ( y | α ) = Cat( π ) , where π ∼ Dir( α ) . (3)It is well-known that a Dirichlet sample can be generated by sampling from C independent Gamma-distributed random variables with shape parameters α i and rate λ = 1 ; realizations of the classprobabilities can be generated as follows: π i = x i (cid:80) Cc =1 x c , where x i ∼ Gamma( α i , (4)Therefore, the noisy Dirichlet likelihood assumed for each observation translates to C independentGamma likelihoods with shape parameters either α i = 1 + α (cid:15) , if y i = 1 , or α i = α (cid:15) otherwise.In order to construct a Gaussian likelihood in the log-space, we approximate each Gamma-distributed x i with ˜ x i ∼ Lognormal(˜ y i , ˜ σ i ) , which has the same mean and variance (i.e. moment matching): E[ x i ] = E[˜ x i ] ⇔ α i = exp(˜ y i + ˜ σ i / x i ] = Var[˜ x i ] ⇔ α i = (cid:0) exp(˜ σ i ) − (cid:1) exp(2˜ y i + ˜ σ i ) Thus, for the parameters of the normally distributed logarithm we have: ˜ y i = log α i − ˜ σ i / , ˜ σ i = log(1 /α i + 1) (5)Note that this is the first approximation to the likelihood that we have employed so far. One couldargue that a log-normal approximation to a Gamma-distributed variable is reasonable, although itis not perfect for small values of the shape parameter α i . However, the most important implicationis that we can now consider a Gaussian likelihood in the log-space. Assuming a vector of latentprocesses f = [ f , . . . , f C ] (cid:62) , we have: p (˜ y i | f ) = N ( f i , ˜ σ i ) , (6)where the class label is now denoted by ˜ y i in the transformed logarithmic space. It is importantto note that the noise parameter ˜ σ i is different for each observation; we have a heteroskedastic regression model. In fact, the ˜ σ i values (as well as ˜ y i ) solely depend on the Dirichlet pseudo-countassumed in the prior, which only has two possible values. Given this likelihood approximation, it isstraightforward to place a GP prior over f and evaluate the posterior over C latent processes exactly. Remark:
In the binary classification case, we still have to perform regression on two latent processes.The use of heteroskedastic noise model implies that one latent process is not a mirrored version ofthe other (see Figure 2), contrary to
GPC . 5 .02.50.0
Class 0
PosteriorData 0.00.51.0
Class 0
PredictionData5.02.50.0
Class 1
PosteriorData 0.00.51.0
Class 1
PredictionData
Figure 2: Example of Dirichlet regression for a one-dimensional binary classification problem. Left:the latent GP posterior for class “0” (top) and class “1” (bottom). Right: the transformed posteriorthrough softmax for class “0” (top) and class “1” (bottom). GP posterior to Dirichlet variables The obtained GP posterior emulates the logarithm of a stochastic process with Gamma marginals thatgives rise to the Dirichlet posterior distributions. It is straightforward to sample from the posteriorlog-normal marginals, which should behave as Gamma-distributed samples to generate posteriorDirichlet samples as in Equation (4). It is easy to see that this corresponds to a simple application ofthe softmax function on the posterior GP samples. The expectation of class probabilities will be: E[ π i, ∗ | X ] = (cid:90) exp( f i, ∗ ) (cid:80) j exp( f j, ∗ ) p ( f i, ∗ | X ) d f ∗ (7)which can be approximated by sampling from the Gaussian posterior p ( f i, ∗ | X ) .Figure 2 is an example of Dirichlet regression for a one-dimensional binary classification problem.The left-side panels demonstrate how the GP posterior approximates the transformed data; the errorbars represent the standard deviation for each data-point. Notice, that the posterior for class “0” (top)is not a mirror image of class “1” (bottom), because of the different noise terms assumed for eachlatent process. The right-side panels show results in the original output space, after applying softmaxtransformation; as expected in the binary case, one posterior process is a mirror image of the other. α (cid:15) The performance of Dirichlet-based classification is affected by the choice of α (cid:15) , in addition to theusual GP hyperparameters. As α (cid:15) approaches zero, α i converges to either or . It is easy to see thatfor the transformed “1” labels we have ˜ σ i = log 2 and ˜ y i = log(1 / √ in the limit. The transformed“0” labels, however, converge to infinity, and so do their variances. The role of α (cid:15) is to make thetransformed labels finite, so that it is possible to perform regression. The smaller the α (cid:15) is, the furtherthe transformed labels will be apart, but at the same time, the variance for the “0” label will be larger.By increasing α (cid:15) , the transformed labels of different classes tend to be closer. The marginal log-likelihood tends to be larger, as it is easier for a zero-mean GP prior to fit the data. However,this behavior is not desirable for classification purposes. For this reason, the Gaussian marginallog-likelihood in the transformed space is not appropriate to determine the optimal value for α (cid:15) .Figure 3 demonstrates the effect of α (cid:15) on classification accuracy, as reflected by the MNLL metric.Each subfigure corresponds to a different dataset;
MNLL is reported for different choices of α (cid:15) between . and . . As a general remark, it appears that there is no globally optimal α (cid:15) parameteracross datasets. However, the reported training and testing MNLL curves appear to be in agreementregarding the optimal choice for α (cid:15) . We therefore propose to select the α (cid:15) value that minimizes the MNLL on the training data.
We experimentally evaluate the methodologies discussed on the datasets outlined in Table 1. Forthe implementation of GP -based models, we use and extend the algorithms available in the GPFlowlibrary [17]. More specifically, for GPC we make use of variational sparse GP [8], while for GPR we6 .1 0.01 0.001 0.0001Dirichlet prior 0.0750.1000.1250.150 M N LL Training setTest set (a)
HTRU M N LL Training setTest set (b)
MAGIC M N LL Training setTest set (c)
LETTER M N LL Training setTest set (d)
DRIVE
Figure 3: Exploration of α (cid:15) for 4 different datasets with respect to the MNLL metric.Table 1: Datasets used for evaluation, available from the UCI repository [1].Dataset Classes Training instances Test instances Dimensionality Inducing points
EEG
HTRU
MAGIC
MINIBOO
COVERBIN
SUSY
LETTER
26 15000 5000 16 200
DRIVE
11 48509 10000 48 500
MOCAP GP regression [33]. The latter is also the basis for our GPD implementation:we apply adjustments so that heteroskedastic noise is admitted, as dictated by the Dirichlet mapping.Concerning
KRR , in order to scale it up to large-scale problems we use a subsampling-based variantnamed Nyström
KRR ( NKRR ) [31, 35]. Nyström-based approaches have been shown to achievestate-of-the-art accuracy on large-scale learning problems [13, 30, 26, 4, 28]. The number of inducing(subsampled) points used for each dataset is reported in Table 1.The experiments have been repeated for 10 random training/test splits. For each iteration, inducingpoints are chosen by applying k-means clustering on the training inputs. Exceptions are
COVERBIN and
SUSY , for which we used 5 splits and inducing points chosen uniformly at random. For
GPR wefurther split each training dataset: 80% of which is used to train the model and the remaining 20% isused for calibration with Platt scaling.
NKRR uses an 80-20% split for k -fold cross-validation andPlatt scaling calibration, respectively. For each of the datasets, the α (cid:15) parameter of GPD was selectedaccording to the training
MNLL : we have . for COVERBIN , . for LETTER , DRIVE and
MOCAP ,and . for the remaining datasets.In all experiments, we consider an isotropic RBF kernel; the kernel hyperparameters are selected bymaximizing the marginal likelihood for the GP -based approaches, and by k -fold cross validation for NKRR (with k = 10 for all datasets except from SUSY , for which k = 5 ). In the case of GPR , we alsooptimize the noise variance jointly with all kernel parameters.The performance of
GPD , GPC , GPR and
NKRR is compared in terms of various error metrics, includingerror rate,
MNLL and
ECE for a collection of datasets. The error rate,
MNLL and
ECE values obtainedare summarised in Figure 4. The
GPC method tends to outperform
GPR in most cases. Regarding the
GPD approach, its performance tends to lie between
GPC and
GPR ; in some instances classificationperformance is better than
GPC and
NKRR . Most importantly, this performance is obtained at afraction of the computational time required by the
GPC method. Figure 5 summarizes the speed-upachieved by the used of
GPD as opposed to the variational GP classification approach.This dramatic difference in computational efficiency has some interesting implications regardingthe applicability of GP -based classification methods on large datasets. GP -based machine learningapproaches are known to be computationally expensive; their practical application on large datasetsdemands the use of scalable methods to perform approximate inference. The approximation qualityof sparse approaches depends on the number (and the selection) of inducing points. In the case ofclassification, the speed-up obtained by GPD implies that the computational budget saved can be spenton a more fine-grained sparse GP approximation. In Figure 5, we explore the effect of increasing the7 EG HTRU2 MAGIC MiniBoo Cover SUSY letter Drive MoCap0.00.10.2 E rr o r r a t e GPC GPD GPR (Platt) NKRREEG HTRU2 MAGIC MiniBoo Cover SUSY letter Drive MoCap0.00.20.4 M N LL GPC GPD GPR (Platt) NKRR
EEG HTRU2 MAGIC MiniBoo Cover SUSY letter Drive MoCap0.000.020.040.060.08 E C E GPC GPD GPR (Platt) NKRR
Figure 4: Error rate,
MNLL and
ECE for the datasets considered in this work.
E E G H T R U M A G I C M i n i B o o C o v e r l e t t e r D r i v e M o C a p Sp ee d - u p
201 462 708 1550Training time (sec)0.050.060.07 E rr o r r a t e GPC (~14K sec)GPD (a)
LETTER
847 1789 2417 4244Training time (sec)0.0800.085 E rr o r r a t e GPC (~10K sec)GPD (b)
MINIBOO
256 728 1022 2175Training time (sec)0.020.030.04 E rr o r r a t e GPC (~16K sec)GPD (c)
MOCAP
Figure 5: Left: Speed-up obtained by using
GPD as opposed to
GPC . Right: Error vs training time for
GPD as the number of inducing points is increased for three datasets. The dashed line represents theerror obtained by
GPC using the same number of inducing points as the fastest
GPD listed.number of inducing points for three datasets:
LETTER , MINIBOO and
MOCAP ; we see that the errorrate drops below the target
GPC with a fixed number of inducing points, and still at a fraction of thecomputational effort.
Most GP -based approaches to classification in the literature are characterized by a meticulousapproximation of the likelihood. In this work, we experimentally show that such GP classifierstend to be well-calibrated, meaning that they estimate correctly classification uncertainty, as this isexpressed through class probabilities. Despite this desirable property, their applicability is limited tosmall/moderate size of datasets, due to the high computational complexity of approximating the trueposterior distribution.Least-squares classification, which may be implemented either as GPR or KRR , is an establishedpractice for more scalable classification. However, the crude approximation of a non-Gaussianlikelihood with a Gaussian one has a negative impact on classification quality, especially as this isreflected by the calibration properties of the classifier.Considering the strengths and practical limitations of GP s, we proposed a classification approachthat is essentially an heteroskedastic GP regression on a latent space induced by a transformation ofthe labels, which are viewed as Dirichlet-distributed random variables. This allowed us to convert C -class classification to a problem of regression for C latent processes with Gamma likelihoods. We8hen proposed to approximate the Gamma-distributed variables with log-normal ones, and thus weachieved a sensible Gaussian approximation in the logarithmic space. Crucially, this can be seen as apre-processing step, that does not have to be learned, unlike GPC , where an accurate transformation issought iteratively. Our experimental analysis shows that Dirichlet-based GP classification produceswell-calibrated classifiers without the need for post-hoc calibration steps. The performance of ourapproach in terms of classification accuracy tends to lie between properly-approximated GPC andleast-squares classification, but most importantly it is orders of magnitude faster than
GPC .As a final remark, we note that the predictive distribution of the
GPD approach is different from thatobtained by
GPC , as can be seen in the extended results in the appendix. An extended characterizationof the predictive distribution for
GPD is subject of future work.
Acknowledgments
DM and PM are partially supported by KPMG. RC would like to thank Luigi Carratino for the usefulexchanges concerning large-scale experiments. LR is funded by the Air Force project FA9550-17-1-0390 (European Office of Aerospace Research and Development) and the RISE project NoMADS -DLV-777826. MF gratefully acknowledges support from the AXA Research Fund.
References [1] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007.[2] L. Baldassarre, L. Rosasco, A. Barla, and A. Verri. Multi-output learning via spectral filtering.
Machinelearning , 87(3):259–301, 2012.[3] P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds.
Journal of the AmericanStatistical Association , 2006.[4] R. Camoriano, T. Angles, A. Rudi, and L. Rosasco. NYTRO: When Subsampling Meets Early Stopping. In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics , pages 1403–1411,2016.[5] P. A. Flach.
Classifier Calibration , pages 1–8. Springer US, Boston, MA, 2016.[6] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. In D. Precupand Y. W. Teh, editors,
Proceedings of the 34th International Conference on Machine Learning , volume 70of
Proceedings of Machine Learning Research , pages 1321–1330, International Convention Centre, Sydney,Australia, Aug. 2017. PMLR.[7] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining,inference and prediction.
The Mathematical Intelligencer , 27(2):83–85, 2001.[8] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable Variational Gaussian Process Classification. InG. Lebanon and S. V. N. Vishwanathan, editors,
Proceedings of the Eighteenth International Conferenceon Artificial Intelligence and Statistics , volume 38 of
Proceedings of Machine Learning Research , pages351–360, San Diego, California, USA, May 2015. PMLR.[9] J. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani. MCMC for Variationally Sparse GaussianProcesses. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances inNeural Information Processing Systems 28 , pages 1648–1656. Curran Associates, Inc., 2015.[10] P. Huang, H. Avron, T. N. Sainath, V. Sindhwani, and B. Ramabhadran. Kernel methods match deep neuralnetworks on TIMIT. In
IEEE International Conference on Acoustics, Speech and Signal Processing , 2014.[11] A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?,Mar. 2017.[12] M. Kull, T. S. Filho, and P. Flach. Beta calibration: a well-founded and easily implemented improvementon logistic calibration for binary classifiers. In A. Singh and J. Zhu, editors,
Proceedings of the 20thInternational Conference on Artificial Intelligence and Statistics , volume 54 of
Proceedings of MachineLearning Research , pages 623–631, Fort Lauderdale, FL, USA, Apr. 2017. PMLR.[13] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystrom Method. In
NIPS , pages 1060–1068. CurranAssociates, Inc., 2009.
14] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world, Feb. 2017.arXiv:1607.02533.[15] M. Kuss and C. E. Rasmussen. Assessing Approximate Inference for Binary Gaussian Process Classifica-tion.
Journal of Machine Learning Research , 6:1679–1704, 2005.[16] Z. Lu, A. May, K. Liu, A. B. Garakani, D. Guo, A. Bellet, L. Fan, M. Collins, B. Kingsbury, M. Picheny,and F. Sha. How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets.
CoRR , abs/1411.4000,2014.[17] A. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani,and J. Hensman. GPflow: A Gaussian process library using TensorFlow.
Journal of Machine LearningResearch , 18(40):1–6, Apr. 2017.[18] T. P. Minka. Expectation Propagation for approximate Bayesian inference. In
Proceedings of the 17thConference in Uncertainty in Artificial Intelligence , UAI ’01, pages 362–369, San Francisco, CA, USA,2001. Morgan Kaufmann Publishers Inc.[19] Y. Mroueh, T. Poggio, L. Rosasco, and J.-J. Slotine. Multiclass learning with simplex coding. In
Advancesin Neural Information Processing Systems , pages 2789–2797, 2012.[20] H. Nickisch and C. E. Rasmussen. Approximations for Binary Gaussian Process Classification.
Journal ofMachine Learning Research , 9:2035–2078, Oct. 2008.[21] A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In
Proceedingsof the 22Nd International Conference on Machine Learning , ICML ’05, pages 625–632. ACM, 2005.[22] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihoodmethods.
Advances in Large Margin Classifiers , 10(3), 1999.[23] C. E. Rasmussen and C. Williams.
Gaussian Processes for Machine Learning . MIT Press, 2006.[24] C. E. Rasmussen and C. K. I. Williams.
Gaussian Processes for Machine Learning . MIT Press, 2006.[25] R. Rifkin, G. Yeo, T. Poggio, and Others. Regularized least-squares classification.
Nato Science Series SubSeries III Computer and Systems Sciences , 190:131–154, 2003.[26] A. Rudi, R. Camoriano, and L. Rosasco. Less is More: Nyström Computational Regularization. InC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in NeuralInformation Processing Systems 28 , pages 1657–1665. Curran Associates, Inc., 2015.[27] A. Rudi, L. Carratino, and L. Rosasco. FALKON: An Optimal Large Scale Kernel Method. In I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances inNeural Information Processing Systems 30 , pages 3888–3898. Curran Associates, Inc., 2017.[28] A. Rudi, L. Carratino, and L. Rosasco. Falkon: An optimal large scale kernel method. In
Advances inNeural Information Processing Systems , pages 3891–3901, 2017.[29] J. Shawe-Taylor and N. Cristianini.
Kernel Methods for Pattern Analysis . Cambridge University Press,New York, NY, USA, 2004.[30] S. Si, C.-J. Hsieh, and I. S. Dhillon. Memory Efficient Kernel Approximation. In
ICML , volume 32 of
JMLR Proceedings , pages 701–709. JMLR.org, 2014.[31] A. J. Smola and B. Schölkopf. Sparse Greedy Matrix Approximation for Machine Learning. In
ICML ,pages 911–918. Morgan Kaufmann, 2000.[32] J. A. K. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classifiers.
Neural Process.Lett. , 9(3):293–300, June 1999.[33] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In D. A. Dykand M. Welling, editors,
Proceedings of the Twelfth International Conference on Artificial Intelligenceand Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009 , volume 5 of
JMLRProceedings , pages 567–574. JMLR.org, 2009.[34] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes.
IEEE Transactions onPattern Analysis and Machine Intelligence , 20:1342–1351, 1998.
35] C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In T. K. Leen,T. G. Dietterich, V. Tresp, T. K. Leen, T. G. Dietterich, and V. Tresp, editors,
NIPS , pages 682–688. MITPress, 2000.[36] B. Zadrozny and C. Elkan. Transforming Classifier Scores into Accurate Multiclass Probability Estimates.In
Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and DataMining , KDD ’02, pages 694–699, New York, NY, USA, 2002. ACM.
A Extended calibration results
Reliability diagrams offer a visual representation of calibration properties, where accuracy is plotted as a functionof confidence for the subsets { X , . . . , X M } . For a perfectly calibrated classifier, the accuracy function shouldbe equal to the identity line, implying that accu ( X m ) = conf ( X m ) . Large deviations from the identity linemean that the class probabilities are either underestimated or overestimated.In Figure 6 we summarize the reliability diagrams for a number of binary classification datasets. Each row ofdiagrams corresponds to a particular dataset; and each column to one of the GP -based classification approachesthat we have discussed in this work. In particular we consider the variational GP classification algorithm of[9] ( GPC ), our Dirichlet-based classification scheme (
GPD ), and GP regression on the labels without and with aPlatt-scaling post-hot calibration step ( GPR and
GPR ( PLATT )). Note that each one of these approaches producesa distribution of classifiers. Thus, in the diagrams of Figure 6 we show the reliability curve of the mean classifier(depicted as solid lines-points), along with the classifiers described by the upper and lower 95% quantiles of thepredictive distribution (grey area). If a classifier is well-calibrated, then its reliability curve should be close tothe identity curve (dashed line); the latter should also lie within the limits of the grey area.For the results of Figure 6, we have considered M = 10 subsets for different levels of confidence. We note thatdeviations from the identity curve should not deemed important, if these are not backed by a sufficient numberof samples. For some datasets there are certain levels of confidence (middle section of HTRU
GPC and
GPD produce well-calibrated models. The
GPR approach on the other hand tends to produce a sigmoid-shaped reliability curve, which suggests that thereis underestimation of the class probabilities. This behavior is cured however by performing calibration viaPlatt-scaling, as we see for the
GPR ( PLATT ) method.These conclusions are further supported by Figure 7, which summarizes the reliability plots for a number ofmulti-class datasets. In the multi-class case, it is not obvious how to concisely summarise the effect of thepredictive distribution, so we resort to simple reliability plots of the average classifier for each method.As a final remark, we note that the predictive distribution of the
GPD approach is different from that obtained by
GPC . In fact, judging form the upper and lower quantile classifiers as presented in Figure 6, it appears that
GPD results in a narrower predictive distribution, which is nevertheless well-calibrated. An extended characterizationof the predictive distribution for
GPD is subject of future work. .0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y EEG - GPC (ECE=0.0420) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y EEG - GPD (ECE=0.0225) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y EEG - GPR (ECE=0.0928) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y EEG - GPR (Platt) (ECE=0.0244)0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y HTRU2 - GPC (ECE=0.0418) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y HTRU2 - GPD (ECE=0.0398) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y HTRU2 - GPR (ECE=0.0461) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y HTRU2 - GPR (Platt) (ECE=0.0436)0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y MAGIC - GPC (ECE=0.0187) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y MAGIC - GPD (ECE=0.0301) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y MAGIC - GPR (ECE=0.0506) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y MAGIC - GPR (Platt) (ECE=0.0225)0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y CovertypeBinary - GPC (ECE=0.0150) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y CovertypeBinary - GPD (ECE=0.0223) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y CovertypeBinary - GPR (ECE=0.0395) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y CovertypeBinary - GPR (Platt) (ECE=0.0117)
Figure 6: Calibration results for four different GP -based classification approaches on four binaryclassification datasets. The bounds correspond to the classifiers given by the 95% confidence intervalof the posterior GP . A cc u r a c y letter GPC (ECE=0.0374)GPD (ECE=0.0540)GPR (ECE=0.3511)GPR (Platt) (ECE=0.0731) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y Drive
GPC (ECE=0.0447)GPD (ECE=0.0511)GPR (ECE=0.1895)GPR (Platt) (ECE=0.0469) 0.0 0.2 0.4 0.6 0.8 1.0Confidence0.00.20.40.60.81.0 A cc u r a c y MoCap