[PDF] Being Bayesian about Categorical Probability

Abstract

Neural networks utilize the softmax as a building block in classification tasks, which contains an overconfidence problem and lacks an uncertainty representation ability. As a Bayesian alternative to the softmax, we consider a random variable of a categorical probability over class labels. In this framework, the prior distribution explicitly models the presumed noise inherent in the observed label, which provides consistent gains in generalization performance in multiple challenging tasks. The proposed method inherits advantages of Bayesian approaches that achieve better uncertainty estimation and model calibration. Our method can be implemented as a plug-and-play loss function with negligible computational overhead compared to the softmax with the cross-entropy loss function.

Full PDF

BBeing Bayesian about Categorical Probability

Taejong Joo Uijung Chung Min-Gwan Seo Abstract

Neural networks utilize the softmax as a buildingblock in classiﬁcation tasks, which contains anoverconﬁdence problem and lacks an uncertaintyrepresentation ability. As a Bayesian alternativeto the softmax, we consider a random variableof a categorical probability over class labels. Inthis framework, the prior distribution explicitlymodels the presumed noise inherent in the ob-served label, which provides consistent gains ingeneralization performance in multiple challeng-ing tasks. The proposed method inherits advan-tages of Bayesian approaches that achieve bet-ter uncertainty estimation and model calibration.Our method can be implemented as a plug-and-play loss function with negligible computationaloverhead compared to the softmax with the cross-entropy loss function.

1. Introduction

Softmax (Bridle, 1990) is the de facto standard for post pro-cessing of logits of neural networks (NNs) for classiﬁcation.When combined with the maximum likelihood objective, itenables efﬁcient gradient computation with respect to log-its and has achieved state-of-the-art performances on manybenchmark datasets. However, softmax lacks the abilityto represent the uncertainty of predictions (Blundell et al.,2015; Gal & Ghahramani, 2016) and has poorly calibratedbehavior (Guo et al., 2017). For instance, the NN withsoftmax can easily be fooled to conﬁdently produce wrongoutputs; when rotating digit 3, it will predict it as the digit8 or 4 with high conﬁdence (Louizos & Welling, 2017).Another concern of the softmax is its conﬁdent predictivebehavior makes NNs to be subject to overﬁtting (Xie et al.,2016; Pereyra et al., 2017). This issue raises the need foreffective regularization techniques for improving general-ization performance. ESTsoft, Republic of Korea. Correspondence to: Taejong Joo < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s).

Bayesian NNs (BNNs; MacKay, 1992) can address theaforementioned issues of softmax. BNNs provide quan-tiﬁable measures of uncertainty such as predictive entropyand mutual information (Gal, 2016) and enable automaticembodiment of Occam’s razor (MacKay, 1995). However,some practical obstacles have impeded the wide adoption ofBNNs. First, the intractable posterior inference in BNNs de-mands approximate methods such as variation inference (VI;Graves, 2011; Blundell et al., 2015) and Monte Carlo (MC)dropout (Gal & Ghahramani, 2016). Even with such novelapproximation methods, concerns arise regarding both thedegree of approximation and the computational expensiveposterior inference (Wu et al., 2019a; Osawa et al., 2019). Inaddition, under extreme non-linearity between parametersand outputs in the NNs, determining a meaningful weightprior distribution is challenging (Sun et al., 2019). Last butnot least, BNNs often require considerable modiﬁcations toexisting baselines, or they result in performance degradation(Lakshminarayanan et al., 2017).In this paper, we apply the Bayesian principle to constructthe target distribution for learning classiﬁers. Speciﬁcally,we regard a categorical probability as a random variable,and construct the target distribution over the categoricalprobability by means of the Bayesian inference, which isapproximated by NNs. The resulting target distribution canbe thought of as being regularized via the prior belief whoseimpact is controlled by the number of observations. Byconsidering only the random variable of categorical proba-bility, the Bayesian principle can be efﬁciently adopted toexisting deep learning building blocks without huge modiﬁ-cations. Our extensive experiments show effectiveness ofbeing Bayesian about the categorical probability in improv-ing generalization performances, uncertainty estimation,and calibration property.Our contributions can be summarized as follows: 1) weshow the importance of considering categorical probabil-ity as a random variable instead of being determined bythe label; 2) we provide experimental results showing theusefulness of the Bayesian principle in improving general-ization performance of large models on standard benchmarkdatasets, e.g., ResNext-101 on ImageNet; 3) we enable NNsto inherit the advantages of the Bayesian methods in betteruncertainty representation and well-calibrated behavior witha negligible increase in computational complexity. a r X i v : . [ c s . L G ] J un eing Bayesian about Categorical Probability One-Hot

Encoding “cat”

Minimize

KL-Divergence

Neural Network (a) Softmax cross-entropy loss

Prior “cat”

Posterior

Bayes’ Rule

Neural Network

Minimize

KL-Divergence (b) Belief matching framework

Figure 1.

Illustration of the difference between softmax cross-entropy loss and belief matching framework when each image is unique inthe training set. In softmax cross-entropy loss, the label “cat” is directly transformed into the target categorical distribution. In beliefmatching framework, the label “cat” is combined with the prior Dirichlet distribution over the categorical probability. Then, the Bayes’rule updates the belief about categorical probability, which produces the target distribution.

2. Preliminary

This paper focuses on classiﬁcation problems in which,given i.i.d. training samples D = (cid:8) x ( i ) , y ( i ) (cid:9) Ni =1 ∈ ( X × Y ) N , we construct a classiﬁer F : X → Y . Here, X is an input space and Y = { , · · · , K } is a set of labels. Wedenote x and y as random variables whose unknown proba-bility distributions generate inputs and labels, respectively.Also, we let ˜ y be a one-hot representation of y.Let f W : X → X (cid:48) be a NN with parameters W where X (cid:48) = R K is a logit space. In this paper, we assume arg max j f W j ⊆ F is the classiﬁcation model where f W j denotes the j -th output basis of f W , and we concentrateon the problem of learning W . Given (( x , y ) ∈ D , f W ) ,a standard minimization loss function is the softmax cross-entropy loss, which applies the softmax to logit and thencomputes the cross-entropy between a one-hot encoded la-bel and a softmax output (Figure 1(a)). Speciﬁcally, thesoftmax, denoted by φ : X (cid:48) → (cid:52) K − , transforms a logit f W ( x ) into a normalized exponential form: φ k ( f W ( x )) = exp( f W k ( x )) (cid:80) j exp( f W j ( x )) (1), and then the cross-entropy loss can be computed by l CE ( ˜ y , φ ( f W ( x ))) = − (cid:80) k ˜ y k log φ k ( f W ( x )) . Here,note that the softmax output can be viewed as a parame-ter of the categorical distribution, which can be denoted by P C ( φ ( f W ( x ))) .We can formulate the minimization of the softmax cross-entropy loss over D into a collection of distribution match-ing problems. To this end, let c D ( x ) be a vector-valuedfunction that counts label frequency at x ∈ X in D , whichis deﬁned as: c D ( x ) = (cid:88) ( x (cid:48) ,y (cid:48) ) ∈D ˜ y (cid:48) { x } ( x (cid:48) ) (2)where A ( x ) is an indicator function that takes 1 when x ∈ A and 0 otherwise. Then, the empirical risk on D canbe expressed as follows: ˆ L D ( W ) = − N N (cid:88) i =1 log φ y ( i ) ( f W ( x ( i ) ))= (cid:88) x ∈ G ( D ) (cid:80) i c D i ( x ) N l D x ( W ) + C (3)where G ( D ) is a set of unique values in D , e.g., G ( { , , } ) = { , } , and C is a constant with respectto W ; l D x ( W ) measures the KL divergence between theempirical target distribution and the categorical distributionmodeled by the NN at location x , which is given by: l D x ( W ) = KL (cid:18) P C (cid:18) c D ( x ) (cid:80) i c D i ( x ) (cid:19) (cid:107) P C (cid:0) φ ( f W ( x )) (cid:1)(cid:19) (4)Therefore, the normalized value of c D ( x ) becomes the esti-mator of a categorical probability of the target distributionat location x . However, directly approximating this targetdistribution can be problematic. This is because the estima-tor uses single or very few samples since most of the inputsare unique or very rare in the training set.One simple heuristic to handle this problem is label smooth-ing (Szegedy et al., 2016) that constructs a regularizedtarget estimator, in which a one-hot encoded label ˜ y isrelaxed by (1 − λ ) ˜ y + λK with hyperparameter λ . Un-der the smoothing operation, the target estimator is regu-larized by a mixture of the empirical counts and the pa-rameter of the discrete uniform distribution P U such that (1 − λ ) P C (cid:16) c D ( x ) (cid:80) i c D i ( x ) (cid:17) + λ P U . One concern is that themixing coefﬁcient is constant with respect to the number ofobservations, which can possibly prevent the exploitation ofthe empirical counting information when it is needed.Another more principled approach is BNNs, which pre-vents full exploitation of the noisy estimation by bal-ancing the distance to the target distribution with model eing Bayesian about Categorical Probability complexity and maintaining the weight ensemble insteadof choosing a single best conﬁguration. Speciﬁcally, inBNNs with the Gaussian weight prior N ( , τ − I ) , the score of conﬁguration W is measured by the posteriordensity p W ( W |D ) ∝ p ( D| W ) p W ( W ) where we have log p W ( W ) ∝ − τ (cid:107) W (cid:107) . Therefore, the complexitypenalty term induced by the prior prevents the softmax out-put from exactly matching a one-hot encoded target. Inmodern deep NNs, however, (cid:107) W (cid:107) may be poor proxyfor the model complexity due to extreme non-linear relation-ship between weights and outputs (Hafner et al., 2018; Sunet al., 2019) as well as weight-scaling invariant property ofbatch normalization (Ioffe & Szegedy, 2015). This issuemay result in poorly regularized predictions, i.e., cannotprevent NNs from the full exploitation of the informationcontained in the noisy target estimator.

3. Method

We propose a Bayesian approach to construct the target dis-tribution for classiﬁcation, called a belief matching frame-work (BM; Figure 1(b)), in which the categorical probabilityabout a label is regarded as a random variable z . Speciﬁcally,we express the likelihood of z (given x ) about the label y asa categorical distribution p y | x , z = P C ( z | x ) . Then, spec-iﬁcation of the prior distribution over z | x automaticallydetermines the target distribution by means of the Bayesianinference: p z | x , y ( z ) ∝ p y | z , x ( y ) p z | x ( z ) .We consider a conjugate prior for simplicity, i.e., the Dirich-let distribution. A random variable z (given x ) following theDirichlet distribution with concentration parameter vector β , denoted by P D ( β ) , has the following density: p z | x ( z ) = Γ( β ) (cid:81) j Γ( β j ) K (cid:89) k =1 z β k − k (5)where Γ( · ) is the gamma function, (cid:80) i z i = 1 meaning that z belongs to the K − simplex (cid:52) K − , β i > , ∀ i , and β = (cid:80) i β i . Here, we have that the mean of z | x is β/β and β controls the sharpness of the density such that moremass centered around the mean as β becomes larger.By the characteristics of the conjugate family, we have thefollowing posterior distribution given D : p z | x , y = P D ( β + c D ( x )) (6)where the target posterior mean is explicitly smoothed bythe prior belief, and the smoothing operation is performedby the principled way of applying Bayes’ rule. Speciﬁcally,the posterior mean is given by β + (cid:80) i c D i ( x ) ( β + c D ( x )) , in In this paper, p x = P ( θ ) is read as a random variable x follows a probability distribution P with parameter θ . which the prior distribution acts as adding pseudo counts.We note that the relative strength between the prior beliefand the empirical count information becomes adaptive withrespect to each data point. Now, we specify the approximate posterior distribution mod-eled by the NNs, which aims to approximate p z | x , y . In thispaper, we model the approximate posterior as the Dirichletdistribution. To this end, we use an exponential function g ( x ) = exp( x ) to transform logits to the concentration pa-rameter of P D , and we let α W = exp ◦ f W . Then, the NNrepresents the density over (cid:52) K − as follows: q W z | x ( z ) = Γ( α W ( x )) (cid:81) j Γ( α W j ( x )) K (cid:89) k =1 z α W k ( x ) − k (7)where α W ( x ) = (cid:80) i α W i ( x ) .From equation 7, we can see that outputs under BM en-code much more information compared to those under thesoftmax. Speciﬁcally, it can be easily shown that the approx-imate posterior mean corresponds to the softmax. In thisregard, BM enables neural networks to represent more richinformation in their outputs, i.e., the density over (cid:52) K − itself not just a single point on it such as the mean. Thiscapability allows capturing more diverse characteristics ofpredictions at different locations, such as how much con-centrate its density around the center mass point, which canbe extremely helpful in many applications. For instance,BM gives a more sophisticated measure of the differencebetween predictions of two neural networks, which can ben-eﬁt the consistency-based loss for semi-supervised learningas we will show in section 5.4. Besides, BM represents amore sophisticated measure of predictive uncertainty basedon the density over simplex, such as mutual information.From the perspective of learning the target distribution, BMcan be considered as a generalization of softmax in terms ofchanging the moment matching problem to the distributionmatching problem in P ( (cid:52) K − ) . To understand the distribu-tion matching objective in BM, we reformulate equation 7as follows: q W z | x ( z ) ∝ exp (cid:32)(cid:88) k α W k ( x ) log z k − (cid:88) k log z k (cid:33) ∝ exp (cid:18) − l CE ( φ ( f W ( x )) , z ) + KL ( P U (cid:107) P C ( z )) α W ( x ) /K (cid:19) (8)In the limit of q W z | x → p z | x , y , mean of the target posterior(equation 6) becomes a virtual label, for which individual z ought to match; the penalty for ambiguous conﬁguration z is determined by the number of observations. Therefore, the eing Bayesian about Categorical Probability distribution matching in BM can be thought of as learningto score a categorical probability based on closeness to thetarget posterior mean, in which exploitation of the closenessinformation is automatically controlled by the data. We have deﬁned the target distribution p z | x , y and the approx-imate distribution modeled by the neural network q W z | x . Wenow present a solution to the distribution matching problemwith maximizing the evidence lower bound (ELBO), deﬁnedby l EB ( y , α W ( x )) = E q W z | x [log p ( y | x , z )] − KL ( q W z | x (cid:107) p z | x ) . Using the ELBO can be motivated by the followingequality (Jordan et al., 1999): log p ( y | x ) = (cid:90) q W z | x ( z ) log (cid:18) p ( y , z | x ) p ( z | x , y ) (cid:19) d z = l EB ( y , α W ( x )) − KL ( q W z | x (cid:107) p z | x , y ) (9)where we can see that maximizing l EB ( y , α W ( x )) corre-sponds to minimizing KL ( q W z | x (cid:107) p z | x , y ) , i.e., matching theapproximate distribution to the target distribution , becausethe KL-divergence is non-negative and log p ( y | x ) is a con-stant with respect to W . Here, each term in the ELBO canbe analytically computed by: E q z | x [log p ( y | x , z )]= E q z | x [log z y ] = ψ ( α W y ( x )) − ψ ( α W ( x )) (10)where ψ ( · ) is the digamma function (the logarithmic deriva-tive of Γ( · ) ), and KL ( q W z | x (cid:107) p z | x ) = log Γ( α W ( x )) (cid:81) k Γ( β k ) (cid:81) k Γ( α W k ( x ))Γ( β )+ (cid:88) k (cid:0) α W k ( x ) − β k (cid:1) (cid:0) ψ ( α W k ( x )) − ψ ( α W ( x )) (cid:1) (11)where p z | x is assumed to be an input independent con-jugate prior for simplicity; that is, p z | x = P D ( β ) .With this analytical solution, we maximizes the ELBOwith mini-batch approximation, which gives the follow-ing loss function: L ( W ) = E x , y [ l EB ( y , α W ( x ))] ≈ m (cid:80) mi =1 l EB ( y ( i ) , α W ( x ( i ) )) . We note that computationsof the ELBO and its gradient have a complexity of O ( K ) per sample, which is equal to those of softmax. This meansthat BM can preserve the scalability and the efﬁciency of theexisting baseline. We also note that the analytical solutionof the ELBO under BM allows to implement the distributionmatching loss as a plug-and-play loss function applied tothe logit directly. The success of the Bayesian approach largely depends onhow we specify the prior distribution due to its impact on the resulting posterior distribution. For example, the targetposterior mean in equation 6 becomes the counting estimatoras β → . On the contrary, as β becomes higher, theeffect of empirical counting information is weakened, andeventually disappeared in the limit of β → ∞ . Therefore,considering that most of the inputs are unique in D , choosingsmall β is appropriate for prevents the resulting posteriordistribution from being dominated by the prior .However, a prior distribution with small β implicitly makes α W ( x ) small, which poses signiﬁcant challenges on thegradient-based optimization. This is because the gradient ofthe ELBO is notoriously large in the small-value regimes of α W ( x ) , e.g., ψ (cid:48) (0 . > . In addition, our variousbuilding blocks including normalization (Ioffe & Szegedy,2015), initialization (He et al., 2015), and architecture (Heet al., 2016a) are implicitly or explicitly designed to make E [ f W ( x )] ≈ ; that is, E [ α W ( x )] ≈ . Therefore, making α W ( x ) small can be wasteful or requires huge modiﬁca-tions to the existing building blocks. Also, E [ α W ( x )] ≈ is encouraged in a sense of natural gradient (Amari, 1998),which improves the conditioning of Fisher information ma-trix (Schraudolph, 1998; LeCun et al., 1998; Raiko et al.,2012; Wiesler et al., 2014).In order to resolve the gradient-based optimization challengein learning the posterior distribution while preventing dom-inance of the prior distribution, we set β = for the priordistribution and then multiply λ to the KL divergence termin the ELBO: l λEB ( y , α W ( x )) = E q z | x [log p ( y | x , z )] − λKL ( q W z | x (cid:107) P D ( )) . This trick signiﬁcantly stabilizes theoptimization process, while making a local optimal point re-mains unchanged. To see this, we can compare the gradientsof the ELBO and the lambda multiplied ELBO: ∂l EB ( y , α W ( x )) ∂α W k ( x ) = (cid:0) ˜ y k − ( α W k ( x ) − β k ) (cid:1) ψ (cid:48) ( α W k ( x )) − (cid:0) − ( α W ( x ) − β ) (cid:1) ψ (cid:48) ( α W ( x )) (12) ∂l λEB ( y , α W ( x )) ∂α W k ( x ) = (cid:0) ˜ y k − (˜ α W k ( x ) − λ ) (cid:1) ψ (cid:48) (˜ α W k ( x )) ψ (cid:48) (˜ α W ( x )) − (cid:0) − (˜ α W ( x ) − λK ) (cid:1) (13)where ˜ α W k ( x ) = λα W k ( x ) . Here, we can see that a localoptimal in equation 12 is achieved when α W ( x ) = β + ˜ y and a local optima for equation 13 is α W ( x ) = 1 + λ ˜ y .Therefore, a ratio between α W i ( x ) and α W j ( x ) equal tothose of a local optimal point in equation 12 for every pair of i and j . In this regard, searching for λ with l λEB ( y , α W ( x )) and then multiplying λ after training corresponds to theprocess of searching for the prior distribution’s parameter β with L ( W ) . In an ideal fully Bayesian treatment, β can be modeled hierar-chically, and we left this as future research. eing Bayesian about Categorical Probability

4. Related Work

BNNs are the dominant approach for applying Bayesianprinciples in neural networks. Because BNNs require the in-tractable posterior inference, many posterior approximationschemes have been developed to reduce the approximationgap and improve scalability (e.g., VI (Graves, 2011; Blun-dell et al., 2015; Wu et al., 2019a) and stochastic gradientMarkov Chain Monte Carlo (Welling & Teh, 2011; Ma et al.,2015; Gong et al., 2019)). However, even with these novelapproximation techniques, BNNs are not scalable to state-of-the-art architectures in large-scale datasets or they oftenreduce the generalization performance in practice, whichimpedes the wide adoption of BNNs despite their numerouspotential beneﬁts.Other approaches avoid explicit modeling of the weightposterior distribution. MC dropout (Gal & Ghahramani,2016) reinterprets the dropout (Srivastava et al., 2014) asan approximate VI, which retains the standard NN trainingprocedure and modiﬁes only the inference procedure forposterior MC approximation. In a similar spirit, some ap-proaches (Mandt et al., 2017; Zhang et al., 2018; Maddoxet al., 2019; Osawa et al., 2019) sequentially estimate themean and covariance of the weight posterior distribution byusing gradients computed at each step. As different from theBNNs, Deep kernel learning (Wilson et al., 2016a;b) placesGaussian processes (GPs) on top of the “deterministic” NNs,which combines NNs’ capability of handling complex highdimensional data and GPs’ capability of principled uncer-tainty representation and robust extrapolation.Non-Bayesian approaches also help to resolve the limita-tions of softmax. Lakshminarayanan et al. (2017) proposean ensemble-based method to achieve better uncertainty rep-resentation and improved self-calibration. Both Guo et al.(2017) and Neumann et al. (2018) proposed temperaturescaling-based methods for post-hoc modiﬁcations of soft-max for improved calibration. To improve generalization bypenalizing over-conﬁdence, Pereyra et al. (2017) propose anauxiliary loss function that penalizes low predictive entropy,and Szegedy et al. (2016) and Xie et al. (2016) consider thetypes of noise included in ground-truth labels.We also note that some recent studies use NNs to modelthe concentration parameter of the Dirichlet distributionbut with a different purpose than BM. Sensoy et al. (2018)uses the loss function of explicitly minimizing predictionvariances on training samples, which can help to producehigh uncertainty prediction for out-of-distribution (OOD)or adversarial samples. Prior network (Malinin & Gales,2018) investigates two types of auxiliary losses computedon in-distribution and OOD samples, respectively. Similarto prior network, Chen et al. (2018) considers an auxiliaryloss computed on adversarially generated samples.

5. Experiment

In this section, we show versatility of BM through extensiveempirical evaluations. We ﬁrst verify its improvement ofthe generalization error in image classiﬁcation tasks (sec-tion 5.1). We then verify whether BM inherits the advan-tages of the Bayesian method by placing the prior distribu-tion only on the label categorical probability (section 5.2).We conclude this section by providing further applica-tions that shows versatility of BM. To support reproducibil-ity, we release our code at: https://github.com/tjoo512/belief-matching-framework . We per-formed all experiments on a single workstation with 8 GPUs(NVIDIA GeForce RTX 2080 Ti).Throughout all experiments, we employ various large-scalemodels based on a residual connection (He et al., 2016a),which are the standard benchmark models in practice. Forfair comparison and reducing burden of hyperparametersearch, we ﬁx experimental conﬁgurations to the referenceimplementation of corresponding architecture. However,we additionally use an initial learning rate warm-up andgradient clipping, which are extremely helpful for stabletraining of BM. Speciﬁcally, we use learning rates of [0.1 (cid:15) ,0.2 (cid:15) , 0.4 (cid:15) , 0.6 (cid:15) , 0.8 (cid:15) ] for ﬁrst ﬁve epochs when the referencelearning rate is (cid:15) and clip gradient when its norm exceeds1.0. Without these methods, we had difﬁculty in trainingdeep models, e.g., ResNet-50, due to gradient explosion atan initial stage of training.We compare BM to following baseline methods: softmax,which is our primary object to improve; MC dropout with100 MC samples, which is a simple and efﬁcient BNN; deepensemble with ﬁve NNs, which greatly improves the uncer-tainty representation ability. While there are other methodsusing NNs to model the Dirichlet distribution (Sensoy et al.,2018; Malinin & Gales, 2018; 2019), we note that thesemethods are not scalable to ResNet. Similarly, we observethat training a mixture of Dirichlet distributions (Wu et al.,2019b) with ResNet is subject to the gradient explosion,even with a 10x lower learning rate. Besides, BNNs withVI (or MCMC) are not directly comparable to our approachdue to their huge modiﬁcations to existing baselines. Forexample, Heek & Kalchbrenner (2019) replace batch nor-malization and ReLU, use additional techniques (2x moreﬁlters, cyclic learning rate, snapshot ensemble), and requirealmost 10x more computations on ImageNet to converge.

We evaluate the generalization performance of BM on CI-FAR (Krizhevsky, 2009) with the pre-activation ResNet (Heet al., 2016b). CIFAR-10 and CIFAR-100 contain 50K train-ing and 10K test images, and each 32x32x3-sized imagebelongs to one of 10 categories in CIFAR-10 and one of100 categories in CIFAR-100. Table 1 lists the classiﬁcation eing Bayesian about Categorical Probability

Table 1.

Test classiﬁcation error rates on CIFAR. Here, we splita train set of 50K examples into a train set of 40K examplesand a validation set of 10K example. Numbers indicate µ ± σ computed across ﬁve trials, and boldface indicates the minimummean error rate. Model and hyperparameter are selected based onvalidation error rates. We searched for the coefﬁcients of BM over { . , . , . } and MC dropout over { . , . , . } .M ODEL M ETHOD

C-10 C-100R ES -18 S OFTMAX ± . ± . MC D

ROP ( LAST ) 6.13 ± . ± . MC D

ROP ( ALL ) 6.50 ± . ± . BM ± . ± . R ES -50 S OFTMAX ± . ± . MC D

ROP ( LAST ) 5.75 ± . ± . MC D

ROP ( ALL ) 5.84 ± . ± . BM ± . ± . Table 2.

Classiﬁcation error rates on the ImageNet. Here, we useonly λ = 0 . for ResNext-50 and λ = 0 . for ResNext-101,and measure the validation error rates directly. We report the resultobtained by single experiment due to computational constraint.M ODEL M ETHOD T OP OP ES N EXT -50 S

OFTMAX R ES N EXT -101 S

OFTMAX error rates of the softmax cross-entropy loss, BM, and MC-dropout. In all conﬁgurations, BM consistently achievesthe best generalization performance on both datasets. Onthe other hand, last-layer MC dropout sometimes resultsin higher generalization errors than softmax and all-layerMC-dropout signiﬁcantly increases error rates even thoughthey consume 100x more computations for inference.We next perform a large-scale experiment using ResNext-50 32x4d and ResNext-101 32x8d (Xie et al., 2017) onImageNet (Russakovsky et al., 2015). ImageNet containsapproximately 1.3M training samples and 50K validationsamples, and each sample is resized to 224x224x3 and be-longs to one of the 1K categories; that is, the ImageNethas more categories, a larger image size, and more trainingsamples compared to CIFAR, which may enable a moreprecise evaluation of methods. Consistent with the resultson CIFAR, BM improves test errors of softmax (Table 2).This result is appealing because improving the generaliza-tion error of deep NNs on large-scale datasets by adoptinga Bayesian principle without computational overhead hasrarely been reported in the literature. −10 −5 0 5 10−10−50510

Train w/o KL −4 −2 0 2 4−4−2024

Train w/ KL

Figure 2.

Penultimate layer’s activations of examples belongingto one of three classes (beaver, dolphin, and otter; indexed by 0,1,2in CIFAR-100).

Regularization Effect of Prior

In theory, BM has tworegularization effects, which may explain the generalizationperformance improvements under BM: the prior distribution,which smooths the target posterior mean by adding pseudocounts, and computing the distribution matching loss byaveraging of all possible categorical probabilities. In thisregard, the ablation of the KL term in the ELBO helps toexamine these two effects separately, which removes onlythe effect of the prior distribution.We ﬁrst examine its impact on the generalization perfor-mance by training a ResNet-50 on CIFAR without the KLterm. The resulting test error rates were on CIFAR-10 and % on CIFAR-100. These signiﬁcant reductionsin generalization performances indicates the powerful regu-larization effect of the prior distribution (cf. Table 1). Theresult that BM without the KL term still achieves lower testerror rates compared to softmax demonstrates the regular-ization effect of considering all possible categorical prob-abilities by the Dirichlet distribution instead of choosingsingle categorical probability.Considering the role of the prior distribution on smoothingthe posterior mean, we conjecture that the impact of theprior distribution is similar to the effect of label smooth-ing. In M¨uller et al. (2019), it is shown that label smooth-ing makes learned representation reveal tight clusters ofdata points within the same classes and smaller deviationsamong the data points. Inspired by this result, we analyzethe activations in the penultimate layer with the visualiza-tion method proposed in M¨uller et al. (2019). Figure 2illustrates that the prior distribution signiﬁcantly reducesthe value ranges of the activations of data points, which im-plies the implicit function regularization effect consideringthat the L p norm of f W ∈ L p ( X ) can be approximatedby (cid:107) f W (cid:107) p ≈ (cid:0) N (cid:80) i | f W ( x ( i ) ) | p (cid:1) /p . Besides, Figure 2shows that the prior distribution makes activations belongto the same class to form much tighter clusters, which canbe thought of as the implicit manifold regularization effect.To see this, assume that two images belonging to the sameclass have close distance in the data manifold. Then, the eing Bayesian about Categorical Probability E rr o r r a t e Only β Joint tuneOriginal

Figure 3.

Impact of β on generalization performance. We excludethe ranges β < exp( − and β > exp(8) because the rangesresult in gradient explosion under the strategy of changing only β difference between logits of same class examples becomes agood proxy for the gradient of f W along the data manifoldsince the gradient measures changes in the output spacewith respect to small changes in the input space. Impact of β In section 3.4, we claimed that a value of β is implicitly related to the distribution of logit values,and its extreme value can be detrimental to the trainingstability. We verify this claim by training ResNet-18 onCIFAR-10 with different values of β . Speciﬁcally, we ex-amine two strategies of changing β : modifying only β orjointly modifying the lambda proportional to /β to matchlocal optima (cf. section 3.4). As a result, we obtain arobust generalization performance in both strategies when β ∈ [exp( − , exp(4)] (Figure 3). However, when β be-comes extremely small ( exp( − when changing only β and exp( − when jointly tuning λ and β ), the gradientexplosion occurs due to extreme slope of the digamma near0. Conversely, when we increase only β to extremely largevalue, the error rate increases by a large margin (7.37) at β = exp(8) , and eventually explodes at β = exp(16) . Thisis because large beta increases the values of activations, sothe gradient with respect to parameters explodes. Underthe joint tuning strategy, such a high values region makes λ ≈ , which removes the impact of the prior distribution. One of the most attractive beneﬁts of Bayesian methods istheir ability to represent the uncertainty about their predic-tions. In a naive sense, uncertainty representation abilityis the ability to “know what it doesn’t know.” For instance,models having a good uncertainty representation abilitywould increase some form of predictive uncertainty on mis-classiﬁed examples compared to those on correctly clas-siﬁed examples. This ability is extremely useful in bothreal-world applications and downstream tasks in machinelearning. For example, underconﬁdent NNs can producemany false alarms, which makes humans ignore the predic-tions of NNs; conversely, overconﬁdent NNs can excludehumans from the decision-making loop, which results in A cc u r a c y BM (ECE=1.66)softmax (ECE=3.82) (a) CIFAR-10 A cc u r a c y BM (ECE=4.25)softmax (ECE=13.48) (b) CIFAR-100

Figure 4.

Reliability plots of ResNet-50 with BM and softmax.Here, ECE is computed with 15 groups. catastrophic accidents. Also, better uncertainty representa-tion enables balancing exploitation and exploration in rein-forcement learning (Gal & Ghahramani, 2016) and detectingOOD samples (Malinin & Gales, 2018).We evaluate the uncertainty representation ability on bothin-distribution and OOD datasets. Speciﬁcally, we measurethe calibration performance on in-distribution test samples,which examines a model’s ability to match its probabilis-tic output associated with an event to the actual long-termfrequency of the event (Dawid, 1982). The notion of cali-bration in NNs is associated with how well its conﬁdencematches the actual accuracy; e.g., we expect the averageaccuracy of a group of predictions having the conﬁdencearound . to be close to . We also examine the pre-dictive uncertainty for OOD samples. Since the examplesbelong to none of the classes seen during training, we expectneural networks to produce outputs of “I don’t know.” In-Distribution Uncertainty

We measure the calibrationperformance by the expected calibration error (ECE; Naeiniet al., 2015), in which max i φ i ( f W ( x )) is regarded as aprediction conﬁdence for the input x . ECE is calculatedby grouping predictions based on the conﬁdence score andthen computing the absolute difference between the averageaccuracy and average conﬁdence for each group; that is, theECE of f W on D with M groups is as follows: ECE M ( f W , D ) = M (cid:88) i =1 |G i ||D| | acc ( G i ) − conf ( G i ) | (14)where G i is a set of samples in the i -th group, deﬁnedas G i = (cid:8) j : i/M < max k φ k ( f W ( x ( j ) )) ≤ (1 + i ) /M (cid:9) ,acc ( G i ) is an average accuracy in the i -th group, andconf ( G i ) is an average conﬁdence in the i -th group.We analyze the calibration property of ResNet-50 exam-ined in section 5.1. As Figure 4 presents, BM’s predictiveprobability is well matched to its accuracy compared tosoftmax–that is, BM improves the calibration property ofNNs. Speciﬁcally, BM improves ECE of softmax from 3.82to 1.66 on CIFAR-10 and from 13.48 to 4.25 CIFAR-100, eing Bayesian about Categorical Probability D e n s i t y In-distributionOut-of-distribution (a) Softmax D e n s i t y In-distributionOut-of-distribution (b) BM D e n s i t y In-distributionOut-of-distribution (c) Deep ensemble D e n s i t y In-distributionOut-of-distribution (d) MC dropout (all)

Figure 5.

Uncertainty representation for in-distribution (CIFAR-100) and OOD (SVHN) of ResNet-50 under softmax and BM. Weexclude the result of last-layer MC dropout because it shows nomeaningful difference compared to the softmax. respectively. These improvements are comparable to thedeep ensemble, which achieves 1.04 on CIFAR-10 and 3.54on CIFAR-100 with 5x more computations for both trainingand inference. In the case of all-layer MC dropout, ECEdecreases to 1.50 on CIFAR-10 and 9.76 on CIFAR-100,however, these improvements require to compromise thegeneralization performance. On the other hand, the last-layer MC dropout, which often improves the generalizationperformance, does not show meaningful ECE improvements(3.78 on CIFAR-10 and 13.52 on CIFAR-100). We notethat there are also post-hoc solutions improving calibrationperformance, e.g., temperature scaling (Guo et al., 2017).However, these methods often require an additional datasetfor tuning the behavior of NNs, which may prevent theexploitation of entire samples to train NNs.

Out-of-Distribution Uncertainty

We quantify uncer-tainty by predictive entropy, which measures the uncertaintyof f W ( x ) as follows: H [ E q W z | x [ z ]] = H [ φ ( f W ( x ))] = − (cid:80) Kk =1 φ k ( f W ( x )) log φ k ( f W ( x )) . This uncertaintymeasure gives intuitive interpretation such that the “I don’tknow” the answer is close to the uniform distribution; con-versely, “I conﬁdently know” answer has one dominatingcategorical probability.Figure 5 presents density plots of the predictive entropy,showing that BM provides notably better uncertainty esti-mation compared to other methods. Speciﬁcally, BM makesclear peaks of predictive entropy in high uncertainty re- Table 3.

Transfer learning performance (test error rates) fromResNet-50 pretrained on ImageNet to smaller datasets. µ and σ are obtained by ﬁve experiments, and boldface indicates theminimum mean error rate. We examine only λ = 0 . for BM.M ETHOD

C-10 F

OOD -101 C

ARS S OFTMAX ± . ± . ± . BM ± . ± . ± . gion for OOD samples (Figure 5(b)). In contrast, softmaxproduces relatively ﬂat uncertainty for OOD samples (Fig-ure 5(a)). Even though both MC dropout and deep ensemblesuccessfully increase predictive uncertainty for OOD sam-ples compared to softmax, they fail to make a clear peak onthe high uncertainty region for such samples, unlike BM.We note that this remarkable result is obtained by beingBayesian only about the categorical probability.Note that some in-distribution samples should be answeredas “I don’t know” because the network does not achieveperfect test accuracy. As Figure 5 shows, BM containsmore samples of high uncertainty for in-distribution samplescompared to softmax that is almost certain in its predictions.This result consistently supports the previous result that BMresolves the overconﬁdence problem of softmax. BM adopts the Bayesian principle outside the NNs, so itcan be applied to models already trained on different tasks,unlike BNNs. In this regard, we examine the effectivenessof BM on the transfer learning scenario. Speciﬁcally, wedownloaded the ImageNet-pretrained ResNet-50, and ﬁne-tune weights of the last linear layer for 100 epochs by theAdam optimizer (Kingma & Ba, 2015) with learning rateof 3e-4 on three different datasets (CIFAR-10, Food-101(Bossard et al., 2014), and Cars (Krause et al., 2013).Table 3 compares test error rates of softmax and BM, inwhich BM consistently achieves better performances com-pared to softmax. Next, we examine the predictive uncer-tainty for OOD samples (Figure 6). Surprisingly, we observethat BM signiﬁcantly improves the uncertainty representa-tion ability of pretrained-models by only ﬁne-tuning the lastlayer weights . These results present a possibility of adopt-ing BM as a post-hoc solution to enhance the uncertaintyrepresentation ability of pretrained models without sacriﬁc-ing their generalization performance. We believe that theinteraction between BM and the pretrained models is sig-niﬁcantly attractive, considering recent efforts of the deeplearning community to construct general baseline modelstrained on extremely large-scale datasets and then transferthe baselines to multiple down-stream tasks (e.g., BERT(Devlin et al., 2018) and MoCo (He et al., 2019)). eing Bayesian about Categorical Probability D e n s i t y CIFAR-10SVHNFood-101Cars (a) Softmax D e n s i t y CIFAR-10SVHNFood-101Cars (b) BM

Figure 6.

Uncertainty representation for in-distribution samples(CIFAR-10) and OOD samples (SVHN, Foods, and Cars) in trans-fer learning tasks. BM produces clear peaks in high uncertaintyregion on SVHN and Food-101. We note that BM conﬁdentlypredicts examples in Cars because CIFAR-10 contains the objectcategory of “automobile”. On the other hand, softmax producesconﬁdent predictions on all datasets compared to BM.

BM enables NNs to represent rich information in their pre-dictions (cf. section 3.2). We exploit this characteristic tobeneﬁt consistency-based loss functions for semi-supervisedlearning. The idea of consistency-based losses employsinformation of unlabelled samples to determine where topromote robustness of predictions under stochastic perturba-tions (Belkin et al., 2006; Oliver et al., 2018). In this section,we investigate two baselines that consider stochastic pertur-bations on inputs (VAT; Miyato et al., 2018) and networks( Π -model; Laine & Aila. 2017), respectively. Speciﬁcally,VAT generates adversarial direction r , then measures KL-divergence between predictions at x and x + r : L V AT ( x ) = KL ( φ ( f W ( x )) (cid:107) φ ( f W ( x + r ))) (15)where the adversarial direction is chosen by r =arg max (cid:107) r (cid:107)≤ (cid:15) KL ( φ ( f W ( x )) (cid:107) φ ( f W ( x + r ))) , and Π -model measures L distance between predictions with andwithout enabling stochastic parts in NNs: L Π ( x ) = (cid:107) φ ( ¯ f W ( x )) − φ ( f W ( x )) (cid:107) (16)where ¯ f is a prediction without the stochastic parts.We can see that both methods achieve the perturbation in-variant predictions by minimizing the divergence betweentwo categorical probabilities under some perturbations. Inthis regard, BM can provide a more delicate measure ofthe prediction consistency–divergence between Dirichletdistributions–that can capture richer probabilistic structures,e.g., (co)variances of categorical probabilities. This gen-eralization the moment matching problem to the distribu-tion matching problem can be achieved by replacing onlythe consistency measures in equation 15 with KL ( q W z | x (cid:107) q W z | x + r ) and in equation 16 with KL (¯ q W z | x (cid:107) q W z | x ) . Table 4.

Classiﬁcation error rates on CIFAR-10. µ and σ are ob-tained by ﬁve experiments, and boldface indicates the minimummean error rate. We matched the conﬁgurations to those of Oliveret al. (2018) except for a consistency loss coefﬁcient of 0.03 forVAT and 0.5 for Π -model to match the scale between supervisedand unsupervised losses. We use only λ = 0 . for BM.M ETHOD Π -M ODEL

VATS

OFTMAX ± . ± . BM ± . ± . We train wide ResNet 28-2 (Zagoruyko & Komodakis,2016) via the consistency-based loss functions on CIFAR-10 with 4K/41K/5K number of labeled training/unlabeledtraining/validation samples. Our experimental results showthat the distribution matching metric of BM is more effec-tive than the moment matching metric under the softmax onreducing the error rates (Table 4). We think that the improve-ment in semi-supervised learning with more sophisticatedconsistency measures shows the potential usefulness of BMon other useful applications. This is because employingthe prediction difference of neural networks is a prevalentmethod in many domains such as knowledge distillation(Hinton et al., 2015) and model interpretation (Zintgrafet al., 2017).

6. Conclusion

We adopted the Bayesian principle for constructing the tar-get distribution by considering the categorical probabilityas a random variable rather than being given by the traininglabel. The proposed method can be ﬂexibly applied to thestandard deep learning models by replacing only the softmaxand the cross-entropy loss, which provides the consistentimprovements in generalization performance, better uncer-tainty estimation, and well-calibrated behavior. We believethat BM shows promising advantages of being Bayesianabout categorical probability.We think that accommodating more expressive distributionsin the belief matching framework is an interesting futuredirection. For example, parameterizing the logistic normaldistribution (or mixture distribution) can make neural net-works to capture strong semantic similarities among classlabels, which would be helpful in large-class classiﬁcationproblems such as machine translation and classiﬁcation onthe ImageNet. Besides, considering the input dependentprior would result in interesting properties. For example,under the teacher-student framework, the teacher can dic-tate the prior for each input, thereby control the desiredsmoothness of the student’s prediction on the location. Thisproperty can beneﬁt various domains such as imbalanceddatasets and multi-domain learning. eing Bayesian about Categorical Probability

Acknowledgements

We would like to thank Dong-Hyun Lee and anonymousreviewers for the discussions and suggestions.

References

Amari, S.-I. Natural gradient works efﬁciently in learning.

Neural Computation , 10(2):251–276, 1998.Belkin, M., Niyogi, P., and Sindhwani, V. Manifold regular-ization: A geometric framework for learning from labeledand unlabeled examples.

Journal of Machine LearningResearch , 7(Nov):2399–2434, 2006.Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,D. Weight uncertainty in neural networks. In

Interna-tional Conference on Machine Learning , 2015.Bossard, L., Guillaumin, M., and Van Gool, L. Food-101 –mining discriminative components with random forests.In

European Conference on Computer Vision , 2014.Bridle, J. S. Probabilistic interpretation of feedforward clas-siﬁcation network outputs, with relationships to statisticalpattern recognition. In

Neurocomputing , pp. 227–236.Springer, 1990.Chen, W., Shen, Y., Jin, H., and Wang, W. A varia-tional Dirichlet framework for out-of-distribution detec-tion. arXiv preprint arXiv:1811.07308 , 2018.Dawid, A. P. The well-calibrated Bayesian.

Journal ofthe American Statistical Association , 77(379):605–610,1982.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 ,2018.Gal, Y.

Uncertainty in Deep Learning . PhD thesis, Univer-sity of Cambridge, 2016.Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approxi-mation: Representing model uncertainty in deep learning.In

International Conference on Machine Learning , 2016.Gong, W., Li, Y., and Hernndez-Lobato, J. M. Meta-learningfor stochastic gradient MCMC. In

International Confer-ence on Learning Representations , 2019.Graves, A. Practical variational inference for neural net-works. In

Advances in Neural Information ProcessingSystems , 2011.Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. Oncalibration of modern neural networks. In

InternationalConference on Machine Learning , 2017. Hafner, D., Tran, D., Lillicrap, T., Irpan, A., and Davidson, J.Noise contrastive priors for functional uncertainty. arXivpreprint arXiv:1807.09289 , 2018.He, K., Zhang, X., Ren, S., and Sun, J. Delving deep intorectiﬁers: Surpassing human-level performance on ima-genet classiﬁcation. In

IEEE International Conferenceon Computer Vision , 2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In

IEEE Conference onComputer Vision and Pattern Recognition , 2016a.He, K., Zhang, X., Ren, S., and Sun, J. Identity mappingsin deep residual networks. In

European Conference onComputer Vision , 2016b.He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-mentum contrast for unsupervised visual representationlearning. arXiv preprint arXiv:1911.05722 , 2019.Heek, J. and Kalchbrenner, N. Bayesian inferencefor large scale image classiﬁcation. arXiv preprintarXiv:1908.03491 , 2019.Hinton, G., Vinyals, O., and Dean, J. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531 , 2015.Ioffe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In

International Conference on Machine Learning , 2015.Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul,L. K. An introduction to variational methods for graphicalmodels.

Machine Learning , 37(2):183–233, 1999.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In

International Conference on MachineLearning , 2015.Krause, J., Stark, M., Deng, J., and Fei-Fei, L. 3d objectrepresentations for ﬁne-grained categorization. In

ICCVWorkshop on 3D Representation and Recognition , pp.554–561, 2013.Krizhevsky, A. Learning multiple layers of features fromtiny images.

Technical report , 2009.Laine, S. and Aila, T. Temporal ensembling for semi-supervised learning. In

International Conference onLearning Representations , 2017.Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simpleand scalable predictive uncertainty estimation using deepensembles. In

Advances in Neural Information Process-ing Systems , 2017. eing Bayesian about Categorical Probability

LeCun, Y., Bottou, L., Orr, G. B., and M¨uller, K.-R. Efﬁ-cient backprop. In

Neural Networks: Tricks of the Trade ,pp. 9–50. Springer, 1998.Louizos, C. and Welling, M. Multiplicative normalizingﬂows for variational Bayesian neural networks. In

Inter-national Conference on Machine Learning , 2017.Ma, Y.-A., Chen, T., and Fox, E. A complete recipe forstochastic gradient MCMC. In

Advances in Neural Infor-mation Processing Systems , 2015.MacKay, D. J. A practical Bayesian framework for back-propagation networks.

Neural Computation , 4(3):448–472, 1992.MacKay, D. J. Probable networks and plausible predictionsareview of practical Bayesian methods for supervised neu-ral networks.

Network: Computation in Neural Systems ,6(3):469–505, 1995.Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., andWilson, A. G. A simple baseline for Bayesian uncertaintyin deep learning. In

Advances in Neural InformationProcessing Systems , 2019.Malinin, A. and Gales, M. Predictive uncertainty estimationvia prior networks. In

Advances in Neural InformationProcessing Systems , 2018.Malinin, A. and Gales, M. Reverse KL-divergence trainingof prior networks: Improved uncertainty and adversarialrobustness. In

Advances in Neural Information Process-ing Systems , 2019.Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic gra-dient descent as approximate Bayesian inference.

Journalof Machine Learning Research , 18(1):4873–4907, 2017.Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. Vir-tual adversarial training: A regularization method forsupervised and semi-supervised learning.

IEEE Transac-tions on Pattern Analysis and Machine Intelligence , 41(8):1979–1993, 2018.M¨uller, R., Kornblith, S., and Hinton, G. When does labelsmoothing help? In

Advances in Neural InformationProcessing Systems , 2019.Naeini, M. P., Cooper, G., and Hauskrecht, M. Obtainingwell calibrated probabilities using Bayesian binning. In

AAAI Conference on Artiﬁcial Intelligence , 2015.Neumann, L., Zisserman, A., and Vedaldi, A. Relaxedsoftmax: Efﬁcient conﬁdence auto-calibration for safepedestrian detection. In

NIPS Workshop on MachineLearning for Intelligent Transportation Systems , 2018. Oliver, A., Odena, A., Raffel, C. A., Cubuk, E. D., and Good-fellow, I. Realistic evaluation of deep semi-supervisedlearning algorithms. In

Advances in Neural InformationProcessing Systems , 2018.Osawa, K., Swaroop, S., Jain, A., Eschenhagen, R., Turner,R. E., Yokota, R., and Khan, M. E. Practical deep learn-ing with Bayesian principles. In

Advances in NeuralInformation Processing Systems , 2019.Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., andHinton, G. Regularizing neural networks by penal-izing conﬁdent output distributions. arXiv preprintarXiv:1701.06548 , 2017.Raiko, T., Valpola, H., and LeCun, Y. Deep learning madeeasier by linear transformations in perceptrons. In

Interna-tional Conference on Artiﬁcial Intelligence and Statistics ,2012.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., Berg, A. C., and Fei-Fei, L. ImageNet Large ScaleVisual Recognition Challenge.

International Journal ofComputer Vision , 115(3):211–252, 2015.Schraudolph, N. Accelerated gradient descent by factor-centering decomposition.

Technical report , 1998.Sensoy, M., Kaplan, L., and Kandemir, M. Evidential deeplearning to quantify classiﬁcation uncertainty. In

Ad-vances in Neural Information Processing Systems , 2018.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,and Salakhutdinov, R. Dropout: A simple way to preventneural networks from overﬁtting.

Journal of MachineLearning Research , 15(1):1929–1958, 2014.Sun, S., Zhang, G., Shi, J., and Grosse, R. Functionalvariational Bayesian neural networks. In

InternationalConference on Learning Representations , 2019.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,Z. Rethinking the inception architecture for computervision. In

IEEE Conference on Computer Vision andPattern Recognition , 2016.Welling, M. and Teh, Y. W. Bayesian learning via stochasticgradient Langevin dynamics. In

International Conferenceon Machine Learning , 2011.Wiesler, S., Richard, A., Schl¨uter, R., and Ney, H. Mean-normalized stochastic gradient for large-scale deep learn-ing. In

IEEE International Conference on Acoustics,Speech and Signal Processing , 2014.Wilson, A. G., Hu, Z., Salakhutdinov, R. R., and Xing, E. P.Deep kernel learning. In

International Conference onArtiﬁcial Intelligence and Statistics , 2016a. eing Bayesian about Categorical Probability

Wilson, A. G., Hu, Z., Salakhutdinov, R. R., and Xing, E. P.Stochastic variational deep kernel learning. In

Advancesin Neural Information Processing Systems , 2016b.Wu, A., Nowozin, S., Meeds, E., Turner, R. E., Hern´andez-Lobato, J. M., and Gaunt, A. L. Deterministic variationalinference for robust Bayesian neural networks. In

Interna-tional Conference on Learning Representations , 2019a.Wu, Q., Li, H., Li, L., and Yu, Z. Quantifying intrinsicuncertainty in classiﬁcation via deep dirichlet mixturenetworks. arXiv preprint arXiv:1906.04450 , 2019b.Xie, L., Wang, J., Wei, Z., Wang, M., and Tian, Q. Dis-turblabel: Regularizing cnn on the loss layer. In

IEEEConference on Computer Vision and Pattern Recognition ,2016.Xie, S., Girshick, R., Doll´ar, P., Tu, Z., and He, K. Aggre-gated residual transformations for deep neural networks.In

IEEE Conference on Computer Vision and PatternRecognition , 2017.Zagoruyko, S. and Komodakis, N. Wide residual networks.In

British Machine Vision Conference , 2016.Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisynatural gradient as variational inference. In

InternationalConference of Machine Learning , 2018.Zintgraf, L. M., Cohen, T. S., Adel, T., and Welling, M. Visu-alizing deep neural network decisions: Prediction differ-ence analysis. In