[PDF] Multiplicative Normalizing Flows for Variational Bayesian Neural Networks

Abstract

We reinterpret multiplicative noise in neural networks as auxiliary random variables that augment the approximate posterior in a variational setting for Bayesian neural networks. We show that through this interpretation it is both efficient and straightforward to improve the approximation by employing normalizing flows while still allowing for local reparametrizations and a tractable lower bound. In experiments we show that with this new approximation we can significantly improve upon classical mean field for Bayesian neural networks on both predictive accuracy as well as predictive uncertainty.

Full PDF

MMultiplicative Normalizing Flows for Variational Bayesian Neural Networks

Christos Louizos

Max Welling

Abstract

We reinterpret multiplicative noise in neural net-works as auxiliary random variables that aug-ment the approximate posterior in a variationalsetting for Bayesian neural networks. We showthat through this interpretation it is both efﬁcientand straightforward to improve the approxima-tion by employing normalizing ﬂows (Rezende& Mohamed, 2015) while still allowing for lo-cal reparametrizations (Kingma et al., 2015) anda tractable lower bound (Ranganath et al., 2015;Maaløe et al., 2016). In experiments we showthat with this new approximation we can sig-niﬁcantly improve upon classical mean ﬁeld forBayesian neural networks on both predictive ac-curacy as well as predictive uncertainty.

1. Introduction

Neural networks have been the driving force behind thesuccess of deep learning applications. Given enough train-ing data they are able to robustly model input-output rela-tionships and as a result provide high predictive accuracy.However, they do have some drawbacks. In the absence ofenough data they tend to overﬁt considerably; this restrictsthem from being applied in scenarios were labeled data arescarce, e.g. in medical applications such as MRI classiﬁca-tion. Even more importantly, deep neural networks trainedwith maximum likelihood or MAP procedures tend to beoverconﬁdent and as a result do not provide accurate conﬁ-dence intervals, particularly for inputs that are far from thetraining data distribution. A simple example can be seen atFigure 1a; the predictive distribution becomes overly over-conﬁdent, i.e. assigns a high softmax probability, towardsthe wrong class for things it hasn’t seen before (e.g. anMNIST 3 rotated by 90 degrees). This in effect makes themunsuitable for applications where decisions are made, e.g. University of Amsterdam, Netherlands TNO Intelli-gent Imaging, Netherlands Canadian Institute For AdvancedResearch (CIFAR). Correspondence to: Christos Louizos < [email protected] > . Proceedings of the th International Conference on MachineLearning , Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s). when a doctor determines the disease of a patient based onthe output of such a network.A principled approach to address both of the aforemen-tioned shortcomings is through a Bayesian inference pro-cedure. Under this framework instead of doing a point es-timate for the network parameters we infer a posterior dis-tribution. These distributions capture the parameter uncer-tainty of the network, and by subsequently integrating overthem we can obtain better uncertainties about the predic-tions of the model. We can see that this is indeed the caseat Figure 1b; the conﬁdence of the network for the unseendigits is drastically reduced when we are using a Bayesianmodel, thus resulting into more realistic predictive distri-butions. Obtaining the posterior distributions is howeverno easy task, as the nonlinear nature of neural networksmakes the problem intractable. For this reason approxima-tions have to be made.Many works have considered the task of approximateBayesian inference for neural networks using eitherMarkov Chain Monte Carlo (MCMC) with HamiltonianDynamics (Neal, 1995), distilling SGD with LangevinDynamics (Welling & Teh, 2011; Korattikara et al.,2015) or deterministic techniques such as the LaplaceApproximation (MacKay, 1992), Expectation Propaga-tion (Hern´andez-Lobato & Adams, 2015; Hern´andez-Lobato et al., 2015) and variational inference (Graves,2011; Blundell et al., 2015; Kingma et al., 2015; Gal &Ghahramani, 2015b; Louizos & Welling, 2016).In this paper we will also tackle the problem of Bayesianinference in neural networks. We will adopt a stochasticgradient variational inference (Kingma & Welling, 2014;Rezende et al., 2014) procedure in order to estimate theposterior distribution over the weight matrices of the net-work. Arguably one of the most important ingredients ofvariational inference is the ﬂexibility of the approximateposterior distribution; it determines how well we are ableto capture the true posterior distribution and thus the trueuncertainty of our models. In Section 2 we will show howwe can produce very ﬂexible distributions in an efﬁcientway by employing auxiliary random variables (Agakov &Barber, 2004; Salimans et al., 2013; Ranganath et al., 2015;Maaløe et al., 2016) and normalizing ﬂows (Rezende &Mohamed, 2015). In Section 3 we will discuss related a r X i v : . [ s t a t . M L ] J un ultiplicative Normalizing Flows for Variational Bayesian Neural Networks (a) LeNet with weight decay (b) LeNet with multiplicative formalizing ﬂows Figure 1.

Predictive distribution for a continuously rotated version of a 3 from MNIST. Each colour corresponds to a different classand the height of the bar denotes the probability assigned to that particular class by the network. Visualization inspired by (Gal &Ghahramani, 2015b). work, whereas in Section 4 we will evaluate and discussthe proposed framework. Finally we will conclude withSection 5, where we will provide some ﬁnal thoughts alongwith promising directions for future research.

2. Multiplicative normalizing ﬂows

Let D be a dataset consisting of input output pairs { ( x , y ) , . . . , ( x n , y n ) } and let W L denote the weightmatrices of L layers. Assuming that p ( W i ) , q φ ( W i ) arethe prior and approximate posterior over the parameters ofthe i ’th layer we can derive the following lower bound onthe marginal log-likelihood of the dataset D using varia-tional Bayes (Peterson, 1987; Hinton & Van Camp, 1993;Graves, 2011; Blundell et al., 2015; Kingma et al., 2015;Gal & Ghahramani, 2015b; Louizos & Welling, 2016): L ( φ ) = E q φ ( W L ) (cid:2) log p ( y | x , W L )++ log p ( W L ) − log q φ ( W L ) (cid:3) , (1)where ˜ p ( x , y ) denotes the training data distribution and φ the parameters of the variational posterior. For continu-ous q ( · ) distributions that allow for the reparametrizationtrick (Kingma & Welling, 2014) or stochastic backpropa-gation (Rezende et al., 2014) we can reparametrize the ran-dom sampling from q ( · ) of the lower bound in terms ofnoise variables (cid:15) and deterministic functions f ( φ, (cid:15) ) : L = E p ( (cid:15) ) (cid:2) log p ( y | x , f ( φ, (cid:15) ))++ log p ( f ( φ, (cid:15) )) − log q φ ( f ( φ, (cid:15) )) (cid:3) . (2)This reparametrization allow us to treat approximate pa-rameter posterior inference as a straightforward optimiza- tion problem that can be optimized with off-the-shelf(stochastic) gradient ascent techniques. For Bayesian neural networks the most common family forthe approximate posterior is that of mean ﬁeld with inde-pendent Gaussian distributions for each weight. Despitethe fact that this leads to a straightforward lower boundfor optimization, the approximation capability is quite lim-iting; it corresponds to just a unimodal “bump” on thevery high dimensional space of the parameters of the neu-ral network. There have been attempts to improve uponthis approximation with works such as (Gal & Ghahra-mani, 2015b) with mixtures of delta peaks and (Louizos &Welling, 2016) with matrix Gaussians that allow for non-trivial covariances among the weights. Nevertheless, bothof the aforementioned methods are still, in a sense, limited;the true parameter posterior is more complex than deltapeaks or correlated Gaussians.There has been a lot of recent work on ways to improvethe posterior approximation in latent variable models withnormalizing ﬂows (Rezende & Mohamed, 2015) and aux-iliary random variables (Agakov & Barber, 2004; Salimanset al., 2013; Ranganath et al., 2015; Maaløe et al., 2016)being the most prominent. Brieﬂy, a normalizing ﬂow isconstructed by introducing parametrized bijective transfor-mations, with easy to compute Jacobians, to random vari-ables with simple initial densities. By subsequently opti-mizing the parameters of the ﬂow according to the lowerbound they can signiﬁcantly improve the posterior approxi-mation. Auxiliary random variables instead construct moreﬂexible distributions by introducing latent variables in the ultiplicative Normalizing Flows for Variational Bayesian Neural Networks posterior itself, thus deﬁning the approximate posterior asa mixture of simple distributions.Nevertheless, applying these ideas to the parameters in aneural network has not yet been explored. While it isstraightforward to apply normalizing ﬂows to a sample ofthe weight matrix from q ( W ) , this quickly becomes veryexpensive; for example with planar ﬂows (Rezende & Mo-hamed, 2015) we will need two extra matrices for each stepof the ﬂow. Furthermore, by utilizing this procedure wealso lose the beneﬁts of local reparametrizations (Kingmaet al., 2015; Louizos & Welling, 2016) which are possiblewith Gaussian approximate posteriors.In order to simultaneously maintain the beneﬁts of lo-cal reparametrizations and increase the ﬂexibility of theapproximate posteriors in a Bayesian neural network wewill rely on auxiliary random variables (Agakov & Bar-ber, 2004; Salimans et al., 2013; 2015; Ranganath et al.,2015; Maaløe et al., 2016); more speciﬁcally we will ex-ploit the well known “multiplicative noise” concept, e.g. asin (Gaussian) Dropout (Srivastava et al., 2014), in neuralnetworks and we will parametrize the approximate poste-rior with the following process: z ∼ q φ ( z ); W ∼ q φ ( W | z ) , (3)where now the approximate posterior becomes a compounddistribution, q ( W ) = (cid:82) q ( W | z ) q ( z ) d z , with z being avector of random variables distributed according to themixing density q ( z ) . To allow for local reparametriza-tions we will parametrize the conditional distribution forthe weights to be a fully factorized Gaussian. Therefore weassume the following form for the fully connected layers: q φ ( W | z ) = D in (cid:89) i =1 D out (cid:89) j =1 N ( z i µ ij , σ ij ) , (4)where D in , D out is the input and output dimensionality,and the following form for the kernels in convolutional net-works: q φ ( W | z ) = D h (cid:89) i =1 D w (cid:89) j =1 D f (cid:89) k =1 N ( z k µ ijk , σ ijk ) , (5)where D h , D w , D f are the height, width and number ofﬁlters for each kernel. Note that we did not let z affect thevariance of the Gaussian approximation; in a pilot study wefound that this parametrization was prone to local optimadue to large variance gradients, an effect also observed withthe multiplicative parametrization of the Gaussian poste-rior (Kingma et al., 2015; Molchanov et al., 2017). Wehave now reduced the problem of increasing the ﬂexibilityof the approximate posterior over the weights W to that ofincreasing the ﬂexibility of the mixing density q ( z ) . Since z is of much lower dimension, compared to W , it is nowstraightforward to apply normalizing ﬂows to q ( z ) ; in thisway we can signiﬁcantly enhance our approximation andallow for e.g. multimodality and nonlinear dependenciesbetween the elements of the weight matrix. This will inturn better capture the properties of the true posterior dis-tribution, thus leading to better performance and predictiveuncertainties. We will coin the term multiplicative normal-izing ﬂows (MNFs) for this family of approximate posteri-ors. Algorithms 1, 2 describe the forward pass using localreparametrizations for fully connected and convolutionallayers with this type of approximate posterior. Algorithm 1

Forward propagation for each fully connectedlayer h . M w , Σ w are the means and variances of eachlayer, H is a minibatch of activations and NF ( · ) is the nor-malizing ﬂow described at eq. 6. For the ﬁrst layer we havethat H = X where X is the minibatch of inputs. Require: H , M w , Σ w Z ∼ q ( z ) Z T f = NF ( Z ) M h = ( H (cid:12) Z T f ) M w V h = H Σ w E ∼ N (0 , return M h + √ V h (cid:12) E Algorithm 2

Forward propagation for each convolutionallayer h . N f are the number of convolutional ﬁlters, ∗ isthe convolution operator and we assume the [batch, height,width, feature maps] convention. Require: H , M w , Σ w z ∼ q ( z ) z T f = NF ( z ) M h = H ∗ ( M w (cid:12) reshape ( z T f , [1 , , D f ])) V h = H ∗ Σ w E ∼ N (0 , return M h + √ V h (cid:12) E For the normalizing ﬂow of q ( z ) we will use the maskedRealNVP (Dinh et al., 2016) using the numerically sta-ble updates introduced in Inverse Autoregressive Flow(IAF) (Kingma et al., 2016): m ∼ Bern (0 . h = tanh( f ( m (cid:12) z t )) µ = g ( h ); σ = σ ( k ( h )) z t +1 = m (cid:12) z t +(1 − m ) (cid:12) ( z t (cid:12) σ + (1 − σ ) (cid:12) µ ) (6) log (cid:12)(cid:12)(cid:12)(cid:12) ∂ z t +1 ∂ z t (cid:12)(cid:12)(cid:12)(cid:12) = (1 − m ) T log σ , where (cid:12) corresponds to element-wise multiplication, σ ( · ) ultiplicative Normalizing Flows for Variational Bayesian Neural Networks is the sigmoid function and f ( · ) , g ( · ) , k ( · ) are linear map-pings. We resampled the mask m every time in order toavoid a speciﬁc splitting over the dimensions of z . For thestarting point of the ﬂow q ( z ) we used a simple fully fac-torized Gaussian and we will refer to the ﬁnal iterate as z T f . Unfortunately, parametrizing the posterior distribution aseq. 3 makes the lower bound intractable as generally we donot have a closed form density function for q ( W ) . Thismakes the calculation of the entropy − E q ( W ) [log q ( W )] challenging. Fortunately we can make the lower boundtractable again by further lower bounding the entropy interms of an auxiliary distribution r ( z | W ) (Agakov & Bar-ber, 2004; Salimans et al., 2013; 2015; Ranganath et al.,2015; Maaløe et al., 2016). This can be seen as if we areperforming variational inference on the augmented proba-bility space p ( D , W L , z L ) , that maintains the same trueposterior distribution p ( W |D ) (as we can always marginal-ize out r ( z | W ) to obtain the original model). The lowerbound in this case becomes: L ( φ, θ ) = E q φ ( z L , W L ) (cid:2) log p ( y | x , W L , z L )++ log p ( W L ) + log r θ ( z L | W L ) −− log q φ ( W L | z L ) − log q φ ( z L ) (cid:3) , (7)where θ are the parameters of the auxiliary distribution r ( · ) .This bound is looser than the previous bound, however theextra ﬂexibility of q ( W ) can compensate and allow for atighter bound. Furthermore, the tightness of the bound alsodepends on the ability of r ( z | W ) to approximate the “aux-iliary” posterior distribution q ( z | W ) = q ( W | z ) q ( z ) q ( W ) . There-fore, to allow for a ﬂexible r ( z | W ) we will follow (Ran-ganath et al., 2015) and we will parametrize it with inversenormalizing ﬂows as follows: r ( z T b | W ) = D z (cid:89) i =1 N (˜ µ i , ˜ σ i ) , (8)where for fully connected layers we have that: ˜ µ i = (cid:0) b ⊗ tanh ( c T W ) (cid:1) ( (cid:12) D − out ) (9) ˜ σ i = σ (cid:18)(cid:0) b ⊗ tanh ( c T W ) (cid:1) ( (cid:12) D − out ) (cid:19) , (10)and for convolutional: ˜ µ i = (cid:0) tanh ( mat ( W ) c ) ⊗ b (cid:1) ( (cid:12) ( D h D w ) − ) (11) ˜ σ i = σ (cid:18)(cid:0) tanh ( mat ( W ) c ) ⊗ b (cid:1) ( (cid:12) ( D h D w ) − ) (cid:19) , (12) f ( x ) = − x ) where b , b , c are trainable vectors that have the samedimensionality as z , D z , corresponds to a vector of1s, ⊗ corresponds to the outer product and mat ( · ) corre-sponds to the matricization operator. The z T b variablecorresponds to the fully factorized variable that is trans-formed by a normalizing ﬂow to z T f or else the variable ob-tained by the inverse normalizing ﬂow, z T b = NF − ( z T f ) .We will parametrize this inverse directly with the proce-dure described at eq. 6. Notice that we can employ localreparametrizations also in eq. 9,10,11,12, so as to avoidsampling the, potentially big, matrix W . With the standardnormal prior and the fully factorized Gaussian posterior ofeq. 4 the KL-divergence between the prior and the posteriorcan be computed as follows: − KL ( q ( W ) || p ( W )) == E q ( W , z T ) [ − KL ( q ( W | z T f ) || p ( W ))++ log r ( z T f | W ) − log q ( z T f )] , (13)where each of the terms corresponds to: − KL ( q ( W | z T f ) || p ( W )) == 12 (cid:88) i,j ( − log σ i,j + σ i,j + z T fi µ i,j − (14) log r ( z T f | W ) = log r ( z T b | W ) + T f + T b (cid:88) t = T f log (cid:12)(cid:12)(cid:12)(cid:12) ∂ z t +1 ∂ z t (cid:12)(cid:12)(cid:12)(cid:12) (15) log q ( z T f ) = log q ( z ) − T f (cid:88) t =1 log (cid:12)(cid:12)(cid:12)(cid:12) ∂ z t +1 ∂ z t (cid:12)(cid:12)(cid:12)(cid:12) . (16)It should be noted that this bound is a generalization of thebound proposed by (Gal & Ghahramani, 2015b). We canarrive at the bound of (Gal & Ghahramani, 2015b) if wetrivially parametrize the auxiliary model r ( z | W ) = q ( z ) (which provides a less tight bound (Ranganath et al., 2015))use a standard normal prior for W , a Bernoulli q ( z ) withprobability of success π and then let the variance of ourconditional Gaussian q ( W | z ) go to zero. This will resultinto the lower bound being inﬁnite due to the log of thevariances; nevertheless since we are not optimizing over σ we can simply disregard those terms. After a little bit of al-gebra we can show that the only term that will remain in theKL-divergence between q ( W ) and p ( W ) will be the ex-pectation of the trace of the square of the mean matrix , i.e. E q ( z ) [ tr((diag( z ) M )) T (diag( z ) M ))] = π (cid:107) M (cid:107) , with − π being the dropout rate.We also found that in general it is beneﬁcial to “constrain”the standard deviations σ ij of the conditional Gaussian pos-terior q ( W | z ) during the forward pass for the computation Converting the multidimensional tensor to a matrix. The matrix that has M [ i, j ] = µ ij ultiplicative Normalizing Flows for Variational Bayesian Neural Networks of the likelihood to a lower than the true range, e.g. [0 , α ] instead of the [0 , we have with a standard normal prior.This results into a small bias and a looser lower bound,however it helps in avoiding bad local minima in the vari-ational objective. This is akin to the free bits objective de-scribed at (Kingma et al., 2016).

3. Related work

Approximate inference for Bayesian neural networks hasbeen pioneered by (MacKay, 1992) and (Neal, 1995).Laplace approximation (MacKay, 1992) provides a deter-ministic approximation to the posterior that is easy to ob-tain; it is a Gaussian centered at the MAP estimate of theparameters with a covariance determined by the inverse ofthe Hessian of the log-likelihood. Despite the fact that itis straightforward to implement, its scalability is limitedunless approximations are made, which generally reducesperformance. Hamiltonian Monte Carlo (Neal, 1995) isso far the golden standard for approximate Bayesian in-ference; nevertheless it is also not scalable to large net-works and datasets due to the fact that we have to explicitlystore the samples from the posterior. Furthermore as it isan MCMC method, assessing convergence is non trivial.Nevertheless there is interesting work that tries to improveupon those issues with stochastic gradient MCMC (Chenet al.) and distillation methods (Korattikara et al., 2015).Deterministic methods for approximate inference inBayesian neural networks have recently attained much at-tention. One of the ﬁrst applications of variational infer-ence in neural networks was in (Peterson, 1987) and (Hin-ton & Van Camp, 1993). More recently (Graves, 2011)proposed a practical method for variational inference inthis setting with a simple (but biased) estimator for afully factorized posterior distribution. (Blundell et al.,2015) improved upon this work with the unbiased esti-mator from (Kingma & Welling, 2014) and a scale mix-ture prior. (Hern´andez-Lobato & Adams, 2015) proposedto use Expectation Propagation (Minka, 2001) with fullyfactorized posteriors and showed good results on regressiontasks. (Kingma et al., 2015) showed how Gaussian dropoutcan be interpreted as performing approximate inferencewith log-uniform priors, multiplicative Gaussian posteri-ors and local reparametrizations, thus allowing straight-forward learning of the dropout rates. Similarly (Gal &Ghahramani, 2015b) showed interesting connections be-tween Bernoulli Dropout (Srivastava et al., 2014) networksand approximate Bayesian inference in deep Gaussian Pro-cesses (Damianou & Lawrence, 2013) thus allowing theextraction of uncertainties in a principled way. Simi-larly (Louizos & Welling, 2016) arrived at the same re-sult through structured posterior approximations via ma-trix Gaussians and local reparametrizations (Kingma et al., 2015).It should also be mentioned that uncertainty estimationin neural networks can also be performed without theBayesian paradigm; frequentist methods such as Boot-strap (Osband et al., 2016) and ensembles (Lakshmi-narayanan et al., 2016) have shown that in certain scenariosthey can provide reasonable conﬁdence intervals.

4. Experiments

All of the experiments were coded in Tensor-ﬂow (Abadi et al., 2016) and optimization was donewith Adam (Kingma & Ba, 2015) using the default hyper-parameters. We used the LeNet 5 (LeCun et al., 1998)convolutional architecture with ReLU (Nair & Hinton,2010) nonlinearities. The means M of the conditionalGaussian q ( W | z ) were initialized with the scheme pro-posed in (He et al., 2015), whereas the log of the varianceswere initialized by sampling from N ( − , . . Unlessexplicitly mentioned otherwise we use ﬂows of length twofor q ( z ) and r ( z | W ) with 50 hidden units for each stepof the ﬂow of q ( z ) and 100 hidden units for each step ofthe ﬂow of r ( z | W ) . We used 100 posterior samples toestimate the predictive distribution for all of the modelsduring testing and 1 posterior sample during training. Table 1.

Models considered in this paper. Dropout correspondsto the model used in (Gal & Ghahramani, 2015a), Deep Ensem-ble to the model used in (Lakshminarayanan et al., 2016), FFG tothe Bayesian neural network employed in (Blundell et al., 2015),FFLU to the Bayesian neural network used in (Kingma et al.,2015; Molchanov et al., 2017) with the additive parametrizationof (Molchanov et al., 2017) and MNFG corresponds to the pro-posed variational approximation. It should be noted that DeepEnsembles use adversarial training (Goodfellow et al., 2014).

Name Prior PosteriorL2 N ( , I ) delta peak Dropout N ( , I ) mixture of zero and delta peaks D. Ensem. - mixture of peaks

FFG N ( , I ) fully factorized additive Gaussian FFLU log( | W | ) = c fully factorized additive Gaussian MNFG N ( , I ) multiplicative normalizing ﬂows We trained on MNIST LeNet architectures usingthe priors and posteriors described at Table 1. We trainedDropout with the way described at (Gal & Ghahramani,2015a) using 0.5 for the dropout rate and for Deep Ensem-bles (Lakshminarayanan et al., 2016) we used 10 mem-bers and (cid:15) = . for the adversarial example generation.For the models with the Gaussian prior we constrained thestandard deviation of the conditional posterior to be ≤ . The version from Caffe. ultiplicative Normalizing Flows for Variational Bayesian Neural Networks during the forward pass. The classiﬁcation performance ofeach model can be seen at Table 2; while our overall fo-cus is not classiﬁcation accuracy per se, we see that withthe MNF posteriors we improve upon mean ﬁeld reachingsimilar accuracies with Deep Ensembles. notMNIST

To evaluate the predictive uncertainties ofeach model we performed the task described at (Lakshmi-narayanan et al., 2016); we estimated the entropy of thepredictive distributions on notMNIST from the LeNet ar-chitectures trained on MNIST. Since we a-priori know thatnone of the notMNIST classes correspond to a trained class(since they are letters and not digits) the ideal predictivedistribution is uniform over the MNIST digits, i.e. a maxi-mum entropy distribution. Contrary to (Lakshminarayananet al., 2016) we do not plot the histogram of the entropiesacross the images but we instead use the empirical CDF,which we think is more informative. Curves that are closerto the bottom right part of the plot are preferable, as it de-notes that the probability of observing a high conﬁdenceprediction is low. At Figure 2 we show the empirical CDFover the range of possible entropies, [0 , . , for all of themodels. Figure 2.

Empirical CDF for the entropy of the predictive distri-butions on notMNIST.

It is clear from the plot that the uncertainty estimates fromMNFs are better than the other approaches, since the prob-ability of a low entropy prediction is overall lower. Thenetwork trained with just weight decay was, as expected,the most overconﬁdent with an almost zero median en-tropy while Dropout seems to be in the middle ground. TheBayesian neural net with the log-uniform prior also showedoverconﬁdence in this task; we hypothesize that this is dueto the induced sparsity (Molchanov et al., 2017) which re-sults into the pruning of almost all irrelevant sources ofvariation in the parameters thus not providing enough vari- Can be found at http://yaroslavvb.blogspot.co.uk/2011/09/notmnist-dataset.html ability to allow for uncertainty in the predictions. The spar-sity levels are 62%, 95.2% for the two convolutional lay-ers and 99.5%, 93.3% for the two fully connected. Similareffects would probably be also observed if we optimizedthe dropout rates for Dropout. The only source of random-ness in the neural network is from the Bernoulli randomvariables (r.v.) z . By employing the Central Limit Theo-rem we can express the distribution of the activations as aGaussian (Wang & Manning, 2013) with variance affectedby the variance of the Bernoulli r.v., V ( z ) = π (1 − π ) . Themaximum variance of the Bernoulli r.v. is when π = 0 . ,therefore any tuning of the Dropout rate will result into adecrease in the variance of the r.v. and therefore a decreasein the variance of the Gaussian at the hidden units. This willsubsequently lead into less predictive variance and moreconﬁdence.Finally, whereas it was shown at (Lakshminarayanan et al.,2016) that Deep Ensembles provide good uncertainty es-timates (better than Dropout) on this task using fully con-nected networks, this result did not seem to apply for theLeNet architecture we considered. We hypothesize thatthey are sensitive to the hyperparameters (e.g. adversarialnoise, number of members in the ensemble) and it requiresmore tuning in order to improve upon Dropout on this ar-chitecture. CIFAR 10

We performed a similar experiment on CIFAR10. To artiﬁcially create the ”unobserved class” scenario,we hid 5 of the labels (dog, frog, horse, ship, truck) andtrained on the rest (airplane, automobile, bird, cat, deer).For this task we used the larger LeNet architecture de-scribed at (Gal & Ghahramani, 2015a). For the modelswith the Gaussian prior we similarly constrained the stan-dard deviation during the forward pass to be ≤ . . For DeepEnsembles we used ﬁve members with (cid:15) = . for the adver-sarial example generation. The predictive performance onthese ﬁve classes can be seen in Table 2, with Dropout andMNFs achieving the overall better accuracies. We subse-quently measured the entropy of the predictive distributionon the classes that were hidden, with the resulting empiricalCDFs visualized in Figure 3.We similarly observe that the network with just weight de-cay was the most overconﬁdent. Furthermore, Deep En-sembles and Dropout had similar uncertainties, with DeepEnsembles having lower accuracy on the observed classes.The networks with the Gaussian priors also had similar un-certainty with the network with the log uniform prior, nev-ertheless the MNF posterior had much better accuracy on Computed by pruning weights where log σ − log µ ≥ (Molchanov et al., 2017). Assuming that the network is wide enough.

192 ﬁlters at each convolutional layer and 1000 hidden unitsfor the fully connected layer. ultiplicative Normalizing Flows for Variational Bayesian Neural Networks

Figure 3.

Empirical CDF for the entropy of the predictive distri-butions on the 5 hidden classes from CIFAR 10. the observed classes. The sparsity levels for the networkwith the log-uniform prior now were 94.9%, 99.8% for theconvolutional layers and 99.9%, 92.7% for the fully con-nected. Overall, the network with the MNF posteriors seemto provide the better trade-off in uncertainty and accuracyon the observed classes.

Table 2.

Test errors (%) with the LeNet architecture on MNISTand the ﬁrst ﬁve classes of CIFAR 10.

Dataset L2 Dropout D.Ensem. FFG FFLU MNFGMNIST

CIFAR 5

24 16 21 22 23 16

We also measure how robust our models and uncertain-ties are against adversarial examples (Szegedy et al., 2013;Goodfellow et al., 2014) by generating examples using thefast sign method (Goodfellow et al., 2014) for each of thepreviously trained architectures using Cleverhans (Paper-not et al., 2016). For this task we do not include DeepEnsembles as they are trained on adversarial examples.

MNIST

On this scenario we observe interesting resultsif we plot the change in accuracy and entropy by varyingthe magnitude of the adversarial perturbation. The result-ing plot can be seen in Figure 4. Overall Dropout seemsto have better accuracies on adversarial examples; never-theless, those come at an ”overconﬁdent” price since theentropy of the predictive distributions is quite low thus re-sulting into predictions that have, on average, above 0.7probability for the dominant class. This is in contrast withMNFs; while the accuracy almost immediately drops closeto random, the uncertainty simultaneously increases to al-most maximum entropy. This implies that the predictivedistribution is more or less uniform over those examples.So despite the fact that our model cannot overcome adver- sarial examples at least it “knows that it doesn’t know”.

Figure 4.

Accuracy (solid) vs entropy (dashed) as a function ofthe adversarial perturbation (cid:15) on MNIST.

CIFAR

We performed the same experiment also on theﬁve class subset of CIFAR 10. The results can be seenin Figure 5. Here we however observe a different picture,compared to MNIST, since all of the methods experiencedoverconﬁdence. We hypothesize that adversarial examplesare harder to escape and be uncertain about in this dataset,due to the higher dimensionality, and therefore further in-vestigation is needed.

Figure 5.

Accuracy (solid) vs entropy (dashed) as a function of theadversarial perturbation (cid:15) on CIFAR 10 (on the ﬁrst 5 classes).

For the ﬁnal experiment we visualize the predictive distri-butions obtained with the different models on the toy re-gression task introduced at (Hern´andez-Lobato & Adams,2015). We generated 20 training inputs from U [ − , andthen obtained the corresponding targets via y = x + (cid:15) ,where (cid:15) ∼ N (0 , . We ﬁxed the likelihood noise to itstrue value and then ﬁtted a Dropout network with π = 0 . ultiplicative Normalizing Flows for Variational Bayesian Neural Networks (a) Dropout π = 0 . (b) Dropout learned π (c) FFLU (d) MNFG Figure 6.

Predictive distributions for the toy dataset. Blue areas correspond to ± for the hidden layer , an FFLU network and an MNFG. Wealso ﬁtted a Dropout network where we also learned thedropout probability π of the hidden layer according to thebound described at section 2.3 (which is equivalent to theone described at (Gal & Ghahramani, 2015b)) using RE-INFORCE (Williams, 1992) and a global baseline (Mnih& Gregor, 2014). The resulting predictive distributions canbe seen at Figure 6.As we can observe, MNF posteriors provide more realisticpredictive distributions, closer to the true posterior (whichcan be seen at (Hern´andez-Lobato & Adams, 2015)) andwith the network being more uncertain on areas where wedo not observed any data. The uncertainties obtained byDropout with ﬁxed π = 0 . did not diverge as much inthose areas but overall they were better compared to theuncertainties obtained with FFLU. We could probably at-tribute the latter to the sparsiﬁcation of the network since95% and 44% of the parameters were pruned for each layerrespectively.Interestingly the uncertainties obtained with the networkwith the learned Dropout probability were the most “over-ﬁtted”. This might suggest that Dropout uncertainty isprobably not a good posterior approximation since by opti-mizing the dropout rates we do not seem to move closer tothe true posterior predictive distribution. This is in contrastwith MNFs; they are ﬂexible enough to allow for optimiz-ing all of their parameters in a way that does better approxi-mate the true posterior distribution. This result also empiri-cally veriﬁes the claim we previously made; by learning thedropout rates the entropy of the posterior predictive will de-crease thus resulting into more overconﬁdent predictions.

5. Conclusion

We introduce multiplicative normalizing ﬂows (MNFs);a family of approximate posteriors for the parameters ofa variational Bayesian neural network. We have shownthat through this approximation we can signiﬁcantly im-prove upon mean ﬁeld on both predictive performance as No Dropout was used for the input layer since it is 1-dimensional. well as predictive uncertainty. We compared our uncer-tainty on notMNIST and CIFAR with Dropout (Srivastavaet al., 2014; Gal & Ghahramani, 2015b) and Deep Ensem-bles (Lakshminarayanan et al., 2016) using convolutionalarchitectures and found that MNFs achieve more realisticuncertainties while providing predictive capabilities on parwith Dropout. We suspect that the predictive capabilitiesof MNFs can be further improved through more appropri-ate optimizers that avoid the bad local minima in the vari-ational objective. Finally, we also highlighted limitationsof Dropout approximations and empirically showed thatMNFs can overcome them.There are a couple of promising directions for future re-search. One avenue would be to explore how muchcan MNFs sparsify and compress neural networks undereither sparsity inducing priors, such as the log-uniformprior (Kingma et al., 2015; Molchanov et al., 2017), or em-pirical priors (Ullrich et al., 2017). Another promising di-rection is that of designing better priors for Bayesian neuralnetworks. For example (Neal, 1995) has identiﬁed limi-tations of Gaussian priors and proposes alternative priorssuch as the Cauchy. Furthermore, the prior over the pa-rameters also affects the type of uncertainty we get in ourpredictions; for instance we observed in our experiments asigniﬁcant difference in uncertainty between Gaussian andlog-uniform priors. Since different problems require differ-ent types of uncertainty it makes sense to choose the prioraccordingly, e.g. use an informative prior so as to alleviateadversarial examples.

Acknowledgements

We would like to thank Klamer Schutte, Matthias Reisserand Karen Ullrich for valuable feedback. This research issupported by TNO, NWO and Google.

References

Abadi, Mart´ın, Agarwal, Ashish, Barham, Paul, Brevdo,Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S,Davis, Andy, Dean, Jeffrey, Devin, Matthieu, et al. Ten-sorﬂow: Large-scale machine learning on heterogeneous ultiplicative Normalizing Flows for Variational Bayesian Neural Networks distributed systems. arXiv preprint arXiv:1603.04467 ,2016.Agakov, Felix V and Barber, David. An auxiliary varia-tional method. In

International Conference on NeuralInformation Processing , pp. 561–566. Springer, 2004.Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Koray,and Wierstra, Daan. Weight uncertainty in neural net-works.

Proceedings of the 32nd International Confer-ence on Machine Learning, ICML 2015, Lille, France,6-11 July 2015 , 2015.Chen, Tianqi, Fox, Emily B, and Guestrin, Carlos. Stochas-tic gradient hamiltonian monte carlo.Damianou, Andreas C. and Lawrence, Neil D. Deep gaus-sian processes. In

Proceedings of the Sixteenth Interna-tional Conference on Artiﬁcial Intelligence and Statis-tics, AISTATS 2013, Scottsdale, AZ, USA, April 29 - May1, 2013 , pp. 207–215, 2013.Dinh, Laurent, Sohl-Dickstein, Jascha, and Bengio, Samy.Density estimation using real nvp. arXiv preprintarXiv:1605.08803 , 2016.Gal, Yarin and Ghahramani, Zoubin. Bayesian convolu-tional neural networks with bernoulli approximate vari-ational inference. arXiv preprint arXiv:1506.02158 ,2015a.Gal, Yarin and Ghahramani, Zoubin. Dropout as a bayesianapproximation: Representing model uncertainty in deeplearning. arXiv preprint arXiv:1506.02142 , 2015b.Goodfellow, Ian J, Shlens, Jonathon, and Szegedy, Chris-tian. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 , 2014.Graves, Alex. Practical variational inference for neural net-works. In

Advances in Neural Information ProcessingSystems , pp. 2348–2356, 2011.He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,Jian. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In

Pro-ceedings of the IEEE International Conference on Com-puter Vision , pp. 1026–1034, 2015.Hern´andez-Lobato, Jos´e Miguel and Adams, Ryan.Probabilistic backpropagation for scalable learning ofbayesian neural networks. In

Proceedings of the 32ndInternational Conference on Machine Learning, ICML2015, Lille, France, 6-11 July 2015 , pp. 1861–1869,2015.Hern´andez-Lobato, Jos´e Miguel, Li, Yingzhen,Hern´andez-Lobato, Daniel, Bui, Thang, and Turner, Richard E. Black-box α -divergence minimization. arXiv preprint arXiv:1511.03243 , 2015.Hinton, Geoffrey E and Van Camp, Drew. Keeping theneural networks simple by minimizing the descriptionlength of the weights. In Proceedings of the sixth annualconference on Computational learning theory , pp. 5–13.ACM, 1993.Kingma, Diederik and Ba, Jimmy. Adam: A method forstochastic optimization.

International Conference onLearning Representations (ICLR), San Diego , 2015.Kingma, Diederik P and Welling, Max. Auto-encodingvariational bayes.

International Conference on Learn-ing Representations (ICLR) , 2014.Kingma, Diederik P, Salimans, Tim, and Welling, Max.Variational dropout and the local reparametrization trick.

Advances in Neural Information Processing Systems ,2015.Kingma, Diederik P, Salimans, Tim, and Welling, Max. Im-proving variational inference with inverse autoregressiveﬂow. arXiv preprint arXiv:1606.04934 , 2016.Korattikara, Anoop, Rathod, Vivek, Murphy, Kevin, andWelling, Max. Bayesian dark knowledge. arXiv preprintarXiv:1506.04416 , 2015.Lakshminarayanan, Balaji, Pritzel, Alexander, and Blun-dell, Charles. Simple and scalable predictive uncer-tainty estimation using deep ensembles. arXiv preprintarXiv:1612.01474 , 2016.LeCun, Yann, Bottou, L´eon, Bengio, Yoshua, and Haffner,Patrick. Gradient-based learning applied to documentrecognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.Louizos, Christos and Welling, Max. Structured and efﬁ-cient variational deep learning with matrix gaussian pos-teriors. arXiv preprint arXiv:1603.04733 , 2016.Maaløe, Lars, Sønderby, Casper Kaae, Sønderby,Søren Kaae, and Winther, Ole. Auxiliary deep gener-ative models. arXiv preprint arXiv:1602.05473 , 2016.MacKay, David JC. A practical bayesian framework forbackpropagation networks.

Neural computation , 4(3):448–472, 1992.Minka, Thomas P. Expectation propagation for approx-imate bayesian inference. In

Proceedings of the Sev-enteenth conference on Uncertainty in artiﬁcial intelli-gence , pp. 362–369. Morgan Kaufmann Publishers Inc.,2001. ultiplicative Normalizing Flows for Variational Bayesian Neural Networks

Mnih, Andriy and Gregor, Karol. Neural variational in-ference and learning in belief networks. arXiv preprintarXiv:1402.0030 , 2014.Molchanov, D., Ashukha, A., and Vetrov, D. VariationalDropout Sparsiﬁes Deep Neural Networks.

ArXiv e-prints , January 2017.Nair, Vinod and Hinton, Geoffrey E. Rectiﬁed linear unitsimprove restricted boltzmann machines. In

Proceedingsof the 27th International Conference on Machine Learn-ing (ICML-10) , pp. 807–814, 2010.Neal, Radford M.

Bayesian learning for neural networks .PhD thesis, Citeseer, 1995.Osband, Ian, Blundell, Charles, Pritzel, Alexander, andVan Roy, Benjamin. Deep exploration via bootstrappeddqn. arXiv preprint arXiv:1602.04621 , 2016.Papernot, Nicolas, Goodfellow, Ian, Sheatsley, Ryan, Fein-man, Reuben, and McDaniel, Patrick. cleverhans v1.0.0:an adversarial machine learning library. arXiv preprintarXiv:1610.00768 , 2016.Peterson, Carsten. A mean ﬁeld theory learning algorithmfor neural networks.

Complex systems , 1:995–1019,1987.Ranganath, Rajesh, Tran, Dustin, and Blei, David M.Hierarchical variational models. arXiv preprintarXiv:1511.02386 , 2015.Rezende, Danilo Jimenez and Mohamed, Shakir. Varia-tional inference with normalizing ﬂows. arXiv preprintarXiv:1505.05770 , 2015.Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra,Daan. Stochastic backpropagation and approximate in-ference in deep generative models. In

Proceedings ofthe 31th International Conference on Machine Learn-ing, ICML 2014, Beijing, China, 21-26 June 2014 , pp.1278–1286, 2014.Salimans, Tim, Knowles, David A, et al. Fixed-form varia-tional posterior approximation through stochastic linearregression.

Bayesian Analysis , 8(4):837–882, 2013.Salimans, Tim, Kingma, Diederik P, Welling, Max, et al.Markov chain monte carlo and variational inference:Bridging the gap. In

International Conference on Ma-chine Learning , pp. 1218–1226, 2015.Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:A simple way to prevent neural networks from overﬁt-ting.

The Journal of Machine Learning Research , 15(1):1929–1958, 2014. Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya,Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian, and Fer-gus, Rob. Intriguing properties of neural networks. arXivpreprint arXiv:1312.6199 , 2013.Ullrich, Karen, Meeds, Edward, and Welling, Max. Softweight-sharing for neural network compression. arXivpreprint arXiv:1702.04008 , 2017.Wang, Sida and Manning, Christopher. Fast dropout train-ing. In

Proceedings of The 30th International Confer-ence on Machine Learning , pp. 118–126, 2013.Welling, Max and Teh, Yee W. Bayesian learning viastochastic gradient langevin dynamics. In

Proceedingsof the 28th International Conference on Machine Learn-ing (ICML-11) , pp. 681–688, 2011.Williams, Ronald J. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.

Machine learning , 8(3-4):229–256, 1992.Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Ben-jamin, and Vinyals, Oriol. Understanding deep learn-ing requires rethinking generalization. arXiv preprintarXiv:1611.03530 , 2016. ultiplicative Normalizing Flows for Variational Bayesian Neural Networks

A. Memorization capabilities

As it was shown in (Zhang et al., 2016), deep neural net-works can exhibit memorization, even with random la-bels. Therefore deep neural networks could perfectly ﬁtthe training data while having random chance accuracy onthe test data, even with Dropout or weight decay regulariza-tion. (Molchanov et al., 2017) instead showed that by em-ploying Sparse Variational Dropout this phenomenon didnot appear, thus resulting into the network pruning every-thing and having random chance accuracy on both trainingand test sets. We similarly show here that with Gaussianpriors and MNF posteriors we also have random chanceaccuracy on both train and test sets. This suggests that it isproper Bayesian inference that penalizes memorization.

Table 3.

Accuracy (%) with the LeNet architecture on MNIST andthe ﬁrst ﬁve classes of CIFAR 10 using random labels. Randomchance is on MNIST and on CIFAR 5.

Dataset Dropout train

Dropout test

MNFG train

MNFG test

MNIST

30 11 11 11