[PDF] A Comprehensive guide to Bayesian Convolutional Neural Network with Variational Inference

Abstract

Artificial Neural Networks are connectionist systems that perform a given task by learning on examples without having prior knowledge about the task. This is done by finding an optimal point estimate for the weights in every node. Generally, the network using point estimates as weights perform well with large datasets, but they fail to express uncertainty in regions with little or no data, leading to overconfident decisions. In this paper, Bayesian Convolutional Neural Network (BayesCNN) using Variational Inference is proposed, that introduces probability distribution over the weights. Furthermore, the proposed BayesCNN architecture is applied to tasks like Image Classification, Image Super-Resolution and Generative Adversarial Networks. The results are compared to point-estimates based architectures on MNIST, CIFAR-10 and CIFAR-100 datasets for Image CLassification task, on BSD300 dataset for Image Super Resolution task and on CIFAR10 dataset again for Generative Adversarial Network task. BayesCNN is based on Bayes by Backprop which derives a variational approximation to the true posterior. We, therefore, introduce the idea of applying two convolutional operations, one for the mean and one for the variance. Our proposed method not only achieves performances equivalent to frequentist inference in identical architectures but also incorporate a measurement for uncertainties and regularisation. It further eliminates the use of dropout in the model. Moreover, we predict how certain the model prediction is based on the epistemic and aleatoric uncertainties and empirically show how the uncertainty can decrease, allowing the decisions made by the network to become more deterministic as the training accuracy increases. Finally, we propose ways to prune the Bayesian architecture and to make it more computational and time effective.

Full PDF

AA Comprehensive guide to Bayesian Convolutional NeuralNetwork with Variational Inference

Kumar Shridhar , Felix Laumann , Marcus Liwicki

MindGarage, Technical University Kaiserslautern NeuralSpace, Imperial College London Lule˚a University of TechnologyDecember 2018

Abstract

Artiﬁcial Neural Networks are connectionist systems that perform a given task bylearning on examples without having prior knowledge about the task. This is done byﬁnding an optimal point estimate for the weights in every node. Generally, the networkusing point estimates as weights perform well with large datasets, but they fail to ex-press uncertainty in regions with little or no data, leading to overconﬁdent decisions.In this paper, Bayesian Convolutional Neural Network (BayesCNN) using VariationalInference is proposed, that introduces probability distribution over the weights. Fur-thermore, the proposed BayesCNN architecture is applied to tasks like Image Classiﬁ-cation, Image Super-Resolution and Generative Adversarial Networks. The results arecompared to point-estimates based architectures on MNIST, CIFAR-10 and CIFAR-100datasets for Image CLassiﬁcation task, on BSD300 dataset for Image Super Resolutiontask and on CIFAR10 dataset again for Generative Adversarial Network task.BayesCNN is based on Bayes by Backprop which derives a variational approximationto the true posterior. We, therefore, introduce the idea of applying two convolutionaloperations, one for the mean and one for the variance. Our proposed method not onlyachieves performances equivalent to frequentist inference in identical architectures butalso incorporate a measurement for uncertainties and regularisation. It further elimi-nates the use of dropout in the model. Moreover, we predict how certain the modelprediction is based on the epistemic and aleatoric uncertainties and empirically showhow the uncertainty can decrease, allowing the decisions made by the network to be-come more deterministic as the training accuracy increases. Finally, we propose waysto prune the Bayesian architecture and to make it more computational and time eﬀective.

Deep Neural Networks (DNNs), are connectionist systems that learn to perform tasks bylearning on examples without having prior knowledge about the tasks. They easily scale tomillions of data points and yet remain tractable to optimize with stochastic gradient descent.Convolutional neural networks (

Cnn s), a variant of DNNs, have already surpassed humanaccuracy in the realm of image classiﬁcation (e.g. [24], [56], [34]). Due to the capacity of

Cnn sto ﬁt on a wide diversity of non-linear data points, they require a large amount of trainingdata. This often makes

Cnn s and Neural Networks in general, prone to overﬁtting on smalldatasets. The model tends to ﬁt well to the training data, but are not predictive for newdata. This often makes the Neural Networks incapable of correctly assessing the uncertaintyin the training data and hence leads to overly conﬁdent decisions about the correct class,prediction or action.Various regularization techniques for controlling over-ﬁtting are used in practice namelyearly stopping, weight decay, L1, L2 regularizations and currently the most popular andempirically eﬀective technique being dropout [25].1 a r X i v : . [ c s . L G ] J a n Despite Neural Networks architectures achieving state-of-the-art results in almost all classiﬁ-cation tasks, Neural Networks still make over-conﬁdent decisions. A measure of uncertaintyin the prediction is missing from the current Neural Networks architectures. Very carefultraining, weight control measures like regularization of weights and similar techniques areneeded to make the models susceptible to over-ﬁtting issues.We address both of these concerns by introducing Bayesian learning to ConvolutionalNeural Networks that adds a measure of uncertainty and regularization in their predictions.

Deep Neural Networks have been successfully applied to many domains, including very sen-sitive domains like health-care, security, fraudulent transactions and many more. However,from a probability theory perspective, it is unjustiﬁable to use single point-estimates asweights to base any classiﬁcation on. On the other hand, Bayesian neural networks are morerobust to over-ﬁtting, and can easily learn from small datasets. The Bayesian approach fur-ther oﬀers uncertainty estimates via its parameters in form of probability distributions (seeFigure 1.1). At the same time, by using a prior probability distribution to integrate outthe parameters, the average is computed across many models during training, which gives aregularization eﬀect to the network, thus preventing overﬁtting.Bayesian posterior inference over the neural network parameters is a theoretically attrac-tive method for controlling overﬁtting; however, modelling a distribution over the kernels(also known as ﬁlters) of a

Cnn s has never been attempted successfully before, perhapsbecause of the vast number of parameters and extremely large models commonly used inpractical applications.Even with a small number of parameters, inferring model posterior in a Bayesian NN is adiﬃcult task. Approximations to the model posterior are often used instead, with the varia-tional inference being a popular approach. In this approach one would model the posteriorusing a simple variational distribution such as a Gaussian, and try to ﬁt the distribution’sparameters to be as close as possible to the true posterior. This is done by minimising the

Kullback-Leibler divergence from the true posterior. Many have followed this approach inthe past for standard NN models [26], [3], [19], [4]. But the variational approach used toapproximate the posterior in Bayesian NNs can be fairly computationally expensive – the useof Gaussian approximating distributions increases the number of model parameters consider-ably, without increasing model capacity by much. [4] for example used Gaussian distributionsfor Bayesian NN posterior approximation and have doubled the number of model parameters,yet report the same predictive performance as traditional approaches using dropout. Thismakes the approach unsuitable in practice to use with

Cnn s as the increase in the numberof parameters is too costly.

We build our Bayesian

Cnn upon

Bayes by Backprop [19], [4]. The exact Bayesian inferenceon the weights of a neural network is intractable as the number of parameters is very largeand the functional form of a neural network does not lend itself to exact integration. So, weapproximate the intractable true posterior probability distributions p ( w |D ) with variationalprobability distributions q θ ( w |D ), which comprise the properties of Gaussian distributions µ ∈ R d and σ ∈ R d , denoted by N ( θ | µ, σ ), where d is the total number of parameters deﬁn-ing a probability distribution. The shape of these Gaussian variational posterior probabilitydistributions, determined by their variance σ , expresses an uncertainty estimation of everymodel parameter.Figure 1: Top: Each ﬁlter weight has a ﬁxed value, as in the case of frequentist Convolu-tional Networks. Bottom: Each ﬁlter weight has a distribution, as in the case of BayesianConvolutional Networks. [16] The main contributions of our work are as follows:1. We present how

Bayes by Backprop can be eﬃciently applied to

Cnn s. We, therefore,introduce the idea of applying two convolutional operations, one for the mean and onefor the variance.2. We show how the model learns richer representations and predictions from cheap modelaveraging.3. We empirically show that our proposed generic and reliable variational inference methodfor Bayesian

Cnn s can be applied to various

Cnn architectures without any limitationson their performances.4. We examine how to estimate the aleatoric and epistemic uncertainties and empiricallyshow how the uncertainty can decrease, allowing the decisions made by the network tobecome more deterministic as the training accuracy increases.5. We also empirically show how our method typically only doubles the number of pa-rameters yet trains an inﬁnite ensemble using unbiased Monte Carlo estimates of thegradients.6. We also apply L1 norm to the trained model parameters and prune the number of nonzero values and further, ﬁne-tune the model to reduce the number of model parameterswithout a reduction in the model prediction accuracy.7. Finally, we will apply the concept of Bayesian CNN to tasks like Image Super-Resolutionand Generative Adversarial Networks and we will compare the results with other promi-nent architectures in the respective domain.This work builds on the foundations laid out by Blundell et al. [4], who introduced

Bayes byBackprop for feedforward neural networks. Together with the extension to recurrent neuralnetworks, introduced by Fortunato et al. [11],

Bayes by Backprop is now applicable onthe three most frequently used types of neural networks, i.e., feedforward, recurrent, andconvolutional neural networks.

A perceptron is conceived as a mathematical model of how the neurons function in our brainby a famous psychologist Rosenblatt. According to Rosenblatt, a neuron takes a set of binaryinputs (nearby neurons), multiplies each input by a continuous-valued weight (the synapsestrength to each nearby neuron), and thresholds the sum of these weighted inputs to outputa 1 if the sum is big enough and otherwise a 0 (the same way neurons either ﬁre or do notﬁre). Figure 2: Biologically inspired Neural Network [29]

Inspired by the biological nervous system, the structure of an Artiﬁcial Neural Network(ANN) was developed to process information similar to how brain process information. Alarge number of highly interconnected processing elements (neurons) working together makesa Neural Network solve complex problems. Just like humans learn by example, so does a Neu-ral Network. Learning in biological systems involves adjustments to the synaptic connectionswhich is similar to weight updates in a Neural Network.A Neural Network consists of three layers: input layer to feed the data to the model tolearn representation, hidden layer that learns the representation and the output layer thatoutputs the results or predictions. Neural Networks can be thought of an end to end systemthat ﬁnds patterns in data which are too complex to be recognized by a human to teach toa machine.

Input layer Output layerHidden layer 2Hidden layer 1

Figure 3: Neural Network with two hidden layers

Hubel and Wiesel in their hierarchy model mentioned a neural network to have a hierarchystructure in the visual cortex. LGB (lateral geniculate body) forms the simple cells thatform the complex cells which form the lower order hypercomplex cells that ﬁnally form thehigher order hypercomplex cells. Also, the network between the lower order hypercomplexcells and the higher order hypercomplex cells are structurally similar to the network betweensimple cells and the complex cells. In this hierarchy, a cell in a higher stage generally has atendency to respond selectively to a more complicated feature of the stimulus pattern, andthe cell at the lower stage responds to simpler features. Also, higher stage cells possess alarger receptive ﬁeld and are more insensitive to the shift in the position of the stimuluspattern.Similar to a hierarchy model, a neural network starting layers learns simpler featureslike edges and corners and subsequent layers learn complex features like colours, texturesand so on. Also, higher neural units possess a larger receptive ﬁeld which builds over theinitial layers. However, unlike in multilayer perceptron where all neurons from one layer areconnected with all the neurons in the next layer, weight sharing is the main idea behind aconvolutional neural network. Example: instead of each neuron having a diﬀerent weightfor each pixel of the input image (28*28 weights), the neurons only have a small set ofweights (5*5) that is applied to a whole bunch of small subsets of the image of the same size.Layers past the ﬁrst layer work in a similar way by taking in the ‘local’ features found in thepreviously hidden layer rather than pixel images, and successively see larger portions of theimage since they are combining information about increasingly larger subsets of the image.Finally, the ﬁnal layer makes the correct prediction for the output class.The reason for why this is helpful is intuitive if not mathematically clear: without suchconstraints, the network would have to learn the same simple things (such as detectingedges, corners, etc) a whole bunch of times for each portion of the image. But with theconstraint there, only one neuron would need to learn each simple feature - and with farfewer weights overall, it could do so much faster! Moreover, since the pixel-exact locations ofsuch features do not matter the neuron could basically skip neighbouring subsets of the image- subsampling, now known as a type of pooling - when applying the weights, further reducingthe training time. The addition of these two types of layers - convolutional and pooling layers- are the primary distinctions of Convolutional Neural Nets (CNNs/ConvNets) from plainold neural nets.

We deﬁne a function y = f ( x ) that estimates the given inputs { x , . . . , x N } and their corre-sponding outputs { y , . . . , y N } and produces an predictive output. Using Bayesian inference,a prior distribution is used over the space of functions p ( f ). This distribution represents ourprior belief as to which functions are likely to have generated our data.A likelihood is deﬁned as p ( Y | f, X ) to capture the process in which given a functionobservation is generated. We use the Bayes rule to ﬁnd the posterior distribution given ourdataset: p ( f | X, Y ).The new output can be predicted for a new input point x ∗ by integrating over all possiblefunctions f , p ( y ∗ | x ∗ , X, Y ) = (cid:90) p ( y ∗ | f ∗ ) p ( f ∗ | x ∗ , X, Y ) df ∗ (1)The equation (1) is intractable due to the integration sign. We can approximate it bytaking a ﬁnite set of random variables w and conditioning the model on it. However, itis based on a modelling assumption that the model depends on these variables alone, andmaking them into suﬃcient statistics in our approximate model.The predictive distribution for a new input point x ∗ is then given by p ( y ∗ | x ∗ , X, Y ) = (cid:90) p ( y ∗ | f ∗ ) p ( f ∗ | x ∗ , w ) p ( w | X, Y ) df ∗ dw. However, the distribution p ( w | X, Y ) still remains intractable and we need to approximateit with a variational distribution q ( w ), that can be computed. The approximate distributionneeds to be as close as possible to the posterior distribution obtained from the original model.We thus minimise the Kullback–Leibler (KL) divergence, intuitively a measure of similaritybetween two distributions: KL ( q ( w ) || p ( w | X, Y )), resulting in the approximate predictivedistribution q ( y ∗ | x ∗ ) = (cid:90) p ( y ∗ | f ∗ ) p ( f ∗ | x ∗ , w ) q ( w ) df ∗ dw. (2)Minimising the Kullback–Leibler divergence is equivalent to maximising the log evidencelower bound , KL VI := (cid:90) q ( w ) p ( F | X, w ) log p ( Y | F ) dF dw − KL ( q ( w ) || p ( w )) (3)with respect to the variational parameters deﬁning q ( w ). This is known as variationalinference , a standard technique in Bayesian modelling.Maximizing the KL divergence between the posterior and the prior over w will result ina variational distribution that learns a good representation from the data (obtained from loglikelihood) and is closer to the prior distribution. In other words, it can prevent overﬁtting. The ability to rewrite statistical problems in an equivalent but diﬀerent form, to reparam-eterise them, is one of the most general-purpose tools used in mathematical statistics. Thetype of reparameterization when the global uncertainty in the weights is translated into aform of local uncertainty which is independent across examples is known as the local repa-rameterization trick . An alternative estimator is deployed for which Cov [ L i , L j ] = 0, so thatthe variance of the stochastic gradients scales as 1 /M . The new estimator is made compu-tationally eﬃcient by sampling the intermediate variables and not sampling (cid:15) directly, butonly f ( (cid:15) ) through which (cid:15) inﬂuences L SGVB D ( φ ). Hence, the source of global noise can betranslated to local noise ( (cid:15) → f ( (cid:15) )), a local reparameterization can be applied so as to obtaina statistically eﬃcient gradient estimator.The technique can be explained through a simple example: We consider an input( X )of random uniform function ranging from -1 to +1 and an output( Y ) as a random normaldistribution around mean X and standard deviation δ . The Mean Squared Loss would bedeﬁned as ( Y − X ) . The problem here is during the backpropagation from the randomnormal distribution function. As we are trying to propagate through a stochastic node wereparameterize by adding X to the random normal function output and multiplying by δ .The movement of parameters out of the normal distribution does not change the behaviourof the model. Uncertainties in a network is a measure of how certain the model is with its prediction. InBayesian modelling, there are two main types of uncertainty one can model [9]:

Aleatoric uncertainty and

Epistemic uncertainty.

Aleatoric uncertainty measures the noise inherent in the observations. This type of un-certainty is present in the data collection method like the sensor noise or motion noise whichis uniform along the dataset. This cannot be reduced if more data is collected.

Epistemic uncertainty, on the other hand, represents the uncertainty caused by the model. This un-certainty can be reduced given more data and is often referred to as model uncertainty .Aleatoric uncertainty can further be categorized into homoscedastic uncertainty, uncertaintywhich stays constant for diﬀerent inputs, and heteroscedastic uncertainty. Heteroscedasticuncertainty depends on the inputs to the model, with some inputs potentially having morenoisy outputs than others. Heteroscedastic uncertainty is in particular important so thatmodel prevents from outputting very conﬁdent decisions.Current work measures uncertainties by placing a probability distributions over eitherthe model parameters or model outputs. Epistemic uncertainty is modelled by placing aprior distribution over a model’s weights and then trying to capture how much these weightsvary given some data. Aleatoric uncertainty, on the other hand, is modelled by placing adistribution over the output of the model.

The following can be the source of uncertainty as mentioned by Kiureghian [9]:1. Uncertainty inherent in the basic random variables X , such as the uncertainty inherentin material property constants and load values, which can be directly measured.2. Uncertain model error resulting from the selection of the form of the probabilistic sub-model f X ( x, H f ) used to describe the distribution of basic variables.3. Uncertain modeling errors resulting from selection of the physical sub-models g i ( x, H g ) , i =1 , , ..., m, used to describe the derived variables.4. Statistical uncertainty in the estimation of the parameters H f of the probabilistic sub-model.5. Statistical uncertainty in the estimation of the parameters H g of the physical sub-models.6. Uncertain errors involved in measuring of observations, based on which the parameters H f and H g are estimated. These include errors involved in indirect measurement, e.g.,the measurement of a quantity through a proxy, as in non-destructive testing of materialstrength.7. Uncertainty modelled by the random variables Y corresponding to the derived vari-ables y , which may include, in addition to all the above uncertainties, uncertain errorsresulting from computational errors, numerical approximations or truncations. Forexample, the computation of load eﬀects in a nonlinear structure by a ﬁnite elementprocedure employs iterative calculations, which invariably involve convergence toler-ances and truncation errors. Backpropagation in a Neural Networks was proposed by Rumelhart [10] in 1986 and it is themost commonly used method for training neural networks. Backpropagation is a techniqueto compute the gradient of the loss in terms of the network weights. It operates in twophases: ﬁrstly, the input features through the network propagates in the forward direction tocompute the function output and thereby the loss associated with the parameters. Secondly,the derivatives of the training loss with respect to the weights are propagated back fromthe output layer towards the input layers. These computed derivatives are further used toupdate the weights of the network. This is a continuous process and updating of the weightoccurs continuously over every iteration.Despite the popularity of backpropagation, there are many hyperparameters in back-propagation based stochastic optimization that requires speciﬁc tuning, e.g., learning rate,momentum, weight decay, etc. The time required for ﬁnding the optimal values is propor-tional to the data size. For a network trained with backpropagation, only point estimates ofthe weights are achieved in the network. As a result, these networks make overconﬁdent pre-dictions and do not account for uncertainty in the parameters. Lack of uncertainty measuremakes the network prone to overﬁtting and a need for regularization.A Bayesian approach to Neural Networks provides the shortcomings with the backprop-agation approach [45] as Bayesian methods naturally account for uncertainty in parameterestimates and can propagate this uncertainty into predictions. Also, averaging over parametervalues instead of just choosing single point estimates makes the model robust to overﬁtting.Sevreal approaches has been proposed in the past for learning in Bayesian Networks:Laplace approximation [43], MC Dropout [13],and Variational Inference [26] [19] [4]. Weused Bayes by Backprop [4] for our work and is explained next.

Bayes by Backprop [19, 4] is a variational inference method to learn the posterior distributionon the weights w ∼ q θ ( w |D ) of a neural network from which weights w can be sampled inbackpropagation. It regularises the weights by minimising a compression cost, known as thevariational free energy or the expected lower bound on the marginal likelihood.Since the true posterior is typically intractable, an approximate distribution q θ ( w |D ) isdeﬁned that is aimed to be as similar as possible to the true posterior p ( w |D ), measured bythe Kullback-Leibler (KL) divergence [35]. Hence, we deﬁne the optimal parameters θ opt as θ opt = arg min θ KL [ q θ ( w |D ) (cid:107) p ( w |D )]= arg min θ KL [ q θ ( w |D ) (cid:107) p ( w )] − E q ( w | θ ) [log p ( D| w )] + log p ( D ) (4)where KL [ q θ ( w |D ) (cid:107) p ( w )] = (cid:90) q θ ( w |D ) log q θ ( w |D ) p ( w ) dw. (5)This derivation forms an optimisation problem with a resulting cost function widely known as variational free energy [50, 63, 12] which is built upon two terms: the former, KL [ q θ ( w |D ) (cid:107) p ( w )],is dependent on the deﬁnition of the prior p ( w ), thus called complexity cost, whereas the lat-ter, E q ( w | θ ) [log p ( D| w )], is dependent on the data p ( D| w ), thus called likelihood cost. Theterm log p ( D ) can be omitted in the optimisation because it is constant.Since the KL-divergence is also intractable to compute exactly, we follow a stochastic vari-ational method [19, 4]. We sample the weights w from the variational distribution q θ ( w |D )since it is much more probable to draw samples which are appropriate for numerical methodsfrom the variational posterior q θ ( w |D ) than from the true posterior p ( w |D ). Consequently,we arrive at the tractable cost function (16) which is aimed to be optimized, i.e. minimisedw.r.t. θ , during training: F ( D , θ ) ≈ n (cid:88) i =1 log q θ ( w ( i ) |D ) − log p ( w ( i ) ) − log p ( D| w ( i ) ) (6)where n is the number of draws.We sample w ( i ) from q θ ( w |D ). The uncertainty aﬀorded by Bayes by Backprop trainedneural networks has been used successfully for training feedforward neural networks in bothsupervised and reinforcement learning environments [4, 42, 28], for training recurrent neuralnetworks [11], but has not been applied to convolutional neural networks to-date.

Model pruning reduces the sparsity in a deep neural network’s various connection matrices,thereby reducing the number of valued parameters in the model. The whole idea of modelpruning is to reduce the number of parameters without much loss in the accuracy of the model.This reduces the use of a large parameterized model with regularization and promotes the useof dense connected smaller models. Some recent work suggests [23] [48] that the network canachieve a sizable reduction in model size, yet achieving comparable accuracy. Model pruningpossesses several advantages in terms of reduction in computational cost, inference time andin energy-eﬃciency. The resulting pruned model typically has sparse connection matrices,so eﬃcient inference using these sparse models requires purpose-built hardware capable ofloading sparse matrices and/or performing sparse matrix-vector operations. Thus the overallmemory usage is reduced with the new pruned model.There are several ways of achieving the pruned model, the most popular one is to map thelow contributing weights to zero and reducing the number of overall non-zero valued weights.This can be achieved by training a large sparse model and pruning it further which makes itcomparable to training a small dense model.Assigning weights zero to most features and non-zero weights to only important featurescan be formalized by applying the L norm, where L = || θ || = (cid:80) j δ ( θ j (cid:54) = 0), and it appliesa constant penalty to all non-zero weights. L norm can be thought of a feature selectornorm that only assigns non-zero values to feature that are important. However, the L normis non-convex and hence, non-diﬀerentiable that makes it a NP-hard problem and can beonly eﬃciently solved when P = N P . The alternative that we use in our work is the L norm, which is equal to the sum of the absolute weight values, || θ || = (cid:80) j | θ j | . L norm isconvex and hence diﬀerentiable and can be used as an approximation to L norm [59]. L norm works as a sparsity inducing regularizer by making large number of coeﬃcients equalto zero, working as a great feature selector in our case. Only thing to keep in mind is thatthe L norm do not have a gradient at θ j = 0 and we need to keep that in mind.Pruning away the less salient features to zero has been used in this work and is explainedin details in Our Contribution section. Applying Bayesian methods to neural networks has been studied in the past with variousapproximation methods for the intractable true posterior probability distribution p ( w |D ).Buntine and Weigend [5] started to propose various maximum-a-posteriori (MAP) schemesfor neural networks. They were also the ﬁrst who suggested second order derivatives in theprior probability distribution p ( w ) to encourage smoothness of the resulting approximateposterior probability distribution. In subsequent work by Hinton and Van Camp [26], theﬁrst variational methods were proposed which naturally served as a regularizer in neuralnetworks. He also mentioned that the amount of information in weight can be controlled byadding Gaussian noise. When optimizing the trade-oﬀ between the expected error and theinformation in the weights, the noise level can be adapted during learning.Hochreiter and Schmidhuber [27] suggested taking an information theory perspective intoaccount and utilising a minimum description length (MDL) loss. This penalises non-robustweights by means of an approximate penalty based upon perturbations of the weights onthe outputs. Denker and LeCun [7], and MacKay [44] investigated the posterior probabilitydistributions of neural networks and treated the search in the model space (the space ofarchitectures, weight decay, regularizers, etc..) as an inference problem and tried to solve itusing Laplace approximations. As a response to the limitations of Laplace approximations,0Neal [49] investigated the use of hybrid Monte Carlo for training neural networks, althoughit has so far been diﬃcult to apply these to the large sizes of neural networks built inmodern applications. Also, these approaches lacked scalability in terms of both the networkarchitecture and the data size.More recently, Graves [19] derived a variational inference scheme for neural networksand Blundell et al. [4] extended this with an update for the variance that is unbiased andsimpler to compute. Graves [20] derives a similar algorithm in the case of a mixture posteriorprobability distribution. A more scalable solution based on expectation propagation wasproposed by Soudry [57] in 2014. While this method works for networks with binary weights,its extension to continuous weights is unsatisfying as it does not produce estimates of posteriorvariance.Several authors have claimed that Dropout [58] and Gaussian Dropout [61] can be viewedas approximate variational inference schemes [13, 32]. We compare our results to Gal’s &Ghahramani’s [13] and discuss the methodological diﬀerences in detail. Neural Networks can predict uncertainty when Bayesian methods are introduced in it. Anattempt to model uncertainty has been studied from 1990s [49] but has not been appliedsuccessfully until 2015. Gal and Ghahramani [14] in 2015 provided a theoretical framework formodelling Bayesian uncertainty. Gal and Ghahramani [13] obtained the uncertainty estimatesby casting dropout training in conventional deep networks as a Bayesian approximation of aGaussian Process. They showed that any network trained with dropout is an approximateBayesian model, and uncertainty estimates can be obtained by computing the variance onmultiple predictions with diﬀerent dropout masks.

Some early work in the model pruning domain used a second-order Taylor approximation ofthe increase in the loss function of the network when weight is set to zero [41]. A diagonalHessian approximation was used to calculate the saliency for each parameter [41] and thelow-saliency parameters were pruned from the network and the network was retrained.Narang [48] showed that a pruned RNN and GRU model performed better for the taskof speech recognition compared to a dense network of the original size. This result is verysimilar to the results obtained in our case where a pruned model achieved better resultsthan a normal network. However, no comparisons can be drawn as the model architecture(CNN vs RNN) used and the tasks (Computer Vision vs Speech Recognition) are completelydiﬀerent. Narang [48] in his work introduced a gradual pruning scheme based on pruning allthe weights in a layer less than some threshold (manually chosen) which is linear with someslope in phase 1 and linear with some slope in phase 2 followed by normal training. However,we reduced the number of ﬁlters to half for one case and in the other case, we induced asparsity-based on L1 regularization to remove the less contributing weights and reduced theparameters.Other similar work [2, 38, 6] to our work that reduces or removed the redundant con-nections or induces sparsity are motivated by the desire to speed up computation. Thetechniques used are highly convolutional layer dependent and is not applicable to other ar-chitectures like Recurrent Neural Networks. One another interesting method of pruning isto represent each parameter with a smaller ﬂoating point number like 16-bits instead of 64bits. This way there is a speed up in the training and inference time and the model is lesscomputationally expensive.Another viewpoint for model compression was presented by Gong [17]. He proposed vectorquantization to achieve diﬀerent compression ratios and diﬀerent accuracies and dependingon the use case, the compression and accuracies can be chosen. However, it requires adiﬀerent hardware architecture to support the inference at runtime. Besides quantization,other potentially complementary approaches to reducing model size include low-rank matrixfactorization [8, 38] and group sparsity regularization to arrive at an optimal layer size [1].1

In this section, we explain our algorithm of building a

Cnn with probability distributionsover its weights in each ﬁlter, as seen in Figure 6, and apply variational inference, i.e.

Bayesby Backprop , to compute the intractable true posterior probability distribution, as describedin the last Chapter. Notably, a fully Bayesian perspective on a

Cnn is for most

Cnn archi-tectures not accomplished by merely placing probability distributions over weights in convo-lutional layers; it also requires probability distributions over weights in fully-connected layers(see Figure 5).Figure 4: Input image with exemplary pixel values, ﬁlters, and corresponding output withpoint-estimates (top) and probability distributions (bottom) over weights.[55]

We utilise the local reparameterization trick [32] and apply it to

Cnn s. Following [32, 51], wedo not sample the weights w , but we sample instead layer activations b due to its consequentcomputational acceleration. The variational posterior probability distribution q θ ( w ijhw |D ) =Figure 5: Fully Bayesian perspective of an exemplary CNN. Weights in ﬁlters of convolutionallayers, and weights in fully-connected layers have the form of a probability distribution. [55]2 N ( µ ijhw , α ijhw µ ijhw ) (where i and j are the input, respectively output layers, h and w theheight, respectively width of any given ﬁlter) allows to implement the local reparamerizationtrick in convolutional layers. This results in the subsequent equation for convolutional layeractivations b : b j = A i ∗ µ i + (cid:15) j (cid:12) (cid:113) A i ∗ ( α i (cid:12) µ i ) (7)where (cid:15) j ∼ N (0 , A i is the receptive ﬁeld, ∗ signalises the convolutional operation, and (cid:12) the component-wise multiplication. The crux of equipping a

Cnn with probability distributions over weights instead of singlepoint-estimates and being able to update the variational posterior probability distribution q θ ( w |D ) by backpropagation lies in applying two convolutional operations whereas ﬁlterswith single point-estimates apply one . As explained in the last chapter, we deploy the localreparametrization trick and sample from the output b . Since the output b is a function ofmean µ ijwh and variance α ijhw µ ijhw among others, we are then able to compute the twovariables determining a Gaussian probability distribution, namely mean µ ijhw and variance α ijhw µ ijhw , separately.We do this in two convolutional operations: in the ﬁrst, we treat the output b as an output ofa Cnn updated by frequentist inference. We optimize with Adam [31] towards a single point-estimate which makes the validation accuracy of classiﬁcations increasing. We interpret thissingle point-estimate as the mean µ ijwh of the variational posterior probability distributions q θ ( w |D ). In the second convolutional operation, we learn the variance α ijhw µ ijhw . As thisformulation of the variance includes the mean µ ijwh , only α ijhw needs to be learned inthe second convolutional operation [47]. In this way, we ensure that only one parameter isupdated per convolutional operation, exactly how it would have been with a Cnn updatedby frequentist inference.In other words, while we learn in the ﬁrst convolutional operation the MAP of the variationalposterior probability distribution q θ ( w |D ), we observe in the second convolutional operationhow much values for weights w deviate from this MAP. This procedure is repeated in thefully-connected layers. In addition, to accelerate computation, to ensure a positive non-zero variance α ijhw µ ijhw , and to enhance accuracy, we learn log α ijhw and use the Softplus activation function as further described in the Emperical Analysis section.

In classiﬁcation tasks, we are interested in the predictive distribution p D ( y ∗ | x ∗ ), where x ∗ is an unseen data example and y ∗ its predicted class. For a Bayesian neural network, thisquantity is given by: p D ( y ∗ | x ∗ ) = (cid:90) p w ( y ∗ | x ∗ ) p D ( w ) dw (8)In Bayes by Backprop , Gaussian distributions q θ ( w |D ) ∼ N ( w | µ, σ ), where θ = { µ, σ } arelearned with some dataset D = { x i , y i } ni =1 as we explained previously. Due to the discreteand ﬁnite nature of most classiﬁcation tasks, the predictive distribution is commonly assumedto be a categorical. Incorporating this aspect into the predictive distribution gives us p D ( y ∗ | x ∗ ) = (cid:90) Cat( y ∗ | f w ( x ∗ )) N ( w | µ, σ ) dw (9)= (cid:90) C (cid:89) c =1 f ( x ∗ c | w ) y ∗ c √ πσ e − ( w − µ )22 σ dw (10)where C is the total number of classes and (cid:80) c f ( x ∗ c | w ) = 1.As there is no closed-form solution due to the lack of conjugacy between categorical andGaussian distributions, we cannot recover this distribution. However, we can construct an3unbiased estimator of the expectation by sampling from q θ ( w |D ): E q [ p D ( y ∗ | x ∗ )] = (cid:90) q θ ( w |D ) p w ( y | x ) dw (11) ≈ T T (cid:88) t =1 p w t ( y ∗ | x ∗ ) (12)where T is the pre-deﬁned number of samples. This estimator allows us to evaluate theuncertainty of our predictions by the deﬁnition of variance, hence called predictive variance and denoted as Var q : Var q (cid:0) p ( y ∗ | x ∗ ) (cid:1) = E q [ yy T ] − E q [ y ] E q [ y ] T (13)This quantity can be decomposed into the aleatoric and epistemic uncertainty [30, 36].Var q (cid:0) p ( y ∗ | x ∗ ) (cid:1) = 1 T T (cid:88) t =1 diag(ˆ p t ) − ˆ p t ˆ p Tt (cid:124) (cid:123)(cid:122) (cid:125) aleatoric + 1 T T (cid:88) t =1 (ˆ p t − ¯ p )(ˆ p t − ¯ p ) T (cid:124) (cid:123)(cid:122) (cid:125) epistemic (14)where ¯ p = T (cid:80) Tt =1 ˆ p t and ˆ p t = Softmax (cid:0) f w t ( x ∗ ) (cid:1) . d i s t n o i s e d i s t un c e r t n o i s e i n f o g a i n Figure 6: Predictive distributions is estimated for a low-dimensional active learning task.The predictive distributions are visualized as mean and two standard deviations shaded. (cid:4) shows the epistemic uncertainty and (cid:4) shows the aleatoric noise. Data points are shown in (cid:4) . (Left) A deterministic network conﬂates uncertainty as part of the noise and is over-conﬁdent outside of the data distribution. (Right)

A variational Bayesian neural networkwith standard normal prior represents uncertainty and noise separately but is overconﬁdentoutside of the training distribution as deﬁned by [22]It is of paramount importance that uncertainty is split into aleatoric and epistemic quan-tities since it allows the modeller to evaluate the room for improvements: while aleatoricuncertainty (also known as statistical uncertainty) is merely a measure for the variation of(”noisy”) data, epistemic uncertainty is caused by the model. Hence, a modeller can seewhether the quality of the data is low (i.e. high aleatoric uncertainty), or the model itselfis the cause for poor performances (i.e. high epistemic uncertainty). The former cannot beimproved by gathering more data, whereas the latter can be done so. [9] [30].4

Model pruning means the reduction in the model weights parameters to reduce the modeloverall non-zero weights, inference time and computation cost. In our work, a BayesianConvolutional Network learns two weights, i.e: the mean and the variance compared topoint estimate learning one single weight. This makes the overall number of parametersof a Bayesian Network twice as compared to the parameters of a point estimate similararchitecture.To make the Bayesian

Cnn s parameters equivalent to point estimate architecture, thenumber of ﬁlters in the Bayesian architectures is reduced to half. This makes up for thedouble learned parameters (mean and variance) against one in point estimates and makesthe overall parameters equal for both networks.Another technique used is the usage of the

L1 normalization on the learned weights ofevery layer. By L1 norm, we make the vector of learned weight in various model layers verysparse, as most of its components become close to zero, and at the same time, the remainingnon-zero components capture the most important features of the data. We put a thresholdand make the weights to be zero if the value falls below the threshold. We only keep the nonzero weights and this way the model number of parameters is reduced without aﬀecting theoverall performance of the model.

The originally chosen activation functions in all architectures are

ReLU , but we must intro-duce another, called

Softplus , see (15), because of our method to apply two convolutional orfully-connected operations. As aforementioned, one of these is determining the mean µ , andthe other the variance αµ . Speciﬁcally, we apply the Softplus function because we want toensure that the variance αµ never becomes zero. This would be equivalent to merely calcu-lating the MAP, which can be interpreted as equivalent to a maximum likelihood estimation(MLE), which is further equivalent to utilising single point-estimates, hence frequentist in-ference. The Softplus activation function is a smooth approximation of

ReLU . Although itis practically not inﬂuential, it has the subtle and analytically important advantage that itnever becomes zero for x → −∞ , whereas ReLU becomes zero for x → −∞ .Softplus( x ) = 1 β · log (cid:0) β · x ) (cid:1) (15)where β is by default set to 1.All experiments are performed with the same hyper-parameters settings as stated in theAppendix. For all conducted experiments, we implement the foregoing description of Bayesian

Cnn s withvariational inference in LeNet-5 [39] and AlexNet [34]. The exact architecture speciﬁcationscan be found in the Appendix and in our GitHub repository . To learn the objective function, we use

Bayes by Backprop [19, 4], which is a variationalinference method to learn the posterior distribution on the weights w ∼ q θ ( w |D ) of a neuralnetwork from which weights w can be sampled in backpropagation. It regularises the weightsby minimising a compression cost, known as the variational free energy or the expected lowerbound on the marginal likelihood. https://github.com/kumar-shridhar/PyTorch-BayesianCNN θ , duringtraining: F ( D , θ ) ≈ n (cid:88) i =1 log q θ ( w ( i ) |D ) − log p ( w ( i ) ) − log p ( D| w ( i ) ) (16)where n is the number of draws.Let’s break the Objective Function (16) and discuss in more details. The ﬁrst term in the equation (16) is the variational posterior. The variational posterior istaken as Gaussian distribution centred around mean µ and variance as σ . q θ ( w ( i ) |D ) = (cid:89) i N ( w i | µ, σ ) (17)We will take the log and the log posterior is deﬁned as : log ( q θ ( w ( i ) |D )) = (cid:88) i log N ( w i | µ, σ ) (18) The second term in the equation (16) is the prior over the weights and we deﬁne the priorover the weights as a product of individual Gaussians : p ( w ( i ) ) = (cid:89) i N ( w i | , σ p ) (19)We will take the log and the log prior is deﬁned as: log ( p ( w ( i ) )) = (cid:88) i log N ( w i | , σ p ) (20) The ﬁnal term of the equation (16) log p ( D| w ( i ) ) is the likelihood term and is computed usingthe softmax function. We use a Gaussian distribution and we store mean and variance values instead of just oneweight. The way mean µ and variance σ is computed is deﬁned in the previous chapter.Variance cannot be negative and it is ensured by using softplus as the activation function.We express variance σ as σ i = sof tplus ( ρ i ) where ρ is an unconstrained parameter.We take the Gaussian distribution and initialize mean µ as 0 and variance σ (and hence ρ ) randomly. We observed mean centred around 0 and a variance starting with a big numberand gradually decreasing over time. A good initialization can also be to put a restriction onvariance and initialize it small. However, it might be data dependent and a good method forvariance initialization is still to be discovered. We perform gradient descent over θ = ( µ , ρ ),and individual weight w i ∼ N ( w i | µ i , σ i ). For all our tasks, we take Adam optimizer [31] to optimize the parameters. We also performthe local reparameterization trick as mentioned in the previous section and take the gradientof the combined loss function with respect to the variational parameters ( µ , ρ ).6 We take the weights of all the layers of the network, apply an L1 norm over it and for all theweights value as zero or below a deﬁned threshold are removed and the model is pruned.Also, since the Bayesian

Cnn s has twice the number of parameters ( µ , σ ) compared to afrequentist network (only 1 weight), we reduce the size of our network to half (AlexNet andLeNet- 5) by reducing the number of ﬁlters to half. The architecture used is mentioned inthe Appendix. Please note that it can be argued that reducing the number of ﬁlters to be half is a methodfor pruning or not. It can be seen as a method that reduces the number of overall parametersand hence can be thought of a pruning method in some sense. However, it is a subject toargument.

We train the networks with the MNIST dataset of handwritten digits [39], and CIFAR-10dataset [33] since these datasets serve widely as benchmarks for

Cnn s’ performances.

The MNIST database [40] of handwritten digits have a training set of 60,000 examples anda test set of 10,000 examples. It is a subset of a larger set available from NIST. The digitshave been size-normalized and centred in a ﬁxed-size image of 28 by 28 pixels. Each imageis grayscaled and is labelled with its corresponding class that ranges from zero to nine.

The CIFAR-10 are labelled subsets of the 80 million tiny images dataset [60]. The CIFAR-10 dataset has a training dataset of 50,000 colour images in 10 classes, with 5,000 trainingimages per class, each image 32 by 32 pixels large. There are 10000 images for testing.

First, we evaluate the performance of our proposed method, Bayesian

Cnn s with variationalinference. Table 1 shows a comparison of validation accuracies (in percentage) for architec-tures trained by two disparate Bayesian approaches, namely variational inference, i.e.

Bayesby Backprop and Dropout as proposed by Gal and Ghahramani [13].We compare the results of these two approaches to frequentist inference approach for boththe datasets. Bayesian

Cnn s trained by variational inference achieve validation accuraciescomparable to their counter-architectures trained by frequentist inference. On MNIST, vali-dation accuracies of the two disparate Bayesian approaches are comparable, but a BayesianLeNet-5 with Dropout achieves a considerable higher validation accuracy on CIFAR-10, al-though we were not able to reproduce these reported results.7Figure 7: Comparison of Validation Accuracies of Bayesian AlexNet and LeNet-5 with fre-quentist approach on MNIST and CIFAR-10 datasets

MNIST CIFAR-10Bayesian AlexNet (with VI) 99 73Frequentist AlexNet 99 73Bayesian LeNet-5 (with VI) 98 69Frequentist LeNet-5 98 68Bayesian LeNet-5 (with Dropout) 99.5 83

Table 1: Comparison of validation accuracies (in percentage) for diﬀerent architectures withvariational inference (VI), frequentist inference and Dropout as a Bayesian approximation asproposed by Gal and Ghahramani [13] for MNIST, and CIFAR-10.Figure 7 shows the validation accuracies of Bayesian vs Non-Bayesian

Cnn s. One thing toobserve is that in initial epochs, Bayesian

Cnn s trained by variational inference start with alow validation accuracy compared to architectures trained by frequentist inference. This mustdeduce from the initialization of the variational posterior probability distributions q θ ( w |D )as uniform distributions, while initial point-estimates in architectures trained by frequentistinference are randomly drawn from a standard Gaussian distribution. (For uniformity, wechanged the initialization of frequentist architectures from Xavier initialization to standardGaussian). The latter initialization method ensures the initialized weights are neither toosmall nor too large. In other words, the motivation of the latter initialization is to startwith weights such that the activation functions do not let them begin in saturated or dead8regions. This is not true in case of uniform distributions and hence, Bayesian Cnn s’ startingvalidation accuracies can be comparably low.Figure 8 displays the convergence of the standard deviation σ of the variational posteriorprobability distribution q θ ( w |D ) of a random model parameter over epochs. As aforemen-tioned, all prior probability distributions p ( w ) are initialized as uniform distributions. Thevariational posterior probability distributions q θ ( w |D ) are approximated as Gaussian distri-butions which become more conﬁdent as more data is processed - observable by the decreasingstandard deviation over epochs in Figure 8. Although the validation accuracy for MNIST onBayesian LeNet-5 has already reached 99%, we can still see a fairly steep decrease in the pa-rameter’s standard deviation. In Figure 9, we plot the actual Gaussian variational posteriorprobability distributions q θ ( w |D ) of a random parameter of LeNet-5 trained on CIFAR-10 atsome epochs.Figure 8: Convergence of the standard deviation of the Gaussian variational posterior prob-ability distribution q θ ( w |D ) of a random model parameter at epochs 1, 5, 20, 50, and 100.MNIST is trained on Bayesian LeNet-5.9Figure 9: Convergence of the Gaussian variational posterior probability distribution q θ ( w |D )of a random model parameter at epochs 1, 5, 20, 50, and 100. CIFAR-10 is trained onBayesian LeNet-5.Figure 9 displays the convergence of the Gaussian variational probability distribution ofa weight taken randomly from the ﬁrst layer of LeNet-5 architecture. The architecture istrained on CIFAR-10 dataset with uniform initialization. This dataset is similar to the CIFAR-10 and is a labelled subset of the 80 million tiny imagesdataset [60]. The dataset has 100 classes containing 600 images each. There are 500 trainingimages and 100 validation images per class. The images are coloured with a resolution of 32by 32 pixels.

In Figure 10, we show how Bayesian networks incorporate naturally eﬀects of regularization,exempliﬁed on AlexNet. While an AlexNet trained by frequentist inference without anyregularization overﬁts greatly on CIFAR-100, an AlexNet trained by Bayesian inference onCIFAR-100 does not. This is evident from the high value of training accuracy for frequentistapproach with no dropout or 1 layer dropout. Bayesian CNN performs equivalently to anAlexNet trained by frequentist inference with three layers of Dropout after the ﬁrst, fourth,and sixth layers in the architecture. Another thing to note here is that the Bayesian CNN with100 samples overﬁts slightly lesser compared to Bayesian CNN with 25 samples. However,a higher sampling number on a smaller dataset didn’t prove useful and we stuck with 25 asthe number of samples for all other experiments.0

CIFAR-100Bayesian AlexNet (with VI) 36Frequentist AlexNet 38Bayesian LeNet-5 (with VI) 31Frequentist LeNet-5 33

Table 2: Comparison of validation accuracies (in percentage) for diﬀerent architectures withvariational inference (VI), frequentist inference and Dropout as a Bayesian approximation asproposed by Gal and Ghahramani [13] for MNIST, CIFAR-10, and CIFAR-100.Figure 10: Comparison of Training and Validation Accuracies of Bayesian AlexNet andLeNet-5 with frequentist approach with and without dropouts on CIFAR-100 datasetsTable 3 shows a comparison of the training and validation accuracies for AlexNet withBayesian approach and frequentist approach. The low gap between the training and valida-tion accuracies shows the robustness of Bayesian approach towards overﬁtting and shows howBayesian approach without being regularized overﬁts lesser as compared to frequentist archi-tecture with no or one dropout layer. The results are comparable with AlexNet architecturewith 3 dropout layers.1

Training Accuracy Validation AccuracyFrequentist AlexNet (No dropout) 83 38Frequentist AlexNet (1 dropout layer) 72 40Frequentist AlexNet (3 dropout layer) 39 38Bayesian AlexNet (25 num of samples) 54 37Bayesian AlexNet (100 num of samples) 48 37

Table 3: Comparison of training and validation accuracies (in percentage) for AlexNet archi-tecture with variational inference (VI) and frequentist inference for CIFAR-100.

Finally, Table 4 compares the means of aleatoric and epistemic uncertainties for a BayesianLeNet-5 with variational inference on MNIST and CIFAR-10. The aleatoric uncertainty ofCIFAR-10 is about twenty times as large as that of MNIST. Considering that the aleatoricuncertainty measures the irreducible variability and depends on the predicted values, a largeraleatoric uncertainty for CIFAR-10 can be directly deduced from its lower validation accuracyand may be further due to the smaller number of training examples. The epistemic uncer-tainty of CIFAR-10 is about ﬁfteen times larger than that of MNIST, which we anticipatedsince epistemic uncertainty decreases proportionally to validation accuracy.

Aleatoric uncertainty Epistemic uncertaintyBayesian LeNet-5 (MNIST) 0.0096 0.0026Bayesian LeNet-5 (CIFAR-10) 0.1920 0.0404

Table 4: Aleatoric and epistemic uncertainty for Bayesian LeNet-5 calculated for MNISTand CIFAR-10, computed as proposed by Kwon et al. [36].

For every parameter for a frequentist inference network, Bayesian

Cnn s has two parameters( µ , σ ). Halving the number of parameters of Bayesian AlexNet ensures the number ofparameters of it is comparable with a frequentist inference network. The number of ﬁlters ofALexNet is halved and a new architecture called AlexNetHalf is deﬁned in Figure 5.4.The AlexNetHalf architecture was trained and validated on the MNIST, CIFAR10 andCIFAR100 dataset and the results are shown in Table 6. The accuracy of pruned AlexNetwith only half the number of ﬁlters compared to the normal architecture shows an accuracygain of 6 per cent in case of CIFAR10 and equivalent performance for MNIST and CIFAR100datasets. A lesser number of ﬁlters learn the most important features which proved better atinter-class classiﬁcation could be one of the explanations for the rise in accuracy. However,upon visualization of the ﬁlters, no distinct clariﬁcation can be made to prove the previousstatement.Another possible explanation could be the model is generalizing better after the reduction inthe number of ﬁlters ensuring the model is not overﬁtting and validation accuracy is compar-atively higher. CIFAR-100 higher validation accuracy on ALexNetHalf and a lower trainingaccuracy than Bayesian AlexNet proves the theory. Using a lesser number of ﬁlters furtherenhances the regularization eﬀect and makes the model more robust against overﬁtting. Sim-ilar results have been achieved by Narang [48] in his work where a pruned model achievedbetter accuracy compared to the original architecture in a speech recognition task. Suppress-ing or removing the weights that have lesser or no contribution to the prediction makes themodel rely its prediction on the most prominent and unique features and hence improves theprediction accuracy.2layer type width stride padding input shape nonlinearityconvolution (11 ×

11) 32 4 5 M × × ×

32 Softplusmax-pooling (2 ×

2) 2 0 M × × × ×

5) 96 1 2 M × × ×

15 Softplusmax-pooling (2 ×

2) 2 0 M × × × ×

3) 192 1 1 M × × × ×

3) 128 1 1 M × × × ×

3) 64 1 1 M × × × ×

2) 2 0 M × × × M × MNIST CIFAR-10 CIFAR-100Bayesian AlexNet (with VI) 99 73 36Frequentist AlexNet 99 73 38Bayesian AlexNetHalf (with VI) 99 79 38

Table 6: Comparison of validation accuracies (in percentage) for AlexNet with variationalinference (VI), AlexNet with frequentist inference and AlexNet with half number of ﬁltershalved for MNIST, CIFAR-10 and CIFAR-100 datasets.

L1 norm induces sparsity in the trained model parameters and sets some values to zero.We trained a model to some epochs (number of epochs diﬀers across datasets as we appliedearly stopping when validation accuracy remains unchanged for 5 epochs). We removed thezero-valued parameters of the learned weights and keep the non-zero parameters for a trainedBayesian AlexNet on MNIST and CIFAR-10 datasets. We pruned the model to make thenumber of parameters in a Bayesian Network comparable to the number of parameters inthe point-estimate architecture.Table 7 shows the comparison of validation accuracies of the applied L1 Norm AlexNetBayesian architecture with Bayesian AlexNet architecture and with AlexNet frequentist ar-chitecture. We got comparable results on MNIST and CIFAR10 with the experiments andthe results are shown in Table 73

MNIST CIFAR-10Bayesian AlexNet (with VI) 99 73Frequentist AlexNet 99 73Bayesian AlexNet with L1 Norm (with VI) 99 71

Table 7: Comparison of validation accuracies (in percentage) for AlexNet with variationalinference (VI), AlexNet with frequentist inference and BayesianAlexNet with L1 norm appliedfor MNIST and CIFAR-10 datasets.One thing to note here is that the numbers of parameters of Bayesian Network afterapplying L1 norm is not necessarily equal to the number of parameters in the frequentistAlexNet architecture. It depends on the data size and the number of classes. However, thenumber of parameters in the case of MNIST and CIFAR-10 are pretty comparable and thereis not much reduction in the accuracy either. Also, the early stopping was applied whenthere is no change in the validation accuracy for 5 epochs and the model was saved and laterpruned with the application of L1 norm.

Training time of a Bayesian

Cnn s is twice of a frequentist network with similar architecturewhen the number of samples is equal to one. In general, the training time of a Bayesian

Cnn s, T is deﬁned as: T = 2 ∗ number of samples ∗ t (21)where t is the training time of a frequentist network. The factor of 2 is present due to thedouble learnable parameters in a Bayesian CNN network i.e. mean and variance for everysingle point estimate weight in the frequentist network.However, there is no diﬀerence in the inference time for both the networks. The task referred to as Super Resolution (SR) is the recovery of a High-Resolution (HR)image from a given Low-Resolution (LR) image. It is applicable to many areas like medicalimaging [54], face recognition [21] and so on.There are many ways to do a single image super-resolution and detailed benchmarks ofthe methods are provided by Yang [62]. Following are the major ways to do a single imagesuper-resolution:

Prediction Models : These models generate High-Resolution images from Low-Resolutioninputs through a predeﬁned mathematical formula. No training data is needed for such mod-els. Interpolation-based methods (bilinear, bicubic, and Lanczos) generate HR pixel intensi-ties by weighted averaging neighbouring LR pixel values are good examples of this method.

Edge Based Methods : Edges are one of the most important features for any computervision task. The High-Resolution images learned from the edge features high-quality edgesand good sharpness. However, these models lack good colour and texture information.

Patch Based Methods : Cropped patches from Low-Resolution images and High-Resolution images are taken from training dataset to learn some mapping function. Theoverlapped patches are averaged or some other techniques like Conditional Random Fields[37] can be used for better mapping of the patches.4

We build our work upon [53] work that shows that performing Super Resolution work in High-Resolution space is not the optimal solution and it adds the computation complexity. We useda Bayesian Convolutional Neural Network to extract features in the Low-Resolution space.We use an eﬃcient sub-pixel convolution layer, as proposed by [53], which learns an array ofupscaling ﬁlters to upscale the ﬁnal Low-Resolution feature maps into the High-Resolutionoutput. This replaces the handcrafted bicubic ﬁlter in the Super Resolution pipeline withmore complex upscaling ﬁlters speciﬁcally trained for each feature map, and also reduces thecomputational complexity of the overall Super Resolution operation.The hyperparameters used in the experiments are mentioned in the Appendix A sectionin details.Figure 11: The proposed eﬃcient sub-pixel convolutional neural network (ESPCN) [53], withtwo convolution layers for feature maps extraction, and a sub-pixel convolution layer thataggregates the feature maps from Low Resolution space and builds the Super Resolutionimage in a single step.We used a four-layer convolutional model as mentioned in the paper [53]. We replacedthe convolution layer by Bayesian convolution layer and changed the forward pass that nowcomputes the mean, variance and KL divergence. The PixelShuﬄe layer is kept same asprovided by PyTorch and no changes have been made there.layer type width stride paddingconvolution (5 ×

5) 64 1 2convolution (3 ×

3) 64 1 1convolution (3 ×

3) 32 1 1convolution (3 ×

3) upscale factor * * 2 1 1Table 8: Network Architecture for Bayesian Super ResolutionWhere upscale factor is deﬁned as a parameter. For our experiments, we take upscalefactor = 3.

The Network architecture was trained on BSD300 dataset [46] provided by the BerkeleyComputer Vision Department. The dataset is very popular for Image Super-Resolution taskand thus the dataset is used to compare the results with other work in the domain.5Figure 12: Sample image in Low Resolution image space taken randomly from BSD 300 [46]dataset. Figure 13: Generated Super Resolution Image scaled to 40 percent to ﬁtThe generated results with Bayesian Network is compared with the original paper andthe results are comparable in terms of the number and the quality of the image generated.This application was to prove the concept that the Bayesian Networks can be used for thetask of Image Super Resolution. Furthermore, the results are pretty good.Some more research is needed in the future to achieve state-of-the-art results in thisdomain which is out of the scope of this thesis work.6

Generative Adversarial Networks (GANs) [18] can be used for two major tasks: to learngood feature representations by using the generator and discriminator networks as featureextractors and to generate natural images. The learned feature representation or generatedimages can reduce the number of images substantially for a computer vision supervised task.However, GANs were quite unstable to train in the past and that is why we base our workon the stable GAN architecture namely Deep Convolutional GANs (DCGAN) [52]. Weuse the trained Bayesian discriminators for image classiﬁcation tasks, showing competitiveperformance with the normal DCGAN architecture.

We based our work on the paper: Unsupervised Representation Learning with Deep Con-volutional Generative Adversarial Networks by [52]. We used the architecture of a deepconvolutional generative adversarial networks (DCGANs) that learns a hierarchy of repre-sentations from object parts to scenes in both the generator and discriminator. The generatorused in the Network is shown in Table 9. The architecture is kept similar to the architec-ture used in DCGAN paper [52]. Table 10 shows the discriminator network with BayesianConvolutional Layers.layer type width stride padding nonlinearityConvolutionTranspose (4 ×

4) ngf * 8 1 0 ReLUConvolutionTranspose (4 ×

4) ngf * 4 2 1 ReLUConvolutionTranspose (4 ×

4) ngf * 2 2 1 ReLUConvolutionTranspose (4 ×

4) ngf 2 1 ReLUConvolutionTranspose (4 ×

4) nc 2 1 TanHTable 9: Generator architecture as deﬁned in the paper. [52]where ngf is the number of generator ﬁlters which is chosen to be 64 in our work and nc is the number of output channels which is set to 3.layer type width stride padding nonlinearityConvolution (4 ×

4) ndf 2 1 LeakyReLUConvolution(4 ×

4) ndf * 2 2 1 LeakyReLUConvolution (4 ×

4) ndf * 4 2 1 LeakyReLUConvolution (4 ×

4) ndf * 8 2 1 leakyReLUConvolutionTranspose (4 ×

4) 1 1 0 SigmoidTable 10: Discriminator architecture with Bayesian Convolutional layerswhere ndf is the number of discriminator ﬁlters and is set to 64 as default for all ourexperiments.7

The images were taken directly and no pre-processing was applied to any of the images.Normalization was applied with value 0.5 to make the data mean centred. A batch size of 64was used along with Adam [31] as an optimizer to speed up the training. All weights wereinitialized from a zero-centred Normal distribution with standard deviation equal to 1. Wealso used LeakyReLU as mentioned in the original DCGAN paper [52]. The slope of the leakin LeakyReLU was set to 0.2 in all models. We used the learning rate of 0.0001, whereas inpaper 0.0002 was used instead. Additionally, we found leaving the momentum term β atthe suggested value of 0.9 resulted in training oscillation and instability while reducing it to0.5 helped stabilize training (also taken from original paper [52]).The hyperparameters used in the experiments are mentioned in the Appendix A sectionin details. The fake results of the generator after 100 epochs of training is shown in Figure6.4. To compare the results, real samples are shown in Figure 6.5. The loss in case of aBayesian network is higher as compared to the DCGAN architecture originally described bythe authors. However, upon looking at the results, there is no comparison that can be drawnfrom the results of the two networks. Since GANs are diﬃcult to anticipate just by the lossnumber, the comparison cannot be made. The results are pretty comparable for the Bayesianmodels and the original DCGAN architecture.8Figure 14: Fake Samples generated from the Bayesian DCGAN model trained on CIFAR10dataset9Figure 15: Real Samples taken from CIFAR10 dataset We propose Bayesian

Cnn s utilizing

Bayes by Backprop as a reliable, variational inferencemethod for

Cnn s which has not been studied to-date, and estimate the models’ aleatoric andepistemic uncertainties for prediction. Furthermore, we apply diﬀerent ways to pruning theBayesian

Cnn and compare its results with frequentist architectures.There has been previous work by Gal and Ghahramani [13] who utilized the variousoutputs of a Dropout function to deﬁne a distribution, and concluded that one can thenspeak of a Bayesian

Cnn . This approach ﬁnds, perhaps also due to its ease, a large conﬁrmingaudience. However, we argue against this approach and claim deﬁciencies. Speciﬁcally, inGal’s and Ghahramani’s [13] approach, no prior probability distributions p ( w ) are placedon the Cnn ’s parameters. But, these are a substantial part of a Bayesian interpretationfor the simple reason that Bayes’ theorem includes them. Thus we argue, starting with0prior probability distributions p ( w ) is essential in Bayesian methods. In comparison, weplace prior probability distributions over all model parameters and update them accordingto Bayes’ theorem with variational inference, precisely Bayes by Backprop . We show thatthese neural networks achieve state-of-the-art results as those achieved by the same networkarchitectures trained by frequentist inference.Furthermore, we examine how uncertainties (both aleatoric and epistemic uncertainties)can be computed for our proposed method and we show how epistemic uncertainties can bereduced upon more training data. We also compare the eﬀect of dropout in a frequentistnetwork to the proposed Bayesian

Cnn and show the natural regularization eﬀect of Bayesianmethods. To counter the twice number of parameters (mean and variance) in a Bayesian

Cnn compared to a single point estimate weight in a frequentist method, we apply methods ofnetwork pruning and show that the Bayesian

Cnn performs equally good or better even whenthe network is pruned and the number of parameters is made comparable to a frequentistmethod.Finally, we show the applications of Bayesian

Cnn s in various domains like Image recog-nition, Image Super-Resolution and Generative Adversarial Networks (GANs). The resultsare compared with other popular approaches in the ﬁeld and a comparison of results aredrawn. Bayesian

Cnn s in general, proved to be a good idea to be applied on GANs as priorknowledge for discriminator network helps in better identiﬁcation of real vs fake images.As an add-on method to further enhance the stability of the optimization, posteriorsharpening [11] could be applied to Bayesian

Cnn s in future work. There, the variationalposterior distribution q θ ( w |D ) is conditioned on the training data of a batch D ( i ) . We cansee q θ ( w |D ( i ) ) as a proposal distribution, or hyper-prior when we rethink it as a hierarchicalmodel, to improve the gradient estimates of the intractable likelihood function p ( D| w ). Forthe initialization of the mean and variance, a zero mean and one as standard deviation wasused as the normal distribution seems to be the most intuitive distribution to start with.However, with the results drawn in the thesis from several experimentations, a zero-centredmean and very small standard deviation initialization seemed to be performing equally wellbut training faster. Xavier initialization [15] converges faster in a frequentist network com-pared to a normal initialization and a similar distribution space needs to be explored withBayesian networks for initializing the distribution. Other properties like periodicity or spatialinvariance are also captured by the priors in data space, and based on these properties analternative to Gaussian process priors can be found.Using normal distribution as prior for uncertainty estimation was also explored by Danijaret al. [22] and it was observed that standard normal prior causes the function posterior togeneralize in unforeseen ways on inputs outside of the training distribution. Addition of somenoise in the normal distribution as prior can help in better uncertainty estimation by themodel. However, no such cases were found in our experiments but can be an interesting areato explore in future.The network is pruned with simple methods like L1 norm and more compression trickslike vector quantization [17] and group sparsity regularization [1] can be applied. In our work,we show that reducing the number of model parameters results in a better generalization ofthe Bayesian architecture and even leads to improvement in the overall model accuracy onthe test dataset. Upon further analysis of the model, there is no concrete learning aboutthe change in the behaviour. A more detailed analysis by visualizing the pattern learned byeach neuron and grouping them together and removing the redundant neurons which learnssimilar behaviour is a good way to prune the model.The concept of Bayesian Cnn is applied to the discriminative network of a GAN in ourwork and it has shown good initial results. However, the area of Bayesian generative networksin a GAN is still to be investigated.

References [1] Jose M. Alvarez and Mathieu Salzmann. Learning the number of neurons in deep net-works. In

NIPS , pages 2262–2270, 2016.[2] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convo-lutional neural networks.

CoRR , abs/1512.08571, 2015.1[3] David Barber and Christopher M Bishop. Ensemble learning in bayesian neural net-works.

NATO ASI SERIES F COMPUTER AND SYSTEMS SCIENCES , 168:215–238,1998.[4] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weightuncertainty in neural networks. arXiv preprint arXiv:1505.05424 , 2015.[5] Wray L Buntine and Andreas S Weigend. Bayesian back-propagation.

Complex systems ,5(6):603–643, 1991.[6] Soravit Changpinyo, Mark Sandler, and Andrey Zhmoginov. The power of sparsity inconvolutional neural networks.

CoRR , abs/1702.06257, 2017.[7] John S Denker and Yann LeCu. Transforming neural-net output levels to probabilitydistributions. In

Advances in neural information processing systems , pages 853–859,1991.[8] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploit-ing linear structure within convolutional networks for eﬃcient evaluation. In

Advancesin neural information processing systems , pages 1269–1277, 2014.[9] Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter?

Structural Safety , 31:105–112, 03 2009.[10] David E. Rumelhart, Geoﬀrey E. Hinton, and Ronald J. Williams. Learning represen-tations by back propagating errors.

Nature , 323:533–536, 10 1986.[11] Meire Fortunato, Charles Blundell, and Oriol Vinyals. Bayesian recurrent neural net-works. arXiv preprint arXiv:1704.02798 , 2017.[12] Karl Friston, J´er´emie Mattout, Nelson Trujillo-Barreto, John Ashburner, and WillPenny. Variational free energy and the laplace approximation.

Neuroimage , 34(1):220–234, 2007.[13] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks withbernoulli approximate variational inference. arXiv preprint arXiv:1506.02158 , 2015.[14] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Insights andapplications. In

Deep Learning Workshop, ICML , 2015.[15] Xavier Glorot and Yoshua Bengio. Understanding the diﬃculty of training deep feed-forward neural networks. In

Proceedings of the thirteenth international conference onartiﬁcial intelligence and statistics , pages 249–256, 2010.[16] Gluon MXnet. chapter18 variational-methods-and-uncertainty. https://gluon.mxnet.io/chapter18_variational-methods-and-uncertainty/bayes-by-backprop.html ,2017. Online.[17] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deepconvolutional networks using vector quantization.

CoRR , abs/1412.6115, 2014.[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher-jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advancesin neural information processing systems , pages 2672–2680, 2014.[19] Alex Graves. Practical variational inference for neural networks. In

Advances in NeuralInformation Processing Systems , pages 2348–2356, 2011.[20] Alex Graves. Stochastic backpropagation through mixture density distributions. arXivpreprint arXiv:1607.05690 , 2016.[21] B. K. Gunturk, A. U. Batur, Y. Altunbasak, M. H. Hayes, and R. M. Mersereau.Eigenface-domain super-resolution for face recognition.

IEEE Transactions on ImageProcessing , 12(5):597–606, May 2003.2[22] Danijar Hafner, Dustin Tran, Alex Irpan, Timothy Lillicrap, and James Davidson. Reli-able uncertainty estimates in deep neural networks using noise contrastive priors. arXivpreprint arXiv:1807.09289 , 2018.[23] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neuralnetwork with pruning, trained quantization and huﬀman coding.

CoRR , abs/1510.00149,2015.[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[25] Geoﬀrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan RSalakhutdinov. Improving neural networks by preventing co-adaptation of feature de-tectors. arXiv preprint arXiv:1207.0580 , 2012.[26] Geoﬀrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimiz-ing the description length of the weights. In

Proceedings of the sixth annual conferenceon Computational learning theory , pages 5–13. ACM, 1993.[27] Sepp Hochreiter and J¨urgen Schmidhuber. Simplifying neural nets by discovering ﬂatminima. In

Advances in neural information processing systems , pages 529–536, 1995.[28] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel.Curiosity-driven exploration in deep reinforcement learning via bayesian neural net-works. arXiv preprint arxiv.1605.09674 , 2016.[29] Karparthy, Andrej. Neural Networks 1. http://cs231n.github.io/neural-networks-1/ , 2016. Online.[30] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learningfor computer vision? In

Advances in neural information processing systems , pages5574–5584, 2017.[31] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.[32] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the localreparameterization trick. In

Advances in Neural Information Processing Systems , pages2575–2583, 2015.[33] Alex Krizhevsky and Geoﬀrey Hinton. Learning multiple layers of features from tinyimages. Technical report, Citeseer, 2009.[34] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation withdeep convolutional neural networks. In

Advances in neural information processing sys-tems , pages 1097–1105, 2012.[35] Solomon Kullback and Richard A Leibler. On information and suﬃciency.

The annalsof mathematical statistics , 22(1):79–86, 1951.[36] Yongchan Kwon, Joong-Ho Won, Beom Joon Kim, and Myunghee Cho Paik. Uncertaintyquantiﬁcation using bayesian neural networks in classiﬁcation: Application to ischemicstroke lesion segmentation. 2018.[37] John Laﬀerty, Andrew McCallum, and Fernando CN Pereira. Conditional random ﬁelds:Probabilistic models for segmenting and labeling sequence data. 2001.[38] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan V. Oseledets, and Vic-tor S. Lempitsky. Speeding-up convolutional neural networks using ﬁne-tuned cp-decomposition.

CoRR , abs/1412.6553, 2014.[39] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haﬀner. Gradient-based learningapplied to document recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.3[40] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.[41] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In

Advances inneural information processing systems , pages 598–605, 1990.[42] Zachary C Lipton, Jianfeng Gao, Lihong Li, Xiujun Li, Faisal Ahmed, and Li Deng.Eﬃcient exploration for dialogue policy learning with bbq networks & replay buﬀerspiking. arXiv preprint arXiv:1608.05081 , 2016.[43] David J C Mackay. A practical bayesian framework for backprop networks. 1991.[44] David JC MacKay. Probable networks and plausible predictions—a review of practicalbayesian methods for supervised neural networks.

Network: Computation in NeuralSystems , 6(3):469–505, 1995.[45] David JC MacKay. Hyperparameters: optimize, or integrate out? In

Maximum entropyand bayesian methods , pages 43–59. Springer, 1996.[46] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented naturalimages and its application to evaluating segmentation algorithms and measuring ecolog-ical statistics. In

Proc. 8th Int’l Conf. Computer Vision , volume 2, pages 416–423, July2001.[47] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsiﬁesdeep neural networks. arXiv preprint arXiv:1701.05369 , 2017.[48] Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. Exploringsparsity in recurrent neural networks.

CoRR , abs/1704.05119, 2017.[49] Radford M Neal.

Bayesian learning for neural networks , volume 118. Springer Science& Business Media, 2012.[50] Radford M Neal and Geoﬀrey E Hinton. A view of the em algorithm that justiﬁesincremental, sparse, and other variants. In

Learning in graphical models , pages 355–368.Springer, 1998.[51] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Vari-ance networks: When expectation does not meet your expectations. arXiv preprintarXiv:1803.03764 , 2018.[52] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learningwith deep convolutional generative adversarial networks.

CoRR , abs/1511.06434, 2015.[53] Wenzhe Shi, Jose Caballero, Ferenc Husz´ar, Johannes Totz, Andrew P. Aitken,Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and videosuper-resolution using an eﬃcient sub-pixel convolutional neural network.

CoRR ,abs/1609.05158, 2016.[54] Wenzhe Shi, Jose Caballero, Christian Ledig, Xiahai Zhuang, Wenjia Bai, Kanwal Bha-tia, Antonio M. Simoes Monteiro de Marvao, Tim Dawes, Declan O’Regan, and DanielRueckert. Cardiac image super-resolution with global correspondence using multi-atlaspatchmatch. In Kensaku Mori, Ichiro Sakuma, Yoshinobu Sato, Christian Barillot, andNassir Navab, editors,

Medical Image Computing and Computer-Assisted Intervention –MICCAI 2013 . Springer Berlin Heidelberg, 2013.[55] Kumar Shridhar, Felix Laumann, Adrian Llopart Maurin, Martin Olsen, and Marcus Li-wicki. Bayesian convolutional neural networks with variational inference. arXiv preprintarXiv:1806.05978 , 2018.[56] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.4[57] Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. InZ. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, ed-itors,

Advances in Neural Information Processing Systems 27 , pages 963–971. CurranAssociates, Inc., 2014.[58] Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.

The Journal of Machine Learning Research , 15(1):1929–1958, 2014.[59] Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the RoyalStatistical Society. Series B (Methodological) , pages 267–288, 1996.[60] Antonio Torralba, Rob Fergus, and William T. Freeman. 80 million tiny images: A largedata set for nonparametric object and scene recognition.

IEEE Trans. Pattern Anal.Mach. Intell. , 30(11), November 2008.[61] Sida Wang and Christopher Manning. Fast dropout training. In international conferenceon machine learning , pages 118–126, 2013.[62] Chih-Yuan Yang, Chao Ma, and Ming-Hsuan Yang. Single-image super-resolution: Abenchmark. In

ECCV , 2014.[63] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Constructing free-energyapproximations and generalized belief propagation algorithms.

IEEE Transactions oninformation theory , 51(7):2282–2312, 2005. variable valuelearning rate 0.001epochs 100batch size 256sample size 10-25loss cross-entropy( αµ ) init of approximate posterior q θ ( w |D ) -10optimizer Adam [31] λ in (cid:96) -2 normalisation 0.0005 β i M − i M − [4]Sample size can vary from 10 to 25 as this range provided the best results. However, it canbe played around with. For most of our experiments, it is either 10 or 25 unless speciﬁedotherwise.5 variable valuelearning rate 0.01epochs 200batch size 64upscale factor 3loss Mean Squared Errorseed 123( αµ ) init of approximate posterior q θ ( w |D ) -10optimizer Adam [31] λ in (cid:96) -2 normalisation 0.0005 β i M − i M − [4]6 variable valuelearning rate 0.001epochs 100batch size 64image size 64latent vector (nz) 100number of generator factor (ndf) 64number of discriminator factor (ndf) 64upscale factor 3loss Mean Squared Errornumber of channels (nc) 3( αµ ) init of approximate posterior q θ ( w |D ) -10optimizer Adam [31] λ in (cid:96) -2 normalisation 0.0005 β i M − i M − [4] Non Bayesian Settings variable valuelearning rate 0.001epochs 100batch size 256loss cross-entropyinitializer Xavier [15] or Normaloptimizer Adam [31]The weights were initialized with Xavier initialization [15] at ﬁrst, but to make it consistentwith the Bayesian networks where initialization was Normal initialization (mean = 0 andvariance = 1), the initializer was changed to Normal initialization.7

Architectures layer type width stride padding input shape nonlinearityconvolution (5 ×

5) 6 1 0 M × × ×

32 SoftplusMmax-pooling (2 ×

2) 2 0 M × × × ×

5) 16 1 0 M × × ×

14 Softplusmax-pooling (2 ×

2) 2 0 M × × × M ×

400 Softplusfully-connected 84 M ×

120 Softplusfully-connected 10 M × layer type width stride padding input shape nonlinearityconvolution (11 ×

11) 64 4 5 M × × ×

32 Softplusmax-pooling (2 ×

2) 2 0 M × × × ×

5) 192 1 2 M × × ×

15 Softplusmax-pooling (2 ×

2) 2 0 M × × × ×

3) 384 1 1 M × × × ×

3) 256 1 1 M × × × ×

3) 128 1 1 M × × × ×

2) 2 0 M × × × M × layer type width stride padding input shape nonlinearityconvolution (11 ×

11) 32 4 5 M × × ×

32 Softplusmax-pooling (2 ×

2) 2 0 M × × × ×

5) 96 1 2 M × × ×

15 Softplusmax-pooling (2 ×

2) 2 0 M × × × ×

3) 192 1 1 M × × × ×

3) 128 1 1 M × × × ×

3) 64 1 1 M × × × ×

2) 2 0 M × × × M ×

10 How to replicate results

Install PyTorch from the oﬃcial website ( https://pytorch.org/ ) git clone https://github.com/kumar-shridhar/PyTorch-BayesianCNNpip install -r requirements.txt cd into respective folder/ task to replicate (Image Recognition, Super Resolution or GAN) python main_Bayesian.py to replicate the Bayesian Cnn s results. python main_nonBayesian.py to replicate the Frequentist

Cnn s results.For more details, read the README sections of the repo :s results.For more details, read the README sections of the repo :