[PDF] On the Effects of Quantisation on Model Uncertainty in Bayesian Neural Networks

Abstract

Bayesian neural networks (BNNs) are making significant progress in many research areas where decision-making needs to be accompanied by uncertainty estimation. Being able to quantify uncertainty while making decisions is essential for understanding when the model is over-/under-confident, and hence BNNs are attracting interest in safety-critical applications, such as autonomous driving, healthcare, and robotics. Nevertheless, BNNs have not been as widely used in industrial practice, mainly because of their increased memory and compute costs. In this work, we investigate quantisation of BNNs by compressing 32-bit floating-point weights and activations to their integer counterparts, that has already been successful in reducing the compute demand in standard pointwise neural networks. We study three types of quantised BNNs, we evaluate them under a wide range of different settings, and we empirically demonstrate that a uniform quantisation scheme applied to BNNs does not substantially decrease their quality of uncertainty estimation.

Full PDF

OOn the Effects of Quantisation on Model Uncertainty in Bayesian NeuralNetworks

Martin Ferianc Partha Maji Matthew Mattina Miguel Rodrigues University College London, London, UK ARM ML Research Lab, Cambridge, UK ARM ML Research Lab, Boston, USA

Abstract

Bayesian neural networks (BNNs) are making sig-niﬁcant progress in many research areas wheredecision making needs to be accompanied by un-certainty estimation. Being able to quantify uncer-tainty while making decisions is essential for under-standing when the model is over-/under-conﬁdent,and hence BNNs are attracting interest in safety-critical applications, such as autonomous driving,healthcare and robotics. Nevertheless, BNNs havenot been as widely used in industrial practice,mainly because of their increased memory andcompute costs. In this work, we investigate quan-tisation of BNNs by compressing 32-bit ﬂoating-point weights and activations to their integer coun-terparts, that has already been successful in reduc-ing the compute demand in standard pointwiseneural networks. We study three types of quantisedBNNs, we evaluate them under a wide range ofdifferent settings, and we empirically demonstratethat an uniform quantisation scheme applied toBNNs does not substantially decrease their qualityof uncertainty estimation.

Bayesian neural networks (BNNs) can describe complexstochastic patterns by treating their weights as learnablerandom variables that provide well-calibrated uncertaintyestimates (Neal, 1993; Ghahramani, 2015; Blundell et al.,2015; Gal and Ghahramani, 2015; Chen et al., 2014). Inaddition to modelling uncertainty, by treating a neural net-work (NN) through Bayesian inference, it gains robustnessto over-ﬁtting thereby offering the means to leverage smalldata pools (Ghahramani, 2015).BNNs have become relevant in practical applications wherethe quantiﬁcation of uncertainty is essential such as in medicine (Liang et al., 2018), autonomous driving (McAl-lister et al., 2017) or risk assessment (MacKay, 1995). Nev-ertheless, Bayesian models come with a prohibitive compu-tational cost during evaluation (Gal and Ghahramani, 2015;Blundell et al., 2015). While evaluating, it is analyticallyintractable to compute the posterior prediction. Hence, mostmethods approximate the posterior through Monte Carlo(MC) sampling (Gal and Ghahramani, 2015; Blundell et al.,2015; Chen et al., 2014), which depends on multiple feed-forward runs through the BNNs and optionally random num-ber generation.In contrast to pointwise NNs, that are increasingly usedfor applications on the edge, the computational cost asso-ciated with BNNs currently prevents their use on resource-constrained platforms. These platforms exhibit smaller mem-ory and lower compute capabilities involving 8-bit integerarithmetic. Quantisation has been widely used in pointwiseNNs (Jacob et al., 2018; Choukroun et al., 2019; Krish-namoorthi, 2018) to lower their compute demand and makethem more compatible with edge devices. In quantisation,ﬂoating-point representation is reduced to an integer rep-resentation, which enables substantial resource savings inpractical applications. By quantising weights and activationsof pointwise NNs to 8-bit integers, it is possible to achieveup to 4 × improvements in latency with a quarter of the orig-inal memory footprint of the baseline 32-bit ﬂoating-pointimplementation (Jacob et al., 2018). Nevertheless, there hasnot been a comprehensive study into whether BNNs couldattain the same hardware beneﬁts under quantisation andwhether it impacts their predictive accuracy or uncertainty.In this work, we study quantisation of BNNs based onthree widely adapted Bayesian inference schemes: MonteCarlo Dropout (Gal and Ghahramani, 2015), Bayes-By-Backprop (Blundell et al., 2015) and Stochastic GradientLangevin Dynamics with Hamiltonian Monte Carlo (Chenet al., 2014). Furthermore, we investigate the effect ofquantisation of both weights and activations of BNNs us-ing different integer representations through quantisationaware training. Our main contributions are two-fold a r X i v : . [ c s . L G ] F e b ethodology for uniform quantisation of three differenttypes of Bayesian inference; An empirical demonstrationthat lowering arithmetic precision of weights and activa-tions from 32-bit ﬂoating-point to ≤ https://git.io/JtSJG .In the sequel, we describe the methodology in detail. InSection 2 we review preliminaries and the related work.In Section 3 we introduce the methodology for quantisedBayesian neural networks. Then in Section 4 we demon-strate the performance of quantised BNNs on experiments,lastly, in Section 5 we summarise the key takeaways of thiswork. In this Section we review Bayesian learning, quantisationof neural networks and related work.

The aim of Bayesian inference is to learn the distribu-tion over the weights of the BNN w with respect to sometraining dataset of tuples D = { ( x n , y n ) Nn =1 } , where x n are the inputs and y n are the associated targets. Giventhe belief about the noise in the data in the shape ofthe likelihood p ( y | x , w ) and the prior distribution overweights p ( w ) , they come together under the Bayes rule p ( w | x , y ) = p ( y | x , w ) p ( w ) p ( y | x ) . Nevertheless, due to the highdimensionality of BNN it is intractable to compute the pos-terior p ( w | x , y ) and it needs to be approximated with re-spect to q ( w | θ , x , y ) and some learnable parameters θ . Theresultant distribution q ( . ) can then be used to make predic-tions for previously unseen data x ∗ , y ∗ through an integral p ( y ∗ | x ∗ ) = (cid:82) p ( y ∗ | x ∗ , w ) q ( w | θ , x , y ) d w . This integralis again intractable due to the posterior and it needs to be ap-proximated through through MC sampling with L samplesas: p ( y ∗ | x ∗ ) = L (cid:80) Ll =1 p ( y ∗ | x ∗ , w l ); w l ∼ q ( w | θ , x , y ) .The sampling procedure requires efﬁcient processing to re-duce the compute cost of the forward pass through the BNN L times. In this work, we approach this challenge throughinvestigating quantisation applied to BNNs’ weights andactivations in order to enable their efﬁcient processing. Reduction in bit-width precision (Jacob et al., 2018; Kr-ishnamoorthi, 2018; Choukroun et al., 2019) has demon- strated signiﬁcant beneﬁts in lowering the resource con-sumption of pointwise NNs in hardware. In quantisation,32-bit ﬂoating-point representations of weights, and op-tionally, activations are reduced to an integer, usually 8-bitrepresentation, which enables substantial savings in mem-ory and compute resources in real-world applications. Thishelps to reduce energy consumption and improve inferencespeed. If the quantisation happens after training, it is calledpost-training quantisation. If it happens with an additionaltraining with fewer iterations and much smaller learningrate after the main portion of the inference, it is called quan-tisation aware training (QAT) (Jacob et al., 2018). By usingQAT, practitioners have observed smaller accuracy drop inthe quantised model (Jacob et al., 2018), compared to post-training quantisation. The support of only integer arithmeticin hardware has two main outcomes: (1) decrease in size ofthe required memory and the complexity of the hardware toperform the computation; (2) decrease in latency due to thesimplicity of integer computation, in comparison to ﬂoating-point (Cai et al., 2018). These beneﬁts present a strong casefor investigating quantisation of BNNs.

Only recently, there appeared works outside of the realm ofpointwise NNs that interconnected Bayesian thinking withquantisation (Su et al., 2019; Achterhold et al., 2018; Caiet al., 2018; van Baalen et al., 2020).Achterhold et al. (2018) developed a sophisticated methodfor quantisation and pruning for pointwise NNs, albeit byusing Bayesian inference. They initially train a BNN withimproper priors, constructed to be quantisation and pruning-friendly, and after training, convert it to a quantised point-wise NN. Although, the pointwise NNs can achieve signiﬁ-cant reduction in memory consumption, the resultant non-quantised BNNs are actually unable to estimate uncertainty,due to improper priors (Hron et al., 2017). Similarly, vanBaalen et al. (2020) used Bayesian inference to obtain sparsequantised pointwise NNs. In VIBNN, Cai et al. (2018) de-veloped an efﬁcient hardware accelerator for feed-forwardBNNs trained through Bayes-by-backprop (Blundell et al.,2015) algorithm. The authors demonstrated impressive com-pute resource savings, but they did not detail their quanti-sation scheme or its impact on the uncertainty estimationcapabilities of the BNN. Su et al. (2019) proposed a methodfor learning quantised BNNs directly, where the range ofthe found activations and weights is limited to two integervalues. Their model demonstrated that the uncertainty esti-mation can be preserved, however, their scheme in practicewould require development of a custom hardware acceler-ator. Nonetheless, custom hardware accelerators are rarelyused in real-world settings and low-resource general edgedevices, such as smartphones, have been the most commonhardware platform for running NNs.n this paper we propose to learn quantised BNNs directly– as in (Su et al., 2019). However, in contrast to Su et al.(2019), we consider a range of widely used Bayesian infer-ence methods, without the need for changes in the method.In detail, we focus on uniform quantisation, that is com-monly supported in hardware (Krishnamoorthi, 2018).

In this Section we describe quantised BNNs, by ﬁrst dis-cussing the theory behind quantisation followed by its ap-plicability to the respective Bayesian inference methods.

The most light-weight quantisation method is an uniformafﬁne mapping of 32-bit ﬂoating-point values f to integers q (Jacob et al., 2018) as shown in (1): f = S ( q − Z ) (1)where S and Z are the scale and the zero-point respec-tively, which are learnable parameters. The S remains inﬂoating-point representation and it effectively represents aquantisation bin-width, whereas the Z is an integer of thesame bit-width n as q and it represents the mapping of thevalue . The values of S and Z are affected by the target n ,which restricts their range.Assuming initially a standard pointwise linear layer withﬂoating-point weights f w ∈ R M × F , input f i ∈ R I × M and output f o ∈ R I × F , where M and F correspond to theinput and output feature size for a batch consisting of I samples, the computation for their quantised counterparts q w , q i , q o is obtained with respect to (1) as follows: Linearoutput without quantisation is computed as f o = f i f w .Substituting each term with (1), we have S o ( q o − Z o ) = S w ( q w − Z w ) S i ( q i − Z i ) which can be rewritten as in (2): q o = Z o + S w S i S o ( M Z w Z i − Z i (cid:88) q w − Z w (cid:88) q i + q w q i ) (2)The respective sums are performed ﬁrst for each column for q w and each row q i and broadcast to the resultant matrixdimension, similarly to scalars S and Z . Note that, the termsnot involved with any q i are independent of the input, whichmeans they can be computed ofﬂine. Similarly, if the layerhas a bias term, or it is followed by a batch normalisation(BN) (Ioffe and Szegedy, 2015), the BN afﬁne parametersor bias can be fused into the weights after the individual S and Z have been inferred (Krishnamoorthi, 2018). Thesame pattern can then be used to compute the output of morecomplicated operations, such as convolutions (Jacob et al.,2018). Note that, the bit-width n does not need to be thesame for weights and activations. C onv / L i n ea r i npu t w e i gh t s S Q bias SQ ou t pu t OperationOperation params.Simulated quant. (SQ)DataData flow 𝒇 ! 𝒇 " 𝒇 Figure 1: Fine-tuning of a standard pointwise Convolu-tion/Linear layer with simulated quantisation (SQ). All com-putation is carried out using 32-bit ﬂoating-point arithmetic.SQ nodes are injected into the computation to simulate theeffects of quantisation. After ﬁne-tuning, the SQ modulesare removed and the weights, with folded in bias, and com-putation are quantised. In a quantised regime, ﬂoating-pointdata f are replaced by q .The scale ( S ) and the zero-point ( Z ) parameters are learnedby simulating quantisation through ﬁne-tuning resulting inquantisation aware training (QAT). In this work we focus onQAT, which has been preferred to post-training quantisationsince it is shown that it achieves higher accuracy, especiallyin smaller models (Jacob et al., 2018). In the next Section weintroduce QAT-based methods applied to NNs with respectto Bayesian inference. QAT is achieved by simulating quantisation effects in the for-ward pass of training, while backpropagation and all weightsare represented in ﬂoating-point (Jacob et al., 2018). Thesimulation is achieved by implementing rounding behaviour,that can be hardware platform speciﬁc, while performingﬂoating-point arithmetic and then using a straight throughgradient estimator (Bengio et al., 2013) in the backwardpass.• Weights’ quantisation is simulated prior to being com-bined with the input, to avoid dynamic quantisationduring runtime.• Activation or operation output quantisation is simu-lated at points where they would be during inference- after the activation function is applied or after addi-tion or concatenation of outputs of several layers as inResNets (Jacob et al., 2018; He et al., 2016).Concretely in this work, we adopt the element-wise quan-tisation function as shown by Jacob et al. (2018) for alltensors individually, and we assume hardware fusion of thecommon ReLU activation, BN and bias into the operationas done in practice (Krishnamoorthi, 2018). The quantisa-tion and its simulation are parametrised by n , which is userspeciﬁed, and a clamping range consisting of a minimum a ; a = min f and a maximum b ; b = max f for the giventensor. Individual a and b are being observed on the train- onv / L i n ea r i npu t w e i gh t s S Q bias SQ ou t pu t OperationOperation params.Simulated quant. (SQ)DataData flow 𝒇 ! 𝒇 " S Q ⨀𝑲 Random scaled maskElement-wise op. 𝒇 𝒇 𝒙 Figure 2: Quantisation for Monte Carlo Dropout.ing and validation datasets, for each activation output andweight. To observe the most efﬁcient clamping range bounds a, b , it is necessary to record the minimum and maximumvalues of the respective tensors during training and then in-dividually aggregate them via exponential moving average,because of perturbations in outputs and weights due to QATﬁne-tuning. The a, b, n continually map to S ; S = b − an − and Z ; Z = round (min f S ) that are being used for the simulationand the end values are then used for the actual quantisation,following equation (1). The computational graph with re-spect to QAT is visually represented in Figure 1 and inpseudo code in Algorithm 1. In the next Section we describehow this scheme can be used to obtain quantised BNNs. Algorithm 1

Quantisation Aware Training Inference of a ﬂoating-point model until convergence. Insertion of simulated quantisation (SQ) modules afterweights and operations’ outputs. Fine-tuning, simulating quantisation and recording in-dividual a, b per tensor in the computational graph. Computation of individual S and Z , quantisation ofweights and computation of ofﬂine constants to preparethe model for integer arithmetic evaluation. In this work we develop schemes for performing quantisa-tion aware training (QAT) for Bayesian inference methodsfor both their weights and activation outputs. Note that,we propose to use QAT exclusively after Bayesian infer-ence, and with minimal ﬁne-tuning, such that the parame-ters learned through the Bayesian inference are not compro-mised. We illustrate the quantisation process with the helpof a linear layer and notation from Section 3.1. In general, itis necessary to only discuss the placement of SQ nodes withrespect to the compute graphs and step 2. from Algorithm 1for the respective Bayesian inference methods, followingthe rules introduced in the bullet-points in the previous Sec-tion. Other steps are exactly the same as for the pointwisecounterpart. C onv / L i n ea r w e i gh t s biasSQ Operation Operation params.Simulated quant. (SQ) Data Data flow 𝒇 ! 𝒇 " Random noiseElement-wise op. i npu t 𝒇 SQSQ SQ ⨀𝝓 + SQ 𝝁 𝝐𝝈 Figure 3: Quantisation for Bayes-by-Backprop.

The quantisation of MCD in shown in Figure 2. We proposemethodology for quantisation of the standard MCD imple-mentation (Gal and Ghahramani, 2015), that correspondsto applying Bernoulli mask of zeros and ones K ∈ R I × M with respect to an input with M features and I samples foreach weight-bearing layer, except the input. Additionallythe masked input is scaled by the proportion of zeros in themask, such that f x = K (cid:12) f i − p and q x = K (cid:12) q i − p ,where p is the probability of sampling zero and (cid:12) is anelement-wise multiplication. Thus f x , q x values replace f i , q i values with respect to equation (2). We add separateSQ node to the multiplication of the K with the input, sincedue to the factor − p and zeroing-out some inputs, the re-spective S and Z will change. When generating the K , weabsorb the − p into the mask for efﬁcient computation. Notethat during performing QAT it is necessary to generate themask in ﬂoating-point, while in a quantised mode the K needs to take into account the Z of q i . Weights are simplyquantised according to equation (1) and by adding an SQnode as discussed in the Section 3.1.1. We propose QAT methodology for BBB as shown inFigure 3. In BBB (Blundell et al., 2015), the distribu-tion over the weights is modelled explicitly such that f w , q w ∼N ( µ , σ ) , with mean µ ∈ R M × F and variance σ ∈ R M × F for each weight with respect to M input and F output features and N represents a Gaussian. Nevertheless,to enable backpropagation and efﬁcient computation of theweights, the weights are sampled with respect to a Gaussian (cid:15) ∼N (0 , , such that f w = µ + φ ( σ ) (cid:12) (cid:15) (Kingma andWelling, 2013). φ ( . ) constrains the output to be positivee.g.: softplus. It is necessary to add SQ nodes and observethe statistics after each operation: application of positiveelement-wise φ ( . ) , addition and multiplication to obtain f w and subsequently q w . We simulate quantisation to computethe quantisation statistics for the means µ as well as thepositive standard deviation φ ( σ ) . We do this to avoid dy-namic quantisation during run-time. The quantisation forthe standard deviation is performed after φ ( . ) , which eventu- loat Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . . . . R M S E Pointwise MCD BBB SGHMC (a)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision N LL Pointwise MCD BBB SGHMC (b)

Figure 4:

Changing activation precision, ﬁxing weight precision.

Regression results with respect to root-mean-squared error(RMSE) (a) and negative log-likelihood (NLL) (b) on UCI datasets. Q stands for quantised activations (A) and weights (W).Subscript denotes bit-width. Pointwise and SGHMC collapse when the bit-width ≤ for A. Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . . . . R M S E Pointwise MCD BBB SGHMC (a)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision N LL Pointwise MCD BBB SGHMC (b)

Figure 5:

Fixing activation precision, changing weight precision.

Regression results with respect to root-mean-squared error(RMSE) (a) and negative log-likelihood (NLL) (b) on UCI datasets. Q stands for quantised activations (A) and weights (W).Subscript denotes bit-width. SGHMC collapses when the bit-width ≤ for W.ally bypasses the non-linearity, when quantised, and reducesthe numerical errors induced by the reduced representation.Note that, depending on the regime, it is necessary generate (cid:15) in ﬂoating-point or quantised. We found quantised (cid:15) withrespect to S (cid:15) = 0 . and Z (cid:15) = 0 to be performing wellacross different n and experiments. Due to the proposedscheme, there are no changes necessary to be made withrespect to equation (2). We avoid the computation of thegradients with respect to the ELBO’s regulariser (Blundellet al., 2015) during QAT. In comparison to the previous two methods, inSGHMC (Chen et al., 2014), there is no samplingof random variables during evaluation. Chen et al. (2014),following (Welling and Teh, 2011), demonstrated that byadding the right amount of Gaussian noise to a standardstochastic gradient optimisation algorithm, it is possibleto collect weights w l from l = 1 , . . . , L several distinctoptimisation steps, that can then be used to approximatethe samples from the true posterior distribution over theBNN weights. Therefore, we propose to quantise each ofthe weight samples separately through QAT, similarly to apointwise approach, as shown in Figure 1. The SQ nodes are applied to each set of weight samples l as well as thecorresponding outputs. Thus, we propose to ﬁne-tune eachpre-trained network sample l separately. In this Section we present the tasks, datasets and their corre-sponding NN architectures, metrics and the implementation,followed by the observations.We consider two classes of problems: regression and classiﬁcation. We evaluate the networks on sample datasetsof tuples D . For regression the target y n is assumed to be areal-valued y n ∈ R , while for classiﬁcation the target y n isa one-hot encoding of k = 1 , . . . , K classes such that y n ∈ R K . Given the input features x n , we use a BNN to modelthe probabilistic predictive distribution p w ( y n | x n ) over thetargets with respect to some model deﬁned by weights w ,where the mean and the variance are approximated withrespect to L samples as µ w ( x n ) = E [ L (cid:80) Ll =1 p w l ( y n | x n )] and σ w ( x n ) = E [ L (cid:80) Ll =1 ( p w l ( y n | x n ) − µ w ( x n )) ] .For the regression we consider UCI datasets (housing, con-crete, energy, power, wine, yacht) whereas for classiﬁcationwe consider classifying MNIST digits and CIFAR-10 imagedatasets. We used a mixture of real data to control the com- loat Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision E rr o r [ % ] Pointwise MCD BBB SGHMC (a)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision a P E [ n a t s ] Pointwise MCD BBB SGHMC (b)

Figure 6:

Changing activation precision, ﬁxing weight precision.

MNIST results with respect to classiﬁcation error on testdata (a) and average predictive entropy (aPE) on FashionMNIST (b). Q stands for quantised activations (A) and weights(W). Subscript denotes bit-width. MCD collapses when the bit-width ≤ for A. Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . . E rr o r [ % ] Pointwise MCD BBB SGHMC (a)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . . a P E [ n a t s ] Pointwise MCD BBB SGHMC (b)

Figure 7:

Fixing activation precision, changing weight precision.

MNIST results with respect to classiﬁcation error on testdata (a) and average predictive entropy (aPE) on FashionMNIST (b). Q stands for quantised activations (A) and weights(W). Subscript denotes bit-width.plexity of the experiments and observe whether it affectsthe uncertainty estimation quality in a quantised regime. Forthe regression problem we consider an architecture withan input layer followed by 3 hidden layers with 100 nodes,each followed by a ReLU activation. For MNIST we im-plement the common LeNet-5 (LeCun et al., 1998), whilefor CIFAR-10 we implement ResNet-18 (He et al., 2016)with BN and skip-connections enabled. Similarly to thedatasets, we chose NN architectures of increasing complex-ity to explore how the uncertainty estimation is impacted bytrailing quantisation errors coming from a reduced precisionand deeper architectures. We considered image augmenta-tions: rotation, brightness and horizontal shift and confusiondatasets: FashionMNIST for MNIST and SVHN for CIFAR-10 experiments to measure the level of uncertainty on distantor shifted datasets. The hyperparameters for all experimentswere hand-tuned with reference to validation error.From the quantisation point of view, we focus on quantisa-tion of both weights and activations to improve on-devicestorage as well as computational efﬁciency. We considered ≤ n ≤ for weights (W) and ≤ n ≤ for activations(A) for all the proposed methods (MCD, BBB, SGHMC)and a standard pointwise implementation. We considered 1bit lower precision for activations than for weights to avoidinstruction overﬂow on our system. All experiments wererepeated 3 times and we set L = 20 for all methods. Thecode is available at https://git.io/JtSJG . The results for regression for the respective methods underquantisation are presented in Figures 4 (a,b) and 5 (a,b).We were measuring the root-mean-squared error (RMSE)and negative log-likelihood (NLL). Every box-plot is withrespect to the UCI datasets and means of 10-fold crossvalidation that has been done with respect to independentmodels. First, examining the results for changing activationprecision in Figure 4 (a), it can be seen from the RMSEthat the Bayesian methods are more robust towards quanti-sation and they are able to maintain their accuracy, while apointwise NN tends to lose its generalisability the quickest,even though it was initially marginally the most accurate.At the same time, the Bayesian inference methods are ableto maintain their uncertainty estimation capabilities whichcan be seen in the NLL plots in Figures 4, 5 (b). Second,results plotted in Figure 5 for changing weight precision andkeeping the activation precision ﬁxed, further solidify theprevious observations. Nevertheless, the rate of change ofthe error with respect to quantisation of weights is slower incomparison to changing the activation precision. However,in both plots we notice that SGHMC is more affected byquantisation, especially weight quantisation. The weights’distributions for SGHMC for the different layers are morespread than the other 2 methods and uniform quantisationwith respect to such a low precision for either weights oractivations is unable to capture them. loat Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision E rr o r [ % ] Pointwise MCD BBB SGHMC (a)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision a P E [ n a t s ] Pointwise MCD BBB SGHMC (b)

Figure 8:

Changing activation precision, ﬁxing weight precision.

CIFAR-10 results with respect to classiﬁcation error ontest data (a) and average predictive entropy (aPE) on SVHN (b). Q stands for quantised activations (A) and weights (W).Subscript denotes bit-width. All methods collapse when the bit-width ≤ for A. Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision E rr o r [ % ] Pointwise MCD BBB SGHMC (a)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . . a P E [ n a t s ] Pointwise MCD BBB SGHMC (b)

Figure 9:

Fixing activation precision, changing weight precision.

CIFAR-10 results with respect to classiﬁcation error ontest data (a) and average predictive entropy (aPE) on SVHN (b). Q stands for quantised activations (A) and weights (W).Subscript denotes bit-width. All methods collapse when the bit-width ≤ for W. In this Section we present the main results with respect toevaluation on MNIST and CIFAR-10 datasets. We focusedon measuring classiﬁcation error, expected calibration error(ECE) (Guo et al., 2017) with respect to 10 bins and averagepredictive entropy (aPE).

Further results with respect toother metrics can be seen in the appendix.

The results for MNIST evaluation with respect to quantisedBNNs are presented in Figures 6 (a,b) and Figures 7 (a,b). Ingeneral, the results follow the same trends as demonstratedin the regression results. Nevertheless, as seen in the classiﬁ-cation error for changing activation precision in Figure 6 (a)the respective methods are more sensitive towards changingactivation precision than weight precision in comparison toresults in Figure 7 (a), in particular for MCD. The scalingfactor that is applied during the MCD ( − p ) distorts theactivation distribution and results in a collapse of MCD ifthe bit-width for the activation is too small. However, theerror of the BNNs increases marginally slower than for thepointwise NN. Nevertheless, as the error increases, the pre-dictive entropy increases as well, which can be seen in bothFigures 6 (b) and 7 (b) as a result their ECE also decreases.This means that quantisation has actually a regularising ef-fect as with reduced precision for weights or activations the representation capabilities of NNs is limited and theirconﬁdence decreases. Interestingly for the collapsed MCD,this results in a complete and rightful uncertainty on the con-fusion dataset or the test set as seen in Figure 6 (b). Theseresults translate also to measuring aPE and ECE on the testdata, except for the pointwise control.In Figures 10 (a,b) we detail results with respect to augmen-tations and 7-bit quantisation of the activations and 8-bitquantisation of the weights. It can be seen that the Bayesianinference methods remain to be robust towards domain shifteven under quantisation and they record marginally smallerECE and error than the pointwise control. The results for CIFAR-10 with respect to quantised BNNsare presented in Figures 8 (a,b) and Figures 9 (a,b). In this ex-periment the differences between the Bayesian methods andthe pointwise control are widened. Similarly to the previousexperiments, the quantised BNNs are more susceptible to ac-tivation quantisation in comparison to weight quantisation,while comparing the results in Figures 8 (a) and Figures 9(a). Moreover, the qunatised nets collapse earlier, n ≤ for activations, given a more complex ResNet architecture.Nevertheless, as seen from Figures 8 (b) and Figures 9 (b)in no instance for any BNN method the uncertainty-relatedcapability is damaged by quantisation, as the trends are est datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions E C E [ % ] Pointwise MCD BBB SGHMC (a)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions E rr o r [ % ] Pointwise MCD BBB SGHMC (b)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions E C E [ % ] Pointwise MCD BBB SGHMC (c)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions E rr o r [ % ] Pointwise MCD BBB SGHMC (d)

Figure 10: Expected calibration error (ECE) and classiﬁcation error with respect to 7-bit activations and 8-bit weights andthree augmentations applied to the LeNet-5 on MNIST test set (a) and (b) and ResNet-18 on CIFAR-10 test set (c) and (d).Augmentations were: Brightness [1.5-3.5], Rotation [15°-75°] and Horizontal shift [0.1-0.5 of image size].clearly upwards in terms of the predictive entropy on theconfusion dataset. However, as seen in Figure 9 (b) it is thepointwise model in particular which completely overﬁts thetraining dataset and quantisation has a negative effect on itspredictive entropy. Similarly to previous experiments, as theerror increases, the predictive entropy increases for BNNswhich can be seen in both Figures 8 (b) and Figures 9 (b) asa result their ECE also decreases.Next, if considering the domain shift as demonstrated inFigures 10 (c,d) it can be observed that while the error inFigure 10 (d) increases, the error of the Bayesian meth-ods increases at the same rate as in a pointwise approach.However, when further examining Figure 10 (c), it can beseen that the ECE increases by far less in comparison tothe pointwise approach, which makes BNNs, even underquantisation, to be more robust towards domain shift.

In this work we proposed and evaluated a practical quan-tisation methodology for a variety of Bayesian inferencemethods applied to neural networks and in this Section wediscuss the key takeaways of our empirical observations.• An uniform quantisation scheme is viable for quan-tisation of Bayesian neural networks unless pushedto the extrema ( ≤ -bits for activations or weights).For the most commonly utilised 8-bit weights and acti-vation quantisation scheme used in hardware, we didnot observe any signiﬁcant degradation in accuracy or quality of uncertainty estimation in Bayesian nets incomparison to their ﬂoating-point representation.• The quality of predictive uncertainty of Bayesian net-works stays unaffected or increases as a result of quan-tisation. The networks stay certain on the in-domaintest data and become more uncertain on confusion ordomain-shifted data.• The prediction error increases at a slower rate inBayesian neural networks, as their representation is re-duced in the number of bits through quantisation, thanin pointwise networks unless considering extrema.• Activation quantisation seemed to affect all the nettypes more than weight quantisation on the accuracy,predictive entropy or calibration. SGHMC was moresensitive to weight quantisation, MCD was the mostsensitive to activation quantisation.• In MCD random binary masks ( K ) could be quantisedto 1-bit whereas in BBB all parameters ( µ , σ and (cid:15) )need to be quantised with same number of bits as inweights to maintain model accuracy.• Experiments on different datasets and tasks suggestthat Bayesian nets are relatively immune to quantisa-tion. However, complex architectures (ResNet) seemto be more affected by quantisation than simpler archi-tectures (LeNet-5) regarding their performance.In the future work we are going to investigate more com-plex non-mean-ﬁeld approximations for the respectiveBayesian inference methods and more expressive quanti-sation schemes with respect to lower ( ≤ -bits) precision. EFERENCES

Achterhold, J., Koehler, J. M., Schmeink, A., and Genewein,T. (2018). Variational network quantization. In

Interna-tional Conference on Learning Representations .Bengio, Y., Léonard, N., and Courville, A. (2013). Es-timating or propagating gradients through stochasticneurons for conditional computation. arXiv preprintarXiv:1308.3432 .Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,D. (2015). Weight uncertainty in neural networks. arXivpreprint arXiv:1505.05424 .Cai, R., Ren, A., Liu, N., Ding, C., Wang, L., Qian, X.,Pedram, M., and Wang, Y. (2018). Vibnn: Hardwareacceleration of bayesian neural networks.

ACM SIGPLANNotices , 53(2):476–488.Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradi-ent hamiltonian monte carlo. In

International conferenceon machine learning , pages 1683–1691.Choukroun, Y., Kravchik, E., Yang, F., and Kisilev, P. (2019).Low-bit quantization of neural networks for efﬁcient in-ference. In , pages 3009–3018.IEEE.Gal, Y. and Ghahramani, Z. (2015). Dropout as a bayesianapproximation. arXiv preprint arXiv:1506.02157 .Ghahramani, Z. (2015). Probabilistic machine learning andartiﬁcial intelligence.

Nature , 521(7553):452–459.Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017).On calibration of modern neural networks. arXiv preprintarXiv:1706.04599 .He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-ual learning for image recognition. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 770–778. IEEE.Hron, J., Matthews, A. G. d. G., and Ghahramani, Z. (2017).Variational gaussian dropout is not bayesian. arXivpreprint arXiv:1711.02989 .Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-celerating deep network training by reducing internalcovariate shift. arXiv preprint arXiv:1502.03167 .Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard,A., Adam, H., and Kalenichenko, D. (2018). Quantiza-tion and training of neural networks for efﬁcient integer-arithmetic-only inference. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 2704–2713. Kingma, D. P. and Welling, M. (2013). Auto-encodingvariational bayes. arXiv preprint arXiv:1312.6114 .Krishnamoorthi, R. (2018). Quantizing deep convolutionalnetworks for efﬁcient inference: A whitepaper. arXivpreprint arXiv:1806.08342 .Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017).Simple and scalable predictive uncertainty estimationusing deep ensembles. In

Advances in neural informationprocessing systems , pages 6402–6413.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).Gradient-based learning applied to document recognition.

Proceedings of the IEEE , 86(11):2278–2324.Liang, F., Li, Q., and Zhou, L. (2018). Bayesian neuralnetworks for selection of drug sensitive genes.

Journal ofthe American Statistical Association , 113(523):955–972.MacKay, D. J. (1995). Bayesian neural networks and densitynetworks.

Nuclear Instruments and Methods in PhysicsResearch Section A: Accelerators, Spectrometers, Detec-tors and Associated Equipment , 354(1):73–80.McAllister, R., Gal, Y., Kendall, A., Van Der Wilk, M.,Shah, A., Cipolla, R., and Weller, A. (2017). Concreteproblems for autonomous vehicle safety: Advantages ofbayesian deep learning. In

Proceedings of the 26th In-ternational Joint Conference on Artiﬁcial Intelligence ,IJCAI’17, page 4745–4753. AAAI Press.Neal, R. M. (1993). Bayesian learning via stochastic dy-namics. In

Advances in neural information processingsystems , pages 475–482.Ranganath, R., Gerrish, S., and Blei, D. (2014). Blackbox variational inference. In

Artiﬁcial Intelligence andStatistics , pages 814–822.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,and Salakhutdinov, R. (2014). Dropout: a simple way toprevent neural networks from overﬁtting.

The journal ofmachine learning research , 15(1):1929–1958.Su, J., Cvitkovic, M., and Huang, F. (2019). Sampling-freelearning of bayesian quantized neural networks. arXivpreprint arXiv:1912.02992 .van Baalen, M., Louizos, C., Nagel, M., Amjad, R. A.,Wang, Y., Blankevoort, T., and Welling, M. (2020).Bayesian bits: Unifying quantization and pruning. arXivpreprint arXiv:2005.07093 .Welling, M. and Teh, Y. W. (2011). Bayesian learning viastochastic gradient langevin dynamics. In

Proceedingsof the 28th international conference on machine learning(ICML-11) , pages 681–688.

DETAIL OF BAYESIAN NEURALNETWORK METHODS

We will illustrate the complications of the different ap-proaches for Bayesian inference with the help of a toyexample: a common linear layer with an input matrix f i ∈ R I × M with I samples and M features and someweights f w ∈ R M × F with the output f o ∈ R I × F with F output features such that f o = f i f w . Monte Carlo Dropout

The concept of MCD (Gal andGhahramani, 2015) lays in casting dropout (Srivastava et al.,2014) training in NNs as approximate Bayesian inference.Dropout can be described by applying a random element-wise mask K ∈ R I × M ; K ∼ Bernoulli ( p ) of zeros andones with probability ≤ p ≤ to the input f i and scalingthe non-zero elements by − p as f o = f w (cid:16) − p K (cid:12) f i (cid:17) .The authors of MCD show that the use of dropout in NNsbefore every weight-bearing layer can be interpreted as aBayesian approximation and by applying dropout, it canapproximate the integral over the models’ weights (Gal andGhahramani, 2015). Therefore, to estimate the predictivedistribution p ( y ∗ | x ∗ ) it is needed to collect the results of L forward passes, while sampling and applying the element-wise masks. Training is usually done through a single sam-ple. Therefore, the only implementation-wise complicationsof this method are the need for random generation of ze-ros and ones and their subsequent element-wise application.The number of parameters, and thus memory footprint, staysconstant. Bayes-By-Backprop

In BBB (Blundell et al., 2015) theweight uncertainty is modelled explicitly, by assuming anapproximation q ( w | x , y , θ ) for the posterior p ( w | x , y ) with respect to learnable parameters θ . The learning is per-formed through minimising the distance bound between q ( w | x , y , θ ) and p ( w | x , y ) . The most common approxima-tion q for weights w is a mean-ﬁeld approximation, such that w ∼N ( µ , σ ) , with individual mean µ ∈ R M × F and vari-ance σ ∈ R M × F for each weight where θ = { µ , σ } and N represents a Gaussian distribution (Kingma and Welling,2013; Ranganath et al., 2014). Kingma and Welling (2013)have introduced the reparametrisation trick , that allowssampling of the weights with respect to the q , such that f w = µ + (cid:15) (cid:12) φ ( σ ) where (cid:15) ∼N ( , I ) , I is an identity and φ ( . ) is a positive-forcing function e.g. softplus. The sam-pled w can then be used, such that f o = f i f w . Similarlyto MCD, to estimate the predictive distribution p ( y ∗ | x ∗ ) itis needed to collect the results of L forward passes with re-spect to L weight samples. Training is usually done througha single sample. This method requires the ability of thehardware to efﬁciently sample a more complex, Gaussiandistribution. Moreover, this model uses double the numberof parameters for the same network size, due to the meanspaired with variances. Stochastic Gradient Langevin Dynamics with Hamilto-nian Monte Carlo

In comparison to the previous twoapproaches, in SGHMC (Chen et al., 2014), it is not neces-sary to perform sampling and random number generationduring evaluation. The w l corresponding to a single set ofweights from the ensemble can then be used directly duringevaluation instead of sampling them via w l ∼ p ( w | x , y ) asin the case of the two previous methods. To obtain p ( y ∗ | x ∗ ) it is needed to collect the results of L forward passes withrespect to the ensemble with L members corresponding to L weights w . Training is performed similarly to standardpointwise NNs. In comparison to the previous two meth-ods, this method does not require sampling of a distributionduring evaluation. However, it requires L × more memoryresources in comparison to pointwise NNs, to store the en-tire ensemble. At the same time, it is necessary to considerthe extra time needed to load the weights w to memory. B METRICS

In addition to measuring the root-mean-squared error(RMSE) and the classiﬁcation error, we establish metricsfor the evaluation of the quantiﬁed uncertainty.

B.1 NEGATIVE-LOG LIKELIHOOD

Based on (Lakshminarayanan et al., 2017), our base metricfor evaluating the quality of the predictive uncertainty is thenegative-log likelihood (NLL). We chose averages in ourevaluations, due to easier interpretability and consistence.In the regression case, NLL can be formulated with respectto a single-valued Gaussian as in equation (3).NLL = 1 N N (cid:88) n =1 log σ w ( x n )2 + ( y n − µ w ( x n )) σ w ( x n )+ log √ π. (3)In the classiﬁcaiton case, NLL can be formulated with re-spect to cross-entropy as in equation (4).NLL = − N N (cid:88) n =1 K (cid:88) k =1 y kn log µ kw ( x n ) (4) B.2 PREDICTIVE ENTROPY

In case of classiﬁcation for which the labels are not avail-able, which is the case for most out-of-distribution datasets,we measure the quality of the uncertainty prediction withrespect to the average predictive entropy (aPE) as in equa-tion (5). aPE = − N N (cid:88) n =1 K (cid:88) k =1 µ kw ( x n ) log µ kw ( x n ) (5) .3 EXPECTED CALIBRATION ERROR Additionally, we measure the calibration of the BNNsand their sensitivity through expected calibration error(ECE) (Guo et al., 2017). ECE relates conﬁdence with whicha network makes predictions to accuracy, thus it measureswhether a network is over-conﬁdent or under-conﬁdent inits predictions, with respect to the softmax output. To com-pute ECE, the authors propose to discretize the predictionprobability interval into a ﬁxed number of bins, and assigneach probability to the bin that encompasses it. The ECE isthe difference between the fraction of predictions in the binthat are correct (accuracy) and the mean of the probabilitiesin the bin (conﬁdence). ECE computes a weighted averageof this error across bins as shown in equation (6), where n b is the number of predictions in bin b and accuracy( b )and conﬁdence( b ) are the accuracy and conﬁdence of bin b ,respectively. We set B = 10 .ECE = B (cid:88) b =1 n B N | accuracy ( b ) − conﬁdence ( b ) | (6)To summarise, for regression we are measuring RMSE andNLL and for image classiﬁcation problems we were observ-ing the classiﬁcation error, NLL, aPE and ECE. C ADDITIONAL RESULTS

In this Section we report additional measurements withrespect to test data, confusion data or the domain shifts forthe image classiﬁcation problems.

C.1 MNIST

The additional results for MNIST-based experiments arepresented in Figures 11 (a-e), 12 (a-e) and 13 (a,b). Startingwith the results for changing the activation precision, it canbe seen that as the predictive entropy for the confusion dataincreases, the entropy for the test data stays constant and itincreases near the smallest bit-width, as seen in Figure 11(a). As a result, the ECE, in Figure 11 (b), for the test dataalso increases and that is due to 2 reasons: The quanti-sation collapses when n = 3 ; the predictive uncertaintyincreases disproportionately to the error, which can be ob-served in SGHMC and BBB. Subsequently, the NLL on thetest data stays constant and increases in the collapsed regimeas seen in Figure 11 (c). Nevertheless, as supported by theECE plot for the confusion data in Figure 11 (d), by be-coming more uncertain, the activation quantisation reducedthe conﬁdence of BNNs on the confusion dataset, whichis desired. This can be also observed on the NLL for theconfusion data in Figure 11 (e). Comparing these results toFigures 12 (a-e), it can be seen that weight quantisation fol-lows similar trends, and the ECE and the predictive entropycollapse, when n reaches extrema. Looking in more detail at the aPE (a) and NLL (b) on test data with augmentationsin Figures 13 when n = 7 for activations and n = 8 forweights, it can be seen that the methods remain robust underaugmentations, but at the same time they are more uncertain.This further supports the claim that the BNNs can indeed bealready quantised to approximately 8-bit integer represen-tation. In addition, we present 32-bit ﬂoating-point resultswith respect to Figures 14 (a-d) and there are barely anyvisible deviations with respect to the quantised counterparts. C.2 CIFAR-10

The additional results for CIFAR-10-based experiments arepresented in Figures 15 (a-e), 16 (a-e) and 17 (a,b). Similarlyto MNIST experiments, as seen in Figures 15 (a-e), it can beobserved that as the accuracy degrades, so does rightfullythe predictive conﬁdence, best seen in Figure 15 (a). At thesame time the ECE on the test data also increases, and onthe confusion data it tends to decrease, as seen in Figures 15(b,d). However, this time it is due to the collapse in the acti-vations, where the bit-width is not satisfactory and a moreadvanced quantisation scheme might be needed for n ≤ .The results for weight quantisation support these observa-tions as seen in Figures 16 (a-e), nevertheless with smallerimpact on the ranges of the deviations. This suggests a poten-tial investigation into mixed precision representation withlow bit-width of weights, while preserving the activationprecision. As seen in Figure 17 (a,b) presenting results ontest data with augmentations when n = 7 for activations and n = 8 for weights, it can be seen that the quantisation doesnot prevent the Bayesian inference methods from quantify-ing uncertainty in their predictions. In addition, we present32-bit ﬂoating-point results with respect to Figures 18 (a-d)and there are barely any visible deviations with respect tothe quantised counterparts presented in the supplementarymaterial or the main body of the paper. loat Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision a P E [ n a t s ] Pointwise MCD BBB SGHMC (a)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision E C E [ % ] Pointwise MCD BBB SGHMC (b)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision N LL Pointwise MCD BBB SGHMC (c)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision E C E [ % ] Pointwise MCD BBB SGHMC (d)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . . N LL Pointwise MCD BBB SGHMC (e)

Figure 11:

Changing activation precision, ﬁxing weight precision.

MNIST results with respect to average predictive entropy(aPE) (a), expected calibration error (ECE) (b) and negative log-likelihood (NLL) (c) on test data and ECE (d) and NLL (e)on FashionMNIST. Q stands for quantised activations (A) and weights (W). Subscript denotes bit-width. MCD collapseswhen the bit-width ≤ for A. Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . a P E [ n a t s ] Pointwise MCD BBB SGHMC (a)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . E C E [ % ] Pointwise MCD BBB SGHMC (b)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . N LL Pointwise MCD BBB SGHMC (c)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision E C E [ % ] Pointwise MCD BBB SGHMC (d) loat Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision N LL Pointwise MCD BBB SGHMC (e)

Figure 12:

Fixing activation precision, changing weight precision.

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions . . . a P E [ n a t s ] Pointwise MCD BBB SGHMC (a)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions N LL Pointwise MCD BBB SGHMC (b)

Figure 13: Average predictive entropy (aPE) (a) and negative log-likelihood (NLL) (b) with respect to 7-bit activationsand 8-bit weights and three augmentations applied to the MNIST test set: Brightness [1.5-3.5], Rotation [15°-75°] andHorizontal shift [0.1-0.5 of image size].

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions E rr o r [ % ] Pointwise MCD BBB SGHMC (a)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions . . . a P E [ n a t s ] Pointwise MCD BBB SGHMC (b)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions E C E [ % ] Pointwise MCD BBB SGHMC (c)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions N LL Pointwise MCD BBB SGHMC (d)

Figure 14: Classiﬁcation error (a), average predictive entropy (aPE) (b), expected calibration error (ECE) (c) and negativelog-likelihood (NLL) (d) with respect to 32-bit ﬂoating-point activations and weights and three augmentations applied to theMNIST test set: Brightness [1.5-3.5], Rotation [15°-75°] and Horizontal shift [0.1-0.5 of image size]. loat Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision a P E [ n a t s ] Pointwise MCD BBB SGHMC (a)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision E C E [ % ] Pointwise MCD BBB SGHMC (b)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision N LL Pointwise MCD BBB SGHMC (c)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision E C E [ % ] Pointwise MCD BBB SGHMC (d)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision N LL Pointwise MCD BBB SGHMC (e)

Figure 15:

Changing activation precision, ﬁxing weight precision.

CIFAR-10 results with respect to average predictiveentropy (aPE) (a), expected calibration error (ECE) (b) and negative log-likelihood (NLL) (c) on test data and ECE (d)and NLL (e) on SVHN. Q stands for quantised activations (A) and weights (W). Subscript denotes bit-width. All methodscollapse when the bit-width ≤ for A. Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . a P E [ n a t s ] Pointwise MCD BBB SGHMC (a)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision E C E [ % ] Pointwise MCD BBB SGHMC (b)

Float Q : A W Q : A W Q : A W Q : A W Q : A W Q : A W Bit-width & Precision . . N LL Pointwise MCD BBB SGHMC (c)

Figure 16:

Fixing activation precision, changing weight precision.

CIFAR-10 results with respect to average predictiveentropy (aPE) (a), expected calibration error (ECE) (b) and negative log-likelihood (NLL) (c) on test data and ECE (d) andNLL (e) on SVHN. Q stands for quantised activations (A) and weights (W). Subscript denotes bit-width. All methods exceptMCD collapse when the bit-width ≤ for W. Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions . . . a P E [ n a t s ] Pointwise MCD BBB SGHMC (a)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions N LL Pointwise MCD BBB SGHMC (b)

Figure 17: Average predictive entropy (aPE) (a) and negative log-likelihood (NLL) (b) with respect to 7-bit activationsand 8-bit weights and three augmentations applied to the CIFAR-10 test set: Brightness [1.5-3.5], Rotation [15°-75°] andHorizontal shift [0.1-0.5 of image size].

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions E rr o r [ % ] Pointwise MCD BBB SGHMC (a)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions . . . a P E [ n a t s ] Pointwise MCD BBB SGHMC (b)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions E C E [ % ] Pointwise MCD BBB SGHMC (c)

Test datawith noAugmentations 1.5,15 ° ,0.1 2.0,30 ° ,0.2 2.5,45 ° ,0.3 3,60 ° ,0.4 3.5,75 ° ,0.5Distortions N LL Pointwise MCD BBB SGHMC (d)(d)