Single Shot MC Dropout Approximation
SSingle Shot MC Dropout Approximation
Kai Brach Beate Sick Oliver D ¨urr Abstract
Deep neural networks (DNNs) are known for theirhigh prediction performance, especially in per-ceptual tasks such as object recognition or au-tonomous driving. Still, DNNs are prone to yieldunreliable predictions when encountering com-pletely new situations without indicating their un-certainty. Bayesian variants of DNNs (BDNNs),such as MC dropout BDNNs, do provide uncer-tainty measures. However, BDNNs are slow dur-ing test time because they rely on a sampling ap-proach. Here we present a single shot MC dropoutapproximation that preserves the advantages ofBDNNs without being slower than a DNN. Ourapproach is to analytically approximate for eachlayer in a fully connected network the expectedvalue and the variance of the MC dropout signal.We evaluate our approach on different benchmarkdatasets and a simulated toy example. We demon-strate that our single shot MC dropout approxima-tion resembles the point estimate and the uncer-tainty estimate of the predictive distribution that isachieved with an MC approach, while being fastenough for real-time deployments of BDNNs.
1. Introduction
Over the last, decade deep neural networks (DNN) havearisen as the dominant technique for the analysis of per-ceptual data. Also in safety-critical applications like au-tonomous driving, where the vehicle must be able to under-stand its environment, DNNs have seen rapid progress inseveral tasks (Grigorescu et al., 2019).However, classical DNNs have deficits in capturing themodel uncertainty (Kendall & Gal, 2017),(Gal & Ghahra-mani, 2016). But when using DNN models in safety-critical * Equal contribution Elektrobit Automotive GmbH, Germany IDP, Zurich University of Applied Sciences, Switzerland, andEBPI, University of Zurich, Switzerland IOS, Konstanz Univer-sity of Applied Sciences, Germany. Correspondence to: Kai Brach < [email protected] > , Oliver Drr < [email protected] > , Beate Sick < [email protected] > .Presented at the ICML 2020 Workshop on Uncertainty and Ro-bustness in Deep Learning. Copyright 2020 by the author(s). applications, it is mandatory to provide an uncertainty mea-sure that can be used to identify unreliable predictions(Michelmore et al., 2018) (Feng et al., 2018) (Harakeh et al.,2019) (Miller et al., 2018) (McAllister et al., 2017).For example, in the field of robotics (S¨underhauf et al.,2018), medical applications, or autonomous driving (Bo-jarski et al., 2016), where machines interact with humans,it is important to identify situations where a model predic-tion is unreliable and a human intervention is necessary.This can, for example, be situations which are completelydifferent from all that occurred during training.Employing Bayesian DNNs (BDNNs) (MacKay, 1992) tack-les the problem and allows to compute an uncertainty mea-sure. However, state of the art BDNNs require samplingduring deployment leading to computation times that are bythe factor of MC runs larger than a classical DNNs. Thiswork overcomes this drawback by providing a method thatallows to approximate the expected value and variance of aBDNN’s predictive distribution in a single run. It has there-fore the same computation time as a classical DNN. Wefocus here on a special variant of BDNNs which is knownas MC dropout (Gal & Ghahramani, 2016). While ourapproximation method is applicable also to convolutionalneural networks and classification settings, we focus in thiswork on regression through fully connected networks.Ensembling based models take an alternative approach toestimate uncertainties and have been successfully applied toDNNs (Lakshminarayanan et al., 2017; Pearce et al., 2020).But ensemble methods do also not allow to quantify theuncertainty in a single shot manner.
2. Related Work
BDNNs are probabilistic models that capture the uncer-tainty by means of probability distributions. ProbabilisticDNNs, which are non-Bayesian, only define a distributionfor the conditional outcome. In common probabilistic DNNsthe output nodes are controlling the parameters of a condi-tional probability distribution (CPD) of the outcome. Forregression type problems a common choice for the CPDis the normal distribution N ( µ, σ ) , where the variance σ quantifies the data uncertainty, known as aleatoric un- a r X i v : . [ c s . L G ] J u l ingle Shot MC Dropout Approximation certainty. BDNNs define in addition distributions for theweights which translate in a distribution of the modeled pa-rameters. In this manner the model uncertainty is captured,which is known as epistemic uncertainty (Der Kiureghian& Ditlevsen, 2009). In case of MC dropout BDNNs eachweight distribution is a Bernoulli distribution: the weighttakes with the dropout probability p ∗ the value zero and withprobability − p ∗ the value w . All weights starting fromthe same neuron are set to zero simultaneously. The dropoutprobability p ∗ is usually treated as a fixed hyperparameterand the weight-value w is tuned during the training.In contrast to standard dropout (Srivastava et al., 2014), theweights in MC dropout are not frozen and rescaled aftertraining, but the dropout procedure is also done during testtime. It can be shown that MC dropout is an approxima-tion to a BDNN (Gal & Ghahramani, 2016). MC dropoutBDNNs were successfully used in many applications andhave proven to yield improved prediction performance andallow to define uncertainty measures to identify individualunreliable predictions (Gal & Ghahramani, 2016), (Ryuet al., 2019), (D¨urr et al., 2018), (Kwon et al., 2020). Toemploy a trained Bayesian DNN in practice one performsseveral runs of predictions. In each run, weights are sampledfrom the weight distributions leading to a certain constella-tion of weight values that are used to compute the param-eters of a CPD. To determine the outcome distribution ofa BDNN, we draw samples from the CPDs that resultedfrom different MC runs. In this way, the outcome distribu-tion incorporates the epistemic and aleatoric uncertainty. Adrawback of a MC dropout BDNN compared to its classicalDNN variant is the increased computing time. The samplingprocedure leads to a computing time that is prohibitive formany real-time applications like autonomous driving. Our method relies on statistical moment propagation (MP).More specifically, we propagate the expectation and the vari-ance, of our signal distribution through the different layersof a neural network. The variance of the signal arises dueto the dropout process. Quantifying the variance after atransformation is also done in error propagation (EP). EPquantifies how an uncertainty of an input which is trans-formed by a function (i.e. a measurement error) transfers toan uncertainty of the output of this function. In case of a con-tinuous output it is common to characterize the uncertaintyby the variance. This approach is also used in statisticsas the delta method (Dorfman, 1938). In MP we approx-imate the layer-wise transformations of the variance andthe expected value. A similar approach has also been usedfor neural networks before (Frey & Hinton, 1999; Adachi,2019), and used to detect adversarial examples in (Jin, 2015)and (Gast & Roth, 2018). But, due to our best knowledge, our approach is the firstmethod that provides a single shot approximation to theexpected value and the variance of the predictive distributionresulting from a MC dropout NN.
3. Methods
The goal of our method is to approximate the expectedvalue E and the variance V of the predicted output which isobtained by the above described MC dropout method. Whenpropagating an observation through a MC dropout network,we get each layer with p nodes an activation signal with anexpected value E (of dimension p ) and a variance given bya variance-covariance matrix V (of dimension p × p ). Weneglect the effect of correlations between different activa-tions, which are small anyway in deeper layers due to thedecorrelation effect of the dropout. Hence, we only considerdiagonal terms in the correlation matrix. In the following,we describe for each layer-type in a fully connected networkhow the expected value E and its variance V is propagated.As layer-type we consider dropout, dense, and ReLU activa-tion layer. Figure 1 provides an overview of the layer-wiseabstraction. DO EV p ∗ E D V D FC wb E F V F ReLU E R V R Figure 1.
Overview of the proposed method. The expectation E andV flow through different layers of the network in a single forwardpass. Shown is an example configuration in which Dropout (DO)is followed by Dense (FC) and a ReLU activation. More complexnetworks can be build by different arrangements of the individualblocks.
We start our discussion, with the effect of MC dropout. Let E i be the expectation at the ith node of the input layerand V i the variance at the ith node. In a dropout layer therandom value of a node i is multiplied independently with aBernoulli variable Y ∼ Bern ( p ∗ ) that is either zero or one.The expectation E Di of the i’th node after dropout is thengiven by: E Di = E i (1 − p ∗ ) (1)For computing the variance V Di of the i’th node afterdropout, we use the fact that the variance V ( X · Y ) ofthe product of two independent random variables X and Y ,is given by (Goodman, 1960): https://github.com/kaibrach/Moment-Propagation ingle Shot MC Dropout Approximation V ( X · Y ) = V ( X ) V ( Y ) + V ( X ) E ( Y ) + E ( X ) V ( Y ) (2)With V ( Y ) = p ∗ (1 − p ∗ ) , we get: V Di = V i · p ∗ (1 − p ∗ ) + V i (1 − p ∗ ) + E i · p ∗ (1 − p ∗ ) (3)Dropout is the only layer in our approach where uncertaintyis created. I.e. even if the input has V i = 0 the output of thedropout layer has V Di > for p ∗ (cid:54) = 0 . For the dense layer with p input and q output nodes, wecompute the value of the i’th output node as (cid:80) pj w ji x j + b i ,where x j , j = 1 . . . p are the values of the input nodes.Using the linearity of the expectation, we get the expectation E Fi of the i’th output node from the expectations, E Fj , j =1 . . . p , of the input nodes: E Fi = p (cid:88) j =1 w ji E j + b i (4)To calculate the change of the variance, we use the factthat the variance under a linear transformation behaves like V ( w ji · x j + b ) = w ji V ( x j ) . Further, we assume indepen-dence of the j different summands, yielding: V Fi = p (cid:88) j =1 w ji V j (5) To calculate the expectation E Ri and variance V Ri of the i’thnode after a ReLU, as a function of the E i and V i of thisnode before the ReLU, we need to make a distributional as-sumption. We assume that the input is Gaussian distributed,with φ ( x ) = N ( x ; E i , V i ) the PDF, and Φ( x ) the corre-sponding CDF, we get (see (Frey & Hinton, 1999) for aderivation) for the expectation and variance of the output: E Ri = E i · Φ (cid:18) E i √ V i (cid:19) + (cid:112) V i · φ (cid:18) E i √ V i (cid:19) (6) V Ri = ( E i + V i ) · Φ (cid:18) E i √ V i (cid:19) + E i (cid:112) V i · φ (cid:18) E i √ V i (cid:19) − E Ri (7)
4. Results
We first apply our approach to a one dimensional regressiontoy dataset, with only one input feature. We use a fully connected NN with three layers each with 256 nodes, ReLUactivations and dropout after the dense layers. We have asingle node in the output layer which is interpreted as theexpected value µ of the conditional outcome distribution p ( y | x ) . We train the network using the MSE loss and applydropout with p ∗ = 0 . . From the MC dropout BDNN, weget at each x-position T = 30 MC samples µ t ( x ) fromwhich we can estimate the expectation E µ by the averagevalue and V µ by the variance of µ t ( x ) . For comparison,we use our MP approach to also approximate the expectedvalue E µ and the variance V µ of µ at each x -position (seeupper panel of 2). We also included the deterministic output µ ( x ) of the DNN in which dropout has only been used onlyduring training. All three approaches yield nearly identicalresults, within the range of the training data. We attributethis to the fact, that we have plenty of training data and sothe epistemic uncertainty is neglectable. In the lower panelof figure 2 a comparison of the uncertainty of µ ( x ) is shownby displaying an interval given by the expected value of µ ( x ) plus-minus two times the standard deviation of µ ( x ) .Here the width of the resulting intervals of a BDNN viathe MP approach and the MC dropout are comparable (theDNN has no spread). This indicates the usefulness of thisapproach for epistemic uncertainty estimation.
10 5 0 5 10 15 20 x y EE MC E MP D Train
10 5 0 5 10 15 20 x y E MC E MP D Train E MC ±2 V MC E MP ±2 V MP Figure 2.
Comparison of the MP and MC dropout results of aBDNN and the results of a DNN. The NNs were fitted on traindata that were available in the range of -3 to 19. In the upper panelthe estimated expectations of the MC BDNN, the MP BDNN, andthe DNN are compared. In the lower panel the predicted spread of µ ( t ) is shown for the MC and MP method. To benchmark our method, we redo the analysis of (Gal& Ghahramani, 2016) for the UCI regression benchmark ingle Shot MC Dropout Approximation
Table 1.
Comparison of the average prediction performance in test RMSE (Root-Mean-Square Error), test NLL (Negative Log-Likelihood)and test RT (Runtime) including ± standard error on UCI regression benchmark datasets between MC and MP. N and Q correspond to thedataset size and the input dimension. For all test measures, smaller means better. D ATASET
N Q T EST
RMSE T
EST
NLL T
EST
RT [ S ]MC MP MC MP MC MPB OSTON
506 13 3.14 ± . ± . ± . ± . ± . ± . C ONCRETE ± . ± . ± . ± . ± . ± . E NERGY
768 8 1.65 ± . ± . ± . ± . ± . ± . K IN NM ± . ± . -1.10 ± . -1.11 ± . ± . ± . N AVAL ± . ± . -4.36 ± . -3.60 ± . ± . ± . P OWER ± . ± . ± . ± . ± . ± . P ROTEIN ± . ± . ± . ± . ± . ± . W INE ± . ± . ± . ± . ± . ± . Y ACHT
308 6 2.93 ± . ± . ± . ± . ± . ± . dataset. We use the same NN model as Gal and Ghahramani,which is a fully connected neural network including onehidden layer with ReLU activation in which the CPD p ( y | x ) over T = 10 , MC runs is given by sampling from thenormal PDF: p ( y | x ) = 1 T (cid:88) t N ( y ; µ t ( x ) , τ − ) (8)Again µ t ( x ) is the single output of the BDNN for the t’thMC run. To derive a predictive distribution Gal assumesin each run a Gaussian distribution, centered at µ and aprecision τ , corresponding to the reciprocal of the variance.The parameter µ is received from the NN and τ is treated asas a hyperparameter. For the MP model, the MC sampling(Eq. 8) is replaced by integration: p ( y | x ) = (cid:90) N ( y ; µ (cid:48) , τ − ) N ( µ (cid:48) ; E MP , V MP ) dµ (cid:48) = N ( y ; E MP , V MP + τ − ) (9)We used the same protocol as (Gal &Ghahramani, 2016) which can be found athttps://github.com/yaringal/DropoutUncertaintyExps.Accordingly, we train the network for 10 × the epochsprovided in the individual dataset configuration. Asdescribed in (Gal & Ghahramani, 2016) an excessive gridsearch over the dropout rate p ∗ = 0 . , . , . , . and different values of the precision τ is done. Thehyperparameters minimizing the validation NLL are chosenand applied on the testset.We report in table 1 the test performance (RMSE and NLL)achieved via MC BDNN using the optimal hyperparametersfor the different UCI datasets. We also report the test RMSE and the NLL achieved with our MP method. Allover, theMC and MP approaches produces similar results. However,as shown in the last column in the table the MP methodis much faster, having only to perform one forward passinstead of T = 10 , forward passes.
5. Discussion
With our MP approach we have introduced an approxima-tion to MC dropout which requires no sampling but insteadpropagates the expectation and the variance of the signalthrough the network. This results in a time saving by a factorthat approximately corresponds to the number of MC runs(in our benchmark experiment 10,000). We have shown thatour fast MP approach approximates precisely the expecta-tion and variance of the prediction distribution achieved byMC dropout. Also the achieved prediction performance interms of RMSE and NLL do not show significant differ-ences when using MC dropout or our MP approach. Hence,our presented MP approach opens the door to include uncer-tainty information in real-time applications.We are currently working on extending the approach todifferent architectures such as convolutional neural net-works.We are also investigating how to make use of theuncertainty information to detect novel classes in classifica-tion settings.
6. Acknowledgements
We are very grateful to Elektrobit Automotive GmbH forsupporting this research work. Further, part of the workhas been founded by the Federal Ministry of Education andResearch of Germany (BMBF) in the project DeepDoubt(grant no. 01IS19083A). ingle Shot MC Dropout Approximation
References
Adachi, J. Estimating and factoring the dropout induced dis-tribution with gaussian mixture model. In
InternationalConference on Artificial Neural Networks , pp. 775–792.Springer, 2019.Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B.,Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller,U., Zhang, J., Zhang, X., Zhao, J., and Zieba, K. Endto End Learning for Self-Driving Cars. apr 2016. URL http://arxiv.org/abs/1604.07316 .Der Kiureghian, A. and Ditlevsen, O. Aleatory or epistemic?Does it matter?
Structural safety , 31(2):105–112, 2009.Dorfman, R. A. A note on the! d-method for finding vari-ance formulae.
Biometric Bulletin , 1938.D¨urr, O., Murina, E., Siegismund, D., Tolkachev, V.,Steigele, S., and Sick, B. Know When You Don’t Know:A Robust Deep Learning Approach in the Presence ofUnknown Phenotypes.
Assay and drug development tech-nologies , 16(6):343–349, 2018.Feng, D., Rosenbaum, L., and Dietmayer, K. TowardsSafe Autonomous Driving: Capture Uncertainty in theDeep Neural Network for Lidar 3D Vehicle Detection.
IEEE Conference on Intelligent Transportation Systems,Proceedings, ITSC , 2018-Novem:3266–3273, 2018. doi:10.1109/ITSC.2018.8569814.Frey, B. J. and Hinton, G. E. Variational learning in non-linear gaussian belief networks.
Neural Computation ,11(1):193–213, 1999. ISSN 08997667. doi: 10.1162/089976699300016872. URL https://ieeexplore.ieee.org/abstract/document/6790581/ .Gal, Y. and Ghahramani, Z. Dropout as a Bayesian ap-proximation: Representing model uncertainty in deeplearning. In , volume 3, pp. 1651–1660. Inter-national Machine Learning Society (IMLS), 2016. ISBN9781510829008.Gast, J. and Roth, S. Lightweight Probabilistic Deep Net-works. Technical report, 2018.Goodman, L. A. On the Exact Variance of Products.
Journal of the American Statistical Association , 55(292):708–713, may 1960. ISSN 01621459. doi:10.2307/2281592. URL .Grigorescu, S., Trasnea, B., Cocias, T., and Macesanu,G. A Survey of Deep Learning Techniques for Au-tonomous Driving.
Journal of Field Robotics , 37(3): 362–386, oct 2019. doi: 10.1002/rob.21918. URL http://arxiv.org/abs/1910.07738http://dx.doi.org/10.1002/rob.21918 .Harakeh, A., Smart, M., and Waslander, S. L. BayesOD: ABayesian Approach for Uncertainty Estimation in DeepObject Detectors. 2019. URL http://arxiv.org/abs/1903.03838 .Jin, J. Robust Convolutional Neural Networks under Ad-versarial Noise.
CoRR , abs/1511.0, 2015. URL http://arxiv.org/abs/1511.06306 .Kendall, A. and Gal, Y. What uncertaintiesdo we need in Bayesian deep learning forcomputer vision? Technical report, 2017.URL http://papers.nips.cc/paper/7141-what-uncertainties-do-we-need .Kwon, Y., Won, J.-H., Kim, B. J., and Paik, M. C. Uncer-tainty quantification using bayesian neural networks inclassification: Application to biomedical image segmen-tation.
Computational Statistics & Data Analysis , 142:106816, 2020.Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simpleand scalable predictive uncertainty estimation using deepensembles. In
Advances in neural information processingsystems , pp. 6402–6413, 2017.MacKay, D. J. C. A Practical Bayesian Framework forBackpropagation Networks.
Neural Computation , 4(3):448–472, may 1992. ISSN 0899-7667. doi: 10.1162/neco.1992.4.3.448.McAllister, R., Gal, Y., Kendall, A., Van Der Wilk, M.,Shah, A., Cipolla, R., and Weller, A. Concrete problemsfor autonomous vehicle safety: Advantages of Bayesiandeep learning. In
IJCAI International Joint Conferenceon Artificial Intelligence , 2017. ISBN 9780999241103.doi: 10.24963/ijcai.2017/661.Michelmore, R., Kwiatkowska, M., and Gal, Y. EvaluatingUncertainty Quantification in End-to-End AutonomousDriving Control. nov 2018. URL http://arxiv.org/abs/1811.06817 .Miller, D., Nicholson, L., Dayoub, F., and Sunderhauf, N.Dropout Sampling for Robust Object Detection in Open-Set Conditions. In
Proceedings - IEEE InternationalConference on Robotics and Automation , 2018. ISBN9781538630815. doi: 10.1109/ICRA.2018.8460700.Pearce, T., Leibfried, F., Brintrup, A., Zaki, M., and Neely,A. Uncertainty in neural networks: Approximatelybayesian ensembling. In
The 23rd International Con-ference on Artificial Intelligence and Statistics, AISTATS2020 , 2020. ingle Shot MC Dropout Approximation
Ryu, S., Kwon, Y., and Kim, W. Y. A Bayesian graphconvolutional network for reliable prediction of molecu-lar properties with uncertainty quantification.
ChemicalScience , 10(36):8438–8446, 2019.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,and Salakhutdinov, R. Dropout: A simple way to preventneural networks from overfitting.
Journal of MachineLearning Research , 2014. ISSN 15337928.S¨underhauf, N., Brock, O., Scheirer, W., Hadsell, R., Fox,D., Leitner, J., Upcroft, B., Abbeel, P., Burgard, W.,Milford, M., and Corke, P. The limits and potentialsof deep learning for robotics.