Mean-Field Approximation to Gaussian-Softmax Integral with Application to Uncertainty Estimation
aa r X i v : . [ c s . L G ] J un Uncertainty Estimation with Infinitesimal Jackknife,Its Distribution and Mean-Field Approximation
Zhiyun Lu
University of Southern California [email protected]
Eugene Ie
Google Research [email protected]
Fei Sha ∗ Google Research [email protected]
Abstract
Uncertainty quantification is an important research area in machine learning.Many approaches have been developed to improve the representation of uncer-tainty in deep models to avoid overconfident predictions. Existing ones suchas Bayesian neural networks and ensemble methods require modifications to thetraining procedures and are computationally costly for both training and inference.Motivated by this, we propose mean-field infinitesimal jackknife ( mfIJ ) – a sim-ple, efficient, and general-purpose plug-in estimator for uncertainty estimation.The main idea is to use infinitesimal jackknife, a classical tool from statistics foruncertainty estimation to construct a pseudo-ensemble that can be described witha closed-form Gaussian distribution, without retraining. We then use this Gaus-sian distribution for uncertainty estimation. While the standard way is to samplemodels from this distribution and combine each sample’s prediction, we develop amean-field approximation to the inference where Gaussian random variables needto be integrated with the softmax nonlinear functions to generate probabilities formultinomial variables. The approach has many appealing properties: it functionsas an ensemble without requiring multiple models, and it enables closed-formapproximate inference using only the first and second moments of Gaussians. Em-pirically, mfIJ performs competitively when compared to state-of-the-art methods,including deep ensembles, temperature scaling, dropout and Bayesian NNs, onimportant uncertainty tasks. It especially outperforms many methods on out-of-distribution detection.
Recent advances in deep neural nets have dramatically improved predictive accuracy in supervisedtasks. For many applications such as autonomous vehicle control and medical diagnosis, decision-making also needs accurate estimation of the uncertainty pertinent to the prediction. Unfortunately,deep neural nets are known to output overconfident, mis-calibrated predictions [19].It is crucial to improve deep models’ ability in representing uncertainty. There have been a steady de-velopment of new methods on uncertainty quantification for deep neural nets. One popular idea is tointroduce additional stochasticity (such as temperature annealing or dropout to network architecture)to existing trained models to represent uncertainty [16, 19]. Another line of work is to use an en-semble of models, collectively representing the uncertainty about the predictions. This ensemble ofmodels can be obtained by varying training with respect to initialization [28], hyper-parameters [3],data partitions, i.e ., bootstraps [14, 42]. Yet another line of work is to use Bayesian neural networks(
BNN ), which can be seen as an ensemble of an infinite number of models, characterized by theposterior distribution [7, 32]. In practice, one samples models from the posterior or use variationalinference. Each of those methods offers different trade-offs among computational costs, memory ∗ On leave from the University of Southern CaliforniaPreprint. Under review. onsumptions, parallelization, and modeling flexibility. For example, while ensemble methods areoften state-of-the-art, they are both computationally and memory intensive, for repeating trainingprocedures or storing the resulting models.Those stand in stark contrast to many practitioners’ desiderata. It would be the most ideal that neithertraining models nor inference with models incur additional memory and computational costs toestimate uncertainty beyond what is needed for making predictions. Additionally, it is also desirableto be able to quantify uncertainty on an existing model where re-training is not possible.In this work, we propose a new method to bridge the gap. The main idea of our approach is to useInfinitesimal jackknife, a classical tool from statistics for uncertainty estimation [24], to construct a pseudo-ensemble that can be described with a closed-form Gaussian distribution. We then use thisGaussian distribution for uncertainty estimation. While the standard way is to sample models fromthis distribution and combine each sample’s prediction, we develop a mean-field approximation tothe inference, where Gaussian random variables need to be integrated with the softmax nonlinearfunctions to generate probabilities for multinomial variables.We show the proposed approach, which we refer to as mean-field Infinitesimal jackknife ( mfIJ ) oftensurpasses or as competitive as existing approaches in evaluations metrics of NLL, ECE, and Out-of-Distribution detection accuracy on several benchmark datasets. mfIJ shares appealing propertieswith several recent approaches for uncertainty estimation: constructing the pseudo-ensemble withinfinitesimal jackknife does not require changing existing training procedures [42, 5, 17, 18, 25];approximating the ensemble with a distribution removes the need of storing many models – an im-practical task for modern learning models [10, 33]; the pseudo-ensemble distribution is in similarform as the Laplace approximation for Bayesian inference [4, 32] thus existing approaches in com-putational statistics such as Kronecker product factorization can be directly applied.The mean-field approximation brings out an additional appeal. It is in closed-form and needs onlythe first and the second moments of the Gaussian random variables. In our case, the first momentsare simply the predictions of the networks while the second moments involve the product betweenthe inverse Hessian and a vector, which can be computed efficiently [1, 36, 40]. Additionally, themean-field approximation can be applied when integrals in similar form need to be approximated.In Appendix B.3, we demonstrate its utility of applying it to the recently proposed
SWAG algorithmfor uncertainty estimation where the Gaussian distribution is derived differently [10, 23, 33, 35].We describe our approach in §2, followed by a discussion on the related work in §3. Empiricalstudies are reported in §4, and we conclude in §5.
In this section, we start by introducing necessary notations and defining the task of uncertaintyestimation. We then describe the technique of infinitesimal jackknife in §2.1. We derive a closed-form Gaussian distribution of an infinite number of models estimated with infinitesimal jackknife–we call them pseudo-ensemble . We describe how to use this distribution for uncertainty estimationin §2.2. We present our efficient mean-field approximation to Gaussian-softmax integral in §2.3.Lastly, we discuss hyper-parameters of our method and present the algorithm in §2.4.
Notation
We are given a training set of N i.i.d. samples D = { z i } Ni =1 , where z i = ( x i , y i ) withthe input x i ∈ X and the target y i ∈ Y . We fit the data to a parametric predictive model y = f ( x ; θ ) .We define the loss on a sample z as ℓ ( z ; θ ) and optimize the model’s parameter θ via empirical riskminimization on D . The minimizer is given by θ ∗ = arg min θ L ( D ; θ ) , where L ( D ; θ ) def = 1 N X Ni =1 ℓ ( z i ; θ ) (1)In practice, we are interested in not only the prediction f ( x ; θ ∗ ) but also quantifying the uncertaintyof making such a prediction. In this paper, we consider (deep) neural network as the predictivemodel. Jackknife is a well-known resampling method to estimate the confidence interval of an estimator [44,15]. It is a straightforward procedure. Each element z i is left out from the dataset D to form a unique2leave-one-out” Jackknife sample D i = D − { z i } . A Jackknife sample’s estimate of θ is given by b θ i = arg min θ L ( D i ; θ ) (2)We obtain N such samples { b θ i } Ni =1 and use them to estimate the variances of θ ∗ and the predictionsmade with θ ∗ . In this vein, this is a form of ensemble method.However, it is not feasible to retrain modern neural networks N times, when N is often in the orderof millions. Infinitesimal jackknife is a classical tool to approximate b θ i without re-training on D i .It is often used as a theoretical tool for asymptotic analysis [24], and is closely related to influencefunctions in robust statistics [11]. Recent studies have brought (renewed) interests in applying thismethodology to machine learning problems [17, 25]. Here, we briefly summarize the method. Linear approximation.
The basic idea behind infinitesimal jackknife is to treat the θ ∗ and b θ i asspecial cases of an estimator on weighted samples b θ ( w ) = arg min θ X i w i ℓ ( z i ; θ ) (3)where the weights w i form a ( N − -simplex: P i w i = 1 . Thus the maximum likelihood estimate θ ∗ is b θ ( w ) when w = N . A Jackknife sample’s estimate b θ i , on the other end, is b θ ( w ) when w = N ( − e i ) where e i is all-zero vector except taking a value of at the i -th coordinate.Using the first-order Taylor expansion around w = N , we obtain (under the condition of twice-differentiability and the invertibility of the Hessian), b θ i ≈ θ ∗ + 1 N H − ( θ ∗ ) ∇ ℓ ( z i , θ ∗ ) def = θ ∗ + 1 N H − ∇ i (4)where H ( θ ∗ ) is the Hessian matrix of L evaluated at θ ∗ , and ∇ ℓ ( z i , θ ∗ ) is the gradient of ℓ ( z i ; θ ) evaluated at θ ∗ . We use H and ∇ i as shorthands when there is enough context to avoid confusion. An infinite number of infinitesimal jackknife samples.
If the number of samples N → + ∞ ,we can characterize the “infinite” number of b θ i with a closed-form Gaussian distribution with thefollowing sample mean and covariance as the distribution’s mean and covariance, b θ i ∼ N ( m , Σ I ) , with m = 1 N X i b θ i = θ ∗ , and (5) Σ I = 1 N X i ( b θ i − θ ∗ )( b θ i − θ ∗ ) ⊤ = 1 N H − " N X i ∇ i ∇ ⊤ i H − = 1 N H − JH − where J denotes the observed Fisher information matrix.
Infinite infinitesimal bootstraps.
The above procedure and analysis can be extended to bootstrap-ping ( i.e ., sampling with replacement). Similarly, to characterize the estimates from the bootstraps,we can also use a Gaussian distribution – details omitted here for brevity, b θ i ∼ N ( θ ∗ , Σ B ) , with Σ B = 1 N H − JH − . (6)We refer to the distributions N ( θ ∗ , Σ I ) and N ( θ ∗ , Σ B ) as the pseudo-ensemble distributions . Wecan approximate further by H = J to obtain Σ I ≈ H − /N and Σ B ≈ H − /N .Lakshminarayanan et al. [28] discussed that using models trained on bootstrapped samples doesnot work empirically as well as other approaches, as the learner only sees ∼ of the dataset ineach bootstrap sample. We note that this is an empirical limitation rather than a theoretical one. Inpractice, we can only train a very limited number of models. However, we hypothesize we can getthe benefits of combining an infinite number of models without training. Empirical results valid thishypothesis. Given the general form of the pseudo-ensemble distributions θ ∼ N ( θ ∗ , Σ ) , it is straightforward tosee, if we approximate the predictive function f ( · , · ) with a linear function, f ( x ; θ ) ≈ f ( x ; θ ∗ ) + ∇ f ⊤ ( θ − θ ∗ ) ,
3e can then regard the predictions by the models as a Gaussian distributed random variable, f ( x ; θ ) ∼ N ( f ( x ; θ ∗ ) , ∇ f ⊤ Σ ∇ f ) . For predictive functions whose outputs are approximately Gaussian, it might be adequate to charac-terize the uncertainty with the approximated mean and variance. However, for predictive functionsthat are categorical, this approach is not applicable.A standard way is to use the sampling for combining the discrete predictions from the models inthe ensemble. For example, for classification y = arg max c f c ( x ; θ ) where f c is the probability oflabeling x with c -th category, the averaged prediction from the ensemble is then e k = E [ P ( y = k )] ≈ M X Mi =1 [arg max c f c ( x ; θ i ) = k ] (7)where θ i ∼ N ( θ ∗ , Σ ) , and is the indicator function. In the next section, we propose a newapproach that avoids sampling and directly approximates the ensemble prediction of discrete labels. In deep neural network for classification, the predictions f k ( · ) are the outputs of the softmax layer. f k = SOFTMAX ( a k ) ∝ exp { a k } , with a = g ( x ; φ ) ⊤ W , (8)where g ( x ; φ ) is the transformation of the input through the layers before the fully-connected layer,and W is the connection weights in the softmax layer. We focus on the case g ( x ; φ ) is deterministic –extending to the random variables is straightforward. As discussed, we assume the pseudo-ensembleon W forms a Gaussian distribution W ∼ N ( W ∗ ; Σ ) . Then we have a ∼ N ( µ , S ) such that µ = g ( x ; φ ) ⊤ W ∗ , S = g ( x ; φ ) ⊤ Σ g ( x ; φ ) . (9)We give detailed derivation in Appendix A to compute the expectation of f k , e k = E [ f k ] = Z SOFTMAX ( a k ) N ( a ; µ , S )d a (10)The key idea is to apply the mean-field approximation E [ f ( x )] ≈ f ( E [ x ]) and use the followingwell-known formula to compute the Gaussian integral of a sigmoid function σ ( · ) , Z σ ( a ) N ( a ; µ, s ) ≈ σ (cid:18) µ √ λ s (cid:19) where λ is a constant and is usually chosen to be π/ or /π . In the softmax case, we arrive at:Mean-Field 0 ( mf0 ) e k ≈ X i exp − ( µ k − µ i ) p λ s k !! − (11)Mean-Field 1 ( mf1 ) e k ≈ X i exp − ( µ k − µ i ) p λ ( s k + s i ) !! − (12)Mean-Field 2 ( mf2 ) e k ≈ X i exp − ( µ k − µ i ) p λ ( s k + s i − s ik ) !! − (13)The 3 approximations differ in how much information from i = k is considered: not considering s i ,considering its variance s i and considering its covariance s ik . Note that mf0 and mf1 are compu-tationally preferred over mf2 which uses K covariances, where K is the number of classes. ([12]derived an approximation in the form of mf2 but did not apply to uncertainty estimation.) With slight abuse of notation, the form of S is after proper vectorization and padding of g and Σ . ntuition The simple form of the mean-field approximations makes it possible to understand themintuitively. We focus on mf0 . We first rewrite it in the familiar “softmax” form: ( mf0 ) e k ≈ e µk √ λ s k P i e µi √ λ s k def = ˆ e k . Note that this form is similar to a “softmax” with a temperature scaling: P ( y = k ) ∝ exp (cid:8) µ k T (cid:9) . However, there are several important differences. In mf0 , the temperature scaling factor is category-specific: ˆ e k ∝ exp (cid:26) µ k T k (cid:27) with T k = q λ s k . (14)Importantly, the factor T k depends on the variance of the category. For a prediction with highvariance, the temperature for that category is high, reducing the corresponding “probability” ˆ e k .Specifically, ˆ e k → K as s k → + ∞ . In other words, the scaling factor is both category-specific and data-dependent , providing additionalflexibility to a global temperature scaling factor.
Implementation nuances
Because of this category-specific temperature scaling, ˆ e k (and approxi-mations from mf1 and mf2 ) is no longer a multinomial probability. Proper normalization should beperformed, e k ≈ ˆ e k P i ˆ e i def = p k . Temperature scaling was shown to be useful for obtaining calibrated prob-abilities [19]. This can be easily included as well with f k ∝ exp (cid:0) T act − a k (cid:1) (15)We can also combine with another “temperature” scaling factor, representing how well the modelsin the pseudo-ensemble are concentrated θ ∼ N ( θ ∗ , T ens − Σ ) (16)Here T ens is for the pseudo-ensembles or the posterior [46]. Note that these two temperatures controlvariability differently. When T ens → , the ensemble focuses on one model. When T act → , eachmodel in the ensemble moves to “hard” decisions, as in eq. (7). Using mf0 as an example, e k = X i exp (cid:16) − ( µ k − µ i ) (cid:0) T act + λ s k /T ens (cid:1) − (cid:17)! − (17)where s k is computed at T ens = 1 . Empirically, we can tune the temperatures T ens and T act ashyper-parameters on a heldout set, to optimize the predictive performance. Computation complexity and scalability.
The bulk of the computation, as in Bayesian approximateinference, lies in the computation of H − in Σ I and Σ B , as in eq. (5) and eq. (6), or more preciselythe product between the inverse Hessian and vectors, cf. eq. (9). For smaller models, computingand storying H − exactly is attractive. For large models, one can compute the inverse Hessian-vector product approximately using multiple Hessian-vector products [40, 1, 25]. Alternatively, wecan approximate inverse Hessian using Kronecker factorization [41]. In short, any advances incomputing the inverse Hessian and related quantities can be used to accelerate computation neededin this paper. An application example.
In Algorithm 1, we exemplify the application of the mean-field approxi-mation to Gaussian-softmax integration in the uncertainty estimation. For brevity, we consider thelast-layer/fully-connected layer’s parameters W as a part of the deep neural network’s parameters θ = ( φ, W ) . We assume the mapping g ( x ; φ ) prior to the fully connected layer is deterministic.5 lgorithm 1: Uncertainty estimate with Mean-Field Infinitesimal jackknife ( mfIJ ) Input:
A deep neural net model θ ∗ = ( φ ∗ , W ∗ ) , a training set D , T ens , T act , and a test point x . Compute pre-fully-connected layer features g ( x ; φ ∗ ) Compute the moments µ , S of the logits via eq. (9), using either of the following two approaches Compute H − ( W ∗ ) on D and the inverse Hessian-vector product exactly H − ( W ∗ ) v Compute the inverse Hessian-vector product approximately [1] Compute e k via mean-field approximation eq. (11), (12) or (13), and normalize e k to obtain p k . Output:
The predictive uncertainty is p k . Resampling methods, such as jackknife and bootstrap, are classical statistical tools for assessingconfidence intervals [44, 37, 15, 14]. Recent results have shown that carefully designed jackknifeestimators [5] can achieve worst-case coverage guarantees on regression problems.However, the exhaustive re-training in jackknife or bootstrap can be cumbersome in practice. Recentworks have leveraged the idea of influence function [25, 11] to alleviate the computational challenge.[42] combines the influence function with random data weight samples to approximate the varianceof predictions in bootstrapping; [2] derives higher-order influence function approximation for Jack-knife estimators. Theoretical properties of the approximation are studied in [17, 18]. [34] appliesthis approximation to identify underdetermined test points. Infinitesimal jackknife follows the sameidea as those work [24, 17, 18]. To avoid explicitly storing those models, we seek a Gaussian distri-bution to approximately characterize the model. This connects to several existing research.The posterior of a Bayesian model can be approximated with a Gaussian distribution [32, 41]: θ ∼ N ( θ ∗ , H − ) , where θ ∗ is interpreted as the maximum a posterior estimate (thus incorporat-ing any prior the model might want to include). If we approximate the observed Fisher informationmatrix J in eq. (5) with the Hessian H , the two pseudo-ensemble distributions and the Laplace ap-proximation to the posterior have identical forms except that the Hessians are scaled differently, andcan be captured by the ensemble temperature eq. (16). Note that despite the similarity, infinitesimaljackknife are “frequentist” methods and do not assume well-formed Bayesian modeling.The trajectory of stochastic gradient descent gives rise to a sequence of models where the covariancematrix among batch means converges to H − JH − [9, 10, 33], similar to the pseudo-ensembledistributions in form. But note that those approaches do not collect information around the maximumlikelihood estimate, while we do. It is also a classical result that the maximum likelihood estimatorconverges in distribution to a normal distribution. θ ∗ and N H − are simply the plug-in estimatorsof the truth mean and the covariance of this asymptotic distribution. We first describe the setup for our empirical studies. We then demonstrate the effectiveness of theproposed approach on the MNIST dataset, mainly contrasting the results from sampling to thosefrom mean-field approximation and other implementation choices. We then provide a detailed com-parison to popular approaches for uncertainty estimation. In the main text, we focus on classifica-tion problems. We evaluate on commonly used benchmark datasets, summarized in Table 1. InAppendix B.5, we report results on regression tasks.
For MNIST, we train a two-layer MLP with 256 ReLUunits per layer, using Adam optimizer for 100 epochs. For CIFAR-10, we train a ResNet-20 withAdam for 200 epochs. On CIFAR-100, we train a DenseNet-BC-121 with SGD optimizer for 300epochs. For ILSVRC-2012, we train a ResNet-50 with SGD optimizer for 90 epochs.
Evaluation tasks and metrics.
We evaluate on two tasks: predictive uncertainty on in-domainsamples, and detection of out-of-distribution samples. For in-domain predictive uncertainty, wereport classification error rate ( ε ), negative log-likelihood (NLL), and expected calibration error6able 1: Datasets for Classification Tasks and Out-of-Distribution (OOD) Detection Dataset †† : the number of samples is limited; best results on held-out are reported. in ℓ distance (ECE) [19] on the test set. NLL is a proper scoring rule [28], and measures theKL-divergence between the ground-truth data distribution and the predictive distribution of the clas-sifiers. ECE measures the discrepancy between the histogram of the predicted probabilities by theclassifiers and the observed ones in the data – properly calibrated classifiers will yield matching his-tograms. Both metrics are commonly used in the literature and the lower the better. In Appendix B.1,we give precise definitions.On the task of out-of-distribution (OOD) detection, we assess how well p ( x ) , the classifier’s outputbeing interpreted as probability, can be used to distinguish invalid samples from normal in-domainimages. Following the common practice [20, 31, 30], we report two threshold-independent metrics:area under the receiver operating characteristic curve (AUROC), and area under the precision-recallcurve (AUPR). Since the precision-recall curve is sensitive to the choice of positive class, we reportboth “AUPR in:out” where in-distribution and out-of-distribution images are specified as positivesrespectively. We also report detection accuracy, the optimal accuracy achieved among all thresholdsin classifying in-/out-domain samples. All three metrics are the higher the better. Competing approaches for uncertainty estimation.
We compare to popular approaches: (i) fre-quentist approaches: the point estimator of maximum likelihood estimator (
MLE ) as a baseline, thetemperature scaling calibration (T. S
CALE ) [19], the deep ensemble method (
ENSEMBLE ) [28], andthe resampling bootstrap
RUE [42]. (ii) variants of Bayesian neural networks (
BNN ): DROPOUT [16]approximates the Bayesian posterior using stochastic forward passes of the network with dropout.
BNN ( VI ) trains the network via stochastic variational inference to maximize the evidence lowerbound. BNN ( KFAC ) applies the Laplace approximation to construct a Gaussian posterior, via layer-wise Kronecker product factorized covariance matrices [41].
Hyper-parameter tuning.
For in-domain uncertainty estimation, we use the NLL on the held-outsets to tune hyper-parameters. For the OOD detection task, we use AUROC on the held-out toselect hyper-parameters. We report the results of the best hyper-parameters on the test sets. The keyhyper-parameters are the temperatures, regularization or prior in
BNN methods, and dropout rates.
Other implementation details.
When Hessian needs to be inverted, we add a dampening term ˜ H = ( H + ǫI ) following [41, 42] to ensure positive semi-definiteness and the smallest eigenvalueof ˜ H be 1. For BNN ( VI ), we use Flipout [45] to reduce gradient variances and follow [43] forvariational inference on deep ResNets. On ImageNet, we compute the Kronecker-product factorizedHessian matrix, rather than full due to high dimensionality. For BNN ( KFAC ) and
RUE , we use mini-batch approximations on subsets of the training set to scale up on ImageNet, as suggested in [41, 42].
Most uncertainty quantification methods, including the ones proposed in this paper, have “knobs” totune. We first concentrate on MNIST and perform extensive studies of the proposed approaches tounderstand several design choices. Table 2 contrasts them.
Use all layers or just the last layer.
Uncertainty quantification on deep neural nets is computation-ally costly, given their large number of parameters, especially when the methods need informationon the curvatures of the loss functions. To this end, many approaches assume layer-wise indepen-dence [41] and low-rank components [33, 38] and in some cases, restrict uncertainty quantificationto only a few layers [48] – in particular, the last layer [26]. The top portion of Table 2 shows therestricting to the last layer harms the in-domain ECE slightly but improves OOD significantly.7able 2: Performance of Infinitesimal jackknife Pseudo-Ensemble Distribution on MNIST which approx. MNIST: in-domain ( ↓ ) NotMNIST: OOD detection ( ↑ )layer(s) method ε (%) NLL ECE (%) Acc. (%) AU ROC AU PR (in : out)all sampling(§2.2) M = 500 ( M = 0) mf0 mf1 mf2 : activation temperature e - e - e - e - . e e e n s e m b l e t e m p e r a t u r e (a) NLL ↓ on held-out activation temperature e - e - e - e - . e e e n s e m b l e t e m p e r a t u r e (b) AUROC ↑ on in-/ood- held-out Rotation ( ∘ ) E C E ( ∘ ) mletempscaleensemble ued opoutbnn-vibnn-kfacmf-ij (c) Calibration under distribution shift Figure 1: Best viewed in colors. (a-b) Effects of the softmax and ensemble temperatures on NLLand AuROC. The yellow star marks the best pairs of temperatures. See texts for details.
Effectiveness of mean-field approximation.
Table 2 also shows that the mean-field approxima-tion has similar performance as sampling (the distribution) on the in-domain tasks but noticeablyimproves on OOD detection. mf0 performs the best among the 3 variants.
Effect of ensemble and activation temperatures.
We study the roles of ensemble and activationtemperatures in mfIJ (§2.4). We grid search the two and generate the heatmaps of NLL and AUROC on the held-out sets, shown in Figure 1. Note that ( T ens = 0 , T act = 1) correspond to MLE .What is particularly interesting is that for NLL, higher activation temperature ( T act ≥ ) and lowerensemble temperature ( T ens ≤ − ) work the best. For AU ROC, however, lower temperatureson both work best. That lower T ens is preferred was also observed in [46] and using T act > forbetter calibration is noted in [19]. On the other end, for OOD detection, [31] suggests a very highactivation temperature ( T act = 1000 in their work, likely due to using a single model instead of anensemble). Given our results in the previous section, we report mf0 of infinitesimal jackknife in the rest of thepaper. Table 3 contrasts various methods on in-domain tasks of MNIST, CIFAR-100, and ImageNet.Table 4 contrasts performances on the out-of-distribution detection task (OOD). Results on CIFAR-10 (both in-domain and OOD), as well as CIFAR-100 OOD on SVNH are in Appendix B.4.While deep ensemble [28] achieves the best performance on in-domain tasks most of time, theproposed approach mfIJ typically outperforms other approaches, especially on the calibration metricECE. On the OOD detection task, mfIJ significantly outperforms all other approaches in all metrics.ImageNet-O is a particularly hard dataset for OOD [21]. The images are from ImageNet-22K sam-ples, thus share similar low-level statistics as the in-domain data. Moreover, the images are chosensuch that they are misclassified by existing networks (ResNet-50) with high confidence, or so called“natural adversarial examples”. We follow [21] to use 200-class subset, which are the confusingclasses to the OOD images, of the test set as the in-distribution examples. [21] further demonstratesthat many popular approaches to improve neural network robustness, like adversarial training, hardlyhelp on ImageNet-O. mfIJ improves over other baselines by a margin.8able 3: Comparing different uncertainty estimate methods on in-domain tasks (lower is better)
Method MNIST CIFAR-100 ImageNet ε (%) NLL ECE (%) ε (%) NLL ECE (%) ε (%) NLL ECE (%) MLE
CALE
ENSEMBLE
RUE
DROPOUT
BNN ( VI ) BNN ( KFAC ) 1.71 0.06 mfIJ : for CIFAR-100 and ImageNet, only the last layer is used due to the high computational cost.
Table 4: Comparing different methods on out-of-distribution detection (higher is better)
Method MNIST vs. NotMNIST CIFAR-100 vs. LSUN ImageNet vs. ImageNet-OAcc. † ROC ‡ PR $ Acc. † ROC ‡ PR $ Acc. † ROC ‡ PR $ MLE
CALE
ENSEMBLE
RUE
DROPOUT
BNN ( VI ) BNN ( KFAC ) 88.7 93.5 89.1 : 93.8 72.9 80.4 83.9 : 75.5 60.3 53.1 79.6 : 26.9 mfIJ : : : † : Accuracy (%). ‡ : Area under ROC. $ : Area under Precision-Recall with “in” vs. “out” domains flipped. : for CIFAR-100 and ImageNet, only the last layer is used for due to the high computational cost. Robustness to distributional shift. [43] points out that many uncertainty estimation methods aresensitive to distributional shift. Thus, we evaluate the robustness of mfIJ on rotated MNIST images,from to ◦ . The ECE curves in Figure 1(c) show mfIJ is better or as robust as other approaches. We propose a simple, efficient, and general-purpose confidence estimator mfIJ for deep neural net-works. The main idea is to approximate the ensemble of an infinite number of infinitesimal jackknifesamples with a closed-form Gaussian distribution, and derive an efficient mean-field approximationto classification predictions, when the softmax layer is applied to Gaussian. Empirically, mfIJ sur-passes or is competitive with the state-of-the-art methods for uncertainty estimation while incurringlower computational cost and memory footprint.
References [1] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization inlinear time. stat , 1050:15, 2016.[2] Ahmed M. Alaa and Mihaela van der Schaar.
The Discriminative Jackknife: Quan-tifying Predictive Uncertainty via Higher-order Influence Functions , 2019. URL https://openreview.net/forum?id=H1xauR4Kvr .[3] Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfallsof in-domain uncertainty estimation and ensembling in deep learning. arXiv preprintarXiv:2002.06470 , 2020.[4] David Barber and Christopher M Bishop. Ensemble learning for multi-layer networks. In
Advances in neural information processing systems , pages 395–401, 1998.[5] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Predictiveinference with the jackknife+. arXiv preprint arXiv:1905.02928 , 2019.96] Christopher M. Bishop.
Pattern Recognition and Machine Learning . Springer, 2006.[7] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncer-tainty in neural networks. arXiv preprint arXiv:1505.05424 , 2015.[8] Yaroslav Bulatov. Notmnist dataset.
Google (Books/OCR), Tech. Rep.[Online]. Available:http://yaroslavvb. blogspot. it/2011/09/notmnist-dataset. html , 2, 2011.[9] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In
International conference on machine learning , pages 1683–1691, 2014.[10] Xi Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. Statistical inference for model param-eters in stochastic gradient descent. arXiv preprint arXiv:1610.08637 , 2016.[11] R Dennis Cook and Sanford Weisberg.
Residuals and influence in regression . New York:Chapman and Hall, 1982.[12] Jean Daunizeau. Semi-analytical approximations to statistical moments of sigmoid and soft-max mappings of normal variables. arXiv preprint arXiv:1703.00091 , 2017.[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255. Ieee, 2009.[14] Bradley Efron. Bootstrap methods: another look at the jackknife. In
Breakthroughs in statistics ,pages 569–593. Springer, 1992.[15] Bradley Efron and Charles Stein. The jackknife estimate of variance.
The Annals of Statistics ,pages 586–596, 1981.[16] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing modeluncertainty in deep learning. In international conference on machine learning , pages 1050–1059, 2016.[17] Ryan Giordano, Will Stephenson, Runjing Liu, Michael I Jordan, and Tamara Broderick. Aswiss army infinitesimal jackknife. arXiv preprint arXiv:1806.00550 , 2018.[18] Ryan Giordano, Michael I Jordan, and Tamara Broderick. A higher-order swiss army infinites-imal jackknife. arXiv preprint arXiv:1907.12116 , 2019.[19] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neuralnetworks. arXiv preprint arXiv:1706.04599 , 2017.[20] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 , 2016.[21] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adver-sarial examples. arXiv preprint arXiv:1907.07174 , 2019.[22] José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalablelearning of bayesian neural networks. In
International Conference on Machine Learning , pages1861–1869, 2015.[23] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew GordonWilson. Averaging weights leads to wider optima and better generalization. arXiv preprintarXiv:1803.05407 , 2018.[24] Louis A Jaeckel.
The infinitesimal jackknife . Bell Telephone Laboratories, 1972.[25] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions.In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages1885–1894. JMLR. org, 2017.[26] Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being bayesian, even just a bit, fixesoverconfidence in relu networks. arXiv preprint arXiv:2002.10118 , 2020.1027] A Krizhevsky. Learning multiple layers of features from tiny images.
Master’s thesis, Univer-sity of Tront , 2009.[28] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre-dictive uncertainty estimation using deep ensembles. In
Advances in Neural Information Pro-cessing Systems , pages 6405–6416, 2017.[29] Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwrittendigits, 1998.
URL http://yann. lecun. com/exdb/mnist , 10:34, 1998.[30] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for de-tecting out-of-distribution samples and adversarial attacks. In
Advances in Neural InformationProcessing Systems , pages 7165–7175, 2018.[31] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution imagedetection in neural networks. arXiv preprint arXiv:1706.02690 , 2017.[32] David JC MacKay.
Bayesian methods for adaptive models . PhD thesis, California Institute ofTechnology, 1992.[33] Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew GordonWilson. A simple baseline for bayesian uncertainty in deep learning. In
Advances in NeuralInformation Processing Systems , pages 13132–13143, 2019.[34] David Madras, James Atwood, and Alex D’Amour. Detecting extrapolation with local ensem-bles. arXiv preprint arXiv:1910.09573 , 2019.[35] Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as ap-proximate bayesian inference.
The Journal of Machine Learning Research , 18(1):4873–4907,2017.[36] James Martens. Deep learning via hessian-free optimization. In
ICML , volume 27, pages735–742, 2010.[37] Rupert G Miller. The jackknife-a review.
Biometrika , 61(1):1–15, 1974.[38] Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad EmtiyazKhan. Slang: Fast structured covariance approximations for bayesian deep learning with natu-ral gradient. In
Advances in Neural Information Processing Systems , pages 6245–6255, 2018.[39] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Ng. Read-ing digits in natural images with unsupervised feature learning.
NIPS , 01 2011.[40] Barak A Pearlmutter. Fast exact multiplication by the hessian.
Neural computation , 6(1):147–160, 1994.[41] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation forneural networks. In , volume 6. International Conference on Representation Learn-ing, 2018.[42] Peter Schulam and Suchi Saria. Can you trust this prediction? auditing pointwise reliabilityafter learning. arXiv preprint arXiv:1901.00403 , 2019.[43] Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin,D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncertainty?evaluating predictive uncertainty under dataset shift. In
Advances in Neural Information Pro-cessing Systems , pages 13969–13980, 2019.[44] John Tukey. Bias and confidence in not quite large samples.
Ann. Math. Statist. , 29:614, 1958.[45] Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. Flipout: Efficient pseudo-independent weight perturbations on mini-batches. arXiv preprint arXiv:1803.04386 , 2018.1146] Florian Wenzel, Kevin Roth, Bastiaan S Veeling, Jakub ´Swiatkowski, Linh Tran, StephanMandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How goodis the bayes posterior in deep neural networks really? arXiv preprint arXiv:2002.02405 , 2020.[47] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao.Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 , 2015.[48] Jiaming Zeng, Adam Lesnikowski, and Jose M Alvarez. The relevance of bayesianlayer positioning to model uncertainty in deep bayesian active learning. arXiv preprintarXiv:1811.12535 , 2018. 12
Mean-Field Approximation for Gaussian-softmax Integration
In this section, we derive the mean-field approximation for Gaussian-softmax integration, eq. (10)in the main text. Assume the same notations as in §2.3, where the activation to softmax follows aGaussian a ∼ N ( µ , S ) . e k = E [ f k ] = Z SOFTMAX ( a k ) N ( a ; µ , S )d a = Z
11 + P i = k e − ( a k − a i ) N ( a ; µ , S )d a = Z − K + X i = k σ ( a k − a i ) − N ( a ; µ , S )d a integrate ≈ independently − K + X i = k E p ( a i ,a k ) [ σ ( a k − a i )] − (A.1)where “integrate independently” means integrating each term in the summand independently, re-sulting the expectation to the marginal distribution over the pair ( a i , a k ) . This approximation isprompted by the mean-field approximation: E [ f ( x )] ≈ f ( E [ x ]) for a nonlinear function f ( · ) .Next we plug in the approximation to E [ σ ( · )] where σ ( · ) is the sigmoid function, which states that Z σ ( x ) N ( x ; µ, s )d x ≈ σ (cid:18) µ √ λ s (cid:19) . (A.2) λ is a constant and is usually chosen to be π/ or /π . This is a well-known result, see [6].We further approximate by considering different ways to compute the bivariate expectations in thedenominator. Mean-Field 0 ( mf0 ) In the denominator, we ignore the variance of a i for i = k and replace a i with its mean µ i , and compute the expectation only with respect to α k . We arrive at e k ≈ − K + X i = k E p ( a k ) [ σ ( a k − µ i )] − . Applying eq.(A.2), we have e k ≈ − K + X i = k σ (cid:18) µ k − µ i √ λ s k (cid:19) − = X i exp − µ k − µ i p λ s k !! − . (A.3) Mean-Field 1 ( mf1 ) If we replace p ( a i , a k ) with the two independent marginals p ( a i ) p ( a k ) in thedenominator, recognizing ( a k − a i ) ∼ N ( µ k − µ i , s i + s k ) , we get, e k ≈ − K + X i = k σ (cid:18) µ k − µ i √ λ ( s i + s k ) (cid:19) − = X i exp − µ k − µ i p λ ( s k + s i ) !! − . (A.4) Similar to the classical use of mean-field approximation on Ising models, we use the term mean-field ap-proximation to capture the notion that the expectation is computed by considering the weak, pairwise couplingeffect from points on the lattice, i.e ., a i with i = k . ean-Field 2 ( mf2 ) Lastly, if we compute eq.(A.1) with a full covariance between a i and a k ,recognizing ( a k − a i ) ∼ N ( µ k − µ i , s i + s k − s ik ) , we get e k ≈ − K + X i = k σ (cid:18) µ k − µ i √ λ ( s i + s k − s ik ) (cid:19) − = X i exp − µ k − µ i p λ ( s k + s i − s ik ) !! − . (A.5)We note that [12] has developed the approximation form eq.(A.5) for computing e k , though theauthor did not use it for uncertainty estimation. B Experiments
B.1 Definitions of evaluation metricsNLL is defined as the KL -divergence between the data distribution and the model’s predictivedistribution, NLL = − log p ( y | x ) = − K X c =1 y c log p ( y = c | x ) (B.1)where y c is the one-hot embedding of the label. ECE measures the discrepancy between the predicted probabilities and the empirical accuracy ofa classifier in terms of ℓ distance. It is computed as the expected difference between per bucketconfidence and per bucket accuracy, where all predictions { p ( x i ) } Ni =1 are binned into S bucketssuch that B s = { i ∈ [ N ] | p ( x i ) ∈ I s } are predictions falling within the interval I s . ECE is definedas, ECE = S X s =1 | B s | N | conf ( B s ) − acc ( B s ) | , (B.2)where conf ( B s ) = P i ∈ Bs p ( x i ) | B s | and acc ( B s ) = P i ∈ Bs [ˆ y i = y i ] | B s | . B.2 Details of experiments in the main text
Table B.5 provides key hyper-parameters used in training deep neural networks on different datasets.Table B.5: Hyper-parameters of neural network trainingsDataset MNIST CIFAR-10 CIFAR-100 ImageNetArchitecture MLP ResNet20 Densenet-BC-121 ResNet50Optimizer Adam Adam SGD SGDLearning rate 0.001 0.001 0.1 0.1Learning rate decay exponential staircase × . staircase × . staircase × . × e − e − e − Batch size 100 8 128 256Epochs 100 200 300 90For mfIJ method, we use λ = π in our implementation. For the ENSEMBLE approach, we use M = 5 models on all datasets as in [43]. For RUE , DROPOUT , BNN ( VI ) and BNN ( KFAC ), wheresampling is applied at inference time, we use M = 500 Monte-Carlo samples on MNIST, and M = 50 on CIFAR-10, CIFAR-100 and ImageNet. We use B = 10 buckets when computing ECEon MNIST, CIFAR-10 and CIFAR-100, and B = 15 on ImageNet.14 .3 Apply the mean-field approximation to other Gaussian posteriors inference tasks Table B.6: Uncertainty estimation with
SWAG on MNISTApprox.method MNIST: in-domain ( ↓ ) NotMNIST: OOD detection ( ↑ ) ε (%) NLL ECE Acc. (%) AU ROC AU PR (in : out)sampling M = : 90.34100 ( M = 0) mf0 mf1 mf2 SWA -Gaussian posterior(
SWAG ) [33], whose covariance is a low-rank matrix plus a diagonal matrix derived from the SGDiterates. We tune the ensemble and activation temperatures with
SWAG to perform uncertainty taskson MNIST, and the results are reported in Table B.6. We ran
SWAG on the softmax layer for 50epochs, and collect models along the trajectory to form the posterior.We use the default values for other hyper-parameters, like
SWAG learning rate and the rank of thelow-rank component, as the main objective here is to combine the mean-field approximation withdifferent Gaussian posteriors. As we can see from Table B.6,
SWAG , when using sampling for ap-proximate inference, have a variance larger than expected . Nonetheless, within the variance, usingthe mean-field approximation instead of sampling performs similarly. The notable exception is in-domain tasks where the mean-field approximation consistently outperforms sampling.This suggests that the mean-field approximations can work with other Gaussian distributions as areplacement for sampling to reduce the computation cost.Table B.7: Uncertainty estimation on CIFAR-10 Method CIFAR-10 in-domain ( ↓ ) LSUN: OOD detection ( ↑ ) SVHN: OOD detection ( ↑ ) ε (%) NLL ECE (%) Acc. AU ROC AU PR (in/out) Acc. AU ROC AU PR (in : out) MLE
CALE
ENSEMBLE
RUE
DROPOUT
BNN VI
BNN LL - VI BNN ( KFAC ) 8.75 0.29 3.45 85.0 91.3 93.8 : 87.8 86.3 90.7 89.5 : 92.6 mfIJ : : Table B.8: Out of Distribution Detection on CIFAR-100Method SVHN: OOD detection ( ↑ )Acc. AU ROC AU PR (in : out) MLE
RUE
BNN LL - VI BNN ( KFAC ) 74.13 80.97 74.46 : 87.06 mfIJ : In particular, the variance is higher than the variance in the sampling results of the infinitesimal jackknife,cf. Table 2 in the main text.
Metrics Dataset
ENSEMBLE [28] † RUE [42]
DROPOUT [16] † BNN ( PBP ) [22] † BNN ( KFAC ) [41] mfIJ
NLL( ↓ ) Housing ± ± ± ± ± ± Concrete 3.06 ± ± ± ± ± ± ± ± ± ± ± ± Kin8nm -1.20 ± -1.13 ± ± ± ± ± ± -6.00 ± -3.80 ± ± ± -5.99 ± Power ± ± ± ± ± ± ± ± ± ± ± ± Yacht 1.18 ± ± ± ± ± ± RMSE( ↓ ) Housing 3.28 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Kin8nm 0.09 ± ± ± ± ± ± Naval ± ± ± ± ± ± Power 4.11 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± † : numbers are cited from the original paper. B.4 More results on CIFAR-10 and CIFAR-100
Tables B.7 and B.8 supplement the main text with additional experimental results on the CIFAR-10 dataset with both in-domain and out-of-distribution detection tasks, and on the CIFAR-100 without-of-distribution detection using the SVHN dataset.
BNN LL - VI refers to stochastic variationalinference on the last layer only.The results support the same observations in the main text: mfIJ performs similar to other approacheson in-domain tasks, but noticeably outperforms them on out-of-distribution detection. B.5 Regression experiments
We conduct real-world regression experiments on UCI datasets. We follow the experimental setupin [16, 22, 28], where each dataset is split into 20 train-test folds randomly, and the average resultswith the standard deviation are reported. We use the same architecture as previous works, with 1hidden layer of 50 ReLU units. Since in regression tasks the output distribution can be computeanalytically, we don’t use the mean-field approximation in these experiments. Nevertheless, we stillrefer to our method as mfIJ to be consistent with other datasets.For mfIJ approach, we use the pseudo-ensemble Gaussian distribution of the last layer parametersto compute a Gaussian distribution of the network output, i.e . a ∼ N ( µ ( x ) , s ( x )) as eq. (9) inthe main text. We also estimate the variance of the observation noise ǫ from the residuals on thetraining set [42]. Therefore, the predictive distribution of f ( x ) is given by N ( µ ( x ) , s ( x ) + ǫ ) . Wecan compute the negative log likelihood (NLL) as,NLL = − log p ( y | x ) = 12 log (cid:0) s ( x ) + ǫ (cid:1) + ( y − µ ( x )) s ( x ) + ǫ + log(2 π ) ! . We tune ensemble temperature on the heldout sets, and fix the activation temperature to be .For sampling-based methods, RUE and
KFAC , the NLL from M prediction sample f ( x ) , . . . , f M ( x ) can be computed as,NLL = − log p ( y | x ) = 12 (cid:18) log(¯ s m + ǫ ) + ( y − ¯ µ m ) ¯ s m + ǫ + log(2 π ) (cid:19) where ¯ µ m = 1 M X m f m ( x ) is the mean of prediction samples, and ¯ s m = 1 M X m ( f m ( x ) − ¯ µ m ) is the variance of prediction samples. The sampling is conducted on all network layers in both RUE and
KFAC . Results
We report the NLL and RMSE on the test set in Table B.9 and compare with other ap-proaches.
RUE and
BNN ( KFAC ) use M = 50 samples and the ENSEMBLE method has M = 5 models in the ensemble. We achieve similar or better performances than RUE and
KFAC without16he computation cost of explicit sampling. Compared to
DROPOUT and
ENSEMBLE , which are thestate-of-the-art methods on these datasets, we also achieve competitive results. We highlight the bestperforming method on each dataset, and mfIJ when its result is within one std from the best method.
B.6 More results on comparing all layers versus just the last layer.
We conduct an experiment similar to the top portion of Table 2 in the main text, to study the effectof restricting the parameter uncertainty to the last layer only. We use NotMNIST as the in-domaindataset and treat MNIST as the out-of-distribution dataset. We use a two-layer MLP with 256 ReLUhidden units. Table B.10 supports the same observation as in the main text: restricting to the lastlayer improves on OOD task while it does not have significant negative impact on the in-domaintasks.Table B.10: Performance of infinitesimal jackknife pseudo-ensemble distribution on NotMNISTcomparing all layers versus just the last layer which approx. NotMNIST: in-domain ( ↓ ) MNIST: OOD detection ( ↑ )layer(s) method ε (%) NLL ECE Acc. (%) AU ROC AU PR (in : out)all sampling M = 50092.26 96.23 97.82 : 92.06