[PDF] Revisiting Explicit Regularization in Neural Networks for Well-Calibrated Predictive Uncertainty

Abstract

From the statistical learning perspective, complexity control via explicit regularization is a necessity for improving the generalization of over-parameterized models. However, the impressive generalization performance of neural networks with only implicit regularization may be at odds with this conventional wisdom. In this work, we revisit the importance of explicit regularization for obtaining well-calibrated predictive uncertainty. Specifically, we introduce a probabilistic measure of calibration performance, which is lower bounded by the log-likelihood. We then explore explicit regularization techniques for improving the log-likelihood on unseen samples, which provides well-calibrated predictive uncertainty. Our findings present a new direction to improve the predictive probability quality of deterministic neural networks, which can be an efficient and scalable alternative to Bayesian neural networks and ensemble methods.

Full PDF

DDeep Learning Requires Explicit Regularization forReliable Predictive Probability

Taejong Joo

ESTsoftRepublic of Korea [email protected]

Uijung Chung

ESTsoftRepublic of Korea [email protected]

Abstract

From the statistical learning perspective, complexity control via explicit regulariza-tion is a necessity for improving the generalization of over-parameterized models,which deters the memorization of intricate patterns existing only in the trainingdata. However, the impressive generalization performance of over-parameterizedneural networks with only implicit regularization challenges this traditional roleof explicit regularization. Furthermore, explicit regularization does not preventneural networks from memorizing unnatural patterns, such as random labels, thatcannot be generalized. In this work, we revisit the role and importance of explicitregularization methods for generalizing the predictive probability, not just thegeneralization of the 0-1 loss. Speciﬁcally, we present extensive empirical evidenceshowing the versatility of explicit regularization techniques on improving the relia-bility of the predictive probability, which enables better uncertainty representationand prevents the overconﬁdence problem. Our ﬁndings present a new direction toimprove the predictive probability quality of deterministic neural networks, unlikethe mainstream of approaches concentrates on building stochastic representationwith Bayesian neural networks, ensemble methods, and hybrid models.

As deep learning models have become pervasive in real-world systems, the importance of producinga reliable predictive probability is increasing, which results in a well-calibrated behavior and a betteruncertainty representation ability. The calibrated behavior refers to the ability to match its predictiveprobability of an event to the long-term frequency of the event occurrence [1], and the uncertaintyrepresentation ability refers to the ability to represent uncertainty about its predictions. From thedeep learning perspective, the reliable predictive probability beneﬁts many downstream tasks such asanomaly detection [2], classiﬁcation with rejection [3], and exploration in reinforcement learning[4]. More importantly, considering the deep learning system as cognitive automation , the reliablepredictive probability plays a signiﬁcant role in building users’ trust in the automation [5], preventingmisuse and disuse of the automation [6], and eventually preventing catastrophic automation failure.Unfortunately, neural networks prune to be overconﬁdent and lack the uncertainty representationability, and this problem becomes a fundamental concern in the deep learning community.Bayesian methods have inborn abilities to produce the reliable predictive probability. Speciﬁcally,the Bayesian methods express the probability distribution over parameters, in which uncertainty inthe parameter space is automatically determined by data. As a result, they can represent uncertaintyin prediction by means of providing rich information, such as variance and entropy, about aggregatedpredictions from different parameter conﬁgurations [7, 8]. In this perspective, deterministic neuralnetworks, which select a single parameter conﬁguration so cannot provide the rich information,naturally lack the uncertainty representation ability. However, the automatic determination of

Preprint. Under review. a r X i v : . [ c s . L G ] J un arameter uncertainty in the light of data, i.e., the posterior inference , puts signiﬁcantly morecomputational overhead compared to the deterministic neural networks. Therefore, the mainstream ofimproving the predictive probability quality has been an efﬁcient adoption of Bayesian principle intoneural networks, so-called Bayesian neural networks, via novel approximation [9–15] and implicitlybuilding the posterior from inherent stochasticity [4, 16, 17].Recent works [3, 18, 19] discover the hidden gems of label smoothing [20], mixup [21], and ad-versarial training [22] on improving the calibration performance and the uncertainty representationability. These unexpected ﬁndings present a possibility of improving the reliability of the predictiveprobability without changing the deterministic nature of neural networks. This direction is appealingbecause it can be applied in a plug-and-play fashion to the existing building block. This means thatthey can inherit the scalability, computational efﬁciency, and surprising generalization performance ofthe deterministic neural networks, for which the Bayesian neural networks often struggle at [23–25].Motivated by these observations, we present a general direction in the regularization perspective tomitigate the unreliable predictive probability problem, rather than devising constructive heuristics ordiscovering hidden properties of existing methods. Our contributions can be summarized as follows: • We identify that the unreliable predictive probability is not caused by its deterministic nature,but rather overconﬁdence predictions on training samples. • We present a new direction to mitigate the unreliable predictive behavior, which is read-ily applicable, computationally efﬁcient, and scalable to large-scale models compared toBayesian neural networks or ensemble methods. • Our ﬁndings give a novel view on the role of regularization for the reliable predictive proba-bility, different from the dominant view on its role in improving generalization performance.

Recent works show that joint modeling of a generative model p ( x ) along with a classiﬁer p ( y | x ) ,or p ( x , y ) itself, produces the reliable predictive probability [26–28]. Speciﬁcally, Alemi et al. [26]argue that modeling stochastic hidden representation through variational information bottleneckprinciple [29] allows to represent better predictive uncertainty. This can be related to the effectivenessof ensemble methods, which aggregate representations of several models, on enhancing predictiveuncertainty representation and conﬁdence calibration [3, 30, 31]. In this regard, hybrid modeling andensemble methods share a similar principle to the Bayesian methods, concerning the stochasticity ofthe function . However, this paper concentrates on explicit regularization for controlling the predictiveconﬁdence, which is fundamentally different from previous works focus on building the stochasticrepresentation or the stochastic mapping.Other works concentrate on structural characteristics of the neural networks. Speciﬁcally, Hein et al.[32] identify the cause of the overconﬁdence problem based on an analysis of the afﬁne compositionalfunction, e.g., ReLU. The basic intuition behind this analysis is that one can always ﬁnd multiplier λ to an input x , for which a neural network produces one dominant entry on λ x . Verma and Swami [33]point out that the region of the highest predictive uncertainty under the softmax forms a subspace inthe logit space, thereby the volume of the area would be negligible. However, our approach suggeststhat these structural characteristics’ inherent ﬂaws can be easily cured by adding regularizationwithout changing the existing components of neural networks.From the perspective of the statistical learning theory [34], a regularization method minimizingsome form of complexity measures, e.g., Rademacher complexity [35] or VC-dimension [36], is a“must” to achieve better generalization of over-parameterized models, which prevents memorizationof intricate patterns existing only in training samples. However, the role of capacity control withexplicit regularization is challenged by much empirical evidence in deep learning. Speciﬁcally, over-parameterized neural networks achieve impressive generalization performance with only implicitregularizations contained in the optimization procedures [37, 38] or the structural characteristics[39–41]. Even more, Zhang et al. [42] show that the explicit regularization cannot prevent neuralnetworks from easily ﬁtting random labels that cannot be generalized . Therefore, the importance ofcapacity control with explicit regularization seems to be questionable in deep learning. In this work,we re-emphasize its importance, presenting a different view on the role of regularization in terms of generalization of the predictive probability , not solely on better accuracy.2 .00 0.25 0.50 0.75 1.00Confidence0.00.20.40.60.81.0 A cc u r a c y SoftmaxSoftmax w/ temp. scailing (a) D e n s i t y SoftmaxSoftmax w/ temp. scailing (b) P r o b a b ili t y = 0.5 0.00.51.0 = 1.0Categories0.00.51.0 P r o b a b ili t y = 2.0 Categories0.00.51.0 = 10.0 (c) Figure 1: The unreliable predictive probability of ResNet trained on CIFAR-100. Reliability curve(a) splits predictions based on the predictive conﬁdence into 15 groups, and averages accuracy andconﬁdence of predictions within each group. Uncertainty plot (b) shows the predictive uncertainty onout-of-distribution samples (SVHN). Divides each logit by a constant τ smoothens the smoothness ofthe softmax output (c). After applying the temperature scaling τ = 2 . , the predictive conﬁdencebecomes closer to its accuracy (a) and the predictive entropy on SVHN samples becomes higher (b). We consider a classiﬁcation problem with i.i.d. training samples D = (cid:8) x ( i ) , y ( i ) (cid:9) Ni =1 drawn fromunknown distributions P x , y whose corresponding random variables are ( x , y ) . We denote X as aninput space and Y as a set of categories { , , · · · , K } . Let f W : X → Z be a neural network withparameters W where Z = R K is a logit space. On top of logit space, the softmax σ : R K → (cid:52) K − normalizes exponential of logits, giving interpretation of σ k ( f W ( x )) as the predictive probabilitythat label of x belongs to class k [43]: φ W k ( x ) = exp( z k ) (cid:80) i exp( z i ) , z = f W ( x ) (1)where we let φ W k ( x ) = σ k ( f W ( x )) for brevity.The de-facto standard for training the neural network is minimizing the cross-entropy loss withstochastic gradient descent (SGD) [44]. For given sample ( x , y ) and prediction φ W ( x ) , the cross-entropy is deﬁned as l CE ( y, φ W ( x )) = − (cid:80) k y ( k ) log φ W k ( x ) where A ( ω ) is an indicatorfunction taking one if ω ∈ A and zero otherwise. Then, a loss function of W for the mini-batchsamples D (cid:48) ⊂ D is computed by L ( W ; D (cid:48) ) = ˆ E ( x ,y ) ∼D (cid:48) (cid:2) − log φ W y ( x ) (cid:3) where ˆ E D (cid:48) denotesan empirical mean evaluated on D (cid:48) . Finally, SGD minizes the loss by updating parameters via W ← (1 − λ ) W − (cid:15) ∇ W L ( W ; D (cid:48) ) where λ accounts for a weight decay ratio [45] and (cid:15) is alearning rate.While this standard training procedure results in surprising generalization performance, the resultingneural network often is overconﬁdent and lacks the uncertainty representation ability, which detersinterpreting the softmax output as the “predictive probability” [4]. Figure 1 illustrates the unreliablepredictive behavior of ResNet [46]: the network produces high conﬁdence to misclassiﬁed examples(Figure 1 (a)) and provides low predictive entropy on out-of-distribution samples, albeit the samplesbelong to none of the classes seen during training (Figure 1 (b)). Notably, recalibrating the log-likelihood on unseen samples after training mitigates this problemdramatically [30]. For instance, given a trained network f W and an extra dataset D (cid:48) , temperaturescaling [47] adjusts temperature τ by maximizing the log-likelihood on D (cid:48) : max τ (cid:88) ( x ,y ) ∈D (cid:48) log exp( f W y ( x ) /τ ) (cid:80) j exp( f W j ( x ) /τ ) (2)where τ controls the smoothness of the softmax output, thereby adjusts the predictive conﬁdence(Figure 1 (c)). This simple procedure makes the softmax output more closely resemble the predictive

50 100 150 200Epoch203040 L norm (train) L norm (valid) (a) (b) (c) Figure 2: Monitoring changes in behavior of ResNet during training on CIFAR-100. In (a), L norm is approximated with respect to empirical distributions of training samples and validationsamples, respectively. In (c), the misclassiﬁed penalty corresponds to the mean conﬁdence penaltyfor misclassiﬁed examples. probability . For instance, the predictive conﬁdence well-matches its actual accuracy, and the predictiveentropy on out-of-distribution samples signiﬁcantly increases (Figure 1).Motivated by this observation, we carefully analyze the unreliable predictive behavior of neuralnetworks by anticipating the log-likelihood score on unseen samples. Speciﬁcally, we decompose thelog-likelihood on random variables ( x , y ) into two cases whether the predictive class matches thelabel or not: E x , y [log φ W y ( x )] = E x , y  { y } ( m ) log φ W m ( x ) + (cid:88) k (cid:54) = m { y } ( k ) log φ W k ( x )  ≤ E x (cid:2) E y | x (cid:2) { y } ( m ) (cid:3) log φ W m ( x ) + (cid:0) − E y | x (cid:2) { y } ( m ) (cid:3)(cid:1) log (cid:0) − φ W m ( x ) (cid:1)(cid:3) (3)where E y | x (cid:2) { y } ( m ) (cid:3) = p y | x ( y = m ) and m is the predictive class such that m = arg max k f W k ( x ) .Note that there exists a multiplier α ∈ (0 , to (cid:0) − φ W m ( x ) (cid:1) in the second term, which makesthe upper bound to the equality, accounting for dispersion of non-maximum probability into K − categories. We also note that the log-likelihood is a monotonically increasing function, so weconcentrate on the upper bound for the sake of simplicity. Here, given a sample x ∼ x , the log-likelihood is bounded by the realization of a “stochastic switch” p y | x ( y = m ) that selects betweentwo “deterministic scores” log φ W m ( x ) and log(1 − φ W m ( x )) .To scrutinize the score determination mechanism, we note an inherent difference between thedeterministic scores and the stochastic switch. The expected deterministic scores, i.e., E [log φ W m ( x )] and E [log(1 − φ W m ( x ))] , are the property of f W itself. Therefore, the scores can be anticipated fromits estimation ˆ E D x [log φ W m ( x )] as long as D x is drawn from P x , in which the difference would bemostly caused by the variance of the Monte-Carlo estimation with a ﬁnite sample size. On the otherhand, the stochastic switch, i.e., whether the model predicts a label of an unseen sample correctly,depends on the external randomness y | x , which makes predicting its behavior on unseen samplessigniﬁcantly challenging. This is because the difference between ˆ E D [ { y } ( m )] and E [ { y } ( m )] ,a.k.a., the generalization gap , is subject to many complex factors (and their interactions), such asinput dimensionality, model complexity, and inherent noise (e.g., [34, 48]).We can empirically identify this difference by monitoring the values during training. The empiricalmeans of L norm of f W (Figure 2 (a)) and the maximum log-probability E [log φ W m ( x )] (Figure 2(b)), which are the properties of f W , are signiﬁcantly well preserved those on unseen samplescompared to the log-likelihood (Figure 2 (b)), which have dependency on the external randomness. From this perspective, the unreliable predictive probability can be explained by the implicit bias ofthe cross-entropy minimization. Speciﬁcally, the minimum of the cross-entropy is achieved when f W y ( x ) → ∞ and f W k ( x ) < ∞ , ∀ k (cid:54) = y. This means that SGD updates W in the direction that We use term “switch” by assuming the noiseless environment, i.e., there is only one true label for each input. φ W m ( x ) and decreases φ W k ( x ) , ∀ k (cid:54) = m every time see the example ( x , y ) , which in turnsmake the score log φ W m ( x ) near zero and the score log(1 − φ W m ( x )) signiﬁcantly small. For example,Figure 2 (c) illustrates the steadily rising trend of misclassiﬁed penalty on unseen data caused byincreasing conﬁdence on misclassiﬁed examples. Therefore, the log-likelihood becomes venerable tothe notoriously high conﬁdence penalty in the case of misclassiﬁed examples. E C E ResNetResNextVGGWideResNetDenseNet

Figure 3: Accuracy-ECE compar-ison of different models on Ima-geNet. Points connected to thesame line represent the same modelfamily with different capacity.Therefore, improving the test log-likelihood requires reducingthe impact of the conﬁdence penalty, which can be achieved by reducing conﬁdence on misclassiﬁed examples or decreasingmisclassiﬁcation rate . In this work, we focus on reducing thepredictive conﬁdence on training samples, thereby the unseensamples, through explicit regularization techniques and show itseffectiveness for the rest of the paper. This is because empiricalevidence (cf. [47]) shows that the improved generalization per-formance frequently worsens the predictive probability quality,which may be caused by that an increased capacity enables toﬁt training samples more conﬁdently. For example, Figure 3shows that increasing model capacity reduces the misclassi-ﬁcation rate but worsens the calibration performance, calledexpected calibration error (ECE; cf. metrics in Section 4).

Setup.

Our main experimental model is the (pre-activation) ResNet [46] trained on CIFAR [49],which is one of the most prevalent basis models in many state-of-the-art architectures [50, 51]. Wealso present the VGG [52] as a representative of models without residual connection in appendix B,in which we observe similar results to ResNet. We performed all experiments with a single GPU andtrained our model with the standard training procedure based on [46] except learning rate warm-upfor the ﬁrst ﬁve epochs, clipping gradient when its norm exceeds one, and extra validation set of10,000 samples split from the training set. We describe a detailed setup in appendix A.

Metrics.

To precisely evaluate the reliability of the predictive probability, we employ variousmetrics commonly used in literature [3, 30, 47]: • Negative log-likelihood (NLL) evaluates how well the predictive probability explains thetest data D T , which is computed by: − ˆ E D T [log φ W y ( x )] . NLL has a desirable property thatits optimal score is achieved if and only if φ W ( x ) perfectly match p ( y | x ) . • Expected calibration error (ECE) [53] evaluates how well the predictive conﬁ-dence matches its actual accuracy. Speciﬁcally, ECE on D T is computed bybinning predictions into M -groups based on their conﬁdences such that G i = (cid:8) x : i/M < max k φ W k ( x ) ≤ (1 + i ) /M, x ∈ D T (cid:9) , then averaging their calibration scoresby (cid:80) Mi =1 |G i ||D T | | acc ( G i ) − conf ( G i ) | where acc ( G i ) and conf ( G i ) are average accuracy andconﬁdence of predictions in group G i , respectively. • Predictive entropy on out-of-distribution samples evaluates how well the predictive uncer-tainty represents their ignorance, which is computed by H [ φ W ( x )] for the out-of-distributionsamples. Here, the reliable predictions ought to produce the highest entropy as the samplesbelong to none-of-classes seen during training. The simplest way to constrain the conﬁdence may conceivably be controlling the strength of weightdecay [45] that can encourage producing less extreme outputs by shrinking weights. Therefore,we ﬁrst explore the conﬁdence control by varying the weight decay ratio λ , conjecturing that theweight decay ratio, e.g., λ = 0 . in ResNet, is too small to prevent overconﬁdent predictions.Figure 4 (upper) illustrates the impact of λ on changes in generalization performance and calibration This holds when it corrects the answer, for which modern neural networks easily achieve for most of samplesin the early stage of training. inversely proportional to a generalization performance improvementwhen the decaying ratio is larger than 0.001. This means that improving the reliability with strongweight decay is against the primary goal of supervised learning. Weight decay ratio20406080100 E rr o r r a t e E C E Error rateECE

Epoch L n o r m Figure 4: Impact of the weightdecay ratio on ECE and ac-curacy (upper) and L norm(lower).We further investigate this undesirability by monitoring trainingbehavior under different weight decay ratios and comparing theirimpacts to temperature scaling. In section 3.2, we have shown thatthe temperature scaling mitigates the impact of the high conﬁdencepenalty on the log-likelihood, thereby improving the reliability ofthe predictive probability. The temperature scaling achieves this bydividing logits with a scalar, which means that it controls the L norm of the function; that is, (cid:107) f W /τ (cid:107) = (cid:107) f W (cid:107) /τ . For thisreason, we monitor the evolution of L norm, and compare the valuewith (cid:107) f ¯ W /τ ∗ (cid:107) where ¯ W is the weight of the neural networkwith λ = 0 . and τ ∗ = arg max τ ˆ E D T (cid:104) log( φ ¯ W y ( x ) /τ ) (cid:105) that isobtained by “leaking” the test set D T .Figure 4 shows that the SGD with various decay ratios λ ﬁnds only the trivial solution or an infeasible solution in the perspective of thefollowing optimization problem: min W ˆ E D (cid:2) l CE ( y , φ W ( x )) (cid:3) s.t. (cid:107) f W (cid:107) ≤(cid:107) f ¯ W /τ ∗ (cid:107) (4), which indicates that conﬁdence control by adjusting the weightdecay ratio is the signiﬁcantly challenging optimization problem.Speciﬁcally, Figure 4 (lower) shows that the L norm becomes zerowhen the decay ratio λ ≥ . , which means that all weightscollapse to zero, i.e., the trivial solution. This happens when thedecay ratio overwhelms the gradient of the cross-entropy, e.g., around at the epoch 50 under λ = 0 . (Figure 4 (lower)). On the other hand, ratios of 0.001 and 0.0001 do not suffer from the weightcollapse, but a scale of L norm under such ratios is higher than (cid:107) f ¯ W /τ ∗ (cid:107) , which correspond toinfeasible solutions (Figure 4 (lower)). These results may seem natural because the weight decaydoes not consider constraints about the predictive conﬁdence. Therefore, we explore a way to add aregularization loss that explicitly concerns the predictive conﬁdence, e.g., the constraint in equation 4. In this subsection, we examine two types of regularizers that directly constrain the predictiveconﬁdence on the input probability distribution space P ( X ) , whose effectiveness on improvingthe reliability of the predictive distribution has not explored yet. Regularization in the function space.

The ﬁrst approach regards f W as an element of L p ( X ) space. L p space is the space of measurable functions with the norm: (cid:107) f W (cid:107) p = (cid:18)(cid:90) X | f W ( x ) | p dP x ( x ) (cid:19) /p < ∞ (5)Here, we note that the norm is computed with respect to the input generating distribution P x ,which allows to concern how the function f W actually behaves on. Since P x is unknown, so it isapproximated by the Monte-Carlo approximation with mini-batch samples. Then, the approximatefunction norm can be computed by (cid:107) f W (cid:107) pp ≈ m (cid:80) i,j | f W j ( x ( i ) ) | p . By penalizing the complexity interms of the L p norm, continuous increase in the leading entry of logit towards inﬁnity, or continuousdecreases in the non-leading entry to negative inﬁnity, can be prevented (cf. Section 3.3). In thispaper, we examine (cid:107) f W (cid:107) and (cid:107) f W (cid:107) regularization losses. Regularization in the probability distribution space.

The second approach regards f W ( x ) as arandom variable, then minimize its distance to a simple distribution, i.e., standard normal distribution . We note the possibility of more theoretically ground conﬁdence control by encoding more meaningfulinformation into the target distribution, e.g., determination of the precision parameter, leaving it as future work. µ ± σ obtained fromﬁve repetitions, and all values are rounded to two decimal places. Wall clock running time of allmethods have no meaningful difference. Model & Data Regularizer Acc ↑ NLL ↓ ECE ↓ (cid:107) f W (cid:107) ResNet-50 & Vanilla 94.17 ± ± ± ± (cid:107) f W (cid:107) ± ± ± ± (cid:107) f W (cid:107) ± ± ± ± SW ( µ W D (cid:48) , ν ) ± ± ± ± ± ± ± ± ± ± ± ± (cid:107) f W (cid:107) ± ± ± ± (cid:107) f W (cid:107) ± ± ± ± SW ( µ W D (cid:48) , ν ) ± ± ± ± ± ± ± ± In this work, we use the sliced Wasserstein distance of order one because of its computationalefﬁciency and ability to measure the distance between probability distributions with different supports,which is useful for the empirical distribution. We refer Peyré et al. [54] for more detailed explanationsabout this metric. Speciﬁcally, given mini-batch samples D (cid:48) = { x ( i ) } mi =1 , let an empirical measureof logits be µ W D (cid:48) ( A ) = m (cid:80) i A ( f W ( x ( i ) )) and the standard Gaussian measure on Z be ν ( A ) = π K/ (cid:82) A exp (cid:0) − (cid:107) z (cid:107) (cid:1) d z . Then, the sliced Wasserstein distance can be computed by: SW ( µ W D (cid:48) , ν ) = (cid:90) S K − (cid:90) ∞−∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F µ θ ( x ) − m m (cid:88) i =1 ( ∞ ,x ) ( (cid:104) z ( i ) , θ (cid:105) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dxdλ ( θ ) (6)where z ( i ) = f W ( x ( i ) ) , λ is a uniform measure on the unit sphere S K − , and µ θ is a measureobtained by projecting µ W D (cid:48) at angle θ . Therefore, the conﬁdent predictions, each of which involvesone dominant entry, result in a signiﬁcant penalty as the empirical distribution consists of suchpredictions is far from the standard normal distribution. Projected error function regularization (PER)[55] further simplify the SW ( µ W D (cid:48) , ν ) by applying Minkowski inequality to the above equation. Asa result, the gradient of PER resembles the gradient of Huber loss [56] in the projected space, whichallows the robust norm measurement combining advantages of both L norm and L norm as well ascapturing dependency between logits of each location by a projection operation [55]. Results.

Table 1 lists the experimental results, in which both regularization in the function spaceand the Wasserstein probability space successfully control the conﬁdence, e.g., reducing L normof ResNet at least 34% on CIFAR-10 and 68% CIFAR-100. We note that regularization methodscan constrain the conﬁdence without compromising the generalization performance; actually, allregularization methods give small but consistent improvements to test error rates. We also remarkthat the sum of the Frobenius norm of weights often increases compared to the vanilla method andchanges only at most 2% when it decreases, which again shows the undesirability of adjusting theweight decay ratio for conﬁdence control.More importantly, the predictive probability’s reliability signiﬁcantly improves under all consideredmeasures compared to the vanilla method. For instance, the regularization methods reduce NLL ofResNet at least 13% CIFAR-10 and 6% on CIFAR-100 and reduce ECE of ResNet at least 19% onCIFAR-10 and 41% on CIFAR-100. These improvements are comparable to or better than those oftemperature scaling. For instance, ResNet with temperature scaling gives NLL of 1.15 and ECE of8.41 on CIFAR-100 . We split test set into two equal-size sets–a performance measurement set and a temperature calibrationset–and measure the performance after temperature scaling with the calibration set, and repeat the same procedureby reversing their roles. We want to remark that the more realistic evaluation requires to draw the temperaturecalibration set from the training set. In this case, its performance would decrease as it cannot fully exploit theentire dataset during training. D e n s i t y L logit regularization D e n s i t y L logit regularization D e n s i t y Sliced Wasserstein D e n s i t y PER D e n s i t y Vanilla D e n s i t y Deep ensemble D e n s i t y MC-dropout ( p = 0.2) D e n s i t y MC-dropout ( p = 0.3) Figure 5: Density of predictive uncertainty on CIFAR-100 (in-distribution) and SVHN (out-of-distribution). Upper ﬁgures illustrate explicit regularization methods and lower ﬁgures illustratevanilla method, ensemble methods, and Bayesian neural networks.We also investigate the uncertainty representation ability on out-of-distribution samples. Since out-of-distribution samples don’t belong to any categories, the neural network should produce the answer of“I don’t know.” Figure 5 illustrates predictive uncertainty of ResNet-50 with respect to CIFAR-100(in-distribution) and SVHN [57] (out-of-distribution). Vanilla method’s predictive uncertainty onSVHN remains in the somewhat conﬁdent region, albeit less conﬁdent compared to those on CIFAR-100. On the contrary, ResNet under explicit regularization successfully gathers a mass of predictiveuncertainty for SVHN samples on the region around maximum-entropy ( log 100 ≈ . ).We compare the uncertainty representation abilities of regularization methods to Bayesian neuralnetworks and ensemble methods. Speciﬁcally, we use the scalable Bayesian neural network, calledMC-dropout [4], because other methods based variational inference [10–12] or MCMC [14, 15]requires modiﬁcations to the baseline including the optimization procedure and the architecture,which deters a fair comparison. We searched a dropout rate over {0.1, 0.2, 0.3, 0.4, 0.5 } and use 100number of Monte-Carlo samples at test time, i.e., 100x more inference time. We also use the deepensemble [3] with 5 number of ensembles, i.e., 5x more training and inference time. Figure 5 showsthat the regularization-based methods produce signiﬁcantly better uncertainty representation than theMC-dropout and deep ensemble; even though both deep ensemble and MC-dropout have the abilityto move mass on less certain regions, the positions are still far from the highest uncertainty region,unlike the regularization-based methods. In this works, we show that “deep learning requires explicit regularization for reliable predictiveprobability.”

Speciﬁcally, we systematically analyze contributing factors for the unreliable predictiveprobability by decomposing the log-likelihood into the stochastic switch , i.e., whether the predictiveclass matches the label, that chooses between two deterministic scores –log of maximum predictiveconﬁdence and log of a part of the remaining predictive conﬁdence. We then show the effectivenessof explicit regularization on improving the reliability of the predictive probability, which in turnimproving calibration, uncertainty representation, and even the test accuracy.Our ﬁndings present a novel view on the role and importance of explicit regularization for improvingthe reliability of the predictive probability of neural networks. This direction is appealing in terms ofcomputational efﬁciency and scalability compared to the Bayesian and ensemble methods. Despitethese advantages, the regularization methods are limited in that they cannot utilize more sophisticatedmetrics based on stochastic representation on the predictive probability space, such as mutualinformation measuring epistemic uncertainty [58], due to its deterministic nature. We leave thislimitation as an important future direction of research, which may be solved by more expressiveparameterization, e.g., [25, 59, 60]. 8 roader Impact

This work shows the effectiveness of explicit regularization methods on improving the reliablepredictive probability of deep neural networks, which helps the neural networks to produce morecalibrated predictions and represent predictive uncertainty better. As the regularization-based methodsare more computationally efﬁcient and readily applicable compared to those of the Bayesian orensemble methods, our ﬁndings would encourage many practitioners to accommodate the reliabledeep learning models. Once we view the deep learning systems as cognitive automation, meaningthat it aids human decision-making processes or replace a part of cognitive tasks previously done byhumans , this would result in the better predictive probability, which means the better feedback ofexplaining what’s going on, the situations when the automation becomes uncertain, and unexpectedanomalies. This form of appropriate feedback can prevent the misuse and disuse of automation,including automation failures or even catastrophic accidents in safety-critical domains [5, 6, 61].Conversely, the wide adoption of the reliable predictive probability models could put extra trainingburdens on humans because interpreting information from predictive uncertainty or calibrated predic-tion requires human operators to be well-trained to leverage the beneﬁts of such information [62].Besides, providing the uncertainty information or conﬁdence level may increase humans’ cognitiveworkload, which can result in attention distraction and task performance degradation [63]. Finally,our ﬁndings may inherit biases contained in the standard classiﬁcation benchmark environment, forwhich we follow for precise evaluation. References [1] A Philip Dawid. The well-calibrated Bayesian.

Journal of the American Statistical Association ,77(379):605–610, 1982.[2] Andrey Malinin and Mark Gales. Reverse KL-divergence training of prior networks: Improveduncertainty and adversarial robustness. In

Advances in Neural Information Processing Systems ,2019.[3] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalablepredictive uncertainty estimation using deep ensembles. In

Advances in Neural InformationProcessing Systems , 2017.[4] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing modeluncertainty in deep learning. In

International Conference on Machine Learning , 2016.[5] John Lee and Neville Moray. Trust, control strategies and allocation of function in human-machine systems.

Ergonomics , 35(10):1243–1270, 1992.[6] Raja Parasuraman and Victor Riley. Humans and automation: Use, misuse, disuse, abuse.

Human Factors , 39(2):230–253, 1997.[7] David JC MacKay. A practical Bayesian framework for backpropagation networks.

NeuralComputation , 4(3):448–472, 1992.[8] Radford M Neal. Bayesian learning via stochastic dynamics. In

Advances in Neural InformationProcessing Systems , 1993.[9] Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing thedescription length of the weights. In

Annual Conference on Computational Learning Theory ,1993. Here, we assume the involvement of deep learning in the real-world situation as a hybrid system, whereinhumans and automation cooperate to solve some forms of cognitive tasks. This assumption is valid in a sensethat the fully autonomous system, which requires an almost perfect level of success rates and a higher level ofcognitive tasks such as reasoning, adapt to continuously changing environment, and dealing with a completeanomaly, would not be possible at least with the current level of deep learning. Indeed, the cooperative cognitivetasks already prevalent around the world. For example, the autonomous driving software handles the normaldriving situation while a human driver supervises its behavior and take the driving authority under uncertainty orexceptional circumstances. Even in simple image labeling tasks, humans make the ﬁnal decision to complementthe imperfect test accuracy of neural networks.

Advances in NeuralInformation Processing Systems , 2011.[11] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertaintyin neural networks. In

International Conference on Machine Learning , 2015.[12] Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, José Miguel Hernández-Lobato, and Alexander L Gaunt. Deterministic variational inference for robust Bayesian neuralnetworks. In

International Conference on Learning Representations , 2019.[13] José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalablelearning of Bayesian neural networks. In

International Conference on Machine Learning , 2015.[14] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In

International Conference on Machine Learning , 2011.[15] Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclicalstochastic gradient MCMC for Bayesian deep learning. In

International Conference on LearningRepresentations , 2020.[16] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable Laplace approximation forneural networks. In

International Conference on Learning Representations , 2018.[17] Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncertainty estimation for batchnormalized deep networks. In

International Conference on Machine Learning , 2018.[18] Rafael Müller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? In

Advances in Neural Information Processing Systems , 2019.[19] Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and SarahMichalak. On mixup training: Improved calibration and predictive uncertainty for deep neuralnetworks. In

Advances in Neural Information Processing Systems , 2019.[20] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethink-ing the inception architecture for computer vision. In

IEEE Conference on Computer Visionand Pattern Recognition , 2016.[21] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. Mixup: Beyondempirical risk minimization. In

International Conference on Learning Representations , 2018.[22] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-ial examples. In

International Conference on Learning Representations , 2015.[23] Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, José Miguel Hernández-Lobato, and Alexander L Gaunt. Deterministic variational inference for robust Bayesian neuralnetworks. In

International Conference on Learning Representations , 2019.[24] Kazuki Osawa, Siddharth Swaroop, Mohammad Emtiyaz E Khan, Anirudh Jain, Runa Eschen-hagen, Richard E Turner, and Rio Yokota. Practical deep learning with Bayesian principles. In

Advances in Neural Information Processing Systems , 2019.[25] Taejong Joo, Uijung Chung, and Min-Gwan Seo. Being Bayesian about categorical probability. arXiv preprint arXiv:2002.07965 , 2020.[26] Alexander A Alemi, Ian Fischer, and Joshua V Dillon. Uncertainty in the variational informationbottleneck. arXiv preprint arXiv:1807.00906 , 2018.[27] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan.Hybrid models with deep and invertible features. In

International Conference on MachineLearning , 2019.[28] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, MohammadNorouzi, and Kevin Swersky. Your classiﬁer is secretly an energy based model and you shouldtreat it like one. In

International Conference on Learning Representations , 2020.1029] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variationalinformation bottleneck. In

International Conference on Learning Representations , 2017.[30] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D Sculley, Sebastian Nowozin, JoshuaDillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty?evaluating predictive uncertainty under dataset shift. In

Advances in Neural InformationProcessing Systems , 2019.[31] Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. In

International Conferenceon Learning Representations , 2020.[32] Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why relu networks yieldhigh-conﬁdence predictions far away from the training data and how to mitigate the problem.In

IEEE Conference on Computer Vision and Pattern Recognition , 2019.[33] Gunjan Verma and Ananthram Swami. Error correcting output codes improve probability esti-mation and adversarial robustness of deep neural networks. In

Advances in Neural InformationProcessing Systems , 2019.[34] Vladimir Vapnik.

The Nature of Statistical Learning Theory . Springer-Verlag, 1995.[35] Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local Rademacher complexities.

The Annals of Statistics , 33(4):1497–1537, 2005.[36] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequenciesof events to their probabilities. In

Measures of Complexity , pages 11–30. Springer, 2015.[37] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability ofstochastic gradient descent. In

International Conference on Machine Learning , 2016.[38] Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initiallarge learning rate in training neural networks. In

Advances in Neural Information ProcessingSystems , 2019.[39] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradientdescent on linear convolutional networks. In

Advances in Neural Information ProcessingSystems , 2018.[40] Boris Hanin and David Rolnick. Deep relu networks have surprisingly few activation patterns.In

Advances in Neural Information Processing Systems , 2019.[41] Ping Luo, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng. Towards understanding regu-larization in batch normalization. In

International Conference on Learning Representations ,2019.[42] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. In

International Conference on LearningRepresentations , 2017.[43] John S Bridle. Probabilistic interpretation of feedforward classiﬁcation network outputs, withrelationships to statistical pattern recognition. In

Neurocomputing , pages 227–236. Springer,1990.[44] Herbert Robbins and Sutton Monro. A stochastic approximation method.

The Annals ofMathematical Statistics , pages 400–407, 1951.[45] Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In

Advances in Neural Information Processing Systems , 1992.[46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residualnetworks. In

European Conference on Computer Vision , pages 630–645, 2016.[47] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neuralnetworks. In

International Conference on Machine Learning , 2017.1148] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classiﬁcation, and riskbounds.

Journal of the American Statistical Association , 101(473):138–156, 2006.[49] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.2009.[50] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In

IEEE Conference on Computer Vision and Pattern Recognition ,2017.[51] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residualtransformations for deep neural networks. In

IEEE Conference on Computer Vision and PatternRecognition , 2017.[52] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. In

International Conference on Learning Representations , 2015.[53] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibratedprobabilities using Bayesian binning. In

AAAI Conference on Artiﬁcial Intelligence , 2015.[54] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport.

Foundations and Trends R (cid:13) in Machine Learning , 11(5-6):355–607, 2019.[55] Taejong Joo, Donggu Kang, and Byunghoon Kim. Regularizing activations in neural networksvia distribution matching with the Wasserstein metric. In International Conference on LearningRepresentations , 2020.[56] Peter J Huber. Robust estimation of a location parameter.

The Annals of Mathematical Statistics ,35(1):73–101, 1964.[57] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.Reading digits in natural images with unsupervised feature learning. In

NIPS Workshop onDeep Learning and Unsupervised Feature Learning , 2011.[58] Lewis Smith and Yarin Gal. Understanding measures of uncertainty for adversarial exampledetection. arXiv preprint arXiv:1803.08533 , 2018.[59] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Deep kernel learning.In

International Conference on Artiﬁcial Intelligence and Statistics , 2016.[60] Nicki Skafte, Martin Jørgensen, and Søren Hauberg. Reliable training and estimation of variancenetworks. In

Advances in Neural Information Processing Systems , 2019.[61] Donald A Norman. The ‘problem’ with automation: Inappropriate feedback and interaction, not‘over-automation’.

Philosophical Transactions of the Royal Society of London. B, BiologicalSciences , 327(1241):585–593, 1990.[62] Susan S Kirschenbaum, J Gregory Trafton, Christian D Schunn, and Susan B Trickett. Visualiz-ing uncertainty: The impact on performance.

Human Factors , 56(3):509–520, 2014.[63] Alexander Kunze, Stephen J Summerskill, Russell Marshall, and Ashleigh J Filtness. Automa-tion transparency: Implications of uncertainty communication for human-automation interactionand interfaces.

Ergonomics , 62(3):345–360, 2019.[64] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Sur-passing human-level performance on imagenet classiﬁcation. In

IEEE International Conferenceon Computer Vision , 2015. 12

Detailed experimental setup

ResNet base setup.

We trained ResNet for 200 epochs by SGD with momentum coefﬁcient 0.9,mini-batch size of 128, and a weight decay ratio 0.0001; weights were initialized by He initialization[64]; an initial learning rate was 0.1, and decreased by a factor of 10 at 100 and 150 epochs; imagepixel values were subtracted by the mean and divided by the standard deviation, zero-padded with 4pixels, randomly cropped to 32x32, and horizontally ﬂipped with the probability of 0.5.

VGG setup.

We trained VGG by re-using the ResNet setup for convenience, except increasing theweight decay ratio to 0.0005 as in [52].

Hyperparameters.

We searched four regularization loss coefﬁcients for each method,and chose best one based on validation set accuracy (Table A1). The searchspaces were: { . , . , . , . } for L norm; { . , . , . , . } for L norm; { . , . , . , . } for sliced Wasserstein regularization; { . , . , . , . } for PER (10xlower coefﬁcient for CIFAR-10).Sliced Wasserstein regularization and PER involve the integral over the unit sphere, which is evaluatedby Monte-Carlo approximation. In this paper, we used 256 number of evaluations, following [55].Table A1: Best hyperparameters for each conﬁgurationRegularizer VGG-16 VGG-16 ResNet-50 ResNet-50& CIFAR-10 & CIFAR-100 & CIFAR-10 & CIFAR-100 (cid:107) f W (cid:107) (cid:107) f W (cid:107) SW ( µ W D (cid:48) , ν ) B VGG results

As consistent with the results of ResNet, all regularization losses improves NLL, ECE, and accuracy(Table A2), except L regularization on CIFAR-100. However, the improvements are less signiﬁcantcompared to ResNet because the small capacity of VGG makes the vanilla method produces lessconﬁdent answers and then less vulnerable to the conﬁdence penalty. This can be inferred from thatvalues of (cid:107) f W (cid:107) of VGG are reduced by almost 50% compared to those of ResNet.Table A2: Experimental results under various regularization methods. Arrows on the metricsrepresent the desirable direction. Values represent µ ± σ obtained from ﬁve repetitions, and all valuesare rounded to two decimal places.Model & Data Regularizer Acc ↑ NLL ↓ ECE ↓ (cid:107) f W (cid:107) VGG-16 & Vanilla 92.97 ± ± ± ± (cid:107) f W (cid:107) ± ± ± ± (cid:107) f W (cid:107) ± ± ± ± SW ( µ W D (cid:48) , ν ) ± ± ± ± ± ± ± ± ± ± ± ± (cid:107) f W (cid:107) ± ± ± ± (cid:107) f W (cid:107) ± ± ± ± SW ( µ W D (cid:48) , ν ) ± ± ± ± ± ± ± ±±