Soft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks
SSoft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks
Doyup Lee Yeongjae Cheon Abstract
Soft labeling becomes a common output regular-ization for generalization and model compressionof deep neural networks. However, the effectof soft labeling on out-of-distribution (OOD) de-tection, which is an important topic of machinelearning safety, is not explored. In this study, weshow that soft labeling can determine OOD detec-tion performance. Specifically, how to regularizeoutputs of incorrect classes by soft labeling candeteriorate or improve OOD detection. Based onthe empirical results, we postulate a future workfor OOD-robust DNNs: a proper output regular-ization by soft labeling can construct OOD-robustDNNs without additional training of OOD sam-ples or modifying the models, while improvingclassification accuracy.
1. Introduction
Out-of-distribution (OOD) detection has been an importanttopic for deep learning applications, after deep neural net-works (DNNs) are known to be over-confident on abnormalsamples, which are unrecognizable (Nguyen et al., 2015)and from out-of-distribution (Hendrycks & Gimpel, 2017).OOD detection is highly related to the safety, because thereis no control on test samples after deployment of DNNs tothe real-world.To prevent DNNs from over-confident predictions of OODsamples, post-training with outlier samples is commonlyused. Fine-tuning with a few selected OOD samples(Hendrycks et al., 2019) or adversarial noises (Hein et al.,2019) can improve the detection performance of unseenOOD samples.Meanwhile, soft labeling becomes a common trick of outputregularization to train DNNs in various purposes. For ex-ample, label smoothing (Szegedy et al., 2016) improves the Pohang University of Science and Technology, South Ko-rea Kakao Brain, South Korea. Correspondence to: Doyup Lee < [email protected] > .Presented at the ICML 2020 Workshop on Uncertainty and Ro-bustness in Deep Learning. Copyright 2020 by the author(s). test accuracy of DNNs, preventing an overfitting problem(He et al., 2019; M¨uller et al., 2019). Knowledge distillation(Hinton et al., 2015), a kind of soft labeling (Yuan et al.,2019), can compress the size of a teacher model, or improvethe accuracy of its student networks (Xie et al., 2019).Despite the popularity of soft labeling, how soft labelingaffects OOD detection of DNNs has not been explored. Inthis study, we assume that regularizing predictions on in-correct classes by soft labeling determines OOD detectionperformance of DNNs. We analyze and empirically ver-ify our assumption, based on two major results: a) labelsmoothing deteriorates OOD detection of DNNs, and b)soft labels, generated by a teacher model, distill OOD de-tection performance into its student models. In particular,the degraded test accuracy of a teacher model with outlierexposure is recovered or improved in its student models,while conserving the high performance of OOD detection.Based on the empirical results, we claim that a lottery ticketof soft labeling for OOD-robust DNNs exists, and how toregularize the predictions of DNNs on incorrect classes isa compelling direction of future work for generalization ofDNNs not only on unseen in-distribution (ID) samples, butalso on OOD samples.
2. Preliminaries
Outlier exposure (Hendrycks et al., 2019) finetunes a modelwith some OOD samples to predict uniform distribution forthe OOD training samples. H ( q, p i ) + λ H ( U ( K ) , p o ) , (1)where p i is a prediction of a ID sample, p o is a predictionof a OOD sample, q is an one-hot represented ground truth, λ is a hyper-parameter, H is cross-entropy, and U ( K ) isuniform distribution over all K classes.Despite a significant improvement of OOD detection, train-ing additional OOD samples has two drawbacks. First, orig-inal test accuracy is often degraded after outlier exposureas a trade-off between OOD detection and the original task.Second, we cannot consider all possible OOD samples intraining, because there are infinitely many OOD samples. a r X i v : . [ c s . L G ] J u l oft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks In this study, soft label prevents the degradation of classifica-tion accuracy and often improves the test accuracy (Table 1).In addition, we show that a soft label can make DNNs ro-bust to OOD without any OOD training sample and modelmodification (Figure 2).
Given an one-hot represented ground truth q of a trainingsample x , soft labeling is defined as ˜ q = (1 − α ) q + αq (cid:48) , (2)where α is a hyper-parameter for soft labeling, q (cid:48) ∈ [0 , K is a soft target that satisfies argmax(˜ q ) = argmax( q ) and (cid:80) Ki =1 ˜ q i = 1 , and K is the number of classes. Then, thetraining loss with soft labeling is H (˜ q, p ) = (1 − α ) H ( q, p ) + α H ( q (cid:48) , p ) . (3)Note that a soft labeling is a regularization of the predictionsincluding incorrect classes.The training objective of both label smoothing (Szegedyet al., 2016) and knowledge distillation (Hinton et al., 2015)are represented by Eq (3) (Yuan et al., 2019). Label smooth-ing (Szegedy et al., 2016) is a soft labeling that regularizesDNNs to predict an uniform distribution U ( K ) over all K classes: (1 − α ) H ( q, p ) + α H ( U ( K ) , p ) . (4)In knowledge distillation, a soft target of a student model ˜ q consists of a prediction of its teacher model: (1 − α ) H ( q, p ) + α H ( p t , p ) , (5)where p t is the prediction of the teacher model. Knowledgedistillation is a kind of output regularization of student mod-els by the teacher’s predictions (Yuan et al., 2019) for modelcompression (Hinton et al., 2015) or generalization (Xieet al., 2019). In this paper, we train WRN-40-2 (Zagoruyko & Komodakis,2016) with the SVHN, CIFAR-10, and CIFAR-100 datasets(ID). We follow the experimental setting in the official codeof outlier exposure except that we use 150 epochs for train-ing. In addition, we follow the hyper-parameter settings ofknowledge distillation in (M¨uller et al., 2019). For evalua-tion of OOD detection, we use the MNIST, Fashion-MNIST,SVHN (or CIFAR-10), LSUN, and TinyImageNet datasetsfor OOD samples, and AUROC for the evaluation measure. Table 1.
Test accuracy and expected calibration error (ECE) ofWideResNet (Baseline) trained with SVHN, CIFAR10, and CIFAR-100. TinyImageNet is used to train for OE (Outlier Exposure). OD(Outlier Distillation) means the student model of OE model.
ID Dataset Baseline +OE +ODSVHN Acc 97.02 96.82
ECE 2.38 2.65
CIFAR-10 Acc
CIFAR-100 Acc 76.63 75.58
ECE 12.06 14.79
Table 2.
OOD detection performance of outlier exposure and out-lier distillation. WRN-OE, which is finetuned with TinyImageNetas OOD, is used as the teacher model for the two student models(WRN and DenseNet).
AUROC WRN-OE → WRN → DenseNetMNIST 84.28 90.95 93.30Fashion-MNIST 95.16 96.03 96.61SVHN 94.19 94.50 94.19LSUN 99.99 99.94 99.94TinyImageNet 99.99 99.78 99.83
3. Soft Labeling Affects OOD Detection
Figure 1 shows the effects of label smoothing with different α on test accuracy, expected calibration error (ECE) (Guoet al., 2017), and detection of the OOD datasets. As shownin (Lukasik et al., 2020), ECE starts to increase when thelabel smoothing α is larger than the optimal values (reddot lines). Although the test accuracy on CIFAR-10 with α = 0 . , . deteriorates, the test accuracy is alwaysimproved when ECE is minimized (M¨uller et al., 2019).Even though label smoothing improves test accuracy andECE (M¨uller et al., 2019), label smoothing makes DNNsvulnerable to out-of-distribution and disable to distinguishID and OOD datasets. Label smoothing always deterioratesOOD detection regardless of the magnitude of α , and larger α results in more degradation of OOD detection. In particu-lar, WRN models, trained with SVHN and CIFAR-10, showsignificant AUROC drops, when the ECE starts to increase(after the red dot line).We can infer the reason why label smoothing hurts OODdetection of DNNs from two perspectives. First, combiningEq (1) and (4), we can interpret the output regularization oflabel smoothing as an outlier exposure of ID samples. Then, https://github.com/hendrycks/outlier-exposure oft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks Figure 1.
Test accuracy and expected calibration error (top) and OOD detection AUROC (bottom) of WRN, trained with SVHN (left),CIFAR-10 (middle), and CIFAR-100 (right) respectively. The red dot line represents label smoothing α minimizing ECE. OOD detectionis continuously deteriorated when label smoothing α increases. When ECE starts to increase (after the red dot line), dramatic drops ofAUROC are shown in the training datasets of SVHN (ID) and CIFAR-10 (ID). Figure 2.
OOD detection AUROCs of a teacher model and its student model. (Top) teacher models trained with the SVHN, CIFAR-10,and CIFAR-100 dataset and their student models. (Bottom) teacher models of CIFAR-10 are finetuned with MNIST, TinyImageNet, andMNIST+TinyImageNet by outlier exposure. WRN-40-2 is used as the model architecture of both teacher and student models. label smoothing can deteriorate OOD detection of DNNs,making DNNs disable to discriminate OOD samples fromID samples. When the magnitude of α increases, the effectof output regularization in Eq (4) increases and deterioratesthe OOD detection performance as outlier exposure of theID datasets.Meanwhile, knowledge distillation is the other view to inter-pret the negative effect of label smoothing on OOD detec-tion. Note that label smoothing is a knowledge distillationwith a teacher model, which perfectly learns the ID samplesas OOD and predicts uniform distribution for all ID sam-ples. Thus, we assume that Soft labels of incorrect classes,generated by a teacher model, determines OOD detectionperformance of its student model , and empirically verify theassumption in section 3.2.
In this section, we show that OOD detection performanceis determined by the soft labels. Specifically, soft labelsthat are generated by a teacher model determine the perfor-mance of its student model. Figure 2 shows OOD detectionperformance of teacher models and their student modelsin various settings. For a student model, we use the samearchitecture (WRN-40-2) with its teacher, because our con-cern is to analyze the effects of soft labeling, not a modelcompression.Figure 2 (top) shows OOD detection AUROCs of the WRN-40-2 models (SVHN, CIFAR-10, and CIFAR-100), andtheir student models. The teacher and its student modelhave similar AUROCs regardless of test datasets (OOD). oft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks
In Figure 2 (bottom), we finetune the teacher modelswith various OOD samples (MNIST, TinyImageNet, andMNIST+TinyImageNet) to improve OOD detection by out-lier exposure. The OOD detection of the teacher modelsis improved in different OOD datasets, according to theexposed OOD samples. We find that OOD detection per-formance of student models is always consistent with theirteacher models (OE), regardless of the choice of trainingOOD samples for the teacher.Especially, when we use MNIST+TinyImageNet for out-lier exposure of the teacher model, both the teacher andits student almost perfectly detect the test OOD samples.Exposing various OOD samples in training time is an unreal-istic setting, because there are infinitely many cases of OOD.However, the experimental results is worth noting, becausethe student model is trained only using ID samples with softlabels, and any OOD sample is not directly used to trainthe student model. Although how to generate soft labelswithout a perfectly OOD-robust teacher remains in an openquestion, the result show the existence of soft labeling forOOD-robust DNNs to various OOD datasets without OODtraining.OOD detection performance is also distilled into a stu-dent model that has a different architecture from its teachermodel. In Table 2, we use DenseNet (Huang et al., 2017)with 40 hidden layers and 12 growth rates as the studentmodel of WRN-40-2. Note that the number of trainableparameters of DenseNet (1.1 M) is twice less than WRN-40-2 (2.2 M). Even though the size and model architectureof the student are different from those of its teacher, OODdetection AUROCs of the teacher and student are consistent.The results imply that the effect of soft labeling on OODdetection is model-agnostic. Then, if we find a soft labelingmethod for OOD-robust DNNs, the soft labeling can begenerally used for various DNN architectures.Orthogonal to the OOD detection, one disadvantage of post-training with OOD samples is a degradation of originalclassification accuracy (Hendrycks et al., 2019; Hein et al.,2019). However, we find that both test accuracy and ECEof the student models (+OD) are similar to or better thanthe original model before outlier exposure (baseline) in(Table 1). The improvement of test accuracy results fromsoft labeling, because soft labels can help model prevent anoverfitting problem regardless of the type of soft labeling(Yuan et al., 2019).
4. Discussion
In this study, we show that a soft labeling of incorrect classesis closely linked with OOD detection of DNNs. Note thatthe results of student models in Figure 2 do not use anyOOD sample, but can have almost perfect OOD detection AUROCs. The results verify that constructing OOD-robustDNNs is possible without modifying the model or post-training of OOD samples.The limitation of our study is that the solution of soft label-ing for OOD-robust DNNs is unrevealed and remains in anopen question. However, we focus on showing the existenceof soft labeling for OOD-robust DNNs.We postulate that finding an output regularization of incor-rect classes that makes DNN robust to unseen OOD samples is possible and a worth exploration for future work. Notethat proper soft labeling can improve not only OOD de-tection, but also the classification accuracy of unseen IDsamples and confidence calibration (Table 1). In addition,the OOD-robust soft labeling is model-agnostic and gener-ally applied into various model architectures.
References
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. Oncalibration of modern neural networks. In
Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pp. 1321–1330. JMLR. org, 2017.He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., and Li, M.Bag of tricks for image classification with convolutionalneural networks. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 558–567, 2019.Hein, M., Andriushchenko, M., and Bitterwolf, J. Whyrelu networks yield high-confidence predictions far awayfrom the training data and how to mitigate the problem. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pp. 41–50, 2019.Hendrycks, D. and Gimpel, K. A baseline for detectingmisclassified and out-of-distribution examples in neuralnetworks. In
International Conference on Learning Rep-resentations , 2017.Hendrycks, D., Mazeika, M., and Dietterich, T. Deepanomaly detection with outlier exposure. In
InternationalConference on Learning Representations , 2019.Hinton, G., Vinyals, O., and Dean, J. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531 , 2015.Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,K. Q. Densely connected convolutional networks. In
Proceedings of the IEEE conference on computer visionand pattern recognition , pp. 4700–4708, 2017.Lukasik, M., Bhojanapalli, S., Menon, A. K., and Kumar,S. Does label smoothing mitigate label noise? arXivpreprint arXiv:2003.02819 , 2020. oft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks
M¨uller, R., Kornblith, S., and Hinton, G. E. When doeslabel smoothing help? In
Advances in Neural InformationProcessing Systems , pp. 4696–4705, 2019.Nguyen, A., Yosinski, J., and Clune, J. Deep neural net-works are easily fooled: High confidence predictions forunrecognizable images. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , pp.427–436, 2015.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,Z. Rethinking the inception architecture for computer vi-sion. In
Proceedings of the IEEE conference on computervision and pattern recognition , pp. 2818–2826, 2016.Xie, Q., Hovy, E., Luong, M.-T., and Le, Q. V. Self-training with noisy student improves imagenet classifica-tion. arXiv preprint arXiv:1911.04252 , 2019.Yuan, L., Tay, F. E., Li, G., Wang, T., and Feng, J. Revisitknowledge distillation: a teacher-free framework. arXivpreprint arXiv:1909.11723 , 2019.Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146arXiv preprint arXiv:1605.07146