[PDF] Self-Distillation as Instance-Specific Label Smoothing

Abstract

It has been recently demonstrated that multi-generational self-distillation can improve generalization. Despite this intriguing observation, reasons for the enhancement remain poorly understood. In this paper, we first demonstrate experimentally that the improved performance of multi-generational self-distillation is in part associated with the increasing diversity in teacher predictions. With this in mind, we offer a new interpretation for teacher-student training as amortized MAP estimation, such that teacher predictions enable instance-specific regularization. Our framework allows us to theoretically relate self-distillation to label smoothing, a commonly used technique that regularizes predictive uncertainty, and suggests the importance of predictive diversity in addition to predictive uncertainty. We present experimental results using multiple datasets and neural network architectures that, overall, demonstrate the utility of predictive diversity. Finally, we propose a novel instance-specific label smoothing technique that promotes predictive diversity without the need for a separately trained teacher model. We provide an empirical evaluation of the proposed method, which, we find, often outperforms classical label smoothing.

Full PDF

SSelf-Distillation as Instance-Speciﬁc Label Smoothing

Zhilu Zhang

Cornell University [email protected]

Mert R. Sabuncu

Cornell Univerisity [email protected]

Abstract

It has been recently demonstrated that multi-generational self-distillation can im-prove generalization [7]. Despite this intriguing observation, reasons for theenhancement remain poorly understood. In this paper, we ﬁrst demonstrate experi-mentally that the improved performance of multi-generational self-distillation isin part associated with the increasing diversity in teacher predictions. With this inmind, we offer a new interpretation for teacher-student training as amortized MAPestimation, such that teacher predictions enable instance-speciﬁc regularization.Our framework allows us to theoretically relate self-distillation to label smoothing,a commonly used technique that regularizes predictive uncertainty, and suggests theimportance of predictive diversity in addition to predictive uncertainty. We presentexperimental results using multiple datasets and neural network architectures that,overall, demonstrate the utility of predictive diversity. Finally, we propose a novelinstance-speciﬁc label smoothing technique that promotes predictive diversity with-out the need for a separately trained teacher model. We provide an empiricalevaluation of the proposed method, which, we ﬁnd, often outperforms classicallabel smoothing.

First introduced as a simple method to compress high-capacity neural networks into a low-capacitycounterpart for computational efﬁciency, knowledge distillation [10] has since gained much popularityacross various applications domains ranging from computer vision to natural language processing [12,15, 20, 28, 29] as an effective method to transfer knowledge or features learned from a teachernetwork to a student network. This empirical success is often justiﬁed with the intuition that deeperteacher networks learn better representation with greater model complexity, and the "dark knowledge"that teacher networks provide facilitates student networks to learn better representations and henceenhanced generalization performance. Nevertheless, it still remains an open question as to howexactly student networks beneﬁt from this dark knowledge. The problem is made further puzzlingby the recent observation that even self-distillation, a special case of the teacher-student trainingframework in which the teacher and student networks have identical architectures, can lead to bettergeneralization performance [7]. It was also demonstrated that repeated self-distillation process withmultiple generations can further improve classiﬁcation accuracy.In this work, we aim to shed some light on self-distillation. We start off by revisiting the multi-generational self-distillation strategy, and experimentally demonstrate that the performance improve-ment observed in multi-generational self-distillation is correlated with increasing diversity in teacherpredictions. Inspired by this, we view self-distillation as instance-speciﬁc regularization on the neuralnetwork softmax outputs, and cast the teacher-student training procedure as performing amortizedmaximum a posteriori (MAP) estimation of the softmax probability outputs. The proposed frame-work provides us with a new interpretation of the teacher predictions as instance-speciﬁc priorsconditioned on the inputs. This interpretation allows us to theoretically relate distillation to labelsmoothing, a commonly used technique to regularize predictive uncertainty of NNs, and suggests

Preprint. Under review. a r X i v : . [ c s . L G ] J un hat regularization on the softmax probability simplex space in addition to the regularization onpredictive uncertainty can be the key to better generalization. To verify the claim, we systematicallydesign experiments to compare teacher-student training against label smoothing. Lastly, to furtherdemonstrate the potential gain from regularization on the probability simplex space, we also design anew regularization procedure based on label smoothing that we term the “Beta smoothing.”Our contributions can be summarized as follows:1. We provide a plausible explanation for recent ﬁndings on multi-generational self-distillation.2. We offer an amortized MAP interpretation of the teacher-student training strategy.3. We attribute the success of distillation to regularization on both the label space and thesoftmax probability simplex space, and verify the importance of the latter with systematicallydesigned experiments on several benchmark datasets.4. We propose a new regularization technique termed “Beta smoothing” that improves uponclassical label smoothing at little extra cost.5. We demonstrate self-distillation can improve calibration. The original knowledge distillation technique for neural networks [10] has stimulated a ﬂurry ofinterest in the topic, with a large number of published improvements and applications. For instance,prior works [1, 17] have proposed Bayesian techniques in which distributions are distilled with MonteCarlo samples into more compact models like a neural network. Lopez-Paz et al. [16] combineddistillation with the theory of privileged information, and offered a generalized framework fordistillation. To simplify distillation, Zhu et al. [33] proposed a method for one-stage online distillation.There have also been successful applications of distillation for adversarial robustness [20].Several papers have attempted to study the effect of distillation training on the student models.Furlanello et al. [7] examined the effect of distillation by comparing the gradients of the distillationloss against that of the standard cross-entropy loss with ground truth labels. Phuong et al. [23]considered a special case of distillation using linear and deep linear classiﬁers. Cho and Hariharan [4]conducted a thorough experimental analysis, and made several interesting observations. Anotherexperimentally driven work to understand the effect of distillation was also done in the contextof natural language processing [32]. Most similar to our work is [30], in which the authors alsoestablished a connection between label smoothing and distillation. However, our argument comesfrom a different theoretical perspective and offers complementary insights.

We consider the problem of k -class classiﬁcation. Let X ⊆ R d be the feature space and Y = { , .., k } be the label space. Given a dataset D = { x i , y i } nn =1 where each feature-label pair ( x i , y i ) ∈X × Y , and we are interested in ﬁnding a function that maps input features to correspondinglabels f : X → R c . In this work, we restrict the function class to the set of neural networks f w ( x ) where w = { W i } Li =1 are the parameters of a neural network with L layers. We deﬁne alikelihood model p ( y | x ; w ) = Cat ( softmax ( f w ( x ))) , a categorical distribution with parameterssoftmax ( f w ( x )) ∈ ∆( L ) . Here ∆( L ) denotes the L -dimensional probability simplex. Typically,maximum likelihood estimation (MLE) is performed. This leads to the cross-entropy loss L cce ( w ) = − n (cid:88) i =1 k (cid:88) j =1 y ij log p ( y = j | x i ; w ) , (1)where y ij corresponds to the j -th element of the one-hot encoded label y i .2 .1 Teacher-Student Training Objective Given a pre-trained model (teacher) f w t , distillation loss can be deﬁned as: L dist ( w ) = − n (cid:88) i =1 k (cid:88) j =1 [ softmax (cid:0) f w t ( x ) /T (cid:1) ] j log p ( y = j | x i ; w ) , (2)where [ · ] j denotes the j ’th element of a vector. A second network (student) f w can then be trainedwith the following total loss: L ( w ) = α L cce ( w ) + (1 − α ) L dist ( w ) , (3)where α ∈ [0 , is a hyper-parameter, and T corresponds to the temperature scaling hyper-parameterthat ﬂattens teacher predictions. In self-distillation, both teacher and student models have the samenetwork architecture. In the original self-distillation experiments conducted by Furlanello et al. [7], α and T are set to and , respectively throughout the entire training process.Note that, temperature scaling has been applied differently compared to previous literature ondistillation [10]. As addressed in Section 5, we only apply temperature scaling to teacher predictionsin computing distillation loss. We empirically observe that this yields results consistent with previousreports. Moreover, as we show in the Appendix, performing temperature scaling only on the teacherbut not the student models yields more calibrated results. Self-distillation can be repeated iteratively such that during training of the i -th generation, the modelobtained at ( i − -th generation is used as the teacher model. This approach is referred to as multi-generational self-distillation, or “Born-Again Networks” (BAN). Empirically it has been observedthat student predictions can consistently improve with each generation. However, the mechanismbehind this improvement has remained elusive. In this work, we argue that the main attribute thatleads to better performance is the increasing uncertainty and diversity in teacher predictions. Similarobservations that more “tolerant” teacher predictions lead to better students were also made by Yanget al. [27]. Indeed, due the monotonicity and convexity of the negative log likelihood function, sincethe element that corresponds to the true label class of the softmax output p ( y = y i | x i ; w ) is oftenmuch greater than that of the other classes, together with early stopping, each subsequent model willlikely have increasingly unconﬁdent softmax outputs corresponding to the true label class. We use Shannon Entropy to quantify the uncertainty in instance-speciﬁc teacher predictions p ( y | x ; w i ) , averaged over the training set, which we call “Average Predictive Uncertainty,” anddeﬁne as: E x [ H ( p ( ·| x ; w i ))] ≈ n n (cid:88) j =1 H ( p ( ·| x j ; w i )) = 1 n n (cid:88) j =1 k (cid:88) c =1 − p ( y c | x j ; w i ) log p ( y c | x j ; w i ) . (4)Note that previous literature [6, 21] has also proposed to use the above measure as a regularizer toprevent over-conﬁdent predictions. Label smoothing [21, 24] is a closely related technique that alsopenalizes over-conﬁdent predictions by explicitly smoothing out ground-truth labels. Average Predictive Uncertainty is insufﬁcient to fully capture the variability associated with teacherpredictions. In this paper, we argue it is also important to consider the amount of spreading of teacherpredictions over the probability simplex among different (training) samples. We coin this populationspread in predictive probabilities “Conﬁdence Diversity." As we show below, characterizing theConﬁdence Diversity can be important for understanding teacher-student training.There is no straightforward metric for Conﬁdence Diversity and its computation is hampered by thecurse of dimensionality, particularly in applications with a large number of classes. In this paper, we3

Test Accuracy

Test NLL

Predictive Uncertainty

Confidence Diversity

Figure 1: Results for sequential self-distillation over 10 generations are shown above. Model obtainedat the ( i − -th generation is used as the teacher model for training at the i -th generation. Accuracyand NLL are obtained on the test set using the student model, whereas the predictive uncertainty andconﬁdence diversity are evaluated on the training set with teacher predictions.propose to measure only the entropy of the softmax element corresponding to the true label class.Mathematically, if we denote c = φ ( x , y ) := [ softmax (cid:0) f w ( x ) (cid:1) ] y , and let p C be the probabilitydensity function of the random variable C := φ ( X , Y ) where ( X , Y ) ∼ p ( x , y ) . Then, we quantifyConﬁdence Diversity via the differential entropy of C : h ( C ) = − (cid:90) p C ( c ) log p C ( c ) dc. (5)We use the KNN-based entropy estimator to compute h ( C ) over the training set [3]. We perform sequential self-distillation with ResNet-34 on the CIFAR-100 dataset for 10 generations.At each generation, we train the neural networks for 150 epochs using the identical optimizationprocedure as in the original ResNet paper [9]. Following Furlanello et al. [7], α and T are setto and respectively throughout the entire training process. Fig. 1 summarizes the results. Asindicated by the general increasing trend in test accuracy, sequential distillation indeed leads toimprovements. The entropy plots also support the hypothesis that subsequent generations exhibitincreasing diversity and uncertainty in predictions. Despite the same increasing trend, the two entropymetrics quantify different things. The increase in average predictive uncertainty suggests overall adrop in the conﬁdence of the categorical distribution, while the growth in conﬁdence diversity suggestsan increasing variability in teacher predictions. Interestingly, we also see obvious improvements interms of NLL, suggesting in addition that BAN can improve calibration of predictions [8].To further study the apparent correlation between student performance and entropy of teacherpredictions over generations, we conduct a new experiment, where we instead train a single teacher.This teacher is then used to train a single generation of students while varying the temperature hyper-parameter T in Eq. 3, which explicitly adjusts the uncertainty and diversity of teacher predictions.For consistency, we keep α = 0 . Results are illustrated in Fig. 2. As expected, increasing T leads togreater predictive uncertainty and diversity in teacher predictions. Importantly, we see this increaseleads to drastic improvements in the test accuracy of students. In fact, the gain is much greaterthan the best achieved with 10 generations of BAN with T = 1 (indicated with the ﬂat line in theplot). The identiﬁed correlation is consistent with the recent ﬁnding that early-stopped models, whichtypically have much larger entropy than fully trained ones, serve as better teachers [4]. Lastly, wealso see improvements in NLL with increasing entropy of teacher predictions. However, too much T leads to a subsequent increase in NLL, likely due to teacher predictions that lack in conﬁdence.A closer look at the entropy metrics of the above experiment reveals an important insight. While theaverage predictive uncertainty is strictly increasing with T , the conﬁdence diversity plateaus after T = 2 . . This coincides closely with the stagnation of student test accuracy, hinting at the importanceof conﬁdence diversity in teacher predictions. This makes intuitive sense. Given a training set, wewould expect that some of the samples be much more typically representative of the label class thanothers. Ideally, we would hope to classify the typical examples with much greater conﬁdence than anambiguous example of the same class. Previous results show that training with such instance-speciﬁcuncertainty can indeed lead to better performance [22]. Our view is that in self-distillation, the teacherprovides the means for instance-speciﬁc regularization.4 .0 1.5 2.0 2.5 3.0 3.5 4.0temperature73.5073.7574.0074.2574.5074.7575.0075.25 Test Accuracy

Test NLL

Predictive Uncertainty

Confidence Diversity

Figure 2: Results with teacher predictions scaled by varying temperature T . The ﬂat lines in the plotscorrespond to the largest/smallest values achieved over 10 generations of sequential distillation with T = 1 in the previous experiments for accuracy, predictive uncertainty and conﬁdence diversity/NLL. The instance-speciﬁc regularization perspective on self-distillation allows us to recast the trainingprocedure as performing Maximum a posteriori (MAP) estimation on the softmax probability vector.Suppose now that the likelihood p ( y | x , z ) = Cat ( z ) be a categorical distribution with parameter z ∈ ∆( L ) and the conditional prior p ( z | x ) = Dir ( α x ) be a Dirichlet distribution with instance-speciﬁcparameter α x . Due to conjugacy of the Dirichlet prior, a closed-form solution of ˆ z i = c i + α x i − (cid:80) j c j + α x j − ,where c i corresponds to number of occurrences of the i -th category, can be easily obtained.The above framework is not useful for classiﬁcation when given a new sample x without anyobservations y . Moreover, in the common supervised learning setup, only one observation of label y is available for each sample x . The MAP solution shown above merely relies on the provided label y for each sample x , without exploiting the potential similarities among different samples ( x i ) ’s in theentire dataset for more accurate estimation. For example, we could have different samples that arealmost duplicates (cf. [2]), but have different y i ’s, which could inform us about other labels that couldbe drawn from z i . Thus, instead of relying on the instance-level closed-form solution, we can train a(student) network to amortize the MAP estimation ˆ z i ≈ softmax (cid:0) f w ( x i ) (cid:1) with a given training set,resulting in an optimization problem of: max w n (cid:88) i =1 log p ( z | x i , y i ; w , α x ) = max w n (cid:88) i =1 log p ( y = y i | z , x i ; w ) + log p ( z | x i ; w , α x )= max w n (cid:88) i =1 log[ softmax ( f w ( x i ))] y i (cid:124) (cid:123)(cid:122) (cid:125) Cross entropy + n (cid:88) i =1 k (cid:88) c =1 ([ α x i ] c −

1) log[ z ] c (cid:124) (cid:123)(cid:122) (cid:125) Instance-speciﬁc regularization . (6)Eq. 6 is an objective that provides us with a function to obtain a MAP solution of z given an inputsample x . Note that, we do not make any assumptions about the availability or number of labelobservations of y for each sample x . This enables us to ﬁnd an approximate MAP solution to x at test-time when α x and y are unavailable. The resulting framework can be generally applicable to variousscenarios like semi-supervised learning or learning from multiple labels per sample. Nevertheless, inthe following, we restrict our attention to supervised learning with a single label per training sample. The difﬁculty now lies in obtaining the instance-speciﬁc prior Dir ( α x ) . A naive independenceassumption that p ( z | x ) = p ( z ) can be made. Under such an assumption, a sensible choice of priorwould be a uniform distribution across all possible labels. Choosing [ α x ] c = [ α ] c = βk + 1 for all c ∈ { , ..., k } for some hyper-parameter β , the MAP objective becomes L LS = n (cid:88) i =1 − log[ z ] y i + β n (cid:88) i =1 k (cid:88) c =1 − k log[ z ] c . (7)As noted in prior work, this loss function is equivalent to the commonly used label smoothing (LS)regularization [21, 24] (derivation can be found in Appendix). Observe also that the training objectivein essence promotes predictions with larger predictive uncertainty, but not conﬁdence diversity.5 .2 Self-Distillation as MAP A better instance-speciﬁc prior distribution can be obtained using a pre-trained (teacher) neuralnetwork. Let us consider a network f w t trained with the regular MLE objective, by maximizing p ( y | x ; w t ) = Cat ( softmax (cid:0) f w t ( x ) (cid:1) , where [ softmax ( f w t ( x ))] i = [ exp ( f w t ( x ))] i (cid:80) j [ exp ( f w t ( x ))] j . Now, due toconjugacy of the Dirichlet prior, the marginal likelihood p ( y | x ; α x ) is a Dirichlet-multinomialdistribution [18]. In the case of single label observation considered, the marginal likelihood reducesto a categorical distribution. As such, we have: p ( y | x ; α x ) = Cat ( α x ) , where α x is normalizedsuch that [ α x ] i = [ α x ] i (cid:80) j [ α x ] j . We can thus interpret exp ( f w t ( x )) as the parameters of the Dirichletdistribution to obtain a useful instance-speciﬁc prior on z . However, we observe that there is a scaleambiguity that needs resolving, since any of the following will yield the same α x : α x = β exp ( f w t ( x ) /T ) + γ, (8)where T = 1 and γ = 0 , and β corresponds to some hyper-parameter. Using T > and γ > corresponds to ﬂattening the prior distribution, which we found to be useful in practice - an observationconsistent with prior work. Note that in the limit of T → ∞ , the instance-speciﬁc prior reduces toa uniform prior corresponding to classical label smoothing. Setting γ = 1 (we also experimentallyexplore the effect of varying γ . See Appendix for details.), we obtain α x = β exp ( f w t ( x ) /T ) + 1 = β (cid:88) j [ exp ( f w t ( x ) /T )] j softmax ( f w t ( x ) /T ) + 1 . (9)Plugging this into Eq. 6 yields the distillation loss of Eq. 3, with an additional sample weighting term (cid:80) j [ exp ( f w t ( x ) /T )] j !Despite the interesting result, we empirically observe that with temperature values T found to beuseful in practice, the relative weightings of samples were too close to yield a signiﬁcant differencefrom regular distillation loss. Hence, for all of our experiments, we still adopt the distillation lossof Eq. 3. However, we believe that with teacher models trained with an objective more appropriatethan MLE, the difference might be bigger. We hope to explore alternative ways of obtaining teachermodels to effectively utilize the sample re-weighted distillation objective as future work.The MAP interpretation, together with empirical experiments conducted in Section 4, suggests thatmulti-generational self-distillation can in fact be seen as an inefﬁcient approach to implicitly ﬂattenand diversify the instance-speciﬁc prior distribution. Our experiments suggest that instead, we canmore effectively tune for hyper-parameters T and γ to achieve similar, if not better, results.The MAP perspective reveals an intimate relationship between self-distillation and label smoothing.Label smoothing increases the uncertainty of predictive probabilities. However, as discussed inSection 4, this might not be enough to prevent overﬁtting, as evidenced by the stagnant test accuracydespite increasing uncertainty in Fig. 2. Indeed, the MAP perspective suggests that, ideally, eachsample should have a distinct probabilistic label. Instance-speciﬁc regularization can encourageconﬁdence diversity, in addition to predictive uncertainty. Self-distillation requires training a separate teacher model. In this paper, we propose an efﬁcientenhancement to label smoothing strategy where the amount of smoothing will be proportional tothe uncertainty of predictions. Speciﬁcally, we make use of the exponential moving average (EMA)predictions as implemented by Tarvainen and Valpola [25] of the model at training, and obtain aranking based on the conﬁdence (the magnitude of the largest element of the softmax) of predictionsat each mini-batch, on the ﬂy. Then, instead of assigning uniform distributions [ α x ] c = βk + 1 for all c ∈ { , ..., k } to all samples as priors, during each iteration, we sample and sort a set of i.i.d. randomvariables { b , ..., b m } from Beta ( a, where m corresponds to the mini-batch size and a correspondsto the hyper-parameter associated with the Beta distribution. Then, we assign [ α x i ] y i = βb i + 1 and [ α x i ] c = β − b i k − + 1 for all c (cid:54) = y i as the prior to each sample x i , based on the ranking obtained.In practice, for consistency with distillation, Eq. 3 is used for training. Beta-smoothed labels of b i onthe ground truth class and − b i k − on all other classes are used in lieu of teacher predictions for each6 i . Thus, the amount of label smoothing applied to a sample will be proportional to the amount ofconﬁdence the model has about that sample’s prediction. Those instances that are more challengingto classify will, therefore, have more smoothing applied to their labels. EMA predictions are usedin order to stabilize the ranking obtained at each iteration of training. We empirically observe asigniﬁcant performance boost with the EMA predictions. We term this method Beta smoothing .Lastly, to test the importance of ranking, we include in the Appendix an ablation study for whichrandom Beta smoothing is applied to each sample.Beta smoothing regularization implements an instance-speciﬁc prior that encourages conﬁdencediversity, and yet does not require the expensive step of training a separate teacher model. We notethat, due to the constantly changing prior used at every iteration of training, Beta smoothing does not,strictly speaking, correspond to the MAP estimation in Eq. 6. Nevertheless, it is a simple and effectiveway to implement the instance-speciﬁc prior strategy. As we demonstrate in the following section,it can lead to much better performance than label smoothing. Moreover, unlike teacher predictionswhich have unique softmax values for all classes, the difference between Beta and label smoothingonly comes from the ground-truth softmax element. This enables us to conduct more systematicexperiments to illustrate the additional gain from promoting conﬁdence diversity.

To further demonstrate the beneﬁts of the additional regularization on the softmax probability vectorspace, we design a systematic experiment to compare self-distillation against label smoothing. In addi-tion, experiments on Beta smoothing are also conducted to further verify the importance of conﬁdencediversity, and to promote Beta smoothing as a simple alternative that can lead to better performancethan label smoothing at little extra cost. We note that, while previous works have highlighted thesimilarity between distillation and label smoothing from another perspective [30], we provide adetailed empirical analysis that uncovers additional beneﬁts of instance-speciﬁc regularization.

We conduct experiments on CIFAR-100 [13], CUB-200 [26] and Tiny-imagenet [5] using ResNet [9]and DenseNet [11]. We follow the original optimization conﬁgurations, and train the ResNet modelsfor epochs and DeseNet models for epochs. of the training data is split as the validationset. All experiments are repeated 5 times with random initialization. For simplicity, label smoothingis implemented with explicit soft labels instead of the objective in Eq. 7. We ﬁx (cid:15) = 0 . inlabel smoothing for all our experiments. The hyper-parameter α of Eq. 3 is taken to be . forself-distillation. To systematically decompose the effect of the two regularizations in self-distillation,given a pre-trained teacher and α , we manually search for temperature T such that the averageeffective label of the ground-truth class, α + (1 − α )[ softmax ( f w t ( x i ) /T )] y i , is approximately equalto . to match the hyper-parameter (cid:15) chosen for label smoothing. The parameter a of the Betadistribution is set such that E [ α + (1 − α ) b i ] = (cid:15) , to make the average probability of ground truthclass the same as (cid:15) − label smoothing.Lastly, we emphasize that the goal of the experiment is to methodically decompose the gain fromthe two aforementioned regularizations of distillation. Due to limited computational resources,hyper-parameter tuning is not performed, and the results for all methods can be potentially enhanced.Nevertheless, we believe that the conclusions drawn hold in general. Test accuracies are summarized in the top row for each experiment in Fig. 3. Firstly, all regularizationtechniques lead to improved accuracy compared to the baseline model trained with cross-entropy loss.In agreement with previous results, self-distillation performs better than label smoothing in all ofthe experiments with our setup, in which the effective degree of label smoothing in distillation is, onaverage, the same as that of regular label smoothing. The results suggest the importance of conﬁdencediversity in addition to predictive uncertainty. It is worth noting that we obtain encouraging resultswith Beta smoothing. Outperforming label smoothing in all but the CIFAR-100 ResNet experiment,it can even achieve comparable performance to that of self-distillation for the CUB-200 dataset withno separate teacher model required. The improvements of Beta smoothing over label smoothing also7 A cc u r a c y CIFAR-100, ResNet-34

CE LS B SD0.120.14 E C E A cc u r a c y CUB-200, ResNet-34

CE LS B SD0.00.10.2 E C E A cc u r a c y Tiny-Imagenet, ResNet-34

CE LS B SD0.00.1 E C E A cc u r a c y CIFAR-100, DenseNet-100-12

CE LS B SD0.050.10 E C E A cc u r a c y CUB-200, DenseNet-121-12

CE LS B SD0.00.10.2 E C E A cc u r a c y Tiny-Imagenet, DenseNet-100-12

CE LS B SD0.050.10 E C E Figure 3: Experimental Results performed on CIFAR-100, CUB-200 and the Tiny-Imagenet dataset."CE", "LS", "B" and "SD" refers to "Cross Entropy", "Label Smoothing", "Beta Smoothing" and"Self-Distillation" respectively. The top rows of each experiment show bar charts of accuracy on testset for each experiment conducted, while the bottom rows are bar charts of expected calibration error.serve direct evidence on the importance of conﬁdence diversity, as the only difference between thetwo comes from the additional spreading on the ground truth classes. We hypothesize that the gap inaccuracy between Beta smoothing and self-distillation mainly comes form better instance-speciﬁcpriors set by a pre-trained teacher network. The differences in the non-ground-truth classes betweenthe two methods could also account for the small gap in accuracy performance.Results on calibration are shown in the bottom rows of Fig. 3, where we report the expectedcalibration error (ECE) [8]. As anticipated, all regularization techniques lead to enhanced calibration.Nevertheless, we see that the errors obtained with self-distillation are much smaller in generalcompared to label smoothing. As such, instance-speciﬁc priors can also lead to more calibratedmodels. Beta smoothing again not only produces models with much more calibrated predictionscompared to label smoothing but compares favorably to self-distillation in a majority of the cases ofthe experiments done.

In this paper, we provide empirical evidence that diversity in teacher predictions is correlated withstudent performance in self-distillation. Inspired by this observation, we offer an amortized MAPinterpretation of the popular teacher-student training strategy. The novel viewpoint provides us withinsights on self-distillation and suggests ways to improve it. For example, encouraged by the resultsobtained with Beta smoothing, there are possibly better and/or more efﬁcient ways to obtain priorsfor instance-speciﬁc regularization.Recent literature shows that label smoothing leads to better calibration performance [19]. In thispaper, we demonstrate that distillation can also yield more calibrated models. We believe this is adirect consequence of not performing temperature scaling on student models during training. Indeed,with temperature scaling also on the student models, the student logits are likely pushed larger duringtraining, leading to over-conﬁdent predictions.More generally, we have only discussed the teacher-student training strategy as MAP estimation.There have been other recently proposed techniques involving training with soft labels, which we caninterpret as encouraging conﬁdence diversity or implementing instance-speciﬁc regularization. Forinstance, the mixup regularization [31] technique creates label diversity by taking random convexcombinations of the training data, including the labels. Recently proposed consistency-based semi-supervised learning methods such as [14, 25], on the other hand, utilize predictions on unlabeledtraining samples as an instance-speciﬁc prior. We believe this unifying view of regularization withsoft labels can stimulate further ideas on instance-speciﬁc regularization.8 eferences [1] Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling. Bayesian darkknowledge. In

Advances in Neural Information Processing Systems , pages 3438–3446, 2015.[2] Björn Barz and Joachim Denzler. Do we train on test data? purging cifar of near-duplicates. arXiv preprint arXiv:1902.00423 , 2019.[3] Jan Beirlant, Edward J Dudewicz, László Györﬁ, and Edward C Van der Meulen. Nonparametricentropy estimation: An overview.

International Journal of Mathematical and StatisticalSciences , 6(1):17–39, 1997.[4] Jang Hyun Cho and Bharath Hariharan. On the efﬁcacy of knowledge distillation. In

Proceedingsof the IEEE International Conference on Computer Vision , pages 4794–4802, 2019.[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255. Ieee, 2009.[6] Abhimanyu Dubey, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Maximum-entropy ﬁnegrained classiﬁcation. In

Advances in Neural Information Processing Systems , pages 637–647,2018.[7] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar.Born again neural networks. In

International Conference on Machine Learning , pages 1607–1616, 2018.[8] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neuralnetworks. In

Proceedings of the 34th International Conference on Machine Learning-Volume70 , pages 1321–1330. JMLR. org, 2017.[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016.[10] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 , 2015.[11] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 4700–4708, 2017.[12] Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natural Language Processing , pages 1317–1327,2016.[13] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.[14] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprintarXiv:1610.02242 , 2016.[15] Zhizhong Li and Derek Hoiem. Learning without forgetting.

IEEE transactions on patternanalysis and machine intelligence , 40(12):2935–2947, 2017.[16] David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillationand privileged information. arXiv preprint arXiv:1511.03643 , 2015.[17] Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. arXivpreprint arXiv:1905.00076 , 2019.[18] Thomas Minka. Estimating a dirichlet distribution, 2000.[19] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In

Advances in Neural Information Processing Systems , pages 4696–4705, 2019.920] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation asa defense to adversarial perturbations against deep neural networks. In , pages 582–597. IEEE, 2016.[21] Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Reg-ularizing neural networks by penalizing conﬁdent output distributions. arXiv preprintarXiv:1701.06548 , 2017.[22] Joshua C Peterson, Ruairidh M Battleday, Thomas L Grifﬁths, and Olga Russakovsky. Hu-man uncertainty makes classiﬁcation more robust. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 9617–9626, 2019.[23] Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In

International Conference on Machine Learning , pages 5142–5151, 2019.[24] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-thinking the inception architecture for computer vision. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 2818–2826, 2016.[25] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averagedconsistency targets improve semi-supervised deep learning results. In

Advances in neuralinformation processing systems , pages 1195–1204, 2017.[26] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSDBirds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.[27] Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan L Yuille. Training deep neural networksin generations: A more tolerant teacher educates better students. In

Proceedings of the AAAIConference on Artiﬁcial Intelligence , volume 33, pages 5628–5635, 2019.[28] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation:Fast optimization, network minimization and transfer learning. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 4133–4141, 2017.[29] Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. Visual relationship detection withinternal and external linguistic knowledge distillation. In

Proceedings of the IEEE internationalconference on computer vision , pages 1974–1982, 2017.[30] Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisit knowledge distillation:a teacher-free framework. arXiv preprint arXiv:1909.11723 , 2019.[31] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyondempirical risk minimization. arXiv preprint arXiv:1710.09412 , 2017.[32] Chunting Zhou, Graham Neubig, and Jiatao Gu. Understanding knowledge distillation innon-autoregressive machine translation. arXiv preprint arXiv:1911.02727 , 2019.[33] Xiatian Zhu, Shaogang Gong, et al. Knowledge distillation by on-the-ﬂy native ensemble. In

Advances in neural information processing systems , pages 7517–7527, 2018.10

Appendix

We ﬁrst give a derivation on the equivalence of label smoothing regularization and Eq. 7. With somesimple rearrangement of the terms, L LS = n (cid:88) i =1 − log[ z ] y i + β n (cid:88) i =1 k (cid:88) c =1 − k log[ z ] c = − (1 + β ) n (cid:88) i =1  k + βk (1 + β ) log[ z ] y i + (cid:88) c (cid:54) = y i βk (1 + β ) log[ z ] c  . The above objective is clearly equivalent to the label smoothing regularization with − (cid:15) = k + βk (1+ β ) ,up to a constant factor of (1 + β ) .Label smoothing regularizes predictive uncertainty. The amount of regularization is controlled bythe amount of smoothing applied. Evidently, the objective does not regularize conﬁdence diversity.Indeed, assuming a NN with capacity capable of ﬁtting the entire training data, predictions on trainingdata will be pushed arbitrarily close to the smoothed soft label. Empirical evidence for this form ofoverﬁtting can be seen from experiments done by Müller et al. [19], in which the authors demonstratedthat applying label smoothing leads to hampered distillation performance. The authors hypothesizethat this is likely due to erasure of "relative information between logits" when label smoothing isapplied, hinting at the overﬁtting of predictions to the smoothed labels.A closely related regularization technique is to explicitly regularize on predictive uncertainty: L P U = n (cid:88) i =1 − log[ z ] y i + β n n (cid:88) j =1 k (cid:88) c =1 [ z ] c log[ z ] c . Prior papers [6, 21] have demonstrated that directly regularizing predictive uncertainty can leadto better performance than label smoothing. However, we note that the above objective does notregularize conﬁdence diversity either. In fact, it can be easily solved, with the method of Lagrangemultiplier, that the optima for the objective above is achieved when [ z ] y i = βW ( exp ( − /β )( k − /β )+1 where W corresponds to the Lambert W function, and [ z ] c = − [ z ] yi k − for all c (cid:54) = y i , for all samplepairs ( x i , y i ) . As such, the global optima obtained by directly regularizing predictive uncertainty isidentical to that of label smoothing. In practice, differences between the two can arise due to thedetails of the optimization procedure (like early stopping), and/or due to model capacity.11 .2 Additional Experiments with Temperature Scaling on Student Models Temperature A cc u r a c y alpha=0.1 Scale bothScale teacher only

Temperature E C E alpha=0.1 Scale bothScale teacher only

Temperature A cc u r a c y alpha=0.4 Scale bothScale teacher only

Temperature E C E alpha=0.4 Scale bothScale teacher only

Figure 4:

Left:

Test accuracies of ResNet-34 models on the CIFAR-100 dataset when varyingtemperature.

Right:

ECE of ResNet-34 models on the CIFAR-100 dataset when varying temperature."Scale both" corresponds to the originally proposed distillation objective in which both teacher andstudent models are temperature-scaled during training. "Scale teacher only" corresponds to onlytemperature scaling teacher models during distillation. The green ﬂat line represents the performanceachieved by the teacher model trained with cross-entropy loss.To examine the effect of not applying temperature scaling on student models, we conduct an ex-periment to compare models trained with and without temperature scaling on student models fordistillation loss with the ResNet-34 on the CIFAR-100 dataset, using the training objective of Eq. 3.On top of the hyper-parameter α = 0 . used for experiments in Section 7, we also include resultswith α = 0 . , a widely used value for knowledge distillation in prior work [4]. We vary the amount oftemperature scaling applied to illustrate the effect of different temperatures have on student models.Plots of test accuracy and ECE against amount of temperature scaling applied are shown in Fig. 4.Firstly, we observe that models trained with student scaling have ECE almost identical to that ofthe teacher models. As a direct contrast, we see that the student models trained without studentscaling perform much better in terms of calibration error in general over its teacher. Note that therelatively large ECE when α = 0 . and T > is likely due to overly unconﬁdent teacher predictions.In addition, we highlight that, with the optimal hyper-parameters of α and T used, student modelstrained without student scaling can also outperform signiﬁcantly in terms of test accuracy. Weacknowledge that there can be conﬂicts between the performance of ECE and accuracy, as seen fromsuperior test accuracy but poor ECE achieved for α = 0 . and T = 4 . . In practice, we can usethe negative log likelihood, a metric inﬂuenced by both ECE and accuracy, to ﬁnd the optimal α and T . Lastly, we note that, both α and T alter the amount of predictive uncertainty and conﬁdencediversity in teacher predictions at the same time. This coupled effect could be the reason for theobserved conﬂict between ECE and accuracy. We leave it as a future work to explore alternativeways to decouple the two measures for more efﬁcient and effective parameter search. We believe adecoupled set of parameters can lead to models with better calibration and accuracy at the same time.12 .3 Additional Experiments with CIFAR-10 When Varying Trainset Size Training Set Size T e s t A cc u r a c y TeacherStudent

Training Set Size A cc u r a c y I m p r o v e m e n t Figure 5:

Left:

Test accuracies of ResNet-34 models on the CIFAR-10 dataset for the teacher andstudent models when the training set size is varied.

Right:

The relative improvements in accuracywhen the training set size is varied.Recent results show relatively small gain when performing knowledge distillation on the CIFAR-10dataset [4, 7]. Our perspective of distillation as regularization provides a plausible explanation forthis observation. Like all other forms of regularization, its effect diminishes with increasing the sizeof training data. We experimentally verify the claim by training ResNet-34 models with a varyingnumber of training samples. The experiment are repeated 3 times. Fig. 5 summarizes the results.As expected, increasing sample size leads to an increase in test accuracy for both of the models.Nevertheless, the relative improvement in the accuracy of the student model compared to the teacherdecreases as the size of the training set increases, indicating that distillation is a form of regularization.

Weight Decay A cc u r a c y StudentTeacher

Weight Decay A cc u r a c y I m p r o v e m e n t Figure 6:

Left:

Test accuracies of ResNet-34 models on the CIFAR-100 dataset for the teacher andstudent models when the weight decay hyper-parameter is varied.

Right:

The relative improvementsin accuracy when the weight decay hyper-parameter is varied.To further demonstrate that distillation is a regularization process, we also conduct an additionalexperiment on the CIFAR-100 dataset using ResNet-34, varying only the weight decay hyper-parameter. Intuitively, larger weight decay regularization makes NNs less prone to overﬁtting, whichshould, in turn, reduced the additional beneﬁts obtainable from self-distillation, if it is indeed aform of regularization. To keep the quality of priors identical across all student models, we use thesame teacher model obtained from using a weight decay of − for all distillation. Our results aresummarized in Fig. 6. It is evident that increasing the weight decay hyper-parameter leads to muchsmaller improvement in test accuracy. Interestingly, we see a noticeable gain in accuracy for baselinesmodels trained with cross-entropy when adjusting the weight decay term, contradicting some of therecent ﬁndings that weight decay is ineffective for neural networks.13 .5 Additional Experiments on Beta Smoothing A cc u r a c y CIFAR-100, ResNet-34

LS RB B0.100.120.14 E C E A cc u r a c y CUB-200, ResNet-34

LS RB B0.050.10 E C E A cc u r a c y Tiny-Imagenet, ResNet-34

LS RB B0.040.06 E C E A cc u r a c y CIFAR-100, DenseNet-100-12

LS RB B0.060.080.10 E C E A cc u r a c y CUB-200, DenseNet-121-12

LS RB B0.050.10 E C E A cc u r a c y Tiny-Imagenet, DenseNet-100-12

LS RB B0.050.10 E C E Figure 7: Ablation study on Beta smoothing. "LS", "RB" and "B" refers to "Label Smoothing","Random Beta Smoothing" and "Beta Smoothing" respectively. The top rows of each experimentshow bar charts of accuracy on the test set for each experiment conducted, while the bottom rows arebar charts of expected calibration error.We conduct an ablation study on the proposed Beta smoothing regularization in order to demonstratethe importance of relative ranking. To do so, we run experiments with the identical setup as describedin Section 7 for Beta smoothing with completely randomly assigned soft label noise from Betadistribution instead. We term this the “random Beta smoothing”. Results are shown in Fig. 7.For convenience, we also include results obtained with regular label smoothing as a benchmarkcomparison. As seen clearly, the proposed Beta smoothing with ranking obtained from EMApredictions leads to much better results in general in terms of both accuracy and ECE, suggesting thatnaively encouraging conﬁdence diversity does not lead to signiﬁcant improvements, and the relativeconﬁdence among different samples is also an important aspect in order to obtain better studentmodels. This ablation study also serves as indirect evidence for why self-distillation still outperformsBeta smoothing - with a pre-trained model, much more reliable relative conﬁdence among trainingsamples can be obtained. 14 .6 Additional Experiments on the Effect of Quality of Teachers A cc u r a c y CIFAR-100, ResNet-34

SD CD0.050.100.15 E C E A cc u r a c y CUB-200, ResNet-34

SD CD0.020.030.04 E C E A cc u r a c y Tiny-Imagenet, ResNet-34

SD CD0.00.1 E C E A cc u r a c y CIFAR-100, DenseNet-100-12

SD CD0.050.10 E C E A cc u r a c y CUB-200, DenseNet-121-12

SD CD0.0200.0250.030 E C E A cc u r a c y Tiny-Imagenet, DenseNet-100-12

SD CD0.040.05 E C E Figure 8: Additional results on cross-distillation. "SD" and "CD" refers to "self-distillation" and"cross-distillation" respectively. The top rows of each experiment show bar charts of accuracy on thetest set for each experiment conducted, while the bottom rows are bar charts of expected calibrationerror.We also perform an additional experiment with the identical setup as described in Section 7 on crossdistillation of the ResNet and the DenseNet models, in which a ResNet-34 teacher is used to trainthe DenseNet-100 student and vice versa in an attempt to examine the effect of better/worse priorsin self-distillation. Intuitively, with greater capacity, deeper networks can learn representations thatmight capture better the relative label uncertainty between samples, thus generating better priors,leading to better performance in the student. Hyper-parameters are ﬁxed in this case such that thepredictive uncertainty and diversity associated with the label predictions remain the same as that forself-distillation. Results are summarized in Fig. 8. As seen clearly from consistently better/worseperformance of cross distillation for ResNet/DenseNet, better teachers lead to better performance.Thus, in addition to diversity among teacher predictions, the quality of the instance-speciﬁc priorused is also important for better generalization performance. Lastly, we also see an apparent beneﬁtin terms of model calibration when a better teacher model is used.15 .7 Additional Experiments on Varying γ A cc u r a c y CIFAR-100, ResNet-34

SD PD0.100.15 E C E A cc u r a c y CUB-200, ResNet-34

SD PD0.020.04 E C E A cc u r a c y Tiny-Imagenet, ResNet-34

SD PD0.1000.1250.150 E C E A cc u r a c y CIFAR-100, DenseNet-100-12

SD PD0.050.10 E C E A cc u r a c y CUB-200, DenseNet-121-12

SD PD0.020.04 E C E A cc u r a c y Tiny-Imagenet, DenseNet-100-12

SD PD0.0250.050 E C E Figure 9: Additional results on pruned distillation. "SD" and "PD" refer to "self-distillation" and"pruned-distillation" respectively. The top rows of each experiment show bar charts of accuracyon the test set for each experiment conducted, while the bottom rows are bar charts of expectedcalibration error.In addition, we consider a simple variation to distillation loss by varying γ . However, directlyadjusting γ can be problematic in practice. To understand the effect of changing γ , suppose we havesome γ such that [ α x ] c − < for some c ∈ { , ..., k } . Since the minimization objective withrespect to this class is − ([ α x ] c −

1) log([ z ] c ) , the closer the [ z ] c to , the smaller the loss function.This leads to numerical issues as the overall loss function can be pushed to negative inﬁnity by forcing [ z ] c arbitrarily close to zero.To circumvent the numerical problem during optimization, we make the observation that the aboveobjective is essentially equivalent to setting the particular element with [ α x ] c − < to zero. Assuch, adjusting the threshold γ enables us to prune out the smallest elements of the teacher predictions.To further force the pruned elements to zero, a new softmax probability vector is computed withthe remaining elements. In practice, setting the optimal γ can be challenging. We instead choose toprune out a ﬁxed percentage of classes for all samples. For instance, pruning of the classes for a -class classiﬁcation amounts to using only the top most conﬁdent samples to compute softmaxand setting the remaining to zero. We term this method the pruned-distillation .We show some preliminary results with pruned-distillation with50%