[PDF] Locally Adaptive Label Smoothing for Predictive Churn

Abstract

Training modern neural networks is an inherently noisy process that can lead to high \emph{prediction churn} -- disagreements between re-trainings of the same model due to factors such as randomization in the parameter initialization and mini-batches -- even when the trained models all attain similar accuracies. Such prediction churn can be very undesirable in practice. In this paper, we present several baselines for reducing churn and show that training on soft labels obtained by adaptively smoothing each example's label based on the example's neighboring labels often outperforms the baselines on churn while improving accuracy on a variety of benchmark classification tasks and model architectures.

Full PDF

LLocally Adaptive Label Smoothing for Predictive Churn

Dara Bahri Heinrich Jiang Abstract

Training modern neural networks is an inherentlynoisy process that can lead to high predictionchurn – disagreements between re-trainings ofthe same model due to factors such as random-ization in the parameter initialization and mini-batches – even when the trained models all at-tain similar accuracies. Such prediction churncan be very undesirable in practice. In this paper,we present several baselines for reducing churnand show that training on soft labels obtained byadaptively smoothing each example’s label basedon the example’s neighboring labels often out-performs the baselines on churn while improvingaccuracy on a variety of benchmark classiﬁcationtasks and model architectures.

1. Introduction

Deep neural networks (DNNs) have proved to be im-mensely successful at solving complex classiﬁcation tasksacross a range of problems. Much of the effort has beenspent towards improving their predictive performance (i.e.accuracy), while comparatively little has been done to-wards improving the stability of training these models(Zheng et al., 2016). Modern DNN training is inherentlynoisy due to factors such as the random initialization ofnetwork parameters (Glorot & Bengio, 2010), the mini-batch ordering (Loshchilov & Hutter, 2015), the effectsof various data augmentation (Shorten & Khoshgoftaar,2019) or pre-processing tricks (Santurkar et al., 2018), andthe non-determinism arising from the hardware (Turner &Nowotny, 2015), all of which are exacerbated by the non-convexity of the loss surface (Scardapane & Wang, 2017).This results in local optima corresponding to models thathave very different predictions on the same data points.This may seem counter-intuitive, but even when the differ-ent runs all produce very high accuracies for the classiﬁca-tion task, their predictions can still differ quite drasticallyas we will show later in the experiments. Thus, even an Google Research, Mountain View, California, USA. Corre-spondence to: Dara Bahri < [email protected] > . optimized training procedure can lead to high predictionchurn , which refers to the proportion of sample-level dis-agreements between classiﬁers caused by different runs ofthe same training procedure .In practice, reducing such predictive churn can be critical.For example, in a production system, models are often con-tinuously improved on by being trained or retrained withnew data or better model architectures and training pro-cedures. In such scenarios, a candidate model for releasemust be compared to the current model serving in produc-tion. Oftentimes, this decision is conditioned on more thanjust overall ofﬂine test accuracy– in fact, the ofﬂine metricsare often not completely aligned with the actual goal, es-pecially if these models are used as part of a larger system(e.g. maximizing ofﬂine click-through rate vs. maximiz-ing revenue or user satisfaction) (Deng et al., 2013; Beelet al., 2013; Dmitriev & Wu, 2016). As a result, these com-parisons require extensive and costly live experiments, re-quiring human evaluation in situations where the candidateand the production model disagree (i.e. in many situations,the true labels are not available without a manual labeler)(Theocharous et al., 2015; Deng, 2015; Deng & Shi, 2016).In these cases, it can be highly desirable to lower predictivechurn.Despite the practical relevance of lowering churn, there hasbeen surprisingly little work done in this area, which wehighlight in the related work section. In this work, we fo-cus on predictive churn reduction under retraining the samemodel architecture on an identical train and test set. Ourmain contributions are as follows:• We provide one of the ﬁrst comprehensive analyses ofbaselines to lower prediction churn, showing that pop-ular approaches designed for other goals are effectivebaselines for churn reduction, even compared to meth-ods designed for this goal.• We improve label smoothing, a global smoothingmethod popular for calibrating model conﬁdence, byutilizing the local information leveraged by the k -NN labels thus introducing a locally adaptive label Concretely, given two classiﬁers applied to the same test sam-ples, the prediction churn between them is the fraction of test sam-ples with different predicted labels. a r X i v : . [ c s . L G ] F e b ocally Adaptive Label Smoothing for Predictive Churn smoothing which we show to often outperform thebaselines on a wide range of benchmark datasets andmodel architectures.• We show new theoretical results for the k -NN labelssuggesting the usefulness of the k -NN label. We showunder mild nonparametric assumptions that for a widerange of k , the k -NN labels uniformly approximatesthe optimal soft label and when k is tuned optimally,achieves the minimax optimal rate. We also showthat when k is linear in n , the distribution implied bythe k -NN label approximates the original distributionsmoothed with an adaptive kernel.

2. Related Works

Our work spans multiple sub-areas of machine learning.The main problem this paper tackles is reducing predic-tion churn. In the process, we show that label smoothingis an effective baseline and we improve upon it in a princi-pled manner using deep k -NN label smoothing to obtain alocally adaptive version of it. Prediction Churn.

There are only a few works which ex-plicitly address prediction churn. Fard et al. (2016) pro-posed training a model so that it has small prediction in-stability with future versions of the model by modifyingthe data that the future versions are trained on. They fur-thermore propose turning the classiﬁcation problem into aregression towards corrected predictions of an older modelas well as regularizing the new model towards the oldermodel using example weights. Cotter et al. (2019); Gohet al. (2016) use constrained optimization to directly lowerprediction churn across model versions. Simultaneouslytraining multiple identical models (apart from initializa-tion) while tethering their predictions together via regular-ization has been proposed in the context of distillation (Anilet al., 2018; Zhang et al., 2018; Zhu et al., 2018; Song& Chai, 2018) and robustness to label noise (Malach &Shalev-Shwartz, 2017; Han et al., 2018). This family ofmethods was termed “co-distillation” by Anil et al. (2018),who also noted that it can be used to reduce churn in addi-tion to improving accuracy. In this paper, we show muchmore extensively that co-distillation is indeed a reasonablebaseline for churn reduction.

Label Smoothing.

Label smoothing (Szegedy et al., 2016)is a simple technique wherein the model is trained on thesoft labels obtained by a convex combination of the hardtrue label and the soft uniform distribution across all thelabels. It has been shown that it leads to better conﬁdencecalibration and generalization(M¨uller et al., 2019). Herewe show that label smoothing is a reasonable baseline forreducing prediction churn, and we moreover enhance it forthis task by smoothing the labels locally via k -NN rather than a pure global approach mixing with the uniform dis-tribution. k -NN Theory. The theory of k -NN classiﬁcation has along history (e.g. Fix & Hodges Jr (1951); Cover (1968);Stone (1977); Devroye et al. (1994); Chaudhuri & Das-gupta (2014)). To our knowledge, the most relevant k -NNclassiﬁcation result is by (Chaudhuri & Dasgupta, 2014),who show statistical risk bounds under similar assump-tions as used in our work. Our analysis shows ﬁnite-sample L ∞ bounds on the k -NN labels, which is a stronger notionof consistency as it provides a uniform guarantee, ratherthan an average guarantee as is shown in previous worksunder standard risk measures such as L error. We dothis by leveraging recent techniques developed in (Jiang,2019) for k -NN regression, which assumes an additivenoise model instead of classiﬁcation. Moreover, we pro-vide to our knowledge the ﬁrst consistency guarantee forthe case where k grows linearly with n . Deep k -NN. k -NN is a classical method in machine learn-ing which has recently been shown to be useful when ap-plied to the intermediate embeddings of a deep neural net-work (Papernot & McDaniel, 2018) to obtain more cali-brated and adversarially robust networks. This is becausestandard distance measures are often better behaved inthese representations leading to better performance of k -NN on these embeddings than on the raw inputs. (Jianget al., 2018) uses nearest neighbors on the intermediate rep-resentations to obtain better uncertainty scores than soft-max probabilities and (Bahri et al., 2020) uses the k -NNlabel disagreement to ﬁlter noisy labels for better training.Like these works, we also leverage k -NN on the interme-diate representations but we show that utilizing the k -NNlabels leads to lower prediction churn.

3. Algorithm

Suppose that the task is multi-class classiﬁca-tion with L classes and the training datapoints are ( x , y ) , ..., ( x n , y n ) , where x i ∈ X , and X is a compactsubset of R D and y i ∈ R L , represents the one-hot vectorencoding of the label – that is, if the i -th example has label j , then y i has in the j -th entry and everywhere else.We give the formal deﬁnition of the smoothed labels: Deﬁnition 1 (Label Smoothing) . Given label smoothingparameter ≤ a ≤ , then the smoothed label y is (where L denotes the vector of all ’s in R L ). y LSa := (1 − a ) · y + aL · L . We next formally deﬁne the k -NN label, which is the av-erage label of the example’s k -nearest neighbors in thetraining set. Let us use shorthand X := { x , ..., x n } and y i ∈ R L . ocally Adaptive Label Smoothing for Predictive Churn Figure 1.

Visualization of the effects of global vs locally adaptive label smoothing . This visualization provides intuition for whyour locally adaptive label smoothing method can improve neural network training stability.

Left : A binary classiﬁcation dataset in dimensions where there are magenta points for the positive class and cyan points for the negative class. The data is generated from amixture of two Gaussians, where the bottom-left Gaussian corresponds to the positive examples and the top-right corresponds to thenegative examples, and to add label noise, we swap the labels of chosen uniformly. Middle:

We see that label smoothing simplypushes the labels uniformly closer to the average label ( . ). In particular, we see that the noisy labels still remain and thus may still causeconﬂicting information during training, possibly leading to predictive churn. Right : We now show our locally adaptive label smoothingapproach, which also smooths the labels based on local information. This alleviates the examples with noisy labels by bringing themmore in line with the average label amongst its neighbors and provides a more locally smooth label proﬁle with respect to the inputspace. Such smoothness can help model training converge in a more stable manner.

Deﬁnition 2 ( k -NN label) . Let the k -NN radius of x ∈ X be r k ( x ) := inf { r : | B ( x, r ) ∩ X | ≥ k } where B ( x, r ) := { x (cid:48) ∈ X : | x − x (cid:48) | ≤ r } and the k -NN set of x ∈ X be N k ( x ) := B ( x, r k ( x )) ∩ X . Then for all x ∈ X , the k -NNlabel is deﬁned as η k ( x ) := 1 | N k ( x ) | n (cid:88) i =1 y i · x i ∈ N k ( x )] . The label smoothing method can be seen as performing aglobal smoothing. That is, every label is equally trans-formed towards the uniform distribution over all labels.While it seems almost deceptively simple, it has only re-cently been shown to be effective in practice, speciﬁcallyfor better calibrated networks (M¨uller et al., 2019). How-ever, since this smoothing technique is applied equally toall datapoints, it fails to incorporate local information aboutthe datapoint. To this end, we propose using the k -NNlabel, which smooths the label across its nearest neigh-bors. We show theoretically that the k -NN label can bea strong proxy for the optimal soft label, that is, the ex-pected label given the features and thus the best predictionone can make given the uncertainty under an L risk mea-sure. In other words, compared to the true label (or eventhe label smoothing), the k -NN label is robust to variabil-ity in the data distribution and provides a more stable es-timate of the label than the original hard label which maybe noisy. Training on such noisy labels have been shownto hurt model performance (Bahri et al., 2020) and usingthe smoothed labels can help mitigate these effects. To thisend, we deﬁne k -NN label smoothing as follows: Deﬁnition 3 ( k -NN label smoothing) . Let ≤ a, b ≤ be k -NN label smoothing parameters. Then the k -NNsmoothed label of datapoint ( x, y ) is deﬁned as: y kNN a,b = (1 − a ) · y + a · (cid:18) b · L · L + (1 − b ) · η k ( x ) (cid:19) . We see that a is used to weight between using the true la-bels vs. using smoothing, and b is used to weight betweenthe global vs. local smoothing. We provide an illustra-tive simulation in Figure1. Algorithm 1 shows how k -NNlabel smoothing is applied to deep learning models. Like(Bahri et al., 2020), we perform k -NN on the network’slogits layer. Algorithm 1

Deep k -NN locally adaptive label smoothing Inputs: ≤ a, b ≤ , k , training data ( x , y ) , ..., ( x n , y n ) , model training procedure M .Train model M on ( x , y ) , ..., ( x n , y n ) with M .Let z , ..., z n ∈ R L be the logits of x , ..., x n , respec-tively, w.r.t. M .Let y kNN i be the k -NN smoothed label (see Def-inition 3) of ( z i , y i ) computed w.r.t. dataset ( z , y ) , ..., ( z n , y n ) .Train model M on ( x , y kNN ) , ..., ( x n , y kNN n ) with M .

4. Theoretical Analysis

In this section, we provide theoretical justiﬁcation for whythe k -NN labels may be useful. In particular, we show re-sults for two settings, where n is the number of datapoints.• When k (cid:28) n , we show that with appropriate setting of ocally Adaptive Label Smoothing for Predictive Churn k , the k -NN smoothed labels approximates the predic-tions of optimal soft classiﬁer at a minimax-optimalrate.• When k = O ( n ) , we show that the distribution im-plied by the k -NN smoothed labels is equivalent tothe original distribution convolved with an adaptive smoothing kernel.Our results may also reveal insights into why distillationmethods (the procedure of training a model on anothermodel’s predictions instead of the true labels) can work.Another way of considering the result is that the k -NNsmoothed label is equivalent to the soft prediction of the k -NN classiﬁer. Thus, if one were to train on the k -NN labels,it would be essentially distillation on the k -NN classiﬁerand our theoretical results show that the labels implied by k -NN approximate the predictions of the optimal classiﬁer(in the k (cid:28) n setting). Learning the optimal classiﬁer mayindeed be a better goal than learning from the true labels,because the latter may lead to overﬁtting to the samplingnoise rather than just the true signal implied by the optimalclassifer. While distillation is not the topic of this work,our results in this section may be of independent interest tothat area.For the analysis, we assume the binary classiﬁcation set-ting, but it is understood that our results can be straightfor-wardly generalized to the multi-class setting. The featurevectors are deﬁned on compact support X ⊆ R D and dat-apoints are drawn as follows: the features vector is drawnfrom density p X on X and the labels are drawn accordingto the label function η : X → [0 , , i.e. η ( x ) = P ( Y =1 | X = x ) . k (cid:28) n We make a few mild regularity assumptions for our anal-ysis to hold, which are standard in works analyzing non-parametric methods e.g. (Singh et al., 2009; Chaudhuri &Dasgupta, 2014; Reeve & Kaban, 2019; Jiang, 2019; Bahriet al., 2020). The ﬁrst part ensures that the support X doesnot become arbitrarily thin anywhere, the second ensuresthat the density does not vanish anywhere in the support,and the third ensures that the label function η is smoothw.r.t. to its input. Assumption 1.

The following three conditions hold: • Support Regularity: There exists ω > and r > such that Vol ( X ∩ B ( x, r )) ≥ ω · Vol ( B ( x, r )) for all x ∈ X and < r < r , where B ( x, r ) := { x (cid:48) ∈ X : | x − x (cid:48) | ≤ r } . • Non-vanishing density: p X, := inf x ∈X p X ( x ) > . • Smoothness of η : There exists < α ≤ and C α > such that | η ( x ) − η ( x (cid:48) ) | ≤ C α | x − x (cid:48) | α for all x, x (cid:48) ∈X . We have the following result which provides a uniform bound between the smoothed k -NN label η k and the op-timal soft label η . Theorem 1.

Let < δ < and suppose that Assumption 1holds and that k satisﬁes the following: · D log (4 /δ ) · log n ≤ k ≤ · ω · p X, · v D · r D · n, where v D := π D/ Γ( d/ is the volume of a D -dimensionalunit ball. Then with probability at least − δ , we have sup x ∈X | η k ( x ) − η ( x ) | ≤ C α (cid:18) kω · v D · n · p X, (cid:19) α/D + (cid:114) D/δ ) + 2 D log( n ) k . In other words, there exists constants C , C , C dependingon η and δ such that if k satisﬁes C log n ≤ k ≤ C · n, then with probability at least − δ , ignoring logarithmicfactors in n and /δ : sup x ∈X | η k ( x ) − η ( x ) | ≤ C · (cid:32)(cid:18) kn (cid:19) α/D + 1 √ k (cid:33) . Choosing k ≈ n α/ (2 α + D ) , gives us a bound of sup x ∈X | η k ( x ) − η ( x ) | ≤ (cid:101) O ( n − / (2 α + D ) ) , which is theminimax optimal rate as established by (Tsybakov et al.,1997).Therefore, the advantage of using the smoothed labels η k ( x ) , ..., η k ( x n ) instead of the true labels y , ..., y n , isthat the smoothed labels approximate the optimal soft clas-siﬁer. Moreover, as shown above, with appropriate settingof k , the smoothed labels are a minimax-optimal estima-tor of the true label function η . Thus, the smoothed labelsprovide as good of a proxy for η as any estimator possiblycan.As suggested earlier, another way of considering this resultis that the original labels may contain considerable noiseand thus no single label can be guaranteed reliable. Usingthe smoothed label instead mitigates this effect and allowsus to train the model to match the label function η . k linear in n In the previous subsection, we showed the utility of k -NNlabel smoothing as a theoretically sound proxy for the op-timal soft labels, which attains statistical consistency guar-antees as long as k grows faster than log n and k/n → . ocally Adaptive Label Smoothing for Predictive Churn Now, we analyze the case where k grows linearly with n .In this case, the k -NN smoothed labels no longer recoverthe optimal soft label function η , but instead an adaptivekernel smoothed version of η . We make this relationshipprecise here.Suppose that k = (cid:98) β · n (cid:99) for some < β < . We deﬁnethe β -smoothed label function: Deﬁnition 4 ( β -smoothed label function) . Let r β ( x ) :=inf { r > P ( B ( x, r )) ≥ β } , that is the radii of thesmallest ball centered at x with probability mass β w.r.t. P X . Then, let (cid:101) η β ( x ) be the expectation of η on B ( x, r β ( x )) w.r.t. P X : (cid:101) η β ( x ) := 1 β (cid:90) B ( x,r β ( x )) η ( x ) · P X ( x ) dx. We can view (cid:101) η β as an adaptively kernel smoothed versionof η , where adaptivity arises from the density of the point(the more dense, the smaller the bandwidth we smooth itacross) and the kernel is based on the density.We now prove the following result which shows that in thissetting η k estimates (cid:101) η β ( x ) . It is worth noting that we needvery little assumption on η as compared to the previous re-sult because the β -smoothing of η provides a more regularlabel function; moreover, the rates are fast i.e. (cid:101) O ( (cid:112) D/n ) . Theorem 2.

Let < δ < and k = (cid:98) β · n (cid:99) . Then withprobability at least − δ , we have for n sufﬁciently largedepending on β, δ : sup x ∈X | η k ( x ) − (cid:101) η β ( x ) | ≤ (cid:115) D/δ ) + 2 D log( n ) β · n .

5. Experiments

We now describe the experimental methodology and resultsfor validating our proposed method.

We start by by detailing the suite of baselines we compareagainst. We tune baseline hyper-parameters extensively,with the precise sweeps and setups available in the Ap-pendix.•

Control : Baseline where we train for accuracy with-out regards to lower churn.• (cid:96) p Regularization : We control the stability of amodel’s predictions by simply regularizing them (in-dependently of the ground truth label) using classical (cid:96) p regularization. The loss function is given by: L (cid:96) p ( x i , y i ) = L ( x i , y i ) + a || f ( x i ) || pp . We experiment with both (cid:96) and (cid:96) regularization. • Bi-tempered : This is a baseline by (Amid et al.,2019), originally designed for robustness to labelnoise. It modiﬁes the standard logistic loss functionby introducing two temperature scaling parameters t and t . We apply their “bi-tempered” loss here, sus-pecting that methods which make model training morerobust to noisy labels may also be effective at reducingprediction churn.• Anchor : This is based on a method proposed by (Fardet al., 2016) speciﬁcally for churn reduction. It usesthe predicted probabilities from a preliminary modelto smooth the training labels of the second model.We ﬁrst train a preliminary model f prelim using regularcross-entropy loss. We then retrain the model usingsmoothed labels (1 − a ) y i + af prelim ( x i ) , thus “an-choring” on a preliminary model’s predictions. In ourexperiments, we train one preliminary model and ﬁxit across the runs for this baseline to reduce predictionchurn.• Co-distillation : We use the co-distillation approachpresented by (Anil et al., 2018), who touched uponits utility for churn reduction. We train two identicalmodels M and M (but subject to different randominitialization) in tandem while penalizing divergencebetween their predictions. The overall loss is L codistill ( x i , y i ) = L ( f ( x i ) , y i ) + L ( f ( x i ) , y i )+ a Ψ( f ( x i ) , f ( x i )) . In their paper the authors set Ψ to be cross-entropy: Ψ( p (1) , p (2) ) = (cid:88) i ∈ [ L ] p (1) i log( p (2) i ) , but they note KL divergence can be used. We experi-ment with both cross-entropy and KL divergence. Wealso tune w codistill , the number of burn-in steps of train-ing before turning on the regularizer.• Label Smoothing : This is the method of (Szegedyet al., 2016) deﬁned earlier in the paper. Our proposedmethod augments global label smoothing by leverag-ing the local k -NN estimates. Naturally, we compareagainst doing global smoothing only and this servesas a key ablation model to see the added beneﬁts ofleveraging the k -NN labels.• Mixup : This method proposed by (Zhang et al., 2017)generates synthetic training examples on the ﬂy byconvex combining random training inputs and theirassociated labels, where the combination weights arerandom draws from a Beta ( a, a ) distribution. Mixupimproves generalization, increases robustness to ad-versarial examples as well as label noise, and also im-proves model calibration (Thulasidasan et al., 2019). ocally Adaptive Label Smoothing for Predictive Churn • Ensemble : Ensembling deep neural networks can im-prove the quality of their uncertainty estimation (Lak-shminarayanan et al., 2017; Fort et al., 2019). We con-sider the simple case where m identical deep neuralnetworks are trained independently on the same train-ing data, and at inference time, their predictions areuniformly averaged together. For all datasets, we do not use any data augmentation inorder to guarantee that the training data used to across dif-ferent trainings is held ﬁxed. For all datasets we use theAdam optimizer with default learning rate . . We use aminibatch size of throughout.• MNIST : We train a three-layer MLP with 256 hiddenunits and ReLU activations for 20 epochs.•

Fashion MNIST : We use the same architecture as theone used for MNIST.•

SVHN : We use LeNet5 CNN(LeCun et al., 1998) for epochs on the Google Street View Housing Num-bers (SVHN) dataset, where each image is cropped tobe × pixels.• CelebA : CelebA (Liu et al., 2018) is a large-scale faceattributes dataset with more than k celebrity im-ages, each with 40 attribute annotations. We use thestandard train and test splits, which consist of and images respectively. Images were resizedto be × × . We select the “smiling” and “highcheekbone” attributes and perform binary classiﬁca-tion, training LeNet5 for epochs.• Phishing : To validate our method beyond the im-age classiﬁcation setting, we train a three-layer MLPwith hidden units per layer on UCI Phishingdataset (Dua & Graff, 2017), which consists of train and test examples on a -dimensional in-put feature. For each dataset, baseline and hyper-parameter setting, werun each method on the same train and test split exactly 5times. We then report the average test accuracy as well asthe test set churn averaged across every possible pair ( i, j ) of runs (10 total pairs). To give a more complete picture ofthe sources of churn, we also slice the churn by the whetheror not the test predictions of the ﬁrst run in the pair werecorrect. Then, lowering the churn on the correct predictionsis desirable (i.e. if the base model is correct, we clearlydon’t want the predictions to be changing), while churn re-duction on incorrect predictions is less relevant (i.e. if the Figure 2.

Performance across hyperparameters for SVHN . Foreach of the three hyperparameters of our method ( a , b , and k ), weshow the performance across a range of this hyperparameter keep-ing the other two hyperparameters ﬁxed. Standard error bands areshaded. Top : We tune a while ﬁxing k = 10 and b = 0 . .We see better performance under both accuracy and churn as a increases, which suggests that the less weight we put on the orig-inal label, the better. Middle : We tune b while ﬁxing k = 10 and a = 0 . . We don’t see any clear pattern which suggests that b an essential hyperparameter trading off the locally adaptive vsglobal smoothing– this suggests that our adding the locally adap-tive component to the label smoothing is indeed having an effecton performance. Bottom : We tune k while ﬁxing a = 1 and b = 0 . . We see that the accuracy and churn have little differ-ences across a wide range of k from k = 10 to k = 500 , whichsuggests that k is not an essential hyperparameter and that we arestable in it. ocally Adaptive Label Smoothing for Predictive Churn Dataset Method Accuracy % Churn % Churn Correct Churn Incorrect k -NN LS (k=10, a=1, b=0.9) 88.98 (0.33) 10.98 (0.28) 4.64 (0.29) 62.23 (1.22)Label Smoothing (a=0.9) 87.26 (0.73) 13.46 (0.62) 5.31 (0.57) 67.2 (1.44)Anchor (a=1.0) 87.17 (0.16) 12.48 (0.39) 5.19 (0.2) 61.66 (1.85)SVHN (cid:96) Reg (a=0.5) 88.16 (0.35) 11.85 (0.35) 5.07 (0.16) 62.73 (2.1) (cid:96) Reg (a=0.2) 74.18 (3.41) 22.89 (3.74) 9.58 (4.04) 59.36 (5.7)Co-distill (CE, a=0.5) 87.64 (0.64) 12.46 (0.48) 5.16 (0.51) 63.82 (1.67)Co-distill (KL, a=0.5) 87.52 (0.45) 13.01 (0.3) 5.54 (0.33) 65.44 (1.46)Bi-tempered ( t =0.5, t =1) 88.04 (0.5) 12.03 (0.3) 5.26 (0.3) 62.48 (1.83)Mixup (a=0.5) k -NN LS (k=5, a=0.9, b=0.9) (cid:96) Reg (a=0.5) 98.08 (0.1) 1.67 (0.12) 0.8 (0.08) 46.65 (3.2) (cid:96) Reg (a=0.01) 97.67 (0.29) 2.51 (0.31) 1.3 (0.27) 56.8 (2.84)Co-distill (CE, a=0.2, n warm =2k) 98.08 (0.06) 2.08 (0.11) 0.98 (0.07) 58.6 (3.91)Co-distill (KL, a=0.05, n warm =1k) 97.98 (0.14) 2.16 (0.16) 0.97 (0.13) 59.56 (3.64)Bi-tempered ( t =0.9, t =1.0) 98.09 (0.2) 2.04 (0.15) 1.07 (0.14) 55.82 (4.32)Mixup (a=0.2) 98.17 (0.04) 1.59 (0.07) 0.74 (0.04) 47.8 (2.53)Control 97.98 (0.13) 2.28 (0.13) 0.96 (0.07) 63.36 (2.55) k -NN LS (k=10, a=1, b=0.5) 88.89 (0.14) 6.94 (0.18) 3.27 (0.15) 36.26 (1.09)Label Smoothing (a=0.8) 88.46 (0.17) 7.2 (0.46) 3.32 (0.28) 36.63 (2.02)Anchor (a=0.9) 88.55 (0.14) 7.53 (0.45) 3.6 (0.23) 37.78 (2.29)Fashion (cid:96) Reg (a=0.5) 88.52 (0.19) 7.86 (0.36) 3.59 (0.18) 40.38 (1.81)MNIST (cid:96) Reg (a=0.1) 86.88 (0.35) 8.24 (0.55) 3.88 (0.41) 36.81 (2.63)Co-distill (CE, a=0.5, n warm =2k ) 88.76 (0.21) 7.51 (0.39) 3.67 (0.3) 37.98 (1.71)Co-distill (KL, 0.5, n warm =2k) 88.85 (0.35) 7.83 (0.43) 3.68 (0.29) 40.59 (2.4)Bi-tempered ( t =0.7, t =2) 88.7 (0.29) 7.36 (0.47) 3.5 (0.19) 37.24 (3.04)Mixup (a=0.4) k -NN LS (k=100, b=0.1, a=0.9) (cid:96) Reg (a=0.01) 89.35 (0.16) 6.85 (0.34) 3.92 (0.27) 31.62 (1.21)Smiling (cid:96) Reg (a=0.5) 89.39 (0.26) 6.71 (0.26) 3.61 (0.22) 32.48 (1.35)Co-distill (CE, 0.5, n warm =1k) 89.59 (0.29) 6.31 (0.23) 3.66 (0.3) 29.47 (1.47)Co-distill (KL, 0.5, n warm =2k) 89.57 (0.22) 6.1 (0.23) 3.34 (0.26) 29.66 (1.47)Bi-tempered ( t =0.9, t =2.) 89.88 (0.18) 6.44 (0.31) 3.56 (0.19) 31.96 (1.96)Mixup (a=0.2) 89.71 (0.14) 6.15 (0.12) 3.51 (0.12) 29.37 (0.66)Control 89.67 (0.19) 7.3 (0.45) 4.06 (0.27) 35.34 (2.35) k -NN LS (k=100, b=0.1, a=0.9) (cid:96) Reg (a=0.001) 83.6 (0.14) 9.06 (0.32) 5.41 (0.24) 27.66 (1.03)High (cid:96) Reg (a=0.01) 83.59 (0.26) 8.43 (0.23) 4.93 (0.23) 26.14 (1.16)Cheekbone Co-distill (CE, a=0.5, n warm =1k) 84.08 (0.21) 8.96 (0.37) 5.33 (0.36) 28.11 (0.88)Co-distill (KL, a=0.5, n warm =1k) 84.31 (0.08) 8.57 (0.16) 5.06 (0.13) 27.39 (0.47)Bi-tempered ( t =0.5, t =4) 83.92 (0.13) 7.84 (0.32) 4.75 (0.21) 24.01 (1)Mixup (a=0.4) 84.53 (0.14) 7.92 (0.47) 4.69 (0.31) 25.53 (1.54)Control 83.93 (0.56) 10.18 (0.93) 6.2 (0.89) 31.1 (2.22) k -NN LS (k=500, a=0.8, b=0.9) (cid:96) Reg (a=0.5) 96.51 (0.12) 1.35 (0.3) 0.7 (0.21) 19.37 (4) (cid:96) Reg (a=0.5) 95.38 (0.18) 1.48 (0.34) 0.83 (0.24) 14.95 (4.08)Co-distill (CE, a=0.2, n warm =2) 96.02 (0.19) 1.45 (0.26) 0.83 (0.21) 16.72 (4.13)Co-distill (KL, a=0.001, n warm =1k) 95.94 (0.33) 1.51 (0.2) 0.65 (0.18) 20.95 (6.14)Bi-tempered ( t =0.9, t =1.0) 96.26 (0.37) 2.32 (0.69) 1.23 (0.53) 30.19 (8.51)Mixup (a=0.1) 96.22 (0.23) 1.80 (0.33) 1.05 (0.28) 21.53 (4.25)Control 96.3 (0.32) 2.25 (0.59) 1.21 (0.38) 29.05 (7.93) Table 1.

Results across all datasets and baselines under optimal hyperparameter tuning (settings shown). Note that we report the standarddeviation of the runs instead of standard deviation of the mean (i.e. standard error) which is often reported instead. The former is higherthan the latter by a factor of the square root of the number of trials (10). ocally Adaptive Label Smoothing for Predictive Churn base model was incorrect, then it may be better for thereto be higher churn– however at the same time, some ex-amples may be inherently difﬁcult to classify or the labelis such an outlier that we don’t expect an optimal modelto correctly classify in which case lower churn may be de-sirable). This is why in the results for Table 1, we bold thebest performing baseline for churn on correct examples, butnot for churn on incorrect examples.In the results (Table 1), for each dataset and baseline, wechose the optimal hyperparameter setting by ﬁrst sortingby accuracy and choosing the setting with the highest ac-curacy, and if there were multiple settings with very closeto the top accuracy (deﬁned as within less than . dif-ference in test accuracy), then we chose the setting withthe lowest churn among those settings with accuracy closeto the top accuracy. There is often no principled way totrade-off the two sometimes competing objectives of accu-racy and churn (e.g. (Cotter et al., 2019) offer a heuristicto trade off the two objectives in a more balanced manneron the Pareto frontier). However in this case, biasing to-wards higher accuracy is most realistic because in practice,when given a choice between two models, it’s usually bestto go with the more accurate model. Fortunately, we willsee that accuracy and churn are not necessarily competingobjectives and our proposed method usually gives the bestresult for both simultaneously. In Figure 2, we show the performance on SVHN w.r.t. thehyperparameters for both accuracy and churn. We ﬁx twoof the hyperparameters and show the results across tuningsof the remaining hyperparameters. We do this for each ofthe three hyperparameters of our approach ( a , b and k ). Wesee that larger a corresponds to better performance, imply-ing that less weight on the original labels leads to betterresults. We also see that across a wide range of k , theperformance did not change much, which suggests that inpractice, k can be set to some default and not require tun-ing. Such stability in k is desirable. Hence, the remaininghyperparameter b , which decides the trade-off between thelocally adaptive vs global smoothing appears most essen-tial. This further shows that our proposal of using locallyadaptive label smoothing has a real effect on the results forboth churn and accuracy. We see from Table 1 that mixup and our method, k -NN label smoothing, are consistently the most competi-tive; mixup outperforms on SVHN and Fashion MNISTwhile k -NN label smoothing outperforms on all the re-maining datasets. Notably, both methods do well on ac-curacy and churn metrics simultaneously, suggesting that there is no inherent trade-off between predictive perfor-mance and churn reduction. Due to space constraints, re-sults for the ensemble baseline can be found in the Ap-pendix. While we found ensembling to be remarkably ef-fective, it does come with higher cost (more trainable pa-rameters and higher inference cost), and so we discouragea direct comparison with other methods since an ensembleuses a different model class than a single model.

6. Conclusion

Modern DNN training is a noisy process: randomizationarising from stochastic minibatches, weight initialization,data preprocessing techniques, and hardware can all leadto models with drastically different predictions on the samedatapoints when using the same training procedure.Reducing such prediction churn is important in practicalproblems as production ML models are constantly updatedand improved on. Since ofﬂine metrics can usually onlyserve as proxies to the live metrics, comparing the modelsin A/B tests and live experiments oftentimes must involvemanual labeling of the disagreements between the models,making it a costly procedure. Thus, controlling the amountof predictive churn can be crucial for more efﬁciently iter-ating and improving models in a production setting.Despite the practical importance of this problem, there hasbeen little work done in the literature on this topic. Weprovide one of the ﬁrst comprehensive analyses of reduc-ing predictive churn arising from retraining the model onthe same dataset and model architecture. We show thatnumerous methods used for other goals such as learningwith noisy labels and improving model calibration serveas reasonable baselines for lowering prediction churn. Wepropose a new technique, locally adaptive label smooth-ing, that often outperforms the baselines across a range ofdatasets and model architectures.Further study in this area is critical: the problem of predic-tive churn has received far too little treatment in the aca-demic literature given its practical signiﬁcance. Our tech-nique may also help in the subﬁelds that we drew many ofour baselines from, including better calibrated DNNs androbustness to label noise, suggesting a bi-directional ﬂowof ideas between the goal of reducing predictive churn andthese subﬁelds. This is a direction for future work.

References

Amid, E., Warmuth, M. K., Anil, R., and Koren, T. Robustbi-tempered logistic loss based on bregman divergences.In

Advances in Neural Information Processing Systems ,pp. 14987–14996, 2019.Anil, R., Pereyra, G., Passos, A., Ormandi, R., Dahl, G. E., ocally Adaptive Label Smoothing for Predictive Churn and Hinton, G. E. Large scale distributed neural net-work training through online distillation. arXiv preprintarXiv:1804.03235 , 2018.Bahri, D., Jiang, H., and Gupta, M. Deep k-nn for noisylabels.

ICML , 2020.Beel, J., Genzmehr, M., Langer, S., N¨urnberger, A., andGipp, B. A comparative analysis of ofﬂine and on-line evaluations and discussion of research paper recom-mender system evaluation. In

Proceedings of the inter-national workshop on reproducibility and replication inrecommender systems evaluation , pp. 7–14, 2013.Chaudhuri, K. and Dasgupta, S. Rates of convergence forthe cluster tree. In

Advances in neural information pro-cessing systems , pp. 343–351, 2010.Chaudhuri, K. and Dasgupta, S. Rates of convergence fornearest neighbor classiﬁcation. In

Advances in NeuralInformation Processing Systems , pp. 3437–3445, 2014.Cotter, A., Jiang, H., Gupta, M. R., Wang, S., Narayan,T., You, S., and Sridharan, K. Optimization with non-differentiable constraints with applications to fairness,recall, churn, and other goals.

Journal of MachineLearning Research , 20(172):1–59, 2019.Cover, T. M. Rates of convergence for nearest neighborprocedures. In

Proceedings of the Hawaii InternationalConference on Systems Sciences , pp. 413–415, 1968.Deng, A. Objective bayesian two sample hypothesis testingfor online controlled experiments. In

Proceedings of the24th International Conference on World Wide Web , pp.923–928, 2015.Deng, A. and Shi, X. Data-driven metric development foronline controlled experiments: Seven lessons learned.In

Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining ,pp. 77–86, 2016.Deng, A., Xu, Y., Kohavi, R., and Walker, T. Improvingthe sensitivity of online controlled experiments by uti-lizing pre-experiment data. In

Proceedings of the sixthACM international conference on Web search and datamining , pp. 123–132, 2013.Devroye, L., Gyorﬁ, L., Krzyzak, A., Lugosi, G., et al. Onthe strong universal consistency of nearest neighbor re-gression function estimates.

The Annals of Statistics , 22(3):1371–1385, 1994.Dmitriev, P. and Wu, X. Measuring metrics. In

Proceedingsof the 25th ACM international on conference on informa-tion and knowledge management , pp. 429–437, 2016. Dua, D. and Graff, C. UCI machine learning repository,2017. URL http://archive.ics.uci.edu/ml .Fard, M. M., Cormier, Q., Canini, K., and Gupta, M.Launch and iterate: Reducing prediction churn. In

Ad-vances in Neural Information Processing Systems , pp.3179–3187, 2016.Fix, E. and Hodges Jr, J. L. Discriminatory analysis-nonparametric discrimination: consistency properties.Technical report, California Univ Berkeley, 1951.Fort, S., Hu, H., and Lakshminarayanan, B. Deep en-sembles: A loss landscape perspective. arXiv preprintarXiv:1912.02757 , 2019.Glorot, X. and Bengio, Y. Understanding the difﬁcultyof training deep feedforward neural networks. In

Pro-ceedings of the thirteenth international conference onartiﬁcial intelligence and statistics , pp. 249–256. JMLRWorkshop and Conference Proceedings, 2010.Goh, G., Cotter, A., Gupta, M., and Friedlander, M. P. Sat-isfying real-world goals with dataset constraints. In

Ad-vances in Neural Information Processing Systems , pp.2415–2423, 2016.Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang,I., and Sugiyama, M. Co-teaching: Robust training ofdeep neural networks with extremely noisy labels. In

Advances in neural information processing systems , pp.8527–8537, 2018.Jiang, H. Non-asymptotic uniform rates of consistencyfor k-nn regression. In

Proceedings of the AAAI Con-ference on Artiﬁcial Intelligence , volume 33, pp. 3999–4006, 2019.Jiang, H., Kim, B., Guan, M. Y., and Gupta, M. R. Totrust or not to trust a classiﬁer. In

Advances in NeuralInformation Processing Systems (NeurIPS) , 2018.Lakshminarayanan, B., Pritzel, A., and Blundell, C. Sim-ple and scalable predictive uncertainty estimation usingdeep ensembles. In

Advances in neural information pro-cessing systems , pp. 6402–6413, 2017.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.

Pro-ceedings of the IEEE , 86(11):2278–2324, 1998.Liu, Z., Luo, P., Wang, X., and Tang, X. Large-scale celeb-faces attributes (celeba) dataset.

Retrieved August , 15:2018, 2018.Loshchilov, I. and Hutter, F. Online batch selectionfor faster training of neural networks. arXiv preprintarXiv:1511.06343 , 2015. ocally Adaptive Label Smoothing for Predictive Churn

Malach, E. and Shalev-Shwartz, S. Decoupling” when toupdate” from” how to update”. In

Advances in NeuralInformation Processing Systems , pp. 960–970, 2017.M¨uller, R., Kornblith, S., and Hinton, G. E. When does la-bel smoothing help? In

Advances in Neural InformationProcessing Systems , pp. 4694–4703, 2019.Papernot, N. and McDaniel, P. Deep k-nearest neighbors:Towards conﬁdent, interpretable and robust deep learn-ing. arXiv preprint arXiv:1803.04765 , 2018.Reeve, H. W. and Kaban, A. Fast rates for a kNN clas-siﬁer robust to unknown asymmetric label noise. arXivpreprint arXiv:1906.04542 , 2019.Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. Howdoes batch normalization help optimization? arXivpreprint arXiv:1805.11604 , 2018.Scardapane, S. and Wang, D. Randomness in neural net-works: an overview.

Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery , 7(2):e1200,2017.Shorten, C. and Khoshgoftaar, T. M. A survey on imagedata augmentation for deep learning.

Journal of BigData , 6(1):1–48, 2019.Singh, A., Scott, C., Nowak, R., et al. Adaptive Hausdorffestimation of density level sets.

The Annals of Statistics ,37(5B):2760–2782, 2009.Song, G. and Chai, W. Collaborative learning for deep neu-ral networks. In

Advances in Neural Information Pro-cessing Systems , pp. 1832–1841, 2018.Stone, C. J. Consistent nonparametric regression.

The An-nals of Statistics , pp. 595–620, 1977.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-jna, Z. Rethinking the inception architecture for com-puter vision. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pp. 2818–2826, 2016.Theocharous, G., Thomas, P. S., and Ghavamzadeh, M. Adrecommendation systems for life-time value optimiza-tion. In

Proceedings of the 24th International Confer-ence on World Wide Web , pp. 1305–1310, 2015.Thulasidasan, S., Bhattacharya, T., Bilmes, J., Chennu-pati, G., and Mohd-Yusof, J. Combating label noisein deep learning using abstention. arXiv preprintarXiv:1905.10964 , 2019.Tsybakov, A. B. et al. On nonparametric estimation of den-sity level sets.

The Annals of Statistics , 25(3):948–969,1997. Turner, J. P. and Nowotny, T. Estimating numerical error inneural network simulations on graphics processing units.

BMC Neuroscience , 16(198), 2015.Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz,D. mixup: Beyond empirical risk minimization. arXivpreprint arXiv:1710.09412 , 2017.Zhang, Y., Xiang, T., Hospedales, T. M., and Lu, H. Deepmutual learning. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 4320–4328, 2018.Zheng, S., Song, Y., Leung, T., and Goodfellow, I. Im-proving the robustness of deep neural networks via sta-bility training. In

Proceedings of the ieee conferenceon computer vision and pattern recognition , pp. 4480–4488, 2016.Zhu, X., Gong, S., et al. Knowledge distillation by on-the-ﬂy native ensemble. In

Advances in neural informationprocessing systems , pp. 7517–7527, 2018. ocally Adaptive Label Smoothing for Predictive Churn

A. Proofs

For the proofs, we make use of the following result from (Jiang, 2019) which bounds the number of distinct k -NN sets onthe sample across all k : Lemma 1 (Lemma 3 of (Jiang, 2019)) . Let M be the number of distinct k -NN sets over X , that is, M := |{ N k ( x ) : x ∈X }| . Then M ≤ D · n D .Proof of Theorem 1. We have by triangle inequality and the smoothness condition in Assumption 1 that: | η k ( x ) − η ( x ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ( η ( x i ) − η ( x )) · x i ∈ N k ( x )] | N k ( x ) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ( y i − η ( x i )) · x i ∈ N k ( x )] | N k ( x ) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C α · r k ( x ) α + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ( y i − η ( x i )) · x i ∈ N k ( x )] | N k ( x ) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . We now bound each of the two terms separately.To bound r k ( x ) , let r = (cid:16) kω · v D · n · p X, (cid:17) /D . We have P ( B ( x, r )) ≥ ω inf x (cid:48) ∈ B ( x,r ) ∩X p X ( x (cid:48) ) · v D r D ≥ ωp X, v D r D = kn ,where P is the distribution function w.r.t. p X . By Lemma 7 of (Chaudhuri & Dasgupta, 2010) and the condition on k , itfollows that with probability − δ/ , uniformly in x ∈ X , | B ( x, r ) ∩ X | ≥ k , where X is the sample of feature vectors.Hence, r k ( x ) < r for all x ∈ X uniformly with probability at least − δ/ .Deﬁne ξ i := y i − η ( x i ) . Then, we have that − ≤ ξ i ≤ and thus by Hoeffding’s inequality, we have that A x := (cid:80) ni =1 ( y i − η ( x i )) · x i ∈ N k ( x )] | N k ( x ) | = (cid:80) ni =1 ξ i · x i ∈ N k ( x )] | N k ( x ) | satisﬁes P ( | A x | > t/k ) ≤ (cid:0) − t / k (cid:1) . Then setting t = √ k · (cid:112) log(4 D/δ ) + D log( n ) gives P (cid:32) | A x | ≥ (cid:114) D/δ ) + 2 D log( n ) k (cid:33) ≤ δ D · n D . By Lemma 3 of (Jiang, 2019), the number of unique random variables A x across all x ∈ X is bounded by D · n D . Thus,by union bound, P (cid:32) sup x ∈ X | A x | ≥ (cid:114) D/δ ) + 2 D log( n ) k (cid:33) ≤ δ/ . The result follows.

Proof of Theorem 2.

Let X be the n sampled feature vectors and let x ∈ X . Deﬁne k (cid:48) ( x ) := | X ∩ B ( x, r β ( x )) | . We have: | η k ( x ) − (cid:101) η β ( x ) | ≤ | η k (cid:48) ( x ) ( x ) − η k ( x ) | + | η k (cid:48) ( x ) ( x ) − (cid:101) η β ( x ) | . We bound each of the two terms separately. We have | k (cid:48) ( x ) − k | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x ∈ X x ∈ B ( x, r ( x ))] − β · n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) By Hoeffding’s inequality we have P ( | k (cid:48) ( x ) − k | ≥ t · n ) ≤ − t n ) . Choosing t = (cid:113) log(4 D/δ )+ D log( n )2 n gives us P (cid:18) | k (cid:48) ( x ) − k | ≥ (cid:114) n · (log(4 D/δ ) + D log( n )) (cid:19) ≤ δ D · n D . ocally Adaptive Label Smoothing for Predictive Churn Dataset (m=5) Accuracy (%) Churn (%) Churn Correct Churn Incorrect

SVHN 90.34 (0.31) 6.61 (0.19) 2.75 (0.28) 43.12 (1.49)MNIST 98.5 (0.07) 0.94 (0.14) 0.44 (0.09) 33.74 (4.39)Fashion MNIST 89.71 (0.12) 4.05 (0.14) 1.85 (0.05) 23.16 (1.29)CelebA Smiling 90.56 (0.09) 3.35 (0.16) 1.82 (0.11) 17.95 (0.99)CelebA High Cheekbone 85.12 (0.16) 4.95 (0.2) 2.87 (0.1) 16.81 (1.24)Phishing 96.11 (0.06) 0.54 (0.08) 0.29 (0.08) 6.77 (1.31)

Table 2.

Ensemble results for all datasets. In all settings, the optimal m (number of subnetworks) is 5. We see that compared to the othermethods presented, ensembling does well in both predictive performance and in reducing churn. It does come at a cost, however: themodel is effectively 5 times larger, making both training and inference more expensive. By Lemma 3 of (Jiang, 2019), the number of unique sets of points consisting of balls intersected with the sample is boundedby D · n D and thus by union bound, we have with probability at least − δ/ : sup x ∈X | k (cid:48) ( x ) − k | ≤ (cid:114) n · (log(4 D/δ ) + D log( n )) . We now have | η k (cid:48) ( x ) ( x ) − η k ( x ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12) k − k (cid:48) ( x ) (cid:12)(cid:12)(cid:12)(cid:12) min { k, k (cid:48) ( x ) } + min (cid:26) k , k (cid:48) ( x ) (cid:27) | k − k (cid:48) ( x ) |≤ k · | k − k (cid:48) ( x ) | ≤ (cid:115) D/δ ) + 2 D log( n ) β · n . where the ﬁrst inequality follows by comparing the difference contributed by the shared neighbors among the k -NN and k (cid:48) ( x ) -NN (ﬁrst term on RHS) and contributed by the neighbors that are not shared (second term on RHS).For the second term, deﬁne A x := X ∩ B ( x, r β ( x )) . For any x (cid:48) sampled from B ( x, r β ( x )) , we have that the expectedlabel is (cid:101) η β ( x ) . Since η k (cid:48) ( x ) ( x ) is the mean label among datapoints in A x , then we have by Hoeffding’s inequality that P ( | η k (cid:48) ( x ) − (cid:101) η β ( x ) | ≥ k (cid:48) ( x ) · t ) ≤ (cid:0) − t / k (cid:48) (cid:1) . Then setting t = √ k (cid:48) · (cid:112) log(4 D/δ ) + D log( n ) gives P (cid:32) | η k (cid:48) ( x ) ( x ) − (cid:101) η β ( x ) | ≥ (cid:115) D/δ ) + 2 D log( n ) k (cid:48) ( x ) (cid:33) ≤ δ D · n D . By Lemma 3 of (Jiang, 2019), the number of unique sets A x across all x ∈ X is bounded by D · n D . Thus, by unionbound, with probability at least − δ/ L | η k (cid:48) ( x ) ( x ) − (cid:101) η β ( x ) | ≤ (cid:115) D/δ ) + 2 D log( n ) k (cid:48) ( x ) . The result follows immediately for n sufﬁciently large. B. Ensemble Results

In Table 2 we present the experimental results for the ensemble baseline. The method performs remarkably well, beatingthe proposed method and the other baselines on both accuracy and churn reduction across datasets. We do note, however,that ensembling does come at a cost which may prove prohibitive in many practical applications. Firstly, having m timesthe number of trainable parameters, training time (if done sequentially) takes m times as long, as does inference, sinceeach subnetwork must be evaluated before aggregation. ocally Adaptive Label Smoothing for Predictive Churn Fixed Ablated Accuracy (%) Churn (%) Churn Correct k = 10, a = 1 b = 0 86.54 (0.67) 13.43 (0.58) 5.86 (0.57)b = 0.05 87.37 (0.38) 12.22 (0.31) 5.34 (0.31)b = 0.1 86.94 (0.65) 13.41 (0.39) 5.69 (0.57)b = 0.5 88.48 (0.52) 11.12 (0.5) 4.37 (0.35)b = 0.9 88.98 (0.33) 10.98 (0.28) 4.64 (0.29)k = 10, a = 0.5 b = 0 84.44 (2.43) 15.85 (2.39) 6.73 (2.47)b = 0.05 79.64 (3.1) 22.02 (5.15) 10.28 (4.06)b = 0.1 79.88 (2.63) 21.09 (3.59) 10.25 (1.85)b = 0.5 84.44 (2.54) 14.33 (1.78) 6.52 (2.83)b = 0.9 81.06 (2.35) 20.53 (4.52) 8.68 (3.36)k = 10, b = 0.9 a = 0.005 73.91 (3.01) 28.02 (5.66) 13.85 (4.82)a = 0.01 72.41 (4.86) 25.57 (5.78) 13.66 (7.01)a = 0.02 72.03 (1.79) 31.25 (7.25) 17.26 (6.56)a = 0.05 73.2 (3.33) 30.41 (6.2) 17.96 (6.04)a = 0.1 75.28 (1.98) 23.96 (4.76) 10.13 (4.25)a = 0.5 81.06 (2.35) 20.53 (4.52) 8.68 (3.36)a = 0.8 85.99 (0.73) 13.76 (0.75) 6 (0.83)a = 0.9 87.27 (0.41) 13.72 (0.41) 5.68 (0.32)a = 1.0 88.98 (0.33) 10.98 (0.28) 4.64 (0.29)k = 10, b = 0.5 a = 0.005 71.45 (3.81) 21.14 (4.37) 11.5 (5.46)a = 0.01 74.73 (6.24) 25.24 (3.84) 8.28 (4.35)a = 0.02 73.59 (3.72) 29.47 (6.89) 17.52 (6.13)a = 0.05 74.17 (3.88) 20.26 (4.15) 5.79 (3.7)a = 0.1 72.43 (2.75) 25.77 (5.41) 13.42 (4.89)a = 0.5 84.44 (2.54) 14.33 (1.78) 6.52 (2.83)a = 0.8 87.26 (0.41) 11.76 (0.24) 4.62 (0.21)a = 0.9 86.85 (0.54) 12.54 (0.44) 5.25 (0.48)a = 1.0 88.48 (0.52) 11.12 (0.5) 4.37 (0.35)a = 1, b = 0.9 k = 10 88.98 (0.33) 10.98 (0.28) 4.64 (0.29)k = 100 88.19 (0.19) 11.15 (0.23) 4.67 (0.17)k = 500 87.98 (0.62) 11.33 (0.35) 4.72 (0.55)

Table 3.

Ablation on k -NN label smoothing’s hyperparameters: a , b , and k for the SVHN dataset. C. Ablation Study

In Table 3, we report SVHN results ablating k -NN label smoothing’s hyperparameters: k , a , and b . We observe thefollowing trends: with a ﬁxed to 1, both accuracy and churn improve with increasing b , and a similar relationship holds as a increases with b ﬁxed to . . Lastly, both key metrics are stable with respect to k . D. Hyperparameter Search

Our experiments involved performing a grid search over hyperparameters. We detail the search ranges per method below. k -NN label smoothing. • k ∈ [5 , , , • a ∈ [0 . , . , . , . , . , . , . , . , . • b ∈ [0 , . , . , . , . Anchor. ocally Adaptive Label Smoothing for Predictive Churn • a ∈ [0 . , . , . , . , . , . , . , . , . (cid:96) , (cid:96) Regularization. • a ∈ [0 . , . , . , . , . , . Co-distill • a ∈ [0 . , . , . , . , . , . • n warm ∈ [1000 , Bi-tempered • t ∈ [0 . , . , . , . • t ∈ [1 ., ., ., . ] • n iters always set to . Mixup • a ∈ [0 . , . , . , . Ensemble • m ∈ [3 ,,