[PDF] Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Abstract

Detecting out-of-distribution (OOD) examples is critical in many applications. We propose an unsupervised method to detect OOD samples using a k-NN density estimate with respect to a classification model's intermediate activations on in-distribution samples. We leverage a recent insight about label smoothing, which we call the \emph{Label Smoothed Embedding Hypothesis}, and show that one of the implications is that the k-NN density estimator performs better as an OOD detection method both theoretically and empirically when the model is trained with label smoothing. Finally, we show that our proposal outperforms many OOD baselines and also provide new finite-sample high-probability statistical results for k-NN density estimation's ability to detect OOD examples.

Full PDF

LLabel Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Dara Bahri * 1

Heinrich Jiang * 1

Yi Tay Donald Metzler Abstract

Detecting out-of-distribution (OOD) examples iscritical in many applications. We propose an un-supervised method to detect OOD samples usinga k -NN density estimate with respect to a clas-siﬁcation model’s intermediate activations on in-distribution samples. We leverage a recent insightabout label smoothing, which we call the LabelSmoothed Embedding Hypothesis , and show thatone of the implications is that the k -NN densityestimator performs better as an OOD detectionmethod both theoretically and empirically whenthe model is trained with label smoothing. Fi-nally, we show that our proposal outperformsmany OOD baselines and also provide new ﬁnite-sample high-probability statistical results for k -NN density estimation’s ability to detect OODexamples.

1. Introduction

Identifying out-of-distribution examples has a wide rangeof applications in machine learning including fraud detec-tion in credit cards (Awoyemi et al., 2017) and insuranceclaims (Bhowmik, 2011), fault detection and diagnosis incritical systems (Zhao et al., 2013), segmentations in med-ical imaging to ﬁnd abnormalities (Prastawa et al., 2004),network intrusion detection (Zhang & Zulkernine, 2006),patient monitoring and alerting (Hauskrecht et al., 2013),counter-terrorism (Skillicorn, 2008) and anti-money laun-dering (Labib et al., 2020).Out-of-distribution detection is highly related to the clas-sical line of work in anomaly and outlier detection. Suchmethods include density-based (Ester et al., 1996), one-classSVM (Sch¨olkopf et al., 2001), and isolation forest (Liu et al.,2008). However, these classical methods often aren’t im-mediately practical on large and possibly high-dimensionalmodern datasets. * Equal contribution Google Research, MountainView, California, USA. Correspondence to: Dara Bahri < [email protected] > . More recently, Hendrycks & Gimpel (2016) proposed asimple baseline for detecting out-of-distribution examplesby using a neural network’s softmax predictions, whichhas motivated many works since then that leverage deeplearning(Lakshminarayanan et al., 2016; Liang et al., 2017;Lee et al., 2017). However, the majority of the works stillultimately use the neural network’s softmax predictionswhich suffers from the following weakness. Speciﬁcally,the uncertainty in the softmax function cannot distinguishbetween the following situations where (1) the example isactually in-distribution but there is high uncertainty in itspredictions and (2) the situation where the example is actu-ally out-of-distribution. This is largely because the softmaxprobabilities sum to and thus must assign the probabilityweights accordingly. This has motivated recent explorationsin estimating conformal sets for neural networks (Park et al.,2019; Angelopoulos et al., 2020) which can distinguishbetween the two cases.In this paper, we circumvent the above-mentioned weaknessby avoiding using the softmax probabilities altogether. Tothis end, we approach OOD detection with an alternativeparadigm, i.e.,we leverage the intermediate embeddings ofthe neural network and nearest neighbors. Our intuition isbacked by recent work in which the effectiveness of usingnearest-neighbor based methods on these embeddings havebeen demonstrated on a range of problems such as uncer-tainty estimation (Jiang et al., 2018), adversarial robustness(Papernot & McDaniel, 2018), and noisy labels (Bahri et al.,2020).In this work, we explore using k -NN density estimationto detect OOD examples by computing this density on theembedding layers. To this end, it’s worth noting that k -NNdensity estimation is a unsupervised technique, which makesit very different from the aforementioned deep k -NN work(Bahri et al., 2020) which leverages the label information ofthe nearest neighbors. One key intuition here is that low k -NN density examples might be OOD candidates as it impliesthat these examples are far from the training examples inthe embedding space.In order for density estimation to be effective on the inter-mediate embeddings, the data must have good clusterability (Ackerman & Ben-David, 2009), meaning that examples inthe same class should be close together in distance in the a r X i v : . [ c s . L G ] F e b abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection embeddings, while examples not in the same class should befar apart. While much work has been done for the speciﬁcproblem of clustering deep learning embeddings (Xie et al.,2016a; Hershey et al., 2016) many of these ideas are notapplicable to density estimation.In this paper, we use a much simpler but effective approachof label smoothing, which involves training the neural net-work on a soft label obtained by taking a weighted averagebetween the original one-hot encoded label and the uniformdistribution over labels. We leverage a key insight about theeffect of label smoothing on the embeddings M¨uller et al.(2019), i.e., training with label smoothing has the effectof contracting the intermediate activations of the exampleswithin the same class to be closer together at a faster raterelative to examples in different classes. This results inembeddings that have better clusterability and by treatingeach class as a cluster. We call this the Label SmoothedEmbedding Hypothesis , which we deﬁne below.

Hypothesis 1 (Label Smoothed Embedding Hypothesis(M¨uller et al., 2019)) . Training with label smoothing con-tracts the intermediate embeddings of the examples in aneural network, where examples within the same class movecloser towards each other in distance at a faster rate thanexamples in different classes.

We refer interested readers to (M¨uller et al., 2019) for 2Dvisualizations of this effect on the model’s penultimate layer.We will later portray the same phenomenon using k -NNdensity estimation.We summarize our contributions as follows:• We propose a new procedure that uses label smoothingalong with aggregating the k -NN density estimatoracross various intermediate representations to obtainan OOD score.• We show a number of new theoretical results for the k -NN density estimator in the context of OOD detec-tion, including guarantees on the recall and precisionof identifying OOD examples, the preservation of theranking w.r.t. the true density, and a result that providesintuition for why the Label Smoothed Embedding Hy-pothesis improves the k -NN based OOD score.• We experimentally validate the effectiveness of ourmethod and the beneﬁts of label smoothing on bench-mark image classiﬁcation datasets, comparing againstrecent baselines, including one that uses k -NN in adifferent way, as well as classical alternatives to the k -NN but applied in the same way. The comparisonagainst these ablative models highlight the discrimi-native power of the k -NN density estimator for OODdetection. • We conduct ablations to study the performance impactof the three hyper-parameters of our method - (1) theamount of label smoothing, (2) which intermediatelayers to use, and (3) number of neighbors k .

2. Algorithm

We start by deﬁning the foundational quantity in our method.

Deﬁnition 1.

Deﬁne the k -NN radius of x ∈ R D as r k ( x ; X ) := inf { r > | X ∩ B ( x, r ) | ≥ k } . When X is implicit, we drop it from the notation for brevity. Our method goes as follows: upon training a classiﬁca-tion neural network on a sample X in from some distri-bution f in , the intermediate representations of X in shouldbe close together (in the Euclidean sense), possibly clus-tered by class label. Meanwhile, out-of-distribution pointsshould be further away from the training manifold - that is, r k ( g i ( x out ); X in ) > r k ( g i ( x in ); X in ) for x in ∼ f in , x out ∼ f out , where g i maps the input space to the output of the i -th layer of the trained model. Thus, for ﬁxed layer i , wepropose the following statistic: T i ( x ) := r k ( g i ( x ); g i ( X in )) Q ( X in , g i ) ,Q ( X in , g i ) := E z ∼ f in r k ( g i ( z ); g i ( X in )) . Since Q depends on unknown f in , we estimate it using cross-validation: ˆ Q ( X in , g i ) = 1 | X in | (cid:88) x ∈ X in r k ( g i ( x ); g i ( X in \ { x } ))= 1 | X in | (cid:88) x ∈ X in r k +1 ( g i ( x ); g i ( X in )) . Letting ˆ T i be our statistic using ˆ Q , we now aggregate across M layers to form our ﬁnal statistic: ˆ T ( x ) = 1 M M (cid:88) i =1 ˆ T i ( x ) . We use a one-sided threshold rule on ˆ T - namely, if ˆ T > t we predict out-of-distribution, otherwise we do not. Withkey quantities now deﬁned, we use the k -NN radius to sub-stantiate (1) the claim that in and out-of-distribution pointsare different distances away from the training points and (2)Hypothesis 1, that label smoothing causes in-distributionpoints to contract to the training points faster than OODones. This provides the grounding for why a statistic basedon the k -NN radius using a label smoothed model is a pow-erful discriminator. Figure 2 shows the distribution of -NNdistances for three layers as well as our proposed aggregate abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection Figure 1.

Distributions of the 1-NN radius distance for each of three layers as well as our test statistic, which aggregates over all layers.“Depth” refers to the layer index with respect to the logits layer. Thus, 0 means logits, -1 means the layer right before logits, so on andso forth. The ﬁrst two (last two) columns correspond to the dataset pairing Fashion MNIST → KMNIST (SVHN → CelebA) with andwithout label smoothing. The Fashion MNIST pairing uses a 3 layer feedforward neural network while SVHN uses the convolutionalLeNet5. Blue represents samples from the test split of the dataset used to train the model and are therefore inliers. Red represents theout-of-distribution samples. Consider the cases of no label smoothing. We see that there is separability between in and out points at eachlayer, generally more so at deeper (earlier) layers of the network. This motivates our use of k -NN distance for OOD detection. There is,however, non-trivial overlap. Now observe the cases with α = 0 . smoothing. The -NN radii shrink, indicating a “contraction” towardsthe training manifold for both in and out-of-distribution points. The contraction is, however, higher for ID points than for OOD points.This motivates the use of label smoothing in our method. statistic on two dataset pairs. Across layers and datasets, wesee some separability between in and out-of-distributionspoints. Label smoothing has the effect of shrinking thesedistances for both in/out classes but the effect is larger forin points, making the distributions even more separable andthereby improving the performance of our method.

3. Theoretical Results

In this section, we provide statistical guarantees for using the k -NN radius as a method for out of distribution detection.To do this, we assume that the features of the data lie oncompact support X ⊂ R d and that examples are drawn i.i.d. from this. We assume that there exists a density function f : R d → R corresponding to the distribution of the featurespace. This density function can serve as a proxy for howmuch an example is out of distribution. The difﬁculty isthat this underlying density function is unknown in practice.Fortunately, we can show that the k -NN radius methodapproximates the information conveyed by f based on aﬁnite sample drawn from f . For the theory, we deﬁne an outof distribution example as an example where x (cid:54)∈ X . Thus, f ( x ) = 0 for such examples. abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection In this section, we give a result about identifying out ofdistribution examples based on the k -NN radius with perfectrecall if we were to use a particular threshold. That is, anyexample that is indeed out of distribution (i.e. has density)will have k -NN radius above that threshold. We also give aguarantee that the false-positives (i.e. those examples with k -NN radius higher than that quantity which were not out-of-distribution examples) were of low-density to begin with.Our results hold with high-probability uniformly across allof R d . As we will see, as n grows and k/n → , we ﬁndthat the k -NN radius method using the speciﬁed thresholdis able to identify which examples are in-distribution vsout-of-distribution.Our result requires a smoothness assumption on the densityfunction shown below. This smoothness assumption ensuresa relationship between the density of a point and the proba-bility mass of balls around that point which is used in theproofs. Assumption 1 (Smoothness) . f is β -Holder continuous forsome < β ≤ . i.e. | f ( x ) − f ( x (cid:48) ) | ≤ C β | x − x (cid:48) | β . We now give our result below.

Theorem 1.

Suppose that Assumption 1 holds and that < δ < and k ≥ · log (2 /δ ) · d log n . If we choose r := (cid:18) k C β · n · v d (cid:19) / ( β + d ) λ := 5 · C d/ ( β + d ) β · (cid:18) kn · v d (cid:19) β/ ( β + d ) , then the following holds uniformly for all x ∈ R d withprobability at least − δ : • If f ( x ) = 0 , then r k ( x ) ≥ r . • If r k ( x ) ≥ r , then f ( x ) ≤ λ . In words, it says that the set of points x ∈ R d satisfy-ing r k ( x ) (cid:38) ( k/n ) / ( β + d ) , is guaranteed to contain all ofthe outliers and does not contain any points whose den-sity exceeds a cutoff (i.e. f ( x ) (cid:38) ( k/n ) β/ ( β + d ) ). Thesequantities all go to as k/n → and thus with enoughsamples, asymptotically are able to distinguish betweenout-of-distribution and in-distribution examples.We can assume the following condition on the boundarysmoothness of the density as is done in a recent analysis of k -NN density estimation (Zhao & Lai, 2020). Assumption 2 (Boundary smoothness) . There exists <η ≤ such that for any t > , f satisﬁes P ( f ( x ) ≤ t ) ≤ C η t η , where P represents the distribution of in-distribution exam-ples during evaluation. Then, Theorem 1 has the following consequence on theprecision and recall of the k -NN density based out of distri-bution detection method. Corollary 1.

Suppose that Assumptions 1 and 2 hold andthat < δ < and k ≥ · log (2 /δ ) · d log n . Then if wechoose r := (cid:18) k C β · n · v d (cid:19) / ( β + d ) , then the following holds with probability at least − δ .Let us classify an example x ∈ R d as out of distribution if r k ( x ) ≥ r and in-distribution otherwise. Then, this clas-siﬁer will identify all of the out-of-distribution examples(perfect recall) and falsely identify in-distribution examplesas out-of-distribution with probability (error in precision) · C η · C d/ ( β + d ) β · (cid:18) kn · v d (cid:19) β/ ( β + d ) . We next give the following result saying that if the gapin density between two points is large enough, then theirrankings will be preserved w.r.t. the k -NN radius. Theorem 2.

Suppose that Assumption 1 holds and that < δ < and k ≥ · log (2 /δ ) · d log n . Deﬁne C δ,n := 16 log(2 /δ ) √ d log n . Then there exists a con-stant C depending on f such that the following holds withprobability at least − δ uniformly for all pairs of points x , x ∈ R d . If f ( x ) > f ( x ) + (cid:15) k,n , where (cid:15) k,n := C (cid:18) C δ,n √ k + ( k/n ) /d (cid:19) , then, we have r k ( x ) < r k ( x ) . We note that as n, k → ∞ , k/n → , and log n/ √ k → ,we have that (cid:15) k,n → and thus asymptotically, the k -NNradius preserves the ranking by density in the case of non-ties. In this section, we provide some theoretical intuition behindwhy the observed label smoothing embedding hypothesiscan lead to better performance for the k -NN density-basedapproach on embeddings learned with label smoothing. Wemake an assumption that our in-distribution has a convexset X ⊆ R d as its support with uniformly lower boundeddensity and that applying label smoothing has the effect of abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection contracting the space R d in the following way: for pointsin X the contraction is with respect to a point of origin x in the interior of X so that points in X move closer to theorigin and for outlier points, they move closer to the bound-ary of X . We ensure that the former happens at a fasterrate than the latter and show the following guarantee, whichsays that under certain regularity conditions on the densityand X , we have that the ratio of the k -NN distance betweenan out-of-distribution point and an in-distribution point in-creases after this mapping. This suggest that under suchtransformations such as ones induced by what’s implied bythe label smoothed embedding hypothesis, the k -NN dis-tance becomes a better score at separating the in-distributionexamples from the out-of-distribution examples. Proposition 1 (Improvement of k -NN OOD with LabelSmoothed Embedding Hypothesis) . Let f has convex andbounded support X ⊆ R d and let x be an interior point of X and additionally assume that there exists r , c > suchthat for all < r < r and x ∈ X , we have Vol ( B ( x, r ) ∩X ) ≤ c · Vol ( B ( x, r )) holds (to ensure that X ’s boundarieshave regularity and are full dimensional) and that f ( x ) ≥ λ for all x ∈ X for some λ > . Deﬁne mapping φ : R d → R d such that φ ( x ) = γ in · ( x − x ) − x if x ∈ X and otherwise, φ ( x ) = γ out · ( x − Proj X ( x )) − Proj X ( x ) otherwise, for some < γ in < γ out < . (Proj X ( x ) denotes the projection of x onto the boundary of convex set X ). We see that φ contracts the points where points in X contract at a faster rate than those outside of X . Supposeour training set consists of n examples X [ n ] drawn from f ,and denote by φ ( X [ n ] ) the image of those examples w.r.t. φ .Let < δ < and r min > and k satisﬁes k ≥ · log (2 /δ ) · d log nk ≤ c · v d · (cid:18) γ out − γ in γ in · r min (cid:19) d · n and n is sufﬁciently large depending on f . Then with proba-bility at least − δ , the following holds uniformly amongall r min > , choices of x in ∈ X (in-distribution exam-ple) and x out such that d ( x, X ) ≥ r min (out-of-distributionexample with margin). The following holds. r k ( φ ( x out ); φ ( X [ n ] )) r k ( φ ( x in ); φ ( X [ n ] )) > r k ( x out ; X [ n ] ) r k ( x in ; X [ n ] ) , where r k ( x, A ) denotes the k -NN distance of x w.r.t. dataset A .

4. Experiments

We now describe our comprehensive experimental setup andresults.

We validate our method on MNIST (LeCun et al., 1998),Fashion MNIST (Xiao et al., 2017), SVHN (croppedto 32x32x3) (Netzer et al., 2011), CIFAR10 (32x32x3)(Krizhevsky et al., 2009), and CelebA (32x32x3) (Liu et al.,2015). In CelebA, we train against the binary label “smil-ing”. We train models on the train split of each of thesedatasets, and then test OOD binary classiﬁcation perfor-mance for a variety of OOD datasets, while always keepingthe in-distribution to be the test split of the dataset used fortraining. Thus, a dataset pairing denoted “A → B” meansthat the classiﬁcation model is trained on A’s train and isevaluated for OOD detection using A’s test as in-distributionpoints and B’s test as out-of-distribution points. In additionto the aforementioned, we form OOD datasets by corruptingthe in-distribution test sets - by ﬂipping images left and right(HFlip) as well as up and down (VFlip) - and we also usethe validation split of ImageNet (32x32x3), the test splits ofKMNIST (28x28x1), EMNIST digits (28x28x1), and Om-niglot (32x32x3). All datasets are available as TensorﬂowDatasets .We measure the OOD detectors’ ROC-AUC, sample-weighting to ensure balance between in and out-of-distribution samples (since they can have different sizes).For MNIST and Fashion MNIST, we train a 3-layer ReLu-activated DNN, with 256 units per layer, for 20 epochs. ForSVHN, CIFAR10, and CelebA, we train the convolutionalLeNet5 (LeCun et al., 2015) for 10 epochs. We use 128batch size and Adam optimizer with default learning rate0.001 throughout. For embedding-based methods, we ag-gregate over 3 layers for the DNN and 4 dense layers forLeNet5, including the logits. For our method, we always useEuclidean distance between embeddings, k = 1 and labelsmoothing α = 0 . . These could likely be tuned for betterperformance in the presence of a validation OOD datasetsufﬁciently similar to the unknown test set. We do not dothis since we assume the absence of such dataset. We validate our method against the following recent base-lines.•

Control.

We use the model’s maximum softmax conﬁ-dence, as suggested by (Hendrycks & Gimpel, 2016).The lower the conﬁdence, the more likely the exampleis to be OOD.•

Robust Deep k -NN. This method, proposed in (Paper-not & McDaniel, 2018) leverages k -NN for a queryinput as follows: it computes the label distribution ofthe query point’s nearest training points for each layer abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection Dataset Control k -NN (0.1 LS) k -NN (no LS) DeConf Robust k -NN SVM Isolation ForestTrain/In: MNIST

EMNIST 0.835 0.950

Fashion MNIST

EMNIST 0.551

SVHN

CelebA 0.785

CIFAR10

CelebA 0.570

CelebA

HFlip 0.501

Table 1.

ROC-AUC for different methods and dataset pairings. The datasets enclosed by double lines represent the training and in-distribution test set, while the datasets listed beneath them are used as OOD. Each entry was run 5 times. The standard errors are quitesmall, with a median of 0.0071. Entries within two standard errors of the max are bolded. We see that label smoothing almost alwaysimproves the performance of our method and that the method is competitive across a variety of datasets. and then computes a layer-aggregated p -value-basednon-conformity score against a held-out calibrationset. Queries that have high disagreement, or impurity,in their nearest neighbor label set are suspected to beOOD. We use 10% of the training set for calibration, k = 50 , and cosine similarity, as described in the paper.• DeConf. (Hsu et al., 2020) improves over the popularmethod ODIN (Liang et al., 2017) by freeing it fromthe needs of tuning on OOD data. It consists of twocomponents - a learned “conﬁdence decomposition”derived from the model’s penultimate layer, and a mod-iﬁed method for perturbing inputs optimally for OOD detection using a Fast-Sign-Gradient-esque strategy.We use the “h” branch of the cosine similarity variantdescribed in the paper. We searched the perturbationhyperparameter (cid:15) over the range listed in the paper, butfound that it never helped OOD in our setting. We thusreports numbers for (cid:15) = 0 .• SVM.

We learn a one-class SVM (Sch¨olkopf et al.,1999) on the intermediate embedding layers and thenaggregate the outlier scores across layers in the sameway we propose in our method. We use an RBF kernel.•

Isolation Forest.

This is similar to SVM, but uses an abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Figure 2.

Impact of k on ROC-AUC. We observe that performanceis mostly stable across a range of k . We do see slight degradationwith larger k , and so we recommend users a default of k = 1 . isolation forest (Liu et al., 2008) with 100 estimatorsat each layer. Our main results are shown in Table 1. We observe that la-bel smoothing nearly always improved our method, denoted k -NN, and that the method is competitive, outperformingthe rest on the most number of dataset pairs. SVM, IsolationForest serve as key ablative models, since they leverage thesame intermediate layer representations as our method andtheir layer-level scores are combined in the same way. In-terestingly, we see that the k -NN consistently outperformsthen, revealing the discriminative power of the k -NN radiusdistance. Robust k -NN also uses the same layer embed-dings and k -NN, but in a different manner. Crucially, itperforms OOD detection by means of the nearest trainingexample neighbors’ class label distribution. Given that weoutperform Robust k -NN more often than not, we mightconjecture that the distance has more discriminative powerfor OOD detection than class label distribution . We weresurprised that DeConf routinely did worse than the simplecontrol, despite having implementing the method followingthe paper closely. In this section, we study the impact of three factors on ourmethod’s performance: (1) the number of neighbors, k , (2)the amount of label smoothing α , and (3) the intermediatelayers used. Impact of k . In Figure 2 we plot the impact of k onOOD detection for two dataset pairings: MNIST → FashionMNIST and SVHN → CIFAR10 with and without labelsmoothing. We see that larger k degrades ROC-AUC mono- Figure 3.

Impact of α on ROC-AUC for four dataset pairings. Notethat the x-axis is log-scale and the y-axis is zoomed in. We gener-ally see that performance improves with larger α until it reaches acritical point, after which it declines. While this critical point ismodel and data dependent, we see that blithely selecting a ﬁxedvalue like 0.1 results in reasonable performance. tonically, but the effect is rather small. We thus recommenda default of k = 1 . k = 1 has the added beneﬁt of be-ing more efﬁcient in most implementations of index-basedlarge-scale nearest-neighbor lookup systems. Impact of Label Smoothing α . We now consider the ef-fect of label smoothing amount α on ROC-AUC in Figure 3.We see, interestingly, that performance mostly increasesmonotonically with larger α until it reaches a critical point,after which it declines monotonically. While this optimalpoint may be data and model dependent and thus hard toestimate, we’ve found that selecting a ﬁxed value like 0.1works well in most cases. Impact of Intermediate Layer

Our method aggregates k -NN distance scores across intermediate layers. We depictthe effect of different choices of a single layer on FashionMNIST and CelebA in Figure 2. We ﬁnd that label smooth-ing generally boosts performance for each layer individuallyand that while no single layer is always optimal, the penulti-mate layer performs fairly well across the datasets.

5. Related Work

Out-of-Distribution Detection.

OOD detection has clas-sically been studied under names such as outlier, anomaly,or novelty detection. One line of work are density-basedmethods: Ester et al. (1996) presents a density-based clus-tering algorithm which is also an outlier detection algo-rithm by identifying noise points which are points whose (cid:15) -neighborhood has fewer than a certain number of points.Breunig et al. (2000); Kriegel et al. (2009) propose local abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Depth (from logits)OOD LS 0 -1 -2 -3Train/In: FashionMnistEMNIST 0.0 0.970 -KMNIST 0.0 0.927 0.963 -0.1 0.973 0.986 -MNIST 0.0

Table 2.

We observe the ROC-AUC of our method using only asingle layer at a time, for Fashion MNIST and CelebA, with andwithout label smoothing. We ﬁnd that label smoothing usuallyhelps every layer on its own, and that the penultimate layer (depth= -1) often outperforms the rest on these datasets. outlier scores based on the degree to which how isolatedthe datapoint is with respect to its neighborhood via densityestimation. Another line of work uses k -NN density esti-mates (Ramaswamy et al., 2000; Angiulli & Pizzuti, 2002;Hautamaki et al., 2004; Dang et al., 2015). We use the k -NN density estimator, but use it in conjunction with theembeddings of a neural network trained with label smooth-ing. Other classical approaches include the one-class SVM(Sch¨olkopf et al., 2001; Chen et al., 2001), isolation forest(Liu et al., 2008). A slew of recent methods have beenproposed for OOD. We refer interested readers to a survey. Label Smoothing.

Label smoothing has received muchattention lately; we give a brief review here. It has beenshown to improve model calibration (and therefore thegeneration quality of auto-regressive sequence models likemachine translation) but has been seen to hurt teacher-to-student knowledge distillation (Pereyra et al., 2017; Xieet al., 2016b; Chorowski & Jaitly, 2016; Gao et al., 2020;Lukasik et al., 2020b; M¨uller et al., 2019). (M¨uller et al.,2019) show visually that label smoothing encourages thepenultimate layer representations of the training examplesfrom the same class to group in tight clusters. (Lukasiket al., 2020a) shows that label smoothing makes modelsmore robust to label noise in the training data (to a levelcompetitive with noisy label correction methods), and, fur- thermore, smoothing the teacher is beneﬁcial when distillingfrom noisy data. (Chen et al., 2020) corroborates the bene-ﬁts of smoothing for noisy labels and provides a theoreticalframework wherein the optimal smoothing parameter α canbe identiﬁed. LS has been seen to hurt performance onsparse distributions (Meister et al., 2020) and decrease ro-bustness to adversarial attacks (Zantedeschi et al., 2017).(Yuan et al., 2020) casts knowledge distillation (KD) as atype of learned label smoothing regularization, showing thatpart of KD’s success stems from its ability to regularize softlabels in the same way as LS. They then propose Teacher-free KD that achieves comparable performance to normalKD with a superior teacher. k -NN Density Estimation Theory Statistical guaranteesfor k -NN density estimation has had a long history e.g.Fukunaga & Hostetler (1973); Devroye & Wagner (1977);Mack (1983); Buturovi´c (1993); Biau et al. (2011); Kunget al. (2012). Most works focus on showing convergenceguarantees under metrics like L risk or are asymptotic.Dasgupta & Kpotufe (2014) provided the ﬁrst ﬁnite-sample uniform rates, which to our knowledge is the strongest resultso far. Our analysis uses similar techniques, which they alsoborrow from (Chaudhuri & Dasgupta, 2010); however ourresults are for the application of OOD detection wherasDasgupta & Kpotufe (2014)’s goal was mode estimation.As a result, our results hold with high probability uniformlyin the input space, while having ﬁnite-sample guaranteesand provide new theoretical insights into the use of k -NNfor OOD detection.

6. Discussion and Conclusion

In light of the the connection between label smoothing anddistillation that was was touched upon in the related works, itis to natural to question whether distillation would improveour k -NN OOD detector in a similar manner. A thoughtfulstudy of this effect is deferred for future work, but we haveearly evidence suggesting that iterative self-distillation -that is, repeatedly retraining a model on its own predictions- has a similar mechanism as that described in the LabelSmoothed Embedding Hypothesis. In this work we put forward the Label Smoothing Embed-ding Hypothesis and proposed a deep k -NN density-basedmethod for out-of-distribution detection that leverages theseparability of intermediate layer embeddings and showedhow label smoothing the model improves our method. abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection References

Ackerman, M. and Ben-David, S. Clusterability: A theoreti-cal study. In

Artiﬁcial intelligence and statistics , pp. 1–8.PMLR, 2009.Angelopoulos, A., Bates, S., Malik, J., and Jordan, M. I.Uncertainty sets for image classiﬁers using conformalprediction. arXiv preprint arXiv:2009.14193 , 2020.Angiulli, F. and Pizzuti, C. Fast outlier detection in highdimensional spaces. In

European conference on princi-ples of data mining and knowledge discovery , pp. 15–27.Springer, 2002.Awoyemi, J. O., Adetunmbi, A. O., and Oluwadare, S. A.Credit card fraud detection using machine learning tech-niques: A comparative analysis. In , pp. 1–9. IEEE, 2017.Bahri, D., Jiang, H., and Gupta, M. Deep k-nn for noisylabels. In

International Conference on Machine Learning ,pp. 540–550. PMLR, 2020.Bhowmik, R. Detecting auto insurance fraud by data miningtechniques.

Journal of Emerging Trends in Computingand Information Sciences , 2(4):156–162, 2011.Biau, G., Chazal, F., Cohen-Steiner, D., Devroye, L., Ro-driguez, C., et al. A weighted k-nearest neighbor densityestimate for geometric inference.

Electronic Journal ofStatistics , 5:204–237, 2011.Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof:identifying density-based local outliers. In

Proceedingsof the 2000 ACM SIGMOD international conference onManagement of data , pp. 93–104, 2000.Buturovi´c, L. Improving k-nearest neighbor density and er-ror estimates.

Pattern Recognition , 26(4):611–616, 1993.Chaudhuri, K. and Dasgupta, S. Rates of convergence forthe cluster tree. In

NIPS , pp. 343–351. Citeseer, 2010.Chen, B., Ziyin, L., Wang, Z., and Liang, P. P. An investiga-tion of how label smoothing affects generalization. arXivpreprint arXiv:2010.12648 , 2020.Chen, Y., Zhou, X. S., and Huang, T. S. One-class svmfor learning in image retrieval. In

Proceedings 2001International Conference on Image Processing (Cat. No.01CH37205) , volume 1, pp. 34–37. IEEE, 2001.Chorowski, J. and Jaitly, N. Towards better decoding andlanguage model integration in sequence to sequence mod-els. arXiv preprint arXiv:1612.02695 , 2016. Dang, T. T., Ngan, H. Y., and Liu, W. Distance-based k-nearest neighbors outlier detection method in large-scaletrafﬁc data. In , pp. 507–510. IEEE,2015.Dasgupta, S. and Kpotufe, S. Optimal rates for k-NN densityand mode estimation. In

Advances in Neural InformationProcessing Systems , pp. 2555–2563, 2014.Devroye, L. P. and Wagner, T. J. The strong uniform consis-tency of nearest neighbor density estimates.

The Annalsof Statistics , pp. 536–540, 1977.Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. A density-based algorithm for discovering clusters in large spatialdatabases with noise. In

Kdd , pp. 226–231, 1996.Fukunaga, K. and Hostetler, L. Optimization of k near-est neighbor density estimates.

IEEE Transactions onInformation Theory , 19(3):320–326, 1973.Gao, Y., Wang, W., Herold, C., Yang, Z., and Ney, H. To-wards a better understanding of label smoothing in neuralmachine translation. In

Proceedings of the 1st Confer-ence of the Asia-Paciﬁc Chapter of the Association forComputational Linguistics and the 10th InternationalJoint Conference on Natural Language Processing , pp.212–223, 2020.Hauskrecht, M., Batal, I., Valko, M., Visweswaran, S.,Cooper, G. F., and Clermont, G. Outlier detection forpatient monitoring and alerting.

Journal of biomedicalinformatics , 46(1):47–55, 2013.Hautamaki, V., Karkkainen, I., and Franti, P. Outlier detec-tion using k-nearest neighbour graph. In

Proceedings ofthe 17th International Conference on Pattern Recognition,2004. ICPR 2004. , volume 3, pp. 430–433. IEEE, 2004.Hendrycks, D. and Gimpel, K. A baseline for detectingmisclassiﬁed and out-of-distribution examples in neuralnetworks. arXiv preprint arXiv:1610.02136 , 2016.Hershey, J. R., Chen, Z., Le Roux, J., and Watanabe, S. Deepclustering: Discriminative embeddings for segmentationand separation. In ,pp. 31–35. IEEE, 2016.Hsu, Y.-C., Shen, Y., Jin, H., and Kira, Z. Generalizedodin: Detecting out-of-distribution image without learn-ing from out-of-distribution data. In

Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition , pp. 10951–10960, 2020.Jiang, H., Kim, B., Guan, M. Y., and Gupta, M. R. To trustor not to trust a classiﬁer. In

NeurIPS , pp. 5546–5557,2018. abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Kriegel, H.-P., Kr¨oger, P., Schubert, E., and Zimek, A. Loop:local outlier probabilities. In

Proceedings of the 18thACM conference on Information and knowledge manage-ment , pp. 1649–1652, 2009.Krizhevsky, A. et al. Learning multiple layers of featuresfrom tiny images. 2009.Kung, Y.-H., Lin, P.-S., and Kao, C.-H. An optimal k-nearestneighbor for density estimation.

Statistics & ProbabilityLetters , 82(10):1786–1791, 2012.Labib, N. M., Rizka, M. A., and Shokry, A. E. M. Survey ofmachine learning approaches of anti-money launderingtechniques to counter terrorism ﬁnance. In

Internet ofThings—Applications and Future , pp. 73–87. Springer,2020.Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simpleand scalable predictive uncertainty estimation using deepensembles. arXiv preprint arXiv:1612.01474 , 2016.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.

Proceed-ings of the IEEE , 86(11):2278–2324, 1998.LeCun, Y. et al. Lenet-5, convolutional neural networks.

URL: http://yann. lecun. com/exdb/lenet , 20(5):14, 2015.Lee, K., Lee, H., Lee, K., and Shin, J. Training conﬁdence-calibrated classiﬁers for detecting out-of-distribution sam-ples. arXiv preprint arXiv:1711.09325 , 2017.Liang, S., Li, Y., and Srikant, R. Enhancing the reliabilityof out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690 , 2017.Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In ,pp. 413–422. IEEE, 2008.Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learningface attributes in the wild. In

Proceedings of the IEEEinternational conference on computer vision , pp. 3730–3738, 2015.Lukasik, M., Bhojanapalli, S., Menon, A., and Kumar, S.Does label smoothing mitigate label noise? In

Interna-tional Conference on Machine Learning , pp. 6448–6458.PMLR, 2020a.Lukasik, M., Jain, H., Menon, A. K., Kim, S., Bhojana-palli, S., Yu, F., and Kumar, S. Semantic label smooth-ing for sequence to sequence problems. arXiv preprintarXiv:2010.07447 , 2020b.Mack, Y. Rate of strong uniform convergence of k-nn den-sity estimates.

Journal of statistical planning and infer-ence , 8(2):185–192, 1983. Meister, C., Salesky, E., and Cotterell, R. Generalizedentropy regularization or: There’s nothing special aboutlabel smoothing. arXiv preprint arXiv:2005.00820 , 2020.M¨uller, R., Kornblith, S., and Hinton, G. When does labelsmoothing help? arXiv preprint arXiv:1906.02629 , 2019.Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B.,and Ng, A. Y. Reading digits in natural images withunsupervised feature learning. 2011.Papernot, N. and McDaniel, P. Deep k-nearest neighbors:Towards conﬁdent, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765 , 2018.Park, S., Bastani, O., Matni, N., and Lee, I. Pac conﬁdencesets for deep neural networks via calibrated prediction. arXiv preprint arXiv:2001.00106 , 2019.Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., andHinton, G. Regularizing neural networks by penal-izing conﬁdent output distributions. arXiv preprintarXiv:1701.06548 , 2017.Prastawa, M., Bullitt, E., Ho, S., and Gerig, G. A braintumor segmentation framework based on outlier detection.

Medical image analysis , 8(3):275–283, 2004.Ramaswamy, S., Rastogi, R., and Shim, K. Efﬁcient al-gorithms for mining outliers from large data sets. In

Proceedings of the 2000 ACM SIGMOD internationalconference on Management of data , pp. 427–438, 2000.Sch¨olkopf, B., Williamson, R. C., Smola, A. J., Shawe-Taylor, J., Platt, J. C., et al. Support vector method fornovelty detection. In

NIPS , volume 12, pp. 582–588.Citeseer, 1999.Sch¨olkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J.,and Williamson, R. C. Estimating the support of a high-dimensional distribution.

Neural computation , 13(7):1443–1471, 2001.Skillicorn, D.

Knowledge discovery for counterterrorismand law enforcement . CRC Press, 2008.Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: anovel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747 , 2017.Xie, J., Girshick, R., and Farhadi, A. Unsupervised deep em-bedding for clustering analysis. In

International confer-ence on machine learning , pp. 478–487. PMLR, 2016a.Xie, L., Wang, J., Wei, Z., Wang, M., and Tian, Q. Distur-blabel: Regularizing cnn on the loss layer. In

Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pp. 4753–4762, 2016b. abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Yuan, L., Tay, F. E., Li, G., Wang, T., and Feng, J. Revisitingknowledge distillation via label smoothing regularization.In

Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition , pp. 3903–3911,2020.Zantedeschi, V., Nicolae, M.-I., and Rawat, A. Efﬁcientdefenses against adversarial attacks. In

Proceedings ofthe 10th ACM Workshop on Artiﬁcial Intelligence andSecurity , pp. 39–49, 2017.Zhang, J. and Zulkernine, M. Anomaly based networkintrusion detection with unsupervised outlier detection. In ,volume 5, pp. 2388–2393. IEEE, 2006.Zhao, P. and Lai, L. Analysis of knn density estimation. arXiv preprint arXiv:2010.00438 , 2020.Zhao, Y., Lehman, B., Ball, R., Mosesian, J., and de Palma,J.-F. Outlier detection rules for fault detection in solarphotovoltaic arrays. In , pp. 2913–2920. IEEE, 2013. abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Appendix

A. Proofs

We need the following result giving guarantees between the probability measure on the true balls and the empirical balls.

Lemma 1 (Uniform convergence of balls (Chaudhuri & Dasgupta, 2010)) . Let F be the distribution corresponding to f and F n be the empirical distribution corresponding to the sample X . Pick < δ < . Assume that k ≥ d log n . Then withprobability at least − δ , for every ball B ⊂ R D we have F ( B ) ≥ C δ,n √ d log nn ⇒ F n ( B ) > F ( B ) ≥ kn + C δ,n √ kn ⇒ F n ( B ) ≥ kn F ( B ) ≤ kn − C δ,n √ kn ⇒ F n ( B ) < kn , where C δ,n = 16 log(2 /δ ) √ d log n Remark.

For the rest of the paper, many results are qualiﬁed to hold with probability at least − δ . This is precisely theevent in which Lemma 1 holds. Remark. If δ = 1 /n , then C δ,n = O ((log n ) / ) . Proof of Theorem 1.

Suppose that x satisﬁes f ( x ) = 0 . Then we have F ( B ( x, r )) = (cid:90) f ( x (cid:48) ) · x (cid:48) ∈ B ( x, r )] dx (cid:48) = (cid:90) | f ( x (cid:48) ) − f ( x ) | · x (cid:48) ∈ B ( x, r )] dx (cid:48) ≤ C β (cid:90) | x (cid:48) − x | β · x (cid:48) ∈ B ( x, r )] dx (cid:48) ≤ C β r β + d v d = k n ≤ kn − C δ,n √ kn . Therefore, by Lemma 1, we have that r k ( x ) ≤ r . Now for the second part, we prove the contrapositive. Suppose that f ( x ) > λ . Then we have F ( B ( x, r )) ≥ ( λ − C β r β ) · v d · r d ≥ kn ≥ kn + C δ,n √ kn . Therefore, by Lemma 1, we have that f ( x ) ≥ λ , as desired. Proof of Theorem 2.

We borrow some proof techniques used in (Dasgupta & Kpotufe, 2014) to give uniform bounds on | f ( x ) − f k ( x ) | where f k is the k -NN density estimator deﬁned as f k ( x ) := kn · v d · r k ( x ) d . It is also clear that r ≤ C (cid:48) · ( k/n ) /d for some C (cid:48) depending on f . If we choose r such that F ( B ( x, r )) ≤ v d r d ( f ( x ) + C β r β ) = kn − C δ,n √ kn . Then, we have by Lemma 1 that r k ( x ) > r . Thus, we have f k ( x ) < kn · v d · r d = f ( x ) + C β r β − C δ,n / √ k ≤ f ( x ) + C · (cid:18) C δ,n √ k + ( k/n ) /d (cid:19) for some C > depending on f . The argument for the other direction is similar: we instead choose r such that F ( B ( x, r )) ≥ v d r d ( f ( x ) − C β r β ) = kn + C δ,n √ kn . abel Smoothed Embedding Hypothesis for Out-of-Distribution Detection Again it’s clear that r ≤ C (cid:48)(cid:48) · ( k/n ) /d for some C (cid:48)(cid:48) depending on f . Next, we have by Lemma 1 that r k ( x ) ≤ r . Thus, wehave f k ( x ) ≥ kn · v d · r d = f ( x ) − C β r β C δ,n / √ k ≥ f ( x ) − C · (cid:18) C δ,n √ k + ( k/n ) /d (cid:19) for some C depending on f . Therefore, there exists C depending on f such that sup x ∈ R d | f ( x ) − f k ( x ) | ≤ C (cid:18) C δ,n √ k + ( k/n ) /d (cid:19) . Finally, we have that setting C = 2 · C , we have f k ( x ) ≥ f ( x ) − C · (cid:18) C δ,n √ k + ( k/n ) /d (cid:19) > f ( x ) + C · (cid:18) C δ,n √ k + ( k/n ) /d (cid:19) ≥ f k ( x ) , then it immediately follows that r k ( x ) < r k ( x ) , as desired. Proof of Proposition 1.

Deﬁne r u := max x ∈X r k ( x ) . We have r k ( φ ( x out ); φ ( X [ n ] )) r k (( x out ; X [ n ] ) ≥ d ( φ ( x out ) , φ ( X )) d ( x out , X ) + r u = γ out · d ( x out , X ) d ( x out , X ) + r u . Next, we have r k ( φ ( x in ); φ ( X [ n ] )) r k ( x in ; X [ n ] ) = γ in , since all pairwise distances within X are scaled by γ in through our mapping φ .Thus, it sufﬁces to have γ in ≤ γ out d ( x out , X ) d ( x out , X ) + r u , which is equivalent to having r u ≤ γ out − γ in γ in d ( x, X ) . Which holds when r u ≤ γ out − γ in γ in · r min . This holds because we have r u ≤ (cid:16) kc nv d (cid:17) /d by Lemma 1, and the result follows by the condition on kk