To Trust Or Not To Trust A Classifier
TTo Trust Or Not To Trust A Classifier
Heinrich Jiang ∗ Google Research [email protected]
Been Kim
Google Brain [email protected]
Melody Y. Guan † Stanford University [email protected]
Maya Gupta
Google Research [email protected]
Abstract
Knowing when a classifier’s prediction can be trusted is useful in many applicationsand critical for safely using AI. While the bulk of the effort in machine learningresearch has been towards improving classifier performance, understanding whena classifier’s predictions should and should not be trusted has received far lessattention. The standard approach is to use the classifier’s discriminant or confidencescore; however, we show there exists an alternative that is more effective in manysituations. We propose a new score, called the trust score , which measures theagreement between the classifier and a modified nearest-neighbor classifier onthe testing example. We show empirically that high (low) trust scores producesurprisingly high precision at identifying correctly (incorrectly) classified examples,consistently outperforming the classifier’s confidence score as well as many otherbaselines. Further, under some mild distributional assumptions, we show that if thetrust score for an example is high (low), the classifier will likely agree (disagree)with the Bayes-optimal classifier. Our guarantees consist of non-asymptotic ratesof statistical consistency under various nonparametric settings and build on recentdevelopments in topological data analysis.
Machine learning (ML) is a powerful and widely-used tool for making potentially important decisions,from product recommendations to medical diagnosis. However, despite ML’s impressive performance,it makes mistakes, with some more costly than others. As such, ML trust and safety is an importanttheme [54, 36, 1]. While improving overall accuracy is an important goal that the bulk of the effort inML community has been focused on, it may not be enough: we need to also better understand thestrengths and limitations of ML techniques.This work focuses on one such challenge: knowing whether a classifier’s prediction for a test examplecan be trusted or not. Such trust scores have practical applications. They can be directly shown tousers to help them gauge whether they should trust the AI system. This is crucial when a model’sprediction influences important decisions such as a medical diagnosis, but can also be helpful evenin low-stakes scenarios such as movie recommendations. Trust scores can be used to override theclassifier and send the decision to a human operator, or to prioritize decisions that human operatorsshould be making. Trust scores are also useful for monitoring classifiers to detect distribution shiftsthat may mean the classifier is no longer as useful as it was when deployed. ∗ All authors contributed equally. † Work done while intern at Google Research.An open-source implementation of Trust Scores can be found here: https://github.com/google/TrustScore32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. a r X i v : . [ s t a t . M L ] O c t standard approach to deciding whether to trust a classifier’s decision is to use the classifiers’ ownreported confidence or score, e.g. probabilities from the softmax layer of a neural network, distanceto the separating hyperplane in support vector classification, mean class probabilities for the trees ina random forest. While using a model’s own implied confidences appears reasonable, it has beenshown that the raw confidence values from a classifier are poorly calibrated [24, 32]. Worse yet,even if the scores are calibrated, the ranking of the scores itself may not be reliable. In other words,a higher confidence score from the model does not necessarily imply higher probability that theclassifier is correct, as shown in [45, 22, 39]. A classifier may simply not be the best judge of its owntrustworthiness.In this paper, we use a set of labeled examples (e.g. training data or validation data) to help determinea classifier’s trustworthiness for a particular testing example. First, we propose a simple procedurethat reduces the training data to a high density set for each class. Then we define the trust score —theratio between the distance from the testing sample to the nearest class different from the predictedclass and the distance to the predicted class—to determine whether to trust that classifier prediction.Theoretically, we show that high/low trust scores correspond to high probability of agree-ment/disagreement with the Bayes-optimal classifier. We show finite-sample estimation rates whenthe data is full-dimension and supported on or near a low-dimensional manifold. Interestingly, weattain bounds that depend only on the lower manifold dimension and independent of the ambientdimension without any changes to the procedure or knowledge of the manifold. To our knowledge,these results are new and may be of independent interest.Experimentally, we found that the trust score better identifies correctly-classified points for low andmedium-dimension feature spaces than the model itself. However, high-dimensional feature spaceswere more challenging, and we demonstrate that the trust score’s utility depends on the vector spaceused to compute the trust score differences. One related line of work is that of confidence calibration, which transforms classifier outputs intovalues that can be interpreted as probabilities, e.g. [44, 58, 40, 24]. In recent work, [32] explore thestructured prediction setting, and [33] obtain confidence estimates by using ensembles of networks.These calibration techniques typically only use the the model’s reported score (and the softmaxlayer in the case of a neural network) for calibration, which notably preserves the rankings of theclassifier scores. Similarly, [26] considered using the softmax probabilities for the related problem ofidentifying misclassifications and mislabeled points.Recent work explored estimating uncertainty for Bayesian neural networks and returning a distributionover the outputs [20, 30]. The proposed trust score does not change the network structure (nor doesit assume any structure) and gives a single score, rather than a distribution over outputs as therepresentation of uncertainty.The problem of classification with a reject option or learning with abstention [3, 57, 9, 23, 8, 27, 10]is a highly related framework where the classifier is allowed to abstain from making a prediction ata certain cost. Typically such methods jointly learn the classifier and the rejection function. Notethat the interplay between classification rate and reject rate is studied in many various forms e.g.[7, 13, 19, 48, 52, 18, 34, 14, 56, 51]. Our paper assumes an already trained and possibly black-boxclassifier and learns the confidence scores separately, but we do not explicitly learn the appropriaterejection thresholds.Whether to trust a classifier also arises in the setting where one has access to a sequence of classifiers,but there is some cost to evaluating each classifier, and the goal is to decide after evaluating eachclassifier in the sequence if one should trust the current classifier decision enough to stop, rather thanevaluating more classifiers in the sequence (e.g. [55, 43, 16]). Those confidence decisions are usuallybased on whether the current classifier score will match the classification of the full sequence.Experimentally we find that the vector space used to compute the distances in the trust score matters,and that computing trust scores on more-processed layers of a deep model generally works better.This observation is similar to the work of Papernot and McDaniel [42], who use k -NN regression onthe intermediate representations of the network which they showed enhances robustness to adversarialattacks and leads to better calibrated uncertainty estimations.2ur work builds on recent results in topological data analysis. Our method to filter low-densitypoints estimates a particular density level-set given a parameter α , which aims at finding the level-set that contains − α fraction of the probability mass. Level-set estimation has a long history[25, 15, 53, 50, 46, 28]. However such works assume knowledge of the density level, which isdifficult to determine in practice. We provide rates for Algorithm 1 in estimating the appropriatelevel-set corresponding to α without knowledge of the level. The proxy α offers a more intuitiveparameter compared to the density value, is used for level-set estimation. Our analysis is also doneunder various settings including when the data lies near a lower dimensional manifold and we providerates that depend only on the lower dimension. Our approach proceeds in two steps outlined in Algorithm 1 and 2. We first pre-process the trainingdata, as described in Algorithm 1, to find the α -high-density-set of each class, which is defined asthe training samples within that class after filtering out α -fraction of the samples with lowest density(which may be outliers): Definition 1 ( α -high-density-set) . Let ≤ α < and f be a continuous density function withcompact support X ⊆ R D . Then define H α ( f ) , the α -high-density-set of f , to be the λ α -level set of f , defined as { x ∈ X : f ( x ) ≥ λ α } where λ α := inf (cid:8) λ ≥ (cid:82) X f ( x ) ≤ λ ] f ( x ) dx ≥ α (cid:9) . In order to approximate the α -high-density-set, Algorithm 1 filters the α -fraction of the sample pointswith lowest empirical density, based on k -nearest neighbors. This data filtering step is independent ofthe given classifier h .Then, the second step: given a testing sample, we define its trust score to be the ratio between thedistance from the testing sample to the α -high-density-set of the nearest class different from thepredicted class, and the distance from the test sample to the α -high-density-set of the class predictedby h , as detailed in Algorithm 2. The intuition is that if the classifier h predicts a label that isconsiderably farther than the closest label, then this is a warning that the classifier may be making amistake.Our procedure can thus be viewed as a comparison to a modified nearest-neighbor classifier, wherethe modification lies in the initial filtering of points not in the α -high-density-set for each class. Remark 1.
The distances can be computed with respect to any representation of the data. For exam-ple, the raw inputs, an unsupervised embedding of the space, or the activations of the intermediaterepresentations of the classifier. Moreover, the nearest-neighbor distance can be replaced by otherdistance measures, such as k -nearest neighbors or distance to a centroid. Algorithm 1
Estimating α -high-density-setParameters: α (density threshold), k .Inputs: Sample points X := { x , .., x n } drawn from f .Define k -NN radius r k ( x ) := inf { r > | B ( x, r ) ∩ X | ≥ k } and let ε := inf { r > |{ x ∈ X : r k ( x ) > r }| ≤ α · n } . return (cid:99) H α ( f ) := { x ∈ X : r k ( x ) ≤ ε } . Algorithm 2
Trust ScoreParameters: α (density threshold), k .Input: Classifier h : X → Y . Training data ( x , y ) , ..., ( x n , y n ) . Test example x .For each (cid:96) ∈ Y , let (cid:99) H α ( f (cid:96) ) be the output of Algorithm 1 with parameters α, k and sample points { x j : 1 ≤ j ≤ n, y j = (cid:96) } . Then, return the trust score, defined as: ξ ( h, x ) := d (cid:16) x, (cid:99) H α ( f (cid:101) h ( x ) ) (cid:17) /d (cid:16) x, (cid:99) H α ( f h ( x ) ) (cid:17) , where (cid:101) h ( x ) = argmin l ∈Y ,l (cid:54) = h ( x ) d (cid:16) x, (cid:99) H α ( f l ) (cid:17) .The method has two hyperparameters: k (the number of neighbors, such as in k -NN) and α (fractionof data to filter) to compute the empirical densities. We show in theory that k can lie in a wide range3nd still give us the desired consistency guarantees. Throughout our experiments, we fix k = 10 , anduse cross-validation to select α as it is data-dependent. Remark 2.
We observed that the procedure was not very sensitive to the choice of k and α . As willbe shown in the experimental section, for efficiency on larger datasets, we skipped the initial filteringstep of Algorithm 1 (leading to a hyperparameter-free procedure) and obtained reasonable results.This initial filtering step can also be replaced by other strategies. One such example is filteringexamples whose labels have high disagreement amongst its neighbors, which is implemented in theopen-source code release but not experimented with here. In this section, we provide theoretical guarantees for Algorithms 1 and 2. Due to space constraints,all the proofs are deferred to the Appendix. To simplify the main text, we state our results treating δ ,the confidence level, as a constant. The dependence on δ in the rates is made explicit in the Appendix.We show that Algorithm 1 is a statistically consistent estimator of the α -high-density-level set withfinite-sample estimation rates. We analyze Algorithm 1 in three different settings: when the data lieson (i) a full-dimensional R D ; (ii) an unknown lower dimensional submanifold embedded in R D ; and(iii) an unknown lower dimensional submanifold with full-dimensional noise.For setting (i), where the data lies in R D , the estimation rate has a dependence on the dimension D ,which may be unattractive in high-dimensional situations: this is known as the curse of dimensionality,suffered by density-based procedures in general. However, when the data has low intrinsic dimensionin (ii), it turns out that, remarkably, without any changes to the procedure, the estimation rate dependson the lower dimension d and is independent of the ambient dimension D . However, in realisticsituations, the data may not lie exactly on a lower-dimensional manifold, but near one. This reflectsthe setting of (iii), where the data essentially lies on a manifold but has general full-dimensional noiseso the data is overall full-dimensional. Interestingly, we show that we still obtain estimation ratesdepending only on the manifold dimension and independent of the ambient dimension; moreover, wedo not require knowledge of the manifold nor its dimension to attain these rates.We then analyze Algorithm 2, and establish the culminating result of Theorem 4: for labeled datadistributions with well-behaved class margins, when the trust score is large, the classifier likely agreeswith the Bayes-optimal classifier, and when the trust score is small, the classifier likely disagrees withthe Bayes-optimal classifier. If it turns out that even the Bayes-optimal classifier has high-error ina certain region, then any classifier will have difficulties in that region. Thus, Theorem 4 does notguarantee that the trust score can predict misclassification, but rather that it can predict when theclassifier is making an unreasonable decision. We require the following regularity assumptions on the boundaries of H α ( f ) , which are standardin analyses of level-set estimation [50]. Assumption 1.1 ensures that the density around H α ( f ) hasboth smoothness and curvature. The upper bound gives smoothness, which is important to ensurethat our density estimators are accurate for our analysis (we only require this smoothness near theboundaries and not globally). The lower bound ensures curvature: this ensures that H α ( f ) is salientenough to be estimated. Assumption 1.2 ensures that H α ( f ) does not get arbitrarily thin anywhere. Assumption 1 ( α -high-density-set regularity) . Let β > . There exists ˇ C β , ˆ C β , β, r c , r , ρ > s.t.1. ˇ C β · d ( x, H α ( f )) β ≤ | λ α − f ( x ) | ≤ ˆ C β · d ( x, H α ( f )) β for all x ∈ ∂H α ( f ) + B (0 , r c ) .2. For all < r < r and x ∈ H α ( f ) , we have Vol ( B ( x, r )) ≥ ρ · r D .where ∂A denotes the boundary of a set A , d ( x, A ) := inf x (cid:48) ∈ A || x − x (cid:48) || , B ( x, r ) := { x (cid:48) : | x − x (cid:48) | ≤ r } and A + B (0 , r ) := { x : d ( x, A ) ≤ r } . Our statistical guarantees are under the Hausdorff metric, which ensures a uniform guarantee overour estimator: it is a stronger notion of consistency than other common metrics [46, 47].
Definition 2 (Hausdorff distance) . d H ( A, B ) := max { sup x ∈ A d ( x, B ) , sup x ∈ B d ( x, A ) } .
4e now give the following result for Algorithm 1. It says that as long as our density function satisfiesthe regularity assumptions stated earlier, and the parameter k lies within a certain range, then we canbound the Hausdorff distance between what Algorithm 1 recovers and H α ( f ) , the true α -high-densityset, from an i.i.d. sample drawn from f of size n . Then, as n goes to ∞ , and k grows as a function of n , the quantity goes to . Theorem 1 (Algorithm 1 guarantees) . Let < δ < and suppose that f is continuous and hascompact support X ⊆ R D and satisfies Assumption 1. There exists constants C l , C u , C > depending on f and δ such that the following holds with probability at least − δ . Suppose that k satisfies C l · log n ≤ k ≤ C u · (log n ) D (2 β + D ) · n β/ (2 β + D ) . Then we have d H ( H α ( f ) , (cid:99) H α ( f )) ≤ C · (cid:16) n − / D + log( n ) / β · k − / β (cid:17) . Remark 3.
The condition on k can be simplified by ignoring log factors: log n (cid:46) k (cid:46) n β/ (2 β + D ) ,which is a wide range. Setting k to its allowed upper bound, we obtain our consistency guarantee of d H ( H α ( f ) , (cid:99) H α ( f )) (cid:46) max { n − / D , n − / (2 β + D ) } . The first term is due to the error from estimating the appropriate level given α (i.e. identifying thelevel λ α ) and the second term corresponds to the error for recovering the level set given knowledgeof the level. The latter term matches the lower bound for level-set estimation up to log factors [53]. One of the disadvantages of Theorem 1 is that the estimation errors have a dependence on D , thedimension of the data, which may be highly undesirable in high-dimensional settings. We nextimprove these rates when the data has a lower intrinsic dimension. Interestingly, we are able to showrates that depend only on the intrinsic dimension of the data, without explicit knowledge of thatdimension nor any changes to the procedure. As common to related work in the manifold setting, wemake the following regularity assumptions which are standard among works in manifold learning(e.g. [41, 21, 2]). Assumption 2 (Manifold Regularity) . M is a d -dimensional smooth compact Riemannian manifoldwithout boundary embedded in compact subset X ⊆ R D with bounded volume. M has finitecondition number /τ , which controls the curvature and prevents self-intersection. Theorem 2 (Manifold analogue of Theorem 1) . Let < δ < . Suppose that density function f is continuous and supported on M and Assumptions 1 and 2 hold. Suppose also that there exists λ > such that f ( x ) ≥ λ for all x ∈ M . Then, there exists constants C l , C u , C > dependingon f and δ such that the following holds with probability at least − δ . Suppose that k satisfies C l · log n ≤ k ≤ C u · (log n ) d (2 β (cid:48) + d ) · n β (cid:48) / (2 β (cid:48) + d ) . where β (cid:48) := max { , β } . Then we have d H ( H α ( f ) , (cid:99) H α ( f )) ≤ C · (cid:16) n − / d + log( n ) / β · k − / β (cid:17) . Remark 4.
Setting k to its allowed upper bound, we obtain (ignoring log factors), d H ( H α ( f ) , (cid:99) H α ( f )) (cid:46) max { n − / d , n − / (2 max { ,β } + d ) } . The first term can be compared to that of the previous result where D is replaced with d . The secondterm is the error for recovering the level set on manifolds, which matches recent rates [28]. In realistic settings, the data may not lie exactly on a low-dimensional manifold, but near one. Wenext present a result where the data is distributed along a manifold with additional full-dimensionalnoise. We make mild assumptions on the noise distribution. Thus, in this situation, the data hasintrinsic dimension equal to the ambient dimension. Interestingly, we are still able to show that therates only depend on the dimension of the manifold and not the dimension of the entire data.
Theorem 3.
Let < η < α < and < δ < . Suppose that distribution F is a weightedmixture (1 − η ) · F M + η · F E where F M is a distribution with continous density f M supported on a d -dimensional manifold M satisfying Assumption 2 and F E is a (noise) distribution with continuousdensity f E with compact support over R D with d < D . Suppose also that there exists λ > such hat f M ( x ) ≥ λ for all x ∈ M and H (cid:101) α ( f M ) (where (cid:101) α := α − η − η ) satisfies Assumption 1 for density f M . Let (cid:98) H α be the output of Algorithm 1 on a sample X of size n drawn i.i.d. from F . Then, thereexists constants C l , C u , C > depending on f M , f E , η , M and δ such that the following holds withprobability at least − δ . Suppose that k satisfies C l · log n ≤ k ≤ C u · (log n ) d (2 β (cid:48) + d ) · n β (cid:48) / (2 β (cid:48) + d ) ,where β (cid:48) := max { , β } . Then we have d H ( H (cid:101) α ( f M ) , (cid:99) H α ) ≤ C · (cid:16) n − / d + log( n ) / β · k − / β (cid:17) . The above result is compelling because it shows why our methods can work, even in high-dimensions,despite the curse of dimensionality of non-parametric methods. In typical real-world data, even ifthe data lies in a high-dimensional space, there may be far fewer degrees of freedom. Thus, ourtheoretical results suggest that when this is true, then our methods will enjoy far better convergencerates – even when the data overall has full intrinsic dimension due to factors such as noise.
We now provide a guarantee about the trust score, making the same assumptions as in Theorem 3 foreach of the label distributions. We additionally assume that the class distributions are well-behavedin the following sense: that high-density-regions for each of the classes satisfy the property that forany point x ∈ X , if the ratio of the distance to one class’s high-density-region to that of another issmaller by some margin γ , then it is more likely that x ’s label corresponds to the former class. Theorem 4.
Let < η < α < . Let us have labeled data ( x , y ) , ..., ( x n , y n ) drawn fromdistribution D , which is a joint distribution over X × Y where Y are the labels, |Y| < ∞ , and X ⊆ R D is compact. Suppose for each (cid:96) ∈ Y , the conditional distribution for label (cid:96) satisfies theconditions of Theorem 3 for some manifold and noise level η . Let f M,(cid:96) be the density of the portion ofthe conditional distribution for label (cid:96) supported on M . Define M (cid:96) := H (cid:101) α ( f (cid:96) ) , where (cid:101) α := α − η − η andlet (cid:15) n be the maximum Hausdorff error from estimating M (cid:96) over each (cid:96) ∈ Y in Theorem 3. Assumethat min (cid:96) ∈Y P D ( y = (cid:96) ) > to ensure we have samples from each label.Suppose also that for each x ∈ X , if d ( x, M i ) /d ( x, M j ) < − γ then P ( y = i | x ) > P ( y = j | x ) for i, j ∈ Y . That is, if we are closer to M i than M j by a ratio of less than − γ , then thepoint is more likely to be from class i . Let h ∗ be the Bayes-optimal classifier, defined by h ∗ ( x ) :=argmax (cid:96) ∈Y P ( y = (cid:96) | x ) . Then the trust score ξ of Algorithm 2 satisfies the following with highprobability uniformly over all x ∈ X and all classifiers h : X → Y simultaneously for n sufficientlylarge depending on D : ξ ( h, x ) < − γ − (cid:15) n d ( x, M h ( x ) ) + (cid:15) n · (cid:32) d ( x, M (cid:101) h ( x ) ) d ( x, M h ( x ) ) + 1 (cid:33) ⇒ h ( x ) (cid:54) = h ∗ ( x ) , ξ ( h, x ) < − γ − (cid:15) n d ( x, M (cid:101) h ( x ) ) + (cid:15) n · (cid:32) d ( x, M h ( x ) ) d ( x, M (cid:101) h ( x ) ) + 1 (cid:33) ⇒ h ( x ) = h ∗ ( x ) . In this section, we empirically test whether trust scores can both detect examples that are incorrectlyclassified with high precision and be used as a signal to determine which examples are likely correctlyclassified. We perform this evaluation across (i) different datasets (Sections 5.1 and 5.3), (ii) differentfamilies of classifiers (neural network, random forest and logistic regression) (Section 5.1), (iii)classifiers with varying accuracy on the same task (Section 5.2) and (iv) different representations ofthe data e.g. input data or activations of various intermediate layers in neural network (Section 5.3).First, we test if testing examples with high trust score corresponds to examples in which the model iscorrect ("identifying trustworthy examples"). Each method produces a numeric score for each testingexample. For each method, we bin the data points by percentile value of the score (i.e. 100 bins).Given a recall percentile level (i.e. the x -axis on our plots), we take the performance of the classifieron the bins above the percentile level as the precision (i.e. the y -axis). Then, we take the negative ofeach signal and test if low trust score corresponds to the model being wrong ("identifying suspicious6igure 1: Two example datasets and models. For predicting correctness (top row) the vertical dottedblack line indicates error level of the trained classifier. For predicting incorrectness (bottom) thevertical black dotted line is the accuracy rate of the classifier. For detecting trustworthy, for eachpercentile level, we take the test examples whose trust score was above that percentile level and plotthe percentage of those test points that were correctly classified by the classifier, and do the samemodel confidence and 1-nn ratio. For detecting suspicious, we take the negative of each signal andplot the precision of identifying incorrectly classified examples. Shown are average of runs withshaded standard error band. The trust score consistently attains a higher precision for each givenpercentile of classifier decision-rejection. Furthermore, the trust score generally shows increasingprecision as the percentile level increases, but surprisingly, many of the comparison baselines do not.See the Appendix for the full results.examples"). Here the y -axis is the misclassification rate and the x -axis corresponds to decreasingtrust score or model confidence.In both cases, the higher the precision vs percentile curve, the better the method. The vertical blackdotted lines in the plots represent the omniscient ideal. For identifying trustworthy examples it is theerror rate of the classifier and for identifying suspicious examples" it is the accuracy rate.The baseline we use in Section is the model’s own confidence score, which is similar to the approachof [26]. While calibrating the classifiers’ confidence scores (i.e. transforming them into probabilityestimates of correctness) is an important related work [24, 44], such techniques typically do notchange the rankings of the score, at least in the binary case. Since we evaluate the trust score onits precision at a given recall percentile level , we are interested in the relative ranking of the scoresrather than their absolute values. Thus, we do not compare against calibration techniques. Thereare surprisingly few methods aimed at identifying correctly or incorrectly classified examples withprecision at a recall percentile level as noted in [26]. Choosing Hyperparameters : The two hyperparameters for the trust score are α and k . Throughoutthe experiments, we fix k = 10 and choose α using cross-validation over (negative) powers of onthe training set. The metric for cross-validation was optimal performance on detecting suspiciousexamples at the percentile corresponding to the classifier’s accuracy. The bulk of the computationalcost for the trust-score is in k -nearest neighbor computations for training and -nearest neighborsearches for evaluation. To speed things up for the larger datasets MNIST, SVHN, CIFAR-10and CIFAR-100, we skipped the initial filtering step of Algorithm 1 altogether and reduced theintermediate layers down to dimensions using PCA before being processed by the trust scorewhich showed similar performance. We note that any approximation method (such as approximateinstead of exact nearest neighbors) could have been used instead. In this section, we show performance on five benchmark UCI datasets [17], each for three kindsof classifiers (neural network, random forest and logistic regression). Due to space, we only show7igure 2: We show the performance of trust score on the Digits dataset for a neural network as weincrease the accuracy. As we go from left to right, we train the network with more iterations (eachwith batch size ) thus increasing the accuracy indicated by the dotted vertical lines. While the trustscore still performs better than model confidence, the amount of improvement diminishes.two data sets and two models in Figure 1. The rest can be found in the Appendix. For each methodand dataset, we evaluated with multiple runs. For each run we took a random stratified split of thedataset into two halves. One portion was used for training the trust score and the other was used forevaluation and the standard error is shown in addition to the average precision across the runs at eachpercentile level. The results show that our method consistently has a higher precision vs percentilecurve than the rest of the methods across the datasets and models. This suggests the trust scoreconsiderably improves upon known methods as a signal for identifying trustworthy and suspicioustesting examples for low-dimensional data.In addition to the model’s own confidence score, we try one additional baseline, which we call the nearest neighbor ratio (1-nn ratio) . It is the ratio between the 1-nearest neighbor distance to theclosest and second closest class, which can be viewed as an analogue to the trust score withoutknowledge of the classifier’s hard prediction. In Figure 2, we show how the performance of trust score changes as the accuracy of the classifierchanges (averaged over 20 runs for each condition). We observe that as the accuracy of the modelincreases, while the trust score still performs better than model confidence, the amount of improvementdiminishes. This suggests that as the model improves, the information trust score can provide inaddition to the model confidence decreases. However, as we show in Section 5.3, the trust scorecan still have added value even when the classifier is known to be of high performance on somebenchmark larger-scale datasets.
The MNIST handwritten digit dataset [35] consists of 60,000 28 × × × a) MNIST (b) SVHN (c) CIFAR-10(d) MNIST (e) SVHN (f) CIFAR-10 Figure 3: Trust score results using convolutional neural networks on MNIST, SVHN, and CIFAR-10datasets. Top row is detecting trustworthy; bottom row is detecting suspicious. Full chart withCIFAR-100 (which was essentially a negative result) is shown in the Appendix.We used a pretrained VGG-16 [49] architecture with adaptation to the CIFAR datasets based on [37].The CIFAR-10 VGG-16 network achieves a test accuracy of 93.56% while the CIFAR-100 networkachieves a test accuracy of 70.48%. We used pretrained, smaller CNNs for MNIST and SVHN. TheMNIST network achieves a test accuracy of 99.07% and the SVHN network achieves a test accuracyof 95.45%. All architectures were implemented in Keras [6].One simple generalization of our method is to use intermediate layers of a neural network as aninput instead of the raw x . Many prior work suggests that a neural network may learn differentrepresentations of x at each layer. As input to the trust score, we tried using 1) the logit layer, 2) thepreceding fully connected layer with ReLU activation, 3) this fully connected layer, which has 128dimensions in the MNIST network and 512 dimensions in the other networks, reduced down to 20dimensions from applying PCA.The trust score results on various layers are shown in Figure 3. They suggest that for high dimensionaldatasets, the trust score may only provide little or no improvement over the model confidence atdetecting trustworthy and suspicious examples. All plots were made using α = 0 ; using cross-validation to select a different α did not improve trust score performance. We also did not see muchdifference from using different layers. Conclusion :In this paper, we provide the trust score : a new, simple, and effective way to judge if one should trustthe prediction from a classifier. The trust score provides information about the relative positions ofthe datapoints, which may be lost in common approaches such as the model confidence when themodel is trained using SGD. We show high-probability non-asymptotic statistical guarantees thathigh (low) trust scores correspond to agreement (disagreement) with the Bayes-optimal classifierunder various nonparametric settings, which build on recent results in topological data analysis. Ourempirical results across many datasets, classifiers, and representations of the data show that ourmethod consistently outperforms the classifier’s own reported confidence in identifying trustworthyand suspicious examples in low to mid dimensional datasets. The theoretical and empirical resultssuggest that this approach may have important practical implications in low to mid dimensionsettings. https://github.com/geifmany/cifar-vgghttps://github.com/EN10/KerasMNISThttps://github.com/tohinz/SVHN-Classifier eferences [1] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F Christiano, John Schulman, and Dan Mané.Concrete problems in AI safety. CoRR , abs/1606.06565, 2016. URL http://arxiv.org/abs/1606.06565 .[2] Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, and LarryWasserman. Cluster trees on manifolds. In
Advances in Neural Information Processing Systems ,pages 2679–2687, 2013.[3] Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss.
Journal of Machine Learning Research , 9(Aug):1823–1840, 2008.[4] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree. In
Advances in Neural Information Processing Systems , pages 343–351, 2010.[5] Frédéric Chazal. An upper bound for the volume of geodesic balls in submanifolds of Euclideanspaces. https://geometrica.saclay.inria.fr/team/Fred.Chazal/BallVolumeJan2013.pdf , 2013.[6] François Chollet et al. Keras. 2015.[7] C Chow. On optimum recognition error and reject tradeoff.
IEEE Transactions on InformationTheory , 16(1):41–46, 1970.[8] Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Boosting with abstention. In
Advances inNeural Information Processing Systems , pages 1660–1668, 2016.[9] Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection. In
InternationalConference on Algorithmic Learning Theory , pages 67–82. Springer, 2016.[10] Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, and Scott Yang. Onlinelearning with abstention. arXiv preprint arXiv:1703.03478 , 2017.[11] Sanjoy Dasgupta and Samory Kpotufe. Optimal rates for k-NN density and mode estimation.In
Advances in Neural Information Processing Systems , pages 2555–2563, 2014.[12] Luc Devroye, Laszlo Gyorfi, Adam Krzyzak, and Gábor Lugosi. On the strong universalconsistency of nearest neighbor regression function estimates.
The Annals of Statistics , pages1371–1385, 1994.[13] Bernard Dubuisson and Mylene Masson. A statistical decision rule with incomplete knowledgeabout classes.
Pattern Recognition , 26(1):155–165, 1993.[14] Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classification.
Journalof Machine Learning Research , 11(May):1605–1641, 2010.[15] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithmfor discovering clusters in large spatial databases with noise. In
Kdd , pages 226–231, 1996.[16] Wei Fan, Fang Chu, Haixun Wang, and Philip S. Yu. Pruning and dynamic scheduling ofcost-sensitive ensembles.
AAAI , 2002.[17] Jerome Friedman, Trevor Hastie, and Robert Tibshirani.
The Elements of Statistical Learning .Springer, 2001.[18] Giorgio Fumera and Fabio Roli. Support vector machines with embedded reject option. In
Pattern Recognition with Support Vector Machines , pages 68–82. Springer, 2002.[19] Giorgio Fumera, Fabio Roli, and Giorgio Giacinto. Multiple reject thresholds for improvingclassification reliability. In
Joint IAPR International Workshops on Statistical Techniques inPattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) , pages863–871. Springer, 2000.[20] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representingmodel uncertainty in deep learning. In
International Conference on Machine Learning , pages1050–1059, 2016. 1021] Christopher Genovese, Marco Perone-Pacifico, Isabella Verdinelli, and Larry Wasserman.Minimax manifold estimation.
Journal of Machine Learning Research , 13(May):1263–1291,2012.[22] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-ial examples. arXiv preprint arXiv:1412.6572 , 2014.[23] Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, and Stéphane Canu. Support vectormachines with a reject option. In
Advances in Neural Information Processing Systems , pages537–544, 2009.[24] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neuralnetworks. arXiv preprint arXiv:1706.04599 , 2017.[25] John A Hartigan. Clustering algorithms. 1975.[26] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distributionexamples in neural networks. arXiv preprint arXiv:1610.02136 , 2016.[27] Radu Herbei and Marten H Wegkamp. Classification with reject option.
Canadian Journal ofStatistics , 34(4):709–721, 2006.[28] Heinrich Jiang. Density level set estimation on manifolds with DBSCAN. In
InternationalConference on Machine Learning , pages 1684–1693, 2017.[29] Heinrich Jiang. Uniform convergence rates for kernel density estimation. In
InternationalConference on Machine Learning , pages 1694–1703, 2017.[30] Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning forcomputer vision? In
Advances in Neural Information Processing Systems , pages 5580–5590,2017.[31] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.[32] Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In
Advances inNeural Information Processing Systems , pages 3474–3482, 2015.[33] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalablepredictive uncertainty estimation using deep ensembles. In
Advances in Neural InformationProcessing Systems , pages 6405–6416, 2017.[34] Thomas CW Landgrebe, David MJ Tax, Pavel Paclík, and Robert PW Duin. The interactionbetween classification and reject performance for distance-based reject-option classifiers.
PatternRecognition Letters , 27(8):908–917, 2006.[35] Yann LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/ ,1998.[36] John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance.
Humanfactors , 46(1):50–80, 2004.[37] Shuying Liu and Weihong Deng. Very deep convolutional neural network based image classifica-tion using small training sample size. , pages 730–734, 2015.[38] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.Reading digits in natural images with unsupervised feature learning. 2011.[39] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: Highconfidence predictions for unrecognizable images. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 427–436, 2015.[40] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervisedlearning. In
Proceedings of the 22nd International Conference on Machine Learning , pages625–632. ACM, 2005. 1141] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifoldswith high confidence from random samples.
Discrete & Computational Geometry , 39(1-3):419–441, 2008.[42] Nicolas Papernot and Patrick McDaniel. Deep k-nearest neighbors: Towards confident, inter-pretable and robust deep learning. arXiv preprint arXiv:1803.04765 , 2018.[43] Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, and Dun Yu Hsaio. Classifying withconfidence from incomplete information.
Journal of Machine Learning Research , 14(December):3561–3589, 2013.[44] John Platt. Probabilistic outputs for support vector machines and comparisons to regularizedlikelihood methods.
Advances in Large Margin Classifiers , 10(3):61–74, 1999.[45] Foster J Provost, Tom Fawcett, and Ron Kohavi. The case against accuracy estimation forcomparing induction algorithms. In
ICML , volume 98, pages 445–453, 1998.[46] Philippe Rigollet, Régis Vert, et al. Optimal rates for plug-in estimators of density level sets.
Bernoulli , 15(4):1154–1178, 2009.[47] Alessandro Rinaldo and Larry Wasserman. Generalized density clustering.
The Annals ofStatistics , 38(5):2678–2722, 2010.[48] Carla M Santos-Pereira and Ana M Pires. On optimal reject rules and ROC curves.
PatternRecognition Letters , 26(7):943–952, 2005.[49] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 , 2014.[50] Aarti Singh, Clayton Scott, Robert Nowak, et al. Adaptive Hausdorff estimation of density levelsets.
The Annals of Statistics , 37(5B):2760–2782, 2009.[51] David MJ Tax and Robert PW Duin. Growing a multi-class classifier with a reject option.
Pattern Recognition Letters , 29(10):1565–1570, 2008.[52] Francesco Tortorella. An optimal reject rule for binary classifiers. In
Joint IAPR InternationalWorkshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and SyntacticPattern Recognition (SSPR) , pages 611–620. Springer, 2000.[53] Alexandre B Tsybakov et al. On nonparametric estimation of density level sets.
The Annals ofStatistics , 25(3):948–969, 1997.[54] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physicalsystems, decision sciences, and data products.
Big data , 5(3):246–255, 2017.[55] Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. Efficient learning by directedacyclic graph for resource constrained prediction.
Advances in Neural Information ProcessingSystems (NIPS) , 2015.[56] Yair Wiener and Ran El-Yaniv. Agnostic selective classification. In
Advances in NeuralInformation Processing Systems , pages 1665–1673, 2011.[57] Ming Yuan and Marten Wegkamp. Classification methods with reject option based on convexrisk minimization.
Journal of Machine Learning Research , 11(Jan):111–130, 2010.[58] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclassprobability estimates. In
Proceedings of the Eighth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , pages 694–699. ACM, 2002.12 ppendix
A Supporting results for Theorem 1 Proof
We need the following result giving guarantees on the empirical balls.
Lemma 1 (Uniform convergence of balls [4]) . Let F be the distribution corresponding to f and F n be the empirical distribution corresponding to the sample X . Pick < δ < . Assume that k ≥ D log n . Then with probability at least − δ , for every ball B ⊂ R D we have F ( B ) ≥ C δ,n √ D log nn ⇒ F n ( B ) > F ( B ) ≥ kn + C δ,n √ kn ⇒ F n ( B ) ≥ kn F ( B ) ≤ kn − C δ,n √ kn ⇒ F n ( B ) < kn , where C δ,n = 16 log(2 /δ ) √ D log n Remark 5.
For the rest of the paper, many results are qualified to hold with probability at least − δ .This is precisely the event in which Lemma 1 holds. Remark 6. If δ = 1 /n , then C δ,n = O ((log n ) / ) . To analyze Algorithm 1, we use the k -NN density estimator[12], defined below. Definition 3.
Define the k -NN radius of x ∈ R D as r k ( x ) := inf { r > | X ∩ B ( x, r ) | ≥ k } , Definition 4 (k-NN Density Estimator) . f k ( x ) := kn · v D · r k ( x ) D , where v D is the volume of a unit ball in R D . We will use bounds on the k -NN density estimator from [11], which are repeated here.Define the following one-sided modulus of continuity which characterizes how much the densityincreases locally: ˆ r ( (cid:15), x ) := sup (cid:40) r : sup x (cid:48) ∈ B ( x,r ) f ( x (cid:48) ) − f ( x ) ≤ (cid:15) (cid:41) . Lemma 2 (Lemma 3 of [11]) . Suppose that k ≥ C δ,n . Then with probability at least − δ , thefollowing holds for all x ∈ R D and (cid:15) > . f k ( x ) < (cid:18) C δ,n √ k (cid:19) ( f ( x ) + (cid:15) ) , provided k satisfies v D · ˆ r ( x, (cid:15) ) D · ( f ( x ) + (cid:15) ) ≥ kn + C δ,n √ kn . Analogously, define the following which characterizes how much the density decreases locally: ˇ r ( (cid:15), x ) := sup (cid:40) r : sup x (cid:48) ∈ B ( x,r ) f ( x ) − f ( x (cid:48) ) ≤ (cid:15) (cid:41) . Lemma 3 (Lemma 4 of [11]) . Suppose that k ≥ C δ,n . Then with probability at least − δ , thefollowing holds for all x ∈ R D and (cid:15) > . f k ( x ) ≥ (cid:18) − C δ,n √ k (cid:19) ( f ( x ) − (cid:15) ) , provided k satisfies v D · ˇ r ( x, (cid:15) ) D · ( f ( x ) − (cid:15) ) ≥ kn − C δ,n √ kn . Proof of Theorem 1
In this section, we assume the conditions of Theorem 1. We first show that λ α , that is the densitylevel corresponding to the α -high-density-set, is smooth in α . Lemma 4.
There exists constants C , r > depending on f such that the following holds for all < (cid:15) < r such that < λ α − λ α − (cid:15) ≤ C (cid:15) β/D and < λ α + (cid:15) − λ α ≤ C (cid:15) β/D . Proof.
We have (cid:15) = (cid:90) X λ α − (cid:15) < f ( x ) ≤ λ α ] · f ( x ) dx ≥ λ α − (cid:15) (cid:90) X λ α − (cid:15) < f ( x ) ≤ λ α ] dx, where the first equality holds by definition. Choosing (cid:15) sufficiently small such that Assumption 1holds, we have λ α − (cid:15) (cid:90) X λ α − (cid:15) < f ( x ) ≤ λ α ] dx ≥ λ α − (cid:15) · Vol (cid:16)(cid:16) H α ( x ) + B (cid:16) , (( λ α − (cid:15) − λ α ) / (cid:98) C β ) /β (cid:17)(cid:17) \ H α ( f ) (cid:17) ≥ λ α − (cid:15) · C (cid:48) (( λ α − (cid:15) − λ α ) / (cid:98) C β ) D/β , where the last inequality holds for some constant C (cid:48) depending on f and Vol is the volume w.r.t. tothe Lebesgue measure in R D . It then follows that λ α − (cid:15) − λ α ≤ (cid:98) C β (cid:18) (cid:15)λ α − (cid:15) · C (cid:48) (cid:19) β/D , and the result for the first part follows by taking C ≤ (cid:98) C β · ( λ α − r · C (cid:48) ) − β/D and r < α . Showingthat < λ α + (cid:15) − λ α ≤ C (cid:15) β/D can be done analogously and is omitted here.The next result gets a handle on the density level corresponding to α returned by Algorithm 1. Lemma 5.
Let < δ < . Let (cid:98) ε be the ε setting chosen by Algorithm 1. Define (cid:99) λ α := kv D · n · (cid:98) ε D . Then, with probability at least − δ , we have there exist constant C > depending on f such thatfor n sufficiently large depending on f , we have | (cid:98) λ α − λ α | ≤ C (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/D + log(1 /δ ) √ log n √ k . Proof.
Let ˜ α > . Then, we have that if x ∼ f , then P ( x ∈ H ˜ α ( f )) = 1 − ˜ α . Thus, the probabilitythat a sample point falls in H ˜ α ( f ) is a Bernoulli random variable with probability − ˜ α . Hence, byHoeffding’s inequality, we have that there exist constant C (cid:48) > such that P (cid:32) − (cid:101) α − C (cid:48) (cid:114) log(1 /δ ) n ≤ | H ˜ α ( f ) ∩ X | n ≤ − (cid:101) α + C (cid:48) (cid:114) log(1 /δ ) n (cid:33) ≥ − δ/ . Then it follows that choosing α U := α + C (cid:48) (cid:113) log(1 /δ ) n we get P (cid:18) | H α U ( f ) ∩ X | n ≤ − α (cid:19) ≥ − δ/ . Similarly, choosing α L = α − C (cid:48) (cid:113) log(1 /δ ) n gives us P (cid:18) | H α L ( f ) ∩ X | n ≥ − α (cid:19) ≥ − δ/ . H upperα ( f ) := { x ∈ X : f k ( x ) ≥ λ α − (cid:15) } , where (cid:15) > will be chosen later in order for (cid:99) H α ( f ) ⊆ H upperα ( f ) . By Lemma 4, there exists C , r > depending on f such that for (cid:98) ε < r (which holds for n sufficiently large depending on f by Lemma 1), we have λ α − C (cid:18)(cid:113) log(1 /δ ) n (cid:19) β/D ≤ λ α L . As such, it suffices to choose (cid:15) suchthat for all x ∈ X such that if f ( x ) ≥ λ α − C (cid:18)(cid:113) log(1 /δ ) n (cid:19) β/D then f k ( x ) ≥ λ α − (cid:15) . This isbecause { x ∈ X : f k ( x ) ≥ λ α − (cid:15) } would contain H α L ( f ) ∩ X , which we showed earlier containsat least − α fraction of the samples. Define (cid:15) such that (cid:15) = C (cid:18)(cid:113) log(1 /δ ) n (cid:19) β/D + (cid:15) We have byAssumption 1, ˇ r ( x, (cid:15) ) ≥ C β (cid:15) + C (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/D /β − C β C (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/D /β . Then, there exists constant C (cid:48)(cid:48) > sufficiently large depending on f such that if (cid:15) ≥ C (cid:48)(cid:48) (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/D + log(1 /δ ) √ log n √ k then the conditions in Lemma 3 are satisfied for n sufficiently large. Thus, we have for all x ∈ X with f ( x ) ≥ α − C (cid:18)(cid:113) log(1 /δ ) n (cid:19) β/D , then f k ( x ) ≥ α − (cid:15) . Hence, (cid:99) H α ( f ) ⊆ H upperα ( f ) .We now do the same in the other direction. Define H lowerα ( f ) := { x ∈ X : f k ( x ) ≥ λ α + (cid:15) } , where (cid:15) will be chosen such that H lowerα ( f ) ⊆ (cid:99) H α ( f ) . By Lemma 4, it suffices to show that if f k ( x ) ≥ λ α + (cid:15) then f ( x ) ≥ λ α + C (cid:18)(cid:113) log(1 /δ ) n (cid:19) β/D . This direction follows a similar argumentas the previous.Thus, there exists a constant C > depending on f such that for n sufficiently large depending on f , we have: | (cid:98) λ α − λ α | ≤ C (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/D + log(1 /δ ) √ log n √ k , as desired.The next result bounds (cid:99) H α ( f ) between two level sets of f . Lemma 6.
Let < δ < . There exists constant C > depending on f such that the followingholds with probability at least − δ for n sufficiently large depending on f . Define H Uα ( f ) := x ∈ X : f ( x ) ≥ λ α − C (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/D + log(1 /δ ) √ log n √ k H Lα ( f ) := x ∈ X : f ( x ) ≥ λ α + C (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/D + log(1 /δ ) √ log n √ k . Then, H Lα ( f ) ∩ X ⊆ (cid:99) H α ( f ) ⊆ H Uα ( f ) ∩ X. roof. To simplify notation, let us define the following: K ( n, k, δ ) := (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/D + log(1 /δ ) √ log n √ k . By Lemma 5, there exists C > such that defining (cid:100) H Uα ( f ) := { x ∈ X : f k ( x ) ≥ λ α − C · K ( n, k, δ ) } (cid:100) H Lα ( f ) := { x ∈ X : f k ( x ) ≥ λ α + C · K ( n, k, δ ) } , then we have (cid:100) H Lα ( f ) ⊆ (cid:99) H α ( f ) ⊆ (cid:100) H Uα ( f ) . It suffices to show that there exists a constant C > such that H Lα ( f ) ∩ X ⊆ (cid:100) H Lα ( f ) and (cid:100) H Uα ( f ) ⊆ H Uα ( f ) ∩ X. We start by showing H Lα ( f ) ∩ X ⊆ (cid:100) H Lα ( f ) . To do this, show that for any x ∈ X satisfying f ( x ) ≥ λ α + C · K ( n, k, δ ) + (cid:15) satisfies f k ( x ) ≥ λ α + C · K ( n, k, δ ) , where (cid:15) > will be chosenlater. By a similar argument as in the proof of Lemma 5, we can choose (cid:15) ≥ C (cid:48) · K ( n, k, δ ) for someconstant C (cid:48) > and the desired result holds for n sufficiently large. Similarly, there exists C (cid:48)(cid:48) > such that f k ( x ) ≤ λ α − ( C + C (cid:48)(cid:48) ) · K ( n, k, δ ) implies that f ( x ) ≤ λ α − C · K ( n, k, δ ) . The resultfollows by taking C = C + max { C (cid:48) , C (cid:48)(cid:48) } .We are now ready to prove Theorem 5, a more general version of Theorem 1 which makes thedependence on δ explicit. Note that if δ = 1 /n , then log(1 /δ ) = log( n ) . Theorem 5. [Extends Theorem 1] Let < δ < and suppose that f is continuous and has compactsupport X ⊆ R D and satisfies Assumption 1. There exists constants C l , C u , C > depending on f such that the following holds with probability at least − δ . Suppose that k satisfies C l · log(1 /δ ) · log n ≤ k ≤ C u · log(1 /δ ) D/ (2 β + D ) · (log n ) D (2 β + D ) · n β/ (2 β + D ) , then we have d H ( H α ( f ) , (cid:99) H α ( f )) ≤ C · (cid:16) log(1 /δ ) / D · n − / D + log(1 /δ ) /β · log( n ) / β · k − / β (cid:17) . Proof of Theorem 5.
Again, to simplify notation, let us define the following: K ( n, k, δ ) := (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/D + log(1 /δ ) √ log n √ k . There are two directions to show for the Hausdorff distance result. That (i) max x ∈ (cid:100) H α ( f ) d ( x, H α ( f )) is bounded, that is none of the high-density points recovered by Algorithm 1 are far from the truehigh-density region; and (ii) that sup x ∈ H α ( f ) d ( x, (cid:99) H α ( f )) is bounded, that Algorithm 1 recovers agood covering of the entire high-density region.We first show (i). By Lemma 6, we have that there exists C > such that H Uα ( f ) := { x ∈ X : f ( x ) ≥ λ α − C K ( n, k, δ ) } contains (cid:99) H α ( f ) . Thus, max x ∈ (cid:100) H α ( f ) d ( x, H α ( f )) ≤ sup x ∈ H Uα ( f ) d ( x, H α ( f )) ≤ (cid:18) C · K ( n, k, δ ) · C β (cid:19) /β , where the second inequality holds by Assumption 1. Now for the other direction, we have by triangleinequality that sup x ∈ H α ( f ) d ( x, (cid:99) H α ( f )) ≤ sup x ∈ H α ( f ) d ( x, H Lα ( f )) + sup x ∈ H Lα ( f ) d ( x, (cid:99) H α ( f )) . sup x ∈ H α ( f ) d ( x, H Lα ( f )) ≤ (cid:18) C · K ( n, k, δ ) · C β (cid:19) /β . Now for the second term, we see that by Lemma 6, (cid:99) H α ( f ) contains all of the sample points of H Lα ( f ) .Thus, we have sup x ∈ H Lα ( f ) d ( x, (cid:99) H α ( f )) ≤ sup x ∈ H Lα ( f ) d ( x, H Lα ( f ) ∩ X ) . By Assumption 1, for r < r , and x ∈ H Lα ( f ) we have F ( B ( x, r )) ≥ ρr D , where F is thedistribution corresponding to f . Choosing r ≥ (cid:16) C δ,n ρ √ D log nn (cid:17) /D gives us that by Lemma 1 that F n ( B ( x, r )) > where F n is the distribution of X and thus, we have sup x ∈ H Lα ( f ) d ( x, H Lα ( f ) ∩ X ) ≤ (cid:18) C δ,n ρ √ D log nn (cid:19) /D , which is dominated by the error contributed by the other error and the result follows. C Supporting results for Theorem 2 Proof
In this section, we note that we will reuse some notation from the last section for the manifold case.
Lemma 7 (Manifold version of uniform convergence of empirical Euclidean balls (Lemma 7 of [2])) . Let F be the true distribution and F n be the empirical distribution w.r.t. sample X . Let N be aminimal fixed set such that each point in M is at most distance /n from some point in N . Thereexists a universal constant C such that the following holds with probability at least − δ . For all x ∈ X ∪ N , F ( B ) ≥ C δ,n √ d log nn ⇒ F n ( B ) > F ( B ) ≥ kn + C δ,n √ kn ⇒ F n ( B ) ≥ kn F ( B ) ≤ kn − C δ,n √ kn ⇒ F n ( B ) < kn , where C δ,n = C log(2 /δ ) √ d log n , F n is the empirical distribution, and k ≥ C δ,n . Definition 5 (k-NN Density Estimator on Manifold) . f k ( x ) := kn · v d · r k ( x ) d . Lemma 8 (Manifold version of f k upper bound [28]) . Define the following which charaterizes howmuch the density increases locally in M : ˆ r ( (cid:15), x ) := sup (cid:40) r : sup x (cid:48) ∈ B ( x,r ) ∩ M f ( x (cid:48) ) − f ( x ) ≤ (cid:15) (cid:41) . Fix λ > and δ > and suppose that k ≥ C δ,n . Then there exists constant C ≡ C ( λ , d, τ ) suchthat if k ≤ C · C d/ (2+ d ) δ,n · n / (2+ d ) , then the following holds with probability at least − δ uniformly in (cid:15) > and x ∈ X with f ( x ) + (cid:15) ≥ λ : f k ( x ) < (cid:18) · C δ,n √ k (cid:19) · ( f ( x ) + (cid:15) ) , provided k satisfies v d · ˆ r ( (cid:15), x ) d · ( f ( x ) + (cid:15) ) ≥ kn − C δ,n √ kn . emma 9 (Manifold version of f k lower bound [28]) . Define the following which charaterizes howmuch the density decreases locally in M : ˇ r ( (cid:15), x ) := sup (cid:40) r : sup x (cid:48) ∈ B ( x,r ) ∩ M f ( x ) − f ( x (cid:48) ) ≤ (cid:15) (cid:41) . Fix λ > and < δ < and suppose k ≥ C δ,n . Then there exists constant C ≡ C ( λ , d, τ ) such that if k ≤ C · C d/ (4+ d ) δ,n · n / (4+ d ) , then with probability at least − δ , the following holds uniformly for all (cid:15) > and x ∈ X with f ( x ) − (cid:15) ≥ λ : f k ( x ) ≥ (cid:18) − · C δ,n √ k (cid:19) · ( f ( x ) − (cid:15) ) , provided k satisfies v d · ˇ r ( (cid:15), x ) d · ( f ( x ) − (cid:15) ) ≥ (cid:16) kn + C δ,n √ kn (cid:17) . D Proof of Theorem 2
The proof essentially follows the same structure as the full-dimensional case, with the primarydifference in the density estimation bounds.
Lemma 10 (Manifold Version of Lemma 4) . There exists constants C , r > depending on f suchthat the following holds for all < (cid:15) < r such that < λ α − λ α − (cid:15) ≤ C (cid:15) β/d and < λ α + (cid:15) − λ α ≤ C (cid:15) β/d . Proof.
The proof follows the same structure as the proof of Lemma 4, with the difference being thechange in dimension, and is omitted here.
Lemma 11 (Manifold Version of Lemma 5) . Let < δ < . Let (cid:98) ε be the ε setting chosen byAlgorithm 1 after the binary search procedure. Define (cid:99) λ α := kv D · n · (cid:98) ε d . Then, with probability at least − δ , we have there exist constant C > depending on f and M such that for n sufficiently large depending on f and M , we have | (cid:98) λ α − λ α | ≤ C (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/d + log(1 /δ ) √ log n √ k . Proof.
The proof is essentially the same as that of Lemma 5. The only difference is that insteadof applying the full-dimensional versions of the uniform k -NN density estimate bounds (Lemma 2and 3), we instead apply the manifold analogues (Lemma 8 and 9). Asides from constant fac-tors, the major difference is in the allowable range for k . In the full-dimensional case, we onlyneed k (cid:46) n β/ (2 β + d ) for the density estimation bounds to hold. However, here we require k (cid:46) min { n / (2+ d ) , n β/ (2 β + d ) } = n { ,β } / (2 β (cid:48) + d ) . Lemma 12 (Manifold Version of Lemma 5) . Let < δ < . There exists constant C > dependingon f and M such that the following holds with probability at least − δ for n sufficiently largedepending on f and M . Define H Uα ( f ) := x ∈ X : f ( x ) ≥ λ α − C (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/d + log(1 /δ ) √ log n √ k H Lα ( f ) := x ∈ X : f ( x ) ≥ λ α + C (cid:32)(cid:114) log(1 /δ ) n (cid:33) β/d + log(1 /δ ) √ log n √ k . Then, H Lα ( f ) ∩ X ⊆ (cid:99) H α ( f ) ⊆ H Uα ( f ) ∩ X. roof. Same comment as the proof for Lemma 11.
Theorem 6. [Extends Theorem 2] Let < δ < . Suppose that density function f is continuousand supported on M and Assumptions 1 and 2 hold. Suppose also that there exists λ > such that f ( x ) ≥ λ for all x ∈ M . Then, there exists constants C l , C u , C > depending on f such that thefollowing holds with probability at least − δ . Suppose that k satisfies, C l · log(1 /δ ) · log n ≤ k ≤ C u · log(1 /δ ) d/ (2 β (cid:48) + d ) · (log n ) d (2 β (cid:48) + d ) · n β (cid:48) / (2 β (cid:48) + d ) where β (cid:48) := max { , β } . Then we have d H ( H α ( f ) , (cid:99) H α ( f )) ≤ C · (cid:16) log(1 /δ ) / d · n − / d + log(1 /δ ) /β · log( n ) / β · k − / β (cid:17) . Proof of Theorem 6.
Proof is the same as the full-dimensional case given the contributed Lemmas ofthis section and is omitted here.
E Supporting Results for Theorem 3 Proof
Next, we need the following on the volume of the intersection of the Euclidean ball and M ; this isrequired to get a handle on the true mass of the ball under F M in later arguments. The upper andlower bounds follow from [5] and Lemma 5.3 of [41]. The proof can be found e.g. in [28]. Lemma 13 (Ball Volume) . If < r < min { τ / d, /τ } , and x ∈ M then v d r d (1 − τ r ) ≤ vol d ( B ( x, r ) ∩ M ) ≤ v d r d (1 + 4 dr/τ ) , where v d is the volume of a unit ball in R d and vol d is the volume w.r.t. the uniform measure on M . The next is a bound uniform convergence of balls:
Lemma 14 (Lemma 3 of [29]) . Let B be the set of all balls in R D , F is some distribution and F n isan empirical distribution. With probability at least − δ , the following holds uniformly for every B ∈ B and γ ≥ : F ( B ) ≥ γ ⇒ F n ( B ) ≥ γ − β n √ γ − β n , F ( B ) ≤ γ ⇒ F n ( B ) ≤ γ + β n √ γ + β n , where β n = 8 d log(1 /δ ) (cid:112) log n/n . F Proof of Theorem 3
The first result says that within the manifold, the vast majority of the probability mass is attributed tothe manifold distributions.
Lemma 15.
There exists C , r > depending on F M , F E , M such that the following holdsuniformly over x ∈ M and < r < r . F E ( B ( x, r )) F M ( B ( x, r )) ≤ C · r D − d . Proof.
Let x ∈ M and r > . We have F M ( B ( x, r )) ≥ λ · vol d ( B ( x, r ) ∩ M ) ≥ v d r d (1 − τ r ) · λ , where the second inequality holds by Lemma 13 for r sufficiently small. On the other hand, we have F E ( B ( x, r )) ≤ || f E || ∞ v D r D . Thus, we have there exists C > depending on f M , M , and f E such that F E ( B ( x, r )) F M ( B ( x, r )) ≤ C · r D − d , as desired. 19e next show that points far away from H (cid:101) α ( f M ) do not get selected as high-density points byAlgorithm 1. Lemma 16.
There exists ω > such that for any < ω < ω and n sufficiently large dependingon F M , F E , M and ω , with probability at least − δ , Algorithm 1 will not select any points outsideof H (cid:101) α − ω ( f M ) .Proof. By Assumption 1, we can choose ω sufficiently small so that for the density f M , | λ (cid:101) α − ω − λ (cid:101) α | ≤ ˇ C β · ( r c / β . Then, at the ( (cid:101) α − ω ) -density level, we will be within the area where the regularityassumptions hold.Next, by Hoeffding’s inequality, we have that there exist constant C (cid:48) > such that for ¯ α > : P (cid:32) − ¯ α − C (cid:48) (cid:114) log(1 /δ ) n ≤ | H ¯ α − η − η ( f M ) ∩ X | n ≤ − ¯ α + C (cid:48) (cid:114) log(1 /δ ) n (cid:33) ≥ − δ/ . Choosing ¯ α = α − C (cid:48) (cid:113) log(1 /δ ) n , then it follows that with probability at least − δ/ , H := H (cid:101) α − C (cid:48) √ log(1 /δ ) / ( √ n · (1 − η )) ( f M ) satisfies | H ∩ X | > (1 − α ) · n . Next let H ω := H (cid:101) α − ω ( f M ) . Let r be the value of ε used by Algorithm 1. Now, it suffices to show that for n sufficiently largedepending on f M : max x ∈ X \ H ω F n ( B ( x, r )) < min x ∈ H F n ( B ( x, r )) , where F n is the empirical distribution. This is because Algorithm 1 filters out sample points whose ε -ball has less than k sample points for its final ε value, which is the value which allows it to filter α -fraction of the points.By Lemma 15, it suffices to show that max x ∈ X \ H ω F M,n ( B ( x, r )) (cid:0) C r D − d (cid:1) < min x ∈ H F M,n ( B ( x, r )) (cid:0) − C r D − d (cid:1) , where F M,n ( A ) denote the fraction of samples drawn from F M which lie in A w.r.t. our entiresample X .Then, by Lemma 14, it will be enough to show that max x ∈ X \ H ω ( F M ( B ( x, r )) + β n (cid:112) F M ( B ( x, r )) + β n ) (cid:0) C r D − d (cid:1) < min x ∈ H ( F M ( B ( x, r )) − β n (cid:112) F M ( B ( x, r )) − β n ) (cid:0) − C r D − d (cid:1) , where β n = 8 d log(1 /δ ) (cid:112) log n/n .To bound the LHS, we have by Lemma 13 max x ∈ X \ H ω ( F M ( B ( x, r )) + β n · (cid:112) F M ( B ( x, r )) + β n ) (cid:0) C r D − d (cid:1) ≤ max x ∈ X \ H ω (cid:26)(cid:18) inf x (cid:48) ∈ B ( x,r ) f M ( x (cid:48) ) (cid:19) · (cid:16) β n / (cid:112) || f M || ∞ + β n / || f M || ∞ (cid:17) (cid:0) C r D − d (cid:1) (1 + 4 dr/τ ) (cid:27) ≤ max x ∈ X \ H ω (cid:26)(cid:18) inf x (cid:48) ∈ B ( x,r ) f M ( x (cid:48) ) (cid:19) (1 + C β n + C r ) (cid:27) ≤ ( λ (cid:101) α − ω + ι ( f M , r )) (1 + C β n + C r ) , for some C , C > and ι is the modulus of continuity, that is ι ( f M , r ) :=sup x,x (cid:48) ∈ M : | x − x (cid:48) |≤ r | f M ( x ) − f M ( x (cid:48) ) | (i.e. f M is uniformly continuous since it is continuous over acompact support, so ι ( f M , r ) → as r → ). 20imilarly, for the RHS, we can show for some constants C , C that min x ∈ H ( F M ( B ( x, r )) − β n (cid:112) F M ( B ( x, r )) − β n ) (cid:0) − C r D − d (cid:1) ≥ ( λ (cid:101) α − C (cid:48) √ log(1 /δ ) / ( √ n (1 − η )) − ι ( f M , r )) (1 − C β n − C r ) . The result follows since r → as n → ∞ (since by Lemma 1, r is a k -NN radius so r (cid:46) ( k/n ) /D → given the conditions on k of Theorem 3) and the fact that λ (cid:101) α − ω < λ (cid:101) α − C (cid:48) √ log(1 /δ ) / √ n for n sufficiently large. As desired. Lemma 17 (Bounding density estimators w.r.t to entire sample vs w.r.t. samples on manifold) . For x ∈ R D , define the following: r k ( x ) := inf { (cid:15) > | B ( x, (cid:15) ) ∩ X | ≥ k } (cid:101) r k ( x ) := inf { (cid:15) > | B ( x, (cid:15) ) ∩ X ∩ M | ≥ k } where the former is simply the k -NN radius we’ve been using thus far and the latter is the k -NNradius if we were to restrict the samples to only those that came from M . Then likewise, define theanalogous density estimators: f k ( x ) := kn · v d · r k ( x ) d and (cid:101) f k ( x ) := kn · v d · (cid:101) r k ( x ) d , where again, the former is the usual k -NN density estimator on manifolds. Then, there exists C suchthat the following holds with high probability. sup x ∈ M | f k ( x ) − (cid:101) f k ( x ) | ≤ C · ( k/n ) D/d − . Proof.
By Lemma 14, there exists C > depending on F and F M such that F n ( B ( x, r k ( x )) = k ⇒ |F ( B ( x, r k ( x )) − k | ≤ C β n , and F M,n ( B ( x, (cid:101) r k ( x )) = k ⇒ |F M ( B ( x, (cid:101) r k ( x )) − k | ≤ C β n , where F M,n is the empirical distribution w.r.t. X ∩ M . Next, by Lemma 15, we have for someconstant C > : F ( B ( x, (cid:101) r k ( x )) ≤ F M ( B ( x, (cid:101) r k ( x ))(1 + C (cid:101) r k ( x ) D − d ) ≤ ( k + C β n )(1 + C · (cid:101) r k ( x ) D − d ) ≤ ( k + C β n )(1 + C · ( k/n ) D/d − ) ≤ k · (1 + C ( k/n ) D/d − ) , where the second last inequality holds for some constant C > by Lemma 7 and C > is someconstant depending on F and F M and M . Then it follows that for some constant C > , we have F ( B ( x, (cid:101) r k ( x )) F ( B ( x, r k ( x )) ≤ C ( k/n ) D/d − . In the other direction, we trivially have (cid:101) r k ( x ) ≥ r k ( x ) , so ≤ F ( B ( x, (cid:101) r k ( x )) F ( B ( x, r k ( x )) ≤ C ( k/n ) D/d − . The result follows.
Theorem 7. [Extends Theorem 3] Let < η < α < and < δ < . Suppose that distribution F is a weighted mixture (1 − η ) · F M + η · F E where F M is a distribution with continous density f M supported on a d -dimensional manifold M satisfying Assumption 2 and F E is a (noise) distributionwith continuous density f E with compact support over R D with d < D . Suppose also that there exists λ > such that f M ( x ) ≥ λ for all x ∈ M and H (cid:101) α ( f M ) (where (cid:101) α := α − η − η ) satisfies Assumption 1for density f M . Let (cid:98) H α be the output of Algorithm 1 on a sample X of size n drawn i.i.d. from F . hen, there exists constants C l , C u , C > depending on f M , f E , η , M such that the following holdswith probability at least − δ . Suppose that k satisfies C l · log(1 /δ ) · log n ≤ k ≤ C u · log(1 /δ ) d/ (2 β (cid:48) + d ) · (log n ) d (2 β (cid:48) + d ) · n β (cid:48) / (2 β (cid:48) + d ) where β (cid:48) := max { , β } . Then we have d H ( H (cid:101) α ( f M ) , (cid:99) H α ) ≤ C · (cid:16) log(1 /δ ) / d · n − / d + log(1 /δ ) /β · log( n ) / β · k − / β (cid:17) . Proof of Theorem 7.
The proof follows in a similar way as that of Theorem 6, except with thecomplexity of having added full-dimensional noise. We will only highlight the difference and providea sketch of the proof here.Lemma 16 and 17 give us a handle on the additional complexity when having separate noisedistribution, compared to the earlier manifold setting of Theorem 2.Lemma 16 guarantees that the points in (cid:99) H α lie in the inside of M with margin. In particular, thatmeans the noise points are filtered out by the algorithm and thus, we are reduced to reasoning aboutthe (cid:101) α -high-density-set of f M .Then, Lemma 17 ensures that the k -NN density estimator used for our analysis for the entire sample X is actually quite close to the k -NN density estimator with respect to M ∩ X within M . In otherwords, we can use the k -NN density estimator to estimate f M without knowing which samplesof X were in M . Lemma 17 shows that the additional error in density estimation we obtain is ≈ ( k/n ) D/d − (cid:46) ( k/n ) /d (cid:46) (log n ) / √ k , where the first inequality holds since D > d and thelatter holds from the conditions on k . It turns out that this error term can be absorbed as a constant inthe previous result of Theorem 6. G Proof of Theorem 4
Proof of Theorem 4.
For the first inequality, we have ξ ( h, x ) ≥ d ( x, M (cid:101) h ( x ) ) − (cid:15) n d ( x, M h ( x ) ) + (cid:15) n = d ( x, M (cid:101) h ( x ) ) d ( x, M h ( x ) ) − (cid:15) n d ( x, M h ( x ) ) + (cid:15) n · (cid:32) d ( x, M (cid:101) h ( x ) ) d ( x, M h ( x ) ) + 1 (cid:33) , where the first inequality holds by Theorem 3. This, along with the condition on γ and ε ( h, x ) fromothe theorem statement, implies that d ( x, M (cid:101) h ( x ) ) d ( x, M h ( x ) ) < − γ, which implies that h ( x ) (cid:54) = h ∗ ( x ) . For the second inequality, we have ξ ( h, x ) ≥ d ( x, M h ( x ) ) − (cid:15) n d ( x, M (cid:101) h ( x ) ) + (cid:15) n = d ( x, M h ( x ) ) d ( x, M (cid:101) h ( x ) ) − (cid:15) n d ( x, M (cid:101) h ( x ) ) + (cid:15) n · (cid:32) d ( x, M h ( x ) ) d ( x, M (cid:101) h ( x ) ) + 1 (cid:33) , where the first inequality holds by Theorem 3. Thus, if the condition of the theorem statement holds,then d ( x, M h ( x ) ) d ( x, M (cid:101) h ( x ) ) < − γ ⇒ d ( x, M h ( x ) ) d ( x, M c ) < − γ for all c (cid:54) = h ( x ) , which implies that h ( x ) = h ∗ ( x ) Additional UCI Experiments
H.1 When to trust: Precision for correct predictions by percentile
Figure 4: UCI data sets and precision on correctness23 .2 When to not trust: Precision for misclassification predictions by percentile
Figure 5: UCI data sets and precision on incorrectness
H.3 High dimensional Datasets a) MNIST (b) MNIST(c) SVHN (d) SVHN(e) CIFAR-10 (f) CIFAR-10(g) CIFAR-100 (h) CIFAR-100a) MNIST (b) MNIST(c) SVHN (d) SVHN(e) CIFAR-10 (f) CIFAR-10(g) CIFAR-100 (h) CIFAR-100