[PDF] Distribution Density, Tails, and Outliers in Machine Learning: Metrics and Applications

Abstract

We develop techniques to quantify the degree to which a given (training or testing) example is an outlier in the underlying distribution. We evaluate five methods to score examples in a dataset by how well-represented the examples are, for different plausible definitions of "well-represented", and apply these to four common datasets: MNIST, Fashion-MNIST, CIFAR-10, and ImageNet. Despite being independent approaches, we find all five are highly correlated, suggesting that the notion of being well-represented can be quantified. Among other uses, we find these methods can be combined to identify (a) prototypical examples (that match human expectations); (b) memorized training examples; and, (c) uncommon submodes of the dataset. Further, we show how we can utilize our metrics to determine an improved ordering for curriculum learning, and impact adversarial robustness. We release all metric values on training and test sets we studied.

Full PDF

DDistribution Density, Tails, and Outliers in Machine Learning:Metrics and Applications

Nicholas Carlini ´Ulfar Erlingsson Nicolas Papernot

Google Research

Abstract

We develop techniques to quantify the degree to which a given (training or testing) example is anoutlier in the underlying distribution. We evaluate ﬁve methods to score examples in a dataset by howwell-represented the examples are, for different plausible deﬁnitions of “well-represented”, and apply theseto four common datasets: MNIST, Fashion-MNIST, CIFAR-10, and ImageNet. Despite being independentapproaches, we ﬁnd all ﬁve are highly correlated, suggesting that the notion of being well-represented can bequantiﬁed. Among other uses, we ﬁnd these methods can be combined to identify (a) prototypical examples (that match human expectations); (b) memorized training examples; and, (c) uncommon submodes of thedataset. Further, we show how we can utilize our metrics to determine an improved ordering for curriculumlearning, and impact adversarial robustness. We release all metric values on training and test sets we studied.

Figure 1: Sorting images sampled from the MNIST “3” class, Fashion-MNIST “shirt” class, and CIFAR-10“dog” class using our ﬁve metrics. Outliers are shown on the left, and well represented examples on the right.Notice, for example, the mislabeled “9” for MNIST as a digit that is an outlier, or how many of the poorlyrepresented Fashion-MNIST “shirts” in fact belong in the different “t-shirt” or “dress” class.

Machine learning (ML) is now applied to problems with sufﬁciently large datasets that it is difﬁcult to manuallyinspect each training and test point. This drives interest in research that seeks to understand the dataset and theunderlying data distribution. Potential uses of these techniques are numerous. On the one hand, they contributeto improving how ML is perceived by end users (e.g., one of the motivations behind interpretability efforts).On the other hand, they also help ML practitioners glean insights into the learning procedure. This surfacesthe need for tools that enable one to (1) measure and characterize the contribution of each training point to thelearning procedure and (2) explain the different failure modes observed on individual test points when themodel infers. 1 a r X i v : . [ c s . L G ] O c t owards this goal, prior work has investigated model interpretability, identifying training and test pointsthat are prototypical (Kim et al., 2014), or applying inﬂuence functions to measure the contribution of individualtraining points to the ﬁnal model (Koh & Liang, 2017). While deﬁning precisely a prototype remains an openproblem, a common intuitive deﬁnition of the notion is that prototypes should be “a relatively small numberof samples from a data set which, if well chosen, can serve as a summary of the original data set” (Bien &Tibshirani, 2011). In addition to these two examples, there is a wealth of related efforts discussed below.In this work, we take an orthogonal direction and show that rather than trying to identify a single metric ortechnique to identify “prototypes”, simultaneously considering a variety of metrics can be more effective todiscover properties of the training data. In particular, we introduce ﬁve metrics for measuring to what extent aspeciﬁc point is well represented or an outlier in a dataset.We explicitly do not deﬁne what we mean by well-represented or outlier speciﬁcally because we areinterested in the interplay between different metrics that may fall under that deﬁnition. Indeed, we ﬁnd thatwhile the different metrics are highly correlated for most training and test inputs, their disagreement are highlyinformative.In more detail, our metrics are based on adversarial robustness, retraining stability, ensemble agreement,and differentially-private learning. We demonstrate that in addition to supporting use cases previously studiedin the literature (e.g., identifying prototypes), studying the interplay between these ﬁve metrics allows us toidentify other types of examples that help form an understanding of the training and inference procedures. Theyprovide a more complete picture of a model’s training and test performance than can be captured by accuracyalone. For instance, disagreements between our metrics distinguish memorized training examples —that modelsoverﬁt on to in order to learn, or uncommon submodes —not sufﬁciently well-represented in the training datafor a privacy-preserving model to recognize them at test time. These results hold for all the datasets weconsider: MNIST, Fashion-MNIST, CIFAR-10, and ImageNet. We release the results of running our metricson these datasets to help other researchers interested in building on our results.Usefully, there are advantages to training models using only the well-represented examples: the modelslearn much faster, their accuracy loss is not great and occurs almost entirely on outlier test examples, and themodels are both easier to interpret and more adversarially robust. Conversely, at the same sample complexity,signiﬁcantly higher overall accuracy can be achieved by training models exclusively on outliers—onceerroneous and misleading examples have been eliminated from the dataset automatically through an analysisof the disagreement between our metrics.As an independent result, we show that predictive stability under retraining strongly correlates withadversarial distance, and may be used as an approximation. This is particularly interesting for tasks wheredeﬁning the adversary’s goal when creating an adversarial example (Biggio et al., 2013; Szegedy et al., 2013)can be difﬁcult (e.g., in sequence-to-sequence language modeling). It is important to understand the underlying datasets (both training and testing) used for machine learningmodels. In the following, we introduce the ﬁve metrics that underly our approach for interpreting datasets.Each metric we develop scores examples on a continuum where in one direction the examples are somehowmore well-represented in the dataset, and the other direction they are less represented—more of an outlier—inthe dataset.We do not deﬁne a priori what we mean by well-represented: Rather, we deﬁne the term with respect toour different algorithms for computing this. As we will demonstrate, our rankings agree with the deﬁnition ofprototypes in many ways. However, their disagreement are useful to identify training and test points that are2mportant for forming an understanding of the training and inference procedures. Indeed, in Section 3.3 wedemonstrate how the metrics allow us to identify memorized exceptions or uncommon submodes at scale inthe data.

Each of the metrics below begins corresponds to an deﬁnition for what one might mean by saying an exampleis representative or an outlier. For each, we provide a concrete method for measuring this informally-speciﬁedquantity.We study ﬁve metrics that we found generalizable and useful; clearly these are not the only possiblemetrics, and we encourage future work to study other metrics. However, we believe these metrics to covera wide range of what one might mean by representative. Other deﬁnitions which we considered were eitherunstable or model-speciﬁc . All of the algorithms we give below are both stable and appear to be consistentproperties of the training data, and not the model (e.g., architecture). Adversarial Robustness ( adv ): Examples that well represent the dataset should be more adversariallyrobust, i.e., more difﬁcult to ﬁnd an input perturbation which makes them change classiﬁcation. Indeed,as a measure of prototypicality , this exact measure (the distance to the decision boundary measured by anadversarial-example attack was) was recently proposed and utilized by Stock & Cisse (2017). Speciﬁcally,for an example x , the measure ﬁnds the perturbation δ with minimal (cid:107) δ (cid:107) such that the original x and theadversarial example x + δ are classiﬁed differently (Biggio et al., 2013; Szegedy et al., 2013).To compare prototypicality, the work of Stock & Cisse (2017) that inspired our current work used a simpleand efﬁcient (cid:96) ∞ -based adversarial-example attack based on an iterative gradient descent introduced by Kurakinet al. (2016). That attack procedure computes gradients to ﬁnd directions that will increase the model’s loss onthe input within an (cid:96) ∞ -norm ball. They deﬁne prototypicality as the number of gradient descent iterationsnecessary to change the class of the perturbed input.Instead our metric (for short, adv ) ranks by the (cid:96) norm (or faster, less accurate (cid:96) ∞ norm) of the minimal-found adversarial perturbation (Carlini & Wagner, 2017). This is generally more accurate at measuring thedistance to the decision boundary, but comes at a performance cost (it is on average 10-100 × slower). Holdout Retraining ( ret ): A model should treat a well-represented example the same regardless ofwhether or not it is used in the training process: if the example is not used, a well-represented example shouldhave sufﬁcient support in the training data for its omission to not be important.Assume we are given a training dataset X , a disjoint holdout dataset ¯ X , and an example x ∈ X toassess how represented it is in the dataset. To begin, we train a model f ( · ) on the data X to obtain modelweights θ . We train this model just as how we would typically do—i.e., with the same learning rate schedule,hyper-parameter settings, etc. Then, we ﬁne-tune the weights of this ﬁrst model f θ ( · ) on the held-out trainingdata ¯ X to obtain new weights ¯ θ . To perform this ﬁne-tuning, we use a smaller learning rate and train untilthe training loss stops decreasing. (We have found it is important to obtain ¯ θ by ﬁne-tuning θ as opposed totraining from scratch; otherwise, the randomness of training leads to unstable rankings that yield speciousresults.) Finally, given these two models, we measure how well-represented the example x is as the difference (cid:107) f θ ( x ) − f ¯ θ ( x ) (cid:107) . The exact choice of metric (cid:107)·(cid:107) is not important; the results in this paper use the symmetricKL-divergence. In one attempt at a metric, we deﬁned how representative an example is with respect to the magnitude of the gradient of the lossfunction on a pre-trained model and found it varied signiﬁcantly across different pre-trained models. In another metric we rejected, we found that sorting examples by when they were learned during the training process gave differentorderings when applied to different model architectures. train on the test data and performholdout retraining on the original training data.

Ensemble Agreement ( agr ): Well-represented examples should be easy for many types of models tolearn, and not only models which are nearly perfect. We train multiple models of varying capacity (i.e., numberof parameters) on different subsets of the training data (see Appendix 7). The agr metric ranks examples basedon the agreement within this ensemble, as measured by the symmetric JS-divergence between the models’output. Concretely, we train many models f θ i ( · ) and, for each example x , evaluate the model predictions, andthen compute the following value to order the examples: N N (cid:88) i =1 N (cid:88) j =1 JS-Divergence ( f θ i ( x ) , f θ j ( x )) (1) Model Conﬁdence ( conf ): Models should be conﬁdent on examples that are well-represented. Basedon an ensemble of models like that used by the agr metric, the conf metric ranks examples by the meanconﬁdence in the models’ predictions, i.e., ranking each example x by: N N (cid:88) i =1 max f θ i ( x ) (2) Privacy-preserving Training ( priv ): We can expect well-represented examples to be classiﬁed properlyby models even when trained with guarantees of differential privacy (Abadi et al., 2016; Papernot et al., 2016).(Informally, differential privacy states that whether or not any given training example is in the training data,the learned models will be statistically indistinguishable.) However, such privacy-preserving models shouldexhibit signiﬁcantly reduced accuracy on any rare or exceptional examples, because differentially-privatelearning attenuates gradients and introduces noise to prevent the details about any speciﬁc training examplesfrom being memorized. Outliers are disproportionally likely to be impacted by this attenuation and addednoise, whereas the common signal found across well-represented examples must have been preserved inmodels trained to reasonable accuracy.Our priv metric is based on training an ensemble of models with increasingly greater ε privacy (i.e., moreattenuation and noise) using ε -differentially-private stochastic gradient descent (Abadi et al., 2016). Our metricthen ranks how well-represented an example is based on the minimum ε (i.e., maximum privacy protection) atwhich the example is correctly classiﬁed in a reliable manner (which we take as being also classiﬁed correctlyin 90% of less-private models). This ranking embodies the intuition that the more tolerant an example is tonoise and attenuation during learning, the more well-represented it must be. As the ﬁrst step in an evaluation, it is natural to consider to what extent these metrics are different methods ofevaluating the same underlying property. We ﬁnd that the are highly correlated across the four datasets westudy: MNIST (LeCun et al., 2010), Fashion-MNIST (Xiao et al., 2017), CIFAR-10 (Krizhevsky & Hinton,2009), and ImageNet (Russakovsky et al., 2015). In particular, we observe a strong correlation between the4 d v r e t a g r c o n f p r i v advretagrconfpriv 1.000 0.747 0.536 0.381 0.3470.747 1.000 0.557 0.418 0.3380.536 0.557 1.000 0.888 0.5840.381 0.418 0.888 1.000 0.5360.347 0.338 0.584 0.536 1.000 (a) MNIST a d v r e t a g r c o n f p r i v advretagrconfpriv 1.000 0.868 0.726 0.716 0.5720.868 1.000 0.782 0.737 0.6070.726 0.782 1.000 0.940 0.6840.716 0.737 0.940 1.000 0.6700.572 0.607 0.684 0.670 1.000 (b) Fashion-MNIST a d v r e t a g r c o n f p r i v advretagrconfpriv 1.000 0.789 0.602 0.531 0.4970.789 1.000 0.617 0.532 0.4960.602 0.617 1.000 0.890 0.6770.531 0.532 0.890 1.000 0.5960.497 0.496 0.677 0.596 1.000 (c) CIFAR-10 a d v r e t a g r c o n f p r i v advretagrconfpriv 1.000 0.656 0.903 0.886 0.7510.656 1.000 0.674 0.679 0.6660.903 0.674 1.000 0.954 0.7460.886 0.679 0.954 1.000 0.7400.751 0.666 0.746 0.740 1.000 (d) ImageNet Figure 2: Correlation coefﬁcients for our ﬁve prototypicality metrics on four common datasets.5NIST Fashion-MNIST CIFAR-10 P i c k W o r s t

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%advretagrconfpriv 25% 10% 10% 9% 9% 7% 7% 8% 5% 5% 27% 18% 10% 8% 7% 7% 5% 5% 4% 3% 26% 15% 13% 11% 7% 9% 4% 4% 4% 2% 25% 16% 12% 9% 11% 8% 4% 6% 1% 3% 22% 13% 9% 10% 7% 6% 8% 10% 6% 5% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%advretagrconfpriv 21% 17% 15% 10% 9% 8% 8% 4% 2% 1% 25% 18% 14% 10% 9% 5% 6% 3% 3% 2% 23% 15% 14% 12% 10% 6% 5% 5% 3% 2% 18% 14% 15% 15% 12% 7% 6% 4% 4% 2% 23% 11% 13% 14% 9% 8% 6% 5% 2% 3% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%advretagrconfpriv 15% 16% 12% 13% 10% 8% 6% 6% 6% 3% 18% 14% 13% 11% 13% 9% 6% 4% 4% 3% 13% 14% 13% 11% 12% 10% 8% 6% 5% 4% 12% 15% 14% 10% 9% 8% 7% 8% 7% 4% 8% 21% 14% 9% 11% 8% 6% 6% 6% 6% P i c k B e s t

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%advretagrconfpriv 1% 4% 6% 8% 8% 11% 13% 14% 15% 15% 3% 3% 5% 6% 8% 11% 15% 20% 14% 10% 1% 5% 6% 8% 13% 15% 10% 8% 11% 18% 2% 5% 8% 13% 10% 11% 11% 11% 11% 13% 3% 5% 9% 6% 8% 7% 23% 12% 12% 11% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%advretagrconfpriv 4% 6% 7% 6% 10% 16% 16% 13% 7% 8% 3% 6% 8% 7% 13% 11% 18% 10% 12% 8% 3% 6% 7% 9% 12% 9% 13% 17% 9% 9% 6% 5% 8% 5% 13% 10% 16% 14% 7% 11% 5% 7% 8% 10% 8% 12% 7% 14% 11% 11% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%advretagrconfpriv 1% 4% 6% 5% 7% 14% 8% 17% 21% 13% 4% 3% 5% 7% 7% 9% 12% 14% 20% 15% 5% 5% 7% 6% 6% 8% 19% 13% 15% 12% 2% 8% 6% 3% 10% 13% 13% 15% 15% 9% 3% 6% 6% 7% 7% 11% 15% 14% 14% 13%

Table 1: Results of a human study of Mechanical Turk workers selecting the best or worst example among arandom collection of 9 training-data images. For each metric, the tables show what percent of workers selectedexamples in each 10% split of the metric’s sorted ranking (e.g., when shown MNIST digits, humanworkers selected as worst an example that fell in the bottom of examples as ranked by the adv metric.adversarial distance and holdout retraining metrics, which is of independent interest: the holdout retrainingmetric could serve as a substitute for adversarial distance in tasks where adversarial examples are ill-deﬁned.Our metrics are widely applicable, as they are not speciﬁc to any learning task or model (some, like ret and priv might be applicable even to unsupervised learning), and experimentally we have conﬁrmed thatthe metrics are model-agnostic in the sense that they give overall the same results despite large changes inhyperparameters or even the model architecture. We also show that our metrics are consistent with humanperception of representativeness.An important application of our metrics is that studying their (relatively rare) disagreements allow us toinspect datasets at scale. In particular, we show how to identify two types of examples: memorized exceptions and uncommon submodes . They can be used to form an understanding of performance at training and test timethat is more precise than what an accuracy measurement can offer. (Due to space constraints, experimentalresults supporting some of the above observations are given in the Appendix.)We release the full results of running our metrics on each of the four datasets in the Appendix; weencourage the interested reader to examine the results, we believe the results speak for themselves.

Figure 2 shows the correlation coefﬁcients computed pairwise between each of our metrics for all our datasets,(the tables are symmetric across the diagonal). The metrics are overall strongly correlated, and the differencesin correlation are informative. Unsurprisingly, since they measure very similar properties, the agr (ensembleagreement) and the conf (model conﬁdence) show the highest correlation,However, somewhat unexpectedly, we ﬁnd that the adv metric (adversarial robustness) correlates verystrongly with the ret metric (retraining distance) on the smaller three datasets. This is presumably becausethese two metrics both measure the distance to a model’s decision boundary—even though adv measures thisdistance by perturbing each example while ret measures how the evaluation of each example is affected whenmodels’ decision boundaries themselves are perturbed. On ImageNet, adv is most strongly correlated withensemble agreement and model conﬁdence; we hypothesize this is due to the fact that ImageNet is a muchmore challenging task and therefore the distance to the decision boundary can be best approximated by theinitial model conﬁdence, unlike on MNIST where most datapoints (even incorrectly labeled) are assigned6robability . or higher.This strong correlation between adv and ret is a new result that may be of independent interest and somesigniﬁcance. Measurement of adversarial distance is a useful and highly-utilize technique, but it is undeﬁnedor ill-deﬁned on many learning tasks and its computation is difﬁcult, expensive, and hard to calibrate. On theother hand, given any holdout dataset and any measure of divergence, the ret metric we deﬁne in Section 2.1should be easily computable for any ML model or task. Clearly these metrics do consistently measure some quantity which is inspired by how well representedindividual examples are, but we have not yet provided evidence that it does so. To do this, we will showthat our methods for identifying examples that are well represented in the dataset match human intuition forexamples that are of a high quality.To begin, we perform a subjective visual inspection of how the different metrics rank the exampletraining and test data on different datasets. As a representative example, Figure 1 (Page 1) and the ﬁgures inAppendix 7 conﬁrm that there is an obviously apparent difference between the two extremes on the MNIST,Fashion-MNIST, and CIFAR-10 training examples (ImageNet examples are given in the Appendix).To more rigorously validate and quantify how our metrics correlate with human perception, we performedan online human study using Amazon’s Mechanical Turk service. We presented human evaluators a collectionof images (all of one class) from the MNIST, Fashion-MNIST, or CIFAR-10 datasets. We asked the evaluatorto select either the image that was most or least representative of the class. In the study, 400 different humanevaluators assessed over 100,000 images. At a high level, the human evaluators largely agreed that the imagesselected by our algorithms as most representative do in fact match human intuition.Concretely, in this study (an image of the study form is given in the Appendix, Section 9), each humanevaluator saw a 3x3 grid of 9 random images and was asked to pick the worst image—or the best image—andthis was repeated multiple times. Evaluators exclusively picked either best or worst images and were onlyshown random images from one output class under a heading with the label name of that class; thus oneperson would pick only the best MNIST digits “7” while another picked only the worst CIFAR-10 “cars.” (Asdictated by good study design, we inserted “Gold Standard” questions with known answers to catch workersanswering randomly or incorrectly, eliminating the few such workers from our data.) For all datasets, pickingnon-representative images proved to be the easier task: in a side study where 50 evaluators were shown thesame identical 3x3 grids, agreement was on the worst image but only on the best image (randomchoice would give agreement).The results of our human study are presented in Table 1. The key takeaway is that the evaluator’s assessmentis correlated with each one of our metrics: evaluators mostly picked the not well-represented images as theworst examples and examples that were better represented as being the best. Because our ﬁve metrics are not perfectly correlated, there are likely to be many examples that are determinedto be well-represented under one metric but not under another, as a consequence of the fact that each metricdeﬁnes “well-represented” differently. To quantify the number and types of those differences we can trylooking at their visual correlation in a scatter plot; doing so can be informative, as can be seen in Figure 3(a)where the easily-learned, yet fragile, examples of class “1” in MNIST models have high conﬁdence but lowadversarial robustness. The results show substantial disagreement between metrics.7 % 25% 50% 75% 100%0%20%40%60%80%100% 01 23 45 67 89

Figure 3: Scatter plot comparing the adv vs. conf ranks on the MNIST test set. Notice clusters of digits,for examples the 1 class that has extremely high average conﬁdence (i.e., they are easy to classify) but lowadversarial distance (i.e., are easy to perturb to other classes).To understand disagreements, we can consider examples that are well represented in one metric but not inothers, ﬁrst combining the union of adv and ret into a single boundary metric, and the union of adv and ret into an ensemble metric, because of their high correlation.

Memorized exceptions:

Recalling the unusual dress-looking “shirt” of Figure 1, and how it seemed tohave been memorized with high conﬁdence, we can intersect the top 25% well represented ensemble imageswith the bottom-half outliers in both the boundary and priv metrics. Figure 4: Exceptional “shirts.”For the Fashion-MNIST “shirt” class, this set—visually shown inFigure 4 on the right—includes not only the dress-looking example buta number of other atypical “shirt” images, including some looking likeshorts. Also apparent in the set are a number of T-shirt-like and pullover-like images, which are misleading, given the other output classes ofFashion-MNIST. For these sets, which are likely to include spurious,erroneously-labeled, and inherently ambiguous examples, we use thename memorized exceptions because they must be memorized as excep-tions for models to have been able to reach very high conﬁdence duringtraining. Similarly, Figure 5a shows a large (green) cluster of highly am-biguous boot-like sneakers, which appear indistinguishable from a clusterof memorized exceptions in the Fashion-MNIST “ankle boot” class (seeAppendix 7).

Uncommon submodes:

On the other hand, the priv metric is based on differentially-private learningwhich ensures that no small group of examples can possibly be memorized: the privacy stems from addingnoise and attenuating gradients in a manner that will mask the signal from rare examples during training.8 a) Memorized exceptions in theFashion-MNIST “sneaker” class. (b)

Uncommon submodes foundwithin the MNIST “1” class. (c)

Canonical prototypes in theCIFAR-10 “airplane” class.

Figure 5: Our metrics’ sets reveal interesting examples, which can be clustered.This suggests that we can ﬁnd uncommon submodes of the examples in learning tasks by intersecting thebottom-most outlier examples on the priv metric with the union of most well-represented in the boundary and ensemble metrics. Figure 5b shows uncommon submodes discovered in MNIST using the 25% lowestoutliers on priv and top 50% well-represented on other metrics. Notably, all of the “serif 1s” in the entireMNIST training set are found as a submode.

Canonical prototypes:

Finally, we can simply consider the intersection of the sets of all the mostrepresented examples in all of our metrics. The differences between our metrics should ensure that thisintersection is free of spurious or misleading examples; yet, our experiments and human study suggest the setwill provide good coverage. Hence, we call this set canonical prototypes . Figure 5c shows the airplanes thatare canonical prototypes in CIFAR-10.To further aid interpretability in Figures 4 and 5, we perform a combination of dimensionality reductionand clustering. We apply t-SNE (Maaten & Hinton, 2008) on the pixel space (for MNIST and Fashion-MNIST)or ResNetv2 feature space (for CIFAR10) to project the example sets into two dimensions, and cluster withHDBSCAN (Campello et al., 2013), a hierarchical and density-based clustering algorithm which does nottry to assign all points to clusters—which not only can improve clusters but also identify spurious data. Webelieve that other types of data projection and clustering could also be usefully applied to our metrics, andoffer signiﬁcant insight into ML datasets. (See Appendix 11 for this section’s ﬁgures shown larger.)

Our metrics enable us to inspect datasets at scale and identify examples in the training and test sets thatare of particular importance to evaluating a model’s performance. For instance, imagine we observe thata non-private model performs better than a private model because the private model is unable to classifyuncommon submodes correctly. This is desirable because it is harder to protect the privacy of examples thatare not well-represented in the training data. However, if one were to limit their evaluation to simply reportingaccuracy, we would conclude that the privacy-preserving model performs “worse” while this is not necessarilythe case.Beyond such applications that provide insights into learning and inference, we now show that our metricsfor sorting examples according to how well-represented they are can also be integrated directly in learningprocedures to improve them. Namely, we look at three model properties: sample complexity, accuracy, orrobustness. 9

50 100Percentile . . . M o d e l T e s t A cc u r a c y No Aug.Strong Aug. (a) MNIST . . . M o d e l T e s t A cc u r a c y No Aug. (b) Fashion-MNIST . . . M o d e l T e s t A cc u r a c y No Aug. (c) CIFAR-10

Figure 6: Final test accuracy of a model after trained on , training examples consecutively ranked bythe adv metric (so that training on the least representative examples are that the 0th percentile, and the mostrepresentative at the 100th percentile). See text for full details. Given only , training examples, onMNIST (subplot (a)) training on the outliers is always better, however for Fashion-MNIST (b) and CIFAR-10(c) it is preferable to train using neither the most nor least well-represented examples, but those in the middle. . . . M o d e l A cc u r a c y Outlier/AllRepr/AllRepr/Repr (a) MNIST . . . M o d e l A cc u r a c y Outlier/AllRepr/AllRepr/Repr (b) Fashion-MNIST . . . M o d e l A cc u r a c y Outlier/AllRepr/AllRepr/Repr (c) CIFAR-10

Figure 7: Final test accuracy of a model after trained on a varrying percentage of training data as sorted by the adv metric. The blue solid lines correspond to ﬁnal test accuracy when the training data consists of only the x % least well-represented examples, and growing to the most represented. The orange solid line correspondsto training on the x % most most well represented examples; the orange dashed line corresponds to whentesting only on the top most well-represented test points. See text for full details.10 .1 Curriculum Learning We perform two experiments on the three datasets to investigate whether it is better to train on the the well-represented examples or the outliers—exploring the “train on hard data” vs. “train on easy data” question ofcurriculum learning (Ren et al., 2018). To begin, we order all training data according to our adv metric. Experiment 1.

First, we experiment with training on splits of , training examples (approximately of the training data) chosen by taking the k -th most well-represented example to the ( k +5000)-th mostwell-represented, for different values of k . As shown in Figure 6, we ﬁnd that the index k that yields themost accurate model varies substantially across the datasets and tasks. (In the plot, the x-axis is given in apercentile from to , obtained by dividing k by the size of the dataset.) For example, initially we train amodel on only the , least represented examples and record the models’s ﬁnal test accuracy; for MNISTthis accuracy is nearly already. However, for CIFAR-10 it is preferable to take , examples starting atthe 60th percentile—that is, examples ordered from , to , by the adv metric reach a test accuracyof .To summarize the results, on MNIST, training on the outlier examples gives the highest accuracy; con-versely, on Fashion-MNIST and CIFAR-10, training on examples that are better represented gives the highestaccuracy. We conjecture this is due to the dataset complexity: because nearly all of MNIST is very easy, itmakes sense to train on the hardest, most outlier examples. However, because Fashion-MNIST and CIFAR-10CIFAR-10 are comparably difﬁcult, training on the most well-represented examples is better given a limitedamount of training data: it is simply too hard to learn on only the least representative training examples.Notably, many of the CIFAR-10 and Fashion-MNIST outliers appear to be inherently misleading orambiguous examples, and several are simply erroneously labeled. We ﬁnd that about of the ﬁrst 5,000outliers meet our deﬁnition of memorized exceptions. Also, we ﬁnd that inserting 10% label noise causesmodel accuracy to decrease by about 10%, regardless of the split trained on—i.e., that to achieve high accuracyon small training data erroneous and misleading outliers must be removed—and explaining the low accuracyshown on the left in the graph of Figures 6b and 6c. Experiment 2.

For our second experiment, we ask: is it better to train on the k -most or k -least well-represented examples? That is, the prior experiment assumed the amount of data is ﬁxed, and we must choosewhich percentile of data to use. Now, we examine what the best strategy is to apply if we must choose either apreﬁx or a sufﬁx of the training data as ordered by our adv metric.The results are given in Figure 7. Again, we ﬁnd the answer depends on the dataset. On MNIST, training onthe k -least represented examples is always better for any k than training on the k -most represented examples.However, on Fashion-MNIST and CIFAR-10, training on the well-represented examples is better when k issmall, but as soon as we begin to collect more than roughly , examples for Fashion-MNIST or , for CIFAR-10, training on the outliers begins to give more accurate models. However, we ﬁnd that trainingonly on the most well-represented examples found in the training data gives extremely high test accuracy onthe well-represented examples found in the test data.This evidence supports our hypothesis that training on difﬁcult, but not impossibly difﬁcult, training datais of most value. The harder the task, the more useful well-represented training examples are.Also shown in Figure 7 is the ﬁnal test accuracy of a model when only evaluated on the well-representedtest examples. Here, we ﬁnd that the test accuracy is subsantially higher. While training exclusively on the well-represented examples often gives inferior accuracy compared to trainingon the outliers, the former has the beneﬁt of obtaining models with simpler decision boundaries. Thus, it is We use the adv metric since it is well-correlated to human perception and does not involve model performance in deﬁning it. , training points either from the well-represented training examplesor those that are not. For each model, we then compute the mean (cid:96) ∞ adversarial distance needed to ﬁndadversarial examples. As shown in Figure 11 in the Appendix, the Fashion-MNIST and CIFAR-10 models thatare trained on well well represented examples are more robust to adversarial examples than those trained on aslice of training data that is mostly made up of outliers. However, these models trained on a slice of , well-represented examples remain comparably robust to a baseline model trained on the entire data. Prototypes.

At least since the work of Zhang (1992) which was based on intra- and inter-concept similarity,prototypes have been examined using several metrics derived from the intuitive notion that one could ﬁnd“quintessential observations that best represent clusters in a dataset” (Kim et al., 2014). Several more formalvariants of this deﬁnition were proposed in the literature—along with corresponding techniques for ﬁndingprototypes. Kim et al. (2016) select prototypes according to their maximum mean discrepancy with the data,which assumes the existence of an appropriate kernel for the data of interest. Li et al. (2017) circumvent thislimitation by prepending classiﬁers with an autoencoder projecting the input data on a manifold of reduceddimensionality. A prototype layer, which serves as the classiﬁer’s input, is then trained to minimize thedistance between inputs and a set of prototypes on this manifold. While this method improves interpretabilityby ensuring that prototypes are central to the classiﬁer’s logic, it does require that one modify the model’sarchitecture. Instead, metrics considered in our manuscript all operate on existing architectures. Stock &Cisse (2017) proposed to use distance to the boundary—approximately measured with an adversarial examplealgorithm—as a proxy for prototypicality.

Other interpretability approaches.

Prototypes enable interpretability because they provide a subset ofexamples that summarize the original dataset and best explain a particular decision made at test time (Bien& Tibshirani, 2011). Other approaches like saliency maps instead synthesize new inputs to visualize what aneural network has learned. This is typically done by gradient descent with respect to the input space (Zeiler &Fergus, 2014; Simonyan et al., 2013). Because they rely on model gradients, saliency maps can be fragile andonly locally applicable (Fong & Vedaldi, 2017).Beyond interpretability, prototypes are also motivated by additional use cases, some of which we discussedin Section 4. Next, we review related work in two of these applications: namely, curriculum learning andreducing sample complexity.

Curriculum learning.

Based on the observation that the order in which training data is presented to themodel can improve performance (e.g., convergence) of optimization during learning and circumvent limitationsof the dataset (e.g., data imbalance or noisy labels), curriculum learning seeks to ﬁnd the best order in whichto analyze training data (Bengio et al., 2009). This ﬁrst effort further hypothesizes that easy-to-classifysamples should be presented early in training while complex samples gradually inserted as learning progresses.While Bengio et al. (2009) assumed the existence of hard-coded curriculum labels in the dataset, Chin & Liang(2017) sample an order for the training set by assigning each point a sampling probability proportional to its12everage score—the distance between the point and a linear model ﬁtted to the whole data. Instead, we usemetrics that also apply to data that cannot be modeled linearly.The curriculum may also be generated online during training, so as to take into account progress made bythe learner (Kumar et al., 2010). For instance, Katharopoulos & Fleuret (2017) train an auxiliary LSTM modelto predict the loss of training samples, which they use to sample a subset of training points analyzed by thelearner at each training iteration. Similarly, (Jiang et al., 2017) have an auxiliary model predict the curriculum.This auxiliary model is trained using the learner’s current feature representation of a smaller holdout set ofdata for which ground-truth curriculum is known.However, as reported in our experiments, training on easy samples is beneﬁcial when the dataset is noisy,whereas training on hard examples is on the contrary more effective when data is clean. These observationsoppose self-paced learning (Kumar et al., 2010) with hard example mining (Shrivastava et al., 2016). Severalstrategies have been proposed to perform better in both settings. Assuming the existence of a holdout setas well, Ren et al. (2018) assign a weight to each training example that characterizes the alignment of boththe logits and gradients of the learner on training and heldout data. Chang et al. (2017) propose to train onpoints with high prediction variance or whose average prediction is close from the decision threshold. Boththe variance and average are estimated by analyzing a sliding window of the history of prediction probabilitiesthroughout training epochs.

Sample complexity.

Prototypes of a given task share some intuition with the notion of coresets (Agarwalet al., 2005; Huggins et al., 2016; Bachem et al., 2017; Tolochinsky & Feldman, 2018) because both prototypesand coresets describe the dataset in a more compact way—by returning a (potentially weighted) subset ofthe original dataset. For instance, clustering algorithms may rely on both prototypes (Biehl et al., 2016) orcoresets (Biehl et al., 2016) to cope with the high dimensionality of a task. However, prototypes and coresetsdiffer in essential ways. In particular, coresets are deﬁned according to a metric of interest (e.g., the loss thatone would like to minimize during training) whereas prototypes are independent of any machine-learningaspects as indicated in our list of desirable properties for prototypicality metrics from Section 2.Taking a different approach, Wang et al. (2018) apply inﬂuence functions (Koh & Liang, 2017) to discardtraining inputs that do not affect learning. Conversely, for MNIST, we found in our experiments that removingindividual training examples did not have a measurable impact on the predictions of individual test examples.Speciﬁcally, we trained many models to 100% training accuracy where we left one training example out foreach model. There was no statistically signiﬁcant difference between the models predictions on each individualtest example

This paper explores metrics for gaining insight into the properties of datasets commonly used for trainingdeep learning models. We develop ﬁve metrics and ﬁnd that humans agree that the rankings computed capturehuman intuition behind what is meant by a good representative example of the class.When the metrics disagree on how well-represented an example is, we can often learn something interestingabout that example. This helps forming an understanding of the performance of ML models that goes beyondmeasuring test accuracy. For instance, by identifying memorized exceptions in the test data, we may notweight mistakes that models make on these points as important as mistakes on canonical prototypes. Further,by identifying uncommon submodes we can learn where collecting training points will be useful. We ﬁnd thatmodels trained on well-represented examples often have simpler decision boundaries and are thus slightly more13dversarially robust, however training only on the most represented often yields inferior accuracy compared totraining on outliers.We believe that exploring other metrics for assessing properties of datasets and developing methods forusing them during training is an important area of future work, and hope that our analysis will be usefultowards that end goal.

References

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang.Deep learning with differential privacy. In

Proceedings of the 2016 ACM SIGSAC Conference on Computerand Communications Security , pp. 308–318. ACM, 2016.Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Geometric approximation via coresets.

Combinatorial and computational geometry , 52:1–30, 2005.Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476 , 2017.Yoshua Bengio, J´erˆome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In

Proceedingsof the 26th annual international conference on machine learning , pp. 41–48. ACM, 2009.Michael Biehl, Barbara Hammer, and Thomas Villmann. Prototype-based models in machine learning.

WileyInterdisciplinary Reviews: Cognitive Science , 7(2):92–111, 2016.Jacob Bien and Robert Tibshirani. Prototype selection for interpretable classiﬁcation.

The Annals of AppliedStatistics , pp. 2403–2424, 2011.Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim ˇSrndi´c, Pavel Laskov, Giorgio Giacinto,and Fabio Roli. Evasion attacks against machine learning at test time. In

Joint European conference onmachine learning and knowledge discovery in databases , pp. 387–402. Springer, 2013.Ricardo JGB Campello, Davoud Moulavi, and J¨org Sander. Density-based clustering based on hierarchicaldensity estimates. In

Paciﬁc-Asia conference on knowledge discovery and data mining , pp. 160–172.Springer, 2013.Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In , pp. 39–57. IEEE, 2017.Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. Active bias: Training more accurate neuralnetworks by emphasizing high variance samples. In

Advances in Neural Information Processing Systems ,pp. 1002–1012, 2017.Hui Han Chin and Paul Pu Liang. Leverage score ordering. In

Advances in Neural Information ProcessingSystems , 2017.Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. arXivpreprint arXiv:1704.03296 , 2017.Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression.In

Advances in Neural Information Processing Systems , pp. 4080–4088, 2016.14u Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Regularizing very deepneural networks on corrupted labels. arXiv preprint arXiv:1712.05055 , 2017.Angelos Katharopoulos and Franc¸ois Fleuret. Biased importance sampling for deep neural network training. arXiv preprint arXiv:1706.00043 , 2017.Been Kim, Cynthia Rudin, and Julie A Shah. The bayesian case model: A generative approach for case-based reasoning and prototype classiﬁcation. In

Advances in Neural Information Processing Systems , pp.1952–1960, 2014.Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to criticize! criticismfor interpretability. In

Advances in Neural Information Processing Systems , pp. 2280–2288, 2016.Pang Wei Koh and Percy Liang. Understanding black-box predictions via inﬂuence functions. arXiv preprintarXiv:1703.04730 , 2017.Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technicalreport, Citeseer, 2009.M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In

Advances in Neural Information Processing Systems , pp. 1189–1197, 2010.Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXivpreprint arXiv:1607.02533 , 2016.Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database.

AT&T Labs [Online].Available: http://yann. lecun. com/exdb/mnist , 2, 2010.Oscar Li, Hao Liu, Chaofan Chen, and Cynthia Rudin. Deep learning for case-based reasoning throughprototypes: A neural network that explains its predictions. arXiv preprint arXiv:1710.04806 , 2017.Yongshuai Liu, Jiyu Chen, and Hao Chen. Less is more: Culling the training set to improve robustness of deepneural networks. arXiv preprint arXiv:1801.02850 , 2018.Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

Journal of machine learningresearch , 9(Nov):2579–2605, 2008.Nicolas Papernot, Mart´ın Abadi, ´Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervisedknowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755 , 2016.Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deeplearning. arXiv preprint arXiv:1803.09050 , 2018.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.

International Journal of Computer Vision , 115(3):211–252, 2015.Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with onlinehard example mining. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pp. 761–769, 2016. 15aren Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualisingimage classiﬁcation models and saliency maps. arXiv preprint arXiv:1312.6034 , 2013.Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Explanations, bias detection,adversarial examples and model criticism. arXiv preprint arXiv:1711.11443 , 2017.Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and RobFergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 , 2013.Elad Tolochinsky and Dan Feldman. Coresets for monotonic functions with applications to deep learning. arXiv preprint arXiv:1802.07382 , 2018.Tianyang Wang, Jun Huan, and Bo Li. Data dropout: Optimizing training data for convolutional neuralnetworks. arXiv preprint arXiv:1809.00193 , 2018.Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machinelearning algorithms, 2017.Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In

Europeanconference on computer vision , pp. 818–833. Springer, 2014.Jianping Zhang. Selecting typical instances in instance-based learning. In

Proceedings of the Ninth Inter-national Workshop on Machine Learning , ML92, pp. 470–479, 1992. URL http://dl.acm.org/citation.cfm?id=141975.142091 . 16 ppendix Figures of outliers

The following are training examples from MNIST, FashionMNIST, and CIFAR10 that are identiﬁed as mostoutlier (left of the red bar) or prototypical (right of the green bar). Images are presented in groups by class.Each row in these groups corresponds to one of the ﬁve metrics in Section 2.1.

All MNIST results were obtained with a CNN made up of two convolutional layers (each with kernel sizeof 5x5 and followed by a 2x2 max-pooling layer) and a fully-connected layer of 256 units. It was trainedwith Adam at a learning rate of − with a − decay. When an ensemble of models was needed (e.g., forthe agr metric), these were obtained by using different random initializations. advretagrconfprivadvretagrconfprivadvretagrconfpriv dvretagrconfprivadvretagrconfprivadvretagrconfprivadvretagrconfpriv dvretagrconfprivadvretagrconfprivadvretagrconfpriv The Fashion-MNIST model architecture is identical to the one used for MNIST. It was also trained with thesame optimizer and hyper-parameters. 20 dvretagrconfprivadvretagrconfprivadvretagrconfprivadvretagrconfpriv dvretagrconfprivadvretagrconfprivadvretagrconfprivadvretagrconfpriv dvretagrconfprivadvretagrconfpriv All CIFAR results were obtained with a ResNetv2 trained on batches of points with the Adam optimizer for epochs at an initial learning rate of − decayed down to − after 80 epochs. We adapted the followingdata augmentation and training script: https://raw.githubusercontent.com/keras-team/keras/master/examples/cifar10_resnet.py When an ensemble of models was needed (e.g.,for the agr metric), these were obtained by using different random initializations. advretagrconfpriv dvretagrconfprivadvretagrconfprivadvretagrconfprivadvretagrconfpriv dvretagrconfprivadvretagrconfprivadvretagrconfprivadvretagrconfpriv dvretagrconfpriv .4 ImageNet extreme outliers The following pre-trained ImageNet models were used: DenseNet121, DenseNet169, DenseNet201 Incep-tionV3, InceptionResNetV2, Large NASNet, Mobile NASNet, ResNet50, VGG16, VGG19, and Xception.They are all found in the Keras library: https://keras.io/applications . advagrconfadvagrconfadvagrconfadvagrconf dvagrconfadvagrconf dvagrconfadvagrconfadvagrconfadvagrconf Accuracy on well represented data when training only on well rep-resented data

The three matrices that follow respectively report the accuracy of MNIST, Fashion-MNIST and CIFAR-10models learned on training examples with varying degrees of prototypicality and evaluated on test examplesalso with varying degrees of prototypicality. Speciﬁcally, the model used to compute cell ( i, j ) of a matrix islearned on training data that is ranked in the i th percentile of adv prototypicality. The model is then evaluatedon the test examples whose adv prototypicality falls under the j th prototypicality percentile. For all datasets,these matrices show that performing well on non-outliers is possible even when the model is trained on outliers.For MNIST, this shows again that training on outliers provides better performance across the range of test data(from outliers to well represented examples). For Fashion-MNIST and CIFAR-10, this best performance isachieved by training on examples that are neither prototypical nor outliers.30

10 20 30 40 50 60 70 80 90Prototypicality Percentile of Test Data04812162024283236404448525660646872768084889296 P r o t o t y p i c a li t y P e r c e n t il e o f T r a i n D a t a Figure 8: MNIST31

10 20 30 40 50 60 70 80 90Prototypicality Percentile of Test Data04812162024283236404448525660646872768084889296 P r o t o t y p i c a li t y P e r c e n t il e o f T r a i n D a t a Figure 9: Fashion-MNIST32

10 20 30 40 50 60 70 80 90Prototypicality Percentile of Test Data04812162024283236404448525660646872768084889296 P r o t o t y p i c a li t y P e r c e n t il e o f T r a i n D a t a Figure 10: CIFAR-1033

Human Study Example

We presented Mechanical Turk taskers with the following webpage, asking them to select the worst image ofthe nine in the grid.

10 Adversarial robustness of model trained on only well representedexamples

25 50 75 100Prototypicality Percentile L R o b u s t n e ss , / (a) MNIST . . . . . . . L R o b u s t n e ss , / (b) Fashion-MNIST . . . . L R o b u s t n e ss , / (c) CIFAR-10 Figure 11: The blue curves indicate the training accuracy of models trained on slices of , examplesselected according to their prototypicality—as reported on the x-axis. A baseline, obtained by training themodel on the entire dataset is indicated by the dotted-orange line. Models trained on well represented exampleson Fashion-MNIST and CIFAR-10 are × more robust to adversarial examples, when training on slices of , prototypical examples as opposed to slices of , outlier examples. On MNIST there is no signiﬁcantdifference; almost all examples are good.

11 Revealing and clustering interesting examples and submodes

0% 10% 20% 30% 40%Prototypicality Percentile0123456789 C l a ss L a b e l Distribution of MNIST classes for metric adv

0% 10% 20% 30% 40%Prototypicality Percentile C l a ss L a b e l Distribution of MNIST classes for metric conf

0% 10% 20% 30% 40%Prototypicality Percentile C l a ss L a b e l Distribution of MNIST classes for metric agr

0% 10% 20% 30% 40%Prototypicality Percentile C l a ss L a b e l Distribution of MNIST classes for metric priv

0% 10% 20% 30% 40%Prototypicality Percentile C l a ss L a b e l Distribution of MNIST classes for metric ret

0% 10% 20% 30% 40%Prototypicality PercentileT-shirt/topTrouserPulloverDressCoatSandalShirtSneakerBagAnkle boot C l a ss L a b e l Distribution of Fashion-MNIST classes for metric adv

0% 10% 20% 30% 40%Prototypicality PercentileT-shirt/topTrouserPulloverDressCoatSandalShirtSneakerBagAnkle boot C l a ss L a b e l Distribution of Fashion-MNIST classes for metric conf

0% 10% 20% 30% 40%Prototypicality PercentileT-shirt/topTrouserPulloverDressCoatSandalShirtSneakerBagAnkle boot C l a ss L a b e l Distribution of Fashion-MNIST classes for metric agr

0% 10% 20% 30% 40%Prototypicality PercentileT-shirt/topTrouserPulloverDressCoatSandalShirtSneakerBagAnkle boot C l a ss L a b e l Distribution of Fashion-MNIST classes for metric priv

0% 10% 20% 30% 40%Prototypicality PercentileT-shirt/topTrouserPulloverDressCoatSandalShirtSneakerBagAnkle boot C l a ss L a b e l Distribution of Fashion-MNIST classes for metric ret

0% 10% 20% 30% 40%Prototypicality PercentileAirplaneAutomobileBirdCatDeerDogFrogHorseShipTruck C l a ss L a b e l Distribution of CIFAR-10 classes for metric adv

0% 10% 20% 30% 40%Prototypicality PercentileAirplaneAutomobileBirdCatDeerDogFrogHorseShipTruck C l a ss L a b e l Distribution of CIFAR-10 classes for metric conf

0% 10% 20% 30% 40%Prototypicality PercentileAirplaneAutomobileBirdCatDeerDogFrogHorseShipTruck C l a ss L a b e l Distribution of CIFAR-10 classes for metric agr

0% 10% 20% 30% 40%Prototypicality PercentileAirplaneAutomobileBirdCatDeerDogFrogHorseShipTruck C l a ss L a b e l Distribution of CIFAR-10 classes for metric priv

0% 10% 20% 30% 40%Prototypicality PercentileAirplaneAutomobileBirdCatDeerDogFrogHorseShipTruck C l a ss L a b e l Distribution of CIFAR-10 classes for metric ret Below are all the memorized exceptions, as deﬁned in the body of the paper, for all Fashion-MNIST outputclasses: • Tshirt/top • Trouser • Pullover • Dress • Coat • Sandal • Shirt • Sneaker • Bag • Ankle boot 46 /27/2018 fashion.htmlﬁle:///usr/local/google/home/ncarlini/Desktop/fashion.html 1/1 Index 6 stingray Index 7 cock Index 8 hen Index 32 tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui Index 36 terrapin Index 40 American chameleon, anole, Anolis carolinensis Index 46 green lizard, Lacerta viridis Index 66 horned viper, cerastes, sand viper, horned asp, Cerastes cornutus Index 68 sidewinder, horned rattlesnake, Crotalus cerastes Index 72 black and gold garden spider, Argiope aurantia Index 101 tusker Index 103 platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus Index 122 American lobster, Northern lobster, Maine lobster, Homarus americanus Index 124 crayfish, crawfish, crawdad, crawdaddy Index 126 isopod Index 161 basset, basset hound Index 166 Walker hound, Walker foxhound Index 167 English foxhound Index 170 Irish wolfhound Index 179 Staffordshire bullterrier, Staffordshire bull terrier Index 180 American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier Index 196 miniature schnauzer Index 197 giant schnauzer Index 198 standard schnauzer Index 206 curly-coated retriever Index 214 Gordon setter Index 223 schipperke Index 238 Greater Swiss Mountain dog Index 248 Eskimo dog, husky Index 250 Siberian husky Index 264 Cardigan, Cardigan Welsh corgi Index 265 toy poodle Index 266 miniature poodle Index 270 white wolf, Arctic wolf, Canis lupus tundrarum Index 278 kit fox, Vulpes macrotis Index 290 jaguar, panther, Panthera onca, Felis onca Index 304 leaf beetle, chrysomelid Index 311 grasshopper, hopper Index 312 cricket Index 319 dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquitohawk, skeeter hawk Index 320 damselfly Index 334 porcupine, hedgehog Index 341 hog, pig, grunter, squealer, Sus scrofa Index 342 wild boar, boar, Sus scrofa Index 345 ox Index 348 ram, tup Index 349 bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis Index 359 black-footed ferret, ferret, Mustela nigripes Index 380 titi, titi monkey Index 383 Madagascar cat, ring-tailed lemur, Lemur catta Index 386 African elephant, Loxodonta africana Index 390 eel Index 399 abaya Index 400 academic gown, academic robe, judge's robe Index 409 analog clock Index 413 assault rifle, assault gun Index 417 balloon Index 419 Band Aid Index 423 barber chair Index 424 barbershop Index 429 baseball Index 434 bath towel Index 435 bathtub, bathing tub, bath, tub Index 440 beer bottle Index 461 breastplate, aegis, egis ndex 465 bulletproof vest Index 479 car wheel Index 484 catamaran Index 505 coffeepot Index 516 cradle Index 524 cuirass Index 538 dome Index 541 drum, membranophone, tympan Index 550 espresso maker Index 579 grand piano, grand Index 583 guillotine Index 591 handkerchief, hankie, hanky, hankey Index 595 harvester, reaper Index 604 hourglass Index 619 lampshade, lamp shade Index 620 laptop, laptop computer Index 636 mailbag, postbag Index 638 maillot Index 639 maillot, tank suit Index 643 mask Index 647 measuring cup Index 657 missile Index 665 moped Index 667 mortarboard Index 668 mosque Index 670 motor scooter, scooter Index 678 neck brace Index 681 notebook, notebook computer Index 700 paper towel Index 739 potter's wheel Index 744 projectile, missile Index 748 purse Index 764 rifle Index 804 soap dispenser Index 808 sombrero Index 810 space bar Index 817 sports car, sport car Index 827 stove ndex 830 stretcher Index 836 sunglass Index 837 sunglasses, dark glasses, shades Index 841 sweatshirt Index 842 swimming trunks, bathing trunks Index 846 table lamp Index 847 tank, army tank, armored combat vehicle, armoured combat vehicle Index 876 tub, vat Index 878 typewriter keyboard Index 892 wall clock Index 903 wig Index 907 wine bottle Index 914 yawl Index 925 consomme Index 928 ice cream, icecream ndex 939 zucchini, courgette Index 954 banana Index 960 chocolate sauce, chocolate syrup Index 961 dough Index 962 meat loaf, meatloaf Index 966 red wine Index 981 ballplayer, baseball player Index 987 cornndex 939 zucchini, courgette Index 954 banana Index 960 chocolate sauce, chocolate syrup Index 961 dough Index 962 meat loaf, meatloaf Index 966 red wine Index 981 ballplayer, baseball player Index 987 corn