Model Rectification via Unknown Unknowns Extraction from Deployment Samples
MModel Rectification via Unknown UnknownsExtraction from Deployment Samples
Bruno Abrahao , ∗ Zheng Wang ∗ Haider Ahmed Yuchen Zhu New York University Shanghai New York University A BSTRACT
Model deficiency that results from incomplete training data is a form of structural blindness thatleads to costly errors, oftentimes with high confidence. During the training of classification tasks,underrepresented class-conditional distributions that a given hypothesis space can recognize results ina mismatch between the model and the target space. To mitigate the consequences of this discrepancy,we propose
Random Test Sampling and Cross-Validation (RTSCV) as a general algorithmic frameworkthat aims to perform a post-training model rectification at deployment time in a supervised way.RTSCV extracts unknown unknowns (u.u.s), i.e., examples from the class-conditional distributionsthat a classifier is oblivious to, and works in combination with a diverse family of modern predictionmodels. RTSCV augments the training set with a sample of the test set (or deployment data) and usesthis redefined class layout to discover u.u.s via cross-validation, without relying on active learningor budgeted queries to an oracle. We contribute a theoretical analysis that establishes performanceguarantees based on the design bases of modern classifiers. Our experimental evaluation demonstratesRTSCV’s effectiveness, using 7 benchmark tabular and computer vision datasets, by reducing aperformance gap as large as 41% from the respective pre-rectification models. Last we show thatRTSCV consistently outperforms state-of-the-art approaches.
Data quality constitutes a critical factor affecting the performance of prediction models. In particular, incompletetraining data frequently results in structural mismatches between data-driven trained models and the respective targetspace in which they are supposed to be deployed, which makes most classifiers susceptible to systematic errors due totheir limited ability to rectify a model post-training.In scenarios of increasing dependence on algorithmic decisions in high stake situations, deficient models result incostly (sometimes fatal) errors, unfairness, and other problems. For example, an autopilot system may fail to recognizepeculiar traffic signs it has never encountered during training, leading to accidents. In the case of automated recruiting,data from industries dominated by a given gender may result in biased classifiers, likely to reject examples of theopposite gender due to the lack of enough successful observations that belong to that class. In addition, the unseenjoint distribution of features and “data-drift” may contribute to high confidence errors. For instance, when training aclassifier to distinguish between white dogs and black cats, when presented with a black dog at deployment time, themodel may predict “cat” with high confidence [Lakkaraju et al., 2017]. For “data-drift,” the structure of the target spacemay change over time and deviate from the trained model. Take, for example, the anecdotal account in the beginning ofthe COVID-19 pandemic, where physicians attempted to identify what type of “bacteria” had been causing an unusualhigh number of “pneumonia” cases, overlooking the fact that there was a new type of agent, i.e., a novel virus affectingthe respiratory system.We focus on classification tasks where a trained model may be oblivious to some of the domain-specific class-conditionaldistributions that a set of hypotheses can recognize. Data examples from these “invisible” joint-distributions of featuresthat a classifier is oblivious to, i.e., “hidden classes”, form the unknown unknowns (u.u.s), which cause a prediction ∗ Equal contribution and listed in alphabetical order. Correspondence to:
Random Test Sampling and Cross-Validation (RTSCV) , a general algorithmic framework that aims to perform a post-training model rectification of a base classifierat deployment time in a supervised way. RTSCV aims to reduce the structural mismatch between a trained modeland the target space by extracting u.u.s from samples of the target space. RTSCV augments the training set with asample of the test set (or deployment data) and uses this redefined class layout to discover unknown unknowns viacross-validation. Our key insight is that by augmenting a training set that possesses m classes with a dummy class,labeled m + 1 , whose examples come from a test set sample, cross-validation is likely to decouple examples that belongto known classes from m + 1 , due to the high variance and broad boundary of this dummy class. Conversely, u.u.s.coming from separable classes may share greater affinity to the dummy class, as they are expected to being poor fits tothe known classes and because the decision boundary around the dummy class may have been established with thecontribution of examples from the u.u.s in the test sample.RTSCV bears two advantages compared to previous methods. First, RTSCV can work in combination with a diversefamily of modern classifiers. The bulk of existing methods on identifying u.u.s focus on modifying specific classificationmethods, such as SVM, KNN, and DNNs, in such a way as to include a free parameter that can be learned at thedeployment phase. This allows for the method to predict u.u.s as possible outputs [Scheirer et al., 2013, 2014, Júnioret al., 2016, Bendale and Boult]. However, unlike RTSCV, these methods do not generalize, as they are classifier-specific.We note that RTSCV can be easily paired with any trained classifier. Second, RTSCV relies solely on the use of a sampleof the test data, which removes assumptions made by approaches like active learning, which are often challenging tooperationalize in practice, such as the existence of budgeted queries to an oracle [Vandenhof and Law, 2019, Lakkarajuet al., 2017, Simard et al., 2017].We contribute a theoretical analysis with performance guarantees based on the design bases of modern classifiers,including Maximum Likelihood Estimation, Bayes classifier, and Minimum Mahalanobis distance. Through anextensive experimental evaluation, we use 7 benchmark tabular and more challenging computer vision datasets likeCIFAR-10, CIFAR-100, and SVHN with ResNet and DenseNet as base models. Our results suggest that RTSCV isa promising direction for post-training rectification of a base classifier by reducing a performance gap (Accuracy, F-measure, AUROC) as large as 41%. Moreover, our results indicate that RTSCV consistently outperforms state-of-the-artapproaches and baselines by a significant margin. The conceptual idea of a “class” is often subjective, and different hypothesis spaces will separate the feature space intodifferent class-conditional distributions. Here we employ a working definition of u.u.s classes via a geometric argument.That is, given a fixed hypothesis space, the u.u.s form separable clusters in the feature space that are distinguishablefrom the known structures in the target space. This definition is without loss of generality, as it allows for any abstractionof conceptual blindness to examples. We note that u.u.s are not the only sources of prediction errors. To delineatethe aims of RTSCV, here we discuss different types of errors and the scope in which RTSCV operates. Let f be aclassifier and consider the hypothesis space produced by this model, i.e., the set of all functions that can be returnedby it. We assume that the model is consistent, that is, if there is a function in the hypothesis space, the machine isgoing to produce that function from training. Further, let E be the Bayes Error , or the irreducible error. If E ( H ) is thelowest error we could produce with hypothesis space H , and E ( H , D ) is the minimum error we produce with H andavailable training data D , then E ( H , D ) − E represents the overall generalization error given H and D , which can bedecomposed as the sum of E ( H , D ) − E ( H ) and E ( H ) − E . We call the first difference estimation error , and thesecond difference approximation error . That is, the model may produce errors due to either deficiencies of the model(approximation error) or to the training data (estimation error), such as u.u.s. In this paper, we focus on the latter, i.e.,reducing the estimation error under the assumption of a fixed hypothesis space. We also assume that the data are free ofmislabeling errors.We emphasize the distinction between u.u.s detection and outlier detection. Outliers are rare extreme values, producedby the realization of (possibly known) class-conditional distributions. As outliers tend to be isolated from any cluster in For reproducibility, our code will be made publicly available and we will replace this footnote with the GitHub link after theanonymous reviewing phase.
Early work focused on extending classical machine learning algorithms to enable u.u.s prediction. Prominent examplesare SVM-based methods, such as [Scheirer et al., 2013], which proposed the 1-vs-Set machine that separates the featurespace with an additional hyperplane parallel to the hyperplane obtained from the SVM. It then optimizes the open spacerisk for this linear kernel slab model. To further reduce the open set risk, [Scheirer et al., 2014] proposed the W-SVMto incorporate non-linear kernels under a compact abating probability (CAP) model. Another similar approach is the P I -SVM by [Jain et al.]. Besides modifications to SVM, [Júnior et al., 2016] introduced the OSR version of the NearestNeighbor classifier (OSNN) based on a threshold method that relies on measurements of the distance of an u.u.s samplefrom the known space.To address large and high-dimensional datasets, recent approaches proposed to modify Deep Neural Networks (DNNs).A baseline was proposed by [Hendrycks and Gimpel, 2017], formalizing the observation that the softmax predictionsmay assign high confidence to erroneously classified out-of-distribution samples (u.u.s). [Liang et al., 2018] designedthe ODIN detector that could better differentiate the confidence scores between in-distribution and out-of-distributionsamples in the target space, by combining temperature scaling and input perturbation. Using probabilistic modeling,specifically, Gaussian discriminant analysis (GDA), [Lee et al., 2018] modeled the softmax outputs of known classesas class-conditional Gaussian distributions. It then used the closest Mahalanobis distance (MD) to these Gaussiandistributions of each test sample as the confidence score. As a modification to this approach, [Lee et al., 2020] replacedthe MD confidence score with the class-conditional log-likelihood. For DNNs, researchers focused on changing thenetwork architecture to adapt to u.u.s detection, such as OpenMax [Bendale and Boult], CROSR [Yoshihashi et al.],and C2AE [Oza and Patel]. The surveys by [Geng et al., 2020] and [Boult et al., 2019] provide a comprehensivereview of these methods. For all of the preceding approaches, we argue that the modification of existing classifiers ismodel-specific. In contrast, RTSCV is a general algorithmic framework capable of working with any classifier.A different line of research addressed the u.u.s in an incremental or active learning manner. [Rudd et al., 2018]formulated a theoretically sound classifier, the Extreme Value Machine (EVM), grounded on the Extreme Value Theory,which is able to perform nonlinear kernel-free variable bandwidth incremental learning. [Vandenhof and Law, 2019]and [Lakkaraju et al., 2017] both proposed a hybrid framework combining human crowdsourcing and algorithmicmethods, in which some priors of the u.u.s are extracted by experts whose feedbacks guide adjustments of the trainedmodel. By adopting an active learning environment, these approaches can potentially cope with a dynamic featurespace. Nevertheless, requiring the presence of an oracle is oftentimes unrealistic, as it may be time and human-laborexpensive, and therefore non-scalable for many real-world applications. Our proposal offsets these shortcomings byrelying only on the analysis of a data sample at deployment time.The idea of using cross-validation to rectify incorrect data was previously experimented to identify mislabeled trainingdata [Brodley and Friedl, 1999]. In their noise-reduction approach, cross-validation is performed over the training setand mislabeled data are those given different “pseudo-labels" from their original labels after the cross-validating phase.While their work focuses on identifying mislabeled data, RTSCV aims to augment a model with new labels to the dataexamples that class-conditional distributions that were not contemplated at training time generates, which allows for thecorrect classification of these examples. We set our scope on multi-class classification tasks. Let f be the input base classifier, which we will treat as a black-box.Let X = { X , X , · · · , X m } be the training set with labels { , , . . . , m } . Let Y be the test set (as a representative ofthe target space). To detect the u.u.s and rectify a trained model, we present RTSCV in Algorithm 1. In summary, wefirst randomly sample the test set Y of cardinality | c · Y | , for a given fraction c , to obtain a sample X s , from which wecreate a new dummy class and assign the label m + 1 . Note that X s may contain examples from both the known and(potentially multiple) u.u.s classes. We then augment the original training set with examples from X s , resulting in anew intermediate training set (cid:101) X , to which we apply cross-validation in combination with a base classifier f .3odel Rectification via Unknown Unknowns Extraction from Deployment Samples Algorithm 1
Random Test Sampling and Cross-Validation
Input:
Training set X with labels { , , . . . , m } , test set Y , sample rate c, base classifier f , number of cross-validation folds k
1. Randomly sample test set Y to obtain a subset X s such that | X s | = c · | Y |
2. Assign label m + 1 to X s
3. Obtain an augmented training set (cid:101) X ← X ∪ X s with labels { , , . . . , m + 1 }
4. Run a k -fold cross-validation on (cid:101) X
5. Let X u be the set of samples with predicted label m + 1 during cross-validation6. Label samples in X u with label m + 1
7. Obtain the rectified training set X ← X ∪ X u with labels { , , . . . , m + 1 }
8. Train (cid:98) f on X return (cid:98) f . . . . . . Sample-Training Ratio . . . . F - m e a s u r e ( a ) COIL-20
Average95 % C.I. .
06 0 .
10 0 .
14 0 . Sample-Training Ratio . . . A cc u r a c y ( b ) CIFAR10-ImageNet
Average95 % C.I.
Figure 1: The influence of sample-training ratio on RTSCV’s performance. Multiple results were recorded to calculatethe 95% confidence interval. Plot (a) corresponds to openness 9.3%. For (b), the accuracy on the y-axis stands for thecombined, overall classification accuracy.Intuitively, RTSCV relies on the correct re-classification of samples in X s during cross-validation, regarding whetherthey belong to a known or u.u.s class. Samples from the test set make up a high variance class X s , whose boundaryencompasses all other classes (i.e., it may contain examples from any class). In light of this, examples that belong toknown classes are likely placed in the correct classes due to the conciseness and specificity of the representation. On thecontrary, u.u.s are classified as members of m + 1 due to the dissimilarity with all other classes and to affinity with someexamples that contributed to the position of the decision boundary around class m + 1 . The examples classified duringcross-validation with label m + 1 make up a new set X u , (the u.u.s class), which we adjoin to the original training set X to form the rectified model X .The sample rate c is a critical hyperparameter. For c , a very small sample rate may not contain enough representativesof u.u.s due to a small test set sample, whereas a large c may lead to a sample class X s that over-represents the structureof the known classes, thereby causing the cross-validation to assign examples of known classes to X u . In Figure 1 weillustrate the classifier’s performance versus the sample-training ratio, i.e., c as a function of the training set size, for twoof the benchmark datasets we used in our experimental evaluation. We discuss the search for the optimal c in Section 4.The number of cross-validation folds k determines the running time of RTSCV, whose time complexity is roughly O ( k · T f ) , where T f is the running time of the input base classifier f without RTSCV. Figure 2 (a) displays the modelperformance against k for several datasets we used in the evaluation of RTSCV. Note that RTSCV effectively rectifies atrained model even if we use the more computationally economical “holdout validation" approach. We contribute a theoretical analysis that justifies the correctness of RTSCV and establishes performance guarantees as afunction of class separability and the test sample size. As RTSCV is a general framework that may be combined withany base classifier, each employing disparate approaches to establish decision boundaries, there are major challengesin establishing a concise set of mathematical tools that would cover the basis of each specific approach. In light ofthis, we analyze its behavior through the lens of objectives that modern classifiers aim to optimize to find the sufficientconditions for the correct relabeling of the test set sample X s , namely Maximum Likelihood Estimation (MLE), Bayesclassifier (BC), and Minimum Mahalanobis Distance (MMD). These objectives are shared by many classificationmodels, such as the rule-based, margin-based, etc. 4odel Rectification via Unknown Unknowns Extraction from Deployment Samples Cross-Validation Folds F - m e a s u r e (a) Cross Validation J Score . . . . . (b) Known J Score (c)
Unknown tr( S w ) = 4tr( S w ) = 8tr( S w ) = 16tr( S w ) = 32LETTERPENDIGITSCOIL-20MNIST Figure 2: (a) The influence of the number of cross-validation folds on RTSCV framework, evaluated on tabular datasets.(b) The influence of known class separability on RTSCV framework. We vary the between-class distances and thecovariances of the known classes to generate different J scores. (c) The influence of u.u.s class separability on RTSCVframework. We vary the covariance and the distance of the u.u.s class to the known classes.We structure the following theorems based on two classification cases under RTSCV of a data point x ∈ X s : (1)the true label of x is one of the known classes, where the correct decision is to assign the true known label duringcross-validation, and (2) x belongs to the u.u.s class and the process should keep it in X s . We aim to establish thecorrectness of RTSCV by showing that this correct decision is the one that optimizes the MLE, the BC, and the MMD.We model all the known and u.u.s classes as non-identical multivariate Gaussian distributions. Specifically, let X , X , . . . , X m be the known classes with distinct mean µ i ∈ R d and diagonal covariance matrix Σ i ∈ R d × d , and X u is the u.u.s class with mean µ u ∈ R d and diagonal covariance matrix Σ u ∈ R d × d . Furthermore, we assume allthe distributions are homoscedastic, so Σ i = σ i I for i ∈ { , , . . . , m } and Σ u = σ u I , for some σ i ∈ R + , σ u ∈ R + .In this way, since the sample class X s is obtained by randomly sampling the test set by RTSCV, we can model X s as a Gaussian mixture of X , X , . . . , X m , X u weighted by their respective percentage in the test set, denoted P ( X i ) , i ∈ { , , . . . , m, u } .Under the preceding assumptions, via Gaussian discriminant analysis, we first focus on MLE to explore how the totallikelihood of the dataset changes under different labeling scheme of the sample class X s . For a test set sample x ∈ X s ,its likelihood of being in one of the X i ( i ∈ { , , . . . , m, u } ) is: L i ( x ) = 1(2 π ) d (cid:112) | Σ i | exp( −
12 ( x − µ i ) T Σ − i ( x − µ i )) (1)where | Σ i | denotes the determinant of Σ i . Its likelihood of being in the sample class X s is L s ( x ) = P ( X u ) L u ( x ) + m (cid:88) k =1 P ( X k ) L k ( x ) (2)as X s is a Gaussian mixture. We can now find sufficient conditions under which the correct classification of x canincrease the total likelihood. Theorem 3.1 (Maximum Likelihood Estimation) . For x ∈ X s , if x is a sample of some known class, i.e., x ∼ X k forsome k ∈ { , , . . . , m } , for x to be correctly classified as belonging to X k based on MLE, we require it to have ahigher class-conditional likelihood for class X k than that of X s . And we have E x ∼ X k ( L k ( x )) ≥ E x ∼ X k ( L s ( x )) given that, for all i ∈ { , · · · , m, u } : (cid:107) µ k − µ i (cid:107) ≥ d · ( σ k + σ i ) · ln (cid:18) σ k σ k + σ i (cid:19) Similarly, if x ∼ X u , then we have E x ∼ X u ( L s ( x )) ≥ E x ∼ X u ( L k ( x )) given that (cid:107) µ k − µ u (cid:107) σ k + σ u ≥ (cid:18) − P ( X k ) P ( X u ) (cid:19) + d · ln (cid:18) σ u σ k + σ u (cid:19) Proof.
All proofs are in the supplementary materials. 5odel Rectification via Unknown Unknowns Extraction from Deployment SamplesTheorem 3.1 characterizes sufficient conditions for the re-classification of X s to maximize total likelihood. In summary,it says that X s will be correctly classified based on MLE once the class separability is above a threshold characterizedby the squared difference of the means and the squared within-class variances.We now turn our attention to the Bayes classifier, which determines the membership of x ∈ X s by considering itsposterior probability [Murty and Devi, 2011]. By Bayes’ theorem, the posterior for x to be in class X i is givenby p ( X i | x ) = L i ( x ) P (cid:48) ( X i ) /p ( x ) . In scenario we consider, P (cid:48) ( X i ) stands for the prior of X i in the training set: P (cid:48) ( X i ) = | X i | / | X s ∪ (cid:83) mk =1 X k | , i ∈ { , , . . . , m, s } . In this way, one can calculate the Bayesian decision boundaryof x between X s and X k for some k ∈ { , , . . . , m } . Theorem 3.2 (Bayesian Classification) . For x ∈ X s , if x ∼ X k , we have: E x ∼ X k ( p ( X k | x )) ≥ E x ∼ X k ( p ( X s | x )) given that, for all i ∈ { , · · · , m, u } : (cid:107) µ k − µ i (cid:107) σ k + σ i ≥ (cid:18) P (cid:48) ( X s ) P (cid:48) ( X k ) (cid:19) + d · ln (cid:18) σ k σ k + σ i (cid:19) If x ∼ X u , then we have E x ∼ X u ( p ( X s | x )) ≥ E x ∼ X u ( p ( X k | x )) given that (cid:107) µ k − µ u (cid:107) σ k + σ u ≥ (cid:18) P (cid:48) ( X k ) − P (cid:48) ( X s ) P ( X k ) P (cid:48) ( X s ) P ( X u ) (cid:19) + d · ln (cid:18) σ u σ k + σ u (cid:19) Theorem 3.2 (similarly to Theorem 3.1) establishes a performance guarantee on the correct behavior of the framework.That is, when the class separability between the u.u.s class and the known classes is greater than a given threshold,controlled by the sample size, the priors P (cid:48) ( X s ) , and the dimensionality of the data, the re-classification of X s will beas we expect with high probability.Last, the Minimum Mahalanobis Distance (MMD) mimics the goal shared by classifiers to place a data point in theclass whose joint-probability distribution of features is the closest to that of the data point. We show that the correctdecision by RTSCV is the one that minimizes the MD.For a test set sample x ∈ X s , its squared MD to some class X i is given by D M ( x, X i ) = ( x − µ i ) T Σ − i ( x − µ i ) (3)The analogous behavior of a classifier that employs MMD as a metric would assign x to the class that has the closestMD to x . Theorem 3.3 (Minimum Mahalanobis Distance) . Assume Σ s = σ s I . For x ∈ X s , if x ∼ X k for some k ∈{ , , . . . , m } , we have: E x ∼ X k ( D M ( x, X k )) ≤ E x ∼ X k ( D M ( x, X s )) given that (cid:107) µ k − µ s (cid:107) ≥ d · σ s · (cid:16) − σ k σ s (cid:17) If x ∼ X u , then we have E x ∼ X u ( D M ( x , X s )) ≤ E x ∼ X u ( D M ( x , X k )) given that (cid:107) µ u − µ k (cid:107) σ k − (cid:107) µ u − µ s (cid:107) σ s ≥ d · (cid:16) σ u σ s − σ u σ k (cid:17) In alignment with Theorem 3.1 and Theorem 3.2, here we establish a requirement of class separability as the sufficientcondition for the expected behavior of RTSCV. Note that we assume Σ s to be diagonal to avoid long derivations resulted from the Gaussian mixture model. To illustrate the preceding theoretical analysis, here we empirically study the effect of class separability using a syntheticdataset that consists of 10 distinct known classes and one u.u.s class, all sampled from pre-fixed 2-dimensional Gaussiandistributions. To measure class separability, we adapted the notion of scatter matrices [Theodoridis and Koutroumbas,2008] to address the presence of the u.u.s class . Specifically, assume { X , X , . . . , X m } are the m known classeswith µ i being the mean of X i and Σ i being the covariance matrix of X i . For a u.u.s class X u , its mean and covariancematrix are µ u and Σ u . The between-class scatter matrix S b is defined to measure the separability between differentclasses: S b = (cid:80) mi =1 P ( X i )( µ i − µ )( µ i − µ ) T , where P ( X i ) is the percentage of examples in X i , compared to thetotal number of examples in the dataset. When measuring the distance between the m known classes and the u.u.sclass, we set µ = µ u . Otherwise, if we want to measure the between-class distance within the known classes, we set µ = m (cid:80) mi =1 µ i . For the within-class scatter matrix S w , when measuring the u.u.s class, we simply define S w = Σ u ,the covariance matrix of X u . For the known classes, S w is the weighted sum of all the covariance matrices of theknown classes: S w = (cid:80) mi =1 P ( X i )Σ i . To combine both scatter matrices, we use the J criterion [Theodoridis andKoutroumbas, 2008]: J trace { S w + S b } trace { S w } = 1 + trace { S b } trace { S w } (4)which increases by making the means of different classes spread out and intra-class variances small. We evaluateRTSCV on different dataset configurations of varying class means and covariances, each corresponding to a unique J score. The result is plotted in Figure 2 (b) for known classes and Figure 2 (c) for unknown classes, which show apositive correlation between model performance and class separability in both cases. We evaluate RTSCV on both tabular datasets, where we pair RTSCV with classical classifiers like SVM and simplefully connected neural networks, and computer vision datasets, where we use deep convolutional neural networks, suchas ResNet [He et al., 2016] and DenseNet [Huang et al., 2017].As established by Theorem 3.2, there is a sweet spot for the sample rate that best balances the classification of x ∈ X s by governing the prior P (cid:48) ( X s ) . Let sample-training ratio be the ratio of the size of the test set sample and the size ofthe training set. Figure 1 plots RTSCV’s performance on COIL-20 and CIFAR10-ImageNet used in Section 4.1 and 4.2,respectively, under varying sample-training ratio. The model performance climbs steeply with the sample size beforethe sample starts to over-represent the known data, and then decreases slowly. We search for an optimal c using byassessing the misclassification of known data. The optimal sample rate is dataset-dependent and varies from 0.06 to 0.1.Note that the small cardinalities make RTSCV practical for efficiently sampling during model deployment. We alsosearched for the optimal number k , from 2 to 6, of cross-validation for the Letters, Pendigits, COIL-20 and MNISTdatasets, and selected k = 3 . We selected the Letter Recognition dataset and Pendigits dataset which contain the hand-writings of 26 English lettersand 10 digits, respectively. Further, we also down-sample the Columbia University Image Library (COIL-20), whichcontains grey-scale images of 20 objects [Nene et al., 1996], following the PCA-based technique by [Geng and Chen,2020]. We also select the MNIST, which contains 10 digit classes of dimension × [LeCun et al., 2010].On the tabular datasets except for MNIST, we use a standard SVM implementation as the base model. We first report theresults of the pre-rectified model, i.e., the performance of the base classifier under the presence of u.u.s without applyingRTSCV. To compare with RTSCV, we select three previously proposed methods in the literature of u.u.s discovery,which are (1) EVM [Rudd et al., 2018], (2) 1-vs-Set [Scheirer et al., 2013] and (3) WSVM [Scheirer et al., 2014]. Inparticular, 1-vs-Set and WSVM are both SVM-based algorithms. For MNIST, we use a simple MLP, a four-layer fullyconnected network, as the base model. Finally, to create u.u.s in the test set of the chosen datasets, we remove certainclasses of data from the training set while keeping the test set unchanged. To be consistent with previous methods, we evaluate the F-measure of our method and other baselines against the openness [Scheirer et al., 2013]. UCI machine learning repository: http://archive.ics.uci.edu/ml
Dataset Openness (%) F-measure (%)Pre-rectified / EVM / 1-vs-Set / WSVM / RTSCVLetter(SVM) 14.5 69.8 / 89.8 / 72.8 / 91.2 /
Pendigits(SVM) 9.3 78.3 / 97.0 / 75.4 / 93.1 /
COIL-20(SVM) 9.3 88.4 / / 70.2 / 85.6 / 95.118.4 78.7 / 93.2 / 55.7 / 84.5 /
MNIST(MLP) 13.4 59.6 / - / - / - /
Openness.
The u.u.s may be divided into different classes that span different geometric regions of the feature space.The openness metric proposed by [Scheirer et al., 2013] increases with the number of u.u.s classes. A larger opennessindicates a larger number of u.u.s classes relative to that of known classes in the test data:openness = 1 − (cid:115) × | training classes || test classes | + | target classes | (5) F-measure is the harmonic mean of the precision and recall. In our multi-class scenario, it is obtained by averaging theclasswise F-measures, combining classification accuracy of both the known and u.u.s classes.
We present the experimental results on all four datasets in Table 1. The F-measure of the classifier is plotted as afunction of openness, which we vary by controlling the number of known classes removed from the training set. RTSCVachieves a consistent performance improvement over the pre-rectified model, closing a performance gap as large as41% in some cases. Coupled with the SVM, RTSCV beats previous u.u.s detection methods on 5 out 6 settings whilemaintaining a small disadvantage to the EVM on COIL-20 with openness 9.3%. For RTSCV with MLP on MNIST, itperforms the best with the largest openess, i.e., 42.3% where 9 out of 10 classes are selected as u.u.s during test. Thisfurther assures the robustness of RTSVC under a disproportionate amount of u.u.s in the test data.
We also evaluate RTSCV on several more challenging pattern recognition tasks, such as computer vision datasets.Specifically, we select the CIFAR-10 and CIFAR-100, both containing colored object images [Krizhevsky, 2009], aswell as the Street View House Numbers (SVHN) from Google Street View project [Netzer et al., 2011]. All imagedimensions of these datasets are × .For each of the computer vision datasets, we separately test our RTSCV framework with ResNet [He et al., 2016] andDenseNet [Huang et al., 2017], two network architectures that have achieved good performance on large benchmarkdatasets. (See the supplementary material for details on the training configuration). For comparison, we include theresults of the baseline method by [Hendrycks and Gimpel, 2017], the ODIN [Liang et al., 2018], and the Mahalanobismethod (MD) [Lee et al., 2018], three previous approaches on detecting OOD samples (u.u.s) for neural networks.Different from Section 4.1, here we create u.u.s in the test set by introducing additional computer vision datasets to thetarget space. Specifically, we use the resized version of Tiny-ImageNet [Deng et al., 2009] and LSUN [Yu et al., 2015]following the techniques in [Liang et al., 2018, Lee et al., 2018]. We present a summary of the known datasets and u.u.sdatasets we used in Table 2. We adopt the following metrics to evaluate performance on both known classes classification and u.u.s detection.
Classification accuracy is the accuracy of the known class classification, i.e., the total number of correct predictionson the known class labels divided by the total number of known class test samples.8odel Rectification via Unknown Unknowns Extraction from Deployment SamplesTable 2: Experimental results of the RTSCV framework on computer vision datasets. Bold and underlined numbersrepresent the best and second best results, respectively.
Dataset u.u.s Classification Acc. (%) Detection Acc. (%) AUROC (%)Baseline / ODIN / MD / RTSCVCIFAR-10(ResNet) SVHN 93.9 / - / 93.9 /
ImageNet 93.9 / - / 93.9 /
LSUN 93.9 / - / 93.9 /
CIFAR-100(ResNet) SVHN 75.6 / - / 74.8 /
ImageNet 75.6 / - / 74.8 /
LSUN 75.6 / - / 74.8 /
SVHN(ResNet) CIFAR-10 / - / 95.7 / 94.5 90.0 / 89.4 / 96.9 /
ImageNet / - / 95.7 / 94.7 90.4 / 89.4 / / 98.4 93.5 / 92.0 / / LSUN / - / 95.7 / 95.0 89.0 / 87.2 / 99.5 /
CIFAR-10(DenseNet) SVHN 92.9 / - / 91.7 /
ImageNet 92.9 / - / 91.7 /
LSUN 92.9 / - / 91.7 /
CIFAR-100(DenseNet) SVHN 72.3 / - / 68.2 /
ImageNet 72.3 / - / 68.2 /
LSUN 72.3 / - / 68.2 /
SVHN(DenseNet) CIFAR-10 95.3 / - / 95.2 /
ImageNet 95.3 / - / 95.2 / / 97.4 94.8 / 95.1 / / LSUN 95.3 / - / 95.2 /
Detection accuracy is the number of u.u.s in the test data that are correctly detected by the model divided by the totalnumber of the u.u.s.
AUROC depicts the relationship between true positive rate (TPR) and false positive rate (FPR) [Davis and Goadrich,2006]. A higher AUROC indicates a higher probability for a positive instance to rank higher than a negative one.
We display the experimental results in Table 2. RTSCV coupled with ResNet or DenseNet achieves the overall bestperformance with respect to all of the evaluation metrics. In particular, it has a significant improvement on the u.u.sdetection accuracy while maintaining a high classification accuracy for the known classes. This suggests that RTSCV isthe most effective in identifying u.u.s, without degrading the original model, even on more complex datasets with morecomplex classification models.
With the goal of reducing deployment errors and model bias due to deficient training data, RTSCV is a proposal of analgorithmic framework that adds the flexibility for a base classification models to rectify a trained model at deployment.We provide a rigorous theoretical analysis of correctness and performance guarantees of the process of minimizingits structural mismatch with a target space, based on objectives that most modern classifers aim to optimize. RTSCVexhibits consistent performance improvements over both the pre-rectified model and previously proposed approachesthat share the same goals on 7 benchmark datasets. Moreover, it does not assume the presence of an oracle, as in thecase of active learning.Our ongoing work focus on improvements of the RTSCV, especially due to concerns with the computational cost ofcross-validation. We are developing alternatives to cross-validation that could work equally well, while reducing thecomputational cost dramatically. For instance, we are in the process of investigating semi-supervised clustering [Zhu,2008, Basu et al., 2002]. In the supplementary materials, we discuss our initial efforts in this direction. Our preliminaryresults suggest that such an approach works equally well as RTSCV when the u.u.s consist of only one cluster and havea relatively small covariance. Nevertheless, when the u.u.s form multiple clusters or the clusters have high variance, theperformance of this alternative approach drops significantly, compared to that of RTSCV.9odel Rectification via Unknown Unknowns Extraction from Deployment Samples
This research was partially supported by a National Natural Science Foundation of China (NSFC) grant
References
Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Eric Horvitz. Identifying unknown unknowns in the open world:Representations and policies for guided exploration. In
Proc. of the Thirty-First AAAI Conference on ArtificialIntelligence , 2017.Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neuralnetworks.
ICLR , 2017.Shiyu Liang, Y. Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks.
ICLR , 2018.Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distributionsamples and adversarial attacks. In
Advances in Neural Information Processing Systems , volume 31, pages 7167–7177,2018.Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection.
Advances inNeural Information Processing Systems (NeurIPS) , 2020.W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult. Toward open set recognition.
IEEE Transactions onPattern Analysis and Machine Intelligence , 2013.W. J. Scheirer, L. P. Jain, and T. E. Boult. Probability models for open set recognition.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 2014.Pedro Ribeiro Mendes Júnior, Roberto Medeiros de Souza, Rafael de Oliveira Werneck, Bernardo V. Stein, Daniel V.Pazinato, Waldir R. de Almeida, Otávio Augusto Bizetto Penatti, Ricardo da Silva Torres, and Anderson Rocha.Nearest neighbors distance ratio open-set classifier.
Machine Learning , 2016.Abhijit Bendale and Terrance E. Boult. Towards open set deep networks. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR-2016) .Colin Vandenhof and Edith Law. Contradict the machine: A hybrid approach to identifying unknown unknowns. In
AAMAS , 2019.Patrice Y. Simard, Saleema Amershi, David Maxwell Chickering, Alicia Edelman Pelton, Soroush Ghorashi, ChristopherMeek, Gonzalo Ramos, Jina Suh, Johan Verwey, Mo Wang, and John Robert Wernsing. Machine teaching: A newparadigm for building machine learning systems.
ArXiv , abs/1707.06742, 2017.Lalit P. Jain, Walter J. Scheirer, and Terrance E. Boult. Multi-class open set recognition using probability of inclusion.In
ECCV 2014 .Dongha Lee, Sehun Yu, and Hwanjo Yu. Multi-class data description for out-of-distribution detection. In
Proceedingsof the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , KDD ’20, New York,NY, USA, 2020.R. Yoshihashi, W. Shao, R. Kawakami, S. You, M. Iida, and T. Naemura. Classification-reconstruction learning foropen-set recognition. In .Poojan Oza and Vishal M. Patel. C2AE: class conditioned auto-encoder for open-set recognition. In
IEEE Conferenceon Computer Vision and Pattern Recognition, CVPR 2019 . doi:10.1109/CVPR.2019.00241.C. Geng, S. Huang, and S. Chen. Recent advances in open set recognition: A survey.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 2020.Terrance Boult, S. Cruz, Akshay Dhamija, Manuel Günther, James Henrydoss, and W.J. Scheirer. Learning and theunknown: Surveying steps toward open world recognition. In
Proceedings of the 33th AAAI Conference on ArtificialIntelligence , 2019. doi:10.1609/aaai.v33i01.33019801.E. M. Rudd, L. P. Jain, W. J. Scheirer, and T. E. Boult. The extreme value machine.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 2018.Carla E. Brodley and Mark A. Friedl. Identifying mislabeled training data.
Journal of Artificial Intelligence Research ,11(1):131–167, July 1999. ISSN 1076-9757.M. Murty and V. Devi.
Pattern recognition. An algorithmic approach . 2011. doi:10.1007/978-0-85729-495-1.10odel Rectification via Unknown Unknowns Extraction from Deployment SamplesSergios Theodoridis and Konstantinos Koutroumbas.
Pattern Recognition, Fourth Edition . Academic Press, Inc., USA,4th edition, 2008. ISBN 1597492728.K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In , pages 770–778, 2016. doi:10.1109/CVPR.2016.90.G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In , 2017. doi:10.1109/CVPR.2017.243.Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. Columbia object image library (coil-20). Technical report, 1996.Chuanxing Geng and Songcan Chen. Collective decision for open set recognition.
IEEE Transactions on Knowledgeand Data Engineering , 2020. doi:10.1109/tkde.2020.2978199.Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database.
ATT Labs [Online]. Available:http://yann.lecun.com/exdb/mnist , 2, 2010.Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Ng. Reading digits in natural imageswith unsupervised feature learning.
NIPS , 2011.J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , 2009. doi:10.1109/CVPR.2009.5206848.Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image datasetusing deep learning with humans in the loop. 2015.Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In
Proceedings of the 23rdInternational Conference on Machine Learning, ACM , 2006.Xiaojin Zhu. Semi-supervised learning literature survey.
Computer Science, University of Wisconsin-Madison , 07 2008.Sugato Basu, Arindam Banerjee, and R. Mooney. Semi-supervised clustering by seeding. In
Proceedings of 19thInternational Conference on Machine Learning (ICML-2002) , 2002.11odel Rectification via Unknown Unknowns Extraction from Deployment Samples
Proof.
Let’s first prove the case of x ∼ X k . We know that E x ∼ X k ( L k ( x )) = (cid:90) R d L k ( x ) dx = N ( µ k , µ k , k ) E x ∼ X k ( L i ( x )) = (cid:90) R d L i ( x ) L k ( x ) dx = N ( µ i , µ k , Σ k + Σ i ) Expanding the two terms, we have E x ∼ X k ( L k ( x )) = 1(2 π ) d (cid:112) | k | exp( −
12 ( µ k − µ k ) T (2Σ k ) − ( µ k − µ k )) = 1(2 π ) d (cid:112) | k | E x ∼ X k ( L i ( x )) = 1(2 π ) d (cid:112) | Σ k + Σ i | exp( −
12 ( µ i − µ k ) T (Σ k + Σ i ) − ( µ i − µ k )) Recall that Σ i = σ i I, Σ u = σ u I . Let N ( µ k , µ k , k ) > N ( µ i , µ k , Σ k + Σ i ) and we have (cid:107) µ k − µ i (cid:107) ≥ d · ( σ i + σ k ) · ln (cid:18) σ k σ k + σ i (cid:19) If for all other classes other than class k , the above conditions hold, then E x ∼ X k ( L k ( x )) = P ( X u ) E x ∼ X k ( L k ( x )) + m (cid:88) i =1 P ( X i ) E x ∼ X k ( L k ( x )) > E x ∼ X k ( L s ( x )) since P ( X u ) + (cid:80) mi =1 P ( X i ) = 1 . Similarly, for the second case of x ∼ X u , we can compute that E x ∼ X u ( L k ( x )) = N ( µ k , µ u , Σ u + Σ k ) . Notice that E x ∼ X u ( L s ( x )) > P ( X u ) N ( µ u , µ u , u ) + P ( X k ) N ( µ k , µ u , Σ u + Σ k ) Hence, to obtain a sufficient condition of E x ∼ X u ( L s ( x )) > E x ∼ X u ( L k ( x )) , we simply need to require P ( X u ) N ( µ u , µ u , u ) > (1 − P ( X k )) N ( µ k , µ u , Σ u + Σ k ) , which yields (cid:107) µ k − µ u (cid:107) σ k + σ u ≥ (cid:18) − P ( X k ) P ( X u ) (cid:19) + d · ln (cid:18) σ u σ k + σ u (cid:19) Proof.
For x ∼ X k , to show that E x ∼ X k ( p ( X k | x )) > E x ∼ X k ( p ( X s | x )) , by Bayes’ theorem, we just need to show: E x ∼ X k ( L k ( x ) P (cid:48) ( X k )) > E x ∼ X k ( L s ( x ) P (cid:48) ( X s )) From the proof of Theorem 3.1, we have: E x ∼ X k ( L k ( x ) P (cid:48) ( X k )) = P (cid:48) ( X k ) N ( µ k , µ k , k ) E x ∼ X k ( L i ( x ) P (cid:48) ( X s )) = P (cid:48) ( X s ) N ( µ i , µ k , Σ k + Σ i ) Let P (cid:48) ( X k ) N ( µ k , µ k , k ) > P (cid:48) ( X s ) N ( µ i , µ k , Σ k + Σ i ) , we obtain: (cid:107) µ k − µ i (cid:107) σ k + σ i ≥ (cid:18) P (cid:48) ( X s ) P (cid:48) ( X k ) (cid:19) + d · ln (cid:18) σ k σ k + σ i (cid:19) For x ∼ X u , We just need to show that E x ∼ X u ( L s ( x ) P (cid:48) ( X s )) > E x ∼ X u ( L k ( x ) P (cid:48) ( X k )) . From the proof of Theorem3.1, we know that: E x ∼ X u ( L s ( x ) P (cid:48) ( X s )) > P (cid:48) ( X s ) P ( X u ) N ( µ u , µ u , u ) + P (cid:48) ( X s ) P ( X k ) N ( µ k , µ u , Σ u + Σ k ) E x ∼ X u ( L k ( x ) P (cid:48) ( X k )) = P (cid:48) ( X k ) N ( µ k , µ u , Σ u + Σ k ) Letting P (cid:48) ( X s ) P ( X u ) N ( µ u , µ u , u ) > [ P (cid:48) ( X k ) − P (cid:48) ( X s ) P ( X k )] N ( µ k , µ u , Σ u + Σ k ) gives the desired sufficientcondition: (cid:107) µ k − µ u (cid:107) σ k + σ u ≥ (cid:18) P (cid:48) ( X k ) − P (cid:48) ( X s ) P ( X k ) P (cid:48) ( X s ) P ( X u ) (cid:19) + d · ln (cid:18) σ u σ k + σ u (cid:19) Proof.
For x ∼ X k , we know from statistical theory that D M ( x, X k ) has a χ d distribution with d degrees of freedom.So we have E x ∼ X k ( D M ( x, X k )) = E x ∼ χ d ( x ) = d Let’s now investigate the distribution of D M ( x, X s ) . As Σ s is real symmetric and diagonalizable, it has an orthogonaldecomposition: Σ s = U Λ U − = U Λ U T = d (cid:88) j =1 λ j u j u Tj Σ − s = U Λ − U − = U Λ − U T = d (cid:88) j =1 λ − j u j u Tj where { λ j } dj =1 are the eigenvalues for Λ and { u j } dj =1 are the corresponding eigenvectors. Plugging the decompositioninto the formula of D M ( x, X s ) , we get D M ( x, X s ) = ( x − µ s ) T Σ − s ( x − µ s ) = d (cid:88) j =1 [ λ − j u Tj ( x − µ s )] = d (cid:88) j =1 Y j Since x ∼ N ( µ k , Σ k ) , Y j = λ − j u Tj ( x − µ s ) is an affine transformation of a multivariate Gaussian distribution, Y k has a univariate normal distribution: E ( Y j ) = λ − j u Tj ( E ( X ) − µ s ) = λ − j u Tj ( µ k − µ s ) V ar ( Y j ) = λ − j u Tj Σ k λ − j u j = σ k λ j Therefore, we can infer that E ( Y j ) = σ k λ − j + [ λ − j u Tj ( µ k − µ s )] E x ∼ X k ( D M ( x, X s )) = d (cid:88) j =1 E ( Y j ) = σ k d (cid:88) j =1 λ − j + d (cid:88) j =1 [ u Tj ( µ k − µ s )] λ j Here, since we assume Σ s = σ s I , { u k } dk =1 is the canonical basis of R d . So the formula can be further simplified as: E x ∼ X k ( D M ( x, X s )) = d · σ k σ s + (cid:107) µ k − µ s (cid:107) σ s Let E x ∼ X k ( D M ( x, X k )) < E x ∼ X k ( D M ( x , X s )) , we obtain the sufficient condition: (cid:107) µ k − µ s (cid:107) > d · σ s · (1 − σ k σ s ) Similarly, for the case of x ∼ X u , following the above results we have: E x ∼ X u ( D M ( x, X s )) = d · σ u σ s + (cid:107) µ u − µ s (cid:107) σ s E x ∼ X u ( D M ( x, X k )) = d · σ u σ k + (cid:107) µ u − µ k (cid:107) σ k Let E x ∼ X u ( D M ( x, X s )) < E x ∼ X u ( D M ( x, X k )) and we have: (cid:107) µ u − µ k (cid:107) σ k − (cid:107) µ u − µ s (cid:107) σ s > d · (cid:16) σ u σ s − σ u σ k (cid:17) Table 3: ResNet
Parameter Value optimizer SGD with Nesterov Momentummomentum 0.9learning rate 5e-4epochs 100learning rate scheduler learning rate decreases by 50% after every 20 epochscross-validation folds 3number of layers 34Table 4: DenseNet
Parameter Value optimizer SGD with Nesterov Momentummomentum 0.9learning rate 5e-4epochs 100learning rate scheduler learning rate decreases by 50% after every 20 epochscross-validation folds 3number of layers 10014odel Rectification via Unknown Unknowns Extraction from Deployment Samples F - m e a s u r e RTSCV+SVMK-means with side info 0 5 10 15 20 25 30 35 40Trace of mixture scatter matrix0.50.60.70.80.91.0 F - m e a s u r e RTSCV+SVMK-means with side info40 60 80 100 120 140Trace of between-class scatter matrix0.50.60.70.80.91.0 F - m e a s u r e RTSCV+SVMK-means with side info
Figure 3: Comparison between Clustering with Side Information (CSI) and our RTSCV methods under differentsynthetic dataset settings. Top: There is one u.u. cluster for the left plot and two u.u. clusters for the right plot. For bothplots the u.u. class is located far away from the 10 known classes and only the covariance of the u.u. class is alteredacross different trials.We also believe that it is worth discussing the possibility of resorting to semi-supervised clustering as an alternativeto cross-validation during the process of re-classifying sample class X s , given its increasing popularity and the greatpotential of being more computationally economical. Given a small amount of labeled data, semi-supervised clusteringperforms ordinary clustering tasks under the constraints of must-links (two points must be in the same cluster) and cannot-links (two points cannot be in the same cluster), provided by the labeled data [Zhu, 2008]. In our scenario, theobjective of re-classifying X s can be viewed equivalent to dividing X s into several clusters, one of which correspondsto either a known or u.u. class, with the assistance of the labeled data from the entire training set. This is also called clustering with side information (CSI) in the literature [Zhu, 2008].To test this alternative, we adopt a novel but simple method called Seeded-KMeans [Basu et al., 2002]. Specifically,given M known classes X , X , . . . , X M in the training set, we run an ( M + 1) -Means clustering algorithm on sampleset X s , with the initial centers of each cluster set to the mean feature vectors of X , X , . . . , X M and X s , respectively.After the clustering converges, we assign the label of each cluster of X s according to the class membership of the initialseeding of the corresponding center. In other words, a cluster initially seeded by the mean of some known class X k willbe labeled as X k , and a cluster initially seeded by the mean of X s will be labeled as the u.u. class.Our primary experiment suggests that such an approach works equally well as the RTSCV method when the u.u.sconsist of only one cluster (sub-class) and are far away from the known base classes. As illustrated in the top-left plotof Figure 3, in such a setting CSI has a very similar OSR performance as our RTSCV, under different covariance levelsof the u.u. class. Nevertheless, when the u.u.s form multiple clusters or are close to the base classes, the performance ofCSI plunges significantly, as illustrated in the top-right and bottom plots of Figure 3. This is possibly because of thelarge inconsistency between the mean of X s as the initial seed of the u.u. class and the true u.u.s distribution. Bottom:There is one u.u. cluster, whose distance to the known base classes is altered across trials.In response to that, one potential improvement of the CSI method might be to incorporate some priors on the distributionof the u.u. class, i.e., the number of sub-classes or the means of them, with light involvement of human experts. Webelieve that this is a very promising direction for future works.15odel Rectification via Unknown Unknowns Extraction from Deployment Samples Figure 4: RTSCV decision boundaries after cross-validation using SVM, fitted on the entire augmented training setconsisting of 10 known classes and the sample class X s . The black region represents the dummy (u.u.s) class where weintend to trap the u.u.s. Points from the sample class are represented by triangles with pseudo-label 10 while pointsfrom one of the known classes are represented by squares with the respective class label.16odel Rectification via Unknown Unknowns Extraction from Deployment Samples Figure 5: RTSCV decision boundaries after cross-validation using KNN, fitted on the entire augmented training setconsisting of 10 known classes and the sample class X s . The black region represents the dummy (u.u.s) class where weintend to trap the u.u.s. Points from the sample class are represented by triangles with pseudo-label 10 while pointsfrom one of the known classes are represented by squares with the respective class label.17odel Rectification via Unknown Unknowns Extraction from Deployment Samples Figure 6: RTSCV decision boundaries after cross-validation using Decision Tree, fitted on the entire augmented trainingset consisting of 10 known classes and the sample class X ss