[PDF] Model Rectification via Unknown Unknowns Extraction from Deployment Samples

Abstract

Model deficiency that results from incomplete training data is a form of structural blindness that leads to costly errors, oftentimes with high confidence. During the training of classification tasks, underrepresented class-conditional distributions that a given hypothesis space can recognize results in a mismatch between the model and the target space. To mitigate the consequences of this discrepancy, we propose Random Test Sampling and Cross-Validation (RTSCV) as a general algorithmic framework that aims to perform a post-training model rectification at deployment time in a supervised way. RTSCV extracts unknown unknowns (u.u.s), i.e., examples from the class-conditional distributions that a classifier is oblivious to, and works in combination with a diverse family of modern prediction models. RTSCV augments the training set with a sample of the test set (or deployment data) and uses this redefined class layout to discover u.u.s via cross-validation, without relying on active learning or budgeted queries to an oracle. We contribute a theoretical analysis that establishes performance guarantees based on the design bases of modern classifiers. Our experimental evaluation demonstrates RTSCV's effectiveness, using 7 benchmark tabular and computer vision datasets, by reducing a performance gap as large as 41% from the respective pre-rectification models. Last we show that RTSCV consistently outperforms state-of-the-art approaches.

Full PDF

MModel Rectiﬁcation via Unknown UnknownsExtraction from Deployment Samples

Bruno Abrahao , ∗ Zheng Wang ∗ Haider Ahmed Yuchen Zhu New York University Shanghai New York University A BSTRACT

Model deﬁciency that results from incomplete training data is a form of structural blindness thatleads to costly errors, oftentimes with high conﬁdence. During the training of classiﬁcation tasks,underrepresented class-conditional distributions that a given hypothesis space can recognize results ina mismatch between the model and the target space. To mitigate the consequences of this discrepancy,we propose

Random Test Sampling and Cross-Validation (RTSCV) as a general algorithmic frameworkthat aims to perform a post-training model rectiﬁcation at deployment time in a supervised way.RTSCV extracts unknown unknowns (u.u.s), i.e., examples from the class-conditional distributionsthat a classiﬁer is oblivious to, and works in combination with a diverse family of modern predictionmodels. RTSCV augments the training set with a sample of the test set (or deployment data) and usesthis redeﬁned class layout to discover u.u.s via cross-validation, without relying on active learningor budgeted queries to an oracle. We contribute a theoretical analysis that establishes performanceguarantees based on the design bases of modern classiﬁers. Our experimental evaluation demonstratesRTSCV’s effectiveness, using 7 benchmark tabular and computer vision datasets, by reducing aperformance gap as large as 41% from the respective pre-rectiﬁcation models. Last we show thatRTSCV consistently outperforms state-of-the-art approaches.

Data quality constitutes a critical factor affecting the performance of prediction models. In particular, incompletetraining data frequently results in structural mismatches between data-driven trained models and the respective targetspace in which they are supposed to be deployed, which makes most classiﬁers susceptible to systematic errors due totheir limited ability to rectify a model post-training.In scenarios of increasing dependence on algorithmic decisions in high stake situations, deﬁcient models result incostly (sometimes fatal) errors, unfairness, and other problems. For example, an autopilot system may fail to recognizepeculiar trafﬁc signs it has never encountered during training, leading to accidents. In the case of automated recruiting,data from industries dominated by a given gender may result in biased classiﬁers, likely to reject examples of theopposite gender due to the lack of enough successful observations that belong to that class. In addition, the unseenjoint distribution of features and “data-drift” may contribute to high conﬁdence errors. For instance, when training aclassiﬁer to distinguish between white dogs and black cats, when presented with a black dog at deployment time, themodel may predict “cat” with high conﬁdence [Lakkaraju et al., 2017]. For “data-drift,” the structure of the target spacemay change over time and deviate from the trained model. Take, for example, the anecdotal account in the beginning ofthe COVID-19 pandemic, where physicians attempted to identify what type of “bacteria” had been causing an unusualhigh number of “pneumonia” cases, overlooking the fact that there was a new type of agent, i.e., a novel virus affectingthe respiratory system.We focus on classiﬁcation tasks where a trained model may be oblivious to some of the domain-speciﬁc class-conditionaldistributions that a set of hypotheses can recognize. Data examples from these “invisible” joint-distributions of featuresthat a classiﬁer is oblivious to, i.e., “hidden classes”, form the unknown unknowns (u.u.s), which cause a prediction ∗ Equal contribution and listed in alphabetical order. Correspondence to: or a r X i v : . [ c s . L G ] F e b odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment Samplesmodel to make errors with high conﬁdence. This deﬁnition encompasses other terms researchers use in differentcontexts. For instance, in the literature that addresses over-conﬁdent softmax predictions of neural networks, especiallyin computer vision, the term out-of-distribution (OOD) samples refer to the same concept [Hendrycks and Gimpel, 2017,Liang et al., 2018, Lee et al., 2018, Liu et al., 2020]. In addition, researchers have name the problem of classiﬁcationwith u.u.s the Open Set Recognition (OSR) problem [Scheirer et al., 2013, 2014], due to the contrast between the “open"nature of discovering u.u.s and the traditional closed set scenario, where the training and test classes match.We contribute to the mitigation of the u.u.s problem by proposing

Random Test Sampling and Cross-Validation (RTSCV) , a general algorithmic framework that aims to perform a post-training model rectiﬁcation of a base classiﬁerat deployment time in a supervised way. RTSCV aims to reduce the structural mismatch between a trained modeland the target space by extracting u.u.s from samples of the target space. RTSCV augments the training set with asample of the test set (or deployment data) and uses this redeﬁned class layout to discover unknown unknowns viacross-validation. Our key insight is that by augmenting a training set that possesses m classes with a dummy class,labeled m + 1 , whose examples come from a test set sample, cross-validation is likely to decouple examples that belongto known classes from m + 1 , due to the high variance and broad boundary of this dummy class. Conversely, u.u.s.coming from separable classes may share greater afﬁnity to the dummy class, as they are expected to being poor ﬁts tothe known classes and because the decision boundary around the dummy class may have been established with thecontribution of examples from the u.u.s in the test sample.RTSCV bears two advantages compared to previous methods. First, RTSCV can work in combination with a diversefamily of modern classiﬁers. The bulk of existing methods on identifying u.u.s focus on modifying speciﬁc classiﬁcationmethods, such as SVM, KNN, and DNNs, in such a way as to include a free parameter that can be learned at thedeployment phase. This allows for the method to predict u.u.s as possible outputs [Scheirer et al., 2013, 2014, Júnioret al., 2016, Bendale and Boult]. However, unlike RTSCV, these methods do not generalize, as they are classiﬁer-speciﬁc.We note that RTSCV can be easily paired with any trained classiﬁer. Second, RTSCV relies solely on the use of a sampleof the test data, which removes assumptions made by approaches like active learning, which are often challenging tooperationalize in practice, such as the existence of budgeted queries to an oracle [Vandenhof and Law, 2019, Lakkarajuet al., 2017, Simard et al., 2017].We contribute a theoretical analysis with performance guarantees based on the design bases of modern classiﬁers,including Maximum Likelihood Estimation, Bayes classiﬁer, and Minimum Mahalanobis distance. Through anextensive experimental evaluation, we use 7 benchmark tabular and more challenging computer vision datasets likeCIFAR-10, CIFAR-100, and SVHN with ResNet and DenseNet as base models. Our results suggest that RTSCV isa promising direction for post-training rectiﬁcation of a base classiﬁer by reducing a performance gap (Accuracy, F-measure, AUROC) as large as 41%. Moreover, our results indicate that RTSCV consistently outperforms state-of-the-artapproaches and baselines by a signiﬁcant margin. The conceptual idea of a “class” is often subjective, and different hypothesis spaces will separate the feature space intodifferent class-conditional distributions. Here we employ a working deﬁnition of u.u.s classes via a geometric argument.That is, given a ﬁxed hypothesis space, the u.u.s form separable clusters in the feature space that are distinguishablefrom the known structures in the target space. This deﬁnition is without loss of generality, as it allows for any abstractionof conceptual blindness to examples. We note that u.u.s are not the only sources of prediction errors. To delineatethe aims of RTSCV, here we discuss different types of errors and the scope in which RTSCV operates. Let f be aclassiﬁer and consider the hypothesis space produced by this model, i.e., the set of all functions that can be returnedby it. We assume that the model is consistent, that is, if there is a function in the hypothesis space, the machine isgoing to produce that function from training. Further, let E be the Bayes Error , or the irreducible error. If E ( H ) is thelowest error we could produce with hypothesis space H , and E ( H , D ) is the minimum error we produce with H andavailable training data D , then E ( H , D ) − E represents the overall generalization error given H and D , which can bedecomposed as the sum of E ( H , D ) − E ( H ) and E ( H ) − E . We call the ﬁrst difference estimation error , and thesecond difference approximation error . That is, the model may produce errors due to either deﬁciencies of the model(approximation error) or to the training data (estimation error), such as u.u.s. In this paper, we focus on the latter, i.e.,reducing the estimation error under the assumption of a ﬁxed hypothesis space. We also assume that the data are free ofmislabeling errors.We emphasize the distinction between u.u.s detection and outlier detection. Outliers are rare extreme values, producedby the realization of (possibly known) class-conditional distributions. As outliers tend to be isolated from any cluster in For reproducibility, our code will be made publicly available and we will replace this footnote with the GitHub link after theanonymous reviewing phase.

Early work focused on extending classical machine learning algorithms to enable u.u.s prediction. Prominent examplesare SVM-based methods, such as [Scheirer et al., 2013], which proposed the 1-vs-Set machine that separates the featurespace with an additional hyperplane parallel to the hyperplane obtained from the SVM. It then optimizes the open spacerisk for this linear kernel slab model. To further reduce the open set risk, [Scheirer et al., 2014] proposed the W-SVMto incorporate non-linear kernels under a compact abating probability (CAP) model. Another similar approach is the P I -SVM by [Jain et al.]. Besides modiﬁcations to SVM, [Júnior et al., 2016] introduced the OSR version of the NearestNeighbor classiﬁer (OSNN) based on a threshold method that relies on measurements of the distance of an u.u.s samplefrom the known space.To address large and high-dimensional datasets, recent approaches proposed to modify Deep Neural Networks (DNNs).A baseline was proposed by [Hendrycks and Gimpel, 2017], formalizing the observation that the softmax predictionsmay assign high conﬁdence to erroneously classiﬁed out-of-distribution samples (u.u.s). [Liang et al., 2018] designedthe ODIN detector that could better differentiate the conﬁdence scores between in-distribution and out-of-distributionsamples in the target space, by combining temperature scaling and input perturbation. Using probabilistic modeling,speciﬁcally, Gaussian discriminant analysis (GDA), [Lee et al., 2018] modeled the softmax outputs of known classesas class-conditional Gaussian distributions. It then used the closest Mahalanobis distance (MD) to these Gaussiandistributions of each test sample as the conﬁdence score. As a modiﬁcation to this approach, [Lee et al., 2020] replacedthe MD conﬁdence score with the class-conditional log-likelihood. For DNNs, researchers focused on changing thenetwork architecture to adapt to u.u.s detection, such as OpenMax [Bendale and Boult], CROSR [Yoshihashi et al.],and C2AE [Oza and Patel]. The surveys by [Geng et al., 2020] and [Boult et al., 2019] provide a comprehensivereview of these methods. For all of the preceding approaches, we argue that the modiﬁcation of existing classiﬁers ismodel-speciﬁc. In contrast, RTSCV is a general algorithmic framework capable of working with any classiﬁer.A different line of research addressed the u.u.s in an incremental or active learning manner. [Rudd et al., 2018]formulated a theoretically sound classiﬁer, the Extreme Value Machine (EVM), grounded on the Extreme Value Theory,which is able to perform nonlinear kernel-free variable bandwidth incremental learning. [Vandenhof and Law, 2019]and [Lakkaraju et al., 2017] both proposed a hybrid framework combining human crowdsourcing and algorithmicmethods, in which some priors of the u.u.s are extracted by experts whose feedbacks guide adjustments of the trainedmodel. By adopting an active learning environment, these approaches can potentially cope with a dynamic featurespace. Nevertheless, requiring the presence of an oracle is oftentimes unrealistic, as it may be time and human-laborexpensive, and therefore non-scalable for many real-world applications. Our proposal offsets these shortcomings byrelying only on the analysis of a data sample at deployment time.The idea of using cross-validation to rectify incorrect data was previously experimented to identify mislabeled trainingdata [Brodley and Friedl, 1999]. In their noise-reduction approach, cross-validation is performed over the training setand mislabeled data are those given different “pseudo-labels" from their original labels after the cross-validating phase.While their work focuses on identifying mislabeled data, RTSCV aims to augment a model with new labels to the dataexamples that class-conditional distributions that were not contemplated at training time generates, which allows for thecorrect classiﬁcation of these examples. We set our scope on multi-class classiﬁcation tasks. Let f be the input base classiﬁer, which we will treat as a black-box.Let X = { X , X , · · · , X m } be the training set with labels { , , . . . , m } . Let Y be the test set (as a representative ofthe target space). To detect the u.u.s and rectify a trained model, we present RTSCV in Algorithm 1. In summary, weﬁrst randomly sample the test set Y of cardinality | c · Y | , for a given fraction c , to obtain a sample X s , from which wecreate a new dummy class and assign the label m + 1 . Note that X s may contain examples from both the known and(potentially multiple) u.u.s classes. We then augment the original training set with examples from X s , resulting in anew intermediate training set (cid:101) X , to which we apply cross-validation in combination with a base classiﬁer f .3odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment Samples Algorithm 1

Random Test Sampling and Cross-Validation

Input:

Training set X with labels { , , . . . , m } , test set Y , sample rate c, base classiﬁer f , number of cross-validation folds k

1. Randomly sample test set Y to obtain a subset X s such that | X s | = c · | Y |

2. Assign label m + 1 to X s

3. Obtain an augmented training set (cid:101) X ← X ∪ X s with labels { , , . . . , m + 1 }

4. Run a k -fold cross-validation on (cid:101) X

5. Let X u be the set of samples with predicted label m + 1 during cross-validation6. Label samples in X u with label m + 1

7. Obtain the rectiﬁed training set X ← X ∪ X u with labels { , , . . . , m + 1 }

8. Train (cid:98) f on X return (cid:98) f . . . . . . Sample-Training Ratio . . . . F - m e a s u r e ( a ) COIL-20

Average95 % C.I. .

06 0 .

10 0 .

14 0 . Sample-Training Ratio . . . A cc u r a c y ( b ) CIFAR10-ImageNet

Average95 % C.I.

Figure 1: The inﬂuence of sample-training ratio on RTSCV’s performance. Multiple results were recorded to calculatethe 95% conﬁdence interval. Plot (a) corresponds to openness 9.3%. For (b), the accuracy on the y-axis stands for thecombined, overall classiﬁcation accuracy.Intuitively, RTSCV relies on the correct re-classiﬁcation of samples in X s during cross-validation, regarding whetherthey belong to a known or u.u.s class. Samples from the test set make up a high variance class X s , whose boundaryencompasses all other classes (i.e., it may contain examples from any class). In light of this, examples that belong toknown classes are likely placed in the correct classes due to the conciseness and speciﬁcity of the representation. On thecontrary, u.u.s are classiﬁed as members of m + 1 due to the dissimilarity with all other classes and to afﬁnity with someexamples that contributed to the position of the decision boundary around class m + 1 . The examples classiﬁed duringcross-validation with label m + 1 make up a new set X u , (the u.u.s class), which we adjoin to the original training set X to form the rectiﬁed model X .The sample rate c is a critical hyperparameter. For c , a very small sample rate may not contain enough representativesof u.u.s due to a small test set sample, whereas a large c may lead to a sample class X s that over-represents the structureof the known classes, thereby causing the cross-validation to assign examples of known classes to X u . In Figure 1 weillustrate the classiﬁer’s performance versus the sample-training ratio, i.e., c as a function of the training set size, for twoof the benchmark datasets we used in our experimental evaluation. We discuss the search for the optimal c in Section 4.The number of cross-validation folds k determines the running time of RTSCV, whose time complexity is roughly O ( k · T f ) , where T f is the running time of the input base classiﬁer f without RTSCV. Figure 2 (a) displays the modelperformance against k for several datasets we used in the evaluation of RTSCV. Note that RTSCV effectively rectiﬁes atrained model even if we use the more computationally economical “holdout validation" approach. We contribute a theoretical analysis that justiﬁes the correctness of RTSCV and establishes performance guarantees as afunction of class separability and the test sample size. As RTSCV is a general framework that may be combined withany base classiﬁer, each employing disparate approaches to establish decision boundaries, there are major challengesin establishing a concise set of mathematical tools that would cover the basis of each speciﬁc approach. In light ofthis, we analyze its behavior through the lens of objectives that modern classiﬁers aim to optimize to ﬁnd the sufﬁcientconditions for the correct relabeling of the test set sample X s , namely Maximum Likelihood Estimation (MLE), Bayesclassiﬁer (BC), and Minimum Mahalanobis Distance (MMD). These objectives are shared by many classiﬁcationmodels, such as the rule-based, margin-based, etc. 4odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment Samples Cross-Validation Folds F - m e a s u r e (a) Cross Validation J Score . . . . . (b) Known J Score (c)

Unknown tr( S w ) = 4tr( S w ) = 8tr( S w ) = 16tr( S w ) = 32LETTERPENDIGITSCOIL-20MNIST Figure 2: (a) The inﬂuence of the number of cross-validation folds on RTSCV framework, evaluated on tabular datasets.(b) The inﬂuence of known class separability on RTSCV framework. We vary the between-class distances and thecovariances of the known classes to generate different J scores. (c) The inﬂuence of u.u.s class separability on RTSCVframework. We vary the covariance and the distance of the u.u.s class to the known classes.We structure the following theorems based on two classiﬁcation cases under RTSCV of a data point x ∈ X s : (1)the true label of x is one of the known classes, where the correct decision is to assign the true known label duringcross-validation, and (2) x belongs to the u.u.s class and the process should keep it in X s . We aim to establish thecorrectness of RTSCV by showing that this correct decision is the one that optimizes the MLE, the BC, and the MMD.We model all the known and u.u.s classes as non-identical multivariate Gaussian distributions. Speciﬁcally, let X , X , . . . , X m be the known classes with distinct mean µ i ∈ R d and diagonal covariance matrix Σ i ∈ R d × d , and X u is the u.u.s class with mean µ u ∈ R d and diagonal covariance matrix Σ u ∈ R d × d . Furthermore, we assume allthe distributions are homoscedastic, so Σ i = σ i I for i ∈ { , , . . . , m } and Σ u = σ u I , for some σ i ∈ R + , σ u ∈ R + .In this way, since the sample class X s is obtained by randomly sampling the test set by RTSCV, we can model X s as a Gaussian mixture of X , X , . . . , X m , X u weighted by their respective percentage in the test set, denoted P ( X i ) , i ∈ { , , . . . , m, u } .Under the preceding assumptions, via Gaussian discriminant analysis, we ﬁrst focus on MLE to explore how the totallikelihood of the dataset changes under different labeling scheme of the sample class X s . For a test set sample x ∈ X s ,its likelihood of being in one of the X i ( i ∈ { , , . . . , m, u } ) is: L i ( x ) = 1(2 π ) d (cid:112) | Σ i | exp( −

12 ( x − µ i ) T Σ − i ( x − µ i )) (1)where | Σ i | denotes the determinant of Σ i . Its likelihood of being in the sample class X s is L s ( x ) = P ( X u ) L u ( x ) + m (cid:88) k =1 P ( X k ) L k ( x ) (2)as X s is a Gaussian mixture. We can now ﬁnd sufﬁcient conditions under which the correct classiﬁcation of x canincrease the total likelihood. Theorem 3.1 (Maximum Likelihood Estimation) . For x ∈ X s , if x is a sample of some known class, i.e., x ∼ X k forsome k ∈ { , , . . . , m } , for x to be correctly classiﬁed as belonging to X k based on MLE, we require it to have ahigher class-conditional likelihood for class X k than that of X s . And we have E x ∼ X k ( L k ( x )) ≥ E x ∼ X k ( L s ( x )) given that, for all i ∈ { , · · · , m, u } : (cid:107) µ k − µ i (cid:107) ≥ d · ( σ k + σ i ) · ln (cid:18) σ k σ k + σ i (cid:19) Similarly, if x ∼ X u , then we have E x ∼ X u ( L s ( x )) ≥ E x ∼ X u ( L k ( x )) given that (cid:107) µ k − µ u (cid:107) σ k + σ u ≥ (cid:18) − P ( X k ) P ( X u ) (cid:19) + d · ln (cid:18) σ u σ k + σ u (cid:19) Proof.

All proofs are in the supplementary materials. 5odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment SamplesTheorem 3.1 characterizes sufﬁcient conditions for the re-classiﬁcation of X s to maximize total likelihood. In summary,it says that X s will be correctly classiﬁed based on MLE once the class separability is above a threshold characterizedby the squared difference of the means and the squared within-class variances.We now turn our attention to the Bayes classiﬁer, which determines the membership of x ∈ X s by considering itsposterior probability [Murty and Devi, 2011]. By Bayes’ theorem, the posterior for x to be in class X i is givenby p ( X i | x ) = L i ( x ) P (cid:48) ( X i ) /p ( x ) . In scenario we consider, P (cid:48) ( X i ) stands for the prior of X i in the training set: P (cid:48) ( X i ) = | X i | / | X s ∪ (cid:83) mk =1 X k | , i ∈ { , , . . . , m, s } . In this way, one can calculate the Bayesian decision boundaryof x between X s and X k for some k ∈ { , , . . . , m } . Theorem 3.2 (Bayesian Classiﬁcation) . For x ∈ X s , if x ∼ X k , we have: E x ∼ X k ( p ( X k | x )) ≥ E x ∼ X k ( p ( X s | x )) given that, for all i ∈ { , · · · , m, u } : (cid:107) µ k − µ i (cid:107) σ k + σ i ≥ (cid:18) P (cid:48) ( X s ) P (cid:48) ( X k ) (cid:19) + d · ln (cid:18) σ k σ k + σ i (cid:19) If x ∼ X u , then we have E x ∼ X u ( p ( X s | x )) ≥ E x ∼ X u ( p ( X k | x )) given that (cid:107) µ k − µ u (cid:107) σ k + σ u ≥ (cid:18) P (cid:48) ( X k ) − P (cid:48) ( X s ) P ( X k ) P (cid:48) ( X s ) P ( X u ) (cid:19) + d · ln (cid:18) σ u σ k + σ u (cid:19) Theorem 3.2 (similarly to Theorem 3.1) establishes a performance guarantee on the correct behavior of the framework.That is, when the class separability between the u.u.s class and the known classes is greater than a given threshold,controlled by the sample size, the priors P (cid:48) ( X s ) , and the dimensionality of the data, the re-classiﬁcation of X s will beas we expect with high probability.Last, the Minimum Mahalanobis Distance (MMD) mimics the goal shared by classiﬁers to place a data point in theclass whose joint-probability distribution of features is the closest to that of the data point. We show that the correctdecision by RTSCV is the one that minimizes the MD.For a test set sample x ∈ X s , its squared MD to some class X i is given by D M ( x, X i ) = ( x − µ i ) T Σ − i ( x − µ i ) (3)The analogous behavior of a classiﬁer that employs MMD as a metric would assign x to the class that has the closestMD to x . Theorem 3.3 (Minimum Mahalanobis Distance) . Assume Σ s = σ s I . For x ∈ X s , if x ∼ X k for some k ∈{ , , . . . , m } , we have: E x ∼ X k ( D M ( x, X k )) ≤ E x ∼ X k ( D M ( x, X s )) given that (cid:107) µ k − µ s (cid:107) ≥ d · σ s · (cid:16) − σ k σ s (cid:17) If x ∼ X u , then we have E x ∼ X u ( D M ( x , X s )) ≤ E x ∼ X u ( D M ( x , X k )) given that (cid:107) µ u − µ k (cid:107) σ k − (cid:107) µ u − µ s (cid:107) σ s ≥ d · (cid:16) σ u σ s − σ u σ k (cid:17) In alignment with Theorem 3.1 and Theorem 3.2, here we establish a requirement of class separability as the sufﬁcientcondition for the expected behavior of RTSCV. Note that we assume Σ s to be diagonal to avoid long derivations resulted from the Gaussian mixture model. To illustrate the preceding theoretical analysis, here we empirically study the effect of class separability using a syntheticdataset that consists of 10 distinct known classes and one u.u.s class, all sampled from pre-ﬁxed 2-dimensional Gaussiandistributions. To measure class separability, we adapted the notion of scatter matrices [Theodoridis and Koutroumbas,2008] to address the presence of the u.u.s class . Speciﬁcally, assume { X , X , . . . , X m } are the m known classeswith µ i being the mean of X i and Σ i being the covariance matrix of X i . For a u.u.s class X u , its mean and covariancematrix are µ u and Σ u . The between-class scatter matrix S b is deﬁned to measure the separability between differentclasses: S b = (cid:80) mi =1 P ( X i )( µ i − µ )( µ i − µ ) T , where P ( X i ) is the percentage of examples in X i , compared to thetotal number of examples in the dataset. When measuring the distance between the m known classes and the u.u.sclass, we set µ = µ u . Otherwise, if we want to measure the between-class distance within the known classes, we set µ = m (cid:80) mi =1 µ i . For the within-class scatter matrix S w , when measuring the u.u.s class, we simply deﬁne S w = Σ u ,the covariance matrix of X u . For the known classes, S w is the weighted sum of all the covariance matrices of theknown classes: S w = (cid:80) mi =1 P ( X i )Σ i . To combine both scatter matrices, we use the J criterion [Theodoridis andKoutroumbas, 2008]: J trace { S w + S b } trace { S w } = 1 + trace { S b } trace { S w } (4)which increases by making the means of different classes spread out and intra-class variances small. We evaluateRTSCV on different dataset conﬁgurations of varying class means and covariances, each corresponding to a unique J score. The result is plotted in Figure 2 (b) for known classes and Figure 2 (c) for unknown classes, which show apositive correlation between model performance and class separability in both cases. We evaluate RTSCV on both tabular datasets, where we pair RTSCV with classical classiﬁers like SVM and simplefully connected neural networks, and computer vision datasets, where we use deep convolutional neural networks, suchas ResNet [He et al., 2016] and DenseNet [Huang et al., 2017].As established by Theorem 3.2, there is a sweet spot for the sample rate that best balances the classiﬁcation of x ∈ X s by governing the prior P (cid:48) ( X s ) . Let sample-training ratio be the ratio of the size of the test set sample and the size ofthe training set. Figure 1 plots RTSCV’s performance on COIL-20 and CIFAR10-ImageNet used in Section 4.1 and 4.2,respectively, under varying sample-training ratio. The model performance climbs steeply with the sample size beforethe sample starts to over-represent the known data, and then decreases slowly. We search for an optimal c using byassessing the misclassiﬁcation of known data. The optimal sample rate is dataset-dependent and varies from 0.06 to 0.1.Note that the small cardinalities make RTSCV practical for efﬁciently sampling during model deployment. We alsosearched for the optimal number k , from 2 to 6, of cross-validation for the Letters, Pendigits, COIL-20 and MNISTdatasets, and selected k = 3 . We selected the Letter Recognition dataset and Pendigits dataset which contain the hand-writings of 26 English lettersand 10 digits, respectively. Further, we also down-sample the Columbia University Image Library (COIL-20), whichcontains grey-scale images of 20 objects [Nene et al., 1996], following the PCA-based technique by [Geng and Chen,2020]. We also select the MNIST, which contains 10 digit classes of dimension × [LeCun et al., 2010].On the tabular datasets except for MNIST, we use a standard SVM implementation as the base model. We ﬁrst report theresults of the pre-rectiﬁed model, i.e., the performance of the base classiﬁer under the presence of u.u.s without applyingRTSCV. To compare with RTSCV, we select three previously proposed methods in the literature of u.u.s discovery,which are (1) EVM [Rudd et al., 2018], (2) 1-vs-Set [Scheirer et al., 2013] and (3) WSVM [Scheirer et al., 2014]. Inparticular, 1-vs-Set and WSVM are both SVM-based algorithms. For MNIST, we use a simple MLP, a four-layer fullyconnected network, as the base model. Finally, to create u.u.s in the test set of the chosen datasets, we remove certainclasses of data from the training set while keeping the test set unchanged. To be consistent with previous methods, we evaluate the F-measure of our method and other baselines against the openness [Scheirer et al., 2013]. UCI machine learning repository: http://archive.ics.uci.edu/ml

Dataset Openness (%) F-measure (%)Pre-rectiﬁed / EVM / 1-vs-Set / WSVM / RTSCVLetter(SVM) 14.5 69.8 / 89.8 / 72.8 / 91.2 /

Pendigits(SVM) 9.3 78.3 / 97.0 / 75.4 / 93.1 /

COIL-20(SVM) 9.3 88.4 / / 70.2 / 85.6 / 95.118.4 78.7 / 93.2 / 55.7 / 84.5 /

MNIST(MLP) 13.4 59.6 / - / - / - /

Openness.

The u.u.s may be divided into different classes that span different geometric regions of the feature space.The openness metric proposed by [Scheirer et al., 2013] increases with the number of u.u.s classes. A larger opennessindicates a larger number of u.u.s classes relative to that of known classes in the test data:openness = 1 − (cid:115) × | training classes || test classes | + | target classes | (5) F-measure is the harmonic mean of the precision and recall. In our multi-class scenario, it is obtained by averaging theclasswise F-measures, combining classiﬁcation accuracy of both the known and u.u.s classes.

We present the experimental results on all four datasets in Table 1. The F-measure of the classiﬁer is plotted as afunction of openness, which we vary by controlling the number of known classes removed from the training set. RTSCVachieves a consistent performance improvement over the pre-rectiﬁed model, closing a performance gap as large as41% in some cases. Coupled with the SVM, RTSCV beats previous u.u.s detection methods on 5 out 6 settings whilemaintaining a small disadvantage to the EVM on COIL-20 with openness 9.3%. For RTSCV with MLP on MNIST, itperforms the best with the largest openess, i.e., 42.3% where 9 out of 10 classes are selected as u.u.s during test. Thisfurther assures the robustness of RTSVC under a disproportionate amount of u.u.s in the test data.

We also evaluate RTSCV on several more challenging pattern recognition tasks, such as computer vision datasets.Speciﬁcally, we select the CIFAR-10 and CIFAR-100, both containing colored object images [Krizhevsky, 2009], aswell as the Street View House Numbers (SVHN) from Google Street View project [Netzer et al., 2011]. All imagedimensions of these datasets are × .For each of the computer vision datasets, we separately test our RTSCV framework with ResNet [He et al., 2016] andDenseNet [Huang et al., 2017], two network architectures that have achieved good performance on large benchmarkdatasets. (See the supplementary material for details on the training conﬁguration). For comparison, we include theresults of the baseline method by [Hendrycks and Gimpel, 2017], the ODIN [Liang et al., 2018], and the Mahalanobismethod (MD) [Lee et al., 2018], three previous approaches on detecting OOD samples (u.u.s) for neural networks.Different from Section 4.1, here we create u.u.s in the test set by introducing additional computer vision datasets to thetarget space. Speciﬁcally, we use the resized version of Tiny-ImageNet [Deng et al., 2009] and LSUN [Yu et al., 2015]following the techniques in [Liang et al., 2018, Lee et al., 2018]. We present a summary of the known datasets and u.u.sdatasets we used in Table 2. We adopt the following metrics to evaluate performance on both known classes classiﬁcation and u.u.s detection.

Classiﬁcation accuracy is the accuracy of the known class classiﬁcation, i.e., the total number of correct predictionson the known class labels divided by the total number of known class test samples.8odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment SamplesTable 2: Experimental results of the RTSCV framework on computer vision datasets. Bold and underlined numbersrepresent the best and second best results, respectively.

Dataset u.u.s Classiﬁcation Acc. (%) Detection Acc. (%) AUROC (%)Baseline / ODIN / MD / RTSCVCIFAR-10(ResNet) SVHN 93.9 / - / 93.9 /

ImageNet 93.9 / - / 93.9 /

LSUN 93.9 / - / 93.9 /

CIFAR-100(ResNet) SVHN 75.6 / - / 74.8 /

ImageNet 75.6 / - / 74.8 /

LSUN 75.6 / - / 74.8 /

SVHN(ResNet) CIFAR-10 / - / 95.7 / 94.5 90.0 / 89.4 / 96.9 /

ImageNet / - / 95.7 / 94.7 90.4 / 89.4 / / 98.4 93.5 / 92.0 / / LSUN / - / 95.7 / 95.0 89.0 / 87.2 / 99.5 /

CIFAR-10(DenseNet) SVHN 92.9 / - / 91.7 /

ImageNet 92.9 / - / 91.7 /

LSUN 92.9 / - / 91.7 /

CIFAR-100(DenseNet) SVHN 72.3 / - / 68.2 /

ImageNet 72.3 / - / 68.2 /

LSUN 72.3 / - / 68.2 /

SVHN(DenseNet) CIFAR-10 95.3 / - / 95.2 /

ImageNet 95.3 / - / 95.2 / / 97.4 94.8 / 95.1 / / LSUN 95.3 / - / 95.2 /

Detection accuracy is the number of u.u.s in the test data that are correctly detected by the model divided by the totalnumber of the u.u.s.

AUROC depicts the relationship between true positive rate (TPR) and false positive rate (FPR) [Davis and Goadrich,2006]. A higher AUROC indicates a higher probability for a positive instance to rank higher than a negative one.

We display the experimental results in Table 2. RTSCV coupled with ResNet or DenseNet achieves the overall bestperformance with respect to all of the evaluation metrics. In particular, it has a signiﬁcant improvement on the u.u.sdetection accuracy while maintaining a high classiﬁcation accuracy for the known classes. This suggests that RTSCV isthe most effective in identifying u.u.s, without degrading the original model, even on more complex datasets with morecomplex classiﬁcation models.

With the goal of reducing deployment errors and model bias due to deﬁcient training data, RTSCV is a proposal of analgorithmic framework that adds the ﬂexibility for a base classiﬁcation models to rectify a trained model at deployment.We provide a rigorous theoretical analysis of correctness and performance guarantees of the process of minimizingits structural mismatch with a target space, based on objectives that most modern classifers aim to optimize. RTSCVexhibits consistent performance improvements over both the pre-rectiﬁed model and previously proposed approachesthat share the same goals on 7 benchmark datasets. Moreover, it does not assume the presence of an oracle, as in thecase of active learning.Our ongoing work focus on improvements of the RTSCV, especially due to concerns with the computational cost ofcross-validation. We are developing alternatives to cross-validation that could work equally well, while reducing thecomputational cost dramatically. For instance, we are in the process of investigating semi-supervised clustering [Zhu,2008, Basu et al., 2002]. In the supplementary materials, we discuss our initial efforts in this direction. Our preliminaryresults suggest that such an approach works equally well as RTSCV when the u.u.s consist of only one cluster and havea relatively small covariance. Nevertheless, when the u.u.s form multiple clusters or the clusters have high variance, theperformance of this alternative approach drops signiﬁcantly, compared to that of RTSCV.9odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment Samples

This research was partially supported by a National Natural Science Foundation of China (NSFC) grant

References

Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Eric Horvitz. Identifying unknown unknowns in the open world:Representations and policies for guided exploration. In

Proc. of the Thirty-First AAAI Conference on ArtiﬁcialIntelligence , 2017.Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassiﬁed and out-of-distribution examples in neuralnetworks.

ICLR , 2017.Shiyu Liang, Y. Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks.

ICLR , 2018.Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple uniﬁed framework for detecting out-of-distributionsamples and adversarial attacks. In

Advances in Neural Information Processing Systems , volume 31, pages 7167–7177,2018.Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection.

Advances inNeural Information Processing Systems (NeurIPS) , 2020.W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult. Toward open set recognition.

IEEE Transactions onPattern Analysis and Machine Intelligence , 2013.W. J. Scheirer, L. P. Jain, and T. E. Boult. Probability models for open set recognition.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 2014.Pedro Ribeiro Mendes Júnior, Roberto Medeiros de Souza, Rafael de Oliveira Werneck, Bernardo V. Stein, Daniel V.Pazinato, Waldir R. de Almeida, Otávio Augusto Bizetto Penatti, Ricardo da Silva Torres, and Anderson Rocha.Nearest neighbors distance ratio open-set classiﬁer.

Machine Learning , 2016.Abhijit Bendale and Terrance E. Boult. Towards open set deep networks. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR-2016) .Colin Vandenhof and Edith Law. Contradict the machine: A hybrid approach to identifying unknown unknowns. In

AAMAS , 2019.Patrice Y. Simard, Saleema Amershi, David Maxwell Chickering, Alicia Edelman Pelton, Soroush Ghorashi, ChristopherMeek, Gonzalo Ramos, Jina Suh, Johan Verwey, Mo Wang, and John Robert Wernsing. Machine teaching: A newparadigm for building machine learning systems.

ArXiv , abs/1707.06742, 2017.Lalit P. Jain, Walter J. Scheirer, and Terrance E. Boult. Multi-class open set recognition using probability of inclusion.In

ECCV 2014 .Dongha Lee, Sehun Yu, and Hwanjo Yu. Multi-class data description for out-of-distribution detection. In

Proceedingsof the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , KDD ’20, New York,NY, USA, 2020.R. Yoshihashi, W. Shao, R. Kawakami, S. You, M. Iida, and T. Naemura. Classiﬁcation-reconstruction learning foropen-set recognition. In .Poojan Oza and Vishal M. Patel. C2AE: class conditioned auto-encoder for open-set recognition. In

IEEE Conferenceon Computer Vision and Pattern Recognition, CVPR 2019 . doi:10.1109/CVPR.2019.00241.C. Geng, S. Huang, and S. Chen. Recent advances in open set recognition: A survey.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 2020.Terrance Boult, S. Cruz, Akshay Dhamija, Manuel Günther, James Henrydoss, and W.J. Scheirer. Learning and theunknown: Surveying steps toward open world recognition. In

Proceedings of the 33th AAAI Conference on ArtiﬁcialIntelligence , 2019. doi:10.1609/aaai.v33i01.33019801.E. M. Rudd, L. P. Jain, W. J. Scheirer, and T. E. Boult. The extreme value machine.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 2018.Carla E. Brodley and Mark A. Friedl. Identifying mislabeled training data.

Journal of Artiﬁcial Intelligence Research ,11(1):131–167, July 1999. ISSN 1076-9757.M. Murty and V. Devi.

Pattern recognition. An algorithmic approach . 2011. doi:10.1007/978-0-85729-495-1.10odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment SamplesSergios Theodoridis and Konstantinos Koutroumbas.

Pattern Recognition, Fourth Edition . Academic Press, Inc., USA,4th edition, 2008. ISBN 1597492728.K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In , pages 770–778, 2016. doi:10.1109/CVPR.2016.90.G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In , 2017. doi:10.1109/CVPR.2017.243.Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. Columbia object image library (coil-20). Technical report, 1996.Chuanxing Geng and Songcan Chen. Collective decision for open set recognition.

IEEE Transactions on Knowledgeand Data Engineering , 2020. doi:10.1109/tkde.2020.2978199.Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database.

ATT Labs [Online]. Available:http://yann.lecun.com/exdb/mnist , 2, 2010.Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Ng. Reading digits in natural imageswith unsupervised feature learning.

NIPS , 2011.J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , 2009. doi:10.1109/CVPR.2009.5206848.Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image datasetusing deep learning with humans in the loop. 2015.Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In

Proceedings of the 23rdInternational Conference on Machine Learning, ACM , 2006.Xiaojin Zhu. Semi-supervised learning literature survey.

Computer Science, University of Wisconsin-Madison , 07 2008.Sugato Basu, Arindam Banerjee, and R. Mooney. Semi-supervised clustering by seeding. In

Proceedings of 19thInternational Conference on Machine Learning (ICML-2002) , 2002.11odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment Samples

Proof.

Let’s ﬁrst prove the case of x ∼ X k . We know that E x ∼ X k ( L k ( x )) = (cid:90) R d L k ( x ) dx = N ( µ k , µ k , k ) E x ∼ X k ( L i ( x )) = (cid:90) R d L i ( x ) L k ( x ) dx = N ( µ i , µ k , Σ k + Σ i ) Expanding the two terms, we have E x ∼ X k ( L k ( x )) = 1(2 π ) d (cid:112) | k | exp( −

12 ( µ k − µ k ) T (2Σ k ) − ( µ k − µ k )) = 1(2 π ) d (cid:112) | k | E x ∼ X k ( L i ( x )) = 1(2 π ) d (cid:112) | Σ k + Σ i | exp( −

12 ( µ i − µ k ) T (Σ k + Σ i ) − ( µ i − µ k )) Recall that Σ i = σ i I, Σ u = σ u I . Let N ( µ k , µ k , k ) > N ( µ i , µ k , Σ k + Σ i ) and we have (cid:107) µ k − µ i (cid:107) ≥ d · ( σ i + σ k ) · ln (cid:18) σ k σ k + σ i (cid:19) If for all other classes other than class k , the above conditions hold, then E x ∼ X k ( L k ( x )) = P ( X u ) E x ∼ X k ( L k ( x )) + m (cid:88) i =1 P ( X i ) E x ∼ X k ( L k ( x )) > E x ∼ X k ( L s ( x )) since P ( X u ) + (cid:80) mi =1 P ( X i ) = 1 . Similarly, for the second case of x ∼ X u , we can compute that E x ∼ X u ( L k ( x )) = N ( µ k , µ u , Σ u + Σ k ) . Notice that E x ∼ X u ( L s ( x )) > P ( X u ) N ( µ u , µ u , u ) + P ( X k ) N ( µ k , µ u , Σ u + Σ k ) Hence, to obtain a sufﬁcient condition of E x ∼ X u ( L s ( x )) > E x ∼ X u ( L k ( x )) , we simply need to require P ( X u ) N ( µ u , µ u , u ) > (1 − P ( X k )) N ( µ k , µ u , Σ u + Σ k ) , which yields (cid:107) µ k − µ u (cid:107) σ k + σ u ≥ (cid:18) − P ( X k ) P ( X u ) (cid:19) + d · ln (cid:18) σ u σ k + σ u (cid:19) Proof.

For x ∼ X k , to show that E x ∼ X k ( p ( X k | x )) > E x ∼ X k ( p ( X s | x )) , by Bayes’ theorem, we just need to show: E x ∼ X k ( L k ( x ) P (cid:48) ( X k )) > E x ∼ X k ( L s ( x ) P (cid:48) ( X s )) From the proof of Theorem 3.1, we have: E x ∼ X k ( L k ( x ) P (cid:48) ( X k )) = P (cid:48) ( X k ) N ( µ k , µ k , k ) E x ∼ X k ( L i ( x ) P (cid:48) ( X s )) = P (cid:48) ( X s ) N ( µ i , µ k , Σ k + Σ i ) Let P (cid:48) ( X k ) N ( µ k , µ k , k ) > P (cid:48) ( X s ) N ( µ i , µ k , Σ k + Σ i ) , we obtain: (cid:107) µ k − µ i (cid:107) σ k + σ i ≥ (cid:18) P (cid:48) ( X s ) P (cid:48) ( X k ) (cid:19) + d · ln (cid:18) σ k σ k + σ i (cid:19) For x ∼ X u , We just need to show that E x ∼ X u ( L s ( x ) P (cid:48) ( X s )) > E x ∼ X u ( L k ( x ) P (cid:48) ( X k )) . From the proof of Theorem3.1, we know that: E x ∼ X u ( L s ( x ) P (cid:48) ( X s )) > P (cid:48) ( X s ) P ( X u ) N ( µ u , µ u , u ) + P (cid:48) ( X s ) P ( X k ) N ( µ k , µ u , Σ u + Σ k ) E x ∼ X u ( L k ( x ) P (cid:48) ( X k )) = P (cid:48) ( X k ) N ( µ k , µ u , Σ u + Σ k ) Letting P (cid:48) ( X s ) P ( X u ) N ( µ u , µ u , u ) > [ P (cid:48) ( X k ) − P (cid:48) ( X s ) P ( X k )] N ( µ k , µ u , Σ u + Σ k ) gives the desired sufﬁcientcondition: (cid:107) µ k − µ u (cid:107) σ k + σ u ≥ (cid:18) P (cid:48) ( X k ) − P (cid:48) ( X s ) P ( X k ) P (cid:48) ( X s ) P ( X u ) (cid:19) + d · ln (cid:18) σ u σ k + σ u (cid:19) Proof.

For x ∼ X k , we know from statistical theory that D M ( x, X k ) has a χ d distribution with d degrees of freedom.So we have E x ∼ X k ( D M ( x, X k )) = E x ∼ χ d ( x ) = d Let’s now investigate the distribution of D M ( x, X s ) . As Σ s is real symmetric and diagonalizable, it has an orthogonaldecomposition: Σ s = U Λ U − = U Λ U T = d (cid:88) j =1 λ j u j u Tj Σ − s = U Λ − U − = U Λ − U T = d (cid:88) j =1 λ − j u j u Tj where { λ j } dj =1 are the eigenvalues for Λ and { u j } dj =1 are the corresponding eigenvectors. Plugging the decompositioninto the formula of D M ( x, X s ) , we get D M ( x, X s ) = ( x − µ s ) T Σ − s ( x − µ s ) = d (cid:88) j =1 [ λ − j u Tj ( x − µ s )] = d (cid:88) j =1 Y j Since x ∼ N ( µ k , Σ k ) , Y j = λ − j u Tj ( x − µ s ) is an afﬁne transformation of a multivariate Gaussian distribution, Y k has a univariate normal distribution: E ( Y j ) = λ − j u Tj ( E ( X ) − µ s ) = λ − j u Tj ( µ k − µ s ) V ar ( Y j ) = λ − j u Tj Σ k λ − j u j = σ k λ j Therefore, we can infer that E ( Y j ) = σ k λ − j + [ λ − j u Tj ( µ k − µ s )] E x ∼ X k ( D M ( x, X s )) = d (cid:88) j =1 E ( Y j ) = σ k d (cid:88) j =1 λ − j + d (cid:88) j =1 [ u Tj ( µ k − µ s )] λ j Here, since we assume Σ s = σ s I , { u k } dk =1 is the canonical basis of R d . So the formula can be further simpliﬁed as: E x ∼ X k ( D M ( x, X s )) = d · σ k σ s + (cid:107) µ k − µ s (cid:107) σ s Let E x ∼ X k ( D M ( x, X k )) < E x ∼ X k ( D M ( x , X s )) , we obtain the sufﬁcient condition: (cid:107) µ k − µ s (cid:107) > d · σ s · (1 − σ k σ s ) Similarly, for the case of x ∼ X u , following the above results we have: E x ∼ X u ( D M ( x, X s )) = d · σ u σ s + (cid:107) µ u − µ s (cid:107) σ s E x ∼ X u ( D M ( x, X k )) = d · σ u σ k + (cid:107) µ u − µ k (cid:107) σ k Let E x ∼ X u ( D M ( x, X s )) < E x ∼ X u ( D M ( x, X k )) and we have: (cid:107) µ u − µ k (cid:107) σ k − (cid:107) µ u − µ s (cid:107) σ s > d · (cid:16) σ u σ s − σ u σ k (cid:17) Table 3: ResNet

Parameter Value optimizer SGD with Nesterov Momentummomentum 0.9learning rate 5e-4epochs 100learning rate scheduler learning rate decreases by 50% after every 20 epochscross-validation folds 3number of layers 10014odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment Samples F - m e a s u r e RTSCV+SVMK-means with side info 0 5 10 15 20 25 30 35 40Trace of mixture scatter matrix0.50.60.70.80.91.0 F - m e a s u r e RTSCV+SVMK-means with side info40 60 80 100 120 140Trace of between-class scatter matrix0.50.60.70.80.91.0 F - m e a s u r e RTSCV+SVMK-means with side info

Figure 3: Comparison between Clustering with Side Information (CSI) and our RTSCV methods under differentsynthetic dataset settings. Top: There is one u.u. cluster for the left plot and two u.u. clusters for the right plot. For bothplots the u.u. class is located far away from the 10 known classes and only the covariance of the u.u. class is alteredacross different trials.We also believe that it is worth discussing the possibility of resorting to semi-supervised clustering as an alternativeto cross-validation during the process of re-classifying sample class X s , given its increasing popularity and the greatpotential of being more computationally economical. Given a small amount of labeled data, semi-supervised clusteringperforms ordinary clustering tasks under the constraints of must-links (two points must be in the same cluster) and cannot-links (two points cannot be in the same cluster), provided by the labeled data [Zhu, 2008]. In our scenario, theobjective of re-classifying X s can be viewed equivalent to dividing X s into several clusters, one of which correspondsto either a known or u.u. class, with the assistance of the labeled data from the entire training set. This is also called clustering with side information (CSI) in the literature [Zhu, 2008].To test this alternative, we adopt a novel but simple method called Seeded-KMeans [Basu et al., 2002]. Speciﬁcally,given M known classes X , X , . . . , X M in the training set, we run an ( M + 1) -Means clustering algorithm on sampleset X s , with the initial centers of each cluster set to the mean feature vectors of X , X , . . . , X M and X s , respectively.After the clustering converges, we assign the label of each cluster of X s according to the class membership of the initialseeding of the corresponding center. In other words, a cluster initially seeded by the mean of some known class X k willbe labeled as X k , and a cluster initially seeded by the mean of X s will be labeled as the u.u. class.Our primary experiment suggests that such an approach works equally well as the RTSCV method when the u.u.sconsist of only one cluster (sub-class) and are far away from the known base classes. As illustrated in the top-left plotof Figure 3, in such a setting CSI has a very similar OSR performance as our RTSCV, under different covariance levelsof the u.u. class. Nevertheless, when the u.u.s form multiple clusters or are close to the base classes, the performance ofCSI plunges signiﬁcantly, as illustrated in the top-right and bottom plots of Figure 3. This is possibly because of thelarge inconsistency between the mean of X s as the initial seed of the u.u. class and the true u.u.s distribution. Bottom:There is one u.u. cluster, whose distance to the known base classes is altered across trials.In response to that, one potential improvement of the CSI method might be to incorporate some priors on the distributionof the u.u. class, i.e., the number of sub-classes or the means of them, with light involvement of human experts. Webelieve that this is a very promising direction for future works.15odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment Samples Figure 4: RTSCV decision boundaries after cross-validation using SVM, ﬁtted on the entire augmented training setconsisting of 10 known classes and the sample class X s . The black region represents the dummy (u.u.s) class where weintend to trap the u.u.s. Points from the sample class are represented by triangles with pseudo-label 10 while pointsfrom one of the known classes are represented by squares with the respective class label.16odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment Samples Figure 5: RTSCV decision boundaries after cross-validation using KNN, ﬁtted on the entire augmented training setconsisting of 10 known classes and the sample class X s . The black region represents the dummy (u.u.s) class where weintend to trap the u.u.s. Points from the sample class are represented by triangles with pseudo-label 10 while pointsfrom one of the known classes are represented by squares with the respective class label.17odel Rectiﬁcation via Unknown Unknowns Extraction from Deployment Samples Figure 6: RTSCV decision boundaries after cross-validation using Decision Tree, ﬁtted on the entire augmented trainingset consisting of 10 known classes and the sample class X ss