[PDF] Ground-truth or DAER: Selective Re-query of Secondary Information

Abstract

Many vision tasks use secondary information at inference time -- a seed -- to assist a computer vision model in solving a problem. For example, an initial bounding box is needed to initialize visual object tracking. To date, all such work makes the tacit assumption that the seed is a good one. However, in practice, from crowdsourcing to noisy automated seeds, this is often not the case. We hence propose the problem of seed rejection -- determining whether to reject a seed based on the expected performance degradation when it is provided in place of a gold-standard seed. We provide a formal definition to this problem, and focus on two meaningful subgoals: understanding causes of error and understanding the model's response to noisy seeds conditioned on the primary input. With these goals in mind, we propose a novel training method and evaluation metrics for the seed rejection problem. We then use seeded versions of viewpoint estimation and fine-grained classification tasks to evaluate these contributions. In these experiments, we show our method can reduce the number of seeds that need to be reviewed for a target performance by over 23% compared to strong baselines.

Full PDF

DDAER to Reject Seeds with Dual-loss Additional Error Regression

Stephan J. LemmerRobotics InstituteUniversity of MichiganAnn Arbor, USA [email protected]

Jason J. CorsoElectrical Engineering and Computer ScienceRobotics InstituteUniversity of MichiganAnn Arbor, USA [email protected]

Abstract

Many vision tasks require side information at inferencetime—a seed—to fully specify the problem. For example,an initial object segmentation is needed for video objectsegmentation. To date, all such work makes the tacit as-sumption that the seed is a good one. However, in prac-tice, from crowd-sourcing to noisy automated seeds, thisis not the case. We hence propose the novel problem ofseed rejection—determining whether to reject a seed basedon expected degradation relative to the gold-standard. Weprovide a formal deﬁnition to this problem, and focus ontwo challenges: distinguishing poor primary inputs frompoor seeds and understanding the model’s response to noisyseeds conditioned on the primary input. With these chal-lenges in mind, we propose a novel training method andevaluation metrics for the seed rejection problem. We thenvalidate these metrics and methods on two problems whichuse seeds as a source of additional information: keypoint-conditioned viewpoint estimation with crowdsourced seedsand hierarchical scene classiﬁcation with automated seeds.In these experiments, we show our method reduces the re-quired number of seeds that need to be reviewed for a targetperformance by up to 23% over strong baselines.

1. Introduction

Many tasks in computer vision require not only a pri-mary input , such as an image or video, but also a secondaryinput—a seed . This seed may be used to deﬁne the problem,such as in visual object tracking [25], visual object segmen-tation [27], and visual question answering [1], or to provideadditional information for tasks such as hierarchical sceneclassiﬁcation [24], multi-label image classiﬁcation [39], orkeypoint-conditioned viewpoint estimation [36].The performance of computer vision models with poorprimary inputs has been explored in the context of naturally

Geodesic Error

Figure 1: When determining whether or not to accept aseed, the decision must take into account not only how closethe seed is to the correct seed (green) but also the charac-teristics of the model and image. A seed that is closer tocorrect in the input space (red) might signiﬁcantly increasethe error, while a seed signiﬁcantly further away (yellow)may not.difﬁcult [38, 44, 6, 12] and intentionally adversarial [40,41, 7, 35] primary inputs, leading to a variety of methodsdesigned to make models more robust [38, 44] or detect andreject difﬁcult inputs [12].While some work has focused on obtaining better seedsthrough choosing what information to request from an an-notator [15, 2], no work to our knowledge has been per-formed on the identiﬁcation and rejection of seeds whichcause a signiﬁcant increase in error on the task. This isa critical oversight, as seeds are likely to be unreliable inunpredictable ways. The reliability—or lack thereof—ofcrowdsourced seeds is well studied [22, 28, 32, 31], andwhile modern deep learning systems are continuously push-1 a r X i v : . [ c s . C V ] S e p ng boundaries, they are still subject to counterintuitive fail-ure modes [30].In this work, we begin to resolve this critical oversightby directly studying the problem of seed rejection. In seedrejection, we seek to reject seeds that signiﬁcantly degradethe performance of the task model , which estimates the tar-get value based on the primary input and seed. Seed rejec-tion introduces two unique challenges: Understanding the Cause of Error : The ﬁrst challenge isdistinguishing between poor outputs due to poor seeds andpoor outputs due to poor primary inputs. If the source of theerror is the primary input (Figure 2b), little beneﬁt wouldbe obtained from requesting another seed. While the task ofselective prediction has been proposed for handling bad pri-mary inputs, no work to our knowledge has been performedon the task of rejecting bad seeds independent of the qualityof the primary input.

Understanding the Task Model Response : Next, we mustgain an understanding of the model’s response, and how ahuman’s intuition of a seed’s quality differs from its effecton the accuracy of the model’s output. For example, in Fig-ure 1, we see that a very small Euclidean error in the inputspace can induce a signiﬁcant increase in error in the out-put space, while a much smaller Euclidean error can havelittle effect. Similarly, the model may have very strong pri-ors based solely on the primary input, which means it willignore the—potentially noisy—additional information thatis provided by the seed (Figure 2a).We address these challenges via Dual-loss AdditionalError Regression (DAER), a novel training method devel-oped for the seed rejection problem. DAER considers thetwo challenges discussed above separately, and combinesthem to produce an estimate for the effect of the seed on thedownstream task.To compare DAER to baselines, we introduce threenovel metrics: Additional Error (AE), Mean AdditionalError (MAE), and Area under the Mean Additional Errorcurve (AMAE). Instead of simply measuring the error of ac-cepted samples like previous metrics [13], these three met-rics measure how much error in the accepted samples canbe corrected by obtaining the correct seed.We evaluate the performance of the proposed methodand metrics on two separate tasks, both of which use theseed to provide additional evidence: keypoint-conditionedviewpoint estimation [36] and Plugin Networks for hier-archical scene classiﬁcation [24]. In addition, we developfurther understanding of task model’s response to incorrectseeds on both these tasks through an exhaustive sampling.As models which use this strategy universally assume thatthis information is correct this is, to our knowledge, the ﬁrstinstance of such an analysis.The contributions of this paper are as follows:1. Introduction and deﬁnition of the seed rejection prob- lem.2. The metrics of Additional Error (AE), Mean AdditionalError (MAE) and Area under the Mean Additional Errorcurve (AMAE) for evaluating the task of seed rejection.3. Dual-loss Additional Error Regression (DAER): A noveltraining and inference method for seed rejection.4. An exhaustive sampling of the response of the both taskmodels to incorrect seeds, justifying the use of DAERover direct regression.

2. Related Work

Seeded inference describes a number of problems inwhich a primary input and a seed are provided at inferencetime to estimate a target. Broadly speaking, the seed canbe categorized across two axes: The ﬁrst axis is whetherthe seed is provided by a human [3, 33] or a separate au-tomated system [14, 29]. The second is whether the seedis used to fully deﬁne the task, such as initial boundingboxes for single-target object tracking [25], or used to fur-ther inform the inference—sometimes referred to as “infer-ence under partial evidence” [39]. In this work, we considerboth crowdsourced and automated sources of this addi-tional information through two tasks: keypoint-conditionedviewpoint estimation [36] and hierarchical scene catego-rization [24, 39, 19].In general, whether the seed is used to deﬁne the task orto provide extra information is easy to determine. For ex-ample, the tasks of video object segmentation [27], single-target tracking [25], and others [1, 21, 20] are clear casesin which the problem is not fully deﬁned until inferencetime, while in cases such as keypoint-conditioned view-point estimation [36], hierarchical scene classiﬁcation [24],multi-label image annotation [23], and visual concept pre-diction [39], a model could achieve better than random per-formance without the beneﬁt of the seed. However, deter-mining the mechanism through which the model obtains theseed is often less clear. In some cases the use-case is spec-iﬁed by the problem, such as visual question answering [1]which processes questions posed by humans, or visual ser-voing based on object segmentations [14] which uses an au-tomatic segmentation method.In other cases, the algorithm simply asserts that the seedexists, making it difﬁcult to know the intended use. Chal-lenges in the video space [25, 27] typically just say that theﬁrst frame is given, while many methods which use seedsto provide additional information [24, 39, 19] simply usethe gold-standard for both training and inference. We notethat regardless of the method used to provide the seed, it isassumed that the seed is true. eodesic Error (a)

Geodesic Error (b)

Figure 2: In many cases, the task network conditions strongly enough on the primary input that the it will either perform well(2a) or poorly (2b) regardless of the quality of the seed.

A problem closely related to seed rejection is the prob-lem of selective prediction [4, 12], which does not considerthe effects of seeds, but guesses whether a model is likelyto provide the correct answer based on the primary input.Selective prediction has been applied to many approachesover time, from nearest neighbors in the 1970’s [18], to sup-port vector machines in the early 2000s [8], to deep artiﬁcialneural networks today [43, 12].Recent literature focuses on extending selective predic-tion to deep networks, though approaches vary. Yildirimet al.[43] perform mixed integer programming on dropoutneural networks to build a binary classiﬁer that takes intoaccount not only accuracy, but also the cost of a misclas-siﬁcation. Geifman & El-Yaniv [11] show softmax re-sponse is a superior rejection mechanism to Monte Carlodropout [9] in multi-class classiﬁcation, and develop analgorithm which guarantees risk will fall below a cer-tain threshold. In later work, they introduce SelectiveNet,which, much like work done in support vector machines [8],notes that performance is better if rejection is deﬁned duringtraining, and includes the reject option as part of its archi-tecture [12].

3. Seed Rejection

In seed rejection, we attempt to reject seeds which causea signiﬁcant increase in error when compared with a gold-standard seed. Formally stated, we begin with a task model, f ( x, s ) which accepts a primary input, x , and a candidate orgold-standard seed, s ∈ { s c , s gs } . We seek to develop a re-jection model, g θ ( x, s c ) ∈ , which accepts a certain pro-portion (coverage, c d ) of candidate seeds used for inferencesuch that the performance of the accepted set most closelymatches the performance of the corresponding (unknown)gold-standard seeds.We refer to the per-sample performance degradation as additional error , which is calculated: AE ( x, s c , s gs , y | f, (cid:96) ) =max( (cid:96) ( f ( x, s c ) , y ) − (cid:96) ( f ( x, s gs ) , y ) , . (1)where y represents the target value and (cid:96) represents aproblem-speciﬁc performance measure. We note that themax operator ensures that the model is never penalized foraccepting the gold-standard seed over a candidate seed thatis incorrect in input space. With this deﬁnition of perfor-mance degradation, the problem of seed rejection for a de-sired coverage can be treated as ﬁnding an approximate so-lution to the optimization problem: arg min θ (cid:88) ( x,s c ,s gs ,y ) ∈D g θ ( x, s c ) AE ( x, s c , s gs , y | f, (cid:96) ) such that |D| (cid:88) ( x,s c ) ∈D g θ ( x, s c ) = c d , (2)where D represents the set of all samples. .2. Metric While the additional error metric proposed in the previ-ous section provides a measure for the quality of a singleseed, it does not provide a meaningful way to compare dif-ferent rejection methods across a full dataset. In this sec-tion, we introduce the Mean Additional Error (MAE) andArea under the Mean Additional Error curve (AMAE) met-rics, which can be used to compare methods side-by-sideacross a dataset.As the name implies, the mean additional error corre-sponds to the mean additional error of all accepted samples:

M AE ( f, g θ | D, (cid:96) ) = |D| (cid:80) ( x,s c ,s gs ,y ) ∈D g θ ( x, s c ) AE ( x, s c , s gs , y | f, (cid:96) ) | D | (cid:80) ( x,s c ) ∈ D g θ ( x, s c ) . (3)While the mean additional error metric indicates whichrejection model performs better at a given coverage, perfor-mance at a single coverage does not evaluate the true per-formance of a rejection model. In order to produce a sin-gle value with which to compare different rejection models,we produce a plot of mean additional error vs. coverage,and calculate the area underneath. We refer to this as theArea under Mean Additional Error curve (AMAE), whichis found empirically using the equation: AM AE = 1 |D| |D| (cid:88) i =1 (cid:80) ij =1 AE ( x j , s jc , s jgs , y j | f, (cid:96) ) i . (4)As with the MAE, a lower AMAE indicates better perfor-mance. We approach the task of seed rejection by attemptingto regress the additional error. This approach is concep-tually similar to Gurari et al. [16], who applied a series ofhand-crafted features to predict the intersection-over-unionof segmentations generated by various methods. While us-ing qualities of the output, such as the value of a softmaxoutput, to estimate the how likely it is that a classiﬁcation iscorrect is an approach that is also used in selective predic-tion [11], this method is generally outperformed by meth-ods speciﬁcally designed to predict how correct an answeris [8, 12].For this reason, we perform seed rejection by attemptingto regress the additional error directly through the trainingprocedure shown in Figure 3. Critical to this training proce-dure is the separation of the additional error regression intotwo components corresponding to the challenges describedin the introduction. The correctness loss, which addresses the challenge understanding the cause of error , is a clas-siﬁer which determines the likelihood that seed is correct.The regression loss, which addresses the challenge under-standing task model response , regresses the additional errorgiven that the seed is incorrect. That is, the regression lossis only updated when the given seed is incorrect.Mathematically, the correctness and regression outputscan be used to calculate the expected additional error usingthe formula: E ( AE ( x i , s ic , s igs , y i | f, (cid:96) )) = p ( seed correct )( AE | seed correct )+ p ( ¬ seed correct )( AE |¬ seed correct ) . (5)Since the additional error for the correct seed is always zero,the formula reduces to: E ( AE ( x i , s ic , s igs , y i | f, (cid:96) ))= p ( ¬ seed correct )( AE |¬ seed correct ) . (6)We use this calculation when predicting the additionalerror at inference time, but this formulation is not used dur-ing training. While the chosen method of separately learn-ing the correctness and regression outputs is mathemati-cally equivalent to regressing the additional error directly,we show in Section 5.2 that separating the two componentssigniﬁcantly improves performance, and in Section 6 pro-pose a reason why this is the case.

4. Experimental Setup

Application of the generic method to a speciﬁc problemdomain requires only problem speciﬁc deﬁnitions of a cor-rect seed, error performance measure, and architecture. Inthis section, we demonstrate the versatility of our methodby showing state-of-the-art performance on two disparatetasks: keypoint-conditioned viewpoint estimation and hier-archical scene classiﬁcation.

In the task of keypoint-conditioned viewpoint estima-tion [36], a human annotator is given an image of a vehi-cle, and asked to click a keypoint such as “front right tire”.This human-produced information is then combined withfeatures from a convolutional neural network to produce amore accurate estimate of camera viewpoint than would bepossible without the keypoint [34, 37].We use the “click-here CNN” model from the work bySzeto & Corso as our task model. This model accepts ascaled image crop, keypoint class, and downsampled mapgiving the Chebyshev distance from every pixel to the seedkeypoint, and produces nine 359-bin softmax outputs. The e j e c t i on M ode l T a sk M ode l T a sk M ode l S a m p l e Loo k up Prediction -+ErrorErrorPredictionGold-Standard SeedPrimary InputCandidate Seed Additional ErrorAdditional Error | ~seed_correctp(seed_correct) C o rr e c t ne ss Lo ss R eg r e ss i on Lo ss Figure 3: DAER frames seed rejection as the task of regressing the additional error, and separates the regression into twocomponents: predicting whether the candidate seed is correct, and predicting the difference in performance between thecandidate seed and the true seed given that the candidate seed is incorrect.nine bins are conceptually grouped into three sets of three,with a 359-bin regression performed for azimuth, elevation,and tilt for each of the three vehicle classes in the PAS-CAL3D+ [42] dataset.For our rejection model, we use a modiﬁed version ofthe click-here CNN architecture, which has been proven ca-pable of integrating keypoint and image information. Weadd two additional linear layers to produce 34 sets of 201outputs. Every set of outputs represents one of the keypointclasses. Of the 201 outputs, one is used as a binary classi-ﬁer to determine whether or not the given keypoint is correctand is trained with a binary cross-entropy loss, and the other200 are used as a classiﬁer for determining the magnitudeof error and are trained with a cross-entropy loss.

For training, the performance measure, (cid:96) , is the rotation dis-placement formula proposed by Larochelle et al. [26]: d = || I − A A T || F . (7)This metric is upper bound at √ , meaning that we divideour 200 regression output bins between − √ and 2 √ .Since the correct answer in the input space is a singlepixel, producing a classiﬁer capable of accurately determin-ing whether a given keypoint matches the true keypoint is adifﬁcult task all by itself. Even given a perfect classiﬁer, thelow probability of selecting the correct gold-standard seedmeans the problem reduces to regressing the additional er-ror directly.Instead, we deﬁne a correct seed as a seed for which theadditional error is zero: p ( seed correct ) = (cid:40) A.E. = 01

A.E. (cid:54) = 0 . (8)Aside from allowing an easier regression problem whilemaintaining the mathematical foundation of our method, we believe this will help the correctness classiﬁer view the im-age more holistically, classifying images such as Figures 2aand 2b as having no additional error regardless of the loca-tion of the given seed.We train in two stages. In the ﬁrst stage, we use thesynthetic [34, 36] and real data together, and perform earlystopping on the validation loss with patience 5. In thesecond stage, we train exclusively on the PASCAL3D+dataset [42] and perform early stopping on the validationloss with patience 100.The correctness loss is calculated using binary cross-entropy, while the regression loss is calculated using cross-entropy. Candidate seeds are generated by randomly sam-pling a pixel within the image. For our evaluation, we focus on the case in which a keypointis provided through crowdsourcing, by collecting annota-tions on Amazon Mechanical Turk through the providedkeypoint annotation interface. A total of 6,042 gold stan-dard keypoints were collected on the PASCAL3D+ valida-tion set [42]. Of these keypoints, 6.3% (381) cause addi-tional error, while 1.3% cause more than 5 ◦ additional error,and 0.5% (30) cause more than 150 ◦ additional error. Weplace this holdout set into 5 splits, using four for validationand one for testing for each trained network, and report themean result.We also change our performance measure between thetraining and evaluation steps. While we use a metric basedon the distance from the identity matrix during training forcomputational reasons, we use the geodesic on the unitsphere for our evaluation. This follows the convention ofprevious viewpoint estimation work [36, 37, 34], and canbe calculated: d = || log ( A A T ) || . (9)o calculate the expected additional error, we calculatethe mean of the predicted additional error, and multiply itby the probability that the seed is incorrect: E ( AE ) = p ( ¬ seed correct ) (cid:88) ( x | x< ,x ∈ N ) x ( p ( AE = x |¬ seed correct )) . (10) Our baseline scoring functions for seed rejection on thekeypoint-conditioned viewpoint estimation task are: • Known Distance: We score the seed based on oracleknowledge of its distance from the gold-standard key-point. • Task Network Entropy: The distributional entropy ofthe output of Click-Here CNN. This was proposed byGal [9] as a method of analyzing uncertainty whenMonte Carlo dropout was applied to a classiﬁcationtask. • Task Network Percentile: 10,000 weighted samplesare taken from the output of Click-Here CNN. We ﬁndthe absolute difference between every sample and themean of all samples, and take the (best-performing)80 th percentile for our evaluation. • Softmax Response (S.R.): The largest value of thesoftmax output. This has been shown by Geifman &El-Yaniv to [11] outperform the Monte-Carlo dropout[10] on selective prediction for classiﬁcation tasks.

In hierarchical scene classiﬁcation, a coarse scene cat-egorization is used to help a classiﬁer determine the ﬁne-grained scene classiﬁcation of an object. This task was usedfor evaluating the role of partial evidence by Hu et al. [19],Wang et al. [39], and Koperski et al. [24].In this work, we use the recent Plugin Network architec-ture developed by Koperski et al. [24] as our task model.This work applies adjustments to intermediate layers of afrozen base model based on the provided coarse categories.Through this process, they were able to use the partial evi-dence to increase the performance of the base model by over4%.

For the task of hierarchical scene classiﬁcation, the perfor-mance measure is whether or not the predicted ﬁne-grainedclassiﬁcation matches the target ﬁne-grained classiﬁcation: (cid:96) ( f ( x, s )) = (cid:40) f ( x, s ) = y f ( x, s ) (cid:54) = y . (11)Since the number of coarse categories is low, we use theoutput of a coarse scene classiﬁer as our correctness prob-ability. As a backbone for our rejection model, we use aResNet-18 [17] model pretrained on ImageNet [5], of whichwe use 14 outputs. 7 of those outputs represent a categori-cal classiﬁer which predicts the correctness for each poten-tial combination, while the other 7 represent the conditionaladditional error for each class. The two outputs are trainedusing cross-entropy and binary cross-entropy respectively.The rejection model is trained for 50 epochs at a learningrate of e − , and the instance with the lowest validationAMAE is used for testing. For the hierarchical scene classiﬁcation task, we use a sep-arate coarse scene classiﬁcation model to produce the seed.This allows us to test the much larger SUN-397 datasetacross multiple instances of both the seeding and rejectionmodels to calculate the standard error.Like the rejection model, the coarse scene classiﬁcationmodel begins with an ImageNet-pretrained ResNet-18 ar-chitecture. This classiﬁer is trained for 20 epochs to predictone of the 7 coarse category combinations. The highest val-idation accuracy is used to seed the task model.We train ﬁve rejection models and ﬁve seeding models,for a total of 5 runs on the baselines and 25 runs on learnedrejection to calculate the standard error of the mean acrossour various rejection models.

As our baselines, we use the task network entropy and soft-max response as described in the hierarchical scene clas-siﬁcation section. We perform these baselines on both theoutput of the task model and the output of the model whichestimates the coarse classiﬁcation. In our results, we referto these as “ﬁne” and “coarse” respectively.

5. Results

In Table 1, we see the overall performance of our learnedmethod against baseline methods on both the Keypoint-Conditioned Viewpoint Estimation (KCVE) and Hierarchi-cal Scene Classiﬁcation (HSC) tasks. We see that in bothcases, the trained seed rejection method outperforms thebaselines in the AMAE metric. In the case of hierarchicalscene classiﬁcation, we were able to train multiple coarse ask Method AMAEKCVE Distance 0.3964Softmax Response 0.9292Entropy 0.3533Sampler 0.3091

DAER 0.2864

HSC Entropy (Fine) 0.033 ± e − Entropy (Coarse) 0.017 ± e − S.R. (Fine) 0.033 ± e − S.R. (Coarse) 0.018 ± e − DAER (Ours) 0.016 ± e − Table 1: Based on our AMAE metric, DAER outperformsbaselines on both example tasks.classiﬁers and DAER models to establish signiﬁcance be-tween the results.While the mean additional error results in a meaning-ful summary of performance across all potential coverage,we can build better intuition by examining comparing themean additional error at speciﬁc coverages. We examine theresults from the hierarchical scene classiﬁcation task moreclosely in Figure 4 and Table 2.We see in the MAE-Coverage curve shown in Figure 4that DAER outperforms all baselines after a crossover pointat a coverage of 0.197. At this crossover point, the MAE isapproximately 0.0045, meaning in 1 out of every 222 sam-ples an incorrect answer will be caused by an incorrect seed.We also show the percentage of samples which are ac-cepted at several mean additional errors in Table 2. In the M A E Fine EntropyCoarse EntropyFine SRCoarse SRDAER

Figure 4: The additional error risk compared to the coveragefor the hierarchical scene classiﬁcation task. The dark linerepresents mean of all runs. The shaded area represents onestandard error. % Seeds Accepted (Coverage)MAE Fine Ent. Coarse Ent. Fine S.R. Coarse S.R. DAER0.01 15.6% 48.3% 15.0% 48.2%

Table 2: For a target MAE in the hierarchical scene classiﬁ-cation task, DAER allows more seeds to be accepted with-out review.case of hierarchical scene classiﬁcation, a target MAE of0.01 means that one out of every hundred ﬁne-grained sceneclassiﬁcations are incorrect due to an incorrect coarse clas-siﬁcation. At a desired MAE of 0.01, the 8.2% improve-ment in acceptance rate results in a 15.8% reduction in num-ber of samples which will need to be annotated, while ata desired MAE of 0.05, the 1.6% improvement in accep-tance rate results in a 23.9% reduction in number of sampleswhich will need to be annotated.

In this section, we justify the decision to separate ourmean additional error into two separate components throughuse of an ablation. Here we consider separately the correct-ness score and the additional error regression. The addi-tional error regression was trained on all samples regardlessof whether or not the seed was correct, making it equivalentto regressing the additional error directly. No changes weremade to the training of the correctness score. The results ofthis are shown in Table 3.We see in these results that while the model trained todirectly regress the additional error and DAER both attemptto regress the additional error, there is a signiﬁcant increasein performance when the additional error is regressed con-ditionally on the seed being correct.Further, we see that the correctness score is an excellentseed rejection method on its own. This indicates that thenetwork has an easier time separating seeds that are correctfrom seeds that are incorrect than learning the model’s re-sponse to every particular seed—particularly when the ad-ditional error is strongly biased toward zero.

Regression Correctness DAERKCVE 1.2544 0.2937 0.2864HSC 0.021 ± − ± − ± − Table 3: Both regression and correctness contribute to over-all performance, but correctness is the more signiﬁcant con-tributor.

6. Sensitivity Analysis

As shown in the ablation, solving seed rejection as a two-part problem performs better on the ﬁnal task despite being ∘ additional geodesic error012 % o f ∘ m a g e s Figure 5: The sensitivity of images in the PASCAL3D+val set to seed location. The majority of images have fewpotential seeds capable of causing signiﬁcant error. % o f I m a g e s Figure 6: For 71% of images, an incorrect seed does notresult in a correct ﬁne-grained classiﬁcation becoming in-correct.mathematically equivalent to directly regressing the addi-tional error. In fact, simply building a model to ask whetheror not the given evidence is correct outperforms directly re-gressing the additional error by a large margin. To developan understanding of this phenomenon, we ask the previ-ously unasked question: what happens when the evidencewe give our model is incorrect?

We show one case where the seed signiﬁcantly affectsthe task model’s output in Figure 1 and two cases where thetask model’s output is largely unaffected by the seed in Fig-ure 2, but to provide an overall understanding, we exhaus-tively sample the validation set for both our problems—thatis, we provide all potential keypoint clicks in the keypoint-conditioned viewpoint estimation problem and all coarse-grained classiﬁcations for every image in the hierarchicalscene classiﬁcation problem. The results are seen in Fig-ure 5 for keypoint-conditioned viewpoint estimation andFigure 6 for hierarchical scene classiﬁcation.We see that in both of these tasks the seed is often ig-nored, with the model instead opting to use the image asthe sole source of evidence. In other words, the model hasa very strong prior for some images. Speciﬁcally, we seethat in the keypoint-conditioned viewpoint estimation task78.8% of the validation dataset has no potential clicks thatwill increase error by more than the ◦ threshold used todesignate a correct regression in previous work. Further, on83.4% of images, a random click has less than a 5% chanceof increasing geodesic error by more than ◦ , and 36.3% ofcrops do not respond to the keypoint click at all. In the caseof hierarchical scene classiﬁcation, 71% of images do nothave any coarse classiﬁcations that cause the ﬁne-grainedclassiﬁcation to become incorrect.This, alongside the results of our ablation, suggests that while it would be possible to solve the problem by regress-ing additional error directly, it is a task that is subject toa very strong local minima at zero additional error. Sepa-rating the two elements allows the rejection model to ﬁrstsolve an easy task, such as determining whether an image iscloser to Figure 1 or the examples shown in Figure 2, to ac-cept all the seeds that are correct ﬁrst, then solve the harderproblem of determining the additional error of a seed on aprimary input such as Figure 1 using a more balanced train-ing signal.

7. Conclusion

In this work we have introduced the problem of seedrejection, asking what happens when the seed given to atask model is incorrect. We discuss the challenges of un-derstanding the cause of error and understanding the taskmodel response, then propose novel metrics for evaluatingthe seed rejection problem.With these challenges and metrics in mind we proposethe DAER model, which separates the task of regressing ad-ditional error into two components: predicting the correct-ness of the seed, and predicting the additional error giventhat the seed is incorrect. We adapt the generic DAERmodel to the problems of keypoint-conditioned viewpointestimation and hierarchical scene classiﬁcation, and show itoutperforms strong baselines on both tasks.Last, we show that DAER outperforms both of its com-ponents individually, including the mathematically equiv-alent task of regressing the additional error. Through ananalysis of the sensitivity of the task model to incorrectseeds, we ﬁnd evidence suggesting that the signiﬁcant num-ber of seeds that do not cause any additional error leads to strongly zero-biased training signal. Asking the model toﬁrst answer the easier question of whether or not the seedis correct ideally allows all of the correct seeds to be ac-cepted ﬁrst, while the additional error regression receives amore balanced training signal for rejecting seeds which areincorrect.Seed rejection addresses a large number of problemswhich, up to this point, have assumed that seeds are correct.We offer a generic formulation of both the seed rejectionproblem and a solution, which we hope will be extended tomore problems in future work.

Acknowledgements

Toyota Research Institute (”TRI”)provided funds to assist the authors with their research, butthis article solely reﬂects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity.

References [1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, MargaretMitchell, Dhruv Batra, C. Lawrence Zitnick, and DeviParikh. VQA: Visual Question Answering. In , pages2425–2433, Santiago, Chile, Dec. 2015. IEEE.[2] Mohamed El Banani and Jason J. Corso. Adviser Networks:Learning What Question to Ask for Human-In-The-LoopViewpoint Estimation. arXiv:1802.01666 [cs] , Oct. 2018.arXiv: 1802.01666.[3] Steve Branson, Catherine Wah, Florian Schroff, BorisBabenko, Peter Welinder, Pietro Perona, and Serge Belongie.Visual Recognition with Humans in the Loop. In

EuropeanConference on Computer Vision , volume 6314, pages 438–451, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.[4] C. Chow. On optimum recognition error and reject tradeoff.

IEEE Transactions on Information Theory , 16(1):41–46, Jan.1970.[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, andLi Fei-Fei. ImageNet: A Large-Scale Hierarchical ImageDatabase. page 8, 2009.[6] Samuel Dodge and Lina Karam. Understanding How ImageQuality Affects Deep Neural Networks. arXiv:1604.04004[cs] , Apr. 2016. arXiv: 1604.04004.[7] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li,Amir Rahmati, Chaowei Xiao, Atul Prakash, TadayoshiKohno, and Dawn Song. Robust Physical-World Attackson Deep Learning Visual Classiﬁcation. In ,pages 1625–1634, Salt Lake City, UT, USA, June 2018.IEEE.[8] Giorgio Fumera and Fabio Roli. Support Vector Machineswith Embedded Reject Option. In Gerhard Goos, Juris Hart-manis, Jan van Leeuwen, Seong-Whan Lee, and AlessandroVerri, editors,

Pattern Recognition with Support Vector Ma-chines , volume 2388, pages 68–82. Springer Berlin Heidel-berg, Berlin, Heidelberg, 2002.[9] Yarin Gal.

Uncertainty in Deep Learning . PhD thesis, 2016. [10] Yarin Gal and Zoubin Ghahramani. Dropout as a BayesianApproximation: Representing Model Uncertainty in DeepLearning. page 10, 2016.[11] Yonatan Geifman and Ran El-Yaniv. Selective Classiﬁcationfor Deep Neural Networks. In

Advances in Neural Informa-tion Processing Systems , pages 4878–4887, 2017.[12] Yonatan Geifman and Ran El-Yaniv. SelectiveNet: ADeep Neural Network with an Integrated Reject Option. arXiv:1901.09192 [cs, stat] , Jan. 2019. arXiv: 1901.09192.[13] Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-Reduced Uncertainty Estimation for Deep Neural Classiﬁers. arXiv:1805.08206 [cs, stat] , May 2018. arXiv: 1805.08206.[14] Brent Grifﬁn, Victoria Florence, and Jason J. Corso.Video Object Segmentation-based Visual Servo Control andObject Depth Estimation on a Mobile Robot Platform. arXiv:1903.08336 [cs] , Mar. 2019. arXiv: 1903.08336.[15] Brent A Grifﬁn and Jason J Corso. BubbleNets: Learning toSelect the Guidance Frame in Video Object Segmentation byDeep Sorting Frames. page 10.[16] Danna Gurari, Suyog Dutt Jain, Margrit Betke, and KristenGrauman. Pull the Plug? Predicting If Computers or Hu-mans Should Segment Images. In , pages382–391, Las Vegas, NV, USA, June 2016. IEEE.[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] , Dec. 2015. arXiv: 1512.03385.[18] M. E. Hellman. The Nearest Neighbor Classiﬁcation Rulewith a Reject Option.

IEEE Transactions on Systems Scienceand Cybernetics , 6(3):179–185, July 1970.[19] Hexiang Hu, Guang-Tong Zhou, Zhiwei Deng, ZichengLiao, and Greg Mori. Learning Structured Inference NeuralNetworks with Label Relations. In , pages2960–2968, Las Vegas, NV, USA, June 2016. IEEE.[20] Ronghang Hu, Marcus Rohrbach, and Trevor Dar-rell. Segmentation from Natural Language Expressions. arXiv:1603.06180 [cs] , Mar. 2016. arXiv: 1603.06180.[21] Jia-Bin Huang, Sing Bing Kang, Narendra Ahuja, and Jo-hannes Kopf. Temporally coherent completion of dynamicvideo.

ACM Transactions on Graphics , 35(6):1–11, Nov.2016.[22] Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Qual-ity management on Amazon Mechanical Turk. In

Proceed-ings of the ACM SIGKDD Workshop on Human Computation- HCOMP ’10 , page 64, Washington DC, 2010. ACM Press.[23] Justin Johnson, Lamberto Ballan, and Li Fei-Fei. Love ThyNeighbors: Image Annotation by Exploiting Image Meta-data. In , pages 4624–4632, Santiago, Chile, Dec.2015. IEEE.[24] Michal Koperski, Tomasz Konopczynski, Rafal Nowak, Pi-otr Semberecki, and Tomasz Trzcinski. Plugin Networks forInference under Partial Evidence.

The IEEE Winter Confer-ence on Applications of Computer Vision , pages 2883–2891,2020.25] Matej Kristan, Jiri Matas, Ale Leonardis, Tom Voj, Ro-man Pﬂugfelder, Gustavo Fernndez, Georg Nebehay, FatihPorikli, and Luka ehovin. A Novel Performance EvaluationMethodology for Single-Target Trackers.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 38(11):2137–2155, Nov. 2016.[26] Pierre M. Larochelle, Andrew P. Murray, and Jorge Angeles.A Distance Metric for Finite Sets of Rigid-Body Displace-ments via the Polar Decomposition.

Journal of MechanicalDesign , 129(8):883–886, Aug. 2007.[27] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar-belez, Alex Sorkine-Hornung, and Luc Van Gool. The2017 DAVIS Challenge on Video Object Segmentation. arXiv:1704.00675 [cs] , Mar. 2018. arXiv: 1704.00675.[28] Vikas C Raykar and Shipeng Yu. Eliminating Spammersand Ranking Annotators for Crowdsourced Labeling Tasks.

Journal of Machine Learning Research , 13:28, Feb. 2012.[29] N Dinesh Reddy, Minh Vo, and Srinivasa G. Narasimhan.CarFusion: Combining Point Tracking and Part Detectionfor Dynamic 3D Reconstruction of Vehicles. In , pages 1906–1915, Salt Lake City, UT, USA,June 2018. IEEE.[30] Amir Rosenfeld, Richard Zemel, and John K. Tsotsos. TheElephant in the Room. arXiv:1808.03305 [cs] , Aug. 2018.arXiv: 1808.03305.[31] Jeffrey M. Rzeszotarski and Aniket Kittur. Instrumentingthe crowd: using implicit behavioral measures to predict taskperformance. In

Proceedings of the 24th annual ACM sym-posium on User interface software and technology - UIST’11 , page 13, Santa Barbara, California, USA, 2011. ACMPress.[32] Jean Y. Song, Raymond Fok, Alan Lundgard, Fan Yang,Juho Kim, and Walter S. Lasecki. Two Tools are BetterThan One: Tool Diversity as a Means of Improving Aggre-gate Crowd Performance. In

Proceedings of the 2018 Con-ference on Human Information Interaction&Retrieval - IUI18 , pages 559–570, Tokyo, Japan, 2018. ACM Press.[33] Jean Y. Song, Stephan J. Lemmer, Michael Xieyang Liu,Shiyan Yan, Juho Kim, Jason J. Corso, and Walter S.Lasecki. Popup: reconstructing 3D video using particle ﬁl-tering to aggregate crowd responses. In

Proceedings of the24th International Conference on Intelligent User Interfaces- IUI ’19 , pages 558–569, Marina del Ray, California, 2019.ACM Press.[34] Hao Su, Charles R. Qi, Yangyan Li, and Leonidas J. Guibas.Render for CNN: Viewpoint Estimation in Images UsingCNNs Trained with Rendered 3D Model Views. In ,pages 2686–2694, Santiago, Chile, Dec. 2015. IEEE.[35] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, JoanBruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus.Intriguing properties of neural networks. arXiv:1312.6199[cs] , Feb. 2014. arXiv: 1312.6199.[36] Ryan Szeto and Jason J. Corso. Click Here: Human-Localized Keypoints as Guidance for Viewpoint Estimation.In , pages 1604–1613, Venice, Oct. 2017. IEEE. [37] Shubham Tulsiani and Jitendra Malik. Viewpoints and key-points. In , pages 1510–1519, Boston,MA, USA, June 2015. IEEE.[38] Igor Vasiljevic, Ayan Chakrabarti, and GregoryShakhnarovich. Examining the Impact of Blur on Recog-nition by Convolutional Networks. arXiv:1611.05760 [cs] ,May 2017. arXiv: 1611.05760.[39] Tianlu Wang, Kota Yamaguchi, and Vicente Ordonez.Feedback-Prop: Convolutional Neural Network InferenceUnder Partial Evidence. In , pages 898–907,Salt Lake City, UT, June 2018. IEEE.[40] Rey Reza Wiyatno and Anqi Xu. Physical Adversarial Tex-tures that Fool Visual Object Tracking. arXiv:1904.11042[cs] , Sept. 2019. arXiv: 1904.11042.[41] Zuxuan Wu, Ser-Nam Lim, Larry Davis, and Tom Goldstein.Making an Invisibility Cloak: Real World Adversarial At-tacks on Object Detectors. arXiv:1910.14667 [cs, math] ,Oct. 2019. arXiv: 1910.14667.[42] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. BeyondPASCAL: A benchmark for 3D object detection in the wild.In

IEEE Winter Conference on Applications of Computer Vi-sion , pages 75–82, Steamboat Springs, CO, USA, Mar. 2014.IEEE.[43] Mehmet Yigit Yildirim, Mert Ozer, and Hasan Davulcu.Leveraging Uncertainty in Deep Learning for Selective Clas-siﬁcation. arXiv:1905.09509 [cs, math, stat] , May 2019.arXiv: 1905.09509.[44] Yiren Zhou, Sibo Song, and Ngai-Man Cheung. On clas-siﬁcation of distorted images with deep convolutional neuralnetworks. In2017 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP)