[PDF] Localization-Aware Active Learning for Object Detection

Abstract

Active learning - a class of algorithms that iteratively searches for the most informative samples to include in a training dataset - has been shown to be effective at annotating data for image classification. However, the use of active learning for object detection is still largely unexplored as determining informativeness of an object-location hypothesis is more difficult. In this paper, we address this issue and present two metrics for measuring the informativeness of an object hypothesis, which allow us to leverage active learning to reduce the amount of annotated data needed to achieve a target object detection performance. Our first metric measures 'localization tightness' of an object hypothesis, which is based on the overlapping ratio between the region proposal and the final prediction. Our second metric measures 'localization stability' of an object hypothesis, which is based on the variation of predicted object locations when input images are corrupted by noise. Our experimental results show that by augmenting a conventional active-learning algorithm designed for classification with the proposed metrics, the amount of labeled training data required can be reduced up to 25%. Moreover, on PASCAL 2007 and 2012 datasets our localization-stability method has an average relative improvement of 96.5% and 81.9% over the baseline method using classification only.

Full PDF

LLocalization-Aware Active Learning for Object Detection

Chieh-Chi KaoUniversity of California, Santa Barbara [email protected]

Teng-Yok LeeMitsubishi Electric Research Laboratories [email protected]

Pradeep SenUniversity of California, Santa Barbara [email protected]

Ming-Yu LiuMitsubishi Electric Research Laboratories [email protected]

Abstract

Active learning—a class of algorithms that iterativelysearches for the most informative samples to include in atraining dataset—has been shown to be effective at annotat-ing data for image classiﬁcation. However, the use of activelearning for object detection is still largely unexplored asdetermining informativeness of an object-location hypoth-esis is more difﬁcult. In this paper, we address this issueand present two metrics for measuring the informativenessof an object hypothesis, which allow us to leverage activelearning to reduce the amount of annotated data neededto achieve a target object detection performance. Our ﬁrstmetric measures “localization tightness” of an object hy-pothesis, which is based on the overlapping ratio betweenthe region proposal and the ﬁnal prediction. Our secondmetric measures “localization stability” of an object hy-pothesis, which is based on the variation of predicted ob-ject locations when input images are corrupted by noise.Our experimental results show that by augmenting a con-ventional active-learning algorithm designed for classiﬁca-tion with the proposed metrics, the amount of labeled train-ing data required can be reduced up to 25%. Moreover, onPASCAL 2007 and 2012 datasets our localization-stabilitymethod has an average relative improvement of 96.5% and81.9% over the baseline method using classiﬁcation only.

1. Introduction

Prior works have shown that with a large amount of an-notated data, convolutional neural networks (CNNs) canbe trained to achieve a super-human performance for var-ious visual recognition tasks. As tremendous efforts arededicated into the discovery of effective network architec-tures and training methods for further advancing the per-formance, we argue it is also important to investigate intoeffective approaches for data annotation as data annotation is essential but expensive.Data annotation is especially expensive for the object-detection task. Compared to annotating image class, whichcan be done via a multiple-choice question, annotating ob-ject location requires a human annotator to specify a bound-ing box for an object. Simply dragging a tight bounding boxto enclose an object can cost 10-times more time than an-swering a multiple-choice question [29, 20]. Consequently,a higher pay rate has to be paid to a human labeler for an-notating images for an object detection task. In addition tothe cost, it is more difﬁcult to monitor and control the anno-tation quality.

Active learning [25] is a machine learning procedure thatis useful in reducing the amount of annotated data requiredto achieve a target performance. It has been applied to var-ious computer-vision problems including object classiﬁca-tion [12, 5], image segmentation [15, 3], and activity recog-nition [8, 9]. Active learning starts by training a baselinemodel with a small, labeled dataset, and then applying thebaseline model to the unlabeled data. For each unlabeledsample, it estimates whether this sample contains critical in-formation that has not been learned by the baseline model.Once the samples that bring the most critical informationare identiﬁed and labeled by human annotators, they can beadded to the initial training dataset to train a new model,which is expected to perform better. Compared to passivelearning , which randomly selects samples from the unla-beled dataset to be labeled, active learning can achieve thesame accuracies with fewer but more informative labeledsamples.Multiple metrics for measuring how informative a sam-ple is have been proposed for the classiﬁcation task, includ-ing maximum uncertainty, expected model change, densityweighted, and so on [25]. The concept behind several ofthem is to evaluate how uncertain the current model is foran unlabeled sample. If the model could not assign a highprobability to a class for a sample, then it implies the model1 a r X i v : . [ c s . C V ] J a n s uncertain about the class of the sample. In other words,the class of the sample would be very informative to themodel. This sample would require human to clarify.Since an object-detection problem can be consideredas an object-classiﬁcation problem once the object is lo-cated, existing active learning approaches for object detec-tion [1, 27] mainly measure the information in the clas-siﬁcation part. Nevertheless, in addition to classiﬁcation,the accuracy of an object detector also relies on its local-ization ability. Because of the importance of localization,in this paper we present an active learning algorithm tai-lored for object detection, which considers the localizationof detected objects. Given a baseline object detector whichdetects bounding boxes of objects, our algorithm evaluatesthe uncertainty of both the classiﬁcation and localization.Our algorithm is based on two quantitative metrics of thelocalization uncertainty.1. Localization Tightness (LT) : The ﬁrst metric is basedon how tight the detected bounding boxes can en-close true objects. The tighter the bounding box, themore certain the localization. While it sounds im-possible to compute the localization tightness for non-annotated images because the true object locations areunknown, for object detectors that follow the propose-then-classify pipeline [6, 23], we estimate the localiza-tion tightness of a bounding box based on its changesfrom the intermediate proposal (a box contains anykind of foreground objects) to the ﬁnal class-speciﬁcbounding box.2.

Localization Stability (LS) : The second metric is basedon whether the detected bounding boxes are sensitiveto changes in the input image. To evaluate the localiza-tion stability, our algorithm adds different amounts ofGaussian noise to pixel values of the image, and mea-sures how the detected regions vary with respect to thenoise. This one can be applied to all kinds of objectdetectors, especially those that do not have an explicitproposal stage [22, 18].The contributions of this paper are two-fold:1. We present different metrics to quantitatively evaluatethe localization uncertainty of an object detector. Ourmetrics consider different aspects of object detectionin spite that the ground truth of object locations is un-known, making our metrics suited for active learning.2. We demonstrate that to apply active learning for ob-ject detection, both the localization and the classiﬁca-tion of a detector should be considered when samplinginformative images. Our experiments on benchmarkdatasets show that considering both the localizationand classiﬁcation uncertainty outperforms the existing active-learning algorithm works on the classiﬁcationonly and passive learning.

2. Related Works

We now review active learning approaches used for im-age classiﬁcation. For more detail of active learning, Set-tles’s survey [25] provides a comprehensive review. Inthis paper, we use the maximum uncertainty method in theclassiﬁcation as the baseline method for comparison. Theuncertainty based method is used for CAPTCHA recog-nition [28], image classiﬁcation [11], and automated andmanual video annotation [14]. It also has been appliedto different learning models including decision trees [16],SVMs [30], and Gaussian processes [13]. We chooseuncertainty-based method since it is efﬁcient to compute.Active learning is also applied for object detection tasksin various speciﬁc applications, such as satellite images [1]and vehicle images [27]. Vijayanarasimhan et al. [32] pro-pose an approach to actively crawl images from the web totrain part-based linear SVM detector. Note that these meth-ods only consider information from the classiﬁer, while ourmethods aim to consider the localization part as well.Current state-of-the-art object detectors are based ondeep-learning. They can be classiﬁed into two categories.Given an input image, the ﬁrst category explicitly gener-ates region proposals, following by feature extraction, cate-gory classiﬁcation, and ﬁne-tuning of the proposal geome-try [6, 23]. The other category directly outputs the objectlocation and class without the intermediate proposal stage,such as YOLO [22] and SSD [18]. This inspires us to con-sider localization stability, which can be applied to both cat-egories.Besides active learning, there are other research direc-tions to reduce the cost for annotation. Temporal coherenceof the video frames are used to reduce the annotation effortfor training detectors [21]. Domain adaptation [10] is usedto transfer the knowledge from an image classiﬁer to an ob-ject detector without the annotation of bounding boxes. Pa-padopoulos et al. [20] suggest to simplify the annotationprocess from drawing a bounding box to simply answeringa Yes/No question whether a bounding box tightly enclosesan object. Russakovsky et al. [24] integrate multiple inputsfrom both computer vision and humans to label objects.

3. Active Learning for Object Detection

The goal of our algorithm is to train an object detectorthat takes an image as input and outputs a set of rectangu-lar bounding boxes. Each bounding box has the locationand the scale of its shape, and a probability mass functionof all classes. To train such an object detector, the trainingand validation images of the detector are annotated with anbounding box per object and its category. Such an anno- nlabeled image poolTraineddetector ClassificationLocalization Select unlabeled images for annotation Send selected images for annotationHuman annotatorLabeled training set Add labeled images to the training setLearn a model

Figure 1: A round of active learning for object detection.tation is commonly seen in public datasets including PAS-CAL VOC [4] and MS COCO [17].We ﬁrst review the basic active learning framework forobject detection in Sec. 3.1. It also reviews the measure-ment of classiﬁcation uncertainty, which is the major mea-surement for object detection in previous active learningalgorithms for object detection [25, 1, 27]. Based on thisframework, we extend the uncertainty measurement to alsoconsider the localization result of a detector, as described inSec. 3.2 and 3.3.

Fig. 1 overviews our active learning algorithm. Our al-gorithm starts with a small training set of annotated imagesto train a baseline object detector. In order to improve thedetector by training with more images, we continue to col-lect images to annotate. Other than annotating all newlycollected images, based on different characteristics of thecurrent detector, we select a subset of them for human an-notators to label. Once being annotated, these selected im-ages are added to the training set to train a new detector.The entire process continues to collect more images, selecta subset with respect to the new detector, annotate the se-lected ones with humans, re-train the detector and so on.Hereafter we call such a cycle of data collection, selection,annotation, and training as a round .A key component of active learning is the selection ofimages. Our selection is based on the uncertainty of boththe classiﬁcation and localization. The classiﬁcation un-certainty of a bounding box is the same as the existingactive learning approaches [25, 1, 27]. Given a boundingbox B , its classiﬁcation uncertainty U B ( B ) is deﬁned as U B ( B ) = 1 − P max ( B ) where P max ( B ) is highest prob-ability out of all classes for this box. If the probability ona single class is close to 1.0, meaning that the probabilitiesfor other classes are low, the detector is highly certain aboutits class. To the contrast, when multiple classes have similarprobabilities, each probability will be low because the sum Intermediate region proposal Final predicted boxInput image Selective search or region proposal network Final classifier in the detectorIoU as the localization tightness

Figure 2: The process of calculating the tightness of eachpredicted box. Given an intermediate region proposal, thedetector reﬁnes it to a ﬁnal predicted box. The IoU calcu-lated by the ﬁnal predicted box and its corresponding regionproposal is deﬁned as the localization tightness of that box.of probabilities of all classes must be one.Based on the classiﬁcation uncertainty per box, given the i -th image to evaluate, say I i , its classiﬁcation uncertaintyis denoted as U C ( I i ) , which is calculated by the maximumuncertainty out of all detected boxes within. Our ﬁrst metric of the localization uncertainty is basedon the

Localization Tightness (LT) of a bounding box.The localization tightness measures how tight a predictedbounding box can enclose true foreground objects. Ideally,if the ground-truth locations of the foreground objects areknown, the tightness can be simply computed as the IoU(Intersection over Union) between the predicted boundingbox and the ground truth. Given two boxes B and B ,their IoU is deﬁned as: IoU ( B , B ) = B ∩ B B ∪ B .Because the ground truth is unknown for an image with-out annotation, an estimate for the localization tightness isneeded. Here we design an estimate for object detectorsthat involves the adjustment from intermediate region pro-posals to the ﬁnal bounding boxes. Region proposals arethe bounding boxes that might contain any foreground ob-jects, which can be obtained via the selective search [31] ora region proposal network [23]. Besides classifying the re-gion proposals into speciﬁc classes, the ﬁnal stage of theseobject detectors can even adjust the location and scale of re-gion proposals based on the classiﬁed object classes. Fig. 2illustrates the typical pipeline of these detectors where theregion proposal (green) in the middle is adjusted to the redbox in the right.As the region proposal is trained to predict the loca-tion of foreground objects, the reﬁnement process in theﬁnal stage is actually related to how well the region pro-posal predicts. If the region proposal locates the foregroundobject perfectly, there is no need to reﬁne it. Based ona) (b)Figure 3: Images preferred by LT /C . Top rows show twoﬁgures are two cases that will be selected by

LT /C , whichare images with certain category but loose bounding box (a)or images with tight bounding box but uncertain about thecategory (b).this observation, we use the IoU value between the re-gion proposal and the reﬁned bounding box to estimate thelocalization tightness between an adjusted bounding boxand the unknown ground truth. The estimated tightness T of j -th predicted box B j can be formulated as following: T ( B j ) = IoU ( B j , R j ) , where R j is the corresponding re-gion proposal fed into the ﬁnal classiﬁer that generates B j .Once the tightness of all predicted boxes are estimated,we can extend the selection process to consider not only theclassiﬁcation uncertainty but also the tightness. Namely,we want to select images with inconsistency between theclassiﬁcation and the localization, as following: • Given a predicted box that is absolutely certain aboutits classiﬁcation result ( P max = 1 ), but it cannottightly enclose a true object ( T = 0 ). An exampleis shown in Figure 10 (a). • Reversely, if the predicted box can tightly enclose atrue object ( T = 1 ) but the classiﬁcation result is un-certain (low P max ). An example is shown in Figure 10(b).The score of a box is denoted as J , which is computedper Equ. 1. Both conditions above can get value close tozero. J ( B j ) = | T ( B j ) + P max ( B j ) − | (1)As each image can have multiple predicted boxes, wecalculate the score per image as: T I ( I i ) = min j J ( B j ) .Unlabeled images with low score will be selected to anno-tate in active learning. Since both the localization tightnessand classiﬁcation outputs are used in this metric, later weuse LT /C to denotes methods with this score.

The concept behind the localization stability is that, ifthe current model is stable to noise, meaning that the de-tection result does not dramatically change even if the input …...

Increasing noiseOriginal image without noise …...

DetectorReferencebox

Figure 4: The process of calculating the localization stabil-ity of each predicted box. Given one input image, a ref-erence box (red) is predicted by the detector. The changein predicted boxes (green) from noisy images is measuredby the IoU of predicted boxes (green) and the corrspondingreference box (dashed red).unlabeled image is corrupted by noise, the current model al-ready understands this unlabeled image well so there is noneed to annotate this unlabeled image. In other words, wewould like to select images that have large variation in thelocalization prediction of bounding boxes when the noise isadded into the image.Fig. 4 overviews the idea to calculate the localization sta-bility of an unlabeled image. We ﬁrst detect bounding boxesin the original image with the current model. These bound-ing boxes when noise is absent are called reference boxes.The j -th reference box is denoted as B j . For each noiselevel n , a noise is added to each pixel of the image. We useGaussian noise where the standard deviation is proportionalto the level n ; namely, the pixel value can be changed morefor higher level. After detecting boxes in the image withnoise level n , for each reference box (the red box in Fig. 4),we ﬁnd a corresponding box (green) in the noisy image tocalculate how the reference box varies. The correspondingbox is denoted as C n ( B j ) , which has the highest IoU valueamong all bounding boxes that overlap B j .Once all the corresponding boxes from different noiselevels are detected, we can tell that the model is stable tonoise on this reference box if the box does not signiﬁcantlychange across the noise levels. Therefore, the localizationstability of each reference box B j can be deﬁned as the av-erage of IoU between the reference box and correspondingboxes across all noise levels. Given N noise levels, it iscalculated per Equ. 2: B ( B j ) = (cid:80) Nn =1 IoU ( B j , C n ( B j )) N , (2)With the localization stability of all reference boxes, thelocalization stability of this unlabeled image, says I i , is de-ﬁned based on their weighted sum per Equ. 3 where M isthe number of reference boxes. The weight of each refer-ence box is its highest class probability in order to preferboxes with high probability as foreground objects but highuncertainty to their locations. S I ( I i ) = (cid:80) Mj =1 P max ( B j ) S B ( B j ) (cid:80) Mj =1 P max ( B j ) . (3)

4. Experimental Results

Reference Methods:

Since no prior work does activelearning for deep learning based object detectors, we des-ignate two informative baselines that show the impact ofproposed methods. • Random (R):

Randomly choose samples from the un-labeled set, label them, and put them into labeled train-ing set. • Classiﬁcation only (C):

Select images only based onthe classiﬁcation uncertainty U c in Sec. 3.1.Our algorithm with two different metrics for the localiza-tion uncertainty are tested. First, the localization stability(Section 3.3) is combined with the classiﬁcation informa-tion ( LS+C ). As images with high classiﬁcation uncertaintyand low localization stability should be selected for annota-tion, the score of the i -th image ( I i ) image is deﬁned asfollows: U C ( I i ) − λS I ( I i ) ,where λ is the weight to com-bine both, which is set to 1 across all the experiments in thispaper. Second, the localization tightness of predicted boxesis combined with the classiﬁcation information ( LT/C ) asdeﬁned in Section 3.2.We also test three variants of our algorithm. One uses thelocalization stability only ( LS ). Another is the localizationtightness of predicted boxes combined with the classiﬁca-tion information but using the localization tightness calcu-lated from ground-truth boxes ( LT/C(GT) ) instead of theestimate used in LT/C. The other is combining all 3 cuestogether ( ).For the easiness of reading, data for LS and 3in1 areshown in the supplementary result. Our supplementary re-sult also includes the mAP curves with error bars that in-dicate the minimum and maximum average precision (AP)out of multiple trials of all methods. Furthermore, exper-iments with different designs of LT/C are included in thesupplementary result.

Datasets:

We validated our algorithm on three datasets(PASCAL 2012, PASCAL 2007, MS COCO [4, 17]). Foreach dataset, we started from a small subset of the train-ing set to train the baseline model, and selected from theremained training images for active learning. Since objectsin training images from these datasets have been annotatedwith bounding boxes, our experiments used these boundingboxes as annotation without asking human annotators.

Detectors:

The object detector for all datasets is theFaster-RCNN (FRCNN) [23], which contains the interme-diate stage to generate region proposals. We also testedour algorithm with the Single Shot multibox Detector(SSD) [18] on the PASCAL 2007 dataset. Because the SSDdoes not contain a region proposal stage, the tests for local-ization tightness were skipped. Both FRCNN and SSD usedVGG16 [26] as the pre-trained network in the experimentsshown in this paper.

Experimental Setup:

We evaluate all the methods withthe FRCNN model [23] using the RoI warping layer [2] onthe PASCAL 2012 object-detection dataset [4] that consistsof of 20 classes. Its training set (5,717 images) is used tomimic a pool of unlabeled images, and the validation set(5,823 images) is used for testing. Input images are resizedto have 600 pixels on the shortest side for all FRCNN mod-els in this paper.The numbers shown in following sections on PASCALdatasets are averages over 5 trails for each method. All tri-als start from the same baseline object detectors, which aretrained with 500 images selected from the unlabeled imagepool. After then, each active learning algorithm is executedin 15 rounds. In each round, we select 200 images, addthese images to the existing training set, and train a newmodel. Each model is trained with 20 epoches.Our experiments used Gaussian noise as the noise sourcefor the localization stability. We set the number of noiselevel N to 6. The standard deviations of these levels are { } where the pixels range from [0, 255]. Results:

Fig. 5a and Fig. 5b show the mAP curve andthe relative saving of labeled images, respectively, for dif-ferent active learning methods. We have three major ob-servations from the results on the PASCAL 2012 dataset.First, LT/C(GT) outperforms all other methods in most ofthe cases as shown in Fig. 5b. This is not surprising sinceLT/C(GT) is based on the ground-truth annotations. In theregion that achieves the same performance as passive learn-ing with a dataset of 500 to 1,100 labeled images, the per-formance of the proposed LT/C is similar to LT/C(GT),which represents the full potential of LT/C. This impliesthat LT/C using the estimate of tightness of predicted boxes(Section 3.2) can achieve results close to its upper bound.Second, in most of the cases, active learning approaches

00 1000 1500 2000 2500 3000 350044%50%56%62%

Number of labeled images m AP RCLS+CLT/CLT/C(GT) (a) mAP

500 1000 1500 2000 2500 3000 3500−5%0%5%10%15%20%25% R e l a t i v e s a v i ng (b) Saving Figure 5: (a) Mean average precision curve of different ac-tive learning methods on PASCAL 2012 detection dataset.Each point in the plot is an average of 5 trials. (b) Relativesaving of labeled images for different methods. D i ff. be t w een L S + C and C boa t * bo tt l e * c ha i r * t ab l e * p l an t * a ll o t he r (a) PASCAL 2012 D i ff. be t w een L S + C and C boa t * bo tt l e * c ha i r * p l an t * a ll o t he r (b) PASCAL 2007 Figure 6: The difference in difﬁcult classes (blue bars)between the proposed method (LS+C) and the baselinemethod (C) in average precision on (a) PASCAL 2012dataset (b) PASCAL 2007 dataset. Black and green bars arethe average improvements of LS+C over C for all classesand non-difﬁcult classes.work better than random sampling. The localization stabil-ity with the classﬁcation uncertainty (LS+C) has the bestperformance among all methods other than LT/C(GT). Interms of average saving, LS+C and LT/C have 96.5% and36.3% relative improvement over the baseline method C.Last, we also note that the proposed LS+C method hasmore improvements in the difﬁcult categories. We fur-ther analyze the performance of each method by inspectingthe AP per category. Table 1 shows the average precisionfor each method on the PASCAL 2012 validation set after3 rounds of active learning, meaning that every model istrained on a dataset with 1,100 labeled images. For cate-gories with AP lower than 40% in passive learning (R), wetreat them as difﬁcult categories, which have a asterisk nextto their name. For these difﬁcult categories (blue bars) inFig. 6a, we notice that the improvement of LS+C over C islarge. For those 5 difﬁcult categories the average improve-ment of LS+C over C is 3.95%, while the average improve-ment is only 0.38% (the green bar in Fig. 6a) for the rest

500 1000 1500 2000 2500 3000 350048%54%60%66%

Number of labeled images m AP RCLS+CLT/CLT/C(GT) (a) mAP

500 1000 1500 2000 2500 3000 35000%5%10%15%20%25% R e l a t i v e s a v i ng (b) Saving Figure 7: (a) Mean average precision curve of different ac-tive learning methods on PASCAL 2007 detection dataset.Each point in the plot is an average of 5 trials. (b) Relativesaving of labeled images for different methods.15 non-difﬁcult categories. This 10 × difference shows thatadding the localization information into active learning forobject detection can greatly help the learning for difﬁcultcategories. It is also noteworthy that for those 5 difﬁcultcategories, the baseline method C performs slightly worsethan random sampling by 0.50% in average. It indicatesthat C focuses on non-difﬁcult categories to get an overallimprovement in mAP. Experimental Setup:

We evaluate all the methods withthe FRCNN model [23] using the RoI warping layer [2] onthe PASCAL VOC 2007 object-detection dataset [4] thatconsists of 20 classes. Both training and validation sets (to-tal 5,011 images) are used as the unlabeled image pool, andthe test set (4,952 images) is used for testing. All the ex-perimental settings are the same as the experiments on thePASCAL 2012 dataset as mentioned Section 4.1.

Results:

Fig. 7a and Fig. 7b show the mAP curve andrelative saving of labeled images for different active learn-ing methods. In terms of average saving, LS+C and LT/Chave 81.9% and 45.2% relative improvement over the base-line method C. Table 2 shows the AP for each method onthe PASCAL 2007 test set after 3 rounds of active learning.The proposed LS+C and LT/C are better than the baselineclassiﬁcation-only method (C) in terms of mAP.It is interesting to see that LS+C method has the samebehavior as shown in the experiments on the PASCAL2012 dataset. Namely, LS+C also outperforms the baselinemodel C on difﬁcult categories. As the setting in exper-iments on the PASCAL 2012 dataset, categories with APlower than 40% in passive learning (R) are considered asdifﬁcult categories. For those 4 difﬁcult categories, the av-erage improvement in AP of LS+C over C is 3.94%, whilethe average improvement is only 0.95% (the green bar inFig. 6b) for the other 16 categories. ethod aero bike bird boat* bottle* bus car cat chair* cow table* dog horse mbike persn plant* sheep sofa train tv mAPR 71.1 61.5 54.7 28.4 32.0 68.1 57.9 75.4 25.8 44.2 36.4 73.0 61.9 67.3 68.1 21.6 51.9 41.0

LT/C 69.8

Table 1: Average precision for each method on PASCAL 2012 validation set after 3 rounds of active learning (number oflabeled images in the training set is 1,100). Each number shown in the table is an average of 5 trials and displayed inpercentage. Numbers in bold are the best results per column, and underlined numbers are the second best results. Catergorieswith AP lower than 40% in passive learning (R) are deﬁned as difﬁcult categories and marked by asterisk. method aero bike bird boat* bottle* bus car cat chair* cow table dog horse mbike persn plant* sheep sofa train tv mAPR

LT/C 57.6

Table 2: Average precision for each method on PASCAL 2007 test set after 3 rounds of active learning (number of labeledimages in the training set is 1,100). The other experimental settings are the same as shown in Table 1.

Number of labeled images m AP RCLS+C (a) mAP R e l a t i v e s a v i ng (b) Saving Figure 8: (a) Mean average precision curve (@IoU=0.5) ofdifferent active learning methods on MS COCO detectiondataset. (b) Relative saving of labeled images for differentmethods. Each point in the plots is an average of 3 trials.

Experimental Setup:

For the MS COCO object-detection dataset [17], we evaluate three methods: passivelearning (R), the baseline method using classiﬁcation only(C), and the proposed LS+C. Our experiments still use theFRCNN model [23] with the RoI warping layer [2]. Com-pared to the PASCAL datasets, the MS COCO has morecategories (80) and more images (80k for training and 40kfor validation). Our experiments use the training set as theunlabeled image pool, and the validation set for testing.The numbers shown in this section are averages over 3trails for each method. All trials start from the same base-line object detectors, which are trained with 5,000 imagesselected from the unlabeled image pool. After then, eachactive learning algorithm is executed in 4 rounds. In eachround, we select 1,000 images, add these images to the ex-isting training set, and train a new model. Each model istrained with 12 epoches.

Results:

Fig. 8a and Fig. 8b show the mAP curve andthe relative saving of labeled images for the testing meth-

500 1000 1500 2000 2500 3000 350044%46%48%50%

Number of labeled images m AP RCLS+C (a) mAP

500 1000 1500 2000 2500 3000 35000%5%10%15%20%25% R e l a t i v e s a v i ng (b) Saving Figure 9: (a) Mean average precision curve of SSD with dif-ferent active learning methods on PASCAL 2007 detectiondataset. (b) Relative saving of labeled images for differentmethods. Each point in the plots is an average of 5 trials.ods. Fig. 8a shows that classiﬁcation-only method (C) doesnot have improvement over passive learning (R), which isnot similar to the observations for the PASCAL 2012 inSection 4.1 and the PASCAL 2007 in Section 4.2. Byincorporating the localization information, LS+C methodcan achieve 5% relative saving in the amount of annotationcompared with passive learning, as shown in Fig. 8b.

Experimental Setup:

Here we test our algorithm on adifferent object detector: the single shot multibox detec-tor (SSD) [18]. The SSD is a model without an interme-diate region-proposal stage, which is not suitable for thelocalization-tightness based methods. We test the SSD onthe PASCAL 2007 dataset where the training and validationsets (total 5,011 images) are used as the unlabeled imagepool, and the test set (4,952 images) is used for testing. In-put images are reiszed to 300 × Results:

Fig. 9a and Fig. 9b show the mAP curve andthe relative saving of labeled images for the testing meth-ods. Fig. 9a shows that both active learning method (Cand LS+C) have improvements over passive learning (R).Fig. 9b shows that in order to achieve the same performanceof passive learning with a training set consists of 2,300 to3,500 labeled images, the proposed method (LS+C) can re-duce the amount of image for annoation (12 - 22%) morethan the baseline active learning method (C) (6 - 15%). Interms of average saving, LS+C is 29.0% better than thebaseline method C.

5. Discussion

Extreme Cases:

There could be extreme cases that theproposed methods may not be helpful. For instance, if per-fect candidate windows are available (LT/C), or feature ex-tractors are resilient to Gaussian noise (LS+C).If we have very precise candidate windows, which meansthat we need only the classiﬁcation part and it is not a detec-tion problem anymore. While this might be possible for fewspecial object classes (e.g. human faces), to our knowledge,there is no perfect region proposal algorithms that can workfor all type of objects. As shown in our experiments, evenstate-of-the-art object detectors can still incorrectly localizeobjects. Furthermore, when perfect candidates are avail-able, the localization tightness will always be 1, and ourLT/C degenerates to classiﬁcation uncertainty method (C),which can still work for active learning.Also, we have tested the resiliency to Gaussian noiseof state-of-the-art feature extractors (AlexNet, VGG16,ResNet101). Classiﬁcation task on the validation set of Im-ageNet (ILSVRC2012) is used as the testbed. The resultsdemonstrate that none of these state-of-the-art feature ex-tractors is resilient to noise. Moreover, if the feature extrac-tor is robust to noise, the localization stability will alwaysbe 1, and our LS+C degenerates to classiﬁcation uncertaintymethod (C), which can still work for active learning. Pleaserefer to the supplemental material for more details.

Estimate of Localization Tightness:

Our experimentshows that if the ground truth of bounding box is known,localization tightness can achieve best accuracies,but the beneﬁt degrades when using the estimated tight-ness instead. To analyze the impact of the estimate, afterwe trained the FRCNN-based object detector with 500 im- ages of PASCAL2012 training set, we collected the ground-truth-based tightness and the estimated values of all de-tected boxes in the 5,215 test images.Here shows a scatterplot where the coordinatesof each point represents thetwo scores of a detectedbox. As this scatter plotshows an upper-triangulardistribution, it implies thatour estimate is most ac-curate when the proposalscan tightly match the ﬁnaldetection boxes. Otherwise, it could be very different fromthe ground-truth value. This could partially explain why us-ing the estimated cannot achieve the same performance asthe ground-truth-based tightness.

Computation Speed:

Regarding the speed of our ap-proach, as all testing object detector are CNN-based, themain speed bottleneck lies in the forwarding propagation.In our experiment with FRCNN-based detectors, for in-stance, forwarding propagation used 137 milliseconds perimage, which is 82.5% of the total time when consideringonly classiﬁcation uncertainty. The calculation of T I hassimilar speed as U C . The calculation of localization stabil-ity S I needs to run the detector multiple times, and thus isslower than calculating other metrics.Nevertheless, as these metrics are fully automatic to cal-culate, using our approach to reduce the number of imagesto annotate is still cost efﬁcient. Considering that drawing abox from scratch can take 20 seconds in average [29], andchecking whether a box tightly encloses an object can take2 seconds [20], the extra overhead to check images with ourmetrics is small, especially that we can reduce 20 - 25% ofimages to annotate.

6. Conclusion

In this paper, we present an active learning algorithm forobject detection. When selecting unlabeled images for an-notation to train a new object detector, our algorithm con-siders both the classiﬁcation and localization results of theunlabeled images while existing works mainly consider theclassiﬁcation part alone. We present two metrics to quanti-tatively evaluate the localization uncertainty, which are howtight the detected bounding boxes can enclose true objects,and how stable the bounding boxes are when adding noise tothe image. For object detection, our experiments show thatby considering the localization uncertainty, our active learn-ing algorithm can improve the active learning algorithm us-ing the classiﬁcation outputs only. As a result, we can trainobject detectors to achieve the same accuracy with fewerannotated images. cknowledgments

This work was conducted during the ﬁrst author’s in-ternship in Mitsubishi Electric Research Laboratories. Thiswork was sponsored in part by National Science Foundationgrant

References [1] A. Bietti. Active learning for object detection on satelliteimages. Technical report, California Institute of Technology,Jan 2012. 2, 3[2] J. Dai, K. He, and J. Sun. Instance-aware semantic segmen-tation via multi-task network cascades. In

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,June 2016. 5, 6, 7[3] S. Dutt Jain and K. Grauman. Active image segmentationpropagation. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2016. 1[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, andA. Zisserman. The PASCAL Visual Object Classes (VOC)challenge.

International Journal of Computer Vision (IJCV) ,88(2):303–338, 2010. 3, 5, 6, 15[5] A. Freytag, E. Rodner, and J. Denzler. Selecting inﬂuen-tial examples: Active learning with expected model out-put changes. In

European Conference on Computer Vision(ECCV) . Springer, 2014. 1[6] R. Girshick. Fast R-CNN. In

International Conference onComputer Vision (ICCV) , 2015. 2[7] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining andharnessing adversarial examples. In

ICLR , 2015. 12[8] M. Hasan and A. K. Roy-Chowdhury. Continuous learningof human activity models using deep nets. In

European Con-ference on Computer Vision (ECCV) . Springer, 2014. 1[9] M. Hasan and A. K. Roy-Chowdhury. Context aware ac-tive learning of activity recognition models. In

InternationalConference on Computer Vision (ICCV) , 2015. 1[10] J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue,R. Girshick, T. Darrell, and K. Saenko. LSDA: Large scaledetection through adaptation. In

Advances in Neural Infor-mation Processing Systems (NIPS) , 2014. 2[11] R. Islam. Active learning for high dimensional inputs usingbayesian convolutional neural networks. Master’s thesis, De-partment of Engineering, University of Cambridge, 8 2016.2[12] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Activelearning with gaussian processes for object categorization. In

International Conference on Computer Vision (ICCV) . IEEE,2007. 1[13] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Gaus-sian processes for object categorization.

International Jour-nal of Computer Vision (IJCV) , 88(2):169–188, 2010. 2[14] V. Karasev, A. Ravichandran, and S. Soatto. Active frame,location, and detector selection for automated and manualvideo annotation. In

The IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR) , June 2014. 2[15] K. Konyushkova, R. Sznitman, and P. Fua. Introducing ge-ometry in active learning for image segmentation. In

The IEEE International Conference on Computer Vision (ICCV) ,December 2015. 1[16] D. D. Lewis and J. Catlett. Heterogeneous uncertainty sam-pling for supervised learning. In

International Conferenceon Machine Learning (ICML) . Morgan Kaufmann, 1994. 2[17] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In

European Conference on Com-puter Vision (ECCV) , 2014. 3, 5, 7[18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.Fu, and A. C. Berg. Ssd: Single shot multibox detector. In

European Conference on Computer Vision (ECCV) , 2016. 2,5, 7[19] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.

Journal of Machine Learning Research , 9(Nov):2579–2605,2008. 15[20] D. P. Papadopoulos, J. R. R. Uijlings, F. Keller, and V. Fer-rari. We don’t need no bounding-boxes: Training objectclass detectors using only human veriﬁcation. In

The IEEEConference on Computer Vision and Pattern Recognition(CVPR) , June 2016. 1, 2, 8[21] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Fer-rari. Learning object class detectors from weakly annotatedvideo. In

The IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) . IEEE, 2012. 2[22] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. Youonly look once: Uniﬁed, real-time object detection. In

TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , June 2016. 2[23] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection with region proposal net-works. In

Advances in Neural Information Processing Sys-tems (NIPS) , 2015. 2, 3, 5, 6, 7, 16[24] O. Russakovsky, L. J. Li, and L. Fei-Fei. Best of both worlds:Human-machine collaboration for object annotation. In

TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , June 2015. 2[25] B. Settles. Active learning literature survey.

University ofWisconsin, Madison , 52(55-66):11, 2010. 1, 2, 3[26] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014. 5, 16[27] S. Sivaraman and M. M. Trivedi. Active learning for on-roadvehicle detection: A comparative study.

Mach. Vision Appl. ,25(3):599–611, Apr. 2014. 2, 3[28] F. Stark, C. Hazirbas, R. Triebel, and D. Cremers. Captcharecognition with active deep learning. In

GCPR Workshop onNew Challenges in Neural Computation , Aachen, Germany,2015. 2[29] H. Su, J. Deng, and L. Fei-Fei. Crowdsourcing annotationsfor visual object detection. In

Workshops at the Twenty-SixthAAAI Conference on Artiﬁcial Intelligence , 2012. 1, 8[30] S. Tong and D. Koller. Support vector machine active learn-ing with applications to text classiﬁcation.

J. Mach. Learn.Res. , 2:45–66, Mar. 2002. 2[31] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, andA. W. M. Smeulders. Selective search for object recog-ition.

International Journal of Computer Vision (IJCV) ,104(2):154–171, 2013. 3[32] S. Vijayanarasimhan and K. Grauman. Large-scale live ac-tive learning: Training object detectors with crawled data andcrowds.

International Journal of Computer Vision (IJCV) ,108(1-2):97–114, 2014. 2 upplementary Materials

This document includes the data and analysis of the pro-posed methods that are not covered in the main paper dueto the spcae limitation. We ﬁrst deﬁne the abbreviation forall methods as following:Abbreviation MethodR RandomC ClassiﬁcationLS Localization StabilityLS+C Localization Stability and ClassiﬁcationLT/C Localization Tightness and ClassiﬁcationLT/C(GT) Localization Tightness and Classiﬁcationwith Ground Truth3in1 Localization Stability, LocalizationTightness, and ClassiﬁcationThis abbreviation is used in all the text, ﬁgures, and tablesin this document.

A. Design of Localization-Tightness Metric

Given the measurement of localization tightness, weneed to design a metric to utilize it for active learning. Themost intuitive way is to use the localization tightness aloneto decide the score for each box. However, in our exper-iments it does not help for selecting samples to annotate.We further analyze it by showing the images selected bydifferent methods as shown in Fig. 10. When using onlylocalization tightness as the cue to calculate the score ofeach detected box for active learning, it tends to ﬁnd im-ages (Fig. 10, ﬁrst row) that have tiny objects (e.g., airplane,bird), which are not chosen that often by other methods(Fig. 10, second row). However, these classes are easierones that the detector already does well so that the overallperformance of using localization tightness alone is worsethan other metrics.Based on the observations in Sec. 3.2 in the main paper,we would like to ﬁnd images contain boxes that have dis-agreement in classiﬁcation and localization results. Whendesigning a metric using localization tightness, there aretwo important qeustions: ”How to deﬁne the score foran image with detected boxes?” and ”How to deﬁne thescore for a detected box?” For the ﬁrst question, two meth-ods have been tested: using the lowest score of all boxes(min(.)), and using a weighted sum of all boxes (wsum(.)),where the weight is P max of each box. For the second ques-tion, different metrics have been tested as following, whereP ( P max ( B ) in the main paper) is the highest probability outof K categories of box B , and T ( T ( B ) in the main paper)is the localization tightness of box B . For a set of unlabeledimages, the following methods choose images with lowerscores to annotate in active learning. Figure 10: First row:

Example images selected for anno-tation by the method using information from localizationonly to evaluate the score of each box.

Second row:

Ex-ample images selected for annotation by the method usingclassiﬁcation uncertainty only (C). min( | T+P-1 | ) This metric is the one (

LT/C ) we used in themain paper. It selects images with boxes that have dis-agreement between classiﬁcation and localization re-sults. It also picks images contain boxes that are notvery certain in both classiﬁcation and localization re-sults. min(- | P-T | ) Different from LT/C, this metric only selectsimages with boxes that have disagreement betweenclassiﬁcation and localization results. It does not se-lect boxes that are not very certain in both two outputs. wsum( | T+P-1 | ) This method uses the same metric as LT/Cto evaluate the score of each box. However, instead ofusing the highest score out of all boxes as the score ofan imgae, it uses a weighted sum across all boxes. wsum(T)

This method uses only the information from lo-calization outputs when deciding the score of each box.Images with boxes that have low localization tightnesswill be chosen by this method.Fig. 11 shows the mean average precision (mAP) curvesof different metrics using localization tightness, and the ex-perimental setup is the same as mentioned in Sec. 4.2 in themain paper. The proposed LT/C outperforms the rest met-rics clearly at the ﬁrst half of the experiment. Among thesecond half, LT/C is still the best among all metrics, but thegap between LT/C and the others becomes smaller.The difference between LT+C and max( | P-T | ) is select-ing images with boxes that are both uncertain in classiﬁ-cation and localization outputs. We hypothesize that im-ages with uncertainty in both outputs are more informa-tive, which make LT/C better than max( | P-T | ). Also, giventhe same metric for calculating the score of a detected box,LT/C and wsum( | T+P-1 | ) use different strategy to deﬁne thescore of an image. The overlapping ratio of images sampled

00 1000 1500 2000 2500 3000 350048%50%52%54%56%58%60%62%64%66%

Number of labeled images m AP min(|T+P−1|):LT/Cmin(−|P−T|)wsum(|T+P−1|)wsum(T) Figure 11: Mean average precision curve of different met-rics of localization tightness on PASCAL 2007 detectiondataset. Each point in the plot is an average of 5 trials.Figure 12: Top-1 classiﬁcation accuracy of different neuralnetwork models when input images are corrupted by Gaus-sian noise on PASCAL 2012 validation dataset.by these two methods is only 17.9% (an average over 5 tri-als), which implies that how to deﬁne the score of an imagegreatly affects the sampling process.

B. Discussion of Extreme Cases

As mentioned in Sec. 5 in the main paper, there could beextreme cases that the proposed methods may not be help-ful. For instance, if perfect candidate windows are available(LT/C), or feature extractors are resilient to Gaussian noise(LS+C).We had discussed the case of perfect candidate win-dows in the main paper. In the following, we discussmore about the case of feature extractors are resilient tonoise. We have tested the resiliency to Gaussian noiseof state-of-the-art feature extractors (AlexNet, VGG16,ResNet101). Classiﬁcation task on the validation set of Im-ageNet (ILSVRC2012) is used as the testbed. Pre-trained models are used as the classiﬁer and input images are cor-rupted by Gaussian noise of different levels. Fig. 12 showsthe top-1 classiﬁcation accuracy under different standarddeviation of Gaussian noise. With the largest standard devi-ation, the accuracy can drop 23-37%. It demonstrates thatnone of these state-of-the-art feature extractors is resilientto noise. Goodfellow et. al [7] also hypothesized that NNswith non-linear modules (e.g., sigmoid) mainly work in lin-ear region, could be vulnerable to local perturbation such asGaussian noise.

C. Full Experimental Results

In this section, the full results from the experiments ofactive learning methods on the PASCAL and MS COCOdatasets are presented. These results are not covered in themain paper due to the easiness of reading and space con-straint.

Results of Using Localization Stability Only:

As an ab-lation experiment, results for the method using localizationstability only (LS) are added into the plot of mAP curvesand the table of classwise APs. Table 3 and Table 4 showthe average precision for each method after 3 rounds of ac-tive learning on the PSACAL 2012 validation and PASCAL2007 testing set. Fig. 13 and Fig. 14 show the mAP curvesof each active learning method on the PASCAL 2012 and2007 datasets. Each point in the plot is an average of 5 trials.Also, error bars that represent the minimum and maximumvalues out of 5 trials are added at each point to show the dis-tribution of 5 trials. Fig. 15a and Fig. 15b show the relativesaving in labeled images of each active learning method onthe PASCAL 2012 and 2007 datasets. As shown in Fig. 15aand Fig. 15b, LS outperforms the random sampling for themost cases. Also, combining the localization stability withthe classiﬁcation uncertainty (LS+C) works better than us-ing either only the localization stability (LS) or classiﬁca-tion uncertainty (C).

Results of Using 3 Cues:

In order to see that if thelocalization-uncertainty measurements have complemen-tary information, we further combine all cues for select-ing informative images. As images with high classiﬁ-cation uncertainty, low localization stability, and low lo-calization tightness should be selected for annotation, thescore of the i -th image ( I i ) image is deﬁned as follows: U C ( I i ) − λ ls S I ( I i ) − λ lt T I ( I i ) where λ ls and λ lt are setto 1 across all the experiments in this paper.On PASCAL 2012, combining all cues together doesnot work better than either LS+C or LT/C (Fig. 15a). OnPASCAL 2007, 3in1 is compatible with LS+C, and betterthan LT/C (Fig. 15b). It seems that localization-uncertaintymeasurements do not have complementary information. We

00 1000 1500 2000 2500 3000 350045%50%55%60%

Number of labeled images m AP RCLSLS+CLT/CLT/C(GT)3in1

Figure 13: Mean average precision curve of different active learning methods on the PASCAL 2012 detection dataset. Eachpoint in the plot is an average of 5 trials. The error bars represent the minimum and maximum values out of 5 trials at eachpoint. This is a full version (LS and 3in1 added) of Fig. 5a in the main paper. method aero bike bird boat* bottle* bus car cat chair* cow table* dog horse mbike persn plant* sheep sofa train tv mAPR 71.1 61.5 54.7 28.4 32.0 68.1 57.9 75.4 25.8 44.2 36.4 73.0 61.9 67.3 68.1 21.6 51.9 41.0

LT/C 69.8

Table 3: Average precision for each method on the PASCAL 2012 validation set after 3 rounds of active learning (the numberof labeled images in the training set is 1,100). This is a full version (LS and 3in1 added) of Table 1 in the main paper. All theexperimental settings are the same with Table 1 in the main paper.further analyze the overlapping ratio between images cho-sen by different active learning methods in Table 5 and Ta-ble 6. When we compare the overlapping ratio between 3in1and three other metrics (C, LS, LT/C), both C and LS havean overlapping ratio around 30%, but LT/C has only about10%. This implies that among the three cues, LT/C pro-vides the least information in 3in1 method. We notice that the images chosen by 3in1 method are highly overlappedwith LS+C (over 60%), but 3in1 does not outperform LS+C.Our hypothesis is that the images (about one third of totalimages) chosen differently by 3in1 and LS+C make this dif-ference in performance.

00 1000 1500 2000 2500 3000 350046%48%50%52%54%56%58%60%62%64%66%

Number of labeled images m AP RCLSLS+CLT/CLT/C(GT)3in1

Figure 14: Mean average precision curve of different active learning methods on the PASCAL 2007 detection dataset. Eachpoint in the plot is an average of 5 trials. The error bars represent the minimum and maximum values out of 5 trials at eachpoint. This is a full version (LS and 3in1 added) of Fig. 7a in the main paper. method aero bike bird boat* bottle* bus car cat chair* cow table* dog horse mbike persn plant* sheep sofa train tv mAPR 61.6 67.2 54.1 40.0 33.6 64.5 73.0 73.9 34.5 60.8 52.2 69.3

LT/C 57.6

Table 4: Average precision for each method on the PASCAL 2007 testing set after 3 rounds of active learning (the number oflabeled images in the training set is 1,100). This is a full version (LS and 3in1 added) of Table 2 in the main paper. All theexperimental settings are the same with Table 2 in the main paper. mAP Plots with Error Bars:

In the original mAP plots ofthe FRCNN on the MS COCO dataset (Fig. 8a in the mainpaper) and the SSD on the PASCAL 2007 dataset (Fig. 9ain the main paper), only the average of multiple trials isplotted. Here we add the error bars that represent the min-imum and maximum values of multiple trials to the plot.This shows the distribution of the result from different trials. Fig. 16 and Fig. 17 show the mAP curves of the FRCNN onthe MS COCO dataset and the SSD on the PASCAL 2007dataset. Three methods (R, C, and LS+C) are tested in thesetwo experiments.

00 1000 1500 2000 2500 3000 3500−5%0%5%10%15%20%25% R e l a t i v e s a v i ng o f l abe l ed i m age s f o r a c t i v e l ea r n i ng Number of labeled images for passive learning

RCLSLS+CLT/CLT/C(GT)3in1 (a) PASCAL 2012

500 1000 1500 2000 2500 3000 3500−5%0%5%10%15%20%25% R e l a t i v e s a v i ng o f l abe l ed i m age s f o r a c t i v e l ea r n i ng Number of labeled images for passive learning

RCLSLS+CLT/CLT/C(GT)3in1 (b) PASCAL 2007

Figure 15: Relative saving of labeled images for different active learning methods on the (a) PASCAL 2012 validation datasetand (b) PASCAL 2007 testing set. (a) and (b) are full versions (LS and 3in1 added) of Fig. 5b and Fig. 7b in the main paper.

Method RC 3.5% CLS 4.0% 2.7% LSLS+C 4.4% 34.7% 34.6% LS+CLT/C 5.0% 5.9% 2.4% 5.2% LT/C3in1 4.6% 30.4% 25.7% 62.4% 8.8%

Table 5: Overlapping ratio between 200 images chosenby different active learning methods on the PASCAL 2012dataset after the ﬁrst round of active learning. Each numbershown in the table is an average over 5 trials.

Method RC 4.1% CLS 4.2% 3.5% LSLS+C 4.3% 34.0% 39.7% LS+CLT/C 5.6% 5.9% 4.5% 5.7% LT/C3in1 3.9% 30.5% 32.0% 65.3% 12.0%

Table 6: Overlapping ratio between 200 images chosenby different active learning methods on the PASCAL 2007dataset after the ﬁrst round of active learning. Each numbershown in the table is an average over 5 trials.

D. Visualization of The Selection Process

The most popular metric used for measuring the perfor-mance of an object detector is mAP. We also use this met-ric to evaluate the performance of different active learningmethods. If one active learning method selects more infor-mative images to label and add them into the training set, thedetector trained on this set will have a higher mAP. Besidesthis ﬁnal numerical result, we are curious about what im-ages are chosen in the selection process by different activelearning methods, and how these chosen images are related

Number of labeled images m AP RCLS+C

Figure 16: Mean average precision curve of different activelearning methods on the MS COCO validation set. Eachpoint in the plot is an average of 3 trials. The error barsrepresent the minimum and maximum values out of 3 trialsat each point. This is a full version of Fig. 9a in the mainpaper.to the average precision.In order to visualize the selection process, we ﬁrst visual-ize the PASCAL 2012 training set [4] by using t-DistributedStochastic Neighbor Embedding (t-SNE) [19]. After know-ing the distibution of the PASCAL 2012 training set, wefurther visualize the chosen images in the selection processby different active learning methods.

500 1000 1500 2000 2500 300045%50%55%60%65%

Number of labeled images m AP RCLS+C

Figure 17: Mean average precision curve of different activelearning methods with SSD on the PASCAL 2007 testingset. Each point in the plot is an average of 5 trials. Theerror bars represent the minimum and maximum values outof 5 trials at each point. This is a full version of Fig. 10a inthe main paper.

Visualization of the PASCAL 2012 Dataset:

We ﬁrst vi-sualize the PASCAL 2012 training set (5,717 images) byusing t-SNE with VGG16 model [26]. t-SNE is a tech-nique for dimensionality reduction that is tailored for visu-alizing high-dimensional datasets. Features extracted fromthe conv5 3 layer are used as the high-dimensional vectorfor each image in the PASCAL 2012 training set. The vi-sualization of the PASCAL 2012 training set by embed-ding each image to a point on the 2D plane is shown inFig. 20. Each data point in Fig. 20 represents one image inthe dataset. Images with objects from only one class are rep-resented by markers other than dots. Note that there mightbe objects belong to different classes shown in one image.Red dots ( > boat* bottle* chair* table* plant*0 50 100150200250300 o f s e l e c t ed i m age s RCLS+CLT/C

Figure 18: The number of selected images that contain ob-jects belong to difﬁcult classes by different active learningmethods.there are many images that have objects belong to multipleclasses (red dots). From Fig. 22, we know that these imagesmay contain people, chairs, tables, sofas, bottles, plants, andTVs. Actually, these images are regular scenes in a livingroom, just like the 4 images shown in Fig. 20. With theseinformation, we can further analyze the selection process ofdifferent active learning methods.

Visualization of Different Active Learning Methods:

We would like to visualize the selection process of differentactive learning methods. The experimental settings are thesame with Sec. 4.1 in the main paper. For the analysis andvisualization in this section, we only use one trial instead ofusing the average of 5 trials for the easiness of reading. Thebaseline FRCNN detector [23] is trained on a training setof 500 labeled images, and then each active learning algo-rithm is executed for 3 rounds. In each round, we select 200images, add these images to the existing training set. After3 rounds, each method has selected 600 images for annota-tion, and a set with 1,100 labeled images is used to train thedetector.Table 7 shows the average precision for each method onthe PSACAL 2012 validation set after 3 rounds of activelearning. As deﬁned in the main paper, catergories with APlower than 40% in passive learning (R) are deﬁned as dif-ﬁcult categories. These difﬁcult classes are marked by anasterisk in Table 7. We further analyze the selection resultof different methods by a visualization as shown in Fig. 21.There are total 5,217 images (500 images in the initial train-ing set of this trial are not included) in each graph. 600 im-ages selected for annotation by each active learning methodare represented by green asterisks, and the rest 4,617 imagesthat have not been chosen are represented by black dots.We have two major observations from the visualzationresults on the PASCAL 2012 dataset. First, the randomsampling (R) method selects images for annotation acrossall categories, no matter it is a difﬁcult class or an easyclass. Compared to the other methods, lots of images of ero bike bird bus car cat cow dog horse mbike persn sheep sofa train tv0 50 100150200250300 o f s e l e c t ed i m age s RCLS+CLT/C

Figure 19: The number of selected images that contain objects belong to non-difﬁcult classes by different active learningmethods. >1cls aero bikebirdboatbottlebuscarcatchaircowtabledoghorsembikepersnplantsheepsofatraintv

Figure 20: t-SNE embeddings of images on the PASCAL 2012 training set. VGG16 is used for generating high-dimensionalvectors of images that used for the embedding. Each data point in the scatter plot is an image. “ > ethod aero bike bird boat* bottle* bus car cat chair* cow table* dog horse mbike persn plant* sheep sofa train tv mAPR 68.3 61.5 54.2 27.8 30.4 68.2 58.2 76.3 28.4 44.8 31.1 73.7 64.1 67.9 66.7 21.9 52.4 41.7 Table 7: Average precision for each method on the PASCAL 2012 validation set after 3 rounds of active learning (the numberof labeled images in the training set is 1,100). Each number shown in the table is the result of one trial (different from Table1 in the main paper which shows the average over 5 trials) and displayed in percentage. Numbers in bold are the best resultsper column, and underlined numbers are the second best results. Catergories with AP lower than 40% in passive learning (R)are deﬁned as difﬁcult categories and marked by an asterisk.especially in the difﬁcult categories. There is a 10 × dif-ference between difﬁcult and non-difﬁcult categories in theimprovement of LS+C over C as shown in Fig. 6a in themain paper. These 5 difﬁcult categories are: boat, bottle,chair, table, and plant. Fig. 22 shows that all difﬁcult cate-gories but boat locate at the left part of the 2D plane. Thesecategories also are the ones show in scenes of a living room(Fig. 20), as mentioned in the previous section. By visualinspection, the red rectangles in Fig. 21c and Fig. 21b showthat the proposed LS+C tends to select more images for an-notation in these difﬁcult classes than the baseline methodC. Quantitative results are shown in Fig. 18. The proposedLS+C selects images that contain objects belong to difﬁ-cult classes much more than the baseline method C. By se-lecting more images for annoation, the proposed LS+C getsmore improvement in these difﬁcult classes. In contrast,for easy classes (catergories with AP higher than 70% inpassive learning) like cat and dog, the baseline method Cselects more images than the proposed LS+C as shown inFig. 19. These observations indicate that C focuses on non-difﬁcult categories to get an overall improvement in mAP,but does not perform well in difﬁcult categories. unselsel (a) Random (R) unselsel (b) Classiﬁcation (C) unselsel (c) Localization stability + classiﬁcation (LS+C) unselsel (d) Localization tightness + classiﬁcation (LT/C) Figure 21: The visualization of selection results by different active learning methods. Green asterisks (sel) are the imagesselected for annotation by each active learning method, and black dots (unsel) are the images that have not been selected. Adetailed version of this graph with class-wise information is shown in Fig. 23. othersaero (a) Aeroplane othersbike (b) Bicycle othersbird (c) Bird othersboat (d) Boat othersbottle (e) Bottle othersbus (f) Bus otherscar (g) Car otherscat (h) Cat otherschair (i) Chair otherscow (j) Cow otherstable (k) Diningtable othersdog (l) Dog othershorse (m) Horse othersmbike (n) Motorbike otherspersn (o) Person othersplant (p) Pottedplant otherssheep (q) Sheep otherssofa (r) Sofa otherstrain (s) Train otherstv (t) TV monitor

Figure 22: t-SNE embeddings of images for each category on the PASCAL 2012 training set. Different from Fig. 20, eachcolored point in the graphs represents an image that includes at least one object belongs to the target class. For example, eachorange plus sign in (a) represents an image which has at least one aeroplane in it. >1clsaerobikebirdboatbottlebuscarcatchaircowtabledoghorsembikepersnplantsheepsofatraintvunsel (a) Random (R) >1clsaerobikebirdboatbottlebuscarcatchaircowtabledoghorsembikepersnplantsheepsofatraintvunsel (b) Classiﬁcation (C) >1clsaerobikebirdboatbottlebuscarcatchaircowtabledoghorsembikepersnplantsheepsofatraintvunsel (c) Localization stability + classiﬁcation (LS+C) >1clsaerobikebirdboatbottlebuscarcatchaircowtabledoghorsembikepersnplantsheepsofatraintvunsel (d) Localization tightness + classiﬁcation (LT/C)(d) Localization tightness + classiﬁcation (LT/C)