[PDF] DeepBox: Learning Objectness with Convolutional Networks

Abstract

Existing object proposal approaches use primarily bottom-up cues to rank proposals, while we believe that objectness is in fact a high level construct. We argue for a data-driven, semantic approach for ranking object proposals. Our framework, which we call DeepBox, uses convolutional neural networks (CNNs) to rerank proposals from a bottom-up method. We use a novel four-layer CNN architecture that is as good as much larger networks on the task of evaluating objectness while being much faster. We show that DeepBox significantly improves over the bottom-up ranking, achieving the same recall with 500 proposals as achieved by bottom-up methods with 2000. This improvement generalizes to categories the CNN has never seen before and leads to a 4.5-point gain in detection mAP. Our implementation achieves this performance while running at 260 ms per image.

Full PDF

DDeepBox: Learning Objectness with Convolutional Networks

Weicheng Kuo Bharath Hariharan Jitendra MalikUniversity of California, Berkeley { wckuo, bharath2, malik } @eecs.berkeley.edu Abstract

Existing object proposal approaches use primarilybottom-up cues to rank proposals, while we believe that“objectness” is in fact a high level construct. We argue for adata-driven, semantic approach for ranking object propos-als. Our framework, which we call DeepBox, uses convo-lutional neural networks (CNNs) to rerank proposals froma bottom-up method. We use a novel four-layer CNN ar-chitecture that is as good as much larger networks on thetask of evaluating objectness while being much faster. Weshow that DeepBox signiﬁcantly improves over the bottom-up ranking, achieving the same recall with 500 proposals asachieved by bottom-up methods with 2000. This improve-ment generalizes to categories the CNN has never seen be-fore and leads to a 4.5-point gain in detection mAP. Ourimplementation achieves this performance while running at260 ms per image.

1. Introduction

Object detection methods have moved from scanningwindow approaches [9] to ones based on bottom-up objectproposals [11]. Bottom-up proposals [1] have two majoradvantages: 1) by reducing the search space, they allow theusage of more sophisticated recognition machinery, and 2)by pruning away false positives, they make detection easier.[14, 10]Most object proposal methods rely on simple bottom-upgrouping and saliency cues. The rationale for this is that thisstep should be reasonably fast and generally applicable toall object categories. However, we believe that there is moreto objectness than bottom-up grouping or saliency. For in-stance, many disparate object categories might share high-level structures (such as the limbs of animals and robots)and detecting such structures might hint towards the pres-ence of objects. A proposal method that incorporates theseand other such cues is likely to perform much better.In this paper, we argue for a semantic , data-driven notionof objectness. Our approach is to present a large databaseof images with annotated objects to a learning algorithm, and let the algorithm ﬁgure out what low-, mid- and high-level cues are most discriminative of objects. Following re-cent work on a range of recognition tasks [11, 12, 18], weuse convolutional networks (CNNs) [19] for this task. Con-cretely, we train a CNN to rerank a large pool of object pro-posals produced by a bottom-up proposal method (we useEdge boxes [30] for most experiments in this paper). Forease of reference, we call our approach DeepBox. Figure 1shows our framework.We propose a lightweight four layer network architecturethat signiﬁcantly improves over bottom-up proposal meth-ods in terms of ranking (26% relative improvement on AUCover Edge boxes on VOC 2007 [8]). Our network archi-tecture is as effective as state-of-the-art classiﬁcation net-works on this task while being much smaller and thus muchfaster. In addition, using ideas from SPP [13] and Fast R-CNN [10], our implementation runs in 260 ms per image,comparable to some of the fastest bottom-up proposal ap-proaches like Edge Boxes (250 ms). We also provide ev-idence that what our network learns is category-agnostic:our improvements in performance generalize to categoriesthat the CNN has not seen before (16% improvement overEdge boxes on COCO [20]). Our results suggest that a)there is indeed a generic, semantic notion of objectness be-yond bottom-up saliency, and that b) this semantic notion ofobjectness can be learnt effectively by a lightweight CNN.Object proposals are just the ﬁrst step in an object detec-tion system, and the ﬁnal evaluation of a proposal system isthe impact it has on detection performance. We show thatthe Fast R-CNN detection system [10], using 500 DeepBoxproposals per image, is 4.5 points better than the same ob-ject detector using 500 Edge box proposals. Thus our highquality proposals directly lead to better object detection.The rest of the paper is laid out as follows. In Section 2we discuss related work. We describe our network architec-ture and training and testing procedures in Section 3. Sec-tion 4 describes experiments and we end with a discussion.

2. Related work

Russell et al. [26] were one of the ﬁrst to suggest acategory-independent method to propose putative objects.1 a r X i v : . [ c s . C V ] S e p igure 1. The DeepBox framework. Given any RGB image, weﬁrst generate bottom-up proposals and then rerank them using aCNN. High ranked boxes are shown in green and low ranked onesare blue. The number in each box is its ranking in the proposalpool. DeepBox corrects the ranking of Edge box, ranking objectshigher than background. Their method involved sampling regions from multiple seg-mentations of the image. More recently, Alexe et al. [1] andEndres et al. [6] propose using bottom-up object proposalsas a ﬁrst step in recognition. Expanding on the multiplesegmentations idea, Selective search [29] uses regions fromhierarchical segmentations in multiple color spaces as ob-ject proposals. CPMC [4] uses multiple graph-cut basedsegmentations with multiple foreground seeds and multi-ple foreground biases to propose objects. GOP [17] re-places graph cuts with a much faster geodesic based seg-mentation. MCG [2] also uses multiple hierarchical seg-mentations from different scales of the image, but pro-duces proposals by combinatorially grouping regions. Edgeboxes [30] uses contour information instead of segments:bounding boxes which have fewer contours straggling theboundary of the box are considered more likely to be ob-jects.Many object proposal methods also include a rankingof the regions. This ranking is typically based on lowlevel region features such as saliency [1], and is sometimeslearnt [2, 4]. Relatively simple ranking sufﬁces when thegoal is a few thousand proposals as in MCG [2], but to nar-row the list down to a few hundred as in CPMC [4] requiresmore involved reasoning. DeepBox aims at such a ranking.Multibox [7, 28] directly produces object proposals fromimages using a sophisticated neural network. In contempo-rary work, Faster R-CNN [25] uses the same large networkto propose objects and classify them. DeepMask [22] alsouses a very deep network to directly produce segment pro-posals. In comparison, our architecture is quite lightweightand can be used out of the box to rerank any bottom-up pro-posals.Finally, we direct the reader to [15, 14] for a more thor-ough evaluation of bottom-up proposal methods.

3. Method

The pipeline consists of two steps: 1) Generate an ini-tial pool of N bottom-up proposals. Our method is agnos-tic to the precise bottom-up proposal method. The mainpoint of this step is to prune out the obviously unlikely win-dows so that DeepBox can focus on the hard negatives. 2)Rerank the proposals using scores obtained by the Deep-Box network. We rerank each proposal by cropping out theproposal box and feeding it into a CNN, as described byGirshick et al. [11]. Because highly overlapping proposalsare handled independently, this strategy is computationallywasteful and thus slow. A more sophisticated and muchfaster approach, using ideas from [13, 10] is described inSection 3.2.For datasets without many small objects (e.g. PASCAL),we often only need to re-rank the top proposals to ob-tain good enough recall. As shown by [30], increasing thenumber of Edge box proposals beyond leads to onlymarginal increase in recall. For more challenging datasetswith small objects (e.g. COCO [20]), reranking more pro-posals continues to provide gains in recall beyond 2000 pro-posals. When it comes to the network architecture, we wouldexpect that predicting the precise category of an object isharder than predicting objectness, and so we would want asimpler network for objectness. This also makes sense froma computational standpoint, since we do not want the objectproposal scoring stage to be as expensive as the detectoritself.We organized our search for a suitable network archi-tecture by starting from the architecture of [18] and gradu-ally ablating it while trying to preserve performance. Theablation here can be performed by reducing the number ofchannels in the different layers (thus reducing the number ofparameters in the layer), by removing some layers, or by de-creasing the input resolution so that the features computedbecome coarser.The original architecture gave an AUC on PASCALVOC of . . for IoU=0.5 (0.7). First, we changedthe number of outputs of f c to with other things ﬁxedand found that performance remained unchanged. Then weadjusted the input image crop from × of the net-work to × and observed that the AUC dropped . points for IoU=0.5 and . points for IoU=0.7 on PAS-CAL. With this input size, we tried removing f c (drop:2.3 points), conv (drop: 2.9 points), conv conv (drop:10.6 points) and conv conv conv layers (drop:6.7 points). This last experiment meant that dropping allof conv , conv and conv was better than just dropping conv and conv . This might be because conv , conv and conv while adding to the capacity of the network arelso likely overﬁt to the task of image classiﬁcation (as de-scribed below, the convolutional layers are initialized froma model trained on ImageNet). We stuck to this architecture(i.e.,without conv , conv and conv ) and explored differ-ent input sizes for the net. For an input size of × ,we obtained a competitive AUC of . (for IoU=0.5) and . (for IoU=0.7) on PASCAL, or equivalently a . pointdrop against the baseline.Our ﬁnal architecture can be written down as follows.Denote by conv ( k, c, s ) a convolutional layer with kernelsize k , stride s and number of output channels c . Similarly, pool ( k, s ) denotes a pooling layer with kernel size k andstride s , and f c ( c ) a fully connected layer with c outputs.Then, our network architecture is: conv (11 , , − pool (3 , − conv (5 , , − f c (1024) − f c (2) .Each layer except the last is followed by a ReLU non-linearity. Our problem is a binary classiﬁcation problem(object or not), so we only have two outputs which arepassed through a softmax. Our input size is × .While we ﬁnalized this architecture on PASCAL forEdge boxes, we show in Section 4 that the same architectureworks just as well for other datasets such as COCO [20] andfor other proposal methods such as Selective Search [29] orMCG [2]. Running the CNN separately on highly overlappingboxes wastes a lot of computation. He et al. [13] pointed outthat the convolutional part of the network could be sharedamong all the proposals. Concretely, instead of croppingout individual boxes, we pass in the entire image into thenetwork (at a high resolution). After passing through all theconvolutional and pooling layers, the result is a feature mapwhich is some fraction of the image size. Given this featuremap and a set of bounding boxes, we want to compute aﬁxed length vector for each box that we can then feed intothe fully connected layers. To do this, note that each bound-ing box B in the image space corresponds to a box b in thefeature space of the ﬁnal convolutional layer, with the scaleand aspect ratio of b being dependent on B and thus differ-ent for each box. He et al. propose to use a ﬁxed spatialpyramid grid to max-pool features for each box. While thesize of the grid cell varies with the box, the number of binsdon’t, thus resulting in a ﬁxed feature vector for each box.From here on, each box is handled separately. However,the shared convolutional feature maps means that we havesaved a lot of computation.One issue with this approach is that all the convolutionalfeature maps are computed at just one image scale, whichmay not be appropriate for all objects. He et al. [13] suggesta multiscale version where the feature maps are computed ata few ﬁxed scales, and for each box we pick the best scale, which they deﬁne as the one where the area of the scaledbox is closest to a predeﬁned value. We experiment withboth the single scale version and a multiscale version usingthree scales.We use the implementation proposed by Girshick in FastR-CNN [10]. Fast R-CNN implements this pooling as alayer in the CNN (the RoI Pooling layer) allowing us totrain the network end-to-end.To differentiate this version of DeepBox from the sloweralternative based on cropping and warping, we call this ver-sion Fast DeepBox in the rest of the paper. The ﬁrst two convolutional layers were initialized using thepublicly available Imagenet model [18]. This model waspretrained on

Imagenet categories for the classiﬁca-tion task. The f c layers are initialized randomly from Gaus-sian distribution with σ = 0 . . Our DeepBox trainingprocedure consists of two stages. Similar to the classicalnotion of bootstrapping in object detection, we ﬁrst trainan initial model to distinguish between object boxes andrandomly sampled sliding windows from the background.This teaches it the difference between objects and back-ground. To enable it to do better at correcting the errorsmade by bottom-up proposal methods, we run a secondtraining round where we train the model on bottom-up pro-posals from a method such as Edge boxes. First we generate negative windows by simple raster scan-ning. The sliding window step size is selected based on thebox-searching strategy of Edge boxes[30]. Following Zit-nick et al [30], we use α to denote the IoU threshold forneighboring sliding windows, and set it to 0.65. We gener-ated windows in aspect ratios: ( w : h ) = (1 : 1) , (2 : 3) , (1 : 3) , (3 : 2) , and (3 : 1) . Negative windows which over-lap with a ground truth object by more than β − = 0 . arediscarded.To obtain positives, we randomly perturb the cornersof ground truth bounding boxes. Suppose a ground truthbounding box has coordinates ( x min , y min , x max , y max ) ,with the width denoted by w and the height denoted by h .Then the perturbed coordinates are distributed as: x (cid:48) min ∼ unif ( x min − γw, x min + γw ) (1) y (cid:48) min ∼ unif ( y min − γh, y min + γh ) (2) x (cid:48) max ∼ unif ( x max − γw, x max + γw ) (3) y (cid:48) max ∼ unif ( y max − γh, y max + γh ) (4)where γ = 0 . deﬁnes the level of noise. Larger γ intro-duces more robustness into positive training samples, butight hurt localization. In practice we found that γ = 0 . works well. In case some perturbed points go out of theimage, we set them to stay on the image border. Perturbedwindows that overlap with ground truth boxes by less than β + = 0 . are discarded. Next, we trained the net on bottom-up proposals from amethod such as Edge boxes [30]. Compared to sliding win-dows, these proposals contain more edges, texture and partsof objects, and thus form hard negatives. While sliding win-dows trained the net to distinguish primarily between back-ground and objects, training on these proposals allows thenet to learn the notion of complete objects and better adaptitself to the errors made by the bottom-up proposer. This iscritical: without this stage, performance drops by onPASCAL from . AUC at IoU=0.7 to . .The window preparation and training procedure is as de-scribed in Section 3.3.2, except that sliding windows arereplaced with Edge boxes proposals. We set the overlapthreshold β + = 0 . slightly higher, so that the net can learnto distinguish between tight and loose bounding boxes. Weset β − = 0 . below which the windows are labeled nega-tive. Windows with IoU ∈ [0 . , . are discarded lest theyconfuse the net.We balanced the ratio of positive and negative windowsat at training time for both stages. Momentum is setto 0.9 and weight decay to 0.0005. The initial learning rateis 0.001, which we decrease by 0.1 every 20,000 iterations.We train DeepBox for 60,000 iterations with a mini-batchsize of 128.The Fast DeepBox network was trained for k iter-ations in both sliding window and hard negative training.Three scales [400 , , were used for both training andtesting.

4. Experiments

All experiments except those in Section 4.7 and 4.8 weredone using DeepBox. We evaluate Fast DeepBox in Sec-tion 4.7, and as a ﬁnal experiment plug it into an objectdetection system in Section 4.8.

In our ﬁrst set of experiments, we evaluated our approachon PASCAL VOC 2007 [8] and on the newly releasedCOCO [20] dataset. For these experiments, we used Edgeboxes [30] for the initial pool of proposals. On PASCALwe reranked the top 2048 Edge box proposals, whereas onCOCO we re-ranked all. We used the network architec-ture described in Section 3.1. For results on PASCAL VOC2007, we trained our network on the trainval set and tested on the test set. For results on COCO, we trained our net-work on the train set and evaluated on the val set.

We ﬁrst compare our ranking to the ranking output byEdge boxes. Figure 2 plots Recall vs Number of Pro-posals in PASCAL VOC 2007 for IoU=0.7 . We observethat DeepBox outperforms Edge boxes in all regimes, es-pecially with a low number of proposals. The same is truefor IoU=0.5 (not shown). The AUCs (Areas Under Curve)for DeepBox are . . vs Edge box’s . . forIoU=0.5(0.7), suggesting that DeepBox proposals are better at IoU=0.5 and better at IoU=0.7 comparedto Edge boxes. Figure 3 plots the same in COCO. TheAUCs (Areas Under Curve) for DeepBox are . . vs Edge boxes’s . . for IoU=0.5(0.7), suggestingDeepBox proposals are better than Edge boxes.If we re-ranked top 2048 proposals instead, the AUCs are . . .On PASCAL, we also plot the Recall vs IoU thresholdcurves at 1000 proposals in Figure 4. At 1000 proposals,the gain of DeepBox is not as big as at 100-500 proposals,but we still see that it is superior in all regimes of IoU.Part of this performance gain is due to small objects,deﬁned as objects with area less than . On COCOthe AUCs for DeepBox (trained on all categories) are (0 . , . , while the numbers for Edge boxes are (0 . , . for IoU( . , . ). DeepBox outperformsEdge boxes by more than on small objects. Comparison to other proposal methods

We can alsocompare our ranked set of proposals to all the other propos-als in the literature. We show this comparison for IoU=0.7in Table 1. The numbers are obtained from [30] except forMCG and DeepBox which we computed ourselves. In Ta-ble 1, the metrics are the number of proposals needed toachieve 25%, 50% and 75% recall and the maximum recallusing 5000 boxes.

Figures 5 and 6 visualizes DeepBox and Edge box per-formance on PASCAL and COCO images. Figure 5 showsthe ground truth boxes that are detected (“hits”, shown ingreen, with the corresponding best overlapping proposalshown in blue) and those that are missed (shown in red)for both proposal rankings. We observe that in compli-cated scenes with multiple small objects or cluttered back-ground, DeepBox signiﬁcantly outperforms Edge box. Thetiny boats, the cars parked by the road, the donuts and peo-ple in the shade are all correctly captured. We computed recall using the code provided by [30] igure 5. Visualization of hits and misses. In each image, the green boxes are ground truth boxes for which a highly overlapping proposalexists (with the corresponding proposals shown as blue boxes) and red boxes are ground truth boxes that are missed. The IoU threshold is . . We evaluated 1000 proposals per image for COCO and 500 proposals per image for PASCAL. The left image in every pair shows theresult of DeepBox ranking, and the right image shows the ranking from Edge boxes. In cluttered scenes, DeepBox has a higher recall. SeeSection 4.3 for a detailed discussion.Figure 6. In each image we show the distribution of the proposals produced by pasting a red box for each proposal. Only the top 100proposals are shown. For each pair of images, DeepBox is on the left and Edge boxes is the right. DeepBox proposals are more tightlyconcentrated on the objects. See Section 4.3 for a detailed discussion. This is also validated by looking at the distribution of top100 proposals (Figure 6), which is shown in red. In gen-eral, DeepBox’s bounding boxes are very densely focusedon the objects of interest while Edge boxes primarily recog-nize contours and often spread evenly across the image in a complicated scene.

The high recall our method achieves on the PASCAL2007 test set does not guarantee that our net is truly learning Number of proposals R ec a ll PASCAL Evaluation IoU=0.7

DeepBoxEdge boxes

Figure 2. PASCAL Evaluation IoU=0.7. DeepBox starts off muchhigher than Edge boxes. The wide margin continues all the wayuntil proposals and gradually decays. The two curves join at proposals because we chose to re-rank this number of pro-posals. Number of proposals R ec a ll COCO Evaluation IoU=0.7

DeepBoxEdge boxes

Figure 3. MS COCO Evaluation IoU=0.7. The strong gain demon-strated by DeepBox on COCO suggests that our learnt objectnessis particularly helpful in a complicated dataset. general objectness. It is possible that the net is learning justthe union of 20 object categories and using that knowledgeto rank proposals. To evaluate whether the net is indeedlearning a more general notion of objectness that extendsbeyond the categories it has seen during training, we did thefollowing experiment: • We identiﬁed the 36 overlapping categories betweenImagenet and COCO. • We trained the net just on these overlapping categorieson COCO with initialization from Imagenet. This

AUC 25% 50% 75% Recall Time(%) (s)BING[5] 0.20 292 - - 29

Ranta[24] 0.23 184 584 - 68 10Objectness[1] 0.27 27 - - 39 3Rand. P.[21] 0.35 42 349 3023 80 1Rahtu [23] 0.37 29 307 - 70 3SelSearch [29] 0.40 28 199 1434 IoU threshold R ec a ll PASCAL Evaluation 1000 proposals

DeepBoxEdge boxes

Figure 4. Variation of recall with IoU threshold at proposals.DeepBox (average recall: 57.0%) is better than Edge boxes (aver-age recall: 54.4%) in all regimes. Comparisons to other proposalmethods is shown in the supplementary. means that during training, only boxes that overlappedhighly with ground truth from the 36 overlapping cate-gories were labeled positives, and others were labelednegatives. Also, when sampling positives, only theground truth boxes corresponding to these overlappingcategories were used to produce perturbed positives.This is equivalent to training on a dataset where all theother categories have not been labeled at all. • We then evaluated our performance on the rest of thecategories in COCO (44 in number). This means thatat test time, only proposed boxes that overlapped withground truth from the other 44 categories were consid-ered true positives. Again, this corresponds to evaluat- Number of proposals R ec a ll COCO Evaluation IoU=0.5

DeepBoxEdge boxes

Figure 7. Evaluation on unseen categories, when ranking all pro-posals, at IoU=0.5. ing on a dataset where the 36 training categories havenot been labeled at all.All other experimental settings remain the same. As be-fore, we use Edge boxes for the initial pool of proposalswhich we rerank, and compare the ranking we obtain to theranking output by edge boxes.When reranking the top 2048 proposals, DeepBoxachieved . . AUC improvement over Edgeboxes for IoU=0.5 (0.7). When reranking all proposals out-put by Edge boxes, DeepBox achieved . . im-provement over Edge boxes for IoU=0.5 (0.7) (Figure 7). Inthis setting, DeepBox outperforms Edge box in all regimeson unseen categories for both IoUs.In Figure 8, we plot the ratio of the AUC obtained byDeepBox to that obtained by Edge box for all the 44 testingcategories. In more than half of the testing categories, weobtain an improvement greater than , suggesting thatthe gains provided by DeepBox are spread over multiplecategories. This suggests that DeepBox is actually learn-ing a class-agnostic notion of objectness. Note that Deep-Box performs especially well for the animal super categorybecause all animal categories have similar geometric struc-ture and training the net on some animals helps it recog-nize other animals. This validates our intuition that thereare high-level semantic cues that can help with objectness.In the sports super category, DeepBox performs worse thanEdge boxes, perhaps because most objects of this categoryhave salient contours that favor the Edge boxes algorithm.These results on COCO demonstrate that our networkhas learnt a notion of objectness that generalizes beyondtraining categories. Vanilla DeepBox DeepBox Total(Trained on (Finetuned) Time[30]) (s)Sel. Se. 0.27/0.17 0.27/0.19 0.32/0.22 12.5MCG / / / Table 2. DeepBox on top of other proposers. For each method weshow the AUC at IoU=0.5/0.7 of (left to right) the original ranking,the reranking produced by DeepBox trained on Edge boxes, andthat produced after ﬁnetuning on the corresponding proposals.

We also experimented with larger networks such as“Alexnet” [18] and “VGG” [27]. Unsurprisingly, large net-works capture objectness better than our small network.However, the difference is quite small: With VGG, theAUCs on PASCAL are . . for IoU=0.5(0.7). Thenumbers for Alexnet are . . . In comparison our net-work achieves . . .For evaluation on COCO, we randomly selected 5000images and computed AUCs using the VGG and Alexnetarchitecture. All networks re-rank just top 2048 Edge boxproposals. At IoU=0.5, VGG gets . and Alexnet gets . , compared to the . obtained by our network. AtIoU=0.7, VGG gets . and Alexnet gets . , comparedto the . obtained by our network. When re-ranking allproposals, our small net gets . . for IoU=0.5(0.7).These experiments suggest that our network architectureis sufﬁcient to capture the semantics of objectness, whilealso being much more efﬁcient to evaluate compared to themore massive VGG and Alexnet architectures. It is a natural question to ask whether DeepBox frame-work applies to other bottom-up proposers as well. Weexperimented with MCG and Selective Search by just re-ranking top 2048 proposals. (For computational reasons, wedid this experiment on a smaller set of 5000 COCO images.)We experimented with two kinds of DeepBox models: asingle model trained on Edge boxes, and separate modelstrained on each proposal method (Table 2).A model trained on Edge boxes does not provide muchgains, and indeed sometimes hurts, when run on top of Se-lective Search or MCG proposals. However, if we retrainDeepBox separately on each proposal method, this effectgoes away. In particular, we get large gains on SelectiveSearch for both IoU thresholds (as with Edge boxes). OnMCG, DeepBox does not hurt, but does not help much ei-ther. Nevertheless, the gains we get on Edge boxes and Se-lective Search suggest that our approach is general and canwork with any proposal method, or even ensembles of pro-posal methods (a possibility we leave for future work). igure 8. Evaluation on unseen categories: category-wise breakdown. This demonstrates that DeepBox network has learnt a notion ofobjectness that generalizes beyond training categories.

We experimented with Fast DeepBox on COCO. TheAUCs for multiscale Fast DeepBox are . . vs Edgebox’s . . for IoU=0.5(0.7), a gain of over Edge boxes. When we re-ranked top 2048 proposalsinstead, the AUCs are . . . Compared with Deep-Box’s multi-thread runtime of . s , Fast DeepBox (multi-scale) is an order of magnitude faster: it takes . s to re-rank all proposals or . s to re-rank the top- . Thiscompares favorably to other bottom-up proposals. In termsof average recall with 1000 proposals, our performance(0.39) is better than GOP (0.36) [17], Selective Search(0.36) [29] and Edge boxes (0.34) [30], and is about thesame as MCG (0.40) [2] while being almost 70 times faster.In contrast, Deepmask achieves 0.45 with a much deepernetwork at the expense of being 3 times slower (1.6 s) [22].With a small decrease in performance, Fast DeepBox canbe made much faster. With a single scale, AUC drops byabout . when re-ranking top- proposals and . when re-ranking all proposals. However, it only takes . sto re-rank all proposals or . s for the top- . One canalso make training faster by removing the scanning-windowstage and using a single scale. This speedup comes with adrop in performance of . when reranking all proposalscompared to the multiscale two-stage version. The ﬁnal metric for any proposal method is its impacton object detection. Good proposals not only reduce thecomputational complexity but can also make object detec-tion easier by reducing the number of candidates that thedetector has to choose from [14, 10]. We found that thisis indeed true: when using 500 DeepBox proposals, Fast-RCNN (with the VGG-16 network) gives a mAP of % on COCO Test at IoU=0.5, compared to only % whenusing 500 Edge Box proposals. Even when using 2000 EdgeBox proposals, the mAP is still lower (35.9%). For com-parison, Fast R-CNN using 2000 Selective Search propos-als gets a mean AP of 35.8%, indicating that with just 500DeepBox proposals we get a 2 point jump in performance.

5. Discussion and Conclusion

We have presented an efﬁcient CNN architecture thatlearns a semantic notion of objectness that generalizes tounseen categories. We conclude by discussing other appli-cations of our objectness model.First, as the number of object categories increases, thecomputational complexity of the detector increases and itbecomes more and more useful to have a generic objectnesssystem to reduce the number of locations the detector looksat. Objectness can also help take the burden of localizationoff the detector, which then has an easier task.Second, AI agents navigating the world cannot expectto be trained on labeled data like COCO for every objectcategory they see. For some categories the agent will haveto collect data and build detectors on the ﬂy. In this case,objectness allows the agent to pick a few candidate locationsin a scene that look like objects and track them over time,thus collecting data for training a detector. Objectness canthus be useful for object discovery [16], especially when itcaptures semantic properties as in our approach.

6. Acknowledgement

This work is supported by a Berkeley Graduate Fel-lowship and a Microsoft Research Fellowship. We thankNVIDIA for giving GPUs through their academic programtoo. eferences [1] B. Alexe, T. Deselaers, and V. Ferrari. Measuringthe objectness of image windows.

Pattern Analy-sis and Machine Intelligence, IEEE Transactions on ,34(11):2189–2202, 2012. 1, 2, 6[2] P. Arbel´aez, J. Pont-Tuset, J. Barron, F. Marques,and J. Malik. Multiscale combinatorial grouping. In

CVPR , 2014. 2, 3, 6, 8[3] J. Carreira, R. Caseiro, J. Batista, and C. Sminchis-escu. Semantic segmentation with second-order pool-ing. In

ECCV , 2012. 6[4] J. Carreira and C. Sminchisescu. Cpmc: Auto-matic object segmentation using constrained para-metric min-cuts.

Pattern Analysis and Machine In-telligence, IEEE Transactions on , 34(7):1312–1328,2012. 2[5] M. Cheng, Z. Zhang, W. Lin, , and P. Torr. Bing: Bi-narized normed gradients for objectness estimation at300fps. In

CVPR , 2014. 6[6] I. Endres and D. Hoiem. Category independent objectproposals. In

Computer Vision–ECCV 2010 , pages575–588. Springer, 2010. 2[7] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov.Scalable object detection using deep neural networks.In

CVPR , 2014. 2[8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn,and A. Zisserman. The pascal visual object classes(voc) challenge.

International journal of computer vi-sion , 88(2):303–338, 2010. 1, 4[9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, andD. Ramanan. Object detection with discriminativelytrained part-based models.

TPAMI , 32(9), 2010. 1[10] R. Girshick. Fast R-CNN. In

ICCV , 2015. 1, 2, 3, 8[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Richfeature hierarchies for accurate object detection andsemantic segmentation. In

CVPR , 2014. 1, 2[12] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik.Simultaneous detection and segmentation. In

ECCV ,2014. 1[13] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyra-mid pooling in deep convolutional networks for visualrecognition. In

ECCV , 2014. 1, 2, 3[14] J. Hosang, R. Benenson, P. Doll´ar, and B. Schiele.What makes for effective detection proposals? arXivpreprint arXiv:1502.05082 , 2015. 1, 2, 8[15] J. Hosang, R. Benenson, and B. Schiele. How goodare detection proposals, really? In

BMVC , 2014. 2[16] H. Kang, M. Hebert, A. A. Efros, and T. Kanade. Con-necting missing links: Object discovery from sparse observations using 5 million product images. In

ECCV . 2012. 8[17] P. Kr¨ahenb¨uhl and V. Koltun. Geodesic object propos-als. In

ECCV , 2014. 2, 8[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-agenet classiﬁcation with deep convolutional neuralnetworks. In

NIPS , 2012. 1, 2, 3, 7[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.Howard, W. Hubbard, and L. D. Jackel. Backprop-agation applied to handwritten zip code recognition.

Neural computation , 1(4):541–551, 1989. 1[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,D. Ramanan, P. Dollr, and C. L. Zitnick. Microsoftcoco: Common objects in context. In

ECCV , 2014. 1,2, 3, 4[21] S. Manen, M. Guillaumin, and L. Van Gool. Primeobject proposals with randomized prim’s algorithm. In

ICCV , 2013. 6[22] P. O. Pinheiro, R. Collobert, and P. Doll´ar. Learning tosegment object candidates. In

NIPS , 2015 (To appear).2, 8[23] E. Rahtu, K. Juho, and B. Matthew. Learning a cate-gory independent object detection cascade. In

ICCV ,2011. 6[24] R. E. Rantalankila P., Kannala J. Generating objectsegmentation proposals using global and local search.In

CVPR , 2014. 6[25] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with regionproposal networks. In

NIPS , 2015 (To appear). 2[26] B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic,and A. Zisserman. Using multiple segmentations todiscover objects and their extent in image collections.In

CVPR , 2006. 1[27] K. Simonyan and A. Zisserman. Very deep convo-lutional networks for large-scale image recognition.

CoRR , abs/1409.1556, 2014. 7[28] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov.Scalable, high-quality object detection. arXiv preprintarXiv:1412.1441 , 2014. 2[29] J. R. Uijlings, K. E. van de Sande, T. Gevers, andA. W. Smeulders. Selective search for object recog-nition.

IJCV , 104(2), 2013. 2, 3, 6, 8[30] C. L. Zitnick and P. Doll´ar. Edge boxes: Locatingobject proposals from edges. In