FoveaBox: Beyond Anchor-based Object Detector
Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Lei Li, Jianbo Shi
FFoveaBox: Beyond Anchor-based Object Detector
Tao Kong Fuchun Sun Huaping Liu Yuning Jiang Jianbo Shi Department of Computer Science and Technology, Tsinghua University,Beijing National Research Center for Information Science and Technology (BNRist) ByteDance AI Lab University of Pennsylvania [email protected], { fcsun,hpliu } @tsinghua.edu.cn,[email protected], [email protected] Abstract
We present FoveaBox, an accurate, flexible and com-pletely anchor-free framework for object detection. Whilealmost all state-of-the-art object detectors utilize the prede-fined anchors to enumerate possible locations, scales andaspect ratios for the search of the objects, their perfor-mance and generalization ability are also limited to the de-sign of anchors. Instead, FoveaBox directly learns the ob-ject existing possibility and the bounding box coordinateswithout anchor reference. This is achieved by: (a) predict-ing category-sensitive semantic maps for the object existingpossibility, and (b) producing category-agnostic boundingbox for each position that potentially contains an object.The scales of target boxes are naturally associated with fea-ture pyramid representations for each input image.Without bells and whistles, FoveaBox achieves state-of-the-art single model performance of 42.1 AP on the stan-dard COCO detection benchmark. Specially for the objectswith arbitrary aspect ratios, FoveaBox brings in signifi-cant improvement compared to the anchor-based detectors.More surprisingly, when it is challenged by the stretchedtesting images, FoveaBox shows great robustness and gen-eralization ability to the changed distribution of boundingbox shapes. The code will be made publicly available.
1. Introduction
Object detection requires the solution of two main tasks: recognition and localization . Given an arbitrary image, anobject detection system needs to determine whether thereare any instances of semantic objects from predefined cate-gories and, if present, to return the spatial location and ex-tent. To add the localization functionality to generic objectdetection systems, sliding window approaches have beenthe method of choice for many years [25][7].Recently, deep learning techniques have emerged as anchors AP RetinaNet
1 30.2 2 31.9 3 32.4 6 34.2 9 12 34.2 33.9
FoveaBox n/a
30 31
33 34 35
8 10 C O C O A P anchor density Figure 1. The RetinaNet object detection performances with dif-ferent anchor numbers v.s. the proposed FoveaNet. With anchordensity increasing, the performance gets saturated in RetinaNet.While FoveaNet shows better performance without the utilizationof anchors. An improved variant of FoveaNet achieves 42.1 APwith ResNeXt-101 backbone, which is not shown in this figure.More details are given in § powerful methods for learning feature representations au-tomatically from data [43][16]. R-CNN [12] and Fast R-CNN [11] use a few thousands of category-independent re-gion proposals to reduce the searching space for an image.The region proposal generation stage is subsequently re-placed by anchor-based Region Proposal Networks (RPN)[40]. Since then, the anchor boxes are widely used as acommon component for searching possible regions of in-terest for modern object detection frameworks. In short,anchor method suggests dividing the box space (includingposition, scale, aspect ratio) into discrete bins and refiningthe object box in the corresponding bin. Most state-of-the-art detectors rely on anchor boxes to enumerate the possiblelocations, scales, and aspect ratios for target objects [30].Anchors are regression references and classification candi-1 a r X i v : . [ c s . C V ] A p r - - p e r s o n b i c y c l e c a r m o t o r c y c l e a i r p l a n e bu s t r a i n t r u c k b o a t t r a ff i c li g h t f i r e h y d r a n t s t o p s i g n p a r k i n g m e t e r b e n c h b i r d c a t d o g h o r s e s h ee p c o w e l e ph a n t b e a r z e b r a g i r a ff e b a c k p a c k u m b r e ll a h a ndb a g t i e s u i t c a s e f r i s b ee s k i s s n o w b o a r d s p o r t s b a ll k i t e b a s e b a ll b a t b a s e b a ll g l o v e s k a t e b o a r d s u r f b o a r d t e nn i s r a c k e t b o tt l e w i n e g l a ss c up f o r k k n i f e s p oo n b o w l b a n a n a a pp l e s a nd w i c h o r a n g e b r o cc o li c a rr o t h o t d o g p i zz a d o nu t c a k e c h a i r c o u c h p o tt e d p l a n t b e d d i n i n g t a b l e t o il e t t v l a p t o p m o u s e r e m o t e k e y b o a r d c e ll ph o n e m i c r o w a v e o v e n t o a s t e r s i n k r e f r i g e r a t o r b oo k c l o c k v a s e s c i ss o r s t e dd y b e a r h a i r d r i e r t oo t hb r u s h Figure 2. AP difference of FoveaNet and RetinaNet on COCO dataset. Both models use ResNet-FPN-50 as backbone and 800 input scales.FoveaBox shows improvement in most of the classes, especially for classes whose bounding boxes are likely to be more arbitrary, such as toothbrush, fork, snowboard, tie and train . dates to predict proposals for two-stage detectors (Faster R-CNN [40], FPN [27]) or final bounding boxes for single-stage detectors (SSD [31], RetinaNet [28]). Nevertheless,the anchor boxes can be regarded as a feature-sharing slid-ing window scheme to cover the possible locations of ob-jects.However, the utilization of anchor boxes or (candidateboxes) has some drawbacks. First, anchor boxes intro-duce additional hyper-parameters of design choices. Oneof the most important factors in designing anchor boxesis how densely it covers the location space of the targets.To achieve a good recall rate, the anchors are carefullydesigned based on the statistics computed from the train-ing/validation set. Second, one design choice based on aparticular dataset is not always applicable to other appli-cations, which harms the generality[46]. For example, an-chors are usually square for face detection. While pedes-trian detection needs more tall anchors. Third, since thereis a large set of candidate object locations regularly sampledacross an image, dense object detectors usually rely on ef-fective techniques to deal with the foreground-backgroundclass imbalance challenge [28][23][41].One choice to improve the anchor generation process isto make it more flexible. Most recently, there are some suc-cessful works trying to improve the capacity of the anchorboxes [46][45][49]. In MetaAnchor [46], anchor functionsare dynamically generated from the arbitrary customizedprior boxes. The Guided-Anchoring method [45] jointlypredicts the locations where the center of objects are likelyto exist as well as the scales and aspect ratios at different po-sitions. In [49], the authors also suggest dynamically learnthe anchor shapes. Nevertheless, these works still rely onthe enumeration of possible scales and aspect ratios for theoptimization of the model. In MetaAnchor, the input of theanchor functions is the regular sampled anchors with differ- ent aspect ratios and scales. In Guided-Anchoring, the au-thors assume that the center of each anchor is fixed and sam-ple multiple pairs of ( w, h ) to approximate the best shapecentered at the corresponding position.In contrast, human vision system can recognize the lo-cations of instance in space and predict the boundary giventhe visual cortex map, without any pre-defined shape tem-plate [1]. In other words, we human naturally recognize theobject in the visual scene without enumerating the candi-date boxes. Inspired by this, an intuitive question to ask is, is the anchor boxes scheme the optimal way to guide thesearch of the objects ? If the answer is no, could we designan accurate object detection framework without the depen-dence of anchors or candidate boxes ? Without candidateanchor boxes, one might expect a complex method is re-quired to achieve comparable results. However, we showthat a surprisingly simple and flexible system can match theprior state-of-the-art object detection performance withoutthe requirement of candidate boxes.To this end, we present FoveaBox, a completely anchor-free framework for object detection. FoveaBox is motivatedfrom the fovea of human eyes: the center of the vision field(object) is with the highest visual acuity. FoveaBox jointlypredicts the locations where the object’s center area is likelyto exist as well as the bounding box at each valid location.Thanks to the feature pyramidal representations [27], dif-ferent scales of objects are naturally detected from multiplelevels of features. To demonstrate the effectiveness of theproposed detection scheme, we combine the recent progressof feature pyramid networks and our detection head to formthe framework of FoveaBox. Without bells and whistles,FoveaBox gets state-of-the-art single-model results on theCOCO object detection task [29]. Our best single model,based on a ResNeXt-101-FPN backbone, achieves a COCO test-dev AP of 42.1, surpassing most of previously pub-ished anchor based single-model results.Since FoveaBox does not rely on the default anchorsboth at training phase and inference, it is more robust tobounding box distributions. To verify this, we manuallyelongate the images as well as the annotations of the val-idation set, and compare the robustness of FoveaBox withthe previous anchor-based models [28]. Under this setting,FoveaBox outperforms the anchor-based methods by a largemargin. We believe the simple training/inference manner ofFoveaBox, together with the flexibility and accuracy, willbenefit future research on object detection and relevant top-ics.
2. Related work
Classic Object Detectors : Prior to the success of deepCNNs, the widely used detection systems are based on thecombination of independent components (HOG [5], SIFT[33] etc.). DPM [8] and its variants help extending objectdetectors to more general object categories and have leadingresults for many years [6]. The sliding-window approach isthe leading detection paradigm for searching the object ofinterest in classic object detection frameworks.
Modern Object Detectors : Modern object detectors aregenerally grouped into two factions: two-stage, proposaldriven detectors and one-stage, proposal free methods. Fortwo-stage detectors, the first stage generates a sparse set ofobject proposals, and the second stage classifies the propos-als as well as refines the coordinates with the sliding win-dow manner. Such pipeline is first demonstrated its effec-tiveness by R-CNN [12] and is widely used in later two-stage methods [15][11]. In Faster R-CNN [40], the firststage (RPN) simultaneously predicts object bounds and ob-jectness scores at each pre-defined sliding window anchorwith a light-weight network. Several attempts have beenperformed to boost the performance of the detector, includ-ing feature pyramid [27][24][21], multiscale [44][34], andobject relations [17], etc.Compared to two-stage approaches, the one-stagepipeline skips object proposal generation and predictsbounding boxes and class scores in one evaluation. Mosttop one-stage detectors rely on the anchor boxes to enu-merate the possible locations of target objects (e.g. SSD[31], DSSD [9], YOLOv2/v3 [38][39], and RetinaNet [28]).In CornerNet [26], the authors propose to detect an objectbounding box as a pair of keypoints. CornerNet adopts theAssociative Embedding [35] technique to separate differ-ent instances. Some prior works share similarities with ourwork, and we will discuss them in more detail in §
3. FoveaBox
FoveaBox is a single, unified network composed of abackbone network and two task-specific subnetworks. The box prediction FPN output classification dog (𝑥 ,𝑦 ) (𝑥 ,𝑦 ) fovea area Figure 3.
FoveaBox object detector . For each output spacial posi-tion that potentially presents an object, FoveaBox directly predictsthe confidences for all target categories and the bounding box. backbone is responsible for computing a convolutional fea-ture map over an entire input image and is an off-the-shelfconvolutional network. The first subnet performs per pixelclassification on the backbone’s output; the second subnetperforms bounding box prediction for the correspondingposition. While there are many possible choices for the de-tails of these components, we take the RetinaNet’s design[28] for simplicity and fair comparison.
We adopt the Feature Pyramid Network (FPN) [27] asthe backbone network for the subsequent detection. In gen-eral, FPN uses a top-down architecture with lateral connec-tions to build an in-network feature pyramid from a single-scale input. Each level of the pyramid can be used for de-tecting objects at a different scale. We construct a pyramidwith levels { P l } , l = 3 , , · · · , , where l indicates pyramidlevel. P l has / l resolution of the input. All pyramid lev-els have C = 256 channels. For further details about FPN,we refer readers to [27][28]. W W × H × W W × H × W W × H × W W × H × W W × H × K W W × H × feature pyramid net Figure 4.
FoveaBox network architecture . The design of thearchitecture follows RetinaNet [28] to make a fair comparison.FoveaBox uses a Feature Pyramid Network [27] backbone on topof a feedforward ResNet architecture. To this backbone, FoveaBoxattaches two subnetworks, one for classifying the correspondingcells and one for predict the ( x , y , x , y ) of ground-truth objectboxes. For each spatial output location, the FoveaBox predicts onescore output for each class and the corresponding 4-dimensionalbox, which is different from previous works attaching A anchorsin each position (usually A = 9 ). .2. Scale Assignment While our goal is to predict the boundary of the target ob-jects, directly predicting these numbers is not stable, due tothe large scale variations of the objects. Instead, we dividethe scales of objects into several bins, according to the num-ber of feature pyramidal levels. Each pyramid has a basicarea ranging from to on pyramid levels P to P ,respectively. So for level P l , the basic area S l is computedby S l = 4 l · S . (1)Analogous to the ResNet-based Faster R-CNN systemthat uses C4 as the single-scale feature map, we set S to16 [40]. Within FoveaBox, each feature pyramid learns tobe responsive to objects of particular scales. The valid scalerange of the target boxes for pyramid level l is computed as [ S l /η , S l · η ] , (2)where η is set empirically to control the scale range foreach pyramid. Target objects not in the corresponding scalerange are ignored during training. Note that an object maybe detected by multiple pyramids of the networks, which isdifferent from previous practice that maps objects to onlyone feature pyramid [27][14]. Each output set of pyramidal heatmaps has K channels,where K is the number of categories, and is of size H × W (Fig. 4). Each channel is a binary mask indicating the possi-bility for a class. Given a valid ground-truth box denoted as ( x , y , x , y ) . We first map the box into the target featurepyramid P l with stride l x (cid:48) = x l , y (cid:48) = y l ,x (cid:48) = x l , y (cid:48) = y l ,c (cid:48) x = x (cid:48) + 0 . x (cid:48) − x (cid:48) ) ,c (cid:48) y = y (cid:48) + 0 . y (cid:48) − y (cid:48) ) . (3)The positive area (fovea) R pos = ( x (cid:48)(cid:48) , y (cid:48)(cid:48) , x (cid:48)(cid:48) , y (cid:48)(cid:48) ) of thequadrangle on the score map is designed to be roughly ashrunk version of the original one (see Fig. 3): x (cid:48)(cid:48) = c (cid:48) x − . x (cid:48) − x (cid:48) ) σ ,y (cid:48)(cid:48) = c (cid:48) y − . y (cid:48) − y (cid:48) ) σ ,x (cid:48)(cid:48) = c (cid:48) x + 0 . x (cid:48) − x (cid:48) ) σ ,y (cid:48)(cid:48) = c (cid:48) y + 0 . y (cid:48) − y (cid:48) ) σ , (4)where σ is the shrunk factor. Each cell inside the posi-tive area is annotated with the corresponding target class la-bel for training. For the definition of negative samples, we introduce another shrunk factor σ to generate R neg usingEq.(4). The negative area is the whole feature map exclud-ing area in R neg . If a cell is unassigned, it will be ignoredduring training. The positive area usually accounts for asmall portion of the whole feature map, so we use FocalLoss [28] to train the target L cls of this branch. The object fovea only encodes the existence possibilityof the target objects. To decide the location, the model mustpredict the bounding box for each potential instance. Eachground-truth bounding box is specified in the way G =( x , y , x , y ) . Our goal is to learn a transformation thatmaps the networks localization outputs ( t x , t y , t x , t y ) atcell ( x, y ) in the feature maps to the ground-truth box G : t x = log 2 l ( x + 0 . − x z ,t y = log 2 l ( y + 0 . − y z ,t x = log x − l ( x + 0 . z ,t y = log y − l ( y + 0 . z , (5)where z = √ S l is the normalization factor to project theoutput space to space centered around 1, leading to an eas-ier and stable learning of the target. This function first mapsthe coordinate ( x, y ) to the input image, then computes thenormalized offset between the projected coordinate and G .Finally the targets are regularized with the log-space func-tion.For simplicity, we adopt the widely used Smooth L loss[40] to train the box prediction L box . After the targets be-ing optimized, we can generate the box boundary for eachpositive cell ( x, y ) on the output feature maps. We note thatEq.(5) and the inverse transformation can be easily imple-mented by an element-wise layer in modern deep learningframeworks [36][4]. FoveaBox is trained with stochastic gradient descent(SGD). We use synchronized SGD over 4 GPUs with a totalof 8 images per minibatch (2 images per GPU). Unless oth-erwise specified, all models are trained for 270k iterationswith an initial learning rate of 0.005, which is then dividedby 10 at 180k and again at 240k iterations. Weight decayof 0.0001 and momentum of 0.9 are used. In addition tothe standard horizontal image flipping, we also utilize ran-dom aspect ratio jittering to reduce the over-fitting. We set σ = 0 . , σ = 0 . when defining R pos and R neg . Eachcell inside R neg is annotated with the corresponding loca-tion target for bounding box training. ethod AP RetinaNet 1 1 30.2 49.0 31.7RetinaNet 2 1 31.9 50.0 34.1RetinaNet 3 1 31.9 49.4 33.8RetinaNet 1 3 32.4 52.4 33.9RetinaNet 2 3 34.2 53.1 36.5RetinaNet 3 3 34.2 53.2 36.9RetinaNet 4 3 33.9 52.1 36.2FoveaBox n/a n/a (a)
Varying anchor density of RetinaNet [28] and FoveaBox method AP AP u< AP ≤ u< AP u ≥ RetinaNet 34.2 36.5 24.5 10.2RetinaNet ∗ (b) Detection performance with different aspect ratios method backbone AR AR AR RPN ResNet-50 44.5 51.1 56.6FoveaBox ResNet-50 (c)
Region proposal results
Table 1.
Ablation experiments for FoveaBox . All models are trained on MS COCO trainval35k and tested on minival . If notspecified, ResNet-50-FPN backbone and a 600 pixel train and test image scale are used to do the ablation study. (a) FoveaBox andvarying anchor density of RetinaNet. In RetinaNet, increasing beyond 9 anchors does not shown further gains, while FoveaBox couldget much better result without anchor enumeration. (c) With the same exact network, FoveaBox is more robust at higher object aspectratio ( min( hw , wh ) ) thresholds. (c) FoveaBox could also generate high quality region proposals when changing the optimization target to aclass-agnostic head. During inference, we first use a confidence threshold of0.05 to filter out predictions with low confidence. Then,we select the top 1000 scoring boxes from each predic-tion layer. Next, non-maximum suppression (NMS) withthreshold 0.5 are applied for each class separately. Finally,the top-100 scoring predictions are selected for each image.This inference setting is exactly the same as that in Detec-tron baseline [13]. Although there are more intelligent waysto perform post-processing, such as bbox voting [10], Soft-NMS [2] or test-time image augmentations, in order to keepsimplicity and to fairly compare against the baseline mod-els, we do not use those tricks here.
4. Experiments
We present experimental results on the bounding box de-tection track of the challenging COCO benchmark [29]. Fortraining, we follow common practice [14][28] and use theCOCO trainval35k split (union of 80k images from train and a random 35k subset of images from the 40kimage val split). We report lesion and sensitivity studiesby evaluating on the minival split (the remaining 5k im-ages from val). For our main results, we report COCO APon the test-dev split, which has no public labels and re-quires use of the evaluation server.
Various anchor densities and FoveaBox:
One of the mostimportant design factors in a anchor-based detection sys-tem is how densely it covers the space of possible imageboxes. As anchor-based detectors use a fixed sampling grid,a popular approach for achieving high coverage of boxesin these approaches is to use multiple anchors [27][28] ateach spatial position to cover boxes of various scales andaspect ratios. One may expect that we can always get better performance when attaching denser anchors on each posi-tion. To verify this assumption, we sweep over the numberof scale and aspect ratio anchors used at each spatial posi-tion and each pyramid level in RetinaNet, including a singlesquare anchor at each location to 12 anchors per location(Table.1(a)). Increasing beyond 6-9 anchors does not showfurther gains. The saturation of performance w.r.t. densityimplies the handcrafted, over-density anchors do not offeran advantage.Over-density anchors not only increase the foreground-background optimization difficulty, but also likely to causethe ambiguous position definition problem. For each out-put spatial location, there are A anchors whose labels aredefined by the IoU with the ground-truth. Among them,some of the anchors are defined as positive samples, whileothers are negatives. However they are sharing the same in-put features. The classifier needs to not only distinguish thesamples from different positions, but also different anchorsat the same position.In contrast, FoveaBox explicitly predicts one target ateach position and gets no worse performance than the bestanchor-based model . The label of the target is defined byif it is inside an object’s bounding box. Compare with theanchor based scheme, FoveaBox enjoys several advantages.(a) Since we only predict one target at each position, theoutput space has been reduced to /A of the anchor-basedmethod, where A is the anchor number at each position.Since the foreground-background classification challengehas been mitigated, it is easier for the solver to optimizethe model. (b) There is no ambiguous problem and the op-timization target is more straightforward. (c) FoveaBox ismore flexible, since we do not need to extensively designthe anchors to see a relatively better choice. Analysis of Scale Assignment:
In Eq.(2), η controls thescale assignment extent for each pyramid. When η = √ ,he object scales are divided into non-overlapping bins, andeach bin is predicted by the corresponding feature pyramid.As η increases, each pyramid will response to more scalesof objects. Table 2 shows the impact of η on the final detec-tion performance. We set η = 2 for all other experiments. η AP AP AP AP S AP M AP L Table 2. Varying η for FoveaBox. FoveaBox is more robust to box distributions:
Unlikethe traditional predefined anchor strategy, one of the majorbenefits of FoveaBox is the robust prediction of boundingboxes. To verify this, we conduct two experiments to com-pare the localization performance of different methods.In the first experiment, we divide the boxes in the valida-tion set into three groups according to the ground-truth as-pect ratios U = { u i = min( h i w i , w i h i ) } , i = 1 , · · · , N , where N is the instance number in the dataset. We compare Fove-aBox and RetinaNet at different aspect ratio thresholds, asshown in Table 1(b). Here ∗ means training the model withaspect ratio jittering. We see that both methods get best per-formance when u is low. Although FoveaBox also suffersperformance decrease when u increases, it is much betterthan the anchor-based RetinaNet.To further validate the robustness to bounding boxes ofdifferent methods, we manually stretch the image as wellas the annotations in the validation set, and exam the be-haviors of different detectors. Fig. 5 shows the localizationperformance at different h/w stretching thresholds. Underthe evaluation criterion that h/w = 1 , the performance gapbetween the two detectors is relatively small. The gap startsto increase as we increase the stretching threshold. Specif-ically, FoveaBox gets 21.3 AP when stretching h/w by 3times, which is 3.7 points better than RetinaNet ∗ counter-part.The anchor-based methods rely on box regression withanchor reference to generate the final bounding box. Inpractice, the regressor is trained for the positive anchors,which will harm the generality when predicting more arbi-trary shapes of the targets. In FoveaBox, each prediction po-sition is not associated with the particular reference shape,and it directly predicts the target ground-truth boxes. SinceFoveaBox allows arbitrary aspect ratios, it is capable of cap-turing those extremely tall or wide objects better. See somequalitative examples in Fig. 6. Per-class difference:
Fig. 2 shows per-class AP differ-ence of FoveaBox and RetinaNet. Both of them are with
Figure 5.
Evaluating boxes at different h/w stretching thresh-olds . * means training the model with aspect ratio jittering. Sincewe see similar trends when stretching w/h , we only show h/w stretching results here. (a) (b)
Figure 6.
Qualitative comparison of RetinaNet (a) and FoveaBox(b). Our model shows improvements in classes with large aspectratios. Better viewed in color.
ResNet-50-FPN backbone and 800 input scale. The verti-cal axis shows AP
F oveaBox -AP
RetinaNet . FoveaBox showsimprovement in most of the classes, especially for classeswhose bounding boxes are likely to be more arbitrary. Forclass toothbrush, fork, sports ball, snowboard, tie and train ,the AP improvements are larger than 5 points.
Generating high-quality region proposals:
Changing theclassification target to class-agnostic head is straightfor-ward and could generate region proposals. We comparethe proposal performance against region proposal network(RPN) [40] and evaluate average recalls (AR) with differentnumbers of proposals on COCO minival set, as shown in igure 7.
FoveaBox results on the COCO minival set . These results are based on ResNet-101, achieving a single model box AP of38.9. For each pair, left is the detection results with bounding box, category, and confidence. Right is the score output map with theircorresponding bounding boxes before feeding into non-maximum suppression (NMS). The score probability in each position is denotedby the color density. These figures demonstrate that FoveaBox could directly generate accurate, robust box predictions, without therequirement of candidate anchors.
Table 1(c).Surprisingly, our method outperforms the RPN baselineby a large margin, among all criteria. Specifically, with top100 region proposals, FoveaBox gets 53.0 AR, outperform-ing RPN by 8.5 points. This validates that our model’s ca-pacity in generating high quality region proposals.
Across model depth and scale:
Table 3 shows FoveaBoxutilizing different backbone networks and input resolutions.The inference settings are exectly the same as RetinaNet,and the speed is also on par with the corresponding baseline. depth scale AP AP AP AP S AP M AP L RetinaNet 50 400 30.5 47.8 32.7 11.2 33.8 46.1FoveaBox 50 400 31.9 49.3 33.8 12.7 36.1 48.7RetinaNet 50 600 34.2 53.2 36.9 16.2 37.4 47.4FoveaBox 50 600 36.0 55.2 37.9 18.6 39.4 50.5RetinaNet 50 800 35.7 55.0 38.5 18.9 38.9 46.3FoveaBox 50 800 37.1 57.2 39.5 21.6 41.4 49.1RetinaNet 101 400 31.9 49.5 34.1 11.6 35.8 48.5FoveaBox 101 400 33.3 51.0 35.0 12.9 38.0 51.3RetinaNet 101 600 36.0 55.2 38.7 17.4 39.6 49.7FoveaBox 101 600 38.0 57.8 40.2 19.5 42.2 52.7RetinaNet 101 800 37.8 57.5 40.8 20.2 41.1 49.2FoveaBox 101 800 38.9 58.4 41.5 22.3 43.5 51.7
Table 3. FoveaBox across different input resolutions and modeldepths.
As shown in Table 3, FoveaBox improves on RetinaNet baselines consistently by ∼ points. When analysingthe performances on small, medium and large object scales,we observe that the improvements come from all scales ofobjects. We compare FoveaBox to the state-of-the-art methodsin object detection in Table 4. All instantiations of ourmodel outperform baseline variants of previous state-of-the-art models. The first group of detectors on Table 4 aretwo-stage detectors, the second group one-stage detectors,and the last group the FoveaBox detector. FoveaBox out-performs all single-stage detectors under ResNet-101 back-bone, under all evaluation metrics. This includes the veryrecent one-stage CornerNet [26]. FoveaBox also outper-forms most two-stage detectors, including FPN [27] andMask R-CNN [14].Two-stage detectors rely on region-wise sub-networks tofurther classify the sparse region proposals. Cascade R-CNN extends the two-stage scheme to multiple stages tofurther refine the regions. Since FoveaBox could also gen-erate region proposals by changing the model head to classagnostic scheme (Table 1(c)), we believe it could furtherimprove the performance of two-stage detectors, which be-yond the focus of this paper. ackbone AP AP AP AP S AP M AP L two-stage Faster R-CNN+++[16]* ResNet-101 34.9 55.7 37.4 15.6 38.7 50.9Faster R-CNN by G-RMI[19] Inception-ResNet-v2 34.7 55.5 36.7 13.5 38.1 52.0Faster R-CNN w FPN[27] ResNet-101 36.2 59.1 39.0 18.2 39.0 48.2Faster R-CNN w TDM[42] Inception-ResNet-v2 36.8 57.7 39.2 16.2 39.8 52.1Mask R-CNN[14] ResNet-101 38.2 60.3 41.7 20.1 41.1 50.2Relation Network [17] DCN-101 39.0 58.6 42.9 - - -IoU-Net [20] ResNet-101 40.6 59.0 - - - -Cascade R-CNN [3] ResNet-101 42.8 62.1 46.3 23.7 45.5 55.2 one-stage
YOLOv2 [37] DarkNet-19 21.6 44.0 19.2 5.0 22.4 35.5SSD513 [9] ResNet-101 31.2 50.4 33.3 10.2 34.5 49.8YOLOv3 [39] Darknet-53 33.0 57.9 34.4 18.3 35.4 41.9DSSD513 [9] ResNet-101 33.2 53.3 35.2 13.0 35.4 51.1RetinaNet [28] ResNet-101 39.1 59.1 42.3 21.8 42.7 50.2ConRetinaNet [22] ResNet-101 40.1 59.6 43.5 23.4 44.2 53.3CornerNet[26]* Hourglass-104 40.5 56.5 43.1 19.4 42.7 53.9RetinaNet [28] ResNeXt-101 40.8 61.1 44.1 24.1 44.2 51.2 ours
FoveaBox ResNet-101 40.6 60.1 43.5 23.3 45.2 54.5FoveaBox ResNeXt-101 42.1 61.9 45.2 24.9 46.8 55.6
Table 4.
Object detection single-model results v.s. state-of-the-arts on COCO test-dev . We show results for our FoveaBox modelswith 800 input scale. Both RetinaNet and FoveaBox are trained with scale jitter and for 1.5 × longer than the same model from Table 3.Our model achieves top results, outperforming all one-stage and most two-stage models. The entries denoted by “*” use bells and whistlesat inference.
5. More Discussions to Prior Works
Before closing, we investigate some relations and differ-ences between FoveaBox and some prior works.
Score Mask for Text Detection : The score mask tech-nique has been widely used in the area of text detection[48][18][50]. Such works usually utilize the fully convolu-tional networks [32] to predict the existence of target scenetext and the quadrilateral shapes. Compared with scenetext detection, generic object detection is more challengingsince it faces more occlusion, multi-class classification andscale problems. Naively adopting the text detection meth-ods into generic object detection usually gets poor perfor-mances.
Guided-Anchoring [45]: It jointly predicts the locationswhere the center of objects of interest are likely to existas well as the scales and aspect ratios centered at the cor-responding locations. If ( x, y ) is not in the target center,the detected box will not be the optimal one. Guided-Anchoring relies the center points to give the best predic-tions. In contrast, FoveaBox predicts the ( left, top, right,bottom ) boundaries of the object for each foreground posi-tion, which is more robust. FSAF [51]: This is a contemporary work with FoveaBox.It also tries to directly predict the bounding boxes of thetarget objects. The differences between FoveaBox andFSAF are: (a) FSAF relies online feature selection mod-ule to select suitable features for each instance and anchors.While in FoveaBox, instance of a particular scale is simul-taneously optimized by adjacent pyramids, determined by Eq.(2), which is more simple and robust. (b) For the op-timization of box boundary, FSAF utilizes the IoU-Loss[47] to maximize the IoU between the predicted box andgroundtruth. While in FoveaBox, we use the Smooth L loss to directly predict the four boundaries, which is moresimple and straightforward. (c) FoveaBox shows much bet-ter performance compared with FSAF, as shown in Table5. method backbone AP AP AP FSAF ResNet-50 35.9 55.0 37.9FSAF+Retina ResNet-50 37.2 57.2 39.4FoveaBox ResNet-50 37.1 57.2 39.5FoveaBox+Retina ResNet-50 38.1 57.8 40.5
Table 5. Performance comparison of the contemporary work FSAF[51] and FoveaBox. ’+Retina’ means combining the results of therelevant methods and the outputs of anchor-based RetinaNet.
CornerNet [26]: CornerNet proposes detecting objects bythe top-left and bottom-right keypoint pairs. The key stepof CornerNet is to recognize which keypoints belong tothe same instance and grouping them correctly. In con-trast, the instance class and the bounding box are associatedtogether in FoveaBox. We directly predict the boxes andclasses, without any grouping scheme to separate differentinstances.
6. Conclusion
We have presented FoveaBox for generic object detec-tion. By simultaneously predict the object position and theorresponding boundary, FoveaBox gives a clean solutionfor detecting objects without prior candidate boxes. Wedemonstrate its effectiveness on standard benchmarks andreport extensive experimental analysis.
References [1] M. F. Bear, B. W. Connors, and M. A. Paradiso.
Neuro-science , volume 2. Lippincott Williams & Wilkins, 2007.[2] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms–improving object detection with one line of code. In
Proceedings of the IEEE International Conference on Com-puter Vision , pages 5561–5569, 2017.[3] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into highquality object detection. arXiv preprint arXiv:1712.00726 ,2017.[4] H.-Y. Chen et al. Tensorflow–a system for large-scale ma-chine learning. In
OSDI , volume 16, pages 265–283, 2016.[5] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In
Computer Vision and Pattern Recogni-tion, 2005. CVPR 2005. IEEE Computer Society Conferenceon , volume 1, pages 886–893. IEEE, 2005.[6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge.
International journal of computer vision , 88(2):303–338, 2010.[7] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cas-cade object detection with deformable part models. In
Com-puter vision and pattern recognition (CVPR), 2010 IEEEconference on , pages 2241–2248. IEEE, 2010.[8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained part-based models.
IEEE transactions on pattern analysis andmachine intelligence , 32(9):1627–1645, 2010.[9] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg.Dssd: Deconvolutional single shot detector. arXiv preprintarXiv:1701.06659 , 2017.[10] S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In
Pro-ceedings of the IEEE International Conference on ComputerVision , pages 1134–1142, 2015.[11] R. Girshick. Fast r-cnn. In
Proceedings of the IEEE inter-national conference on computer vision , pages 1440–1448,2015.[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 580–587,2014.[13] R. Girshick, I. Radosavovic, G. Gkioxari, P. Doll´ar,and K. He. Detectron. https://github.com/facebookresearch/detectron , 2018.[14] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask r-cnn.In
Computer Vision (ICCV), 2017 IEEE International Con-ference on , pages 2980–2988. IEEE, 2017.[15] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid poolingin deep convolutional networks for visual recognition. In
European conference on computer vision , pages 346–361.Springer, 2014.[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016.[17] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation net-works for object detection. In
Computer Vision and PatternRecognition (CVPR) , volume 2, 2018.[18] H. Hu, C. Zhang, Y. Luo, Y. Wang, J. Han, and E. Ding.Wordsup: Exploiting word annotations for character basedtext detection. In
Proceedings of the IEEE InternationalConference on Computer Vision , pages 4940–4949, 2017.[19] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al.Speed/accuracy trade-offs for modern convolutional objectdetectors. In
IEEE CVPR , volume 4, 2017.[20] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisitionof localization confidence for accurate object detection. In
Proceedings of the European Conference on Computer Vi-sion, Munich, Germany , pages 8–14, 2018.[21] T. Kong, F. Sun, W. Huang, and H. Liu. Deep feature pyra-mid reconfiguration for object detection. european confer-ence on computer vision , pages 172–188, 2018.[22] T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi. Consistentoptimization for single-shot object detection. arXiv preprintarXiv:1901.06563 , 2019.[23] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen. Ron:Reverse connection with objectness prior networks for ob-ject detection. In
IEEE Conference on Computer Vision andPattern Recognition , volume 1, page 2, 2017.[24] T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towardsaccurate region proposal generation and joint object detec-tion. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 845–853, 2016.[25] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyondsliding windows: Object localization by efficient subwindowsearch. In , pages 1–8. IEEE, 2008.[26] H. Law and J. Deng. Cornernet: Detecting objects as pairedkeypoints. In
Proceedings of the European Conference onComputer Vision (ECCV) , volume 6, 2018.[27] T.-Y. Lin, P. Doll´ar, R. B. Girshick, K. He, B. Hariharan, andS. J. Belongie. Feature pyramid networks for object detec-tion. In
CVPR , volume 1, page 4, 2017.[28] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focalloss for dense object detection.
IEEE transactions on patternanalysis and machine intelligence , 2018.[29] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In
European conference on computervision , pages 740–755. Springer, 2014.[30] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, andM. Pietik¨ainen. Deep learning for generic object detection:A survey. arXiv preprint arXiv:1809.02165 , 2018.[31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.n
European conference on computer vision , pages 21–37.Springer, 2016.[32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In
Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 3431–3440, 2015.[33] D. G. Lowe. Distinctive image features from scale-invariant keypoints.
International journal of computer vi-sion , 60(2):91–110, 2004.[34] M. Najibi, B. Singh, and L. S. Davis. Autofocus: Efficientmulti-scale inference. arXiv preprint arXiv:1812.01600 ,2018.[35] A. Newell, Z. Huang, and J. Deng. Associative embedding:End-to-end learning for joint detection and grouping. In
Advances in Neural Information Processing Systems , pages2277–2287, 2017.[36] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-matic differentiation in pytorch. 2017.[37] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. Youonly look once: Unified, real-time object detection. In
Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition , pages 779–788, 2016.[38] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. arXiv preprint , 2017.[39] J. Redmon and A. Farhadi. Yolov3: An incremental improve-ment. arXiv preprint arXiv:1804.02767 , 2018.[40] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. In
Advances in neural information processing systems , pages91–99, 2015.[41] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 761–769, 2016.[42] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be-yond skip connections: Top-down modulation for object de-tection. arXiv preprint arXiv:1612.06851 , 2016.[43] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[44] B. Singh and L. S. Davis. An analysis of scale invariancein object detection–snip. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages3578–3587, 2018.[45] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin.Region proposal by guided anchoring. arXiv preprintarXiv:1901.03278 , 2019.[46] T. Yang, X. Zhang, Z. Li, W. Zhang, and J. Sun. Metaan-chor: Learning to detect objects with customized anchors. In
Advances in Neural Information Processing Systems , pages318–328, 2018.[47] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang. Unitbox: Anadvanced object detection network. In
Proceedings of the24th ACM international conference on Multimedia , pages516–520. ACM, 2016. [48] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai.Multi-oriented text detection with fully convolutional net-works. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 4159–4167, 2016.[49] Y. Zhong, J. Wang, J. Peng, and L. Zhang. Anchor box op-timization for object detection. arXiv: Computer Vision andPattern Recognition , 2018.[50] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, andJ. Liang. East: an efficient and accurate scene text detector.In
Proceedings of the IEEE conference on Computer Visionand Pattern Recognition , pages 5551–5560, 2017.[51] C. Zhu, Y. He, and M. Savvides. Feature selective anchor-free module for single-shot object detection. arXiv preprintarXiv:1903.00621arXiv preprintarXiv:1903.00621