R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object
RR Det: Refined Single-Stage Detector with Feature Refinementfor Rotating Object
Xue Yang
Qingqing Liu , Junchi Yan , Ang Li , Zhiqiang Zhang , Gang Yu Shanghai Jiao Tong University Center South University Nanjing University of Science and Technology Megvii Inc. (Face++)[email protected]
Abstract
Rotation detection is a challenging task due to the dif-ficulties of locating the multi-angle objects and separatingthem accurately and quickly from the background. Thoughconsiderable progress has been made, for practical settings,there still exist challenges for rotating objects with largeaspect ratio, dense distribution and category extremely im-balance. In this paper, we propose an end-to-end refinedsingle-stage rotation detector for fast and accurate posi-tioning objects. Considering the shortcoming of featuremisalignment in existing refined single-stage detector, wedesign a feature refinement module to improve detectionperformance by getting more accurate features. The keyidea of feature refinement module is to re-encode the po-sition information of the current refined bounding box tothe corresponding feature points through feature interpola-tion to realize feature reconstruction and alignment. Ex-tensive experiments on two remote sensing public datasetsDOTA, HRSC2016 as well as scene text data ICDAR2015show the state-of-the-art accuracy and speed of our de-tector. Code is available at https://github.com/Thinklab-SJTU/R3Det_Tensorflow .
1. Introduction
Object detection is one of the fundamental tasks in com-puter vision, and many high-performance general-purposeobject detectors have been proposed. Current popular de-tection methods can be in general divided into two types:two-stage object detectors [12, 11, 33, 8, 24] and single-stage object detectors [27, 31, 25]. Two-stage methods haveachieved promising results on various benchmarks, whilethe single-stage approach maintains faster detection speeds.However, current general horizontal detectors have fun-damental limitations for many practical applications. Forinstance, scene text detection and remote sensing object de- Figure 1: Performance (mAP) versus speed on HRSC2016[29] dataset. Our detectors (in green and blue) notably sur-pass competitors in accuracy, whilst running very fast. De-tailed results are listed in Table 5 (best viewed in color).tection whereby the objects can be in any direction and po-sition. Therefore, many rotation detectors based on a gen-eral detection framework have been proposed in the fieldof scene text and remote sensing. In particular, three chal-lenges are pronounced for images in the above two fields,as analyzed as follows:1)
Large aspect ratio.
The Skew Intersection overUnion (SkewIoU) score between large aspect ratio objectsis sensitive to change in angle, as sketched in Figure 3.2)
Densely arranged.
As illustrated in Figure 6, Manyobjects usually appear in densely arranged forms.3)
Category unbalance.
Many multi-category rotateddatasets are long-tailed datasets whose categories are ex-tremely unbalanced, as shown in Figure 7.In this paper, we mainly discuss how to design an accu-1 a r X i v : . [ c s . C V ] F e b igure 2: Architecture of the proposed Refined RotationSingle-Stage Detector (RetinaNet [25] as an embodiment).The refinement stage can be repeated multiple times. Onlythe bounding box with the highest score of each featurepoint is preserved in the refinement stage to speedup themodel. ‘A’ indicates number of anchors on each featurepoint, and ‘C’ indicates number of categories.rate and fast rotation detector. To maintain high position-ing accuracy and detection speed for large aspect ratio ob-jects, we have adopted a refined single-stage rotation detec-tor. First, we find that rotating anchors can perform better indense scenes, while horizontal anchors can achieve higherrecalls in fewer quantities. Therefore, a combination strat-egy of two forms of anchors is adopted in the refinementsingle-stage detector, that is, the horizontal anchors are usedin the first stage for faster speed and more proposals, andthen the refined rotating anchors are used in the refinementstages to adapt to intensive scenarios. Second, we also no-tice that existing refined single-stage detectors have featuremisalignment problems [42, 7], which greatly limits thereliability of classification and regression during the refinedstages. We design a feature refinement module (FRM) thatuses the feature interpolation to obtain the position infor-mation of the refined anchors and reconstruct the featuremap to achieve feature alignment. FRM can also reduce thenumber of refined bounding box after the first stage, thusspeeding up the model. Experimental results have shownthat feature refinement is sensitive to location and its im-provement in detection results is very noticeable, especiallyfor small sample categories. Combing these two techniquesas a whole, our approach achieves state-of-the-art perfor-mance with high speed on three public rotating sensitivedatasets including DOTA, HRSC2016 and ICDAR2015.This work makes the following contributions:1) For large aspect ratio objects, an accurate and fast ro-tation singe-stage detector is devised in a refined manner,which enables detector for high-precision detection2) For densely arranged scenes, we consider the advan- Mainly refers to the misalignment between the region of interest (RoI)and the feature. (a) (b)
Figure 3: The SkewIoU scores vary with the angle devi-ation. The red and green rectangles represent the groundtruth and the prediction bounding box, respectively.tages of each of the two forms of anchors, and adopt ananchor combination strategy to enable the detector to copewith intensive scenarios with high efficiency.3) For category unbalance, we propose an FRM thataims to make the detector features more accurate and re-liable during the refinement stages. Experiments show thatFRM has greatly improved the category that underfits due tothe small number of samples and inaccurate features, suchas BD, GTF, BC, SBF, RA, HC (see details in Table 1),which increased by 4.09%, 2.83%, 3.4%, 4.82%, 1.22%,and 19.26%, respectively.
2. Related Work
Two-Stage Object Detectors.
Most of the existing two-stage methods are region-based. In a region based frame-work, category-independent region proposals are generatedfrom an image in the first stage, features are extracted fromthese regions subsequently, and then category-specific clas-sifiers and regressors are used for classification and regres-sion in the second satge. Finally, the detection results areobtained by using post-processing methods such as non-maximum suppression (NMS). Faster-RCNN [33] is a clas-sic structure in a two-stage approach that can detect ob-ject quickly and accurately in an end-to-end manner. Manyhigh-performance detection methods are proposed today,such as R-FCN [8], FPN [24], etc.
Single-Stage Object Detectors.
For their efficiency,single-stage detection methods are receiving more and moreattention. OverFeat [35] is one of the first single-stage de-tectors based on convolutional neural networks. It performsobject detection in a multiscale sliding window fashion viaa single forward pass through the CNN. Compared with re-gion based methods, Redmon et al. [31] propose YOLO,a unified detector casting object detection as a regressionproblem from image pixels to spatially separated boundingboxes and associated class probabilities. To preserve real-time speed without sacrificing too much detection accuracy,Liu et al. [27] propose SSD. The work [25] solves the classimbalance problem by proposing RetinaNet with Focal lossand further improves the accuracy of single-stage detector.2 a) (b)(c) (d)
Figure 4: Root cause analysis of feature misalignment andthe core idea for feature refinement module. (a) Originalimage. (b) Refine box without considering the feature mis-alignment caused by the location changes of the boundingbox. (c) Refine box with aligned features by reconstructingthe feature map. (d) Feature interpolation.
Rotation Object Detectiors.
Remote sensing and scenetext are the main application scenarios of the rotation de-tector. Due to the complexity of the remote sensing imagescene and the large number of small, cluttered and rotatedobjects, two-stage rotation detectors are still dominant fortheir robustness. Among them, ICN [2], ROI-Transformer[10] and SCRDet [40] are state-of-the-art detectors. How-ever, they use a more complicated structure causing speedbottleneck. For scene text detection, there are many effi-cient rotation detection methods, including both two-stagemethods (R CNN [17], RRPN [30], FOTS [28]), as well assingle-stage methods (EAST [44], TextBoxes [22]).
Refined Object Detectors.
To achieve better position-ing accuracy, many cascaded or refined detectors are pro-posed. The Cascade RCNN [4], HTC [5], and FSCascade[20] perform multiple classifications and regressions in thesecond stage, which greatly improved the classification ac-curacy and positioning accuracy. The same idea is alsoused in single-stage detectors, such as RefineDet [42]. Un-like the two-stage detectors, which use RoI Pooling [11]or RoI Align [13] for feature alignment. The currently re-fined single-stage detector is not well resolved in this re-spect. An important requirement of the refined single-stagedetector is to maintain a full convolutional structure, whichcan retain the advantage of speed, but methods such as RoIAlign cannot satisfy it whereby fully-connected layers haveto be introduced. Although some works [6, 16, 41] use de- Figure 5: Feature Refinement Module (FRM). It mainly in-cludes three parts: refined bounding box filtering, featureinterpolation and feature map reconstruction.formable convolution [9] for feature alignment, whose off-set parameters are often obtained by learning the offset be-tween the pre-defined anchor box and the refined anchor.The essence of these deformable-based feature alignmentmethods is to expand the receptive field, which is too im-plicit and can not ensure that features are truely aligned.Feature misalignment still limits the performance of the re-fined single-stage detector. Compared to these methods, ourmethod can clearly find the corresponding feature area bycalculation and achieve the purpose of feature alignment byfeature map reconstruction.
3. The Proposed Method
We give an overview of our method as sketched in Figure2. The embodiment is a single-stage rotation detector basedon the RetinaNet [25], namely Refined Rotation RetinaNet(R Det). The refinement stage (which can be added andrepeated by multiple times) is added to the network to re-fine the bounding box, and the feature refinement module(FRM) is added during the refinement stage to reconstructthe feature map. In a single-stage rotating object detectiontask, continuous refinement of the predicted bounding boxcan improve the regression accuracy, and feature refinementis a necessary process for this purpose. It should be notedthat FRM can also be used on other single-stage detectors(such as SSD), refer to the discussion section.
RetinaNet is one of the most advanced single-stage de-tectors available today. It consists of two parts: backbonenetwork, classification and regression subnetwork. Reti-naNet adopts the Feature Pyramid Network (FPN) [24] asthe backbone network. In brief, FPN augments a convolu-tional network with a top-down pathway and lateral connec-tions so the network efficiently constructs a rich, multi-scale3eature pyramid from a single resolution input image. Eachlevel of the pyramid can be used for detecting objects at adifferent scale. Besides, each layer of the FPN is connectedto a classification subnet and a regression subnet for predict-ing categories and locations. Note that the object classifica-tion subnet and the box regression subnet, though sharing acommon structure, use separate parameters. RetinaNet hasproposed focal loss [25] to address the problem caused bycategory imbalance, which has greatly improved the accu-racy of single-stage detector.To achieve RetinaNet-based rotation detection, we usefive parameters ( x, y, w, h, θ ) to represent arbitrary-orientedrectangle. Ranging in [ − π/ , , θ denotes the acute angleto the x-axis, and for the other side we refer it as w . There-fore, it calls for predicting an additional angular offset inthe regression subnet, whose rotation bounding box is: t x = ( x − x a ) /w a , t y = ( y − y a ) /h a t w = log( w/w a ) , t h = log( h/h a ) , t θ = θ − θ a (1) t (cid:48) x = ( x (cid:48) − x a ) /w a , t (cid:48) y = ( y (cid:48) − y a ) /h a t (cid:48) w = log( w (cid:48) /w a ) , t (cid:48) h = log( h (cid:48) /h a ) , t (cid:48) θ = θ (cid:48) − θ a (2)where x, y, w, h, θ denote the box’s center coordinates,width, height and angle, respectively. Variables x, x a , x (cid:48) are for the ground-truth box, anchor box, and predicted box,respectively (likewise for y, w, h, θ ).The multi-task loss is used which is defined as follows: L = λ N N (cid:88) n =1 t (cid:48) n (cid:88) j ∈{ x,y,w,h,θ } L reg ( v (cid:48) nj , v nj )+ λ N N (cid:88) n =1 L cls ( p n , t n ) (3)where N indicates the number of anchors, t (cid:48) n is a binaryvalue ( t (cid:48) n = 1 for foreground and t (cid:48) n = 0 for background,no regression for background). v (cid:48) ∗ j represents the predictedoffset vectors, v ∗ j represents the targets vector of ground-truth. t n represents the label of object, p n is the probabilitydistribution of various classes calculated by sigmoid func-tion. The hyper-parameter λ , λ control the trade-off andare set to 1 by default. The classification loss L cls and re-gression loss L reg is implemented by focal loss [25] andsmooth L1 loss as defined in [11], respectively. Refined Detection.
The Skew Intersection over Union(SkewIoU) score is sensitive to the change in angle, and aslight angle shift causes a rapid decrease in the IoU score,as shown in Figure 3. Therefore, the refinement of the pre-diction box helps to improve the recall rate of the rotationdetection. We join multiple refinement stages with differ-ent IoU thresholds. In addition to using the foreground IoU
Algorithm 1
Feature Refinement Module
Input: original feature map F , the bounding box ( B ) andconfidence ( S ) of the previous stage Output: reconstructed feature map F (cid:48) B (cid:48) ← F ilter ( B, S ) ; h, w ← Shape ( F ) , F (cid:48) ← ZerosLike ( F ) ; F ← Conv × ( F ) + Conv × ( Conv × ( F )) for i ← to h − do for j ← to w − do P ← GetF iveP oints ( B (cid:48) ( i, j )) ; for p ∈ P do p x ← M in ( p x , w − , p x ← M ax ( p x , ; p y ← M in ( p y , h − , p y ← M ax ( p y , ; F (cid:48) ( i, j ) ← F (cid:48) ( i, j ) + BilinearInte ( F, p ) ; end for end for end for F (cid:48) ← F (cid:48) + F ; return F (cid:48) threshold 0.5 and background IoU threshold 0.4 in the firststage, the thresholds of first refinement stage are set 0.6 and0.5, respectively. If there are multiple refinement stages, theremaining thresholds are 0.7 and 0.6. The overall loss forrefined detector is defined as follows: L total = N (cid:88) i =1 α i L i (4)where L i is the loss value of the i -th refinement stage andtrade-off coefficients α i are set to 1 by default. Feature Refinement Module.
Many refined detectorsstill use the same feature map to perform multiple classifica-tions and regressions, without considering the feature mis-alignment caused by the location changes of the boundingbox. Figure 4b depicts the box refining process without fea-ture refinement, resulting in inaccurate features, which canbe disadvantageous for those categories that have a largeaspect ratio or a small sample size. Here we propose tore-encode the position information of the current refinedbounding box (orange rectangle) to the corresponding fea-ture points (red point ), thereby reconstructing the entirefeature map to achieve the alignment of the features. Thewhole process is shown in Figure 4c. To accurately ob-tain the location feature information of the refined boundingbox, we adopt the bilinear feature interpolation method, asshown in Figure 4d. Specifically, feature interpolation can The red and green points should be totally overlapping to each other,while here the red point is intentionally offset in order to distinguishinglyvisualize the entire process. a) (b) (c)(d) (e) (f) Figure 6: Visualization of three detectors on DOTA. (a)(d)RetinaNet-H. (b)(e) RetinaNet-R. (c)(f) R Det withoutFRM.be formulated as follows: val = val lt ∗ area rb + val rt ∗ area lb + val rb ∗ area lt + val lb ∗ area rt (5)Based on the above result, a feature refinement moduleis devised, whose structure and pseudo code is shown inFigure 5 and Algorithm 1, respectively. Specifically, thefeature map is added by two-way convolution to obtain anew feature. Only the bounding box with the highest scoreof each feature point is preserved in the refinement stageto increase the speed, meanwhile ensuring that each featurepoint corresponds to only one refined bounding box. Foreach feature point of the feature map, we obtain the cor-responding feature vector on the feature map according tothe five coordinates of the refined bounding box (one cen-ter point and four corner points). A more accurate featurevector is obtained by bilinear interpolation. We add the fivefeature vectors and replace the current feature vector. Aftertraversing the feature points, we reconstruct the whole fea-ture map. Finally, the reconstructed feature map is added tothe original feature map to complete the whole process. Discussion for Comparison with RoIAlign.
The coreto solve feature misalignment for FRM is feature recon-struction. Compared with RoIAlign [13] that has beenadopted in many two-stage rotation detectors includingR CNN [17] and RRPN [30], FRM has the following dif-ferences that contribute to R3Det’s higher efficiency com-pared with R CNN, RRPN as shown in Table 5:1) RoI Align has more sampling points (the default num-ber is × × ), and reducing the sampling pointgreatly affects the performance of the detector. FRM onlysamples five feature points, about one-fortieth of RoI Align,which gives FRM a huge speed advantage.2) Before classification and regression, RoIAlign only Figure 7: The quantity of each category in the DOTA.need to obtain the feature corresponding to RoI (instancelevel) . In contrast, FRM first obtains the features corre-sponding to the feature points, and then reconstructs the en-tire feature map (image level) . As a result, the FRM basedmethod can maintain a full convolution structure that leadsto higher efficiency and fewer parameters, compared withthe RoIAlign based method that involves a fully-connectedstructure.
4. Experiments
Tests are implemented by TensorFlow [1] on a serverwith GeForce RTX 2080 Ti and 11G memory. We per-form experiments on both aerial benchmarks and scene textbenchmarks to verify the generality of our techniques.
The benchmark DOTA [38] is for object detection inaerial images. It contains 2,806 aerial images from dif-ferent sensors and platforms. The image size ranges fromaround × to , × , pixels and contains ob-jects exhibiting a wide variety of scales, orientations, andshapes. These images are then annotated by experts using15 common object categories. The fully annotated DOTAbenchmark contains 188,282 instances, each of which is la-beled by an arbitrary quadrilateral. There are two detectiontasks for DOTA: horizontal bounding boxes (HBB) and ori-ented bounding boxes (OBB). Half of the original imagesare randomly selected as the training set, 1/6 as the valida-tion set, and 1/3 as the testing set. We divide the imagesinto × subimages with an overlap of 150 pixels andscale it to × . With all these processes, we obtainabout 27,000 patches. The model is trained by 135k itera-tions in total, and the learning rate changes during the 81kand 108k iterations from 5e-4 to 5e-6.The HRSC2016 dataset [29] contains images from twoscenarios including ships on sea and ships close inshore.All the images are collected from six famous harbors. Theimage sizes range from × to , × . The train-ing, validation and test set include 436 images, 181 imagesand 444 images, respectively. For all experiments we use animage scale of × for training and testing. We train5 ethod Backbone FRM Data Aug. PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAPlarge kernel reconstruction stepRetinaNet-H (baseline) ResNet50 × × × × × × Det (baseline) ResNet50 √ × × Det (proposed) ResNet50 √ √ × Det † (proposed) ResNet50 √ √ × Det (proposed) ResNet50 √ √ √ Det (proposed) ResNet101 √ √ √ Det (proposed) ResNet152 √ √ √ Det † (proposed) ResNet152 √ √ √ Table 1: Ablative study (AP for each category and overall mAP) of each components in our proposed method on the DOTAdataset. The short names for categories are defined as (abbreviation-full name): PL-Plane, BD-Baseball diamond, BR-Bridge,GTF-Ground field track, SV-Small vehicle, LV-Large vehicle, SH-Ship, TC-Tennis court, BC-Basketball court, ST-Storagetank, SBF-Soccer-ball field, RA-Roundabout, HA-Harbor, SP-Swimming pool, and HC-Helicopter. For RetinaNet, ‘H’ and‘R’ represent the horizontal and rotating anchors, respectively. R Det † indicates that two refinement stages have been added. mAP Feature Refinement Interpolation Formula Feature Extraction val lt ∗ area rb + val rt ∗ area lb + val rb ∗ area lt + val lb ∗ area rt Bilinear Interpolation64.28 val lt ∗ area lt + val rt ∗ area rt + val rb ∗ area rb + val lb ∗ area lb Random Interpolation64.37 val lt ∗ area lb + val rt ∗ area rb + val rb ∗ area rt + val lb ∗ area lt Random interpolation64.02 val lt ∗ val rt ∗ val rb ∗ val lb ∗ Quantification64.19 val lt ∗ val rt ∗ val rb ∗ val lb ∗ Quantification
Table 2: Experiments with different interpolation formulas.Feature interpolation has position-sensitive properties. − Table 3: Ablation study for number of stages on DOTAdataset. − indicates the ensemble result, which is thecollection of all outputs from refinement stages.the model with 5e-4 learning rate for the first 30k iterations,then 5e-5 and 5e-6 for the other two 10k iterations.ICDAR2015 is used in Challenge 4 of ICDAR 2015 Ro-bust Reading Competition [18]. It includes a total of 1500pictures, 1000 of which are used for training and the re-maining are for testing. The text regions are annotatedby 4 vertices of the quadrangle. We use its origin imagesize × for training and testing. The ICDAR2015dataset uses the same learning strategy and changes thelearning rate size in 15k iterations, 20k iterations, and 25kiterations, respectively.We experiment with ResNet-FPN and MobileNetv2-FPN [34] backbones. All backbones are pre-trained on Im-ageNet [19]. Weight decay and momentum are 0.0001 and0.9, respectively. We employ MomentumOptimizer over8 GPUs with a total of 8 images per minibatch (1 imagesper GPU). The anchors have areas of to on pyra-mid levels P3 to P7, respectively. At each pyramid level weuse anchors at seven aspect ratios { , / , , / , , , / } and three scales { , / , / } . We also add six an-gles {− ◦ , − ◦ , − ◦ , − ◦ , − ◦ , − ◦ } for rotatinganchor-based method. To our knowledge there is no work exactly falling in ouridea presented in the paper. While we still believe thereare alternatives and thus we meanwhile devise a competi-tive detector to further verify the advantage of our proposedmethod. From the perspective of anchor, we analyze the ef-fect of two forms of anchor on the speed and accuracy ofthe detection method, and finally construct a compromisedrobust baseline method.The anchor setting is critical for region-based detectionmodels. Both the horizontal anchor and the rotating anchorcan achieve the purpose of rotation detection, but they havetheir own advantages and disadvantages. The advantage ofa horizontal anchor is that it can use less anchor but matchmore positive samples by calculating the IoU with the hor-izontal circumscribing rectangle of the ground truth, but itintroduces a large number of non-object or regions of otherobjects. For an object with a large aspect ratio, its predic-tion rotating bounding box tends to be inaccurate, as shownin Figure 6a. In contrast, in Figure 6b, the rotating anchoravoids the introduction of noise regions by adding angleparameters and has better detection performance in densescenes. However, the number of anchors has multiplied,making the model less efficient.The performance of the single-stage detection methodbased on two forms of anchor (RetinaNet-H and RetinaNet-R) on the DOTA data set OBB task is shown in Table 1.In general, they have similar overall mAP (62.22% versus62.02%), while with their respective characteristics. Thehorizontal anchor-based approach clearly has an advantagein speed, while the rotating anchor-based method has bet-ter regression capabilities in dense object scenarios, such assmall vehicle, large vehicle, and ship. To more effectivelyverify the validity of the feature refinement module, we alsobuild a refined rotation detector, which does not refine thefeature. Since the number of anchors does not decrease be-fore and after the refinement stage, the number of originalanchors determines the speed of the model. Taking into ac-count the speed and accuracy, we adopt an anchor combina-tion strategy. Specifically, we first use horizontal anchor to6 ethod PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAP
Two-stage methods
FR-O [38] 79.09 69.12 17.17 63.49 34.20 37.16 36.20 89.19 69.60 58.96 49.4 52.52 46.69 44.80 46.30 52.93R-DFPN [39] 80.92 65.82 33.77 58.94 55.77 50.94 54.78 90.33 66.34 68.66 48.73 51.76 55.10 51.32 35.88 57.94R CNN [17] 80.94 65.67 35.34 67.44 59.92 50.91 55.81 90.67 66.92 72.39 55.06 52.23 55.14 53.35 48.22 60.67RRPN [30] 88.52 71.20 31.66 59.30 51.85 56.19 57.25 90.81 72.84 67.38 56.69 52.84 53.08 51.94 53.58 61.01ICN [2] 81.40 74.30 47.70 70.30 64.90 67.80 70.00 90.80 79.10 78.20 53.60 62.90 67.00 64.20 50.20 68.20RoI-Transformer [10] 88.64 78.52 43.44
Single-stage methods
RetinaNet-H+ResNet50 [25] 88.87 74.46 40.11 58.03 63.10 50.61 63.63 Det+ResNet101 89.54 Det+ResNet152 89.24 80.81 51.11 65.62 70.67 76.03 78.32 90.83 84.89 84.42 Det † +ResNet152 89.49 81.17 50.53 66.10 Table 4: Detection accuracy on different objects (AP) and overall performance (mAP) evaluation on DOTA. R Det † indicatesthat two refinement stages have been added. Method FRM Backbone Image Size Data Aug. mAP SpeedR CNN [17] – ResNet101 800*800 × < × PN [43] – VGG16 – √ < √ √ × Det (proposed) × ResNet101 800*800 √ √ ResNet152 800*800 √ √ ResNet101 300*300 √ √ ResNet101 600*600 √ √ ResNet101 800*800 √ √ MobileNetV2 300*300 √ √ MobileNetV2 600*600 √ √ MobileNetV2 800*800 √ Table 5: Accuracy and speed comparison on HRSC2016.
Method FRM Recall Precision F-measure Res. Device FPSCTPN [37] – 51.56 74.22 60.85 – – –SegLink [36] – 76.80 73.10 75.00 – – –RRPN [30] – 82.17 73.23 77.44 – – < < CNN [17] – 79.68 85.62 82.54 720p K80 0.44FOTS RT [28] – 85.95 79.83 82.78 720p Titan X FOTS [28] – Det (proposed) × √ Table 6: Accuracy and speed comparison ICDAR2015.reduce the number of anchors and increase the object recallrate, and then use the rotating refined anchor to overcomethe problems caused by dense scenes, as shown in Figure6c. The refined rotation detector achieves 63.14% perfor-mance, better than RetinaNet-H and RetinaNet-R.
Feature Refinement Module.
It shows that by remvingFRM from R Det, it can improve performance by about 1%which is not significant. We believe that the main reason isthat the anchor is inconsistent with the feature map after thebox refinement. FRM reconstructs the feature map basedon the refined anchor, which increases the overall perfor-mance by 2.59% to 65.73% according to Table 1. We countthe number of objects for each category, as shown in Fig-ure 7. Coincidentally, FRM has greatly improved the cate-gories that are underfitting due to the small number of sam-ples and inaccurate features, such as BD, GTF, BC, SBF, RA, HC, which increased by 4.09%, 2.83%, 3.4%, 4.82%,1.22%, and 19.26%, respectively. In refined detectors, thereare two reasons for the poor performance of small samplecategories: lack of adequate training and inaccuracies in therefinement stage. The former can be mitigated by focal loss,while the latter can be solved by FRM.
Feature Refinement Interpolation Formula.
When werandomly disturb the order of the four weights in the inter-polation formula, the final performance of the model will begreatly reduced, rows 3-4 of Table 2. The same conclusionshave also appeared in the experiments of quantitative oper-ations, see rows 5-6 of Table 2. This phenomenon reflectsthe location sensitivity of the feature points and explainswhy the performance of the model can be greatly improvedafter the feature is correctly refined.
Number of Refinement Stages.
We have known thatadding a refinement stage has a significant improvement inrotation detection, especially the introduction of feature re-finement. How about joining multiple refinements? R Det † in Table 1 has joined the two refinement stages and bringmore gain. To further explore the impact of the number ofstages, several experimental results are summarized in Ta-ble 8. Experiments show that three or more refinements willnot bring additional improvements to overall performance.Despite this, there are still significant improvements in thethree large aspect ratio categories (SV, LV and SH). We alsofind that ensemble multi-stage results can further improvedetection performance. Data Augmentation and Backbone.
By data augmenta-tion, we improve the performance from 65.57% to 70.16%by random horizontal, vertical flipping, random graying,and random rotation. In addition, we also explore the gainof the backbone for the model. Under ResNet101 andResNet152 as the backbone, we observe a reasonable im-provement in table 1 (70.16% → → Ablative study for FRM using SSD.
We also verify theportability of FRM on different data sets based on SSD, seeTable 7 for detailed results. FRM brings 3% and 0.84%gain in DOTA and HRSC2016, respectively. This shows7 odel Backbone FRM DOTA HRSC2016SSD-H × × √ Table 7: Ablative study for FRM in SSD.the excellent generalization capability of FRM.
The proposed R Det with FRM is compared to state-of-the-art object detectors on three datasets: DOTA [10],HRSC2016 [29] and ICDAR2015 [18]. Our model outper-forms all other models.
Results on DOTA.
We compare our results with thestate-of-the-arts in DOTA as depicted in Table 4. The re-sults of DOTA reported here are obtained by submitting ourpredictions to the official DOTA evaluation server . Theexisting two-stage detectors are still dominant in DOTAdataset research, and the latest two-stage detection meth-ods, such as ICN, ROI Transformer, and SCRDet, have per-formed well. However, they all use complex model struc-tures in exchange for performance improvements, which areextremely low in terms of detection efficiency. The single-stage detection method proposed in this paper achievescomparable performance with the most advanced two-stagemethod, while maintaining a fast detection speed. Thespeed analysis is detailed in Section 4.6. Results on HRSC2016.
The HRSC2016 contains lotsof large aspect ratio ship instances with arbitrary orienta-tion, which poses a huge challenge to the positioning ac-curacy of the detector. We use RRPN [30] and R CNN[17] for comparative experiments, which are originally usedfor scene text detection. Experimental results show thatthese two methods under-perform in the remote sensingdataset, only 73.07% and 79.08% respectively. AlthoughRoI Transformer [10] achieves 86.20% results without dataaugmentation, its detection speed is still not ideal, and onlyabout 6fps without accounting for the post-processing oper-ations. RetinNet-H, RetinaNet-R and R Det without FRMare the three baseline models used in this paper. RetinaNet-R achieves the best detection results, around 89.14%, whichis consistent with the performance of the ship category inthe DOTA dataset. This further illustrates that the rotation-based approach has advantages in large aspect ratio targetdetection. Under ResNet101 backbone, our model achievesstate-of-the-art performances.
Results on ICDAR2015.
Scene text detection is alsoone of the main application scenarios for rotation detection.As shown in Table 6, our method achieves 84.96% whilemaintaining 13.5fps in the ICDAR2015 dataset, better thanmost mainstream algorithms except for FOTS, which addsa lot of extra training data and uses large test image. Al- https://captain-whu.github.io/DOTA/ Figure 8: Detection performance (mAP) of the methods fordifferent IoU thresholds on HRSC2016.though the method in this paper is not a targeted single-classobject detection, the experimental results still show that theproposed techniques are general that can be useful for bothaerial images and scene text images.
Note we only add a small number of parameters in re-finement stage to make the comparison as fair as possible.When we use the FRM, only the bounding box with thehighest score of each feature point is preserved in the refine-ment stage to increase the speed of the model. We comparethe speed and accuracy with the other six methods on theHRSC2016 dataset. The time of post process (i.e. R-NMS)is included. At the same time, we also explore the impact ofdifferent backbones and image sizes on the per performanceof the proposed model. The detailed experimental resultsare shown in Table 5 and Figure 1. Our method can achieve86.67% accuracy and 20fps speed, given MobileNetv2 asbackbone with input image size × . High Precision Detection.
Figure 8 shows the detectionperformance of the five methods for different IoU thresh-olds on HRSC2016. The rotating anchor-based method ob-tains a high-quality candidate bounding box, so it has ex-cellent performance in high-precision detection. In contrast,the horizontal anchor-based method does not work well. Wealso find that the refinement method can achieve higher per-formance to the rotating anchor-based method.
FRM is More Suitable for Rotation Detection.
Whenapplying FRM to horizontal detection, the feature vectors ofthe four corner points obtained in FRM are likely to be farfrom the object, resulting in very inaccurate features beingsampled. However, in the rotation detection task, the fourcorner points of the rotating bounding box are very closeto the object. We have experimented on COCO dataset andthe results are not satisfactory, but we have achieved consid-erable gains on many rotating datasets. See supplementarymaterial for details.8 . Conclusion
We have presented an end-to-end refined single-stage de-tector designated for rotating objects with large aspect ratio,dense distribution and category extremely imbalance, whichare common in practice especially for aerial and scene textimage. Considering the shortcoming of feature misalign-ment in the current refined single-stage detector, we designa feature refinement module to improve detection perfor-mance, which is especially effective in the long tail data set.The key idea of FRM is to re-encode the position informa-tion of the current refined bounding box to the correspond-ing feature points through feature interpolation to achievefeature reconstruction and alignment. We perform carefulablation studies and comparative experiments on multiplerotation detection datasets including DOTA, HRSC2016,and ICDAR2015, and demonstrate that our method achievesstate-of-the-art detection accuracy with high efficiency.
6. Appendix
Figure 9a shows the rectangular definition of the 90 de-gree angle representation range. θ denotes the acute angleto the x-axis, and for the other side we refer it as w . Itshould be distinguished from another definition in Figure9b, with 180 degree angular range, whose θ is determinedby the long side of the rectangle and x-axis. This article usesthe first definition, which is also officially used by OpenCV. When applying FRM to horizontal detection, the featurevectors of the four corner points (green points in Figure 10a)obtained in FRM are likely to be far from the object, result-ing in very inaccurate features being sampled. However, inthe rotation detection task, the four corner points (red pointsin Figure 10b) of the rotating bounding box are very closeto the object. We have experimented on COCO dataset andthe results are not satisfactory, but we have achieved con-siderable gains on many rotating datasets.
UCAS-AOD [45] contains 1510 aerial images of approx-imately × pixels, it contains two categories of14596 instances. As [38, 2] did before, we randomly select1110 for training and 400 for testing. The model is trainedby 20 epoches in total with warm up strategy, and the num-ber of iterations per epoch depends on the total amount ofthe data set. The initial learning rate is 5e-4, and the learn-ing rate changes during 12 and 16 epoches from 5e-5 to5e-6. Table 9 illustrates the comparison of performance onUCAS-AOD dataset, we get 96.95% for OBB task. Our re-sults are the best out of all the existing published methods. (a)(b) Figure 9: Two different rotation box definitions. (a) Horizontal detection (b) Rotation detection
Figure 10: Schematic diagram of sampling points for FRMin horizontal detection and rotation detection. The samplepoints for horizontal detection are significantly further awayfrom the object, and the sampling points for rotation detec-tion are tighter.Figure 11: Text detection results on the ICDAR2015 bench-marks.9
Stages Test stage PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAP1 1 88.87 74.46 40.11 58.03 63.10 50.61 63.63 90.89 77.91 76.38 48.26 55.85 50.67 60.23 34.23 62.222 2 88.78 74.69 41.94 59.88 68.90 69.77 69.82 90.81 77.71 80.40 50.98 58.34 52.10 58.30 43.52 65.733 3 − Table 8: Ablation study for number of stages on DOTA dataset. − indicates the ensemble result, which is the collectionof all outputs from refinement stages. Method mAP Plane CarYOLOv2 [32] 87.90 96.60 79.20R-DFPN [39] 89.20 95.90 82.50DRBox [26] 89.95 94.90 85.00S ARN [3] 94.90 97.60 92.20RetinaNet-H 95.47 97.34 93.60ICN [2] 95.67 - -FADet [21] 95.71 98.69 92.72R Det 96.17 98.20 94.14
Table 9: Performance evaluation on UCAS-AOD datasets.Figure 12: Ship detection results on the HRSC2016 bench-marks. The red and green bounding box indicate the groundtruth and prediction box, respectively.
We also briefly explore the impact of the number of an-chors on the accuracy of detection. Table 10 shows thatalthough more anchors will bring more gains on small back-bone, these increases are not obvious when using largebackbone and data augmentation. Therefore, consideringthe trade-off of speed and precision, the scale and ratioof the anchor in this paper are set to { , / , / } and { , / , , / , , , / } , respectively. Figure 13: Face detection results on the FDDB [15] bench-marks. Base Model Backbone Anchor Scales Anchor Ratios mAPRetinaNet-H ResNet50 { } { , / , , / , , , / } { , / , / } { , / , , / , , , / } Det ResNet50 { , / , / } { , / , , / , , , / } { . , , / , / } { , / , , / , , , / , , / , , / } { , / , / } { , / , , / , , , / } { . , , / , / } { , / , , / , , , / , , / , , / } Table 10: Ablation study for anchor scale and ratio onDOTA dataset.
Table 8 shows in detail that three or more refinementswill not bring additional improvements to overall perfor-mance. Despite this, there are still significant improvementsin the three large aspect ratio categories (SV, LV and SH).We also find that ensemble multi-stage results can furtherimprove detection performance.
We visualize the detection results of R Det on differenttypes of data sets, including remote sensing datasets (Figure14 and Figure 12), scene text datasets (Figure 11), and facedatasets (Figure 13).10 a) BC and TC (b) SBF, GTF, TC and SP (c) HA (d) HA and SH(e) SP (f) RA and SV (g) ST (h) BD and RA(i) SV and LV (j) PL and HC (k) BR
Figure 14: Detection results on the OBB task on DOTA. Our method performs better on those with large aspect ratio , inarbitrary direction, and high density.
References [1] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow:a system for large-scale machine learning. In
OSDI , vol-ume 16, pages 265–283, 2016.[2] Seyed Majid Azimi, Eleonora Vig, Reza Bahmanyar, MarcoK¨orner, and Peter Reinartz. Towards multi-class object de-tection in unconstrained remote sensing imagery. In
Asian Conference on Computer Vision , pages 150–165. Springer,2018.[3] Songze Bao, Xing Zhong, Ruifei Zhu, Xiaonan Zhang,Zhuqiang Li, and Mengyang Li. Single shot anchor refine-ment network for oriented object detection in optical remotesensing imagery.
IEEE Access , 7:87150–87161, 2019.[4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv-ing into high quality object detection. In
Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 6154–6162, 2018.
5] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox-iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi,Wanli Ouyang, et al. Hybrid task cascade for instance seg-mentation. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 4974–4983,2019.[6] Xingyu Chen, Junzhi Yu, Shihan Kong, Zhengxing Wu,and Li Wen. Dual refinement networks for accurate andfast object detection in real-world scenes. arXiv preprintarXiv:1807.08638 , 2018.[7] Cheng Chi, Shifeng Zhang, Junliang Xing, Zhen Lei, Stan ZLi, and Xudong Zou. Selective refinement network forhigh performance face detection. In
Proceedings of theAAAI Conference on Artificial Intelligence , volume 33, pages8231–8238, 2019.[8] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Objectdetection via region-based fully convolutional networks. In
Advances in neural information processing systems , pages379–387, 2016.[9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, GuodongZhang, Han Hu, and Yichen Wei. Deformable convolutionalnetworks. In
Proceedings of the IEEE international confer-ence on computer vision , pages 764–773, 2017.[10] Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, and QikaiLu. Learning roi transformer for oriented object detection inaerial images. In
The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2019.[11] Ross Girshick. Fast r-cnn. In
Proceedings of the IEEE inter-national conference on computer vision , pages 1440–1448,2015.[12] Ross Girshick, Jeff Donahue, Trevor Darrell, and JitendraMalik. Rich feature hierarchies for accurate object detectionand semantic segmentation. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages580–587, 2014.[13] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In
Proceedings of the IEEE internationalconference on computer vision , pages 2961–2969, 2017.[14] Wenhao He, Xu Yao Zhang, Fei Yin, and Cheng Lin Liu.Deep direct regression for multi-oriented scene text detec-tion. 2017.[15] Vidit Jain and Erik Learned-Miller. Fddb: A benchmark forface detection in unconstrained settings. 2010.[16] Ho-Deok Jang, Sanghyun Woo, Philipp Benz, Jinsun Park,and In So Kweon. Propose-and-attend single shot detector. arXiv preprint arXiv:1907.12736 , 2019.[17] Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang,Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. R2cnn: rota-tional region cnn for orientation robust scene text detection. arXiv preprint arXiv:1706.09579 , 2017.[18] Dimosthenis Karatzas, Lluis Gomez-Bigorda, AnguelosNicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwa-mura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chan-drasekhar, Shijian Lu, et al. Icdar 2015 competition on robustreading. In , pages 1156–1160.IEEE, 2015. [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In
Advances in neural information processing sys-tems , pages 1097–1105, 2012.[20] Ang Li, Xue Yang, and Chongyang Zhang. Rethinking clas-sification and localization for cascade r-cnn. arXiv preprintarXiv:1907.11914 , 2019.[21] Chengzheng Li, Chunyan Xu, Zhen Cui, Dan Wang, TongZhang, and Jian Yang. Feature-attentioned object detec-tion in remote sensing imagery. In , pages 3886–3890.IEEE, 2019.[22] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang,and Wenyu Liu. Textboxes: A fast text detector with a singledeep neural network. In
Thirty-First AAAI Conference onArtificial Intelligence , 2017.[23] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia,and Xiang Bai. Rotation-sensitive regression for orientedscene text detection. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 5909–5918, 2018.[24] Tsung-Yi Lin, Piotr Doll´ar, Ross B Girshick, Kaiming He,Bharath Hariharan, and Serge J Belongie. Feature pyramidnetworks for object detection. In
CVPR , volume 1, page 4,2017.[25] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Doll´ar. Focal loss for dense object detection. In
Pro-ceedings of the IEEE international conference on computervision , pages 2980–2988, 2017.[26] Lei Liu, Zongxu Pan, and Bin Lei. Learning a rotation in-variant detector with rotatable bounding box. arXiv preprintarXiv:1711.09405 , 2017.[27] Wei Liu, Dragomir Anguelov, Dumitru Erhan, ChristianSzegedy, Scott Reed, Cheng-Yang Fu, and Alexander CBerg. Ssd: Single shot multibox detector. In
European con-ference on computer vision , pages 21–37. Springer, 2016.[28] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, andJunjie Yan. Fots: Fast oriented text spotting with a uni-fied network. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 5676–5685,2018.[29] Zikun Liu, Liu Yuan, Lubin Weng, and Yiping Yang. A highresolution optical satellite image dataset for ship recognitionand some new baselines. In
Proc. ICPRAM , volume 2, pages324–331, 2017.[30] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang,Yingbin Zheng, and Xiangyang Xue. Arbitrary-orientedscene text detection via rotation proposals.
IEEE Transac-tions on Multimedia , 2018.[31] Joseph Redmon, Santosh Divvala, Ross Girshick, and AliFarhadi. You only look once: Unified, real-time object de-tection. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 779–788, 2016.[32] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster,stronger. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 7263–7271, 2017.[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: towards real-time object detection with region roposal networks. IEEE Transactions on Pattern Analysis& Machine Intelligence , (6):1137–1149, 2017.[34] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-moginov, and Liang Chieh Chen. Inverted residuals and lin-ear bottlenecks: Mobile networks for classification, detec-tion and segmentation. 2018.[35] Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨el Math-ieu, Rob Fergus, and Yann LeCun. Overfeat: Integratedrecognition, localization and detection using convolutionalnetworks. arXiv preprint arXiv:1312.6229 , 2013.[36] Baoguang Shi, Xiang Bai, and Serge Belongie. Detectingoriented text in natural images by linking segments. 2017.[37] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. De-tecting text in natural image with connectionist text proposalnetwork. In
European conference on computer vision , pages56–72. Springer, 2016.[38] Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be-longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang-pei Zhang. Dota: A large-scale dataset for object detectionin aerial images. In
Proc. CVPR , 2018.[39] Xue Yang, Hao Sun, Kun Fu, Jirui Yang, Xian Sun, Meng-long Yan, and Zhi Guo. Automatic ship detection in remotesensing images from google earth of complex scenes basedon multiscale rotation dense feature pyramid networks.
Re-mote Sensing , 10(1):132, 2018.[40] Xue Yang, Jirui Yang, Junchi Yan, Yue Zhang, TengfeiZhang, Zhi Guo, Xian Sun, and Kun Fu. Scrdet: Towardsmore robust detection for small, cluttered and rotated ob-jects. In
The IEEE International Conference on ComputerVision (ICCV) , October 2019.[41] Hongkai Zhang, Hong Chang, Bingpeng Ma, ShiguangShan, and Xilin Chen. Cascade retinanet: Maintaining con-sistency for single-stage object detection. arXiv preprintarXiv:1907.06881 , 2019.[42] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, andStan Z. Li. Single-shot refinement neural network for objectdetection. 2018.[43] Zenghui Zhang, Weiwei Guo, Shengnan Zhu, and Wenx-ian Yu. Toward arbitrary-oriented ship detection with ro-tated region proposal and discrimination networks.
IEEEGeoscience and Remote Sensing Letters , 15(11):1745–1749,2018.[44] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, ShuchangZhou, Weiran He, and Jiajun Liang. East: An efficient andaccurate scene text detector. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , 2017.[45] Haigang Zhu, Xiaogang Chen, Weiqun Dai, Kun Fu, QixiangYe, and Jianbin Jiao. Orientation robust object detection inaerial images using deep convolutional neural network. In , pages 3735–3739. IEEE, 2015., pages 3735–3739. IEEE, 2015.