[PDF] R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object

Abstract

Rotation detection is a challenging task due to the difficulties of locating the multi-angle objects and separating them effectively from the background. Though considerable progress has been made, for practical settings, there still exist challenges for rotating objects with large aspect ratio, dense distribution and category extremely imbalance. In this paper, we propose an end-to-end refined single-stage rotation detector for fast and accurate object detection by using a progressive regression approach from coarse to fine granularity. Considering the shortcoming of feature misalignment in existing refined single-stage detector, we design a feature refinement module to improve detection performance by getting more accurate features. The key idea of feature refinement module is to re-encode the position information of the current refined bounding box to the corresponding feature points through pixel-wise feature interpolation to realize feature reconstruction and alignment. For more accurate rotation estimation, an approximate SkewIoU loss is proposed to solve the problem that the calculation of SkewIoU is not derivable. Experiments on three popular remote sensing public datasets DOTA, HRSC2016, UCAS-AOD as well as one scene text dataset ICDAR2015 show the effectiveness of our approach. Tensorflow and Pytorch version codes are available at this https URL and this https URL, and R3Det is also integrated in our open source rotation detection benchmark: this https URL.

Full PDF

RR Det: Reﬁned Single-Stage Detector with Feature Reﬁnementfor Rotating Object

Xue Yang

Qingqing Liu , Junchi Yan , Ang Li , Zhiqiang Zhang , Gang Yu Shanghai Jiao Tong University Center South University Nanjing University of Science and Technology Megvii Inc. (Face++)[email protected]

Abstract

Rotation detection is a challenging task due to the dif-ﬁculties of locating the multi-angle objects and separatingthem accurately and quickly from the background. Thoughconsiderable progress has been made, for practical settings,there still exist challenges for rotating objects with largeaspect ratio, dense distribution and category extremely im-balance. In this paper, we propose an end-to-end reﬁnedsingle-stage rotation detector for fast and accurate posi-tioning objects. Considering the shortcoming of featuremisalignment in existing reﬁned single-stage detector, wedesign a feature reﬁnement module to improve detectionperformance by getting more accurate features. The keyidea of feature reﬁnement module is to re-encode the po-sition information of the current reﬁned bounding box tothe corresponding feature points through feature interpola-tion to realize feature reconstruction and alignment. Ex-tensive experiments on two remote sensing public datasetsDOTA, HRSC2016 as well as scene text data ICDAR2015show the state-of-the-art accuracy and speed of our de-tector. Code is available at https://github.com/Thinklab-SJTU/R3Det_Tensorflow .

1. Introduction

Object detection is one of the fundamental tasks in com-puter vision, and many high-performance general-purposeobject detectors have been proposed. Current popular de-tection methods can be in general divided into two types:two-stage object detectors [12, 11, 33, 8, 24] and single-stage object detectors [27, 31, 25]. Two-stage methods haveachieved promising results on various benchmarks, whilethe single-stage approach maintains faster detection speeds.However, current general horizontal detectors have fun-damental limitations for many practical applications. Forinstance, scene text detection and remote sensing object de- Figure 1: Performance (mAP) versus speed on HRSC2016[29] dataset. Our detectors (in green and blue) notably sur-pass competitors in accuracy, whilst running very fast. De-tailed results are listed in Table 5 (best viewed in color).tection whereby the objects can be in any direction and po-sition. Therefore, many rotation detectors based on a gen-eral detection framework have been proposed in the ﬁeldof scene text and remote sensing. In particular, three chal-lenges are pronounced for images in the above two ﬁelds,as analyzed as follows:1)

Large aspect ratio.

The Skew Intersection overUnion (SkewIoU) score between large aspect ratio objectsis sensitive to change in angle, as sketched in Figure 3.2)

Densely arranged.

As illustrated in Figure 6, Manyobjects usually appear in densely arranged forms.3)

Category unbalance.

Many multi-category rotateddatasets are long-tailed datasets whose categories are ex-tremely unbalanced, as shown in Figure 7.In this paper, we mainly discuss how to design an accu-1 a r X i v : . [ c s . C V ] F e b igure 2: Architecture of the proposed Reﬁned RotationSingle-Stage Detector (RetinaNet [25] as an embodiment).The reﬁnement stage can be repeated multiple times. Onlythe bounding box with the highest score of each featurepoint is preserved in the reﬁnement stage to speedup themodel. ‘A’ indicates number of anchors on each featurepoint, and ‘C’ indicates number of categories.rate and fast rotation detector. To maintain high position-ing accuracy and detection speed for large aspect ratio ob-jects, we have adopted a reﬁned single-stage rotation detec-tor. First, we ﬁnd that rotating anchors can perform better indense scenes, while horizontal anchors can achieve higherrecalls in fewer quantities. Therefore, a combination strat-egy of two forms of anchors is adopted in the reﬁnementsingle-stage detector, that is, the horizontal anchors are usedin the ﬁrst stage for faster speed and more proposals, andthen the reﬁned rotating anchors are used in the reﬁnementstages to adapt to intensive scenarios. Second, we also no-tice that existing reﬁned single-stage detectors have featuremisalignment problems [42, 7], which greatly limits thereliability of classiﬁcation and regression during the reﬁnedstages. We design a feature reﬁnement module (FRM) thatuses the feature interpolation to obtain the position infor-mation of the reﬁned anchors and reconstruct the featuremap to achieve feature alignment. FRM can also reduce thenumber of reﬁned bounding box after the ﬁrst stage, thusspeeding up the model. Experimental results have shownthat feature reﬁnement is sensitive to location and its im-provement in detection results is very noticeable, especiallyfor small sample categories. Combing these two techniquesas a whole, our approach achieves state-of-the-art perfor-mance with high speed on three public rotating sensitivedatasets including DOTA, HRSC2016 and ICDAR2015.This work makes the following contributions:1) For large aspect ratio objects, an accurate and fast ro-tation singe-stage detector is devised in a reﬁned manner,which enables detector for high-precision detection2) For densely arranged scenes, we consider the advan- Mainly refers to the misalignment between the region of interest (RoI)and the feature. (a) (b)

Figure 3: The SkewIoU scores vary with the angle devi-ation. The red and green rectangles represent the groundtruth and the prediction bounding box, respectively.tages of each of the two forms of anchors, and adopt ananchor combination strategy to enable the detector to copewith intensive scenarios with high efﬁciency.3) For category unbalance, we propose an FRM thataims to make the detector features more accurate and re-liable during the reﬁnement stages. Experiments show thatFRM has greatly improved the category that underﬁts due tothe small number of samples and inaccurate features, suchas BD, GTF, BC, SBF, RA, HC (see details in Table 1),which increased by 4.09%, 2.83%, 3.4%, 4.82%, 1.22%,and 19.26%, respectively.

2. Related Work

Two-Stage Object Detectors.

Most of the existing two-stage methods are region-based. In a region based frame-work, category-independent region proposals are generatedfrom an image in the ﬁrst stage, features are extracted fromthese regions subsequently, and then category-speciﬁc clas-siﬁers and regressors are used for classiﬁcation and regres-sion in the second satge. Finally, the detection results areobtained by using post-processing methods such as non-maximum suppression (NMS). Faster-RCNN [33] is a clas-sic structure in a two-stage approach that can detect ob-ject quickly and accurately in an end-to-end manner. Manyhigh-performance detection methods are proposed today,such as R-FCN [8], FPN [24], etc.

Single-Stage Object Detectors.

For their efﬁciency,single-stage detection methods are receiving more and moreattention. OverFeat [35] is one of the ﬁrst single-stage de-tectors based on convolutional neural networks. It performsobject detection in a multiscale sliding window fashion viaa single forward pass through the CNN. Compared with re-gion based methods, Redmon et al. [31] propose YOLO,a uniﬁed detector casting object detection as a regressionproblem from image pixels to spatially separated boundingboxes and associated class probabilities. To preserve real-time speed without sacriﬁcing too much detection accuracy,Liu et al. [27] propose SSD. The work [25] solves the classimbalance problem by proposing RetinaNet with Focal lossand further improves the accuracy of single-stage detector.2 a) (b)(c) (d)

Figure 4: Root cause analysis of feature misalignment andthe core idea for feature reﬁnement module. (a) Originalimage. (b) Reﬁne box without considering the feature mis-alignment caused by the location changes of the boundingbox. (c) Reﬁne box with aligned features by reconstructingthe feature map. (d) Feature interpolation.

Rotation Object Detectiors.

Remote sensing and scenetext are the main application scenarios of the rotation de-tector. Due to the complexity of the remote sensing imagescene and the large number of small, cluttered and rotatedobjects, two-stage rotation detectors are still dominant fortheir robustness. Among them, ICN [2], ROI-Transformer[10] and SCRDet [40] are state-of-the-art detectors. How-ever, they use a more complicated structure causing speedbottleneck. For scene text detection, there are many efﬁ-cient rotation detection methods, including both two-stagemethods (R CNN [17], RRPN [30], FOTS [28]), as well assingle-stage methods (EAST [44], TextBoxes [22]).

Reﬁned Object Detectors.

To achieve better position-ing accuracy, many cascaded or reﬁned detectors are pro-posed. The Cascade RCNN [4], HTC [5], and FSCascade[20] perform multiple classiﬁcations and regressions in thesecond stage, which greatly improved the classiﬁcation ac-curacy and positioning accuracy. The same idea is alsoused in single-stage detectors, such as ReﬁneDet [42]. Un-like the two-stage detectors, which use RoI Pooling [11]or RoI Align [13] for feature alignment. The currently re-ﬁned single-stage detector is not well resolved in this re-spect. An important requirement of the reﬁned single-stagedetector is to maintain a full convolutional structure, whichcan retain the advantage of speed, but methods such as RoIAlign cannot satisfy it whereby fully-connected layers haveto be introduced. Although some works [6, 16, 41] use de- Figure 5: Feature Reﬁnement Module (FRM). It mainly in-cludes three parts: reﬁned bounding box ﬁltering, featureinterpolation and feature map reconstruction.formable convolution [9] for feature alignment, whose off-set parameters are often obtained by learning the offset be-tween the pre-deﬁned anchor box and the reﬁned anchor.The essence of these deformable-based feature alignmentmethods is to expand the receptive ﬁeld, which is too im-plicit and can not ensure that features are truely aligned.Feature misalignment still limits the performance of the re-ﬁned single-stage detector. Compared to these methods, ourmethod can clearly ﬁnd the corresponding feature area bycalculation and achieve the purpose of feature alignment byfeature map reconstruction.

3. The Proposed Method

We give an overview of our method as sketched in Figure2. The embodiment is a single-stage rotation detector basedon the RetinaNet [25], namely Reﬁned Rotation RetinaNet(R Det). The reﬁnement stage (which can be added andrepeated by multiple times) is added to the network to re-ﬁne the bounding box, and the feature reﬁnement module(FRM) is added during the reﬁnement stage to reconstructthe feature map. In a single-stage rotating object detectiontask, continuous reﬁnement of the predicted bounding boxcan improve the regression accuracy, and feature reﬁnementis a necessary process for this purpose. It should be notedthat FRM can also be used on other single-stage detectors(such as SSD), refer to the discussion section.

RetinaNet is one of the most advanced single-stage de-tectors available today. It consists of two parts: backbonenetwork, classiﬁcation and regression subnetwork. Reti-naNet adopts the Feature Pyramid Network (FPN) [24] asthe backbone network. In brief, FPN augments a convolu-tional network with a top-down pathway and lateral connec-tions so the network efﬁciently constructs a rich, multi-scale3eature pyramid from a single resolution input image. Eachlevel of the pyramid can be used for detecting objects at adifferent scale. Besides, each layer of the FPN is connectedto a classiﬁcation subnet and a regression subnet for predict-ing categories and locations. Note that the object classiﬁca-tion subnet and the box regression subnet, though sharing acommon structure, use separate parameters. RetinaNet hasproposed focal loss [25] to address the problem caused bycategory imbalance, which has greatly improved the accu-racy of single-stage detector.To achieve RetinaNet-based rotation detection, we useﬁve parameters ( x, y, w, h, θ ) to represent arbitrary-orientedrectangle. Ranging in [ − π/ , , θ denotes the acute angleto the x-axis, and for the other side we refer it as w . There-fore, it calls for predicting an additional angular offset inthe regression subnet, whose rotation bounding box is: t x = ( x − x a ) /w a , t y = ( y − y a ) /h a t w = log( w/w a ) , t h = log( h/h a ) , t θ = θ − θ a (1) t (cid:48) x = ( x (cid:48) − x a ) /w a , t (cid:48) y = ( y (cid:48) − y a ) /h a t (cid:48) w = log( w (cid:48) /w a ) , t (cid:48) h = log( h (cid:48) /h a ) , t (cid:48) θ = θ (cid:48) − θ a (2)where x, y, w, h, θ denote the box’s center coordinates,width, height and angle, respectively. Variables x, x a , x (cid:48) are for the ground-truth box, anchor box, and predicted box,respectively (likewise for y, w, h, θ ).The multi-task loss is used which is deﬁned as follows: L = λ N N (cid:88) n =1 t (cid:48) n (cid:88) j ∈{ x,y,w,h,θ } L reg ( v (cid:48) nj , v nj )+ λ N N (cid:88) n =1 L cls ( p n , t n ) (3)where N indicates the number of anchors, t (cid:48) n is a binaryvalue ( t (cid:48) n = 1 for foreground and t (cid:48) n = 0 for background,no regression for background). v (cid:48) ∗ j represents the predictedoffset vectors, v ∗ j represents the targets vector of ground-truth. t n represents the label of object, p n is the probabilitydistribution of various classes calculated by sigmoid func-tion. The hyper-parameter λ , λ control the trade-off andare set to 1 by default. The classiﬁcation loss L cls and re-gression loss L reg is implemented by focal loss [25] andsmooth L1 loss as deﬁned in [11], respectively. Reﬁned Detection.

The Skew Intersection over Union(SkewIoU) score is sensitive to the change in angle, and aslight angle shift causes a rapid decrease in the IoU score,as shown in Figure 3. Therefore, the reﬁnement of the pre-diction box helps to improve the recall rate of the rotationdetection. We join multiple reﬁnement stages with differ-ent IoU thresholds. In addition to using the foreground IoU

Algorithm 1

Feature Reﬁnement Module

Input: original feature map F , the bounding box ( B ) andconﬁdence ( S ) of the previous stage Output: reconstructed feature map F (cid:48) B (cid:48) ← F ilter ( B, S ) ; h, w ← Shape ( F ) , F (cid:48) ← ZerosLike ( F ) ; F ← Conv × ( F ) + Conv × ( Conv × ( F )) for i ← to h − do for j ← to w − do P ← GetF iveP oints ( B (cid:48) ( i, j )) ; for p ∈ P do p x ← M in ( p x , w − , p x ← M ax ( p x , ; p y ← M in ( p y , h − , p y ← M ax ( p y , ; F (cid:48) ( i, j ) ← F (cid:48) ( i, j ) + BilinearInte ( F, p ) ; end for end for end for F (cid:48) ← F (cid:48) + F ; return F (cid:48) threshold 0.5 and background IoU threshold 0.4 in the ﬁrststage, the thresholds of ﬁrst reﬁnement stage are set 0.6 and0.5, respectively. If there are multiple reﬁnement stages, theremaining thresholds are 0.7 and 0.6. The overall loss forreﬁned detector is deﬁned as follows: L total = N (cid:88) i =1 α i L i (4)where L i is the loss value of the i -th reﬁnement stage andtrade-off coefﬁcients α i are set to 1 by default. Feature Reﬁnement Module.

Many reﬁned detectorsstill use the same feature map to perform multiple classiﬁca-tions and regressions, without considering the feature mis-alignment caused by the location changes of the boundingbox. Figure 4b depicts the box reﬁning process without fea-ture reﬁnement, resulting in inaccurate features, which canbe disadvantageous for those categories that have a largeaspect ratio or a small sample size. Here we propose tore-encode the position information of the current reﬁnedbounding box (orange rectangle) to the corresponding fea-ture points (red point ), thereby reconstructing the entirefeature map to achieve the alignment of the features. Thewhole process is shown in Figure 4c. To accurately ob-tain the location feature information of the reﬁned boundingbox, we adopt the bilinear feature interpolation method, asshown in Figure 4d. Speciﬁcally, feature interpolation can The red and green points should be totally overlapping to each other,while here the red point is intentionally offset in order to distinguishinglyvisualize the entire process. a) (b) (c)(d) (e) (f) Figure 6: Visualization of three detectors on DOTA. (a)(d)RetinaNet-H. (b)(e) RetinaNet-R. (c)(f) R Det withoutFRM.be formulated as follows: val = val lt ∗ area rb + val rt ∗ area lb + val rb ∗ area lt + val lb ∗ area rt (5)Based on the above result, a feature reﬁnement moduleis devised, whose structure and pseudo code is shown inFigure 5 and Algorithm 1, respectively. Speciﬁcally, thefeature map is added by two-way convolution to obtain anew feature. Only the bounding box with the highest scoreof each feature point is preserved in the reﬁnement stageto increase the speed, meanwhile ensuring that each featurepoint corresponds to only one reﬁned bounding box. Foreach feature point of the feature map, we obtain the cor-responding feature vector on the feature map according tothe ﬁve coordinates of the reﬁned bounding box (one cen-ter point and four corner points). A more accurate featurevector is obtained by bilinear interpolation. We add the ﬁvefeature vectors and replace the current feature vector. Aftertraversing the feature points, we reconstruct the whole fea-ture map. Finally, the reconstructed feature map is added tothe original feature map to complete the whole process. Discussion for Comparison with RoIAlign.

The coreto solve feature misalignment for FRM is feature recon-struction. Compared with RoIAlign [13] that has beenadopted in many two-stage rotation detectors includingR CNN [17] and RRPN [30], FRM has the following dif-ferences that contribute to R3Det’s higher efﬁciency com-pared with R CNN, RRPN as shown in Table 5:1) RoI Align has more sampling points (the default num-ber is × × ), and reducing the sampling pointgreatly affects the performance of the detector. FRM onlysamples ﬁve feature points, about one-fortieth of RoI Align,which gives FRM a huge speed advantage.2) Before classiﬁcation and regression, RoIAlign only Figure 7: The quantity of each category in the DOTA.need to obtain the feature corresponding to RoI (instancelevel) . In contrast, FRM ﬁrst obtains the features corre-sponding to the feature points, and then reconstructs the en-tire feature map (image level) . As a result, the FRM basedmethod can maintain a full convolution structure that leadsto higher efﬁciency and fewer parameters, compared withthe RoIAlign based method that involves a fully-connectedstructure.

4. Experiments

Tests are implemented by TensorFlow [1] on a serverwith GeForce RTX 2080 Ti and 11G memory. We per-form experiments on both aerial benchmarks and scene textbenchmarks to verify the generality of our techniques.

The benchmark DOTA [38] is for object detection inaerial images. It contains 2,806 aerial images from dif-ferent sensors and platforms. The image size ranges fromaround × to , × , pixels and contains ob-jects exhibiting a wide variety of scales, orientations, andshapes. These images are then annotated by experts using15 common object categories. The fully annotated DOTAbenchmark contains 188,282 instances, each of which is la-beled by an arbitrary quadrilateral. There are two detectiontasks for DOTA: horizontal bounding boxes (HBB) and ori-ented bounding boxes (OBB). Half of the original imagesare randomly selected as the training set, 1/6 as the valida-tion set, and 1/3 as the testing set. We divide the imagesinto × subimages with an overlap of 150 pixels andscale it to × . With all these processes, we obtainabout 27,000 patches. The model is trained by 135k itera-tions in total, and the learning rate changes during the 81kand 108k iterations from 5e-4 to 5e-6.The HRSC2016 dataset [29] contains images from twoscenarios including ships on sea and ships close inshore.All the images are collected from six famous harbors. Theimage sizes range from × to , × . The train-ing, validation and test set include 436 images, 181 imagesand 444 images, respectively. For all experiments we use animage scale of × for training and testing. We train5 ethod Backbone FRM Data Aug. PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAPlarge kernel reconstruction stepRetinaNet-H (baseline) ResNet50 × × × × × × Det (baseline) ResNet50 √ × × Det (proposed) ResNet50 √ √ × Det † (proposed) ResNet50 √ √ × Det (proposed) ResNet50 √ √ √ Det (proposed) ResNet101 √ √ √ Det (proposed) ResNet152 √ √ √ Det † (proposed) ResNet152 √ √ √ Table 1: Ablative study (AP for each category and overall mAP) of each components in our proposed method on the DOTAdataset. The short names for categories are deﬁned as (abbreviation-full name): PL-Plane, BD-Baseball diamond, BR-Bridge,GTF-Ground ﬁeld track, SV-Small vehicle, LV-Large vehicle, SH-Ship, TC-Tennis court, BC-Basketball court, ST-Storagetank, SBF-Soccer-ball ﬁeld, RA-Roundabout, HA-Harbor, SP-Swimming pool, and HC-Helicopter. For RetinaNet, ‘H’ and‘R’ represent the horizontal and rotating anchors, respectively. R Det † indicates that two reﬁnement stages have been added. mAP Feature Reﬁnement Interpolation Formula Feature Extraction val lt ∗ area rb + val rt ∗ area lb + val rb ∗ area lt + val lb ∗ area rt Bilinear Interpolation64.28 val lt ∗ area lt + val rt ∗ area rt + val rb ∗ area rb + val lb ∗ area lb Random Interpolation64.37 val lt ∗ area lb + val rt ∗ area rb + val rb ∗ area rt + val lb ∗ area lt Random interpolation64.02 val lt ∗ val rt ∗ val rb ∗ val lb ∗ Quantiﬁcation64.19 val lt ∗ val rt ∗ val rb ∗ val lb ∗ Quantiﬁcation

Table 2: Experiments with different interpolation formulas.Feature interpolation has position-sensitive properties. − Table 3: Ablation study for number of stages on DOTAdataset. − indicates the ensemble result, which is thecollection of all outputs from reﬁnement stages.the model with 5e-4 learning rate for the ﬁrst 30k iterations,then 5e-5 and 5e-6 for the other two 10k iterations.ICDAR2015 is used in Challenge 4 of ICDAR 2015 Ro-bust Reading Competition [18]. It includes a total of 1500pictures, 1000 of which are used for training and the re-maining are for testing. The text regions are annotatedby 4 vertices of the quadrangle. We use its origin imagesize × for training and testing. The ICDAR2015dataset uses the same learning strategy and changes thelearning rate size in 15k iterations, 20k iterations, and 25kiterations, respectively.We experiment with ResNet-FPN and MobileNetv2-FPN [34] backbones. All backbones are pre-trained on Im-ageNet [19]. Weight decay and momentum are 0.0001 and0.9, respectively. We employ MomentumOptimizer over8 GPUs with a total of 8 images per minibatch (1 imagesper GPU). The anchors have areas of to on pyra-mid levels P3 to P7, respectively. At each pyramid level weuse anchors at seven aspect ratios { , / , , / , , , / } and three scales { , / , / } . We also add six an-gles {− ◦ , − ◦ , − ◦ , − ◦ , − ◦ , − ◦ } for rotatinganchor-based method. To our knowledge there is no work exactly falling in ouridea presented in the paper. While we still believe thereare alternatives and thus we meanwhile devise a competi-tive detector to further verify the advantage of our proposedmethod. From the perspective of anchor, we analyze the ef-fect of two forms of anchor on the speed and accuracy ofthe detection method, and ﬁnally construct a compromisedrobust baseline method.The anchor setting is critical for region-based detectionmodels. Both the horizontal anchor and the rotating anchorcan achieve the purpose of rotation detection, but they havetheir own advantages and disadvantages. The advantage ofa horizontal anchor is that it can use less anchor but matchmore positive samples by calculating the IoU with the hor-izontal circumscribing rectangle of the ground truth, but itintroduces a large number of non-object or regions of otherobjects. For an object with a large aspect ratio, its predic-tion rotating bounding box tends to be inaccurate, as shownin Figure 6a. In contrast, in Figure 6b, the rotating anchoravoids the introduction of noise regions by adding angleparameters and has better detection performance in densescenes. However, the number of anchors has multiplied,making the model less efﬁcient.The performance of the single-stage detection methodbased on two forms of anchor (RetinaNet-H and RetinaNet-R) on the DOTA data set OBB task is shown in Table 1.In general, they have similar overall mAP (62.22% versus62.02%), while with their respective characteristics. Thehorizontal anchor-based approach clearly has an advantagein speed, while the rotating anchor-based method has bet-ter regression capabilities in dense object scenarios, such assmall vehicle, large vehicle, and ship. To more effectivelyverify the validity of the feature reﬁnement module, we alsobuild a reﬁned rotation detector, which does not reﬁne thefeature. Since the number of anchors does not decrease be-fore and after the reﬁnement stage, the number of originalanchors determines the speed of the model. Taking into ac-count the speed and accuracy, we adopt an anchor combina-tion strategy. Speciﬁcally, we ﬁrst use horizontal anchor to6 ethod PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAP

Two-stage methods

FR-O [38] 79.09 69.12 17.17 63.49 34.20 37.16 36.20 89.19 69.60 58.96 49.4 52.52 46.69 44.80 46.30 52.93R-DFPN [39] 80.92 65.82 33.77 58.94 55.77 50.94 54.78 90.33 66.34 68.66 48.73 51.76 55.10 51.32 35.88 57.94R CNN [17] 80.94 65.67 35.34 67.44 59.92 50.91 55.81 90.67 66.92 72.39 55.06 52.23 55.14 53.35 48.22 60.67RRPN [30] 88.52 71.20 31.66 59.30 51.85 56.19 57.25 90.81 72.84 67.38 56.69 52.84 53.08 51.94 53.58 61.01ICN [2] 81.40 74.30 47.70 70.30 64.90 67.80 70.00 90.80 79.10 78.20 53.60 62.90 67.00 64.20 50.20 68.20RoI-Transformer [10] 88.64 78.52 43.44

Single-stage methods

RetinaNet-H+ResNet50 [25] 88.87 74.46 40.11 58.03 63.10 50.61 63.63 Det+ResNet101 89.54 Det+ResNet152 89.24 80.81 51.11 65.62 70.67 76.03 78.32 90.83 84.89 84.42 Det † +ResNet152 89.49 81.17 50.53 66.10 Table 4: Detection accuracy on different objects (AP) and overall performance (mAP) evaluation on DOTA. R Det † indicatesthat two reﬁnement stages have been added. Method FRM Backbone Image Size Data Aug. mAP SpeedR CNN [17] – ResNet101 800*800 × < × PN [43] – VGG16 – √ < √ √ × Det (proposed) × ResNet101 800*800 √ √ ResNet152 800*800 √ √ ResNet101 300*300 √ √ ResNet101 600*600 √ √ ResNet101 800*800 √ √ MobileNetV2 300*300 √ √ MobileNetV2 600*600 √ √ MobileNetV2 800*800 √ Table 5: Accuracy and speed comparison on HRSC2016.

Method FRM Recall Precision F-measure Res. Device FPSCTPN [37] – 51.56 74.22 60.85 – – –SegLink [36] – 76.80 73.10 75.00 – – –RRPN [30] – 82.17 73.23 77.44 – – < < CNN [17] – 79.68 85.62 82.54 720p K80 0.44FOTS RT [28] – 85.95 79.83 82.78 720p Titan X FOTS [28] – Det (proposed) × √ Table 6: Accuracy and speed comparison ICDAR2015.reduce the number of anchors and increase the object recallrate, and then use the rotating reﬁned anchor to overcomethe problems caused by dense scenes, as shown in Figure6c. The reﬁned rotation detector achieves 63.14% perfor-mance, better than RetinaNet-H and RetinaNet-R.

Feature Reﬁnement Module.

It shows that by remvingFRM from R Det, it can improve performance by about 1%which is not signiﬁcant. We believe that the main reason isthat the anchor is inconsistent with the feature map after thebox reﬁnement. FRM reconstructs the feature map basedon the reﬁned anchor, which increases the overall perfor-mance by 2.59% to 65.73% according to Table 1. We countthe number of objects for each category, as shown in Fig-ure 7. Coincidentally, FRM has greatly improved the cate-gories that are underﬁtting due to the small number of sam-ples and inaccurate features, such as BD, GTF, BC, SBF, RA, HC, which increased by 4.09%, 2.83%, 3.4%, 4.82%,1.22%, and 19.26%, respectively. In reﬁned detectors, thereare two reasons for the poor performance of small samplecategories: lack of adequate training and inaccuracies in thereﬁnement stage. The former can be mitigated by focal loss,while the latter can be solved by FRM.

Feature Reﬁnement Interpolation Formula.

When werandomly disturb the order of the four weights in the inter-polation formula, the ﬁnal performance of the model will begreatly reduced, rows 3-4 of Table 2. The same conclusionshave also appeared in the experiments of quantitative oper-ations, see rows 5-6 of Table 2. This phenomenon reﬂectsthe location sensitivity of the feature points and explainswhy the performance of the model can be greatly improvedafter the feature is correctly reﬁned.

Number of Reﬁnement Stages.

We have known thatadding a reﬁnement stage has a signiﬁcant improvement inrotation detection, especially the introduction of feature re-ﬁnement. How about joining multiple reﬁnements? R Det † in Table 1 has joined the two reﬁnement stages and bringmore gain. To further explore the impact of the number ofstages, several experimental results are summarized in Ta-ble 8. Experiments show that three or more reﬁnements willnot bring additional improvements to overall performance.Despite this, there are still signiﬁcant improvements in thethree large aspect ratio categories (SV, LV and SH). We alsoﬁnd that ensemble multi-stage results can further improvedetection performance. Data Augmentation and Backbone.

By data augmenta-tion, we improve the performance from 65.57% to 70.16%by random horizontal, vertical ﬂipping, random graying,and random rotation. In addition, we also explore the gainof the backbone for the model. Under ResNet101 andResNet152 as the backbone, we observe a reasonable im-provement in table 1 (70.16% → → Ablative study for FRM using SSD.

We also verify theportability of FRM on different data sets based on SSD, seeTable 7 for detailed results. FRM brings 3% and 0.84%gain in DOTA and HRSC2016, respectively. This shows7 odel Backbone FRM DOTA HRSC2016SSD-H × × √ Table 7: Ablative study for FRM in SSD.the excellent generalization capability of FRM.

The proposed R Det with FRM is compared to state-of-the-art object detectors on three datasets: DOTA [10],HRSC2016 [29] and ICDAR2015 [18]. Our model outper-forms all other models.

Results on DOTA.

We compare our results with thestate-of-the-arts in DOTA as depicted in Table 4. The re-sults of DOTA reported here are obtained by submitting ourpredictions to the ofﬁcial DOTA evaluation server . Theexisting two-stage detectors are still dominant in DOTAdataset research, and the latest two-stage detection meth-ods, such as ICN, ROI Transformer, and SCRDet, have per-formed well. However, they all use complex model struc-tures in exchange for performance improvements, which areextremely low in terms of detection efﬁciency. The single-stage detection method proposed in this paper achievescomparable performance with the most advanced two-stagemethod, while maintaining a fast detection speed. Thespeed analysis is detailed in Section 4.6. Results on HRSC2016.

The HRSC2016 contains lotsof large aspect ratio ship instances with arbitrary orienta-tion, which poses a huge challenge to the positioning ac-curacy of the detector. We use RRPN [30] and R CNN[17] for comparative experiments, which are originally usedfor scene text detection. Experimental results show thatthese two methods under-perform in the remote sensingdataset, only 73.07% and 79.08% respectively. AlthoughRoI Transformer [10] achieves 86.20% results without dataaugmentation, its detection speed is still not ideal, and onlyabout 6fps without accounting for the post-processing oper-ations. RetinNet-H, RetinaNet-R and R Det without FRMare the three baseline models used in this paper. RetinaNet-R achieves the best detection results, around 89.14%, whichis consistent with the performance of the ship category inthe DOTA dataset. This further illustrates that the rotation-based approach has advantages in large aspect ratio targetdetection. Under ResNet101 backbone, our model achievesstate-of-the-art performances.

Results on ICDAR2015.

Scene text detection is alsoone of the main application scenarios for rotation detection.As shown in Table 6, our method achieves 84.96% whilemaintaining 13.5fps in the ICDAR2015 dataset, better thanmost mainstream algorithms except for FOTS, which addsa lot of extra training data and uses large test image. Al- https://captain-whu.github.io/DOTA/ Figure 8: Detection performance (mAP) of the methods fordifferent IoU thresholds on HRSC2016.though the method in this paper is not a targeted single-classobject detection, the experimental results still show that theproposed techniques are general that can be useful for bothaerial images and scene text images.

Note we only add a small number of parameters in re-ﬁnement stage to make the comparison as fair as possible.When we use the FRM, only the bounding box with thehighest score of each feature point is preserved in the reﬁne-ment stage to increase the speed of the model. We comparethe speed and accuracy with the other six methods on theHRSC2016 dataset. The time of post process (i.e. R-NMS)is included. At the same time, we also explore the impact ofdifferent backbones and image sizes on the per performanceof the proposed model. The detailed experimental resultsare shown in Table 5 and Figure 1. Our method can achieve86.67% accuracy and 20fps speed, given MobileNetv2 asbackbone with input image size × . High Precision Detection.

Figure 8 shows the detectionperformance of the ﬁve methods for different IoU thresh-olds on HRSC2016. The rotating anchor-based method ob-tains a high-quality candidate bounding box, so it has ex-cellent performance in high-precision detection. In contrast,the horizontal anchor-based method does not work well. Wealso ﬁnd that the reﬁnement method can achieve higher per-formance to the rotating anchor-based method.

FRM is More Suitable for Rotation Detection.

Whenapplying FRM to horizontal detection, the feature vectors ofthe four corner points obtained in FRM are likely to be farfrom the object, resulting in very inaccurate features beingsampled. However, in the rotation detection task, the fourcorner points of the rotating bounding box are very closeto the object. We have experimented on COCO dataset andthe results are not satisfactory, but we have achieved consid-erable gains on many rotating datasets. See supplementarymaterial for details.8 . Conclusion

We have presented an end-to-end reﬁned single-stage de-tector designated for rotating objects with large aspect ratio,dense distribution and category extremely imbalance, whichare common in practice especially for aerial and scene textimage. Considering the shortcoming of feature misalign-ment in the current reﬁned single-stage detector, we designa feature reﬁnement module to improve detection perfor-mance, which is especially effective in the long tail data set.The key idea of FRM is to re-encode the position informa-tion of the current reﬁned bounding box to the correspond-ing feature points through feature interpolation to achievefeature reconstruction and alignment. We perform carefulablation studies and comparative experiments on multiplerotation detection datasets including DOTA, HRSC2016,and ICDAR2015, and demonstrate that our method achievesstate-of-the-art detection accuracy with high efﬁciency.

6. Appendix

Figure 9a shows the rectangular deﬁnition of the 90 de-gree angle representation range. θ denotes the acute angleto the x-axis, and for the other side we refer it as w . Itshould be distinguished from another deﬁnition in Figure9b, with 180 degree angular range, whose θ is determinedby the long side of the rectangle and x-axis. This article usesthe ﬁrst deﬁnition, which is also ofﬁcially used by OpenCV. When applying FRM to horizontal detection, the featurevectors of the four corner points (green points in Figure 10a)obtained in FRM are likely to be far from the object, result-ing in very inaccurate features being sampled. However, inthe rotation detection task, the four corner points (red pointsin Figure 10b) of the rotating bounding box are very closeto the object. We have experimented on COCO dataset andthe results are not satisfactory, but we have achieved con-siderable gains on many rotating datasets.

UCAS-AOD [45] contains 1510 aerial images of approx-imately × pixels, it contains two categories of14596 instances. As [38, 2] did before, we randomly select1110 for training and 400 for testing. The model is trainedby 20 epoches in total with warm up strategy, and the num-ber of iterations per epoch depends on the total amount ofthe data set. The initial learning rate is 5e-4, and the learn-ing rate changes during 12 and 16 epoches from 5e-5 to5e-6. Table 9 illustrates the comparison of performance onUCAS-AOD dataset, we get 96.95% for OBB task. Our re-sults are the best out of all the existing published methods. (a)(b) Figure 9: Two different rotation box deﬁnitions. (a) Horizontal detection (b) Rotation detection

Figure 10: Schematic diagram of sampling points for FRMin horizontal detection and rotation detection. The samplepoints for horizontal detection are signiﬁcantly further awayfrom the object, and the sampling points for rotation detec-tion are tighter.Figure 11: Text detection results on the ICDAR2015 bench-marks.9

Stages Test stage PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAP1 1 88.87 74.46 40.11 58.03 63.10 50.61 63.63 90.89 77.91 76.38 48.26 55.85 50.67 60.23 34.23 62.222 2 88.78 74.69 41.94 59.88 68.90 69.77 69.82 90.81 77.71 80.40 50.98 58.34 52.10 58.30 43.52 65.733 3 − Table 8: Ablation study for number of stages on DOTA dataset. − indicates the ensemble result, which is the collectionof all outputs from reﬁnement stages. Method mAP Plane CarYOLOv2 [32] 87.90 96.60 79.20R-DFPN [39] 89.20 95.90 82.50DRBox [26] 89.95 94.90 85.00S ARN [3] 94.90 97.60 92.20RetinaNet-H 95.47 97.34 93.60ICN [2] 95.67 - -FADet [21] 95.71 98.69 92.72R Det 96.17 98.20 94.14

Table 9: Performance evaluation on UCAS-AOD datasets.Figure 12: Ship detection results on the HRSC2016 bench-marks. The red and green bounding box indicate the groundtruth and prediction box, respectively.

We also brieﬂy explore the impact of the number of an-chors on the accuracy of detection. Table 10 shows thatalthough more anchors will bring more gains on small back-bone, these increases are not obvious when using largebackbone and data augmentation. Therefore, consideringthe trade-off of speed and precision, the scale and ratioof the anchor in this paper are set to { , / , / } and { , / , , / , , , / } , respectively. Figure 13: Face detection results on the FDDB [15] bench-marks. Base Model Backbone Anchor Scales Anchor Ratios mAPRetinaNet-H ResNet50 { } { , / , , / , , , / } { , / , / } { , / , , / , , , / } Det ResNet50 { , / , / } { , / , , / , , , / } { . , , / , / } { , / , , / , , , / , , / , , / } { , / , / } { , / , , / , , , / } { . , , / , / } { , / , , / , , , / , , / , , / } Table 10: Ablation study for anchor scale and ratio onDOTA dataset.

Table 8 shows in detail that three or more reﬁnementswill not bring additional improvements to overall perfor-mance. Despite this, there are still signiﬁcant improvementsin the three large aspect ratio categories (SV, LV and SH).We also ﬁnd that ensemble multi-stage results can furtherimprove detection performance.

We visualize the detection results of R Det on differenttypes of data sets, including remote sensing datasets (Figure14 and Figure 12), scene text datasets (Figure 11), and facedatasets (Figure 13).10 a) BC and TC (b) SBF, GTF, TC and SP (c) HA (d) HA and SH(e) SP (f) RA and SV (g) ST (h) BD and RA(i) SV and LV (j) PL and HC (k) BR

Figure 14: Detection results on the OBB task on DOTA. Our method performs better on those with large aspect ratio , inarbitrary direction, and high density.

References [1] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-mawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow:a system for large-scale machine learning. In

OSDI , vol-ume 16, pages 265–283, 2016.[2] Seyed Majid Azimi, Eleonora Vig, Reza Bahmanyar, MarcoK¨orner, and Peter Reinartz. Towards multi-class object de-tection in unconstrained remote sensing imagery. In

Asian Conference on Computer Vision , pages 150–165. Springer,2018.[3] Songze Bao, Xing Zhong, Ruifei Zhu, Xiaonan Zhang,Zhuqiang Li, and Mengyang Li. Single shot anchor reﬁne-ment network for oriented object detection in optical remotesensing imagery.

IEEE Access , 7:87150–87161, 2019.[4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv-ing into high quality object detection. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 6154–6162, 2018.

5] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox-iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi,Wanli Ouyang, et al. Hybrid task cascade for instance seg-mentation. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 4974–4983,2019.[6] Xingyu Chen, Junzhi Yu, Shihan Kong, Zhengxing Wu,and Li Wen. Dual reﬁnement networks for accurate andfast object detection in real-world scenes. arXiv preprintarXiv:1807.08638 , 2018.[7] Cheng Chi, Shifeng Zhang, Junliang Xing, Zhen Lei, Stan ZLi, and Xudong Zou. Selective reﬁnement network forhigh performance face detection. In

Proceedings of theAAAI Conference on Artiﬁcial Intelligence , volume 33, pages8231–8238, 2019.[8] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Objectdetection via region-based fully convolutional networks. In

Advances in neural information processing systems , pages379–387, 2016.[9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, GuodongZhang, Han Hu, and Yichen Wei. Deformable convolutionalnetworks. In

Proceedings of the IEEE international confer-ence on computer vision , pages 764–773, 2017.[10] Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, and QikaiLu. Learning roi transformer for oriented object detection inaerial images. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2019.[11] Ross Girshick. Fast r-cnn. In

Proceedings of the IEEE inter-national conference on computer vision , pages 1440–1448,2015.[12] Ross Girshick, Jeff Donahue, Trevor Darrell, and JitendraMalik. Rich feature hierarchies for accurate object detectionand semantic segmentation. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages580–587, 2014.[13] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In

Proceedings of the IEEE internationalconference on computer vision , pages 2961–2969, 2017.[14] Wenhao He, Xu Yao Zhang, Fei Yin, and Cheng Lin Liu.Deep direct regression for multi-oriented scene text detec-tion. 2017.[15] Vidit Jain and Erik Learned-Miller. Fddb: A benchmark forface detection in unconstrained settings. 2010.[16] Ho-Deok Jang, Sanghyun Woo, Philipp Benz, Jinsun Park,and In So Kweon. Propose-and-attend single shot detector. arXiv preprint arXiv:1907.12736 , 2019.[17] Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang,Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. R2cnn: rota-tional region cnn for orientation robust scene text detection. arXiv preprint arXiv:1706.09579 , 2017.[18] Dimosthenis Karatzas, Lluis Gomez-Bigorda, AnguelosNicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwa-mura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chan-drasekhar, Shijian Lu, et al. Icdar 2015 competition on robustreading. In , pages 1156–1160.IEEE, 2015. [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classiﬁcation with deep convolutional neural net-works. In

Advances in neural information processing sys-tems , pages 1097–1105, 2012.[20] Ang Li, Xue Yang, and Chongyang Zhang. Rethinking clas-siﬁcation and localization for cascade r-cnn. arXiv preprintarXiv:1907.11914 , 2019.[21] Chengzheng Li, Chunyan Xu, Zhen Cui, Dan Wang, TongZhang, and Jian Yang. Feature-attentioned object detec-tion in remote sensing imagery. In , pages 3886–3890.IEEE, 2019.[22] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang,and Wenyu Liu. Textboxes: A fast text detector with a singledeep neural network. In

Thirty-First AAAI Conference onArtiﬁcial Intelligence , 2017.[23] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia,and Xiang Bai. Rotation-sensitive regression for orientedscene text detection. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 5909–5918, 2018.[24] Tsung-Yi Lin, Piotr Doll´ar, Ross B Girshick, Kaiming He,Bharath Hariharan, and Serge J Belongie. Feature pyramidnetworks for object detection. In

CVPR , volume 1, page 4,2017.[25] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Doll´ar. Focal loss for dense object detection. In

Pro-ceedings of the IEEE international conference on computervision , pages 2980–2988, 2017.[26] Lei Liu, Zongxu Pan, and Bin Lei. Learning a rotation in-variant detector with rotatable bounding box. arXiv preprintarXiv:1711.09405 , 2017.[27] Wei Liu, Dragomir Anguelov, Dumitru Erhan, ChristianSzegedy, Scott Reed, Cheng-Yang Fu, and Alexander CBerg. Ssd: Single shot multibox detector. In

European con-ference on computer vision , pages 21–37. Springer, 2016.[28] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, andJunjie Yan. Fots: Fast oriented text spotting with a uni-ﬁed network. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 5676–5685,2018.[29] Zikun Liu, Liu Yuan, Lubin Weng, and Yiping Yang. A highresolution optical satellite image dataset for ship recognitionand some new baselines. In

Proc. ICPRAM , volume 2, pages324–331, 2017.[30] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang,Yingbin Zheng, and Xiangyang Xue. Arbitrary-orientedscene text detection via rotation proposals.

IEEE Transac-tions on Multimedia , 2018.[31] Joseph Redmon, Santosh Divvala, Ross Girshick, and AliFarhadi. You only look once: Uniﬁed, real-time object de-tection. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 779–788, 2016.[32] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster,stronger. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 7263–7271, 2017.[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: towards real-time object detection with region roposal networks. IEEE Transactions on Pattern Analysis& Machine Intelligence , (6):1137–1149, 2017.[34] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-moginov, and Liang Chieh Chen. Inverted residuals and lin-ear bottlenecks: Mobile networks for classiﬁcation, detec-tion and segmentation. 2018.[35] Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨el Math-ieu, Rob Fergus, and Yann LeCun. Overfeat: Integratedrecognition, localization and detection using convolutionalnetworks. arXiv preprint arXiv:1312.6229 , 2013.[36] Baoguang Shi, Xiang Bai, and Serge Belongie. Detectingoriented text in natural images by linking segments. 2017.[37] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. De-tecting text in natural image with connectionist text proposalnetwork. In

European conference on computer vision , pages56–72. Springer, 2016.[38] Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be-longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang-pei Zhang. Dota: A large-scale dataset for object detectionin aerial images. In

Proc. CVPR , 2018.[39] Xue Yang, Hao Sun, Kun Fu, Jirui Yang, Xian Sun, Meng-long Yan, and Zhi Guo. Automatic ship detection in remotesensing images from google earth of complex scenes basedon multiscale rotation dense feature pyramid networks.

Re-mote Sensing , 10(1):132, 2018.[40] Xue Yang, Jirui Yang, Junchi Yan, Yue Zhang, TengfeiZhang, Zhi Guo, Xian Sun, and Kun Fu. Scrdet: Towardsmore robust detection for small, cluttered and rotated ob-jects. In

The IEEE International Conference on ComputerVision (ICCV) , October 2019.[41] Hongkai Zhang, Hong Chang, Bingpeng Ma, ShiguangShan, and Xilin Chen. Cascade retinanet: Maintaining con-sistency for single-stage object detection. arXiv preprintarXiv:1907.06881 , 2019.[42] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, andStan Z. Li. Single-shot reﬁnement neural network for objectdetection. 2018.[43] Zenghui Zhang, Weiwei Guo, Shengnan Zhu, and Wenx-ian Yu. Toward arbitrary-oriented ship detection with ro-tated region proposal and discrimination networks.

IEEEGeoscience and Remote Sensing Letters , 15(11):1745–1749,2018.[44] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, ShuchangZhou, Weiran He, and Jiajun Liang. East: An efﬁcient andaccurate scene text detector. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , 2017.[45] Haigang Zhu, Xiaogang Chen, Weiqun Dai, Kun Fu, QixiangYe, and Jianbin Jiao. Orientation robust object detection inaerial images using deep convolutional neural network. In , pages 3735–3739. IEEE, 2015., pages 3735–3739. IEEE, 2015.