HRDNet: High-resolution Detection Network for Small Objects
HHRDNet: High-resolution Detection Network for Small Objects
Ziming Liu , Guangyu Gao , Lin Sun and Zhiyuan Fang Beijing Institute of Technology SAMSUNG SEMICONDUCTOR INC.
Abstract
Small object detection is challenging because smallobjects do not contain detailed information andmay even disappear in the deep network. Usu-ally, feeding high-resolution images into a networkcan alleviate this issue. However, simply enlargingthe resolution will cause more problems, such asthat, it aggravates the large variant of object scaleand introduces unbearable computation cost. Tokeep the benefits of high-resolution images withoutbringing up new problems, we proposed the High-Resolution Detection Network (HRDNet). HRD-Net takes multiple resolution inputs using multi-depth backbones. To fully take advantage of multi-ple features, we proposed Multi-Depth Image Pyra-mid Network (MD-IPN) and Multi-Scale FeaturePyramid Network (MS-FPN) in HRDNet. MD-IPN maintains multiple position information us-ing multiple depth backbones. Specifically, high-resolution input will be fed into a shallow net-work to reserve more positional information and re-ducing the computational cost while low-resolutioninput will be fed into a deep network to extractmore semantics. By extracting various featuresfrom high to low resolutions, the MD-IPN is ableto improve the performance of small object de-tection as well as maintaining the performance ofmiddle and large objects. MS-FPN is proposedto align and fuse multi-scale feature groups gen-erated by MD-IPN to reduce the information im-balance between these multi-scale multi-level fea-tures. Extensive experiments and ablation studiesare conducted on the standard benchmark datasetMS COCO2017, Pascal VOC2007/2012 and a typ-ical small object dataset, VisDrone 2019. Notably,our proposed HRDNet achieves the state-of-the-arton these datasets and it performs better on smallobjects.
Object detection is challenging while it has widespread ap-plications. With the advances of deep learning, object detec-tion achieves the remarkable progress. According to whether the proposals are generated by an independent learning stageor directly and densely sample possible locations, object de-tection can be classified into two-stage or one-stage mod-els. Compared to two-stage detectors [Cai and Vasconcelos,2018; Ren et al. , 2017] one stage methods [Lin et al. , 2017b;Liu et al. , 2016] are less complex, therefore, it can run fasterwith some precision loss. While most existed successfulmethods are based on anchor mechanism, the recent state-of-the-art methods focus on anchor-free detection mostly, e.g.CornerNet [Law and Deng, 2018], FCOS [Tian et al. , 2019],FSAF [Zhu et al. , 2019a]. These CNN based detection meth-ods are very powerful because it can create some low-levelabstractions of the images like lines, circles and then ‘itera-tively combine’ them into some objects, but this is also thereason that they struggle with detecting small objects.Generally, the object detection algorithms mentionedabove can achieve good performance, as long as the fea-tures extracted from the backbone network are strong enough.Usually, a huge and deep CNN backbone extracts multi-levelfeatures and then refine them with feature pyramid network(FPN). Most time, these detection models benefit from deeperbackbone, while the deeper backbone also introduces morecomputation cost and memory usage.Commonly, the detection performance is extremely sensi-tive to the resolution of input. High-resolution images aremore suitable for small object detection, which reserves moredetails and position information. But high-resolution also in-troduces new problems, such as, (i) it’s easy to damage thedetection of large objects, as shown in Table 1; (ii) Detectionalways needs a deeper network for more powerful semantics,resulting in an unaffordable computing cost. Actually, it’sessential to use the high-resolution image for small objectdetection, and also the deeper backbone for small scale im-ages. But we should deal with the trade-offs between largeand small object detection, as well as high performance andlow computational complexity.To solve these problems, we propose a new architecture,High-resolution Detection Network (HRDNet). As shown inFigure 1, it includes two parts: Multi-Depth Image PyramidNetwork (MD-IPN) and Multi-Scale Feature Pyramid Net-work (MS-FPN). The main idea of the HRDNet is to use adeep backbone to process low-resolution images while usinga shallow backbone to process high-resolution images. Theadvantage of extracting features from high-resolution images a r X i v : . [ c s . C V ] J un ulti-scale FPN Cls head
Reg head
𝑊 𝐻𝛼𝐻𝛼𝑊 𝛼 𝐻 𝛼 𝑊 MD-IPN MS-FPN 𝛼ℎ 𝛼𝑤 𝛼 𝑤 𝛼 ℎ Deeper Deep shallow 𝐼 𝐼 𝐼 𝑆 𝑆 𝑆 𝐺 𝐺 𝐺 𝐺 ′ 𝐹 𝐹 𝐹 𝐹 ℎ 𝑤 𝐹 𝐹 𝐹 𝐹 𝐹 𝐹 𝐹 ′ 𝐹 𝐹 𝑤ℎ Figure 1: The overall structure of HRDNet, composed by MD-IPN and MS-FPN. The input of HRDNet is an image pyramid with N images, N = 3 in this figure, the decreasing ratio is α . The outputs of MD-IPN are N groups feature pyramids, the decreasing ratio of each featurepyramid is . MS-FPN fuses these features into single one feature pyramid { F (cid:48) , F (cid:48) , F (cid:48) , F (cid:48) , F (cid:48) } , which is used for object detection. Theexemplar comes from VisDrone2019 validation set. with the shallow and tiny network has been demonstrated in[Pang et al. , 2019a]. With HRDNet, we can not only get moredetails for a small object in high-resolution, but also guaran-tee the efficiency and effectiveness by integrating multi-depthand multi-scale deep networks.MD-IPN can be regarded as a variant of the image pyramidnetwork with multiple streams, as shown in Figure 1. MD-IPN is dealing with the trade-offs between large and smallobject detection, as well as high performance and low com-putational complexity. We extract features from the high-resolution image using a shallow backbone network. Becauseof the weak semantic representation power of the shallowbackbone network, we also need deep backbones to obtain se-mantically strong features by feeding low-resolution imagesin. Thus, the inputs of the MD-IPN form an image pyramidwith a fixed decreasing ratio of α ∈ [0 , . The output of MD-IPN is a series of multi-scale feature groups, and each groupcontains multi-level feature maps.The multi-scale feature groups extend the standard featurepyramid by adding multi-scale streams. Therefore, traditionalFPN can’t be directly applied here. To fuse these multi-scalefeature groups properly, we proposed the Multi-Scale FeaturePyramid Network (MS-FPN). As shown in Figure 2, the in-formation of images not only propagates from high-level fea-tures to low-level features inside the multi-level feature pyra-mid but also between streams of different scales in MD-IPN.Before going through the details, we summarize our con-tributions as follows: • We comprehensively analyzed the factors that small ob-ject detection depends on and the trade-off between per-formance and efficiency, as well as proposed a novelhigh-resolution detection network, HRDNet, consider-ing both image pyramid and feature pyramid. • In HRDNet, we designed the multi-depth and multi-stream module, MD-IPN to balance the performance be- tween small, middle and large objects. We proposedanother new module, MS-FPN to combine different se-mantic representations from these multi-scale featuregroups. • Extensive ablation studies validate the effectiveness andefficiency of the proposed approach. The performanceof bench-marking on several datasets show that our ap-proach achieves the state-of-the-art performance on ob-ject detection, particularly on small object detection.Meanwhile, we hope such practice of small object de-tection could shed the light for other researches.
Object detection is a basic task for many downstream tasksin computer vision. The state-of-the-art detection networksinclude one stage model, e.g., RetinaNet [Lin et al. , 2017b],Yolo-v3 [Farhadi and Redmon, 2018], Center net [Duan et al. ,2019], FSAF [Zhu et al. , 2019a], Corner net [Law and Deng,2018] and two-stage model, e.g., Faster R-CNN [Ren et al. ,2017], Cascade R-CNN [Cai and Vasconcelos, 2018] etc.).Nevertheless, the proposed HRDNet is a more fundamentalframework that could be the backbone network for most ofthe detection models, as mentioned above, such as RetinaNetand Cascade R-CNN.
Small object detection
The detection performance islargely restricted by small object detection in most datasets.Therefore, there are many researches specializing in smallobject detection. For example, [Kisantal et al. , 2019] pro-posed oversampling and copy-pasting small objects to solvesuch a problem. Perceptual GAN [Li et al. , 2017] generatedsuper-resolved features and stacked them into feature mapsof small objects to enhance the representations. DetNet [Li etal. , 2018] maintained the spatial resolution and has a large re-ceptive field to improve small object detection. SNIP [Singhand Davis, 2018] resized images to different resolutions andnly train samples which is close to ground truth. SNIPER[Singh et al. , 2018] is proposed to use regions around thebounding box to remove the influence of background. Unlikethese methods, we combine both image pyramid and featurepyramid together, with which it not only effectively improvesthe detection performance of small targets, but also ensure thedetection performance of other objects.
High-resolution detection
Some studies already exploredto do object detection on high-resolution images. [Pang etal. , 2019a] proposed a fast tiny detection network for high-resolution remote sensing images. [R˚uˇziˇcka and Franchetti,2018] proposed an attention pipeline to achieve fast detectionon 4K or 8K videos using YOLO v2 [Redmon and Farhadi,2017]. However, these works did not fully explore the effectof high-resolution images for small object detection, which iswhat we concentrate on.
Feature-level imbalance
To capture the semantic informa-tion of objects from different scales, multi-level features arecommonly used for object detection. However, they have se-rious feature-level imbalance because they convey differentsemantic information. Feature Pyramid Network (FPN) [Lin et al. , 2017a] introduced a top-down pathway to transmit se-mantic information, alleviating the feature imbalance prob-lem in some degree. Based on FPN, PANet [Liu et al. , 2018a]involved a bottom-up path to enrich the location informationof deep layers. The authors of Libra R-CNN [Pang and Chen,2019] revealed and tried to deal with the sample level, featurelevel, and objective level imbalance issues. Pang et al. [Pang et al. , 2019b] proposed a light weighted module to producefeatured image pyramid features to augment the output fea-ture pyramid. While these methods only focus on multi-levelfeatures. Here, We solve the feature-level imbalance froma new angle, we proposed a new module called Multi-scaleFPN to solve the imbalance not only from multi-level featuresbut also from multi-scale feature groups.
Obviously, high-resolution images are important for smallobject detection. Unfortunately, high-resolution images willintroduce unaffordable computation costs to deep networks.At the same time, high-resolution images aggravate the vari-ance of object scales, worsening the performance of large ob-jects, as shown in Table 1. To balance computation costs andvariance of objects scales while keeping the performance ofall the classes, we proposed the High-Resolution DetectionNetwork (HRDNet). The HRDNet is a general concept thatis compatible with any alternative detection method.More specifically, HRDNet is designed with two novelmodules, Multi-Depth Image Pyramid Network (MD-IPN)and Multi-Scale Feature Pyramid Network (MS-FPN). InMD-IPN, an image pyramid is processed by backbones withdifferent depth, i.e., using deep CNNs for the low-resolutionimages while using shallow CNNs for the high-resolution im-ages, as shown in Figure 1. After that, to fuse the multi-scalegroups of multi-level features from MD-IPN, MS-FPN is pro-posed as a more reasonable feature pyramid architecture (Fig-ure 2).
The MD-IPN is composed of N independent backbones withvarious depth to process the image pyramid. We term eachbackbone as a stream . HRDNet can be generalized to morestreams, but to better illustrate the main idea, we mainly dis-cuss the two-stream HRDNet and three-stream HRDNet. Fig-ure 1 presents an example of three-stream HRDNet. Givenan image I with resolution R , the high-resolution image ( I with R ) is processed by a stream of shallow CNN ( S ), thelower-resolution images ( I and I with αR and α R , and α = 0 . .) is processed by streams of deeper CNN ( S and S ). Generally, we can build an image pyramid network with N independent parallel streams, S i , i = { , , , ..., N − } .We use { I i } N − i =0 to represent the input images with differ-ent resolutions given the original image I with the highestresolution. The outputs of the multi scale image pyramid are N feature groups {G i } N − i =0 . Each group G i contains a set ofmulti-level features { F i,j } , where i ∈ { , , , ..., N − } isthe multi-scale index and j ∈ { , , , ..., M − } is the multi-level index. For example, in Figure 1, the value of N and M are , , respectively, and the relation can be formulated as G i = S i ( I i ) = { F i, , F i, , F i, , F i, } , (1)where i ∈ { , , , · · · , N − } . up-sampling conv 𝑮 𝑮 𝑮 New connection
Level 0Level 1Level 2Level 3 𝑭 𝑭 𝑭 𝑭 𝑭 𝑭 𝑭 𝑭 𝑭 𝑭 𝑭 𝑭 𝑭 𝑭 ′ 𝑭 𝑭 𝑭 ′ 𝑭 𝑮 ′ Figure 2: The details of MS-FPN, in which there is a feature pyra-mid with three streams and four levels. The horizontal orange barindicates the depth of S i , the vertical orange bar indicates the depthof a single backbone. Better to be viewed in color and zoom in. Feature pyramid network (FPN) is one of the key compo-nents for most object detection algorithms. It combines low-resolution, semantically strong features with high-resolution,semantically weak features via a top-down pathway and lat-eral connections.In our proposed HRDNet, the MD-IPN generates multi-scale (different resolution) and multi-level (different hier-archy of features) features. To deal with the multi-scalehierarchy features, we also proposed the Multi-Scale FPN(MS-FPN). Different from FPN, semantic information prop-agates not only from high-level features to low-level fea-tures but also from deep stream (low-resolution) to shallowstream (high-resolution). Therefore, there are two directionsfor the computation of the multi-scale FPN. The basic op-eration in multi-scale FPN is same as traditional FPN, i.e., × Convolution, × up-sampling and sum-up. odel resolution pedestrian people bicycle car van truck tricycle awning-tri bus motor mAP Cascade R-CNN × Cascade R-CNN × HRDNet × HRDNet × HRDNet † × Table 1:
Performance Comparison of Cascade R-CNN and HRDNet with different resolution’s input.
The HRDNet here is a twostreams version, and † means that it is trained on patch images as mentioned in Subsection 4.1 experiment details. (a) (b) (c)Figure 3: The change of
AP, AP , AP over different input’s resolution. The HRDNet used here is a two-stream version withResNet18+101 backbone. The training details follow the subsection 4.1.
In this way, the highest resolution feature, i.e., F , , notonly maintain the high-resolution for small object detectionbut also combine semantically strong features from multi-scale streams. Our novel MS-FPN can be formulated as F i,j = Conv ( F i,j ) + U p ( F i,j − ) i = N - Conv ( F i,j ) + U p ( F i,j − ) + U p ( F i +1 ,j ) i (cid:54) = N - (2)The F i,j is the feature in level j and stream i in Figure 2.The U p ( . ) operation is × up-sampling. The Conv ( . ) is × convolution. Finally, MS-FPN outputs the final feature group G (cid:48) = { F (cid:48) , F (cid:48) , ...F (cid:48) i , ... } . F (cid:48) i is calculated by F (cid:48) i = Conv ( F ,i ) (3)where F ,i is the features in Group G , i.e., the outputs of thehighest resolution stream. Datasets
We conduct experiments on both the typical smallobject detection data set, VisDrone2019 [Zhu et al. , 2019b]and traditional datasets of MS COCO2017 [Lin et al. , 2014]and Pascal VOC2007/2012 [Everingham et al. , 2010] as well.The VisDrone2019 dataset is collected by the AISKYEYEteam, which consists of 288 video clips formed by 261,908frames and 10,209 static images, covering a wide range loca-tion, environment, objects, and density. The resolution of Vis-Drone2019 is higher than COCO as we mentioned in Section1, ranging from 960 to 1360. MS COCO and Pascal VOC arethe most common benchmark for object detection. Following
Image from COCO2017 validation Image from VisDrone2019 validation
Figure 4: The exemplar from COCO2017 and VisDrone2019. common practice, we trained our model on the COCO train-ing set and tested it on the COCO validation set. For PascalVOC, we trained our model with all the training and valida-tion datasets from both Pascal VOC 2007 and 2012, tested themodel on the Pascal VOC 2007 test set.In COCO or Pascal VOC dataset, most images’ resolutionis - px, which is resized to × or × in the training stage, but - px in VisDrone2019 [Du et al. , 2019] dataset. As shown in Figure 4, compared to MS-COCO, there are more objects and nearly all of them are verysmall in VisDrone2019, which is more challenging. Training
We followed the common practice in mmdetec-tion [Chen et al. , 2019]. We trained the models on Vis-Drone2019 with four Nvidia
T i
GPUs and COCO witheight Nvidia P GPUs. We use SGD optimizer with amini-batch for each GPU. The learning rate starts from 0.02and decreases by at epoch and . The weight decay is × − . The linear warm-up strategy is used with warm-upiterations of , and the warm-up ratio of . / . The imagepyramid is obtained by the linear interpolation. The resolu-tion decreasing ratio α is . .In order to fit the high-resolution images from Vis-Drone2019 into GPU memory, we equally cropped each orig-inal image in VisDrone2019 training set into four patch im- A) (B) (C)
Figure 5: Comparison of the simple FPN, multi-scale FPN alignedwith depth and resolution. Each column is one stream in MD-IPN,each row is corresponding to the depth of backbone. The blue blocksrepresent those features have been fused while gray blocks are thosewaiting to be fused. The red arrows represent a basic fusing opera-tion described in subsection 3.2. ages which are not overlapped. In this way, we obtained anew training set with such cropped images.
Inference
Same resolution as training is used for inference.The IOU threshold of NMS is . , and the threshold of confi-dence score is . . Without especially emphasizing, for themulti-scale test in our experiments, we use three scales. style AP AP50 AP75ResNet10+18simple FPN 28.8 49.5 28.8aligned by resolution 28.7 49.6 28.7aligned by depth Table 2:
The comparison of three different styles of MS-FPN. model backbone resolution params speed AP50Cascade ResNet18 1333 56.11M 9.9 36.1Cascade ResNet18 2666 56.11M 5.4 40.3Cascade ResNet18 3800 56.11M 2.9 42.6HRDNet ResNet10+18 3800 62.44M 3.7 49.2HRDNet ResNet18+101 3800 100.78M 2.8 53.3HRDNet ResNeXt50+101 3800 152.22M 1.5 55.2
Table 3: The speed (items/second) and the number of parameters(M) are obtained on a same machine with one
Nvidia GTX 2080TiGPU and
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz . The HRD-Net here is a two stream version, using MS-FPN aligned with depth.
The effect of image resolution for small object detection
Extensive ablation studies on the VisDrone2019 dataset areconducted to illustrate the effect of input image resolution fordetection performance. Table 1 shows that detection perfor-mance has a significant improvement with the increase of im-age resolution. Higher resolution leads to better performanceunder the same experimental settings. The performance ofsmall objects presents more significant improvement fromHRDNet. What is more, HRDNet performs much better thanthe state-of-the-art Cascade R-CNN with the same resolutionas the input.Interestingly, when the resolution of input increases, singlebackbone model, i.e. Cascade R-CNN, suffers dramaticallydecrease ( 1.1-7.6%) for categories with relatively large size, model backbone resolution AP AP50 AP75Single Backbone ResNeXt50 3800 32.7 54.6 33.6Single Backbone ResNeXt101 1900 30.4 51.0 31.1Model Ensemble ResNeXt50+101 3800+1900 32.9 55.1 33.5HRDNet ResNeXt50+101 3800+1900
Table 4:
The comparison of HRDNet and Model Ensemble.
Themodels here follow the design of Cascade R-CNN. i.e. truck , awning-tricycle and bus . On the contrary, signif-icant performance increase (1-5.2%) can be observed fromHRDNet. Simply increasing the image resolution withoutconsidering the severe variant of object scale is not the idealsolution for object detection, let alone small object detection. Explore the optimal image resolution
We have stated and showed some experiments that the im-age resolution is important for small object detection; how-ever, is it true higher resolution leads to better performance.Does it have the optimal resolution for detection? In this part,we will present the effect of image resolution for object de-tection. Figure 3 shows the change of the Average Precise( AP [0 .
05 : 0 . , AP , AP ) with different resolutions.The resolution starts from (long edge) with 400 as thestride. Finally, HRDNet achieves the best performance whenthe resolution is × px. How to design the multi-scale FPN
As mentioned above, MS-FPN is designed to fuse multi-scalefeature groups. Here, we compared three different styles,including simple FPN, multi-scale FPN aligned by depth,multi-scale FPN aligned by resolution , as shown in Figure5, to demonstrate MS-FPN’s advantage. A simple FPN isto apply standard FPN to each multi-scale group of HRD-Net and finally fuse the results of each FPN. For multi-scaleFPN, there are new connections between multi-streams, asshown in Figure 2. We conducted two groups experimentswith ResNet10+18 backbone and ResNet18+101 backbone.The first experiment in Table 2 shows that the multi-scaleFPN is better than the simple FPN. Both experiments demon-strate that MS-FPN aligned with depth performs better thanthose aligned with resolution. Therefore, we adopt MS-FPNaligned with depth in our architecture.
Efficient and Effective HRDNet
HRDNet is a multi-streams network, and there may be someconcerns about the model size and running speed. Here, weillustrate the number of parameters and running speed of ourHRDNet, comparing with the state-of-the-art single back-bone baseline. The results are shown in Table 3 demonstratethat our HRDNet can achieve much better performance with asimilar number of parameters and even faster running speed.
The comparison with single backbone model ensemble
To further demonstrate that the performance improvementof HRDNet is not because of more parameters , we com-pared two-stream HRDNet with the ensemble of two singlebackbone models under the same experimental setting (Ta-ble 4). The ensemble models fuse the predicted boundingboxes and scores before NMS (Non-Maximum Suppression)and then perform NMS together. We found that the singlebackbone models with high-resolution input always performbetter than those with low-resolution even it is processed by odel backbone AP AP50 AP75 AP S AP M AP L R-FCN [Dai et al. , 2016] ResNet-101 29.9 51.9 - 10.8 32.8 45.0Faster R-CNN w FPN [Lin et al. , 2017a] ResNet-101 36.2 59.1 39.0 18.2 39.0 48.2DeNet-101(wide) [Ghodrati et al. , 2015] ResNet-101 33.8 53.4 36.1 12.3 36.1 50.8CoupleNet [Zhu et al. , 2017] ResNet-101 34.4 54.8 37.2 13.4 38.1 50.8Mask-RCNN [He et al. , 2017] ResNeXt-101 39.8 62.3 43.4 22.1 43.2 51.2Cascade RCNN [Cai and Vasconcelos, 2018] ResNet-101 42.8 62.1 46.3 23.7 45.5 55.2SNIP++ [Singh and Davis, 2018] ResNet-101 43.4 65.5 48.4 27.2 46.5 54.9SNIPER(2scale) [Singh et al. , 2018] ResNet-101 43.3 63.7 48.6 27.1 44.7 56.1Grid-RCNN [Lu et al. , 2019] ResNeXt-101 43.2 63.0 46.6 25.1 46.5 55.2SSD512 [Liu et al. , 2016] VGG-16 28.8 48.5 30.3 10.9 31.8 43.5RetinaNet80++ [Lin et al. , 2017b] ResNet-101 39.1 59.1 42.3 21.8 42.7 50.2RefineDet512 [Zhang et al. , 2018] ResNet-101 36.4 57.5 39.5 16.6 39.9 51.4M2Det800 VGG-16 41.0 59.7 45.0 22.1 46.5 53.8CornetNet511 [Law and Deng, 2018] Hourglass-104 40.5 56.5 43.1 19.4 42.7 53.9FCOS [Tian et al. , 2019] ResNeXt-101 42.1 62.1 45.2 25.6 44.9 52.0FSAF [Zhu et al. , 2019a] ResNeXt-101 42.9 63.8 46.3 26.6 46.2 52.7CenterNet511 [Duan et al. , 2019] Hourglass-104 44.9 62.4 48.1 25.6 47.4 57.4HRDNet++ ResNet101+152
Table 5: The state of the art of the performance on the MS COCO test-dev , the input resolution of HRDNet ResNet101 stream is same as othermodels above, i.e. × , while the input of ResNet 152 stream is a × smaller image. ’++’ denotes that the inference is performedwith multi-scales etc. model backbone resolution AP AP50 AP75 AR1 AR10 AR100 AR500 † Cascade R-CNN [Cai and Vasconcelos, 2018] ResNet50 2666 24.10 42.90 23.60 0.40 2.30 21.00 35.20 † Faster R-CNN [Ren et al. , 2017] ResNet50 2666 23.50 43.70 22.20 0.34 2.20 18.30 35.70 † RetinaNet [Lin et al. , 2017b] ResNet50 2666 15.10 27.70 14.30 0.17 1.30 24.60 25.80 † FCOS [Tian et al. , 2019] ResNet50 2666 16.60 28.80 16.70 0.38 2.20 24.40 24.40HFEA [Zhang et al. , 2019a] ResNeXt101 - 27.10 - -HFEA [Zhang et al. , 2019a] ResNeXt152 - 30.30 - -DSOD [Zhang et al. , 2019b] ResNet50 - 28.80 47.10 29.30 † HRDNet ResNet10+18 3800 28.68 49.15 28.90 0.48 3.42 37.47 37.47 † HRDNet ResNet18+101 2666 28.33 49.25 28.16 0.47 3.34 36.91 36.91 † HRDNet ResNet18+101 3800 31.39 53.29 31.63 0.49 3.55 40.45 40.45 † HRDNet++ ResNet50+101+152 3800 34.35 56.65 35.51 0.53 4.00 43.24 43.25 † HRDNet++ ResNeXt50+101 3800
Table 6:
The comparison with the state-of-the-art object detection models on visdrone2019 DET validation set.
For DSOD results, weonly show their true results without model ensemble. We only listed those results trained on VisDrone2019 train set. Those results with † aretrained and tested with the same environment and base code. ’++’ denotes that the inference is performed with multi-scales. model backbone input size mAPFaster R-CNN [He et al. , 2016] ResNet-101 × et al. , 2016] ResNet-101 × et al. , 2016] VGG-16 × et al. , 2016] VGG-16 × et al. , 2017] ResNet-101 × et al. , 2017]t ResNet-101 × et al. , 2015] ResNet-101 × et al. , 2018] ResNet-101 × et al. , 2016] VGG-16 × et al. , 2018] VGG-16 × et al. , 2018b] VGG-16 × et al. , 2019] ResNet-101 × et al. , 2019] DLA [Zhou et al. , 2019] × × × Table 7: The state of the art performance on Pascal VOC 2007 test . a stronger backbone. HRDNet performs better than the en-semble model, thanks to the novel multi-scale and multi-levelfusion method. These results further prove that our designedMS-FPN is essential for HRDNet. VisDrone2019
To demonstrate the advantage of our modeland technical criterion, we also compare HRDNet with themost popular and state-of-the-art methods. Table 6 showsthat our proposed HRDNet achieves the best performance onVisDrone2019 DET validation set. Notably, our model ob-tains more than 3.0% AP improvement with ResNeXt50+101 compared to HFEA using ResNet152 as their backbone.
COCO2017
Besides the experiments on VisDrone2019,we also conduct experiments on the COCO2017 test set toprove our method works well on a larger scale, complicatedand standard detection dataset. Table 5 shows that HRD-Net achieves state-of-the-art results, and > . AP small im-provement compared with most recent models. Pascal VOC2007/2012
There are not too many small ob-jects in Pascal VOC. We conducted experiments on this dataset to demonstrate HRDNet not only improves small objectdetection but also keeps the performance for large objects.
Merely increasing the image resolution without modifica-tions will relatively damage the performance of large objects.Moreover, the server variance of object scales further limitsthe performance from high-resolution images. Motivated bythis, we propose a new detection network for small objects,HRDNet. In order to handle the issues well, we further designMD-IPN and MS-FPN. HRDNet achieves the state-of-the-arton small object detection data set, VisDrone2019, at the sametime, we outperform on other benchmarks, i.e., MS COCO,Pascal VOC. eferences [Cai and Vasconcelos, 2018] Zhaowei Cai and Nuno Vas-concelos. Cascade r-cnn: Delving into high quality objectdetection. In
CVPR , pages 6154–6162, 2018.[Chen et al. , 2019] Kai Chen, Jiaqi Wang, Jiangmiao Pang,and ... MMDetection: Open mmlab detection toolbox andbenchmark. arXiv:1906.07155 , 2019.[Dai et al. , 2016] Jifeng Dai, Yi Li, Kaiming He, and JianSun. R-FCN: object detection via region-based fully con-volutional networks. In
NIPS , pages 379–387, 2016.[Dai et al. , 2017] Jifeng Dai, Haozhi Qi, Yuwen Xiong,Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. De-formable convolutional networks.
ICCV , 2017.[Du et al. , 2019] Dawei Du, Pengfei Zhu, and et al.Visdrone-det2019: The vision meets drone object detec-tion in image challenge results. In
ICCV , 2019.[Duan et al. , 2019] Kaiwen Duan, Song Bai, Lingxi Xie,Honggang Qi, Qingming Huang, and Qi Tian. Center-net: Keypoint triplets for object detection. In
ICCV , pages6569–6578, 2019.[Everingham et al. , 2010] M. Everingham, L. Van Gool,C. K. I. Williams, J. Winn, and A. Zisserman. The pascalvisual object classes (voc) challenge.
IJCV , 88(2):303–338, 2010.[Farhadi and Redmon, 2018] Joseph Redmon Ali Farhadiand Joseph Redmon. Yolov3: An incremental improve-ment.
Retrieved September , 2018.[Ghodrati et al. , 2015] Amir Ghodrati, Ali Diba, Marco Ped-ersoli, Tinne Tuytelaars, and Luc Van Gool. Deeppro-posal: Hunting objects by cascading deep convolutionallayers.
ICCV , 2015.[He et al. , 2016] Kaiming He, Xiangyu Zhang, and et al.Deep residual learning for image recognition.
CVPR ,2016.[He et al. , 2017] Kaiming He, Georgia Gkioxari, Piotr Dol-lar, and Ross Girshick. Mask r-cnn.
ICCV , 2017.[Kisantal et al. , 2019] Mate Kisantal, Zbigniew Wojna,Jakub Murawski, Jacek Naruniec, and Kyunghyun Cho.Augmentation for small object detection. arXiv preprintarXiv:1902.07296 , 2019.[Kong et al. , 2016] Tao Kong, Anbang Yao, Yurong Chen,and Fuchun Sun. Hypernet: Towards accurate region pro-posal generation and joint object detection.
CVPR , 2016.[Kong et al. , 2018] Tao Kong, Fuchun Sun, Wenbing Huang,and Huaping Liu. Deep feature pyramid reconfigurationfor object detection.
Lecture Notes in Computer Science ,2018.[Law and Deng, 2018] Hei Law and Jia Deng. Cornernet:Detecting objects as paired keypoints. In
ECCV , pages765–781, 2018.[Li et al. , 2017] Jianan Li, Xiaodan Liang, Yunchao Wei,Tingfa Xu, Jiashi Feng, and Shuicheng Yan. Perceptualgenerative adversarial networks for small object detection.In
CVPR , 2017. [Li et al. , 2018] Zeming Li, Chao Peng, Gang Yu, XiangyuZhang, Yangdong Deng, and Jian Sun. Detnet: Abackbone network for object detection. arXiv preprintarXiv:1804.06215 , 2018.[Lin et al. , 2014] Tsung-Yi Lin, Michael Maire, Serge J. Be-longie, and et al. Microsoft COCO: common objects incontext. In
ECCV , pages 740–755, 2014.[Lin et al. , 2017a] Tsung-Yi Lin, Piotr Doll´ar, Ross B. Gir-shick, Kaiming He, Bharath Hariharan, and Serge J. Be-longie. Feature pyramid networks for object detection. In
CVPR , 2017.[Lin et al. , 2017b] Tsung-Yi Lin, Priya Goyal, Ross Gir-shick, Kaiming He, and Piotr Dollar. Focal loss for denseobject detection.
ICCV , 2017.[Liu et al. , 2016] Wei Liu, Dragomir Anguelov, Dumitru Er-han, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu,and Alexander C. Berg. SSD: single shot multibox detec-tor. In
ECCV , 2016.[Liu et al. , 2018a] Shu Liu, Lu Qi, Haifang Qin, JianpingShi, and Jiaya Jia. Path aggregation network for instancesegmentation. In
CVPR , pages 8759–8768, 2018.[Liu et al. , 2018b] Songtao Liu, Di Huang, and YunhongWang. Receptive field block net for accurate and fast ob-ject detection.
Lecture Notes in Computer Science , page404–419, 2018.[Lu et al. , 2019] Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li,and Junjie Yan. Grid r-cnn. In
CVPR , pages 7363–7372,2019.[Pang and Chen, 2019] Jiangmiao Pang and Kai Chen. LibraR-CNN: towards balanced learning for object detection. In
CVPR , pages 821–830, 2019.[Pang et al. , 2019a] J. Pang, C. Li, J. Shi, Z. Xu, andH. Feng. R -cnn: Fast tiny object detection in large-scaleremote sensing images. IEEE Transactions on Geoscienceand Remote Sensing , 57(8):5512–5524, Aug 2019.[Pang et al. , 2019b] Yanwei Pang, Tiancai Wang,Rao Muhammad Anwer, Fahad Shahbaz Khan, andLing Shao. Efficient featurized image pyramid networkfor single shot detector. In
CVPR , 2019.[Redmon and Farhadi, 2017] Joseph Redmon and AliFarhadi. Yolo9000: Better, faster, stronger.
CVPR , 2017.[Ren et al. , 2017] Shaoqing Ren, Kaiming He, Ross Gir-shick, and Jian Sun. Faster r-cnn: Towards real-time objectdetection with region proposal networks.
IEEE TPAMI ,39(6):1137–1149, 2017.[R˚uˇziˇcka and Franchetti, 2018] V. R˚uˇziˇcka and F. Franchetti.Fast and accurate object detection in high resolution 4kand 8k video using gpus. In , Sep. 2018.[Shrivastava et al. , 2016] Abhinav Shrivastava, AbhinavGupta, and Ross Girshick. Training region-based objectdetectors with online hard example mining.
CVPR , 2016.[Singh and Davis, 2018] Bharat Singh and Larry S. Davis.An analysis of scale invariance in object detection - snip.
CVPR , 2018.Singh et al. , 2018] Bharat Singh, Mahyar Najibi, and et al.SNIPER: efficient multi-scale training. In
NeurIPS , 2018.[Tian et al. , 2019] Zhi Tian, Chunhua Shen, and et al. FCOS:Fully convolutional one-stage object detection. In
ICCV ,2019.[Zhang et al. , 2018] Shifeng Zhang, Longyin Wen, XiaoBian, Zhen Lei, and Stan Z. Li. Single-shot refinementneural network for object detection.
CVPR , 2018.[Zhang et al. , 2019a] Junyi Zhang, Junying Huang, XuankunChen, and Dongyu Zhang. How to fully exploit the abili-ties of aerial image detectors. In
ICCV , 2019.[Zhang et al. , 2019b] Xindi Zhang, Ebroul Izquierdo, andKrishna Chandramouli. Dense and small object detectionin uav vision based on cascade network. In
ICCV , 2019.[Zhou et al. , 2019] Xingyi Zhou, Dequan Wang, and PhilippKr¨ahenb¨uhl. Objects as points. arXiv preprintarXiv:1904.07850 , 2019.[Zhu et al. , 2017] Yousong Zhu, Chaoyang Zhao, JinqiaoWang, Xu Zhao, Yi Wu, and Hanqing Lu. Couplenet: Cou-pling global structure with local parts for object detection.
ICCV , 2017.[Zhu et al. , 2019a] Chenchen Zhu, Yihui He, and MariosSavvides. Feature selective anchor-free module for single-shot object detection. In
CVPR , pages 840–849, 2019.[Zhu et al. , 2019b] Pengfei Zhu, Dawei Du, and et al.Visdrone-vid2019: The vision meets drone object detec-tion in video challenge results. In