SM+: Refined Scale Match for Tiny Person Detection
SSM+: REFINED SCALE MATCH FOR TINY PERSON DETECTION
Nan Jiang, Xuehui Yu, Xiaoke Peng, Yuqi Gong and Zhenjun Han
University of Chinese Academy of Sciences, Beijing, China
ABSTRACT
Detecting tiny objects ( e.g. , less than × pixels) inlarge-scale images is an important yet open problem. ModernCNN-based detectors are challenged by the scale mismatchbetween the dataset for network pre-training and the targetdataset for detector training. In this paper, we investigate thescale alignment between pre-training and target datasets, andpropose a new refined Scale Match method (termed SM+)for tiny person detection. SM+ improves the scale matchfrom image level to instance level, and effectively promotesthe similarity between pre-training and target dataset. More-over, considering SM+ possibly destroys the image structure,a new probabilistic structure inpainting (PSI) method is pro-posed for the background processing. Experiments conductedacross various detectors show that SM+ noticeably improvesthe performance on TinyPerson, and outperforms the state-of-the-art detectors with a significant margin. Index Terms — tiny object detection, pre-training strat-egy
1. INTRODUCTION
Person detection is an important topic in the computer visionarea. It has wide applications including surveillance [1][2],driving assistance [3] and maritime quick rescue [4], etc.
Theresearch of detectors [5][6][7][8][9] has achieved significantprogress with the rapid development of data-driven deep con-volutional neural networks (CNNs). However, the detectorsperform poorly when detecting tiny objects with few pixels( e.g. , less than × pixels), such as traffic signs [10], per-sons in aerial images [4], etc. To better exploit the CNN-based detectors, a large num-ber of person datasets [11][12][13] for detection with humanmanual annotations have been proposed and made publiclyavailable. However, datasets for specific object detection,such as tiny person detection [4], are not as large as othercounterparts [14][12], due to the cost of collecting and anno-tating the data. With the insufficient data for a specific ap-plication, an alternative way is to pre-train a model on theextra-large datasets ( e.g. , ImageNet [14], COCO [12]), andthen fine-tune the model on a task-specific dataset.
The corresponding author is Zhenjun Han.
Fig. 1 . The illustration of the difference between Image-level SM and Instance-level SM. While SM only considers thewhole image, SM+ focuses on every instance. The instance-level approach achieves scale match in a finer level, whichmainly consists of four steps: (1) Separation, (2) Instance pro-cessing, (3) Background processing, and (4) Combination.However, a new question arises: Could we take betteradvantage of existing large datasets for a task-specific appli-cation, particularly when object sizes significantly differ be-tween the datasets? SM algorithm[4], Random Scale Match(RSM) and Monotone Scale Match (MSM), gave simple yeteffective ways. With a sampled scale factor, the SM algorithmdirectly resizes the images and aligns the scale distribution ofpre-training dataset to that of the target dataset. The SM al-gorithm, with image-level scaling, is merely a simple approx-imation for scale match by simply regarding the average sizeof all objects in an image as the size of the image, where theremay be many labeled objects with multi-scales.In this paper, we propose a newly refined SM method(SM+), in which we transform the scale distribution of pre-training dataset by instance-level scaling instead of resizingthe whole image. Intuitively, compared with the vanilla SMalgorithm, SM+ is a finer-scale scaling and alleviates the un-certainty and inaccuracy caused by the approximation intro-duced in the SM algorithm. The differences between SM andSM+ are illustrated in Fig. 1. SM+ algorithm separates theimage into two parts: the annotated instances and their back- a r X i v : . [ c s . C V ] F e b round. The instances are utilized for the instance-level scalematch, while the images are destroyed by some instance-shaped holes (background). However, the traditional method[15] blurs the images by direct inpainting the holes and gen-erates some unreal images, leading to the performance dropof the pre-trained model. To solve this problem, a probabilis-tic structure inpainting (PSI) method is further proposed todynamically inpaint the images by suppressing the image blurand preserving context consistency around the holes. Com-pared with state-of-the-art detectors on TinyPerson, SM+algorithm leads to significant performance improvement inAP. The main contributions of our work include:1. We comprehensively analyze the scale information ofTinyPerson, and propose a new refined scale match method,dubbed as SM+, which achieves better scale distributionalignment by finer-scale scaling.2. We propose the probabilistic structure inpainting forthe SM+ algorithm. PSI can effectively inpaint the images.3. The proposed SM+ algorithm improves the detectionperformance over the state-of-the-art detectors with a largemargin. Codes will be available upon acceptance.
2. METHODOLOGY2.1. Scale Match
We define the object size as the square root of its area: s ( G ij ) = (cid:112) w ij h ij where G ij denotes the j -th bounding boxof i -th image I i , and w ij , h ij are the width and height of thebounding box, respectively.Given an extra dataset E where the probability densityfunction of object size s is P size ( s ; E ) and a target dataset D where the probability density function is P size ( s ; D ) , ourgoal is to apply a scale transformation T on E , such that theirprobability distributions of object size can be well matched.This is corresponding to, P size ( s ; T ( E )) ≈ P size ( s ; D ) . (1) Image-level method leaves lots of room for improvement. Tothis end, we propose the refined scale match (SM+), whichfocuses on instance-level scale match and achieves more de-sirable results than image-level match [4].
The whole procedure is shown in Fig. 1. In the following, wepresent the details of each part.
Part I. Extraction and Separation:
Pre-training datasetrequires ground-truth annotations for instance segmentation.According to the mask annotation, each picture participat-ing in the training is separated into the background andforeground. In order to get the finer foreground, we adoptthe matting method [16] to make the outline of instancessmoother. Because the stored form of mask annotation isboundary points and edges, using such annotations directlymakes the outline of the foreground jagged. After separation,
Fig. 2 . Background based on inpainting ( top ) vs. Backgroundbased on new sampling ( bottom ). The inpainting methodmight not repair some artifacts , but changing the backgrounddoes not cause this problem. (Best viewed in color.)we get a proper instance mask and an incomplete background.Then the two parts are processed separately.
Part II. Instance Scale Histogram Match:
On the groundof target dataset annotation, a discrete scale histogram H isestablished to approximate the scale probability density func-tion of target dataset P size ( s ; D train ) , which is rectified topay less attention to the long-tail part of scale distribution.In H , K represents the number of bins in scale histogram, R [ k ] − and R [ k ] + are size boundaries of k -th bin. For ev-ery separated instance, we use the size of the correspondingbounding box as its scale representation s . First, we samplean index of bin with respect to the probability of H . Then wesample a target scale ˆ s based on a uniform probability distri-bution, whose min and max size equal to R [ k ] − and R [ k ] + ,respectively. Finally, we transform the instance according tothe ratio of ˆ s to s . It can be defined by the affine transforma-tion matrix, A = r t x r t y , (2) where r denotes the scale variance, t x and t y denote the coor-dinate shift in x -axis and y -axis, respectively. Part III. Probabilistic Structure Inpainting:
For such abackground with an instance-shaped hole on it, we first adoptthe inpainting method [15] to fill in the blank area of the back-ground inspired by InstaBoost [17]. In practice, however, theeffect of the traditional inpainting method can be very poorbecause the object is reduced to a very small size. In orderto alleviate the image structure loss caused by instance-levelscale match, we introduce extra background to make up forthe distortion of the image. This also raises the question thatthe context information of the object will be completely dif-ferent from before. To some extent, it will confuse networklearning. Therefore, a hyper-parameter p is predefined to de-termine whether a new background is needed. We use it tofind a trade-off between the two kinds of background. If the D JS ( P size ( s ; T ( E )) || P size ( s ; D )) RSM 0.0091RSM+ 0.0020MSM 0.0133MSM+ 0.0013
Table 1 . The similarity between the scale distributionsaligned by different methods. A smaller similarity score de-notes the two distributions are closer. D represents TinyPer-son, E represents COCO, and T denotes the transformationconducted by the scale match method.random number is greater than p , we will sample a new imagefrom the pre-training dataset as background. On the contrary,we still use the inpainting background. It should be noted thatthe label of the new image will not participate in training. Part IV. Combination:
After getting the final backgroundand reasonable instance, we paste the transformed instanceson the corresponding position in background according to an-notation. Adjusted images can be visualized in Fig. 2.
In order to prove the effectiveness of SM+, we use the Jensen-Shannon divergence [18] to quantitatively measure the simi-larity between distributions. Here, p ( x ) and q ( x ) denote prob-ability distribution of a discrete random variable x . Both p ( x ) and q ( x ) sum up to 1, and p ( x ) > and q ( x ) > for any x in X . Kullback-Leibler divergence[19] D KL ( p ( x ) , q ( x )) isdefined in Eq. (3) D KL ( p ( x ) || q ( x )) = (cid:88) x ∈ X p ( x ) ln p ( x ) q ( x ) . (3) Therefore, we can get the formulation of Jensen-Shannon di-vergence D JS ( p ( x ) , q ( x )) from Eq. (4) D JS ( p ( x ) || q ( x )) = (cid:88) x ∈ X [ 12 D KL ( p ( x ) || p ( x ) + q ( x )2 )+ 12 D KL ( q ( x ) || p ( x ) + q ( x )2 )] . (4) According to Tab. 1, the JS divergence between scaledistribution transformed by SM+ algorithm is less than thattransformed by SM algorithm. The proposed SM+ more ef-fectively bridges the gap between the scale distribution of pre-training dataset and target dataset.
3. EXPERIMENT3.1. Dataset
The experiments are conducted in two datasets: COCOand TinyPerson.
COCO involves 80 categories of objects.
TinyPerson is a tiny object detection dataset collected fromhigh-quality videos and web pictures. TinyPerson contains 72,651 annotated human objects with a low resolution on vi-sual effect in total 1, 610 images. The size of most annotatedobjects in TinyPerson is less than × pixels. In Tab. 2, Faster RCNN-FPN-MSM+, Faster RCNN-FPNpre-trained with our proposed MSM+, produces state-of-the-art results in all AP evaluations. The comparison well demon-strates that our method is effective for tiny object detection.
As shown in Tab. 3, we compareSM+ with various pre-training strategies including ImageNet,COCO800, RSM [4] and MSM [4]. The COCO800 meansthat we control the size of images in (800, 1333) as inputand use different anchor settings for each of the two trainingstages. For COCO we use the original as input. Scale matchbased methods are applied to COCO dataset. Faster RCNN-FPN is used as the detector. First, compared with ImageNet,using COCO800 for pre-training can improve performancewith a proper anchor setting since the TinyPerson containsmuch smaller objects than COCO.Considering the person scale distribution of TinyPerson,RSM and MSM can achieve higher accuracy. Furthermore,SM+ can effectively match the scale of COCO to that ofTinyPerson and improve detection accuracy. For example,RSM+ outperforms 0.13 point over RSM in AP tiny . More-over, using the monotone function proposed in [4], we getMSM+, which gains an improvement of about 1.72% overMSM. Detector-agnostic:
In order to further validate the efficiencyof the proposed approach, one-stage detector Adaptive Reti-naNet is also chosen as baseline. In Tab. 4, the improvementin one-stage detector is more than that in two-stage detec-tor. RSM+ improves AP tiny by 2.11 points. MSM+ also im-proves AP tiny by 1.66 points, and M R tiny by 1.30 points.The performance improvement of one-stage detector issignificantly greater than that of two-stage detector. In Tab.3 and Tab. 4, the consistent improvement on both kinds ofdetector demonstrates that the proposed refined scale match(SM+) is detector-agnostic, which can be effectively used fordifferent kinds of detectors. Probabilistic Structure Inpainting (PSI):
We note that sim-ply aligning scale distributions of pre-training dataset and tar-get dataset at instance level does not improve performancesince the image structure is destroyed. SM+ involves sig-nificantly zooming out objects, where the inpainting methodmight not be effective in repairing image. This will causesome artifacts, and damage the image structure as shown atthe top of Fig. 2. In contrast, PSI allows instances to be pastedon another background image. In this case, the resulting im-age will not have artifacts as shown in the bottom of Fig. 2.To validate the effect of PSI, we include a baseline withoutthe background change (w/o PSI) in Tab. 5. We show thatthis baseline drops performance dramatically. We believe the ethod AP tiny AP tiny AP tiny AP tiny AP small AP tiny AP tiny FCOS [9] 0.99 2.82 6.20 3.26 20.19 13.28 0.14Adaptive RetinaNet[8] 27.08 52.63 57.88 46.56 59.97 69.60 4.49Faster RCNN-FPN [5] 30.25 51.58 58.95 47.35 63.18 68.43 5.83Faster RCNN-FPN-RSM [4] 33.91 55.16 62.58 51.33 66.96 71.55 6.46Faster RCNN-FPN-RSM+ (ours) 33.74 55.32 62.95 51.46 66.68 72.38 6.62Faster RCNN-FPN-MSM [4] 33.79 55.55 61.29 50.89 65.76 71.28 6.66Faster RCNN-FPN-MSM+ (ours)
Table 2 . Comparisons in terms of AP s (%). Larger AP means better performance. AP tiny , AP tiny , AP tiny , AP tiny , AP small reflect the performance of object size in range [2, 20], [2, 8], [8, 12], [12, 20], [20, 32], respectively. The Bold indicates the best performance.
Pre-training Dataset AP tiny ( ↑ )ImageNet 47.35COCO800 49.76RSM (COCO) 51.33RSM+ (COCO) 51.46MSM (COCO) 50.89MSM+ (COCO) Table 3 . Comparisons of AP tiny with Faster RCNN-FPN .Compared with SM algorithm, SM+ algorithm shows how toperform a better pre-training at a deeper level.
Pre-training Dataset AP tiny ( ↑ )ImageNet 46.56COCO800 45.03RSM (COCO) 48.48RSM+ (COCO) 50.59MSM (COCO) 49.59MSM+ (COCO) Table 4 . Comparisons of AP tiny on Adaptive RetinaNet .SM+ algorithm achieves consistent performance improve-ment with the one-stage detector.unrealistic image structure and artifact pattern make networkover-fitting, leading to undesirable results.Moreover, replacing the background in PSI might be re-garded as data augmentation. Thus, we further conduct ex-periments to validate whether the performance gain is fromdata augmentation. To study this, we include an experiment:directly copy and paste objects on a new background imagewithout scaling its size. We introduce two baselines, CP andCP+. CP means we crop all the instances and paste them ona new image background, but the original annotations of thenew image will not be used during pre-training. CP+ meansboth newly pasted objects and original annotated objects areused for training. In Tab. 6, the two baselines achieve similarresults and slightly surpass COCO. However, they are lowerthan MSM+ (COCO). This indicates that replacing the back-ground can only bring limited improvement but it is not themechanism of our SM+. The effectiveness of SM+ comesfrom achieving a finer distribution alignment at instance leveland better preserving the structure of the images.
Pre-training Strategy AP tiny ( ↑ )RSM+ (w/o PSI) 50.12RSM+ 51.46MSM+ (w/o PSI) 50.69MSM+ 52.61 Table 5 . Ablation study on PSI. It is not enough to alignthe distribution at instance level without considering the back-ground. SM+ can achieve the desired effect with PSI.
Pre-training Dataset AP tiny ( ↑ )COCO 49.96CP (COCO) 50.66CP+ (COCO) 50.46MSM (COCO) 50.89MSM+ (COCO) 52.61 Table 6 . Effect of different methods. CP (COCO) and MSM(COCO) both achieve limited performance improvements.In addition, we also validate the effect of p in PSI andshow the results. We observe that a moderate probability( p =0.4) can achieve a trade-off between image structure lossand semantic loss.
4. CONCLUSION
The scale information for better pre-training is further inves-tigated in this paper. Scale Match only focuses on the image-level match and thus limits the feature representation learningfor detectors. In this paper, we propose a novel method namedRefined Scale Match (SM+). SM+, a much finer scale matchstrategy, aligns scale distributions of pre-training dataset andtarget dataset at instance level, yielding a more effective andsuitable matched dataset. Moreover, in order to alleviate theloss caused by aligning distribution at instance level, an effec-tive method, referred to as probabilistic structure inpainting(PSI), is further proposed. PSI effectively balances the infor-mation loss between image structure and semantics. Thor-ough experimental results verified the superiority of the pro-posed method over other state-of-the-art methods.The relative size between two datasets is also very impor-tant for tiny object detection, which will be further investi-gated in the future. . REFERENCES [1] Robert T Collins, Alan J Lipton, Takeo Kanade, Hi-ronobu Fujiyoshi, David Duggins, Yanghai Tsin, DavidTolliver, Nobuyoshi Enomoto, Osamu Hasegawa, PeterBurt, et al., “A system for video surveillance and moni-toring,”
VSAM final report , vol. 2000, pp. 1–68, 2000.[2] Ismail Haritaoglu, David Harwood, and Larry S. Davis,“W4: Real-time surveillance of people and their activi-ties,”
IEEE Trans. Pattern Anal. Mach. Intell. , vol. 22,2000.[3] Piotr Doll´ar, Christian Wojek, Bernt Schiele, and PietroPerona, “Pedestrian detection: A benchmark,” in
CVPR ,2009.[4] Xuehui Yu, Yuqi Gong, Nan Jiang, Qixiang Ye, andZhenjun Han, “Scale match for tiny person detection,”in
WACV , 2020.[5] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie, “Feature pyra-mid networks for object detection,” in
CVPR , 2017.[6] Zhaowei Cai and Nuno Vasconcelos, “Cascade r-cnn:Delving into high quality object detection,” in
CVPR ,2018.[7] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng,Wanli Ouyang, and Dahua Lin, “Libra r-cnn: Towardsbalanced learning for object detection,” in
CVPR , 2019.[8] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,and Piotr Doll´ar, “Focal loss for dense object detection,”in
CVPR , 2017.[9] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He,“Fcos: Fully convolutional one-stage object detection,”in
ICCV , 2019.[10] Yi-Fan Lu, Jiaming Lu, Song-Hai Zhang, and PeterHall, “Traffic signal detection and classification in streetviews using an attention model,”
Computational VisualMedia , vol. 4, 2018.[11] Mark Everingham, Luc Van Gool, Christopher K. I.Williams, John M. Winn, and Andrew Zisserman, “Thepascal visual object classes (voc) challenge,”
Interna-tional Journal of Computer Vision , vol. 88, 2009.[12] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, andC Lawrence Zitnick, “Microsoft coco: Common objectsin context,” in
ECCV , 2014.[13] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali,Stefan Popov, Matteo Malloci, Alexander Kolesnikov,and et al., “The open images dataset v4,” Mar 2020. [14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei, “Imagenet: A large-scale hierarchicalimage database,” in
CVPR , 2009.[15] Marcelo Bertalm´ıo, Andrea L. Bertozzi, and GuillermoSapiro, “Navier-stokes, fluid dynamics, and image andvideo inpainting,” in
CVPR , 2001.[16] Kaiming He, Christoph Rhemann, Carsten Rother, Xi-aoou Tang, and Jian Sun, “A global sampling methodfor alpha matting,” in
CVPR , 2011.[17] Hao shu Fang, Jianhua Sun, Runzhong Wang, MinghaoGou, Yonglu Li, and Cewu Lu, “Instaboost: Boostinginstance segmentation via probability map guided copy-pasting,” in
ICCV , 2019.[18] Jianhua Lin, “Divergence measures based on the shan-non entropy,”
IEEE Trans. Inf. Theory , vol. 37, 1991.[19] Solomon Kullback and R. A. Leibler, “On informationand sufficiency,”