[PDF] SM+: Refined Scale Match for Tiny Person Detection

Abstract

Detecting tiny objects ( e.g., less than 20 x 20 pixels) in large-scale images is an important yet open problem. Modern CNN-based detectors are challenged by the scale mismatch between the dataset for network pre-training and the target dataset for detector training. In this paper, we investigate the scale alignment between pre-training and target datasets, and propose a new refined Scale Match method (termed SM+) for tiny person detection. SM+ improves the scale match from image level to instance level, and effectively promotes the similarity between pre-training and target dataset. Moreover, considering SM+ possibly destroys the image structure, a new probabilistic structure inpainting (PSI) method is proposed for the background processing. Experiments conducted across various detectors show that SM+ noticeably improves the performance on TinyPerson, and outperforms the state-of-the-art detectors with a significant margin.

Full PDF

SSM+: REFINED SCALE MATCH FOR TINY PERSON DETECTION

Nan Jiang, Xuehui Yu, Xiaoke Peng, Yuqi Gong and Zhenjun Han

University of Chinese Academy of Sciences, Beijing, China

ABSTRACT

Detecting tiny objects ( e.g. , less than × pixels) inlarge-scale images is an important yet open problem. ModernCNN-based detectors are challenged by the scale mismatchbetween the dataset for network pre-training and the targetdataset for detector training. In this paper, we investigate thescale alignment between pre-training and target datasets, andpropose a new reﬁned Scale Match method (termed SM+)for tiny person detection. SM+ improves the scale matchfrom image level to instance level, and effectively promotesthe similarity between pre-training and target dataset. More-over, considering SM+ possibly destroys the image structure,a new probabilistic structure inpainting (PSI) method is pro-posed for the background processing. Experiments conductedacross various detectors show that SM+ noticeably improvesthe performance on TinyPerson, and outperforms the state-of-the-art detectors with a signiﬁcant margin. Index Terms — tiny object detection, pre-training strat-egy

1. INTRODUCTION

Person detection is an important topic in the computer visionarea. It has wide applications including surveillance [1][2],driving assistance [3] and maritime quick rescue [4], etc.

Theresearch of detectors [5][6][7][8][9] has achieved signiﬁcantprogress with the rapid development of data-driven deep con-volutional neural networks (CNNs). However, the detectorsperform poorly when detecting tiny objects with few pixels( e.g. , less than × pixels), such as trafﬁc signs [10], per-sons in aerial images [4], etc. To better exploit the CNN-based detectors, a large num-ber of person datasets [11][12][13] for detection with humanmanual annotations have been proposed and made publiclyavailable. However, datasets for speciﬁc object detection,such as tiny person detection [4], are not as large as othercounterparts [14][12], due to the cost of collecting and anno-tating the data. With the insufﬁcient data for a speciﬁc ap-plication, an alternative way is to pre-train a model on theextra-large datasets ( e.g. , ImageNet [14], COCO [12]), andthen ﬁne-tune the model on a task-speciﬁc dataset.

The corresponding author is Zhenjun Han.

Fig. 1 . The illustration of the difference between Image-level SM and Instance-level SM. While SM only considers thewhole image, SM+ focuses on every instance. The instance-level approach achieves scale match in a ﬁner level, whichmainly consists of four steps: (1) Separation, (2) Instance pro-cessing, (3) Background processing, and (4) Combination.However, a new question arises: Could we take betteradvantage of existing large datasets for a task-speciﬁc appli-cation, particularly when object sizes signiﬁcantly differ be-tween the datasets? SM algorithm[4], Random Scale Match(RSM) and Monotone Scale Match (MSM), gave simple yeteffective ways. With a sampled scale factor, the SM algorithmdirectly resizes the images and aligns the scale distribution ofpre-training dataset to that of the target dataset. The SM al-gorithm, with image-level scaling, is merely a simple approx-imation for scale match by simply regarding the average sizeof all objects in an image as the size of the image, where theremay be many labeled objects with multi-scales.In this paper, we propose a newly reﬁned SM method(SM+), in which we transform the scale distribution of pre-training dataset by instance-level scaling instead of resizingthe whole image. Intuitively, compared with the vanilla SMalgorithm, SM+ is a ﬁner-scale scaling and alleviates the un-certainty and inaccuracy caused by the approximation intro-duced in the SM algorithm. The differences between SM andSM+ are illustrated in Fig. 1. SM+ algorithm separates theimage into two parts: the annotated instances and their back- a r X i v : . [ c s . C V ] F e b round. The instances are utilized for the instance-level scalematch, while the images are destroyed by some instance-shaped holes (background). However, the traditional method[15] blurs the images by direct inpainting the holes and gen-erates some unreal images, leading to the performance dropof the pre-trained model. To solve this problem, a probabilis-tic structure inpainting (PSI) method is further proposed todynamically inpaint the images by suppressing the image blurand preserving context consistency around the holes. Com-pared with state-of-the-art detectors on TinyPerson, SM+algorithm leads to signiﬁcant performance improvement inAP. The main contributions of our work include:1. We comprehensively analyze the scale information ofTinyPerson, and propose a new reﬁned scale match method,dubbed as SM+, which achieves better scale distributionalignment by ﬁner-scale scaling.2. We propose the probabilistic structure inpainting forthe SM+ algorithm. PSI can effectively inpaint the images.3. The proposed SM+ algorithm improves the detectionperformance over the state-of-the-art detectors with a largemargin. Codes will be available upon acceptance.

2. METHODOLOGY2.1. Scale Match

We deﬁne the object size as the square root of its area: s ( G ij ) = (cid:112) w ij h ij where G ij denotes the j -th bounding boxof i -th image I i , and w ij , h ij are the width and height of thebounding box, respectively.Given an extra dataset E where the probability densityfunction of object size s is P size ( s ; E ) and a target dataset D where the probability density function is P size ( s ; D ) , ourgoal is to apply a scale transformation T on E , such that theirprobability distributions of object size can be well matched.This is corresponding to, P size ( s ; T ( E )) ≈ P size ( s ; D ) . (1) Image-level method leaves lots of room for improvement. Tothis end, we propose the reﬁned scale match (SM+), whichfocuses on instance-level scale match and achieves more de-sirable results than image-level match [4].

The whole procedure is shown in Fig. 1. In the following, wepresent the details of each part.

Part I. Extraction and Separation:

Pre-training datasetrequires ground-truth annotations for instance segmentation.According to the mask annotation, each picture participat-ing in the training is separated into the background andforeground. In order to get the ﬁner foreground, we adoptthe matting method [16] to make the outline of instancessmoother. Because the stored form of mask annotation isboundary points and edges, using such annotations directlymakes the outline of the foreground jagged. After separation,

Fig. 2 . Background based on inpainting ( top ) vs. Backgroundbased on new sampling ( bottom ). The inpainting methodmight not repair some artifacts , but changing the backgrounddoes not cause this problem. (Best viewed in color.)we get a proper instance mask and an incomplete background.Then the two parts are processed separately.

Part II. Instance Scale Histogram Match:

On the groundof target dataset annotation, a discrete scale histogram H isestablished to approximate the scale probability density func-tion of target dataset P size ( s ; D train ) , which is rectiﬁed topay less attention to the long-tail part of scale distribution.In H , K represents the number of bins in scale histogram, R [ k ] − and R [ k ] + are size boundaries of k -th bin. For ev-ery separated instance, we use the size of the correspondingbounding box as its scale representation s . First, we samplean index of bin with respect to the probability of H . Then wesample a target scale ˆ s based on a uniform probability distri-bution, whose min and max size equal to R [ k ] − and R [ k ] + ,respectively. Finally, we transform the instance according tothe ratio of ˆ s to s . It can be deﬁned by the afﬁne transforma-tion matrix, A =  r t x r t y  , (2) where r denotes the scale variance, t x and t y denote the coor-dinate shift in x -axis and y -axis, respectively. Part III. Probabilistic Structure Inpainting:

For such abackground with an instance-shaped hole on it, we ﬁrst adoptthe inpainting method [15] to ﬁll in the blank area of the back-ground inspired by InstaBoost [17]. In practice, however, theeffect of the traditional inpainting method can be very poorbecause the object is reduced to a very small size. In orderto alleviate the image structure loss caused by instance-levelscale match, we introduce extra background to make up forthe distortion of the image. This also raises the question thatthe context information of the object will be completely dif-ferent from before. To some extent, it will confuse networklearning. Therefore, a hyper-parameter p is predeﬁned to de-termine whether a new background is needed. We use it toﬁnd a trade-off between the two kinds of background. If the D JS ( P size ( s ; T ( E )) || P size ( s ; D )) RSM 0.0091RSM+ 0.0020MSM 0.0133MSM+ 0.0013

Table 1 . The similarity between the scale distributionsaligned by different methods. A smaller similarity score de-notes the two distributions are closer. D represents TinyPer-son, E represents COCO, and T denotes the transformationconducted by the scale match method.random number is greater than p , we will sample a new imagefrom the pre-training dataset as background. On the contrary,we still use the inpainting background. It should be noted thatthe label of the new image will not participate in training. Part IV. Combination:

After getting the ﬁnal backgroundand reasonable instance, we paste the transformed instanceson the corresponding position in background according to an-notation. Adjusted images can be visualized in Fig. 2.

In order to prove the effectiveness of SM+, we use the Jensen-Shannon divergence [18] to quantitatively measure the simi-larity between distributions. Here, p ( x ) and q ( x ) denote prob-ability distribution of a discrete random variable x . Both p ( x ) and q ( x ) sum up to 1, and p ( x ) > and q ( x ) > for any x in X . Kullback-Leibler divergence[19] D KL ( p ( x ) , q ( x )) isdeﬁned in Eq. (3) D KL ( p ( x ) || q ( x )) = (cid:88) x ∈ X p ( x ) ln p ( x ) q ( x ) . (3) Therefore, we can get the formulation of Jensen-Shannon di-vergence D JS ( p ( x ) , q ( x )) from Eq. (4) D JS ( p ( x ) || q ( x )) = (cid:88) x ∈ X [ 12 D KL ( p ( x ) || p ( x ) + q ( x )2 )+ 12 D KL ( q ( x ) || p ( x ) + q ( x )2 )] . (4) According to Tab. 1, the JS divergence between scaledistribution transformed by SM+ algorithm is less than thattransformed by SM algorithm. The proposed SM+ more ef-fectively bridges the gap between the scale distribution of pre-training dataset and target dataset.

3. EXPERIMENT3.1. Dataset

The experiments are conducted in two datasets: COCOand TinyPerson.

COCO involves 80 categories of objects.

TinyPerson is a tiny object detection dataset collected fromhigh-quality videos and web pictures. TinyPerson contains 72,651 annotated human objects with a low resolution on vi-sual effect in total 1, 610 images. The size of most annotatedobjects in TinyPerson is less than × pixels. In Tab. 2, Faster RCNN-FPN-MSM+, Faster RCNN-FPNpre-trained with our proposed MSM+, produces state-of-the-art results in all AP evaluations. The comparison well demon-strates that our method is effective for tiny object detection.

As shown in Tab. 3, we compareSM+ with various pre-training strategies including ImageNet,COCO800, RSM [4] and MSM [4]. The COCO800 meansthat we control the size of images in (800, 1333) as inputand use different anchor settings for each of the two trainingstages. For COCO we use the original as input. Scale matchbased methods are applied to COCO dataset. Faster RCNN-FPN is used as the detector. First, compared with ImageNet,using COCO800 for pre-training can improve performancewith a proper anchor setting since the TinyPerson containsmuch smaller objects than COCO.Considering the person scale distribution of TinyPerson,RSM and MSM can achieve higher accuracy. Furthermore,SM+ can effectively match the scale of COCO to that ofTinyPerson and improve detection accuracy. For example,RSM+ outperforms 0.13 point over RSM in AP tiny . More-over, using the monotone function proposed in [4], we getMSM+, which gains an improvement of about 1.72% overMSM. Detector-agnostic:

In order to further validate the efﬁciencyof the proposed approach, one-stage detector Adaptive Reti-naNet is also chosen as baseline. In Tab. 4, the improvementin one-stage detector is more than that in two-stage detec-tor. RSM+ improves AP tiny by 2.11 points. MSM+ also im-proves AP tiny by 1.66 points, and M R tiny by 1.30 points.The performance improvement of one-stage detector issigniﬁcantly greater than that of two-stage detector. In Tab.3 and Tab. 4, the consistent improvement on both kinds ofdetector demonstrates that the proposed reﬁned scale match(SM+) is detector-agnostic, which can be effectively used fordifferent kinds of detectors. Probabilistic Structure Inpainting (PSI):

We note that sim-ply aligning scale distributions of pre-training dataset and tar-get dataset at instance level does not improve performancesince the image structure is destroyed. SM+ involves sig-niﬁcantly zooming out objects, where the inpainting methodmight not be effective in repairing image. This will causesome artifacts, and damage the image structure as shown atthe top of Fig. 2. In contrast, PSI allows instances to be pastedon another background image. In this case, the resulting im-age will not have artifacts as shown in the bottom of Fig. 2.To validate the effect of PSI, we include a baseline withoutthe background change (w/o PSI) in Tab. 5. We show thatthis baseline drops performance dramatically. We believe the ethod AP tiny AP tiny AP tiny AP tiny AP small AP tiny AP tiny FCOS [9] 0.99 2.82 6.20 3.26 20.19 13.28 0.14Adaptive RetinaNet[8] 27.08 52.63 57.88 46.56 59.97 69.60 4.49Faster RCNN-FPN [5] 30.25 51.58 58.95 47.35 63.18 68.43 5.83Faster RCNN-FPN-RSM [4] 33.91 55.16 62.58 51.33 66.96 71.55 6.46Faster RCNN-FPN-RSM+ (ours) 33.74 55.32 62.95 51.46 66.68 72.38 6.62Faster RCNN-FPN-MSM [4] 33.79 55.55 61.29 50.89 65.76 71.28 6.66Faster RCNN-FPN-MSM+ (ours)

Table 2 . Comparisons in terms of AP s (%). Larger AP means better performance. AP tiny , AP tiny , AP tiny , AP tiny , AP small reﬂect the performance of object size in range [2, 20], [2, 8], [8, 12], [12, 20], [20, 32], respectively. The Bold indicates the best performance.

Pre-training Dataset AP tiny ( ↑ )ImageNet 47.35COCO800 49.76RSM (COCO) 51.33RSM+ (COCO) 51.46MSM (COCO) 50.89MSM+ (COCO) Table 3 . Comparisons of AP tiny with Faster RCNN-FPN .Compared with SM algorithm, SM+ algorithm shows how toperform a better pre-training at a deeper level.

Pre-training Dataset AP tiny ( ↑ )ImageNet 46.56COCO800 45.03RSM (COCO) 48.48RSM+ (COCO) 50.59MSM (COCO) 49.59MSM+ (COCO) Table 4 . Comparisons of AP tiny on Adaptive RetinaNet .SM+ algorithm achieves consistent performance improve-ment with the one-stage detector.unrealistic image structure and artifact pattern make networkover-ﬁtting, leading to undesirable results.Moreover, replacing the background in PSI might be re-garded as data augmentation. Thus, we further conduct ex-periments to validate whether the performance gain is fromdata augmentation. To study this, we include an experiment:directly copy and paste objects on a new background imagewithout scaling its size. We introduce two baselines, CP andCP+. CP means we crop all the instances and paste them ona new image background, but the original annotations of thenew image will not be used during pre-training. CP+ meansboth newly pasted objects and original annotated objects areused for training. In Tab. 6, the two baselines achieve similarresults and slightly surpass COCO. However, they are lowerthan MSM+ (COCO). This indicates that replacing the back-ground can only bring limited improvement but it is not themechanism of our SM+. The effectiveness of SM+ comesfrom achieving a ﬁner distribution alignment at instance leveland better preserving the structure of the images.

Pre-training Strategy AP tiny ( ↑ )RSM+ (w/o PSI) 50.12RSM+ 51.46MSM+ (w/o PSI) 50.69MSM+ 52.61 Table 5 . Ablation study on PSI. It is not enough to alignthe distribution at instance level without considering the back-ground. SM+ can achieve the desired effect with PSI.

Pre-training Dataset AP tiny ( ↑ )COCO 49.96CP (COCO) 50.66CP+ (COCO) 50.46MSM (COCO) 50.89MSM+ (COCO) 52.61 Table 6 . Effect of different methods. CP (COCO) and MSM(COCO) both achieve limited performance improvements.In addition, we also validate the effect of p in PSI andshow the results. We observe that a moderate probability( p =0.4) can achieve a trade-off between image structure lossand semantic loss.

4. CONCLUSION

The scale information for better pre-training is further inves-tigated in this paper. Scale Match only focuses on the image-level match and thus limits the feature representation learningfor detectors. In this paper, we propose a novel method namedReﬁned Scale Match (SM+). SM+, a much ﬁner scale matchstrategy, aligns scale distributions of pre-training dataset andtarget dataset at instance level, yielding a more effective andsuitable matched dataset. Moreover, in order to alleviate theloss caused by aligning distribution at instance level, an effec-tive method, referred to as probabilistic structure inpainting(PSI), is further proposed. PSI effectively balances the infor-mation loss between image structure and semantics. Thor-ough experimental results veriﬁed the superiority of the pro-posed method over other state-of-the-art methods.The relative size between two datasets is also very impor-tant for tiny object detection, which will be further investi-gated in the future. . REFERENCES [1] Robert T Collins, Alan J Lipton, Takeo Kanade, Hi-ronobu Fujiyoshi, David Duggins, Yanghai Tsin, DavidTolliver, Nobuyoshi Enomoto, Osamu Hasegawa, PeterBurt, et al., “A system for video surveillance and moni-toring,”

VSAM ﬁnal report , vol. 2000, pp. 1–68, 2000.[2] Ismail Haritaoglu, David Harwood, and Larry S. Davis,“W4: Real-time surveillance of people and their activi-ties,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 22,2000.[3] Piotr Doll´ar, Christian Wojek, Bernt Schiele, and PietroPerona, “Pedestrian detection: A benchmark,” in

CVPR ,2009.[4] Xuehui Yu, Yuqi Gong, Nan Jiang, Qixiang Ye, andZhenjun Han, “Scale match for tiny person detection,”in

WACV , 2020.[5] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie, “Feature pyra-mid networks for object detection,” in

CVPR , 2017.[6] Zhaowei Cai and Nuno Vasconcelos, “Cascade r-cnn:Delving into high quality object detection,” in

CVPR ,2018.[7] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng,Wanli Ouyang, and Dahua Lin, “Libra r-cnn: Towardsbalanced learning for object detection,” in

CVPR , 2019.[8] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,and Piotr Doll´ar, “Focal loss for dense object detection,”in

CVPR , 2017.[9] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He,“Fcos: Fully convolutional one-stage object detection,”in

ICCV , 2019.[10] Yi-Fan Lu, Jiaming Lu, Song-Hai Zhang, and PeterHall, “Trafﬁc signal detection and classiﬁcation in streetviews using an attention model,”

Computational VisualMedia , vol. 4, 2018.[11] Mark Everingham, Luc Van Gool, Christopher K. I.Williams, John M. Winn, and Andrew Zisserman, “Thepascal visual object classes (voc) challenge,”

Interna-tional Journal of Computer Vision , vol. 88, 2009.[12] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, andC Lawrence Zitnick, “Microsoft coco: Common objectsin context,” in

ECCV , 2014.[13] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali,Stefan Popov, Matteo Malloci, Alexander Kolesnikov,and et al., “The open images dataset v4,” Mar 2020. [14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei, “Imagenet: A large-scale hierarchicalimage database,” in

CVPR , 2009.[15] Marcelo Bertalm´ıo, Andrea L. Bertozzi, and GuillermoSapiro, “Navier-stokes, ﬂuid dynamics, and image andvideo inpainting,” in

CVPR , 2001.[16] Kaiming He, Christoph Rhemann, Carsten Rother, Xi-aoou Tang, and Jian Sun, “A global sampling methodfor alpha matting,” in

CVPR , 2011.[17] Hao shu Fang, Jianhua Sun, Runzhong Wang, MinghaoGou, Yonglu Li, and Cewu Lu, “Instaboost: Boostinginstance segmentation via probability map guided copy-pasting,” in

ICCV , 2019.[18] Jianhua Lin, “Divergence measures based on the shan-non entropy,”

IEEE Trans. Inf. Theory , vol. 37, 1991.[19] Solomon Kullback and R. A. Leibler, “On informationand sufﬁciency,”