[PDF] Adapted Center and Scale Prediction: More Stable and More Accurate

Abstract

Pedestrian detection benefits from deep learning technology and gains rapid development in recent years. Most of detectors follow general object detection frame, i.e. default boxes and two-stage process. Recently, anchor-free and one-stage detectors have been introduced into this area. However, their accuracies are unsatisfactory. Therefore, in order to enjoy the simplicity of anchor-free detectors and the accuracy of two-stage ones simultaneously, we propose some adaptations based on a detector, Center and Scale Prediction(CSP). The main contributions of our paper are: (1) We improve the robustness of CSP and make it easier to train. (2) We propose a novel method to predict width, namely compressing width. (3) We achieve the second best performance on CityPersons benchmark, i.e. 9.3% log-average miss rate(MR) on reasonable set, 8.7% MR on partial set and 5.6% MR on bare set, which shows an anchor-free and one-stage detector can still have high accuracy. (4) We explore some capabilities of Switchable Normalization which are not mentioned in its original paper.

Full PDF

AAdapted Center and Scale Prediction:More Stable and More Accurate

Wenhao Wang ∗ School of Mathematical Sciences(SMS), Beihang University, Beijing, China

Abstract

Pedestrian detection beneﬁts from deep learningtechnology and gains rapid development in recentyears. Most of detectors follow general object de-tection frame, i.e. default boxes and two-stage pro-cess. Recently, anchor-free and one-stage detectorshave been introduced into this area. However, theiraccuracies are unsatisfactory. Therefore, in order toenjoy the simplicity of anchor-free detectors and theaccuracy of two-stage ones simultaneously, we pro-pose some adaptations based on a detector, Centerand Scale Prediction(CSP). The main contributionsof our paper are: (1) We improve the robustness ofCSP and make it easier to train. (2) We proposea novel method to predict width, namely compress-ing width. (3) We achieve the second best perfor-mance on CityPersons benchmark, i.e. .

3% log-average miss rate(MR) on reasonable set, 8 .

7% MRon partial set and 5 .

6% MR on bare set, which showsan anchor-free and one-stage detector can still havehigh accuracy. (4) We explore some capabilities ofSwitchable Normalization which are not mentionedin its original paper.

Keywords:

Pedestrian Detection, Anchor-free,Switchable Normalization, Convolutional NeuralNetworks ∗ Corresponding author: [email protected]

With the prevalence of artiﬁcial intelligence tech-nique, autonomous vehicles have gained more andmore attention. Although automatic driving needsintegration of a lot of technologies, pedestrian de-tection is one of the most important. That’s becausemissing pedestrian detection could threaten pedestri-ans’ lives. As a result, the performance of pedestriandetection algorithms is of great importance.With the development of generic object detection[8, 9, 22, 30–32], the detection performance on bench-mark datasets [2, 6, 7, 35, 49] is signiﬁcant improved.Also, some detectors, such as [23, 24, 28, 47], are spe-cially designed for pedestrian detection.However, though detection performance is im-proved on benchmark datasets all the time, thereis still a huge gap between current pedestrian de-tector and a careful people [48]. Therefore, fur-ther performance improvement is necessary. Takepedestrian detection dataset, CityPersons [49], forinstance. For a fair comparison, the following log-average miss rates(denoted as MR)(lower is bet-ter) are reported on the reasonable validation setwith the same input scale (1x). From all ofthe state-of-the-arts literature available(includingpreprint ones), we summarize as follows: Re-pulsion Loss [43](13 . . . . . . . . . / a r X i v : . [ c s . C V ] M a r epulsion Loss [43], OR-CNN [52], HBAN [25],Adaptive NMS [21], MGAN [28], PSC-Net [45], APD[47]. In addition, APD [47] uses more powerfulbackbone, i.e. DLA-34 [46], to improve MR from10 . . , , i.e. MR willapproach 1 after several iterations. Second, whentraining CSP [24], diﬀerent input scales bring sig-niﬁcantly diﬀerent performance. Finally, when com-pared to algorithms with occlusion/crowd handlingprocess, there is still much room for improvement.To address the above limitations, we propose A dapted C enter and S cale P rediction ( ACSP ),which has slight diﬀerence with original CSP [24] butbrings signiﬁcant improvement on performance. De-tection examples using ACSP are shown in Fig. 1. Insummary, the main contributions of this paper are asfollows: (1) We make original CSP [24] more robust, i.e. less sensitive to batch size and input scale. (2)Wepropose compressing width, a novel method to deter-mine the width of a bounding box. (3)We improvethe performance of CSP [24]. (4) We explore thepower of Switchable Normalization when the batchsize is big.Experiments are conducted on the CityPersons [49]database. We achieve the second best performance, i.e. .

3% MR on reasonable set, 8 .

7% MR on partialset, 5 .

6% MR on bare set.

Early object detection approaches, such as [5, 19, 42],mainly utilize region proposal classiﬁcation and slid-ing window paradigm. Since August 2018, more and more works focus on anchor-free approaches. As aresult, modern object detectors can be divided intotwo classes: anchor-based and anchor-free.

Anchor-based methods includes two-stage detectorsand one-stage detectors. The most famous series oftwo-stage detectors are RCNN [9] and its descen-dants, i.e.

Fast-RCNN [8], Faster-RCNN [32]. Theybuild two-stage framework, which contains objectproposals and classiﬁcation. For one-stage detectors,YOLOv2 [31] and SSD [22] successfully accomplishdetection and classiﬁcation tasks on feature maps atthe same time.

The earliest exploitation of anchor-free mode comesfrom DenseBox [12] and YOLOv1 [30]. The main dif-ference between them is that DenseBox is designedfor face detection while YOLOv1 is a generic objectdetection. The introduction of CornerNet [16] bringsanchor-free detection into key point era. Its followersinclude ExtremeNet [56], CenterNet [55], etc. In ad-dition, FoveaBox [15] and FSAF [57] use dense objectdetection strategy.

Before the dominance of deep learning techniques,traditional pedestrian detectors, such as [5,27,50], fo-cus on integral channel features with sliding windowstrategy. Recently, with the introduction of FasterRCNN [32], some two-stage pedestrian detection ap-proaches [21,28,43,49,51–54] achieve state-of-the-artson standard benchmarks. Also, some pedestrian de-tectors [17, 21, 23], which base on single-stage back-bone, gain a balance between speed and accuracy.Zhou et al. [53] propose a discriminative featuretransformation which enforces feature separability ofpedestrian and non-pedestrian examples to handleocclusions for pedestrian detection. In [52], a newocclusion-aware R-CNN is proposed to improve thedetection accuracy in the crowd. Wang et al. [43]develop a novel loss, repulsion loss, to address crowd2igure 1: We use CityPersons test set to illustrate our ACSP detection ability. It is worthy to mention that,without any occlusion handling method, small and occlusion pedestrian can still be detected. The detectionsare shown in green rectangle boxes.occlusion problem. The work of [21] focuses on Non-Maximum Suppression and proposes a dynamic sup-pression threshold to reﬁne the bounding boxes givenby detectors. HBAN [25] is introduced to enhancepedestrian detection by fully utilizing the humanhead prior. ALFNet is proposed in [23] to use asymp-totic localization ﬁtting strategy to evolve the defaultanchor boxes step by step into precise detection re-sults. MGAN [28] emphasizes on visible pedestrianregions while suppressing the occluded ones by mod-ulating full body features. PSC-Net [45] is designedfor occluded pedestrian detection. CSP [24] utilizesan anchor-free method, i.e. directly predicting pedes-trian center and scale through convolutions. Basedon CSP [24], Zhang et al. propose APD [47] to dif-ferentiate individuals in crowds. All of the aforemen-tioned methods achieve state-of-the-arts on CityPer-sons benchmark [49].

Batch Normalization(BN) [13] is proposed to accel-erate training process and improve the performanceof CNNs. [34] points out that batch normalizationmakes the loss surface smoother while the originalpaper [13] believes the improvement comes from ”in-ternal covariate shift”. Although, even today, it isstill unknown that why batch normalization works sowell, the utilization of batch normalization improvesthe performance of object detection, image classiﬁca-tion, etc.After batch normalization, weight normaliza-tion(WN) [33] is introduced to normalize the weightsof layers. Layer normalization(LN) [1] normalizes theinputs across the features instead of the batch dimen-sion. In this way, the performance will not be inﬂu-enced by batch size and layer normalization is usedin RNN at ﬁrst. Originally designed for style trans-fer, instance normalization(IN) [41] normalizes acrosseach channel in each training example. Group nor-3

WInput Feature Extraction H/4 W/4 Conv 3x3x256Feature Map C channels H/4 W/4Feature Map 256 channels Detections

Center HeatmapScale MapOffset Map

Conv 1x1Conv 1x1Conv 2x2

Figure 2: It is the architecture of original CSP [24]. The frame includes two parts: feature extraction anddetection head.malization(GN) [44] divides the channels into groupsand computes the mean and variance for normaliza-tion within each group. As a result, it addresses theproblem that, when the batch size becomes smaller,the performance of batch normalization goes down. Itis a combination of layer normalization and instancenormalization to some degree.Recently, Luo et al. propose switchable normaliza-tion(SN) [26], which uses a weighted average of diﬀer-ent mean and variance statistics from batch normal-ization, instance normalization, and layer normaliza-tion.

CSP [24] was proposed by Wei Liu and Shengcai Liaoin 2019. They ﬁrst introduced anchor-free methodinto pedestrian detection area. More speciﬁcally,CSP [24] includes two parts: feature extraction anddetection head. In feature extraction part, a back-bone, such as ResNet-50 [10], MobileNet [11], is usedto extract diﬀerent levels of features. Shallower fea-ture maps can provide more precise localization infor-mation while deeper feature maps can provide high-level semantic information. In detection head part,convolutions are used to predict center, scale, andoﬀset respectively. In Fig. 2, we summarize the ar-chitecture of CSP [24].A more detailed architecture of CSP [24] will be re-visited in this paragraph. However, it will be slightlydiﬀerent with original paper [24]. We take ResNet-50 [10] and a picture with original shape, i.e. × × Conv layer. Certainly, BN layer, ReLU layer and

Maxpool layer follow the

Conv layer. In this way, a (3, 1024, 2048)(The bracket (, ,) denotes (

Conv layer is used on the ﬁnal feature map to re-duce its channel dimensions to 256. Finally, threeconvolutions: 1x1, 1x1 and 2x2 are appended for theprediction of center, scale and oﬀset respectively.

According to the aforementioned revisit, we concludethat there are totally 50 BN layers in CSP [24]. Al-though BN layer brings performance improvement toCSP [24] as it brings to other tasks, CSP [24] also suf-fers from the drawback of BN layer. On one hand, BN layer is unsuitable when the batch size is small.That is because small batch size will make the train-ing process noisy, i.e. the amplitude of training loss4 efore Layer1 Layer2 Layer3 Layer4 AfterDifferent Parts0%20%40%60%80%100% W e i g h t IN MeanLN MeanBN MeanIN VarianceLN VarianceBN Variance

Figure 3: The proportion of the weight of each normalization method in diﬀerent parts is shown in thehistogram. The weights of mean and variance are displayed separately.is relatively huge. However, ablation study will showa bigger batch size, even (1 , BN layerswith SN layers. The eﬀectiveness of this change willbe shown in the ablation part, we try to explain thereason of it now.To illustrate more speciﬁcally, we take (1 ,

8) forinstance and the backbone is ResNet-50. The archi-tecture of network can be divided into 6 parts: Theﬁrst 5 come from backbone and the last one(denotedas After) is detection head. The ﬁrst part(denoted asBefore) is the operations before 4 layers in ResNet-50. The next four is the four layers. There is only1 BN layer in the ﬁrst part while there are 9, 12,18, and 9 BN layers in the next four parts respec-tively. Finally, the detection head has only 1 BN layer. As suggested in 2.3, Switchable normalizationis the combination of batch normalization, instancenormalization, and layer normalization with diﬀerentweights. Therefore, exploring the proportion of dif-ferent weights in each part of network will show whatmakes a diﬀerence on earth. For each part, we calcu-late the weights of each normalization method in SN layers. Then the average of these weights are shownin Fig. 3. Although diﬀerent normalization methodshave diﬀerent weights in each part, we ﬁgure out two main diﬀerences with original BN layers. For onehand, in Layer4, the weight of BN variance is verysmall while the weight of LN variance is very big.For the other hand, in ’After’ Part, IN, LN and BNshare similar weights. The conclusions are as follows:( i ) The low BN variance in Layer4 decreases the inﬂu-ence of noise when estimating variance. In this way,high-level semantic information can be utilized fullyduring inference process. ( ii ) The similar weights in’After’ part enable these three normalization methodsto play same important roles. ( iii ) Diﬀerent normal-ization methods in all parts complement one another. The feature extraction ability of backbone is of greatimportance in object detection. Some networks, suchas, ResNet-50 [26], ResNet-101 [26], VGG [36] andMobileNet [11], which are original designed for imageclassiﬁcation, are widely used in pedestrian detectors.In addition, some other networks, such as DetNet[18], are specially designed for object detection.In the original paper [24], ResNet-50 [26] and Mo-bileNet [11] are used as backbone. However, becauseof the nature of CSP [24], i.e. it fuses diﬀerent levelof feature maps, it is suitable to use deeper backbonenetwork. In this way, the location information willstill be stored in shallow feature maps and higher-level semantic information will be extracted at thesame time.Inspiring by the aforementioned idea, we select two5ew backbones, expecting to obtain better perfor-mance. First, we use ResNet-101 [26] as our ACSPbackbone. Compared to ResNet-50 [26], the only dif-ference of ResNet-101 [26] is its third layer: thereare 23 Bottleneck blocks rather than 6 Bottleneckblocks. As a result, in our ACSP, the last two featuremaps presents higher level semantic information thanCSP [24]. Meanwhile, localization information willnot be changed. In theory, the fusion in our ACSPis more eﬃcient than original CSP [24]. We will con-duct ablation study to prove it. Second, in [18], au-thors point out that using DetNet [18] as backbone,they achieve state-of-the-art on the MSCOCO bench-mark [20]. Therefore, it is likely that DetNet [18] willimprove the performance of original CSP [24]. How-ever, after ﬁne tuning learning rate and so on, weﬁnd it is unpromising. We conclude the reason isthat: one of the design concept of DetNet [18] is toaddress poor location problem, however, in CSP [24],this problem is solved by the fusion of diﬀerent levellayers and eﬃcient center prediction.

In the original paper [24], Liu et al. do not justifythe resizing process, i.e. why in training part, the au-thors resize the original picture shape(1024 × × From the original paper [24], we can see that thewidth of a box is obtained by multiplying the heightby 0 .

41. It concurs with pedestrian aspect ratio inCityPersons Dataset [49]. However, it is not suitablein the reference process. That is because, in crowdedscene, relatively wide boxes will increase the chance ofoverlapping and the NMS process will eliminate someof boxes. In this way, we will lose some detections.As a result, we try to design a novel method todetermine the width. On one hand, as we mentionedbefore, a wide box is not appropriate. On the otherhand, a too narrow box is also not suitable. Thatis because, in this way, IoU between detections andground truths will be small and detections will not beregarded as correct. Inspiring by the aforementionedanalysis, we give our formula for calculating width: w = r · h, where r is the aspect ratio( r < .

41) and h is thepredicted height of a bounding box.It should be mentioned that the exact form of ourcompressing width is not crucial and we choose themost basic one. What matters most is the designconcept. As pointed in [24], total loss consists of classiﬁca-tion loss, scale loss and oﬀset loss. The weights are6.01, 1 and 0.1, respectively. And for scale regres-sion loss, [24] utilizes Smooth L1 to accelerate con-vergence. However, [38, 39, 55] show that vanilla L1is better than Smooth L1. Therefore, we try to re-place Smooth L1 with vanilla L1. We experimentallyset the weights as 0.01, 0.05 and 0.1, respectively.The eﬀectiveness of this improvement will be shownin ablation study.

To prove the eﬃcacy of our adaptation, we con-duct our experiments on CityPersons Dataset [49].CityPersons is introduced recently and with high res-olution. And the dataset is based on CityScapesbenchmark [3]. It includes 5 ,

000 images with vari-ous occlusion levels. We train our model on oﬃcialtraining set with 2 ,

975 images and test on the val-idation set with 500 images. In our test, the inputscale is 1x.

The ACSP is realized in Pytorch [29]. Adam [14]optimizer is utilized to optimize our network. Sameas CSP [24] and APD [47], moving average weights[40] is adopted. Experiments show it helps achievebetter performance. The backbone is ﬁxed ResNet-101 [26] unless otherwise stated, i.e. replacing all BN layers with SN layers. It is pretrained on ImageNet[4]. We optimize the network on 2 GPUs (Tesla V100)with 2 images per GPU. The learning rate is 2 × − and training process is stopped after 150 epochs with744 iterations per epoch. In the training process, wekeep the original shape of pictures, i.e. × In this section, we conduct an ablative analysis onthe CityPersons Dataset [49]. We use the most com-mon and standard pedestrian detection criterion, log-average miss rates(denoted as MR), as evaluation metric. In addition, the following MRs are all re-ported on reasonable set.

What is the inﬂuence of SN layer on stabletraining? The stability of training process is of great im-portance. It comes from two aspects: whether thenetwork is sensitive to the batch size and whetherthe performance will become poor after many iter-ations. To answer these two questions, we compareour ACSP with original CSP [24]. It should be men-tioned that learning rate is appropriate in the follow-ing experiments, i.e. the training loss decreases andconverges.For the ﬁrst one, comparisons are shown in Table1. To conduct a fair comparison, the only diﬀerenceis we replace all BN layers with SN layers, i.e. thebackbone is still Resnet-50, the training input scale isstill 640 × ,

4) means 4 GPUs with 4 images per GPU. ’Con’means the training is convergent, i.e.

MR is still lowno matter how many iterations are used. ’Exp’ meansthe training is not convergent, i.e.

MR increases to 1after several iterations. The improvement line showsthe percentage of decrease in MR from CSP [24] toACSP. It is shown that when we choose GPU numberand image number per GPU carefully, such as (4 , , , , ,

2) and (1 ,

8) forCSP [24]. That diﬀerence does not come from BN layer because BN layer will only be invalid when itis (8 ,

1) rather than (1 , , ,

2) bring convergence result. However, forACSP, all of the results are convergent.

How important is the backbone?

In this part, we compare three diﬀerent backbones, i.e.

ResNet-50 [26], ResNet-101 [26], DetNet [18].The experiments are conducted based on BN layer7able 1: Comparisons of diﬀerent batch sizes and diﬀerent methods. The bracket (, ) denotes( ,

2) (4 ,

4) (2 ,

2) (1 ,

1) (1 ,

8) (8 , .

56% Con 27 .

75% Exp 11 .

34% Con 16 .

35% Exp 16 .

10% Exp 14 .

51% ExpACSP 11 .

16% Con 11 .

89% Con

Con 13 .

42% Con 11 .

66% Con 12 .

88% ConImprovement +3 .

46% +57 .

15% +4 .

76% +17 .

92% +27 .

58% +11 .

23% Con

10 20 30 40 50 60 70epoch20%40%60%80%100% M R Comparison between Different Batch Size

ACSP (1,1)CSP (1,1)ACSP (2,2)CSP (2,2)ACSP (4,2)CSP (4,2)ACSP (4,4)CSP (4,4)ACSP (8,1)CSP (8,1)ACSP (1,8)CSP (1,8)

Figure 4: Comparisons of diﬀerent batch size. It isshown that: For CSP [24], 4 experiment settings arenot convergent; for ACSP, all experiments are con-vergent.and SN layer respectively. The experiments settingis (4 , × i ) As suggested in the the-ory part, ResNet-101 [26] outperforms ResNet-50 [26]no matter which normalization method is choosen.( ii ) DetNet [18] underperforms ResNet-50 [26] andResNet-101 [26] slightly. ( iii ) As discussed before, re-placing BN layers with SN layers bring performanceimprovement on ResNet-50 [26] and ResNet-101 [26].However, on DetNet [18], the MR increases. That ispartly because we cannot ﬁnd pretrained parametersof SN layers in DetNet [18]. How important is the input resolution?

To prove the discussion in 3.4, we conduct someexperiments with diﬀerent resolutions under diﬀerentcircumstances. From Table 3, we can ﬁnd that: Fororiginal CSP [24], the MR is not convergent when wedo not resize pictures to 640 × , .

56% 11 . . DetNet 12 .

66% 12 . What is the contribution of SN layer to theMR? As stated in the before part, SN layer brings sig-niﬁcant improvement when batch size is not carefullyselected. In addition, from Table 2, we conclude SN ,

2) and the backbone is ResNet-50.MethodMR Resolution 1024 × × .

08% 11 . . Table 4: Comparison between diﬀerent resolutionsunder diﬀerent batch sizes. Resolution part meansthe input picture scale. The normalization methodis SN and the backbone is ResNet-101.BatchMR Resolution 1024 × × , . ,

2) 10 .

69% 10 . r = 0 .

41 10 .

30% 46 .

12% 9 .

15% 6 . r = 0 .

40 10 .

00% 46 .

11% 8 .

80% 6 . .

00% 46 .

11% 8 .

80% 6 . .

27% 46 .

34% 8 .

66% 5 .

62% layer brings approximately 0 .

4% improvement withregard to MR. Table 3 shows no matter which solu-tion we select, SN layer always contributes to perfor-mance improvement. Finally, as displayed in Table4, we obtain our best performance under the help of SN layer. In conclusion, SN layer can substitute BN layer totally in our ACSP. How important is the compressing widthand vanilla L1 loss?

We talk about the contribution of the compress-ing width and vanilla L1 loss together in this part.Experiments show that, for Smooth L1, setting r incompressing width formula as 0 .

40 yields relativelygood performance. And for vanilla L1 loss, r = 0 . r = 0 .

41 with r = 0 .

40, andthe results are shown in Table 5. It can be seen thatMR decreases about 0 . r . Asdisplayed in Table 6, MR decreases to varying degreeson reasonable set, partial set, and bare set. However,MR increases about 0 .

2% on heavy set.

We compare our ACSP with all existing state-of-the-art detectors(including preprint ones) on the valida-tion set of CityPersons. The results are shown in Ta-ble 7. The evaluation metric is MR. To conduct a faircomparison, all methods are trained on the trainingset without any extra data(except ImageNet). Whentesting, the input scale is 1x. The top three resultsare highlighted in red, green and blue, respectively.Because the diﬀerence in training and test environ-ment, i.e. most of other methods use Nvidia GTX1080Ti GPU while we use Nvidia Tesla V100 GPU,time comparing is meaningless. As a result, it willnot be reported in our table.From the table, we can ﬁgure out that our ACSPachieves state-of-the-art on bare set and the secondbest performance on reasonable set, heavy set and9able 7: Comparisons with state-of-the-arts on validation set: The evaluation metric is MR and theinput scale is 1x. The top three results are highlighted in red, green and blue, respectively.Method Backbone Reasonable Heavy Partial BareFRCNN [49] VGG-16 15 .

4% - - -FRCNN+Seg [49] VGG-16 14 .

8% - - -TLL [37] ResNet-50 15 .

5% 53 .

6% 17 .

2% 10 . .

4% 52 .

0% 15 .

9% 9 . .

2% 56 .

9% 16 .

8% 7 . .

8% 55 .

7% 15 .

3% 6 . .

5% 48 .

1% - -ALF [23] ResNet-50 12 .

0% 51 .

9% 11 .

4% 8 . .

9% 54 .

0% 11 .

4% 6 . .

0% 49 .

3% 10 .

4% 7 . .

5% 47 .

2% - -PSC-Net [45] VGG-16 10 .

4% 39 .

7% - -APD [47] ResNet-50 10 .

6% 49 .

8% 9 .

5% 7 . .

8% 46 .

6% 8 .

3% 5 . .

0% 46 .

1% 8 .

8% 6 . .

3% 46 .

3% 8 .

7% 5 . .

6% instead. On heavy set, without any specialocclusion handling process, we outperform other spe-cial designed methods except for PSC-Net [45]. Also,we only lags behind APD [47] on partial set.

In this paper, we propose several improvements onoriginal pedestrian detector CSP [24]. In this way,the training process of our ACSP is more robust.And we try to explain why we make these adapta-tions and why they make a diﬀerence. What’s more,we propose a novel method to estimate the width of abounding box. In addition, we explore some functionsof Switchable Normalization which are not mentionedin its original paper [26]. Experiments are conductedon the CityPersons [49] and we achieve state-of-the-art on bare set and the second best performance onreasonable set, heavy set and partial set. In the fu-ture, it is interesting to explore the ”representative point” rather than the ”center point” of pedestrian.

We thank Informatization Oﬃce of Beihang Univer-sity for the supply of High Performance ComputingPlatform, which have 32 Nvidia Tesla V100 GPUs.This work is also supported by School of Mathemat-ical Sciences, Beihang University.

References [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geof-frey E Hinton. Layer normalization. arXivpreprint arXiv:1607.06450 , 2016.[2] Markus Braun, Sebastian Krebs, Fabian Flohr,and Dariu M Gavrila. Eurocity persons: A novelbenchmark for person detection in traﬃc scenes.

IEEE transactions on pattern analysis and ma-chine intelligence , 41(8):1844–1861, 2019.103] Marius Cordts, Mohamed Omran, SebastianRamos, Timo Rehfeld, Markus Enzweiler, Ro-drigo Benenson, Uwe Franke, Stefan Roth, andBernt Schiele. The cityscapes dataset for seman-tic urban scene understanding. In

Proceedingsof the IEEE conference on computer vision andpattern recognition , pages 3213–3223, 2016.[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li,Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In , pages 248–255. Ieee, 2009.[5] Piotr Doll´ar, Ron Appel, Serge Belongie, andPietro Perona. Fast feature pyramids for objectdetection.

IEEE transactions on pattern anal-ysis and machine intelligence , 36(8):1532–1545,2014.[6] Piotr Doll´ar, Christian Wojek, Bernt Schiele,and Pietro Perona. Pedestrian detection: Abenchmark. In , pages304–311. IEEE, 2009.[7] Andreas Geiger, Philip Lenz, and Raquel Urta-sun. Are we ready for autonomous driving? thekitti vision benchmark suite. In , pages 3354–3361. IEEE, 2012.[8] Ross Girshick. Fast r-cnn. In

Proceedings ofthe IEEE international conference on computervision , pages 1440–1448, 2015.[9] Ross Girshick, Jeﬀ Donahue, Trevor Darrell, andJitendra Malik. Rich feature hierarchies for ac-curate object detection and semantic segmenta-tion. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages580–587, 2014.[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, andJian Sun. Deep residual learning for image recog-nition. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages770–778, 2016. [11] Andrew G Howard, Menglong Zhu, Bo Chen,Dmitry Kalenichenko, Weijun Wang, TobiasWeyand, Marco Andreetto, and Hartwig Adam.Mobilenets: Eﬃcient convolutional neural net-works for mobile vision applications. arXivpreprint arXiv:1704.04861 , 2017.[12] Lichao Huang, Yi Yang, Yafeng Deng, and Yi-nan Yu. Densebox: Unifying landmark local-ization with end to end object detection. arXivpreprint arXiv:1509.04874 , 2015.[13] Sergey Ioﬀe and Christian Szegedy. Batch nor-malization: Accelerating deep network train-ing by reducing internal covariate shift. arXivpreprint arXiv:1502.03167 , 2015.[14] Diederik P Kingma and Jimmy Ba. Adam:A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.[15] Tao Kong, Fuchun Sun, Huaping Liu, Yun-ing Jiang, and Jianbo Shi. Foveabox: Beyondanchor-based object detector. arXiv preprintarXiv:1904.03797 , 2019.[16] Hei Law and Jia Deng. Cornernet: Detectingobjects as paired keypoints. In

Proceedings ofthe European Conference on Computer Vision(ECCV) , pages 734–750, 2018.[17] Yuhong Li, Xiaofan Zhang, and Deming Chen.Csrnet: Dilated convolutional neural networksfor understanding the highly congested scenes.In

Proceedings of the IEEE conference on com-puter vision and pattern recognition , pages 1091–1100, 2018.[18] Zeming Li, Chao Peng, Gang Yu, XiangyuZhang, Yangdong Deng, and Jian Sun. Detnet:Design backbone for object detection. In

Pro-ceedings of the European Conference on Com-puter Vision (ECCV) , pages 334–350, 2018.[19] Rainer Lienhart and Jochen Maydt. An ex-tended set of haar-like features for rapid objectdetection. In

Proceedings. international confer-ence on image processing , volume 1, pages I–I.IEEE, 2002.1120] Tsung-Yi Lin, Michael Maire, Serge Belongie,James Hays, Pietro Perona, Deva Ramanan, Pi-otr Doll´ar, and C Lawrence Zitnick. Microsoftcoco: Common objects in context. In

Europeanconference on computer vision , pages 740–755.Springer, 2014.[21] Songtao Liu, Di Huang, and Yunhong Wang.Adaptive nms: Reﬁning pedestrian detection ina crowd. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition ,pages 6459–6468, 2019.[22] Wei Liu, Dragomir Anguelov, Dumitru Erhan,Christian Szegedy, Scott Reed, Cheng-Yang Fu,and Alexander C Berg. Ssd: Single shot multi-box detector. In

European conference on com-puter vision , pages 21–37. Springer, 2016.[23] Wei Liu, Shengcai Liao, Weidong Hu, XuezhiLiang, and Xiao Chen. Learning eﬃcient single-stage pedestrian detectors by asymptotic local-ization ﬁtting. In

Proceedings of the EuropeanConference on Computer Vision (ECCV) , pages618–634, 2018.[24] Wei Liu, Shengcai Liao, Weiqiang Ren, WeidongHu, and Yinan Yu. High-level semantic featuredetection: A new perspective for pedestrian de-tection. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition ,pages 5187–5196, 2019.[25] Ruiqi Lu and Huimin Ma. Semantic head en-hanced pedestrian detection in a crowd. arXivpreprint arXiv:1911.11985 , 2019.[26] Ping Luo, Jiamin Ren, Zhanglin Peng, RuimaoZhang, and Jingyu Li. Diﬀerentiable learning-to-normalize via switchable normalization. arXivpreprint arXiv:1806.10779 , 2018.[27] Woonhyun Nam, Piotr Doll´ar, and Joon HeeHan. Local decorrelation for improved detection. arXiv preprint arXiv:1406.1134 , 2014.[28] Yanwei Pang, Jin Xie, Muhammad Haris Khan,Rao Muhammad Anwer, Fahad Shahbaz Khan,and Ling Shao. Mask-guided attention network for occluded pedestrian detection. In

Proceed-ings of the IEEE International Conference onComputer Vision , pages 4967–4975, 2019.[29] Adam Paszke, Sam Gross, Soumith Chintala,Gregory Chanan, Edward Yang, Zachary De-Vito, Zeming Lin, Alban Desmaison, LucaAntiga, and Adam Lerer. Automatic diﬀeren-tiation in pytorch. 2017.[30] Joseph Redmon, Santosh Divvala, Ross Gir-shick, and Ali Farhadi. You only look once: Uni-ﬁed, real-time object detection. In

Proceedingsof the IEEE conference on computer vision andpattern recognition , pages 779–788, 2016.[31] Joseph Redmon and Ali Farhadi. Yolo9000: bet-ter, faster, stronger. In

Proceedings of the IEEEconference on computer vision and pattern recog-nition , pages 7263–7271, 2017.[32] Shaoqing Ren, Kaiming He, Ross Girshick, andJian Sun. Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. In

Advances in neural information processing sys-tems , pages 91–99, 2015.[33] Tim Salimans and Durk P Kingma. Weight nor-malization: A simple reparameterization to ac-celerate training of deep neural networks. In

Ad-vances in neural information processing systems ,pages 901–909, 2016.[34] Shibani Santurkar, Dimitris Tsipras, AndrewIlyas, and Aleksander Madry. How does batchnormalization help optimization? In

Ad-vances in Neural Information Processing Sys-tems , pages 2483–2493, 2018.[35] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao,Gang Yu, Xiangyu Zhang, and Jian Sun. Crowd-human: A benchmark for detecting human in acrowd. arXiv preprint arXiv:1805.00123 , 2018.[36] Karen Simonyan and Andrew Zisserman.Very deep convolutional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.1237] Tao Song, Leiyu Sun, Di Xie, Haiming Sun,and Shiliang Pu. Small-scale pedestrian detec-tion based on topological line localization andtemporal feature aggregation. In

Proceedings ofthe European Conference on Computer Vision(ECCV) , pages 536–551, 2018.[38] Xiao Sun, Jiaxiang Shang, Shuang Liang, andYichen Wei. Compositional human pose regres-sion. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 2602–2611, 2017.[39] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang,and Yichen Wei. Integral human pose regression.In

Proceedings of the European Conference onComputer Vision (ECCV) , pages 529–545, 2018.[40] Antti Tarvainen and Harri Valpola. Mean teach-ers are better role models: Weight-averaged con-sistency targets improve semi-supervised deeplearning results. In

Advances in neural informa-tion processing systems , pages 1195–1204, 2017.[41] Dmitry Ulyanov, Andrea Vedaldi, and VictorLempitsky. Instance normalization: The miss-ing ingredient for fast stylization. arXiv preprintarXiv:1607.08022 , 2016.[42] Paul Viola, Michael Jones, et al. Robust real-time object detection.

International journal ofcomputer vision , 4(34-47):4, 2001.[43] Xinlong Wang, Tete Xiao, Yuning Jiang, ShuaiShao, Jian Sun, and Chunhua Shen. Repulsionloss: Detecting pedestrians in a crowd. In

Pro-ceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 7774–7783, 2018.[44] Yuxin Wu and Kaiming He. Group normaliza-tion. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , pages 3–19, 2018.[45] Jin Xie, Yanwei Pang, Hisham Cholakkal,Rao Muhammad Anwer, Fahad Shahbaz Khan,and Ling Shao. Psc-net: Learning part spatialco-occurence for occluded pedestrian detection. arXiv preprint arXiv:2001.09252 , 2020. [46] Fisher Yu, Dequan Wang, Evan Shelhamer, andTrevor Darrell. Deep layer aggregation. In

Pro-ceedings of the IEEE conference on computer vi-sion and pattern recognition , pages 2403–2412,2018.[47] Jialiang Zhang, Lixiang Lin, Yun-chen Chen,Yao Hu, Steven C. H. Hoi, and Jianke Zhu.CSID: center, scale, identity and density-awarepedestrian detection in a crowd.

CoRR ,abs/1910.09188, 2019.[48] Shanshan Zhang, Rodrigo Benenson, MohamedOmran, Jan Hosang, and Bernt Schiele. Towardsreaching human performance in pedestrian de-tection.

IEEE transactions on pattern analysisand machine intelligence , 40(4):973–986, 2017.[49] Shanshan Zhang, Rodrigo Benenson, and BerntSchiele. Citypersons: A diverse dataset forpedestrian detection. In

Proceedings of the IEEEConference on Computer Vision and PatternRecognition , pages 3213–3221, 2017.[50] Shanshan Zhang, Rodrigo Benenson, BerntSchiele, et al. Filtered channel features for pedes-trian detection. In

CVPR , volume 1, page 4,2015.[51] Shanshan Zhang, Jian Yang, and Bernt Schiele.Occluded pedestrian detection through guidedattention in cnns. In

Proceedings of the IEEEConference on Computer Vision and PatternRecognition , pages 6995–7003, 2018.[52] Shifeng Zhang, Longyin Wen, Xiao Bian, ZhenLei, and Stan Z Li. Occlusion-aware r-cnn: de-tecting pedestrians in a crowd. In

Proceedings ofthe European Conference on Computer Vision(ECCV) , pages 637–653, 2018.[53] Chunluan Zhou, Ming Yang, and Junsong Yuan.Discriminative feature transformation for oc-cluded pedestrian detection. In

Proceedings ofthe IEEE International Conference on Com-puter Vision , pages 9557–9566, 2019.1354] Chunluan Zhou and Junsong Yuan. Bi-boxregression for pedestrian detection and occlu-sion estimation. In

Proceedings of the EuropeanConference on Computer Vision (ECCV) , pages135–151, 2018.[55] Xingyi Zhou, Dequan Wang, and PhilippKr¨ahenb¨uhl. Objects as points. arXiv preprintarXiv:1904.07850 , 2019.[56] Xingyi Zhou, Jiacheng Zhuo, and Philipp Kra-henbuhl. Bottom-up object detection by group-ing extreme and center points. In

Proceedings ofthe IEEE Conference on Computer Vision andPattern Recognition , pages 850–859, 2019.[57] Chenchen Zhu, Yihui He, and Marios Savvides.Feature selective anchor-free module for single-shot object detection. In