[PDF] Crowd Counting via Hierarchical Scale Recalibration Network

Abstract

The task of crowd counting is extremely challenging due to complicated difficulties, especially the huge variation in vision scale. Previous works tend to adopt a naive concatenation of multi-scale information to tackle it, while the scale shifts between the feature maps are ignored. In this paper, we propose a novel Hierarchical Scale Recalibration Network (HSRNet), which addresses the above issues by modeling rich contextual dependencies and recalibrating multiple scale-associated information. Specifically, a Scale Focus Module (SFM) first integrates global context into local features by modeling the semantic inter-dependencies along channel and spatial dimensions sequentially. In order to reallocate channel-wise feature responses, a Scale Recalibration Module (SRM) adopts a step-by-step fusion to generate final density maps. Furthermore, we propose a novel Scale Consistency loss to constrain that the scale-associated outputs are coherent with groundtruth of different scales. With the proposed modules, our approach can ignore various noises selectively and focus on appropriate crowd scales automatically. Extensive experiments on crowd counting datasets (ShanghaiTech, MALL, WorldEXPO'10, and UCSD) show that our HSRNet can deliver superior results over all state-of-the-art approaches. More remarkably, we extend experiments on an extra vehicle dataset, whose results indicate that the proposed model is generalized to other applications.

Full PDF

CCrowd Counting via Hierarchical ScaleRecalibration Network

Zhikang Zou and Yifan Liu and Shuangjie Xu and Wei Wei and Shiping Wen and Pan Zhou † Abstract.

The task of crowd counting is extremely challenging dueto complicated difﬁculties, especially the huge variation in visionscale. Previous works tend to adopt a naive concatenation of multi-scale information to tackle it, while the scale shifts between the fea-ture maps are ignored. In this paper, we propose a novel HierarchicalScale Recalibration Network (HSRNet), which addresses the aboveissues by modeling rich contextual dependencies and recalibratingmultiple scale-associated information. Speciﬁcally, a Scale FocusModule (SFM) ﬁrst integrates global context into local features bymodeling the semantic inter-dependencies along channel and spatialdimensions sequentially. In order to reallocate channel-wise featureresponses, a Scale Recalibration Module (SRM) adopts a step-by-step fusion to generate ﬁnal density maps. Furthermore, we proposea novel Scale Consistency loss to constrain that the scale-associatedoutputs are coherent with groundtruth of different scales. With theproposed modules, our approach can ignore various noises selec-tively and focus on appropriate crowd scales automatically. Exten-sive experiments on crowd counting datasets (ShanghaiTech, MALL,WorldEXPO10, and UCSD) show that our HSRNet can deliver su-perior results over all state-of-the-art approaches. More remarkably,we extend experiments on an extra vehicle dataset , whose results in-dicate that the proposed model is generalized to other applications.

The task of crowd counting aims to ﬁgure out the quantity of thepedestrians in images or videos. It has drawn much attention recentlydue to its broad possibilities of applications in video surveillance,trafﬁc control and metropolis safety. What’s more, the methods pro-posed for crowd counting can be generalized to similar tasks in otherdomains, including estimating the number of cells in a microscopicimage [16], vehicle estimation in a trafﬁc congestion situation [10]and extensive environmental investigation [9].With the rapid growth of convolutional neural networks, manyCNN-based methods [18, 1, 25] have sprung up in ﬁelds of crowdcounting and have made promising progress. However, dealing withthe large density variations is still a difﬁcult but attractive issue. Asillustrated in Figure 1, the crowd density of certain sizes varies indifferent locations of the images. Such a density shift also exists inthe patches of the same sizes across different images. To address the Huazhong University of Science and Technology, China. Email: [email protected], { u201712105,weiw,panzhou } @hust.edu.cn Deeproute.ai,China. Email:[email protected] University of Electronic Science and Technology, China. Email: [email protected] ∗ Equal contributions. † Corresponding author

Figure 1.

Regions of certain sizes exhibit diverse pedestrian distributionamong different locations of images. This shift also exists across differentimages, which proves the huge scale variations in crowd counting. scale variation, substantial progress has been achieved by designingmulti-column architectures [23, 22], adaptively fusing features pyra-mid [13], and modifying the receptive ﬁelds of CNNs [17]. Althoughthese methods alleviate the scale problem to some extent, they suf-fer from two inherent algorithmic drawbacks. On the one hand, eachsub-network or each layer in these models treats every pixel of theinput equally while ignoring their superior particularity on the cor-responding crowd scales, thus the noises will be propagated throughthe pipeline ﬂow. On the other hand, directly adding or concatenat-ing multi-scale features causes the scale chaos since each feature mapcontains abundant scale shifts of different degrees.To settle the above issues, we propose a novel Hierarchical ScaleRecalibration Network (HSRNet) to leverage rich contextual depen-dencies and aggregate multiple scale-associated information. Ourtraining phase contains two stages: the Scale Focus Module (SFM)and the Scale Recalibration Module (SRM). Since the receptive ﬁeldsizes of the sequential convolutional layers in a deep network are in-creasing from shallow to deep, the scales of pedestrians they can cap-ture are different from each other. This can deduce two inferences:1) the deeper the network ﬂows, the wider the scale range can becaptured by the corresponding convolutional layers, 2) sensitivity todifferent scales varies across different layers of the network. Thus,we connect a Scale Focus Module (SFM) to each convolutional layerin the backbone network, which integrates global context into lo-cal features to boost the capability of intermediate features on thecorresponding scales. More speciﬁcally, SFM ﬁrstly compresses theinput features in the spatial dimension and generates a set of channel-wise focus weights, which are utilized to update each channel map.Thus, each layer can emphasize the matching scales degree by adjust-ing channel-wise feature responses adaptively. Similarly, the contextalong the channel axis in the feature map is squeezed to generatea spatial-wise focus mask and it is applied to update features at allpositions using element-wise multiplication. Note that this strategyenhances that the output features focus more on the patches of im- a r X i v : . [ c s . C V ] M a r ges with appropriate scales instead of treating every pixel equally.By incorporating this module in the network, intermediate layers canfocus on ’which’ scale degree and ’where’ scale distributes simul-taneously and hence enhance the discriminant power of the featurerepresentations.In a hierarchical architecture, the scale space is increasing fromshallow to deep, which means feature maps from different layerscontain scale asymmetry. Due to this, naive average or concatena-tion of multiple features is not an optimal solution. We propose anovel Scale Recalibration Module (SRM) to further achieve adaptivescale recalibration and generate multi-scale predictions at differentstages. Speciﬁcally, this module takes the feature maps processed bythe SFM as input and then slice these features in channel dimen-sion. Since each channel is associated with a certain scale, the piecesrepresented by the corresponding scales can be recombined throughstacking to obtain scale-associated outputs. In this case, each outputcan capture a certain scale of crowds and give an accurate predic-tion on the patches of that scale. We fuse these outputs to generatethe ﬁnal density map, which could have accurate responses on crowdimages of diverse scales. To enforce the network produces consistentmulti-scale density maps, we propose a Scale Consistency loss topose supervision on scale-associated outputs. It is computed by gen-erating multi-scale groundtruth density maps and optimizing eachside output towards the corresponding scale maps.In general, the contributions of our work are three-folds: • We propose a Scale Focus Module (SFM) to enhance the represen-tation power of local features. By modeling rich contextual depen-dencies among channel and spatial dimensions, different layers inthe network can focus on the appropriate scales of pedestrians. • We propose a Scale Recalibration Module (SRM) to recalibrateand aggregate multi-scale features from sequential layers at dif-ferent stages. It signiﬁcantly enhances the adaption ability of thestructure to the complicated scenes with diverse scale variations. • We propose a Scale Consistency loss to supervise the scale-associated outputs at different scale level, which enforces the net-work produces consistent density maps with multiple scales.

The previous frameworks are mainly composed of two paradigms:1) people detection or tracking [27], 2) feature-based regression [6].However, these methods are generally unpractical due to their poorperformance and high computation. As the utilization of Convolu-tional Neural Network (CNN) has boosted improvements in variouscomputer vision tasks [29, 36, 3, 18], most recent works are inclinedto use CNN-based methods. They tend to generate accurate densitymaps whose integral indicates the total number of crowds. However,it is still challenging to achieve precise pedestrians counting in ex-tremely complicated scenes for the presence of various complexities,especially scale variations.To tackle the above issues, many existing approaches focus on im-proving the scale variance of features using multi-column structuresfor crowd counting [34, 23, 22, 7]. Speciﬁcally, they utilize multi-ple branches, each of which has its own ﬁlter size, to strengthen theability of learning density variations across diverse feature resolu-tions. Despite the promising results, these methods are limited bytwo drawbacks: 1) a large amount of parameters usually results indifﬁculties for training, 2) the existence of ineffective branches leadsto structure redundancy.In addition to multi-column networks, some methods adopt multi-scale but single-column architecture [35, 37]. For instance, Zhang et al . [32] propose an architecture which extracts and fuses featuremaps from different layers to generate high-quality density maps.In SANet [4], Cao et al . deploy an encoder-decoder architecture, inwhich the encoder part utilizes scale aggregation modules to extractmulti-scale features and the decoder part generates high-resolutiondensity maps via transposed convolutions. Li et al . [17] replace pool-ing layers with dilated kernels to deliver larger receptive ﬁelds, whicheffectively capture more spatial information. After this, Kang et al .[14] design two different networks and evaluate the quality of gener-ated density maps on three crowd analysis tasks.However, all these methods directly fuse multi-layer or multi-column features to generate the ﬁnal density maps. It ignores theunique perception of each part to the scale diversity and thus causesthe scale chaos in the output result. Besides, attention-based meth-ods [11, 26] have proved their effectiveness in several deep learningtasks. These approaches work by allocating computational resourcestowards the most relevant part of information. In this paper, we pro-pose a novel Hierarchical Scale Recalibration Network (HSRNet) toresolve the severe difﬁculties of scale variations. Our method dif-fers in two aspects: ﬁrstly, instead of treating the whole images, ourmethod is able to focus on the appropriate scale of the crowds; sec-ondly, the scale recalibration takes place to effectively exploit thespecialization of the components in the whole architecture.

The primary objective of our model is to learn a mapping F : X → Y , in which X means input image data, and the learning object Y hastwo choices: density map or total crowd counting number. Motivatedby the aforementioned observations, we choose the density map asthe main task of our model in the training phase to involve the spatialdistribution information for a better representation of crowds, whichis realizable with a network of fully convolutional structure. In thispaper, we propose a novel proposed Hierarchical Scale RecalibrationNetwork (HSRNet) to address the scale variations in crowd count-ing, the overall architecture of which is shown in Figure 2. For faircomparison with previous works [20, 17, 2], we choose VGG-16 [21]network as the backbone by reason of its strong representative abilityand adjustable structure for subsequent feature fusion. The last pool-ing layer and the classiﬁcation part composed of fully-connected lay-ers are removed for the task of counting requires pixel-level predic-tions and preventing spatial information loss. Thereby our backboneconsists of ﬁve stages (Conv1 ∼ Conv5 respectively). To focus onthe appropriate scales of pedestrians, we connect the proposed ScaleFocus Module (SFM) to the last convolutional layer in each stageexcept for the ﬁrst one to obtain the ﬁne-grained features from mul-tiple layers. The reason why we carve the ﬁrst stage is the receptiveﬁeld sizes of the ﬁrst convolutional layers are too small to obtain anyinformation of crowds. Since the scales of pedestrians the convolu-tional layers can capture varies across different stages, we send thesefeatures after the process of SFM to the Scale Recalibration Module(SRM) to reallocate scale-aware responses by a slice/stack strategy.Thus each side output corresponds to a certain scale and provides anaccurate crowd prediction of that scale. With the utilization of de-convolutional layer, each prediction stays the same resolution as theinput image. The ﬁnal density map can be generated by fusing thesescale-associated outputs. To guarantee that each scale-associated out-put is optimized towards a speciﬁc direction, we propose a ScaleConsistency loss to supervise the target learning. onv1 Conv2 Conv3 Conv4 Conv5

SFMSFM

SFM

SRM L AP Scale Recalibration Module

Scale ConsistencyScale Focus Module

Conv DeconvAverage pooling Fully connected layer L Ground truthFinal predition

Figure 2.

The detailed structure of the proposed Hierarchical Scale Recalibration Network (HSRNet). The single column baseline network on the top is VGG-16, at the end of each convolutional layer, feature maps are sent to Scale Focus Module (SFM) to get reﬁned feature maps. Then, the Scale Recalibration Module(SRM) process these features to get multi-scale predictions, which are ﬁnally fused by a 1x1 convolution to generate the ﬁnal density map.

The Scale Focus Module is designed to enforce sequential layers atdifferent stages focus on the appropriate scales of pedestrians by en-coding the global contextual information into local features. Sincethe receptive ﬁelds of each convolutional layer are accumulatingfrom shallow to deep, the scale space they are able to cope with in-creases accordingly. Besides, there are speciﬁc shifts between theirrepresentation ability on crowd scales, which indicates that differentstages should be responsible for the corresponding scales. With theseobservations, we generate adaptive weights to emphasize feature re-sponses in the channel and spatial dimension respectively.In channel dimension, each channel map of immediate featurescan be regarded as a scale-aware response. For a high-level feature,different channels are associated with each other in terms of seman-tics information. By exploring the inter-channel interdependencies,we could generate channel-focus weights to modify the ratio of dif-ferent channels with corresponding scales. Similar strategy for spa-tial dimension, treating the whole images equally is improper sincethe various crowd distribution leads to different scale space in lo-cal patches. However, local features generated merely by standardconvolutions are not able to express the whole semantic information.Thus, we generate spatial-focus weights to select attentive regionsin the feature maps, which enhances the representative capability offeatures. Since the channel-focus and spatial-focus weights attend to’which’ scale degree and ’where’ scale distributes, they are comple-mentary to each other and the combination of them could boost thediscriminant power of the feature representations.Formally, given an image X of size × H × W , the outputfeatures from the Conv (i+1) layer of the backbone are deﬁned as F = { F i , i = 1 , , , } . Due to the existence of pooling layers, F i has a resolution of H i × W i . For the output feature F i ∈ R C i × H i × W i ,we ﬁrst squeeze the global spatial information to generate channel-wise statics Z i ∈ R C i by utilizing global average pooling function H avg . Thus the j -th channel of Z i is deﬁned as Z ij = H avg ( F ij ) = 1 H i × W i H i (cid:88) m =1 W i (cid:88) n =1 F ij ( m, n ) (1)where F ij ( m, n ) represents the pixel value at position ( m, n ) of the j -th channel of F i . Such channel statistic merely collects local spatialinformation and views each channel independently, which fails toexpress the global context. Therefore, we add fully-connected layersand introduce a gating mechanism to further capture channel-wisedependencies. The gating mechanism is supposed to meet with twocriteria: ﬁrst, it should be capable to exploit a nonlinear interactionamong channels; second, to emphasize multiple channels, it shouldcapture a non-mutually-exclusive relationship. We use the sigmoidactivation to realize the gating mechanism: S i = Sigmoid ( W · ReLU ( W Z i )) (2)where W ∈ R C i /r × C i , W ∈ R C i × C i /r . This operation can beparameterized as two fully-connected (FC) layers, with one deﬁneschannel-reduction layer (reduction radio r = 64) and the other rep-resents channel-increasing layer. After this non-linear activation, wecombine the channel-focus weights S i and the input feature F i usingthe element-wise multiplication operation to generate the immediatefeature ˆ S i . ˆ S i = S i · F i , i ∈ { , , , } (3)Thus, the global channel information is encoded into local features.Then we take the mean value of ˆ S i among channels to generatespatial statistic M i ∈ R × H i × W i . After squeezing the informationamong channels, we feed the spatial statistic M i into a convolutionallayer to generate a spatial-focus weight ˆ M i : M i = 1 C i (cid:88) C i i =1 ˆ S i , ˆ M i = Sigmoid ( H c ( M i )) (4)where H c indicates the convolution process. Here, the kernel size ofthe convolutional layer is set to 7, which is capable of providing abroader view. Then we perform an element-wise multiplication op-eration between ˆ M i and ˆ S i to obtain the ﬁnal output ˆ F i : ˆ F i = ˆ M i · ˆ S i , i ∈ { , , , } (5)Noted that the spatial-focus weight is copied to apply on each chan-nel of the input in the same way. With the network getting deeper, deep layers can capture more com-plex and high-level features while shallow layers can reserve morepatial information. Therefore, by fusing features from low-level lay-ers with those from high-level layers, our network can extract stablefeatures no matter how complicated the crowd scenes are. Unlike pre-vious works [23, 13], We design a scale recalibration module (SRM)to recalibrate and aggregate multi-scale features rather than directaverage or concatenation.Based on the above analysis, the deep layer has a wider range ofscale space and meanwhile has a stronger response on the larger scaleof crowds. Formally, assuming that the outputs of the Scale FocusModule are ˆ F = { ˆ F i , i = 1 , , , } , we ﬁrst send these featuresinto a × convolutional layer and then a deconvolutional layerrespectively to obtain multi-scale score maps E i , E i +1 = H dc ( H c ( ˆ F i )) (6)where H dc is the deconvolution operation. Here, the channel num-bers of the sequential layers are , , , from E to E respec-tively, which corresponds to the scale space contained in each stage.The obtained multi-scale score maps E i contain multi-scale infor-mation from layers of different depths. However, the information ischaotic. For instance, E captures multi-scale information deliveredfrom Conv { } since low-level features are transmitted tothe latter stages in the backbone. To recalibrate channel-level statis-tics, we adopt a slice/stack strategy. Speciﬁcally, we slice each scoremap into piece along its channel dimension to obtain a feature map E ij where j means the channel number, and then group them intoﬁve multi-scale features sets from E to E : { E , E , E , E } , { E , E , E , E } , { E , E , E } , { E , E } , { E } . Each set isassociated with a certain scale and we stacked features in each setrespectively to generate the corresponding multi-scale predictions D = { D i , i = 1 , , , , } . By utilizing this strategy, each predic-tion is able to provide an accurate number of pedestrians on a certainscale. These predictions are complementary to each other and thecombination of them will cover the crowd distributions with variousscale variations. Thus we send them into a × convolutional layer togenerate the ﬁnal density map D . Overall, the scale-speciﬁc predic-tion is obtained only with convolutional layer and slice/stack strategy,which is parameter-saving and time-efﬁciency. With the Scale Recal-ibration Module, the ﬁnal output is robust to diverse crowd scales inhighly complicated scenes. The groundtruth density map D GT can be converted from the dotmaps which contain the labeled location at the center of the pedes-trian head. Suppose a pedestrian head at a pixel x i , we represent eachhead annotation of the image as a delta function δ ( x − x i ) and blur itwith Gaussian kernel G σ ( σ refers to the standard deviation). So thatthe density map D GT is obtained via the formula below: D GT ( x ) = (cid:88) i ∈ S δ ( x − x i ) ∗ G σ i , with σ i = β ¯ d i (7)Where S is the amount of head annotations, ¯ d i refers to the averagedistance among x i and its k nearest annotations and β is a parame-ter. We use this geometry-adaptive kernels following MCNN [34] totackle the perspective distortion in highly complicated scenes. TheEuclidean distance is utilized to deﬁne the density map loss, whichcan be formulated as follows: L ( θ ) = 12 N N (cid:88) i =1 || D ( X i ; θ ) − D GTi || (8) where θ denotes the parameters of the network and N is the amountof image pixels. Usually, this loss is merely calculated between theﬁnal density map and the groudtruth map. In this paper, we pro-pose a novel Scale Consistency loss to guide the multi-scale predic-tions to be optimized towards its corresponding scale map. Specif-ically, we use the average pooling to obtain the groundtruth pyra-mid D GTi , i = 1 , , , , . The receptive ﬁelds of ﬁlters are 1, 2,4, 8, 16, respectively. Then these maps are upsampled to the samesize as the original image through a bilinear interpolation. We cancompute the loss pairs { D i , D GTi } and obtain the loss pyramid { L , L , L , L , L , L } . The total loss of our model can be deﬁnedas: L = L + λ i · L i , i ∈ { , , , , } (9)where λ i is a scale-speciﬁc weight. It can be gradually optimized andadaptively adjust the ratio between losses. In this section, we evaluate our method on four publicly availablecrowd counting datasets: ShanghaiTech, WorldExpo’10, UCSD andMALL. Compared with previous approaches, the proposed HSRNetachieves state-of-the-art performance. Besides, experiments on a ve-hicle dataset TRANCOS are performed to testify the generalizationcapability of our model. Furthermore, we conduct throughout ab-lation studies to verify the effectiveness of each component in ourmodel. Experimental settings and results are detailed below.

Data Augmentation.

We ﬁrst crop four patches at four quarters ofthe image without overlapping which is 1/4 size of the original im-age resolution. By this operation, our training datasets can cover thewhole images. Then, we crop 10 patches at random locations of eachimage with the same size. Also, random scaling is utilized to con-struct multi-scale image pyramid with scales of 0.8-1.2 incrementedin interval of 0.2 times the original image. During test, the wholeimages are fed into the network rather than cropped patches.

Training Phase.

We train the proposed HSRNet in an end-to-endmanner. The ﬁrst ten convolutional layers of our model are initializedfrom the pre-trained VGG-16 [21] while the rest of convolutionallayers are initialized by a Gaussian distribution with zero mean astandard deviation of 0.01. We use the Adam optimizer [15] with aninitial learning rate of 1e-5. In Eqn. 7, k is set to 3 and β is set to 0.3following MCNN [34]. Evaluation Metrics.

Following existing state-of-the-art methods[17, 22], the mean absolute error (MAE) and the mean squared er-ror (MSE) are used to evaluate the performance on the test dataset,which can be described as follows:

MAE = 1 N N (cid:88) i =1 | z i − ˆ z i | , (10) MSE = (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) i =1 ( z i − ˆ z i ) (11)where N means numbers of image, z i means the total count of theimage, and ˆ z i refers to the total count of corresponding estimateddensity map. able 1. Experimental results on ShanghaiTech dataset.Method ShanghaiTech Part A ShanghaiTech Part BMAE MSE MAE MSEMCNN [34] 110.2 173.2 26.4 41.3Switching-CNN [20] 90.4 135.0 21.6 33.4IG-CNN[1] 72.5 118.2 13.6 21.1CSRNet [17] 68.2 115.0 10.6 16.0SANet [4] 67.0 104.5 8.4 13.6TEDnet [12] 64.2 109.1 8.2 12.8SFCN [25] 64.8 107.5 7.6 13.0HSRNet (ours)

Experimental results on UCSD and MALL dataset.Method UCSD dataset MALL datasetMAE MSE MAE MSERidge Regression [6] 2.25 7.82 3.59 19.0CNN-Boosting [24] 1.10 - 2.01 -MCNN [34] 1.07 1.35 2.24 8.5ConvLSTM-nt [28] 1.73 3.52 2.53 11.12Bidirectional ConvLSTM [28] 1.13 1.43 2.10 7.6CSRNet [17] 1.16 1.47 - -HSRNet (ours)

We evaluate the performance of our model on four benchmarkdatasets and a vehicle dataset. Overall, the proposed HSRNetachieves the superior results over existing state-of-the-art methods.

ShanghaiTech.

The ShanghaiTech dataset [34] consists of 1198 im-ages which contains a total amount of 330165 persons. It is separatedinto two parts: the one is named as Part A with 482 pictures and theother is Part B with 716 pictures. Part A composed of rather con-gested images is randomly captured from the web while Part B iscomprised of images with relatively low density captured from streetviews. Following the setup in [34], we use 300 images to form thetraining set and the left 182 images for testing in Part A, while in PartB, 400 images are used to compose the training set and the remaining316 are considered as testing set. The comparison results of perfor-mance between the proposed HSRNet with some previous methodsare reported in Table 1. It is shown that our model achieves superiorperformance on both parts of the Shanghaitech dataset.

UCSD.

The UCSD dataset[5] contains 2000 frames with size of158x238 captured by surveillance cameras. This dataset includes rel-atively sparse crowds with count varying from 11 to 46 per image.Authors also provide the ROI to discard the irrelevant areas in the im-age. Utilizing the vertexes of ROI given for each scene, we modifythe last multi-scale prediction from fusing layers based on the givenROI mask, setting the nerve cells out of the ROI regions not work.We use frames 601 to 1400 as the training set and the rest as testingset. Table 2 illustrates that our model achieves the lowest MAE andMSE compared with previous works, which indicates that HSRNetcan perform well in both dense and sparse scenarios.

MALL.

The Mall dataset [6] is collected from a publicly accessi-ble surveillance web camera. The dataset contains 2,000 annotatedframes of static and moving pedestrians with more challenging illu-mination conditions and severer perspective distortion compared tothe UCSD dataset. The ROI and perspective map are also providedin the dataset. Following Chen et al . [6], the ﬁrst 800 frames are usedto compose the training set and the remaining 1,200 are consideredas the testing set. As shown in Table 2, we beat the second best ap-

Table 3.

Estimation results on WorldExpo’10 dataset.Method S1 S2 S3 S4 S5 AveMCNN [34] 3.4 20.6 12.9 13.0 8.1 11.6Switching-CNN [20] 4.4 15.7 10.0 11.0 5.9 9.4IG-CNN [1] 2.6 16.1 10.15 20.2 7.6 11.3CSRNet [17] 2.9 11.5 proach by a 14.3% improvement in MAE and 70% improvement inMSE.

WorldExpo’10.

The WorldExpo’10 dataset is introduced by Zhang et al . [31], consisting of total 3980 frames extracted from 1132 videosequences captured with 108 surveillance equipment in Shanghai2010 WorldExpo. Each sequence has 50 frame rates and a resolu-tion of 576x720 pixels. For the test set, those frames are split intoﬁve scenes, named S ∼ S respectively. Besides, the ROI (regionof interest) and the perspective maps are provided for this dataset.Due to the abundant surveillance video data, this dataset is suitableto verify our model in visual surveillance. The comparison resultsof performance between our HSRNet with some previous methodsare reported in Table 3. Overall, our model achieves the best averageMAE performance compared with existing approaches. TRANCOS.

TRANCOS dataset [10] is a vehicle counting datasetwhich consists of 1244 images of various trafﬁc scenes captured byreal video surveillance cameras with total 46796 annotated vehicles.Also, the ROI per image is provided. Images in the dataset have verydifferent trafﬁc scenarios without perspective maps. During the train-ing period, 800 patches with the size of 115x115 pixels are randomlycropped from each image, and the ground truth density maps are gen-erated via blurring annotations with 2D Gaussian Kernel. Differentfrom counting dataset, we use the GAME for evaluation during thetest period which can be formulated as follows:

GAME ( L ) = 1 N N (cid:88) n =1 ( L (cid:88) l =1 | D lI n − D lI gtn | ) (12)where N denotes the amount of images. Given a speciﬁc number L , the GAME(L) divides each image into L non-overlapping re-gions of equal area, D lI n is the estimated count for image n withinregion l , and D lI gtn is the corresponding ground truth count. Notethat GAME(0) equals to the aforementioned MAE metric. Resultsshown in Table 4 indicates that HSRNet achieves the state-of-art per-formance on four GAME metrics, which demonstrates the general-ization ability of our model.Qualitatively, we visualize the density maps generated by the pro-posed HSRNet on these ﬁve datasets in Figure 3. It is worth notingthat our model can generate high-quality density maps and produceaccurate crowd counts. In this section, we conduct further experiments to explore the detailof the model design and network parameters. All experiments in thissection are performed on ShanghaiTech dataset for its large scalevariations.

Architecture learning.

We ﬁrst evaluate the impact of each compo-nent in our architecture by separating all the modules and reorganiz-ing them step by step. We perform this experiment on ShanghaiTech t Count: 1581

Est Count: 1577.6

Gt Count: 23

Est Count: 23.2

Gt Count: 28

Est Count: 29

Gt Count: 26

Est Count: 24.2

Gt Count: 23

Est Count: 23.5

Gt Count: 36

Est Count: 35.3

Figure 3.

From left to right, images are taken from ShanghaiTech Part A, ShanghaiTech Part B, UCSD, MALL, WorldExpo’10, TRANCOS datasets. Thesecond row shows the groundtruth density maps and the third row displays the output density maps generated by our HSRNet.

Table 4.

Experimental results on the TRANCOS dataset.Method GAME 0 GAME 1 GAME 2 GAME 3Fiaschi et al . [8] 17.77 20.14 23.65 25.99Lempitsky et al . [16] 13.76 16.72 20.72 24.36Hydra-3s[19] 10.99 13.75 16.69 19.32FCN-HA [33] 4.21 - - -CSRNet [17] 3.56 5.49 8.57 15.04HSRNet (ours) radio E rr o r on P a r t A MAEMSE 2 4 8 16 32 64 1286789101112131415 radio E rr o r on P a r t B MAEMSE

Figure 4.

The line chart of performance under different ratios inside theChannel Focus on ShanghaiTech Part A and Part B dataset.

Part A dataset and the results are listed in Table 5. The backbonerefers to the VGG-16 model. We add a × convolutional layer tothe end to generate the density map, which is deﬁned as our base-line. It is obvious that combining backbone with the Scale Recal-ibration Module (SRM) can boost the performance (MAE 75.6 vs73.7), which veriﬁes the effectiveness of the SRM module. We di-vide the Scale Focus Module (SFM) into two parts: Channel Focus(CF) and Spatial Focus (SF). The third row and the fourth row verifytheir respective signiﬁcance (MAE 69.9,68.8 vs 73.7). Besides, thecombination of the two parts is more effective than using one of themalone. We add the Scale Consistency loss (SC) to supervise the modellearning. This strategy also brings a signiﬁcant improvement to theperformance (MAE 65.1 vs 62.3). Overall, each part in the model iseffective and complementary to each other, which can signiﬁcantlyboost the performance on the ﬁnal results. Ratio of Channel Focus.

We measure the performance of HSRNetwith different ratios of Channel Focus introduced in Eqn. 2. On theone hand, the ratio needs to be small enough to ensure the represen- tative capability of full connections, on the other hand, if the ratio istoo small, then parameters will become more numerous and may in-troduce computational redundancy. To ﬁnd a balance between capa-bility and computational cost, experiments are performed for a seriesof ratio values. Speciﬁcally, we gradually increase the ratio at twicethe interval and results are shown in Figure 4. As the ratio increases,the error estimation undergoes a process of decreasing ﬁrst and thenincreasing. The proposed model delivers the best accuracy on bothPart A and Part B of ShanghaiTech dataset when the ratio equals to64. Therefore, this value is used for all experiments in this paper.

Table 5.

Experimental results of architecture learning on ShanghaiTech PartA dataset.Backbone SRM CF SF SC MAE MSE (cid:88) × × × × (cid:88) (cid:88) × × × (cid:88) (cid:88) (cid:88) × × (cid:88) (cid:88) × (cid:88) × (cid:88) (cid:88) (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Sequence of SFM.

We evaluate the effect of the sequence of Chan-nel Focus and Spatial Focus in the proposed Scale Focus Module(SFM). Therefore, we design four networks which are different fromeach other for the design of SFM module and experimental results areshown in Table 6. Channel+Spatial refers to the network with Chan-nel Focus module ahead Spatial Focus while Spatial+Channel refersto the opposite one. Apart from the serial settings, we also design aparallel one which feeds the input separately to the Channel focusand Spatial focus part inside the SFM. Then we have two choices:the one is average their output as the ﬁnal output, the other is uti-lizing a convolution layer to process the stack of these two features.We name these two networks as (Channel ⊕ Spatial) + average and(Channel ⊕ Spatial) + Conv respectively. We conduct experimentson ShanghaiTech Part A and Part B datasets. The results mainly il-lustrates two aspects: 1) the stacked design of two modules are moreeffective than the parallel design, 2) Channel+Spatial is the most op-timal choice to achieve the best accuracy, which demonstrates thevalidity of the model design.

Scale Consistency.

To understand the effect of the Scale Consistency rediction

Ground Truth Original Scale_1 Scale_2 Scale_3 Scale_4 Scale_5

Figure 5.

From left to right, the images pyramid contains the 1,1,1/2,1/4,1/8,1/16 size of the original image resolution. The ﬁrst row indicates the groundtruthdensity maps while the second row is the density maps generated by the proposed HSRNet.

Table 6.

Validation of sequences inside the proposed Scale Focus Module.

Sequence inside the SFM Part A Part B

MAE MSE MAE MSESpatial+Channel 64.9 108.3 8.1 12.1Channel+Spatial (Channel ⊕ Spatial) + average 68.8 107.2 11.0 19.8(Channel ⊕ Spatial) + Conv 66.2 104.3 10.2 18.9 loss more deeply, we visualization the immediate results of the pro-posed HSRNet and compare them with the groundtruth density mappyramid. As shown in Figure 5, the scale i represents the ﬁlter sizeof the average pooling operation on the groundtruth maps. Notedthat the scale-associated outputs are closer to their correspondinggroundtruth density maps. With the supervision of the extra ScaleConsistency loss, the responses of the immediate stages in the net-work are indeed associated with the scales of pedestrians rather thanstay the same with the groundtruth map. For instance, the shallowlayers (such as

Scale , Scale ) are more sensitive to the small scaleof pedestrians, while the deep layers (such as Scale , Scale ) per-form well on the large scale of crowds. By fusing these outputs, theﬁnal result can cover the multi-scale crowd distributions in compli-cated scenes. Scale Invariance.

We turn to evaluate the scale invariance of the fea-ture representations from different stages in the proposed HSRNetfor diverse scenes with various crowd counts. To achieve this, we di-vide the ShanghaiTech Part A test set into ﬁve groups according tothe number of people in each scene. Each set represents a speciﬁcdensity level. The histogram of the results can be observed in Figure6. The increase in density level represents an increase in the aver-age number of people. We compare our method with two existingclassic representative counting networks, MCNN [34] and CSRNet[17]. It is obvious that MCNN performs well on the relatively sparsescenes while loses its superiority on the dense crowds. The perfor-mance of the CSRNet tends to be the opposite. Noted that the pro-posed HSRNet outperforms the two models over all groups, whichfurther demonstrates the scale generalization of our model on highlycomplicated scenes.

In this paper, we propose a single column but multi-scale net-work named as Hierarchical Scale Recalibration Network (HSRNet),which can exploit global contextual information and aggregate multi-scale information simultaneously. The proposed HSRNet consists oftwo main parts: Stacked Focus Module (SFM) and Scale Recalibra-

Figure 6.

The histogram of average counts estimated by different ap-proaches on ﬁve density levels from ShanghaiTech Part A dataset. tion Module (SRM). Speciﬁcally, SFM models the global contex-tual dependencies among channel and spatial dimensions, which con-tributes to generating more informative feature representations. Be-sides, SRM recalibrates the feature responses generated by the SAMto generate multi-scale predictions, and then utilize a scale-speciﬁcfusion strategy to aggregate scale-associated outputs to generate theﬁnal density maps. Besides, we design a Scale Consistency loss toenhance the learning of scale-associated outputs towards their corre-sponding multi-scale groundtruth density maps. With the proposedmodules combined, the network can tackle the difﬁculties of scalevariations and generate more precise density maps in highly con-gested crowd scenes. Extensive experiments on four counting bench-mark datasets and one vehicle dataset show that our method deliversthe state-of-the-art performance over existing approaches and can beextended to other tasks. Besides, throughout ablation studies are con-ducted on ShanghaiTech dataset to validate the effectiveness of eachpart in the proposed HSRNet.

REFERENCES [1] Deepak Babu Sam, Neeraj N Sajjan, R Venkatesh Babu, and Mukund-han Srinivasan, ‘Divide and grow: Capturing huge diversity in crowdimages with incrementally growing cnn’, in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pp. 3618–3626, (2018).[2] Haoyue Bai, Song Wen, and S-H Gary Chan, ‘Crowd counting on im-ages with scale variation and isolated clusters’, in

Proceedings of theIEEE International Conference on Computer Vision Workshops , pp. 0–0, (2019).3] Karel Bartos, Michal Sofka, and Vojtech Franc, ‘Learning invariant rep-resentation for malicious network trafﬁc detection’, in

Proceedings ofthe Twenty-second European Conference on Artiﬁcial Intelligence , pp.1132–1139. IOS Press, (2016).[4] Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su, ‘Scale aggrega-tion network for accurate and efﬁcient crowd counting’, in

Proceedingsof the European Conference on Computer Vision (ECCV) , pp. 734–750,(2018).[5] Antoni B Chan, Zhang-Sheng John Liang, and Nuno Vasconcelos, ‘Pri-vacy preserving crowd monitoring: Counting people without peoplemodels or tracking’, in , pp. 1–7. IEEE, (2008).[6] Ke Chen, Chen Change Loy, Shaogang Gong, and Tony Xiang, ‘Featuremining for localised crowd counting.’, in

BMVC , volume 1, p. 3, (2012).[7] Zhi-Qi Cheng, Jun-Xiu Li, Qi Dai, Xiao Wu, Jun-Yan He, and Alexan-der G Hauptmann, ‘Improving the learning of multi-column convolu-tional neural network for crowd counting’, in

Proceedings of the 27thACM International Conference on Multimedia , pp. 1897–1906. ACM,(2019).[8] Luca Fiaschi, Ullrich K¨othe, Rahul Nair, and Fred A Hamprecht,‘Learning to count with regression forest and structured labels’, in

Pro-ceedings of the 21st International Conference on Pattern Recognition(ICPR2012) , pp. 2685–2688. IEEE, (2012).[9] Geoffrey French, MH Fisher, Michal Mackiewicz, and Coby Needle,‘Convolutional neural networks for counting ﬁsh in ﬁsheries surveil-lance video’,

Proceedings of the machine vision of animals and theirbehaviour (MVAB) , 7–1, (2015).[10] Ricardo Guerrero-G´omez-Olmedo, Beatriz Torre-Jim´enez, RobertoL´opez-Sastre, Saturnino Maldonado-Basc´on, and Daniel Onoro-Rubio,‘Extremely overlapping vehicle counting’, in

Iberian Conference onPattern Recognition and Image Analysis , pp. 423–431. Springer,(2015).[11] J. Hu, L. Shen, and G. Sun, ‘Squeeze-and-excitation networks’, in ,pp. 7132–7141, (June 2018).[12] Xiaolong Jiang, Zehao Xiao, Baochang Zhang, Xiantong Zhen, Xian-bin Cao, David Doermann, and Ling Shao, ‘Crowd counting and den-sity estimation by trellis encoder-decoder networks’, in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition , pp.6133–6142, (2019).[13] Di Kang and Antoni Chan, ‘Crowd counting by adaptively fusing pre-dictions from an image pyramid’, arXiv preprint arXiv:1805.06115 ,(2018).[14] Di Kang, Zheng Ma, and Antoni B Chan, ‘Beyond counting: Compar-isons of density maps for crowd analysis taskscounting, detection, andtracking’,

IEEE Transactions on Circuits and Systems for Video Tech-nology , (5), 1408–1422, (2018).[15] Diederik P Kingma and Jimmy Ba, ‘Adam: A method for stochasticoptimization’, arXiv preprint arXiv:1412.6980 , (2014).[16] Victor Lempitsky and Andrew Zisserman, ‘Learning to count objectsin images’, in Advances in neural information processing systems , pp.1324–1332, (2010).[17] Yuhong Li, Xiaofan Zhang, and Deming Chen, ‘Csrnet: Dilated con-volutional neural networks for understanding the highly congestedscenes’, in

Proceedings of the IEEE conference on computer vision andpattern recognition , pp. 1091–1100, (2018).[18] Xialei Liu, Joost van de Weijer, and Andrew D Bagdanov, ‘Leveragingunlabeled data for crowd counting by learning to rank’, in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,pp. 7661–7669, (2018).[19] Daniel Onoro-Rubio and Roberto J L´opez-Sastre, ‘Towardsperspective-free object counting with deep learning’, in

EuropeanConference on Computer Vision , pp. 615–629. Springer, (2016).[20] Deepak Babu Sam, Shiv Surya, and R Venkatesh Babu, ‘Switching con-volutional neural network for crowd counting’, in , pp. 4031–4039. IEEE, (2017).[21] Karen Simonyan and Andrew Zisserman, ‘Very deep convolu-tional networks for large-scale image recognition’, arXiv preprintarXiv:1409.1556 , (2014).[22] Vishwanath A Sindagi and Vishal M Patel, ‘Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd count-ing’, in , pp. 1–6. IEEE, (2017). [23] Vishwanath A Sindagi and Vishal M Patel, ‘Generating high-qualitycrowd density maps using contextual pyramid cnns’, in

Proceedingsof the IEEE International Conference on Computer Vision , pp. 1861–1870, (2017).[24] Elad Walach and Lior Wolf, ‘Learning to count with cnn boosting’,in

European Conference on Computer Vision , pp. 660–676. Springer,(2016).[25] Qi Wang, Junyu Gao, Wei Lin, and Yuan Yuan, ‘Learning from syn-thetic data for crowd counting in the wild’, in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pp. 8198–8207, (2019).[26] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon,‘Cbam: Convolutional block attention module’, in

Proceedings of theEuropean Conference on Computer Vision (ECCV) , pp. 3–19, (2018).[27] Bo Wu and Ramakant Nevatia, ‘Detection of multiple, partially oc-cluded humans in a single image by bayesian combination of edgeletpart detectors’, in

Tenth IEEE International Conference on ComputerVision (ICCV’05) Volume 1 , volume 1, pp. 90–97. IEEE, (2005).[28] Feng Xiong, Xingjian Shi, and Dit-Yan Yeung, ‘Spatiotemporal mod-eling for crowd counting in videos’, in

Proceedings of the IEEE Inter-national Conference on Computer Vision , pp. 5151–5159, (2017).[29] Mingfu Xiong, Jun Chen, Zheng Wang, Zhongyuan Wang, Ruimin Hu,Chao Liang, and Daming Shi, ‘Person re-identiﬁcation via multiplecoarse-to-ﬁne deep metrics’, in

Proceedings of the Twenty-second Eu-ropean Conference on Artiﬁcial Intelligence , pp. 355–362. IOS Press,(2016).[30] Anran Zhang, Lei Yue, Jiayi Shen, Fan Zhu, Xiantong Zhen, XianbinCao, and Ling Shao, ‘Attentional neural ﬁelds for crowd counting’, in

Proceedings of the IEEE International Conference on Computer Vision ,pp. 5714–5723, (2019).[31] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang,‘Cross-scene crowd counting via deep convolutional neural networks’,in

Proceedings of the IEEE conference on computer vision and patternrecognition , pp. 833–841, (2015).[32] Lu Zhang, Miaojing Shi, and Qiaobo Chen, ‘Crowd counting via scale-adaptive convolutional neural network’, in , pp. 1113–1121.IEEE, (2018).[33] Shanghang Zhang, Guanhang Wu, Joao P Costeira, and Jos´e MFMoura, ‘Fcn-rlstm: Deep spatio-temporal neural networks for vehiclecounting in city cameras’, in

Proceedings of the IEEE InternationalConference on Computer Vision , pp. 3667–3676, (2017).[34] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, ‘Single-imagecrowd counting via multi-column convolutional neural network’, in , pp. 589–597, (June 2016).[35] Z. Zou, X. Su, X. Qu, and P. Zhou, ‘Da-net: Learning the ﬁne-graineddensity distribution with deformation aggregation network’,

IEEE Ac-cess , , 60745–60756, (2018).[36] Zhikang Zou, Yu Cheng, Xiaoye Qu, Shouling Ji, Xiaoxiao Guo, andPan Zhou, ‘Attend to count: Crowd counting with adaptive capacitymulti-scale cnns’, Neurocomputing , , 75–83, (2019).[37] Zhikang Zou, Huiliang Shao, Xiaoye Qu, Wei Wei, and Pan Zhou, ‘En-hanced 3d convolutional networks for crowd counting’, arXiv preprintarXiv:1908.04121arXiv preprintarXiv:1908.04121