Fast Video Crowd Counting with a Temporal Aware Network
Xingjiao Wu, Baohan Xu, Yingbin Zheng, Hao Ye, Jing Yang, Liang He
11 Fast Video Crowd Countingwith a Temporal Aware Network
Xingjiao Wu, Baohan Xu, Yingbin Zheng, Hao Ye, Jing Yang, Liang He
Abstract —Crowd counting aims to count the number of in-stantaneous people in a crowded space, and many promisingsolutions have been proposed for single image crowd counting.With the ubiquitous video capture devices in public safetyfield, how to effectively apply the crowd counting technique tovideo content has become an urgent problem. In this paper, weintroduce a novel framework based on temporal aware modelingof the relationship between video frames. The proposed networkcontains a few dilated residual blocks, and each of them consistsof the layers that compute the temporal convolutions of featuresfrom the adjacent frames to improve the prediction. To alleviatethe expensive computation and satisfy the demand of fast videocrowd counting, we also introduce a lightweight network tobalance the computational cost with representation ability. Weconduct experiments on the crowd counting benchmarks anddemonstrate its superiority in terms of effectiveness and efficiencyover previous video-based approaches.
Index Terms —Crowd counting, video analysis, dynamic tem-poral modeling, spatiotemporal information.
I. I
NTRODUCTION
The rapid development of surveillance devices has led toan explosive growth of images and videos, which createsa great demand for analyzing visual content. In addition toobject recognition, crowd counting, which focuses on esti-mating the number of people from the visual contents, hasreceived increasing interests in recent years. Many researchershave explored crowd counting task on still images, whilelimited efforts have been focused on videos. Nevertheless,crowd counting in videos has many real-world applications,such as video surveillance, traffic monitoring, and emergencymanagement.Counting crowd robustly and efficiently under differentpedestrian distribution, illumination, occlusion, and cameradistortion is nevertheless challenging. Although recent pro-gresses such as multi-branch network are introduced to learnmore contextual information and achieve excellent perfor-mance, most of existing methods still ignore the temporalrelations between nearby frame since crowd counting dataoften collected by surveillance videos. Temporal relation is animportant factor in video tasks comparing with still images.There is a certain overlap between video frames, so thatthere is a certain law between continuous data. Exploring andusing temporal relations can correct some errors caused by
Xingjiao Wu and Baohan Xu contributed equally to this work. Correspond-ing author: Jing Yang (e-mail: [email protected]).X. Wu, J. Yang, and L. He are with East China Normal University, Shanghai200062, China.B. Xu is with Jilian Technology Group (Video++), Shanghai 200023, China.Y. Zheng and H. Ye are with Videt Tech Ltd., Shanghai 201203, China. noise. From the results, our experiments show that consideringtemporal relations performs significantly better than mostcurrent methods which ignore the relationships. Furthermore,the predicted density maps, which may be the most commonintermediate element in a modern crowd counting system, isalso with a strong correlation with the neighboring maps inthe video sequence.To cope with these difficulties, we employ a novel frame-work to take advantage of temporal information extracted bycontinuously video frames. The Temporal Aware Network(TAN) is proposed to dynamically simulate the temporalcharacteristics of continuous frames for crowd counting andthe architecture is shown in Fig. 1. The TAN consists oftwo main parts, an lightweight convolutional neural network(LCN) unit capable of processing counting tasks quickly,and a multiple block architecture for temporal modeling.The LCN unit can guarantee the network response speedwhile keeping a certain accuracy. Then we focus on modelingthe relationship in time dimension and construct a group ofdilated residual blocks between the adjacent features. Withineach dilated residual block, we employ an expanded set oftemporal convolutions to update the frame-level predictionwith the help of its neighboring frames. Different from theexisting works, we also introduce the density map as anotherbranch of our architecture. Our observation is, while theadjacent video frames may have different visual content dueto the background and occlusion, the neighboring densitymaps still demonstrate more similar content with each other.The density map reports the distribution of people, whichcan be regarded as attention map. The contextual informationbetween consequent frames and density maps would benefitthe current counting state. Comprehensive experiments onthe public datasets show the improvement with the help oftemporal and contextual information.The main contributions of this work are summarized asfollows. • The proposed Temporal Aware Network dynamicallymodel the temporal features from continuously videoframes for crowd counting. Utilizing information fromdensity maps helps to overcome the changing back-grounds and occlusion and boosts the performance. • We also design a lightweight convolutional network toachieve a comparable result while keeping the compact-ness of the model. • Extensive evaluations on the benchmarks demonstratethe superior performance of our proposed method. No-tably, we achieve the state-of-the-art results on the videodatasets comparing with the existing video-based meth- a r X i v : . [ c s . C V ] F e b Dilated Residual Block
Frame i - k Frame i+k
Frame i VideoFrames
LCN
DensityMaps R e s h a p e & C on ca t e n a ti on ContextFeatures
DilatedResidualLayerDilatedResidualLayerDilatedResidualLayer …… DilatedResidualLayerDilatedResidualLayerDilatedResidualLayer
Dilated Residual Block
Counting LayerFrame i Result: 32.34
LCNLCN …… ……
Fig. 1. The Architecture of proposed TAN. A lightweight convolution network is used to produce density map for each frame efficiently. Then the dilatedresidual blocks explore both short-term and long-term context information gain from adjacent density maps as weight vectors. The final output is computedby multiply the frame features with weight vectors. ods. Furthermore, our network achieves 25 FPS crowdcounting speed on a moderate commercial CPU.The rest of this paper is organized as follows. Section II in-troduces background of crowd counting in images and videos.Section III discusses the model design, network architectureand training process in detail. In Section IV, we demonstratethe qualitative and quantitative study of the proposed frame-work. We conclude our work in Section V.II. R
ELATED W ORK
A. Crowd Counting in Still Images
Over the past few years, researchers have attempted to solvecrowd counting in images using a variety of approaches. Earlyworks focused on detection methods to recognize specificbody parts or full body using hand-crafted features [1], [2].While detection based methods are difficult to deal with densecrowds because of occlusion, some studies investigated tolearn a mapping function between features to the number ofpeoples [3]. Furthermore, Lempitsky et al. [4] proposed localfeatures for the density map to exploit spatial information.However, the hand-crafted features are not good enough whenfacing the clutter and low resolution of images.Recently, the convolutional neural network has shown greatsuccess in computer vision fields. Inspired by the promisingperformance of the neural network, many researchers haveexplored CNN-based methods in crowd counting ( e.g. , [5]–[27]). Zhang et al. [6] proposed a multi-column CNN withdifferent sizes of filters to deal with the variations of densitydifferences. Sam et al. [8] and Sindagi et al. [9] have achievedremarkable results in a multi-subnet structure. To address theproblem of limited training data, Liu et al. [12] investigatedenhance data such as collect scene datasets from Googleusing keyword searches and query-by-example image retrievaland then applying a learning-to-rank method. Shi et al. [11]considered that the adaptation of the previous method tothe crowd relying on a single image is still in its infancy.Sam et al. [13] proposed the TDF-CNN with top-down feed-back to correct the initial prediction of the CNN that is verylimited for detecting the space background of people. These methods are all designed for image crowd counting, thustreating videos as image sequences would ignore the importanttemporal information in videos. Wang et al. [25] proposeda crowd counting method via domain adaptation. The datacollector and labeler can generate the synthetic crowd scenesand simultaneously annotate them without any manpower. Gao et al. [26] proposed the perspective crowd counting network toovercome the deficiency of traditional methods that only focuson the local appearance features. Recently, a novel structuralcontext descriptor was designed to characterize the structuralcharacteristics of individuals in crowd scenes and make betteruse of context information [27].
B. Crowd Counting in Videos
There are fewer researchers studied on video crowdcounting compared with still images. Brostow et al. [28]and Chan et al. [29] proposed to use the Bayesian func-tion to detect individuals using motion information. Ro-driguez et al. [30] further proposed optimization of energyfunction combining crowd density estimation and individualtracking. Chen et al. [31] proposed an error-driven iterationframework aiming to cope with the noisy input videos.Although these methods based on motion or hand-craftedfeatures showed satisfactory performance on the pedestrianor football datasets, they are still lack of the generalizationability when applying them to extremely dense crowds. Morerecently, Xiong et al. [32] proposed the convLSTM frame-work to capture both spatial and temporal dependencies. TheCNN-based method demonstrated the effectiveness of bench-mark crowd counting datasets, such as UCF CC 50 [5] andUCSD [3]. However, due to the limited training data of videosand various scenes, it is usually difficult to train the complexand deeper networks for effective crowd counting. In thispaper, we propose a novel framework considering temporalinformation and density maps as well. Even though usinga lightweight network architecture, our method can achievepromising results on multiple datasets, with the help of theauxiliary information extracted from temporal dependenciesand density maps.
Fig. 2. The architecture of dilated residual layer.
III. F
RAMEWORK
In this section, we will introduce the temporal aware model-ing with the convolutional network for the crowd counting taskin the video. We describe architecture of temporal modeling inSection III-A and the basic unit of the temporal aware networkin Section III-B. The implementation details will be describedin Section III-C.
A. Temporal Aware Network
The selection of a temporal modeling approach is importantto the success of the video crowd counting system. Ideally,we want a comprehensive collection of both long-term andshort-term frame correlations so that we can have accuratecounting under any scene settings. However, video processingis time-consuming and the training video dataset for crowdcounting is also limited. With these in mind, we design theTemporal Aware Network (TAN) with the dilated convolutionto fully utilize the context and content information of thevideo. The architecture of TAN is shown in Fig. 1. Differentfrom the existing works, we mainly focus on investigating therelations between combination density maps rather than onlyconsidering the temporal information of original frames.Vectors from several neighboring video frames are concate-nated as the inputs of the first dilated block. Particularly, forthe t -th video frame, we suppose k frames before and after theframe are considered and the input feature for the t -th frameis v ( t ) = [ v Tt − k , · · · , v Tt · · · , v Tt + k ] T .The first part of TAN is a set of LCN unit for extractingeach frame density maps, which will be described in SectionIII-B. Formally, let X = ( x , ..., x T ) be a video with T frames. Each frame x i go through the LCN unit to produce thecorresponding density map f ( x i ) . In order to match the datadimension from the density map to the timing block group, weset the Reshape & Concatenation unit. This unit transformsthe density map f ( x i ) with size of ( M , N ) to into a one-dimensional vector v i with size of ( , M N ).The feature vectors are sent to a series of dilated residualblocks. The group of dilated residual block use the previousstage initial the next stage and use the next stage refines theprevious stage. We define the frame orientation characteristicsof the input video of the first stage as follows: Y = x T Y s = F ( Y s − ) , (1)where Y s is the output at s stage and F is a dilated residualblock. Each dilated residual block contains multiple dilatedresidual layers and the architecture of dilated residual layer is shown in Fig. 2. The first is a dilated convolution witha receptive field, which helps in preventing the model fromoverfitting. Let w ,i and b ,i be the filter weights and biasassociated with the i -th dilated residual layer and v i be theinput, the output for location l after the 1 D dilation is definedas ˆv i [ l ] = (cid:88) ∆ l ∈R d w ,i [∆ l ] · v i [ l + ∆ l ] + b ,i , (2)where R d = {− d, , d } construct the 1D filters with kernelsize of 3 and d = 2 i − .Then ReLU and × convolution are used to superimposethe weights and offset the output. The output of the wholedilated residual layer is v i +1 = v i + w ,i · ReLU( ˆv i ) + b ,i , (3)where v i +1 is the output of layer i , w ,i and b ,i are theweights and bias of the dilated convolution filters. The recep-tive field at the i -th dilated residual layer is i +1 − . A dilatedresidual block consists of three dilated residual layer, and weuse this architecture to help provide more context to predictthe result at each frame.There are a few alternative choices to model the context withdilated convolution, such as dilated temporal convolution [33],dilation with densely connection [34], and dilated residualunit [35]. In this paper, our design is based on the dilatedresidual unit for its computation efficiency. Our model aimsto capture dependencies between current frame and the othervideo sequences, which helps smooth the prediction errors inthe same video sequences.To utilize the context information gain more effectively, wenormalize the output of the last block and obtain a set ofweight vectors. For the t -th video frame, the output of thelast block is v ( t ) = [ v (cid:48) Tt − k , · · · , v (cid:48) Tt · · · , v (cid:48) Tt + k ] T and the vector v (cid:48) i represents the feature of i -th frame. We extract the weightfrom the normalization of the continuous frame features, i.e. , w ∆ t = || v (cid:48) t +∆ t || k (cid:80) ∆ t = − k || v (cid:48) t +∆ t || (4)We consider the original density map f ( x t ) again and the finaldensity map f (cid:48) ( x t ) is done by f (cid:48) ( x t ) = k (cid:88) ∆ t = − k w ∆ t f ( x t +∆ t ) (5)The final counting result for frame t is computed by simplyaccumulating the density map f (cid:48) ( x t ) . Loss function.
Learning of the parameters in each block iswith two terms in the loss function, i.e. , L block = L mse + λ L SL , (6)where λ is a model hyper-parameter to determine the contri-bution of the different terms.The MSE loss is defined as L mse = 1 N N (cid:88) i =1 ( C p − C gt ) , (7) where N is the total amount of video frames, C p is thepredicted counting value, and C gt is ground-truth.While the MSE loss already performs well, we observe thatthe predictions for some of the videos contain a few over-segmentation errors. To further improve the quality of the pre-dictions, we use an additional smoothing loss to reduce suchover-segmentation issue. Here a Smooth- L loss is employed: L SL ( x, y ) = 1 N (cid:26) ( x i − y i ) if | x i − y i | < | x i − y i | − otherwise (8)Several blocks will be applied in the TAN framework, andthe loss function is the sum of L block in each block. B. The LCN Unit
The basic network in our framework is a convolutionalneural network for crowd counting of a still image or asingle video frame. In previous works, networks with multiplesubnets and single branch are usually employed. Since wefocus on video crowd counting problem in this paper, theinference speed is an important issue in real-world applicationand our goal is to use a small enough architecture to achievea comparable result. Therefore, the single branch networkwith few parameters is preferred. We design a lightweightconvolutional neural network (LCN) with 9 convolutionallayers and 3 max pooling operations. The overall structure ofLCN is illustrated in Table I. In our preliminarily experiments,we find that using more convolutional layers with small kernelsis more efficient than using fewer layers with larger kernels forcrowd counting, which is consistent with the observations fromrecent research on image recognition [36]. Max pooling isapplied for each × region, and Rectified linear unit (ReLU)is adopted as the activation function for its good performance.The network only consists of convolutional blocks with × kernel and max-pooling layers instead of more sophisticatedarchitecture, which aims to accelerate computational speed.We also limit the number of filters on each layer to reducethe computational complexity. Finally, we adopt filters witha size of 1 × L LCN = 12 N N (cid:88) i =1 || f ( x i ) − F ( x i ) || , (9)where N is the number of training images, and F ( x i ) isthe ground truth density map of image x i , and f ( x i ) is theestimated density map for x i . The overall network parametersize of LCN is less than × , and the network can obtainreal-time speed under a CPU environment. As will be shown inthe experiments, within a small size of parameters, our modelcan still achieve a competitive result compared with previousapproaches. C. Implementation Details
Ground truth generation.
There is significate diversityamong different crowd counting datasets and thus we use the
TABLE IC
ONFIGURATION OF
LCN.Layer Kernel size Channel Dilation rate Outputconv1 × × - - 1/2conv2 ×
16 1 1/2conv3 ×
16 1 1/2max-pool2 × - - 1/4conv4 ×
32 1 1/4conv5 ×
32 1 1/4conv6 ×
32 1 1/4max-pool3 × - - 1/8conv7 ×
16 1 1/8conv8 × × geometry-adaptive kernels to generate density maps from theground truth. The geometry-adaptive kernels are defined as F ( x ) = N t (cid:88) i =1 δ ( x − o i ) × G σ i ( x ) . (10)Given object o i in the target set { o , o , ..., o N t } , we calculate k nearest neighbors to determine d i . For the pixel position i in the image, we use a Gaussian kernel with a parameter of σ i = β ¯ d i to generate the density map F ( x ) .In our experiments, we create density maps with the fixedkernel of 17 for UCSD dataset and 15 for others. We alsofollow the previous work [37] to create density maps usingRegion of Interest (ROI) and the perspective map to deal withthe WorldExpo’10 dataset. Data augmentation.
We consider data augmentation based onthe original information of the data. For the training of LCN,the insufficient number of training samples is one importantissue. Thus, we follow the data enhancement method in [10]to deal with image data. Nine color patches are cut fromeach image in different positions and the size is of theoriginal image. The first four tiles contain three-quarters ofthe images without overlapping, while the other five tilesare randomly cropped from the input image. After that, wemirrored the patches to double the training set. We do notapply any data enhancement for the video dataset, as we wouldlike to consider more context information of the video framesby using our model. Training details.
Our temporal model is implemented usingPyTorch. To train the LCN, we first initialize the layers of thenetwork using a Gaussian distribution from standard deviation
TABLE IIS
TATISTICS OF THE DATASETS . N UM : THE NUMBER OF VIDEO FRAMES ORIMAGES ; A VG : THE AVERAGE LABELED PEDESTRIAN NUMBER IN THEDATASET ; T
OTAL : THE TOTAL LABELED PEDESTRIANS NUMBER . Dataset Type Resolution Num Avg TotalUCSD Video 158 ×
238 2,000 24.9 49,885Mall Video 640 ×
480 2,000 31.2 62,315WorldExpo Video 576 ×
720 3,980 50.2 199,923ShanghaiTech A Image Varied 482 501 241,677ShanghaiTech B Image 768 × TABLE IIIC
ROWD COUNTING RESULTS ON M ALL AND
UCSD. T
HE PERFORMANCEOF [3], [39], [40]
ARE FROM [32].
Method MALL UCSDMAE MSE MAE MSEGaussian process regression [3] 3.72 20.1 2.24 7.97Ridge regression [39] 3.59 19.0 2.25 7.82Cumulative attribute regression [40] 3.43 17.7 2.07 6.90ConvLSTM [32] 2.24 8.5 1.30 1.79Bidirectional ConvLSTM [32] 2.10 7.6 1.13 1.43ST-CNN [41] 4.03 5.87 - -TAN 2.03 2.6 1.08 1.41 of 0.01. We then set the learning rate of − for all thedatasets as initial, and use Adam [38] for training. For thetraining of TAN, we use Adam optimizer with the learningrate of 0.0005. IV. E XPERIMENTS
We evaluate the Temporal Aware Network on three videocrowd counting benchmarks, i.e. , Mall [39], UCSD [3], andWorldExpo’10 [37]. Fig. 3 illustrates their typical scenes.To examine the efficiency of the basic network LCN, wealso conduct the image-level analysis on ShanghaiTech [6]and UCF CC 50 [5] datasets, since there are no time-relatedinformation. Basic statistics of the datasets are summarized inTable II.Following existing state-of-the-art methods, we use themean absolute error (MAE) and mean squared error (MSE)to evaluate the performance, which are defined as
M AE = 1 N N (cid:88) i =1 (cid:12)(cid:12) C i − C GTi (cid:12)(cid:12) , (11) M SE = (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) i =1 (cid:12)(cid:12) C i − C GTi (cid:12)(cid:12) . (12)Here N is the number of testing images, C i and C GTi are theestimated people count and ground truth people count in the i -th image respectively.There are a few hyper-parameters in TAN, such as thenumber of video frames for temporal modeling and dilatedresidual blocks. In this section, we use 5 video frames for thetemporal modeling and 3 blocks as the default setting. Theeffect of these settings will be evaluated thoroughly in the TABLE IVT HE MAE
OF DIFFERENT SCENES ON THE W ORLD E XPO ’10
DATASET . Method S1 S2 S3 S4 S5 Avg. Paramsic-CNN [42] 17.0 12.3 9.2 8.1 4.7 10.3 . × D-ConvNet [11] 1.9 12.1 20.7 8.3 2.6 9.1 . × CSRNet [10] 2.9 11.5 8.6 16.6 3.4 8.6 . × ACSCP [43] 2.8 14.1 9.6 8.1 2.9 7.5 . × ConvLSTM [32] 7.1 15.2 15.2 13.9 3.5 10.9 -
Bi-ConvLSTM [32] 6.8 14.5 14.9 13.5 3.1 10.6 -ST-CNN [41] 5.2 16.5 9.9 8.4 6.2 9.3 -TAN 2.8 18.1 9.6 7.5 3.6 8.3 . × ablation study. We also report the number of neural networksparameters (Params) for comparison. A. Mall Dataset
We first report the results on the Mall dataset summarizedin Table III-Left. The experiments follow the same settingas [39], in which the first 800 frames are used for training andthe remaining 1,200 frames are used for testing. we comparethe TAN with the methods of using spatialtemporal informa-tion as well, including the regression-based methods [3], [39],[40] and the temporal-based methods [32], [41]. As shownin the table, using the proposed TAN leads to the MAE of2.03 and MSE of 2.6, which are significantly higher than thebaseline approaches. We demonstrate some predicted densitymaps as well as their corresponding counting results with TANin Fig. 4(a).
B. UCSD Dataset
Following the convention of the existing works [3], weuse frames 601-1400 of the UCSD dataset as the trainingdata and the remaining 1200 frames as the test data. As theregion of interest (ROI) and perspective map are provided, theintensities of pixels out of ROI is set to zero, and we alsouse ROI to supervise the last convolution layer. Results on theUCSD dataset are presented in Table III-Right. Same as theexperiments on the Mall dataset, TAN shows better results thanthe LSTM-based crowd counting approaches. Some countingresults with TAN on sample snippets are shown in Fig. 4(b).
C. WorldExpo’10 Dataset
The WorldExpo’10 dataset consists of 3,980 annotatedframes from 1,132 video sequences captured by 108 differentsurveillance cameras during the 2010 Shanghai World Expo.The training set includes of 3,380 annotated frames from103 scenes, while the testing images are extracted from otherfive different scenes with 120 frames per scene. Followingthe settings of [37], MAE is used as the evaluation metric.Table IV lists the per-scene performance of TAN and previousapproaches. Among these approaches, the first group is thestate-of-the-art methods with pre-trained models [10], [11],[42] or more complex network designs [43]. Our results arecomparable with these approaches for four scenes (exceptin Scene 2), while the parameter size of the TAN is order-of-magnitude smaller than all of these methods. And the (a) MALL (b) UCSD(c) WorldExpo’10 (Scene 3, Scene 4)
Fig. 4. Qualitative results on the sample snippets of the video datasets.TABLE VT
HE METRICS OF THE
LCN
COMPARING WITH THE PREVIOUSAPPROACHES ON S HANGHAI T ECH P ART
A & P
ART
B (SHA & SHB)
AND
UCF CC 50 (UCF).
Method SHA SHB UCF Paramsic-CNN [42] MAE 68.5 10.7 260.9 . × MSE 116.2 16.0 365.5D-ConvNet [11] MAE 73.5 18.7 288.4 . × MSE 112.3 26.0 404.7CSRNet [10] MAE 68.2 10.6 266.1 . × MSE 115.0 16.0 397.5ACSCP [43] MAE 75.7 17.2 291.0 . × MSE 102.7 27.4 404.6BSAD [44] MAE - 20.2 409.5 . × MSE - 35.6 563.7TDF-CNN [13] MAE 97.5 20.7 354.7 . × MSE 145.1 32.8 491.4MCNN [6] MAE 110.2 26.4 377.6 . × MSE 173.2 41.3 509.1LCN MAE 93.3 15.1 262.0 . × MSE 157.0 23.3 358.6 results of TAN is also better than that of the temporal-basedmethods [32], [41]. The qualitative results on one of the testingscenes are illustrated in Fig. 4(c).
D. Ablation Study
In this section, we evaluate some parameters and alternativeimplementations of the proposed framework.
LCN.
We first evaluate the performance of LCN and compareit with several approaches. As most of the previous approachesreport results on ShanghaiTech and UCF CC 50, here we alsoconduct the comparison on these datasets and Table V reportsthe metrics. Among these approaches, the first group are alsothe state-of-the-art methods with more complex networks [10],[11], [42], [43]. Our results are comparable with these ap-proaches, while our model size is much more compact. The (a) ShanghaiTech Part A(b) ShanghaiTech Part B(c) UCF CC 50
Fig. 5. Density maps and predicted counting by the basic network. second group contains several networks with compact struc-ture, including MCNN [6], BSAD [44], and TDF-CNN [13].From the table it is clear that LCN outperforms all theseapproaches. Fig. 5 illustrates some examples using LCN onboth datasets, including crowd images, predicted density maps,and the counting results.
Number of video frames for temporal modeling.
As shownin Table VI, we compare the performance of our frameworkwith a varying number of video frames for the temporalmodeling. We observe performance gains when the number ofconsidered video frames increases from three to five. Using
TABLE VIE
VALUATION OF COUNTING RESULTS W . R . T . THE NUMBER OF VIDEOFRAMES .Method Dataset MAE MSELCN UCSD 1.13 1.47MALL 2.23 2.85TAN (3 frames) UCSD 1.10 1.43MALL 2.06 2.62TAN (5 frames) UCSD 1.08 1.41MALL 2.03 2.57TAN (7 frames) UCSD 1.08 1.41MALL 2.03 2.59LCN (5 frames average) UCSD 1.12 1.45MALL 2.20 2.60TABLE VIIE
VALUATION OF COUNTING RESULTS W . R . T . THE BLOCK NUMBER .Method Dataset MAE MSETAN (1 block) UCSD 1.09 1.43MALL 2.03 2.58TAN (3 blocks) UCSD 1.08 1.41MALL 2.03 2.57TAN (5 blocks) UCSD 1.08 1.41MALL 2.08 2.65 more frames does not improve performance since the numberof crowds varies along the time pass. Another intuitive way toadd the temporal information is to smooth the density maps orcounting numbers of neighboring frames. However, as shownin the table, this strategy improves the performance of thesingle frame model, but is not as good as the proposed TANapproach.
Number of dilated residual blocks.
We also evaluate theeffect of dilated residual block numbers in the TAN model.As shown in Table VII, the best trade-off is obtained byusing three dilated residual blocks. Compared to using a singleblock, more blocks can boost performance. However, whenthe number gets larger, in some case the performances aredecreased. This is probably because complex neural networkslead to underfitting when the scale of training data is limited.
Temporal modeling.
We compare our temporal aware networkwith previous LSTM based approaches by incorporating LCNwith them. As shown in Table VIII, the results of TAN arebetter than LCN with LSTM or Bi-directional LSTM. Thisalso proves that our temporal modeling can capture temporalrelations better than LSTM.
Timing.
Recall that our goal is to build a compact modelfor fast crowd counting in the videos based on the proposedlightweight network. The parameter size of LCN and TANare 0.032M and 0.047M respectively. For a video with theresolution of × pixels, the TAN model achieves 120FPS detection speed on an Nvidia TITAN X GPU and duringinference it only consumes less than 500MB GPU memory.Our approach can provide real-time (25FPS) crowd countingspeed with a moderate Intel Core i5-8400 desktop CPU. TABLE VIIIC
OMPARISON OF DIFFERENT TEMPORAL MODELING APPROACHES .Method Dataset MAE MSELCN + LSTM UCSD 1.21 1.69MALL 2.23 3.80LCN + BI-LSTM UCSD 1.11 1.48MALL 2.09 3.07TAN UCSD 1.08 1.41MALL 2.03 2.60
V. C
ONCLUSIONS
We proposed the Temporal Aware Network with the LCNunit toward fast video crowd counting. The novel lightweightarchitecture is able to produce good performance with thecompact network. We showed that by leveraging contexture in-formation of the video contents, promising results are achievedfor video crowd counting benchmark. We also achieved thereal-time inference on a moderate commercial CPU by 25 FPS.R
EFERENCES[1] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in
IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2005, pp. 886–893.[2] M. Li, Z. Zhang, K. Huang, and T. Tan, “Estimating the numberof people in crowded scenes by mid based foreground segmentationand head-shoulder detection,” in
International Conference on PatternRecognition (ICPR) , 2008, pp. 1–4.[3] A. B. Chan, Z. J. Liang, and N. Vasconcelos, “Privacy preserving crowdmonitoring: Counting people without people models or tracking,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,2008, pp. 1–7.[4] V. Lempitsky and A. Zisserman, “Learning to count objects in images,”in
Advances in Neural Information Processing Systems (NeurIPS) , 2010,pp. 1324–1332.[5] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multi-source multi-scale counting in extremely dense crowd images,” in
IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2013, pp. 2547–2554.[6] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowdcounting via multi-column convolutional neural network,” in
IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2016,pp. 589–597.[7] D. O˜noro Rubio and R. J. L´opez-Sastre, “Towards perspective-free ob-ject counting with deep learning,” in
European Conference on ComputerVision (ECCV) , 2016, pp. 615–629.[8] D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neuralnetwork for crowd counting,” in
IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2017, pp. 5744–5752.[9] V. A. Sindagi and V. M. Patel, “Generating high-quality crowd densitymaps using contextual pyramid cnns,” in
International Conference onComputer Vision (ICCV) , 2017, pp. 1879–1888.[10] Y. Li, X. Zhang, and D. Chen, “CSRNet: Dilated convolutional neuralnetworks for understanding the highly congested scenes,” in
IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2018,pp. 1091–1100.[11] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M. Cheng, and G. Zheng,“Crowd counting with deep negative correlation learning,” in
IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2018,pp. 5382–5390.[12] X. Liu, J. van de W., and A. D. Bagdanov, “Leveraging unlabeleddata for crowd counting by learning to rank,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2018, pp. 7661–7669.[13] D. Sam and R. V. Babu, “Top-down feedback for crowd counting convo-lutional neural network,” in
AAAI Conference on Artificial Intelligence(AAAI) , 2018, pp. 7323–7330.[14] V. Ranjan, H. Le, and M. Hoai, “Iterative crowd counting,” in
EuropeanConference on Computer Vision (ECCV) , 2018, pp. 270–285. [15] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot,and M. Shah, “Composition loss for counting, density map estimationand localization in dense crowds,” in
European Conference on ComputerVision (ECCV) , 2018, pp. 532–546.[16] Z. Zou, Y. Cheng, X. Qu, S. Ji, X. Guo, and P. Zhou, “Attendto count: Crowd counting with adaptive capacity multi-scale cnns,”
Neurocomputing , vol. 367, pp. 75–83, 2019.[17] J. Gao, Q. Wang, and Y. Yuan, “SCAR: Spatial-/channel-wise attentionregression networks for crowd counting,”
Neurocomputing , vol. 363, pp.1–8, 2019.[18] Z.-Q. Cheng, J.-X. Li, Q. Dai, X. Wu, and A. G. Hauptmann, “Learningspatial awareness to improve crowd counting,” in
International Confer-ence on Computer Vision (ICCV) , 2019, pp. 6152–6161.[19] C. Xu, K. Qiu, J. Fu, S. Bai, Y. Xu, and X. Bai, “Learn to scale:Generating multipolar normalized density maps for crowd counting,” in
International Conference on Computer Vision (ICCV) , 2019, pp. 8382–8390.[20] L. Liu, Z. Qiu, G. Li, S. Liu, W. Ouyang, and L. Lin, “Crowd countingwith deep structured scale integration network,” in
International Con-ference on Computer Vision (ICCV) , 2019, pp. 1774–1783.[21] X. Wu, Y. Zheng, H. Ye, W. Hu, J. Yang, and L. He, “Adaptive scenariodiscovery for crowd counting,” in
IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) , 2019, pp. 2382–2386.[22] J. Ma, Y. Dai, and Y.-P. Tan, “Atrous convolutions spatial pyramidnetwork for crowd counting and density estimation,”
Neurocomputing ,vol. 350, pp. 91–101, 2019.[23] L. Wang, B. Yin, X. Tang, and Y. Li, “Removing background in-terference for crowd counting via de-background detail convolutionalnetwork,”
Neurocomputing , vol. 332, pp. 360–371, 2019.[24] Y. Zhang, C. Zhou, F. Chang, and A. C. Kot, “Multi-resolution attentionconvolutional neural network for crowd counting,”
Neurocomputing , vol.329, pp. 144–152, 2019.[25] Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from synthetic data forcrowd counting in the wild,” in
IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2019, pp. 8198–8207.[26] J. Gao, Q. Wang, and X. Li, “PCC net: Perspective crowd countingvia spatial convolutional network,”
IEEE Transactions on Circuits andSystems for Video Technology , 2019.[27] Q. Wang, M. Chen, F. Nie, and X. Li, “Detecting coherent groups incrowd scenes by multiview clustering,” vol. 42, no. 1, pp. 46–58, 2020.[28] G. J. Brostow and R. Cipolla, “Unsupervised bayesian detection ofindependent motion in crowds,” in
IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2006, pp. 594–601.[29] A. B. Chan and N. Vasconcelos, “Counting people with low-level fea-tures and bayesian regression,”
IEEE Transactions on Image Processing ,vol. 21, no. 4, pp. 2160–2177, 2012.[30] M. Rodriguez, I. Laptev, J. Sivic, and J.-Y. Audibert, “Density-awareperson detection and tracking in crowds,” in
International Conferenceon Computer Vision (ICCV) , 2011, pp. 2423–2430.[31] S. Chen, A. Fern, and S. Todorovic, “Person count localization invideos from noisy foreground and detections,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2015, pp. 1364–1372.[32] F. Xiong, X. Shi, and D.-Y. Yeung, “Spatiotemporal modeling for crowdcounting in videos,” in
International Conference on Computer Vision(ICCV) , 2017, pp. 5151–5159.[33] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporalconvolutional networks for action segmentation and detection,” in
IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2017,pp. 156–165.[34] B. Xu, H. Ye, Y. Zheng, H. Wang, T. Luwang, and Y.-G. Jiang, “Densedilated network for video action recognition,”
IEEE Transactions onImage Processing , vol. 28, no. 10, pp. 4941–4953, 2019.[35] Y. A. Farha and J. Gall, “MS-TCN: Multi-stage temporal convolutionalnetwork for action segmentation,” in
IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2019, pp. 3575–3584.[36] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in
International Conference on LearningRepresentations (ICLR) , 2015.[37] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd count-ing via deep convolutional neural networks,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2015, pp. 833–841.[38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in
International Conference on Learning Representations (ICLR) , 2015. [39] K. Chen, C. Loy, S. Gong, and T. Xiang, “Feature mining for localisedcrowd counting,” in
British Machine Vision Conference (BMVC) , 2012,pp. 21.1–21.11.[40] K. Chen, S. Gong, T. Xiang, and C. Change Loy, “Cumulative attributespace for age and crowd density estimation,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2013, pp. 2467–2474.[41] Y. Miao, J. Han, Y. Gao, and B. Zhang, “ST-CNN: Spatial-temporalconvolutional neural network for crowd counting in videos,”
PatternRecognition Letters , vol. 125, pp. 113–118, 2019.[42] D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan,“Divide and grow: capturing huge diversity in crowd images withincrementally growing cnn,” in
IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2018, pp. 3618–3626.[43] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang, “Crowd countingvia adversarial cross-scale consistency pursuit,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2018, pp. 5245–5254.[44] S. Huang, X. Li, Z. Zhang, F. Wu, S. Gao, R. Ji, and J. Han, “Bodystructure aware deep crowd counting,”