[PDF] Understanding Traffic Density from Large-Scale Web Camera Data

Abstract

Understanding traffic density from large-scale web camera (webcam) videos is a challenging problem because such videos have low spatial and temporal resolution, high occlusion and large perspective. To deeply understand traffic density, we explore both deep learning based and optimization based methods. To avoid individual vehicle detection and tracking, both methods map the image into vehicle density map, one based on rank constrained regression and the other one based on fully convolution networks (FCN). The regression based method learns different weights for different blocks in the image to increase freedom degrees of weights and embed perspective information. The FCN based method jointly estimates vehicle density map and vehicle count with a residual learning framework to perform end-to-end dense prediction, allowing arbitrary image resolution, and adapting to different vehicle scales and perspectives. We analyze and compare both methods, and get insights from optimization based method to improve deep model. Since existing datasets do not cover all the challenges in our work, we collected and labelled a large-scale traffic video dataset, containing 60 million frames from 212 webcams. Both methods are extensively evaluated and compared on different counting tasks and datasets. FCN based method significantly reduces the mean absolute error from 10.99 to 5.31 on the public dataset TRANCOS compared with the state-of-the-art baseline.

Full PDF

UUnderstanding Trafﬁc Density from Large-Scale Web Camera Data

Shanghang Zhang † , ‡ , Guanhang Wu † , Jo˜ao P. Costeira ‡ , Jos´e M. F. Moura †† Carnegie Mellon University, Pittsburgh, PA, USA ‡ ISR - IST, Universidade de Lisboa, Lisboa, Portugal { shanghaz, guanhanw } @andrew.cmu.edu, [email protected], [email protected] Abstract

Understanding trafﬁc density from large-scale web cam-era (webcam) videos is a challenging problem because suchvideos have low spatial and temporal resolution, high oc-clusion and large perspective. To deeply understand trafﬁcdensity, we explore both optimization based and deep learn-ing based methods. To avoid individual vehicle detection ortracking, both methods map the dense image feature intovehicle density, one based on rank constrained regressionand the other based on fully convolutional networks (FCN).The regression based method learns different weights fordifferent blocks of the image to embed road geometry andsigniﬁcantly reduce the error induced by camera perspec-tive. The FCN based method jointly estimates vehicle den-sity and vehicle count with a residual learning frameworkto perform end-to-end dense prediction, allowing arbitraryimage resolution, and adapting to different vehicle scalesand perspectives. We analyze and compare both methods,and get insights from optimization based method to improvedeep model. Since existing datasets do not cover all thechallenges in our work, we collected and labelled a large-scale trafﬁc video dataset, containing 60 million framesfrom 212 webcams. Both methods are extensively evalu-ated and compared on different counting tasks and datasets.FCN based method signiﬁcantly reduces the mean absoluteerror (MAE) from 10.99 to 5.31 on the public dataset TRAN-COS compared with the state-of-the-art baseline.

1. Introduction

Trafﬁc congestion leads to the need for a deep under-standing of trafﬁc density, which together with average ve-hicle speed, form the major building blocks of trafﬁc ﬂowanalysis [27]. Trafﬁc density is the number of vehicles perunit length of a road (e.g., vehicles per km) [19]. This paperfocuses on trafﬁc density estimation from webcam videos,which are of low resolution, low frame rate, high occlusionand large perspective. As illustrated in Figure 1, we se-lect a region of interest (yellow dotted rectangle) in a videostream, and count the number of vehicles in the region for

Figure 1. Problem Statement each frame. Then the trafﬁc density is calculated by divid-ing that number by the region length.Nowadays, many cities are being instrumented withsurveillance cameras. However, due to network bandwidthlimitations, lack of persistent storage, and privacy concerns[11], these videos present several challenges for analysis(illustrated in Figure 1): (i)

Low frame rate . The time in-terval between two successive frames of a webcam videotypically ranges from s to s , resulting in large vehicledisplacement. (ii) Low resolution . The resolution of web-cam videos is × . Vehicle at the top of a frame canbe as small as × pixels. Image compression also inducesartifacts. (iii) High occlusion . Cameras installed at urbanintersections often capture videos with high trafﬁc conges-tion, especially during rush hours. (iv)

Large perspective .Cameras are installed at high points to capture more videocontent, resulting in videos with large perspective. Vehi-cle scales vary dramatically based on their distance to thecamera. These challenges make the existing work for trafﬁcdensity estimation has many limitations.

Several trafﬁc density estimation techniques have beendeveloped in the literature, but they perform less accuratelyon the webcam data due to the above challenges:

Detection based methods.

These methods [38, 29] try toidentify and localize vehicles in each frame. They performpoorly in low resolution and high occlusion videos. Figure 11 a r X i v : . [ c s . C V ] J un hows detection results by Faster RCNN [26]. Even thoughtrained on our collected and annotated webcam dataset, itstill exhibits very high missing rate. Motion based methods.

Several methods [8, 9, 23] uti-lize vehicle tracking to estimate trafﬁc ﬂow. These methodstend to fail due to low frame rate and lack of motion infor-mation. Figure 1 shows a large displacement of a vehicle(black car) in successive frames due to the low frame rate.Some vehicles only appear once in the video and their tra-jectory cannot be well estimated.

Holistic approaches.

These techniques [33] perform anal-ysis on the whole image, thereby avoiding segmentation ofeach object. [15] uses a dynamic texture model based onSpatiotemporal Gabor Filters for classifying trafﬁc videosinto different congestion types, but it does not provide ac-curate quantitative vehicle densities. [30] formulates the ob-ject density as a linear transformation of each pixel feature,with a uniform weight over the whole image. It suffers fromlow accuracy when the camera has large perspective.

Deep learning based methods.

Recently, several deeplearning based methods have been developed for objectcounting[35, 36, 25, 37, 2]. The network in [35] outputsa 1D feature vector and ﬁt a ridge regressor to perform theﬁnal density estimation, which cannot perform pixel-wiseprediction and lose the spatial information. The estimateddensity map cannot have the same size as the input image.[25] is based on fully convolutional networks but the out-put density map is still much smaller than the input image,because it does not have deconvolutional or upsampling lay-ers. [2] jointly learns density map and foreground mask forobject counting, while it does not solve the large perspectiveand object scale variation problems.To summarize, detection and motion based methods tendto fail in high congestion, low resolution, and low framerate videos, because they are sensitive to video quality andenvironment conditions. The holistic approaches performpoorly in videos with large perspective and variable vehiclescales. Besides, most of existing methods are incapable ofestimating the exact number of vehicles. [13, 18, 10, 34].

To deeply understand trafﬁc density and overcome thechallenges from real-world webcam data, we explore bothdeep learning based and optimization based methods. Theoptimization based model (OPT-RC) embeds scene geome-try through the rank constraint of multiple block-regressors,and motivates the deep-learning model FCN-MT. FCN-MTshares the idea of mapping local feature into vehicle den-sity with OPT-RC, while it replaces the BG subtraction,feature extractor, and block-regressors with fully convolu-tional networks. With extensive experiments, we analyzeand compare both methods, and get insights from optimiza-tion based method to improve deep model.

Figure 2. Intuition for OPT-RC.

Optimization Based Vehicle Density Estimation withRank Constraint (OPT-RC) . Inspired by [30], which mapseach pixel feature into vehicle density with a uniformweight, we propose a regression model to learn differentweights for different blocks to increase the degrees of free-dom on the weights, and embed geometry information. Itoutperforms work [30] and obtains high accuracy in lowquality webcam videos, especially overcoming large per-spective challenge. We ﬁrst divide the target region intoblocks, extract features for each block, and subtract back-ground. As illustrated in Figure 2, we linearly map eachblock feature x b into vehicle density Den b = w (cid:62) b x b . Toavoid large errors induced by large perspective, we buildone regressor per block with different weights w b , and learnthe optimal weights. All the weight vectors are stacked intoa weight matrix W = (cid:2) w (cid:62) ; w (cid:62) ; ... w (cid:62) B (cid:3) . To han-dle high dimensionality and capture the correlations amongweight vectors of different blocks, rank constraint is im-posed on the weight matrix W . The motivation behind suchtreatment is illustrated in Figure 2. Due to large perspec-tive, vehicles have different scales in block A and C, andtheir corresponding vehicle densities are also different. Yetfor block A and B, the vehicle densities are similar. Thuswe build the weight matrix W to reﬂect both the diversityand the correlation among weight vectors. After the vehicledensity map is estimated, the vehicle count can be obtainedby integrating the vehicle density map. Finally, the trafﬁcdensity is obtained by dividing the vehicle counts by thelength of the target region. Figure 3. Framework of FCN-MT

FCN Based Multi-Task Learning for Vehicle Count-ing (FCN-MT) . To avoid individual vehicle detection ortracking, besides the proposed optimization based model,we further propose an FCN based model to jointly learnehicle density and vehicle count. The framework is il-lustrated in Figure 3. To produce the density map thathas the same size as input image, we design a fully con-volutional network [21] to perform pixel-wise predictionwhole-image-at-a-time by dense feedforward computationand backpropagation. Instead of applying simple bilinearinterpolation for upsampling, we add deconvolution layerson top of convolutional layers, whose parameters can belearnt during training.There are two challenges of FCN-based object count-ing: (1) object scale variation, and (2) reduced featureresolution[7]. To avoid large errors induced by scale varia-tions, we jointly perform global count regression and den-sity estimation. Single task (density estimation) methodonly encourages networks to approximate ground truth den-sity and directly sums the densities to get the count, whichsuffers from large error when there is extreme occlusion oroversized vehicles. Yet the multi-task framework is fun-damental to account for such deviations, enabling relatedobjectives achieve better local optima, improving robust-ness, and providing more supervised information. Further-more, instead of directly regressing the global vehicle countfrom the last feature map, we develop a residual learningframework to reformulate global count regression as learn-ing residual functions with reference to the sum of densitiesin each frame. Such design avoids learning unreferencedfunctions and eases the training of network. The secondchallenge is caused by the repeated combination of max-pooling and striding. To solve this problem, we producedenser feature maps by combining appearance features fromshallow layers with semantic features from deep layers. Wethen add a convolution layer after the combined featurevolume with 1x1 kernels to perform feature re-weighting.The re-weighted features better distinguish foreground andbackground. Thus the whole network is able to accuratelyestimate vehicle density without foreground segmentation.

Webcam Trafﬁc Video Dataset (WebCamT)

We col-lected and labelled a large-scale webcam trafﬁc dataset,which contains 60 million frames from 212 webcams in-stalled in key intersections of the city. This datasetis annotated with vehicle bounding box, orientation,re-identiﬁcation, speed, category, trafﬁc ﬂow direction;weather and time. Unlike existing car dataset KITTI[14]and Detrac[32], which focus on vehicle models, our datasetemphasizes real world trafﬁc network analysis in a largemetropolis. This dataset has three beneﬁts: (i) It motivatesresearch on vision based trafﬁc ﬂow analysis, posing newchallenges for state-of-the-art algorithms. (ii) With variousstreet scenes, it can serve as benchmark for transfer learningand domain adaptation. (iii) With large amount of labeleddata, it provides training set for various learning based mod-els, especially for deep learning based techniques.Contributions of this paper are summarized as follows: 1. We propose an optimization based density estimationmethod (OPT-RC) that embeds road geometry in the weightmatrix and signiﬁcantly reduces error induced by perspec-tive. It avoids detecting or tracking individual vehicles.2.We propose FCN based multi-task learning to jointlyestimate vehicle density and count with end-to-end denseprediction. It allows arbitrary input image resolution, andadapts to different vehicle scale and perspective.3. We collect and annotate a large-scale webcam trafﬁcdataset, which poses new challenges to state-of-the-art traf-ﬁc density estimation algorithms. To the best of our knowl-edge, it is the ﬁrst and largest webcam trafﬁc dataset withelaborate annotations.4. With extensive experiments on different countingtasks, we verify and compare the proposed FCN-MT withOPT-RC, and obtain insight for future study.The rest of paper is outlined as follows. Section 2 in-troduces the proposed OPT-RC. Section 3 introduces theproposed FCN-MT. Section 4 presents experimental results,and Section 5 compares OPT-RC with FCN-MT.

2. Optimization Based Vehicle Density Estima-tion with Rank Constraint

To overcome limitations of existing work, we proposea block-level regression model with rank constraint. Theoverall framework is described in Section 1.2. We ﬁrst per-form foreground segmentation based on GrabCut [3]. Toautomate the segmentation process, we initialize the back-ground and foreground based on the difference between theinput frame and the pure background image, which is gen-erated by using a reference image with no vehicles taken inlight trafﬁc periods and transferring it to other time periodsby brightness adjustment. We assume that a stream of N images I , · · · , I N are given, for which we select a regionof interest and divide it into J blocks. The block size canvary from × to × , depending on the width of thelane and the length of the smallest vehicle. A block B ( i ) j ineach image I i is represented by a feature vector x ( i ) j ∈ R K .Examples of particular choices of features are given in theexperimental section. It is assumed that each training image I i is annotated with a set of 2D bounding boxes, centered atpixels P i = { p , · · · , p c ( i ) } , where c ( i ) is the total numberof annotated vehicles in the i -th image. The density func-tions in our approach are real-valued functions over pixelgrids, whose integrals over image regions equal to the vehi-cle counts. For a training image I i , we calculate the groundtruth density based on the labeled bounding boxes (shownin Figure 4). The pixel p covered by a set of bounding boxes O ( p ) has density D ( p ) deﬁned as D ( p ) = (cid:88) o ∈ O ( p ) A ( o ) , (1)here A ( o ) denotes the area of bounding box o . Then, wedeﬁne the density D ( B j ) of a block as D ( B j ) = (cid:88) p ∈ B j D ( p ) . (2)Given a set of training images together with their groundtruth densities, for each block B j , we learn a block-speciﬁclinear regression model that predicts block-level density (cid:98) D ( B j ) given its feature representation x j by (cid:98) D ( B j ) = w (cid:62) j x j , (3)where w j ∈ R K is the coefﬁcient vector of the linearregression model to be learned for block j . We assigndifferent weights to different blocks. To capture the cor-relations and commonalities among the regression weightvectors at different blocks, we encourage these vectors toshare a low-rank structure. To avoid overﬁtting, we add (cid:96) -regularization to these weight vectors. To encourage spar-sity of weights, (cid:96) -regularization is imposed. Let W ∈ R K × J , where the j -th column vector w j denotes the re-gression coefﬁcient vector of block j . To this end, we de-ﬁne the following regularized linear regression model withlow-rank constraintmin W N N (cid:80) i =1 J (cid:80) j =1 ( w (cid:62) j x ( i ) j − D ( B ( i ) j )) + α (cid:107) W (cid:107) F + β | W | s.t. rank ( W ) ≤ r (4)To solve this rank-constrained problem that has a non-smooth objective function, we develop an accelerated pro-jected subgradient descent (APSD) [20] algorithm outlinedin Algorithm 1, which iteratively performs subgradient de-scent, rank projection, and acceleration. Two sequences ofvariables { A k } and { W k } are maintained for the purposeof performing acceleration. Subgradient Descent

Subgradient descent is performedover variable A k . We ﬁrst compute the subgradient (cid:52) A k of the non-smooth objective function N N (cid:80) i =1 J (cid:80) j =1 ( a (cid:62) j x ( i ) j − D ( B ( i ) j )) + α (cid:107) A (cid:107) F + β | A | , where the ﬁrst and secondterms are smooth, hence their subgradients are simply gra-dients. For the third term | A | which is non-smooth, itssubgradient ∂ A can be computed as ∂A ij = (cid:40) +1 if A ij ≥ − if A ij < (5)Adding the subgradients of the three terms, we obtain thesubgradient (cid:52) A k of the overall objective function. Then A k is updated by A k ← A k − t k (cid:52) A k . (6) Algorithm 1

Accelerated Projected Subgradient Descent

Input:

Data D , rank r , regularization parameters α , β while not converged do Compute gradient (cid:52) A k A k ← A k − t k (cid:52) A k Compute top r singular values and vectors of A k : U r , Σ r , V r W k +1 ← U r Σ r V (cid:62) r t k +1 ← (1 + (cid:112) t k ) A k +1 ← A k t k − − t k ( W k − W k − ) end whileOutput: W ← W k +1 Rank Projection

In lines 4-5, we project the newly ob-tained A k to the feasible set { W | rank ( W ) ≤ r } , whichamounts to solving the following problemmin W (cid:107) A k − W (cid:107) F s.t. rank ( W ) ≤ r (7)According to [12], the optimal solution W ∗ can be ob-tained by ﬁrst computing the largest r singular values andsingular vectors of A k : U r , Σ r , V r , then setting W ∗ to U r Σ r V (cid:62) r . Acceleration

In lines 6-7, acceleration is performed by up-dating the step size according to the following rule: t k +1 ← (1 + (cid:112) t k ) , and adding a scaled difference betweenconsecutive W s to A .It is worth noting that this optimization problem is notconvex and the APSD algorithm may lead to local optimal.It helps to run the algorithm multiple times with differentrandom initializations.

3. FCN Based Multi-Task Learning

We also propose an FCN based model to jointly learnvehicle density and global count. The vehicle density esti-mation can be formulated as D ( i ) = F ( X i ; Θ) , where X i is the input image, Θ is the set of parameters of the FCN-MT model, and D ( i ) is the estimated vehicle density mapfor image i. The ground truth density map can be generatedin the same way as Section 2. Inspired by the FCN used in semantic segmentation[21],we develop FCN to estimate the vehicle density. Afterthe vehicle density map is estimated, the vehicle count canbe obtained by integrating the vehicle density map. How-ever, we observed that variation of vehicle scales induceserror during the direct integration. Especially, the largebuses/trucks (oversized vehicles) in close view induce spo-radically large errors in the counting results. To solve thisproblem, we propose a deep multi-task learning frameworkbased on FCN to jointly learn vehicle density map and ve-hicle count. Instead of directly regressing the count fromthe last feature map or the learnt density map, we develop residual learning framework to reformulate global countregression as learning residual functions with reference tothe sum of densities in each frame. The overall structure ofour proposed FCN-MT is illustrated in Figure 3, which con-tains convolution network, deconvolution network, featurecombination and selection, and multi-task residual learning.The convolution network is based on pre-trainedResNets[17]. The pixel-wise density estimation requireshigh feature resolution, yet the pooling and striding re-duce feature resolution signiﬁcantly. To solve this problem,we rescale and combine the features from a , a , a lay-ers of ResNets. We then add a convolution layer after thecombined feature volume with 1x1 kernels to perform fea-ture re-weighting. By learning parameters of this layer, there-weighted features can better distinguish foreground andbackground pixels. We input the combined feature volumeinto the deconvolution network, which contains ﬁve decon-volution layers. Inspired by the deep VGG-net[28], we ap-ply small kernels of 3x3 in the deconvolution layers. Thefeature is mapped back to the image size by deconvolutionlayers, whose parameters can be learnt from the trainingprocess[24]. A drawback of deconvolution layer is that itmay have uneven overlap when the kernel size is not divis-ible by the stride. We add one convolution layer with 3x3kernels to smooth the checkerboard artifacts and alleviatethis problem. Then one more convolution layer with 1x1kernels is added to map the feature maps into density map. At the last stage of the network, we jointly learn vehicledensity and count. The vehicle density is predicted by thelast convolution 1x1 layer from the feature map. Euclideandistance is adopted to measure the difference between theestimated density and the ground truth. The loss functionfor density map estimation is deﬁned as follows: L D (Θ) = 12 N N (cid:88) i =1 P (cid:88) p =1 (cid:107) F ( X i ( p )); Θ) − F i ( p )) (cid:107) , (8)where N is the number of training images and F i ( p ) is theground truth density for pixel p in image i .For the second task, global count regression, we refor-mulate it as learning residual functions with reference tothe sum of densities, which consists of two parts: (i) basecount: the integration of the density map over the wholeimage; (ii) offset count: predicted by two fully connectedlayers from the feature map after the convolution 3x3 layerof the deconvolution network. We sum the two parts to-gether to get the estimated vehicle count, as shown in thefollowing equation: C ( i ) = G ( D ( i ); γ ) + P (cid:88) p =1 D ( i, p ) , (9)where γ is the learnable parameters of the two fully con- nected layers, D ( i, p ) indicates the density of each pixel p in image i . We hypothesize that it is easier to optimizethe residual mapping than to optimize the original, unrefer-enced mapping. Considering that the vehicle count for someframes may have very large value, we adopt Huber loss tomeasure the difference between the estimated count and theground truth count. The count loss for one frame is deﬁnedas follows: L δ ( i ) = (cid:26) ( C ( i ) − C t ( i )) for | C ( i ) − C t ( i ) | ≤ δ,δ | C ( i ) − C t ( i ) | − δ otherwise. (10)where C t ( i ) is the ground truth vehicle count of frame i , C ( i ) is the estimated loss of frame i . δ is the threshold tocontrol the outlier in the training sets. Then overall lossfunction for the network is deﬁned as: L = L D + λ N N (cid:88) i =1 L δ ( i ) , (11)where λ is the weight of count loss, and is tuned to achievebest accuracy. By simultaneously learning the two relatedtasks, each task can be better trained with much fewer pa-rameters. The loss function is optimized via batch-basedAdam and backpropagation. As FCN-MT adapts to differ-ent input image resolutions and variation of vehicle scalesand perspectives, it is robust to different scenes.

4. Experiments

We extensively evaluate the proposed methods on dif-ferent datasets and counting tasks: (i) We ﬁrst intro-duce our collected and annotated webcam trafﬁc dataset(WebCamT). (ii) Then we evaluate and compare the pro-posed methods with state-of-the-art methods on WebCamTdataset and present an interesting application to detect thechange of trafﬁc density patterns in NYC on IndependenceDay. (iii) We evaluate our proposed methods on the publicdataset TRANCOS[25]. (iv) We evaluate our methods onthe public pedestrian counting dataset UCSD [4] to verifythe robustness and generalization of our model.

As there is no existing labeled real-world webcam trafﬁcdataset, in order to evaluate our proposed method, we utilizeexisting trafﬁc web cameras to collect continuous stream ofstreet images and annotate rich information. Different fromexisting trafﬁc datasets, webcam data are challenging foranalysis due to the low frame rate, low resolution, high oc-clusion, and large perspective. We select 212 representativeweb cameras, covering different locations, camera perspec-tive, and trafﬁc states. For each camera, we downloadedvideos for four time intervals each day (7am-8am, 12pm-1pm; 3pm-4pm; 6pm-7pm). These cameras have frame ratearound 1 frame/second and resolution × . Collectingthese data for 4 weeks generates 1.4 Terabytes of video data igure 4. Annotation Instance. consisting of 60 million frames. To the best of our knowl-edge, WebCamT is the ﬁrst and largest annotated webcamtrafﬁc dataset to date.We annotate , frames with the following informa-tion: (i) Bounding box : rectangle around each vehicle. (ii)

Vehicle type : ten types including taxi, black sedan, othercars, little truck, middle truck, big truck, van, middle bus,big bus, other vehicles. (iii)

Orientation : each vehicle ori-entation is annotated into four categories: 0, 90, 180, and270 degrees. (iv)

Vehicle density : number of vehicles inROI region of each frame. (v)

Re-identiﬁcation : we matchthe same car in sequential frames. (vi)

Weather : ﬁve typesof weather, including sunny, cloudy, rainy, snowy, and in-tensive sunshine. The annotation for two successive framesis shown in Figure 4. The dataset is divided into trainingand testing sets, with 45,850 and 14,150 frames, respec-tively. We select testing videos taken at different time fromtraining videos. WebCamT serves as an appropriate datasetto evaluate our proposed method. It also motivates researchon vision based trafﬁc ﬂow analysis, posing new challengesfor the state-of-the-art algorithms . We evaluate the proposed methods on the testing set ofWebCamT, containing 61 video sequences from 8 cameras,and covering different scenes, congestion states, cameraperspectives, weather and time of the day. Each video has × resolution, and frame rate around 1 frame/sec.The training set has the same resolution, but is from differ-ent videos. Three metrics are employed for evaluation: (i)Mean absolute error (MAE); (ii) Mean square error (MSE);(iii) Average relative accuracy (ARA), which is the averageof all the test frame relative errors.For OPT-RC, to compare with the baseline methods, weextract SIFT feature and learn visual words for each block.The block size is determined by the lane width and thesmallest vehicle length. Parameters in Eq.(4) are selectedby cross-validation. For FCN-MT, we divide the trainingdata into two groups: downtown and parkway. In eachgroup, we balance the training frames with less than 15vehicles and the frames with more than 15 vehicles. Thenetwork architecture is explained in Section 3.1 and some Please email the authors if you are interested in the dataset. parameters are shown in Figure 3. The weight λ of vehi-cle count loss in Eq.(11) is 0.1. More details can be foundin the released code link: https://github.com/polltooh/traffic_video_analysis . Baseline approaches . We compare our method withtwo methods:

Baseline 1: Learning to count [30]. It mapsthe feature of each pixel into object density with uniformweight for the whole image. For comparison, we extractdense SIFT features [22] using VLFeat [31]. The groundtruth density is computed as a normalized 2D Gaussiankernel based on the center of each labeled bounding box.

Baseline 2: Hydra [25]. It learns multi-scale regression net-works using a pyramid of image patches extracted at mul-tiple scales to perform ﬁnal density prediction. We trainHydra 3s model on the same training set as FCN-MT.

Table 1. Accuracy comparison on WebCamT

Method Downtown ParkwayMAE ARA MAE ARABaseline 1 5.91 0.5104 5.19 0.5248Baseline 2 3.55 0.6613 3.64 0.6741OPT-RC 4.56 0.6102 4.24 0.6281FCN-MT 2.74 0.7175 2.52 0.784

Experimental Results . Below, we compare the errorsof the proposed and baseline approaches in Table 1. Fromthese results, we conclude that FCN-MT outperforms thebaseline approaches and OPT-RC for all the measurements.As the testing data cover different congestion states, cameraperspectives, weather conditions and time of the day, theseresults verify the generalization and robustness of FCN-MT.OPT-RC outperforms the non-deep learning based Baseline1 and shows comparable results with Baseline 2, but re-quire much less training data. Compared with FCN-MT,OPT-RC is less generalizable, but it can learn smooth den-sity map with geometry information. Figure 5 shows theoriginal image (a), learned density map from Baseline1 (b),and learned density map from OPT-RC (c). We see that thedensity map from Baseline 1 can not reﬂect the perspectivepresent in the video, while density map from our methodcaptures well the camera perspective. Figure 6 shows thedensity map learned from FCN-MT. Without foregroundsegmentation, the learned density map can still estimate theregion of vehicles, and distinguish background from fore-ground in both sunny and cloudy, dense and sparse scenes.Yet due to the uneven overlaps of the deconvolution layers,checkerboard artifacts are created on the learnt density map.Figure 7 and Figure 8 shows the estimated trafﬁc densityof OPT-RC and FCN-MT from long time sequences andmultiple cameras, respectively. MAE of each camera’s es-timated trafﬁc density is shown at the bottom right of eachplot. From the results, we see that FCN-MT has more ac-curate estimation than OPT-RC, yet both methods can cap-ure the trends of the trafﬁc density. For the same time ofday, trafﬁc density from downtown cameras are on averagehigher than that from parkway cameras. For the same cam-era location, trafﬁc density during nightfall (18:00-19:00) isgenerally higher than at other times of the day. Especiallyfor parkway cameras, nightfall trafﬁc density increases sig-niﬁcantly when compared to trafﬁc density in the morningand at noon. As the test videos cover different locations,weather, camera perspectives, and trafﬁc states, those re-sults verify the robustness of the proposed methods.

Figure 5. Comparison of OPT-RC and Baseline 1.Figure 6. Density map from FCN-MT: (a) Downtown; (b) Park-way. Top three rows are cloudy; Bottom three rows are sunny.Figure 7. Estimated trafﬁc density from OPT-RC for three cam-eras. (Left) estimated trafﬁc density curve for each camera, wherethe X-axis represents frame index and the Y-axis represents trafﬁcdensity. MAE for each curve is shown at the bottom right of eachplot. To show the time series for one day, we select 150 framesfor each time interval (morning, noon, afternoon, and evening).(Right) one sample image for each camera.

Special Event Detection . An interesting application oftrafﬁc density estimation is to detect the change of trafﬁc

Figure 8. Estimated trafﬁc density from FCN-MT for three cam-eras. Same setting as Figure 7.Figure 9. Independence Day trafﬁc density detection. X-axis:frame index. Y-axis: vehicle count. density when special event happens in the city. To verifythe capability to detect such changes of our method, we testFCN-MT on two cameras and multiple days. From the re-sults we ﬁnd that trafﬁc density on July 4th 18h is differentfrom that on other normal days, as illustrated in Figure 9.For the camera in downtown (3Ave@49st), the trafﬁc den-sity on July 4th is averagely lower than that of other days,and the trafﬁc can be very sparse periodically. This corre-sponds to the fact that several roads around 3Ave@49st areclosed after 3 pm due to ﬁreworks show on IndependenceDay, resulting in less trafﬁc around. For the camera on park-way (FDR Dr @ 79st), the average trafﬁc is less then Friday(4-29), but more than a normal Monday (5-2). As July 4this also Monday, it is supposed to have similar trafﬁc as May2nd. The detected increase of the trafﬁc density on July 4thcorresponds to the fact that FDR under 68 St is closed, re-sulting in more trafﬁc congested to FDR above 68 St. Allthose observations verify that our method can detect trafﬁcdensity change when special events happen.

To verify the efﬁcacy of our methods, we also evalu-ate and compare the proposed methods with baselines on apublic dataset, TRANCOS[25]. TRANCOS provides a col-lection of 1244 images of different trafﬁc scenes, obtainedfrom real video surveillance cameras, with a total of 46796annotated vehicles. The objects have been manually anno-tated using dots. It also provides a region of interest (ROI)per image, deﬁning the region considered for the evaluation.This database provides images from very different scenar-ios and no perspective maps are provided. The ground truthobject density maps are generated by placing a Gaussian able 2. Results comparison on TRANCOS datasetMethod MAE ARABaseline 1 13.76 0.6412Baseline 2-CCNN 12.49 0.6743Baseline 2-Hydra 10.99 0.7129OPT-RC 12.41 0.6674FCN-ST 5.47 0.827FCN-MT 5.31 0.856Figure 10. Comparing FCN-MT and Baseline 2-Hydra. X-axis:frame index. Y-axis: vehicle count.

Kernel in the center of each annotated object[16].We compare our methods with baselines in Table 2.Baseline 1 and OPT-RC have the same settings as evalu-ated in WebCamT. Baseline 2-CCNN is a basic version ofthe network in [25], and Baseline 2-Hydra augments theperformance by learning a multi-scale non-linear regressionmodel. FCN-ST is the single task implementation (only ve-hicle density estimation) of FCN-MT for ablation analysis.Baseline 2-CCNN, Baseline 2-Hydra, FCN-ST, and FCN-MT are trained on 823 images and tested on 421 framesfollowing the separation in [25]. From the results, we seethat FCN-MT signiﬁcantly reduces the MAE from 10.99 to5.31 compared with Baseline 2-Hydra. FCN-MT also out-performs the single task method FCN-ST and veriﬁes theefﬁcacy of multi-task learning. From Figure 10 we can alsosee that the estimated counts from FCN-MT ﬁt the groundtruth better than the estimated counts from Baseline 2. OPT-RC outperforms Baseline 1 and obtains results comparativeto Baseline 2.

To verify the generalization of the proposed methods ondifferent counting tasks, we evaluate the proposed methodson crowd counting dataset UCSD[4]. This dataset contains2000 frames chosen from one surveillance camera, withframe size × and frame rate f ps . Average num-ber of people in each frame is around . Table 3. Results comparison on UCSD datasetMethod MAE MSEKernel Ridge Regression [1] 2.16 7.45Ridge Regression [6] 2.25 7.82Gaussian Process Regression [4] 2.24 7.97Cumulative Attribute Regression [5] 2.07 6.86Cross-scene DNN[35] 1.6 3.31OPT-RC 2.03 5.97FCN-MT 1.67 3.41

By following the same settings with [4], we use framesfrom 601 to 1400 as training data, and the remaining 1200frames as test data. OPT-RC has the same setting as eval-uated in WebCamT. ROI mask is used on the input imagesand combined feature maps in FCN-MT and OPT-RC. Table3 shows the results of our methods and existing methods.From the results we can see OPT-RC outperforms the non-deep learning based methods, but is less accurate than thedeep learning-based methods in[35]. FCN-MT outperformsall the non-deep learning based methods and gets compara-tive accuracy with the deep learning-based methods in[35].These results show that though our methods are developedto solve the challenges in webcam video data, they are alsorobust to other type of counting tasks.

5. Comparison of OPT-RC and FCN-MT

From the extensive experiments, we highlight some dif-ference between OPT-RC and FCN-MT: (i) OPT-RC canlearn the geometry information by learning different weightfor different blocks in the image. (ii) As the handcraftedSIFT feature is not discriminative enough to distinguishbackground and foreground, OPT-RC heavily relies onbackground subtraction. Nevertheless FCN-MT extracts hi-erarchical feature maps. The combined and re-weightedfeatures are quite discriminative for foreground and back-ground. Thus FCT-MT does not require background sub-traction. (iii) The learned density map of FCN-MT suffersfrom checkerboard artifacts in some cases.Despite of these differences, FCN-MT and OPT-RC stillhave strong connection: both methods map the image intovehicle density map and overcome the challenges of web-cam video data. FCN-MT replaces the BG subtraction, fea-ture extractor, and block-regressors of OPT-RC with fullyconvolutional networks. Both methods avoid detecting ortracking individual vehicle, and adapt to different vehiclescales. For the future research, domain transfer learningwill be explored to enable the model more robust to multi-ple cameras.

Acknowledgment

This research was supported in part by Fundac¸˜ao paraa Ciˆencia e a Tecnologia (project FCT [SFRH/BD/113729/2015]). Jo˜aoP. Costeira is partially supported by project [Lx-01-0247-FEDER-017906]SmartCitySense funded by ANI. eferences [1] S. An, W. Liu, and S. Venkatesh. Face recognition using ker-nel ridge regression. In , pages 1–7. IEEE, 2007. 8[2] C. Arteta, V. Lempitsky, and A. Zisserman. Counting in thewild. In

European Conference on Computer Vision , pages483–498. Springer, 2016. 2[3] A. B. C. Rother, V. Kolmogorov. Grabcut: Interactive fore-ground extraction using iterated graph cuts. In

ACM SIG-GRAPH , 2004. 3[4] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos. Privacy pre-serving crowd monitoring: Counting people without peoplemodels or tracking. In

Computer Vision and Pattern Recog-nition, 2008. CVPR 2008. IEEE Conference on , pages 1–7.IEEE, 2008. 5, 8[5] K. Chen, S. Gong, T. Xiang, and C. Change Loy. Cumula-tive attribute space for age and crowd density estimation. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 2467–2474, 2013. 8[6] K. Chen, C. C. Loy, S. Gong, and T. Xiang. Feature miningfor localised crowd counting. In

BMVC , volume 1, page 3,2012. 8[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs. arXiv preprint arXiv:1606.00915 , 2016. 3[8] Y.-L. Chen, B.-F. Wu, H.-Y. Huang, and C.-J. Fan. A real-time vision system for nighttime vehicle detection and trafﬁcsurveillance.

IEEE Transactions on Industrial Electronics ,2011. 2[9] Z. Chen, T. Ellis, and S. A. Velastin. Vehicle detection,tracking and classiﬁcation in urban trafﬁc. In , 2012. 2[10] T. E. Choe, M. W. Lee, and N. Haering. Trafﬁc analysis withlow frame rate camera networks. In

IEEE Conference onComputer Vision and Pattern Recognition-Workshops , 2010.2[11] J. Du and Y.-C. Wu. Network-wide distributed carrierfrequency offsets estimation and compensation via beliefpropagation.

IEEE Transactions on Signal Processing ,61(23):5868–5877, 2013. 1[12] C. Eckart and G. Young. The approximation of one matrixby another of lower rank.

Psychometrika , 1936. 4[13] K. Garg, S.-K. Lam, T. Srikanthan, and V. Agarwal. Real-time road trafﬁc density estimation using block variance. In

IEEE Winter Conference on Applications of Computer Vi-sion , 2016. 2[14] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-tonomous driving? the kitti vision benchmark suite. In

Conference on Computer Vision and Pattern Recognition(CVPR) , 2012. 3[15] W. N. Gonc¸alves, B. B. Machado, and O. M. Bruno. Spa-tiotemporal gabor ﬁlters: a new method for dynamic texturerecognition. arXiv preprint arXiv:1201.3612 , 2012. 2 [16] R. Guerrero-G´omez-Olmedo, B. Torre-Jim´enez, R. L´opez-Sastre, S. Maldonado-Basc´on, and D. Onoro-Rubio. Ex-tremely overlapping vehicle counting. In

Iberian Conferenceon Pattern Recognition and Image Analysis , pages 423–431.Springer, 2015. 8[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages770–778, 2016. 5[18] S. Hua, J. Wua, and L. Xub. Real-time trafﬁc congestiondetection based on video analysis.

J. Inf. Comput. Sci , 2012.2[19] B. S. Kerner.

Introduction to modern trafﬁc ﬂow theory andcontrol: the long road to three-phase trafﬁc theory . 2009. 1[20] H. Li and Z. Lin. Accelerated proximal gradient methods fornonconvex programming. In

Advances in neural informationprocessing systems , pages 379–387, 2015. 4[21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 3431–3440, 2015. 3, 4[22] D. G. Lowe. Distinctive image features from scale-invariant keypoints.

International journal of computer vi-sion , 60(2):91–110, 2004. 6[23] G. Mo and S. Zhang. Vehicles detection in trafﬁc ﬂow.In

Sixth International Conference on Natural Computation ,2010. 2[24] H. Noh, S. Hong, and B. Han. Learning deconvolution net-work for semantic segmentation. In

Proceedings of the IEEEInternational Conference on Computer Vision , pages 1520–1528, 2015. 5[25] D. Onoro-Rubio and R. J. L´opez-Sastre. Towardsperspective-free object counting with deep learning. In

Eu-ropean Conference on Computer Vision , pages 615–629.Springer, 2016. 2, 5, 6, 7, 8[26] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection with region proposal net-works. In

Advances in Neural Information Processing Sys-tems , 2015. 2[27] R. Shirani, F. Hendessi, and T. A. Gulliver. Store-carry-forward message dissemination in vehicular ad-hoc networkswith local density estimation. In

IEEE 70th Vehicular Tech-nology Conference , 2009. 1[28] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014. 5[29] E. Toropov, L. Gui, S. Zhang, S. Kottur, and J. M. F. Moura.Trafﬁc ﬂow from a low frame rate city camera. In

IEEEInternational Conference on Image Processing , 2015. 1[30] A. Z. V. Lempitsky. Learning to count objects in images. In

Advances in Neural Information Processing Systems , 2010.2, 6[31] A. Vedaldi and B. Fulkerson. VLFeat: An open and portablelibrary of computer vision algorithms, 2008. 6[32] L. Wen, D. Du, Z. Cai, Z. Lei, M. Chang, H. Qi, J. Lim,M. Yang, and S. Lyu. DETRAC: A new benchmark and pro-tocol for multi-object detection and tracking. arXiv CoRR ,abs/1511.04136, 2015. 333] F. Xia and S. Zhang. Block-coordinate frank-wolfe optimiza-tion for counting objects in images. In

Advances in NeuralInformation Processing Systems Workshops , 2016. 2[34] X.-D. Yu, L.-Y. Duan, and Q. Tian. Highway trafﬁc infor-mation extraction from skycam mpeg video. In

IEEE Inter-national Conference on Intelligent Transportation Systems ,2002. 2[35] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene crowdcounting via deep convolutional neural networks. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 833–841, 2015. 2, 8[36] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-image crowd counting via multi-column convolutional neu-ral network. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 589–597, 2016.2[37] Z. Zhao, H. Li, R. Zhao, and X. Wang. Crossing-line crowdcounting with two-phase deep neural networks. In

EuropeanConference on Computer Vision , pages 712–726. Springer,2016. 2[38] Y. Zheng and S. Peng. Model based vehicle localization forurban trafﬁc surveillance using image gradient based match-ing. In15th International IEEE Conference on IntelligentTransportation Systems