[PDF] Online Clustering-based Multi-Camera Vehicle Tracking in Scenarios with overlapping FOVs

Abstract

Multi-Target Multi-Camera (MTMC) vehicle tracking is an essential task of visual traffic monitoring, one of the main research fields of Intelligent Transportation Systems. Several offline approaches have been proposed to address this task; however, they are not compatible with real-world applications due to their high latency and post-processing requirements. In this paper, we present a new low-latency online approach for MTMC tracking in scenarios with partially overlapping fields of view (FOVs), such as road intersections. Firstly, the proposed approach detects vehicles at each camera. Then, the detections are merged between cameras by applying cross-camera clustering based on appearance and location. Lastly, the clusters containing different detections of the same vehicle are temporally associated to compute the tracks on a frame-by-frame basis. The experiments show promising low-latency results while addressing real-world challenges such as the a priori unknown and time-varying number of targets and the continuous state estimation of them without performing any post-processing of the trajectories.

Full PDF

11 Online Clustering-based Multi-Camera VehicleTracking in Scenarios with overlapping FOVs

Elena Luna, Juan C. SanMiguel, Jos´e M. Mart´ınez, and Marcos Escudero-Vi˜nolo

Abstract —Multi-Target Multi-Camera (MTMC) vehicle track-ing is an essential task of visual trafﬁc monitoring, one ofthe main research ﬁelds of Intelligent Transportation Systems.Several ofﬂine approaches have been proposed to address thistask; however, they are not compatible with real-world applica-tions due to their high latency and post-processing requirements.In this paper, we present a new low-latency online approachfor MTMC tracking in scenarios with partially overlappingﬁelds of view (FOVs), such as road intersections. Firstly, theproposed approach detects vehicles at each camera. Then, thedetections are merged between cameras by applying cross-cameraclustering based on appearance and location. Lastly, the clusterscontaining different detections of the same vehicle are temporallyassociated to compute the tracks on a frame-by-frame basis. Theexperiments show promising low-latency results while addressingreal-world challenges such as the a-priori unknown and time-varying number of targets and the continuous state estimation ofthem without performing any post-processing of the trajectories.

Index Terms —multi-camera tracking, multi-target tracking,online tracking, intelligent transportation systems.

I. I

NTRODUCTION I NTELLIGENT Transportation Systems (ITS) are consid-ered a key part of smart cities. Consistent with the ac-celerated development of modern sensors, new computingcapabilities and communication, ITS technology engages theattention of both academia and industry. ITS point to offersmarter transportation facilities and vehicles, along with safertransport services.One of the main research ﬁelds on ITS is visual trafﬁcmonitoring using video analytics with data captured by visualsensors. This data can be used to provide information, suchas trafﬁc ﬂow estimation, or to detect trafﬁc patterns oranomalies. In recent years it has become an active ﬁeld withinthe computer vision community [1]–[3], however it is stillremains a challenging task [4], mainly if we consider the caseof multiple cameras.In contrast to mono-camera trafﬁc monitoring, multi-camerasetups requires of a more complex infrastructure, the ca-pability of dealing with more simultaneously data, as wellas a higher processing capability. Multi-Target Multi-Camera(MTMC) tracking algorithms are fundamental for many ITStechnologies.Different from Multi-Target Single-Camera (MTSC) track-ing [5], [6], MTMC tracking entails the analysis of visualsignals captured by multiple cameras, considering setups with

Elena Luna, Juan C. SanMiguel, Jos´e M. Mart´ınez, and MarcosEscudero-Vi˜nolo are with the Video Processing and Understanding Lab,Universidad Aut´onoma de Madrid, Spain, e-mail: { elena.luna; juancar-los.sanmiguel; josem.martinez; marcos.escudero } @uam.es. overlapping ﬁelds of view (FOVs), but also scenarios forwide-area monitoring, where cameras may be separated bylarge distances. Road intersections are well-know targets formonitoring due to the high number of reported accidents andcollisions [7]. These intersections are known for their intrinsicand complex nature due to a variety of the vehicles’ behaviors.This kind of scenarios are usually monitored with multiple par-tially overlapping cameras, which introduces new challenges,but also powerful opportunities for video analysis (e.g. trafﬁcﬂow optimization and pedestrian behaviour analysis).For the multi-camera tracking problem, efﬁcient data as-sociation across cameras, and also, across frames, becomesthe key problem to solve. A considerable amount of existingMTMC vehicle tracking algorithms perform an ofﬂine batchprocessing scheme to carry out the association [8]–[15]. Theyconsider previous and future frames, and often the wholevideo sequences at once, to merge vehicles trajectories acrosscameras and time. They also rely on post-processing tech-niques to reﬁne the resulting trajectories. This ofﬂine schemeprovides more robustness, compared to online designs, albeitit is not compatible with online applications; hence, limitingits applicability in real-time trafﬁc monitoring scenarios.In this paper, we describe the ﬁrst, to the best of our knowl-edge, low-latency online MTMC vehicle tracking approach forcameras with partially overlapping FOVs capturing intersec-tion scenarios. The proposed system follows an online andframe-by-frame processing scheme. Furthermore, comparedto other state-of-the-art systems (see Table I), our approachdoes not perform any post-processing track reﬁnement, it isagnostic to potential motion patterns (i.e. it works withoutprior knowledge of vehicles paths within cameras’ FOV) andit does not require additional manual ad-hoc annotations (e.g.deﬁnition of regions and boundaries on the roads). These twolast characteristics avoid the need of conﬁguring each real set-up where the system is deployed, improving ﬂexibility andgeneralising its use.The proposed MTMC tracking approach builds upon detec-tion of multiple vehicles on every single camera. Afterwards,a combined cross-camera agglomerative clustering, combiningspatial locations (using GPS coordinates) and appearancefeatures, is used to merge vehicles from different cameras. Thisclustering is evaluated using validation indexes and, ﬁnally,a temporal linkage of the obtained clusters is performed toobtain the trajectories of each moving vehicle in the scenealong time.This paper is an extended version of our related conferencepublication [16], with additional contributions as follows.First, we include and evaluate the impact of additional object a r X i v : . [ c s . C V ] F e b TABLE IC

OMPARISON OF AVAILABLE

MTMC

VEHICLE TRACKING APPROACHES . T

HE TABLE SHOWS DIFFERENCES REGARDING THE TYPE OF PROCESSING , THEUSE OF POST - PROCESSING TRACKS , THE AWARENESS ABOUT THE VEHICLES ’ MOTION PATTERNS , THE USE OF AD - HOC INFORMATION ANNOTATEDMANUALLY AND THE LEVEL OF CROSS - CAMERA ASSOCIATION . A

S CAN BE SEEN , OURS IS THE UNIQUE ONLINE APPROACH CONSIDERING DETECTIONSTO PERFORM THE CROSS - CAMERA ASSOCIATION , ALSO , WE DO NOT EMPLOY ANY POST - PROCESSING OF THE TRACKS , WE ARE AGNOSTIC TO THEMOTION PATTERNS OF THE VEHICLES AND WE DO NOT USE ANY ADDITIONAL MANUAL ANNOTATIONS . Approach Processing Post-processingof tracks Awareness ofmotion patterns Ad-hocannotations Level ofAssociationBaidu [8] Ofﬂine (cid:88) (cid:88) (cid:88)

TrackletsNCCU-UAlbany [9] - (cid:88) -CUNY NPU [10] (cid:88) (cid:88) (cid:88)

BUPT [11] - - -ANU [12] - - -UWIPL [13] (cid:88) (cid:88) (cid:88)

DiDi Global [14] (cid:88) - -Shanghai Tech. U. [15] - (cid:88) (cid:88)

Ours Online - - - Detections detectors. Second, we remove any ofﬂine dependency in orderto become a genuine online approach. Third, we design andtrain a completely new appearance feature extraction, and alsoinvestigate the impact of an additional dataset for training.Fourth, we improve the cross-camera clustering and temporalassociation reasoning. Fifth, we design and implement a newocclusion handling strategy. Lastly, we perform a wide ablationstudy to measure the impact of different parameters andstrategies at different stages of the proposal, and we showresults in a detailed comparison with the state-of-the-art.The paper is organized as follows. Section II reviews thestate-of-the-art in MTMC vehicle tracking. Section III de-scribes the proposed approach. Section IV presents the evalua-tion framework, the implementation details, the ablation studyand ﬁnally, a comparison with the state-of-the-art. Finally,conclusion remarks are described in Section V.II. R

ELATED WORK

For the last recent years, several approaches devoted to trackpedestrians in multi-camera environments have been published[17]–[22]. The releases of public benchmarks such as MARS[23] and DukeMTMC [24] powered the research communityto put efforts into Multi-Target Multi-Camera tracking orientedto people tracking.Due to the lack of appropriate publicly available datasets,MTMC tracking focused on vehicles was a nearly unex-plored ﬁeld. To encourage research and development in ITSproblems, the AI City Challenge Workshop launched threedistinct but closely related tasks: 1) City-Scale Multi-CameraVehicle Tracking, 2) City-Scale Multi-Camera Vehicle Re-Identiﬁcation and 3) Trafﬁc Anomaly Detection. Focusing onMTMC tracking, the CityFlow benchmark was presented [25].At the time of publication, it is the only dataset and benchmarkfor MTMC vehicle tracking. Figure 1 depicts four sampleviews from an intersection in City-Flow benchmark.The major challenge of tracking vehicles is the viewpointvariation problem. As can be seen in Figure 2, differentvehicles may appear quite similar from the same viewpoint,however the same vehicle captured from different viewpointsmay be difﬁcult to recognise. It can be extremely hard, even (a)(d) (b)(e) (c)(f) Fig. 2. Illustration of the viewpoint variation problem. Under the same viewdifferent vehicles may appear very similar (a), (b) and (c), while the same carfrom different viewpoints may be extremely difﬁcult to recognise [(a), (d)],[(b), (e)] and [(c), (f)]. for humans, to determine if two vehicles from different pointsof view depict the same car (e.g., as shown in Figure 2, pairs[(a), (d)], [(b), (e)] and [(c), (f)]).According to the processing scheme, MTMC tracking meth-ods can be categorized in two groups: 1) ofﬂine methods,and 2) online methods. Ofﬂine tracking methods, perform aglobal optimization to ﬁnd the optimal association using the entire video sequence. The vehicles’ detections are tempo-rally grouped into tracklets (short trajectories of detections)using MTSC tracking techniques, and, afterwards, tracklet-to-tracklet association is performed, mainly by using re-identiﬁcation techniques: considering the whole video se-quences at once [8], [11], [13]–[15], considering windows offrames [10], or even combining both approaches [12].On the other side, online approaches need to perform cross-camera association of target detections on a frame-by-framebasis, using detectors’ outputs (usually, bounding boxes) asthe smallest unit for matching, instead of tracklets.As can be seen in Table I, to the best of our knowledge,all existing approaches chose to work in an ofﬂine way. Inorder to remove false positive trajectories or ID switches [24],the ofﬂine approaches sometimes may apply post-processingﬁltering at the end of some intermediate stages [8], [10],[14], or at the end of the whole procedure [13]. Being awareof the motion patterns that the vehicles can adopt in everycamera view, can also help to remove undesired trajectoriesand, therefore, increase the recall [8]–[10], [13], [15]. Ofﬂineworking also allows to apply additional temporal constraintsto increase the performance [8], [10], [13]. Another strategyto improve overall performance consists in incorporating someadditional manually annotated, scenario speciﬁc, information;for example, additional vehicle’s attributes (colour, type, etc)for getting a better appearance model [8] or road boundaries[10].It is common in the literature of MTMC tracking to treat thetracklet-to-tracklet cross-camera association task as a cluster-ing problem, grouping them by appearance features [8], [11],[26], or by combining appearance and other constraints (e.g.,time and location) [10], [13], [27], [28]. Clustering algorithmsare often categorized into two broad categories: 1) partitioningalgorithms (center-based, e.g. K-means [29], or density-based,e.g. DBSCAN [30]), and 2) hierarchical clustering [31] (beingagglomerative or divisive). While hierarchical algorithms buildclusters gradually (as a tree of clusters) and they do not requirepre-speciﬁcation of the number of clusters, partitioning algo-rithms learn clusters at once and they require pre-speciﬁcationof the number of clusters (K-means) or the minimum numberof points deﬁning a cluster (DBSCAN). Therefore, hierarchicalclustering is advantageous when there is no prior knowledgeabout the number of clusters, but on the contrary, it outputsa tree of clusters, commonly represented as a dendrogram .Such structure does not provide the number of clusters, butgives information about the relations between the data. Forthis reason, cluster validation techniques, such as Davies-Bouldin index [32], the Dunn index [33] or the Silhouettecoefﬁcient [34], are used to determine the number of clusters,which may differ for each technique. In the proposed work,as there is no prior knowledge about the number of vehiclesin the scene, we apply agglomerative hierarchical clusteringcombining location and appearance information.Existing MTMC vehicle tracking approaches ﬁrstly computetracklets by temporally merging detections on every singlecamera, and then performing cross-camera tracklet-to-trackletassociation. In contrast, we ﬁrstly compute clusters by cross-camera association of vehicle detections and, afterwards, on a

TABLE IIN

OTATION USED THROUGHOUT THE PAPER . Symbol Description Scope N Number of cameras Scenario H n Homography matrix of n th camera Scenario r Association radius Scenario D Number of total detections Frames B Set of bounding boxes Frames W Set of GPS world locations Frames F Set of feature descriptors Frames L Set of clusters Frames T Set of tracks Frames, Sequence frame-by-frame basis, we temporally associate the clusters tocompute the tracks.III. P

ROPOSED A PPROACH

In the proposed online Multi Target Multi Camera trackingapproach, all cameras’ videos are processed simultaneouslyframe by frame, without any post-processing of the trajec-tories. The approach is composed of ﬁve processing blocks,as shown in Figure 3. As input, we consider a network ofcalibrated and synchronized cameras with partially overlap-ping FOVs providing independent video sequences. Given anetwork of N cameras, the pipeline includes the followingstages: (1) vehicle detection, (2) feature extraction, (3) ho-mography projection, which projects single camera vehiclesfrom each camera to the world coordinates system (GPS) forproviding location information, (4) cross-camera clustering,that is fed on the output of (2) and (3) blocks, and (5) temporalassociation of vehicles trajectories over time to compute thetracks. As result, the system generates tracks consisting on theidentity and location of every vehicle along time. The design ofthe processing blocks is detailed in the following subsections,whilst the implementation details are given in Section IV-B.Table II summarizes the notation used in this section. Thescope of each variable is also deﬁned. Scenario refers tothe set of cameras,

Frames stand for all the simultaneousimages coming from the cameras at each temporal instant.

Sequence is comprised of all the aggregated frames comingsimultaneously from the cameras. N , and H n are intrinsic tothe scenario, while r is a design parameter. D , B , W , F ,and L are computed at each temporal instant, needing thesimultaneous frames. Last, T is updated frame-by-frame forthe whole sequence. A. Vehicle Detection

As most of the state-of-the-art MTMC tracking methods, wefollow the tracking-by-detection paradigm. Therefore, the ﬁrststage of the pipeline is vehicle detection at each frame. Let b = [ x, y, w, h ] be a bounding box with [ x, y ] being the upper-left corner pixel coordinates, and [ w, h ] the width and height.Let deﬁne B = { b d , d ∈ [1 , D ] } as the set of bounding boxesat each frame for all the cameras, with D the total number ofdetections.Note that the proposal can incorporate any single-cameravehicle detection algorithm whose output is in a bounding boxform. 𝑁12 ... Vehicledetection

Single Camera Multiple Cameras

Homographyprojection Cross-camera clustering Temporal Association

Location-basedFeature-based + Feature extraction

𝒯ℱ ℒℬℬ = {𝒃 𝑛,𝑑 , 𝑛 ∈ [1, 𝑁], 𝑑 ∈ [1, 𝐷]}ℱ = {𝒇 𝑛,𝑑 , 𝑛 ∈ [1, 𝑁], 𝑑 ∈ [1, 𝐷]} ℒ = {𝑙 𝑖 , 𝑖 ∈ [1, 𝐿]}𝒲 = [𝑥 𝑤 , 𝑦 𝑤 ] 𝑛,𝑑 , 𝑛 ∈ 1, 𝑁 , 𝑑 ∈ 1, 𝐷𝒲 𝒯 = {𝑡 𝑗 , 𝑗 ∈ [1, 𝐽]} Fig. 3. Block diagram of the proposed approach. The inputs are frames from N cameras. The trajectories are computed for each frame. First, the vehicledetection block computes B , the set of vehicle detections. B feeds both feature extraction and homography projection blocks. F is the set of appearancefeature descriptors and W the set of GPS world coordinates of every vehicle. The cross-camera clustering block uses F and W to aggregate different viewsof the same vehicle and to compute the set of clusters L at each temporal instant. Lastly, the temporal association block associates clusters in L in a temporalway to compute the set of tracks T . B. Feature Extraction

In order to describe the appearance of the d th bounding boxdetection, let f d be its k -dimensional deep feature descriptor.Let F = { f d , d ∈ [1 , D ] } be the set of appearance featuredescriptors for each frame for all the detected vehicles.Due to the intrinsic geometry of vehicles, their appearancemay suffer strong variations across different camera views.This variance is such that it could be, even for a human being,very hard to determine if they are the same vehicle. Thus,in order to have highly discriminating features, we trained amodel to improve vehicle classiﬁcation ability in the facedscenario. More details on this vehicle speciﬁc model will begiven in Section IV-B2.Class imbalance is a form of the imbalance problem [35],that occurs when there is an important inequality regardingthe number of examples pertaining to each class in the data.When not addressed, it may have negative effects on the ﬁnalperformance. It is known that classes with a higher number ofobservations tend to dominate the learning process, hinderingthe learning and generalization of low-represented classes. Inorder to minimize the imbalance effects, instead of classicalCross-Entropy (CE) loss [36], we employ the focal loss (FL)proposed in [37]. C. Homography-based Projection

This processing block computes the location of each de-tected vehicle on the common ground-plane employing GPScoordinates. Let H n be the homography matrix that transformscoordinates from the image plane of the n th camera to theGPS coordinates of the common ground plane. Let the inversematrix H − n be the inverse transformation. We leverage theGPS coordinates to achieve a high-precision clustering basedon the location information by applying camera projection. camera 6camera 7camera 8camera 98 m camera 6camera 7camera 8camera 98 m Fig. 4. Vehicle detections from four partially overlapped cameras projectedto GPS coordinates at a certain temporal instant. Detections within a 8 metersradius are more likely to be joined. (Best viewed in color)

Given a bounding box b , one can obtain its associated GPScoordinates, i.e. [ φ, λ ] (latitude and longitude), by projectingthe middle point of its base with the H n transformation. W = { [ φ, λ ] d , d ∈ [1 , D ] } , the set of GPS coordinates, isobtained after applying the transformation to the set B .Figure 4 illustrates an example of the projected detectionscoming from different cameras. Note that this block relies onthe output of the object detection stage, and along with thefeature extraction module, it feeds the cross-camera clustering. D. Cross-camera Clustering

Given the sets B , W and F , the cross-camera clusteringblock associates different camera views of the same vehicle at each frame to compute L = { l i , i ∈ [1 , L ] } , the set of clustersat a given frame, being L be the number of created clusters.Clusters’ content ranges from a single detection, if the vehicleis only visible by one camera, to the maximum number ofdetections, deﬁned by the maximum number of cameras cap-turing the scene. To create the clusters, we compute a frame-by-frame linkage by performing an agglomerative hierarchicalclustering combining location and appearance features.Hierarchical clustering [31] requires a square connectivity matrix of distances (dissimilarities) or similarities of the inputdata to merge. We compute the connectivity matrix Θ as aconstrained pairwise features distance between all the vehiclescoming from every camera at each frame. At each frame, wecompute the pairwise Euclidean distance between the appear-ance feature vectors of all the vehicles under consideration, asfollows: ζ d,d (cid:48) = || f d − f d (cid:48) || (1)Also at each frame, we compute the Euclidean pairwisedistance between all the GPS coordinates of vehicles: ψ d,d (cid:48) = || ( φ d − φ d (cid:48) ) − ( λ d − λ d (cid:48) ) || (2)The spatial distance and the camera ID are used to ap-ply some constraints. Since two vehicles’ detections widelyseparated in GPS coordinates are highly unlikely to comefrom the same vehicle, it is reasonable to assume a maximumassociation distance. This constraint narrows down the list ofvehicles to be matched and improves the ability to distinguishdifferent identities by focusing on comparing only nearbytargets. Hence, the connectivity matrix Θ is computed asfollows: Θ (cid:48) d,d (cid:48) = (cid:26) ζ d,d (cid:48) , ψ d,d (cid:48) ≤ r ∞ , ψ d,d (cid:48) > r , (3)being r the maximum association radius. A second conditionis applied for preventing vehicles’ detections from the samecamera view to be merged together. It is done by constrainingthe association matrix as follows: Θ d,d (cid:48) = (cid:26) Θ (cid:48) d,d (cid:48) , c d (cid:54) = c d (cid:48) ∞ , c d = c d (cid:48) (4)Let c d be the camera yielding the d th detection.As stated above, hierarchical clustering methods departsfrom a connectivity matrix Θ to compute a tree of clustersand this structure does not provide the number of clusters, butgives information on the relations between the data. Theserelationships can be represented by a tree diagram calleddendrogram. An example is presented in Figure 5. In orderto cut the dendrogram and identify the optimal number ofclusters, we use the Dunn index [33] for cluster validation.The aim of this index is to ﬁnd clusters that are compact,with a small variance between members of the cluster, andwell separated by comparing the minimal cluster distance tothe maximal cluster diameter. The cluster diameter is deﬁnedas the distance between the two farthest elements in the cluster.This process provides the number of vehicles at every framein the scene, in the form of clusters, as well as its location, inthe form of the cluster’s centroid (outlined as the mean pointat each coordinate axis of all the components). To sum, atevery frame, each cluster designates an existing vehicle.

18 25 15 16 1 23 24 9 3 6 27 5 12 8 17 20 2 14 28 19 4 26 7 10 30 13 29 21 11 22 3101002003004005006007008009001000

Dendrogram

Fig. 5. Dendrogram illustrating the hierarchical relationships between allthe detected vehicles d ∈ [1 , at a certain temporal instant. After clustervalidation, the dendrogram is cut at height = 50 : detections that are joinedtogether below the red line are part of the same cluster. In this example (18,25), (6, 27), (2, 14, 28, 19) and (13, 29) are joined together, while the rest ofdetections comprise a cluster by themselves. Hence, L = 25 . E. Temporal Association

The last stage of the proposed approach links clusters overtime to estimate the vehicle tracks. Let t j = [ x s start j , ..., x s end j ] be the j th track deﬁning the trajectory of a moving vehicleby a succession of states. Each state is described by x sj =[ φ, λ, v φ , v λ ] , where [ φ, λ ] is the target location and [ v φ , v λ ] is the target velocity, both represented using GPS coordinates.Let deﬁne T = { t j , j ∈ [1 , J ] } as the set of tracks along thevideo sequence. In contrast to previous sets B , W and F , thatare initialized at each frame, T is built incrementally, i.e. itis computed at ﬁrst frame and updated along time. In otherwords, tracks depict the location of clusters along time. Asin the whole system, the temporal association is performedon-line, that is, frame-by-frame.Vehicles’ motion is estimated using a constant-velocityKalman ﬁlter [38]. The Kalman ﬁlter makes a prediction ofthe state of the target as a combination of the targets’ previousstate (at prior frame) and the new measurement (at currentframe) using a weighted average. It results in a new stateestimation lying in between the previous prediction and themeasurement. Thus, at each frame, on the one hand we employKalman ﬁlter to get the estimate location of the tracks of theprevious frame, and on the other hand, we get the currentvehicle measurements as the clusters resulting from the cross-camera association.In order to associate both, we apply the Hungarian Al-gorithm [39] to solve the assignment problem, using anassociation matrix to enumerate all possible assignments. Theassociation matrix is ﬁlled with the pairwise L2-norm, i.e.the euclidean distance, between the location of the estimatedtracks and the clusters’ centroid location (see Section III-D).To provide robustness against occlusions we designed twostrategies: a blind occlusion handling and a reprojection-basedocclusion handling. The ﬁrst maintains alive the tracks duringa short time when the detections associated to it are lost.Keeping on predicting the position of the track during thatperiod allows to recover it in case the detections are recovered.This is helpful if the vehicle detector loses a detection, either due to a bad detection performance or a hard occlusion. Thesecond strategy detects if a track has lost one or more ofits associated detections and looks for the same track in theprevious frame to get the information about the size of itspreviously associated bounding boxes. The new location inthe current frame is inferred by applying the correspondinginverse homography matrix (e.g. H − n assuming a detection ismissing for the n th camera) to the estimated track position.Therefore, when this strategy reveals a track which detectionor detections are lost, mostly due to an occlusion the detectorcannot deal with, we can generate an artiﬁcial detectionwith accurate estimates on the correct position and with theprevious detected size of the occluded vehicle.IV. E XPERIMENTS

A. Evaluation framework1) Datasets:

We considered the CityFlow benchmark [25],since there is no other publicly available dataset devoted toMTMC vehicle tracking with partially overlapping FOVs. Thedataset comprises videos of 40 cameras, 195 total minutesrecorded for all cameras, and manually annotated ground-truthconsisting of 229,690 bounding boxes for 666 vehicles. Thedataset is divided into 5 scenarios (S01, S02, S03, S04 andS05) covering intersections and stretches of roadways. S01and S02 have overlapping FOVs, while S03, S04 and S05 arewide-area scenarios. The CityFlow benchmark also providesthe camera homography matrices between the 2D image planeand the ground plane deﬁned by GPS coordinates based on theﬂat-earth approximation.We have also used VeRi-776 dataset for improving thefeature extraction model by using it as additional trainingdata. VeRi-776 [40] is one of the largest and most commondatasets for vehicle re-identiﬁcation in multi-camera scenarios.It comprises about 50,000 bounding boxes of 776 vehiclescaptured by 20 cameras.

2) Evaluation Metrics:

The MTMC tracking ground-truthprovided by the CityFlow benchmark consists of the boundingboxes of multi-camera vehicles labeled with consistent IDs.Following the CityFlow benchmark evaluation methodol-ogy, Identiﬁcation Precision, Identiﬁcation Recall and F Score measures [24] are adopted:

IDP = IDT PIDT P + IDF P , (5)

IDR = IDT PIDT P + IDF N , (6)

IDF = 2 · IDT P · IDT P + IDF P + IDF N , (7)where

IDT P , IDF P and

IDF N stand for True PositiveID, False Positive ID and False Negative ID, respectively.

IDP ( IDR ) is the fraction of computed (ground-truth) tracksthat are correctly identiﬁed.

IDF is the ratio of correctlyidentiﬁed tracks over the average number of ground-truth andcomputed tracks.Automatically obtained tracks by the proposed method arepairwise compared with the ground-truth tracks. We declare a match, i.e. an IDT P , when two tracks temporarily coexistand the area of the intersection of the bounding boxes is higherthan τ IoU (with < τ IoU < ) times the area of the unionof the two boxes. Hence, τ IoU is the Intersection Over Union(IoU) threshold. A high

IDF score is obtained when thecorrect multi-camera vehicles are detected, accurately trackedwithin each camera view, and labeled with a consistent IDacross all the views in the dataset.

3) Hardware and software:

The algorithm and modeltraining have been implemented using PyTorch 1.0.1 DeepLearning framework running on a computer using a 6 CoresCPU and a NVIDIA GeForce GTX 1080 12GB GraphicsProcessing Unit.

B. Implementation details1) Vehicle detection:

Regarding single-camera vehicle de-tection we have experimented with public detections, i.e.vehicle detections provided by the CityFlow Benchmark, andprivate detections, computed using a state-of-the-art algorithm.The public detections were obtained by using three populardetectors: Yolo v3 [41], SSD512 [42] and Mask R-CNN [43].Yolo v3 is a one-stage object detector that solves detection asa regression problem. SSD512 is also a single-shot detectorwhich directly predicts category scores and box offsets fora ﬁxed set of default bounding boxes of different scales ateach location. Mask R-CNN, on the contrary, is a two-stagedetector consisting of a region proposal network that feedsregion proposals into a classiﬁer and a regressor.Moreover, we have complemented the provided detectionswith those obtained by the EfﬁcientDet [44] algorithm, atop-performing state-of-the-art object detector. EfﬁcientDet isalso a one-stage detector that uses EfﬁcientNet [45] as thebackbone network and a bi-directional feature pyramid featurenetwork (BiFPN).All these approaches make use of pre-trained models onthe COCO benchmark [46]. For our purpose, we consideredonly detections classiﬁed as instances of the car, truck and busclasses.

2) Feature extraction:

For the feature extraction network,we employ ResNet-50 [47] as backbone, but the original clas-siﬁcation layer ( fc 1 layer), shaped for image classiﬁcation onthe ImageNet dataset [48], is replaced by a new classiﬁcationlayer whose size is tailored to the total number of identities inthe training data. In order to leverage the pretained weigths onImagenet, we ﬁne-tune the network but freeze it until conv 5 layer.To ﬁne-tune the network, we used the CityFlow benchmarktraining data (S01, S03 and S04) and we also included theVeRi-776 dataset, bringing a total of 905 vehicle IDs fortraining (129 IDs from CityFlow, plus 776 IDs from VeRi-776). Since only training identities are known, the networklearns features to correctly classify the 905 different trainingvehicle identities. We perform a validation methodology onpairs of unseen vehicles and comparing whether predictionsare the same or not. Therefore, we check the network abilityto discern different views of the same target. To create thesepairs, we randomly select half of the data from S05 scenario

TABLE IIIMTMC

TRACKING PERFORMANCE OF THE PROPOSED APPROACH FORDIFFERENT VEHICLE DETECTORS . W

E DIFFERENTIATE BETWEEN THEPUBLIC DETECTORS (SSD512, Y

OLO V AND M ASK

R-CNN)

AND ANDTHE PRIVATE ONE (E FFICIENT D ET ). F OR EACH DETECTOR , WE INCLUDETHE MEAN A VERAGE P RECISION ( M AP)

FOR OBJECT DETECTION TASK IN

COCO

DATASET [46]

AS A MEASURE OF PERFORMANCE . B

EST OF BOTHCATEGORIES IN BOLD . COCOmAP VehicleDetector ScoreThreshold

IDP ↑ IDR ↑ IDF ↑ to create a 169 IDs validation set. We forced the validationbatch to contain approximately 50 % of positive and 50 % ofnegatives pairs. The pair selection is randomly done over theset of IDs, instead of the set of images, thus, IDs containingfew samples are not impaired. At inference, we adopt, as a2048-dimensional descriptor, the output of the average poolinglayer, just before the classiﬁer.Each input image containing a bounding box of a vehicle isadapted to the network by resizing it to x x and pixels’values are normalized by the mean and standard deviation ofthe ImageNet dataset. In order to reduce model overﬁttingand to improve generalization, we perform several randomdata augmentation techniques such as horizontal ﬂip, dropout,Gaussian blur and contrast perturbation.To minimize the loss function and optimize the networkparameters, we adopt Stochastic Gradient Descend (SGD)solver. Experimentally, the initial learning rate was set to 0.1and we follow a step decay schedule dropping it by 0.1 every25 epochs. Momentum was set to 0.9 and weight decay to e − . C. Ablation Study

This section measures the impact of the strategies usedalong the different stages of the proposed approach. Firstly,the effect of using different vehicle detectors is evaluated.Secondly, the inﬂuence of the association radius parameteris analysed. Subsequently, we gauge the inﬂuence of theappearance model training method as well as the size ofthe feature embedding. And ﬁnally, some additional strategies(e.g. occlusions handling) are assesed. All the experiments areevaluated on the testing scenario of the CityFlow benchmarkdataset with partially overlapping FOVs, i.e. the S02 scenario.It is composed of 4 cameras pointing to an intersectionroadway (see Figure 1). In total, aggregating 129 annotatedvehicles whose trajectories are distributed along 8440 frames(2110 per camera) that have been captured at 10 fps.

TABLE IVI

MPACT OF THE ASSOCIATION PARAMETER r OVER THE

MTMC

TRACKING PERFORMANCE . B

EST IN BOLD . Association radius

IDP ↑ IDR ↑ IDF ↑ M a s k R - C NN r = 5 m. 47.04 62.11 53.53 r = 6.5 m. 46.03 60.78 52 39 r = 8 m r = 9.5 m. 46.56 61.47 52.99 E f ﬁ c i e n t D e t r = 5 m. r = 6.5 m. 46.41 67.24 54.92 r = 8 m. 47.10 68.21 55.72 r = 9.5 m. 43.24 62.59 61.14 Inﬂuence of the vehicle detector algorithm:

Table III com-prises the impact of different vehicle detectors on the overallperformance of the proposed approach. As stated before, weconsider three provided object detections coming from Yolov3, SSD512 and Mask R-CNN, i.e. public detections. We alsoevaluate the performance of EfﬁcientDet, a top-performingalgorithm.We experimented with three different score thresholds toget the output detections (0.1, 0.2 and 0.3). Regarding thepublic detections, one can observe that the compared detectorsachieve the peak performance when a low threshold is applied.The results suggest that ﬁltering the output detections byscores higher than 0.2 leads to a lower IDR in the MTMCtracking performance. This ﬁnding indicates that detectionswith low conﬁdence (mostly generated by remote and partiallyvisible vehicles) are still useful.On the contrary, EfﬁcientDet, since it is a better performingobjecter detector, results in a higher

IDR and

IDP beingﬁltered with 0.3 instead of 0.2. It enhances

IDR by 3.37,compared with the best results of Mask-RCNN, however,

IDP is degraded by 2.01. The reason for this decline isthat EfﬁcientDet is providing more False Positive trajectoriesarising from the detection of partially occluded vehicles thatMask-RCNN is not able to detect.In the light of these results, we opted for adopting MaskR-CNN output detections ﬁltered by 0.2 score as publicdetections, and EfﬁcientDet output ﬁltered out by 0.3, asprivate detections for the rest of the experiments.

Inﬂuence of the association radius:

Table IV shows howthe association radius r , used in the cross-camera clustering(see Section III-D), affects the MTMC tracking performanceof the proposed approach in the evaluated scenario. We sweepradius values of 5, 6.5, 8 and 9.5 meters. The results on thetable indicate that the choice of the radius is quite relevant,having a signiﬁcant impact in the performance, and also itis highly-dependant on the detection algorithm. The Mask-RCNN detector gets performance peak for r = 8 , however,when using the EfﬁcientDet detector a smaller radius, r =5 , is the optimal choice. The reason of this difference maybe related with bounding box accuracy (i.e. how the outputbounding boxes ﬁt the vehicles). Since the middle points ofthe bases of the bounding boxes from different camera viewsare projected to the ground-plane, the tighter the boxes are, the more accurate are the projections.Due to the common vehicles dimensions, it may be naturalto think that a smaller radius should be enough to successfullyassociate several detections of the same vehicle. However,due to noise in the video transmission while capturing thedata, some frames are skipped within some videos, so somecameras suffer from a subtle temporal misalignment (i.e. theyare unsynchronized with respect to the others). Therefore,the optimal r values for the CityFlow benchmark using theproposed algorithm are 5 and 8 meters, given the two evaluateddetectors. Inﬂuence of the appearance feature model:

Table Vsummarizes the effect of the training schemes on the modelused to describe the appearance features of vehicles for theproposed MTMC tracking approach. The table lists the datathat has been used for training the network (described inSection IV-B2) and how the weights of the network wereobtained. As the baseline, we use the model pretrained on theImagenet dataset. As training data, we considered the trainingset of the CityFlow benchmark (S01, S03 and S04 scenarios)and also the training set of the CityFlow benchmark jointlywith the VeRi-776 dataset. We tried two classiﬁcation lossfunctions: Cross-Entropy loss (CE loss) and Focal Loss (FL).Table V indicates that the tracking performance behavioursin a coordinated manner using both Mask R-CNN and Efﬁ-cientDet detectors. In both cases, ﬁne-tuning the network tothe CityFlow benchmark has a slightly, but positive, inﬂuence.Including more training data, the VeRi-776 dataset, appears toimprove the quality of the feature embeddings, resulting in aeven better tracking performance.Figure 6 depicts in red the distribution of the number ofimages per vehicle ID of the training set of the CityFlowdataset illustrating that it is a quite unbalanced set with avery scattered distribution. The average of the distributionis µ city = 232 . , while the standard deviation is σ city =201 . . From Table V, we observe that training the CityFlowbenchmark with the Focal loss, instead of the Cross-Entropyloss, has a positive inﬂuence in our MTMC tracking approach.Figure 6 also depicts in blue the distribution of the numberof images per vehicle ID of the VeRi-776 dataset, as one mayobserve, it is more balanced than the CityFlow set. Consideringboth datasets together, the join distribution is now describedby µ join = 89 . and σ city = 102 . , as σ join << σ city , onecould say that the join dataset is less disperse than the singleCityFlow, which can be an indicator for the subtle increasein performance obtained when the combined dataset is used.According to these results, we opt for using the combineddataset and the CE loss for the rest of the experiments. Inﬂuence of size of the feature embedding:

Table VI com-prises the experiments carried out to explore the effect of thesize of the feature embeddings. As stated in Section IV-B2, theoutput of the last average pooling layer of ResNet-50 providesa 2048-dimensional embedding. We set this embedding size asthe baseline. In order to modify the length of the embedding,an additional fully connected layer of size 512, 1024 or 4096is added at the end of the network. The additional fullyconnected layer is preceded by batch normalization and ReLUlayers, and the training procedure is the same as described

TABLE VI

MPACT OF APPEARANCE FEATURE MODEL OVER THE

MTMC

TRACKINGPERFORMANCE . F: F

INETUNED . CE: C

ROSS -E NTROPY L OSS . FL: F

OCAL L OSS . B

EST IN BOLD . Training Data Weights

IDP ↑ IDR ↑ IDF ↑ M a s k R - C NN Imagenet Pretrained 49.11 64.84 55.89CityFlow F + CE 49.16 64.91 55.95F + FL 49.26 65.04 56.06CityFlow + VeRi-776 F + CE

F + FL 49.53 65.39 56.37 E f ﬁ c i e n t D e t Imagenet Pretrained 47.18 68.37 55.83CityFlow F + CE 47.41 68.70 56.11F + FL 47.43 68.73 56.13CityFlow + VeRi-776 F + CE

F + FL 47.63 69.01 56.36 ID N u m b e r o f i m a g e s CityFlow IDs [1:129] VeRi-776 IDs [130:905] μ = 232.9 (CityFlow) μ = 89.35 (CityFlow + VeRi) μ ± σ (CityFlow) μ ± σ (CityFlow + VeRi) Fig. 6. The distribution of the number of images per vehicle identity in theCityFlow training dataset, the VeRi-776 datastet and the distribution of bothjoined. Best viewed in color. in Section IV-B2. The performance suggests that adding anadditional layer, and therefore, more complexity to the model,either to reduce or increase the embedding size, may harm theperformance, leading the model to overﬁtting.

Inﬂuence of additional strategies:

The additional strategieswe have designed are divided in two branches: removing smalldetected objects that are not considered in the ground-truth,and occlusion handling.To avoid the existing bias in the ground-truth towards distantcars that are not annotated, we performed a size ﬁlteringstrategy by removing detections which area is under 0.10 % of the total frame area.The blind occlusion handling and the reprojection-based TABLE VII

MPACT FEATURE EMBEDDING SIZE . B

EST IN BOLD . Embedding size

IDP ↑ IDR ↑ IDF ↑ M a s k R - C NN Baseline (2048)

512 49.24 65.01 56.041024 49.67 65.58 56.534096 49.70 65.62 56.56 E f ﬁ c i e n t D e t Baseline (2048)

512 46.72 67.69 55.281024 47.06 68.20 55.694096 47.50 68.83 56.21

TABLE VIII

MPACT OF ADDITIONAL STRATEGIES . B

EST IN BOLD . Approach

IDP ↑ IDR ↑ IDF ↑ M a s k R - C NN Baseline 50.56 66.75 57.54+ Size ﬁltering 53.03 66.70 59.08+ Blind Occlusion handling 53.46 70.59 60.84+ Reprojection-based handling 52.99 70.96 60.67+ Blind Occlusion handling + Size ﬁltering + Reprojection-based handl. + Size ﬁltering 54.06 69.02 60.63 E f ﬁ c i e n t D e t Baseline 48.33 70.03 57.19+ Size ﬁltering 50.20 70.07 58.49+ Blind Occlusion handling 53.18 77.05 63.50+ Reprojection-based handling 51.53 75.67 61.31+ Blind Occlusion handling + Size ﬁltering + Reprojection-based handl. + Size ﬁltering 53.34 75.50 62.52 occlusion handling strategies are detailed in Section III-E.Table VII shows the ablation results of these strategies.As expected, we can observe that the procedure of removingsmall detections increases the

IDP measure, using both objectdetectors, by 2.47 (1.87), while maintaining almost the same

IDR . Since

IDP reacts to false positives, this seems toindicate that the size ﬁltering removes those small detectionswe track, but are not annotated in the ground-truth.Both occlusion handling strategies improve the baselinetracking

IDR signiﬁcantly, 3.81 (7.02) and 4.21 (5.64) re-spectively, and also the

IDP is being improved by 2.90 (4.85)and 2.43 (3.20). Contrary to expectations, the reprojection-based strategy is not overcoming the blind one. Another biasexisting in the ground-truth could be the reason for explainingthis, since occluded vehicles are not annotated.When combining both occlusion handling strategies withsize ﬁltering, we achieve a higher precision than applying themseparately, while recall is slightly narrowed. As in the previouscomparison, these results suggest that the reprojection-basedstrategy does not provide improvements over the blind strategydue to the nature of the ground-truth. We consider using thebaseline approach together with the blind occlusion handlingand the size ﬁltering strategies, a good trade-of between the

IDP and

IDR . D. Comparison with the state-of-the-art

Along this section, the proposed algorithm is compared withstate-of-the-art approaches. Comparison is performed in theS02 scenario of the CityFlow benchmark, which is the onlyvalidation scenario with partially overlapping FOVs, as ourmethod targets this scenario.The proposed approaches in the literature devoted to ve-hicles MTMC tracking, listed in Table I, have been alreadycompared in the The 2019 AI City Challenge [49] jointly overthe testing scenarios S02 and S05. However, as S05 consistsof non-overlapping cameras, to ensure a fair comparison, weperform the evaluation only over S02. For this purpose, we ranthe public available codes and we evaluated them followingthe CityFlow benchmark evaluation methodology, detailed inSection IV-A2.Table VIII shows the evaluated performances in terms of

IDP , IDR , IDF , latency and total computational time. Thelisted approaches can be divided by the processing mode in TABLE VIIIC

OMPARISON WITH THE STATE - OF - THE - ART APPROACHES . τ IoU

IS THE I NTERSECTION O VER U NION (I O U) EVALUATION THRESHOLD . T

HE STAR (*)

DENOTES THAT IS AN ESTIMATION . T

HE EXTRA

SECONDS ARETHE DURATION OF THE VIDEO SEQUENCE UNDER EVALUATION . IDP ↑ IDR ↑ IDF ↑ τ IoU

Processing Latency (s) TotalCost (min)UWIPL [13] 70.21 92.61 79.87 0.2 Ofﬂine 3015* + 211 53.76*70.02 92.36 79.65 0.5ANU [12] 67.53 81.99 74.06 0.2 1159* + 211 22.83*66.42 80.64 72.85 0.5BUPT [11] 78.23 63.69 70.22 0.2 1389* + 211 26.66*78.16 63.63 70.15 0.5NCCU [9] 48.91 43.35 45.97 0.2 2316* + 211 42.11*24.36 21.59 22.89 0.5Ours (EfﬁcientDet) 55.15 76.98 64.26 0.2 Online 2.55 13.6554.73 76.38 63.77 0.5Ours (Mask-RCNN) 57.23 71.99 63.76 0.2 2.29 12.7154.99 69.17 61.27 0.5 two groups: ofﬂine and online processing. As described inSection II, to the best of our knowledge there is no previousproposal dealing with online MTMC vehicle tracking. For thisreason, all the state-of-the-art methods that we evaluated areofﬂine approaches. It is important to remark that, in TableVIII, the star denotes a partial and downward estimation. Thecodes for the complete systems are not publicly available, andonly solutions based on precomputed intermediate results areaccessible; hence, we can only compute the running time ofthe available modules. Therefore, the overall latency of thecompared ofﬂine approaches is expected to be much higherthan the results reported in Table VIII. Note that, the durationof the sequence under evaluation is also included in the latencysince these ofﬂine approaches require access to results for thewhole video to compute tracklets at each camera and thencompute multi-camera tracks in a global way. As our proposalyields tracking results incrementally, from the beginning of thesequence, it can achieve a really low latency, in comparisonwith the others methods.Regarding the quantitative measures of the tracking perfor-mance:

IDP , IDR and

IDF , ofﬂine methods using con-straining priors tailored to the target scenario clearly beneﬁtfrom this extra information (see Table I). In contrast to therest of the state-of-the-art approaches, we are agnostic to themotion patterns of the vehicles (allowing to ﬁlter erroneoustracks), we do not perform any track post-processing (permit-ting to reﬁning and unifying tracks and by this way reducingID switches) and ﬁnally, we do not make use of manualannotations. On this basis, with an online approximationwe perform really close to ofﬂine state-of-the-art approachesoutperforming two of them in terms of Identiﬁcation Recall.Overall, our approach does not quite reach top performancein MTMC vehicle tracking, but its latency is three ordersof magnitude smaller and the ﬁnal computational cost isone order of magnitude faster, enabling a high performanceoperation on online mode with low-latency, that is a commonrequirement for many video-related applications, and also, inthe generalization of the algorithms, avoiding hand-craftedstrategies. V. C ONCLUSION

Not relying on manual ad-hoc annotations, having no priorknowledge about the number of targets, and providing the bestresult in the shortest possible time are crucial requirements for a convenient and versatile algorithm. This paper presents, tothe best of our knowledge, the ﬁrst online MTMC vehicletracking solution. Unlike previous approaches, the proposedapproach continuously computes and updates the targets’ state.We calculate clusters of detections of the same vehicle fromdifferent camera views applying a cross-camera clusteringbased on appearance and location. We train an appearancemodel to identify different views of the same vehicle lever-aging homography matrices’ information. Using informationfrom the previous frame and a temporal estimation, we de-veloped an occlusion handling strategy able to extrapolateaccurate detections even if the target is occluded. Since thestate estimation is continually updated, this strategy is usefuleven if the target is long-term occluded.This approach results in a low-latency MTMC vehicletracking solution with quite promising results. Although per-formance is below its ofﬂine counterparts, the proposed oneis a suitable solution for a real-world ITS technology.A CKNOWLEDGEMENT

This work was partially supported by the Spanish Gov-ernment (TEC2017-88169-R MobiNetVideo). We gratefullyacknowledge the support of NVIDIA Corporation with thedonation of the GPU used for the research of our group.R

EFERENCES[1] J. Guerrero-Ib´a˜nez, S. Zeadally, and J. Contreras-Castillo, “Sensortechnologies for intelligent transportation systems,”

Sensors , vol. 18,no. 4, p. 1212, 2018.[2] Z. Yang and L. S. Pun-Cheng, “Vehicle detection in intelligent trans-portation systems and its applications under varying environments: Areview,”

Image and Vision Computing , vol. 69, pp. 143–154, 2018.[3] M. Veres and M. Moussa, “Deep learning for intelligent transportationsystems: A survey of emerging trends,”

IEEE Transactions on Intelligenttransportation systems , 2019.[4] H. Menouar, I. Guvenc, K. Akkaya, A. S. Uluagac, A. Kadri, andA. Tuncer, “Uav-enabled intelligent transportation systems for the smartcity: Applications and challenges,”

IEEE Communications Magazine ,vol. 55, no. 3, pp. 22–28, 2017.[5] L. Leal-Taix´e, A. Milan, K. Schindler, D. Cremers, I. Reid, and S. Roth,“Tracking the trackers: an analysis of the state of the art in multipleobject tracking,” arXiv preprint arXiv:1704.02781 , 2017.[6] G. Ciaparrone, F. L. S´anchez, S. Tabik, L. Troiano, R. Tagliaferri, andF. Herrera, “Deep learning in video multi-object tracking: A survey,”

Neurocomputing , vol. 381, pp. 61–88, 2020.[7] M. S. Shirazi and B. T. Morris, “Looking at intersections: a survey ofintersection monitoring, behavior and safety analysis of recent studies,”

IEEE Transactions on Intelligent Transportation Systems , vol. 18, no. 1,pp. 4–24, 2016.[8] X. Tan, Z. Wang, M. Jiang, X. Yang, J. Wang, Y. Gao, X. Su, X. Ye,Y. Yuan, D. He, S. Wen, and E. Ding, “Multi-camera vehicle trackingand re-identiﬁcation based on visual and spatial-temporal features,” in

Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) Workshops , June 2019.[9] M.-C. Chang, J. Wei, Z.-A. Zhu, Y.-M. Chen, C.-S. Hu, M.-X. Jiang,and C.-K. Chiang, “Ai city challenge 2019 – city-scale video analyticsfor smart transportation,” in

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR) Workshops , June2019.[10] Y. Chen, L. Jing, E. Vahdani, L. Zhang, M. He, and Y. Tian, “Multi-camera vehicle tracking and re-identiﬁcation on ai city challenge 2019,”in

Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) Workshops , June 2019.[11] Z. He, Y. Lei, S. Bai, and W. Wu, “Multi-camera vehicle tracking withpowerful visual features and spatial-temporal cue,” in

Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) Workshops , June 2019. [12] Y. Hou, H. Du, and L. Zheng, “A locality aware city-scale multi-cameravehicle tracking system,” in

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR) Workshops , June2019.[13] H.-M. Hsu, T.-W. Huang, G. Wang, J. Cai, Z. Lei, and J.-N. Hwang,“Multi-camera tracking of vehicles based on deep features re-id andtrajectory-based camera link models,” in

Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR) Work-shops , June 2019.[14] P. Li, G. Li, Z. Yan, Y. Li, M. Lu, P. Xu, Y. Gu, B. Bai, and Y. Zhang,“Spatio-temporal consistency and hierarchical matching for multi-targetmulti-camera vehicle tracking,” in

Proceedings of the IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition (CVPR) Workshops ,June 2019.[15] M. Wu, G. Zhang, N. Bi, L. Xie, Y. Hu, and Z. Shi, “Multiviewvehicle tracking by graph matching model,” in

Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) Workshops , June 2019.[16] E. Luna, P. Moral, J. C. SanMiguel, A. Garcia-Martin, and J. M. Mar-tinez, “Vpulab participation at ai city challenge 2019,” in

Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) Workshops , June 2019.[17] L. Chen, H. Ai, R. Chen, Z. Zhuang, and S. Liu, “Cross-view trackingfor multi-human 3d pose estimation at over 100 fps,” in

Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition ,2020, pp. 3279–3288.[18] Y. He, X. Wei, X. Hong, W. Shi, and Y. Gong, “Multi-target multi-camera tracking by tracklet-to-target assignment,”

IEEE Transactionson Image Processing , vol. 29, pp. 5191–5205, 2020.[19] M. C. Liem and D. M. Gavrila, “Joint multi-person detection andtracking from overlapping cameras,”

Computer Vision and Image Un-derstanding , vol. 128, pp. 36–50, 2014.[20] C. H. Sio, H.-H. Shuai, and W.-H. Cheng, “Multiple ﬁsheye cameratracking via real-time feature clustering,” in

Proceedings of the ACMMultimedia Asia , 2019, pp. 1–6.[21] X. Zhang and E. Izquierdo, “Real-time multi-target multi-camera track-ing with spatial-temporal information,” in . IEEE, 2019, pp. 1–4.[22] Z. Zhipeng, “Collaborative tracking method in multi-camera system,”

Journal of Shanghai Jiaotong University (Science) , vol. 2, 2020.[23] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars:A video benchmark for large-scale person re-identiﬁcation,” in

EuropeanConference on Computer Vision . Springer, 2016, pp. 868–884.[24] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performancemeasures and a data set for multi-target, multi-camera tracking,” in

European Conference on Computer Vision . Springer, 2016, pp. 17–35.[25] Z. Tang, M. Naphade, M.-Y. Liu, X. Yang, S. Birchﬁeld, S. Wang, R. Ku-mar, D. Anastasiu, and J.-N. Hwang, “Cityﬂow: A city-scale benchmarkfor multi-target multi-camera vehicle tracking and re-identiﬁcation,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 8797–8806.[26] E. Ristani and C. Tomasi, “Features for multi-target multi-cameratracking and re-identiﬁcation,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2018, pp. 6036–6046.[27] Z. Tang, G. Wang, H. Xiao, A. Zheng, and J.-N. Hwang, “Single-cameraand inter-camera vehicle tracking and 3d speed estimation based onfusion of visual and semantic features,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops ,2018, pp. 108–115.[28] Z. Zhang, J. Wu, X. Zhang, and C. Zhang, “Multi-target, multi-camera tracking by hierarchical clustering: recent progress on dukemtmcproject,” arXiv preprint arXiv:1712.09531 , 2017.[29] J. MacQueen et al. , “Some methods for classiﬁcation and analysis ofmultivariate observations,” in

Proceedings of the ﬁfth Berkeley sympo-sium on mathematical statistics and probability , vol. 1, no. 14. Oakland,CA, USA, 1967, pp. 281–297.[30] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al. , “A density-basedalgorithm for discovering clusters in large spatial databases with noise.”in

Kdd , vol. 96, no. 34, 1996, pp. 226–231.[31] S. C. Johnson, “Hierarchical clustering schemes,”

Psychometrika ,vol. 32, no. 3, pp. 241–254, 1967.[32] D. L. Davies and D. W. Bouldin, “A cluster separation measure,”

IEEEtransactions on pattern analysis and machine intelligence , no. 2, pp.224–227, 1979. [33] J. C. Dunn, “A fuzzy relative of the isodata process and its usein detecting compact well-separated clusters,” Journal of Cybernetics ,vol. 3, no. 3, pp. 32–57, 1973.[34] L. Kaufman and P. J. Rousseeuw,

Finding groups in data: an introduc-tion to cluster analysis . John Wiley & Sons, 2009, vol. 344.[35] K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas, “Imbalance problemsin object detection: A review,”

IEEE Transactions on Pattern Analysisand Machine Intelligence , 2020.[36] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio,

Deep learning .MIT press Cambridge, 2016, vol. 1, no. 2.[37] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal lossfor dense object detection,” in

Proceedings of the IEEE internationalconference on computer vision , 2017, pp. 2980–2988.[38] R. E. Kalman, “A new approach to linear ﬁltering and predictionproblems,”

Journal of basic Engineering , vol. 82, no. 1, pp. 35–45,1960.[39] H. W. Kuhn, “The hungarian method for the assignment problem,”

Navalresearch logistics quarterly , vol. 2, no. 1-2, pp. 83–97, 1955.[40] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “Learning deep neuralnetworks for vehicle re-id with visual-spatio-temporal path proposals,” in

Proceedings of the IEEE International Conference on Computer Vision ,2017, pp. 1900–1909.[41] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767 , 2018.[42] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.Berg, “Ssd: Single shot multibox detector,” in

European conference oncomputer vision . Springer, 2016, pp. 21–37.[43] K. , G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in

Proceed-ings of the IEEE international conference on computer vision , 2017, pp.2961–2969.[44] M. Tan, R. Pang, and Q. V. Le, “Efﬁcientdet: Scalable and efﬁcientobject detection,” in

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , 2020, pp. 10 781–10 790.[45] M. Tan and Q. Le, “Efﬁcientnet: Rethinking model scaling for con-volutional neural networks,” in

International Conference on MachineLearning , 2019, pp. 6105–6114.[46] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in

European conference on computer vision . Springer, 2014,pp. 740–755.[47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , 2016, pp. 770–778.[48] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima-genet: A large-scale hierarchical image database,” in

Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition .Ieee, 2009, pp. 248–255.[49] M. Naphade, Z. Tang, M.-C. Chang, D. C. Anastasiu, A. Sharma,R. Chellappa, S. Wang, P. Chakraborty, T. Huang, J.-N. Hwang, andS. Lyu, “The 2019 ai city challenge,” in

Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR) Work-shops , June 2019.

Elena Luna Garc´ıa obtained a B.S degree inTelecommunications Engineering in 2015 at the Uni-versidad Aut´onoma de Madrid (Spain). In 2017 shereceived the M.S. degrees belonging to the Interna-tional Joint Master Program in Image Processing andComputer Vision (IPCV) at the PPCU in Budapest(Hungary), the University of Bordeaux (France) andthe UAM (Spain). She is currently pursuing thePh.D. degree with the Video Processing and Under-standing Lab (VPU-Lab) at the UAM (Spain).

Juan C. SanMiguel received the Ph.D. degreein computer science and telecommunication fromUniversity Autonoma of Madrid, Madrid, Spain,in 2011. He was a Post-Doctoral Researcher withQueen Mary University of London, London, U.K.,from 2013 to 2014, under a Marie Curie IAPPFellowship. He is currently Associate Professor atUniversity Aut´onoma of Madrid and Researcherwith the Video Processing and Understanding Labo-ratory. His research interests include computer visionwith a focus on online performance evaluation andmulticamera activity understanding for video segmentation and tracking. Hehas authored over 40 journal and conference papers.

Jos´e M. Mart´ınez received the Ph.D. degree incomputer science and telecommunication from theUniversidad Polit´ecnica de Madrid, Madrid, Spain,in 1998. He is currently a Full Professor with theEscuela Polit´ecnica Superior, Universidad Aut´onomade Madrid, Madrid. He has acted as an auditor anda reviewer for the EC for projects of the frameworksprogram for research in Information Society andTechnology (IST). He is the author or coauthor ofmore than 100 papers in international journals andconferences and a coauthor of the ﬁrst book aboutthe MPEG-7 standard published in 2002. His professional interests coverdifferent aspects of advanced video surveillance systems and multimediainformation systems.