[PDF] Deep Continuous Conditional Random Fields with Asymmetric Inter-object Constraints for Online Multi-object Tracking

Abstract

Online Multi-Object Tracking (MOT) is a challenging problem and has many important applications including intelligence surveillance, robot navigation and autonomous driving. In existing MOT methods, individual object's movements and inter-object relations are mostly modeled separately and relations between them are still manually tuned. In addition, inter-object relations are mostly modeled in a symmetric way, which we argue is not an optimal setting. To tackle those difficulties, in this paper, we propose a Deep Continuous Conditional Random Field (DCCRF) for solving the online MOT problem in a track-by-detection framework. The DCCRF consists of unary and pairwise terms. The unary terms estimate tracked objects' displacements across time based on visual appearance information. They are modeled as deep Convolution Neural Networks, which are able to learn discriminative visual features for tracklet association. The asymmetric pairwise terms model inter-object relations in an asymmetric way, which encourages high-confidence tracklets to help correct errors of low-confidence tracklets and not to be affected by low-confidence ones much. The DCCRF is trained in an end-to-end manner for better adapting the influences of visual information as well as inter-object relations. Extensive experimental comparisons with state-of-the-arts as well as detailed component analysis of our proposed DCCRF on two public benchmarks demonstrate the effectiveness of our proposed MOT framework.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 1

Deep Continuous Conditional Random Fields withAsymmetric Inter-object Constraints for OnlineMulti-object Tracking

Hui Zhou, Wanli Ouyang, Jian Cheng, Xiaogang Wang, and Hongsheng Li

Abstract —Online Multi-Object Tracking (MOT) is a challeng-ing problem and has many important applications including intel-ligence surveillance, robot navigation and autonomous driving.In existing MOT methods, individual object’s movements andinter-object relations are mostly modeled separately and relationsbetween them are still manually tuned. In addition, inter-objectrelations are mostly modeled in a symmetric way, which weargue is not an optimal setting. To tackle those difﬁculties, in thispaper, we propose a Deep Continuous Conditional Random Field(DCCRF) for solving the online MOT problem in a track-by-detection framework. The DCCRF consists of unary and pairwiseterms. The unary terms estimate tracked objects’ displacementsacross time based on visual appearance information. They aremodeled as deep Convolution Neural Networks, which are ableto learn discriminative visual features for tracklet association.The asymmetric pairwise terms model inter-object relations inan asymmetric way, which encourages high-conﬁdence trackletsto help correct errors of low-conﬁdence tracklets and not to beaffected by low-conﬁdence ones much. The DCCRF is trainedin an end-to-end manner for better adapting the inﬂuences ofvisual information as well as inter-object relations. Extensiveexperimental comparisons with state-of-the-arts as well as de-tailed component analysis of our proposed DCCRF on two publicbenchmarks demonstrate the effectiveness of our proposed MOTframework.

Index Terms —Multi-object tracking, Deep neural networks,Continuous Conditional Random Fields, Asymmetric pairwiseterms.

I. I

NTRODUCTION R OBUST tracking of multiple objects [1] is a challengingproblem in computer vision and acts as an importantcomponent of many real-world applications. It aims to reli-ably recover trajectories and maintain identities of objects ofinterest in an image sequence. State-of-the-art Multi-ObjectTracking (MOT) methods [2], [3] mostly utilize the tracking-by-detection strategy because of its robustness against trackingdrift. Such a strategy generates per-frame object detectionresults from the image sequence and associates the detectionsinto object trajectories. It is able to handle newly appear-ing objects and is robust to tracking drift. The tracking-by-detection methods can be categorized into ofﬂine and online

Hui Zhou and Jian Cheng are with the School of Information and Commu-nication Engineering at University of Electronic Science and Technology ofChina, Chengdu, China. Hongsheng Li and Xiaogang Wang are with the De-partment of Electronic Engineering at The Chinese University of Hong Kong,Hong Kong, China. Wanli Ouyang is with University of Sydney, Sydney,Australia. This work is done when Hui Zhou is a Research Assistant in the De-partment of Electronic Engineering at The Chinese University of Hong Kong.Hongsheng Li is the corresponding author (e-mail: [email protected]). methods. The ofﬂine methods [4] use both detection resultsfrom past and future with some global optimization tech-niques for linking detections to generate object trajectories.The online methods, on the other hand, use only detectionresults up to the current time to incrementally generate objecttrajectories. Our proposed method focuses on online MOT,which is more suitable for real-time applications includingautonomous driving and intelligent surveillance.In MOT methods, the tracked objects usually show consis-tent or slowly varying appearance across time. Visual featuresof the objects are therefore important cues for associatingdetection boxes into tracklets. In recent years, deep learningtechniques have shown great potential in learning discrimina-tive visual features for single-object and multi-object tracking.However, visual cues alone cannot guarantee robust trackingresults. When tracked objects with similar appearances occludeor are close to each other, their trajectories might be wronglyassociated to other objects. In addition, there also exist mis-detections or inaccurate detections by imperfect object de-tectors. Such difﬁculties escalate when the camera is holdby hand or ﬁxed on a car. Each object moves accordingto its own movement pattern as well as the global cameramotion. Solving such problems was explored by modelinginteractions between tracked objects in the optimization model.For online MOT methods, there were investigations on mod-eling inter-object interactions with social force models [5],[6], [7], relative spatial and speed differences [8], [9], [10],and relative motion constraints [3], [11]. Most of the previousmethods model pairwise inter-object interactions in symmetricmathematical forms, i.e., pairs of objects inﬂuence each otherwith the same magnitude.However, such pairwise object interactions should be di-rectional and modeled in an asymmetric form, while existingmethods model such interactions in a symmetric way. Forinstance, large-size detection boxes are more likely to be noisy(if measured in actual pixels). Smaller boxes should inﬂuencelarger boxes more than large ones to small ones because thesmaller ones usually provide more accurate localization for ob-jects. Similarly, high-conﬁdence trajectories should inﬂuencelow-conﬁdence ones more and low-conﬁdence ones shouldhave minimal impact on the high-conﬁdence ones. In thisway, the more accurate detections or trajectories could helpcorrect errors of the inaccurate ones and would not be affectedby the inaccurate ones much. Moreover, in existing methods,

Copyright c (cid:13) a r X i v : . [ c s . C V ] J un OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 2 individual object’s movements and inter-object interactions areusually modeled separately. The relations between the twoterms are mostly manually tuned and not effectively studiedin a uniﬁed framework.To tackle the difﬁculties, we propose a Deep ContinuousConditional Random Field (DCCRF) with asymmetric inter-object constraints for solving the problem of online MOT.The DCCRF inputs a pair of consecutive images at time t − and time t , and tracked object’s past trajectories upto time t − . It estimates locations of the tracked objectsat time t . The DCCRF optimizes an objective function withtwo terms, the unary terms, which estimate individual ob-ject’s movement patterns, and the asymmetric pairwise terms,which model interactions between tracked objects. The unaryterms are modeled by a deep Convolutional Neural Network(CNN), which is trained to estimate each individual object’sdisplacement between time t − and time t with each object’svisual appearance. The asymmetric pairwise terms aim totackle the problem caused by object occlusions, object mis-detections and global camera motion. For two neighboringtracked trajectories, the pairwise inﬂuence is different alongeach direction to let the high-conﬁdence trajectory assiststhe low-conﬁdence one more. Our proposed DCCRF utilizesmean-ﬁeld approximation for inference and is trained in anend-to-end manner to estimate the optimal displacement foreach tracked object. Based on such estimated object locations,a ﬁnal visual-similarity CNN is proposed for generating theﬁnal detection association results.The contribution of our proposed online MOT framework istwo-fold. (1) A novel DCCRF model is proposed for solvingthe online MOT problem. Each object’s individual movementpatterns as well as inter-object interactions are studied in auniﬁed framework and trained in an end-to-end manner. Inthis way, the unary terms and pairwise terms of our DCCRFcan better adapt each other to achieve more accurate trackingperformance. (2) An asymmetric inter-object interaction termis proposed to model the directional inﬂuence between pairs ofobjects, which aims to correct errors of low-conﬁdence trajec-tories while maintain the estimated displacements of the high-conﬁdence ones. Extensive experiments on two public datasetsshow the effectiveness of our proposed MOT framework.II. R ELATED W ORK

There are a large number of methods on solving the multi-object tracking problem. We focus on reviewing online MOTmethods that utilize interactive constraints, as well as single-object and multi-object tracking algorithms with deep neuralnetworks.

Interaction models for MOT.

Social force models wereadopted in MOT methods [5], [6], [7] to model pairwiseinteractions (attraction and repulsion) between objects. Thesemethods required objects’ 3D positions for modeling inter-object interactions, which were obtained by visual odometry.Grabner et al. [12] assumed that the relative positionsbetween feature points and objects were more or less ﬁxed overshort-time intervals. Generalized Hough transform was there-fore used to predict each target’s location with the assist of supporter feature points. Duan et al. [10] proposed mutual re-lation models to describe the spatial relations between trackedobjects and to handle occluded objects. Such constraints arelearned by an online structured SVM. Zhang and Maaten [9]incorporated spatial constraints between objects into an MOTframework to track objects with similar appearances.The CRF algorithm [13] was used frequently in segmenta-tion tasks to model the relationship between different pixels inthe spatial-domain. There were also many works that modeledthe multi-object tracking problem with CRF models. Yangand Nevatia [14] proposed an online-learned CRF modelfor MOT, and assumed linear and smooth motion of theobjects to associate past and future tracklets. Andriyenko etal. [15] modeled multi-object tracking as optimizing discreteand continuous CRF models. A continuous CRF was usedfor enforcing motion smoothness, and a discrete CRF witha temporal interaction pairwise term was optimized for dataassociation. Milan et al. [16] designed new CRF potenialsfor modeling spatio-temporal constraints between pairs oftrajectories to tackle detection and trajectory-level occlusions.

Deep learning based object tracking.

Most existing deeplearning based tracking methods focused on single objecttracking, because deep neural networks were able to learnpowerful visual features for distinguishing the tracked ob-jects from the background and other similar objects. Earlysingle-object tracking methods [17], [18] with deep learningfocused on learning discriminative appearance features foronline training. However, due to the large learning capacitityof deep neural networks, it is easy to overﬁt the data. [19],[20] pretrained deep convolutional neural networks on large-scale image dataset to learn discriminative visual features,and updated the classiﬁer online with new training samples.More recently, methods that did not require model updatingwere proposed. Tao et al. [21] utilized Siamese CNNs todetermine visual similarities between image pacthes for track-ing. Bertinetto et al. [22] changed the network into a fullyconvolutional setting and achieved real-time running speed.Recently, deep models have been applied to multi-objecttracking. Milan et al. [23] proposed an online MOT frameworkwith two RNNs. One RNN was used for state (object loca-tions, motions, etc.) prediction and update, and the other forassociating objects across time. However, this method did notutilize any visual feature and relied solely on spatial locationsof the detection results. [24], [25] replaced the hand-craftedfeatures (e.g., color histograms) with the learned featuresbetween image patches by a Siamese CNN, which increasesthe discriminative ability. However, those methods focusedon modeling individual object’s movement patterns with deeplearning. Inter-object relations were not integrated into deepneural networks. III. M

ETHOD

The overall framework of our proposed MOT method isillustrated in Fig. 1. We propose a Deep Continuous Condi-tional Random Field (DCCRF) model for solving the onlineMOT problem. At each time t , the framework takes pasttracklets up to time t − and detection boxes at time t as OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 3

Unary Terms G Asymmetric Pairwise Terms

1) Size-based Directional Influence

Visual-Displacment CNN

2) Confidence-based Directional influence

Image Patch1Image Patch2 EstimatedDisplacementDisplacementConfidence

Visual-assignment CNN

Fig. 1: Illustration of the overall multi-object tracking framework. The proposed Deep Continuous Conditional Random Fieldconsists of unary terms and asymmetric pairwise terms (Section III-A). The unary terms are modeled by a visual-displacementCNN, which take pairs of object image patches as inputs and output the estimated object displacements between time t − and time t (Section III-A1). The asymmetric pairwise terms encourage to use high-conﬁdence tracklets for correcting errorsof low-conﬁdence tracklets (Section III-A2). Size-based and conﬁdence-based directional weighting functions are investigated.inputs, and generates new tracklets up to time t . At each time t , new tracklets are also initialized and current tracklets areterminated if tracked objects disappear from the scene.The core components of the proposed DCCRF consist ofunary terms and asymmetric pairwise terms. The unary termsof our DCCRF are modeled by a deep CNN that estimates theindividual tracked object’s displacements between consecutivetimes t − and t . The asymmetric pariwise terms aim tomodel inter-object interactions, which consider differences ofspeeds, visual-conﬁdence, and object sizes between neighbor-ing objects. Unlike interaction terms in existing MOT methods,which treat inter-object interactions in a symmetric way, asym-metric relationship terms are proposed in our DCCRF. Forpairs of tracklets in our DCCRF model, the proposed asym-metric pairwise term models the two directions differently,so that high-conﬁdence trajectories with small-size detectionboxes can help correct errors of low-conﬁdence trajectorieswith large-size detection boxes. Based on the estimated objectdisplacements by DCCRF, we adopt a visual-similarity CNNand Hungarian algorithm to obtain the ﬁnal tracklet-detectionassociations. A. Deep Continuous Conditional Random Field (DCCRF)

The proposed DCCRF takes object trajectories up to time t − and video frame at time t as inputs, and outputseach tracked object’s displacement between time t − andtime t . Let r represents a random ﬁeld deﬁned over a setof variables { r , r , · · · , r n } , where each of the n variablesrepresents the visual and motion information of an objecttracklet. Let d represents another random ﬁeld deﬁned overvariables { d , d , · · · , d n } , where each variable represents thedisplacement of an object between time t − and time t .The domain of each variable is the two-dimensional space R , denoting the x - and y -dimensional displacements of trackedobjects. Let I represents the new video frame at time t .The goal of our conditional random ﬁeld ( r , d ) is tomaximize the following conditional distribution, P ( d | r , I ) = 1 Z ( t ) exp ( − E ( d , r , I )) , (1)where E ( d , r , I ) represents the Gibbs energy and Z ( t ) = (cid:82) r exp( − E ( d , r )) d r is the partition function. Maximizing theconditional distribution w.r.t. d is equivalent to minimizing theGibbs energy function, E ( d , r , I ) = n (cid:88) i =1 φ ( d i , r i , I ) + (cid:88) i,j ψ ( d i , d j , r i , r j , I ) , (2)where φ ( d i , r i , I ) and ψ ( d i , d j , r i , r j ) are the unary terms andpairwise terms.After the displacements d of tracked objects between time t − and time t are obtained, individual object’s estimatedlocations at time t can be easily calculated for associatingtracklets and detection boxes to generate tracklets up to time t . Such displacements are then iteratively calculated for thefollowing time frames. Without loss of generality, we onlydiscuss the approach for optimizing object displacementsbetween time t − and time t in this section. Unary terms:

For the i th object tracklet, the unary term φ ( d i , r i , I ) of our DCCRF model is deﬁned as φ ( d i , r i , I ) = w i, ( d i − f d ( r i , I )) . (3)This term penalizes the quadratic deviations between the ﬁnaloutput displacement d i and the estimated displacement by avisual displacement estimation function f d . w i, is an onlineadaptive parameter for the i th object that controls to trust morethe estimated displacement based on the i th object’s visual OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 4

Cropped Patch t-1t

Cropped PatchResize & Concat Discrete BinsCross-entropy Loss

Same Box Location

Fig. 2: Illustration of the visual-displacement CNN for mod-eling the unary terms. Two image patches are cropped fromthe same box location centered at the object location at time t − as inputs. The visual-displacement CNN estimates theconﬁdences of discrete object displacements and is trainedwith cross-entropy loss.cues (the unary terms) or based on inter-object relations (thepairwise terms). Intuitively, when the visual displacement esti-mator f d has higher conﬁdence on its estimated displacement, w i, should be larger to bias the ﬁnal output d i towards thevisually inferred displacements. On the other hand, when f d has lower conﬁdence on its estimation, due to object occlusionor appearing of similar objects, w i, should be smaller to letthe ﬁnal displacement d i be mainly inferred by inter-objectconstraints.In our framework, the visual displacement estimation func-tion f d is modeled as a deep Convolution Neural Network(CNN) that utilizes only the tracked objects’ visual informa-tion for estimating its location displacement between time t − and time t . For each tracked object r i , our visual-displacementCNN takes a pair of images patched from frames t − and t as inputs, and outputs the object’s inferred displacement.A network structure similar to ResNet-101 [26] (except forthe topmost layer) is adopted for our visual-displacementCNN. The network inputs and outputs are illustrate in Fig. 2.For the inputs, given currently tracked object r i ’s boundingbox location b i at time t − , a larger bounding box ¯ b i centered at b i is ﬁrst created. Two image patches are croppedat the same spatial location ¯ b i but from different frames attime t − and time t . They are then concatenated alongthe channel dimension to serve as the inputs for our visual-displacement CNN. The reasons for using a larger boundingbox ¯ b i instead of the original box b i are to tolerate largepossible displacement between the two consecutive framesand also to incorporate more visual contextual information ofthe object for more accurate displacement estimation. Aftertraining with thousands of such pairs, the visual-displacementCNN is able to capture important visual cues from image-patch pairs to infer object displacements between time t − and time t . For the CNN outputs, instead of directly estimating objects’two dimensional x - and y -dimensional displacements, wediscretize possible 2D continuous displacements into a 2Ddiscrete grid { p i , p i , · · · , p mi } (bottom-right part in Fig. 2),where p ki ∈ R represents the displacement corresponding tothe k th bin of the i th object. The visual-displacement CNNis trained to output conﬁdence scores c ki for the displacementbins p ki with a softmax function. The cross-entropy loss istherefore used to train the CNN, and the ﬁnal estimateddisplacement for the tracked object r i is calculated as theweighted average of all possible displacements (cid:80) mk =1 c ki p ki ,where (cid:80) mk =1 c ki = 1 . In practice, we discretize displace-ments into m = 20 × bins, which is a good trade-off between discretization accuracy and robustness. Note thatthere are existing tracking methods [22], [27] that also utilizepairs of image patches as inputs to directly estimate objectdisplacements. However, in our method, we propose to usecross-entropy loss for estimating displacements and ﬁnd thatits result achieves more accurate and robust displacementestimations in our experiments. More importantly, it providesdisplacement conﬁdence scores { c i , · · · , c mi } for calculatingthe adaptive parameter w i, in Eq. (3) to weight the unary andpairwise terms.The conﬁdence weight w i, is obtained by the followingequation, w i, = σ ( a max( c i ) + b ) , (4)where σ is the sigmoid function constraining the range of w i, being between 0 and 1, max( c i ) obtains the maximalconﬁdence of c i = { c i , c i , · · · , c mi } , and a and b arelearnable scalar parameters. In our experiments, the learnedparameter a is generally positive after training, which denotesthat, if the visual-displacement CNN is more conﬁdent aboutits displacement estimations, the value of w i, is larger and theﬁnal output displacement d i can be more biased towards thevisually inferred displacement f d ( r i , I ) . Otherwise, the ﬁnaldisplacement d i can be biased to be inferred by inter-objectconstraints.If the energy function E in Eq. (2) consists of only the unaryterms φ ( d i , r i , I ) , the ﬁnal output displacement d i can besolely dependent on each tracked object’s visual informationwithout considering inter-object constraints. Asymmetric pairwise terms:

The pairwise terms inEq. (2) are utilized to model asymmetric inter-object relationsbetween object tracklets for regularizing the ﬁnal displacementresults d . To handle global camera motion, we assume thatfrom time t − to time t , the speed differences between twotracked objects should be maintained, i.e., ψ ( d i , d j , r i , r j , I ) = (1 − w i, ) (cid:88) k w ( k ) ij, (∆ d ij − ∆ s ij ) , (5)where ∆ d ij = d i − d j is the displacement (which can beviewed as speed) difference between objects i and j at time t , ∆ s ij = s i − s j is the speed difference at the previous time t − , and w ( k ) ij, are a series of weighting functions (two in ourexperiments) that control the directional inﬂuences betweenthe pair of objects, OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 5

Box height (pixel) A v e r a g e d e v i a ti on fr o m G T ( p i x e l ) Fig. 3: The average deviation of detection boxes from theirground-truth locations is approximately proportional to thedetection box size. The statistics are calculated from the2DMOT16 training set [28] where the detection boxes areprovided by the dataset.For better modeling inter-object relations, two importantobservations are made to deﬁne the asymmetric weightingfunctions w ( k ) ij, . 1) For detection boxes, in terms of localizationaccuracy, larger object detection boxes are more likely to benoisy, while smaller ones tend to be more stable (as shown inFig. 3). This is because the displacements of both large andsmall detection boxes are all recorded in pixels in our trackingframeworks. Noisy large detection boxes would signiﬁcantlyinﬂuence the displacement estimation for other boxes. Thisproblem is illustrated in Fig. 4. The two targets in Fig. 4(a)have accurate locations and speeds which can be used tobuild inter-object constraints at time t − . When the detectoroutputs roughly accurate bounding boxes for both targets attime t , symmetric inter-object constraints could well reﬁne theobjects’ locations (see Fig. 4(b)). However, since the larger-size detection boxes are more likely to be noisy, using thesymmetric inter-object constraints would signiﬁcantly affecttracking results of the small-size objects (see Fig. 4(c)). Incontrast, small-size objects have smaller localization errors andcould better infer larger-size objects’ locations. Asymmetricsmall-to-large-size inter-object constraints are robust, evenwhen the smaller-size detection box is noisy(see Fig. 4(d)).Therefore, between a pair of tracked objects, the one withsmaller detection box should have more inﬂuence to inferthe displacement of the ones with larger detection box, andthe object with a larger box should have less chance todeteriorate the displacement estimation of the smaller one. 2)If our above mentioned visual-displacement CNN has highconﬁdence for an object’s displacement, this object’s visuallyinferred displacement should be used more to infer otherobjects’ displacements. On the other hand, the objects with lowconﬁdences on their visually inferred displacements should notaffect other objects with high-conﬁdence displacements. Basedon the two observations, we model the weighting function w ( k ) ij, by a product of a size-based weighting function and aconﬁdence-based weighting function between a pair of tracked Interaction results Detection boxes (a) Time t − (b) Time t Symmetric inﬂuence(c) Time t (d) Time t Symmetric inﬂuence

Asymmetric small-to-large inﬂuence

Fig. 4: Illustration of symmetric and asymmetric inter-objectconstraints. (a) Two tracked objects at time t − with their esti-mated speeds (denoted by arrows). (b) Tracked objects at time t . Symmetric inter-object constraints work well when there arelittle detection noise for all detection boxes. (c) When there islocalization noise for the large-size detection box, symmetricinter-object constraints are likely to deteriorate the trackingof the small-size object. (d) Asymmetric small-to-large-sizeinter-object constraints are more robust than symmetric inter-object constraints, even when there is localization noise forthe small-size detection box.objects as w ( k ) ij, = σ ( a ( k )21 log( area i /area j ) + b ( k )21 )) × σ ( a ( k )22 (max( c i ) − max( c j )) + b ( k )22 ) (6)where σ denotes the sigmoid function, area i denotes thesize of the i th tracked object at time t − , max( c i ) obtainsthe maximal displacement conﬁdence from { c i , c i , · · · , c mi } by our proposed visual-displacement CNN, and a ( k )21 , b ( k )21 , a ( k )22 , b ( k )22 are learnable scalar parameters. In our DCCRF,these parameters can be learned by back-propagation algo-rithm with mean-ﬁeld approximation. If we use the mean-ﬁeld approximation for DCCRF inference, the inﬂuence fromobject r i to r j and that from r j and r i are different (see nextsubsection for details). After training, we see that a ( k )21 > and a ( k )22 < , which means that smaller area i /area j andlarger max( c i ) − max( c j ) lead to greater weights. It validatesour above mentioned observations that objects with smallersizes and greater visual-displacement conﬁdences should havegreater inﬂuences to other objects, but not the other around.In Fig. 5, we show example values of one learned weightingfunction w ( k ) ij, . In Fig. 5(a), compared with object 6, objects 2-4 are of smaller sizes and also higher visual conﬁdences. Withthe directional weighting functions, they have greater inﬂuence OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 6

Visual Displacement ConfidencePerson 1: 0.2352Person 2: 0.3334Person 3: 0.3564Person 4: 0.3437Person 5: 0.2822Person 6: 0.2333

Visual Displacement ConfidencePerson 1: 0.481Person 2: 0.192Person 3: 0.338Person 4: 0.243Person 5: 0.132 (a) (b)Fig. 5: Example values of asymmetric weighting function w ( k ) ij between tracked objects of different sizes and conﬁdences.Green dashed rectangles denote estimated object locationsby the unary terms (visual-displacement CNN) only. Redrectangles denote estimated object locations by both unary andpairwise terms. Orange arrows and numbers denote the weight-ing function values from one object to the other. (a) Small-sizeobjects (objects 2-4) help correct errors of large-size object(object 6) with higher diretional weighting function values. (b)Objects with higher visual-displacement conﬁdences (objects1, 3, 4) help correct errors of the object (object 5) with lowervisual-displacement conﬁdences.to correct errors of tracking object 6 (red vs. green rectanglesof object 6) and are not affected much by the erroneous estima-tion of object 6. Similar directional weighting function valuescan be found in Fig. 5(b), where objects 1, 3, 4 with highvisual-displacement conﬁdences are able to correct trackingerrors of object 5 with low visual-displacement conﬁdence. Inference:

For our unary terms, we utilize forward prop-agation of the visual-displacement CNN for calculating ob-jects’ estimated displacements and displacement conﬁdences { c i , c i , · · · , c mi } . After the unary term inference, the overallmaximum posterior marginal inference is achieved by mean-ﬁeld approximation. This approximation yields an iterativemessage passing for approximate inference. Our unary termsand pairwise terms are both of quadratic form. The energyfunction is convex and the optimal displacement is obtainedas the mean value of the energy function, d i ←− w i, f d ( r i , I ) + (1 − w i, ) (cid:80) i (cid:54) = j (cid:80) k w ( k ) ij, ( d j − ∆ s ij ) w i, + (1 − w i, ) (cid:80) i (cid:54) = j (cid:80) k w ( k ) ij, . (7)In each iteration, the node i receives messages from all otherobjects to update its displacement estimation. The mean-ﬁeld approximation is usually converged in 5-10 iterations.The above displacement update equation clearly shows thedifferences between the messages transmitted from i to j andthat from object j to i because of the asymmetric weightingfunctions w ( k ) ij, . For a pair of objects, w ( k ) ij, and w ( k ) ji, aregenerally different. Even if w i, = w j, , when w ( k ) ij, > w ( k ) ji, ,object j has greater inﬂuence to i than that from j to i .A detailed derivation of Eq. (7) is given as follows.The mean-ﬁeld method is to approximate the distribution P ( d | r , I ) with a distribution Q ( d | r , I ) , which can be ex-pressed as a product of independent marginals Q ( d | r , I ) = (cid:81) N Q i ( d i | r , I ) . The optimal approximation of Q is obtained by minimizing Kullback-Leibler (KL) divergence between P and Q . The solution for Q has the following form, log( Q i ( d i | r , I )) = E i (cid:54) = j [log( P ( d | r , I ))] + const , (8)where E i (cid:54) = j denotes expectation under Q distributions over allvariables d j for j (cid:54) = i . The inference is formulated as log( Q i ( d i | r , I )) = φ ( d i , r i , I ) + (cid:88) i,j ψ ( d i , d j , r i , r j , I )= w i, ( d i − f d ( r i , I )) + (1 − w i, ) (cid:88) i (cid:54) = j (cid:88) k w ( k ) ij, (∆ d ij − ∆ s ij ) (9) = ( w i, + (1 − w i, ) (cid:88) i (cid:54) = j (cid:88) k w ( k ) ij, ) d i − w i, f d ( r i , I ) + (1 − w i, ) (cid:88) i (cid:54) = j (cid:88) k w ( k ) ij, ( d j + ∆ s ij )) d i + const . Each log( Q i ( d i | r , I )) is a quadratic form with respect to d i and its means therefore are µ i = w i, f d ( r i , I ) + (1 − w i, ) (cid:80) i (cid:54) = j (cid:80) k w ( k ) ij, ( d j − ∆ s ij ) w i, + (1 − w i, ) (cid:80) i (cid:54) = j (cid:80) k w ( k ) ij, . (10)The inference task is to minimize P ( d | r , I ) . Since we ap-proximate conditional distribution with product of independentmarginals, an estimate of each d i is obtained as the expectedvalue µ i of the corresponding quadratic function, (cid:98) d i = arg min d i ( Q i ( d i | r , I )) = µ i . (11) B. The Overall MOT Algorithm

The overall algorithm with our proposed DCCRF is shownin Algorithm 1. At each time t , the DCCRF inputs are existingtracklets up time t − , and consecutive frames at time t − and time t . It outputs each tracklet’s displacement estimation.After obtaining displacement estimations (cid:98) d i for each tracklet r i by DCCRF, its estimated location at time t can be simplycalculated as the summation of its location b r i at time t − and its estimated displacement (cid:98) d i , i.e., (cid:99) b r i = b r i + (cid:98) d i . (12)Based on such estimated locations, we utilize a visual-similarity CNN (Section III-B1) as well as the Intersection-over-Union value as the criterion for tracklet-detection asso-ciation to generate longer tracklets (Section III-B2). To makeour online MOT system complete, we also specify our detailedstrategies for tracklet initialization (Section III-B3), occlusionhandling and tracklet termination (Section III-B4).

1) Visual-similarity CNN:

The tracklet-detection associa-tions need to be determined based on visual cues and spatialcues simultaneously. We propose a visual-similarity CNN forcalculating visual similarities between image patches croppedat bounding box locations in the same frame. The visual-similarity CNN has similar network structure as our visual-displacement CNN in Section III-A1. However, the network

OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 7

Algorithm 1:

The overall MOT algorithm

Input:

Images sequence up to time t , per-frame objectdetections b , b , b , · · · Output:

Object tracklets up to time t . for time = , · · · , t do Estimate tracked object displacements d i (SectionIII-A); Estimate tracklet location (cid:99) b r i (Eq. (12)); Calculate tracklet-detection similarities ( (cid:99) b r i , b j ) (Section III-B2); Hungarian algorithm to obtain tracklet-associateddetection b r i (Section III-B2); for each tracklet r i do if IoU ( (cid:99) b r i , b j ) ≥ . then Append b j to tracklet r i ; else if . > IoU ( (cid:99) b r i , b j ) ≥ . then Append ( b j + (cid:99) b r i ) / to tracklet r i ; else r i has no detection association; if no association > m frames then Tracklet termination (Section III-B4); else Append (cid:99) b r i to tracklet r i (Section III-B4); end end end for detections not associated to tracklets do if high overall similarity for k frames then Tracklet initialization (Section III-B3); end end end takes image patches in the same video frame as inputs andoutputs the conﬁdence whether the input pair represents thesame object. It is therefore trained with a binary cross-entropyloss. In addition, the training samples are generated differ-ently for the visual-similarity CNN. Instead of cropping twoconsecutive video frames at the same bounding box locationsas the visual-displacement CNN, the visual-similarity CNNrequires positive pairs to be cropped at different locationsof the same object at anytime in the same video, while thenegative pairs to be image patches belonging to differentobjects. For cropping image patches, we dont’t enlarge theobject’s bounding box, which is also different to our visual-displacement CNN. During training, the ratio between positiveand negative pairs are set to : and the network is trainedsimilarly to that of visual-displacement CNN.

2) Tracklet-detection association:

Given the estimatedtracklet locations and detection boxes at time t , they areassociated with detection boxes based on the visual and spatialsimilarities between them. The associated detection boxes canthen be appended to their corresponding tracklets to formlonger ones up to time t . Let (cid:99) b r i and b j denote the i th tracklet’sestimated location and the j th detection box at time t . Their visual similarity calculated by the visual-similarity CNN inSection III-B1 is denoted as V ( (cid:99) b r i , b j ) . The spatial similaritybetween the estimated tracklet locations and detection boxesare measured as the their box Intersection-over-Union values IoU ( (cid:99) b r i , b j ) . If a detection box is tried to be associated withmultiple tracklets, Hungrian algorithm is utilized to determinethe optimal associations with the following overall similarity, S ( (cid:99) b r i , b j ) = V ( (cid:99) b r i , b j ) + λIoU ( (cid:99) b r i , b j ) , (13)where λ is the weight balancing the visual and spatial sim-ilarities and is set to 1 in our experiments. After the boxassociation by Hungarian algorithm, if a tracklet is associatedwith a detection box that has an IoU value greater than 0.5with it, the associated detection box are directly appended tothe end of the tracklet. If the IoU value is between 0.3 and0.5, the average of the associated detection box and estimatedtracklet box are appended to the tracklet to compensate forthe possible noisy detection box. If the IoU value is smallerthan 0.3, tracklet might be considered as being terminated ortemporally occluded (Section III-B4).

3) Tracklet initialization:

If an object detection box at time t − is not associated to any tracklet in the above tracklet-detection association step, it is treated as a candidate box for initializing new tracklets. For each such candidate box attime t − , its visually inferred displacement between time t − and t is ﬁrst obtained by our visual-displacement CNNin Section III-A1. Its estimated box location can be easilycalculated following Eq. (12). The visual similarities V andspatial similarities IoU between the estimated box at t andcandidate boxes at t are calculated. To form new candidatetracklet, the candidate box at time t − is only associatedwith the candidate box at time t that has 1) greater-than-0.3 IoU and 2) greater-than-0.8 visual similarity with its estimatedbox location. If there are multiple candidate associations,Hungarian algorithm is utilized to associate the candidate boxat t to its optimal candidate association at t − accordingto the overall similarities (Eq. (13)). If none of the candidateassociations at time t satisﬁes the above two conditions withthe candidate box at t − , the candidate box is ignored andwould not be used for tracking initialization. Such operationsare iterated over time to generate longer candidate tracklets.If a candidate tracklet is over k frames ( k = 4 for pedestraintracking with 25-fps videos), it is initialized as a new tracklet.

4) Occlusion handling and tracklet termination:

If a pasttracklet is not associated to any detection box at time t ,the tracked object is considered as being possibly occludedor temporally missed. For a possibly occluded object, wedirectly associate its past tracklet to its estimated location byour DCCRF at time t to create a virtual tracklet. The sameoperation is iterated for m frames, i.e., if the virtual trackletis not associated to any detection box for more than m timesteps, the virtual tracklet is terminated. For pedestrian tracking,we empirically set m = 5 .IV. E XPERIMENTS

In this section, we present experimental results of theproposed online MOT algorithm. We ﬁrst introduce evalu-ation datasets and implementation details for our proposed

OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 8 framework in Sections IV-A and IV-B. In Section IV-C, wecompare the proposed method with state-of-the-art approacheson the public MOT datasets. The individual components of ourproposed method are evaluated in Section IV-D.

A. Datasets and Evaluation Metric

We conduct experiments on the 2DMOT15 [29] and2DMOT16 [28] benchmarks, which are widely used to eval-uate the performance of MOT methods. Both of them havetwo tracks: public detection boxes [2], [3], [24] and privatedetection boxes [30], [31]. For comparing with only theperformance of tracking algorithms, we evaluate our methodwith the provided public detection boxes.

1) 2DMOT15:

This dataset is one of the largest datasetswith moving or static cameras, different viewpoints and dif-ferent weather conditions. It contains a total of 22 sequences,half for training and half for testing, with a total of 11286frames (or 996 seconds). The training sequences contain over5500 frames, 500 annotated trajectories and 39905 anno-tated bounding boxes. The testing sequences contain over5700 frames, 721 annotated trajectories and 61440 annotatedbounding boxes. The public detection boxes in 2DMOT15 aregenerated with aggregated channel features (ACF).

2) 2DMOT16:

This dataset is an extension to 2DMOT15.Compared to 2DMOT15, new sequences are added and thedataset contains almost 3 times more bounding boxes fortraining and testing. Most sequences are in high resolution,and the average pedestrian number in each video frame is3 times higher than that of the 2DMOT15. In 2DMOT16,deformable part models (DPM) based methods are used togenerate public detection boxes, which are more accurate thanboxes in 2DMOT15.

3) Evaluation Metric:

For the quantitative evaluation, weadopt the popular CLEAR MOT metrics [29], which include: • MOTA:

Multiple Object Tracking Accuracy. This metricis usually chosen as the main performance indicator forMOT methods. It combines three types of errors: falsepositives, false negatives, and identity switches. • MOTP:

Multiple Object Tracking Precision. The mis-alignment between the annotated and the predictedbounding boxes. • MT:

Mostly Tracked targets. The ratio of ground-truthtrajectories that are covered by a track hypothesis for atleast 80% of their respective life span. • ML:

Mostly Lost targets. The ratio of ground-truthtrajectories that are covered by a track hypothesis for atmost 20% of their respective life span. • FP:

The total number of false positives. • FN:

The total number of false negatives (missed targets). • ID Sw:

The total number of identity switches. Please notethat we follow the stricter deﬁnition of identity switchesas described in MOT challenge. • Frag:

The total number of times a trajectory is frag-mented (i.e., interrupted during tracking).

B. Implementation details1) Training schemes and setting:

For visual-displacementand visual-similarity CNNs, we adopt ResNet-101 [26], [32] as the network structure and replace the topmost layer to out-put displacement conﬁdence or same-object conﬁdence. BothCNN are pretrained on the ImageNet dataset. For croppingimage patches from ¯ b i , we enlarge each detection box b i bya factor of 5 in width and 2 in height to obtain ¯ b i . Imagepatches for the two CNNs are cropped at the same locationsfrom consecutive frames as described in Section III-A1, whichare then resized to × as the CNN inputs.We train our proposed DCCRF in three stages. In theﬁrst stage, the proposed visual-displacement CNN is trainedwith the cross-entropy loss and batch Stochastic GradientDescent (SGD) with a batch size of 5. The initial learningrate is set to − and is decreased by a factor of 1/10every 50,000 iterations. The training generally converges after600,000 iterations. In the second stage, the learned visual-displacement CNN from stage-1 is ﬁxed and other parametersin our DCCRF are trained with L loss, ζ loss = (cid:88) (cid:107) (cid:98) d i − d gti (cid:107) , (14)where (cid:98) d i and d gti are estimated displacements and the ground-truth displacements for the i th tracked object. In the ﬁnal stage,the DCCRF is trained in an end-to-end manner with the above L loss and the cross-entropy loss for visual-displacementCNN in unary terms. We ﬁnd that 5 iterations of the mean-ﬁeldapproximation generate satisfactory results. The DCCRF istrained with an initial learning rate of − , which is decreasedby a factor of 1/3 every 5,000 iterations. The training typicallyconverges after 3 epochs.Our code is implemented with MATLAB and Caffe. Theoverall tracking speed of the proposed method on MOT16 testsequences is 0.1 fps using the 2.4GHz CPU and a MaxwellTITAN X GPU without some acceleration library packages.

2) Data augmentation:

To introduce more variation into thetraining data and thus reduce possible overﬁtting, we augmentthe training data. For pre-training the visual-displacementCNN, the input images are image patches centered at detectionboxes. We augment the training samples by random ﬂippingas well as randomly shifting the cropping positions by nomore than ± / of detection box width or height for x and y dimensions respectively. For end-to-end training the DCCRF,except for random ﬂipping of whole video frames, the timeintervals between the two input video frames are randomlysampled from the interval of [1 , to generate more framepairs with larger possible displacements between them. C. Quantitative results on 2DMOT15 and 2DMOT16

On the MOT2015 and MOT2016 datasets, we test ourproposed method and compare it with state-of-the-art MOTmethods including SMOT [33], MDP [2], SCEA [3], CEM[34], RNN LSTM [23], RMOT [11], TC ODAL [38], CN-NTCM [36], SiameseCNN [25], oICF [39], NOMT [37],CDA DDAL [24]. The results of the compared methods arelisted in Tables I and II. We focus on the MOTA value as themain performance indicator, which is a weighted combination Note that only methods in peer-reviewed publications are compared in thispaper. ArXiv papers that have not undergone peer-review are not included.

OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 9

TABLE I: Quantitative results by our method and state-of-the-art MOT methods on 2DMOT15 dataset. Bold numbers indicatethe best results of online or ofﬂine methods respectively). ↑ denotes that higher is better and ↓ represents the opposite. Tracking Mode Method

MOTA ↑ MOTP ↑ MT ↑ ML ↓ FP ↓ FN ↓ ID Sw ↓ Frag ↓ Ofﬂine SMOT [33] 18.2% 71.2% 2.8% 54.8% 8780 40310 1148 2132Ofﬂine CEM [34] 19.3% 70.7% 8.5% 46.5% 14180 34591 813 1023Ofﬂine DCO X [35] 19.6% 71.4% 5.1% 54.9% 10652 38232 521

Ofﬂine SiameseCNN [25] 29.0% 71.2% 8.5% 48.4%

Online MDP [2] 30.3%

680 1500Online CDA DDAL [24] 32.8% 70.7% 9.7% 42.2%

TABLE II: Quantitative results by our proposed method and state-of-the-art MOT methods on 2DMOT16 dataset. ↑ denotesthat higher is better and ↓ represents the opposite. Tracking Mode Method

MOTA ↑ MOTP ↑ MT ↑ ML ↓ FP ↓ FN ↓ ID Sw ↓ Frag ↓ Ofﬂine TBD [40] 33.7% 76.5% 7.2% 54.2% 5804 112587 2418 2252Ofﬂine LTTSC-CRF [41] 37.6% 75.9% 9.6% 55.2% 11969 101343 481 1012Ofﬂine LINF [42] 41.0% 74.8% 11.6% 51.3% 7896 99224 430 963Ofﬂine MHT DAM [43] 42.9%

Online OVBT [44] 38.4% 75.4% 7.5% 47.3% 11517 99463 1321 2140Online EAMTT pub [45] 38.8% 75.1% 7.9% 49.1% 8114 102452 965 1657Online oICF [39] 43.2% 74.3% 11.3% 48.5% 6651 96515 of false negatives (FN), false positives (FP) and identityswitches (ID Sw). Note that ofﬂine methods generally havehigher MOTA than online methods because they can utilizenot only past but also future information for object trackingand are only listed for reference here. Our proposed onlineMOT method outperforms all compared online methods andmost ofﬂine methods [2], [3], [39], [24], [25]. As shownby the quantitative results, our proposed method is able toalleviate the difﬁculties caused by object mis-detections, noisydetections, and short-term occlusion. The qualitative results areshown in Fig. 6.Compared with SCEA [3], which also models inter-objectinteractions and speed differences to handle mis-detectionscaused by global camera motion, our learned DCCRF showsbetter performance, especially in FN for our more accuratedisplacement prediction which is able to recover more mis-detections. Our proposed method also outperforms MDP [2]in terms of MOTA and FP by a large margin. MDP learns topredict four target states (active, tracked, lost and inactive)for each tracked object. However, it only models trackedobject’s movement patterns with a constant speed assumption,which is likely to result in false tracklet-detection associationsand thus increases FP. CDA DDAL [24] focuses on usingdiscriminative visual features by a siamese CNN for tracklet-detection associations, which is not robust for occlusions and iseasy to increase FN. Compared with other algorithms DCO X[35] and LTTSC-CRF [41] which also use conditional randomﬁeld approximation to solve MOT problems, the results show TABLE III: Component analysis of our proposed DCCRF on2DMOT2016 dataset. ↑ denotes that higher is better and ↓ represents the opposite. Method

MOTA ↑ FP ↓ FN ↓ ID Sw ↓ Proposed DCCRF 44.8% 5613 94125 968Unary-only 41.9% 7392 97618 876Unary-only+ L -loss (reg) 34.2% 12089 104810 3134DCCRF w/o size-asym 43.6% 8063 93724 1035DCCRF w/o cfd-asym 43.8% 7353 94163 969DCCRF w/ symmetry 43.4% 9100 93076 1104 that our proposed DCCRF has great advantages over otherCRF-based methods in MOTA.However, our method produces more ID switches than somecompared methods, which is due to long-term occlusions thatcannot be solved by our method. D. Component analysis on 2DMOT16

To analyze the effectiveness of different components inour proposed framework, we also design a series of baselinemethods for comparison. The results of these baselines and ourﬁnal method are reported in Table III. Similar to the above ex-periments, we focus on MOTA value as the main performanceindicator. 1) Unary-only: this baseline utilizes only our unaryterms in DCCRF, i.e., the visual-displacement CNN, with ouroverall MOT algorithm. Such a baseline model considers onlytracked objects’ appearance information. Compared with ourproposed DCCRF, it has a MOTA drop, which denotes

OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 10 (a) MOT1603 (b) MOT1606 (c) MOT1607(d) MOT1608 (e) MOT1612 (f) MOT1614Fig. 6: Example tracking results by our proposed method on 2DMOT16 dataset.that the inter-object relations are crucial for regularizing eachobject’s estimated displacement and should not be ignored.2) Unary-only+ L -loss (reg): since our visual-displacementCNN is trained with proposed cross-entropy loss instead ofconventional L or L losses for regression problems, we traina visual-displacement CNN with smooth L -loss and test itin the same way as the above unary-only baseline. Comparedwith unary-only baseline, unary-only+ L -loss has a signiﬁcant MOTA drop, which demonstrates that our proposed cross-entropy loss results in much better displacement estimationaccuracy. 3) DCCRF w/o cfd-asym and DCCRF w/o size-asym: the weighting functions of the pairwise term in ourproposed DCCRF have two terms, a conﬁdence-asymmetricterm and a size-asymmetric term. We test using only one ofthem in our DCCRF’s pairwise terms. The results show morethan drop in terms of MOTA for both baseline methodscompared with our proposed DCCRF, which validates theneed of both terms in the weighting functions. 4) DCCRFw/ symmetry: this baseline method replaces the asymmetricpairwise term in our DCCRF with a symmetric one, (1 − w i, ) (cid:88) k exp (cid:32) − ( l i − l j ) a ( k )22 (cid:33) (∆ d ij − ∆ s ij ) , (15)where l i is the coordinates of i th object’s center position and a ( k )2 are learnable Gaussian kernel bandwidth parameters. Sucha symmetric term assumes that the speed differences betweenclose-by objects should be better maintained across time,while those between far-away objects are less regularized.There is a MOTA drop compared with our proposedDCCRF, which shows our asymmetric term is beneﬁcial forthe ﬁnal performance. We also try to directly replace thesigmoid function in Eq. (5) with a Gaussian-like function inthe weighting function (Eq. (15)), which results in even worseperformance.In addition to the above, we also conduct experiments toanalysize the effects of different hyper-parameters to show ourDCCRF robustness. 1) The λ controls the weight between thevisual-similarity term and the DCCRF location prediction term TABLE IV: Effects of different λ parameter. λ . . MOTA .

8% 44 .

8% 43 . TABLE V: Results by different tracklet initialization parameter k . k MOTA FP FN .

8% 5613 941258 43 .

0% 4837 98433 for tracklet-detection association in Eq. (13). We test threedifferent values of λ and the results of different λ are reportedin Table IV, which the ﬁnal performance is not sensitive tothe λ value. 2) The k is the length of a candidate tracklet tocreate an actual tracklet in section III-B3. We additionally test k = 8 in Table V, which shows slightly performance drop,because larger k will cause more low-conﬁdence detectionsto be ignored. 3) The m denotes the number of consecutiveframes of missing objects to terminate its associated trackletin section III-B4. We additionally test m = 8 and the resultsin Table VI show the peformance is not sensitive to the choiceof m . V. C ONCLUSION

In this paper, we present the Deep Continuous ConditionalRandom Field (DCCRF) model with asymmetric inter-objectconstraints for solving the MOT problem. The unary termsare modeled as a visual-displacement CNN that estimatesobject displacements across time with visual information.The asymmetric pairwise terms regularize inter-object speedTABLE VI: Results by different tracklet termination parameter m . m MOTA FP FN .

8% 5613 941258 44 .

7% 6861 92976

OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 11 differences across time with both size-based and conﬁdence-based weighting functions to weight more on high-conﬁdencetracklets to correct tracking errors. By jointly training the twoterms in DCCRF, the relations between objects’ individualmovement patterns and complex inter-object constraints canbe better modeled and regularized to achieve more accuratetracking performance. Extensive experiments demonstrate theeffectiveness of our proposed MOT framework as well as theindividual components of our DCCRF.R

EFERENCES[1] W. Luo, J. Xing, X. Zhang, X. Zhao, and T. K. Kim, “Multiple objecttracking: A literature review,” arXiv preprint arXiv:1409.7618 , 2014.[2] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision making,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2015, pp. 4705–4713.[3] J. Hong Yoon, C.-R. Lee, M.-H. Yang, and K.-J. Yoon, “Onlinemulti-object tracking via structural constraint event aggregation,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 1392–1400.[4] S. Tang, B. Andres, M. Andriluka, and B. Schiele, “Multi-person track-ing by multicut and deep matching,” arXiv preprint arXiv:1608.05404 ,2016.[5] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never walkalone: Modeling social behavior for multi-target tracking,” in

ComputerVision, 2009 IEEE 12th International Conference on . IEEE, 2009, pp.261–268.[6] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, andS. Savarese, “Social lstm: Human trajectory prediction in crowdedspaces,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 961–971.[7] L. Leal-Taix´e, G. Pons-Moll, and B. Rosenhahn, “Everybody needssomebody: Modeling social and grouping behavior on a linear program-ming multiple people tracker,” in

Computer Vision Workshops (ICCVWorkshops), 2011 IEEE International Conference on . IEEE, 2011, pp.120–127.[8] X. Chen, Z. Qin, L. An, and B. Bhanu, “Multiperson tracking byonline learned grouping model with nonlinear motion context,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 26,no. 12, pp. 2226–2239, 2016.[9] L. Zhang and L. Van Der Maaten, “Preserving structure in model-free tracking,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 36, no. 4, pp. 756–769, 2014.[10] G. Duan, H. Ai, S. Cao, and S. Lao, “Group tracking: Exploring mutualrelations for multiple object tracking,”

Computer Vision–ECCV 2012 ,pp. 129–143, 2012.[11] J. H. Yoon, M.-H. Yang, J. Lim, and K.-J. Yoon, “Bayesian multi-objecttracking using motion context from multiple objects,” in

Applications ofComputer Vision (WACV), 2015 IEEE Winter Conference on . IEEE,2015, pp. 33–40.[12] H. Grabner, J. Matas, L. Van Gool, and P. Cattin, “Tracking the invisible:Learning where the object might be,” in

Computer Vision and PatternRecognition (CVPR), 2010 IEEE Conference on . IEEE, 2010, pp. 1285–1292.[13] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du,C. Huang, and P. H. Torr, “Conditional random ﬁelds as recurrent neuralnetworks,” in

Proceedings of the IEEE International Conference onComputer Vision , 2015, pp. 1529–1537.[14] B. Yang and R. Nevatia, “An online learned crf model for multi-targettracking,” in

Computer Vision and Pattern Recognition (CVPR), 2012IEEE Conference on . IEEE, 2012, pp. 2034–2041.[15] A. Andriyenko, K. Schindler, and S. Roth, “Discrete-continuous op-timization for multi-target tracking,” in

Computer Vision and PatternRecognition (CVPR), 2012 IEEE Conference on . IEEE, 2012, pp. 1926–1933.[16] A. Milan, K. Schindler, and S. Roth, “Detection- and trajectory-levelexclusion in multiple object tracking,” in

The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , June 2013.[17] H. Li, Y. Li, and F. Porikli, “Robust online visual tracking with a singleconvolutional neural network,” in

Asian Conference on Computer Vision .Springer, 2014, pp. 194–209. [18] N. Wang and D.-Y. Yeung, “Learning a deep compact image representa-tion for visual tracking,” in

Advances in neural information processingsystems , 2013, pp. 809–817.[19] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learningdiscriminative saliency map with convolutional neural network,” in

International Conference on Machine Learning , 2015, pp. 597–606.[20] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking withfully convolutional networks,” in

The IEEE International Conferenceon Computer Vision (ICCV) , December 2015.[21] R. Tao, E. Gavves, and A. W. Smeulders, “Siamese instance search fortracking,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 1420–1429.[22] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H.Torr, “Fully-convolutional siamese networks for object tracking,” arXivpreprint arXiv:1606.09549 , 2016.[23] A. Milan, S. H. Rezatoﬁghi, A. R. Dick, I. D. Reid, and K. Schindler,“Online multi-target tracking using recurrent neural networks.” in

AAAI ,2017, pp. 4225–4232.[24] S.-H. Bae and K.-J. Yoon, “Conﬁdence-based data association anddiscriminative deep appearance learning for robust online multi-objecttracking,”

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 2017.[25] L. Leal-Taix´e, C. Canton-Ferrer, and K. Schindler, “Learning by track-ing: Siamese cnn for robust target association,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern RecognitionWorkshops , 2016, pp. 33–40.[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2016.[27] D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 fps withdeep regression networks,” in

European Conference on Computer Vision .Springer, 2016, pp. 749–765.[28] A. Milan, L. Leal-Taix´e, I. Reid, S. Roth, and K. Schindler, “Mot16: Abenchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831 ,2016.[29] L. Leal-Taix´e, A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchal-lenge 2015: Towards a benchmark for multi-target tracking,” arXivpreprint arXiv:1504.01942 , 2015.[30] H. Roberto, L.-T. Laura, C. Daniel, and R. Bodo, “A novel multi-detector fusion framework for multi-object tracking,” arXiv preprintarXiv:1705.08314 , 2017.[31] F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan, “Poi: Multiple objecttracking with high performance detection and appearance feature,” in

European Conference on Computer Vision Workshops , 2016, pp. 36–42.[32] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou,B. Yang, Z. Wang et al. , “Crafting gbd-net for object detection,” arXivpreprint arXiv:1610.02579 , 2016.[33] C. Dicle, O. I. Camps, and M. Sznaier, “The way they move: Trackingmultiple targets with similar appearance,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2013, pp. 2304–2311.[34] A. Milan, S. Roth, and K. Schindler, “Continuous energy minimizationfor multitarget tracking,”

IEEE transactions on pattern analysis andmachine intelligence , vol. 36, no. 1, pp. 58–72, 2014.[35] A. Milan, K. Schindler, and S. Roth, “Multi-target tracking by discrete-continuous energy minimization,”

IEEE transactions on pattern analysisand machine intelligence , vol. 38, no. 10, pp. 2054–2068, 2016.[36] B. Wang, L. Wang, B. Shuai, Z. Zuo, T. Liu, K. Luk Chan, and G. Wang,“Joint learning of convolutional neural networks and temporally con-strained metrics for tracklet association,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops ,2016, pp. 1–8.[37] W. Choi, “Near-online multi-target tracking with aggregated local ﬂowdescriptor,” in

Proceedings of the IEEE International Conference onComputer Vision , 2015, pp. 3029–3037.[38] S.-H. Bae and K.-J. Yoon, “Robust online multi-object tracking basedon tracklet conﬁdence and online discriminative appearance learning,”in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2014, pp. 1218–1225.[39] H. Kieritz, S. Becker, W. H¨ubner, and M. Arens, “Online multi-persontracking using integral channel features,” in

Advanced Video and SignalBased Surveillance (AVSS), 2016 13th IEEE International Conferenceon . IEEE, 2016, pp. 122–130.[40] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun, “3d trafﬁcscene understanding from movable platforms,”

IEEE transactions onpattern analysis and machine intelligence , vol. 36, no. 5, pp. 1012–1025, 2014.

OURNAL OF L A TEX CLASS FILES, DOI 10.1109/TCSVT.2018.2825679 12 [41] N. Le, A. Heili, and J.-M. Odobez, “Long-term time-sensitive costsfor crf-based tracking by detection,” in

Computer Vision-Eccv 2016Workshops, Pt Ii , vol. 9914, no. EPFL-CONF-221401. Springer IntPublishing Ag, 2016, pp. 43–51.[42] L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle, “Improvingmulti-frame data association with sparse representations for robust near-online multi-object tracking.” in

ECCV (8) , 2016, pp. 774–790.[43] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple hypothesis trackingrevisited,” in

Proceedings of the IEEE International Conference onComputer Vision , 2015, pp. 4696–4704.[44] Y. Ban, S. Ba, X. Alameda-Pineda, and R. Horaud, “Tracking multiplepersons based on a variational bayesian model,” in

ECCV Workshop onBenchmarking Mutliple Object Tracking , 2016.[45] R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro, “Online multi-targettracking with strong and weak detections.” in