Learning Multi-target Tracking with Quadratic Object Interactions
LLearning Multi-target Tracking with Quadratic Object Interactions
Shaofei Wang, Charless FowlkesDept of Computer ScienceUniversity of California, Irvine [email protected], [email protected]
Abstract
We describe a model for multi-target tracking basedon associating collections of candidate detections acrossframes of a video. In order to model pairwise interactionsbetween different tracks, such as suppression of overlappingtracks and contextual cues about co-occurence of differentobjects, we augment a standard min-cost flow objective withquadratic terms between detection variables. We learn theparameters of this model using structured prediction and aloss function which approximates the multi-target trackingaccuracy. We evaluate two different approaches to findingan optimal set of tracks under model objective based on anLP relaxation and a novel greedy extension to dynamic pro-gramming that handles pairwise interactions. We find thegreedy algorithm achieves equivalent performance to theLP relaxation while being 2-7x faster than a commercialsolver. The resulting model with learned parameters out-performs existing methods across several categories on theKITTI tracking benchmark.
1. Introduction
Multi-target tracking is a classic topic of research incomputer vision. Thanks to advances of object detectorperformance in single, still images, ”tracking-by-detection”approaches that build tracks on top of a collection of candi-date object detections have shown great promise. Tracking-by-detection avoids some problems such as drift and is of-ten able to recover from extended periods of occlusion sinceit is “self-initializing”. Finding an optimal set of detectionscorresponding to each track is often formulated as a discreteoptimization problem of finding low-cost paths through agraph of candidate detections for which there are often effi-cient combinatorial algorithms (such as min-cost matchingor min-cost network-flow) that yield globally optimal solu-tions ( e.g ., [27, 20]).Tracking by detection is somewhat different than tra-ditional generative formulations of multi-target tracking,which draw a distinction between the problem of estimat-
Figure 1. We describe a framework for learning parameters of amulti-object tracking objective that includes pairwise interactionsbetween objects. The left column shows tracking without pair-wise interactions. Our system learns to enforce both inter-classand intra-class mutual exclusion as well as co-occurrence relation-ship between trajectories. By incorporating pairwise interactionsbetween objects within a frame we are able to improve detectionperformance. ing a latent continuous trajectory for each object from thediscrete per-frame data-association problem of assigningobservations ( e.g ., detections) to underlying tracks. Suchmethods ( e.g ., [2, 19, 24]) allow for explicitly specifying anintuitive model of trajectory smoothness but face a difficultjoint inference problem over both continuous and discretevariables with little guarantee of optimality.In tracking by detection, trajectories are implicitly de-fined by the selected group of detections. For example,the path may skip over some frames entirely due to occlu-sions or missing detections. The transition cost of utilizinga given edge between detections in successive frames thuscould be interpreted as some approximation of the marginallikelihood associated with integrating over a set of underly-ing continuous trajectories associated with the correspond-ing pair of detections. This immediately raises difficulties,both in (1) encoding strong trajectory models with onlypairwise potentials and (2) identifying the parameters ofthese potentials from training data.One line of attack is to first group detections in to can-1 a r X i v : . [ c s . C V ] D ec idate tracklets and then perform scoring and associationof these tracklets [25, 4, 23]. Tracklets allow for scor-ing much richer trajectory and appearance models whilemaintaining some benefits of purely combinatorial group-ing. Another approach is to attempt to include higher-orderconstraints directly in a combinatorial framework [5, 6]. Ineither case, there are a large number of parameters associ-ated with these richer models which necessitates applicationof machine learning techniques. This is particularly true for(undirected) combinatorial models based on, e.g . network-flow, where parameters are often set empirically by hand.In this work, we introduce an extension to the standardmin-cost flow tracking objective that allows us to modelpairwise interactions between tracks. This allows us to in-corporate useful knowledge such as typical spatial relation-ships between detections of different objects and suppres-sion of multiple overlapping tracks of the same object. Thisquadratic interaction necessitates the development of ap-proximate inference methods which we describe in Section3. In Section 5 we describe an approach to joint learningof model parameters in order to maximize tracking perfor-mance on a training data set using techniques for structuredprediction [22]. Structured prediction has been applied intracking to learning inter-frame affinity metrics [14] and as-sociation [18] as well as a variety of other learning taskssuch as fitting CRF parameters for segmentation [21] andword alignment for machine translation [15]. To our bestknowledge, the work presented here is unique in utilizingdiscriminative structured prediction to jointly learn the com-plete set of parameters of a tracking model from labeleddata, including track birth/death bias, transition affinities,and multi-object contextual relations. We conclude withexperimental results (Section 6) which demonstrate that thelearned quadratic model and inference routines yield stateof the art performance on multi-target, multi-category ob-ject tracking in urban scenes.
2. Model
We begin by formulating multi-target tracking and dataassociation as a min-cost flow network problem equivalentto that of [27], where individual tracks are described bya first-order Markov Model whose state space is spatial-temporal locations in videos. This framework incorporatesa state transition likelihood that generates transition featuresin successive frames, and an observation likelihood thatgenerates appearance features for objects and background.
For a given video sequence, we consider a discrete setof candidate object detection sites V where each candidatesite x = ( l, σ, t ) is described by its location, scale andframe number. We write Φ = { φ a ( x ) | x ∈ V } for the im-age evidence (appearance features) extracted at each corre- sponding spatial-temporal location in a video. A single ob-ject track consists of an ordered set of these detection sites: T = { x , ..., x n } , with strictly increasing frame numbers.We model the whole video by a collection of tracks T = { T , ..., T k } , each of which independently gener-ates foreground object appearances at the correspondingsites according to distribution p fg ( φ a ) while the remainingsite appearances are generated by a background distribution p bg ( φ a ) . Each site can only belong to a single track. Ourtask is to infer a collection of tracks that maximize the pos-terior probability P ( T | Φ) under the model. Assuming thattracks behave independently of each other and follow a first-order Markov model, we can write an expression for MAPinference: T ∗ = argmax T (cid:89) T ∈T P (Φ | T ) P ( T )= argmax T (cid:0) (cid:89) T ∈T (cid:89) x ∈ T l ( φ a ( x )) (cid:1) × (cid:89) T ∈T (cid:0) p s ( x ) p e ( x N ) N − (cid:89) i =1 p t ( x i +1 | x i ) (cid:1) (1)where l ( φ a ( x )) = p fg ( φ a ( x )) p bg ( φ a ( x )) is the appearance likelihood ratio that a specific location x corresponds to the object tracked and p s , p e and p t representthe likelihoods for tracks starting, ending and transitioningbetween given sites.The set of optimal tracks can be found by taking the logof 1 to yield an integer linear program (ILP) over flow vari-ables f . min f (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i c i f i + (cid:88) i c ti f ti (2)s.t. f si + (cid:88) j f ji = f i = f ti + (cid:88) j f ij f si , f ti , f i , f ij ∈ { , } where E is the set of valid transitions between sites in suc-cessive frames and the costs are given by c i = − log l ( φ a ( x )) , c ij = − log p ( x j | x i ) c si = − log p s ( x i ) , c ti = − log p t ( x i ) (3)This ILP is a well studied problem known as minimum-costnetwork flow [1]. The constraints satisfy the total unimodu-larity property and thus can be solved exactly using any LPsolver or via various efficient specialized solvers, includingnetwork simplex, successive shortest path and push-relabelwith bisectional search [27].2hile these approaches yield globally optimal solutions,the authors of [20] consider even faster approximationsbased on multiple rounds of dynamic programming (DP).In particular, the successive shortest paths algorithm (SSP)finds optimal flows by applying Dijkstra’s algorithm ona residual graph constructed from the original network inwhich some edges corresponding to instanced tracks havebeen reversed. This can be implemented by performingmultiple forward and backward passes of dynamic program-ming (see Appendix for details). [ ] found that two or evenone pass of DP often performs nearly as well as SSP inpractical tracking scenarios. In our experiments we eval-uate several of these variants. The aforementioned model assumes tracks are independentof each other, which is not always true in practice. A keycontribution of our work is showing that pairwise relationsbetween tracks can be integrated into the model to improvetracking performance. In order to allow interactions be-tween multiple objects, we add a pairwise cost term de-noted q ij and q ji for jointly activating a pair of flows f i and f j corresponding to detections at sites x i = ( l i , σ i , t i ) and x j = ( l j , σ j , t j ) . An intuitive example of q ij and q ji would be penalty for overlap locations or a boost for co-occurring objects. We only consider pairwise interactionsbetween pairs of sites in the same video frame which wedenote by EC = { ij : t i = t j } . Adding this term to 2yields an Integer Quadratic Program (IQP): min f (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i c i f i + (cid:88) ij ∈ EC q ij f i f j + (cid:88) i c ti f ti (4)s.t. f si + (cid:88) j f ji = f i = f ti + (cid:88) j f ij f si , f ti , f i , f ij ∈ { , } The addition of quadratic terms makes this objective hard tosolve in general. In the next section we discuss two differ-ent approximations for finding high quality solutions f . InSection 5 we describe how the costs c can be learned fromdata.
3. Inference
Now we describe different methods to conduct trackinginference (finding the optimal flows f ). These inferenceroutines are used both for predicting a set of tracks at testtime as well as optimizing parameters during learning (seeSection 5).As mentioned in previous section, for traditional min-cost network flow problem defined in Equation 2 there ex- ists various efficient solvers that explores its total unimod-ularity property to find the global optimum. We employMOSEK’s built-in network simplex solver in our experi-ments, as other alternative algorithms yield exactly the samesolution.In contrast, finding the global minimum of the IQP prob-lem 4 is NP-hard [26] due to the quadratic terms. We evalu-ate two different schemes for finding high-quality approxi-mate solutions. The first is a standard approach of introduc-ing auxiliary variables and relaxing the integral constraintsto yield a linear program (LP) that lower-bounds the orig-inal objective. We also consider a greedy approximationbased on successive rounds of dynamic programming thatalso yields good solutions while avoiding the expense ofsolving a large scale LP. If we relax the integer constraints and deform the costs asnecessary to make the objective convex, then the global op-timum of 4 can be found in polynomial time. For example,one could apply Frank-Wolfe algorithm to optimize the re-laxed, convexified QP while simultaneously keeping trackof good integer solutions [13]. However, for real-worldtracking over long videos, the relaxed QP is still quite ex-pensive. Instead we follow the approach proposed by Chari et al . [6], reformulating the IQP as an equivalent ILP prob-lem by replacing the quadratic terms f i f j with a set of aux-iliary variables u ij : min f (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i c i f i + (cid:88) ij ∈ EC q ij u ij + (cid:88) i c ti f ti (5)s.t. f si , f ti , f i , f j , f ij , u ij ∈ { , } f si + (cid:88) j f ji = f i = f ti + (cid:88) j f ij u ij ≤ f i , u ij ≤ f j f i + f j ≤ u ij + 1 The new constraint sets enforce u ij to be only when f i and f j are both . By relaxing the integer constraints, program5 can be solved efficiently via large scale LP solvers suchas CPLEX or MOSEK.During test time we would like to predict a discrete setof tracks. This requires rounding the solution of the re-laxed LP to some solution that satisfies not only integer con-straints but also flow constraints. [6] proposed two round-ing heuristics: a Euclidean rounding scheme that minimizes (cid:107) f − (cid:98) f (cid:107) where (cid:98) f is the non-integral solution given by the LPrelaxation. When f is constrained to be binary, this objec-tive simplifies to a linear function ( − (cid:98) f ) T f + (cid:107) (cid:98) f (cid:107) , whichcan be optimized using a standard linear min-cost flow3olver. Alternately, one can use a linear under-estimator of4 similar to the Frank-Wolfe algorithm: (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i ( c i + (cid:88) ij ∈ EC q ij (cid:98) u ij + (cid:88) ji ∈ EC q ji (cid:98) u ji ) f i + (cid:88) i c ti f ti (6)Both of these rounding heuristics are linear functions sub-ject to the original integer and flow constraints and thus canbe solved as an ordinary min-cost network flow problem.In our experiments we execute both rounding heuristics andchoose the solution with lower cost. We now describe a simple greedy algorithm inspiredby the combination of dynamic programming and non-maximal suppression proposed in [20]. We carry out a se-ries of rounds of dynamic programming to find the short-est path between source and sink nodes. In each round,once we have identified a track, we update the (unary) costsassociated with all detections to include the effect of thepairwise quadratic interaction term of the newly activatedtrack ( e.g . suppressing overlapping detections, boosting thescores of commonly co-occurring objects). This is analo-gous to greedy algorithms for maximum-weight indepen-dent set where the elements are paths through the network.
Algorithm 1
DP with pairwise Cost Update Input : A Directed-Acyclic-Graph G with edgeweights c i , c ij initialize T ← ∅ repeat Find shortest start-to-end path p on G track cost = cost ( p ) if track cost < then for all locations x i in p do c j = c j + q ij + q ji for all ij, ji ∈ EC c i = + ∞ end for T ← T ∪ p end if until track cost ≥ Output : track collection T In the absence of quadratic terms, this algorithm corre-sponds to the 1-pass DP approximation of the successive-shortest paths (SSP) algorithm. Hence it does not guaranteean optimal solution, but, as we show in the experiments,it performs well in practice. A practical implementationdifference (from the linear objective) is that updating thecosts with the quadratic terms when a track is instanced hasthe unfortunate effect of invalidating cost-to-go estimates which could otherwise be cached and re-used between suc-cessive rounds to accelerate the DP computation.Interestingly, the greedy approach to updating the pair-wise terms can also be used with a 2-pass DP approximationto SSP where backward passes subtract quadratic penalties.We describe the details of our implementation of the 2-passalgorithm in the Appendix. We found the 1-pass approachsuperior as the complexity and runtime grows substantiallyfor multi-pass DP with pairwise updates.
4. Tracking Features and Potentials
In order to learn the tracking potentials ( c and q ) we pa-rameterize the flow cost objective by a vector of weights w and a set of features Ψ( X, f ) that depend on featuresextracted from the video, the spatio-temporal relations be-tween candidate detections, and which tracks are instanced.With this linear parameterization we write the cost of agiven flow as C ( f ) = − w T Ψ( X, f ) where the negative signis a useful convention to convert the minimization probleminto a maximization. The vector components of the weightand feature vector are given by: w = w S w t w s w a w E Ψ( X, f ) = (cid:80) i φ S ( x si ) f si (cid:80) ij ∈ E ψ t ( x i , x j ) f ij (cid:80) ij ∈ EC ψ s ( x i , x j ) f i f j (cid:80) i φ a ( x i ) f i (cid:80) i φ E ( x ti ) f ti (7)Here w a represents local appearance template for thetracked objects of interest, w t represents weights for tran-sition features, w s represents weights for pairwise interac-tions, w S and w E represents weights associated with trackbirths and deaths. φ a ( x i ) is the image feature at spatial-temporal location x i , ψ t ( x i , x j ) represents the feature oftransition from location x i to location x j , ψ s ( x i , x j ) repre-sents the feature of pairwise interaction between location x i and x j that are in the same frame, φ S ( x si ) represents featureof birth node to location x i and φ E ( x ti ) represents featureof location x i to sink node. Local appearance model:
We make use of an off-the-shelf detector to capture local appearance. Our local appear-ance feature thus consists of the detector score along with aconstant 1 to allow for a variable bias.
Transition model:
We use a simple motion model (de-scribed in Section 6) to predict candidate windows’ loca-tions in future frames; we connect a candidate x i at time t i with another candidate x j at a later time t i + n , only ifthe overlap ratio between x i ’s predicted window at t i + n and x j ’s window at t i + n exceeds . . The overlap ratiois defined as two windows’ intersection over their union.We use this overlap ratio as a feature associated with eachtransition link. The transition link’s feature will be 1 if this4atio is lower than 0.5, and 0 otherwise. In our experimentswe allow up to frames occlusion for all the network-flowmethods. We append a constant 1 to this feature and binthese features according to the length of transition. Thisyields a dimensional feature for each transition link. Birth/death model:
In applications with static camerasit can be useful to learn a spatially varying bias to modelwhere tracks are likely to appear or disappear. However,videos in our experiments are all captured from a movingvehicle, we thus use a single constant value 1 for the birthand death features.
Pairwise interactions: w s is a weight vector thatencodes valid geometric configurations of two objects. ψ ( x i , x j ) is a discretized spatial-context feature that binsrelative location of detection window at location x i andwindow at location x j into one of the D relations includ-ing on top of, above, below, next-to, near, far and overlap(similar to the spatial context of [7]). To mimic the tem-poral NMS described in [20] we add one additional rela-tion, strictly overlap, which is defined as the intersectionof two boxes over the area of the first box; we set thecorresponding feature to 1 if this ratio is greater than 0.9and 0 otherwise. Now assume that we have K classesof objects in the video, then w s is a DK vector, i.e . w s = [ w Ts , w Ts , ..., w Tsij , ..., w
TsKK ] T , in which w sij isa length of D column vector that encodes valid geometricconfigurations of object of class i w.r.t. object of class j .In such way we can capture intra- and inter-class contextualrelationships between tracks.
5. Learning
We formulate parameter learning of tracking models asa structured prediction problem. With some abuse of nota-tion, assume we have N training videos ( X n , f n ) ∈ X ×F , n = 1 , ..., N . Given ground-truth tracks in trainingvideos specified by flow variables f n , we discriminativelylearn tracking model parameters w using a structured SVMwith margin rescaling: w ∗ = argmin w ,ξ n ≥ (cid:107) w (cid:107) + C (cid:88) n ξ n (8)s.t. ∀ n, (cid:98) f , (cid:104) w , (cid:52) Ψ( X n , f n , (cid:98) f ) (cid:105) ≥ L ( f n , (cid:98) f ) − ξ n where (cid:52) Ψ( X n , f n , (cid:98) f ) = Ψ( X n , f n ) − Ψ( X n , (cid:98) f ) where Ψ( X n , f n ) are the features extracted from n th train-ing video. L ( f n , (cid:98) f ) is a loss function that penalize any dif-ference between the inferred label (cid:98) f and the ground truthlabel f n . The constraint on the slack variables ξ n ensurethat we pay a cost for any training videos in which the flowcost of the ground-truth tracks under model w is higher thansome other incorrect labeling. We optimize the structured SVM objective in 8 using astandard cutting-plane method [12] in which the exponen-tial number of constraints (one for each possible flow (cid:98) f ) areapproximated by a much smaller number of terms. Given acurrent estimate of w we find a “most violated constraint”for each training video: (cid:98) f ∗ n = argmax (cid:98) f L ( f n , (cid:98) f ) − (cid:104) w , (cid:52) Ψ( X n , f n , (cid:98) f ) (cid:105) We can then add these constraints to the optimization prob-lem and solve for an updated w . This procedure is iterateduntil no additional constraints are added to the problem. Inour implementation, at each iteration we add a single linearconstraint which is a sum of violating constraints derivedfrom individual videos in the dataset which is also a validcutting plane constraint [7].The key subroutine is finding the most-violated con-straint for a given video which requires solving the loss-augmented inference problem (we drop the n subscript no-tation from here on) (cid:98) f ∗ = argmin (cid:98) f (cid:104) w, Ψ( X, (cid:98) f ) (cid:105) − L ( f , (cid:98) f ) (9)As long as the loss function L ( f , (cid:98) f ) decomposes as a sumover flow variables then this problem has the same form asour test time tracking inference problem, the only differencebeing that the cost of variables in f is augmented by theircorresponding negative loss.We note that our two inference algorithms behave some-what differently when producing constraints. The greedyalgorithm has no guarantee of finding the optimal flow for agiven tracking problem and hence may not generate all thenecessary constraints for learning w . In contrast, for the LPrelaxation, we have the option of adding constraints corre-sponding to fractional solutions (rather than rounding themto discrete tracks). If we use a loss function that penalizesincorrect non-integral solutions, this may push the struc-tured SVM to learn parameters that tend to result in tightrelaxations. These scenarios are termed “undergenerating”and “overgenerating” respectively by [9] since approximateinference is performed over a subset or superset of the exactspace of flows. Now we describe loss functions for multi-target trackingproblem. We use a weighted hamming loss to measure lossbetween ground truth labels f and inferred labels (cid:98) f : L ( (cid:98) f , f ) = (cid:88) f i ∈ f loss i (cid:12)(cid:12)(cid:12) f i − (cid:98) f i (cid:12)(cid:12)(cid:12) (10)5here { loss , ..., loss i , ..., loss | f | } is a vector indicatingthe penalty for differences between the estimated flow (cid:98) f andthe ground-truth f . For example, when loss = it becomesthe hamming loss. Transition Loss:
A critical aspect for successful learn-ing is to define a good loss vector that closely reassem-bles major tracking performance criteria, such as MultipleObject Tracking Accuracy (MOTA [3]). Metrics such asfalse positive, false negative, true positive, true negative andtrue/false birth/death can be easily incorporated by settingtheir corresponding values in loss to 1.By definition, id switches and fragmentations [16] aredetermined by looking at labels of two consecutive tran-sition links simultaneously, under such definition the losscannot be optimized by our inference routine which onlyconsiders pairwise relations between detections within aframe. Instead, we propose a decomposable loss for tran-sition links that attempts to capture important aspects ofMOTA by taking into account the length and localization oftransition links rather than just using a constant (Hamming)loss on mislabeled links. We found empirically that care-ful specification of the loss function is crucial for learning agood tracking model.In order to describe our transition loss, let us first denotefour types of transition links:
N N is the link from a falsedetection to another false detection,
P N is the link from atrue detection to a false detection,
N P is the link from afalse detection to a true detection,
P P + is the link from atrue detection to another true detection with the same iden-tity, and P P − is the link from a true detection to anothertrue detection with a different identity. For all the transitionlinks, we interpolate detections between its start detectionand end detection (if their frame numbers differ more than1); the interpolated virtual detections are considered eithertrue virtual detection or false virtual detection, dependingon whether they overlap with a ground truth label or not.Loss for different types of transition is defined as:1. For N N links, the loss will be (number of true virtualdetections + number of false virtual detections)2. For
P N and
N P links, the loss will be (number of truevirtual detections + number of false virtual detections +1)3. For
P P + links, the loss will be (number of true virtualdetections)4. For P P − links, the loss will be (number of true virtualdetections + number of false virtual detections + 2) Ground-truth flows:
In practice, available trainingdatasets specify ground-truth bounding boxes that need tobe mapped onto ground-truth flow variables f n for eachvideo. To do this mapping, we first consider each frameseparately, taking the highest scoring detection window thatoverlaps a ground truth label as true detection, each truedetection will be assigned a track identity label same as Figure 2. Example benefit of soft transition penalty. Left column isan ID switch error (IDSW) of the baseline due to removing aggres-sive transition links based on an empirical hard overlap threshold.At right column, our model prevents this error by learning a softpenalty function that allows for some aggressive transitions to oc-cur.Figure 3. Example of track co-occurrence. The right columnis the model learned with pairwise terms (LP+Flow+Struct),while the left column is learned without pairwise terms(SSP+FLow+Struct). Co-occurrence term forces both track 2 and3 to initialize even when the detector responses are weak. the ground truth label it overlaps. Next, for each trackidentity, we run a simplified version of the dynamic pro-gramming algorithm to find the path that claims the largestnumber of true detections. After we iterate through allid labels, any instanced graph edge will be a true detec-tion/transition/birth/death while the remainder will be false.
6. Experimental results
Dataset:
We have focused our experiments on trainingsequences of KITTI tracking benchmark [11]. KITTI track-ing benchmark consists of to 21 training sequences with atotal of 8008 frames and 8 classes of labeled objects; of allthe labeled objects we evaluated three categories which had6a) inter-frame weights (b) intra-frame weights
Figure 4. Visualization of the weight vector learned by our method.Yellow has small cost, blue has large cost. (a) shows transi-tion weights for different length of frame jumps. The model en-courages transitions to nearby neighboring frames, and penalizeslong or weak transition links ( i.e . overlap ratio lower than 0.5).(b) shows learned pairwise contextual weights between objects.The model encourages intra-class co-occurrence when objects areclose but penalizes overlap and objects on top of others. Note thestrong negative interaction learned between cyclist and pedestrian(two classes which are easily confused by their respective detec-tors.). By exploring contextual cues we can make correct predic-tion on this otherwise confusing configuration. sufficient number of instances for comparative evaluation:cars, pedestrians and cyclists. We use publicly availableLSVM [8] reference detections and evaluation script . Theevaluation script only evaluates objects that are not too faraway and not truncated by more than 15 percent, it also doesnot consider vans as false positive for cars or sitting personsas false positive for pedestrians. The final dataset contains636 labeled car trajectories, 201 labeled pedestrian trajecto-ries and 37 labeled cyclists trajectories. Training with ambiguous labels:
One difficulty oftraining on the KITTI tracking benchmark is that it hasspecial evaluation rules for ground truth labels such assmall/truncated objects and vans for cars, sitting persons forpedestrians. This is resolved by removing all detection can-didates that correspond to any of these “ambiguous” groundtruth labels during training; in this way we avoid mininghard negatives from those labels. Also, to speed up training,we partition full-sized training sequences in to 10-frame-long subsequences with a 5-frame overlap, and define losseson each subsequence separately.
Data-dependent transition model:
In order to keep thesize of tracking graphs tractable for our inference methods,we need a heuristic to select a sparse set of links betweendetection candidates across frames. We found that simplypredicting candidate’s locations in future frames via opticalflow gives very good performance. Specifically, we firstcompute frame-wise optical flow using software of [17],then for a candidate detection x i at frame t i , we computethe mean of vertical flows and the mean of horizontal flowswithin the candidate box, and use them to predict candi- −1 number of frames t i m e ( l og sca l e ) Relaxed LP cost = −129967.91
DP, cost = −129336.88LP+rounding, cost = −129962.09
Figure 5. Speed and quality comparison of proposed undergener-ating and overgenerating approximation. Over the 21 training se-quences in KITTI dataset, LP+rounding produces cost that is veryclose to relaxed global optimum. DP gives a lower bound that iswithin 1% of relaxed global optimum, while being 2 to 7 timesfaster than a commercial LP solver (MOSEK) date’s location in the next frame t i + 1 ; for x i ’s predictedlocations in frame t i + 2 we use its newly predicted loca-tion at t i + 1 and candidate’s original box size to repeat theprocess described above, and same for t i + n . Trajectory smoothing:
During evaluation we observethat many track fragmentation errors (FRAG) reported bythe benchmark are due to the raw trajectory oscillating awayfrom the ground-truth due to poorly localized detection can-didates. Inspired by the trajectory model of [2], we post-process each output raw trajectory by fitting a cubic B-spline. This smoothing of the trajectory eliminates manyFRAGs from the raw track, making the fragmentation num-ber more meaningful when compared across different mod-els.
Baselines:
We use the publicly available code from [10]as a first baseline. It relies on a three-stages tracklet linkingscheme with occlusion sensitive appearance learning; it isby far the best tracker for cars on KITTI tracking benchmarkamong all published methods. Also we consider dynamicprogramming (DP) and successive shortest path (SSP) withdefault parameters in [20] as another two baselines, denotedas DP+Flow and SSP+Flow in our table.
Parameter settings:
We tuned the structural parametersof the various baselines to give good performance. For allbaselines we only use detections that have a positive score.For DP+Flow and SSP+Flow we also remove all transitionlinks that have overlap ratios lower than 0.5. For learnedtracking models (+Struct) we use detections that have scoresgreater than -0.5, and transition links that have overlap ra-tios greater than 0.3.7 enchmark Results:
We evaluate performance us-ing a standard battery of performance measures. Theevaluation result for each object category, as well as forall categories are shown in Table 1. For our learnedtracking models (+Struct) we use either network sim-plex solver (for SSP+Flow+Struct) or LP relaxation (forLP/DP+Flow+Struct) for training and conduct leave-one-sequence-out cross-validation with C = 2 − , − , ..., .We report cross-validation result under best C , whichis C = 2 − for SSP+Flow+Struct and C = 2 − forLP/DP+Flow+Struct. Our simple motion model helpsDP+Flow outperform state-of-the-art baseline by a signif-icant margin. One exception is IDSW which we attributeto the fact that the network-flow methods do not explic-itly model target appearance. While SSP+Flow seems toperform poorly with default parameters, it turns out thatwith properly learned parameters (SSP+Flow+Struct), itproduces results that are comparable to (and often betterthan) DP+Flow, this indicates that there is much more po-tential of SSP than suggested in previous work. In addi-tion, SSP’s guarantee of optimality makes it very attractiveif more complicated features and network structure are tobe used in learning. As shown in Table 1, in our eval-uation over all objects our model learned with pairwisecosts (LP/DP+Flow+Struct) achieves the best MOTA, Re-call, Mostly Tracked(MT) and Mostly Lost(ML) perfor-mance while keeping other metrics competitive. Approximate Inference:
To evaluate quality ofthe LP+rounding and DP approximation, we run bothLP+rounding and DP inference on models trained via LPrelaxation and DP respectively. We then average the run-ning time and minimum cost found on each sequence forLP+rounding and DP, respectively. Fig 5 shows the accu-mulative running time and cost for each algorithm. Dur-ing our experiments, LP+rounding often finds the exact re-laxed global optimal, and when it doesn’t it still gives veryclose approximation. While greedy forward search usingDP rarely reach relaxed global optimum, it still producedgood approximate solutions that were often within 1% ofrelaxed global optimum while running significantly faster(2-7x) than LP+rounding.
Overgenerating versus Undergenerating:
Previousworks have shown that in general, models trained withrelaxed inference are preferable than models trained withgreedy inference. To investigate this idea in our particu-lar problem, we also conduct leave-one-sequence-out cross-validation using either DP or the LP relaxation as the infer-ence method for training. The evaluation results under dif-ferent training/testing inference combinations are shown inTable 2. Notice that model trained with the LP relaxationdoes slightly better in most metrics, whereas DP standsout as a good inference algorithm at test time. Moreover,though slightly falling behind, model trained with greedy
Car
MOTA MOTP Rec Prec MT ML IDSW FRAGBaseline [10] 57.8 78.8 58.6 98.8 14.9 28.4 22 225SSP+Flow 49.0 79.1 49.1 99.7 18.4 59.9 0 47DP+Flow 62.2 79.0 63.4 98.5 25.2 24.2 43 177SSP+Flow+Struct 63.4 78.3 65.4 97.1 27.4 20.0 2 179LP+Flow+Struct 64.1 78.1 67.1 95.7 30.5 18.7 3 208DP+Flow+Struct 64.6 78.0 67.5 96.0 30.1 18.6 17 222
Pedestrian
MOTA MOTP Rec Prec MT ML IDSW FRAGBaseline 40.2 73.2 49.0 86.6 4.2 32.2 132 461SSP+Flow 37.9 73.4 41.8 92.0 8.4 57.5 25 146DP+Flow 49.7 73.1 57.2 88.9 18.6 26.3 46 260SSP+Flow+Struct 51.2 73.2 57.4 90.5 19.2 24.6 16 230LP+Flow+Struct 52.6 72.9 60.2 89.2 22.2 21.6 31 281DP+Flow+Struct 52.4 73.0 60.0 89.2 19.8 22.2 36 277
Cyclist
MOTA MOTP Rec Prec MT ML IDSW FRAGBaseline 39.0 81.6 39.6 99.5 5.4 37.8 7 26SSP+Flow 18.7 85.6 18.7 100 5.4 89.2 0 1DP+Flow 42.4 81.2 42.5 100 18.9 45.9 2 5SSP+Flow+Struct 47.4 79.7 59.9 83.0 35.1 32.4 5 10LP+Flow+Struct 52.3 79.6 61.1 88.2 40.6 27.0 12 21DP+Flow+Struct 56.3 79.4 64.2 89.7 40.5 27.0 9 15
All Categories
MOTA MOTP Rec Prec MT ML IDSW FRAGBaseline 51.7 77.4 54.8 95.3 12.1 29.7 161 712SSP+Flow 44.2
DP+Flow 57.6 77.4 60.5 95.7 23.5 25.7 91 442SSP+Flow+Struct 59.0 77.0 62.8 94.5 25.9 21.5
62 514
Table 1. Tracking result for cars, pedestrian and cyclist categoriesin the KITTI tracking benchmark and aggregate performance overall categories. The proposed method using quadratic interactionsbetween objects and parameters trained using structured predictionachieves state-of-the art MOTA and is competitive across multipleperformance measures.
TrainDP LPTest DP MOTA 60.5 60.6Recall 65.2 65.1Precision 93.5 93.8MT 28.6 28.4ML 20.5 19.7IDSW 68 62FRAG 517 514LP+round MOTA 60.1 60.2Recall 64.9 64.8Precision 93.3 93.5MT 29.3 29.2ML 20.3 19.7IDSW 56 46FRAG 518 510
Table 2. Performance evaluation over 21 sequences using crossvalidation for different combinations of inference algorithm usedduring training and test time.
DP is very close to the performance of that trained with LPand thus suggests the greedy algorithm proposed here is avery competitive inference method.8 . Summary
We augmented the well-studied network-flow track-ing model with pairwise cost, and proposed an end-to-end framework that jointly optimizes parameters forsuch model. We extensively evaluated a traditional LPrelaxation-based method and a novel greedy dynamic pro-gramming method for inference in the augmented network,both of which achieves state-of-the-art performance, whileour greedy DP algorithm being 2-7x faster than a commer-cial LP solver.
8. Acknowledgements
This work was supported by NSF DBI-1053036, IIS-1253538 and a Google Research Award.
References [1] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin.
Network Flows:Theory, Algorithms, and Applications . Prentice-Hall, Inc.,Upper Saddle River, NJ, USA, 1993.[2] A. Andriyenko, K. Schindler, and S. Roth. Discrete-continuous optimization for multi-target tracking. In
CVPR ,2012.[3] K. Bernardin and R. Stiefelhagen. Evaluating multiple objecttracking performance: The clear mot metrics.
J. Image VideoProcess. , 2008:1:1–1:10, Jan. 2008.[4] W. Brendel, M. Amer, and S. Todorovic. Multiobject track-ing as maximum weight independent set. In
In Proc. IEEEConf. on Computer Vision and Pattern Recognition , 2011.[5] A. A. Butt and R. T. Collins. Multi-target tracking by la-grangian relaxation to min-cost network flow. In
The IEEEConference on Computer Vision and Pattern Recognition(CVPR) , June 2013.[6] V. Chari, S. Lacoste-Julien, I. Laptev, and J. Sivic. On pair-wise cost for multi-object network flow tracking.
CoRR ,abs/1408.3304, 2014.[7] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative mod-els for multi-class object layout. In
IEEE International Con-ference on Computer Vision , 2009.[8] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester.Discriminatively trained deformable part models, release 4.http://people.cs.uchicago.edu/ pff/latent-release4/.[9] T. Finley and T. Joachims. Training structural SVMs whenexact inference is intractable. In
International Conferenceon Machine Learning (ICML) , pages 304–311, 2008.[10] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun. 3dtraffic scene understanding from movable platforms.
PatternAnalysis and Machine Intelligence (PAMI) , 2014.[11] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-tonomous driving? the kitti vision benchmark suite. In
Conference on Computer Vision and Pattern Recognition(CVPR) , 2012.[12] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane trainingof structural svms.
Machine Learning , 77(1):27–59, 2009. [13] A. Joulin, K. Tang, and L. Fei-Fei. Efficient image andvideo co-localization with frank-wolfe algorithm. In
Euro-pean Conference on Computer Vision (ECCV) , 2014.[14] S. Kim, S. Kwak, J. Feyereisl, and B. Han. Online multi-target tracking by large margin structured learning. In
Pro-ceedings of the 11th Asian Conference on Computer Vision- Volume Part III , ACCV’12, pages 98–111, Berlin, Heidel-berg, 2013. Springer-Verlag.[15] S. Lacoste-Julien, B. Taskar, D. Klein, and M. I. Jordan.Word alignment via quadratic assignment. In
Proceedings ofthe Main Conference on Human Language Technology Con-ference of the North American Chapter of the Associationof Computational Linguistics , HLT-NAACL ’06, pages 112–119, Stroudsburg, PA, USA, 2006. Association for Compu-tational Linguistics.[16] Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hy-bridboosted multi-target tracker for crowded scene. In
InCVPR , 2009.[17] C. Liu.
Beyond Pixels: Exploring New Representations andApplications for Motion Analysis . PhD thesis, MassachusettsInstitute of Technology, 2009.[18] X. Lou and F. A. Hamprecht. Structured Learning for CellTracking. In
Twenty-Fifth Annual Conference on Neural In-formation Processing Systems (NIPS 2011) , 2011.[19] A. Milan, K. Schindler, and S. Roth. Detection- andtrajectory-level exclusion in multiple object tracking. In
CVPR , 2013.[20] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number ofobjects. In
IEEE conference on Computer Vision and PatternRecognition (CVPR) , 2011.[21] M. Szummer, P. Kohli, and D. Hoiem. Learning crfs usinggraph cuts. In
European Conference on Computer Vision ,October 2008.[22] B. Taskar, C. Guestrin, and D. Koller. Max-margin markovnetworks. MIT Press, 2003.[23] B. Wang, G. Wang, K. Luk Chan, and L. Wang. Tracklet as-sociation with online target-specific metric learning. In
TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , June 2014.[24] Z. Wu, A. Thangali, S. Sclaroff, , and M. Betke. Couplingdetection and data association for multiple object tracking.In
Proceeding of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , pages 1–8, Rhode Island,June 2012.[25] B. Yang and R. Nevatia. An online learned crf model formulti-target tracking. In
In CVPR , 2012.[26] A. N. H. Zaied and L. A. E. fatah Shawky. Article: A sur-vey of quadratic assignment problems.
International Journalof Computer Applications , 101(6):28–36, September 2014.Full text available.[27] L. Zhang, Y. Li, and R. Nevatia. Global data association formulti-object tracking using network flows. , 0:1–8,2008. . Appendix: Multi-Pass Dynamic Program-ming to Approximate Successive ShortestPath Now we describe two dynamic programming (DP) al-gorithms proposed by [20] which approximate successiveshortest path (SSP) algorithm. Recall the network-flowproblem described in Equation 2: min f (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i c i f i + (cid:88) i c ti f ti s.t. f si + (cid:88) j f ji = f i = f ti + (cid:88) j f ij f si , f ti , f i , f ij ∈ { , } The corresponding graphical model is shown in Fig 6. SSPfinds the global optimum of Objective 2 by repeating:1. Find the minimum cost source to sink path on residualgraph G r ( f )
2. If the cost of the path is negative, push a flow throughthe path to update f Until no negative cost path can be found. A residual graph G r ( f ) is the same as the original graph G except all edges in f are reversed and their cost negated. We focus on describ-ing the DP algorithms and refer readers to [1] for detaileddescription of SSP algorithm. Assume the detection nodes are sorted in time. We de-note cost ( i ) as the cost of the shortest path from sourcenode to node i , link ( i ) as i ’s predecessor in this shortestpath, and birth node ( i ) as the first detection node in thisshortest path. We initialize cost ( i ) = c i + c si , link ( i ) = ∅ ,and birth node ( i ) = i for all i ∈ V .To find the shortest path on the initial DAG G , we cansweep from first frame to last frame, computing cost ( i ) as: cost ( i ) = c i + min ( π, c si ) , π = min ji ∈ E c ji + cost ( j ) (11)And update birth node ( i ) , link ( i ) accordingly.After we sweeping through all frames, we find a node i such that cost ( i ) + c ti is minimum, and reconstruct theshortest path by backtracking cached link variables. Thecost of this path would be cost ( i ) + c ti . After the shortestpath is found, we remove all nodes and edges in this shortestpath from G , the resulting graph G (cid:48) will still be a DAG,thus we can repeat this procedure until we cannot find anypath that has a negative cost. Even more speed up can beachieved by only recomputing cost ( i ) , birth node ( i ) and link ( i ) for those i whose birth node is the same as the birthnode of the track found in previous iteration.It is also straightforward to integrate NMS into this al-gorithm: when we pick up a shortest path, we also prune all S T
Figure 6. Graphical representation of network flow modelfrom [27]. A pair of nodes (connected by red edge) represent adetection, blue edges represent possible transitions between detec-tions and birth/death flows are modeled by black edges. Costs c i in Objective 2 are for red edges, c ij are for blue edges, c ti and c si are for black edges. To simplify our description, we will refer adetection edge and the two nodes associated with it as a ”node”or ”detection node”. The set V consists of all detection nodes inthe graph, whereas the set E consists of all transition edges in thegraph. nodes that overlap the shortest path. In practice this ”tempo-ral NMS” can be much more aggressive than pre-processingNMS, since the confidence of a track being composed oftrue positives is much higher than single detections. G r ( f ) . Wedenote V forward as the set of forward nodes in current resid-ual graph, and V backward as the set of backward nodes incurrent residual graph, we describe one iteration of 2-passDP as below:1. Ignore all backward edges (including reversed detec-tion edges) and perform one pass of forward DP (fromfirst frame to last frame) on all nodes. For each node i , there will be a path ( i ) array that stores mininum-costsource to i path, with cost ( i ) being the total cost of thispath.2. Use cost ( i ) from step 1 as initial values and performone pass of backward DP (from last frame to first frame)on V backward . After this, cost ( i ) for i ∈ V backward would be the cost ( j ) − c ij , where j is i ’s best (back-ward) predecessor and c ij is from the original graph. Set cost ( i ) = + ∞ for backward node i that has no backwardedge coming to it.3. Perform one pass of forward DP on i ∈ V forward .To avoid running into cyclic path, we need to backtrack10hortest paths for all j ∈ N ( i ) , where N ( i ) is all neigh-boring nodes that are connected to i via a forward edge.4. Find node i with minimum cost ( i ) + c ti , the (approxi-mate) shortest path is then path ( i ) .5. Update solution f by setting all forward variablesalong path ( i ) to 1 and all backward variables along path ( i ) to 0.It is straightforward to show that during the first iteration,1-pass DP and 2-pass DP behave identically. Also, the pathfound by 2-pass DP will never go into a source node or goout of a sink node, thus in each iteration we generate exactlyone more track, either by splitting a previously found track,or by choosing a entirely new track. Therefore the algorithmwill terminate after at most | V | iterations.
10. Appendix: Incorporating Quadratic Inter-actions in Multi-pass DP
Recall the augmented network-flow problem withquadratic cost (Eqn. 4): min f (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i c i f i + (cid:88) ij ∈ EC q ij f i f j + (cid:88) i c ti f ti s.t. f si + (cid:88) j f ji = f i = f ti + (cid:88) j f ij f si , f ti , f i , f ij ∈ { , } Where EC = { ij : t i = t j } . We propose two new variantsof DP algorithm that can approximately minimize the Ob-jective 4. They are also divided into 1-pass DP and 2-passDP. Since we already described 1-pass DP with pairwiseinteractions in the paper, we will focus on 2-pass DP withpairwise interactions here. A feasible solution f on the network corresponds to aresidual graph G r ( f ) . We could apply the steps described insection 9.2 to find an approximate shortest path. This pathmay consist of both forward nodes and backward nodes,which correspond to uninstanced detections (but will be in-stanced after this iteration) and already instanced detections(but will be uninstanced after this iteration) respectively.We then update the (unary) cost of other nodes by addingor subtracting the pairwise cost imposed by turning on oroff selected nodes on the path. Additionally, at step 3 of2-pass DP, one could also consider the pairwise cost to cur-rent node imposed by previously selected nodes in the samepath. The entire procedure is described as Algorithm 2.Notice that, to simplify our notation, we construct tem-porary residual graph at the beginning of each iteration and Algorithm 2
Two-pass DP with pairwise Cost Update Input : A Directed-Acyclic-Graph G with node andedge weights c i , c ij initialize f = repeat Find start-to-end min-cost unit flow f ∗ on G r ( f ) track cost = cost ( f ∗ ) if track cost < then for all f i ∈ f ∗ do if f i = 0 then c j = c j + q ij + q ji , ∀ ij, ji ∈ EC else c j = c j − q ij − q ji , ∀ ij, ji ∈ EC end if end for f ∗ = ¬ f ∗ end if until track cost ≥ Output : Solution f do not negate edge weights in the original graph. In prac-tice, we can instead update edge costs and directions on theoriginal graph at the end of each iteration, in such a case weshould add pairwise costs to forward nodes or subtract pair-wise costs from backward nodes if we turn on some node,similarly we subtract pairwise costs from forward nodes oradd pairwise costs to backward nodes if we turn off somenode. We found that 2-pass DP often finds lower cost than 1-pass DP but still not as good as LP+rounding. It also runssignificantly slower, even slower than LP+rounding on longsequences. On a 1059 frame-long video with 3 categories ofobjects, 2-pass DP uses about 6 minutes to finish, whereas1-pass DP finishes within 1 minute and LP+rounding fin-ishes within 4 minutes. The leave-one-sequence-out cross-validation result using 2-pass DP gets a MOTA of 60.4%,which is equivalent to that of 1-pass DP and LP relaxation.We observe that most of the running time for 2-pass DPis on the second forward pass, which involves backtrackingfor each forward node to avoid cyclic path. It should benoted that with proper data structure such as a hash linked-list to cache path arrays, checking cyclic path can be donein O (1) . Also, in the second forward pass, one could setall backward nodes as active and propagate active labelsto other forward nodes, so eventually we might not needto look at every forward node. Overall, though showingsome incompetence in running time in our current imple-mentation, 2-pass DP should still be a promising inferencemethod with better choice of data-structures and moderateoptimization.11 T (a) S T (b)
S T (c)
S T (d)
S T (e)Figure 7. An illustration for 2-pass DP with quadratic interactions.(a) the initial DAG graph, a pair of nodes indicate a candidate de-tection; (b) first iteration of the algorithm, red edges indicates theshortest path found in this iteration; (c) we reverse all the edges onthe shortest path (green edges), and add the pairwise cost imposedby this path to other candidates within the time window (red pairs);(d) second iteration of algorithm, red edges and blue edges indi-cates the new shortest path, notice that it takes three of reversededges (blue edges); (e) we again reverse all the edges in the short-est path, now green edges indicate the two tracks found in this 2iterations; we also update pairwise cost: blue node pair indices wesubtract the pairwise cost imposed by ”turning off” an candidate,red pair still indicates adding in pairwise cost of newly instancedcandidates,and the blue-red pair indicates we first add the pairwisecost by newly instanced candidates, then subtract the pairwise costby newly uninstanced candidates. Additions and subtractions aredone to the non-negated edge costs and then negated if necessary.(e)Figure 7. An illustration for 2-pass DP with quadratic interactions.(a) the initial DAG graph, a pair of nodes indicate a candidate de-tection; (b) first iteration of the algorithm, red edges indicates theshortest path found in this iteration; (c) we reverse all the edges onthe shortest path (green edges), and add the pairwise cost imposedby this path to other candidates within the time window (red pairs);(d) second iteration of algorithm, red edges and blue edges indi-cates the new shortest path, notice that it takes three of reversededges (blue edges); (e) we again reverse all the edges in the short-est path, now green edges indicate the two tracks found in this 2iterations; we also update pairwise cost: blue node pair indices wesubtract the pairwise cost imposed by ”turning off” an candidate,red pair still indicates adding in pairwise cost of newly instancedcandidates,and the blue-red pair indicates we first add the pairwisecost by newly instanced candidates, then subtract the pairwise costby newly uninstanced candidates. Additions and subtractions aredone to the non-negated edge costs and then negated if necessary.