[PDF] Learning Multi-target Tracking with Quadratic Object Interactions

Abstract

We describe a model for multi-target tracking based on associating collections of candidate detections across frames of a video. In order to model pairwise interactions between different tracks, such as suppression of overlapping tracks and contextual cues about co-occurence of different objects, we augment a standard min-cost flow objective with quadratic terms between detection variables. We learn the parameters of this model using structured prediction and a loss function which approximates the multi-target tracking accuracy. We evaluate two different approaches to finding an optimal set of tracks under model objective based on an LP relaxation and a novel greedy extension to dynamic programming that handles pairwise interactions. We find the greedy algorithm achieves equivalent performance to the LP relaxation while being 2-7x faster than a commercial solver. The resulting model with learned parameters outperforms existing methods across several categories on the KITTI tracking benchmark.

Full PDF

LLearning Multi-target Tracking with Quadratic Object Interactions

Shaofei Wang, Charless FowlkesDept of Computer ScienceUniversity of California, Irvine [email protected], [email protected]

Abstract

We describe a model for multi-target tracking basedon associating collections of candidate detections acrossframes of a video. In order to model pairwise interactionsbetween different tracks, such as suppression of overlappingtracks and contextual cues about co-occurence of differentobjects, we augment a standard min-cost ﬂow objective withquadratic terms between detection variables. We learn theparameters of this model using structured prediction and aloss function which approximates the multi-target trackingaccuracy. We evaluate two different approaches to ﬁndingan optimal set of tracks under model objective based on anLP relaxation and a novel greedy extension to dynamic pro-gramming that handles pairwise interactions. We ﬁnd thegreedy algorithm achieves equivalent performance to theLP relaxation while being 2-7x faster than a commercialsolver. The resulting model with learned parameters out-performs existing methods across several categories on theKITTI tracking benchmark.

1. Introduction

Multi-target tracking is a classic topic of research incomputer vision. Thanks to advances of object detectorperformance in single, still images, ”tracking-by-detection”approaches that build tracks on top of a collection of candi-date object detections have shown great promise. Tracking-by-detection avoids some problems such as drift and is of-ten able to recover from extended periods of occlusion sinceit is “self-initializing”. Finding an optimal set of detectionscorresponding to each track is often formulated as a discreteoptimization problem of ﬁnding low-cost paths through agraph of candidate detections for which there are often efﬁ-cient combinatorial algorithms (such as min-cost matchingor min-cost network-ﬂow) that yield globally optimal solu-tions ( e.g ., [27, 20]).Tracking by detection is somewhat different than tra-ditional generative formulations of multi-target tracking,which draw a distinction between the problem of estimat-

Figure 1. We describe a framework for learning parameters of amulti-object tracking objective that includes pairwise interactionsbetween objects. The left column shows tracking without pair-wise interactions. Our system learns to enforce both inter-classand intra-class mutual exclusion as well as co-occurrence relation-ship between trajectories. By incorporating pairwise interactionsbetween objects within a frame we are able to improve detectionperformance. ing a latent continuous trajectory for each object from thediscrete per-frame data-association problem of assigningobservations ( e.g ., detections) to underlying tracks. Suchmethods ( e.g ., [2, 19, 24]) allow for explicitly specifying anintuitive model of trajectory smoothness but face a difﬁcultjoint inference problem over both continuous and discretevariables with little guarantee of optimality.In tracking by detection, trajectories are implicitly de-ﬁned by the selected group of detections. For example,the path may skip over some frames entirely due to occlu-sions or missing detections. The transition cost of utilizinga given edge between detections in successive frames thuscould be interpreted as some approximation of the marginallikelihood associated with integrating over a set of underly-ing continuous trajectories associated with the correspond-ing pair of detections. This immediately raises difﬁculties,both in (1) encoding strong trajectory models with onlypairwise potentials and (2) identifying the parameters ofthese potentials from training data.One line of attack is to ﬁrst group detections in to can-1 a r X i v : . [ c s . C V ] D ec idate tracklets and then perform scoring and associationof these tracklets [25, 4, 23]. Tracklets allow for scor-ing much richer trajectory and appearance models whilemaintaining some beneﬁts of purely combinatorial group-ing. Another approach is to attempt to include higher-orderconstraints directly in a combinatorial framework [5, 6]. Ineither case, there are a large number of parameters associ-ated with these richer models which necessitates applicationof machine learning techniques. This is particularly true for(undirected) combinatorial models based on, e.g . network-ﬂow, where parameters are often set empirically by hand.In this work, we introduce an extension to the standardmin-cost ﬂow tracking objective that allows us to modelpairwise interactions between tracks. This allows us to in-corporate useful knowledge such as typical spatial relation-ships between detections of different objects and suppres-sion of multiple overlapping tracks of the same object. Thisquadratic interaction necessitates the development of ap-proximate inference methods which we describe in Section3. In Section 5 we describe an approach to joint learningof model parameters in order to maximize tracking perfor-mance on a training data set using techniques for structuredprediction [22]. Structured prediction has been applied intracking to learning inter-frame afﬁnity metrics [14] and as-sociation [18] as well as a variety of other learning taskssuch as ﬁtting CRF parameters for segmentation [21] andword alignment for machine translation [15]. To our bestknowledge, the work presented here is unique in utilizingdiscriminative structured prediction to jointly learn the com-plete set of parameters of a tracking model from labeleddata, including track birth/death bias, transition afﬁnities,and multi-object contextual relations. We conclude withexperimental results (Section 6) which demonstrate that thelearned quadratic model and inference routines yield stateof the art performance on multi-target, multi-category ob-ject tracking in urban scenes.

2. Model

We begin by formulating multi-target tracking and dataassociation as a min-cost ﬂow network problem equivalentto that of [27], where individual tracks are described bya ﬁrst-order Markov Model whose state space is spatial-temporal locations in videos. This framework incorporatesa state transition likelihood that generates transition featuresin successive frames, and an observation likelihood thatgenerates appearance features for objects and background.

For a given video sequence, we consider a discrete setof candidate object detection sites V where each candidatesite x = ( l, σ, t ) is described by its location, scale andframe number. We write Φ = { φ a ( x ) | x ∈ V } for the im-age evidence (appearance features) extracted at each corre- sponding spatial-temporal location in a video. A single ob-ject track consists of an ordered set of these detection sites: T = { x , ..., x n } , with strictly increasing frame numbers.We model the whole video by a collection of tracks T = { T , ..., T k } , each of which independently gener-ates foreground object appearances at the correspondingsites according to distribution p fg ( φ a ) while the remainingsite appearances are generated by a background distribution p bg ( φ a ) . Each site can only belong to a single track. Ourtask is to infer a collection of tracks that maximize the pos-terior probability P ( T | Φ) under the model. Assuming thattracks behave independently of each other and follow a ﬁrst-order Markov model, we can write an expression for MAPinference: T ∗ = argmax T (cid:89) T ∈T P (Φ | T ) P ( T )= argmax T (cid:0) (cid:89) T ∈T (cid:89) x ∈ T l ( φ a ( x )) (cid:1) × (cid:89) T ∈T (cid:0) p s ( x ) p e ( x N ) N − (cid:89) i =1 p t ( x i +1 | x i ) (cid:1) (1)where l ( φ a ( x )) = p fg ( φ a ( x )) p bg ( φ a ( x )) is the appearance likelihood ratio that a speciﬁc location x corresponds to the object tracked and p s , p e and p t representthe likelihoods for tracks starting, ending and transitioningbetween given sites.The set of optimal tracks can be found by taking the logof 1 to yield an integer linear program (ILP) over ﬂow vari-ables f . min f (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i c i f i + (cid:88) i c ti f ti (2)s.t. f si + (cid:88) j f ji = f i = f ti + (cid:88) j f ij f si , f ti , f i , f ij ∈ { , } where E is the set of valid transitions between sites in suc-cessive frames and the costs are given by c i = − log l ( φ a ( x )) , c ij = − log p ( x j | x i ) c si = − log p s ( x i ) , c ti = − log p t ( x i ) (3)This ILP is a well studied problem known as minimum-costnetwork ﬂow [1]. The constraints satisfy the total unimodu-larity property and thus can be solved exactly using any LPsolver or via various efﬁcient specialized solvers, includingnetwork simplex, successive shortest path and push-relabelwith bisectional search [27].2hile these approaches yield globally optimal solutions,the authors of [20] consider even faster approximationsbased on multiple rounds of dynamic programming (DP).In particular, the successive shortest paths algorithm (SSP)ﬁnds optimal ﬂows by applying Dijkstra’s algorithm ona residual graph constructed from the original network inwhich some edges corresponding to instanced tracks havebeen reversed. This can be implemented by performingmultiple forward and backward passes of dynamic program-ming (see Appendix for details). [ ] found that two or evenone pass of DP often performs nearly as well as SSP inpractical tracking scenarios. In our experiments we eval-uate several of these variants. The aforementioned model assumes tracks are independentof each other, which is not always true in practice. A keycontribution of our work is showing that pairwise relationsbetween tracks can be integrated into the model to improvetracking performance. In order to allow interactions be-tween multiple objects, we add a pairwise cost term de-noted q ij and q ji for jointly activating a pair of ﬂows f i and f j corresponding to detections at sites x i = ( l i , σ i , t i ) and x j = ( l j , σ j , t j ) . An intuitive example of q ij and q ji would be penalty for overlap locations or a boost for co-occurring objects. We only consider pairwise interactionsbetween pairs of sites in the same video frame which wedenote by EC = { ij : t i = t j } . Adding this term to 2yields an Integer Quadratic Program (IQP): min f (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i c i f i + (cid:88) ij ∈ EC q ij f i f j + (cid:88) i c ti f ti (4)s.t. f si + (cid:88) j f ji = f i = f ti + (cid:88) j f ij f si , f ti , f i , f ij ∈ { , } The addition of quadratic terms makes this objective hard tosolve in general. In the next section we discuss two differ-ent approximations for ﬁnding high quality solutions f . InSection 5 we describe how the costs c can be learned fromdata.

3. Inference

Now we describe different methods to conduct trackinginference (ﬁnding the optimal ﬂows f ). These inferenceroutines are used both for predicting a set of tracks at testtime as well as optimizing parameters during learning (seeSection 5).As mentioned in previous section, for traditional min-cost network ﬂow problem deﬁned in Equation 2 there ex- ists various efﬁcient solvers that explores its total unimod-ularity property to ﬁnd the global optimum. We employMOSEK’s built-in network simplex solver in our experi-ments, as other alternative algorithms yield exactly the samesolution.In contrast, ﬁnding the global minimum of the IQP prob-lem 4 is NP-hard [26] due to the quadratic terms. We evalu-ate two different schemes for ﬁnding high-quality approxi-mate solutions. The ﬁrst is a standard approach of introduc-ing auxiliary variables and relaxing the integral constraintsto yield a linear program (LP) that lower-bounds the orig-inal objective. We also consider a greedy approximationbased on successive rounds of dynamic programming thatalso yields good solutions while avoiding the expense ofsolving a large scale LP. If we relax the integer constraints and deform the costs asnecessary to make the objective convex, then the global op-timum of 4 can be found in polynomial time. For example,one could apply Frank-Wolfe algorithm to optimize the re-laxed, convexiﬁed QP while simultaneously keeping trackof good integer solutions [13]. However, for real-worldtracking over long videos, the relaxed QP is still quite ex-pensive. Instead we follow the approach proposed by Chari et al . [6], reformulating the IQP as an equivalent ILP prob-lem by replacing the quadratic terms f i f j with a set of aux-iliary variables u ij : min f (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i c i f i + (cid:88) ij ∈ EC q ij u ij + (cid:88) i c ti f ti (5)s.t. f si , f ti , f i , f j , f ij , u ij ∈ { , } f si + (cid:88) j f ji = f i = f ti + (cid:88) j f ij u ij ≤ f i , u ij ≤ f j f i + f j ≤ u ij + 1 The new constraint sets enforce u ij to be only when f i and f j are both . By relaxing the integer constraints, program5 can be solved efﬁciently via large scale LP solvers suchas CPLEX or MOSEK.During test time we would like to predict a discrete setof tracks. This requires rounding the solution of the re-laxed LP to some solution that satisﬁes not only integer con-straints but also ﬂow constraints. [6] proposed two round-ing heuristics: a Euclidean rounding scheme that minimizes (cid:107) f − (cid:98) f (cid:107) where (cid:98) f is the non-integral solution given by the LPrelaxation. When f is constrained to be binary, this objec-tive simpliﬁes to a linear function ( − (cid:98) f ) T f + (cid:107) (cid:98) f (cid:107) , whichcan be optimized using a standard linear min-cost ﬂow3olver. Alternately, one can use a linear under-estimator of4 similar to the Frank-Wolfe algorithm: (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i ( c i + (cid:88) ij ∈ EC q ij (cid:98) u ij + (cid:88) ji ∈ EC q ji (cid:98) u ji ) f i + (cid:88) i c ti f ti (6)Both of these rounding heuristics are linear functions sub-ject to the original integer and ﬂow constraints and thus canbe solved as an ordinary min-cost network ﬂow problem.In our experiments we execute both rounding heuristics andchoose the solution with lower cost. We now describe a simple greedy algorithm inspiredby the combination of dynamic programming and non-maximal suppression proposed in [20]. We carry out a se-ries of rounds of dynamic programming to ﬁnd the short-est path between source and sink nodes. In each round,once we have identiﬁed a track, we update the (unary) costsassociated with all detections to include the effect of thepairwise quadratic interaction term of the newly activatedtrack ( e.g . suppressing overlapping detections, boosting thescores of commonly co-occurring objects). This is analo-gous to greedy algorithms for maximum-weight indepen-dent set where the elements are paths through the network.

Algorithm 1

DP with pairwise Cost Update Input : A Directed-Acyclic-Graph G with edgeweights c i , c ij initialize T ← ∅ repeat Find shortest start-to-end path p on G track cost = cost ( p ) if track cost < then for all locations x i in p do c j = c j + q ij + q ji for all ij, ji ∈ EC c i = + ∞ end for T ← T ∪ p end if until track cost ≥ Output : track collection T In the absence of quadratic terms, this algorithm corre-sponds to the 1-pass DP approximation of the successive-shortest paths (SSP) algorithm. Hence it does not guaranteean optimal solution, but, as we show in the experiments,it performs well in practice. A practical implementationdifference (from the linear objective) is that updating thecosts with the quadratic terms when a track is instanced hasthe unfortunate effect of invalidating cost-to-go estimates which could otherwise be cached and re-used between suc-cessive rounds to accelerate the DP computation.Interestingly, the greedy approach to updating the pair-wise terms can also be used with a 2-pass DP approximationto SSP where backward passes subtract quadratic penalties.We describe the details of our implementation of the 2-passalgorithm in the Appendix. We found the 1-pass approachsuperior as the complexity and runtime grows substantiallyfor multi-pass DP with pairwise updates.

4. Tracking Features and Potentials

In order to learn the tracking potentials ( c and q ) we pa-rameterize the ﬂow cost objective by a vector of weights w and a set of features Ψ( X, f ) that depend on featuresextracted from the video, the spatio-temporal relations be-tween candidate detections, and which tracks are instanced.With this linear parameterization we write the cost of agiven ﬂow as C ( f ) = − w T Ψ( X, f ) where the negative signis a useful convention to convert the minimization probleminto a maximization. The vector components of the weightand feature vector are given by: w =  w S w t w s w a w E  Ψ( X, f ) =  (cid:80) i φ S ( x si ) f si (cid:80) ij ∈ E ψ t ( x i , x j ) f ij (cid:80) ij ∈ EC ψ s ( x i , x j ) f i f j (cid:80) i φ a ( x i ) f i (cid:80) i φ E ( x ti ) f ti  (7)Here w a represents local appearance template for thetracked objects of interest, w t represents weights for tran-sition features, w s represents weights for pairwise interac-tions, w S and w E represents weights associated with trackbirths and deaths. φ a ( x i ) is the image feature at spatial-temporal location x i , ψ t ( x i , x j ) represents the feature oftransition from location x i to location x j , ψ s ( x i , x j ) repre-sents the feature of pairwise interaction between location x i and x j that are in the same frame, φ S ( x si ) represents featureof birth node to location x i and φ E ( x ti ) represents featureof location x i to sink node. Local appearance model:

We make use of an off-the-shelf detector to capture local appearance. Our local appear-ance feature thus consists of the detector score along with aconstant 1 to allow for a variable bias.

Transition model:

We use a simple motion model (de-scribed in Section 6) to predict candidate windows’ loca-tions in future frames; we connect a candidate x i at time t i with another candidate x j at a later time t i + n , only ifthe overlap ratio between x i ’s predicted window at t i + n and x j ’s window at t i + n exceeds . . The overlap ratiois deﬁned as two windows’ intersection over their union.We use this overlap ratio as a feature associated with eachtransition link. The transition link’s feature will be 1 if this4atio is lower than 0.5, and 0 otherwise. In our experimentswe allow up to frames occlusion for all the network-ﬂowmethods. We append a constant 1 to this feature and binthese features according to the length of transition. Thisyields a dimensional feature for each transition link. Birth/death model:

In applications with static camerasit can be useful to learn a spatially varying bias to modelwhere tracks are likely to appear or disappear. However,videos in our experiments are all captured from a movingvehicle, we thus use a single constant value 1 for the birthand death features.

Pairwise interactions: w s is a weight vector thatencodes valid geometric conﬁgurations of two objects. ψ ( x i , x j ) is a discretized spatial-context feature that binsrelative location of detection window at location x i andwindow at location x j into one of the D relations includ-ing on top of, above, below, next-to, near, far and overlap(similar to the spatial context of [7]). To mimic the tem-poral NMS described in [20] we add one additional rela-tion, strictly overlap, which is deﬁned as the intersectionof two boxes over the area of the ﬁrst box; we set thecorresponding feature to 1 if this ratio is greater than 0.9and 0 otherwise. Now assume that we have K classesof objects in the video, then w s is a DK vector, i.e . w s = [ w Ts , w Ts , ..., w Tsij , ..., w

TsKK ] T , in which w sij isa length of D column vector that encodes valid geometricconﬁgurations of object of class i w.r.t. object of class j .In such way we can capture intra- and inter-class contextualrelationships between tracks.

5. Learning

We formulate parameter learning of tracking models asa structured prediction problem. With some abuse of nota-tion, assume we have N training videos ( X n , f n ) ∈ X ×F , n = 1 , ..., N . Given ground-truth tracks in trainingvideos speciﬁed by ﬂow variables f n , we discriminativelylearn tracking model parameters w using a structured SVMwith margin rescaling: w ∗ = argmin w ,ξ n ≥ (cid:107) w (cid:107) + C (cid:88) n ξ n (8)s.t. ∀ n, (cid:98) f , (cid:104) w , (cid:52) Ψ( X n , f n , (cid:98) f ) (cid:105) ≥ L ( f n , (cid:98) f ) − ξ n where (cid:52) Ψ( X n , f n , (cid:98) f ) = Ψ( X n , f n ) − Ψ( X n , (cid:98) f ) where Ψ( X n , f n ) are the features extracted from n th train-ing video. L ( f n , (cid:98) f ) is a loss function that penalize any dif-ference between the inferred label (cid:98) f and the ground truthlabel f n . The constraint on the slack variables ξ n ensurethat we pay a cost for any training videos in which the ﬂowcost of the ground-truth tracks under model w is higher thansome other incorrect labeling. We optimize the structured SVM objective in 8 using astandard cutting-plane method [12] in which the exponen-tial number of constraints (one for each possible ﬂow (cid:98) f ) areapproximated by a much smaller number of terms. Given acurrent estimate of w we ﬁnd a “most violated constraint”for each training video: (cid:98) f ∗ n = argmax (cid:98) f L ( f n , (cid:98) f ) − (cid:104) w , (cid:52) Ψ( X n , f n , (cid:98) f ) (cid:105) We can then add these constraints to the optimization prob-lem and solve for an updated w . This procedure is iterateduntil no additional constraints are added to the problem. Inour implementation, at each iteration we add a single linearconstraint which is a sum of violating constraints derivedfrom individual videos in the dataset which is also a validcutting plane constraint [7].The key subroutine is ﬁnding the most-violated con-straint for a given video which requires solving the loss-augmented inference problem (we drop the n subscript no-tation from here on) (cid:98) f ∗ = argmin (cid:98) f (cid:104) w, Ψ( X, (cid:98) f ) (cid:105) − L ( f , (cid:98) f ) (9)As long as the loss function L ( f , (cid:98) f ) decomposes as a sumover ﬂow variables then this problem has the same form asour test time tracking inference problem, the only differencebeing that the cost of variables in f is augmented by theircorresponding negative loss.We note that our two inference algorithms behave some-what differently when producing constraints. The greedyalgorithm has no guarantee of ﬁnding the optimal ﬂow for agiven tracking problem and hence may not generate all thenecessary constraints for learning w . In contrast, for the LPrelaxation, we have the option of adding constraints corre-sponding to fractional solutions (rather than rounding themto discrete tracks). If we use a loss function that penalizesincorrect non-integral solutions, this may push the struc-tured SVM to learn parameters that tend to result in tightrelaxations. These scenarios are termed “undergenerating”and “overgenerating” respectively by [9] since approximateinference is performed over a subset or superset of the exactspace of ﬂows. Now we describe loss functions for multi-target trackingproblem. We use a weighted hamming loss to measure lossbetween ground truth labels f and inferred labels (cid:98) f : L ( (cid:98) f , f ) = (cid:88) f i ∈ f loss i (cid:12)(cid:12)(cid:12) f i − (cid:98) f i (cid:12)(cid:12)(cid:12) (10)5here { loss , ..., loss i , ..., loss | f | } is a vector indicatingthe penalty for differences between the estimated ﬂow (cid:98) f andthe ground-truth f . For example, when loss = it becomesthe hamming loss. Transition Loss:

A critical aspect for successful learn-ing is to deﬁne a good loss vector that closely reassem-bles major tracking performance criteria, such as MultipleObject Tracking Accuracy (MOTA [3]). Metrics such asfalse positive, false negative, true positive, true negative andtrue/false birth/death can be easily incorporated by settingtheir corresponding values in loss to 1.By deﬁnition, id switches and fragmentations [16] aredetermined by looking at labels of two consecutive tran-sition links simultaneously, under such deﬁnition the losscannot be optimized by our inference routine which onlyconsiders pairwise relations between detections within aframe. Instead, we propose a decomposable loss for tran-sition links that attempts to capture important aspects ofMOTA by taking into account the length and localization oftransition links rather than just using a constant (Hamming)loss on mislabeled links. We found empirically that care-ful speciﬁcation of the loss function is crucial for learning agood tracking model.In order to describe our transition loss, let us ﬁrst denotefour types of transition links:

N N is the link from a falsedetection to another false detection,

P N is the link from atrue detection to a false detection,

N P is the link from afalse detection to a true detection,

P P + is the link from atrue detection to another true detection with the same iden-tity, and P P − is the link from a true detection to anothertrue detection with a different identity. For all the transitionlinks, we interpolate detections between its start detectionand end detection (if their frame numbers differ more than1); the interpolated virtual detections are considered eithertrue virtual detection or false virtual detection, dependingon whether they overlap with a ground truth label or not.Loss for different types of transition is deﬁned as:1. For N N links, the loss will be (number of true virtualdetections + number of false virtual detections)2. For

P N and

N P links, the loss will be (number of truevirtual detections + number of false virtual detections +1)3. For

P P + links, the loss will be (number of true virtualdetections)4. For P P − links, the loss will be (number of true virtualdetections + number of false virtual detections + 2) Ground-truth ﬂows:

In practice, available trainingdatasets specify ground-truth bounding boxes that need tobe mapped onto ground-truth ﬂow variables f n for eachvideo. To do this mapping, we ﬁrst consider each frameseparately, taking the highest scoring detection window thatoverlaps a ground truth label as true detection, each truedetection will be assigned a track identity label same as Figure 2. Example beneﬁt of soft transition penalty. Left column isan ID switch error (IDSW) of the baseline due to removing aggres-sive transition links based on an empirical hard overlap threshold.At right column, our model prevents this error by learning a softpenalty function that allows for some aggressive transitions to oc-cur.Figure 3. Example of track co-occurrence. The right columnis the model learned with pairwise terms (LP+Flow+Struct),while the left column is learned without pairwise terms(SSP+FLow+Struct). Co-occurrence term forces both track 2 and3 to initialize even when the detector responses are weak. the ground truth label it overlaps. Next, for each trackidentity, we run a simpliﬁed version of the dynamic pro-gramming algorithm to ﬁnd the path that claims the largestnumber of true detections. After we iterate through allid labels, any instanced graph edge will be a true detec-tion/transition/birth/death while the remainder will be false.

6. Experimental results

Dataset:

We have focused our experiments on trainingsequences of KITTI tracking benchmark [11]. KITTI track-ing benchmark consists of to 21 training sequences with atotal of 8008 frames and 8 classes of labeled objects; of allthe labeled objects we evaluated three categories which had6a) inter-frame weights (b) intra-frame weights

Figure 4. Visualization of the weight vector learned by our method.Yellow has small cost, blue has large cost. (a) shows transi-tion weights for different length of frame jumps. The model en-courages transitions to nearby neighboring frames, and penalizeslong or weak transition links ( i.e . overlap ratio lower than 0.5).(b) shows learned pairwise contextual weights between objects.The model encourages intra-class co-occurrence when objects areclose but penalizes overlap and objects on top of others. Note thestrong negative interaction learned between cyclist and pedestrian(two classes which are easily confused by their respective detec-tors.). By exploring contextual cues we can make correct predic-tion on this otherwise confusing conﬁguration. sufﬁcient number of instances for comparative evaluation:cars, pedestrians and cyclists. We use publicly availableLSVM [8] reference detections and evaluation script . Theevaluation script only evaluates objects that are not too faraway and not truncated by more than 15 percent, it also doesnot consider vans as false positive for cars or sitting personsas false positive for pedestrians. The ﬁnal dataset contains636 labeled car trajectories, 201 labeled pedestrian trajecto-ries and 37 labeled cyclists trajectories. Training with ambiguous labels:

One difﬁculty oftraining on the KITTI tracking benchmark is that it hasspecial evaluation rules for ground truth labels such assmall/truncated objects and vans for cars, sitting persons forpedestrians. This is resolved by removing all detection can-didates that correspond to any of these “ambiguous” groundtruth labels during training; in this way we avoid mininghard negatives from those labels. Also, to speed up training,we partition full-sized training sequences in to 10-frame-long subsequences with a 5-frame overlap, and deﬁne losseson each subsequence separately.

Data-dependent transition model:

In order to keep thesize of tracking graphs tractable for our inference methods,we need a heuristic to select a sparse set of links betweendetection candidates across frames. We found that simplypredicting candidate’s locations in future frames via opticalﬂow gives very good performance. Speciﬁcally, we ﬁrstcompute frame-wise optical ﬂow using software of [17],then for a candidate detection x i at frame t i , we computethe mean of vertical ﬂows and the mean of horizontal ﬂowswithin the candidate box, and use them to predict candi- −1 number of frames t i m e ( l og sca l e ) Relaxed LP cost = −129967.91

DP, cost = −129336.88LP+rounding, cost = −129962.09

Figure 5. Speed and quality comparison of proposed undergener-ating and overgenerating approximation. Over the 21 training se-quences in KITTI dataset, LP+rounding produces cost that is veryclose to relaxed global optimum. DP gives a lower bound that iswithin 1% of relaxed global optimum, while being 2 to 7 timesfaster than a commercial LP solver (MOSEK) date’s location in the next frame t i + 1 ; for x i ’s predictedlocations in frame t i + 2 we use its newly predicted loca-tion at t i + 1 and candidate’s original box size to repeat theprocess described above, and same for t i + n . Trajectory smoothing:

During evaluation we observethat many track fragmentation errors (FRAG) reported bythe benchmark are due to the raw trajectory oscillating awayfrom the ground-truth due to poorly localized detection can-didates. Inspired by the trajectory model of [2], we post-process each output raw trajectory by ﬁtting a cubic B-spline. This smoothing of the trajectory eliminates manyFRAGs from the raw track, making the fragmentation num-ber more meaningful when compared across different mod-els.

Baselines:

We use the publicly available code from [10]as a ﬁrst baseline. It relies on a three-stages tracklet linkingscheme with occlusion sensitive appearance learning; it isby far the best tracker for cars on KITTI tracking benchmarkamong all published methods. Also we consider dynamicprogramming (DP) and successive shortest path (SSP) withdefault parameters in [20] as another two baselines, denotedas DP+Flow and SSP+Flow in our table.

Parameter settings:

We tuned the structural parametersof the various baselines to give good performance. For allbaselines we only use detections that have a positive score.For DP+Flow and SSP+Flow we also remove all transitionlinks that have overlap ratios lower than 0.5. For learnedtracking models (+Struct) we use detections that have scoresgreater than -0.5, and transition links that have overlap ra-tios greater than 0.3.7 enchmark Results:

We evaluate performance us-ing a standard battery of performance measures. Theevaluation result for each object category, as well as forall categories are shown in Table 1. For our learnedtracking models (+Struct) we use either network sim-plex solver (for SSP+Flow+Struct) or LP relaxation (forLP/DP+Flow+Struct) for training and conduct leave-one-sequence-out cross-validation with C = 2 − , − , ..., .We report cross-validation result under best C , whichis C = 2 − for SSP+Flow+Struct and C = 2 − forLP/DP+Flow+Struct. Our simple motion model helpsDP+Flow outperform state-of-the-art baseline by a signif-icant margin. One exception is IDSW which we attributeto the fact that the network-ﬂow methods do not explic-itly model target appearance. While SSP+Flow seems toperform poorly with default parameters, it turns out thatwith properly learned parameters (SSP+Flow+Struct), itproduces results that are comparable to (and often betterthan) DP+Flow, this indicates that there is much more po-tential of SSP than suggested in previous work. In addi-tion, SSP’s guarantee of optimality makes it very attractiveif more complicated features and network structure are tobe used in learning. As shown in Table 1, in our eval-uation over all objects our model learned with pairwisecosts (LP/DP+Flow+Struct) achieves the best MOTA, Re-call, Mostly Tracked(MT) and Mostly Lost(ML) perfor-mance while keeping other metrics competitive. Approximate Inference:

To evaluate quality ofthe LP+rounding and DP approximation, we run bothLP+rounding and DP inference on models trained via LPrelaxation and DP respectively. We then average the run-ning time and minimum cost found on each sequence forLP+rounding and DP, respectively. Fig 5 shows the accu-mulative running time and cost for each algorithm. Dur-ing our experiments, LP+rounding often ﬁnds the exact re-laxed global optimal, and when it doesn’t it still gives veryclose approximation. While greedy forward search usingDP rarely reach relaxed global optimum, it still producedgood approximate solutions that were often within 1% ofrelaxed global optimum while running signiﬁcantly faster(2-7x) than LP+rounding.

Overgenerating versus Undergenerating:

Previousworks have shown that in general, models trained withrelaxed inference are preferable than models trained withgreedy inference. To investigate this idea in our particu-lar problem, we also conduct leave-one-sequence-out cross-validation using either DP or the LP relaxation as the infer-ence method for training. The evaluation results under dif-ferent training/testing inference combinations are shown inTable 2. Notice that model trained with the LP relaxationdoes slightly better in most metrics, whereas DP standsout as a good inference algorithm at test time. Moreover,though slightly falling behind, model trained with greedy

Car

MOTA MOTP Rec Prec MT ML IDSW FRAGBaseline [10] 57.8 78.8 58.6 98.8 14.9 28.4 22 225SSP+Flow 49.0 79.1 49.1 99.7 18.4 59.9 0 47DP+Flow 62.2 79.0 63.4 98.5 25.2 24.2 43 177SSP+Flow+Struct 63.4 78.3 65.4 97.1 27.4 20.0 2 179LP+Flow+Struct 64.1 78.1 67.1 95.7 30.5 18.7 3 208DP+Flow+Struct 64.6 78.0 67.5 96.0 30.1 18.6 17 222

Pedestrian

MOTA MOTP Rec Prec MT ML IDSW FRAGBaseline 40.2 73.2 49.0 86.6 4.2 32.2 132 461SSP+Flow 37.9 73.4 41.8 92.0 8.4 57.5 25 146DP+Flow 49.7 73.1 57.2 88.9 18.6 26.3 46 260SSP+Flow+Struct 51.2 73.2 57.4 90.5 19.2 24.6 16 230LP+Flow+Struct 52.6 72.9 60.2 89.2 22.2 21.6 31 281DP+Flow+Struct 52.4 73.0 60.0 89.2 19.8 22.2 36 277

Cyclist

MOTA MOTP Rec Prec MT ML IDSW FRAGBaseline 39.0 81.6 39.6 99.5 5.4 37.8 7 26SSP+Flow 18.7 85.6 18.7 100 5.4 89.2 0 1DP+Flow 42.4 81.2 42.5 100 18.9 45.9 2 5SSP+Flow+Struct 47.4 79.7 59.9 83.0 35.1 32.4 5 10LP+Flow+Struct 52.3 79.6 61.1 88.2 40.6 27.0 12 21DP+Flow+Struct 56.3 79.4 64.2 89.7 40.5 27.0 9 15

All Categories

MOTA MOTP Rec Prec MT ML IDSW FRAGBaseline 51.7 77.4 54.8 95.3 12.1 29.7 161 712SSP+Flow 44.2

DP+Flow 57.6 77.4 60.5 95.7 23.5 25.7 91 442SSP+Flow+Struct 59.0 77.0 62.8 94.5 25.9 21.5

62 514

Table 1. Tracking result for cars, pedestrian and cyclist categoriesin the KITTI tracking benchmark and aggregate performance overall categories. The proposed method using quadratic interactionsbetween objects and parameters trained using structured predictionachieves state-of-the art MOTA and is competitive across multipleperformance measures.

TrainDP LPTest DP MOTA 60.5 60.6Recall 65.2 65.1Precision 93.5 93.8MT 28.6 28.4ML 20.5 19.7IDSW 68 62FRAG 517 514LP+round MOTA 60.1 60.2Recall 64.9 64.8Precision 93.3 93.5MT 29.3 29.2ML 20.3 19.7IDSW 56 46FRAG 518 510

Table 2. Performance evaluation over 21 sequences using crossvalidation for different combinations of inference algorithm usedduring training and test time.

DP is very close to the performance of that trained with LPand thus suggests the greedy algorithm proposed here is avery competitive inference method.8 . Summary

We augmented the well-studied network-ﬂow track-ing model with pairwise cost, and proposed an end-to-end framework that jointly optimizes parameters forsuch model. We extensively evaluated a traditional LPrelaxation-based method and a novel greedy dynamic pro-gramming method for inference in the augmented network,both of which achieves state-of-the-art performance, whileour greedy DP algorithm being 2-7x faster than a commer-cial LP solver.

8. Acknowledgements

This work was supported by NSF DBI-1053036, IIS-1253538 and a Google Research Award.

References [1] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin.

Network Flows:Theory, Algorithms, and Applications . Prentice-Hall, Inc.,Upper Saddle River, NJ, USA, 1993.[2] A. Andriyenko, K. Schindler, and S. Roth. Discrete-continuous optimization for multi-target tracking. In

CVPR ,2012.[3] K. Bernardin and R. Stiefelhagen. Evaluating multiple objecttracking performance: The clear mot metrics.

J. Image VideoProcess. , 2008:1:1–1:10, Jan. 2008.[4] W. Brendel, M. Amer, and S. Todorovic. Multiobject track-ing as maximum weight independent set. In

In Proc. IEEEConf. on Computer Vision and Pattern Recognition , 2011.[5] A. A. Butt and R. T. Collins. Multi-target tracking by la-grangian relaxation to min-cost network ﬂow. In

The IEEEConference on Computer Vision and Pattern Recognition(CVPR) , June 2013.[6] V. Chari, S. Lacoste-Julien, I. Laptev, and J. Sivic. On pair-wise cost for multi-object network ﬂow tracking.

CoRR ,abs/1408.3304, 2014.[7] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative mod-els for multi-class object layout. In

IEEE International Con-ference on Computer Vision , 2009.[8] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester.Discriminatively trained deformable part models, release 4.http://people.cs.uchicago.edu/ pff/latent-release4/.[9] T. Finley and T. Joachims. Training structural SVMs whenexact inference is intractable. In

International Conferenceon Machine Learning (ICML) , pages 304–311, 2008.[10] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun. 3dtrafﬁc scene understanding from movable platforms.

PatternAnalysis and Machine Intelligence (PAMI) , 2014.[11] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-tonomous driving? the kitti vision benchmark suite. In

Conference on Computer Vision and Pattern Recognition(CVPR) , 2012.[12] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane trainingof structural svms.

Machine Learning , 77(1):27–59, 2009. [13] A. Joulin, K. Tang, and L. Fei-Fei. Efﬁcient image andvideo co-localization with frank-wolfe algorithm. In

Euro-pean Conference on Computer Vision (ECCV) , 2014.[14] S. Kim, S. Kwak, J. Feyereisl, and B. Han. Online multi-target tracking by large margin structured learning. In

Pro-ceedings of the 11th Asian Conference on Computer Vision- Volume Part III , ACCV’12, pages 98–111, Berlin, Heidel-berg, 2013. Springer-Verlag.[15] S. Lacoste-Julien, B. Taskar, D. Klein, and M. I. Jordan.Word alignment via quadratic assignment. In

Proceedings ofthe Main Conference on Human Language Technology Con-ference of the North American Chapter of the Associationof Computational Linguistics , HLT-NAACL ’06, pages 112–119, Stroudsburg, PA, USA, 2006. Association for Compu-tational Linguistics.[16] Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hy-bridboosted multi-target tracker for crowded scene. In

InCVPR , 2009.[17] C. Liu.

Beyond Pixels: Exploring New Representations andApplications for Motion Analysis . PhD thesis, MassachusettsInstitute of Technology, 2009.[18] X. Lou and F. A. Hamprecht. Structured Learning for CellTracking. In

Twenty-Fifth Annual Conference on Neural In-formation Processing Systems (NIPS 2011) , 2011.[19] A. Milan, K. Schindler, and S. Roth. Detection- andtrajectory-level exclusion in multiple object tracking. In

CVPR , 2013.[20] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number ofobjects. In

IEEE conference on Computer Vision and PatternRecognition (CVPR) , 2011.[21] M. Szummer, P. Kohli, and D. Hoiem. Learning crfs usinggraph cuts. In

European Conference on Computer Vision ,October 2008.[22] B. Taskar, C. Guestrin, and D. Koller. Max-margin markovnetworks. MIT Press, 2003.[23] B. Wang, G. Wang, K. Luk Chan, and L. Wang. Tracklet as-sociation with online target-speciﬁc metric learning. In

TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , June 2014.[24] Z. Wu, A. Thangali, S. Sclaroff, , and M. Betke. Couplingdetection and data association for multiple object tracking.In

Proceeding of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , pages 1–8, Rhode Island,June 2012.[25] B. Yang and R. Nevatia. An online learned crf model formulti-target tracking. In

In CVPR , 2012.[26] A. N. H. Zaied and L. A. E. fatah Shawky. Article: A sur-vey of quadratic assignment problems.

International Journalof Computer Applications , 101(6):28–36, September 2014.Full text available.[27] L. Zhang, Y. Li, and R. Nevatia. Global data association formulti-object tracking using network ﬂows. , 0:1–8,2008. . Appendix: Multi-Pass Dynamic Program-ming to Approximate Successive ShortestPath Now we describe two dynamic programming (DP) al-gorithms proposed by [20] which approximate successiveshortest path (SSP) algorithm. Recall the network-ﬂowproblem described in Equation 2: min f (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i c i f i + (cid:88) i c ti f ti s.t. f si + (cid:88) j f ji = f i = f ti + (cid:88) j f ij f si , f ti , f i , f ij ∈ { , } The corresponding graphical model is shown in Fig 6. SSPﬁnds the global optimum of Objective 2 by repeating:1. Find the minimum cost source to sink path on residualgraph G r ( f )

2. If the cost of the path is negative, push a ﬂow throughthe path to update f Until no negative cost path can be found. A residual graph G r ( f ) is the same as the original graph G except all edges in f are reversed and their cost negated. We focus on describ-ing the DP algorithms and refer readers to [1] for detaileddescription of SSP algorithm. Assume the detection nodes are sorted in time. We de-note cost ( i ) as the cost of the shortest path from sourcenode to node i , link ( i ) as i ’s predecessor in this shortestpath, and birth node ( i ) as the ﬁrst detection node in thisshortest path. We initialize cost ( i ) = c i + c si , link ( i ) = ∅ ,and birth node ( i ) = i for all i ∈ V .To ﬁnd the shortest path on the initial DAG G , we cansweep from ﬁrst frame to last frame, computing cost ( i ) as: cost ( i ) = c i + min ( π, c si ) , π = min ji ∈ E c ji + cost ( j ) (11)And update birth node ( i ) , link ( i ) accordingly.After we sweeping through all frames, we ﬁnd a node i such that cost ( i ) + c ti is minimum, and reconstruct theshortest path by backtracking cached link variables. Thecost of this path would be cost ( i ) + c ti . After the shortestpath is found, we remove all nodes and edges in this shortestpath from G , the resulting graph G (cid:48) will still be a DAG,thus we can repeat this procedure until we cannot ﬁnd anypath that has a negative cost. Even more speed up can beachieved by only recomputing cost ( i ) , birth node ( i ) and link ( i ) for those i whose birth node is the same as the birthnode of the track found in previous iteration.It is also straightforward to integrate NMS into this al-gorithm: when we pick up a shortest path, we also prune all S T

Figure 6. Graphical representation of network ﬂow modelfrom [27]. A pair of nodes (connected by red edge) represent adetection, blue edges represent possible transitions between detec-tions and birth/death ﬂows are modeled by black edges. Costs c i in Objective 2 are for red edges, c ij are for blue edges, c ti and c si are for black edges. To simplify our description, we will refer adetection edge and the two nodes associated with it as a ”node”or ”detection node”. The set V consists of all detection nodes inthe graph, whereas the set E consists of all transition edges in thegraph. nodes that overlap the shortest path. In practice this ”tempo-ral NMS” can be much more aggressive than pre-processingNMS, since the conﬁdence of a track being composed oftrue positives is much higher than single detections. G r ( f ) . Wedenote V forward as the set of forward nodes in current resid-ual graph, and V backward as the set of backward nodes incurrent residual graph, we describe one iteration of 2-passDP as below:1. Ignore all backward edges (including reversed detec-tion edges) and perform one pass of forward DP (fromﬁrst frame to last frame) on all nodes. For each node i , there will be a path ( i ) array that stores mininum-costsource to i path, with cost ( i ) being the total cost of thispath.2. Use cost ( i ) from step 1 as initial values and performone pass of backward DP (from last frame to ﬁrst frame)on V backward . After this, cost ( i ) for i ∈ V backward would be the cost ( j ) − c ij , where j is i ’s best (back-ward) predecessor and c ij is from the original graph. Set cost ( i ) = + ∞ for backward node i that has no backwardedge coming to it.3. Perform one pass of forward DP on i ∈ V forward .To avoid running into cyclic path, we need to backtrack10hortest paths for all j ∈ N ( i ) , where N ( i ) is all neigh-boring nodes that are connected to i via a forward edge.4. Find node i with minimum cost ( i ) + c ti , the (approxi-mate) shortest path is then path ( i ) .5. Update solution f by setting all forward variablesalong path ( i ) to 1 and all backward variables along path ( i ) to 0.It is straightforward to show that during the ﬁrst iteration,1-pass DP and 2-pass DP behave identically. Also, the pathfound by 2-pass DP will never go into a source node or goout of a sink node, thus in each iteration we generate exactlyone more track, either by splitting a previously found track,or by choosing a entirely new track. Therefore the algorithmwill terminate after at most | V | iterations.

10. Appendix: Incorporating Quadratic Inter-actions in Multi-pass DP

Recall the augmented network-ﬂow problem withquadratic cost (Eqn. 4): min f (cid:88) i c si f si + (cid:88) ij ∈ E c ij f ij + (cid:88) i c i f i + (cid:88) ij ∈ EC q ij f i f j + (cid:88) i c ti f ti s.t. f si + (cid:88) j f ji = f i = f ti + (cid:88) j f ij f si , f ti , f i , f ij ∈ { , } Where EC = { ij : t i = t j } . We propose two new variantsof DP algorithm that can approximately minimize the Ob-jective 4. They are also divided into 1-pass DP and 2-passDP. Since we already described 1-pass DP with pairwiseinteractions in the paper, we will focus on 2-pass DP withpairwise interactions here. A feasible solution f on the network corresponds to aresidual graph G r ( f ) . We could apply the steps described insection 9.2 to ﬁnd an approximate shortest path. This pathmay consist of both forward nodes and backward nodes,which correspond to uninstanced detections (but will be in-stanced after this iteration) and already instanced detections(but will be uninstanced after this iteration) respectively.We then update the (unary) cost of other nodes by addingor subtracting the pairwise cost imposed by turning on oroff selected nodes on the path. Additionally, at step 3 of2-pass DP, one could also consider the pairwise cost to cur-rent node imposed by previously selected nodes in the samepath. The entire procedure is described as Algorithm 2.Notice that, to simplify our notation, we construct tem-porary residual graph at the beginning of each iteration and Algorithm 2

Two-pass DP with pairwise Cost Update Input : A Directed-Acyclic-Graph G with node andedge weights c i , c ij initialize f = repeat Find start-to-end min-cost unit ﬂow f ∗ on G r ( f ) track cost = cost ( f ∗ ) if track cost < then for all f i ∈ f ∗ do if f i = 0 then c j = c j + q ij + q ji , ∀ ij, ji ∈ EC else c j = c j − q ij − q ji , ∀ ij, ji ∈ EC end if end for f ∗ = ¬ f ∗ end if until track cost ≥ Output : Solution f do not negate edge weights in the original graph. In prac-tice, we can instead update edge costs and directions on theoriginal graph at the end of each iteration, in such a case weshould add pairwise costs to forward nodes or subtract pair-wise costs from backward nodes if we turn on some node,similarly we subtract pairwise costs from forward nodes oradd pairwise costs to backward nodes if we turn off somenode. We found that 2-pass DP often ﬁnds lower cost than 1-pass DP but still not as good as LP+rounding. It also runssigniﬁcantly slower, even slower than LP+rounding on longsequences. On a 1059 frame-long video with 3 categories ofobjects, 2-pass DP uses about 6 minutes to ﬁnish, whereas1-pass DP ﬁnishes within 1 minute and LP+rounding ﬁn-ishes within 4 minutes. The leave-one-sequence-out cross-validation result using 2-pass DP gets a MOTA of 60.4%,which is equivalent to that of 1-pass DP and LP relaxation.We observe that most of the running time for 2-pass DPis on the second forward pass, which involves backtrackingfor each forward node to avoid cyclic path. It should benoted that with proper data structure such as a hash linked-list to cache path arrays, checking cyclic path can be donein O (1) . Also, in the second forward pass, one could setall backward nodes as active and propagate active labelsto other forward nodes, so eventually we might not needto look at every forward node. Overall, though showingsome incompetence in running time in our current imple-mentation, 2-pass DP should still be a promising inferencemethod with better choice of data-structures and moderateoptimization.11 T (a) S T (b)

S T (c)

S T (d)

S T (e)Figure 7. An illustration for 2-pass DP with quadratic interactions.(a) the initial DAG graph, a pair of nodes indicate a candidate de-tection; (b) ﬁrst iteration of the algorithm, red edges indicates theshortest path found in this iteration; (c) we reverse all the edges onthe shortest path (green edges), and add the pairwise cost imposedby this path to other candidates within the time window (red pairs);(d) second iteration of algorithm, red edges and blue edges indi-cates the new shortest path, notice that it takes three of reversededges (blue edges); (e) we again reverse all the edges in the short-est path, now green edges indicate the two tracks found in this 2iterations; we also update pairwise cost: blue node pair indices wesubtract the pairwise cost imposed by ”turning off” an candidate,red pair still indicates adding in pairwise cost of newly instancedcandidates,and the blue-red pair indicates we ﬁrst add the pairwisecost by newly instanced candidates, then subtract the pairwise costby newly uninstanced candidates. Additions and subtractions aredone to the non-negated edge costs and then negated if necessary.(e)Figure 7. An illustration for 2-pass DP with quadratic interactions.(a) the initial DAG graph, a pair of nodes indicate a candidate de-tection; (b) ﬁrst iteration of the algorithm, red edges indicates theshortest path found in this iteration; (c) we reverse all the edges onthe shortest path (green edges), and add the pairwise cost imposedby this path to other candidates within the time window (red pairs);(d) second iteration of algorithm, red edges and blue edges indi-cates the new shortest path, notice that it takes three of reversededges (blue edges); (e) we again reverse all the edges in the short-est path, now green edges indicate the two tracks found in this 2iterations; we also update pairwise cost: blue node pair indices wesubtract the pairwise cost imposed by ”turning off” an candidate,red pair still indicates adding in pairwise cost of newly instancedcandidates,and the blue-red pair indicates we ﬁrst add the pairwisecost by newly instanced candidates, then subtract the pairwise costby newly uninstanced candidates. Additions and subtractions aredone to the non-negated edge costs and then negated if necessary.