[PDF] Towards Real-Time Multi-Object Tracking

Abstract

Modern multiple object tracking (MOT) systems usually follow the \emph{tracking-by-detection} paradigm. It has 1) a detection model for target localization and 2) an appearance embedding model for data association. Having the two models separately executed might lead to efficiency problems, as the running time is simply a sum of the two steps without investigating potential structures that can be shared between them. Existing research efforts on real-time MOT usually focus on the association step, so they are essentially real-time association methods but not real-time MOT system. In this paper, we propose an MOT system that allows target detection and appearance embedding to be learned in a shared model. Specifically, we incorporate the appearance embedding model into a single-shot detector, such that the model can simultaneously output detections and the corresponding embeddings. We further propose a simple and fast association method that works in conjunction with the joint model. In both components the computation cost is significantly reduced compared with former MOT systems, resulting in a neat and fast baseline for future follow-ups on real-time MOT algorithm design. To our knowledge, this work reports the first (near) real-time MOT system, with a running speed of 22 to 40 FPS depending on the input resolution. Meanwhile, its tracking accuracy is comparable to the state-of-the-art trackers embodying separate detection and embedding (SDE) learning ( 64.4% MOTA \vs 66.1% MOTA on MOT-16 challenge). Code and models are available at \url{this https URL}.

Full PDF

TTowards Real-Time Multi-Object Tracking

Zhongdao Wang , Liang Zheng , Yixuan Liu , Yali Li , and Shengjin Wang Department of Electronic Engineering, Tsinghua University [email protected], { liyali13, wgsgj } @tsinghua.edu.cn Australian National University [email protected]

Abstract.

Modern multiple object tracking (MOT) systems usually fol-low the tracking-by-detection paradigm. It has 1) a detection model fortarget localization and 2) an appearance embedding model for data as-sociation. Having the two models separately executed might lead to ef-ﬁciency problems, as the running time is simply a sum of the two stepswithout investigating potential structures that can be shared betweenthem. Existing research eﬀorts on real-time MOT usually focus on theassociation step, so they are essentially real-time association methods butnot real-time MOT system. In this paper, we propose an MOT systemthat allows target detection and appearance embedding to be learned ina shared model. Speciﬁcally, we incorporate the appearance embeddingmodel into a single-shot detector, such that the model can simultaneouslyoutput detections and the corresponding embeddings. We further pro-pose a simple and fast association method that works in conjunction withthe joint model. In both components the computation cost is signiﬁcantlyreduced compared with former MOT systems, resulting in a neat and fastbaseline for future follow-ups on real-time MOT algorithm design. To ourknowledge, this work reports the ﬁrst (near) real-time MOT system, witha running speed of 22 to 40 FPS depending on the input resolution. Mean-while, its tracking accuracy is comparable to the state-of-the-art track-ers embodying separate detection and embedding (SDE) learning (64 . v.s. .

1% MOTA on MOT-16 challenge). Code and models areavailable at https://github.com/Zhongdao/Towards-Realtime-MOT.

Keywords:

Multi-Object Tracking

Multiple object tracking (MOT), which aims at predicting trajectories of multi-ple targets in video sequences, underpins critical application signiﬁcance rangingfrom autonomous driving to smart video analysis.The dominant strategy to this problem, i.e. , tracking-by-detection [24,40,6]paradigm, breaks MOT down to two steps: 1) the detection step, in which targetsin single video frames are localized; and 2) the association step, where detectedtargets are assigned and connected to existing trajectories. It means the system a r X i v : . [ c s . C V ] J u l Z. Wang et al.

DetectorRPN Embedding modelEmbedding modelRe-sampled features EmbeddingEmbeddingRe-sampled pixels ( a ) S D E ( b ) T w o - s t a g e + JDE Detectionresults Embedding ( c ) J D E Fig. 1.

Comparison between (a) the Separate Detection and Embedding (SDE) model,(b) the two-stage model and (c) the proposed Joint Detection and Embedding (JDE). requires at least two compute-intensive components: a detector and an embed-ding (re-ID) model. We term those methods as the Separate Detection and Em-bedding (SDE) methods for convenience. The overall inference time, therefore,is roughly the summation of the two components, and will increase as the targetnumber increases. The characteristics of SDE methods bring critical challengesin building a real-time MOT system, an essential demand in practice.In order to save computation, a feasible idea is to integrate the detector andthe embedding model into a single network. The two tasks thus can share thesame set of low-level features, and re-computation is avoided. One choice for jointdetector and embedding learning is to adopt the Faster R-CNN framework [28],a type of two-stage detectors. Speciﬁcally, the ﬁrst stage, the region proposalnetwork (RPN), remains the same with Faster R-CNN and outputs detectedbounding boxes; the second stage, Fast R-CNN [11], can be converted to anembedding model by replacing the classiﬁcation supervision with the metriclearning supervision [39,36]. In spite of saving some computation, this method isstill limited in speed due to its two-stage design and usually runs at fewer than10 frames per second (FPS), far from real-time. Moreover, the runtime of thesecond stage also increases as target number increases like SDE methods.This paper is dedicated to the improving eﬃciency of an MOT system. Weintroduce an early attempt that Jointly learns the Detector and Embeddingmodel (JDE) in a single-shot deep network. In other words, the proposed JDEemploys a single network to simultaneously output detection results and the cor-responding appearance embeddings of the detected boxes. In comparison, SDEmethods and two-stage methods are characterized by re-sampled pixels (bound-ing boxes) and feature maps, respectively. Both the bounding boxes and featuremaps are fed into a separate re-ID model for appearance feature extraction. Fig-ure 1 brieﬂy illustrates the diﬀerence between the SDE methods, the two-stagemethods and the proposed JDE. Our method is near real-time while being al-most as accurate as the SDE methods. For example, we obtain a running timeof 20.2 FPS with MOTA=64 .

4% on the MOT-16 test set. In comparison, FasterR-CNN + QAN embedding [40] only runs at < .

1% onthe MOT-16 test set. owards Real-Time MOT 3

To build a joint learning framework with high eﬃciency and accuracy, we ex-plore and deliberately design the following fundamental aspects: training data,network architecture, learning objectives, optimization strategies, and validationmetrics. First, we collect six publicly available datasets on pedestrian detectionand person search to form a uniﬁed large-scale multi-label dataset. In this uni-ﬁed dataset, all the pedestrian bounding boxes are labeled, and a portion of thepedestrian identities are labeled. Second, we choose the Feature Pyramid Net-work (FPN) [21] as our base architecture and discuss with which type of lossfunctions the network learns the best embeddings. Then, we model the trainingprocess as a multi-task learning problem with anchor classiﬁcation, box regres-sion, and embedding learning. To balance the importance of each individualtask, we employ task-dependent uncertainty [16] to dynamically weight the het-erogenous losses. A simple and fast association algorithm is proposed to furtherimprove eﬃciency. Finally, we employ the following evaluation metrics. The aver-age precision (AP) is employed to evaluate the performance of the detector. Theretrieval metric True Accept Rate (TAR) at certain False Alarm Rate (FAR) isadopted to evaluate the quality of the embedding. The overall MOT accuracyis evaluated by the CLEAR metrics [2], especially the MOTA score. This pa-per also provides new settings and baselines for joint detection and embeddinglearning, which we believe will facilitate research towards real-time MOT.The contributions of our work are summarized as follows, – We introduce JDE, a single-shot framework for joint detection and embed-ding learning. It runs in (near) real-time and is comparably accurate to theseparate detection + embedding (SDE) state-of-the-art methods. – We conduct thorough analysis and experiments on how to build such a jointlearning framework from multiple aspects including training data, networkarchitecture, learning objectives and optimization strategy. – Experiments with the same training data show the JDE performs as well asa range of strong SDE model combinations and achieves the fastest speed. – Experiments on MOT-16 demonstrate the advantage of our method overstate-of-the-art MOT systems considering the amount of training data, ac-curacy and speed.

Recent progresses on multiple object tracking can be primarily categorized intothe following aspects:1) Ones that model the association problem as certain form of optimizationproblem on graphs [37,42,17].2) Ones that make eﬀorts to model the association process by an end-to-endneural network [32,50].3) Ones that seek novel tracking paradigm other than tracking-by-detection [1].Among them, the ﬁrst two categories have been the prevailing solution toMOT in the past decade. In these methods, detection results and appearance

Z. Wang et al. embeddings are given as input, and the only problem to be solved is data associ-ation. A standard formulation is using a graph, where nodes represent a detectedtargets, and edges indicate the possibility of linkages among nodes. Data asso-ciation thus can be solved by minimizing some ﬁxed [15,26,44] or learned [19]cost, or by more complex optimization such as multi-cuts [35] and minimumcliques [43]. Some recent works attempt to model the association problem usinggraph networks [4,20], so that end-to-end association can be achieved. Graph-based association shows good tracking accuracy especially in hard cases such aslarge occlusions, but their eﬃciency is always a problem. Although some meth-ods [6] claim to be able to attain real-time speed, the runtime of the detector isexcluded, such that the overall system still has some distance from the claim. Incontrast, in this work, we consider the runtime of the entire MOT system ratherthan the association step only. Achieving eﬃciency on the entire system is morepractically signiﬁcant.The third category attempts to explore novel MOT paradigms, for instance,incorporating single object trackers into the detector by predicting the spatialoﬀsets [1]. These methods are appealing owning to their simplicity, but trackingaccuracy is not satisfying unless an additional embedding model is introduced.As such, the trade-oﬀ between performance and speed still needs improvement.The spirit of our approach, that learning auxiliary associative embeddingssimultaneously with the main task, also shows good performance in many othervision tasks, such as person search [39], human pose estimation [25], and point-based object detection [18].

The objective of JDE is to simultaneously output the location and appearanceembeddings of targets in a single forward pass. Formally, suppose we have atraining dataset { I , B , y } Ni =1 . Here, I ∈ R c × h × w indicates an image frame, and B ∈ R k × represents the bounding box annotations for the k targets in thisframe. y ∈ Z k denotes the partially annotated identity labels, where − B ∈ R ˆ k × and appearance embeddings ˆ F ∈ R ˆ k × D , where D is the dimension ofthe embedding. The following objectives should be satisﬁed. – B ∗ is as close to B as possible. – Given a distance metric d ( · ), ∀ ( k t , k t + ∆t , k (cid:48) t + ∆t ) that satisfy y k t + ∆t = y k t and y k (cid:48) t + ∆t (cid:54) = y k t , we have d ( f k t , f k t + ∆t ) < d ( f k t , f k (cid:48) t + ∆t ), where f k t is a rowvector from ˆ F t and f k t + ∆t , f k (cid:48) t + ∆t are row vectors from ˆ F t + ∆t , i.e. , embed-dings of targets in frame t and t + ∆t , respectively,The ﬁrst objective requires the model to detect targets accurately. The secondobjective requires the appearance embedding to have the following property. Thedistance between observations of the same identity in consecutive frames should owards Real-Time MOT 5 Predictionhead Box classification Box RegressionEmbedding 𝓛 " 𝓛 𝓛 $ Uncertainty 𝑠 " Uncertainty 𝑠 $ Uncertainty 𝑠 ( ) 𝓛 $ +𝑠 $ ( , 𝓛 +𝑠 ( - 𝓛 " +𝑠 " Fused Loss (b) Prediction Head(a) Architecture Overview * indicates concatenation

PredictionheadPredictionhead

Fig. 2.

Illustration of (a) the network architecture and (b) the prediction head. Predic-tion heads are added upon multiple FPN scales. In each prediction head the learningof JDE is modeled as a multi-task learning problem. We automatically weight the het-erogeneous losses by learning a set of auxiliary parameters, i.e. , the task-dependentuncertainty. be smaller than the distance between diﬀerent identities. The distance metric d ( · ) can be the Euclidean distance or the cosine distance. Technically, if thetwo objectives are both satisﬁed, even a simple association strategy, e.g. , theHungarian algorithm, would produce good tracking results. We employ the architecture of Feature Pyramid Network (FPN) [21]. FPN makespredictions from multiple scales, thus bringing improvement in pedestrian de-tection where the scale of targets varies a lot. Figure 2 brieﬂy shows the neuralarchitecture used in JDE. An input video frame ﬁrst undergoes a forward passthrough a backbone network to obtain feature maps at three scales, namely,scales with , and down-sampling rate, respectively. Then, the feature mapwith the smallest size (also the semantically strongest features) is up-sampledand fused with the feature map from the second smallest scale by skip connec-tion, and the same goes for the other scales. Finally, prediction heads are addedupon fused feature maps at all the three scales. A prediction head consists ofseveral stacked convolutional layers and outputs a dense prediction map of size(6 A + D ) × H × W , where A is the number of anchor templates assigned to thisscale, and D is the dimension of the embedding. The dense prediction map isdivided into three parts (tasks):1) the box classiﬁcation results of size 2 A × H × W ;2) the box regression coeﬃcients of size 4 A × H × W ;3) the dense embedding map of size D × H × W .In the following sections, we will detail how these tasks are trained. In general the detection branch is similar to the standard RPN [28], but withtwo modiﬁcations. First, we redesign the anchors in terms of numbers, scales,

Z. Wang et al. and aspect ratios to be able to adapt to the targets, i.e. , pedestrian in our case.Based on the common prior, all anchors are set to an aspect ratio of 1 : 3. Thenumber of anchor templates is set to 12 such that A = 4 for each scale, and thescales (widths) of anchors range from 11 ≈ × / to 512 = 8 × / . Second,we note that it is important to select proper values for the dual thresholds usedfor foreground/background assignment. By visualization we determine that anIOU > w.r.t. the ground truth approximately ensures a foreground, which isconsistent with the common setting in generic object detection. On the otherhand, those boxes that have an IOU < w.r.t. the ground truth should beregarded as background in our case rather than 0 . L α , and the bounding box regression loss L β . L α is formulated as a cross-entropy loss and L β as a smooth-L1 loss. Theregression targets are encoded in the same manner as [28]. The second objective is a metric learning problem, i.e., learning a embeddingspace where instances of the same identity are close to each other while instancesof diﬀerent identities are far apart. To achieve this goal, an eﬀective solution isto use the triplet loss [29]. The triplet loss has also been used in previous MOTworks [36]. Formally, we use triplet loss L triplet = max(0 , f (cid:62) f − − f (cid:62) f + ), where f (cid:62) is an instance in a mini-batch selected as an anchor, f + represents a positivesample w.r.t. f (cid:62) , and f − is a negative sample. The margin term is neglectedfor convenience. This naive formulation of the triplet loss has several challenges.The ﬁrst is the huge sampling space in the training set. In this work we addressthis problem by looking at a mini-batch and mining all the negative samples andthe hardest positive sample in this mini-batch, such that, L triplet = (cid:88) i max (cid:0) , f (cid:62) f − i − f (cid:62) f + (cid:1) , (1)where f + is the hardest positive sample in a mini-batch.The second challenge is that training with the triplet loss can be unstableand the convergence might be slow. To stabilize the training process and speedup convergence, it is proposed in [31] to optimize over a smooth upper bound ofthe triplet loss, L upper = log (cid:16) (cid:88) i exp (cid:0) f (cid:62) f − i − f (cid:62) f + (cid:1)(cid:17) . (2)Note that this smooth upper bound of triplet loss can be also written as, L upper = − log exp( f (cid:62) f + )exp( f (cid:62) f + ) + (cid:80) i exp( f (cid:62) f − i ) . (3) owards Real-Time MOT 7 It is similar to the formulation of the cross-entropy loss, L CE = − log exp( f (cid:62) g + )exp( f (cid:62) g + ) + (cid:80) i exp( f (cid:62) g − i ) , (4)where we denote the class-wise weight of the positive class (to which the anchorinstance belongs) as g + and weights of negative classes as g − i . The major ditinc-tions between L upper and L CE are two-fold. First, the cross-entropy loss employslearnable class-wise weights as proxies of class instances rather than using theembeddings of instances directly. Second, all the negative classes participate inthe loss computation in L CE such that the anchor instance is pulled away fromall the negative classes in the embedding space. In contrast, in L upper , the anchorinstance is only pulled away from the sampled negative instances.In light of the above analysis, we speculate the performance of the threelosses under our case should be L CE > L upper > L triplet . Experimental result inthe experiment section conﬁrms this. As such, we select the cross-entropy lossas the objective for embedding learning (hereinafter referred to as L γ ).Speciﬁcally, if an anchor box is labeled as the foreground, the correspond-ing embedding vector is extracted from the dense embedding map. Extractedembeddings are fed into a shared fully-connected layer to output the class-wiselogits, and then the cross-entropy loss is applied upon the logits. In this man-ner, embeddings from multiple scales shares the same space, and associationacross scales is feasible. Embeddings with label − i.e. , foregrounds with boxannotations but without identity annotations, are ignored when computing theembedding loss. The learning objective of each prediction head in JDE can be modeled as amulti-task learning problem. The joint objective can be written as a weightedlinear sum of losses from every scale and every component, L total = M (cid:88) i (cid:88) j = α,β,γ w ij L ij , (5)where M is the number of prediction heads and w ij , i = 1 , ..., M, j = α, β, γ areloss weights. A simple way to determine loss weights are described below.1. Let w iα = w iβ , as suggested in [28]2. Let w α/γ/β = ... = w Mα/γ/β .3. Search for the remaining two independent loss weights for the best perfor-mance.Searching loss weights with this strategy can yield decent results within severalattempts. However, the reduction of searching space also brings strong restric-tions on the loss weights, such that the resulting loss weights might be far from

Z. Wang et al. optimal. Instead, we adopt an automatic learning scheme for loss weights pro-posed in [16] by using the concept of task-independent uncertainty. Formally,the learning objective with automatic loss balancing is written as, L total = M (cid:88) i (cid:88) j = α,β,γ (cid:18) e s ij L ij + s ij (cid:19) , (6)where s ij is the task-dependent uncertainty for each individual loss and is mod-eled as learnable parameters. We refer readers to [16] for more detailed derivationand discussion. Method Density FPS MOTA IDF-1SORT [3] low 44.1 66.9 55.8ours low

SORT [3] high 26.4 35.0 32.4ours high

Table 1.

Comparison between our associationmethod and SORT. Inputs are the same.

Although the association al-gorithm is not the focus ofthis work, here we introducea simple and fast online as-sociation strategy to workin conjunction with JDE.Speciﬁcally, a tracklet is de-scribed with an appearancestate e i and a motion state m i = ( x, y, γ, h, ˙ x, ˙ y, ˙ γ, ˙ h ),where x, y indicate the bound-ing box center position, h in-dicates the bounding box height and γ indicates the aspect ratio, and ˙ x indicatesthe velocity along x direction. The tracklet appearance e i is initialized with theappearance embedding of the ﬁrst observation f i . We maintain a tracklet poolcontaining all the reference tracklets that observations are probable to be asso-ciated with. For an incoming frame, we compute the pair-wise motion aﬃnitymatrix A m and appearance aﬃnity matrix A e between all the observations andthe traklets from the pool. The appearance aﬃnity is computed using cosine sim-ilarity, and the motion aﬃnity is computed using Mahalanobis distance. Thenwe solve the linear assignment problem by Hungarian algorithm with cost matrix C = λA e + (1 − λ ) A m . The motion state m i of all matched tracklets are updatedby the Kalman ﬁlter, and the appearance state e i is updated by e ti = αe t − i + (1 − α ) f ti (7)Where f ti is the appearance embedding of the current matched observation, α = 0 . owards Real-Time MOT 9 implement a vectorized version of the Kalman ﬁlter and ﬁnd it critical for highFPS, especially when the model is already fast. A comparison between SORTand our association method, based on the same JDE model, is shown in Table 1.We use MOT-15 [24] for testing the low density scenario and CVPR-19-01 [7]for high density. It can be observed that our method outperforms SORT in bothaccuracy and speed, especially under the high-density case. Dataset ETH CP CT M16 CS PRW Total

Table 2.

Statistics of the joint training set.

Performing experiments onsmall datasets may lead to bi-ased results and conclusionsmay not hold when apply-ing the same algorithm tolarge-scale datasets. There-fore, we build a large-scaletraining set by putting to-gether six publicly availabledatasets on pedestrian detec-tion, MOT and person search.These datasets can be categorized into two types: ones that only contain bound-ing box annotations, and ones that have both bounding box and identity annota-tions. The ﬁrst category includes the ETH dataset [9] and the CityPersons (CP)dataset [45]. The second category includes the CalTech (CT) dataset [8], MOT-16 (M16) dataset [24], CUHK-SYSU (CS) dataset [39] and PRW dataset [48].Training subsets of all these datasets are gathered to form the joint trainingset, and videos in the ETH dataset that overlap with the MOT-16 test set areexcluded for fair evaluation. Table 2 shows the statistics of the joint training set.For validation/evaluation, three aspects of performance need to be evaluated:the detection accuracy, the discriminative ability of the embedding, and thetracking performance of the entire MOT system. To evaluate detection accuracy,we compute average precision (AP) at IOU threshold of 0 . N retrieval among these instancesand report the true positive rate at false accept rate 0 . training set with duplicatedsequences with the training set removed. During testing, we use the MOT-16 test set to compare with existing methods. We employ DarkNet-53 [27] as the backbone network in JDE. The network istrained with standard SGD for 30 epochs. The learning rate is initialized as 10 − and is decreased by 0.1 at the 15th and the 23th epoch. Several data augmenta-tion techniques, such as random rotation, random scale and color jittering, areapplied to reduce overﬁtting. Finally, the augmented images are adjusted to aﬁxed resolution. The input resolution is 1088 ×

608 if not speciﬁed.

Embed. Weighting Det Emb MOTLoss Strategy AP ↑ TPR ↑ MOTA ↑ IDs ↓L triplet App.Opt 81.6 42.2 59.5 375 L upper App.Opt 81.7 44.3 59.8 346 L CE App.Opt 82.0 88.2 64.3 223 L CE Uniform 6.8 L CE MGDA-UB 8.3 93.5 38.3 357 L CE Loss.Norm 80.6 82.1 57.9 321 L CE Uncertainty

Table 3.

Comparing diﬀerent embedding lossesand loss weighting strategies. TPR is short forTPR@FAR=0.1 on the embedding validation set,and IDs means times of ID switches on the track-ing validation set. ↓ means the smaller the better; ↑ means the larger the better. In each column, the bestresult is in bold , and the second best is underlined. Comparison of the threeloss functions for appear-ance embedding learning.

We ﬁrst compare the dis-criminative ability of appear-ance embeddings trained withthe cross-entropy loss, thethe triplet loss and its up-per bound variant, describedin the previous section. Formodels trained with L triplet and L upper , B/ B . This ensures thatthere always exist positivesamples. For models trainedwith L CE , images are ran-domly sampled to form amini-batch. Table 3 presentscomparisons of the three lossfunctions.As expected, L CE outperforms both L triplet and L upper . Surprisingly, theperformance gap is large (+46.0/+43.9 TAR@FAR=0.1). A possible reason forthe large gap is that the cross-entropy loss requires the similarity between one in-stance and its positive class be higher than the similarities between this instanceand all negative classes. This objective is more rigorous than the triplet lossfamily, which exerts constraints merely in a sampled mini-batch. Considering itseﬀectiveness and simplicity, we adopt the cross-entropy loss in JDE. Comparison of diﬀerent loss weighting strategies.

The loss weightingstrategy is crucial to learn good joint representation for JDE. In this paper,three loss weighting strategies are implemented. The ﬁrst is a loss normalizationmethod (named “Loss.Norm”), where the losses are weighted by the reciprocalof their moving average magnitude. The second is the “MGDA-UB” algorithmproposed in [30] and the last is the weight-by-uncertainty strategy described owards Real-Time MOT 11 in Section 3.5. Moreover, we have two baselines. The ﬁrst trains all the taskswith identical loss weights, named as “Uniform”. The second, referred to as“App.Opt”, uses a set of approximate optimal loss weights by searching underthe two-independent-variable assumption as described in Section 3.5. Table 3summarizes the comparisons of these strategies. Two observations are made.First, the Uniform baseline produces poor detection results, and thus thetracking accuracy is not good. This is because the scale of the embedding lossis much larger than the other two losses and dominates the training process.Once we set proper loss weights to let all tasks learn at a similar rate, as inthe “App.Opt” baseline, both the detection and embedding tasks yield goodperformance.Second, results indicate that the “Loss.Norm” strategy outperforms the “Uni-form” baseline but is inferior to the “App.Opt” baseline. The MGDA-UB al-gorithm, despite being the most theoretically sound method, fails in our casebecause it assign too large weights to the embedding loss, such that its perfor-mance is similar to the Uniform baseline. The only method that outperforms theApp.Opt baseline is the weight-by-uncertainty strategy.

Comparison with SDE methods.

To demonstrate the superiority of JDEto the Separate Detection and Embedding (SDE) methods, we implementedseveral state-of-the-art detectors and person re-id models and compare theircombinations with JDE in terms of both tracking accuracy (MOTA) and run-time (FPS). The detectors include JDE with ResNet-50 and ResNet-101 [13] asbackbone, Faster R-CNN [28] with ResNet-50 and ResNet-101 as backbone, andCascade R-CNN [5] with ResNet-50 and ResNet-101 as backbone. The personre-id models include IDE [47], Triplet [14] and PCB [33]. In the association step,we use the same online association approach described in Section 3.6 for all theSDE models. For fair comparison, all the training data are the same as used inJDE.In Figure 3, we plot the MOTA metric and the IDF-1 score against theruntime for SDE combinations of the above detectors and person re-id models.Runtime of all models are tested on a single Nvidia Titan xp GPU. Figure 3 (a)and (c) show comparisons on the MOT-15 train set, in which the pedestriandensity is low, e.g. , less than 20. In contrast, Figure 3 (b) and (d) show compar-isons on a video sequence that contains crowd in high-density (CVPR19-01 fromthe CVPR19 MOT challenge datast, with density 61 . . v.s. . ∼ Speciﬁcally, JDE with DarkNet-53 presents 66 .

2% IDF-1 score at 22 FPS, whileCascade RCNN with ResNet-101 + PCB presents 69 .

6% IDF-1 score at 7 . ?? shows an example of such failure case. (a) FPS@usual case M O T A EmbeddingJDE (ours)IDEPCBTripletDetectionJDE-DN53 (ours)JDE-R50 (ours)JDE-R101 (ours)FRCNN-R50FRCNN-R101Cascade-R50Cascade-R101 (b) FPS@crowd case M O T A EmbeddingJDE (ours)IDEPCBTripletDetectionJDE-DN53 (ours)JDE-R50 (ours)JDE-R101 (ours)FRCNN-R50FRCNN-R101Cascade-R50Cascade-R101 (c) FPS@usual case I D F - EmbeddingJDE (ours)IDEPCBTripletDetectionJDE-DN53 (ours)JDE-R50 (ours)JDE-R101 (ours)FRCNN-R50FRCNN-R101Cascade-R50Cascade-R101 (d) FPS@crowd case I D F - EmbeddingJDE (ours)IDEPCBTripletDetectionJDE-DN53 (ours)JDE-R50 (ours)JDE-R101 (ours)FRCNN-R50FRCNN-R101Cascade-R50Cascade-R101

Fig. 3.

Comparing JDE and various SDEcombinations in terms of tracking accu-racy (MOTA/IDF-1) and speed (FPS).(a) and (c) show comparisons under thecase where the pedestrian density is low(MOT-15 train set), (b) and (d) showcomparisons under the crowded scenario(MOT-CVPR19-01). Diﬀerent colors repre-sent diﬀerent embedding models, and dif-ferent shapes denote diﬀerent detectors.We clearly observe that the proposed JDEmethod (JDE Embedding + JDE-DN53)has the best time-accuracy trade-oﬀ. Bestviewed in color.

Second, the tracking accuracy ofJDE is very close to the combina-tions of JDE+IDE, JDE+Triplet andJDE+PCB (see the cross markers inFigure 3). With other componentsﬁxed, JDE even outperforms theJDE+IDE combination. This stronglysuggests the jointly learned embed-ding is almost as discriminative as theseparately learned embedding.Finally, comparing the runtime ofa same model between Figure 3 (a)and (b), it can be observed that all theSDE models suﬀer a signiﬁcant speeddrop under the crowded case. This isbecause the runtime of the embeddingmodel increases with the number ofdetected targets. This drawback doesnot exist in JDE because the embed-ding is computed together with thedetection results. As such, the run-time diﬀerence between JDE underthe usual case and the crowded caseis much smaller (see the red markers).In fact, the speed drop of JDE is dueto the increased time in the associa-tion step, which is positively relatedto the target number.

Comparison with the state-of-the-art MOT systems.

Since wetrain JDE using additional data in-stead of the MOT-16 train set, wecompare JDE under the “private data” protocol of the MOT-16 benchmark.State-of-the-art online MOT methods under the private protocol are compared,including DeepSORT 2 [38], RAR16wVGG [10], TAP [49], CNNMTT [23] andPOI [40]. All these methods employ the same detector, i.e. , Faster-RCNN with owards Real-Time MOT 13Method ↑ IDF1 ↑ MT ↑ ML ↓ IDs ↓ FPSD ↑ FPSA ↑ FPS ↑ DeepSORT 429K 1.2k 61.4 62.2 32.8 18.2 781 < ∗ < < ∗ < < ∗ < < ∗ < < ∗ < JDE

Table 4.

Comparison with the state-of-the-art online MOT systems under the privatedata protocol on the MOT-16 benchmark. The performance is evaluated with theCLEAR metrics, and runtime is evaluated with three metrics: frames per second ofthe detector (FPSD), frame per second of the association step (FPSA), and frame persecond of the overall system (FPS). ∗ indicates estimated timing. We clearly observeour method has the best eﬃciency and a comparable accuracy. VGG-16 as backbone, which is trained on a large private pedestrian detectiondataset. The main diﬀerences among these methods reside in their embeddingmodels and the association strategies. For instance, DeepSORT 2 employs WideResidual Network (WRN) [41] as the embedding model and uses the MARS [46]dataset to train the appearance embedding. RAR16withVGG, TAP, CNNMTTand POI use Inception [34], Mask-RCNN [12], a 5-layer CNN, and QAN [22] astheir embedding models, respectively. Training data of these embedding modelsalso diﬀer from each other. For clear comparison, we list the number of train-ing data for all these methods in Table 4. Accuracy and speed metrics are alsopresented.Considering the overall tracking accuracy, e.g. , the MOTA metric, JDE isgenerally comparable. Our result is higher than DeepSORT 2 by +3.0% and islower than POI by 1.7%. In terms of running speed, it is not feasible to directlycompare these methods because their runtimes are not all reported. Therefore,we re-implemented the VGG-16 based Faster R-CNN detector and benchmarkits running speed, and then estimate the running speed upper bounds of theentire MOT system for these methods. Note that for some methods the runtimeof the embedding model is not taken into account, so the speed upper bounds arefar from being tight. Even with such relaxed upper bound, the proposed JDEruns at least 2 ∼ × faster than existing methods, reaching a near real-timespeed, i.e. , 22.2 FPS at an image resolution of as high as 1088 × × ∆ = -2.6% MOTA). Visualization.

To show the discrimination of the joint learned embeddingintuitively, we perform a simple retrieval experiment and visualize the results inFigure 4. We extract the feature of a pedestrian in one video frame as a queryand compute pixel-wise cosine similarity with the feature map of another frame.We compare the retrieval results between using detection feature map as the

Query Target Detection Similarity Embedding Similarity

Fig. 4.

Visualization of the retrieval performance of the detection feature map and thedense embedding. Similarity maps are computed as the cosine similarity between thequery feature and the target feature map. The joint learned dense embedding presentsgood correspondence between the query and the target. feature and using the dense embedding as the feature, and it is clearly observedthe dense embedding results in better correspondence between the query andthe target.

Analysis and discussions.

One may notice that JDE has a lower IDF1score and more ID switches than existing methods. At ﬁrst we suspect the rea-son is that the jointly learned embedding might be weaker than a separatelylearned embedding. However, when we replace the jointly learned embeddingwith the separately learned embedding, the IDF1 score and the number of IDswitches remain almost the same. Finally we ﬁnd that the major reason lies inthe inaccurate detection when multiple pedestrians have large overlaps with eachother. Such inaccurate boxes introduce lots of ID switches, and unfortunately,such ID switches often occur in the middle of a trajectory, hence the IDF1 scoreis lower. In our future work, it remains to be solved how to improve JDE tomake more accurate boxes predictions when pedestrian overlaps are signiﬁcant.

In this paper, we introduce JDE, an MOT system that allows target detectionand appearance features to be learned in a shared model. Our design signiﬁ-cantly reduces the runtime of an MOT system, making it possible to run ata (near) real-time speed. Meanwhile, the tracking accuracy of our system iscomparable with the state-of-the-art online MOT methods. Moreover, we haveprovided thorough analysis, discussions and experiments about good practicesand insights in building such a joint learning framework. In the future, we willinvestigate deeper into the time-accuracy trade-oﬀ issue. owards Real-Time MOT 15

References

1. Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles.arXiv preprint arXiv:1903.05625 (2019)2. Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance:the clear mot metrics. Journal on Image and Video Processing , 1 (2008)3. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtimetracking. In: ICIP (2016)4. Bras´o, G., Leal-Taix´e, L.: Learning a neural solver for multiple object tracking. In:CVPR (2020)5. Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection.In: CVPR (2018)6. Choi, W.: Near-online multi-target tracking with aggregated local ﬂow descriptor.In: ICCV (2015)7. Dendorfer, P., Rezatoﬁghi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth,S., Schindler, K., Leal-Taixe, L.: Cvpr19 tracking and detection challenge: Howcrowded can it get? arXiv preprint arXiv:1906.04567 (2019)8. Doll´ar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark.In: CVPR (2009)9. Ess, A., Leibe, B., Schindler, K., , van Gool, L.: A mobile vision system for robustmulti-person tracking. In: CVPR (2008)10. Fang, K., Xiang, Y., Li, X., Savarese, S.: Recurrent autoregressive networks foronline multi-object tracking. In: WACV (2018)11. Girshick, R.: Fast r-cnn. In: ICCV (2015)12. He, K., Gkioxari, G., Doll´ar, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR (2016)14. Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identiﬁcation. arXiv preprint arXiv:1703.07737 (2017)15. Jiang, H., Fels, S., Little, J.J.: A linear programming approach for multiple objecttracking. In: CVPR (2007)16. Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weighlosses for scene geometry and semantics. In: CVPR (2018)17. Kim, C., Li, F., Ciptadi, A., Rehg, J.M.: Multiple hypothesis tracking revisited.In: ICCV (2015)18. Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: ECCV(2018)19. Leal-Taix´e, L., Fenzi, M., Kuznetsova, A., Rosenhahn, B., Savarese, S.: Learningan image-based motion context for multiple people tracking. In: CVPR (2014)20. Li, J., Gao, X., Jiang, T.: Graph networks for multiple object tracking. In: CVPR(2020)21. Lin, T.Y., Doll´ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Featurepyramid networks for object detection. In: CVPR (2017)22. Liu, Y., Yan, J., Ouyang, W.: Quality aware network for set to set recognition. In:CVPR (2017)23. Mahmoudi, N., Ahadi, S.M., Rahmati, M.: Multi-target tracking using cnn-basedfeatures: Cnnmtt. Multimedia Tools and Applications78