[PDF] Simple Unsupervised Multi-Object Tracking

Abstract

Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for developing better and larger labeled datasets. In this work, we remove the need for annotated datasets by proposing an unsupervised re-identification network, thus sidestepping the labeling costs entirely, required for training. Given unlabeled videos, our proposed method (SimpleReID) first generates tracking labels using SORT and trains a ReID network to predict the generated labels using crossentropy loss. We demonstrate that SimpleReID performs substantially better than simpler alternatives, and we recover the full performance of its supervised counterpart consistently across diverse tracking frameworks. The observations are unusual because unsupervised ReID is not expected to excel in crowded scenarios with occlusions, and drastic viewpoint changes. By incorporating our unsupervised SimpleReID with CenterTrack trained on augmented still images, we establish a new state-of-the-art performance on popular datasets like MOT16/17 without using tracking supervision, beating current best (CenterTrack) by 0.2-0.3 MOTA and 4.4-4.8 IDF1 scores. We further provide evidence for limited scope for improvement in IDF1 scores beyond our unsupervised ReID in the studied settings. Our investigation suggests reconsideration towards more sophisticated, supervised, end-to-end trackers by showing promise in simpler unsupervised alternatives.

Full PDF

SSimple Unsupervised Multi-Object Tracking

Shyamgopal Karthik Ameya Prabhu Vineet Gandhi Center for Visual Information TechnologyKohli Center on Intelligent Systems, IIIT Hyderabad, India University of Oxford { shyamgopal.karthik@research,vgandhi@ } [email protected] Abstract.

Multi-object tracking has seen a lot of progress recently, al-beit with substantial annotation costs for developing better and largerlabeled datasets. In this work, we remove the need for annotated datasetsby proposing an unsupervised re-identiﬁcation network, thus sidestep-ping the labeling costs entirely, required for training. Given unlabeledvideos, our proposed method (SimpleReID) ﬁrst generates tracking labelsusing SORT [3] and trains a ReID network to predict the generated la-bels using crossentropy loss. We demonstrate that SimpleReID performssubstantially better than simpler alternatives, and we recover the full per-formance of its supervised counterpart consistently across diverse trackingframeworks. The observations are unusual because unsupervised ReIDis not expected to excel in crowded scenarios with occlusions, and dras-tic viewpoint changes. By incorporating our unsupervised SimpleReIDwith CenterTrack trained on augmented still images, we establish a newstate-of-the-art performance on popular datasets like MOT16/17 withoutusing tracking supervision, beating current best (CenterTrack) by 0.2-0.3MOTA and 4.4-4.8 IDF1 scores. We further provide evidence for limitedscope for improvement in IDF1 scores beyond our unsupervised ReID inthe studied settings. Our investigation suggests reconsideration towardsmore sophisticated, supervised, end-to-end trackers [56,5] by showingpromise in simpler unsupervised alternatives.

Keywords:

Multi-Object Tracking, Re-Identiﬁcation, Unsupervised Learn-ing

Understanding human interactions and behaviour over videos has been a fun-damental problem in computer vision with applications in action recognition,sports video analytics, and assistive tech and requires tracking multiple peopleover time. Multi-object trackers broadly consist of two key components: (i) Aspatio-temporal association model which associates boxes in nearby frames tocreate clusters of tracklets, and (ii) A re-identiﬁcation model which associatestracklets over larger windows to deal with complexities in tracking such as occlu-sions and target interactions. Re-identiﬁcation is a major challenge in tracking, a r X i v : . [ c s . C V ] J un S. Karthik et al. with sophisticated supervised approaches requiring expensive annotations toassign trajectories across frames to every single person in a video. Availability oflabeled datasets[36,37] has alleviated the problem. For instance IDF1 (MOTA)scores have improved from 51.3(48.8) [49] to 59.9 (55.9) [5] on the MOT16 [37]benchmark in the past 3 years.There has been a growing need to annotate larger tracking datasets with theaim of improving re-identiﬁcation (ReID) models. However, annotating trackingdatasets require hefty labeling costs and scale poorly with dataset size. Toillustrate the eﬀort and cost required, annotating 6 minutes worth of video of theMOT15 benchmark [27] using the standard annotation procedures would take atleast 22 hours of annotation time [36]. Annotating just twenty-six hours of videodata (VIRAT dataset [39]) with state-of-the-art protocols in place [39,50] coststens of thousands of dollars. We propose to learn our model in an unsupervisedmanner in the free-labels paradigm (Section 6.3.2 in [21]) in a two-step manner.We ﬁrst generate tracking labels given unlabeled videos and the correspondingset of detections. Then, we learn a ReID network to predict the generated labelgiven an input image. To the best of our knowledge, ours is the ﬁrst work topropose unsupervised ReID models for multi-object tracking and completely doaway with the tremendous annotation costs for tracking datasets. Throughoutthe paper, we consider supervision only in the context of sidestepping trajectory-level annotations. Using oﬀ-the-shelf detectors [41,40,7] trained on COCO is notviewed as supervision in our context. The proposed ReID network complementsthe unsupervised spatio-temporal association models [53,1] proposed in the priorart, leading to a more complete unsupervised tracking framework.We go one step further and aim to test the limits of our unsupervised trackingparadigm. We empirically test for two desiderata w.r.t IDF1 scores: (i) Ourunsupervised ReID should perform signiﬁcantly better than naive ReID methodswhen incorporated into any tracker; (ii) Our unsupervised ReID should achieveperformance equivalent to the original supervised counterpart. We demonstratethat we are able to achieve these desiderata consistently across datasets, detectors,and diverse trackers. The resultant unsupervised tracker, when combined withCenterTrack [69] trained on single images, achieves state-of-the-art performanceon the MOT16/17 test challenge server. We beat the latest supervised trackersby large margins, outperforming CenterTrack by 0.3 MOTA, and 4.8 IDF1 scores.We then demonstrate that there is limited scope for further improvement beyondour proposed unsupervised ReID by demonstrating that the Oracle counterpartof our ReID model makes only minor gains.We would also like to highlight that while our work is conceptually simple, thecontributions made are signiﬁcant. We expect our investigation to be of signiﬁcantinterest to the MOT community by demonstrating that simple unsupervised ReIDis suﬃcient even in crowded scenarios with occlusions and person interactions.Our investigation contrasts the current shift towards using more supervised,end-to-end trackers for MOT Challenge datasets. We hope our work spursresearch in the unsupervised MOT paradigm, exploring extensions to other imple Unsupervised Multi-Object Tracking 3 tracking scenarios (3D/vehicles/pose tracking) and do away with the labelingeﬀort wherever unnecessary.

Monocular 2D multi-object tracking on videos is an extensively studied problem.[13] oﬀers a comprehensive review of works on MOT Challenge datasets. Apopular paradigm is to model the detections as a graph. Various approacheshave been proposed here including using network ﬂows [62], graph cuts [49],MCMC [60] and minimum cliques [61] if the entire video is provided beforehand(batch processing). In scenarios where we get frame-by-frame input, Hungarianmatching [53,3], greedy matching [69] and Recurrent Neural Networks [15,43]are popularly used models for sequential prediction (online processing). Theassociation metrics/cost functions used by these consist of (i) Spatio-temporalrelations (ii) Re-identiﬁcation.

Spatiotemporal relations:

There has been much investigation into appearance-free methods for the spatio-temporal association. Basic methods proposed includeusing Intersection-Over-Union (IoU) between detections [4] or incorporating avelocity model using a Kalman ﬁlter [3]. The velocity model can also be learnedusing Recurrent Neural Networks [15,43]. The complexity of assigning pairwisecosts can be further increased by incorporating additional cues from head/jointdetectors [6,19], segmentation [38], activity recognition [10], or keypoint trajec-tories [9]. Recent approaches leverage appearance-reliant pre-trained boundingbox regressors from object detection [1] or single object tracking [56,11] pipelinesto regress the bounding box in the next frame. Since most of the above modelsare unsupervised (requiring no tracking annotations), they complement our workand can be incorporated with our proposed approach for creating eﬃcaciousunsupervised trackers.

ReID across multiple cameras:

Supervised training of CNNs [68] on largelabeled datasets [65,30] has given excellent results for ReID across multiplecameras. In addition to this, there have been approaches to exploit the poseinformation using oﬀ-the-shelf body pose detectors [47,48]. Attention mechanismshave also been explored to capture the important regions in the foreground [45,46].Generative models have been employed to augment the training data for improvedperformance [66,31]. We recommend this excellent survey [58] for a completereview. In contrast, we work on tracking with a single camera, with reasonableframe-rates (no drastic appearance variations). Additionally, the objective is todistinguish the target pedestrian among a small set of diﬀerent looking pedestriansin a given frame, with the aid of additional detection information. Hence, webelieve our simple, noisy unsupervised re-identiﬁcation model might suﬃce.Sophisticated unsupervised ReID networks [32,29] designed for multiple camerasReID may not be required for MOT.

ReID for monocular 2D tracking:

Re-identiﬁcation has been a majorchallenge in tracking, with matching using similarity between CNN featuresbeing the dominant approach [42]. Past works have proposed diﬀerent methods

S. Karthik et al. to train the CNN ranging from using siamese networks [26] with triplet loss,further augmented by hard negative mining [1] or other metric learning losseslike cosine loss [53]. Incorporating a combination of loss functions [34] or poseinformation [49] as well as ﬁne-tuning the ReID model on the test sequence [34].All the above ReID networks are supervised and fairly complex to train. Weare the ﬁrst work to demonstrate that simple unsupervised ReID networks aresuﬃcient for this context. It is important to note that in most MOT pipelines,this is the only component that uses tracking annotations.

Evaluation metric for MOT:

Multi-Object Tracking Accuracy (MOTA)is not a good metric to illustrate ReID performance because it focuses on objectcoverage and therefore is dominated by false negatives. An excellent detector canachieve high MOTA scores despite being a poor tracker with a large number ofID switches [69]. Identity-F1 (IDF1) has been shown to measure long consistenttracks without switches and widely shown [35,13] to be a better metric for trackingperformance. We accordingly focus and emphasize on IDF1 scores.

End-to-end supervised MOT:

Recent works circumvent the above paradigmeither partially or completely by learning the MOT solver using end-to-end su-pervision. Early works [51,44] performed end-to-end learning in the min-cost ﬂowdata association framework. Recently, approaches like [56] and [5] perform end-to-end optimization by introducing diﬀerentiable forms of Hungarian matching andclustering formulation, respectively. Parallel works [69,64,52] attempt to performsimultaneous object detection, data association, and sometimes re-identiﬁcationin a single network. Most notable among these, CenterTrack [69] is capable oftraining the detector using only augmentations of still images. These methodsinvolving joint detection and tracking deliver high performance at real-time infer-ence speeds but require high annotation costs. Our work diﬀers in principle byremoving and replacing supervised components yet outperforming these trackers,without incurring the associated labelling cost.

Our goal is to leverage the abundance of unlabeled videos to learn ReID models(without manual cost). Our unsupervised learning method can be categorizedas learning by generating labels (Ref. Section 6.3.2 of [21]). In a nutshell, givenunlabeled videos and corresponding bounding boxes, we ﬁrst generate trackinglabels. We then learn a ReID network by predicting the generated label given adetection.

Here, we describe the two parts of our proposed framework in detail: (i) Gen-erating the labels, and (ii) Learning the network.

Generating labels:

Givena set of videos, each video is passed independently through an object detector.An unsupervised spatiotemporal association model from the list given in Table1 (left) is then run through the detections to obtain short contiguous tracks or imple Unsupervised Multi-Object Tracking 5

Model Ref

Kalman ﬁlter+Hungarian matching [3]IoU based tracking [4]Network Flow [62]Linear Programming [28]Conditional Random Fields(CRFs) [38]Markov Decision Proceses(MDPs) [54]Recurrent Neural Networks(RNNs) [43]Bounding Box Regression [1]

Training Strategy Ref

Crossentropy [49]Triplet+hard negative mining [1]Contrastive [25]SymTriplet [63]Cosine Loss [53]Joint Detections [49]Veriﬁcation+Classiﬁcation Loss [34]

Table 1.

Approaches use for Spatiotemporal data association (Left). Loss functionsand methods used to train CNNs for Appearance modeling (Right). We choose thesimplest approach for both these components. tracklets (set of associated detections of the same person over time). Examples ofspatiotemporal models can range from tracking using a constant velocity assump-tion with Kalman ﬁltering [3] (bounding box information only) to incorporatingappearance features by using pre-trained bounding box regression from objectdetection pipelines to regress the bounding box [1] in the next frame. Now tocluster/associate detections, we can use online methods like greedy/Hungarianmatching or expensive oﬄine methods like graph-cuts. Ultimately, the outputof this step is a set of noisy track labels for each video, resulting in a pool oflabeled video tracklets.

Training ReID models:

Now, given noisy track labels per video, the taskis to learn a ReID model using any of the methods given in Table 1 (right). Inabsence of trajectory level supervision, the challenge here is to explore ways toharness the given regularities in data (in form of tracklets). There are two simpleassumptions which can help the cause: (i) The videos are independent of eachother (i.e., no common tracks between any two videos), and (ii) the trackletswithin a video are independent of each other (i.e., each tracklet belongs to adiﬀerent person). If both the assumptions are followed then each tracklet canbe considered as an independent class. The simplest option which follows is totrain at network to predict a label given an image, optimized with cross-entropyloss (with number of classes equalling to the number of tracklets). However,assumption (ii) may break in cases like missed detections and occlutions and mayresult into multiple tracklets for the same person in a video.An alternate option (by relaxing assumption (ii)) could be to form positivepairs from the same tracklet and negative pairs from across other videos orsimultaneous tracks from the same video. Such pairing can enable learningSiamese networks to compare two images and predict whether they are the sameperson or not. They can be trained with pairwise losses such as contrastive loss[25] or triplet loss with hard-negative mining [1], or more complex ones likesymtriplet [63] or the group loss [14], resulting in a trained ReID network.

S. Karthik et al.

SORT

Video(Input) Noisy Tracklets … . … . (ii) Training ReID models(i) GeneratingLabels Trained ResNet-50(Output) CrossEntropy

Fig. 1.

Overview of our approach: Given a video with detections, we use SORT [3] tosimulate noisy tracking labels. Then, we train the ReID network (ResNet50) to predictthe track label for each input image.

We use simple methods to both simulate labels and learn the ReID network, asillustrated in Figure 1. In step (i), we only utilize the bounding boxes and useKalman ﬁltering combined with Hungarian matching to simulate labels. Sincewe use no appearance information, our tracking labels are noisy. In step (ii), weproceed by making both the aforementioned assumptions that no two videos ortracklets share common labels. We assign a unique label to each tracklet andtrain a network with cross-entropy loss to predict this label given any imagefrom that tracklet. At inference time, we integrate our ReID model into existingframeworks by simply replacing their models with ours, with no other changes.In CenterTrack, we extract tracks using its unsupervised model and reﬁne itwith our ReID network using a DeepSORT framework. Although we are awarethat some enhancements can be performed to our proposed process (e.g., using asiamese framework), we show in subsequent sections that simpler choices aloneare suﬃcient to match the performance of supervised networks.

In a nutshell, in this section we incorporate our developed unsupervised ReIDmodel (SimpleReID) into various trackers and show compelling evidence for threeresults: (i) our unsupervised tracker obtains state-of-the-art tracking performanceon MOT16/17, outperforming recent works (ii) naive unsupervised trackers canreplace their supervised counterparts consistently (iii) there is limited scope forimprovement beyond our unsupervised ReID complemented with better detectorsin settings we tested. : We evaluate our performance on the standard multi-object trackingbenchmark– MOT Challenge – which consists of several challenging pedestrian imple Unsupervised Multi-Object Tracking 7 tracking sequences with frequent occlusions, crowded scenes with sequencesvarying in their angle of view, size of objects, camera motion, and frame rate. Itcontains two challenging tracking benchmarks, namely MOT16 and MOT17 [37].They both share the same training and testing sequences, but MOT16 providesonly DPM [16] detections, whereas MOT17 provides two additional sets of publicdetections (namely Faster R-CNN [41] and SDP [57]) and has more accurateground truth. The primary metrics used for measuring performance are MOTA [2]and IDF1, which are a combination of simpler metrics like False Positives, FalseNegatives, and ID Switches.

Implementation details : We obtain our SimpleReID model by training aResNet50 [17] backbone popularly used by trackers for a fair comparison. Wetrain the model with tracklets generated by SORT [3] on the PathTrack [36]dataset to test generalization to unseen MOT16/17 data. We perform analysisstudies on the entire training dataset and report results on MOT Challengehidden test set . Our model was implemented using PyTorch and Torchreid [67]and trained on a GTX1080Ti GPU. For any tracker used [53,1], we utilizethe implementations provided by the authors, leaving all the hyperparametersunchanged and simply replacing their supervised ReID model with SimpleReID.We use the CenterTrack model trained with single images w.r.t augmentationsand incorporate the SimpleReID model using the DeepSORT framework. Ourcode and pretrained models will be released upon acceptance of the paper. We submit our best performing unsupervised tracker to the MOT ChallengeBenchmark. The submitted tracker consists of our proposed SimpleReID modelincorporated with CenterTrack [69] for bounding box regression using publicdetections. We compare the performance on the MOT Challenge test set withstate-of-the-art supervised trackers and provide results in Table 2. Surprisingly,we observe that our developed unsupervised tracker outperforms all supervisedtrackers on MOT16/17 setting a new state-of-the-art in terms of MOTA andIDF1 scores among all trackers on public detections.We beat the previous best tracker (CenterTrack) by 0.2/0.3 MOTA and4.4/4.8 IDF1 scores on MOT16/MOT17, respectively. The signiﬁcant increase inIDF1 score can be entirely attributed to the eﬃcacy of our SimpleReID model,because while CenterTrack is a good detector, it cannot maintain long trackswhich is compensated by using our appearance features for Re-identiﬁcation. Wereduce ID switches made by CenterTrack by nearly 3x, achieving the lowest IDswitches compared to other online trackers.

Past literature [49,34] indicates that unsupervised ReID is unlikely to excel incrowded scenarios due to the complexities of tracking in such scenes. In this The MOT Challenge web page: https://motchallenge.net . S. Karthik et al.

Detector Method Published Unsup MOTA ↑ IDF1 ↑ IDSw ↓ FP ↓ FN ↓ MOT16Batch GCRA [33] ICME18 × × × ×

431 7086

Online AMIR [43] ICCV17 × × × × × × (cid:88) DMAN [70] ECCV18 × × × Ours - (cid:88)

588 5909 61981MOT17Batch MHT [23] CVPR15 × × × × × Online FAMNet [12] ICCV19 × × × (cid:88) Tracktor++v2 [1] ICCV19 × × × × Ours - (cid:88) Table 2.

Results on the MOT Challenge test set benchmark using public detections.Unsup indicates approach does not need supervision (no tracking labels required). *are recent parallel works. Up/down arrows indicate higher/lower is better. subsection, we provide two sets of evidence to demonstrate that SimpleReIDindeed performs well across diverse scenarios: (i) We show that the test perfor-mance of SimpleReID (on unseen videos) is equivalent to that of a supervisedReID model, on its training set itself (ii) We show that SimpleReID achieves theabove desiderata even with simple trackers which are highly reliant on the ReIDcomponent.

Limits of unsupervised ReID:

We test the limits of SimpleReID by com-paring the performance of our model with supervised models. We perform ex-periments across various weaker scenarios such as having no ReID, or usingpretrained-ImageNet as-is, and show that these perform signiﬁcantly worse thanSimpleReID - proving that SimpleReID is important to match supervised perfor-mance. We ﬁrst train another recent supervised tracker, Tracktor++v2[1], which imple Unsupervised Multi-Object Tracking 9

ReID MOTA ↑ IDF1 ↑ ReID MOTA ↑ IDF1 ↑ MOT16DPM POINone 57.6 62.0 None 68.3 67.6ImageNet 57.6 62.0 ImageNet 68.3 67.7Ours 57.6

Ours 68.5

Supervised 57.6

Supervised 68.5

MOT17FRCNN POINone 61.6 64.6 None 68.5 67.6ImageNet 61.6 64.7 ImageNet 68.5 67.6Ours 61.7

Ours 68.6

Supervised 61.7

Supervised 68.6

Table 3.

Ablation study comparing the performance of diﬀerent ReID models withinthe Tracktor [1] framework. We observe that our unsupervised SimpleReID achievesthe same performance (IDF1 scores) as supervised ReID. DPM, FRCNN and POIcorrespond to diﬀerent detectors. uses bounding box regression along with a supervised ReID model to predict theposition of an object in the next frame. We train the supervised ReID model onthe training data for MOT16/ MOT17 and then benchmark the performance onthe same training set. In contrast, this data is new to our SimpleReID models,i.e., have not seen these videos previously. Our experiment results are tabulatedin Table 3. We observe that using ImageNet-pretrained ReID somewhat improvesIDF1 scores compared to using no ReID network at all, but fails to achieve theupper bound by a considerable margin. Our SimpleReID approach successfullyrecovers the remaining performance gap. This is achieved consistently acrossdiﬀerent variations.

ReID-reliant unsupervised tracking:

Due to the low dependence of Track-tor on its ReID model, one may argue that it might not be the best frameworkfor evaluation of ReID models in tracking. Hence, we also perform the sameexperiments on a popular tracker DeepSORT [53] that is highly reliant on theReID model used, since the only visual features it receives is from the ReIDnetwork. We replace the supervised ReID model used in DeepSORT with diﬀerentReID methods and tabulate results in Table 4. First, we observe that replacingsupervised ReID with random features causes a severe drop in performance oversupervised counterpart, with MOTA score decreasing by 9.4% and IDF1 decreas-ing by 31.3%, demonstrating the degree of reliance on ReID in the DeepSORTframework. When substituted with features from an ImageNet-pretrained ResNet,we get a similar result: a signiﬁcant improvement over SORT, yet much lowerthan supervised ReID performance. We further benchmark with a supervisedReID model trained on Market1501 dataset [65] and observe lower performancecompared to the ImageNet-pretrained model, indicating that features learned for

ReID MOTA ↑ IDF1 ↑ ReID MOTA ↑ IDF1 ↑ MOT16-POI MOT17-POINo ReID 58.1 57.1 No ReID 57.9 56.9Random 51 34.6 Random 50.7 34.3ImageNet 60.3 62 ImageNet 59.9 61.6Market1501 60.3 61.5 Market1501 59.9 61.1Ours 60.5

Ours 60.1

Supervised 60.4

Supervised 60

Table 4.

Ablation study comparing the performance of diﬀerent ReID models withinthe DeepSORT [53] framework. We observe that our unsupervised SimpleReID achievesthe same performance (IDF1 scores) as supervised ReID.

Detector SimpleReID Oracle ReID+Kill+MM

MOTA ↑ IDF1 ↑ MOTA ↑ IDF1 ↑ IDF1 GainYOLOv3 [40] 56.5 62.5 61.5 66.1 3.6DPM [16] 58.5 62.9 62.4 66.3 3.4Faster-RCNN [41] 61.7 65.2 65.5 68.5 3.3HTC [7] 67.7 68.1 75.6 70.5 2.4SDP [57] 67.7 68.1 73.0 70.6 2.5POI [59] 68.6 69.4 73.5 71.4

Table 5.

Ablation study comparing the diﬀerence between performance of SimpleReIDacross detectors on MOT17. We observe that the diﬀerence decreases from 3.6 to 2.0with improved detectors. cross-camera person-ReID datasets without trajectory annotations do not transferto multi-object tracking. Lastly, we observe that our unsupervised SimpleReIDcovers the remaining performance gap, as seen above.

Scope for improvement in ReID:

We further explore the best performanceachievable by a ReID network using the Tracktor framework and explore thescope for further improvement of our SimpleReID. To obtain the possible bestperformance, we test Tracktor with an Oracle ReID [1] and observe that there isa 3.3 IDF1 score gap between SimpleReID and the Oracle. We repeat the sameexperiment with the latest oﬀ-the-shelf detectors and tabulate the results in Table5. We observe that with modern detectors, the gap between SimpleReID and thecorresponding oracles is small enough to limit the scope for further improvement.Overall, we conclude that unsupervised SimpleReID counterintuitively matchesthe limiting performance of supervised counterparts in diﬃcult MOT scenarios,by leveraging only unlabeled videos. Since our model works in extreme cases suchas DeepSORT, where tracking is entirely reliant on the ReID model for encodingappearance information, we expect that the eﬃcacy of SimpleReID will generalizeto other trackers as well. We demonstrated the potential of unsupervised trackersby outperforming all supervised MOT16/17 trackers, setting a new state-of-the- imple Unsupervised Multi-Object Tracking 11 art in MOTA and IDF1 scores and performing close to the optimal ReID. If it isindeed generalizable, we believe that this work has signiﬁcant implications forresearch in supervised ReID for tracking.

We propose the ﬁrst step in the direction of developing unsupervised re-identiﬁcationfor MOT and demonstrate that our simple approach performs at par with super-vised counterparts across diverse setups. When combined with recent unsupervisedassociation models [56,1], we obtain accurate unsupervised trackers. The trackerwe submit ranks ﬁrst in the MOT Challenge, beating all the latest supervisedapproaches. Our investigation suggests reconsideration on whether the shift to-wards more complex, supervised, end-to-end MOT models is necessary. We hopeour work is useful to sidestep high annotation costs otherwise thought to be arequirement necessary to feed the data-hungry supervised trackers.

References

1. Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles.In: ICCV. pp. 941–951 (2019)2. Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance:the clear mot metrics. EURASIP Journal on Image and Video Processing ,1–10 (2008)3. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtimetracking. In: ICIP. pp. 3464–3468 (2016)4. Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection withoutusing image information. In: AVSS. pp. 1–6 (2017)5. Bras´o, G., Leal-Taix´e, L.: Learning a neural solver for multiple object tracking. In:CVPR (2020)6. Chari, V., Lacoste-Julien, S., Laptev, I., Sivic, J.: On pairwise costs for networkﬂow multi-object tracking. In: CVPR. pp. 5537–5545 (2015)7. Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J.,Ouyang, W., et al.: Hybrid task cascade for instance segmentation. In: CVPR. pp.4974–4983 (2019)8. Chen, L., Ai, H., Zhuang, Z., Shang, C.: Real-time multiple people tracking withdeeply learned candidate selection and person re-identiﬁcation. In: ICME. pp. 1–6(2018)9. Choi, W.: Near-online multi-target tracking with aggregated local ﬂow descriptor.In: ICCV. pp. 3029–3037 (2015)10. Choi, W., Savarese, S.: A uniﬁed framework for multi-target tracking and collectiveactivity recognition. In: ECCV. pp. 215–230 (2012)11. Chu, P., Fan, H., Tan, C.C., Ling, H.: Online multi-object tracking with instance-aware tracker and dynamic model refreshment. In: WACV. pp. 161–170 (2019)12. Chu, P., Ling, H.: Famnet: Joint learning of feature, aﬃnity and multi-dimensionalassignment for online multiple object tracking. In: ICCV. pp. 6172–6181 (2019)13. Ciaparrone, G., S´anchez, F.L., Tabik, S., Troiano, L., Tagliaferri, R., Herrera, F.:Deep learning in video multi-object tracking: A survey. Neurocomputing (2020)2 S. Karthik et al.14. Elezi, I., Vascon, S., Torcinovich, A., Pelillo, M., Leal-Taixe, L.: The group loss fordeep metric learning. arXiv preprint arXiv:1912.00385 (2019)15. Fang, K., Xiang, Y., Li, X., Savarese, S.: Recurrent autoregressive networks foronline multi-object tracking. In: WACV. pp. 466–475 (2018)16. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multi-scale, deformable part model. In: CVPR. pp. 1–8 (2008)17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR. pp. 770–778 (2016)18. Henschel, R., Leal-Taix´e, L., Cremers, D., Rosenhahn, B.: Improvements tofrank-wolfe optimization for multi-detector multi-object tracking. arXiv preprintarXiv:1705.08314 (2017)19. Henschel, R., Leal-Taix´e, L., Cremers, D., Rosenhahn, B.: Fusion of head andfull-body detectors for multi-object tracking. In: CVPR-W. pp. 1428–1437 (2018)20. Huang, P., Han, S., Zhao, J., Liu, D., Wang, H., Yu, E., Kot, A.C.: Reﬁne-ments in motion and appearance for online multi-object tracking. arXiv preprintarXiv:2003.07177 (2020)21. Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks:A survey. arXiv preprint arXiv:1902.06162 (2019)22. Keuper, M., Tang, S., Andres, B., Brox, T., Schiele, B.: Motion segmentation &multiple object tracking by correlation co-clustering. TPAMI (1), 140–153 (2018)23. Kim, C., Li, F., Ciptadi, A., Rehg, J.M.: Multiple hypothesis tracking revisited. In:ICCV. pp. 4696–4704 (2015)24. Kim, C., Li, F., Rehg, J.M.: Multi-object tracking with neural gating using bilinearlstm. In: ECCV. pp. 200–215 (2018)25. Kim, M., Alletto, S., Rigazio, L.: Similarity mapping with enhanced siamese networkfor multi-object tracking. arXiv preprint arXiv:1609.09156 (2016)26. Leal-Taix´e, L., Canton-Ferrer, C., Schindler, K.: Learning by tracking: Siamese cnnfor robust target association. In: CVPR-W. pp. 33–40 (2016)27. Leal-Taix´e, L., Milan, A., Reid, I., Roth, S., Schindler, K.: Motchallenge 2015:Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942(2015)28. Leal-Taix´e, L., Pons-Moll, G., Rosenhahn, B.: Everybody needs somebody: Modelingsocial and grouping behavior on a linear programming multiple people tracker. In:ICCV-W. pp. 120–127 (2011)29. Li, M., Zhu, X., Gong, S.: Unsupervised person re-identiﬁcation by deep learningtracklet association. In: ECCV. pp. 737–753 (2018)30. Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: Deep ﬁlter pairing neural networkfor person re-identiﬁcation. In: CVPR (2014)31. Li, X., Wu, A., Zheng, W.S.: Adversarial open-world person re-identiﬁcation. In:ECCV. pp. 280–296 (2018)32. Lin, Y., Dong, X., Zheng, L., Yan, Y., Yang, Y.: A bottom-up clustering approachto unsupervised person re-identiﬁcation. In: AAAI. pp. 8738–8745 (2019)33. Ma, C., Yang, C., Yang, F., Zhuang, Y., Zhang, Z., Jia, H., Xie, X.: Trajectoryfactory: Tracklet cleaving and re-connection by deep siamese bi-gru for multipleobject tracking. In: ICME. pp. 1–6 (2018)34. Ma, L., Tang, S., Black, M.J., Van Gool, L.: Customized multi-person tracker. In:ACCV. pp. 612–628 (2018)35. Maksai, A., Fua, P.: Eliminating exposure bias and metric mismatch in multipleobject tracking. In: CVPR. pp. 4639–4648 (2019)36. Manen, S., Gygli, M., Dai, D., Van Gool, L.: Pathtrack: Fast trajectory annotationwith path supervision. In: ICCV. pp. 290–299 (2017)imple Unsupervised Multi-Object Tracking 1337. Milan, A., Leal-Taix´e, L., Reid, I., Roth, S., Schindler, K.: Mot16: A benchmarkfor multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)38. Milan, A., Leal-Taix´e, L., Schindler, K., Reid, I.: Joint tracking and segmentationof multiple targets. In: CVPR. pp. 5397–5406 (2015)39. Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.C., Lee, J.T., Mukherjee, S.,Aggarwal, J., Lee, H., Davis, L., et al.: A large-scale benchmark dataset for eventrecognition in surveillance video. In: CVPR. pp. 3153–3160 (2011)40. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Uniﬁed,real-time object detection. In: CVPR. pp. 779–788 (2016)41. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time objectdetection with region proposal networks. In: NeurIPS. pp. 91–99 (2015)42. Ristani, E., Tomasi, C.: Features for multi-target multi-camera tracking and re-identiﬁcation. In: CVPR. pp. 6036–6046 (2018)43. Sadeghian, A., Alahi, A., Savarese, S.: Tracking the untrackable: Learning to trackmultiple cues with long-term dependencies. In: ICCV. pp. 300–311 (2017)44. Schulter, S., Vernaza, P., Choi, W., Chandraker, M.: Deep network ﬂow for multi-object tracking. In: CVPR. pp. 6951–6960 (2017)45. Si, J., Zhang, H., Li, C.G., Kuen, J., Kong, X., Kot, A.C., Wang, G.: Dual attentionmatching network for context-aware feature sequence based person re-identiﬁcation.In: CVPR. pp. 5363–5372 (2018)46. Song, C., Huang, Y., Ouyang, W., Wang, L.: Mask-guided contrastive attentionmodel for person re-identiﬁcation. In: CVPR. pp. 1179–1188 (2018)47. Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Pose-driven deep convolutionalmodel for person re-identiﬁcation. In: ICCV. pp. 3960–3969 (2017)48. Suh, Y., Wang, J., Tang, S., Mei, T., Mu Lee, K.: Part-aligned bilinear representa-tions for person re-identiﬁcation. In: ECCV. pp. 402–419 (2018)49. Tang, S., Andriluka, M., Andres, B., Schiele, B.: Multiple people tracking by liftedmulticut and person re-identiﬁcation. In: CVPR. pp. 3539–3548 (2017)50. Vondrick, C., Ramanan, D.: Video annotation and tracking with active learning.In: NeurIPS. pp. 28–36 (2011)51. Wang, S., Fowlkes, C.C.: Learning optimal parameters for multi-target trackingwith contextual interactions. IJCV122