IA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency
Jiarui Cai, Yizhou Wang, Haotian Zhang, Hung-Min Hsu, Chengqian Ma, Jenq-Neng Hwang
IIA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency
Jiarui Cai, Yizhou Wang, Haotian Zhang, Hung-Min Hsu, Chengqian Ma, Jenq-Neng HwangDepartment of Electrical and Computer EngineeringUniversity of Washington, Seattle, WA, USA { jrcai, ywang26, haotiz, hmhsu, cm74, hwang } @uw.edu Abstract
Multiple object tracking (MOT) is a crucial task in com-puter vision society. However, most tracking-by-detectionMOT methods, with available detected bounding boxes,cannot effectively handle static, slow-moving and fast-moving camera scenarios simultaneously due to ego-motionand frequent occlusion. In this work, we propose anovel tracking framework, called instance-aware MOT (IA-MOT), that can track multiple objects in either static ormoving cameras by jointly considering the instance-levelfeatures and object motions. First, robust appearance fea-tures are extracted from a variant of Mask R-CNN detectorwith an additional embedding head, by sending the givendetections as the region proposals. Meanwhile, the spa-tial attention, which focuses on the foreground within thebounding boxes, is generated from the given instance masksand applied to the extracted embedding features. In thetracking stage, object instance masks are aligned by fea-ture similarity and motion consistency using the Hungarianassociation algorithm. Moreover, object re-identification(ReID) is incorporated to recover ID switches caused bylong-term occlusion or missing detection. Overall, whenevaluated on the MOTS20 and KITTI-MOTS dataset, ourproposed method won the first place in Track 3 of the BMTTChallenge in CVPR2020 workshops.
1. Introduction
Multi-Object Tracking (MOT) leveraging detectedbounding box locations has been the mainstream and well-studied, a.k.a., tracking-by-detection. Most tracking-by-detection methods firstly extract the objects’ deep featuresfrom the detected bounding boxes. With these extracted fea-tures, a tracking algorithm is followed to associate the de-tections along consecutive frames. However, the bounding-box-based features contain significant background noises,which are usually redundant and would downgrade thetracking performance. Recently, Multi-Object Tracking andSegmentation (MOTS) [10] is posed as a new task to track
RPNHungarian Algorithm Track Buffer
Update
Short-term Retrieval (STR)Long-term ReID
UpdateUpdate
Image Frame Backbone
Bbox HeadMask HeadEmb HeadPublic Bboxes
SA Map Generation
Public Masks
Figure 1. The framework of the proposed IA-MOT. First, embed-ding features are extracted using a variant of Mask R-CNN andspatial attention generation. Then, isolated detections are alignedby Hungarian algorithm based on the mask IOU and cosine featuresimilarity. Finally, missing detection and long-term occlusion arehandled by the short-term retrieval and ReID modules. multiple objects with the corresponding instance segmenta-tion. MOTS is more informative and spatially precise, andthe masks also introduce potential promotions to trackingaccuracy.Typically, there are three key components in MOTS:object detection and segmentation, feature extraction, andmulti-object tracking. Some related works are briefly re-viewed as follows. Recent studies on object detection andsegmentation achieve impressive performance with variousadvanced structures [8, 1]. Especially, for mask generationin MOTS, Luiten et al. [5] present the Proposal-generation,Refinement and Merging for Video Object Segmentation al-gorithm (PReMVOS). First, a set of object segmentationmask proposals are generated from Mask R-CNN and re-fined by a refinement network frame-by-frame. Then, withthe assistance of optical-flow and ReID information, these1 a r X i v : . [ c s . C V ] J un elected proposals are merged into pixel-wise object tracksover a video sequence. Feature extraction with deep neuralnetworks is widely used in the appearance similarity com-parison in tracking. FaceNet [9] proposes a metric learningstrategy. Moreover, a multi-scaled pedestrian detector [12]is proposed to allow target detection and appearance fea-tures to be learned in a shared model. On top of that, sev-eral MOT methods are studied to obtain the object trajec-tories. Wang et al. [11] propose the TrackletNet Tracking(TNT) framework that jointly considers appearance and ge-ometric features to form a tracklet-based graph model. Toincorporate segmentation into MOT, Voigtlaende et al. [10]propose a new baseline approach, Track R-CNN, which ad-dresses detection, tracking, and segmentation via a unifiedconvolutional neural network. Milan et al. [6] use super-pixel information and consider detection and segmentationjointly with a conditional random field model. CAMOT [7]exploits stereo information to perform mask-based trackingof generic objects on the KITTI dataset.Instance segmentation is trained and inferred in paral-lel with the bounding box detection in the previous stud-ies. In this paper, to better utilize the instance masks forembedding feature extraction, we propose a novel MOTSframework, called instance-aware multi-object tracking(IA-MOT) with motion consistency (IA-MOT) to integratethe segmentation information with instance features. Withthe given detection results, the sequences are sent into avariant of Mask R-CNN [1] to obtain the embedding fea-tures as region proposals of RPN. Here, to handle instance-aware features, we apply the spatial attention generatedfrom the corresponding instance masks to weight more onthe foreground of each bounding box. Second, the extractedinstance-aware features are utilized in the following track-ing module based on the Hungarian assignment algorithm.In this tracking module, object motion consistency, i.e., thesimilarity of object motion, including object sizes, movingdirection, and moving speed, are jointly considered. More-over, discontinuity caused by missing detection is recov-ered by the short-term retrieval (STR) module and objectre-identification (ReID) is further incorporated to reduce IDswitches caused by long-term occlusion.
2. The Proposed Method
The proposed IA-MOT includes three steps: embeddingfeature extraction, online object tracking with STR, and ob-ject re-identification with motion consistency for final re-finement. The overall framework is shown in Figure 1.
First of all, embedding feature for each given boundingbox is extracted from a variant of Mask R-CNN [2] with anextra embedding head. This network is trained by the jointoptimization of bounding box classification and regression, object mask prediction, and object feature extraction as amulti-task problem. The loss function is L total = α L box + α L cls + α L mask + α L emb , (1)where L box , L cls and L mask are the original Mask R-CNNloss, L emb is the cross-entropy loss for object identity clas-sification. Here, each public detection D n ∈ D is treatedas a region proposal and fed into the network to acquire itsbounding box feature emb boxn .Then, the extracted embedding features emb boxn are fil-tered by the spatial attention (SA) map w n to form ourinstance-aware (IA) embedding features emb IAn , where emb
IAn = w n · emb boxn . (2)Here, the SA map w n is generated from the provided in-stance segmentation mask n , where w n ( i, j ) = (cid:40) , if mask n ( i, j ) is foreground, . , if mask n ( i, j ) is background. (3) After the IA embedding features are extracted, they aretransmitted to the following tracking algorithm to associatethe same identities among consecutive frames. With detec-tions D ( t ) and the corresponding feature embeddings E ( t ) inframe t , the hypotheses D ( t ) i ∈ D ( t ) are matched with livetracks D ( t − j ∈ D ( t − in frame t − using the Hungarianassignment algorithm [3]. The assignment cost between the j -th track and the i -th detection is computed by C = 2 − maskIOU ( D ( t − j , D ( t ) i ) − simi ( emb j , emb ( t ) i ) , (4)where the j -th track’s feature emb j is the stack of its first frames and the most recent frames, simi ( · ) representsthe cosine similarity. Considering the visibility of an objectmay vary due to camera movement, the features are not ag-gregated into one, but kept separated. Feature similarity iscompared pair-wisely, and the maximum is taken as the re-sult. Missing detections are compensated by the short-termretrieval (STR) module for lost objects. STR tries to matchunassigned detections in frame t with the live tracks thatwithout a detection in frame t − .In addition to the feature similarity, the bounding boxesin current frame of the lost track are extrapolated by Huberregression. The distance between the regressed location andthe track’s last location is confined to within twice the objectwidth. Tracklets will be marked as terminated if there areno alignments for the most recent N frames and not beincluded in the Hungarian assignment or STR. igure 2. Qualitative results of the proposed IA-MOT. 1 st row: MOTS20-01 (static camera); 2 nd row: MOTS20-06 (stroller-mountedcamera); 3 rd row: KITTI-MOTS-0011 (car-mounted camera, up-slope view); 4 th row: KITTI-MOTS-0012 (car-mounted camera, turningview). Red arrows indicate that the targets are tracked robustly even with frequent occlusion or turning. Then, long-term occlusions are recovered by feature-based re-identification (ReID). In this stage, two tracklets ξ u and ξ v without overlapped frames in time (assuming ξ u is earlier than ξ v ), within N frames apart, and with fea-ture similarity higher than β , are considered as possiblematched pairs. For static cameras, the interval in betweenis extrapolated from ξ u and ξ v , respectively. ξ u and ξ v arereconnected if the average bounding box IOU between twoextrapolation are above β . For moving cameras, trackletmotion vectors are estimated from the ξ u ’s last or ξ v ’s first N frames. Thus, the motion of ξ u can be defined as M u = [ M ux , M uy ]= 1 N − N − (cid:88) j =1 ( x ( j +1) u − x ( j ) u ) , N − (cid:88) j =1 ( y ( j +1) u − y ( j ) u ) , (5)where [ x ( j ) u , y ( j ) u ] is the top-left point of the j -th detectionin ξ u . The motion M v can be defined in the same manner.Then, ξ u and ξ v are recognized as the same object if the co-sine similarity between M u and M v are positive and abovea threshold β .
3. Experiments
The data of the BMTT Challenge consists of MOTS20and KITTI-MOTS dataset [10]. There are sequences inMOTS20 dataset for pedestrian tracking, and evenly splitfor training and testing. In testing set, the resolution variesfrom × to × with an average density of . targets per frame. KITTI-MOTS is a driving scenariodataset for both pedestrian and car tracking tasks, consist-ing of training sequences and testing sequences. Thepre-computed detections are generated from Mask R-CNNX152 [1] and refined by the refinement net [5]. The proposed modified Mask R-CNN uses ResNet50 [2]as the backbone, which is pretrained on COCO dataset [4]and fine-tuned on the MOTS20 dataset and KITTI-MOTSdataset. Based on the convergence speed and scaling of eachloss component, the detection loss weight α = α = 1 ,mask weight α = 0 . and embedding weight α = 0 . .The output feature is -dim. Short-term memory inter-val to determine the state of a track is N = 0 . second,while the long-term interval for ReID is N = 1 secondfor pedestrian, and N = 0 . second, N = 0 . second forcars. N = 5 frames for both categories.Moreover, due to the large number of false positivesethod KITTI-MOTS MOTS20 TotalCar PedMCFPA 77.0 67.2 66.1 69.1TPM-MOTS 75.8 IA-MOT (Ours) 76.4 64.0
Table 1. Track 3 evaluation results of the BMTT Challenge, eval-uated by sMOTSA. Best results are marked in bold . in the provided detections, we create three different fil-ters, including detection confidence, bounding box size,and bounding box aspect ratio, to select valid candidates.After tracking with the filtered detections, short tracks ortracks with low average confidence are discarded. In ad-dition, the trajectory IOU is defined as the average maskIOU of two tracks over their co-exist frames. If the tra-jectory IOU is larger than . , the shorter track will bediscarded. These processing steps efficiently remove dupli-cated and non-target detections. We evaluate our IA-MOT on Track 3 dataset of theBMTT Challenge (the combination of MOTS20 and KITTI-MOTS) and get the first place, with . sMOTSA, out of methods. The quantitative results are shown in Table 1and some qualitative examples are shown in Figure 2.Specifically, the proposed framework achieves . inMOTS20 dataset with the public detections, which is nearlyeven with the leading method in the MOTS20 leaderboardwith private detectors. Although IA-MOT does not get thebest performance on KITTI-MOTS, it still ranks the 3 rd forKITTI-MOTS car and 5 th for pedestrian, indicating its abil-ity to generalize to different object categories and the po-tential to handle different complex MOTS scenarios.
4. Conclusion
In this paper, a novel MOTS framework – IA-MOT isproposed. A Mask R-CNN with an additional embeddinghead and spatial attention first generate discriminating fea-tures. The following MOT stage consists of online Hun-garian assignment, short-term retrieve module and ReID.In addition, several implementation details are presentedfor the MOTS20 and KITTI-MOTS dataset. The proposed framework could effectively track both pedestrian and car,with static or moving cameras, and is flexible for differ-ent video resolution and scenarios. Our proposed IA-MOTachieved the winner in Track 3 of the BMTT Challenge inCVPR2020 workshops.
References [1] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In
Proceedings of the IEEE internationalconference on computer vision , pages 2961–2969, 2017. 1,2, 3[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 2, 3[3] Harold W Kuhn. The hungarian method for the assignmentproblem.
Naval research logistics quarterly , 2(1-2):83–97,1955. 2[4] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. In
European conference on computer vision , pages 740–755.Springer, 2014. 3[5] B Leibe J Luiten, P Voigtlaender, and B Leibe. Premvos:Proposal-generation.
Refinement and Merging for the DAVISChallenge on Video Object Segmentation , 2018. 1, 3[6] Anton Milan, Laura Leal-Taix´e, Konrad Schindler, and IanReid. Joint tracking and segmentation of multiple targets.In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5397–5406, 2015. 2[7] Aljoˇsa Oˇsep, Wolfgang Mehner, Paul Voigtlaender, and Bas-tian Leibe. Track, then decide: Category-agnostic vision-based multi-object tracking. In , pages 1–8. IEEE, 2018. 2[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In
Advances in neural information pro-cessing systems , pages 91–99, 2015. 1[9] Florian Schroff, Dmitry Kalenichenko, and James Philbin.Facenet: A unified embedding for face recognition and clus-tering. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 815–823, 2015. 2[10] Paul Voigtlaender, Michael Krause, Aljosa Osep, JonathonLuiten, Berin Balachandar Gnana Sekar, Andreas Geiger,and Bastian Leibe. Mots: Multi-object tracking and segmen-tation. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 7942–7951, 2019. 1,2, 3[11] Gaoang Wang, Yizhou Wang, Haotian Zhang, Renshu Gu,and Jenq-Neng Hwang. Exploit the connectivity: Multi-object tracking with trackletnet. In
Proceedings of the 27thACM International Conference on Multimedia , pages 482–490, 2019. 2[12] Zhongdao Wang, Liang Zheng, Yixuan Liu, and ShengjinWang. Towards real-time multi-object tracking. arXivpreprint arXiv:1909.12605arXivpreprint arXiv:1909.12605