[PDF] IA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency

Abstract

Multiple object tracking (MOT) is a crucial task in computer vision society. However, most tracking-by-detection MOT methods, with available detected bounding boxes, cannot effectively handle static, slow-moving and fast-moving camera scenarios simultaneously due to ego-motion and frequent occlusion. In this work, we propose a novel tracking framework, called "instance-aware MOT" (IA-MOT), that can track multiple objects in either static or moving cameras by jointly considering the instance-level features and object motions. First, robust appearance features are extracted from a variant of Mask R-CNN detector with an additional embedding head, by sending the given detections as the region proposals. Meanwhile, the spatial attention, which focuses on the foreground within the bounding boxes, is generated from the given instance masks and applied to the extracted embedding features. In the tracking stage, object instance masks are aligned by feature similarity and motion consistency using the Hungarian association algorithm. Moreover, object re-identification (ReID) is incorporated to recover ID switches caused by long-term occlusion or missing detection. Overall, when evaluated on the MOTS20 and KITTI-MOTS dataset, our proposed method won the first place in Track 3 of the BMTT Challenge in CVPR2020 workshops.

Full PDF

IIA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency

Jiarui Cai, Yizhou Wang, Haotian Zhang, Hung-Min Hsu, Chengqian Ma, Jenq-Neng HwangDepartment of Electrical and Computer EngineeringUniversity of Washington, Seattle, WA, USA { jrcai, ywang26, haotiz, hmhsu, cm74, hwang } @uw.edu Abstract

Multiple object tracking (MOT) is a crucial task in com-puter vision society. However, most tracking-by-detectionMOT methods, with available detected bounding boxes,cannot effectively handle static, slow-moving and fast-moving camera scenarios simultaneously due to ego-motionand frequent occlusion. In this work, we propose anovel tracking framework, called instance-aware MOT (IA-MOT), that can track multiple objects in either static ormoving cameras by jointly considering the instance-levelfeatures and object motions. First, robust appearance fea-tures are extracted from a variant of Mask R-CNN detectorwith an additional embedding head, by sending the givendetections as the region proposals. Meanwhile, the spa-tial attention, which focuses on the foreground within thebounding boxes, is generated from the given instance masksand applied to the extracted embedding features. In thetracking stage, object instance masks are aligned by fea-ture similarity and motion consistency using the Hungarianassociation algorithm. Moreover, object re-identiﬁcation(ReID) is incorporated to recover ID switches caused bylong-term occlusion or missing detection. Overall, whenevaluated on the MOTS20 and KITTI-MOTS dataset, ourproposed method won the ﬁrst place in Track 3 of the BMTTChallenge in CVPR2020 workshops.

1. Introduction

Multi-Object Tracking (MOT) leveraging detectedbounding box locations has been the mainstream and well-studied, a.k.a., tracking-by-detection. Most tracking-by-detection methods ﬁrstly extract the objects’ deep featuresfrom the detected bounding boxes. With these extracted fea-tures, a tracking algorithm is followed to associate the de-tections along consecutive frames. However, the bounding-box-based features contain signiﬁcant background noises,which are usually redundant and would downgrade thetracking performance. Recently, Multi-Object Tracking andSegmentation (MOTS) [10] is posed as a new task to track

RPNHungarian Algorithm Track Buffer

Update

Short-term Retrieval (STR)Long-term ReID

UpdateUpdate

Image Frame Backbone

Bbox HeadMask HeadEmb HeadPublic Bboxes

SA Map Generation

Public Masks

Figure 1. The framework of the proposed IA-MOT. First, embed-ding features are extracted using a variant of Mask R-CNN andspatial attention generation. Then, isolated detections are alignedby Hungarian algorithm based on the mask IOU and cosine featuresimilarity. Finally, missing detection and long-term occlusion arehandled by the short-term retrieval and ReID modules. multiple objects with the corresponding instance segmenta-tion. MOTS is more informative and spatially precise, andthe masks also introduce potential promotions to trackingaccuracy.Typically, there are three key components in MOTS:object detection and segmentation, feature extraction, andmulti-object tracking. Some related works are brieﬂy re-viewed as follows. Recent studies on object detection andsegmentation achieve impressive performance with variousadvanced structures [8, 1]. Especially, for mask generationin MOTS, Luiten et al. [5] present the Proposal-generation,Reﬁnement and Merging for Video Object Segmentation al-gorithm (PReMVOS). First, a set of object segmentationmask proposals are generated from Mask R-CNN and re-ﬁned by a reﬁnement network frame-by-frame. Then, withthe assistance of optical-ﬂow and ReID information, these1 a r X i v : . [ c s . C V ] J un elected proposals are merged into pixel-wise object tracksover a video sequence. Feature extraction with deep neuralnetworks is widely used in the appearance similarity com-parison in tracking. FaceNet [9] proposes a metric learningstrategy. Moreover, a multi-scaled pedestrian detector [12]is proposed to allow target detection and appearance fea-tures to be learned in a shared model. On top of that, sev-eral MOT methods are studied to obtain the object trajec-tories. Wang et al. [11] propose the TrackletNet Tracking(TNT) framework that jointly considers appearance and ge-ometric features to form a tracklet-based graph model. Toincorporate segmentation into MOT, Voigtlaende et al. [10]propose a new baseline approach, Track R-CNN, which ad-dresses detection, tracking, and segmentation via a uniﬁedconvolutional neural network. Milan et al. [6] use super-pixel information and consider detection and segmentationjointly with a conditional random ﬁeld model. CAMOT [7]exploits stereo information to perform mask-based trackingof generic objects on the KITTI dataset.Instance segmentation is trained and inferred in paral-lel with the bounding box detection in the previous stud-ies. In this paper, to better utilize the instance masks forembedding feature extraction, we propose a novel MOTSframework, called instance-aware multi-object tracking(IA-MOT) with motion consistency (IA-MOT) to integratethe segmentation information with instance features. Withthe given detection results, the sequences are sent into avariant of Mask R-CNN [1] to obtain the embedding fea-tures as region proposals of RPN. Here, to handle instance-aware features, we apply the spatial attention generatedfrom the corresponding instance masks to weight more onthe foreground of each bounding box. Second, the extractedinstance-aware features are utilized in the following track-ing module based on the Hungarian assignment algorithm.In this tracking module, object motion consistency, i.e., thesimilarity of object motion, including object sizes, movingdirection, and moving speed, are jointly considered. More-over, discontinuity caused by missing detection is recov-ered by the short-term retrieval (STR) module and objectre-identiﬁcation (ReID) is further incorporated to reduce IDswitches caused by long-term occlusion.

2. The Proposed Method

The proposed IA-MOT includes three steps: embeddingfeature extraction, online object tracking with STR, and ob-ject re-identiﬁcation with motion consistency for ﬁnal re-ﬁnement. The overall framework is shown in Figure 1.

First of all, embedding feature for each given boundingbox is extracted from a variant of Mask R-CNN [2] with anextra embedding head. This network is trained by the jointoptimization of bounding box classiﬁcation and regression, object mask prediction, and object feature extraction as amulti-task problem. The loss function is L total = α L box + α L cls + α L mask + α L emb , (1)where L box , L cls and L mask are the original Mask R-CNNloss, L emb is the cross-entropy loss for object identity clas-siﬁcation. Here, each public detection D n ∈ D is treatedas a region proposal and fed into the network to acquire itsbounding box feature emb boxn .Then, the extracted embedding features emb boxn are ﬁl-tered by the spatial attention (SA) map w n to form ourinstance-aware (IA) embedding features emb IAn , where emb

IAn = w n · emb boxn . (2)Here, the SA map w n is generated from the provided in-stance segmentation mask n , where w n ( i, j ) = (cid:40) , if mask n ( i, j ) is foreground, . , if mask n ( i, j ) is background. (3) After the IA embedding features are extracted, they aretransmitted to the following tracking algorithm to associatethe same identities among consecutive frames. With detec-tions D ( t ) and the corresponding feature embeddings E ( t ) inframe t , the hypotheses D ( t ) i ∈ D ( t ) are matched with livetracks D ( t − j ∈ D ( t − in frame t − using the Hungarianassignment algorithm [3]. The assignment cost between the j -th track and the i -th detection is computed by C = 2 − maskIOU ( D ( t − j , D ( t ) i ) − simi ( emb j , emb ( t ) i ) , (4)where the j -th track’s feature emb j is the stack of its ﬁrst frames and the most recent frames, simi ( · ) representsthe cosine similarity. Considering the visibility of an objectmay vary due to camera movement, the features are not ag-gregated into one, but kept separated. Feature similarity iscompared pair-wisely, and the maximum is taken as the re-sult. Missing detections are compensated by the short-termretrieval (STR) module for lost objects. STR tries to matchunassigned detections in frame t with the live tracks thatwithout a detection in frame t − .In addition to the feature similarity, the bounding boxesin current frame of the lost track are extrapolated by Huberregression. The distance between the regressed location andthe track’s last location is conﬁned to within twice the objectwidth. Tracklets will be marked as terminated if there areno alignments for the most recent N frames and not beincluded in the Hungarian assignment or STR. igure 2. Qualitative results of the proposed IA-MOT. 1 st row: MOTS20-01 (static camera); 2 nd row: MOTS20-06 (stroller-mountedcamera); 3 rd row: KITTI-MOTS-0011 (car-mounted camera, up-slope view); 4 th row: KITTI-MOTS-0012 (car-mounted camera, turningview). Red arrows indicate that the targets are tracked robustly even with frequent occlusion or turning. Then, long-term occlusions are recovered by feature-based re-identiﬁcation (ReID). In this stage, two tracklets ξ u and ξ v without overlapped frames in time (assuming ξ u is earlier than ξ v ), within N frames apart, and with fea-ture similarity higher than β , are considered as possiblematched pairs. For static cameras, the interval in betweenis extrapolated from ξ u and ξ v , respectively. ξ u and ξ v arereconnected if the average bounding box IOU between twoextrapolation are above β . For moving cameras, trackletmotion vectors are estimated from the ξ u ’s last or ξ v ’s ﬁrst N frames. Thus, the motion of ξ u can be deﬁned as M u = [ M ux , M uy ]= 1 N −  N − (cid:88) j =1 ( x ( j +1) u − x ( j ) u ) , N − (cid:88) j =1 ( y ( j +1) u − y ( j ) u )  , (5)where [ x ( j ) u , y ( j ) u ] is the top-left point of the j -th detectionin ξ u . The motion M v can be deﬁned in the same manner.Then, ξ u and ξ v are recognized as the same object if the co-sine similarity between M u and M v are positive and abovea threshold β .

3. Experiments

The data of the BMTT Challenge consists of MOTS20and KITTI-MOTS dataset [10]. There are sequences inMOTS20 dataset for pedestrian tracking, and evenly splitfor training and testing. In testing set, the resolution variesfrom × to × with an average density of . targets per frame. KITTI-MOTS is a driving scenariodataset for both pedestrian and car tracking tasks, consist-ing of training sequences and testing sequences. Thepre-computed detections are generated from Mask R-CNNX152 [1] and reﬁned by the reﬁnement net [5]. The proposed modiﬁed Mask R-CNN uses ResNet50 [2]as the backbone, which is pretrained on COCO dataset [4]and ﬁne-tuned on the MOTS20 dataset and KITTI-MOTSdataset. Based on the convergence speed and scaling of eachloss component, the detection loss weight α = α = 1 ,mask weight α = 0 . and embedding weight α = 0 . .The output feature is -dim. Short-term memory inter-val to determine the state of a track is N = 0 . second,while the long-term interval for ReID is N = 1 secondfor pedestrian, and N = 0 . second, N = 0 . second forcars. N = 5 frames for both categories.Moreover, due to the large number of false positivesethod KITTI-MOTS MOTS20 TotalCar PedMCFPA 77.0 67.2 66.1 69.1TPM-MOTS 75.8 IA-MOT (Ours) 76.4 64.0

Table 1. Track 3 evaluation results of the BMTT Challenge, eval-uated by sMOTSA. Best results are marked in bold . in the provided detections, we create three different ﬁl-ters, including detection conﬁdence, bounding box size,and bounding box aspect ratio, to select valid candidates.After tracking with the ﬁltered detections, short tracks ortracks with low average conﬁdence are discarded. In ad-dition, the trajectory IOU is deﬁned as the average maskIOU of two tracks over their co-exist frames. If the tra-jectory IOU is larger than . , the shorter track will bediscarded. These processing steps efﬁciently remove dupli-cated and non-target detections. We evaluate our IA-MOT on Track 3 dataset of theBMTT Challenge (the combination of MOTS20 and KITTI-MOTS) and get the ﬁrst place, with . sMOTSA, out of methods. The quantitative results are shown in Table 1and some qualitative examples are shown in Figure 2.Speciﬁcally, the proposed framework achieves . inMOTS20 dataset with the public detections, which is nearlyeven with the leading method in the MOTS20 leaderboardwith private detectors. Although IA-MOT does not get thebest performance on KITTI-MOTS, it still ranks the 3 rd forKITTI-MOTS car and 5 th for pedestrian, indicating its abil-ity to generalize to different object categories and the po-tential to handle different complex MOTS scenarios.

4. Conclusion

In this paper, a novel MOTS framework – IA-MOT isproposed. A Mask R-CNN with an additional embeddinghead and spatial attention ﬁrst generate discriminating fea-tures. The following MOT stage consists of online Hun-garian assignment, short-term retrieve module and ReID.In addition, several implementation details are presentedfor the MOTS20 and KITTI-MOTS dataset. The proposed framework could effectively track both pedestrian and car,with static or moving cameras, and is ﬂexible for differ-ent video resolution and scenarios. Our proposed IA-MOTachieved the winner in Track 3 of the BMTT Challenge inCVPR2020 workshops.

References [1] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In

Proceedings of the IEEE internationalconference on computer vision , pages 2961–2969, 2017. 1,2, 3[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 2, 3[3] Harold W Kuhn. The hungarian method for the assignmentproblem.

Naval research logistics quarterly , 2(1-2):83–97,1955. 2[4] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. In

European conference on computer vision , pages 740–755.Springer, 2014. 3[5] B Leibe J Luiten, P Voigtlaender, and B Leibe. Premvos:Proposal-generation.

Reﬁnement and Merging for the DAVISChallenge on Video Object Segmentation , 2018. 1, 3[6] Anton Milan, Laura Leal-Taix´e, Konrad Schindler, and IanReid. Joint tracking and segmentation of multiple targets.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5397–5406, 2015. 2[7] Aljoˇsa Oˇsep, Wolfgang Mehner, Paul Voigtlaender, and Bas-tian Leibe. Track, then decide: Category-agnostic vision-based multi-object tracking. In , pages 1–8. IEEE, 2018. 2[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In

Advances in neural information pro-cessing systems , pages 91–99, 2015. 1[9] Florian Schroff, Dmitry Kalenichenko, and James Philbin.Facenet: A uniﬁed embedding for face recognition and clus-tering. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 815–823, 2015. 2[10] Paul Voigtlaender, Michael Krause, Aljosa Osep, JonathonLuiten, Berin Balachandar Gnana Sekar, Andreas Geiger,and Bastian Leibe. Mots: Multi-object tracking and segmen-tation. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 7942–7951, 2019. 1,2, 3[11] Gaoang Wang, Yizhou Wang, Haotian Zhang, Renshu Gu,and Jenq-Neng Hwang. Exploit the connectivity: Multi-object tracking with trackletnet. In

Proceedings of the 27thACM International Conference on Multimedia , pages 482–490, 2019. 2[12] Zhongdao Wang, Liang Zheng, Yixuan Liu, and ShengjinWang. Towards real-time multi-object tracking. arXivpreprint arXiv:1909.12605arXivpreprint arXiv:1909.12605