[PDF] ReMOTS: Self-Supervised Refining Multi-Object Tracking and Segmentation

Abstract

We aim to improve the performance of Multiple Object Tracking and Segmentation (MOTS) by refinement. However, it remains challenging for refining MOTS results, which could be attributed to that appearance features are not adapted to target videos and it is also difficult to find proper thresholds to discriminate them. To tackle this issue, we propose a self-supervised refining MOTS (i.e., ReMOTS) framework. ReMOTS mainly takes four steps to refine MOTS results from the data association perspective. (1) Training the appearance encoder using predicted masks. (2) Associating observations across adjacent frames to form short-term tracklets. (3) Training the appearance encoder using short-term tracklets as reliable pseudo labels. (4) Merging short-term tracklets to long-term tracklets utilizing adopted appearance features and thresholds that are automatically obtained from statistical information. Using ReMOTS, we reached the 1 st place on CVPR 2020 MOTS Challenge 1, with an sMOTSA score of 69.9 .

Full PDF

RReMOTS: Self-Supervised Reﬁning Multi-Object Tracking and Segmentation( st place solution for MOTSChalelnge 2020 Track 1) Fan Yang ∗ , Xin Chang , Chenyu Dang , Ziqiang Zheng , Sakriani Sakti , Satoshi Nakamura , andYang Wu Nara Institute of Science and Technology, Japan RIKEN, Center for Advanced Intelligence Project, Japan UISEE Technology (Beijing) Co. Ltd., China Kyoto University, Japan

Abstract

We aim to improve the performance of Multiple ObjectTracking and Segmentation (MOTS) by reﬁnement. How-ever, it remains challenging for reﬁning MOTS results,which could be attributed to that appearance features arenot adapted to target videos and it is also difﬁcult to ﬁndproper thresholds to discriminate them. To tackle this is-sue, we propose a self-supervised reﬁning MOTS (i.e., Re-MOTS) framework. ReMOTS mainly takes four steps to re-ﬁne MOTS results from the data association perspective.(1) Training the appearance encoder using predicted masks.(2) Associating observations across adjacent frames to formshort-term tracklets. (3) Training the appearance encoderusing short-term tracklets as reliable pseudo labels. (4)Merging short-term tracklets to long-term tracklets utiliz-ing adopted appearance features and thresholds that areautomatically obtained from statistical information. UsingReMOTS, we reached the st place on CVPR 2020 MOTSChallenge 1 [4], with a sMOTSA score of . .

1. Introduction

Multiple Object Tracking (MOT), which depends on in-formation from the bounding box, faces a great challenge,since different objects may stay in the same bounding boxand increase the ambiguity to distinguish them. Recently,some researchers in this ﬁled have moved their eyes to Mul-tiple Object Tracking and Segmentation (MOTS) and hopeto take advantage of object-instance masks. Under such abackground, the ﬁrst MOTS challenge is organized to ex-plore solutions for MOTS. We participated in this challenge ∗ Corresponding email: [email protected] (May-30th-2020) and won the st place on Challenge 1. Inthis paper, we represent our solution.

2. Method Details

Overall, we apply the tracking-by-detection strategy togenerate MOTS results. Since our ReMOTS is an ofﬂine ap-proach, we reﬁne the data association by retraining the ap-pearance feature encoder. In each step of ReMOTS, we givea practical guidance to quantitatively select hyperparame-ters. Our approach is illustrated in Figure 1. After obtain-ing object-instance masks, we perform: (1) encoder train-ing with intra-frame data, (2) associate masks to short-termtracklets by a short-term tracker, (3) inter-short-tracklet en-coder retraining, and (4) merging short-term tracklets.

Referring to how the public detection is generated, weobtain object-instance masks using the Mask R-CNN X152of Detectron 2 [5] and X-101-64x4d-FPN of MMDetec-tion [2]. We fuse their segmentation results by a modi-ﬁed Non-maximum Suppression (NMS). Unlike the tradi-tional NMS, where the IoU (Intersection over Union) is ap-plied, we propose a new metric named IoM (Intersectionover Minimum) for it since heavily overlapped masks mayalso have low IoU values. The python code of IoM is asfollows. def pixel_iom(target,prediction): """ Inputs: target: binary mask, array([H,W]) prediction: binary mask, array([H,W]) Outputs: iom_score: float """ a r X i v : . [ c s . C V ] J u l ong-term Tracklets 𝜽 𝒍𝒐𝒏𝒈𝒂𝒑𝒑 t t t t t t t t Short-term Tracklets A pp e a r a n c e E n c o d e r C o s i n e S i m il a r i t y 𝑓 )* 𝑓 )+ 𝑓 ), 𝑓 )- 𝑓 ). 𝑓 )/ 𝑓 )0 𝑓 )1 inf inf 0.1 0.4inf inf 0.5 0.20.1 0.5 inf inf0.4 0.2 inf inf Cutting threshold= 𝒎𝒆𝒓𝒈𝒆𝒂𝒑𝒑 clusters D i s t a n ce t t t t t t t t Distance Matrix Hierarchical Clustering W long if same split tracklet ID:set infelif temporal overlapping:set infelse:set cosine distance t t t t t t t t Object-instance SegmentationGround Truth:Hypotheses: ID ID id Raw

Short-term TrackerAppearance Encoder

Intra-frame Training Inter-short-term-trackletTraining

Intra-frameCosine AffinityIntra-short-trackletCosine Affinity

Figure 1: The illustration of ReMOTS Framework. t n Intra-frame Observations

Estimated Bounding Boxes in Test Set

Intra-frame sampling

Augmentation

PN PN

Inter-trackletsampling t t t t t Ground-truth Trackletsin Training SetForm a Mini Batch Input by the Ratio 1:1

Temporaloverlapped &non-overlappedtracklets

Ground Truth:Hypotheses: ID ID id Figure 2: Constructing training samples for intra-frame training. P and N represent positive and negative samples, respectively. intersection = np.logical_and(target,prediction) min_area = min(np.sum(target),np.sum(prediction)) iom_score = np.sum(intersection) / min_area return iom_score After performing our modiﬁed NMS, the remainingmasks may still have overlapped areas. Therefore, we onlykeep the mask with the top conﬁdent score for each over-lapping area.

We take an off-the-shelf appearance encoder and itstraining scheme from an object re-identiﬁcation work [3].SeResNeXt50 is used as the backbone and its global-average-pooling output, which is a -dimension vector,is used as the appearance features. The triplet loss [1] isapplied to train the appearance encoder. To adapt the ap-pearance feature learning to the target videos, we incorpo-rate intra-frame observations of target videos into a novelofﬂine training process.As Figure 2 shows, we can sample triplets from the train-ing set only referring to the ground-truth tracklets. In testset, since Non-maximum Suppression (NMS) is performed,we assume that predicted object masks are exclusive withinthe same frame, and therefore it is easy to form negativepairs with intra-frame observations. Before tracking, wecreate a positive sample by augmenting an anchor sam-ple. The augmentation process can dramatically change thepixel content of the anchor sample without altering identity.Finally, we take triplets from the training set and target setto form a mini-batch input by the ratio of . Using such new training samples, we retrain the appearance encoder toobtain more discriminative appearance features.

After intra-frame training, we apply the appearance en-coder to generate appearance features for data association.Since the tracker part is not our main focus, we build a sim-ple tracker that only associates two-frame observations atonce. Using the dense optical ﬂow function of OpenCV,we generate optical ﬂow between two adjacent frames, andthen warp the mask from previous frame to current frame tocalculate IoU of cross-frame masks. The distance matrix isformulated as follows: W shortprev,curr = (cid:40) inf, if IoU ( mask prev , mask curr ) = 01 − f prev f curr (cid:107) f prev (cid:107)(cid:107) f curr (cid:107) , otherwise (1)where mask prev and mask curr respectively denote themask of the previous frame and the mask of the currentframe; W shortprev,curr is their edge weight (i.e., distance); f prev and f curr are their appearance features.Besides constraining data association with low IoU val-ues, we also hope to constrain data association with low ap-pearance similarity. However, it is tricky to heuristically de-termine a threshold for constraining. We tackle this issue byanalyzing the intra-frame distribution. Speciﬁcally, the his-togram of appearance cosine similarity between intra-framemasks can be approximated by a normal distribution, andwithin three standard deviations is . of the observationpairs (see Figure 4). We set an appearance afﬁnity threshold nter-short-term-tracklet sampling PN PN

Inter-trackletsampling t t t t t Temporal overlapped &non-overlapped tracklets

Ground-truth Trackletsin Training SetForm a Mini Batch Input by the Ratio 1:1 t t t t t Temporal overlapped short-term tracklets

Ground Truth:Hypotheses: ID ID id Estimated Short-term-tracklets in Test Set

Figure 3: Constructing training samples for inter-short-tracklet training. P and N represent positive and negative samples, respectively. at three standard deviations, as value θ appshort . 𝜽 𝒔𝒉𝒐𝒓𝒕𝒂𝒑𝒑 Figure 4: The appearance threshold value for short-term tracking.

Consequently, using automatically obtained θ appshort , wefurther process W shortprev,curr by W shortprev,curr = (cid:26) inf, if W shortprev,curr > − θ appshort W shortprev,curr , otherwise . (2)We apply linear assignment on W shortprev,curr to determinethe association of masks between the previous frame andthe current frame. Due to misdetection and occlusion, sucha process can only generate short-term tracklets. However,short-term tracklets reduce the risk of mixing different iden-tities, which is an important condition in next-step process. As we assume each short-term tracklet may only containa unique identity, they can be used as dependable pseudolabels to train the feature encoder. However, different short-term tracklets, which have no overlap in the temporal do-main, may still hold the same identity. Therefore, we dointer-short-tracklet retraining under the constraint that sam-pled short-term tracklets must be temporally overlappingwithin the same video.We illustrate the process of training data sampling forinter-short-tracklet training as Figure 3 shows. Within avideo, we ﬁrst sample two identities that appear in a ran-domly chosen frame, and then randomly choose anotherframe for one of the selected identities, thus constructing atriplet. Other settings of inter-short-tracklet training are thesame as intra-frame retraining. We update the appearancefeatures after inter-short-tracklet retraining and use them inthe next step.

With better appearance features and more robust spatio-temporal information of short-term tracklets, we are able tomerge short-term tracklets into long-term ones. The merg-ing process is summarized in Figure 1. Short-term trackletsassociation is formulated as a hierarchical clustering prob-lem on a weighted graph, in which each node represents atracklet and the graph edges are represented in a distancematrix W long , deﬁned as W longk ,k =  inf, if k k inf, if Distance (Π k , Π k ) > θ t inf, if Π k ∩ Π k (cid:54) = ∅ N k N k (cid:80) i ∈ Π k (cid:80) j ∈ Π k (cid:0) − f k i f k j (cid:107) f k i (cid:107)(cid:107) f k j (cid:107) (cid:1) , otherwise , (3)where for tracklets T k and T k , W longk ,k is their edge weight(i.e., distance); Π k and Π k are their temporal ranges; f k i and f k j are their appearance features at frame i and j , and N k and N k are the number of observations within thetracklets, respectively.Whenever the matching condition between two short-term tracklets violates any of the following three principles:(1) different short-term track ID, (2) the temporal gap be-tween two short-term tracklets are within θ t frames (we use θ t = 15 ), and (3) no temporal overlap between two short-term tracklets, we set their distance value to be inﬁnite. Tohold these constraints in the whole process of hierarchicalclustering, we apply the centroid linkage criteria to deter-mine the distance between clusters.The main challenge of applying hierarchical clustering ison how to set a proper cutting threshold. We do not give aheuristic value, and we let the data speak for themselves in-stead. We suppose that intra-frame and inter-short-trackletcosine similarity histograms can be separated at θ appmerge (seeFigure 5) after inter-short-tracket retraining, though smalloverlapping might exist. Without accessing to the ground-truth, this could be a reasonable boundary to distinguishobjects based on appearance features. Therefore, we set − θ appmerge as the cutting threshold in hierarchical cluster-ing. ank Method sMOTSA ↑ IDF1 ↑ MOTSA ↑ MOTSP ↑ MODSA ↑ MT ↑ ML ↓ st place ReMOTS (ours) . . . . . nd place PTPM . . . . . rd place PT . . . . . Table 1: The performance on CVPR 2020 MOTS Challenge test set (up to submission deadline at May-30th-2020).

Sequence Method sMOTSA ↑ IDF1 ↑ MOTSA ↑ MOTSP ↑ MODSA ↑ MT ↑ ML ↓ MOTS20-01 ReMOTS . . . . . MOTS20-06 ReMOTS . . . . . MOTS20-07 ReMOTS . . . . . MOTS20-12 ReMOTS . . . . . Table 2: The performance of ReMOTS on each sequence of CVPR 2020 MOTS Challenge test set (up to submission deadline at May-30th-2020). 𝜽 𝒍𝒐𝒏𝒈𝒂𝒑𝒑 Intra-frameCosine Affinity Intra-short-trackletCosine Affinity t t t t Temporal-overlappedShort-term Tracklets

Figure 5: The appearance threshold value for merging short-term tracklets.

The only neural network model - appearance en-coder [3], used in this work, is not our contribution andour ReMOTS can do the same reﬁning when other appear-ance models are used. Therefore, we do little change tothe default setting of [3], except for forming novel trainingsamples in our intra-frame training and inter-short-tracklettraining. Here, we omit the other details described in [3].

3. Results

We report the performance of our ReMOTS on theMOTChallenge evaluation system, with metrics introducedin [4]. In Table 1, we list the performance of top-3 methodsup to the submission deadline. Our method mainly outper-form the other two methods in terms of IDF1 score, andtherefore leads to state-of-the-art performance in this chal-lenge. The detailed performance on each test sequence islisted in Table 2. Though the same method is applied, it canbe observed that the performance of each sequence varies alot. This may be attributed to the diversity between videos,which calls for more exploration in automatically adaptingMOTS models to target videos. Our ReMOTS analyzes thestatistical information at the entire video level, but the tem-poral local statistical information, which might be useful for ﬁne-grained adaption, has not been considered yet.

4. Conclusion

We present our solution which wins the CVPR 2020MOTS Challenge 1. In our proposed ReMOTS framework,intra-frame training and inter-short-tracklet training are in-troduced for learning better appearance features for moreeffective data association, which are our main contributions.Besides, we quantitatively demonstrate how to select properthresholds by analyzing the statistical information of track-lets, which could be useful for other multiple object track-ing works. The main limitation of ReMOTS is that it cannotbe used in real-time scenarios, but it may bring insights todesign better online MOTS method with feature adaptation.

ACKNOWLEDGEMENTS

This work was supported by JSPS KAKENHI GrantNumbers JP17H06101.

References [1] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio.Large scale online learning of image similarity through rank-ing.

Journal of Machine Learning Research , 11(Mar):1109–1135, 2010.[2] et al. Chen, Kai. MMDetection: Open mmlab detection tool-box and benchmark. arXiv preprint arXiv:1906.07155 , 2019.[3] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and WeiJiang. Bag of tricks and a strong baseline for deep personre-identiﬁcation. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops , pages0–0, 2019.[4] Paul Voigtlaender, Michael Krause, Aljo˘sa O˘sep, JonathonLuiten, Berin Balachandar Gnana Sekar, Andreas Geiger, andBastian Leibe. MOTS: Multi-object tracking and segmenta-tion. In