[PDF] MOTS R-CNN: Cosine-margin-triplet loss for multi-object tracking

Abstract

One of the central tasks of multi-object tracking involves learning a distance metric that is consistent with the semantic similarities of objects. The design of an appropriate loss function that encourages discriminative feature learning is among the most crucial challenges in deep neural network-based metric learning. Despite significant progress, slow convergence and a poor local optimum of the existing contrastive and triplet loss based deep metric learning methods necessitates a better solution. In this paper, we propose cosine-margin-contrastive (CMC) and cosine-margin-triplet (CMT) loss by reformulating both contrastive and triplet loss functions from the perspective of cosine distance. The proposed reformulation as a cosine loss is achieved by feature normalization which distributes the learned features on a hypersphere. We then propose the MOTS R-CNN framework for joint multi-object tracking and segmentation, particularly targeted at improving the tracking performance. Specifically, the tracking problem is addressed through deep metric learning based on the proposed loss functions. We propose a scale-invariant tracking by using a multi-layer feature aggregation scheme to make the model robust against object scale variations and occlusions. The MOTS R-CNN achieves the state-of-the-art tracking performance on the KITTI MOTS dataset. We show that the MOTS R-CNN reduces the identity switching by 62\% and 61\% on cars and pedestrians, respectively in comparison to Track R-CNN.

Full PDF

MMOTS R-CNN: Cosine-margin-triplet loss for multi-object tracking

Amit Satish Unde and Renu M. RameshanSchool of Computing and Electrical Engineering, Indian Institute of Technology, Mandi, India [email protected], [email protected]

Abstract

One of the central tasks of multi-object tracking involveslearning a distance metric that is consistent with the seman-tic similarities of objects. Design of an appropriate lossfunction that encourages discriminative feature learning isamong the most crucial challenges in deep neural networkbased metric learning. Despite signiﬁcant progress, slowconvergence and a poor local optimum of the existing con-trastive and triplet loss based deep metric learning meth-ods necessitates a better solution. In this paper, we pro-pose cosine-margin-contrastive (CMC) and cosine-margin-triplet (CMT) loss by reformulating both contrastive andtriplet loss functions from the perspective of cosine distance.The proposed reformulation as a cosine loss is achievedby feature normalization which distributes the learned fea-tures on a hypersphere. We then propose the MOTS R-CNN framework for joint multi-object tracking and segmen-tation, particularly targeted at improving the tracking per-formance. Speciﬁcally, the tracking problem is addressedthrough deep metric learning based on the proposed lossfunctions. We propose a scale-invariant tracking by using amulti-layer feature aggregation scheme to make the modelrobust against object scale variations and occlusions. TheMOTS R-CNN achieves the state-of-the-art tracking per-formance on the KITTI MOTS dataset. We show that theMOTS R-CNN reduces the identity switching by and on cars and pedestrians, respectively in comparison toTrack R-CNN.

1. Introduction

Multi-object tracking (MOT) is a well-established prob-lem in the ﬁeld of computer vision [7, 4, 47, 2]. It is re-gaining attention of the research community owing to sig-niﬁcant progresses in autonomous vehicles and robotics[16, 17, 24]. The “tracking-by-detection” is the widelyadopted strategy to tackle MOT problem [47, 31, 37]. Itis the process of detecting objects of interest in each video Our code will be made available at:https://github.com/amitsunde/MOTS-R-CNN frame and matching their identities across frames to obtainthe object trajectories over time. Given a video frame, theobject detector identiﬁes and locates possible objects of in-terest and feeds them as input to the tracking system. Thetracker computes the similarity score between the featurevectors of previously tracked targets and objects detected inthe current frame. Based on the similarity score, the trackerlinks detections and targets to ﬁnd optimal object trajecto-ries.The traditional tracking system consists of two parts: 1)robust hand-crafted feature extraction for each detected ob-ject [35, 1, 13] and 2) distance metric learning consistentwith the semantic similarities of objects [20, 34, 27]. Af-ter an exhaustive study, it is noticed that the discriminationpower of hand-crafted features is low that results in sub-optimal performance [45, 22, 49]. With the evolution ofconvolutional neural networks (CNNs) in recent years, deepCNN-based object tracking has gained increasing popular-ity [2, 31, 48]. Different from traditional methods, CNN-based approaches learn the feature representation of eachdetected object and distance metric in an end-to-end man-ner. MOT problem using CNNs has been thoroughly stud-ied over the past few years which in turn results in a notice-able improvement in the tracking performance [40, 46, 41].However, tracking remains a challenging problem, espe-cially in unconstrained scenarios such as autonomous driv-ing, due to severe occlusions, scale and appearance varia-tion of objects, false positives (FP) and missing detections,and illumination variations [31, 49, 29].One possible reason that affects the tracking perfor-mance of MOT is the use of axis-aligned rectangular bound-ing boxes [39, 11]. This choice of the bounding box maycontain information about the background and other ob-jects in close proximity due to the overlapping of bound-ing boxes, especially in crowded scenes, which deterioratesthe efﬁciency of the tracker. Hence, approaches to improvethe tracker performance is shifting from the bounding boxlevel to the pixel level. In this direction, Voigtlaender etal. proposed a uniﬁed framework for multi-object track-ing and segmentation (MOTS) through the integration ofthe tracking head in the Mask R-CNN model [9]. Most of a r X i v : . [ c s . C V ] F e b he approaches thereafter reported in the literature are fo-cused on learning embedding representation of objects us-ing triplet loss [37, 29, 44]. However, existing deep metriclearning frameworks aiming to learn discriminative featuresbased on triplet loss often suffer from slow convergence anda poor local optimum [34, 45, 27, 25].In this paper, we propose a cosine-margin-triplet (CMT)and cosine-margin-contrastive (CMC) loss to enhance thefeature discrimination power of the multi-object trackingmodel. Since the Euclidean distance between the featurevectors can be unbounded, we utilize the dot product be-tween the pair of normalized feature vectors which is equalto the cosine similarity. The feature normalization dis-tributes the learned feature vectors on a hypersphere andencourages better feature learning by separating them inangular. In addition, a margin term is introduced in thecosine metric to increase the intra-class compactness andinter-class discrimination.We then propose the MOTS R-CNN network to addressthe MOTS problem. The proposed model extends the MaskR-CNN [9] with the addition of the tracking head. The3D convolutional layers are integrated into the MOTS R-CNN framework to model the spatio-temporal dependencyof objects. The tracking problem is addressed through jointlearning of feature representation and distance metric basedon the CMT loss in an end-to-end manner. The tracking ca-pability of the MOTS R-CNN to handle object scale varia-tions is improved by fusing features from different layers ofthe convolutional network. Further improvement in track-ing accuracy is achieved by taking advantage of the readilyavailable instance segmentation mask in the MOTS R-CNNto focus only on foreground information of objects. In ad-dition, we empirically show the inefﬁciency of the softmaxloss for object classiﬁcation. We demonstrate that incorpo-rating large margin cosine loss (LMCL) [38] in the MOTSR-CNN increases the object classiﬁcation accuracy. Themajor contributions of our work are as follows:• We propose cosine-margin-triplet and cosine-margin-contrastive loss functions to enforce closeness betweenpairs of feature vectors belonging to the same instancewhile pushing dissimilar vectors far away on the em-bedding space. The proposed loss functions encouragediscriminative feature learning for multi-object track-ing problems.• We present the MOTS R-CNN model to address theMOTS problem. We show that identity switching(IDS) using the proposed model for cars and pedestri-ans is reduced by and respectively in com-parison with Track R-CNN that uses triplet loss for dis-tance metric learning.• We empirically demonstrate the usefulness of large margin cosine loss over traditional softmax loss for ob-ject classiﬁcation.The remainder of this paper is organized as follows. Wediscuss in section 2 the existing works related to MOT andMOTS problem. In section 3, we review preliminaries fordistance metric learning. In section 4, we propose the CMTand CMC loss functions. We present the proposed MOTSR-CNN model in section 5. The effectiveness of the pro-posed model through experimental analysis is shown in sec-tion 6 and followed by conclusions in section 7.

2. Related Work

There are only few works reported in the literature toaddress the MOTS task since it is introduced very recently.So, we ﬁrst discuss existing literature related to the MOTproblem. Then, we brieﬂy review the related work on theMOTS task.

Multi-object tracking.

The work in [31] combined appear-ance, motion, and interaction cues using a recurrent neu-ral network to make the network robust against occlusions.However, a large number of IDS as compared to state-of-the-art methods show that the combination of various cuesdoes not guarantee an improvement in the tracking perfor-mance. Ji Zhu et al. proposed a spatial and temporal at-tention model to extract only object-speciﬁc features and toreduce the effect of noisy observations on the tracking per-formance [48]. Their work was mainly focused on the ex-tension of a single object tracker for the MOT task. As an al-ternative to the traditional tracking-by-detection approach,a joint detection and tracking method was proposed in [40]by integrating the tracking head in the single-shot detec-tor to develop the real-time MOT system. While the jointparadigm speed up the overall process, the inferior perfor-mance of the single-shot detector caused a reduction in thetracking accuracy.The work in [46] extended Siamese single object trackerfor the MOT task through the integration of motion esti-mation and data association into a single framework. Thecombination of appearance and motion features from 2Dand 3D space jointly was proposed in [41] for improvingthe accuracy of the tracker. Furthermore, a graph neural net-work was employed to promote feature interaction of vari-ous objects aiming at more discriminative feature learning.Sarthak Sharma et al. proposed the BeyondPixels model forMOT by taking the 3D pose, shape, and motion informationof objects into account [33]. The matching was done onlyin the speciﬁc search area obtained through the propagationof position and orientation information of objects in succes-sive frames.

Multi-object tracking and segmentation.

Paul Voigtlaen-der et al. proposed the Track R-CNN framework to inte-grate object detection, instance segmentation, and trackingn a uniﬁed framework [37]. The tracking head was incor-porated in parallel to the box and mask heads of the MaskR-CNN architecture. The tracking problem was posed as afeature learning problem by minimizing the triplet loss. Itwas observed that the tracking accuracy was poor both inthe case of cars and pedestrians. This can be due to the useof triplet loss for distance metric learning and the extrac-tion of features only from the ﬁnal convolutional layer ofthe backbone network. This single layer feature extractioncan make the tracking performance sensitive to the objectscale variation.The MOTSNet proposed in [29] achieved the scale-invariant feature learning by making use of the feature pyra-mid network [18] to form the backbone convolutional net-work. The segmentation mask was used to improve thetracking performance by extracting only the foreground in-formation of objects. The triplet loss was employed to fa-cilitate discriminative feature learning for object tracking.However, the overall performance improvement was largelyattributed to training the network with their automaticallygenerated MOTS training dataset. The tracking-by-pointsstrategy was proposed in [43] by interpreting image pix-els as irregular 2D point clouds. The shape, color, seman-tic class labels, and position information of objects wastaken into account for the object tracking. The incorpo-ration of motion information coupled with the recovery ofmissing detections in the MOTS framework was proposedin [23]. While MOTSFusion yields promising tracking per-formance, the two-stage tracking process involving the con-struction of short tracks using optical ﬂow and its projectioninto a 3D space makes it computationally heavy. The com-putational overhead in MOTSFusion can restrict its exten-sion to real-time applications such as autonomous driving.

3. Review of distance metric learning

In this section, we review various loss functions that areused for distance metric learning.

Softmax loss.

The softmax loss is widely used for objectclassiﬁcation, face recognition, and person re-identiﬁcation[21, 6, 42, 15]. It enables feature learning by formulat-ing the learning task as a multi-class classiﬁcation problem.The softmax loss is given as, L = 1 N N (cid:88) i =1 − log e W Tyi x i + b y i (cid:80) nj =1 e W Tj x i + b j (1)where ( x i , y i ) denotes the input feature vector to the soft-max layer and associated class label respectively and N isthe number of training examples. The terms W j and b j rep-resent the j th column of the weight matrix W and corre-sponding bias term respectively. In spite of its wide use, thelearned features using the softmax loss function are knownto be less discriminative [38, 6, 21]. Large margin cosine loss.

The LMCL overcomes limita-tions of the traditional softmax loss [38]. It forces the net-work to distribute learned features on a hypersphere through (cid:96) normalization of W j and x i . The LMCL introduces apositive margin m in the loss function and is expressed as, L = 1 N N (cid:88) i =1 − log e s (cos( θ yi,i ) − m ) e s (cos( θ yi,i ) − m ) + (cid:80) j (cid:54) = y i e s cos( θ j,i ) (2)where θ j,i is the angle between normalized W j and x i . Theincorporation of LMCL in face recognition models lead tonoticeable performance improvement [38]. Contrastive loss.

Given a pair of feature vectors ( f i , f j ) to-gether with the binary label ( y ij ) , the contrastive loss mini-mizes the distance between the feature vectors belonging tothe same class while penalizing the distance between nega-tive pair if it is smaller than a speciﬁed margin m [34, 45].The contrastive loss is deﬁned as, L = y ij (cid:107) f i − f j (cid:107) +(1 − y ij ) max(0 , m − (cid:107) f i − f j (cid:107) ) (3)where y ij = 1 when ( f i , f j ) are from the same class and y ij = 0 when ( f i , f j ) belongs to the different class. Triplet loss.

Triplet loss has gained a lot of attention forperson re-identiﬁcation and object tracking tasks due to itssuperior performance in deep face recognition [32, 12, 36,37, 29]. It is deﬁned on triplets ( f ai , f pi , f ni ) , where f ai isreferred to as anchor of the triplet such that positive pair ( f ai , f pi ) have the same class label and the negative pair ( f ai , f ni ) have the different label. Triplet loss distributes thelearned feature vectors on an embedding space where thedistance between positive pair is larger than that of negativepair by a speciﬁed margin m . It is given as, L = max(0 , (cid:107) f ai − f pi (cid:107) − (cid:107) f ai − f ni (cid:107) + m ) (4)The performance of both contrastive and triplet lossbased distance metric learning methods depends on the hardsampling which is employed to obtain nontrivial pairs ortriplets in large mini batches. Despite their widespreaduse in various applications, both the loss functions havedrawbacks of slow convergence and poor local optima[34, 45, 27, 25].

4. Proposed CMT and CMC loss functions

One of the important tasks of multi-object tracking in-volves learning a distance metric that is consistent with thesemantic similarities of objects. Distance metric learningis the process of learning an embedded representation ofthe objects to keep the distance between similar objectsat the minimum while pushing away instances of dissim-ilar objects far on the embedding space. In this section,we propose a cosine-margin-triplet loss and cosine-margin-contrastive loss to enhance the feature discrimination powerof the multi-object tracking model. .1. Cosine-margin-triplet loss

Motivated from the success of triplet loss for face recog-nition and person re-identiﬁcation, we design the loss func-tion based on triplets of feature vectors obtained using deepCNNs. Let D denote the set of detections over a batch ofvideo frames. For each detected object d i ∈ D , there isan associated ground truth segmentation mask, class label,tracking identity, and an association feature vector f i . Wedeﬁne a loss function on the triplets ( f i , f + i , f − i ) , where f i is the anchor of the triplet that shares the same trackingidentity with the anchor positive ( f + i ) and has a differentidentity (but the same class) with the anchor negative ( f − i ) .The loss function is formulated as, L = 1 N N (cid:88) i =1 − log e f Ti f + e f Ti f + + e f Ti f − (5)where N = | D | . The loss function in Eq. (5) aims to min-imize the positive pair, ( f i , f + i ) distance, and penalize thenegative pair, ( f i , f − i ) distance. It builds upon the dot prod-uct between the pair of feature vectors which is given as, f Ti f + i = (cid:107) f i (cid:107)(cid:107) f + i (cid:107) cos( θ + i ) (6)where θ + i is the angle between f i and f + i .The numerical value of the dot product is inﬂuenced byboth the direction and norm of feature vectors. In orderto develop effective feature learning, it is essential to makethe dot product determined only by the direction. Hence,the feature vectors are normalized to have unit norm by (cid:96) normalization. In order to enable sufﬁcient angular sepa-rability, the radius of the sphere is scaled to s . This stepdistributes the learned feature vectors on a hypersphere ofradius s . The cosine similarity between the normalized fea-ture vectors has a geometric correspondence to the geodesicdistance on the hypersphere. This correspondence can beattributed to the relationship between the central angle andarc of a circle and also to the fact that the learned embed-dings are angular separable. The normalized loss functionis expressed as, L = 1 N N (cid:88) i =1 − log e s ( f Ti f + ) e s ( f Ti f + ) + e s ( f Ti f − ) (7) = 1 N N (cid:88) i =1 − log e s (cos( θ + i )) e s (cos( θ + i )) + e s (cos( θ − i )) (8)where f i = f i (cid:107) f i (cid:107) , f + i = f + i (cid:107) f + i (cid:107) , and f − i = f − i (cid:107) f − i (cid:107) .However, the features learned using the above loss func-tion are not guaranteed to be highly discriminative. It doesnot take the stringent constraints for discrimination into ac-count which can produce ambiguity in decision boundaries.For example, let θ + i and θ − i denote the angle between the positive pair (same tracking identity) and negative pair (dif-ferent tracking identity) respectively. The normalized lossfunction in Eq. (8) forces cos( θ + i ) > cos( θ − i ) to map simi-lar feature vectors close to each other.To further enhance the intra-class compactness and inter-class discrimination, we introduce the margin term in thecosine metric. To be more speciﬁc, the loss function is mod-iﬁed to force cos( θ + i ) − m > cos( θ − i ) , where the hyperpa-rameter m is the additive cosine margin. Since cos( θ + i ) − m is less than cos( θ + i ) , the incorporation of the cosine mar-gin penalty in the loss function makes the constraints morestringent and thereby promotes discriminative feature learn-ing.The cosine-margin-triplet loss that enforces the distribu-tion of positive pair closer on the hypersphere while pushingaway negative pair is deﬁned as, L CMT = 1 N N (cid:88) i =1 − log e s (cos( θ + i ) − m ) e s (cos( θ + i ) − m ) + e s (cos( θ − i )) (9)The tracking model is trained by sampling hard positiveand hard negative for each detected object. While the CMT loss results in the relative distance, wepropose the loss function that measures the absolute dis-tance, identical to the contrastive loss, to answer the ques-tion “How similar/dissimilar are the feature vectors of twoobjects?” The proposed cosine-margin-contrastive loss isdeﬁned as, L CMC = 1 N N (cid:88) i =1 ( − loge σ ( s (cos( θ + i ) − m )) − loge − σ ( s (cos( θ − i ) − m )) ) (10)where f i = f i (cid:107) f i (cid:107) , f + i = f + i (cid:107) f + i (cid:107) , f − i = f − i (cid:107) f − i (cid:107) , N = | D | ,and σ ( . ) is the sigmoid activation function. It is worthnoting that the application of the sigmoid function boundsthe cosine similarity score to [0 , and hence − σ ( . ) gives cosine distance between pair of feature vectors. Theﬁrst term − loge σ ( s (cos( θ + i ) − m )) in Eq. (10) is responsi-ble for mapping the positive pair closer while the secondterm − loge − σ ( s (cos( θ − i ) − m )) ) is responsible for pushingthe negative pair far away on the hypersphere.The loss function in Eq. (10) can be easily reformulatedas two-class softmax loss and is given as, L CMC = 1 N N (cid:88) i =1 ( − log e σ ( s (cos( θ + i ) − m )) e σ ( s (cos( θ + i ) − m )) + e − σ ( s (cos( θ + i ))) − log e − σ ( s (cos( θ − i ) − m )) e − σ ( s (cos( θ − i ) − m )) + e σ ( s (cos( θ − i ))) ) (11) -1t+1t conv-1 conv-2 conv-3 conv-4 ResNet-101SharedweightsSharedweights CAR: 0.98

Car: conv-5FC-128 FC-128

Concatenate

FC-128 FC-128

Concatenate

Tracking head

ClassheadMaskheadBounding boxhead

Ped: conv-1 conv-2 conv-3 conv-4conv-1 conv-2 conv-3 conv-4

Figure 1: Overview of the proposed MOTS R-CNN. The nomenclature conv-x denotes the depth of the backbone network.Similar to the CMT loss, hard positive and hard negativesampling is used during training.

5. Proposed MOTS R-CNN

We address the MOTS problem building on the strengthsof Mask R-CNN that extends the Faster R-CNN detector[30] with an instance segmentation (mask) head. We pro-pose MOTS R-CNN which is an extension of Mask R-CNNwith the addition of a tracking head. The architecture ofMOTS R-CNN has 1) a bounding box and classiﬁcationhead for object localization and classiﬁcation, 2) a maskhead for instance segmentation of objects, and 3) a trackinghead for data association. Given an image, MOTS R-CNNoutputs a set of detections with associated bounding boxes,class labels, and instance segmentation masks together withlearned feature vectors. These learned embeddings are thenused to associate detected objects in the video sequence.

The proposed MOTS R-CNN architecture is shown inFig. 1. The ﬁrst 91 convolutional layers (up to conv-4) inResNet-101 [10] is used as the convolutional backbone net-work. The 3D convolutional layers are applied to the fea-ture maps obtained from the backbone network. The incor-poration of 3D convolutions efﬁciently models the spatio-temporal dependency of objects [3]. These spatio-temporalfeatures are then shared by a region proposal network (gen-erating a set of object proposals) and a Fast R-CNN detec-tion network [8]. RoI-align is performed on spatio-temporal features to extract a portion of the feature map speciﬁedby object proposals. All convolutional layers of conv-5 inResNet-101 are applied on RoI-aligned features which areshared by box, classiﬁcation, and mask head. Similar to theMask R-CNN, the box head, classiﬁcation head, and maskhead localizes the objects of interest, assigns a class label,and generates binary segmentation masks respectively. Thedetected objects augmented with their class labels, and seg-mentation masks are feed as input to the tracking head forthe association.

MOTS R-CNN extends the original implementation ofMask R-CNN through the incorporation of the trackinghead in series with the box, classiﬁcation, and mask heads.This cascade arrangement of the tracking head is encour-aged to speed up the network during inference due to theprocessing of the fewer number of boxes and to improve ac-curacy by using more accurate bounding boxes. The track-ing head consists of fully-connected layers with boundingboxes, class labels, and masks as inputs and an associa-tion feature vector for each detected box as output. It isworth noting that occlusion, noisy detections, and appear-ance changes pose several inevitable challenges in multi-object tracking scenarios. Besides, the scale of various ob-jects also affects the tracking performance adversely. Forexample, the size of an occluded pedestrian is small rela-tive to a non-occluded car. To address the aforementionedchallenges, multi-layer feature aggregation is used in theroposed work as described in the following.

Multi-layer feature aggregation for scale-invarianttracking.

We ﬁrst extract the object-speciﬁc feature mapsfrom spatio-temporal features using RoI-align for each de-tected object. On these RoI-aligned features, all layers ofconv-5 are applied. We refer to the feature map correspond-ing to the ﬁnal convolutional layer of 3D convolutions andconv-5 in ResNet-101 as C4 and C5 respectively. The RoI-aligned features are likely to be comprised of both fore-ground and background information due to the axis-alignedrectangular bounding boxes. It is well known from recentstudies that foreground information alone contributes sig-niﬁcantly to object tracking. Hence, it is highly essential toensure that the tracking performance is largely dependenton foreground information.Since the MOTS framework is equipped with instancesegmentation masks, foreground-background discrimina-tion is readily available. In view of the fact that C4 hashigh spatial resolution than C5, we pixel-wise multiplyRoI-aligned features from C4 with the corresponding maskto nullify the effect of background information. Finally,the RoI-align masked features and C5 features are inde-pendently converted to 128-dimensional feature vectors byfully-connected layers. These individual 128-dimensionalfeature vectors are then concatenated to obtain a 256-dimensional embedding vector for each detected object.These feature vectors are then learned using cosine-margin-triplet loss.

Data Association.

At the test time, the dot product betweennormalized feature vectors of detected objects gives thesimilarity score which is used for data association. Specif-ically, we link each detected object in the current frame t having a detection conﬁdence score greater than a thresh-old β with detections in the previous frames. The detec-tions from previous t − α frames are considered for track-ing. The detections in the current frame are compared withpreviously detected objects if and only if (1) correspond-ing bounding box centers have Euclidean distance less than γ and (2) associated pairwise similarity score is larger thanthe threshold δ . Matching is performed using a greedy algo-rithm and all high conﬁdence detections that are not linkedto any previously detected objects are assigned to a newtrack. The MOTS R-CNN is trained jointly in an end-to-endmanner by deﬁning multi-task loss on each detected ob-ject. This is done through the addition of the losses asso-ciated with the box, classiﬁcation, mask, and tracking headtogether. The multi-task loss is given as, L = L BH + L LMC + L MH + L CMT (12) where L BH and L MH denote the losses corresponding tothe box head and mask head respectively as deﬁned in [9].Different from the Mask R-CNN, we use LMCL and CMTloss for classiﬁcation and tracking head respectively.

6. Experiments

In this section, we report a detailed experimental analy-sis to test the effectiveness of the MOTS-RCNN and therebythe proposed loss functions on the KITTI MOTS dataset[37]. We carry out ablation studies to signify the contribu-tion of each component of the MOTS R-CNN. We presentthe performance of the proposed algorithm in comparisonwith the state-of-the-art methods for the MOTS task. Theperformance is evaluated in terms of sMOTSA, MOTSA,IDS, FP, and false negatives (FNs) metrics as described in[37]. While the sMOTSA and MOTSA measure the overallaccuracy of the MOTS network, IDS evaluates the perfor-mance of the tracking head. The FP and FN measure theperformance of the bounding box and classiﬁcation head.

For the MOTS R-CNN, the ResNet-101 is used as thebackbone convolutional network and it is pre-trained on Im-ageNet [5], COCO [19], and Mapillary [26] dataset. Twodepthwise separable 3D convolutional layers with ﬁlters ofsize × × followed by ReLU activation function are ap-plied on the top of the backbone network. The data augmen-tations including random ﬂipping and gamma correction inthe [ − . , . range are used during training. We train theMOTS R-CNN using the Adam optimizer [14] for epochs.The learning rate of the optimizer is set to × − . Thenetwork is trained with mini-batches of size which areformed by stacking adjacent frames of the same video.During training, the hyper-parameters of the proposedcosine-margin-triplet loss including margin and scale ( m − s ) are set to . − and − for cars and pedestriansrespectively. Afterwards, the tracking head for pedestriansis ﬁne-tuned with mini-batches of size by setting m and s to . − and . − respectively. We then ﬁne-tunetracking head for cars with mini-batches of size and respectively. We set the hyper-parameters for data associ-ation during the inference as follows: α equal to and for car and pedestrian, detection conﬁdence score β is set to . and . for car and pedestrian, γ and δ are set to and . respectively. All experiments are performed on theNVIDIA Quadro GP100 card with 16 GB of memory. We evaluate the effectiveness of the proposed MOTS R-CNN on the KITTI MOTS validation dataset. We com-pare the performance of the MOTS R-CNN with the exist-ing Track R-CNN [37], MOTSFusion [23], BeyondPixels[33], and CIWT [28] models. The last two approaches areable 1: Performance of the MOTS R-CNN on KITTI validation dataset in comparison with Track R-CNNMethod sMOTSA ↑ MOTSA ↑ IDS ↓ FP ↓ FN ↓ Car Ped Car Ped Car Ped Car Ped Car PedMOTS R-CNN . . . primarily designed for MOT task and extended to MOTSproblem through the addition of the mask head as describedin [37]. The performance of the tracking-by-detectionparadigm heavily relies on detection results. Hence, weconsider reported results with the same detection and seg-mentation pipeline as MOTS R-CNN to maintain fairnessin the analysis. Comparison with Track R-CNN.

The comparison of theproposed algorithm is given in Table 1. To be more speciﬁc,our work differs from Track R-CNN in the three aspects 1)multi-layer feature aggregation mechanism, 2) the proposedCMT loss for the tracking head, and 3) the LMCL for objectclassiﬁcation. While MOTS R-CNN gives sMOTSAimprovement in the car class, the performance of pedestri-ans rises by . . The most encouraging ﬁnding is that IDswitching using the proposed algorithm for both cars andpedestrians is reduced by in comparison with TrackR-CNN which uses triplet loss for distance metric learning.This reduction in ID switching using MOTS R-CNN canbe attributed to the proposed CMT loss and multi-layer fea-ture aggregation mechanism. Speciﬁcally, it signiﬁes thatthe proposed CMT loss function guides the model to con-verge to a better optimal minimum than the triplet loss. Fur-thermore, . improvement in MOTSA for both cars andpedestrians is due to the better classiﬁcation accuracy. Thisdemonstrates the prominence of large margin cosine loss inreducing the number of false positives. State-of-the-art comparison.

We compare in Table 2 theperformance of the MOTS R-CNN with the existing meth-ods. While the MOTS R-CNN outperforms the BeyondPix-els and CIWT, its performance is comparable with MOTS-Fusion [23]. Different from MOTS R-CNN which uses ap-pearance features, MOTSFusion exploits motion informa-tion for object tracking. It can be seen that the MOTS R-CNN keeps the number of IDS at the minimum, therebyattaining the state-of-the-art tracking performance. Further-more, MOTSFusion has a large number of parameters sinceit requires an additional deep network for the computationof optical ﬂow. It is worth noting that the reduced numberof FNs for cars using the MOTS R-CNN (629 vs. in-dicates that the proposed algorithm tracks more number ofground truth objects, which is highly desirable. This vali-dates the usefulness of the MOTS R-CNN model in prac-tical applications. On the contrary, the improvement insMOTSA and MOTSA for pedestrians using MOTSFusionis attributed to its ability to recover missing detections.

We perform several ablations to signify the importanceof each module of MOTS R-CNN.

Multi-layer feature aggregation.

We report in Table 3athe effectiveness of the multi-layer feature aggregation overthe single-layer feature extraction. More speciﬁcally, thefeatures from the ﬁnal convolutional layer of the ResNet-101 are converted to -dimensional vectors by the fullyconnected layer and used for data association. The multi-layer feature aggregation signiﬁcantly reduces the numberof IDS. We observed empirically that the use of multi-layerfeatures makes the network robust against occlusions.

Data association mechanism.

The performance of MOTSR-CNN using the popular Hungarian and greedy algorithmsfor data association is detailed in Table 3b. The reduc-tion in the number of IDS using the greedy algorithm in-dicates that the learned features using the proposed methodare strongly discriminative. Also, the greedy algorithm isextremely lightweight in comparison to the Hungarian al-gorithm which speeds up the tracker at test time.

Object classiﬁcation loss function.

In Table 3c, wedemonstrate the signiﬁcance of LMCL over the traditionalsoftmax loss for the task of object classiﬁcation. While theincorporation of LMCL in MOTS R-CNN reduces the num-ber of FNs from to for cars, it reduces the numberof FPs from to in the case of pedestrians. Instance segmentation mask for feature extraction.

Ta-ble 3d vividly illustrates the usefulness of the segmentationmask for extracting features from the foreground of objectsto improve the tracking performance.

Distance metric learning.

We compare in Table 4 the ef-fect of the proposed CMT and CMC loss functions on fea-ture learning for multi-object tracking. From an empiricalanalysis, the margin m and scale factor s are set to . and for the CMC loss. It can be noticed that the performanceusing the CMT loss is better than the CMC loss. Specif-ically, the reduction in the number of IDS using the CMTloss illustrates its power to encourage discriminating featurelearning. This observation is consistent with the fact that thetraditional triplet loss which results in relative distance be-tween pair of feature vectors is seen as an improvement overthe contrastive loss. Qualitative analysis.

We display in Fig. 2 the qualitativetracking results for visual analysis. It can be noticed thatour model is robust against merge and split problem. Fur-able 2: Performance of the MOTS R-CNN on KITTI validation dataset in comparison with existing algorithmsMethod sMOTSA ↑ MOTSA ↑ IDS ↓ FP ↓ FN ↓ Car Ped Car Ped Car Ped Car Ped Car PedMOTS R-CNN . .

35 30

130 208 629 856MOTSFusion [23] . . .

36 34

94 181

673 855BeyondPixels [33] 76.9 - 89.7 - 88 - 280 - -CIWT [28] 68.1 42.9 79.4 61.0 106 42 333 401 1214 863Table 3: Ablation results on the KITTI MOTS validation datasetMethod sMOTSA ↑ MOTSA ↑ IDS ↓ Car Ped Car Ped Car PedMulti-layer 78.2 49 90.1 67.3

35 30

Single-layer 77.8 47.7 89.8 66.6 39 50 (a) Feature aggregation

Method sMOTSA ↑ MOTSA ↑ IDS ↓ Car Ped Car Ped Car PedGreedy 78.2 49 90.1 67.3

35 30

Hungarian 78 48.6 90 66.8 47 46 (b) Data association mechanisms

Loss function sMOTSA ↑ FP ↓ FN ↓ Car Ped Car Ped Car PedLMCL 78.2 49 130

208 629 (c) Object classiﬁcation loss functions

Method sMOTSA ↑ MOTSA ↑ IDS ↓ Car Ped Car Ped Car PedWith mask 78.2 49 90.1 67.3

35 30

Without mask 77.7 48.5 89.8 66.8 44 38 (d) Instance segmentation mask for feature extraction

Table 4: Ablation results for distance metric learningLoss function sMOTSA ↑ MOTSA ↑ IDS ↓ Car Ped Car Ped Car PedCMT . . . CMC 77.8 48.5 89.8 67.2 46 34thermore, the ability of the model to handle occlusions andto perform consistent tracking is also witnessed.

7. Conclusions

In this paper, we proposed cosine-margin-triplet andcosine-margin-contrastive loss functions for deep metriclearning. While feature normalization makes features ob-tained via CNNs angular separable, the incorporation ofmargin term further boosts its discriminative ability. Wethen propose the MOTS R-CNN model to demonstrate thesigniﬁcance of the proposed loss functions for multi-objecttracking. We show the robustness of our model to occlu-sions and scale variation which can be attributed to themulti-layer feature aggregation mechanism. The signiﬁcantreduction in the number of IDS validates the better con-vergence of the proposed CMT loss. We believe that fur-ther improvement in the performance is possible by usinga backbone convolutional network involving a feature pyra-mid network.

Acknowledgement:

This work is funded by SERB NPDFgrant (ref. no. PDF/2019/003459), government of India. (a)(b)(c)

Figure 2: Qualitative analysis of tracking results on theKITTI MOTS validation dataset (a) robustness againstmerge and split problem and (b)+(c) robustness to occlu-sions

References [1] Anton Andriyenko, Konrad Schindler, and Stefan Roth.Discrete-continuous optimization for multi-target tracking.In , pages 1926–1933. IEEE, 2012.[2] Guillem Bras´o and Laura Leal-Taix´e. Learning a neu-ral solver for multiple object tracking. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and Patternecognition , pages 6247–6257, 2020.[3] Joao Carreira and Andrew Zisserman. Quo vadis, actionrecognition? a new model and the kinetics dataset. In pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 6299–6308, 2017.[4] Wongun Choi. Near-online multi-target tracking with ag-gregated local ﬂow descriptor. In

Proceedings of the IEEEinternational conference on computer vision , pages 3029–3037, 2015.[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In , pages 248–255. Ieee, 2009.[6] Jiankang Deng, Jia Guo, Niannan Xue, and StefanosZafeiriou. Arcface: Additive angular margin loss for deepface recognition. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 4690–4699, 2019.[7] Andreas Ess, Bastian Leibe, Konrad Schindler, and LucVan Gool. A mobile vision system for robust multi-persontracking. In , pages 1–8. IEEE, 2008.[8] Ross Girshick. Fast r-cnn. In

Proceedings of the IEEE inter-national conference on computer vision , pages 1440–1448,2015.[9] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In

Proceedings of the IEEE internationalconference on computer vision , pages 2961–2969, 2017.[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[11] Roberto Henschel, Laura Leal-Taix´e, Daniel Cremers, andBodo Rosenhahn. Fusion of head and full-body detectors formulti-object tracking. In

Proceedings of the IEEE confer-ence on computer vision and pattern recognition workshops ,pages 1428–1437, 2018.[12] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In de-fense of the triplet loss for person re-identiﬁcation. arXivpreprint arXiv:1703.07737 , 2017.[13] Weiming Hu, Wei Li, Xiaoqin Zhang, and Stephen May-bank. Single and multiple object tracking using a multi-feature joint sparse representation.

IEEE transactions onpattern analysis and machine intelligence , 37(4):816–833,2014.[14] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[15] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmoniousattention network for person re-identiﬁcation. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 2285–2294, 2018.[16] Junwei Liang, Lu Jiang, Kevin Murphy, Ting Yu, andAlexander Hauptmann. The garden of forking paths: To-wards multi-future trajectory prediction. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 10508–10518, 2020. [17] Junwei Liang, Lu Jiang, Juan Carlos Niebles, Alexander GHauptmann, and Li Fei-Fei. Peeking into the future: Predict-ing future person activities and locations in videos. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 5725–5734, 2019.[18] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyra-mid networks for object detection. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 2117–2125, 2017.[19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. In

European conference on computer vision , pages 740–755.Springer, 2014.[20] Wang G. Zuo W. Feng X. Lin, L. and L. Zhang. Cross-domain visual matching via generalized similarity measureand feature learning.

IEEE transactions on pattern analysisand machine intelligence , 39(6):1089–1102, 2016.[21] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, BhikshaRaj, and Le Song. Sphereface: Deep hypersphere embeddingfor face recognition. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 212–220,2017.[22] Chen Long, Ai Haizhou, Zhuang Zijie, and Shang Chong.Real-time multiple people tracking with deeply learned can-didate selection and person re-identiﬁcation. In

ICME , 2018.[23] Jonathon Luiten, Tobias Fischer, and Bastian Leibe. Trackto reconstruct and reconstruct to track.

IEEE Robotics andAutomation Letters , 5(2):1803–1810, 2020.[24] Karttikeya Mangalam, Harshayu Girase, Shreyas Agarwal,Kuan-Hui Lee, Ehsan Adeli, Jitendra Malik, and AdrienGaidon. It is not the journey but the destination: End-point conditioned trajectory prediction. arXiv preprintarXiv:2004.02025 , 2020.[25] Yair Movshovitz-Attias, Alexander Toshev, Thomas K Le-ung, Sergey Ioffe, and Saurabh Singh. No fuss distance met-ric learning using proxies. In

Proceedings of the IEEE In-ternational Conference on Computer Vision , pages 360–368,2017.[26] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, andPeter Kontschieder. The mapillary vistas dataset for semanticunderstanding of street scenes. In

Proceedings of the IEEEInternational Conference on Computer Vision , pages 4990–4999, 2017.[27] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and SilvioSavarese. Deep metric learning via lifted structured fea-ture embedding. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 4004–4012,2016.[28] Aljoˇsa Osep, Wolfgang Mehner, Markus Mathias, and Bas-tian Leibe. Combined image-and world-space tracking intrafﬁc scenes. In , pages 1988–1995. IEEE,2017.[29] Lorenzo Porzi, Markus Hoﬁnger, Idoia Ruiz, Joan Serrat,Samuel Rota Bulo, and Peter Kontschieder. Learning multi-bject tracking and segmentation from automatic annota-tions. In

Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition , pages 6846–6855,2020.[30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks.

IEEE transactions on pattern analysisand machine intelligence , 39(6):1137–1149, 2016.[31] Amir Sadeghian, Alexandre Alahi, and Silvio Savarese.Tracking the untrackable: Learning to track multiple cueswith long-term dependencies. In

Proceedings of the IEEEInternational Conference on Computer Vision , pages 300–311, 2017.[32] Florian Schroff, Dmitry Kalenichenko, and James Philbin.Facenet: A uniﬁed embedding for face recognition and clus-tering. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 815–823, 2015.[33] Sarthak Sharma, Junaid Ahmed Ansari, J Krishna Murthy,and K Madhava Krishna. Beyond pixels: Leveraging geom-etry and shape cues for online multi-object tracking. In , pages 3508–3515. IEEE, 2018.[34] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In

Advances in neural informationprocessing systems , pages 1857–1865, 2016.[35] Valtteri Takala and Matti Pietikainen. Multi-object trackingusing color, texture and motion. In , pages 1–7. IEEE,2007.[36] Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and BerntSchiele. Multiple people tracking by lifted multicut and per-son re-identiﬁcation. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3539–3548, 2017.[37] Paul Voigtlaender, Michael Krause, Aljosa Osep, JonathonLuiten, Berin Balachandar Gnana Sekar, Andreas Geiger,and Bastian Leibe. Mots: Multi-object tracking and segmen-tation. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 7942–7951, 2019.[38] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, DihongGong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface:Large margin cosine loss for deep face recognition. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 5265–5274, 2018.[39] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, andPhilip HS Torr. Fast online object tracking and segmenta-tion: A unifying approach. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages1328–1338, 2019.[40] Zhongdao Wang, Liang Zheng, Yixuan Liu, and ShengjinWang. Towards real-time multi-object tracking. arXivpreprint arXiv:1909.12605 , 2019.[41] Xinshuo Weng, Yongxin Wang, Yunze Man, and Kris M Ki-tani. Gnn3dmot: Graph neural network for 3d multi-objecttracking with 2d-3d multi-feature learning. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 6499–6508, 2020. [42] Ancong Wu, Wei-Shi Zheng, Xiaowei Guo, and Jian-HuangLai. Distilled person re-identiﬁcation: Towards a morescalable system. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 1187–1196, 2019.[43] Zhenbo Xu, Wei Zhang, Xiao Tan, Wei Yang, Huan Huang,Shilei Wen, Errui Ding, and Liusheng Huang. Segment aspoints for efﬁcient online multi-object tracking and segmen-tation. In

European Conference on Computer Vision , pages264–281. Springer, 2020.[44] Fan Yang, Xin Chang, Chenyu Dang, Ziqiang Zheng, Sakri-ani Sakti, Satoshi Nakamura, and Yang Wu. Remots: Self-supervised reﬁning multi-object tracking and segmentation. arXiv e-prints , pages arXiv–2007, 2020.[45] Xun Yang, Peicheng Zhou, and Meng Wang. Person rei-dentiﬁcation via structural deep metric learning.

IEEETransactions on Neural Networks and Learning Systems ,30(10):2987–2998, 2018.[46] Junbo Yin, Wenguan Wang, Qinghao Meng, Ruigang Yang,and Jianbing Shen. A uniﬁed object motion and afﬁnitymodel for online multi-object tracking. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 6768–6777, 2020.[47] Fengwei Yu, Wenbo Li, Quanquan Li, Yu Liu, Xiaohua Shi,and Junjie Yan. Poi: Multiple object tracking with high per-formance detection and appearance feature. In

EuropeanConference on Computer Vision , pages 36–42. Springer,2016.[48] Ji Zhu, Hua Yang, Nian Liu, Minyoung Kim, Wenjun Zhang,and Ming-Hsuan Yang. Online multi-object tracking withdual matching attention networks. In

Proceedings of the Eu-ropean Conference on Computer Vision (ECCV) , pages 366–382, 2018.[49] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, andWeiming Hu. Distractor-aware siamese networks for visualobject tracking. In