Attribute-aware Identity-hard Triplet Loss for Video-based Person Re-identification
AAttribute-aware Identity-hard Triplet Loss forVideo-based Person Re-identification
Zhiyuan Chen , Annan Li , Shilu Jiang , and Yunhong Wang School of Computer Science and Engineering, Beihang University, Beijing, China { dechen,liannan,shilu jiang,yhwang } @buaa.edu.cn Abstract.
Video-based person re-identification (Re-ID) is an important com-puter vision task. The batch-hard triplet loss frequently used in video-based per-son Re-ID suffers from the Distance Variance among Different Positives (DVDP)problem. In this paper, we address this issue by introducing a new metric learn-ing method called Attribute-aware Identity-hard Triplet Loss (AITL), which re-duces the intra-class variation among positive samples via calculating attributedistance. To achieve a complete model of video-based person Re-ID, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mech-anism is also proposed. Extensive experiments on MARS and DukeMTMC-VIDdatasets shows that both the AITL and ASTA are very effective. Enhanced bythem, even a simple light-weighted video-based person Re-ID baseline can out-perform existing state-of-the-art approaches. The codes has been published on https://github.com/yuange250/Video-based-person-ReID-with-Attribute-information . In recent years, person re-identification (Re-ID) under video settings has drawn signif-icant attention. In recent Re-ID studies, batch-hard triplet loss [13] is frequently used.This metric learning method can significantly narrow the distance between the anchorand its positives, and expand the margin between the anchor and its negatives in a mini-batch. But as shown in Figure 1, normal batch-hard triplet loss would cause the
DistanceVariance among Different Positives (DVDP) problem , which makes the model less ro-bust to intra-class variations such as pose and appearance difference.The attribute information has also been widely used for improving the performanceof person re-identification. Many methods use the attribute to strengthen the correlationof image pairs or triplets [16,17,58,39,44], in which the distance between predicted at-tributes across identities is widely used. Some methods use attribute to help co-trainingthe Re-ID models [25,38,39,44,32,26]. Recently, some feature aggregation strategiesare adopted to make full use of the attribute information, as in [7,12,22]. Although inthe aforementioned methods introducing attribute to person Re-ID demonstrates goodperformance improvement, they are mainly practiced in an image-based way. Zhao etal. [54] firstly apply attribute information to video-based person re-ID by introducingtemporal attention learned from attribute recognition progress into the video-based Re-ID task via transfer learning. It should be pointed out that the frame-level temporalattention module in [54] is actually trained from an image based RAP [20] pedestrian a r X i v : . [ c s . C V ] J un Z. Chen et al. negatives distance distance : >0.30distance : 0.30distance : 0.24 anchor positive 2 positive 3positive 1 negatives distance distance : >0.47distance : 0.47distance0.39 anchor positive 2 positive 3positive 1 (a) (b)
Fig. 1.
Normal batch-hard triplet loss can extremely shorten the distance between anchor and itscorresponding positives and expand the distance to its negatives. But among different positives,there exists a high distance variance, the anchor tend to be much closer to the positive exampleswhich has more similar pose or appearance with it. attribute dataset. However, the authors only use the attribute information to generate thetemporal attentions and no spatial attention is considered.As shown in Figure 2(a) and Figure 2(b), if the bottom part of a pedestrian is oc-cluded, recognizing attribute such as bottom color will be not applicable, and otherquality problems in pedestrian videos such as background-dominated and multi-personsalso affects the recognition of specific attributes, it shows that spatial cues are also veryimportant in recognizing attribute. However, spatial attention are ignored in existingwork of attribute-assisted video-based person Re-ID. Lack of annotation is the possi-ble reason. Recently, attribute annotations for large video-based person Re-ID datasetsbecome publicly available [4], based on which real attribute-driven spatio-temporal at-tention can be learned.Based on above observations, a novel attribute-assisted approach for video-basedperson Re-ID is proposed in this paper. To address the DVDP problem, we propose theAttribute-aware Identity-hard Triplet Loss (AITL). By building triplets within positivesamples of the same identity according to the attribute distance, the high DVDP canbe considerably reduced and the Re-ID performance can be improved. To achieve acomplete video-based person Re-ID approach, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mechanism is also proposed, in which at-tribute modeling, identity recognition and attribute embedding for DVDP reduction areintegrated via an unified loss. The effectiveness of AITL and ASTA is well demon-strated by the experiments on MARS [55] and DukeMTMC-VID [47] dataset.The main contributions in this paper are summarized as follows. – An attribute-aware metric learning method for the Distance Variance among Dif-ferent Positives problem. – A three-stream multi-task framework for attribute-assisted video-based person Re-ID using both identity-relevant and irrelevant attributes. ttribute-aware Identity-hard Triplet Loss for Video-based Person Re-identification 3 (a)(c) (d)(b)
Fig. 2.
Performance of attribute recognition can be greatly influenced by temporal partial occlu-sion. Ruling out such negative effectives requires attention in both spatial and temporal axis. Andthe attribute-driven spatio-temporal attention can be also used for re-identifying people. – An attribute-driven spatio-temporal attention mechanism for video-based personRe-ID.The rest of this paper is organized as follows. In the next section we briefly reviewrelevant research works. The attribute-aware identity-hard triplet loss is described inSection 3. Then, Section 4 introduces the proposed attribute-driven spatio-temporal at-tentive multi-task framework for video-based person Re-ID. Experimental results arepresented in Section 5 and conclusion is drawn in Section 6.
Video-based person Re-ID
Since person Re-ID is an application of video surveillance,video-based setting is a natural choice and closer to real-world scenario. Many chal-lenging datasets have been created for video-based person Re-ID: iLIDS-VID [11],PRID2011 [14], MARS [55], DukeMTMC-VID [47], among these datasets, MARSand DukeMTMC-VID are frequently used recently, due to the large number in bothtrack-lets and pedestrian identities.Early methods on video-based person Re-ID focus on handcrafting video represen-tations and metric learning [45,10,29,49,46]. Since the breakthrough of deep Convo-lutional Neural Network (CNN) and Recurrent Neural Networks (RNN), deep learningbecomes the mainstream approach for video-based person Re-ID [23,1,42,43,3,52,51,28,34,50,36,56,15,40].McLaughlin et al. [33] firstly proposed a baseline CNN-RNN model to extract featuresfrom the pedestrian track-lets. Liu et al. [31], Li et al. [22] and Song et al. [37] esti-mated quality scores or high-quality regions for each frame to weaken the influence ofnoisy samples automatically.Siamese network has been adopted to video-based Re-ID recently [5,24]. Chen etal. [2] proposed a competitive similarity aggregation scheme with short snippet-basedrepresentations to estimate the similarity between two pedestrian sequences. Gao andNevatia [9] make a thorough experimental survey on the temporal models in the video-based person Re-ID. Recently, the performance of the video-based Re-ID methods growrapidly, Liu et al. [27] use non-local layers to apply the spatial and temporal attention
Z. Chen et al. into the training progress and get good performance on MARS and DukeMTMC-VIDdatasets.
Pedestrian Attribute Recognition
Similar to person Re-ID, early works on pedes-trian attribute recognition are mainly image-based. Large dataset such as PETA [6],RAP [20], P-100K [30] have been created. Traditional handcrafting methods as well asend-to-end deep learning methods are all adopted [25,16,6,17,57,41,18,30,53,19].Chen et al. [4] first tackle the pedestrian attribute recognition problem by video-based settings and provide attribute annotations for MARS DukeMTMC-VID datasets.They propose a temporal attention based method and demonstrate the superiority ofusing video in recognizing attributes.
Attribute-assisted Person Re-ID
Attribute such as gender, age and clothing can beviewed as some kind of “soft biometrics”.Therefore, they can provide additional infor-mation and have been introduced into person Re-ID. Early methods use the attributesto strengthen the correlation between image pairs or triplets [16,17,58,39,44]. Somemethods use attributes as a strong supervision to help co-training [25,38,39,44,32,26].In some latest researches, feature aggregation strategy is adopted to make full use of theattribute information [7,12].Song et al.[38] firstly apply the attribute information in video-based person re-IDproblem. However, this work is rather limited for the reason of lack of data. Zhao etal.[54] apply attribute to video-based person re-ID by learning temporal attention fromthe external image-based RAP dataset. Although they clearly show the effectiveness ofcooperating with attributes, their attention model is limited to temporal dimension forthe reason of lack of attribute ground truth on the video-based person Re-ID datasets.
Metric Learning for Person Re-ID
Except for the batch-hard triplet loss [13], someother metric learning methods are also applied to handle the person Re-ID task, such asQuadruplet loss [3] and Margin sample mining loss [48], these metric learning methodalso focus on improving the Re-ID performance by adjusting the distance between pos-itive and negative pairs. Among all these methods, batch-hard triplet loss is most com-monly used in recent Re-ID researches due to the good generalization ability.
As introduced in Section 1 and Figure 1, batch hard triplet loss is really effective innarrowing the distance between anchor and its positives, and also in expanding thedistance between anchor with its negatives. But it ignores the distance relation amongpositives. As shown in Figure 1, normal batch-hard triplet loss would cause a problemthat when the training process tends to be stable, the anchor-to-negative-distances areusually much larger than the anchor-to-positives-distance. Even the overall triplet losstends to be zero, there still exists a distance variance between anchor and its differentpositives. The feature of anchor would be much closer to the positives which is moresimilar in appearance. This would lead the model to pay more attention to the direct ap-pearance of the pedestrian, and be less robust to pose and appearance variances betweendifferent positives. Obviously, it is harmful to person Re-ID.So in this paper, we designed an attribute-aware identity-hard triplet loss to solvethis problem. We use the attribute prediction vectors to measure the appearance simi- ttribute-aware Identity-hard Triplet Loss for Video-based Person Re-identification 5
F distanceF distance : 0.30F distance anchor positive 2 positive 3positive 1
Intra-class positiveIntra-class negative
F distance : 0.30
F distance : 0.24 anchor positive 2positive 3positive 1
F distance0.17A distance 0.04
Intra-class positive Intra-classnegative
A distance : 0.21A distance : 0.14
Optimization Objective
Attribute-aware Identity-hard Triplet Loss
Fig. 3.
Brief illustration of the attribute-aware triplet loss. The “A distance” denotes the distancebetween attribute prediction vectors of different pedestrian videos, while “F distance” means thedistance of Re-ID features. Due to the DVDP problem, there is a big difference among the anchor-to-positive distances (left). We find that the attribute distance is negative correlated to DVDP. Bybuilding additional triplets within the identity positives according to the attribute distance andminimizing the negative loss, the identity-level DVDP can be reduced. larity among different positive pairs, and then fix the Re-ID feature distance gap (seeFigure 7) between the anchor to the most similar positive and the anchor to the mostdifferent positive.
Intra-Class Triplet Loss
In this paper, we introduce the attribute-aware triplet lossinto the positives themselves to solve the DVDP problem. As shown in Figure 3, theattribute prediction vectors generated by the multi-task model, which is described inSection 4, can be used to measure the appearance similarity between anchor and itscorresponding positives.Based on the attribute distance, intra-class negative and intra-class positive canbe picked out from the identity positives . As shown in Figure 3, their distances to theanchor are usually different. In other words, at the identity-level, the intra-class variationa.k.a the DVDP is highly correlated to the attribute difference. And the high DVDPpairs can be automatically discovered by the attribute. Once the intra-class negative and intra-class positive are selected, triplet based optimization can be performed tonarrow the distance.
Attribute-aware
As mentioned above, we use the attribute distance rather than Re-ID feature distance to select attribute-aware intra-class negative and attribute-awareintra-class positive . The reason is that the Re-ID features of the pedestrian videos is ahigh-level semantics while the attribute prediction result is low-level. Low-level seman-tics is naturally more suitable to be used to compare the appearance similarity betweendifferent pedestrian videos and be more robust to noisy and hard samples. The ad-vantages of attribute distance in selecting intra-class negative and intra-class positive would also be illustrated through ablation study in Section 5.2.
Identity Hard
To form a mini-batch, the attribute-aware triplet loss basically fol-lows the sampling rule proposed by Hermans et al. [13]. The core idea is to form batchesby randomly sampling P classes (people), and then randomly sampling K ( K > = 3)
Z. Chen et al. videos of each class, thus resulting in a batch of
P K videos. The only difference be-tween the AITL and Hermans et al. [13] is that the attribute-aware triplet loss selectthe intra-class negative and intra-class positive only from the samples sharing the sameidentity label with the anchor, so theoretically its
Identity Hard .For each anchor a , we can select the intra-class positive and the intra-class negative within the samples sharing the same identity with a in the batch when forming theattribute-aware triplets. The Distance Variance among Different Positives(DVDP) onthis single batch can be represented by DVDP ( θ ; X ) = all anchors (cid:122) (cid:125)(cid:124) (cid:123) P (cid:88) i =1 K (cid:88) a =1 [ Intra − class negative (cid:122) (cid:125)(cid:124) (cid:123) F D ( f θ ( x ia ) , f θ ( x iFN ( i,a ) )) − F D ( f θ ( x ia ) , f θ ( x iFP ( i,a ) )) (cid:124) (cid:123)(cid:122) (cid:125) Intra − class positive ] + . (1) And the Attribute-aware Identity-hard Triplet Loss which is used to reduce DVDPcan be written as L AITL ( θ ; X ) = all anchors (cid:122) (cid:125)(cid:124) (cid:123) P (cid:88) i =1 K (cid:88) a =1 [ attribute negative (cid:122) (cid:125)(cid:124) (cid:123) F D ( f θ ( x ia ) , f θ ( x iAN ( i,a ) )) − F D ( f θ ( x ia ) , f θ ( x iAP ( i,a ) )) (cid:124) (cid:123)(cid:122) (cid:125) attribute positive ] + . (2) This is defined for a mini-batch X and where data point x ij corresponds to the j -th video of the i -th person in the batch. In Equation (2), θ means the parameters ofthe Re-ID feature function f , F D denote the Re-ID feature distance between differentsamples,
F N and
F P means the index of the intra-class negative and intra-class posi-tive in feature distance, AN and AP means the index of the attribute-aware intra-classnegative and attribute-aware intra-class positive for the anchor, for each anchor whichindexed i, a , the F N and
F P , AN and AP can be represented by F N ( γ ; i, a ) = argmax j =1 ,...,K j (cid:54) = a F D ( g γ ( x ia ) , g γ ( x ij )) F P ( γ ; i, a ) = argmin j =1 ,...,K j (cid:54) = a F D ( g γ ( x ia ) , g γ ( x ij )) .AN ( γ ; i, a ) = argmax j =1 ,...,K j (cid:54) = a AD ( g γ ( x ia ) , g γ ( x ij )) AP ( γ ; i, a ) = argmin j =1 ,...,K j (cid:54) = a AD ( g γ ( x ia ) , g γ ( x ij )) . (3)In Equation (3), γ means the parameters of the attribute recognition model g , and AD means the distance on the attribute prediction vector between different pedestrianvideos. For the distance function AD and F D , we use the cosine distance as the metric.By using the attribute prediction vector to calculate the appearance similarity be-tween anchor and different positives, we could easily get the intra-class negative and intra-class positive for the anchor by comparing the attribute distance. And by narrow-ing the Re-ID feature distance gap between the intra-class negative and intra-class pos- ttribute-aware Identity-hard Triplet Loss for Video-based Person Re-identification 7 T × × × × × × Motion Bottom type Top color •••
Linear binary classifierConvolutional layer ST-attention moduleST-attention module ExpandT × × × × × × × × × × ••• Pose Occlusion
Concat T × × × × × × × × Fig. 4.
Architecture of our multi-task framework. It is mainly composed of three streams: The Re-ID backbone, ID-irrelevant attribute recognition stream for ID-irrelevant Attribute Recognition,as well as the ID-relevant attribute recognition stream for recognizing ID-relevant attributes. itive , the Distance Variance among Different Positives would be considerably reduced.Consequently, the resulted model is more robust to variations like pose difference.
Video data can provide more information than single image. However, as shown inFigure 2, due to inaccurate detector and tracker, noisy frames are inevitably contained.Therefore, as well as picking out discriminative frames, screening off disturbing framesand regions is also important for video-based Re-ID. Learning attentions from middle-level attribute is a better way for feature refinement.The appearance of a pedestrian sequence is influenced by two kinds of factors. Theinternal factor such as clothing characteristics is directly relevant to identity, thereforeshould be emphasized. The external factors such as pose angle and occlusion are harm-ful to recognition, to improve the performance, their impact should be reduced. For-tunately, the recently released dataset [4] provides annotations for both kinds of fac-tors, which makes it possible to learn a comprehensive spatio-temporal attentions frompedestrian attribute.Based on above observations, we propose a multi-task framework for attribute-enhanced video-based Re-ID. It consists of three streams, which correspond to identity-relevant attribute, identity itself, and identity-irrelevant attribute. As can be seen fromFigure 4, to achieve final re-identification, the three channels are fused together viatransferring spatio-temporal attentions from attribute.
As illustrated in Figure 4, the multi-task framework consists of three streams. Beforestream splitting, a backbone ResNet-50 is used to extract the basic frame feature. Theidentify-irrelevant (ID-irrelevant) attribute recognition stream is used to recognize at-tribute like motion, pose and occlusion. It takes the frame features as input, then feedsthem into an additional size-preserving convolutional layer. After that the feature isfurther processed by a spatio-temoral attention module, and get the attribute-driven
Z. Chen et al.
ST-attention module T × × × × × × × × × Transpose 128 × T 128 × × Temporal convolutional layerTranspose
Reshape T × × Fig. 5.
Detailed structure of our spatio-temporal attention module. spatio-temoral attention whose size is T × × , where T is the frame number. Theattention vector can be expanded into original feature size for later point-wise multipli-cation. Finally, a spatio-temporal pooling operation is performed to encode the featureto × . It is achieved by performing average pooling in both spatial and temporalaxis respectively. Based on the final feature, attribute estimation can be achieved byusing a linear classifier.The structure of identity-relevant (ID-relevant) attribute module is the same ex-cept for the size of output attribute prediction. ID-relevant attribute recognition modulemainly focus on the attribute related to the pedestrian identity like the color of clothes,gender, etc, while the ID-irrelevant attribute recognition module pay attention to mo-tion, pose, and noises like occlusion. The output attribute prediction vectors generatedfrom these two attribute recognition streams could be used for attribute-aware identity-hard triplet loss, which has been discussed in Section 3.Since each attribute stream can generate a spatial-temporal attention vector, whichcontains rich spatio-temporal information learned in attribute recognition progress, thefeatures from the Re-ID are encoded by the two attribute-driven spatial-temporal at-tention vectors respectively. The two attention enhanced features are concatenated withthe original one. By applying a spatio-temporal pooling operation, the overall featureis finally encoded to × . This final feature not only keeps the information fromoriginal frames, but also combines the spatial-temporal clues learned from attributes. To generate spatio-temporal attention vector from the frame features, we designed alight-weighted yet effective spatial-temporal attention module. As shown in Figure 5,the spatio-temporal attention module takes the frame features as input. Firstly, the chan-nel dimension of the frame feature is reduced to one by a two dimensional convolutionallayer and output the spatial attention vector. Then after reshape and transpose operation,the attention vector is converted to × T . In the next, the attention vector would beprocessed by a one dimensional temporal convolutional layer, the temporal attentionlayer is a 1-d convolutional layer that the number of input channel and output channelis 128, kernel size is 3, padding is 2, stride is 1. It takes the 128 * T (T is the temporalaxis) spatial attention vector as input and generate the ST-attention vector which hasthe same size with input by conduct 1-d convolution operations on the temporal axis.Finally, the attention vector would be turned into original shape, and after the sigmoidactivation, this spatio-temporal vector would be used to encode the attribute recognitionfeature as well as the Re-ID feature. ttribute-aware Identity-hard Triplet Loss for Video-based Person Re-identification 9 Table 1.
Ablation study on person Re-ID task on MARS(%).Model MARSR1 R5 R10 mAPBaseline. 84.9 95.2 96.6 79.4Baseline. + ASTA 86.6 95.8 97.2 82.4Baseline. + ITL 86.7 96.0 97.4 83.3Baseline. + AITL 87.4 96.2 97.6 83.5Baseline. + ASTA + AITL
Since the multi-task model combines the attribute recognition task and Re-ID task, weunified several loss functions in the training process. We use
Binary Cross Entropy Loss L BCE to train the attribute recognition module. For Re-ID task, besides the normal lossfunction combination of batch-hard triplet loss L tri and softmax loss L softmax , we in-troduced the attribute-aware identity-hard triplet loss L AItri which has been discussedin Section 3 to solve the DVDP problem. So the final loss function of the multi-taskmodel in the training progress can be written as L = attributes (cid:122) (cid:125)(cid:124) (cid:123) L BCE + L tri + L softmax + L AIT L (cid:124) (cid:123)(cid:122) (cid:125) Re − ID . (4)As the final Re-ID feature is the result of concatenation, to strike a balance betweenattribute and identity, we use cosine distance in the two triplet loss, which is equivalentto squared euclidean distance if the features is normalized. We evaluate our method on two large video datasets for person Re-ID, i.e. the MARS [55],and DukeMTMC-VID [47] respectively. MARS consists of 1,261 people captured fromsix cameras, in which 625 are used for training and the rest are for test. The DukeMTMC-VID is a subset of the DukeMTMC [35], which consists of 702 training subjects, 702test subjects and 408 distractors. The track-lets in both dataset are automated detectedand tracked. The attribute annotation for MARS and DukeMTMC-VID is provided byChen et al. [4].In the experiments, all the detected pedestrian are resized to × . We setthe video clips size T = 4 in the training progress, which just follows the best settingof the baseline model in Gao and Nevatia [9]. All the models in the experiments areimplemented in Pytorch, we choose Adam as the optimizer, and the learning rate is setto 0.0003. The source codes of this paper would be published in the future. track-letsId-relevant ST-attentionId-irrelevant
ST-attention track-lets track-lets
Fig. 6.
Visualization of ST-attention on low-quality track-lets.
To verify the effectiveness of our multi-task model and the Attribute-aware Identity-hard Triplet Loss, we trained several models on the MARS dataset: the baseline tem-poral pooling model by Gao and Nevatia [9]; baseline with the Attribute-driven Spatio-Temporal Attention (ASTA); baseline with Identity-hard Triplet Loss (ITL) which se-lects the intra-class positive and negative by directly comparing the feature distancerather than attribute distance; baseline with Attribute-aware Identity-hard Triplet Loss(AITL); the proposed multi-task model which combines the Attribute spatio-temporalAttention mechanism as well as the Attribute-aware Identity-hard Triplet Loss.As shown in Table 2, by integrating the spatio-temporal attention learned in the at-tribute recognition progress to the person Re-ID task, the performance of re-identificationcan be consistently improved, which implies that the attribute recognition streams couldfind the discriminative spatial and temporal clues out from the pedestrian image se-quence, and the discriminativeness is not only good for recognizing attribute but alsohelpful to re-identification.Without attention, solely using the Identity-hard Triplet Loss gains more consistentperformance improvements. It’s obvious that AITL outperforms ITL in every metric.Because directly using the feature distance to determine the intra-class positive andnegative in a triplet is vulnerable to noisy and difficult samples. Introducing attributedistance could make the model more robust. Combining ASTA and AITL, the proposed
Table 2.
Results of different stream combinations on MARS(%).Stream combination MetricsRe-ID ID-relevant ID-irrelevant R1 mAP (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) ttribute-aware Identity-hard Triplet Loss for Video-based Person Re-identification 11
DVD P m A P DVD P w/o AITL DVDPmAP
DVD P m A P DVD P w/ AITL DVDPmAP
Fig. 7.
Variation curve of DVDP and mAP in the training progress with or without the Attribute-aware Identity-hard Triplet Loss (AITL) supervision. multi-task model reaches the best performance, which proves the effectiveness of thestrategies we proposed.Different attribute streams have different effects on the improvement of Re-ID per-formance. As shown in Figure 6, for the same track-lets, the id-relevant and id-irrelevantattribute streams have different spatio-temporal concentrations, but they consistently fo-cus on the human body. In other words, although the attributes are irrelevant to specificperson, they are still closely relevant to general human body. Consequently, the learnedattention can filter out background elements and occlusions. Recognition results (seeTable 2) also show that both ST-attentions are beneficial.To the feature length issue, as can be seen from Table 2, even with the same length,the model trained with attribute is still much better than baseline, which implies thatthe improvement are mainly derived from attribute rather than the expansion of featurelength.
Although the ablation study shows the effectiveness of the Attribute-aware Identity-hard Triplet Loss in improving the Re-ID performance, it’s still not intuitive enough ifthe AITL can really ease the DVDP problem. To illustrate the effectiveness of AITL inreducing the Distance Variance among Different Positives, we record the average DVDPin each epoch in the training progress by computing the average distance between the intra-class negative and intra-class positive for all anchors in each batch, as well as themAP in every validation process. By observing the variation of this two variables, wecan get a more intuitive illustration on the performance of AITL in reducing the DVDPand in improving the Re-ID performance.As shown in Figure 7, without AITL supervision, the DVDP would be stacked atabout 0.075 as the training progress becomes stable, while with the AITL supervision,the DVDP quickly decreases. The mAP metric can also benefit from this decreasing inDVDP and reach a higher position than without AITL. This example clearly shows theeffectiveness of the AITL in reducing the Distance Variance among Different Positives.In both two curves the DVDP drops dramatically at the beginning of training pro-cess, and then quickly increases. This phenomenon may be caused by the normal batch
Table 3.
Performance comparison on MARS (%).Model Source MARS DukeMTMC-VIDR1 R5 R20 mAP R1 R5 R20 mAPZheng [55] ECCV16 65.0 81.1 88.9 46.6 – – – –Hermans [13] arXiv17 79.8 91.4 – 67.7 – – – –Xiao [48] arXiv17 83.0 92.6 – 72.0 – – – –Liu [27] BMVC19 – – 82.8 – – 94.9Zhao [54] CVPR19 87.0 95.4 hard triplet loss which is often considered an unstable process in the beginning of train-ing.
We compared our method (w/ ASTA and AITL) with some state-of-the-art video-basedperson Re-ID methods on MARS and DukeMTMC-VID: 1)
Image-based method :such as Resnet-50 baseline [55]; 2)
Triplet Loss : Hermans et al. [13], Margin samplemining loss proposed by Xiao et al.[48]; 3)
Spatial or temporal attention method :Temporal attention model by non-local layers [27], Attribute-driven temporal aggre-gation model [54], Spatio-Temporal Attention Network [8]; 4)
Other methods , Co-attentive Snippet Embedding (CASE) [2], VRSTC [15], Global-Local Temporal Rep-resentations [21].The comparisons on MARS are shown in Table 3. Compared with the image-basedbaseline method, identity-driven spatio-temporal attention models or the model trainedwith triplet loss, the proposed method can outperform them in all metrics. While com-pared with some latest methods with complex architectures, such as the models by Liuet al. [27], which use non-local layers to enrich the local image feature with globalsequence information by generating attention masks according to features of differentframes and different spatial locations, and the VRSTC model [2] which use GAN-likemodel to recover the appearance of occluded parts, our model can still achieve thebest performance in mAP and Rank-5 recognition rate. To our knowledge, the work byZhao et al. [54] is the best approach in attribute-assisted video-based person Re-ID. Ourmethod can consistently outperform it except for Rank-20 recognition rate.Several state-of-the-art video-based Re-ID methods are also compared on the DukeMTMC-VID. The experimental results are shown also in Table 3. Our method can clearly out-perform all the aforementioned approaches except for the Rank-1 metric. The mAPmargin on DukeMTMC-VID is not as large as it on MARS. Because in training theAITL, it’s better to select K ≥ videos from each identity. However, most people in ttribute-aware Identity-hard Triplet Loss for Video-based Person Re-identification 13
85 10 15 20 25 30 35 m A P ( % ) on M A R S (20.7, 84.4)STA(N=4)(16.0, 80.8)STE-NVAN(16.5, 81.2) STA(N=6)(24.0, 81.0) STA(N=8)(32.0, 81.2)NVAN(60.0, 82.8) Fig. 8.
Computation-performance plot of our proposed multi-task model and other existing meth-ods with attention mechanisms. the training set of DukeMTMC-VID cannot meet this requirement. That’s limits theperformance of AITL.In conclusion, by comparing our multi-task model with state-of-the-art methods, wecan prove that the proposed attribute-aware spatio-temporal attention and identity-hardtriplet loss are very effective. It should be pointed out that the baseline model used inour method is rather simple. The performance could be further improved by introducingbetter baseline models.
Besides the recognition performance, our model is extremely light-weighted. To takethe computation complexity into consideration, we compare our method with existingmethods that also use attention mechanisms on the performance-computation plot inFigure 8. We visualize mAP on MARS dataset for the performance and FLOPs forcomputation counts. For STA [8], we follow the work in Liu et al. [27] which reportsthree variants of STA model with different numbers of sampled frames per sequence tobetter demonstrate their trade-off. The STE-NVAN and NVAN in Liu et al. [27] is alsocompared in this trade-off competition.Since our model can achieve the best performance when the sampled frames numberper sequence T is set to 4, we directly use the best model to compare with other attentivemodels. As can be seen from Figure 8, the results not only indicate the advantage of ourmodel in both performance and computation counts, but also reveal the importance ofusing attribute in video-based Re-ID. To evaluate the generalization performance of our multi-task model, we conduct thecross-dataset validation on the MARS (M) and DukeMTMC-VID (D) datasets. We trainthe baseline model and proposed method on the two datasets respectively, and test themon the other dataset. As shown in Table 4, the performance of two models decreasedramatically in cross-dataset test due to the domain difference. It is obvious that theperformance decline of M → D test is less than the D → M . A possible explanation is Table 4.
Cross-dataset validation (%).train → test Baseline. proposed methodmAP R1 mAP R1 M → M M → D D → D D → M that, compared with DukeMTMC-VID, MARS contains more identities and track-lets.It is not surprising that larger training set can result in better models.Although the proposed method doesn’t show any significant advantages on domainadaptation in the cross-dataset validation, it still keeps absolute leading position in theRe-ID performance in every metric compared with baseline. It proves that the attributeembedding in the proposed method will not cause serious over-fitting. In this paper, we proposed a multi-task model which combines the Attribute-drivenspatio-temporal Attention strategy and the Attribute-aware Identity-hard Triplet Loss toutilize the attribute information to help improve the video-based Re-ID performance.The Attribute-driven spatio-temporal Attention strategy could introduce the importantspatio-temporal clues generated in the attribute recognition to encode the Re-ID fea-tures, while the Attribute-aware Identity-hard Triplet Loss could effectively reduce theDistance Variance among Different Positives. Both of these two strategies can highlyimprove the Re-ID performance, and the effectiveness of the proposed model is welldemonstrated by experiments.Although effective, the proposed multi-task model does not make full use of thepose and motion attributes. They are simply combined with other ID-irrelevant at-tributes to form an overall attention. Since the track-lets obtained from two camerasare captured in a very short time, the walking direction, which appears as the posein a camera, does have a strong relation. The movement and speed are also supposedto be consistent. What else, we found that some attributes like cloth color and gendercontributes more than other attributes in the improvement of Re-ID performance in ex-periments, but addressing the influence of single attributes as well as their correlationswith Re-ID is complicated, the page length of this paper does not allow a comprehensivestudy. So we planned to discuss these problems in future work.