Global-Local Temporal Representations For Video Person Re-Identification
GGlobal-Local Temporal Representations For Video Person Re-Identification
Jianing Li , Jingdong Wang , Qi Tian , Wen Gao , Shiliang Zhang School of Electronics Engineering and Computer Science, Peking University Huawei Noahs Ark Lab
Abstract
This paper proposes the Global-Local Temporal Repre-sentation (GLTR) to exploit the multi-scale temporal cues invideo sequences for video person Re-Identification (ReID).GLTR is constructed by first modeling the short-term tem-poral cues among adjacent frames, then capturing thelong-term relations among inconsecutive frames. Specif-ically, the short-term temporal cues are modeled by par-allel dilated convolutions with different temporal dilationrates to represent the motion and appearance of pedestrian.The long-term relations are captured by a temporal self-attention model to alleviate the occlusions and noises invideo sequences. The short and long-term temporal cuesare aggregated as the final GLTR by a simple single-streamCNN. GLTR shows substantial superiority to existing fea-tures learned with body part cues or metric learning on fourwidely-used video ReID datasets. For instance, it achievesRank-1 Accuracy of 87.02% on MARS dataset without re-ranking, better than current state-of-the art.
1. Introduction
Person Re-Identification aims to identify a probe personin a camera network by matching his/her images or videosequences and has many real applications, including smartsurveillance and criminal investigation. Image person ReIDhas achieved significant progresses in terms of both solu-tions [38, 20, 24] and large benchmark dataset construc-tion [23, 57, 44]. Recently, video person ReID, the interestof this paper, has been attracting a lot of attention becausethe availability of video data is easier than before, and videodata provides richer information than image data. Beingable to explore plenty of spatial and temporal cues, videoperson ReID has potentials to address some challenges inimage person ReID, e.g ., distinguishing different personswearing visually similar clothes.The key focus of existing studies for video person ReIDlies on the exploitation of temporal cues. Existing workscan be divided into three categories according to their waysof temporal feature learning: (i) extracting dynamic fea- time/s
Figure 1. Illustrations of two video sequences from two differentpedestrians with similar appearance on MARS dataset (we coverthe face for privacy purpose). Local temporal cues among adja-cent frames, e.g. , motion pattern or speed helps to differentiatethose two pedestrians. The global contextual cues among adjacentframes can be applied to spot occlusions and noises, e.g. , occludedframes show smaller similarity to other frames. tures from additional CNN inputs, e.g. , through opticalflow [30, 5]; (ii) extracting spatial-temporal features byregarding videos as -dimensional data, e.g. , through DCNN [27, 19]. (iii) learning robust person representa-tions by temporally aggregating frame-level features, e.g. ,through Recurrent Neural Networks (RNN) [50, 30, 5], andtemporal pooling or weight learning [26, 59, 22];The third category, which our work belongs to, is cur-rently dominant in video person ReID. The third categoryexhibits two advantages: (i) person representation tech-niques developed for image ReID can be easily exploredcompared to the first category; (ii) it avoids the estima-tion of optical flows, which is still not reliable enough dueto misalignment errors between adjacent frames. Currentstudies have significantly boosted the performance on ex-isting datasets, however they still show certain limitationsin the aspects of either efficiency or the capability of tem-poral cues modeling. For instance, RNN model is compli-cated to train for long sequence videos. Feature temporalpooling could not model the order of video frames, whichalso conveys critical temporal cues. It is appealing to ex-plore more efficient and effective way of acquiring spatial-temporal feature through end-to-end CNN learning.This work targets to learn a discriminative Global-Local a r X i v : . [ c s . C V ] A ug emporal Representation (GLTR) from a sequence of framefeatures by embedding both short and long-term tempo-ral cues. As shown in Fig. 1, the short-term temporal cueamong adjacent frames helps to distinguish visually similarpedestrians. The long-term temporal cue helps to alleviatethe occlusions and noises in video sequences. Dilated spa-tial pyramid convolution [4, 51] is commonly used in imagesegmentation tasks to exploit the spatial contexts. Inspiredby its strong and efficient spatial context modeling capabil-ity, this work generalizes the dilated spatial pyramid con-volution to Dilated Temporal Pyramid (DTP) convolutionfor local temporal context learning. To capture the globaltemporal cues, a Temporal Self-Attention (TSA) model isintroduced to exploit the contextual relations among incon-secutive frames. DTP and TSA are applied on frame-levelfeatures to learn the GLTR through end-to-end CNN train-ing. As shown in our experiments and visualizations, GLTRpresents strong discriminative power and robustness.We test our approach on a newly proposed Large-ScaleVideo dataset for person ReID (LS-VID) and four widelyused video ReID datasets, including PRID [14], iLIDS-VID [43], MARS [56], and DukeMTMC-VideoReID [47,34], respectively. Experimental results show that GLTRachieves consistent performance superiority on thosedatasets. It achieves Rank-1 Accuracy of . on MARSwithout re-ranking, better than the recent PBR [39]that uses extra body part cues for video feature learn-ing. It achieves Rank-1 Accuracy of . on PRID and . on DukeMTMC-VideoReID, which also beat theones achieved by current state-of-the art.GLTR representation is extracted by simple DTP andTSA models posted on a sequence of frame features. Al-though simple and efficient to compute, this solution out-performs many recent works that use complicated designslike body part detection and multi-stream CNNs. To ourbest knowledge, this is an early effort that jointly leveragesdilated convolution and self-attention for multi-scale tem-poral feature learning in video person ReID.
2. Related Work
Existing person ReID works can be summarized intoimage based ReID [43, 38, 31, 49, 55] and video basedReID [56, 35, 39, 19], respectively. This part briefly re-views four categories of temporal feature learning in videoperson ReID, which are closely related with this work.
Temporal pooling is widely used to aggregate featuresacross all time stamps. Zheng et al . [56] apply max andmean pooling to get the video feature. Li et al . [22] uti-lize part cues and learn a weighting strategy to fuse fea-tures extracted from video frames. Suh et al . [39] proposea two-stream architecture to jointly learn the appearancefeature and part feature, and fuse the image level featuresthrough a pooling strategy. Average pooling is also used in recent works [21, 47], which apply unsupervised learningfor video person ReID. Temporal pooling exhibits promis-ing efficiency, but extracts frame features independently andignores the temporal orders among adjacent frames.
Optical flow encodes the short-term motion between ad-jacent frames. Many works utilize optical flow to learn tem-poral features [36, 8, 5]. Simonyan et al . [36] introduce atwo-stream network to learn spatial feature and temporalfeature from stacked optical flows. Feichtenhofer et al . [7]leverage optical flow to learn spatial-temporal features, andevaluate different types of motion interactions between twostreams. Chung et al . [5] introduce a two stream architec-ture for appearance and optical flow, and investigate theweighting strategy for those two streams. Mclaughlin etal . [30] introduce optical flow and RNN to exploit long andshort term temporal cues. One potential issue of opticalflow is its sensitive to spatial misalignment errors, whichcommonly exist between adjacent person bounding boxes.
Recurrent Neural Network (RNN) is also adopted forvideo feature learning in video person ReID. Mclaughlin et al . [30] first extract image level features, then introduceRNN to model temporal cues cross frames. The outputs ofRNN are then combined through temporal pooling as thefinal video feature. Liu et al . [29] propose a recurrent ar-chitecture to aggregate the frame-level representations andyield a sequence-level human feature representation. RNNintroduces a certain number of fully-connected layers andgates for temporal cue modeling, making it complicated anddifficult to train. D convolution directly extracts spatial-temporal fea-tures through end-to-end CNN training. Recently, deep 3DCNN is introduced for video representation learning. Tran et al . [41] propose C3D networks for spatial-temporal fea-ture learning. Qiu et al . [32] factorize the 3D convolutionalfilters into spatial and temporal components, which yieldperformance gains. Li et al . [19] build a compact Multi-scale 3D (M3D) convolution network to learn multi-scaletemporal cues. Although 3D CNN has exhibited promisingperformance, it is still sensitive to spatial misalignments andneeds to stack a certain number of 3D convolutional kernels,resulting in large parameter overheads and increased diffi-cult for CNN optimization.This paper learns GLTR through posting DTP and TSAmodules on frame features. Compared with existing tempo-ral pooling strategies, our approach jointly captures globaland local temporal cues, hence exhibits stronger temporalcue modeling capability. It is easier to optimize than RNNand presents better robustness to misalignment errors thanoptical flow. Compared with 3D CNN, our model has amore simple architecture and could easily leverage repre-sentations developed for image person ReID. As shown inour experiments, our approach outperforms the recent 3DCNN model M3D [19] and the recurrent model STMP [29]. ooling concatframefeatures 𝑑 × 𝑇
3𝑑 × 𝑇 𝑑 × 𝑇
Dilated Temporal Pyramid (DTP) convolution with N =3dilation rate = 1dilation rate = 2dilation rate = 4 ℱ′ M convconvconv transpose softmax BC conv Temporal Self Attention (TSA)
3𝑑 × 1
GLTR: f തℱ ’ ℱ′′ℱ
3𝑑 × 𝑇
Figure 2. Illustration of our frame feature aggregation subnetwork for GLTR extraction, which consists of Dilated Temporal Pyramid (DTP)convolution for local temporal context learning and Temporal Self-Attention (TSA) model to exploit the global temporal cues.
3. Proposed Methods
Video person ReID aims to identify a gallery video thatis about the same person with a query video from a galleryset containing K videos. A gallery video is denoted by G k = { I k , I k , ..., I kT k } with k ∈ { , , ..., K} , and thequery video is denoted by Q = { I q , I q , ..., I qT q } , where T k ( T q ) denotes the number of frames in the sequence and I kt ( I qt ) is the t -th frame. A gallery video G will be identified astrue positive, if it has the closest distance to the query basedon a video representation, i.e. , G = arg min k dist( f G k , f Q ) , (1)where f G k and f Q are the representations of the galleryvideo G k and the query video Q , respectively.Our approach consists of two subnetworks to learn a dis-criminative video representation f , i.e. , image feature ex-traction subnetwork and frame feature aggregation subnet-work, respectively. The first subnetwork extracts features of T frames, i.e. , F = { f , f , . . . , f T } , where f t ∈ R d . Thesecond subnetwork aggregates the T frame features into asingle video representation vector. We illustrate the secondsubnetwork, which is the focus of this work in Fig. 2. Webriefly demonstrate the computation of DTP and TSA in thefollowing paragraphs.The DTP is designed to capture the local temporal cuesamong adjacent frames. As shown in Fig. 2, DTP takesframe features in F as input and outputs the updated framefeature F (cid:48) = { f (cid:48) , f (cid:48) , . . . , f (cid:48) T } . Each f (cid:48) t ∈ F (cid:48) is computedby aggregating its adjacent frame features, i.e. , f (cid:48) t = M DT P ( f t − i , ..., f t + i ) , (2)where M DT P denotes the DTP model, and f (cid:48) t is computedfrom × i adjacent frames.The TSA model exploits the relation among incon-secutive frames to capture the global temporal cues. It takes F (cid:48) as input and outputs the temporal feature F (cid:48)(cid:48) = { f (cid:48)(cid:48) , f (cid:48)(cid:48) , . . . , f (cid:48)(cid:48) T } . Each f (cid:48)(cid:48) t ∈ F (cid:48)(cid:48) is computed by consider-ing the contextual relations among features inside F (cid:48) , i.e. , f (cid:48)(cid:48) t = M T SA ( F (cid:48) , f (cid:48) t ) , (3)where M T SA is the TSA model.Each f (cid:48)(cid:48) t aggregates both local and global temporal cues.We finally apply average pooling on F (cid:48)(cid:48) to generate thefixed length GLTR f for video person ReID, i.e. , f = 1 T T (cid:88) t =1 f (cid:48)(cid:48) t . (4)Average pooling is also commonly applied in RNN [30] and3DCNN [19] to generate fixed-length video feature. Theglobal and local temporal cues embedded in each f (cid:48)(cid:48) t guar-antee the strong discriminative power and robustness of f .The following parts introduce the design of DTP and TSA. Dilated Temporal Convolution:
Dilated spatial convolu-tion has been widely used in image segmentation for its ef-ficient spatial context modeling capability [52]. Inspired bydilated spatial convolution, we implement dilated temporalconvolution for local temporal feature learning. Supposethe W ∈ R d × w is a convolutional kernel with temporalwidth w . With input frame features F = { f , f , . . . , f T } ,the output F ( r ) of dilated convolution with dilation rate r can be defined as, F ( r ) = { f ( r )1 , f ( r )2 , ..., f ( r ) T } ,f ( r ) t = w (cid:88) i =1 f [ t + r · i ] × W ( r )[ i ] , f ( r ) t ∈ R d , (5)where F ( r ) is the collection of output features containing f ( r ) t . W ( r ) denotes dilated convolution with dilation rate r .The dilation rate r indicates the temporal stride for sam-pling frame features. It decides the temporal scales cov-ered by dilated temporal convolution. For instance, with frames M m 𝐟 ∗ 𝐟ℱ′ℱ′′ℱ Figure 3. Visualization of F , F (cid:48) , F (cid:48)(cid:48) , M and f computed on atracklet with occlusions. Dimensionality of F , F (cid:48) , F (cid:48)(cid:48) is reducedto × T by PCA for visualization. It is clear that, occlusion af-fects the baseline feature F , i.e. , feature substantially changes asocclusion happens. DTP and TSA progressively alleviate the oc-clusions, i.e. , features of occluded frames in F (cid:48) and F (cid:48)(cid:48) appearsimilar to the others. f ∗ is generated after manually removing oc-cluded frames. f is quite close to f ∗ , indicating the strong robust-ness of GLTR to occlusion. r = 2 , w = 3 , each output feature corresponds to a tem-poral range of five adjacent frames. Standard convolutioncan be regarded as a special case with r = 1 , which coversthree adjacent frames. Compared with standard convolu-tion, dilated temporal convolution with r ≥ has the samenumber of parameters to learn, but enlarges the receptivefield of neurons without reducing the temporal resolution.This property makes dilated temporal convolution an effi-cient strategy for multi-scale temporal feature learning. Dilated Temporal Pyramid Convolution:
Dilated tempo-ral convolutions with different dilation rates model temporalcues at different scales. We hence use parallel dilated con-volutions to build the DTP convolution to enhance its localtemporal cues modeling ability.As illustrated in Fig. 2, DTP convolution consists of N parallel dilated convolutions with dilation rates increasingprogressively to cover various temporal ranges. For n -thdilated temporal convolution, we set its dilation rate r n as r n = 2 n − to efficiently enlarge the temporal receptivefields. We concatenate the outputs from N branches as theupdated temporal feature F (cid:48) , i.e. , we compute f (cid:48) t ∈ F (cid:48) as f (cid:48) t = concat ( f ( r ) t , f ( r ) t , ..., f ( r N ) t ) , f (cid:48) t ∈ R Nd , (6)where r i is the dilation rate of i -th dilated temporal convo-lutions. Self-Attention:
The self-attention module is recently usedto learn the long-range spatial dependencies in image seg-mentation [10, 15, 53], action recognition [42] and imageperson ReID [16, 1]. Inspired by its promising performancein spatial context modeling, we generalize self-attention toto capture the contextual temporal relations among incon-secutive frames.
Temporal Self-Attention:
The basic idea of TSA is tocompute an T × T sized attention mask M to store the con-textual relations among all frame features. As illustratedin Fig. 2, given the input F (cid:48) ∈ R Nd × T , TSA first ap-plies two convolution layers followed by Batch Normaliza-tion and ReLU to generate feature maps B and C with size ( N d/α ) × T , respectively. Then, it performs a matrix mul-tiplication between C and the transpose of B , resulting in a T × T sized temporal attention mask M . M is applied to update the F (cid:48) to embed extra global tem-poral cues. F (cid:48) is fed into a convolution layer to generate anew feature map ¯ F (cid:48) with size ( N d/α ) × T . ¯ F (cid:48) is hencemultiplied by M and is fed into a convolution layer to re-cover its size to N d × T . The resulting feature map is fusedwith the original F (cid:48) by residual connection, leading to theupdated temporal feature F (cid:48)(cid:48) . The computation of TSA canbe denoted as F (cid:48)(cid:48) = W ∗ ( ¯ F (cid:48) · M ) + F (cid:48) , F (cid:48)(cid:48) ∈ R Nd × T , (7)where W denotes the last convolutional kernel. W is ini-tialized as 0 to simplify the optimization of residual con-nection. α controls the parameter size in TSA. We experi-mentally set α as 2. F (cid:48)(cid:48) is processed with average poolingto generate the final GLTR f ∈ R Nd .We visualize the F , F (cid:48) , F (cid:48)(cid:48) , M , and f computed on atracklet with occlusion in Fig. 3. DTP reasonably allevi-ates occlusion by applying convolutions to adjacent fea-tures. TSA alleviates occlusion mainly by computing theattention mask M , which stores the global contextual rela-tions as shown in Fig. 3. With M , average pooling on F (cid:48)(cid:48) can be conceptually expressed as: T (cid:88) t =1 F (cid:48)(cid:48) (: , t ) . = T (cid:88) t =1 F (cid:48) (: , t ) × m ( t ) + T (cid:88) t =1 F (cid:48) (: , t ) , (8)where m = (cid:80) Tt =1 M (: , t ) is a T -dim weighting vector.Note that, Eq. (8) omits the convolutions before and after ¯ F (cid:48) to simplify the expression. m is visualized in Fig. 3,where occluded frames presents lower weights, indicatingtheir features are depressed during average pooling. Com-bining DTP and TSA, GLTR presents strong robustness.
4. Experiment
We test our methods on four widely used video ReIDdatasets, and a novel large-scale dataset. Example imagesare depicted in Fig. 5 and statistics are given in Table 1.
PRID-2011 [14].
There are 400 sequences of 200 pedestri-ans captured by two cameras. Each sequence has a lengthbetween 5 and 675 frames. iLIDS-VID [43]. There are 600 sequences of 300 pedestri-ans from two cameras. Each sequence has a variable length able 1. The statistics of our LS-VID dataset and other video person ReID datasets.dataset × CMC + mAPMARS 1,261 20,715 1,067,516 58 0 6 DPM × CMC + mAPPRID 200 400 40,033 100 0 2 Hand × CMCiLIDS-VID 300 600 42,460 73 2 0 Hand × CMC
LS-VID 3,772 14,943 2,982,685 200 3 12 Faster R-CNN (cid:88)
CMC + mAP ~100 ~200 ~300 ~400 ~500 > 500 (a) s e qu e n c e sequence length ( (b)8k baseline STMP M3D GLTR[19] [29] R a n k a cc u r a c y sequence lengthsequence num p e d e s t r i a n nu m (d) Figure 4. Some statistics on LS-VID dataset: (a) the number ofsequences with different length; (b) the number of sequences ineach of the 15 cameras; (c) the number of identities with differentsequence number; (d) the ReID performance with different testingsequence length. between 23 and 192 frames. Following the implementa-tion in previous works [43, 22], we randomly split this twodatasets into train/test identities. This procedure is repeated10 times for computing averaged accuracies.
MARS [56]. This dataset is captured by 6 cameras. It con-sists of 17,503 sequences of 1,261 identities and 3,248 dis-tractor sequences. It is split into 625 identities for train-ing and 636 identities for testing. The bounding boxesare detected with DPM detector [9], and tracked using theGMMCP tracker [6]. we follow the protocol of MARSand report the Rank1 accuracy and mean Average Precision(mAP).
DukeMTMC-VideoReID [47, 34]. There are 702 identi-ties for training, 702 identities for testing, and 408 identi-ties as distractors. The training set contains 369,656 framesof 2,196 tracklets, and test set contains 445,764 frames of2,636 tracklets.
LS-VID.
Besides the above four datasets, we collect a novelLarge-Scale Video dataset for person ReID (LS-VID).
Raw video capture:
We utilize a 15-camera network and P R I D i L I D S M A R S D u k e L S - V I D lighting changesscene changes background changes P R I D i L I D S M A R S D u k e L S - V I D lighting changesscene changesbackground changes P R I D i L I D S M A R S D u k e L S - V I D lighting changesscene changesbackground changes Figure 5. Frames evenly sampled from person tracklets. Each rowshows two sequences of the same person under different cameras.Compared with existing datasets, LS-VID presents more substan-tial variations of lighting, scene, and background, etc . We coverthe face in for privacy purpose. select 4 days for data recording. For each day, 3 hours ofvideos are taken in the morning, noon, and afternoon, re-spectively. Our final raw video contains 180 hours videos,12 outdoor cameras, 3 indoor cameras, and 12 time slots.
Detection and tracking:
Faster RCNN [33] is utilized forpedestrian detection. After that, we design a feature match-ing strategy to track each detected pedestrian in each cam-era. After discarding some sequences with too short length,we finally collect 14,943 sequences of 3,772 pedestrians,and the average sequence length is 200 frames.
Characteristics:
Example sequences in LS-VID areshown in Fig. 5, and statistics are given in Table 1and Fig. 4. LS-VID shows the following new features:(1)Longer sequences. (2) More accurate pedestrian track-lets. (3) Currently the largest video ReID dataset. (4) Definea more realistic and challenging ReID task.
Evaluation protocol:
Because of the expensive data an-notation, we randomly divide our dataset into training set (a) LS-VID (b) MARS (c) DukeMTMC baseline pyramid temporal conv
DTP pyramid pool
Figure 6. Rank1 accuracy of DTP and two competitors on threedatasets with different numbers of branches, i.e. , parameter N . and test set with 1:3 ratio to encourage more efficient train-ing strategies. We further divide a small validation set. Fi-nally, the training set contains 550,419 bounding boxes of842 identities, the validation set contains 155,191 boundingboxes of 200 identities, and the test set contains 2,277,075bounding boxes of 2,730 identities. Similar to existingvideo ReID datasets [56, 47], LS-VID utilizes the Cumu-lated Matching Characteristics (CMC) curve and mean Av-erage Precision (mAP) as evaluation metric. We employ standard ResNet50 [12] as the backbone forframe feature extraction. All models are trained and fine-tuned with PyTorch. Stochastic Gradient Descent (SGD) isused to optimize our model. Input images are resized to256 × adjacent frames from each sequence as input for eachtraining epoch. The batch size is set as 10. The initial learn-ing rate is set as 0.01, and is reduced ten times after 120epoches. The training is finished after 400 epoches. Allmodels are trained with only softmax loss.During testing, we use 2D CNN to extract a d =128-dimfeature from each video frame, then fuse frame features intoGLTR using the network illustrated in Fig. 2. The video fea-ture is finally used for person ReID with Euclidean distance.All of our experiments are implemented with GTX TITANX GPU, Intel i7 CPU, and 128GB memory. Comparison of DTP and other local temporal cueslearning strategies:
Besides DTP, we also implement thefollowing strategies to learn temporal cues among adjacentframes: (i) pyramid temporal convolution without dilation,and (ii) temporal pyramid pooling [54]. As explained inSec. 3.2, the dilation rate of i -th pyramid branch in DTP is r i = 2 i − . To make a fair comparison, we set three methodshave the same number of branches, where each has the samesize of receptive field. For instance, we set the convolution Table 2. Performance of individual components in GLTR.
Dataset LS-VID MARS DukeMTMC PRID iLIDSMethod mAP rank1 mAP rank1 mAP rank1 rank1 rank1baseline 30.72 46.18 65.45 78.43 82.08 86.47 83.15 62.67DTP 41.78 59.92 75.90 85.74 89.98 93.02 93.26 84.00TSA 40.01 58.73 75.62 85.40 89.26 92.74 92.14 83.33GLTR
Table 3. Performance of GLTR with different backbones on LS-VID test set. method backbone mAP rank1 rank5 rank10 rank20baseline Alexnet [17] 15.98 24.23 43.52 53.45 62.13Inception [40] 22.77 35.70 55.88 64.89 73.12ResNet50 [12] 30.72 46.18 67.41 74.71 82.33GLTR Alexnet [17] 22.57 35.45 56.59 66.01 75.06Inception [40] 35.75 51.83 71.66 79.19 84.79ResNet50 [12] kernel size as d × for the 3-rd branch of pyramid tempo-ral convolution without dilation. The experiment results onMARS, DukeMTMC-VideoReID, and the validation set ofLS-VID are summarized in Fig. 6.Fig. 6 also compares average pooling as the baseline. Itis clear that, three methods perform substantially better thanbaseline, indicating that average pooling is not effective incapturing the temporal cues among frame features. WithN=1, the three methods perform equally, i.e. , apply a d × sized convolution kernel to frame feature F . As we increase N , the performances of three algorithms are boosted. Thismeans that introducing multiple convolution scales benefitsthe learned temporal feature.It is also clear that, DTP consistently outperforms theother two strategies on three datasets. The reason may bebecause the temporal pyramid pooling loses certain tempo-ral cues as it down-samples the temporal resolution. Thetraditional temporal convolution introduces too many pa-rameters, leading to difficult optimization. The dilated con-volutions in DTP efficiently enlarge the temporal respectivefields hence performs better for local temporal feature learn-ing. With N ≥ , the performance boost slows down forDTP. Further introducing more branches increases the sizeof parameters and causes more difficult optimization. Weselect N = 3 for DTP in the following experiments. Validity of combining DTP and TSA:
This part proceedsto evaluate that combining DTP and TSA results in the bestvideo feature. We compare several variants of our meth-ods and summarize the results on four datasets and the testset of LS-VID in Table 2. In the table, “baseline” denotesthe ResNet50 + average pooling. “DTP” and “TSA” denoteaggregating frame feature only with DTP or TSA, respec-tively. “GLTR” combines DTP and TSA.Table 2 shows that either DTP or TSA performs substan-tially better than the baseline, indicating modeling extra lo-cal and global temporal cues results in better video feature. able 4. Comparison with recent works on LS-VID test set.
Method mAP rank1 rank5 rank10 rank20ResNet50 [12] 30.72 46.18 67.41 74.71 82.33GLAD [45] 33.98 49.34 70.15 77.14 83.59HACNN [24] 36.65 53.93 72.41 80.88 85.27PBR [39] 37.58 55.34 74.68 81.56 86.16DRSA [22] 37.77 55.78 74.37 81.06 86.81Two-stream [36] 32.12 48.23 68.66 75.06 83.56LSTM [50] 35.92 52.11 72.57 78.91 85.50I3D [2] 33.86 51.03 70.08 78.08 83.65P3D [32] 34.96 53.37 71.15 78.08 83.65STMP [29] 39.14 56.78 76.18 82.02 87.12M3D [19] 40.07 57.68 76.09 83.35 88.18GLTR
DTP model achieves rank1 accuracy of 85.74% on MARSdataset, outperforming the baseline by large margin. Simi-larly, TSA also performs substantially better than the base-line. By combining DTP and TSA, the GLTR consistentlyachieves the best performance on five datasets. We henceconclude that, jointly learning local and global temporalcues results in the best video feature.
Different backbones:
We further evaluate the effective-ness of GLTR with different backbone networks, includingAlexnet [17], Inception [40] and ResNet50 [12]. Experi-mental results on the test set of LS-VID are summarizedin Table 3. Table 3 shows that, implemented on differentbackbones, GLTR consistently outperforms baselines, in-dicating that our methods work well with different framefeature extractors. GLTR thus could leverage strong im-age representations and serve as a general solution for videoperson ReID. Since ResNet50 achieves best performance inTable 3, we adopt ResNet50 in the following experiments.
LS-VID:
This section compares several recent methodswith our approach on LS-VID test set . To make a com-parison on LS-VID, we implement several recent workswith code provided by their authors, including tempo-ral feature learning methods for person reid: M3D [19]and STMP [29], other temporal feature learning methods:two-stream CNN with appearance and optical flow [36],LSTM [50], 3D convolution: I3D [2] and P3D [32], as wellas recent person ReID works: GLAD [45], HACNN [24],PBR [39] and DRSA [22], respectively. Video featuresof GLAD [45] and HACNN [24] are extracted by averagepooling. We repeat PBR [39] and DRSA [22] by referringto their implantations on MARS. Table 4 summarizes thecomparison.Table 4 shows that, GLAD [45] and HACNN [24] getpromising performance in image person ReID, but achievelower performance than temporal feature learning strate-gies, e.g ., M3D [19] and STMP [29]. This indicates theimportance of learning temporal cues in video person ReID.
Table 5. Comparison with recent works on MARS.
Method mAP rank1 rank5 rank20BoW+kissme [56] 15.50 30.60 46.20 59.20IDE+XQDA [56] 47.60 65.30 82.00 89.00SeeForest [59] 50.70 70.60 90.00 97.60QAN [28] 51.70 73.70 84.90 91.60DCF [18] 56.05 71.77 86.57 93.08TriNet [13] 67.70 79.80 91.36 -MCA [37] 71.17 77.17 - -DRSA [22] 65.80 82.30 - -DuATM [35] 67.73 81.16 92.47 -MGCAM [37] 71.17 77.17 - -PBR [39] 75.90 84.70 92.80 95.00CSA [3] 76.10 86.30 94.70 98.20STMP [29] 72.70 84.40 93.20 96.30M3D [19] 74.06 84.39 93.84 97.74STA [11]
Among those compared temporal feature learning methods,the recent M3D achieves the best performance. In Table 4,the proposed GLTR achieves the best performance. It out-performs the recent video person ReID work STMP [29]and M3D [19] by large margins, e.g. , . and . inrank1 accuracy, respectively. MARS:
Table 5 reports the comparison with recent workson MARS. GLTR achieves the rank1 accuracy of 87.02%and mAP of 78.47%, outperforming most of the recentwork, e.g ., STMP [29], M3D [19] and STA [11] by 2.62%,2.63%, and 0.72% in rank1 accuracy, respectively. Notethat, STMP [29] introduces a complex recurrent networkand uses part cues and triplet loss. M3D [19] use 3D CNNto learn the temporal cues, hence requires higher computa-tional complexity. STA [11] achieves competitive perfor-mance on MARS dataset, and outperform GLTR on mAP.Note that, STA introduces multi-branches for part featurelearning and uses triplet loss to promote the performance.Compared with those works, our method achieves competi-tive performance with simple design., e.g. , we extract globalfeature with basic backbone and train only with the softmaxloss. GLTR can be further combined with a re-ranking strat-egy [58], which further boosts its mAP to 85.54%.
PRID and iLIDS-VID:
The comparisons on PRID andiLIDS-VID datasets are summarized in Table 6. It showsthat, our method presents competitive performance onrank1 accuracy. M3D [19] also gets competitive perfor-mance on both datasets. The reason may be because theM3D jointly learns the multi-scale temporal cues formvideo sequences, and introduces a two-stream architec-ture to learn the spatial and temporal representations re-spectively. With a single feature extraction stream design,our method still outperforms M3D on both datasets. Ta-ble 6 also compares with several temporal feature learningmethods, e.g ., RFA-Net [50], SeeForest [59], T-CN [48],CSA [3] and STMP [29]. Our method outperforms thoseworks by large margins in rank1 accuracy. rue match Q u e r y B a s e li n e G L T R (a) LS-VID (b) MARS (c) DukeMTMC-VideoReID Figure 7. Illustration of person ReID results on LS-VID, MARS and DukeMTMC-VideoReID datasets. Each example shows the top-5retrieved sequences by baseline method (first tow) and GLTR (second tow), respectively. The true match is annotated by the red dot. Wecover the face for privacy purpose.Table 6. Comparison with recent works on PRID and iLIDS-VID.
Dataset PRID iLIDS-VIDMethod rank1 rank5 rank1 rank5BoW+XQDA [56] 31.80 58.50 14.00 32.20IDE+XQDA [56] 77.30 93.50 53.00 81.40DFCP [25] 51.60 83.10 34.30 63.30AMOC [26] 83.70 98.30 68.70 94.30QAN [28] 90.30 98.20 68.00 86.80DRSA [22] 93.20 - 80.20 -RCN [30] 70.00 90.00 58.00 84.00DRCN [46] 69.00 88.40 46.10 76.80RFA-Net [50] 58.20 85.80 49.30 76.80SeeForest [59] 79.40 94.40 55.20 86.50T-CN [48] 81.10 85.00 60.60 83.80CSA [3] 93.00 99.30 85.40 96.70STMP [29] 92.70 98.80 84.30 96.80M3D [19] 94.40 100.00 74.00 94.33GLTR
DukeMTMC-VideoReID:
Comparisons on this dataset areshown in Table 7. Because DukeMTMC-VideoReID is arecently proposed video ReID dataset, a limited number ofworks have reported performance on it. We compare withETAP-Net [47] and STA [11] in this section. The reportedperformance of ETAP-Net [47] in Table 7 is achieved witha supervised baseline. As shown in Table 7, GLTR achieves93.74% mAP and 96.29% rank1 accuracy, outperformingETAP-Net [47] by large margins. The STA [11] alsoachieves competitive performance on this dataset. GLTRstill outperforms STA [11] on rank1, rank5, and rank20 ac-curacy, respectively. Note that, STA [11] utilizes extra bodypart cues and triplet loss.
Summary:
The above comparisons on five datasets couldindicate the advantage of GLTR in video representationlearning for person ReID, i.e. , it achieves competitive ac-curacy with simple and concise model design. We also ob-serve that, the ReID accuracy on LS-VID is substantiallylower than the ones on the other datasets. For example, thebest rank1 accuracy on LS-VID is 63.07%, substantially
Table 7. Comparison on DukeMTMC-VideoReID.
Method mAP rank1 rank5 rank20ETAP-Net [47] 78.34 83.62 94.59 97.58STA [11] lower than the 87.02% on MARS. This shows that, eventhough LS-VID collects longer sequences to provide moreabundant spatial and visual cues, it still presents a morechallenging person ReID task.We show some person ReID results achieved by GLTRand ResNet50 baseline on LS-VID, MARS [56] andDukeMTMCVideoReID [47, 34] in Fig. 7. For each query,we show the top5 returned video sequences by those twomethods. It can be observed that, the proposed GLTR issubstantially more discriminative for identifying personswith similar appearance.
5. Conclusion
This paper proposes the Global Local Temporal Repre-sentation (GLTR) for video person ReID. Our proposed net-work consists of the DTP convolution and TSA model, re-spectively. The DTP consists of parallel dilated temporalconvolutions to model the short-term temporal cues amongadjacent frames. TSA exploits the relation among inconsec-utive frames to capture global temporal cues. Experimentalresults on five benchmark datasets demonstrate the supe-riority of the proposed GLTR over current state-of-the-artmethods.
Acknowledgments
This work is supported in part by Peng ChengLaboratory, in part by Beijing Natural Science Foundation underGrant No. JQ18012, in part by Natural Science Foundation ofChina under Grant No. 61620106009, 61572050, 91538111. eferences [1] Jean-Paul Ainam, Ke Qin, and Guisong Liu. Self at-tention grid for person re-identification. arXiv preprintarXiv:1809.08556 , 2018.[2] Joao Carreira and Andrew Zisserman. Quo vadis, actionrecognition? a new model and the kinetics dataset. In
CVPR ,2017.[3] Dapeng Chen, Hongsheng Li, Tong Xiao, Shuai Yi, and Xi-aogang Wang. Video person re-identification with compet-itive snippet-similarity aggregation and co-attentive snippetembedding. In
CVPR , 2018.[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic im-age segmentation with deep convolutional nets, atrous con-volution, and fully connected crfs.
IEEE Trans. PAMI ,40(4):834–848, 2017.[5] Dahjung Chung, Khalid Tahboub, and Edward J Delp. Atwo stream siamese convolutional neural network for personre-identification. In
ICCV , 2017.[6] Afshin Dehghan, Shayan Modiri Assari, and Mubarak Shah.Gmmcp tracker: Globally optimal generalized maximummulti clique problem for multiple object tracking. In
CVPR ,2015.[7] Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes.Spatiotemporal multiplier networks for video action recog-nition. In
CVPR , 2017.[8] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman.Convolutional two-stream network fusion for video actionrecognition. In
CVPR , 2016.[9] Pedro F Felzenszwalb, Ross B Girshick, David McAllester,and Deva Ramanan. Object detection with discriminativelytrained part-based models.
IEEE Trans. PAMI , 32(9), 2010.[10] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, ZhiweiFang, and Hanqing Lu. Dual attention network for scenesegmentation. In
CVPR , 2019.[11] Yang Fu, Xiaoyang Wang, Yunchao Wei, and ThomasHuang. Sta: Spatial-temporal attention for large-scale video-based person re-identification. In
AAAI . 2019.[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
CVPR ,2016.[13] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In de-fense of the triplet loss for person re-identification. arXivpreprint arXiv:1703.07737 , 2017.[14] Martin Hirzer, Csaba Beleznai, Peter M Roth, and HorstBischof. Person re-identification by descriptive and discrim-inative classification. In
SCIA , 2011.[15] Zilong Huang, Xinggang Wang, Lichao Huang, ChangHuang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. arXiv preprintarXiv:1811.11721 , 2018.[16] Minyue Jiang, Yuan Yuan, and Qi Wang. Self-attentionlearning for person re-identification. In
BMVC , 2018.[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In
NIPS , 2012. [18] Dangwei Li, Xiaotang Chen, Zhang Zhang, and KaiqiHuang. Learning deep context-aware features over body andlatent parts for person re-identification. In
CVPR , 2017.[19] Jianing Li, Shiliang Zhang, and Tiejun Huang. Multi-scale 3d convolution network for video based person re-identification. In
AAAI , 2019.[20] Jianing Li, Shiliang Zhang, Qi Tian, Meng Wang, and WenGao. Pose-guided representation learning for person re-identification.
IEEE Trans. PAMI , 2019.[21] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervisedperson re-identification by deep learning tracklet association.In
ECCV , 2018.[22] Shuang Li, Slawomir Bak, Peter Carr, and Xiaogang Wang.Diversity regularized spatiotemporal attention for video-based person re-identification. In
CVPR , 2018.[23] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deep-reid: Deep filter pairing neural network for person re-identification. In
CVPR , 2014.[24] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious at-tention network for person re-identification. In
CVPR , 2018.[25] Youjiao Li, Li Zhuo, Jiafeng Li, Jing Zhang, Xi Liang, andQi Tian. Video-based person re-identification by deep fea-ture guided pooling. In
CVPR Workshops , 2017.[26] Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jian-guo Jiang, Shuicheng Yan, and Jiashi Feng. Video-basedperson re-identification with accumulative motion context.
IEEE Trans. CSVT , 28(10):2788–2802, 2017.[27] Jiawei Liu, Zheng-Jun Zha, Xuejin Chen, Zilei Wang, andYongdong Zhang. Dense 3d-convolutional neural networkfor person re-identification in videos.
ACM Trans. onTOMM , 15(1s):8, 2019.[28] Yu Liu, Junjie Yan, Wanli Ouyang, and Wanli Ouyang. Qual-ity aware network for set to set recognition. In
CVPR , 2017.[29] Yiheng Liu, Zhenxun Yuan, Wengang Zhou, and HouqiangLi. Spatial and temporal mutual promotion for video-basedperson re-identification. In
AAAI , 2019.[30] Niall McLaughlin, Jesus Martinez del Rincon, Paul Miller,and Paul Miller. Recurrent convolutional network for video-based person re-identification. In
CVPR , 2016.[31] Sateesh Pedagadi, James Orwell, Sergio Velastin, andBoghos Boghossian. Local fisher discriminant analysis forpedestrian re-identification. In
CVPR , 2013.[32] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with pseudo-3d residual networks.In
ICCV , 2017.[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In
NIPS , 2015.[34] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara,and Carlo Tomasi. Performance measures and a data set formulti-target, multi-camera tracking. In
ECCV , 2016.[35] Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen,Xiangfei Kong, Alex C Kot, and Gang Wang. Dual attentionmatching network for context-aware feature sequence basedperson re-identification. In
CVPR , 2018.[36] Karen Simonyan and Andrew Zisserman. Two-stream con-volutional networks for action recognition in videos. In
NIPS , 2014.37] Chunfeng Song, Yan Huang, Wanli Ouyang, and LiangWang. Mask-guided contrastive attention model for personre-identification. In
CVPR , 2018.[38] Chi Su, Jianing Li, Shiliang Zhang, Junliang Xing, Wen Gao,and Qi Tian. Pose-driven deep convolutional model for per-son re-identification. In
ICCV , 2017.[39] Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Ky-oung Mu Lee. Part-aligned bilinear representations for per-son re-identification. In
ECCV , 2018.[40] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, JonShlens, and Zbigniew Wojna. Rethinking the inception ar-chitecture for computer vision. In
CVPR , 2016.[41] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,and Manohar Paluri. Learning spatiotemporal features with3d convolutional networks. In
ICCV , 2015.[42] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-ing He. Non-local neural networks. In
CVPR , 2018.[43] Xiaogang Wang and Rui Zhao. Person re-identification:System design and evaluation overview. In
Person Re-Identification , pages 351–370. Springer, 2014.[44] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian.Person transfer gan to bridge domain gap for person re-identification. In
CVPR , 2018.[45] Longhui Wei, Shiliang Zhang, Hantao Yao, Wen Gao, and QiTian. Glad: global-local-alignment descriptor for pedestrianretrieval. In
ACM MM , 2017.[46] Lin Wu, Chunhua Shen, Anton van den Hengel, and An-ton van den Hengel. Deep recurrent convolutional networksfor video-based person re-identification: An end-to-end ap-proach. arXiv preprint arXiv:1606.01609 , 2016.[47] Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang,and Yi Yang. Exploit the unknown gradually: One-shotvideo-based person re-identification by stepwise learning. In
CVPR , 2018.[48] Yang Wu, Jie Qiu, Jun Takamatsu, and Tsukasa Ogasawara.Temporal-enhanced convolutional network for person re-identification. In
AAAI , 2018.[49] Fei Xiong, Mengran Gou, Octavia Camps, and Mario Sz-naier. Person re-identification using kernel-based metriclearning methods. In
ECCV . 2014.[50] Yichao Yan, Bingbing Ni, Zhichao Song, Chao Ma, Yan Yan,and Xiaokang Yang. Person re-identification via recurrentfeature aggregation. In
ECCV , 2016.[51] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and KuiyuanYang. Denseaspp for semantic segmentation in street scenes.In
CVPR , 2018.[52] Fisher Yu and Vladlen Koltun. Multi-scale contextaggregation by dilated convolutions. arXiv preprintarXiv:1511.07122 , 2015.[53] Yuhui Yuan and Jingdong Wang. Ocnet: Object context net-work for scene parsing. arXiv preprint arXiv:1809.00916 ,2018.[54] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, XiaogangWang, and Jiaya Jia. Pyramid scene parsing network. In
CVPR , 2017.[55] Liming Zhao, Xi Li, Yueting Zhuang, and Jingdong Wang.Deeply-learned part-aligned representations for person re-identification. In
ICCV , 2017. [56] Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su,Shengjin Wang, and Qi Tian. Mars: A video benchmark forlarge-scale person re-identification. In
ECCV , 2016.[57] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing-dong Wang, and Qi Tian. Scalable person re-identification:A benchmark. In
ICCV , 2015.[58] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding.In
CVPR , 2017.[59] Zhen Zhou, Yan Huang, Wei Wang, Liang Wang, and Tie-niu Tan. See the forest for the trees: Joint spatial and tem-poral recurrent neural networks for video-based person re-identification. In