Video-based Person Re-identification with Accumulative Motion Context
Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang, Shuicheng Yan, Jiashi Feng
11 Video-based Person Re-identification withAccumulative Motion Context
Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang and Shuicheng Yan,
Fellow, IEEE , Jiashi Feng
Abstract —Video based person re-identification plays a centralrole in realistic security and video surveillance. In this paperwe propose a novel Accumulative Motion Context (AMOC)network for addressing this important problem, which effectivelyexploits the long-range motion context for robustly identifyingthe same person under challenging conditions. Given a videosequence of the same or different persons, the proposed AMOCnetwork jointly learns appearance representation and motioncontext from a collection of adjacent frames using a two-streamconvolutional architecture. Then AMOC accumulates clues frommotion context by recurrent aggregation, allowing effective in-formation flow among adjacent frames and capturing dynamicgist of the persons. The architecture of AMOC is end-to-endtrainable and thus motion context can be adapted to complementappearance clues under unfavorable conditions ( e.g. , occlusions).Extensive experiments are conduced on three public benchmarkdatasets, i.e. , the iLIDS-VID, PRID-2011 and MARS datasets, toinvestigate the performance of AMOC. The experimental resultsdemonstrate that the proposed AMOC network outperformsstate-of-the-arts for video-based re-identification significantly andconfirm the advantage of exploiting long-range motion contextfor video based person re-identification, validating our motivationevidently.
Index Terms —video surveillance, person re-identification, ac-cumulative motion context
I. I
NTRODUCTION T HE person re-identification (re-id) problem has receivedincreasing attention [1]–[15], which associates differenttracks of a person moving between non-overlapping camerasdistributed at different physical locations. It aims to re-identifythe same person captured by one camera in another cameraat a new location. This task is essential to many importantsurveillance applications such as people tracking and forensicsearch. However, it is still a challenging problem becausepedestrian images captured in different camera views coulddisplay large variations in lightings, poses, viewpoints, andcluttered backgrounds.In this work, we investigate the problem of video-based per-son re-identification. The recent state-of-the-art methods oftensolve the person re-id task by matching spatial appearance
Hao Liu, Meibin Qi and Jianguo Jiang are with School of Computer andInformation, Hefei University of Technology, P. R. China. Hao Liu is alsowith Department of Electrical and Computer Engineering, National Universityof Singapore, Singapore, e-mail: [email protected]; [email protected];[email protected] Jayashree is with Panasonic R&D Center Singapore, Singapore,e-mail: [email protected] Jie, Shuicheng Yan and Jiashi Feng are with Department of Electricaland Computer Engineering, National University of Singapore, Singapore, e-mail: [email protected], [email protected], [email protected]. features ( e.g. color and texture) using a pair of still imagescaptured for persons [1]–[4], [7], [8]. However, appearancefeature representations learned from still images of peopleare intrinsically limited due to the inherent visual ambiguitycaused by, e.g. , clothing similarity among people in publicplaces and appearance variance from cross-camera viewingcondition variations. In this work, we propose to exploreboth appearance and motion information from video framesequences of people for re-id. The video setting is indeedmore natural for re-id considering a person is often capturedby surveillance cameras in a video rather than only a singleimage. By using sequences of persons’ video frames, not onlythe spatial appearance but also the temporal information, suchas persons’ gait, can be utilized to discern difficult cases whentrying to recognize a person in another different camera. Ingeneral, describing person video can naturally be attributed tospatial and temporal cues. The spatial part carries informationabout scenes and appearance information of the persons, likeclothing color, height and shape, while the temporal part, inthe form of motion across frames, conveys the movement ofthe observer (the cameras) and the person objects, which iscomplementary to the spatial part.Recently, spatial-temporal information has been exploredin video-based person re-id works [9]–[15]. In [10], theauthors extract optical flows between consecutive frames of aperson to represent the short-term temporal information, thenconcatenate them with RGB images to construct the spatial-temporal features. However, this method simply uses one CNNto extract spatial and temporal information simultaneously.The limitation of current single-stream CNN architecture isthat it cannot take advantage of valuable temporal informationand the performance is consequently limited by only using spa-tial (appearance) features. Thus the learned spatial-temporalfeatures may not be sufficiently discriminative for identifyingdifferent persons in real videos.To overcome the limitation of single-stream CNN archi-tecture, in related fields, such as action recognition ( [16],[17]), the authors decompose spatial and temporal informationprocessing into two individual streams to separately learnmore representative features from video sequences, and thenthe information from the two streams is fused at certainintermediate layers. For the person re-id task, in the processof discriminating two walking persons, at a specific locationsuch as legs or arms of a person, they may have discriminativemotion characteristics except for some spatial features suchas trousers’ colour or texture. As it can only be obtained
Copyright c (cid:13) a r X i v : . [ c s . C V ] J un between two consecutive frames, this motion is called motioncontext. Therefore, we design a novel two-stream spatial-temporal architecture, called the Accumulative Motion Contextnetworks (AMOC), that can well capture the spatial featuresand motion context from sequences of persons. And then theuseful appearance and motion information is accumulated ina recurrent way for providing long-term information.Furthermore, to our best knowledge, the motion informationin most of related methods [10], [16], [17] is representedby hand-crafted optical flow extracted in an off-line way.Such off-line optical flow extraction has the following dis-advantages: (i) The pre-extraction of optical flow is time andresources consuming due to the storage demanding, especiallyfor the large-scale dataset, such as MARS [18]; (ii) Off-lineextracted optical flow may not be optimal for the person re-id task since it is independent of person re-id. Instead, weexpect the model to automatically learn motion informationfrom video sequential frames discriminative for person re-id.To encode the motion information from raw video frames, arecent work called FlowNet [19] is proposed to estimate highquality optical flow. Basically, the idea of FlowNet is to exploitthe ability of convolutional networks to learn strong features atmultiple levels of scale and abstraction, which can help findingthe actual correspondences between frames. The FlowNetconsists of contracting and expanding parts trained as a wholeusing back-propagation. The optical flow is first learned andspatially compressed in contracting part and then refined tohigh resolution in expanding part. Inspired by FlowNet, wedesign a new motion network in our AMOC architecture toperform the end-to-end motion context information learningtask. Although our proposed end-to-end AMOC also involvesestimating the optical flow (encoding motion information)from raw video frames similar to FlowNet, there are stillseveral distinct differences with that model, which are listedas follows: • First, They have different targets. FlowNet is for com-puting low-level optical flow. In contrast, our proposedAMOC model is to utilize the flow information forperforming high-level person re-identification tasks. Dueto different application targets, the optimization goal ofFlowNet is to minimize the endpoint errors (EPE) [19]for getting more precise optical flow while our AMOCis to jointly re-identify persons and estimate optical flowtowards benefiting person re-id. • Second, the motion net in our model works together withthe spatial stream to learn both appearance feature andmotion feature which are complementary to each other.By contrast, FlowNet only focuses on learning the motioninformation encoded by the optical flow. • Last but not least, considering the data in person re-id datasets that we experimented on has much lowerresolution than the synthetic data that FlowNet uses, thestructure of relevant layers in motion net is essentiallymodified to allow our model to take in low resolutionvideo frames, especially for the person video frames, asinput.Our experiments and analyses show our end-to-end trainable two-stream approach successfully learns motion features help-ful for improving the performance of person re-identification.Moreover, we explore how to fuse the learned appearancefeature and motion feature together can make our modelperform best. Compared to other state-of-the-art video personre-id methods, our method is also able to achieve betterperformance, as clearly validated by extensive experimentalresults.To sum up, we make the following contributions to video-based person re-identification: • We propose a novel two-stream motion context accu-mulating network model which can directly learn bothspatial features and motion context from raw person videoframes in an end-to-end manner, instead of pre-extractingoptical flow using the off-line algorithm. Then the learnedspatial appearance features and temporal motion contextinformation are accumulated in a recurrent way. • We quantitatively validate the good performance of ourend-to-end two-stream recurrent convolutional networkby comparing it with the state-of-the-arts on three bench-mark datasets: iLIDS-VID [20], PRID-2011 [21] andMARS [18]. II. R
ELATED WORK
Person re-identification has been extensively studied inrecent years. Existing works on person re-identification can beroughly divided into two types: person re-id for still imagesand person re-id for video sequences.
A. Person Re-id for Still Images
Previous works on person re-id for still images focus on theinvariant feature representation and distance metric learning.Discriminative features that are invariant to environmentaland view-point changes play a determining role in person re-id performance. [22] combines spatial and color informationusing an ensemble of discriminant localized features and clas-sifiers selected by boosting to improve viewpoint invariance.Symmetry and asymmetry perceptual attributes are exploitedin [1], based on the idea that features closer to the bodyaxes of symmetry are more robust against scene clutter. [23]fits a body configuration composed of chest, head, thighs,and legs in pedestrian images and extracts per-part colorinformation as well as color displacement within the wholebody to handle pose variation. [24] turns local descriptors intothe Fisher Vector to produce a global representation of animage. Kviatkovsky et al. [25] propose a novel illumination-invariant feature representation based on the logchromaticity(log) color space and demonstrate that color as a singlecue shows relatively good performance in identifying personsunder greatly varying imaging conditions.After feature extraction, distance metric learning is usedin person re-id to emphasize inter-person distance and de-emphasize intra-person distance. Large margin nearest neigh-bor metric (LMNN) [26] is proposed to improve the perfor-mance of the traditional kNN classification. Prosser et al. [27]formulate person re-identification as a ranking problem, anduse ensembled RankSVMs to learn pairwise similarity. Zheng et al. [28] propose a soft discriminative scheme termed relativedistance comparison (RDC) by large and small distancescorresponding to wrong matches and right matches, respec-tively. [29] proposes a high dimensional representation of colorand Scale Invariant Local Ternary Pattern (SILTP) histograms.It constructs a histogram of pixel features, and then takes itsmaximum values within horizontal strips to tackle viewpointvariations while maintaining local discrimination.
B. Person Re-id for Video Sequences
Recently, some works consider performing person re-idin video sequences. In this scenario, the person matchingproblem in videos is crucial in exploiting multiple frames invideos to boost the performance. [30] applies Dynamic TimeWarping (DTW) to solve the sequence matching problem invideo-based person re-id. [31] uses a conditional random field(CRF) to ensure similar frames in a video sequence to receivesimilar labels. [32] proposes to learn to map between theappearances in sequences by taking into account the differ-ences between specific camera pairs. [20] introduces a pictorialvideo segmentation approach and deploys a fragment selectingand ranking model for person matching. [33] introduces ablock sparse model to handle the video-based person re-idproblem by the recovery problem in the embedding space. [13]proposes a spatio-temporal appearance representation method,and feature vectors that encode the spatially and temporallyaligned appearance of the person in a walking cycle areextracted. [12] proposes a top-push distance learning (TDL)model incorporating a top-push constraint to quantify ambigu-ous video representation for video-based person re-id.Deep learning methods have also been applied to video-based person re-id to simultaneously solve feature representa-tion and metric learning problem. Usually DNNs are used tolearn ranking functions based on pairs [3] or triplets [34] ofimages. Such methods typically rely on a deep network ( e.g.
Siamese network [35]) used for feature mapping from rawimages to a feature space where images from the same personare close while images from different persons are widelyseparated. [10] uses a recurrent neural network to learn theinteraction between multiple frames in a video and a Siamesenetwork to learn the discriminative video-level features forperson re-id. [9] uses the Long-Short Term Memory (LSTM)network to aggregate frame-wise person features in a recurrentmanner. Unlike the existing deep learning based methods forperson re-id in videos, our proposed Accumulative MotionContext (AMOC) networks introduces an end-to-end two-stream architecture that has specialized network streams forlearning spatial appearance and temporal feature representa-tions individually. Spatial appearance information from the rawvideo frame input and temporal motion information from theoptical flow predicted by the motion network are processedrespectively and then fused at higher recurrent layers to forma discriminative video-level representation.III. P
ROPOSED M ETHOD
We propose an end-to-end Accumulative Motion ContextNetwork (AMOC) based architecture that addresses the video person re-identification problem through joint spatial appear-ance learning and motion context accumulating from rawvideo frames. We first introduce the overall architecture of ourAMOC model (III-A), which is illustrated in Fig. 1. Then foreach pair of frames from a person sequence, the details of mo-tion networks and spatial networks and how they collaboratewith each other are described. Besides, the recurrent fusionlayers to integrate two-stream spatial-temporal information andfusion method are elaborated. Finally, implementation detailsof training and test are introduced for reproducing the results.
A. Architecture Overview
Fig. 1 illustrates the architecture of the proposed end-to-end Accumulative Motion Context network (AMOC). In ourarchitecture each two consecutive frames is processed by atwo-stream network to learn spatial appearance and temporalmotion features representing the person’s appearance at acertain time instant. Then these two-stream features are fusedin a recurrent way for learning discriminative accumulativemotion contexts. After that, the temporal pooling layer in-tegrates the two-steam features in time from an arbitrarilylong video sequence into a single feature representation.Finally, the two-stream sub-networks for two sequences fromtwo different cameras are constructed following the Siamesenetwork architecture [35] in which the parameters of CameraA networks and Camera B networks are shared. To end-to-endtrain this network, we adopt multi-task loss functions includingcontrastive loss and classification loss. The contrastive lossdecides whether two sequences describe the same person ornot while the classification loss predicts the identity of theperson in the sequence. In the following, we will give moredetailed explanations to each component of our proposednetwork.
B. End-to-end Two-stream Networks
As aforementioned, each frame in a sequence of the personis processed by two convolutional network streams jointly.Specifically, one stream uses spatial networks (yellow boxesin Fig. 1) to learn spatial features from raw video frameswhile for the other stream spatial networks (green boxes inFig. 1) are applied on the motion context features producedby motion networks (blue boxes in Fig. 1) to learn the temporalfeature representations at each location between a pair of videoframes. In this subsection, we introduce the details of themotion networks and spatial networks, and then describe howthey work together.
1) The Motion Networks:
As shown in Fig. 2, at eachtime-step a pair of consecutive video frames of a person isprocessed by the motion network within AMOC (correspond-ing to blue boxes in Fig. 1) to predict motion between theadjacent frames. Similar to the structure used in [19], themotion network consists of several convolutional and decon-volutional layers for up-sizing the high-level coarse featuremaps learned by convolutional layers. In particular, it has 6convolutional layers (corresponding to “Conv1”, “Conv 1”,“Conv2”, “Conv2 1”,“Conv3” and “Conv3 1” ) with strideof 2 (the simplest form of pooling) in six of them and a
Fig. 1. The architecture of our proposed Accumulative Motion Context Network (AMOC). At each time-step, each pair of two consecutive frames is processedby a two-stream network, including spatial network (yellow and green boxes) and motion network (blue box), to learn spatial appearance and temporal motionfeature representations. Then these two-stream features are fused in a recurrent way for learning discriminative accumulative motion contexts. The two-steamfeatures are integrated by temporal pooling layer from an arbitrarily long video sequence into a single feature representation. Finally, the two-stream sub-networks for two sequences from two different cameras are constructed following the Siamese network architecture [35] in which the parameters of CameraA networks and Camera B networks are shared. The whole AMOC network is end-to-end trained by introducing multi-task losses (classification loss andcontrastive loss) to satisfy both the contrastive objective and to predict the persons identity. tanh non-linearity after each layer. Taking the concatenatedtwo person frames as inputs with size of h × w × ( h isthe height of the frame and w is the width), the networkemploys 6 convolutional layers to learn high-level abstractrepresentations of the frame pair by producing the featuremaps with reduced sizes ( w.r.t. the raw input frames). However,this size shrinking process could result in low resolutionand harm the performance of person re-id. So in order toprovide dense per-pixel predictions we need to refine thecoarse pooled representation. To perform the refinement, weapply the “deconvolution” on feature maps, and concatenatethem with corresponding feature maps from the “contractive”part of the network and an upsampled coarser flow prediction(“Pred1”, “Pred2”, “Pred3”). In this way, the networks couldpreserve both the high-level information passed from coarserfeature maps and refine local information provided in lowerlayer feature maps. Each step increases the resolution by afactor of 2. We repeat this process for twice, leading to thefinal predicted flow (“Pred3”) whose resolution is still as halfas that of raw input. Note that our motion networks do nothave any fully connected layers, which can take video framesof arbitrary size as input. This motion network can be end-to-end trained by using optical flow generated by several off-the-shelf algorithms as supervision, such as Lucas-Kanade [36]and EpicFlow [37] algorithm. The training details will bedescribed in Sec. III-D1.
2) The Spatial Networks:
As shown in Fig. 1, there are twospatial networks (yellow and green boxes) in both streams tolearn spatial feature representations from raw video frames and temporal features at the spatial location between twoconsecutive frames. Here, both spatial networks lying in thetwo streams have the same structure, each of which contains3 convolutional layers and 3 max-pooling layers with a non-linearity layer (in this paper we use tanh ) interpolated aftereach convolutional layer. And there is a fully-connected layerat the top of last max-pooling layer. The details of spatialnetworks are shown in Fig. 3, where the purple cubes areconvolutional kernels and the red ones are pooling kernels. Allthe strides in the convolutional layers and pooling layers areset as 2. Note that although two spatial networks have the samestructure, they play different roles in two streams. The inputsof the network (yellow boxes in Fig. 1) are raw RGB imageswhen serves as a spatial feature extractor. Otherwise, the inputsare the last predictions (“Pred3”) of motion networks (blueboxes in Fig. 1).
C. Spatial Fusion and Motion Context Accumulation1) Spatial fusion:
Here we consider different fusion meth-ods (orange boxes in Fig. 1 ) to fuse the two stream networks.Our intention is to fuse the spatial features and motion contextinformation at the spatial location such that channel responsesat the same pixel position are put in correspondence. Becausethe structures of spatial networks in two streams are the same,the feature map produced by each layer in each stream has theexact location correlation. So the problem is how to fuse theoutputs of corresponding layers of two streams. To motivateour fusion strategy, consider for example discriminating twowalking persons. If legs move or arms swing periodically at
Fig. 2. The structure of motion networks of our proposed accumulative motion context network at one time-step. It has 6 convolutional layers (correspondingto “Conv1”, “Conv 1”, “Conv2”, “Conv2 1”,“Conv3” and “Conv3 1” ) with stride of 2 in six of them and a tanh non-linearity layer after each layer. Theinputs are the concatenated two person frames with size of h × w × . To provide dense per-pixel predictions, several deconvolutional layers are applied onoutput feature maps of convolutional layers and motion predictions to refine the coarse pooled representation. The purple cubes represent the convolutionalkernels while the green cubes are the deconvolutional kernels.Fig. 3. The structure of spatial networks of our proposed accumulative motioncontext network at one time-step. It consists of 3 convolutional layers and 3max-pooling layers with a tanh non-linearity layer interpolated after eachconvolutional layer. And there is a fully-connected layer at the top of lastmax-pooling layer. The purple cubes are convolutional kernels while the redones are pooling kernels. some spatial location then the motion network can recognizethat motion and obtain the motion context information fromtwo consecutive frames, and the spatial network can recognizethe location (legs or hands) and their combination so as todiscriminate the persons.This spatial correspondence can be easily obtained whenthe two networks have the same spatial resolution at thelayers to be fused. The simplest way is overlaying (stack-ing) layers from one network on the other. However, thereis also the issue of establishing the correct correspondencebetween one channel (or channels) in one network and thecorresponding channel (or channels) of the other network.Suppose different channels in the spatial network learningspatial feature representation from one video frame are re-sponsible for different body areas (head, arms, legs, etc. ),and one channel in the spatial network following the motionnetwork is responsible for contextual motion between twoneighboring frames in the fields. Then, after the channels arestacked, the filters in the subsequent layers must learn thecorrespondence between these appropriate channels in orderto best discriminate between these motions from differentperson samples. To make this more concrete, we investigate the following 3 ways of fusing layers between two streamnetworks. Suppose x A ∈ R H × W × D and x B ∈ R H × W × D are two feature maps from two layers and need to be fused,where W , H and D are the width, height and channel numberof the respective feature maps. And y represents the fusedfeature maps. When applied to the feedforward spatial networkarchitecture which is shown in Fig. 3, which consists ofconvolutional, max-pooling non-linearity and fully-connectedlayers, the fusion can be performed at different points in thenetwork to implement e.g. early-fusion or late-fusion. • Concatenation fusion
This fusion operation stacks thetwo feature maps at the same spatial locations i , j acrossthe feature channels d : y cat i,j, d = x A i,j,d , y cat i,j, d − = x B i,j,d , (1)where x A , x B ∈ R H × W × D , y cat ∈ R H × W × D and ≤ i ≤ H, ≤ j ≤ W . • Sum fusion
The sum fusion computes the sum of thetwo feature maps at the same spatial locations i , j andchannels d : y sum i,j,d = x A i,j,d + x B i,j,d , (2)where y cat ∈ R H × W × D . • Max fusion
Similarly, max fusion takes the maximum ofthe two feature maps: y max i,j,d = max { x A i,j,d , x B i,j,d } . (3)Now, we briefly introduce where our fusion method shouldbe applied to. Injecting fusion layers can have significantimpact on the number of parameters and layers in a two-streamnetwork, especially if only the network which is fused into iskept and the other network tower is truncated. For example,in Fig. 3, if the fusion operation is performed on the secondmax-pooling layers (“Max-pool2”) of two spatial networksin the two-stream network, then the previous parts (“Conv1, Max-pool1, Conv2” and non-linearity layers) are kept, andthe layers (“Conv3, Max-pool3, Fully connected”) after fusionoperation share one set of parameters. The illustration is shownin Fig. 4.In the experimental section (Sec. IV-C), we evaluate andcompare the performance of each of the three fusion methodsin terms of their re-identification accuracy.
Fig. 4. The illustration of fusion of two spatial networks at the second max-pooling layer (Max-pool2).
2) Motion Context Accumulation:
We now consider thetechniques to combine fused features f ( t ) (output of the fully-connected layer in Fig. 4) containing both spatial appearanceand motion context information over time t . Because the lengthof a sequence is arbitrary, the motion context is also unfixed foreach person. Therefore, we exploit Recurrent neural networks(RNN) which can process an arbitrarily long time-series usinga neural network to address the problem of accumulating mo-tion context information. Specifically, an RNN has feedbackconnections allowing it to remember information over timeand produces an output based on both the current input andinformation from the previous time-steps at each time-step.As shown in Fig. 1, the recurrent connections of RNN are“unrolled” in time to create a very deep feed-forward network.Given the unrolled network, the lateral connections serve as“memory”, allowing information to flow between an arbitrarynumber of time-steps.As video-based person re-identification involves recognizinga person from a sequence, accumulating the motion contextinformation of each frame at each instant would be helpful toimprove the performance of re-identification. This motion ac-cumulation can be achieved by using the recurrent connectionsallowing information to be passed between time-steps. To bemore clear, we aim to better capture person spatial appearancefeatures and motion context information present in the videosequence, and then to accumulate them along the time axis.Specifically, given the p -dimension output f ( t ) ∈ R p × of thefused spatial networks, the RNN can be defined as follows: o ( t ) = M f ( t ) + N r ( t − , (4) r ( t ) = tanh ( o ( t ) ) . (5)Here o ( t ) ∈ R q × is the q -dimensional output of RNN attime-step t , and r ( t − ∈ R q × contains the information onthe RNN’s state at the previous time-step. The M ∈ R q × p and N ∈ R q × q represent the corresponding parameters for f ( t ) and r ( t − respectively, where q is the dimension of theoutput of the last fully-connected layer in fusion part and p isthe dimension of the feature embedding-space.Although RNNs are able to accumulate the fused infor-mation, they still have some limitations. Specifically, some time-steps may be more dominant in the output of the RNN,which could reduce the RNNs effectiveness when used toaccumulate the input information over a full sequence becausediscriminative frames may appear anywhere in the sequence.To overcome the drawback, similar to [10], we add a temporalpooling layer after RNN to allow for the aggregation ofinformation across all time steps. The temporal pooling layeraims to capture long-term information present in the whole se-quence, which combines with the motion context informationaccumulated through RNN. In this paper, we can use eitheraverage-pooling or max-pooling over the temporal dimensionto produce a single feature vector. u ave represents the person’sspatial appearance and motion information averaged over thewhole input sequence while u max denotes the feature vectorafter max-pooling. Given the temporal pooling layer inputs (cid:8) o (1) , o (2) , ..., o ( T ) (cid:9) , the average-pooling and max-poolingmethods are as follows: u ave = 1 T T (cid:88) t =1 o ( t ) , (6) u imax = max (cid:16)(cid:104) o (1) ,i , o (2) ,i , ..., o ( t ) ,i (cid:105)(cid:17) , (7)where T is the length of the sequence or time-steps. And u imax is the ith element of the vector vs and (cid:2) o (1) ,i , o (2) ,i , ..., o ( t ) ,i (cid:3) are i th elements of the feature vector across the temporaldimension.
3) Multi-task Loss:
Similar to the method suggestedin [10], we train the whole AMOC network to satisfy both thecontrastive objective and to predict the persons identity. Giventhe sequence feature vector u including accumulative spatialappearance feature and motion context information, output bythe feature representation learning networks of our AMOC, wecan predict the identity of the person in the sequence usingthe standard softmax function, which is defined as follows: I ( u ) = P ( z = c | u ) = exp( S c u ) (cid:80) k exp( S k u ) , (8)where there are a total of K identities, z is the identity of theperson, and S is the softmax weight matrix while S c and S K represent the c th and k th column of it, respectively. Then thecorresponding softmax loss function (pink boxes in Fig. 1) isdefined as follows: L class = − log ( P ( z = c | u )) . (9)Besides, given a pair of sequence feature vectors ( u ( a ) , u ( b ) ) output by the Siamese network, the contrastive loss (purplebox in Fig.1) function can be defined as L con = (cid:13)(cid:13)(cid:13) u ( a ) − u +( b ) (cid:13)(cid:13)(cid:13) + max (cid:26) , α − (cid:13)(cid:13)(cid:13) u ( a ) − u − ( b ) (cid:13)(cid:13)(cid:13) (cid:27) , (10)where u +( b ) represents the positive pair of u ( a ) while u − ( b ) represents the negative pair of u ( a ) . The loss consists of twopenalties: the first term penalizes a positive pair ( u ( a ) , u +( b ) ) that is too far apart, and the second penalizes a negative pair ( u ( a ) , u − ( b ) ) that is closer than a margin α . If a negative pair isalready separated by α , then there is no penalty for this pairand L con ( u ( a ) , u − ( b ) ) = 0 . Finally, we jointly end-to-end train our architecture withboth classification loss and contrastive loss. We can nowdefine the overall multi-task training loss function L multi fora single pair of person sequences, which jointly optimizes theclassification cost and the contrastive cost as follows: L multi ( u ( a ) , u ( b ) ) = L con ( u ( a ) , u ( b ) )+ L class ( u ( a ) ) + L class ( u ( b ) ) . (11)Here, we give equal weights for the classification cost andcontrastive cost terms. The above network can be trained end-to-end using back-propagation-through-time from raw videoframes (details of our training parameters can be found in Sec.III-D). During the training phase, all recurrent connectionsare unrolled to create a deep feed-forward graph, where theweights of AMOC are shared between all time-steps. And inthe test phase, we discard the multi-task loss functions andapply the AMOC on the raw video sequences as a featureextractor, where the feature vectors extracted by it can bedirectly compared using Euclidean distance. D. Implementation Details1) Pre-training of the Motion Networks:
As aforemen-tioned in the Sec. III-B1, we use the pre-extracted opticalflow as the ground truth to pre-train the motion networks. Ouraim is that our proposed architecture can directly learn motioncontext information from raw consecutive video frames. Whenthe pre-training of motion networks is finished, we use thetrained parameters to initialize the motion networks in thewhole framework.As shown in the Fig. 2, our motion networks producethree motion maps of optical flow maps at three levels ofscale (“Pred1”, “Pred2”, “Pred3”) for a pair of person frames.Because the size of each prediction map is 1/8, 1/4 and 1/2of the input frame size respectively, the pre-extracted opticalflow maps are all downsampled to the corresponding sizes,then serve as the ground truth. In this paper, we use thesmooth- L loss function [38] to compute the losses betweenthe predictions e ( l ) ∈ R h ( l ) × w ( l ) × and ground truth opticalflow maps g ( l ) ∈ R h ( l ) × w ( l ) × , where l = 1 , , . The lossfunction is defined as L ( l )( motion ) ( e ( l ) , g ( l ) ) = (cid:88) i,j,k smooth L ( e ( l ) i,j,k − g ( l ) i,j,k ) , (12)smooth L ( θ ) = (cid:26) . θ if | θ | < | θ | − . otherwise . (13)Then, the overall cost function can be written as L ( motion all ) = (cid:88) l =1 ω l L ( l )( motion ) , (14)where ω l represents the weight of each loss between differentscale level prediction and ground truth map. In the pre-trainingphase of the motion network, we set them to (0.01, 0.02, 0.08)respectively.All the input pairs of video frames are resized to × and Adam [39] is chosen as the optimization method due to its faster convergence than standard stochastic gradient descentwith momentum for our task. As recommended in [39], wefix the parameters of Adam : β = 0 . and β = 0 . .Considering every pixel is a training sample, we use smallmini-batches of 4 frame pairs. We start with learning rate λ = 1 e − and then divide it by 2 every 10k iterationsafter the first 20k. Note, the motion network of our modelhas a good generalization ability, which is verified in thefollowing Section (Sec. IV-C). Therefore, in the experimentson all three datasets, we use the motion network only pre-trained on the iLIDS-VID to initialize the one in the wholeend-to-end AMOC framework.
2) Training of the Overall Architecture:
After pre-trainingthe motion network, we use it to initialize the one in ourAMOC to end-to-end train the whole network. In the end-to-end training process, we set margin α to 2 in the Eqn. (10), andthe embedding-space dimension is set to 128. For the temporalpooling, we adopt the average-pooling method in the relatedexperiments if not specified. Besides, the learning rate is setto e − . Note, as we mentioned in Sec. III-B1, the resolutionof output final predicted flow map of the motion network(“Pred3”) is as half as that of input. Therefore, at each time-step, we resize the first frame of a pair to two scales of × and × . For the second frame, we resize it to × .That is to say, at each time-step, the first-stream spatialnetwork (yellow boxes in Fig.1) takes the first frame within apair with size of × to learn spatial appearance featurerepresentations. And the second-stream networks, includingspatial network and motion network (green and blue boxesin Fig.1) take the pair frames with sizes of × asinput to extract motion context information. This operation canbe performed in an on-line way. Furthermore, it guaranteesthat the feature map generated by each layer of two spatialnetworks in two streams has the same resolution thus can befused at an arbitrary layer.
3) Data Augmentation:
To increase diversity of the train-ing sequences to overcome the data imbalance and overfit-ting issue, data augmentation is applied on all the datasets.Specifically, we artificially augment the data by performingrandom 2D translation, similar to the processing in [10].For all frame images of size w × h in a sequence, wesample same-sized frame images around the image center,with translation drawn from a uniform distribution in the range [ -0.05 w, w ] × [ -0.05 h, h ] . Besides, the horizontal flip isalso performed to augment the data. At each epoch of trainingphase, the augmentation is applied once. During the testingphase, data augmentation is also applied, and the similarityscores between sequences are averaged over all the augmenteddata. IV. E XPERIMENTS
A. Datasets
In this paper, we use iLIDS-VID [20], PRID-2011 [21] andMARS [14], which are three public benchmarks available, toevaluate our proposed AMOC model. iLIDS-VID : The iLIDS-VID dataset is created from thepedestrians captured in two non-overlapping camera views at an airport arrival hall under a multi-camera CCTV network.It is very challenging due to large clothing similarities amongpeople, lighting and viewpoint variations across camera views,cluttered background and random occlusions. There are 600image sequences of 300 distinct individuals in the dataset, withone pair of image sequences from two camera views for eachperson. The image sequences with an average number of 73range in length from 23 to 192 image frames.
PRID-2011 : The PRID-2011 dataset totally contains 749persons captured by two non-overlapping cameras. Among,there are 400 image sequences for 200 people from two cameraviews that are adjacent to each other, with sequences lengthsof 5 to 675 frames. Compared with the iLIDS-VID dataset,it is less challenging due to being captured in non-crowdedoutdoor scenes with relatively simple and clean backgroundsand rare occlusions. Similar to the protocol used in [10], weonly use the first 200 persons appearing in both cameras forevaluation.
MARS : The MARS containing 1,261 IDs and around20,000 video sequences is the largest video re-id benchmarkdataset to date. Each sequence is automatically obtained by theDeformable Part Model (Deformable Part Model) [40] detectorand the GMMCP [41] tracker. These sequences are capturedby six cameras at most and two cameras at least, and eachidentity has 13.2 sequences on average. Additionally, there arealso 3,248 distractor sequences are contained in the dataset.The dataset is fixedly split into training and test sets, with631 and 630 identities, respectively. In testing, 2,009 probesare selected for query.
B. Experimental Settings and Evaluation Protocol
For the experiments performed on iLIDS-VID and PRID-2011 datasets, half of persons are extracted for training andthe other half for testing. All experiments are conducted 10times with different training/testing splits and the averagedresults ensure their stability. For the MARS dataset, we use theprovided fixed training and test set, containing 631 and 630identities respectively, to conduct experiments. Additionally,the training of networks in our architecture, including pre-training of motion networks and end-to-end training of thewhole framework, are all implemented by using the Torch [42]framework on NVIDIA GeForce GTX TITAN X GPU.As we described in Sec. III-A, our network is a Siamese-like network, so the positive and negative sequence pairs arerandomly on-line selected during the training phase. However,positive and negative sequence pairs consist of two full se-quences of an arbitrary length containing the same personor different persons under different cameras respectively. Toguarantee the fairness of experiments, we follow the samesequence length setting in [10]. Considering the efficiencyof training, a sub-sequence containing 16 consecutive framesis sampled from the full length sequence of a person duringtraining. We train our model for 1000 epochs on the iLIDS-VID and PRID-2011 datasets. On the MARS, the epochnumber is set to 2000. At each epoch, the random selection isperformed once. During testing, the sequences under the firstcamera and second camera are regarded as the probe and the gallery respectively, as in [10] and [20]. And the length ofeach person sequence is set to 128 for testing. If the lengthof a full sequence of a person is smaller than 128, we use thefull-length sequence of this person to test. For the iLIDS-VIDand PRID-2011 datasets, the training takes approximately 4-5 hours. And the training on the MARS consumes around 1day. In the test phase, it takes roughly 5 minutes for one splittesting on the iLIDS-VID and PRID-2011 datasets. On theMARS with larger scale, the testing time is increased to about15 minutes.In experiments, for each pedestrian, the matching of hisor her probe sequence (captured by one camera) with thegallery sequence (captured by another camera) is ranked.To reflect the statistics of the ranks of true matches, theCumulative Match Characteristic (CMC) curve is adoptedas the evaluation metric. Specifically, in the testing phase,the Euclidean distances between probe sequence features andthose of gallery sequences are computed firstly. Then, foreach probe person, a rank order of all the persons in thegallery is sorted from the one with the smallest distance tothe biggest distance. Finally, the percentage of true matchesfounded among the first m ranked persons is computed anddenoted as rank m . In addition, for the MARS dataset, themean average precision (mAP) as in [18] is also employedto evaluate the performance since each query has multiplecross-camera ground truth matches. Note, all the experimentsperformed on three datasets are under single-query setting. C. Analysis of the Proposed AMOC Model
Before showing the comparison of our method with thestate-of-the-arts, we conduct several analytic experiments oniLIDS-VID and PRID-2011 datasets to verify the effective-ness of our model for solving the video-based person re-identification problem. We analyse and investigate the effectof several factors upon the performance, which include thegeneration of motion context information, the selection ofspatial fusion method, the location of performing spatialfusion, the sequence lengths of probe and gallery for testing,the temporal pooling method and other parameter settings,such as embedding size of RNN and margin α in Eqn. (10).In this paper, we regard the method in [10] as our baseline,in which the spatial-temporal features are also employed ina recurrent way but without two-stream structure or spatialfusion method, and all the optical flow maps serving as motioninformation are pre-extracted in off-line way.
1) Effect of Different Motion Information :
As described inSec. III-B1, our motion network is able to end-to-end learnmotion information from video sequence frames. Thus exceptfor the Lucas-Kanade optical flow (LK-Flow) algorithm [36]used in [10], we also exploit the optical flow maps producedby EpicFlow [37] algorithm to investigate the effect of differ-ent motion information. The experimental results are shownin Tab. I. The AMOC networks in “AMOC + LK-Flow”and “AMOC + EpicFlow” methods are the non-end-to-endversions, the motion networks of which are replaced by thecorresponding optical flow maps pre-extracted. Note that herewe only study the effect of different optical flows carrying
TABLE IR
ANK
1, R
ANK
5, R
ANK AND R ANK RECOGNITION RATE ( IN %) OF VARIOUS METHODS ON I
LIDS-VID
AND
PRID-2011
DATASETS . Dataset iLIDS-VID PRID-2011Methods Rank1 Rank5 Rank10 Rank20 Rank1 Rank5 Rank10 Rank20
Baseline + LK-Flow [10] 58.0 84.0 91.0 96.0 70.0 90.0 95.0 97.0Baseline + EpicFlow [37] 59.3 87.2 92.7 98.2 76.2 97.5 98.2 99.0AMOC w/o Motion 54.2 78.3 89.1 95.8 65.4 88.9 95.6 98.5AMOC + LK-Flow 63.3 85.3 95.1 96.4 76.0 96.5 97.4 99.6AMOC + EpicFlow 65.5 93.1 97.2 98.7 82.0 97.3 99.3 99.4end-to-end AMOC + LK-Flow 65.3 87.3 96.1 98.4 78.0 97.2 99.1 99.7 end-to-end AMOC + EpicFlow 68.7 94.3 98.3 99.3 83.7 98.3 99.4 100 motion context information and verify the effectiveness ofend-to-end learning. Therefore, all the shown results of ourAMOC are achieved by using the “concatenation fusion”method introduced in the Sec. III-C1 and fusing two-streamnetworks at “Max-pool2” layer as illustrated in Fig. 4. InTab. I, “Baseline + LK-FLOW” is the method from [10], andwe observe that the performance is boosted when the LK-Flow is replaced by the EpicFlow to produce optical flow ineither baseline methods or our AMOC method. As introducedin [37], EpicFlow assumes that contours often coincide withmotion discontinuities and computes a dense correspondencefield by performing a sparse-to-dense interpolation from aninitial sparse set of matches, leveraging contour cues usingan edge-aware distance. So it is more robust to motionboundaries, occlusions and large displacements than the LK-Flow algorithm [36], which is beneficial to the extraction ofmotion information between video frames. In addition, we cansee that the performance of our “end-to-end AMOC” usingdifferent optical flow generation methods is both improved.Moreover, we also visualize the optical flow maps producedby our motion networks in Fig. 5. The first and forth rows arethe consecutive raw video frames from iLIDS-VID and PRID-2011 datasets while the optical flow maps computed using[37] for the two datasets are shown in the second and fifthrows. The third row and sixth row are the output flow maps ofour motion networks. All the produced optical flow maps areencoded with the flow color coding [43] method which is alsoused in [37]. Different colors represent different directions ofmotions, and shapes indicate the speeds of motions. The fasterthe motion is, the darker its color will be. In this experiment,we only use the sequences of half persons of iLIDS-VID fortraining, and perform the motion networks on the left halfpersons’ sequences of iLIDS-VID and the whole PRID-2011dataset. In other words, for the PRID-2011 dataset, there is noneed to re-train the motion networks on it. From the resultsshown in Fig. 5, we can see that our motion networks can wellapproximate the optical flow maps produced by EpicFlow [37]and successfully capture the motion details of persons, suchas the speed and amplitude of legs moving. Especially forthe PRID-2011 dataset, our motion network can achieve thegood optical flow estimation without using training data fromPRID-2011, which means our motion network has a good generalization ability.To further illustrate the effectiveness of motion informationembedded in the AMOC architecture, we remove the temporalstream (corresponding to the green and blue boxes in Fig.1) of AMOC to perform the analytic experiments. Namely,our AMOC degenerates to the single-stream network whichonly uses appearance features of persons and there is nomotion information included and accumulated. The results arereported as “AMOC w/o Motion” in Tab. I. Compared withother methods using motion information in Tab. I, the perfor-mance of “AMOC w/o Motion” is the worst for both datasets.This suggests that the motion information is beneficial forimproving the person re-identification performance in videosequence.
2) Effect of Spatial Fusion Method and Location:
In thispart, we investigate how and where to fuse two-stream net-works in our end-to-end AMOC network. For these experi-ments, we use the same spatial network architecture introducedin Sec. III-B2. The fusion layer can be injected at any location,such as after “Max-pool 2”, i.e. its input is the output of “Max-pool 2” from the two streams. After the fusion layer a singleprocessing stream is used.We compare different fusion strategies in Tab. II, wherewe report the rank1, rank5, rank10 and rank20 recognitionrate on both iLIDS-VID and PRID-2011 datasets. From theresults, we see that “Concatenation” fusion method performsconsiderably higher than “Sum” and “Max” fusion methods.Compared to the fusion methods, our end-to-end AMOCnetwork shows more sensitiveness to the location of spatialfusion. Specifically, for all three fusion methods, our methodcan achieve the best performance when the spatial fusion isperformed after the “Max-pool2” layer. And fusion at FC(Fully-Connected) layers results in an evident drop in the per-formance. The reason for FC performing worse may be that atthis layer spatial correspondences between spatial appearanceand motion context information would be collapsed.
3) Effect of Sequence Length and Temporal Pooling Meth-ods:
To further study how re-identification accuracy variesdepending on the lengths of the probe and gallery sequencesduring the test phase, we perform the experiments on theiLIDS-VID dataset. The testing lengths of the probe andgallery sequences are set to the same number and simulta-neously increased from 1 to 128, in steps corresponding with i - L I D S - V I D P R I D - Fig. 5. The visualization of output optical flow of the proposed motion networks performed on iLIDIS-VID and PRID-2011 datasets. The first and forthrows are the consecutive raw video frames from iLIDS-VID and PRID-2011 datasets while the optical flow maps computed using EpicFlow [37] for the twodatasets are shown in the second and fifth rows. The third row and sixth row are the output flow maps of our motion networks. All the produced optical flowmaps are encoded exploiting the flow colour coding [43] method. Different colours represent different directions of motions, shades of which indicate thespeeds of motions. the powers-of-two. Training lengths are set to 16 as indicatedin the Sec. IV-B .The results shown in Fig. 6 (a) prove that increasing both theprobe and gallery sequence lengths can bring the improvementof performance. This is reasonable because more appearanceand motion information would be exploited if more samplesfor each person are available. The similar phenomenon isalso verified by the baseline method [10]. Comparing to it,our method can achieve higher rank1 recognition rates bysignificant margins under different sequence length settings.As aforementioned in the Sec. III-C2, the AMOC accumu-lates the RNN output spatial-temporal features into a singlefeature vector by apply either average-pooling or max-poolingover the temporal dimension. Now we compare the perfor-mance of the two pooling methods. Results are summarizedin Fig. 6 (b). We observe that average-pooling is superiorthan the max-pooling. One possible reason is that average-pooling consider all the time steps equally important in thedecision, whilst max-pooling only employ the feature valuein the temporal step with the largest activation. Therefore,average-pooling over the temporal sequence of features canproduce a more robust feature vector to compress and representthe person’s appearance and motion information over a periodof time.
4) Effect of Other Parameter Settings:
In this subsection,we conduct experimental analysis on iLIDS-VID to investigatethe effect of other parameter settings on our proposed AMOC:the margin α in Eqn. (10) and embedding size of RNN.In this paper, we adopt multi-task loss including contrastiveloss and classification loss. The margin α , in the contrastive loss function, can be set empirically, since the AMOC modelcan learn to adaptively scale the feature vectors proportionalto α . But the choice of the margin is still crucial to theperformance. Therefore, we investigate a few different marginsranging from 1 to 10. The rank1 recognition rate is used forevaluation and the result curve is shown in Fig. 6 (c). Weobserve that the rank1 recognition rate stays stable when weset margin smaller than 5. And our model can achieve the bestrank1 performance (68.7%) when the margin is set to 2. If wefurther increase the margin from 5 to 10, the performancewould drop considerably. Therefore, 2 is the best choice ofmargin in our model.Additionally, the effect of different embedding sizes ( { } ) of RNN on the rank1 performance ofour method is also illustrated in the Fig. 6 (d). We find ourmethod can obtain the highest rank1 recognition rate under 128embedding size setting. When the embedding size is reducedto 64, the rank1 drops slightly. This is probably because ofinformation loss from the reduction of parameters. When weuse 256 embedding size, the slight drop can also be observed.Moreover, the performance would be constantly undermined ifthe embedding size is further increased to 1024, mainly due toover-fitting. Besides, increasing the embedding size can alsobring longer training time. So we choose the 128 embeddingsize of RNN as it gives the best trade-off between performanceand computational cost. D. Comparison with State-of-the-Art Methods
We further evaluate the performance of end-to-end AMOC,by comparing it with the state-of-the-art methods on iILIDS- TABLE IIR
ANK
1, R
ANK
5, R
ANK AND R ANK RECOGNITION RATE ( IN %) OF VARIOUS FUSION METHODS ON I
LIDS-VID
AND
PRID-2011
DATASETS . Fusion Method Dataset iLIDS-VID PRID-2011Fusion Layer Rank1 Rank5 Rank10 Rank20 Rank1 Rank5 Rank10 Rank20
Sum Tanh1 60.8 90.1 91.7 96.5 71.5 93.4 97.3 98.1Max-pool1 61.2 89.8 90.3 95.1 72.6 92.8 96.2 96.9Tanh2 65.5 91.8 95.6 96.5 79.9 95.8 97.6 97.8Max-pool2 67.8 93.4 96.5 98.3 80.6 96.6 97.8 99.2Tanh3 63.9 91.8 96.7 97.7 78.3 94.5 97.8 98.2Max-pool3 65.0 92.8 97.9 98.4 78.8 94.9 98.0 99.1FC 60.7 88.1 93.2 94.3 72.0 91.2 93.8 94.9Max Tanh1 60.6 90.0 91.9 97.2 73.1 94.9 97.2 99.5Max-pool1 61.0 91.2 93.1 97.8 75.1 94.9 99.0 99.5Tanh2 66.9 94.2 97.1 98.9 81.2 97.3 98.5 99.3Max-pool2 68.2 95.4 97.5 98.9 81.6 98.6 98.8 99.3Tanh3 64.0 91.5 96.4 97.3 78.1 94.1 97.3 98.2Max-pool3 64.3 93.1 98.2 98.3 78.4 94.5 97.8 99.0FC 61.2 89.1 93.6 94.9 73.5 90.2 92.9 94.7
Concatenation
Tanh1 64.1 92.3 94.2 98.0 74.8 95.5 98.4 99.2Max-pool1 64.8 92.0 94.1 98.3 75.3 95.8 97.6 99.2Tanh2
Max-pool2
Tanh3 65.2 92.3 97.1 98.4 80.0 96.3 99.4 99.6Max-pool3 66.1 92.8 97.9 98.4 80.0 96.9 99.8 99.8FC 62.3 88.3 93.9 95.6 73.2 91.3 94.4 96.7(a) Effect of sequence length (b) Effect of pooling methods(c) Effect of margin (d) Effect of embedding sizeFig. 6. The performance of our proposed end-to-end AMOC using differentparameter settings on iLIDS-VID dataset. (a) shows comparisons of ourproposed end-to-end AMOC with baseline method [10] using different queryand gallery sequence lengths, ranging from 1 to 128, for testing. (b) showsthe CMC curves of our method using two different temporal pooling methods( i.e ., average-pooling and max-pooling). (c) gives our method’s performancevariation depending on different margins in contrastive loss (Eqn. (10)). (d)illustrates the effect of different RNN embedding sizes upon the performanceof our method.
VID, PRID-2011 and MARS datasets. The methods in-cludes STA [13], DVR [11], TDL [12], SI DL [14], PaMM[44], mvRMLLC+ST-Alignment [45], TAPR [15], SRID [33], AFDA [46], DVDL [47], HOG3D [48], KISSME [49], GEI[50], HistLBP [51] XQDA [29], LOMO [29], BoW [52],gBiCov [53], IDE [54] and RFA-Net [9]. Note, as discussedin Sec. IV-C, the motion network of our model has a goodgeneralization ability. Therefore, for all the experiments of ourend-to-end AMOC performed on all three datasets, we onlyuse the iLIDS-VID dataset to pre-train the motion network.
1) Results on iLIDS-VID and PRID-2011:
Comparing theCMC results shown in Tab. III, we can see that the non-end-to-end version of our AMOC can achieve higher performancethan all the compared methods for both iLIDS-VID andPRID-2011 datasets. When the end-to-end AMOC is applied,the performance is further boosted, especially for the Rank1protocol. The improvements are 3.2% and 1.7% for iLIDS-VID and PRID-2011 datasets respectively. Moreover, to ourbest knowledge, we are the first to introduce a two-streamnetwork structure and end-to-end learning motion informa-tion from raw frame pairs to solve the video-based personre-identification problem. Compared to those methods alsousing spatial-temporal features, such as STA [13], RFA-Net[9] and TAPR [15], our AMOC can achieve performanceimprovement by a large margin for both datasets, due to thetwo-stream structure of our AMOC which separately dealswith the spatial appearance and motion information fromcontext and then performs feature integration through spatialfusion. From the results, we notice that the second bestmethod, “mvRMLLC+ST-Alignment [45]”, can also achievegood performance on iLIDS-VID. However, this method needsthe complex pre-processing of person video frames while ourend-to-end AMOC can directly learn representations from TABLE IIIC
OMPARISON OF OUR END - TO - END
AMOC’
S PERFORMANCE ON I
LIDS-VID
AND
PRID-2011
DATASETS TO THE STATE - OF - THE - ARTS . Dataset iLIDS-VID PRID-2011Methods Rank1 Rank5 Rank10 Rank20 Rank1 Rank5 Rank10 Rank20
Baseline [10] 58.0 84.0 91.0 96.0 70.0 90.0 95.0 97.0STA [13] 44.3 71.7 83.7 91.7 64.1 87.3 89.9 92.0DVR [11] 39.5 61.1 71.7 81.8 40.0 71.7 84.5 92.2TDL [12] 56.3 87.6 95.6 98.3 56.7 80.0 87.6 93.6 SI DL [14] 48.7 81.1 89.2 97.3 76.7 95.6 96.7 98.9PaMM [44] 30.3 56.3 70.3 82.7 56.5 85.7 96.3 97.0mvRMLLC+ST-Alignment [45] end-to-end AMOC + EpicFlow raw video frames. Note, in Tab. III, “end-to-end AMOC +EpicFlow” means our end-to-end AMOC uses the motionnetworks pre-trained by using EpicFlow optical flow maps asthe supervision.
2) Results on MARS:
MARS is a large and realistic video-based person re-id dataset since it was captured in the campusof university with complex environment. Besides, it containsseveral natural detection/tracking errors as the person videoswere collected by applying the automatic DPM detector andGMMCP tracker. Each person is captured by six cameras atmost in MARS — compared with iLIDS-VID and PRID-2011,MARS has a much larger scale: 4 times and 30 times largerin the number of identities and total tracklets, respectively.Therefore, the relationships between person pairs are morecomplicated.In Tab. IV, we compare our results with the state-of-the-art methods on MARS. The compared methods include 8descriptors ( i.e. , SDALF, HOG3D, HistLBP, gBiCov, GEI,LOMO, BoW and IDE) and 3 metric learning methods ( i.e. ,DVR, KISSME and XQDA).Among all the recent video re-id methods, the best knownrank1 accuracy is 65.3% on MARS under single query setting,reported in [18]. From Tab. IV, we can observe that ourAMOC model can achieve the Rank1 recognition rate higherthan the current best method (IDE [54]+XQDA) by 3%.Although the rank5 performance of AMOC is slightly lowerthan “IDE+XQDA”, it still achieves the mAP as high as 52.9%.Note, the descriptor “IDE” in “IDE+XQDA” is obtained byfine-tuning the CaffeNet [55] on ImageNet- pretrained model.By contrast, our model is simply trained only on MARS fromscratch. Besides, in all the compared methods, the extraction ofdescriptors and metric learning are separated as two individualprocesses. That is to say, the extraction of descriptors cannot update parameters in the process of metric learning. Inour proposed method, the motion feature and appearance feature are end-to-end learned jointly with the followingfeature accumulation part. As a result, the learned featurescontain enough discriminative information for the ultimateperson re-id target. To conclude, the experimental resultsshow that our model performs comparable to other methodsin more complex multi-camera person re-identification tasks,benefiting from the inherent end-to-end trainable appearanceand motion information accumulation mechanism.
TABLE IVC
OMPARISON OF OUR END - TO - END
AMOC
METHODS PERFORMANCE ON
MARS
DATASET TO THE STATE - OF - THE - ARTS . Method Rank1 Rank5 Rank20 mAP
SDALF [1]+DVR [11] 4.1 12.3 25.1 1.8HOG3D [48]+KISSME [49] 2.6 6.4 12.4 0.8HistLBP [51]+XQDA [29] 18.6 33.0 45.9 8.0gBiCov [53]+XQDA 9.2 19.8 33.5 3.7GEI [50]+KISSME 1.2 2.8 7.4 0.4LOMO [29]+XQDA 30.7 46.6 60.9 16.4BoW [52]+KISSME 30.6 46.2 59.2 15.5IDE [54] + XQDA 65.3 end-to-end AMOC+EpicFlow 68.3
E. Discussion
We here discuss some potential limitations of our pro-posed model though its practical effectiveness and superiorperformance have already been demonstrated in the aboveexperiments.One potential issue with our model is the learned flow infor-mation may be redundant and not relevant to identifying theperson of interest. From the visualization of optical flow mapsestimated by the motion net of our model in Fig. 5, we observethat the foreground person and background present differentcolours ( i.e. , motion patterns) in each frame. This is reasonableas the person and background move with different velocities in the video. As we explained in Sec. III-D1, the motion netwithin our model is pre-trained using pre-extracted opticalflow as supervision. If the supervision optical flow includestoo much non-related motion information ( i.e. , backgroundmotion information),the person motion will be dominated bybackground motion within the computed optical flow. This isquite possible for the dataset including person video sequencesautomatically obtained by the detector and tracker, such asMARS, because inherent detection or tracking errors canbring more non-related motion information. Although deepnetworks that our method utilized are mostly powerful forlearning robust feature representation, we believe the videoperson re-id performance would be further boosted if non-related motion information can be effectively suppressed likeleveraging object segmentation models. We will study thisissue in our future works.V. C ONCLUSION
In this work, we propose an end-to-end AccumulativeMotion Context Network (AMOC) based method addressingvideo person re-identification problem through joint spatial ap-pearance learning and motion context accumulating from rawvideo frames. We conducted extensive experiments on threepublic available video-based person re-identification datasetsto validate our method. Experimental results demonstrated thatour model outperforms other state-of-the-art methods in mostcases, and verified that our accumulative motion context modelis beneficial for the recognition accuracy in person matching.A
CKNOWLEDGMENT
This work was supported in part by the National NaturalScience Foundation of China under Grant 61371155, Grant61174170, and Grant 61632007, and in part by the ChinaScholarship Council under Grant 201506690007.The work of Jiashi Feng was partially supported by NationalUniversity of Singapore startup grant R-263-000-C08-133,Ministry of Education of Singapore AcRF Tier One grant R-263-000-C21-112 and NUS IDS grant R-263-000-C67-646.R
EFERENCES[1] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Per-son re-identification by symmetry-driven accumulation of local features,”in
IEEE Conference on Computer Vision and Pattern Recognition , 2010,pp. 2360–2367.[2] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairingneural network for person re-identification,” in
IEEE Conference onComputer Vision and Pattern Recognition , 2014, pp. 152–159.[3] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for personre-identification,” in . IEEE, 2014, pp. 34–39.[4] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learningarchitecture for person re-identification,” in
IEEE Conference on Com-puter Vision and Pattern Recognition , 2015, pp. 3908–3916.[5] Y. C. Chen, W. S. Zheng, J. H. Lai, and P. C. Yuen, “An asymmetric dis-tance model for cross-view feature mapping in person re-identification,”
IEEE Transactions on Circuits and Systems for Video Technology , pp.1–1, 2016.[6] X. Wang, W. S. Zheng, X. Li, and J. Zhang, “Cross-scenario transferperson reidentification,”
IEEE Transactions on Circuits and Systems forVideo Technology , vol. 26, no. 8, pp. 1–1, 2015.[7] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan, “End-to-end comparativeattention networks for person re-identification,”
IEEE Transactions onImage Processing , vol. 26, no. 7, pp. 3492–3506, 2017. [8] H. Liu, M. Qi, and J. Jiang, “Kernelized relaxed margin componentsanalysis for person re-identification,”
IEEE Signal Processing Letters ,vol. 22, no. 7, pp. 910–914, 2015.[9] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang, “Person re-identification via recurrent feature aggregation,” in
European Conferenceon Computer Vision . Springer, 2016, pp. 701–716.[10] N. McLaughlin, J. Martinez del Rincon, and P. Miller, “Recurrentconvolutional network for video-based person re-identification,” in
IEEEConference on Computer Vision and Pattern Recognition , 2016, pp.1325–1334.[11] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identification bydiscriminative selection in video ranking,”
IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 38, pp. 2501–2514, 2016.[12] J. You, A. Wu, X. Li, and W.-S. Zheng, “Top-push video-based personre-identification,” arXiv preprint arXiv:1604.08683 , 2016.[13] K. Liu, B. Ma, W. Zhang, and R. Huang, “A spatio-temporal appearancerepresentation for viceo-based pedestrian re-identification,” in
IEEEInternational Conference on Computer Vision , 2015, pp. 3810–3818.[14] X. Zhu, X.-Y. Jing, F. Wu, and H. Feng, “Video-based person re-identification by simultaneously learning intra-video and inter-videodistance metrics.” IJCAI, 2016.[15] C. Gao, J. Wang, L. Liu, J. G. Yu, and N. Sang, “Temporally alignedpooling representation for video-based person re-identification,” in
IEEEInternational Conference on Image Processing , 2016, pp. 4284–4288.[16] K. Simonyan and A. Zisserman, “Two-stream convolutional networksfor action recognition in videos,” in
Advances in Neural InformationProcessing Systems , 2014, pp. 568–576.[17] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” arXiv preprintarXiv:1604.06573 , 2016.[18] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars:A video benchmark for large-scale person re-identification,” in
EuropeanConference on Computer Vision . Springer, 2016, pp. 868–884.[19] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,P. van der Smagt, D. Cremers, and T. Brox, “Flownet: Learning opticalflow with convolutional networks,” in
IEEE International Conference onComputer Vision , 2015, pp. 2758–2766.[20] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identification byvideo ranking,” in
European Conference on Computer Vision . Springer,2014, pp. 688–703.[21] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person re-identification by descriptive and discriminative classification,” in
Scan-dinavian conference on Image analysis . Springer, 2011, pp. 91–102.[22] D. Gray and H. Tao, “Viewpoint invariant pedestrian recognition withan ensemble of localized features,” in
European conference on computervision . Springer, 2008, pp. 262–275.[23] D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, and V. Murino,“Custom pictorial structures for re-identification.” in
BMVC , vol. 1,no. 2, 2011, p. 6.[24] B. Ma, Y. Su, and F. Jurie, “Local descriptors encoded by fisher vectorsfor person re-identification,” in
European Conference on ComputerVision . Springer, 2012, pp. 413–422.[25] I. Kviatkovsky, A. Adam, and E. Rivlin, “Color invariants for personreidentification,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 35, no. 7, pp. 1622–1634, 2013.[26] K. Q. Weinberger and L. K. Saul, “Distance metric learning for largemargin nearest neighbor classification,”
Journal of Machine LearningResearch , vol. 10, no. Feb, pp. 207–244, 2009.[27] B. Prosser, W.-S. Zheng, S. Gong, T. Xiang, and Q. Mary, “Person re-identification by support vector ranking.” in
BMVC , vol. 2, no. 5, 2010,p. 6.[28] W.-S. Zheng, S. Gong, and T. Xiang, “Reidentification by relative dis-tance comparison,”
IEEE transactions on pattern analysis and machineintelligence , vol. 35, no. 3, pp. 653–668, 2013.[29] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification bylocal maximal occurrence representation and metric learning,” in
IEEEConference on Computer Vision and Pattern Recognition , 2015, pp.2197–2206.[30] D. Simonnet, M. Lewandowski, S. A. Velastin, J. Orwell, and E. Turk-beyler, “Re-identification of pedestrians in crowds using dynamic timewarping,” in
European Conference on Computer Vision . Springer, 2012,pp. 423–432.[31] S. Karaman and A. D. Bagdanov, “Identity inference: generalizing per-son re-identification scenarios,” in
European Conference on ComputerVision . Springer, 2012, pp. 443–452. [32] W. Li and X. Wang, “Locally aligned feature transforms across views,”in IEEE Conference on Computer Vision and Pattern Recognition , 2013,pp. 3594–3601.[33] S. Karanam, Y. Li, and R. J. Radke, “Sparse re-id: Block sparsity forperson re-identification,” in
IEEE Conference on Computer Vision andPattern Recognition Workshops , 2015, pp. 33–40.[34] S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learningwith relative distance comparison for person re-identification,”
PatternRecognition , vol. 48, no. 10, pp. 2993–3003, 2015.[35] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction bylearning an invariant mapping,” in
IEEE Conference on Computer Visionand Pattern Recognition , 2006, pp. 1735–1742.[36] B. D. Lucas, T. Kanade et al. , “An iterative image registration techniquewith an application to stereo vision.” in
IJCAI , vol. 81, no. 1, 1981, pp.674–679.[37] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Epicflow:Edge-preserving interpolation of correspondences for optical flow,” in
IEEE Conference on Computer Vision and Pattern Recognition , 2015,pp. 1164–1172.[38] R. Girshick, “Fast r-cnn,” in
IEEE International Conference on Com-puter Vision , 2015, pp. 1440–1448.[39] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[40] P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan, “Ob-ject detection with discriminatively trained part-based models,”
IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 32,no. 9, pp. 1627–45, 2014.[41] A. Dehghan, S. M. Assari, and M. Shah, “Gmmcp tracker: Globallyoptimal generalized maximum multi clique problem for multiple objecttracking,” in
IEEE International Conference on Computer Vision andPattern Recognition , 2015, pp. 4091–4099.[42] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-likeenvironment for machine learning,” in
BigLearn, NIPS Workshop , 2011.[43] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski,“A database and evaluation methodology for optical flow,”
InternationalJournal of Computer Vision , vol. 92, no. 1, pp. 1–31, 2011.[44] Y.-J. Cho and K.-J. Yoon, “Improving person re-identification via pose-aware multi-shot matching,” in
IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 1354–1362.[45] J. Chen, Y. Wang, and Y. Y. Tang, “Person re-identification by exploitingspatio-temporal cues and multi-view metric learning.”[46] Y. Li, Z. Wu, S. Karanam, and R. J. Radke, “Multi-shot human re-identification using adaptive fisher discriminant analysis,” in
BritishMachine Vision Conference , 2015, pp. 73.1–73.12.[47] S. Karanam, Y. Li, and R. J. Radke, “Person re-identification withdiscriminatively trained viewpoint invariant dictionaries,” in
IEEE In-ternational Conference on Computer Vision , 2015, pp. 4516–4524.[48] A. Klser, M. Marszalek, and C. Schmid, “A spatio-temporal descriptorbased on 3d-gradients,” in
British Machine Vision Conference 2008,Leeds, September , 2008.[49] M. Kstinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Largescale metric learning from equivalence constraints,” in
Computer Visionand Pattern Recognition , 2012, pp. 2288–2295.[50] J. Han and B. Bhanu, “Individual recognition using gait energy image,”
IEEE Transactions on Pattern Analysis & Machine Intelligence , vol. 28,no. 2, pp. 316–322, 2005.[51] F. Xiong, M. Gou, O. Camps, and M. Sznaier, “Person re-identificationusing kernel-based metric learning methods,” in
European conferenceon computer vision . Springer, 2014, pp. 1–16.[52] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalableperson re-identification: A benchmark,” in
IEEE International Confer-ence on Computer Vision , 2015, pp. 1116–1124.[53] B. Ma, Y. Su, and F. Jurie, “Covariance descriptor based on bio-inspiredfeatures for person re-identification and face verification,”
Image andVision Computing , vol. 32, no. 6, pp. 379–390, 2014.[54] L. Zheng, H. Zhang, S. Sun, M. Chandraker, and Q. Tian, “Person re-identification in the wild,” arXiv preprint arXiv:1604.02531 , 2016.[55] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in