Multi-Channel Pyramid Person Matching Network for Person Re-Identification
Chaojie Mao, Yingming Li, Yaqing Zhang, Zhongfei Zhang, Xi Li
MMulti-Channel Pyramid Person Matching Network for Person Re-Identification
Chaojie Mao , Yingming Li ∗ , , Yaqing Zhang , Zhongfei Zhang , and Xi Li , College of Information Science & Electronic Engineering, Zhejiang University, Hangzhou, China College of Computer Science and Technology, Zhejiang University, Hangzhou, China Alibaba-Zhejiang University Joint Institute of Frontier Technologies, Hangzhou, China { mcj, yingming, zhongfei, yaqing, xilizju } @zju.edu.cn Abstract
In this work, we present a Multi-Channel deep convolutionalPyramid Person Matching Network (MC-PPMN) based onthe combination of the semantic-components and the color-texture distributions to address the problem of person re-identification. In particular, we learn separate deep repre-sentations for semantic-components and color-texture distri-butions from two person images and then employ pyramidperson matching network (PPMN) to obtain correspondencerepresentations. These correspondence representations arefused to perform the re-identification task. Further, the pro-posed framework is optimized via a unified end-to-end deeplearning scheme. Extensive experiments on several bench-mark datasets demonstrate the effectiveness of our approachagainst the state-of-the-art literature, especially on the rank-1recognition rate.
Introduction
The task of person Re-Identification (Re-ID) is to judgewhether two person images indicate the same target or notand has widespread applications in video surveillance forpublic security. From the perspective of human perception,two persons can be distinguished according to the color ortexture features of the persons’ attributes (e.g. clothes, hairs)and the latent semantic parts (e.g., head, front and backupper body, belongings). Consequently, the person Re-IDtask can be addressed from two aspects matching: the color-texture distributions and the latent semantic-components.In the previous efforts, two common strategies are em-ployed for the person Re-ID task. One strategy focuseson learning the correspondence among color-texture distri-butions from different person images, but ignoring corre-spondence among the semantic-components (Mignon andJurie 2012) (Pedagadi et al. 2013) (Weinberger and Saul2009) (Koestinger et al. 2012). The other relies on learningthe correspondence among semantic-components, while ig-noring the color-texture correspondence (Ahmed, Jones, andMarks 2015) (Li et al. 2014) (Varior et al. 2016) (Zhang etal. 2016a) (Cheng et al. 2016a) (Ding et al. 2015) (Wang etal. 2016). Figure 1 gives two examples to show respectiveadvantages of the two strategies, where images in the first ∗ Corresponding authorCopyright c (cid:13)
View A View AView B View B
Rank 2Rank 1 Rank 3Probe Image Probe Image Rank 2Rank 1 Rank 3 C T M - b a s e d m a t c h i ng S C - b a s e d m a t c h i ng F u s i on o f C T M a nd S C Figure 1: Example pairs of images from the CUHK03dataset. Given the probe image of a person in view A markedby a blue window, the task is to find the same person in thegallery set of view B. The groundtruth images are marked bythe green bounding boxes. The first row and the second roware re-identification results by semantic components (SC)-based and color-texture maps (CTM)-based strategies, re-spectively. Failures exist in both cases. The third row arethe results by the combination of the two strategies whichobtain success on both examples.row are re-identified results by the semantic correspondenceand images in the second row are re-identified results by thecolor-texture correspondence..In this work, we assume that the semantic-componentsand color-texture distributions are complementary to eachother and present a novel multi-channel deep convolutionalperson matching network based on the combination of thesemantic-components and the color-texture distributions.In particular, we learn separate deep representations forsemantic-components and color-texture distributions fromtwo person images and then employ the matching networkto obtain the correspondence representations. These corre-spondence representations are fused to address the Re-IDtask. a r X i v : . [ c s . C V ] M a r n one hand, to learn the correspondence amongsemantic-components from two persons, we first fine-tune the model weights of the ImageNet-pretrainedGoogLeNet (Szegedy et al. 2015) to learn the deep repre-sentation of each person’s semantic-components. By visual-izing some layers of this network, we observe that the dis-criminative regions in feature maps correspond to differentcomponents (bag, head, body, etc.) of a person. For matchingthese learned feature regions from two person images, con-volution operation is exploited to fuse these feature regionsfrom different inputs in the same sub-windows. However,the feature regions of the same components from two viewsfor one person seldom have the consistent spatial scale andlocation due to viewpoint changes. To overcome the varia-tion of spatial scale and location, we employ atrous convolu-tion (Chen et al. 2016) with multi-scale views to construct amodule called pyramid matching module, which provides adesirable view of perception without increasing parametersand computation by introducing zeroes between the consec-utive filter values. With this module, we obtain the corre-spondence representation between the semantic-componentsfrom different inputs.On the other hand, to build the correspondence repre-sentation between color-texture distributions, we proposeto introduce the deep color-texture distribution representa-tion learning based on convolutional neural network. Dif-ferent from the conventional hand-crafted features (e.g.,LOMO) (Liao et al. 2015), we first extract RGB, HSV andSILTP histograms (Liao et al. 2010) with the sliding win-dows and then project the histogram bins into specific fea-ture maps, which encode the spatial distribution for the par-ticular color-texture range. With these Color-Texture fea-ture Maps (CTM), we employ a 3-layers convNet to learnthe deep color-texture representation for each person image.Thus, the pyramid matching module is exploited to learn thecorrespondence representation between color-texture distri-butions from different person images.Having the learned correspondence representations forthe semantic-components and color-texture distributions, theMC-PPMN is carried out by fusing them with two fully con-nected layers to decide whether the two input images rep-resent the same person or not. The proposed framework isevaluated on several real-world datasets. Extensive experi-ments on these benchmark datasets demonstrate the effec-tiveness of our approach against the state-of-the-art, espe-cially on the rank-1 recognition rate.The main contributions of this paper are as follows:(1) We propose a deep convolutional network namedMC-PPMN which learns the correspondence representa-tions from both the semantic-components and color-texturedistributions. Deep structures for encoding both semanticspace and color-texture distributions, and cross-person cor-respondence are jointly optimized to improve the general-ization performance of the person re-identification task.(2) The proposed framework employs the pyramid match-ing strategy based on the atrous convolution to learn the cor-respondence representation for two person images, whichprovides a desirable view of perception without increasingparameters and computation by introducing zeroes between the consecutive filter values. Related Work
In the past five years, many efforts have been proposed forthe task of person Re-ID, which greately advance this field.The discrimative feature representation learning and the ef-fective matching strategy learning are the main topics forperson Re-ID. For feature representation, many approachesdesign the robust descriptors againist misalignments andvariations with color and texture, which are two of the mostuseful characteristics in image representation. The hand-crafted features including HSV color histogram (Farenzenaet al. 2010), SIFT histogram (Zhao, Ouyang, and Wang2013), LBP histogram (Li and Wang 2013) features andthe combination of them are widely used for image repre-sentation. Many efforts also consider the properties of per-son images such as symmetry structure of segments (Faren-zena et al. 2010) and the horizontal occurrence of local fea-tures(Liao et al. 2015), to design the features, which signifi-cantly boost the matching rate.For the matching strategy, the metric learning is thebasic idea to find a mapping function from the featurespace to the distance space so as to minimize the intra-personal variance while maximizing the inter-personal mar-gin. Many approaches have been proposed based on thisidea including pair-wise constrained component analysis(PCCA) (Mignon and Jurie 2012), local Fisher discrimi-nant analysis (LFDA) (Pedagadi et al. 2013), Large MarginNearest-Neighbour (LMNN) (Weinberger and Saul 2009),and KISS metric learning (KISSME) (Koestinger et al.2012). However, these matching strategies often pay muchattention to the distance learning of the abstract featureswithout taking the spatial stuctural and semantic correspon-dence learning in consideration.Recently, the efforts which employ deep convolutional ar-chitectures to deal with the task of person Re-ID have showna remarkable improvement over the approaches based onthe hand-craft features. For example, the patch-based meth-ods (Ahmed, Jones, and Marks 2015) (Li et al. 2014) per-form patch-wise distance measurement to obtain the spa-tial relationship. Part-based methods (Varior et al. 2016) di-vide one person into some equal parts and jointly performbody-wise and part-wise correspondence learning based onthe assumption that the pedestrian keeps upright in gen-eral. Some efforts (Zhang et al. 2016a) try to capture thesemantic and structural correlation using deep convolutionnetworks, which have promising results on the challengingdatasets. To improve the performance of feature extraction,the triplet learning frameworks (Cheng et al. 2016a) (Dinget al. 2015) (Wang et al. 2016) which employ triplet trainingexamples and the triplet loss function to learn fine grainedimage are also proposed.
Our Architecture
Figure 2 illustrates our network’s architecture. The pro-posed architecture extracts color-texture and the mid-levelsemantic-components representation for a pair of input per-son images. With the features mentioned above, two pyra- RGB HistogramHSV HistogramSILTP Histogram SC Image RepresentationCTM Image Representation Pyramid Matching Module rate: 1rate: 2rate: 3Atrous Convolution Same ScoreDifferent Score Fc Softmax loss Fc Fusion layer rate: 1rate: 2rate: 3 Figure 2: The proposed architecture of deep convolutional person matching network.mid matching modules are employed to learn the corre-spondence for the color-texture distributions and semantic-components, respectively, and to output the correspondencerepresentations. Finally, we fuse the correspondence repre-sentations utilizing two fully connected layers and employsoftmax activations to compute the final decision which in-dicates the probability that the image pair depicts the sameperson. The details of the architecture is explained in the fol-lowing subsections.
Semantic-Components (SC) Images Representation
As discussed previously, there exists a set of intrinsic la-tent semantic components (e.g., head, front and back up-per body, belongings) in a person image, which are robustto the variations of views and background change. Withthese semantic representations for the images, we are ableto learn the correspondence between the image pair. Thewell-known ImageNet-pretrained deep convolutional frame-works (like AlexNet, GoogLeNet, ResNet, etc) (Szegedy etal. 2015) (He et al. 2016) (Krizhevsky, Sutskever, and Hin-ton 2012) have been widely used to project the RGB spaceto the semantic-aware space. The previous efforts (Davis etal. 2007) have also verified that the mid-level feature mapsof the frameworks represent the semantic-components forone object. In our architecture, we extract these semantic-components with two parameter-shared GoogLeNets for apair of person images. Figure 3a shows the visualization ofevery block’s responses in GoogLeNet finetuned on the Re-ID dataset CUHK03. We observe that the original person im-ages are decomposed into many semantic-components (bag,head, etc.). The responses of low layers like Conv1 depict the particular components apparently and the high layers’responses like Conv5 layer look abstract but still keep theshape and spatial location. For notational simplicity, we re-fer to the convNet as a function f CNN ( X ; θ ) , that takes X as input and θ as parameters. The GoogLeNets output1024 feature maps with size × respectively as the rep-resentations of images for an input pair of images resized to × from two cameras, A and B. We denote this processas follows: { R Asc , R Bsc } = { f CNN ( I A ; θ sc ) , f CNN ( I B ; θ sc ) } (1)where R Asc and R Bsc denote the SC representation of images I A and I B separately. θ sc are the shared parameters. Color-Texture Maps (CTM) Images Representation
The existing methods often extract the color-texture featuresfor images by computing the histgrams of color channelswithin a partitioned horizontal stripe, which works under theassumption of slight vertical misalignment, and only con-sider the pose variations on horizontal dimension. Thesemethods also ignore the spatial structure information. To ad-dress these problems and represent the color spatial distri-butions, we propose to use sliding windows to describe lo-cal color details for a person image and construct the spatialfeature maps instead of feature vectors. RGB and HSV chan-nels are the basic color characteristics for images. The ScaleInvariant Local Ternary Pattern (SILTP) (Liao et al. 2010)descriptor is an improved operator over the well-known Lo-cal Binary Pattern (LBP) (Li and Wang 2013) and an in-variant texture description for illumination. Specifically, weuse a subwindow size of × , with an overlapping step of nput: [160,80,3] Conv1: [80,40,64]Conv3: [20,10,480]Conv5: [10,5,1024] a: Semantic-components Representation Histogram extraction FeatureprojectionSize: 8x8Stride: 4
Input Image Image Patches Histogram Maps b: Color-Texture Maps Representation
Figure 3: a: Visualization of features for the ImageNet-pretrained GoogLeNet network, which is finetuned on theCUHK03 dataset. b: Illustration of the proposed Color-Texture feature Maps (CTM) extraction. pixels to locate local patches in the input × im-ages. Within each subwindow, we extract a 24-bin RGB his-togram, a 24-bin HSV histogram and a 16-bin SILTP his-togram ( SILT P . , ). These resulting histogram-bins com-puted from all subwindows are then projected to the featuremaps with size × . Figure 3b shows the procedure ofthe proposed CTM extraction.With the extracted CTM, we employ the parameters-shared convolution networks constructed with three convo-lution layers and two max-pooling layers to generate thecolor-texture representation with spatial size × consis-tent with the SC representation. We denote the representa-tion above as R Actm and R Bctm for images I A and I B re-spectively with the shared parameters θ ctm : { R Actm , R Bctm } = { f CNN ( I A ; θ ctm ) , f CNN ( I B ; θ ctm ) } (2) Pyramid Matching Module with AtrousConvolution
In this work, we represent the semantic-components of per-son images with the mid-level feature maps of GoolgLeNet,which still preserve the original shape and releative spatiallocation. Therefore, the variations of the spatial scale andlocation misalignment caused by viewpoint changes remainsignificant on the image representation. As shown in Figure4, the same bag belonging to the same person is located onthe right side of one image but on the left side of the otherimage. The previous efforts (Zhang et al. 2016a) address this (cid:53)(cid:68)(cid:87)(cid:72)(cid:3)(cid:32)(cid:3)(cid:21)(cid:53)(cid:68)(cid:87)(cid:72)(cid:3)(cid:32)(cid:3)(cid:20)
Figure 4: Illustration of correspondence learning with pyra-mid matching module. Left: the component “head” has thesimiliar spatial location. Right: the component “bag” has thecompletely different shape and location. We match the com-ponents above by computing their responses in one windowand the convolutions with multi-scale field-of-view are ro-bust to the misalignment and variation of scale caused byviewpoint changes.problem by decreasing the distance for the same semanticcomponents from two images with max-pooling layers. Thisstrategy is effective but loses the spatial information.We employ atrous convolution to address this issueabove. By introducing zeroes between the consecutive fil-ter values, the atrous convolution computes the correspon-dences of the same semantic-components without decreas-ing their resolutions. Another challenge is that differentsemantic-components have the different scale of variationsand misalignments. To address the scale invariance, weemploy multi-rate atrous convolutions to construct pyra-mid matching module based on pyramid matching strategyto adaptively learn the correspondence for the semantic-components with multi-scale misalignments. Consideringthe size of feature maps, the pyramid matching module in-cludes three branches × atrous convolution with rate 1,2 and 3, which provides the field-of-view with size × , × , × respectively. Figure 5 shows the structure ofthis module and in Figure 4 two examples whose correspon-dences are learned with the rate1 and rate2 atrous convo-lutions respectively, are given to illustrate how this mod-ule works. With the images’ concatenated SC representation { R Asc , R Bsc } , the proposed module computes the correspon-dence distribution denoted as S psc = { S r =1 sc , S r =2 sc , S r =3 sc } ,in which the value of each location ( i, j ) indicates the corre-spondence probability at that location. r is the rate of atrousconvolution. We formulate this matching strategy as follows: S psc = { S r =1 sc , S r =2 sc , S r =3 sc } = { f CNN ( { R Asc , R Bsc } ; { θ , θ , θ } sc } = { f CNN ( { R Asc , R Bsc } ; θ sc }} (3)where θ sc = { θ , θ , θ } sc denotes the parameters of ourmodule for SC representation. θ r ( r = 1 , , are the pa-rameters of the matching branch with rate r .We fuse the concatenated correspondence maps S psc withlearned parameters θ sc , which indicate the weights of differ-ent matching branches, and output the fused correspondence Rate = 2 Rate = 1Original Image Conv5 Original Image Conv5
Rate 3: Rate 2:Rate 1:
Conv & Max-pool [ , , ] [ , , ] [ , , ] [ , , ] Concatenation [ , , ][ , , ] [ , , ] Atrous Conv
Figure 5: Illustration of the pyramid matching module.representation. Inspired by (Zhang et al. 2016a), we furtherdownsample the representation by max-pooling so as to pre-serve the most discriminative correspondence informationand align it in a larger region. Finally, we obtain the corre-spondence representation S fsc : S fsc = f CNN ( { S r =1 sc , S r =2 sc , S r =3 sc } ; θ sc )= f CNN ( { R Asc , R Bsc } ; θ sc , θ sc } (4)Based on the same motivation and principle, we learnthe correspondence of color-texture distributions of the per-son’s attributes (e.g.clothes, hairs) with another standalonepyramid matching module. With the images’ concatenatedCTM representation { R Actm , R Bctm } , We obtain the corre-spondence representation as follows: S fctm = f CNN ( { R Actm , R Bctm } ; θ ctm , θ ctm } (5)where θ ctm and θ ctm denote the parameters of pyramidmatching module for CTM representation. The Unified Framework and Learning
The correspondence representations S fsc and S fctm are fusedto the correspondence descriptor of size 1024 by using twofully connected layers. We pass the correspondence descrip-tor to another fully connected layer containing two softmaxunits. The probability that the two images in the pair, I A and I B , are of the same person with softmax activations com-puted on the units above is denoted as: p = exp ( S ( S fsc , S fctm ; θ )) exp ( S ( S fsc , S fctm ; θ )) + exp ( S ( S fsc , S fctm ; θ )) (6)where S ( S fsc , S fctm ; θ ) and S ( S fsc , S fctm ; θ ) are thesoftmax units for S ( S fsc , S fctm θ ) .We reformulate the proposed framework as a unified deepconvolution framework based on Eqs.1 - 4 : S ( S fsc , S fctm , θ )= f CNN ( { I A , I B } ; {{ θ , { θ r } , θ } sc ; { θ , { θ r } , θ } ctm ; θ } )= f CNN ( { I A , I B } ; θ ) (7) where θ = {{ θ , { θ r } , θ } sc ; { θ , { θ r } , θ } ctm ; θ } , and r = 1 , , .We minimize the widely used cross-entropy loss to opti-mize the network as Eq.7 over a training set of N pairs us-ing stochastic gradient descent. l n is the 1/0 label for the in-put pair depicting whether the same person or not. With thisunified network, the processes of discriminative image rep-resentation learning and cross-person correspondence learn-ing are optimized jointly to make the image representationoptimal to this task. L = − N N (cid:88) n =1 [ l n log p n + (1 − l n ) log(1 − p n )] (8)By setting { θ , { θ r } , θ } sc = or { θ , { θ r } , θ } ctm = , we construct two independent convNets named SC-PPMN and CTM-PPMN which focus on semantic-components correspondence learning and color-texture dis-tributions correspondence learning respectively. These twoconvNets are denoted as Eq.9 and Eq.10 optimized with L sc and L ctm represented in Eq.8 respectively. S sc ( S fsc , θ sc )= f CNN ( { I A , I B } ; {{ θ , { θ r } , θ } sc ; θ sc } )= f CNN ( { I A , I B } ; θ sc ) (9) S ctm ( S fctm , θ ctm )= f CNN ( { I A , I B } ; {{ θ , { θ r } , θ } ctm ; θ ctm } )= f CNN ( { I A , I B } ; θ ctm ) (10) Experiments
Datasets and Protocol
We evaluate the proposed architecture and compare our re-sults with those of the state-of-the-art approaches on sixperson Re-ID datasets, namely CUHK03 (Li et al. 2014),CUHK01 (Li, Zhao, and Wang 2012), VIPeR (Gray and Tao2008), PRID450s (Roth et al. 2014), i-LIDS (Office 2008)and PRID2011 (Hirzer et al. 2011). All the approaches areevaluated with Cumulative Matching Characteristics (CMC)curves by single-shot results, which characterize a rankingresult for every image in the gallery given the probe image.Our experiments are conducted on the datasets with 10 ran-dom training and the average results are presented. We con-duct the experiments on SC-PPMN, CTM-PPMN and MC-PPMN to learn the correspondence for two person imageswith the CTM features, SC features and the fused features,respectively. We report the experimental results and analyzethe performances of CTM features and SC features.Table lists the description of each dataset and our ex-perimental settings with the training and testing splits. TheCUHK03 dataset provides two settings named labelled set-ting with the manually annotated pedestrian bounding boxesand detected settings with automatically generated bound-ing boxes in which possible misalignments and body partmissing are introduced for a more realistic setting. In thisable 1: Datasets and settings in our experiments.
Dataset CUHK03 CUHK01 VIPeR PRID450s i-LIDS PRID2011identities 1360 971 632 450 119 385/749images 13164 3884 1264 900 479 1134views 2 2 2 2 2 2train IDs 1160 871;485 316 225 59 100test IDs 100 100;486 316 225 59 100 paper, the evaluation results on both labelled and detectedsettings are reported. For the CUHK01 dataset, we reportresults on two different settings: 100 test IDs, and 486 testIDs. The VIPeR and PRID450s dataset are relatively smalldatasets and only contain one image per person in each view.i-LIDS dataset is constructed from video images shooting abusy airport arrival hall and contains 479 images from 119persons, in which each person has four images in average.PRID2011 dataset consists of images captured by two staticsurveillance cameras, in which views A and B contain 385and 749 persons, respectively, with 200 persons appearingin both views. Following the procedure described in (Chenget al. 2016b) for evaluation on the test set, view A is usedfor the probe set (100 person IDs) and view B is used forthe gallery set, which contains all images of the view B (649person IDs) except the 100 training samples.
Training the Network
The proposed architecture is implemented on the widelyused deep learning framework Caffe (Jia et al. 2014) withan NVIDIA TITAN X GPU. We use stochastic gradient de-scent(SGD) for updating the weights of the network. Theparameters for training SC-PPMN, CTM-PPMN and MC-PPMN are listed in Table 2. We start with a base learning rateand gradually decrease it as the training progresses using apolynomial decay policy: η i = η (1 − imax iter ) p , where p = 0 . , i is the current mini-batch iteration and max iter is the maximum iteration. We train the MC-PPMN modelby fixing the parameters of the pre-trained SC-PPMN andCTM-PPMN models. Data Augmentation . To make the model robust to theimage translation variation and to enlarge the data set, wesample 5 images around the image center, with translationdrawn from a uniform distribution in the range [ − , × [ − , for an original image of size × . Hard Negative Mining (hnm) . In fact, the negative pairsare far more than the positive pairs, which can lead to dataimbalance. Also, in these negative pairs, there still existsome scenarios that are hard to distinguish. To address thesedifficuties, we sample the hard negative piars for retrain-ing our network following the way in (Ahmed, Jones, andMarks 2015).
Experiments Results
We campare our proposed MC-PPMN with several meth-ods in recent years, including both hand-craft featurebased methods: ITML (Davis et al. 2007), LMNN (Wein-berger and Saul 2009), KISSME (Koestinger et al. 2012),LOMO+XQDA (Liao et al. 2015), LSSCDL (Zhanget al. 2016b), LOMO+LSTM (Varior et al. 2016); Table 2: The parameters for training.
Parameters SC-PPMN CTM-PPMN MC-PPMNTraining Time (hours) 40-48 16 10Maximum Iteration 160K 30K 10KBatch Size 100 800 150Momentum 0.9 0.9 0.9Weight Decay 0.0002 0.0002 0.0002Base Learning Rate 0.01 0.1 0.0001
Table 3: Comparison of state-of-the-art results on labelledand detected CUHK03 dataset with 100 test IDs. The cumu-lative matching scores (%) at rank 1, 5, and 10 are listed.
Methods labelled CUHK03 detected CUHK03r=1 r=5 r=10 r=1 r=5 r=10KISSME 14.17 37.46 52.20 11.70 33.45 45.69LMNN 7.29 19.64 30.74 6.25 17.87 26.60LSSCDL 57.00 - - 51.20 - -LOMO+LSTM - - - 57.30 80.10 88.30LOMO+XQDA 52.20 82.23 92.14 46.25 78.90 88.55CTM-PPMN (no hnm) 73.52 95.12 98.56 68.44 91.50 96.98CTM-PPMN (hnm) 76.58 95.64 98.24 70.68 92.58 97.18FPNN 20.65 50.94 67.01 19.89 49.41 -ImprovedDL 54.74 86.50 93.88 44.96 76.01 81.85PIE(R)+Kissme - - - 67.10 92.20 96.60SICIR - - - 52.17 - -DCSL (no hnm) 78.60 97.76 99.30 - - -DCSL (hnm) 80.20 97.73 99.17 - - -MTDnet 74.68 95.99 97.47 - - -JLML 83.20 98.00 99.40 80.60
SC-PPMN (no hnm) 83.20 97.50 99.25 77.60 96.10 98.60SC-PPMN (hnm) 85.50 98.20 99.50 80.63 95.62 98.07MC-PPMN (no hnm) 84.36 and DCNN feature based methods: FPNN (Li et al.2014), ImprovedDL (Ahmed, Jones, and Marks 2015),Single-Image and Cross-Images Representation learning(SICIR) (Wang et al. 2016), TCP (Cheng et al. 2016b),DCSL (Zhang et al. 2016a), Pose Invariant Embedding(PIE(R)+Kissme) (Zheng et al. 2017), MTDnet (includingMTDnet-cross) (Chen et al. 2017), JLML (Li, Zhu, andGong 2017). We report the evaluation results as shown inTable 3 - Table 6.
Comparisons on CUHK03 dataset . We conduct the ex-periments on both labelled and detected CUHK03 datasets.From Table 3, we see that our proposed approach achievesthe better results than the state-of-the-art methods. On the la-belled dataset, our method outperforms the next best methodby an improvement of 3.16% (86.36% vs. 83.20%). On thedetected dataset, the performance is reduced by the mis-alignment and incompleteness caused by the detector. How-ever, the proposed method still achieves an improvement1.28% over the next best method (81.88% vs. 80.60%).
Comparisons on CUHK01 dataset . Table 4 illustratesthe top recognition rate on CUHK01 dataset with 100 testIDs and 486 test IDs. We see that our proposed methodachieves the best recognition rate of 93.45% (rank-1),99.62% (rank-5) and 99.98% (rank-10) (vs. 89.60%, 96.90%and 99.98% respectively by the next best method) with 100test IDs. For the setting with 486 test IDs, only 485 identitiesand half positive samples are left for training which makeable 4: Comparison of state-of-the-art results on CUHK01dataset with 100 test IDs and 486 test IDs. The cumulativematching scores (%) at rank 1, 5, and 10 are listed.
Methods CUHK01(100 test IDs) CUHK01(486 test IDs)r=1 r=5 r=10 r=1 r=5 r=10KISSME 29.40 60.18 74.44 - - -LMNN 21.17 48.51 62.98 13.45 31.33 42.25LSSCDL 65.97 48.51 62.98 - - -CTM-PPMN (no hnm) 71.18 91.94 96.54 48.01 75.91 84.34CTM-PPMN (hnm) 73.74 92.32 98.18 53.57 79.32 87.13FPNN 27.87 59.64 73.53 - - -ImprovedDL 65.00 89.00 94.00 47.53 71.60 80.25SICIR 71.80 - - - - -TCP - - - 53.70 84.30 91.00MTDnet-cross 78.50 96.50 97.50 - - -DCSL (no hnm) 88.00 96.90 98.10 - - -DCSL (hnm) 89.60 97.80 98.90 76.54 94.24 97.49SC-PPMN (no hnm) 92.10 99.50 99.95 - - -SC-PPMN (hnm) 93.10 98.80 99.80 77.16 92.80 97.53MC-PPMN (no hnm) 92.32 98.68 99.60 - - -MC-PPMN (hnm)
Table 5: Comparison of state-of-the-art results on VIPeR andPRID450S datasets.The cumulative matching scores (%) atrank 1, 5, and 10 are listed.
Methods VIPeR PRID450sr=1 r=5 r=10 r=1 r=5 r=10KISSME 19.60 48.00 62.20 15.0 - 39.0LSSCDL 42.66 - 84.27 60.49 - 88.58LOMO+LSTM 42.40 68.70 79.40 - - -LOMO+XQDA 40.00 68.13 80.51 61.42 - 90.84CTM-PPMN 32.12 64.24 80.38 28.98 59.47 73.60ImprovedDL 34.81 63.61 75.63 34.81 63.72 76.24PIE(R) 27.44 43.01 50.82 - - -SICIR 35.76 - - - - -TCP 47.80 74.70 84.80 - - -DCSL 44.62 73.42 82.59 - - -JLML it challenging for our proposed deep architecture to con-verge. Following the way in (Zhang et al. 2016a), we fine-tune the network for CUHK01 with the pre-trained modelon CUHK03 and achieve an improvement of 2.41%(78.95%vs. 76.54%) on rank-1 recognition rate.
Comparisons on VIPeR and PRID450s dataset . Fol-lowing (Ahmed, Jones, and Marks 2015), we pre-train thenetwork using CUHK03 and CUHK01 datasets, and fine-tune on the training set of VIPeR and PRID450s. As shownin the Table 5, the proposed MC-PPMN is better than thestate-of-the-art method in all the cases except the rank-1recognition rate for VIPeR dataset, while is comparable withthe best competing method JLML.
Comparisons on i-LIDS and PRID2011 datasets .We also conduct experiments on the i-LIDS dataset andPRID2011 dataset. Table 6 shows our results. For bothdatasets, MC-PPMN achieves the best rank-1, rank-5 andrank-10 recognition rates, which demonstrate the effective-ness of the proposed method for the small training set.
The effect of fusion for the correspondence represen-tations . Camparing with the experimental results by learn-ing the correspondence for two person images with CTMfeatures and SC features, respectively, Table 7 shows the Table 6: Comparison of state-of-the-art results on i-LIDSand PRID2011 datasets. The cumulative matching scores(%) at rank 1, 5, and 10 are listed.
Methods i-LIDS PRID2011r=1 r=5 r=10 r=1 r=5 r=10ITML 29.00 54.00 70.50 12.00 - 36.00KISSME - - - 15.00 - 39.00LMNN 28.00 53.80 66.10 10.00 - 30.00CTM-PPMN 44.17 73.31 85.02 12.00 32.00 42.00TCP 60.40 82.70 90.70 22.00 47.00 57.00MTDnet 57.8 78.61 87.28 32.00 51.00 62.00SC-PPMN 54.80 81.92 92.32 32.00 53.00 63.00MC-PPMN
Table 7: The improvement of the fused correspondence rep-resentations for rank-1 recognition rates on the experimentaldatasets.
Dataset CTM-PPMN SC-PPMN MC-PPMN ImprovementCUHK03(labelled) 76.58 85.50 86.36 0.86CUHK03(detected) 70.68 80.63 81.88 1.25CUHK01(100 test IDs) 73.74 93.10 93.45 0.35CUHK01(486 test IDs) 53.57 77.16 78.95 1.79VIPeR 32.12 45.82 50.13 4.31PRID450s 28.98 52.08 62.22 10.14i-LIDS 44.17 54.80 62.69 7.89PRID2011 12.00 32.00 34.00 2.00 improvement on the rank-1 recognition rates with the fu-sion for the correspondence representations. For CUHK03and CUHK01 datasets, we achieve the absolute gain about1.00% and for the other datasets, we can see the absolutegain over 2.00%. Especially, the proposed method achieve10.14% improvement on the rank-1 recognition rate. Theresults above demonstrate the effectiveness of fusion forthe correspondence representations, which is obvious on thesmall datasets.
The effect of hard negative mining . We also report theresults of both our model with hnm and without hnm asshown in Table 3 and 4. We can see the absolute gain about1.00% compared with the same model without hnm.
Conclusion
In this paper, we have developed a novel multi-channel deepconvolutional architecture for person re-identification. Weemploy deep convNets to map person’s semantic compo-nents and color-texture distributions to the required featurespace. Based on the learned deep features and a pyramidmatching strategy, we learn their correspondence representa-tions and fuse them together to perform the re-identificationtask. The effectiveness and promise of our method is demon-strated by extensive evaluations on various datasets. The re-sults have shown that our method has a remarkable improve-ment over the competing models.
Acknowledgments
This work was supported by National Key R&D Programof China (No. 2017YFB1002400), National Natural ScienceFoundation of China (No. 61702448, 61672456), the KeyR&D Program of Zhejiang Province (No. 2018C03042), theFundamental Research Funds for the Central UniversitiesNo. 2017QNA5008, 2017FZA5007). X. Li was also sup-ported in part by the National Natural Science Foundationof China under Grant U1509206 and Grant 61472353, andthe Alibaba-Zhejiang University Joint Institute of FrontierTechnologies.
References [Ahmed, Jones, and Marks 2015] Ahmed, E.; Jones, M.; andMarks, T. K. 2015. An improved deep learning architecturefor person re-identification. In
CVPR , 3908–3916.[Chen et al. 2016] Chen, L. C.; Papandreou, G.; Kokkinos,I.; Murphy, K.; and Yuille, A. L. 2016. Deeplab: Seman-tic image segmentation with deep convolutional nets, atrousconvolution, and fully connected crfs.
IEEE Transactions onPattern Analysis and Machine Intelligence
PP(99):1–1.[Chen et al. 2017] Chen, W.; Chen, X.; Zhang, J.; andHuang, K. 2017. A multi-task deep network for personre-identification. In
AAAI , 3988–3994.[Cheng et al. 2016a] Cheng, D.; Gong, Y.; Zhou, S.; Wang,J.; and Zheng, N. 2016a. Person re-identification by multi-channel parts-based cnn with improved triplet loss function.In
CVPR , 1335–1344.[Cheng et al. 2016b] Cheng, D.; Gong, Y.; Zhou, S.; Wang,J.; and Zheng, N. 2016b. Person re-identification by multi-channel parts-based cnn with improved triplet loss function.In
CVPR , 1335–1344.[Davis et al. 2007] Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.;and Dhillon, I. S. 2007. Information-theoretic metric learn-ing. In
ICML , 209–216.[Ding et al. 2015] Ding, S.; Lin, L.; Wang, G.; and Chao, H.2015. Deep feature learning with relative distance com-parison for person re-identification.
Pattern Recognition
CVPR , 2360–2367.[Gray and Tao 2008] Gray, D., and Tao, H. 2008. Viewpointinvariant pedestrian recognition with an ensemble of local-ized features.
Computer Vision–ECCV
CVPR ,770–778.[Hirzer et al. 2011] Hirzer, M.; Beleznai, C.; Roth, P. M.; andBischof, H. 2011. Person re-identification by descriptive anddiscriminative classification. In
Scandinavian Conferenceon Image Analysis , 91–102.[Jia et al. 2014] Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev,S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T.2014. Caffe: Convolutional architecture for fast feature em-bedding. In
ACM MM , 675–678. ACM.[Koestinger et al. 2012] Koestinger, M.; Hirzer, M.;Wohlhart, P.; Roth, P. M.; and Bischof, H. 2012. Large scalemetric learning from equivalence constraints. In
CVPR ,2288–2295. IEEE. [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.;Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifica-tion with deep convolutional neural networks. In
Advancesin neural information processing systems , 1097–1105.[Li and Wang 2013] Li, W., and Wang, X. 2013. Locallyaligned feature transforms across views. In
CVPR , 3594–3601.[Li et al. 2014] Li, W.; Zhao, R.; Xiao, T.; and Wang, X.2014. Deepreid: Deep filter pairing neural network for per-son re-identification. In
CVPR , 152–159.[Li, Zhao, and Wang 2012] Li, W.; Zhao, R.; and Wang, X.2012. Human reidentification with transferred metric learn-ing. In
ACCV , 31–44. Springer.[Li, Zhu, and Gong 2017] Li, W.; Zhu, X.; and Gong, S.2017. Person re-identification by deep joint learning ofmulti-loss classification. In
IJCAI , 2194–2200.[Liao et al. 2010] Liao, S.; Zhao, G.; Kellokumpu, V.;Pietikainen, M.; and Li, S. Z. 2010. Modeling pixel processwith scale invariant local patterns for background subtrac-tion in complex scenes. In
CVPR , 1301–1306.[Liao et al. 2015] Liao, S.; Hu, Y.; Zhu, X.; and Li, S. Z.2015. Person re-identification by local maximal occurrencerepresentation and metric learning. In
CVPR , 2197–2206.[Mignon and Jurie 2012] Mignon, A., and Jurie, F. 2012.Pcca: A new approach for distance learning from sparse pair-wise constraints. In
CVPR , 2666–2672.[Office 2008] Office, U. H. 2008. i-lids multiple cameratracking scenario definition.[Pedagadi et al. 2013] Pedagadi, S.; Orwell, J.; Velastin, S.;and Boghossian, B. 2013. Local fisher discriminant analysisfor pedestrian re-identification. In
CVPR , 3318–3325.[Roth et al. 2014] Roth, P. M.; Hirzer, M.; Koestinger, M.;Beleznai, C.; and Bischof, H. 2014. Mahalanobis dis-tance learning for person re-identification. In
Person Re-Identification . Springer. 247–267.[Szegedy et al. 2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet,P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; andRabinovich, A. 2015. Going deeper with convolutions. In
CVPR , 1–9.[Varior et al. 2016] Varior, R. R.; Shuai, B.; Lu, J.; Xu, D.;and Wang, G. 2016. A siamese long short-term memoryarchitecture for human re-identification. In
ECCV , 135–153.Springer.[Wang et al. 2016] Wang, F.; Zuo, W.; Lin, L.; Zhang, D.;and Zhang, L. 2016. Joint learning of single-image andcross-image representations for person re-identification. In
CVPR , 1288–1296.[Weinberger and Saul 2009] Weinberger, K. Q., and Saul,L. K. 2009. Distance metric learning for large margin near-est neighbor classification. volume 10, 207–244.[Zhang et al. 2016a] Zhang, Y.; Li, X.; Zhao, L.; and Zhang,Z. 2016a. Semantics-aware deep correspondence structurelearning for robust person re-identification. In
IJCAI , 3545–3551.Zhang et al. 2016b] Zhang, Y.; Li, B.; Lu, H.; Irie, A.; andRuan, X. 2016b. Sample-specific svm learning for personre-identification. In
CVPR , 1278–1287.[Zhao, Ouyang, and Wang 2013] Zhao, R.; Ouyang, W.; andWang, X. 2013. Unsupervised salience learning for personreidentification. In
CVPR , 3586–3593.[Zheng et al. 2017] Zheng, L.; Huang, Y.; Lu, H.; and Yang,Y. 2017. Pose invariant embedding for deep person re-identification. arXiv preprint arXiv:1701.07732arXiv preprint arXiv:1701.07732