Pyramid Person Matching Network for Person Re-identification
Chaojie Mao, Yingming Li, Zhongfei Zhang, Yaqing Zhang, Xi Li
aa r X i v : . [ c s . C V ] M a r JMLR: Workshop and Conference Proceedings 80:1–11, 2017 ACML 2017
Pyramid Person Matching Network for PersonRe-identification
Chaojie Mao [email protected]
Yingming Li [email protected]
Zhongfei Zhang [email protected]
Yaqing Zhang [email protected]
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China
Xi Li [email protected]
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Abstract
In this work, we present a deep convolutional pyramid person matching network (PPMN)with specially designed Pyramid Matching Module to address the problem of person re-identification. The architecture takes a pair of RGB images as input, and outputs a sim-iliarity value indicating whether the two input images represent the same person or not.Based on deep convolutional neural networks, our approach first learns the discriminativesemantic representation with the semantic-component-aware features for persons and thenemploys the Pyramid Matching Module to match the common semantic-components ofpersons, which is robust to the variation of spatial scales and misalignment of locationsposed by viewpoint changes. The above two processes are jointly optimized via a unifiedend-to-end deep learning scheme. Extensive experiments on several benchmark datasetsdemonstrate the effectiveness of our approach against the state-of-the-art approaches, es-pecially on the rank-1 recognition rate.
Keywords:
Person re-identification, Pyramid Matching Module, Unified end-to-end deeplearning scheme
1. Introduction
The task of person re-identification (Re-ID) is to judge whether two person images rep-resent the same person or not and it has widely-spread applications in video surveillance.There are two challenges posed by viewpoint changes: the variation of a person’s pose andmisalignment.Many existing methods solve the challenges above by extracting cross-view invariantfeatures Weinberger and Saul (2009) Koestinger et al. (2012) Ahmed et al. (2015) Li et al.(2014) Varior et al. (2016). These methods focus on extracting local features includingthe hand-crafted features and deep learning features from horizontal stripes of a personimage, and fuse them into a description vector as the representation. Though these meth-ods usually work under the assumption up to a slight vertical misalignment, they ignorethe typically widely existing horizonal misalignment. From the perception of humans, theimages captured by two cameras for the same person should have many common com-ponents (body-part, front and back pose, belongings) so that people can decide whether c (cid:13) ao Li Zhang Zhang Li (cid:50)(cid:85)(cid:76)(cid:74)(cid:76)(cid:81)(cid:68)(cid:79) (cid:44)(cid:80)(cid:68)(cid:74)(cid:72) (cid:38)(cid:82)(cid:81)(cid:89)(cid:20)(cid:50)(cid:85)(cid:76)(cid:74)(cid:76)(cid:81)(cid:68)(cid:79) (cid:44)(cid:80)(cid:68)(cid:74)(cid:72) (cid:38)(cid:82)(cid:81)(cid:89)(cid:20) (cid:38)(cid:82)(cid:81)(cid:89)(cid:22)(cid:38)(cid:82)(cid:81)(cid:89)(cid:22) Figure 1: An example of misalignment for the component “bag” in two different imageswhere the feature maps are extracted with the GoogLeNet.the two input images represent the same person or not. Based on this principle, methodslike DCSL Zhang et al. (2016) employ deep convolutional networks to learn the correspon-dence among these components and have shown a promising performance. DCSL uses thedeep convolutional networks like GoogLeNet Szegedy et al. (2015) to extract the semantic-components representation. For bottom layers, the discriminative region in each featuremap learned by DCSL corresponds to one component of a person such as bag, head, andbody. For high layers, the learned regions still keep their shapes and spatial locations whilethey are abstract. However, the feature regions of the same components from two views forthe same person seldom have the consistent spatial scales and locations because of view-point changes. For example, the component “bag” is located in the opposite sides in the twoimages in Figure 1. Consequentely, the existing methods like DCSL ignore this problem.To address the challenges above, in this work, we present a deep convolutional pyra-mid person matching network (PPMN). Pyramid matching based on convolution operationis employed to compute the responses of the same semantic-component in different im-ages. To further capture the variation of the spatial scale and location misalignment, weexploit the flexible kernel-size convolution operation to guarantee that the most of semantic-components are matched in the same subwindows. Since the convolution operation withlarge kernels increases the parameters and computation, we propose to reduce the computa-tion complexity by introducing the atrous convolution structure Chen et al. (2016), whichhas been used in some convNet-based tasks such as image segmentation, object detection,and can provide a desirable view of perception without increasing parameters and compu-tation by introducing zeros between the consecutive filter values. In particular, we employthe multi-rate atrous convolution layers to construct the Pyramid Matching Module andproduce the correspondence representation between the semantic-components. With thecorrespondence representation, we learn the final similiarity value to decide whether thetwo input images represent the same person or not. PMN
The proposed framework is evaluated on three real-world datasets. Extensive experi-ments on these benchmark datasets demonstrate the effectiveness of our approach againstthe state-of-the-art, especially on the rank-1 recognition rate.The main contributions of this paper are as follows: (1) We propose an end-to-end deepconvolutional framework to deal with the problem of person Re-ID. Image representationlearning and cross-person correspondence learning are jointly optimized to enable the imagerepresentation to adapt to the task of perosn Re-ID. (2) The proposed framework maps aperson’s semantic-components to the deep feature space and employs the pyrimid matchingstrategy based on the atrous convolution to identify the common components of the person.
2. Related Work
In the literature, most existing efforts of person Re-ID are mainly carried in two aspects:the discriminative representation learning and the effective matching strategy learning. Forimage representation, a number of approaches pay attention to designing robust descriptorsagainist misalignments and variations. Early studies employ hand-crafted features includingHSV color histogram Farenzena et al. (2010), SIFT Zhao et al. (2013), LBP Li and Wang(2013) features or the combination of them. Recently, several deep convolutional architec-tures Li et al. (2014) Zhang et al. (2016) have been proposed for person Re-ID and haveshown significant improvements over those with hand-crafted features.For matching strategy, the essential idea behind metric learning is to find a mappingfunction from the feature space to the distance space so as to minimize the intra-personalvariance while maximizing the inter-personal margin. Many approaches have been proposedbased on this idea including LMNN Weinberger and Saul (2009) and KISSME Koestinger et al.(2012). Recently, some efforts jointly learn the representation and classifier in a unifieddeep architecture. For example, patch-based methods Ahmed et al. (2015) Li et al. (2014)decompose images into patches and perform patchwise distance measurement to find thespatial relationship. Part-based methods Varior et al. (2016) divide one person into equalparts and jointly perform bodywise and partwise correspondence learning since the pedes-trians keep upright in general. Different from all the above efforts which focus on featuredistance measurement, our proposed method aims at learning the semantic correspondenceof semantic-components based on the semantics-aware features and is robust to the variationand misalignment posed by viewpoint changes.
3. Our Architecture
Figure 2 illustrates our network’s architecture. The proposed architecture extracts thesemantics-aware representations for a pair of input person images. The features are thenconcatenated to feed into the Pyramid Matching Module to learn the correspondence ofsemantic-components. Finally, softmax activations are employed to compute the final de-cision which indicates the probability that the image pair represents the same person. Thedetails of the architecture are explained in the following subsections. ao Li Zhang Zhang Li Figure 2: The proposed architecture of the deep convolutional Pyramid Person MatchingNetwork (PPMN). Given a pair of person images as input, the parameters-sharedGoogLeNets generate the semantics-aware representation. The semantic compo-nents such as bag and head are visible in the output of Conv1 layer. With theextracted features, the Pyramid Matching Module learns the correspondence ofthese semantic components based on multi-scale artrous convolution layers. Fi-nally, softmax activations give the final decision of whether the image pair depictsthe same person or not.
The ImageNet-pretrained GoogLeNet employed in this work is able to capture the semanticfeatures for most of objects in this task as the ImageNet dataset contains a large numberof object types for more than 100000 concepts. In our architecture, these semantic featuresare extracted with two parameter-shared GoogLeNets for a pair person images, respectively.As shown in Figure 2, the GoogLeNets have been adapted to the Re-ID task by finetuningon a Re-ID dataset and decompose the person image into many semantic components suchas bag, head, and body. It is apparent to recognize the particular components from thevisualization of bottom layers’ output like Conv1 layer. These components’ visualizationsfor higher layers such as Conv5 layer are more abstract but still keep the shapes and spatiallocations. For notational simplicity, we refer to the convNet as a function f CNN ( X ; θ ),which takes X as input and θ as parameters. Given an input pair of images resized to160 ×
80 from two cameras, A and B, the GoogLeNets output 1024 feature maps with size10 × { R A , R B } = { f CNN ( I A ; θ ) , f CNN ( I B ; θ ) } (1)where R A and R B denote the representations of images I A and I B , respectively. θ arethe shared parameters. PMN
Based on the semantic representations of persons, the problem of person matching is reducedto a problem of the matching for the semantic-components. However, the challenges are thevariations of spatial scales and the misalignments of locations for the semantic-componentsposed by viewpoint changes. As shown in Figure 1, the same bag belonging to the sameperson is located on the right side in one image but the left side in the other image. To dealwith the challenges, we employ the atrous convolution with multi-scale kernels to constructa module called Pyramid Matching Module based on the pyramid matching strategy. Wegive two examples in Figure 3 to explain how this module works. On the left column, thecomponent “head” has the similiar spatial shapes and locations in the two images. It iseasy to learn the correspondence between the two feature maps with a general convolutionoperation to compute the responses of two feature regions in closely located windows in thetwo images, respectively, called the field-of-views, while the component “bag” has completelydifferent shapes and locations in the two images, respectively, and thus has different field-of-views. Consequently, a larger field-of-view is required for the convolution in the lattercase. Accordingly, we employ the atrous convolution for a large field-of-view. The PyramidMatching Module includes three branches 3 × ×
3, 5 ×
5, 7 ×
7, respectively. Withthe images concatenated as { R A , R B } , the proposed module computes the correspondencedistribution denoted as S P P M = { S r =1 , S r =2 , S r =3 } , in which the value of each location( i, j ) indicates the correspondence probability at that location. r is the rate of atrousconvolution. We formulate this matching strategy as follows: S P P M = { S r =1 , S r =2 , S r =3 } = { f CNN ( { R A , R B } ; { θ , θ , θ }} (2)where θ r ( r = 1 , ,
3) are the parameters of the matching branch with rate r . We use θ = { θ , θ , θ } as the parameters of our module.We fuse the concatenated correspondence maps S P P M with the learned parameters θ ,which indicates the weights of different matching branches, and output the fused corre-spondence representation S fusion . Inspired by Zhang et al. (2016), we further downsample S fusion by the max-pooling operation so as to preserve the most discriminative correspon-dence information and align the result in a larger region. Then, we obtain the final corre-spondence representation S final : S final = f CNN ( { S r =1 , S r =2 , S r =3 } ; θ ) (3) We apply two fully connected layers to encode the correspondence representation S final with an abstract vector of size 1024. The vector is then passed to a softmax layer withtwo softmax units S ( S final ; θ ): namely S ( S final ; θ ) and S ( S final ; θ ). We representthe probability that the two images in the pair, I A and I B , are of the same person withsoftmax activations computed on the units above: p = exp ( S ( S final ; θ )) exp ( S ( S final ; θ )) + exp ( S ( S final ; θ )) (4) ao Li Zhang Zhang Li (cid:53)(cid:68)(cid:87)(cid:72)(cid:3)(cid:32)(cid:3)(cid:21)(cid:53)(cid:68)(cid:87)(cid:72)(cid:3)(cid:32)(cid:3)(cid:20)(cid:50)(cid:85)(cid:76)(cid:74)(cid:76)(cid:81)(cid:68)(cid:79)(cid:3)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72) (cid:38)(cid:82)(cid:81)(cid:89)(cid:24)(cid:3) (cid:50)(cid:85)(cid:76)(cid:74)(cid:76)(cid:81)(cid:68)(cid:79)(cid:3)(cid:44)(cid:80)(cid:68)(cid:74)(cid:72) (cid:38)(cid:82)(cid:81)(cid:89)(cid:24)(cid:3) Figure 3: Illustration of the correspondence learning with Pyramid Matching Module. Left:the component “head” has the similiar spatial shapes and locations. Right: thecomponent “bag” has the completely different shapes and locations. We matchthe components above by computing their responses in the corresponding windowsand take the convolutions with a multi-scale field-of-view, which are robust to themisalignments of locations and variations of scale posed by viewpoint changes.We reformulate this approach as a unified framework with θ = { θ , { θ r } , θ , θ } , where r = 1 , , S ( S final , θ ) = f CNN ( { S r =1 , S r =2 , S r =3 } ; θ , θ )= f CNN ( { I A , I B } ; θ , θ , { θ r } , θ )= f CNN ( { I A , I B } ; θ ) (5)We optimize this framework by minimizing the widely used cross-entropy loss over a trainingset of N pairs: L θ = − N N X n =1 [ l n log p n + (1 − l n ) log(1 − p n )] (6)where l n is the 1/0 label for the input pair, which represents the same person or not. PMN
Table 1: Datasets and settings in our experiments. The settings for CUHK01 dataset in-clude the 100 test IDs and 486 test IDs.Dataset CUHK03 CUHK01 VIPeRidentities 1360 971 632images 13164 3884 1264views 2 2 2train IDs 1160 871;485 316test IDs 100 100;486 316
4. Experiments
We compare our proposed architecture with the state-of-the-art approaches on three personRe-ID datasets, namely CUHK03 Li et al. (2014), CUHK01 Li et al. (2012) and VIPeR Gray and Tao(2008). All the approaches are evaluated with Cumulative Matching Characteristics (CMC)by single-shot results, which characterize a ranking result for every image in the gallery givena probe image. Our experiments are conducted on the datasets with 10 random initializa-tions in training and the average results are provided. Table 1 lists the description of eachdataset and our experimental settings with the training and testing splits.
The proposed architecture is implemented on the widely used deep learning frameworkCaffeJia et al. (2014) with an NVIDIA TITAN X GPU. It takes about 40-48 hours intraining for 160K iterations with batch size 100. We use stochastic gradient descent forupdating the weights of the network. We set the momentum as γ = 0 . µ = 0 . η = 0 .
01 and gradually decreaseit as the training progresses using a polynomial decay policy with power as 0 . Data Augmentation . To make the model robust to the image translation varianceand to further augment the training dataset, for every original training image, we sample 5images around the image center, with translation drawn from a uniform distribution in therange [ − , × [ − ,
4] for an original image of size 160 × Hard Negative Mining (hnm) . The negative pairs are far more than the positivepairs, which can lead to data imbalance. Also, in these negative pairs, there still existscenarios that are hard to distinguish. To address these difficuties, we first sample thenegative sets to get three times as many negatives as positives and train our network.Then, we use the trained model to classify all the negative pairs and retain those rankedtop on which the trained model performs the worst for retraining the network. ao Li Zhang Zhang Li Table 2: Comparison of state-of-the-art results on CUHK03. The cumulative matchingscores (%) at rank 1, 5, and 10 are listed.Methods labelled CUHK03 detected CUHK03r=1 r=5 r=10 r=1 r=5 r=10KISSME 14.17 37.46 52.20 11.70 33.45 45.69LMNN 7.29 19.64 30.74 6.25 17.87 26.60LOMO+LSTM - - - 57.30 80.10 88.30LOMO+XQDA 52.20 82.23 92.14 46.25 78.90 88.55FPNN 20.65 50.94 67.01 19.89 49.41 -ImprovedDL 54.74 86.50 93.88 44.96 76.01 81.85PIE(R)+Kissme - - - 67.10 92.20 96.60SICIR - - - 52.17 - -DCSL(no hnm) 78.60 97.76 99.30DCSL(hnm) 80.20 97.73 99.17 - - -PPMN(no hnm) 83.20 97.50 99.25 77.60
PPMN(hnm)
In this section, we campare PPMN with several recent methods, including both hand-craftedfeatures based methods: KISSME Koestinger et al. (2012), LMNN Weinberger and Saul(2009), LOMO+LSTM Varior et al. (2016), LOMO+XQDA Liao et al. (2015); and deeplearning features based methods: FPNN Li et al. (2014), ImprovedDL Ahmed et al. (2015),Pose Invariant Embedding (PIE(R)+Kissme)Zheng et al. (2017), Single-Image and Cross-Images Representation learning(SICIR)Wang et al. (2016), DCSL Zhang et al. (2016). Wereport the evaluation results in Table 2.We conduct PPMN on both labelled and detected CUHK03 datasets. From Table 2,our method achieves an improvement of 5.30% (85.50% vs. 80.20%) on the labelled datasetand an improvement of 23.33% (80.63% vs. 57.30%) on the detected dataset. Table 3also illustrates the top recognition rate on CUHK01 dataset with 100 test IDs and 486test IDs. We see that PPMN achieves the best rank-1, rank-5 recognition rates of 93.10%,99.50% (vs. 89.60%, 96.90% respectively by the next best method) with 100 test IDs, whichmeans in most cases we can find the correct person in the first five samples of the queriedand returned results given 100 candidate images. For the settings with 486 test IDs, wefinetune the network on the set of half-CUHK01 with the pre-trained model on CUHK03and achieve an improvement of 0.62% (77.16% vs. 76.54% ) over DCSL using the sametraining protocol on rank-1 recognition rate. The experimental results also demonstratethe effect of hard negative mining, which provides the absolute gain over 1.00% comparedwith the same model without hard negative mining. Following the setup of Ahmed et al.(2015), we pre-train the network using CUHK03 and CUHK01 datasets, and finetune onthe training set of VIPeR. As shown in Table 4, we see that PPMN achieves the best rank-1, PMN
Table 3: Comparison of state-of-the-art results on CUHK01 dataset. The cumulativematching scores (%) at rank 1, 5, and 10 are listed.Methods CUHK01(100 test IDs) CUHK01(486 test IDs)r=1 r=5 r=10 r=1 r=5 r=10KISSME 29.40 60.18 74.44 - - -LMNN 21.17 48.51 62.98 13.45 31.33 42.25FPNN 27.87 59.64 73.53 - - -ImprovedDL 65.00 89.00 94.00 47.53 71.60 80.25PIE(R)+Kissme - - - - - -SICIR 71.80 - - - - -DCSL(no hnm) 88.00 96.90 98.10 - - -DCSL(hnm) 89.60 97.80 98.90 76.54 - - -PPMN(hnm)
Table 4: Comparison of state-of-the-art results on VIPeR dataset. The cumulative match-ing scores (%) at rank 1, 5, and 10 are listed.Methods VIPeRr=1 r=5 r=10KISSME 19.60 48.00 62.20LMNN - - -LOMO+LSTM 42.40 68.70 79.40LOMO+XQDA 40.00 68.13 80.51FPNN - - -ImprovedDL 34.81 63.61 75.63PIE(R)+Kissme 27.44 43.01 50.82SICIR 35.76 - -DCSL(hnm) 44.62 73.42 82.59PPMN(hnm) rank-5, rank-10 recognition rates and an imporvement of 1.20% (45.82% vs. 44.62%) forrank-1 recognition rate.
5. Conclusion
In this paper, we have developed a novel deep convolutional architecture for person re-identification. We employ a deep convNet GoogLeNet to map a person’s semantic compo-nents to the required feature space. Based on the pyramid matching strategy, we design ao Li Zhang Zhang Li a module to address the misalignment and variation issues posed by viewpoint changes.We demonstrate the effectiveness and promise of our method by reporting extensive eval-uations on various datasets. The results have indicated that our method has a remarkableimprovement over the state-of-the-art literature. References
Ejaz Ahmed, Michael Jones, and Tim K Marks. An improved deep learning architecturefor person re-identification. In
CVPR , pages 3908–3916, 2015.Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan LYuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrousconvolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 , 2016.Michela Farenzena, Loris Bazzani, Alessandro Perina, Vittorio Murino, and Marco Cristani.Person re-identification by symmetry-driven accumulation of local features. In
CVPR ,pages 2360–2367, 2010.Douglas Gray and Hai Tao. Viewpoint invariant pedestrian recognition with an ensembleof localized features. pages 262–275. Springer, 2008.Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Gir-shick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fastfeature embedding. In
ACM MM , pages 675–678. ACM, 2014.Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M Roth, and Horst Bischof. Largescale metric learning from equivalence constraints. In
CVPR , pages 2288–2295. IEEE,2012.Wei Li and Xiaogang Wang. Locally aligned feature transforms across views. In
CVPR ,pages 3594–3601, 2013.Wei Li, Rui Zhao, and Xiaogang Wang. Human reidentification with transferred metriclearning. In
ACCV , pages 31–44. Springer, 2012.Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neuralnetwork for person re-identification. In
CVPR , pages 152–159, 2014.Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. Person re-identification by localmaximal occurrence representation and metric learning. In
CVPR , pages 2197–2206,2015.Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeperwith convolutions. In
CVPR , pages 1–9, 2015.Rahul Rama Varior, Bing Shuai, Jiwen Lu, Dong Xu, and Gang Wang. A siamese longshort-term memory architecture for human re-identification. In
ECCV , pages 135–153.Springer, 2016. PMN
Faqiang Wang, Wangmeng Zuo, Liang Lin, David Zhang, and Lei Zhang. Joint learningof single-image and cross-image representations for person re-identification. In
CVPR ,pages 1288–1296, 2016.Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large marginnearest neighbor classification. volume 10, pages 207–244, 2009.Yaqing Zhang, Xi Li, Liming Zhao, and Zhongfei Zhang. Semantics-aware deep correspon-dence structure learning for robust person re-identification. In
IJCAI , pages 3545–3551,2016.Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Unsupervised salience learning for personreidentification. In
CVPR , pages 3586–3593, 2013.Liang Zheng, Yujia Huang, Huchuan Lu, and Yi Yang. Pose invariant embedding for deepperson re-identification. arXiv preprint arXiv:1701.07732 , 2017., 2017.