Learning Diverse Features with Part-Level Resolution for Person Re-Identification
11 Learning Diverse Features with Part-LevelResolution for Person Re-Identification
Ben Xie, Xiaofu Wu † , Suofei Zhang, Shiliang Zhao and Ming Li Abstract —Learning diverse features is key to the success ofperson re-identification. Various part-based methods have beenextensively proposed for learning local representations, which,however, are still inferior to the best-performing methods forperson re-identification. This paper proposes to construct astrong lightweight network architecture, termed PLR-OSNet,based on the idea of Part-Level feature Resolution over theOmni-Scale Network (OSNet) for achieving feature diversity. Theproposed PLR-OSNet has two branches, one branch for globalfeature representation and the other branch for local featurerepresentation. The local branch employs a uniform partitionstrategy for part-level feature resolution but produces only asingle identity-prediction loss, which is in sharp contrast to theexisting part-based methods. Empirical evidence demonstratesthat the proposed PLR-OSNet achieves state-of-the-art perfor-mance on popular person Re-ID datasets, including Market1501,DukeMTMC-reID and CUHK03, despite its small model size.
Index Terms —Person re-identification, person matching, fea-ture diversity, deep learning.
I. I
NTRODUCTION
In recent years, person re-identification (Re-ID) has at-tracted increasing interest due to its fundamental role in emerg-ing computer vision applications such as video surveillance,human identity validation, and authentication, and human-robot interaction [1], [2], [3], [4], [5], [6]. The objective ofperson Re-ID is to match any query image with the imagesof the same person taken by the same or different cameras atdifferent angles, time or location. Despite its recent progress,identifying the person of interest accurately and reliably is stillvery challenging due to huge variations in lighting, humanpose, background, camera viewpoint, etc. With the ID-labeledtraining set, one of the main goals in the field of person Re-IDis to discover a low-dimensional but rich representation of anyinput image for person matching.Person Re-ID was often formulated as a metric-learningproblem (or a feature-embedding problem) [7], [8], [9], wherethe distance between intra-class samples is required to be lessthan the distance between inter-class ones by at least a margin. † Corresponding author. This work was supported in part by the NationalNatural Science Foundation of China under Grants 61372123, 61671253 andby the Scientific Research Foundation of Nanjing University of Posts andTelecommunications under Grant NY213002.Ben Xie, Xiaofu Wu and Shiliang Zhao are with the National Engineer-ing Research Center of Communications and Networking, Nanjing Univer-sity of Posts and Telecommunications, Nanjing 210003, China (E-mails:[email protected]; [email protected]; [email protected]).Suofei Zhang is with the School of Internet of Things, Nanjing Universityof Posts and Telecommunications, Nanjing 210003, China (E-mail: [email protected]).Ming Li is with the Supply Chain Platform Division, Alibaba Group,Hangzhou 311121, China (Email: [email protected]).
Parameters(M) R a nk - A cc u r a c y HA-CNN MancsPCBOSNet AutoReID BagOfTricksABD MHNBDB PyramidOurs
DukeMTMC-ReID
CVPR+ICCV(2019)CVPR+ECCV(2018)Ours
Parameters(M) m A P OSNet AutoReID MHNBDB PyramidHA-CNN PCBOurs
CUHK03-Detected
CVPR+ICCV(2019)CVPR+ECCV(2018)Ours
Fig. 1. The performance of different baselines on the DukeMTMC-reID andCUHK03-Detected datasets. We compare the proposed method with otherbaselines published in CVPR, ECCV and ICCV (2018/2019).
Unfortunately, a direct implementation of this idea requires togroup samples in a pairwise manner, which is known to becomputationally intensive. Alternatively, a classification taskis employed to find the feature-embedding solution due toits advantage on the implementation complexity. Currently,various state-of-the-art methods [10], [2], [6], [11], [12] forperson Re-ID have evolved from a single metric-learningproblem or a single discriminative classification problem to amulti-task problem, where both the discriminative loss and thetriplet loss are employed [13]. As each sample image is onlylabeled with the person ID, an end-to-end training approachusually has difficulty to learn diverse and rich features withoutelaborate design of the underlying neural network and further a r X i v : . [ c s . C V ] J a n use of some regularization techniques.In the past years, various part-based approaches [14], [15],[16] and dropout-based approaches [17] have been proposedin order to learn rich features from the ID-labeled dataset.Differing from conventional pose-based Re-ID approaches [7],[18], [19], [20], part-based approaches usually locate a num-ber of body parts firstly, and force each part meeting anindividual ID-prediction loss in getting discriminative part-level feature representations [21], [22], [23], [24]. Dropout-based approaches, however, intend to discover rich featuresfrom enlarging the dataset with various dropout-based data-augmentation methods, such as cutout [25] and random erasing[26], or from dropping the intermediate features from feature-extracting networks, such as Batch BropBlock [2].The performance of part-based methods relies heavily onthe employed partition mechanism. Semantic partitions mayoffer stable cues to good alignment but are prone to noisypose detections, as it requires that human body parts shouldbe accurately identified and located. The uniform horizontalpartition was widely employed in [15], [22], which, however,provides limited performance improvement.This motivates the work in this paper, where we propose anovel two-branch lightweight architecture for discovering richfeatures in person Re-ID. In particular, we employ the ideaof part-level feature resolution in developing a strong two-branch baseline for person Re-ID. Compared to the popularpart-based method of PCB [15], our method differs mainly intwo aspects. One is the use of global branch for facilitatingthe extraction of a global feature, and the other is the useof a single ID-prediction loss for part-level feature resolution.We briefly summarize the main contribution of this paper asfollows:1) Based on the omni-scale network (OSNet) baseline[27], we propose a lightweight two-branch networkarchitecture (PLR-OSNet) for person Re-ID. Its globalbranch adopts a global-max-pooling layer while its localbranch employs a part-level feature resolution schemefor producing only a single ID-prediction loss, whichis in sharp contrast to existing part-based methods.The proposed architecture is shown to be effective forachieving feature diversity.2) Despite its small model size, the proposed PLR-OSNetis very efficient as depicted in Figure 1 for achiev-ing the state-of-the-art results [28], [29], [6], [3], [2],[10], [30], [15], [31] on the three popular personRe-ID datasets, Marktet1501, DukeMTMC-reID andCUHK03. It achieves the rank-1 accuracy of 91.6%for DukeMTMC-reID and 83.5% for CUHK03-Labeledwithout using re-ranking. II. R
ELATED W ORK
We review the relevant work about embedded featurelearning, discriminative feature learning, part-based featurelearning, and multi-scale feature learning. Besides of variousproblems for person Re-ID, we are particularly interested inachieving feature diversity throughout this section. Source codes are available at https://github.com/AI-NERC-NUPT/PLR-OSNet
A. Embedded Feature Learning
Person Re-ID can be formulated as a feature-embeddingproblem, which looks for a function mapping the high-dimensional pedestrian images into a low-dimensional fea-ture space. This feature-embedding formulation requires themapping function to ensure that given any anchor image,any positive image from the same person should has a lowerdistance to the anchor in the feature space compared to anynegative image from a different person. This is known to bethe objective of the triplet loss in training.For an efficient feature-embedding learning, the batch hardtriplet loss [32] was proposed to mine the hardest positiveand the hardest negative samples for each pedestrian image ina batch. However, it is sensitive to outlier samples and maydiscard useful information due to its hard selective approach.To deal with these problems, Ristani et al. proposed the batch-soft triplet loss [32], which introduce a weighting factor foreach pair distance. One hyper-parameter that exists in all of thetriplet loss variations is the margin. To eliminate the manualparameter of the margin, the softplus function ln(1 + exp( · )) instead of [ · ] + = max(0 , · ) in the triplet loss function wasintroduced in [12], which is known to be soft-margin tripletloss. B. Discriminative Feature Learning
A feature-embedding approach requires to group the pedes-trian images into pairs for training, which is not efficient ingeneral. Discriminative feature learning is more efficient bytraining a classification task, where each person is regarded asa single class. As each pedestrian image in a training datasetis only labeled by a single person ID, a fundamental problemin the field of person Re-ID is how to learn diverse featuresfrom the ID-labeled dataset.To get diverse features from an end-to-end training ap-proach, multi-branch network architectures have been widelyemployed [15], [2], [33], where a shared-net is often followedby multiple subnetwork branches. To achieve feature diver-sity, distinct mechanisms should be imposed among differentbranches, such as attention [6], [3], feature dropping [2], [34],and overlapped activation penalty [4] .
C. Part-Based Feature Learning
Part-based feature learning with hand-crafted algorithmshad been pursued for a long time for the purpose of personretrieval before the era of deep-learning. In [15], the Part-basedConvolutional Baseline (PCB) network was proposed, whichemploys uniform partition on the conv-layer for learning part-level features. Essentially, it employed a 6-branch network bydividing the whole body into 6 horizontal stripes in the featurespace and each part feature vector was used to produce anindependent ID-prediction loss. The idea of PCB was verywelcome and widely adopted for developing stronger methodsin the recent years for person Re-ID [31][30][21].The part-level feature learning has an intuitive advantagefor extracting diverse features from the ID-labeled pedestrianimages. However, the pristine division strategy usually suffersfrom misalignment [35] between corresponding parts due to
Fig. 2. Single ID loss vs. multiple ID loss with part-level feature resolution. The input image goes through backbone network to obtain a 3-D tensor, which isvertically split into n = 4 part-level features, and then averages each part-level tensor into a vector. These n part-level vectors are used to drive n independentID losses in PCB, or simply concatenated for driving a single ID loss in training. The use of multiple ID loss in training may lead to false prediction of theperson ID with some part-level features. large variations in poses, viewpoints and scales. In particular,the use of multiple ID-prediction loss (an independent IDloss for each part) may fail to capture the semantic part-level features since a pedestrian image may simply containthe semantically different parts at a uniformly-divided pedes-trian part. This may partially explain the limited performanceadvantage of various PCB-based algorithms, compared to thestate-of-the-art methods [34], [3]. D. Multi-Scale Feature Learning
Recently, multi-scale feature learning, together with themulti-stream building block design [36], [37], [38], has beenproved to be efficient for improving the performance of personRe-ID. By designing a residual block composed of multipleconvolutional feature streams, each detecting features at a cer-tain scale, the concept of omni-scale deep feature learning wasfurther introduced in [27] and a lightweight CNN architecture,termed OSNet, was cleverly constructed for learning omni-scale feature representations. Experiments demonstrated thatOSNet performs very well for both tasks of classification andperson Re-ID, despite its lightweight design.III. PLR-OSN ET A. Part-Level Feature Resolution
For PCB with n -part resolution, it produces n part-levelfeature vectors by dividing the whole body into n horizontalstripes in the feature space. As shown in Figure 2, the inputimage goes forward through the stacked convolutional layersfrom the backbone network to form a 3-D tensor T . PCB em-ploys a conventional average pooling layer to spatially down-sample T into n pieces of column vectors g , g , · · · , g n ,followed by n classifiers in order to produce n ID-predicationloss. Note that the classifier is implemented by a fully-connected (FC) layer and a softmax function. Hence, when the batch of input labeled samples are { ( x i , y i ) , i = 1 , · · · , N s } ,PCB employs the multiple ID-prediction loss as L m − id = n (cid:88) p =1 L pid , (1) L pid = − N s N s (cid:88) i =1 log (cid:32) exp(( W y i p ) T g ip + b y i ) (cid:80) j exp(( W jp ) T g jp + b j ) (cid:33) . (2)where W jp , W y i p are the j -th and y i -th column of the weightmatrix W p (the p -th classifier designated for g p ), respectively.By forcing each part-level feature vector to meet an in-dependent ID-prediction loss, one may obtain useful part-level features for discriminating different persons. However,many part-level feature vectors may simply fail to catch anydiscriminative information for different persons, as shown inFigure 2. Therefore, the use of PCB is practically limited forgetting discriminative part-level information.In order to learn discriminative features with part-levelresolution, we propose to concatenate n part-level featurevectors into a single column vector g = [ g T , g T , · · · , g Tn ] T , (3)which is further used to produce the ID-prediction loss L s − id = − N s N s (cid:88) i =1 log (cid:32) exp(( W y i ) T g i + b y i ) (cid:80) j exp(( W j ) T g j + b j ) (cid:33) . (4)Here, W j , W y i are the j -th and y i -th column of the weightmatrix W (the single classifier for g ), respectively. As thevector g contains the full information about the input image,the use of a single ID-prediction loss could drive g to learnsufficient discriminative information.The proposed approach is somewhat similar to OSNet,where the tensor T is followed by a global average pooling(GAP) for getting a global descriptor ¯g = 1 n n (cid:88) p =1 g p . (5) OSNet Conv 1
OSNet
Conv 2
OSNet
Conv 3
OSNet
Conv 4
OSNet
Conv 5
OSNetConv 5OSNetConv 4
256 128 AP AP
Triplet LossID LossCenter Loss
Global Max Pooling x ( ) q x ( ) k x :1 1 cq r BN SAM :1 1 ck r
Triplet Loss
ID Loss
Center Loss
512 1 SAM
CAM
Attention Modules
512 1 CAM ( ) s x ( ) e s x x :1 1 cs r :1 1 e c ( ) ( ) T softmax q x k x ( ) sigmoid e s x AP AP Fig. 3. The overall network architecture of PLR-OSNet. During testing, the feature embedding concatenated from both global branch and local branch isused for the final matching distance computation.
Instead of using the GAP in OSNet, the proposed part-levelresolution approach uses average pooling in each part toretrieve part-level feature vectors and the final descriptor g (3) is of rich local information, which might be simply filteredwith the GAP (5) in OSNet. B. Proposed Network Architecture
We employ an -branch neural network architecture, bymodifying the recently-proposed OSNet baseline. Figure 3shows the overall network architecture, which includes abackbone network, a global branch (orange colored arrows),and a local branch (blue colored arrows).
1) Attention Modules:
Compared to the OSNet, attentionmodules are explicitly employed in Figure 3, where bothspatial attention module (SAM) and channel attention module(CAM) are used in the shared-net.For SAM, we employ the version of [6], which was designedto capture and aggregate those semantically related pixelsin the spatial domain. To further reduce the computationalcomplexity, we use a × convolution that forms a functions q ( x ) (or k ( x ) ) to reduce the number of channels c to c/r ofthe input x .For CAM, the squeeze-and-excitation mechanism [39] isemployed with slight modifications detailed in Figure 3. Com-pared to the channel attention module in [6], it does not requireto compute the channel affinity matrix and therefore can beimplemented more efficiently.
2) Shared-Net:
The recently-proposed OSNet is employedas the backbone network for feature extraction. OSNet uses a lightweight network architecture for omni-scale feature learn-ing, which is achieved by employing the factorised convo-lutional layer, the omni-scale residual block and the unifiedaggregation gate. The shared-net consists of the first 3 convlayers and 2 transition layers from OSNet. As shown in Figure3, we insert SAM + CAM modules in both conv-2 and conv-3layers for the shared-net.
3) Global Branch with Global-Max-Pooling:
The globalbranch consists of the conv4 and conv5 layers, a Global-Max-Pooling (GMP) layer to produce a 512-dimensional vector,providing a compact global feature representation for both thetriplet loss and the ID-prediction loss. The use of GMP ismainly for achieving the feature diversity between the globalbranch and the local branch, where average pooling is knownto be popular in PCB [15] and adopted in the local branch.
4) Local Branch with Part-Level Feature Resolution:
Thelocal branch has the similar layer structure but with theaverage pooling (AP) in replace of GMP. To achieve featurediversify, a uniform partition strategy is employed for part-level feature resolution, and four 512-dimensional featuresare then concatenated for producing just one ID-predictionloss. The use of a single ID-prediction loss is unique in thispaper, while PCB and its variations employed a multiple ID-prediction loss with an independent ID-prediction loss for eachpart.
C. Loss Functions
The feature vectors from the global and local branches areconcatenated as the final descriptor for the person Re-ID task.The loss function at either the global branch or the local branch is the sum of a single ID loss (softmax loss), a soft margintriplet loss [12] and a center loss [40], namely, L total = L s − id + γ t L triplet + γ c L center , (6)where γ t , γ c are weighting factors.IV. E XPERIMENTS
Extensive experiments have been performed for evaluatingthe effectiveness of the proposed approach over three publicperson Re-ID datasets: Market1501, DukeMTMC-reID andCUHK03. The results are compared to the state-of-the-artmethods.
A. Datasets
The Market1501 dataset [41] has 1,501 identities collectedby six cameras and a total of 32,668 pedestrian images.Following [41]. The dataset is split into a training set with12,936 images of 751 identities and a testing set of 3,368query images and 15,913 gallery images of 750 identities.The DukeMTMC-reID dataset [42] contains 1,404 identitiescaptured by more than 2 cameras and a total of 36,411 images.The training subset contains 702 identities with 16,522 imagesand the testing subset has other 702 identities.The CUHK03 dataset [43] contains labeled 14,096 imagesand detected 14,097 images of a total of 1,467 identitiescaptured by two camera views. With splitting just like in[41], a non-overlapping 767 identities are for training and700 identities for testing. The labeled dataset contains 7,368training images, 5,328 gallery, and 1,400 query images fortesting, while the detected dataset contains 7,365 images fortraining, 5,332 gallery, and 1,400 query images for testing.
B. Implementation Details
Our network is trained using a single Nvidia Tesla P100GPU with a batch size of 64. Each identity contains 4 instanceimages in a batch, so there are 16 identities per batch. Thebackbone OSNet is initialized from the ImageNet pre-trainedmodel. The total number of epoches is set to 120 [150],namely, 120 for both Market-1501 and DukeMTMC-reID, and150 for CUHK03, respectively. We use the Adam optimizerwith the base learning rate initialized to 3.5e-5. With a linearwarm-up strategy in first 20 [40] epochs, the learning rateincreases to 3.5e-4. Then, the learning rate is decayed to 3.5e-5 after 60 [100] epochs, and further decayed to 3.5e-6 after90 [130] epochs.For training, the input images are re-sized to × andthen augmented by random horizontal flip, random erasing,and normalization. The testing images are re-sized to × with normalization. C. Comparison with State-of-the-art Methods
We compare our work with state-of-the-art methods, inparticular emphasis on the recent remarkable works (CVPR’19and ICCV’19) on person Re-ID, over the popular benchmarkdatasets Market-1501, DukeMTMC-ReID and CUHK03. Allreported results are obtained without any re-ranking [44],
Method mAP rank-1KPM [46](CVPR’18) 75.3 90.1MLFN [36](CVPR’18) 74.4 90.0CRF [47](CVPR’18) 81.6 93.5HA-CNN [28](CVPR’18) 75.7 91.2PCB [15](ECCV’18) 81.6 93.8Mancs [29] (ECCV’18) 82.3 93.1SNL [48](ACM’18) 73.43 88.27HDLF[49](ACM MM’18) 79.10 93.30MGN [21](ACM MM’18) 86.9 95.7Local CNN[50](ACM MM’18) 87.4
IAN [5] (CVPR’19) 83.1 94.4CAMA [4](CVPR’19) 84.5 94.7MHN [3](CVPR’19) 85.0 95.1Pyramid [31](CVPR’19) 88.2 95.7BagOfTricks [10] (CVPRW’19) 85.9 94.5ABD [6] (ICCV’19) 88.28 95.6BDB [2] (ICCV’19) 86.7 95.3SONA [34] (ICCV’19) 88.67 95.68Auto-ReID [30] (ICCV”19) 85.1 94.5OSNet [27] (ICCV’19) 84.9 94.8PLR-OSNet
OMPARISON OF OUR PROPOSED METHOD WITH STATE - OF - THE - ARTMETHODS FOR THE M ARKET -1501
DATASET . [45] or multi-query fusion [41] techniques. The comparisonresults are listed in Table 1, Table 2 and Table 3. From thesetables, one can observe that our proposed method performscompetitively among various state-of-the-art methods, includ-ing PCB [15], IAN [5], CAMA [4],MHN [3],Pyramid [31],BagOfTricks [10], ABD-Net [6], BDB [2], SONA [34], Auto-ReID [30], OSNet [27], et al.As shown, our PLR-OSNet has achieved the best mAP per-formance among various state-of-the-art methods for all thethree datasets . For DukeMTMC-reID, PLR-OSNet obtained91.6% Rank-1 accuracy and 81.2% mAP, which significantlyoutperforms all existing methods. For CUHK03, PLR-OSNeteven outperforms SONA in both mAP and Rank-1 accuracy,which might be the best performing algorithm for CUHK03.Besides of its strong competition in both Rank-1 andmAP performance, PLR-OSNet has a lightweight networkarchitecture inherited from OSNet. It only has only 3.4Mparameters while the recently-available Robust-ReID has 6.4Mparameters. D. Visualization
Visualization of Feature Diversity Between TwoBranches : In Figure 4, we show the visualization of classactivation maps (CAMs) for the global feature vector and4 local part-level feature vectors. Note that the local branchproduces 4 part-level feature vectors, corresponding to part-1,part-2, part-3 and part-4. As shown, these part-level featureshave some degree of diversity compared to the global features.This means that the proposed PLR-OSNet architecture allowsthe model to learn diverse features, which is key to the highperformance of person Re-ID.
Re-ID Visual Retrieving Results : We compare PLR-OSNetwith OSNet more directly from visual retrieving results. Threeretrieved examples are shown in Figure 5. One can see thatOSNet fails to retrieve several correct images among the top-
Method mAP rank-1MLFN [36](CVPR’18) 62.8 81.2GP-Re-ID [51] (CVPR’18) 72.8 85.2HA-CNN [28](CVPR’18) 63.8 80.5PCB [15] (ECCV’18) 69.2 83.3Mancs (ECCV’18) 71.8 84.9MGN [21](ACM MM’18) 78.40 88.7Local CNN[50](ACM MM’18) 66.04 82.23IAN [5] (CVPR’19) 73.4 87.1CAMA [4] (CVPR’19) 72.9 85.8MHN [3] (CVPR’19) 77.2 89.1Pyramid [31](CVPR’19) 79.0 89.0BagOfTricks (CVPRW’19) 76.4 86.4ABD [6] (ICCV’19) 78.59 89.0BDB [2] (ICCV’19) 76.0 89.0SONA [34] (ICCV’19) 78.05 89.25Auto-ReID [30] (ICCV’19) 75.1 88.5OSNet[27] (ICCV’19) 73.5 88.6PLR-OSNet
TABLE IIC
OMPARISON OF OUR PROPOSED METHOD WITH STATE - OF - THE - ARTMETHODS FOR THE D UKE
MTMC- RE ID DATASET .Method Labeled DetectedmAP rank-1 mAP rank-1DaRe+RE [52](CVPR’18) 61.6 66.1 59.0 63.3MLFN [36](CVPR’18) 49.2 54.7 47.8 52.8HA-CNN [28](CVPR’18) 41.0 44.4 38.6 41.7PCB [15](ECCV’18) - - 57.5 63.7Mancs (ECCV’18) 63.9 69.0 60.5 65.5MGN [21](ACM MM’18) 67.4 68.0 66.0 68.0MHN [3] (CVPR’19) 72.4 77.2 65.4 71.7Pyramid [31](CVPR’19) 76.9 78.9 74.8 78.9BDB [2] (ICCV’19) 76.7 79.4 73.5 76.4SONA [34] (ICCV’19) 79.23 81.85 76.35 79.10Auto-ReID [30] (ICCV’19) 73.0 77.9 69.3 73.3OSNet [27] (ICCV’19) – – 67.8 72.3PLR-OSNet
TABLE IIIC
OMPARISON OF OUR PROPOSED METHOD WITH STATE - OF - THE - ARTMETHODS FOR THE
CUHK03
DATASET .
10 results. Taking the second query as an example, PLR-OSNet is able to find correct images of the same identity inthe top 10 results whilst OSNet gets 5 incorrect ones.
E. Ablation Studies1) Benefit of Global Features:
PCB employed a uniformpartition strategy for producing part-level features, which didnot consider any possibility of the use of global features. Theproposed PLR-OSNet, however, introduces a global branch,which uses global-max-pooling for extracting global featuresas shown in Figure 3. With the use of global features, PLR-OSNet performs significantly better as depicted in Table IVfor all the three datasets. For CUHK03-Label, PLR-OSNetachieves the Rank-1 accuracy of 84.6% with the globalfeatures, without which it simply can obtain 81.2% Rank-1accuracy. This suggests that the global branch and the localbranch reinforce each other, both contributing to the finalperformance.
2) Single ID Loss vs. Multiple ID Loss:
PLR-OSNet usesonly single ID loss for multiple part-level features, which issharply contrast to PCB and its variants, where each part-level feature vector is employed to drive an ID loss so that
Global Part-1 Part-2
Part-3
Part-4Query
Fig. 4. Visualization of class activation maps (CAMs) for the global branchand the local branch (including 4 part-level feature vectors). The proposedarchitecture allow the model to learn diverse features (marked in orange). the number of ID loss is equal to the number of separatedparts. The use of ID loss for each part-level feature can forceit to learn the feature at each specified part with the ID-labeled dataset. The drawback, however, is that some part-levelfeatures may fail to produce any reliable ID prediction. Byconcatenating multiple part-level feature vector into a singlefeature vector, a single ID prediction is much more reliable.With the use of multiple feature concatenation followed bya single ID loss, PLR-OSNet performs significantly better asshown in Table V for all the three datasets. For Market1501,PLR-OSNet obtains 88.9% mAP, which surpasses its counter-part (with multiple ID loss) about 3.3%.
3) Benefit of Attention Modules:
The attention moduleshave been widely employed in various state-of-the-art methodsfor person Re-ID. Therefore, we also insert these popularattention modules in the shared net as shown in Figure 3.Experiments results are shown in in Table VI for all thethree datasets. Clearly, it achieves consistently improved per-formance for all three datasets. However, the improvement ismoderate, which may be due to the existence of the inherentattention mechanisms in OSNet.
4) Soft Margin Triplet Loss vs. Hard Margin Triplet Loss:
Table VII studies the impact of soft margin triplet loss on theperformance of the PLR-OSNet over CUHK03. Surprisingly,there is a large performance gap between soft margin tripletloss and hard margin triplet loss. We can see that the Rank-1accuracy gap is around 6% while the mAP gap is about 4.6%.
Global Features Market1501 DukeMTMC CUHK03-Labeled CUHK03-DetectedmAP rank-1 mAP rank-1 mAP rank-1 mAP rank-1No 86.9 94.6 79.8 90.2 77.5 81.2 73.4 77.6Yes
TABLE IVT
HE USE OF GLOBAL FEATURES ON THE FINAL PERFORMANCE
Method Market1501 DukeMTMC CUHK03-Labeled CUHK03-DetectedmAP rank-1 mAP rank-1 mAP rank-1 mAP rank-1Multiple ID loss 85.6 94.4 77.0 89.4 79.4 83.1 74.7 78.4Single ID loss
TABLE VS
INGLE
ID L
OSS VS . M
ULTIPLE
ID L
OSS
Attention Modules Market1501 DukeMTMC CUHK03-Labeled CUHK03-DetectedmAP rank-1 mAP rank-1 mAP rank-1 mAP rank-1No 88.4 95.0 81.0 90.8 79.5 82.4 76.8 79.4Yes
TABLE VIT
HE USE OF ATTENTION MODULES ON THE FINAL PERFORMANCE
Fig. 5. Three Re-ID examples of PLR-OSNet and OSNet on DukeMTMC-reID. Left: query image. Upper-Right: top-10 results of PLR-OSNet. Low-Right: top-10 results of OSNet. Images in red boxes are negative results.PLR-OSNet boosts the retrieval performance
We also do experiments over Market1501 and DukeMTMC.However, experiments show that the use of soft margin tripletloss does not produce any observable improvement over thehard margin counterpart. Therefore, it remains unknown whythe soft margin triplet loss can produce significantly betterresults compared to the hard version for CUHK03.V. C
ONCLUSION
In this paper, we propose a new OSNet structure with part-level feature resolution for person Re-ID. With a two-branch
Triplet Loss CUHK03-Labeled CUHK03-DetectedmAP rank-1 mAP rank-1Hard Margin 75.9 78.6 72.8 76.0Soft Margin
TABLE VIIS
OFT M ARGIN T RIPLET L OSS VS . H
ARD M ARGIN T RIPLET L OSS . network architecture, the proposed PLR-OSNet concatenatesvarious uniformly-partitioned part-level feature vectors to along vector for producing a single ID prediction loss, which isproved to be more efficient than the existing part-based meth-ods. Extensive experiments show that PLR-OSNet achievesstate-of-the-art performance on popular person Re-ID datasets,including Market1501, DukeMTMC-reID and CUHK03. Inthe mean time, its model size is significantly smaller thanvarious state-of-the-art methods, thanking to the lightweightarchitecture of OSNet. R EFERENCES[1] Zheng, Liang and Yang, Yi and Hauptmann, Alexander G. (2016),“Person re-identification: Past, present and future,” [Online]. Available:https://arxiv.org/abs/1610.02984.[2] Z. Dai, M. Chen, X. Gu, S. Zhu, and P. Tan, “Batch dropblock networkfor person re-identification and beyond,” in
Proc. ICCV , 2019, pp. 3691–3701.[3] B. Chen, W. Deng, and J. Hu, “Mixed high-order attention network forperson re-identification,” in
Proc. ICCV , 2019, pp. 371–381.[4] W. Yang, H. Huang, Z. Zhang, X. Chen, K. Huang, and S. Zhang,“Towards rich feature discovery with class activation maps augmentationfor person re-identification,” in
Proc. CVPR , June 2019, pp. 1389–1398.[5] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen, “Interaction-and-aggregation network for person re-identification,” in
Proc. CVPR ,2019, pp. 9317–9326.[6] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren, andZ. Wang, “Abd-net: Attentive but diverse person re-identification,” in
Proc. ICCV , 2019, pp. 8351–8361.[7] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deepconvolutional model for person re-identification,” in
Proc. ICCV , 2017,pp. 3960–3969. [8] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: Adeep quadruplet network for person re-identification,” in
Proc. CVPR ,July 2017, pp. 1320–1329.[9] S. Bai, X. Bai, and Q. Tian, “Scalable person re-identification onsupervised smoothed manifold,” in
Proc. CVPR , July 2017, pp. 3356–3365.[10] H. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag of tricksfor image classification with convolutional neural networks,” in
Proc.CVPR , 2019, pp. 558–567.[11] Z. Zheng, L. Zheng, and Y. Yang, “A discriminatively learned cnn em-bedding for person reidentification,”
ACM Trans. Multimedia Comput.,Commun., Appl. , vol. 14, no. 1, p. 13, 2018.[12] Hermans, Alexander and Beyer, Lucas and Leibe, Bastian. (2017),“In defense of the triplet loss for person re-identification,” [Online].Available: https://arxiv.org/abs/1703.07737.[13] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Deep attributes drivenmulti-camera person re-identification,” in
Proc. ECCV . Springer, 2016,pp. 475–491.[14] H. Yao, S. Zhang, R. Hong, Y. Zhang, C. Xu, and Q. Tian, “Deeprepresentation learning with part loss for person re-identification,”
IEEETrans. Image Process. , vol. 28, no. 6, pp. 2860–2871, June 2019.[15] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models:Person retrieval with refined part pooling (and a strong convolutionalbaseline),” in
Proc. ECCV , 2018, pp. 480–496.[16] L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-alignedrepresentations for person re-identification,” in
Proc. ICCV , Oct 2017,pp. 3239–3248.[17] Dai, Zuozhuo and Chen, Mingqiang and Zhu, Siyu and Tan, Ping.(2018), “Batch feature erasing for person re-identification and beyond,”[Online]. Available: https://arxiv.org/abs/1811.07130.[18] V. Kumar, A. Namboodiri, M. Paluri, and C. V. Jawahar, “Pose-awareperson recognition,” in
Proc. CVPR , July 2017, pp. 6797–6806.[19] L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose-invariant embeddingfor deep person re-identification,”
IEEE Trans. Image Process , vol. 28,no. 9, pp. 4500–4509, Sep. 2019.[20] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y.-G. Jiang, andX. Xue, “Pose-normalized image generation for person re-identification,”in
Proc. ECCV , 2018, pp. 650–667.[21] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminativefeatures with multiple granularities for person re-identification,” in
Proc.ACM Multimedia , 2018, pp. 274–282.[22] Y. Suh, J. Wang, S. Tang, T. Mei, and K. Mu Lee, “Part-aligned bilinearrepresentations for person re-identification,” in
Proc. ECCV , 2018, pp.402–419.[23] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re-identification by multi-channel parts-based cnn with improved tripletloss function,” in
Proc. CVPR , June 2016, pp. 1335–1344.[24] X. Fan, H. Luo, X. Zhang, L. He, C. Zhang, and W. Jiang, “Scpnet:Spatial-channel parallelism network for joint holistic and partial personre-identification,” in
Proc. ACCV . Springer, 2018, pp. 19–34.[25] Terrance DeVries and Graham W Taylor. (2018), “Improved regulariza-tion of convolutional neural networks with cutout,” [Online]. Available:https://arxiv.org/abs/1708.04552.[26] Zhong, Zhun and Zheng, Liang and Kang, Guoliang and Li, Shaoziand Yang, Yi. (2017), “Random erasing data augmentation,” [Online].Available: https://arxiv.org/abs/1708.04896.[27] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale featurelearning for person re-identification,” in
Proc. ICCV , 2019, pp. 3702–3712.[28] W. Li, X. Zhu, and S. Gong, “Harmonious attention network for personre-identification,” in
Proc. CVPR , June 2018, pp. 2285–2294.[29] C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Mancs: Amulti-task attentional network with curriculum sampling for person re-identification,” in
Proc. ECCV , 2018, pp. 365–381.[30] R. Quan, X. Dong, Y. Wu, L. Zhu, and Y. Yang, “Auto-reid: Searchingfor a part-aware convnet for person re-identification,” in
Proc. ICCV ,2019.[31] F. Zheng, C. Deng, X. Sun, X. Jiang, X. Guo, Z. Yu, F. Huang, and R. Ji,“Pyramidal person re-identification via multi-loss dynamic training,” in
Proc. CVPR , June 2019.[32] E. Ristani and C. Tomasi, “Features for multi-target multi-cameratracking and re-identification,” in
Proc. CVPR , June 2018, pp. 6036–6046.[33] W. Chen, X. Chen, J. Zhang, and K. Huang, “A multi-task deep networkfor person re-identification,” in
Proc. AAAI , 2017. [34] B. N. Xia, Y. Gong, Y. Zhang, and C. Poellabauer, “Second-order non-local attention networks for person re-identification,” in
Proc. ICCV ,2019, pp. 3760–3769.[35] Z. Zhang and M. Huang, “Person re-identification based on heteroge-neous part-based deep network in camera networks,”
IEEE Transactionson Emerging Topics in Computational Intelligence , vol. Early Access,pp. 1–10, December 2018.[36] X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisation netfor person re-identification,” in
Proc. CVPR , June 2018, pp. 2109–2118.[37] X. Qian, Y. Fu, Y.-G. Jiang, T. Xiang, and X. Xue, “Multi-scale deeplearning architectures for person reidentification,” in
Proc. ICCV , 2017.[38] Y. Chen, X. Zhu, S. Gong et al. , “Person re-identification by deeplearning multi-scale representations,” in
Proc. ICCV , 2018.[39] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proc.CVPR , 2018, pp. 7132–7141.[40] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learningapproach for deep face recognition,” in
Proc. ECCV . Springer, 2016,pp. 499–515.[41] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalableperson re-identification: A benchmark,” in
Proc. ICCV , Dec 2015, pp.1116–1124.[42] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performancemeasures and a data set for multi-target, multi-camera tracking,” in
Proc.ECCV . Springer, 2016, pp. 17–35.[43] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairingneural network for person re-identification,” in
Proc. CVPR , June 2014,pp. 152–159.[44] Z. Zhong, L. Zheng, D. Cao, and S. Li, “Re-ranking person re-identification with k-reciprocal encoding,” in
Proc. CVPR , July 2017,pp. 3652–3661.[45] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identification with expanded crossneighborhood re-ranking,” in
Proc. CVPR , June 2018, pp. 420–429.[46] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “End-to-end deepkronecker-product matching for person re-identification,” in
Proc. CVPR ,June 2018, pp. 6886–6895.[47] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang, “Group consistentsimilarity learning via deep crf for person re-identification,” in
Proc.CVPR , June 2018, pp. 8649–8658.[48] K. Li, Z. Ding, K. Li, Y. Zhang, and Y. Fu, “Support neighbor loss forperson re-identification,” in
Proc. ACM Multimedia , 2018, pp. 1492–1500.[49] M. Zeng, C. Tian, and Z. Wu, “Person re-identification with hierarchicaldeep learning feature and efficient xqda metric,” in
Proc. ACM Multi-media , 2018, pp. 1838–1846.[50] J. Yang, X. Shen, X. Tian, H. Li, J. Huang, and X.-S. Hua, “Localconvolutional neural networks for person re-identification,” in
Proc.ACM Multimedia , 2018, pp. 1074–1082.[51] Almazan, Jon and Gajic, Bojana and Murray, Naila and Larlus, Di-ane. (2018), “Re-id done right: towards good practices for person re-identification,” [Online]. Available: https://arxiv.org/abs/1610.02984.[52] Y. Wang, L. Wang, Y. You, X. Zou, V. Chen, S. Li, G. Huang,B. Hariharan, and K. Q. Weinberger, “Resource aware person re-identification across multiple resolutions,” in