[PDF] Perceive Where to Focus: Learning Visibility-aware Part-level Features for Partial Person Re-identification

Abstract

This paper considers a realistic problem in person re-identification (re-ID) task, i.e., partial re-ID. Under partial re-ID scenario, the images may contain a partial observation of a pedestrian. If we directly compare a partial pedestrian image with a holistic one, the extreme spatial misalignment significantly compromises the discriminative ability of the learned representation. We propose a Visibility-aware Part Model (VPM), which learns to perceive the visibility of regions through self-supervision. The visibility awareness allows VPM to extract region-level features and compare two images with focus on their shared regions (which are visible on both images). VPM gains two-fold benefit toward higher accuracy for partial re-ID. On the one hand, compared with learning a global feature, VPM learns region-level features and benefits from fine-grained information. On the other hand, with visibility awareness, VPM is capable to estimate the shared regions between two images and thus suppresses the spatial misalignment. Experimental results confirm that our method significantly improves the learned representation and the achieved accuracy is on par with the state of the art.

Full PDF

PPerceive Where to Focus: Learning Visibility-aware Part-level Featuresfor Partial Person Re-identiﬁcation

Yifan Sun , Qin Xu , Yali Li , Chi Zhang , Yikang Li , Shengjin Wang ∗ , Jian Sun Tsinghua University Megvii Technology { sunyf15, xuq16, liyk11 } @mails.tsinghua.edu.cn { liyali13, wgsgj } @tsinghua.edu.cn { zhangchi, sunjian } @megvii.com Abstract

This paper considers a realistic problem in person re-identiﬁcation (re-ID) task, i.e. , partial re-ID. Under par-tial re-ID scenario, the images may contain a partial ob-servation of a pedestrian. If we directly compare a par-tial pedestrian image with a holistic one, the extreme spa-tial misalignment signiﬁcantly compromises the discrimi-native ability of the learned representation. We proposea Visibility-aware Part Model (VPM), which learns to per-ceive the visibility of regions through self-supervision. Thevisibility awareness allows VPM to extract region-level fea-tures and compare two images with focus on their sharedregions (which are visible on both images). VPM gainstwo-fold beneﬁt toward higher accuracy for partial re-ID.On the one hand, compared with learning a global fea-ture, VPM learns region-level features and beneﬁts fromﬁne-grained information. On the other hand, with visibilityawareness, VPM is capable to estimate the shared regionsbetween two images and thus suppresses the spatial mis-alignment. Experimental results conﬁrm that our methodsigniﬁcantly improves the learned representation and theachieved accuracy is on par with the state of the art.

1. Introduction

Person re-identiﬁcation (re-ID) aims to spot the appear-ances of same person in different observations by measuringthe similarity between the query image and the gallery im-ages ( i.e. , the database). In spite that the re-ID researchcommunity has achieved signiﬁcant progress during thepast few years, re-ID systems are still faced with a seriesof realistic difﬁculties. A prominent challenge is the partialre-ID problem [36, 7, 35], which requires accurate retrievalwith partial observation of the pedestrian. More concretely,in realistic re-ID systems, a pedestrian may happen to bepartially occluded or be walking out of the ﬁeld of camera ∗ Corresponding author. (a) (b) (c) shared region features

Figure 1. Two challenges related to partial-re-ID and our solutionwith the proposed VPM. (a) aggravation of spatial misalignment,(b) distracting noises from unshared regions (the blue region on theleft image) and (c) VPM locates visible regions on a given imageand extracts region-level features. With visibility-awareness, VPMcompares two images by focusing on their shared regions. view, and the camera fails to capture the holistic pedestrian.Intuitively, partial re-ID increases the difﬁculty to makecorrect retrieval. Analytically, we ﬁnd that partial re-ID raises two more unique challenges, compared with theholistic person re-ID, as illustrated in Fig. 1. • First, partial re-ID aggravates the spatial misalignmentbetween probe and gallery images. Under holistic re-ID setting, the spatial misalignment mainly originatesfrom the articulate movement of pedestrian and theviewpoint variation. Under partial re-ID setting, evenwhen two pedestrian with same pose are captured fromsame viewpoints, there still exists severe spatial mis-alignment between the two images (Fig. 1 (a)). • Second, when we directly compare a partial pedestrianagainst a holistic one, the unshared body regions in theholistic pedestrian become distracting noises, rather1 a r X i v : . [ c s . C V ] A p r han discriminative clues. We note that the same sit-uation also happens when any two compared imagescontain different proportion of the holistic pedestrian(Fig. 1 (b)).We propose the Visibility-aware Part Model (VPM) forpartial re-ID. VPM avoids or alleviates the two unique dif-ﬁculties related to partial re-ID by focusing on their sharedregions, as shown in Fig. 1 (c). More speciﬁcally, we ﬁrstdeﬁne a set of regions on the holistic person image. Dur-ing training, given partial pedestrian images, VPM learns tolocate all the pre-deﬁned regions on convolutional featuremaps. After locating each region, VPM perceives which re-gions are visible and learns region-level features. Duringtesting, given two images to be compared, VPM ﬁrst cal-culates the local distances between their shared regions andthen concludes the overall distance.VPM gains two-fold beneﬁt toward higher accuracy forpartial re-ID. On the one hand, compared with learning aglobal feature, VPM learns region-level features and thusbeneﬁts from ﬁne-grained information, which is similar tothe situation in holistic person re-ID [23, 12]. On the otherhand, with visibility-awareness, VPM is capable to estimatethe shared regions between two images and thus suppressesthe spatial misalignment as well as the noises originatedfrom unshared regions. Experimental results conﬁrm thatVPM achieves signiﬁcant improvement on partial re-ID ac-curacy, compared with a global feature learning baseline[34], as well as a strong part-based convolutional baseline[23]. The achieved performance are on par with the state ofthe art.Moreover, VPM is featured for employing self-supervision for learning the region visibility awareness. Werandomly crop partial pedestrian images from the holisticones and automatically generate region labels, yielding theso-called self-supervision. Self-supervision enables VPMto learn locating pre-deﬁned regions. It also helps VPM tofocus on visible regions during feature learning, which iscritical to the discriminative ability of the learned features,as to be accessed in Section 4.4.The main contributions of this paper are summarized asfollows: • We propose a visibility-aware part model (VPM) forpartial re-ID task. VPM learns to locate the visible re-gions on pedestrian images through self-supervision.Given two images to be compared, VPM conductsa region-to-region comparison within their shared re-gions, and thus signiﬁcantly suppresses the spatial mis-alignment as well as the distracting noises originatedfrom unshared regions. • We conduct extensive partial re-ID experiments onboth synthetic datasets and realistic datasets and val-idate the effectiveness of VPM. On two realistic dataset, Partial-iLIDs and Partial-ReID, VPM achievesperformance on par with the state of the art. So far aswe know, few previous works on partial re-ID reportedthe performance on synthetic large-scale datasets e.g. ,Market-1501 or DukeMTMC-ReID. We experimen-tally validate that VPM can be easily scaled up tolarge-scale (synthetic) partial re-ID datasets, due to itsfast matching capacity.

2. Related Works

Deep learning methods currently dominate the re-ID re-search community with signiﬁcant superiority on retrievalaccuracy [34]. Recent works [23, 12, 26, 31, 22, 28, 16]further advance the state of the art on holistic person re-ID,through learning part-level deep features. For example, Wei et al. [26], Kalayeh et al. [12] and Sun et al. [23] extractseveral region parts, with pose estimation [17, 27, 10, 18, 1],human parsing [2, 5] and uniform partitioning, respectively.Then they learn a respective feature for each part and as-semble the part-level features to form the ﬁnal descriptor.These progresses motivate us to extend learning part-levelfeatures to the speciﬁed problem of partial re-ID.However, learning part-level features does not naturallyimprove partial re-ID. We ﬁnd that PCB [23], which main-tains the latest state of the art on holistic person re-ID, en-counters a substantial performance decrease when appliedin partial re-ID scenario. The achieved retrieval accuracyeven drops below the global feature learning baseline (to beaccessed in Sec. 4.2). Arguably, it is because part mod-els rely on precisely locating each part and are inherentlymore sensitive to the severe spatial misalignment problemin partial re-ID.Our method is similar to PCB in that both methods per-form uniform division instead of semantic body parts forpart extraction. Moreover, similar to SPReID [12], ourmethod also uses probability maps to extract each part dur-ing inference. However, while SPReID requires an extrahuman parser and human parsing dataset (strong supervi-sion) for learning part extraction, our method relies on self-supervision. During matching stage, both PCB and SPReIDadopt the common strategy of concatenating part features.In contrast, VPM ﬁrst measures the region-to-region dis-tance and then conclude the overall distance by dynamicallycrediting the local distances with high visibility conﬁdence.

Self-supervision learning is a speciﬁed unsupervisedlearning approach. It explores the visual information to au-tomatically generate surrogate supervision signal for featurelearning [19, 25, 13, 3, 14]. Larsson et al. [13] train the deepmodel to predict per-pixel color histograms and consequen- onv R R R ! Feature extractor WP C C C WP WP

Figure 2. The structure of VPM. We ﬁrst deﬁne p = m × n ( × in the ﬁgure for instance) densely aligned rectangle regions on the holisticpedestrian. VPM resizes a partial pedestrian image to ﬁxed size, inputs it into a stack of convolutional layers (“conv”) and transforms itinto a 3D tensor T . Upon T , VPM appends a region locator to discover each regions through pixel-wise classiﬁcation. By predictinga probability of belonging to each region for every pixel g , the region locator generates p probability maps to infer the location of eachregion. It also generates p visibility scores through “ (cid:80) ” operation over each probability map. Given the predicted probability maps, thefeature extractor extracts a respective feature for each pre-deﬁned region through weighted pooling (“WP”). VPM, as a whole, outputs p region-level features and p visibility scores for inference. tially facilitate automatic colorization. Doersch et al. [3]and Noroozi et al. [19] propose to predict the relative posi-tion of image patches. Gidaris et al. train the deep model torecognize the rotation applied to original images.Self-supervision is an elemental tool in our work. Weemploy self-supervision to learn visibility awareness. VPMis especially close to [3] and [19] in that all the three meth-ods employ the position information of patches for self-supervision. However, VPM signiﬁcantly differs from themin the following aspects. Self-supervision signal. [3] randomly samples a patchand one of its eight possible neighbors, and then trains thedeep model to recognize the spatial conﬁguration. Simi-larly, [19] encodes the neighborhood relationship into a jig-saw puzzle. Different from [3] and [19], VPM does notexplore the spatial relationship between multiple images orpatches. VPM pre-deﬁnes a division on the holistic pedes-trian image and then assigns an independent label to eachregion. Then VPM learns to directly predict which regionsare visible on a partial pedestrian image, without comparingit against the holistic one.

Usage of the self-supervision.

Both [3] and [19] trans-fer the model trained through self-supervision to the objectdetection or classiﬁcation task. In comparison, VPM uti-lizes self-supervision in a more explicit manner: with thevisibility awareness gained from self-supervision, VPM de-cides which regions to focus when comparing two images.

3. Proposed Method

VPM is designed as a fully convolutional network, asillustrated in Fig. 2. It takes a pedestrian image as the input and outputs a constant number of region-level features, aswell as a set of visibility scores indicating which regionsare visible on the input image.We ﬁrst deﬁne p = m × n densely aligned rectangleregions on the holistic pedestrian image through uniformdivision. Given a partial pedestrian image, we resize it to aﬁxed size, i.e. , H × W and input it into VPM. Through astack of convolutional layers (“conv” in Fig. 2, which usesall the convolutional layers in ResNet-50 [6]), VPM trans-fers the input image into a 3D tensor T . The size of T is c × h × w (which are the number of channels, height andwidth, respectively), and we view the c − dim vector g asa pixel on T . On tensor T , VPM appends a region locatorand a region feature extractor. The region locator discov-ers regions on tensor T . Then the region feature extractorgenerates a respective feature for each region. A region locator perceives which regions are visible andpredicts their locations on tensor T . To this end, the regionlocator employs a × convolutional layer and a followingSoftmax function to classify each pixel g on T into the pre-deﬁned regions, which in formulated by, P ( R i | g ) = sof tmax ( W T g ) = exp W Ti g p (cid:80) j =1 exp W Tj g , (1)where P ( R i | g ) is the predicted probability of g belongingto R i , W is the learnable weight matrix of the × convo-lutional layer, p is the total number of pre-deﬁned regions.By sliding over every pixel g on T , the region loca-tor predicts g as belonging to all the pre-deﬁned regionswith corresponding probabilities, and thus gets p probabil-ity maps (one h × w map for each region), as shown in Fig.. Each probability map indicates the location of a corre-sponding region on T , which allows region feature extrac-tion.The region locator also predicts the visibility score C foreach region, by accumulating P ( R i | g ) over all the g on T ,which is formulated by, C i = (cid:88) f ∈ T P ( R i | g ) , (2)Eq. 2 is natural in that if considerable pixels on T be-longs to R i (with large probability), it indicates that R i isvisible on the input image and is assigned with a relativelylarge C i . In contrast, if a region is actually invisible, the re-gion locator will still return a probability map (with all thevalues approximating 0). In this case, C i will be very small,indicating possibly-invisible region. The visibility score isimportant for calculating the distance between two images,as to be detailed in Section 3.2. A region feature extractor generates a respective fea-ture f for a region by weighted pooling, which is formulatedby, f i = (cid:80) g ∈ T P ( R i | g ) gC i , ∀ i ∈ { , , · · · , p } , (3)where the division of C i is to maintain the norm invarianceagainst the size of the region.The region locator returns a probability map for each re-gion, even if the region is actually invisible on the inputimage. Correspondingly, we can see from Eq. 3 that the re-gion feature extractor always generates a constant number( i.e. , p ) of region features for any input image. Given two images to be compared, i.e. , I k and I l , VPMextracts their region features and predicts the region vis-ibility scores through Eq. 3 and Eq. 2, respectively.With region features and region visibility scores { f ki , C ki } , { f li , C li } , VPM ﬁrst calculates region-to-region euclideandistances D kli = (cid:107) f ki − f li (cid:107) ( i = 1 , , · · · , p ) . Then VPMconcludes the overall distance from the local distances by, D kl = p (cid:80) i =1 C ki C li D klip (cid:80) i =1 C ki C li . (4)In Eq. 4, the visible regions are with relative large vis-ibility scores. The local distances between shared regionsare highly credited by VPM and thus dominate the overalldistance D kl . In contrast, if a region is invisible in any oneof the compared images, its region feature is considered asunreliable and the corresponding local distance contributeslittle to D kl . Employing VPM adds very light computational cost,compared with popular part-based deep learning methods[23, 31, 12]. While some prior partial re-ID methods re-quire pairwise comparison before feature extraction andmay have efﬁciency problems, VPM presents high scalabil-ity, which allows experiments on large re-ID datasets suchas Market-1501 [33] and DukeMTMC-reID [37], as to beaccessed in Section 4.2. Training VPM consists of training the region classiﬁerand training the region feature extractor. The region classi-ﬁer and the region feature extractor share the convolutionallayers before tensor T , and are trained end to end in a multi-task training manner. Training VPM is also featured foremploying auxiliary self-supervision. Self-supervision is critical to VPM. It supervises VPMto learn region visibility awareness, as well as to focus onvisible regions during feature learning. Speciﬁcally, given aholistic pedestrian image, we randomly crop a patch and re-size it to H × W . The random crop operation excludes sev-eral pre-deﬁned regions and the remaining regions are re-shaped during the resizing. Then, we project the regions onthe input image to tensor T through ROI projection [11, 20].To be concrete, let us assume a region with its up-left cor-ner located at ( x , y ) and its bottom-right corner locatedat ( x , y ) on the input image. Then the ROI projectiondeﬁnes a corresponding region on tensor T with its up-leftcorner located at ([ x /S ] , [ y /S ]) and its right-bottom cor-ner located at ([ x /S ] , [ y /S ]) , in which the [ • ] denotes therounding and S is the down-sampling rate from input imageto T . Finally, we assign every pixel g on T with a regionlabel L ( L ∈ , , · · · , p ) to indicate which region g belongsto. We also record all the visible regions in a set V . As wewill see, self-supervision contributes to training VPM in thefollowing three aspects: • First, self-supervision generates the ground truth of re-gion labels for training the region locator. • Second, self-supervision enables VPM to focus on vis-ible regions when learning feature through classiﬁca-tion loss (cross-entropy loss). • Finally, self-supervision enables VPM to focus on theshared regions when learning features through tripletloss.Without the auxiliary self-supervision, VPM encountersdramatic performance decrease, as to be accessed in Section4.4.

The region locator is trained through cross-entropy losswith the self-supervision signal L as the ground truth, which igure 3. VPM learns region-level features with auxiliary self-supervision. Only features corresponding to visible regions con-tribute to the cross-entropy loss. Only features corresponding toshared regions contribute to the deducing of triplet loss. is formulated by, L R = − (cid:88) g ∈ T i = L log ( P ( R i | g )) , (5)where R i = L returns 1 only when i equals the ground truthregion label L and returns 0 in any other cases. The region feature extractor is trained with the combi-nation of cross-entropy loss and triplet loss, as illustrated in3. Recall that the region feature extractor always generates p region features for any input image. It leads to a nontrivialproblem during feature learning: only features of visible re-gions should be allowed to contribute to the training losses.With self-supervision signals V , we dynamically select thevisible regions for feature learning. The cross-entropy loss is commonly used in learning fea-tures for pedestrian under the IDE [32] mode. We appenda respective identity classiﬁer i.e. , IP i ( f i )( i = 1 , , · · · , p ) upon each region feature f i , to predict the identity of train-ing images. The identity classiﬁer consists of two sequen-tial fully-connected layers and a Softmax function. The ﬁrstfully-connected layers reduces the dimension of the inputregion feature, and the second one transforms the feature di-mension to K ( K is the total identities of training images).Then the cross-entropy loss is formulated by, L ID = − (cid:88) i ∈ V k = y log ( sof tmax ( IP i ( f i ))) , (6)where k is the predicted identity and y is the ground truthlabel. With Eq. 6, self-supervision enforces focus on visibleregions for learning region features through cross-entropyloss. The triplet loss pushes the features from a same pedes-trian close to each other and pulls the features from differ-ent pedestrians far away. Given a triplet of images, i.e. , ananchor image I a , a positive image I p and a negative image I n , we deﬁne a region-selective triplet loss derived from thecanonical one by, L tri = [ D ap − D an + α ] + ,D ap = (cid:80) i ∈ ( V a ∩ V p ) (cid:107) f ai − f pi (cid:107)| V a ∩ V p | ,D an = (cid:80) i ∈ ( V a ∩ V n ) (cid:107) f ai − f ni (cid:107)| V a ∩ V n | , (7)where f ai , f pi and f ni are the region features from anchorimage, positive image and negative image, respectively. V a , V p and V n are the visible region sets for anchor image, pos-itive image and negative image, respectively. |•| denotes theoperation of counting the elements of a set, i.e. , the numberof shared regions in the two compared images. α is the mar-gin for training triplet, and is set to in our implementation.With Eq. 7, self-supervision enforces focus on the sharedregions when calculating the distances of two images.The overall training loss is the sum of region predictionloss, the identity classiﬁcation loss and the region-selectivetriplet loss, which is formulated by, L = L R + L ID + L tri (8)We also note that Eq. 4 and Eq. 7 share a similar pattern.Training with the modiﬁed triplet loss (Eq. 7) mimics thematching strategy (Eq. 4) and is thus specially beneﬁcial(to be detailed in Table 3). The difference is that, duringtraining, the focus is enforced through “hard” visibility la-bels, while during testing, the focus is regularized throughpredicted “soft” visibility scores.

4. Experiment

Datasets . We use four datasets, i.e. , Market-1501[33], DukeMTMC-reID [21, 37], Partial-REID and Partial-iLIDS, to evaluate our method. Market-1501 andDukeMTMC-reID are two large scale holistic re-ID dataset.The Market-1501 dataset contains 1,501 identities ob-served from 6 camera viewpoints, 19,732 gallery imagesand 12,936 training images detected by DPM [4]. TheDukeMTMC-reID dataset contains 1,404 identities, 16,522training images, 2,228 queries, and 17,661 gallery im-ages. We crop certain patches from the query images dur-ing testing stage to imitate the partial re-ID scenario andget a comprehensive evaluation of our method on large-scale (synthetic) partial re-ID datasets. We note that fewprior works on partial re-ID evaluated their methods onlarge-scale dataset, mainly because of low computing ef-ﬁciency. Partial-REID [36] and Partial-iLIDS [35] aretwo commonly-used datasets for partial re-ID. Partial-REIDcontains 600 images and 60 identities, every one of whichhas 5 holistic images and 5 partial images. Partial-iLIDSataset γ baseline PCB VPMR-1 R-5 R-10 mAP R-1 R-5 R-10 mAP R-1 R-5 R-10 mAP0.5 64.5 82.2 88.1 44.4 0.9 3.2 5.6 1.7 70.9 86.5 92.1 48.80.6 79.0 91.4 94.3 57.9 8.1 16.5 23.2 6.6 84.4 94.3 96.1 62.5Market-1501 0.7 83.9 93.9 95.9 63.7 36.5 58.9 67.4 26.8 88.2 95.8 97.2 71.70.8 85.7 94.3 96.4 66.1 71.9 87.3 91.4 56.8 90.1 95.8 97.7 74.70.9 87.1 95.5 97.4 67.8 88.8 95.8 97.1 77.2 91.7 96.6 98.0 78.71.0 86.8 95.3 97.4 67.7 93.4 97.8 98.4 83.0 93.0 97.8 98.8 80.80.5 65.0 81.1 86.7 47.2 5.0 10.1 13.6 4.0 69.5 83.1 87.9 52.20.6 76.2 87.3 90.4 55.4 13.1 25.6 33.5 10.5 78.2 89.0 91.3 60.9DukeMTMC-reID 0.7 76.3 87.3 90.6 90.6 35.9 57.0 65.4 28.4 80.3 89.5 92.0 63.10.8 76.3 88.3 91.9 58.8 64.0 82.6 87.7 52.3 80.3 89.3 92.4 63.50.9 77.0 88.1 91.7 59.0 81.6 90.4 93.0 70.3 81.7 90.9 93.1 70.71.0 76.2 87.3 91.2 58.6 84.1 92.4 94.5 73.2 83.6 91.7 94.2 72.6 Table 1. Comparison between VPM, baseline and PCB. For VPM, we use p = 6 × pre-deﬁned regions. For PCB, we adopt the codereleased by the authors and append an extra triplet loss, for fair comparison with VPM. On Market-1501, the extra triplet loss enables PCBto gain +5.6% mAP over the original 77.4% reported by the authors [23]. is derived from iLIDS [24], which is collected in an airportand the lower-body of a pedestrian is frequently occludedby the luggage. Partial-iLIDS crops the non-occluded re-gion from these images and get 238 images from 119 identi-ties. Both Partial-REID and Partial-iLIDS offer only testingimages. When evaluating our method on these two publicdatasets, we train VPM on the training set of Market-1501,for fair comparison with other competitive methods, includ-ing MTRC [15], AMC+SWM [36], DSR [7], and SFR [8]. Implementation Details . Training VPM relies on theassumption that the original training images all containholistic pedestrian and the pedestrian are tightly boundedby bounding boxes. On two holistic re-ID datasets, Market-1501 and DukeMTMC-reID, there do exist some imageswhich contain either partial pedestrian or oversized bound-ing boxes. We consider these images as tolerable noises.To generate the partial image for training VPM, we cropa patch from the holistic image with random area ratio γ .We set γ to be uniformly distributed between . and .VPM is not necessarily bounded with any speciﬁed cropstrategy and we may consider the prior knowledge for op-timization. We argue that choosing the detailed crop strat-egy according to the realistic condition is reasonable be-cause partial re-ID is a realistic challenge, and the occlusionfashion is usually predictable. We also experimentally val-idate that choosing an appropriate crop strategy to imitatethe confronted partial re-ID condition beneﬁts the retrievalaccuracy, as to be detailed in Section 4.3. That being said,VPM is still general in that it may adopt any crop strategyto conduct self-supervision.VPM is trained with the combination of cross-entropyloss and triplet loss. We pre-train the model for 50 epochswith single cross-entropy loss because it helps VPM to con-verge faster and better. Then we append the triplet loss and ﬁne-tune the model for another 80 epochs. In boththe pre-training and the ﬁne-tuning stages, we use standardStochastic Gradient Descent (SGD) optimization strategy,initialize the learning rate as 0.1 and decay it to 0.01 after30 epochs. During the ﬁne-tuning stage, we construct eachmini-batch with 64 images from 8 identities (8 images peridentity) and use the hard mining strategy [9] for deducingthe triplet loss. We evaluate the effectiveness of VPM with experimenton the synthetic partial datasets derived from two large-scale re-ID datasets, Market-1501 and DukeMTMC-reID.We differ the ratio γ of the cropped patches from 0.5 to 1.0during testing. For comparison with VPM, we implement abaseline which learns global feature through the combina-tion of cross-entropy loss and triplet loss. We also imple-ment a part-based feature learning method PCB [23]. Forfair comparison, we enhance PCB with an extra triplet lossduring training, and achieve slightly higher performancethan [23]. The results are summarized in Table 1. VPM signiﬁcantly improves partial re-ID perfor-mance over the baseline.

On Market-1501, VPM sur-passes the baseline by +6.4%, +5.4%, +4.3%, +4.4%,+4.6% +6.2% rank-1 accuracy and +4.4%, +4.6%, +8.0%,+ 8.6%, +10.9%, +13.1% mAP when γ is set from 0.5 to 1,respectively. The superiority of VPM against the baseline,which learns a global feature representation, is derived fromtwo-fold beneﬁt. On the one hand, VPM learns region-levelfeatures and beneﬁts from ﬁne-grained information. Onthe other hand, with visibility awareness, VPM conducts aregion-level alignment and eliminates the distracting noisesoriginated from unshared regions. VPM increases the robustness of part features un- baseline p=2 p=3 p=4 p=6 p=8 γ R an k - Figure 4. Impact of p on the partial-reID accuracy. We set p to2,3,4,6 and 8, respectively. We use Market-1501 for training anddiffer the crop ratio γ during testing. der partial re-ID scenario. Comparing VPM with PCB,a state-of-the-art part feature learning method for holisticperson re-ID task, we observe that as γ decreases, the re-trieval accuracy achieved by PCB dramatically drops ( e.g. ,0.9% rank-1 accuracy at γ = 0 . ), implying that PCBis extremely vulnerable to the spatial misalignment in par-tial re-ID. By contrast, the retrieval accuracy achieved byVPM decreases much slower as γ decreases. We infer thatVPM facilitates region-to-region comparison within sharedregions of two images and thus gains strong robustness.We also notice that under γ = 1 . , i.e. , the holistic per-son re-ID scenario, VPM achieves comparable retrieval ac-curacy with PCB.In Table 1, we use 6 pre-deﬁned parts to construct VPM.Moreover, we analyze the impact of the part numbers p onMarket-1501, with results shown in Fig. 4. Under all set-tings of p and γ , VPM consistently surpasses the baseline,which further conﬁrms the superiority of VPM. We also ob-serve that larger p generally brings higher (rank-1) retrievalaccuracy. Larger p allows VPM to learn the region-levelfeatures in ﬁner granularity and thus beneﬁts the discrim-inative ability, which is consistent with the observation inholistic person re-ID works [31, 23]. Larger p also allowsmore accurate region alignment when comparing a partialperson image against a holistic one. We suggest choosing p with the joint consideration of retrieval accuracy and com-puting efﬁciency, and set p = 6 in most of our experiments(if not specially mentioned). We compare VPM with the state-of-the-art methods ontwo public datasets, i.e. , Partial-REID and Partial-iLIDS.We train three different versions of VPM with different cropstrategies for preparing training patches, i.e. , top crop (the Methods Partial-REID Partial-iLIDSR-1 R-3 R-1 R-3MTRC [15] 23.7 27.3 17.7 26.1AMC+SWM [36] 37.3 46.0 21.0 32.8DSR [7] 50.7 70.0 58.8 67.2SFR [8] 56.9 78.5 63.9 74.8VPM (Bottom) 53.2 73.2 53.6 62.3VPM (Top) 64.3

VPM (Bilateral)

Table 2. Evaluation of VPM on Partial-REID and Partial-iLIDS.Three VPMs trained with different crop strategies are evaluated. top regions are always visible), bottom crop (the bottom re-gions are always visible) and bilateral crop (top crop + bot-tom crop). The results are presented in Table 2, from whichtwo observations are drawn.First, comparing three editions of VPM against eachother, we ﬁnd that the crop strategy matters to VPM. OnPartial-iLIDS, all query images of which are cropped fromthe top side of holistic pedestrian images, VPM (Top)achieves the highest retrieval accuracy. On Partial-REID,which contains images cropped from different directions,VPM (Bilateral) achieves the highest retrieval accuracy.VPM (Bottom) always performs the worst due to two rea-sons. First, retaining the bottom regions severely deviatesfrom the testing condition. Second, the bottom regions(mainly containing legs) inherently offers relatively weakdiscriminative clues. We note that when solving the prob-lem of partial-reID, the realistic partial condition is usuallyestimable. We recommend analyzing the partial conditionand choosing a similar crop strategy for training VPM. Thatbeing said, VPM is general in that it is able to cooperatewith various crop strategies.Second, given appropriate crop strategies, VPM achievesvery competitive performance compared with the state ofthe art. On Partial-REID, VPM (Bilateral) surpasses thestrongest competing method SFR by +10.6% Rank-1 ac-curacy. On Partial-iLIDS, VPM (Top) surpasses SFR by+3.3% Rank-1 accuracy. Even with no prior knowledge ofpartial condition on testing set, we may eclectically chooseVPM (Bilateral), which considers both top and down occlu-sions and thus maintains stronger robustness.

We conduct ablation study to analyze the impact of self-supervision on VPM. We train four

Malfunctioned

VPM forcomparison: • MVPM-1 is trained as a normal VPM, but abandonsthe visibility awareness during testing, i.e. , MVPM-1 concludes the overall distance with all region-levelfeatures, even if some regions are invisible.ethods Partial-iLIDS Market-1501R-1 R-3 R-5 R-1 R-5 mAPVPM. 67.2 76.5 82.4 93.0 97.8 80.8VPM (no triplet) 57.1 73.9 79.0 91.3 97.0 77.8MVPM-1 63.0 74.8 82.4 93.0 96.3 79.7MVPM-2 61.3 73.1 79.0 92.8 97.4 80.1MVPM-3 58.8 74.8 82.4 91.4 96.5 75.5MVPM-4 59.7 74.8 78.2 90.4 96.6 75.7

Table 3. Ablation study on VPM. “VPM (no triplet)” is trainedwith no triplet loss. On Market-1501, we only analyze the holisticperson re-ID mode. • MVPM-2 abandons self-supervision on triplet lossduring training, i.e. , all region features equally con-tribute to deducing the triplet loss L tri . • MVPM-3 abandons self-supervision on identiﬁcationloss L ID during training, i.e. , all region features aresupervised by the training identity label through L ID . • MVPM-4 abandons self-supervision on both tripletloss and identiﬁcation loss.Moreover, we also analyze the impact of two types oflosses (cross-entropy loss and triplet loss) in training VPM.The results are summarized in Table 3, from which we drawthree observations.First, appending an extra triplet loss enhances the dis-criminative ability of the learned features, under both par-tial re-ID scenario (Partial-iLIDS) and holistic person re-IDscenario (Market-1501). This observation is consistent withprior works [9, 38, 30, 29] on holistic person re-ID task.Second, comparing “MVPM-1” with “VPM”, we ob-serve a dramatic performance decrease on Partial-iLIDS.Both models are trained in exactly the same procedure. Thedifference is that “MPVM-1” employs all the region fea-tures to conclude the overall distance, while VPM focuseson the shared regions between two images. On Market-1501, all the regions are visible and two models achievesvery close retrieval accuracy. We thus infer that the visibil-ity awareness is critical for VPM under partial re-ID sce-nario.Third, comparing last three editions of MVPM with“VPM” as well as “MVPM-1”, we observe further per-formance decreases on Partial-iLIDS. The last three edi-tions abandon self-supervision to regularize the learning ofregion-level features (either on the cross-entropy loss ortriplet loss or both). Learning features from invisible re-gions brings about larger sample noises. Consequentially,the learned region features are signiﬁcantly compromised.We thus conclude that enforcing VPM to focus on visibleregions through self-supervision is critical for learning re-gion features.

Figure 5. Region visualization. We train VPM with × pre-deﬁned regions. For each image, VPM discovers 6 regions with6 probability maps, as detailed in Section 3.1. For better visual-ization, we assign each pixel to its closest region and achieve thepartitioning effect. Images on the ﬁrst and the second row are from(synthetic) Market-1501 and Partial-REID, respectively. We visualize the regions discovered by VPM (the re-gion locator, in particular) in Fig. 5. We use p = 3 × pre-deﬁned regions to facilitate both horizontal and verti-cal visibility awareness. It is observed that VPM conductsadaptive partition with visibility awareness. Given holis-tic images (the ﬁrst column), VPM successfully discoversall the × regions. Given partial pedestrian images withhorizontal occlusion (the second column), VPM favors thedominating regions (left regions in Fig. 5). Given partialpedestrian images with lower-body occluded (the last twocolumns), VPM roughly discovers 4 visible regions, andperceives that the bottom 2 regions are invisible. These ob-servations conﬁrm that VPM gains robust region-level visi-bility awareness and is capable to locate the visible regionsthrough self-supervised learning.

5. Conclusion

In this paper, we propose a region-based feature learn-ing method, VPM for partial re-ID task. Given a set ofpre-deﬁned regions on the holistic pedestrian image, VPMlearns to perceive which regions are visible on a partial im-age through self-supervision. VPM locates each region onthe convolutional feature maps and then extracts region-level features. With visibility awareness, VPM comparestwo pedestrian images with focus on their shared regionsand correspondingly suppresses the severe spatial misalign-ment in partial re-ID. Experimental results conﬁrm thatVPM surpasses both the global feature learning baselineand part-based convolutional methods, and the achievedperformance is on par with the state of the art. eferences [1] Z. Cao, T. Simon, S. E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds. In

CVPR ,2017. 2[2] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentationwith deep convolutional nets, atrous convolution, and fullyconnected crfs.

IEEE Trans. Pattern Anal. Mach. Intell. ,40(4):834–848, 2018. 2[3] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visualrepresentation learning by context prediction. In

IEEE Inter-national Conference on Computer Vision , pages 1422–1430,2015. 2, 3[4] P. Felzenszwalb, D. McAllester, and D. Ramanan. A dis-criminatively trained, multiscale, deformable part model. In

CVPR , 2008. 5[5] K. Gong, X. Liang, D. Zhang, X. Shen, and L. Lin. Lookinto person: Self-supervised structure-sensitive learning anda new benchmark for human parsing. In , 2017.2[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In

CVPR , 2016. 3[7] L. He, J. Liang, H. Li, and Z. Sun. Deep spatial feature re-construction for partial person re-identiﬁcation: Alignment-free approach.

CoRR , abs/1801.00881, 2018. 1, 6, 7[8] L. He, Z. Sun, Y. Zhu, and Y. Wang. Recognizing partialbiometric patterns.

CoRR , abs/1810.07399, 2018. 6, 7[9] A. Hermans, L. Beyer, and B. Leibe. In defense of thetriplet loss for person re-identiﬁcation. arXiv preprint arXiv:1703.07737 , 2017. 6, 8[10] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, andB. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In

ECCV , 2016. 2[11] H. Kaiming, Z. Xiangyu, R. Shaoqing, and J. Sun. Spatialpyramid pooling in deep convolutional networks for visualrecognition. In

European Conference on Computer Vision ,2014. 4[12] M. M. Kalayeh, E. Basaran, M. Gokmen, M. E. Kamasak,and M. Shah. Human semantic parsing for person re-identiﬁcation. In

CVPR , 2018. 2, 4[13] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep-resentations for automatic colorization. In

Computer Vision- ECCV 2016 - 14th European Conference, Amsterdam, TheNetherlands, October 11-14, 2016, Proceedings, Part IV ,pages 577–593, 2016. 2[14] Q. V. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado,K. Chen, J. Dean, and A. Y. Ng. Building high-level featuresusing large scale unsupervised learning. In

ICML , 2012. 2[15] S. Liao, A. K. Jain, and S. Z. Li. Partial face recognition:Alignment-free approach.

IEEE Trans. Pattern Anal. Mach.Intell. , 35(5):1193–1205, 2013. 6, 7[16] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan,and X. Wang. Hydraplus-net: Attentive deep features forpedestrian analysis. In

ICCV , 2017. 2[17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In

CVPR , 2015. 2 [18] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-works for human pose estimation. In

ECCV , 2016. 2[19] M. Noroozi and P. Favaro. Unsupervised learning of visualrepresentations by solving jigsaw puzzles. In

IEEE Interna-tional Conference on Computer Vision , pages 69–84, 2016.2, 3[20] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: to-wards real-time object detection with region proposal net-works.

IEEE Transactions on Pattern Analysis and MachineIntelligence , pages 1137–1149, 2017. 4[21] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi.Performance measures and a data set for multi-target, multi-camera tracking. In

European Conference on ComputerVision workshop on Benchmarking Multi-Target Tracking ,2016. 5[22] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Pose-driven deep convolutional model for person re-identiﬁcation.In

ICCV , 2017. 2[23] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang. Beyondpart models: Person retrieval with reﬁned part pooling. In

ECCV , 2018. 2, 4, 6, 7[24] T. Wang, S. Gong, X. Zhu, and S. Wang. Person re-identiﬁcation by video ranking. In

Computer Vision - ECCV2014 - 13th European Conference, Zurich, Switzerland,September 6-12, 2014, Proceedings, Part IV , 2014. 6[25] X. Wang, K. He, and A. Gupta. Transitive invariance for self-supervised visual representation learning. In

IEEE Interna-tional Conference on Computer Vision, ICCV 2017, Venice,Italy, October 22-29, 2017 , 2017. 2[26] L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian. GLAD:Global-local-alignment descriptor for pedestrian retrieval.

ACM Multimedia , 2017. 2[27] S. E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con-volutional pose machines. In

CVPR , 2016. 2[28] H. Yao, S. Zhang, Y. Zhang, J. Li, and Q. Tian. Deep repre-sentation learning with part loss for person re-identiﬁcation. arXiv preprint arXiv:1707.00798 , 2017. 2[29] R. Yu, Z. Dou, S. Bai, Z. Zhang, Y. Xu, and X. Bai. Hard-aware point-to-set deep metric for person re-identiﬁcation.In

ECCV , 2018. 8[30] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao,W. Jiang, C. Zhang, and J. Sun. Alignedreid: Surpassinghuman-level performance in person re-identiﬁcation. arXivpreprint arXiv:1711.08184 , 2017. 8[31] L. Zhao, X. Li, J. Wang, and Y. Zhuang. Deeply-learnedpart-aligned representations for person re-identiﬁcation. In

ICCV , 2017. 2, 4, 7[32] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, andQ. Tian. Mars: A video benchmark for large-scale personre-identiﬁcation. In

ECCV , 2016. 5[33] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian.Scalable person re-identiﬁcation: A benchmark. In

ICCV ,2015. 4, 5[34] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identiﬁcation: Past, present and future. arXiv preprintarXiv:1610.02984 , 2016. 235] W. Zheng, S. Gong, and T. Xiang. Person re-identiﬁcationby probabilistic relative distance comparison. In

The 24thIEEE Conference on Computer Vision and Pattern Recogni-tion, CVPR 2011, Colorado Springs, CO, USA, 20-25 June2011 , 2011. 1, 5[36] W. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong.Partial person re-identiﬁcation. In , 2015. 1, 5, 6, 7[37] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples gener-ated by gan improve the person re-identiﬁcation baseline invitro. In

ICCV , 2017. 4, 5[38] Z. Zhong, L. Zheng, S. Li, and Y. Yang. Generalizing a per-son retrieval model hetero- and homogeneously. In