PPerceive Where to Focus: Learning Visibility-aware Part-level Featuresfor Partial Person Re-identification
Yifan Sun , Qin Xu , Yali Li , Chi Zhang , Yikang Li , Shengjin Wang ∗ , Jian Sun Tsinghua University Megvii Technology { sunyf15, xuq16, liyk11 } @mails.tsinghua.edu.cn { liyali13, wgsgj } @tsinghua.edu.cn { zhangchi, sunjian } @megvii.com Abstract
This paper considers a realistic problem in person re-identification (re-ID) task, i.e. , partial re-ID. Under par-tial re-ID scenario, the images may contain a partial ob-servation of a pedestrian. If we directly compare a par-tial pedestrian image with a holistic one, the extreme spa-tial misalignment significantly compromises the discrimi-native ability of the learned representation. We proposea Visibility-aware Part Model (VPM), which learns to per-ceive the visibility of regions through self-supervision. Thevisibility awareness allows VPM to extract region-level fea-tures and compare two images with focus on their sharedregions (which are visible on both images). VPM gainstwo-fold benefit toward higher accuracy for partial re-ID.On the one hand, compared with learning a global fea-ture, VPM learns region-level features and benefits fromfine-grained information. On the other hand, with visibilityawareness, VPM is capable to estimate the shared regionsbetween two images and thus suppresses the spatial mis-alignment. Experimental results confirm that our methodsignificantly improves the learned representation and theachieved accuracy is on par with the state of the art.
1. Introduction
Person re-identification (re-ID) aims to spot the appear-ances of same person in different observations by measuringthe similarity between the query image and the gallery im-ages ( i.e. , the database). In spite that the re-ID researchcommunity has achieved significant progress during thepast few years, re-ID systems are still faced with a seriesof realistic difficulties. A prominent challenge is the partialre-ID problem [36, 7, 35], which requires accurate retrievalwith partial observation of the pedestrian. More concretely,in realistic re-ID systems, a pedestrian may happen to bepartially occluded or be walking out of the field of camera ∗ Corresponding author. (a) (b) (c) shared region features
Figure 1. Two challenges related to partial-re-ID and our solutionwith the proposed VPM. (a) aggravation of spatial misalignment,(b) distracting noises from unshared regions (the blue region on theleft image) and (c) VPM locates visible regions on a given imageand extracts region-level features. With visibility-awareness, VPMcompares two images by focusing on their shared regions. view, and the camera fails to capture the holistic pedestrian.Intuitively, partial re-ID increases the difficulty to makecorrect retrieval. Analytically, we find that partial re-ID raises two more unique challenges, compared with theholistic person re-ID, as illustrated in Fig. 1. • First, partial re-ID aggravates the spatial misalignmentbetween probe and gallery images. Under holistic re-ID setting, the spatial misalignment mainly originatesfrom the articulate movement of pedestrian and theviewpoint variation. Under partial re-ID setting, evenwhen two pedestrian with same pose are captured fromsame viewpoints, there still exists severe spatial mis-alignment between the two images (Fig. 1 (a)). • Second, when we directly compare a partial pedestrianagainst a holistic one, the unshared body regions in theholistic pedestrian become distracting noises, rather1 a r X i v : . [ c s . C V ] A p r han discriminative clues. We note that the same sit-uation also happens when any two compared imagescontain different proportion of the holistic pedestrian(Fig. 1 (b)).We propose the Visibility-aware Part Model (VPM) forpartial re-ID. VPM avoids or alleviates the two unique dif-ficulties related to partial re-ID by focusing on their sharedregions, as shown in Fig. 1 (c). More specifically, we firstdefine a set of regions on the holistic person image. Dur-ing training, given partial pedestrian images, VPM learns tolocate all the pre-defined regions on convolutional featuremaps. After locating each region, VPM perceives which re-gions are visible and learns region-level features. Duringtesting, given two images to be compared, VPM first cal-culates the local distances between their shared regions andthen concludes the overall distance.VPM gains two-fold benefit toward higher accuracy forpartial re-ID. On the one hand, compared with learning aglobal feature, VPM learns region-level features and thusbenefits from fine-grained information, which is similar tothe situation in holistic person re-ID [23, 12]. On the otherhand, with visibility-awareness, VPM is capable to estimatethe shared regions between two images and thus suppressesthe spatial misalignment as well as the noises originatedfrom unshared regions. Experimental results confirm thatVPM achieves significant improvement on partial re-ID ac-curacy, compared with a global feature learning baseline[34], as well as a strong part-based convolutional baseline[23]. The achieved performance are on par with the state ofthe art.Moreover, VPM is featured for employing self-supervision for learning the region visibility awareness. Werandomly crop partial pedestrian images from the holisticones and automatically generate region labels, yielding theso-called self-supervision. Self-supervision enables VPMto learn locating pre-defined regions. It also helps VPM tofocus on visible regions during feature learning, which iscritical to the discriminative ability of the learned features,as to be accessed in Section 4.4.The main contributions of this paper are summarized asfollows: • We propose a visibility-aware part model (VPM) forpartial re-ID task. VPM learns to locate the visible re-gions on pedestrian images through self-supervision.Given two images to be compared, VPM conductsa region-to-region comparison within their shared re-gions, and thus significantly suppresses the spatial mis-alignment as well as the distracting noises originatedfrom unshared regions. • We conduct extensive partial re-ID experiments onboth synthetic datasets and realistic datasets and val-idate the effectiveness of VPM. On two realistic dataset, Partial-iLIDs and Partial-ReID, VPM achievesperformance on par with the state of the art. So far aswe know, few previous works on partial re-ID reportedthe performance on synthetic large-scale datasets e.g. ,Market-1501 or DukeMTMC-ReID. We experimen-tally validate that VPM can be easily scaled up tolarge-scale (synthetic) partial re-ID datasets, due to itsfast matching capacity.
2. Related Works
Deep learning methods currently dominate the re-ID re-search community with significant superiority on retrievalaccuracy [34]. Recent works [23, 12, 26, 31, 22, 28, 16]further advance the state of the art on holistic person re-ID,through learning part-level deep features. For example, Wei et al. [26], Kalayeh et al. [12] and Sun et al. [23] extractseveral region parts, with pose estimation [17, 27, 10, 18, 1],human parsing [2, 5] and uniform partitioning, respectively.Then they learn a respective feature for each part and as-semble the part-level features to form the final descriptor.These progresses motivate us to extend learning part-levelfeatures to the specified problem of partial re-ID.However, learning part-level features does not naturallyimprove partial re-ID. We find that PCB [23], which main-tains the latest state of the art on holistic person re-ID, en-counters a substantial performance decrease when appliedin partial re-ID scenario. The achieved retrieval accuracyeven drops below the global feature learning baseline (to beaccessed in Sec. 4.2). Arguably, it is because part mod-els rely on precisely locating each part and are inherentlymore sensitive to the severe spatial misalignment problemin partial re-ID.Our method is similar to PCB in that both methods per-form uniform division instead of semantic body parts forpart extraction. Moreover, similar to SPReID [12], ourmethod also uses probability maps to extract each part dur-ing inference. However, while SPReID requires an extrahuman parser and human parsing dataset (strong supervi-sion) for learning part extraction, our method relies on self-supervision. During matching stage, both PCB and SPReIDadopt the common strategy of concatenating part features.In contrast, VPM first measures the region-to-region dis-tance and then conclude the overall distance by dynamicallycrediting the local distances with high visibility confidence.
Self-supervision learning is a specified unsupervisedlearning approach. It explores the visual information to au-tomatically generate surrogate supervision signal for featurelearning [19, 25, 13, 3, 14]. Larsson et al. [13] train the deepmodel to predict per-pixel color histograms and consequen- onv R R R ! Feature extractor WP C C C WP WP
Figure 2. The structure of VPM. We first define p = m × n ( × in the figure for instance) densely aligned rectangle regions on the holisticpedestrian. VPM resizes a partial pedestrian image to fixed size, inputs it into a stack of convolutional layers (“conv”) and transforms itinto a 3D tensor T . Upon T , VPM appends a region locator to discover each regions through pixel-wise classification. By predictinga probability of belonging to each region for every pixel g , the region locator generates p probability maps to infer the location of eachregion. It also generates p visibility scores through “ (cid:80) ” operation over each probability map. Given the predicted probability maps, thefeature extractor extracts a respective feature for each pre-defined region through weighted pooling (“WP”). VPM, as a whole, outputs p region-level features and p visibility scores for inference. tially facilitate automatic colorization. Doersch et al. [3]and Noroozi et al. [19] propose to predict the relative posi-tion of image patches. Gidaris et al. train the deep model torecognize the rotation applied to original images.Self-supervision is an elemental tool in our work. Weemploy self-supervision to learn visibility awareness. VPMis especially close to [3] and [19] in that all the three meth-ods employ the position information of patches for self-supervision. However, VPM significantly differs from themin the following aspects. Self-supervision signal. [3] randomly samples a patchand one of its eight possible neighbors, and then trains thedeep model to recognize the spatial configuration. Simi-larly, [19] encodes the neighborhood relationship into a jig-saw puzzle. Different from [3] and [19], VPM does notexplore the spatial relationship between multiple images orpatches. VPM pre-defines a division on the holistic pedes-trian image and then assigns an independent label to eachregion. Then VPM learns to directly predict which regionsare visible on a partial pedestrian image, without comparingit against the holistic one.
Usage of the self-supervision.
Both [3] and [19] trans-fer the model trained through self-supervision to the objectdetection or classification task. In comparison, VPM uti-lizes self-supervision in a more explicit manner: with thevisibility awareness gained from self-supervision, VPM de-cides which regions to focus when comparing two images.
3. Proposed Method
VPM is designed as a fully convolutional network, asillustrated in Fig. 2. It takes a pedestrian image as the input and outputs a constant number of region-level features, aswell as a set of visibility scores indicating which regionsare visible on the input image.We first define p = m × n densely aligned rectangleregions on the holistic pedestrian image through uniformdivision. Given a partial pedestrian image, we resize it to afixed size, i.e. , H × W and input it into VPM. Through astack of convolutional layers (“conv” in Fig. 2, which usesall the convolutional layers in ResNet-50 [6]), VPM trans-fers the input image into a 3D tensor T . The size of T is c × h × w (which are the number of channels, height andwidth, respectively), and we view the c − dim vector g asa pixel on T . On tensor T , VPM appends a region locatorand a region feature extractor. The region locator discov-ers regions on tensor T . Then the region feature extractorgenerates a respective feature for each region. A region locator perceives which regions are visible andpredicts their locations on tensor T . To this end, the regionlocator employs a × convolutional layer and a followingSoftmax function to classify each pixel g on T into the pre-defined regions, which in formulated by, P ( R i | g ) = sof tmax ( W T g ) = exp W Ti g p (cid:80) j =1 exp W Tj g , (1)where P ( R i | g ) is the predicted probability of g belongingto R i , W is the learnable weight matrix of the × convo-lutional layer, p is the total number of pre-defined regions.By sliding over every pixel g on T , the region loca-tor predicts g as belonging to all the pre-defined regionswith corresponding probabilities, and thus gets p probabil-ity maps (one h × w map for each region), as shown in Fig.. Each probability map indicates the location of a corre-sponding region on T , which allows region feature extrac-tion.The region locator also predicts the visibility score C foreach region, by accumulating P ( R i | g ) over all the g on T ,which is formulated by, C i = (cid:88) f ∈ T P ( R i | g ) , (2)Eq. 2 is natural in that if considerable pixels on T be-longs to R i (with large probability), it indicates that R i isvisible on the input image and is assigned with a relativelylarge C i . In contrast, if a region is actually invisible, the re-gion locator will still return a probability map (with all thevalues approximating 0). In this case, C i will be very small,indicating possibly-invisible region. The visibility score isimportant for calculating the distance between two images,as to be detailed in Section 3.2. A region feature extractor generates a respective fea-ture f for a region by weighted pooling, which is formulatedby, f i = (cid:80) g ∈ T P ( R i | g ) gC i , ∀ i ∈ { , , · · · , p } , (3)where the division of C i is to maintain the norm invarianceagainst the size of the region.The region locator returns a probability map for each re-gion, even if the region is actually invisible on the inputimage. Correspondingly, we can see from Eq. 3 that the re-gion feature extractor always generates a constant number( i.e. , p ) of region features for any input image. Given two images to be compared, i.e. , I k and I l , VPMextracts their region features and predicts the region vis-ibility scores through Eq. 3 and Eq. 2, respectively.With region features and region visibility scores { f ki , C ki } , { f li , C li } , VPM first calculates region-to-region euclideandistances D kli = (cid:107) f ki − f li (cid:107) ( i = 1 , , · · · , p ) . Then VPMconcludes the overall distance from the local distances by, D kl = p (cid:80) i =1 C ki C li D klip (cid:80) i =1 C ki C li . (4)In Eq. 4, the visible regions are with relative large vis-ibility scores. The local distances between shared regionsare highly credited by VPM and thus dominate the overalldistance D kl . In contrast, if a region is invisible in any oneof the compared images, its region feature is considered asunreliable and the corresponding local distance contributeslittle to D kl . Employing VPM adds very light computational cost,compared with popular part-based deep learning methods[23, 31, 12]. While some prior partial re-ID methods re-quire pairwise comparison before feature extraction andmay have efficiency problems, VPM presents high scalabil-ity, which allows experiments on large re-ID datasets suchas Market-1501 [33] and DukeMTMC-reID [37], as to beaccessed in Section 4.2. Training VPM consists of training the region classifierand training the region feature extractor. The region classi-fier and the region feature extractor share the convolutionallayers before tensor T , and are trained end to end in a multi-task training manner. Training VPM is also featured foremploying auxiliary self-supervision. Self-supervision is critical to VPM. It supervises VPMto learn region visibility awareness, as well as to focus onvisible regions during feature learning. Specifically, given aholistic pedestrian image, we randomly crop a patch and re-size it to H × W . The random crop operation excludes sev-eral pre-defined regions and the remaining regions are re-shaped during the resizing. Then, we project the regions onthe input image to tensor T through ROI projection [11, 20].To be concrete, let us assume a region with its up-left cor-ner located at ( x , y ) and its bottom-right corner locatedat ( x , y ) on the input image. Then the ROI projectiondefines a corresponding region on tensor T with its up-leftcorner located at ([ x /S ] , [ y /S ]) and its right-bottom cor-ner located at ([ x /S ] , [ y /S ]) , in which the [ • ] denotes therounding and S is the down-sampling rate from input imageto T . Finally, we assign every pixel g on T with a regionlabel L ( L ∈ , , · · · , p ) to indicate which region g belongsto. We also record all the visible regions in a set V . As wewill see, self-supervision contributes to training VPM in thefollowing three aspects: • First, self-supervision generates the ground truth of re-gion labels for training the region locator. • Second, self-supervision enables VPM to focus on vis-ible regions when learning feature through classifica-tion loss (cross-entropy loss). • Finally, self-supervision enables VPM to focus on theshared regions when learning features through tripletloss.Without the auxiliary self-supervision, VPM encountersdramatic performance decrease, as to be accessed in Section4.4.
The region locator is trained through cross-entropy losswith the self-supervision signal L as the ground truth, which igure 3. VPM learns region-level features with auxiliary self-supervision. Only features corresponding to visible regions con-tribute to the cross-entropy loss. Only features corresponding toshared regions contribute to the deducing of triplet loss. is formulated by, L R = − (cid:88) g ∈ T i = L log ( P ( R i | g )) , (5)where R i = L returns 1 only when i equals the ground truthregion label L and returns 0 in any other cases. The region feature extractor is trained with the combi-nation of cross-entropy loss and triplet loss, as illustrated in3. Recall that the region feature extractor always generates p region features for any input image. It leads to a nontrivialproblem during feature learning: only features of visible re-gions should be allowed to contribute to the training losses.With self-supervision signals V , we dynamically select thevisible regions for feature learning. The cross-entropy loss is commonly used in learning fea-tures for pedestrian under the IDE [32] mode. We appenda respective identity classifier i.e. , IP i ( f i )( i = 1 , , · · · , p ) upon each region feature f i , to predict the identity of train-ing images. The identity classifier consists of two sequen-tial fully-connected layers and a Softmax function. The firstfully-connected layers reduces the dimension of the inputregion feature, and the second one transforms the feature di-mension to K ( K is the total identities of training images).Then the cross-entropy loss is formulated by, L ID = − (cid:88) i ∈ V k = y log ( sof tmax ( IP i ( f i ))) , (6)where k is the predicted identity and y is the ground truthlabel. With Eq. 6, self-supervision enforces focus on visibleregions for learning region features through cross-entropyloss. The triplet loss pushes the features from a same pedes-trian close to each other and pulls the features from differ-ent pedestrians far away. Given a triplet of images, i.e. , ananchor image I a , a positive image I p and a negative image I n , we define a region-selective triplet loss derived from thecanonical one by, L tri = [ D ap − D an + α ] + ,D ap = (cid:80) i ∈ ( V a ∩ V p ) (cid:107) f ai − f pi (cid:107)| V a ∩ V p | ,D an = (cid:80) i ∈ ( V a ∩ V n ) (cid:107) f ai − f ni (cid:107)| V a ∩ V n | , (7)where f ai , f pi and f ni are the region features from anchorimage, positive image and negative image, respectively. V a , V p and V n are the visible region sets for anchor image, pos-itive image and negative image, respectively. |•| denotes theoperation of counting the elements of a set, i.e. , the numberof shared regions in the two compared images. α is the mar-gin for training triplet, and is set to in our implementation.With Eq. 7, self-supervision enforces focus on the sharedregions when calculating the distances of two images.The overall training loss is the sum of region predictionloss, the identity classification loss and the region-selectivetriplet loss, which is formulated by, L = L R + L ID + L tri (8)We also note that Eq. 4 and Eq. 7 share a similar pattern.Training with the modified triplet loss (Eq. 7) mimics thematching strategy (Eq. 4) and is thus specially beneficial(to be detailed in Table 3). The difference is that, duringtraining, the focus is enforced through “hard” visibility la-bels, while during testing, the focus is regularized throughpredicted “soft” visibility scores.
4. Experiment
Datasets . We use four datasets, i.e. , Market-1501[33], DukeMTMC-reID [21, 37], Partial-REID and Partial-iLIDS, to evaluate our method. Market-1501 andDukeMTMC-reID are two large scale holistic re-ID dataset.The Market-1501 dataset contains 1,501 identities ob-served from 6 camera viewpoints, 19,732 gallery imagesand 12,936 training images detected by DPM [4]. TheDukeMTMC-reID dataset contains 1,404 identities, 16,522training images, 2,228 queries, and 17,661 gallery im-ages. We crop certain patches from the query images dur-ing testing stage to imitate the partial re-ID scenario andget a comprehensive evaluation of our method on large-scale (synthetic) partial re-ID datasets. We note that fewprior works on partial re-ID evaluated their methods onlarge-scale dataset, mainly because of low computing ef-ficiency. Partial-REID [36] and Partial-iLIDS [35] aretwo commonly-used datasets for partial re-ID. Partial-REIDcontains 600 images and 60 identities, every one of whichhas 5 holistic images and 5 partial images. Partial-iLIDSataset γ baseline PCB VPMR-1 R-5 R-10 mAP R-1 R-5 R-10 mAP R-1 R-5 R-10 mAP0.5 64.5 82.2 88.1 44.4 0.9 3.2 5.6 1.7 70.9 86.5 92.1 48.80.6 79.0 91.4 94.3 57.9 8.1 16.5 23.2 6.6 84.4 94.3 96.1 62.5Market-1501 0.7 83.9 93.9 95.9 63.7 36.5 58.9 67.4 26.8 88.2 95.8 97.2 71.70.8 85.7 94.3 96.4 66.1 71.9 87.3 91.4 56.8 90.1 95.8 97.7 74.70.9 87.1 95.5 97.4 67.8 88.8 95.8 97.1 77.2 91.7 96.6 98.0 78.71.0 86.8 95.3 97.4 67.7 93.4 97.8 98.4 83.0 93.0 97.8 98.8 80.80.5 65.0 81.1 86.7 47.2 5.0 10.1 13.6 4.0 69.5 83.1 87.9 52.20.6 76.2 87.3 90.4 55.4 13.1 25.6 33.5 10.5 78.2 89.0 91.3 60.9DukeMTMC-reID 0.7 76.3 87.3 90.6 90.6 35.9 57.0 65.4 28.4 80.3 89.5 92.0 63.10.8 76.3 88.3 91.9 58.8 64.0 82.6 87.7 52.3 80.3 89.3 92.4 63.50.9 77.0 88.1 91.7 59.0 81.6 90.4 93.0 70.3 81.7 90.9 93.1 70.71.0 76.2 87.3 91.2 58.6 84.1 92.4 94.5 73.2 83.6 91.7 94.2 72.6 Table 1. Comparison between VPM, baseline and PCB. For VPM, we use p = 6 × pre-defined regions. For PCB, we adopt the codereleased by the authors and append an extra triplet loss, for fair comparison with VPM. On Market-1501, the extra triplet loss enables PCBto gain +5.6% mAP over the original 77.4% reported by the authors [23]. is derived from iLIDS [24], which is collected in an airportand the lower-body of a pedestrian is frequently occludedby the luggage. Partial-iLIDS crops the non-occluded re-gion from these images and get 238 images from 119 identi-ties. Both Partial-REID and Partial-iLIDS offer only testingimages. When evaluating our method on these two publicdatasets, we train VPM on the training set of Market-1501,for fair comparison with other competitive methods, includ-ing MTRC [15], AMC+SWM [36], DSR [7], and SFR [8]. Implementation Details . Training VPM relies on theassumption that the original training images all containholistic pedestrian and the pedestrian are tightly boundedby bounding boxes. On two holistic re-ID datasets, Market-1501 and DukeMTMC-reID, there do exist some imageswhich contain either partial pedestrian or oversized bound-ing boxes. We consider these images as tolerable noises.To generate the partial image for training VPM, we cropa patch from the holistic image with random area ratio γ .We set γ to be uniformly distributed between . and .VPM is not necessarily bounded with any specified cropstrategy and we may consider the prior knowledge for op-timization. We argue that choosing the detailed crop strat-egy according to the realistic condition is reasonable be-cause partial re-ID is a realistic challenge, and the occlusionfashion is usually predictable. We also experimentally val-idate that choosing an appropriate crop strategy to imitatethe confronted partial re-ID condition benefits the retrievalaccuracy, as to be detailed in Section 4.3. That being said,VPM is still general in that it may adopt any crop strategyto conduct self-supervision.VPM is trained with the combination of cross-entropyloss and triplet loss. We pre-train the model for 50 epochswith single cross-entropy loss because it helps VPM to con-verge faster and better. Then we append the triplet loss and fine-tune the model for another 80 epochs. In boththe pre-training and the fine-tuning stages, we use standardStochastic Gradient Descent (SGD) optimization strategy,initialize the learning rate as 0.1 and decay it to 0.01 after30 epochs. During the fine-tuning stage, we construct eachmini-batch with 64 images from 8 identities (8 images peridentity) and use the hard mining strategy [9] for deducingthe triplet loss. We evaluate the effectiveness of VPM with experimenton the synthetic partial datasets derived from two large-scale re-ID datasets, Market-1501 and DukeMTMC-reID.We differ the ratio γ of the cropped patches from 0.5 to 1.0during testing. For comparison with VPM, we implement abaseline which learns global feature through the combina-tion of cross-entropy loss and triplet loss. We also imple-ment a part-based feature learning method PCB [23]. Forfair comparison, we enhance PCB with an extra triplet lossduring training, and achieve slightly higher performancethan [23]. The results are summarized in Table 1. VPM significantly improves partial re-ID perfor-mance over the baseline.
On Market-1501, VPM sur-passes the baseline by +6.4%, +5.4%, +4.3%, +4.4%,+4.6% +6.2% rank-1 accuracy and +4.4%, +4.6%, +8.0%,+ 8.6%, +10.9%, +13.1% mAP when γ is set from 0.5 to 1,respectively. The superiority of VPM against the baseline,which learns a global feature representation, is derived fromtwo-fold benefit. On the one hand, VPM learns region-levelfeatures and benefits from fine-grained information. Onthe other hand, with visibility awareness, VPM conducts aregion-level alignment and eliminates the distracting noisesoriginated from unshared regions. VPM increases the robustness of part features un- baseline p=2 p=3 p=4 p=6 p=8 γ R an k - Figure 4. Impact of p on the partial-reID accuracy. We set p to2,3,4,6 and 8, respectively. We use Market-1501 for training anddiffer the crop ratio γ during testing. der partial re-ID scenario. Comparing VPM with PCB,a state-of-the-art part feature learning method for holisticperson re-ID task, we observe that as γ decreases, the re-trieval accuracy achieved by PCB dramatically drops ( e.g. ,0.9% rank-1 accuracy at γ = 0 . ), implying that PCBis extremely vulnerable to the spatial misalignment in par-tial re-ID. By contrast, the retrieval accuracy achieved byVPM decreases much slower as γ decreases. We infer thatVPM facilitates region-to-region comparison within sharedregions of two images and thus gains strong robustness.We also notice that under γ = 1 . , i.e. , the holistic per-son re-ID scenario, VPM achieves comparable retrieval ac-curacy with PCB.In Table 1, we use 6 pre-defined parts to construct VPM.Moreover, we analyze the impact of the part numbers p onMarket-1501, with results shown in Fig. 4. Under all set-tings of p and γ , VPM consistently surpasses the baseline,which further confirms the superiority of VPM. We also ob-serve that larger p generally brings higher (rank-1) retrievalaccuracy. Larger p allows VPM to learn the region-levelfeatures in finer granularity and thus benefits the discrim-inative ability, which is consistent with the observation inholistic person re-ID works [31, 23]. Larger p also allowsmore accurate region alignment when comparing a partialperson image against a holistic one. We suggest choosing p with the joint consideration of retrieval accuracy and com-puting efficiency, and set p = 6 in most of our experiments(if not specially mentioned). We compare VPM with the state-of-the-art methods ontwo public datasets, i.e. , Partial-REID and Partial-iLIDS.We train three different versions of VPM with different cropstrategies for preparing training patches, i.e. , top crop (the Methods Partial-REID Partial-iLIDSR-1 R-3 R-1 R-3MTRC [15] 23.7 27.3 17.7 26.1AMC+SWM [36] 37.3 46.0 21.0 32.8DSR [7] 50.7 70.0 58.8 67.2SFR [8] 56.9 78.5 63.9 74.8VPM (Bottom) 53.2 73.2 53.6 62.3VPM (Top) 64.3
VPM (Bilateral)
Table 2. Evaluation of VPM on Partial-REID and Partial-iLIDS.Three VPMs trained with different crop strategies are evaluated. top regions are always visible), bottom crop (the bottom re-gions are always visible) and bilateral crop (top crop + bot-tom crop). The results are presented in Table 2, from whichtwo observations are drawn.First, comparing three editions of VPM against eachother, we find that the crop strategy matters to VPM. OnPartial-iLIDS, all query images of which are cropped fromthe top side of holistic pedestrian images, VPM (Top)achieves the highest retrieval accuracy. On Partial-REID,which contains images cropped from different directions,VPM (Bilateral) achieves the highest retrieval accuracy.VPM (Bottom) always performs the worst due to two rea-sons. First, retaining the bottom regions severely deviatesfrom the testing condition. Second, the bottom regions(mainly containing legs) inherently offers relatively weakdiscriminative clues. We note that when solving the prob-lem of partial-reID, the realistic partial condition is usuallyestimable. We recommend analyzing the partial conditionand choosing a similar crop strategy for training VPM. Thatbeing said, VPM is general in that it is able to cooperatewith various crop strategies.Second, given appropriate crop strategies, VPM achievesvery competitive performance compared with the state ofthe art. On Partial-REID, VPM (Bilateral) surpasses thestrongest competing method SFR by +10.6% Rank-1 ac-curacy. On Partial-iLIDS, VPM (Top) surpasses SFR by+3.3% Rank-1 accuracy. Even with no prior knowledge ofpartial condition on testing set, we may eclectically chooseVPM (Bilateral), which considers both top and down occlu-sions and thus maintains stronger robustness.
We conduct ablation study to analyze the impact of self-supervision on VPM. We train four
Malfunctioned
VPM forcomparison: • MVPM-1 is trained as a normal VPM, but abandonsthe visibility awareness during testing, i.e. , MVPM-1 concludes the overall distance with all region-levelfeatures, even if some regions are invisible.ethods Partial-iLIDS Market-1501R-1 R-3 R-5 R-1 R-5 mAPVPM. 67.2 76.5 82.4 93.0 97.8 80.8VPM (no triplet) 57.1 73.9 79.0 91.3 97.0 77.8MVPM-1 63.0 74.8 82.4 93.0 96.3 79.7MVPM-2 61.3 73.1 79.0 92.8 97.4 80.1MVPM-3 58.8 74.8 82.4 91.4 96.5 75.5MVPM-4 59.7 74.8 78.2 90.4 96.6 75.7
Table 3. Ablation study on VPM. “VPM (no triplet)” is trainedwith no triplet loss. On Market-1501, we only analyze the holisticperson re-ID mode. • MVPM-2 abandons self-supervision on triplet lossduring training, i.e. , all region features equally con-tribute to deducing the triplet loss L tri . • MVPM-3 abandons self-supervision on identificationloss L ID during training, i.e. , all region features aresupervised by the training identity label through L ID . • MVPM-4 abandons self-supervision on both tripletloss and identification loss.Moreover, we also analyze the impact of two types oflosses (cross-entropy loss and triplet loss) in training VPM.The results are summarized in Table 3, from which we drawthree observations.First, appending an extra triplet loss enhances the dis-criminative ability of the learned features, under both par-tial re-ID scenario (Partial-iLIDS) and holistic person re-IDscenario (Market-1501). This observation is consistent withprior works [9, 38, 30, 29] on holistic person re-ID task.Second, comparing “MVPM-1” with “VPM”, we ob-serve a dramatic performance decrease on Partial-iLIDS.Both models are trained in exactly the same procedure. Thedifference is that “MPVM-1” employs all the region fea-tures to conclude the overall distance, while VPM focuseson the shared regions between two images. On Market-1501, all the regions are visible and two models achievesvery close retrieval accuracy. We thus infer that the visibil-ity awareness is critical for VPM under partial re-ID sce-nario.Third, comparing last three editions of MVPM with“VPM” as well as “MVPM-1”, we observe further per-formance decreases on Partial-iLIDS. The last three edi-tions abandon self-supervision to regularize the learning ofregion-level features (either on the cross-entropy loss ortriplet loss or both). Learning features from invisible re-gions brings about larger sample noises. Consequentially,the learned region features are significantly compromised.We thus conclude that enforcing VPM to focus on visibleregions through self-supervision is critical for learning re-gion features.
Figure 5. Region visualization. We train VPM with × pre-defined regions. For each image, VPM discovers 6 regions with6 probability maps, as detailed in Section 3.1. For better visual-ization, we assign each pixel to its closest region and achieve thepartitioning effect. Images on the first and the second row are from(synthetic) Market-1501 and Partial-REID, respectively. We visualize the regions discovered by VPM (the re-gion locator, in particular) in Fig. 5. We use p = 3 × pre-defined regions to facilitate both horizontal and verti-cal visibility awareness. It is observed that VPM conductsadaptive partition with visibility awareness. Given holis-tic images (the first column), VPM successfully discoversall the × regions. Given partial pedestrian images withhorizontal occlusion (the second column), VPM favors thedominating regions (left regions in Fig. 5). Given partialpedestrian images with lower-body occluded (the last twocolumns), VPM roughly discovers 4 visible regions, andperceives that the bottom 2 regions are invisible. These ob-servations confirm that VPM gains robust region-level visi-bility awareness and is capable to locate the visible regionsthrough self-supervised learning.
5. Conclusion
In this paper, we propose a region-based feature learn-ing method, VPM for partial re-ID task. Given a set ofpre-defined regions on the holistic pedestrian image, VPMlearns to perceive which regions are visible on a partial im-age through self-supervision. VPM locates each region onthe convolutional feature maps and then extracts region-level features. With visibility awareness, VPM comparestwo pedestrian images with focus on their shared regionsand correspondingly suppresses the severe spatial misalign-ment in partial re-ID. Experimental results confirm thatVPM surpasses both the global feature learning baselineand part-based convolutional methods, and the achievedperformance is on par with the state of the art. eferences [1] Z. Cao, T. Simon, S. E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In
CVPR ,2017. 2[2] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentationwith deep convolutional nets, atrous convolution, and fullyconnected crfs.
IEEE Trans. Pattern Anal. Mach. Intell. ,40(4):834–848, 2018. 2[3] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visualrepresentation learning by context prediction. In
IEEE Inter-national Conference on Computer Vision , pages 1422–1430,2015. 2, 3[4] P. Felzenszwalb, D. McAllester, and D. Ramanan. A dis-criminatively trained, multiscale, deformable part model. In
CVPR , 2008. 5[5] K. Gong, X. Liang, D. Zhang, X. Shen, and L. Lin. Lookinto person: Self-supervised structure-sensitive learning anda new benchmark for human parsing. In , 2017.2[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In
CVPR , 2016. 3[7] L. He, J. Liang, H. Li, and Z. Sun. Deep spatial feature re-construction for partial person re-identification: Alignment-free approach.
CoRR , abs/1801.00881, 2018. 1, 6, 7[8] L. He, Z. Sun, Y. Zhu, and Y. Wang. Recognizing partialbiometric patterns.
CoRR , abs/1810.07399, 2018. 6, 7[9] A. Hermans, L. Beyer, and B. Leibe. In defense of thetriplet loss for person re-identification. arXiv preprint arXiv:1703.07737 , 2017. 6, 8[10] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, andB. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In
ECCV , 2016. 2[11] H. Kaiming, Z. Xiangyu, R. Shaoqing, and J. Sun. Spatialpyramid pooling in deep convolutional networks for visualrecognition. In
European Conference on Computer Vision ,2014. 4[12] M. M. Kalayeh, E. Basaran, M. Gokmen, M. E. Kamasak,and M. Shah. Human semantic parsing for person re-identification. In
CVPR , 2018. 2, 4[13] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep-resentations for automatic colorization. In
Computer Vision- ECCV 2016 - 14th European Conference, Amsterdam, TheNetherlands, October 11-14, 2016, Proceedings, Part IV ,pages 577–593, 2016. 2[14] Q. V. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado,K. Chen, J. Dean, and A. Y. Ng. Building high-level featuresusing large scale unsupervised learning. In
ICML , 2012. 2[15] S. Liao, A. K. Jain, and S. Z. Li. Partial face recognition:Alignment-free approach.
IEEE Trans. Pattern Anal. Mach.Intell. , 35(5):1193–1205, 2013. 6, 7[16] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan,and X. Wang. Hydraplus-net: Attentive deep features forpedestrian analysis. In
ICCV , 2017. 2[17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In
CVPR , 2015. 2 [18] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-works for human pose estimation. In
ECCV , 2016. 2[19] M. Noroozi and P. Favaro. Unsupervised learning of visualrepresentations by solving jigsaw puzzles. In
IEEE Interna-tional Conference on Computer Vision , pages 69–84, 2016.2, 3[20] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: to-wards real-time object detection with region proposal net-works.
IEEE Transactions on Pattern Analysis and MachineIntelligence , pages 1137–1149, 2017. 4[21] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi.Performance measures and a data set for multi-target, multi-camera tracking. In
European Conference on ComputerVision workshop on Benchmarking Multi-Target Tracking ,2016. 5[22] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Pose-driven deep convolutional model for person re-identification.In
ICCV , 2017. 2[23] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang. Beyondpart models: Person retrieval with refined part pooling. In
ECCV , 2018. 2, 4, 6, 7[24] T. Wang, S. Gong, X. Zhu, and S. Wang. Person re-identification by video ranking. In
Computer Vision - ECCV2014 - 13th European Conference, Zurich, Switzerland,September 6-12, 2014, Proceedings, Part IV , 2014. 6[25] X. Wang, K. He, and A. Gupta. Transitive invariance for self-supervised visual representation learning. In
IEEE Interna-tional Conference on Computer Vision, ICCV 2017, Venice,Italy, October 22-29, 2017 , 2017. 2[26] L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian. GLAD:Global-local-alignment descriptor for pedestrian retrieval.
ACM Multimedia , 2017. 2[27] S. E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con-volutional pose machines. In
CVPR , 2016. 2[28] H. Yao, S. Zhang, Y. Zhang, J. Li, and Q. Tian. Deep repre-sentation learning with part loss for person re-identification. arXiv preprint arXiv:1707.00798 , 2017. 2[29] R. Yu, Z. Dou, S. Bai, Z. Zhang, Y. Xu, and X. Bai. Hard-aware point-to-set deep metric for person re-identification.In
ECCV , 2018. 8[30] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao,W. Jiang, C. Zhang, and J. Sun. Alignedreid: Surpassinghuman-level performance in person re-identification. arXivpreprint arXiv:1711.08184 , 2017. 8[31] L. Zhao, X. Li, J. Wang, and Y. Zhuang. Deeply-learnedpart-aligned representations for person re-identification. In
ICCV , 2017. 2, 4, 7[32] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, andQ. Tian. Mars: A video benchmark for large-scale personre-identification. In
ECCV , 2016. 5[33] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian.Scalable person re-identification: A benchmark. In
ICCV ,2015. 4, 5[34] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identification: Past, present and future. arXiv preprintarXiv:1610.02984 , 2016. 235] W. Zheng, S. Gong, and T. Xiang. Person re-identificationby probabilistic relative distance comparison. In
The 24thIEEE Conference on Computer Vision and Pattern Recogni-tion, CVPR 2011, Colorado Springs, CO, USA, 20-25 June2011 , 2011. 1, 5[36] W. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong.Partial person re-identification. In , 2015. 1, 5, 6, 7[37] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples gener-ated by gan improve the person re-identification baseline invitro. In
ICCV , 2017. 4, 5[38] Z. Zhong, L. Zheng, S. Li, and Y. Yang. Generalizing a per-son retrieval model hetero- and homogeneously. In